Harvard Emergency Room Triage Trial Proves AI Diagnosis Superior to Human Doctors

This research, published in the journal *Science*, was led by a team from Harvard Medical School and is considered by independent experts to mark a “genuine advance” in artificial intelligence’s clinical reasoning abilities, going beyond simply passing exams or solving artificially constructed test questions. The study employed a large-scale experimental design, comparing hundreds of doctors with a large language model (LLM), focusing on performance differences in critical scenarios such as emergency room triage and long-term treatment planning.

In one core experiment, the research team selected 76 real patient cases from the emergency room of a Boston hospital. The AI system and a team of two human doctors were provided with identical standard electronic medical records, including vital signs data, demographic information, and a few brief descriptions of the reasons for the visit from nurses. In the scenario of making an initial diagnosis based solely on this limited information, the AI provided an accurate or very close diagnosis in 67% of cases, while the accuracy rate for human doctors was only 50%–55%.

The study points out that AI’s advantage is particularly pronounced in triage scenarios where information is extremely limited and rapid judgments are required. When more detailed clinical information was provided to both the AI and the doctors, the AI’s (using OpenAI’s o1 reasoning model) diagnostic accuracy further improved to 82%, while the accuracy of human experts was between 70%–79%, but this difference was not statistically significant.

In addition to emergency room triage, AI also demonstrated superior performance to doctors in developing long-term treatment plans. In another experiment, the research team had the AI and 46 doctors jointly review five clinical cases, tasked with designing antibiotic use plans and planning end-of-life care processes, among other long-term management plans. The results showed that the treatment plans provided by the AI scored significantly higher, with a score of 89%, while doctors relying on traditional resources (such as search engines) scored only 34%.

However, researchers emphasize that it is far from time to “announce the layoff of emergency room doctors.” This study only compared the diagnostic capabilities of AI and humans at the level of text-based medical record data, and did not include many signals that are crucial in real clinical settings, such as a patient’s painful expressions, emotional state, body language, and interactions with family members – non-textual information. In other words, in this study, the AI is closer to a “behind-the-scenes consultant” providing a second opinion based on paper records.

“I don’t think our findings mean that AI will replace doctors,” said Arjun Manrai, one of the study’s lead authors and head of the Harvard Medical School AI Lab. “I think it means we are witnessing a profoundly impactful technological shift that will reshape the entire healthcare system.” Adam Rodman, also a lead author, is a clinical physician at Beth Israel Deaconess Medical Center in Boston, who called large language models “one of the most impactful technologies in decades.” He predicts that in the next ten years, AI will not replace doctors, but will form a new “tripartite care model” with doctors and patients – “doctors, patients, and artificial intelligence systems.”

The study also presented a representative clinical case: a patient came to the hospital with a pulmonary embolism and worsening symptoms. Human doctors initially judged that anticoagulation therapy had failed, leading to disease progression; but after reading the medical history, the AI noticed a key point – the patient suffered from lupus, an autoimmune disease that can also cause pulmonary inflammation. Further examination confirmed the AI’s inference was correct.

AI’s application in clinical settings is not limited to the laboratory stage, with a large number of doctors already using it in practice. According to a recent study released by the American Medical Association, nearly one in five American doctors have already introduced AI-assisted tools into the diagnostic process. In the UK, a recent survey by the Royal College of Physicians showed that 16% of doctors use such technology daily, and another 15% use it once or more per week, with “clinical decision support” being one of the most common use cases.

However, doctors in the UK also expressed a high degree of caution when surveyed, particularly regarding concerns about AI misdiagnosis risks and liability issues. Although billions of dollars have poured into medical AI startups worldwide, how responsibility is defined and who bears the consequences once AI makes a mistake remains an urgent institutional gap. “There is currently no formal accountability framework,” Rodman pointed out, while emphasizing that patients, when facing life-or-death decisions or complex treatment plans, “ultimately want to be guided, accompanied, and explained by humans.”

Professor Ewen Harrison, Co-Director of the Medical Informatics Centre at the University of Edinburgh, believes the study is significant because it shows that “these systems are no longer just passing medical exams or responding to artificially constructed test questions.” In his view, AI is gradually becoming a useful “second opinion tool” for clinicians, especially in scenarios requiring a comprehensive review of potential diagnoses and avoiding overlooking important causes.

At the same time, Wei Xing, Assistant Professor at the School of Mathematics and Physical Sciences at the University of Sheffield, also reminds that some of the results of the study show that doctors may unconsciously rely on AI conclusions when collaborating with AI, weakening independent thinking. “This tendency may be further enhanced as AI becomes routinely used in clinical environments,” he pointed out. Xing also emphasized that the study did not fully disclose in which types of patients AI performed worse, for example, whether it was more difficult to diagnose elderly patients or patients who were not native English speakers, which are issues that cannot be ignored when assessing safety.

Therefore, although the Harvard trial results are encouraging, they do not prove that AI is safe enough to be routinely used independently for clinical diagnosis and treatment, nor does it mean that the public should turn to free AI tools to replace professional medical advice. In the foreseeable future, AI is more likely to act as a high-performance “intelligent stethoscope” and “second brain,” embedded in a human-led healthcare system, promoting more accurate and efficient diagnosis and treatment, while also raising new questions about responsibility, ethics, and trust to society.