Harvard Study Shocker: AI Outperforms Doctors in ER Diagnoses!

Published 1 hour ago• 3 minute read

Uche Emeka

Harvard Study Shocker: AI Outperforms Doctors in ER Diagnoses!

A recent study published in Science by a research team from Harvard Medical School and Beth Israel Deaconess Medical Center has examined the performance of large language models (LLMs) in various medical contexts, including real emergency room cases. The findings suggest that at least one AI model demonstrated higher accuracy than human doctors in certain diagnostic scenarios.

The research team, led by physicians and computer scientists, conducted several experiments to evaluate OpenAI’s o1 and 4o models against human physicians. A significant part of their investigation focused on 76 patients admitted to the Beth Israel emergency room. In this experiment, the diagnoses provided by two internal medicine attending physicians were compared to those generated by the AI models. The assessments were then reviewed by two independent attending physicians who were blinded to whether the diagnoses originated from humans or AI.

According to the study, at each diagnostic touchpoint, the o1 model performed either nominally better than or on par with both human attending physicians and the 4o model. The differences in performance were particularly pronounced at the initial emergency room triage stage, where limited patient information is available and timely, accurate decisions are most critical. The researchers highlighted in Harvard Medical School’s press release that no pre-processing was done on the data, meaning the AI models were given the exact same information from electronic medical records as the human doctors at the time of diagnosis.

Specifically, the o1 model achieved an “exact or very close diagnosis” in 67% of triage cases. In comparison, one human physician reached this level of accuracy 55% of the time, while the other achieved it 50% of the time. Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study’s lead authors, stated that the AI model surpassed both prior models and physician baselines across virtually every benchmark tested.

Despite these promising results, the study explicitly clarifies that it does not claim AI is ready for making real-life, critical decisions in the emergency room. Instead, the findings underscore an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.” The researchers also pointed out that their study exclusively focused on models processing text-based information, noting that existing studies indicate current foundation models are more constrained in reasoning over nontext inputs.

Adam Rodman, a Beth Israel doctor and co-lead author, cautioned The Guardian about the current lack of a formal framework for accountability regarding AI diagnoses. He emphasized that patients still desire human guidance for life-or-death and challenging treatment decisions. Kristen Panthagani, an emergency physician, commented on the study, calling it “an interesting AI study that has led to some very overhyped headlines.” She highlighted that the comparison was made against internal medicine physicians, not emergency room physicians, arguing that comparisons should be made with doctors who actually practice the relevant specialty. Panthagani further stated that an ER doctor’s primary goal at first encounter is not to guess the ultimate diagnosis but to determine if a patient has a life-threatening condition.