A new Harvard study has found that advanced AI models can outperform doctors in diagnosing patients, including in high-pressure emergency room (ER) scenarios. The researchers pitted OpenAI’s advanced o1 language model directly against hundreds of physicians across multiple diagnostic touchpoints, finding that the AI consistently outperformed doctors in both diagnosing conditions and planning clinical management.
What does the study find?
The new research, published in the journal Science, gave 76 clinical cases from the emergency room of Beth Israel Deaconess Medical Center to OpenAI’s o1 model and two expert attending physicians.
The researchers found that o1 performed on par with or significantly better than the human experts on a variety of tasks:
Initial Triage: When the least information was available, o1 identified the exact or very close diagnosis in 67.1% of cases, compared to 55.3% and 50% accuracy for the two physicians.
ER evaluation: As more patient information became available during the physician’s initial assessment, the model’s accuracy improved to 72.4%, compared to 61.8% and 52.6% for the two doctors.
Hospital admission: At the final stage, when patients were admitted to the medical floor or ICU, o1 reached 81.6% accuracy, again outperforming both physicians, who scored 78.9% and 69.7% respectively.
The study also found that AI had a definitive edge when asked to give treatment plans, like prescribing antibiotics or planning end-of-life decisions. Across five case studies, the AI achieved a median score of 89%, substantially outperforming physicians, who scored around 34% when using conventional resources and 41% when using GPT-4.
“Although applying AI to assist with clinical decision support is sometimes viewed as a high-risk endeavour, greater use of these tools might serve to mitigate the human and financial costs of diagnostic error, delay, and lack of access,” the researchers said.
“Our findings suggest that LLMs have now eclipsed most benchmarks of clinical reasoning, motivating the urgent need for human-computer interaction studies and prospective clinical trials to rigorously assess the potential of AI systems to improve clinical practice and patient outcomes,” they added.
However, the researchers also warn that clinical medicine is filled with non-text inputs, like patients’ level of physical distress or the interpretation of medical imaging. This, they suggest, means there is a need for future research to assess how humans and machines can collaborate effectively.
In a statement to The Guardian, Arjun Manrai, one of the lead authors of the study, noted, “I don’t think our findings mean that AI replaces doctors.”
“I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine,” he added.


