AI models are improving each day. They can do many tasks such as writing, analysing information, and coding—if not better, then in a way that is comparable to humans. As AI is improving and its use cases are expanding, researchers are testing its use in medicine. Recently, researchers completed what they call the largest-yet studies comparing artificial intelligence and physicians across a wide range of clinical reasoning tasks. The aim of the study was to evaluate whether an AI system could do what physicians do every day.
The study finds at least one large language model (LLM) performed more accurately than human doctors across many of these tasks, including making emergency-room decisions based on the available information, identifying likely diagnoses, and choosing the next steps in management. The study was conducted by a team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center.
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said co-senior author Arjun (Raj) Manrai, assistant professor of biomedical informatics in the Blavatnik Institute at HMS, in Harvard Medical School’s press release about the study.
Emergency room experiment and findings
In one experiment, the study used 76 emergency room cases from Beth Israel Deaconess Medical Center where decisions were required from prioritising care to admission to the ICU. It compared the diagnoses offered by two attending physicians to those generated by OpenAI’s o1 and 4o models.
When two other attending physicians, who did not know which diagnoses came from humans and which came from AI, evaluated the results, they found that at each stage of the emergency room diagnosis process, o1 either performed nominally better than or on par with the two attending physicians and 4o.
The Harvard release notes that at each stage, the model was given only the information that was available in the electronic medical records at the time of each diagnosis.
“We didn’t pre-process the data at all,” said co-senior author Adam Rodman, HMS assistant professor of medicine at Beth Israel Deaconess, in the press release. “I thought it was going to be a fun experiment but that it wouldn’t work that well. That was not at all what happened,” Rodman added.
In the experiment, the researchers found that at the early decision points in real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy—a result that surprised even the researchers.
Not a replacement for doctors yet
However, researchers say that the study does not suggest that AI can replace doctors or is ready to practice medicine autonomously. The study only shows that AI can be studied as new medical interventions through carefully controlled, rigorous, prospective clinical trials in real care settings.
“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” Brodeur said in the press release. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”


