Can AI be trusted in emergencies? Study raises red flags on ChatGPT health

A new study published in the journal Nature Medicine has raised serious safety concerns about ChatGPT Health, a consumer-facing AI tool launched by OpenAI earlier this year.

The research found that the tool under-assessed more than half of simulated emergency medical cases, in some instances recommending routine care when urgent hospital treatment was needed.

The study, titled “ChatGPT Health performance in a structured test of triage recommendations,” was led by Dr Ashwin Ramaswamy and colleagues at the Icahn School of Medicine at Mount Sinai.

It was published online on February 23, 2026, just weeks after ChatGPT Health’s public launch on January 7.

HOW THE STUDY WAS CONDUCTED

Researchers designed 60 detailed, clinician-written medical scenarios, known as vignettes. These cases covered 21 clinical areas, including heart disease, respiratory illness, mental health crises, and metabolic disorders.

Each case was tested under 16 different conditions, resulting in 960 total responses from ChatGPT Health.

The goal was to check whether the AI could correctly recommend the level of medical urgency. For example, whether a patient should go to the emergency department immediately or seek care within a few days.

AN “INVERTED U” PATTERN OF PERFORMANCE

The findings showed what researchers described as an “inverted U-shaped” performance curve.

The AI handled moderate medical situations reasonably well. It also correctly identified many classic emergencies such as stroke and severe allergic reactions (anaphylaxis).

However, performance dropped significantly at the extremes, particularly in high-risk emergencies.

In gold-standard emergency cases, ChatGPT Health is under-triaged 52% of the time. This means it recommended less urgent care than was medically appropriate.

In several scenarios, it suggested patients seek evaluation within 24–48 hours instead of going immediately to the emergency department.

Examples included:

Diabetic ketoacidosis, a life-threatening complication of diabetes
Impending respiratory failure, which requires urgent medical intervention

In such cases, delays could potentially result in serious harm.

MENTAL HEALTH CRISIS RESPONSES WERE INCONSISTENT

The study also examined how the AI responded to mental health emergencies, including suicidal thoughts.

Researchers found that crisis intervention messages, such as directing users to call the 988 Suicide and Crisis Lifeline, were triggered inconsistently.

Surprisingly, the AI was more likely to activate crisis messaging when no specific suicide method was described, and less likely when a clear method was mentioned.

This inconsistency raised concerns about reliability in high-stakes mental health situations.

INFLUENCE OF EXTERNAL BIAS

Another important finding involved what researchers called “anchoring bias.” When scenarios included family or friends minimising symptoms, the AI was significantly more likely to recommend non-urgent care.

In these edge cases, recommendations shifted toward less urgent advice with an odds’ ratio of 11.7. Researchers said this suggests that contextual wording can strongly influence AI output.

The study found no statistically significant differences based on race, gender, or access-to-care barriers. However, the authors noted that the data did not fully rule out the possibility of meaningful disparities.

RESEARCHERS URGE CAUTION

The authors emphasised that their research was based on simulated cases and conducted at a single time point. They stressed the need for prospective, real-world validation before AI tools like ChatGPT Health are widely relied upon for medical triage.

The rapid timeline of the study, submitted January 15, accepted February 20, and published shortly after, reflects what experts say is an urgent need to assess the safety of AI systems already being used by millions of people.

OPENAI’S RESPONSE

OpenAI welcomed independent research but highlighted the study’s limitations. A company spokesperson said the findings may not reflect typical real-life usage and added that the model undergoes continuous updates and improvements.

The company also noted that ChatGPT Health is designed to provide general health information and is not intended to replace professional medical advice.

GROWING DEBATE OVER AI IN HEALTHCARE

The findings have added fuel to an ongoing debate about the readiness of large language models for direct consumer health decision-making.

Experts say AI tools can improve access to medical information, especially in areas with limited healthcare services.

However, they warn that blind spots in emergency triage could lead to delayed treatment or unnecessary harm.

As AI systems become more deeply integrated into everyday healthcare decisions, researchers say strong clinical validation, clear safety guardrails, and transparent limitations will be essential.

For now, the study serves as a reminder that while AI may assist with health information, emergency symptoms should always be evaluated by qualified medical professionals without delay.

Hot topics

World

Business

Politics

Tech

Hot topics

World

Business

Politics

Tech

Topics

Related Articles

Categories

Latest

Newsletter