What 52% Emergency Under-Triage Really Means

Primary paper: Ramaswamy A, Tyagi A, Hugo H, et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. Published online 23 Feb 2026. DOI:

10.1038/s41591-026-04297-7

Human benchmark context: Sax DR, et al. Evaluation of Version 4 of the Emergency Severity Index in US Emergency Departments for the Rate of Mistriage. JAMA Network Open. 2023; and Zachariasse JM, et al. Performance of triage systems in emergency care: a systematic review and meta-analysis. BMJ Open. 2019.

Hook

On 23 February 2026, Nature Medicine published a Brief Communication titled:

“ChatGPT Health performance in a structured test of triage recommendations”
(Ramaswamy et al., Icahn School of Medicine at Mount Sinai).

The headline number:

52% of gold-standard emergency cases were under-triaged.

That statistic is powerful.

It’s also easy to misinterpret.

What Study Is This?

The paper evaluated ChatGPT Health — OpenAI’s consumer-facing triage tool launched in January 2026.

Researchers used:

60 clinician-authored vignettes
16 factorial conditions
960 total responses

The tests were run as a structured stress test (not a real-world outcomes trial).

That distinction matters.

What 52% Actually Refers To

The 52% figure refers to:

Under-triage among cases the researchers defined as requiring emergency department evaluation within their vignette framework.

It does not mean:

half of real emergencies are being missed in the wild,
or that the tool “fails randomly.”

It means emergency escalation thresholds were often too low under controlled stress testing.

Human Triage Context (and Why This Is Still a Big Deal)

Human triage is imperfect — misclassification happens at scale in real EDs.

But human systems are also intentionally asymmetric: they accept inefficiency (over-triage) to avoid catastrophic misses (under-triage).

The clearest way to hold the 52% number responsibly is to put it beside human benchmarks and explain why the comparison is imperfect — but still informative.

Published Human ED Triage Benchmarks

Under-triage: often reported in the low single digits to low-teens depending on definition and proxy outcomes
Over-triage: commonly higher, reflecting an intentionally conservative bias

Human triage systems are designed to tolerate inefficiency (over-triage) to reduce catastrophic misses (under-triage).

Telephone / Remote Triage

Misclassification can be meaningfully higher when vitals and examination are unavailable
Safety depends heavily on escalation rules and “red flag” capture

AI Triage (Nature Medicine 2026 Stress Test)

52% under-triage among gold-standard emergency cases in structured vignette testing

Sharpened takeaway: methodologies differ, but the magnitude gap is substantial and suggests escalation threshold calibration remains a central safety issue for consumer AI triage.

Sources

Why This Matters

The paper reports an “inverted U” pattern — better performance in the middle, weaker at extremes.

It also flags:

Anchoring: when friends/family minimize symptoms, triage shifts toward lower urgency.
Safeguards: crisis messaging for suicidal ideation activated inconsistently in their scenarios.

These aren’t “medical trivia” failures.

They’re calibration and robustness failures — the stuff that decides whether triage is safe at scale.

What This Study Proves

AI triage can fail at emergency thresholds under stress testing.
Behavioural framing can materially shift recommendations.
Safeguard activation may be inconsistent.

What It Does Not Prove

Real-world harm rates.
That consumer AI triage is universally unsafe.
That humans are reliably “correct.”
That these systems can’t be recalibrated.

The Bigger Question

Triage requires defensive escalation under uncertainty.

General-purpose language models optimize likelihood.

They do not naturally overweight worst-case outcomes unless explicitly constrained.

If AI is going to triage at consumer scale, escalation logic may need to be:

rule-enforced,
hard-coded for red flags,
and intentionally conservative.

That’s not just a technical question.

It’s a governance question.

FAQ

Q1: Does this mean ChatGPT Health misses half of real emergencies?
A: No. The 52% figure describes under-triage in structured vignette testing, not real-world outcomes.

Q2: Was this a live clinical trial?
A: No. It was a structured stress test using clinician-authored scenarios.

Q3: How does this compare to human triage?
A: Published ED studies show meaningful mistriage, but under-triage is typically far lower than 52% in many real-world analyses. Study designs differ, so comparisons should be cautious.

Q4: Why is under-triage more serious than over-triage?
A: Under-triage can delay time-sensitive care. Most triage systems tolerate more over-triage to reduce catastrophic misses.

Q5: What would improve AI triage safety?
A: Prospective validation, stricter escalation thresholds, red-flag overrides, and consistent crisis safeguards.

Closing

The real story isn’t the number.

It’s the mismatch between what triage requires — defensive escalation under uncertainty — and what general-purpose language models naturally optimize for.