What 52% Emergency Under-Triage Really Means
A 2026 Nature Medicine stress test found high emergency under-triage rates in ChatGPT Health. Here’s how to interpret that number responsibly.
On this page
Primary paper: Ramaswamy A, Tyagi A, Hugo H, et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. Published online 23 Feb 2026. DOI:
10.1038/s41591-026-04297-7
Human benchmark context: Sax DR, et al. Evaluation of Version 4 of the Emergency Severity Index in US Emergency Departments for the Rate of Mistriage. JAMA Network Open. 2023; and Zachariasse JM, et al. Performance of triage systems in emergency care: a systematic review and meta-analysis. BMJ Open. 2019.
Hook
On 23 February 2026, Nature Medicine published a Brief Communication titled:
“ChatGPT Health performance in a structured test of triage recommendations”
(Ramaswamy et al., Icahn School of Medicine at Mount Sinai).
The headline number:
52% of gold-standard emergency cases were under-triaged.
That statistic is powerful.
It’s also easy to misinterpret.
What Study Is This?
The paper evaluated ChatGPT Health — OpenAI’s consumer-facing triage tool launched in January 2026.
Researchers used:
- 60 clinician-authored vignettes
- 16 factorial conditions
- 960 total responses
The tests were run as a structured stress test (not a real-world outcomes trial).
That distinction matters.
What 52% Actually Refers To
The 52% figure refers to:
Under-triage among cases the researchers defined as requiring emergency department evaluation within their vignette framework.
It does not mean:
- half of real emergencies are being missed in the wild,
- or that the tool “fails randomly.”
It means emergency escalation thresholds were often too low under controlled stress testing.
Human Triage Context (and Why This Is Still a Big Deal)
Human triage is imperfect — misclassification happens at scale in real EDs.
But human systems are also intentionally asymmetric: they accept inefficiency (over-triage) to avoid catastrophic misses (under-triage).
The clearest way to hold the 52% number responsibly is to put it beside human benchmarks and explain why the comparison is imperfect — but still informative.
Human vs AI Emergency Under-Triage: Contextual Comparison
A quick benchmark view. Definitions vary across studies; treat these as context, not a head-to-head.
Published Human ED Triage Benchmarks
- Under-triage: often reported in the low single digits to low-teens depending on definition and proxy outcomes
- Over-triage: commonly higher, reflecting an intentionally conservative bias
Human triage systems are designed to tolerate inefficiency (over-triage) to reduce catastrophic misses (under-triage).
Telephone / Remote Triage
- Misclassification can be meaningfully higher when vitals and examination are unavailable
- Safety depends heavily on escalation rules and “red flag” capture
AI Triage (Nature Medicine 2026 Stress Test)
- 52% under-triage among gold-standard emergency cases in structured vignette testing
Sharpened takeaway: methodologies differ, but the magnitude gap is substantial and suggests escalation threshold calibration remains a central safety issue for consumer AI triage.
Why This Matters
The paper reports an “inverted U” pattern — better performance in the middle, weaker at extremes.
It also flags:
- Anchoring: when friends/family minimize symptoms, triage shifts toward lower urgency.
- Safeguards: crisis messaging for suicidal ideation activated inconsistently in their scenarios.
These aren’t “medical trivia” failures.
They’re calibration and robustness failures — the stuff that decides whether triage is safe at scale.
What This Study Proves
- AI triage can fail at emergency thresholds under stress testing.
- Behavioural framing can materially shift recommendations.
- Safeguard activation may be inconsistent.
What It Does Not Prove
- Real-world harm rates.
- That consumer AI triage is universally unsafe.
- That humans are reliably “correct.”
- That these systems can’t be recalibrated.
The Bigger Question
Triage requires defensive escalation under uncertainty.
General-purpose language models optimize likelihood.
They do not naturally overweight worst-case outcomes unless explicitly constrained.
If AI is going to triage at consumer scale, escalation logic may need to be:
- rule-enforced,
- hard-coded for red flags,
- and intentionally conservative.
That’s not just a technical question.
It’s a governance question.
FAQ
Q1: Does this mean ChatGPT Health misses half of real emergencies?
A: No. The 52% figure describes under-triage in structured vignette testing, not real-world outcomes.
Q2: Was this a live clinical trial?
A: No. It was a structured stress test using clinician-authored scenarios.
Q3: How does this compare to human triage?
A: Published ED studies show meaningful mistriage, but under-triage is typically far lower than 52% in many real-world analyses. Study designs differ, so comparisons should be cautious.
Q4: Why is under-triage more serious than over-triage?
A: Under-triage can delay time-sensitive care. Most triage systems tolerate more over-triage to reduce catastrophic misses.
Q5: What would improve AI triage safety?
A: Prospective validation, stricter escalation thresholds, red-flag overrides, and consistent crisis safeguards.
Closing
The real story isn’t the number.
It’s the mismatch between what triage requires — defensive escalation under uncertainty — and what general-purpose language models naturally optimize for.