What 52% Emergency Under-Triage Really Means

A 2026 Nature Medicine stress test found high emergency under-triage rates in ChatGPT Health. Here’s how to interpret that number responsibly.

On this page

Primary paper: Ramaswamy A, Tyagi A, Hugo H, et al. ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. Published online 23 Feb 2026. DOI:

10.1038/s41591-026-04297-7

Human benchmark context: Sax DR, et al. Evaluation of Version 4 of the Emergency Severity Index in US Emergency Departments for the Rate of Mistriage. JAMA Network Open. 2023; and Zachariasse JM, et al. Performance of triage systems in emergency care: a systematic review and meta-analysis. BMJ Open. 2019.

Hook

On 23 February 2026, Nature Medicine published a Brief Communication titled:

“ChatGPT Health performance in a structured test of triage recommendations”
(Ramaswamy et al., Icahn School of Medicine at Mount Sinai).

The headline number:

52% of gold-standard emergency cases were under-triaged.

That statistic is powerful.

It’s also easy to misinterpret.

What Study Is This?

The paper evaluated ChatGPT Health — OpenAI’s consumer-facing triage tool launched in January 2026.

Researchers used:

  • 60 clinician-authored vignettes
  • 16 factorial conditions
  • 960 total responses

The tests were run as a structured stress test (not a real-world outcomes trial).

That distinction matters.

What 52% Actually Refers To

The 52% figure refers to:

Under-triage among cases the researchers defined as requiring emergency department evaluation within their vignette framework.

It does not mean:

  • half of real emergencies are being missed in the wild,
  • or that the tool “fails randomly.”

It means emergency escalation thresholds were often too low under controlled stress testing.

Human Triage Context (and Why This Is Still a Big Deal)

Human triage is imperfect — misclassification happens at scale in real EDs.

But human systems are also intentionally asymmetric: they accept inefficiency (over-triage) to avoid catastrophic misses (under-triage).

The clearest way to hold the 52% number responsibly is to put it beside human benchmarks and explain why the comparison is imperfect — but still informative.

Human vs AI Emergency Under-Triage: Contextual Comparison

A quick benchmark view. Definitions vary across studies; treat these as context, not a head-to-head.

Published Human ED Triage Benchmarks

  • Under-triage: often reported in the low single digits to low-teens depending on definition and proxy outcomes
  • Over-triage: commonly higher, reflecting an intentionally conservative bias

Human triage systems are designed to tolerate inefficiency (over-triage) to reduce catastrophic misses (under-triage).

Telephone / Remote Triage

  • Misclassification can be meaningfully higher when vitals and examination are unavailable
  • Safety depends heavily on escalation rules and “red flag” capture

AI Triage (Nature Medicine 2026 Stress Test)

  • 52% under-triage among gold-standard emergency cases in structured vignette testing

Sharpened takeaway: methodologies differ, but the magnitude gap is substantial and suggests escalation threshold calibration remains a central safety issue for consumer AI triage.

Why This Matters

The paper reports an “inverted U” pattern — better performance in the middle, weaker at extremes.

It also flags:

  • Anchoring: when friends/family minimize symptoms, triage shifts toward lower urgency.
  • Safeguards: crisis messaging for suicidal ideation activated inconsistently in their scenarios.

These aren’t “medical trivia” failures.

They’re calibration and robustness failures — the stuff that decides whether triage is safe at scale.

What This Study Proves

  • AI triage can fail at emergency thresholds under stress testing.
  • Behavioural framing can materially shift recommendations.
  • Safeguard activation may be inconsistent.

What It Does Not Prove

  • Real-world harm rates.
  • That consumer AI triage is universally unsafe.
  • That humans are reliably “correct.”
  • That these systems can’t be recalibrated.

The Bigger Question

Triage requires defensive escalation under uncertainty.

General-purpose language models optimize likelihood.

They do not naturally overweight worst-case outcomes unless explicitly constrained.

If AI is going to triage at consumer scale, escalation logic may need to be:

  • rule-enforced,
  • hard-coded for red flags,
  • and intentionally conservative.

That’s not just a technical question.

It’s a governance question.

FAQ

Q1: Does this mean ChatGPT Health misses half of real emergencies?
A: No. The 52% figure describes under-triage in structured vignette testing, not real-world outcomes.

Q2: Was this a live clinical trial?
A: No. It was a structured stress test using clinician-authored scenarios.

Q3: How does this compare to human triage?
A: Published ED studies show meaningful mistriage, but under-triage is typically far lower than 52% in many real-world analyses. Study designs differ, so comparisons should be cautious.

Q4: Why is under-triage more serious than over-triage?
A: Under-triage can delay time-sensitive care. Most triage systems tolerate more over-triage to reduce catastrophic misses.

Q5: What would improve AI triage safety?
A: Prospective validation, stricter escalation thresholds, red-flag overrides, and consistent crisis safeguards.

Closing

The real story isn’t the number.

It’s the mismatch between what triage requires — defensive escalation under uncertainty — and what general-purpose language models naturally optimize for.