Intro
In February 2026, Nature Medicine published a structured evaluation of ChatGPT Health’s triage recommendations.
The study reported 52% under-triage among gold-standard emergency cases in structured vignette testing.
Evidence quality
Strengths
- Peer-reviewed publication
- Structured factorial testing
- Focused on escalation safety
Limitations
- Vignette-based, not real-world
- Gold-standard assignment dependent on study definitions
Human Comparison
Human vs AI Emergency Under-Triage: Contextual Comparison
A quick benchmark view. Definitions vary across studies; treat these as context, not a head-to-head.
Published Human ED Triage Benchmarks
- Under-triage: often reported in the low single digits to low-teens depending on definition and proxy outcomes
- Over-triage: commonly higher, reflecting an intentionally conservative bias
Human triage systems are designed to tolerate inefficiency (over-triage) to reduce catastrophic misses (under-triage).
Telephone / Remote Triage
- Misclassification can be meaningfully higher when vitals and examination are unavailable
- Safety depends heavily on escalation rules and “red flag” capture
AI Triage (Nature Medicine 2026 Stress Test)
- 52% under-triage among gold-standard emergency cases in structured vignette testing
Sharpened takeaway: methodologies differ, but the magnitude gap is substantial and suggests escalation threshold calibration remains a central safety issue for consumer AI triage.
FAQ
Q1: Does 52% reflect real-world harm?
A: No. It reflects structured stress testing results.
Q2: Was this a clinical trial?
A: No.
Q3: Why is under-triage critical?
A: It may delay life-saving treatment.
Q4: Can thresholds be improved?
A: Likely yes, with rule-based escalation.
Q5: Is AI triage unsafe?
A: It shows calibration risks but not proven real-world harm.
Related Guides
- /guides/ai-in-health
- /guides/automation-bias-in-clinical-practice