AI Can Ace the Medical Exam. It Still Fails With Real Patients.
A Nature Medicine study shows LLMs perform brilliantly alone — and falter when used by the public. The problem isn’t intelligence. It’s interaction.
On this page
Hook
Large language models can identify the correct condition in ~95% of structured medical test scenarios.
Put them in the hands of real people?
Performance drops below 35%.
That’s not a glitch.
That’s a systems warning.
Context
A 2026 Nature Medicine randomized study tested 1,298 UK adults using GPT-4o, Llama 3, and Command R+ for medical self-assessment.
When evaluated alone, the models performed strongly.
When used by real participants:
- Relevant conditions were identified in fewer than about one-third of cases
- Correct triage decisions hovered around ~43%
- Performance was no better than standard internet search
The failure wasn’t medical knowledge.
It was human–AI interaction.
Paper: https://www.nature.com/articles/s41591-025-04074-y
The Wrong Question
We keep asking:
“Is AI as smart as a doctor?”
That’s the wrong question.
The right question is:
Can ordinary people use this tool safely under uncertainty?
Medicine isn’t a multiple-choice exam.
It’s incomplete information. Stress. Ambiguity. Risk filtering.
And that’s where the breakdown occurred.
What Actually Failed
The study exposed three consistent failure points:
1) Incomplete information
Users didn’t provide all relevant details (onset, severity, associated symptoms, risk factors).
2) Differential overload
Models generated multiple plausible causes.
Users struggled to identify which one was dangerous.
3) Interpretation error
Even when a correct condition appeared in the conversation, users didn’t consistently act on it.
The models passed the exam.
The interaction failed.
Tools, Not Doctors
LLMs are cognitive amplifiers.
In informed hands, they:
- Generate differential diagnoses
- Surface red flags
- Help structure thinking
- Improve preparation for clinician visits
In untrained or emotionally distressed hands, they:
- Create false reassurance
- Encourage premature closure
- Inflate trivial explanations
- Produce misplaced confidence
A chainsaw is powerful.
In trained hands, it builds. In careless hands, it injures.
LLMs belong in that category.
Why This Matters for Policy
Developers and regulators often rely on:
- Medical licensing exam benchmarks
- Simulated patient interactions
- Controlled test performance
This study showed those methods do not reliably predict real-world safety.
If AI is going to be deployed as a “front door” to healthcare, it must be evaluated:
- With real users
- Under incomplete information
- Under uncertainty
- Measuring under-triage (false reassurance) as a core safety outcome
Exam scores are not safety evidence.
The Responsible Path Forward
The answer is not abandonment.
It’s maturity.
AI in healthcare must be:
- Structured
- Constrained
- Explicit about risk
- Designed to highlight red flags first
- Tested with real humans before scaling
AI is a tool.
And tools require skill.
Closing
The models passed the medical exam.
The public failed the interaction.
That gap — between capability and safe use —
is where healthcare AI will either mature responsibly…
or cause harm through overconfidence.