Intro
Large language models (LLMs) are increasingly used by the public for health questions.
A 2026 randomized study in Nature Medicine tested 1,298 adults using GPT-4o, Llama 3, and Command R+ for medical self-assessment. Although the models performed well when tested alone, participants using them did not outperform standard internet search — and performed worse on identifying relevant conditions.
This guide explains how to use AI tools more safely — and why structured use matters.
Key Points
- LLMs generate possibilities, not diagnoses.
- Incomplete symptom descriptions degrade output quality.
- Users often struggle to choose the most serious possibility from multiple suggestions.
- Exam-style benchmark scores do not reliably predict safe real-world use.
- Use AI to support decisions — not to replace clinical care.
What the Evidence Shows
The Nature Medicine study found a sharp gap between:
- Model capability (how the LLM performs when given the full scenario)
and - Real-world use (how humans perform when interacting with it)
In the study, LLMs alone identified relevant conditions in the large majority of scenarios, but members of the public using those same tools identified relevant conditions in fewer than about one-third of cases and did not improve triage decisions compared to control.
The most important takeaway is not “LLMs are useless.”
It’s:
Medical knowledge inside a model does not guarantee safe outcomes in users.
Why Performance Breaks Down
1) Incomplete inputs (garbage in, garbage out)
People frequently omit details that change risk:
- onset timing (sudden vs gradual)
- severity and trajectory (getting worse?)
- associated symptoms
- risk factors (pregnancy, clot risk, immunosuppression)
LLMs can’t reliably infer what isn’t provided.
2) Differential overload (too many plausible options)
LLMs often output multiple possibilities.
That’s normal — it’s closer to how clinicians think.
But non-experts then must choose:
- which is most likely
- which is most serious
- which is time-sensitive
This is where users fail: the filtering step.
3) Cognitive bias and “automation effects”
Users may:
- over-trust fluent or confident language
- ignore rare-but-dangerous possibilities
- stop searching after the first plausible explanation (premature closure)
Health contexts amplify these biases because people are stressed and motivated to reduce uncertainty quickly.
How to Use AI Tools More Safely
Step 1: Provide structured information
Use a structured template:
- Age / sex
- Main symptom
- Onset: sudden or gradual, exact time if possible
- Duration
- Severity: 1–10
- Associated symptoms
- Relevant history / meds
- What you’re worried it might be
Bad prompt:
“I have a headache. What should I do?”
Better prompt:
“I’m a 42-year-old male. Sudden severe headache started 30 minutes ago (10/10), worst of my life, with neck stiffness and light sensitivity. No trauma. No blood thinners. What are the dangerous causes and what action should I take?”
Step 2: Force acuity labeling
Ask explicitly:
“List possible causes and clearly label which require emergency care.”
This reduces ambiguity and helps prevent false reassurance.
Step 3: Separate “what it is” from “what to do”
Ask in two steps:
- “What are the top possibilities and dangerous alternatives?”
- “What should I do right now given uncertainty?”
This mirrors real triage: diagnosis is often unclear, but urgency can still be estimated.
Step 4: Ask for red flags (then verify)
Ask:
“What symptoms would make this an emergency?”
Then verify using trusted sources (NHS/CDC/WHO).
If AI says “not urgent,” you still check the red flags yourself.
Step 5: Use AI to prepare for clinicians
AI is often best used to:
- turn your symptoms into a clear summary
- generate a question list for your GP
- interpret medical terms
- understand tests and treatment options
Think “preparation tool,” not “triage authority.”
When AI Is Most Useful
- Preparing a short symptom summary for an appointment
- Understanding medical terminology
- Generating differential diagnoses to discuss with a clinician
- Learning about evidence and trade-offs
- Identifying lifestyle and medication interactions (then verifying)
Risks, Limits, and When to Escalate
Do not rely on AI alone if symptoms are:
- sudden and severe
- worsening rapidly
- associated with chest pain, shortness of breath, fainting
- neurological (weakness, facial droop, speech trouble)
- severe abdominal pain, dehydration, or confusion
If you think it could be urgent, treat it as urgent.
Regulatory Implications
This study also suggests a broader conclusion:
Benchmarks and simulated interactions are not enough to establish safety.
Real-world deployment should require:
- evaluation with diverse real users
- testing under incomplete information conditions
- measures of under-triage (false reassurance) as a primary safety metric
- interface designs that constrain output and prioritize red flags
FAQ
Q: Are LLMs medically accurate?
A: They contain substantial medical knowledge but can misinterpret incomplete inputs and can be inconsistent across similar prompts.
Q: Why do they do well on exams but poorly with users?
A: Exams are structured and complete. Real users provide partial information and must interpret multi-option answers under uncertainty.
Q: Should I stop using them for health questions?
A: No — but treat them as a support tool for understanding and preparation, not a decision-maker.
Further Reading
- Nature Medicine (2026): Reliability of LLMs as medical assistants for the general public
https://www.nature.com/articles/s41591-025-04074-y - Automation Bias in Clinical Practice
- Large Language Models in Medicine