An AI Diagnosed More ER Patients Correctly Than the Doctors Did

📖 4 min read•748 words•Updated May 3, 2026

67%. That’s the share of emergency room patients OpenAI’s o1 model diagnosed correctly in a new Harvard-led study. The doctors in the same study? They landed somewhere between 50% and 55%. In a setting where a wrong call can mean the difference between life and death, that gap is hard to ignore.

I’m Maya, and I write about AI for people who don’t have computer science degrees — just curiosity and maybe a little healthy skepticism. So let’s talk about what this study actually means, why it matters, and why it’s more complicated than the headline suggests.

What the Study Actually Found

Researchers tested OpenAI’s o1 reasoning model on real emergency room cases. The goal was straightforward: could the AI correctly identify what was wrong with a patient based on the kind of information a triage doctor would have early in a visit?

The results were striking. The AI got it right — or very close to right — in 67% of cases. Human physicians, working through the same early-stage information, scored between 50% and 55%. That’s not a small difference. That’s roughly a 12 to 17 percentage point gap in favor of the machine.

But here’s where it gets even more interesting. When researchers gave the AI more detail — fuller patient histories, more test results, richer context — its accuracy climbed to 82%. For comparison, doctors given similar additional detail scored between 70% and 79%. The AI improved faster and further as the information got better.

Why an AI Might Actually Be Good at This

Emergency rooms are chaotic. Doctors are tired, overworked, and making rapid decisions under pressure with incomplete information. A triage physician might see dozens of patients in a single shift. That’s a lot of cognitive load for any human brain to carry.

AI models like o1 don’t get tired. They don’t have a bad day. They don’t anchor on the first diagnosis that comes to mind and unconsciously filter out contradicting evidence — a well-documented human tendency called anchoring bias. They process the information in front of them and reason through it systematically every single time.

OpenAI’s o1 is specifically built around what’s called “chain-of-thought reasoning.” Instead of just pattern-matching to a quick answer, it works through a problem step by step, more like a methodical thinker than a reflex response. In a diagnostic setting, that kind of structured reasoning turns out to be genuinely useful.

What This Does Not Mean

Before anyone starts imagining hospitals staffed entirely by chatbots, let’s slow down a little.

A study is not a deployment. Controlled research conditions are very different from the noise and unpredictability of a real ER on a Friday night.
Diagnosis is one piece of care. A doctor does far more than name what’s wrong. They communicate with frightened patients, make judgment calls about risk tolerance, coordinate with specialists, and adapt in real time to things no dataset can fully capture.
AI can be confidently wrong. A higher accuracy rate still means the model missed roughly one in three cases. In medicine, that matters enormously — and an AI that sounds certain when it’s wrong can be more dangerous than a hesitant human who knows to ask for a second opinion.
The study was Harvard-led, but details on sample size and patient demographics matter. How well these results hold across different populations and hospital types is still an open question.

So Where Does This Actually Leave Us

The most realistic and useful version of this future isn’t AI replacing doctors — it’s AI working alongside them. Think of o1 as a very fast, very thorough second opinion that a physician can consult in real time. A tool that flags possibilities the doctor might not have considered yet, especially in those early chaotic minutes when information is thin and the stakes are high.

That framing — AI as a thinking partner, not a replacement — is where most serious researchers and clinicians land when they talk about this technology in medicine. The goal is better outcomes for patients, and if a model can help a tired doctor catch something they might have missed at 2 a.m., that’s worth taking seriously.

The numbers from this study are genuinely impressive. A 67% accuracy rate beating a 50-55% human baseline is not a rounding error — it’s a signal that AI has real diagnostic capability worth building on. The next step is figuring out how to use that capability responsibly, inside systems that still keep humans accountable for the decisions that affect people’s lives.

That’s a harder problem than building the model. But it’s the one that actually matters.

🕒 Published: May 3, 2026

🎓

Written by Jake Chen

AI educator passionate about making complex agent technology accessible. Created online courses reaching 10,000+ students.

Learn more →

What the Study Actually Found

Why an AI Might Actually Be Good at This

What This Does Not Mean

So Where Does This Actually Leave Us

You May Also Like

📚 You Might Also Like

Related Articles