Harvard study finds OpenAI's o1 model outperforms doctors in ER triage diagnosis

A Harvard-led trial comparing OpenAI's o1 reasoning model with human physicians in a Boston emergency department showed the AI correctly identified the exact or near‑exact diagnosis in 67% of cases, outpacing doctors who scored between 50% and 55%. When provided with more detailed patient information, the model's accuracy rose to 82% versus 70%‑79% for clinicians. Researchers caution the findings are not statistically significant and note the AI cannot assess visual cues or patient demeanor, but suggest the technology could serve as a rapid second opinion in emergency care.

In a head‑to‑head trial conducted at a Boston hospital, an artificial‑intelligence system built by OpenAI outperformed practicing physicians in diagnosing emergency‑room patients. The study, overseen by Harvard researchers, pitted the o1 reasoning model against two doctors using identical electronic health records for each case.

Study design and results

Seventy‑six patients who arrived at the emergency department were evaluated. For each case, the AI and the physicians received the same basic data: vital signs, demographic details and a brief nurse‑written note describing the reason for the visit. In the first round, the AI identified the exact or near‑exact diagnosis in 67% of cases. Human doctors fell short, scoring between 50% and 55%.

A second round supplied more comprehensive information. Under those conditions, the o1 model’s accuracy climbed to 82%, while the physicians’ performance ranged from 70% to 79%. The researchers noted that the gap between AI and doctors did not reach statistical significance, tempering claims of a clear superiority.

Implications and cautions

Lead author Dr. Adam Rodman, a physician at Beth Israel Deaconess Medical Center, emphasized that the experiment tested text‑based medical reasoning, not the full spectrum of emergency‑room assessment. "The model does not see a patient’s distress, tone, body language or other real‑world signals that clinicians rely on," he said.

Despite those limitations, Rodman envisions a “triadic care model” where doctors, patients and AI collaborate. In such a setup, the system could provide a rapid second opinion, especially when clinicians need to make swift decisions with limited data.

Experts, however, raised several concerns. Accountability for AI‑driven errors remains murky, and patient safety could be jeopardized if clinicians over‑rely on algorithmic suggestions. The study’s authors stressed that the technology is not ready for unsupervised deployment in emergency departments.

For now, the o1 model appears best suited as an adjunct tool, offering quick diagnostic suggestions that physicians can verify against their own clinical judgment. As AI continues to evolve, further trials with larger sample sizes and real‑time patient interaction will be needed to determine whether such systems can safely augment emergency care.

Harvard study finds OpenAI's o1 model outperforms doctors in ER triage diagnosis

Key Points

Study design and results

Implications and cautions

Also available in: