Harvard study finds OpenAI's o1 model outperforms doctors in ER triage diagnosis

Harvard study finds OpenAI's o1 model outperforms doctors in ER triage diagnosis
Digital Trends

Key Points

  • Harvard trial compared OpenAI's o1 model with two emergency physicians using identical patient records.
  • AI correctly diagnosed 67% of 76 cases in the first test; doctors scored 50%-55%.
  • With more detailed data, AI accuracy rose to 82% versus doctors' 70%-79%.
  • Statistical analysis found the performance gap not significant.
  • AI lacks ability to assess visual cues, tone, and patient demeanor.
  • Researchers propose a triadic model pairing doctors, patients, and AI for rapid second opinions.
  • Concerns include accountability, patient safety, and potential over‑reliance on AI recommendations.

A Harvard-led trial comparing OpenAI's o1 reasoning model with human physicians in a Boston emergency department showed the AI correctly identified the exact or near‑exact diagnosis in 67% of cases, outpacing doctors who scored between 50% and 55%. When provided with more detailed patient information, the model's accuracy rose to 82% versus 70%‑79% for clinicians. Researchers caution the findings are not statistically significant and note the AI cannot assess visual cues or patient demeanor, but suggest the technology could serve as a rapid second opinion in emergency care.

In a head‑to‑head trial conducted at a Boston hospital, an artificial‑intelligence system built by OpenAI outperformed practicing physicians in diagnosing emergency‑room patients. The study, overseen by Harvard researchers, pitted the o1 reasoning model against two doctors using identical electronic health records for each case.

Study design and results

Seventy‑six patients who arrived at the emergency department were evaluated. For each case, the AI and the physicians received the same basic data: vital signs, demographic details and a brief nurse‑written note describing the reason for the visit. In the first round, the AI identified the exact or near‑exact diagnosis in 67% of cases. Human doctors fell short, scoring between 50% and 55%.

A second round supplied more comprehensive information. Under those conditions, the o1 model’s accuracy climbed to 82%, while the physicians’ performance ranged from 70% to 79%. The researchers noted that the gap between AI and doctors did not reach statistical significance, tempering claims of a clear superiority.

Implications and cautions

Lead author Dr. Adam Rodman, a physician at Beth Israel Deaconess Medical Center, emphasized that the experiment tested text‑based medical reasoning, not the full spectrum of emergency‑room assessment. "The model does not see a patient’s distress, tone, body language or other real‑world signals that clinicians rely on," he said.

Despite those limitations, Rodman envisions a “triadic care model” where doctors, patients and AI collaborate. In such a setup, the system could provide a rapid second opinion, especially when clinicians need to make swift decisions with limited data.

Experts, however, raised several concerns. Accountability for AI‑driven errors remains murky, and patient safety could be jeopardized if clinicians over‑rely on algorithmic suggestions. The study’s authors stressed that the technology is not ready for unsupervised deployment in emergency departments.

For now, the o1 model appears best suited as an adjunct tool, offering quick diagnostic suggestions that physicians can verify against their own clinical judgment. As AI continues to evolve, further trials with larger sample sizes and real‑time patient interaction will be needed to determine whether such systems can safely augment emergency care.

#artificial intelligence#emergency medicine#diagnostic accuracy#OpenAI#Harvard study#medical AI#triage#patient safety#healthcare technology#clinical decision support
Generated with  News Factory -  Source: Digital Trends

Also available in: