AI Chatbots Miss Most Medical Diagnoses, Study Warns

Key Points
- Study in Nature Medicine tested ChatGPT and Meta's Llama 3 for medical advice.
- Only about 34.5% of diagnoses were correct among 1,298 UK participants.
- Correct follow‑up steps were provided just 44.2% of the time.
- Incomplete user information led to many inaccurate responses.
- Two cases showed initially correct answers turning wrong after more details.
- LLMs score comparably to USMLE benchmarks but still underperform in practice.
- Three‑in‑five U.S. adults say they use AI for health information.
- OpenAI’s disclaimer warns that AI can make mistakes.
- Researchers advise against relying on chatbots for serious medical decisions.
A new study published in Nature Medicine examined how large language models such as ChatGPT and Meta's Llama 3 performed when asked for medical advice. Among 1,298 UK participants, the models correctly identified medical conditions in fewer than 34.5% of cases and offered correct follow‑up steps only 44.2% of the time. The research highlights that users often provide incomplete information, leading to inaccurate responses, and cautions against relying on AI chatbots for serious health decisions.
Study Overview
A recent investigation featured in Nature Medicine evaluated the diagnostic accuracy of large language models (LLMs) when used for medical advice. The study recruited 1,298 participants in the United Kingdom who interacted with AI systems such as ChatGPT and Meta's Llama 3. Across the sample, the models correctly identified the underlying medical condition in fewer than 34.5% of the interactions.
Performance Details
Although LLMs have achieved benchmark scores comparable to passing the United States Medical Licensing Exam and their generated clinical documents are sometimes rated as equivalent to or better than those written by physicians, the real‑world diagnostic performance fell short. When participants provided only partial information—a scenario observed in 16 of 30 sampled exchanges—the models frequently produced incomplete or incorrect answers. In two instances, an initially correct diagnosis was later altered with new, inaccurate information after the user supplied additional details.
Follow‑Up Guidance
Beyond the initial diagnosis, the AI systems also struggled with recommending appropriate next steps. Correct follow‑up instructions were given only 44.2% of the time, underscoring limitations in the models’ ability to guide patients through subsequent care.
User Behavior and Expectations
A survey conducted by OpenAI revealed that three out of five U.S. adults report using AI for health‑related purposes. Respondents said they turn to AI when they first feel unwell, to prepare for appointments, and to better understand medical instructions. Despite a disclaimer on ChatGPT stating, “ChatGPT can make mistakes. Check important info,” many users still take the chatbot’s advice at face value.
Implications
The findings serve as a reminder that AI chatbots should not be the primary source for medical guidance, especially in serious or complex situations. While the technology shows promise, the study emphasizes the need for caution, thorough user input, and professional medical consultation to ensure safe and accurate health care decisions.