Science Journalists Find ChatGPT Struggles With Accurate Summaries

A study by science journalists for the AAAS examined how well ChatGPT can summarize scientific papers. Reviewers consistently gave the AI‑generated briefs low scores for relevance, compellingness and factual accuracy. The model often conflated correlation with causation, omitted crucial context, and used exaggerated language like “groundbreaking.” Journalists concluded that ChatGPT does not meet the quality standards required for scientific briefings and would need extensive fact‑checking before use.

Evaluation of ChatGPT‑Generated Summaries

Science journalists tasked with assessing ChatGPT’s ability to distill scientific articles reported uniformly low performance across several criteria. When asked whether the AI‑produced summaries could seamlessly blend into existing briefing lines, evaluators assigned an average rating of 2.26 on a five‑point scale, where 1 means “no, not at all” and 5 means “absolutely.” For the question of how compelling the briefs were, the average score dropped slightly to 2.14. Only a single summary earned the top rating of 5 on either metric, while 30 received the lowest rating of 1.

Qualitative feedback highlighted recurring problems. Reviewers noted that ChatGPT frequently conflated correlation with causation, left out essential background—such as the typical slowness of soft actuators—and tended to over‑hype results, sprinkling buzzwords like “groundbreaking” and “novel.” Although prompting the model to avoid such language reduced the over‑hype, other issues persisted.

Limitations in Depth and Accuracy

The journalists observed that ChatGPT excels at transcribing the literal text of a paper when the source material lacks nuance. However, the model struggles to translate those findings into a broader context, failing to discuss methodology, limitations, or larger implications. This weakness becomes especially apparent when summarizing papers that present multiple, sometimes conflicting, results or when asked to merge two related studies into a single brief.

Fact‑checking emerged as a major concern. Reporters described the need for “extensive fact‑checking” to verify AI‑generated content, noting that using ChatGPT as a starting point could demand as much effort as writing a summary from scratch. The journalists emphasized that scientific communication demands precision and clarity, making any lapse in factual reliability unacceptable.

Implications for Scientific Publishing

Overall, the AAAS journalists concluded that the current version of ChatGPT does not satisfy the style and standards required for scientific briefs in their press package. While they acknowledged that future major updates to the model might improve performance, they recommended a cautious approach and stressed the importance of human oversight. The study adds to a broader body of research showing that AI tools can cite incorrect sources as often as 60 percent of the time, reinforcing the need for rigorous editorial review when integrating AI‑generated text into scientific discourse.

Science Journalists Find ChatGPT Struggles With Accurate Summaries

Key Points

Evaluation of ChatGPT‑Generated Summaries

Limitations in Depth and Accuracy

Implications for Scientific Publishing

Also available in: