AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research - The Economic Times

Published 5 hours ago• 4 minute read

AI makes science easy, but is it getting it right? Study warns LLMs are oversimplifying critical research

ET OnlineLast Updated: Jul 05, 2025, 06:03:00 PM IST

(LLMs), including ChatGPT, Llama, and DeepSeek, might be doing too good a job at being too simple—and not in a good way. According to a study published in the journal Royal Society Open Science and reported by Live Science, researchers discovered that newer versions of these AI models are not only more likely to oversimplify complex information but may also distort critical scientific findings. Their attempts to be concise are sometimes so sweeping that they risk misinforming healthcare professionals, policymakers, and the general public.

Led by Uwe Peters, a postdoctoral researcher at the University of Bonn, the study evaluated over 4,900 summaries generated by ten of the most popular LLMs, including four versions of ChatGPT, three of Claude, two of Llama, and one of DeepSeek. These were compared against human-generated summaries of academic research.

The results were stark: chatbot-generated summaries were nearly five times more likely than human ones to overgeneralize the findings. And when prompted to prioritize accuracy over simplicity, the chatbots didn’t get better—they got worse. In fact, they were twice as likely to produce misleading summaries when specifically asked to be precise.

“Generalization can seem benign, or even helpful, until you realize it’s changed the meaning of the original research,” Peters explained in an email to Live Science. What’s more concerning is that the problem appears to be growing. The newer the model, the greater the risk of confidently delivered—but subtly incorrect—information.

In one striking example from the study, DeepSeek transformed a cautious phrase; “was safe and could be performed successfully”, into a bold and unqualified medical recommendation: “is a safe and effective treatment option.” Another summary by Llama eliminated crucial qualifiers around the dosage and frequency of a diabetes drug, potentially leading to dangerous misinterpretations if used in real-world medical settings. Max Rollwage, vice president of AI and research at Limbic, a clinical mental health AI firm, warned that “biases can also take more subtle forms, like the quiet inflation of a claim’s scope.” He added that AI summaries are already integrated into healthcare workflows, making accuracy all the more critical. Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. This means they inherit and replicate those oversimplifications especially when tasked with summarizing already simplified content.

Even more critically, these models are often deployed across specialized domains like medicine and science without any expert supervision. “That’s a fundamental misuse of the technology,” Thaine told Live Science, emphasizing that task-specific training and oversight are essential to prevent real-world harm.

Part of the issue stems from how LLMs are trained. Patricia Thaine, co-founder and CEO of Private AI, points out that many models learn from simplified science journalism rather than from peer-reviewed academic papers. (Image: iStock)

Peters likens the issue to using a faulty photocopier each version of a copy loses a little more detail until what’s left barely resembles the original. LLMs process information through complex computational layers, often trimming the nuanced limitations and context that are vital in scientific literature.

Earlier versions of these models were more likely to refuse to answer difficult questions. Ironically, as newer models have become more capable and “instructable,” they’ve also become more confidently wrong.

“As their usage continues to grow, this poses a real risk of large-scale misinterpretation of science at a moment when public trust and scientific literacy are already under pressure,” Peters cautioned.

While the study's authors acknowledge some limitations, including the need to expand testing to non-English texts and different types of scientific claims they insist the findings should be a wake-up call. Developers need to create workflow safeguards that flag oversimplifications and prevent incorrect summaries from being mistaken for vetted, expert-approved conclusions.

In the end, the takeaway is clear: as impressive as AI chatbots may seem, their summaries are not infallible, and when it comes to science and medicine, there’s little room for error masked as simplicity.

Because in the world of AI-generated science, a few extra words, or missing ones, can mean the difference between informed progress and dangerous misinformation.