AI-Powered Chatbots Are Giving Even Worse Medical Advice Than We Thought.

It’s tempting to think that a chatbot with a master’s degree in law will be able to answer any question you ask, including questions about your health. After all, chatbots are trained on a wealth of medical information and can reproduce it if given the right prompts. But that doesn’t mean they’ll give you accurate medical advice, and new research shows just how easily AI’s supposed competence breaks down. In short, they’re even worse at it than I thought.

During the study, the researchers first queried several chatbots about medical information. In these rigorous tests, ChatGPT-4o, Llama 3, and Command R+ correctly diagnosed medical situations in an impressive 94% of cases, although they were able to recommend the correct treatment in a much less impressive 56% of cases.

But this was not a real-world test of the usefulness of chatbots in the medical field.

The researchers then presented 1,298 people with medical scenarios and asked them to use logical reasoning (LLM) to figure out what might happen in the scenario, as well as what they should do in the situation (such as calling an ambulance, seeing a doctor at a time convenient for them, or solving the problem themselves).

Participants were recruited through an online platform that claimed to verify that subjects were real people and not bots. Some participants were assigned to a control group, which was asked to independently explore the scenario without using any AI tools. Ultimately, the control group without AI performed significantly better than the group using LLM in correctly identifying medical conditions, including most serious scenarios requiring special attention.

How a chatbot providing “correct” information can mislead people.

As the researchers write, “high LLM performance alone is not sufficient to ensure high user engagement.” Numerous previous studies have shown that chatbot performance is sensitive to the precise wording of questions asked by users, and that chatbots appear to prioritize satisfying user needs over providing accurate information.

Even if an LLM bot can correctly answer an objectively formulated question, that doesn’t mean it will give you sound advice when you need it. That’s why it doesn’t really matter that ChatGPT can “pass” the modified medical licensing exam —successfully answering standard multiple-choice questions isn’t the same as telling you when to go to the hospital.

Researchers analyzed chat logs to determine where the issues occurred. Here are some of the problems they identified:

  • Users didn’t always provide the bot with all the necessary information . Being non-specialists, they certainly didn’t know what exactly to include in their request. If you’ve ever consulted a doctor about something potentially serious, you know they’ll pepper you with questions to make sure you haven’t missed anything important. Bots don’t always do this.

  • The bots “generated multiple types of misleading and incorrect information.” Sometimes they ignored important details to focus on something else; sometimes they recommended calling an emergency number but provided the wrong number (for example, the Australian emergency number for UK users).

  • Answers to similar questions can differ dramatically . For example, two users gave nearly identical reports of subarachnoid hemorrhage. One response recommended seeking emergency medical attention, the other recommended lying down in a dark room.

  • People interacted with the chatbot in different ways . For example, some asked specific questions to limit the bot’s responses, while others allowed the bot to take the lead. Both methods could have resulted in unreliable LLM results.

  • Correct answers were often grouped together with incorrect ones . On average, each LLM presented users with 2.21 answer options to choose from. Understandably, people didn’t always choose the correct option from these options.

Overall, people who didn’t use LLM were 1.76 times more likely to make the correct diagnosis. (Both groups were about equally likely to determine the correct course of treatment, but that doesn’t mean much—on average, they got it right only about 43% of the time.) The researchers characterized the control group as “significantly better” at the task. And this may represent a best-case scenario: the researchers note that they provided clear examples of common diseases, while LLM would likely perform worse for rare diseases or more complex medical situations. They conclude: “Despite the high performance achieved with LLM alone, both by existing criteria and in our scenarios, medical expertise was insufficient for effective patient care.”

What do you think at the moment?

Chatbots also pose a danger to doctors.

Patients may not know how to communicate with a qualified legal professional or how to verify their work, but surely it would be easier for doctors, right? Unfortunately, medical professionals are also using AI-powered chatbots to obtain medical information in ways that pose risks to patient care.

ECRI, a nonprofit focused on healthcare safety, ranked the misuse of AI-powered chatbots at the top of its list of health technology dangers for 2026. While the AI ​​marketing machine tries to persuade you to provide ChatGPT with your medical information , ECRI rightly points out that it’s a mistake to think of these chatbots as having human personalities or cognitive abilities: “While these models produce human-like responses, they do so by predicting the next word based on large data sets, rather than through genuine understanding of the information.”

ECRI reports that doctors are indeed using generative AI tools for patient care, and that research has already shown serious risks associated with them. Using LLM models does not improve doctors’ clinical judgment . LLM models reliably complete incorrect details contained in prompts . Google’s Med-Gemini model, created for medical use, invented a non-existent body part whose name was a combination of two unrelated real body parts; Google told a Verge reporter that this was a “typo.” ECRI argues that “because LLM models’ responses often sound authoritative, there is a risk that doctors may subconsciously incorporate AI-generated suggestions into their judgments without critical analysis.”

Even in seemingly non-fatal situations, using a chatbot can be harmful. ECRI asked four law experts to recommend gel brands suitable for use with a specific ultrasound machine on a patient with a catheter in the vicinity of the scan site. In this situation, it is crucial to use sterile gel due to the risk of infection. Only one of the four chatbots identified this issue and provided appropriate recommendations; the others simply recommended standard ultrasound gels. In other cases, ECRI’s tests found that chatbots provided unsafe advice on electrode placement and the use of isolation gowns.

Clearly, LLM chatbots aren’t ready to be trusted to ensure people’s safety while seeking medical care, whether it’s the patient themselves, the treating physician, or even a staff member ordering necessary supplies. But such services already exist, are widely used, and are actively promoted. (Their creators even participate in Super Bowl ad campaigns .) There’s no foolproof way to ensure these chatbots aren’t involved in your treatment, but at least we can remain loyal to good old Dr. Google—just be sure to disable AI-powered search results .

More…

Leave a Reply