GPT-4, Google Gemini Fall Short in Breast Imaging Classification

Use of publicly available large language models (LLMs) resulted in changes in breast imaging reports classification that could have a negative effect on patient management, according to a new international study published today in the journal Radiology, a journal of the Radiological Society of North America (RSNA). The study findings underscore the need to regulate these LLMs in scenarios that require high-level medical reasoning, researchers said.

LLMs are a type of artificial intelligence (AI) widely used today for a variety of purposes. In radiology, LLMs have already been tested in a wide variety of clinical tasks, from processing radiology request forms to providing imaging recommendations and diagnosis support.

Publicly available generic LLMs like ChatGPT (GPT 3.5 and GPT-4) and Google Gemini (formerly Bard) have shown promising results in some tasks. Importantly, however, they are less successful at more complex tasks requiring a higher level of reasoning and deeper clinical knowledge, such as providing imaging recommendations. Users seeking medical advice may not always understand the limitations of these untrained programs.

"Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and non-radiologist physicians seeking a second opinion," said study co-lead author Andrea Cozzi, M.D., Ph.D., radiology resident and post-doctoral research fellow at the Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale, in Lugano, Switzerland.

Dr. Cozzi and colleagues set out to test the generic LLMs on a task that pertains to daily clinical routine but where the depth of medical reasoning is high and where the use of languages other than English would further stress LLMs capabilities. They focused on the agreement between human readers and LLMs for the assignment of Breast Imaging Reporting and Data System (BI-RADS) categories, a widely used system to describe and classify breast lesions.

The Swiss researchers partnered with an American team from Memorial Sloan Kettering Cancer Center in New York City and a Dutch team at the Netherlands Cancer Institute in Amsterdam.

The study included BI-RADS classifications of 2,400 breast imaging reports written in English, Italian and Dutch. Three LLMs - GPT-3.5, GPT-4 and Google Bard (now renamed Google Gemini) - assigned BI-RADS categories using only the findings described by the original radiologists. The researchers then compared the performance of the LLMs with that of board-certified breast radiologists.

The agreement for BI-RADS category assignments between human readers was almost perfect. However, the agreement between humans and the LLMs was only moderate. Most importantly, the researchers also observed a high percentage of discordant category assignments that would result in negative changes in patient management. This raises several concerns about the potential consequences of placing too much reliance on these widely available LLMs.

According to Dr. Cozzi, the results highlight the need for regulation of LLMs when there is a highly likely possibility that users may ask them health-care-related questions of varying depth and complexity.

"The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in health care," he said. "These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions."

Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, Christianson B, Curti M, Rizzo S, Del Grande F, Mann RM, Schiaffino S.
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.
Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133

Most Popular Now

Commission Joins Forces with Venture Cap…

The Commission has launched a Trusted Investors Network bringing together a group of investors ready to co-invest in innovative deep-tech companies in Europe together with the EU. The Union's investment...

Philips and Medtronic Advocacy Partnersh…

Royal Philips (NYSE: PHG, AEX: PHIA), a global leader in health technology, and Medtronic Neurovascular, a leading innovator in neurovascular therapies, today announced a strategic advocacy partnership. Delivering timely stroke...

Wearable Cameras Allow AI to Detect Medi…

A team of researchers says it has developed the first wearable camera system that, with the help of artificial intelligence (AI), detects potential errors in medication delivery. In a test whose...

New AI Tool Predicts Protein-Protein Int…

Scientists from Cleveland Clinic and Cornell University have designed a publicly-available software and web database to break down barriers to identifying key protein-protein interactions to treat with medication. The computational tool...

AI for Real-Rime, Patient-Focused Insigh…

A picture may be worth a thousand words, but still... they both have a lot of work to do to catch up to BiomedGPT. Covered recently in the prestigious journal Nature...

New Research Shows Promise and Limitatio…

Published in JAMA Network Open, a collaborative team of researchers from the University of Minnesota Medical School, Stanford University, Beth Israel Deaconess Medical Center and the University of Virginia studied...

G-Cloud 14 Makes it Easier for NHS to Bu…

NHS organisations will be able to save valuable time and resource in the procurement of technologies that can make a significant difference to patient experience, in the latest iteration of...

Start-Ups will Once Again Have a Starrin…

11 - 14 November 2024, Düsseldorf, Germany. The finalists in the 16th Healthcare Innovation World Cup and the 13th MEDICA START-UP COMPETITION have advanced from around 550 candidates based in 62...

Hampshire Emergency Departments Digitise…

Emergency departments in three hospitals across Hampshire Hospitals NHS Foundation Trust have deployed Alcidion's Miya Emergency, digitising paper processes, saving clinical teams time, automating tasks, and providing trust-wide visibility of...

MEDICA HEALTH IT FORUM: Success in Maste…

11 - 14 November 2024, Düsseldorf, Germany. How can innovations help to master the great challenges and demands with which healthcare is confronted across international borders? This central question will be...

A "Chemical ChatGPT" for New M…

Researchers from the University of Bonn have trained an AI process to predict potential active ingredients with special properties. Therefore, they derived a chemical language model - a kind of...

Siemens Healthineers co-leads EU Project…

Siemens Healthineers is joining forces with more than 20 industry and public partners, including seven leading stroke hospitals, to improve stroke management for patients all over Europe. With a total...