GPT-4, Google Gemini Fall Short in Breast Imaging Classification

Use of publicly available large language models (LLMs) resulted in changes in breast imaging reports classification that could have a negative effect on patient management, according to a new international study published today in the journal Radiology, a journal of the Radiological Society of North America (RSNA). The study findings underscore the need to regulate these LLMs in scenarios that require high-level medical reasoning, researchers said.

LLMs are a type of artificial intelligence (AI) widely used today for a variety of purposes. In radiology, LLMs have already been tested in a wide variety of clinical tasks, from processing radiology request forms to providing imaging recommendations and diagnosis support.

Publicly available generic LLMs like ChatGPT (GPT 3.5 and GPT-4) and Google Gemini (formerly Bard) have shown promising results in some tasks. Importantly, however, they are less successful at more complex tasks requiring a higher level of reasoning and deeper clinical knowledge, such as providing imaging recommendations. Users seeking medical advice may not always understand the limitations of these untrained programs.

"Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and non-radiologist physicians seeking a second opinion," said study co-lead author Andrea Cozzi, M.D., Ph.D., radiology resident and post-doctoral research fellow at the Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale, in Lugano, Switzerland.

Dr. Cozzi and colleagues set out to test the generic LLMs on a task that pertains to daily clinical routine but where the depth of medical reasoning is high and where the use of languages other than English would further stress LLMs capabilities. They focused on the agreement between human readers and LLMs for the assignment of Breast Imaging Reporting and Data System (BI-RADS) categories, a widely used system to describe and classify breast lesions.

The Swiss researchers partnered with an American team from Memorial Sloan Kettering Cancer Center in New York City and a Dutch team at the Netherlands Cancer Institute in Amsterdam.

The study included BI-RADS classifications of 2,400 breast imaging reports written in English, Italian and Dutch. Three LLMs - GPT-3.5, GPT-4 and Google Bard (now renamed Google Gemini) - assigned BI-RADS categories using only the findings described by the original radiologists. The researchers then compared the performance of the LLMs with that of board-certified breast radiologists.

The agreement for BI-RADS category assignments between human readers was almost perfect. However, the agreement between humans and the LLMs was only moderate. Most importantly, the researchers also observed a high percentage of discordant category assignments that would result in negative changes in patient management. This raises several concerns about the potential consequences of placing too much reliance on these widely available LLMs.

According to Dr. Cozzi, the results highlight the need for regulation of LLMs when there is a highly likely possibility that users may ask them health-care-related questions of varying depth and complexity.

"The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in health care," he said. "These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions."

Cozzi A, Pinker K, Hidber A, Zhang T, Bonomo L, Lo Gullo R, Christianson B, Curti M, Rizzo S, Del Grande F, Mann RM, Schiaffino S.
BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.
Radiology. 2024 Apr;311(1):e232133. doi: 10.1148/radiol.232133

Most Popular Now

Is Your Marketing Effective for an NHS C…

How can you make sure you get the right message across to an NHS chief information officer, or chief nursing information officer? Replay this webinar with Professor Natasha Phillips, former...

Welcome Evo, Generative AI for the Genom…

Brian Hie runs the Laboratory of Evolutionary Design at Stanford, where he works at the crossroads of artificial intelligence and biology. Not long ago, Hie pondered a provocative question: If...

We could Soon Use AI to Detect Brain Tum…

A new paper in Biology Methods and Protocols, published by Oxford University Press, shows that scientists can train artificial intelligence (AI) models to distinguish brain tumors from healthy tissue. AI...

Telehealth Significantly Boosts Treatmen…

New research reveals a dramatic improvement in diagnosing and curing people living with hepatitis C in rural communities using both telemedicine and support from peers with lived experience in drug...

Research Study Shows the Cost-Effectiven…

Earlier research showed that primary care clinicians using AI-ECG tools identified more unknown cases of a weak heart pump, also called low ejection fraction, than without AI. New study findings...

AI can Predict Study Results Better than…

Large language models, a type of AI that analyses text, can predict the results of proposed neuroscience studies more accurately than human experts, finds a new study led by UCL...

New Guidance for Ensuring AI Safety in C…

As artificial intelligence (AI) becomes more prevalent in health care, organizations and clinicians must take steps to ensure its safe implementation and use in real-world clinical settings, according to an...

Remote Telemedicine Tool Found Highly Ac…

Collecting images of suspicious-looking skin growths and sending them off-site for specialists to analyze is as accurate in identifying skin cancers as having a dermatologist examine them in person, a...

Philips Aims to Advance Cardiac MRI Tech…

Royal Philips (NYSE: PHG, AEX: PHIA) and Mayo Clinic announced a research collaboration aimed at advancing MRI for cardiac applications. Through this investigation, Philips and Mayo Clinic will look to...

New Study Reveals Why Organisations are …

The slow adoption of blockchain technology is partly driven by overhyped promises that often obscure the complex technological, organisational, and environmental challenges, according to research from the University of Surrey...

Deep Learning Model Accurately Diagnoses…

Using just one inhalation lung CT scan, a deep learning model can accurately diagnose and stage chronic obstructive pulmonary disease (COPD), according to a study published today in Radiology: Cardiothoracic...

Shape-Changing Device Helps Visually Imp…

Researchers from Imperial College London, working with the company MakeSense Technology and the charity Bravo Victor, have developed a shape-changing device called Shape that helps people with visual impairment navigate...