Large language models (LLMs) outperformed a medical student but fell short of junior faculty and an in-training radiologist when solving imaging cases in a quiz, suggest findings published on 10 December in Radiology.
Researchers led by Dr. Pae Sun Suh from Yonsei University in Seoul, South Korea, found that the LLMs showed “substantial” accuracy with text and image inputs when analyzing New England Journal of Medicine (NEJM) Image Challenge cases. However, their accuracy decreased with shorter text lengths.
“The accuracy of LLMs was consistent regardless of image input and was significantly influenced by the length of text input,” Suh and colleagues wrote.
LLM use is on the rise in radiology, with models beginning to understand both textual content and visual images. However, many doubt the ability of LLMs to perceive and accurately interpret medical images.
The researchers evaluated the accuracy of LLMs in answering NEJM Image Challenge cases with radiologic images. They compared these results to those of human readers with varying levels of training experience. Finally, the team explored potential factors affecting LLM accuracy. It included four LLMs in its study: ChatGPT-4V, ChatGPT-4o, Gemini, and Claude.
The NEJM Image Challenge is a quiz for medical professionals that includes questions on various clinically impactful diseases in several medical fields. For the study, the researchers included radiologic images from 272 cases published between 2005 and 2024. The study also included 11 human readers, which included the following: seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student. The readers were blinded to the published answers.
Of the LLMs, ChatGPT-4o achieved the highest accuracy. And while it did not outperform the junior faculty or radiologist in training, it outperformed the medical student.
Accuracy of ChatGPT-4o, human readers | ||
---|---|---|
Model/reader | Accuracy | p-value (compared with LLM) |
ChatGPT-4o | 59.6% | N/A |
In-training radiologist | 70.2% | 0.003 |
Junior faculty | 80.9% | < 0.001 |
Medical student | 47.1% | < 0.001 |
Also, ChatGPT-4o showed similar accuracy regardless of image inputs. It achieved an accuracy of 54% without images and 59.6% with images, respectively (p = 0.59).
And while human reader accuracy was unaffected by text length, LLMs achieved higher accuracy with long text inputs (all p < 0.001). Text input length affected LLM accuracy, with odds ratio ranges between 3.2 and 6.6.
The study authors highlighted that these findings demonstrate the uncertainty in the ability of LLMs to perform visual assessment and interpretation of image inputs.
They also wrote that one possible reason for the LLMs providing correct answers without image inputs is the probabilistic selection of answers from multiple choices based on extensive training data. Furthermore, LLM performance on multiple-choice quizzes “may be overestimated” because radiologists make diagnostic decisions without multiple choices helping them, the team added.
“Although LLMs have demonstrated promising advancements in radiologic diagnosis, certain limitations require great caution in their application to real-world diagnostics because of their uncertain ability to interpret radiologic images and their dependence on text inputs,” the authors wrote. “Thus, LLMs are unlikely to re-place radiologists in the immediate future.”
The full study can be found here.