GPT-4V flops on RSNA ‘Case of the Day’ questions

Oct 4, 2024

National Institutes of Health (NIH) investigators have found that GPT-4 Vision in its current form is not able to reliably interpret radiologic images, according to research published on 1 October in Radiology.

Compared to radiologists and residents, GPT-4V performed poorly on RSNA “Case of the Day” questions, although its accuracy did increase when only the textual context from cases was input, noted lead author Pritam Mukherjee, PhD, an NIH staff scientist, and colleagues.

“We found that the median accuracy of radiologists and residents significantly exceeded that of GPT-4V. Providing GPT-4V’s outputs to radiologists or residents did not necessarily improve their accuracy. GPT-4V relied on the textual context in the cases for making its choices,” the group wrote.

OpenAI released GPT-4V, which can take both text and images as inputs, in September 2023. The model has shown impressive performance in various benchmarks, including medical examinations, according to the authors.

To further assess its performance in the domain of radiology, the group prompted GPT-4V to solve 72 Case of the Day questions first presented at the RSNA 2023 annual meeting. The researchers compared its accuracy with that of five radiologists and three residents who answered the questions in an “open book” setting, as well as explored whether providing the readers with GPT-4V’s outputs had an effect on their accuracy.

Sixty-two of 72 (86%) of the cases were categorized as imaging dependent, the authors noted. On these cases, GPT-4V’s accuracy was 39% (24 of 62), while its accuracy on the 10 imaging-independent cases was 70% (7 of 10), according to the findings.

Comparatively, the radiologists had greater accuracy than GPT-4V for both imaging-dependent cases (59%, p = 0.31) and imaging-independent cases (76%, p = 0.99).

Overall, with access to GPT-4V’s responses, there was no evidence of difference in the marginal mean accuracy of the radiologists for either imaging-dependent cases or imaging-independent cases.

The findings suggest that GPT-4V in its current form is not able to reliably interpret radiologic images, yet the study “may guide future research involving large language models, for example, developing new prompting strategies and encouraging the fine-tuning of models with radiologic text and images,” the researchers concluded.

In an accompanying editorial, Dr. Douglas Katz, of NYU Grossman Long Island School of Medicine, noted that although the study was an interesting exercise, GPT-4V’s results were disappointing at best and a bust at worst.

“The results were better than flipping a coin, but not that much better for AI,” he wrote.

However, Katz added that the growth of large language models (LLMs) closely parallels or exceeds that of Moore’s law, which holds that the number of transistors on an integrated circuit doubles approximately every 18 months to two years.

“When will LLMs with visual input (and with video input, not just static inputs) be able to address difficult imaging cases, and with much greater accuracy? I am not sure, but it is probably not that many years away,” he concluded.

The full study is available here.