Large language models (LLMs) are not ready to assign PI-RADS classifications for prostate cancer, suggest findings published on 13 November in the British Journal of Radiology.
A team led by Dr. Kang-Lung Lee from the University of Cambridge, U.K., found that radiologists outperformed all LLMs analyzed in its study, including ChatGPT and Google Gemini, in terms of accuracy for PI-RADS classification based on prostate MRI text reports.
“While LLMs, including online models, may be a valuable tool, it's essential to be aware of their limitations and exercise caution in their clinical application,” Lee told AuntMinnieEurope.com.
Since their introduction in late 2022, LLMs have demonstrated potential for clinical utilization, including in radiology departments. Radiology researchers continue to explore their potential as well as their current limitations.
Lee and colleagues tested the ability of these chatbots in assigning PI-RADS categories based on clinical text reports. They included 100 consecutive multiparametric prostate MRI reports for patients who had not undergone biopsy. Two radiologists classified the reports and these were compared with responses generated by the following models: ChatGPT-3.5, ChatGPT-4, Google Bard, and Google Gemini.
Out of the total reports, 52 were originally reported as PI-RADS 1-2, nine as PI-RADS 3, 19 as PI-RADS 4, and 20 as PI-RADS 5.
The radiologists outperformed all the LLMs. However, the researchers observed that the successor models (ChatGPT-4 and Gemini) outperformed their predecessors.
Accuracy of radiologists, large language models in PI-RADS classification | |
---|---|
Reader | Accuracy |
Senior radiologist | 95% |
Junior radiologist | 90% |
ChatGPT-4 | 83% |
Gemini | 79% |
ChatGPT-3.5 | 67% |
Bard | 67% |
Bard and Gemini bested the ChatGPT models in PI-RADS categories 1 and 2. These included F1 scores of 0.94 and 0.98 for the Google models while GPT-3.5 and GPT-4 achieved F1 scores of 0.77 and 0.94, respectively.
However, for PI-RADS 4 and 5 cases, GPT-3.5 and GTP-4 (F1, 0.95 and 0.98, respectively) outperformed Bard and Gemini (F1, 0.71 and 0.87, respectively).
Bard also assigned a non-existent PI-RADS 6 “hallucination” for two patients. PI-RADS contains five categories.
“This hallucination phenomenon, however, was not observed in ChatGPT-3.5, ChatGPT-4, or Gemini,” Lee said.
Finally, the team observed varying inter-reader agreements between the original reports and the radiologists and models. These included the following kappa values: senior radiologist, 0.93; junior radiologist, 0.84; GPT-4, 0.86; Gemini, 0.81; GPT-3.5, 0.65; Bard, 0.57.
Lee said that despite the results, LLMs have the potential to assist radiologists in assigning or verifying PI-RADS categories after completing text reports. This includes offering significant support to less experienced readers in making accurate decisions.
“Furthermore, not all radiologists include PI-RADS scores in their reports, which can create challenges when patients are referred to another hospital,” Lee told AuntMinnieEurope.com. “In such cases, LLMs can streamline the process for healthcare professionals at referral centers by efficiently generating PI-RADS categories from existing text reports.”
The researchers called for future research to study the utility of LLMs in assisting residents with reading reports, as well as investigating where these models may still be lagging. This could offer further insights into how these models may be applied in training environments, they noted.
The full study can be found here.