A chatbot can outperform radiologists and ChatGPT in providing imaging recommendations that adhere to appropriateness guidelines, according to German research published on 25 July in Radiology.
Researchers led by Alexander Rau, MD, from the University of Freiburg reported that its own appropriateness criteria contexted chatbot (accGPT) gave "usually appropriate" recommendations according to the American College of Radiology (ACR) Appropriateness Criteria, showed consistently correct answers, and added to time- and cost-based savings.
"Our results demonstrate the potential of the context-based accGPT chatbot in making imaging recommendations based on the ACR guidelines as it accepts standard clinical referral notes and provides concise recommendations on imaging in an end-to-end solution," Rau and colleagues wrote.
Radiologists have pointed out that standardized care is needed for more efficient and accurate diagnostic imaging. While the ACR created its first recommendations for streamlined decision-making in 1994, the researchers pointed out that variability in clinical routine persists. Factoring into such variability are lack of awareness by some radiologists as well as the rapid rise of new imaging technologies and methods, including AI.
One such AI tool, ChatGPT, has seen a rapid rise in use in patient-facing settings. However, radiology researchers have also been testing its potential as a clinical assistance tool. While it has gone through a couple upgrades since its initial launch in late 2022 through GPT 3.5-Turbo and GPT 4, ChatGPT's use is limited by its training data, which goes up to September 2021. This means it could give out incorrect or incomplete information.
Rau and colleagues wanted to investigate how incorporating specialized knowledge to create its accGPT model would impact accuracy and production of relevant responses. It used LlamaIndex, a framework that makes way for the connection of large language models with external data, and the ChatGPT 3.5-Turbo to create the model.
From there, the team tested its model against benchmarked radiologists of varying experience levels as well as GPT 3.5, and GPT 4 in adhering to the ACR Appropriateness Criteria. They did so via a random selection of 50 case files based on the ACR Appropriateness Criteria and repeated testing six times for the chatbots based on these cases.
The researchers found that when it came to providing "usually appropriate" recommendations, the accGPT model significantly outperformed the radiologists and GPT 3.5-Turbo, while only besting GPT 4 at a "trend" level.
Comparison of performances between models, radiologists | ||||
Reference (measured in odds ratio values) | ||||
Radiologists | GPT 3.5-Turbo | GPT 4 | accGPT | |
Radiologists | - | 0.78 (p = 0.23) | 0.41 (p < 0.001) | 0.27 (p < 0.001) |
GPT 3.5-Turbo | 1.29 (p = 0.23) | - | 0.53 (p = 0.004) | 0.34 (p < 0.001) |
GPT 4 | 2.44 (p < 0.001) | 1.9 (p = 0.004) | - | 0.65 (p = 0.08) |
accGPT | 3.76 (p < 0.001) | 2.93 (p < 0.001) | 1.54 (p = 0.08) | - |
The researchers also reported not observing a "robust" difference in radiologists giving correct answers, indicating that the radiologists' experience levels did not have a significant impact on the results.
The team also presented each case six times and determined the proportion of cases with 100% (6 of 6) correct ratings and with at least 66.66% (4 of 6) correct ratings. It found that accGPT was the best model when it came to considering "may be appropriate" as false recommendations. This included being correct for all six runs in 74% of all cases and at least four times correct in 82% of all cases.
The accGPT model also showed high performance when it came to considering "may be appropriate" answers as correct. It was correct for all six runs in 74% of all cases and at least four times correct in 84% of all cases.
Finally, the team found that all three chatbots led to significant time and cost savings. This included an average decision time of 5 minutes and a cost of 0.19 euros, compared with 50 minutes and 29.99 euros seen for the radiologists (p < 0.01 for both time and cost).
The study authors suggested that based on their results, accGPT could be beneficial to radiologists and referring physicians.
"Radiologists would primarily use it as an information retrieval tool for rare cases, while ordering physicians might find it useful as a quick reference to guide decision-making in the ordering process," they wrote.
The authors also called for future studies of the model to include detailed assessment of the costs, availability, and potential radiation dose.
The full report can be found here.