Meta’s Llama 3 70B open-source large language model (LLMs) offers comparable performance to proprietary models in answering multiple-choice radiology test questions, according to research published on 13 August in Radiology.
A team led by Dr. Lisa Adams, of Technical University Munich in Germany, found that Llama 3 70B's performance was not inferior to OpenAI’S GPT-4, Google DeepMind’s Gemini Ultra, or Anthropic’s Claude models.
“This demonstrates the growing capabilities of open-source LLMs, which offer privacy, customization, and reliability comparable to that of their proprietary counterparts, but with far fewer parameters, potentially lowering operating costs when using optimization techniques such as quantization,” the group wrote.
The researchers tested the models -- including versions of another open-source LLM from Mixtral -- on 50 multiple-choice test questions from a publicly available 2022 in-training test from the American College of Radiology (ACR) as well as 85 additional board-style examination questions. Images were excluded from the analysis.
Performance on ACR diagnostic in-training exam questions | ||||||
---|---|---|---|---|---|---|
GPT-3.5 Turbo | Mixtral 8 x 22B | Gemini Ultra | Claude 3 Opus | GPT-4 Turbo | Llama 3 70B | |
Accuracy | 58% | 64% | 72% | 78% | 78% | 74% |
Performance on radiology board exam-style questions | ||||||
Accuracy | 61% | 72% | 72% | 76% | 82% | 80% |
With the exception of the Mistral 8 x 22B open-source model (p = 0.15), the differences in performance between Llama 3 70B and the other LLMs did not reach statistical significance for the ACR in-training exam questions. Llama 3 70B did significantly outperform GPT-3.5 Turbo (p = 0.05), however, on the radiology board-exam style questions.
The authors emphasized that important limitations still remain for these types of models in radiology applications.
“Multiple-choice formats test only specific knowledge, missing broader clinical complexities,” they wrote. “More nuanced benchmarks are needed to assess LLM skill in radiology, including disease and treatment knowledge, guideline adherence, and real-world case ambiguities. The lack of multimodality in open-source models is a critical shortcoming in the image-centric field of radiology.”
What’s more, all LLMs face the challenge of producing unreliable outputs, including false-positive findings and hallucinations, they said.