Artificial intelligence (AI) requires data. Ideally that data should be clean, trustworthy, and above all, accurate. Unfortunately, the reality is far from it. In fact, sometimes medical data may be so far removed from being clean that the data can be positively dirty.
Consider the simple chest x-ray, the good old-fashioned posterior-anterior radiograph of the thorax -- one of the longest-standing radiological techniques in the medical diagnostic armory, performed across the world by the billions. So many in fact, that radiologists struggle to keep up with the sheer volume, and sometimes forget to read the odd 23,000 of them. Oops.
Surely, such a popular, tried, and tested medical test should provide great data for training radiology AI? There's clearly more than enough data to have a decent attempt, and the technique is so well standardized and robust that surely it's just crying out for automation?
Unfortunately, there is one small and inconvenient problem: humans.
Human radiologists are so bad interpreting chest x-rays and/or agreeing what findings they can see, the "report" that comes with the digital image is often either entirely wrong, partially wrong, or omits information.
It's not the humans' fault -- they are trying their best! When your job is to process thousands of black and white pixels into a few words of natural language text in approximately 30 seconds, it's understandable that information gets lost and errors are made. Writing a radiology report is an extreme form of data compression -- you are converting around 2 megabytes of data into a few bytes, in effect performing lossy compression with a huge compressive ratio. It's like trying to stream a movie through a 16K modem by getting someone to tap out what's happening in Morse code.
And then there's the subjectivity of it all. The interpretation of the image is subject to all sorts of external factors, including patient demographics, history, and geography. The problem is even worse for more complex imaging modalities such as MRI or operator-dependent modalities like ultrasound, where observer error is even higher.
But does it really matter?
So what if a chest x-ray report isn't very accurate? The image is still there, so no data are really lost?
The problem quickly becomes apparent when you start using the written report to train a radiology AI to learn how to interpret the image. The machine-learning team at Stanford University in California has done exactly this, using 108,948 labeled chest x-rays freely available from the National Institutes of Health (NIH). It proudly announced its results as outperforming a radiologist at finding pneumonia.
Now, I'm all for cutting-edge research, and I think it's great that datasets like this are released to the public for exactly this reason, but we have to be extremely careful about how we interpret the results of any algorithms built on this data, because the data can be dirty.
How is it possible to train an AI to be better than a human, if the data you give it are of the same low quality as produced by humans? I don't think it is.
It boils down to a simple fact: Chest x-ray reports were never intended to be used for the development of radiology AI. They were only ever supposed to be an opinion, an interpretation, a creative, educated guess. Reading a chest x-ray is more equivalent to an art than a science. A chest x-ray is neither the final diagnostic test nor the first; it is just one part of a suite of diagnostic steps in order to get to a clinical end point.
The chest x-ray itself is not a substitute for a ground truth. In fact, it's only real purpose is to act as a form of "triage" with the universal clinical question being, "Is there something here that I need to worry about?" That's where the value in a chest x-ray lies -- answering "Should I worry?" rather than "What is the diagnosis?" Perhaps the researchers at Stanford have been trying to answer the wrong question.
Three key points
If we are to develop an AI that can actually "read" chest x-rays, then future research should be concentrated on the following three things:
- The surrounding metadata and finding a ground truth, rather than relying on a human-derived report that wasn't produced with data-mining in mind. An ideal dataset would include all of the patient's details, epidemiology, history, blood tests, follow-up CT results, biopsy results, genetics, and more. Sadly, this level of verified anonymized data doesn't exist, at least not in the format required for machine reading. Infrastructure should therefore be put into collating and verifying this metadata, at a bare minimum, preferably at scale.
- Meticulous labeling of the dataset. And I do mean absolutely, painstakingly, thoroughly annotating images using domain experts trained specifically to do so for the purposes of providing machine-learning ready data. Expert consensus opinion, alongside accurate metadata, will be demonstrably better than using random single-reader reports. Thankfully, this is what some of the more reputable AI companies are doing. Yes, it's expensive and time-consuming, but it's a necessity if the end goal is to be attained. This is what I have termed as the data-refinement process, specifically the level B to level A stage. Skip this, and you'll never beat human performance.
- Standardizing radiological language. Many of the replies I got to my simple Twitter experiment used differing language to describe roughly similar things. For instance "consolidation" is largely interchangeable with "pneumonia." Or is it? How do we define these terms, and when should one be used instead of the other? There is huge uncertainty in human language, and this extends to radiological language. (Radiologists are renowned in medical circles for their skill at practicing uncertainty, known as "the hedge"). Until this uncertainty is removed, and terminology agreed upon for every single possible use case, it is hard to see how we can progress toward a digital nirvana. Efforts are underway to introduce a standardized language (RadLex), however, uptake by practicing radiologists has been slow and rather patchy. I don't know what the answer is to this, but I know the problem is language!
Until we have done all of this, the only really useful value of AI in chest radiography is, at best, to provide triage support -- tell us what is normal and what is not, and highlight where it could possibly be abnormal. Just don't try and claim that AI can definitively tell us what the abnormality is, because it can't do so any more accurately than we can -- the data is dirty because we made it thus.
For now, let's leave the fuzzy thinking and creative interpretation up to us humans, separate the "art" of medicine from "artificial intelligence," and start focusing on producing oodles of clean data.
Editor's note: For a longer version of this article, click here.
Dr. Hugh Harvey is the clinical lead at the deep-learning company Kheiron Medical and a U.K. Royal College of Radiologists informatics committee member. He was formerly a consultant radiologist at Guy's and St. Thomas' Hospital in London and head of regulatory affairs at Babylon Health.
The comments and observations expressed herein do not necessarily reflect the opinions of AuntMinnieEurope.com, nor should they be construed as an endorsement or admonishment of any particular vendor, analyst, industry consultant, or consulting group.