Major flaws found in machine learning for COVID 19 diagnosis

A coalition of AI researchers and health care professionals in fields like infectious disease, radiology, and ontology have found several common but serious shortcomings with machine learning made for COVID-19 diagnosis or prognosis.

After the start of the global pandemic, startups like DarwinAI, major companies like Nvidia, and groups like the American College of Radiology launched initiatives to detect COVID-19 from CT scans, X-rays, or other forms of medical imaging. The promise of such technology is that it could help health care professionals distinguish between pneumonia and COVID-19 or provide more options for patient diagnosis. Some models have even been developed to predict if a person will die or need a ventilator based on a CT scan. However, researchers say major changes are needed before this form of machine learning can be used in a clinical setting.

Researchers assessed more than 2,200 papers and, through a process of removing duplicates and irrelevant titles, narrowed results down to 320 papers that underwent a full text review for quality. Finally, 62 papers were deemed fit to be part of what authors refer to as a systematic review of published research and preprints shared on open research paper repositories like arXiv, bioRxiv, and medRxiv.

Of those 62 papers included in the analysis, roughly half made no attempt to perform external validation of training data, did not assess model sensitivity or robustness, and did not report the demographics of people represented in training data.

“Frankenstein” datasets, the kind made with duplicate images obtained from other datasets, were also found to be a common problem, and only one in five COVID-19 diagnosis or prognosis models shared their code so others can reproduce results claimed in literature.

“In their current reported form, none of the machine learning models included in this review are likely candidates for clinical translation for the diagnosis/prognosis of COVID-19,” the paper reads. “Despite the huge efforts of researchers to develop machine learning models for COVID-19 diagnosis and prognosis, we found methodological flaws and many biases throughout the literature, leading to highly optimistic reported performance.”

The research was published last week as part of the March issue of Nature Machine Intelligence by researchers from the University of Cambridge and University of Manchester. Other common issues they found with machine learning models developed using medical imaging data was virtually no assessment for bias and generally being trained without enough images. Nearly every paper reviewed was found to be at high or uncertain risk of bias; only six were considered at low risk of bias.

Publicly available datasets also commonly suffered from lower quality image formats and weren’t large enough to train reliable AI models. Researchers used the checklist for artificial intelligence in medical imaging (CLAIM) and radiomics quality score (RQS) to help assess the datasets and models.

“The urgency of the pandemic led to many studies using datasets that contain obvious biases or are not representative of the target population, for example, pediatric patients. Before evaluating a model, it is crucial that authors report the demographic statistics for their datasets, including age and sex distributions,” the paper reads. “Higher-quality datasets, manuscripts with sufficient documentation to be reproducible and external validation are required to increase the likelihood of models being taken forward and integrated into future clinical trials to establish independent technical and clinical validation as well as cost-effectiveness.”

Other recommendations suggested by the group of AI researchers and health care professionals include ensuring reproducibility of model performance results spelled out in research papers and considering how datasets are assembled and put together.

In other news at the intersection of COVID-19 and machine learning, earlier this week the Food and Drug Administration (FDA) approved emergency use authorization of a machine learning-based screening device which the agency says is the first approved in the U.S.

Khari Johnson