Analyzing crowd-sourced sets of data used to create AI algorithms from medical images, University of Maryland School of Medicine (UMSOM) researchers found that most did not include patient demographics. In the study published April 3 in Nature Medicine, the researchers also found that the algorithms did not evaluate for inherent biases either. That means they have no way of knowing whether these images contain representative samples of the population such as Blacks, Asians, and Indigenous Americans.
According to the researchers, much of medicine in the U.S. is already fraught with partiality toward certain races, genders, ages, or sexual orientations. Small biases in individual sets of data could be amplified greatly when hundreds or thousands of these datasets are combined in these algorithms.
"These deep learning models can diagnose things physicians can’t see, such as when a person might die or detect Alzheimer's disease seven years earlier than our known tests - superhuman tasks," said senior investigator Paul Yi, MD, Assistant Professor of Diagnostic Radiology and Nuclear Medicine at UMSOM. He is also Director of University of Maryland Medical Intelligent Imaging (UM2ii) Center. "Because these AI machine learning techniques are so good at finding needles in a haystack, they can also define sex, gender, and age, which means these models can then use those features to make biased decisions."
Much of the data collected in large studies tends to be from people of means who have relatively easy access to healthcare. In the U.S., this means the data tends to be skewed toward men versus women, and toward people who are white rather than other races. As the U.S. tends to perform more imaging than the rest of the world, this data gets compiled into algorithms that have the potential to slant outcomes worldwide.
For the current study, the researchers chose to evaluate the datasets used in data science competitions in which computer scientists and physicians crowdsource data from around the world and try to develop the best, most accurate algorithm. These competitions tend to have leaderboards that rank each algorithm and provide a cash prize, motivating people to create the best one. Specifically, the researchers investigated medical imaging algorithms, such as those that evaluate CT scans to diagnose brain tumors or blood clots in the lungs. Of the 23 data competitions analyzed, 61 percent did not include demographic data such as age, sex, or race. None of the competitions had evaluations for biases against underrepresented or disadvantaged groups.
"We hope that by bringing awareness to this issue in these data competitions - and if applied in an appropriate way - that there is tremendous potential to solve these biases," said lead author Sean Garin, Program Coordinator at the UM2ii Center.
The study's authors also encourage future competitions to require not only high accuracy, but also fairness among different groups of people.
"As AI models become more prevalent in medical imaging and other fields of medicine, it is important to identify and address potential biases that may exacerbate existing health inequities in clinical care - an essential priority for every academic medical institution," said UMSOM Dean Mark T. Gladwin, MD, Vice President for Medical Affairs, University of Maryland, Baltimore, and the John Z. and Akiko K. Bowers Distinguished Professor.
Garin, S.P., Parekh, V.S., Sulam, J. et al.
Medical imaging data science competitions should report dataset demographics and evaluate for bias.
Nat Med, 2023. doi: 10.1038/s41591-023-02264-0