Published April 26, 2018 | Version v1
Journal article Open

RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

  • 1. Princeton University
  • 2. King Abdullah University of Science and Technology
  • 3. University of Chicago

Description

Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10−9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.

Data availability

The data comprise millions of de-identified patient clinical records that cannot be deposited publicly and cannot be shared without special agreement with the Columbia University and the University of Chicago. Data are available from third party: to access the University of Chicago data, please visit the Center for Research Informatics, http://cri.uchicago.edu; at the Columbia University, data can be accessed through the Electronic Medical Records and Genomics (eMERGE) network, http://emerge.cumc.columbia.edu.

Files

journal.pcbi.1006106.pdf

Files (5.8 MB)

Name Size Download all
Article
md5:99db8829434fad6130b3f6a9a4f5de27
4.8 MB Preview Download
md5:8bb1530283600a37ef75d6fb2a6b8d80
980.1 kB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pcbi.1006106
Other
oai:uchicago.tind.io:6582

Funding

Defense Advanced Projects Agency
W911NF1410333
National Heart Lung and Blood Institute
R01HL122712
National Institute of Mental Health
P50 MH094267
King Abdullah University of Science and Technology
FCC/1/1976-04
King Abdullah University of Science and Technology
URF/1/3007-01
King Abdullah University of Science and Technology
URF/1/ 3450-01
King Abdullah University of Science and Technology
URF/1/3454-01
Liz and Kent Dauten

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Human Genetics, Medicine
Center(s) or Institute(s)
Institute for Genomics and Systems Biology