Published October 19, 2020 | Version v1
Journal article Open

Predicting antimicrobial resistance using conserved genes

  • 1. University of Chicago
  • 2. Fellowship for Interpretation of Genomes

Description

A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data. Models were built from small sets of randomly-selected core genes after removing the AMR genes. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80–0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11–0.23 and major error rates ranging from 0.10–0.20. Models built from core genes have predictive power in cases where the primary AMR mechanisms result from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes, we show that F1 scores and error rates are stable and have little variance between replicates. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes.

Data availability

All genome sequence data are available from NCBI and PATRIC, and genome IDs, gene IDs and PubMed IDs are provided to these data sources where appropriate in the supplemental tables and main text. The underlying data, genes, and models corresponding to the alignment based models described in this study are available at our github site for this project: https://github.com/jimdavis1/Core-Gene-AMR-Models. K-mer based models are too large to host in this way, but we show that results from the alignment and k-mer based models are equivalent in the main text.

Files

journal.pcbi.1008319.pdf

Files (15.8 MB)

Name Size Download all
Article
md5:2290f05b9a5c05e908271c5f17263a26
3.0 MB Preview Download
Figures
md5:ac5a76c433f76c9e172f0aabd13bc026
11.6 MB Preview Download
md5:9483c37eb045484d2faab04207733b81
1.2 MB Download

Additional details

Identifiers

DOI
10.1371/journal.pcbi.1008319
Other
oai:uchicago.tind.io:6205

Funding

United States Defense Advanced Research Projects Agency
iSENTRY Friend or Foe program award
United States National Institute of Allergy and Infectious Diseases
Bacterial and Viral Bioinformatics Resource Center award

UChicago Information

Division(s)
Biological Sciences Division, Physical Sciences Division
Department(s)
Biological Sciences, Computer Science