Published December 19, 2023 | Version v1
Journal article Open

Statistical prediction of microbial metabolic traits from genomes

  • 1. University of Chicago

Description

The metabolic activity of microbial communities is central to their role in biogeochemical cycles, human health, and biotechnology. Despite the abundance of sequencing data characterizing these consortia, it remains a serious challenge to predict microbial metabolic traits from sequencing data alone. Here we culture 96 bacterial isolates individually and assay their ability to grow on 10 distinct compounds as a sole carbon source. Using these data as well as two existing datasets, we show that statistical approaches can accurately predict bacterial carbon utilization traits from genomes. First, we show that classifiers trained on gene content can accurately predict bacterial carbon utilization phenotypes by encoding phylogenetic information. These models substantially outperform predictions made by constraint-based metabolic models automatically constructed from genomes. This result solidifies our current knowledge about the strong connection between phylogeny and metabolic traits. However, phylogeny-based predictions fail to predict traits for taxa that are phylogenetically distant from any strains in the training set. To overcome this we train improved models on gene presence/absence to predict carbon utilization traits from gene content. We show that models that predict carbon utilization traits from gene presence/absence can generalize to taxa that are phylogenetically distant from the training set either by exploiting biochemical information for feature selection or by having sufficiently large datasets. In the latter case, we provide evidence that a statistical approach can identify putatively mechanistic genes involved in metabolic traits. Our study demonstrates the potential power for predicting microbial phenotypes from genotypes using statistical approaches.

Data availability

Raw sequencing data and genome assemblies are available at the original source of each study (NCBI BioProject PRJNA660495 and PRJNA513156 for genomes from Gowda et al. (2022), PRJNA540276 for genomes from Muscarella et al. (2019), PRJNA940744 for genomes from Prabhakara et al. (2023) (sequenced in this study), and Gralka et al. (2023)). The analysis data are publicly available on Open Science Framework (https://doi.org/10.17605/OSF.IO/JWKR7). The bioinformatic pipeline and all data analysis code are available at https://github.com/zeqianli/CarbonUtilization.

Files

journal.pcbi.1011705.pdf

Files (23.0 MB)

Name Size Download all
md5:d4c5e006984e4db9869b17b7777b3ea9
19.8 MB Preview Download
Article
md5:280fe4bcbe65fbc5072a0be796ac6369
3.2 MB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pcbi.1011705
Other
oai:uchicago.tind.io:10183

Funding

National Science Foundation
Biology Directorate
National Science Foundation
Biology Directorate
National Institutes of Health
1R01GM151538
National Science Foundation
Center for Living Systems

UChicago Information

Division(s)
Biological Sciences Division, Physical Sciences Division
Department(s)
Biophysical Sciences, Ecology and Evolution, Physics
Center(s) or Institute(s)
Center for the Physics of Evolving Systems