Published December 19, 2023
| Version v1
Journal article
Open
Statistical prediction of microbial metabolic traits from genomes
Description
The metabolic activity of microbial communities is central to their role in biogeochemical cycles, human health, and biotechnology. Despite the abundance of sequencing data characterizing these consortia, it remains a serious challenge to predict microbial metabolic traits from sequencing data alone. Here we culture 96 bacterial isolates individually and assay their ability to grow on 10 distinct compounds as a sole carbon source. Using these data as well as two existing datasets, we show that statistical approaches can accurately predict bacterial carbon utilization traits from genomes. First, we show that classifiers trained on gene content can accurately predict bacterial carbon utilization phenotypes by encoding phylogenetic information. These models substantially outperform predictions made by constraint-based metabolic models automatically constructed from genomes. This result solidifies our current knowledge about the strong connection between phylogeny and metabolic traits. However, phylogeny-based predictions fail to predict traits for taxa that are phylogenetically distant from any strains in the training set. To overcome this we train improved models on gene presence/absence to predict carbon utilization traits from gene content. We show that models that predict carbon utilization traits from gene presence/absence can generalize to taxa that are phylogenetically distant from the training set either by exploiting biochemical information for feature selection or by having sufficiently large datasets. In the latter case, we provide evidence that a statistical approach can identify putatively mechanistic genes involved in metabolic traits. Our study demonstrates the potential power for predicting microbial phenotypes from genotypes using statistical approaches.
Data availability
Raw sequencing data and genome assemblies are available at the original source of each study (NCBI BioProject PRJNA660495 and PRJNA513156 for genomes from Gowda et al. (2022), PRJNA540276 for genomes from Muscarella et al. (2019), PRJNA940744 for genomes from Prabhakara et al. (2023) (sequenced in this study), and Gralka et al. (2023)). The analysis data are publicly available on Open Science Framework (https://doi.org/10.17605/OSF.IO/JWKR7). The bioinformatic pipeline and all data analysis code are available at https://github.com/zeqianli/CarbonUtilization.
Files
journal.pcbi.1011705.pdf
Files
(23.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:d4c5e006984e4db9869b17b7777b3ea9
|
19.8 MB | Preview Download |
|
Article md5:280fe4bcbe65fbc5072a0be796ac6369 |
3.2 MB | Preview Download |
Additional details
Identifiers
- DOI
- 10.1371/journal.pcbi.1011705
- Other
- oai:uchicago.tind.io:10183
Funding
- National Science Foundation
- Biology Directorate
- National Science Foundation
- Biology Directorate
- National Institutes of Health
- 1R01GM151538
- National Science Foundation
- Center for Living Systems