Using pseudoalignment and base quality to accurately quantify microbial community composition

Reppell, Mark; Novembre, John

doi:10.6082/j9gzd-sfg32

Published April 16, 2018 | Version v1

Journal article Open

Using pseudoalignment and base quality to accurately quantify microbial community composition

1. University of Chicago

Pooled DNA from multiple unknown organisms arises in a variety of contexts, for example microbial samples from ecological or human health research. Determining the composition of pooled samples can be difficult, especially at the scale of modern sequencing data and reference databases. Here we propose a novel method for taxonomic profiling in pooled DNA that combines the speed and low-memory requirements of k-mer based pseudoalignment with a likelihood framework that uses base quality information to better resolve multiply mapped reads. We apply the method to the problem of classifying 16S rRNA reads using a reference database of known organisms, a common challenge in microbiome research. Using simulations, we show the method is accurate across a variety of read lengths, with different length reference sequences, at different sample depths, and when samples contain reads originating from organisms absent from the reference. We also assess performance in real 16S data, where we reanalyze previous genetic association data to show our method discovers a larger number of quantitative trait associations than other widely used methods. We implement our method in the software Karp, for k-mer based analysis of read pools, to provide a novel combination of speed and accuracy that is uniquely suited for enhancing discoveries in microbial studies.

Data availability

The Karp software is available from gitHub at https://github.com/mreppell/Karp. The software simreads, used to simulate sequencing reads for this project, is available as part of the Harp software package at https://bitbucket.org/dkessner/harp. The authors of this paper modified simreads to handle paired-end and references shorter than the read length, these modifications and installation instructions are available at https://github.com/mreppell/simreads_expansion. The real sequence data analyzed in this study was shared with us by the Ober Lab at the University of Chicago, who have deposited it in dbGap as phs000185.v4.p1 Genetic Studies in the Hutterites, available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000185.v4.p1.

Files

journal.pcbi.1006096.pdf

Files (4.3 MB)

Name	Size	Download all
journal.pcbi.1006096.pdf Article md5:944522eeeb43c032c1e68a50a1b1b40a	2.8 MB	Preview Download
pcbi.1006096.zip md5:716e2e493f184d34b49a252f299f109a	1.5 MB	Preview Download

Additional details

DOI: 10.1371/journal.pcbi.1006096
Other: oai:uchicago.tind.io:6353

National Human Genome Research Institute
R01 HG007089

Division(s): Biological Sciences Division
Department(s): Human Genetics

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Using pseudoalignment and base quality to accurately quantify microbial community composition

Data availability

Files

journal.pcbi.1006096.pdf

Files (4.3 MB)

Additional details

Identifiers

Funding

UChicago Information

Using pseudoalignment and base quality to accurately quantify microbial community composition

Creators

Description

Data availability

Files

journal.pcbi.1006096.pdf

Files (4.3 MB)

Additional details

Identifiers

Funding

UChicago Information