Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

Shah, Rohan N.; Ruthenburg, Alexander J.

doi:10.6082/h249x-dp484

Published April 19, 2021 | Version v1

Journal article Open

Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

1. University of Chicago

Next-generation sequencing (NGS) has transformed molecular biology and contributed to many seminal insights into genomic regulation and function. Apart from whole-genome sequencing, an NGS workflow involves alignment of the sequencing reads to the genome of study, after which the resulting alignments can be used for downstream analyses. However, alignment is complicated by the repetitive sequences; many reads align to more than one genomic locus, with 15–30% of the genome not being uniquely mappable by short-read NGS. This problem is typically addressed by discarding reads that do not uniquely map to the genome, but this practice can lead to systematic distortion of the data. Previous studies that developed methods for handling ambiguously mapped reads were often of limited applicability or were computationally intensive, hindering their broader usage. In this work, we present SmartMap: an algorithm that augments industry-standard aligners to enable usage of ambiguously mapped reads by assigning weights to each alignment with Bayesian analysis of the read distribution and alignment quality. SmartMap is computationally efficient, utilizing far fewer weighting iterations than previously thought necessary to process alignments and, as such, analyzing more than a billion alignments of NGS reads in approximately one hour on a desktop PC. By applying SmartMap to peak-type NGS data, including MNase-seq, ChIP-seq, and ATAC-seq in three organisms, we can increase read depth by up to 53% and increase the mapped proportion of the genome by up to 18% compared to analyses utilizing only uniquely mapped reads. We further show that SmartMap enables the analysis of more than 140,000 repetitive elements that could not be analyzed by traditional ChIP-seq workflows, and we utilize this method to gain insight into the epigenetic regulation of different classes of repetitive elements. These data emphasize both the dangers of discarding ambiguously mapped reads and their power for driving biological discovery.

Data availability

Software written as part of this work are available at https://github.com/shah-rohan/SmartMap for download. The tools included at that repository are the SmartMapPrep script, the SmartMapRNAPrep script, and the SmartMap program. The SmartMapPrep software is used to streamline the alignment, filtering, and processing of reads to enable their use in the SmartMap software. The SmartMapRNAPrep software is used to do the same, except for strand-specific applications. The SmartMap software is used to conduct the iterative Bayesian reweighting algorithm described above and yields a gzipped BEDGRAPH file of the genome coverage of map weights. In addition, these tools are all available through Bioconda at http://bioconda.github.io/recipes/smartmap/README.html. Detailed instructions for installation and use are available at https://shah-rohan.github.io/SmartMap. All ICeChIP-seq and MNase-seq data are available at GEO under accession numbers GSE60378 and GSE103543. All RNA-seq and ATAC-seq data are available at https://www.encodeproject.org/ from the ENCODE Project under experiment accession numbers ENCSR000AEL and ENCSR483RKN, respectively. The simulated data and analysis workflow for both simulated and biological data are available on Zenodo at https://zenodo.org/record/4586639, with detailed instructions provided both in that Zenodo repository and on Github at https://shah-rohan.github.io/SmartMap/analysis.html. Simulated FASTQ files can be found on Zenodo at https://zenodo.org/record/4584103.

Files

journal.pcbi.1008926.pdf

Files (14.1 MB)

Name	Size	Download all
journal.pcbi.1008926.pdf Article md5:fec790823c96a7fa76d8fa5b46b43a1b	4.1 MB	Preview Download
pcbi.1008926.s001_014.zip Figures md5:4daab35194e83a1c39df0c1cf60f0418	10.0 MB	Preview Download

Additional details

DOI: 10.1371/journal.pcbi.1008926
Other: oai:uchicago.tind.io:6031

National Institutes of Health
R01-GM115945-05
National Institutes of Health
R01-HL148719
National Institutes of Health
T32-HD007009-45

Division(s): Biological Sciences Division, Pritzker School of Medicine
Department(s): Molecular Genetics and Cell Biology
Center(s) or Institute(s): Becker Friedman Institute for Economics

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

Data availability

Files

journal.pcbi.1008926.pdf

Files (14.1 MB)

Additional details

Identifiers

Funding

UChicago Information

Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads

Creators

Description

Data availability

Files

journal.pcbi.1008926.pdf

Files (14.1 MB)

Additional details

Identifiers

Funding

UChicago Information