Published August 13, 2015 | Version v1
Journal article Open

ExScalibur: A High-Performance Cloud-Enabled Suite for Whole Exome Germline and Somatic Mutation Identification

Description

Whole exome sequencing has facilitated the discovery of causal genetic variants associated with human diseases at deep coverage and low cost. In particular, the detection of somatic mutations from tumor/normal pairs has provided insights into the cancer genome. Although there is an abundance of publicly-available software for the detection of germline and somatic variants, concordance is generally limited among variant callers and alignment algorithms. Successful integration of variants detected by multiple methods requires in-depth knowledge of the software, access to high-performance computing resources, and advanced programming techniques. We present ExScalibur, a set of fully automated, highly scalable and modulated pipelines for whole exome data analysis. The suite integrates multiple alignment and variant calling algorithms for the accurate detection of germline and somatic mutations with close to 99% sensitivity and specificity. ExScalibur implements streamlined execution of analytical modules, real-time monitoring of pipeline progress, robust handling of errors and intuitive documentation that allows for increased reproducibility and sharing of results and workflows. It runs on local computers, high-performance computing clusters and cloud environments. In addition, we provide a data analysis report utility to facilitate visualization of the results that offers interactive exploration of quality control files, read alignment and variant calls, assisting downstream customization of potential disease-causing mutations. ExScalibur is open-source and is also available as a public image on Amazon cloud.

Data availability

The NA12878 trio datasets used in this study are public available from the The Sequence Read Archive (SRA) database (accession numbers SRX079575, SRX079576, SRX079577). The AML datasets are available from The Cancer Genome Atlas (TCGA) for researchers who meet the criteria for access to the protected data. To submit an application, please follow the TCGA controlled-access data application process (URL: https://wiki.nci.nih.gov/display/TCGA/Application+Process). Once approved, researchers may use the TCGA sample IDs (provided in the supplementary table S16) to retrieve the AML datasets from The Cancer Genomics Hub (CGHub). Sample IDs are shown in the pair of "tumor, normal." The ExScalibur pipeline is available from GitHub (https://github.com/cribioinfo). We have also developed a website that hosts general information as well as instructions, tutorials and release notes of ExScalibur. It is publicly accessible at http://exscalibur.cri.uchicago.edu.

Files

journal.pone.0135800.pdf

Files (1.8 MB)

Name Size Download all
Article
md5:ac28bc9a83e3b7bb2bfa4ba603b084fc
996.3 kB Preview Download
Supporting information
md5:35e0bae84c440a1c19f149603fcd8415
836.9 kB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pone.0135800
Other
oai:uchicago.tind.io:7700

Funding

National Institutes of Health
UL1 RR024999

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Pediatrics
Center(s) or Institute(s)
Center for Research Informatics