Published July 14, 2024 | Version v1
Journal article Open

nQuack: An R package for predicting ploidal level from sequence data using site-based heterozygosity

  • 1. University of Florida
  • 2. Cornell University
  • 3. University of Chicago
  • 4. College of Idaho

Description

Premise: Traditional methods of ploidal-level estimation are tedious; using DNA sequence data for cytotype estimation is an ideal alternative. Multiple statistical approaches to leverage sequence data for ploidy inference based on site-based heterozygosity have been developed. However, these approaches may require high-coverage sequence data, use inappropriate probability distributions, or have additional statistical shortcomings that limit inference abilities. We introduce nQuack, an open-source R package that addresses the main shortcomings of current methods.

Methods and Results: nQuack performs model selection for improved ploidy predictions. Here, we implement expectation maximization algorithms with normal, beta, and beta-binomial distributions. Using extensive computer simulations that account for variability in sequencing depth, as well as real data sets, we demonstrate the utility and limitations of nQuack.

Conclusions: Inferring ploidy based on site-based heterozygosity alone is difficult. Even though nQuack is more accurate than similar methods, we suggest caution when relying on any site-based heterozygosity method to infer ploidy.

Data availability

The R package nQuack is available at https://github.com/mgaynor1/nQuack and https://mlgaynor.com/nQuack/. A full implementation tutorial (https://mlgaynor.com/nQuack/articles/BasicExample.html), as well as detailed tutorials on data preparation (https://mlgaynor.com/nQuack/articles/DataPreparation.html) and model inference (https://mlgaynor.com/nQuack/articles/ModelOptions.html), are available with the package documentation. For three sample sets, reference genomes and population genetics data are available via open repositories (see Appendix S3 and S4 for accessions). Sequence data for Galax urceolata and Larrea tridentata will be published in open repositories with future publications. An exemplar data set and processing times required for every step of model implementation (1.46–2.09 s for models with the normal distribution; 6.41–23.16 min for models with the beta distribution; 9.54–46.15 min for models with beta-binomial distribution), as well as the output of each step of our method, are available on our GitHub (https://mlgaynor.com/nQuack/articles/BasicExample.html).

Files

nQuack.pdf

Files (4.1 MB)

Name Size Download all
Article
md5:a64a84bc0c8bba5c58b5477c70e05087
932.4 kB Preview Download
Supporting information files
md5:e3f2d806af3731d37748a49da721a21c
3.1 MB Preview Download

Additional details

Identifiers

DOI
10.1002/aps3.11606
Other
oai:uchicago.tind.io:12832

Funding

National Science Foundation
Graduate Research Fellowship
National Science Foundation
Small Grant
National Science Foundation
Plant Genome Fellowship
National Institute of Food and Agriculture, U.S. Department of Agriculture
Hatch award

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Ecology and Evolution