Published June 3, 2025 | Version v1
Journal article Open

Study design and the sampling of deleterious rare variants in biobank-scale datasets

  • 1. University of Chicago
  • 2. Massachusetts Institute of Technology
  • 3. Johns Hopkins University
  • 4. Icahn School of Medicine

Description

One key component of study design in population genetics is the "geographic breadth" of a sample (i.e., how broad a region across which individuals are sampled). How the geographic breadth of a sample impacts observations of rare, deleterious variants is unclear, even though such variants are of particular interest for biomedical and evolutionary applications. Here, in order to gain insight into the effects of sample design on ascertained genetic variants, we formulate a stochastic model of dispersal, genetic drift, selection, mutation, and geographically concentrated sampling. We use this model to understand the effects of the geographic breadth of sampling effort on the discovery of negatively selected variants. We find that samples which are more geographically broad will discover a greater number of variants as compared to geographically narrow samples (an effect we label "discovery"); though the variants will be detected at lower average frequency than in narrow samples (e.g., as singletons, an effect we label "dilution"). Importantly, these effects are amplified for larger sample sizes and fitness effects. We validate these results using both population genetic simulations and empirical analyses in the UK Biobank. Our results are particularly important in two contexts: the association of large-effect rare variants with particular phenotypes and the inference of negative selection from allele frequency data. Overall, our findings emphasize the importance of considering geographic breadth when designing and carrying out genetic studies, especially at biobank scale.

Data availability

Some study data are available: All simulation data and associated scripts are available at: https://doi.org/10.5281/zenodo.15398319 (61). Empirical analyses used the published exome sequence data from the UK Biobank resource (19, https://www.ukbiobank.ac.uk/).

Files

steiner-et-al-study-design-and-the-sampling-of-deleterious-rare-variants-in-biobank-scale-datasets.pdf

Files (12.4 MB)

Name Size Download all
Supporting information
md5:def679304d9e72140a1c776a99ebc08a
9.3 MB Preview Download
Article
md5:a7c1120c1921c7d0a7b95b552463be6f
3.1 MB Preview Download

Additional details

Identifiers

DOI
10.1073/pnas.2425196122
Other
oai:uchicago.tind.io:15459

Funding

National Science Foundation
DGE1746045
National Institutes of Health
R01 GM132383
National Institutes of Health
R35 GM149521

UChicago Information

Division(s)
Biological Sciences Division
Department(s)
Ecology and Evolution, Human Genetics