Published September 3, 2024 | Version v1
Journal article Open

Joint trajectory inference for single-cell genomics using deep learning with a mixture prior

  • 1. Carnegie Mellon University
  • 2. University of Texas at Austin
  • 3. University of Chicago

Description

Trajectory inference methods are essential for analyzing the developmental paths of cells in single-cell sequencing datasets. It provides insights into cellular differentiation, transitions, and lineage hierarchies, helping unravel the dynamic processes underlying development and disease progression. However, many existing tools lack a coherent statistical model and reliable uncertainty quantification, limiting their utility and robustness. In this paper, we introduce VITAE (Variational Inference for Trajectory by AutoEncoder), a statistical approach that integrates a latent hierarchical mixture model with variational autoencoders to infer trajectories. The statistical hierarchical model enhances the interpretability of our framework, while the posterior approximations generated by our variational autoencoder ensure computational efficiency and provide uncertainty quantification of cell projections along trajectories. Specifically, VITAE enables simultaneous trajectory inference and data integration, improving the accuracy of learning a joint trajectory structure in the presence of biological and technical heterogeneity across datasets. We show that VITAE outperforms other state-of-the-art trajectory inference methods on both real and synthetic data under various trajectory topologies. Furthermore, we apply VITAE to jointly analyze three distinct single-cell RNA sequencing datasets of the mouse neocortex, unveiling comprehensive developmental lineages of projection neurons. VITAE effectively reduces batch effects within and across datasets and uncovers finer structures that might be overlooked in individual datasets. Additionally, we showcase VITAE's efficacy in integrative analyses of multiomic datasets with continuous cell population structures.

Data availability

Developing mouse brain datasets. Yuzwa dataset (22) and Ruan dataset (23) are available from the Gene Expression Omnibus (GEO) database under accession codes GSE107122 and GSE161690, respectively. For Yuzwa’s dataset, we only use cortically derived cells selected by the original paper. We keep the genes that are measured in both datasets (14,707 genes) and merge all 16,651 cells (6,390 and 10,261 cells for each). The cell cycle scores (S.Score and G2M.Score) of the two datasets are calculated separately by Seurat v3. As cell type labels are not provided by the Yuzwa dataset, we perform clustering using Seurat and annotate cell types using the marker genes in the Ruan dataset. The information on the collection days of both datasets is summarized in SI Appendix, Table S2.

  • Mouse cerebral cortex atlas. Di Bella’s preprocessed dataset (25) following the procedure in ref. 25 are available in the https://singlecell.broadinstitute.org/singlecell/study/SCP1290/molecular-logic-of-cellular-diversification-in-the-mammalian-cerebral-cortex. The cell cycle scores (S.Score and G2M.Score), cell type annotations, and collection day information are provided in the original paper. We exclude cells labeled as “Doublet” and “Low quality cells” and retain 91,648 cells and 19,712 genes with 24 different cell type annotations.
  • Integration of the three mouse brain datasets. The merged developing mouse brain dataset and the atlas dataset are preprocessed separately by the standard procedure (normalize_total, log1p, highly_variable_genes, and scale) using scanpy v1.8.3 (42) with default parameters and exclude cells labeled as “Doublet” and “Low quality.” Then mouse brain and atlas datasets are merged by taking the union of highly variable genes in each dataset. Then we unified different cell type annotations that refer to the same cell type (for example, “Intermediate progenitors” as “IPC”). Finally, there are 108,299 cells and 13,183 genes. Empirically, we found that the subcerebal projection neurons (SCPN) cells before and after day E16 are always projected to two different latent space regions. So, we relabeled the SCPN cells before E16 as SCPN1 to provide a more biologically meaningful center when initializing latent space.
  • Human hematopoiesis data. We obtain the gene expression, peak matrix, and cell types annotation of human hematopoiesis data of healthy donors from https://github.com/GreenleafLab/MPAL-Single-Cell-2019 (30). The TF activity scores from ref. 30 were computed using chromVAR (43). Following the preprocessing procedure as in ref. 44, we excluded cells labeled as “Unknown” and combined the clusters with the same cell-type annotation into one label (for example, “CLP.1” and “CLP.2” as “CLP”). We retain only cell types that are on the two developmental trajectories analyzed in ref. 44, and the “cDC” and “CD16.Mono” cell types are removed because of the inadequate number of cells. We calculate the highly variable genes for each dataset by scanpy and retain the union of these genes. The raw count matrices containing the selected cells and genes are then concatenated accordingly. Finally, this results in 19,309 cells for scRNA-seq data and 22,685 cells for gene activities of the scATAC-seq data for the analysis, with 4,094 genes measured.

After filtering genes and cells, we apply the standard preprocessing procedure by scanpy v1.8.3 (42) before supplying the datasets to our model.

The Python package of VITAE is publicly available at https://github.com/jaydu1/VITAE (45) with MIT license. Python and R scripts for reproducing all results in this paper are also provided in the same repository.

Files

du-et-al-2024-joint-trajectory-inference-for-single-cell-genomics-using-deep-learning-with-a-mixture-prior.pdf

Files (87.4 MB)

Name Size Download all
Article
md5:564758aef56b0167947a20ef56542461
28.4 MB Preview Download
Supporting information
md5:aa355e0802e9831d735b06eef2e6878a
59.0 MB Preview Download

Additional details

Identifiers

DOI
10.1073/pnas.2316256121
Other
oai:uchicago.tind.io:13349

Funding

National Science Foundation
DMS-2113646
National Science Foundation
DMS-2238656

UChicago Information

Division(s)
Booth School of Business, Physical Sciences Division, The College
Department(s)
Econometrics and Statistics, Physical Sciences, Statistics