Joint trajectory inference for single-cell genomics using deep learning with a mixture prior
- 1. Carnegie Mellon University
- 2. University of Texas at Austin
- 3. University of Chicago
Description
Data availability
Developing mouse brain datasets. Yuzwa dataset (22) and Ruan dataset (23) are available from the Gene Expression Omnibus (GEO) database under accession codes GSE107122 and GSE161690, respectively. For Yuzwa’s dataset, we only use cortically derived cells selected by the original paper. We keep the genes that are measured in both datasets (14,707 genes) and merge all 16,651 cells (6,390 and 10,261 cells for each). The cell cycle scores (S.Score and G2M.Score) of the two datasets are calculated separately by Seurat v3. As cell type labels are not provided by the Yuzwa dataset, we perform clustering using Seurat and annotate cell types using the marker genes in the Ruan dataset. The information on the collection days of both datasets is summarized in SI Appendix, Table S2.
- Mouse cerebral cortex atlas. Di Bella’s preprocessed dataset (25) following the procedure in ref. 25 are available in the https://singlecell.broadinstitute.org/singlecell/study/SCP1290/molecular-logic-of-cellular-diversification-in-the-mammalian-cerebral-cortex. The cell cycle scores (S.Score and G2M.Score), cell type annotations, and collection day information are provided in the original paper. We exclude cells labeled as “Doublet” and “Low quality cells” and retain 91,648 cells and 19,712 genes with 24 different cell type annotations.
- Integration of the three mouse brain datasets. The merged developing mouse brain dataset and the atlas dataset are preprocessed separately by the standard procedure (normalize_total, log1p, highly_variable_genes, and scale) using scanpy v1.8.3 (42) with default parameters and exclude cells labeled as “Doublet” and “Low quality.” Then mouse brain and atlas datasets are merged by taking the union of highly variable genes in each dataset. Then we unified different cell type annotations that refer to the same cell type (for example, “Intermediate progenitors” as “IPC”). Finally, there are 108,299 cells and 13,183 genes. Empirically, we found that the subcerebal projection neurons (SCPN) cells before and after day E16 are always projected to two different latent space regions. So, we relabeled the SCPN cells before E16 as SCPN1 to provide a more biologically meaningful center when initializing latent space.
- Human hematopoiesis data. We obtain the gene expression, peak matrix, and cell types annotation of human hematopoiesis data of healthy donors from https://github.com/GreenleafLab/MPAL-Single-Cell-2019 (30). The TF activity scores from ref. 30 were computed using chromVAR (43). Following the preprocessing procedure as in ref. 44, we excluded cells labeled as “Unknown” and combined the clusters with the same cell-type annotation into one label (for example, “CLP.1” and “CLP.2” as “CLP”). We retain only cell types that are on the two developmental trajectories analyzed in ref. 44, and the “cDC” and “CD16.Mono” cell types are removed because of the inadequate number of cells. We calculate the highly variable genes for each dataset by scanpy and retain the union of these genes. The raw count matrices containing the selected cells and genes are then concatenated accordingly. Finally, this results in 19,309 cells for scRNA-seq data and 22,685 cells for gene activities of the scATAC-seq data for the analysis, with 4,094 genes measured.
After filtering genes and cells, we apply the standard preprocessing procedure by scanpy v1.8.3 (42) before supplying the datasets to our model.
The Python package of VITAE is publicly available at https://github.com/jaydu1/VITAE (45) with MIT license. Python and R scripts for reproducing all results in this paper are also provided in the same repository.
Files
du-et-al-2024-joint-trajectory-inference-for-single-cell-genomics-using-deep-learning-with-a-mixture-prior.pdf
Files
(87.4 MB)
| Name | Size | Download all |
|---|---|---|
|
Article md5:564758aef56b0167947a20ef56542461 |
28.4 MB | Preview Download |
|
Supporting information md5:aa355e0802e9831d735b06eef2e6878a |
59.0 MB | Preview Download |
Additional details
Identifiers
- DOI
- 10.1073/pnas.2316256121
- Other
- oai:uchicago.tind.io:13349
Funding
- National Science Foundation
- DMS-2113646
- National Science Foundation
- DMS-2238656