Published August 1, 2021 | Version v2
Software Open

ACTIVA: realistic single-cellRNA-seq generation with automatic cell-type identificationusing introspective variational autoencoders

  • 1. University of California, Merced

Description

(References to the used tools are available in the manuscript)

Datasets
68K PBMC: To compare our results with the current state-of-the-art deep learning model, scGAN/cscGAN, we trained and evaluated our model on a dataset containing 68579 peripheral blood mononuclear cells (PBMCs) from a healthy donor (68K PBMC). 68K PBMC is an ideal dataset for evaluating generative models due to the distinct cell populations, data complexity, and size scGAN. After pre-processing, the data contained 17789 genes. We then performed a balanced split on this data, which resulted in 6991 testing and 61588 training cells. 


Brain Small: In addition to 68K PBMC, we used a randomly-selected subset of a larger dataset called Brain Large (both by 10x Genomics). Brain Large contains approximately 1.3 million cells from the cortex, hippocampus, and the subventricular zone of two embryonic day 18 mice. Compared to 68K PBMC, this dataset has fewer cells, and it varies in complexity and organism. Both Brain Large and its subset (Brain Small) are available on 10X Genomics portal. After performing the pre-processing steps, Brain Small contained 17970 genes, which we then split (via "balanced split") to 1997 test cells and 18003 training cells. 

NeuroCOVID: This dataset (Heming et al.) contains scRNAseq data of immune cells from the cerebrospinal fluid (CSF) of Neuro-COVID patients and patients with non-inflammatory and autoimmune neurological diseases or with viral encephalitis. Our pre-processing resulted in data of dimensions 85414 cells x 22824 genes, which we split to testing and training subsets as mentioned above.

Pre-Processing
We the pipeline provided by Marouf et al. 2020 (scGAN) to pre-process the data. First, we removed genes that were expressed in < 3 cells and cells that expressed <10 genes. Next, cells were normalized by total unique molecular identifiers (UMI) counts and scaled to 20000 reads/cell. Then, we selected a "test set'' ( approximately 10% of each dataset).


Post-Processing
After generating a count matrix with a generative model (e.g. ACTIVA or scGAN), we add the gene names (from the real data) and save as a Scanpy/Seurat object. We then use Seurat to identify 3000 highly variable genes through the use of variance-stabilization transformation (VST), which applies a negative binomial regression to identify outlier genes. The shared highly variable genes are then used for integration [\cite{seurat-integrate}] which allows for biological feature overlap between different datasets in order to perform the downstream analyses presented in this work. We next perform a gene-level scaling, i.e. centering the mean of each feature to zero and scaling by the standard deviation. The feature space in then reduce to 50 principal components, followed by Uniform Manifold
Approximation and Projection (UMAP) and t-distributed Stochastic Neighbor Embedding (t-SNE). As noted by Marouf et al, analysis with lower-dimensional representations have two main advantages: (i) most biologically relevant information is captured while noise is reduced and (ii) statistically, it is more acceptable to use lower dimensional embeddings in classification tasks when samples and features are of the same order of magnitude, which is often the case with scRNAseq datasets (such as the ones we used). Lastly, we use Scater to visualize the datasets.

Files

ACTIVA-main.zip

Files (1.1 GB)

Name Size Download all
md5:7696db0199a30a36f857e86a1ad1da8d
61.7 MB Download
md5:4b2786f49895924b9301b46c2358224f
18.9 kB Preview Download
md5:f75a89e43bc30999a0e346a0b6583dd2
744.0 MB Download
md5:d27ed71280681d1f7317b25783f3d044
300.8 MB Download