Code Used for Processing mRNA for COMET Dataset

By: George C. Hartoularos Original Date: 10MAY20 Edited Date: 28JUL21

For processing mRNA, we find that using an iterative process running on all cells at once and trying to retain as much information as possible yields the best results in terms of capturing variability in cell type as well as environmental affects while removing technical artifacts from subsequent visualizations. To run memory-intensive algorithms like ComBat, PCA, and UMAP on all cells at once, a large machine capable of processing and storing data for all cells in memory is required. Due to the large cell numbers we captured here, we required renting out an AWS instance with 768 GiB of memory and 96 vCPUs (AWS instance: r5a.24xlarge).

Therefore, in order to process the cells, I ran the folliowing snippets of code interactively using Python (from the command line, not in a Jupyter notebook). Input to this interactive Python session was concat.gene.filt.singlets.h5ad, generated by mrna_preprocessing.ipynb.

Initial Processing

This initial snippet was run on the original dataset.

The above code ran overnight on the instance. I checked the PCA variance plot and based on that fed 200 PCs to the neighbors calculations. Then ran the remaining code in a new session:

The output of this file was concat.gene.filt.singlets.trans.scale.pca.dimred.clust.h5ad which was then fed into remove_consent_declined.ipynb to remove patients from the study that did not consent to data sharing. That notebook generated concat.gene.filt.singlets.trans.scale.pca.dimred.clust.consent.h5ad which was then fed into mrna1.ipynb. After running that notebook, there were two outcomes:

  1. Some cells were marked as non-target cells or cells with high mitochondrial content (or containing contaminating transcripts from them), or as doublets. These cells were marked for removal for the next iteration.
  2. There were observable batch effects due to the pool/run, so it was decided to run combat in the next iterations.

 

Processing — Iteration 2

So output by mrna1.ipynb was cells.to.keep.tsv and obs.to.keep.csv. These files were then input into a second iteration of the processing, also on the command line. This is identical to the first processing but removes the non-target cells.

Above code was run overnight, followed by the code below the next day.

This above code output concat.combatted.pca.clustering.h5ad, which was input to mrna2.ipynb. Then, mrna2.ipynb was used to subcluster Leiden clusters in order to better capture the heterogeneity visualized through UMAP. Once these clusters were generated, the final notebook mrna3.ipynb was used to annotate them with cell types and states. This was the final version of the matrix/UMAP that was used for further analysis.