mRNA_processing

Code Used for Processing mRNA for COMET Dataset

By: George C. Hartoularos Original Date: 10MAY20 Edited Date: 28JUL21

For processing mRNA, we find that using an iterative process running on all cells at once and trying to retain as much information as possible yields the best results in terms of capturing variability in cell type as well as environmental affects while removing technical artifacts from subsequent visualizations. To run memory-intensive algorithms like ComBat, PCA, and UMAP on all cells at once, a large machine capable of processing and storing data for all cells in memory is required. Due to the large cell numbers we captured here, we required renting out an AWS instance with 768 GiB of memory and 96 vCPUs (AWS instance: r5a.24xlarge).

Therefore, in order to process the cells, I ran the folliowing snippets of code interactively using Python (from the command line, not in a Jupyter notebook). Input to this interactive Python session was concat.gene.filt.singlets.h5ad, generated by mrna_preprocessing.ipynb.

Initial Processing

This initial snippet was run on the original dataset.


>>> import scanpy as sc
>>> import pandas as pd
>>> sc.settings.n_jobs=60
>>> sc.settings.verbosity=0
>>> path_prefix = '/data/comet.tmp/' # where my files are stored
>>> concat = sc.read_h5ad(path_prefix + 'h5ads/older/concat.gene.filt.singlets.h5ad')
>>> print(concat.shape)
>>> sc.pp.normalize_per_cell(concat, counts_per_cell_after=1e6)
>>> sc.pp.log1p(concat)
>>> sc.pp.scale(concat, max_value=10)
>>> sc.pp.pca(concat, n_comps=200, chunked=True, chunk_size=50000)
>>> sc.pl.pca_variance_ratio(concat, log=True, n_pcs=150, show=False, save='.combat.no_cov.pdf')
>>> concat.write_h5ad(path_prefix + 'h5ads/concat.gene.filt.singlets.trans.scale.pca.h5ad')

The above code ran overnight on the instance. I checked the PCA variance plot and based on that fed 200 PCs to the neighbors calculations. Then ran the remaining code in a new session:


xxxxxxxxxx
>>> import scanpy as sc
>>> sc.settings.n_job=60
>>> sc.settings.verbosity=4
>>> concat = sc.read_h5ad('concat.combatted.pca.h5ad')
>>> sc.pp.neighbors(concat, n_pcs=150)
computing neighbors
computing neighbors
using 'X_pca' with n_pcs = 150
computed neighbors (0:05:52)
computed connectivities (0:00:29)
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:06:22)
>>> sc.tl.umap(concat)
computing UMAP
completed  0  /  200 epochs
completed  20  /  200 epochs
completed  40  /  200 epochs
completed  60  /  200 epochs
completed  80  /  200 epochs
completed  100  /  200 epochs
completed  120  /  200 epochs
completed  140  /  200 epochs
completed  160  /  200 epochs
completed  180  /  200 epochs
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:27:53)
>>> concat.write_h5ad('concat.gene.filt.singlets.trans.scale.pca.dimred.clust.h5ad')
/home/ssm-user/anaconda3/envs/scanpy/lib/python3.6/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])

The output of this file was concat.gene.filt.singlets.trans.scale.pca.dimred.clust.h5ad which was then fed into remove_consent_declined.ipynb to remove patients from the study that did not consent to data sharing. That notebook generated concat.gene.filt.singlets.trans.scale.pca.dimred.clust.consent.h5ad which was then fed into mrna1.ipynb. After running that notebook, there were two outcomes:

Some cells were marked as non-target cells or cells with high mitochondrial content (or containing contaminating transcripts from them), or as doublets. These cells were marked for removal for the next iteration.
There were observable batch effects due to the pool/run, so it was decided to run combat in the next iterations.

Processing — Iteration 2

So output by mrna1.ipynb was cells.to.keep.tsv and obs.to.keep.csv. These files were then input into a second iteration of the processing, also on the command line. This is identical to the first processing but removes the non-target cells.


xxxxxxxxxx
>>> import scanpy as sc
>>> import pandas as pd
>>> sc.settings.n_jobs=60
>>> sc.settings.verbosity=0
>>> path_prefix = '/data/comet.tmp/' # where my files are stored
>>> concat = sc.read_h5ad(path_prefix + 'h5ads/older/concat.gene.filt.singlets.h5ad')
>>> with open(path_prefix + 'tsvs/cells.to.keep.tsv', index_col'r') as file:
...     cells_to_keep = [i.strip() for i in file.readlines()]
...
>>> concat = concat[cells_to_keep].copy()
>>> concat.shape
>>> sc.pp.normalize_per_cell(concat, counts_per_cell_after=1e6)
>>> sc.pp.log1p(concat)
>>> sc.pp.scale(concat, max_value=10)
>>> obs_to_keep = pd.read_csv('/data/comet.tmp/csvs/obs.to.keep.csv', index_col=0)
>>> concat.obs = concat.obs.join(obs_to_keep)
>>> sc.pp.combat(concat, key='pool')
>>> sc.pp.scale(concat, max_value=10)
>>> sc.pp.pca(concat, n_comps=200, chunked=True, chunk_size=50000)
>>> sc.pl.pca_variance_ratio(concat, log=True, n_pcs=150, show=False, save='.combat.no_cov.pdf')
>>> concat.write_h5ad(path_prefix + 'h5ads/concat.combatted.pca.h5ad')

Above code was run overnight, followed by the code below the next day.


xxxxxxxxxx
>>> import scanpy as sc
>>> sc.settings.n_job=60
>>> sc.settings.verbosity=4
>>> concat = sc.read_h5ad('concat.combatted.pca.h5ad')
>>> sc.pp.neighbors(concat, n_pcs=150)
computing neighbors
computing neighbors
using 'X_pca' with n_pcs = 150
computed neighbors (0:05:52)
computed connectivities (0:00:29)
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:06:22)
>>> sc.tl.umap(concat)
computing UMAP
completed  0  /  200 epochs
completed  20  /  200 epochs
completed  40  /  200 epochs
completed  60  /  200 epochs
completed  80  /  200 epochs
completed  100  /  200 epochs
completed  120  /  200 epochs
completed  140  /  200 epochs
completed  160  /  200 epochs
completed  180  /  200 epochs
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:27:53)
>>> concat.write_h5ad('concat.combatted.pca.clustering.h5ad')
/home/ssm-user/anaconda3/envs/scanpy/lib/python3.6/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key]

This above code output concat.combatted.pca.clustering.h5ad, which was input to mrna2.ipynb. Then, mrna2.ipynb was used to subcluster Leiden clusters in order to better capture the heterogeneity visualized through UMAP. Once these clusters were generated, the final notebook mrna3.ipynb was used to annotate them with cell types and states. This was the final version of the matrix/UMAP that was used for further analysis.