By: George C. Hartoularos Original Date: 10MAY20 Edited Date: 28JUL21
For processing mRNA, we find that using an iterative process running on all cells at once and trying to retain as much information as possible yields the best results in terms of capturing variability in cell type as well as environmental affects while removing technical artifacts from subsequent visualizations. To run memory-intensive algorithms like ComBat, PCA, and UMAP on all cells at once, a large machine capable of processing and storing data for all cells in memory is required. Due to the large cell numbers we captured here, we required renting out an AWS instance with 768 GiB of memory and 96 vCPUs (AWS instance: r5a.24xlarge
).
Therefore, in order to process the cells, I ran the folliowing snippets of code interactively using Python (from the command line, not in a Jupyter notebook). Input to this interactive Python session was concat.gene.filt.singlets.h5ad
, generated by mrna_preprocessing.ipynb
.
This initial snippet was run on the original dataset.
>>> import scanpy as sc
>>> import pandas as pd
>>> sc.settings.n_jobs=60
>>> sc.settings.verbosity=0
>>> path_prefix = '/data/comet.tmp/' # where my files are stored
>>> concat = sc.read_h5ad(path_prefix + 'h5ads/older/concat.gene.filt.singlets.h5ad')
>>> print(concat.shape)
>>> sc.pp.normalize_per_cell(concat, counts_per_cell_after=1e6)
>>> sc.pp.log1p(concat)
>>> sc.pp.scale(concat, max_value=10)
>>> sc.pp.pca(concat, n_comps=200, chunked=True, chunk_size=50000)
>>> sc.pl.pca_variance_ratio(concat, log=True, n_pcs=150, show=False, save='.combat.no_cov.pdf')
>>> concat.write_h5ad(path_prefix + 'h5ads/concat.gene.filt.singlets.trans.scale.pca.h5ad')
The above code ran overnight on the instance. I checked the PCA variance plot and based on that fed 200 PCs to the neighbors calculations. Then ran the remaining code in a new session:
xxxxxxxxxx
>>> import scanpy as sc
>>> sc.settings.n_job=60
>>> sc.settings.verbosity=4
>>> concat = sc.read_h5ad('concat.combatted.pca.h5ad')
>>> sc.pp.neighbors(concat, n_pcs=150)
computing neighbors
computing neighbors
using 'X_pca' with n_pcs = 150
computed neighbors (0:05:52)
computed connectivities (0:00:29)
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:06:22)
>>> sc.tl.umap(concat)
computing UMAP
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:27:53)
>>> concat.write_h5ad('concat.gene.filt.singlets.trans.scale.pca.dimred.clust.h5ad')
/home/ssm-user/anaconda3/envs/scanpy/lib/python3.6/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key])
The output of this file was concat.gene.filt.singlets.trans.scale.pca.dimred.clust.h5ad
which was then fed into remove_consent_declined.ipynb
to remove patients from the study that did not consent to data sharing. That notebook generated concat.gene.filt.singlets.trans.scale.pca.dimred.clust.consent.h5ad
which was then fed into mrna1.ipynb
. After running that notebook, there were two outcomes:
So output by mrna1.ipynb
was cells.to.keep.tsv
and obs.to.keep.csv
. These files were then input into a second iteration of the processing, also on the command line. This is identical to the first processing but removes the non-target cells.
xxxxxxxxxx
>>> import scanpy as sc
>>> import pandas as pd
>>> sc.settings.n_jobs=60
>>> sc.settings.verbosity=0
>>> path_prefix = '/data/comet.tmp/' # where my files are stored
>>> concat = sc.read_h5ad(path_prefix + 'h5ads/older/concat.gene.filt.singlets.h5ad')
>>> with open(path_prefix + 'tsvs/cells.to.keep.tsv', index_col'r') as file:
... cells_to_keep = [i.strip() for i in file.readlines()]
...
>>> concat = concat[cells_to_keep].copy()
>>> concat.shape
>>> sc.pp.normalize_per_cell(concat, counts_per_cell_after=1e6)
>>> sc.pp.log1p(concat)
>>> sc.pp.scale(concat, max_value=10)
>>> obs_to_keep = pd.read_csv('/data/comet.tmp/csvs/obs.to.keep.csv', index_col=0)
>>> concat.obs = concat.obs.join(obs_to_keep)
>>> sc.pp.combat(concat, key='pool')
>>> sc.pp.scale(concat, max_value=10)
>>> sc.pp.pca(concat, n_comps=200, chunked=True, chunk_size=50000)
>>> sc.pl.pca_variance_ratio(concat, log=True, n_pcs=150, show=False, save='.combat.no_cov.pdf')
>>> concat.write_h5ad(path_prefix + 'h5ads/concat.combatted.pca.h5ad')
Above code was run overnight, followed by the code below the next day.
xxxxxxxxxx
>>> import scanpy as sc
>>> sc.settings.n_job=60
>>> sc.settings.verbosity=4
>>> concat = sc.read_h5ad('concat.combatted.pca.h5ad')
>>> sc.pp.neighbors(concat, n_pcs=150)
computing neighbors
computing neighbors
using 'X_pca' with n_pcs = 150
computed neighbors (0:05:52)
computed connectivities (0:00:29)
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:06:22)
>>> sc.tl.umap(concat)
computing UMAP
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:27:53)
>>> concat.write_h5ad('concat.combatted.pca.clustering.h5ad')
/home/ssm-user/anaconda3/envs/scanpy/lib/python3.6/site-packages/anndata/_core/anndata.py:1192: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead
if is_string_dtype(df[key]) and not is_categorical(df[key]
This above code output concat.combatted.pca.clustering.h5ad
, which was input to mrna2.ipynb
. Then, mrna2.ipynb
was used to subcluster Leiden clusters in order to better capture the heterogeneity visualized through UMAP. Once these clusters were generated, the final notebook mrna3.ipynb
was used to annotate them with cell types and states. This was the final version of the matrix/UMAP that was used for further analysis.