scATAcat package
Submodules
scATAcat.bulk_data_functions module
- scATAcat.bulk_data_functions.generate_bulk_AnnData(bulk_df)[source]
Generate AnnData object from DataFrame.
Parameters:
- bulk_df (DataFrame): Dataframe of prototype data:
columns: cell types / samples
rows: featues (cCREs)
Returns:
AnnData of prototypes.
- scATAcat.bulk_data_functions.generate_bulk_sparse_AnnData(bulk_df, var_key='cCREs', obs_key='cell_types')[source]
Generate AnnData object from DataFrame. The count matrix is sparse.
Parameters:
- bulk_df (DataFrame): Dataframe of prototype data:
columns: cell types / samples
rows: featues (cCREs)
Returns:
AnnData of prototypes.
- scATAcat.bulk_data_functions.preprocess_bulk_adata(bulk_adata, remove_chrY=True, var_key='cCREs', copy=False)[source]
Preprocess a prototype count matrix in AnnData format. This function preprocesses a prototype count matrix in AnnData format by optionally removing features associated with chromosome Y.
If copy is True, a new AnnData object with the preprocessed data is returned, leaving the original AnnData object unchanged. If copy is False, the original AnnData object is modified in place, and the preprocessed AnnData object is returned.
Parameters:
bulk_adata (AnnData): An AnnData object containing the prototype count matrix.
remove_chrY (bool, optional): Whether to remove features associated with chromosome Y. Default is True.
var_key (str, optional): Key for accessing feature information in AnnData.var. Default is ‘cCREs’.
copy (bool, optional): If True, a copy of the AnnData object is returned; if False, the original AnnData object is modified. Default is False.
Returns:
AnnData: The preprocessed AnnData object.
scATAcat.helper_functions module
- scATAcat.helper_functions.add_binary_layer(adata, binary_layer_key='binary')[source]
Convert the count matrix associated with the AnnData object to binary and adds it as a new layer.
This function converts the count matrix in the AnnData object to binary, where non-zero values are set to 1.
The resulting binary matrix is added as a new layer in the AnnData object using the specified key.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
binary_layer_key (str, optional): The key for the binary layer to be added. Default is “binary”.
Returns:
AnnData: The AnnData object with the binary layer added.
- scATAcat.helper_functions.apply_PCA(adata, layer_key='TF_logIDF', svd_solver='arpack', random_state=0)[source]
Wrapper around scanpy.tl.pca to enable applying scanpy.tl.pca function to a specified layer.
adds the _pca, _components, explained_variance_ratio_, explained_variance_ to adata object
see scanpy documentaion for details: https://scanpy.readthedocs.io/en/latest/generated/scanpy.tl.pca.html#scanpy-tl-pca
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
layer_key: The key for accessing the layer to which PCA is applied. Default is “TF_logIDF”.
Returns:
AnnData: The AnnData object with the TF-log(IDF) normalized layer added.
- scATAcat.helper_functions.apply_TFIDF_sparse(adata, binary_layer_key='binary', TFIDF_key='TF_logIDF')[source]
Apply Term Frequency - Inverse Document Frequency TF-log(IDF) normalization to the binary layer of the AnnData object.
If the binary layer is not present, it calculates and adds the binary layer using the specified key.
Additionally, if cell and feature statistics are not available, it calculates them using the binary layer.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
binary_layer_key (str): The key for accessing the binary layer. Default is “binary”.
TFIDF_key (str): The key for the TFIDF normalized matrix layer to be added. Default is “TF_logIDF”.
Returns:
AnnData: The AnnData object with the TF-log(IDF) normalized layer added.
- scATAcat.helper_functions.overlap_vars(adata1, adata2)[source]
Generic function to get shared variables between two AnnData objects.
Parameters:
adata1 (AnnData): An AnnData object containing the sc count matrix.
adata2 (AnnData): An AnnData object containing the sc count matrix.
Returns:
List: List of shared variables.
- scATAcat.helper_functions.preproces_sc_matrix(adata, cell_cutoff=1000, cell_cutoff_max=80000, feature_cutoff=3, remove_chrY=True, var_key='cCREs', copy=False)[source]
Preprocess a sc count matrix in AnnData format.
This function preprocesses a single-cell count matrix in AnnData format by applying the following steps:
Filters cells based on the number of features per cell using the specified cutoffs.
Filters features based on the number of cells per feature using the specified cutoff.
Optionally removes features associated with chromosome Y.
If copy is True, a new AnnData object with the preprocessed data is returned, leaving the original AnnData object unchanged. If copy is False, the original AnnData object is modified in place, and the preprocessed AnnData object is returned.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
cell_cutoff (int, optional): Minimum number of features required per cell. Default is 1000.
cell_cutoff_max (int, optional): Maximum number of features allowed per cell. Default is 80000.
feature_cutoff (int, optional): Minimum number of cells required per feature. Default is 3.
remove_chrY (bool, optional): Whether to remove features associated with chromosome Y. Default is True.
var_key (str, optional): Key for accessing feature information in AnnData.var. Default is ‘cCREs’.
copy (bool, optional): If True, a copy of the AnnData object is returned; if False, the original AnnData object is modified. Default is False.
Returns:
AnnData: The preprocessed AnnData object.
- scATAcat.helper_functions.preprocessing_libsize_norm_log2(adata)[source]
Perform libray-size normalization & log2 transformation on the Anndata object.
Normalized and log2 transformed matrix is added as a layer with keyword “libsize_norm_log2”.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
Returns:
AnnData: The AnnData object with the libsize_norm_log2 normalized layer added.
- scATAcat.helper_functions.preprocessing_standardization(adata, input_layer_key='libsize_norm_log2', output_layer_key='libsize_norm_log2_std', std_key=None, mean_key=None, std_=None, mean_=None, zero_center=True)[source]
Perform z-normalization at the feature level. If the standard deviation (std) and mean are already included in the AnnData (adata), the function applies normalization directly. In the absence of these variables, it calculates and adds the standard deviation and mean to the AnnData using the specified layer key (layer_key). Subsequently, it performs z-normalization.
Additionally, if alternative std_ and mean_ matrices/arrays are provided, these values are utilized for the calculations instead of assuming zero mean and unit variance.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
input_layer_key (str): The key for accessing the layer to which standardization is applied. Default is “libsize_norm_log2”.
output_layer_key (str): The key for the standardized layer to be added. Default is “libsize_norm_log2_std”.
std_key (str): The key for the standard deviation to be added. If None, “feature_std” is added as key.
mean_key (str): The key for the mean to be added. If None, “feature_std” is added as key.
std_ (numpy array): The key for accessing the standard deviation. If specified, it is utilized for the z-score calculations instead of assuming zero mean and unit variance. Default is None.
mean_ (numpy array): The key for accessing the mean. If specified, it is utilized for the z-score calculations instead of assuming zero mean and unit variance. Default is None.
Returns:
AnnData: The AnnData object with the libsize_norm_log2_std standardized layer added.
- scATAcat.helper_functions.sparse_mean_variance_axis(mtx: spmatrix, axis: int)[source]
This code and internal functions are based on sklearns sparsefuncs.mean_variance_axis.
Modifications:
allow deciding on the output type, which can increase accuracy when calculating the mean and variance of 32bit floats.
This doesn’t currently implement support for null values, but could.
Uses numba not cython
- scATAcat.helper_functions.subset_adata_obs(adata, obs_list, copy_=True)[source]
This function subsets the observations (cells) of the AnnData object based on the provided list.
The resulting AnnData object includes only the variables specified in the vars_list.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
vars_list (list): A list of observations names (cells) to retain in the subset.
copy_ (bool, optional): If True, a copy of the AnnData object is returned; if False, the original AnnData object is modified. Default is True.
Returns:
AnnData: The AnnData object with a subset of variables.
- scATAcat.helper_functions.subset_adata_vars(adata, vars_list, copy_=True)[source]
Subset the variables (features) of an AnnData object based on a specified list.
The resulting AnnData object includes only the variables specified in the vars_list.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
vars_list (list): A list of variable names (features) to retain in the subset.
copy_ (bool, optional): If True, a copy of the AnnData object is returned; if False, the original AnnData object is modified. Default is True.
Returns:
AnnData: The AnnData object with a subset of variables.
scATAcat.plot_functions module
- scATAcat.plot_functions.cell_feature_statistics(adata, binary_layer_key='binary')[source]
Calculates the cell and feature statistics and adds them to AnnData.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
binary_layer_key (str, optional): The key for accessing the layer to calculates the cell and feature statistics, Default ‘binary’.
Returns:
- AnnData with the following features:
num_cell_per_feature: how many cells have a count for a feature? / number of cells sharing a feature
num_feature_per_cell : how many features are open for a cell? / number of features in a cell
- scATAcat.plot_functions.plot_cell_statistics(adata, binary_layer_key='binary', color=None, edgecolor=None, bins=None, xlabel=None, ylabel=None, title=None, threshold=None, save=False, save_dir=None, dpi=300)[source]
Plots the cell statistics. In simpler terms, this method shows how densely features are occupied by cells.
It provides a visual representation of the distribution and concentration of these cells within the features.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
binary_layer_key (str, optional): The key for accessing the layer to calculate the cell and feature statistics, Default ‘binary’.
save (bool, optional): Whether or not to save the figure, Default False.
save_dir (str, optional): Directory to sace the figure. Default None and saves to current directory.
dpi (str, optional): resolution of the figure to save in dots per inch
- kwds
Are passed to
matplotlib.pyplot.hist().
Returns:
- AnnData obejct with following features:
num_cell_per_feature (int): how many cells have a count for a feature? / number of cells sharing a feature
num_feature_per_cell : how many features are open for a cell? / number of features in a cell
Cell statistic plot
- scATAcat.plot_functions.plot_feature_statistics(adata, binary_layer_key='binary', color=None, edgecolor=None, bins=None, xlabel=None, ylabel=None, title=None, threshold=None, save=False, save_dir=None, dpi=300, fig_size_inches=(15, 15))[source]
Plots the feature statistics. In simpler terms, this method shows how densely cells are occupied by features.
It provides a visual representation of the distribution and concentration of these features within the cells.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
binary_layer_key (str, optional): The key for accessing the layer to calculate the cell and feature statistics, Default ‘binary’.
save (bool, optional): Whether or not to save the figure, Default False.
save_dir (str, optional): Directory to sace the figure. Default None and saves to current directory.
dpi (str, optional): resolution of the figure to save in dots per inch
- kwds
Are passed to
matplotlib.pyplot.hist().
Returns:
- AnnData obejct with following features:
num_cell_per_feature (int): how many cells have a count for a feature? / number of cells sharing a feature
num_feature_per_cell : how many features are open for a cell? / number of features in a cell
Feature statistic plot
- scATAcat.plot_functions.plot_gene_activity_of_UMAP(adata, gene_name, activity_matrix, out_path, point_size=22, cmap=None)[source]
Plot UMAP embedding of the given genes’ activity across single cells.
This function saves a Matplotlib figure to the specified file path.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
gene_name (str): Name of the gene.
activity_matrix (DataFrame): Gene activity score of the gene across cells. (Rows:cells x columns:genes (str))
out_path (str): The path to the output directory.
point_size: Size of the cell points displayed on the UMAP.
cmap: Color map object passed to sc.pl.umap()
Returns:
None
- scATAcat.plot_functions.plot_pca_dist_cent_heatmap(trained_bulk_pca_df_w_labels, projected_pseudobulk_pca_df, cmap='Blues_r')[source]
Plot a heatmap visualizing the pairwise Euclidean distances between centroids of prorotypes and pseudobulks.
This function combines PCA components of projected pseudobulk data and trained prototype data, calculates centroids for trained prototype data, and plots a heatmap using Seaborn’s clustermap.
Parameters:
trained_bulk_pca_df_w_labels (DataFrame): DataFrame containing PCA components of trained bulk data with labels.
projected_pseudobulk_pca_df (DataFrame): DataFrame containing PCA components of projected pseudobulk data.
cmap (str, optional): Colormap for the heatmap. Default is ‘Blues_r’.
Returns: - tuple: A tuple containing:
sns.ClusterGrid: Seaborn ClusterGrid object representing the heatmap.
DataFrame: DataFrame containing the pairwise Euclidean distances.
- scATAcat.plot_functions.plot_pca_dist_heatmap(trained_bulk_pca_df_w_labels, projected_pseudobulk_pca_df, cmap='Blues_r')[source]
Plot a heatmap visualizing the pairwise Euclidean distances of prototypes and pseudobulks.
This function combines PCA components of projected pseudobulk data and trained prototype data, calculates the pairwise Euclidean distances, and plots a heatmap using Seaborn’s clustermap.
Parameters:
trained_bulk_pca_df_w_labels (DataFrame): DataFrame containing PCA components of trained prototypes with labels.
projected_pseudobulk_pca_df (DataFrame): DataFrame containing PCA components of projected pseudobulk data.
cmap (str, optional): Colormap for the heatmap. Default is ‘Blues_r’.
Returns:
- tuple: A tuple containing:
sns.ClusterGrid: Seaborn ClusterGrid object representing the heatmap.
DataFrame: DataFrame containing the pairwise Euclidean distances.
- scATAcat.plot_functions.projection(prototype_adata, pseudobulk_adata, prototype_layer_key='libsize_norm_log2_std', pseudobulk_layer_key='libsize_norm_log2_bulk_scaled', prototype_label_font_size=18, pseudobulk_label_font_size=18, prototype_colors=None, cmap=None, pseudobulk_colors=None, color_key='clustering_color', pseudobulk_point_size=180, prototype_point_size=150, pseudobulk_point_alpha=0.8, prototype_point_alpha=0.7, prototype_legend=True, pseudobulk_legend=True, save_path=None, dpi=300, fig_size_inches=(15, 15))[source]
Custom 3D PCA projection of prototypes and pseudobulks.
Parameters:
prototype_adata (AnnData): An AnnData object containing the prototype count matrix.
pseudobulk_adata(AnnData): An AnnData object containing the pseudobulk count matrix.
prototype_layer_key (str): The key for accessing the prototype layer for projection. Default ‘libsize_norm_log2_std’.
pseudobulk_layer_key (str): The key for accessing the pseudobulk layer for projection. Default ‘libsize_norm_log2_bulk_scaled’
prototype_label_font_size (int): Font size of the prototype labels on the PCA projection. If set to 0, no labels will be plotted. Default 18.
pseudobulk_label_font_size (int): Font size of the pseudobulk labels on the PCA projection. If set to 0, no labels will be plotted. Default 18.
prototype_colors (List[str ]or None): A list of color codes to be used for plotting prototypes. If None, colors will be chosen by cmap parameter.
cmap (str): Matplotlib colormap used to colorcode the prototypes if prototype_colors is None.
pseudobulk_colors (List[str] or None): A list of color codes to be used for plotting pseudobulks. If None, colors will be determined by color_key parameter
color_key (str, optional): The key for accessing the cluster colors in the sc data. If provided, the pseudobulk points will be colored based on the cluster colors they originated from.
pseudobulk_point_size (int): Size of the pseudobulk point displayed on the plot. Default 180.
prototype_point_size (int): Size of the prototype point displayed on the plot. Default 200.
pseudobulk_point_alpha (float): Parameter controling the transparency of the plotted pseudobulks. Ranges between 0 (transparent) and 1 (opaque). Default 0.8.
prototype_point_alpha (float): Parameter controling the transparency of the plotted prototypes. Ranges between 0 (transparent) and 1 (opaque). Default 0.7.
prototype_legend (bool): A boolean value indicating whether or not to include prototype-related items in the legend. Deault True.
pseudobulk_legend(bool): A boolean value indicating whether or not to include pseudobulk-related items in the legend. Deault True.
save_path (str or None): Path where the plot should be saved. If None, the plot is not saved.
dpi (int): The resolution in dots per inch. Default 300.
fig_size_inches (tuple): A tuple representing the size (width, height) of the figure in inches. Default (15,15).
Returns:
3D PCA projection figure.
PCA transformed values of prototypes.
PCA transformed values of pseudbulks.
scATAcat.pseudobulk_functions module
- scATAcat.pseudobulk_functions.get_closest_prototype_to_pseudobulk(pseudobulk_prototype_centroid_euclidean_dis_df)[source]
Calculates the distances between prototypes and pseudobulks and returns the closest prototype to a pseudobulk.
Paramaters:
pseudobulk_prototype_centroid_euclidean_dis_df (Pandas Dataframe): square matrix of pairwise distances between prorootype centroids and pseudobulk samples. Can be obtained by running plot_pca_dist_cent_heatmap() function.
Returns:
{pseudobulk:closest_prototype} dictionary
- scATAcat.pseudobulk_functions.get_closest_pseubulk_to_prototype(pseudobulk_prototype_centroid_euclidean_dis_df)[source]
Calculates the distances between pseudobulks and prototypes and returns the closest pseudobulk to a prototype.
Paramaters:
pseudobulk_prototype_centroid_euclidean_dis_df (Pandas Dataframe): square matrix of pairwise distances between prorootype centroids and pseudobulk samples. Can be obtained by running plot_pca_dist_cent_heatmap() function.
- Returns:
{prototype:closest_pseudobulk} dictionary
- scATAcat.pseudobulk_functions.get_pseudobulk_matrix(adata, cluster_key='leiden', method='sum')[source]
Constructs pseudobulk by features matrix given the cluster key.
Parameters:
adata (AnnData): An AnnData object containing the sc count matrix.
cluster_key (str, optional): The key for the cluster key from which the pseudobulk matrix is constructed. Default is “leiden”.
- method: method to aggregate the cells:
sum: sums the feature counts across cells
mean: takes the mean of the feature counts across cells
Returns:
Pandas dataframe in th shape of pseudobulk x feature.
- scATAcat.pseudobulk_functions.get_pseudobulk_matrix_ext(adata_to_subset, adata_to_get_clusters, cluster_key='leiden', method='sum')[source]
Constructs pseudobulk by features matrix given the cluster key and AnnData objects.
Parameters:
adata_to_subset (AnnData): An AnnData object containing the sc count matrix.
adata_to_get_clusters (AnnData): An AnnData object containing the clusterong information for give cluster_key.
cluster_key (str, optional): The key for the cluster key from which the pseudobulk matrix is constructed. Default is “leiden”.
- method: method to aggregate the cells:
sum: sums the feature counts across cells
mean: takes the mean of the feature counts across cells
Returns:
Pandas dataframe in th shape of pseudobulk x feature.
- scATAcat.pseudobulk_functions.get_pseudobulk_to_prototype_distance(pseudobulk_prototype_centroid_euclidean_dis_df, pbulk_to_prototype=True)[source]
Transfers Euclidean distances to scaled similarities based on pseudobulk and bulk samples’ perspectives.
This function takes a square matrix of pairwise Euclidean distances between bulk centroids and pseudobulk samples.
It then scales the distances to the minimum and returns a DataFrame representing the percentage contributions for each sample.
Parameters:
pseudobulk_prototype_centroid_euclidean_dis_df (DataFrame): A square matrix of pairwise distances between bulk centroids and pseudobulk samples.
- pbulk_to_prototype (bool, optional):
If True, the distances are determined by the prorotypes’ perspective.
If False, the distances are determined by the pseudobulk samples’ perspective. Default is True.