seismicrna.cluster package
Subpackages
Submodules
- seismicrna.cluster.addclust.add_orders(report_file: Path, max_order: int, *, tmp_dir: Path, brotli_level: int, n_procs: int)
Add orders to an existing report and dataset.
- seismicrna.cluster.addclust.run(input_path: tuple[str, ...], *, max_clusters: int = 0, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, tmp_pfx='./tmp-', keep_tmp=False) list[Path]
Add more clusters to a dataset that was already clustered.
- Parameters:
max_clusters (
int) – Attempt to find at most this many clusters [keyword-only, default: 0]brotli_level (
int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]max_procs (
int) – Run up to this many processes simultaneously [keyword-only, default: 16]parallel (
bool) – Run tasks in parallel or in series [keyword-only, default: True]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
- seismicrna.cluster.addclust.update_batches(dataset: ClusterMutsDataset, new_orders: list[RunOrderResults], tmp_dir: Path, brotli_level: int)
Update the cluster memberships in batches.
- seismicrna.cluster.addclust.update_field(report: ClusterReport, field: Field, new_orders: list[RunOrderResults], attr: str)
Merge the field from the original report with the field from the new orders.
- seismicrna.cluster.addclust.update_log_counts(new_orders: list[RunOrderResults], tmp_dir: Path, out_dir: Path, sample: str, ref: str, sect: str)
Update the expected log counts of unique bit vectors.
- seismicrna.cluster.addclust.update_report(original_report: ClusterReport, max_order: int, best_order: int, new_orders: list[RunOrderResults], checksums: list[str], began: datetime, ended: datetime, top: Path)
- class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)
Bases:
ClusterReadBatch,PartialMutsBatch,SectionMutsBatch- property read_weights
Weights for each read when computing counts.
- class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)
Bases:
PartialReadBatch- property num_reads: Series
Number of reads.
- property read_nums
Read numbers.
Cluster Comparison Module
Auth: Matty
Collect and compare the results from independent runs of EM clustering.
- class seismicrna.cluster.compare.RunOrderResults(runs: list[EmClustering])
Bases:
objectResults of clustering runs of the same order.
- property bic
BIC of the best run.
- seismicrna.cluster.compare.assign_clusterings(mus1: ndarray, mus2: ndarray)
Optimally assign clusters from two groups to each other.
- seismicrna.cluster.compare.calc_mean_pearson(run1: EmClustering, run2: EmClustering)
Compute the mean Pearson correlation between the clusters.
- seismicrna.cluster.compare.calc_nrmsd_groups(mus1: ndarray, mus2: ndarray)
Calculate the NRMSD of each pair of clusters in two groups.
- seismicrna.cluster.compare.calc_pearson_groups(mus1: ndarray, mus2: ndarray)
Calculate the Pearson correlation of each pair of clusters in two groups.
- seismicrna.cluster.compare.calc_rms_nrmsd(run1: EmClustering, run2: EmClustering)
Compute the root-mean-square NRMSD between the clusters.
- seismicrna.cluster.compare.calc_rmsd_groups(mus1: ndarray, mus2: ndarray)
Calculate the RMSD of each pair of clusters in two groups.
- seismicrna.cluster.compare.find_best_order(orders: list[RunOrderResults]) int
Find the number of clusters with the best (smallest) BIC.
- seismicrna.cluster.compare.get_common_best_run_attr(orders: list[RunOrderResults], attr: str)
Get an attribute of the best clustering run from every order, and confirm that key(attribute) is identical for all orders.
- seismicrna.cluster.compare.get_common_order(runs: list[EmClustering])
Find the order of the clustering (the number of clusters) from a list of EM clustering runs of the same order. If multiple orders are found, then raise a ValueError.
- seismicrna.cluster.compare.get_log_exp_obs_counts(orders: list[RunOrderResults])
Get the expected and observed log counts of each bit vector.
- seismicrna.cluster.compare.sort_replicate_runs(runs: list[EmClustering])
Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.
- seismicrna.cluster.csv.copy_all_run_tables(to_dir: Path, from_dir: Path, sample: str, ref: str, sect: str, max_k: int, num_runs: int)
- seismicrna.cluster.csv.copy_single_run_table(to_dir: Path, from_dir: Path, sample: str, ref: str, sect: str, table: str, k: int, run: int)
Hard-link (if possible – otherwise, copy) a table for a single run to a new directory.
- seismicrna.cluster.csv.get_count_path(top: Path, sample: str, ref: str, sect: str)
Build a path for a table of bit vector counts.
- seismicrna.cluster.csv.get_table_path(top: Path, sample: str, ref: str, sect: str, table: str, k: int, run: int)
Build a path for a table of clustering results.
- seismicrna.cluster.csv.write_log_counts(orders: list[RunOrderResults], top: Path, sample: str, ref: str, sect: str)
Write the expected and observed log counts of unique bit vectors to a CSV file.
- seismicrna.cluster.csv.write_mus(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int)
- seismicrna.cluster.csv.write_props(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int)
- seismicrna.cluster.csv.write_single_run_table(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int, *, output_func: Callable[[EmClustering], DataFrame], table: str)
Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.
- class seismicrna.cluster.data.ClusterDataset
-
Dataset for clustered data.
- class seismicrna.cluster.data.ClusterMutsDataset(data1: MutsDataset, data2: Dataset)
Bases:
ClusterDataset,ArrowDataset,UnbiasDatasetMerge cluster responsibilities with mutation data.
- classmethod get_dataset1_load_func()
Function to load Dataset 1.
- classmethod get_dataset2_type()
Type of Dataset 2.
- property max_order
Number of clusters.
- property min_mut_gap
Minimum gap between two mutations.
- property pattern
Pattern of mutations to count.
- property quick_unbias
Use the quick heuristic for unbiasing.
- property quick_unbias_thresh
Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.
- property section
Section of the dataset.
- class seismicrna.cluster.data.ClusterReadDataset(report: BatchedReport, top: Path)
Bases:
ClusterDataset,LoadedDatasetLoad clustering results.
- classmethod get_batch_type()
Type of batch.
- classmethod get_report_type()
Type of report.
- property max_order
Number of clusters.
- property pattern
Pattern of mutations to count.
- class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)
Bases:
ClusterDataset,JoinMutsDataset,MergedUnbiasDataset- property clusts
Index of order and cluster numbers.
- classmethod get_batch_type()
Type of batch.
- classmethod get_dataset_load_func()
Function to load one constituent dataset.
- classmethod get_report_type()
Type of report.
- property max_order
Number of clusters.
- classmethod name_batch_attrs()
Name the attributes of each batch.
- seismicrna.cluster.delclust.del_orders(report_file: Path, max_order: int, *, tmp_dir: Path, brotli_level: int)
Delete orders from an existing report and dataset.
- seismicrna.cluster.delclust.run(input_path: tuple[str, ...], *, max_clusters: int = 0, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, tmp_pfx='./tmp-', keep_tmp=False) list[Path]
Delete clusters from a dataset that was already clustered.
- Parameters:
max_clusters (
int) – Attempt to find at most this many clusters [keyword-only, default: 0]brotli_level (
int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]max_procs (
int) – Run up to this many processes simultaneously [keyword-only, default: 16]parallel (
bool) – Run tasks in parallel or in series [keyword-only, default: True]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
- seismicrna.cluster.delclust.update_batches(dataset: ClusterMutsDataset, best_order: int, tmp_dir: Path, brotli_level: int)
Update the cluster memberships in batches.
- seismicrna.cluster.delclust.update_field(report: ClusterReport, field: Field, best_order: int)
Delete clusters from a field of a report.
- seismicrna.cluster.delclust.update_log_counts(best_order: int, tmp_dir: Path, out_dir: Path, sample: str, ref: str, sect: str)
Update the expected log counts of unique bit vectors.
- seismicrna.cluster.delclust.update_report(original_report: ClusterReport, max_order: int, best_order: int, checksums: list[str], began: datetime, ended: datetime, top: Path)
- class seismicrna.cluster.em.EmClustering(uniq_reads: UniqReads, order: int, *, min_iter: int, max_iter: int, em_thresh: float)
Bases:
objectRun expectation-maximization to cluster the given reads into the specified number of clusters.
- property bic
Bayesian Information Criterion of the model.
- property clusters
MultiIndex of the order and cluster numbers.
- property delta_log_like
Compute the change in log likelihood from the previous to the current iteration.
- property end3s
3’ end coordinates (0-indexed).
- property end5s
5’ end coordinates (0-indexed).
- get_mus()
Log mutation rate at each position for each cluster.
- get_props()
Real and observed log proportion of each cluster.
- property log_like
Return the current log likelihood, which is the last item in the trajectory of log likelihood values.
- property log_like_prev
Return the previous log likelihood, which is the penultimate item in the trajectory of log likelihood values.
- property logn_exp
Log number of expected observations of each read.
- property masked
Masked positions (0-indexed).
- property n_data
Number of data points in the model.
- property n_params
Number of parameters in the model.
- property n_pos_total
Number of positions, including those masked.
- property n_pos_unmasked
Number of unmasked positions.
- run(seed: int | None = None)
Run the EM clustering algorithm.
- Parameters:
seed (
int | None = None) – Random number generator seed.- Returns:
This instance, in order to permit statements such as
return [em.run() for em in em_clusterings]- Return type:
- property section_end5
5’ end of the section.
- property unmasked
Unmasked positions (0-indexed).
- class seismicrna.cluster.io.ClusterBatchIO(*, sect: str, **kwargs)
Bases:
ReadBatchIO,ClusterIO,ClusterReadBatch- classmethod file_seg_type()
Type of the last segment in the path.
- class seismicrna.cluster.io.ClusterIO(*, sect: str, **kwargs)
-
- classmethod auto_fields()
Names and automatic values of selected fields.
- seismicrna.cluster.main.run(input_path: tuple[str, ...], *, max_clusters: int = 0, em_runs: int = 12, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, force: bool = False, tmp_pfx='./tmp-', keep_tmp=False) list[Path]
Infer alternative structures by clustering reads’ mutations.
- Parameters:
max_clusters (
int) – Attempt to find at most this many clusters [keyword-only, default: 0]em_runs (
int) – Repeat EM this many times for each number of clusters [keyword-only, default: 12]min_em_iter (
int) – Run EM for at least this many iterations (times number of clusters) [keyword-only, default: 10]max_em_iter (
int) – Run EM for at most this many iterations (times number of clusters) [keyword-only, default: 500]em_thresh (
float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]brotli_level (
int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]max_procs (
int) – Run up to this many processes simultaneously [keyword-only, default: 16]parallel (
bool) – Run tasks in parallel or in series [keyword-only, default: True]force (
bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]
keep_tmp – Keep temporary files after finishing [keyword-only, default: False]
Cluster – Names Module
Auth: Matty
Define names for the indexes of the cluster tables.
- class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])
Bases:
BatchedReport,ClusterIO- classmethod auto_fields()
Names and automatic values of selected fields.
- classmethod fields()
All fields of the report.
- classmethod file_seg_type()
Type of the last segment in the path.
- seismicrna.cluster.save.write_batches(dataset: MaskMutsDataset, orders: list[RunOrderResults], brotli_level: int, top: Path)
Write the cluster memberships to batch files.
- class seismicrna.cluster.uniq.UniqReads(sample: str, section: Section, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)
Bases:
EndCoordsCollection of bit vectors of unique reads.
- classmethod from_dataset(dataset: MaskMutsDataset, **kwargs)
Get unique reads from a dataset.
- classmethod from_dataset_contig(dataset: MaskMutsDataset)
Get unique reads from a dataset of contiguous reads.
- get_cov_matrix()
Full boolean matrix of the covered positions.
- get_mut_matrix()
Full boolean matrix of the mutations.
- get_uniq_names()
Unique bit vectors as byte strings.
- property num_batches
Number of batches.
- property num_uniq
Number of unique reads.
- property read_end3s_zero
3’ end coordinates (0-indexed in the section).
- property read_end5s_zero
5’ end coordinates (0-indexed in the section).
- property ref
Reference name.
- property seg_end3s_zero
3’ end of every segment (0-indexed in the section).
- property seg_end5s_zero
5’ end of every segment (0-indexed in the section).
- seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[SectionMutsBatch], **kwargs)
- seismicrna.cluster.write.cluster(mask_report_file: Path, max_order: int, n_runs: int, *, n_procs: int, brotli_level: int, force: bool, tmp_dir: Path, **kwargs)
Cluster unique reads from one mask dataset.
- seismicrna.cluster.write.run_max_order(uniq_reads: UniqReads, **kwargs)
Find the optimal order, from 1 to max_order.