seismicrna.cluster package

Subpackages

Submodules

seismicrna.cluster.addclust.add_orders(report_file: Path, max_order: int, *, tmp_dir: Path, brotli_level: int, n_procs: int)

Add orders to an existing report and dataset.

seismicrna.cluster.addclust.run(input_path: tuple[str, ...], *, max_clusters: int = 0, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, tmp_pfx='./tmp-', keep_tmp=False) list[Path]

Add more clusters to a dataset that was already clustered.

Parameters:
  • max_clusters (int) – Attempt to find at most this many clusters [keyword-only, default: 0]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 16]

  • parallel (bool) – Run tasks in parallel or in series [keyword-only, default: True]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

seismicrna.cluster.addclust.update_batches(dataset: ClusterMutsDataset, new_orders: list[RunOrderResults], tmp_dir: Path, brotli_level: int)

Update the cluster memberships in batches.

seismicrna.cluster.addclust.update_field(report: ClusterReport, field: Field, new_orders: list[RunOrderResults], attr: str)

Merge the field from the original report with the field from the new orders.

seismicrna.cluster.addclust.update_log_counts(new_orders: list[RunOrderResults], tmp_dir: Path, out_dir: Path, sample: str, ref: str, sect: str)

Update the expected log counts of unique bit vectors.

seismicrna.cluster.addclust.update_report(original_report: ClusterReport, max_order: int, best_order: int, new_orders: list[RunOrderResults], checksums: list[str], began: datetime, ended: datetime, top: Path)
class seismicrna.cluster.batch.ClusterMutsBatch(*, resps: DataFrame, **kwargs)

Bases: ClusterReadBatch, PartialMutsBatch, SectionMutsBatch

property read_weights

Weights for each read when computing counts.

class seismicrna.cluster.batch.ClusterReadBatch(*, resps: DataFrame, **kwargs)

Bases: PartialReadBatch

property num_reads: Series

Number of reads.

property read_nums

Read numbers.

Cluster Comparison Module

Auth: Matty

Collect and compare the results from independent runs of EM clustering.

class seismicrna.cluster.compare.RunOrderResults(runs: list[EmClustering])

Bases: object

Results of clustering runs of the same order.

property bic

BIC of the best run.

seismicrna.cluster.compare.assign_clusterings(mus1: ndarray, mus2: ndarray)

Optimally assign clusters from two groups to each other.

seismicrna.cluster.compare.calc_mean_pearson(run1: EmClustering, run2: EmClustering)

Compute the mean Pearson correlation between the clusters.

seismicrna.cluster.compare.calc_nrmsd_groups(mus1: ndarray, mus2: ndarray)

Calculate the NRMSD of each pair of clusters in two groups.

seismicrna.cluster.compare.calc_pearson_groups(mus1: ndarray, mus2: ndarray)

Calculate the Pearson correlation of each pair of clusters in two groups.

seismicrna.cluster.compare.calc_rms_nrmsd(run1: EmClustering, run2: EmClustering)

Compute the root-mean-square NRMSD between the clusters.

seismicrna.cluster.compare.calc_rmsd_groups(mus1: ndarray, mus2: ndarray)

Calculate the RMSD of each pair of clusters in two groups.

seismicrna.cluster.compare.find_best_order(orders: list[RunOrderResults]) int

Find the number of clusters with the best (smallest) BIC.

seismicrna.cluster.compare.format_exp_count_col(order: int)
seismicrna.cluster.compare.get_common_best_run_attr(orders: list[RunOrderResults], attr: str)

Get an attribute of the best clustering run from every order, and confirm that key(attribute) is identical for all orders.

seismicrna.cluster.compare.get_common_order(runs: list[EmClustering])

Find the order of the clustering (the number of clusters) from a list of EM clustering runs of the same order. If multiple orders are found, then raise a ValueError.

seismicrna.cluster.compare.get_log_exp_obs_counts(orders: list[RunOrderResults])

Get the expected and observed log counts of each bit vector.

seismicrna.cluster.compare.parse_exp_count_col(col: str)
seismicrna.cluster.compare.sort_replicate_runs(runs: list[EmClustering])

Sort the runs of EM clustering by decreasing likelihood so that the run with the best (largest) likelihood comes first.

seismicrna.cluster.csv.copy_all_run_tables(to_dir: Path, from_dir: Path, sample: str, ref: str, sect: str, max_k: int, num_runs: int)
seismicrna.cluster.csv.copy_single_run_table(to_dir: Path, from_dir: Path, sample: str, ref: str, sect: str, table: str, k: int, run: int)

Hard-link (if possible – otherwise, copy) a table for a single run to a new directory.

seismicrna.cluster.csv.get_count_path(top: Path, sample: str, ref: str, sect: str)

Build a path for a table of bit vector counts.

seismicrna.cluster.csv.get_table_path(top: Path, sample: str, ref: str, sect: str, table: str, k: int, run: int)

Build a path for a table of clustering results.

seismicrna.cluster.csv.write_log_counts(orders: list[RunOrderResults], top: Path, sample: str, ref: str, sect: str)

Write the expected and observed log counts of unique bit vectors to a CSV file.

seismicrna.cluster.csv.write_mus(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int)
seismicrna.cluster.csv.write_props(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int)
seismicrna.cluster.csv.write_single_run_table(run: EmClustering, top: Path, sample: str, ref: str, sect: str, rank: int, *, output_func: Callable[[EmClustering], DataFrame], table: str)

Write a DataFrame of one type of data from one independent run of EM clustering to a CSV file.

class seismicrna.cluster.data.ClusterDataset

Bases: Dataset, ABC

Dataset for clustered data.

property max_order: int

Number of clusters.

class seismicrna.cluster.data.ClusterMutsDataset(data1: MutsDataset, data2: Dataset)

Bases: ClusterDataset, ArrowDataset, UnbiasDataset

Merge cluster responsibilities with mutation data.

classmethod get_dataset1_load_func()

Function to load Dataset 1.

classmethod get_dataset2_type()

Type of Dataset 2.

property max_order

Number of clusters.

property min_mut_gap

Minimum gap between two mutations.

property pattern

Pattern of mutations to count.

property quick_unbias

Use the quick heuristic for unbiasing.

property quick_unbias_thresh

Consider mutation rates less than or equal to this threshold to be 0 when using the quick heuristic for unbiasing.

property section

Section of the dataset.

class seismicrna.cluster.data.ClusterReadDataset(report: BatchedReport, top: Path)

Bases: ClusterDataset, LoadedDataset

Load clustering results.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property max_order

Number of clusters.

property pattern

Pattern of mutations to count.

class seismicrna.cluster.data.JoinClusterMutsDataset(*args, **kwargs)

Bases: ClusterDataset, JoinMutsDataset, MergedUnbiasDataset

property clusts

Index of order and cluster numbers.

classmethod get_batch_type()

Type of batch.

classmethod get_dataset_load_func()

Function to load one constituent dataset.

classmethod get_report_type()

Type of report.

property max_order

Number of clusters.

classmethod name_batch_attrs()

Name the attributes of each batch.

seismicrna.cluster.delclust.del_orders(report_file: Path, max_order: int, *, tmp_dir: Path, brotli_level: int)

Delete orders from an existing report and dataset.

seismicrna.cluster.delclust.run(input_path: tuple[str, ...], *, max_clusters: int = 0, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, tmp_pfx='./tmp-', keep_tmp=False) list[Path]

Delete clusters from a dataset that was already clustered.

Parameters:
  • max_clusters (int) – Attempt to find at most this many clusters [keyword-only, default: 0]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 16]

  • parallel (bool) – Run tasks in parallel or in series [keyword-only, default: True]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

seismicrna.cluster.delclust.update_batches(dataset: ClusterMutsDataset, best_order: int, tmp_dir: Path, brotli_level: int)

Update the cluster memberships in batches.

seismicrna.cluster.delclust.update_field(report: ClusterReport, field: Field, best_order: int)

Delete clusters from a field of a report.

seismicrna.cluster.delclust.update_log_counts(best_order: int, tmp_dir: Path, out_dir: Path, sample: str, ref: str, sect: str)

Update the expected log counts of unique bit vectors.

seismicrna.cluster.delclust.update_report(original_report: ClusterReport, max_order: int, best_order: int, checksums: list[str], began: datetime, ended: datetime, top: Path)
class seismicrna.cluster.em.EmClustering(uniq_reads: UniqReads, order: int, *, min_iter: int, max_iter: int, em_thresh: float)

Bases: object

Run expectation-maximization to cluster the given reads into the specified number of clusters.

property bic

Bayesian Information Criterion of the model.

property clusters

MultiIndex of the order and cluster numbers.

property delta_log_like

Compute the change in log likelihood from the previous to the current iteration.

property end3s

3’ end coordinates (0-indexed).

property end5s

5’ end coordinates (0-indexed).

get_members(batch_num: int)

Cluster memberships of the reads in the batch.

get_mus()

Log mutation rate at each position for each cluster.

get_props()

Real and observed log proportion of each cluster.

property log_like

Return the current log likelihood, which is the last item in the trajectory of log likelihood values.

property log_like_prev

Return the previous log likelihood, which is the penultimate item in the trajectory of log likelihood values.

property logn_exp

Log number of expected observations of each read.

property masked

Masked positions (0-indexed).

property n_data

Number of data points in the model.

property n_params

Number of parameters in the model.

property n_pos_total

Number of positions, including those masked.

property n_pos_unmasked

Number of unmasked positions.

run(seed: int | None = None)

Run the EM clustering algorithm.

Parameters:

seed (int | None = None) – Random number generator seed.

Returns:

This instance, in order to permit statements such as return [em.run() for em in em_clusterings]

Return type:

EmClustering

property section_end5

5’ end of the section.

property unmasked

Unmasked positions (0-indexed).

class seismicrna.cluster.io.ClusterBatchIO(*, sect: str, **kwargs)

Bases: ReadBatchIO, ClusterIO, ClusterReadBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.cluster.io.ClusterIO(*, sect: str, **kwargs)

Bases: SectIO, ABC

classmethod auto_fields()

Names and automatic values of selected fields.

seismicrna.cluster.main.run(input_path: tuple[str, ...], *, max_clusters: int = 0, em_runs: int = 12, min_em_iter: int = 10, max_em_iter: int = 500, em_thresh: float = 0.37, brotli_level: int = 10, max_procs: int = 16, parallel: bool = True, force: bool = False, tmp_pfx='./tmp-', keep_tmp=False) list[Path]

Infer alternative structures by clustering reads’ mutations.

Parameters:
  • max_clusters (int) – Attempt to find at most this many clusters [keyword-only, default: 0]

  • em_runs (int) – Repeat EM this many times for each number of clusters [keyword-only, default: 12]

  • min_em_iter (int) – Run EM for at least this many iterations (times number of clusters) [keyword-only, default: 10]

  • max_em_iter (int) – Run EM for at most this many iterations (times number of clusters) [keyword-only, default: 500]

  • em_thresh (float) – Stop EM when the log likelihood increases by less than this threshold [keyword-only, default: 0.37]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 16]

  • parallel (bool) – Run tasks in parallel or in series [keyword-only, default: True]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]

  • keep_tmp – Keep temporary files after finishing [keyword-only, default: False]

Cluster – Names Module

Auth: Matty

Define names for the indexes of the cluster tables.

class seismicrna.cluster.report.ClusterReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedReport, ClusterIO

classmethod auto_fields()

Names and automatic values of selected fields.

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

classmethod from_clusters(orders: list[RunOrderResults], uniq_reads: UniqReads, max_order: int, num_runs: int, *, min_iter: int, max_iter: int, em_thresh: float, checksums: list[str], began: datetime, ended: datetime)

Create a ClusterReport from EmClustering objects.

seismicrna.cluster.save.write_batches(dataset: MaskMutsDataset, orders: list[RunOrderResults], brotli_level: int, top: Path)

Write the cluster memberships to batch files.

class seismicrna.cluster.uniq.UniqReads(sample: str, section: Section, min_mut_gap: int, quick_unbias: bool, quick_unbias_thresh: float, muts_per_pos: list[ndarray], batch_to_uniq: list[Series], counts_per_uniq: ndarray, **kwargs)

Bases: EndCoords

Collection of bit vectors of unique reads.

classmethod from_dataset(dataset: MaskMutsDataset, **kwargs)

Get unique reads from a dataset.

classmethod from_dataset_contig(dataset: MaskMutsDataset)

Get unique reads from a dataset of contiguous reads.

get_cov_matrix()

Full boolean matrix of the covered positions.

get_mut_matrix()

Full boolean matrix of the mutations.

get_uniq_names()

Unique bit vectors as byte strings.

property num_batches

Number of batches.

property num_nonuniq: int

Number of total reads (including non-unique reads).

property num_uniq

Number of unique reads.

property read_end3s_zero

3’ end coordinates (0-indexed in the section).

property read_end5s_zero

5’ end coordinates (0-indexed in the section).

property ref

Reference name.

property seg_end3s_zero

3’ end of every segment (0-indexed in the section).

property seg_end5s_zero

5’ end of every segment (0-indexed in the section).

seismicrna.cluster.uniq.get_uniq_reads(pos_nums: Iterable[int], pattern: RelPattern, batches: Iterable[SectionMutsBatch], **kwargs)
seismicrna.cluster.write.cluster(mask_report_file: Path, max_order: int, n_runs: int, *, n_procs: int, brotli_level: int, force: bool, tmp_dir: Path, **kwargs)

Cluster unique reads from one mask dataset.

seismicrna.cluster.write.run_max_order(uniq_reads: UniqReads, **kwargs)

Find the optimal order, from 1 to max_order.

seismicrna.cluster.write.run_order(uniq_reads: UniqReads, order: int, n_runs: int, *, n_procs: int, **kwargs) list[EmClustering]

Run EM with a specific number of clusters.

seismicrna.cluster.write.run_orders(uniq_reads: UniqReads, min_order: int, max_order: int, n_runs: int, prev_bic: float | None, *, min_iter: int, max_iter: int, em_thresh: float, n_procs: int, top: Path, **kwargs)

Find the optimal order, from min_order to max_order.