seismicrna.relate package

Subpackages

Submodules

class seismicrna.relate.batch.QnamesBatch(*, names: list[str] | ndarray, **kwargs)

Bases: AllReadBatch

property num_reads

Number of reads.

classmethod simulate(batch: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>, **kwargs)

Simulate a batch.

Parameters:
  • batch (int) – Batch number.

  • num_reads (int) – Number of reads in the batch.

  • formatter (Callable[[int, int], str]) – Function to generate the name of each read: must accept the batch number and the read number and return a string.

class seismicrna.relate.batch.RelateBatch(*, section: Section, **kwargs)

Bases: SectionMutsBatch, AllReadBatch

property read_weights

Weights for each read when computing counts.

classmethod simulate(batch: int, ref: str, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, **kwargs)

Simulate a batch.

Parameters:
  • batch (int) – Batch number.

  • ref (str) – Name of the reference.

  • pmut (pd.DataFrame) – Rate of each type of mutation at each position.

  • uniq_end5s (np.ndarray) – Unique read 5’ end coordinates.

  • uniq_end3s (np.ndarray) – Unique read 3’ end coordinates.

  • pends (np.ndarray) – Probability of each set of unique end coordinates.

  • paired (bool) – Whether to simulate paired-end or single-end reads.

  • read_length (int) – Length of each read segment (paired-end reads only).

  • p_rev (float) – Probability that mate 1 is reversed (paired-end reads only).

  • min_mut_gap (int) – Minimum number of positions between two mutations.

  • num_reads (int) – Number of reads in the batch.

seismicrna.relate.batch.format_read_name(batch: int, read: int)

Format a read name.

class seismicrna.relate.data.QnamesDataset(report: BatchedReport, top: Path)

Bases: LoadedDataset

Dataset of read names from the Relate step.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property pattern

Pattern of mutations to count.

class seismicrna.relate.data.RelateDataset(report: BatchedReport, top: Path)

Bases: LoadedMutsDataset

Dataset of mutations from the Relate step.

get_batch(batch: int)

Get a specific batch of data.

classmethod get_batch_type()

Type of batch.

classmethod get_report_type()

Type of report.

property paired

Whether the reads are paired-end.

property pattern

Pattern of mutations to count.

class seismicrna.relate.io.QnamesBatchIO(*, sample: str, ref: str, **kwargs)

Bases: ReadBatchIO, RelateIO, QnamesBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.io.RelateBatchIO(*args, section: Section, **kwargs)

Bases: MutsBatchIO, RelateIO, RelateBatch

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.relate.io.RelateIO(*, sample: str, ref: str, **kwargs)

Bases: RefIO, ABC

classmethod auto_fields()

Names and automatic values of selected fields.

seismicrna.relate.io.from_reads(reads: Iterable[tuple[str, tuple[list[int], [list[int]]], dict[int, int]]], sample: str, ref: str, refseq: DNA, batch: int)

Accumulate reads into relation vectors.

Relate – Main Module

Auth: Matty

Define the command line interface for the ‘relate’ command, as well as its main run function that executes the relate step.

seismicrna.relate.main.run(fasta: str, input_path: tuple[str, ...], *, out_dir: str = './out', min_reads: int = 1000, min_mapq: int = 25, phred_enc: int = 33, min_phred: int = 25, batch_size: int = 65536, ambindel: bool = True, overhangs: bool = True, clip_end5: int = 4, clip_end3: int = 6, max_procs: int = 16, parallel: bool = True, brotli_level: int = 10, force: bool = False, keep_tmp: bool = False, tmp_pfx='./tmp-')

Compute relationships between references and aligned reads.

Parameters:
  • out_dir (str) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • min_phred (int) – Mark base calls with Phred scores lower than this threshold as ambiguous [keyword-only, default: 25]

  • batch_size (int) – Limit batches to at most this many reads [keyword-only, default: 65536]

  • ambindel (bool) – Mark all ambiguous insertions and deletions [keyword-only, default: True]

  • overhangs (bool) – Retain the overhangs of paired-end mates that dovetail [keyword-only, default: True]

  • clip_end5 (int) – Clip this many bases from the 5’ end of each read [keyword-only, default: 4]

  • clip_end3 (int) – Clip this many bases from the 3’ end of each read [keyword-only, default: 6]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 16]

  • parallel (bool) – Run tasks in parallel or in series [keyword-only, default: True]

  • brotli_level (int) – Compress pickle files with this level of Brotli (0 - 11) [keyword-only, default: 10]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]

class seismicrna.relate.report.RelateReport(**kwargs: Any | Callable[[Report], Any])

Bases: BatchedRefseqReport, RelateIO

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

refseq_file(top: Path)
seismicrna.relate.report.refseq_file_auto_fields()
seismicrna.relate.report.refseq_file_path(top: Path, sample: str, ref: str)
seismicrna.relate.report.refseq_file_seg_types()
class seismicrna.relate.sam.XamViewer(xam_input: Path, tmp_dir: Path, batch_size: int, n_procs: int = 1)

Bases: object

create_tmp_sam()

Create the temporary SAM file.

delete_tmp_sam()

Delete the temporary SAM file.

property flagstats: dict
property indexes
iter_records(batch: int)

Iterate through the records of the batch.

property n_reads
open_tmp_sam()

Open the temporary SAM file as a file object.

property paired
property ref
property sample
property tmp_sam_path

Get the path to the temporary SAM file.

seismicrna.relate.sam.read_name(line: str)

Get the name of the read in the current line of a SAM file.

seismicrna.relate.sam.tmp_xam_cmd(xam_in: Path, xam_out: Path, n_procs: int = 1)

Collate and create a temporary XAM file.

seismicrna.relate.sim.simulate_batch(sample: str, ref: str, batch: int, pmut: ~pandas.core.frame.DataFrame, uniq_end5s: ~numpy.ndarray, uniq_end3s: ~numpy.ndarray, pends: ~numpy.ndarray, paired: bool, read_length: int, p_rev: float, min_mut_gap: int, num_reads: int, formatter: ~typing.Callable[[int, int], str] = <function format_read_name>)

Simulate a pair of RelateBatchIO and QnamesBatchIO.

seismicrna.relate.sim.simulate_batches(batch_size: int, pmut: DataFrame, pclust: Series, num_reads: int, **kwargs)
seismicrna.relate.sim.simulate_cluster(first_batch: int, batch_size: int, num_reads: int, **kwargs)

Simulate all batches for one cluster.

seismicrna.relate.sim.simulate_relate(*, out_dir: Path, tmp_dir: Path, sample: str, ref: str, refseq: DNA, batch_size: int, num_reads: int, pmut: DataFrame, uniq_end5s: ndarray, uniq_end3s: ndarray, pends: ndarray, pclust: Series, brotli_level: int, force: bool, **kwargs)

Simulate an entire relate step.

Relation Vector Writing Module

Given alignment map (BAM) files, split each file into batches of reads, write the relation vectors for each batch to a compressed file, and write a report summarizing the results.

class seismicrna.relate.write.RelationWriter(xam_view: XamViewer, seq: DNA)

Bases: object

Compute and write relation vectors for all reads from one sample mapped to one reference sequence.

property num_reads
property ref
property sample
write(*, out_dir: Path, release_dir: Path, min_mapq: int, min_reads: int, brotli_level: int, force: bool, overhangs: bool, min_phred: int, phred_enc: int, ambindel: bool, clip_end5: int, clip_end3: int, **kwargs)

Compute a relation vector for every record in a BAM file, write the vectors into one or more batch files, compute their checksums, and write a report summarizing the results.

seismicrna.relate.write.generate_batch(batch: int, *, xam_view: XamViewer, top: Path, refseq: DNA, min_mapq: int, min_qual: str, ambindel: bool, overhangs: bool, clip_end5: int, clip_end3: int, brotli_level: int)

Compute relation vectors for every SAM record in one batch, write the vectors to a batch file, and return its MD5 checksum and the number of vectors.

seismicrna.relate.write.write_all(xam_files: Iterable[Path], max_procs: int, parallel: bool, **kwargs)
seismicrna.relate.write.write_one(xam_file: Path, *, fasta: Path, tmp_dir: Path, batch_size: int, n_procs: int, **kwargs)

Write the batches of relation vectors for one XAM file.