seismicrna.align package

Subpackages

Submodules

seismicrna.align.fqops.cutadapt_cmd(fq_inp: FastqUnit, fq_out: FastqUnit, *, n_procs: int, cut_q1: int, cut_q2: int, cut_g1: str, cut_a1: str, cut_g2: str, cut_a2: str, cut_o: int, cut_e: float, cut_indels: bool, cut_nextseq: bool, cut_discard_trimmed: bool, cut_discard_untrimmed: bool, cut_m: int)
seismicrna.align.fqops.fastqc_cmd(fq_unit: FastqUnit, out_prefix: Path, *, extract: bool, n_procs: int)
seismicrna.align.fqops.run_fastqc(fq_unit: FastqUnit, out_dir: Path, **kwargs)
class seismicrna.align.fqunit.FastqUnit(*, fastqz: Path | None = None, fastqy: Path | None = None, fastq1: Path | None = None, fastq2: Path | None = None, phred_enc: int, one_ref: bool)

Bases: object

Unified interface for the following sets of sequencing reads:

  • One FASTQ file of single-end reads from one sample

  • One FASTQ file of interleaved, paired-end reads from one sample

  • Two FASTQ files of mate 1 and 2 paired-end reads from one sample

  • One FASTQ file of single-end reads originating from one reference sequence in one sample

  • One FASTQ file of interleaved, paired-end reads originating from one reference sequence in one sample

  • Two FASTQ files of mate 1 and mate 2 paired-end reads originating from one reference sequence in one sample

BOWTIE2_FLAGS = {'fastq1': '-1', 'fastq2': '-2', 'fastqy': '--interleaved', 'fastqz': '-U'}
KEY_DINTER = 'dmfastqy'
KEY_DMATED = 'dmfastqx'
KEY_DSINGLE = 'dmfastqz'
KEY_INTER = 'fastqy'
KEY_MATE1 = 'fastq1'
KEY_MATE2 = 'fastq2'
KEY_MATED = 'fastqx'
KEY_SINGLE = 'fastqz'
MAX_PHRED_ENC = 127
property bowtie2_inputs

Return input file arguments for Bowtie2.

property cutadapt_input_args

Return input file arguments for Cutadapt.

fields(key: str)
classmethod from_paths(*, phred_enc: int, **fastq_args: list[Path])

Yield a FastqUnit for each FASTQ file (or each pair of mate 1 and mate 2 FASTQ files) whose paths are given as strings.

Parameters:
  • phred_enc (int) – ASCII offset for encoding Phred scores

  • fastq_args (list[Path]) – FASTQ files, given as lists of paths: - fastqz: FASTQ files of single-end reads - fastqy: FASTQ files of interleaved paired-end reads - fastqx: mated FASTQ files of paired-end reads - dmfastqz: demultiplexed FASTQ files of single-end reads - dmfastqy: demultiplexed FASTQ files of interleaved paired-end reads - dmfastqx: demultiplexed mated FASTQ files of paired-end reads

Yields:

FastqUnit – FastqUnit representing the FASTQ or pair of FASTQ files. The order is determined primarily by the order of keyword arguments; within each keyword argument, by the order of file or directory paths; and for directories, by the order in which os.path.listdir returns file paths.

get_sample_ref_exts()

Return the sample and reference of the FASTQ file(s).

property kind
property n_reads: int

Number of reads in the FASTQ file(s).

property parent

Return the parent directory of the FASTQ file(s).

property phred_arg
property seg_types: dict[str, tuple[Segment, ...]]
to_new(*new_segments: Segment, **new_fields)

Return a new FASTQ unit with updated path fields.

seismicrna.align.fqunit.count_fastq_reads(fastq_file: Path)

Count the reads in a FASTQ file.

seismicrna.align.fqunit.fastq_gz(fastq_file: Path)

Return whether a FASTQ file is compressed with gzip.

seismicrna.align.fqunit.get_args_count_fastq_reads(fastq_file: Path)

Count the reads in a FASTQ file.

seismicrna.align.fqunit.parse_stdout_count_fastq_reads(process: CompletedProcess)

Parse the output of word count to find the number of reads.

seismicrna.align.main.run(fasta: str, *, fastqz: tuple[str, ...] = (), fastqy: tuple[str, ...] = (), fastqx: tuple[str, ...] = (), dmfastqz: tuple[str, ...] = (), dmfastqy: tuple[str, ...] = (), dmfastqx: tuple[str, ...] = (), phred_enc: int = 33, out_dir: str = './out', keep_tmp: bool = False, force: bool = False, max_procs: int = 16, parallel: bool = True, fastqc: bool = True, qc_extract: bool = False, cut: bool = True, cut_q1: int = 25, cut_q2: int = 25, cut_g1: tuple[str, ...] = ('GCTCTTCCGATCT',), cut_a1: tuple[str, ...] = ('AGATCGGAAGAGC',), cut_g2: tuple[str, ...] = ('GCTCTTCCGATCT',), cut_a2: tuple[str, ...] = ('AGATCGGAAGAGC',), cut_o: int = 6, cut_e: float = 0.1, cut_indels: bool = True, cut_nextseq: bool = False, cut_discard_trimmed: bool = False, cut_discard_untrimmed: bool = False, cut_m: int = 20, bt2_local: bool = True, bt2_discordant: bool = False, bt2_mixed: bool = False, bt2_dovetail: bool = False, bt2_contain: bool = True, bt2_score_min_e2e: str = 'L,-1,-0.5', bt2_score_min_loc: str = 'L,1,0.5', bt2_i: int = 0, bt2_x: int = 600, bt2_gbar: int = 4, bt2_l: int = 20, bt2_s: str = 'L,1,0.1', bt2_d: int = 4, bt2_r: int = 2, bt2_dpad: int = 2, bt2_orient: str = 'fr', bt2_un: bool = True, min_mapq: int = 25, min_reads: int = 1000, sep_strands: bool = False, f1r2_plus: bool = False, minus_label: str = '-minus', tmp_pfx='./tmp-') list[Path]

Trim FASTQ files and align them to reference sequences.

Parameters:
  • fastqz (tuple) – FASTQ file(s) of single-end reads [keyword-only, default: ()]

  • fastqy (tuple) – FASTQ file(s) of paired-end reads with mates 1 and 2 interleaved [keyword-only, default: ()]

  • fastqx (tuple) – FASTQ files of paired-end reads with mates 1 and 2 in separate files [keyword-only, default: ()]

  • dmfastqz (tuple) – Demultiplexed FASTQ files of single-end reads [keyword-only, default: ()]

  • dmfastqy (tuple) – Demultiplexed FASTQ files of paired-end reads interleaved in one file [keyword-only, default: ()]

  • dmfastqx (tuple) – Demultiplexed FASTQ files of mate 1 and mate 2 reads [keyword-only, default: ()]

  • phred_enc (int) – Specify the Phred score encoding of FASTQ and SAM/BAM/CRAM files [keyword-only, default: 33]

  • out_dir (str) – Write all output files to this directory [keyword-only, default: ‘./out’]

  • keep_tmp (bool) – Keep temporary files after finishing [keyword-only, default: False]

  • force (bool) – Force all tasks to run, overwriting any existing output files [keyword-only, default: False]

  • max_procs (int) – Run up to this many processes simultaneously [keyword-only, default: 16]

  • parallel (bool) – Run tasks in parallel or in series [keyword-only, default: True]

  • fastqc (bool) – Run FastQC on the initial and trimmed FASTQ files [keyword-only, default: True]

  • qc_extract (bool) – Unzip FastQC report files [keyword-only, default: False]

  • cut (bool) – Use Cutadapt to trim reads before alignment [keyword-only, default: True]

  • cut_q1 (int) – Trim base calls below this Phred score from read 1 [keyword-only, default: 25]

  • cut_q2 (int) – Trim base calls below this Phred score from read 2 [keyword-only, default: 25]

  • cut_g1 (tuple) – Trim this 5’ adapter from read 1 [keyword-only, default: (‘GCTCTTCCGATCT’,)]

  • cut_a1 (tuple) – Trim this 3’ adapter from read 1 [keyword-only, default: (‘AGATCGGAAGAGC’,)]

  • cut_g2 (tuple) – Trim this 5’ adapter from read 2 [keyword-only, default: (‘GCTCTTCCGATCT’,)]

  • cut_a2 (tuple) – Trim this 3’ adapter from read 2 [keyword-only, default: (‘AGATCGGAAGAGC’,)]

  • cut_o (int) – Require at least this many bases of an adapter to trim it [keyword-only, default: 6]

  • cut_e (float) – Tolerate at most this fraction of errors in adapter sequences [keyword-only, default: 0.1]

  • cut_indels (bool) – Allow errors in adapter sequences to be insertions and deletions [keyword-only, default: True]

  • cut_nextseq (bool) – Trim high-quality Gs from the 3’ end (for Illumina NextSeq and iSeq) [keyword-only, default: False]

  • cut_discard_trimmed (bool) – Discard reads in which an adapters were found [keyword-only, default: False]

  • cut_discard_untrimmed (bool) – Discard reads in which no adapters were found [keyword-only, default: False]

  • cut_m (int) – Discard reads shorter than this length after trimming [keyword-only, default: 20]

  • bt2_local (bool) – Run Bowtie2 in local mode rather than end-to-end mode [keyword-only, default: True]

  • bt2_discordant (bool) – Output paired-end reads whose mates align discordantly [keyword-only, default: False]

  • bt2_mixed (bool) – Attempt to align individual mates of pairs that fail to align [keyword-only, default: False]

  • bt2_dovetail (bool) – Consider dovetailed mate pairs to align concordantly [keyword-only, default: False]

  • bt2_contain (bool) – Consider nested mate pairs to align concordantly [keyword-only, default: True]

  • bt2_score_min_e2e (str) – Discard alignments that score below this threshold in end-to-end mode [keyword-only, default: ‘L,-1,-0.5’]

  • bt2_score_min_loc (str) – Discard alignments that score below this threshold in local mode [keyword-only, default: ‘L,1,0.5’]

  • bt2_i (int) – Discard paired-end alignments shorter than this many bases [keyword-only, default: 0]

  • bt2_x (int) – Discard paired-end alignments longer than this many bases [keyword-only, default: 600]

  • bt2_gbar (int) – Do not place gaps within this many bases from the end of a read [keyword-only, default: 4]

  • bt2_l (int) – Use this seed length for Bowtie2 [keyword-only, default: 20]

  • bt2_s (str) – Seed Bowtie2 alignments at this interval [keyword-only, default: ‘L,1,0.1’]

  • bt2_d (int) – Discard alignments if over this many consecutive seed extensions fail [keyword-only, default: 4]

  • bt2_r (int) – Re-seed reads with repetitive seeds up to this many times [keyword-only, default: 2]

  • bt2_dpad (int) – Pad the alignment matrix with this many bases (to allow gaps) [keyword-only, default: 2]

  • bt2_orient (str) – Require paired mates to have this orientation [keyword-only, default: ‘fr’]

  • bt2_un (bool) – Output unaligned reads to a FASTQ file [keyword-only, default: True]

  • min_mapq (int) – Discard reads with mapping qualities below this threshold [keyword-only, default: 25]

  • min_reads (int) – Discard alignment maps with fewer than this many reads [keyword-only, default: 1000]

  • sep_strands (bool) – Separate each alignment map into plus- and minus-strand reads [keyword-only, default: False]

  • f1r2_plus (bool) – With –sep-strands, consider forward mate 1s and reverse mate 2s to be plus-stranded [keyword-only, default: False]

  • minus_label (str) – With –sep-strands, append this label to each minus-strand reference [keyword-only, default: ‘-minus’]

  • tmp_pfx – Write all temporary files to a directory with this prefix [keyword-only, default: ‘./tmp-‘]

class seismicrna.align.report.AlignRefReport(ref: str, **kwargs)

Bases: AlignReport

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

class seismicrna.align.report.AlignReport(**kwargs: Any | Callable[[Report], Any])

Bases: Report, ABC

classmethod auto_fields()

Names and automatic values of selected fields.

classmethod dir_seg_types()

Types of the directory segments in the path.

abstract classmethod fields()

All fields of the report.

class seismicrna.align.report.AlignSampleReport(ref: str | None = None, **kwargs)

Bases: AlignReport

classmethod fields()

All fields of the report.

classmethod file_seg_type()

Type of the last segment in the path.

Simulate SAM Files Module

seismicrna.align.sim.as_sam(name: str, flag: int, ref: str, end5: int, mapq: int, cigar: str, rnext: str, pnext: int, tlen: int, read: DNA, qual: str)

Return a line in SAM format from the given fields.

Parameters:
  • name (str) – Name of the read.

  • flag (int) – SAM flag. Must be in [0, MAX_FLAG].

  • ref (str) – Name of the reference.

  • end5 (int) – Most 5’ position to which the read mapped (1-indexed).

  • mapq (int) – Mapping quality score.

  • cigar (str) – CIGAR string. Not checked for compatibility with the read.

  • rnext (str) – Name of the mate’s reference (if paired-end).

  • pnext (int) – Most 5’ position of the mate (if paired-end).

  • tlen (int) – Length of the template.

  • read (DNA) – Base calls in the read. Must be equal in length to read.

  • qual (str) – Phred quality score string of the base calls. Must be equal in length to read.

Returns:

A line in SAM format containing the given fields.

Return type:

str

seismicrna.align.sim.relvecs_to_sam_file(file: Path, *args, overwrite: bool = False, **kwargs)
seismicrna.align.sim.relvecs_to_sam_lines(relvecs: DataFrame, ref: str, paired: bool, **kwargs)
seismicrna.align.sim.sam_header(ref: str, length: int | DNA)
seismicrna.align.write.align_samples(fq_units: list[FastqUnit], fasta: Path, *, out_dir: Path, force: bool, **kwargs) list[Path]

Run the alignment pipeline and return a tuple of all XAM files from the pipeline.

seismicrna.align.write.check_fqs_xams(alignments: dict[tuple[str, str], FastqUnit], out_dir: Path)

Return every FASTQ unit on which alignment must be run and every expected XAM file that already exists.

seismicrna.align.write.figure_alignments(fq_units: list[FastqUnit], refs: set[str])

Every expected alignment of a sample to a reference.

seismicrna.align.write.fq_pipeline(fq_inp: FastqUnit, fasta: Path, bowtie2_index: Path, *, out_dir: Path, tmp_dir: Path, keep_tmp: bool, fastqc: bool, qc_extract: bool, cut: bool, cut_q1: int, cut_q2: int, cut_g1: str, cut_a1: str, cut_g2: str, cut_a2: str, cut_o: int, cut_e: float, cut_indels: bool, cut_nextseq: bool, cut_discard_trimmed: bool, cut_discard_untrimmed: bool, cut_m: int, bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, bt2_un: bool, min_mapq: int, min_reads: int, sep_strands: bool, f1r2_plus: bool, minus_label: str, n_procs: int = 1) list[Path]

Run all stages of the alignment pipeline for one FASTQ file or one pair of mated FASTQ files.

seismicrna.align.write.fqs_pipeline(fq_units: list[FastqUnit], main_fasta: Path, *, max_procs: int, parallel: bool, out_dir: Path, tmp_dir: Path, keep_tmp: bool, **kwargs) list[Path]

Run all stages of alignment for one or more FASTQ files or pairs of mated FASTQ files.

seismicrna.align.write.list_fqs_xams(fq_units: list[FastqUnit], refs: set[str], out_dir: Path)

List every FASTQ to align and every extant XAM file.

seismicrna.align.write.merge_nondemult_fqs(fq_units: Iterable[FastqUnit])

For every FASTQ that is not demultiplexed, merge all the keys that map to the FASTQ into one key: (sample, None). Merging ensures that every non-demultiplexed FASTQ is aligned only once to the whole set of references, not once for every reference in the set. This function is essentially the inverse of figure_alignments.

seismicrna.align.write.sep_strands_bam(bam_file: Path, paired: bool, refseq_minus: DNA, minus_label: str, f1r2_plus: bool, keep_tmp: bool, min_mapq: int, n_procs: int = 1, **kwargs)

Split reads in a BAM file into each strand.

seismicrna.align.write.write_tmp_ref_files(tmp_dir: Path, refset_path: Path, refs: set[str], n_procs: int = 1)

Write temporary FASTA files, each containing one reference that corresponds to a FASTQ file from demultiplexing.

Alignment XAM Generation Module


Alignment Score Parameters for Bowtie2

Consider this example: Ref = ACGT, Read = AG

Assume that we want to minimize the number of edits needed to convert the reference into the read sequence. The smallest number of edits is two, specifically these two deletions (/) from the reference: [A/G/] which gets a score of (2 * match - 2 * gap_open - 2 * gap_extend).

But there are two alternative alignments, each with 3 edits: [Ag//] and [A//g] (substitutions marked in lowercase). Each gets the score (match - substitution - gap_open - 2 * gap_extend).

In order to favor the simpler alignment with two edits, (2 * match - 2 * gap_open - 2 * gap_extend) must be greater than (match - substitution - gap_open - 2 * gap_extend). This inequality simplifies to (substitution > gap_open - match).

Thus, the substitution penalty and match bonus must be relatively large, and the gap open penalty small. We want to avoid introducing too many gaps, especially to prevent the introduction of an insertion and a deletion from scoring better than one substitution.

Consider this example: Ref = ATAT, Read = ACTT

The simplest alignment (the smallest number of mutations) is ActT, which gets a score of (2 * match - 2 * substitution). Another alignment with indels is A{C}T/T, where {C} means a C was inserted into the read and the / denotes an A deleted from the read. This alignment scores (3 * match - 2 * gap_open - 2 * gap_extend).

Thus, (2 * match - 2 * substitution) must be greater than (3 * match - 2 * gap_open - 2 * gap_extend), which simplifies to (2 * gap_open + 2 * gap_extend > match + 2 * substitution).

There are two easy solutions to these inequalities: - Bowtie v2.5 defaults: 6 > 5 - 2 and 2*5 + 2*3 > 2 + 2*6 - Set every score to 1: 1 > 1 - 1 and 2*1 + 2*1 > 1 + 2*1

seismicrna.align.xamops.bowtie2_build_cmd(fasta: Path, prefix: Path, *, n_procs: int = 1)

Build a Bowtie2 index of a FASTA file.

seismicrna.align.xamops.bowtie2_cmd(fq_inp: FastqUnit | None, sam_out: Path | None, *, paired: bool | None = None, phred_arg: str | None = None, index_pfx: Path, n_procs: int, bt2_local: bool, bt2_discordant: bool, bt2_mixed: bool, bt2_dovetail: bool, bt2_contain: bool, bt2_score_min_e2e: str, bt2_score_min_loc: str, bt2_i: int, bt2_x: int, bt2_gbar: int, bt2_l: int, bt2_s: str, bt2_d: int, bt2_r: int, bt2_dpad: int, bt2_orient: str, fq_unal: Path | None = None)
seismicrna.align.xamops.export_cmd(xam_in: Path, xam_out: Path, *, ref: str, header: str, ref_file: Path | None = None, n_procs: int = 1)

Select and export one reference to its own XAM file.

seismicrna.align.xamops.filter_cmd(*args, **kwargs)

Filter a XAM file based on flags, then collate the output.

seismicrna.align.xamops.filter_cmds(xam_inp: Path, xam_out: Path | None, *, tmp_pfx: Path | None = None, flags_req: int | Iterable[int] | None = None, flags_exc: int | Iterable[int] | None = None, collate: bool = True, n_procs: int = 1)

Filter a XAM file based on flags, then collate the output.

seismicrna.align.xamops.get_bowtie2_index_paths(prefix: Path)

Return the Bowtie 2 index paths for a FASTA file.

seismicrna.align.xamops.parse_bowtie2(process: CompletedProcess)

Get the number of reads input and aligned.

seismicrna.align.xamops.realign_cmd(xam_inp: Path, xam_out: Path, *, paired: bool, tmp_pfx: Path | None = None, flags_req: int | Iterable[int] | None = None, flags_exc: int | Iterable[int] | None = None, min_mapq: int | None = None, n_procs: int = 1, **kwargs)

Re-align reads that are already in a XAM file.

seismicrna.align.xamops.xamgen_cmd(fq_inp: FastqUnit, bam_out: Path, *, min_mapq: int | None = None, n_procs: int = 1, **kwargs)

Wrap alignment and post-processing into one pipeline.