Relate: Compute relationships between references and aligned reads
Relate: Input files
Relate input file: Reference sequences
You need one file of reference sequences in FASTA format (for details on this format, see FASTA: Reference sequences). If your file has characters or formatting incompatible with SEISMIC-RNA, then you can fix it using the Clean FASTA Files tool.
Relate input file: Alignment maps
You can give any number of alignment map files, each of which must be in SAM, BAM, or CRAM (collectively, “XAM”) format. See SAM, BAM, and CRAM: Alignment Maps for more information.
Note
The references in the FASTA file must match those to which the reads in the
alignment map were aligned.
Discrepancies can cause the Relate step to fail or produce erroneous output.
You can assume that the references match if you use the same (unmodified)
FASTA file for both the align and relate commands, or if you run
both steps using the command seismic wf.
Provide the alignment map files as a list after the FASTA file.
See List Input Files for ways to list multiple files.
For example, to compute relation vectors for reads from sample-1 aligned to
references ref-1 and ref-2, and from sample-2 aligned to reference
ref-1, use the following command:
seismic relate {refs.fa} sample-1/align/ref-1.cram sample-1/align/ref-2.cram sample-2/align/ref-1.cram
where {refs.fa} is the path to the file of reference sequences.
Relate: Settings
Relate setting: Minimum Phred score
In the Relate step, you can flag bases with low quality scores as ambiguous, as
if they were Ns.
This step serves a purpose similar to that of quality trimming during the Align
step (see How to trim low-quality base calls).
The difference is that quality trimming removes low-quality bases by shortening
reads from their ends, while the minimum quality score in the Relate step flags
low-quality bases located anywhere in the reads, while preserving read lengths.
See Encoding low-quality base calls for a more detailed description of how this works.
To set the minimum quality score, use --min-phred.
The default is 25, meaning that base calls with a probabilities of at least
10-2.5 = 0.3% of being incorrect are flagged as ambiguous.
(See Phred quality score encodings for an explanation of quality scores.)
For example, if a T is called as a match with a quality score of 20, then it
would be flagged as possibly a match and possibly a subsitution to A, C, or G.
Relate setting: Ambiguous insertions and deletions
When insertions and deletions (indels) occur in repetitive regions, determining
which base(s) were inserted or deleted can be impossible due to the repetitive
reference sequence itself, even if the reads were perfectly free of errors.
To handle ambiguous indels, SEISMIC-RNA introduces a new algorithm that finds
all possible indels that could have produced the observed read (for details on
this algorithm, see ../algos/ambrel).
This algorithm is enabled by default.
If you do not need to identify ambiguous indels, then you can disable this
algorithm with --no-ambrel, which will speed up the Relate step at the cost
of reducing its accuracy on indels.
Relate setting: Batch size
In the Relate step, you can divide up your data into batches to speed up the
analysis and reduce the amount of memory needed.
For an explanation of batching and how to use it, see Batches and Benefits.
You can specify batch size (in millions of base calls) using --batch-size,
which is 64.0 (64 million base calls) by default.
Relate uses the batch size to calculate the number of reads in each batch.
The number of relationship bytes per batch, B, is the number of relationship
bytes per read, L, times the number of reads per batch, N:
B = LN
Since L is the length of the reference sequence and B is --batch-size:
N = B/L
Note
SEISMIC-RNA will aim to put exactly N reads in each batch but the last (the last batch can be smaller because it has just the leftover reads). If the reads are single-ended or were not aligned in mixed mode, then every batch but the last will contain exactly N reads. If the reads are paired-ended and were aligned in mixed mode, then batches may contain more than N reads, up to a maximum of 2N in the extreme case that only one read aligned in every mate pair.
Relate setting: Overhangs
When relating sequencing reads, SEISMIC-RNA will by default compute
relationships for all base calls between the smallest 5’ aligned position and
the greatest 3’ aligned position.
This occurs independent of mate alignment orientation, and can result in the
relating of base calls that fall outside the region between mate starts, for
instance, if the 5’ mate aligns to the negative strand at a position less than
the 3’ mate start position on the positive strand.
In certain rare circumstances, like when adapter trimming is inconsistent, or
when using randomized adapter sequences during library preparation, this can
result in SEISMIC-RNA calculating relationships for extraneous extensions.
The default behavior --overhangs can be disabled in favor of a more
conservative approach --no-overhangs, where only base calls greater than
the 5’ mate start and less than the 3’ mate start positions
(i.e. within the insert) are related.
Relate: Output files
All output files go into the directory {out}/{sample}/relate/{ref}, where
{out} is the output directory, {sample} is the sample, and {ref} is
the name of the reference.
Relate output file: Batch of relation vectors
Each batch of relation vectors contains a RelateBatchIO object and is saved
to the file relate-batch-{num}.brickle, where {num} is the batch number.
See Relate Batch for details on the data structure.
See Brickle: Compressed Python Objects for more information on brickle files.
Relate output file: Batch of read names
Within each batch, the relate step assigns an index (a nonnegative integer) to
each read and writes a file mapping the indexes to the read names.
Each batch of read names contains a QnamesBatchIO object and is saved to the
file qnames-batch-{num}.brickle, where {num} is the batch number.
See Read Names Batch for details on the data structure.
See Brickle: Compressed Python Objects for more information on brickle files.
Relate output file: Reference sequence
The relate step writes the reference sequence as a RefseqIO object to the
file refseq.brickle.
See Reference Sequence for details on the data structure.
See Brickle: Compressed Python Objects for more information on brickle files.
Relate output file: Relate report
SEISMIC-RNA also writes a report file, relate-report.json, that records the
settings you used for running the Relate step and summarizes the results.
See Relate Report for more information.
Relate: Troubleshoot and optimize
If you encounted errors during the Relate step, then the most likely cause is that the FASTA file or settings you used for the Relate step differ from those that you used during alignment.
Insufficient reads in {file} …
This error means that you provided a SAM/BAM/CRAM file containing fewer reads
than the minimum number set by --min-reads (-N).
There are two common causes of this error:
You ran
seismic alignandseismic relateseparately (instead of withseismic wf), and you used a larger value for--min-readsduring the Relate step than the Align step. To check if this happened, open your report files from Align and Relate and see if the field “Minimum number of reads in an alignment map” has a larger value in the Relate report.You ran alignment outside of SEISMIC-RNA or obtained alignment map files from an external source, and some of the alignment maps have insufficient reads.
The solution for the problem is to ensure that you run seismic relate with
--min-reads set to the minimum number of reads you actually want during the
Relate step.
As long as you do so, you may ignore error messages about insufficient reads,
since these messages just indicate that SEISMIC-RNA is skipping alignment maps
with insufficient reads, which is exactly what you want to happen.
Read {read} mapped with a quality score {score} …
This error means that a read inside an alignment file aligned with a mapping
quality lower than the minimum set by --min-mapq.
There are two common causes of this error:
You ran
seismic alignandseismic relateseparately (instead of withseismic wf), and you used a larger value for--min-mapqduring the Relate step than the Align step. To check if this happened, open your report files from Align and Relate and see if the field “Minimum mapping quality to use an aligned read” has a larger value in the Relate report.You ran alignment outside of SEISMIC-RNA or obtained alignment map files from an external source, and some reads in the alignment maps have insufficient mapping quality.
The solution for the problem is to ensure that you run seismic relate with
--min-mapq set to the minimum mapping quality you actually want during the
Relate step.
As long as you do so, you may ignore error messages about insufficient quality,
since these messages just indicate that SEISMIC-RNA is skipping reads with
with insufficient mapping quality, which is exactly what you want to happen.
Read {read} mapped to a reference named {name} …
This error means that a read inside an alignment file aligned to a reference
whose name does not match the name of the alignment file (minus the extension).
For example, if your alignment map file azure.cram contains a read that
aligned to a reference named cyan (instead of azure), then you will get
this error message.
If you aligned the reads using seismic align or seismic wf, then this
error should never occur (unless you renamed or modified the output files).
Otherwise, you can solve the problem by ensuring that
Each alignment map file contains reads that aligned to only one reference.
Each alignment map file is named (up to the file extension) the same as the one reference to which all of the reads aligned.
Relate crashes or hangs while producing few or no batch files
Most likely, your system has run out of memory.
You can confirm using a program that monitors memory usage (such as top in a
Linux/macOS terminal, Activity Monitor on macOS, or Task Manager on Windows).
If so, then rerun Relate with adjustments to one or both settings:
Use smaller batches (with
--batch-size) to limit the size of each batch, at the cost of having more files with a larger total size.Use fewer processors (with
--max-procs) to limit the memory usage, at the cost of slower processing.