Align: Trim FASTQ files and align them to reference sequences
Align: Input files
Align input file: Reference sequences
You need one file of reference sequences in FASTA format (for details on this format, see FASTA: Reference sequences). If your file has characters or formatting incompatible with SEISMIC-RNA, then you can fix it using the Clean FASTA Files tool.
Align input file: Read sequences
Your read sequences must be in FASTQ format (see FASTQ: Sequencing Reads for details on this format).
Reads can be single-end or paired-end
You can align FASTQ files of single- and paired-end reads with SEISMIC-RNA. (For definitions of single- and paired-end reads, see Endedness: single-end and paired-end reads). SEISMIC-RNA requires that single- and paired-end reads not be mixed within one file, but it can accept different types of reads in separate FASTQ files.
Reads can come from whole or demultiplexed samples
You can align FASTQ files that come from a whole sample (possibly containing multiple RNA sequences) or that have been demultiplexed before alignment so that they contain reads from only one RNA sequence. For more information on demultiplexing and how to perform it if needed, see Demultiplex: Split multiplexed FASTQ files by their barcodes.
How to specify the endedness and source of reads
Specify the endedness (single-end, paired-end interleaved, or paired-end in separate files) and source of the reads (whole or demultiplexed sample) in a FASTQ file by giving the file path after the appropriate option:
Endedness |
Whole Sample |
Demultiplexed |
|---|---|---|
paired-end, separate files |
|
|
paired-end, interleaved |
|
|
single-end |
|
|
For example, you could type -z {myfile} to input a FASTQ file of single-end
reads from a whole sample or -X {myfile1} -X {myfile2} to input two FASTQ
files of paired-end reads from a demultiplexed sample, replacing {myfile}
with the actual path(s) of your file(s).
Note
If you have paired-end reads in separate FASTQ files (-x or -X),
then you must give both files of 1st and 2nd mates.
See How to align a pair of FASTQ files (paired-end reads in separate files) for instructions.
How to align one FASTQ file (single-end or interleaved paired-end reads)
To align a FASTQ of single-end reads from a whole sample, use -z:
seismic align {refs.fa} -z {sample.fq.gz}
where {refs.fa} is the path to your FASTA file of reference sequences and
{sample.fq.gz} is the path to your FASTQ file of the sample.
For a FASTQ of paired-end, interleaved reads that were demultiplexed, use -Y
instead:
seismic align {refs.fa} -Y {sample/ref.fq.gz}
where {sample/ref.fq.gz} is the path to your FASTQ file containing reads
from only one reference in the sample.
How to align a pair of FASTQ files (paired-end reads in separate files)
If your reads are paired-end and you have one FASTQ file containing all 1st mates and another containing all 2nd mates, then you will need to provide both FASTQ files. There are two methods:
Use
-x/-Xtwice, once per FASTQ file:seismic align {refs.fa} -x {sample_R1.fq.gz} -x {sample_R2.fq.gz}
where
{sample_R1.fq.gz}and{sample_R2.fq.gz}are the paths to your FASTQ files of the 1st and 2nd mates, respectively.Make a new directory, move both FASTQ files into that directory, and provide the path to that directory with
-x/-X:mkdir {sample} mv {sample_R1.fq.gz} {sample_R2.fq.gz} {sample} seismic align {refs.fa} -x {sample}
where
{sample}is the new directory for both FASTQ files.
How to align multiple FASTQ files or pairs of paired-end FASTQ files
There are three ways to align multiple FASTQ files (or pairs thereof):
Use options more than once. You can repeat any of
-x/-y/-z/-X/-Y/-Z, as well as mix them in one command. For example, to align one pair of paired-end FASTQ files (sample 1), one interleaved paired-end FASTQ file (sample 2), and two single-end FASTQ files (samples 3 and 4), you could typeseismic align {refs.fa} -x {sample1_R1.fq.gz} -x {sample1_R2.fq.gz} -y {sample2.fq.gz} -z {sample3.fq.gz} -z {sample4.fq.gz}
This method is most useful when you have a small number of FASTQ files.
Group FASTQ files of the same type into a directory. Suppose you have 63 pairs of FASTQ files, with the files of mate 1s named
sample-1_R1.fq.gztosample-63_R1.fq.gzand the files of mate 2s namedsample-1_R2.fq.gztosample-63_R2.fq.gz; plus demultiplexed single-end reads from three samples (I-III) and six references (A-F), namedsample-I/ref-A.fq.gztosample-III/ref-F.fq.gz). You can align all of them with one command if you move the whole-sample, paired-end FASTQ files into their own directory, and the demultiplexed, single-end FASTQ files into another directory, and then give each directory after the appropriate options (-xand-Z, respectively):mkdir {paired} mv sample-*_R?.fq.gz {paired} mkdir {dm-single} mv sample-I* {dm-single} seismic align {refs.fa} -x {paired} -Z {dm-single}This method is most useful when you have many FASTQ files.
Combine the first two methods. Suppose you are working on two projects, have generated a set of many FASTQ files for each project, and want to process both sets. Currently, the FASTQ files for projects 1 and 2 are in directories
proj1andproj2, and you want to keep them separate. You can process both directories with one command:seismic align {refs.fa} -x proj1 -x proj2
This method is most useful when you have multiple directories of FASTQ files that you would like to keep separate.
Note
If you give a directory for any of the FASTQ options, then SEISMIC-RNA will search for FASTQ files recursively, with no limit to the depth.
Align: Settings
Align setting: Quality score encoding
Your FASTQ files may encode quality scores in several schemes (for details, see
Phred quality score encodings).
Modern Illumina sequencers use Phred+33 encoding, the default in SEISMIC-RNA.
To change the quality score encoding, use --phred-enc.
Note
If your FASTQ files do not use the Phred+33 encoding, then you must specify the correct Phred score encoding, or else Cutadapt and/or Bowtie 2 can produce incorrect output or fail outright.
If you do not know the encoding scheme of your FASTQ files, then you can process them with FastQC and check the “Encoding” field in the “Basic Statisics” part of the FastQC report:
If the Encoding field says
Illumina 1.0to1.7, then your FASTQ files use Phred+64 encoding (--phred-enc 64).If the Encoding field says
Illumina 1.8or greater, then your FASTQ files use Phred+33 encoding (--phred-enc 33, the default).Otherwise, you will need to search elsewhere for your encoding scheme.
Align setting: Quality assessment with FastQC
To check the quality of your FASTQ files, SEISMIC-RNA runs FastQC by default.
To disable FastQC, use --no-fastqc.
You can also enable automatic unzipping of the zipped output files from FastQC
with --qc-extract.
Align setting: Trimming reads with Cutadapt
To trim adapters and low-quality base calls before alignment, SEISMIC-RNA runs
Cutadapt by default.
To disable trimming, use --no-cut.
How to trim adapter sequences
Your reads may contain unwanted adapters (especially near their 3’ ends), which can cause problems such as misalignment (alignment to the wrong location). Your adapter sequences depend on how your samples were prepared for sequencing (i.e. on your library prep kit) and on your sequencing platform. Since Illumina sequencers are the most widely used for mutational profiling, SEISMIC-RNA defaults to the standard, minimal adapter sequences for Illumina for both read 1 and (if paired-end) read 2:
5’:
GCTCTTCCGATCT3’:
AGATCGGAAGAGC
If your samples have other adapters, then you can specify their sequences using
Side |
Read |
Option |
|---|---|---|
5’ |
1 |
|
5’ |
2 |
|
3’ |
1 |
|
3’ |
2 |
|
How to trim low-quality base calls
Your reads may contain low-quality base calls (especially near their 3’ ends),
which can cause misalignment and excessive mutations.
By default, SEISMIC-RNA trims base calls with quality scores less than 25, which
corresponds to a probability of 10-2.5 = 0.3% that the base call is
incorrect (for an explanation, see Phred quality score encodings).
You can set the quality threshold with --min-phred.
We discourage using a quality threshold less than 25 because doing so could lead
to a background error rate that is too high for accurate mutational profiling
(e.g. 1% with --min-phred 20), especially if you want to cluster your reads.
How to trim extra dark cycles (for Illumina two-channel chemistry)
Illumina sequencers using two-channel chemistry (e.g. NextSeq, NovaSeq, iSeq)
interpret the lack of color from either channel as G.
Consequently, if a DNA molecule is shorter than the read length, then the final
cycles of sequencing will produce no light and be called as a string Gs.
Using --cut-nextseq tells Cutadapt to trim high-quality Gs
from the 3’ end of every read.
Trimming dark cycles can improve alignment in end-to-end mode, but it also trims
real G bases (which cannot be distinguished from artifactual ones) from the 3’
ends of reads.
How to further customize read trimming
Refer to seismic align for the full list of options that SEISMIC-RNA can use
with Cutadapt, and the Cutadapt reference guide for details on each.
These options suffice for most users.
If you need more customization, then you can trim your FASTQ files externally
and then perform alignment within SEISMIC-RNA, using --no-cut to disable
additional trimming.
Align setting: Mapping reads with Bowtie 2
SEISMIC-RNA uses Bowtie 2 to align your reads to your reference sequences.
How to pre-build a Bowtie 2 index (optional)
Bowtie 2 requires the FASTA file of reference sequences to be indexed. You can have SEISMIC-RNA build the index for you automatically (the default) or index your FASTA file manually. With automatic indexing, SEISMIC-RNA builds the index in a temporary directory and deletes it after alignment finishes. This option is ideal for small sets of references (i.e. up to several hundred sequences of several thousand nucleotides each) because building the index will take on the order of seconds to minutes. However, for large sets of references (e.g. an entire mammalian transcriptome), building the index can take on the order of hours. In this case, we recommend building the index yourself using the command
bowtie2-build {refs}.fa {refs}
where {refs}.fa is the path of your FASTA file and {refs} is the path
without the FASTA file extension.
See the Bowtie 2 Indexer manual for more information on building an index.
Note that, while Bowtie 2 does not require the index to have the same name as
the FASTA file, SEISMIC-RNA does, so make sure that you use the same path for
the FASTA file and the index, except that the index path should not have the
FASTA file extension.
Indexing will generate six files with the extensions .1.bt2, .2.bt2,
.3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2.
As long as all six files are in the same directory as and have the same name
(minus the file extension) as the FASTA file, SEISMIC-RNA will use the index.
Otherwise, SEISMIC-RNA will build and use a temporary index.
Note
If you use a pre-built Bowtie 2 index, then SEISMIC-RNA does not verify that the index was actually built from the FASTA file of the same name. Discrepancies between the FASTA file and the index files can crash the Align and Relate steps or produce erroneous results.
How to choose between local and end-to-end alignment
You can align either whole reads (end-to-end mode) or the part of each read that aligns best to the reference (local mode). See the description of alignment modes in Bowtie 2 for more details.
Generally, end-to-end mode yields spurious mutations (false positives) at the ends of reads if the reads contain artifacts such as low-quality base calls or untrimmed or improperly trimmed adapters. Conversely, local mode misses real mutations (false negatives) within several nucleotides of the ends of reads because such mutations cannot be part of the best local alignment, which penalizes mutations and rewards matches.
For RNA mutational profiling, false positives generally cause more problems than
do false negatives, so SEISMIC-RNA uses local mode (--bt2-local) by default.
Use end-to-end mode (--bt2-end-to-end) only if you have a compelling reason
to do so (e.g. if you must quantify mutations at the ends of reads) and only
after carefully trimming any extraneous sequences from the ends of the reads.
How to align paired-end reads
If your reads are paired-end, then you have additional options for keeping or
discarding read pairs depending on how the two reads in the pair (called mates)
align relative to each other.
Bowtie 2 considers mates to align “concordantly” when their relative positions
match expectations and “discordantly” otherwise.
See the Bowtie 2 manual for details on concordant/discordant alignments.
By default, SEISMIC-RNA keeps only concordantly aligned pairs.
To include discordantly aligned pairs too, add --bt2-discordant.
Several options control which types of alignments are considered concordant versus discordant.
You can specify where mates should align relative to each other: mates may
overlap partially or completely, or dovetail.
By default, overlaps (partial and complete) are considered concordant, while
dovetailing is considered discordant.
You can treat complete overlaps as discordant with --bt2-no-contain, or
dovetailed mates as concordant with --bt2-dovetail.
Pairs that overlap partially (without dovetailing) are always concordant in
SEISMIC-RNA.
You can also specify the orientation of paired mates using --bt2-orient.
The choices are fr (the 5’-most mate is forward, the 3’-most is reversed),
rf (the 5’-most mate is reversed, the 3’-most is forward), or ff (both
mates are forward).
The default is fr (and if you are not sure which orientation you need, then
you probably need the default).
Note
First, --bt2[-no]-contain, --bt2[-no]-dovetail, and --bt2-orient
choose which paired-end alignments count as concordant or discordant.
If discordant, then --bt2-[no-]discordant choose whether to keep them.
Using --bt2-no-contain and --bt2-discordant, for example, would make
alignments where one mate fully contains the other discordant (because of
--bt2-no-contain) but still kept (because of --bt2-discordant),
despite what the name “no-contain” would imply.
You can also enable mixed mode with --bt2-mixed.
In mixed mode, if two mates fail to align as a pair, then Bowtie 2 will attempt
to align each mate individually, like a single-end read.
(It is possible in mixed mode for only one mate in a pair to align.)
How to filter aligned reads
You can filter alignments by alignment score and mapping quality.
Alignment score measures how well a read aligns to one specific location
in one reference sequence and depends on the number of matches, substitutions,
and gaps, using the score parameters.
You can specify the minimum alignment score for local and end-to-end modes using
--bt2-score-min-loc and --bt2-score-min-e2e, respectively.
See the section of the Bowtie 2 manual on alignment scores for advice.
Mapping quality measures how unique an alignment is among all locations
in all reference sequences: high if the read aligns with a high alignment
score to exactly one location, low quality if it aligns with similar alignment
scores to multiple locations in the reference (and thus it is hard to determine
a single location where the read aligns).
The default minimum mapping quality is 25, meaning that the probability that the
chosen location is incorrect is 10-2.5 = 0.3%.
You can change the minimum mapping quality using --min-mapq.
How to filter by number of aligned reads
Alignment maps containing very few reads are not generally useful for mutational
profiling, due to low coverage per position.
When aligning to many references (e.g. an entire transcriptome), most references
will receive few reads, producing many output files that would be unusable for
further processing.
To prevent unusable files from cluttering your output directory, you can choose
to have alignment map files with insufficient reads deleted automatically.
The default minimum is 1000 reads, which you can change using --min-reads.
With no minimum (--min-reads 0), no files are deleted automatically.
How to further customize alignment
See seismic align for the full list of options that SEISMIC-RNA can use with Bowtie 2, and the Bowtie 2 manual for details on each of these options. These options suffice for most users. If you need more customization, then you can align your FASTQ files externally and pass the alignment maps into SEISMIC-RNA during Relate: Compute relationships between references and aligned reads.
Align setting: Format of alignment maps
You can choose to output alignment map files in either BAM or CRAM format.
For information on these file formats, see SAM, BAM, and CRAM: Alignment Maps.
The default is CRAM (--cram); you can switch to BAM using --bam.
Alignment maps in CRAM format are smaller than their BAM counterparts, and hence better suited to long-term storage. However, the better compression of CRAM files comes at three costs:
A CRAM file must be accompanied by a FASTA file storing the sequence of every reference that appears in the header of the CRAM file. A CRAM file stores only the relative path to its FASTA file, not the sequence information, which enables the CRAM file to be much smaller than it would be if it did need to contain its own sequences. Because the FASTA file existed before and during the alignment, having this FASTA file accompany the CRAM file usually incurs no extra cost. However, moving or deleting the FASTA will break the CRAM file. As a safeguard against this fragility, SEISMIC-RNA keeps a copy of the FASTA file in the same directory as the output CRAM file. Creating an actual copy would require more storage space and defeat the point of CRAM’s smaller file size, so SEISMIC-RNA actually makes a hard link – not a copy – which requires minimal extra space. In some circumstances, making a hard link can fail, in which case SEISMIC-RNA will resort to copying the FASTA file instead.
Reading and writing CRAM files is slower than for BAM files due to the extra effort needed for compressing and decompressing CRAM files.
In the CIGAR strings, distinction between reference matches (
=) and substitutions (X) is lost upon compressing to CRAM format. Thus, the Relate step must perform extra work to determine if each non-indel position is a match or substitution, which makes it run more slowly than it would if the distinction had been preserved.
In general, use CRAM format if minimizing the size of your alignment map files
is a priority, especially for long-term storage.
Use BAM format to make the align and relate steps run faster, and to
make the output files more portable (since BAM files are self-contained, while
CRAM files will break without the FASTA file that accompanies them).
Align: Output files
All output files except FastQC reports are written to {out}/{sample}/align,
where {out} is your output directory and {sample} is the sample name.
Align output file: FastQC reports
If you run FastQC, then its report files go into {out}/{sample}/qc.
The directory {out}/{sample}/qc/initial contains the FastQC reports for your
initial FASTQ files, before trimming.
If you also run trimming, then reports for the post-trimmed FASTQ files go into
{out}/{sample}/qc/trimmed.
In each directory, FastQC outputs {fq_name}_fastqc.html (the FastQC report)
and {fq_name}_fastqc.zip (extra information), where {fq_name} comes from
the original FASTQ file.
If you add --qc-extract, then each {fq_name}_fastqc.zip will be unzipped
to the directory {fq_name}_fastqc.
For details on these outputs, see the documentation for FastQC.
Align output file: Alignment maps
Alignment maps store the location in the reference sequence to which each read
aligned, plus the Phred quality scores, mapping quality, and mutated positions.
(For more information on alignment maps, see SAM, BAM, and CRAM: Alignment Maps.)
SEISMIC-RNA outputs alignment maps where every read aligns to the same reference
(although this is not a restriction outside of SEISMIC-RNA).
Each alignment map is written to {ref}.{xam}, where {ref} is the name of
the reference to which the reads aligned, and {xam} is the file extension
(depending on the selected format).
SEISMIC-RNA can output alignment maps in either BAM or CRAM format.
For a comparison of these formats, see Align setting: Format of alignment maps.
Align output file: Reference sequences
If you choose to output alignment maps in CRAM format, then you also get a FASTA file(s) of the reference sequence(s) alongside the CRAM files. If the reads came from a whole sample, then a single FASTA file with the same name as the input FASTA file will be output. The output file will be a hard link to the input file, if possible, to avoid consuming unnecessary storage space. If the reads were demultiplexed before alignment, then for each CRAM file, a FASTA file with the same name (up to the file extension) will be output. In both cases, each FASTA file will be indexed using samtools faidx to speed up reading the CRAM files. If you choose to output alignment maps in BAM format, then you get (and need) no FASTA files alongside them.
Align output file: Unaligned reads
In addition to the alignment maps, SEISMIC-RNA outputs FASTQ file(s) of reads that Bowtie 2 could not align:
Each whole-sample FASTQ file of single-end (
-z) or interleaved (-y) reads yields one file:unaligned.fq.gzEach pair of whole-sample FASTQ files of 1st and 2nd mates (
-x) yields two files:unaligned.fq.1.gzandunaligned.fq.2.gzEach demultiplexed FASTQ file of single-end (
-Z) or interleaved (-Y) reads yields one file:{ref}__unaligned.fq.gzEach pair of demultiplexed FASTQ files of 1st and 2nd mates (
-X) yields two files:{ref}__unaligned.fq.1.gzand{ref}__unaligned.fq.2.gz
where {ref} is the reference for demultiplexed FASTQ files.
You can disable outputting unaligned using --bt2-no-un.
Align output file: Align report
SEISMIC-RNA also writes a report file, align-report.json, that records the
settings you used for running the Align step and summarizes the results.
See Align Report for more information.
Check the number of reads that aligned overall
Under “Number of reads after alignment”, the report says how many single-end and/or paired-end reads were in the FASTQ file(s), and how many reads aligned. This information is copied verbatim from the alignment summary of Bowtie 2; see its documentation for more details. For paired-end reads, each pair counts as one read.
Check the number of reads that aligned to each reference
Under “Number of reads aligned by reference”, the report lists every reference in your input FASTA file and the number of reads that aligned to it. For paired-end reads, each pair counts as one read.
Align: Troubleshoot and optimize
Align produces alignment map files too slowly
First, try running the Align step using more processors (with --max-procs),
at the cost of using more memory.
If, as a result, Align crashes or hangs without producing alignment map files, then try adjusting the settings of
Bowtie 2 to increase the speed, at the risk of overlooking valid alignments.
See seismic align for the Bowtie 2 settings you can adjust in SEISMIC-RNA,
and the Bowtie 2 manual for more detailed descriptions.
Align crashes or hangs without producing alignment map files
Most likely, your system has run out of memory.
You can confirm using a program that monitors memory usage (such as top in a
Linux/macOS terminal, Activity Monitor on macOS, or Task Manager on Windows).
If so, then rerun the Align step using fewer processors (with --max-procs)
to limit the memory usage, at the cost of slower alignment.
Fewer reads aligned than you expected
Try the following steps (in this order):
Ensure you are using Bowtie version 2.5.1 or later (version 2.5.0 has a bug that affects alignment rate). You can check the version with
bowtie2 --version | head -n 1.Double check that your FASTA file has the correct reference sequence(s) and that, if you pre-built the Bowtie 2 index before running
seismic align, that you indexed the correct FASTA file.Examine the reads that failed to align (see Align output file: Unaligned reads). Choose several reads randomly, copy one or two 20 - 40 nt segments from the middle of each read, and check if the segments come from any known sources by querying BLAST (or similar tools). Identifying the sources of unaligned reads can help determine the cause of the problem (e.g. contamination with ribosomal RNA or foreign nucleic acids such as from Mycoplasma) and whether the reads that did align are usable.