Step-by-step walkthrough
Set up your file system
Open a new terminal session, create and navigate to a new directory for this
tutorial (e.g. ~/seismic-tutorial:
mkdir ~/seismic-tutorial
cd ~/seismic-tutorial
Also make directories called fq and ref inside ~/seismic-tutorial
for your sequencing reads (FASTQ) and reference sequence (FASTA), respectively:
mkdir fq ref
Download the Example dataset into these directories: FASTQ (.fq.gz)
files into fq and the FASTA file (.fa) into ref, like so:
seismic-tutorial
|-- fq
| |-- sars2-fse_R1.fq.gz
| |-- sars2-fse_R2.fq.gz
|-- ref
| |-- sars2.fa
To confirm the paths are correct, type this:
ls fq/sars2-fse_R1.fq.gz fq/sars2-fse_R2.fq.gz ref/sars2.fa
If correct, this command will simply list the paths.
If it prints No such file or directory (or similar) for any paths, then
those path are incorrect.
Run the entire workflow of SEISMIC-RNA
First, activate your Conda environment for SEISMIC-RNA:
conda activate seismic
Run the main workflow for SEISMIC-RNA using this command:
seismic -vv wf -x fq ref/sars2.fa
Let’s break down what this is doing:
seismicis the SEISMIC-RNA program (an executable file).-vvmakes SEISMIC-RNA use double-verbose mode, logging the maximum amount of information to the console; maximum verbosity is useful for tutorials and troubleshooting, though entirely optional.wftells SEISMIC-RNA to run its entire workflow.-x fqtells SEISMIC-RNA to accept paired-end sequencing reads (-x) fromfq; sincefqis a directory, it will be searched (recursively) for all FASTQ files. You could also type-x fq/sars2-fse_R1.fq.gz -x fq/sars2-fse_R2.fq.gzto specify files individually, but this is more cumbersome to type.ref/sars2.fais the FASTA file of reference sequence(s); it must be given as the first positional argument (i.e. not immediately preceded by an option beginning with-) after the stepwf.
This command should take several minutes to run on a modern computer.
View the results
All files generated by SEISMIC-RNA will have gone into the directory out
(the default name for the output directory).
Output directory
The output directory contains one directory for each sample (i.e. FASTQ file or
pair of paired-end FASTQ files).
In this case, there will be one sample called sars2-fse (whose name derives
from the input FASTQ files).
Sample directory
Inside the directory for a sample, there will be one directory for each step of
the workflow: qc, align, relate, mask, table, and graph.
Align directory
The directory align contains all output files from the Align step:
An Align Report file (
align-report.json) that records settings used for alignment and summarizes the results, such as the number of reads that aligned to each reference (see Align Report).An alignment map file in BAM (
.bam) or CRAM (.cram) format for each reference (see SAM, BAM, and CRAM: Alignment Maps), containing the reads that mapped to that reference; the file name is the name of the reference.A file of reads that did not align to any reference, in gzipped FASTQ format (see FASTQ: Sequencing Reads):
unaligned.fq.gz(for single-end reads) orunaligned.fq.1.gzandunaligned.fq.2.gz(for paired-end reads); unaligned reads can be useful for troubleshooting low rates of alignment.
Relate directory
The directory relate contains one directory for each reference.
Each of those directories contains the following files:
A Relate Report file (
relate-report.json) that records settings used for relating and summarizes the results (see Relate Report).The reference sequence (
refseq.brickle) in compressed form as a brickle file (see Relate Batch and Brickle: Compressed Python Objects).Batches of relationship information (
relate-batch-n.brickle) as brickle files (see Relate Batch and Brickle: Compressed Python Objects).Batches of query (read) names (
qnames-batch-n.brickle) as brickle files (see Read Names Batch and Brickle: Compressed Python Objects).
Mask directory
The directory mask contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section full that spans the entire reference sequence).
Each directory for a section contains the following table files:
A Mask Report file (
mask-report.json) that records settings used for masking and summarizes the results (see Mask Report).Batches of reads that passed all filters (
mask-batch-n.brickle) as brickle files (see ../data/mask/mask and Brickle: Compressed Python Objects).
Table directory
The directory table contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section full that spans the entire reference sequence).
Each directory for a section contains the following tables in (possibly gzipped)
CSV format:
A table counting all reads with each type of relationship at each position (
relate-per-pos.csv).A table counting masked reads with each masked type of relationship at each masked position in the section (
mask-per-pos.csv).A table counting all postions with each type of relationship in each read (
relate-per-read.csv.gz).A table counting masked postions with each masked type of relationship in each masked read (
mask-per-read.csv.gz).
Graph directory
The directory graph contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section full that spans the entire reference sequence).
Each directory for a section contains the following graphs in HTML format, plus
their raw data in CSV format:
Mutational profile, i.e. the mutation rate at each position (
profile_masked_m-ratio-q0).Mutational profile with each position subdivided by type of mutation (
profile_masked_acgtdi-ratio-q0).Informative coverage (i.e. number of reads that were either definitely mutated or definitely matched) at each position (
profile_masked_n-count).Histogram of the number of mutations per read (
histread_masked_m-count).