Step-by-step walkthrough

Set up your file system

Open a new terminal session, create and navigate to a new directory for this tutorial (e.g. ~/seismic-tutorial:

mkdir ~/seismic-tutorial
cd ~/seismic-tutorial

Also make directories called fq and ref inside ~/seismic-tutorial for your sequencing reads (FASTQ) and reference sequence (FASTA), respectively:

mkdir fq ref

Download the Example dataset into these directories: FASTQ (.fq.gz) files into fq and the FASTA file (.fa) into ref, like so:

seismic-tutorial
|-- fq
|   |-- sars2-fse_R1.fq.gz
|   |-- sars2-fse_R2.fq.gz
|-- ref
|   |-- sars2.fa

To confirm the paths are correct, type this:

ls fq/sars2-fse_R1.fq.gz fq/sars2-fse_R2.fq.gz ref/sars2.fa

If correct, this command will simply list the paths. If it prints No such file or directory (or similar) for any paths, then those path are incorrect.

Run the entire workflow of SEISMIC-RNA

First, activate your Conda environment for SEISMIC-RNA:

conda activate seismic

Run the main workflow for SEISMIC-RNA using this command:

seismic -vv wf -x fq ref/sars2.fa

Let’s break down what this is doing:

  • seismic is the SEISMIC-RNA program (an executable file).

  • -vv makes SEISMIC-RNA use double-verbose mode, logging the maximum amount of information to the console; maximum verbosity is useful for tutorials and troubleshooting, though entirely optional.

  • wf tells SEISMIC-RNA to run its entire workflow.

  • -x fq tells SEISMIC-RNA to accept paired-end sequencing reads (-x) from fq; since fq is a directory, it will be searched (recursively) for all FASTQ files. You could also type -x fq/sars2-fse_R1.fq.gz -x fq/sars2-fse_R2.fq.gz to specify files individually, but this is more cumbersome to type.

  • ref/sars2.fa is the FASTA file of reference sequence(s); it must be given as the first positional argument (i.e. not immediately preceded by an option beginning with -) after the step wf.

This command should take several minutes to run on a modern computer.

View the results

All files generated by SEISMIC-RNA will have gone into the directory out (the default name for the output directory).

Output directory

The output directory contains one directory for each sample (i.e. FASTQ file or pair of paired-end FASTQ files). In this case, there will be one sample called sars2-fse (whose name derives from the input FASTQ files).

Sample directory

Inside the directory for a sample, there will be one directory for each step of the workflow: qc, align, relate, mask, table, and graph.

Align directory

The directory align contains all output files from the Align step:

  • An Align Report file (align-report.json) that records settings used for alignment and summarizes the results, such as the number of reads that aligned to each reference (see Align Report).

  • An alignment map file in BAM (.bam) or CRAM (.cram) format for each reference (see SAM, BAM, and CRAM: Alignment Maps), containing the reads that mapped to that reference; the file name is the name of the reference.

  • A file of reads that did not align to any reference, in gzipped FASTQ format (see FASTQ: Sequencing Reads): unaligned.fq.gz (for single-end reads) or unaligned.fq.1.gz and unaligned.fq.2.gz (for paired-end reads); unaligned reads can be useful for troubleshooting low rates of alignment.

Relate directory

The directory relate contains one directory for each reference. Each of those directories contains the following files:

Mask directory

The directory mask contains one directory for each reference. Each of those directories contains one directory for each section (so far, just the default section full that spans the entire reference sequence). Each directory for a section contains the following table files:

  • A Mask Report file (mask-report.json) that records settings used for masking and summarizes the results (see Mask Report).

  • Batches of reads that passed all filters (mask-batch-n.brickle) as brickle files (see ../data/mask/mask and Brickle: Compressed Python Objects).

Table directory

The directory table contains one directory for each reference. Each of those directories contains one directory for each section (so far, just the default section full that spans the entire reference sequence). Each directory for a section contains the following tables in (possibly gzipped) CSV format:

  • A table counting all reads with each type of relationship at each position (relate-per-pos.csv).

  • A table counting masked reads with each masked type of relationship at each masked position in the section (mask-per-pos.csv).

  • A table counting all postions with each type of relationship in each read (relate-per-read.csv.gz).

  • A table counting masked postions with each masked type of relationship in each masked read (mask-per-read.csv.gz).

Graph directory

The directory graph contains one directory for each reference. Each of those directories contains one directory for each section (so far, just the default section full that spans the entire reference sequence). Each directory for a section contains the following graphs in HTML format, plus their raw data in CSV format:

  • Mutational profile, i.e. the mutation rate at each position (profile_masked_m-ratio-q0).

  • Mutational profile with each position subdivided by type of mutation (profile_masked_acgtdi-ratio-q0).

  • Informative coverage (i.e. number of reads that were either definitely mutated or definitely matched) at each position (profile_masked_n-count).

  • Histogram of the number of mutations per read (histread_masked_m-count).