
Step-by-step walkthrough
================================================================================

Set up your file system
--------------------------------------------------------------------------------

Open a new terminal session, create and navigate to a new directory for this
tutorial (e.g. ``~/seismic-tutorial``::

    mkdir ~/seismic-tutorial
    cd ~/seismic-tutorial

Also make directories called ``fq`` and ``ref`` inside ``~/seismic-tutorial``
for your sequencing reads (FASTQ) and reference sequence (FASTA), respectively::

    mkdir fq ref

Download the :ref:`example-data` into these directories: FASTQ (``.fq.gz``)
files into ``fq`` and the FASTA file (``.fa``) into ``ref``, like so::

    seismic-tutorial
    |-- fq
    |   |-- sars2-fse_R1.fq.gz
    |   |-- sars2-fse_R2.fq.gz
    |-- ref
    |   |-- sars2.fa

To confirm the paths are correct, type this::

    ls fq/sars2-fse_R1.fq.gz fq/sars2-fse_R2.fq.gz ref/sars2.fa

If correct, this command will simply list the paths.
If it prints ``No such file or directory`` (or similar) for any paths, then
those path are incorrect.

Run the entire workflow of SEISMIC-RNA
--------------------------------------------------------------------------------

First, activate your Conda environment for SEISMIC-RNA::

    conda activate seismic

Run the main workflow for SEISMIC-RNA using this command::

    seismic -vv wf -x fq ref/sars2.fa

Let's break down what this is doing:

- ``seismic`` is the SEISMIC-RNA program (an executable file).
- ``-vv`` makes SEISMIC-RNA use double-verbose mode, logging the maximum amount
  of information to the console; maximum verbosity is useful for tutorials and
  troubleshooting, though entirely optional.
- ``wf`` tells SEISMIC-RNA to run its entire workflow.
- ``-x fq`` tells SEISMIC-RNA to accept paired-end sequencing reads (``-x``)
  from ``fq``; since ``fq`` is a directory, it will be searched (recursively)
  for all FASTQ files.
  You could also type ``-x fq/sars2-fse_R1.fq.gz -x fq/sars2-fse_R2.fq.gz`` to
  specify files individually, but this is more cumbersome to type.
- ``ref/sars2.fa`` is the FASTA file of reference sequence(s); it must be given
  as the first positional argument (i.e. not immediately preceded by an option
  beginning with ``-``) after the step ``wf``.

This command should take several minutes to run on a modern computer.

View the results
--------------------------------------------------------------------------------

All files generated by SEISMIC-RNA will have gone into the directory ``out``
(the default name for the output directory).

Output directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The output directory contains one directory for each sample (i.e. FASTQ file or
pair of paired-end FASTQ files).
In this case, there will be one sample called ``sars2-fse`` (whose name derives
from the input FASTQ files).

Sample directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Inside the directory for a sample, there will be one directory for each step of
the workflow: ``qc``, ``align``, ``relate``, ``mask``, ``table``, and ``graph``.

Align directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The directory ``align`` contains all output files from the Align step:

- An Align Report file (``align-report.json``) that records settings used for
  alignment and summarizes the results, such as the number of reads that aligned
  to each reference (see :doc:`../formats/report/align`).
- An alignment map file in BAM (``.bam``) or CRAM (``.cram``) format for each
  reference (see :doc:`../formats/data/xam`), containing the reads that mapped
  to that reference; the file name is the name of the reference.
- A file of reads that did not align to any reference, in gzipped FASTQ format
  (see :doc:`../formats/data/fastq`): ``unaligned.fq.gz`` (for single-end reads)
  or ``unaligned.fq.1.gz`` and ``unaligned.fq.2.gz`` (for paired-end reads);
  unaligned reads can be useful for troubleshooting low rates of alignment.

Relate directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The directory ``relate`` contains one directory for each reference.
Each of those directories contains the following files:

- A Relate Report file (``relate-report.json``) that records settings used for
  relating and summarizes the results (see :doc:`../formats/report/relate`).
- The reference sequence (``refseq.brickle``) in compressed form as a brickle
  file (see :doc:`../data/relate/relate` and :doc:`../formats/data/brickle`).
- Batches of relationship information (``relate-batch-n.brickle``) as brickle
  files (see :doc:`../data/relate/relate` and :doc:`../formats/data/brickle`).
- Batches of query (read) names (``qnames-batch-n.brickle``) as brickle
  files (see :doc:`../data/relate/qnames` and :doc:`../formats/data/brickle`).

Mask directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The directory ``mask`` contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section ``full`` that spans the entire reference sequence).
Each directory for a section contains the following table files:

- A Mask Report file (``mask-report.json``) that records settings used for
  masking and summarizes the results (see :doc:`../formats/report/mask`).
- Batches of reads that passed all filters (``mask-batch-n.brickle``) as brickle
  files (see :doc:`../data/mask/mask` and :doc:`../formats/data/brickle`).

Table directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The directory ``table`` contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section ``full`` that spans the entire reference sequence).
Each directory for a section contains the following tables in (possibly gzipped)
CSV format:

- A table counting all reads with each type of relationship at each position
  (``relate-per-pos.csv``).
- A table counting masked reads with each masked type of relationship at each
  masked position in the section (``mask-per-pos.csv``).
- A table counting all postions with each type of relationship in each read
  (``relate-per-read.csv.gz``).
- A table counting masked postions with each masked type of relationship in each
  masked read (``mask-per-read.csv.gz``).

Graph directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The directory ``graph`` contains one directory for each reference.
Each of those directories contains one directory for each section (so far, just
the default section ``full`` that spans the entire reference sequence).
Each directory for a section contains the following graphs in HTML format, plus
their raw data in CSV format:

- Mutational profile, i.e. the mutation rate at each position
  (``profile_masked_m-ratio-q0``).
- Mutational profile with each position subdivided by type of mutation
  (``profile_masked_acgtdi-ratio-q0``).
- Informative coverage (i.e. number of reads that were either definitely mutated
  or definitely matched) at each position (``profile_masked_n-count``).
- Histogram of the number of mutations per read (``histread_masked_m-count``).
