Performing an example analysis¶
Here you will be guided trough a small example analysis using a publicly available RNA-Seq from NCBI GEO that was part of a publication by Kröger et al.. This is a transcriptome analysis of Salmonella Typhimurium SL1344 in different conditions. We will generate several output files in different formats. The CSV (tabular separated plain text files) files can be opened with any spreadsheet program like LibreOffice or Excel. For inspecting the mappings (in BAM format) and coverage files (wiggle format) you can use a genome browser for example IGB or IGV.
Generating a project¶
At first we have to create the analysis folder and its subfolder. For
this we use the create
subcommand:
$ reademption create READemption_analysis
Created folder "READemption_analysis2" and required subfolders.
Please copy read files into folder "READemption_analysis2/input/reads" and reference sequences files into folder "READemption_analysis2/input/reference_sequences".
This will result in a folder structure as shown here:
READemption_analysis
├── input
│ ├── annotation_files
│ ├── reads
│ └── reference_sequences
└── output
├── align
│ ├── alignments
│ ├── index
│ ├── processed_reads
│ ├── reports_and_stats
│ │ ├── stats_data_json
│ │ └── used_reademption_version.txt
│ └── unaligned_reads
├── coverage
│ ├── coverage-raw
│ ├── coverage-tnoar_mil_normalized
│ └── coverage-tnoar_min_normalized
├── deseq
│ ├── deseq_raw
│ └── deseq_with_annotations
├── gene_quanti
│ ├── gene_quanti_combined
│ └── gene_quanti_per_lib
├── viz_align
├── viz_deseq
└── viz_gene_quanti
Retrieving the input data¶
We have to download the reference sequence (FASTA format) as well as
the annotation file (GFF3 format) for Salmonella from NCBI. As we
will use the URL of Salmonella Typhimurium SL1344’s source FTP
folder it several times we store it in an environment variable called
FTP_SOURCE
.
$ FTP_SOURCE=ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Salmonella_enterica_serovar_Typhimurium_SL1344_uid86645/
We download the reference sequence (the chromosome and three plasmids)
in FASTA format and store them in the reference_sequences
folder. The files are saved with a different suffix (.fa
instead
of .fna
) as some genome browser (e.g. IGB) will not accept them as
FASTA files otherwise.
$ wget -O READemption_analysis/input/reference_sequences/NC_016810.fa $FTP_SOURCE/NC_016810.fna
$ wget -O READemption_analysis/input/reference_sequences/NC_017718.fa $FTP_SOURCE/NC_017718.fna
$ wget -O READemption_analysis/input/reference_sequences/NC_017719.fa $FTP_SOURCE/NC_017719.fna
$ wget -O READemption_analysis/input/reference_sequences/NC_017720.fa $FTP_SOURCE/NC_017720.fna
We have to modify the header of the FASTA files as the sequence IDs have to be the same as the ones in the first column of the GGF3 files (see below) to be used in the gene quantification. This will be also necessary if both, FASTA and GFF3 files, will be loaded in the IGB.
$ sed -i "s/>/>NC_016810.1 /" READemption_analysis/input/reference_sequences/NC_016810.fa
$ sed -i "s/>/>NC_017718.1 /" READemption_analysis/input/reference_sequences/NC_017718.fa
$ sed -i "s/>/>NC_017719.1 /" READemption_analysis/input/reference_sequences/NC_017719.fa
$ sed -i "s/>/>NC_017720.1 /" READemption_analysis/input/reference_sequences/NC_017720.fa
Then we download the GFF3 files that contain the annotations.
$ wget -P READemption_analysis/input/annotations $FTP_SOURCE/*gff
Finally, we need the reads of the RNA-Seq libraries. To save some time for running this examples we will work with subsampled libraries of 1M reads each. This will the limit informative value of the results which is acceptable as we just want to understand the workflow of the READemption. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/InSPI2_R1.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/InSPI2_R2.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/LSP_R1.fa.bz2
$ wget -P READemption_analysis/input/reads http://reademptiondata.imib-zinf.net/LSP_R2.fa.bz2
We have now all the necessary data available. The input folder should look like this now:
$ ls READemption_analysis/input/*
READemption_analysis/input/annotations:
NC_016810.gff NC_017718.gff NC_017719.gff NC_017720.gff
READemption_analysis/input/reads:
InSPI2_R1.fa.bz2 InSPI2_R2.fa.bz2 LSP_R1.fa.bz2 LSP_R2.fa.bz2
READemption_analysis/input/reference_sequences:
NC_016810.fa NC_017718.fa NC_017719.fa NC_017720.fa
Processing and aligning the reads¶
The first step it the read processing and mapping. Via parameters we
tell READemption to use 4 CPU (-p 4
) and perform a poly-A-clipping
(--poly_a_clipping
) before the mapping.
$ reademption align -p 4 --poly_a_clipping READemption_analysis
Once this the mapping is done the file read_alignment_stats.csv
is
created which can be found in
READemption_analysis/output/align/reports_and_stats/
. It contains
several mapping statistics for example how many reads are successfully
aligned in total and how many were aligned to each replicon. We see
that more than 98 % of the reads are mapped for each library. Sorted
and indexed alignements in BAM format are stored in
READemption_analysis/output/align/alignments
. We could load them
into a genome browser but instead we continue with the next step.
Generating coverage files¶
In order to generate strand specific coverage files with different
normalizations we use the subcommand coverage
.
$ reademption coverage -p 4 READemption_analysis
The sets are stored in subfolder of
READemption_analysis/output/coverage/
. The most oftenly used set
is stored in coverage-tnoar_min_normalized
. Here the coverage
values are normalized by the total number of aligned reads (TNOAR) of
the individual library and then multiplied by the lowest TNOAR value
of all libraries. These files could be inspected for differential
RNA-Seq (dRNA-Seq - comparing libraries with and without Terminator
Exonuclease treatment) data in order to determine transcriptional
start sites. They can be loaded in common genome browsers like IGB or IGV. Keep in mind that the
coverages of the reverse strand have negative values so you have to
adapt the scaling in some genome browsers.
Performing gene wise quantification¶
In this step we want to quantify the number of reads overlapping with
the locations of the annotation entries. With the --features
parameter we configure reademption
to just quantify CDS, tRNA and
rRNA entries.
$ reademption gene_quanti -p 4 --features CDS,tRNA,rRNA READemption_analysis
After the quantification we find tables that contain the combined
counting for all entries in
READemption_analysisoutput/gene_quanti/gene_quanti_combined/
. The
countings for mappings in sense and anti-sense are separately
listed. Besides the raw countings there are also tables for
countings normalized by the total number of reads and RPKM values.
Performing differential gene expression analysis¶
To compare the gene expression of different conditions we apply the
subcommand deseq
which makes use of the R library DESeq2.
$ reademption deseq \
-l InSPI2_R1.fa.bz2,InSPI2_R2.fa.bz2,LSP_R1.fa.bz2,LSP_R2.fa.bz2 \
-c InSPI2,InSPI2,LSP,LSP READemption_analysis
We have to tell READemption which libraries are replicates of which
condition. This is done by the parameter -l
and -c
. -l
should hold a comma separated list of the libraries and -c
the
corresponding conditions. In our case we have 4 libraries
(InSPI2_R1.fa.bz2
, InSPI2_R2.fa.bz2
, LSP_R1.fa.bz2
,
LSP_R2.fa.bz2
) and two condition (which we call InSPI2
and
LSP
). Just to make this association easier to understand:
InSPI2_R1.fa.bz2 InSPI2_R2.fa.bz2 LSP_R1.fa.bz2 LSP_R2.fa.bz2
| | | |
InSPI2 InSPI2 LSP LSP
When you call deseq
it will compare all conditions with each other
and you can pick the comparison that you need. The raw DESeq2
results are enriched with the original annotation information and are
stored in
READemption_analysis/output/deseq/deseq_with_annotations/
Create plots¶
Finally we generate plots that visualize the results of the different
steps. viz_align
creates histograms of the read length
distribution for the untreated and treated reads (saved in
READemption_analysis/output/viz_align/
).
$ reademption viz_align READemption_analysis
viz_gene_quanti
visualizes the gene wise countings. In our example
you will see that - as expected - the replicates are more similar to
each other than to the libs of the other condition. It also generates
bar plots that show the distribution of reads inside the different RNA
classes.
$ reademption viz_gene_quanti READemption_analysis
viz_deseq
generates MA-plots as well as volcano plots.
$ reademption viz_deseq READemption_analysis