MasterOfPores: A Workflow for the Analysis of Oxford Nanopore Direct RNA Sequencing Datasets


	Results



	Overview of the MasterOfPores Workflow

Workflow management systems together with Linux containers offer a solution to efficiently analyze large scale datasets in a highly reproducible, scalable and parallelizable manner. During the last decade, an increasing interest in the field has led to the development of different programs such as Snakemake (Köster and Rahmann, 2012), NextFlow (Di Tommaso et al., 2017), Galaxy (Afgan et al., 2018), SciPipe (Lampa et al., 2019) or GenPipes (Bourgey et al., 2019), among others. These tools enable the prototyping and deployment of pipelines by abstracting computational processes and representing pipelines as directed graphs, in which nodes represent tasks to be executed and edges represent either data flow or execution dependencies between different tasks.
Here we chose the workflow framework NextFlow (Di Tommaso et al., 2017) because of its native support of different batch schedulers (SGE, LSF, SLURM, PBS, and HTCondor), cloud platforms (Kubernetes, Amazon AWS, and Google Cloud) and GPU computing, which is crucial for processing huge volumes of data produced by nanopore sequencers. NextFlow has tight integration with lightweight Linux containers, such as Docker and Singularity. Automatic organization of intermediate results produced during the NextFlow pipeline execution allows reducing the complexity of intermediary file names and the possibility of name clashing. Continuous check-pointing with the possibility of resuming failed executions, interoperability and meticulous monitoring and reporting of resource usage are among other thought-after features of NextFlow. The executables of the presented pipeline have been bundled within Docker images accessible at DockerHub that can be converted on the fly into a Singularity image, thus allowing the HPC usage.
The MasterOfPores workflow includes all steps needed to process raw FAST5 files produced by Nanopore direct RNA sequencing and executes the following steps, allowing users a choice among different algorithms (Figure 1). The pipeline consists of 3 modules:
 ** NanoPreprocess: this module takes as input the raw Fast5 reads and produces as output base-called sequences both in FAST5 and FASTQ formats, as well as alignments in BAM format. The pre-processing module performs base-calling, demultiplexing, filtering, quality control, mapping and gene and/or transcript quantification, generating a final report of the performance and results of each of the steps performed.
** NanoTail: this module takes as input the output from the NanoPreprocess module and produces polyA tail length estimations using two different algorithms.
** NanoMod: this module takes as input the files generated during the pre-processing step, and produces flat text files with the predicted RNA modifications using two different algorithms.


	Pre-processing Module: NanoPreprocess

The NanoPreprocess module consists of 8 main steps (Figure 2):
 ** Read base-calling with the algorithm of choice, using Albacore[] or Guppy.[] This step can be run in parallel and the user can decide the number of files to be processed in a single job by using the command –granularity.
** Demultiplexing of the reads using DeePlexiCon (Smith et al., 2019). This step is optional, and can only be used if the libraries have been barcoded using the oligonucleotides used to train the deep neural classifier[6]
** Filtering of the resulting fastq files using Nanofilt (De Coster et al., 2018). This step is optional and can be run in parallel.
** Quality control of the base-called data, using MinIONQC (Lanfear et al., 2019) and FastQC.[7]
** Read mapping to the reference genome or transcriptome, using minimap2[8] or graphmap2.[9]
** Quality control on the alignment, using NanoPlot[10] and bam2stats.[11]
** Gene or transcript quantification, using HTSeq (Anders et al., 2015) or NanoCount.[12] The latter estimates transcript abundance using an expectation-maximization algorithm. NanoCount will be run if reads have been mapped to the transcriptome, using the flag –reference_type transcriptome, whereas HTSeq will be employed to quantify per-gene counts if the reads have been mapped to the genome.
** Final report of the data processing using multiQC[13] that combines the single quality controls done previously, as well as global run statistics (Figure 3).


	Data Analysis Modules: NanoTail and NanoMod

The MasterOfPores pipeline contains two additional modules for the downstream analyses of the mapped reads, namely NanoTail and NanoMod, which provide polyA tail length estimations and RNA modification predictions, respectively (Figure 2). The modules can be run using as input the output from the NanoPreprocess module.
The NanoTail module estimates polyA tail lengths using Nanopolish[14] and TailfindR,[15] producing a plain text file with polyA tail length estimations for each read, computed using both algorithms. The correlation between the two algorithms is also reported as a plot.
The NanoMod module predicts RNA modifications using Tombo[16] and EpiNano,[17] producing a plain text files with the predicted sites by each algorithm. The NanoMod module is run “paired mode,” i.e., providing two conditions, as both EpiNano and Tombo identify RNA modifications by comparing two conditions.


	Running MasterOfPores: Installation, Input, Parameters and Output

To run MasterOfPores, the following steps are required:
 ** Install NextFlow (version 19.10.0):
**  ** $ curl -s https://get.nextflow.io | bash

 ** Clone the MasterOfPores repository:
**  ** $ git clone –depth 1 https://github.com/biocorecrg/master_of_pores.git

 ** Install Docker or Singularity (for Singularity, version 2.6.1 and Docker 19.03 or later are required):
**  ** Docker: https://docs.docker.com/install/
** Singularity: https://sylabs.io/guides/2.6/user-guide/quick_start.html#quick-installation-steps

 ** Download Nanopore base-calling algorithms: guppy with or without GPU support and or the albacore Wheel file (a standard built-package format used for Python distributions) and install them inside the bin folder inside the MasterOfPores directory. The users can place their preferred version of guppy and/or albacore in the bin folder (in the example below, albacore version 2.1.7 and guppy 3.1.5).
**  ** $ cd master_of_pores/NanoPreprocess/bin
** $ tar -zvxf ont-guppy_3.1.5_linux64.tar.gz
** $ ln -s ont-guppy_3.1.5_linux64/ont-guppy/bin/guppy_ [∗].
** $ pip3 install –target = ./albacore ont_albacore-2.1.7-cp36-cp36m-manylinux1_x86_64.whl
** $ ln -s albacore/bin/multi_to_single_fast5
** $ ln -s albacore/bin/read_fast5_basecaller.py

 ** Optional step: install CUDA drivers (only needed for GPU support):
**  **  https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

 ** Run the pre-processing step of the pipeline (using singularity or docker):
**  ** $ cd./
** $ nextflow run nanopreprocess.nf-with-singularity
** or
** $ nextflow run nanopreprocess.nf-with-docker

 ** Run polyA tail estimation module
**  ** $ cd./NanoTail
** $ nextflow run nanotail.nf-bg-with-singularity –input_folders “.NanoPreprocess/output/RNA [∗]”

 ** Run RNA modification prediction module
**  ** $ cd./ NanoMod
** $ nextflow run nanomod.nf -with-singularityinput_path “.NanoPreprocess/output/”

The NanoPreprocess module can handle both single- and multi-FAST5 reads as input. To execute the workflow, several parameters can be defined by the user, including the choice of the basecaller (albacore or guppy), mapper (minimap2 or graphmap2), as well as their command line options. If these are not specified by the user, the workflow will be run with default parameter settings specified in the params.config file (Table 2). The final report includes four different types of metrics: (i) General statistics of the input, including the total number of reads, GC content and number of identical base-called sequences; (ii) Per-read statistics of the input data, including scatterplots of the average read length versus sequence identity, the histogram of read lengths, and the correlation between read quality and identity; (iii) Alignment statistics, including the total number of mapped reads, the total number of mapped bases, the average length of mapped reads, and the mean sequence identity; (iv) Quality filtering statistics, including the number of filtered reads, median Q-score and read length, compared to those observed in all sequenced reads; and (v) Per-read analysis of biases, including information on duplicated reads, over-represented reads and possible adapter sequences (Figure 3). The final outputs of this module include:
 ** Basecalled fast5 files within the “fast5_files” folder.
** Filtered fastq files within “fastq_files” folder.
** QC reports within “QC” folder.
** Final report within “report” folder.
** Aligned reads in sorted BAM files within the “aln” folder.
** Read counts within the “counts” folder.
The NanoMod module requires two samples to detect RNA modifications, typically wild-type and knock-out (or knock-down) matched conditions. The user must provide a tab-delimited file (–comparison “comparison.tsv”) indicating which input file is the wild-type condition and which one is the knock-out or knock-down condition (see, for example[18]), which is specified in the parameter file. The NanoMod module will output the results into two different folders:
 ** RNA modification results predicted using Tombo in the “Tombo” folder
** RNA modification results predicted using EpiNano in the “EpiNano” folder
The NanoTail module will output the results into three different folders:
 ** PolyA tail length estimates predicted using Nanopolish, in the “Nanopolish” folder.
** PolyA tail length estimates predicted using tailfindR, in the “Tailfindr” folder.
** In this module, an additional “NanoMod_final” folder is generated, containing combined Nanopolish and tailfindR estimates of polyA tail lengths, as well as information regarding the geneID or transcriptID where the read is mapped to.


	Running MasterOfPores on the Cloud (AWS Batch and AWS EC2)

Nanopore sequencing allows for real-time sequencing of samples. While GridION devices come with built-in GPUs that allows live base-calling, smaller MinION devices do not have built-in CPU or GPU. Thus, the user has to connect the MinION to a computer with sufficient CPU/GPU capabilities, or run base-calling after the sequencing. In all these contexts, the possibility of running the MasterOfPores pipeline on the cloud presents a useful alternative.
The Amazon Web Services (AWS) Batch is a computing service that enables users to submit jobs to a cloud-based user-defined infrastructure, which can be easily set up via either code-based definitions or a web-based interface. Computation nodes can be allocated in advance or according to resource availability. Cloud infrastructure can be also deployed or dismantled on demand using automation tools, such as CloudFormation or Terraform.
Here we show that the MasterOfPores pipeline can be successfully implemented on the cloud, and provide the Terraform script for running MasterOfPores on the AWS Batch CPU environments, available in the GitHub repository.[19] To run the pipeline using the AWS Batch, the users needs to change only a few parameters related to their accounts in a configuration file. The pipeline can be run from either a local workstation or an Amazon EC2 entrypoint instance initiated for this purpose (we recommend the latter). Data to be analyzed can be uploaded to an Amazon S3 storage bucket.
Similarly, we also tested whether our pipeline could be run in Amazon Web Services (AWS) Elastic Compute Cloud (EC2), which is one of the most popular cloud services (Supplementary Table S1). Compared to AWS Batch, to run any workflow in AWS EC2, the user must first create an Amazon Machine Image (AMI). The AMI can be created using the same instructions as provided inSupplementary File S1, starting from the official Ubuntu 18.04 LTS AMI, and including both Docker and Singularity software with NVIDIA libraries support. Here we show that the resulting image can be used to run the MasterOfPores workflow with NVIDIA Tesla V100 GPU cards. Automation scripts to run MasterOfPores in AWS EC2 can be found in the GitHub repository.[20]


	Test Case: Analysis of Saccharomyces cerevisiae SK1 PolyA(+) RNA



	Running the MasterOfPores Pipeline on S. cerevisiae PolyA(+) RNA

To benchmark the performance of the MasterOfPores workflow, we employed four publicly available direct RNA sequencing runs of polyA(+)-selected S. cerevisiae WT and ime4△ strains, in biological replicates, which had been sequenced using MinION and GridION devices, producing a total of ∼3 million reads (Table 1). We used up to 100 nodes with 8 CPUs for testing the base-calling in CPU mode and 1 node with 1 GPU card for testing the base-calling in GPU mode (Table 1).
The MasterOfPores NanoPreprocess module was ran using guppy version 3.1.5 as the base-caller and minimap2 version 2.17 as the mapping algorithm. Reads were filtered by running nanofilt with the options “-q 0 –headcrop 5 –tailcrop 3 –readtype 1D”. Filtered reads were mapped to the yeast SK1 fasta genome. Specifically, the command that was executed to run the pipeline with these settings was:
Then, the two data analysis modules were executed as follows:


	Benchmarking the Time Used for the Analysis of S. cerevisiae PolyA(+) RNA

Here we have tested the pipeline using both CPU and GPU computing. Specifically, we ran the pipeline on the following configurations: (i) a single CPU node (e.g., emulating the computing time on a single laptop); (ii) a CPU cluster with 100 nodes; (iii) a single mid-range GPU card (RTX2080); and (iv) a single high-end GPU card (GTX1080 Ti). We found that the computing time required to run the pipeline on a single GPU card was significantly lower than the running time in parallel on a high performance CPU cluster with 100 nodes, 8 cores per node (Table 1, see alsoSupplementary Table S1). Moreover, we found that the computing time of the NanoPreprocess module can be significantly reduced depending on the GPU card (base-calling step was ∼2X faster for GTX1080 Ti than for RTX2080).


	Reporting Resources Used for the Analysis of S. cerevisiae PolyA(+) RNA

Taking advantage of the NextFlow reporting functions, the pipeline can produce detailed reports on the time and resources consumed by each process (Figure 4), in addition to the output files (bam, fastq) and final report (html), if the workflow is executed with parameters -with-report (formatted report) or-with-trace (plain text report). Running the base-calling on each multi-fast5 file in parallel on our dataset showed that the most memory intensive tasks (about 5 Gbytes) were the mapping step (using minimap2) and the quality control step (using Nanoplot) (Table 3), while the most CPU-intensive and time-consuming step (∼80 min) was the base-calling (using Guppy) (Table 4).
Finally, we should note that the latest (19.10.0) version of NextFlow allows the user to control the execution of a pipeline remotely. To enable this feature, the user needs to login to the https://tower.nf/website developed by the NextFlow authors and retrieve a token for communicating with the pipeline. For doing that, the user should set this token as an environmental variable and run the pipeline as follows:


	Materials and Methods



	Code Availability

The pipeline is publicly available at https://github.com/biocorecrg/master_of_pores under an MIT license. The example input data as well as expected outputs are included in the GitHub repository. Detailed information on program versions used can be found in the GitHub repository. EpiNano was modified from its original version (1.0) to decrease the computing time of the pipeline (EpiNano version 1.1, available at https://github.com/enovoa/EpiNano).


	Documentation Availability

Detailed documentation on how to install and use the pipeline can be found at: https://biocorecrg.github.io/master_of_pores/


	Availability of Docker Files and Docker Images

The pipeline uses software that is embedded within Docker containers. Docker files are available in the GitHub repository.[21] The pipeline retrieves a specific Docker image from DockerHub. In particular, the workflow retrieves four distinct images: one for basecalling,[22] one for demultiplexing,[23] one for pre-processing[24] and one for measuring polyA tail lengths and detecting RNA modifications.[25]


	Integration of Base-Calling Algorithms in the Docker Images

Due to the terms and conditions that users agree to when purchasing Nanopore products, we are not allowed to distribute Nanopore software (binaries or in packaged form like docker images). While the original version of the MasterOfPores pipeline includes both guppy and albacore, we are not legally allowed to distribute it with the binaries. Therefore, here we only make available a version where the binaries must be downloaded and placed into a specific folder by the user. We expect future versions of MasterOfPores will include these programs within the docker image once this issue is solved.


	CPU and GPU Computing Time and Resources

The MasterOfPores workflow was tested both locally (using either CPU or GPU) as well as in the cloud (AWS). Computing times for each mode are shown inTable 1. CPU time was determined using a maximum of 100 nodes simultaneously with maximum 8 cores CPU per node (2.8–3.5 GHz, 80–130 Watt). GPU time was computed using either GIGABYTE GeForce RTX 1660 Ti (1536 CUDA cores @ 1770 MHz with 6GB of GDDR6 vRAM memory, 120 Watt) or INNO3D GeForce RTX 2080 (2944 CUDA cores @ 1710 MHz with 8 GB of GDDR6 vRAM memory, 225 Watt) or NVIDIA Tesla V100 (5120 CUDA cores + 640 Tensor cores @ 1462 MHz with 16 GB of HBM2 memory). For GPU computing, both system memory (RAM) and GPU memory (vRAM) are used. Base-calling with guppy typically uses 1 or 4.2 Gb of vRAM in fast and high accuracy mode, respectively. As a result, only one base-calling process can be performed on above mentioned cards in high accuracy mode at given time. The execution time in the AWS EC2 p3.2xlarge instance involves reading files already placed in a previously set-up S3 storage bucket but not writing back output results into it.
