FASTdRNA: a workflow for the analysis of ONT direct RNA sequencing


	2 Methods



	2.1 Overview of FASTdRNA

Snakemake was selected for the workflow framework due to being Python-based, which makes the pipeline better human readable. This decision considered the bioinformatic defects in computer programming as well as further maintenance and update. Snakemake can be seamlessly scaled to server, cluster, grid, and cloud environments without modifying the workflow definition. For various experiment designs, we separated the whole pipeline into two modules: the data preprocessing module, named dRNAmain, and the data analysis module containing dRNAtail, dRNAmodif, and dRNAas (Fig. 2). Procedure files are retained, as well as the basic visualization, we supply so that researchers will be able to access more details if necessary.
 ** dRNAmain: this initial step is crucial for the subsequent three steps. Fast5 files are converted to FASTQ files, mapped to the transcriptome reference to estimate transcript counts.
** dRNAtail: this step utilizes the FASTQ file, the mapping BAM file, and the transcriptome reference as inputs to estimate poly(A) length using the Nanopolish tool.
** dRNAmodif: this step aims to detect RNA modifications and requires the BAM file generated by dRNAmain, which includes group and replication information.
** dRNAas: this step detects alternative splicing events and generates information on AS variations between different conditions or specific controls using annotation and expression files as input.
By executing the dRNAmain, dRNAtail, dRNAmodif, and dRNAas modules of the FASTdRNA pipeline, researchers can effectively process dRNA-seq data, estimate poly(A) length, detect RNA modifications, and analyze alternative splicing events, allowing for a comprehensive exploration of transcriptome features in various experimental conditions.


	2.2 Data preprocessing: dRNAmain

The dRNAmain step of the FASTdRNA pipeline encompasses two main components: basecall and mapping. Other fundamental data processing steps will be discussed in subsequent sections.
 ** Basecall: initially, SLOW5tools (Gamaarachchi et al. 2022) is utilized to convert FAST5 files to SLOW5 format, a novel format for storing signal data generated by ONT devices. SLOW5 files are smaller and enable more efficient extraction of physical information from raw sequencing data. SLOW5 can be encoded in a human-readable ASCII format or a more efficient binary format called BLOW5. In the FASTdRNA pipeline, the BLOW5 format is selected for downstream analysis. It is essential to distinguish between two types of FAST5 formats: single-FAST5 and multi-FAST5. Single-FAST5 is the previous version of FAST5 files generated by ONT devices. Its notable characteristic is that the output FAST5 directory contains subfolders rather than a list of individual FAST5 files. While moving the FAST5 files from each subfolder into a single folder would enable the smooth execution of Guppy, it can lead to confusion when indexing sequencing information from FASTQ to raw FAST5 using Nanopolish. This issue can be addressed within SLOW5tools, which automatically converts single-FAST5 files into multiple FAST5 files while converting SLOW5 back to FAST5. Furthermore, Guppy, the current official software for basecalling, does not support direct basecalling from SLOW5 format. To overcome this limitation, SLOW5 files are converted back into FAST5 format. The exported FAST5 file releases large quantities of “unaccounted” space compared to the original fast5 file. Bonito, an alternative software to Guppy, can process SLOW5 datasets. However, Bonito currently only supports basecalling for DNA datasets, and future updates may enable RNA basecalling.
** Mapping: the Minimap2 aligner (Li 2018,2021) maps the FASTQ files to a reference transcriptome. The SAM file from Minimap2 is processed using Samtools (Li et al. 2009). Finally, transcript counts are estimated using NanoCount (Gleeson et al. 2022), a tool specifically designed for dRNA-seq analysis. Additional scripts are used to generate mapping statistics and facilitate differential expression analysis. The files generated for transcript count estimation are preserved for further research.
If the organism being sequenced has a pre-existing transcriptome reference, it can be directly used for mapping with Minimap2. However, if a transcriptome reference is not available, Gffread (Pertea and Pertea 2020) is employed to merge the genome reference and RNA annotation files into a unified transcriptome reference. To avoid mapping errors, it is crucial to ensure that the merged genome reference and RNA annotation files are derived from the same dataset.


	2.3 Data analysis: dRNAtail, dRNAmodif, and dRNAas

 ** dRNAtail: poly(A) length estimation was performed using Nanopolish. To improve efficiency, the SLOW5 format was chosen as the input, considering the time-consuming nature of extracting physical base information. The output file provides the estimated poly(A) length and its precise physical position.
** dRNAmodif: traditionally, the software of choice for detecting RNA modifications from dRNA-seq data has been Tombo. However, Tombo only accepts single-FAST5 files as input, requiring converting multi-FAST5 files to single-FAST5 format. Moreover, Tombo is no longer actively maintained. As an alternative, nanocompore (Leger et al. 2021) was selected for RNA modification detection, which requires the availability of a control group. We provide a pipeline utilizing Tombo for RNA modification detection for datasets without a control group, considering its advantage of analyzing data without a control group.
** dRNAas: SUPPA2 (Trincado et al. 2018) was modified to detect alternative splicing events. The analysis focuses on local alternative splicing events and different transcript isoforms. The output results include a “.psivec” file generated by deltaPSI (Lin and Krainer 2019), a novel counting strategy developed by SUPPA, which expresses the enrichment of specific transcripts or events. Researchers can identify alterations in transcripts or local alternative splicing events during the designed biological process. Additionally, two official pipeline visualizations are provided. One is a volcano-like plot illustrating the variation in transcript counts and confidence levels. The other is the result of cluster analysis, which groups local alternative splicing events with similar changes in PSI values across conditions into the same cluster.
By utilizing the dRNAtail, dRNAmodif, and dRNAas modules of the FASTdRNA pipeline, researchers can accurately estimate poly(A) length, detect RNA modifications, and analyze alternative splicing events gaining valuable insights into transcriptome dynamics and regulation.


	2.4 CPU and GPU computing resources

The performance of FASTdRNA was evaluated using both local computing resources and Amazon Elastic Compute Cloud (Amazon EC2). For local computing, the CPU utilized was Intel(R) Xeon(R) Gold 6230R (2.10 GHz), while GPU computing utilized the NVIDIA A100-SXM4-40GB. In the case of Amazon EC2, the CPU computing resources utilized the AMD EPYC 7571 to run the entire pipeline.


	3 Results



	3.1 Test for workflow: analysis of Triticum aestivum L. mutants

To verify the performance of FASTdRNA in an experimental setting, we collected two types of Triticum aestivum L. samples, namely wild-type (WT) and mutant, each with two biological replicates. The analysis was conducted using eight cores and 16 threads in CPU mode. In GPU mode, all eight cores share the same GPU for accelerated computing.Table 1 demonstrates the successful execution of FASTdRNA in an actual experiment. Within FASTdRNA, NanoPlot (version 1.41.0) was utilized to provide a summary report displaying sequencing quality and alignment information (Fig. 3). In the dRNAtail module, an overview violin plot was employed to illustrate the poly(A) length distribution of reads and contigs. Additionally, an interface was provided to display the poly(A) tail length of user-selected transcripts of interest (Fig. 4). For the dRNAmodif module, nanocompore (version 1.0.4) was used to predict potential methylation sites on RNA. A summary file containing comprehensive information on all modifications, including m6A methylation, was generated. In addition, users had the option to select transcripts of interest for visualization of all methylation or specifically m6A modifications (Fig. 5). In the dRNAas module, SUPPA2 was employed to search for alternative splicing events using annotation files and predict the probability of each event based on the sequencing data. The output provided by SUPPA2 served as the final report, as it already contained intuitive and comprehensive information, including various alternative splicing types within each transcript and the specific splicing position information for each splicing form. The FASTdRNA's operational logs are included in the ‘Test' subfolder of theSupplementary File.


	3.2 Comparison between FASTdRNA and MasterOfPores

MasterOfPores is another workflow for analyzing dRNA-seq data, written in Nextflow. However, due to issues with the code of the latest MasterOfPores (MOP2) version, it could not successfully run the four example datasets provided by FASTdRNA, as mentioned below. Therefore, for this comparison, we used the example datasets FAL and FAK from MOP2’s test data and compared the runtime of MOP2 and FASTdRNA. In this comparison, the MOP2 modules mop_preprocess, mop_tail, and mop_mod correspond to the FASTdRNA modules dRNAmain, dRNAtail, and dRNAmodif, respectively. Both workflows were executed using 36 cores and 64 threads, with mop_preprocess and dRNAmain utilizing a dedicated graphics card for computation. In the modification module, we chose the nanocompore algorithms in MOP2, which share the same method as FASTdRNA.Table 2 presents the running time of each pipeline.
From the recorded times, it is evident that FASTdRNA has improved analysis efficiency by 2–10 times, partly due to the adoption of the SLOW5 data format in the analysis pipeline. Additionally, while Nextflow is known for its intelligent resource allocation, it is not as flexible in allocating computing resources as Snakemake. Moreover, the more complex code structure of Nextflow makes debugging MOP2 more challenging compared to FASTdRNA. Due to the characteristics of Nextflow, MOP2 requires a decision on whether to use Docker or Singularity as containers. Docker, although widely used, requires high permissions, and improper operations may have serious consequences, such as system crashes.
Compared to MOP2, FASTdRNA provides a more user-friendly visualized report, including a poly(A) length summary for each transcript and potential methylation sites. Additionally, FASTdRNA includes the dRNAas module to detect various alternative splicing events and calculate their probabilities, further expanding the analytical capabilities of the pipeline. The operational logs of the two pipelines are stored in the ‘Compare' subfolder of theSupplementary File.


	3.3 Running FASTdRNA workflow on AWS EC2

Bioinformatics, a powerful tool for extensive data analysis, often requires advanced computer hardware due to the computational demands of processing large datasets. For ONT data analysis, tasks, such as FAST5 file management, including basecalling, mapping, and modification detection, often exceed the capabilities of personal computers. As a result, sufficient computing resources play a crucial role in determining the efficiency of analysis, particularly for basecalling. To address the limitations of computing resources, we propose utilizing Amazon Web Services Elastic Compute Cloud (AWS EC2). Amazon Web Services (AWS) offers a comprehensive range of global computing, storage, database, analysis, application, and deployment services. It provides several advantages for bioinformatics analysis. Firstly, it offers cost savings in the early stages by replacing upfront hardware investment with a monthly rental model based on actual resource usage. Secondly, AWS allows users to customize their resources according to their specific requirements, enabling scalability, resource utilization improvements, and flexible resource allocation.
Amazon EC2 is a service within AWS that provides scalable computing resources in the cloud. It eliminates the need for upfront hardware investment and allows users to quickly develop and deploy applications, launch virtual servers, configure security and network settings, and manage storage resources based on their needs. Here, we used the first 10 fast5 files of each four samples used in Section 3.1 as datasets and ran it on an Amazon EC2 instance to test its performance, which shows inTable 3. The performance testing of the FASTdRNA pipeline in AWS EC2 environments demonstrated that, when the necessary software and environment configurations are correctly set up, it can operate smoothly and successfully, resembling the experience of working on a local workstation. The operational logs of the FASTdRNA running on Amazon EC2 are stored in the ‘Amazon' subfolder of theSupplementary File.


	3.4 SLOW5 accelerates the pipeline and resolves FAST5 format issues

SLOW5 format was initially adopted to enhance the speed of processing raw FAST5 files. Although Guppy does not support basecalling directly from SLOW5 files, converting them back to FAST5 format improved the speed of basecalling compared to using raw FAST5 files. According to the research by the author of slow5tools, there is no data loss during the conversion process when converting from FAST5 -> BLOW5 -> FAST5, and the integrity of the original data is preserved. This suggests that numerous FAST5-based analyses can benefit from this new technique for accelerated calculations.
Furthermore, SLOW5tools automatically converts from single-FAST5 to multi-FAST5 format, simplifying the basecalling process within the workflow. For experiments requiring Tombo for modification event detection, which requires single-FAST5 files, it is highly recommended to use the “multi_to_single_FAST5” function from the “ont_FAST5_api” package to convert the multi-FAST5 files exported by SLOW5tools into single-FAST5 format. This approach ensures compatibility and avoids issues from using raw data directly as input.


	Supplementary Material

