GeneTEFlow: A Nextflow-based pipeline for analysing gene and transposable elements expression from RNA-Seq data


	Implementation

The GeneTEFlow pipeline was developed using Nextflow, a portable, flexible, and reproducible workflow management system, and Docker technology, a solution to securely build and run applications on multiple platforms. To build the GeneTEFlow pipeline, a series of bioinformatics tools (S1 Table) were selected for QC, quantitation and differential expression analysis of genes and TEs from RNA-Seq data. These bioinformatics tools and custom scripts were built into four Docker containers to ensure portability of the workflow on different computational platforms. Data processing and analysis steps were implemented by modules using Nextflow. Modules are connected through channels and can be run in parallel. Each module in GeneTEFlow can include any executable Linux scripts such as Perl, R, or Python. Parameters for each module are defined in a configuration file.
A conceptual workflow of GeneTEFlow is illustrated inFig 1. The workflow includes four major inputs: raw sequence files in fastq format, a sample meta data file in excel format, reference genome and gene annotation files, and a Nextflow configuration file. The sample meta data file contains detailed sample information and the design of group comparisons between different experimental conditions. Human reference genome UCSC hg38 with the gene annotation (.gtf) was downloaded from Illumina iGenomes collections [28] and used by the bioinformatics tools included in GeneTEFlow. Scheduling of computational resources for each application module is defined in the configuration file.
GeneTEFlow analysis is performed in following steps: QC, expression quantification, differential expression and down-stream analysis. First, adapter sequences are trimmed off from the Illumina raw reads using Trimmomatic(v0.36) [29] for single-end or paired-end reads, and low-quality reads are filtered out. Next, FastQC(v0.11.7) [30] is executed to survey the quality of sequencing reads, and report is generated to help identify any potential issues of the high throughput sequencing data. Reference genome index for mapping sequencing reads to mRNA genes is built using “rsem-prepare-reference” of RSEM (v.1.3.0). Reads remaining after the pre-processing step are mapped to the reference genome using STAR(v2.6.0c) [31]. Gene level expression is quantitated as expected counts and transcripts per million (TPM) using “rsem-calculate-expression” of RSEM(v1.3.0) with default parameters [32]. Custom Perl scripts were developed to aggregate data from each sample into a single data matrix for expected counts and TPM values respectively. The expression quantification of locus-specific TEs is performed by SQuIRE [22]. After aligning reads to reference genome, SQuIRE classifies reads into unique reads (reads mapped to a single locus) and multi-mapping reads (reads mapped to multiple locations). Then SQuIRE calculates unique read expression of each annotated TE and assigns the fractions of multi-mapping reads based on the normalized unique read expression of each TE. Finally, expectation-maximization (EM) algorithm was implemented in SQuIRE to recursively calculate the fractions of multi-mapping reads (E-step) of each TE, and re-estimate total read counts (M-step) until convergence.
In addition, we also implemented quality control measures after reads alignment step to detect potential outlier samples resulted from experimental errors. Boxplot and density plot are used to evaluate the overall consistency of the expression distribution for each sample. Sample correlation analysis is performed with Pearson method using TPM values to assess the correlation between biological replicates from each sample group. Principal component analysis (PCA) is employed to identify potential outlier samples and to investigate relationships among sample groups.
Differential expression analysis of genes and transposable elements is performed using DESeq2(v1.18.1) package [33]. Significantly up-regulated and down-regulated genes and TEs are summarized in a table. To analyse overlap among significantly regulated genes and TEs from pair-wise comparisons between different sample groups we use Venn diagrams. We perform hierarchical clustering of significantly dysregulated genes or TEs using R package “ComplexHeatmap” [34] with euclidean distance and average linkage clustering parameters. Gene set enrichment analysis (GSEA, v3.0) [35] is conducted using collections from the Molecular Signatures Database (MSigDB) [36]. The outputs (S2 Table) from GeneTEFlow are organized into several folders predefined in a GeneTEFlow configuration file.
In addition, GeneTEFlow can be run in step-by-step mode, which allows users to explore their RNA-Seq data. On GitHub, we include an example on how to run GeneTEFlow in a flexible manner where user has an option to remove some low-quality samples before proceeding with the differential expression analysis. A tutorial with detailed instructions on how to set up and run GeneTEFlow is provided at https://github.com/zhongw2/GeneTEFlow.
