A bioinformatic-assisted workflow for genome-wide identification of ncRNAs


	MATERIALS AND METHODS

Pinc uses state-of-the-art tools to be as robust and versatile as possible.


	Pre-processing

Raw sequencing reads need to pass several quality control and processing steps. These steps include adapter removal, read filtering and read trimming. If the output of the pre-processing is of poor quality, results may be compromised. This makes this arguably the most important step in the analysis pipeline. In order to simplify the procedure, Pinc uses fastp, a fast and all-in-one solution that can be customized to meet all possible needs (5).


	Transcript assembly

Two ways can be chosen to build transcripts from RNA sequencing data, a de novo assembly or a genome-guided approach. As de novo assemblies are very complex and often require manual curation of data, they are not suitable for an automated pipeline. Therefore, the reads first need to be aligned against a reference genome. HISAT2, the successor of TopHat, was chosen as alignment tool because it is fast, memory-efficient and sensitive (6). Speed and low memory usage are important considerations especially for RNA sequencing data, which usually consists of large datasets that may even be generated in replicates and/or for different conditions for differential gene expression analysis. After the reads are aligned, they need to be assembled into partial transcripts, also called transfrags. Since Pinc was expected to be suitable for prokaryotic as well as eukaryotic organisms, the assembler needs to accommodate for splice variants of eukaryotic genes. Another consideration is the possible presence of long reads generated by Third-Generation-Sequencers such as Oxford Nanopore or Pacific Biosciences SMRT sequencing technologies. StringTie was incorporated into the pipeline as it can utilize short reads, long reads or a combination of both (7). Using StringTie's -merge mode, the assembled transfrags across of samples and replicates are merged into one non-redundant set to facilitate downstream differential gene expression analysis. As of today, Pinc only supports short reads, however, the option to process long reads can be added at any point without the need to reconstruct major parts of the pipeline.


	Filtering of transcripts

Since the goal of Pinc is to predict potentially novel ncRNAs, the constructed transcriptome needs to be filtered. By comparing the assembled transfrags against the provided reference annotation gffcompare assigns each transfrag a tag based on the relationship to each annotated gene (8). Keeping only transfrags, which are labeled with ‘u’ (unknown), ‘i’ (fully contained within an intron) or ‘x’ (exonic overlap on complementary strand to annotated feature), effectively reduces the transcriptome to only transfrags that are either not annotated or have a high chance of being ncRNAs (see gffcompare manual for a detailed description of labels). The remaining transfrags might still contain RNAs coding for a protein, thus, an additional filtering step is applied. However, approaches that include aligning sequences to databases are not feasible for large datasets. Therefore, in recent years many tools expanded on linguistic features of sequences. This allows faster processing of sequences and can even yield better results, especially for lncRNAs. As this is a binary classification problem, there are several machine learning approaches to choose from. The two most straightforward ones would be either a Support-Vector-Machine (SVM), which predicts labels for each RNA, or a regression model, which calculates the probability to be a protein-coding RNA. Due to the extensive research on RNAs, a lot of data for ncRNAs are available to be applied in supervised learning in order to train the model. However, based on the training data, the model might be suitable to predict ncRNAs from humans but might yield rather poor results on other organisms. This led to the availability of models that are trained on well-studied species like Homo sapiens, Drosophila melanogaster or Arabidopsis thaliana on one hand, or models that can predict ncRNAs across all species with the trade-off of slightly lower accuracy. In order to tackle this problem, a novel strategy to reliably predict ncRNAs using data from not exhaustively studied organisms was developed. For this purpose we use a combination of CPC2 (9) and CPAT (10) in order to remove transcripts with protein-coding potential. CPAT uses four linguistic features of known coding and non-coding RNAs and transforms them into scores to build a ‘logistic regression’ model. CPAT comes with four pre-trained models, which are trained on datasets of human, mouse, zebrafish and fruitfly, respectively. In an optimal case, sequence data for mRNAs and ncRNAs of the investigated organism itself is available and can be used to train an organism-specific model. However, in most cases there is not enough information about lncRNAs to train a model specific for a target organism. CPC2 is a general classifier and is used to screen all transcripts for their protein-coding potential. Based on this pre-classification an organism-specific CPAT model is trained. As CPAT applies a logistic regression model to calculate coding potential, a probability cut-off needs to be found to distinguish predicted non-coding from protein-coding RNAs. In the end, all transcripts that show protein-coding potential are removed and in an ideal case only ncRNAs are remaining. During the training of the model, training data undergo 10-fold stratified cross-validation where in every iteration the optimal probability cut-off is calculated using a weighted variant of the Youden's index (11) based on a receiver operating characteristic (ROC) curve. From these 10 iterations, the median of all cut-offs will be calculated and used for the final performance evaluation. The weight can be chosen between 0 and 1. A weight of 0.5 means equal contributions of specificity and sensitivity to determine the cut-off. Decreasing the weight towards 0 will set the cut-off very loose and will lead to higher true-positive rate, however, the false-positive rate will also increase as a trade-off. An increase of the weight will make the cut-off more stringent and predicted lncRNAs will have higher confidence at the cost of potentially missing lncRNAs. To visualize the training process the ROC curves, as well as the performance in dependence on the cut-off, are plotted during each iteration.


	Training of classifier

The training of the classifier for Pinc is a two-step process as it combines the potential of both CPC2 and CPAT. After filtering of novel transcripts using gffcompare, the coding potential will be assessed using CPC2. All transcripts, which are predicted as ‘non-coding’, comprise the non-coding training set for CPAT. To complete the training set, protein-coding transcripts originating from the genome annotation are used. As CPC2 will most likely produce false-positives among those predicted non-coding transcripts, we looked into how these falsely classified transcripts will affect the predictive power of the resulting CPAT classifier. In order to estimate the impact of false-positives of CPC2 in the downstream process, the training set of known ncRNAs was ‘contaminated’ with known protein-coding sequences. This dataset for ncRNAs contained 80% ncRNAs and 20% protein-coding RNAs. The sequences used to simulate false-positives are removed from the set of CDS originating from the genome annotation. On the ‘contaminated’ dataset CPC2 was used to predict the coding-potential. Sequences predicted as ‘non-coding’ will be used to train CPAT. Transcripts predicted as ‘coding’ by CPC2 will be discarded for the training process only, however, they will be included in the final prediction of the coding potential of novel lncRNAs. A graphical overview of this procedure can be seen in Figure1A.
In order to evaluate the efficacy of the training, datasets of H. sapiens, A. thaliana and ascomycota were chosen. Since generally little is known about ncRNAs, high-quality datasets are only available for such well-studied organisms. For this reason, the datasets of human and A. thaliana comprise exclusively organism-specific mRNAs and ncRNAs. The dataset for ascomycetes is composed of mRNAs from Saccharomyces cerevisiae and ncRNAs from all ascomycetes. All datasets are based on RefSeq entries for the respective organism or phylum. The dataset of ascomycota is designed to simulate an RNA sequencing experiment to identify potentially novel ncRNAs in a rather unexplored organism. As in most cases, a reference genome of the studied organism is available, while data on ncRNAs are lacking, the combination of mRNAs (or mRNAs based on genome annotation predictions) from the studied organism and ncRNAs from related organisms shall simulate the aforementioned case.
The resulting datasets are split into a training and test set. This allows testing of the trained classifier on data, which it has never ‘seen’ before. As CPAT only calculates coding probabilities, a cut-off needs to be determined to distinguish coding from non-coding RNAs. In order to estimate such a cut-off, we employ 10-fold stratified cross-validation. The training set will be randomly split again into a training and test subset in such a way, that each resulting subset contains the same ratio of protein-coding to non-coding RNAs. Afterwards, CPAT will be trained on one subset, and the optimal cut-off is calculated using a weighted Youden's Index. This procedure was repeated 10 times. The final cut-off is the median of all tens cut-offs calculated during the cross-validation (see Figure1B). After training a separate CPAT classifier with each dataset, the three resulting models will be benchmarked against each other, CPC2 and a pretrained CPAT classifier, which was trained on human data.
Pinc employs a similar approach to train the classifier (see Figure2). Based on the provided genome annotation, Pinc extracts mRNAs and uses them as coding transcripts during the training. For ncRNAs, first a non-redundant transcriptome is built based on all RNA sequencing samples. Using gffcompare all annotated features are removed and the coding probability of the remaining transfrags is assessed using CPC2. Transfrags labelled as ‘non-coding’ will contain primarily ncRNAs and in most cases also some coding RNAs, which are protein-coding genes missing in the annotation. Based on these datasets CPAT is trained and used to predict the coding potential of all novel transfrags.


	Differential expression analysis

Pinc gives the possibility to perform a differential expression analysis using edgeR (12). A basic workflow that allows to compare two conditions is integrated into Pinc. This does not only identify differentially expressed genes, but also differentially transcribed ncRNAs. If replicates for each condition are provided, not only the fold changes of transcripts are reported, but also a statistical significance analysis can be performed. As recommended by the manual, the raw read counts of transcripts should be given to edgeR. For this, the tool HTSeq-count of the HTSeq package is used (13).


	Implementation of Pinc

Analysis of sequencing data requires many tools to be installed and maintained. Most tools can only be used through a command-line interface. This can be the first obstacle, since working with those programs can be quite a challenge for users who are not familiar with this environment. Pipelines help researchers by automating running all programs and handle the data flow between those. Pinc is embedded in a Nextflow framework which allows building very intuitive pipelines. However, this still requires that all used programs are installed on the user's system. Therefore, Pinc will be distributed by using a Docker image. This image is a whole-in-one solution as it comes with its own operating system and all required programs installed within the virtual environment. A graphical overview on the whole workflow can be inferred from Figure3. This allows researchers to use Pinc independent of the operating system and without the tedious installation and setup of multiple programs. In addition, the stand-alone Nextflow pipeline will also be available at GitHub when all necessary programs are already installed on the system.


	Generation of test data



	RNA sample preparation

During the extensive, large-scale cultivation Trichoderma reesei industry strains, such as Iogen-M10, can gradually lose their ability to produce the target product (i.e. cellulases) and transform into a non-producing (cel–) population in contrast to the initial, highly productive (cel+) population (14). A (cel+) and a (cel–) strain were cultivated in triplicates in Mandels-Andreotti medium supplemented with 1% (w/v) glucose as the sole carbon source and 0.1% (w/v) peptone or 1% (w/v) lactose and 0.1% (w/v) peptone. These conditions represent low cellulase producing and high cellulase producing conditions, respectively (15). After 48 h of cultivation, mycelium was harvested and total RNA was extracted. To enrich RNAs of interest, an rRNA depletion was performed using ‘NEBNext® RNA Depletion Core Reagent Set’. All samples were split into three sets, each set containing a sample from each condition, to ensure fast handling. 50 DNA probes were used to degrade rRNA using RNase H. Probes were designed to leave rRNA fragments of 50–70 nt length after RNase H-mediated degradation. Parameters for purification of samples by size exclusion were chosen as recommended by the manufacturer to obtain RNA fragments of ∼200 nt length.


	Library preparation

Library preparation of rRNA depleted samples was done using the NEBNext [®] Ultra™ II Directional RNA Library Prep Kit for Illumina [®]. RNA was fragmented by incubating at 94°C for 7 min. For PCR enrichment of adapter-ligated DNA 10 cycles was chosen as 500 ng of rRNA-depleted RNA was used in the beginning. Otherwise, the recommended protocol of the manufacturer was followed.


	Sequencing

Sequencing was done on an Illumina NextSeq 500 system to acquire a depth of about 20 million paired-end reads per sample.


	Supplementary Material

