ICAnnoLncRNA: A Snakemake Pipeline for a Long Non-Coding-RNA Search and Annotation in Transcriptomic Sequences


	2. Materials and Methods



	2.1. The Pipeline for lncRNA Identification and Analysis in Plant Transcriptomes

The computational pipeline for detecting and characterising lncRNAs in plant transcriptomes is shown inFigure 1. 
The input of the pipeline is a set of assembled transcripts provided by the user. The assembly step is not included in our pipeline and the user should obtain these sequences by de novo or reference-based assembly using additional software. Thus, our pipeline is free from the restrictions associated with the choice of a particular method of transcriptome assembling. The pipeline input also includes additional data: genomic sequences, its annotation, mRNA and lncRNA sequences annotated in the genome and description of the transcriptomic libraries (seeTable S1 and Figure S1 in the Supplementary Materials).
The analysis includes several major steps: data pre-processing (panel 1,Figure 1), lncRNA identification (panel 2,Figure 1) and lncRNA annotation (panel 3,Figure 1). 
The data pre-processing step involves building an index file for genomic sequences with GMAP [42] and training the lncRNA recognition model using LncFinder v1.1.4 [41].
LncRNA identification includes: (1) prediction of lncRNA candidates in the input set of transcripts; (2) alignment of the predicted lncRNA sequences to the reference genome; (3) filtering erroneous/noise transcripts; (4) filtering possible transposable elements (TEs).
The lncRNA annotation step includes: (1) identification of lncRNA classes; (2) identification of conserved lncRNAs; (3) analysis of lncRNA expression; (4) statistical analysis. 
Details of the pipeline analysis are provided below.


	2.1.1. Data Pre-Processing 

This step involves building an index file for a genomic sequence with GMAP [42] and training a lncRNA recognition model using LncFinder v1.1.4 [41]. 
LncFinder enables a researcher to train a lncRNA recognition model based on known lncRNA and mRNA sequences of specific genomes with known annotation. This software uses a neural network algorithm to classify sequences and utilises three types of parameters: hexamer frequencies in nucleotide sequences, physico-chemical properties of nucleotides and features of predicted secondary structure for an RNA sequence. The sequences used for training include the full set of known mRNAs and lncRNAs for the reference genome in two separate files in FASTA format. The pipeline generates training and testing sets for the LncFinder software from this complete dataset. Taking into account that the number of known lncRNAs in annotated genomes is usually significantly smaller than the number of mRNAs (for example, 33,725 mRNAs and 2535 lncRNAs are annotated in the v.40 maize genome annotation), all known lncRNA sequences were employed for this analysis, and the number of mRNA sequences sampled by random-without-return for training and testing was assumed to be twice as large. This ratio was used in ref. [41] and ensures dataset balancing for an RNA set. 
During network training, both the mRNA and lncRNA data were randomly divided at a ratio of 80% for the training set and 20% for the test set. Using the training data, the model parameters were optimised by LncFinder. The performance of the model was evaluated on the test data using sensitivity ( SN), precision (PR), specificity (SP) and F1 measures [43] as follows:SN = TP/(TP + FN),(1)PR = TP/(TP + FP),(2)SP = TN/(TN + FP),(3)F1 = 2 · ((SN × PR)/(SN + PR)),(4)
where TP (true positives) is the number of annotated lncRNAs predicted as lncRNAs, FP (false positives) is the number of annotated mRNAs predicted as lncRNAs, TN (true negatives) is the number of annotated mRNAs predicted as mRNAs and FN (false negatives) is the number of annotated lncRNAs predicted as mRNAs. The closer F1 is to 1.0, the better the performance of the lncRNA prediction. All the performance measures are provided as a result of the training process at the final pipeline output.
The training and testing procedures described above were performed in five replicates. The sequences for each replicate were chosen independently. The pipeline outputs performance estimates for lncRNA recognition for each replicate in a separate file in a CSV format. The pipeline saves the parameters of the replicate having maximal F1 in a separate file and uses them for further analysis. 
The LncFinder package offers the opportunity to use secondary structure parameters estimated by the ViennaRNA software [44] for lncRNA recognition. Our pipeline supports this option. In this case, the calculation time is usually increased by ~95%. At the same time, according to our preliminary estimates (data not shown), the performance of the lncRNA recognition improves by ~1% only. Therefore, in this work, we did not use secondary structure parameters to predict lncRNA candidates.


	2.1.2. LncRNA Identification and Filtering

LncRNA identification includes four steps: (1) prediction of lncRNA candidates in the input set of transcripts; (2) alignment of the predicted lncRNA sequences to the reference genome; (3) filtering erroneous/noise transcripts; (4) filtering possible transposable elements (TEs).
The lncRNA recognition was performed by LncFinder [41] with the parameters determined at the previous step (see above). Transcripts were classified into two sets—coding and non-coding sequences—and the classification data stored in an output text file in TSV format. The sequences identified by LncFinder as non-coding were lncRNA candidates. For each transcript, the ORF length, the Fickett index, ORF coverage and the isoelectric point were estimated by CPC2 [18] and stored in a separate file as additional information. These parameters, however, did not affect the lncRNA prediction results. 
Alignment of the lncRNA candidates to the reference genome was performed using the GMAP software v2020.10.14 [42] with the following parameters: —min-intronlength = 1 —intronlength = 414,579—cross-species—format = gff3_gene—split-large-introns—npaths = 1.
In the pipeline, we applied filtering to short sequence fragments resulting from de novo assembly errors or transcriptional noise. For this purpose, we identified genomic lncRNA loci as a continuous region of a chromosome to which at least one lncRNA transcript is aligned. The GFF file with lncRNA loci coordinates is used by the gffcompare tool v0.11.2 [45] to select transcripts that matched the loci regions (class ‘=’ transcripts, see below); transcripts without a match to lncRNA loci were removed from further analysis.
A second filter is aimed at removing transcripts from transposable elements loci. The user should provide a GFF file with TE coordinates in the reference genome as input to the ICAnnoLncRNA pipeline. All candidate lncRNA transcripts that overlapped by at least one nucleotide with the TE region were removed from further analysis. In our work, we used the EDTA package [46] to obtain TE libraries for the maize AGPv4 genome and to identify TEs in its sequence.
Candidate lncRNAs sequences that passed these filters were analysed further by the ICAnnoLncRNA pipeline.


	2.1.3. Identification of lncRNA Classes 

To classify the candidate lncRNA sequences by their location in the genome relative to protein-coding genes, ICAnnoLncRNA uses the program gffcompare v0.11.2 [45]. Each transcript that passes through filters at the previous steps is classified by this program into 1 of 14 classes represented by single-character notation. LncRNA candidates that completely matched known mRNA sequences (class ‘=’) due to possible prediction errors were excluded from further analysis. Some lncRNA candidates aligned with exons in the ‘+’ direction (classes ‘c’, ‘k’, ‘j’, ‘m’, ‘n’ and ‘o’). There is experimental evidence in humans that several lncRNAs are encoded by exonic fragments in the ‘+’ direction. However, these cases are infrequent, and no such examples are currently known in plants. In addition, it has been suggested that transcripts of the ‘c’, ‘k’, ‘j’, ‘m’, ‘n’ or ‘o’ class may represent new isoforms of known genes rather than lncRNAs [47]. For the above reasons, the sequences of these classes were excluded from further analysis. Some lncRNA candidates belong to classes ‘e’, ‘s’, ‘p’ and ‘y’ (including joined intron and exon fragments). We believe that these sequences are not lncRNAs and mainly represent sequencing/assembly artefacts. They were excluded from further analysis by our pipeline. 
Some sequences of the ‘=’ class aligned to known lncRNA genes annotated in the reference genome. Transcripts not included in the above classes were regarded as novel lncRNAs. The pipeline provides analysis of known as well as novel lncRNAs. However, in the current work, we focus our analysis on the detection of novel sequences not presented in the current genome annotation. 
Novel lncRNAs were categorised into three types [14,29,48] (Figure 2): ** intronic: a lncRNA that is fully contained within a reference intron (gffcompare class ‘i’).
** antisense: an exonic overlap on the reverse strand of a protein-coding gene (gffcompare class ’x’).
** intergenic: a lncRNA that is located between two protein-coding genes (gffcompare class ‘u’).

In accordance with the classification of lncRNAs described above, we categorised their loci into two types: intergenic lncRNA loci (class ‘u’) and loci overlapping with known protein-coding genes (classes ‘i’ and ‘x’).


	2.1.4. Identification of Conserved lncRNAs

Similarly to ref. [31], the ICAnnoLncRNA pipeline classifies novel lncRNAs as conserved and non-conserved by the presence/absence of their homologs in other species in the external databases. The lncRNA sequence from the reference species was considered conserved if a lncRNA sequence with identity greater or equal to 50% was found in other species in the dataset of known lncRNAs (seeSection 2.4 below). The search was performed using BLASTn v2.9.0+ [49] with the parameters: -outfmt 6 -evalue 1e-50 -max_target_seqs 1 -perc_identity 50. Otherwise, the lncRNA was regarded as non-conserved. 
The pipeline also allows the clustering of novel lncRNA sequences and sequences from the dataset of known lncRNAs by UCLUST v. 11 [50]. This allows the identification of groups of homologous lncRNAs from different species. 


	2.1.5. Analysis of lncRNA Expression 

The pipeline allows the specificity of lncRNA expression in various plant tissues to be determined using the data provided by user at the pipeline input (Table S1, Supplementary Materials). Our pipeline can compare three sets of RNA sequences: two sets of lncRNAs (conserved and non-conserved) and transcripts predicted as protein-coding. For each of the three RNA types, the number of sequences expressed in at least one tissue is estimated (expression value is greater than user-defined threshold). The number of transcripts of each class expressed in a given tissue is normalised to the number of transcripts of a given RNA type and to the number of experiments relevant to that tissue. The proposed normalisation takes into account the unequal number of genes represented in the three types of RNAs and the unequal number of experiments performed on different tissue types. This value (the number of expressed transcripts per library per tissue) shows the specificity of gene expression of the selected RNA molecules (conserved lncRNAs, non-conserved lncRNAs and mRNAs) for each tissue analysed. However, this value should not be confused with the overall expression level of the specific molecule type in the tissue. 


	2.1.6. Statistical Analysis

Our pipeline outputs several important statistics for novel lncRNAs: density in each chromosome, number of different classes of genome location, distribution of exon/intron characteristics, and specificity of expression in different tissues. During the analysis of the lncRNA exon/intron structure, we excluded those lncRNA transcripts with a small intron length (<60 bp) which could be obtained for de novo assembled transcripts but would be highly unlikely, according to experiments [51]. These transcripts, however, were not excluded from the analysis of expression and homology. 


	2.2. The Platform

The pipeline was designed using the workflow management system Snakemake v6.0.0 [52]. Snakemake is a tool for devising reproducible and scalable data analyses. Workflows are described in a human-readable, Python-based language. They can be seamlessly scaled to server, cluster, grid and cloud environments without the need to modify the workflow definition. Snakemake is compatible with the Conda management system (https://docs.conda.io/en/latest/, accessed on 1 October 2022), which enables an investigator to easily install any new programs needed for the pipeline. The direct acyclic graph (DAG) for the ICAnnoLncRNA pipeline is provided inFigure S2 (Supplementary Materials) and shows its implementation in detail.


	2.3. Transcriptomic Datasets and Sequence Assembly

We analysed 15 maize transcriptome libraries (Table S2, Supplementary Materials) from different plant tissues/organs. Two approaches were used for sequence assembly. The first one was the de novo method and included extraction of reads from SRA files by SRA Toolkit [53], data pre-processing by fastp [54], sequence assembly by Trinity v2.6.6 [55] and transcript quantification in TPM (transcripts per million) by Kallisto [56] (the analysis is described in detail in [57]). The second approach was reference-based. Reads were filtered and mapped to the reference maize genome AGPv4 [58] retrieved from the Ensembl Plants database [59] by Hisat2 v. 2.1 [60] and assembled by StringTie v. 1.3 [61]. We used the implementation of the second approach from the LncPipe pipeline [39]. We used Ensembl Plants v.40 annotation files for the reference genome. 
The analysis of tissue expression specificity for the transcripts was performed in this work for transcripts identified in the Trinity assembly. The transcript was considered as expressed if its TPM was greater than 1.


	2.4. A Library of Known lncRNA Sequences 

Sequences of known lncRNAs from PNRD [33], CANTATAdb v2.0 [62], GREENC v1.12 [35], PLncDB v2.0 [34] and EVLncRNAs v2.0 [32] databases were used for comparison with the lncRNAs obtained in the current work for maize transcriptomes. The sets of sequences from these databases were combined; identical sequences were removed to obtain a non-redundant sequence dataset. This procedure yielded 256,091 lncRNA sequences from 16 plant species; 39,456 of these belong to maize. Detailed statistics on the number of sequences of each species in the non-redundant dataset are presented inTable S3 (Supplementary Materials).
