Published February 8, 2024 | Version v1
Dataset Open

LsRTDv1: A reference transcript dataset for accurate transcript-specific expression analysis in lettuce

  • 1. University of York
  • 2. James Hutton Institute

Description

Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we present a reference transcript dataset (LsRTDv1) for lettuce, combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 is a valuable resource for the investigation of transcriptional and alternative splicing regulation in lettuce.

Notes

Funding provided by: Biotechnology and Biological Sciences Research Council
Crossref Funder Registry ID: https://ror.org/00cwqg982
Award Number: BB/S020160/1

Funding provided by: Ministry of National Education
Crossref Funder Registry ID: https://ror.org/00jga9g46
Award Number: MEB1416

Funding provided by: Scottish Government Rural and Environment Science and Analytical Services*
Crossref Funder Registry ID:
Award Number:

Methods

We generated a lettuce Reference Transcript Dataset (LsRTDv1) by integrating transcript assemblies from short- and long-read RNA sequencing data with existing lettuce genome annotations. RNA sequencing data was generated from 23 different lettuce samples capturing different tissues, ages of plant and treatments. The 23 samples, all from Lactuca sativa cv. Saladin (synonymous with cv. Salinas) were combined equally into 7 samples prior to sequencing.

Short-read assembly

The RNA-seq reads of the seven pooled samples were pre-processed with Fastp (Chen et al., 2018) to remove adapters and filter low-quality reads (quality score <20, length <30). Trimmed reads were mapped to the latest lettuce reference genome assembly in NCBI (Lsat_Salinas_v11) using STAR aligner in the 2-pass mode to increase the mapping sensitivity at splice junctions (SJs)(Dobin and Gingeras, 2015). Mismatch was set to 1 with minimum and maximum intron sizes of 60 and 15,000 bp respectively. Two transcript assemblers, StringTie (Pertea et al., 2015) and Scallop (Shao and Kingsford, 2017), were used to assemble transcripts for each sample. The assemblies were then merged and refined using RTDmaker (https://github.com/anonconda/RTDmaker) to remove low-quality transcripts, including redundant transcripts with identical intron combinations to longer transcripts, fragmented transcripts with length <70% of gene length, transcripts with non-canonical SJs, transcripts with SJs only supported by <5 spliced reads in <2 samples and low expressed transcripts with <1 transcript per million reads (TPM) in <2 samples.

Long-read assembly

We employed the IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq) to pre-process the Iso-seq data from the seven samples. The CCS method was used to generate circular consensus sequences (CCS) from raw subreads and reads with minimum predicted accuracy <90% were discarded (--min-rq=0.9). Barcodes associated with the CCS reads were eliminated using the lima method. To further refine the reads, Isoseq3 was applied to trim poly(A) tails and identify and remove concatemers. The output of full-length, non-concatemer (FLNC) reads was mapped to the reference genome using Minimap2 (Li, 2018). TAMA-collapse was used to collapse redundant transcript models in each sample with variation at the 5' and 3' ends and at SJs not allowed (-a = 0, -m = 0 and -z = 0) to ensure high accuracy of boundaries. Reads with errors within the 10 bp up- or down-stream of a SJ were removed.  TAMA-merge was used to merge transcript models from the seven samples (Kuo et al., 2020). To improve the quality of the assembly, we implemented well-established methods for SJ and transcript start site (TSS) and end site (TES) analyses previously used for Arabidopsis AtRTD3 and barley BaRTv2 (Zhang et al., 2022b; Coulter et al., 2022). We removed low-quality transcripts that exhibited non-canonical SJs and low quality SJs unless they were also present in the short-read assembly. We applied a binomial test to distinguish high-confidence TSS and TES with a false discovery rate (FDR) <0.05. For genes with limited read support, statistical testing becomes challenging, hence we also kept TSS/TES if they were supported by at least 2 Iso-seq reads. Redundancy merge was applied to transcripts if they only differed ±50 nucleotides at their TSS/TES. In addition, transcripts only supported by a single Iso-seq read were removed from the final dataset.

Integration of multiple annotations

We integrated four transcript annotations: the long-read assembly, short-read assembly and two versions of Lsat_Salinas_v11 genome annotations GenBank (GCA_002870075.4) and RefSeq (GCF_002870075.4). The Iso-seq long-read assembly served as the reliable backbone, while the other three annotations were incorporated in a step-wise manner to improve the RTD completeness. Firstly, the transcripts in the short-read assembly that introduce novel SJs and/or novel gene loci were integrated into the long-read assembly. Subsequently, we added transcripts from GenBank and RefSeq annotations that contributed novel SJs or gene loci to build the lettuce RTD (LsRTDv1). In cases where two transcripts from GenBank and RefSeq had identical SJ combinations or were mono-exonic transcripts with overlapping regions exceeding 30% of both transcripts, we collapsed them to a single transcript, and the longest TSS and TES were used as the start and end point of the collapsed transcript. In LsRTDv1, the overlapped transcripts were assigned the same gene ID. However, if a set of overlapped transcripts entirely resided within the intron region of other transcripts, they were treated as intronic transcripts and assigned with a different gene ID. Where the overlapped transcripts can be divided into multiple groups and the adjacent groups overlapped less than 5% of the group lengths, they were assigned separate gene IDs.

Files

README.md

Files (473.8 MB)

Name Size Download all
md5:19d7d3736f9b86bc9e8b02c6bfa0b1d6
294.7 MB Download
md5:ae32e03a10b8bbd4d9ccbc5038d9abaa
179.1 MB Download
md5:e460058065b19aa7fff0065e1cf2cba9
5.1 kB Preview Download