Side-by-Side Comparison of Post-Entry Quarantine and High Throughput Sequencing Methods for Virus and Viroid Diagnosis


	2. Materials and Methods



	2.1. Plant Material Preparation and Nucleic Acid Extraction

Assays used at PEQ identified 14 imported plants infected with one or more endemic or regulated (i.e., deemed of biosecurity concern) viruses/viroids for comparison against HTS protocols. These plants comprised a diverse range of commodities, namely citrus ( n = 6), iris (n = 1), ornamental grass (n = 1), prunus (n = 1), raspberry (n = 1), strawberry (n = 3) and sweet potato (n = 1) (Table 1). Three leaves of different ages from different parts of the plant were selected for each plant and combined for downstream processing. Tissue samples were collected using a single-use sterile blade or by hand with single-use gloves. Approximately 100 mg of leaf/midrib was subsampled for nucleic acid extraction.
Total RNA extracts were prepared with an automated Maxwell [®] Rapid Sample Concentrator (RSC; Promega, Madisson, WI, USA) instrument using Maxwell [®] RSC SimplyRNA Tissue kit (AS1340, Promega). In brief, tissue samples were placed with a sterile 7 mm stainless steel bead in a sterile 2 mL round bottom tube and stored at −80 °C for 30 min. The tubes were then placed into a TissueLyser (Qiagen, Germantown, Maryland, Germany) for 2 min at 50 Hz. Homogenization solution (600 µL) was added to each tube followed by 300 µL of lysis buffer. The tubes were subsequently vortexed and spun for 5 min at 20,000× g before loading 500 µL supernatant into the Maxwell RSC cartridge. The quantity of the RNA extracted was measured by fluorometry using a Qubit RNA BR Assay kit (ThermoFisher Scientific, Waltham, MA, USA) on a Quantus fluorometer (Promega) according to manufacturer’s recommendations.


	2.2. Routine Diagnostics Assays

Virus and viroid infection was confirmed by PCR or RT-qPCR. Assay details described inSupplementary Materials Table S2.


	2.3. Library Construction and Sequencing

Aliquots of 2 µg of the same total RNA extract was shipped on dry ice to two separate HTS providers referred here as service provider 1 (SP1) and 2 (SP2). Fourteen and nine RNA samples were submitted to SP1 and SP2, respectively where the RNA quality was assessed and sequencing libraries were prepared (Supplementary Materials Table S1). Samples with RNA integrity scores ≥ 7.0 (Bioanalyser 2100, Agilent, Santa Clara, CA, USA) were retained for library preparation.
 **  Sequencing provider 1 (SP1)
For RNA-Seq library preparation, samples were subject to rRNA depletion using the Ribo-Zero Plant kit (Illumina, San Diego, CA, USA) as per the instruction manual. Then, the TruSeq Stranded Total RNA kit (Illumina) was used to generate single-indexed libraries. A proprietary sRNA library preparation protocol was used that included two PAGE electrophoresis steps: the first one to separate 18 to 30 bp long RNA from total RNA before library preparation and a second one after library preparation to separate final PCR products in the range of 100–120 bp. The sRNA libraries generated used single-indexed sequencing adapters and unique molecular identifiers (UMIs). All samples were sequenced using a proprietary rolling cycle sequencing approach and a custom sequencing platform, generating 2 × 150 bp paired-end RNA-Seq reads and 75 bp single-end reads for the sRNA-Seq samples (Supplementary Materials Figure S1).
 **  Sequencing provider 2 (SP2)
The TruSeq Stranded Total RNA kit with Ribo-Zero Plant (Illumina) was used to produce dual-indexed RNA-Seq library libraries. Paired-end 2 × 150 bp reads were sequenced using an Illumina NovaSeq 6000. For sRNA library preparation, samples were processed with the NEBNext Multiplex Small RNA Library kit (New England Biolabs, Ipswich, MA, USA) with a final size selection performed on a Pippin prep system using 3% cassettes with the P marker and 125–175 bp size selection parameters. Single-end 75 bp reads were sequenced on an Illumina NextSeq 500 (Supplementary Materials Figure S1).


	2.4. Bioinformatics Analyses

All raw fastq files were deposited in the Short Read Archive (SRA) database under the BioProject PRJNA752836. Post-sequencing, reads generated by RNA-Seq were filtered for low quality regions and sequencing adapters using Trim Galore [27]. The adapters were the specific forward (5′-AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA-3′) and reverse (5′-AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG-3′) adapters for the SP1 samples; and Illumina TruSeq adapter (5′-AGATCGGAAGAGC-3′) for the SP2 samples. The sRNA dataset provided by SP1 was already filtered for low quality sequences and adapters using SOAPnuke [28]. We performed an additional filtering step for low quality regions using Fastp [29]. The sRNA reads provided by SP2 were clipped for Illumina adapters and low quality regions using Fastp. FastQC was used to check the quality of reads prior to and after quality filtering (≥Q30) for both RNA-Seq and sRNA-Seq samples [30]. The amount of rRNA was assessed by aligning the reads with Bowtie [31] against the Silva database (https://www.arb-silva.de, accessed on 9 January 2021).
Four bioinformatics approaches were executed on the ribodepleted RNA-Seq samples to detect viruses and viroids. We trialed two methods that use direct annotation on quality trimmed reads without any assembly: the plant virus detection pipeline (PVDP) [32] and Kodoja [33]. These workflows were run on default parameters using the pre-computed database kodojaDB_v1.0 and the PlantVirusesDB_0420v4_masked.fa for Kodoja and PVDP, respectively. We used the stringent level assignment for viruses (i.e., same assignment designated by Kraken [34] and Kaiju [35]) and the non-stringent assignment for viroids (as viroids do not have protein assignments in RefSeq). We also tested two de novo transcriptome assemblers: the rnaviral mode of SPAdes 3.15 [36] and Trinity-v2.11.0 [37] using singularity. For Trinity, the quality-trimmed reads were filtered using a removal list of diverse non-informative sequences including ribosomal RNAs and representative host plant chloroplasts and mitochondria, before performing de novo assembly following best practice recommendations for this tool (https://informatics.fas.harvard.edu/best-practices-for-de-novo-transcriptome-assembly-with-trinity.html, accessed on 9 January 2021). SPAdes 3.15 was run using the rnaviral mode on default parameters, using k-mer lengths of 33 and 49. The contigs obtained from the de novo assembly approaches were collapsed into scaffolds using CAP3 [38]. Nucleotide sequence similarity search of scaffolds was then performed against the NCBI nucleotide (nt) database using megablast [39]. The top 5 matches were retained, and a report table was obtained using BlastTools (https://github.com/schmidda/blast-tools, accessed on 9 March 2021) (Supplementary Materials Figure S1).
We compared the detection of viruses and viroids by the sRNA-Seq technology using two methods: VirusDetect 1.7 [40] and VirReport (https://github.com/eresearchqut/virreport.git, accessed on 2 February 2022), a portable, scalable and reproducible nextflow [41] pipeline which uses a similar framework to the Yabi web-based Virus Surveillance and Diagnostics pipeline [42] (Supplementary Materials Figure S2). Before running VirusDetect, reads were first filtered for ribosomal RNA and host representative sequences (mitochondria and chloroplast) as recommended by the developer. The software was run using the cut-off depth parameter of 5 (default). VirReport was tested using two different de novo assemblers: SPAdes 3.14 [43,44]. and Velvet 1.2.10 [45] (Supplementary Materials Figure S1). The downstream results obtained from these assemblers are referred to as VirReport-SPAdes and VirReport-Velvet from here onwards). For SPAdes, several k-mer ranges were tested and the optimal setting which maximized recovery of viruses and viroids for this study was 9 to 21, increasing by increments of 2 (data not shown). For Velvet, a k-mer size of 15 and a depth cut-off of 3 was used. For both VirusDetect and VirReport, we restricted the input for de novo assembly to 21–22 nt- or 24 nt-long reads. The contigs obtained by SPAdes and Velvet were blasted against NCBI as per protocol for the RNA-Seq data. A custom python script was applied to derive the top virus or viroid blastn hit (from the report table derived with BlastTools), and use it as a reference to map the filtered 21–22 nt- and 24 nt-long reads using Bowtie, allowing up to a single mismatch (which is equivalent to up to 5% sequence divergence among mapped reads onto the same reference genome). PCR duplicates were identified using UMICollapse [46] for the libraries which incorporated UMIs for the sRNA samples from the SP1. From the obtained alignments, we calculated the total number of aligned reads and deduplicated reads (for samples from SP1). We normalized counts using reads per million (RPM): total number of aligned reads × 1,000,000/total number of quality-filtered reads. Coverage and average depth were calculated using Picard tools (http://broadinstitute.github.io/picard/, accessed on 9 March 2021).
To evaluate the concordance of the de novo assembled viral contigs across methodologies and enable the identification of artefacts such as chimeric contigs, we selected, and where applicable derived, representative sequences for each detected viruses and viroids using the de novo contigs generated from both ribodepleted RNA-Seq and sRNA-Seq. For cases where a suitable representative sequence was not available, we used a public reference genome. To compare read abundance for a given virus or viroid across sequencing technologies, the representative sequences were used as references to map high quality reads from virus-derived reads from each sample for both HTS approaches, to derive read counts and fragments per kilobase of transcript per million mapped reads (FPKM).
To compare the performance of each method and HTS technology combination, we tested their ability to detect a total of 38 known target viruses and viroids species identified by PEQ across both sequencing platform including 24 across 14 plant samples for the SP1 and 14 across nine samples for the SP2 (Table 1). We derived the number of true positive (TP) and false positive (FP) detections and calculated sensitivity, which corresponds to the number of TPs (number of correctly identified viruses and viroids) divided by the total number of known target species [16,47]. To normalize reads from each sample and HTS approach, both sRNA-Seq and RNA-Seq derived quality trimmed reads, which were filtered from non-informative RNA products, were randomly subsampled using seqtk [48]; the sRNA-Seq dataset was also limited to reads ranging in size from 18 to 25 nt. The chosen subsample data sizes were 1,000,000 (1 M), 2.5 M and 4 M single reads for the sRNA-Seq data; and 1 M, 2.5 M, 4 M and 10 M paired-end reads for the RNA-Seq datasets. These subsamples were reanalyzed using Kodoja (RNA-Seq) and VirReport-Velvet (sRNA-Seq). For each subset of data, we derived the number of TPs and FPs, and calculated sensitivity and false discovery rate using the formula FP/(FP + TP) [16,47].
We also implemented a contamination flag (CF) which was applied on Kodoja, PVDP, VirReport and VirusDetect. Candidate contamination events were flagged based on RPM value reported for each target detected in the final virus table output. We used RPM values derived by the bioinformatics method (PVDP, VirusDetect) or derived it independently (Kodoja, VirReport). The CF assumptions are that: (1) a virus or viroid present in high titer in a given sample is likely to be the source of contamination in other multiplexed samples in the same run, and (2) detection of reads matching to this pathogen in other samples occur at a significantly lower abundance. We first calculate the maximum RPM value recorded for each virus and viroid identified on a run. If for a given virus, the RPM value reported for a sample represented less than a percentage of this maximum RPM value, it was then flagged as a contamination event. We tested 0.1, 1 and 10% as minimum threshold values.
