VGEA: an RNA viral assembly toolkit


	Materials and Methods



	Datasets

We successfully tested and validated VGEA with SARS-CoV-2 ( n = 20) and Lassa Virus (n = 20) datasets sequenced on the illumina MiSeq and illumina FGx sequencing machines in our laboratory at the African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University, Ede, Nigeria. Briefly, samples were inactivated in buffer AVL and viral RNA was extracted according to the QiAmp viral RNA mini kit (Qiagen) manufacturer’s instructions. Extracted RNA was treated with Turbo DNase to remove contaminating DNA, followed by cDNA synthesis with random hexamers. Sequencing libraries were prepared using the Nextera XT kit (Illumina) as previously described (Matranga et al., 2016) and sequenced on the Illumina Miseq platform with 101 base pair paired-end reads. We also tested and validated VGEA with HIV-1 datasets sequenced on the illumina HiSeq 2500 obtained from NCBI Sequence Read Archive (SRA). We made use of 60 test datasets (Lassa Virus (20), SARS-CoV-2 (20) and HIV-1 (20)) for the validation of the VGEA pipeline. All our test datasets are available on figshare ( https://doi.org/10.6084/m9.figshare.13009997).


	Implementation

The installation of VGEA requires the pipeline to be downloaded onto a personal computer and creation of a conda environment to set up all dependencies. Complete installation steps are in the github README file: https://github.com/pauloluniyi/VGEA/blob/master/README.md
The analysis of VGEA is broken down into a set of ‘rules’ that links the output file of an analysis into the input of the next task in the general workflow (Fig. 1). The dependencies are fastp for read trimming and read-level quality control, BWA for mapping sequencing reads to the human reference genome, SAMtools for extracting unmapped reads and also for splitting bam files into fastq files, IVA for de novo assembly to generate contigs, shiver to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit for cleaning shiver assembly for QUAST, QUAST to evaluate/assess the quality of genome assemblies and MultiQC for aggregation of the results from fastp, BWA and QUAST
All of these tools can be installed using a bioconda channel (Grüning et al., 2018). The input files for VGEA are paired-end fastq files. VGEA allows full customization of the pipeline, so users can modify the parameters used in running their samples. It is possible to modify every step of the workflow to suit the samples being processed. Users can also add more steps to the pipeline as they see fit. The pipeline runs on Linux/Unix and Mac. However, no prior programming is required to run the pipeline and, once the user supplies the input, the whole workflow can run automatically from beginning to end.
