Recovery of metagenomic data from the Aedes aegypti microbiome using a reproducible snakemake pipeline: MINUUR


	Methods



	Specifications

To undertake this analysis, we developed a metagenomics workflow called MINUUR, using the workflow manager Snakemake[40] (Figure 1). This pipeline was developed to facilitate the following analysis and future studies aiming to characterise non-reference reads in mosquitoes or other organisms. A breakdown of the pipeline that produced each result of this study is provided in the following section, with details of how each step was configured in the final section. 


	Database setup

MINUUR requires several databases to perform the analysis. This includes a high quality bowtie2-indexed reference genome[41] to separate host and non-host reads, and a KRAKEN2[42], BRACKEN[43] and MetaPhlAn3[44] database for taxonomic read classifications. All databases are available in their respective GitHub repositories. The databases used in this study are the default MetaPhlAn3 marker gene database, KRAKEN2 and BRAKEN indexes from the Ben Langmead repository located here. For our study, we downloaded and compiled these default databases on 8 September 2022. 


	Data preparation

MINUUR accepts either BAM or paired FASTQ inputs. For FASTQ inputs, MINUUR performs quality control (QC) using FASTQC (v0.11.9)[46], providing a QC report per sample. MINUUR does not use the FASTQC report in subsequent steps, but only as a quality assurance metric for the user and to estimate if read trimming is required. Read trimming can be performed within MINUUR using Cutadapt (v1.5)[47] with user defined parameters for minimum read length, base quality and adapter content (default: minimum base length = 50, average base quality = 30). To separate host and non-host, reads are aligned to a user defined indexed reference genome (the relevant host genome) using Bowtie2 (v2.4.4)[41]. Alignment sensitivity and type can be adjusted within the pipeline at the user’s discretion. A high-quality, chromosome level-assembled reference genome is recommended if available. In situations where this is not possible, users should be aware that read alignment will likely result in mismatches between the reference and target sequence and produce alignments with poor coverage[48]. As a result, non-reference reads used in subsequent steps are likely to contain a substantial number of host data. In this instance, we included a feature within MINUUR to extract KRAKEN2 assigned reads pertaining to known microbes and potentially improve metagenome assemblies. Non-reference reads within the coordinate sorted binary alignment (BAM) file are extracted using samtools using the command ”samtools -view -f 4” (v.1.14)[49] and converted to FASTQ format using bedtools (v2.3.0)[50] (”–bamToFastq”). As users might already have their data in BAM format mapped against the host reference (
e.g. in large-scale sequencing projects like the Ag1000G), we also included the option of a BAM input. Here, any non-reference reads within the BAM file will be extracted, converted to FASTQ and used in downstream steps.
For this study, we used five published genome datasets of Aedes aegypti publicly available on the European Nucleotide Archive[35,37,51–54]. We selected the data sets to cover a range of sequencing depths, DNA extraction method and sequencing platform (Figure 2C). All raw FASTQ files of published sequencing data were retrieved from the European Nucleotide Archive (ENA) under the project accession numbers PRJEB33044[37], PRJNA255893[51], PRJNA385349[52], PRJNA718905[35], PRJNA776956[53] and PRJNA992905[54].


	Read classification

MINUUR uses two read classification approaches to infer taxonomy. KRAKEN2 (v2.1.2)[42], which uses a k-mer based approach to map read fragments of k-length against a taxonomic genome library of k-mer sequences, and MetaPhlAn3 (v3.0.13)[44] to align reads against a library of marker genes using Bowtie2[41]. MINUUR also provides the option to use KRAKEN2 classified reads, parsed from KrakenTools (v1.2), to select a specific set of reads (for example bacterial) for metagenome assembly. To estimate the relative taxonomic abundance from KRAKEN2 classifications, MINUUR will parse KRAKEN2 read classifications to BRACKEN (v2.6.2)[43] which uses a Bayesian probability approach to redistribute reads assigned at higher taxonomic levels to lower (species) taxonomic levels.
MINUUR outputs classified and unclassified reads to paired FASTQ files and generates BRACKEN-estimated taxonomic abundance profiles for further analysis. An additional feature we included within MINUUR is the option to extract a specific taxon or group of taxa from KRAKEN2, using the program KrakenTools[42]. This option is useful in situations where a specific group of taxa are of interest or to exclude groups of taxa such as viral or archaeal reads. Alternatively, if alignment to a reference is poor, this option can be used to remove host reads that did not map to its reference.


	Metagenome assembly, binning and quality assurance

MINUUR uses the non-reference reads to perform de novo metagenome assemblies (the same reads used for KRAKEN2 and MetaPhlAn3 taxonomic classifications). Reads are parsed to MEGAHIT (v1.2.9)[55], a rapid and memory efficient metagenome assembler, for
de novo metagenome assembly. Assembled contigs are quality-checked using QUAST (v5.0.2)[56] to assess contig N50 and L50 scores. The resultant contigs, which are fasta files with sequences pertaining to genomic regions of a microbe, need to be placed within defined taxonomic groups - referred to as a bin. For this, contigs are indexed using the Burrows Wheeler Aligner (BWA) (v0.7.17)[57], and the original non-reference reads are aligned to the indexed contigs using ”–bwa-mem”. The subsequent coordinate sorted BAM file is parsed to the “jgi_summarize_bam_contig_depth” script from MetaBAT2 (v2.12.1)[58] to produce a depth file of contig coverage. The depth file and assembled contigs are input to the metagenome binner MetaBAT2 (v2.12.1)[58], to group contigs within defined genomic bins. Each bin is a predicted metagenome assembled genome (MAG). CheckM (v1.1.3)[59] is then used for quality assurance of each bin by identifying single copy core genes. Specifically, bin contamination is assessed by looking for one single copy core gene within each bin, and completeness by calculating a required set of single copy core genes. In addition, BUSCO (v5.4.7)[60] can be included to search for eukaryotic or prokaryotic specific genes in the final MAGs as an additional quality assurance metric.


	Pipeline configuration

All sample names were listed in the samples.tsv file of the configuration directory and paths to the FASTQ directories given in the samples_table.tsv file. To implement the pipeline, the configuration file was set to the following parameters: FASTQ = True, QC = True, CutadaptParams = ”–minimum-length 50 -q 30”, RemoveHostFromFastqGz = True, AlignmentSensitivity = ”–sensitive-local”, ProcessBam = True, From-Fastq = True, KrakenClassification = True, ConfidenceScore = 0, KrakenSummaries = True, GenusReadThreshold = 1000, SpeciesReadThreshold = 30000, MetagenomeAssm = True, MetagenomeBinning = True, MinimumContigLength = 1500, CheckmBinQA = True, BUSCO=True. All databases were installed from their respective repositories on 8 September 2022. The pipeline was run on an Ubuntu Linux system with 660gb of available memory and 128 CPUs. For our analysis, with the above settings and 10 cores available, runtime was two weeks; the maximum Resident Set Size (RSS) of an individual sample during this run was 9771 RSS (occurring during metagenome assembly); and total storage used (including temporary files) was 5.18Tb (terabytes) across all samples used in this study.


	Taxonomic classification of MAGs with GTDB-Tk

Separate from MINUUR, all bins produced from MetaBAT2 were taxonomically classified with GTDB-Tk[61] (v2.1.1) using ”–classify-wf” against the Genome Taxonomy Database (GTDB) (release 07-R207 8 April 2022, downloaded on 8 September 2022). GTDB-Tk assigns genes to MAGs using Prodigal (v2.6.3)[62] and ranks the taxonomic domain of each MAG using a database of 120 bacteria and 122 archaea marker genes[63] using HMMER3[64]. With this information, MAGs are then placed into domain specific reference trees with pplacer (v1.1)[65]. Taxonomic classifications with GTDB-Tk are based on placement within the GTDB reference tree, relative evolutionary divergence, and average nucleotide identity (ANI) scores with its closest reference genome. The relative evolutionary divergence score is used to refine ambiguous taxonomic rank assignments and ANI scores are used to define species classifications[61].
