Phylodynamic Inference of Bacterial Outbreak Parameters Using Nanopore Sequencing


	Materials and Methods



	Outbreak Sampling in FNQ and PNG

We collected isolates from outbreaks in two remote populations in northern Australia and PNG (fig. 1). Isolates associated with pediatric osteomyelitis cases (mean age of 8 years) were collected from 2012 to 2017 (n = 42) from Kundiawa, Simbu Province (27), and from 2012 to 2018 (n = 35) from patients in the neighboring Eastern Highlands province town of Goroka. We supplemented the data with methicillin sensitive S. aureus isolates associated with severe hospital-associated infections and blood cultures in Madang (Madang Province; n = 8) and Goroka (n = 12). Isolates from communities in FNQ, including urban Cairns, the Cape York Peninsula and the Torres Strait Islands (n = 91), were a contemporary sample from routine surveillance at Cairns Hospital in 2019. Isolates were recovered on LB agar from clinical specimens using routine microbiological techniques at Queensland Health and the Papua New Guinea Institute of Medical Research (PNGIMR). Isolates were transported on swabs from monocultures to the Australian Institute of Tropical Health and Medicine (AITHM Townsville) where they were cultured in 10-ml Luria-Bertani (LB) broth at 37 °C overnight and stored at −80 °C in glycosol and LB. Illumina short-read data from the ST93 lineage (van Hal et al. 2018) included in this study were collected from the European Nucleotide Archive (supplementary tables, Supplementary Material online).


	Nanopore Sequencing and Basecalling

Two milliliters of LB broth was spun down at 5,000 × g for 10 min and after removing the supernatant, 50 µl of 0.5 mg/ml lysostaphin were added to the tube and vortexed. Cell lysis was conducted at 37 °C for 2 h with gentle shaking followed by a proteinase K digestion for 30 min at 56 °C. DNA was extracted using a simple column protocol from the DNeasy Blood & Tissue kit (QIAGEN) following the manufacturer’s instructions. DNA was eluted in 70 µl of nuclease-free water, quantified on Qubit, and DNA was stored at 4 °C until library preparation. Library preparation was done using approximately 420 ng of DNA and the rapid barcoding kit with 12 barcodes (ONT, SQK-RBK004) as per manufacturer’s instructions. Basecalling was done using the R9.4.1 high accuracy (HAC,supplementary fig. S5, Supplementary Material online), the HAC methylation model (not shown), and the all context methylation Rerio model (not shown) in Guppy v4.2.3, as well as the final Bonito v0.3.6 R9.4.1 DNA model (used for all analyses), run on a local NVIDIA GTX1080-Ti or a remote cluster of NVIDIA P100 GPUs. Sequence runs were conducted with 2 × 12 barcoded (SQK-RBK004) isolates per flow cell in two consecutive 18–24 h runs. Libraries were nuclease flushed using the wash kit between consecutive runs (EXP-WSH-003). This is sufficiently effective to remove read carry-over, as demonstrated previously with hybrid assemblies of sequentially sequenced Enterobacteriaceae (Lipworth et al. 2020) and our analysis of a single library panel (FNQ-2) sequenced on a previously used flow cell with a human library. After washing with EXP-WSH-003, a total of 2,910/294,461 reads were classified as human in the S. aureus library, about twice as much as human contamination in other runs. Sequencing runs were managed on two MinIONs and monitored in MinKNOW > v20.3.1.


	Nanopore Genome Assembly and Quality Control

Genome assemblies for genotyping were constructed using our Nextflow assembly pipeline ( https://github.com/np-core/np-assembly), which first randomly subsamples reads to a maximum of 200× coverage with rasusa v0.3.0 (Hall 2022) and filtered Q > 7 with minimum read length of 100 bp using nanoq v0.8.0 (Steinig and Coin 2022). Fastp v0.20.1 (Chen et al. 2018) was used to trim adapter and low-quality Illumina sequences. We then constructed three types of assemblies: a polished long-read assembly using ONT data only (flye), one with short-read correction of the ONT long-read assembly (pilon) and one that first assembles short reads and scaffolds the assembly with long reads. For the polished long-read assembly, Flye v2.8.3 (Kolmogorov et al. 2019) was used in conjunction with four iterations of minimap2 v2.17-r941 (Li 2018) + Racon 1.4.20 (Vaser et al. 2017) and subsequent polishing with Medaka 1.2.3. For the long-read hybrid assembly, corrections were conducted with Illumina paired-end reads for each genome using two iterations of Pilon v1.2.3. For the short-read hybrid assembly, we used Unicycler v0.4.8. Reference Illumina assemblies were generated with the pipeline Shovill v1.1.0 ( https://github.com/tseemann/shovill) using Skesa v2.4.0 and genotyped with Mykrobe v0.9.0 (Hunt et al. 2019) (from reads) and SCCion v0.4.0 ( https://github.com/esteinig/sccion), a wrapper around common assembly-based genotyping tools and databases (Zankari et al. 2012;Chen et al. 2016;Kaya et al. 2018) for S. aureus. We called species, resistance genes, virulence factors, PVL, multilocus sequence type, mecA, and major SCCmec cassette subtypes. We assessed differences between the Illumina references and hybrid- or nanopore assemblies using the dnadiff v1.3 to determine assembly-based differences in SNPs and Indels, as well as assess overall identity between genomes (fig. 2). Coverage against the reference genome (ST93: JKD6159;Chua et al. 2010) was assessed using CoverM v0.6.0 ( https://github.com/wwood/CoverM).


	De Novo Variant Calling and Random Forest SNP Polishers

We called SNPs de novo using the neural network callers Medaka v1.2.3 ( https://github.com/nanoporetech/medaka) and Clair v2.1.1 (shown in example pipeline executions). Snippy v4.6.0 ( https://github.com/tseemann/snippy) was used to generate a core-site alignment of the ST93 background population (n = 444, 6,161 SNPs) and reference Illumina core alignments including the outbreaks in FNQ and PNG isolates (>5×, n = 531, 6,580 SNPs). Snippy variant calls (SNP type) were used as reference truth for matching ONT and Illumina sequenced isolates. We implemented the feature extraction and random forest design from Sanderson and colleagues (2020) who use the RandomForest classifier from scikit-learn (Pedregosa et al. 2011) with default hyperparameter settings and feature extraction with pysamstats. Like the original implementation, we subsampled isolates to 2, 5, 10, 20, 50, and 100× coverage with rasusa to account for read coverage in training and evaluating the classifiers. For training, we created three sets of matching Illumina and ONT sequence data, each with three isolates for training: three mixed sequence types (ST88, ST15, and ST93; saureus_mixed), one of FNQ within-lineage isolates (ST93; saureus_fnq), and one of Papua New Guinean within-lineage isolates (ST93; saureus_png). Training and validation sets for the classifiers were split into 60% training and 40% validation data.
Next, we evaluated the classifiers, including the N. gonorrhoeae classifier trained by Sanderson and colleagues, using the remaining isolates from FNQ and PNG as an independent test data set (fig. 1). We defined true positive (TP) SNPs as those that were called by both Illumina Snippy and ONT Clair, FP as ONT SNPs that were not called with Snippy, and FN Snippy calls that were missed by ONT calls or later excluded in the random forest filtering step. Since we used the de novo Snippy calls as reference, true negative (TN) calls (sites called as wild type by ONT and Snippy) were not able to be considered. We combined data from both outbreaks (nST93 = 118, nother = 44) and computed accuracy, precision, recall, and F1 scores for each evaluation against Illumina reference data (supplementary tables, Supplementary Material online,fig. 4).


	Hybrid Core-Site Outbreak Alignments

To contextualize polished ONT isolates called with Clair within the wider background of the ST93 lineage, we adopted the core functionality from Snippy's core alignment caller (Snippy-core) into an ONT and Illumina core SNP alignment caller in the NanoPath package ( https://github.com/np-core/nanopath). Core SNP sites were defined by polymorphic SNP sites present in genomes of all isolates included in the alignment, excluding any site that in any one isolate falls into a gap, or any site with less than a predefined minimum coverage (default: 1×). We first polished ONT SNPs from Clair with the trained random forest models, including the N. gonorrhoeae data set fromSanderson et al. (2020). We then created reference alignments of the Illumina data (ST93 background and outbreaks, n = 531, >5×) with Snippy-core, as well as a reference Illumina and polished hybrid alignments with ONT outbreak SNPs in NanoPath (fig. 5).


	ML Phylogenetics and Bayesian Model Configurations

ML phylogeny of the ST93 lineage was reconstructed from the Illumina and ONT polished alignments, including the outbreaks. We used RAXML-NG with the GTR+G and Lewi's ascertainment bias correction for SNP alignments. Trees were rooted on SRR115236 (early isolate from 1992), near the root of the phylogeny (van Hal et al. 2018) and decorated with metadata of sample origin at state level in ITOL (Letunic and Bork 2019). Sampling dates in years were provided for each isolate. We next subset the full lineage alignments to the isolates in the large clades of the FNQ (n = 36) and PNG (n = 62) outbreaks. We then configured birth–death skyline models in BEAST2 using a custom Python interface (NanoPath Beastling) that stores model configurations of the serially (PNG) and contemporaneously sampled models (FNQ) in YAML files. Birth–death models consider dynamics of a population forward in time using the (transmission) rate λ, the death (become uninfectious) rate δ, the sampling probability ρ, and the time of the start of the population (outbreak; also called origin time) T. The effective reproduction number (Re), can be directly extracted from these parameters by dividing the birth rate by the death rate (λ÷δ). We configured the model priors as outlined in table 1. Importantly, we set a lineage-wide fixed substitution rate prior at 3.199×10-4 (Steinig et al. 2021) to account for the loss of temporal signal in the outbreak subset alignments. NanoPath constructs the BEAST2 XML model files which can be run with the BEAGLE library on GPU. Results were summarized using the bdskytools package in R, where median higher posterior density intervals were computed in custom plotting scripts that can be found along with all other results from the pipelines and model runs at the data repository.


	Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.


	Supplementary Material

