MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction


	Implementation



	Public data

The EC958 complete genome was downloaded from NCBI (GenBank: HG941718.1, HG941719.1, HG941720.1) [20]. Illumina reads for 12 ST131 genomes and draft assemblies for 95 ST131 were accessed from [17]. Twelve publicly available complete genomes were also selected to test MicroPIPE, under the following criteria: (i) the raw nanopore sequencing files (fast5) were available, (ii) a complete genome was made available for the same strain and (iii) Illumina sequencing data were available for the same strain. These 12 genomes represented 7 species from both gram-positive and gram-negative bacteria with chromosome sizes between 1.8 Mbp – 5.6 Mbps. A complete list of data used is provided in Supplementary dataset1.


	Culture and DNA extraction

Twelve ST131 E. coli isolates (including EC958) were grown from single colonies in Lysogeny Broth (LB) at 37 °C overnight with 250 rpm shaking. The overnight cultures (1.5 mL) were then pelleted for DNA extraction using the Wizard Genomic DNA Purification Kit (Promega) following manufacturer’s protocol with modifications. Briefly, the cell pellet was lysed following the protocol for Gram negative bacteria. RNA was removed by 1 h incubation at 37 °C with RNase and the lysate was then mix with Protein Precipitation Solution by vortexing for 5 s at max speed using Vortex-Genie 2 with horizontal tube adapter (Scientific Industries). The DNA was precipitated using isopropanol and washed with 70% ethanol. The DNA pellet was air-dried and then rehydrated in 100 μl EB buffer (QIAgen) by incubation at 65 °C for 1 h. The DNA was quantified using a Qubit fluorometer (ThermoFisher Scientific) and the DNA fragment size was estimated using agarose gel electrophoresis (0.5% agarose in TAE, 90 V, 1h30m).


	Nanopore sequencing

DNA from 12 ST131 E. coli were multiplexed onto a single FLO-MIN106 flow cell using the rapid barcode sequencing kit (SQK-RBK004) as per manufacturer’s recommendation with the following adjustments: the barcoded DNA was pooled without a concentration step using AMPure XP beads prior to sequencing. Read metrics for each isolate are given in Supplementary Table1.


	Pipeline tools and settings

Specific parameters and commands used to perform the following analyses are provided in full in Supplementary dataset1. MicroPIPE v0.8 uses Guppy v3.4.3, while MicroPIPE v0.9 uses Guppy v3.6.1.


	Basecalling

Reads were basecalled using Guppy (v3.4.3) “fast” and “high-accuracy” modes. Fast mode was evaluated using both GPU and CPU servers, while the “high-accuracy” mode was evaluated using only GPU as the time to completion for this mode became unfeasible when run using CPUs. Upon the release of Guppy v3.6.1, reads were re-basecalled using only the “high-accuracy” mode. Guppy versions (3.4.3 and 3.6.1) were tested using the methylation aware config file “dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg”.


	Demultiplexing

Demultiplexing was evaluated using Guppy_barcoder (v3.4.3) and qcat (v1.0.1) on the “passed” (>Q7) fastq reads after basecalling with Guppy. Demultiplexing using the raw fast5 reads was evaluated using Deepbinner (v0.2.0) [21]. Demultiplexed fast5 reads were subsequently basecalled with Guppy (v3.4.3).


	Quality control

Barcodes and adapters were trimmed using Porechop (v0.2.3_seqan2.1.1) ( https://github.com/rrwick/Porechop). Overall read quality metrics and basecalling statistics were extracted using PycoQC (v2.2.3) [22]. Read length and quality metrics per sample were extracted using NanoPlot (v1.26.1) [23]. Average percentage read accuracy was determined by mapping the basecalled reads to the reference genome EC958 using Minimap2 (v2.17-r954-dirty) [24] and computing reads accuracy using Nanoplot. Filtering was evaluated using two tools: Filtlong (v0.2.0) ( https://github.com/rrwick/Filtlong) and Japsa (v1.9-01a) ( https://github.com/mdcao/japsa/).


	Assembly

Six assemblers were evaluated for long-read assembly only: Canu (v1.9) [25], Flye (v2.5) [26], Raven (v1.1.5) https://www.nature.com/articles/s43588-021-00073-4 ( https://github.com/lbcb-sci/raven), Redbean (v2.5) [27], Shasta (v0.4.0: config file optimised for Nanopore: https://github.com/chanzuckerberg/shasta/blob/master/conf/Nanopore-Dec2019.conf) [28] and Unicycler (v0.4.7 long-read only) [29]. Three hybrid-assembly tools were also evaluated, including SPAdes (v3.13.1) [30], Unicycler (v0.4.7) and MaSuRCA (v3.3.5) [31]. Long-read correction was performed using Canu (v1.9).


	Polishing and quality assessment

Polishing of the draft assemblies was evaluated using long reads (ONT), short reads (Illumina), and a combination of both long and short reads. Long read polishing was performed using Racon (v1.4.9) [32] and Medaka (v0.10.0) ( https://nanoporetech.github.io/medaka/) (4 iterations of Racon based on Minimap2 v2.17-r941 overlaps followed by one iteration of Medaka), Nanopolish (v0.11.1) [33] (1 iteration based on Minimap2 v2.17-r941 alignment) and NextPolish (v1.1.0) [34] (2 iterations). Raw Illumina reads were trimmed using Trimmomatic (v0.36) [35] with the following settings: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:30. Short read polishing was performed using NextPolish (v1.1.0) and Pilon (v1.23) [36] (both 2 rounds of polishing based on BWA MEM v0.7.17-r1188 alignments).
Circularity was checked using NUCmer (v3.1) [37] to perform self-alignments. For Flye, Canu and Unicycler, circularisation was determined by the assemblers themselves. For Canu, circularisation was also confirmed using Nucmer self-alignments results and contigs were trimmed to remove overlapping ends. Circularisation for Raven and Shasta was confirmed using generated GFA files. For MaSuRCA, circularisation was confirmed using Nucmer self-alignments results. For SPAdes, the plasmids were manually checked for circularity and the overlapping ends were trimmed. For Redbean, circularisation of the contigs was confirmed by alignment to the reference EC958 genome using QUAST.
Final assemblies were assessed for quality by comparison to the complete EC958 genome using the assess_assembly tool from Pomoxis (v0.3) ( https://github.com/nanoporetech/pomoxis) as well as DNAdiff (v1.3) [37] and QUAST (v5.0.2) [38] to detect errors, misassemblies, and determine overall nucleotide identity.


	Compute resources

All results were produced using cloud-based nodes with 16vCPUs and 32GB RAM. For the GPU node, the GPU is a NVIDIA Tesla P40 24GB while the CPUs are 2x Intel Xeon Silver 4214 2.2G (12C/24 T, 9.6GT/s, 16.5 M Cache, Turbo, HT [85 W] DDR4–2400).


	ST131 phylogeny

ParSNP (v1.5.2) [39] was used to create an ST131 phylogeny using the 12 ST131 E. coli assembled in this study in addition to 95 ST131 E. coli short-read assemblies from Petty and Ben Zakour et al. [17]. Recombination was removed using PhiPack [40], as implemented in ParSNP. To evaluate the accuracy of each assembly and polishing step, we included our 12 completely polished assemblies (long and short read), 12 unpolished assemblies, 12 long-read polished assemblies and 12 short-read polished assemblies. The tree was visualised using Figtree ( http://tree.bio.ed.ac.uk/software/figtree/) and iTOL [41].


	MEME methylation motif analysis

The 20 bps sequence (− 10 to + 10) around the 401 shared SNPs were extracted using BEDTools getfasta (v2.28.0–33-g0f45761e) [42]. MEME (v5.2.0) [43,44] was used to identify enriched motifs within the sequences using the default parameters of the classic mode and allowing zero or one occurrence per sequence. The motif CC(T/A)GG was significantly enriched in 393 sequences with an E-value of 6.2e-758.


	Results



	Validation of pipeline stages by comparison to EC958 complete genome

The main goal of this study was to create a robust and easily applicable pipeline for the construction of high-quality bacterial genomes with minimal manual manipulations. To achieve this, we first evaluated the performance of commonly used software at each stage of bacterial genome construction using the high-quality EC958 genome (Accession: HG941718) as our standard for final genome accuracy. Figure1 shows a diagram of the whole workflow, indicating the software chosen for comparison at each stage. Nanopore reads for EC958 were generated on a multiplexed run of 12 using the rapid barcoding kit on an R9.4.1 flow cell.



	Basecalling

When considering the ongoing stability and accuracy of our overall pipeline, we decided to limit our basecalling validation to software that we were confident would be consistent and well-maintained for the foreseeable future. Many existing basecallers (such as Bonito, Flappie and Runnie) are currently research releases and therefore have minimal support and unknown longevity. Other basecallers are either depreciated (Albacore) or no longer updated (Scrappie). As such, we decided to focus our analysis on Guppy, which is the ONT recommended basecaller and is stably released and maintained.
Here we tested Guppy using both the “fast” and “high-accuracy” modes, as well as the CPU vs. GPU configurations. When using Guppy v3.4.3 with the “high-accuracy” setting on GPU servers we generated reads with approximately 91.0% accuracy in 828.5 min (13.81 h). Using the “fast” mode on CPUs, we were able to generate 88.9% accuracy in 2948.4 min (49.14 h) (Table1). Testing the “high-accuracy” mode on a CPU server was unfeasible due to the time required for processing (fewer than 10% of reads completed basecalling in 1 week). Despite the lower per-read accuracy when using CPUs and the “fast” basecalling setting, the consensus quality of the overall finished genome (after assembly and polishing through MicroPIPE v0.8) was of comparable quality to that generated with the GPU and high-accuracy setting (Table1).

We also tested the effects of methylation and found that using the “high-accuracy” model with methylation-aware basecalling achieved a similar per-read accuracy (90.6%) to the “high-accuracy” only model. The final assembly, however, had fewer SNPs (3 vs. 23 originally) and indels (31 vs. 45 originally) compared to the reference standard (Table1).


	Demultiplexing

For demultiplexing we tested three tools: Deepbinner [21], Guppy_barcoder [45] and qcat [46]. While Guppy and qcat rely on basecalled reads, Deepbinner uses the raw fast5 reads. As such, we compared the total number of binned reads after both basecalling and binning for each tool. Overall, qcat was able to bin 89% of reads, compared to 84% for Guppy_barcoder and 75% for Deepbinner (Supplementary Figure1). Initially we chose qcat as the default demultiplexer as we prioritised read retention to maximise coverage of each genome. However, following the recent depreciation of qcat (detailed on their GitHub: https://github.com/nanoporetech/qcat), ONT is recommending the use of the Guppy demultiplexer. As such, Guppy was chosen as the default demultiplexer for MicroPIPE, while qcat is still optionally available within the pipeline.


	Filtering

Here we tested two filtering tools: Filtlong and Japsa. Filtlong has the advantage of being versatile enough to filter based on a number of requirements, such as read length, quality, percentage of reads to keep and the option of using an external reference. Japsa primarily filters based on read length and quality. Read metrics after filtering using each tool are given in Supplementary Figure2. Overall, we found that filtering with Japsa retained more reads, but with a reduced N50 read length and median read quality compared to Filtlong. Both tools took an equivalent amount of time to run. For all downstream analysis we filtered reads using Japsa with a minimum average quality cut-off of Q10 and 1 kb minimum read length, although Filtlong would have been equally suitable. Both filtering tools are available as optional steps in MicroPIPE. We have also included Rasusa [47] as an optional tool to randomly subsample large datasets down to a specific coverage (as necessary based on user needs). This subsampling step is performed before trimming in order to reduce computational time.


	Long-read-only assembly

A number of tools have been designed for de novo assembly from long reads. Here we compared six popular assembly tools and evaluated speed, completeness (of the chromosome and plasmids, including circularisation) and correctness (i.e. nucleotide identity) based on the complete EC958 reference genome standard, which contains 1 chromosome (5,109,767 bp) and 2 plasmids (135,602 bp and 4080 bp). Parameters used for all assemblers are given in Supplementary Dataset1.
Overall, we found that all assemblers constructed the chromosome and larger (~ 135 kb) plasmid (Fig.2, Supplementary Table2). Raven, Redbean and Shasta did not assemble the smaller ~ 4 kb plasmid. While Canu was able to assemble both plasmids, closer inspection found them to be much larger than expected (1.4x and 2x larger for the large and small plasmid, respectively) due to overlapping ends that required additional trimming. Interestingly, both Flye and Canu assembled a third, previously unidentified, small plasmid of ~ 1.8 kb in size. This small plasmid was only identified when the Flye “--plasmids” mode was selected (to rescue short unassembled plasmids) and when certain or no filtering parameters were applied to the reads prior to assembly (Supplementary Table3). Comparison of this small plasmid to the Illumina data for the EC958 reference genome standard confirmed its presence and was likely missed in the original assembly.

For most de novo assemblies, a number of small (< 4.5 kb) misassemblies were detected, mainly on the chromosome (Fig.2). This included a small inversion, which on closer inspection was found to be an invertible phage tail protein that has been characterised previously [20]. This inversion was found in the Flye, Unicycler, Raven and Redbean assemblies and was not counted as a misassembly due to its biological relevance.
Additional contigs were found in both Canu and Unicycler (long-read only mode). The three additional contigs produced by Unicycler all matched other parts of the EC958 reference genome standard (two on the chromosome, one on the larger plasmid). The additional contig in Canu matched part of the additional ~ 1.8 kb plasmid.
In terms of speed, Shasta, Redbean and Raven were the fastest assemblers, completing in less than 30 min. Of the remainder, Flye was four times faster than Canu and two times faster than Unicycler. The majority of contigs from all assemblers were reported as circularised upon assembly completion, with the exception of the additional contigs in Canu and Unicycler. Redbean did not generate circularisation information, although the chromosome and plasmid contigs could be circularised manually or using 3rd party software following assembly. Overall, we found that Flye generated the best de novo assembly from long read data without the need for manual intervention.


	Polishing

Polishing of assemblies generated using long reads is currently regarded as a necessity for ONT data due to high per-read errors that can persist through to the de novo assemblies [14]. Here we tested the polishing capabilities of three different tools (Racon/Medaka, NextPolish and Nanopolish) using nanopore long reads against the de novo assembly generated using Flye. We additionally tested polishing with Illumina short reads (NextPolish and Pilon), which have a higher basecall accuracy. Polishing was tested both independently (i.e., long read and short read separately) as well as sequentially (long read followed by short read polishing) to determine the best polishing protocol.
Overall, we found that polishing with Racon and Medaka (four rounds of Racon and one round of Medaka, using long reads) followed by NextPolish (two rounds using short reads) achieved the most accurate assemblies (Fig.3, Supplementary Table4). Polishing using only long or short reads did not produce comparable levels of accuracy, therefore we emphasize the requirement of short read sequencing in parallel with Nanopore for high-quality complete genome assembly (as is already commonly done).

To confirm our choice of Flye as the best assembler, we polished assemblies generated from the other five long-read assemblers, described above, using this strategy (Supplementary Table5). The polished Flye assembly remained the most accurate, closely followed by the polished Raven assembly.


	Hybrid assembly

In addition to long-read assembly (followed by short-read polishing), hybrid assemblers capable of using both long and short reads simultaneously have also been developed, and include Unicycler, MaSuRCA and SPAdes. Comparison of these pipelines to our genome completed with Flye, Racon, Medaka and NextPolish found that they did not outperform our current method. Unicycler was the only hybrid assembler able to completely resolve the chromosome and both plasmids (SPAdes failed to circularise the chromosome while MaSuRCA was unable to assemble the 4 kb plasmid) (Supplementary Table6). Additional long and short read polishing greatly improved the accuracy of the Unicycler and SPAdes hybrid assemblies but not MaSuRCA (Supplementary Table5). We compared the quality of the genomes generated by either the best long-read only assembly (Flye) or the best hybrid assembler based on accuracy and structure (Unicycler) and polished with the same strategy. The polished assemblies contained a similar number of indels compared to the EC958 reference genome standard, however the Flye assembly contained around two-fold fewer substitution errors (Supplementary Table5). Furthermore, Flye was nearly eight times faster than Unicycler (Supplementary Table6).


	Final pipeline

Based on the results of our comparative analysis for all of the major steps of bacterial genome assembly, we have developed MicroPIPE (Fig.4). The pipeline is written in Nextflow [48] and the dependencies are packaged into Singularity [49] container images available through the Docker Hub and Quay.io BioContainers repositories. The bioinformatics workflow manager Nextflow allows users to run the pipeline locally or using common High-Performance Computing schedulers. Each step of the pipeline uses a specific container image which enables easy modifications to be made in the future to include new or updated tools. Furthermore, in addition to the recommended default pipeline settings, MicroPIPE also provides alternative software options and/or parameters to suit the user’s individual needs. The pipeline is freely available on GitHub: https://github.com/BeatsonLab-MicrobialGenomics/micropipe.



	Evaluation of remaining differences with EC958 reference genome standard

The final genome for EC958 produced by MicroPIPE v0.8 was compared to the previously published EC958 reference genome standard (GenBank: HG941718.1) to assess any remaining differences. We observed a single 3.4 kb inversion corresponding to a phage tail protein switching event previously characterised in EC958 [20]. Overall, there were no other structural rearrangements. The final assembly contained an additional ~ 1.8 kb plasmid, with 100% nucleotide identity to previously reported E. coli plasmids (GenBank records CP048320.1, KJ484633.1, [50]). This plasmid appears to have been lost during size selection when constructing the original genomic DNA library for PacBio RSII sequencing of EC958 as it could be identified from de novo assembly of the corresponding Illumina reads.
Comparison of the two assemblies identified 68 remaining differences (66 on the chromosome, 2 on pEC958) (for full list, see Supplementary Dataset1). The two differences in the plasmid sequence correspond to known errors in the EC958 reference genome standard (PacBio assembly constructed without Illumina polishing). The majority of the chromosomal differences were indels (n = 45, 67%) ranging from 1 to 6 bp in size. These indels were mainly found in rRNA (n = 31), tRNA (n = 4), insertion sequences (n = 4), or phage-related genes (n = 2). The remaining 23 differences were SNPs, which were similarly found mainly in rRNA (n = 13) and insertion sequences (n = 8). These remaining differences likely represent an inability of current short-read polishing to adequately determine true alleles in repetitive regions of the genome. Using methylation-aware basecalling was found to significantly improve these errors, with only 3 SNPs and 31 indels (Supplementary Table7).


	MicroPIPE validation using 11 ST131 E. coli

To further test the robustness of MicroPIPE on other genomes, we included an additional 11 well-characterised E. coli ST131 strains [17] on a multiplexed run of 12 E. coli (i.e. 11 ST131 strains plus EC958).
Each strain took on average 120 min to run completely through MicroPIPE v0.8 using 16 threads (excluding the basecalling and demultiplexing steps) (Fig.4). Of these 11 isolates, all had complete circularised chromosomes of the expected size. They also carried an array of plasmids, which were circularised in all cases except for a single isolate, HVM2044 (Supplementary Table8). Re-analysis of this sample found that complete circularised plasmids can be achieved by adjusting the read filtering step. We also identified additional small plasmids in six out of the 11 genomes ranging between 1.5–5 kb in size. Importantly, we found that these plasmids are not recovered when using filtering parameters above 1 kb.
In order to confirm the accuracy of the assemblies generated with MicroPIPE, we recreated the ST131 phylogeny from [17] using (i) the complete MicroPIPE assembly, (ii) long read only polished assembly, (iii) short read only polished assembly and (iv) unpolished Nanopore assembly, and assessed the position of each strain within the tree. We found that all MicroPIPE v0.8 assemblies and ONT assemblies polished with Illumina clustered closest to their Illumina counterpart within the phylogenetic tree (Fig.5A). However, the long read polished and unpolished ONT assemblies in most cases did not cluster as expected. They also displayed longer branches indicative of the remaining errors within the assembly. Interestingly, the long read polished and unpolished assemblies for all ST131 isolates belonging to our previously defined fluoroquinolone-resistance clade C [17,18] clustered together independent of other clade C strains, possibly representing systematic errors from the ONT data. Further interrogation of the branch leading to this cluster identified 401 shared SNPs. Of these SNPs, 97% were transitions, particularly A → G (n = 187) and T → C (n = 203) (Supplementary Table9, Fig.5C). Further analysis of these sites determined that 393 (98%) were associated with a Dcm methylase motif CC(A/T)GG (Supplementary Figure4).



	MicroPIPE validation using publicly available ONT sequenced bacteria

Lastly, we tested MicroPIPE using 12 publicly available genomes from both gram-positive and gram-negative bacteria with available raw nanopore data (fast5) and validated our results using their corresponding complete genomes (Table2, Supplementary Dataset1). These genomes also represent a wide range of GC content to further validate the use of MicroPIPE on diverse bacterial species (Table2). As most of these isolates were sequenced using entire flow cells, the coverage was reduced to 100x during the initial Flye assembly stage to minimise processing time.

Using MicroPIPE v0.9, we were able to completely assemble the chromosome and plasmids of all 12 isolates. We were also able to recover two additional plasmids from the Salmonella enterica str. SA20055162 that were not reported in the original assembly (Table2).
To determine the accuracy of MicroPIPE, we compared our final assemblies with the submitted complete genome for each isolate. Overall, the fewest differences were detected between our MicroPIPE assembly and the complete genome of Staphylococcus aureus strain 110900 (6 SNPs, 5 indels) and strain 128254 (4 SNPs, 4 indels), constructed using ONT data basecalled with a recent version of Guppy (v3.2.6) (Table2). These were followed by Streptococcus pyogenes strain SP1336, constructed using PacBio long-read sequencing (8 SNPs, 96 indels). All other comparisons yielded 25–510 SNPs, and 14–758 indels, with the greatest number of differences observed in the Salmonella enterica serovar Napoli strain LC0541/17 (Table2).
With the exception of S. pyogenes SP1336, all other complete genomes were constructed using previously assembled nanopore data (Supplementary Dataset1). Specifically, all assemblies with a high number of SNPs and indels were generated using reads basecalled with Albacore or a Guppy version prior to v3. As such, we hypothesise that our MicroPIPE assemblies likely represent corrections to the existing complete genomes, as a result of updated basecalling and assembly methods. Further investigation found that one sample, Salmonella enterica Bareilly str. CFSAN000189, also had a corresponding complete genome constructed using PacBio data. Comparison of our MicroPIPE assembly to this complete genome detected 0 SNPs and 15 indels, while there were 32 SNPs and 34 indels compared to the ONT complete genome.


	Future development of MicroPIPE

Rapid and continual enhancement of nanopore technology has been integral to ONTs growth and popularity in recent years. It does, however, lead to several problems, including rapid depreciation, abandonment or replacement of software. As such, we have developed a modularised ONT/Illumina pipeline that can be readily adapted and re-evaluated alongside the changing nanopore landscape.
An example of MicroPIPE’s adaptability came from the release of Guppy v3.6.1 during preparation of this manuscript. As this version reported a substantial increase in basecalling accuracy, we incorporated it into MicroPIPE (v0.9) and re-evaluated our pipeline’s performance for all ST131 genomes.
Using MicroPIPE v0.9 on our EC958 data, we were able to resolve 21 out of the 23 SNPs and 32 out of 45 indels compared to the MicroPIPE v0.8 assembly (Guppy v3.4.3) (Supplementary Dataset1, Supplementary Figure3, Supplementary Table7), relative to the published reference genome. Two SNPs and 12 indels were additionally detected using v3.6.1, which were not detected using v3.4.3. Both SNPs were detected in IS elements, while 11 out of the 12 indels were detected in rRNA genes. Overall, the v3.6.1 assembly performed better than the v3.4.3 assembly with only 29 differences compared to the complete reference EC958 genome (4 SNPs and 25 indels). Interestingly, using methylation-aware basecalling with Guppy v3.6.1 was not found to improve overall assembly accuracy (Supplementary Table7).
We also found that by re-basecalling all other remaining ST131 isolates with MicroPIPE v0.9 and recreating assemblies as before, we were able to achieve a remarkable increase in the accuracy of Nanopore-only assemblies, such that all assemblies clustered in their expected position within the tree (Fig.5B).
