Published March 3, 2020 | Version 1.0
Preprint Open

Additional data for preprint: 'Analysis procedures for assessing recovery of high quality, complete, closed genomes from Nanopore long read metagenome sequencing'

Description

New long read sequencing technologies offer huge potential for effective recovery of complete, closed genomes from complex microbial communities. Using long read (MinION) obtained from an ensemble of activated sludge enrichment bioreactors, we 1) describe new methods for validating long read assembled genomes using their counterpart short read metagenome assembled genomes; 2) assess the influence of different correction procedures on genome quality and predicted gene quality and 3) contribute 21 new closed or complete genomes of community members, including several species known to play key functional roles in wastewater bioprocesses, including from microbes known to exhibit the polyphosphate- and glycogen-accumulating organism phenotypes, and filamentous bacteria associated the formation and stability of activated sludge flocs. Our findings further establish the feasibility of long read metagenome-assembled genome recovery, and demonstrate the utility of parallel sampling of moderately complex enrichments communities for recovery of genomes of key functional species from activated sludge bioprocesses. This submission provides additional data and intermediate results not provided in the raw data submission to NCBI Short Read Archive (SRA).

The provided data are as follows (see tree file for more detailed breakdown) in 387 directories and 3723 files:

1. LRAC.tar.gz: the Canu assembled sequence in FASTA format for each of the four datasets (PAO*). Includes Canu cmd specification in files names canu.e*

2. SRAC.tar.gz: the short read assemblies (FASTA) and binning results (contig membership of each provided within an .RData file) for each of the four datasets (PAO*).

3. srac2lrac.tar.gz: summary results and data for the concordance statistic analysis for each of the four datasets (PAO*). The BLASTN analysis was run as specified in blastn.sh file, and the tabular output is provided in srac_spades2corrlrac_canu.out. The output from the concordance statistic analysis is provided in the .RData files called blastn3-6_canu.RData and blastn3-6_canu_fil.RData. See https://github.com/rbhwilliams/srac2lrac for a fully worked example.

4. LRGenomes.tar.gz: FASTA sequence, Prokka and CheckM annotations for each of the 21 recovered genomes from each correction procedure (uncorrected, MEGAN-FS, Medaka, Racon and multiple [4 rounds of Racon and 1 of Medaka]).

5. LRGenomesCoverage.tar.gz: for MEGAN-FS versions of the 21 genomes, the perbase coverage data for both long read (LR) and short read (SR) data.

6. LRAccumulibacterGenomes.tar.gz: two refined genomes (.fasta) from Candidatus Accumulibacter and detailed notes on procedures applied (.txt).

7. tree: text file with contents of the above files in a tree-like format.

 

 

Files

Files (4.2 GB)

Name Size Download all
md5:0dae85ce6c4a0854470fa3852a1bdd7c
281.8 MB Download
md5:9b19dab43ed67bcb3dfc1bbf78d96e94
3.0 MB Download
md5:ff2738386ccccc94e1a9d6d0c41f5130
2.0 GB Download
md5:80047cbe844a3391d18d2c6c4a28e362
540.3 MB Download
md5:4b0a6c57f6795ec0268e480d799ff3ee
1.2 GB Download
md5:0ef3951ad98965920f35deee6d361359
182.4 MB Download
md5:f00816f155ec7aaea23dace2068f4dc4
257.8 kB Download

Additional details

Related works

Is derived from
Preprint: 10.1101/2020.03.12.974238 (DOI)