Published June 10, 2025
| Version v1
Dataset
Open
Genomics NCD-gzip database
Authors/Creators
Description
This is a basic database consisting of 4634 genomes sampled from the RefSeq database using the Woltka pipelin. It is intended for testing and evaluation of metagenomic classification tools.
Contents
-
The genome sequences are provided in a compressed archive (
genomes.tar.gz). -
When unpacked, the folder structure is organized by NCBI Taxonomy ID (
taxid), like so:
genomes/
├── taxid1/
│ ├── genome1_0.fna
│ └── genome1_1.fna
├── taxid2/
│ └── genome2_0.fna
├── taxid3/
│ ├── genome3_0.fna
│ ├── genome3_1.fna
│ └── genome3_2.fna
Each top-level directory corresponds to a taxonomic ID and contains one or more genome FASTA files in .fna format.
Additional Files
-
fold1_list.txtandfold1_testing_list.txt: Lists of genome TaxIDs used for training and testing, respectively. These are included to support reproducible benchmarking of metagenomic classifiers.