Published June 10, 2025 | Version v1
Dataset Open

Genomics NCD-gzip database

Authors/Creators

Description

This is a basic database consisting of 4634 genomes sampled from the RefSeq database using the Woltka pipelin. It is intended for testing and evaluation of metagenomic classification tools. 

Contents

  • The genome sequences are provided in a compressed archive (genomes.tar.gz).

  • When unpacked, the folder structure is organized by NCBI Taxonomy ID (taxid), like so:

genomes/
├── taxid1/
│ ├── genome1_0.fna
│ └── genome1_1.fna
├── taxid2/
│ └── genome2_0.fna
├── taxid3/
│ ├── genome3_0.fna
│ ├── genome3_1.fna
│ └── genome3_2.fna

Each top-level directory corresponds to a taxonomic ID and contains one or more genome FASTA files in .fna format.

Additional Files

  • fold1_list.txt and fold1_testing_list.txt: Lists of genome TaxIDs used for training and testing, respectively. These are included to support reproducible benchmarking of metagenomic classifiers.

 

Files

fold1_list.txt

Files (3.1 GB)

Name Size Download all
md5:b6a685d31e9da969dd045318bc45c6ee
27.6 kB Preview Download
md5:2d21602727c0403f8209cd8a80e0586c
6.9 kB Preview Download
md5:b2ff962741d222418a4619aab68ff734
34.5 kB Preview Download
md5:57a08afc0752c8fea51d2dfef7004af0
3.1 GB Download