Published November 11, 2025 | Version v1
Dataset Open

Eulertigs for benchmarking kmer dictionaries

Authors/Creators

  • 1. ROR icon Ca' Foscari University of Venice

Description

Introduction

The datasets in this collection are meant to benchmark kmer dictionaries (i.e., data structures that represent a set of kmers and support, at least, exact membership queries), like SSHash [1], SBWT [2], and FMSI [3].

There are two types of datasets in this collection:

  • Those with extension eulertigs.fa.gz contain the actual kmers to be indexed in FASTA format, represented as substrings of longer eulertigs [4]. Eulertigs were computed using the GGCAT algorithm [5] (commit 14b2853731787495d0874c7ec7b6ca3ee97cd3a4), for both k=31 and k=63.
  • Those with extension fastq.gz contain reads in FASTQ format can be used to query the dictionaries.

All strings in this collection are relative to the DNA alphabet consisting in the four symbols {A, C, G, T}.

Whole genomes: cod, kestrel, human

The datasets headed "cod", "kestrel", and "human" were obtained by processing with GGCAT the whole genomes of Gadus morhua, Falco tinnunculus, and Homo sapiens respectively, as follows.

wget http://ftp.ensembl.org/pub/current_fasta/gadus_morhua/dna/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -O Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz
wget http://ftp.ensembl.org/pub/current_fasta/falco_tinnunculus/dna/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -O Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz
wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -O Homo_sapiens.GRCh38.dna.toplevel.fa.gz

ggcat build -k 31 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k63.eulertigs.fa

ggcat build -k 31 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k63.eulertigs.fa

ggcat build -k 31 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k63.eulertigs.fa

 

Number of distinct kmers

Collection Num. distinct 31-mers Num. distinct 63-mers
Cod   502,465,200   556,585,658
Kestrel 1,150,399,205 1,155,250,667
Human 2,505,678,680 2,771,316,093

 

Pangenomes: NCBI-virus, SE, HPRC

The datasets headed "ncbi-virus", "se", and "hprc" are pangenomes.

NCBI-virus

This is a collection of 18,836 virus assemblies downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&SourceDB_s=RefSeq in November 2025. After downloading, and assuming the file is named ncbi-virus.fasta.gz, the collection can be processed as follows.

ggcat build -k 31 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k31.eulertigs.fa
ggcat build -k 63 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k63.eulertigs.fa

SE

This is a pangenome containing are all the 534,751 Salmonella enterica genomes from the "All The Bacteria" collection [6] (v0.2).

ggcat build -k 31 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k31.eulertigs.fa
ggcat build -k 63 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k63.eulertigs.fa

HPRC

This is a human pangenome. We downloaded the Linux 3.2.1 binary of AGC [7] from https://github.com/refresh-bio/agc/releases and the human472.agc file from https://zenodo.org/records/14854401.

Then, we did:

agc getcol -o output_folder human472.agc

to extract all the individual files to be processed by GGCAT.

ggcat build -k 31 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k31.eulertigs.fa
ggcat build -k 63 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k63.eulertigs.fa

 

Number of distinct kmers

Collection Num. distinct 31-mers Num. distinct 63-mers
NCBI-virus    376,205,185     412,515,880
SE   894,310,084  1,524,904,156
HPRC 3,718,120,949 5,926,785,469

 

References

  1. Pibiri, Giulio Ermanno. "Sparse and skew hashing of k-mers." Bioinformatics 38.Supplement_1 (2022): i185-i194.
  2. Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. "Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform." SIAM Conference on Applied and Computational Discrete Algorithms (ACDA), 2023.
  3. Ondřej Sladký, Pavel Veselý, Karel Břinda. "FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)", Bioinformatics Advances, vbaf290, 2025.
  4. Schmidt, Sebastian, and Jarno N. Alanko. "Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time." Algorithms for Molecular Biology 18.1 (2023): 5.
  5. Cracco, Andrea, and Alexandru I. Tomescu. "Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT." Genome Research 33.7 (2023): 1198-1207.
  6. Hunt et al. "AllTheBacteria – all bacterial genomes assembled, available, and searchable", BioRxiv, 2025. https://www.biorxiv.org/content/10.1101/2024.03.08.584059v7

  7. S. Deorowicz, A. Danek, H. Li. "AGC: Compact representation of assembled genomes with fast queries and updates." Bioinformatics (2023).

 

Files

Files (12.0 GB)

Name Size Download all
md5:5eab2fa367f9586413e657b738c1aa9f
176.9 MB Download
md5:064d1431337657a2feadc68e94cca1d2
179.1 MB Download
md5:60001b1dee7d1bbe5483f11d1eb0fab9
1.9 GB Download
md5:7ed68f768004077bc3a6d9f5884f3bde
3.1 GB Download
md5:7ad952abed23c8b06151ca96816e401c
883.3 MB Download
md5:c5dd7ab16a0897a79a3528515d29a12a
853.6 MB Download
md5:caa4a66edd63f1f14819adbc52ac7376
341.9 MB Download
md5:204ba3a5083f2e77e4012ac36c2882af
336.9 MB Download
md5:3b07e83a5d3edb03300f85e2282dc3fa
6.1 MB Download
md5:16f4c8040f8ca9c187c4d31987b8f245
135.9 MB Download
md5:238d1e780ea1d07ba90fcbb3f2f714ec
140.5 MB Download
md5:007c4c68c1359f97c61211b9bcde584c
510.4 MB Download
md5:a52965361f691c8f1775a58fe1c24b45
860.0 MB Download
md5:1095a95986a13f2fde4b0752fe049edd
478.3 MB Download
md5:9a3e6c899ff8f238fb9c75af91ad63f5
53.9 MB Download
md5:ae7372762783d7f0d8e1d2c31fe1403c
439.5 MB Download
md5:f013eb2a4c6ce68e4192c0fb9e1dc6e4
1.7 GB Download