Eulertigs for benchmarking kmer dictionaries

Pibiri, Giulio Ermanno

doi:10.5281/zenodo.17582116

Published November 11, 2025 | Version v1

Dataset Open

Eulertigs for benchmarking kmer dictionaries

Pibiri, Giulio Ermanno¹

1. Ca' Foscari University of Venice

Introduction

The datasets in this collection are meant to benchmark kmer dictionaries (i.e., data structures that represent a set of kmers and support, at least, exact membership queries), like SSHash [1], SBWT [2], and FMSI [3].

There are two types of datasets in this collection:

Those with extension eulertigs.fa.gz contain the actual kmers to be indexed in FASTA format, represented as substrings of longer eulertigs [4]. Eulertigs were computed using the GGCAT algorithm [5] (commit 14b2853731787495d0874c7ec7b6ca3ee97cd3a4), for both k=31 and k=63.
Those with extension fastq.gz contain reads in FASTQ format can be used to query the dictionaries.

All strings in this collection are relative to the DNA alphabet consisting in the four symbols {A, C, G, T}.

Whole genomes: cod, kestrel, human

The datasets headed "cod", "kestrel", and "human" were obtained by processing with GGCAT the whole genomes of Gadus morhua, Falco tinnunculus, and Homo sapiens respectively, as follows.

wget http://ftp.ensembl.org/pub/current_fasta/gadus_morhua/dna/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -O Gadus_morhua.gadMor3.0.dna.toplevel.fa.gzwget http://ftp.ensembl.org/pub/current_fasta/falco_tinnunculus/dna/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -O Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gzwget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -O Homo_sapiens.GRCh38.dna.toplevel.fa.gz

ggcat build -k 31 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k63.eulertigs.fa

ggcat build -k 31 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k63.eulertigs.fa

ggcat build -k 31 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k31.eulertigs.fa
ggcat build -k 63 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k63.eulertigs.fa

Number of distinct kmers

Collection	Num. distinct 31-mers	Num. distinct 63-mers
Cod	502,465,200	556,585,658
Kestrel	1,150,399,205	1,155,250,667
Human	2,505,678,680	2,771,316,093

Pangenomes: NCBI-virus, SE, HPRC

The datasets headed "ncbi-virus", "se", and "hprc" are pangenomes.

NCBI-virus

This is a collection of 18,836 virus assemblies downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&SourceDB_s=RefSeq in November 2025. After downloading, and assuming the file is named ncbi-virus.fasta.gz, the collection can be processed as follows.

ggcat build -k 31 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k31.eulertigs.fa
ggcat build -k 63 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k63.eulertigs.fa

SE

This is a pangenome containing are all the 534,751 Salmonella enterica genomes from the "All The Bacteria" collection [6] (v0.2).

ggcat build -k 31 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k31.eulertigs.fa
ggcat build -k 63 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k63.eulertigs.fa

HPRC

This is a human pangenome. We downloaded the Linux 3.2.1 binary of AGC [7] from https://github.com/refresh-bio/agc/releases and the human472.agc file from https://zenodo.org/records/14854401.

Then, we did:

agc getcol -o output_folder human472.agc

to extract all the individual files to be processed by GGCAT.

ggcat build -k 31 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k31.eulertigs.fa
ggcat build -k 63 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k63.eulertigs.fa

Number of distinct kmers

Collection	Num. distinct 31-mers	Num. distinct 63-mers
NCBI-virus	376,205,185	412,515,880
SE	894,310,084	1,524,904,156
HPRC	3,718,120,949	5,926,785,469

References

Pibiri, Giulio Ermanno. "Sparse and skew hashing of k-mers." Bioinformatics 38.Supplement_1 (2022): i185-i194.
Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. "Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform." SIAM Conference on Applied and Computational Discrete Algorithms (ACDA), 2023.
Ondřej Sladký, Pavel Veselý, Karel Břinda. "FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)", Bioinformatics Advances, vbaf290, 2025.
Schmidt, Sebastian, and Jarno N. Alanko. "Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time." Algorithms for Molecular Biology 18.1 (2023): 5.
Cracco, Andrea, and Alexandru I. Tomescu. "Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT." Genome Research 33.7 (2023): 1198-1207.
Hunt et al. "AllTheBacteria – all bacterial genomes assembled, available, and searchable", BioRxiv, 2025. https://www.biorxiv.org/content/10.1101/2024.03.08.584059v7
S. Deorowicz, A. Danek, H. Li. "AGC: Compact representation of assembled genomes with fast queries and updates." Bioinformatics (2023).

Files

Files (12.0 GB)

Name	Size
cod.k31.eulertigs.fa.gz md5:5eab2fa367f9586413e657b738c1aa9f	176.9 MB	Download
cod.k63.eulertigs.fa.gz md5:064d1431337657a2feadc68e94cca1d2	179.1 MB	Download
hprc.k31.eulertigs.fa.gz md5:60001b1dee7d1bbe5483f11d1eb0fab9	1.9 GB	Download
hprc.k63.eulertigs.fa.gz md5:7ed68f768004077bc3a6d9f5884f3bde	3.1 GB	Download
human.k31.eulertigs.fa.gz md5:7ad952abed23c8b06151ca96816e401c	883.3 MB	Download
human.k63.eulertigs.fa.gz md5:c5dd7ab16a0897a79a3528515d29a12a	853.6 MB	Download
kestrel.k31.eulertigs.fa.gz md5:caa4a66edd63f1f14819adbc52ac7376	341.9 MB	Download
kestrel.k63.eulertigs.fa.gz md5:204ba3a5083f2e77e4012ac36c2882af	336.9 MB	Download
ncbi-queries.fastq.gz md5:3b07e83a5d3edb03300f85e2282dc3fa	6.1 MB	Download
ncbi-virus.k31.eulertigs.fa.gz md5:16f4c8040f8ca9c187c4d31987b8f245	135.9 MB	Download
ncbi-virus.k63.eulertigs.fa.gz md5:238d1e780ea1d07ba90fcbb3f2f714ec	140.5 MB	Download
se.k31.eulertigs.fa.gz md5:007c4c68c1359f97c61211b9bcde584c	510.4 MB	Download
se.k63.eulertigs.fa.gz md5:a52965361f691c8f1775a58fe1c24b45	860.0 MB	Download
SRR11449743_1.fastq.gz md5:1095a95986a13f2fde4b0752fe049edd	478.3 MB	Download
SRR12858649.fastq.gz md5:9a3e6c899ff8f238fb9c75af91ad63f5	53.9 MB	Download
SRR27871075_1.fastq.gz md5:ae7372762783d7f0d8e1d2c31fe1403c	439.5 MB	Download
SRR5833294.fastq.gz md5:f013eb2a4c6ce68e4192c0fb9e1dc6e4	1.7 GB	Download

	All versions	This version
Views	194	194
Downloads	604	604
Data volume	541.0 GB	541.0 GB

Eulertigs for benchmarking kmer dictionaries

Authors/Creators

Description

Introduction

Whole genomes: cod, kestrel, human

Pangenomes: NCBI-virus, SE, HPRC

References

Files

Files (12.0 GB)