Eulertigs for benchmarking kmer dictionaries
Description
Introduction
The datasets in this collection are meant to benchmark kmer dictionaries (i.e., data structures that represent a set of kmers and support, at least, exact membership queries), like SSHash [1], SBWT [2], and FMSI [3].
There are two types of datasets in this collection:
- Those with extension
eulertigs.fa.gzcontain the actual kmers to be indexed in FASTA format, represented as substrings of longer eulertigs [4]. Eulertigs were computed using the GGCAT algorithm [5] (commit14b2853731787495d0874c7ec7b6ca3ee97cd3a4), for both k=31 and k=63. - Those with extension
fastq.gzcontain reads in FASTQ format can be used to query the dictionaries.
All strings in this collection are relative to the DNA alphabet consisting in the four symbols {A, C, G, T}.
Whole genomes: cod, kestrel, human
The datasets headed "cod", "kestrel", and "human" were obtained by processing with GGCAT the whole genomes of Gadus morhua, Falco tinnunculus, and Homo sapiens respectively, as follows.
wget http://ftp.ensembl.org/pub/current_fasta/gadus_morhua/dna/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -O Gadus_morhua.gadMor3.0.dna.toplevel.fa.gzwget http://ftp.ensembl.org/pub/current_fasta/falco_tinnunculus/dna/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -O Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gzwget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -O Homo_sapiens.GRCh38.dna.toplevel.fa.gz
ggcat build -k 31 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k31.eulertigs.faggcat build -k 63 -j 64 ~/Gadus_morhua.gadMor3.0.dna.toplevel.fa.gz -s 1 --eulertigs -o cod.k63.eulertigs.fa
ggcat build -k 31 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k31.eulertigs.faggcat build -k 63 -j 64 ~/Falco_tinnunculus.FalTin1.0.dna.toplevel.fa.gz -s 1 --eulertigs -o kestrel.k63.eulertigs.fa
ggcat build -k 31 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k31.eulertigs.faggcat build -k 63 -j 64 ~/Homo_sapiens.GRCh38.dna.toplevel.fa.gz -s 1 --eulertigs -o human.k63.eulertigs.fa
Number of distinct kmers
| Collection | Num. distinct 31-mers | Num. distinct 63-mers |
| Cod | 502,465,200 | 556,585,658 |
| Kestrel | 1,150,399,205 | 1,155,250,667 |
| Human | 2,505,678,680 | 2,771,316,093 |
Pangenomes: NCBI-virus, SE, HPRC
The datasets headed "ncbi-virus", "se", and "hprc" are pangenomes.
NCBI-virus
This is a collection of 18,836 virus assemblies downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&SourceDB_s=RefSeq in November 2025. After downloading, and assuming the file is named ncbi-virus.fasta.gz, the collection can be processed as follows.
ggcat build -k 31 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k31.eulertigs.faggcat build -k 63 -j 64 ncbi-virus.fasta.gz -s 1 --eulertigs -o ncbi-virus.k63.eulertigs.fa
SE
This is a pangenome containing are all the 534,751 Salmonella enterica genomes from the "All The Bacteria" collection [6] (v0.2).
ggcat build -k 31 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k31.eulertigs.faggcat build -k 63 -j 64 -l salmonella_enterica-all.txt -s 1 --eulertigs -o se.k63.eulertigs.fa
HPRC
This is a human pangenome. We downloaded the Linux 3.2.1 binary of AGC [7] from https://github.com/refresh-bio/agc/releases and the human472.agc file from https://zenodo.org/records/14854401.
Then, we did:
agc getcol -o output_folder human472.agc
to extract all the individual files to be processed by GGCAT.ggcat build -k 31 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k31.eulertigs.faggcat build -k 63 -j 64 -l hprc_filenames.txt -s 1 --eulertigs -o hprc.k63.eulertigs.fa
Number of distinct kmers
| Collection | Num. distinct 31-mers | Num. distinct 63-mers |
| NCBI-virus | 376,205,185 | 412,515,880 |
| SE | 894,310,084 | 1,524,904,156 |
| HPRC | 3,718,120,949 | 5,926,785,469 |
References
- Pibiri, Giulio Ermanno. "Sparse and skew hashing of k-mers." Bioinformatics 38.Supplement_1 (2022): i185-i194.
- Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. "Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform." SIAM Conference on Applied and Computational Discrete Algorithms (ACDA), 2023.
-
Ondřej Sladký, Pavel Veselý, Karel Břinda. "FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)", Bioinformatics Advances, vbaf290, 2025.
- Schmidt, Sebastian, and Jarno N. Alanko. "Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time." Algorithms for Molecular Biology 18.1 (2023): 5.
- Cracco, Andrea, and Alexandru I. Tomescu. "Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT." Genome Research 33.7 (2023): 1198-1207.
-
Hunt et al. "AllTheBacteria – all bacterial genomes assembled, available, and searchable", BioRxiv, 2025. https://www.biorxiv.org/content/10.1101/2024.03.08.584059v7
-
S. Deorowicz, A. Danek, H. Li. "AGC: Compact representation of assembled genomes with fast queries and updates." Bioinformatics (2023).
Files
Files
(12.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:5eab2fa367f9586413e657b738c1aa9f
|
176.9 MB | Download |
|
md5:064d1431337657a2feadc68e94cca1d2
|
179.1 MB | Download |
|
md5:60001b1dee7d1bbe5483f11d1eb0fab9
|
1.9 GB | Download |
|
md5:7ed68f768004077bc3a6d9f5884f3bde
|
3.1 GB | Download |
|
md5:7ad952abed23c8b06151ca96816e401c
|
883.3 MB | Download |
|
md5:c5dd7ab16a0897a79a3528515d29a12a
|
853.6 MB | Download |
|
md5:caa4a66edd63f1f14819adbc52ac7376
|
341.9 MB | Download |
|
md5:204ba3a5083f2e77e4012ac36c2882af
|
336.9 MB | Download |
|
md5:3b07e83a5d3edb03300f85e2282dc3fa
|
6.1 MB | Download |
|
md5:16f4c8040f8ca9c187c4d31987b8f245
|
135.9 MB | Download |
|
md5:238d1e780ea1d07ba90fcbb3f2f714ec
|
140.5 MB | Download |
|
md5:007c4c68c1359f97c61211b9bcde584c
|
510.4 MB | Download |
|
md5:a52965361f691c8f1775a58fe1c24b45
|
860.0 MB | Download |
|
md5:1095a95986a13f2fde4b0752fe049edd
|
478.3 MB | Download |
|
md5:9a3e6c899ff8f238fb9c75af91ad63f5
|
53.9 MB | Download |
|
md5:ae7372762783d7f0d8e1d2c31fe1403c
|
439.5 MB | Download |
|
md5:f013eb2a4c6ce68e4192c0fb9e1dc6e4
|
1.7 GB | Download |