Datasets used in "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment"

Jiménez-Blanco, Albert; López-Villellas, Lorién; Moure, Juan Carlos; Moretó Planas, Miquel; Marco-Sola, Santiago

doi:10.5281/zenodo.18482097

Published February 4, 2026 | Version v1

Dataset Open

Datasets used in "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment"

1. Universitat Politècnica de Catalunya
2. Barcelona Supercomputing Center
3. Universidad de Zaragoza
4. Instituto Universitario de Investigación en Ingeniería de Aragón (I3A)
5. Universidad Autónoma de Barcelona

Datasets used in the "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment" paper. The data is compressed into the theseus_datasets.tar.gz file. This file
includes the datasets for the two experiments on the paper:

MSA datasets:

File	Size	Source
mtb_benchmark_50kbp_shortened.fna	342 sequences, each of approximately 50Kbp	Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are 50Kbp long.
mtb_benchmark_250kbp_trmB.fna	342 sequences, each of approximately 250Kbp	Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the trmB gene.
mtb_benchmark_500kbp_thiE.fna	342 sequences, each of approximately 500Kbp	Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the thiE gene.
mtb_benchmark_1Mbp_gltA2.fna	342 sequences, each of approximately 1Mbp	Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the gltA2 gene.
covid_19_complete.fasta	2732 sequences, each of approximately ∼30 Kbp in length	2732 GenBank-complete SARS-CoV-2 genome assemblies.
monkeypox_100_seq.fasta	100 sequences, each of approximately ∼200 Kbp in length	100 RefSeq-complete whole genome assemblies of Monkey pox’s virus.

Sequence-to-graph/pangenome read mapping datasets:

File	Size	Source
SRR062634_1.filt_REDUCED.fasta	250K sequences of length 100bp	Human Pangenome Reference Consortium
211109_M024_V350038332_L01_HUMuarfR092940-606_1_REDUCED.fasta	250K sequences of length 150bp	NIST Genome in a Bottle (GIAB) project
D1_S1_L001_R1_001_REDUCED.fasta	250K sequences of length 250bp	NIST Genome in a Bottle (GIAB) project

This dataset is derived from original data produced by third parties, as detailed above. All rights to the original data remain with the original authors or copyright holders. Users are responsible for ensuring compliance with the licensing terms of the original data sources.

Files

Files (252.2 MB)

Name	Size	Download all
theseus_datasets.tar.gz md5:f8b88a2233356cdb2def9832cfaa1ca0	252.2 MB	Download

	All versions	This version
Views	64	64
Downloads	21	21
Data volume	5.5 GB	5.5 GB

Datasets used in "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment"

Authors/Creators

Description

Files

Files (252.2 MB)