Published February 4, 2026 | Version v1
Dataset Open

Datasets used in "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment"

  • 1. ROR icon Universitat Politècnica de Catalunya
  • 2. ROR icon Barcelona Supercomputing Center
  • 3. ROR icon Universidad de Zaragoza
  • 4. Instituto Universitario de Investigación en Ingeniería de Aragón (I3A)
  • 5. Universidad Autónoma de Barcelona

Description

Datasets used in the "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment" paper. The data is compressed into the theseus_datasets.tar.gz file. This file 
includes the datasets for the two experiments on the paper:

MSA datasets:

File Size Source
mtb_benchmark_50kbp_shortened.fna 342 sequences, each of approximately 50Kbp Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are 50Kbp long.
mtb_benchmark_250kbp_trmB.fna 342 sequences, each of approximately 250Kbp Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the trmB gene.
mtb_benchmark_500kbp_thiE.fna 342 sequences, each of approximately 500Kbp Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the thiE gene.
mtb_benchmark_1Mbp_gltA2.fna 342 sequences, each of approximately 1Mbp Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the gltA2 gene.
covid_19_complete.fasta 2732 sequences, each of approximately ∼30 Kbp
in length
2732 GenBank-complete SARS-CoV-2 genome assemblies.
monkeypox_100_seq.fasta 100 sequences, each of approximately ∼200 Kbp in length 100 RefSeq-complete whole genome assemblies of Monkey pox’s virus.



Sequence-to-graph/pangenome read mapping datasets:

File Size Source
SRR062634_1.filt_REDUCED.fasta 250K sequences of length 100bp Human Pangenome Reference Consortium
211109_M024_V350038332_L01_HUMuarfR092940-606_1_REDUCED.fasta 250K sequences of length 150bp NIST Genome in a Bottle (GIAB) project
D1_S1_L001_R1_001_REDUCED.fasta 250K sequences of length 250bp NIST Genome in a Bottle (GIAB) project


This dataset is derived from original data produced by third parties, as detailed above. All rights to the original data remain with the original authors or copyright holders. Users are responsible for ensuring compliance with the licensing terms of the original data sources.

Files

Files (252.2 MB)

Name Size Download all
md5:f8b88a2233356cdb2def9832cfaa1ca0
252.2 MB Download