Published February 4, 2026
| Version v1
Dataset
Open
Datasets used in "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment"
Authors/Creators
Description
Datasets used in the "Theseus: Fast and Optimal Affine-Gap Sequence-to-Graph Alignment" paper. The data is compressed into the theseus_datasets.tar.gz file. This file
includes the datasets for the two experiments on the paper:
MSA datasets:
| File | Size | Source |
| mtb_benchmark_50kbp_shortened.fna | 342 sequences, each of approximately 50Kbp | Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are 50Kbp long. |
| mtb_benchmark_250kbp_trmB.fna | 342 sequences, each of approximately 250Kbp | Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the trmB gene. |
| mtb_benchmark_500kbp_thiE.fna | 342 sequences, each of approximately 500Kbp | Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the thiE gene. |
| mtb_benchmark_1Mbp_gltA2.fna | 342 sequences, each of approximately 1Mbp | Derived from a set of 342 RefSeq-complete whole-genome assemblies of Mycobacterium tuberculosis genomes. Sequences start at the dnaA gene and are truncated at the gltA2 gene. |
| covid_19_complete.fasta | 2732 sequences, each of approximately ∼30 Kbp in length |
2732 GenBank-complete SARS-CoV-2 genome assemblies. |
| monkeypox_100_seq.fasta | 100 sequences, each of approximately ∼200 Kbp in length | 100 RefSeq-complete whole genome assemblies of Monkey pox’s virus. |
Sequence-to-graph/pangenome read mapping datasets:
| File | Size | Source |
| SRR062634_1.filt_REDUCED.fasta | 250K sequences of length 100bp | Human Pangenome Reference Consortium |
| 211109_M024_V350038332_L01_HUMuarfR092940-606_1_REDUCED.fasta | 250K sequences of length 150bp | NIST Genome in a Bottle (GIAB) project |
| D1_S1_L001_R1_001_REDUCED.fasta | 250K sequences of length 250bp | NIST Genome in a Bottle (GIAB) project |
This dataset is derived from original data produced by third parties, as detailed above. All rights to the original data remain with the original authors or copyright holders. Users are responsible for ensuring compliance with the licensing terms of the original data sources.
Files
Files
(252.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f8b88a2233356cdb2def9832cfaa1ca0
|
252.2 MB | Download |