Published March 28, 2025 | Version v1
Dataset Open

Supporting data for the paper "CREMSA: Compressed Indexing of (Ultra) Large Multiple Sequence Alignments"

Description

Four files used in the paper “CREMSA: Compressed Indexing of (Ultra) Large Multiple Sequence Alignments” are made available here for reproducibility:

  • random_datasets_len10000_num30000.zip : An archive of artificial FASTA files generated as described in the paper.
  • HIV1_ALL_2022_genome_DNA.fasta.xz : A multiple sequence alignment of 5,381 HIV1 genomes, retrieved from the Los Alamos National Laboratory on March 2025.
  • nextstrain_groups_LANL-HIV-DB_HIV_genome_timetree.jsonl.gz : A JSONL file, as produced by Nextstrain, of the phylogeny of 3,090 HIV genomes among the 5,381 from the previous file. 
  • MFS_1.fasta.xz : A multiple sequence alignment of 214,283 protein sequences of the Major Facilitator Superfamily (MFS), retrieved from Pfam on March 2025.

Files

random_datasets_len10000_num30000.zip

Files (1.0 GB)

Name Size Download all
md5:5879e267a0649a2919894e19c8f952d1
4.1 MB Download
md5:afb42d3030027bfe318166e0c2176ac5
44.8 MB Download
md5:c485b29a0e2a938b7b044f53516c3ae5
138.6 kB Download
md5:ea360d3a77a5057f7ad7a41d79eef934
995.6 MB Preview Download