Published July 16, 2025 | Version 1.0
Dataset Open

simulated_msa v1.0: simulated multiple sequence alignments with known phylogenies for benchmarking MSA tools

  • 1. ROR icon Adam Mickiewicz University in Poznań
  • 2. ROR icon Silesian University of Technology

Description

This dataset contains 1,860 simulated multiple sequence alignments (MSAs) with known phylogenies. The alignments were generated using the AliSim tool from IQ-TREE v2.4.0.

The simulation parameters span a broad range of conditions:

  • Number of sequences: 1,000–100,000
  • Substitution models: LG, JTT, and WAG
  • Sequence lengths: 400–2,000 residues
  • Sequence identities: 8%–75%
  • Gap fractions: 0%–99%

Directory structure

The dataset contains three main directories:

  • fasta/ – unaligned protein sequences [FASTA format]
  • msa/ – aligned protein sequences (reference MSAs) [FASTA format]
  • tree/ – phylogenetic trees corresponding to each simulated MSA (reference trees) [Newick format]

A metadata file (metadata.tsv) is also included, providing detailed information for each simulated MSA

Metadata

A metadata file (metadata.tsv) is included, containing detailed information for each simulated MSA. It provides:

  • id – unique MSA identifier
  • seqs_count – number of sequences in the MSA
  • alisim_length – seed sequence length
  • alisim_rlen_min / mean / max – relative branch length parameters
  • alisim_ins / alisim_del – insertion and deletion rates
  • alisim_model – substitution model (LG, JTT, WAG)
  • alisim_model_type – model configuration used by AliSim
  • mean_identity_percent – average sequence identity [%]
  • mean_gaps_percent – average fraction of gaps [%]
  • min_seq_length / mean_seq_length / max_seq_length – sequence length statistics

 

Files

Files (21.0 GB)

Name Size Download all
md5:94d628de2bee89acc274d64434199987
21.0 GB Download