simulated_msa v1.0: simulated multiple sequence alignments with known phylogenies for benchmarking MSA tools

Zielezinski, Andrzej; Gudyś, Adam; Deorowicz, Sebastian

doi:10.5281/zenodo.15971353

Published July 16, 2025 | Version 1.0

Dataset Open

simulated_msa v1.0: simulated multiple sequence alignments with known phylogenies for benchmarking MSA tools

1. Adam Mickiewicz University in Poznań
2. Silesian University of Technology

This dataset contains 1,860 simulated multiple sequence alignments (MSAs) with known phylogenies. The alignments were generated using the AliSim tool from IQ-TREE v2.4.0.

The simulation parameters span a broad range of conditions:

Number of sequences: 1,000–100,000
Substitution models: LG, JTT, and WAG
Sequence lengths: 400–2,000 residues
Sequence identities: 8%–75%
Gap fractions: 0%–99%

Directory structure

The dataset contains three main directories:

fasta/ – unaligned protein sequences [FASTA format]
msa/ – aligned protein sequences (reference MSAs) [FASTA format]
tree/ – phylogenetic trees corresponding to each simulated MSA (reference trees) [Newick format]

A metadata file (metadata.tsv) is also included, providing detailed information for each simulated MSA

Metadata

A metadata file (metadata.tsv) is included, containing detailed information for each simulated MSA. It provides:

id – unique MSA identifier
seqs_count – number of sequences in the MSA
alisim_length – seed sequence length
alisim_rlen_min / mean / max – relative branch length parameters
alisim_ins / alisim_del – insertion and deletion rates
alisim_model – substitution model (LG, JTT, WAG)
alisim_model_type – model configuration used by AliSim
mean_identity_percent – average sequence identity [%]
mean_gaps_percent – average fraction of gaps [%]
min_seq_length / mean_seq_length / max_seq_length – sequence length statistics

Files

Files (21.0 GB)

Name	Size	Download all
simulated_msa_dataset.tar.gz md5:94d628de2bee89acc274d64434199987	21.0 GB	Download

	All versions	This version
Views	62	62
Downloads	23	23
Data volume	484.0 GB	484.0 GB

simulated_msa v1.0: simulated multiple sequence alignments with known phylogenies for benchmarking MSA tools

Authors/Creators

Description

Directory structure

Metadata

Files

Files (21.0 GB)