Published July 16, 2025
| Version 1.0
Dataset
Open
simulated_msa v1.0: simulated multiple sequence alignments with known phylogenies for benchmarking MSA tools
Authors/Creators
Description
This dataset contains 1,860 simulated multiple sequence alignments (MSAs) with known phylogenies. The alignments were generated using the AliSim tool from IQ-TREE v2.4.0.
The simulation parameters span a broad range of conditions:
- Number of sequences: 1,000–100,000
- Substitution models: LG, JTT, and WAG
- Sequence lengths: 400–2,000 residues
- Sequence identities: 8%–75%
- Gap fractions: 0%–99%
Directory structure
The dataset contains three main directories:
fasta/– unaligned protein sequences [FASTA format]msa/– aligned protein sequences (reference MSAs) [FASTA format]tree/– phylogenetic trees corresponding to each simulated MSA (reference trees) [Newick format]
A metadata file (metadata.tsv) is also included, providing detailed information for each simulated MSA
Metadata
A metadata file (metadata.tsv) is included, containing detailed information for each simulated MSA. It provides:
- id – unique MSA identifier
- seqs_count – number of sequences in the MSA
- alisim_length – seed sequence length
- alisim_rlen_min / mean / max – relative branch length parameters
- alisim_ins / alisim_del – insertion and deletion rates
- alisim_model – substitution model (LG, JTT, WAG)
- alisim_model_type – model configuration used by AliSim
- mean_identity_percent – average sequence identity [%]
- mean_gaps_percent – average fraction of gaps [%]
- min_seq_length / mean_seq_length / max_seq_length – sequence length statistics
Files
Files
(21.0 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:94d628de2bee89acc274d64434199987
|
21.0 GB | Download |