afdb_clusters v1.0: AlphaFold-derived structure-based dataset for benchmarking MSA tools
Creators
Description
This dataset contains 1,166 protein families derived from AlphaFold Database Clusters. The families vary in size, ranging from approximately 1,000 to 680,000 sequences.
For each family, the dataset provides:
- protein sequences (FASTA format)
- download URLs for AlphaFold-predicted PDB structures corresponding to each protein sequence
These paired sequences and structures enable structure-based benchmarking of multiple sequence alignment (MSA) tools using the Local Distance Difference Test (LDDT) score, computed with the FoldMason tool.
Directory structure
The dataset contains two main directories:
fasta/
– protein sequences for each cluster [FASTA format]pdb_urls/
– text files containing download URLs for AlphaFold PDB structures for each sequence in the cluster [TXT format]
A metadata file (metadata.tsv
) is also included, providing detailed information for each cluster.
Metadata
A metadata file (metadata.tsv
) provides:
- cluster_id – Cluster identifier
- seqs_count – total number of sequences in the cluster
- min_seq_length – minimum sequence length within the cluster
- mean_seq_length – average sequence length within the cluster
- max_seq_length – maximum sequence length within the cluster
Files
Files
(5.1 GB)
Name | Size | Download all |
---|---|---|
md5:699c617bb59825e54b4e016a8140c7d6
|
5.1 GB | Download |