afdb_clusters v1.0: AlphaFold-derived structure-based dataset for benchmarking MSA tools

Zielezinski, Andrzej; Gudyś, Adam; Deorowicz, Sebastian

doi:10.5281/zenodo.16082639

Published July 18, 2025 | Version 1.0

Dataset Open

afdb_clusters v1.0: AlphaFold-derived structure-based dataset for benchmarking MSA tools

1. Adam Mickiewicz University in Poznań
2. Silesian University of Technology

This dataset contains 1,166 protein families derived from AlphaFold Database Clusters. The families vary in size, ranging from approximately 1,000 to 680,000 sequences.

For each family, the dataset provides:

protein sequences (FASTA format)
download URLs for AlphaFold-predicted PDB structures corresponding to each protein sequence

These paired sequences and structures enable structure-based benchmarking of multiple sequence alignment (MSA) tools using the Local Distance Difference Test (LDDT) score, computed with the FoldMason tool.

Directory structure

The dataset contains two main directories:

fasta/ – protein sequences for each cluster [FASTA format]
pdb_urls/ – text files containing download URLs for AlphaFold PDB structures for each sequence in the cluster [TXT format]

A metadata file (metadata.tsv) is also included, providing detailed information for each cluster.

Metadata

A metadata file (metadata.tsv) provides:

cluster_id – Cluster identifier
seqs_count – total number of sequences in the cluster
min_seq_length – minimum sequence length within the cluster
mean_seq_length – average sequence length within the cluster
max_seq_length – maximum sequence length within the cluster

Files

Files (5.1 GB)

Name	Size	Download all
afdb_clusters_dataset.tar.gz md5:699c617bb59825e54b4e016a8140c7d6	5.1 GB	Download

	All versions	This version
Views	25	25
Downloads	5	5
Data volume	25.7 GB	25.7 GB

afdb_clusters v1.0: AlphaFold-derived structure-based dataset for benchmarking MSA tools

Creators

Description

Directory structure

Metadata

Files

Files (5.1 GB)