active_sites v1.0: enzyme domain sequences with annotated active sites from Pfam v37.1 for benchmarking MSA tools
Creators
Description
This dataset contains 772 enzyme domain families annotated with 1,376 active sites in Pfam v37.1.
Protein sequences containing Pfam domains were retrieved from UniProt, and active site residues were predicted using Pfam’s pfam_scan.pl tool v1.626 together with the active site database active_site.dat v37.1. For each protein, only the sequence corresponding to the annotated domain was extracted from the full-length protein sequence Domain sequences were excluded if: the annotated domain was shorter than 25% of the length of the corresponding Pfam HMM model, or more than 10% of residues were non-standard amino acids.
Directory structure
The dataset contains two directories:
families/
– domain protein sequences grouped by enzyme family [FASTA format]active_sites/
– active site residue annotations for each family [TSV format]
A metadata file (metadata.tsv
) is also included, providing detailed information for each enzyme family.
Metadata
A metadata file (metadata.tsv
) provides:
- family_id – Pfam family identifier (e.g., PF02615)
- family_name – Pfam family name
- seqs_count – total number of domain sequences in the family
- seqs_with_active_sites – number of sequences containing at least one annotated active site
- seqs_with_active_sites_percent – percentage of sequences with active sites
- active_site_ids – comma-separated list of active site identifiers for the family
- min_seq_length – minimum domain sequence length
- mean_seq_length – average domain sequence length
- max_seq_length – maximum domain sequence length
Active sites
Each family has a corresponding TSV file in active_sites/
listing sequence-specific active site annotations:
- protein_id – protein sequence identifier
- site_id – active site identifier (e.g., 114_H, 42_H)
- protein_position – residue position within the sequence
- protein_residue – amino acid at the position
Files
Files
(4.6 GB)
Name | Size | Download all |
---|---|---|
md5:84b918786d488ba70e8aed35b2f20232
|
4.6 GB | Download |