Published August 17, 2024 | Version 1.0
Dataset Open

DPCstruct Classification of AlphaFold2-Predicted Protein Structures

Description

This dataset contains DPCstruct domain classifications for protein structures predicted by AlphaFold2, as presented in the paper "Unsupervised Domain Classification of AlphaFold2-Predicted Protein Structures."

DPCstruct was applied to a non-redundant set of the AlphaFold Database v4.0, known as Foldseek Clusters, which includes approximately 15 million representative proteins, as described in the work by Barrio-Hernandez et al.

This repository provides the results of our classification, along with all the data related to the analyses presented in our study. DPCstruct algorithm can be found at https://github.com/RitAreaSciencePark/DPCstruct together with examples on how to use it.

FILES DESCRIPTION:

  • dpcstruct_classification.tsv: List of domains identified by DPCstruct and their corresponding metacluster. Columns: Metacluster ID, Protein Uniprot ID, domain start, domain end.
  • mcs_reps.fasta: For each metacluster, two representative domains were selected: one representing the center of the cluster and the other being the domain with the highest pLDDT score. If these are the same, only one domain is included as the representative. This file contains the list of representative domains and their sequences in FASTA format.
  • mcs_reps_pdbs.zip: Contains a PDB file for each representative domain. The filename is structured as 'proteinID_metacluster.pdb'.
  • mcs_properties.tsv: Set of properties per metacluster, including:
    • mcID: Metacluster ID.
    • size: Number of domains.
    • len_aa: Average length of domains (number of amino acids).
    • len_std: Standard deviation of domain lengths.
    • len_ratio: Ratio of len_std to len_aa.
    • plddt: Average predicted LDDT as reported by AlphaFold2.
    • disorder: Average intrinsic disorder score calculated with AIUPred.
    • alntmscore: Pairwise alignment TM-score between domains, averaged over all pairs.
    • tmscore: Pairwise alignment TM-score between domains, averaged over all pairs, using the maximum between TM-score normalized by query or target.
    • lddt: Pairwise LDDT score, averaged over all pairs.
    • prob: Pairwise probability of homology according to SCOPe, as reported by Foldseek.
    • pident: Pairwise percentage identity, averaged over all pairs.
  • annotated_[cath|scop]_qc[x]_t[x]_l[x].tsv: For each fold in [CATH|SCOP], we provide the best matching DPCstruct domain, if available, along with the structural alignment information as reported by Foldseek. A fold is considered annotated if its alignment values meet or exceed the following thresholds:
    • qc: query coverage.
    • t: template modelling score of the alignment.
    • l: lddt score of the alignment.
  • dpcstruct_consistency.tsv: Consistency of DPCstruct metaclusters with respect to Pfam 36.0 labels. Note that we consider a Pfam label to overlap with a DPCstruct domain even if it shares just one amino acid, which is why some metaclusters have many labels. In such cases, we only display 5 representative labels.
  • pfam_consistency.tsv: Consistency of Pfam Clans with respecto to DPCstruct labels.

Note: All 'tsv' files contain a header as the first row.

If there is any doubt regarding the data or there is something missing please contact us: 

federico.barone@areasciencepark.it

Files

mcs_reps_pdbs.zip

Files (1.3 GB)

Name Size Download all
md5:f6491ea3c21a88c64cd2c45d8d2e8cec
166.3 kB Download
md5:c8aa536b0e8251c635d451412c428c4e
162.1 kB Download
md5:c7c54380ad30d1f97aad2d9758de0788
188.2 kB Download
md5:9cdf35ad337589ab8d1bb354af61d647
181.4 kB Download
md5:586802b8857511d4f2b3be270b9e8f12
38.3 MB Download
md5:a5cb8307a311fc157a8056cb4062eba4
361.1 kB Download
md5:88f3070908223a679246e1eeea066275
2.4 MB Download
md5:c25e303fbf621362e03072f4fb33575f
10.2 MB Download
md5:2cf9dd2cdcfdcc696336ee053c27d341
1.3 GB Preview Download
md5:05af7e5ccb67623efdbdcc5ba1febb19
190.4 kB Download

Additional details

Related works

Is variant form of
Journal article: 10.1371/journal.pcbi.1010610 (DOI)

Software

Repository URL
https://github.com/RitAreaSciencePark/DPCstruct
Programming language
C++
Development Status
Active