DPCstruct Classification of AlphaFold2-Predicted Protein Structures
Creators
Description
This dataset contains DPCstruct domain classifications for protein structures predicted by AlphaFold2, as presented in the paper "Unsupervised Domain Classification of AlphaFold2-Predicted Protein Structures."
DPCstruct was applied to a non-redundant set of the AlphaFold Database v4.0, known as Foldseek Clusters, which includes approximately 15 million representative proteins, as described in the work by Barrio-Hernandez et al.
This repository provides the results of our classification, along with all the data related to the analyses presented in our study. DPCstruct algorithm can be found at https://github.com/RitAreaSciencePark/DPCstruct together with examples on how to use it.
FILES DESCRIPTION:
- dpcstruct_classification.tsv: List of domains identified by DPCstruct and their corresponding metacluster. Columns: Metacluster ID, Protein Uniprot ID, domain start, domain end.
- mcs_reps.fasta: For each metacluster, two representative domains were selected: one representing the center of the cluster and the other being the domain with the highest pLDDT score. If these are the same, only one domain is included as the representative. This file contains the list of representative domains and their sequences in FASTA format.
- mcs_reps_pdbs.zip: Contains a PDB file for each representative domain. The filename is structured as 'proteinID_metacluster.pdb'.
- mcs_properties.tsv: Set of properties per metacluster, including:
- mcID: Metacluster ID.
- size: Number of domains.
- len_aa: Average length of domains (number of amino acids).
- len_std: Standard deviation of domain lengths.
- len_ratio: Ratio of len_std to len_aa.
- plddt: Average predicted LDDT as reported by AlphaFold2.
- disorder: Average intrinsic disorder score calculated with AIUPred.
- alntmscore: Pairwise alignment TM-score between domains, averaged over all pairs.
- tmscore: Pairwise alignment TM-score between domains, averaged over all pairs, using the maximum between TM-score normalized by query or target.
- lddt: Pairwise LDDT score, averaged over all pairs.
- prob: Pairwise probability of homology according to SCOPe, as reported by Foldseek.
- pident: Pairwise percentage identity, averaged over all pairs.
- annotated_[cath|scop]_qc[x]_t[x]_l[x].tsv: For each fold in [CATH|SCOP], we provide the best matching DPCstruct domain, if available, along with the structural alignment information as reported by Foldseek. A fold is considered annotated if its alignment values meet or exceed the following thresholds:
- qc: query coverage.
- t: template modelling score of the alignment.
- l: lddt score of the alignment.
- dpcstruct_consistency.tsv: Consistency of DPCstruct metaclusters with respect to Pfam 36.0 labels. Note that we consider a Pfam label to overlap with a DPCstruct domain even if it shares just one amino acid, which is why some metaclusters have many labels. In such cases, we only display 5 representative labels.
- pfam_consistency.tsv: Consistency of Pfam Clans with respecto to DPCstruct labels.
Note: All 'tsv' files contain a header as the first row.
If there is any doubt regarding the data or there is something missing please contact us:
federico.barone@areasciencepark.it
Files
mcs_reps_pdbs.zip
Files
(1.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:f6491ea3c21a88c64cd2c45d8d2e8cec
|
166.3 kB | Download |
|
md5:c8aa536b0e8251c635d451412c428c4e
|
162.1 kB | Download |
|
md5:c7c54380ad30d1f97aad2d9758de0788
|
188.2 kB | Download |
|
md5:9cdf35ad337589ab8d1bb354af61d647
|
181.4 kB | Download |
|
md5:586802b8857511d4f2b3be270b9e8f12
|
38.3 MB | Download |
|
md5:a5cb8307a311fc157a8056cb4062eba4
|
361.1 kB | Download |
|
md5:88f3070908223a679246e1eeea066275
|
2.4 MB | Download |
|
md5:c25e303fbf621362e03072f4fb33575f
|
10.2 MB | Download |
|
md5:2cf9dd2cdcfdcc696336ee053c27d341
|
1.3 GB | Preview Download |
|
md5:05af7e5ccb67623efdbdcc5ba1febb19
|
190.4 kB | Download |
Additional details
Related works
- Is variant form of
- Journal article: 10.1371/journal.pcbi.1010610 (DOI)
Software
- Repository URL
- https://github.com/RitAreaSciencePark/DPCstruct
- Programming language
- C++
- Development Status
- Active