The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Lau, Andy; Bordin, Nicola; Kandathil, Shaun; Sillitoe, Ian; Waman, Vaishali; Wells, Jude; Orengo, Christine; Jones, David T

doi:10.5281/zenodo.13908086

Published October 31, 2024 | Version v5

Dataset Open

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

1. University College London

Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy, and putative CATH SuperFamily or Fold assignments, for all 365 million domains (~324 million domains in TED100 and ~40 million domains in TED-redundant).

For all chains in the chain-level TED-redundant files, the file contains boundary predictions, consensus level and information on the TED100 representative.

For both TED100 and TED-redundant we provide domain boundary predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 PDB files for potentially novel folds identified during the TED classification process, with an annotation table sorted by novelty, as well as 6,433 highly symmetrical folds representatives.

Please use the gunzip command to extract files with a '.gz' extension and "tar -xzvf file.tar.gz" to open .tar.gz files .

CATH annotations have been assigned using the Foldseek algorithm applied in various modes, and the Foldclass algorithm, both of which are used to report significant structural similarity to a known CATH domain.

Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

Changelog Version 5:

Add: ted_365m.domain_summary.cath.globularity.taxid.tsv.tar.gz - This table, in the same format as the previous ted_100_324m.domain_summary.cath.globularity.taxid.tsv.tar.gz, contains per-domain annotations for the whole of TED, including metadata on domain quality metrics such as secondary structure elements counts, globularity scores, average pLDDT and taxonomical assignments.
Add: high_symmetry_folds_set.domain_summary.tsv.gz - subset of ted_365m.domain_summary.cath.globularity.taxid.tsv containing information on 6,433 high symmetry folds in TED. The entries are sorted in descending order by Z-score obtained from SymD.
Add: high_symmetry_folds_set_models.tar.gz - TED domain models in PDB format for 6,433 high symmetry folds in TED.
Add: ISP_data.tar.gz - Raw data for Interacting SuperFamily Pairs calculations used in the manuscript. A more detailed description of the ISP data is available below as well as within the tar.gz file.
Add: ted_redundant_40m_domain_id.list.gz - list of TED_domain_ID in TED redundant
Add: ted_100_324m_domain_id.list.gz - list of TED_domain_ID in TED100
Fix/Replace: A domain-level summary of TED, now consolidated into ted_365m.domain_summary.cath.globularity.taxid.tsv, is consistent with the protocol used in the manuscript. As Foldclass and Foldseek T-level hits provide all 4 CATH digits, we removed the H portion of the CATH code from each prediction at the T-level.
Previously, the following columns
14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300
15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment.
16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass
sometimes showed an additional label with a T-level prediction by Foldclass in the case of T-level assignments obtained by Foldseek, e.g.
3.40.30,3.40.30 T foldseek,foldclass
This has now been corrected to reflect the TED protocol, with Foldclass T-level assignments applied only to domains where a T-level assignment could not be applied using Foldseek, e.g.
domain-x 3.40.30 T foldseek
domain-y 3.20.20 T foldclass

Thus, in the current version of the data, CATH assignments label can only be
H-level assignment by Foldseek (i.e. 3.40.50.300 H foldseek)
T-level assignment by Foldseek (i.e. 3.40.30 T foldseek)
T-level assignment by Foldclass (i.e. 3.40.30 T foldclass)
or no assignment (- - - )

This dataset contains:

ted_214m_per_chain_segmentation.tsv
The file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.
1. AFDB_model_ID: chain identifier from AFDB in the format AF-<UniProtID>-F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4
2. md5 hash for chain sequence
3. nres - number of residues in chain
4. n_high - number of high consensus domains predicted in chain
5. n_med - number of medium consensus domains predicted in chain
6. n_low - number of low consensus domains predicted in chain
7. high_consesnsus - boundaries of high consensus domains predicted in chain. If none, 'na'
8. med_consensus - boundaries of medium consensus domains predicted in chain. If none, 'na'
9. low_consensus - boundaries of low consensus domains predicted in chain. If none, 'na'
10. proteome_id - proteome identifier in the format proteome-tax_id-<taxonID>-<shard>_v4 i.e. proteome-tax_id-67581-0_v4
ted_365m_domain_boundaries_consensus_level.tsv.gz
The file contains all domain assignments in TED100 and TED-redundant (365M) in the format:
1. TED_ID: TED domain identifier in the format AF-<UniProtID>-F1-model_v4_TED<domain_number_in_chain> i.e. AF-A0A1V6M2Y0-F1-model_v4_TED03
2. Boundaries: domain boundaries in the format <start>-<stop> or <start>-<stop>_<start>-<stop> for discontinuous domains.
3. Consensus: either high or medium.
ted_100_324m_domain_id.list.gz - list of ~324 million domain identifiers in TED100, one per line in the format AF-<UniProtID>-F1-model_v4_TED<domain_number_in_chain> i.e. AF-A0A1V6M2Y0-F1-model_v4_TED0
ted_redundant_40m_domain_id.list.gz - list of ~40 million domain identifiers in TED redundant, one per line in the format AF-<UniProtID>-F1-model_v4_TED<domain_number_in_chain> i.e. AF-A0A1V6M2Y0-F1-model_v4_TED0
ted_365m.domain_summary.cath.globularity.taxid.tsv, novel_folds_set.domain_summary.tsv and high_symmetry_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). novel_folds_set.domain_summary.tsv is sorted by novelty
Note: The TED protocol differs from that of the standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for classification into superfamilies.

1. ted_id - TED domain identifier in the format AF-<UniProtID>-F1-model_v4_TED<domain_number_in_chain> i.e. AF-A0A1V6M2Y0-F1-model_v4_TED03
2. md5_domain - md5 hash of domain sequence
3. consensus_level - medium (2 methods agreement) or high (3 methods agreement)
4. chopping - domain boundaries in the format <start>-<stop> or <start>-<stop>_<start>-<stop> for discontinuous domains
5. nres_domain - number of residues in domain
6. num_segments - number of individual segments in domain.
7. plddt - average pLDDT for domain (range from 0 to 100)
8. num_helix_strand_turn - number of helix strand turns predicted by STRIDE
9. num_helix - number of helices predicted by STRIDE
10. num_strand - number of strands predicted by STRIDE
11. num_helix_strand - number of helices and strands predicted by STRIDE
12. num_turn - number of turns predicted by STRIDE
13. proteome_id - proteome identifier in the format proteome-tax_id-<taxonID>-<shard>_v4 i.e. proteome-tax_id-67581-0_v4
14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300. Otherwise '-'
15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment. Otherwise '-'
16. cath_assignment_method - Method used to assign a CATH label, either foldseek or foldclass. Otherwise '-'
17. packing_density - metric used to determine globularity. A domain with packing_density >=10.333 and norm_rg below 0.356 is considered globular
18. norm_rg - normalised radius of gyration. A domain with packing_density >=10.333 AND norm_rg below 0.356 is considered globular.
19. tax_common_name - Common name for organism
20. tax_scientific_name - Scientific name for organism
21. tax_lineage - Full taxonomic lineage.
ted_324m_seq_clustering.cathlabels.tsv.gz
The file contains the results of the domain sequences clustering with MMseqs2.
Columns:
1. Cluster_representative
2. Cluster_member
3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification
4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass
Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz
The file ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz contains a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-<UniProtID>-F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4
2. md5 - md5 hash for chain sequence
3. nres - number of residues in chain
4. n_high - number of high consensus domains predicted in chain
5. n_med - number of medium consensus domains predicted in chain
6. high_consensus - boundaries of high consensus domains predicted in chain
7. med_consensus - boundaries of medium consensus domains predicted in chain
8. ndom_consensus - number of consensus domains predicted in chain
9. n_targets - number of chains considered for consensus calculation
10. proteome_id - proteome identifier in the format proteome-tax_id-<taxonID>-<shard>_v4 i.e. proteome-tax_id-67581-0_v4
11. TED_redundant_species - Scientific name for organism the chain originally comes from.
12. TED100_chain_rep - TED100 representative for chain
13. TED100_chain_rep_species - Species of TED100 representative for chain.
The file ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv contains a header with the following fields. Each column is tab-separated (.tsv).
1. TED_redundant_id - TED chain identifier in the format AF-<UniProtID>-F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4
2. md5 - md5 hash for chain sequence
3. nres - number of residues in chain
4. n_high - number of high consensus domains predicted in chain
5. n_med - number of medium consensus domains predicted in chain
6. high_consensus - boundaries of high consensus domains predicted in chain
7. med_consensus - boundaries of medium consensus domains predicted in chain
8. proteome_id - proteome identifier in the format proteome-tax_id-<taxonID>-<shard>_v4 i.e. proteome-tax_id-67581-0_v4
9. TED_redundant_species - Scientific name for organism the chain originally comes from
10. TED100_chain_rep - TED100 representative for chain
11. TED100_chain_rep_species - Species of TED100 representative for chain.

novel_folds_set_models.tar.gz contains PDB files of all novel folds representatives identified in TED100.
high_symmetry_folds_set_models.tar.gz contains PDB files of all highly symmetrical folds representatives identified in TED100.
Per-tool domain boundaries_predictions - All per-tool domain boundaries predictions for TED100 and TED-redundant are in the same format with the following columns.
    1. TED_chainID - TED chain identifier in the format AF-<UniProtID>-F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4
2. TED_chain_md5 - md5 hash for chain sequence
3. TED_chain_length - number of residues in chain
4. ndoms - number of domains predicted in chains
5. Domain boundaries - domain boundaries in the format <start>-<stop> or <start>-<stop>_<start>-<stop> for discontinuous domains
6. Prediction probability - probability of each per-chain prediction
Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

    i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
    AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

    Merizo predicts one continuous domain and a discontinuous domain,
    Domain1 (discontinuous): 10-52_289-394
segment1: 10-52
segment2: 289-394
Domain 2 (continuous):
segment 1: 53-288
ISP_data.tar.gz contains raw data for the Interacting Superfamily Pairs (ISP) calculations featured in the manuscript. The archive contains a README as well as :
all_ISP_data_cath.pkl:
ISP data for CATH 4.3 in Python pickle format.
A Python dictionary with the following contents:
Each key is an ISP, e.g. '3.40.640.10-3.90.1150.10'
Each value is a dictionary with the following contents:
key 'aligned_domain_pairs': value is a Python list of length N, each element is a string specifying the two TED domain IDs in contact, separated by a colon, e.g. "AF-A0A000-F1-model_v4_TED02:AF-A0A000-F1-model_v4_TED01"
key 'vectors': value is a numpy.ndarray of shape (N, 3). Each row is a raw unnormalized interaction vector for the corresponding domain pair, after aligning to the reference structure.
Any given index in each list or ndarray has the data for a single domain pair; the order is constant in each list/array.
----------------------------------
all_ISP_data_afdb.pkl:
ISP data for TED100 in Python pickle format.
A Python dictionary with the following contents:
Each key is an ISP, e.g. '3.40.640.10-3.90.1150.10'
Each value is a dictionary with the following contents:
key 'aligned_domain_pairs': value is a Python list of length N, each element is a string specifying the two TED domain IDs in contact, separated by a colon, e.g. "AF-A0A000-F1-model_v4_TED02:AF-A0A000-F1-model_v4_TED01"
key 'vectors': value is a numpy.ndarray of shape (N, 3). Each row is a raw unnormalized interaction vector for the corresponding domain pair, after aligning to the reference structure.
key 'choppings': value is a Python list of length N, each element is a colon-separated string containing the TED chopping strings for the domains in contact, e.g. "54-288:11-41_290-389". The format for each 'chopping' follows that used in the main TED TSV files.
key 'pae_score': value is a numpy.ndarray of floats of shape (N,). Each element is the median PAE score between the domains in contact, computed across both relevant parts of the PAE matrix as described in the paper.
Any given index in each list or ndarray has the data for a single domain pair; the order is constant in each list/array.
NB: each list of domains has not been filtered by PAE score, so the pae_score values include values greater than 4.0, which was the threshold used to filter confident predictions in the manuscript.
-------------------------------------
isp_data_afdbonly_nopaefilter.csv:
A subset of the data in the TED100 .pkl file above, in CSV format.
Each row contains the following fields: AFDB ID, e.g. AF-A0A000-F1-model_v4 ISP, e.g. 3.40.640.10-3.90.1150.10
Colon-separated domain ID pair, e.g. AF-A0A000-F1-model_v4_TED02:AF-A0A000-F1-model_v4_TED01 PAE score, e.g. '4.0'
As the aforementioned files, this data has not been filtered by PAE score values.
ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED.
cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains.
ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info)
gofocus_data.tar.bz2 - GOFocus model weights

Files

cath-alphaflow-main.zip

Files (69.6 GB)

Name	Size	Download all
cath-alphaflow-main.zip md5:ffd9ce6d5ef54f4e8b77e705fe1ad464	2.4 MB	Preview Download
gofocus_data.tar.bz2 md5:2f5a724b81f7df8059f460ee4667640c	2.6 GB	Download
high_symmetry_folds_set.domain_summary.tsv.gz md5:126ef69099e20aaac597c452a820fd42	531.9 kB	Download
high_symmetry_folds_set_models.tar.gz md5:7991ac56b535bd91dafd90005a376e6c	153.3 MB	Download
ISP_data.tar.gz md5:fce964fd32311656893f5066cf71a111	2.3 GB	Download
novel_folds_set.domain_summary.tsv.gz md5:0b916cd1b666d2334787c807c7c96d55	716.6 kB	Download
novel_folds_set_models.tar.gz md5:16ae5869086328a9a632595ab3a8f682	158.9 MB	Download
ted-tools-main.zip md5:87183e4d56251b66de2d0a913ae9e5d9	18.1 MB	Preview Download
ted-web-master.zip md5:36ba4d61312179a42beb5c6baa32b266	11.5 MB	Preview Download
ted_100_188m.chainsaw.filtered.tsv.gz md5:ddb1545c5bc8cb6cab8e8f629377121e	7.0 GB	Download
ted_100_188m.merizo.filtered.tsv.gz md5:bec4efed9b84e41ca6ec73879306a61d	7.7 GB	Download
ted_100_188m.unidoc-NDR.filtered.tsv.gz md5:357a3a4705ae8776be51d4276f3ed579	6.8 GB	Download
ted_100_324m_domain_id.list.gz md5:92b39fbd1952995c8b8109ebde3b16d6	839.3 MB	Download
ted_214m_per_chain_segmentation.tsv.gz md5:99b6afad9d670aa51813397d58f8e7c6	9.6 GB	Download
ted_324m_seq_clustering.cathlabels.tsv.gz md5:0bafd4153561ef34d25efe5e56a882ed	3.3 GB	Download
ted_365m.domain_summary.cath.globularity.taxid.tsv.gz md5:51468b8ec6c74388085ebe0f66f973c5	19.9 GB	Download
ted_365m_domain_boundaries_consensus_level.tsv.gz md5:6fb6837a40a41e0fbb10191c77ec5d71	2.9 GB	Download
ted_redundant_26m.chainsaw.filtered.tsv.gz md5:ad99b3db16c05590991d1661d26c4382	907.5 MB	Download
ted_redundant_26m.merizo.filtered.tsv.gz md5:d945eb0d7378c923584fe1d0d65e9571	998.1 MB	Download
ted_redundant_26m.unidoc-ndr.filtered.tsv.gz md5:25aa365e2975914ed19d567ef9ba325c	870.2 MB	Download
ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv.gz md5:ffc9b1c9c50d9a2eb36e8d258657c103	1.8 GB	Download
ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv.gz md5:6c8f33d844924e28c7378b25d33d97c4	1.7 GB	Download
ted_redundant_40m_domain_id.list.gz md5:efbf7e7b54b4dc85ea3fcfc63db4dd39	114.1 MB	Download

Additional details

Available: 2024-10-31

	All versions	This version
Views	4,566	1,780
Downloads	3,375	2,975
Data volume	25.9 TB	19.8 TB

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Creators

Description

Dataset description:

Changelog Version 5:

This dataset contains:

Files

cath-alphaflow-main.zip

Files (69.6 GB)

Additional details

Dates