The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Lau, Andy; Bordin, Nicola; Kandathil, Shaun; Sillitoe, Ian; Waman, Vaishali; Wells, Jude; Orengo, Christine; Jones, David T

doi:10.5281/zenodo.10848710

Published March 20, 2024 | Version v2

Dataset Restricted

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

1. University College London

Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments.

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

Organism TaxonID

arabidopsis_thaliana 3702
caenorhabditis_elegans 6239
candida_albicans 237561
danio_rerio 7955
dictyostelium_discoideum 44689
drosophila_melanogaster 7227
escherichia_coli 83333
glycine_max 3847
homo_sapiens 9606
methanocaldococcus_jannaschii 243232
mus_musculus 10090
oryza_sativa 39947
rattus_norvegicus 10116
saccharomyces_cerevisiae 559292
schizosaccharomyces_pombe 284812
zea_mays 4577
ajellomyces_capsulatus 447093
brugia_malayi 6279
campylobacter_jejuni 192222
cladophialophora_carrionii 86049
dracunculus_medinensis 318479
fonsecaea_pedrosoi 1442368
haemophilus_influenzae 71421
helicobacter_pylori 85962
klebsiella_pneumoniae 1125630
leishmania_infantum 5671
madurella_mycetomatis 100816
mycobacterium_leprae 272631
mycobacterium_tuberculosis 83332
mycobacterium_ulcerans 1299332
neisseria_gonorrhoeae 242231
nocardia_brasiliensis 1133849
onchocerca_volvulus 6282
paracoccidioides_lutzii 502779
plasmodium_falciparum 36329
pseudomonas_aeruginosa 208964
salmonella_typhimurium 99287
schistosoma_mansoni 6183
shigella_dysenteriae 300267
sporothrix_schenckii 1391915
staphylococcus_aureus 93061
streptococcus_pneumoniae 171101
strongyloides_stercoralis 6248
trypanosoma_brucei 185431
trypanosoma_cruzi 353153
wuchereria_bancrofti 6293

For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc).

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.

This dataset contains:

ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
novel_folds_set.domain_summary.tsv is sorted by novelty.
1. ted_id
2. md5_domain
3. consensus_level
4. chopping
5. nres_domain
6. num_segments
7. plddt
8. num_helix_strand_turn
9. num_helix
10. num_strand
11. num_helix_strand
12. num_turn
13. proteome_id
14. cath_label
15. cath_assignment_level
16. cath_assignment_method
17. packing_density
18. norm_rg
19. tax_common_name
20. tax_scientific_name
21. tax_lineage
Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv
The file contains a header with the following fields. Each column is tab separated (.tsv).
1. TED_redundant_id
2. md5
3. nres
4. n_high
5. n_med
6. high_consensus
7. med_consensus
8. ndom_consensus
9. n_targets
10. proteome_id
11. TED_redundant_species
12. TED100_chain_rep
13. TED100_chain_rep_species
novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
All per-tool domain boundaries predictions are in the same format with the following columns.
1. TED_chainID
2. TED_chain_md5
3. TED_chain_length
4. ndoms
5. Domain boundaries
6. Prediction probability
Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'

    i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
    AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077

    Merizo predicts one continuous domain and a discontinuous domain,
    Domain1 (discontinuous): 10-52_289-394
segment1: 10-52
segment2: 289-394
Domain 2 (continuous):
segment 1: 53-288
model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

	All versions	This version
Views	9,910	4,451
Downloads	6,740	21
Data volume	47.5 TB	1.9 TB

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Authors/Creators

Description

Dataset description:

This dataset contains:

Files

Restricted