There is a newer version of the record available.

Published March 20, 2024 | Version v2
Dataset Restricted

The Encyclopedia of Domains (TED) structural domains assignments for AlphaFold Database v4

Description

Dataset description:

The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments. 

In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100.

For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative.

Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes:

Organism TaxonID

arabidopsis_thaliana 3702
caenorhabditis_elegans 6239
candida_albicans 237561
danio_rerio 7955
dictyostelium_discoideum 44689
drosophila_melanogaster 7227
escherichia_coli 83333
glycine_max 3847
homo_sapiens 9606
methanocaldococcus_jannaschii 243232
mus_musculus 10090
oryza_sativa 39947
rattus_norvegicus 10116
saccharomyces_cerevisiae 559292
schizosaccharomyces_pombe 284812
zea_mays 4577
ajellomyces_capsulatus 447093
brugia_malayi 6279
campylobacter_jejuni 192222
cladophialophora_carrionii 86049
dracunculus_medinensis 318479
fonsecaea_pedrosoi 1442368
haemophilus_influenzae 71421
helicobacter_pylori 85962
klebsiella_pneumoniae 1125630
leishmania_infantum 5671
madurella_mycetomatis 100816
mycobacterium_leprae 272631
mycobacterium_tuberculosis 83332
mycobacterium_ulcerans 1299332
neisseria_gonorrhoeae 242231
nocardia_brasiliensis 1133849
onchocerca_volvulus 6282
paracoccidioides_lutzii 502779
plasmodium_falciparum 36329
pseudomonas_aeruginosa 208964
salmonella_typhimurium 99287
schistosoma_mansoni 6183
shigella_dysenteriae 300267
sporothrix_schenckii 1391915
staphylococcus_aureus 93061
streptococcus_pneumoniae 171101
strongyloides_stercoralis 6248
trypanosoma_brucei 185431
trypanosoma_cruzi 353153
wuchereria_bancrofti 6293


For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc). 

We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty.


This dataset contains:

 

  • ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv).
  •     novel_folds_set.domain_summary.tsv is sorted by novelty.
        1. ted_id
        2. md5_domain
        3. consensus_level
        4. chopping
        5. nres_domain
        6. num_segments
        7. plddt
        8. num_helix_strand_turn
        9. num_helix
        10. num_strand
        11. num_helix_strand
        12. num_turn
        13. proteome_id
        14. cath_label
        15. cath_assignment_level
        16. cath_assignment_method
        17. packing_density
        18. norm_rg
        19. tax_common_name
        20. tax_scientific_name
        21. tax_lineage
  • Domain assignments for TED redundant in ted_redundant_39m.consensus_domain_summary.taxid.tsv 
        The file contains a header with the following fields. Each column is tab separated (.tsv).
        1. TED_redundant_id
        2. md5
        3. nres
        4. n_high
        5. n_med
        6. high_consensus
        7. med_consensus
        8. ndom_consensus
        9. n_targets      
        10. proteome_id    
        11. TED_redundant_species
        12. TED100_chain_rep        
        13. TED100_chain_rep_species
  • novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100.
  •     All per-tool domain boundaries predictions are in the same format with the following columns.
        1. TED_chainID
        2. TED_chain_md5
        3. TED_chain_length
        4. ndoms
        5. Domain boundaries
        6. Prediction probability
  •     Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-'
       
        i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4
        AF-A0A000-F1-model_v4    e8872c7a0261b9e88e6ff47eb34e4162    394    2    10-52_289-394,53-288    0.90077
       
        Merizo predicts one continuous domain and a discontinuous domain,
        Domain1 (discontinuous): 10-52_289-394
        segment1: 10-52
        segment2: 289-394
        Domain 2 (continuous):
        segment 1: 53-288
  • model_organisms_and_global_health_proteomes.tar.gz - domain assignments for 21 model organisms and 25 global health proteomes

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.