Published March 15, 2023 | Version 2
Dataset Open

Metadata supporting the AFDB90v4 annotated sequence similarity network

  • 1. Biozentrum and SIB Swiss Institute of Bioinformatics
  • 2. Institute of Technology, University of Tartu, Estonia
  • 3. VantAI, New York, USA
  • 4. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • 5. Institute of Technology, University of Tartu, Estonia; Department of Experimental Medical Science, Lund University, Sweden; Science for Life Laboratory, Lund, Sweden

Description

Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 at a high predicted accuracy and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4. The dataset deposited here corresponds to the metadata generated, and that makes the base of the similarity network constructed and its interpretation. These files are either generated or processed using the code available at https://github.com/ProteinUniverseAtlas/AFDB90v4.

This repository further contains the detailed, individual sequence similarity networks (in CLANS format) generated for the 3 example protein (super)families described in the text.

The full content of this repository includes:

  1. AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv: table listing all uniref50 clusters in UniProt, including information on structural representatives from AFDB. Each column provides different annotations, including functional brightness, median and best pLDDT, brightness and structural representatives, etc.

  2. AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv: table listing all uniref50 clusters in UniProt and whether they include proteins mapped to known domains of unknown function (DUF).

  3. AFDBv4_90.fasta: fasta file with the sequences of all UniRef50 clusters selected, and used for the all-against-all mmseqs searches that make the base of the network.

  4. AFDB90v4_data.csv: the subset of file (1) that corresponds to the AFDB90v4 dataset, including columns such as functional brightness, median and best pLDDT, brightness and structural representatives, etc.

  5. AFDB90v4_data_with_graph_labels.csv: table listing each individual uniref50 cluster included in the AFDB90v4 dataset, together with their mapping to communities, and connected components.

  6. AFDB90v4_cc_data.csv: table of uniref50 clusters in connected components, including their annotations, and the columns in file (5).

  7. AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv: mapping of each uniprotAC entry to their corresponding component, community and taxonomy.

  8. AFDB90v4_subgraphs_summary.csv: table summarising the properties of individual connected components, including the average brightness, the number of members, the number of unique protein sequences, the median length, and the number of communities.

  9. communities_summary.csv: table summarising the properties of individual communities, including average brightness, the number of members, the number of unique protein sequences, the median length, the most common superkingdom represented, the average structure outlier score, etc.

  10. communities_edge_list-coordinates.csv: the coordinates of each community in the graphical representation. Singleton communities or singleton UniRef50 clusters are not included.

  11. communities_edge_list_no_duplicates.csv: list of edges making the graph.

  12. node_class.json: map between each Uniref50 cluster and its corresponding component and community.

  13. subgraphs.tar.gz: tar file containing gml files for each individual connected component.

  14. AFDB90v4_outlier_scores.tsv: table containing the outlier scores for each community representative.

  15. AFDB90v4_dark_galaxies_summary.csv: table containing the summary of all dark connected components, including average brightness, median length, representatives, number of communities, etc.

  16. AFDB90v4_uniprot_naming_assessment_counts.csv: table listing the per-component semantic diversity scores, as well as the major source of the titles of the proteins included and their count.

  17. uniprot_naming_assessment.tar.gz: tar file containing the per community assessment of predicted protein names in UniProt as of February 2023.

  18. CLANS_files.tar.gz: stores the 3 sequence similarity networks, in CLANS format, constructed for the analysis of the sequence diversity and sequence similarities of the proteins in components, 27, 159 and 3314. These CLANS files make the base of panel A in all figures 3 and 4 and extended data figure 5.

 

Notes

This work was supported by funding from the SIB - Swiss Institute of Bioinformatics (https://www.sib.swiss/), the Biozentrum of the University of Basel (https://www.biozentrum.unibas.ch/), by the European Union via project MIBEst H2020-WIDESPREAD-2018-2020/GA number 857518 (T.T. and V.H.), by a grant from the Estonian Research Council (PRG335 to T.T. and V.H.), the Knut and Alice Wallenberg Foundation (2020-0037 to V.H.), Swedish Research Council (Vetenskapsrådet) grants (2021-01146 to V.H.), Cancerfonden (20 0872 Pj to V.H.), and‬ the Biotechnology and Biological Sciences Research Council and the NSF Directorate for Biological Sciences (BB/X012492/1 to A.B).

Files

AFDB90v4_cc_data.csv

Files (19.6 GB)

Name Size Download all
md5:0f445c09121a4fe27ea8ec32a531099d
879.6 MB Preview Download
md5:3cb36f96923bf502348d19b594ca14b3
4.1 GB Preview Download
md5:cd9496bbb54194e46800a93c1c0387c0
3.6 MB Preview Download
md5:6366e3cdc8475da69446420ccaf4bc57
1.2 GB Preview Download
md5:74bf56feeba528b0dbb72d647b0aea3a
1.2 GB Preview Download
md5:508a235190c5aca1490752649fe8c388
101.2 MB Download
md5:5667d1090b97a5c24037263e68fdadd9
18.1 MB Preview Download
md5:d2aac68dbe61154f0e6ad520ba88162d
15.6 MB Preview Download
md5:fad23381870a1c0064b242725210a576
1.6 GB Download
md5:e6e3a04c40ce3233a8becd8e523dc2f6
409.5 MB Preview Download
md5:e6db871d1896f865cbc630716b155b42
7.8 GB Preview Download
md5:1bd28aeb217db2aa23c2f63764459a60
17.2 MB Download
md5:89f1517edcde527429abdab21219b415
20.2 MB Preview Download
md5:9151046bc46584bd895a3618b5e868fc
27.8 MB Preview Download
md5:574dba0fdfbfd6a856811026008846c8
128.4 MB Preview Download
md5:d62dec6c50c99ec53cc63bf55e844b9e
1.2 GB Download
md5:46c78158915abfb6474869ef4dd4b324
126.2 MB Preview Download
md5:0ebc209d7b54f140e99a8cde9b10e3c9
687.6 MB Download

Additional details

Related works

Is documented by
Preprint: 10.1101/2023.03.14.532539 (DOI)
Is required by
Software: https://github.com/ProteinUniverseAtlas/AFDB90v4 (URL)