Metadata supporting the AFDB90v4 annotated sequence similarity network

Durairaj, Janani; Waterhouse, Andrew M.; Mets, Toomas; Brodiazhenko, Tetiana; Abdullah, Minhal; Studer, Gabriel; Tauriello, Gerardo; Akdel, Mehmet; Andreeva, Antonina; Bateman, Alex; Tenson, Tanel; Hauryliuk, Vasili; Schwede, Torsten; Pereira, Joana

doi:10.5281/zenodo.8121336

Published March 15, 2023 | Version 2

Dataset Open

Metadata supporting the AFDB90v4 annotated sequence similarity network

1. Biozentrum and SIB Swiss Institute of Bioinformatics
2. Institute of Technology, University of Tartu, Estonia
3. VantAI, New York, USA
4. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
5. Institute of Technology, University of Tartu, Estonia; Department of Experimental Medical Science, Lund University, Sweden; Science for Life Laboratory, Lund, Sweden

Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 at a high predicted accuracy and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4. The dataset deposited here corresponds to the metadata generated, and that makes the base of the similarity network constructed and its interpretation. These files are either generated or processed using the code available at https://github.com/ProteinUniverseAtlas/AFDB90v4.

This repository further contains the detailed, individual sequence similarity networks (in CLANS format) generated for the 3 example protein (super)families described in the text.

The full content of this repository includes:

AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv: table listing all uniref50 clusters in UniProt, including information on structural representatives from AFDB. Each column provides different annotations, including functional brightness, median and best pLDDT, brightness and structural representatives, etc.
AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv: table listing all uniref50 clusters in UniProt and whether they include proteins mapped to known domains of unknown function (DUF).
AFDBv4_90.fasta: fasta file with the sequences of all UniRef50 clusters selected, and used for the all-against-all mmseqs searches that make the base of the network.
AFDB90v4_data.csv: the subset of file (1) that corresponds to the AFDB90v4 dataset, including columns such as functional brightness, median and best pLDDT, brightness and structural representatives, etc.
AFDB90v4_data_with_graph_labels.csv: table listing each individual uniref50 cluster included in the AFDB90v4 dataset, together with their mapping to communities, and connected components.
AFDB90v4_cc_data.csv: table of uniref50 clusters in connected components, including their annotations, and the columns in file (5).
AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv: mapping of each uniprotAC entry to their corresponding component, community and taxonomy.
AFDB90v4_subgraphs_summary.csv: table summarising the properties of individual connected components, including the average brightness, the number of members, the number of unique protein sequences, the median length, and the number of communities.
communities_summary.csv: table summarising the properties of individual communities, including average brightness, the number of members, the number of unique protein sequences, the median length, the most common superkingdom represented, the average structure outlier score, etc.
communities_edge_list-coordinates.csv: the coordinates of each community in the graphical representation. Singleton communities or singleton UniRef50 clusters are not included.
communities_edge_list_no_duplicates.csv: list of edges making the graph.
node_class.json: map between each Uniref50 cluster and its corresponding component and community.
subgraphs.tar.gz: tar file containing gml files for each individual connected component.
AFDB90v4_outlier_scores.tsv: table containing the outlier scores for each community representative.
AFDB90v4_dark_galaxies_summary.csv: table containing the summary of all dark connected components, including average brightness, median length, representatives, number of communities, etc.
AFDB90v4_uniprot_naming_assessment_counts.csv: table listing the per-component semantic diversity scores, as well as the major source of the titles of the proteins included and their count.
uniprot_naming_assessment.tar.gz: tar file containing the per community assessment of predicted protein names in UniProt as of February 2023.
CLANS_files.tar.gz: stores the 3 sequence similarity networks, in CLANS format, constructed for the analysis of the sequence diversity and sequence similarities of the proteins in components, 27, 159 and 3314. These CLANS files make the base of panel A in all figures 3 and 4 and extended data figure 5.

Notes

This work was supported by funding from the SIB - Swiss Institute of Bioinformatics (https://www.sib.swiss/), the Biozentrum of the University of Basel (https://www.biozentrum.unibas.ch/), by the European Union via project MIBEst H2020-WIDESPREAD-2018-2020/GA number 857518 (T.T. and V.H.), by a grant from the Estonian Research Council (PRG335 to T.T. and V.H.), the Knut and Alice Wallenberg Foundation (2020-0037 to V.H.), Swedish Research Council (Vetenskapsrådet) grants (2021-01146 to V.H.), Cancerfonden (20 0872 Pj to V.H.), and‬ the Biotechnology and Biological Sciences Research Council and the NSF Directorate for Biological Sciences (BB/X012492/1 to A.B).

Files

AFDB90v4_cc_data.csv

Files (19.6 GB)

Name	Size	Download all
AFDB90v4_cc_data.csv md5:0f445c09121a4fe27ea8ec32a531099d	879.6 MB	Preview Download
AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv md5:3cb36f96923bf502348d19b594ca14b3	4.1 GB	Preview Download
AFDB90v4_dark_galaxies_summary.csv md5:cd9496bbb54194e46800a93c1c0387c0	3.6 MB	Preview Download
AFDB90v4_data.csv md5:6366e3cdc8475da69446420ccaf4bc57	1.2 GB	Preview Download
AFDB90v4_data_with_graph_labels.csv md5:74bf56feeba528b0dbb72d647b0aea3a	1.2 GB	Preview Download
AFDB90v4_outlier_scores.tsv md5:508a235190c5aca1490752649fe8c388	101.2 MB	Download
AFDB90v4_subgraphs_summary.csv md5:5667d1090b97a5c24037263e68fdadd9	18.1 MB	Preview Download
AFDB90v4_uniprot_naming_assessment_counts.csv md5:d2aac68dbe61154f0e6ad520ba88162d	15.6 MB	Preview Download
AFDBv4_90.fasta md5:fad23381870a1c0064b242725210a576	1.6 GB	Download
AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv md5:e6e3a04c40ce3233a8becd8e523dc2f6	409.5 MB	Preview Download
AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv md5:e6db871d1896f865cbc630716b155b42	7.8 GB	Preview Download
CLANS_files.tar.gz md5:1bd28aeb217db2aa23c2f63764459a60	17.2 MB	Download
communities_edge_list-coordinates.csv md5:89f1517edcde527429abdab21219b415	20.2 MB	Preview Download
communities_edge_list_no_duplicates.csv md5:9151046bc46584bd895a3618b5e868fc	27.8 MB	Preview Download
communities_summary.csv md5:574dba0fdfbfd6a856811026008846c8	128.4 MB	Preview Download
full_graph.gml md5:d62dec6c50c99ec53cc63bf55e844b9e	1.2 GB	Download
node_class.json md5:46c78158915abfb6474869ef4dd4b324	126.2 MB	Preview Download
uniprot_naming_assessment.tar.gz md5:0ebc209d7b54f140e99a8cde9b10e3c9	687.6 MB	Download

Additional details

Is documented by: Preprint: 10.1101/2023.03.14.532539 (DOI)
Is required by: Software: https://github.com/ProteinUniverseAtlas/AFDB90v4 (URL)

	All versions	This version
Views	1,317	1,109
Downloads	3,442	3,355
Data volume	5.1 TB	4.8 TB

Metadata supporting the AFDB90v4 annotated sequence similarity network

Authors/Creators

Description

Notes

Files

AFDB90v4_cc_data.csv

Files (19.6 GB)

Additional details

Related works