Metadata supporting the AFDB90v4 annotated sequence similarity network
Creators
- 1. Biozentrum and SIB Swiss Institute of Bioinformatics
- 2. Institute of Technology, University of Tartu, Estonia
- 3. VantAI, New York, USA
- 4. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
- 5. Institute of Technology, University of Tartu, Estonia; Department of Experimental Medical Science, Lund University, Sweden; Science for Life Laboratory, Lund, Sweden
Description
Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 at a high predicted accuracy and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4. The dataset deposited here corresponds to the metadata generated, and that makes the base of the similarity network constructed and its interpretation. These files are either generated or processed using the code available at https://github.com/ProteinUniverseAtlas/AFDB90v4.
This repository further contains the detailed, individual sequence similarity networks (in CLANS format) generated for the 3 example protein (super)families described in the text.
The full content of this repository includes:
-
AFDBv4_pLDDT_diggestion_UniRef50_2023-02-01.csv: table listing all uniref50 clusters in UniProt, including information on structural representatives from AFDB. Each column provides different annotations, including functional brightness, median and best pLDDT, brightness and structural representatives, etc.
-
AFDBv4_DUF_dark_diggestion_UniRef50_2023-02-06.csv: table listing all uniref50 clusters in UniProt and whether they include proteins mapped to known domains of unknown function (DUF).
-
AFDBv4_90.fasta: fasta file with the sequences of all UniRef50 clusters selected, and used for the all-against-all mmseqs searches that make the base of the network.
-
AFDB90v4_data.csv: the subset of file (1) that corresponds to the AFDB90v4 dataset, including columns such as functional brightness, median and best pLDDT, brightness and structural representatives, etc.
-
AFDB90v4_data_with_graph_labels.csv: table listing each individual uniref50 cluster included in the AFDB90v4 dataset, together with their mapping to communities, and connected components.
-
AFDB90v4_cc_data.csv: table of uniref50 clusters in connected components, including their annotations, and the columns in file (5).
-
AFDB90v4_cc_data_uniprot_community_taxonomy_map.csv: mapping of each uniprotAC entry to their corresponding component, community and taxonomy.
-
AFDB90v4_subgraphs_summary.csv: table summarising the properties of individual connected components, including the average brightness, the number of members, the number of unique protein sequences, the median length, and the number of communities.
-
communities_summary.csv: table summarising the properties of individual communities, including average brightness, the number of members, the number of unique protein sequences, the median length, the most common superkingdom represented, the average structure outlier score, etc.
-
communities_edge_list-coordinates.csv: the coordinates of each community in the graphical representation. Singleton communities or singleton UniRef50 clusters are not included.
-
communities_edge_list_no_duplicates.csv: list of edges making the graph.
-
node_class.json: map between each Uniref50 cluster and its corresponding component and community.
-
subgraphs.tar.gz: tar file containing gml files for each individual connected component.
-
AFDB90v4_outlier_scores.tsv: table containing the outlier scores for each community representative.
-
AFDB90v4_dark_galaxies_summary.csv: table containing the summary of all dark connected components, including average brightness, median length, representatives, number of communities, etc.
-
AFDB90v4_uniprot_naming_assessment_counts.csv: table listing the per-component semantic diversity scores, as well as the major source of the titles of the proteins included and their count.
-
uniprot_naming_assessment.tar.gz: tar file containing the per community assessment of predicted protein names in UniProt as of February 2023.
-
CLANS_files.tar.gz: stores the 3 sequence similarity networks, in CLANS format, constructed for the analysis of the sequence diversity and sequence similarities of the proteins in components, 27, 159 and 3314. These CLANS files make the base of panel A in all figures 3 and 4 and extended data figure 5.
Notes
Files
AFDB90v4_cc_data.csv
Files
(19.6 GB)
Name | Size | Download all |
---|---|---|
md5:0f445c09121a4fe27ea8ec32a531099d
|
879.6 MB | Preview Download |
md5:3cb36f96923bf502348d19b594ca14b3
|
4.1 GB | Preview Download |
md5:cd9496bbb54194e46800a93c1c0387c0
|
3.6 MB | Preview Download |
md5:6366e3cdc8475da69446420ccaf4bc57
|
1.2 GB | Preview Download |
md5:74bf56feeba528b0dbb72d647b0aea3a
|
1.2 GB | Preview Download |
md5:508a235190c5aca1490752649fe8c388
|
101.2 MB | Download |
md5:5667d1090b97a5c24037263e68fdadd9
|
18.1 MB | Preview Download |
md5:d2aac68dbe61154f0e6ad520ba88162d
|
15.6 MB | Preview Download |
md5:fad23381870a1c0064b242725210a576
|
1.6 GB | Download |
md5:e6e3a04c40ce3233a8becd8e523dc2f6
|
409.5 MB | Preview Download |
md5:e6db871d1896f865cbc630716b155b42
|
7.8 GB | Preview Download |
md5:1bd28aeb217db2aa23c2f63764459a60
|
17.2 MB | Download |
md5:89f1517edcde527429abdab21219b415
|
20.2 MB | Preview Download |
md5:9151046bc46584bd895a3618b5e868fc
|
27.8 MB | Preview Download |
md5:574dba0fdfbfd6a856811026008846c8
|
128.4 MB | Preview Download |
md5:d62dec6c50c99ec53cc63bf55e844b9e
|
1.2 GB | Download |
md5:46c78158915abfb6474869ef4dd4b324
|
126.2 MB | Preview Download |
md5:0ebc209d7b54f140e99a8cde9b10e3c9
|
687.6 MB | Download |
Additional details
Related works
- Is documented by
- Preprint: 10.1101/2023.03.14.532539 (DOI)
- Is required by
- Software: https://github.com/ProteinUniverseAtlas/AFDB90v4 (URL)