Published January 19, 2022
| Version 1.1
Dataset
Open
Metaclusters by DPCfam clustering of UniRef50 v 2017_07
Creators
- 1. Sissa, Trieste (IT); Area Science Park, Trieste (IT)
- 2. Area Science Park, Trieste (IT)
Contributors
Data curator:
Other:
Researcher:
Supervisors:
- 1. Area Science Park, Trieste (IT)
- 2. SISSA, Trieste (IT)
- 3. Center for Omics Sciences, IRCCS San Raffaele Hospital Milan (IT)
- 4. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton. UK
Description
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07.
Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.
Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.
VERSION 1.1 changes:
- Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_
- Added Alphafold representative based on AlphaFoldDB for each MC
FILES DESCRIPTION:
1) Standard DPCfam database
- metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
- metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
- metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
- all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported
- uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.
2) DPCfamB database
- B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
- B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
- B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
- B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
Files
Files
(18.8 GB)
Name | Size | Download all |
---|---|---|
md5:f45fdcefda4774732518fb4540611105
|
868.6 MB | Download |
md5:69ab9002ae2723a0e09cdfb68fd53278
|
612.1 MB | Download |
md5:dc011864bfd3b5227f8eb6d367ba5c03
|
616.6 MB | Download |
md5:0495a913905493d93acac4040c38ae49
|
112.3 MB | Download |
md5:2d90e2e2701fed2808f581530d1532b1
|
129.7 MB | Download |
md5:a2355250a9b6c59a85e3809d36c0f0d6
|
874.6 MB | Download |
md5:a9c29ec54c6692426760aac2e03082d6
|
1.5 GB | Download |
md5:9155aafb636717bf65ad6b8dd79fff6d
|
1.7 GB | Download |
md5:3977c848d3cdc4c24329196321c9cbff
|
12.4 GB | Download |
Additional details
References
- Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.