Unified Human Gastrointestinal Proteome clustering results by DPCfam
Creators
- 1. AREA Science Park
Description
This dataset contains the result of clustering the Unified Human Gastrointestinal Proteome (UHGP) using the DPCfam algorithm.
More details on the DPCfam clustering algorithm can be found in the original publication:
Russo, Elena Tea, et al. "DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets." PLOS Computational Biology 18.10 (2022): e1010610. https://doi.org/10.1371/journal.pcbi.1010610
All of the putative protein families obtained through DPCfam (including previous results) can be browsed online at our dedicated webserver: https://dpcfam.areasciencepark.it/uhgp
The original protein dataset is version 1.0 of the UHGP-50 dataset, available for download from MGnify at https://www.ebi.ac.uk/metagenomics/.
FILES DESCRIPTION:
Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 aminoacids are reported.
metaclusters_xml.tar.gz:
- dpcfam_uhgp_metaclusters.xml: Metaclusters' seeds. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction, etc.) and Pfam comparison (Dominant Architecture).
- dpcfam_metaclusters.xsd: XML schema file for the data.
- MCxml_to_tables.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script.
- parse.sh: XML parser.
- README.md
uhgp_xml.tar.gz:
- uhgp_seed_match.xml: XML file containing all of UHGP-50 proteins and its corresponding sequences, annotated with Pfam and DPCfam metacluster data. Annotations comprise the membership of a protein as a seed or matches found though the profile-hmms of the DPCfam-UHGP and the DPCfam-Uniref clusterings.
- uhgp_matches.xsd: XML schema file for the data.
- xml_to_list.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script.
- xml_to_list_mcfiles.awk: Awk script to convert from XML to tabular text files (including individual files for metaclusters' seeds). Use through the parse.sh script.
- parse.sh: XML parser.
- README.md
Metacluster Files:
- seeds.zip: Metaclusters' seed sequences. A fasta file for each metacluster before filtering.
- filtered_seeds.zip: Metaclusters' seed sequences after clustering at 60 percent identity.
- metaclusters_hmms.tar.gz: Metaclusters' profile-hmms. A ".hmm" file for each metacluser.
- metaclusters_msas.tar.gz: Metaclusters' multiple sequence alignments, in fasta format.
uhgp_protein_mapping.txt:
- Contains a mapping between the identifiers of versions 1.0 and 2.0.2 of UHGP. The first column corresponds to the ID in UHGP-50 1.0 (representatives for the clustering at 50% protein identity), the second column to the ID in version 2.0.2 and the third column to the ID of the representative of the protein for clustering at 100% sequence identity, for which the protein sequence can be found in UHGP-100.
Files
filtered_seeds.zip
Files
(3.2 GB)
Name | Size | Download all |
---|---|---|
md5:e3a4aae980d43fb38da6715edb864303
|
300.8 MB | Preview Download |
md5:9a1177672dda0b719b16e02b628293d1
|
226.8 MB | Download |
md5:c4198fedd6fc485b9ad702e559852cea
|
410.1 MB | Download |
md5:11da7cc43e6520efcebdc6d2ff945d92
|
381.5 MB | Download |
md5:a7335d52ee8587c15846e93bb8d00ed0
|
363.5 MB | Preview Download |
md5:efc65434edcd5f94ec0fc0826670ad74
|
295.2 MB | Preview Download |
md5:5f491b7c6a2680ab91346fca0e0cce8a
|
1.2 GB | Download |
Additional details
References
- Russo, E. T., Barone, F., Bateman, A., Cozzini, S., Punta, M., & Laio, A. (2022). DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS computational biology, 18(10), e1010610.