Published February 2, 2024 | Version 1.0.2
Dataset Open

Unified Human Gastrointestinal Proteome clustering results by DPCfam

Description

This dataset contains the result of clustering the Unified Human Gastrointestinal Proteome (UHGP) using the DPCfam algorithm. 

More details on the DPCfam clustering algorithm can be found in the original publication:

Russo, Elena Tea, et al. "DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets." PLOS Computational Biology 18.10 (2022): e1010610. https://doi.org/10.1371/journal.pcbi.1010610

All of the putative protein families obtained through DPCfam (including previous results) can be browsed online at our dedicated webserver: https://dpcfam.areasciencepark.it/uhgp

The original protein dataset is version 1.0 of the UHGP-50 dataset, available for download from MGnify at https://www.ebi.ac.uk/metagenomics/.

FILES DESCRIPTION:

Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 aminoacids are reported.

metaclusters_xml.tar.gz:

  • dpcfam_uhgp_metaclusters.xml: Metaclusters' seeds. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction, etc.) and Pfam comparison (Dominant Architecture).
  • dpcfam_metaclusters.xsd: XML schema file for the data. 
  • MCxml_to_tables.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script.
  • parse.sh: XML parser. 
  • README.md

uhgp_xml.tar.gz: 

  • uhgp_seed_match.xml: XML file containing all of UHGP-50 proteins and its corresponding sequences, annotated with Pfam and DPCfam metacluster data.  Annotations comprise the membership of a protein as a seed or matches found though the profile-hmms of the DPCfam-UHGP and the DPCfam-Uniref clusterings. 
  • uhgp_matches.xsd: XML schema file for the data. 
  • xml_to_list.awk: Awk script to convert from XML to tabular text files. Use through the parse.sh script.
  • xml_to_list_mcfiles.awk: Awk script to convert from XML to tabular text files (including individual files for metaclusters' seeds). Use through the parse.sh script.
  • parse.sh: XML parser. 
  • README.md

Metacluster Files:

  • seeds.zip: Metaclusters' seed sequences. A fasta file for each metacluster before filtering.
  • filtered_seeds.zip: Metaclusters' seed sequences after clustering at 60 percent identity. 
  • metaclusters_hmms.tar.gz: Metaclusters' profile-hmms. A ".hmm" file for each metacluser. 
  • metaclusters_msas.tar.gz: Metaclusters' multiple sequence alignments, in fasta format. 

uhgp_protein_mapping.txt:

  • Contains a mapping between the identifiers of versions 1.0 and 2.0.2 of UHGP. The first column corresponds to the ID in UHGP-50 1.0 (representatives for the clustering at 50% protein identity), the second column to the ID in version 2.0.2 and the third column to the ID of the representative of the protein for clustering at 100% sequence identity, for which the protein sequence can be found in UHGP-100.  

Files

filtered_seeds.zip

Files (3.2 GB)

Name Size Download all
md5:e3a4aae980d43fb38da6715edb864303
300.8 MB Preview Download
md5:9a1177672dda0b719b16e02b628293d1
226.8 MB Download
md5:c4198fedd6fc485b9ad702e559852cea
410.1 MB Download
md5:11da7cc43e6520efcebdc6d2ff945d92
381.5 MB Download
md5:a7335d52ee8587c15846e93bb8d00ed0
363.5 MB Preview Download
md5:efc65434edcd5f94ec0fc0826670ad74
295.2 MB Preview Download
md5:5f491b7c6a2680ab91346fca0e0cce8a
1.2 GB Download

Additional details

References

  • Russo, E. T., Barone, F., Bateman, A., Cozzini, S., Punta, M., & Laio, A. (2022). DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS computational biology, 18(10), e1010610.