Planned intervention: On Thursday March 28th 07:00 UTC Zenodo will be unavailable for up to 5 minutes to perform a database upgrade.
Published October 21, 2020 | Version v2
Other Open

PUA_UR50 and P53_UR50 datasets and metaclusters

  • 1. Sissa, Trieste

Description

This zip contains data used to Density Peak cluster PUA_UR50 and P53_UR50 datasets in Russo, E.T., Laio, A. & Punta, M. Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation. BMC Bioinformatics 22, 121 (2021). https://doi.org/10.1186/s12859-021-04013-x

Metaclusters generated in the procedure are also included.

Each folder, named as the dataset, contains two files:
- all.fasta (fasta file containing the query sequences of the dataset)
- all_blasted_out.txt (alignments produced running blast using the respective all.fasta file)
To analyze these files, you can use the DPCfam0 program at https://gitlab.com/ETRu/dpcfam (see the repository README on how to use these data)

Moreover, each folder contains an "MCs" folder.
Here final MCs, filtered at 95 PI with CD-HIT, are stored. Each MC file is a fasta file named as the numbered MC discussed in the paper. Each sequence is named using a numeric protein identifier identifier and, separated by a | , the starting and the endig position of the sequence along the given protein. Note that the sequences reported are NOT the full protein, but the specific sequence located at the starting-ending position written in the sequence name. A "dictionary" file gives teh correspondeces between each numeric ID and the uniref protein name.

Note, finally, that the enumeration of the MCs reported corresponds to the enumeration in the paper tables (2 and 4), and NOT with the enumeration produced by the algorithm.

Files

Files (1.2 GB)

Name Size Download all
md5:c31c86f0a8902b40e432bfe2b4089bcd
1.2 GB Download

Additional details

References

  • Russo, E.T., Laio, A. & Punta, M. Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation. BMC Bioinformatics 22, 121 (2021). https://doi.org/10.1186/s12859-021-04013-x