DPCfam PUA_UR50 and P53_UR50 datasets and metaclusters

Elena Tea Russo

This zip contains data used in DPCfam to analyise PUA_UR50 and P53_UR50 datasets (paper in submission).

Each folder, named as the dataset, contains two files:
- all.fasta (fasta file containing the query sequences of the dataset)
- all_blasted_out.txt (alignments produced running blast using the respective all.fasta file)
To analyze these files, you can use the DPCfam0 program at https://gitlab.com/ETRu/dpcfam (see the repository README on how to use these data)

Moreover, each folder contains an "MCs" folder.
Here final MCs, filtered at 95 PI with CD-HIT,are stored. Each MC file is a fasta file named as the numbered MC discussed in the paper. Each sequence is named using its protein's Uniref50 identifier and, separated by a | , the starting and the endig position of the sequence along the given protein. Note that the sequences reported are NOT the full protein, but the specific sequence located at the starting-ending position written in the sequence name.

Note, finally, that the enumeration of the MCs reported corresponds to the enumeration in the paper tables (2 and 4), and NOT with the enumeration produced by the algorithm.

