There is a newer version of the record available.

Published November 18, 2020 | Version 2020-07
Dataset Open

PSSH2 - database of protein sequence-to-structure homologies

  • 1. HSWT
  • 2. Garvan Institute of Medical Research

Description

The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

This dataset contains the Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid).

 

Calculating PSSH2

The main bunch of Swissprot and PDB data was downloaded in February 2020, but incremental updates, especially as related to Covid19 were added until July 2020.
Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe between February and July 2020. 

 

Evaluating PSSH2

The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits: 

  • The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
  • The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
  • The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in February 2020 (202002) as well as previous data from September 2017 (201709) and has since been repeated for data from October 2020 (202010). The results are  collected in PSSH CATH validation.csv

Files

PSSH CATH validation.csv

Files (7.7 GB)

Name Size Download all
md5:778396cfe6b38b6c91d4f12f203ae3d3
8.8 kB Preview Download
md5:ff12ac27cf8615ed84c75fc38c4bc411
23.5 MB Download
md5:865b58b6edf7aaf5ec30c2467a615e81
7.6 GB Download
md5:0dc2aa7a543baf26a626737bab11b02f
82.4 MB Download

Additional details

Related works

Continues
Journal article: 10.1093/nar/gkg110 (DOI)
Is documented by
Journal article: 10.1101/2020.07.16.207308v5 (DOI)
References
Journal article: 10.1038/nmeth.3258 (DOI)