PSSH2 - database of protein sequence-to-structure homologies

Andrea Schafferhans; Sean O'Donoghue

doi:10.5281/zenodo.4279164

Published November 18, 2020 | Version 2020-07

Dataset Open

PSSH2 - database of protein sequence-to-structure homologies

1. HSWT
2. Garvan Institute of Medical Research

The PSSH2 data set

PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences.

This dataset contains the Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid).

Calculating PSSH2

The main bunch of Swissprot and PDB data was downloaded in February 2020, but incremental updates, especially as related to Covid19 were added until July 2020.
Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe between February and July 2020.

Evaluating PSSH2

The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits:

The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/)
The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40).
The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT").

The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in February 2020 (202002) as well as previous data from September 2017 (201709) and has since been repeated for data from October 2020 (202010). The results are collected in PSSH CATH validation.csv.

Files

PSSH CATH validation.csv

Files (7.7 GB)

Name	Size	Download all
PSSH CATH validation.csv md5:778396cfe6b38b6c91d4f12f203ae3d3	8.8 kB	Preview Download
PSSH2_PDB_chain_2020-07.csv.gz md5:ff12ac27cf8615ed84c75fc38c4bc411	23.5 MB	Download
PSSH2_pssh2_2020-07.csv.gz md5:865b58b6edf7aaf5ec30c2467a615e81	7.6 GB	Download
PSSH2_swissprot_2020-07.csv.gz md5:0dc2aa7a543baf26a626737bab11b02f	82.4 MB	Download

Additional details

Continues: Journal article: 10.1093/nar/gkg110 (DOI)
Is documented by: Journal article: 10.1101/2020.07.16.207308v5 (DOI)
References: Journal article: 10.1038/nmeth.3258 (DOI)

	All versions	This version
Views	2,488	924
Downloads	2,274	316
Data volume	1.7 TB	370.2 GB

PSSH2 - database of protein sequence-to-structure homologies

Authors/Creators

Description

Files

PSSH CATH validation.csv

Files (7.7 GB)

Additional details

Related works