High quality protein residues: top2018 mainchain-filtered residues

doi:10.5281/zenodo.5115075

Published July 19, 2021 | Version 1.0

Dataset Open

High quality protein residues: top2018 mainchain-filtered residues

1. Duke University

Introduction
--------------------------------------------------------------------------------
This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.

These are high-quality residues from high-quality, low redundancy protein chains in the PDB.

Usage recommendations
--------------------------------------------------------------------------------
Protein residues that fail the filtering criteria described below have been removed from the files. As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data. As long as the question concerns mainchain protein atoms, these files should be usable as is. There is a separate version that has been filtered on all atoms that is suitable for sidechains.

The top2018 contains several different levels of homology clustering to ensure nonredundant datasets. The 70% homology level is a reliable default. These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filteredhom70.tar.gz

Files are organized in subdirectories based on the first two letters of their PDB ids. The included python script sample_file_loop.py may aid in accessing the directory structure.

Files already contain hydrogens added by Reduce. NQH flips have been performed to ensure that these are the best versions of these structures.

top2018_metadata_mc_filtered.csv contains information on release data, resolution, and validation scores.

top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.

Homology sets:
--------------------------------------------------------------------------------
Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset. This ensures minimal sequence/structural redundancy.

The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses. Lists of the included chains at each homology level are included in this distribution.

Lower homology numbers mean greater variety and less redundancy, but also fewer total chains in the dataset.

For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt

Usage caveats:
--------------------------------------------------------------------------------
These files are incomplete. They are single chains from structures that may have had multiple chains. Residues that fail the filtering criteria have been removed. Programs with strong requirements for completeness or uninterrupted chains should be used with care. Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.

All header information from the original structure has been preserved. This includes information about chains and residues no longer present in the file.

All ligands and waters associated with the chain have been preserved without filtering. Robust ligand filtering is beyond the scope of this dataset. Trust the ligands at your own discretion.

Sidechain atoms beyond CB have not been considered in the filtering. However, all sidechains have been included for residues that passed the mainchain filters. DO NOT use this set of files for serious questions involving sidechains. See our all-atom filtered dataset instead.

Filtering criteria: Chain level
--------------------------------------------------------------------------------
Chain is protein
Released on or before Dec 31, 2018
Resolution < 2.0
MolProbity Score < 2.0
<3% residues have cbeta deviations
<2% residues have covalent bond length outliers
<2% residues have covalent bond geometry outliers

Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.

Filtering criteria: Residue level
--------------------------------------------------------------------------------
Even good structures may contain poorly-resolved regions. Residue-level filtering helps avoid including these regions in otherwise high-quality data

Mainchain atoms are defined as N, CA, C, O, CB.
Note that CB is included, since its ideal position is defined by the other mainchan atoms.

All mainchain atoms in a residue:
Bfactor <= 40
Real-space correlation coefficient (rscc) >= 0.7
2Fo-Fc map value >= 1.2

Additionally, residues are not allowed to have:
Covalent geometry outliers
Steric overlaps or "clashes", as per Probe
Alternate conformations

Chain Completeness criteria
--------------------------------------------------------------------------------
Chains which lost >40% of their residues during filtering were dropped from this dataset. All chains present here are at least 60% complete.

Filtering doumentation
--------------------------------------------------------------------------------
Each file documents its pruned and incluced residues with USER records. These include self-documenting USER DOC lines as follow:
USER DOC Lines marked with USER DEL list residues pruned by
USER DOC quality filtering.
USER DOC Format is chain:resseq:icode:reason_for_pruning
USER DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa
USER DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue
USER DOC g=geometry outlier, o=steric overlap, a=alternate conformations
USER DOC Lines marked USER INC list the uninterrupted fragments of structure
USER DOC still included after pruning by quality filtering
USER DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length
USER DOC where 1 is the first and 2 the last residue of the fragment
USER DOC Line marked with USER PCT gives statistics for structure completeness

Files

top2018_chains_hom30_mcfilter_60pct_complete.txt

Files (4.7 GB)

Name	Size	Download all
README md5:c584daa2388c7791f50fd6559bca0dc6	6.0 kB	Download
sample_file_loop.py md5:eaa77f695a6555f516371fdc6d393c81	641 Bytes	Download
top2018_chains_hom30_mcfilter_60pct_complete.txt md5:f056880fbd878a048c130724d5e60b9b	58.1 kB	Preview Download
top2018_chains_hom50_mcfilter_60pct_complete.txt md5:086725b3d16c440d60bae73930b40763	82.6 kB	Preview Download
top2018_chains_hom70_mcfilter_60pct_complete.txt md5:c7e39101aed13b47ff9fd224cbb25e22	95.7 kB	Preview Download
top2018_chains_hom90_mcfilter_60pct_complete.txt md5:f6305987966abac79fc5b1a5351a2626	106.3 kB	Preview Download
top2018_metadata_mc_filtered.csv md5:52325221e3f26ab10ac544ed9bc505df	2.0 MB	Preview Download
top2018_passrates_mc_filtered.csv md5:cd5d510592873d2e01cf2585066e29f9	340.3 kB	Preview Download
top2018_pdbs_mc_filtered_hom30.tar.gz md5:cfb648bf739e8396c437919805c03958	782.1 MB	Download
top2018_pdbs_mc_filtered_hom50.tar.gz md5:c5e4c1bb27741590e00df7aad20bd657	1.1 GB	Download
top2018_pdbs_mc_filtered_hom70.tar.gz md5:7d08a54068f2af6f2e2f929e2e812ebf	1.3 GB	Download
top2018_pdbs_mc_filtered_hom90.tar.gz md5:cb2dc438f313ebfb00e7ddc8846e6da8	1.5 GB	Download

	All versions	This version
Views	2,619	173
Downloads	1,214	93
Data volume	475.5 GB	25.8 GB

High quality protein residues: top2018 mainchain-filtered residues

Creators

Description

Files

top2018_chains_hom30_mcfilter_60pct_complete.txt

Files (4.7 GB)