High quality protein residues: top2018 all-atom-filtered residues - mmCIF

Williams, Christopher; Richardson David; Richardson Jane

doi:10.5281/zenodo.5889221

Published January 22, 2022 | Version 0.9

Dataset Open

High quality protein residues: top2018 all-atom-filtered residues - mmCIF

1. Duke University

Introduction
--------------------------------------------------------------------------------
This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.

These are high-quality residues from high-quality, low redundancy protein chains in the PDB.

This dataset is quality-filtered on all atoms in the residue.

The accompanying publication is:
Williams, C. J., Richardson, D. C., & Richardson, J. S. (2021). The importance of residue-level filtering, and the Top2018 best-parts dataset of high‐quality protein residues. Protein Science. http://doi.org/10.1002/pro.4239

Usage recommendations
--------------------------------------------------------------------------------
Protein residues that fail the filtering criteria described below have been removed from the files. As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data. All protein atoms have been considered in filtering; these files should be usable for any protein question. If your work is strictly limited to mainchain atoms (plus CB), there is a separate version that has been filtered on only mainchain atoms.

The top2018 contains several different levels of homology clustering (30%, 50%, 70%, 90%) to ensure nonredundant datasets. The 70% homology level is a reliable default. These chains are listed in top2018_chains_hom70_fullfiltered_60pct_complete.txt and found in top2018_pdbs_full_filtered_hom70.tar.gz

Files are organized in subdirectories based on the first two letters of their PDB ids. The included python script sample_file_loop.py may aid in accessing the directory structure.

Files already contain hydrogens added by Reduce. NQH flips have been performed to ensure that these are the best versions of these structures.

top2018_metadata_full_filtered.csv contains information on release date, resolution, and validation scores for each file.

top2018_passrates_full_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.

Homology sets:
--------------------------------------------------------------------------------
Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset. This ensures minimal sequence/structural redundancy.

The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses. Lists of the included chains at each homology level are included in this distribution.

Lower homology numbers mean less redundancy, but fewer total chains in the dataset.

For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_fullfiltered_60pct_complete.txt

Usage caveats:
--------------------------------------------------------------------------------
These files are incomplete. They are single chains from structures that may have had multiple chains. Residues that fail the filtering criteria have been removed. Programs with strong requirements for completeness or uninterrupted chains should be used with care. Chain completeness and fragmentation statistics are available in top2018_passrates_full_filtered.csv and as _top2018.percent_passrate in the .cif file.

All ligands and waters associated with the chain have been preserved without filtering. Robust ligand filtering is beyond the scope of this dataset. Trust the ligands at your own discretion.

Filtering criteria: Chain level
--------------------------------------------------------------------------------
Chain is protein
Released on or before Dec 31, 2018
Resolution < 2.0
MolProbity Score < 2.0
<3% residues have cbeta deviations
<2% residues have covalent bond length outliers
<2% residues have covalent bond geometry outliers

Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.

Filtering criteria: Residue level
--------------------------------------------------------------------------------
Even excellent structures usually contain some poorly-resolved regions. Residue-level filtering helps avoid including these regions in otherwise high-quality data

All atoms in a residue:
Bfactor <= 40
Real-space correlation coefficient (rscc) >= 0.7
2Fo-Fc map value >= 1.2

Additionally, residues are not allowed to have:
Covalent geometry outliers
Steric overlaps or "clashes", as per Probe
Alternate conformations

Chain Completeness criteria
--------------------------------------------------------------------------------
Chains which lost >40% of their residues during filtering were dropped from this dataset. All chains present here are at least 60% complete.

Filtering documentation
--------------------------------------------------------------------------------
Each file documents its pruned residues and included segments in a cif data block named data_top2018_dataset. This block can be found at the end of the file.

In the _top2018_deleted_residue loop, causes of pruning are documented. If a residues was removed due to failing the B-factor filter, a "b" will appear in the appropriate column. Otherwise, a "." will appear. Other filtering criteria are treated similarly with the following codes:
b - B-factor
c - RSCC
m - map value
g - geometry outliers
o - steric overlaps
a - alternate conformations

Version history
--------------------------------------------------------------------------------
Version 0.9
Initial upload to establish DOI

Files

top2018_chains_hom30_fullfiltered_60pct_complete.txt

Files (2.4 MB)

Name	Size	Download all
README md5:35adafcae6dd270936b652ebf6c4354e	5.7 kB	Download
sample_file_loop.py md5:0f75606741b20ac43540c7ef1f26e746	640 Bytes	Download
top2018_chains_hom30_fullfiltered_60pct_complete.txt md5:ccba58b9a2f628830a182ab020e61ba6	50.7 kB	Preview Download
top2018_chains_hom50_fullfiltered_60pct_complete.txt md5:251571af4dffb6fd922c38ac09f9fe92	72.9 kB	Preview Download
top2018_chains_hom70_fullfiltered_60pct_complete.txt md5:0b03b8b0f4a182fefa6d59cb19e3e683	84.9 kB	Preview Download
top2018_chains_hom90_fullfiltered_60pct_complete.txt md5:2b4fc8cc2dc093451bd973ba9b63442d	94.2 kB	Preview Download
top2018_metadata_full_filtered.csv md5:385a2968247083a71e65e9a9e573d059	1.8 MB	Preview Download
top2018_passrates_full_filtered.csv md5:6f9cfbacac6d44c9feb7e9ee8f4c1313	305.5 kB	Preview Download

	All versions	This version
Views	525	243
Downloads	782	594
Data volume	62.8 GB	5.8 GB

High quality protein residues: top2018 all-atom-filtered residues - mmCIF

Authors/Creators

Description

Files

top2018_chains_hom30_fullfiltered_60pct_complete.txt

Files (2.4 MB)