There is a newer version of this record available.

Dataset Open Access

High quality protein residues: top2018 mainchain-filtered residues

Williams, Christopher; Richardson, David; Richardson, Jane


JSON-LD (schema.org) Export

{
  "description": "<p>Introduction<br>\n--------------------------------------------------------------------------------<br>\nThis directory contains files from the top2018 dataset by the Richardson Lab at Duke University.</p>\n\n<p>These are high-quality residues from high-quality, low redundancy protein chains in the PDB.</p>\n\n<p><br>\nUsage recommendations<br>\n--------------------------------------------------------------------------------<br>\nProtein residues that fail the filtering criteria described below have been removed from the files.&nbsp; As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.&nbsp; As long as the question concerns mainchain protein atoms, these files should be usable as is.</p>\n\n<p>The top2018 contains several different levels of homology clustering to ensure nonredundant datasets.&nbsp; The 70% homology level is a reliable default.&nbsp; These chains are listed in top2018_chains_hom70_60pct_complete.txt</p>\n\n<p>Files are organized in subdirectories based on the first two letters of their PDB ids.</p>\n\n<p>Files already contain hydrogens added by Reduce.&nbsp; NQH flips have been performed to ensure that these are the best versions of these structures.</p>\n\n<p>top2018_metadata_mcfilter.csv contains information on release data, resolution, and validation scores.</p>\n\n<p>top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.</p>\n\n<p><br>\nHomology sets:<br>\n--------------------------------------------------------------------------------<br>\nUsing sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.&nbsp; This ensures minimal sequence/structural redundancy.</p>\n\n<p>The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.&nbsp; Lists of the included chains at each homology level are included in this distribution.</p>\n\n<p>Lower homology numbers mean greater variety and less redundancy, but also fewer total chains in the dataset.</p>\n\n<p>For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_60pct_complete.txt</p>\n\n<p><br>\nUsage caveats:</p>\n\n<p>--------------------------------------------------------------------------------<br>\nThese files are incomplete.&nbsp; They are single chains from structures that may have had multiple chains.&nbsp; Residues that fail the filtering criteria have been removed.&nbsp; Programs with strong requirements for completeness or uninterrupted chains should be used with care.</p>\n\n<p>All header information from the original structure has been preserved.&nbsp; This includes information about chains and residues no longer present in the file.</p>\n\n<p>All ligands and waters associated with the chain have been preserved without filtering.&nbsp; Robust ligand filtering is beyond the scope of this dataset.&nbsp; Trust the ligands at your own discretion.</p>\n\n<p>Sidechain atoms beyond CB have not been considered in the filtering.&nbsp; However, all sidechains have been included for residues that passed the mainchain filters.&nbsp; DO NOT use this set of files for serious questions involving sidechains.&nbsp; See our all-atom filtered dataset instead.</p>\n\n<p><br>\nFiltering criteria: Chain level<br>\n--------------------------------------------------------------------------------<br>\nChain is protein<br>\nReleased on or before Dec 31, 2018<br>\nResolution &lt; 2.0<br>\nMolProbity Score &lt; 2.0<br>\n&lt;3% residues have cbeta deviations<br>\n&lt;2% residues have covalent bond length outliers<br>\n&lt;2% residues have covalent bond geometry outliers</p>\n\n<p>Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.</p>\n\n<p><br>\nFiltering criteria: Residue level<br>\n--------------------------------------------------------------------------------<br>\nEven good structures may contain poorly-resolved regions.&nbsp; Residue-level filtering helps avoid including these regions in otherwise high-quality data</p>\n\n<p>Mainchain atoms are defined as N, CA, C, O, CB.<br>\nNote that CB is included, since its ideal position is defined by the other mainchan atoms.</p>\n\n<p>All mainchain atoms in a residue:<br>\nBfactor &lt;= 40<br>\nReal-space correlation coefficient (rscc) &gt;= 0.7<br>\n2Fo-Fc map value &gt;= 1.2</p>\n\n<p>Additionally, residues are not allowed to have:<br>\nCovalent geometry outliers<br>\nSteric overlaps or &quot;clashes&quot;, as per Probe<br>\nAlternate conformations</p>\n\n<p><br>\nChain Completeness criteria<br>\n--------------------------------------------------------------------------------<br>\nChains which lost &gt;40% of their residues during filtering were dropped from this dataset.&nbsp; All chains present here are at least 60% complete.</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Duke University", 
      "@id": "https://orcid.org/0000-0002-5808-8768", 
      "@type": "Person", 
      "name": "Williams, Christopher"
    }, 
    {
      "affiliation": "Duke University", 
      "@id": "https://orcid.org/0000-0001-5069-343X", 
      "@type": "Person", 
      "name": "Richardson, David"
    }, 
    {
      "affiliation": "Duke University", 
      "@id": "https://orcid.org/0000-0002-3311-2944", 
      "@type": "Person", 
      "name": "Richardson, Jane"
    }
  ], 
  "url": "https://zenodo.org/record/4626150", 
  "datePublished": "2021-03-21", 
  "version": "0.9", 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/87f145ae-721d-4c6f-a36a-62c56d2dadac/README", 
      "encodingFormat": "", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/87f145ae-721d-4c6f-a36a-62c56d2dadac/top2018_mc_filtered_pdbs.tar.gz", 
      "encodingFormat": "gz", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.4626150", 
  "@id": "https://doi.org/10.5281/zenodo.4626150", 
  "@type": "Dataset", 
  "name": "High quality protein residues: top2018 mainchain-filtered residues"
}
254
81
views
downloads
All versions This version
Views 254207
Downloads 8130
Data volume 25.9 GB18.8 GB
Unique views 203175
Unique downloads 5121

Share

Cite as