Dataset Open Access
Williams, Christopher;
Richardson, David;
Richardson, Jane
<?xml version='1.0' encoding='utf-8'?> <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> <identifier identifierType="DOI">10.5281/zenodo.4626150</identifier> <creators> <creator> <creatorName>Williams, Christopher</creatorName> <givenName>Christopher</givenName> <familyName>Williams</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-5808-8768</nameIdentifier> <affiliation>Duke University</affiliation> </creator> <creator> <creatorName>Richardson, David</creatorName> <givenName>David</givenName> <familyName>Richardson</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0001-5069-343X</nameIdentifier> <affiliation>Duke University</affiliation> </creator> <creator> <creatorName>Richardson, Jane</creatorName> <givenName>Jane</givenName> <familyName>Richardson</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-3311-2944</nameIdentifier> <affiliation>Duke University</affiliation> </creator> </creators> <titles> <title>High quality protein residues: top2018 mainchain-filtered residues</title> </titles> <publisher>Zenodo</publisher> <publicationYear>2021</publicationYear> <dates> <date dateType="Issued">2021-03-21</date> </dates> <resourceType resourceTypeGeneral="Dataset"/> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/4626150</alternateIdentifier> </alternateIdentifiers> <relatedIdentifiers> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.4626149</relatedIdentifier> </relatedIdentifiers> <version>0.9</version> <rightsList> <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights> <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights> </rightsList> <descriptions> <description descriptionType="Abstract"><p>Introduction<br> --------------------------------------------------------------------------------<br> This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.</p> <p>These are high-quality residues from high-quality, low redundancy protein chains in the PDB.</p> <p><br> Usage recommendations<br> --------------------------------------------------------------------------------<br> Protein residues that fail the filtering criteria described below have been removed from the files.&nbsp; As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.&nbsp; As long as the question concerns mainchain protein atoms, these files should be usable as is.</p> <p>The top2018 contains several different levels of homology clustering to ensure nonredundant datasets.&nbsp; The 70% homology level is a reliable default.&nbsp; These chains are listed in top2018_chains_hom70_60pct_complete.txt</p> <p>Files are organized in subdirectories based on the first two letters of their PDB ids.</p> <p>Files already contain hydrogens added by Reduce.&nbsp; NQH flips have been performed to ensure that these are the best versions of these structures.</p> <p>top2018_metadata_mcfilter.csv contains information on release data, resolution, and validation scores.</p> <p>top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.</p> <p><br> Homology sets:<br> --------------------------------------------------------------------------------<br> Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.&nbsp; This ensures minimal sequence/structural redundancy.</p> <p>The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.&nbsp; Lists of the included chains at each homology level are included in this distribution.</p> <p>Lower homology numbers mean greater variety and less redundancy, but also fewer total chains in the dataset.</p> <p>For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_60pct_complete.txt</p> <p><br> Usage caveats:</p> <p>--------------------------------------------------------------------------------<br> These files are incomplete.&nbsp; They are single chains from structures that may have had multiple chains.&nbsp; Residues that fail the filtering criteria have been removed.&nbsp; Programs with strong requirements for completeness or uninterrupted chains should be used with care.</p> <p>All header information from the original structure has been preserved.&nbsp; This includes information about chains and residues no longer present in the file.</p> <p>All ligands and waters associated with the chain have been preserved without filtering.&nbsp; Robust ligand filtering is beyond the scope of this dataset.&nbsp; Trust the ligands at your own discretion.</p> <p>Sidechain atoms beyond CB have not been considered in the filtering.&nbsp; However, all sidechains have been included for residues that passed the mainchain filters.&nbsp; DO NOT use this set of files for serious questions involving sidechains.&nbsp; See our all-atom filtered dataset instead.</p> <p><br> Filtering criteria: Chain level<br> --------------------------------------------------------------------------------<br> Chain is protein<br> Released on or before Dec 31, 2018<br> Resolution &lt; 2.0<br> MolProbity Score &lt; 2.0<br> &lt;3% residues have cbeta deviations<br> &lt;2% residues have covalent bond length outliers<br> &lt;2% residues have covalent bond geometry outliers</p> <p>Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.</p> <p><br> Filtering criteria: Residue level<br> --------------------------------------------------------------------------------<br> Even good structures may contain poorly-resolved regions.&nbsp; Residue-level filtering helps avoid including these regions in otherwise high-quality data</p> <p>Mainchain atoms are defined as N, CA, C, O, CB.<br> Note that CB is included, since its ideal position is defined by the other mainchan atoms.</p> <p>All mainchain atoms in a residue:<br> Bfactor &lt;= 40<br> Real-space correlation coefficient (rscc) &gt;= 0.7<br> 2Fo-Fc map value &gt;= 1.2</p> <p>Additionally, residues are not allowed to have:<br> Covalent geometry outliers<br> Steric overlaps or &quot;clashes&quot;, as per Probe<br> Alternate conformations</p> <p><br> Chain Completeness criteria<br> --------------------------------------------------------------------------------<br> Chains which lost &gt;40% of their residues during filtering were dropped from this dataset.&nbsp; All chains present here are at least 60% complete.</p></description> </descriptions> </resource>
All versions | This version | |
---|---|---|
Views | 1,313 | 296 |
Downloads | 922 | 40 |
Data volume | 205.6 GB | 26.1 GB |
Unique views | 1,098 | 261 |
Unique downloads | 575 | 30 |