There is a newer version of this record available.

Dataset Open Access

High quality protein residues: top2018 mainchain-filtered residues

Williams, Christopher; Richardson, David; Richardson, Jane


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <controlfield tag="005">20211213211739.0</controlfield>
  <controlfield tag="001">5115075</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Duke University</subfield>
    <subfield code="0">(orcid)0000-0001-5069-343X</subfield>
    <subfield code="a">Richardson, David</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Duke University</subfield>
    <subfield code="0">(orcid)0000-0002-3311-2944</subfield>
    <subfield code="a">Richardson, Jane</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">6044</subfield>
    <subfield code="z">md5:c584daa2388c7791f50fd6559bca0dc6</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/README</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">641</subfield>
    <subfield code="z">md5:eaa77f695a6555f516371fdc6d393c81</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/sample_file_loop.py</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">58149</subfield>
    <subfield code="z">md5:f056880fbd878a048c130724d5e60b9b</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_chains_hom30_mcfilter_60pct_complete.txt</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">82642</subfield>
    <subfield code="z">md5:086725b3d16c440d60bae73930b40763</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_chains_hom50_mcfilter_60pct_complete.txt</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">95739</subfield>
    <subfield code="z">md5:c7e39101aed13b47ff9fd224cbb25e22</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_chains_hom70_mcfilter_60pct_complete.txt</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">106274</subfield>
    <subfield code="z">md5:f6305987966abac79fc5b1a5351a2626</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_chains_hom90_mcfilter_60pct_complete.txt</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">2007333</subfield>
    <subfield code="z">md5:52325221e3f26ab10ac544ed9bc505df</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_metadata_mc_filtered.csv</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">340310</subfield>
    <subfield code="z">md5:cd5d510592873d2e01cf2585066e29f9</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_passrates_mc_filtered.csv</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">782111083</subfield>
    <subfield code="z">md5:cfb648bf739e8396c437919805c03958</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_pdbs_mc_filtered_hom30.tar.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1140492248</subfield>
    <subfield code="z">md5:c5e4c1bb27741590e00df7aad20bd657</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_pdbs_mc_filtered_hom50.tar.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1330641624</subfield>
    <subfield code="z">md5:7d08a54068f2af6f2e2f929e2e812ebf</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_pdbs_mc_filtered_hom70.tar.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">1465895953</subfield>
    <subfield code="z">md5:cb2dc438f313ebfb00e7ddc8846e6da8</subfield>
    <subfield code="u">https://zenodo.org/record/5115075/files/top2018_pdbs_mc_filtered_hom90.tar.gz</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-07-19</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:5115075</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Duke University</subfield>
    <subfield code="0">(orcid)0000-0002-5808-8768</subfield>
    <subfield code="a">Williams, Christopher</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">High quality protein residues: top2018 mainchain-filtered residues</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Introduction&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.&lt;/p&gt;

&lt;p&gt;These are high-quality residues from high-quality, low redundancy protein chains in the PDB.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Usage recommendations&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Protein residues that fail the filtering criteria described below have been removed from the files.&amp;nbsp; As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.&amp;nbsp; As long as the question concerns mainchain protein atoms, these files should be usable as is.&amp;nbsp; There is a separate version that has been filtered on all atoms that is suitable for sidechains.&lt;/p&gt;

&lt;p&gt;The top2018 contains several different levels of homology clustering to ensure nonredundant datasets.&amp;nbsp; The 70% homology level is a reliable default.&amp;nbsp; These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filteredhom70.tar.gz&lt;/p&gt;

&lt;p&gt;Files are organized in subdirectories based on the first two letters of their PDB ids.&amp;nbsp; The included python script sample_file_loop.py may aid in accessing the directory structure.&lt;/p&gt;

&lt;p&gt;Files already contain hydrogens added by Reduce.&amp;nbsp; NQH flips have been performed to ensure that these are the best versions of these structures.&lt;/p&gt;

&lt;p&gt;top2018_metadata_mc_filtered.csv contains information on release data, resolution, and validation scores.&lt;/p&gt;

&lt;p&gt;top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Homology sets:&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.&amp;nbsp; This ensures minimal sequence/structural redundancy.&lt;/p&gt;

&lt;p&gt;The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.&amp;nbsp; Lists of the included chains at each homology level are included in this distribution.&lt;/p&gt;

&lt;p&gt;Lower homology numbers mean greater variety and less redundancy, but also fewer total chains in the dataset.&lt;/p&gt;

&lt;p&gt;For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Usage caveats:&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
These files are incomplete.&amp;nbsp; They are single chains from structures that may have had multiple chains.&amp;nbsp; Residues that fail the filtering criteria have been removed.&amp;nbsp; Programs with strong requirements for completeness or uninterrupted chains should be used with care.&amp;nbsp; Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.&lt;/p&gt;

&lt;p&gt;All header information from the original structure has been preserved.&amp;nbsp; This includes information about chains and residues no longer present in the file.&lt;/p&gt;

&lt;p&gt;All ligands and waters associated with the chain have been preserved without filtering.&amp;nbsp; Robust ligand filtering is beyond the scope of this dataset.&amp;nbsp; Trust the ligands at your own discretion.&lt;/p&gt;

&lt;p&gt;Sidechain atoms beyond CB have not been considered in the filtering.&amp;nbsp; However, all sidechains have been included for residues that passed the mainchain filters.&amp;nbsp; DO NOT use this set of files for serious questions involving sidechains.&amp;nbsp; See our all-atom filtered dataset instead.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Filtering criteria: Chain level&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Chain is protein&lt;br&gt;
Released on or before Dec 31, 2018&lt;br&gt;
Resolution &amp;lt; 2.0&lt;br&gt;
MolProbity Score &amp;lt; 2.0&lt;br&gt;
&amp;lt;3% residues have cbeta deviations&lt;br&gt;
&amp;lt;2% residues have covalent bond length outliers&lt;br&gt;
&amp;lt;2% residues have covalent bond geometry outliers&lt;/p&gt;

&lt;p&gt;Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Filtering criteria: Residue level&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Even good structures may contain poorly-resolved regions.&amp;nbsp; Residue-level filtering helps avoid including these regions in otherwise high-quality data&lt;/p&gt;

&lt;p&gt;Mainchain atoms are defined as N, CA, C, O, CB.&lt;br&gt;
Note that CB is included, since its ideal position is defined by the other mainchan atoms.&lt;/p&gt;

&lt;p&gt;All mainchain atoms in a residue:&lt;br&gt;
Bfactor &amp;lt;= 40&lt;br&gt;
Real-space correlation coefficient (rscc) &amp;gt;= 0.7&lt;br&gt;
2Fo-Fc map value &amp;gt;= 1.2&lt;/p&gt;

&lt;p&gt;Additionally, residues are not allowed to have:&lt;br&gt;
Covalent geometry outliers&lt;br&gt;
Steric overlaps or &amp;quot;clashes&amp;quot;, as per Probe&lt;br&gt;
Alternate conformations&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
Chain Completeness criteria&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Chains which lost &amp;gt;40% of their residues during filtering were dropped from this dataset.&amp;nbsp; All chains present here are at least 60% complete.&lt;/p&gt;

&lt;p&gt;Filtering doumentation&lt;br&gt;
--------------------------------------------------------------------------------&lt;br&gt;
Each file documents its pruned and incluced residues with USER records.&amp;nbsp; These include self-documenting USER&amp;nbsp; DOC lines as follow:&lt;br&gt;
USER&amp;nbsp; DOC Lines marked with USER&amp;nbsp; DEL list residues pruned by&lt;br&gt;
USER&amp;nbsp; DOC quality filtering.&lt;br&gt;
USER&amp;nbsp; DOC Format is chain:resseq:icode:reason_for_pruning&lt;br&gt;
USER&amp;nbsp; DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa&lt;br&gt;
USER&amp;nbsp; DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue&lt;br&gt;
USER&amp;nbsp; DOC g=geometry outlier, o=steric overlap, a=alternate conformations&lt;br&gt;
USER&amp;nbsp; DOC Lines marked USER&amp;nbsp; INC list the uninterrupted fragments of structure&lt;br&gt;
USER&amp;nbsp; DOC still included after pruning by quality filtering&lt;br&gt;
USER&amp;nbsp; DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length&lt;br&gt;
USER&amp;nbsp; DOC where 1 is the first and 2 the last residue of the fragment&lt;br&gt;
USER&amp;nbsp; DOC Line marked with USER&amp;nbsp; PCT gives statistics for structure completeness&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4626149</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.5115075</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
1,419
1,035
views
downloads
All versions This version
Views 1,41996
Downloads 1,03578
Data volume 231.0 GB13.1 GB
Unique views 1,18489
Unique downloads 65353

Share

Cite as