<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Williams, Christopher</dc:creator>
  <dc:creator>Richardson, David</dc:creator>
  <dc:creator>Richardson, Jane</dc:creator>
  <dc:date>2021-07-19</dc:date>
  <dc:description>&amp;lt;p&amp;gt;Introduction&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
This directory contains files from the Top2018 dataset by the Richardson Lab at Duke University.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;These are high-quality residues from high-quality, low redundancy protein chains in the PDB.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;This dataset is quality-filtered on mainchain atoms.&amp;nbsp; For the full-residue filtered set, see https://doi.org/10.5281/zenodo.5115232&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The accompanying publication is:&amp;lt;br&amp;gt;
Williams, C. J., Richardson, D. C., &amp;amp; Richardson, J. S. (2021). The importance of residue‐level filtering, and the Top2018 best‐parts dataset of high‐quality protein residues. Protein Science. http://doi.org/10.1002/pro.4239&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Usage recommendations&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Protein residues that fail the filtering criteria described below have been removed from the files.&amp;nbsp; As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.&amp;nbsp; As long as the question concerns mainchain protein atoms, these files should be usable as is.&amp;nbsp; There is a separate version that has been filtered on all atoms that is suitable for sidechains.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The Top2018 contains several different levels of homology clustering (30%, 50%, 70%, 90%) to ensure nonredundant datasets.&amp;nbsp; The 70% homology level is a reliable default.&amp;nbsp; These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filtered_hom70.tar.gz&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Files are organized in subdirectories based on the first two letters of their PDB ids.&amp;nbsp; The included python script sample_file_loop.py may aid in accessing the directory structure.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Files already contain hydrogens added by Reduce.&amp;nbsp; NQH flips have been performed to ensure that these are the best versions of these structures.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;top2018_metadata_mc_filtered.csv contains information on release date, resolution, and validation scores for each file.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Homology sets:&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.&amp;nbsp; This ensures minimal sequence/structural redundancy.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;The Top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.&amp;nbsp; Lists of the included chains at each homology level are included in this distribution.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Lower homology numbers mean less redundancy, but fewer total chains in the dataset.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Usage caveats:&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
These files are incomplete.&amp;nbsp; They are single chains from structures that may have had multiple chains.&amp;nbsp; Residues that fail the filtering criteria have been removed.&amp;nbsp; Programs with strong requirements for completeness or uninterrupted chains should be used with care.&amp;nbsp; Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;All header information from the original structure has been preserved.&amp;nbsp; This includes information about chains and residues no longer present in the file.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;All ligands and waters associated with the chain have been preserved without filtering.&amp;nbsp; Robust ligand filtering is beyond the scope of this dataset.&amp;nbsp; Trust the ligands at your own discretion.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Sidechain atoms beyond CB have not been considered in the filtering.&amp;nbsp; However, all sidechains have been included for residues that passed the mainchain filters.&amp;nbsp; DO NOT use this set of files for serious questions involving sidechains.&amp;nbsp; See our all-atom filtered dataset instead.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Filtering criteria: Chain level&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Chain is protein&amp;lt;br&amp;gt;
Released on or before Dec 31, 2018&amp;lt;br&amp;gt;
Resolution &amp;lt; 2.0&amp;lt;br&amp;gt;
MolProbity Score &amp;lt; 2.0&amp;lt;br&amp;gt;
&amp;lt;3% residues have cbeta deviations&amp;lt;br&amp;gt;
&amp;lt;2% residues have covalent bond length outliers&amp;lt;br&amp;gt;
&amp;lt;2% residues have covalent bond geometry outliers&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Filtering criteria: Residue level&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Even excellent structures usually contain some poorly-resolved regions.&amp;nbsp; Residue-level filtering helps avoid including these regions in otherwise high-quality data&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Mainchain atoms are defined as N, CA, C, O, CB.&amp;lt;br&amp;gt;
Note that CB is included, since its ideal position is defined by the other mainchan atoms.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;All mainchain atoms in a residue:&amp;lt;br&amp;gt;
Bfactor &amp;lt;= 40&amp;lt;br&amp;gt;
Real-space correlation coefficient (rscc) &amp;gt;= 0.7&amp;lt;br&amp;gt;
2Fo-Fc map value &amp;gt;= 1.2&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Additionally, residues are not allowed to have:&amp;lt;br&amp;gt;
Covalent geometry outliers&amp;lt;br&amp;gt;
Steric overlaps or &amp;quot;clashes&amp;quot;, as per Probe&amp;lt;br&amp;gt;
Alternate conformations&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Chain Completeness criteria&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Chains which lost &amp;gt;40% of their residues during filtering were dropped from this dataset.&amp;nbsp; All chains present here are at least 60% complete.&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;
Filtering doumentation&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Each file documents its pruned and included residues with USER records.&amp;nbsp; These include self-documenting USER&amp;nbsp; DOC lines as follow:&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Lines marked with USER&amp;nbsp; DEL list residues pruned by&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC quality filtering.&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Format is chain:resseq:icode:reason_for_pruning&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC g=geometry outlier, o=steric overlap, a=alternate conformations&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Lines marked USER&amp;nbsp; INC list the uninterrupted fragments of structure&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC still included after pruning by quality filtering&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC where 1 is the first and 2 the last residue of the fragment&amp;lt;br&amp;gt;
USER&amp;nbsp; DOC Line marked with USER&amp;nbsp; PCT gives statistics for structure completeness&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Version history&amp;lt;br&amp;gt;
--------------------------------------------------------------------------------&amp;lt;br&amp;gt;
Version 0.9 10.5281/zenodo.4626150 &amp;nbsp;&amp;nbsp; &amp;nbsp;Mar 21, 2021&amp;lt;br&amp;gt;
Initial version&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Version 1.0 10.5281/zenodo.5115075 &amp;nbsp;&amp;nbsp; &amp;nbsp;Jul 19, 2021&amp;lt;br&amp;gt;
Split into 30, 50, 70, and 90% homology sets&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Version 2.0&amp;lt;br&amp;gt;
Set case of filenames to unambiguous standard: all lowercase except L&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Version 2.01&amp;lt;/p&amp;gt;

&amp;lt;p&amp;gt;Added missing chain list for recommended hom70 set&amp;lt;/p&amp;gt;</dc:description>
  <dc:identifier>https://doi.org/10.5281/zenodo.5777651</dc:identifier>
  <dc:identifier>oai:zenodo.org:5777651</dc:identifier>
  <dc:publisher>Zenodo</dc:publisher>
  <dc:relation>https://doi.org/10.5281/zenodo.4626149</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:title>High quality protein residues: Top2018 mainchain-filtered residues</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
</oai_dc:dc>