{
  "DOI": "10.5281/zenodo.5777651",
  "abstract": "Introduction\n--------------------------------------------------------------------------------\nThis directory contains files from the Top2018 dataset by the Richardson Lab at Duke University.\n\n\nThese are high-quality residues from high-quality, low redundancy protein chains in the PDB.\n\n\nThis dataset is quality-filtered on mainchain atoms.\u00a0 For the full-residue filtered set, see https://doi.org/10.5281/zenodo.5115232\n\n\nThe accompanying publication is:\nWilliams, C. J., Richardson, D. C., & Richardson, J. S. (2021). The importance of residue\u2010level filtering, and the Top2018 best\u2010parts dataset of high\u2010quality protein residues. Protein Science. http://doi.org/10.1002/pro.4239\n\n\nUsage recommendations\n--------------------------------------------------------------------------------\nProtein residues that fail the filtering criteria described below have been removed from the files.\u00a0 As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.\u00a0 As long as the question concerns mainchain protein atoms, these files should be usable as is.\u00a0 There is a separate version that has been filtered on all atoms that is suitable for sidechains.\n\n\nThe Top2018 contains several different levels of homology clustering (30%, 50%, 70%, 90%) to ensure nonredundant datasets.\u00a0 The 70% homology level is a reliable default.\u00a0 These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filtered_hom70.tar.gz\n\n\nFiles are organized in subdirectories based on the first two letters of their PDB ids.\u00a0 The included python script sample_file_loop.py may aid in accessing the directory structure.\n\n\nFiles already contain hydrogens added by Reduce.\u00a0 NQH flips have been performed to ensure that these are the best versions of these structures.\n\n\ntop2018_metadata_mc_filtered.csv contains information on release date, resolution, and validation scores for each file.\n\n\ntop2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.\n\n\n\nHomology sets:\n--------------------------------------------------------------------------------\nUsing sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.\u00a0 This ensures minimal sequence/structural redundancy.\n\n\nThe Top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.\u00a0 Lists of the included chains at each homology level are included in this distribution.\n\n\nLower homology numbers mean less redundancy, but fewer total chains in the dataset.\n\n\nFor general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt\n\n\n\nUsage caveats:\n--------------------------------------------------------------------------------\nThese files are incomplete.\u00a0 They are single chains from structures that may have had multiple chains.\u00a0 Residues that fail the filtering criteria have been removed.\u00a0 Programs with strong requirements for completeness or uninterrupted chains should be used with care.\u00a0 Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.\n\n\nAll header information from the original structure has been preserved.\u00a0 This includes information about chains and residues no longer present in the file.\n\n\nAll ligands and waters associated with the chain have been preserved without filtering.\u00a0 Robust ligand filtering is beyond the scope of this dataset.\u00a0 Trust the ligands at your own discretion.\n\n\nSidechain atoms beyond CB have not been considered in the filtering.\u00a0 However, all sidechains have been included for residues that passed the mainchain filters.\u00a0 DO NOT use this set of files for serious questions involving sidechains.\u00a0 See our all-atom filtered dataset instead.\n\n\n\nFiltering criteria: Chain level\n--------------------------------------------------------------------------------\nChain is protein\nReleased on or before Dec 31, 2018\nResolution < 2.0\nMolProbity Score < 2.0\n<3% residues have cbeta deviations\n<2% residues have covalent bond length outliers\n<2% residues have covalent bond geometry outliers\n\n\nUsing sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.\n\n\n\nFiltering criteria: Residue level\n--------------------------------------------------------------------------------\nEven excellent structures usually contain some poorly-resolved regions.\u00a0 Residue-level filtering helps avoid including these regions in otherwise high-quality data\n\n\nMainchain atoms are defined as N, CA, C, O, CB.\nNote that CB is included, since its ideal position is defined by the other mainchan atoms.\n\n\nAll mainchain atoms in a residue:\nBfactor <= 40\nReal-space correlation coefficient (rscc) >= 0.7\n2Fo-Fc map value >= 1.2\n\n\nAdditionally, residues are not allowed to have:\nCovalent geometry outliers\nSteric overlaps or \"clashes\", as per Probe\nAlternate conformations\n\n\n\nChain Completeness criteria\n--------------------------------------------------------------------------------\nChains which lost >40% of their residues during filtering were dropped from this dataset.\u00a0 All chains present here are at least 60% complete.\n\n\n\nFiltering doumentation\n--------------------------------------------------------------------------------\nEach file documents its pruned and included residues with USER records.\u00a0 These include self-documenting USER\u00a0 DOC lines as follow:\nUSER\u00a0 DOC Lines marked with USER\u00a0 DEL list residues pruned by\nUSER\u00a0 DOC quality filtering.\nUSER\u00a0 DOC Format is chain:resseq:icode:reason_for_pruning\nUSER\u00a0 DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa\nUSER\u00a0 DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue\nUSER\u00a0 DOC g=geometry outlier, o=steric overlap, a=alternate conformations\nUSER\u00a0 DOC Lines marked USER\u00a0 INC list the uninterrupted fragments of structure\nUSER\u00a0 DOC still included after pruning by quality filtering\nUSER\u00a0 DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length\nUSER\u00a0 DOC where 1 is the first and 2 the last residue of the fragment\nUSER\u00a0 DOC Line marked with USER\u00a0 PCT gives statistics for structure completeness\n\n\nVersion history\n--------------------------------------------------------------------------------\nVersion 0.9 10.5281/zenodo.4626150 \u00a0\u00a0 \u00a0Mar 21, 2021\nInitial version\n\n\nVersion 1.0 10.5281/zenodo.5115075 \u00a0\u00a0 \u00a0Jul 19, 2021\nSplit into 30, 50, 70, and 90% homology sets\n\n\nVersion 2.0\nSet case of filenames to unambiguous standard: all lowercase except L\n\n\nVersion 2.01\n\n\nAdded missing chain list for recommended hom70 set",
  "author": [
    {
      "family": "Williams",
      "given": "Christopher"
    },
    {
      "family": "Richardson",
      "given": "David"
    },
    {
      "family": "Richardson",
      "given": "Jane"
    }
  ],
  "id": "5777651",
  "issued": {
    "date-parts": [
      [
        "2021",
        "07",
        "19"
      ]
    ]
  },
  "publisher": "Zenodo",
  "title": "High quality protein residues: Top2018 mainchain-filtered residues",
  "type": "dataset",
  "version": "2.01"
}