5773119
doi
10.5281/zenodo.5773119
oai:zenodo.org:5773119
Richardson, David
Duke University
Richardson, Jane
Duke University
High quality protein residues: top2018 mainchain-filtered residues
Williams, Christopher
Duke University
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
<p>Introduction<br>
--------------------------------------------------------------------------------<br>
This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.</p>
<p>These are high-quality residues from high-quality, low redundancy protein chains in the PDB.</p>
<p>This dataset is quality-filtered on mainchain atoms. For the full-residue filtered set, see https://doi.org/10.5281/zenodo.5115232</p>
<p>The accompanying publication is:<br>
Williams, C. J., Richardson, D. C., & Richardson, J. S. (2021). The importance of residue‐level filtering, and the Top2018 best‐parts dataset of high‐quality protein residues. Protein Science. http://doi.org/10.1002/pro.4239</p>
<p>Usage recommendations<br>
--------------------------------------------------------------------------------<br>
Protein residues that fail the filtering criteria described below have been removed from the files. As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data. As long as the question concerns mainchain protein atoms, these files should be usable as is. There is a separate version that has been filtered on all atoms that is suitable for sidechains.</p>
<p>The top2018 contains several different levels of homology clustering (30%, 50%, 70%, 90%) to ensure nonredundant datasets. The 70% homology level is a reliable default. These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filtered_hom70.tar.gz</p>
<p>Files are organized in subdirectories based on the first two letters of their PDB ids. The included python script sample_file_loop.py may aid in accessing the directory structure.</p>
<p>Files already contain hydrogens added by Reduce. NQH flips have been performed to ensure that these are the best versions of these structures.</p>
<p>top2018_metadata_mc_filtered.csv contains information on release date, resolution, and validation scores for each file.</p>
<p>top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.</p>
<p><br>
Homology sets:<br>
--------------------------------------------------------------------------------<br>
Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset. This ensures minimal sequence/structural redundancy.</p>
<p>The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses. Lists of the included chains at each homology level are included in this distribution.</p>
<p>Lower homology numbers mean less redundancy, but fewer total chains in the dataset.</p>
<p>For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt</p>
<p><br>
Usage caveats:<br>
--------------------------------------------------------------------------------<br>
These files are incomplete. They are single chains from structures that may have had multiple chains. Residues that fail the filtering criteria have been removed. Programs with strong requirements for completeness or uninterrupted chains should be used with care. Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.</p>
<p>All header information from the original structure has been preserved. This includes information about chains and residues no longer present in the file.</p>
<p>All ligands and waters associated with the chain have been preserved without filtering. Robust ligand filtering is beyond the scope of this dataset. Trust the ligands at your own discretion.</p>
<p>Sidechain atoms beyond CB have not been considered in the filtering. However, all sidechains have been included for residues that passed the mainchain filters. DO NOT use this set of files for serious questions involving sidechains. See our all-atom filtered dataset instead.</p>
<p><br>
Filtering criteria: Chain level<br>
--------------------------------------------------------------------------------<br>
Chain is protein<br>
Released on or before Dec 31, 2018<br>
Resolution < 2.0<br>
MolProbity Score < 2.0<br>
<3% residues have cbeta deviations<br>
<2% residues have covalent bond length outliers<br>
<2% residues have covalent bond geometry outliers</p>
<p>Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.</p>
<p><br>
Filtering criteria: Residue level<br>
--------------------------------------------------------------------------------<br>
Even excellent structures usually contain some poorly-resolved regions. Residue-level filtering helps avoid including these regions in otherwise high-quality data</p>
<p>Mainchain atoms are defined as N, CA, C, O, CB.<br>
Note that CB is included, since its ideal position is defined by the other mainchan atoms.</p>
<p>All mainchain atoms in a residue:<br>
Bfactor <= 40<br>
Real-space correlation coefficient (rscc) >= 0.7<br>
2Fo-Fc map value >= 1.2</p>
<p>Additionally, residues are not allowed to have:<br>
Covalent geometry outliers<br>
Steric overlaps or "clashes", as per Probe<br>
Alternate conformations</p>
<p><br>
Chain Completeness criteria<br>
--------------------------------------------------------------------------------<br>
Chains which lost >40% of their residues during filtering were dropped from this dataset. All chains present here are at least 60% complete.</p>
<p><br>
Filtering doumentation<br>
--------------------------------------------------------------------------------<br>
Each file documents its pruned and included residues with USER records. These include self-documenting USER DOC lines as follow:<br>
USER DOC Lines marked with USER DEL list residues pruned by<br>
USER DOC quality filtering.<br>
USER DOC Format is chain:resseq:icode:reason_for_pruning<br>
USER DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa<br>
USER DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue<br>
USER DOC g=geometry outlier, o=steric overlap, a=alternate conformations<br>
USER DOC Lines marked USER INC list the uninterrupted fragments of structure<br>
USER DOC still included after pruning by quality filtering<br>
USER DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length<br>
USER DOC where 1 is the first and 2 the last residue of the fragment<br>
USER DOC Line marked with USER PCT gives statistics for structure completeness</p>
<p>Version history<br>
--------------------------------------------------------------------------------<br>
Version 0.9 10.5281/zenodo.4626150 Mar 21, 2021<br>
Initial version</p>
<p>Version 1.0 10.5281/zenodo.5115075 Jul 19, 2021<br>
Split into 30, 50, 70, and 90% homology sets</p>
<p>Version 2.0<br>
Set case of filenames to unambiguous standard: all lowercase except L</p>
Zenodo
2021-07-19
info:eu-repo/semantics/other
4626149
2.0
1690576978.017341
782122628
md5:c80be319fcdc7ca3072a1344af5b7aba
https://zenodo.org/records/5773119/files/top2018_pdbs_mc_filtered_hom30.tar.gz
1465922465
md5:fd1671aa7e88e49335ff44e87d96883c
https://zenodo.org/records/5773119/files/top2018_pdbs_mc_filtered_hom90.tar.gz
1330693936
md5:31ec0d70c296b63891d8e57f8fd546a4
https://zenodo.org/records/5773119/files/top2018_pdbs_mc_filtered_hom70.tar.gz
1140526057
md5:7197acde2313b0a1da84786b0e897bd2
https://zenodo.org/records/5773119/files/top2018_pdbs_mc_filtered_hom50.tar.gz
2007333
md5:eecc82da39bed07f3d26bdb6d5272c2d
https://zenodo.org/records/5773119/files/top2018_metadata_mc_filtered.csv
106274
md5:72bdb96c209ae36b59dd1b4f71361064
https://zenodo.org/records/5773119/files/top2018_chains_hom90_mcfilter_60pct_complete.txt
82642
md5:8b955c245684bbd76ff23436f79a5445
https://zenodo.org/records/5773119/files/top2018_chains_hom50_mcfilter_60pct_complete.txt
58149
md5:aaa1b4dbc1191489ed6e05cf6060fbb5
https://zenodo.org/records/5773119/files/top2018_chains_hom30_mcfilter_60pct_complete.txt
340310
md5:9ad257bd0b521aa0cfc1a986d869060f
https://zenodo.org/records/5773119/files/top2018_passrates_mc_filtered.csv
641
md5:f542f9a7ce0f4c220da7ba16c430fc6b
https://zenodo.org/records/5773119/files/sample_file_loop.py
6807
md5:750f8826827c5369a1ce46c1117acb54
https://zenodo.org/records/5773119/files/README_mc_filter.txt
public
10.5281/zenodo.4626149
isVersionOf
doi