Dataset Open Access
Williams, Christopher;
Richardson, David;
Richardson, Jane
<?xml version='1.0' encoding='utf-8'?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:adms="http://www.w3.org/ns/adms#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:dcat="http://www.w3.org/ns/dcat#" xmlns:duv="http://www.w3.org/ns/duv#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:frapo="http://purl.org/cerif/frapo/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:gsp="http://www.opengis.net/ont/geosparql#" xmlns:locn="http://www.w3.org/ns/locn#" xmlns:org="http://www.w3.org/ns/org#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:prov="http://www.w3.org/ns/prov#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:vcard="http://www.w3.org/2006/vcard/ns#" xmlns:wdrs="http://www.w3.org/2007/05/powder-s#"> <rdf:Description rdf:about="https://doi.org/10.5281/zenodo.5773119"> <rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/> <dct:type rdf:resource="http://purl.org/dc/dcmitype/Dataset"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://doi.org/10.5281/zenodo.5773119</dct:identifier> <foaf:page rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0002-5808-8768"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0002-5808-8768</dct:identifier> <foaf:name>Williams, Christopher</foaf:name> <foaf:givenName>Christopher</foaf:givenName> <foaf:familyName>Williams</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Duke University</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0001-5069-343X"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0001-5069-343X</dct:identifier> <foaf:name>Richardson, David</foaf:name> <foaf:givenName>David</foaf:givenName> <foaf:familyName>Richardson</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Duke University</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:creator> <rdf:Description rdf:about="http://orcid.org/0000-0002-3311-2944"> <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"/> <dct:identifier rdf:datatype="http://www.w3.org/2001/XMLSchema#string">0000-0002-3311-2944</dct:identifier> <foaf:name>Richardson, Jane</foaf:name> <foaf:givenName>Jane</foaf:givenName> <foaf:familyName>Richardson</foaf:familyName> <org:memberOf> <foaf:Organization> <foaf:name>Duke University</foaf:name> </foaf:Organization> </org:memberOf> </rdf:Description> </dct:creator> <dct:title>High quality protein residues: top2018 mainchain-filtered residues</dct:title> <dct:publisher> <foaf:Agent> <foaf:name>Zenodo</foaf:name> </foaf:Agent> </dct:publisher> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#gYear">2021</dct:issued> <dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2021-07-19</dct:issued> <owl:sameAs rdf:resource="https://zenodo.org/record/5773119"/> <adms:identifier> <adms:Identifier> <skos:notation rdf:datatype="http://www.w3.org/2001/XMLSchema#anyURI">https://zenodo.org/record/5773119</skos:notation> <adms:schemeAgency>url</adms:schemeAgency> </adms:Identifier> </adms:identifier> <dct:isVersionOf rdf:resource="https://doi.org/10.5281/zenodo.4626149"/> <owl:versionInfo>2.0</owl:versionInfo> <dct:description><p>Introduction<br> --------------------------------------------------------------------------------<br> This directory contains files from the top2018 dataset by the Richardson Lab at Duke University.</p> <p>These are high-quality residues from high-quality, low redundancy protein chains in the PDB.</p> <p>This dataset is quality-filtered on mainchain atoms.&nbsp; For the full-residue filtered set, see https://doi.org/10.5281/zenodo.5115232</p> <p>The accompanying publication is:<br> Williams, C. J., Richardson, D. C., &amp; Richardson, J. S. (2021). The importance of residue‐level filtering, and the Top2018 best‐parts dataset of high‐quality protein residues. Protein Science. http://doi.org/10.1002/pro.4239</p> <p>Usage recommendations<br> --------------------------------------------------------------------------------<br> Protein residues that fail the filtering criteria described below have been removed from the files.&nbsp; As a result, these files can be considered pre-filtered and will return only results for residues of good model quality with supporting experimental data.&nbsp; As long as the question concerns mainchain protein atoms, these files should be usable as is.&nbsp; There is a separate version that has been filtered on all atoms that is suitable for sidechains.</p> <p>The top2018 contains several different levels of homology clustering (30%, 50%, 70%, 90%) to ensure nonredundant datasets.&nbsp; The 70% homology level is a reliable default.&nbsp; These chains are listed in top2018_chains_hom70_mcfilter_60pct_complete.txt and found in top2018_pdbs_mc_filtered_hom70.tar.gz</p> <p>Files are organized in subdirectories based on the first two letters of their PDB ids.&nbsp; The included python script sample_file_loop.py may aid in accessing the directory structure.</p> <p>Files already contain hydrogens added by Reduce.&nbsp; NQH flips have been performed to ensure that these are the best versions of these structures.</p> <p>top2018_metadata_mc_filtered.csv contains information on release date, resolution, and validation scores for each file.</p> <p>top2018_passrates_mc_filtered.csv contains information on how many protein residues from the original chain passed the quality filters.</p> <p><br> Homology sets:<br> --------------------------------------------------------------------------------<br> Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the best chain was selected for inclusion in the dataset.&nbsp; This ensures minimal sequence/structural redundancy.</p> <p>The top2018 is available at several different levels of homology clustering, which may be appropriate to different uses.&nbsp; Lists of the included chains at each homology level are included in this distribution.</p> <p>Lower homology numbers mean less redundancy, but fewer total chains in the dataset.</p> <p>For general use, ***we recommend the 70% homology set*** as a good balance between inclusivity and variety. This list is given in the file top2018_chains_hom70_mcfilter_60pct_complete.txt</p> <p><br> Usage caveats:<br> --------------------------------------------------------------------------------<br> These files are incomplete.&nbsp; They are single chains from structures that may have had multiple chains.&nbsp; Residues that fail the filtering criteria have been removed.&nbsp; Programs with strong requirements for completeness or uninterrupted chains should be used with care.&nbsp; Chain completeness and fragmentation statistics are available in top2018_passrates_mc_filted.csv and in USER records at the end on each .pdb file.</p> <p>All header information from the original structure has been preserved.&nbsp; This includes information about chains and residues no longer present in the file.</p> <p>All ligands and waters associated with the chain have been preserved without filtering.&nbsp; Robust ligand filtering is beyond the scope of this dataset.&nbsp; Trust the ligands at your own discretion.</p> <p>Sidechain atoms beyond CB have not been considered in the filtering.&nbsp; However, all sidechains have been included for residues that passed the mainchain filters.&nbsp; DO NOT use this set of files for serious questions involving sidechains.&nbsp; See our all-atom filtered dataset instead.</p> <p><br> Filtering criteria: Chain level<br> --------------------------------------------------------------------------------<br> Chain is protein<br> Released on or before Dec 31, 2018<br> Resolution &lt; 2.0<br> MolProbity Score &lt; 2.0<br> &lt;3% residues have cbeta deviations<br> &lt;2% residues have covalent bond length outliers<br> &lt;2% residues have covalent bond geometry outliers</p> <p>Using sequence homology clusters provided by the RCSB PDB, for each homology cluster, the chain with the best (lowest) average of Resolution and MolProbity Score was selected.</p> <p><br> Filtering criteria: Residue level<br> --------------------------------------------------------------------------------<br> Even excellent structures usually contain some poorly-resolved regions.&nbsp; Residue-level filtering helps avoid including these regions in otherwise high-quality data</p> <p>Mainchain atoms are defined as N, CA, C, O, CB.<br> Note that CB is included, since its ideal position is defined by the other mainchan atoms.</p> <p>All mainchain atoms in a residue:<br> Bfactor &lt;= 40<br> Real-space correlation coefficient (rscc) &gt;= 0.7<br> 2Fo-Fc map value &gt;= 1.2</p> <p>Additionally, residues are not allowed to have:<br> Covalent geometry outliers<br> Steric overlaps or &quot;clashes&quot;, as per Probe<br> Alternate conformations</p> <p><br> Chain Completeness criteria<br> --------------------------------------------------------------------------------<br> Chains which lost &gt;40% of their residues during filtering were dropped from this dataset.&nbsp; All chains present here are at least 60% complete.</p> <p><br> Filtering doumentation<br> --------------------------------------------------------------------------------<br> Each file documents its pruned and included residues with USER records.&nbsp; These include self-documenting USER&nbsp; DOC lines as follow:<br> USER&nbsp; DOC Lines marked with USER&nbsp; DEL list residues pruned by<br> USER&nbsp; DOC quality filtering.<br> USER&nbsp; DOC Format is chain:resseq:icode:reason_for_pruning<br> USER&nbsp; DOC Reasons for pruning are abbreviated as 1-letter codes: bcmgoa<br> USER&nbsp; DOC b=bfactor, c=real space correlation, m=2Fo-Fc mapvalue<br> USER&nbsp; DOC g=geometry outlier, o=steric overlap, a=alternate conformations<br> USER&nbsp; DOC Lines marked USER&nbsp; INC list the uninterrupted fragments of structure<br> USER&nbsp; DOC still included after pruning by quality filtering<br> USER&nbsp; DOC Format is chain1:resseq1:icode1:chain2:resseq2:icode2:fragment_length<br> USER&nbsp; DOC where 1 is the first and 2 the last residue of the fragment<br> USER&nbsp; DOC Line marked with USER&nbsp; PCT gives statistics for structure completeness</p> <p>Version history<br> --------------------------------------------------------------------------------<br> Version 0.9 10.5281/zenodo.4626150 &nbsp;&nbsp; &nbsp;Mar 21, 2021<br> Initial version</p> <p>Version 1.0 10.5281/zenodo.5115075 &nbsp;&nbsp; &nbsp;Jul 19, 2021<br> Split into 30, 50, 70, and 90% homology sets</p> <p>Version 2.0<br> Set case of filenames to unambiguous standard: all lowercase except L</p></dct:description> <dct:accessRights rdf:resource="http://publications.europa.eu/resource/authority/access-right/PUBLIC"/> <dct:accessRights> <dct:RightsStatement rdf:about="info:eu-repo/semantics/openAccess"> <rdfs:label>Open Access</rdfs:label> </dct:RightsStatement> </dct:accessRights> <dcat:distribution> <dcat:Distribution> <dct:license rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>6807</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/README_mc_filter.txt"/> <dcat:mediaType>text/plain</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>641</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/sample_file_loop.py"/> <dcat:mediaType>text/x-python</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>58149</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_chains_hom30_mcfilter_60pct_complete.txt"/> <dcat:mediaType>text/plain</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>82642</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_chains_hom50_mcfilter_60pct_complete.txt"/> <dcat:mediaType>text/plain</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>106274</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_chains_hom90_mcfilter_60pct_complete.txt"/> <dcat:mediaType>text/plain</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>2007333</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_metadata_mc_filtered.csv"/> <dcat:mediaType>text/csv</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>340310</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_passrates_mc_filtered.csv"/> <dcat:mediaType>text/csv</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>782122628</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_pdbs_mc_filtered_hom30.tar.gz"/> <dcat:mediaType>application/x-tar</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>1140526057</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_pdbs_mc_filtered_hom50.tar.gz"/> <dcat:mediaType>application/x-tar</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>1330693936</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_pdbs_mc_filtered_hom70.tar.gz"/> <dcat:mediaType>application/x-tar</dcat:mediaType> </dcat:Distribution> </dcat:distribution> <dcat:distribution> <dcat:Distribution> <dcat:accessURL rdf:resource="https://doi.org/10.5281/zenodo.5773119"/> <dcat:byteSize>1465922465</dcat:byteSize> <dcat:downloadURL rdf:resource="https://zenodo.org/record/5773119/files/top2018_pdbs_mc_filtered_hom90.tar.gz"/> <dcat:mediaType>application/x-tar</dcat:mediaType> </dcat:Distribution> </dcat:distribution> </rdf:Description> </rdf:RDF>
All versions | This version | |
---|---|---|
Views | 1,430 | 52 |
Downloads | 1,060 | 38 |
Data volume | 237.9 GB | 9.5 GB |
Unique views | 1,195 | 48 |
Unique downloads | 664 | 28 |