Published October 7, 2025 | Version v2
Dataset Open

Supplementary data for "New Targets and Procedures for Validating the Valence Geometry of Nucleic Acid Structures"

  • 1. ROR icon Czech Academy of Sciences
  • 1. ROR icon Czech Academy of Sciences
  • 2. ROR icon Poznań University of Technology
  • 3. MRC Laboratory of Molecular Biology

Description

This repository contains data and code accompanying the paper "New Targets And Procedures For Validating The Valence Geometry Of Nucleic Acid Structures" by Černý et al. The files are as follows:

  • pdb_na_reference_set.zip- the complete PDB-NA Reference Set
  • zprime_thresholds.zip - thresholds for the weighted asymmetric non-parametric standard score (Z')
  • restraints_in literature_and_refinement_software.xlsx - listing of geometrical restraints for nucleic acid bond lengths and angles found in the literature and refinement programs
  • filtering_and_prosco_code.zip - the code for filtering the PDB-NA Reference Set and for calculating the probability percentile score (ProSco)
  • residues_removed_after_expert_inspection.csv - list of residues excluded from the PDB-NA Reference Set after manual inspection
  • Preferred_CSD_stats.ods - table with PDB-wide summary of proportions of the lower and upper boundaries between Preferred and Allowed determined by CSD 3σ values rather than ProSco 5 values
  • Z-prime_analysis.zip - An html report with the visualizations used to inspect the effect of different Z' thresholds
  • prosco_json.zip - ProSco values in JSON format for all analyzed bond lengths and angles

 

Additional information about the content of the "filtering_and_prosco_code.zip" file:

The "filtering" directory contains scripts and data for quality filtering of non-redundant DNA and RNA residues forming the "PDB NA Reference Set".

"rcsb_all_DNA+RNA_within_3.5A_xray_with_data.txt" - list of DNA and RNA xray PDB structures within 3.5A crystallographic resolution where experimental data are available. This was obtained by an Advanced Search query with the mentioned parameters at rscb.org web site.

The scripts in the expected order of calling:

  • "validation_xml2json" - Converting the xml-formatted PDB validation reports from XML format to JSON.
  • "graphQL_GET_protein_clusters" - The script queries the https://data.rcsb.org/graphql endpoint for details about biomolecular chains contained in the list of xray structure. The "rcsb_DNA+RNA_graphQL.json" file with the requested data is generated.
  • "make_non_redundant_DNA.py" and "make_non_redundant_RNA.py" - Identification of DNA and RNA sequence clusters in complexes with proteins and naked NAs. This step relies on modified BioPython substitution matrix, update your local instalation by files from the attached Bio directory.
  • "calculate_scores_single_res.py" - Adding the quality score for each NA chain.
  • "process_non_redundant" - Compiles the set of highest quality non-redundant NA chains.

The "naval" directory contains a C++ re-implementation of the python-based annotation code (https://github.com/mkowiel/nucleic-acid-validation.git).
The C++ program depends on the "libLLKA" library used at the https://dnatco.datmos.org web service. The source code is available from https://github.com/cernylab/libLLKA.git repository.

The code processes the structures from "filtering" step, measures the bond lengths and angles and returns an intermediate classification. Only the CSD-related classes are used further for the final composite validation tier (combining the ProSco, CSD, and Z' scores).

The "prosco" directory contains the R script and auxiliary shell scripts for (re)calculation of the *_prosco.json files. It uses the "naval" annotated csv files from "pdb_na_reference_set.zip" as the input.

Files

filtering_and_prosco_code.zip

Files (96.8 MB)

Name Size Download all
md5:c2de4601d22834ee4ee474706a2c3f39
9.7 MB Preview Download
md5:a531d308a3984eaefb453fc988c91359
2.6 MB Preview Download
md5:1d8ca40cdb9d48c8fc8f0419315b3269
53.3 kB Download
md5:cb9a8818f2d2341d986b90af2c0317b4
20.1 MB Preview Download
md5:b391b3cbfd833a8c97970827ec76d869
291 Bytes Preview Download
md5:c0ea6a31d353d3baa658d19440a3fe74
212.1 kB Download
md5:117a6c16b75dc2eea423022a37e2d0f2
64.0 MB Preview Download
md5:fbc182002554d92382fbe31a22e6d56c
8.0 kB Preview Download

Additional details

Software

Programming language
Python , R , C++