Published December 22, 2025 | Version 2025-12-22
Software Open

Source Code for the 'Corpus of Resolutions: UN Security Council' (CR-UNSC-Source)

  • 1. ROR icon Ludwig-Maximilians-Universität München
  • 2. ROR icon Scuola Superiore Sant'Anna
  • 3. ROR icon King's College London

Description

Overview

This code in the R programming language downloads and processes the full set of resolutions, drafts and meeting records rendered by the United Nations Security Council (UNSC), as published by the UN Digital Library, into a rich and structured human- and machine-readable dataset. It is the basis for the Corpus of Resolutions: UN Security Council (CR-UNSC).

All data sets created with this script will always be hosted permanently open access and freely available at Zenodo, the scientific repository of CERN. Each version is uniquely identified with a persistent Digitial Object Identifier (DOI), the Version DOI. The newest version of the data set will always available via the link of the Concept DOI: https://doi.org/10.5281/zenodo.7319780

We have also recently published a pre-print entitled "Words of Power: Introducing a Comprehensive Corpus of UN Security Council Resolutions" (Zenodo 2025)  describing the collection, revision and full creation process of the corpus.

 

Updates

The CR-UNSC will be updated at least once per year. In case of serious errors an update will be provided at the earliest opportunity and a highlighted advisory issued on the Zenodo page of the current version. Minor errors will be documented in the GitHub issue tracker and fixed with the next scheduled release.

The CR-UNSC is versioned according to the day of the last run of the data pipeline, in the ISO format YYYY-MM-DD. Its initial release version is 2024-05-03.

Notifications regarding new and updated data sets will be published on the academic website of Seán Fobbe at www.seanfobbe.com or on the Fediverse at @seanfobbe@fediscience.org

 

Changelog

  • Full recompilation of dataset with most recent data up to UNSC resolution 2798
  • OCR content fixes in 108 lines of 36 gold-standard files (courtesy of Kilian Lüders and Hannah Birkenkötter)
  • Dataset now available in Parquet format
  • Improved detection of citations to UNGA special and emergency sessions
  • Number of countries and subjects in bar charts reduced to 25 for better readability
  • New test for completeness of TXT conversion
  • Set timezone of Docker container to America/New York
  • Text variables are properly removed from the metadata only variant
  • Repair parsing of html main and vote records
  • Fix Quanteda parallelization
  • Show resolution number instead of doc_id in quality check of short resolutions
  • Increased OCR range for Chinese resolutions up to and inclusing res 1292 to capture content of ~360 documents
  • Set general OCR inclusion range up to and including res 902
  • Refactored citation extraction to simplify and update deprecated functions
  • Simplify cleanup script with git clean, add Dockerized version
  • Remove docker build script, added build instructions to README
  • Simplify docker run script with docker-compose
  • R core upgraded to version 4.4.0 (mitigate CVE-2024-27322)
  • R packages version-locked to CRAN date 2024-06-13
  • R package requirements now in plaintext file
  • Tesseract upgraded to version 5.5.1

 

Functionality

The pipeline will produce the following results and store them in the  output/ folder:

  • Codebook as PDF
  • Compilation Report as PDF
  • Quality Assurance Report as PDF
  • ZIP archive containing the main data set as a CSV file
  • ZIP archive containing only the metadata of the main data set as a CSV file
  • ZIP archive containing citation data and metadata as a GraphML file
  • ZIP archive containing bibliographic data as a BIBTEX file
  • ZIP archive containing all resolution texts as TXT files (OCR and extracted)
  • ZIP archive containing all resolution texts as PDF files (original and English OCR)
  • ZIP archive containing all draft texts as PDF files (original)
  • ZIP archive containing all meeting record texts as PDF files (original)
  • ZIP archive containing the full Source Code
  • ZIP archive containing all intermediate pipeline results ("targets")

 The integrity and veracity of each ZIP archive is documented with cryptographically secure hash signatures (SHA2-256 and SHA3-512). Hashes are stored in a separate CSV file created during the data set compilation process.

 

System Requirements

  • The reference data sets were compiled on a Debian host system. Running the Docker config on an SELinux system like Fedora will require modifications of the Docker Compose config file
  • 55 GB space on hard drive
  • Multi-core CPU recommended. We used 8 cores/16 threads to compile the reference data sets. Standard config will use all cores on a system. This can be fine-tuned in the config file
  • The runtime of the full pipeline can take up to 72 hours


Instructions

Step 1: Prepare Folder

Copy the full source code to an empty folder, for example by executing:

$ git clone https://codeberg.org/seanfobbe/cr-unsc

Always use a dedicated and empty (!) folder for compiling the data set. The scripts will automatically delete all PDF, TXT and many other file types in its working directory to ensure a clean run.

 

Step 2: Create Docker Image

The Dockerfile contains automated instructions to create a full operation system with all necessary dependencies. To create the image from the Dockerfile, please execute:

$ docker-compose build --pull

 

Step 3: Compile Dataset

If you have previously compiled the data set, whether successfuly or not, you can delete all output and temporary files by executing:

$ bash delete_all_data.sh

 

You can compile the full data set by executing:

$ bash docker-run-project.sh

 

Results

The data set and all associated files are now saved in your working directory.

 

GNU General Public License Version 3

Copyright (C) 2024 Seán Fobbe, Lorenzo Gasbarri and Niccolò Ridi

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public Licensealong with this program.  If not, see https://www.gnu.org/licenses/

 

Author Websites

Personal Website of Seán Fobbe

Personal Website of Lorenzo Gasbarri

Personal Website of Niccolò Ridi

 

Contact

Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on Codeberg or contact Seán Fobbe via https://seanfobbe.com/contact/

Files

CR-UNSC_2025-12-22_CompilationReport.pdf

Files (454.4 MB)

Name Size Download all
md5:fd041a31918dab1b5d54c4f08d2857fd
591.8 kB Preview Download
md5:f6f7481d1e7ecc6fe373d0fa7d34ad20
7.0 kB Preview Download
md5:7725309b4f4180abd25220be55ced7d9
5.8 MB Preview Download
md5:22fb8a825b4fe7f789c1bc5ae94c1074
3.8 MB Preview Download
md5:c647e090fffe2686253bebf8ba8f8bba
444.2 MB Preview Download

Additional details

Related works

Compiles
10.5281/zenodo.15154519 (DOI)
Is derived from
Dataset: https://digitallibrary.un.org/ (URL)

Software

Repository URL
https://codeberg.org/SeanFobbe/cr-unsc
Programming language
R
Development Status
Active