Source Code for the 'Corpus of Decisions: International Court of Justice' (CD-ICJ-Source)

Fobbe, Sean

doi:10.5281/zenodo.7876287

Published May 9, 2023 | Version 2023-05-07

Software Open

Source Code for the 'Corpus of Decisions: International Court of Justice' (CD-ICJ-Source)

Fobbe, Sean¹

1. Ludwig Maximilian University of Munich

Overview

This code in the R Programming Language downloads and processes the full set of decisions and appended opinions rendered by the International Court of Justice (ICJ) as published on its website into a rich and structured human- and machine-readable data set. It is the basis of the Corpus of Decisions: International Court of Justice (CD-ICJ).

All data sets created with this script will always be hosted permanently open access and freely available at Zenodo, the scientific repository of CERN. Each version is uniquely identified with a persistent Digitial Object Identifier (DOI), the Version DOI. The newest version of the data set will always available via the link of the Concept DOI: https://doi.org/10.5281/zenodo.3826444

Citation

A peer-reviewed academic paper describing the construction and relevance of the data set entitled 'Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ)' was published open access in the Journal of Empirical Legal Studies (JELS). It is also available in print at JELS 2022, Vol. 19, No. 2, pp. 491-524.

If you use the data set for academic work, please cite both the JELS paper and the precise version of the data set you used for your analysis.

New in Version 2023-05-07

Full recompilation of data set
Entire computational environment now version-controlled with Docker
Scope extends up to case number 187: Obligations of States in respect of climate change (Advisory Opinion)
Upgrade Tesseract OCR to version 5.3.1
Upgrade OCR training data to "tesseract_best"
Simplified config file
Simplified function loading
Ensure that debug mode only processes cases once
Fix download manifest
Update download function
Contents of source ZIP file linked to Git manifest

Updates

The CD-ICJ will be updated two times per year, ideally every six months. In case of serious errors an update will be provided at the earliest opportunity and a highlighted advisory issued on the Zenodo page of the current version. Minor errors will be documented in the GitHub issue tracker and fixed with the next scheduled release.

The CD-ICJ is versioned according to the day the data was acquired from the website of the Court, in the ISO format YYYY-MM-DD. Its initial release version was 2021-11-23.

Notifications regarding new and updated data sets will be published on my academic website at www.seanfobbe.com or via Mastodon at @seanfobbe@fediscience.org

Functionality

This code will produce 21 ZIP archives:

2 archives of CSV files containing the full machine-readable data set (English/French)
2 archives of CSV files containing the full machine-readable metadata (English/French)
2 archives of TXT files containing all machine-readable texts with a reduced set of metadata encoded in the filenames (English/French)
2 archives of PDF files containing all human-readable texts with enhanced OCR (English/French)
2 archives of PDF files containing all human-readable majority opinions with enhanced OCR (English/French)
2 archives of PDF files of documents dated 2004 and earlier containing monolingual documents with enhanced OCR (English/French)
2 archives of PDF files as originally published by the ICJ (English/French)
2 archives of TXT files containing text as generated by Tesseract for documents dated 2004 or earlier (English/French)
2 archives of TXT files containing extracted text from the original documents (English/French)
1 archive PDF files that were unlabelled on the website (intended for replication and review only)
1 archive of analysis data and diagrams
1 archive containing all source files

The integrity and veracity of each ZIP archive is documented with cryptographically secure hash signatures (SHA2-256 and SHA3-512). Hashes are stored in a separate CSV file created during the data set compilation process.

System Requirements

Docker
Docker Compose
25 GB disk space on hard drive
Parallelization will automatically be customized to your machine by detecting the maximum number of cores
A full run of this script takes approximately 11 hours on a machine with a Ryzen 3700X CPU using 16 threads, 64 GB DDR4 RAM and a fast SSD

Instructions

Step 1: Prepare Folder

Copy the full source code to an empty folder, for example by executing:

$ git clone https://github.com/seanfobbe/cd-icj

Always use a dedicated and empty (!) folder for compiling the data set. The scripts will automatically delete all PDF, TXT and many other file types in its working directory to ensure a clean run.

Step 2: Create Docker Image

The Dockerfile contains automated instructions to create a full operation system with all necessary dependencies. To create the image from the Dockerfile, please execute:

$ bash docker-build-image.sh

Step 3: Compile Dataset

If you have previously compiled the data set, whether successfuly or not, you can delete all output and temporary files by executing:

$ Rscript delete_all_data.R

You can compile the full data set by executing:

$ bash docker-run-project.sh

Results

The data set and all associated files are now saved in your working directory.

Academic Publications (Fobbe)

Website — www.seanfobbe.com

Open Data — zenodo.org/communities/sean-fobbe-data

Code Repository — zenodo.org/communities/sean-fobbe-code

Regular Publications — zenodo.org/communities/sean-fobbe-publications

Contact

Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at fobbe-data@posteo.de

Notes

Open Data Impact Award 2022 (Stifterverband für die deutsche Wissenschaft)

Files

CD-ICJ_2023-05-07_CompilationReport.pdf

Files (4.1 MB)

Name	Size	Download all
CD-ICJ_2023-05-07_CompilationReport.pdf md5:4d5165f6f83a2391ffe74234d5f05cbc	1.3 MB	Preview Download
CD-ICJ_2023-05-07_CryptographicSignatures.zip md5:25702fc289be662090b0578fb58f04fa	7.7 kB	Preview Download
CD-ICJ_2023-05-07_Source_Files.zip md5:b19e56bb233b9ab13bd7ebb71b69a519	277.8 kB	Preview Download
CD-ICJ_2023-05-07_UnlabelledFiles.zip md5:4d82789d2c44498e7a2f21c4462996cb	2.5 MB	Preview Download

Additional details

Cites: https://www.icj-cij.org (URL); Software: https://github.com/SeanFobbe/cd-icj (URL)
Compiles: Dataset: 10.5281/zenodo.7876286 (DOI)

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	563	73
Downloads	415	75
Data volume	604.1 MB	104.5 MB

Source Code for the 'Corpus of Decisions: International Court of Justice' (CD-ICJ-Source)

Creators

Description

Notes

Files

CD-ICJ_2023-05-07_CompilationReport.pdf

Files (4.1 MB)

Additional details

Related works