Source Code for the 'Corpus of Decisions: International Court of Justice' (CD-ICJ-Source)
Description
Overview
This code in the R Programming Language downloads and processes the full set of decisions and appended opinions rendered by the International Court of Justice (ICJ) as published on its website into a rich and structured human- and machine-readable data set. It is the basis of the Corpus of Decisions: International Court of Justice (CD-ICJ).
All data sets created with this script will always be hosted permanently open access and freely available at Zenodo, the scientific repository of CERN. Each version is uniquely identified with a persistent Digitial Object Identifier (DOI), the Version DOI. The newest version of the data set will always available via the link of the Concept DOI: https://doi.org/10.5281/zenodo.3826444
Citation
A peer-reviewed academic paper describing the construction and relevance of the data set entitled 'Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ)' was published open access in the Journal of Empirical Legal Studies (JELS). It is also available in print at JELS 2022, Vol. 19, No. 2, pp. 491-524.
If you use the data set for academic work, please cite both the JELS paper and the precise version of the data set you used for your analysis.
New in Version 2023-05-07
- Full recompilation of data set
- Entire computational environment now version-controlled with Docker
- Scope extends up to case number 187: Obligations of States in respect of climate change (Advisory Opinion)
- Upgrade Tesseract OCR to version 5.3.1
- Upgrade OCR training data to "tesseract_best"
- Simplified config file
- Simplified function loading
- Ensure that debug mode only processes cases once
- Fix download manifest
- Update download function
- Contents of source ZIP file linked to Git manifest
Updates
The CD-ICJ will be updated two times per year, ideally every six months. In case of serious errors an update will be provided at the earliest opportunity and a highlighted advisory issued on the Zenodo page of the current version. Minor errors will be documented in the GitHub issue tracker and fixed with the next scheduled release.
The CD-ICJ is versioned according to the day the data was acquired from the website of the Court, in the ISO format YYYY-MM-DD. Its initial release version was 2021-11-23.
Notifications regarding new and updated data sets will be published on my academic website at www.seanfobbe.com or via Mastodon at @seanfobbe@fediscience.org
Functionality
This code will produce 21 ZIP archives:
- 2 archives of CSV files containing the full machine-readable data set (English/French)
- 2 archives of CSV files containing the full machine-readable metadata (English/French)
- 2 archives of TXT files containing all machine-readable texts with a reduced set of metadata encoded in the filenames (English/French)
- 2 archives of PDF files containing all human-readable texts with enhanced OCR (English/French)
- 2 archives of PDF files containing all human-readable majority opinions with enhanced OCR (English/French)
- 2 archives of PDF files of documents dated 2004 and earlier containing monolingual documents with enhanced OCR (English/French)
- 2 archives of PDF files as originally published by the ICJ (English/French)
- 2 archives of TXT files containing text as generated by Tesseract for documents dated 2004 or earlier (English/French)
- 2 archives of TXT files containing extracted text from the original documents (English/French)
- 1 archive PDF files that were unlabelled on the website (intended for replication and review only)
- 1 archive of analysis data and diagrams
- 1 archive containing all source files
The integrity and veracity of each ZIP archive is documented with cryptographically secure hash signatures (SHA2-256 and SHA3-512). Hashes are stored in a separate CSV file created during the data set compilation process.
System Requirements
- Docker
- Docker Compose
- 25 GB disk space on hard drive
- Parallelization will automatically be customized to your machine by detecting the maximum number of cores
- A full run of this script takes approximately 11 hours on a machine with a Ryzen 3700X CPU using 16 threads, 64 GB DDR4 RAM and a fast SSD
Instructions
Step 1: Prepare Folder
Copy the full source code to an empty folder, for example by executing:
$ git clone https://github.com/seanfobbe/cd-icj
Always use a dedicated and empty (!) folder for compiling the data set. The scripts will automatically delete all PDF, TXT and many other file types in its working directory to ensure a clean run.
Step 2: Create Docker Image
The Dockerfile contains automated instructions to create a full operation system with all necessary dependencies. To create the image from the Dockerfile, please execute:
$ bash docker-build-image.sh
Step 3: Compile Dataset
If you have previously compiled the data set, whether successfuly or not, you can delete all output and temporary files by executing:
$ Rscript delete_all_data.R
You can compile the full data set by executing:
$ bash docker-run-project.sh
Results
The data set and all associated files are now saved in your working directory.
Academic Publications (Fobbe)
Website — www.seanfobbe.com
Open Data — zenodo.org/communities/sean-fobbe-data
Code Repository — zenodo.org/communities/sean-fobbe-code
Regular Publications — zenodo.org/communities/sean-fobbe-publications
Contact
Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at fobbe-data@posteo.de
Notes
Files
CD-ICJ_2023-05-07_CompilationReport.pdf
Additional details
Related works
- Cites
- https://www.icj-cij.org (URL)
- Software: https://github.com/SeanFobbe/cd-icj (URL)
- Compiles
- Dataset: 10.5281/zenodo.7876286 (DOI)