Reproducibility data for a study of regulatory statements in EU legislation

Kody Moodley; Gijs Jan Brandsma; Jens Blom-Hansen; Christiaan Meijer

doi:10.5281/zenodo.12760951

Published July 17, 2024 | Version v2

Dataset Open

Reproducibility data for a study of regulatory statements in EU legislation

1. Netherlands eScience Center
2. Radboud University
3. Aarhus University

Reproducibility data for a quantitative study on EU legislation

The files in this repository were generated or used in a pipeline of analysis operations on EU legislation published between 1971 and 2022. The project is called the Nature of EU Rules which seeks to analyse the "strictness" and density of EU regulations over time and by legal policy area. The data has been made available to help make the results of our study reproducible by other researchers. The underlying data used in the study has also been published in this repository.

File descriptions

complete_training_data.csv
- This file is training data for binary classification of specific sentences in EU legislation as either regulatory in nature (constituting a legal obligation for some agent) or not (called a constitutive statement). The sentences have been labelled by EU law professors from Aarhus University in Denmark and Radboud University in the Netherlands
- Note: The file also contains columns for identifying the specific agent being regulated (to which the legal obligation applies) in each sentence. However, this information has not been used in the study
extracted_sentences_classified_1971_2022.csv
- List of sentences extracted from EU legislation documents
- Classification results for individual sentences whether each is regulatory or not. There are two columns recording the classification results, one for a rule-based approach (using grammatical dependency parsing) and one for a LegalBERT classification approach.
inlegal_bert_xgboost_classifier.json
- Trained binary classification model for classifying sentences as regulatory or not (based on InlegalBERT).
- Note: this model is trained on the file 'complete_training_data.csv' in this Zenodo repo
- Model was trained using this script and used by these scripts: one, two
metadata_enriched.csv
- Metadata file from this repository but enriched with additional columns one of which is the count of regulatory sentences in each individual document
- File is generated by this script
- File is used by this script
classification_results_all_algorithms_test_set.csv
- classification results of each sentence in the test set containing 1451 sentences (20% of training set)
- according to both the fine-tuned Legal-BERT model and the dependency parsing (rule-based) algorithm
- also contains the ground truth labels

Github repositories relevant to this analysis

The Python scripts in the following Github repositories were responsible for generating the data files in this Zenodo repository. The first repository listed is the core one for running the pipeline to classify and quantitatively analyse legal obligations in EU legislation. The other listed Github repositories represent components or steps of the pipeline.

http://github.com/nature-of-eu-rules/eu-legislation-strictness-analysis

Files

classification_results_all_algorithms_test_set.csv

Files (244.7 MB)

Name	Size	Download all
classification_results_all_algorithms_test_set.csv md5:42ad8ab6d8ba8801c418ee6c124fdd3d	441.4 kB	Preview Download
complete_training_data.csv md5:346b1d6ce15623a2e71c1fa2acc9511c	2.5 MB	Preview Download
extracted_sentences_classified_1971_2022.csv md5:01e90d77e3a61e54dca3cee4c491d4b1	178.3 MB	Preview Download
inlegal_bert_xgboost_classifier.json md5:9b8be847c35c33f130b4f3a9331ec59d	956.1 kB	Preview Download
metadata_enriched.csv md5:ef9837ad2df996cf0c1c6ea97adfb858	62.5 MB	Preview Download

	All versions	This version
Views	1,057	285
Downloads	2,263	926
Data volume	182.9 GB	85.7 GB

Reproducibility data for a study of regulatory statements in EU legislation

Authors/Creators

Description

Files

classification_results_all_algorithms_test_set.csv

Files (244.7 MB)