There is a newer version of the record available.

Published July 11, 2025 | Version v1
Dataset Open

EU Public Consultations and Feedback Portal - Dataset of all public input data to EU legislation, attached files and extraction of textual responses (until May 2025)

  • 1. ROR icon University of Zurich

Description

Purpose of repository

This dataset contains all data (to the best of the author's knowledge) on public input to EU legislation publicly available through the EU Public Consultations and Feedback Portal https://ec.europa.eu/info/law/better-regulation as of May 2025. Through the EU Public Consultations and Feedback Portal, the broader public of the EU is asked to supply input to proposed regulation, via a feedback form field and/ or by attaching a file related to a specific consultation process instantiated by the EU.

The data was downloaded from the portal via the Python package eu_consultations specifically developed for this purpose, which allows to download metadata on consultations, download attached files, extract texts from files and then store all information in a specific data format.

The aim of this dataset is to facilitate academic analysis of how the public participates in EU public consultations.

 🚧 This is version v0.1.0 of the dataset. In a next version, the dataset will include a more accessible format for the textual data, as well as the results of a currently running translation process for all textual data. 🚧

Repository structure

The repository contains the raw data as scraped plus results of processing all documents using docling to extract textual contents from all documents that were possible to be processed (mainly .pdf and .docx files).

It also contains the exact code run for scraping for documenting the procedure.

Data

The downloaded raw data from scraping is in a zipped file named `v0_1_0_bytopic.zip`. 

The data is organized in folders by topic (the EU assigns every consultation to one of 38 overarching topics).

In every <topic> folder, a subfolder labelled with the date (YYYY-MM-DD-Minute-Second) where the scraping process for the topic was initiated. It contains:

  • `consultation_data.json`: metadata about all initiatives and consultations grouped below them in the topic
  • `consultations_with_downloads.json`: The initiative data with download paths in the folder `files/` to all files that were downloaded for each consultation
  • consultations_with_extracted`: A massive .json file, which also contains the extracted text for every file in every attachment for every initiative.

The benefit of these files is that they can be directly loaded with the eu_consultations Python package, using read_initiatives_from_json(), which reads the data in a validated format.

However, all processed files are also contained in a `files/` folder. Further, for every processed file in `files/`, which follow the naming convention <fileid>_<filename>.<extension> the conversion to lossless docling JSON is stored in `files/docling/<fileid>.json`.

Further `files/consultations` also include the metadata for every consultation within the topic separately, to facilitate working with single consultations.

The folder `logs` contain the logs created during scraping of the data.

 

Code

The code that was used for the scraping procedure for the data is contained in a zipped file named `src.zip`. It contains crucially:

  • a `pyproject.toml` file to replicate the procedure using `uv`
  • a script scrape_parallel.sh, which was used to scrape topic by topic in parallel, first calling a script to check which topics were already finished and then running the scraping script `data_gathering/scrape_topic.py` to scrape a specific topic.

Files

README.md

Files (37.8 GB)

Name Size Download all
md5:205286a62b53c5424d398a5b7b851850
3.6 kB Preview Download
md5:e2c85d9af3ce321a6ba980d57443c46d
2.8 kB Preview Download
md5:2a7050808695ee7b4f6eef4a4a96c590
37.8 GB Preview Download