Overcoming Language Barriers: Multilingual Analysis of the 2023 Swiss Privacy Law's Impact
Contributors
Contact person:
Data manager (2):
Description
This repository contains the complete research pipeline for automatically analyzing privacy policies in a multilingual setting (validated for English, German, French, and Italian). The project operationalizes this pipeline in the context of (1) a revision in Swiss privacy law and (2) the use of automated policy generators.
Repository Structure
This repository is organized into three main subdirectories, each serving a specific purpose in the research pipeline:
1. Analysis
The Analysis directory contains most statistical analyses and data processing scripts.
Key Components:
- Data preprocessing (including text cleaning and removal of personal identifiers with Presidio)
- Creation of the three relevant groups for the analysis (CH, CH & EU, and EU)
- Main statistical and semantic analysis scripts presented in the paper
- Analysis of policy clusters based on generator use
2. Data
The Data directory contains the relevant datasets and corpora used throughout the paper.
Key Components:
- Annotated datasets:
1. LLM-annotated full original dataset after removal of personal identifiers with Presidio ("swiss-gdpr_annotated.parquet")
2. Final annotated and grouped dataset used for all statistical analyses ("swiss-gdpr_annotated_groups.parquet")
3. Log of the LLM annotations ("run.log")
- CrUX dataset used for website popularity rankings as well as the website budgeting list used to scrape the initial dataset
- Embeddings of the policies used for the cluster analysis
3. LLM
The LLM directory contains all files related to the LLM-based data analysis.
Key Components:
- The codebooks, human annotations (reference benchmark), and evaluations for all three initial annotation phases ("Annotations")
- The validation of the models against the final set of human annotations ("Validation")
- The scripts for the large-scale policy evaluation using OpenAI's GPT-5 ("Evaluation")
Citation
If you use this work in your research, please cite it as:
```
Accepted at PETS '26 [Citation details to be added upon publication]
```
We kindly ask you to cite the paper and not the dataset itself. Please find a more detailed list of funding sources in the paper's Acknowledgments section.
License
This work and its artifacts are licensed under a CC-BY 4.0 license.
Contact
For questions about this research, please contact Luka Nenadic at lnenadic@ethz.ch.
Notes
Files
Artifacts.zip
Additional details
Funding
- Swiss National Science Foundation
- Digital Market Regulation: Monitoring Enforcement and Evaluating Outcomes 10002634
- Ministerio de Ciencia, Innovación y Universidades
- Re-InitS PID2024-155230OB-C43
Dates
- Accepted
-
2026-06-01PETS '26