There is a newer version of the record available.

Published June 2, 2026 | Version v1

Overcoming Language Barriers: Multilingual Analysis of the 2023 Swiss Privacy Law's Impact

  • 1. ROR icon ETH Zurich

Contributors

Contact person:

  • 1. ROR icon ETH Zurich
  • 2. Information Processing and Telecommunications Center, Universidad Politécnica de Madrid
  • 3. ROR icon Carnegie Mellon University

Description

This repository contains the complete research pipeline for automatically analyzing privacy policies in a multilingual setting (validated for English, German, French, and Italian). The project operationalizes this pipeline in the context of (1) a revision in Swiss privacy law and (2) the use of automated policy generators.

Repository Structure

This repository is organized into three main subdirectories, each serving a specific purpose in the research pipeline:

1. Analysis

The Analysis directory contains most statistical analyses and data processing scripts.

Key Components:
- Data preprocessing (including text cleaning and removal of personal identifiers with Presidio)
- Creation of the three relevant groups for the analysis (CH, CH & EU, and EU)
- Main statistical and semantic analysis scripts presented in the paper
- Analysis of policy clusters based on generator use

2. Data

The Data directory contains the relevant datasets and corpora used throughout the paper.

Key Components:
- Annotated datasets:
  1. LLM-annotated full original dataset after removal of personal identifiers with Presidio ("swiss-gdpr_annotated.parquet")
  2. Final annotated and grouped dataset used for all statistical analyses ("swiss-gdpr_annotated_groups.parquet")
  3. Log of the LLM annotations ("run.log")
- CrUX dataset used for website popularity rankings as well as the website budgeting list used to scrape the initial dataset
- Embeddings of the policies used for the cluster analysis

3. LLM

The LLM directory contains all files related to the LLM-based data analysis.

Key Components:
- The codebooks, human annotations (reference benchmark), and evaluations for all three initial annotation phases ("Annotations")
- The validation of the models against the final set of human annotations ("Validation")
- The scripts for the large-scale policy evaluation using OpenAI's GPT-5 ("Evaluation")

Citation

If you use this work in your research, please cite it as:

```
Accepted at PETS '26 [Citation details to be added upon publication]

``` 
We kindly ask you to cite the paper and not the dataset itself. Please find a more detailed list of funding sources in the paper's Acknowledgments section.

License

This work and its artifacts are licensed under a CC-BY 4.0 license.

Contact

For questions about this research, please contact Luka Nenadic at lnenadic@ethz.ch.

Notes

Artifact Appendix

Paper title: Overcoming Language Barriers: Multilingual Analysis of the 2023 Swiss Privacy Law's Impact

Requested Badge(s):
  - [X] Available
  - [ ] Functional
  - [ ] Reproduced

Description

This is the artifact repository for the paper "Overcoming Language Barriers: Multilingual Analysis of the 2023 Swiss Privacy Law's Impact" published at PETS 2026.4. The authors are Luka Nenadic (ETH Zurich), David Rodriguez (Information Processing and Telecommunications Center, Universidad Politécnica de Madrid), and Joseph Calandrino (Carnegie Mellon University).

The repository contains all datasets (after the removal of personal identifiers with Presidio) as well as the statistical and LLM-based analyses used for the paper.

Security/Privacy Issues and Ethical Concerns

Given that we (1) have removed personal identifiers with Presidio and (2) do not use any artifacts desabling security mechanisms, run vulnerable code, or collect user data, our artifacts pose only minimal security and privacy risks. It remains, nonetheless, likely that Presidio could not successfully remove all personal identifiers from the publicly accessible privacy policies we use in our paper.

Environment

The entire artifact repository is hosted on Zenodo (DOI: 10.5281/zenodo.20512192) and can be downloaded as a .ZIP file. We explain the repository's structure above.

All analyses can be performed on a personal computer, except for the LLM-based analyses, for which we comprehensively describe how we used OpenAI's API. The statistical analyses were performed with Python and R (version 4.5.0 in R Studio version 2026.01.0). 

Accessibility

The entire artifact repository is hosted on Zenodo (DOI: 10.5281/zenodo.20512192) and can be downloaded as a .ZIP file. We explain the repository's structure above.

Notes on Reusability

While we have validated our multilingual disclosure-assessment method for English, German, French, and Italian, it could be useful in different languages as well (upon careful validation in the target languages).

Finally, and as we note in the paper, the LLM-based policy analysis cannot guarantee perfect reproducibility, since model outputs may vary slightly across runs.

Files

Artifacts.zip

Files (2.1 GB)

Name Size
md5:a414bbe4bb194b31f680a7279b2444f5
2.1 GB Preview Download

Additional details

Funding

Swiss National Science Foundation
Digital Market Regulation: Monitoring Enforcement and Evaluating Outcomes 10002634
Ministerio de Ciencia, Innovación y Universidades
Re-InitS PID2024-155230OB-C43

Dates

Accepted
2026-06-01
PETS '26