Cross-language corpora of privacy policies
Authors/Creators
- 1. University of Trento
- 2. University of Trento, Vrije Universiteit Amsterdam
Description
The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.
The policies were collected from:
- the Alexa top 10 Italy and U.S. websites rank;
- the Play Store apps rank in the "most profitable games" category of the Play Store for Italy and the U.S.
We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.
All the privacy policies are ANSI-encoded text files and have been manually read and verified.
The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
Details on the methodology can be found in the accompanying paper.
The available files are as follows:
- policies-texts.zip --> contains a directory of text files with the policy texts. File names are the SHA1 hashes of the policy text.
- policy-metadata.csv --> Contains a CSV file with the metadata for each privacy policy.
This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].
[1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.
[2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.
Files
policies-texts.zip
Files
(948.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:6abd745fb0440d8d80520b24970f6af1
|
928.2 kB | Preview Download |
|
md5:d549ee495e960cbe0daa03e68fb44623
|
20.6 kB | Preview Download |
Additional details
Related works
- Is described by
- Preprint: https://arxiv.org/pdf/2302.05355.pdf (URL)
- Conference paper: 10.1007/978-3-031-36096-1_8 (DOI)