Multilingual Scraper of Privacy Policies and Terms of Service

Bernhard, David; Nenadic, Luka; Bechtold, Stefan; Kubicek, Karel

doi:10.5281/zenodo.14562039

Published March 25, 2025 | Version v1

Dataset Open

Multilingual Scraper of Privacy Policies and Terms of Service

1. ETH Zürich
2. ETH Zurich

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

Month	Number of websites
2024-01	551'148
2024-02	792'921
2024-03	844'537
2024-04	802'169
2024-05	805'878
2024-06	809'518
2024-07	811'418
2024-08	813'534
2024-09	814'321
2024-10	817'586
2024-11	828'662
2024-12	827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

2024__policy.csv for policies
2024__terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

website_month_id - identification of crawled website
job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
- DNS_ERROR - domain cannot be resolved
- OK - all fine
- REDIRECT - domain redirect to somewhere else
- TIMEOUT - the request timed out
- BAD_CONTENT_TYPE - 415 Unsupported Media Type
- HTTP_ERROR - 404 error
- TCP_ERROR - error in the network connection
- UNKNOWN_ERROR - unknown error
website_lang - language of index page detected based on langdetect library
website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
job_domain_status - indicates the status of loading the index page. Can be:
- OK - all works well (at the moment, should be all entries)
- BLACKLISTED - URL is on our list of blocked URLs
- UNSAFE - website is not safe according to save browsing API by Google
- LOCATION_BLOCKED - country is in the list of blocked countries
job_started_at - when the visit of the website was started
job_ended_at - when the visit of the website was ended
job_crux_popularity - JSON with all popularity ranks of the website this month
job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

policy_url_id - ID of the URL this policy has
policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
policy_ml_probability - probability assigned by the BERT model that given document is a policy
policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
2. 'search' - this policy was found using search engine
3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
policy_url - full URL to the policy
policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.

Files

cslaw_policies.zip

Files (39.3 GB)

Name	Size	Download all
cslaw_policies.zip md5:ba4689d4eeb33dbd70aec816e24584a7	21.4 GB	Preview Download
cslaw_terms.zip md5:8347a45dd5c545445a6088d2e4eb559f	17.9 GB	Preview Download
README.md md5:e8d7cd61217377a8cf3d22ad5fd8265c	6.1 kB	Preview Download

Additional details

Subtitle: Scraped documents 2024

Is derived from: Conference paper: 10.1145/3589334.3645709 (DOI)
Is described by: Journal: To appear in: Multilingual Scraper of Privacy Policies and Terms of Service (Other)

Swiss National Science Foundation
Reproducibility of Web Privacy Measurements P500PT_225449
Swiss National Science Foundation
Digital Market Regulation: Monitoring Enforcement and Evaluating Outcomes 10002634

	All versions	This version
Views	102	102
Downloads	83	83
Data volume	1.0 TB	1.0 TB

Multilingual Scraper of Privacy Policies and Terms of Service

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

Preliminaries

Files and structure

Shared metadata

Policy data

Terms data

Updates

Files

cslaw_policies.zip

Files (39.3 GB)

Additional details

Additional titles

Related works

Funding

Multilingual Scraper of Privacy Policies and Terms of Service

Creators

Description

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

Preliminaries

Files and structure

Shared metadata

Policy data

Terms data

Updates

Files

cslaw_policies.zip

Files (39.3 GB)

Additional details

Additional titles

Related works

Funding