Published March 25, 2025 | Version v1
Dataset Open

Multilingual Scraper of Privacy Policies and Terms of Service

  • 1. EDMO icon ETH Zürich
  • 2. ROR icon ETH Zurich

Description

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

Month Number of websites
2024-01 551'148
2024-02 792'921
2024-03 844'537
2024-04 802'169
2024-05 805'878
2024-06 809'518
2024-07 811'418
2024-08 813'534
2024-09 814'321
2024-10 817'586
2024-11 828'662
2024-12 827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

  • 2024__policy.csv for policies
  • 2024__terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

  • website_month_id - identification of crawled website
  • job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
  • website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
    • DNS_ERROR - domain cannot be resolved
    • OK - all fine
    • REDIRECT - domain redirect to somewhere else
    • TIMEOUT - the request timed out
    • BAD_CONTENT_TYPE - 415 Unsupported Media Type
    • HTTP_ERROR - 404 error
    • TCP_ERROR - error in the network connection
    • UNKNOWN_ERROR - unknown error
  • website_lang - language of index page detected based on langdetect library
  • website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
  • job_domain_status - indicates the status of loading the index page. Can be:
    • OK - all works well (at the moment, should be all entries)
    • BLACKLISTED - URL is on our list of blocked URLs
    • UNSAFE - website is not safe according to save browsing API by Google
    • LOCATION_BLOCKED - country is in the list of blocked countries
  • job_started_at - when the visit of the website was started
  • job_ended_at - when the visit of the website was ended
  • job_crux_popularity - JSON with all popularity ranks of the website this month
  • job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
  • job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
  • job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
  • job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
  • job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

  • policy_url_id - ID of the URL this policy has
  • policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
  • policy_ml_probability - probability assigned by the BERT model that given document is a policy
  • policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
    1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
    2. 'search' - this policy was found using search engine
    3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
  • policy_url - full URL to the policy
  • policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
  • policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
  • policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.

Files

cslaw_policies.zip

Files (39.3 GB)

Name Size Download all
md5:ba4689d4eeb33dbd70aec816e24584a7
21.4 GB Preview Download
md5:8347a45dd5c545445a6088d2e4eb559f
17.9 GB Preview Download
md5:e8d7cd61217377a8cf3d22ad5fd8265c
6.1 kB Preview Download

Additional details

Additional titles

Subtitle
Scraped documents 2024

Related works

Is derived from
Conference paper: 10.1145/3589334.3645709 (DOI)
Is described by
Journal: To appear in: Multilingual Scraper of Privacy Policies and Terms of Service (Other)

Funding

Swiss National Science Foundation
Reproducibility of Web Privacy Measurements P500PT_225449
Swiss National Science Foundation
Digital Market Regulation: Monitoring Enforcement and Evaluating Outcomes 10002634