# Crawler and Data Artifacts

This zip provides the main artifacts for our study: a web crawler ('crawler_and_data/crawler'), the resulting collection of crawled documents ('document_collection), and manually annotated data ('data_files/quotations.csv') used for analysis. All data files are provided to reproduce main results in the paper "Layered, Overlapping, and Inconsistent: A Large-Scale Analysis of the Multiple Privacy Policies and Controls of U.S. Banks".

The research questions are
- RQ1: How many privacy policies are consumers likely to encounter for a given U.S. bank? What do their length and readability reveal about the effort required for consumers to understand a bank’s data practices? 
- RQ2: What do a bank's multiple privacy policies, provided in response to different regulations, disclose about third-party sharing practices regarding marketing and advertising purposes? Are these disclosures consistent across multiple policies provided by the \textit{same} bank? 
- RQ3: How many privacy opt-outs do banks provide regarding third-party sharing for marketing purposes, as required by different regulations? 

Our main artifacts are
- (a) a crawler that enables crawling for potential privacy policy documents from the landing page of a website, up to a depth of three (see more in the `Privacy Policy Retrieval' paragraph of Section 4.1.1 in the paper);
- (b) a collection of five types of privacy documents that the top 2,073 U.S. banks provide (see more in the `Policy classification' paragraph of Section 4.1.1 in the paper); and
- (c) annotated privacy policy documents according to codebook, where data-sharing disclosures and opt-out choices are provided in quotations (see more in Section 4.2.4 in the paper).

Additionally, we provide code and supplementary data files to reproduce the main results that answer the three research questions: Tables 1-5 and Figure 4.

---

## 0. Summary of Provided Artifacts

1. **Crawler code**: `crawler_and_data/crawler/`
2. **Crawled files (demo)**: `crawler_and_data/crawled_files/`
3. **Document collection**: `data_artifact/document_collection/`
4. **Annotated quotations**: `data_files/quotations.csv`
5. **Data files**: `data_files/` (supporting measurement and analysis datasets)
6. **Code for analysis**: `generate_files_for_tables_figure.ipynb`
7. **Generated Tables 1-5 and Figure 4 results**: `result_files_generated/`


## 1. Environment Setup

To run the crawler, you will need **Python 3.12.2** and a virtual environment.

```bash
# Create and activate virtual environment
python3.12 -m venv crawler-artifact
source crawler-artifact/bin/activate

# Upgrade pip and install dependencies
pip install --upgrade pip
pip install "Scrapy==2.11.1"
```

Additional dependencies for running analysis notebooks:

```bash
pip install pandas numpy matplotlib
```

---

## 2. Crawler Artifact

### Crawler Information 

This crawler searched for relevant links and files related to privacy policies (and closely related keywords), up to 2 clicks away from each landing page of a website. 
It takes a CSV file that expects two columns, `URL` and `ID`.  
For demonstration purposes, we provide a sample CSV file `crawler_sample.csv` in the `crawler_and_data/crawler` folder with **5 example URLs**. 
Keywords it uses to look for in link text or URLs include but not limited to 'privacy', 'notice', 'policy', 'disclosure', 'glba', 'ccpa', 'cookie', 'download', and 'pdf'. 
For each `ID`, for all depths, it saves the webpages and files (including PDF files) in `crawler_and_data/crawled_files/` in the `ID`'s subfolder.

When the crawl runs, it provides a Scrapy runtime log that records environment info, framework settings, network activity (requests/responses), item extraction events, and periodic crawl statistics. This is written to `multibank_log.txt` in the `crawler_and_data/crawler` folder and appends continuously while the crawler runs. 
This file contains periodic LogStats like "Crawled X pages.... scraped Y items..." that can be used as a quick health check. It also records network action  with method, URL, and status to trace navigation and diagnose blocks like 403 errors. 
The webpages/files identified to be relevant to privacy policies are captured by a dictionary (e.g., ID, lists of discovered PDF/HTML links, and any saved page markers like "depth_1_html_saved"). This is also documented in lines like "Scraped from <200 ...>" in the log file.

Lines like "Crawled (200) <GET URL>" means a successful fetch. If it is followed by a Scraped from block, it means that the page yielded an item.
For example: 
> Scraped from <200 https://thread.bank/> {'ID': '1149', 'privacy_link_from_landing_page': ['https://21857472.fs1.hubspotusercontent-na1.net/hubfs/21857472/CCPA%20Privacy%20Notice%204.2024.pdf', 'https://thread.bank/privacy-policy'], 'depth_0_pdf_links': ['https://21857472.fs1.hubspotusercontent-na1.net/hubfs/21857472/CCPA%20Privacy%20Notice%204.2024.pdf'], 'depth_0_html_links': ['https://thread.bank/privacy-policy'], 'files': []}

For website ID 1149, 'privacy_link_from_landing_page' saves the links identified on the landing page (depth=0) that contains a relevant privacy keyword. The crawler then saved the identified PDF file, and visited the identified webpage (depth=1).

The log file documents the crawler's encounters and saves at each depth.
For example:
> {'ID': '1149', 'depth_1_html_saved': 'https://thread.bank/privacy-policy/', 'depth_1_pdf_links': ['https://assets.tina.io/854745ed-f244-4a0d-bfa0-321425fdc744/Thread%20Bank%20Rate%20Sheet%20(2).pdf'], 'depth_1_html_links': ['https://thread.bank/sweep-disclosure'], 'files': []} 

For website ID 1149, the crawler visited the depth=1 link, from which then idenfied and saved a pdf file in 'depth_1_pdf_links' and webapge 'depth_1_html_links' that contain privacy-related keywords. The crawler will then visit 'depth_1_html_links' at depth=2.

The `multibank_log.txt` ends with a crawl summary that documents the success rates of crawls and downloads.

As the crawler finishes, it saves the crawler’s structured log in `output.csv` in the `crawler_and_data/crawler` folder. Each row represents a single item (e.g., a discovered document or page) scraped during the crawl for downstream analysis. Columns indicate the ID of the input website the cralwer visited, the source URL where the item was found, the crawl depth relative to the entry page, and the file type (HTML, PDF, etc.). This file can help check against the log file and the saved files in the `crawler_and_data/crawled_files/` for unexpected failures.


Runtime varies with the number of keyword-matched links the crawler finds. 
For the **5 example URLs** provided, the crawl should be completed in 1-2 minutes. 
We recommend users to split long URL lists into smaller batches to improve reliability, isolate errors, and simplify troubleshooting.


### Running the Crawler

Action item #1: Navigate to the crawler directory and execute the crawler. 
In './data_artifact/crawler_and_data/crawler/singlebank/spiders/multibank_sample.py', update 'FILES_STORE' with your file path.

Action item #2: Run the crawler with the following terminal command, after replacing [your_file_path] with your file path of this unzipped file:

```bash
cd "/[your_file_path]/data_artifact/crawler_and_data/crawler"
scrapy crawl multibank_sample -O output.csv
```


---

## 3. Document Collection Artifact

We applied the crawler on all bank websites, manually reviewed pages where the crawler failed, and also manually examined the websites of top 200 banks due to their more complex website structures.

We ran the above crawler on the top 2,073 banks provided by the Federal Reserve. We manually visited 373 bank websites where the crawler failed to access page, found no privacy-related content, or no GLBA notices (which we expected all banks to provide). We also manually reviewed the top 200 banks due to more complex site structures. This crawler captured and saved all files on a bank’s website that might contain privacy-notice information. Thus, the resulting collection included many irrelevant files and duplicates. Through automated and manual review, we filtered out non-relevant files and duplicates. 

To answer RQ1 (number of privacy policies a bank provides), we further manually labeled files based on headings and categorized them into five types, and renamed the corresponding files. 

We provide the resulting collection of manually reviewed and cleaned documents in:  
`data_artifact/document_collection/`

This is one of the main artifacts of our study. 
This collection of documents cover multiple privacy policy types that a bank provides, including (1) GLBA notice (required to provide to consumers annually under the Gramm-Leach-Bliley Act), (2) general privacy policy (often titled “privacy policy” or “digital privacy” that focuses on online privacy), (3) CCPA privacy policy, (by which we refer to both the privacy policy and the notice at collection provided in compliance with the CCPA), (4) mobile privacy policy (specific to mobile applications or mobile data collection practices), and (5) cookie/advertising policy (focusing on cookie use and tracking technologies or interest-based advertising).
This collection of privacy documents of diverse types, could support future research training classifiers to identify sharing-related disclosures across policy types beyond those covered by existing resources that focus mainly on website privacy policy (type (2) general privacy policy in our collection).

---

## 4. Annotated Data (Quotations) Artifact

We manually annotated statements related to **data sharing disclosures** and **opt-out controls** in the documents. These annotations are provided in:  

`data_files/quotations.csv`

The file includes:
- The **Text Content** of each quotation,
- The **Document** where each quotation originates, 
- The applied **Codes**, based on the codebooks provided in the artifact evaluation appendix.

---

## 5. Data Files

The folder `data_files/` contains additional supporting datasets from our measurements and analyses:

- **bank_list.csv**: List of banks we crawled and examined, using a commercial VPN from a California vantage point. All subsequent measurements were conducted from the California vantage point.  
- **ccpa_optout.csv**: Measurement results of whether a bank provides opt-out under CCPA. We considered both **GPC (Global Privacy Control) response** and a **Do-Not-Sell link** on the landing page as required by CCPA.  
- **ccpa_yes.csv**: Secondary analysis of data sharing statements under the CCPA definition. Contains statements where banks explicitly acknowledge sharing under CCPA, e.g., references to "cross-context behavioral advertising."  
- **cookie_optout.csv**: Measurement of whether and what kind of cookie control settings or banners were present on bank websites.  
- **glba_optout.csv**: Analysis result of GLBA notices, focusing on the "To limit our sharing" section to identify opt-out choices provided under GLBA.  
- **glba_sharing.csv**: Analysis result of GLBA notices, focusing on the data sharing table to identify disclosure of sharing practices under GLBA.  


## 6. Reproducing Tables and Figures

Code for reproducing **Tables 1–5** and **Figure 4** is provided in:  

`generate_files_for_tables_figure.ipynb`

This notebook requires the following imports:  

```python
import os
import re
import pandas as pd
from collections import defaultdict
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
```

- Executing this notebook generates results saved under:  
  `result_files_generated/`


---

## 7. Reproducibility Notes

- The provided setup makes sure that Tables 1-5 and Figure 4 can be reproduced and others can replicate our crawling process, data collection, and analysis workflow.  
- Manual steps are documented in the paper for transparency. 
- The collection of privacy documents and the annotated data sharing disclosure and opt-out choices can potentially be re-used by future research for automated policy document analysis. 
- This zip does not reproduce other results in our paper, which require additional steps of processing and manual review.
- Please refer to the artifact appendix for more information. 

