Published April 23, 2026 | Version v1.0
Dataset Open

BuyTheBy - An annotated dataset of paper mill advertisements with price data

  • 1. ROR icon Northwestern University
  • 2. ROR icon Freie Universität Berlin

Description

A preprint describing this dataset has been submitted to arXiv. This entry will be updated as soon as arXiv's moderation process is complete.

The study of paper mills and similar businesses operating in the market for academic and education fraud services is frustrated by the lack of market price data on their various offerings. Here, we assemble BuyTheBy, a large, annotated dataset of timestamped, text-based paper mill advertisements from seven businesses operating out of seven different countries. The dataset consists of 18,710 individual advertisements, of which 15,839 have prices listed. Among these there are 20,598 positions listed as for sale on 5,567 unique products in 14 different product categories with 51,812 timestamped price data points. Code for reproducing figures and summary statistics is available at https://github.com/reeserich/buytheby.

Files

buytheby_v_1_0_combined_processed_ads.csv

Files (158.8 MB)

Name Size Download all
md5:57ecb0a7e868864484a6b5e1cf17fef7
142.1 MB Preview Download
md5:cf7a90e0b132f91fe606a08bd6c653d5
16.7 MB Download
md5:8d23ed9aeee76e3f6909d4199c5dde78
10.6 kB Download
md5:4d3f6b561085d96bfda7a500e450cc5f
910 Bytes Preview Download

Additional details