A dataset of late 1990s and early 2000s web banner ads on Chinese- and English-language web pages
Description
This dataset contains information about 22,915 unique banner ad images appearing on Chinese- and English-language web pages in the late 1990s and early 2000s. The dataset is mined from 1,384,355 archived web page snapshots downloaded from the Wayback Machine, representing 77,747 unique HTTP URLs. The URLs are collected from six printed Internet directory books published in mainland China and the United States between 1999 and 2001, as part of a larger research project on Chinese-language web archiving.
For each banner ad image, the dataset provides standard image metadata such as file format and dimension. The dataset also provides the original URLs of the web pages where the banner ad image was found, timestamps of the archived web page snapshots containing the image, archived URLs of the image file, and, if available, archived URLs of web pages to which the ad image is linked. Additionally, the dataset provides text data obtained from the banner ad images using optical character recognition (OCR). We expect the dataset to be useful for researchers across a variety of disciplines and fields such as visual culture, history, media studies, and business and marketing.
Notes
Technical info (English)
The dataset is presented as a JSON file containing an array of individual banner ad images. Each object in the array represents one unique banner ad image. Each object contains the following fields:
- md5: This field contains the MD5 hash value of the banner image file. It is used as a unique identifier for all banner ads in the dataset.
- width and height: These fields specify the dimensions of the banner ad in pixels.
- filetype: This field indicates the file format of the banner ad image as it was served from the original website (or the original ad provider's server) to the Wayback Machine's crawler. Possible values are gif, jpeg, bmp, and png. File type is detected by examining the first two characters of the image's base64 string.
- appearances: This is an array of objects, each representing one appearance of the banner ad in the downloaded collection of archived web page snapshots along with associated details. An appearance is defined as the banner ad image located at a unique image_url (see below) appearing in a web page snapshot at a specific URL archived at a specific timestamp (see below). Each object in this array contains the following fields:
- url: This field provides the original URL of the web page where the banner ad was found.
- timestamp: The timestamp indicates when the web page containing the banner ad was archived on the Wayback Machine. The timestamps are in the format of "YYYYMMDDHHMMSS". The archived snapshot of the web page containing the banner ad image can therefore be accessed at https://web.archive.org/web/{{timestamp}}/{{url}}
- image_url: This field provides the archived URL to the banner ad image as it appeared in the archived snapshot of the web page captured at the time indicated in timestamp.
- hrefs: This field is an array containing archived URLs that the image would lead the user to upon clicking. In most cases, the array contains only one element. If this array contains multiple elements, it indicates that the banner image loaded from the same image_url appeared on this archived snapshot of the web page multiple times and was linked to at least two different URLs.
- ocr_result: If the image is not a corrupted GIF image, ocr_result is an array containing text extracted from the image using PaddleOCR. For animated images, each object in this array represents one individual frame. For static images, there is only one object in this array, with frame_num being 0. For corrupted GIF images, the value of ocr_result will be "corrupt". If the image is not corrupted, an object in this array contains the following fields:
- frame_num: The number of the specific frame of the banner ad image that this object is representing (counting from zero).
- result: an array representing bounding boxes detected by the OCR engine on the frame. Each object in this array contains the following fields:
- text: the text detected by the OCR engine.
- confidence: the confidence score given by the OCR engine for the text detected.
Files
banners_output_20230930.json
Files
(215.0 MB)
Name | Size | Download all |
---|---|---|
md5:0b183aca59895cc1cb9ea96eae82ccfd
|
215.0 MB | Preview Download |