tvarchive Dataset

WhatEvery1Says (WE1S) Project

doi:10.5281/zenodo.5068196

Published July 3, 2021 | Version v1

Dataset Open

tvarchive Dataset

WhatEvery1Says (WE1S) Project

The tvarchive dataset contains word-frequency and other non-consumptive-use data about 1,205,844 English-language transcriptions of U.S. television news broadcasts. The documents were scraped from the Internet Archive's TV News Archive, which includes automatic captions of select U.S. news broadcasts since 2009. While the complete TV News Archive contains over 2.2 million transcripts, WE1S researchers were only able to collect about 1.2 million documents containing complete transcripts. The full TV News Archive includes transcripts from 33 networks and hundreds of shows. Unlike other WE1S datasets, the tvarchive dataset was not collected using keyword searches for specific terms (i.e., documents containing the word "humanities"). (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")

Notes

WE1S makes available word frequency data only "non-consumptive use". This dataset cannot be used to access, read, or reconstruct the original texts.

The data has been archived in jsonl format (each json document is delimited by a line break).

Files

Files (32.1 GB)

Name	Size
tvarchive.jsonl md5:48c843b98370b4bd34f292c41c95a60a	32.1 GB	Download

	All versions	This version
Views	335	335
Downloads	42	42
Data volume	1.4 TB	1.4 TB

tvarchive Dataset

Authors/Creators

Description

Notes

Files

Files (32.1 GB)