Published July 3, 2021 | Version v1

tvarchive Dataset

Description

The tvarchive dataset contains word-frequency and other non-consumptive-use data about 1,205,844 English-language transcriptions of U.S. television news broadcasts. The documents were scraped from the Internet Archive's TV News Archive, which includes automatic captions of select U.S. news broadcasts since 2009. While the complete TV News Archive contains over 2.2 million transcripts, WE1S researchers were only able to collect about 1.2 million documents containing complete transcripts. The full TV News Archive includes transcripts from 33 networks and hundreds of shows. Unlike other WE1S datasets, the tvarchive dataset was not collected using keyword searches for specific terms (i.e., documents containing the word "humanities"). (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")

Notes

WE1S makes available word frequency data only "non-consumptive use". This dataset cannot be used to access, read, or reconstruct the original texts.

The data has been archived in jsonl format (each json document is delimited by a line break).

Files

Files (32.1 GB)

Name Size
md5:48c843b98370b4bd34f292c41c95a60a
32.1 GB Download