Quotebank: A Corpus of Quotations from a Decade of News

Vaucher,  Timoté; Spitz, Andreas; Catasta, Michele; West, Robert

doi:10.5281/zenodo.4277311

Published March 8, 2021 | Version 1.0

Dataset Open

Quotebank: A Corpus of Quotations from a Decade of News

1. EPFL
2. Stanford University

Introduction

Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

For further details, please refer to the description below and to the original paper:

Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760

When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).

Dataset summary

The dataset consists of two versions:

Quotation-centric version (quotes-YYYY.json.bz2)
An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.
Article-centric version (quotebank-YYYY.json.bz2)
A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations.

Both versions are split into 13 files (one per year) for ease of downloading and handling.

Dataset details

The following formatting applies to both versions of the dataset:

All data is made available in JSON format that has been compressed using bzip2.
The data is split per year (i.e., there is one file for each calendar year).
The offsets of quotations, contexts, and speaker annotations are given in units of Penn TreeBank Tokenizer tokens.
Offsets are zero-based and are computed from the start of the article.
When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1).
The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A detailed description is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E.

Version 1: Quotation-centric data

In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.

Quotation-centric data
 |-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
 |-- quotation: Text of the longest encountered original form of the quotation
 |-- date: Earliest occurrence date of any version of the quotation
 |-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
 |-- probas: Array representing the probabilities of each speaker having uttered the quotation.
      The probabilities across different occurrences of the same quotation are summed for
      each distinct candidate speaker and then normalized
      |-- proba: Probability for a given speaker
      |-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
 |-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
 |-- qids: Wikidata IDs of all aliases that match the selected speaker
 |-- numOccurrences: Number of time this quotation occurs in the articles
 |-- urls: List of links to the original articles containing the quotation

Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.

Version 2: Article-centric data

In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.

Article-centric data
 |-- articleID: Primary key
 |-- articleLength: Length of the article in PTB tokens
 |-- date: Publication date of the article
 |-- phase: Corresponding phase in which the article appeared (A-E)
 |-- title: Title of the article
 |-- url: Link to the original article
 |-- names: List of all extracted speakers that occur in the article
      |-- name: Surface form of the first occurrence of each speaker in the article
      |-- ids: List of Wikidata IDs that have `name` as a possible alias
      |-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
 |-- quotations: List of all the quotations that appear in the article
      |-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
      |-- quotation: Text of the quotation as it occurs in this article
   	  |-- quotationOffset: Index where the quotation starts in the article
      |-- leftContext: Text in the left context window of the quotation (used for the attribution)
      |-- rightContext: Text in the right context window (used for the attribution)
      |-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
      |-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` 
      |-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
           |-- proba: Probability for a given speaker
           |-- speaker: Name of the speaker as it first occurs in this article
      |-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
      |-- numOccurrences: Number of times this quotation occurs in any article

Code repository

The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.

Files

Files (189.7 GB)

Name	Size	Download all
quotebank-2008.json.bz2 md5:403525925dc408f811ca2746230ad320	2.2 GB	Download
quotebank-2009.json.bz2 md5:071c9b74b9a21a596d8802ac9a4a68b9	10.6 GB	Download
quotebank-2010.json.bz2 md5:d7bd58b27601a222ea2624535bcbb6ca	10.1 GB	Download
quotebank-2011.json.bz2 md5:0b08111bf0d63c81e0c667a6db75beaa	11.4 GB	Download
quotebank-2012.json.bz2 md5:a8fef352327046acf23ef1774924e1db	11.1 GB	Download
quotebank-2013.json.bz2 md5:7572f0ce58e318d2515aa2d73c452c51	10.4 GB	Download
quotebank-2014.json.bz2 md5:0b22916e136acd1bc645a38b795576b0	11.3 GB	Download
quotebank-2015.json.bz2 md5:1ecf6a5aa98ed8fa2e6e3dc9a6c5a840	11.7 GB	Download
quotebank-2016.json.bz2 md5:01e205a4b139e65b38efc16c3bd9e049	7.8 GB	Download
quotebank-2017.json.bz2 md5:4af6ec1e83dfde55c5c057966bca8850	16.1 GB	Download
quotebank-2018.json.bz2 md5:359db3897e4411ba7482b0c4c2e1cdee	20.0 GB	Download
quotebank-2019.json.bz2 md5:4f6522c49afda32904acf720b45af6eb	23.0 GB	Download
quotebank-2020.json.bz2 md5:1d9c0abad33e034b01a1028237435a9e	5.3 GB	Download
quotes-2008.json.bz2 md5:37e7bd0328fb88168485ab0e60b0f18c	1.3 GB	Download
quotes-2009.json.bz2 md5:73f755950784f7fc3d55c56d7898047f	3.1 GB	Download
quotes-2010.json.bz2 md5:358682a9e2c462b6cf1af04aac1104c3	2.6 GB	Download
quotes-2011.json.bz2 md5:20a3b5be3912b7e512d1df3120c90376	2.9 GB	Download
quotes-2012.json.bz2 md5:adb8002da8a8f00e564c33fca388f422	2.9 GB	Download
quotes-2013.json.bz2 md5:c9d7f80ae943d656b66e32cba3044b2e	2.8 GB	Download
quotes-2014.json.bz2 md5:dc586892e40ba47780efa752898e1a72	3.0 GB	Download
quotes-2015.json.bz2 md5:559feb6c332218053a907d810a3b72d4	3.3 GB	Download
quotes-2016.json.bz2 md5:721a0ad89471112c34a111642b16674b	2.3 GB	Download
quotes-2017.json.bz2 md5:57b939b996da97d1decc0d167fa9ca42	5.2 GB	Download
quotes-2018.json.bz2 md5:2fe5b1326057a7a04baf99e3139c3b6a	4.8 GB	Download
quotes-2019.json.bz2 md5:5ea5ee668428b24f6da9ae02912c14f1	3.6 GB	Download
quotes-2020.json.bz2 md5:c76f55fab5183fdbf3357f6d4eb3e4a7	830.7 MB	Download

	All versions	This version
Views	7,048	7,013
Downloads	3,959	3,956
Data volume	31.7 TB	31.7 TB

Quotebank: A Corpus of Quotations from a Decade of News

Creators

Description

Files

Files (189.7 GB)