Dataset Open Access
Vaucher, Timoté; Spitz, Andreas; Catasta, Michele; West, Robert
Introduction
Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.
For further details, please refer to the description below and to the original paper:
Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West
"Quotebank: A Corpus of Quotations from a Decade of News"
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.
https://doi.org/10.1145/3437963.3441760
When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).
Dataset summary
The dataset consists of two versions:
Both versions are split into 13 files (one per year) for ease of downloading and handling.
Dataset details
The following formatting applies to both versions of the dataset:
Version 1: Quotation-centric data
In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.
Quotation-centric data
|-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
|-- quotation: Text of the longest encountered original form of the quotation
|-- date: Earliest occurrence date of any version of the quotation
|-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
|-- probas: Array representing the probabilities of each speaker having uttered the quotation.
The probabilities across different occurrences of the same quotation are summed for
each distinct candidate speaker and then normalized
|-- proba: Probability for a given speaker
|-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
|-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
|-- qids: Wikidata IDs of all aliases that match the selected speaker
|-- numOccurrences: Number of time this quotation occurs in the articles
|-- urls: List of links to the original articles containing the quotation
Version 2: Article-centric data
In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.
Article-centric data
|-- articleID: Primary key
|-- articleLength: Length of the article in PTB tokens
|-- date: Publication date of the article
|-- phase: Corresponding phase in which the article appeared (A-E)
|-- title: Title of the article
|-- url: Link to the original article
|-- names: List of all extracted speakers that occur in the article
|-- name: Surface form of the first occurrence of each speaker in the article
|-- ids: List of Wikidata IDs that have `name` as a possible alias
|-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
|-- quotations: List of all the quotations that appear in the article
|-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
|-- quotation: Text of the quotation as it occurs in this article
|-- quotationOffset: Index where the quotation starts in the article
|-- leftContext: Text in the left context window of the quotation (used for the attribution)
|-- rightContext: Text in the right context window (used for the attribution)
|-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
|-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID`
|-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
|-- proba: Probability for a given speaker
|-- speaker: Name of the speaker as it first occurs in this article
|-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
|-- numOccurrences: Number of times this quotation occurs in any article
Code repository
The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.
Name | Size | |
---|---|---|
quotebank-2008.json.bz2
md5:403525925dc408f811ca2746230ad320 |
2.2 GB | Download |
quotebank-2009.json.bz2
md5:071c9b74b9a21a596d8802ac9a4a68b9 |
10.6 GB | Download |
quotebank-2010.json.bz2
md5:d7bd58b27601a222ea2624535bcbb6ca |
10.1 GB | Download |
quotebank-2011.json.bz2
md5:0b08111bf0d63c81e0c667a6db75beaa |
11.4 GB | Download |
quotebank-2012.json.bz2
md5:a8fef352327046acf23ef1774924e1db |
11.1 GB | Download |
quotebank-2013.json.bz2
md5:7572f0ce58e318d2515aa2d73c452c51 |
10.4 GB | Download |
quotebank-2014.json.bz2
md5:0b22916e136acd1bc645a38b795576b0 |
11.3 GB | Download |
quotebank-2015.json.bz2
md5:1ecf6a5aa98ed8fa2e6e3dc9a6c5a840 |
11.7 GB | Download |
quotebank-2016.json.bz2
md5:01e205a4b139e65b38efc16c3bd9e049 |
7.8 GB | Download |
quotebank-2017.json.bz2
md5:4af6ec1e83dfde55c5c057966bca8850 |
16.1 GB | Download |
quotebank-2018.json.bz2
md5:359db3897e4411ba7482b0c4c2e1cdee |
20.0 GB | Download |
quotebank-2019.json.bz2
md5:4f6522c49afda32904acf720b45af6eb |
23.0 GB | Download |
quotebank-2020.json.bz2
md5:1d9c0abad33e034b01a1028237435a9e |
5.3 GB | Download |
quotes-2008.json.bz2
md5:37e7bd0328fb88168485ab0e60b0f18c |
1.3 GB | Download |
quotes-2009.json.bz2
md5:73f755950784f7fc3d55c56d7898047f |
3.1 GB | Download |
quotes-2010.json.bz2
md5:358682a9e2c462b6cf1af04aac1104c3 |
2.6 GB | Download |
quotes-2011.json.bz2
md5:20a3b5be3912b7e512d1df3120c90376 |
2.9 GB | Download |
quotes-2012.json.bz2
md5:adb8002da8a8f00e564c33fca388f422 |
2.9 GB | Download |
quotes-2013.json.bz2
md5:c9d7f80ae943d656b66e32cba3044b2e |
2.8 GB | Download |
quotes-2014.json.bz2
md5:dc586892e40ba47780efa752898e1a72 |
3.0 GB | Download |
quotes-2015.json.bz2
md5:559feb6c332218053a907d810a3b72d4 |
3.3 GB | Download |
quotes-2016.json.bz2
md5:721a0ad89471112c34a111642b16674b |
2.3 GB | Download |
quotes-2017.json.bz2
md5:57b939b996da97d1decc0d167fa9ca42 |
5.2 GB | Download |
quotes-2018.json.bz2
md5:2fe5b1326057a7a04baf99e3139c3b6a |
4.8 GB | Download |
quotes-2019.json.bz2
md5:5ea5ee668428b24f6da9ae02912c14f1 |
3.6 GB | Download |
quotes-2020.json.bz2
md5:c76f55fab5183fdbf3357f6d4eb3e4a7 |
830.7 MB | Download |
All versions | This version | |
---|---|---|
Views | 5,836 | 5,835 |
Downloads | 2,250 | 2,250 |
Data volume | 12.5 TB | 12.5 TB |
Unique views | 5,213 | 5,212 |
Unique downloads | 1,123 | 1,123 |