4277311
doi
10.5281/zenodo.4277311
oai:zenodo.org:4277311
Spitz, Andreas
EPFL
Catasta, Michele
Stanford University
West, Robert
EPFL
Quotebank: A Corpus of Quotations from a Decade of News
Vaucher, Timoté
EPFL
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
<p><strong>Introduction</strong></p>
<p>Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.</p>
<p>For further details, please refer to the description below and to the original paper:</p>
<p>Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West<br>
"Quotebank: A Corpus of Quotations from a Decade of News"<br>
Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021.<br>
<a href="https://doi.org/10.1145/3437963.3441760">https://doi.org/10.1145/3437963.3441760</a></p>
<p>When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles).</p>
<p> </p>
<p><strong>Dataset summary</strong></p>
<p>The dataset consists of two versions:</p>
<ul>
<li><strong>Quotation-centric version</strong> (<em>quotes-YYYY.json.bz2</em>)<br>
An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles.</li>
<li><strong>Article-centric version</strong> (<em>quotebank-YYYY.json.bz2</em>)<br>
A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations.</li>
</ul>
<p>Both versions are split into 13 files (one per year) for ease of downloading and handling.</p>
<p> </p>
<p><strong>Dataset details</strong></p>
<p>The following formatting applies to both versions of the dataset:</p>
<ul>
<li>All data is made available in JSON format that has been compressed using bzip2.</li>
<li>The data is split per year (i.e., there is one file for each calendar year).</li>
<li>The offsets of quotations, contexts, and speaker annotations are given in units of <a href="https://nlp.stanford.edu/software/tokenizer.shtml">Penn TreeBank Tokenizer</a> tokens.</li>
<li>Offsets are zero-based and are computed from the start of the article.</li>
<li>When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1).</li>
<li>The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A <a href="https://github.com/epfl-dlab/Quotebank/blob/main/phases.md">detailed description</a> is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E.</li>
</ul>
<p><br>
<strong>Version 1: Quotation-centric data</strong></p>
<p>In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation.</p>
<pre><code>Quotation-centric data
|-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
|-- quotation: Text of the longest encountered original form of the quotation
|-- date: Earliest occurrence date of any version of the quotation
|-- phase: Corresponding phase of the data in which the quotation first occurred (A-E)
|-- probas: Array representing the probabilities of each speaker having uttered the quotation.
The probabilities across different occurrences of the same quotation are summed for
each distinct candidate speaker and then normalized
|-- proba: Probability for a given speaker
|-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred
|-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas`
|-- qids: Wikidata IDs of all aliases that match the selected speaker
|-- numOccurrences: Number of time this quotation occurs in the articles
|-- urls: List of links to the original articles containing the quotation </code></pre>
<p>Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the <a href="https://github.com/epfl-dlab/quotebank-toolkit">quotebank-toolkit</a> repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank.</p>
<p><strong>Version 2: Article-centric data</strong></p>
<p>In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations.</p>
<pre><code>Article-centric data
|-- articleID: Primary key
|-- articleLength: Length of the article in PTB tokens
|-- date: Publication date of the article
|-- phase: Corresponding phase in which the article appeared (A-E)
|-- title: Title of the article
|-- url: Link to the original article
|-- names: List of all extracted speakers that occur in the article
|-- name: Surface form of the first occurrence of each speaker in the article
|-- ids: List of Wikidata IDs that have `name` as a possible alias
|-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker)
|-- quotations: List of all the quotations that appear in the article
|-- quoteID: Foreign key of the quotation (from the quotation-centric dataset)
|-- quotation: Text of the quotation as it occurs in this article
|-- quotationOffset: Index where the quotation starts in the article
|-- leftContext: Text in the left context window of the quotation (used for the attribution)
|-- rightContext: Text in the right context window (used for the attribution)
|-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID`
|-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID`
|-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*.
|-- proba: Probability for a given speaker
|-- speaker: Name of the speaker as it first occurs in this article
|-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas`
|-- numOccurrences: Number of times this quotation occurs in any article </code></pre>
<p> </p>
<p><strong>Code repository</strong></p>
<p>The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find <a href="https://github.com/epfl-dlab/Quotebank">here</a>.</p>
Zenodo
2021-03-08
info:eu-repo/semantics/other
4277310
1.0
1687106463.110338
2887107232
md5:20a3b5be3912b7e512d1df3120c90376
https://zenodo.org/records/4277311/files/quotes-2011.json.bz2
5279195205
md5:1d9c0abad33e034b01a1028237435a9e
https://zenodo.org/records/4277311/files/quotebank-2020.json.bz2
4810907820
md5:2fe5b1326057a7a04baf99e3139c3b6a
https://zenodo.org/records/4277311/files/quotes-2018.json.bz2
2945789354
md5:adb8002da8a8f00e564c33fca388f422
https://zenodo.org/records/4277311/files/quotes-2012.json.bz2
1347514222
md5:37e7bd0328fb88168485ab0e60b0f18c
https://zenodo.org/records/4277311/files/quotes-2008.json.bz2
11143217255
md5:a8fef352327046acf23ef1774924e1db
https://zenodo.org/records/4277311/files/quotebank-2012.json.bz2
5202007408
md5:57b939b996da97d1decc0d167fa9ca42
https://zenodo.org/records/4277311/files/quotes-2017.json.bz2
2619342887
md5:358682a9e2c462b6cf1af04aac1104c3
https://zenodo.org/records/4277311/files/quotes-2010.json.bz2
11341488928
md5:0b22916e136acd1bc645a38b795576b0
https://zenodo.org/records/4277311/files/quotebank-2014.json.bz2
2766832999
md5:c9d7f80ae943d656b66e32cba3044b2e
https://zenodo.org/records/4277311/files/quotes-2013.json.bz2
22962334176
md5:4f6522c49afda32904acf720b45af6eb
https://zenodo.org/records/4277311/files/quotebank-2019.json.bz2
10407438507
md5:7572f0ce58e318d2515aa2d73c452c51
https://zenodo.org/records/4277311/files/quotebank-2013.json.bz2
19972385209
md5:359db3897e4411ba7482b0c4c2e1cdee
https://zenodo.org/records/4277311/files/quotebank-2018.json.bz2
3085704672
md5:73f755950784f7fc3d55c56d7898047f
https://zenodo.org/records/4277311/files/quotes-2009.json.bz2
2324413044
md5:721a0ad89471112c34a111642b16674b
https://zenodo.org/records/4277311/files/quotes-2016.json.bz2
3343646526
md5:559feb6c332218053a907d810a3b72d4
https://zenodo.org/records/4277311/files/quotes-2015.json.bz2
2959019572
md5:dc586892e40ba47780efa752898e1a72
https://zenodo.org/records/4277311/files/quotes-2014.json.bz2
2211705548
md5:403525925dc408f811ca2746230ad320
https://zenodo.org/records/4277311/files/quotebank-2008.json.bz2
10597472147
md5:071c9b74b9a21a596d8802ac9a4a68b9
https://zenodo.org/records/4277311/files/quotebank-2009.json.bz2
10069850381
md5:d7bd58b27601a222ea2624535bcbb6ca
https://zenodo.org/records/4277311/files/quotebank-2010.json.bz2
830742956
md5:c76f55fab5183fdbf3357f6d4eb3e4a7
https://zenodo.org/records/4277311/files/quotes-2020.json.bz2
11678780298
md5:1ecf6a5aa98ed8fa2e6e3dc9a6c5a840
https://zenodo.org/records/4277311/files/quotebank-2015.json.bz2
3561935569
md5:5ea5ee668428b24f6da9ae02912c14f1
https://zenodo.org/records/4277311/files/quotes-2019.json.bz2
7848827875
md5:01e205a4b139e65b38efc16c3bd9e049
https://zenodo.org/records/4277311/files/quotebank-2016.json.bz2
16068952045
md5:4af6ec1e83dfde55c5c057966bca8850
https://zenodo.org/records/4277311/files/quotebank-2017.json.bz2
11388702855
md5:0b08111bf0d63c81e0c667a6db75beaa
https://zenodo.org/records/4277311/files/quotebank-2011.json.bz2
public
10.5281/zenodo.4277310
isVersionOf
doi