Quotebank: A Corpus of Quotations from a Decade of News

10.5281/zenodo.4277311 https://zenodo.org/records/4277311 oai:zenodo.org:4277311 Vaucher, Timoté Timoté Vaucher EPFL Spitz, Andreas Andreas Spitz EPFL Catasta, Michele Michele Catasta Stanford University West, Robert Robert West EPFL Quotebank: A Corpus of Quotations from a Decade of News Zenodo 2021 2021-03-08 2023-06-18 eng 10.5281/zenodo.4277310 1.0 Creative Commons Attribution 4.0 International Introduction Quotebank is a dataset of 235 million unique, speaker-attributed quotations that were extracted from 196 million English news articles (127 million containing quotations) crawled from over 377 thousand web domains (15 thousand root domains) between September 2008 and April 2020. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution. For further details, please refer to the description below and to the original paper: Timoté Vaucher, Andreas Spitz, Michele Catasta, and Robert West "Quotebank: A Corpus of Quotations from a Decade of News" Proceedings of the 14th International ACM Conference on Web Search and Data Mining (WSDM), 2021. https://doi.org/10.1145/3437963.3441760 When using the dataset, please cite the above paper (Note that the above numbers differ from those listed in the paper, as the updated data in this repository has been computed from an expanded set of input news articles). Dataset summary The dataset consists of two versions: Quotation-centric version (quotes-YYYY.json.bz2) An aggregated set of unique quotations with the most likely speaker. Each unique quotation occurs only once in this version of the data and the probabilities of the candidate speakers to which the quotation can be attributed are aggregated over all occurrences of the quotation. This version of the data is a minimal - but complete - list of attributed quotations that is aimed at users who only require quotation-speaker attributions, but no individual contexts for these quotations from the original articles. Article-centric version (quotebank-YYYY.json.bz2) A complete set of all individual quotation mentions with associated speaker as well as the article context in which they are mentioned. This larger version contains one entry per article in the news data. Each entry contains all speakers that appear in the news article as well as the (attributed) quotations, alongside a context window surrounding the quotations. Both versions are split into 13 files (one per year) for ease of downloading and handling. Dataset details The following formatting applies to both versions of the dataset: All data is made available in JSON format that has been compressed using bzip2. The data is split per year (i.e., there is one file for each calendar year). The offsets of quotations, contexts, and speaker annotations are given in units of Penn TreeBank Tokenizer tokens. Offsets are zero-based and are computed from the start of the article. When pairs of offsets are provided, the end offset is non-inclusive (e.g. in Python you can call tokens[start:end] without having to do end+1). The Spinn3r data from which Quotebank was extracted had been collected over the course of over a decade. During this time, the client-side code used for collecting the data changed several times, and various character-encoding-related issues led to different representations of the original text at different times. We thus divide the 12 years spanned by the Spinn3r corpus into five phases (Phases A through E). A detailed description is available on GitHub; the key takeaways are that (1) text was lowercased in Phases A, B, and C, whereas the original capitalization was maintained in Phases D and E, and that (2) non-ASCII characters are properly represented only in Phase E. Version 1: Quotation-centric data In this version of the dataset, the quotations are aggregated across all their occurrences in the news article data, and assigned a probability for each speaker candidate. We consider two quotations to be equivalent and suitable for aggregation if they are identical after lower-casing and removing punctuation. Quotation-centric data |-- quoteID: Primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- quotation: Text of the longest encountered original form of the quotation |-- date: Earliest occurrence date of any version of the quotation |-- phase: Corresponding phase of the data in which the quotation first occurred (A-E) |-- probas: Array representing the probabilities of each speaker having uttered the quotation. The probabilities across different occurrences of the same quotation are summed for each distinct candidate speaker and then normalized |-- proba: Probability for a given speaker |-- speaker: Most frequent surface form for a given speaker in the articles where the quotation occurred |-- speaker: Selected most likely speaker. This matches the the first speaker entry in `probas` |-- qids: Wikidata IDs of all aliases that match the selected speaker |-- numOccurrences: Number of time this quotation occurs in the articles |-- urls: List of links to the original articles containing the quotation Note that for some speakers there can be more than one Wikidata ID in the `qids` field. To access Wikidata information about those speakers it is necessary to disambiguate them, i.e., select one of the listed Wikidata IDs that most likely corresponds to the respective speaker. Speaker disambiguation can be done using scripts available in the quotebank-toolkit repository. Additionally, the repository contains useful scripts for cleaning and enriching Quotebank. Version 2: Article-centric data In this data set, individual quotations are not aggregated. For each article, one JSON entry contains all speakers that appear in the news article, the (attributed) quotations, and the text within a context window surrounding each of the quotations. Article-centric data |-- articleID: Primary key |-- articleLength: Length of the article in PTB tokens |-- date: Publication date of the article |-- phase: Corresponding phase in which the article appeared (A-E) |-- title: Title of the article |-- url: Link to the original article |-- names: List of all extracted speakers that occur in the article |-- name: Surface form of the first occurrence of each speaker in the article |-- ids: List of Wikidata IDs that have `name` as a possible alias |-- offsets: List of pairs of start/end offset, signifying positions at which the speaker occurs in the article (full and partial mention of the speaker) |-- quotations: List of all the quotations that appear in the article |-- quoteID: Foreign key of the quotation (from the quotation-centric dataset) |-- quotation: Text of the quotation as it occurs in this article |-- quotationOffset: Index where the quotation starts in the article |-- leftContext: Text in the left context window of the quotation (used for the attribution) |-- rightContext: Text in the right context window (used for the attribution) |-- globalProbas: Array representing the probabilities of each speaker having uttered the quote *at the aggregated level*. Same as `probas` for a given `quoteID` |-- globalTopSpeaker: Most probable speaker *at the aggregated level*. Same as `speaker` for a given `quoteID` |-- localProbas: Array representing the probabilities of each speaker having said the quote *given this article context*. |-- proba: Probability for a given speaker |-- speaker: Name of the speaker as it first occurs in this article |-- localTopSpeaker: Selected speaker. Same name as the first entry in `localProbas` |-- numOccurrences: Number of times this quotation occurs in any article Code repository The code of Quobert that was used for the extraction and attribution of this data set is available and managed in a Github repository, which you can find here.