Published January 6, 2021 | Version v1
Dataset Open

Contextualizing Trending Entities in News Stories

  • 1. Bloomberg
  • 2. University of Pisa

Description

This repository contains the enrichments for the dataset The New York Times Annotated Corpus developed for the paper:

“Marco Ponza, Diego Ceccarelli, Paolo Ferragina, Edgar Meij, Sambhav Kothari. Contextualizing Trending Entities in News Stories. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM 2021).”

It includes a total of 149 trends constituted by 120K entities. The goal is to retrieve a set of entities ranked with respect to their usefulness in explaining why a given trending entity is actually trending.

Format

The repository contains the enrichments in JSON format.

The news stories of the New York Times from which these enrichments have been developed are available from LDC.

Data Splits

We perform two kinds of evaluation.

  1. Unsupervised evaluation, where we use the complete dataset of 149 trends as a benchmark.
  2. Supervised evaluation, where we train/tune our models on a training/development set and we test them on a test set.
  • The training set contains 50 trends constituted by 36.3K entities from 1996 to 2000.
  • The development set contains 34 trends constituted by 26.7K entities from 2000 to 2002.
  • The test set contains 65 trends constituted by 57K entities from 2002 to 2007.

Use

Please cite the data set and the accompanying paper if you found the resources in this repository useful:

@inproceedings{ponza2021,
     Title = {Contextualizing Trending Entities in News Stories},
     author = {Ponza, Marco and Ceccarelli, Diego and Ferragina, Paolo and Meij, Edgar and Kothari, Sambhav},
     Booktitle = {Proceedings of the 14th ACM International Conference on Web Search and Data Mining},
     Year = {2021},
}

Files

contextualizing-trending-entities.zip

Files (19.0 MB)

Name Size Download all
md5:3cc71b7e1461637531143549f6c4c5ba
19.0 MB Preview Download