Dataset Open Access
Rijhwani, Shruti; Preoțiuc-Pietro, Daniel
This repository contains the data set developed for the paper:
“Shruti Rijhwani and Daniel Preoțiuc-Pietro. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the Association for Computational Linguistics (ACL). 2020.”
It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.
The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.
The repository contains the annotations in JSON format.
Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (https://developer.twitter.com/en/docs/tweets/search) can be used extract the text for the tweet corresponding to the tweet IDs.
Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.
To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.
The development and test splits are provided in the JSON format.
Please cite the data set and the accompanying paper if you found the resources in this repository useful.