Dataset Open Access

Temporally-Informed Analysis of Named Entity Recognition

Rijhwani, Shruti; Preoțiuc-Pietro, Daniel


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Rijhwani, Shruti</dc:creator>
  <dc:creator>Preoțiuc-Pietro, Daniel</dc:creator>
  <dc:date>2020-06-17</dc:date>
  <dc:description>This repository contains the data set developed for the paper:

“Shruti Rijhwani and Daniel Preoțiuc-Pietro. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the Association for Computational Linguistics (ACL). 2020.”

It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.

The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.

Format

The repository contains the annotations in JSON format.

Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (https://developer.twitter.com/en/docs/tweets/search) can be used extract the text for the tweet corresponding to the tweet IDs.

Data Splits

Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.

To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.

The development and test splits are provided in the JSON format.

Use

Please cite the data set and the accompanying paper if you found the resources in this repository useful.</dc:description>
  <dc:identifier>https://zenodo.org/record/3899040</dc:identifier>
  <dc:identifier>10.5281/zenodo.3899040</dc:identifier>
  <dc:identifier>oai:zenodo.org:3899040</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:relation>doi:10.5281/zenodo.3899039</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>named entity recognition</dc:subject>
  <dc:subject>twitter</dc:subject>
  <dc:subject>ner</dc:subject>
  <dc:subject>twitter ner</dc:subject>
  <dc:subject>tweets</dc:subject>
  <dc:subject>temporal analysis</dc:subject>
  <dc:subject>information extraction</dc:subject>
  <dc:title>Temporally-Informed Analysis of Named Entity Recognition</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>dataset</dc:type>
</oai_dc:dc>
214
48
views
downloads
All versions This version
Views 214214
Downloads 4848
Data volume 8.9 MB8.9 MB
Unique views 196196
Unique downloads 4848

Share

Cite as