Dataset Open Access

Temporally-Informed Analysis of Named Entity Recognition

Rijhwani, Shruti; Preoțiuc-Pietro, Daniel

DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="" xmlns="" xsi:schemaLocation="">
  <identifier identifierType="DOI">10.5281/zenodo.3899040</identifier>
      <creatorName>Rijhwani, Shruti</creatorName>
      <creatorName>Preoțiuc-Pietro, Daniel</creatorName>
    <title>Temporally-Informed Analysis of Named Entity Recognition</title>
    <subject>named entity recognition</subject>
    <subject>twitter ner</subject>
    <subject>temporal analysis</subject>
    <subject>information extraction</subject>
    <date dateType="Issued">2020-06-17</date>
  <resourceType resourceTypeGeneral="Dataset"/>
    <alternateIdentifier alternateIdentifierType="url"></alternateIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3899039</relatedIdentifier>
    <rights rightsURI="">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
    <description descriptionType="Abstract">&lt;p&gt;This repository contains the data set developed for the paper:&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Shruti Rijhwani and Daniel Preoțiuc-Pietro. &lt;em&gt;Temporally-Informed Analysis of Named Entity Recognition.&lt;/em&gt; In Proceedings of the Association for Computational Linguistics (ACL). 2020.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.&lt;/p&gt;

&lt;p&gt;The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.&lt;/p&gt;


&lt;p&gt;The repository contains the annotations in JSON format.&lt;/p&gt;

&lt;p&gt;Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (&lt;a href=""&gt;;/a&gt;) can be used extract the text for the tweet corresponding to the tweet IDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Splits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.&lt;/p&gt;

&lt;p&gt;To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.&lt;/p&gt;

&lt;p&gt;The development and test splits are provided in the JSON format.&lt;/p&gt;


&lt;p&gt;Please cite the data set and the accompanying paper if you found the resources in this repository useful.&lt;/p&gt;</description>
All versions This version
Views 309309
Downloads 8181
Data volume 15.0 MB15.0 MB
Unique views 287287
Unique downloads 7878


Cite as