Dataset Open Access

Temporally-Informed Analysis of Named Entity Recognition

Rijhwani, Shruti; Preoțiuc-Pietro, Daniel


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.3899040</identifier>
  <creators>
    <creator>
      <creatorName>Rijhwani, Shruti</creatorName>
      <givenName>Shruti</givenName>
      <familyName>Rijhwani</familyName>
      <affiliation>Bloomberg</affiliation>
    </creator>
    <creator>
      <creatorName>Preoțiuc-Pietro, Daniel</creatorName>
      <givenName>Daniel</givenName>
      <familyName>Preoțiuc-Pietro</familyName>
      <affiliation>Bloomberg</affiliation>
    </creator>
  </creators>
  <titles>
    <title>Temporally-Informed Analysis of Named Entity Recognition</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2020</publicationYear>
  <subjects>
    <subject>named entity recognition</subject>
    <subject>twitter</subject>
    <subject>ner</subject>
    <subject>twitter ner</subject>
    <subject>tweets</subject>
    <subject>temporal analysis</subject>
    <subject>information extraction</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2020-06-17</date>
  </dates>
  <language>en</language>
  <resourceType resourceTypeGeneral="Dataset"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/3899040</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3899039</relatedIdentifier>
  </relatedIdentifiers>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;This repository contains the data set developed for the paper:&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Shruti Rijhwani and Daniel Preoțiuc-Pietro. &lt;em&gt;Temporally-Informed Analysis of Named Entity Recognition.&lt;/em&gt; In Proceedings of the Association for Computational Linguistics (ACL). 2020.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.&lt;/p&gt;

&lt;p&gt;The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repository contains the annotations in JSON format.&lt;/p&gt;

&lt;p&gt;Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (&lt;a href="https://developer.twitter.com/en/docs/tweets/search"&gt;https://developer.twitter.com/en/docs/tweets/search&lt;/a&gt;) can be used extract the text for the tweet corresponding to the tweet IDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Splits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.&lt;/p&gt;

&lt;p&gt;To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.&lt;/p&gt;

&lt;p&gt;The development and test splits are provided in the JSON format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Please cite the data set and the accompanying paper if you found the resources in this repository useful.&lt;/p&gt;</description>
  </descriptions>
</resource>
309
81
views
downloads
All versions This version
Views 309309
Downloads 8181
Data volume 15.0 MB15.0 MB
Unique views 287287
Unique downloads 7878

Share

Cite as