Dataset Open Access

Temporally-Informed Analysis of Named Entity Recognition

Rijhwani, Shruti; Preoțiuc-Pietro, Daniel


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">eng</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">named entity recognition</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">twitter</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">ner</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">twitter ner</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">tweets</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">temporal analysis</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">information extraction</subfield>
  </datafield>
  <controlfield tag="005">20200617221822.0</controlfield>
  <controlfield tag="001">3899040</controlfield>
  <datafield tag="711" ind1=" " ind2=" ">
    <subfield code="d">5-10 July 2020</subfield>
    <subfield code="g">ACL2020</subfield>
    <subfield code="a">The 58th Annual Meeting of the Association for Computational Linguistics</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bloomberg</subfield>
    <subfield code="a">Preoțiuc-Pietro, Daniel</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">185283</subfield>
    <subfield code="z">md5:ba79cfa2ec554a7bc40241f86b344280</subfield>
    <subfield code="u">https://zenodo.org/record/3899040/files/temporal-ner-twitter-corpus.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="y">Conference website</subfield>
    <subfield code="u">https://acl2020.org</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2020-06-17</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="o">oai:zenodo.org:3899040</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Bloomberg</subfield>
    <subfield code="a">Rijhwani, Shruti</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Temporally-Informed Analysis of Named Entity Recognition</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This repository contains the data set developed for the paper:&lt;/p&gt;

&lt;p&gt;&amp;ldquo;Shruti Rijhwani and Daniel Preoțiuc-Pietro. &lt;em&gt;Temporally-Informed Analysis of Named Entity Recognition.&lt;/em&gt; In Proceedings of the Association for Computational Linguistics (ACL). 2020.&amp;rdquo;&lt;/p&gt;

&lt;p&gt;It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.&lt;/p&gt;

&lt;p&gt;The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The repository contains the annotations in JSON format.&lt;/p&gt;

&lt;p&gt;Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (&lt;a href="https://developer.twitter.com/en/docs/tweets/search"&gt;https://developer.twitter.com/en/docs/tweets/search&lt;/a&gt;) can be used extract the text for the tweet corresponding to the tweet IDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Splits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.&lt;/p&gt;

&lt;p&gt;To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.&lt;/p&gt;

&lt;p&gt;The development and test splits are provided in the JSON format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Please cite the data set and the accompanying paper if you found the resources in this repository useful.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3899039</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3899040</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
287
75
views
downloads
All versions This version
Views 287287
Downloads 7575
Data volume 13.9 MB13.9 MB
Unique views 266266
Unique downloads 7373

Share

Cite as