{
  "access": {
    "embargo": {
      "active": false,
      "reason": null
    },
    "files": "public",
    "record": "public",
    "status": "open"
  },
  "created": "2020-06-17T16:56:52.272923+00:00",
  "custom_fields": {
    "meeting:meeting": {
      "acronym": "ACL2020",
      "dates": "5-10 July 2020",
      "title": "The 58th Annual Meeting of the Association for Computational Linguistics",
      "url": "https://acl2020.org"
    }
  },
  "deletion_status": {
    "is_deleted": false,
    "status": "P"
  },
  "files": {
    "count": 1,
    "enabled": true,
    "entries": {
      "temporal-ner-twitter-corpus.zip": {
        "checksum": "md5:ba79cfa2ec554a7bc40241f86b344280",
        "ext": "zip",
        "id": "513db860-ecb4-4133-a790-5a154303ec85",
        "key": "temporal-ner-twitter-corpus.zip",
        "metadata": null,
        "mimetype": "application/zip",
        "size": 185283
      }
    },
    "order": [],
    "total_bytes": 185283
  },
  "id": "3899040",
  "is_draft": false,
  "is_published": true,
  "links": {
    "access": "https://zenodo.org/api/records/3899040/access",
    "access_grants": "https://zenodo.org/api/records/3899040/access/grants",
    "access_links": "https://zenodo.org/api/records/3899040/access/links",
    "access_request": "https://zenodo.org/api/records/3899040/access/request",
    "access_users": "https://zenodo.org/api/records/3899040/access/users",
    "archive": "https://zenodo.org/api/records/3899040/files-archive",
    "archive_media": "https://zenodo.org/api/records/3899040/media-files-archive",
    "communities": "https://zenodo.org/api/records/3899040/communities",
    "communities-suggestions": "https://zenodo.org/api/records/3899040/communities-suggestions",
    "doi": "https://doi.org/10.5281/zenodo.3899040",
    "draft": "https://zenodo.org/api/records/3899040/draft",
    "files": "https://zenodo.org/api/records/3899040/files",
    "latest": "https://zenodo.org/api/records/3899040/versions/latest",
    "latest_html": "https://zenodo.org/records/3899040/latest",
    "media_files": "https://zenodo.org/api/records/3899040/media-files",
    "parent": "https://zenodo.org/api/records/3899039",
    "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.3899039",
    "parent_html": "https://zenodo.org/records/3899039",
    "requests": "https://zenodo.org/api/records/3899040/requests",
    "reserve_doi": "https://zenodo.org/api/records/3899040/draft/pids/doi",
    "self": "https://zenodo.org/api/records/3899040",
    "self_doi": "https://zenodo.org/doi/10.5281/zenodo.3899040",
    "self_html": "https://zenodo.org/records/3899040",
    "self_iiif_manifest": "https://zenodo.org/api/iiif/record:3899040/manifest",
    "self_iiif_sequence": "https://zenodo.org/api/iiif/record:3899040/sequence/default",
    "versions": "https://zenodo.org/api/records/3899040/versions"
  },
  "media_files": {
    "count": 0,
    "enabled": false,
    "entries": {},
    "order": [],
    "total_bytes": 0
  },
  "metadata": {
    "creators": [
      {
        "affiliations": [
          {
            "name": "Bloomberg"
          }
        ],
        "person_or_org": {
          "family_name": "Rijhwani",
          "given_name": "Shruti",
          "name": "Rijhwani, Shruti",
          "type": "personal"
        }
      },
      {
        "affiliations": [
          {
            "name": "Bloomberg"
          }
        ],
        "person_or_org": {
          "family_name": "Preo\u021biuc-Pietro",
          "given_name": "Daniel",
          "name": "Preo\u021biuc-Pietro, Daniel",
          "type": "personal"
        }
      }
    ],
    "description": "<p>This repository contains the data set developed for the paper:</p>\n\n<p>&ldquo;Shruti Rijhwani and Daniel Preo\u021biuc-Pietro. <em>Temporally-Informed Analysis of Named Entity Recognition.</em> In Proceedings of the Association for Computational Linguistics (ACL). 2020.&rdquo;</p>\n\n<p>It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models.</p>\n\n<p>The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities.</p>\n\n<p><strong>Format</strong></p>\n\n<p>The repository contains the annotations in JSON format.</p>\n\n<p>Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (<a href=\"https://developer.twitter.com/en/docs/tweets/search\">https://developer.twitter.com/en/docs/tweets/search</a>) can be used extract the text for the tweet corresponding to the tweet IDs.</p>\n\n<p><strong>Data Splits</strong></p>\n\n<p>Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set.</p>\n\n<p>To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period.</p>\n\n<p>The development and test splits are provided in the JSON format.</p>\n\n<p><strong>Use</strong></p>\n\n<p>Please cite the data set and the accompanying paper if you found the resources in this repository useful.</p>",
    "languages": [
      {
        "id": "eng",
        "title": {
          "en": "English"
        }
      }
    ],
    "publication_date": "2020-06-17",
    "publisher": "Zenodo",
    "resource_type": {
      "id": "dataset",
      "title": {
        "de": "Datensatz",
        "en": "Dataset"
      }
    },
    "rights": [
      {
        "description": {
          "en": "The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited."
        },
        "icon": "cc-by-icon",
        "id": "cc-by-4.0",
        "props": {
          "scheme": "spdx",
          "url": "https://creativecommons.org/licenses/by/4.0/legalcode"
        },
        "title": {
          "en": "Creative Commons Attribution 4.0 International"
        }
      }
    ],
    "subjects": [
      {
        "subject": "named entity recognition"
      },
      {
        "subject": "twitter"
      },
      {
        "subject": "ner"
      },
      {
        "subject": "twitter ner"
      },
      {
        "subject": "tweets"
      },
      {
        "subject": "temporal analysis"
      },
      {
        "subject": "information extraction"
      }
    ],
    "title": "Temporally-Informed Analysis of Named Entity Recognition"
  },
  "parent": {
    "access": {
      "owned_by": {
        "user": "70921"
      }
    },
    "communities": {},
    "id": "3899039",
    "pids": {
      "doi": {
        "client": "datacite",
        "identifier": "10.5281/zenodo.3899039",
        "provider": "datacite"
      }
    }
  },
  "pids": {
    "doi": {
      "client": "datacite",
      "identifier": "10.5281/zenodo.3899040",
      "provider": "datacite"
    },
    "oai": {
      "identifier": "oai:zenodo.org:3899040",
      "provider": "oai"
    }
  },
  "revision_id": 2,
  "stats": {
    "all_versions": {
      "data_volume": 52435089.0,
      "downloads": 283,
      "unique_downloads": 266,
      "unique_views": 1327,
      "views": 1437
    },
    "this_version": {
      "data_volume": 51879240.0,
      "downloads": 280,
      "unique_downloads": 263,
      "unique_views": 1315,
      "views": 1425
    }
  },
  "status": "published",
  "updated": "2020-06-17T22:18:22.795583+00:00",
  "versions": {
    "index": 1,
    "is_latest": true
  }
}