TweetNERD - End to End Entity Linking Benchmark for Tweets

Mishra, Shubhanshu; Saini, Aman; Makki, Raheleh; Mehta, Sneha; Haghighi, Aria; Mollahosseini, Ali

doi:10.5281/zenodo.6617192

Published June 6, 2022 | Version 0.0.0

Dataset Open

TweetNERD - End to End Entity Linking Benchmark for Tweets

1. Twitter, Inc

TweetNERD - End to End Entity Linking Benchmark for Tweets

Paper - Video - Neurips Page

This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).

Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.

TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.

The license only applies to the data files present in this dataset. See Data usage policy below.

Check out more details at https://github.com/twitter-research/TweetNERD

Usage

We provide the dataset split across the following tab seperated files:

OOD.public.tsv: OOD split of the data in the paper.
Academic.public.tsv: Academic split of the data described in the paper.
part_*.public.tsv: Remaining data split into parts in no particular order.

Each file is tab separated and has has the following format:

tweet_id	phrase	start	end	entityId	score
22	twttr	20	25	Q918	3
21	twttr	20	25	Q918	3
1457198399032287235	Diwali	30	38	Q10244	3
1232456079247736833	NO_PHRASE	-1	-1	NO_ENTITY	-1

For tweets which don't have any entity, their column values for phrase, start, end, entityId, score are set NO_PHRASE, -1, -1, NO_ENTITY, -1 respectively.

Description of file columns is as follows:

Column	Type	Missing Value	Description
tweet_id	string		ID of the Tweet
phrase	string	NO_PHRASE	entity phrase
start	int	-1	start offset of the phrase in text using `UTF-16BE` encoding
end	int	-1	end offset of the phrase in the text using `UTF-16BE` encoding
entityId	string	NO_ENTITY	Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918
score	int	-1	Number of annotators who agreed on the phrase, start, end, entityId information

In order to use the dataset you need to utilize the tweet_id column and get the Tweet text using the Twitter API (See Data usage policy section below).

Data stats

Split	Number of Rows	Number unique tweets
OOD	34102	25000
Academic	51685	30119
part_0	11830	10000
part_1	35681	25799
part_2	34256	25000
part_3	36478	25000
part_4	37518	24999
part_5	36626	25000
part_6	34001	24984
part_7	34125	24981
part_8	32556	25000
part_9	32657	25000
part_10	32442	25000
part_11	32033	24972

Data usage policy

Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.

Please cite the following if you use TweetNERD in your paper:

@dataset{TweetNERD_Zenodo_2022_6617192,
  author       = {Mishra, Shubhanshu and
                  Saini, Aman and
                  Makki, Raheleh and
                  Mehta, Sneha and
                  Haghighi, Aria and
                  Mollahosseini, Ali},
  title        = {{TweetNERD - End to End Entity Linking Benchmark 
                   for Tweets}},
  month        = jun,
  year         = 2022,
  note         = {{Data usage policy  Use of this dataset is subject 
                   to you obtaining lawful access to the [Twitter
                   API](https://developer.twitter.com/en/docs
                   /twitter-api), which requires you to agree to the
                   [Developer Terms Policies and
                   Agreements](https://developer.twitter.com/en
                   /developer-terms/).}},
  publisher    = {Zenodo},
  version      = {0.0.0},
  doi          = {10.5281/zenodo.6617192},
  url          = {https://doi.org/10.5281/zenodo.6617192}
}
@inproceedings{TweetNERDNeurips2022,
 author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali},
 booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
 pages = {},
 title = {TweetNERD - End to End Entity Linking Benchmark for Tweets},
 volume = {2},
 year = {2022},
 eprint = {arXiv:2210.08129},
 doi = {10.48550/arXiv.2210.08129}
}

Notes

Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs/twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en/developer-terms/).

Files

README.md

Files (22.7 MB)

Name	Size	Download all
Academic.public.tsv md5:cfa247ab553e9c075e2bbb76d5205600	2.2 MB	Download
OOD.public.tsv md5:7c7240692abe65b77e8e1be007e2c369	1.7 MB	Download
part_0.public.tsv md5:939363c52f7a9b0d7641d18718612582	561.7 kB	Download
part_1.public.tsv md5:e237d3d4ec9d25fce014f4121f97550e	1.8 MB	Download
part_10.public.tsv md5:00a8fe795d3f2d97e0f314c2bdbbeaee	1.5 MB	Download
part_11.public.tsv md5:7252f28baec0f6f23ed56cd6d125f125	1.5 MB	Download
part_2.public.tsv md5:d6c05c5013341cabcd483405821648d5	1.6 MB	Download
part_3.public.tsv md5:47c0ac603b3dc98c98ed9a3f6f382734	1.8 MB	Download
part_4.public.tsv md5:a6cac2bf28df877c38218ea157ea94b3	1.8 MB	Download
part_5.public.tsv md5:3bf9f5907197f38b5f4be3baad4ca839	1.8 MB	Download
part_6.public.tsv md5:67e27bd542f1454ad43f36407708fd23	1.7 MB	Download
part_7.public.tsv md5:e3666c4e7ca2493b9d7150a0f4ccd1f5	1.6 MB	Download
part_8.public.tsv md5:1a4ea02be63206db2451ed4fea834e1c	1.5 MB	Download
part_9.public.tsv md5:78ba7304cf2a47503b9db702f8ae7fd7	1.6 MB	Download
README.md md5:0b9a4625539b4e35f162590cc25f1236	4.5 kB	Preview Download

Additional details

Is cited by: Preprint: 10.48550/arXiv.2210.08129 (DOI)

	All versions	This version
Views	1,290	1,110
Downloads	360	358
Data volume	1.2 GB	1.1 GB

TweetNERD - End to End Entity Linking Benchmark for Tweets

Creators

Description

Notes

Files

README.md

Files (22.7 MB)

Additional details

Related works