TweetNERD - End to End Entity Linking Benchmark for Tweets
Creators
- 1. Twitter, Inc
Description
TweetNERD - End to End Entity Linking Benchmark for Tweets
Paper - Video - Neurips Page
This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track).
Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area.
TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE.
The license only applies to the data files present in this dataset. See Data usage policy below.
Check out more details at https://github.com/twitter-research/TweetNERD
Usage
We provide the dataset split across the following tab seperated files:
- OOD.public.tsv: OOD split of the data in the paper.
- Academic.public.tsv: Academic split of the data described in the paper.
part_*.public.tsv
: Remaining data split into parts in no particular order.
Each file is tab separated and has has the following format:
tweet_id | phrase | start | end | entityId | score |
---|---|---|---|---|---|
22 | twttr | 20 | 25 | Q918 | 3 |
21 | twttr | 20 | 25 | Q918 | 3 |
1457198399032287235 | Diwali | 30 | 38 | Q10244 | 3 |
1232456079247736833 | NO_PHRASE | -1 | -1 | NO_ENTITY | -1 |
For tweets which don't have any entity, their column values for phrase, start, end, entityId, score
are set NO_PHRASE, -1, -1, NO_ENTITY, -1
respectively.
Description of file columns is as follows:
Column | Type | Missing Value | Description |
---|---|---|---|
tweet_id | string | ID of the Tweet | |
phrase | string | NO_PHRASE | entity phrase |
start | int | -1 | start offset of the phrase in text using UTF-16BE encoding |
end | int | -1 | end offset of the phrase in the text using UTF-16BE encoding |
entityId | string | NO_ENTITY | Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918 |
score | int | -1 | Number of annotators who agreed on the phrase, start, end, entityId information |
In order to use the dataset you need to utilize the tweet_id
column and get the Tweet text using the Twitter API (See Data usage policy section below).
Data stats
Split | Number of Rows | Number unique tweets |
---|---|---|
OOD | 34102 | 25000 |
Academic | 51685 | 30119 |
part_0 | 11830 | 10000 |
part_1 | 35681 | 25799 |
part_2 | 34256 | 25000 |
part_3 | 36478 | 25000 |
part_4 | 37518 | 24999 |
part_5 | 36626 | 25000 |
part_6 | 34001 | 24984 |
part_7 | 34125 | 24981 |
part_8 | 32556 | 25000 |
part_9 | 32657 | 25000 |
part_10 | 32442 | 25000 |
part_11 | 32033 | 24972 |
Data usage policy
Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements.
Please cite the following if you use TweetNERD in your paper:
@dataset{TweetNERD_Zenodo_2022_6617192, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, title = {{TweetNERD - End to End Entity Linking Benchmark for Tweets}}, month = jun, year = 2022, note = {{Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs /twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en /developer-terms/).}}, publisher = {Zenodo}, version = {0.0.0}, doi = {10.5281/zenodo.6617192}, url = {https://doi.org/10.5281/zenodo.6617192} } @inproceedings{TweetNERDNeurips2022, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, pages = {}, title = {TweetNERD - End to End Entity Linking Benchmark for Tweets}, volume = {2}, year = {2022}, eprint = {arXiv:2210.08129}, doi = {10.48550/arXiv.2210.08129} }
Notes
Files
README.md
Files
(22.7 MB)
Name | Size | Download all |
---|---|---|
md5:cfa247ab553e9c075e2bbb76d5205600
|
2.2 MB | Download |
md5:7c7240692abe65b77e8e1be007e2c369
|
1.7 MB | Download |
md5:939363c52f7a9b0d7641d18718612582
|
561.7 kB | Download |
md5:e237d3d4ec9d25fce014f4121f97550e
|
1.8 MB | Download |
md5:00a8fe795d3f2d97e0f314c2bdbbeaee
|
1.5 MB | Download |
md5:7252f28baec0f6f23ed56cd6d125f125
|
1.5 MB | Download |
md5:d6c05c5013341cabcd483405821648d5
|
1.6 MB | Download |
md5:47c0ac603b3dc98c98ed9a3f6f382734
|
1.8 MB | Download |
md5:a6cac2bf28df877c38218ea157ea94b3
|
1.8 MB | Download |
md5:3bf9f5907197f38b5f4be3baad4ca839
|
1.8 MB | Download |
md5:67e27bd542f1454ad43f36407708fd23
|
1.7 MB | Download |
md5:e3666c4e7ca2493b9d7150a0f4ccd1f5
|
1.6 MB | Download |
md5:1a4ea02be63206db2451ed4fea834e1c
|
1.5 MB | Download |
md5:78ba7304cf2a47503b9db702f8ae7fd7
|
1.6 MB | Download |
md5:0b9a4625539b4e35f162590cc25f1236
|
4.5 kB | Preview Download |
Additional details
Related works
- Is cited by
- Preprint: 10.48550/arXiv.2210.08129 (DOI)