Dataset for Named Entity Recognition and Entity Linking from Greek Wikipedia Events
Authors/Creators
- 1. Foundation for Research and Technology (Forth)
- 2. Foundation for Research and Technology (Forth))
Description
An automated benchmark dataset for (Named Entity Recognition) NER and (Named Entity Linking) NEL tools, based on Greek Wikipedia events pages.
Note: This data includes data from the following sources:
- Wikipedia el.wikipedia.org
Description
The dataset is provided in the form of three JSON-formatted subsets i.e., train, validation and test in an analogy of 70-20-10. The current version of the dataset contains 18,617 events annotated with 40,798 entity mentions and 36,189 links to elWikipedia (and wikidata ids). The dataset contains annotations belonging to 8 entity types: person, organization, location, gpe, event, facility, product and work of art.
| Docs | Tokens | Sentences | Surface Mentions | Valid Links | Red Links | |
|---|---|---|---|---|---|---|
| Train | 13,031 | 332,077 | 16,927 | 28,593 | 25,365 | 3,228 |
| Validation | 3,722 | 94,746 | 4,844 | 8,168 | 7,240 | 928 |
| Test | 1,862 | 47,450 | 2,427 | 4,037 | 3,584 | 453 |
| Total | 18,617 | 474,361 | 24,200 | 40,798 | 36,189 | 4,609 |
Example
A record example is given below.
{
"json_file": "February 2012_39_0 events",
"text": "Sudan and South Sudan sign non-aggression pact.",
"ground_truth_mentions": [
{"start": 0, "end": 4, "surface_mention": "Sudan", "mention_type": "GPE"},
{"start": 10, "end": 20, "surface_mention": "South Sudan", "mention_type": "GPE"}
],
"ground_truth_links": [
{"enwiki": "Sudan","wikidata": "Q1049"},
{"enwiki": "South_Sudan", "wikidata": "Q958"}
]
}
Code
https://gitlab.isl.ics.forth.gr/debatelab/elwiki_events_benchmark
Acknowledgments
This work has received funding from the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under grant agreement No 4195.
Files
ner_nel_greek_dataset.zip
Files
(2.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:4dd6186f149be5b5a76336c28b88c1f1
|
2.9 MB | Preview Download |