Published December 12, 2022 | Version 0.1
Dataset Open

Dataset for Named Entity Recognition and Entity Linking from Greek Wikipedia Events

  • 1. Foundation for Research and Technology (Forth)
  • 2. Foundation for Research and Technology (Forth))

Description

An automated benchmark dataset for (Named Entity Recognition) NER and (Named Entity Linking) NEL tools, based on Greek Wikipedia events pages.

Note: This data includes data from the following sources:
- Wikipedia el.wikipedia.org

Description

The dataset is provided in the form of  three  JSON-formatted subsets i.e.,  train, validation and test in an analogy of 70-20-10. The current version of the dataset contains 18,617 events annotated with 40,798 entity mentions and 36,189 links to elWikipedia (and wikidata ids). The dataset contains annotations belonging to 8 entity types: person, organization, location, gpe, event, facility, product and work of art.

Overall dataset statistics
  Docs Tokens Sentences Surface Mentions Valid Links Red Links
Train 13,031 332,077 16,927 28,593 25,365 3,228
Validation 3,722 94,746 4,844 8,168 7,240 928
Test 1,862 47,450 2,427 4,037 3,584 453
Total 18,617 474,361 24,200 40,798 36,189 4,609

Example

A record example is given below.

{

"json_file": "February 2012_39_0 events",
"text": "Sudan and South Sudan sign non-aggression pact.",
"ground_truth_mentions": [
      {"start": 0, "end": 4, "surface_mention": "Sudan", "mention_type": "GPE"},
      {"start": 10, "end": 20, "surface_mention": "South Sudan", "mention_type": "GPE"}
                                          ],
"ground_truth_links": [
      {"enwiki": "Sudan","wikidata": "Q1049"},
      {"enwiki": "South_Sudan", "wikidata": "Q958"}
                                         ]
}

Code

https://gitlab.isl.ics.forth.gr/debatelab/elwiki_events_benchmark

Acknowledgments

This work has received funding from the Hellenic Foundation for Research and Innovation (HFRI) and the General Secretariat for Research and Technology (GSRT), under grant agreement No 4195.

Files

ner_nel_greek_dataset.zip

Files (2.9 MB)

Name Size Download all
md5:4dd6186f149be5b5a76336c28b88c1f1
2.9 MB Preview Download