NILK
Description
A dataset for the NIL-detection and NIL-disambiguation tasks.
The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.
Dataset files contain JSON objects of the following structure:
{"mention":"Walter Damrosch",
"offset":348,
"length":15,
"context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...",
"wikipedia_page_id":"309",
"wikidata_id":"Q725579",
"nil":false}
The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.
The dataset was designed with the WikiData 2017 as the target knowledge base in mind: https://archive.org/download/wikibase-wikidatawiki-20170213/wikidata-20170213-all.json.gz
Files
nilk_full.zip
Files
(13.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:5a9803516489842e9928cdf739695f40
|
13.3 GB | Preview Download |