Published May 31, 2022 | Version v4
Dataset Open

NILK

  • 1. Stuttgart University

Description

A dataset for the NIL-detection and NIL-disambiguation tasks.

The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.

Dataset files contain JSON objects of the following structure:

{"mention":"Walter Damrosch",
"offset":348,
"length":15,
"context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...",
"wikipedia_page_id":"309",
"wikidata_id":"Q725579",
"nil":false}

The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.

The dataset was designed with the WikiData 2017 as the target knowledge base in mind: https://archive.org/download/wikibase-wikidatawiki-20170213/wikidata-20170213-all.json.gz

 

nilk_03_2023.zip contains same data with longer contexts (unsplit)

Files

nilk_1000_split.zip

Files (26.7 GB)

Name Size Download all
md5:564f72662a7f4840d3c8e326995c59e6
13.4 GB Preview Download
md5:5a9803516489842e9928cdf739695f40
13.3 GB Preview Download