There is a newer version of the record available.

Published May 31, 2022 | Version v2
Dataset Open

NILK

  • 1. Stuttgart University

Description

A dataset for the NIL-detection and NIL-disambiguation tasks.

The NILK dataset has two main features: 1) It marks NIL-mentions for NIL-detection by extracting mentions which belong to newly added entities in Wikipedia text. 2) It provides an entity label for NIL-disambiguation by marking NIL-mentions with WikiData IDs from the newer dump.

Dataset files contain JSON objects of the following structure:

{"mention":"Walter Damrosch",
"offset":348,
"length":15,
"context":"...the conductor Walter Damrosch. He scored the piece for the standard instruments of the symphony orchestra plus celesta, saxophone, and automobile horns...",
"wikipedia_page_id":"309",
"wikidata_id":"Q725579",
"nil":false}

The dataset contains both linked and not linked mentions, one can distinguish between them by checking "nil" flag. To obtain NIL-mentions, we compared two WikiData dumps: from 2017 and 2021. NIL-mentions have WikiData ID from WikiData 2021, one can use it to check whether these mentions refer to the same entity.

The dataset was designed with the WikiData 2017 as the target knowledge base in mind: https://archive.org/download/wikibase-wikidatawiki-20170213/wikidata-20170213-all.json.gz

Files

nilk_full.zip

Files (13.3 GB)

Name Size Download all
md5:5a9803516489842e9928cdf739695f40
13.3 GB Preview Download