Improving the Utility and Trustworthiness of Knowledge Graph Embeddings with Calibration
Description
This repository contains two public knowledge graph datasets used in our paper Improving the Utility of Knowledge Graph Embeddings with Calibration. Each dataset is described below.
Note that for our experiments we split each dataset randomly 5 times into 80/10/10 train/validation/test splits. We recommend that users of our data do the same to avoid (potentially) overfitting models to a single dataset split.
wikidata-authors
This dataset was extracted by querying the Wikidata API for facts about people categorized as "authors" or "writers" on Wikidata. Note that all head entities of triples are people (authors or writers), and all triples describe something about that person (e.g., their place of birth, their place of death, or their spouse). The knowledge graph has 23,887 entities, 13 relations, and 86,376 triples.
The files are as follows:
entities.tsv
: A tab-separated file of all unique entities in the dataset. The fields are as follows:
eid
: The unique Wikidata identifier of this entity. You can find the corresponding Wikidata page athttps://www.wikidata.org/wiki/<eid>
.label
: A human-readable label of this entity (extracted from Wikidata).
relations.tsv
: A tab-separated file of all unique relations in the dataset. The fields are as follows:
rid
: The unique Wikidata identifier of this relation. You can find the corresponding Wikidata page athttps://www.wikidata.org/wiki/Property:<rid>
.label
: A human-readable label of this relation (extracted from Wikidata).
triples.tsv
: A tab-separated file of all triples in the dataset, in the form of <head eid>
, <relation rid>
, <tail eid>
.
fb15krr-linked
This dataset is an extended version of the FB15k+ dataset provided by [Xie et al IJCAI16]. It has been linked to Wikidata using Freebase MIDs (machine IDs) as keys; we discarded triples from the original dataset that contained entities that could not be linked to Wikidata. We also removed reverse relations following the procedure described by [Toutanova and Chen CVSC2015]. Finally, we removed existing triples labeled as False and added predicted triples labeled as True based on the crowdsourced annotations we obtained in our True or False Facts experiment (see our paper for details). The knowledge graph consists of 14,289 entities, 770 relations, and 272,385 triples.
The files are as follows:
entities.tsv
: A tab-separated file of all unique entities in the dataset. The fields are as follows:
mid
: The Freebase machine ID (MID) of this entity.wiki
: The corresponding unique Wikidata identifier of this entity. You can find the corresponding Wikidata page athttps://www.wikidata.org/wiki/<eid>
.label
: A human-readable label of this entity (extracted from Wikidata).types
: All hierarchical types of this entity, as provided by [Xie et al IJCAI16].
relations.tsv
: A tab-separated file of all unique relations in the dataset. The fields are as follows:
label
: The hierarchical Freebase label of this relation.
triples.tsv
: A tab-separated file of all triples in the dataset, in the form of <head MID>
, <relation label>
, <tail MID>
.
Files
kg-calibration.zip
Files
(2.9 MB)
Name | Size | Download all |
---|---|---|
md5:ad3cb8c32a0bf702686896a999e17695
|
2.9 MB | Preview Download |