Published September 18, 2020 | Version 1.0
Dataset Open

WD50K

  • 1. TU Dresden; Fraunhofer IAIS
  • 2. Fraunhofer IAIS
  • 3. Fraunhofer IAIS; University of Bonn

Description

WD50K dataset: An hyper-relational dataset derived from Wikidata statements.

The dataset is constructed by the following procedure based on the [Wikidata RDF dump](https://dumps.wikimedia.org/wikidatawiki/20190801/) of August 2019:

-  A set of seed nodes corresponding to entities from FB15K-237 having a direct mapping in Wikidata (P646 "Freebase ID") is extracted from the dump.
-  For each seed node, all statements whose main object and qualifier values corresponding to wikibase:Item are extracted from the dump.
-  All literals are filtered out from the qualifiers of the above obtained statements.
-  All the entities from the dataset which have less than two mentions are dropped. The statements corresponding to the dropped entities are also dropped.
-  The remaining statements are randomly split into the train, test, and validation sets.
-  All statements from train and validation sets are removed which share the same main triple (s,p,o) with test statements.
-  WD50k_33, WD50k_66, WD50k_100 are then sampled from the above statements. Here 33, 66, 100 represents the amount of hyper-relational facts (statements with qualifiers) in the dataset.


The table below provides some basic statistics of our dataset and its three further variations:

| Dataset     | Statements | w/Quals (%)    | Entities | Relations | E only in Quals | R only in Quals | Train   | Valid  | Test   |
|-------------|------------|----------------|----------|-----------|-----------------|-----------------|---------|--------|--------|
| WD50K       | 236,507    | 32,167 (13.6%) | 47,156   | 532       | 5460            | 45              | 166,435 | 23,913 | 46,159 |
| WD50K (33)  | 102,107    | 31,866 (31.2%) | 38,124   | 475       | 6463            | 47              |  73,406 | 10,668 | 18,133 |
| WD50K (66)  |  49,167    | 31,696 (64.5%) | 27,347   | 494       | 7167            | 53              |  35,968 |  5,154 |  8,045 |
| WD50K (100) |  31,314    | 31,314 (100%)  | 18,792   | 279       | 7862            | 75              |  22,738 |  3,279 |  5,297 |

 

 

When using the dataset please cite:

@inproceedings{StarE,
  title={Message Passing for Hyper-Relational Knowledge Graphs},
  author={Galkin, Mikhail and Trivedi, Priyansh and Maheshwari, Gaurav and Usbeck, Ricardo and Lehmann, Jens},
  booktitle={EMNLP},
  year={2020}
}
For any further questions, please contact: mikhail.galkin@iais.fraunhofer.de

Notes

Funding sources - SPEAKER : 01MK20011A - JOSEPH : Fraunhofer Zukunftsstiftung - Cleopatra : 812997 - ML2R: 01 15 18038 A/B/C - MLwin: 01IS18050 D/F - ScADS: 01IS18026A

Files

WD50K.zip

Files (7.1 MB)

Name Size Download all
md5:5e46e4630a4b425924efc0402d09696e
7.1 MB Preview Download

Additional details

Funding

European Commission
Cleopatra – Cross-lingual Event-centric Open Analytics Research Academy 812997