Published August 29, 2023 | Version v1
Dataset Open

WikiDBs 10k - A Corpus Of Relational Databases From Wikidata

  • 1. Technical University of Darmstadt
  • 2. Technical University of Darmstadt, DFKI

Description

WikiDBs-10k (https://wikidbs.github.io/) is a corpus of relational databases built from Wikidata (https://www.wikidata.org/). This is the preliminary 10k version, the newer version of 100k databases (https://zenodo.org/records/11559814)  includes more coherent databases and more diverse table and column names.

The WikiDBs-10k corpus consists of 10,000 databases, for more details read our paper: https://ceur-ws.org/Vol-3462/TADA3.pdf (TaDA@VLDB'23)

Each database is saved in a sub-folder, the table files are provided as csv files and the database schema as a json file.

We thank Till Döhmen and Madelon Hulsebos for generously providing the table statistics from their GitSchemas dataset and Jan-Micha Bodensohn for converting the dataset to SQLite files. This work has been supported by the BMBF and the state of Hesse as part of the NHR Program and the BMBF project KompAKI (grant number 02L19C150), as well as the HMWK cluster project 3AI. Finally, we want to thank hessian.AI, and DFKI Darmstadt for their support.

Files

wikidbs_10k.zip

Files (761.1 MB)

Name Size Download all
md5:71ebb739508f1a54b79c2112687d5d83
590.6 MB Preview Download
md5:432a413072f6041966fe3129469b595e
170.5 MB Preview Download

Additional details

Related works

Is previous version of
Dataset: 10.5281/zenodo.11559814 (DOI)