Published October 30, 2024 | Version v1
Dataset Open

WikiDBs - A Large-Scale Corpus Of Relational Databases From Wikidata

Description

WikiDBs  is an open-source corpus of 100,000 relational databases. We aim to support research on tabular representation learning on multi-table data. The corpus is based on Wikidata and aims to follow certain characteristics of real-world databases.

WikiDBs was published as a spotlight paper at the Dataset & Benchmarks track at NeurIPS 2024.

WikiDBs contains the database schemas, as well as table contents. The database tables are provided as CSV files, and each database schema as JSON. The 100,000 databases are available in five splits, containing 20k databases each. In total, around 165 GB of disk space are needed for the full corpus. We also provide a script to convert the databases into SQLite.

Files

part-0.zip

Files (48.9 GB)

Name Size Download all
md5:500e0adf73b425813e18289fae06b1a6
9.8 GB Preview Download
md5:53981a8b4e6b974f065cafeb949887ee
9.9 GB Preview Download
md5:ac445e51cb7ed6705e494599d00754b3
9.8 GB Preview Download
md5:220edab823eaf856bc1bf588825c5337
9.6 GB Preview Download
md5:e369b2511eba38686ac1c622359a904d
9.8 GB Preview Download
md5:371db2e2791190fe2033fe8a08ef1d5f
3.0 kB Download

Additional details

Related works

Continues
Dataset: 10.5281/zenodo.8227452 (DOI)

Software

Repository URL
https://github.com/DataManagementLab/wikidbs-public
Programming language
Python