WikiDBs - A Large-Scale Corpus Of Relational Databases From Wikidata
Description
WikiDBs is an open-source corpus of 100,000 relational databases. We aim to support research on tabular representation learning on multi-table data. The corpus is based on Wikidata and aims to follow certain characteristics of real-world databases.
WikiDBs was published as a spotlight paper at the Dataset & Benchmarks track at NeurIPS 2024.
WikiDBs contains the database schemas, as well as table contents. The database tables are provided as CSV files, and each database schema as JSON. The 100,000 databases are available in five splits, containing 20k databases each. In total, around 165 GB of disk space are needed for the full corpus. We also provide a script to convert the databases into SQLite.
Files
part-0.zip
Files
(48.9 GB)
Name | Size | Download all |
---|---|---|
md5:500e0adf73b425813e18289fae06b1a6
|
9.8 GB | Preview Download |
md5:53981a8b4e6b974f065cafeb949887ee
|
9.9 GB | Preview Download |
md5:ac445e51cb7ed6705e494599d00754b3
|
9.8 GB | Preview Download |
md5:220edab823eaf856bc1bf588825c5337
|
9.6 GB | Preview Download |
md5:e369b2511eba38686ac1c622359a904d
|
9.8 GB | Preview Download |
md5:371db2e2791190fe2033fe8a08ef1d5f
|
3.0 kB | Download |
Additional details
Related works
- Continues
- Dataset: 10.5281/zenodo.8227452 (DOI)
Software
- Repository URL
- https://github.com/DataManagementLab/wikidbs-public
- Programming language
- Python