SANTOS Benchmark for Table Union Search

Khatiwada, Aamod; Fan, Grace; Shraga, Roee; Chen, Zixuan; Gatterbauer, Wolfgang; Miller, Renée J.; Riedewald, Mirek

doi:10.5281/zenodo.7758091

Published March 22, 2023 | Version v1

Dataset Open

SANTOS Benchmark for Table Union Search

1. Northeastern University

This record contains the datasets released with SIGMOD 2023 paper entitled "SANTOS: Relationship-based Semantic Table Union Search". We release two new tabular benchmarks to evaluate the table union search problem over the data lakes. Furthermore, we also release relabeled ground truth for an existing TUS benchmark by taking the binary relationship between the columns into account. Please visit our paper for further details.

If you use our dataset for your work, please cite our paper as:

Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek
Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search. SIGMOD Conference 2023, ACM

@article{DBLP:journals/pacmmod/KhatiwadaFSCGMR23,
author = {Aamod Khatiwada and
Grace Fan and
Roee Shraga and
Zixuan Chen and
Wolfgang Gatterbauer and
Ren{\'{e}}e J. Miller and
Mirek Riedewald},
title = {{SANTOS:} Relationship-based Semantic Table Union Search},
journal = {Proc. {ACM} Manag. Data},
volume = {1},
number = {1},
pages = {9:1--9:25},
year = {2023},
doi = {10.1145/3588689},
}

You can find SANTOS implementation at: https://github.com/northeastern-datalab/santos

You can find the original TUS benchmark at: https://github.com/RJMillerLab/table-union-search-benchmark

Abstract: Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns: The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm called SANTOS outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically in all benchmarks that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.

Files

readme.txt

Files (1.9 GB)

Name	Size
readme.txt md5:f7f6da49a4ab4d21992031e3f35d4c76	1.1 kB	Preview Download
real_data_lake_benchmark.zip md5:1c3c041de79479bb5176729df3c55d35	1.8 GB	Preview Download
santos_benchmark.zip md5:2be111215b569bd2d64515fa5deed490	90.5 MB	Preview Download
TUS_benchmark_relabeled_groundtruth.csv md5:f749b12324ca4474cc51006f65e13a3b	1.2 MB	Preview Download

Additional details

U.S. National Science Foundation
III : Medium: Collaborative Research: From Open Data to Open Data Curation 2107248
U.S. National Science Foundation
CAREER: Scaling Approximate Inference and Approximation-Aware Learning 1762268
U.S. National Science Foundation
III: Medium: Table-as-Query: Unifying Data Discovery and Alignment 1956096

	All versions	This version
Views	2,017	1,684
Downloads	1,775	1,449
Data volume	1.0 TB	875.4 GB

SANTOS Benchmark for Table Union Search

Authors/Creators

Description

Files

readme.txt

Files (1.9 GB)

Additional details

Funding