Published July 14, 2025 | Version v1
Dataset Open

Context Aware Joins in Open Data

Description

Description 


The OpenData benchmark is a collection of tables extracted from open government data repositories, which have been frequently used in set-similarity-based join search studies. Most of the prior work used an exact method to determine set overlap and leverage it as a ground truth for joinability evaluation. However, focusing solely on column similarity is insufficient for enterprise data lakes as they contain unrelated data from diverse sources. For datalakes, it is important to prioritize statistically significant and contextually relevant joins. In our work [1], we introduced the problem of context-aware joinable columns search, but did not find any significant study or benchmark that captures the notion of context-aware joinable columns.

So, we gathered human annotations for a small subset of OpenData to enable the creation of context-aware joins that reflect real-world user behavior. We identified
471 column pairs that exhibited high containment scores and solicited join labels from fifteen human annotators. During the annotation process, human annotators had access to table snippets, metadata such as table descriptions, organization IDs, and tags, as well as the potential outcome of the joined table

We received 6 to 15 annotations per sample and regarded columns as joinable if at least 10% of the annotations were positive. Subsequently, we discovered only 42 context-aware joinable column pairs among the 471 pairs that were identified as joinable through set similarity.

The zip file contains
1. README.md
2. gt.jsonl -- Ground Truth Annotations
3. datalake/ -- Datalake tables in .df format
4. metadata/ -- Metadata of the tables in .json format
 

[1] Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, and Kavitha Srinivas. TOPJoin: A Context-Aware Multi-Criteria Approach for Joinable Column Search. VLDB 2025 Workshop: Tabular Data Analysis (TaDA). https://arxiv.org/abs/2507.11505


LICENSE

 

This data is released with CC BY-NC-ND 4.0 License. 

Files

opendata-contextawarejoins.zip

Files (151.9 MB)

Name Size Download all
md5:7348051c5fed71a1641122720fd7e6ef
151.9 MB Preview Download

Additional details

Identifiers

Dates

Accepted
2025-07-14