Published July 14, 2025
| Version v1
Dataset
Open
Context Aware Joins in Open Data
Description
Description
The OpenData benchmark is a collection of tables extracted from open government data repositories, which have been frequently used in set-similarity-based join search studies. Most of the prior work used an exact method to determine set overlap and leverage it as a ground truth for joinability evaluation. However, focusing solely on column similarity is insufficient for enterprise data lakes as they contain unrelated data from diverse sources. For datalakes, it is important to prioritize statistically significant and contextually relevant joins. In our work [1], we introduced the problem of context-aware joinable columns search, but did not find any significant study or benchmark that captures the notion of context-aware joinable columns.
So, we gathered human annotations for a small subset of OpenData to enable the creation of context-aware joins that reflect real-world user behavior. We identified
471 column pairs that exhibited high containment scores and solicited join labels from fifteen human annotators. During the annotation process, human annotators had access to table snippets, metadata such as table descriptions, organization IDs, and tags, as well as the potential outcome of the joined table
We received 6 to 15 annotations per sample and regarded columns as joinable if at least 10% of the annotations were positive. Subsequently, we discovered only 42 context-aware joinable column pairs among the 471 pairs that were identified as joinable through set similarity.
The zip file contains
1. README.md
2. gt.jsonl -- Ground Truth Annotations
3. datalake/ -- Datalake tables in .df format
4. metadata/ -- Metadata of the tables in .json format
[1] Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, and Kavitha Srinivas. TOPJoin: A Context-Aware Multi-Criteria Approach for Joinable Column Search. VLDB 2025 Workshop: Tabular Data Analysis (TaDA). https://arxiv.org/abs/2507.11505
LICENSE
This data is released with CC BY-NC-ND 4.0 License.
Files
opendata-contextawarejoins.zip
Files
(151.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7348051c5fed71a1641122720fd7e6ef
|
151.9 MB | Preview Download |
Additional details
Identifiers
- arXiv
- arXiv:2507.11505
Dates
- Accepted
-
2025-07-14
Software
- Repository URL
- https://github.com/IBM/ContextAwareJoin