Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

doi:10.5281/zenodo.10624396

Published May 21, 2024 | Version 2.0

Dataset Open

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Cappuzzo, Riccardo (Data curator)¹

1. Inria Saclay - Île-de-France Research Centre

Contributors

Project manager:

Varoquaux, Gael²

Project members:

1. EURECOM
2. Inria Saclay - Île-de-France Research Centre
3. Dataiku

Files composing the YADL data lake, for the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes (Experiment, Analysis & Benchmark Paper)"

We present an in-depth analysis of data discovery for analytics in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three key steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates, and the efficiency of simple aggregation methods. We report new insights on the benefits of existing solutions and on the their limitations, aiming at guiding future research in this space.

Archives provided here follow the notation used for the experiments, which is different from what is reported in the paper. The four YADL versions available here are:

"binary_update" (YADL Binary)
"wordnet_full" (YADL Base)
"wordnet_vldb_10" (YADL 10k)
"wordnet_vldb_50" (YADL 50k)

Files

Files (39.3 GB)

Name	Size	Download all
binary_update.tar.gz md5:5d396388133981132df74cd7e9260f04	309.5 MB	Download
wordnet_full.tar.gz md5:5351d655ce40a5caf2e23dd017c22606	9.8 GB	Download
wordnet_vldb_10.tar.gz md5:1732c17383c6686b3a29e0456bf63efd	4.9 GB	Download
wordnet_vldb_50.tar.gz md5:4dbd49f7d27bda04ba3d89d385d5f09e	24.3 GB	Download

Additional details

Repository URL: https://github.com/rcap107/YADL
Programming language: Python

	All versions	This version
Views	209	141
Downloads	93	62
Data volume	733.0 GB	570.7 GB

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Creators

Contributors

Project manager:

Project members:

Description

Files

Files (39.3 GB)

Additional details

Software