Published May 21, 2024 | Version 2.0
Dataset Open

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

  • 1. ROR icon Inria Saclay - Île-de-France Research Centre

Contributors

Project manager:

  • 1. ROR icon EURECOM
  • 2. ROR icon Inria Saclay - Île-de-France Research Centre
  • 3. Dataiku

Description

Files composing the YADL data lake, for the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes (Experiment, Analysis & Benchmark Paper)"

We present an in-depth analysis of data discovery for analytics in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three key steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates, and the efficiency of simple aggregation methods. We report new insights on the benefits of existing solutions and on the their limitations, aiming at guiding future research in this space.

Archives provided here follow the notation used for the experiments, which is different from what is reported in the paper. The four YADL versions available here are:

  • "binary_update" (YADL Binary)
  • "wordnet_full" (YADL Base)
  • "wordnet_vldb_10" (YADL 10k)
  • "wordnet_vldb_50" (YADL 50k)

Files

Files (39.3 GB)

Name Size Download all
md5:5d396388133981132df74cd7e9260f04
309.5 MB Download
md5:5351d655ce40a5caf2e23dd017c22606
9.8 GB Download
md5:1732c17383c6686b3a29e0456bf63efd
4.9 GB Download
md5:4dbd49f7d27bda04ba3d89d385d5f09e
24.3 GB Download

Additional details

Software

Repository URL
https://github.com/rcap107/YADL
Programming language
Python