Retrieve, Merge, Predict: Augmenting Tables with Data Lakes
Description
Files composing the YADL data lake, for the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes (Experiment, Analysis & Benchmark Paper)"
We present an in-depth analysis of data discovery for analytics in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three key steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration on both lakes, our study outlines the importance of accurately retrieving join candidates, and the efficiency of simple aggregation methods. We report new insights on the benefits of existing solutions and on the their limitations, aiming at guiding future research in this space.
Archives provided here follow the notation used for the experiments, which is different from what is reported in the paper. The four YADL versions available here are:
- "binary_update" (YADL Binary)
- "wordnet_full" (YADL Base)
- "wordnet_vldb_10" (YADL 10k)
- "wordnet_vldb_50" (YADL 50k)
Files
Files
(39.3 GB)
Name | Size | Download all |
---|---|---|
md5:5d396388133981132df74cd7e9260f04
|
309.5 MB | Download |
md5:5351d655ce40a5caf2e23dd017c22606
|
9.8 GB | Download |
md5:1732c17383c6686b3a29e0456bf63efd
|
4.9 GB | Download |
md5:4dbd49f7d27bda04ba3d89d385d5f09e
|
24.3 GB | Download |
Additional details
Software
- Repository URL
- https://github.com/rcap107/YADL
- Programming language
- Python