Parallel Evaluation of Multi-Semi-Joins
- 1. Hasselt University
- 2. National Taiwan University
- 3. Université Libre de Bruxelles
Description
While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.
Files
vldb_2016_daenen_et_al.pdf
Files
(837.0 kB)
Name | Size | Download all |
---|---|---|
md5:773cc24f7de7ddb9ab0fa5c379ad3afc
|
837.0 kB | Preview Download |
Additional details
Related works
- Is supplement to
- 10.14778/2977797.2977800 (DOI)
- arXiv:1605.05219 (arXiv)
- 10.5281/zenodo.51517 (DOI)