Published September 5, 2016 | Version v1
Poster Open

Parallel Evaluation of Multi-Semi-Joins

  • 1. Hasselt University
  • 2. National Taiwan University
  • 3. Université Libre de Bruxelles

Description

While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations aimed at minimizing total time without severely affecting net time. Even though the latter optimizations are NP-hard, we present effective greedy algorithms. Our experiments, conducted using our own implementation Gumbo on top of Hadoop, confirm the usefulness of parallel query plans, and the effectiveness and scalability of our optimizations, all with a significant improvement over Pig and Hive.

Files

vldb_2016_daenen_et_al.pdf

Files (837.0 kB)

Name Size Download all
md5:773cc24f7de7ddb9ab0fa5c379ad3afc
837.0 kB Preview Download

Additional details

Related works