DDS: integrating data analytics transformations intask-based workflows
- 1. Workflows and Distributed Computing, Barcelona Supercomputing Center, Barcelona, Catalunya, 08034, Spain
- 2. IOVLabs, Gibraltar, Gibraltar
Description
High-performance data analytics (HPDA) is a current
trend in e-science research that aims to integrate traditional
HPC with recent data analytic frameworks. Most of the work
done in this field has focused on improving data analytic frameworks
by implementing their engines on top of HPC technologies
such as Message Passing Interface. However, there is a lack
of integration from an application development perspective.
HPC workflows have their own parallel programming models,
while data analytic (DA) algorithms are mainly implemented
using data transformations and executed with frameworks like
Spark. Task-based programming models (TBPMs) are a very
efficient approach for implementing HPC workflows. Data analytic
transformations can also be decomposed as a set of
tasks and implemented with a task-based programming model.
In this paper, we present a methodology to develop HPDA
applications on top of TBPMs that allow developers to combine
HPC workflows and data analytic transformations seamlessly.
A prototype of this approach has been implemented on top of
the PyCOMPSs task-based programming model to validate two
aspects: HPDA applications can be seamlessly developed and
have better performance than Spark. We compare our results
using different programs. Finally, we conclude with the idea
of integrating DA into HPC applications and evaluation of our
method against Spark.
Files
openreseurope-2-15731.pdf
Files
(4.0 MB)
Name | Size | Download all |
---|---|---|
md5:e88452bbe1be12dc80860493b5bb66d5
|
4.0 MB | Preview Download |
Additional details
References
- Zaharia M, Chowdhury M, Franklin MJ (2010). Spark: Cluster Computing with Working Sets.
- Asch M, Moore T, Badia R (2018). Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl. doi:10.1177/1094342018778123
- Gittens A, Rothauge K, Wang S (2019). Alchemist: An apache spark- mpi interface. Concurr Comput Pract Exp.
- Caíno-Lores S, Carretero J, Nicolae B (2018). Spark-diy: A framework for interoperable spark operations with high performance block-based data models. doi:10.1109/BDCAT.2018.00010
- Dagum L, Menon R (1998). Openmp: an industry standard api for shared-memory programming. IEEE Comput Sci Eng. doi:10.1109/99.660313
- Gropp WD, Lusk E, Skjellum A (1999). Using MPI: portable parallel programming with the message-passing interface.
- El-Ghazawi T, Carlson W, Sterling T (2005). UPC: distributed shared memory programming.
- Grünewald D, Simmendinger C (2013). The gaspi api specification and its implementation gpi 2.0.
- Duran A, Perez JM, Ayguadé E (2008). Extending the OpenMP tasking model to allow dependent tasks. doi:10.1007/978-3-540-79561-2_10
- Augonnet C, Thibault S, Namyst R (2011). StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009. doi:10.1002/cpe.1631
- Bauer M, Treichler S, Slaughter E (2012). Legion: Expressing locality and independence with logical regions. doi:10.1109/SC.2012.71
- Badia RM, Conejero J, Diaz C (2015). COMP superscalar, an interoperable programming framework. SoftwareX. doi:10.1016/j.softx.2015.10.004
- Dean J, Ghemawat S (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM. doi:10.1145/1327452.1327492
- Zaharia M, Chowdhury M, Das T (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.
- Zhang B, Peng B, Chen L (2017). Introduction to harp: when big data meets hpc.
- Rocklin M (2015). Dask: Parallel computation with blocked algorithms and task scheduling. doi:10.25080/Majora-7b98e3ed-013
- Kamburugamuve S, Govindarajan K, Wickramasinghe P (2020). Twister2: Design of a big data toolkit. Concurr Comput Pract Exp. doi:10.1002/cpe.5189
- Tejedor E, Becerra Y, Alomar G (2017). PyCOMPSs: Parallel Computational Workflows in Python. Int J High Perform Comput Appl. doi:10.1177/1094342015594678
- (null). Gutenberg Project.
- (null). News Articles.
- (null). Terasort dataset generator.
- (null). Transitive Closure dataset generator.
- (null). Lorem Ipsum dataset generator.
- (null). COMPSs.