DDS: integrating data analytics transformations intask-based workflows

doi:10.12688/openreseurope.14569.1

Published May 25, 2022 | Version 1

Journal article Open

DDS: integrating data analytics transformations intask-based workflows

1. Workflows and Distributed Computing, Barcelona Supercomputing Center, Barcelona, Catalunya, 08034, Spain
2. IOVLabs, Gibraltar, Gibraltar

High-performance data analytics (HPDA) is a current

trend in e-science research that aims to integrate traditional

HPC with recent data analytic frameworks. Most of the work

done in this field has focused on improving data analytic frameworks

by implementing their engines on top of HPC technologies

such as Message Passing Interface. However, there is a lack

of integration from an application development perspective.

HPC workflows have their own parallel programming models,

while data analytic (DA) algorithms are mainly implemented

using data transformations and executed with frameworks like

Spark. Task-based programming models (TBPMs) are a very

efficient approach for implementing HPC workflows. Data analytic

transformations can also be decomposed as a set of

tasks and implemented with a task-based programming model.

In this paper, we present a methodology to develop HPDA

applications on top of TBPMs that allow developers to combine

HPC workflows and data analytic transformations seamlessly.

A prototype of this approach has been implemented on top of

the PyCOMPSs task-based programming model to validate two

aspects: HPDA applications can be seamlessly developed and

have better performance than Spark. We compare our results

using different programs. Finally, we conclude with the idea

of integrating DA into HPC applications and evaluation of our

method against Spark.

Files

openreseurope-2-15731.pdf

Files (4.0 MB)

Name	Size	Download all
openreseurope-2-15731.pdf md5:e88452bbe1be12dc80860493b5bb66d5	4.0 MB	Preview Download

Additional details

Zaharia M, Chowdhury M, Franklin MJ (2010). Spark: Cluster Computing with Working Sets.
Asch M, Moore T, Badia R (2018). Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl. doi:10.1177/1094342018778123
Gittens A, Rothauge K, Wang S (2019). Alchemist: An apache spark- mpi interface. Concurr Comput Pract Exp.
Caíno-Lores S, Carretero J, Nicolae B (2018). Spark-diy: A framework for interoperable spark operations with high performance block-based data models. doi:10.1109/BDCAT.2018.00010
Dagum L, Menon R (1998). Openmp: an industry standard api for shared-memory programming. IEEE Comput Sci Eng. doi:10.1109/99.660313
Gropp WD, Lusk E, Skjellum A (1999). Using MPI: portable parallel programming with the message-passing interface.
El-Ghazawi T, Carlson W, Sterling T (2005). UPC: distributed shared memory programming.
Grünewald D, Simmendinger C (2013). The gaspi api specification and its implementation gpi 2.0.
Duran A, Perez JM, Ayguadé E (2008). Extending the OpenMP tasking model to allow dependent tasks. doi:10.1007/978-3-540-79561-2_10
Augonnet C, Thibault S, Namyst R (2011). StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. CCPE - Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009. doi:10.1002/cpe.1631
Bauer M, Treichler S, Slaughter E (2012). Legion: Expressing locality and independence with logical regions. doi:10.1109/SC.2012.71
Badia RM, Conejero J, Diaz C (2015). COMP superscalar, an interoperable programming framework. SoftwareX. doi:10.1016/j.softx.2015.10.004
Dean J, Ghemawat S (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM. doi:10.1145/1327452.1327492
Zaharia M, Chowdhury M, Das T (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.
Zhang B, Peng B, Chen L (2017). Introduction to harp: when big data meets hpc.
Rocklin M (2015). Dask: Parallel computation with blocked algorithms and task scheduling. doi:10.25080/Majora-7b98e3ed-013
Kamburugamuve S, Govindarajan K, Wickramasinghe P (2020). Twister2: Design of a big data toolkit. Concurr Comput Pract Exp. doi:10.1002/cpe.5189
Tejedor E, Becerra Y, Alomar G (2017). PyCOMPSs: Parallel Computational Workflows in Python. Int J High Perform Comput Appl. doi:10.1177/1094342015594678
(null). Gutenberg Project.
(null). News Articles.
(null). Terasort dataset generator.
(null). Transitive Closure dataset generator.
(null). Lorem Ipsum dataset generator.
(null). COMPSs.

	All versions	This version
Views	26	26
Downloads	39	39
Data volume	161.9 MB	161.9 MB

DDS: integrating data analytics transformations intask-based workflows

Files

openreseurope-2-15731.pdf

Files (4.0 MB)

Additional details

Related works

References

DDS: integrating data analytics transformations intask-based workflows

Creators

Description

Files

openreseurope-2-15731.pdf

Files (4.0 MB)

Additional details

Related works

References