Project deliverable Open Access
Vinicius Monteiro de Lira; Cristina Muntean; Franco Maria Nardini; Raffaele Perego; Nicola Tonellotto; Roberto Trani; Salvatore Trani
This accompanying document for deliverable “D4.4 Resource Optimization Methods and Algorithms” describes new research tools to manage distributed big data platforms that will be used in the BigDataGrapes (BDG) platform to optimize computing resource management.
The BigDataGrapes platform aims at providing Predictive Data Analytics tools that go beyond the state-of-theart for what regards their application to large and complex grapevine-related data assets. Such tools leverage machine learning techniques that largely benefit from the distributed execution paradigm that serves as the basis for addressing efficiently the analytics and scalability challenges of grapevines-powered industries (see Deliverable 4.1). However, distributed architectures can consume a large amount of electricity when operated at large scale. This is the case of OnLine Data Intensive (OLDI) systems, which perform significant computing over massive datasets per user request. They often require responsiveness in the sub-second time scale at high request rates.
In this document, we illustrate the main operational characteristics of the OLDI systems and we describe the implementation and usage of our OLDI Simulator, a java library to simulate OnLine Data-Intensive systems. This simulator can be used to study the performance of OLDI systems in terms of latency and energy consumption. Then, we show how to use our simulator by using an example, in which we 1) we describe a system to simulate, 2) we model the system, and 3) we evaluate and compare various solutions by simulation means. Finally, we discuss how our simulator can be used to compare energy-efficient scheduling algorithms in distributed big data platforms, using a WSE with real-world dataset as motivating example. The scientific results exploited by our OLDI simulator and the energy-efficient scheduling algorithms discussed in this document have been shared with the research community in a BigDataGrapes paper [cikm2018].
In the second revision of this deliverable (due M33), we present a novel technique that enables an efficient training of machine learning models in distributed environments. The proposed technique can be used within the BigDataGrapes platform to improve the usage of distributed computing resources. The BDG platform relies on a distributed environment composed of several co-operating machines to support efficient and effective predictive data analytics over the extremely large and complex grapevine-related data assets. Such Predictive Data Analytics tools benefit from state-of-the-art machine learning techniques to extract evidences from the data and learn models of the various phenomenon. Scalable and distributed machine learning algorithms are needed to efficiently extract knowledge from such extremely large datasets. Moreover, data of different vineyards could be located on different machines that could be far from each other, e.g., close to the location of the vineyards where the data are gathered. For this reason, it is crucial to distribute and move the computation closer to the data to avoid slow and costly data-exchange across the globe. To this regard, the leading approach to learn machine learning models in this scenario is known as Federated Learning (FL). The updated version of this deliverable presents the application of quantization techniques, i.e., a form of lossy compression, in Federated Learning to improve the efficiency in terms of data exchange between the machines involved in the computation. We focus on learning neural network models as they achieve state-of-the-art performance on several time-series tasks such as stock forecasting, natural language processing, and sequential image processing. Experiments on a public dataset show that the introduction of quantization techniques to the Federated Learning of neural network allows to reduce up to 19 × the volume of data exchanged between machines, with a minimal loss of performance of the final model. The scientific results as well as the algorithms discussed in this document have been shared with the research community in an article that is currently under review.
D4.4 - Resource Optimization Methods and Algorithms_v2.0_ (Submitted to EC).pdf