Project deliverable Open Access
Ida Mele; Franco Maria Nardini; Raffaele Perego; Nicola Tonellotto; Vinicius Monteiro de Lira; Cristina Muntean; Salvatore Trani
The deliverable D7.1, “Scalability and Robustness Experimental Methodology” consists in a report describing the methodology for assessing the performance of a big data system. In particular, the purpose of the task 7.1 is to develop and to implement a rigorous automated testing methodology for measuring and comparing the efficiency of the components in a big data system. The methodology considers the characteristics of the system as well as the heterogeneity and distributed nature of big data.
In this deliverable, we present the general concepts related to big data and its properties (i.e., volume, velocity, variety, and veracity). We analyze the state-of-the-art for big data benchmarking considering different challenges which range from preserving the 4V properties of big data to streaming and scalability issues. Moreover, we also provide a description of state-of-the-art tools that can be used for assessing and monitoring the big data system’s performance.
Besides reviewing the literature, we present the steps that can be followed for providing a rigorous testing of the BDG system. We believe that it would be better to follow a layered design where the user interfaces are at the top in order to provide easy access to the benchmarking for the user. Below the interfaces, there are the functional and execution layers. The former allows to capture the data and test generators as well as the metrics; the latter represents the basic operations for configuring the system, converting the data, and analyzing the results.
Another contribution of this deliverable is providing guidelines that can help the process of rigorous testing a big data system. We believe that a good approach would be following a standardized benchmarking methodology which is divided into different stages going from the selection of the application domain to the execution of the tests.
Since in the BDG project the semantic infrastructure is represented by graph databases, we also describe the main limitations of the current benchmarking in the context of relational databases and semantic repositories. Also, we provide some valid solutions for our project, for example, the benchmarks proposed by the Linked Data Benchmark Council (LDBC) which ensure linearity, reliability, repeatability, and easy to measure of the metrics. Additionally, LDBC is open for submissions of novel industry benchmarks which may represent specificity of data distribution in big data applications, and this makes it particularly suitable for the BDG project.
Another contribution of this deliverable is a proposal on the metrics to use for assessing the performance of the BDG system. Such metrics are chosen based on the datasets employed in the use cases and pilots of the BDG project. This deliverable has been updated at M18. In this new version we update the review of the literature with new contributions that have been published and we present an updated list of scalability metrics identified on the basis of the work done within the definition of the use cases and pilots.
This deliverable has been updated, despite not required, in order to be consistent with the other deliverables, by including the description of use case D (Food Protection) missing at the time of the second update at M18 and updating accordingly the list of the datasets used by the BDG project and the list of scalability metrics. Version 2.1 includes such an update and has been submitted on 15th January 2021.
D7.1 Scalability and Robustness Experimental Methodology_v2.1_(Submitted to EC).pdf