Project deliverable Open Access
Mele, Ida; Tonellotto, Nicola; Nardini, Franco Maria; Perego, Raffaele; Monteiro de Lira, Vinicius; Muntean, Cristina
The deliverable D7.1, “Scalability and Robustness Experimental Methodology” consists in a report describing the methodology for assessing the performance of a big data system. In particular, the purpose of the task 7.1 is to develop and to implement a rigorous automated testing methodology for measuring and comparing the efficiency of the components in a big data system. The methodology takes into account the characteristics of the system and also the heterogeneity and distributed nature of big data.
In this first version of the deliverable, we present the general concepts related to big data and its properties (i.e., volume, velocity, variety, and veracity). We analyze the state-of-the-art for big data benchmarking considering different challenges which range from preserving the 4V properties of big data to streaming and scalability issues.
Besides reviewing the literature, we present the steps that can be followed for providing a rigorous testing of the BDG system. We believe that it would be better to follow a layered design where the user interfaces are at the top in order to provide easy access to the benchmarking for the user. Below the interfaces, there are the functional and execution layers. The former allows to capture the data and test generators as well as the metrics; the latter represents the basic operations for configuring the system, converting the data, and analyzing the results.
Another contribution of this deliverable is providing some guidelines that can be helpful in the process of rigorous testing a big data system. We believe that a good approach would be following a standardized benchmarking methodology which is divided into different stages going from the selection of the application domain to the execution of the tests. Since the BDG system is not finalized yet, these guidelines are very general and will be refined and concretized once the system will be developed.
Since in the BDG project the semantic infrastructure is represented by graph databases, we also describe the main limitations of the current benchmarking in the context of relational databases and semantic repositories. Also, we provide some valid solutions for our project, for example, the benchmarks proposed by the Linked Data Benchmark Council (LDBC) which ensure linearity, reliability, repeatability, and easy to measure of the metrics. Additionally, LDBC is open for submissions of novel industry benchmarks which may represent specificity of data distribution in big data applications, and this makes it particularly suitable for the BDG project.
Another contribution of this deliverable is a first proposal on the metrics to use for assessing the performance of the BDG system. Such metrics are chosen based on the datasets employed in the use cases of the BDG project. As for the guidelines, also the metrics are prone to changes since the use cases could be refined during the development of the project and the corresponding datasets and metrics would change accordingly.
D7.1 - Scalability and Robustness Experimental Methodology.pdf