Project deliverable Open Access

BigDataGrapes D4.3 - Models and Tools for Predictive Analytics over Extremely Large Datasets

Franco Maria Nardini; Vinicius Monteiro de Lira; Salvatore Trani; Raffaele Perego; Cristina Muntean

This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and effective predictive data analytics over the BigDataGrapes (BDG) platform in the context of grapevine-related assets.

The BDG software stack employs efficient and fault-tolerant tools for distributed processing, aimed at providing scalability and reliability for the target applications. On top of this stack, the BDG platform enables distributed predictive big data analytics by effectively exploiting scalable Machine Learning algorithms using the computational resources of the underlying infrastructure efficiently. The software components enabling BDG predictive data analytics have been designed and deployed using Docker containers. They thus include everything needed to run the supported predictive data analytics tools on any system that can run a Docker engine.

The document first introduces the main technologies currently used in the first version of the BDG component for performing efficient and scalable analytics over extremely large dataset. The docker component provided in this deliverable relies on the BDG software stack discussed in Deliverable 2.3: “BigDataGrapes Software Stack Design” and exploits the distributed execution environment provided by the Persistence and Processing Layers of the BDG architecture contributed in Deliverable 4.1: “Methods and Tools for Scalable Distributed Processing”.

In Section 3, we detail the steps to be followed to download and deploy the first version of the BDG platform and provides the reader with practical examples of usage of its scalable predictive analytics component. Specifically, we provide four demonstrators released as Jupyter Notebooks implementing four different machine learning tasks by exploiting the BDG infrastructure. The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically generated data. We present these results by adding a visualization of the result to allow the reader to understand the limitations of each specific solution. The second demonstrator employs a well-known dataset, i.e., the KDD CUP 1999 dataset, to train a binary logistic regression classifier. We show how to train and evaluate the performance of the classifier by means of a standard metric, i.e., Accuracy. This second demonstrator also shows how the distributed file system can be exploited to directly feed the machine learning platform with data. The third demonstrator extends the second one by showing how to train a multi-label classifier on a Red Wine Quality Dataset, a public dataset employed on Kaggle for a machine learning competition. We show how to learn a multi-label logistic regression classifier and how to evaluate its performance in terms of Accuracy.

Section 3 also presents a fourth demonstrator, which is much more complex and structured and has been added to this document as an update done at M15. The demonstrator focuses on the application of machine learning methods on wine data collected from online social networks of wine passionate users. The dataset analyzed contains: 489,417 wine reviews by 195,678 users, written in 86 languages, related to 51,579 different wines, from 58 wine countries and 2272 wine regions. The predictive analysis conducted on this dataset allows us to show the potential of the machine learning layer of the BDG infrastructure providing efficient and effective methods for assessing the potential market penetration of a given wine in a new country. We estimate this penetration capability by learning a model from user-generated contents about wines in a target country.

Section 4 of this document has been added as an update done at M31. The section details the machine learning analytics developed to support the five BigDataGrapes pilots. For each solution we detail: i) the specific task we addressed, ii) a description of the software architecture of the analytics developed, iii) the data acquisition and preprocessing pipeline employed to train and apply the models within the BigDataGrapes infrastructure, iv) the machine-learning techniques that we employ to solve the task, the methodology used and the hyper-parameter tuning performed to provide the best performance to the BigDataGrapes final user. For each technique, we also provide an experimental analysis of the techniques employed against strong baselines and state-of-the-art competitors on the real data provided within the BigDataGrapes consortium. The experimental results reported show the effectiveness of the machine learned analytics developed.

All versions This version
Views 4545
Downloads 1919
Data volume 59.9 MB59.9 MB
Unique views 3939
Unique downloads 1717


Cite as