Project deliverable Open Access
This accompanying document for deliverable D4.3 (Models and Tools for Predictive Analytics over Extremely Large Datasets) describes the first version of the mechanisms and tools supporting efficient and effective predictive data analytics over the BigDataGrapes (BDG) platform in the context of grapevine-related assets.
The BDG software stack employs efficient and fault-tolerant tools for distributed processing, aimed at providing scalability and reliability for the target applications. On top of this stack, the BDG platform enables distributed predictive big data analytics by effectively exploiting scalable Machine Learning algorithms using the computational resources of the underlying infrastructure efficiently. The software components enabling BDG predictive data analytics have been designed and deployed using Docker containers1. They thus include everything needed to run the supported predictive data analytics tools on any system that can run a Docker engine.
The document first introduces the main technologies currently used in the first version of the BDG component for performing efficient and scalable analytics over extremely large dataset. The docker component provided in this deliverable relies on the BDG software stack discussed in Deliverable 2.3: “BigDataGrapes Software Stack Design” and exploits the distributed execution environment provided by the Persistence and Processing Layers of the BDG architecture contributed in Deliverable 4.1: “Methods and Tools for Scalable Distributed Processing”.
The document details the steps to be followed to download and deploy the first version of the BDG platform and provides the reader with practical examples of usage of its scalable predictive analytics component. Specifically, we provide four demonstrators released as Jupyter Notebooks2 implementing four different machine learning tasks by exploiting the BDG infrastructure. The first one shows how to train two kinds of regressors, i.e., linear and random forest regressors, to fit synthetically generated data. We present these results by adding a visualization of the result to allow the reader to understand the limitations of each specific solution. The second demonstrator employs a well-known dataset, i.e., the KDD CUP 1999 dataset3, to train a binary logistic regression classifier. We show how to train and evaluate the performance of the classifier by means of a standard metric, i.e., Accuracy. This second demonstrator also shows how the distributed file system can be exploited to directly feed the machine learning platform with data. The third demonstrator extends the second one by showing how to train a multi-label classifier on a Red Wine Quality Dataset4, a public dataset employed on Kaggle for a machine learning competition. We show how to learn a multi-label logistic regression classifier and how to evaluate its performance in terms of Accuracy. The fourth demonstrator is much more complex and structured and has been added to this document as an update done at M15. The demonstrator focuses on the application of machine learning methods on wine data collected from online social networks of wine passionate users. The dataset analyzed contains: 489,417 wine reviews by 195,678 users, written in 86 languages, related to 51,579 different wines, from 58 wine countries and 2272 wine regions. The predictive analysis conducted on this dataset allows us to show the potential of the machine learning layer of the BDG infrastructure providing efficient and effective methods for assessing the potential market penetration of a given wine in a new country. We estimate this penetration capability by learning a model from user-generated contents about wines in a target country.