Project deliverable Open Access

D4.1 Definition of Architecture for Extreme-Scale Analytics

Project consortium members

This deliverable defines the building blocks of the INFORE Architecture, their functionality and their interconnections in the scope of a holistic, pluggable, extensible INFORE framework that will evolve to an omnibus solution for extreme-scale streaming analytics. 

Aligned with the objectives stated in the project proposal, INFORE aims at: 

  1. (i) supporting the non-programmer data analyst in rapid setup of streaming workflows tailored for her application scenario needs by providing graphical workflow design facilities, 
  2. (ii) automating the tuning of the underlying Big Data platform infrastructure that materializes the visually designed workflow as well as the provisioned physical resources in a way that optimizes specific performance measures, 
  3. (iii) providing real-time, interactive machine learning and data mining tools that can be leveraged by the designed workflows, 
  4. (iv) enhanced interactivity via data summarization and approximate query processing techniques, 
  5. (v) distributed complex event processing and forecasting techniques to not only detect business events of interest as soon as they occur, but also forecast their occurrence well in advance. 

 

To achieve these goals, the definition of the INFORE Architecture includes the following loosely coupled, modular components: (i) Graphical Editor Component, (ii) Connection Component, (iii) Manager Component, (iv) Optimizer Component, (v) Synopsis Data Engine Component, (vi) Interactive Online Machine Learning Component, (vii) Complex Event Forecasting Component. 

The Graphical Editor Component is an extension of the RapidMiner Studio developed in the scope of the project. An elaborate Streaming Nest operator is being developed within the Studio. The Streaming Nest operator is essentially an umbrella encompassing a family of streaming, logical (i.e., abstract, not tied to a particular implementation on a Big Data platform) operators developed in the scope of the project. This family of operators includes both data management operators such as filtering, join, projection, map, reduce, aggregations etc operators as well as logical operators for online machine learning, complex event forecasting and data approximation. Having designed the desired workflow using drag and drop functionality of the Graphical Editor Component, the user proceeds with visually creating connection objects for input, output streams and streaming backends (available Big Data platforms and respective clusters hosting them) in the Graphical Editor Component. These visually defined connection details are internally handled by the Connection Component. Upon submitting the workflow, the Manager Component takes over to convey the submitted workflow to the Optimizer Component. The Optimizer Component returns to the Manager Component a modified, optimized workflow where it has attached execution plan information related to (a) the Big Data platform on which each operator of the workflow will be executed, (b) the cluster resources that will be provisioned, (c) the cluster in which each operator will be deployed in case of multiple geo-dispersed clusters, (d) replacements of exact workflow operators with approximate ones provided by the Synopses Data Engine Component should the user have defined that some predefined inaccuracy guarantees can be tolerated by the application for reducing workflow execution time. The Manager Component may visualize the modified workflow and ask for user approval or execute the actual plan provided by the optimizer. To do so, all logical operators in the execution plan provided by the Optimizer are instantiated by their physical implementations. A dispatcher module as part of the Manager Component submits separate jobs for each Big Data platform and respective cluster, while output streams are provided to the desired applications. In that scope, the physical implementation of approximate query processing operators is included in the Synopses Data Engine Component. The physical implementation of machine learning operators resides in the Interactive Online Machine Learning Component and similarly for the Complex Event Forecasting Component. 

This deliverable is in direct relation to deliverables of WP1, WP2, WP3 (up to date, use case requirements have been expressed in Deliverables D1.2, D2.1, D3.1) which aid in realizing the INFORE framework to specific application scenarios. WP5 specifies the internal details of the Optimizer Component starting in Deliverable D5.1 to be submitted in Month 16 of the project. WP6 develops the Synopses Data Engine Component (Deliverable D6.1 on Month 12 together with the current one, which is later enhanced in D6.3), the Interactive Online Machine Learning Component and the Complex Event Forecasting Component described in Deliverables described in Deliverables D6.2 (Month 16), D6.4, D6.5. The first, complete prototype of the INFORE Architecture is presented in the follow up Deliverable D4.2 on Month 16.

Files (1.5 MB)
Name Size
D4.1 Definition of Architecture for Extreme-Scale Analytics.pdf
md5:22eea9711cbed2f703bbd3a63f051176
1.5 MB Download
11
8
views
downloads
All versions This version
Views 1111
Downloads 88
Data volume 11.8 MB11.8 MB
Unique views 99
Unique downloads 77

Share

Cite as