Project deliverable Open Access
Project consortium members
This deliverable elaborates on the first version of the Data Synopses Generator, henceforth referred to as Synopsis Data Engine (SDE), developed within the scope of INFORE. The purpose of the SDE is to provide concise summaries of the massive, high-velocity input data streams it processes, both in per stream as well as cross-stream fashion. The data summaries provided by the SDE constitute representative data views of important aspects (samples, expected values, counts, frequency moments, among others) of the incoming data, approximated with predefined accuracy guarantees. Such synopses that effectively summarize huge volumes of ever growing data in a real-time, online fashion can be used by other INFORE modules such as online learning, complex event forecasting algorithms in WP6 or the INFORE Optimizer developed in WP5 to effectively work on carefully crafted data summaries instead of the entire stream(s) and, speed up response times of complex tasks in INFORE workflows, thus boosting interactive data exploration.
Given the above, the current deliverable outlines the requirements posed by INFORE that led the strategic decisions behind SDE’s architectural design. It details the architectural choices in the development of the SDE, from the point streaming data are digested from relevant data sources to the point where concise synopses of these data are built, maintained and get subsequently delivered as responses to respective requests by other INFORE components. We further describe the functionality and the design of the currently supported synopses. To ensure the utility of the SDE in INFORE use case scenarios and broader application areas, the synopses currently incorporated in the SDE include general-purpose streaming data summarization techniques as well as synopses destined to support INFORE use case workflows.
For the reasons argued in this deliverable, the SDE has been developed on Apache Flink Big Data platform, one of the most prominent frameworks designed to operate as a true streaming engine and to combine batch and stream operations. The SDE is fully extensible as each new, specific synopsis algorithm can be plugged in by extending the Synopsis supertype and simply instantiating the basic methods. The first version of the SDE allows the on-the-fly (i.e., as the SDE is up and running): (a) deployment and maintenance of a new synopsis from a library of registered to the SDE algorithms for a given StreamID (e.g. stockID ) or dataset (e.g. financial data1) , (b) querying in an ad-hoc or continuous fashion a maintained synopsis to provide a response with respect to the quantity it is destined to approximate, (c) querying all maintained synopses for a given <DatasetID, StreamID>, triplet of <DatasetID, StreamID, Value>, or for a given <DatasetID>, pair of <DatasetID, Value>, (d) providing reports on the currently running synopses and their parameters.
Within the scope of INFORE, the SDE may summarize data that directly originate from cancer evolution simulations, financial or maritime activity monitoring scenarios. Therefore, this deliverable accounts for scenarios described in Deliverables D1.1, D2.1 and D3.1, respectively, as the SDE is destined to support synopses tailored to the INFORE use cases besides being extensible to broader application areas. It further takes into consideration dataset descriptions outlined in the Data Management Plan V1, Deliverable D8.3. Besides, in the streaming workflows supported by INFORE, a synopsis provided by the SDE may act as an intermediate operation between other operators some of which provide input streams to be summarized (upstream operators), while the rest receiving the constructed summaries as their own input (downstream operators). Upstream and downstream operators may involve machine learning or forecasting operators developed in Tasks T6.2, T6.3 of WP6. Thus, the current deliverable directly interacts with the advancements made in these tasks, described in the upcoming Deliverable D6.2. In addition, INFORE’s Optimization module developed in WP5 should take into consideration that in case approximate query answers with predefined accuracy guarantees can be tolerated in a given workflow, combinations of operators in the workflow designed by the application may be substituted with approximate ones provided by the SDE. As such, the techniques developed in the upcoming versions of the Workflow Optimization Technology D5.2-D5.4, D5.7 can leverage SDE’s arsenal in their algorithms. Finally, the interactions of the SDE with other parts of the overall INFORE architecture are described in Deliverable D4.1.