I-BiDaaS: Industrial-Driven Big Data as a Self-Service Solution

Real time data analytics, Data stream analysis, Scalability, Data visualization, Very large data basesOrganizations leverage data pools to drive value, while it is variety, not volume or velocity, which drives big-data investments. The convergence of IoT, cloud, and big data, create new opportunities for selfservice analytics towards a completely paradigm towards big data analytics. Human and machine created data is being aggregated, transforming our economy and society. To face these challenges, companies call upon expert analysts and consultants to assist them. A self-service solution will be transformative for organizations, it will empower their employees with the right knowledge, and give the true decision-makers the insights they need to make the right decisions. It will shift the power balance within an organization, increase efficiency, reduce costs, improve employee empowerment, and increase profitability. I-BiDaaS aims to empower users to easily utilize and interact with big data technologies, by designing, building, and demonstrating, a unified solution that: significantly increases the speed of data analysis while coping with the rate of data asset growth, and facilitates cross-domain data-flow towards a thriving data-driven EU economy.


Executive Summary
This deliverable describes the implementation and operation of real-life industrial experiments from the telecommunication, banking and manufacturing industries in order to demonstrate how the I-BiDaaS solution has been applied in real-world environments. The I-BiDaaS project, funded by the Horizon 2020 Programme under Grant Agreement 780787, aims to empower users to easily utilize and interact with big data technologies, by designing, building, and demonstrating, a unified solution that significantly increases the speed of data analysis while coping with the rate of data asset growth, and facilitates cross-domain data-flow towards a thriving data-driven EU economy. To this end, the project developed an integrated platform for processing and extracting actionable knowledge from big data that includes: 1) data ingestion from various data sources and its preparation; 2) fabrication of realistic synthetic data for experimentation and testing; 3) batch and streaming analytics; and 4) simple, intuitive, and effective visualization and interaction capabilities for the end-users.
All activities have been aligned with the I-BiDaaS experimental protocol in order to ensure a smooth and adequate running of the operational experiments in alignment with the business objectives identified by the industrial partners for all defined use cases. Further revisions were taken into account regarding the real-life industrial experiments during the implementation phase and the progress in the design of the I-BiDaaS platform and associated technologies.
In more detail, the deliverable reports on the description of the different types of data provided, the generated datasets, the experimental workflow and the quality data evaluation for all use cases exploited during the experimental process and integrated in the I-BiDaaS platform. The integration was performed by ensuring secure data management through the anonymization and encryption of the data. Several types of programming languages and advanced visualization tools have been used to develop a platform easy-to-use for all experiments. Furthermore, the different end-user categories for each sector were detailed to characterize the usage of the platform and two additional generic use cases for expert and non-expert users have been developed to offer a more generic solution beyond the I-BiDaaS use cases, taking into consideration that the solution depends not only on the type and amount of data, but also on the type of different potential end-users.
At last, the deliverable describes the progress about the impact analysis, by continuing the work started in D6.2 [1] and reported in D.6.3 [2] with respect to the expected project innovation and achievements and provides the external stakeholders' feedback collected during the progress of the project.

Introduction
This deliverable continues the work started in the D6.2 [1] and reports the detailed description of the results of 'WP6. Real -life industrial and operational experiments'. All experiments were executed according to the experimental protocol alignment (Task 6.1), were implemented and operated through three different real scenarios belonging to the telecommunication, financial and manufacturing sectors (Task 6.2) and tests were defined to determine the efficiency, operability, usability, robustness, performance, privacy awareness and costs of the real experiments and the impact analysis (Task 6.3). Each experiment was defined within the project in terms of data gathering, datasets implementation, analysis, integration and explanation of experimental results.
Synthetic and real anonymised data have been provided, generated and processed. The methods, developed in WP2 'Data curation, ingestion and pre-processing', have been used to aggregate, pre-process, manage and synthesize different types of data in both batch and real time. Batch and stream processing, described in detail in deliverables submitted under WP3 'Batch processing innovative technologies for rapidly increasing historical data' and WP4 'Distributed analytics over extremely large numbers of high volume streams', have been performed in WP6 activities in order to take into account all aspects, which may occur in realworld environments, such as cases that require a deeper analysis of large amounts of data, collected over a period of time (batch) or those that require velocity and agility for the events that we need to monitor in real or near real-time (streaming). Operational experiments and trials have been carried out using the I-BiDaaS solution within an interactive process between data providers and I-BiDaaS analysts and technologists. Finally, this deliverable reports on the progress of the impact analysis with respect to the expected project level innovation and achievements. Furthermore, it provides a description of the activities that involved external stakeholders, who have expertise, experience or interest in Big Data analytics, in the evaluation process.
The rest of the document is structured as follows. Section 2 reports on the experimental protocol alignment based on the incremental and iterative nature of the I-BiDaaS solution applied to real life experiments. Section 3 provides a detailed description of the implemented datasets and the experimental workflow of the industrial experiments' implementation for each use case. Section 4 describes the experimental evaluation in terms of data quality, I-BiDaaS solution, architecture implementation, experiments verification and validation. Section 5 discusses the impact analysis and provides an overview of external stakeholders' involvement activities. Section 6 concludes with a summary of the results of WP6 achieved from M19 to M32.

Experimental protocol alignment
The experimental protocol alignment process aims to refine and, if necessary, revise the outcomes of the initial experiment's definition phase according to the I-BiDaaS experimental protocol (see D1.3 [3]) and to fine-tune the details of the industrial experiments to assure that the designed experiments will validate both business and technical requirements.
A detailed description of the alignment process, together with an overview of the initial alignment of the I-BiDaaS experiments, has been reported in deliverable D6.3 [2]. Due to the incremental and iterative nature of the I-BiDaaS experimental protocol, further alignment was required in order to reflect (a) revisions of the industrial use cases definitions and (b) progress in the design of the I-BiDaaS platform and associated technology characteristics.
This resulted in the revision of the definition of the I-BiDaaS experiments in terms of: a) The experiment's goals, data sets, analytics type, workflow and participants. b) The experimental indicators and associated metrics to be measured during the experiment.
Overall, ten experiments have been defined (reported in section 3). Eight of them are real-life industrial experiments that address real problems in the telecommunication, financial and manufacturing sectors. These experiments reflect the specific requirements of the project industrial partners and correspond to the 'Co-Develop mode' of operation of the I-BiDaaS platform, whereby end-users receive support and guidance from I-BiDaaS members in order to customize the analytics pipeline and enhance the visualization of the experiment results.
End-users, in this case, may fall in the following general categories: • Data Providers and/or Data Consumers: Business users who introduce new data or information feeds into the platform, and/or use the Big Data analytics services and results. These, depending on the experiment, include data analysts, quality assurance and control managers, financial administrators, infrastructure engineers. • Other stakeholders (e.g. IT security personnel): these are not end users per-se, however, they are evaluators or administrators of the platform.
See Section 3.3 for a more fine-grained description of users per different sector.
In addition, two generic experiments were defined, aiming to evaluate how the I-BiDaaS solution can be applied generally, reflecting the requirements of two generic user categories: • Non-expert users: correspond to business users (data analysts) of the platform in 'selfservice' mode. Such users understand the basic concepts of data analytics, machine learning and statistics. • Expert users: correspond to Big Data developers that use the platform in 'expert mode'.
They are able to develop data analysis applications in COMPSs or at least 'pure' Python (expert user).

Overview
International organizations and different competitive scenarios have been selected for developing, implementing and evaluating real-world industrial experiments in the EU H2020 I-BiDaaS project. Three data providers, namely TID (Telefonica I+D), CAIXA (CaixaBank) and CRF (Centro Ricerche FIAT), belonging to the telecommunication, banking and manufacturing sectors respectively, have defined eight real-world, industry-lead experiments where I-BiDaaS solution is being tangibly validated. In addition to the real-life experiments, two generic use cases have also been defined, considering potential non-expert/expert end-users and inputs for cross-sectorial experiment have been provided.
Each experiment has been carried out by utilizing different datasets (synthetic/real data) or processing type (batch/stream). Table 1 provides an overview of the datasets generated from all three industrial partners and detailed in Section 3.2, for each one of them. The telecommunications industry collects massive amounts of data that act as the catalyst for business improvement. TID tested three use cases in order to improve the customer experience by employing advanced Machine Learning techniques. Part of the effort of improving the customer experience is focused around the employment of voice activated bots that help the users accomplish tasks related to the network configuration and operation. For the Telecommunication use cases, anonymized/synthetic data are analysed to predict changes in the number of connected mobile phone users per sector and the Customer Satisfaction Index (CSI).
CAIXA, as a representative of the financial sector in the project, tested three use cases that revolve around the huge amount of data collected by the different sources (ATMs, online banking services, employees' workstations, external providers' activity, network devices, etc.). For financial use cases, data analysts used synthetic/tokenized data for developing algorithms and tool performance testing or proof-of-concepts' validation skipping the strict security and privacy internal validation procedures of CAIXA.
The manufacturing industry, represented in the I-BiDaaS project by CRF, generates a large amount of heterogeneous data from various devices, systems and applications that enable manufacturers to develop new methodologies for the Big Data era. CRF is testing two use cases in order to demonstrate the ability to exploit I-BiDaaS solution to take profit of the near realtime shop floor data and to apply sophisticated statistical assessments. For the manufacturing use cases, data analysts used real or anonymized data, retrieved from the production lines, used for continuous improvement of algorithms in order to avoid cost breakdown, micro or macro stoppages and decrease of quality level. Unnecessary actions, such as preventive or planned maintenance, retooling, refurbishing, or repair of products, will be drastically reduced.
Finally, experiments for end-to-end I-BiDaaS platform either in self-service mode or in expert mode have been defined to provide a service with functional solutions for non-expert/expert end-users who want to optimize the performance and efficiency of their businesses. Relevant dataset examples have also been provided for these generic experiments.

Generated datasets
In the following subsections, the synthetic, real encrypted and anonymized datasets are described. Synthetic data was fabricated on an I-BiDaaS dedicated Virtual Machine into PostgreSQL DB, SQLite and csv files. TDF projects were defined, and the data was successfully generated, however, changes in the fabrication approach were made (see section 4.1 'Data fabrication via simulation' of D2.5 [5]) due to inaccurate results in the TID use case.

TID datasets
The datasets (synthetic or otherwise) provided by TID address three representative and relevant use cases: • Accurate location prediction with high traffic and visibility • Optimization of placement of telecommunication equipment

• Quality of service in Call Centres
In all three uses cases, anonymised Big Data has been made available to I-BiDaaS technologists and analysts in order to assess the quality of the call centre services, e.g. by providing transcripts of customer service phone calls, or aggregated and anonymised mobility and antenna logs, in order to perform predictions on user movements.
More specifically, for the 'Accurate location prediction with high traffic and visibility' use case, TID made available a dataset that consists of anonymous traces collected from a large European cellular network provider. Each trace is a time series of mobile events that contain the encrypted user identifier, a timestamp, and the location of the associated base station. The base stations have varied coverage (between ~100 m to tens of km) depending on deployment density and radio propagation characteristics like obstacles, hills, or mountains. The expected user displacement in urban areas is smaller than in rural areas and can reach as low as 70 m.
A mobile event is generated every time a mobile device: • activates/deactivates in the network • makes/receives a call • sends/receives an SMS • moves from one location area code to another • changes from one technology to another • requests access to data (2G/3G) or requests a high-speed data channel (4G) • is actively pinged by the network if no other event is registered for 2 hours More specifically, the dataset consists of approximately 120K traces X Ni events per user i, divided into four-hour periods (yielding 186 points in total). The first field is the hashed user identifier (UID); fields 2-5 are aggregated statistics related to: 1) distance traversed by the user, 2) time connected to cell sites, 3) from 6th field onwards, we have tuples of antenna ID and amount of time connected to that antenna. The length of each time series varies since, for example, if a user has moved a lot, the user would have connected to more number of antennas.
The dataset was provided in JSON format, where every antenna is a JSON object containing its time series for the period. The values in the time series represent the number of users using the antenna at some point in time, so the values are strictly positive. Every time series was split into training and validation sets; then the time series are fitted and tested with the respective models.
For the 'Optimization of placement of telecommunication equipment' use case, TID aggregated 2G, 3G and 4G feeds. Mobile Network Operators (MNOs) continuously collect various Key Performance Indicators (KPIs), such as coverage monitoring, and voice/data service metrics, about each radio sector. Such antenna KPIs are one of the key information for MNOs to understand network performance, and are used as input for network management, planning, and optimization. The employed indicators correspond to 2G, 3G and 4G sectors and can be grouped into the following categories: • coverage (e.g., radio interference, noise level, power characteristics) • accessibility (e.g., success establishing a voice or data channel, paging success, allocation of high-speed data channels) • retain ability (e.g., fraction of abnormally dropped channels) • mobility (e.g., handovers' success ratio) • availability and congestion (e.g., number of transmission time intervals, number of queued users waiting for a resource, congestion ratios, free channels available) The said antenna KPI data consist of 999,257 observations x 17 features (24 hours cycle), and include more than 40K cell sites. Moreover, the data were anonymised using cryptographic hashing functions from the OpenSSL's libcrypto 1 library and method from the sdcMicro R 2 package. The sdcMicro package provides a series of probabilistic anonymization methods that depend on a probability mechanism or a random number-generating mechanism, i.e. every time a probabilistic method is used, a different outcome is generated. Since our target variable is a binary label, which corresponds to the notion of "being a hot spot" at a certain day, the data was split into a training (80%) and test set (20%). To this end, we grouped the antennas by their ID, so that each antenna can be either in the train set or in the test set. We sought the splitting that yields as similar percentage of the positive class as possible in the train and test sets.
Finally, for the 'Quality of service in Call Centres' use case, TID has constructed a real operation dataset from a LATAM (Spanish) country, under the standard CTM format. The anonymised dataset consists of 1.3M transcripts of continuous speech recordings and: 1) does not include personal and company identifiable information (relevant tokens were removed), 2) does not contain speaker information, 3) sentences are switched within the same call, and 4) real timestamps have been obfuscated although the relative order of the calls is kept. From this dataset, a subset consisting of 17K anonymised transcripts was derived, which was further split into 1) train, 2) develop, and 3) test sets. More specifically, all transcripts are labeled with a Customer Satisfaction Index (CSI) score, as indicated by the customer at the end of the call. In addition, the transcripts are being augmented with the output of a sentiment analysis, for both I-BiDaaS -17 -August 31, 2020 Spanish and English languages. Last, this dataset also serves for benchmarking purposes that is the study of the impact of anonymization in KPIs, e.g., impact on the retrieval of low satisfaction calls but will also be introduced in the Telefonica Hackathon event.

CAIXA datasets
CAIXA generated four different datasets for the I-BiDaaS experimentation and evaluation. It presented three different use cases but one of the use cases was tested with two different datasets (synthetic fabricated dataset and real tokenized dataset).
Synthetic dataset for the 'Analysis of relationships through IP addresses' use case was the first dataset generated and the one that was selected to test the I-BiDaaS MVP.
The generated dataset provides data on the relationships between customers in order to build part of the social graph of the bank. The data was synthetically generated based on real data coming from a set of restricted tables (relational database), with information related to the customers and their IP address when connecting online. CAIXA and IBM generated the data recipe for the data fabrication using IBM TDF. Through an iterative analysis of obtained results, the rules were improved in order to obtain the fabricated dataset used for testing the MVP, with more than 1 million entries.
The structure of this dataset is the following: • FK_NUMPERSO: Identifier of the Person. NUMBER • PK_ANYOMESDIA: Date (YYYYMMDD) of the connection of the user. NUMBER • IP_TERMINAL: IP Address of the connection of the user. VARCHAR2 • FK_COD_OPERACION: Code of business operation done by the user. VARCHAR2 • PK_COD_ESTADO_OP: Code of the status of the operation done by the user. VARCHAR2 This use case was also used for validating the usage of synthetic data as a method to test the performance and adequacy of new technologies before integrating it into CAIXA's premises.
After the generation of the synthetic dataset, a dataset was also created with real tokenized data. The structure of this dataset is the same as the fabricated dataset.
For opening the real data of this use case and the rest of the use cases, CAIXA worked internally on the specification of the types of encryptions that enable the entity to share this data without breaking the privacy of the data and allow a certain level of data analytics over the encrypted data. Indeed, one of the challenges of this approach is to find ways to encrypt the data in a way that loses as less relevant information as possible. CAIXA proposed the 'Advanced analysis of bank transfer payment in financial terminal' use case in order to first test that new approach in the project, and proposed a tokenized dataset using three different data encryption algorithms (depending on the table field types): • Format preserving encryption for categorical fields.
• Order preserving encryption for numerical fields.
• A Bloom-filtering encryption process for free text fields.
These types of encryption were also used for the tokenized datasets of the other two use cases ('Analysis of relationships through IP addresses' and 'Enhanced control on Online Banking') and will be further explained in the following subsection.
'Advanced analysis of bank transfer payment in financial terminal' use case dataset is a dataset generated collecting most of the relevant and additional contextual information of a bank transfer done by a CAIXA employee in a bank office. It is executed by the employee in the name of a customer that authenticates itself and orders it, by identifying all the relational tables that can contain information of the transfer, the customer, the receiver of the transfer, the office and terminal where it is executed and the employee that proceeded with it. The dataset was used in order to identify anomalies that lead to potentially fraudulent bank transfers or bad practices done in the offices.
The structure of the used dataset is the following: Finally, the dataset for the 'Enhanced control on Online Banking' use case was generated. This use case focuses on analysing the mobile to mobile bank transfers ordered through online banking (web or application). It focuses on the assessment that the controls applied to user authentication are applied adequately (e.g. second factor authentication) in accordance with PSD2 regulation and depending on the context of the bank transfer.
The structure of this dataset is the following: • PK_ANYOMES: Year and month of the partition, is the one corresponding to the consolidation of the Bizum. NUMBER.

CRF datasets
Initially, CRF gathered structured and unstructured sets of a large amount of heterogeneous data from different sources and different levels. During the preliminary stage, all information has been analysed and CRF interacted with the plant in order to understand in depth the nature of the data.

I-BiDaaS
-20 -August 31, 2020 For the 'Maintenance and monitoring of production assets' use case, data arrives from sensors mounted on different machines (e.g. linear stages, robots, elevators and so on). The data consists of two different datasets in csv format, named SCADA and MES.
The SCADA dataset contains production, process and control parameters of the daily vehicle production and is structured as follows: There are over 100 sensors and each one is identified by a specific number (id).
The MES dataset contains specific data associated with the type of vehicle being produced and is structured as follows: When OP020.Passo [20] changes from 0 to 1, a new vehicle enters into the area with sensors and modello_op_020 indicates the model of the vehicle being processed.
Initially, both types of data were considered, but over time, we faced problems retrieving MES data because of rescheduling activities and changes in the production lines, partially due to the COVID-19 pandemic. Therefore, we decided to utilize only SCADA to obtain thresholds for anomalous measurements for all sensors, also because measurements of sensor identified by the number 141 during 13 days of available MES data showed high variance in measurements even for the same vehicle type. Within the project, by analysing data we found satisfactory information in SCADA data.
For the 'Production process of Aluminium die-casting' use case, CRF received from the plant different datasets with heterogeneous data (e.g. piston speed in the first and second phase, piston stroke, intensification pressures, temperatures, cooling capacity), quality and operator's data (e.g., defect manually detect). Due to the complexity of the process, at initial project stages (M9), it was not possible to collect sufficient real data for Big Data analytics, so we extracted the most significant information from all those received and created a synthetic dataset as close as possible to the real data. The aim was to share with the consortium synthetic data to have a flexible and rich dataset to understand the pattern and to develop and test the I-BiDaaS technologies. More specifically, CRF combined into a single file the most significant process parameters that reflect the trend of the real production, in which there are a wide variety and a low veracity of various and heterogeneous information.
The generation of the synthetic dataset has been performed using the IBM's Data Fabrication Tool (TDF), according to the definition reported in D2.1 [4], and generated a formatted text file, convertible into excel that contains 1 million rows, each of one corresponds to subsequent engine blocks produced.
An excerpt of the structure of the synthetic dataset, detailed in the section 3.9.1.2 'Data description' of D2.1 [4] is reported in the table below: I-BiDaaS -21 -August 31, 2020 Synthetic data has been validated with an empirical and analytical technique, as described in section 4.4 'Production process of aluminium casting' of D2.5 [5].
The high-level algorithms, developed in the first part of the project, identified the critical values from the dataset and defined the main parameters that affect the quality of the process. In the meanwhile, we gathered more real anonymised process data. An excerpt of the structure of the real anonymised dataset, containing 187 rows, is reported in the table below with the main parameters identified for the detection of the quality level KPI: second level of control for operators' changes in parameters). As a result, two thermal imaging cameras have been installed on all die-casting machines, so that in the penultimate quarter of the second year of the project thermal data, an example of which is shown in Figure 1 and Figure 2, were also retrieved. Therefore, a large dataset of annotated thermal images has been provided in addition to the real anonymised dataset in order to test the complexity of the process with the analytics developed within the project. Several models have been developed to utilize both sensor and thermal images data, as reported in D3.3 [6].

Generic use case datasets
For the generic use cases, three different datasets have been utilised as proof of concept that the provided algorithms work properly.
More specifically, A. Coordinates dataset, used for K-Means clustering algorithm. It is a two dimensional dataset containing coordinates (longitude, latitude). A preview is reported in Table 6.   • Lasso ADMM: Least Absolute Shrinkage and Selection Operator algorithm for regression analysis, solved in a distributed manner. The LASSO model uses L1 regularization to induce sparsity and prevent overfitting of the model 3 . • K-Means -prediction: The objective of K-Means is to group similar data points together in a user-specified number of clusters (K). • K-Means -evaluation: The objective of this version of K-means is to create a model based on a pre-labelled training dataset that can be used for classification.

Industrial experiments implementation
In this section, the list of the real experiments carried out using the I-BiDaaS solution is reported. Specifically, for each experiment, the experimental workflow in the operation plan is described, by focusing on its goals and associated questions, as reported in Table 8 In addition to the eight experiments which relate to specific industrial sectors, two generic use case experiments have also been defined to show how the I-BiDaaS solution can also be applied generally in the 'Expert mode' and the 'Self-Service mode', considering the perspective of potential non-expert/expert end-users. Furthermore, CAIXA and TID identified the approach to work on cross-sectorial use cases, as explained in subsection 3.3.6.

Experimental workflow
The execution of all the conducted experiments was based on predefined workflows, which allowed better planning and monitoring of the participants. These workflows include a set of action steps that drive the execution of the experiment and correspond to relevant metrics that were captured and used during the evaluation of the results. Although for each experiment, a specific experimental workflow was devised, we can describe the high-level series of steps involved in all experiments as follows: 1. Project setup: Definition of the I-BiDaaS experimental setup, internally referred as 'project'. This includes the name of it, processing mode (batch or streaming data) and input type (single file, directory of file, db source, etc.).

Data selection:
Definition of the dataset to be used for the experiment. This is made by the pilots in the project's use cases or by the participating end-users in the expert and self-service modes.

Data preparation:
Actions required prior to using the dataset in an experiment. These include actual data collection, legal requirements with respect to privacy and security issues and pre-processing actions depending on the type of data: a. Real Data (aggregation, anonymization, encryption). b. Synthetic Data (rules generation, fabrication, uploading).

Experiment setup & execution:
Definition of the internal I-BiDaaS representation of an experiment, namely: dataset access establishment, analytics algorithm definition and parameterisation and resources allocation.

Results Visualisation: Visualisation of the experiment results in various ways
according to the nature of data, i.e. from static charts for batch analysis results to realtime interactive graphics for analysis results of constantly incoming streaming data.

I-BiDaaS
-24 -August 31, 2020 6. Feedback intake: Assessment of the perceived usefulness of the executed experiment for the participants' internal business operations but also utilisation of results for optimising I-BiDaaS component, e.g. in the cases where data fabrication is used results provide feedback for better fabrication rules to TDF.
The following paragraphs describe how this generic workflow was implemented in each use case and provide an analytical description of the experiment setup, execution and results.

Telecommunication experiments
Telecommunication experiments aim to test the efficiency of the I-BiDaaS solution in the context of improving and optimising current operations. To this end, three operational experiments have been defined, as shown in Tables 9-11.
Within TID, there are different types of users that can benefit from the advanced visualizations and the intelligent dashboards integrated in the I-BiDaaS platform. Considering that this is a heterogeneous group of experts and non-experts, with diverse skills, we can define the following high-level groupings: • IT & Big Data practitioners: these are employees and third-party consultants with specialised training (e.g., data scientists, software engineers, UX experts) and share a common knowledge on big data analytics. • Intermediate users: People with basic understanding of data analytics that are used to work with some big data tools, especially for visualisation and big data visual analysis. At best, these users may be able to complete basic data mining tasks using languages like python or R. • Operators: TID or third-party employees, at different levels in production processes, who need to have access to different cascades and views of the data processing results. • Non-IT users: People with a very good knowledge of the field and the sector (e.g., product managers, marketing and business units), they could interpret the data but they lack the programming skills or data mining expertise.
With I-BiDaaS, TID tested different Big Data technologies (batch and streaming) in a unified platform to solve the important challenges for the telecommunication sector. In particular, the design and implementation of a complete framework of tools augmented real data platforms with the functionality needed to enable a new, highly diverse and synergistic data ecosystem, in a privacy-preserving manner. Furthermore, advanced visualisation approaches and dashboards allowed to harness the power of multiple heterogeneous sources and big-data analytics. This facilitates the ability to take data -to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it -with the primary focus to empower both expert, and non-expert big data practitioners, involved in telecommunication activities.
TID provided three relevant and high-value use cases, as shown in Table 1, which are discussed in the next subsections.

Accurate location prediction with high traffic and visibility
This use case aims to analyse the behaviour of local and non-local customers over various periods of time (e.g. holidays), and extract insights on the behavioural patterns of groups of people, enabling them to optimize their value propositions. When users travel around the city they create traffic congestions in network, so it would be useful to forecast immediately next events to anticipate movements at scale and to improve the routing and placement of the telecommunication equipment that is already in place, or to arrange accordingly the new equipment obtained.
For this use case, synthetic and anonymised real data have been used.
The important challenges derived from this use case were to interpolate missing events to recover plausible event trajectories; to minimize processing time with respect to growing data size and to maintain real-time delivery of results.
By selecting the best 1000 models, we obtained a baseline metric of average mean absolute error and it was 1.2565. The baseline model accuracy is decent, although we suspect more accurate model can be made with more data pre-processing (e.g. imputation of missing values). The reader can refer to D3.3 [6] for more details on analytics aspects.
We can predict if a specific antenna will experience a rise of users attached to it, but predictions are not directly linked to specific events or rise of users on the neighbouring antenna. In addition, although the current dataset does not fully support this kind of analysis, this objective remains a possibility for future work, provided that the mobility data are augmented and fused with other datasets, such as weather data, public events and calendar data, other context data, which are necessary to generalize the predictive capabilities of the models.
Because of the data sparsity, predictions are made for 4 hours in front. Using Facebook's timeseries tool Prophet, for 1000 best sectors (with the least missing data) predictions on average deviate by 1-2 users compared to the true values.  The following table summarises some of the key points of the use case experimentation:

Experiment's Goals
To test I-BiDaaS solution efficiency with respect to the prediction of places with high traffic and congestion events in order to optimise their resource distribution.

Experiment's Questions Q1. What is the quality of the analytics results?
Q1.1 How able is the I-BiDaaS platform to forecast mobile phone user movements at scale?

Q2. How efficient is the process of data analytics?
Q2.1 Can we predict when new events will cause movements at scale and where will they appear?
Q2.2 What is the performance of the predictive models (ML/DL) as a function of time, amount of historical data, and prediction horizon?

Experimental Workflow (based on the generic workflow, to be further refined)
1 -Data selection 2 -Data preparation -Aggregation of antenna KPI data from a major European telecommunications company, covering a city and a large number of cell sites 3 -Data analysis -Experiment with various ML models (SVN, RF, XGBoost) and DL models to predict changes in the number of connected mobile phone users per sector 4 -Data visualization Experimental Subjects (participating on any of the steps above)

Role
Steps

involved No of participants
Data analysts 1-4 4

Optimization of Placement of Telecommunication Equipment
This use case aims to optimise the network operations by providing caches and identifying optimal antenna locations, given the provided data from customer usage. The important challenges were analyse streaming data in order to improve the routing and placement of the telecommunication equipment that is available or arrange for new equipment to be obtained; study the spatio-temporal patterns and provide insights on the dynamics of cellular sectors and consider DL models and study their performance as a function of time, amount of historical data, and prediction horizon.
All models -XGBoost, CatBoost and Random Forest showed promising results, while XGBoost stood out with accuracy equal to 0.999, and precision and recall equal to 0.998. Considering the high accuracy and high throughput of the models, the I-BiDaaS solution can help understand network performance, and used as input for network management, planning, and optimization. The reader can refer to D3.3 [6] for more details on analytics aspects.
Currently, near-perfect classification has been achieved, with available data.
Previous analysis on the hotspot prediction task was considered and it brought strong evidence that, for moderate horizons, forecasts can be made even for sectors exhibiting isolated, nonregular behaviour. This work performs forecasts in two situations: daily hot spots and emerging persistent hot spots. We evaluated accuracy as a function of time, prediction horizon, and amount of considered past information. Among others, we observed that the time of the forecast does not significantly affect the results that forecast accuracy reaches a plateau when more than one week of past information is considered, and that tree-based models can outperform the best baseline by 14% on daily hot spots and by 153% on emerging hot spots. In both scenarios, we have seen that the time of the forecast does not introduce a significant variability in the results, and that forecast accuracy reaches a plateau when at least one week of past information is considered. We have also assessed the importance of KPIs in performing such forecasts, showing that this decisively increases for the forecasting of non-regular hot spots, especially for certain usage, congestion, interference, and signalling KPIs.  The following table summarises some of the key points of the use case experimentation: Anonymised TID mobility data Preparation status:

Experiment's Goals
To test I-BiDaaS solution efficiency with respect to the optimization of placement of telecommunication equipment.

Experiment's Questions Q1. What is the quality of the analytics results?
Q1.1 How able is the I-BiDaaS platform to support the management of large-scale cellular networks and provide operators with intel on which sectors underperform at any given time?

Q2. How efficient is the process of data analytics?
Q2.1 Can we timely predict when an antenna will become the next 'hot spot'?
Q2.2 What is the performance of the predictive models (ML/DL) as a function of time, amount of historical data, and prediction horizon?

Experimental Workflow (based on the generic workflow, to be further refined)
1 -Data selection 2 -Data preparation: -Aggregation of antenna KPI data from a major European telecommunications company, covering a city and a large number of cell sites 3 -Data analysis: -Experiment with various ML models (SVN, RF, XGBoost) and DL models to predict changes in the number of connected mobile phone users per sector 4 -Data visualization Experimental Subjects (participating on any of the steps above)

Role
Steps

QoS in Call Centres
This use case addresses the challenge of developing speech technologies that transform audio calls into relevant information for the Call Centre, which can be used to assess its performance and/or to screen automatically phone calls. By facilitating the results of the project, TID plans to improve the number of audio calls that can be processed per time unit.
There is a wide variety with respect to the nature of the customer calls: to ask for service and product information, report technical problems, to follow-up with a purchase, to provide feedback, etc. The I-BiDaaS solution allows to quickly get familiar and understand customer's perspective and main interests, and to facilitate a fast response and improve customer service by using big data speech and language analytics. This is achieved by shortening the call durations, waiting time and First Call Resolution (FCR) time by anticipating customer's situation based on previous insights, e.g., using the aggregation of previous analytics by Call Centres or regions.
More specifically, the Business Units (BIs) in TID need to manually inspect a small portion of phone calls, less than 1% of total amount of CC calls per year. The I-BiDaaS solution, using GPU-accelerated text matching, estimates automatically a sentiment score aggregated by call centres/regions and by time window and a list of more relevant words (retrieves top-K frequent words or 2-grams and provides a quick overview of the CC current scenario and operative). In our scenario, we are not interested to compute the sentiment of a certain entity within a transcript, but rather predict the sentiment out of a whole transcript, which will then be further aggregated to predict the sentiment of the whole call centre. As such, the sentiment values of the words that have been found in the text stream of each call centre are accumulated into a single score. This score is actually the sentiment score of a specific call centre for a given time window, and is actually an indicator of whether the overall customer sentiment is positive or negative. In addition, it accounts for the correlation between the sentiment of the call with Customer Satisfaction Index. The reader can refer to D4.3 [7] for more details on analytics aspects.
In Figure 5, an example of the execution of the sentiment analysis tool (English version) in TID server is shown. For a detailed description, see Section 2.5.1 'Quality of service in call centres' of D2.6 [8]. I-BiDaaS -29 -August 31, 2020 A human agent can process ~11,520 calls (per year). This manual procedure allows to identify about 2,300 low customer satisfaction audio calls. The I-BiDaaS solution can automatically process ~3.5B calls (per year) in a single GPU. This results in an increase in the number of detected low customer satisfaction audio calls by human agents to 7,000 (200% increase), by pre-processing/filtering the audio calls. This corresponds to a max real-time throughput: 40K transcripts/second.
Given an average call duration of 8.6 minutes, a human agent following a work schedule of 40 hours per week (160 hours per month), could help process up to 11,520 calls per year. This manual process allows to flag ~2,300 low customer satisfaction calls. The I-BiDaaS platform (configuration with 1 GPU) can increase the number of detected low customer satisfaction audio calls by human agents to 7,000 (200% increase), by pre-processing/filtering the audio calls. This corresponds to a max real-time throughput: 40K transcripts/second.   The cost of the CC is a function of many variables. However, an automatic solution would go a long way to reduce the manual effort and the human resources that need to be allocated for this task. Hence, it will result in a significant reduction of the operational costs.
The following table summarises some of the key points of the use case experimentation: 1 -Data selection 2 -Real data preparation -Aggregation of call centre data from multiple sources and generation of data files, e.g., audio files, meta-data, etc. -Execution of ASR model to produce the transcripts -Execution of speaker segmentation to segment the audio files by the different speakers -Merging of the ctm (transcripts) and rttm (speakers) files 3 -Data analysis -Apply the NLP model on the input data (merged transcript) to predict the CSI) 4 -Data visualization Experimental Subjects (participating on any of the steps above)

Role
Steps

involved No of participants
Data analysts 1 -4 4

Banking experiments
Banking experiments aim to test the efficiency of the I-BiDaaS platform for reducing the costs and time of analysing large datasets whilst preserving data privacy & security. To this end, three operational experiments have been defined, as shown in Tables 14-16.
The usage of big data analytics in the financial sector is every day becoming more and more important and it is gradually being integrated in many departments of CAIXA (security, risks, innovation, etc.). Therefore, CAIXA has a heterogeneous group of experts with different skills and it also relies on several big data analytics experts that give consultancy services. However, the people working with the great amount of data collected from the different sources and channels of CAIXA can be reduced to three groups: • IT & Big Data expert users: employees and third-party consultants that have great programming skills & big data analytics knowledge. • Intermediate users: People with some notions of data analytics that are used to work with some big data tools, especially for visualisation and big data visual analysis (such as QlikSense/QlikView). They are not skilled programmers, although they are capable of programming simple algorithms or functions with python or R. • Non-IT users: People with a very good knowledge of the field and the sector; they could interpret the data but they lack programming skills or big data analytics knowledge.
Taking that into account, CAIXA proposed three different use cases and evaluated the I-BiDaaS tools from the perspective of potential usage by those different groups of employees: • Enhance control of customers to online banking.
• Advanced Analysis of bank transfer payment in financial terminal.
• Analysis of relationships through IP addresses.
These use cases will be presented in this section in the chronological order in which they were studied in the project.

Analysis of relationships through IP addresses use case
'Analysis of relationships through IP addresses' was the first use case, used to test the MVP of I-BiDaaS.
In this use case, CAIXA aims to validate the usage of synthetic data and the usage of external big data analytics platforms. It is deployed in the context of identifying relationships between customers that use the same IP address in their connections to online banking. CAIXA stores information about their customers and the operations they perform (bank transfer, check their accounts, etc.) using channels such as mobile apps or online banking, and they afterwards use this data for security and fraud prevention processes. One of the processes is to identify relationships between customers and use them to verify posterior bank transfers between linked customers. Such operations are considered with lower possibility to be fraudulent transactions. It allows CaixaBank's Security Operation Centre (SOC) to directly discard those bank transfers in their revision processes. The goal of this experiment is to validate the use of synthetic data for analysis, if the rules act in the same situations as with the real data and to test the time efficiency of the I-BiDaaS solution.
For this use case, we started using synthetic data, using IBM TDF and the generation process described in D2.5 [5]. The set of rules that build the custom-tailored algorithm to find relationships between two users was identified and the algorithm was programmed in COMPSs by BSC experts. It allowed us to obtain the results on the I-BiDaaS platform, being able to get a first glance visual graphic of the number of relationships, as well as downloading the relationships between customers (Figure 8).

Figure 8: Analysis of relationships through IP address use case visualisation in I-BiDaaS platform
After the generation of relationships, the streaming use case was deployed, using a synthetic dataset of bank transfers between the users and being able to identify those bank transfers in which the sender and the reviewer were already related ( Figure 9). In addition, in that use case, the SOC employee can also check the graph of relations of the user, if a more in depth analysis is needed ( Figure 10).  Both use cases were placed in the 'Co-develop Mode' of the I-BiDaaS platform due to their personalisation and custom-tailored needs by the I-BiDaaS partners experts.
Moreover, further analysis was performed over the synthetic data generated in order to analyse their quality.
In that process, the dataset was transformed as follows: each user represents a data sample, while each IP address represents a feature. In such a data matrix, the value at position (i,j) represents the number of times user i connects via IP address j. This way, we obtain a data matrix with dimensions 8058x22992. The resulting matrix is very sparse. In order to retain only meaningful data, the next pre-processing step is to drop all the IP addresses that are used by only one user. After dropping all such IP addresses, we are left with 1075 IP addresses, which represents a huge reduction compared to the initial 22992 IP addresses contained in the original dataset.
Next, we filter out the users that are not connected to any of the remaining IP addresses. As it turns out, there are 6049 such users, leaving us with a dataset containing 2009 users and 1075 IP addresses, where each IP address has been connected with at least 2 users.

Clustering
In order to infer relationships between users, we first use clustering algorithms. In particular, we use K-Means [9] and DBSCAN [10]. Additionally, we use t-distributed Stochastic Neighbor Embedding (t-SNE) [11] to visualize the reduced dataset in 2D. The visualization is presented in Figure 11.
I-BiDaaS -34 -August 31, 2020 Both K-Means and DBSCAN offer some interesting hyper-parameters. In particular, K-Means allows us the flexibility of setting the desired number of clusters. On the other hand, DBSCAN decides on the number of clusters internally. However, it provides the parameters that represent the minimum number of samples in a neighbourhood for a point to be considered a core point.
In addition, it provides the maximum distance between two samples for them to be considered in the same neighbourhood. The described K-Means and DBSCAN hyper-parameters are to be set by an end-user based on experimentation and domain knowledge and will be tunable through the I-BiDaaS platform user interface.
Clustering was performed on both the full 2009x1075 dataset, as well as on the t-SNE reduced 2009x2 datasets. We used the silhouette score [12] for evaluating the clustering quality. Roughly speaking, the silhouette score evaluates how well each point fits the cluster it is assigned to versus the next closest cluster. The values range from -1 to 1, with 1 representing a perfect clustering.

Results
All experiments done in both the clustering and graph-based analyses were implemented in Python, on a single computer. Libraries used in a clustering-based analysis are numpy 4 , pandas 5 , scikit-learn 6 and matplotlib 7 . Numpy and pandas were used for data pre-processing, scikit-learn for clustering analysis, and matplotlib for visualization.
We tested the performances of K-Means and DBSCAN for different values of parameters. In particular, we experimented with the number of desired clusters (K) for K-Means and the parameter for defining the maximum distance between two samples for them to be considered as in the same neighbourhood (eps) for DBSCAN. The parameter representing the minimum number of samples in a neighbourhood for a point to be considered a core point (min samples) for DBSCAN was equal to fixed to 2. This is intuitive since we want to allow the algorithm to find clusters of (at least) 2 people in order to infer relationships.
To start, we applied t-SNE on the transformed dataset in order to visualize the data in 2D. (Any other suitable visualization method to be adopted by I-BiDaaS may be applied.) We set the parameter perplexity to its default value of 30, while we set the number of iterations to 1000. The algorithm is known to emphasize grouping similar points, so the visual results suggest there is some structure in the data.
For K-Means, we chose 6 different values for K: 13, 600, 700, 800, 900, 1000. The choice is motivated as follows: the visual test, based on t-SNE projection suggests roughly 13 clusters.
On the other hand, DBSCAN initially found around 600 clusters. Based on this, we decided to try the small value suggested by t-SNE as well as the higher values suggest by DBSCAN. The results are presented in Table 11.  Additionally, we evaluated the size of the obtained clusters. A recurring effect is that DBSCAN finds 1 big cluster (e.g., when the number of clusters is 665, DBSCAN finds 1 cluster containing 629 points and when the number of clusters is 693, the big cluster contains 570 points). Interestingly, except from the single big cluster, DBSCAN finds only clusters of size 2, 3, 4 and 5, with clusters of size 2 dominating (e.g. 618 and 643 clusters of size 2 when 665 and 693 total clusters found). As for the K-Means, the algorithm also generates a single big cluster and clusters of size 2, 3, 4, 5, as well as clusters that contain only 1 point. In general, the following pattern emerges: as K grows, the size of the big cluster decreases, but the number of clusters containing only 1 point grows. Once more, the size 2 clusters dominate.

Graph-based analysis
We present here a graph-based solution for relationship analysis among the users. We first describe the data pre-processing steps; subsequently, we describe the methods we use for analysis, and finally, we discuss the obtained results.
The dataset represents the users' online activity made in January and February. In graph-based solution for relationship detection, we did a monthly based analysis, i.e., the results here restrict to the users' activity made in January.
The first step of data pre-processing was to remove the users' activity made in February. The newly produced dataset consists of 71,810 instances. After this, the dataset is transformed in the same way as with the clustering analysis: each user represents a sample, while each IP address represents a feature. In such a data matrix, the value in position (i,j) represents the number of times user i connected via IP address j. This way, we obtain a data matrix with dimensions 7947x22680.
Detecting the relationships and generating a graph of relationships Since, the goal is to establish relationships among the customers, we define an (M, N)relationship as follows: On the constructed user relationships graph, we apply the Louvain method for community detection [13]. This is an algorithm designed for detecting communities in networks. It is a simple, efficient, easy to implement and one of the most widely used algorithms for community detection in large networks. We used the python libraries numpy, pandas, networkx 8 , community 9 and matplotlib 10 . Again, Numpy and pandas were used for data pre-processing, networkx and community for community detection via the Louvain method, and finally matplotlib for visualization.
In our experiments, we considered (3, 1) -relationships, i.e., we considered the pairs of users who have connected at least 3 times via at least 1 IP address.
After generating the network, we applied the Louvain method. The algorithm detected 817 communities. Most of the communities were of size 2, i.e., only two connected vertices were forming a community. Table 13 represents different types of components detected via the Louvain algorithm and their frequencies.

Results
The results on the synthetic dataset suggest that both algorithms (K-Means and DBSCAN) manage to group similar users, confirmed by the high silhouette scores obtained by both algorithms. The relatively small cluster sizes (2,3,4,5) suggest that this approach could be meaningful in inferring relationships. Additionally, while the 'visual test' after projecting data to the 2D space suggests roughly 13 clusters, the silhouette score confirms that the grouping of data, as well as the possible relationships among them, are much more sophisticated, by assigning higher scores to higher values of K (here K is the desired number of clusters).
In the graph-based modelling approach for relationship inference, we carried out detection of communities in the network of users using the Louvain method for community detection and analysed the obtained results using graph theoretic tools and metrics. The obtained results suggest that the graph-based approach may be suitable for relationships inference. For future work, we will account for the temporal aspect of the IP connections. For example, one can differentiate the relationships obtained on weekdays and weekends and also in working hours and evenings. Such an approach might be able to provide categorization of the obtained relationships (co-workers, friends, spouses, etc.).
The process was repeated with a dataset of real data. After the tokenization (see section 3.3.3.4) of a real dataset of customer connections, the same process was performed and some conclusions were extracted from the comparison between synthetic and real data experiments.

Comparison with respect to synthetic data
The results obtained using K-means clustering on real tokenized data show higher silhouette scores than with the synthetic data, which may suggest that the clusters are in a better agreement with the data than in the synthetic case. Using the synthetic data, only 1 big cluster is obtained, with clusters of size 2 dominating. On the other hand, using the tokenized data, there are 2 big clusters containing most of the points, while the single point clusters dominate. This might offer an explanation for the higher silhouette scores, as single point clusters tend to inflate the metric. Additionally, a large number of single-point clusters on the tokenized real data might point to users that may have specific transaction patterns and might offer an interesting direction for future research and analysis.
The results obtained using DBSCAN on real tokenized data show higher silhouette scores for all, except for the final two values of the parameter eps. Coupled with the drastically decreasing number of clusters found for the final two values of eps (from 681 to 6 clusters found), it can I-BiDaaS -38 -August 31, 2020 be hypothesized that, giving a too loose maximal distance between two points to be considered as being in the same cluster causes almost all of the points to be clustered in a small number of clusters. Comparing the results on the real tokenized data with the results on synthetic data using DBSCAN, it can be observed that, with both the real and synthetic data, most of the clusters found are 2-point clusters. However, with real data, two big clusters containing most of the points in the dataset are found, while with the synthetic data, one big cluster was found. Also, increasing the eps parameter on the synthetic data results an increase in the number of clusters, as well as the silhouette score. On the other hand, the opposite effect can be observed on the tokenized real data, where increasing the eps parameter leads to a decrease in both the number of clusters found and in the silhouette score.
The following table summarises some of the key points of the use case experimentation: To test time efficiency of I-BiDaaS solution.

Experiment's Questions Q1. Can synthetic data provide the same insights as the real data use case?
Q1.1 Has the generated data the same structure as the real one?
Q1.2 How valid is the model generated with synthetic data with regards to the model of the real data?

Q2. Is the process of data fabrication more efficient than the process of granting access to real data?
Q2.1 How much time (mean) is necessary for generating a volume of synthetic data that can provide a valid model?
Q2.2 How much is the time reduction that we obtain by generating the synthetic data instead of granting permits to an external provider?

Experimental
Workflow (based on the generic workflow, to be further refined) 1 -Data selection 2 -Synthetic data preparation generate rules fabricate synthetic data upload data set 3 -Real data preparation 4 -Data analysis select algorithm custom algorithm 5 -Data visualization 6 -Adjust data fabrication rules Experimental Subjects (participating on any of the steps above)

Role
Steps

Advanced Analysis of bank transfer payment in financial terminal
The second CAIXA use case that was studied in I-BiDaaS is 'Advanced Analysis of bank transfer payment in financial terminal'. This use case aims to detect the differences between reliable transfers and possible fraudulent cases. The goal of this experiment is to test the efficiency of the I-BiDaaS solution in the context of anomaly detection in bank transfers from employees' workstations (financial terminal).
For that reason, the first step was to identify all the contextual information from the bank transfer (i.e. time execution, transferred amount, etc.), the sender and receiver (e.g. name, surname, nationality, physical address, etc.), employee (i.e. employee id, authorization level, etc.) and bank office (e.g. office id, type of bank office, etc.). All this information is coming from several relational database tables stored in the CAIXA datapool. The meaningful information was extracted and flattered in a single table. This task is particularly challenging because it is needed to identify events and instances from the log file corresponding to the money transfer operations carried out by an employee from a bank centre and join those ones that relate to the same bank transfer. The heterogeneous nature of the log files, as saved in the CAIXA datapool makes this task even more difficult. There is a total of 969,351,155 events in the log data just for month April 2019. These events are heterogeneous in nature and arise from mixing of disparate operations associated with services provided by the employees in bank offices of different types.
After a laborious table flattering and composition process, a table of 32 fields was obtained and afterwards tokenized according to the encryption schemes described in section 3.3.3.4. In order to find anomalies through I-BiDaaS, this data was uploaded to the platform through an 'Expert Mode' use case, and several algorithms were executed to process the tokenized data, such as K-Means, PCA (Principal Component Analysis) and DBSCAN, as described in section 3.1 of I-BiDaaS D3.3 [6]. Those algorithms were executed using the dislib library with the support of BSC. Although it was studied under an 'Expert mode' use case, the complexity of the use case is moderate and it was evaluated by CAIXA that intermediate users could work with it in order to modify parameters of the algorithms and refine the initial anomalies found. Figure 12 shows the results obtained with the expert mode visualisation of I-BiDaaS. The visualisation tool was validated, permitting CAIXA employees to visually identify those transactions that are outliers and select them as anomalies. That allows them to download those anomalies (in several formats such as .csv, .xls, etc.) in order to analyse them on their own or send them to employees of the SOC.
I-BiDaaS -40 -  Thanks to this, the tools offered in I-BiDaaS were validated for the full cycle of big data processing, as a self-service for non-IT and intermediate users, while more advanced users are able to customize their big-data analysis. Actually, this gives more flexibility in comparison with competitors. In this sense, we analysed the same dataset with DataRobot, a more mature commercial solution recently acquired by CAIXA. DataRobot provides a benchmark of algorithms that could be applied to an uploaded dataset, and an autopilot option that helps to select the most adequate algorithm. However, the tool is much more focused on supervised learning and it did not provide clear results for this use case, applying any of the unsupervised learning algorithms it provides. Figure 13 shows the results obtained with Random Forest algorithm, which was the algorithm that ranked best for the provided dataset. However, they were fuzzy, being much more difficult to identify clear anomalies than in I-BiDaaS. As a conclusion of the comparison analysis between the two platforms, on the one hand, we can say that I-BiDaaS provide more flexibility in the definition of your own code, scoring metrics, etc. For example, I-BiDaaS allows to change the scoring function of a specific column, while this feature is fixed in DataRobot. It also allows to provide custom-tailored algorithms or refine them in the case of IT & Big Data expert users.
On the other hand, DataRobot has very a limited number of unsupervised learning models. I-BiDaaS can provide much more detailed results on unsupervised learning use cases based on clustering.
The following table summarises some of the key points of the use case experimentation:

Q2. How efficient is the process of data analytics?
Q2.2 How many potential fraud cases can be solved with I-BiDaaS platform?

Q3. Does the tokenization/encryption method assure compliance with the current security and privacy regulations?
Q3.1 Does the tokenization/encryption method ensures the privacy of the data for getting out of the premises of CAIXA without business implications?

Q4. Which features can I-BiDaaS provide with regards to other data analytics commercial solutions (such as Data Robot)?
Experimental Workflow (based on the generic workflow, to be further refined)

Enhanced control of customers to online banking
Finally, in 'Enhanced control of customers to online banking' use case, we focused on analyzing the mobile-to-mobile bank transfers ordered through online banking (web and application). It focuses on assessing that the controls applied to authenticate the user are applied adequately (e.g., Strong Customer Authentication -SCA-by means of second-factor authentication) according to PSD2 regulation and depending on the context of the bank transfer.
With that aim, we wanted to cluster a dataset collected from mobile-to-mobile transfers. Most of the information of this dataset is not needed to be encrypted because only a few fields were sensitive. The main objectives of the use case are to identify usage patterns on the mobile-tomobile bank transfers and enhance the current security identifying the set of transactions in which we should increase the level of authentication. For that reason, we decided to analyze the collected 'online banking' dataset and work on non-supervised methods such as clustering of the data. We faced with the need for clustering on a categorical database so that most known algorithms lost efficacy. Initially, an attempt was made to apply a K-Means. K-Means is an unsupervised classification (clustering) algorithm that groups objects into k groups based on their characteristics. Grouping is performed by minimizing the sum of distances between each object and the centroid of its group or cluster. Quadratic distance is often used. Since the vast majority of available variables were not numerical, calculating these distances was no longer so simple (for example, if there are three types of enhanced authentication, the distance between them must be the same? Should it be greater since some of them are more restrictive than the I-BiDaaS -42 -August 31, 2020 others?) This type of question affects the result of the model and therefore a transformation was made to the data. We transform the variable categories into columns (1, 0), a transformation known as one-hot encoding. This transformation allows to eliminate the problems of calculating the distance between categories. Even so, the results were not satisfactory. Given the situation, a search/ investigation process was carried out for an appropriate model for this case series. We find the k-modes library that includes algorithms to apply clustering on categorical data.
The K-modes algorithm [15] is basically the already known K-Means, but with some modification that allows us to work with categorical variables. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimize the clustering cost function.
Once the algorithm has been decided, we must calculate the optimal number of clusters for our use case. For this, the method known as elbow method is applied, which allows us to locate the optimal cluster as follows. We first define: • Distortion: It is calculated as the average of the squared distances from the cluster centres of the respective clusters. • Inertia: It is the sum of squared distances of samples to their closest cluster centre.
Then we iterate the values of k from 1 to 10 and calculate the values of distortion for each value of k and calculate the distortion and inertia for each value of k in the given range. The idea is to select the number of clusters that minimize inertia (separation between the components of the same cluster).

Figure 14: Number of clusters selection for 'Enhanced Control of customers to Online Banking'
To determine the optimal number of clusters, we have to select the value of k at the 'elbow' in the point after which the distortion/inertia start decreasing in a linear fashion. Thus, for the given data, we conclude that the optimal number of clusters for the data is 4. Once we know the optimal number of clusters, we apply k-modes with k = 4 and analyse the results obtained.
We worked with BSC in the analysis of this dataset and the clustering of it within the I-BiDaaS platform, being integrated with their support in the "I-BiDaaS expert mode".
With that support, our 'Intermediate users' and 'Non-IT users' were able to easily change the number of clusters to run over the dataset and visually analyse the results of it the platform ( Figure 15).

I-BiDaaS
-43 -August 31, 2020 Those results were checked with the Digital Security and Security Operation Centre (SOC) employees from CAIXA in order to correctly understand if the clustering algorithm applied allowed to identify potential errors in our automated authentication mechanisms in mobile-tomobile bank transfers. The obtained clusters of entries were useful to identify the different patterns of usage of mobile-to-mobile bank transfers and reconsider the way we are selecting the authentication method to proceed with the transfer. Nevertheless, the most important conclusion of the use case was the ability to perform big data clustering analytics in a very agile way, based on existing or custom-tailored clustering algorithms.
The following table summarises some of the key points of the use case experimentation:

Experiment's Goals
To assess that the controls applied to authenticate the user are applied adequately (e.g., second-factor authentication) on mobile-to-mobile bank transfers in online banking.

Experiment's Questions Q1. What is the quality of the analytics results?
Q1.1 How able is the I-BiDaaS platform to cluster the dataset into meaningful datasets?

Q2. How efficient is the process of data analytics?
Q2.2 How easy was to run the clustering and identify potential errors in the customer authentication?

Q4. Which features can I-BiDaaS provide with regards to other data analytics commercial solutions (such as Data Robot)?
Experimental Workflow (based on the generic workflow, to be further refined)

Data Encryption
During the I-BiDaaS project life-time, CaixaBank changed its approach with regards to how to extract sensitive data and allow big data analytics outside its premises, thus breaking inter-and intra-sectorial data-silos, and support data sharing, exchange, and interoperability. At first, in the project definition, it was planned to evaluate and validate only synthetic data generation and the usage of this synthetic data with I-BiDaaS tools. However, after the first use case, we realised that relying only on synthetic data was a limitation for extracting new insights from the data. Therefore, CAIXA moved into a more open position, starting to evaluate ways to share real data. In that sense, to facilitate sharing of such data to a third party (i.e. uploading dataset to any external cloud) requires a cryptographically secure encryption process without degrading the quality of data. Several data encryption experiments were undertaken during the project and used for the data tokenization in the use cases.
In this section, we provide a proof of principle demonstration of the encryption schemes that were used in the project for encrypting financial data.

Format-Preserving Encryption
Our principal aim is to encrypt sensitive data with the constraint that encrypted data closely follow the real data.
For example, encryption of a string with 4-digits will give back another string of 4-digits. For this, we use an encryption scheme known as Format-Preserving Encryption: the goal of a Format-Preserving Encryption scheme is to securely encrypt the data while preserving its original format.
To carry out such a form of encryption, we specifically used the algorithm Feistel-based encryption (FFX) 11 . The construction we have used is based on hash-based pseudo random function HMAC 12 .

Order-preserving Encryption
Order-preserving encryption (OPE) allows to compare cipher text values in order to learn the corresponding relation between the underlying plaintexts. By definition, order-preserving I-BiDaaS -45 -August 31, 2020 encryption methods are less secure than conventional encryption algorithms for the same data sizes, because the former leak ordering information of the plaintext values.
CAIXA stores several numeric data that must be secured before sharing. This method is quite useful when you want to apply external algorithms which use ordered data. A typical example is when you want to compare the Age of two clients. In these cases, we would have: The implementation of this algorithm is leveraged to open source libraries. See pyope package 13 for more information.

Privacy-preserving encryption of text using Bloom filters
In the financial data owned by CAIXA, one kind of sensitive information consists of free texts like the surnames, street names etc. Such information can be of importance for establishing relations between clients. Using encryption schemes, such as described in Section 3.3.3.4, one can create complications in establishing relations. The main reason is that any mistake in spelling (or different way of writing, for example: L'hospitalet and Lhospitalet) will create encrypted texts totally different from the original.
To address such an issue, we use a privacy-preserving scheme used for record-linkage [14] by employing cryptographic Bloom filters. We will describe in detail the process of record-linkage in the following subsections. We just comment here on one important property: Bloom-filter based cryptographic schemes are non-reversible. This is to say that, just knowing the private keys and encrypted data, one cannot go back to the decrypted original data.
The encryption is carried out in the following steps:

Splitting text in n-grams
In the field of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech 14  A Bloom-filter atoms uses cryptographic hash functions followed by a mapping to a bit-array. We have used two set of parameters depending on the type of text one need to encrypt. The texts which are not susceptible to frequency-based attacks 15 (i.e. street name), we follow the procedure outlined in the article 'Privacy-preserving record linkage using Bloom filters' [14] with the following parameters: • Two independent hash functions (sha224, sha256).
• Number of dependent hash functions.
• Length of bit array. For texts susceptible to frequency-based attacks (surnames) 16 , we increase the number of independent hash functions along with the bit array size as per the suggestions 17 .

Cryptographic Bloom filter and comparison
The cryptographic coding of a text is carried out by collecting all bit arrays created from the application of Bloom-filter atoms on each element of the n-gram of the text. Then the collection of bit arrays are joined by Boolean OR operations. Such a cryptographic procedure can be used to find the closeness of two texts by comparing similarity measures. An example of Bloomfilter representations is shown pictorially in Figure 16, taken from the aforementioned article [14], of two similar texts (A: SMITH, B: SMYTH). The similarity measure we use is called Dice coefficient and is given by

Other evaluated encryption methodologies
Fully Homomorphic Encryption (FHE) was also considered as a potential encryption scheme for some data fields. FHE is a cryptographic technique that allows to perform operations on encrypted data that are equivalent to directly manipulating the plaintext. Performing analytics over encrypted data has an intrinsic trade-off: Accuracy-Security-Performance. Accuracy is measured against the accuracy of comparable plaintext analytics; Security is measured in terms of the ability to deduce information about the private encrypted data; Performance is measured against the time and storage performance of comparable plaintext analytics. For complex tasks, in most cases, at least one of these elements is sacrificed for the others.
All existing FHE schemes have the property that the encrypted data contains noise, and this noise increases when this data is manipulated. When performing long computations, this noise needs to be cleaned every once in a while. This can be done in two ways. One way is to interact with the client (the owner of the data who encrypted the data in the first place) as follows: Every time a cyphertext accumulates too much noise, it is sent back to the client, where it is decrypted, encrypted again, and returned. Decrypting cleans the noise and encrypting again creates a fresh cyphertext with minimal noise. Another way that is completely non-interactive is to use an operation called Bootstrapping, which cleans the noise. This operation is computationally expensive and currently not available in most FHE schemes implementations.
The main experiment performed focused on training a complex Neural Network (NN) under FHE. As the complexity of the NN grows, the time and storage overheads, introduced by FHE, become quite big. Overcoming this challenge, while maintaining acceptable scores in all three metrics: Accuracy-Security-Performance was the challenge that we set out to solve in the experiment. We worked simultaneously on the simplification of the NN architecture and on HE optimization to improve performance while maintaining industry acceptable security levels and minimizing accuracy degradation. We were able to train a complex NN within an 8-hour timeframe with an accuracy degradation, which is linearly dependent on the number of CPUs used. Currently, for 6 output classes and 24 CPUs, the accuracy drop is about 20%, and for 2 output classes, there is hardly even an accuracy drop. Due to the parallelization work that was done, this figure can be improved by increasing the number of CPUs.
However, this experiment focused on this very specific experiment and we concluded that currently, it was not applicable generally for the financial data encryption.

Manufacturing experiments
The main objective of the manufacturing experiments is to demonstrate the ability to exploit Big Data in order to take advantage of the real-time shop-floor data to apply sophisticated statistical assessments. Tables 17-18 present the definition of the associated experiments. Manufacturing production processes are complex in that production lines have several robots and digital tools. At the shop floor level, massive amounts of raw data are gathered; data that do not only help to monitor processes, but can also improve process robustness and efficiency.
Within the I-BiDaaS project, the data provider CRF identified two scenarios, in which complex and initial structured/unstructured data sets are retrieved from real processes.
The project focuses on providing a self-service solution that will give CRF employees the insights and tools they need to develop a methodology to implement in production sites for improving the quality of the processes and products in a much more agile way, through the collaborative effort of self-organizing and cross-functional teams. Together with the experimental subjects who participated in the experimental workflow, the final end-users for the manufacturing sector can be grouped into three main groups: • Manufacturers: People who have the relevant experience and current practices to innovate and improve, and offering the opportunity to validate and demonstrate the project, its approach and results across real contexts. • Intermediate users: People involved in data collection, data security, manual analysis, operational flows and required functionalities by investigating I-BiDaaS solution in order to innovate the production management processes. • Operators: People employed at different levels in production processes, who need to have the data processing results really useful. This is achieved, for example, through advanced data visualization methods that provide the insights, value, and operational knowledge extracted from data available. This system allows the operator to understand the meaning and relationships of the analysed data, through graph representations of the algorithms developed by the consortium.
As an industrial end-user, CRF identified the necessary requirements to develop analytics on the retrieved data from real industrial environment. For both use cases, confidentiality is very important to protect information from being accessed by external parties. So data have been anonymised before shared. Furthermore, the lack of time to extract and analyse data due to the fast rhythms of production and fast internal changes due to rescheduling production quantities and component variations required data cleaning in terms of identification of incomplete, inaccurate and irrelevant parts of the data and data analyses with advanced visualisation tools to better empower manufacturers decision. All analyses were carried out by I-BiDaaS experts, as detailed in the corresponding activities developed within the entire project and reported in all technical deliverables. In the next two sections, we describe the main outcomes for both use cases and explain how we can use I-BiDaaS solution to develop a methodology to implement the use cases in real scenarios for quality and process improvements and Predictive Maintenance.

Maintenance and Monitoring of production assets
This use case has been selected to use the data to optimise a real industrial process and to set a predictive maintenance procedure in order to prevent faults before they happen by doing maintenance at the right time (not too late or too early, to avoid inefficiencies). Different types of sensors are installed on the production line and acquire different data information (e.g. acceleration, velocity, pressure, temperature and so on). All of the sensors record their perception of the surroundings, uploading and transfer this information to a server that manages the data. For example, accelerometers are used for measuring vibration and shock on machines and basically anything that moves. Therefore, the monitoring of vibrations is important to check the status of a machine and the analysis of the trend of vibrations over time allows to predict the onset of deterioration and to intervene in time before the failure. The continuous and periodic control of the service conditions of a machine is known as Predictive Maintenance. The goal of this experiment is to test the I-BiDaaS platform, using different methods adapted to different users (expert/ non-expert) across silos: different companies, departments and competences are involved.
Before analysing, data have been transformed into separate time series -one per sensor -in order to monitor the separate time series for each sensor any day. As described in D3.3 [6], I-BiDaaS analysts carried out an outlier detection analysis on each sensor separately. Subsequently, they compared the time stamps of the detected anomalous measurements across the results for different sensors. The analysis did not require any parallelization, everything was done on a single GPU (NVIDIA RTX2070). The outlier detection analysis was performed using a modified interquartile range (IQR) test. It was established that almost all sensors have different days with anomalous measurements, and almost all of them were common to different sensors (more than 90% on average). Two more similar tests were performed, where Q1 was calculated as the 10th (5th) quantile and Q3 was calculated as 90th (95th) quantile. The most informative results were obtained for Q1 = 5th, and Q3 = 95th percentile.
After implementing these results, the efficiency and accuracy of the I-BiDaaS model with respect to internal CRF analyses allowed to quickly visualise the results on the I-BiDaaS platform, in the 'Co-Develop Mode', being able to get a visual graphic of the anomalous measurements for the selected year, month, day and sensor, as shown in Figure 18 and Figure  19, by giving us the possibility of developing a methodology to intervene with specific actions. By pressing on any bar, in Figure 18, it is possible to visualise the anomalous values for the selected day and match them to try to understand what happened.
I-BiDaaS -50 -August 31, 2020  The following table summarises some of the key points of the use case experimentation:

Experiment's Goals
To test efficiency of I-BiDaaS solution in the context of anticipation of maintenance events (alarm).

Experiment's Questions Q1. What is the quality of the analytics results?
Q1.1 What is the accuracy of new models with respect to internal CRF models in use (geographical representation of the process)?

Q2. How efficient is the process of data analytics?
Q2.1 How efficient is the performance of the analytics application (algorithm)?
Q2.2 How efficient is the visualisation of the analytics solution to allow the workers a quick intervention with specific actions?

Experimental
Workflow (based on the generic workflow, to be further refined)

Production Process of Aluminium die-casting
This use case aims to improve the quality of the production process of the engine blocks. During the die-casting process, molten aluminium is injected into a die cavity where it solidifies quickly. The process is complex and it is important to not only carefully design parameters and temperatures but also to control them because they have a direct impact on the quality of the casting. Big Data analysis aims to improve the quality of the process, with the aim of finding the most significant parameters to monitor and control. The goal of this experiment is to test the efficiency of the I-BiDaaS solution in the context of correlating defects with the production process parameters and resetting these to prevent repairs and reprocessing of the engine blocks.
Firstly, to allow the classification of the engine blocks according to their control class, Random Forest has been used to assess the feature importance and possibly point to the most important parameters in the process that determine the outcome of the classification. The first analysis showed that this use case corresponds to an imbalanced problem, i.e. there are more samples belonging to class 'a' than to class 'b', so the basic random forest algorithm is susceptible to discriminate and favour the larger class. To avoid this scenario, by associating different weights (or rewards) for different classes in the objective function, weighted Random Forest has been used, an extension of the basic random forest, by associating different weights (or rewards) for different classes in the objective function. Treating the problem in the use case as a binary classification problem, with the aim of identifying scrap and proper engines, a binary classification algorithm has been applied. Furthermore, the newly implemented distributed alternating directions method of multipliers (ADMM) algorithm has been applied and has been used to perform the binary classification on the given dataset. Since the data contains a large number of process parameters, and because of their proven performance in practice, the Deep Neural Network framework has been chosen. Deep Neural Networks contain a large number of layers that enable them to learn relationships between the input data and the target values, whether that relationship is linear or highly non-linear. Considering that the part of the data for this use case consisted of thermal images of engines, convolutional neural networks, and DenseNet-201 in particular, have been used in image classification. Further details on modelling and analytical approach for this use case can be found in D3.2 [18] and D3.3 [6].
A visualization approach using t-Distributed Stochastic Neighbor Embedding (t-SNE [11]) is also used to visualize the data in 2D and see whether there is any structure emerging (see D3.2 [18]).
To develop a methodology to improve the quality of the process, we can quickly visualise the results of the analyses, performed by I-BiDaaS experts in the I-BiDaaS platform. In the following figures, the self-service solution, developed by I-BiDaaS, is explained step by step.
A dynamic diagram shows the incoming streaming data in real-time, as well as aggregations of them that are constantly updated, after pressing the top-left button 'Run Experiment'.   In this case, batch analytics were used to develop the high-level algorithms in order to identify and select the critical parameters, by providing a 'Co-Develop-Mode' to timely check the status of the process and classify the quality levels that we have identified as KPIs. Subsequently, CRF connected its internal server to the I-BiDaaS platform through a Virtual Machine created by the I-BiDaaS technologists for sharing data in real time. Every two minutes, corresponding to the production time of an engine block, data are copied in a folder in the Virtual machine, analysed near real time and provide a response in a few seconds. In this way, we can develop a methodology to reduce scrape and waste and prevent repairs and reprocessing, by avoiding unnecessary actions after the die-casting of the engine block, such as impregnation, cooling, storing and management of failed engines.
The possibility to use a Virtual Machine for sharing and copying data was a great solution, provided by the I-BiDaaS consortium that can be easily utilised from industries for which corporate constraints do not allow to share a high volume of data with internal systems and cannot give access to their internal servers.
The following table summarises some of the key points of the use case experimentation: I-BiDaaS -55 -August 31, 2020

Experiment's Goals
To test the efficiency of I-BiDaaS solution in the context of correlating defects with the production process parameters.

Experiment's Questions Q1. What is the quality of the analytics results?
Q1.1 What is the accuracy of new models with respect to internal CRF Aluminium Casting models?

Q2. How efficient is the process of data analytics?
Q2.1 How efficient is the performance of the analytics application (algorithm)?
Q2.2 How efficient is the visualisation of the analytics solution to allow a quick intervention with specific actions?

Experimental Workflow (based on the generic workflow, to be further refined)
1 -Data selection 2 -Synthetic data preparation generate rules fabricate synthetic data upload data set 3 -Real data preparation 4 -Data analysis select algorithm custom algorithm 5 -Data visualisation 6 -Adjust data fabrication rules Experimental Subjects (participating on any of the steps above)

Role
Steps

involved No of participants
Quality assurance and control managers

Generic experiments
Generic experiments aim to evaluate the functionality and usability of the I-BiDaaS platform from the perspective of potential generic end-users. These experiments are meant for the platform usage beyond the concrete industrial use cases defined within the project, aiming at a wider solution usability and applicability. To this end, Experiment #9 shown in Table 19 corresponds to the 'Self-Service mode' of the platform and is targeted to non-experts and experiment #10, shown in Table 20, is for the expert users (PyCOMPSs developers).
End-to-end solutions have been defined to offer comprehensive systems in alignment with I-BiDaaS infrastructure solutions, considering that an end-to-end solution may cover everything from the setup of the project, the selection of data sources, the selection, setup and execution of the algorithms until the visualisation of results.

Experimental Workflow (based on the generic workflow, to be further refined)
1 -Project setup 2 -Selecting a data source 3 -Algorithm selection and setup (run the user questionnaire, use the interactive guidance to select an algorithm, set the amount of resources to be used)

Experimental
Subjects (participating on any of the steps above)

Role Steps involved No of participants
Data analysts 1-5 10-20

Q4. Cost/Effort reduction (because of not having to maintain or setup a PyCOMPSs environment for development)?
Experimental Workflow (based on the generic workflow, to be further refined) 1 -Create a new project 2 -Select a data source 3 -Upload code for the desired algorithm, optionally using a template provided 4 -Edit the code and re-upload I-BiDaaS -57 -August 31, 2020

Cross-sectorial experiments
Several discussions have taken place between CAIXA and TID for finding potential crosssectorial experiments that help in the evaluation of I-BiDaaS. Most of the use cases that were considered require to share very sensitive data from CAIXA or TID customers. CAIXA studied the way to extract data without breaking data privacy and perform certain level of big data analytics. However, the results from those analytics are evaluated internally after decrypting the data.
Other encryption mechanisms that could be used to do analytics mixing data from different entities were evaluated (i.e. FHE) but we arrive at the conclusion that only very simple operations can be done with this kind of encryption mechanisms, so they are not viable either (as described in previous section 3. 3

.3.4).
Therefore, the only research line that was found in order to correctly perform the cross-sectorial experiments and overcome the data privacy barriers set by the General Data Protection Regulation (GDPR) was going into the Federated Machine Learning (FML) direction. FML allows different entities to perform big data analytics with the sensitive data they own and do it on their premises. Models from their data are generated by each entity and they are combined in a common macro-model constructed from the data models from each entity. This common model should be previously defined and agreed by the entities and it should have some parameters in common defined by the different entities in order to be useful. After constructing the macro-model, it can be consulted by all the entities in order to enhance each entity model and extract new insights not previously available.
We consider FML a very interesting approach to work on cross-sectorial use cases in which we want to mix sensitive data that, by regulation, cannot be shared between industrial entities. We identified this approach as a hot research topic that is trending up. However, further experimentation with this approach was considered out of the scope of the I-BiDaaS experimentation because of its complexity, and it would require newer initiatives and projects that focus on this research line.
However, we need to highlight that during the course of the project, we performed extensive analysis on how the I-BiDaaS solution can support cross sectorial experiments. I-BiDaaS offers both experts and non-IT experts, all the necessary technologies to take, to understand, to process, to visualize data and to extract value from data as long as the data become available.

Overview
This section describes the evaluation process of the I-BiDaaS solution, covering the analysis of the different tools and modules provided and the overall performance of the I-BiDaaS prototype. It follows the evaluation methodology followed in D1.3 [3], defining and reporting smooth and adequate running of the experiments according to the experimental protocol and demonstrating how I-BiDaaS solution can effectively aggregate, pre-process, manage and synthesize different types of data, noisy and large-scale data sets in both batch and real-time processing.
The evaluation process provides structured feedback to the development process both from the data providers and the technology owners in order to ensure the project's impact, thus fostering platform's long-term sustainability.

Data quality from the perspective of assessing algorithm scalability
In this section, we provide a report regarding testing the data quality with respect to scalability or to be more specific, with respect to algorithm scalability in testing with synthetic data. This means that, in order to measure synthetic data quality, we need to see how it behaves (scalingwise) compared to real data. Ideally, the differences in scaling should be minimal.
We tested this hypothesis on the CAIXA's dataset that corresponds to the use case 'Analysis of relationships through IP addresses'. The real (tokenized) dataset has 295838 samples, while the dataset generated with IBM's TDF tool has 481672 samples. We took 5000 samples from both of these datasets and tested a K-Means implementation from the scikit-learn library 18 on them.
Our conclusion is that the synthetic data behaves very similarly to the real data regarding the execution time/CPU scaling, which proves that it is suitable for algorithm scalability testing. 18 https://github.com/scikit-learn/scikit-learn

Specific and general utility
This evaluation was carried out, in several phases, over data that were fabricated for the CAIXA use case 'Analysis of relationships through IP address' and the CRF use case 'Production process of aluminium die-casting'. Both use cases include detailed data definitions, specifications and descriptions which can be found in D2.1 [4]. The synthetic data was fabricated by TDF using these definitions.
The TDF synthesizer accepts data description rules (constraints defined by the user) and fabricates data that satisfies all the constraints (using a solver). Although the synthesized data is guaranteed to satisfy each of the provided constraints, there is no assurance that the TDF synthesizer model corresponds to the model that generated the original data. When the generative model adheres too closely to the proposed utility model, the validity checks such as the existence of other interactions may not be apparent in the synthesized data. That is why general and specific measures of utility are required in providing an assessment for the synthetic data model. For that end, initially, the data provides performed a set of tests to validate the structure as well as a set of minimal requirements of internal applications that make use of such data.
To extend analyses and experiments, IBM performed a generic evaluation process for the real data (provided by the use case providers) compared with the fabricated data. This evaluation is concerned with methods to judge whether the fabricated data have a distribution that is comparable to that of the original data, what is commonly referred to in the literature as general utility. In addition to the general utility, we also consider specific utility, i.e. the similarity of results of analyses from the synthetic data and the original data.
As a general measure of data utility, we used the propensity score mean-squared-error (pMSE), to the specific case of synthetic data. As specific utility measures, we used confidence intervals overlap and standardized difference in summary statistics, which we added to the general utility results.

Specific Utility
Synthetic data utility is often assessed by analysis-specific measures which compare data summaries and/or the coefficients of models fitted to synthetic data with those from the original data. If inferences from original and synthetic data agree, the synthetic data are said to have high utility. Published evaluations of synthetic data using specific utility measures, usually for just a few selected analyses, have highlighted differences in the quality of syntheses.
We applied data analysis over the real and synthetic datasets that included the inference of single and multi-attribute constraints and compared a selected subset of the results. The single attributes inferred constraints include best fitted distributions (and the corresponding parameters), min-max values, value frequencies, patterns, formats and other statistical properties. The multi-attribute inferred constraints included value correlations between tuples (up to size 3) of columns (Numeric, categorical, dates and polynomial relations of up to degree 4), as shown in Figure 28.

General Utility
Previous work has suggested various general measures of utility for data that have undergone disclosure control. Generally, these measures consider the distributional similarity between the original and fabricated datasets, with greater utility attributed to masked data that are more similar to the original data. In the broadest sense, measures such as distance between empirical Cumulative Distribution Functions (CDFs) or the Kullback-Liebler (KL) divergence give an estimate of difference. Karr et al. (2006) [16] and the follow-up paper Woo et al. (2009) [17] discussed and implemented various distributional measures such as the KL divergence, an empirical CDF measure, a method based on clustering, and one that uses propensity scores to estimate general utility. They compared these measures for micro-aggregation, additive noise, swapping, and resampling methods, and they evaluated the propensity score method as the most promising. Propensity scores represent probabilities of group memberships, commonly used in causal inference studies. To use them as a measure of utility, there is a need to model group membership between the original and the masked data to get an estimate of distinguishability where small distinguishability relates to high distributional similarity between the original and masked data. If the propensity scores are well modelled, this general measure should capture relationships among the data that methods such as the empirical CDF may miss.
The propensity score method, given in Woo et al. (2009) [17] can be summarized as follows.
The 'n' rows of the original and 'm' rows of the synthetic data are merged with the addition of a variable 'I' which indicates the source of the data (0 -real and 1-synthetic). A propensity score / is estimated for each of the n + m rows, as the probability of classification for the indicator variable, using predictors based on the variables in the data. The difference (MSE) between these estimated probabilities and the true proportion of records ( 0 120 ) from the synthetic data in the merged data gives the utility statistic The method can be thought of as a classification problem where the desired result is poor classification (50% error rate), giving better utility for low values of the pMSE. Randomly sampling 5000 datapoints from the real and synthetic datasets, and using a logistic regression to provide the probability for the label classification we were able to show that the measured mean pMSE score for the CAIXA IP dataset is 0.234 with a standard deviation of 0.000835, and the measured pMSE score for the 'Production Process of Aluminium die-casting' dataset is 0.218 with a standard deviation of 0.00146.

The I-BiDaaS integrated solution and architecture implementation
I-BiDaaS platform takes into consideration many important features that have been implemented in order to utilise the I-BiDaaS solutions developed by data analysts and data technologists. Batch and streaming analytics have been enhanced respectively through the implementation of more data, high-level algorithms, better testing and through analysing realtime data via complex event processing.  a) The Expert mode allows experts (developers) to upload their own data analytics code based on the available I-BiDaaS highly reusable templates. b) The Self-service mode allows users that have the relevant domain knowledge and some knowledge about data analysis (non-experts) to easily construct Big Data pipelines in a user-friendly way, selecting a pre-defined data analytics algorithm from an available list.  The I-BiDaaS architecture is depicted in the following figure.
I-BiDaaS -64 -August 31, 2020 The system architecture is shown in Figure 37. The cluster that runs the batch processing jobs is based on docker swarm 19 . The swarm consists of a manager node and a set of worker nodes. The orchestrator assigns dynamically a set of workers from the set of available nodes in docker swarm, based on the user's preferences and set-up inserted through the UI. These workers exploit the Cassandra DB and the shared FS in order to complete the requested job.  For the case of streaming analytics, the architecture is presented in Figure 38. In this case, the orchestrator collaborates with the Universal Messaging bus provided by SAG that collects the data and feeds the APAMA analytics engine.

Experiments verification and validation
Experiments verification and validation corresponds to the Operation step of the I-BiDaaS experimental process (described in deliverable D6.1 [19]), aiming at the evaluation of the I-BiDaaS platform according to the experiments' definition and against the stakeholders' requirements. A main characteristic of the I-BiDaaS experimental process is that it considers both technical and business requirements. In particular, it aims to evaluate both the performance of the I-BiDaaS solution, but also its alignment with the needs of the industrial users. To this end, the experiments verification and validation integrates the following aspects: a) Quantitative (technology-centred) evaluation: Evaluation of the quality of the I-BiDaaS platform in parts and as a whole using appropriate benchmarks. The main stakeholders in this phase are technology providers. b) Qualitative (user-centred) evaluation: Experimental evaluation of the I-BiDaaS platform in a real business setting and against user requirements. The main stakeholders in this phase are business users (data analysts, financial administrators, etc.), as well as other users such as IT administrators, Big Data developers, etc.
To this end, a set of indicators has been defined against which validation of the I-BiDaaS solution efficiency can be measured. Such indicators reflect both technology features at component and platform level (e.g., operational performance), as well as business key performance indicators (e.g., service quality, time efficiency). The former are use case independent, whilst the latter reflect the specific needs expressed in each use case. Alignment of both types of indicators has been a main objective of the experimental definition phase.
For each indicator, a set of quantifiable metrics has also been defined, whose measurement relates to the achievement (or not), of a specific indicator. For example, in the case of operational performance, relevant metrics include execution time and throughput. In the case of service quality, a relevant metric might be the 'Overall Equipment Effectiveness (OEE)'. For the metrics related to technology indicators, available big data benchmarks can also be used. An initial investigation of applicable big data benchmarks has been performed during the I-BiDaaS baseline phase and reported in D1.3 [3]. This has been further informed from recent research in the area, reported in the results of ongoing projects, such as the classification of big I-BiDaaS -66 -August 31, 2020 data benchmarks proposed by the European DataBench 20 project, in which some of the most well-known benchmarking tools are classified according to the benchmark categories, type, domain, data type and metrics measured.

Quantitative evaluation
The quantitative (technical) evaluation focuses on the evaluation of quality of the I-BiDaaS platform in parts and as a whole through testing using appropriate benchmarks, where available. This section continues the work reported in the D6.3 [2] concerning the evaluation of the results of I-BiDaaS, by providing an update of the individual parts and the overall I-BiDaaS solution evaluation at M32.

Individual parts evaluation
The progress regarding the quantitative evaluation of the I-BiDaaS solution of each platform component is reported in this section in order to provide an update of the evaluation indicators used for the verification and validation of the I-BiDaaS solution in parts. In Table 21, measurement obtained for each of the module, developed in I-BiDaaS, are reported:

Qbeast
During the project, the tests initially proposed for Qbeast have been reconsidered, as described in the previous table. New tests show better the real performance of Qbeast, not only for synthetic benchmarks, but also for real applications. We ran our tests at the Barcelona Supercomputing Center, in MareNostrum IV supercomputer. Each server contains two sockets with an Intel Xeon Platinum 8160 24C for a total of 48 cores and 96 GB of ram for each server. Nodes are interconnected by a 100 GB Intel Omni-Path and a 10 GB Ethernet 27 . We use the local SATA 240 GB Intel s3520 SSD scratch disk to store data. The disks are rated for sequential reads and write up to 320, and 300 MB/s, respectively, while for random reads and writes up to 65000 and 16000 IOPS. We used a stress tool shipped with Cassandra to benchmark the system, using twice as many machines for the stress tool than the database and performing random insertions with a Gaussian distribution.
In Figure 39, with two nodes, Cassandra and Qbeast perform very similarly, achieving respectively ≈ 84K and ≈ 83K IOPS. Cassandra and Qbeast approximately improve 80% when doubling the nodes. The scalability is not linear as the replica is synchronous, which adds latency and increases resource usage. 26 The Distributed Computing Library, available at https://dislib.bsc.es/ 27 https://www.bsc.es/marenostrum/marenostrum/technical-information I-BiDaaS -71 -August 31, 2020    Table 22 shows how Qbeast improves after multiple Read-Optimizations for three different types of queries (All 0.01%, Olfactory 1% and Inhaler 1%). The table also reports the different speedup we can achieve in the three queries, ranging from 24.51 X improvement to a "mere" factor 2.37. In query "All 0.01%", we have the highest speedup as we benefit the most from the efficient sampling of Qbeast. Finally, in terms of disk usage, older versions of Qbeast required to replicate each item 5.29 times on average, while the newest version requires only 1.14.

Hecuba
The tests of the integration between Hecuba and Dislib have been executed on MareNostrum IV supercomputer, as well as Qbeast tests. In order to obtain the different measurements, we have performed three different algorithms: K-Means, PCA and KNN.
For the K-Means algorithm, it has been used a dataset of 10 Million samples with 50 features. Dislib gave us the possibility to decide the granularity of the data. To obtain the better performance, we have divided the dataset in 48 blocks which will be operated in parallel, thanks to COMPSs. The performed K-Means is analysing the data to find 50 clusters. We can see that the better performance is obtained when we are using 96 cores, in MareNostrum IV this is equivalent to use 2 computing nodes.
I-BiDaaS -73 -August 31, 2020 To perform the PCA algorithm, we have used a dataset of 10 M samples with 50 features, reducing the number of dimensions down to the 3, which is a typically performed PCA in order to do a posterior clusterization and visualization of data. In this case, better performance has been obtained, dividing data in 96 blocks and using 96 cores (2 computing nodes). As we can see, for the K-Means and PCA algorithm, there is some extra time when using data from Hecuba, this is due to the time necessary to load the block of data, if further updates a cache will be developed, reducing the time difference.
Finally, for the KNN algorithm, a dataset of 10 M samples and 50 features has been used, performing a KNN of 10 neighbors. For computing these tests, the chosen granularity of data has been 48 blocks. As it can be seen the algorithm obtains a better performance when using data from Hecuba, this is due to an improvement on COMPSs functionalities, which detects that data is located on Cassandra, by this way each task can retrieve its own data instead of having to serialize it into a file to pass it to the tasks.

Overall I-BiDaaS solution evaluation
The following table provides an update of the main metrics used for testing the quality of the overall I-BiDaaS solution, in combination with relevant benchmarks and measurements obtained so far. The measurements are obtained by questionnaires provided in focus groups/ webinars organized by Telefonica and CAIXA.

Tests in relation to I-BiDaaS industry validated benchmarks
Tests in relation to industrial benchmarks will be facilitated through the use of the DataBench Toolbox 30 , which aims to provide tooling support to Big Data benchmarking users to search, select, deploy and run existing Big Data benchmarks on the one hand, while on the other hand, get the results of the execution, homogenize the technical metrics, and finally help derive business insights and KPIs.
To this end, a collaboration between I-BiDaaS and DataBench projects has been initiated in May, aiming to explore how to best exploit the DataBench Toolbox in the context of I-BiDaaS. In Section 5.2, the description of the DataBench webinar, where I-BiDaaS participated, is reported.

Qualitative evaluation
As presented in D6.3 [2], the definition of the experimental qualitative evaluation follows a goal-oriented approach, whereby for each experiment: first the experiment's goal(s) towards which the measurement will be performed are defined; then a number of questions are formed aiming to characterize the achievement of each goal; and finally, a set of indicators and appropriate metrics is associated with every question in order to answer it in a measurable way. The experimental goals and associated questions for each experiment have been presented in section 3.3. The following sections 4.6.1 -4.6.4 show the indicators and associated metrics for each experiment. These are defined both at business and application level thus, ensuring (a) that both business and technical requirements are taken into consideration and (b) the traceability among business and application performance.

Telecommunication experiments
Tables 24-26 report the values obtained during the telecommunication experiments with respect to the identified metrics.

Banking experiments
Tables 27-29 report the values obtained during the banking experiments with respect to the identified metrics.
I-BiDaaS -78 -August 31, 2020 Depending on the selected type of license.

Platform level performance indicators
Cost Price of technologies.
Depending on the selected type of license.
Order of 100k€  Table 30 and Table 31 report the values obtained during the manufacturing experiments with respect to the identified metrics.   Table 32 and Table 33 provide an overview of the metrics and associated expected values with respect to the two generic I-BiDaaS experiments.

High-level non-functional requirements evaluation
All experiments have been tested considering the evaluation of usability, operability, robustness, innovation, compliance, privacy awareness and cost of the I-BiDaaS solution, based on the user questionnaires filled by the data providers (see D6.3 [2]) Specifically: • Usability: The usage of real and valuable datasets from different industrial sectors allowed to test the usability of the I-BiDaaS solution with three different mode, developed for expert and non-expert users. The data anonymization was used in this process to ensure data privacy as well as the preservation of sensitive information while delivering the desired data analytics and meta-knowledge. The advanced visualisation tools provided a simple and intuitive design, by showing the qualities of the underlying data analytics and meta-information and delivered to the platform the expected added value. • Operability: The I-BiDaaS solution can be integrated in a real business setting, taking into account of the specific internal requirements that have to be evaluated on a caseby-case basis. The different use cases tested within I-BiDaaS platform show that it is not only high performing but also scalable, by guarantying security and privacy. • Robustness: The same use cases have been tested with several datasets and the results were valid in all the tests. Furthermore, several iterations were performed and they provided the same results. • Innovation: The main innovation lies in the capacity to easily empower both non-expert and expert big data practitioners, belonged to a very diverse use case landscape, which is supported by a high level algorithmic solutions. I-BiDaaS platform provides not only the end-user mode but also the expert mode that enables the data analyst to directly prepare the dataset inside the same platform in the cloud and does so by leveraging advanced visualisation approaches and dashboards that harness the power of multiple heterogeneous sources and Big Data analytics. This facilitates the ability to take, to understand, to process, to visualize data and to extract value from them. • Compliance: The relevant security and privacy regulations are put in place internally, during the selection and pre-processing stages when data are anonymised and tokenised. In this way, data providers manage internally information and data are made readily available without requiring further security or anonymization steps to be implemented by the I-BiDaaS platform. • Privacy awareness: All relevant security and privacy requirements pertaining to the inhouse access of proprietary data were met and the necessary practices were applied (e.g., data anonymization, data aggregation, encryption, etc.). • Cost: The I-BiDaaS platform allows to reduce infrastructure cost and personnel cost. In the first case, it allows to obtain improved results and state-of-the-art performance using less hardware resources, thus cutting down on unnecessary and costly investments and avoiding the maintenance of expensive infrastructure. In the second case, it replaces certain manual, labour-intensive and costly practices with automated, efficient and scalable technologies.

Impact Analysis
I-BiDaaS is delivering a full array of big data business analytics solutions for real and synthetic data for companies in the domains of telecommunication, finance and manufacturing that are more accessible, cost-effective and employee-empowering than existing solutions, which gives companies the opportunity to deploy Big Data Self-Service solutions across the organisation, from consumer-facing employees with little IT experience or expertise to top management, and helps companies to optimize decision-making at the tactical, operational and strategic levels.
To ensure that the I-BiDaaS project meets its ambitious objectives, and achieve expected impacts, it is using a three-stage impact assessment model that realises the project's business case, monitors progress, raises any issues and helps inform operational decisions. The impact assessments occur at the: a) Project Level to ensure Project Partners deliver the required outputs to test the business cases; b) Pilot Level with involved Local and National Stakeholders to produce outcomes that test and refine the value proposition and improve the business case for I-BiDaaS; and c) European Level encompassing wider society to aggregate and spread social and economic benefits that result from the business case.
In this section, the results of the Experimentation & Evaluation Phase (M19-M32) of the project, along with an analysis with respect to the expected project level innovation and achievements, are discussed. Moreover, the implemented or prospective activities aiming to demonstrate the I-BiDaaS solution and involve external users in the evaluation process, are described.

Progress Report
After the end of M18, and the successful completion of the Innovation phase, the I-BiDaaS project entered into the Experimentation and Evaluation phase where all functionalities developed during the previous period are implemented on 8 (eight) real-life industrial scenarios in the I-BiDaaS targeted domains of telecommunication, finance and manufacturing. During the Experimentation and Evaluation phase, all the innovation development results accomplished in the previous phase under the technical WPs of the project (WP2-WP5) have been integrated, realising the 2 nd version of the I-BiDaaS platform as an integrated framework that enables effective extraction of meaningful knowledge from integrating very large datasets from heterogeneous and multiple domains and more effective and scalable data analytics and real-time complex event processing to support low-level employees and decision makers with advanced visualization capabilities.
The Experimentation phase is being supported by various events organized by I-BiDaaS, such as the CAIXA's Workshop, the Big Data Pilot Demo Days series of webinars where 3 (three dedicated webinars were organized under BDV PPP Summit 2020. The I-BiDaaS Solution has also been demonstrated in various events such as the BDV PPP Summit 2019 in Riga and the EBDVF 2019 in Helsinki. Therein, the experimentation is supported through the involvement of external entities in the evaluation of the I-BiDaaS solution and feedback collection from these entities. In all the aforementioned events, the main goal is to promote I-BiDaaS tools and technologies to software developers, big data experts, data analysts, decision makers, non-IT end users, etc., and receive valuable feedback for further improvement. More events are planned to be executed until the end of the project (e.g., TID's Hackathon and a big event at the end of the project to celebrate all I-BiDaaS achievements). The output of this phase will feed the Consolidation Phase that is following. The open-source code is available at GitHub. Since the number of downloads/clones of the git hub repo is not provided by GitHub, we are counting the popularity of the Tools section (pressed links to the knowledge database and pressed links to the proprietary tools of the project). The number of events (pressed links) is expected to be reach the KPI threshold due to the release of the 2 nd and final version of the tools developed within the project and also due to the continuous dissemination efforts until the end of the project.

Dissemination & Communication KPIs
KPI-DC-1 Although this KPI is covered, the project will continue the dissemination efforts to increase the downloads of its material. The detailed list of conferences & workshops attended by I-BiDaaS partners can be found at D7.3 [21] for 2018 and D7.5 [22] for 2019. For 2020, the list of the events will be reported at D7.7 (to be submitted at M36).

KPI
I-BiDaaS Consortium invests in events targeted at industry and academia to showcase I-BiDaaS vision, impact and results, and to create an active community for the project that will significantly enhance its entrance to the market.

KPI-DC-4
For this reporting period, we have achieved 64% of the publications to have an impact factor or ERA classification.

KPI-DC-5
During the reporting period, 3 (three) journal articles have been accepted and published. Thus, I-BiDaaS consortium has achieved 33,3% gold open access to the journal articles linked to I-BiDaaS scientific results score higher than the expected for this KPI.

KPI-DC-6
Participation in BDVA which is driving big data standardization and interoperability priorities and is connected with Big Data Standards related to Big Data PPP projects Participation and active collaboration with DataBench, who is designing performance benchmarking processes for Big Data. DataBench is expected to set the standards and benchmarks for the emerging Big Data ecosystem.

Impact in research community and contribution to innovation capacity:
The I-BiDaaS project has achieved a significant impact in the research community with 3 journal publications, 17 conference papers and 2 poster publications, with more than 60% of these publications having an impact factor or ERA classification. Moreover, an important achievement for the I-BiDaaS project is considered the acceptance of 5 (five) Innovations developed under I-BiDaaS from EU Innovation Radar as Excellent Innovations. The Innovation Radar is a European Commission initiative to identify high potential innovations and innovators in EU-funded research and innovation framework programmes. The full list of the accepted Innovations is depicted in Table 37: Impact in Data Market and the Big Data Economy: I-BiDaaS is developing a Big Data as a Self-Service Solution to provide a significant boost to the finance, manufacturing and telecommunication sector. For the financial and telecommunication sectors, I-BiDaaS has offered its tools and services for the development of 6 (six) different use cases (3 use cases for CAIXA and 3 use cases for TID) making possible for CAIXA and TID to exploit their big data efficiently and therefore increase their market share and services provided to their customers. For the manufacturing sector, I-BiDaaS has offered its tools and services for the development of two different use cases making possible for CRF for even easier and massive big data exploitation.

Feedback from external stakeholders
In this section, we present the external stakeholders' feedback received during the project period M18-M32; for earlier results, the reader may refer to D6.1 [19] and D6.3 [2]. In particular, we discuss the following important events at which the external feedback was collected: Big Data Pilot Demo Days, CAIXA workshop, and the webinar 'Virtual BenchLearning -Assessing the Performance and Impact of Big Data, Analytics and AI' organized by the DataBench project. Manufacturing Sector, held on July 9, 2020. The main goal of the webinars was to demonstrate in a step by step fashion the I-BiDaaS solution in the tree sectors and receive feedback from the participants. At each of the three events, the audience was asked to respond to four short questions, aiming to investigate the background of the attendees and adjust the nature of the webinars accordingly. The questions were: 1) to which stakeholder type they belong; 2) whether they work with Big Data; 3) if they are interested in Big Data technologies to optimize customer experience; and 4) what is the main barrier from preventing the Big Data analytics technologies in their organization. The results of the questions for each of the three events are shown in Figure 45. In addition, after each of the events, a more detailed questionnaire designed using Microsoft forms has been provided to the attendees. The respective questionnaires (CaixaBank 60 , Telefonica 61 , CRF 62 ) can be found in the Appendix. The presentation of an example of the results is shown in Figure 46. More details about the three webinars can be found at the I-BiDaaS website, in the news & events section 63 .  Virtual BenchLearning -Assessing the Performance and Impact of Big Data, Analytics and AI. On July 7, 2020, I-BiDaaS participated in this webinar organized by the DataBench project. The webinar described a framework and tools to assess the performance and impact of Big Data and AI technologies by providing real insights coming from DataBench. I-BiDaaS participated at the webinar through a presentation of the current I-BiDaaS benchmarking approach, landscape and needs, both from the technological and business perspectives. The main goal of our participation is to explore, during the final 6-months period of the project, the I-BiDaaS -93 -August 31, 2020 possibilities to harness collaboration and benchmarking tools provided by the DataBench project.

Exploitation and potential commercialization
According to Statista's IT Market Model, spending in the global IT services market will reach an 853 billion U.S. dollars mark by 2021, up from 737 billion U.S. dollars in 2017. Profitability and cost reduction are some of the expected impacts from I-BiDaaS, providing companies with the competitive advantage they need towards a thriving data-driven EU economy. The nature of the project requires designing new and innovative businesses that often include complex interconnections and interoperable dimensions.
The use of the Dynamic Business Modelling (DBM) tool and the exploitation workshop (see D7.6 [23]) enabled the definition of the exploitation strategy for each participating partner, as well as for the whole consortium, maximizing the exploitation opportunities for individual partners and the sustainability of the tools in the long-term, beyond the lifespan of the project. Moreover, individual exploitation plans enabled the identification of potential I-BiDaaS products linked to actual market needs, given their capability to address different stakeholders in the market.
Five different joint business models were developed to adapt each of the solutions and business processes to the targeted markets and clients, including Non-IT SMEs, Large companies, Academics, and Data harvesting companies. The business models will be re-analysed in the last months of the project for the design and proposition of a sound business plan to ensure the longterm sustainability and potential commercial viability of the solution. Furthermore, the joint exploitation plan will also include a detailed profiling of partners (including academics) to develop a competency profile for the whole consortium and allow the export of 'the I-BiDaaS innovation ecosystem' to third-party companies.
The design of the I-BiDaaS solution allows to decouple a given tool from the rest, making the platform flexible and modular, and these factors will impact the pricing model. Therefore, a methodology for the pricing model was designed, enabling the identification of a suitable licensing strategy and possible collaboration that will allow the establishment of the agreement for partnerships. Business planning activities will be analysed for the elaboration of any Intellectual Property Rights (IPR) aspects and revenue sharing models, including an analysis of the potential revenue streams to validate the Return on Investment (RoI) plans.
Finally, to achieve long-term sustainability, I-BiDaaS will leverage the opportunity and ambition of the EU to become a global leader in the acceleration towards digital transformation as well as the purpose of Europe to become a circular industry, by the incorporation of circular business models.

Conclusion
This deliverable reports on the experimentation phases of the industrial experiments carried out within the I-BiDaaS project. It provides a detailed description of each experiment in terms of the dataset implemented and the experimental workflow. Pilot demos have been developed for each use case and they can be accessed via the I-BiDaaS platform, integrated for expert and non-expert users. Furthermore, for each real industrial sector, end-users have been described and it has shown how to easily utilise I-BiDaaS solution.
All of work presented is aligned with the experimental protocol described in D6.3 [2] and revised during the implementation of each experiment to assure that the designed experiments validate both business and technical requirements in the I-BiDaaS platform and associated technology characteristics.
In the final section of this deliverable, we present the benefits of the participatory evaluation that involved the consortium and external stakeholders in evaluating key results and what constitutes success. Furthermore, for each experiment, the impact analysis has been carried out, by focusing on usability, operability, innovation, robustness, privacy awareness and cost of the I-BiDaaS solution. The final results will be reported in the next deliverable D6.5 'Assessment report and impact analysis' (M36).