Real-Time Integration of Building Energy Data

An Energy Management System (EMS) is a monitoring tool that tracks buildings energy consumption with the purpose of enhancing energy efficiency, by identifying savings opportunities and misuse situations. To achieve this, EMSs collect data flows-data streams-from a network of energy meters and sensors, which are then combined into useful information. Data must be processed in real-time, to support a timely decision making process. Traditionally EMSs use Database Management Systems (DBMSs) to process data, introducing a persistence step that leads to an unacceptable latency on data evaluation and do not properly support many types of time-series queries. This work explores the feasibility of Data Stream Management Systems (DSMSs) to support Energy Management applications, pointing out how to implement an EMS capable of real-time data processing.


I. Introduction
Buildings account for 40% of energy consumption, ahead of other sectors, such as industry or transportation [1].Therefore, small improvements on building energy consumption translate to major savings.Among other ways, energy efficiency in buildings can be achieved through Intelligent Energy Management [2].This topic refers to monitoring energy consumption, and the careful tracing of energy usage enabling building managers to identify saving opportunities.EMSs continuously monitor the energy consumed in buildings.Consumption related data is evaluated according to several variables such as time, areas and their occupation, equipment state, expected consumption, among others, which determines the building energy usage patterns, providing required information to determine the adjustments towards improving energy usage [3].
One fundamental aspect of energy management is timeliness: faster decisions translate to less waste and larger savings.In other words, timely information greatly improves the decision making process [4] since building managers are able to immediately diagnose and promptly respond to anomalous situations.EMSs are real-time decision making applications that require (near) real-time integration of huge quantities of data, wherein each record relates to a very short period [5], thus leading to Big Data challenges: (i) achieve low latency on queries evaluation, even in massive workload periods, in order to maximize the throughput, (ii) be able to identify complex event patterns to extract meaningful information from data streams, and also correlate several data streams that come from different sensors, and (iii) the EMS concept may be extended to monitor an entire city-smart cities-which would exponentially increase the volume of gathered data that needs to be timely evaluated.
Existing EMSs are not, however, prepared to provide information in real-time (see Section II), neither their functionalities or software architecture are conceived to be a truly real-time data processing system.Indeed, their persistent data model, which arises from the usage of a DBMS, is not suited to extract relevant information from collected data in a timely fashion.Let us consider the following queries: Q1 Which periods, in the past 8 hours, had energy consumption 20% higher than the average consumption of those respective 8 hours.Q2 State the cost of consumed energy in the last 12 hours per zone/equipment.Q3 List the zones that are consuming more than last year.Q4 List the equipments that are consuming inefficiently.Q5 List the zones which are having an unexpected consumption against its occupation.Q6 List the spaces where CO 2 has increased after energy consumption has decreased.
The evaluation of queries such as those presented above raise a number of requirements that are not easily addressed by DBMSs.For example, in query Q1 (see Fig. 1): (i) the last 8 hours to now are always changing, therefore the query must be continuously evaluated, (ii) input data assumes the form of a potentially unbounded data stream, which means that AVG operator cannot block, and must be evaluated online, (iii) whenever AVG changes the set of periods must be recalculated, meaning that the last 8 hours of data must stay in memory-to reduce the latency of fetching data.Traditional DBMSs, typically used to handle EMSs data, are not conceived to properly handle these classes of queries.On a traditional DBMS all data have first to be persisted on the database, and only then the offline query can be fully evaluated, violating an unacceptable latency for many data streaming applications, violating (iii).DBMSs are designed to run one-time queries against a fixed and finite dataset, definitely not being optimized to run the same query continuously over a time-varying and possible unbound data stream, that should be processed continuously through online queries.Fig. 2 illustrates how a DBMS based solution processes a data stream.
There are some monitoring applications that by using a DSMS achieve better results, such as monitoring of stock market transactions [5], network traffic [6], [7], [8], and health sensor data [9], [10], which suggests that an EMS based on a DSMS performs better than one based on a DBMS.DSMSs are continuously processing arriving data without having to persist them, this speeds up the data evaluation process, achieving more timely results.Moreover, DSMSs query languages are more expressive due their better suitability to the real-time data processing domain, which simplifies the development of data stream applications.
The identification of patterns on a event stream is also an important feature that many DSMSs can provide through the Complex Event Processing (CEP) capabilities of their languages.For instance, Q6 allows to compare how energy savings affects the quality of the HVAC ventilation.
Fig. 3 illustrates this paradigm shift on the data processing infrastructure [11], [12], [13], where DSMSs run online queries against continuous and unbounded data streams, highly reducing data processing latency.
This work aims to present and support the hypothesis that, for an EMS, a DSMS is a more effective data processing infrastructure than a DBMS.

II. Energy Management Systems
To understand how an EMS should be modified in order to monitor buildings energy consumption in real-time, we point out a general architecture (see Fig. 4) based on the system goals and functionalities [14].Due the architectural layers, we identify three dimensions which may increase the latency between data gathering and the presentation of the produced information.Data Presentation.Dashboards are an a effective tool for monitoring applications display their information, being their refresh rate a very important issue.Some of them can refresh as the same rate as their data sources, that is, their displays are dynamically updated as soon as new information becomes available.Other ones only refresh periodically, i.e. they poll their data sources regularly to fetch updated data [15].Data Processing.Data processing comprises the integration and evaluation of collected data.The integration process may have to consider both static and dynamic data.Static data rarely undergoes changes, not having to be processed on a regular basis, and is related to building characteristics (room areas, equipments by room, energy tariffs, etc.).Dynamic data is constantly being updated from several sources and EMSs have to retrieve and process those updates.This greatly increases the overhead of the data processing step and makes real-time results harder to achieve.Data Acquisition.There are three different types of data sources: energy meters, environmental and equipment status sensors.Depending on the device, the gathering of data may be based on a Pooling or Event driven approach.In the first case, the EMS pulls the data from the sensors by querying them periodically.Typically, EMSs allow the adjustment of this time period.Since an EMS has to explicitly check all devices (one by one), the total elapsed time may be too large so that the system can respond in real-time.In Event driven based approaches, devices are responsible to send data to the EMS, whenever a new value is measured.By avoiding the pooling time, the second approach may achieve better results than the first one on collecting data in a timely manner.
In an EMS with real-time monitoring capabilities, the three dimensions stated above must be taken in account.Table I states, for each dimension, the deadlines (in minutes) that must be met for each level of real-time [16], [17], [3].The 15 minutes deadline is the least demanding requirement that must be met, in order to consider that each of those dimensions is responding in real-time.Note that the scope of our work lies on Data Processing and Presentation Layer, being the Real-Time Data Acquisition beyond the intentions of our research.

Fetch auxiliary data
Input Handler Database .Fig. 3: Data stream processing using a DSMS.Data streams generated from several sources are directly processed by the DSMS, highly decreasing the latency of data processing.Data streams may be persisted (for computation that require past data), without affecting ongoing continuous queries evaluation.
We also evaluate a representative sample of existing EMS solutions, to identify their features and their ability to respond in real-time, correlating this with their data processing infrastructure.Given that many of existing solutions are proprietary, this kind of survey is often limited by both license restrictions and available documentation.Even more when, in several cases, such documentation omits many technical details and states real-time capabilities without clarifying the time scale of their timeliness.Hence, from the several studied solutions, we consider only those that, at least, give minimal insights about details discussed above.
In Table II, we identify the general features of four analysed solutions: EEMSuite [18], EnergyWitness [19], EnerwiseEM [20], and OpenEIS [21].Among these, only EnergyWitness can update the results of some of their functions in a timely enough manner to be considered real-time (as defined in Table I).We note however that, for some functions, it can only produce new results with periods up to fifteen minutes, hence our classification as near real-time.The remaining three solutions update their results hourly, therefore we do not consider them truly real-time.
All four solutions rely on a DBMS.Moreover, all literature reviewed by us [14], [16], [17] point out the usage of a DBMS as the standard way to manage, process, and query collected data in EMSs.Therefore, at the best of our knowledge, there is no EMS solution based on a DSMS approach.

III. Data Stream Management Systems
There is a huge number of stream-based applications that have to process high-volumes of data in a timely manner, pushing to the limit the capabilities of the current data processing systems.EMSs are one of those applications.To cope with those requirements, stream based systems should be prepared to [11]: Real-time response.Process high-volumes of data under real-time requirements, only possible on a system designed and optimized for this specific purpose.Those systems should be able to timely process data on demand.High-level language.Support a language able to effectively express queries over data streams, and capable to report complex relationships on data stream tuples.Scalability.Spread the workload, transparently and automatically, across available resources.Tolerant to faulty streams.Deal with delayed, lost, or malformed tuples.Among other things, this is essential to properly evaluate blocking operators.Deterministic response.Result soundness for equivalent inputs, essential for fault tolerance and recovery capabilities.
Integration.Integrate stored data with live data streams, using an uniform language and without the need of human manual intervention.
Availability.Preserve integrity of data, being strongly resilient to failures.

A. Database Management Systems Approaches
It is widely recognized the inability of DBMSs to fulfil previous requirements and support data stream applications [11], [22].DBMSs only process data after they have been stored and indexed, and this introduces an unacceptable penalty on data processing latency.To mitigate this issue, some DBMS based alternatives have emerged, albeit in vain: Main Memory Database Systems.[23] Since they store information in main memory instead of using secondary storage, they are far more efficient than traditional DBMSs, yet, they continue to use the same basic modelprocess-after-store-that, by nature, is incompatible with the timeliness requirement imposed by data stream applications.
Traditional DBMSs are also completely passive, they only present data when explicitly requested by a user or application-Human-Active, Database-Passive (HADP).The HADP model does not allow the system to spontaneously run a notification whenever any predefined condition is fulfilled [22].[16], [17], [3].Each dimension may be classified by their real-time capabilities.( †) Captures the scope of this work.
Active Database systems.[24], [25] Attempt to solve preceding issue through a mechanism of rules triggered by events.However, those triggers are poorly scalable, leading to a large impact on systems performance [26].
Another issue with DBMSs is related to the query language, SQL is not prepared to operate on potentially unbounded time-series.DBMSs one-time query model is unsuitable to support the class of queries related to stream based applications.In particular, blocking operators face serious problems to produce an answer in the presence of an unbounded data stream, and present a lack of expressibility to specify complex and interesting time related conditions [27], [28].A query language specifically designed for data stream applications is urging.

B. Stream Processing Engine Approaches
Taking into account requirements imposed by stream based applications, a new class of systems have emerged: Stream Processing Engines (SPE) [11], [29], [30].Although these systems share a common target, they differ greatly: different architectures, data model, processing mechanisms and query languages [31].Among many proposed systems two major approaches emerge [22]: Data Stream Management Systems.[29] DSMSs are a type of SPE targeted to process several input data streams, from a wide range of sources, and produce as a result new output streams.DSMSs can be seen as a DBMS evolution to properly support data stream processing, with many DSMSs (TelegraphCQ [32], NiagaraCQ [33], and Cougar [34]) being developed from already existing DBMSs [35].Complex Event Processing Systems.[29] CEPSs aim to process streams of events (facts that happened), in order to draw conclusions from them, i.e. to identify patterns on the sequence of events and produce more complex ones (events with an higher semantic level), which indicate more complicated circumstances.The main goal of those systems is to recognize more meaningful situations, from less complex ones, and respond to them as soon as possible.
These two classes of systems are also distinguishable by their different language models.DSMSs use a declarative (e.g.CQL [36]) or imperative (e.g.SQuAL [37]) language model, whereas CEPSs use a pattern-based model (e.g.CEL [38]).Declarative languages (like SQL) specify expected results for the query evaluation, while imperative  languages explicitly determine the sequence of transformations to be applied on data streams.Pattern-based languages are known to be defined as a set of rules, where each of these rules is composed by a pre-condition that, when satisfied by a data stream pattern, triggers a specific action.

EEMSuite EnergyWitness EnerwiseEM OpenEIS
Understanding the differences between DSMSs and CEPSs is crucial as: (i) each system is designed to properly process just one type of streams, data or event streams, a characteristic relevant to identify the first generation SPEs, (ii) there are misunderstandings 1 in what concerns SPEs concepts, hindering required cooperation among community to advance state of the art, (iii) an ideally SPE should be able to process both stream types, and (iv) the main difference between SPEs of first and second generation is the capacity of last ones to process both data and event streams, which demonstrates the actual maturity of these systems.

C. Survey of existing DSMSs
Table III lists DSMS solutions that we are considering in our work.Language model expressibility is our most critical concern.The relation between language expressiveness and operators is not linear, where some operators can be built by combining other operators [22].However, for the sake of simplicity on rules implementation, we only consider that a given system provides a given operator semantics if it implements the operator explicitly, or if the semantic can be trivially mimicked by provided operators.The deployment model-distributed or centralizedis tightly related with system scalability, reflecting how the performance is affected under massive workload scenarios.The ability to integrate the system into new applications, or extend their functionalities and APIs, is reflected on Open Source column.

D. Discussion
State of the art stream processing systems, such as Esper, Storm, and S4, are conceived to process both data streams and complex events.Furthermore they can be deployed as part of cloud based solutions, fully distributed, allowing the construction of highly scalable systems.As we discussed, both time model and language model have a strong impact on operators implementation, and consequently on their semantic, as well as on systems ability to detect and identify patterns on streams.Let us now focus on some other relevant details related to our work.
Load Shedding is the systems capability to, in situations of overload, drop some tuples in order to fulfil the requirement of timely data processing.Each dropped tuple implies a degradation on the quality of the produced results.Purely event stream processing systems do not support any Load Shedding technique, while some of the analysed data stream systems support it such as STREAM, NiagaraCQ, Borealis, and TelegraphCQ.Systems oriented towards pattern matching rules don't tolerate approximated answers.Note in particular that the drop of the "wrong" tuple may invalidate the detection of a whole pattern [13,Chapter 7].
For most recent systems is also relevant to understand how their programming model may influence our work.Storm lies on an imperative programming style, in the sense that is the user that explicitly implements the graph of transformations to be applied on the data stream.Hence Storm is not a pure query engine, there is no query optimizer to automatically produce an optimal transformation graph.S4 programming model is somewhat different from the Storm model, however there is also the absence of an optimizer to generate an optimal sequence of transformations.On other hand, Esper provides a declarative query language, that allows the user to specify which transformations should be made on data, instead of specifying how the data must be transformed.Note that we usually rely on a graph of transformations that the stream will flood and traverse.Each graph node represents a step on a pipeline of transformations that will process the data stream in order to extract desired information.Due to our claim for real-time data processing, it is of utmost importance to formulate optimal transformations sequences.In Esper, the graph of transformations is not explicitly defined by the user, instead it is automatically produced and optimized by the engine compiler, providing a transparent decoupling between logical level (query specification) and physical level (query execution).This gives Esper a major advantage when compared to Storm or S4, where the difficulty of implement and optimize a graph rapidly grows with queries complexity.Although Esper hides the graph complexity from the user, it does not impede the explicit specification of a graph-this can be done with a pipeline of continuous queries-it just promotes the declarative implementation of the node (query) functionality.

IV. Position Statement
Our envisioned solution aims at creating a data processing architecture able to integrate energy related data in real-time.Existing architectures, often used to prepare data to populate a given data warehouse, process data through a sequence of transformations in a batch manner, which impedes a timely data processing [39], [40], [41], [42], [43].The proposed solution adapts these pipeline of data transformations concept to process data continuouslyin stream-with the lowest possible latency, enabling freshness of data to be measured in minutes or seconds.Therefore, we believe that a real-time data processing architecture blueprint supported by a DSMS, (see Fig. 5) is the adequate choice for processing data in an EMS.

A. Architecture Overview
An DSMS should process continuous queries to integrate streaming energy related data with the benefits of in-  Tier responsible for the integration, validation, normalization of gathered data, and, if necessary, persists them for future reference.Data Presentation Tier where produced results will be passed to the stream application.memory processing, eliminating the latency and overhead penalties typical in traditional batch processing.A detailed explanation of a possible solution architecture is as follows.
Data Processing Tier is the core component of the solution.Conceptually, it works like a pipeline of data transformations, where data is manipulated according to data stream application requirements.The data transformation flow is structured in stages using the types of components detailed below.
Adapter mediates the introduction of data extracted from several sources into the data transformation process.The adapter understands the source delivery model-push or pull based-and, in order to speed up data propagation, pushing data into remaining components.The adapter also handles bursts of data produced by each source, by holding data and delivering into the system in a more steady way.Adapters additionally perform a set of data validation and transformation steps.The quality of gathered data is assessed in order to identify and discard faulty tuples that may hamper the process.This is necessary to tolerate faulty streams generated by faulty equipment or problems in the transmission process.This step normalizes, into a common schema, distinct data stream schemes that come from different sensors.For instance, energy meters provided by distinct suppliers may rely on different schemes to provide equivalent data.When a data source is a database, the adapter ensures the required connection and communication, converting the retrieved set of tuples to a stream of tuples.The adapters role is critical to the effectiveness of all data transformation process: they have to be finely tuned, bringing to the "pipeline" only strictly necessary data, pre-processed in the most convenient way for the remaining transformations [40].
Data Integration represents the core functionality of the data transformation process, which consists of Data Integration and Cleaning steps.The main purpose is to combine several data streams, in order to formulate a new set of data streams, following well defined and suitable schemas, that better fit the problem domain.Such integrated schemes will be used as input for domain queries.However, the integration of several streams are far from being a trivial process, raising several data quality issues.For instance, some data cleaning may be required in order to ensure data consolidation and consistency.To solve such issues, this component must be able to merge data from multiple sources (e.g.sensor networks and databases), transform data under different schemes, recalculate and synthesize attributes, specify default values, calculate new attributes, etc.For scalability purposes, each Integrated Scheme must be independent, so that each one can be deployed in a single cluster node when deployed on a distributed setting.
Data Evaluation component supports the evaluation of application queries including those that represent energy monitoring use-case scenarios.Those queries should be evaluated on previous integrated data streams, that represent available data sources for application queries.From the evaluation of those queries will result essential Key Performance Indicators (KPIs) that support the decision making process.Their ease and timely evaluation are dependent on how suitable is the data source scheme produced on the Data Integration component, and naturally the decision making process effectiveness is highly dependent on the set of KPIs produced by use case queries.Therefore, these queries are the foundations of the data stream application.It is worth noting that use case queries should be evaluated independently, being the data stream sources duplicated whenever a query needs to access them.This evaluation decoupling enables adding and removing queries without side effects, also allowing queries evaluation to be deployed across several nodes, in order to achieve better scalability.Application Adapter converts queries results to a format perceptible by the application (e.g.XML or JSON).For instance, the adapter should produce an output perceived by the dashboard.
Data Queues serve to hold on excess of data when the arrival rate of data stream tuples becomes higher than the processing capability of the receiver component.Queues will be added between most critical components (e.g.Data Integration and Data Evaluation), the ones that due their different data transformation complexity may yield data at different rates.Besides their major purpose, queues may also, if necessary, perform some additional computation, for instance to impose some priority order on the delivery of tuples.
The loosely coupling of architectural components, allows to deploy them on cloud computing environments (in a fully distributed way), highly improving the systems scalability on huge workloads.

B. Implementation
As discussed before, taking into account the state of the art, we propose that a solution implementation should lie on Esper [44], a DSMS able to process Continuous Queries (CQs) over unbounded data streams.The architectural data transformation components-Adapters, Data Integration and Evaluation, and Application Adapter-should be implemented as a composition of CQs, to form a graph of data transformations, see Fig. 6 (a).Those CQs are expressed through a declaratively language, and transparently compiled in an optimal query execution plan, see Fig. 6 (b).So, Esper would be used as a key building component of Data Processing Tier.The EMS Data Presentation Tier also should be adapted in order to present data in a real-time manner, for this purpose, Graphite2 , a real-time graphing tool that render graphs from data time-series, may be used to build a real-time energy dashboard.
Queries (Q1-Q6) should guide the design of the integrated schemes in the Data Integration component.Those queries should derive from a class of building energy management methods, such as: Load Profile, Peak Load Analysis, Model Baseline, and Equipment Efficiency, that should be timely evaluated to better support the decision making process [3].

V. Conclusion
EMSs are used to support the decision making process of energy building managers, helping them to actuate in order to use energy in a more efficient way.To achieve this, those systems monitor buildings energy consumption in order to identify potential problems and assess how taken actions affect energy efficiency.
Effective problem solving requires early interventions, only possible with an early detection of problems.Typically, a problem takes days or weeks to be detected, reducing this time to hours, or even minutes, would be a major contribution.However, to achieve this EMSs should be able to detect volatile and ephemeral situations, which, in a real scenario, requires the continuously gathering of energy related data, that also must be continuously evaluated in a timely manner.EMSs should be able to evaluate huge amounts of data in real-time, collected from several buildings or even from large urban areas.
Since as we discuss, DBMSs are not the best solution to timely process data of an EMS, we propose the hypothesis that an EMS based on a DSMS performs better than common solutions based on a DBMS.
As future work, we intend to validate our hypothesis by implementing the solution proposed above, under the "Smart Campus Energy Monitor"3 project, which is a EMS developed at Instituto Superior Técnico, that monitors campus energy consumption, through a network of energy meters and other related sensors.The current implementation is based on a DBMS, and our objective is to consider the data processing tier of Fig. 5 as a replacement for current data processing infrastructure.To assess which architecture performs better we plan to run a benchmark evaluation to determine which solution presents a lower latency on data evaluation and a more suitable query language for supporting the decision making process.The work will also evaluate how Esper, a non-parallel DSMS, performs under the challenges brought by Big Data related applications, to understand which features should be taken in account for a future parallel DSMSs evaluation.

Fig. 1 :
Fig. 1: Q1 ilustrative scenario evaluation.The evaluation process has to: 1) consider the data points of the past 8 hours, 2) calculate the average value of those points, 3) identify the periods for which the values are 20% greater than the average.

2 :
Data stream processing using a DBMS.Data streams produced by sources are first persisted into a database, and only then processed by the DBMS.For the data stream application this introduced delay is unacceptable.

Fig. 4 :
Fig. 4: General EMS software architecture.Organized according to three layers: (i) data presentation, responsible to present the information derived from the collected data; (ii) data processing, responsible to integrate and evaluate collected data; (iii) data acquisition, responsible for retrieve data from sensors and meters.

Fig. 5 :
Fig. 5: Proposed Architecture.From left to right: Data Acquisition Tier composed of the data streaming sources.Data Processing

Fig. 6 :
Fig. 6: Data processing as a graph of Continuous Queries (CQs).(a) The data tranformation components (e.g.Data Integration) are implemented through a composition of CQs.(b)Each CQ works as a data transformation step (in a pipeline of transformations), and is specified through Esper's declarative language, to then be transparently compiled to an optimal query execution plan.

TABLE I :
Real-time dimensions and corresponding deadlines

TABLE III :
Summary of general features provided by DSMS solutions.•: supported.•: not supported.C: centralized.
D: distributed.: unknown information.( †) EsperHA, is a Esper's paid extension that supports a distributed deployment.