Analysis of Data Cleaning Techniques for Electrical Energy Consumption of a Public Building

Statistical Techniques and Artificial Intelligence are becoming much more a necessity in a fastened world rather than just a theoretical use case. In order to satisfy this need, the optimization process starts with data collecting and cleaning. The aim of this paper is to provide a short overview of the outlier detection methods and to explain the need for data cleaning in the field of energy consumption by analyzing the energetic profile data from the Technical University of Cluj-Napoca’s swimming complex. In the first and second parts of the article, a short overview of cleaning methods are presented. The third part compares the efficiency of the proposed methods. Finally, but not least the fourth part of the article is dedicated to conclusions and future work.


I. INTRODUCTION
Before developing high-end hardware and software machines, data analysis was often just an exercise of applied theoretical algorithms on various dummy data sets for validating an isolated perspective. Nowadays, the theory stands, yet the dummy data has been replaced by real-time data that needs to prove its efficiency in future analysis. In the process of getting the real-time data cleaning has become an impetuous step in order to avoid a "Garbage In, Garbage Out" scenario. Having data gaps, outliers, missing values or outranged values can be a consequence of data entry, measurement, distillation, or data integration errors [1,2]. With the help of the newly rapid development of cloud computing technologies [3], storage has become a wildly used application that facilitates most of the companies to collect and to store large packs of data. With a large volume of data, the probability of error occurrence is increased, and dirty data can lead to wrong decisions, and questionable analysis which makes data quality to be a major concern. Other types of common errors are typos, mixed formats, replicated entries and violations of business rules which analysts need to take into consideration as the key point when exploring the research side of the databases [4,5]. In the process of forecasting the Technical University of Cluj-Napoca's swimming complex energetic profile we encountered gaps and abnormal values in the data sets given by our data feed provider. Because the given issue we decided that an outlier detection analysis will be requested and implemented before doing any forecast. The contribution of this paper is to identify efficient outlier techniques over an energy data set through an inside built intelligent scoring algorithm.

II. OUTLIER DETECTION TECHNIQUES
From the vast literature focused on outlier detection some definitions could be summarized; as a general one by Barnett and Lewis where they define an outlier as an observation or a set of observations which appear to be inconsistent with that set of data [6] or in short lines "Outliers do not equal errors. They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision" [7]. From a more consistent understanding of data we should take into consideration that outliers are indeed not necessarily to be removed or replaced in some cases, they can be a consistent observation in the long term, as a response for this, many different outlier detection methods were developed in the literature. [8,9]. A detailed overview of methods used in outlier detection was presented in and distributed as: probabilistic models with parametric and nonparametric approaches, statistical models, and Machine Learning algorithms with clustering-based and classification-based techniques [10].

A. Probabilistic models
Probability distribution functions were proposed to detect outliers as datapoints which have the highest probability to be outside the given threshold. In the two types of probabilistic approaches; parametric in which the data is analyzed with a predefine known distribution function and the nonparametric where the data is estimated based on a density or distance function , deviations are consider anomalies because they are not behaving like the majority of the tested population (data points) [11,12]. Gaussian distribution functions and median absolute deviation are usually applied in parametric probabilistic outlier detection methods [11]. Because most of the distributions are univariate and the primary distributions of the observation need to be known in advanced, probabilistic parametric models are failing to deliver when the data is not known [13].
The method of detecting outliers using the median, although similar to the mean, is very insensitive to the presence of outliers by calculating the central tendency. The median method together with Median Absolute Deviation (MAD) represents also the statistical dispersion of a data set, being much more robust than the mean and standard deviation methods. To determine the insensitivity of the median method, one of the indicators used is the "breakdown point" (see, e.g., Donoho & Huber, 1983) [14]. This indicator represents the maximum number of contaminated data that can be in a dataset, without affecting the final result. For example, as a comparison, if a record in a data set is of infinite value, the method of the mean gives an infinite result, while the method of the median has an unchanged result. The only problem for the median method arises only when more than half of the values are of infinite values [15,16].
Box-and-whiskers plots Fig.1 could be used for outlier detection as a numeric and graphical approach for tedious large sets of data and require human analysts in order to get accurate results. The output can generally cut through a high density of outliers based on the studied time horizon still it's a good graphic indicator if it is tested for different time stamps. To have better accuracy in the process of data outlier detection automated techniques have been developed in the literature [9,36].
The non-parametric methods are giving a more widely view being tested on multidimensional datasets [17,18] and can be combined with various clustering techniques as knearest neighbor [19], density based approaches (takes only global view of data sets) using kernel estimation and Parzen window [20][21][22] yet they are usually computationally expensive. Apart from the most common nonparametric methods: ranking or scoring data, based on differences and similarities [23]; Gaussian mixture models [24]; and probabilistic ensemble models using the density based local outlier factor (LOF) with distance based as k-nearest neighbor, are used [25].LOF itself is a calculation method that looks how close a certain point is to other points in its vicinity, in order to obtain the local data neighborhood density. This density is then compared to the density of the other points later on. This whole procedure is determined by a number k. This parameter dictates if the outlier detection will have a more local focus, in which case we use a smaller k, which is more erroneous the more noise is in the data. A large k however can miss local outliers. In this paper we treat multiple cases of k values to determine a suitable one for our dataset. In the context of data cleaning for electrical energy consumption this algorithm offers an interesting opportunity beyond merely identifying problematic, faulty or abnormal consumption readings as LOF does not treat the property of being an outlier a binary property. Thus, the result could not only be used to identify but also used to determine an adjustment factor [26].

B. Statistic approuaches
Residual analysis can be a good indicator for outlier detection when using statistical methods like auto regressive moving averages [27][28][29][30] even if it is hard to detect the polynomial function for the real time series data [31,32]. Linear regression models were proposed in [33] where the dependent variables are the electric load consumption and the independent variables are the weather input. Because most of the times the parameters are calculated based on historical data, in general the statistical algorithms from the literature were developed only for offline anomaly detection, even if some of them were described as heavy online anomaly detection methods used in wireless networks [34,35].

C. Machine learning algorithms
On the more advanced artificial intelligence perspective machine learning with supervised (classification based) and unsupervised learning (clustering based) were used to detect outliers in fraud, health care, image processing and networks intrusions [36][37][38]. In the clustering approaches each data point is assigned with a degree of belonging to each data cluster. The anomalies are detected by comparisons to the given clusters threshold and understanding the associateship of the tested data. A good example of a clustering method is the k-means algorithm where selecting the top n points that are situated on a biggest distance from the nearest cluster as outliers [39].
The most feasible Machine Learning techniques for anomaly detection in an unsupervised environment are the clustering-based approaches represented by models like kmeans [MacQueen] [40] or DBSCAN [Ester] [41]. DBSCAN (Density-based spatial clustering of applications with noise) was proposed in 1996 and proved excellent results in extracting the density information from data. DBSCAN presents some advantages over k-nearest neighbors' algorithm such as automatically adjusting the number of clusters to be computed and the ability to isolate the outliers in individual clusters. DBSCAN classifies the data points in three groups: core points, border points, and outlier points. This model divides the samples into different classes based on the proximity of the samples and by considering the two input parameters: the ε (eps) parameter that represents the maximum distance between two samples and the minimum points that represents the number of samples in a neighborhood for a point to be considered as a core point. A sample is considered a border point if it is not a core point or an outlier point but it is a part of the cluster. As follows, the outlier points are the remaining points [40,41].
In the AI literature we can also find Machine learning classification-based approaches like neural networks and support vector machines to mimic classifiers that are for anomaly detection. Neural networks were used in various domains [42][43][44] with the advantage of clearly differentiating between different outliers' classes even if they need a rigorous definition for the cost function. In case of support vector machines, the algorithm is looking for the optimum hyperplane that split two adjacent data classes [44] and finds the maximum margin necessary to separate them. Hybrid methods were also proposed for finding outlier detection like Bayesian classifier which combine probabilistic and machine learning algorithms where Bayes" theorem is applied between the features and the given classes [45].

A. Proposed outlier detection techniques
In the process of finalizing DR-BoB "Demand Response in Blocks of Buildings" project funded by the EU Horizon 2020 innovation program under grant agreement No. 696114/2016 [46,47] data was collected from Technical University of Cluj-Napoca's (TUCN) buildings in order to develop an energy monitoring tool and targeting system with a Demand Response curve control strategy. In the process of gathering the electrical consumption data from the swimming complex of the Technical University of Cluj-Napoca, there were detected inconsistent data (see Fig. 2). "You can observe a lot by just watching"-Yogi Berra: In order to confirm our descriptive observations, we proposed the LOF method to understand at more than a binary level the distribution of the data. To avoid seasonality scenarios, the data was distributed for each year to be tested. In our analysis we used the value of k equal with 2,3,4,5,25,50 and we determine that the most suitable values for our scenario was 25 even if in the literature the most used parameters are usually 2 and 3 [26]. The reason for selecting a higher positive integer k was determined by the fact that we wanted to see the anomalous data from a more global perspective. It has been observed that on average 20% of our data is outlier. We consider that a point has a higher probability to be an outlier if it was selected by all the k values scenarios. After the first round of analysis there has been observed that from all outliers' techniques anomalous data was detected for more than one month in 2017.The interquartile range method and median were also applied on the same data set and a similar outlier detection was observed, on average more than 20% of our data was an outlier. Before adding the final results from detection algorithms, the data feeds were rechecked and the administration of the building was questioned in order to understand if a potential event could have been occurred in the tested period. During our investigation we determined that the anomaly from the data was determined by a mistaken collecting feed issue. After checking and adjusting the data feed (see Fig. 3 the correct data feed), Interquartile range IQR and Median methods were recompiled to identify the real anomalous data. The testing was conducted over each year from the beginning of 2014 and also for the whole data set. It was observed that on average 552 data points from a total of 51120 were detected to be outliers, which lead to a percentage of 0.9% of the analyzed data (see Table 1): The LOF method was recompiled after the data adjusting process and it has been observed that on average 2576 outliers were detected Table 2. It can be observed that there is still a large gap difference of 2024 detected outliers between the interquartile range IQR/Median methods and the LOF method which conducted into a different type of analysis. To understand our data more, we also chose a clusteringbased method, DBSCAN to find the anomalous data. As entry parameters 0.5 for epsilon (ε) as a default value and various minim number of points were used: 5, 10, 20 and 50 respectively. It was observed that the most relevant outliers were detected using 5 minim points. In order to cover more outliers, the process was conducted on different data variation which included tuples based on registered consumption value and hour or registration day and consumption value. The results showcased an estimated number of 843 clusters with a silhouette score of 0.97 for consumption value and hour which detected 614 outliers (see Fig. 4). The same exercise was done for the consumption values and weekdays and the output showcased 359 clusters with a silhouette score of 0.98 and 285 outliers. Having a multi parameter approaches, a testing was conducted in outlier detection using consumption values with days of week and hours together, Fig. 5. The estimated number of clusters being equal with 66 having a silhouette score of 0.99 and a number of 123 outliers. It is important to specify that a higher silhouette scores indicate a high accuracy which qualifies the current testing as a relevant result.

B. Intelligent Scoring Method
After the first iteration we considered that the most relevant outliers are those who have the highest probability to occur in most of the tested scenarios. In order to create an automated system for anomalous data detection we decided to use an incidence factor for the values which are not in the pattern of the majority of data, in our case, for the tested data and methods we have the total number of common outliers equal to 123.
In order to avoid counting as outliers the natural energy peaks, an intelligent scoring method has been implemented. The method was designed to take the outputs from any outlier detection technique and to compare them with the average energy consumption over four different data clusters for the same time interval (hour) and similar working or weekend days. The first cluster contains data for days from the same year and the same 2-month period as the investigate outlier data point. The second cluster contains energy consumption data from the same year and the same season (winter or summer). The third and fourth clusters contain data from the entire historical data set for the same 2-month period and for the same season respectively. The aim of the exercise is to validate the outlier data points through a scoring process if they are unusual consumption and/or damaged data values. The score over each data cluster is distributed from 0 to 5: 0 meaning that the outlier detection output data point is a usual energy consumption data (an invalid outlier) while 5 means the outlier consumption data is much higher/smaller than 90% of the cluster data points. A final scaled score was evaluated from the individual score over each analyzed data cluster.

C. Final Results and Resolutions
Because the last iteration was not enough to cover all the outliers and to help us understand which should be the best method or combination of methods for errors detection, we decided to filter the initial results through the proposed intelligent scoring method. Having a small amount of data, we combined IQR and Median methods results into a single data base for the computation. It has been observed that from a common ground of 413 outliers only 322 were validated trough the system. The same process was conducted over the combined data of DBSCAN results and 95.2% of the observation were validated as real outliers. Because the LOF method had a larger number of detected issues we decided to run the K2 and K3 databases independently. For both of the LOF computations the results were lacking accuracy having only 755 valid outliers out of 3512 detected for K2 and only 292 outliers out of 1628 for K3 Table 3. After the scoring process the valid output was analyzed in one databased. It has been observed that from all the methods we have a total of 1468 unique valid outliers. Some of the methods validated the same data point as an anomaly: 23 common data points were detected by all the methods, 63 by three of them and 416 by any two methods that had a common value Table 4. Having this common result between the methods we can be sure that there is a certain number of 502 anomalous data points in the data set.

IV. CONCLUSION
This paper presented various applied outlier detection methods for determining the sanity of the data collected from Technical University of Cluj-Napoca's swimming complex and to prepare it for a future forecast exercise. During the analysis we managed to understand the need for an intelligent scoring method as the presented outlier detection methods were unable to differentiate between the natural energy peaks and anomalous data. The exercise is enforcing the idea that the outlier methods are not giving a high accuracy in a universal usage approach. For the current test we obtain the highest accuracy for the BDSCAN method. The LOF method will be reviewed in our next uncases, if low accuracy persist it will be removed from future work. For future analysis we will extend our outlier detection adding up more methods and new data sets collected during the DR-BoB "Demand Response in Blocks of Buildings" project.