A Transfer Learning Framework for Predictive Energy-Related Scenarios in Smart Buildings

Human activities and city routines follow patterns. Transfer learning can help achieve scalable solutions toward the realization of smart cities accounting for similarities between regions, domains, and activities. In this study, we propose a transfer learning-based framework for smart buildings to test this hypothesis in energy-related problems. Our framework has two major components: the network creation and the transferable predictive model. In order to create the network that groups buildings sharing characteristics, we evaluated two strategies: a novel clustering algorithm for mixed data, k-prod, and clustering the image-based representation of time series. Then, a combination of long short term memory and convolutional neural network was trained on the centroids of the clusters for energy consumption prediction. The coefficient of variation of the root mean squared error (CVRMSE) of the predictions in such clusters vary between 3.85% and 58.85%. The obtained parameters were transferred to the rest of the buildings for predictive purposes, finding accurate results in buildings with little data. Our framework deals with insufficient training data since parameters from scenarios with more sensors can be received. It also carries out state-of-the-art performance on three datasets from different sources having in total 533 rooms/buildings and two energy efficiency domains: consumption prediction reducing the CVRMSE in a 21.6%, and air conditioning usage prediction moving from a 4.18% to a 0.28% CVRMSE. Our framework extracts more knowledge from available IoT deployments, so that smartness could be spread between environments at a fewer cost given that less individual effort will be needed.


I. INTRODUCTION
I N CITIES, resources are scarce, and the new paradigm of electricity generation and consumption forces the realization of smart urban environments. For addressing this, one of the main courses of action is energy management toward its efficient use [1], [2]. That is why smart meters are on the rise. In Europe, at the end of 2018, around 44% of customers had a smart electricity meter and the penetration rate is expected to reach 58%-71% by 2023 [3]. As of 2020, the penetration of smart electricity meters in North America reached 68% [4].
Sensor provision, installation, and maintenance are essential for functionally deploying IoT solutions to feed algorithms that provide intelligence to cities and buildings [5]. However, despite the reduction of their energy consumption (EC) and their cost, sensors are sufficiently intrusive to make it unfeasible to place them on every building for every application. In that sense, we can benefit from comprehensive IoT roll-outs that extract initial knowledge and then use transfer learning techniques in order to extrapolate the relevant information to partially monitored environments, applying higher level knowledge about the similarities of the sites. As it happens with the hardware, complete datasets are not always available under the fast emerging smart cities. Data could be noncanonical, expensive to collect and label, inaccessible [6], [7] or simply nonexistent in many cases. For these two reasons, we consider that leveraging transfer learning will help in the realization of the smart city.
Our work identifies representative buildings and how they relate to others. For finding such archetypes, we have developed two approaches: a new algorithm for clustering mixed sparse data and a visual representation of time series for its clustering. The properties that are considered include physical characteristics, geospatial information, and monitored values. Once clusters are created using these characteristics, we select the building that is closer to the centroid of each cluster as the representative of the rest. The transfer learning framework for time series analysis was constructed using convolutional neural networks (CNN) and long-short term memory (LSTM) networks under big data requirements.
We provide experiments of forecasting EC time series and heating ventilating and air conditioning (HVAC) behavior. These forecasts are needed for energy-efficiency strategies such as buying green energy and storing it, preventing power peaks and profiling HVAC users for efficiency recommendations. In addition to that, the framework offers deep insights into how to efficiently leverage transfer learning to enable a range of IoT services. The novelty of this article stems from the fact that this is the first effort toward the generalization of the methods that are created for the emergence of intelligence in specific buildings and therefore, it is a first step toward the sublimation of the smart city. We do this by applying transfer learning techniques to a great variety of buildings for solving two different problems related to energy efficiency. The specific contributions are as follows: 1) Creation of a clustering algorithm for mixed static data able to handle missing values; 2) Creation of a multivariate time series clustering methodology based on images; 3) Development of a transfer learning framework based on CNN-LSTM neural networks; 4) Evaluation of our framework to transfer knowledge in similar-domain problems: EC prediction to EC prediction and EC prediction to HVAC setpoint prediction for a set of more than 500 buildings. The article is structured as follows: Section II outlines the background and related work with regards to transfer learning principles, highlighting CNN and LSTM networks, the applications of transfer learning in smart buildings and describes k-pod and k-prototypes clustering, which are the main needed concepts for the development of our framework.
Section III describes our framework for creating a network of buildings and for transferring the learning between predictive models. Section IV presents the experiments, results, and discussion and finally, the conclusions and future work are enumerated in Section V.

II. BACKGROUND AND RELATED WORK
In this section, we define the needed terms and background to build our transfer-learning methodology and algorithms.

A. Transfer Learning Principles and Techniques
Transfer learning appears under the need to create a highperformance learner for a target domain trained from a related source domain. This can improve the learning efficiency by leveraging the knowledge learnt from related domains. In practical scenarios in which the provision of IoT devices is not feasible in terms of resources, it becomes very useful. The transfer learning framework can be formally defined as follows: Let D be a domain consisting of two components as D = {X, P (X)}, where X = {x 1 , . . ., x n } represents the features and P (X) represents their marginal probability.
For a given domain D, a task T is defined by two components as T = {Y, f (.)}, where Y is a label space, and f () is a predictive function learned from the feature vector and label pairs {x i , y i }, where x i ∈ X and y i ∈ Y . The task can be rewritten as T = {Y, P (Y |X)}. Now, D S is defined as the source domain data considering D S = {(x S 1 , y S 1 ), . . ., (x S n , y S n )}, where x S i ∈ X S is the ith data instance of D S and y S i ∈ Y S is the corresponding class label for x S i . Similarly, D T is defined as the target domain. Having this stated, we can formally define transfer learning as the process of improving the target predictive function f T (.) from the target domain D T by using the related information from D S and T S , where D S = D T or T S = T T . Depending on which condition is met previously, there are many classifications for transfer learning: 1) Heterogeneous transfer learning: X S = X T which means that the source features differ from the target features. 2) Homogeneous transfer learning: X S = X T , which means that the features are the same in both source and target. 3) P (X S ) = P (X T ): meaning that the marginal distributions are different. 4) P (Y S |X S ) = P (Y T |X T ): meaning that the conditional probability distributions are different. 5) Y S = Y T means a mismatch in the class space. 6) P (Y S ) = P (Y T ) is caused by imbalanced labeled dataset between the source and target domains. 7) Negative transfer learning: when transferring knowledge degrades the target performance. Another way to classify transfer learning approaches answers the question of "what to transfer?" and consists of: sample instances, feature mapping, model parameters, and association rules [8]. Instance transfer learning is used to improve models' accuracy by finding training samples that have a strong correlation with the target domain and reweighting them so that they can be used as input for training in the target domain [9]. Feature-based methodologies generate a different feature representation of the source features with the objective of reducing the differences between the source and the target. There are two subcategories, that is, asymmetric-reweighting the features [10]-and symmetric transfer-finding a common latent feature space [11].
The parameter-based category transfers knowledge through the shared parameters of source and target domain learner models or by creating multiple source learner models and optimally combining the reweighted learners (ensemble learners) to form an improved target learner [12]. Typically, when dealing with neural networks (NN), the model begins with random weights near zero and adjusts them in the process. Preparing a deep NN requires a large amount of labeled data. Using transfer learning we can start the network with prepared weights from a comparable domain and adjust them to the explicit source. This reduces the time and requirements for creating an accurate network. The last and least used approach is known as relational based. It is based on some defined relationship between the source and target domains [13], [14].
Much research effort has been devoted to transfer learning in applications such as image classification [15], natural language processing [16], and product recommendation [17], while transfer learning in smart cities and buildings is still less studied. However, smart buildings applications that are using data are sometimes even more crucial since decisions at the city level that involve infrastructure are more definitive than others. Transfer learning can play a crucial role in this decision-making process [18].
In this article, we have used CNNs and LSTM networks for our transfer learning strategy and therefore, we review some basic principles in the following.

1) CNN and LSTM Networks:
A CNN is a class of deep neural networks, most commonly applied to analyzing visual images that use convolution instead of general matrix multiplication in at least one of their layers. Convolution can be seen as a way of looking at a function's surroundings to make accurate predictions of its outcome. Even though they are known to be appropriate for image processing, some other areas where CNN are used are natural language processing, time series, proteomics, and audio data classification [19]- [21]. The layers of a CNN have the so-called neurons arranged in three dimensions: width, height, and depth, as nodes connecting input and outputs; and the neurons in a layer will only be connected to a small region of the previous layer. On CNN, the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge in feature design is a major advantage for this method.
Also in the family of artificial neural networks (ANNs), a long short-term memory network (LSTM) is a type of recurrent neural network (RNN) that retain information over longer periods of time. Unlike CNN, LSTM networks have feedback connections, implying that for making a decision, they selectively consider the previous input(s) and the output(s) that they have previously learned, therefore solving the vanishing gradient problem. On LSTMs, the model is trained using backpropagation, meaning that information about the errors are sent in reverse through the network, so that it can alter the weights. This kind of network is well suited to time series data. They can not only process single data points (such as images), but also entire sequences of data (such as videos).
The architecture of an LSTM network has the form of a chain of repeating modules or blocks. Each block has typically a memory cell and three gates: an input gate, an output gate, and the so-called forget gate. The memory cell saves the state of the LSTM unit, while the gates are a way to optionally let information through. The gates are composed out of a sigmoid function and a pointwise multiplication operation. The sigmoid function outputs numbers between zero and one so that it controls the extent to which a new value flows into the cell. Similarly, the forget gate and output gate control the extent to which a value remains in the cell and its usage to compute the output activation of the LSTM block, respectively.
In recent years, the LSTM approach has been combined with the convolution mechanisms. This has led to the ConvLSTM, whose advantages have already been seen as superior in several recent applications [22]- [24]. The ConvLSTM attributes derivate from the convolution and the LSTM layer.
Let X 1 , . . ., X t be the inputs, C 1 , . . .., C t the cell outputs, H 1 , . . .., H t the hidden states, and gates i t , f t , o t . The key equations of ConLSTM are shown below, where "*" denotes the convolution operator and "•" denotes the Hadamard product:

B. K-Pod and K-Prototypes for Clustering
In transfer learning, it remains an open question how to choose the main subjects from which knowledge is going to be obtained. For this, we propose a clustering approach. A cluster refers to a collection of data points put together because of certain similarities, and the representation of their center is the centroid. The K-means algorithm starts with a group of randomly selected centroids and then performs iterative calculations to optimize their positions until they stabilize. K-means is useful for numerical data. When a problem presents categorical data, k-medoids [25] is used. K-medoids computations are based on medians instead of means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, integrates both [26]. K-Pod is the adaption of K-means to missing data [27]. In K-Pod, the missing features are replaced by those of the centroids, given that the underlying assumption of this kind of methods is that each observation of the dataset is a noisy realization of a cluster centroid.

C. Energy-Related Problems in Smart Buildings
Consumption prediction methods can be classified into engineering, statistical, and machine learning methods [28].
Engineering models, also known as white-box models, use numerical equations by considering the physical properties of building characteristics [29]. Such information is often very difficult to obtain. Some researchers have tried to simplify engineering models to effectively predict energy consumption [30], yet those models' results may not have been completely accurate [31]. Statistical methods use mathematical formulas to correlate energy consumption data with influencing factors. The calculation processes are straightforward, but they often cannot handle complex interactions between factors, so they are not flexible and often have poor prediction accuracy [32].
Several machine learning techniques have been applied in the context of smart buildings management in independent efforts. We claim that cooperation between domains is key in order to fully unlock the potential of IoT research. In the survey [33], we can see a comparison of the four main categories of machine learning applied to buildings: supervised, unsupervised, semisupervised, and reinforcement learning algorithm. Regarding supervised learning energy consumption prediction is targeted in [34] using SVM, in [35] a hybrid neurofuzzy inference system and a multivariate feature selection strategy for the same purpose is proposed in [36]. Other works present hybrid approaches (ANN and K-pattern clustering) for predicting user activities in the smart environment, including but not restricting to those related to energy consumption [37]. It is also interesting to mention that many works combine LSTM with other strategies such as decomposition mechanisms or data augmentation [38] for energy consumption prediction [39].
The design of a building energy consumption data-driven model consists of four steps: data collection of historical and current data such as baseline consumption and outdoor weather conditions, data preprocessing (cleaning, transformation, and/or data reduction), model training, that aims to fit the model to the train data and model testing in order to verify that the model provides good results in unseen data.
The main design decisions are normally constrained to the type of building (residential and nonresidential), the temporal granularity of data collection and the predictive horizon (from subhourly to yearly) and type of energy consumption predicted, the features to be used in the analysis and different data sizes between scenarios. For new buildings and most existing buildings without advance IoT deployments, there is a lack of sufficient data to train data-driven predictive models [40]. Our method's novelty and value stem from the fact that we develop a strategy that can be used to improve accuracy in buildings with limited training data. As previously stated, the collection of data is one of the main design challenges in smart building problems, and we are providing a novel method to overcome such an issue with remarkable results.
Transfer learning can be leveraged to accelerate the smart city development and this was termed in [41] as the urban transfer learning paradigm. The literature on transfer learning in buildings is extensively reviewed in [42]. A recent study [43] uses feature extraction and weight initialization as transfer learning strategies for the prediction of a 24 h horizon of EC in buildings. This study serves us as a baseline to develop transfer learningbased methods for building energy management in general. Feature extracting and domain adaptation combined in one model training process for transfer learning in consumption prediction scenarios between three buildings has been investigated in [44], and LSTMs for the transference between 1 office buildings and 20 offices [45]. The goals on these three studies are similar to ours, however, they do not find the relationship between buildings so as to share other kinds of models. At the same time, they only considered one database for their, and the fact that they chose a subset of buildings hinders the generalization of the study.

A. Creation of a Network of Buildings
We aim to identify representative buildings or their stereotype characteristics and how they relate to other buildings considering factors that should include but also go beyond physical characteristics and geospatial information. In that sense, we could either select those who best represent the rest for their sensorization or know how much a sensorized building relates to others in order to transfer the knowledge between them. For this purpose, we have created two methodologies: 1) Creation of K-Prod for Metadata/Static Clustering: When analyzing the static characteristics of a building, we can find numerical variables (dimensions) and categorical variables (industrial, educational). In addition, many variables/values may be missing in real-world data since they are usually manually annotated.
Our proposed algorithm, k-prod is an adaption of Kprototypes so that it deals with missing data. We have reused the idea of the k-pod, in which k-means missing features are replaced by those of the centroids, given that the underlying assumption of this kind of methods is that each observation of the dataset is a noisy realization of a cluster centroid.
We first compute the mode or mean of each feature of the matrix, depending on the kind of data and introduce such values Algorithm 1: k-prod: Adaptation of K-prototypes to Missing Values.
Inputs: X: a n × m matrix n_cl: an integer -number of clusters cat_index: an array with the ids of categorical columns max_iter: an integer -max. iterations Output:labels: an array with the assigned label per row centroids: a centroid per cluster X a matrix with imputed missing values γ: an integer -the weighted sum of costs The algorithm starts: labels 0 = NULL X = X where missing values are filled with the mode or mean of the column cls = apply KPrototypes for n_cl toX for i in 0 to max_iter do labels = current cls labels centroids = current cls centroids γ = current cls gammâ X = X where missing values are filled with the centroids if labels == labels 0 then break end if labels 0 = labels cls = apply KPrototypes for n_cl using centroids as initialization toX end for return labels, centroids,X, γ in the missing points of the matrix. The column indexes for categorical data are specified as inputs of the algorithm. We first fit the K-prototypes model with the well-known method in [46], using the implementation of k-modes python package 1 to the filled dataX. After that, we iteratively change the missing points by the computed centroids (distinguish between numerical and categorical) and refit the K-prototypes algorithm using the current centroids as initialization. The cost function is the Euclidean dissimilarity function for the numerical variables and the matching dissimilarity function [47] for the categorical ones. Once the labels do not change or the maximum number of iterations is achieved, the algorithm finishes. This process is summarized in Algorithm 1.
Given that, we have created an algorithm that helped us clustering the building by using solely their metadata.
2) Clustering Time-Series Data: The data collected from sensors in smart buildings is in time series form. For each building, we will have some static data, as above specified, and several time series that are generated, including but not limited to EC, indoor temperature, occupation. The clustering of multivariate time series is always a complicated task and we have decided to explore a new way to perform it. One of the problems where deep learning excels is image classification, so we propose to convert time series data into heatmap images in order to get an image representation of each building. Considering buildings EC and outdoor temperature in order to represent each building, we would obtain the images shown in Fig. 1 for a single building. This strategy can incorporate any other kind of time-series data into the representation.
From a deep learning perspective, it is common to apply transfer learning to reduce the training cost in image classification. In addition, fine-tune existing models pretrained on massive datasets is a common transfer learning method. ImageNet [48] is a project which aims to provide a large image database for research purposes. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2 is an annual competition organized by the ImageNet team since 2010, where research teams evaluate their computer vision algorithms for visual recognition tasks such as object detection in images. The training data used for this purpose is a subset of ImageNet with 1.2 million images belonging to 1000 classes. We will use ImageNet weights for initialization. The winners of ILSVRC have released their models to the open-source community. There are many accessible models such as AlexNet, VGGNet, Inception, ResNet, Xception. Given that we are interested in extracting the encoded features from the images we will remove the top dense layers, which are intended for classification problems. In our methodology, we have considered vgg16, vgg19, and resnet50 models. Those CNN models will provide 3-D vectors that represent the image. We need to flatten these for the clustering algorithms. We can also use principal component analysis (PCA) for dimensionality reduction so that we ease the clustering algorithms by reducing the features.
The vgg16, vgg19, and ResNet50 models were compiled using the cross entropy loss function, the stochastic gradient descent optimization function [49], and accuracy (Acc) and mean square error (MSE) metrics. The categorical cross entropy formally is designed to quantify the difference between two probability distributions. It is well suited to classification tasks, since one example can be considered to belong to a specific category with probability 1, and to other categories with probability 0. The stochastic gradient descent optimization algorithm estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weights of the model using backpropagation. The metrics are defined as Acc = #correct predictions After compiling, the networks outputs 3-D vectors that represent the image. We then flatten these for the clustering algorithms to start working with them. Before that, we also create PCA instances for each covnet output. This process is shown in Fig. 2  Next, we use K-means and Gaussian Mixture algorithms to create and fit models that group images that look alike together. Both models return as many centroids as was indicated in the number of clusters. We will compute the distances between all the members of the clusters and the centroids in order to choose representative buildings for each cluster.

B. Transfer Learning Algorithm for Time Series Prediction
We have adopted instance-based transfer with the strategy of using pretrained models for weight initialization. The application of reweighting was used to incorporate source domain data into the target task training process. The predictive models are based on the combination of 1-D CNN and LSTM networks. The conventional deep CNNs are designed to operate exclusively on 2-D data such as images and videos. As an alternative, a modified version of 2-D CNNs called 1-D CNNs have recently been used [50]. One-dimensional convolutional layers can automatically extract local temporal features from time series and they are advantageous in many ways when compared to 2-D ones: less computational complexity, configurations with networks having <10 K, can run on a standard computer and are well-suited for real-time and low-cost applications [51]. They serve as a data preprocessing step and has been reported to be useful in reducing computational costs and enhancing model performance. The number of 1-D convolutional layers, the number of filters, kernel sizes, strides, and activation function types are arbitrarily chosen and future work should address the optimization of the parameters.
After that, recurrent layers are adopted to capture interactions among temporal features for accurate predictions. The recurrent units are LSTM units, considering their excellence in modeling long-term dependence and tackling the vanishing or exploding gradient problem. A further extension of the combination of CNN and LSTM is to perform the convolutions of the CNN (e.g., how the CNN reads the input sequence data) as part of the LSTM for each time step. The combination that we used is called a convolutional LSTM, or ConvLSTM for short, and it is also used for spatiotemporal data [52]. Unlike an LSTM that reads the data indirectly in order to calculate internal state and state transitions, the ConvLSTM is using convolutions directly as part of reading input into the LSTM units themselves.
Regarding the implementation, we have used the ConvL-STM2D class provided by the Keras Library. The layer expects a sequence of 2-D images, having input shape [samples, timesteps, rows, columns, features]. For our purposes, we can split each sample into subsequences that we chose to be 7 (a week), and columns will be the number of time steps for each sequence, that we chose to be 24 (a day). The number of rows is fixed at 1 as we are working with 1-D data. Finally, we had two features (consumption/hvac use, temperature). Therefore, our input shape configuration is [samples, 7, 1, 24, 2]. The methodology consists of, for every cluster, train a ConvLSTM network using the representative buildings' data. Then, the weights of such network are used for initializing the ConvLSTM networks of the rest of the buildings from the same cluster. Our transfer learning is safe in the sense that, since we are only sharing the weights of the models, it could be considered as a federated learning approach, where each building is a client [53]. In that sense, our approach potentially preserves the privacy of the buildings. Further investigation with privacy-preserving techniques should be done in that sense in order to diminish the security risks associated with data sharing [54]. Some works are already investigating how to incorporate Federated Learning to the smart buildings scenarios in order to keep privacy [55].

IV. RESULTS AND DISCUSSION
In this section, we describe the results from our framework. Data and scripts to reproduce the experiments are available at a GitHub repository. 3

A. Data Description and Preparation
Open data favors science. Open and reproducible research practices enable scientific reuse, accelerating future projects and discoveries in any discipline when accompanied by workflows and explanations [56]. Many studies suggested in the past the need for a set of time series benchmarks so that contributions are generalizable enough [57].
In recent years, some initiatives of great importance have been carried out for the acquisition of data and its publication as a basis for the creation of open and shareable data repositories for building performance research.
In this work, we have used three different open databases on buildings operation.
The first one is The Building nonresidential Data Genome Project [58] (DGP), which consists of one year hourly EC of 507 nonresidential buildings, 38 weather condition datasets that can be related to them, and building metadata. The building metadata provides the general building information, such as the building location, primary usage type, the total floor area, etc., for the majority of buildings. Buildings are mainly located in America and Europe, while only five buildings in Asia are included. The buildings are categorized into five primary usage types: there are 156 Offices, 105 primary or secondary school classrooms, 95 university laboratories, 81 university classrooms, and 70 university dormitories. 3 [Online]. Available: https://github.com/auroragonzalez/TLIAS Given that there are very few buildings in Asia in this dataset, we decided to consider as the second database the Indian BuiLdings Energy coNsumption Dataset (I-BLEND) [59]. This database contains minute-based energy-related information for 52 months on 7 different buildings on an autonomous research institute in Delhi, India. We aggregated the data for having an hourly sampling. The database also includes datasets about occupancy, institute calendar, building architecture details, and four months of local weather (temperature, humidity). I-BLEND consumption dataset is vaster than the previous but it also presents more missing values.
An extra benefit from adding data from several databases is that we are covering more of the semantics that are used for buildings' metadata, which can help enable interoperability across the different categories of buildings.
The last database that we are considering will be referred to in this work as the Langevin-HVAC database [60]. It delivers occupants' energy behavior data collected through a year within 15 offices and among 24 workers at the Friends Center office building in Philadelphia, USA. It consists of a longitudinal comfort survey data collection together with continuous measurements of the weather, the local indoor environment around each subject, and certain of the subjects' behavioral actions. To the authors' knowledge, this is the first time that the dataset is used for predictive purposes. The preprocessing steps consists of the assignation of the rows to specific rooms and participants (subjects), select some continuous variables regarding thermostat information, weather, and room conditions, and aggregate them in an hourly manner. This preprocessing can be reproduced using the code within the data folder at the already-mentioned repository. 3 All the databases contain nonresidential buildings, since they are offices, schools, laboratories, etc. Consumption patterns between residential and nonresidential buildings are complementary and the authors will delve into it in the future using available open datasources such as the ones in [61] and [62].

B. Clustering Results
First, we applied the novel k-prod clustering to all databases. In order to select the number of clusters we used the elbow method heuristic, which consists of plotting the explained variation (cost in our case) as a function of the number of clusters, and picking the elbow of the curve as the number of clusters. It selected five clusters as Fig. 3 shows.
Second, we applied our image time-series classification. It is important to remark that for performing the image clustering, we are only using 2000 data points as it is assumed for the further experiments that buildings started only recently to gather data, having insufficient data to train a neural network successfully by themselves. For easing access, we saved the images in a public repository 4 and we built an image downloader with multiprocessing for making it faster. We have used the deep learning framework Keras [63] to define an architecture capable of accepting multiple inputs of building information, and then train a single end-to-end network with this data. We randomly selected a train set of buildings, that constitute 70% of the total. For comparison between models, we have also chosen five clusters in this case, given the results shown in Fig. 3. The data of those buildings were used for obtaining images and then, 12 combinations of clustering algorithms were tested (vgg16, vgg19, ResNet50, vgg16-pca, vgg19-pca, and ResNet50-pca 2 times, one for K-means and another one for GMM). With and without PCA, the centroids are the same. Then, we obtained the features for the test buildings' images and fit them with the previously created clustering models in order to obtain their labels. Given that there are 12 different options, we have computed the adjusted rand index in order to find the most common groupings for making a decision. The Rand Index is one of the most popular alternatives for comparing partitions and has been rediscovered and/or modified by different authors [64]. Let n be the number of elements (in our case, the number of buildings included in the test set). There are n 2 distinct pairs that could be found. In between two partitions U and V , we can find: 1) Agreement between partitions: objects in the pair are in the same class (in different classes) in U and also in the same class (in different classes) in V . RI is not able to take into consideration the effects of random groupings. When the number of clusters is increased, more and more example pairs will be in agreement because they are more likely to be not grouped together. To counter this drawback, an adjustment is made to the calculations by taking into consideration grouping by chance through the generalized hypergeometric distribution, creating the Adjusted Rand Index, whose formula and further theoretical explanation can be seen in [64]. The ARI is recommended as the index of choice for measuring agreement between two partitions in clustering analysis [65]. A greater ARI indicates a greater agreement between clustering strategies.
Given that using PCA would not alter our results, we have reduced our models to 6. Table I shows that the greater metric is obtained when comparing the partitions from Kvgg16 and

C. Transfer Learning Results
For each clustering centroid that was found with k-prod and with image clustering, the closer building in terms of their characteristics are selected since from our algorithm, centroids do not have to coincide with any observation, meaning with any particular building. The distance between each building and the centroid is computed as explained in Section III-A1, that is using the cost function. We have decided to test our approach not only with 5 centroids, as the elbow method in Fig. 3 suggested but also with 15 centroids because it seemed more appropriate considering that we have more than 500 buildings available for study. The latter was an arbitrary choice that could depend on each of the setups.
We created EC predictive models using the data from the 5 and 15 centroids of each of the approaches, respectively, and also we stored the weights. The metrics we used include root mean square error (RMSE) and its coefficient of variation (CVRMSE), mean absolute error (MAE) and mean average percentage error (MAPE). The performance on the centroids can be seen in Tables II and III, respectively, including also the names of the reference buildings. In k-prod we were able to use all buildings in order to find the centroids because its input does not include energy consumption. In the case of image clustering, we have used only a 70% of the buildings as train and a 30% as test. The test set only contains 2000 values of temperature and consumption, to be aligned with the later prediction strategy. We can see that having a greater number of centroids is translated into a greater variety in the metrics for our models. It creates very accurate models such as the one for centroid 10 (3.85 % Fig. 4. CVRMSE comparison between using the pretrained models as initialization and not using it in the DGP dataset for 15 clusters obtained using k-prod (left) and image clustering (right). of CVRMSE) and very inaccurate ones such as the model for centroid 4 (49.74 % of CVRMSE). It is not the goal of our research to create the best models for each of the buildings, however, we would expect a correlation between the accuracy of the model in the original building and the models that use it as "initialization" in our transfer learning strategy, meaning the models for the buildings in the same clusters.
1) Same-Domain Transfer Learning With K-Prod and Image Clustering: EC to EC: In this first scenario, we have assumed that we only had 2000 values of EC and weather from the rest of the buildings and used our neural networks algorithm in two ways.
1) Without Transfer Learning: using only the prior 2000 data points for estimating next 24 h EC. 2) With Transfer Learning: using the prior 2000 data points and the weights for their representative building in order to estimate the next 24 h EC. In Fig. 4 (left), we can see that using our Transfer Learning strategy reduces the error in all groups when using as initialization the prototypes obtained from k-prod clustering. On average, the CVRMSE was improved a 19% when using transfer learning (23.64%) compared to not using it (42.78 %). We have not depicted the case of 5 clusters, but the average improvement is 21.6 % in CVRMSE.
In Fig. 4 (right) we can see that using our Transfer Learning strategy reduces the error in all groups also when using as initialization the prototypes obtained from the image clustering. We can see that, in general, the results are better with the kprod approach, which means, using the metadata of the building for the construction of our relationship network.
We have separated the results for the I-BLEND dataset, given that the source is different in order to show how our strategy adapts to a different source. As it can be seen in Table IV, where the representative for each of the buildings (the prototype) is between brackets and where we can also see how our TL approach improves the CVRMSE in several points for the three buildings.
2) Similar-Domain Transfer Learning: EC to HVAC Preferences: Finally, we analyzed the goodness of our transfer learning approach between different domains. We have used a dataset that collects the HVAC setting points and outdoor temperature. We have tried it in three different rooms. Such a study was not providing a lot of metadata so we were not able to find which group was better to be assigned to. Given that, as a preliminary approach, we defined an experiment to predict the next 24 h HVAC setting point for the three rooms using the weights of all the 15 buildings selected as centroids when using our k-prod algorithm. Those buildings are completely unrelated to the rooms from which we have collected the HVAC setting points, however, we assume that EC and HVAC usage are related    5 shows the CVRMSE of the prediction of HVAC setpoint for three particular rooms. First of all, the scenario without transfer learning is a straight line because we are not using the centroids for transferring anything. We can also see that when using the model trained on the centroid of cluster 3, all rooms present a lower CVRMSE when using transfer learning. Therefore, it is positive to use the weights of a neural network trained in EC data in order to initialize a network with the goal of predicting HVAC setting points. We can find in Fig. 6 a more extensive evaluation, where each rooms' model is initialized using the 15 possible centroids from the clustering analysis and the results are portrayed in the form of boxplots. Also, we can see that the line that shows the CVRMSE when not using any transfer learning strategy stays always higher than every scenario with transfer learning.
Since there is not enough metadata to perform the k-prod clustering, we have only tried the image clustering approach. Using it, all the 23 tested rooms belonged to cluster 3. Using the building that was chosen as representative from cluster 3, Office-Gustavo, we obtained in average an improvement of 4% in CVRMSE, because it went from 4.18% to 0.28%.
These preliminary results open up a wealth of opportunities for further research.

3) Comparison With Previous Studies on the Same Datasets:
To the authors' knowledge, this is the first time that I-Blend dataset and the Langevin-HVAC database are used for transfer learning and predictive purposes. The I-Blend dataset was only aggregated and used as such, however the Langevin-HVAC was transformed so as to generate the HVAC set-point prediction scenario. This makes the later incomparable with previous studies, but provides the possibility to realize new studies with a different perspective on the same dataset.
The DGP dataset has been used in the literature for transfer learning purposes in five occasions as can be seen in Table V. Four of these studies do not present a grouping strategy, they either do it manually or randomly. The general improvements on CVRMSE are also lower than with our strategy and finally, they perform their experiments in less buildings, which makes it less robust and reliable in terms of generalization. We consider that our methodology is more general and can be easily adopted and improved in comparison to theirs. It is also important to note that none of the other transfer learning studies have investigated the transference of knowledge between different domains as we do.
With regards to predictive purposes in general, the DGP dataset has been used several times as it can be seen in Table VI. Studies that realized only 1 step prediction show better metrics as expected compared to ours, that performed 24 steps prediction. The work in [66] focused on three step prediction and deliver metrics that are more comparable to ours. It is important to mention that our objective was focused on scenarios with less data, and therefore is complicated to make a fair comparison between works.

V. CONCLUSION AND FUTURE WORK
In this article, we have proposed a framework that can transfer the knowledge from sensorized buildings to a wide amount of buildings that have fewer data from similar domains. The framework includestwo methodologies that select the best buildings to transfer the knowledge from. We have performed experiments using as source domain the prediction of EC in 5 and 15 main buildings. There were three target domains with different challenges: 1) EC prediction on other buildings from the same database using less data; 2) EC prediction on other buildings from a different database using less data; 3) HVAC setpoint prediction (related domain) with few data.
In all these scenarios, our framework was successful when compared to developing a predictive strategy on the target domain without using it. When having a very reduced dataset (2000 points), our approach improved the CVRMSE of prediction a 21.6% (k-prod) and a 25% (IC) on average for all buildings when using data from the same domain (EC-EC). When using data from a different domain (EC-HVAC), we went from a 4.18% to a 0.28% CVRMSE.
This work opens up a great number of research lines: 1) To investigate how to include knowledge from more than one building simultaneously, and how the distance to the centroid influence the accuracy; 2) To investigate knowledge transfer between residential and nonresidential buildings; 3) To investigate how to adapt the framework to a Federated Learning scenario for data privacy; 4) To develop an optimization strategy for the deployment of sensors in buildings and perform a cost-benefit analysis; 5) To find the optimal number of buildings to be sensorized in order to achieve better accuracy and the cut-off point for the amount of data to be used in the target domain, 6) To optimize the parameters of Conv2DLSTM, 7) To account for inconsistent features between datasets.