Federated Reservoir Computing Neural Networks

—A critical aspect in Federated Learning is the aggregation strategy for the combination of multiple models, trained on the edge, into a single model that incorporates all the knowledge in the federation. Common Federated Learning approaches for Recurrent Neural Networks (RNNs) do not provide guarantees on the predictive performance of the aggregated model. In this paper we show how the use of Echo State Networks (ESNs), which are efﬁcient state-of-the-art RNN models for time-series processing, enables a form of federation that is optimal in the sense that it produces models mathematically equivalent to the corresponding centralized model. Furthermore, the proposed method is compliant with privacy constraints. The proposed method, which we denote as Incremental Federated Learning, is experimentally evaluated against an averaging strategy on two datasets for human state and activity recognition.


I. INTRODUCTION
In a centralized setting, a Machine Learning algorithm can make use of all the available training data to produce a predictive model that best generalizes to unseen data. Unfortunately, a centralized setting is not always feasible. When the data comes from multiple independent devices, constraints such as network connectivity, bandwidth, and privacy preservation can make it impossible to aggregate the training data within a centralized location.
In a typical Federated Learning scenario [1], the aforementioned problem is tackled by letting each client produce a local Machine Learning (ML) model trained on just the locally available data. Then, instead of the raw data, it is the models that are transferred to a centralized location such as a server. In the server, the models must be aggregated by some kind of strategy (e.g., averaging the weights) and then sent back to the clients if they need it for inference or further training. The critical point for an effective federation lies in the aggregation strategy, which ideally should produce a single compact model that incorporates all the knowledge from each client. However, due to the notorious difficulty in the interpretation of the weights of a neural network, it is not easy to give guarantees about the outcome of the aggregation.
Due to the relevance of Federated Learning in the case of clients collecting sensor data, in this paper we focus on federation techniques for Recurrent Neural Networks (RNNs), which are ML models especially suited for time-series processing. While there is a vast amount of literature regarding federated RNNs [2]- [6], here we focus on the paradigm of Reservoir Computing, which allows to produce highly resource-efficient RNNs with a long-proven effectiveness in applications with sensor gathered information such as human activity recognition [7], ambient assisted living [8], medical diagnosis [9], meteorological forecasting [10], [11], industrial applications (for blast furnace off-gas) [12], [13], and more [14]. In all these cases, Reservoir Computing approaches provide an unparalleled tool in the literature both from the point of the achievable predictive performance, and of the trade-off between accuracy and efficiency in learning. We will show how the use of Echo State Networks (ESNs) [15], [16], from the Reservoir Computing paradigm [17], [18], allows the implementation of an optimal aggregation strategy that produces models equivalent to the corresponding centralized model directly trained on all the data. Nevertheless, with the proposed approach privacy preservation constraints are still satisfied.
This works has been developed within the context of the H2020 project TEACHING 1 [19] which is an ongoing research endeavour targeting specifically the provisioning of innovative methods and systems to enable the development of the next-generation of autonomous applications leveraging a learning system distributed [20] over a cyber-physical system (CPS). In this context, TEACHING puts forward a humancentric perspective on CPS intelligence based on a synergistic collaboration between the human and the cybernetic intelligence. In particular, it leverages human reactions as a driver for continual adaptation [21] and personalization of neural models deployed on the devices at the edge of the CPS. This clearly depicts a federated learning scenario where such localized neural models will be adapted across time to provide personalized predictions to the single users, while maintaining centralized and aggregated models on the cloud, leveraging the Architecture of an ESN. The input signal u(t) is fed to the recurrent reservoir. Then, a state x(t) is extracted from the reservoir, from which an output y(t) is computed.
full knowledge harvested by the personalized models. Within such a scenario, TEACHING is planning to leverage ESNs as the model of choice to implement the learning models distributed at the CPS edge, using the federated learning mechanisms described in this paper to build, maintain and consolidate the aggregated models at the network core.
In Section II we introduce ESNs and the commonly used aggregation technique denoted as Federated Averaging. In Section III we propose a novel federation technique, denoted as Incremental Federated Learning, for producing aggregated ESN models that are equivalent to a centralized model. In Section IV we perform an experimental comparison of the approaches of Federated Averaging and Incremental Federated Learning, by simulating a federated scenario over two datasets for human state and activity recognition. Finally, in Section V we draw the conclusions of the study.

A. Echo State Networks
Echo State Networks (ESNs) [15], [16] are an efficient ML approach for temporal data. They are part of the general framework of Recurrent Neural Networks (RNNs), but are based on the exploitation of the network activations from the point of view of a discrete-time dynamical system. The idea of studying the evolution of the recurrent network as a dynamical system is not unique to ESNs, but is shared under the hat of the so-called Reservoir Computing paradigm [17], [18]. While in this work we focus on ESNs, the techniques are also applicable to other Reservoir Computing models such as Liquid State Machines [22].
The architecture of an ESN consists in two components, which are illustrated in Fig. 1. One is the recurrent network that holds an internal state which evolves over the time steps, which is called the reservoir. The other component, the readout, is a linear layer that takes as input a state of the reservoir and emits a prediction. Formally, let x(t) ∈ R N R denote the state of a reservoir with N R recurrent units at a given time step t. Then, the evolution of the state for an input sequence of vectors u(1), . . . , u(t) ∈ R N U in a reservoir of leaky-integrator neurons can be described as Equation (1) is parametrized by two matrices and a scalar: matrix W in ∈ R N R ×N U is the input-to-reservoir weight matrix,Ŵ ∈ R N R ×N R is the recurrent reservoir-to-reservoir weight matrix, and a ∈ R is the leaking rate, under the constraint that 0 < a ≤ 1. The bias term is omitted for the sake of conciseness.
Unlike popular RNNs in which the parameters of the whole network are jointly trained by an interative algorithm, in ESNs only the parameters of the readout are trained. This allows for an extremely efficient training process. In fact, the weights in the reservoir are initialized from a suitable random distribution and then left fixed. For the reservoir-to-reservoir matrixŴ, the initialization step also includes an important constraint on the spectral radius ρ(Ŵ) (the largest eigenvalue in absolute value) which is controlled in order to meet the condition for the stability of the reservoir dynamics [15]. Moreover, W in andŴ are often initialized as sparse, i.e., to have a limited degree of connectivity between the units, to enable faster matrix operations.
From a given state x(t), the output of the network is computed by the readout as follows: Here, W ∈ R N Y ×N R is a weight matrix. In ESNs, W is the only matrix subject to training. As such, training proceeds as follows: 1) the input sequences from the training dataset are fed to the reservoir, 2) the relevant states on which the network must learn to perform predictions are collected column-wise into a matrix S ∈ R N R ×Ntrain , where N train is the number of such states, and the associated targets are collected into the matrix Y ∈ R N Y ×Ntrain , 3) the matrix W is obtained as the solution to a least squares minimization problem between WS and Y.
In particular, a common algorithm for a regularized solution to the least squares problem is ridge regression. In this case, if β ∈ R + is the L2 regularization factor chosen by model selection, the readout weights are computed in closed form as follows: By avoiding the tuning of the recurrent connections, the training process of ESNs can be particularly efficient. Moreover, by avoiding the use of gradient descent it does not run into the optimization problems associated with the popular algorithm of backpropagation through time [23].
Each client c sends their local matrix Wc to the server. After the aggregation of the models is performed in the form of a weighted average, the server sends back the same matrix W to all clients.

B. Federated Averaging
A straightforward and standard strategy for performing federated learning in neural networks is that of Federated Averaging (FedAvg) [24]. In this strategy, the weights of each locally trained model are aggregated in the central server by an element-wise average, possibly weighted by the size of the local datasets.
In the special case of ESNs, we can assume the scenario of a uniform configuration of the reservoir among all clients. In practice this means that the input-to-reservoir matrices W in and the reservoir-to-reservoir matricesŴ will be identical in all clients. In this case, since W in andŴ are fixed, Federated Averaging simply amounts to the transmission and averaging of the readout weights alone.
Formally, let S c ∈ R N R ×Ntrain,c be a matrix containing the states collected from the reservoir, locally to client c ∈ C. The states in S c can be used to locally train the readout weights W c in closed-form as follows: where Y c ∈ R N Y ×Ntrain,c contains the label associated to each reservoir state in S c , β c ∈ R + is the L2 regularization factor and I is the identity matrix. After the local readout weights W c have been computed, they can be transferred to the server. In this case, the local readout weights are the only information which is sent to the server, which is not aware of the content of the actual data or states from the clients. The average weights are computed as follows: After the aggregated weights are computed as in (5), the matrix W is transmitted back to the clients. Then, the clients can substitute their readout weights with the averaged version received from the server. A schematic view of how the matrices are transmitted between clients and server is shown in Fig. 2.
Let us ignore the impact of transmitting single scalar values such as N train,c . Then, with the technique of Federated Averaging, each client needs to share with the server Fig. 3. Incremental Federated Learning Scheme. Each client sends their local matrices Ac and Bc to the server. After the matrices are aggregated and multiplied to compute the optimal readout weights, the server transmits W back to all clients.
floating-point values, corresponding to the entries in W c . On the other hand, each client receives from the server N Y N R floating-point values for W. The transmission load is thus symmetric.
Averaging the readout weigths is a straightforward technique that however does not give any strong guarantee about the performance of the aggregated model. In the next section we will propose a different aggregation strategy that guarantees the optimal aggregated weights given the data and the reservoir.

III. PROPOSED METHOD
The peculiar characteristics of ESN training allow an optimal form of federated learning, in the sense that the resulting aggregated model is equivalent to the model that would be obtained by aggregating all the input data and using it for the training process. In fact the proposed method, that we denote as Incremental Federated Learning (IncFed), exploits an algebraic decomposition of the typical readout training equation (3). While the approach is introduced here in the context of federated learning for Reservoir Computing neural networks, it is worth mentioning that it is generally applicable for ridge regression-based training of linear output layers in the presence of large volumes of data [25], [26].
Like the Federated Averaging technique, this method also assumes a uniform configuration of the reservoir among all clients. Locally, instead of computing the readout weights, each client c computes the matrices A c ∈ R N Y ×N R and B N R ×N R c as follows: The matrices A c and B c are sent to the server, where they get summed as in the following equations: After the summed matrices are computed as in (8) and (9), the server can compute the optimal readout weights W in closed-form as follows: Notice how (10) is mathematically equivalent to (3) if all data was locally available to the server. After the weights are computed as in (10), they can be transmitted back to the clients as in the Federated Averaging approach from Section II-B. A schematic view of how the matrices are transmitted between clients and server in the Incremental Federated Learning approach is shown in Fig. 3. The benefit of the proposed method with respect to the Federated Averaging is that it allows a one-shot training with virtually unlimited amount of data. The reason behind this is that both matrices A c ∈ R N Y ×N R and B c ∈ R N R ×N R do not depend on the number of training sequences. Whenever new data is available, the two matrices can be iteratively updated by the clients by simply adding the corresponding results from the associated computations. Formally, assume that the client c has already computed the matrices A As soon as new input data is available together with the associated labels, the client can compute the matricesÃ c andB c like in 6 and 7 using only the newly available data, and then sum the matrices as in: Notice how equations (11) and (12)  to be sent to the server. It is this incremental nature of the approach that gives rise to the name of Incremental Federated Learning. As long as the entries of S do not need to be rescaled by some function of the whole matrix S, Incremental Federated Learning is applicable to any model that can exploit ridge regression -alike training.
Even though the training equation is decomposed, and even though the server is aware of the reservoir weights that were used to produce the final states, we point out that the server is still unable to recover the original training data. In fact, the striking advantage of the proposed approach is that the training data is never transferred to other nodes and it is not possible to recover it from the transferred matrices, but at the same time the readout can be aggregated as if all the original training data was available to the server. Thus, Incremental Federated Learning enables an exact form of federation for ESNs while guaranteeing privacy preservation.
In Incremental Federated Learning, the number of floatingpoint values that get transferred from each client to the server is N 2 R + N R N Y , while the transmission from the server to the clients involves N R N Y floating-point values. The added transmission load with respect to Federated Averaging (which is still constant with respect to the number of examples used for training) is justified by a better predictive performance of the aggregated model, as it will be demonstrated in Section IV.

IV. EXPERIMENTAL EVALUATION
In ML applications governed by strict efficiency constraints, such as within low-power edge devices, the training process must be efficient. This is why the first choice in such contexts is often represented by ESNs. The aim of our experimental evaluation is to compare the performance of federated ESNs, on one side using the Federated Averaging approach, and on the other side using the proposed Incremental Federated Learning method. Our analysis will involve two datasets for human state and activity recognition, which are well-suited for simulating a federated scenario.
In the following we first discuss the datasets that have been used to evaluate the proposed Incremental Federated Learning approach. Then, we describe our experimental setup and we discuss the results of the experiments.

A. Datasets
To perform the experiments we have chosen two dataset whose data is organized on a per-subject basis. This makes the datasets suited for a comparison of federated learning techniques by assuming a different edge device for each subject in the datasets. In the following we briefly describe the two datasets. a) WESAD: WESAD [27] is a publicly available dataset for stress and affect detection from wearable devices. The time-series in the dataset are recorded from both a wristand a chest-worn device, in a lab study comprising of 15 subjects. For our purposes we employ a subset of the data, in particular we only consider the following signals: electrocardiogram, electrodermal activity, electromyogram, respiration, body temperature, and three-axis acceleration for a total of 8 features. All considered signals are synchronized and sampled at 700 Hz. We consider the classification problem of predicting the state of the user from the aforementioned signals, restricted to the following 4 classes: b) HAR: The Heterogeneity Activity Recognition dataset from smartphones and smartwatches sensors [28] is a dataset collected in real-world settings that can be used for benchmarking tasks of human activity recognition. The time-series contained in the dataset have been produced by sensors commonly found in smartphones, namely accelerometer, gyroscope, magnetometer and GPS. The data was collected by having subjects carry smartphones or smartwatches while performing scripted activities in no specific order. The dataset includes the following 6 kinds of activities: For our purposes we only consider the signals coming from the accelerometer and the gyroscope, which are sampled at the highest frequency that the respective device allows, which is between 50 Hz and 200 Hz. Devices include 4 smartwatches and 8 smartphones, which record the data of 9 different subjects in total.

B. Experimental setup
Our aim is to measure and compare the performance of the two different federation strategies, namely Federated Averaging and the proposed Incremental Federated Learning. To evaluate the strategies in different contexts, we simulate four different degrees of availability of the clients in the federation. In particular, we simulate the scenarios in which only 25%, 50%, 75% or 100% of the subjects are available from each dataset.
While the number of subjects in the training set is controlled in order to simulate different numbers of clients in the federation, we emphasize that the number of subject in the validation and test sets remains fixed in all experiments.
The best-performing hyperparameters, which are selected by evaluating on the validation set, are shared across all models in the federation. Therefore, the models within all clients share the same hyperparameters. Moreover, the models also share the same exact initialization for the weights in the reservoir (matrices W in andŴ).
We point out that the practice of using all the training data to perform hyperparameter tuning is obviously not realistic in practical federated learning applications. As an alternative, one could explore how the models behave when the hyperparameters are chosen on the smallest fraction of the dataset. In this study we have chosen to use the entire training set for the hyperparameter search in order to limit the number of variables involved in our experimental evaluation.
The two datasets are processed as follows. a) WESAD: The 8 synchronized time series that are considered from the WESAD dataset are split in chunks of 350 samples (roughly 0.5 seconds for each chunk) and each chunk is associated to its target class.
For WESAD, the hyperparameter tuning is performed by a hold-out evaluation strategy. In detail, the dataset is split so that 3 subjects (∼ 20%) are used for the test set (used for testing the models after aggregation) and 3 other subjects (∼ 20%) for the validation set (used for validating the models after aggregation). All other subjects are used for training.
In Table I we report the range of hyperparameters explored for each model on WESAD. b) HAR: From the HAR dataset we only select the timeseries associated to the accelerometer and to the gyroscope. For each of these, the dataset specified the values for each of the three axes (x, y, z), for a total of 6 features. Additional features such as the mean, standard deviation, minimum and maximum of the aforementioned 6 features are computed. This is done by using a sliding window of size 200 with 50% overlap, as described in detail in [29]. Moreover, we also include as additional features the magnitudes for the accelerometer and gyroscope data. Therefore, the resulting data used to train the models is composed by a total of 32 features.
Regarding the length of the sequences, we have chosen to split the time-series into sequences of length 500.
The hyperparameter tuning is performed by a hold-out evaluation strategy. In detail, the dataset is split so that 2 subjects (∼ 20%) are used for the test set (used for testing the models after aggregation) and 2 other subjects (∼ 20%) for the validation set (used for validating the models after aggregation). All other subjects are used for training.
In Table II we report the range of hyperparameters explored for each model on HAR.

C. Results
In Table III we report the best performing hyperparameters, chosen on the validation set, for Federated Averaging and Incremental Federated Learning on the WESAD task. The corresponding results for the HAR task are reported in Table IV. As it can be observed from Table III and Table IV, the accuracy on the validation set is similar for both approaches.
To evaluate the actual generalization performance of the two approaches we have tested the models on the held-out test set. The results for WESAD and HAR are reported respectively in Table V and in Table VI, where we show the average accuracy and standard deviation over 3 repetitions of the experiments. The tables show the results of the experiments with a varying number of subjects, or clients, in the federation. In particular, the reported values on the training and test set are measured after aggregation (where applicable). In federation terms, the test accuracy indicates the average performance of the aggregated model for a set of clients that join the federation at a later time.
In the tables, the accuracies of a random baseline are included for reference. Since the distribution of the classes is balanced, the random model sets a baseline over which to evaluate how much the other models are actually learning from the data. Note that for clarity, and for numerical confirmation, we also report the performance obtained by a centralized ESN, which turns out to be perfectly compatible with those achieved by the Incremental Federated Learning ESN, as it is clearly reflected in the values. Moreover, we also report the outcome of applying the Federated Averaging method to the popular LSTM model [30]. For a fair comparison, the LSTM has been instantiated by ensuring a comparable number of trainable parameters with respect to the ESN. As highlighted in bold, the reader can observe from Table V and Table VI that the Incremental Federated Learning method has superior predictive performance with respect to the Federated Averaging in all cases except just one. Interestingly, the performance of the Federated LSTM on both datasets is consistently surpassed by the ESN, both in the Federated Averaging scenario and in the Incremental Federated Learning scenario. We speculate that this behaviour may be due to the increased complexity of LSTM: averaging the weights of a LSTM could be a more destructive operation with respect to averaging the weights of a simple ESN.
Still from Tables V and VI, it is interesting to observe how both training and test accuracies vary with the number of subjects in the training set. For Federated Averaging, increasing the number of clients can drastically decrease the training accuracy. This can be interpreted as a form of overfitting in the case of few subjects: in fact, for a high training accuracy we can observe an associated poor generalization performance on the test set. On the other hand, an increasing accuracy on the test set for increasing numbers of subjects is to be expected and it is clearly highlighted by our results.

V. CONCLUSIONS
Thanks to their efficient training process ESNs and, in general, Reservoir Computing approaches, are often considered ideal for deployment on low-powered devices such as those that can be found in the edge. In this work we have shown that it is possible to further exploit the peculiar training method of ESNs to improve their predictive accuracy in a federated learning scenario. In particular, the novel Incremental Federated Learning approach that we propose makes Reservoir Computing models such as ESNs especially suited for federated infrastructures.
The advantages of the proposed approach are manifold. First, the long-proven characteristics of ESNs make it possible to train predictive models very efficiently, even directly on the edge. Second, the global model that is produced by aggregating the local models is optimal in the sense that no better equivalent model could have been produced by gathering all the training data within a centralized node. Third, privacy constraints are preserved since the potentially sensitive training data is never transmitted over the network and remains confined within each local node.
We point out that the proposed approach is not limited to a specific architecture of the reservoir. While here for simplicity we have employed an ESN with leaky-integrator neurons, the approach can be extended without modifications to more complex reservoirs (e.g. Deep ESNs [31]) that for their characteristics are often better suited for modeling multiple time-scales in the data. This makes Incremental Federated Learning a highly versatile federation method for Reservoir Computing models.
ACKNOWLEDGMENT This work is supported by the EC H2020 programme under project TEACHING (grant n. 871385).