Architectural richness in deep reservoir computing

Reservoir computing (RC) is a popular class of recurrent neural networks (RNNs) with untrained dynamics. Recently, advancements on deep RC architectures have shown a great impact in time-series applications, showing a convenient trade-off between predictive performance and required training complexity. In this paper, we go more in depth into the analysis of untrained RNNs by studying the quality of recurrent dynamics developed by the layers of deep RC neural networks. We do so by assessing the richness of the neural representations in the different levels of the architecture, using measures originating from the fields of dynamical systems, numerical analysis and information theory. Our experiments, on both synthetic and real-world datasets, show that depth—as an architectural factor of RNNs design—has a natural effect on the quality of RNN dynamics (even without learning of the internal connections). The interplay between depth and the values of RC scaling hyper-parameters, especially the scaling of inter-layer connections, is crucial to design rich untrained recurrent neural systems.


Introduction
Recurrent neural networks (RNNs) represent a consolidated computational abstraction for learning with variablelength times-series data.Their extensions toward deep architectural settings, i.e., deep RNNs, are of special interest as they enable the development of richer and multiscale internal representations of the dynamical input information in an effective fashion, with breakthrough applications in relevant domains such as language, document and speech processing [21,23,35].However, a major downside of traditional approaches is that training (deep) RNNs, typically by means of gradient backpropagation through time [43] algorithms, is difficult [3] and involves high computational requirements.Hence, whenever accuracy is not the only constraint in place, finding efficient alternatives to standard RNN training becomes of paramount importance (e.g., in distributed intelligence applications embedded on low-powerful devices).
In this context, randomization-based deep neural networks algorithms provide a useful trade-off between complexity and accuracy by exploiting as much as possible the architectural biases of deep neural networks models [20,39].The fundamental idea is that of leaving the hidden layers of the deep architecture untrained after random-but constrained-initialization, thereby limiting the learning algorithms to operate only on an output readout component.The resulting striking advantage is that the training algorithms are much simpler and faster, and can even be implemented at the edge of a distributed computing setup [2].Reservoir computing (RC) [32,41] is the reference paradigm for randomization-based RNN models.In this sense, echo state networks [25,28] represent the most notable instance of the RC paradigm.The crucial insight is that it is possible to separate the treatment of the dynamical component of the network, called the reservoir, from that of the readout part.The dynamical reservoir can be left untrained after random initialization provided that it implements an asymptotically stable input-driven dynamical system [25,46].
In particular, the recently introduced study of deep RC systems [13,17,18], comprising a stacked composition of multiple untrained reservoirs, is a research line of special interest as it allows to combine the expressiveness of deep architectures for time-series with the efficiency of RCbased training algorithms.The approach has proven over the past few years to be extremely effective in addressing real-world problems in several application domains, including industrial plants monitoring [5,7], speech and polyphonic music processing [19], meteorological forecasting [31], energy consumption and wind power generation prediction [24].Beyond the practical merits, the analysis of the properties of deep RC models is also important because it enables to characterize the intrinsic behavior of deep recurrent systems prior to, or in the absence of, learning of the recurrent connections.Relevantly, studies on deep RC showed that a stacked composition of recurrent hidden layers inherently produces a qualitative diversification of the dynamical response of the neurons in the successive layers of the architecture [18,19].However, despite the great recent interest in this class of neural network models, the effects of architectural choices on the quality of the generated dynamics still remain unexplored in the literature.
To fill this gap, in this paper, we go more in-depth into the analysis of the dynamics developed by the hidden layers of deep RC neural networks.Specifically, we ask whether (and under what circumstances) a deep setting of an RC architecture is able to unleash a beneficial process of enriching state dynamics.We address this research question by performing an empirical evaluation that uses several metrics for quantifying the concepts of ''richness'' and ''goodness'' of deep reservoirs.The considered metrics cover heterogeneous aspects, including dynamical stability, diversity and intrinsic dimensionality of the reservoir manifold, as well as the conditioning of the resulting learning problem for the readout.The experiments reported in this paper are conducted on several datasets with different properties, including popular RC benchmarks as well as real-world time-series.As our investigation aims at analyzing the intrinsic quality of state dynamics in progressively deeper layers of an RC-system, we follow an unsupervised approach and take aside all the aspects related to training the readout based on specific target output signals.Overall, this work intends to shed light on the intrinsic effects of layering in RNNs, and on the interplay with the values of fundamental RC-based scaling hyperparameters.
The rest of this paper is organized as follows.We start by introducing the deep reservoir computing paradigm in Sect. 2. After that, in Sect.3, we address the richness of reservoir dynamics, presenting the metrics used to quantify it.The experimental analysis is reported in Sect. 4 (with additional results given in Appendix A).Finally, Sect. 5 concludes the paper.

Deep reservoir computing neural networks
A deep reservoir computing (Deep RC) [17,18,20] neural network is made by a stacked composition of multiple hidden layers of sparsely connected recurrent neurons that are left untrained after initialization.Although this concept can be in principle instantiated to both continuous and discrete time dynamics, here we refer to the latter, and use the formalism introduced by the deep echo state network (DeepESN) model in [18].In a deep reservoir system, the time varying input signal is applied only to the first recurrent layer.Each successive reservoir layer is driven by the dynamics generated by the previous reservoir in the stack.The overall deep RC architecture is shown in Fig. 1.
In mathematical terms, at each time-step t an X-dimensional input feature denoted by xðtÞ is fed to the first reservoir layer, which filters it according to the following state transition function: where h ð1Þ ðtÞ is the H ð1Þ -dimensional reservoir state of the first layer at time-step t, V ð1Þ and W ð1Þ are, respectively, the input and the recurrent weight matrix for the first layer, and tanh denotes the element-wise applied hyperbolic tangent nonlinearity.Following the pipeline of computation, each successive i-th layer (with i [ 1) receives as input at timestep t the state of the previous reservoir at the same timestep, and implements a state transition function described as follows: where h ðiÞ ðtÞ is the H ðiÞ -dimensional state of the i-th reservoir layer, V ðiÞ is the inter-layer weight matrix that modulates the information transmission between layer i À 1 and layer i, and tanh is the non-linearity as in the previous case.Notice that in both equations 1 and 2, we omit the bias terms for the ease of notation.Furthermore, a more general definition pertaining to leaky integration reservoir neurons [29] can be found, e.g., in [18].The formulation in equations 1 and 2 is complemented by a set of initial reservoir conditions, i.e., h ð1Þ ð0Þ; . ..; h ðLÞ ð0Þ, all of which are typically set to a null vector 0. Also note that while in general the number of neurons in the different layers of the deep reservoir architecture can be different, for the sake of simplicity in this paper, we focus on a setup in which all reservoir layers contain the same number of neurons, indicated by H. Crucially, all the weight matrices V ð1Þ ; . ..; V ðLÞ , W ð1Þ ; . ..; W ðLÞ , involved in the deep reservoir computation are left untrained after initialization.This initialization needs to consider constraints related to asymptotic stability of the resulting dynamical system, a condition that in this context is known under the name of Echo State Property (ESP) [25,46].The interested reader can find a detailed analysis of the stability conditions for deep RC neural networks in [14].The essential outcome of such analysis is that it is possible to use randomized and untrained weights in the reservoirs, provided that the magnitude of the weights is kept under control and properly scaled.As such, to initialize the reservoirs in a deep RC system, we randomly draw all the weight values from a uniform distribution in ½À1; 1, and then: • Rescale the input weight matrix V ð1Þ to have a maximum singular value equal to an input scaling x in ; • For each layer i [ 1, rescale the inter-layer weight matrix V ðiÞ to have a maximum singular value equal to an inter-scaling x ðiÞ il ; • For each layer i ! 1, rescale the recurrent weight matrix W ðiÞ to have a spectral radius1 equal to q ðiÞ .Controlling the values of the spectral radii of the reservoirs in the deep architecture is particularly important.Commonly, values smaller than, but close to, 1 are considered as those giving the best trade-off between expressivity and stability [25,32].
In common setups, the deep RC neural network also comprises a readout output layer that performs the output computation being fed by reservoir states, and that is the only trained component of the architecture [18].In this work, we focus explicitly on studying the quality of the neural representations developed in the reservoir, and all aspects related to the readout part of the network is left aside of the current study2 .

Richness of reservoir dynamics
Measuring the quality of reservoir systems' dynamics is a matter involving analysis under several perspectives.While there is not a unique commonly adopted measure of reservoirs richness, there are a number of desired properties that a reservoir should possess.Stability, diversity and high intrinsic dimension of internal state dynamics, in addition to a ''good'' conditioning of the resulting training problem (in the reservoir space) are among the desirable properties that a randomized recurrent system should have, and set the ground for the experimental analysis in this paper.Addressing the above-mentioned properties, we specifically quantify the richness of reservoir dynamics by using the measure described in the following sub-sections.Notice that in introducing such measures, we refer to the case in which the deep reservoir is driven by a unique long input time-series and we are interested in assessing the quality of the transient state dynamics at each time-step.This setup can be straightforwardly extended to the case in which we have many input time-series and we are interested in evaluating the richness of the final reservoir representation developed in the last time-step of each time-series.
Specifically, to measure the richness and goodness of reservoir dynamics, we resort to the following: ESP index (described in Sect.3.1), average state entropy of reservoir neurons activations (described in Sect.3.2, linearly uncoupled dynamics (described in Sect.3.3), and condition number (described in Sect.3.4).

ESP index
Studying the dynamical behavior of a reservoir layer under the effect of the external driving input is of paramount practical importance.In particular, studied as a non-autonomous dynamical system, each reservoir layer should be asymptotically stable, i.e., robust to perturbations in the initial conditions (equivalently, in the input [25]).This means that after a transient period, the trajectory of the reservoir state should be uniquely determined by the driving input.In practice, if the system is started with several different initial conditions and receives the same input time-series, after a transient period, all the possible trajectories in the reservoir state space synchronize (to a unique attractive trajectory).This idea has been used in [10] to develop a practical measure of reservoir stability, called the ESP index, and consisting in computing the average deviation among reservoir state trajectories obtained starting from different initial conditions under the influence of the same input signal.
Specifically, referring to the dynamics in the i-th reservoir layer, and given in input the time-series xð1Þ; . ..; xðSÞ, we first compute-as a reference trajectory-the states obtained by starting the reservoir from the 0 state, i.e., h ðiÞ 0 ð1Þ; . ..; h ðiÞ 0 ðSÞ.We then pick P randomly chosen initial conditions and, using the same driving input, we compute the correspondingly achieved reservoir states, i.e., fh In applications, one can then use the ESP ðiÞ index computed as in equation 4 to verify that a reservoir layer under consideration is effectively in an asymptotically stable regime.
In practice, whenever ESP ðiÞ index ¼ 0, we can claim that the ESP is practically satisfied by the reservoir under the influence of the given input signal.On the other hand, whenever ESP ðiÞ index takes values [ 0, the reservoir is sensible to perturbations, and the ESP is not met for the given input signal.
We find it useful to remark here that the case in which ESP index [ 0 is potentially harmful for practical applications, as it implies some form of instability in the reservoir dynamics, which might ultimately result in poor generalization performance [22].In this paper, then, we use the ESP index to measure the ''goodness'' of a specific reservoir system under a given input signal.Accordingly, the reservoir is considered ''good'' if ESP index ¼ 0, and ''bad'' otherwise.

Average state entropy (ASE)
A popular downside of RC-based approaches is that the random and untrained settings of the reservoir connections (under stability constraints) concretely results into a limitation of the variety and expressiveness of the information that is conveyed by the reservoir state.Intuitively, the components of the system's state should be as diverse as possible to provide a richer pool of dynamics from which the trainable part can appropriately combine.From an information-theoretic point of view, we can measure this form of richness by means of the entropy of instantaneous reservoir states.In particular, to this purpose, we follow the work in [34] and adapt the use of the efficient estimator of Renyi's quadratic entropy in [36,37] to the case of deep RC neural networks.Specifically, for each time-step t and for each layer i, we compute the value of instantaneous state entropy H ðiÞ ðtÞ as follows: where h ðiÞ j ðtÞÞ is the j-th component of h ðiÞ ðtÞ, and K indicates a Gaussian kernel 3 .Given an input time-series of length S, for each layer, we average the values H ðiÞ ðtÞ from equation 5, obtaining the average state entropy (ASE): which we use as one of the metrics to quantify the ''goodness'' of deep reservoir layers, where higher values are preferable and denote richer dynamics.

Linearly uncoupled dynamics (LUD)
Another manifestation of the possibly limited variety of reservoir dynamics is typically observed in the high correlation among the reservoir state variables.In other words, reservoir's units exhibit behaviors that are strongly coupled among each other [32,45], i.e., the system state contains redundant components.This means that the reservoir state's trajectories are constrained into a state manifold whose dimension can be much smaller than the size of the reservoir.This phenomenon has been experimentally linked to the inherent Markovian organization of reservoir state spaces [12,40] resulting from contractive initialization [11].The stronger is this coupling effect, the lower is the actual dimensionality of the reservoir manifold, and hence the poorer is the quality of the resulting reservoir dynamics.In this spirit, we propose to assess the richness of the reservoir by measuring the actual dimensionality of uncoupled state dynamics.Intuitively, the dimension of a reservoir state manifold can be thought of as the number of orthogonal directions in the reservoir state space.Hence, a simple way to compute the number of (linearly) uncoupled reservoir dynamics consists in performing PCA of the state activations, and then computing the number of principal components that explain the most of the variance in the sampled state space.Specifically, suppose to drive the dynamics of the i-th layer in a deep reservoir by an input time-series, and let us denote by H ðiÞ the column-wise concatenation of the resulting reservoir states, and by Intuitively, the larger is the value of R ðiÞ j , the larger is the amount of variability of reservoir activations explained by the j-th principal component.Ideally, i.e., in a ''good'' reservoir, the whole variability would be spread across many principal components rather than being concentrated on a few components.On this ground, we propose to evaluate the number of linearly uncoupled dynamics (LUD) as the number of principal components that are necessary to explain a desired amount of (normalized) reservoir space variability h 2 ð0; 1.In particular, we compute the following: where we use the value of h ¼ 0:9, meaning that we measure the number of linearly uncoupled directions in the reservoir state space that can explain at least the 90% of the whole observed variability.The higher is the value of LUD ðiÞ the richer are considered to be the dynamics of the i-th reservoir layer.

Condition number (C)
Ideally, the role of a reservoir layer should be to nonlinearly project the driving input signal into a high dimensional feature space, where-based on the theoretical insights of the Cover's theorem [6]-the original learning problem is more likely to be solved linearly.It is then interesting to investigate the ''goodness'' of the reservoir operation in terms of how much it actually facilitates the readout training phase.A way to quantify this is given by the condition number of the reservoir state matrix [32].Specifically, using again H ðiÞ to denote the state matrix obtained the column-wise concatenation of the activations in the i-th reservoir layer, and r ðiÞ 1 ; . ..; r ðiÞ H for its ordered singular values, we can calculate the condition number as follows: Higher values of C ðiÞ indicate bad conditioning of the learning problem in the i-th reservoir layer state space.We therefore consider lower values of C ðiÞ to be preferable, and characteristic of ''good'' reservoirs [27].

Experimental analysis
In this section, we describe our experimental analysis on the richness of reservoir state dynamics in deeper layers of deep RC neural networks.To this aim, we make use of the metrics of dynamical richness introduced in Sect.3, empirically evaluated in different architectural configurations of the deep RC system.Note that our analysis targets the intrinsic goodness of untrained (and input-driven) state dynamics in deep RNNs.Hence, our experiments follow an unsupervised setup, abstracting from the particular target output signal associated with the input time-series.
We introduce the used datasets in Sect.4.1, while in Sect.4.2, we report the details of the experimental conditions for our experiments.The results are provided and analyzed in Sect.4.3.

Datasets
We considered a diverse pool of datasets to cover the cases of i) both uni-dimensional and multi-dimensional input features at each time-step, ii) both single-sequence (i.e., a unique long-input time-series) and multiple-sequence (i.e., several-input time-series) datasets, and iii) both synthetic and real-world time-series.The datasets are briefly illustrated in the following.
We first consider a RANDOM dataset, where the input at each time-step xðtÞ is a uni-dimensional variable that is sampled from a uniform distribution in [0, 0.5].The generated time-series is 5000 time-steps long.Notice that this case encompasses a number of classical benchmarks in RC literature, from estimation of memory capacity [26] to learning NARMA systems [1], where a specific target signal is coupled to the random input.Here, instead, abstracting from the possible supervised tasks, we are specifically interested in the assessment of the richness of reservoir dynamics under the influence of a randomized (and temporally unstructured) input signal.The dataset contains a single sequence, and it is shown in Fig. 2.
The second dataset contains a single time-series of sampled intensities from a far-infrared LASER, taken from the Santa Fe competition [42].This dataset is representative for real-world time-series in chaotic regime and is a popular benchmark in the RC literature.We used the first 5000 time-steps of the dataset, to which we applied a minimal preprocessing consisting in a simple rescaling by a factor of 0.01.The resulting LASER dataset is shown in Fig. 3.
The third dataset is the electrocardiogram (ECG) dataset [33] from the UCR archive [4].In this case, there are multiple time-series, representing recordings of the cardiac electrical activity in normal and myocardic infarction conditions.We used the first 100 time-series in this dataset, where each of them regards a single heartbeat, and has a length that varies between 54 and 147 time-steps.The input feature at each time-step xðtÞ is two-dimensional, and contains the information regarding leads 0 and 1 of the ECG recordings.Examples of the time-series in this dataset are shown in Fig. 4.
The fourth dataset is the LIBRAS4 dataset [8] from the UCI repository [9].This dataset contains discrete representations of hand movements that represent specific types of language signs.Each time-series is obtained after preprocessing (with time-frame selection) of a recorded video during which the hand movement is performed, and the The last dataset we used is the Character Trajectories (CHAR) dataset [44] taken from the UCI repository [9].The dataset contains data gathered from a WACOM tablet, while a user is writing alphabet characters.Each timeseries pertains to the writing of a single character, and the input information at each time-step represents a three-dimensional velocity trajectory of the pen tip during the writing.We used the first 300 sequences of the dataset, with lengths varying between 63 and 172 5 .We report a couple of time-series examples from this dataset in Fig. 6.

Experimental settings
Following the model's description in Sect.2, in our experiments we used a simple strategy to designing the deep RC system, in which all the layers share the same values of the crucial hyper-parameters 6 .Specifically, all the layers shared the same value of the spectral radius q (i.e., q ð1Þ ¼ q ð2Þ ¼ . . .¼ q ðLÞ ), and all the layers at depth greater than one shared the same value of the inter layer scaling x il (i.e., x ð2Þ il ¼ . . .¼ x ðLÞ il ).In our experiments, we explored deep RC hyper-parametrizations varying q in f0:1; 0:3; . ..; 1:3g, x in and x il in f0:1; 0:2; 0:5; 1; 2; 5; 10g.We considered networks with a maximum number of L ¼ 7 layers, each of which containing H ¼ 100 sparsely connected reservoir neurons, where each reservoir neuron receives one input connection from the previous layer (i.e., from the external input for the first layer, and from layer i À 1 for every layer i [ 1) and one recurrent connection from the neurons in the same layer.
We assessed the richness of state dynamics developed in progressively deeper layers of a deep RC system.To this end, we computed the metrics described in Sect.3. In particular, we computed the ESP index (see Sect. 3.1, where we set P ¼ 50), the ASE entropy measure (see Sect. 3.2), the dimensionality of the reservoir manifold LUD (see Sect. 3.3, where we set h ¼ 0:9), and the condition number of the reservoir states C (see Sect. 3.4).For the RANDOM and LASER datasets, where a single time-series is given, we used a transient of T ¼ 500 time-steps to washout initial conditions in the state.For the remaining datasets, in which several time-series are given, to compute the richness quality measures we collected the reservoir states in the last time-step of each time-series.For each configuration explored for the setting of the deep RC networks, we repeated the experiments a number of 50 times (with different random seeds for the initialization), aggregating all the achieved results.In particular, results presented in this Fig. 4 Examples of time-series from the ECG dataset paper are aggregated by the median operator, enabling a robust analysis with respect to possible outliers.In considering the performance to the variation of a specific hyper-parameter, we consider fixed the others, using as a reference (default) hyper-parametrization the one given by: q ¼ 0:9, x in ¼ 1, x il ¼ 1.
In the next Section 4.3, we illustrate the achieved results considering the input data without further rescaling.For the sake of completeness, and for homogeneity of information processing with uniform scalings coefficients in the various cases, the same experiments were also performed on input data after rescaling to the common range [-1, 1] along each input dimension.The outcomes of this latter analysis are varying the spectral radius q hyper-parameterization (vertical axis in each plot).The plots show ESP index (the lower the better), ASE (the higher the better), LUD (the higher the better), and C (in log 10 scale, the lower the better).Deviation across datasets and dataset-specific plots are reported in Appendix A Fig. 8 Input scaling variability.Richness of reservoir dynamics measured in deeper reservoir layers (horizontal axis in each plot), varying the input scaling x in hyper-parameterization (vertical axis in each plot).The plots show ESP index (the lower the better), ASE (the higher the better), LUD (the higher the better), and C (in log 10 scale, the lower the better).Deviation across datasets and dataset-specific plots are reported in Appendix A Fig. 9 Inter-layer scaling variability.Richness of reservoir dynamics measured in deeper reservoir layers (horizontal axis in each plot), varying the inter-layer scaling x il hyper-parameterization (vertical axis in each plot).The plots show ESP index (the lower the better), ASE (the higher the better), LUD (the higher the better), and C (in log 10 scale, the lower the better).Deviation across datasets and datasetspecific plots are reported in Appendix A given in Appendix A, and are qualitatively in line with those reported in Sect.4.3.

Results
We report here the results on reservoir's richness at increasing layer depth, analyzing the effect of layering when varying the spectral radius q (Fig. 7), the input scaling x in (Fig. 8), and the inter-layer scaling x il (Fig. 9).For the sake of neatness of presentation, the results that we show here have been aggregated on all the datasets.The interested reader can find the detailed results, including the deviations across the different datasets and the results aggregated individually on each dataset, in Appendix A. Notice that in the provided figures the values are represented by a color code, where for ASE and LUD higher (i.e., lighter) values indicate a richer reservoir, for ESP index and C (given in log 10 scale) lower (i.e., darker) values are preferable.To ease the results description, we drop the superscript (i) from the notation introduced in Sect.3, as here the reference to the layer in the architecture is made explicit in the plots.
Analyzing the results in Fig. 7, we first observe, by looking at the ESP index results, that network's dynamics tend to become progressively more unstable in progressively higher layers.In fact, while the ESP index is zero (hence the ESP condition is met) in the first layer for all the explored values of q (up to q ¼ 1:3), in progressively deeper layers, values of q [ 0:9 determine progressively larger values of the ESP index [ 0, hence more unstable dynamics.Figure 7 also shows the trend in which for every choice of the layer's depth, the ASE and LUD values tend to increase and C tends to decrease, thereby indicating richer dynamics, with the increasing value of the spectral radius q.On the other hand, at the increase in the considered layer's depth, when the dynamics remain in a stable regime (i.e., when ESP index % 0), results show a more or less marked depletion of the reservoir quality.A trend of enrichment at the increase of layer's depth is actually observed only when the reservoir is in an unstable regime (i.e., when ESP index [ 0), which makes this potential improvement not usable in practice.
Looking at Fig. 8, we observe that keeping fixed the layer's depth, the goodness of reservoir dynamics tends to improve at the increase in the input scaling hyper-parameter x in .The reservoir dynamics tend to be more stable (lower values of ESP index ) and richer (higher values of ASE and LUD, lower values of C).On the other hand, keeping the value of x in fixed, at the increase in the layer's depth, the results show progressively less stable dynamics (though generally close to ideal values of the ESP index 7 especially for higher values of x in ).The other metrics do not show a uniform trend.While ASE tends to worsen at the increase in layer's depth for x in ! 1, and to improve (albeit with low values) for x in \1, LUD shows an opposite trend, while C shows a general improvement for increasing depth (though less significant for x in \1).
Figure 9 shows interesting and neat results on the synergy between the inter-layer scaling x il and the depth of the recurrent architecture.Keeping fixed the depth, all the measured metrics display a trend of improvement with the increase of x il .Moreover, at the increase of the depth of the considered reservoir layer, the richness of system's dynamics generally shows a marked improvement for values of x il [ 1 (with a peak for x il \10), a marked degradation for x il \1, and a trend of substantial nonvariability for x il ¼ 1.
From the analyzed results, we can make some overall comments.In general, the degree of richness of the reservoir dynamics is sensitive to both the values of the reservoir hyper-parameters and the depth of the reservoir layer in the architecture.Our results show that in progressively deeper networks, it is possible to observe either a progressive enrichment of the dynamics or their progressive impoverishment.This fundamental observation is interestingly linked to the empirical result that prediction accuracy in DeepESNs is generally non-monotonic with the addition of more reservoir layers (see e.g., [15,19]): adding layers is not per se sufficient to improve the goodness (nor the performance) of the reservoir system in applications.In this sense, the value of the scaling of interlayer connections is crucial: values of x il [ 1 tend to result in more stable and richer dynamics, at the same time setting up a better conditioned learning problem for readout.In the sense of the progressive qualitative change of dynamics in deeper layers, the other scaling hyper-parameters seem to play a decidedly less prominent role.
Finally, a few hints on model selection in practical applications of deep RC can be derived from the above observations and put forward for the RC practitioners.Most relevantly, the number of reservoir layers should be considered as a crucial hyper-parameter of the model, and thereby its value should be tuned (along with the values of the other hyper-parameters) on a validation set, rather than fixed to an arbitrary value.The interested reader can also find an automatic approach (rooted in the frequency analysis of the reservoir state content) to tune the number of layers in a DeepESN in one of our previous works [19].Secondly, when tuning the scaling parameters of the deep reservoir, our empirical results suggest that it would be useful to start from exploring values of the inter-layer scaling x il larger than 1.Moreover, from a broader perspective, our analysis revealed that a non-negligible portion of the hyper-parameters space can lead to intrinsically poor quality reservoirs (e.g., with ESP index [ 0), and in such cases training the readout layer could be an avoidable cost.In applications, it would be then advisable to preliminarily sample the hyper-parameters space to identify the most promising regions (e.g., those corresponding to the richer resulting dynamics according to the measures proposed in this paper) and then deepen the tuning process by the standard supervised model selection phase only in the selected promising regions.

Conclusions
In this paper, we have presented an empirical study on the richness of the hidden states representations developed in deeper layers of untrained deep RNNs, implemented according to the RC framework.Our analysis targeted a diverse range of metrics used to quantify the goodness of state dynamics from different perspectives.The stability regime was analyzed by the ESP index.The degree of diversity of reservoir's units activations was investigated by the average state entropy and by the number of linearly uncoupled dynamics.Finally, the extent to which the operation of the reservoir facilitates the learning in the case of supervised problems was studied by the condition number of the resulting state matrix.The empirical analysis was conducted on datasets with different characteristics, including both common RC benchmarks and real-world data from diverse domains.
The outcomes of our experiments clearly point out that layering has an inherent effect on the quality of RNNs dynamics.Depending on the architectural construction (i.e., the layer's depth) and on the initialization of the weight matrices involved in the state computation (analyzed here in terms of RC-based scaling coefficients), the quality of the recurrent representations can be enhanced or reduced even in the absence, or prior to, learning of the recurrent connections.In particular, our results suggest that an especially crucial role is played by the maximum singular value of the inter-layer weight matrices x il .To determine a progressive enrichment of the state dynamics, the value of x il should be larger than 1.
Our study paves the way to several future research directions.First of all, it would be interesting to study the impact of the organization of the untrained network connections on the richness of the resulting reservoir dynamics.A starting point in this direction would consist in exploring diverse setups for the sparsity of input, interlayer and reservoir connections, including topological organization of special interest, like cycle, permutation or small-world reservoirs [30,38].Moreover, extending the analysis presented in this paper to settings with possibly different reservoir sizes in different layers of the deep RC network would also allow exploring the most suitable architectural organization of an RC system on a given computational budget, i.e., to address the question of finding the convenient number of layers in which to arrange a given total number of reservoir neurons.The analysis presented here would also stimulate the development of input-driven pre-training strategies aiming at tailoring the architectural construction of deep RNNs toward regions of the hyper-parameters space where the developed dynamics are rich by construction, hence facilitating training in downstream supervised learning tasks.Besides, from an RC perspective, an interesting side effect of our analysis is the observation that a proper construction of deep reservoir architectures leads to a better-conditioned learning problem for the readout (as testified by a smaller value of the condition number).The consequence is that deep RC neural networks lend themselves more easily to an effective application of efficient iterative readout training algorithms based on least mean square error, which is traditionally ineffective in shallow RC due to bad conditioning [27,32].We believe that this direction is particularly attractive for a successful design of distributed deep learning systems for temporal data embedded in lowpowerful devices.Furthermore, extensions of our analysis to the case of deep reservoirs for graphs [16] is also of a great appeal.

A Further Results
Here, we report additional experimental results that provide further support to the analysis conducted in the paper.
In particular, in Fig. 10, we show the deviation of the results across the different datasets, expressed in terms of median average deviation (MAD) 8 , for the spectral radius, the input scaling and the inter-layer variability.The provided plots match the corresponding median results given in Figs. 7, 8, and 9.We can observe that in general the deviation is rather small and, varying the reservoir configurations and the number of layers, shows a trend generally in line with the observations made in Sect.4.3, with tendentially lower values for richer reservoirs.
Next, we detail the outcomes of the experiments reported in Sect.4.3, aggregated on the individual datasets.We provide the values of ESP index , ASE, LUD and C, achieved under the experimental settings illustrated in Sect.4.2, varying the value of the spectral radius q in Fig. 11, varying the value of the input scaling x in in Fig. 12, and varying the value of the inter-layer scaling x il in Fig.

Fig. 1
Fig. 1 Architecture of a deep RC neural network

ð3Þ
ðiÞ j ð1Þ ; . ..; h ðiÞ j ðSÞg P j¼1 .After discarding an initial transient of duration T, we can compute the average deviation between We finally compute the ESP index by averaging the ESP j values in equation 3, as follows:

r ðiÞ 1 ;
. ..; r ðiÞ H the singular values of H ðiÞ in decreasing order.The normalized relevance of the j-th principal component of the sampled i-th reservoir space can be then computed as follows:

Fig.
Fig. 11, varying the value of the input scaling x in in Fig. 12, and varying the value of the inter-layer scaling x il in Fig. 13.In each figure, each row corresponds to a different dataset.The values shown in the plots evidently confirm the same trends analyzed in Sect.4.3 (see Figs. 7, 8, and 9).Finally, we report the outcomes of the further experiments conducted in the same conditions as in Sect.4.2, but

Fig. 11
Fig. 11 Spectral radius variability.ESP index , ASE, LUD and C (in log 10 scale) measured in deeper reservoir layers (horizontal axis in each plot), varying the value of the spectral radius q (vertical axis in each plot).Results are presented individually for each dataset

Fig. 12
Fig. 12 Input scaling variability.ESP index , ASE, LUD and C (in log 10 scale) measured in deeper reservoir layers (horizontal axis in each plot), varying the value of the input scaling x in (vertical axis in each plot).Results are presented individually for each dataset

Fig. 13
Fig. 13 Inter-layer scaling variability.ESP index , ASE, LUD and C (in log 10 scale) measured in deeper reservoir layers (horizontal axis in each plot), varying the value of the inter-layer scaling x il (vertical axis in each plot).Results are presented individually for each dataset