Integrated Neural Networks for Nonlinear Continuous-Time System Identiﬁcation

—This letter introduces a novel neural network architecture, called Integrated Neural Network (INN), for direct identiﬁcation of nonlinear continuous-time dynamical models in state-space representation. The proposed INN is used to approximate the continuous-time state map, and it consists of a feed-forward network followed by an integral block. The unknown parameters are estimated by minimizing a properly constructed dual-objective criterion. The effectiveness of the proposed methodology is assessed against the Cascaded Tanks System benchmark.


I. INTRODUCTION
D IRECT identification of continuous-time (CT) dynamical systems from sampled data is nowadays a mature research topic which has attracted the attention of many researchers in the system and control community. Successful applications and complete reviews of direct CT identification methods can be found in the works [1]- [6], in the special issue [7], and in the book [8]. However, the methodologies proposed in the above-cited works can only be applied to the identification of dynamical systems with linear input-output relationships, such as linear time-invariant, linear time-varying and linear parameter-varying systems. On the other hand, the identification of general nonlinear dynamical systems is still an open and challenging research problem. In this letter, we address the problem of direct continuous-time identification of general non-linear state-space models, where the state and the output mappings are described by artificial neural networks.
For continuous-time neural state-space identification, a straightforward approach is to discretize the ordinary differential equation (ODE) representing the state mapping function of the dynamical model and then minimize the simulation error with respect to the model parameters [9]. The resulting procedure turns out to be very similar to the ones used for discrete-time (DT) system identification with Recurrent Neural Network (RNN) architectures. Training is performed by applying the back-propagation through time algorithm [10], [11]. Other approaches for discrete-time nonlinear system identification with neural networks could alternatively be applied, such as the one-step prediction error method [12], or identification schemes using Long Short-Term Memory networks [13]. The main limitation of these approaches is that they cannot be parallelized, due to the inherently sequential nature of a simulation through time. Thus, their implementation on modern hardware optimized for parallel computing is often inefficient. A certain degree of parallelization can be achieved by carrying out the training procedure simultaneously on several sub-sequences extracted from the training dataset, according to the widespread (mini)batch optimization paradigm. However, special care has to be taken to ensure that the initial conditions of all these sub-sequences are compatible with the identified model dynamics over the whole training dataset. For instance, a multiple shooting approach has been discussed in [14], while a regularization-based approach is presented in [15].
In order to estimate a continuous-time state-space model, a novel neural network architecture, called integrated neural network (INN), is presented in this letter. The INN architecture combines feed-forward neural networks and the integral form of the Cauchy problem defined by the state-space model dynamics. The output mapping function is modelled as a standard feed-forward neural network. The weights of the two networks and the estimate of the (hidden) state sequence are computed simultaneously by minimizing a properly-defined non-convex multi-objective loss function. To solve this nonconvex optimization problem, we take advantage of modern deep learning libraries that provide a combination of back-propagation algorithm [16] for gradient evaluation and gradient-based solvers for optimization. The proposed methodology has a flexible structure and allows easy adaptation to non-uniformly sampled data. The latter is a typical advantage of continuous-time identification with respect to DT identification.
With the INN architecture introduced in this letter, we circumvent the need to run time simulations by considering the unknown state sequence as a tunable variable to be optimized along with the state and output mappings. This allows for a full parallelization of the training algorithm. Furthermore, the derivatives of the loss function required for optimization are obtained through a plain back-propagation procedure, as opposed to the back-propagation through time required in simulation error minimization. For completeness, it is worth mentioning that the approach in [17] based on 1-D Convolutional Neural Networks can be also highly parallelized. However, unlike the proposed INN architecture, the approach in [17] considers identification of discrete-time models represented by the interconnection of finite impulse response dynamic blocks with static non-linearities.
This letter is organized as follows. The overall identification setting is outlined in Section II. The proposed methodology is described in Section III, where the integrated neural network is introduced and details for numerical implementation of the training algorithm are provided. Results on the cascaded two-tank system identification benchmark are reported in Section IV to show the effectiveness of the approach.

II. PROBLEM SETTING
Let us consider a data-generating system S described by the continuous-time state-space representation: where x(t) ∈ R n x andẋ(t) ∈ R n x are the state vector and its time derivative, respectively; x 0 ∈ R n x is the state initial condition; u(t) ∈ R n u is the system input; y o (t) ∈ R n y is the (noise-free) system output at time t ∈ R; f (·, ·):R n x × R n u → R n x , and g(·) : R n x → R n y are the state and output mappings, respectively.

A. Modelling
Our modelling approach relies on the representation of f as a feed-forward neural network N f , which is fed by the system input u(t) and the (estimated) statex(t) at time t, and returns the estimated state time-derivativeẋ(t), i.e., where W f ∈ R n f denotes the neural network parameters to be identified. Similarly, the output mapping g is represented by a feed-forward neural network N g , fed byx(t) and returning the model outputŷ(t) at time t, i.e., where W g ∈ R n g denotes the parameters of the network. Note that, in many cases, the output map g is actually known (e.g., when the system's outputs correspond to a subset of the state variables). Obviously, in these cases, the output mapping can be fixed and there is no need to estimate the network N g .
The resulting neural dynamical model is then given by: In order to fit the neural network parameters to the training dataset D, one possibility is to simulate (4) using a numerical ODE solution scheme, and then to minimize the simulation error norm penalizing the distance between measured and simulated outputs, with respect to the neural network parameters W f and W g .
In this letter, we introduce an alternative method exploiting the integral form of the Cauchy problem (4a)-(4b), by defining an integrated neural network network N I as: If the state sequencex(t) feeding the integrated neural network N I is actually generated by model (4), then the signal x I (t) exactly matchesx(t), i.e., The block diagram in Fig. 1 is a representation of (5), along with the output equation (4c) producingŷ(t).

B. Fitting Criterion
In our approach, the neural network weights W f , W g and the state signalx(t), t ∈ [t 0 t N−1 ] are tunable parameters to be optimized altogether according to a dual objective. First, the model output should be close to the measurement in D. This objective is represented by a fitting term in the cost function penalizing the distance betweenŷ(t k ) and y(t k ), k = 0, 1, . . . , N − 1. Second, the state signalx should be compatible with the model dynamics (4). This is promoted by an additional regularization term in the cost function penalizing the distance betweenx I (t) andx(t), wherex I is defined as in (5).
The following minimization problem is thus formulated: where The weighting constant α > 0 acts as a regularization hyper-parameter balancing the relative importance of the fitting objective J y and the regularization objective J x . Fig. 2 is a block diagram representation of the operations required to compute the cost J in (7b).

C. Numerical Implementation
Note that the optimization problem (7) is actually infinitedimensional as one of the optimization variable is the continuous-time state signalx(t) ∈ R n x , t ∈ [t 0 t N−1 ]. In order to transform (7) into a finite-dimensional problem amenable for numerical optimization,x(t) is approximated using a finite-dimensional parametrization. For the sake of exposition, we adopt in this letter a simple piecewise constant parametrization, wherex(t) is constant on the intervals [t k t k+1 ], k = 0, 1, . . . , N−1. In general, different parametrizations forx such as piecewise linear of polynomial could be adopted. Moreover, the intervals for the piecewise definition ofx do not necessarily correspond to the ones induced by the measurements in D.
Furthermore, the integrals in (7b) and (7d) are approximated by applying a numerical integration scheme. For the sake of simplicity, we adopt the classical rectangular approximation rule for the numerical integration of both integrals. Thus, the piecewise constant parametrization ofx(t) and the rectangular Algorithm 1 Training Neural Network Through Gradient Descent Inputs: training dataset D; number of gradient-descent iterations n; learning rate λ; weighting constant α.
1) Initialize the neural network parameters W f , W g and the state sequencex.
3) Define the cost J from (7b) and (8), i.e., Update the network parameters and the hidden state: quadrature of the integrals yield: (8) where t k = t k − t k−1 , and (7d) is approximated as: Note that, for implementation convenience, the sum in (9) can be constructed recursively as: In general, other quadrature rules such as the trapezoidal rule or Gaussian quadrature [18] could be adopted for approximation of the integrals in (7b) and (7d).

D. Algorithm Overview
Algorithm 1 describes the steps required to estimate the network parameters W f and W g by minimizing the loss J defined in (7b) over a number of n gradient-descent iterations.
At the beginning of the procedure (Step 1), all the optimization parameters W f , W g , andx are initialized to some selected (or random) values. It is import to remark that, for numerical implementation and with some abuse of notation, the optimization variablex in Algorithm 1 denotes the finite-dimensional representation of the state signalx, i.e., x = {x(t 0 ), . . . ,x(t N−1 )}. The measured input and output values in the dataset D may also be normalized, if required. Then, at each iteration j = 0, . . . , n−1 of the gradient-based optimization, the following operations are repeated.
In Step 2, the neural network outputŷ and the integrated statex I are evaluated according to (7c) and (10), respectively. 1 The cost function J is then computed (Step 3) as in (7b) and (8), and its gradient with respect to the optimization variables W f , W g , andx is obtained through back-propagation (Step 4), based on the computational graph in Fig. 2. Lastly, the optimization variables W f , W g , andx are updated according to the gradient descent optimization algorithm with learning rate λ (Step 5). Enhanced variants of the vanilla gradient descent algorithm such as RMSProp and Adam [19] can also be adopted in Step 5.
The computational graph corresponding to the loss minimized in Algorithm 1 is sketched in Fig. 3. It is evident from this graph that all the neural network function evaluations can be performed in parallel, as the state sequencex(t) is a free optimization variable. Only the cumulative sum required to obtain recursivelyx I (t k+1 ) asx I (t k+1 ) =x I (t k ) + x k has to be computed sequentially. However, this simple operation has a negligible computational cost compared to neural network evaluations.
For the sake of comparison, the computational graph of a simulation error minimization training strategy based on the forward Euler numerical scheme for ODE integration is shown in Figure 4. It is evident that in this case the neural network function evaluations have to be performed sequentially as the simulated statex(t k+1 ) at time step t k+1 depends on the previously simulated statex(t k ) at time step t k . This does not allow a parallel implementation and generally results in a longer algorithm runtime.
Remark 1: It is possible to take adavantage of a previously available model to initialize the optimization variables in Step 1 of Algorithm 1. For instance, an initial estimate of state sequencex may be obtained by optimizing the cost function J 1 Note that (10) is an approximation of (7d). Computational graph corresponding to simulation error minimization, with system dynamics integrated using the forward Euler numerical scheme. with respect tox only, while fixing the system dynamics to the initial model. This type of initialization is proposed in [20] for general nonlinear identification of state space models, with an initial linear model obtained as the Best Linear Approximation of the training data.
Remark 2: In Algorithm 1, the entire training dataset is processed at each iteration j of the optimization loop. Unlike in simulation error minimization, (mini)-batch training is not required in our approach to increase the level of parallelization. Nonetheless, executing the training procedure on shorter training sequences could still be beneficial for different reasons. First, as shown in [14], the Lipschitz constant of the objective function may blow up exponentially with the training sequence length for non-contractive system dynamics. Thus, the optimization problem (7) may be better posed when solved over subsequences extracted from the training dataset. Furthermore, for large-scale training datasets, minibatch training may be required due to memory limitations of the hardware.

A. System Description
The proposed method is evaluated on a publicly available system identification benchmark proposed in www.nonlinearbenchmark.com. In particular, the Cascaded Tanks System (CTS) benchmark thoroughly described in [21], [22] is considered here.
The CTS (schematized in Fig. 5) consists of two cascaded tanks with free outlets. The system input is the water flow provided by a pump feeding water from a reservoir into the upper tank, while the measured output is the water level in the lower tank. When one of the tanks is completely filled, water overflow occurs. Furthermore, when an overflow occurs in the upper tank, part of the water flows into the lower tank, while the rest leaves the system.
An approximate nonlinear state-space model of the CTS can be derived as in [21], exploiting Bernoulli's principle and conservation of mass (but ignoring the water overflow effect): where u(t) is the input signal that controls the water pump flow from the reservoir into the upper tank; x 1 (t) and x 2 (t) are the state variables of the system representing water level in the upper and lower tank, respectively; y(t) is the output signal that measures the water level in the lower tank; and k 1 , k 2 , k 3 , k 4 are unknown coefficients depending on the CTS configuration.
Both the training and the validation datasets consist of N = 1024 samples and the sampling time is t = 4 s. The input u and the output y are expressed in Volts, as they correspond to raw actuator and sensor readings, respectively. The plot of the inputs of the training and validation data is given in Fig. 6.
The goal of this benchmark is to estimate the dynamics of the system as accurately as possible, using only the training data. The efficiency of the estimation algorithm is then tested on the validation dataset. As proposed in [21], the Root Mean Square Error (RMSE) between the measured output y and the open-loop simulated outputŷ is used as performance index:

B. Neural Model Structure
Based on the proposed identification architecture and on the physics-based model given in (11), the following neural network model structure is chosen: with x(t) ∈ R 2 . This model structure is motivated by the observation that (i) the physics-based model has two state  The feed-forward neural network N f has three input nodes (corresponding to the two state components and the system input u); two hidden layers with 80 nodes with a sigmoid non-linearity; and two linear output nodes corresponding to the two components of the state equation (13a).

C. Algorithm Setup
The proposed identification algorithm is implemented using the PyTorch Deep Learning framework [23]. In order to adjust data with different magnitude scales, the input and the output data are normalized to have zero mean and unitary standard deviation. In Algorithm 1, gradient-based optimization is performed using the Adam method [19] over n = 4 · 10 5 iterations, fixing the learning rate to λ = 10 −5 . The hyperparameter α is set equal to 1/8. The code for reproducing this example is available in the following GitHub repository: https://github.com/bmavkov/INN-for-Identification.

D. Results
The achieved results may vary due to the random initial conditions assigned to the optimization variables and the nonconvexity of the objective function. Thus, multiple executions with different initial conditions and a varying number of neurons per layer were performed. The neural network consists of 3 layers, while the number of neurons varied in the range of 20 − 120 per each layer. The boxplots of the RMSE values of the validation data are shown in Fig. 7.
The measured output and the simulated model output in the training and in the validation datasets are reported in Fig. 8(a) and Fig. 8(b), respectively. The resulting RMSE index for the best preforming model is e RMS = 0.36 for the training dataset and e RMS = 0.41 for the validation dataset.
We observe that the achieved performance index is similar in the training and in the validation datasets, and that the results are in line with the ones obtained using other state-of-the-art identification methods applied to this benchmark [24]- [29]. The resulting RMSE values of these algorithms are given in Tab. I.

V. CONCLUSION
A novel neural network architecture has been presented for the identification of nonlinear dynamical systems in continuous-time state-space forms. The proposed structure consists of a feed-forward neural network followed by an integral block. The network is trained using gradient-descent optimization, with gradients computed through standard backpropagation.
Current research activities are focused on the extension of the proposed approach for direct identification of systems described by partial differential equations, which requires handling infinite-dimensional state-space vectors.