Backpropagation Through Time: What It Does and How to Do It

Backpropagation is now the most widely used tool in the field of artificial neural networks. At the core of backpropagation is a method for calculating derivatives exactly and efficiently in any large system made up of elementary subsystems or calculations which are represented by known, differentiable functions; thus, backpropagation has many applications which do not involve neural networks as such. This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis. Next, it presents the basic equations for backpropagation through time, and discusses applications to areas like pattern recognition involving dynamic systems, systems identification, and control. Finally, i t describes further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations or true recurrent networks, and other practical issues which arise with this method. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed.


I. INTRODUCTION
Backpropagation through time is a very powerful tool, with applications to pattern recognition, dynamic modeling, sensitivity analysis, and the control of systems over time, among others. It can be applied to neural networks, to econometric models, to fuzzy logic structures, to fluid dynamics models, and to almost any system built up from elementary subsystems or calculations. The one serious constraint is that the elementary subsystems must be represented by functions known to the user, functions which are both continuous and differentiable (i.e., possess derivatives). For example, the first practical application of backpropagation was for estimating a dynamic model to predict nationalism and social communications in 1974 zyxwvutsrq [I]. recognition work. For example, suppose that we are trying to build a neural network which can learn to recognize handwritten ZIP codes. (AT&T has actually done this zyxwvutsr [13], although the details are beyond the scope of this paper.) We assume that we already have a camera and preprocessor which can digitize the image, locate the five digits, and provide a19 x 20 grid of ones and zeros representing the image of each digit. We want the neural network to input the zyxwvutsrq 19 x 20 image, and output a classification; for example, we might ask the network to output four binary digits which, taken together, identify which decimal digit is being observed.
Before adapting the parameters of the neural network, one must first obtain a training database of actual handwritten digits and correct classifications. Suppose, for example, that this database contains 2000 examples of handwritten digits. In that case, zyxwvutsrqp T = 2000. We may give each example a label t between 1 and 2000. For each sample t, we have a record of the input pattern and the correct classification. Each input pattern consistsof 380numbers,which may be viewed as a vector with 380 components; we may call this vector X(t). The desired classification consists of four numbers, which may be treatedAas a vector Y(t). The actual output of the network will be zyxwvutsrqp Y(t), which may differ from thedesired output Y(t), especially in the period before the network has been adapted. To solve the supervised learning problem, there aKe two steps: We must specify the "topology" (connections and equations) for a network which inputs zyxwvutsrq X(t) and outputs a four-component vector Y(t), an approximation to Y(1). The relation between the inputs and outputs must depend on a set of weights (parameters) W which can be adjusted.
We must specify a "learning rule"-a procedure for adjusting the weights W so as to make the actual outputs zyxwvutsrq P(t) approximate the desired outputs Y(t).
Basic backpropagation is currently the most popular learning rule used in supervised learning. It is generally used with a very simple network design-to be described in the next section-but the same approach can be used with any network of differentiable functions, as will be discussed in Section IV.
Even when we use a simple network design, the vectors X(t) and Y(t) need not be made of ones and zeros. They can be made up of any values which the network is capable of inputting and outputting. Let us denote the components of X(t) as Xl(t) * . . X,(t) so that there are m inputs to the network. Let us denote the components of Y(t) as Vl(t) * * * Y,,(t) so that we haven outputs. Throughout this paper, the components of a vector will be represented by the same letter as the vector itself, in the same case; this convention turns out to be convenient because x(t) will represent a different vector, very closely related to X(t). Fig. 1 illustrates the supervised learning task in the gen-' eral case. Given a history of X(1) * * X(T) and Y(1) . . . Y(T), we want to find a mapping from X to Y which will perform well when we encounter new vectors Xoutside the training set. The index "t" may be interpreted either as a time index or as a pattern number index; however, this section will not assume that the order of patterns is meaningful.

B. Simple Feedforward Networks
Before we specify a learning rule, we have to define exactly how theoutputs of a neural net depend on its inputs and weights. In basic backpropagation, we assume the following logic: x,=X,, x, = s(net,), Y, = x,+~, 1 zyxwv 5 i I n (4) where the functions in (3) is usually the following sigmoidal function: and where N is a constant which can be any integer you choose as long as it is no less than m. The value of N + n decides how many neurons are in the network (if we include inputs as neurons). Intuitively, net, represents thetotal level of voltage exciting a neuron, and x, represents the intensity of the resulting output from the neuron. (x, is sometimes called the "activation level" of the neuron.) It is conventional to assume that there is a threshold or constant weight W, , added to the right side of (2); however, we can achieve the same effect by assuming that one of the inputs (such as X,) is always 1. The significance of these equations is illustrated in Fig.   2. There are N + n circles, representing all of the neurons X Input ... in the network, including the input neurons. The first rn circles are really just copies of the inputs XI . . . X,; they are included as part of the vector x only as a way of simplifying the notation. Every other neuron in the networksuch as neuron numberi,which calculates net, and x,-takes input from every cell which precedes it in the network. Even the last output cell, which generates pn, takes inpu; from other output cells, such as the one which outputs Y,-l .
In neural networkterminology, this network is"ful1yconnected" in the extreme. As a practical matter, it is usually desirable to limit the connections between neurons. This can bedone bysimplyfixingsomeof theweights W,,tozero so that they drop out of all calculations. For example, most researchers prefer to use "layered" networks, in which all WERBOS: BACKPROPAGATION THROUGH TIME connection weights W, , are zeroed out, except for those going from one"layer" (subset of neurons) to the next layer. zyxwvut Ingenera1,onemayzerooutasmanyorasfewoftheweights as one likes, based on one's understanding of individual applications. For those who first begin this work, it is conventional todefineonlythree layers-an input layer, a"hidden" layer, and an output layer. This section will assume the full range of allowed connections, simply for the sake of generality.
In computer code, we could represent this network as a Fortran subroutine (assuming a Fortran which distinguishes upper case from lower case): SUBROUTINE NET(X, W, x, Yhat) REAL X(m), W(N +n, N+n),x(N zyxwvutsr +n), Yhat(n), net INTEGER, i, j,m,n,N DO 1 i = l,m C First insert the inputs, as per equation (1) 1 x(i) = X(i) C Next implement zyxwvutsrqp (2)  C 1000 x(i) = l/(l+exp(-net)) C Finally, copy over the outputs, as per (4) DO 2000; = l,n 2000 Yhat(i) = x(i+N); finally, calculate x, based on (3) and (5) In the pseudocode, note that Xand Ware technically the inputs to the subroutine, while x and Yhat are the outputs. Yhat is usually regarded as"the"output of the network, but x may also have its uses outside of the subroutine proper, as will be seen in the next section.
C. Adapting the Network: Approach as to minimize square error over the training set: This is simply a special case of the well-known method of least squares, used very often in statistics, econometrics, and engineering; the uniqueness of backpropagation lies in how this expression is minimized. The approach used here is illustrated in Fig. 3.
In basic backpropagation, we choose the weights W,, so

Fig. 3. Basic backpropagation (in pattern learning).
In basic backpropagation, we start with arbitrary values for the weights W. (It is usual to choose random numbers in the range from -0.1 to 0.1, but it may be better to guess the weights based on prior information, in cases where prior information is available.) Next, we calculate the outputs 1552 Y(t) and the errors E(t) for that set of weights. Then we calculate the derivatives of Ewith respect to all of the weights; this is indicated by the dotted lines in Fig. 3. If increasing a given weight would lead to more error, we adjust that weight downwards. If increasing a weight leads to less error, we adjust it upwards. After adjusting all the weights up or down, we start all over, and keep on going through this process until the weights and the error settle down. (Some researchers iterate until the error is close to zero; however, if the number of training patterns exceeds the number of weights in the network-as recommended by studies on generalization-it may not be possiblefortheerror to reach zero.) The uniqueness of backpropagation lies in the method used to calculate the derivatives exactly for all of the weights in only one pass through the system.

D. Calculating Derivatives: Theoretical Background
Many papers on backpropagation suggest that we need only use the conventional chain rule for partial derivatives to calculate the derivatives of €with respect to all of the weights. Under certain conditions, this can be a rigorous approach, but its generality is limited, and it requires great care with the side conditions (which are rarely spelled out); calculations of this sort can easily become confused and erroneous when networks and applications grow complex. Even when using (7) below, it is a good idea to test one's gradient calculations using explicit perturbations in order to be sure that there is no bug in one's code.
When the idea of backpropagation was first presented to the Harvard faculty in 1972, they expressed legitimate concern about the validity of the rather complex calculations involved. To deal with this problem, I proved a new chain rule for ordered derivatives:

* -(7)
-~ where the derivatives with the superscript represent ordered derivatives, and the derivatives without subscripts represent ordinary partial derivatives. Thischain rule isvalid only for ordered systems where the values to be calculated can be calculated one by one (if necessary) in the order zl, 22, . . . , zn, TARGET. The simple partial derivatives represent the direct impact of z, on z, through the system equation which determines z,. The ordered derivative represents the total impact of z, on TARGET, accounting for both the direct and indirect effects. For example, suppose that we had a simple system governed by the following two equations, in order: The "simple" partial derivative of z3 with respect to z1 (the directeffect) is3; tocalculate thesimpleeffect,weonlylook at the equation which determines z3. However, theordered derivative of z3 with respect to z, is 23 because of the indirect impact by way of z2. The simple partial derivative measures what happens when we increase z1 (e.g., by l, in this example) and assume that everything else (like z2) in the equation which determines z3 remains constant. The ordered derivative measures what happens when we increase zl, and also recalculate all other quantities-like z,-which are later than z, in thecausal ordering we impose on the system. This chain rule provides a straightforward, plodding, "linear" recipe for how tocalculate thederivatives of agiven TARGET variable with respect to a// of the inputs (and parameters) of an ordered differentiable system zyxwvutsrq in onlyone pass through the system. This paper will not explain this chain rule in detail since lengthy tutorials have been published elsewhere zyxwvutsrqpo [I], [Ill. But there is one point worth noting: because we are calculating ordered derivatives of one target variable, we can use a simpler notation, a notation which works out to be easier to use in complex practical examples [Ill. We can write the ordered derivative of the TARGETwith respecttoz,as "F-z,,"which may bedescribed as "the feedback toz,." In basic backpropagation, the TAR-GET variable of interest is the error €. This changes the appearance of our chain rule in that case to For purposesof debugging, one can calculate the truevalue of anyordered derivativesimply byperturbingz,atthe point in the program where z, is calculated; this is particularly useful when applying backpropagation to a complex network of functions other than neural networks. zyxwvutsr E. Adapting the Network: Equations For a given set of weights W, it is easy to use (1)-(6) to calculate Y(t) and zyxwvutsrqpon E (t) for each pattern zyxwvutsrqp t. The trick is in how we then calculate the derivatives.
Let us use the prefix "F-" to indicate the ordered derivative of € with respect to whatever variable the "F-" precedes. Thus, for example, which follows simply by differentiating (6). Bythechain rule for ordered derivatives as expressed in (8), ( 1 2) where s' is the derivative of s(z) as defined in (5)  agation of information is what gives backpropagation its name. A little calculus and algebra, starting from (5), shows us that 5'(z) = s(z) * (1s(z)),

(1 3)
which we can use when we implement (11). Finally, to adapt the weights, the usual method is to set New W,, = W,, -learning-rate * F-W,, where the learningrate is some small constant chosen on an ad hoc basis. (The usual procedure is to make it as large as possible, up to 1, until the error starts to diverge; however, there are more analytic procedures available [Ill.) F. Adapting the Network: Code be coded up into a "dual" subroutine, as follows.

CONTINUE
Note that the array F-W is the only output of this subroutine. Equation (14) represented "batch learning," in which weights are adjusted only after a// Tpatterns are processed. It is more common to use pattern learning, in which the weights are continually updated after each observation. Pattern learning may be represented as follows: Note how weights are updated within the "DO 100" loop.

WERBOS
The key pair.here is that the weights Ware adjustec in response to the current vector F-W, which only depends on the current pattern zyxwvutsrqpo t; the weights are adjusted after each pattern is processed. (In batch learning, by contrast, the weights are adjusted only after the "DO 100" loop is completed.) In practice, maximum-passes is usually set to an enormous number; the loop is exited only when a test of convergence is passed, a test of error size or weight change which can be injected easily into the loop. True real-time learning is like pattern learning, but with only one pass through the data and no memory of earlier times t. (The equations above could be implemented easily enough as a real-time learning scheme; however, this will not be true for backpropagation through time.) The term "on-line learning" is sometimes used to represent a situation which could be pattern learning or could be real-time learning. Most people using basic backpropagation now use pattern learning rather than real-time learning because, with their data sets, many passes through the data are needed to ensure convergence of the weights.
The reader should be warned that I have not actually tested the code here. It is presented simply as a way of explaining more precisely the preceding ideas. The C implementations which I have worked with have been less transparent, and harder to debug, in part because of the absence of range checking in that language. It is often argued that people "who knowwhat they are doing" do not need range checking and the like; however, people who think they never make mistakes should probably not be writing this kind of code. With neural network code, especially, good diagnostics and tests arevery important because bugs can lead to slow convergence and oscillation-problems which are hard to track down, and are easily misattributed to the algorithm in use. If one must use a language without range checking, it is extremely important to maintain a version of the code which is highly transparent and safe, however inefficient it may be, for diagnostic purposes.

A. Background
Backpropagation through time-like basic backpropagation-is used most often in pattern recognition today. Therefore, this section will focuson such applications, using notation like that of the previous section. See Section IV for other applications.
In someapplications-such as speech recognition or submarine detection-our classification at time twill be more accurate if we can account for what we saw at earlier times. Even though the training set still fits the same format as above, we want to use a more powerful class of networks to do the classification; we want the output of the network at time t to account for variables at earlier times (as in Fig. zyxwvut   5). The Introduction cited a number of exampleswhere such "memory" of previous time periods is very important. For example, it is easier to recognize moving objects if our network accounts for changes in the scene from the time t -1 to time zyxwvut t, which requires memory of time t -1. Many of the best pattern recognition algorithms involve a kind of "relaxation" approach where the representation of the world at time t is based on an adjustment of the representation at time t -1; this requires memory of the internal network variables for time t -1. (Even Kalman filtering requires such a representation.) zyxw

B. Example of a Recurrent Network
Backpropagation can be applied to any system with a welldefined order of calculations, even if those calculations depend on past calculations within the network itself. For the sake of generality, I will show how this works for the network design shown in Fig. 5 where every neuron is potentially allowed to input values from any of the neurons at the two previous time periods (including, of course, the input neurons). To avoid excess clutter, Fig. 5 shows the hidden and output sections of the network (parallel to Fig.  2)onlyfortime T, but they are presentat othertimesaswell. To translate this network into a mathematical system, we can simply replace (2) above by net,(t) = C W,,x,(t) + C ~;,x,(t -I) + C w;X,(t -2).
(1 5) Again, we can simply fix some of the weights to be zero, if we so choose, in order to simplify the network. In most applications today, the W weights are fixed to zero (i.e., erased from all formulas), and all the W weights are fixed to zero as well, except for W; , . This is done in part for the sake of parsimony, and in part for historical reasons. (The "time-delay neural networks" of Watrous and Shastri [5] assumed that special case.) Here, I deliberately include extra terms for the sake of generality. I allow for the fact that all active neurons (neurons other than input neurons) can be allowed to input the outputs of any other neurons if there is a time lag in the connection. The weights W and W are the weights on those time-lagged connections between neurons. [Lags of more than two periods are also easy to manage; they are treated just as one would expect from seeing how we handle lag-two terms, as a special case of These equations could be embodied in a subroutine: (7~ SUBROUTINE NETZ(X(t), W', W", x(t -2), x(t -I), XU), Yhat), which is programmed just like the subroutine NET, with the modifications one would expect from (15). The output arrays are x(t) and Yhat.
When we call this subroutine for the first time, at zyxwvutsrq t = 1, we face a minor technical problem: there is no value for x(-1)or x(O), both of which we need as inputs. In principle, we can use any values we wish to choose; the choice of x( -1) and x(0) is essentially part of the definition of our network. Most people simply set these vectors to zero, and argue that their network will start out with a blank slate in classifying whatever dynamic pattern is at hand, both in the training set and in later applications. (Statisticians have been known to treat these vectors as weights, in effect, to be adapted alongwith the otherweights in the network. Thisworks fine in the training set, but opens up questions of what to do when one applies the network to new data.) In this section, I will assume that the data run from an initial time t = 1 through to a final time t = zyxwvut T, which plays a crucial role in the derivative calculations. Section zyxwvuts IV will show how this assumption can be relaxed somewhat.
C. Adapting the Network: Equations equations as before, except that (IO) is replaced by To calculate the derivatives of F-W,,, we use the same zyxwvuts
Once again, if one wants to fix the W" terms to zero, one can simply delete the rightmost term. Notice that this equation makes it impossible for US to calculate F-x,(t) and F-net,(t) until after F-net,(t + 1) and t=l In all zyxwvutsrqponmlk of these calculations, F-net(T + 1) and F-net(T + 2) should be treated as zero. For programming convenience, I will later define quantities like F-net;(t) = F-net,(t + I), but this is purely a convenience; the subscript "i" and the time argument are enough to identify which derivative is being represented. (In other words, net,(f) represents a specific quantity z, as in (8), and F-net,(t) represents the ordered derivative of E with respect to that quantity.)

D. Adapting the Network: Code
To fully understand the meaning and implications of these equations, it may help to run through a simple (hypothetical) implementation.
First, to calculate the derivatives, we need a new subroutine, dual to NET2. SUBROUTINE F_NET2(F_Yhat, W, W, W", x, F-net, F-net', F-net", F-W, F-W, F-W) REAL F-Yhath), W(N+n, N+n), REAL x(N+ n), F-net(N + n), F-net'(N + n), REAL F-W(N + n, N + n), F-W(N + n, N + n), C Initialize equation (16) C RUN THROUGH (16) (12), (17), and (18) DO 12j = 1,;-I (as running sums) Once again, note that we have to go backwards in time in order to get the required derivatives. (There are ways to do these calculations in forward time, but exact results require the calculation of an entire lacobian matrix, which is far more expensive with large networks.) For backpropagation through time, the natural way to adapt the network is in one big batch. Also note that we need to store a lot of intermediate information (which is inconsistent with realtimeadaptation).This storagecan be reduced by clever programming if w' and w" are sparse, but it cannot be eliminated altogether. In using backpropagation through time, we usually need to use much smaller learning rates than wedo in basic backpropagation if we use steepest descent at all. In my experience [20], it may also help to start out by fixing the w' weights to zero (or to 1 when we want to force memory) in an initial phase of adaptation, and slowly free them up.
In some applications, we may not really care about errors in classification at all times t. In speech recognition, for example, we mayonlycareabout errorsattheendof aword or phoneme; we usually do output a preliminary classification before the phoneme has been finished, but we usually do not care about the accuracy of that preliminary classification. In such cases, we may simply set F-Yhat to zero in thetimeswedonot careabout.To be moresophisticated, we may replace (6) by a more precise model of what we do care about; whatever we choose, it should be simple to replace (9) and the F-Yhat loop accordingly.

Iv. EXTENSIONS OF THE METHOD
Backpropagation through time is a very general method, with many extensions. This section will try to describe the most important of these extensions.

A. Use of Other Networks
The network shown in (1)-(5) is a very simple, basic network. Backpropagation can be used to adapt a wide variety of other networks, including networks representing econometric models, systems of simultaneous equations, etc. Naturally, when one writes computer programs to implement a different kind of network, one must either describe which alternative network one chooses or else put options into the program to give the user this choice.
In the neural network field, users areoften given achoice of network "topology." This simply means that they are asked to declare which subset of the possible weightslconnections will actually be used. Every weight removed from (15)should be removedfrom (16)aswell,alongwith(12)and (14) (or whichever apply to that weight); therefore, simplifying the network by removing weights simplifies all the other calculations as well. (Mathematically, this is the same as fixing these weights to zero.) Typically, people will remove an entire block of weights, such that the limits of the sums in our equations are all shrunk.
In a truly brain-like network, each neuron [in (15)] will only receive input from a small number of other cells. Neuroscientists do not agree on how many inputs are typical; somecite numbersontheorderof lOOinputspercell,while others quote 10 000. In any case, all of these estimates are small compared to the billions of cells present. To implement this kind of network efficiently on a conventional computer, one would use a linked list or a list of offsets to represent the connections actually implemented for each cell; the same strategy can be used to implement the backwards calculations and keep the connection costs low. Similar tricks are possible in parallel computers of all types. Many researchers are interested in devising ways to automatically make and break connections so that users will not have to specify all this information in advance [20]. The research on topology is hard to summarize since it is a mixture of normal science, sophisticated epistemology, and extensive ad hoc experimentation; however, the paper by Guyon et  . zyxwv S(Z) = 1 -1/(1 + zyxwvutsrq z + 0.5 * z2), z > 0.
In a similar spirit, it is common to speed up learning by "stretching out" s(z) so that it goes from -1 to 1 instead of 0 to 1. Backpropagation can also be used without using neural networks at all. For example, it can be used to adapt a network consisting entirely of user-specified functions, representing something like an econometric model. In that case, the way one proceeds depends on who one is programming for and what kind of model one has.
If one is programming for oneself and the model consists of a sequence of equations which can be invoked one after the other, then one should consider the tutorial paper [Ill, which alsocontains a more rigorous definition of what these "F-x," derivatives really mean and a proof of the chain rule for ordered derivatives. If one is developing a tool for others, then one might set it up to look like a standard econometric package (like SAS or Troll) where the user of the system types in the equations of his or her model; the backpropagation would go inside the package as a way to speed up these calculations, and would mostly be transparent to the user. If one's model consists of a set of simultaneous equations which need to be solved at each time, then one must use more complicated procedures [15]; in neural network terms, one would call this a "doubly recurrent network." (The methods of Pineda [I61 and Almeida [I71 are special cases of this situation.) Pearlmutter [I81 and Williams [I91 have described alternative methods,designed toachieve results similartothose of backpropagation through time, using a different computational strategy. For example, the Williams-Zipser method is a special caseof the"conventiona1 perturbation" equation cited in [14], which rejected this as a neural network method on the grounds that its computational costs scale as the square of the network size; however, the method does yield exact derivatives with a time-forward calculation.
Supervised learning problems or forecasting problems which involve memory can also be translated into control problems [15, p. zyxwvutsrqpo 3521, [20], which allows the use of adaptive critic methods, to be discussed in the next section. Normally, this would yield only an approximate solution (or approximate derivatives), but it would also allow time-forward real-time learning. If the network itself contains calculation noise (due to hardware limitations), the adaptive critic approach might even be more robust than backpropagation through time because it is based on mathematics which allow for the presence of noise.

B. Applications Other Than Supervised Learning
Backpropagation through time can also be used in two other major applications: neuroidentification and neurocontrol. (For applications to sensitivity analysis, see zyxwvuts [14] and [151.) In neuroidentification, we try to do with neural nets what econometricians do with forecasting models. (Engineers would call this the identification problem or the problem of identifying dynamic systems. Statisticians refer to it as the problem of estimating stochastic time-series models.) Ourtrainingsetconsistsof vectorsX(t) and zyxwvutsr u(t), notX(t) and Y(t). Usually, X(t) represents a set of observations of the external (sic) world, and u(t) represents a set of actions that we had control over (such as the settings of motors or actuators), The combination of X(t) and u(t) is input to the network at each time t. Our target, at time t, is the vector XU + 1).
We could easily build a network to input these inputs, and aim at these targets. We could simply collect the inputs and targets into the format of Section II, and then use basic backpropagation. But basic backpropagation contains no "memory." The forecast of X(t + 1) would depend on X(t), but not on previous time periods. If human beings worked like this, then they would be unable to predict that a ball might roll outthefarsideof atableafter rollingdown under the near side; as soon as the ball disappeared from sight [from the current vector X(t)], they would have no way of accounting for its existence. (Harold Szu has presented a more interesting example of this same effect: if a tiger chased after such a memoryless person, the person would forget about the tiger after first turning to run away. Natural selection has eliminated such people.) Backpropagation through time permits more powerful networks, which do have a "memory," for use in the same setup.
Even this approach to the neuroidentification problem has its limitations. Like the usual methods of econometrics [15], it may lead to forecasts which hold up poorly over multiple time periods. It does not properly identify where the noise comes from. It does not permit real-time adaptation.
In an earlier paper [20], I have described some ideas for overcomingthese limitations, but more research is needed. The first phase of Kawato's cascade method [9] for controllinga robot arm isan identification phase,which is more robust over time, and which uses backpropagation through timeinadifferentway; it isaspecia1caseofthe"purerobust method," which also worked well in the earliest applications which I studied [I], [20].
After we have solved the problem of identifying a dynamic system, we are then ready to move on to controlling that system.
In neurocontrol, we often start out with a model or network which describes the system or plant we are trying to control. Our problem zyxwvu is to adapt a second network, the action network, which inputs X(t) and outputs the control u(t). (In actuality, we can allow the action network to "see" or input the entire vector x(t) calculated by the model network; this allows it to account for memories such as the recent appearance of a tiger.) Usually, we want to adapt the action network so as to maximize some measure of performance or utility U(X, t) summed over time. Performance measures used in past applications have included everything from the energy used to move a robot arm [8], [9] through to net profits received bythegas industry [ll].Typically, we are given a set of possible initial states XU), and asked to train the action network so as to maximize the sum of utility from time 1 to a final time T.
To solve this problem using backpropagation through time, we simply calculate the derivatives of our performance measure with respect to all of the weights in the action network. "Backpropagation" refers to how we calculate the derivatives, notto anything involving pattern recognition or error. We then adapt the weights according to these derivatives, as in (121, except that the sign of the adjustment term is now positive (because we are maxirnizing rather than minimizing).
The easiest way to implement this approach is to merge I the utility function, the model network, and the action network into one big network. We can then construct the dual to this entire network, as described in 1974 zyxwvutsrq [I]  Insteadof workingwith asinglesubroutine, NET,we now need three subroutines: UTILITY(X; t; x"; zyxwvutsr U) MODEL(X(t), u(t); x(t); X(t + 1)) ACTION(x(t); W; x'(t); u(t)).
Ineachofthesesubroutines,thetwoargumentsonthe right are technically outputs, and the argument on the far right is what we usually think of as the output of the network. We need to know the full vector x produced inside the model network so that the action network can "see" important memories. The action network does not need to have its own internal memory, but we need to save its internal state (x') so that we can later calculate derivatives. For simplicity, I will assume that MODEL does not contain any lagtwo memory terms (i.e., W weights). The primes after the x's indicate that we are looking at the internal states of different networks; they are unrelated to the primes representing lagged values, discussed in Section Ill, which we will also need in what follows.
The outputs of these subroutines are the arguments on the far right (including F-net), which are represented by the broken lines in Fig. 4. The subroutine F-UTILITY simply reports out the derivatives of U(x, t) with respect to the variables)(,. The subroutine F-MODEL is like the earlier subroutine F-NET2, except that we need to output F-U instead of derivatives to weights. (Again, we are adapting only the action network here.)The subroutine F-ACTION isvirtually identical to the old subroutine F-NET, except that we need to calculate F-W as a running sum (as we did in F-NET2).
Of these three subroutines, F-MODEL is by far the most complex. Therefore, it may help to consider some possible code. SUBROUTINE F-MODEL(F-net', F-X, x, F-net, F-U) C The weights inside this subroutine are those C used in MODEL, analogous to those in NET2, and are C unrelated to the weights in ACTION 1 2 91 0 920 zyxwv 1000 2000 REAL F-net '( N; n), F-X'( n), x (N + n), F-net(N+n), F-u (p), F-x(N+n) INTEGER i, j,n,m, N,p DO The last small DO loop here assumes that u(t) was part of the input vector to the original subroutine MODEL, inserted into the slots between x(n + 1) and x(m). Again, a good programmer could easily compress all this; my goal here is only to illustrate the mathematics.
Finally, in order to adapt the action network, we go through multiple passes,each startingfrom oneofthestarting values of XU). In each pass, we call ACTION and then MODEL, one after the other, until we have built up a stream of forecasts from time 1 up to time T. Then, for each time t going backwards from T to 1, we call the UTILITY subroutine, then F-UTILITY, then F-MODEL, and then F-AC-TION. At the end of the pass, we have the correct array of derivatives F-W, which wecan then usetoadjust theweights of the action network.
In general, backpropagation through time has theadvantage of being relatively quick and exact. That is why I chose it for my natural gas application [Ill. However, it cannot account for noise in the process to the controlled. To account for noise in maximizing an arbitrary utility function, we must rely on adaptive critic methods [21]. Adaptive critic methods do not require backpropagation through time in any form, and are therefore suitable for true realtime 1earning.Thereareotherformsof neurocontrol aswell [21] which are not based on maximizing a utility function.

C. Handling Strings of Data
In most of the examples above, I assumed that the trainingdataform one lonetime series, from tequalsl to tequals T. Thus, in adapting the weights, I always assumed batch learning (except in the code in Section 11); the weights were always adapted after a complete set of derivatives was calculated, based on a complete pass through all the data. Mechanically, one could use pattern learning in the backwards pass through time; however, this would lead to a host of problems, and it is difficult to see what it would gain.
Data in the real world are often somewhere between the two extremes represented by Sections II and Ill. Instead of having a set of unrelated patterns or one continuous time series, we often have aset of time series or strings. For example, in speech recognition, our training set may consist of a set of strings, each consisting of one word or one sen-tence. In robotics, our training set may consist of a set of strings, where each string represents one experiment with a robot.
In these situations, we can apply backpropagation through time to a single string of data at a time. For each string, we can calculate complete derivatives and update the weights. Then we can go on to the next string. This is like pattern learning, in that the weights are updated incrementally before the entire data set is studied. It requires intermediate storage for only one string at a time. To speed things up even further, we might adapt the net in stages, initially fixing certain weights (like zyxwvutsrq Wjj) to zero or one. Nevertheless, string learning is notthe samething as realtime learning. To solve problems in neuroidentification and supervised learning, the only consistent way to have internal memory terms and to avoid backpropagation through time is to use adaptive critics in a supporting role [15]. That alternative is complex, inexact, and relatively expensive for these applications; it may be unavoidable for true real-time systems like the human brain, but it would probably be better to live with string learning and focus on otherchallenges in neuroidentification for the time being. zyxwvutsrq D. Speeding Up Convergence For those who are familiar with numerical analysis and optimization, it goes without saying that steepest descentas in zyxwvutsrqpo (12)-is a very inefficient method. There is a huge literature in the neural network field on how to speed up backpropagation. For example, Fahlman and Touretzky of Carnegie-Mellon have compiled and tested avarietyof intuitive insights which can speed upconvergence a hundredfold. Their benchmark problems may be very useful in evaluating other methods which claim to do the same. A few authors have copied simple methods from the field of numerical analysis, such as quasi-Newton methods (BFGS) and Polak-Ribiere conjugate gradients; however, the former works only on small problems (a hundred or so weights) [22], while the latter works well only with batch learningandverycareful linesearches.The need for careful line searches is discussed in the literature [23], but zyxwvutsrqpon I have found it to be unusually importantwhen working with large problems, including simulated linear mappings.
In my own work, I have used Shanno's more recent conjugate gradient method with batch learning; for a dense training set-made up of distinctly different patterns-this method worked better than anything else I tried, including pattern learning methods [12]. Many researchers have used approximate Newton's methods, without saying that they are using an approximation; however an exact Newton's method can also be implemented in O(N) storage, and has worked reasonably well in early tests [12]. Shanno has reported new breakthroughs in function minimization which may perform still better [24]. Still, there is clearly a lot of room for improvement through further research.
Needless to say, it can be much easier to converge to a setofweightswhichdonot minimizeerrororwhichassume a simpler network; methods of that sort are also popular, but are useful only when they clearly fit the application at hand for identifiable reasons.

E. Miscellaneous Issues
Minimizing square error and maximizing likelihood are often taken for granted as fundamental principles in large parts of engineering; however, there is a large literature on alternative approaches [12], both in neural network theory and in robust statistics.
These literatures are beyond the scope of this paper, but a few related points may be worth noting. For example, instead of minimizing square error, we could minimize the 1.5 power of error; all of the operations above still go through. We can minimize E of (5) plus some constant k times the sum of squares of theweights; as kgoes to infinity and the network is made linear, this converges to Kohonen's pseudoinverse method, a common form of associative memory. Statisticians like Dempster and Efron have argued that the linear form of this approach can be better than the usual least squares methods; their arguments capture the essential insight that people can forecast by analogyto historical precedent, instead of forecasting byacomprehensive model or network. Presumably, an ideal network would bring together both kinds of forecasting [121, zyxwvuts [201. Many authors worry a lot about local minima. In using backpropagation through time in robust estimation, I found it important to keep the "memory" weights near zero at first, and free them up gradually in order to minimize problems. When T is much larger than m-as statisticians recommend for good generalization-local minima are probably a lot less serious than rumor has it. Still, with T larger than m, it is very easy to construct local minima. Consider the example with m = 2 shown in Table I. .9 The error for each of the patterns can be plotted as a contour map as a function of the two weights zyxw w, and w2. (For this simple example, no threshold term is assumed.) Each map is made up of straight contours, defining a fairly sharp trough about a central line. The three central lines for the three patterns form a triangle, the vertices of which correspond roughly to the local minima. Even when Tis much larger than m, conflicts like this can exist within the training set. Again, however, this may not be an overwhelming problem in practical applications [19].

U. SUMMARY
Backpropagation through time can be applied to many different categories of dynamical systems-neural networks, feedforward systems of equations, systems with time lags, systems with instantaneous feedback between variables (as in ordinary differential equations or simultaneous equation models), and so on. The derivatives which it calculates can be used in pattern recognition, in systems identification, and in stochastic and deterministic control. This paper has presented the keyequationsof backpropagation, as applied to neural networks of varying degrees of complexity. It has also discussed other papers which elaborate on the extensions of this method to more general applications and some of the tradeoffs involved.