Estimation of Dynamic Models with Occasionally Binding Constraints.

We present an algorithm for estimating non-linear dynamic models, including those featuring occasionally binding constraints. The algorithm extends the Cubature Kalman Filter of Arasaratnam and Haykin (2009) with dynamic state space reduction, to give adequate speed in the presence of occasionally binding constraints, and to ensure that it can handle the large state spaces generated by pruned perturbation solutions to medium-scale DSGE models. We further extend the base algorithm to allow for alternative cubature procedures to improve the tracking of non-linearities. The algorithm relies on the solution method for models with occasionally binding constraints of Holden (2016b). We illustrate that the method can solve some of the identification problems that plague linearized DSGE models.


Introduction
Traditionally, DSGE models have been linearized prior to solution and estimation.
Assuming that the driving shocks are normally distributed, the resulting solution is a linear-Gaussian state space model that may be estimated without additional approximation error via the standard Kalman filter. Thanks to linearity, and the special properties of the normal distribution, this is remarkably tractable, and has enabled the estimation of rich macroeconomic models, as popularized by Smets and Wouters (2003). However, any small departure from linearity or Gaussianity means that tracking the distribution of the state without approximation error would require keeping track of the full distribution of the state, which is an infinite dimensional object. As a result, any non-linear filter will inevitably introduce additional approximation error.
One important departure from linearity is the presence of occasionally binding constraints. Such constraints are ubiquitous in modern macroeconomic models, appearing, variously, in the zero lower bound in nominal interest rates, in the irreversibility of investment, and in borrowing and collateral constraints. If they are ignored during estimation, the resulting parameters can be severely biased, as lower bounded variables will tend to have higher mean in an accurate approximation than they do under a first order one. Furthermore, when occasionally binding constraints are ignored, there can be failures of identification, as some of the model's parameters may only alter behaviour when at the bound.
Models with occasionally binding constraints present a variety of difficulties. As established by Holden (2016a), they may possess multiple solutions in some states, and no solutions at all in others, even when a terminal condition is specified. For estimation, this means both that the likelihood may be minus infinity for some parameters, and that unless care is taken, the likelihood may be discontinuous in the parameters, invalidating standard inference. Holden (2016b) is the first simulation algorithm for models with occasionally binding constraints that is always guaranteed to select the same solution in the presence of multiplicity, restoring continuity in the parameters, and hence prior to that paper it was impossible to perform conventional statistical inference in models with occasionally binding constraints. Furthermore, Holden (2016b) establishes that computing solutions to models with occasionally binding constraints is a computationally difficult problem, in a sense which that paper makes precise. Indeed, Holden (2016b) is the first algorithm for simulating such models that is guaranteed to complete in finite time. Since almost all non-linear estimation methods are based upon repeated simulation, this in turn implies that estimating models with occasionally binding constraints is likely to be particularly difficult, and essentially impossible if the Holden (2016b) simulation algorithm is not used.
Some progress has recently been made in estimating non-linear DSGE models using relatively standard particle filters (see e.g. Fernández-Villaverde and Rubio-Ramírez 2007; Fernández-Villaverde, Guerrón-Quintana, and Rubio-Ramírez 2015). The standard particle filter has the advantage that its estimate of the likelihood is unbiased (though noisy), and this is sufficient for correct Bayesian inference at least (Andrieu, Doucet, and Holenstein 2010). However, these methods require too many simulationsteps to be computationally tractable in the presence of occasionally binding constraints, at least on current computer hardware. Furthermore, one of the chief reasons macroeconomists were led towards Bayesian methods was due to the weak identification of many linearized DSGE models. Since using non-linear filters potentially solves the weak identification issue, we would ideally like to be able to return to maximum likelihood estimation, which is less econometrically divisive. However, due to the noise in the standard particle filter's estimate of the likelihood, numerically maximising the likelihood is almost impossible in this case. While techniques such as the smooth particle filter of Pitt (2002) may remedy this, they come at the cost of introducing substantial sampling noise into the final parameter estimates.
An alternative track of the literature has sought to exploit the properties of pruned perturbation solutions (Kim et al. 2008) to enable the use of the standard Kalman filter, despite the non-linearity, albeit with some additional approximation error. One prominent example of this approach is that of Kollmann (2013), who exploits the existence of an augmented state-space in which the pruned perturbation solution is linear, though non-Gaussian. Another is that of Meyer-Gohde (2014), who exploits the ability to get a closed form first order approximation around the mean of an underlying pruned perturbation approximation. However, both of these methods are restricted to models which are everywhere differentiable, ruling out occasionally binding constraints.
In this paper, we take an intermediary route between these two extremes, based on the cubature Kalman filter of Arasaratnam and Haykin (2009). Like the standard particle filter approach, we will rely on approximating the distribution of the state via simulation at a collection of points. Indeed, the approach we take is sometimes referred to as a type of particle filter (see e.g. Adjemian et al. 2016). Like the alternative approach though, we will maintain a Gaussian approximation to the state of the model, and we use an underlying simulation algorithm from Holden (2016b) that exploits the pruned perturbation structure.
Relative to Arasaratnam and Haykin (2009), our approach expands and improves along several lines. Firstly, we give an algorithm for the initialization of the state distribution. Secondly, we present a technique that ensures the dimensionality of both this initial state, and subsequent states, are as low as possible, without introducing major inaccuracies. Given that pruned perturbation solutions produce very large state spaces, this is essential for medium-scale DSGE applications. Thirdly, we introduce a choice of cubature methods, which will produce better estimates at some additional computational cost. Fourthly, we modify the method to support models in which shocks enter the model non-linearly. Fifthly, we discuss techniques for the numerical maximisation of the likelihood, which needs considerable care given the potential multimodality. Such difficulties in numerical maximisation were another key reason why the profession shifted towards Bayesian methods, so addressing them is crucial if we are to argue for using maximum likelihood estimates. Finally, we discuss the computation of standard errors, given the non-normality of the actual surprise content of observations.
Our paper is structured as follows. In the following section, we present our estimation algorithm, and discuss assorted extensions and econometric issues. In section 3, we then test the algorithm's performance, and illustrate how non-linear estimation can address the weak identification of linearized DSGE models. Section 4 concludes. All files needed for the replication of this paper's numerical results are included in the "Examples" directory of the author's DynareOBC toolkit. 2

Our algorithm for estimating non-linear models
In this section, we first describe the core algorithm, leaving unspecified how the Gaussian integrals are evaluated. We then go on to discuss alternative methods of performing the Gaussian integration. The following sub-section discusses the computation of the initial distribution of the state. This is followed by material on the practical numerical maximisation of the likelihood, and the computation of standard errors. The section concludes with a discussion of our implementation of this algorithm in the open source DynareOBC toolkit, which is freely available from http://github.org/tholden/dynareOBC.

Core algorithm
The solution to a general non-linear DSGE model takes the following form for all ∈ ℕ + : where ∈ ℝ , : ℝ × ℝ → ℝ + + , and ~NIID�0 , × �. The restriction to normal shocks is without loss of generality, since if ~N(0,1) , we can generate shocks from a distribution with cumulative distribution function by evaluating −1 �Φ( )�, where Φ is the cumulative distribution function of the standard normal.
Given that the shocks may enter non-linearly, such an expression may be incorporated into our .
We allow to return more than just the state variables, as often it is impossible (or at least inconvenient) to substitute control variables out of a model's equations. will include all of the controls which have an impact on the measurement equations, and includes any other controls.
We suppose that rather than observing , and/or , for all ∈ ℕ + , we instead observe: where ∈ ℝ and ~NIID�0 , Λ� , where Λ ∈ ℝ × is diagonal. Restricting ourselves to additive, uncorrelated, Gaussian measurement error is again without loss of generality, as richer measurement error processes can be directly incorporated into . Not allowing to directly depend on is also without loss of generality, as components of can pass through if really required. It is important though that we do always allow for some additional measurement error, since our state space dimension reduction procedure may possibly induce stochastic singularity even in models with as many shocks as observables, if some shocks make a small enough contribution. Now, suppose that we believe that −1 |ℱ −1~N �̂− 1| −1 , −1| −1 −1| −1 ′ � , where ℱ −1 ≔ { 0 , … , −1 } is the period − 1 information set, and where −1| −1 is not necessarily square. As in the standard Kalman filter, we start with a predict step, i.e. by calculating the distribution of |ℱ −1 . Let ∈ ℝ −1| −1 be a draw from N�0 −1| −1 , −1| −1 × −1| −1 �, where −1| −1 ≔ cols −1| −1 , then note that the distribution of −1 |ℱ −1 is equal to that of ̂− 1| −1 + −1| −1 . Hence: and where : ℝ → ℝ + is the probability density function of a standarddimensional normal variable. Thus, to derive a Gaussian approximation to the distribution of |ℱ −1 , we just need to evaluate a pair of −1| −1 + -dimensional standard Gaussian integrals.
Methods for approximately performing this integration will be discussed in the following section, but we observe here that any integration rule will take the form of a weighted sum over a set of sample points. Hence, the two integrals may be evaluated simultaneously by first evaluating �̂− 1| −1 + −1| −1 , � at the sample points, then using the integration rule to calculate the approximation to [ |ℱ −1 ] , which then enables us to calculate −1 ( , ) without further evaluations of . Using these approximate integrals, we may define: where ": ≈ " is a shorthand for "is defined to be an approximation to". Finally, we While the normal approximation will not be exact for non-linear models, in practice it will often beat particle approximations unless implausibly large numbers of particles are used.
We then proceed to the update step. Recall that in the standard Kalman filter, the update step is based on the surprise in the observed measurement. Thus, we need to evaluate the distribution of |ℱ −1 . Since is only a function of , and , and all are at least approximately Gaussian conditional on ℱ −1 , this will require a further Gaussian integral. To perform this integration, first let | −1 | −1 | −1 ′ be the Schur decomposition of | −1 , where | −1 is orthogonal, and | −1 is diagonal and weakly positive, since | −1 is positive semi-definite (henceforth, p.s.d.). Let | −1 ≔ diag | −1 . Without loss of generality, suppose that only the | −1 first elements of | −1 are bigger than some threshold , where in the below we set ≔ 10 −6 . Now let | −1,⋅1 be the first | −1 columns of | −1 , and | −1,1 be the first | −1 rows of | −1 . Then, providing is small, | −1 ≈ | −1,⋅1 diag | −1,1 | −1,⋅1 ′ , giving a reduced rank approximation to | −1 .
There are two benefits to taking reduced rank approximations. Firstly, by reducing the dimensionality of the space over which we have to integrate, it will greatly speed up the computation. Secondly, integration rules are often much better behaved in lower dimensions. This may mean they evaluate less far from the centre, avoiding distortions caused by extreme tail non-linearities, or it may mean that they have more uniform weights, avoiding e.g. failures of positive semi-definiteness caused by a negative weight.
We go on to set: Consequently: where | −1,1 is the top rows of | −1 and: As before these Gaussian integrals may be simultaneously approximated by the cubature methods we will discuss in the next section. Using this approximation, we define: We want to know the distribution of |ℱ , but by the definition of ℱ , this is just equal to the distribution of | , ℱ −1 . Hence, by standard results on conditional normal distributions, |ℱ is approximately distributed as: Hence, we define: To complete the inductive step, we have to find | such that | | ′ ≈ | . We proceed much as before, first constructing | (orthogonal) and | such that | = | diag | | ′ , and then proceeding to set: | ≔ | ,⋅1 diag � | ,1 , and where | ,1 contains the ( | ) elements of | that are greater than . After this, we have that the distribution of |ℱ is approximately: which completes one time-step.
We close this section by noting that we can use these calculations and approximations to obtain the approximate likelihood, in the standard way. In particular: This gives an iterative formula for progressively calculating the approximate loglikelihood.

Cubature methods
Degree 3 monomial rule Arasaratnam and Haykin (2009) suggest approximating the Gaussian integrals via degree three monomial cubature. The degree three rule they advocate is based upon the following approximation: which is exact when ℎ is a sum of monomials of at most degree 3. Indeed, this is the degree 3 rule requiring the minimum possible number of function evaluations, making it highly computationally efficient.
In a model without occasionally binding constraints, under a first order approximation the rule will give the exact mean and variance, and under a second or third order approximation, it will give the exact mean, but only an approximate variance. In the presence of occasionally binding constraints, it will only give approximate means and variances, whatever the order of approximation.
However, in practice this rule tends to perform remarkably well. As discussed in Holden (2016b), degree 3 monomial cubature rules are particularly robust since they have positive, equal weights. All known higher degree integration rules that do not use more than polynomial in nodes also feature negative weights on at least some nodes (Cools 2003), which means that their result is not guaranteed to lie within the convex hull of the source evaluations, and it means that the approximated covariance matrix need not be p.s.d..
A downside of the rule is that when is large, the rule evaluates a long distance from the mean. When the true integrand is actually a sum of monomials of at most degree 3, this obviously does not matter, but in reality ℎ often has substantially more curvature.
Indeed, in the presence of occasionally binding constraints, ℎ may feature extreme behaviour in the tails. Hence, when ℎ is evaluated far into the tails, we are likely to obtain biased estimates of the integral. Our dimension reduction algorithm obviously helps with this, but still in large models this may be problematic. Genz and Keister (1996) rules By way of motivation, note that with the standard cubature Kalman filter, if we obtained a non-p.s.d. covariance matrix at one step, it would produce a catastrophic failure of the likelihood evaluation. However, given our dimension reduction algorithm, in our context this is not the case. To see this, suppose that | −1 is not p.s.d.. It will nonetheless be symmetric though, as it is being approximated by a weighted sum of symmetric matrices. Hence, by the spectral theorem for real symmetric matrices, the Schur decomposition will still give us an orthogonal | −1 and a diagonal | −1 such that | −1 | −1 | −1 ′ = | −1 . Thus, we can still select the values on the diagonal of | −1 that are greater than to give a p.s.d.
approximation to | −1 , as before. This will be a reasonable approximation to the true covariance of interest provided (plausibly enough) that the reason we ended up with a non-p.s.d. value for | −1 was that the true covariance has some very small eigenvalues.
In light of this discussion, it is natural for us to reappraise the use of higher degree cubature rules in our setting. Holden (2016b) found that the rules of Genz and Keister (1996) performed very well in a different context, with not excessively high computational cost, so these rules seem a natural thing to try here too. These rules rules allow one to choose the maximum degree of monomial that should be integrated exactly, up to a maximum order of 51. The number of points used is Ο� �, where 2 + 1 is the degree of monomial that is integrated exactly. When > 0 and > 1, the rule features negative weights on at least one node, but this enables it to ensure that the maximum over the absolute vectors of integration points is independent of . This contrasts with the aforementioned rule in which the higher is , the further into the tails of the distribution one has to evaluate the integrand, which may lead to poor performance as discussed previously. Furthermore, by using a higher degree rule, we can generally obtain a better approximation to the integrand, leading to improved estimates. The Genz and Keister (1996) rules are likely to be particularly useful for performing the integrals in the update step, as these integrals do not require performing costly simulation steps at each node, so using many nodes is quite tractable.

Selecting the initial distribution of the state
In order to evaluate the likelihood, we need to be able to calculate the unconditional distribution of 1 , which in turn requires the unconditional distribution of 0 . Now, if we have a model without occasionally binding constraints, then it is possible to get the mean and covariance of 0 without simulation if we are using a pruned solution. Even if we are not using a pruned solution, we can at least derive reasonable approximations to these quantities without resorting to simulation. However, in the presence of occasionally binding constraints, or with more general non-linearities, we will not be able to calculate a reasonable approximation to the stationary distribution of without resorting to simulation.
One approach then would be to simulate a long run from the model, discard a burnin period, and then take the mean and covariance of the remainder. This has two drawbacks. Firstly, due to the high degree of auto-correlation in many DSGE models, removing all sampling variation from the estimate would require a prohibitively long simulation run. Even if the same random seed was used for each run, so the objective was still continuous in the parameters even in finite samples, the result would just be that the sampling variation was transmitted to the final parameter estimates. This is essentially the same problem as is encountered by the particle filter in maximum likelihood contexts.

Secondly, it is not clear that the stationary distribution of the model is actually what
is needed here. In general, thanks to the approximations intrinsic in any variant of the cubature Kalman filter, including ours, the value to which ̂| and | would converge given an infinite string of completely uninformative observations will not agree with the stationary distribution of the model. If it is agreed that parameter estimates should not change when the data set is augmented by a run of initial missing observations, then rather than trying to evaluate the stationary distribution of , we should be trying to evaluate the limit of ̂| and | when no information arrives.
This is the approach we take here. In particular, we imagine that we were tasked with running the cubature Kalman filter on an infinite run of missing observations. Since this works in "pseudo-time", to distinguish "pseudo-times" from real times, we place ̃ over all "pseudo-times" in the following. We start by initializing ̂0̃| 0̃ and 0|0̃ with some easily computable approximation, such as that derived from the pruned perturbation approximation to the model, omitting occasionally binding constraints. We then run the cubature Kalman filter forward, skipping the update step, meaning that ̂̃|̃=̂̃| −1 � and ̃|̃=̃| −1 � ,11 for all ̃∈ ℕ + . We continue until the change in ̂̃|̃ and ̃|̃ is sufficiently small (e.g. on the order of 10 −8 ). For some models, this procedure may not converge exactly, in which case rather than making a full step from ̂− 1 � | −1 � to ̂̃|̃ and −1 � | −1 � to ̃|̃, we instead make a partial step to a weighted average of the old and new points. DynareOBC contains code for dynamically adjusting the weight which works well in practice, ensuring convergence. When ̂̃|̃ and ̃|̃ have converged, we set ̂0 |0 and 0|0 to the found limiting values.

Maximising the likelihood and computing standard errors
Traditionally, DSGE models have chiefly been estimated by Bayesian methods. This has apparently been for two reasons. Firstly, many parameters in linearized DSGE models are either unidentified or just weakly identified. By placing a prior over the parameter space, although the likelihood may be flat in places, the posterior density will not be, ensuring that any numerical maximisation algorithm will return the same maximum a posterior estimate. Nonetheless, the prior does not solve the underlying non-identification. Instead, when the likelihood is flat, then the prior becomes highly "informative". Of course, if the prior reflects true external information (e.g. from panel micro-data), then it is completely appropriate to incorporate this information into the final estimate. However, most of the time the prior used in macroeconomic modelling is instead a product (albeit indirect) of the same data on which the model is now being estimated. It is preferable then to attempt to fix the underlying weak identification problem, which is what non-linear estimation potentially permits. This is thanks to the fact that two parameters may have identical effects on dynamics very close to the steadystate, but quite different effects further away.
The second reason people have not traditionally pursued maximum likelihood estimation of DSGE models is because the likelihoods tend to be highly multi-modal.
The hope is that with a strong enough prior, the posterior density might be unimodal, even though the likelihood is not. This hope is somewhat naïve though, since at least asymptotically the likelihood will asymptotically dominate the prior along any dimensions in which there is identification. Evidence for the practical relevance of this is provided by Herbst and Schorfheide (2014) who show that a range of popular DSGE models actually possess multimodal posteriors, though this was previously missed due to the difficulties of integrating over high dimensional spaces.
Consequently, the only way that a Bayesian approach could have a computational advantage over a classical approach would be if somehow it was at least easier to integrate over high dimensional spaces than it was to maximise in them. But this cannot be true. If one has an algorithm that can sample from a distribution in a high dimensional space, then by starting standard local maximisation procedures from these draws, one will eventually find the global maximum. Furthermore, since local maximisation requires far fewer evaluations than (say) MCMC would to explore a mode, this will be faster. Of course one can never guarantee that a maximisation procedure has found the true global maximum, but neither can one guarantee that an integration procedure has explored every mode. The fact that most MCMC implementations start by using conventional methods to search for a global mode provides further evidence that integration must be harder than maximisation.
In short then, the standard arguments for a Bayesian approach do not seem relevant when the model is estimated non-linearly. To make a maximum likelihood approach practical though, we must provide a global search algorithm with decent performance. By default, DynareOBC uses a customised version of the CMA-ES algorithm of Hansen et al. (1995;2006). While first order (and first generation) evolutionary algorithms evolve a population of parameter vectors by combining parameters from multiple "parents" and adding independent noise to each component, the second order (and second generation) CMA-ES algorithm draws noise from a covariance matrix which mirrors the shape of the objective. This covariance matrix is dynamically updated over time in an entirely parameter-free way, and the result is an algorithm which is almost competitive with local algorithms on unimodal objectives, 3 but which also possesses good global search properties. The modified version in DynareOBC is designed to exploit parallel computing environments, further speeding up the search.
Once we have found the location of the maximum of the likelihood function, we then need to compute standard errors. Now, recall that the likelihood is coming from normal approximations to the distributions of |ℱ −1 for ∈ {1, … , }. In non-linear models, these distributions may be quite a way from normality, hence, what we have here is in fact a quasi-maximum likelihood estimate (or at least, an approximate quasi-maximum likelihood one, given integral approximation error). Quasi-maximum likelihood estimates enjoy the same consistency and asymptotic normality properties of standard maximum likelihood (White 1982), but they require standard errors to be computed using a sandwich formula, as detailed in e.g. Canova (2007).

Further details on the DynareOBC toolkit
Code implementing the estimation algorithm discussed here is contained in the author's "DynareOBC" toolkit which is a suite of MATLAB files designed to augment the abilities of Dynare (Adjemian et al. 2011). The toolkit may be freely downloaded from http://github.org/tholden/dynareOBC, and this site also contains complete documentation for its assorted options. 4 To use it for simulation, one merely has to 3 CMA-ES requires around ten times more function evaluations than local methods on quadratic objectives. 4 A PDF of the toolkit's documentation is available from: https://github.com/tholden/dynareOBC/raw/master/ReadMe.pdf. include a "max", "min" or "abs" in the MOD file describing the DSGE model to be simulated, and then to invoke DynareOBC with the MATLAB command "dynareOBC ModFileName.MOD". Using it for estimation is almost as easy, and the examples in the "Examples\BoundedProductivityEstimation" sub-folder should make it clear to the user how to proceed.
While base Dynare now supports using the cubature Kalman filter for estimating second order approximations to models (Adjemian et al. 2016), it does not implement either the state initialization, or the state space reduction technique developed here; it does not support third order approximations; it does not calculate quasi-Maximum likelihood standard errors; and it cannot handle occasionally binding constraints. In all these regards then, the DynareOBC estimation procedure is an improvement on that already contained in base Dynare.

A test of the performance of our algorithm
In this section, we give some indications of the accuracy of our approach, by applying the implementation of it in the DynareOBC toolkit to two simple non-linear models, one of which contains occasionally binding constraints. We restrict ourselves to models for which the results of Holden (2016a) imply there is a unique solution in all states of the world, to avoid additional uncertainty coming from equilibrium selection, though the algorithm of Holden (2016b) does give a natural procedure for doing this. Likewise, we restrict ourselves to models for which an exact solution is available, to enable us to assess the impact of approximation error on the final parameter estimates. We start by describing the two models, and their identification properties, then we go on to present the estimation results.

A simple model without occasionally binding constraints
Suppose the representative household in an economy chooses consumption and zero net supply bond holdings to maximise: subject to the restriction that: where 's evolution is given by: log = log −1 + , where = (1 − ) ̅ + −1 + and ~N(0,1). Market clearing implies = and = 0 for all ∈ ℤ, implying that the first order condition may be written as: and from this, it is easy to see that in the exact solution: To make our estimation task more challenging, we suppose that the econometrician only observes log , not . The model has five parameters, but with this limited information set, only three of them can be identified. To see this, first note that since just scales − +1 , and log just shifts log − +1 , the parameter vector ( , , ̅ , , ) must be observationally equivalent to the parameter vector (1,1, ̅ − log , , ). In light of this, when we estimate the model, we fix and at their true values. These estimates will be based on an 1000 period artificial data-set constructed from the exact solution using the following parameters: ≔ 0.99, ≔ 5, ̅ ≔ 0.005, ≔ 0.95 and ≔ 0.007 (after discarding 100 periods of burn-in).

A simple model with occasionally binding constraints
To produce a model with occasionally binding constraints, we modify the previous model, changing the law of motion for . In particular, we suppose that for all ∈ ℤ: = max{0, (1 − ) ̅ + −1 + } where ~N(0,1) . This specification may be thought of as capturing the fact that technologies cannot be un-invented. The first order condition of this model is as before, but now, in the exact solution, slightly more onerous calculation gives us that: where is as before.
This model also contains five parameters, but thanks to the additional non-linearity, four of them can be identified despite only log being observed. While still just scales the − +1 term in the first order condition, varying log is no longer equivalent to shifting ̅ , due to the zero lower bound on productivity. Thus, when we estimate this model, we only need to fix at its true value. For comparison though, we also perform runs with fixed too. These estimates will use an artificial data set constructed from the exact solution exactly as before, with identical parameters.

Estimation results
Results from estimating both models are contained in Table 1 below. We also include the results of estimating the unbounded model on data from the bounded model, to illustrate the biases that can occur when bounds are ignored. Furthermore, to illustrate the potential costs of different types of approximation error, we include both results where the simulations in the filter predict step are performed using the exact solution, and results where they are performed using the approximate solution algorithm from Holden (2016b), either without cubature or with degree 3 monomial cubature in the internal simulation algorithm (referred to as "fast cubature" in the below). If cubature is not incorporated into the inner simulation algorithm, then expectations are not fully rational, as agents are continually surprised by the presence of the bound. Using cubature in the inner solution algorithm fixes this, producing more accurate simulations, and, hopefully, better estimates.
When using this approximate solution algorithm without cubature, we try both with an order one approximation to the underlying model, and with an order three one. The latter illustrates our algorithm's dimension reduction method, since it turns out that in these models, order two and order three solutions agree. Finally, when we use the exact solution algorithm, we try both with degree 3 monomial cubature for the update step integrals, and with the degree 51 Genz and Keister (1996) rules, which are essentially exact. All other integrals are performed with the degree 3 monomial rule.
All numerical maximisation was performed with the CMA-ES algorithm, with the results polished by MATLAB's "fmincon". We start CMA-ES from the true parameters, with the measurement error standard deviation, � Λ, set to 0.0001, but thanks to the initial broad search undertaken by the CMA-ES algorithm, it soon moves away from this point, so identical results would be derived with a different initial point. We constraint and to lie in �10 −6 , 1 − 10 −6 �, ̅ to lie in �10 −6 , ∞�, and and � Λ to lie in [0, ∞). Estimation times range from a few minutes for the models without bounds, to around four hours for the runs with "fast cubature" in the internal simulation algorithm, running on either a 12 or 20 core machine. We stress though that thanks to our dimension reduction techniques, and the fact that simulation speed is almost independent of model size, running times should be of a similar order of magnitude even with medium-scale models.
Turning to the results, we first note that the estimates of ̅ are all quite poor. For example, in the model without occasionally binding constraints, ̅ is roughly half its true value, even with the exact simulation algorithm, which turns this into a standard linear filtering problem. However, in all cases this is just driven by sampling error.
Given the high persistence in , ̅ cannot be precisely estimated. To provide further intuition, note that in the model without occasionally binding constraints: Hence, conditional on the other parameters, we can estimate ̅ by evaluating: 1 � 1 � log =1 + 2 2 2 + log �, coming from the poor-conditioning of the hessian, which itself comes from the weak identification of ̅ .
We now examine the estimates generated by running the model without occasionally binding constraints on the data generated with an occasionally binding constraint. Here, we see a big upward bias in ̅ , and a big downwards bias in and . The former is due to the fact that the mean of a zero lower bounded process is higher than the mean of the process without bound. The latter is due to the fact that hitting the bound prevents from getting any lower, both compressing its range, and reducing its mean half-life, as returning from e.g. = −0.01 takes longer than returning from = 0. This illustrates the severe biases that may accompany an attempt to estimate a model without occasionally binding constraints when there clearly are such constraints in the data generating process.
The estimates using a model with occasionally binding constraints unsurprisingly fare much better. With "fast cubature", we get results very close to those using the exact solution, suggesting that even in models for which an exact solution is not available, we are likely to get good results using "fast cubature". The estimates of seem particularly impressive, which is most likely explained by the additional information coming from the zero lower bound, as will now have a non-linear effect.
When we attempt to estimate as well, understandably, the estimates of other parameters suffer somewhat. For example, is biased downwards, and ̅ is driven to its lower bound of 10 −6 when we do not incorporate cubature in the simulation algorithm.
To understand this, observe that in the absence of cubature, our simulation algorithm essentially ignores the bound when computing expectations. In particular, if is a dummy which equals 0 under a first order approximation, and 1 under a third order approximation, then our solution's approximation in the absence of cubature, which we shall denote � ( ) solves: Hence: as ≥ 0. Thus, using this approximation, a reasonable estimate of log conditional on the other parameters would be: Note though, that this is likely to be substantially biased down, since our approximation to exp(− +1 ) is biased upwards, as it ignores the impact of the bound. Now suppose we are trying to simultaneously estimate ̅ as well. Given our approximation, the natural estimate of ̅ is the solution for ̅ to: �(1 − ) ̅ + � = 1 � log � ( ) =1 + 2 2 2 + log , which, using our estimate of log is the solution for ̅ to: Since depends positively on ̅ , and our estimate of log is biased down, so too will be our estimate of ̅ . This in turn causes further downwards bias in the estimate of log , which further pushes down ̅ , and so on.
To see the magnitude of these expected biases, note that our estimation algorithm approximates the stationary distribution of by a Gaussian, say N� , 2 �, hence, if ~N� , 2 � and ~N(0,1): (1 − ) ̅ + +~N��, � 2 �, where � ≔ (1 − ) ̅ + and � ≔ � 2 2 + 2 , so: which implicitly define and in terms of ̅ . Substituting ≈ into equation (1), and solving the resulting non-linear equations gives an estimate of ̅ on our artificial data of −0.0080. 5 Since we constrain ̅ to be greater than 10 −6 when we estimate, this means that we should be unsurprised that we hit this bound. Given this, the implied estimate for is: Luckily, when we include cubature in the underlying simulation algorithm, ensuring that agent's expectations are truly rational, all of these biases disappear. Indeed, our estimates with "fast cubature" including the bound are the most accurate estimates of both and ̅ from any of our estimation runs. Given that "fast cubature" is computationally tractable even on large models, this suggests that we should be able to use our estimation procedure to get reliable estimates even in the medium scale DSGE models used by policy makers.

Conclusion
This paper has presented an efficient algorithm for estimating non-linear models, including those with occasionally binding constraints. Thanks to the algorithm's dimension reduction techniques, the algorithm keeps the costs of forming predictive distributions manageable, allowing it to be used even on models for which simulation is expensive, such as those with occasionally binding constraints. We went on to show that identification is easier in non-linear models, and that estimating ignoring occasionally binding constraints may introduce substantial biases. The latter is particularly relevant for the zero lower bound on nominal interest rates, given the amount of time many economies have now spent at the bound. Macro-econometricians thus have no choice but to include the occasionally binding constraint in their model when they estimate medium-scale DSGE models, if they want to use up to date data.
Luckily, the algorithms presented in this paper will readily scale up to handle such models.
Code implementing all of the algorithms discussed here is contained in the author's "DynareOBC" toolkit which augments the abilities of Dynare (Adjemian et al. 2011) with the ability to solve and estimate models with occasionally binding constraints.