A study on window size selection for threshold and bootstrap Value-at-Risk models

This paper investigates the effects of window size selection on various models for Value-at-Risk (VaR) forecasting using high performance computing. Subsequently, automated procedures using change-point analysis for optimal window size selection are proposed. In particular, stationary bootstrapping and the peaks-over-threshold methods are utilized for the rolling daily VaR estimation and are contrasted with the classical conditional Gaussian model. It is evidenced that change-point procedures can, on average, result in more adequate risk predictions than a predetermined fixed window size. The data sets analyzed include indices across 5 continents, i.e., the Dow Jones Industrial Average Index (DJI), the Financial Times Stock Exchange 100 Index (UKX), the NIKKEI Top 225 Index (NKY), the Johannesburg Stock Exchange Top 40 Index (JSE Top40), the Ibovespa Brazil Sao Paulo Stock Exchange All Index (IBOV), and the Bombay Stock Exchange Top 500 Index (BSE 500).


Introduction
Value-at-Risk (VaR) is a statistic used to measure the volatility, and therefore the risk, of a given financial portfolio. This statistic is essentially a lower quantile, which quantifies the likelihood that a security return falls below a certain value, at a given confidence level, over a predetermined time period. Typically, historical observations are used to calculate an estimate of VaR. This estimation is however sensitive to the size of the historical window, or period, used. Too large a window will result in little variation in the statistic but will expose the statistic to model bias. Conversely, too small a window will result in a statistic which is sensitive to model changes but is very volatile. This presents portfolio managers with the challenge of selecting a window size to use in order to most accurately describe the risk of a given portfolio in the next period.
The Basel Accord II (Basel Committee on Banking Supervision 2006) laid out various quantitative standards to aid financial institutes in the evaluation of their internal risk models. For example, banks may use VaR to determine the amount of market risk capital to hold as a cushion against adverse market movements (McNeil and Frey 2000). One of the Basel II quantitative standards requires that the historical observation period, for the calculation of VaR, must be constrained to at least one year (i.e., approximately 250 trading days). However, the limited amount of study on the appropriate window size shows mixed results. For example, Hendricks (1996) observed that a long window (i.e., 1250 days) produced the best result in his analysis, relative to four other shorter windows. In contrast, Hoppe (1998) argued for the use of shorter windows (in some cases as short as 30 days) to produce better coverage results and to better align with the non-stationarity of the data.
Čížek, Härdle and Spokoiny (2009), Härdle, Hautsch and Mihoci (2015) and Schröder (2016) suggested the use of adaptive pointwise selection procedures for studying volatility, parameter variations and local trend estimation in financial time series. Our current work explores these ideas for the forecasting of VaR, as a remedy to the problem of window size selection.
The issue above is further complicated by the existence of various VaR models, with varying degrees of success under different scenarios. The cohort of such models is expanded by Basel III which relaxed some of the requirements on the effective window size (Basel Committee on Banking Supervision 2011). This paved way for the use of exponential weighted moving average models and the generalized autoregressive conditional heteroscedastic (GARCH) class of methods. In particular, GARCH-type filtered approaches, with non-parametric (e.g., historical simulation), parametric (e.g., Gaussian distribution), or semi-parametric (e.g., threshold models) procedures for describing the innovations, have become popular for estimating VaR (McNeil and Frey 2000;Orhan and Köksal 2012;Brandolini and Colucci 2012;Laker, Huang and Clark 2017).
The current paper contributes to the literature as follows. Firstly, a comprehensive analysis is performed for three popular VaR models, for fixed window sizes over the range from 500 to 1500 days (with increments of 50 days). This provides an overview of the effects of selecting different window sizes for the process of VaR forecasting. This analysis also identifies the optimal fixed window size. However, it requires high performance computing and may not be practical in real applications. Hence, as a further contribution, we explore the use of change-point analysis for window size selection. In particular, the at-most-one-change (AMOC) and the binary segmentation (BinSeg) algorithms are used to produce superior VaR backtesting results than the average performance of the fixed window size approach.

Methodology
Given a series of daily closing stock prices 1 , 2 , 3 , … , we define the negative stock, or security, return for day as The negative return of a stock is then decomposed into two parts: an expected return which is predicted by an autoregressive (AR) model (typically an AR(1) model) and an error term which captures deviations from the predicted value. This is presented as follows where −1 denotes all the information about the return process up to day − 1. We will assume a GARCH(1,1) process for the error term by defining where 0 > 0 , 1 > 0 and > 0 . This is often referred to as a GARCH(1,1) filter and various assumptions may be conjectured about the innovations (or residuals) . Such assumptions can be utilised to estimate the GARCH parameters and produce forecasts for E( +1 | ) and +1 .
The VaR statistic is defined as an extreme tail quantile of a financial time series. VaR is said to be violated when the security return is in excess of the VaR quantile. The probability of getting such an extreme return, right of the lower limit of the upper tail of the assumed distribution of negative returns, is denoted by , i.e., Pr( > (1 − )) = .
Hence, using the above GARCH filter, we can deduce VaR for the next period as where is the 100(1 − )-th quantile of the marginal distribution of . If is estimated by the Gaussian distribution, then we have the popular mean-variance formulation. When its estimate is obtained by bootstrapping, then the procedure is referred to as filtered historical simulation (FHS). The former is a parametric approach and the latter is non-parametric, both being commonly used in practice due to their simplicity in implementation and interpretation. However, we shall extend the standard FHS to cater for dependencies in the residuals by using stationary bootstrapping (Laker, Huang and Clark 2017). A third approach that is semi-parametric estimates indirectly through the modelling of the corresponding threshold exceedances of . This is termed the conditional peaksover-threshold (POT) method, which has also gained popularity in recent years (McNeil and Frey 2000). Stationary bootstrapping and the POT method are briefly described below. These are followed by discussions on change-point algorithms (and how they may be used for window size selection) and backtests for VaR.

Stationary bootstrapping
Stationary bootstrapping is a generalization of the ordinary bootstrapping that subdivides the observations into blocks prior to resampling. It also assumes that the block length is a geometric random variable. Firstly, a value ∈ (0,1] is preselected, which can be optimally taken at −1 −1/3 for some constant (Politis and Romano 1994). Subsequently, each observation is put through a decision rule in the order which they appear. For each observation, a number is randomly drawn from the UNIF(0,1) distribution. This observation is then included in the present block if is less than 1 − . Alternatively, if is greater than 1 − , then a new block is started. This decision rule is continued until all the observations have been selected into blocks. This procedure caters for dependency in the observations and preserves stationarity. For our purposes, we shall apply the above to the realized GARCH innovations and calculate a bootstrapped value for .

POT
The POT method is known to be very useful in the tail estimation of data as this method is based on sound statistical theory and provides an asymptotic parametric form for the tail of the distribution (McNeil and Frey 2000). Furthermore, it is specifically useful for approximating the distribution of financial data due to the leptokurtic nature of the conditional distribution of the errors arising from the GARCH model (Orhan and Köksal 2012). More precisely, given a threshold , we can write the corresponding excess distribution of as where is the distribution function of and 0 ≤ < 0 − , for some 0 defined as the right endpoint of . For a large class of distributions (i.e., in the domain of attraction of the extreme value distribution), there exists a measurable function ( ) such that is the generalized Pareto distribution (GPD). In other words, the distribution of threshold exceedances is asymptotically identified with the GPD (Embrechts, Klüppelberg and Mikosch 1997). This allows us, by reverting the arguments above, to calculate where ̂ and ̂ are maximum likelihood estimates for and , respectively, is the number of exceedances above , and is the number of observations.

AMOC and BinSeg
The GARCH model, used in this paper to model the volatility of financial data over time, is known to assume time homogeneity. This may not be appropriate when modelling financial data, as market and institutional changes result in changing variability of the data over time. When these changes are not accounted for, the accuracy of the model fit is compromised. As such, one can employ the technique of change-point analysis to achieve a more flexible model which is able to forecast data more accurately over longer periods of time (Čížek, Härdle and Spokoiny 2009). This technique essentially aims to identify stationary intervals (with a fixed right endpoint) inside every estimation window. This is achieved by detecting the points at which the variability of the data changes most significantly (Čížek, Härdle and Spokoiny 2009).
The points at which the data's distribution changes significantly can be found in numerous ways. A significant change in the mean or variance of the series can be monitored and the splits made accordingly. Two such approaches are implemented through the AMOC algorithm (Silva and Texeira 2008) and BinSeg algorithm (Scott and Knott 1974).
Initially, one assumes that the data series captured within the window is time homogeneous (i.e., the null hypothesis). This notion is then tested against there being a change-point required due to the non-homogeneity of the series within the window (i.e., the alternative hypothesis). The AMOC algorithm makes use of the likelihood ratio method, where the maximum likelihood is calculated under both hypotheses across different change-point positions within the selected window. The BinSeg method is a generalized version of the AMOC algorithm, where once a change-point is established the sequence is split and each segment is then again tested for a change point until the prescribed threshold is reached.

Automated Window Size Selection
An important facet of the current work is to identify ways to select an appropriate historic window for VaR forecasting, at each given point in time. This is to avoid the problem of having to artificially specifying a fixed window size in advance and the sensitivity of models to window size (as we shall explore later). This paper proposes the use of AMOC and BinSeg to identify change-points in terms of volatility (points for which the largest change in variance is identified). For each day to be forecasted, the procedure is implemented as follows: Let be the day for which VaR is to be estimated and and are the maximum and minimum sizes (in days) allowed for the estimation window to be used. − 1 is a fixed right endpoint, as the closest point to the day of interest. Then, AMOC or BinSeg is applied to identify a change-point in the interval [ − ; − ], say at . This results in the estimation window [ ; − 1], which is then used for model estimation.

VaR Backtests
We employ two common backtests to evaluate the adequacy of VaR estimates, namely the Kupiec likelihood ratio test (Kupiec 1995) and the Christoffersen test (Christoffersen, Hahn and Inoue 2001). The former is an unconditional coverage test that compares the realized VaR violations against the specified VaR level (i.e., value of ). The second test is a conditional coverage test that examines the dependency amongst VaR violations (i.e., a robust VaR model should react quick enough so as to minimise the chance of consecutive VaR violations). The resulting p-values from both backtests are examined as a method to compare across different models.

Data Analysis
Our data consists of security returns calculated from daily closing prices of the past 16 years (January 2000 to November 2016), sourced from the Bloomberg database. Collecting information from international stock markets results in minor discrepancies in the number of available data points. To overcome this problem, the database of returns has been truncated to 4147 data points for each stock, ensuring that an equal amount of data points is recorded for each stock. Specifically We will apply three VaR models to all 6 return series, namely GARCH with Gaussian innovations (GARCH-Gaussian), GARCH with stationary bootstrapping (GARCH-Boot), and GARCH with POT (GARCH-POT). The last two are implemented by using pseudo maximum likelihood estimation for the GARCH parameters. This is to minimize distributional assumptions of the innovations (McNeil and Frey 2000). As a preliminary analysis, the GARCH model is fitted to the complete horizon of data points in each index and descriptive statistics are drawn from both the original returns and the innovations. These are displayed in Table 1.  Table 1 it can be seen that all return series, irrespective of market maturity, have a near-zero mean and a negatively skewed distribution. This is fairly typical in financial data and is explained by either the return distribution peeking at values larger than the mean with a long tale to the negative end of the return scale, or merely that the higher returns out-weigh the lower returns to the left of the scale.
The excess kurtosis is positive for all markets, with particularly high values being recorded for the BSE 500 Index and the DJI Index. These kurtosis values indicate sharply peaked distributions, with fatter tails than that of the Gaussian distribution. The extent of this deviation from the Gaussian distribution seems to be larger in the developed market. On the other hand, the innovation excess kurtosis values are much lower than that of the original returns. As expected, this is one effect of the GARCH model catering for the changing volatility. However, when applying the rolling-window mechanism there is the possibility for the excess kurtosis to vary, depending on which window of data has been selected. This phenomenon is explained in Figure 1 below for a rolling window of size 1500 days. It is evidenced that the rolling innovation excess kurtosis can often surge significantly above zero. This justifies the need to cater for conditional heavy tails. Interestingly, the common spike at around the 2000 th observation corresponds to the 2008 global credit crunch.

Figure 1: Rolling excess kurtosis of GARCH innovations using a window size of 1500 days.
We now perform daily 95% VaR forecasts (using all three models described earlier) for each return series using a fixed number of observations in each widow (i.e., a fixed window size). With the aid of high performance computing, this is done for all window sizes from 500 to 1500 (at increments of size 50) for each index. The parameter for stationary bootstrapping is determined by setting to 3.15, which is obtained by Monte Carlo simulation using the Gaussian AR(1) process. Each window is bootstrapped 1000 times. The threshold for the POT approach is set to be the 90% quantile of the innovations in the corresponding window. The resulting p-values for the Kupiec and Christoffersen tests, for each window size and each index are displayed by the figures in Appendix A. It is interesting to note that the commonly used window sizes, e.g., 500 and 1000, does not always produce the optimal results. The window size for the highest p-value also varies across different models, indices and backtests. The results are also summarized in Tables 2 to 7 1 . Column 3 records the average of all p-values obtained across different fixed window sizes, while column 4 gives the standard deviation of these p-values. Columns 5 and 6 presents the smallest and biggest p-value obtained across all window sizes, with the window size that produced the biggest p-value given in the last column. It is clear that results produced by the two commonly used backtests can drastically change according to the choice of window size.
Interestingly, the models (on average) seem to perform the best for JSE Top 40. This may be attributed to the fact that this index produced the least excess kurtosis. This also resulted in a relatively smaller coefficient of variation (C.V.) for the different p-values across different window sizes. Hence, JSE Top 40 appears to be the least sensitive (relative to other indices) to window size selection. In contrast, UKX and NKY produced significantly higher C.V. values, indicating a high sensitivity to window size selection.      On average (considering the mean p-value again), POT and stationary bootstrap approaches seem to produce more robust VaR estimates than the conditional Gaussian model. This is with the exception of DJI, where the three models were very similar in performance. In terms of the maximum p-value achieved at the optimal window size, the results again varied across different indices. For UKX, the best performance is dominated by POT, whereas the difference between POT and the Gaussian approaches is indecisive for DJI (likewise, POT and stationary bootstrap produced similar results for NKY). For IBOV and BSE 500, the best model under optimal window size is clearly achieved by stationary bootstrapping. All three models gave almost identical VaR backtesting results for JSE Top 40, at the optimal window sizes.
At this point, it is imperative to note several shortcomings. Firstly, the VaR backtesting results varied across different indices, models and window sizes. This create a dilemma for practitioners in deciding on an appropriate approach. Secondly, the models seem to be largely sensitive to window size selection. Although the optimal window sizes are determined as part of the process in this analysis, high performance computing is required to record such results. There is also no universal choice for the best window size. These make the approach impractical in some applications. At the same time, these optimal window sizes may be restricted only to the time horizon in the current sets of data.
To overcome the issues above, we propose and implement change-point procedures in each estimation window, with a fixed right endpoint (as described in the Methodology Section). AMOC and BinSeg are both utilized, where the number of splits for BinSeg is capped at three. The minimum window size is set to be 500 and the maximum is 1500 (i.e., default value if no change-point is identified). This procedure is automated without the need to run the models through all window sizes and hence requires significantly less computing power. Table 8 summarizes the results after implementing the change-point procedures. The highest p-values for each index and backtest are highlighted in bold. Columns 3 and 4 records the actual number of VaR exceedances, as produced by implementing AMOC and BinSeg, respectively. The results of the two backtests are then recorded in columns 5 to 8. First, one can clearly observe that, in almost all cases, the result produced by the change-point procedures are more robust than the average p-value obtained by fixed window sizes in Tables 2 to 7. This is strong evidence that implementing a change-point procedure is more useful than a randomly chosen (or selected-by-convenience) fixed window size. More importantly, given the appropriate innovation assumption or treatment, the corresponding model can produce similar results as those arising from having previously identified the optimal fixed window sizes. These make the proposed procedure practically more attractive.
As expected, the Gaussian innovation assumption again produced the least robust model in most indices, except for DJI and JSE Top 40. For JSE Top 40, the results are mixed across the two different backtests, with comparable results between models. More interestingly, the conditional Gaussian model with BinSeg seems to be the best approach for DJI.

Conclusion
In this paper, we studied the effects of window size selection on three popular VaR models, namely the conditional Gaussian model, the conditional POT model and a generalized FHS method. This was done by utilizing high performance computing to run the models through a range of window sizes for six different indices. The level of sensitivity across different markets is varied. However, without doubt, the choice of the window size plays a vital role in the performance of these models.
Although the optimal window sizes were determined from the above process, the approach is in general not practical. This is due to the large disparity in terms of model choice, the reliance on computing power and the non-existence of a universal optimal window size. At the same time, the optimal window size determined here may be restricted to the current time horizon analyzed. To overcome these problems, we have proposed and implemented change-point procedures to identify significant changes in volatility within the estimation windows. This then allows us to specify appropriate windows with a fixed right endpoint. The results show that the combination of the different models with change-point procedures are capable of producing adequate VaR forecasts. In particular, they produce (on average) more robust results than randomly selected, or selected-byconvenience, window sizes (such as the traditional 500 or 1000 day windows). More importantly, given the appropriate treatment of the innovations, one can achieve similar backtesting results as in the case where optimal window size is known.
Further investigations could possibly include extending the approach to utilize other adaptive pointwise estimation methods (Čížek, Härdle and Spokoiny 2009) and performing similar analysis on portfolios of public companies. The combination of a switching model strategy (Chiu and Chuang 2016) with varying window sizes may be another interesting exploration. It is also the aim of the authors to extend the study to other risk measures, such as conditional VaR and entropic VaR.

Declaration of Interest
The authors report no conflicts of interest.