Introduction

The financial world is marred by levels of uncertainty and there is an ever increasing need for financial institutions to estimate the risk of various positions. In particular, the ability to quantify such risk enables risk managers to continuously re-evaluate the adequacy of their risk capitalization against extreme losses. Value-at-risk (VaR) is a financial risk measure made popular by J. P. Morgan, which is simply a chosen quantile of the loss distribution. More precisely, Holton (2014) defined VaR to be the threshold value of return or capital that a company’s or portfolio’s losses would not exceed, given a specific probability level and time horizon. While VaR is widely used in practice, it is not a coherent risk measure. Artzner et al. (1999) defined a coherent measure as one that is monotonic, translation invariant, homogeneous, and sub-additive. VaR satisfies the first three properties, but is not always sub-additive. Hence, an alternative risk measure called expected shortfall (ES), sometimes also referred to as conditional VaR, is introduced. It is defined as the expected loss of a portfolio or asset return, conditional on a loss exceeding the corresponding VaR. ES is a coherent risk measure and is more sensitive to the tail loss distribution. While VaR is still commonly used by practitioners, ES has also gained more popularity in recent years due to the above finding.

Parametric approaches for estimating VaR and ES typically depend on an accurate underlying distributional assumption of the financial data. Traditionally, such risk measures were based on the assumption that asset returns are independently and identically distributed (IID) according to the Gaussian distribution. It is well-known in practice that such assumptions are often inappropriate. The resulting model also needs to capture various stylized facts of the financial time series, such as volatility clustering, heavy tails, asymmetry, and long-range dependency. Moreover, such approaches can be hindered by the lack of extreme observations (which are, trivially, rare). Another common approach for modeling asset returns is using the generalized autoregressive conditional heteroskedastic (GARCH) model. While this model caters for the continually changing variance, assumptions also need to be made regarding its innovations, or residuals. Typically, the innovations are assumed to be IID and Gaussian. However, these assumptions are not always suitable (for example, see Byström 2004; and, Huang et al. 2016). For a comprehensive review on VaR estimation and backtesting, see Abad et al. (2014).

Bootstrapping procedures provide alternatives to overcome the challenges mentioned above. In general, they do not rely on distributional assumptions of the data and asymptotically replicate the empirical density for resampled data. Moreover, block bootstrapping and stationary bootstrapping can also cater for dependencies and stationarity in the data (Sunesson 2011). In this paper, we examine the daily rolling one-day-ahead forecasts of VaR and ES using bootstrapping procedures. More precisely, we apply ordinary, block, and stationary bootstrapping to both the original financial series and its GARCH innovations (for two different underlying noise assumptions). At the same time, we also assess the effects of using different window sizes of historical data for these procedures. The two datasets used are daily returns of the S&P500 from New York Stock Exchange (NYSE) and the ALSI from Johannesburg Stock Exchange (JSE).

The contribution to the literature is threefold. Firstly, the analyses should point out whether dependent bootstrapping can improve performances, in terms of VaR and ES estimation. In particular, we also examine whether the various bootstrap procedures are more effective with or without GARCH implementations. These are extensions to previous work on historical simulation and filtered historical simulation, where only ordinary bootstrapping is used (see, for example, Zenti and Pallotta 2000; Brandolini et al. 2001; Barone-Adesi et al. 2002; Lin et al. 2006; Hartz et al. 2006; Brandolini and Colucci 2012; Cecarone and Colucci 2016). Secondly, we test the approaches against different window sizes of historical data and aim to conclude on the effects of window size selection in bootstrap VaR and ES estimation. Lastly, we compare the performances of these methods across two vastly different markets.

Methodology

The procedures implemented in this paper are generalized forms of historical simulation and filtered historical simulation (Pritsker 2006). In other words, we infer on a forecast of VaR or ES depending on information provided by past data. Hence, one needs to decide on which and how much historical data to include. In our analyses, we use the moving window technique (Richardson et al. 1997) for the re-calculation of VaR and ES on a daily basis. The window sizes chosen are 250, 500, and 1000 days, equivalent to approximately 1 year, 2 years, and 4 years of daily observations. For example, using a window size of 250, we utilize the observations from day 1 to day 250 to estimate VaR for day 251 and, accordingly, use observations from day 2 to day 251 to estimate VaR for day 252, etc. This method is implemented for the following bootstrapping procedures.

Ordinary bootstrapping

Bootstrapping is a resampling procedure first introduced by Efron (1979). Typically, this involves repeated resampling, with replacement, from a given dataset, say X 1, X 2, …, X n , to produce numerous samples of size n. Then, a statistic of interest is estimated from each sample, creating a series of estimates which can be used to approximate the sampling distribution of the statistic. One drawback of the ordinary bootstrapping procedure is that it assumes IID data while most real-world data are rarely so rigid. Hence, to cater for dependent data, generalizations of the ordinary bootstrapping have been proposed.

Block bootstrapping

Block bootstrapping, or moving block bootstrapping, is suited for resampling from a dependent dataset. The method was introduced by Künsch (1989) and can be used to replicate the correlation in the dataset by resampling blocks of data. It involves dividing a dataset of n observations into n − b + 1 overlapping blocks of fixed length b. For example, the block beginning with the ith observation is of the form

$$B_{i}^{b} = \left\{ {X_{i} , X_{i + 1} , \ldots , X_{i + b - 1} } \right\}.$$

Subsequently, n/b blocks are randomly selected, with replacement, from the n − b + 1 overlapping blocks and concatenated together. The concatenated observations are then used to estimate the statistic of interest. Again, the process is repeated several times to produce a series of estimates. In addition, we shall follow Hall et al. (1995) in using the optimal block length b = n 1/4 since we only focus on the negative side of the returns distribution for our estimation of VaR and ES.

Stationary bootstrapping

A possible problem in using block bootstrapping is that the procedure eliminates stationarity in the dataset. In other words, if the original dataset was stationary, the block bootstrapped samples may not be. This can be overcome by allowing the block length to be random (Politis and Romano 1994). The procedure is as follows. A value p ∊ (0, 1] is predefined, which is optimallyFootnote 1 taken to be c −1 n −1/3. When deciding whether an observation should be included in a block, a number u is randomly drawn from the \(UNIF(0,1)\) distribution. If u is less than 1 − p, we include the observation into the current block. If u is greater than 1 − p, then a new block construction is started. This algorithm is continued until all the observations have been designated into blocks. Hence, the block length is a random variable following a geometric distribution with parameter p. It also consequently renders the number of blocks as a random variable. This modification preserves the stationarity property of any data series.

GARCH

In addition to applying bootstrapping directly on financial returns datasets, we also explore an alternative approach in bootstrapping the innovations, after the datasets have been fitted with a GARCH(1,1) model. GARCH(1,1) was first introduced by Engle (1982) and it has the general assumptions in the form of

$$R_{t} = \mu_{t} + \sigma_{t} Z_{t},$$

where R t is the return at time t, μ t is the mean function of the overall series, σ 2 t is the conditional variance, and Z t is the residual or innovation at time t. It is further defined that

$$\sigma_{t}^{2} = \alpha_{0} + \alpha_{1} \varepsilon_{t - 1}^{2} + \beta_{1} \sigma_{t - 1}^{2},$$

where ɛ t  = σ t Z t and α 0, α 1, and β 1 are non-negative constants. It is typically assumed that the innovation series is IID and follows the Gaussian distribution.Footnote 2 Consequently, the normal quantiles can be back transformed to estimate the VaR or ES of the original series. The innovations can also be extracted under the assumption that they follow a Student’s t distribution and the respective quantiles utilized as above. However, we shall implement the bootstrapping procedures on the innovations instead and contrast our results against these approaches.

Validation Tests for VaR and ES

Mathematically, VaR can be defined as

$$VaR_{\alpha } \left( L \right) = { \inf }\{ m \in {\mathbb{R}}:P\left( {L \le m} \right) \ge \alpha \}$$

for some probability level α and loss function L. Additionally, ES is defined as

$$ES = {\mathbb{E}}\left\{ {L|L > {\text{VaR}}_{\alpha } \left( L \right)} \right\}.$$

To compare the performances of our VaR and ES estimates, we implement various backtesting procedures for the two risk measures. In particular, we utilize the Kupiec likelihood ratio test (Kupiec 1995) and the VaR duration test (Christoffersen and Pelletier 2004) for the VaR estimates. The backtesting procedure in McNeil and Frey (2000) is implemented for ES estimates. Generally, a higher p value for each test implies a better estimate. However, care must be taken when interpreting the results as a whole. The Kupiec test checks for the correct number of exceedances (i.e., unconditional coverage) and the VaR duration test checks to see whether the VaR violations are IID. The second test is strongly sensitive to the number of violations, which is more problematic for very extreme levels of VaR. Moreover, the ES test is by definition strongly dependent on the corresponding VaR estimates. As a result, we take the Kupiec test as our primary first check for a suitable model (identifying the highest p value), and utilize the other two tests as secondary checks for desirable model properties (suitably high p values).

Empirical results

In this paper, we examine the daily returns obtained from two indices, namely S&P500 and ALSI, both extracted from McGregor BFA, and are dated from 03/01/2000 to 11/06/2015 (giving a total of 4029 data pointsFootnote 3). For each dataset, we calculate the daily returns as

$$R_{t} = \ln \left( {\frac{{C_{t} }}{{C_{t - 1} }}} \right),$$

where C t is the closing stock price on day t. Table 1 presents the excess kurtosis and skewness for both return series. Both series are leptokurtic and asymmetric, as commonly observed in financial data. Hence, they deviate from the Gaussian distribution. This is also highlighted by the QQ plots in Fig. 1, which indicate both data series have tails significantly departing from the Gaussian distribution.

Table 1 Excess kurtosis and skewness for S&P500 and ALSI
Fig. 1
figure 1

Normal QQ plots of S&P500 (left) and ALSI (right)

The partial autocorrelation function (PACF) plots in Fig. 2 show a number of lags being marginally significant for both datasets, while the time series plots of the returns (shown in Fig. 3) also exhibit volatility clustering. These are evidence that, at least, some weak dependencies exist in the datasets. Hence, it justifies the use of dependent bootstrapping.

Fig. 2
figure 2

PACF plots for S&P500 (left) and ALSI (right)

Fig. 3
figure 3

Time series plots for S&P500 (left) and ALSI (right)

After implementing a GARCH filter (for either case of Gaussian and Student t noises) for both data series, we extract the corresponding residuals. The skewness and excess kurtosis of these residual series are presented in Table 2. It is evidenced that there are significant excess kurtosis (albeit lesser compared to the original data series) and S&P500 innovations possess higher kurtosis than ALSI. The skewness is also persistent in the residuals.

Table 2 Excess kurtosis and skewness for GARCH innovations of S&P500 and ALSI

Figures 4 and 5 present the PACF plots for the GARCH residuals of S&P500 and ALSI returns, respectively. All plots indicate very slow (if any) decay of autocorrelation, indicating long-range dependencies in the GARCH innovations. This is evidence again that dependent bootstrapping may be required, instead of the usual IID bootstrap.

Fig. 4
figure 4

PACF plots for S&P500 innovations (squared) under Gaussian noise (left) and innovations under Student t noise (right)

Fig. 5
figure 5

PACF plots for ALSI innovations under Gaussian noise (left) and innovations (squared) under Student t noise (right)

We now estimate VaR on a daily basis by utilizing the various bootstrapping methods (without GARCH filters) and using a moving window of 250, 500, or 1000 trading days. VaR is calculated at 0.1, 1, 5, and 10% levels for both S&P500 and ALSI returns. The resulting series of VaR estimates are presented by Figs. 6, 7, and 8 (using original return series) in Appendix 1. The results are also benchmarked against corresponding VaR estimates obtained from the Gaussian distribution (Fig. 9). The formal backtesting results of VaR and ES estimates are displayed in Tables 5, 6, 7, and 8 in Appendix 1.

We observe that the bootstrap methods do perform better than the Gaussian distribution, in terms of VaR estimation. This is apparent as the Gaussian VaR estimates often underestimate the magnitude of risk, generating excess amount of VaR violations. We also see that the selection of the moving window size plays an essential role on how quickly the VaR estimates can return to normal levels after a shock to the market. In particular, a smaller window size tends to accelerate the return of VaR to normal levels, while a large window size can cause a lag of large VaR estimates. These large VaR estimates can drastically cause an overestimate of the actual magnitude of risk. Practically, this corresponds to a financial institute setting aside a larger-than-required risk capital even after a crisis has passed, which could have been invested elsewhere. However, a small window allows for very little information for extreme tails of the data. Moreover, as often depicted in the literature (also evidenced from the figures in Appendix 1), direct bootstrapping on the returns is in general slow in reacting to changes in the markets (even for the smallest window size 250), as the model does not react quickly enough to the changing volatility. Hence, we next look at bootstrapping in a filtered historical simulation procedure.

The return series are now filtered by GARCH processes, with either Gaussian or Student t noise. The innovations are extracted and bootstrapped by the three methods as before. Again, the three different window sizes are used and the risk measures are estimated at the 4 different VaR levels. The backtesting results are given in Tables 3 and 4, while the plots of VaR estimates from the various models are given in Appendix 2.

Table 3 Backtesting of daily one-day-ahead VaR and ES estimates for S&P500 with GARCH filters
Table 4 Backtesting of daily one-day-ahead VaR and ES estimates for ALSI with GARCH filters

It is clearly evidenced that the added GARCH effect is better at capturing the changing variance and produced significantly more adequate models, relative to the earlier results in Appendix 1. In Tables 3 and 4, we have also highlighted (in bold) situations where dependent bootstrapping has produced equal or better results (for the Kupiec test) than their IID counterpart. Although the performances between the different bootstrapping methods are quite similar, when restricted to a common window size and noise assumption, there is evidence of possible improvements from the IID approach.

It may be interesting to note as well that the window size selection do have an effect on the performance of the different models for the two indices, at various VaR levels. For 10% VaR, window size of 500 seems to be the best choice. However, window size 1000 is likely to be more adequate for more extreme VaR levels. In fact, for ALSI returns, models with window size 1000 and Student t noise are recorded as the only cases to produce satisfactory results for 0.1% VaR.

The comparisons across the two indices are also peculiar. In general, the models seem to perform better for ALSI at 10 and 1% VaR levels, but are less effective (relatively) for 5 and 0.1% VaR levels. This may be attributed to the vastly different structures of the two different indices and how they react to sudden changes to the market. It may also be interesting to note that the difference between better performing models and the less adequate models is much larger for S&P500, as compared to cases for ALSI. This is largely caused by the choice of window size, making this factor a vital one for risk prediction in S&P500 returns.

In general, a GARCH filter should be applied and there seems to be some evidence that dependent bootstrapping can improve VaR and ES estimation. However, the models in general perform quite similarly, when restricted to similar conditions. At the same time, window size selection does have an impact on model performances, and more significantly so for S&P500 returns. The optimal model for the two indices varies at different VaR levels. Our recommendation, thus far, is to employ window size 1000 for ALSI returns (possibly for all VaR levels as window size 500 is only marginally better at 10% VaR level) and window size 500 for S&P500 at 10% VaR. As for 5, 1, and 0.1% VaR levels of S&P500, the better choice is window size 1000. We also recommend using both block bootstrapping and stationary bootstrapping to check which gives a better result. They tend to produce similar results to ordinary bootstrapping, but one of them can at times give improved estimates. It may be highly beneficial to explore a model switching procedure to optimize the performance of these models.

Conclusion

In this paper, we have examined the use of dependent bootstrapping in VaR and ES estimation for daily returns of S&P500 and ALSI. The bootstrap methods included the ordinary bootstrap, block bootstrap, and stationary bootstrap. Furthermore, these approaches were implemented with and without GARCH filters and are also contrasted against the corresponding parametric approaches. As expected, the bootstrap methods produced improved VaR results when compared to the parametric counterparts. Furthermore, GARCH models brought further improved VaR and ES estimates.

Overall, we suggest a GARCH filter to be applied for both VaR and ES estimations. Window size 1000 seems to be quite satisfactory for ALSI returns. However, for S&P500, window size 500 can significantly improve the estimates at 10% VaR level but must be reverted back to window size 1000 for more extreme VaR levels.

We suggest as further work to implement a model switching approach with varying window size selections to fully capture the changing results across the VaR levels. At the same time, this model should be able to adjust the bootstrapping method accordingly to the particular window at hand. This is also left for further investigation. Another avenue for exploration is to compare our results with other backtesting procedures. For example, one may consider the new backtesting method for ES proposed by Acerbi and Szekely (2017).