1 Introduction

Accurate forecasting of key macroeconomic variables such as Gross Domestic Product (GDP), inflation, and industrial production, has been at the forefront of economic research over many decades. Early approaches involved univariate models or at best low dimensional multivariate systems. The era of big data has led to the use of regularisation and shrinkage methods such as dynamic factor models, Lasso, LARS, and Bayesian VARs, in an effort to exploit the plethora of potentially useful predictors now available. These predictors commonly also include the components of the variables of interest. For instance, GDP is formed as an aggregate of consumption, government expenditure, investment, and net exports, with each of these components also formed as aggregates of other economic variables. While the macroeconomic forecasting literature regularly uses such sub-indices as predictors, it does so in ways that fail to exploit accounting identities that describe known deterministic relationships between macroeconomic variables.

In this chapter we take a different approach. Over the past decade there has been a growing literature on forecasting collections of time series that follow aggregation constraints, known as hierarchical time series. Initially the aim of this literature was to ensure that forecasts adhered to aggregation constraints thus ensuring aligned decision-making. However, in many empirical settings the forecast reconciliation methods designed to deal with this problem have also been shown to improve forecast accuracy. Examples include forecasting accidents and emergency admissions (Athanasopoulos, Hyndman, Kourentzes, & Petropoulos, 2017), mortality rates (Shang & Hyndman, 2017), prison populations (Athanasopoulos, Steel, & Weatherburn, 2019), retail sales (Villegas & Pedregal, 2018), solar energy (Yagli, Yang, & Srinivasan, 2019; Yang, Quan, Disfani, & Liu, 2017), tourism demand (Athanasopoulos, Ahmed, & Hyndman, 2009; Hyndman, Ahmed, Athanasopoulos, & Shang, 2011; Wickramasuriya, Athanasopoulos, & Hyndman, 2018), and wind power generation (Zhang & Dong, 2019). Since both aligned decision-making and forecast accuracy are key concerns for economic agents and policy makers we propose the application of state-of-the-art forecast reconciliation methods to macroeconomic forecasting. To the best of our knowledge the only application of forecast reconciliation methods to macroeconomics focuses on point forecasting for inflation (Capistrán, Constandse, & Ramos-Francia, 2010; Weiss, 2018).

The remainder of this chapter is set out as follows: Section 21.2 introduces the concept of hierarchical time series, i.e., collections of time series with known linear constraints, with a particular emphasis on macroeconomic examples. Section 21.3 describes state-of-the-art forecast reconciliation techniques for point forecasts, while Sect. 21.4 describes the more recent extension of these techniques to probabilistic forecasting. Section 21.5 describes the data used in our empirical case study, namely Australian GDP data, that is represented using two alternative hierarchical structures. Section 21.6 provides details on the setup of our empirical study including metrics used for the evaluation of both point and probabilistic forecasts. Section 21.7 presents results and Sect. 21.8 concludes providing future avenues for research that are of particular relevance to the empirical macroeconomist.

2 Hierarchical Time Series

To simplify the introduction of some notation we use the simple two-level hierarchical structure shown in Fig. 21.1. Denote as y Tot,t the value observed at time t for the most aggregate (Total) series corresponding to level 0 of the hierarchy. Below level 0, denote as y i,t the value of the series corresponding to node i, observed at time t. For example, y A,t denotes the tth observation of the series corresponding to node A at level 1, y AB,t denotes the tth observation of the series corresponding to node AB at level 2, and so on.

Fig. 21.1
figure 1

A simple two-level hierarchical structure

Let y t = (y Tot,t, y A,t, y B,t, y AA,t, y AB,t, y BA,t, y BB,t, y BC,t) denote a vector containing observations across all series of the hierarchy at time t. Similarly denote as b t = (y AA,t, y AB,t, y BA,t, y BB,t, y BC,t) a vector containing observations only for the bottom-level series. In general, y t ∈ ℝn and b t ∈ ℝm where n denotes the number of total series in the structure, m the number of series at the bottom level, and n > m always. In the simple example of Fig. 21.1, n = 8 and m = 5.

Aggregation constraints dictate that y Tot = y A,t + y B,t = y AA,t + y AB,t + y BA,t + y BB,t + y BC,t, y A,t = y AA,t + y AB,t and y B = y BA,t + y BB,t + y BC,t. Hence we can write

$$\displaystyle \begin{aligned} \boldsymbol{y}_t = \boldsymbol{Sb}_t, \end{aligned} $$
(21.1)

where

is an n × m matrix referred to as the summing matrix and I m is an m-dimensional identity matrix. S reflects the linear aggregation constraints and in particular how the bottom-level series aggregate to levels above. Thus, columns of S span the linear subspace of ℝn for which the aggregation constraints hold. We refer to this as the coherent subspace and denote it by \(\mathfrak {s}\). Notice that pre-multiplying a vector in ℝm by S will result in an n-dimensional vector that lies in \(\mathfrak {s}\).

Property 21.1

A hierarchical time series has observations that are coherent, i.e., \(\boldsymbol {y}_{t} \in \mathfrak {s}\) for all t. We use the term coherent to describe not just y t but any vector in \(\mathfrak {s}\).

Structures similar to the one shown in Fig. 21.1 can be found in macroeconomics. For instance, in Sect. 21.5 we consider two alternative hierarchical structures for the case of GDP and its components. However, while this motivating example involves aggregation constraints, the mathematical framework we use can be applied for any general linear constraints, examples of which are ubiquitous in macroeconomics. For instance, the trade balance is computed as exports minus imports, while the consumer price index is computed as a weighted average of sub-indices, which are in turn weighted averages of sub-sub-indices, and so on. These structures can also be captured by an appropriately designed S matrix.

An important alternative aggregation structure, also commonly found in macroeconomics, is one for which the most aggregate series is disaggregated by attributes of interest that are crossed, as distinct to nested which is the case for hierarchical time series. For example, industrial production may be disaggregated along the lines of geography or sector or both. We refer to this as a grouped structure. Figure 21.2 shows a simple example of such a structure. The Total series disaggregates into y A,t and y B,t, but also into y X,t and y Y,t, at level 1, and then into the bottom-level series, b t = (y AX, y AY, y BX, y BY). Hence, in contrast to hierarchical structures, grouped time series do not naturally disaggregate in a unique manner.

Fig. 21.2
figure 2

A simple two-level grouped structure

An important implementation of aggregation structures are temporal hierarchies introduced by Athanasopoulos et al. (2017). In this case the aggregation structure spans the time dimension and dictates how higher frequency data (e.g., monthly) are aggregated to lower frequencies (e.g., quarterly, annual). There is a vast literature that studies the effects of temporal aggregation, going back to the seminal work of Amemiya and Wu (1972), Brewer (1973), Tiao (1972), Zellner and Montmarquette (1971) and others, including Hotta and Cardoso Neto (1993), Hotta (1993), Marcellino (1999), Silvestrini, Salto, Moulin, and Veredas (2008). The main aim of this work is to find the single best level of aggregation for modelling and forecasting time series. In this literature, the analyses, results (whether theoretical or empirical), and inferences, are extremely heterogeneous, making it very challenging to reach a consensus or to draw firm conclusions. For example, Rossana and Seater (1995) who study the effect of aggregation on several key macroeconomic variables state:

Quarterly data do not seem to suffer badly from temporal aggregation distortion, nor are they subject to the construction problems affecting monthly data. They therefore may be the optimal data for econometric analysis.

A similar conclusion is reached by Nijman and Palm (1990). Silvestrini et al. (2008) consider forecasting French cash state deficit and provide empirical evidence of forecast accuracy gains from forecasting with the aggregate model rather than aggregating forecasts from the disaggregate model.

The vast majority of this literature concentrates on a single level of temporal aggregation (although there are some notable exceptions such as Andrawis, Atiya, and El-Shishiny (2011), Kourentzes, Petropoulos, and Trapero (2014)). Athanasopoulos et al. (2017) show that considering multiple levels of aggregation via temporal hierarchies and implementing forecast reconciliation approaches rather than single-level approaches results in substantial gains in forecast accuracy across all levels of temporal aggregation.

3 Point Forecasting

A requirement when forecasting hierarchical time series is that the forecasts adhere to the same aggregation constraints as the observed data; i.e., they are coherent.

Definition 21.1

A set of h-step-ahead forecasts \(\tilde {\boldsymbol {y}}_{T+h|T}\), stacked in the same order as y t and generated using information up to and including time T, are said to be coherent if \(\tilde {\boldsymbol {y}}_{T+h|T} \in \mathfrak {s}\).

Hence, coherent forecasts of lower level series aggregate to their corresponding upper level series and vice versa.

Let us consider the smallest possible hierarchy with two bottom-level series, depicted in Fig. 21.3, where y Tot = y A + y B. While base forecasts could lie anywhere in ℝ3, the realisations and coherent forecasts lie in a two dimensional subspace \(\mathfrak {s}\subset \mathbb{R}^3\).

Fig. 21.3
figure 3

Representation of a coherent subspace in a three dimensional hierarchy where y Tot = y A + y B. The coherent subspace is depicted as a grey two dimensional plane labelled \(\mathfrak {s}\). Note that the columns of s 1 = (1, 1, 0) and s 2 = (1, 0, 1) form a basis for \(\mathfrak {s}\). The red points lying on \(\mathfrak {s}\) can be either realisations or coherent forecasts

3.1 Single-Level Approaches

A common theme across all traditional approaches for forecasting hierarchical time series is that a single level of aggregation is first selected and forecasts for that level are generated. These are then linearly combined to generate a set of coherent forecasts for the rest of the structure.

3.1.1 Bottom-Up

In the bottom-up approach, forecasts for the most disaggregate level are first generated. These are then aggregated to obtain forecasts for all other series of the hierarchy (Dunn, Williams, & Dechaine, 1976). In general, this consists of first generating \(\hat {\boldsymbol {b}}_{T+h|T} \in \mathbb{R}^m\), a set of h-step-ahead forecasts for the bottom-level series. For the simple hierarchical structure of Fig. 21.1, \(\hat {\boldsymbol {b}}_{T+h|T} = (\hat {{y}}_{\text{AA},T+h|T}, \hat {{y}}_{\text{AB},T+h|T}, \hat {{y}}_{\text{BA},T+h|T}, \hat {{y}}_{\text{BB},T+h|T},\hat {{y}}_{\text{BC},T+h|T}),\)where \(\hat {{y}}_{i,T+h|T}\) is the h-step-ahead forecast of the series corresponding to node i. A set of coherent forecasts for the whole hierarchy is then given by

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{y}}^{\text{BU}}_{T+h|T}=\boldsymbol{S\hat{\boldsymbol{b}}}_{T+h|T}. \end{aligned}$$

Generating bottom-up forecasts has the advantage of no information being lost due to aggregation. However, bottom-level data can potentially be highly volatile or very noisy and therefore challenging to forecast.

3.1.2 Top-Down

In contrast, top-down approaches involve first generating forecasts for the most aggregate level and then disaggregating these down the hierarchy. In general, coherent forecasts generated from top-down approaches are given by

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{y}}^{\text{TD}}_{T+h|T}=\boldsymbol{S}\boldsymbol{p}\hat{y}_{\text{Tot}, T+h|T}, \end{aligned}$$

where p = (p 1, …, p m) is an m-dimensional vector consisting of a set of proportions which disaggregate the top-level forecast \(\hat {y}_{\text{Tot}, T+h|T}\) to forecasts for the bottom-level series; hence \(\boldsymbol {p}\hat {y}_{\text{Tot}, T+h|T}=\hat {\boldsymbol {\boldsymbol {b}}}_{T+h|T}\). These are then aggregated by the summing matrix S.

Traditionally, proportions have been calculated based on the observed historical data. Gross and Sohl (1990) present and evaluate twenty-one alternative approaches. The most convenient attribute of these approaches is their simplicity. Generating a set of coherent forecasts involves only modelling and generating forecasts for the most aggregate top-level series. In general, such top-down approaches seem to produce quite reliable forecasts for the aggregate levels and they are useful with low count data. However, a significant disadvantage is the loss of information due to aggregation. A limitation of such top-down approaches is that characteristics of lower level series cannot be captured. To overcome this, Athanasopoulos et al. (2009) introduced a new top-down approach which disaggregates the top-level based on proportions of forecasts rather than the historical data and showed that this method outperforms the conventional top-down approaches. However, a limitation of all top-down approaches is that they introduce bias to the forecasts even when the top-level forecast itself is unbiased. We discuss this in detail in Sect. 21.3.2.

3.1.3 Middle-Out

A compromise between bottom-up and top-down approaches is the middle-out approach. It entails first forecasting the series of a selected middle level. For series above the middle level, coherent forecasts are generated using the bottom-up approach by aggregating the middle-level forecasts. For series below the middle level, coherent forecasts are generated using a top-down approach by disaggregating the middle-level forecasts. Similarly to the top-down approach it is useful for when bottom-level data is low count. Since the middle-out approach involves generating top-down forecasts, it also introduces bias to the forecasts.

3.2 Point Forecast Reconciliation

All approaches discussed so far are limited to only using information from a single level of aggregation. Furthermore, these ignore any correlations across levels of a hierarchy. An alternative framework that overcomes these limitations is one that involves forecast reconciliation. In a first step. forecasts for all the series across all levels of the hierarchy are computed, ignoring any aggregation constraints. We refer to these as base forecasts and denote them by \(\hat {\boldsymbol {y}}_{T+h|T}\). In general, base forecasts will not be coherent, unless a very simple method has been used to compute them such as for naïve forecasts. In this case, forecasts are simply equal to a previous realisation of the data and they inherit the property of coherence.

The second step is an adjustment that reconciles base forecasts so that they become coherent. In general, this is achieved by mapping the base forecasts \(\hat {\boldsymbol {y}}_{T+h|T}\) onto the coherent subspace \(\mathfrak {s}\) via a matrix SG, resulting in a set of coherent forecasts \(\tilde {\boldsymbol {y}}_{T+h|T}\). Specifically,

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{y}}_{T+h|T}=\boldsymbol{S}\boldsymbol{G}\hat{\boldsymbol{y}}_{T+h|T}, \end{aligned} $$
(21.2)

where G is an m × n matrix that maps \(\hat {\boldsymbol {y}}_{T+h|T}\) to ℝm, producing new forecasts for the bottom level, which are in turn mapped to the coherent subspace by the summing matrix S. We restrict our attention to projections on \(\mathfrak {s}\) in which case SGS = S. This ensures that unbiasedness is preserved, i.e., for a set of unbiased base forecasts reconciled forecasts will also be unbiased.

Note that all single-level approaches discussed so far can also be represented by (21.2) using appropriately designed G matrices, however, not all of these will be projections. For example, for the bottom-up approach, G =  ( 0 (m×nm)I m ) in which case SGS = S. For any top-down approach G =  ( p0 (m×n−1) ) , for which SGSS.

3.2.1 Optimal MinT Reconciliation

Wickramasuriya et al. (2018) build a unifying framework for much of the previous literature on forecast reconciliation. We present here a detailed outline of this approach and in turn relate it to previous significant contributions in forecast reconciliation.

Assume that \(\hat {\boldsymbol {y}}_{T+h|T}\) is a set of unbiased base forecasts, i.e., \(\text{E}_{1:T}(\hat {\boldsymbol {y}}_{T+h|T})= \text{E}_{1:T}[\boldsymbol {y}_{T+h}\mid \boldsymbol {y}_1,\dots ,\boldsymbol {y}_T]\), the true mean with the expectation taken over the observed sample up to time T. Let

$$\displaystyle \begin{aligned} \hat{\boldsymbol{e}}_{T+h|T} = \boldsymbol{y}_{T+h|T}-\hat{\boldsymbol{y}}_{T+h|T} \end{aligned} $$
(21.3)

denote a set of base forecast errors with Var\((\hat {\boldsymbol {e}}_{T+h|T})=\boldsymbol {W}_h\), and

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{e}}_{T+h|T} = \boldsymbol{y}_{T+h|T}-\tilde{\boldsymbol{y}}_{T+h|T} \end{aligned}$$

denote a set of coherent forecast errors. Lemma 1 in Wickramasuriya et al. (2018) shows that for any matrix G such that SGS = S, \(\text{Var}(\tilde {\boldsymbol {e}}_{T+h|T})=\boldsymbol {S}\boldsymbol {G}\boldsymbol {W}_h\boldsymbol {S}^{\prime }\boldsymbol {G}^{\prime } \). Furthermore Theorem 1 shows that

$$\displaystyle \begin{aligned} \boldsymbol{G} = (\boldsymbol{S}^{\prime}{\boldsymbol{W}}^{-1}_h\boldsymbol{S})^{-1}\boldsymbol{S}^{\prime}{\boldsymbol{W}}^{-1}_h \end{aligned} $$
(21.4)

is the unique solution that minimises the trace of SGW hS G subject to SGS = S. MinT is optimal in the sense that given a set of unbiased base forecasts, it returns a set of best linear unbiased reconciled forecasts, using as G the unique solution that minimises the trace (hence MinT) of the variance of the forecast error of the reconciled forecasts.

A significant advantage of the MinT reconciliation solution is that it is the first to incorporate the full correlation structure of the hierarchy via W h. However, estimating W h is challenging, especially for h > 1. Wickramasuriya et al. (2018) present possible alternative estimators for W h and show that these lead to different G matrices. We summarise these below.

  • Set W h = k hI n for all h, where k h > 0 is a proportionality constant. This simple assumption returns G = (S S)−1S so that the base forecasts are orthogonally projected onto the coherent subspace \(\mathfrak {s}\) minimising the Euclidean distance between \(\hat {\boldsymbol {y}}_{T+h|T}\) and \(\tilde {\boldsymbol {y}}_{T+h|T}\). Hyndman et al. (2011) come to the same solution, however, from the perspective of the following regression model

    $$\displaystyle \begin{aligned} \hat{\boldsymbol{y}}_{T+h|T} = \boldsymbol{S}\boldsymbol{\beta}_{T+h|T} + \boldsymbol{\varepsilon}_{T+h|T}, \end{aligned}$$

    where β T+h|T = E[b T+hb 1, …, b T] is the unknown conditional mean of the bottom-level series and ε T+h|T is the coherence or reconciliation error with mean zero and variance V . The OLS solution leads to the same projection matrix S(S S)−1S , and due to this interpretation we continue to refer to this reconciliation method as OLS. A disadvantage of the OLS solution is that the homoscedastic diagonal entries do not account for the scale differences between the levels of the hierarchy due to aggregation. Furthermore, OLS does not account for the correlations across series.

  • Set \({\boldsymbol {W}}_{h}=k_{h}\text{diag}(\hat {\boldsymbol {W}}_{1})\) for all h (k h > 0), where

    $$\displaystyle \begin{aligned} \hat{\boldsymbol{W}}_{1} = \frac{1}{T}\sum_{T=1}^{T} \hat{\boldsymbol{e}}_{t}\hat{\boldsymbol{e}}_{t}^{\prime} \end{aligned}$$

    is the unbiased sample estimator of the in-sample one-step-ahead base forecast errors as defined in (21.3). Hence this estimator scales the base forecasts using the variance of the in-sample residuals and is therefore described and referred to as a weighted least squares (WLS) estimator applying variance scaling. A similar estimator was proposed by Hyndman et al. (2019).

    An alternative WLS estimator is proposed by Athanasopoulos et al. (2017) in the context of temporal hierarchies. Here W h is proportional to diag(S1) where 1 is a unit column vector of dimension n. Hence the weights are proportional to the number of bottom-level variables required to form an aggregate. For example, in the hierarchy of Fig. 21.1, the weights corresponding to the Total, series A and series B are proportional to 5, 2 and 3 respectively. This weighting scheme depends only on the aggregation structure and is referred to as structural scaling. Its advantage over OLS is that it assumes equivariant forecast errors only at the bottom level of the structure and not across all levels. It is particularly useful in cases where forecast errors are not available; for example, in cases where the base forecasts are generated by judgemental forecasting.

  • Set \(\boldsymbol {W}_{h}=k_{h}\hat {\boldsymbol {W}}_{1}\) for all h (k h > 0) to be proportional to the unrestricted sample covariance estimator for h = 1. Although this is relatively simple to obtain and provides a good solution for small hierarchies, it does not provide reliable results as m grows compared to T. This is referred to this as the MinT(Sample) estimator.

  • Set \(\boldsymbol {W}_{h}=k_{h}\hat {\boldsymbol {W}}_{1}^D\) for all h (k h > 0), where \(\hat {\boldsymbol {W}}^{D}_{1} = \lambda _{D} \text{diag}(\hat {\boldsymbol {W}}_{1}) + (1 - \lambda _{D})\hat {\boldsymbol {W}}_{1}\) is a shrinkage estimator with diagonal target and shrinkage intensity parameter

    $$\displaystyle \begin{aligned} \hat{\lambda}_{D} = \frac{\sum_{i \ne j}\hat{\text{Var}}(\hat{r}_{ij})}{\sum_{i \ne j}\hat{r}_{ij}^2}, \end{aligned}$$

    where \(\hat {r}_{ij}\) is the (i, j)th element of \(\hat {\boldsymbol {R}}_{1}\), the one-step-ahead sample correlation matrix as proposed by Schäfer and Strimmer (2005). Hence, off-diagonal elements of \(\hat {\boldsymbol {W}}_1\) are shrunk towards zero while diagonal elements (variances) remain unchanged. This is referred to as the MinT(Shrink) estimator.

4 Hierarchical Probabilistic Forecasting

A limitation of point forecasts is that they provide no indication of uncertainty around the forecast. A richer description of forecast uncertainty can be obtained by providing a probabilistic forecast, also commonly referred to as a density forecast. For a review of probabilistic forecasts, and scoring rules for evaluating such forecasts, see Gneiting and Katzfuss (2014). This chapter and Chapter 16 respectively provide comprehensive summaries of methods for constructing density forecasts and predictive accuracy tests for both point and density forecasts. In recent years, the use of probabilistic forecasts and their evaluation via scoring rules has become pervasive in macroeconomic forecasting, some notable (but non-exhaustive) examples are Geweke and Amisano (2010), Billio, Casarin, Ravazzolo, and Van Dijk (2013), Carriero, Clark, and Marcellino (2015) and Clark and Ravazzolo (2015).

The literature on hierarchical probabilistic forecasting is still an emerging area of interest. To the best of our knowledge the first attempt to even define coherence in the setting of probabilistic forecasting is provided by Taieb, Taylor, and Hyndman (2017) who define a coherent forecast in terms of a convolution. An equivalent definition due to Gamakumara, Panagiotelis, Athanasopoulos, and Hyndman (2018) defines a coherent probabilistic forecast as a probability measure on the coherent subspace \(\mathfrak {s}\). Gamakumara et al. (2018) also generalise the concept of forecast reconciliation to the probabilistic setting.

Definition 21.2

Let \(\mathcal {A}\) be a subsetFootnote 1 of \(\mathfrak {s}\) and let \(\mathcal {B}\) be all points in ℝn that are mapped onto \(\mathcal {A}\) after premultiplication by SG. Letting \(\hat {\nu }\) be a base probabilistic forecast for the full hierarchy, the coherent measure \(\tilde {\nu }\) reconciles \(\hat {\nu }\) if \(\tilde {\nu }(\mathcal {A})=\hat {\nu }(\mathcal {B})\) for all \(\mathcal {A}\).

In practice this definition leads to two approaches. For some parametric distributions, for instance the multivariate normal, a reconciled probabilistic forecast can be derived analytically. However, in macroeconomic forecasting, non-standard distributions such as bimodal distributions are often required to take different policy regimes into account. In such cases a non-parametric approach based on bootstrapping in-sample errors proposed Gamakumara et al. (2018) can be used. These scenarios are now covered in detail.

4.1 Probabilistic Forecast Reconciliation in the Gaussian Framework

In the case where the base forecasts are probabilistic forecasts characterised by elliptical distributions, Gamakumara et al. (2018) show that reconciled probabilistic forecasts will also be elliptical. This is particularly straightforward for the Gaussian distribution which is completely characterised by two moments. Letting the base probabilistic forecasts be \(\mathcal {N}(\hat {\boldsymbol {y}}_{T+h|T}, \hat {\boldsymbol {\Sigma }}_{T+h|T})\), then the reconciled probabilistic forecasts will be \(\mathcal {N}(\boldsymbol {\tilde {y}}_{T+h|T}, \tilde {\boldsymbol {\Sigma }}_{T+h|T})\), where

$$\displaystyle \begin{aligned} \boldsymbol{\tilde{y}}_{T+h|T} &= \boldsymbol{S}\boldsymbol{G}\hat{\boldsymbol{y}}_{T+h|T} \end{aligned} $$
(21.5)
$$\displaystyle \begin{aligned} \text{and}\qquad {} \tilde{\boldsymbol{\Sigma}}_{T+h|T} &= \boldsymbol{S}\boldsymbol{G}\hat{\boldsymbol{\Sigma}}_{T+h|T}\boldsymbol{G}^{\prime}\boldsymbol{S}^{\prime}. \end{aligned} $$
(21.6)

There are several options for obtaining the base probabilistic forecasts and in particular the variance covariance matrix \(\hat {\boldsymbol {\Sigma }}\). One option is to fit multivariate models either level by level or for the hierarchy as a whole leading respectively to a \(\hat {\boldsymbol \Sigma }\) that is block diagonal or dense. Another option is to fit univariate models for each individual series in which case \(\hat {\boldsymbol {\Sigma }}\) is a diagonal matrix. A third option that we employ here is to obtain \(\hat {\boldsymbol {\Sigma }}\) using in-sample forecast errors, in a similar vein to how \(\hat {\boldsymbol {W}}_{1}\) is estimated in the MinT method. Here the same shrinkage estimator described in Sect. 21.3.2 is used. The reconciled probabilistic forecast will ultimately depend on the choice of G; the same choices of G matrices used in Sect. 21.3 can be used.

4.2 Probabilistic Forecast Reconciliation in the Non-parametric Framework

In many applications, including macroeconomic forecasting, it may not be reasonable to assume Gaussian predictive distributions. Therefore, non-parametric approaches have been widely used for probabilistic forecasts in different disciplines. For example, ensemble forecasting in weather applications (Gneiting, 2005; Gneiting & Katzfuss, 2014; Gneiting, Stanberry, Grimit, Held, & Johnson, 2008), and bootstrap-based approaches (Manzan & Zerom, 2008; Vilar & Vilar, 2013). In macroeconomics, Cogley, Morozov, and Sargent (2005) discuss the importance of allowing for skewness in density forecasts and more recently Smith and Vahey (2016) discuss this issue in detail.

Due to these concerns, we employ the bootstrap method proposed by Gamakumara et al. (2018) that does not make parametric assumptions about the predictive distribution. An important result exploited by this method is that applying point forecast reconciliation to the draws from an incoherent base predictive distribution, results in a sample from the reconciled predictive distribution. We summarise this process below:

  1. 1.

    Fit univariate models to each series in the hierarchy over a training set from t = 1, …, T. Let these models denote M 1, …, M n.

  2. 2.

    Compute one-step-ahead in-sample forecast errors. Collect these into an n × T matrix \({\hat {\boldsymbol E}}=(\hat {\boldsymbol {e}}_1,\hat {\boldsymbol {e}}_2,\dots ,\hat {\boldsymbol {e}}_T)\), where the n-vector \(\hat {\boldsymbol {e}}_t={\boldsymbol {y}}_t-\hat {\boldsymbol {y}}_{t|t-1}\). Here, \(\hat {\boldsymbol {y}}_{t|t-1}\) is a vector of forecasts made for time t using information up to and including time t − 1. These are called in-sample forecasts since while they depend only on past values, information from the entire training sample is used to estimate the parameters for the models on which the forecasts are based.

  3. 3.

    Block bootstrap from \(\hat {\boldsymbol {E}}\); that is, choose H consecutive columns of \(\hat {{\boldsymbol E}}\) at random, repeating this process B times. Denote the n × H matrix obtained at iteration b as \(\hat {{\boldsymbol E}}^b\) for b = 1, …, B.

  4. 4.

    For all b, compute \(\hat {\boldsymbol \Upsilon }^b = \{\hat {\boldsymbol {\gamma }}^b_1,\ldots ,\hat {\boldsymbol {\gamma }}^b_n\}^{\prime }\in \mathbb{R}^{n\times H} : \hat {\gamma }^b_{i,h}=f(M_i, \hat {e}^b_{i,h})\) where, f(.) is a function of fitted univariate model in step 1 and associated error. That is, \(\hat {\gamma }_{i,h}\) is a sample path simulated from fitted model M i for ith series and error approximated by the corresponding block bootstrapped sample error \(\hat {e}^b_{i,h}\) which is the (i, h)th element of \(\hat {{\boldsymbol E}}^b\). Each row of \(\hat {\boldsymbol \Upsilon }^b\) is a sample path of h forecasts for a single series. Each column of \(\hat {\boldsymbol \Upsilon }^b\) is a realisation from the joint predictive distribution at a particular horizon.

  5. 5.

    For each b = 1, …, B, select the hth column of \(\hat {\boldsymbol \Upsilon }^b\) and stack these to form an n × B matrix \(\hat {\boldsymbol {\Upsilon }}_{T+h|T}\).

  6. 6.

    For a given G matrix and for each h = 1, …, H, compute \(\tilde {\boldsymbol {\Upsilon }}_{T+h|T}={\boldsymbol S}{\boldsymbol G}\hat {\boldsymbol {\Upsilon }}_{T+h|T}\). Each column of \(\tilde {\boldsymbol \Upsilon }_{T+h|T}\) is a realisation from the joint h-step-ahead reconciled predictive distribution.

5 Australian GDP

In our empirical application we consider Gross Domestic Product (GDP) of Australia with quarterly data spanning the period 1984:Q4–2018:Q3. The Australian Bureau of Statistics (ABS) measures GDP using three main approaches namely Production, Income, and Expenditure. The final GDP figure is obtained as an average of these three figures. Each of these measures is aggregates of economic variables which are also targets of interests for the macroeconomic forecaster. This suggests a hierarchical approach to forecasting could be used to improve forecasts of all series in the hierarchy including the headline GDP.

We concentrate on the Income and Expenditure approaches as nominal data are available only for these two. We restrict our attention to nominal data due to the fact that real data are constructed via a chain price index approach with different price deflators used for each series. As a result, real GDP data are not coherent—the aggregate series is not a linear combination of the disaggregate series. For similar reasons we do not use seasonally adjusted data; the process of seasonal adjustment results in data that are not coherent. Finally, although there is a small statistical discrepancy between each series and the headline GDP figure, we simply treat this statistical discrepancy, which is also published by the ABS, as a time series in its own right. For further of the details on the data please refer to Australian Bureau of Statistics (2018).

5.1 Income Approach

Using the income approach, GDP is calculated by aggregating all income flows. In particular, GDP at purchaser’s price is the sum of all factor incomes and taxes, minus subsidies on production and imports (Australian Bureau of Statistics, 2015):

$$\displaystyle \begin{aligned} \textit{GDP} & = \textit{Gross operating surplus} + \textit{Gross mixed income} \\ & + \textit{Compensation of employees} \\ & + \textit{Taxes less subsidies on production and imports} \\ & + \textit{Statistical discrepancy (I)}. \end{aligned} $$

Figure 21.4 shows the full hierarchical structure capturing all components aggregated to form GDP using the income approach. The hierarchy has two levels of aggregation below the top-level, with a total of n = 16 series across the whole structure and m = 10 series at the bottom level.

Fig. 21.4
figure 4

Hierarchical structure of the income approach for GDP. The pink cell contains GDP the most aggregate series. The blue cells contain intermediate-level series and the yellow cells correspond to the most disaggregate bottom-level series

5.2 Expenditure Approach

In the expenditure approach, GDP is calculated as the aggregation of final consumption expenditure, gross fixed capital formation (GFCF), changes in inventories of finished goods, work-in-progress, and raw materials and the value of exports less imports of the goods and services (Australian Bureau of Statistics, 2015). The underlying equation is:

$$\displaystyle \begin{aligned} \textit{GDP} & = \textit{Final consumption expenditure} + \textit{Gross fixed capital formation} \\ & + \textit{Changes in inventories} + \textit{Trade balance} + \textit{Statistical discrepancy (E)}. \end{aligned} $$

Figures 21.5, 21.6, and 21.7 show the full hierarchical structure capturing all components aggregated to form GDP using the expenditure approach. The hierarchy has three levels of aggregation below the top-level, with a total of n = 80 series across the whole structure and m = 53 series at the bottom level. Descriptions of each series in these hierarchies along with the series ID assigned by the ABS are given in the Tables 21.1, 21.2, 21.3, and 21.4 in the Appendix.

Table 21.1 Variables, series IDs and their descriptions for the income approach
Table 21.2 Variables, series IDs and their descriptions for expenditure approach
Table 21.3 Variables, series IDs and their descriptions for changes in inventories—expenditure approach
Table 21.4 Variables, series IDs and their descriptions for household final consumption—expenditure approach
Fig. 21.5
figure 5

Hierarchical structure of the expenditure approach for GDP. The pink cell contains GDP, the most aggregate series. The blue and purple cells contain intermediate-level series with the series in the purple cells further disaggregated in Figs. 21.6 and 21.7. The yellow cells contain the most disaggregate bottom-level series

Fig. 21.6
figure 6

Hierarchical structure for Gross Fixed Capital Formations under the expenditure approach for GDP, continued from Fig. 21.5. Blue cells contain intermediate-level series and the yellow cells correspond to the most disaggregate bottom-level series

Fig. 21.7
figure 7

Hierarchical structure for Household Final Consumption Expenditure under the expenditure approach for GDP, continued from Fig. 21.5. Blue cells contain intermediate-level series and the yellow cells correspond to the most disaggregate bottom-level series

Figure 21.8 displays time series from the income and expenditure approaches. The top panel shows the most aggregate GDP series. The panels below show series from lower levels for the income hierarchy (left panel) and the expenditure hierarchy (right panel). The plots show the diverse features of the time series with some displaying positive and others negative trending behaviour, some showing no trends but possibly a cycle, and some having a strong seasonal component. These highlight the need to account for and model all information and diverse signals from each series in the hierarchy, which can only be achieved through a forecast reconciliation approach.

Fig. 21.8
figure 8

Time plots for series from different levels of income and expenditure hierarchies

6 Empirical Application Methodology

We now demonstrate the potential for reconciliation methods to improve forecast accuracy for Australian GDP. We consider forecasts from h = 1 quarter ahead up to h = 4 quarters ahead using an expanding window. First, the training sample is set from 1984:Q4 to 1994:Q3 and forecasts are produced for 1994:Q4 to 1995:Q3. Then the training window is expanded by one quarter at a time, i.e., from 1984:Q4 to 2017:Q4 with the final forecasts produced for the last available observation in 2018:Q1. This leads to 94 1-step-ahead, 93 2-steps-ahead, 92 3-steps-ahead, and 91 4-steps-ahead forecasts available for evaluation.

6.1 Models

The first task in forecast reconciliation is to obtain base forecasts for all series in the hierarchy. In the case of the income approach, this necessitates forecasting n = 16 separate time series while in the case of the expenditure approach, forecasts for n = 80 separate time series must be obtained. Given the diversity in these time series discussed in Sect. 21.5, we focus on an approach that is fast but also flexible. We consider simple univariate ARIMA models, where model order is selected via a combination of unit root testing and the AIC using an algorithm developed by Hyndman, Koehler, Ord, and Snyder (2008) and implemented in the auto.arima() function in Hyndman, Lee, and Wang (2019). A similar approach was also undertaken using the ETS framework to produce base forecasts (Hyndman & Khandakar, 2008). Using ETS models to generate base forecasts had minimal impact on our conclusions with respect to forecast reconciliation methods and in most cases ARIMA forecasts were found to be more accurate than ETS forecasts. Consequently for brevity, we have excluded presenting the results for ETS models. However, these are available from githubFootnote 2 and are discussed in detail in Gamakumara (2019). We note that a number of more complicated approaches could have been used to obtain base forecasts including multivariate models such as vector autoregressions, and models and methods that handle a large number of predictors such as factor models or least angle regression. However, Panagiotelis, Athanasopoulos, Hyndman, Jiang, and Vahid (2019) show that univariate ARIMA models are highly competitive for forecasting Australian GDP even compared to these methods, and in any case our primary motivation is to demonstrate the potential of forecast reconciliation.

The hierarchical forecasting approaches we consider are bottom-up, OLS, WLS with variance scaling and the MinT(Shrink) approach. The MinT(Sample) approach was also used but due to the size of the hierarchy, forecasts reconciled via this approach were less stable. Finally, all forecasts (both base and coherent) are compared to a seasonal naïve benchmark (Hyndman & Athanasopoulos, 2018); i.e., the forecast for GDP (or one of its components) is the realised GDP in the same quarter of the previous year. The naïve forecasts are by construction coherent and therefore do not need to be reconciled.

6.2 Evaluation

For evaluating point forecasts we consider two metrics, the Mean Squared Error (MSE) and the Mean Absolute Scaled Error (MASE) calculated over the expanding window. The absolute scaled error is defined as

$$\displaystyle \begin{aligned} q_{T+h} = \frac{|\breve{e}_{T+h|T}|}{(T-4)^{-1}\sum_{t=5}^{T}|y_t - y_{t-4}|}\,, \end{aligned}$$

where ĕt+h is the difference between any forecast and the realisation,Footnote 3 and 4 is used due to the quarterly nature of the data. An advantage of using MASE is that it is a scale independent measure. This is particularly relevant for hierarchical time series, since aggregate series by their very nature are on a larger scale than disaggregate series. Consequently, scale dependent metrics may unfairly favour methods that perform well for the aggregate series but poorly for disaggregate series. For more details on different point forecast accuracy measures, refer to Chapter 3 of Hyndman and Athanasopoulos (2018).

Forecast accuracy of probabilistic forecasts can be evaluated using scoring rules (Gneiting & Katzfuss, 2014). Let F̆ be a probabilistic forecast and let y̆ ∼F̆ where a breve is again used to denote that either base forecasts or coherent forecasts can be evaluated. The accuracy of multivariate probabilistic forecasts will be measured by the energy score given by

$$\displaystyle \begin{aligned} eS(\breve{F}_{T+h|T},\boldsymbol{y}_{T+h}) = \text{E}_{\breve{F}}\|\breve{\boldsymbol{y}}_{T+h}-\boldsymbol{y}_{T+h}\|{}^\alpha -\frac{1}{2}\text{E}_{\breve{F}}\|\breve{\boldsymbol{y}}_{T+h}-\breve{\boldsymbol{y}}^*_{T+h}\|{}^\alpha\,,\end{aligned} $$

where y T+h is the realisation at time T + h, and α ∈ (0, 2]. We set α = 1, noting that other values of α give similar results. The expectations can be evaluated numerically as long as a sample from F̆ is available, which is the case for all methods we employ. An advantage of using energy scores is that in the univariate case it simplifies to the commonly used cumulative rank probability score (CRPS) given by

$$\displaystyle \begin{aligned} \text{CRPS}(\breve{F}_i,y_{i,T+h}) = \text{E}_{\breve{F}_i}|\breve{y}_{i,T+h}-y_{i,T+h}| - \frac{1}{2}\text{E}_{\breve{F}_i}|\breve{y}_{i,T+h}-\breve{y}^*_{i,T+h}|,\end{aligned} $$

where the subscript i is used to denote that CRPS measures forecast accuracy for a single variable in the hierarchy.

Alternatives to the energy score were also considered, namely log scores and variogram scores. The log score was disregarded since Gamakumara et al. (2018) prove that the log score is improper with respect to the class of incoherent probabilistic forecasts when the true DGP is coherent. The variogram score gave similar results to the energy score; these results are omitted for brevity but are available from github and are discussed in detail in Gamakumara (2019).

7 Results

7.1 Base Forecasts

Due to the different features in each time series, a variety of ARIMA and seasonal ARIMA models were selected for generating base forecasts. For example, in the income hierarchy, some series require seasonal differencing while other did not. Furthermore the AR orders vary from 0 to 3, the MA orders from 0 to 2, and their seasonal counterparts SAR from 0 to 2 and SMA from 0 to 1. Figure 21.9 compares the accuracy of the ARIMA base forecasts to the seasonal naïve forecasts over different forecast horizons. The panels on the left show results for the Income hierarchy while the panels on the right show the results for the Expenditure hierarchy. The top panels summarise results over all series in the hierarchy, i.e., we calculate the MSE for each series and then average over all series. The bottom panels show the results for the aggregate level GDP.

Fig. 21.9
figure 9

Mean squared errors for naïve and ARIMA base forecasts. Top panels refer to results summarised over all series while bottom panels refer to results for the top-level GDP series. Left panels refer to the income hierarchy and right panels to the expenditure hierarchy

The clear result is that base forecasts are more accurate than the naïve forecasts, however, as the forecasting horizon increases, the differences become smaller. This is to be expected since the naïve model here is a seasonal random walk, and for horizons h < 4, forecasts from an ARIMA model are based on more recent information. Similar results are obtained when MASE is used as the metric for evaluating forecast accuracy.

One disadvantage of the base forecasts relative to the naïve forecasts is that base forecasts are not coherent. As such we now turn our attention to investigating whether reconciliation approaches can lead to further improvements in forecast accuracy relative to the base forecasts.

7.2 Point Forecast Reconciliation

We now turn our attention to evaluating the accuracy of point forecasts obtained using the different reconciliation approaches as well as the single-level bottom-up approach. All results in subsequent figures are presented as the percentage changes in a forecasting metric relative to base forecasts, a measure known in the forecasting literature as skill scores. Skill scores are computed such that positive values represent an improvement in forecasting accuracy over the base forecasts while negative values represent a deterioration.

Figures 21.10 and 21.11 show skill scores using MSE and MASE respectively. The top row of each figure shows skill scores based on averages over all series. We conclude that reconciliation methods generally improve forecast accuracy relative to base forecasts regardless of the hierarchy used, the forecasting horizon, the forecast error measure or the reconciliation method employed. We do, however, note that while all reconciliation methods improve forecast performance, MinT(Shrink) is the best forecasting method in most cases.

Fig. 21.10
figure 10

Skill scores for point forecasts from alternative methods (with reference to base forecasts) using MSE. The left panels refer to the income hierarchy while the right panels refer to the expenditure hierarchy. The first row refers to results summarised over all series, the second row to top-level GDP series, the third row to aggregate levels, and the last row to the bottom level

Fig. 21.11
figure 11

Skill scores for point forecasts from different reconciliation methods (with reference to base forecasts) using MASE. The left two panels refer to the income hierarchy and the right two panels to the expenditure hierarchy. The first row refers to results summarised over all series, the second row to top-level GDP series, the third row to aggregate levels, and the last row to the bottom level

To further investigate the results we break down the skill scores by different levels of each hierarchy. The second row of Figs. 21.10 and 21.11 shows the skill scores for a single series, namely GDP which represents the top-level of both hierarchies. The third row shows results for all series excluding those of the bottom level, while the final row shows results for the bottom-level series only. Here, we see two general features. The first is that OLS reconciliation performs poorly on the bottom-level series, and the second is that bottom-up performs relatively poorly on aggregate series. The two features are particularly exacerbated for the larger expenditure hierarchy. These results are consistent with other findings in the forecast reconciliation literature (see for instance Athanasopoulos et al., 2017; Wickramasuriya et al., 2018).

7.3 Probabilistic Forecast Reconciliation

We now turn our attention towards results for probabilistic forecasts. Figure 21.12 shows results for the energy score which as a multivariate score summarises forecast accuracy over the entire hierarchy. Once again all results are presented as skill scores relative to base forecasts. The top panels refer to results assuming Gaussian probabilistic forecasts as described in Sect. 21.4.1 while the bottom panels refer to the non-parametric bootstrap method described in Sect. 21.4.2. The left panels correspond to the income hierarchy while the right panels correspond to the expenditure hierarchy. For the income hierarchy, all methods improve upon base forecasts at all horizons. In nearly all cases the best performing reconciliation method is MinT(Shrink), a notable result since the optimal properties for MinT have thus far only been established theoretically in the point forecasting case. For the larger expenditure hierarchy results are a little more mixed. While bottom-up tends to perform poorly, all reconciliation methods improve upon base forecasts (with the single exception of MinT(Shrink) in the Gaussian framework four quarters ahead). Interestingly, OLS performs best under the assumption of Gaussianity—this may indicate that OLS is a more robust method under model misspecification but further investigation is required.

Fig. 21.12
figure 12

Skill scores for multivariate probabilistic forecasts from different reconciliation methods (with reference to base forecasts) using energy scores. The top panels refer to the results for the Gaussian approach and the bottom panels to the non-parametric bootstrap approach. Left panels refer to the income hierarchy and right panels to the expenditure hierarchy

Finally, Fig. 21.13 displays the skill scores based on the cumulative ranked probability score for a single series, namely top-level GDP. The cause of the poor performance of bottom-up reconciliation as a failure to accurately forecast aggregate series is apparent here.

Fig. 21.13
figure 13

Skill scores for probabilistic forecasts of top-level GDP from different reconciliation methods (with reference to base forecasts) using CRPS. Top panels refer to the results for Gaussian approach and bottom panels refer to the non-parametric bootstrap approach. The left panel refers to the income hierarchy and the right panel to the expenditure hierarchy

8 Conclusions

In the macroeconomic setting, we have demonstrated the potential for forecast reconciliation methods to not only provide coherent forecasts, but to also improve overall forecast accuracy. This result holds for both point forecasts and probabilistic forecasts, for the two different hierarchies we consider and over different forecasting horizons. Even where the objective is to only forecast a single series, for instance top-level GDP, the application of forecast reconciliation methods improves forecast accuracy.

By comparing results from different forecast reconciliation techniques we draw a number of conclusions. Despite its simplicity, the single-level bottom-up approach can perform poorly at more aggregated levels of the hierarchy. Meanwhile, when forecast accuracy at the bottom level is evaluated, OLS tends to break down in some instances. Overall, the WLS and MinT(Shrink) methods (and particularly the latter) tend to yield the highest improvements in forecast accuracy. Similar results can be found in both simulations and the empirical studies of Athanasopoulos et al. (2017) and Wickramasuriya et al. (2018).

There are a number of open avenues for research in the literature on forecast reconciliation, some of which are particularly relevant to macroeconomic applications. First there is scope to consider more complex aggregation structures, for instance in addition to the hierarchies we have already considered, data on GDP and GDP components disaggregated along geographical lines are also available. This leads to a grouped aggregation structure. Also, given the substantial literature on the optimal frequency at which to analyse macroeconomic data, a study on forecasting GDP or other variables as a temporal hierarchy may be of interest. In this chapter we have only shown that reconciliation methods can be used to improve forecast accuracy when univariate ARIMA models are used to produce base forecasts. It will be interesting to evaluate whether such results hold when a multivariate approach, e.g., a Bayesian VAR or dynamic factor model, is used to generate base forecasts, or whether the gains from forecast reconciliation would be more modest. Finally, a current limitation of the forecast reconciliation literature is that it only applies to collections of time series that adhere to linear constraints. In macroeconomics there are many examples of data that adhere to non-linear constraints, for instance real GDP is a complicated but deterministic function of GDP components and price deflators. The extension of forecast reconciliation methods to non-linear constraints potentially holds great promise for continued improvement in macroeconomic forecasting.