Keywords

1 Introduction

Let us consider two scalar stochastic processes x t and y t, \(t \in \mathbb {Z}\), each observed for T realizations. We assume that x t and y t are covariance stationary or that Δx t and Δy t are covariance stationary. Most time series observed in macroeconomics, for example, belong to this class of processes (see e.g. [29]). If we exclude the possibility that the future can cause the past, but we allow contemporaneous feedback loops due for example to temporal aggregation, there are several possibilities as regards the causal structure between x t and y t, which we list here below. We denote causal relationshipsFootnote 1 with directed edges (→), following the graphical causal models terminology [64].

  1. (i)

    The series x t has a contemporaneous or lagged causal effect on y t, i.e. x i → y i+s for some i, s such that i ≥ 0, s ≥ 0.

  2. (ii)

    The series y t has a contemporaneous or lagged causal effect on x t, i.e. y i → x i+s for some i, s such that i ≥ 0, s ≥ 0.

  3. (iii)

    A not-measured series z t has a contemporaneous or lagged causal effect on both x t and y t.

  4. (iv)

    The causal structure between x t and y t can be described by any combination of (i)–(iii).

  5. (v)

    There is no causal link or path (of any type) linking x t and y t+s, for any \(s \in \mathbb N\).

In principle, other, more involute, causal structures are possible between x t and y t. For example, the data generating process may have a frequency that is different from the frequency of data collection, so that there are hidden causal structures between the observed variables. This class of structures has been considered in the literature on temporal aggregation in econometrics (see e.g. [17, 18, 50]) and in the literature on subsampling in machine learning (see [10, 36]), but will not be further discussed in this paper. We will also limit our discussion on structures in which variables are well-defined (i.e. they are not aggregate of variables with diverse causal roles) and the causal structures are time invariant: i.e. if x i → w i+s given any \(s \in \mathbb {Z}\), then this true for all \(i \in \mathbb {Z}\), where w can be any variable (included x itself). We will also typically assume that each observed series w t will be directly causally influenced by its own past, until a certain lag and that each variable at each time unit will be affected, in an additive manner, by one or more independent shock. In other words, we focus on additive noise models.

The causal structure between two time series can be represented by a causal graph consisting of nodes for x t, …, x tp, y t, …, y tp, where p is the largest lag by which x t or y t can be directly causally influenced. Using the terminology proposed by Chu and Glymour [7], this graph is called a unit causal graph. Examples for unit causal graphs are shown in Figs. 5.1 and 5.2, for p = 2. Figure 5.1 represents the case in which (i) is true, while Fig. 5.2 represents the case in which (iii) is true. Chu and Glymour [7] notice that a unit causal graph can be extended to repetitive causal graph (not shown), including the variables x t and y t at a potentially infinite time units. The repetitive causal graph corresponding to the unit causal graph of Fig. 5.1, for example, would include nodes for x t−3, x t−4, …, for y t−3, y t−4, … and direct edges from x ts to y ts, as well as x ts−2 → y ts and x ts−1 → y ts, for any \(s \in \mathbb {Z}\).

Fig. 5.1
figure 1

Unit causal graph for bi-variate time series with both lagged and contemporaneous effects

Fig. 5.2
figure 2

Unit causal graph for bi-variate time series with a latent series z t

How do we detect which of the five cases listed above (i)–(v) is true? How do we learn which causal graph better represents the data generating process? How do we learn to what extent an intervention on one variable at time t propagates on all the variables at time t + h for any value of h > 0? These are the typical questions that concern, for example, the applied macro-econometrician. In this paper we discuss possible manners to address these questions. We review methods that are able to disentangle among different causal structures, under different assumptions.

Some causal discovery methods developed for i.i.d. data cannot be applied, without further modification, to the time series setting, due to the fact that, even in a simple setting of causal pairs, there is the possibility of causal relationships with different effects at different lags. Furthermore, the autocorrelation (or self-dependence) structure underlying the data introduces some complications in standard statistical inference that reduce the efficiency of simple regression estimation or conditional independence testing [26]. Nevertheless, the time series setting is not necessarily a curse, and is actually a blessing in specific contexts of causal inference. Indeed, if one accepts the assumption that the future cannot cause the past (whose acceptance in economics involves a careful taking into account of expectational variables, see [33]), exploiting the arrow of time allows one to solve many orientation problems, i.e. problems where it is known that there is a causal dependence between two variables, but not the direction. Moreover, in the case of causal pairs, the possibility of observing past values of the variables allows us to condition on more than two variables, which is not possible in the context of i.i.d. causal pairs.

We shall also notice that if the framework is the one of a causal time-series pair in which only one direction of causal influence is admitted: either x t → y s (for one or more values of s such that s ≥ t) or y t → x s (s ≥ t) and one is only interested in the “summary graph” [58], i.e. in ascertaining whether x causes y or y causes x at any time unit, then the problem can be solved in a relatively easy fashion in many settings. Using a simple regression analysis, it will be sufficient to regress x t on lagged values of itself and of the other variable, as well as to regress y t on lagged values of itself and of the other variable. Since all the covariates in the two regressions are pre-determined there are no endogeneity problems here and the error terms will be independent of the regressors. Therefore, by simple testing the hypothesis of non-zero statistical influence of one lagged variable (e.g. x t−1) on another (e.g. y t) and the hypothesis of a zero statistical influence on the symmetric regression (e.g. of y t−1 on x t), we will be able to detect a genuine causal influence (at some unknown time unit) from one variable to another (e.g. from x to y). This framework is identical to the vector autoregressive framework that we will discuss below and also related to an interpretation of Granger non-causality test that we will also discuss below. Notice, however, than in many fields like economics the assumption of causality running in only one direction between time series, without the possibility of a feedback at a different time unit, is a toy example, with very poor empirical applicability. This is why our discussion framework will be larger, including the possibility of structures like y t−1 → x t → y t.

In reviewing different methods we distinguish between methods that filter the series through a vector autoregressive model (Sect. 5.2.1) and methods that apply causal search directly to time series data (Sect. 5.3).

2 Vector-Autoregressive Framework

2.1 The VAR Model

One of the most popular approaches to identify dynamic causal effects in time series econometrics is structural vector autoregressive (VAR) analysis. Structural VAR analysis is based on the assumption that the statistical properties of a data generating process can be well approximated by a reduced-form VAR model.

Let us consider a vector Y t of k time series variables. For example, Y t = (x t, y t), in which case k = 2. We assume that Y t follows a stochastic process that can be well approximated by a linear VAR process of the form

$$\displaystyle \begin{aligned} Y_t = \mu + A_1 Y_{t-1} + \cdots + A_p Y_{t-p} + u_t, \end{aligned} $$
(5.1)

where μ is a k × 1 vector of constants, A i (i = 1, …, p) is a k × k matrix and u t is a k × 1 vector of white noise, whose elements are referred to as reduced-form residuals. Each element of u t is in turn assumed to be a linear combination of latent structural shocks, 𝜖 1t, 𝜖 2t, …, which are the sources of variation of the system. In macroeconomics these shocks have special meaning such as, for example, the productivity shock, the monetary policy shock, the fiscal policy shock, etc. It is standard in the VAR literature to assume that the number of shocks is equal to the number of measured variables. Another usual assumptions is that 𝜖 1t, …, 𝜖 kt are mutually independent, although orthogonality is sufficient in many applications. Thus we have:

$$\displaystyle \begin{aligned} u_t = B \varepsilon_t, \end{aligned} $$
(5.2)

where B is a k × k invertible matrix (the impact or mixing matrix) and ε t = (𝜖 1t, …, 𝜖 kt) is a vector of independent shocks. Let W be B −1. By pre-multiplying Eq. (5.1) by W we get the structural VAR form:

$$\displaystyle \begin{aligned} W Y_t = \mu' + \varGamma_1 Y_{t-1} + \cdots + \varGamma_p Y_{t-p} + \varepsilon_t, \end{aligned} $$
(5.3)

where μ′ =  and Γ i = WA i for i = 1, …, p. From Eq. (5.1) it is evident that the matrix W incorporates information about the contemporaneous causal structure, while the matrices Γ i’s incorporate information about the lagged causal structure. Since Sims [62], econometricians have focused their attention on the identification of the effect of ε t on Y t over time. These are called impulse response functions, and we will be discussed in a subsequent Sect. 5.2.4.

Since Eq. (5.3) cannot be directly estimated because of endogeneity problem, the idea of VAR analysis is to follow a two-step procedure: first Eq. (5.1) is estimated through standard regression methods. From this stage one obtains an estimate of the reduced-form residuals u t. Second, the parameters of Eq. (5.3) (in particular the coefficients entering in W and Γ i) can be recovered by analyzing the relationships among the elements of u t, which, under some conditions, may allow identifying the matrix B entering in Eq. (5.2). Notice that, having estimated (5.1), knowing B is sufficient for identifying (5.3).

For example, Swanson and Granger [68], Bessler and Lee [3], Demiralp and Hoover [11], Moneta [52] propose a two-step identification method, consisting in first estimating the reduced-form VAR residuals, and then applying to the estimated u t (which should share characteristics of i.i.d. data) conditional independence tests, in the spirit of a causal search based on graphical causal models [64]. This allows them to find out which entries of B are zero.

For k = 2, as is the case of causal pairs, independence tests between u 1t and u 2t can only discriminate between the presence and the absence of a causal link between the contemporaneous variables, but are not of any help in finding causal directions. In other words, they find zero entries in B only in the case when u 1t and u 2t are mutually independent (corresponding to the absence of contemporaneous causal relations).

2.2 ICA-Based Identification

An alternative method to identify B in the same two-step framework is to apply Independent Component Analysis (ICA) to the estimated reduced-form residuals u t. Since, as shown in (5.2), u t =  t, it is possible to apply ICA to recover the coefficients that linearly mix the elements of ε t to produce u t [9, 37, 39]. ICA has been applied to a VAR setting by Hyvärinen et al. [40], Moneta et al. [53], Gouriéroux et al. [22], among others.

ICA is based on a theorem, see [9, Th. 11], [15, Th. 3], [22, p.112], according to which if B is invertible, and if the components of ε t (𝜖 1t,…, 𝜖 kt) are independent, with at most one Gaussian distribution, then the matrix B is identifiable up to a post multiplication by DP, where P is a permutation matrix and D a diagonal matrix with non zero diagonal elements.

There are many ICA approaches to estimate the mixing matrix B (cfr. [39] for an overview), most popular of which are the fastICA algorithm [38], which is based on minimization of mutual information and maximization of negentropy, the JADE algorithm [5], which maximizes a measure of non-Gaussianity based on the fourth moments, and the product density ICA algorithm [28], which is based on maximum likelihood principle. Alternative approaches have been also recently proposed in econometrics, e.g. the distance covariance approach by Matteson and Tsay [51], the Cramer-von-Mises distance approach by Herwartz [30], the maximum likelihood approach by Lanne et al. [48], and the pseudo ML approach by Gouriéroux et al. [22].

Assuming that B is invertible implies that each observed variable u it is affected by at least one shock 𝜖 it and that each 𝜖 it influences at least one variable. In other words, there is always a column-permutation of the mixing matrix \(\tilde {B}\) output of ICA such that all the elements in the main diagonal are significantly different from zero. This assumption is in tune with the standard VAR framework.

In the case of causal pairs (k = 2), with matrix B of dimension 2 × 2, it is therefore very useful to test which entries in B are significantly close to zero and check their row position. The significance test can be done with a bootstrap procedure, by performing a nonparametric quantile test in order to decide whether 0 is an outlier, as proposed by Lacerda et al. [47]. Alternatively, one can test a zero restriction in B by exploiting the asymptotic distribution of the pseudo ML estimator of B, as proposed by Gouriéroux et al. [22].

Let us continue to assume that Y t = (x t, y t). On the basis of tests on zero restrictions in B, one can distinguish among four different cases: (1) If there is only one zero entry in B and this lies in the first row, this means that the first element of u t, which we call u xt, is affected only by one shock, while the second element of u t, which we call u yt, is affected by both shocks. This means that x t causes y t. (2) Symmetrically, if the only zero entry of B lies in the second row, y t causes x t. (3) If there are two zero entries in B, which, by construction, must lie either in its main or anti-diagonal, then x t and y t are not (contemporaneously) causally related. (4) If there are no zero entries in B, some other structures are possibilities: there could be a feedback loop between x t and y t, or a latent variable z t affecting both x t and y t, possibly also including causal relationships between x t and y t.

If there is a latent variable z t, this means that the shocks affecting the system are potentially three, while the observed variables are still two. Attempting to identify the structural model would bring us outside the VAR framework. It is worth noting, however, that the ICA framework has been extended to the cases where the number of sources is greater than the number of mixtures (overcomplete ICA) (see [39, ch.16]). The identification of the rectangular mixing matrix potentially allows distinguishing between the case of feedback loop between x t and y t (two shocks affecting the system) and the case of a latent variable (three shocks affecting the system with at least one idiosyncratic shock).

If it is known that, underlying the structural model, there is a recursive contemporaneous structure, that is either x t causes y t or y t causes x t (equivalently, there is a permutation of the matrices B and W that make them lower triangular), then, a valid and efficient alternative to the test of zero-coefficient suggested above, is performing a LiNGAM (short for Linear Non-Gaussian Acyclic Model) analysis, as proposed by Shimizu et al. [61]. LiNGAM is an algorithm that incorporates ICA in the first step, and then search for the right row-permutation of the unmixing matrix W that yields a lower triangular matrix. Lacerda et al. [47] propose an extension of this algorithm to the cyclic case (in which feedback loops are allowed), called LiNG. Hoyer et al. [34] propose an extension of basic LiNGAM to the case in which latent common cause are allowed, called LvLiNGAM.

2.3 Nonlinear Framework

The standard VAR framework, as proposed in the econometric literature, is a linear model. In economics and in many other fields, however, there is no compelling substantive reason why a variable should depend only linearly on current values of other variables, on past values of itself and of other variables. Thus, a class of nonlinear structural VAR models has been proposed (see [44, ch. 18]) that allows nonlinear dependence among measured time-series but with an additive white noise error terms. In this case, we can apply a two-step identification procedure similar to linear case: in a first step one estimates a reduced-form nonlinear VAR model, and in a second step one extracts from the estimated additive errors information in order to recover the structural VAR model. A general nonlinear VAR model with additive errors can be written as:

$$\displaystyle \begin{aligned} Y_t = F_t (Y_{t-1}, \ldots, Y_{t-p}) + u_t, \end{aligned} $$
(5.4)

where the nonlinear function F t(⋅) may depend on t. Most nonlinear VAR models considered in the econometric literature deal with time-varying coefficients (see e.g. [59]) which are able to capture very general nonlinear dynamics, while keeping linear the mixing structure between reduced-form and structural residuals.

We do not review here this literature (see [27, 43], and references therein). Rather, we point out a method to identify the contemporaneous causal direction that exploits the nonlinear dependence among the variables and is based on two assumptions: (i) there is a contemporaneous, nonlinear causal relationship between x t and y t in only one direction (either x ty t or y tx t), (ii) the structural form model can be written as Y t = F(Y t−1, …, Y tp) + G(Y t) + ε t, where F(⋅) and G(⋅) are two linear functions with .

The method follows a two-step procedure, as is typical of a VAR-based approach. In the first step the lagged effects are filtered out through nonlinear or nonparametric estimates of the regressions x t = f 1(x t−1, …, x tp, y t−1, …, y tp) + u 1t and y t = f 2(x t−1, …, x tp, y t−1, …, y tp) + u 2t, in order to obtain estimates of u 1t and u 2t. In the second step one the contemporaneous causal direction is detected through a nonlinear additive noise model (see [35, 58]). Indeed we will have that if the contemporaneous causal relation is x t → y t

$$\displaystyle \begin{aligned} u_{2t} = f_y(u_{1t}) + N_y \end{aligned} $$
(5.5)

where N y is an unobserved noise term and . Likewise, if the contemporaneous causal relation is y t → x t

$$\displaystyle \begin{aligned} u_{1t} = f_x(u_{2t}) + N_x, \end{aligned} $$
(5.6)

where N x is an unobserved noise term and .

Thus, once u 1t and u 2t are estimated through a nonlinear or nonparametric VAR model, one regress them on each other, using a nonparametric estimator, and obtains estimated of N x and N y. If, on the basis of a nonparametric independence test (see e.g. [25]), the independence between N y and u 1t is not rejected, while the independence between N x and u 2t is rejected, one infer x t → y t. If, the independence between N x and u 2t is not rejected, while the independence between N y and u 1t is rejected, one infer y t → x t.

2.4 Impulse Response Functions

Having identified the mixing matrix B and the structural shocks ε t, econometricians are mostly interested in the responses over time of each element of Y t = (x t, y t) to a one-time impulse in each element of ε t = (𝜖 1t, 𝜖 2t). These impulse response functions are defined [44, p. 110] as:

$$\displaystyle \begin{aligned} \frac{\partial Y_{t+i}}{\partial \varepsilon^{\prime}_t} =\varTheta_i \; \; \; \; \; i = 0,1,2, \ldots, H, \end{aligned} $$
(5.7)

where, in the case of two variables, Θ i is a 2 × 2 matrix, whose four elements are: \(\frac {\partial x_{t+i}}{\partial \epsilon _{1t}}\), \(\frac {\partial y_{t+i}}{\partial \epsilon _{1t}}\) (first column), \(\frac {\partial x_{t+i}}{\partial \epsilon _{2t}}\), \(\frac {\partial y_{t+i}}{\partial \epsilon _{2t}}\) (second column).

Consider, for simplicity, a linear VAR model with one lag (p=1) and no intercept:

$$\displaystyle \begin{aligned} Y_t = A_1 Y_{t-1} + u_t. \end{aligned} $$
(5.8)

By recursive substitution it can be written:

$$\displaystyle \begin{aligned} Y_{t+i} = A_1^{i+1} Y_{t-1} + \sum_{j=0}^{i}A_1^j u_{t+i-j}. \end{aligned} $$
(5.9)

The responses of Y t to reduced-form errors (also referred to as forecast errors) i periods agoFootnote 2 are then captured by the matrix \(\varPhi _i = A_1^i\). If Y t is a stable process (all eigenvalues of A have modulus less than 1), i.e. each element of Y t is covariance stationary, Eq. (5.8) can be equivalently expressed according to the moving average (MA) representation (Wold decomposition):

$$\displaystyle \begin{aligned} Y_t = \sum_{i=0}^{\infty}\varPhi_i u_{t-i}, \end{aligned} $$
(5.10)

where Φ i is calculated as above (for the one-lag case), with Φ 0 = I. From Eqs. (5.10), (5.2) and (5.7) it follows

$$\displaystyle \begin{aligned} Y_t = \sum_{i=0}^{\infty} \varPhi_i B B^{-1} u_{t-i} = \sum_{i=0}^{\infty} \varPhi_i B \varepsilon_{t-i} = \sum_{i=0}^{\infty} \varTheta_i \varepsilon_{t-i}. \end{aligned} $$
(5.11)

If the VAR is not stable, the infinite Wold representation is not allowed, but the same approach to calculate Φ i and Θ i will work, because Eq. (5.9) does not depend on stationarity. In case of unstable process, the impulse response functions will not be tied to the MA representation and will not converge to zero for i →. In particular if Δx t is stationary the impulse response function to Δx t will converge to a finite number.

This framework to calculate impulse response functions can be easily extended to the case of more lags using a “companion matrix” representation (see [44, p. 25]) and is not substantively affected by the presence of a constant in (5.8). However, it cannot be applied to nonlinear VAR models, due to its reliance on Eq. (5.9).

Thus structural impulse responses in a nonlinear setting are defined in an alternative manner, using the concept of conditional expectation [44, 45, p. 615]. Denoting by Ω t−1 the information set available at date t − 1 and by δ the magnitude of the impulse of which one wants to study the response (e.g. δ = standard deviation (𝜖 1t)), the structural response of x t+i to the structural shock 𝜖 1t is defined as

$$\displaystyle \begin{aligned} I_x(i,\delta, \varOmega_{t-1}) = \mathbb{E} (x_{t+i}| \epsilon_{1t}= \delta, \varOmega_{t-1}) - \mathbb{E} (x_{t+i} | \varOmega_{t-1}) \; \; \; i = 0, \ldots, H. \end{aligned} $$
(5.12)

Having estimated a nonlinear reduced form VAR model (5.4) and having recovered the structural shocks (for example on the basis of additive noise model framework, see end of Sect. 5.2.3), one can evaluate (5.12) using a Monte Carlo procedure [44, pp. 615–616]. In this procedure, one simulates two time paths: in a first path the shock of interest is set at time 0 to a particular value δ and the subsequent realizations of the variables of interest are estimated; in a second time path the value of the shock of interest is drawn from an empirically estimated marginal distribution. Thus, Eq. (5.12) is estimated by subtracting the average outcome of the second path from the first.

2.5 Granger Causality in a VAR Framework

VAR models have also been used for a type of causal analysis that does not involve the identification of a structural model like Eq. (5.3). This approach is based on a notion of causal relationship proposed by Granger [23, 24], which is referred to as Granger causality. Granger’s general definition of causality relies on two general principles: (i) the effect does not precede its cause in time; (ii) the causal time series contains unique information about the series being caused that is not available otherwise (see [13]). A corollary of these principles is that x t Granger causes y t if x t is helpful for predicting future values of y t. Incidentally, these tenets share profound similarities with probabilistic theories of causality proposed in the philosophy of science literature [20, 21, 67] (see also [65]).

Although the definition of Granger causality is more general (see Sect. 5.3.1 below), several empirical studies and statistical software make it operational in a linear VAR framework. Consider a bivariate VAR with p lags:

$$\displaystyle \begin{aligned} \left( \begin{matrix} x_t\\ y_t \end{matrix} \right)= \sum_{i=1}^{p} \left[ \begin{matrix} a_{11,i} & a_{12,i}\\a_{21,i} & a_{22,i} \end{matrix} \right] \left( \begin{matrix} x_{t-i}\\ y_{t-i} \end{matrix} \right) + u_t. \end{aligned} $$
(5.13)

In this framework x t is said to be non-Granger-causal for y t if and only if a 21,i = 0 for i = 1, …, p [49, p. 154]. This amounts to say that the information set available until time t − 1 to forecast y t comprises only x t−1 (with more lagged terms) and y t−1 (with more lagged terms), and one wants to check whether excluding or not lagged x t from the information set makes a difference in predicting y t. The zero restrictions can be tested with standard Wald χ 2- or F-tests, which have standard asymptotic properties if the series are stationary [49, p. 154].

A main limitation of this framework is that lagged x t may make a difference in forecasting y t (so that to infer that x t Granger-causes y t) because it contains information that is not contained in the information set comprising lagged y t and lagged x t, but it is always possible that if one considered a larger set of information, for example one containing lagged values of a series z t, x t would not bring a further contribution for the prediction of y t. If z t is a common cause of both x t and y t one would have wrongly inferred that x t causes y t. Thus, although scholars have worked in this direction, introducing concepts such as conditional independencies and higher-order interactions, causal sufficiency is still a fundamental tenet of this approach; this is particularly true, if the focus on causality goes beyond what sometimes is referred to as “predictive causality.”

Granger-causality in causal pairs is a very powerful method in a setting in which, as mentioned in the introduction, the presence of a causal relationship between the two variables, until some lag p ≥ 0, is known, but is unknown whether it is x tp that causes y t or it is y tp that causes x t.

Suppose, for example, that is known that x tp causes y t with p = 0 or 1 and there are no causal relationships from y t to x t at any lag. Then in all the 3 admitted cases in which x t can cause y t ((i) x t−1 → y t; (ii) x t → y t; (iii) i ∪ ii), the coefficient a 12,1, estimated by regressing equation (5.13), is expected to be not significantly different from zero, while the other coefficients of the same matrix will be non-zero. Symmetrically, if y tp causes x t with p = 0, 1 (and no feedback from x t to y t at any lag), then the only coefficient of the same matrix, obtained by regressing the same equation, which is expected to be zero is a 21,1.

Standard Granger-causality in a VAR framework neglects, by choice, the contemporaneous causal link, which is considered by the structural VAR approach. Geweke [19], however, proposes an extension of the Granger-causality concept to detect linear contemporaneous feedback between two time-series, x t and y t.

Jacobs et al. [41] and Hoover [32, pp. 151–152] present examples of bivariate, one-lag structural VAR models in which x t−1 → y t; y t−1 → x t; x t → y t, but, for particular configurations of the parameters, in the reduced form VAR the coefficient corresponding to the influence of y t−1 on x t (a 11,1 in Eq. (5.13)) is zero. One could exclude these types of parameters configuration as “measure-zero.” This assumption would be similar to the faithfulness assumption in the graphical causal model literature [64], where configurations of parameters that yield statistical independence actually corresponding to causal dependence are ruled out. Hoover [32] argues further that specific configurations of parameters for which Granger non-causality does not match structural non-causality may correspond to theoretical economic models and thus cannot be easily dismissed.

3 Direct Causal Search

In this section we discuss methods for causal pairs search that are applied directly to time series data, without filtering them through a vector autoregressive model. Skipping VAR estimation has the clear advantage of not being tied to the imposition of a functional form (e.g. linear VAR), when estimating the relationship between current and lagged values of the variables of interest. On the other hand, direct causal search deals directly with autocorrelated data.

3.1 Granger Causality

As mentioned above (Sect. 5.2.5), the central notion in Granger causality is “incremental predictability” [32, p.150]: if a time series y t+1 is better predicted by the set of all information available up to time t than by the same information set less the series x t, then x t Granger-causes y t+1. The general definition given by Granger [24, p. 49] is that x t is said to cause y t+1 if

$$\displaystyle \begin{aligned} P(y_{t+1} \in A | \varOmega_t) \neq P(y_{t+1} \in A | \varOmega_t - x_t), \end{aligned} $$
(5.14)

where Ω t is all the knowledge in the universe available at time t, Ω t − x t is the same information set except the values taken by a x t up to time t, where x t ∈ Ω t, and A is any set of values that y t+1 can take. We can also write that x t does not Granger-causes y t+1 if [13]

(5.15)

otherwise x t is said to Granger-cause y t+1. As Granger [24] admits, this general definition of causality is not operational, i.e. it cannot be implemented with actual data. A practical solution is to consider Ω t as incorporating only current and past values (until certain lags) of x t, y t and of a set of observed variables Z t. Thus we have that x t is Granger-noncausal for y t+1 if [16, 66]

(5.16)

given lags p, q, r, where by {x t, …, x tq} we denote the σ-field generated by the vector of random variables (x t, …, x tq), and similarly for {y t, …}. The σ-field generated by a random variable is the set of events that may be described in terms of that random variable [16, p. 588]. Let us suppose that the background knowledge available at time t comprises only two time series: x t and y t. Then, given lags p and q, x t−1 does not Granger causes y t if

(5.17)

Assuming that x t and y t are stationary and ergodic, many studies have proposed nonparametric tests of (5.17), without assuming a linear structure (which could be treated in a linear VAR framework) (see [1, 2, 4, 12, 31, 66, 70]). In case of p, q = 1 the proposed tests have high performance, which tends to decline for high p and q for data with limited sample size [6]. The assumption of Ω t as comprising only two time series is, of course, a strong assumption in empirical contexts where causal sufficiency may fail.

3.2 Graphical Models for Time Series

Since Granger-causality faces fundamental hurdles in case of unmeasured causal variables, one possible solution is to rely on causal inference procedures that are designed to perform well in presence of latent variables. One algorithm that is asymptotically correct in the presence of latent variables is the Fast Causal Inference (FCI) algorithm proposed by Spirtes et al. [64]. This method belongs to the more general approach of graphical causal models based on conditional independence tests, also known as “constraint-based causal search” (see [63]). We have mentioned this approach in Sect. 5.2.1, noticing that it was of little use when applied to pairs of estimated VAR reduced-form residuals. This approach, however, has larger applicability when applied directly to pairs of time series data (not filtered by a VAR model), because it can exploit the possibility of conditioning both on lagged and contemporaneous variables. An interesting method, in this setting, is the adaptation of the FCI algorithm that Entner and Hoyer [14] propose for time series.

In case of causal sufficiency (and no feedback loops), constraint-based causal search moves from the assumption that the data generating process can be described by a directed acyclic graph (DAG) and a joint distribution P(X), where X = (X 1, …, X n) is the set of observable variables represented by the set V of n vertices of the DAG. Causal inference is based on two assumptions: Markov and faithfulness condition. Markov condition states that if vertices i and j of a DAG \(\mathscr {G}\) given some subset W ⊆ V ∖{i, j} are d-separated (a graphical criterion defined by Pearl [54]), then we have . Faithfulness condition states that all (conditional and unconditional) independence relations in P(X) are entailed by the Markov condition. In this setting, the PC algorithm [64], on the base of these assumptions, starts from a complete graph (all vertices connected by undirected edges) over all variables, and performs a series of independence tests that allows the removal of edges between pairs of variables that are independent conditionally on any set of variables (included the empty set). Then it makes use of some rules which allow us to orient edges among triple of vertices, and in particular to distinguish between collider (⋅→⋅←⋅) structure and fork/chain structures (⋅←⋅→⋅, or ⋅←⋅←⋅, or ⋅→⋅→⋅). This is also done on the basis of conditional independence tests and the two conditions above. The outcome of the algorithm is a set of DAGs that share the same (conditional) independence relations, i.e. a class of Markov equivalent DAGs.

Relaxing the assumption of causal sufficiency, the FCI algorithm [64] moves also from the assumption that the process underlying the data can be described by a DAG, but this DAG may contain vertices that correspond to latent variables. Richardson and Spirtes [60] (see also [8]) introduced a new class of graphs whose vertices are observed variables, but in which the causal relationships may involve latent variables. These graphs, in which a latent cause Z affecting the observed variables X and Y is represented by X ↔ Y , are called maximal ancestral graphs (MAGs). The idea is that any DAG whose vertices include latent variables can be transformed in a unique MAG whose vertices comprise only observed variables. Moreover, MAGs encode conditional independence relations among the observed variables through m-separation, a generalization of d-separation [8, 60]. A MAG is a graph \(\mathscr {M}\) with the following properties: (i) \(\mathscr {M}\) is a mixed graph (it contains not only directed (→), but also undirected (−) and bi-directed (↔) edges); (ii) \(\mathscr {M}\) is an ancestral graph (there is no vertex i which is a ancestor of any of its parents nor any of its spouseFootnote 3); (iii) for every pair of variables 〈X i, X j〉 there is an edge between i and j in \(\mathscr {M}\) if and only if there does not exist a set of vertices W ⊆ V ∖{i, j} in \(\mathscr {M}\) such that [14, 60].

Similarly to PC algorithm, the output of the FCI algorithm is a class of MAGs that entail the same set of conditional independence relationships. This class of MAGs is represented by a partial ancestral graph (PAG), which is a graph which have a third edge mark, besides arrowtail (−) and arrowhead (> ), namely a circle (∘). Excluding feedback loops or selection bias (hence undirected edges), a PAG can only incorporate these types of edges: →, ↔, , and . If X i ↔ X j then neither variable is ancestor of the other and there is a latent variable between X i and X j. The circle (∘) denotes the case where it is undecided whether in the underlying data generating process there is an arrowtail or an arrowhead next to the vertex where the circle appear. This means that the PAG contains a MAG with (−) and a MAG with (> ) at that location. Like the PC algorithm, the FCI in a first step removes edges from a complete graph on the base of conditional independence tests, and in a second step it orients edges so that the inferred causal structures are in tune with the Markov and faithfulness assumptions (all the conditional independence relations must be derived from m-separation).

Entner and Hoyer [14] adapt the FCI in a time series framework, which they call tsFCI. Suppose the observed time series variables are {x t} = x 1, …, x T and {y t} = y 1, …, y T. The algorithm starts from a complete graph on a time window of the time series, i.e. the set of vertices are x t, x t−1, …, x tp, y t, y t−1, …y tp. It then remove edges from this complete graph, as in a standard FCI algorithm, on the basis of conditional independence test, but with the addition that if the contemporaneous edge is eliminated, this will be eliminated at all time units (t, t − 1, …, t − p). If a lagged edge with lag l is eliminated (for example from x tl to y t), this is eliminated at all time units (for example from x tl−1 to y t−1). Orientation makes use not only of the orientation rules of the standard FCI algorithm, but also makes use of the “arrow of time”: if there is an undirected edge between two lagged variable, it will be put an arrowtail at the variable coming before and an arrowhead at the variable coming after. Moreover, if an edge is oriented contemporaneously at time t, this will be oriented in the same manner for all time units (t, t − 1, …). If a lagged edge with lag l is oriented (for example x tl → y t), this is oriented in the same manner for all time units (for example x tl−1 → y t−1).

Thus, exploiting the assumption that an effect cannot precede a cause and the assumption of repetition of causal structures over time (time invariance), one can reach a more detailed description of the data generating process than the one that would be provided by a standard application of constraint based algorithm. However, since these methods ultimately rely on conditional independence tests is crucial that they are designed taking into account the specificity of testing self-dependence in a time series context (see [46]).

3.3 Additive Noise Models

We consider in this subsection the problem of distinguishing among different causal structures over the time-series pair {x t, y t}, using a specific class of structural equation models. We assume that: (i) there are no latent common causes between x t and y t (at any lag); (ii) no contemporaneous causal feedback loops (i.e. either x t → y t or x t ← y t, but it is possible that x ts → y t → x t+h, for s ≥ 0, h ≥ 1); (iii) each variable x t and y t causally depends on its own past (respectively x t−1, … and y t−1, …) until a lag p; (iv) both contemporaneous and lagged causal structures recur over time: if x ti → y t then x tis → y ts, for i ≥ 0, s ≥ 1. To simplify the illustration, we also assume here that (v) p = 1. In Fig. 5.3 we show the 12 directed acyclic graphs (DAGs) corresponding to all the possible causal structures related to the data generating process (represented as unit graphs) under these assumptions. We also assume that (vi) x t and y t are stationary and ergodic processes. We also assume that the data generating process can be formalized as a specific type of structural equation model (or functional equation model, see [55]), namely as an additive noise model [35, 56, 57], where

$$\displaystyle \begin{aligned} x_t = f_x(\mathbf{PA}^x) + N^x_t \end{aligned} $$
(5.18)

and

$$\displaystyle \begin{aligned} y_t = f_y(\mathbf{PA}^y) + N^y_t, \end{aligned} $$
(5.19)

where PA x are the graphical parents of x t (and PA y of y t) in the DAG representing the data generating process, and \(N^x_t\) and \(N^y_t\) are independent white noise processes. We assume (vii) , , and ; (viii) f x(⋅) and f y(⋅) are either nonlinear functions or linear but with the additional assumption that \(N^x_t\) and \(N^y_t\) have non-Gaussian distribution.Footnote 4

Fig. 5.3
figure 3

Unit graphs of all the possible structural equations models under assumption (i)–(viii)

In Fig. 5.3, below each DAG it is shown the set of corresponding structural equations and the set of implied (conditional or unconditional) independence relationships. Hoyer et al. [35] (see also Sect. 5.2.3) proposes a procedure to check if a DAG corresponding to a nonlinear additive noise model is consistent with the data: first one constructs a nonlinear regression of each variable on its parents, then one tests whether the estimated residuals are independent of the covariates and among each other. If any independence test is rejected the DAG is rejected, if none of the independence tests are rejected, the DAG is consistent with the data.

Thus, in principle, one could run the regressions corresponding to the equations indicated below each DAG in Fig. 5.3 to check whether a specific DAG is consistent with the data. Let us analyze some specific cases.

If the data are generated by DAG 1 (see Fig. 5.3), and the data generating process were not known to the observer, by constructing the nonparametric regressionsFootnote 5:

$$\displaystyle \begin{aligned} x_t = f_1(x_{t-1}) + N_t^{x, 1} \end{aligned} $$
(5.20)
$$\displaystyle \begin{aligned} y_t = f_1(y_{t-1}) + N^{y,1}_t \end{aligned} $$
(5.21)

and by not rejecting the independence relations:

(5.22)
(5.23)
(5.24)

one would conclude the DAG 1 is consistent with the data. Are other DAGs consistent with these findings? If we run the same regressions but using data generated by DAG 2, we will not necessarily reject: , . Indeed these regressions may suffer of omitted variable bias, but not of reverse causality. However, we will have that . Indeed \(\widehat {N^{y,1}_{t}}\) results from a regression in which it is omitted x t−1. Hence \(\widehat {N^{y,1}_{t}}\) is dependent on x t−1, and since x t−1 is in turn dependent on \(\widehat {N^{x,1}_{t-1}}\), then . If we run the same regressions (Eqs. (5.20), (5.21)) using data generated by any other DAG (from DAG 3 to DAG 12), for analogous lines of reasoning we would reach the same conclusion: for some 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, 〈0, 1〉.

Let us now suppose that DAG 1 has been found not consistent with the data and one runs the nonparametric regressions (also indicated below DAG 2 in Fig. 5.3):

$$\displaystyle \begin{aligned} x_t = f_2(x_{t-1}) + N^{x,2}_t, \end{aligned} $$
(5.25)
$$\displaystyle \begin{aligned} y_t = f_2(x_{t-1}, y_{t-1}) + N^{y,2}_t. \end{aligned} $$
(5.26)

By not rejecting:

(5.27)
(5.28)
(5.29)
(5.30)

one would conclude the DAG 2 is consistent with the data. If the data were generated by DAG 3, we would have that , because in regressing x t on x t−1 we are omitting y t−1, which is a graphical parent of x t in DAG 3. If the data were generated by any DAG containing the contemporaneous causal link (DAG 4–DAG 12, except DAG 10), we would have that . If DAG 10 were generating the data, we would have that , because, again, we would omit y t−1 in the regression of x t on x t−1.

Let us suppose now that DAG 4 is the data generating process. By running the nonparametric regressions,

$$\displaystyle \begin{aligned} x_t = f_4(x_{t-1}) + N^{x,4}_t \end{aligned} $$
(5.31)
$$\displaystyle \begin{aligned} y_t = f_4(x_{t}, y_{t-1}) + N^{y,4}_t \end{aligned} $$
(5.32)

and not rejecting

(5.33)
(5.34)
(5.35)
(5.36)

we would conclude that DAG 4 is consistent with the data. If the data generating process were any DAG with opposite contemporaneous causal link (DAG 5, 7, 9, 12), running the same regressions ((5.31), (5.32)) and tests ((5.33)–(5.36)), we would get . If the data generating process were any DAG among DAG 2, 3, 6, 8, 10, 11, there would be no reverse contemporaneous causal link, but an omitted lagged variables in one (or both) of the two regressions. This would imply that for some 〈i, j〉 = 〈1, 0〉, 〈0, 1〉.

These examples should already suggest that, under the framework of the 12 possible DAGs of Fig. 5.3, under the assumptions listed above, with an exhaustive search of independence relationships derived by the possible DAGs, one is able to uniquely identify the model that has generated the data. Based on these considerations, we propose a search procedure formalized in the algorithm described in the Table here below. The algorithm avoids an exhaustive causal search, but at the same time is able to uniquely identify, among the 12 DAGs represented in Fig. 5.3, the one that has generated the data.

The search algorithm is able to efficiently infer one of the 12 DAGs on the base of a limited number of nonparametric regressions and tests of unconditional independence. Once the algorithms outputs DAG number i, however, we suggest to check its consistency with the data through the nonparametric regressions and (conditional and unconditional) independence tests indicated in Fig. 5.3 under the inferred DAG number.

Search Algorithm

1. Input: Samples from a 2-dimensional time series of length T, maximal order p = 1.

2. Run nonpar. regressions: \(x_t = f_1(x_{t-1}) + N_t^{x,1}\); \(y_t = f_1(y_{t-1}) + N_t^{y,1}\), get \(\widehat {N_t^{x,1}}, \widehat {N_t^{y,1}}\)

3. Test:

4. If

5. Test: for 〈i, j〉 = 〈1, 0〉, 〈0, 1〉

6. If for 〈i, j〉 = 〈1, 0〉, 〈0, 1〉, break, output DAG 1

7. If and , then break, output DAG 2

8. If and , then break, output DAG 3

9. Else break, output DAG 10

10. If

11. Run nonp. reg.: \(x_t = f_4(x_{t-1}) + N_t^{x,4}\); \(y_t = f_4(x_t, y_{t-1}) + N_t^{y,4}\), get \(\widehat {N_t^{x,4}}, \widehat {N_t^{y,4}}\)

12. Test:

13. If

14. Test: for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, 〈0, 1〉

15. If for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, 〈0, 1〉, break, output DAG 4

16. If only for 〈i, j〉 = 〈0, 0〉, 〈0, 1〉, break, output DAG 6

17. If only for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, break, output DAG 8

18. Else break, output DAG 11

19. If

20. Run \(x_t = f_5(x_{t-1}, y_t) + N_t^{x,5}\); \(y_t = f_5(y_{t-1}) + N_t^{y,5}\), get \(\widehat {N_t^{x,5}}, \widehat {N_t^{y,5}}\)

21. Test: for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, 〈0, 1〉

22. If for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, 〈0, 1〉, break, output DAG 5

23. If only for 〈i, j〉 = 〈0, 0〉, 〈0, 1〉, break, output DAG 7

24. If only for 〈i, j〉 = 〈0, 0〉, 〈1, 0〉, break, output DAG 9

25. Else break, output DAG 12

26. Output: One DAG among DAG 1 - DAG 12.

For a more general framework in which there are k possible time series and p lags of causal influence, Peters et al. [56] propose a search procedure based on additive noise models called TiMINO, i.e. time series models with independent noise. The output of TiMINO is however, a summary graph. This means that it is not possible to disentangle between contemporaneous and lagged causal effects. The advantage of our search algorithm is that it is possible to distinguish between these two types of effects, but only under the specific framework of time series pairs.

3.4 Local Projections

Local projections were introduced by Jorda [42] to compute impulse responses (see Sect. 5.2.4) without specifying and estimating a VAR model. Furthermore, any attempt of representing the data generating process through a multivariate time series structural system is eschewed in local projections. The idea here is to focus on the estimation of impulse responses through regression methods that are applied at each period of interest, without hinging on a pre-specified or pre-estimated time series model.

Let be Y t = (x t, y t), as in Sect. 5.2.1. Jorda [42] considered projecting Y t+s onto the linear space generated by (Y t−1, …, Y tp) for a certain choice of lag p, namely

$$\displaystyle \begin{aligned} Y_{t+s} = \alpha^s + P_1^{s+1} Y_{t-1} + P_2^{s+1} Y_{t-2} + \ldots + P_p^{s+1} Y_{t-p} + u_{t+s}^{s}, \end{aligned} $$
(5.37)

where α s is a (2 × 1) vector of constant, \(P_i^{s+1}\) are (2 × 2) matrices of coefficients, and \(u_{t+s}^{s}\) is a (2 × 1) vector of errors by construction uncorrelated with the regressors. Superscripts here are meant to denote the time window where the regression is performed.

Impulse response functions are defined as the difference between two forecasts, which is an idea consistent with Eq. (5.12). More specifically, we have that the impulse response of x t+s to a shock at time t, s ∈ Z is

$$\displaystyle \begin{aligned} IR(t,s,\delta) = E (x_{t+s}| v_{1t}= \delta, Y_{t}) - E (x_{t+s} | v_{1t}= 0, Y_{t}) \; \; \; i = 0, \ldots, H. \end{aligned} $$
(5.38)

where E(⋅|⋅) denotes the best, mean squared predictor, v 1t is a disturbance shock, and d is the magnitude of the shock the impact of which one wants to measure.

The impulse responses estimated from (5.37) are

$$\displaystyle \begin{aligned} IR(t,s,\delta) = \hat{P}_1^s \delta. \end{aligned} $$
(5.39)

As noted by Kilian and Lütkepohl [44, chapter 12], these impulse responses will be relative to a reduced-form error (v it = u it) and not to the true shock affecting the system, if they are estimated directly through a least square regression of Eq. (5.38). Thus, it is fundamental in this context to transform the reduced-form residuals in a mixture of structural shocks. But here the problem is analogous to the problem of identification of the structural VAR model and the literature on local projections seems not to have found a method yet that bypasses this step.

4 Conclusions

In this paper we have addressed the problem of causal inference from data that are realizations of bivariate time series processes. We have focused on the setting typically encountered in econometrics, namely stationary or difference-stationary autoregressive processes with additive noises. The standard approach in econometrics to address this problem is structural vector autoregressive analysis. This allows the researcher to filter the time-series data, in order to apply causal search algorithms to the i.i.d. filtered data. Since the time structure is filtered out, the output of this causal search is a contemporaneous causal structure, which, in a second step, gives the possibility of recovering the entire structural autoregressive model. In a causal pair setting, however, causal search in this framework is limited. For example, in the case of Gaussian data, the linear causal structure between the two filtered time series is not identifiable. We have shown that identification is possible under non-Gaussianity (exploiting independent component analysis) or under non-linearity (exploiting non-linear additive noise model). But we have also shown that in a setting of bivariate time series, an alternative valid approach is to address the problem of causal inference by avoiding the vector autoregressive framework. This is possible by applying graphical models algorithms (like FCI) or nonlinear additive noise models algorithms (like the one presented in this paper) directly to the data, without filtering them. We have also shown the possibility of applications of Granger non-causality testing and local projections in a framework in which VAR models are not necessarily estimated. The latter two techniques, however, deviate for many aspects, from a structural interpretation of causality (see footnote 1), i.e. from a framework which allows intervention, while they are closer to a notion of predictability. A study of the relative merits of the different methods presented above with empirical and simulated data is left to future research.