1 Background and motivation

Providing causal assessments about episodes of extreme weather or unusual climate conditions is an important topic in the climate sciences: it arises from the multiple needs for public dissemination, litigation in a legal context, adaptation to climate change or simply improvement of the science associated with these events (Stott et al. 2013). The approach widely used so far to was introduced one decade ago by M.R. Allen and colleagues (Allen 2003; Stone and Allen 2005); it originates from best practices in epidemiology (Greenland and Rothman 1998) and is referred to as probabilistic event attribution (PEA).

In the PEA approach, one evaluates the extent to which a given external climate forcing — such as solar irradiation, greenhouse gas (GHG) emissions, ozone or aerosol concentrations — has changed the probability of occurrence of an event of interest. For this purpose, one thus needs to compute two probabilities: (i) the probability of occurrence of the event in an ensemble of model simulations representing the observed climatic conditions, which simulates the actual occurrence probability in the real world, referred to as factual; and (ii) the probability of occurrence of the event in a second ensemble of model simulations, representing this time the alternative world that might have occurred had the forcing of interest been absent, referred to as counterfactual.

Denoting by p 1 and p 0 the probabilities of the event occurring in the factual world and in the counterfactual world respectively, the so-called fraction of attributable risk (FAR) is then defined:

$$ \text{FAR} =1 - \frac{p_{0}}{p_{1}} $$
(1)

The FAR has long been interpreted as the fraction of the change in likelihood of an event which is attributable to the external forcing. Over the past decade, most causal claims have been following from the FAR and its uncertainty, resulting in statements such as “It is very likely that over half the risk of European summer temperature anomalies exceeding a threshold of 1.6 C is attributable to human influence.” Stott et al. (2004). Hannart et al. (2015) have recently shown that, under realistic assumptions, the FAR may also be interpreted as the so-called probability of necessary causation (PN) associated — in a complete and self-consistent theory of causality (Pearl 2000) — with the causal link between the forcing and the event. The FAR thus corresponds to only one of the two facets of causality in such a theory, while the probability of sufficient causation (PS) is its second facet.

In this setting,

$$ \text{PN} = \max{1-\frac{p_{0}}{p_{1}}, 0} $$
(2a)
$$ {\kern15pt} \text{PS} = \max{1-\frac{1-p_{1}}{1-p_{0}}, 0} $$
(2b)
$$ {\kern3pt}\text{PN} = \max{p_{1}-p_{0}, 0} $$
(2c)

where PNS is the probability of necessary and sufficient causation. Pearl (2000) provides rigorous definitions of these three concepts, as well as a detailed discussion of their meanings and implications. It can be seen from Eq. 2a that causal attribution requires to evaluate the two probabilities, p 0 and p 1, which is therefore the central methodological question of PEA.

So far, most case studies have used large ensembles of climate model simulations in order to estimate p 1 and p 0 based on a variety of methods. However, this general approach has a very high computational cost and is difficult to implement in a timely and systematic way. As recognized by Stott et al. (2013), this remains an open problem: “the overarching challenge for the community is to move beyond research-mode case studies and to develop systems that can deliver regular, reliable and timely assessments in the aftermath of notable weather and climate-related events, typically in the weeks or months following (and not many years later as is the case with some research-mode studies)”. Several research initiatives are presently addressing this real-time attribution challenge. For instance, the weather@home system (Massey et al. 2014) in the context of the World Weather Attribution initiative (http://www.climatecentral.org/wwa), the system proposed by Christidis et al. (2013), or the Weather Risk Attribution Forecast system (http://www.csag.uct.ac.za/~daithi/forecast/) aim at meeting those requirements within the conventional ensemble-based approach.

The purpose of this article is to introduce a new methodological approach that addresses the latter overarching operational challenge. Our proposal relies on a class of powerful statistical methods for interfacing high-dimensional models with large observational datasets. This class of methods originates from the field of weather forecasting and is referred to as data assimilation (DA) (Bengtsson et al. 1981; Ghil and Malanotte-Rizzoli 1991; Talagrand 1997).

Section 2 explains the rationale of the approach proposed herein, presents a brief overview of DA, and outlines the most prominent technical features of a “data assimilation–based detection and attribution” (DADA) approach. Section 3 illustrates the proposal by implementing it on a version of the classical Lorenz convection model (Lorenz 1963, L63 hereafter) subject to an additional constant force. Finally, in Section 4, we discuss the main strengths and limitations of the DADA approach, and highlight several theoretical and practical research questions that need to be addressed to make it potentially operational within weather forecasting centers in a near future.

2 Methodology

2.1 General rationale

In an operational context, a significant difficulty of PEA is that events of interest are usually rare, i.e. they occur in regions of the climate system’s attractor that are reached quite rarely. It may hence require a very large ensemble of simulations for the numerical model representing the climate system to reach the relevant region of the attractor. This requirement is particularly relevant if the event is defined narrowly, based on multiple features that might involve some combination of the atmospheric circulation, of the climate system’s thermodynamic state, and of the impacts associated with the event. Simulating a sufficiently large number of occurrences of such an event for a robust evaluation of p 1 and p 0 may then be computationally very costly, and a brute force approach based on an unconstrained ensemble may become unaffordable in an operational context.

The first general idea underlying the DADA proposal is that the latter computational burden may be greatly reduced by constraining the model to explore only the relevant region of its state space where the event under scrutiny is defined to occur. Such a selective exploration of a high-dimensional state space is not new. The constrained simulation of very rare events using complex dynamical models has been studied extensively (e.g., Harris and Kahn (1951); Del Moral and Garnier (2005)) and is referred to as Rare Event Sampling (RES). RES methods are based on importance sampling and probabilistic large-deviation theory (Bucklew 2004), and they are commonly used in several areas — such as queueing, reliability, telecommunication (Heidelberg 1995) — but their adaptation to a climate context has only recently started (Wouters and Bouchet 2015).

The second general idea of the DADA proposal is to take a shortcut along this path: DA methods present the key advantage of being already operational in weather forecasting centers to routinely update an atmospheric model with new observations in order to initialize the forecast, and we argue that they can be used simultaneously to solve the class of problems addressed by RES methods. Carrassi et al. (2008, and references therein) have already used a similarly selective exploration of a reduced number of phase space dimensions in the context of DA methods designed to control chaotic dynamics.

For the purposes of PEA, we show that,by assimilating the observed trajectory of an event into a model, one can obtain as a by-product the probability density function (PDF) associated with this trajectory. PEA is then obtained by assimilating the observations of the event twice, first in the factual setting of the model and second in its counterfactual setting, and then by computing the FAR as the ratio of the two PDF values thus obtained.

Heuristically speaking, if an observed event is incompatible with the counterfactual world but compatible with the factual one — according to the standard approach of defining the existence of a causal link (Pearl 2000; Hannart et al. 2015) — then assimilation will act as a crucial experiment, since the event’s observed trajectory will be easy to assimilate in the factual setting and difficult to assimilate in the counterfactual one, merely because the counterfactual setting physically precludes the existence of such a trajectory.

In Section 2.2, we formulate this general rationale in probabilistic terms and discuss the relevance of the approach. We then show in Section 2.3 that, given a similar set of hypotheses as those that underly the majority of operational DA methods, it is possible to quantify the extent to which an observed trajectory is compatible with the model physics — either factual or counterfactual — or not. This quantification in an operational context is at the core of the DADA approach and it greatly facilitates real-time PEA.

2.2 Probabilistic description of the method

Let y t denote the d-dimensional vector of observations at discrete times {t=0,1,…,T}. Here, y={y t :0≤tT} corresponds, for instance, to the full set of all available meteorological observations over a time interval covering the event of interest, no matter the diversity and source of the data; typically, the latter include ground station networks, satellite measurements, ship data, and so on, cf. Bengtsson et al. (1981, Preface, Fig. 1) or Ghil and Malanotte-Rizzoli (1991, Fig. 1). In the present probabilistic context of PEA, the observed trajectory y is viewed as a realization of a random variable denoted Y={Y t :0≤tT}, i.e. there exists an ω∈Ω such that Y(ω) = y — where Ω denotes the sample space of all possible outcomes and encompasses observational error, as well as internal variability.

In event attribution studies, it is recognized that defining the occurrence of an event, i.e. selecting a subset \(\mathcal {F}\subset {\Omega }\), depends on a rather arbitrary choice. Yet this choice has been shown to greatly affect causal conclusions (Hannart et al. 2015). For instance, a generic and fairly loose event definition is arguably prone to yield a low level of evidence with respect to both necessary and sufficient causality while, on the other hand, a tighter and more specific event definition is prone to yield a stringent level for necessary causality but a reduced one for sufficient causality.

Indeed, it is quite intuitive that many different factors should usually be necessary to trigger the occurrence of a highly specific event and conversely, that no single factor will ever hold as a sufficient explanation thereof. For the class of unusual events at stake in PEA, where both p 0 and p 1 are very small, we arguably lean towards specific definitions that inherently result in few sufficient causal factors or none. This conclusion immediately follows from Eq. (1b), which yields PS≃0 when both p 0 and p 1 are very small.

Usually, an event occurrence is defined in PEA based on an ad hoc scalar index ϕ(Y) exceeding a threshold u, i.e. p i = P(ϕ(Y)≥u); from now on, we associate i=0 with the counterfactual and i=1 with the factual world. While this definition may be already quite restrictive for u large, it is defensible to restrict the event definition even further. Such a strategy may reduce an already negligible PS but it also may increase PN by a greater amount; one thus expects to gain more than is lost in this trade-off. In particular, this will be the case if additional features, not accounted for in ϕ(Y), can be identified that will allow one to further discriminate between the two worlds.

Following this strategy, a central element of our proposal is to use the tightest possible event occurrence definition, i.e. the trajectory y exactly as it was observed, namely the singleton event {Y = y}. This singleton event has probability zero in both worlds, i.e. p 1 = p 0=0. Indeed, the full sequence of observations y, exactly as it occurred, is unique. Quoting the Greek philosopher Heraclitus “You cannot step into the same river twice, for other waters are continually flowing in”: the exact same sequence y never occurred before and will never occur again. Our proposed singleton event definition may thus arguably match with the suggestion of Trenberth et al. (2015) that “a different framing is desirable which asks why extremes unfold the way they do” in so far as it focuses on the event exactly as it happened and is thereby able to spot the detailed physical features of the event that made it “unfold the way it did”. However, by contrast with Trenberth et al. (2015), our proposed singleton event definition is not conditional on the circulation: the observed vector y may perfectly include circulation-related observations.

One may find surprising that a causal analysis of such a zero probability event is possible. However, in the context of the aforementioned causal theory, such a causal analysis is definitely possible and meaningful. Indeed, the fact that p 1 and p 0 are null does not imply that the associated probability of necessary causation PN is null. Generally speaking, the ratio of two quantities that tends to zero may well converge to a finite quantity (e.g. the derivative of a differentiable function). Likewise, here the singleton set {Y = y} may be viewed as the limit of the sphere of radius r centered in y when the radius r tends to zero, i.e. \(\{ \mathbf {Y}=\mathbf {y}\}=\lim _{r\rightarrow 0} \{\Vert \mathbf {Y}-\mathbf {y}\Vert \leq r\}\). It is clear that when \(r\rightarrow 0\), then \(p_{0}\rightarrow 0\) and \(p_{1}\rightarrow 0\). It is also straightforward to show that the limit of PN =1−p 0/p 1 is then finite. More specifically, we have:

$$ \text{PN} =1 - \frac{f_{0}(\mathbf{y})}{f_{1}(\mathbf{y})} $$
(3)

where we denote f i the PDF of Y in world i. By contrast, the quantity 1−(1−p 1)/(1−p 0) converges to zero when p 0 and p 1 tends to zero, thus the probability of sufficient causation PS associated with the singleton event {Y = y} is always zero. Our DADA proposal thus intentionally sacrifices the evidence of sufficiency, in the hope of maximizing the evidence of necessity.

Our betting on the singleton set is thus justifiable already based on the above theoretical considerations. This choice, moreover, is also motivated by having a highly simplifying implication from a practical standpoint. Evaluating the PDF of Y at a single point Y = y is indeed, under many circumstances, considerably easier than evaluating the probability P(ϕ(Y)≥u) required in the conventional approach. Appendix A gives a concrete illustration of this situation, and Fig. 1 shows the details of the latter evaluation for a scalar AR(1) process (panel a, as well as its associated accuracy (panels b and c), and the computational cost as the sample size n varies (panel d); the latter cost is much larger than the one of applying the DADA approach consisting in evaluating the PDF at a single point. This simple example confirms the large computational discrepancy between the two approaches. The reason for the discrepancy is quite simple: evaluating the conventional probability requires integrating a PDF over a predefined domain, instead of a one-off evaluation at a single point. Because both the domain of integration and the PDF may have potentially complex shapes, one cannot expect, in general, that the requisite integral be amenable to analytical treatment. Hence numerical integration is the default option: no matter how efficient an integration scheme one applies, it will require evaluating the PDF at many points and is thus as many times more costly computationally than just evaluating f(y) at a single point.

Fig. 1
figure 1

Illustration of the conventional PEA approach as applied to a univariate AR(1) process. a Observed time series (first component Y 1, dotted line) and daily average ϕ(Y) (heavy solid line) over the three first days. b Threshold level (vertical axis) as a function of the return period (horizontal axis): simulated values (crosses); fit based on the Generalized Pareto distribution (GPD, heavy dark-blue line); uncertainty range at the 95 % level (light blue area); and threshold value u=3.1 (light solid black line). c Estimated value of P = P(ϕ(Y)≥u) (heavy dark-blue line) using a GPD fit as a function of the sample size n (horizontal axis); uncertainty range (light blue area); and true value P=0.01 (light solid black line). d Computational time on a desktop computer (seconds, vertical axis) as a function of sample size n (horizontal axis) required by the conventional method (dark blue line) and the DADA method (solid red line); the latter method is explained in Sections 2b and 3 below

In order to obtain the PDF of Y, the class of dynamic, statistical models referred to as Hidden Markov Models (HMMs; e.g. Ihler et al. (2007)) is relevant in the context of PEA. Indeed, the dynamics of a climate event can usually be represented by using a numerical climate model. Denoting X t the N-dimensional state vector at time t of the numerical model, we can assume:

$$ \mathbf{X}_{t+1} = {\mathrm{M}(\mathbf{X}_{t},\mathbf{F}_{t})} + \mathbf{v}_{t}\,, $$
(4a)
$$ \mathbf{Y}_{t}=\mathrm{H}(\mathbf{X}_{t})+\mathbf{w}_{t}\, $$
(4b)

where Eq. 4a describes the dynamics of the state vector, with M the numerical model operator, v t a stochastic term representing modeling error, and F t a prescribed forcing. Equation 4b maps the state vector X t to our observations Y t at any time t, where H is the so-called observation or forward operator and w t is a stochastic term representing observational error. The problem of interest here is thus to derive the likelihoods f 0(y) and f 1(y) of the observation y when using the counterfactual and factual forcings, by using the HMM setting of Eq. 4a.

DA can be viewed as a class of inference methods designed for the above HMM setting. While inferring the unknown state vector trajectory X given the observed trajectory y is the main focus of DA, the likelihood f(y) can also be obtained as a side product thereof, as we will immediately clarify below. Therefore, with DA able to derive the two likelihoods f 0(y) and f 1(y), and the latter two being the keys to causal attribution in our approach, one should be capable of moving towards near-real-time, systematic causal attribution of weather- and climate-related events.

2.3 Brief overview of data assimilation

DA was initially developed in the context of numerical weather forecasting, in order to initialize the model’s state variables X based on observations y that are incomplete, diverse, unevenly distributed in space and time and are contaminated by measurement error (Bengtsson et al. 1981; Talagrand 1997). Over the past decades, those methods have grown out of their original application field to reach a wide variety of topics in geophysics such as oceanography (Ghil and Malanotte-Rizzoli 1991), atmospheric chemistry, geomagnetism, hydrology, and space physics, among many other areas (Robert et al. 2006; Cosme et al. 2006; Kondrashov et al. 2011; Bocquet 2012; Martin et al. 2014).

DA is already playing an increasing role in the climate sciences, having being applied, for instance, to initialize a climate model for seasonal or decadal prediction (Balmaseda et al. 2009), to constrain a climate model’s parameters (Kondrashov et al. 2008; Ruiz et al. 2013), to infer carbon cycle fluxes from atmospheric concentrations (Chevallier 2013), or to reconstruct paleoclimatic fields out of sparse and indirect observations (Bhend et al. 2012; Roques et al. 2014). In the context of D&A, Lee et al. (2008) actually tested a DA-like approach to include the effects of the various forcings over the last millennium, in addition to other paleoclimate proxy data, in combined climate reconstruction and detection analysis. The present work thus follows a general trend in climate studies.

Methodologically speaking, DA methods are traditionally grouped into two categories: sequential and variational (Ide et al. 1997, and references therein). Here, we concentrate on the sequential approach, but the two approaches are complementary and the choice of method depends on the specifics of the problem at hand (Ghil and Malanotte-Rizzoli 1991; Ide et al. 1997; Talagrand 1997). In the sequential approach (Ghil et al. 1981), the state estimate and a suitable estimate of the associated error covariance matrix are propagated in time until new observations become available and are used to update the state estimate. In practice, the evolution of the system of interest is retrieved — like in earlier, typically much smaller-dimensional applications (Kalman 1960; Jazwinski 1970; Gelb 1974) — through a sequence of prediction and analysis steps.

Abundant literature is available on DA and on Kalman-type filters. Kalman (1960) first presented the solution in discrete time for the case in which both the dynamic evolution operator M in Eq. 4a and the observation operator H in Eq. 4b are linear, and the errors are Gaussian. Under these assumptions, the state-estimation problem for the system given by Eq. 4a and 4b has an exact solution given by the sequential Kalman filter (KF) equations (Appendix B). Further, the likelihood function f(y), which is of primary importance for DADA, also has an exact expression under the above linearity and Gaussianity assumptions (Tandeo et al. 2014). Following the usual notations of DA, which are detailed in Appendix B, the expression of the likelihood is given by:

$$ f(\mathbf{y}) = \prod\limits_{t=0}^{T} \,\,(2\pi)^{-\frac{d}{2}}\vert \mathbf{\Sigma}_{t}\vert^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}\left( y_{t}-\mathbf H{x_{t}^{f}}\right)^{\prime}\mathbf{\Sigma}_{t}^{-1}(y_{t}-\mathbf H{x_{t}^{f}})\right\} $$
(5)

with \(\mathbf {\Sigma }_{t} = \mathbf {H}\mathbf {P}_{t}^{f}\mathbf {H}^{\prime }+\mathbf {R}\). The proof of Eq. 5 is provided in Appendix C, and f(y) is typically computed by taking the logarithm of this equation to turn the product on the right-hand side into a sum.

The main interest of Eq. 5 is that, once the observations y t have been assimilated on the interval 0≤tT, the necessary ingredients \({\mathbf {x}^{f}_{t}}\) and \(\mathbf {P^{f}_{t}}\) in Eq. 5 are available from the KF equations (Appendix B) and thus calculating f(y) is both straightforward and computationally inexpensive. The fundamental connections between this calculation, the HMM context, and Bayes theorem are further clarified in Appendix C.

Many difficulties arise in applying the simple ideas outlined here to geophysical models, which are typically nonlinear, have non-Gaussian errors and are huge in size (Ghil and Malanotte-Rizzoli 1991). Most of these difficulties have been addressed by improving both sequential and variational methods in several ingenious ways (Bocquet et al. 2010; Kondrashov et al. 2011).

In particular, the Ensemble Kalman Filter (EnKF; Evensen, 2003) — in which the uncertainty propagation is evaluated by using a finite-size ensemble of trajectories — is now operational in numerical weather and oceanic prediction centers worldwide; see e.g. Houtekamer et al. (2005); Sakov et al. (2012). The EnKF is a convenient approximate solution to the filtering problem in a nonlinear, large-dimensional context. We simply note here that it can also be applied to obtain an approximation of the likelihood f(y) by substituting the approximate sequence \(\{ (\hat {x}^{f}_{t}, \hat {\mathbf P}^{f}_{t}): t = 0, \ldots , T \}\) that the EnKF produces into Eq. 5. This strategy is illustrated immediately below in the context of the L63 convection model subject to an additional constant force.

3 Implementation within the modified L63 model

3.1 The modified model and its two worlds

A simple modification (Palmer 1999) of the L63 model (Lorenz 1963) has been extensively used for the purpose of illustrating methodological developments in both DA and PEA (e.g. Carrassi and Vannitsem, 2010; Stone and Allen, 2005). In the nonlinear, coupled system of three ordinary differential equations (ODEs) for x,y and z below,

$$ \frac{\mathrm{d}x}{\mathrm{d}t} = \sigma(y-x) + \lambda_{i}\cos\theta_{i}\,,\quad \frac{\mathrm{d}y}{\mathrm{d}t} = \rho x - y - x z + \lambda_{i}\sin\theta_{i}\,,\quad \frac{\mathrm{d}z}{\mathrm{d}t} = x y - \beta z\, $$
(6)

the time-constant forcing terms in the x- and y-equation represent, in fact, an addition to the forcing hidden in the original L63 model. The latter forcing is revealed by a well-known linear change of variables, in which x and y are left unchanged and zz + ρ + σ (Lorenz 1963). In the new variables, the model of Eq. 6 will take the canonical form of a forced-dissipative system (Ghil and Childress 1987, Sec. 5.4), with an extra forcing term −β(ρ + σ) in the z-equation, just like the original L63 model.

Here λ i is the intensity of the additional forcing and 𝜃 i is its direction in world i=0,1: i.e., λ 0=0 represents a counterfactual world with no additional forcing, while λ 1≠0. We take the parameters (σ,ρ,β) to equal their usual values (10,28,8/3) that yield the well-known chaotic behavior, and the (nondimensional) time unit t is interpreted as equaling days.

The ODE system given by Eq. 6 is discretized by using Δt=0.01 and t refers hereafter to the number of time increments Δt. This system is then turned into a HMM as described in Eq. 4a by adding an error term v t assumed to be Gaussian and centered with covariance \(\mathbf Q = {\sigma _{Q}^{2}}\,\mathbf I\), where I is the 3×3 identity matrix. Furthermore, we assume that all three coordinates (x,y,z) of the state vector are observed, i.e. that H = I, and that the measurement error term w t is also Gaussian and centered, with covariance \(\mathbf R = {\sigma _{R}^{2}}\,\mathbf I\).

The HMM defined above is stationary, i.e. the PDF of the observed vector y t depends neither on t nor on the initial condition after a sufficiently long time t (Appendix D). In the factual world, the shape of the PDF is affected by the parameters (λ 1,𝜃 1) of the forcing. In both worlds, the PDFs can be estimated, for instance, by using kernel density estimation applied to ensembles of simulations obtained for either forcing. In Fig. 2a and b, we plot the projections of both PDFs onto the plane associated with the greatest variance in the factual PDF. The difference between the two PDFs is shown in Fig. 2c; it emphasizes the existence of an area of the state space (represented in white), which is more likely to be reached in the factual world than in the counterfactual one.

Fig. 2
figure 2

Two-dimensional (2-D) projections of the PDF of the modified L63 model; the projection is onto a plane defined by the two leading eigenvectors of the factual PDF shown in the first panel. a PDF of the factual attractor, with λ 1=20 and σ Q =0.1; and b PDF of the counterfactual attractor, with λ 0=0. c Difference between the factual and counterfactual PDFs. d Sample trajectories associated with an event occurrence originating from the factual (red solid lines) and counterfactual worlds (green solid lines); the vertical dashed line in all four panels indicates the threshold u with respect to the horizontal axis of largest variance in the factual PDF

Next, we define an event to occur for the sequence {y t :t=0,…,T} if the scalar product \(\hat {\phi }^{\prime } \mathbf {y}_{t}\) between the unit vector \(\hat {\phi }\) in the direction ϕ and y t , i.e. the projection of y t onto the direction ϕ, exceeds u for some 0≤tT, where ϕ is a specified direction and u is a threshold chosen based on ϕ so that p 1=0.01. Fig. 2d shows a selection of sequences from both worlds in which an event did occur, where ϕ was chosen to be the leading direction in the projection plane.

For this choice of ϕ, the trajectories associated with event occurrence happen to all lie in the area of the state space which is more likely to be reached in the factual world than in the counterfactual one. Accordingly, the probability of the event in the former is found to be higher than in the latter, i.e. p 1>p 0, and the occurrence of an event \(\{\max _{\{0 \leq t \leq T\}} \phi ^{\prime }\mathbf {y}_{t}\geq u\}\) is thereby informative from a causal perspective.

Figure 2d also shows that the trajectories associated with the event in the two worlds — counterfactual (green) and factual (red) — appear to have slightly distinct features: the red trajectories are shifted towards higher values in the second direction, of highest-but-one variance. Such distinctions might help discriminate further between the two worlds in the DADA framework — the circumstances under which such further discrimination is helpful will be discussed in Section 4.

3.2 DADA for the modified L63 model

The DADA procedure is illustrated in Fig. 3. We plot in panel (a) a trajectory of the state vector x t simulated under factual conditions, i.e. in the presence of the additional forcing (black solid line), along with the observations {y t :0≤tT} (gray dots), with T=400. The EnKF is used to assimilate these observations into a factual model (i=1) that thus matches the true model M=M1=M(λ 1,𝜃 1) used for the simulation: a reconstructed trajectory is obtained from the corresponding analyses \(\mathbf {x}^{a}_{t}\) (red solid line in panel (a)), cf. Eq. 8a, and the likelihoods f 1(y t ) (red solid line in panel (c)) are obtained by application of Eq. 5, respectively.

Fig. 3
figure 3

Sample trajectories from data assimilation (DA) in our modified L63 model. a True trajectory (black solid line) and the two trajectories reconstructed by DA in the factual (i=1) and counterfactual (i=0) worlds (red and green solid lines), respectively, over a long sequence, T=400; the values of λ 1 and 𝜃 1 here are the same as in Fig. 2, and the assimilated observations are shown as gray dots. b Same as panel (a) but zoomed over a short sequence, T=20.c Logarithm of the cumulative evidences f 1(y) and f 0(y) (red and green lines, respectively) computed over the window [0,tT]; gray bars indicate the instantaneous differences between f 1(y t ) and f 0(y t ). d PN computed over the window [0,t]

Next, the assimilation is repeated in the counterfactual model (i=0, i.e. λ=0) to obtain a second analysis of the trajectory, from the same observations; see green solid line in panel (a), for T=400. The corresponding likelihoods f 0(y t ) are shown in panel (c) as a green solid line. Comparing the trajectories of the two analyses in Fig. 3a shows that, even though the counterfactual analysis (green line) uses the same data as the factual analysis (red line), the former lies closer to the true trajectory (black line).

The local discrepancies between the trajectories estimated in the two worlds appear to be rather small at first glance, cf. panel (a), and so are the instantaneous differences between the associated factors on the right-hand side of Eq. 5; the latter are shown as gray rectangles in panel (c) of the figure. Still, the evidence in favor of the factual world accumulates as the time t over which the two trajectories differ, albeit by a small amount, lengthens. This cumulative difference in evidence, \(\log f_{0}(\mathbf {y}_{t}) - \log f_{1}(\mathbf {y}_{t})\), is reflected by a growing gap between the two curves, red and green, in panel (c), and by an associated high mean growth over time of the probability PN of necessary causation, cf. the black solid line in panel (d).

In order to evaluate more systematically its performance and robustness compared to the conventional FAR approach, the DADA procedure was applied to a large sample of sequences y t of length T=20 simulated under diverse conditions. The sample explored all possible combinations of the triplet of parameters (λ 1,σ Q ,σ R ), with ten equidistributed values each, for a total of 103 combinations; the ranges were 0≤λ 1≤40, 0.1≤σ Q ≤0.5 and 0.1≤σ R ≤1.0, respectively, with 𝜃 1=−140. For each combination of (λ 1,σ Q ,σ R ), ten directions ϕ were randomly generated and u was defined based on ϕ as in Section 3a above, so as to achieve p 1≥0.01.

In order to estimate the corresponding conventional probabilities p 0 and p 1 of the associated event defined as \(\{\max _{\{0 \leq t \leq T\}} \phi ^{\prime }\mathbf {y}_{t}\geq u\}\), n=50 000 sequences y t of length T=20 were simulated, by using a single sequence of length n T=106 and splitting it into n equal segments. Probabilities p 0 and p 1 were then directly estimated from empirical frequencies.

For each quintuplet of parameter values (λ 1,σ Q ,σ R ;ϕ,u), one hundred sequences of observations {y t :0,…,T=20} were generated with a proportion p 1/(p 1 + p 0) being simulated from the factual world and a proportion p 0/(p 1 + p 0) from the counterfactual one. All sequences were treated with the DADA procedure — by applying DA to the synthetic observations according to Eq. 8a8d — and then Eq. 5 to obtain f 0(y) and f 1(y) from the reconstructed trajectories. The a priori mean and covariance \({x^{f}_{0}}\) and \(\mathbf {P}_{0}^{f}\) required as inputs to the DADA procedure were those associated with the PDF of the attractor, given the forcing conditions (λ 1∈[0,40],𝜃 1=−140) assumed for each assimilation experiment. As a result, two probabilities PN of necessity are finally obtained for each sequence y t , PN p =1−p 0/p 1 for the conventional approach and PN f =1−f 0(y)/f 1(y) for the DADA approach.

We next wish to evaluate under various conditions how well the two probabilities PN p and PN f perform with respect to discriminating between the factual and counterfactual forcings. Consider a simple discrimination rule whereby a trajectory y t is identified as factual for PN exceeding a given threshold, and as counterfactual otherwise. The so-called receiver operating characteristic (ROC) curve plots the rate of true positives as a function of the rate of false positives obtained when varying the threshold in a binary classification scheme from 0 to 1; it thus gives an overall visual representation of the skill of our PN as a discriminative score.

The (Gini 1921) index G was originally introduced as a measure of statistical dispersion intended to summarize the information contained in the (Lorenz 1905) curve that represents the income distribution of a nation’s residents; G may be viewed, though, more generally as a metric summarizing the dispersion of any smooth curve that starts at the origin and ends at the point (1,1) with respect to the diagonal of the corresponding square. In particular, we use G here to summarize into a single scalar the ROC curve, which ranges from 0 for random discrimination to 1 for perfect discrimination.

Figure 4a shows ROC curves obtained over the entire sample of n=50 000 sequences: they correspond to G=0.35 for the conventional method and to G=0.82 for the DADA method, i.e. the overall performance gap is more than twofold. As expected, the performance of both methods is nil for λ 1=0 and it is very sensitive to the intensity of the forcing, cf. Fig. 4b.

Fig. 4
figure 4

Performance of the DADA and conventional methods (red vs. blue solid lines, respectively). a Receiver operating characteristic (ROC) curve: true positive rate as a function of false positive rate, when varying the cut-off level u, as obtained from the entire sample of n=50 000 sequences; see text for details. b Gini index G as a function of forcing intensity λ 1. c Same as (b) for several values of σ Q and for DADA only, with the black arrow indicating the direction of growing σ Q . (d) Same as (b) but as a function of model error amplitude σ Q . e Same as (b) but as a function of observational error amplitude σ R . f Same as (b) as a function of the logarithmic contrast between the conventional probabilities \(\log p_{1}/p_{0}\)

Furthermore, the skill of the DADA method is boosted when decreasing the level of model error, cf. Fig. 4c; this is an expected result, since DA becomes more reliable when the model is more accurate, and when it is known to be so. Ultimately, under perfect model conditions, i.e. as σ Q →0, DADA reaches perfect discriminative power, with G →1, no matter how small, but still positive, the forcing is; see Fig. 4d. On the other hand, the level of observational error σ R appears to have but a limited effect on DADA performance for the range of values considered, cf. Fig. 4e.

Finally, Fig. 4f shows that both methods perform better when the contrast between p 0 and p 1 is strong, but the latter does not influence the gap between the two methods, which remains nearly constant. This constant gap thus appears to quantify the additional power resulting from the extra discriminative features that the PDF f(y) is able to capture on top of those associated with the probability P(ϕ(y)≥u).

4 Discussion and conclusions

Considerations rooted in the causality theory of Pearl (2000) have shown that the ratio between the factual likelihood f 1(y) and the counterfactual likelihood f 0(y) is relevant in studying causal attribution of weather- and climate-related events. In this paper, we described data assimilation (DA) methods and demonstrated that they are well suited for deriving f 0(y) and f 1(y) from trajectories in the factual and the counterfactual worlds, respectively. Besides, these methods offer the key practical advantage of being already up-and-running in real time at meteorological centers.

Combining these two sets of considerations, theoretical and practical, opens a novel route towards real time, systematic causal attribution of weather- and climate-related events, thereby addressing a key challenge in the field of PEA at present (Stott et al. 2013).

4.1 Theoretical considerations

Implementing the DADA approach in the context of the L63 model in Section 3 allowed for a detailed step-by-step illustration of our methodological proposal. It also provided a basic test for an initial performance assessment, which showed an improved level of discriminating power with respect to the conventional approach outlined in Section 1. These results are promising, and their promise is easy to understand, given the fact that the DADA approach leverages the available information on the entire trajectory y, as opposed to the single specific feature ϕ(y)≥u in the conventional approach.

It is important, though, to stress that the term “performance” here should be considered with caution: improving discriminatory performance may or may not be a desirable outcome, depending on the causal question being asked. Hannart et al. (2015) and Otto et al. (2015) have shown that the causal question being formulated reflects the subjective interests of a particular class of end-users, and that the formulation itself may dramatically affect the answer.

For example, the question “did anthropogenic CO 2 emissions cause the heatwave observed over Argentina during January 2014?” has been traditionally treated by defining a “heatwave” in terms of a predefined temperature index reaching a predefined threshold, i.e., by a singular index exceeding a singular threshold. This class of questions matters for instance in the context of insurance disbursements, where a financial compensation may typically be triggered based on such an index exceedance. In this situation, the additional discriminatory power of DADA is meaningless because the DADA computation does not address the question at stake: there is simply no alternative to computing the probabilities p 0 and p 1 of the index exceeding the threshold.

However, if the question is formulated instead as “did anthropogenic CO 2 emissions cause the atmospheric conditions observed over Argentina during January 2014?” — i.e., without specifying which feature of the observed sequence is most important — then improving discrimination makes perfect sense and DADA becomes fully relevant. Furthermore, DADA is still fully relevant even if the question is formulated more specifically as “did anthropogenic CO 2 emissions cause the damages generated in Argentina by the atmospheric conditions of January 2014?,” provided that a model relating atmospheric observations to damages at every time step t along the trajectory of the physical model used in the assimilation is available and can be integrated into the observation operator H.

On the other hand, the results of Section 3 should also be considered with caution simply because the L63 testbed obviously differs in many respects from the real situation envisioned for future applications, both in terms of model dimension n and observation dimension d: in practice n will be very large and dn, while here we took d = n=3.

In particular, choosing a highly idealized, climatological prior distribution on the initial condition π(x 0) does not raise any difficulty under the tested conditions nor does it influence significantly the outcome of the procedure (not shown). The choice of π(x 0), however, may be an important problem in practice, when dn, and lead to potentially spurious results.

As a consequence, it may be both necessary and useful to further constrain the so-called background PDF π(x 0) by using the forecasts originating from τ previous assimilation cycles, thus following the ideas of lagged-averaged forecasting (Hoffman and Kalnay 1983; Dalcher et al. 1988). The evidence thus obtained, though, will then also depend on previous observations over the “initialization” window [−τ,...,−1] — i.e., it will no longer represent exclusively the desired evidence f(y). Besides, choosing τ optimally to constrain the initial background PDF in a satisfactory manner, while at the same time limiting the latter unwanted dependence on previous observations, is a challenging question that needs to be adressed.

More generally, the problem of evaluating the evidence f(y) is not new in the HMM and DA literature; see, for instance, Baum et al. (1970); Hurzeler and Kunsch (2001); Pitt (2002) and Kantas et al. (2009). Various algorithms are thus available to carry out this evaluation, depending on a number of key assumptions — such as lack of Gaussianity or linearity — and on the inferential setting chosen, e.g. particle filtering. These algorithms may provide accurate and effective solutions to the above problem, as well as improved alternatives to the Gaussian and linear approximation of Eq. 5, since the latter may not be sufficiently accurate for succesfully implementing the DADA approach under realistic conditions.

4.2 Practical considerations

While we have shown here that the proposal of using DADA for event attributions has intellectual merit, its main strength lies, in our view, in down-to-earth cost considerations. By design, the DADA approach allows one to piggyback at a low marginal cost on the large and powerful infrastructures already in place at several meteorological centers, in terms of both hardware and personnel. These centers are capable of processing massive amounts of observational data with high-throughput pipelines on the world’s largestcomputational platforms, as opposed to requiring the design, set-up and maintenance of a new and large, PEA-specific infrastructure to collect observations and generate — under real time constraints — the many model simulations required by the conventional approach recalled in Section 1.

Taking a step back, it is useful to examine our proposal within the wider context of the emergence of so-called climate services. It is widely recognized that extending the scope of activity of meteorological centers from being “monoline” weather forecasting providers to becoming “multiline” climate services providers – encompassing, for instance, weather forecasting and weather event attribution as two service lines among several others –?? is a relevant strategic option (Hewitt et al. 2012). Such a strategy may foster the timely and cost-efficient emergence of the latter services by building upon technological and infrastructure synergies with the former. For these reasons, our proposal is particularly relevant for, and could contribute to, the implementation of the strategic option just outlined.

This being said, DADA can very well serve as a method for real time event attribution even for hypothetical climate services providers that focus uniquely or mainly on longer time scales, beyond a month, a season or a year. In such a context, DADA may allow for the assimilation of a broader range of observations, and in particular of ocean observations; it may, in fact, be important to include the latter in causal analysis when the event occurrence under scrutiny is defined over a sufficiently large time window.

Finally, it is important to remember that providing real-time attribution assessments is a major communication challenge, since different methods give different answers and different definitions of a specific event may also impact the outcome of an assessment — as mentioned above and as discussed recently by Trenberth et al. (2015). Various recent examples, such as the ongoing California drought have shown that divergences among experts may lead to confusion in the media and among stakeholders. In this respect, a detailed comparison of the DADA approach with other methods in realistic, real-time situations will be required before the method can be applied operationally.