Keywords

1 Introduction

Studying climate time series such as temperature and precipitation requires modeling with statistical techniques. Although the earth’s climate has changed gradually in response to both natural and human-induced processes, it is known that climate may have abrupt change, i.e., a large shift may happen in climate that persists for years or longer. Example of these changes includes the changes in average temperature, patterns of storms, floods, or droughts over a widespread area (Lohmann 2009). Climatic records show that large and widespread abrupt changes have occurred repeatedly throughout the geological records (Alley et al. 2003). Many studies have analyzed climate time series in the stationary framework; i.e., the statistical parameters are assumed to be constant over time. However, stationary assumption of climate time series is invalid considering various internal dynamics and external forcings (Milly et al. 2008). Thus, statistical techniques based on stationary assumption should be modified to reveal the characteristics of the abrupt climate change. Nonstationary time series have a set of clusters (regime, phase, or segment), while the model of each cluster is stationary. These clusters are separated in time by some change points (breaks). The analysis of nonstationary time series, including finding the change points between the clusters, is an ongoing research area in the climate data analysis.

Several approaches were proposed in the literature for the change detection in climate time series. Brute-force search was performed over all candidate points to find the best change points (Liu et al. 2010). However, this method is not applicable for longer time series with high number of change points due to huge volume of computations. The change points were estimated by Bayesian inference, where the change points and other model parameters were assumed as random variables (Ruggieri 2013). Kehagias and Fortin (2006) used a method based on hidden Markov models, assuming that the time series was generated by a Markov process. Then, the unknown parameters were determined by the maximum likelihood. However, statistical assumptions on the data and the change points in Bayesian and Markov methods may not be true in general. Several statistical tests were used in atmospheric studies such as sequential Mann–Kendall, Bai–Perron, and Pettitt–Mann–Whitney to find the change points. However, the results of these tests are valid only if the data are not serially correlated (Lyubchich et al. 2013). For the correlated time series with only one change point, proper hypothesis tests were introduced (Robbins et al. 2011).

The bounded-variation (BV) clustering (Metzner et al. 2012) is another technique which finds the change points in nonstationary time series. In this method, instead of statistical assumptions on the data or the change points, a reasonable assumption is made such that the total number of the change points between the clusters is bounded. The BV clustering is computationally efficient and is also applicable for the serially correlated time series. Assuming a different linear trend (Horenko 2010a) and a vector autoregressive (Horenko 2010b) in each cluster, the BV clustering was used to analyze the climate dynamics in the ERA-40 reanalysis data. The BV clustering was used in Gorji Sefidmazgi et al. (2014c) for analyzing the climate variability of North Carolina. The effect of covariates on nonstationary time series was analyzed by the BV clustering (Gorji Sefidmazgi et al. 2014a; Horenko 2010b; Kaiser and Horenko 2014).

In this paper, we have shown the applicability of the BV clustering by two numerical examples: the pattern of changes in the Pacific Decadal Oscillation (PDO) and the US land surface temperature. Moreover, the Bayesian information criterion (BIC) is applied to find the optimal number of the change points and the number of clusters.

2 Dataset

The bias-adjusted monthly average temperature of the US continental stations is derived from the US Historical Climatology Network database (http://cdiac.ornl.gov/epubs/ndp/ushcn/ushcn.html). The period 1900 until 2013 is selected, and the stations with continuous missing data for more than 4 months are eliminated. Then, the missing data in the remaining 1,189 stations are filled by interpolation, and also the mean cycle of time series is removed to eliminate the effect of seasonality.

Annual time series of the PDO for 1900–2013 is from NOAA database (http://www.esrl.noaa.gov/psd/data/climateindices/list).

3 Method

The BV clustering might be applied in two cases, where the model of the time series in each cluster is in the form of a mathematical function (such as polynomial or differential equation) or a statistical distribution (such as Gaussian, Gamma, etc.). The change points and the parameters of each cluster are determined by solving a least square/maximum likelihood (LS/ML) and a constrained optimization.

Let x(t) be a nonstationary time series with M clusters. The first case is when the model in each cluster is a function of time (and other covariates u(t) if exist), i.e., \( x(t)=f\left(x\left(t-1\right),\dots, x\left(t-p\right),u\left(t-1\right),\dots, u\left(t-n\right),t,{\alpha}_m\right) \). Here, f is the model of the time series, α m is the set of parameters in the mth cluster where \( m\in \left\{1,\dots, M\right\} \). Also, p and n are the order of the lagged outputs and the covariates, respectively. In the second case, f is a probability density function, i.e., \( P\left(X=x(t)\right)=f\left(x(t)\Big|u(t),t,{\alpha}_m\right) \). For these cases, model distance function d m (x(t)) is the distance between the time series at time \( t\in \left\{1,\dots, T\right\} \) and the model of the mth cluster, which can be defined by the Euclidean distance or the likelihood function:

$$ {d}_m\left(x(t)\right)={\left\Vert x(t)-f\left(x\left(t-1\right),\dots, x\left(t-p\right),u\left(t-1\right),\dots, u\left(t-n\right),t,{\alpha}_m\right)\right\Vert}^2 $$
(17.1)
$$ {d}_m\left(x(t)\right)=\ell \left(f\left(x(t)\Big|u(t),t,{\alpha}_m\right)\right) $$
(17.2)

where ‖.‖ and (.) are the L2-norm and the negative log-likelihood operators, respectively. Now, the change detection problem can be defined as a minimization:

$$ \underset{\;\mu, \alpha }{min}{\displaystyle \sum_{t=1}^T{\displaystyle \sum_{m=1}^M{\mu}_m}}(t).{d}_m\left(x(t)\right) $$
(17.3)

In (17.3), \( {\mu}_m(t)\in \left\{0,1\right\} \) is the cluster membership function indicating whether the datum at time t belongs to the mth cluster or not. The change points are the times when the values of μ m (t) are changed. For example, if \( {\mu}_2(50)=0 \) and \( {\mu}_2(51)=1 \), then the cluster 2 is started at the change point t = 51. Clearly, the datum at each time belongs to only one of the clusters, and hence,

$$ {\displaystyle \sum_{m=1}^M{\mu}_m(t)}=1\kern1em t=\left\{1,\dots, T\right\} $$
(17.4)

Now, we augmented the model distance functions in \( {D}_m=\left[{d}_m\left(x(1)\right),{d}_m\left(x(2)\right),\right. \break \left.\dots, {d}_m\left(x(T)\right)\right] \), and also the cluster membership functions in \( {U}_m=\left[{\mu}_m(1),{\mu}_m(2),\right. \break \left.\dots, {\mu}_m(T)\right] \) for \( m=\left\{1,\dots, M\right\} \). The optimization in (17.3) is rewritten as below (Metzner et al. 2012):

$$ \underset{U_m,\kern0.24em {D}_m}{min}{\displaystyle \sum_{m=1}^M{D}_m.{U_m}^T} $$
(17.5)

The BV clustering solves the optimization (17.5) in two iterative steps using the coordinate-descent algorithm. In the first step, it assumes that cluster membership function μ m (t) is known and the cluster parameters α m are found. In the second step, α m are fixed and μ m (t) is determined.

In the first step, assume that μ m (t) and the change points are known. The data belong to each cluster are separated, and the parameters α m are found by LS/ML. Using the estimated parameters α m , the model distance function d m (x(t)) is determined using (17.1) or (17.2). In the next step, the cluster membership function μ m (t) should be found. The assumption that the number of the change points is bounded should be added to the problem formulation with imposing constraints on μ m (t). First, a counter \( {q}_m(t)\in \left\{0,1\right\} \) is defined which is increased by one unit when μ m (t) changes from 0 to 1 or vice versa (i.e., on the change points) (Metzner et al. 2012):

$$ \left|{\mu}_m\left(t\!+\!1\right)\!-\!{\mu}_m(t)\right|\!\le\! {q}_m(t)\to \left\{\begin{array}{ll}{\mu}_m\left(t\!+\!1\right)\!-\!{\mu}_m(t)-{q}_m(t)\ \le 0 \hfill m=1,\dots, M\hfill \\\!\!-{\mu}_m\left(t\!+\!1\right)\!+\!{\mu}_m(t)\!-\!{q}_m(t)\ \le 0\quad t=1,\dots, T-1 \end{array}\right. $$
(17.6)

In order to limit the total number of the change points to a constant Q, a constraint on q m (t) is added in the following form:

$$ {\displaystyle \sum_{t=1}^{T-1}{\displaystyle \sum_{m=1}^M{q}_m(t)=2Q}} $$
(17.7)

Now, \( {\overline{\mathbf{q}}}_m=\left[{q}_m(1),{q}_m(2),\dots, {q}_m\left(T-1\right)\right] \) is defined to record q m (t) over time and is added to the set of unknown parameters. By defining (17.8) and (17.9), (17.5) is rewritten in (17.10) to find μ m (t) (Metzner et al. 2012):

$$ \tilde{\mathbf{D}}=\left[\underset{T\times M}{\underbrace{\begin{array}{cccc}\hfill {D}_1\hfill & \hfill {D}_2\hfill & \hfill \dots \hfill & \hfill {D}_M\hfill \end{array}}}\kern0.36em \underset{\left(T-1\right)\times M}{\underbrace{\begin{array}{ccc}\hfill 0\hfill & \hfill \dots \hfill & \hfill 0\hfill \end{array}}}\right] $$
(17.8)
$$ \tilde{\mathbf{U}}=\left[\underset{T\times M}{\underbrace{\begin{array}{cccc}\hfill {U}_1\hfill & \hfill {U}_2\hfill & \hfill \dots \hfill & \hfill {U}_M\hfill \end{array}}}\kern0.24em \underset{\left(T-1\right)\times M}{\underbrace{\begin{array}{cccc}\hfill {\overline{\mathbf{q}}}_1\hfill & \hfill {\overline{\mathbf{q}}}_2\hfill & \hfill \dots \hfill & \hfill {\overline{\mathbf{q}}}_M\hfill \end{array}}}\right] $$
(17.9)
$$ \underset{\tilde{\mathbf{U}}}{min}\tilde{\mathbf{U}}.{\tilde{\mathbf{D}}}^{\boldsymbol{T}} $$
(17.10)

All the elements of unknown vector Ũ in (17.10) are either 0 or 1. Thus, this optimization is a constrained optimization in the form of a binary integer programming. The set of linear constraints are (17.4), (17.6), and (17.7) which include \( 2M\times \left(T-1\right) \) inequality and \( T+1 \) equality constraints. There are standard methods for solving constrained optimization using some toolboxes in Matlab or R (Gurobi 2014). Once Ũ is found, μ m (t) for all of the clusters and the change points can be determined.

In conclusion, the BV-clustering algorithm includes the following steps: first, a random initial μ m (t) is selected such that it satisfies (17.4). Then, the parameters of each cluster are calculated by the LS/ML, and the model distance function d m (x(t)) is determined by (17.1) or (17.2). Then, the optimization problem in (17.10) is constructed and solved with the constraints of (17.4), (17.6), and (17.7). The LS/ML and the constrained optimization steps are repeated for some predefined number of iterations (usually five). This procedure converges to at least a local solution of the optimization in (17.3). For finding the global solution, the algorithm should be started with different initial random μ m (t).

4 Model Selection

The number of clusters M and the change points Q should be set in the BV clustering. In the time series literature, the number of change points is usually found by information theory methods such as the BIC (Jandhyala et al. 2013). The BIC is a well-known approach to perform a trade-off between the goodness of fit and the complexity of models and to prevent over-fitting/under-fitting.

The index for the detected cluster at time t is determined by \( {m}^{*}(t)= \arg \underset{m}{ \max}\left({\mu}_m(t)\right) \) for \( t=\left\{1,\dots, T\right\} \). By obtaining the cluster parameters using the maximum likelihood, the minimized value of the negative log-likelihood V and the BIC are determined by

$$ V={\displaystyle \sum_{t=1}^T\ell \left(x(t)\Big|u(t),t,{\alpha}_{m^{*}(t)}\right)} $$
(17.11)
$$ BIC\left(M,Q\right)=2V+ \ln (T)\times \left(\mathrm{number}\;\mathrm{of}\;\mathrm{estimated}\;\mathrm{parameters}\ \right) $$
(17.12)

Otherwise, by obtaining the clusters parameters by the least square method, the residual w(t) is determined by

$$ w(t)=x(t)-f\left(x\left(t-1\right),\dots, x\left(t-p\right),u\left(t-1\right),\dots, u\left(t-n\right),t,{\alpha}_{m^{*}(t)}\right) $$
(17.13)

Adding the assumption that the residual follows a normal distribution with a constant variance, the loss function V and the BIC are found (Hastie et al. 2009):

$$ V= \ln \left(\frac{\bigg({\displaystyle \sum_t{w}^2(t)}\bigg)}{T}\right) $$
(17.14)
$$ BIC\left(M,Q\right)=T\times V+ \ln (T)\times \left(\mathrm{number}\kern0.24em \mathrm{of}\kern0.24em \mathrm{estimated}\kern0.24em \mathrm{parameters}\ \right) $$
(17.15)

Assume that the number of parameters for each cluster is ω. For example, the set of parameters for a time series with Gaussian distribution in each cluster includes the mean and the variance, and thus \( \omega =2 \). Hence, in addition to Q change points, we need to estimate ω parameters for each of the M clusters:

$$ \mathrm{number}\;\mathrm{of}\;\mathrm{estimated}\;\mathrm{parameters}=M\times \omega +Q $$
(17.16)

The BV clustering should be applied to the data with various possible values of M and Q. Finally, the model with the smallest BIC is chosen.

5 Results and Conclusion

In this section, we applied the BV-clustering method on two time series, the PDO and the US surface temperature. It is well known that the PDO has some regimes with different mean values (Rodionov 2006). Assume that the model of each cluster is a normal distribution N(ρ m , σ m 2). The distance function is defined as \( {d}_m\left(x(t)\right)=-1/2\left[ ln\left(2\pi \right)+ ln\left({\sigma_m}^2\right)+\left(x(t)-{\rho}_m\right)/{\sigma_m}^2\right] \) which is equivalent to the negative log-likelihood of the normal distribution. Using the maximum likelihood, the mean and the variance of each cluster are found similar to the parameters of the mixture models (Hastie et al. 2009):

$$ {\rho}_m=\frac{\left({\displaystyle \sum_t{\mu}_m(t).x(t)}\right)}{T};\kern1em {\sigma}_m^2=\frac{\left({\displaystyle \sum_t{\mu}_m(t).}{\left(x(t)-{\rho}_m\right)}^2\right)}{T} $$
(17.17)

The result of the change detection is shown in Fig.  17.1 , where three change points in 1948, 1976, and 2007 are found among the three clusters. The models of the time series in [1948–1976] and [2007–2013] are the same. These results are similar to the change points found in (Rodionov 2006), while no prior knowledge about the minimum length of the clusters is necessary in the proposed approach.

Fig. 17.1
figure 1

Pattern of change in the PDO time series. The red lines show the averages of time series in each cluster and green lines are average ± standard deviation

The second example is the analysis of the US average temperature where the data are serially correlated. It is known that piecewise linear trend with the first order autoregressive (AR(1)) residuals is better than the single linear trend for the surface temperature in the sense of the BIC. Moreover, the trend is not necessarily continuous at the break points (Seidel and Lanzante 2004). Assume that the model of each cluster is a linear trend plus AR(1) noise. Thus, \( x(t)={\beta_0}_m+{\beta}_{1m}.t+\varepsilon (t) \) and \( \varepsilon (t)={\rho}_m.\varepsilon \left(t-1\right)+w(t) \), where w(t) is the white noise. The model distance function is defined as

$$ {d}_m\left(x(t)\right)={\left\Vert \left[x(t)-\left({\beta}_{0m}+{\beta}_{1m}t\right)\right]-{\rho}_m\left[x\left(t-1\right)-\left({\beta}_{0m}+{\beta}_{1m}\left(t-1\right)\right)\right]\right\Vert}^2 $$
(17.18)

If \( {\rho}_m=0 \), then the linear model parameters β 0 and β 1 can be found in a closed form using ordinary linear square (Gorji Sefidmazgi et al. 2014b). However, in the case of \( {\rho}_m\ne 0 \), no closed form solution exists and these parameters should be determined by the feasible generalized least square. Results of the BV clustering for average temperature show that there are M = 2 clusters with one change point in 1958. Figure  17.2 shows the spatiotemporal pattern of the linear trends in each of the clusters. Figure  17.3 shows the piecewise linear trend in one of the stations. It can be seen that the linear trends increased after 1958 in most of the areas, especially over the eastern and central sections of the USA. Finding relations between this change point and existing physical phenomena is difficult, since there are many anthropogenic and natural factors contributing to the climate variability. However, common breaks in the trend of the global temperature and the anthropogenic forcings are reported in the early 1960s (Estrada et al. 2013). This fact can establish a direct relationship between the human effects on altering the long-term trend of the temperature.

Fig. 17.2
figure 2

(a) Linear trend of temperature in 1,189 stations in cluster 1 during 1900–1958. (b) Linear trend of temperature in cluster 2 during 1958–2013

Fig. 17.3
figure 3

Anomaly of average temperature in Boulder, CO, and its piecewise linear trend. The linear trend increased after the change point of 1958