1 Introduction

Deep Earthquakes, with the focal depth more than 50 km, form a significant portion in earthquake catalogue around the world, including New Zealand. Deep earthquakes are important in that they give indications of the structure of the earth, the dynamics of the crust and mantle, see Frohlich (2006). Although deep earthquakes are usually less destructive as shallow earthquakes, they do cause damage in some occasions. A study on the occurrence patterns of deep earthquakes may prove valuable for understanding the subduction process at convergent plate boundaries and the evaluation of geological hazards near surface such as shallow earthquakes and volcano activities. A common feature typically observed in the deep earthquake catalogue is the nonstationarity of two important statistics, i.e. the seismicity rate and the magnitude-frequency distribution (MFG). We consider one type of nonstationarity: abrupt changes in the deep seismicity rate and the associated magnitude-frequency distribution. We characterize the time-varying pattern of the two statistics by use of a Bayesian multiple changepoint model for the marked temporal Poisson processes.

Changepoint models are widely applied for modelling heterogeneities appearing in a set of observations collected sequentially. These models split the data into disjoint segments with a (random) number of changepoints, so that observations in the same segment come from the same pattern and observations in different segments show heterogeneity. Changepoint models have been applied extensively in engineering, signal processing, bioinformatics, earthquake modelling (Yip et al. 2017), hydrology (Kehagias 2004) and finance, among many others. There are vast literatures on the methods and applications of changepoint models. Among them, wild binary segmentation (Fryzlewicz 2014) and simultaneous multiscale change point estimator (Frick et al. 2014) are very popular. We consider Bayesian off-line inference for the changepoint models. Often, inference for the number, the locations of changepoints and other model parameters is based on Markov chain Monte Carlo methods, see Stephens (1994), Green (1995), Chib (1998), Lavielle and Lebarbier (2001) and Fearnhead (2006), among many others. It is noted that most of these models are applicable only for discrete time observations. Although some continuous time random processes may be approximated ideally by their discrete-time counterparts, it is necessary to develop multiple changepoint models in continuous time on its own right. The advantage of this approach is to avoid approximation errors by their discrete-time counterparts, which is often difficult to quantify. Marked point process is widely utilized in statistical modelling for hurricane occurrences, insurance claims (Elliott et al. 2007), earthquake occurrences (Ogata 1988; Yip et al. 2018) etc. For Poisson processes, Galeano (2007) proposed a binary segmentation algorithm along with a centralized and normalized cumulative sum statistic to detect changepoints for the intensity rate of Poisson events. The approach is based on asymptotic arguments for providing consistent estimates for the locations of changepoints. Yang and Kuo (2001) suggested a binary segmentation procedure for locating the changepoints and the associated heights of the intensity function of a Poisson process by using Bayes factor or its BIC approximation.

We consider a formulation of Bayesian multiple changepoint models to simultaneously monitor the structural breaks in Poisson intensity rate and the associated mark distribution, which is a continuous time extension of Chib’s multiple changepoint models (1998). For this continuous time hidden Markov model, we suggest an approach to directly simulate the full trajectory of the latent Markov chain in a block Gibbs sampling scheme, which is often implemented through the uniformization method (Fearnhead and Sherlock 2006; Rao and Teh 2013). The number of changepoints is determined via a modified Bayes information criterion, which is tailored particularly for this multiple changepoint models for the marked Poisson processes.

The outline of the paper is as follows. In Sects. 2 and 3, we formulate a multiple changepoint model for the marked Poisson processes via a continuous-time hidden Markov models. We then introduce a continuous-time forward filtering backward sampling algorithm for sampling the full trajectory of the latent Markov process without resorting to the uniformization method. The maximum a posteriori (MAP) estimate of the trajectory of the latent Markov chain x(t), i.e. the locations of changepoints, can be obtained from a continuous time version of Viterbi algorithm as indicated in Sect. 4. The number of changepoints in a marked Poisson process is chosen by a modified BIC criterion. We then carry out simulation studies to demonstrate the method in Sect. 5. In the last section, we perform a case study for the New Zealand deep earthquakes. The temporal variabilities of the deep seismicity rate and the magnitude-frequency distribution are analysed via this multiple changepoint model and its implications for seismic hazards are illustrated.

2 Model formulation and the likelihood

2.1 Model formulation

Let Y(t) be a Poisson process attached with marks. Suppose it is subject to abrupt changes at m unknown time points \(0\triangleq \tau _0<\tau _1<\cdots<\tau _m<\tau _{m+1}\triangleq T\). The process Y(t) is partitioned into \(m+1\) segments by m changepoints, with the \(i-th\) segment consisting observations within \([\tau _{i-1}, \tau _i)\). We consider the multiple changepoint models with changepoint locations associated with the state transition times of an unobservable continuous time finite Markov chain x(t). The intensity rate function of Y(t) is specified by \(\lambda _{x(t)}f_{x(t)}(y)\), where \(\lambda _{x(t)}\) is the stochastic intensity rate of the ground process and \(f_{x(t)}(y)\) is the probability density function of the mark distribution. A sequence of Poisson events at \(\{t_1,\ldots ,t_n\}\) and associated marks \(\{y_1,\ldots ,y_n\}\) are observed over a time interval [0, T]. The transition rate matrix of x(t) is constrained to be consistent with a changepoint model, such that x(t) either sojourns in the previous state or jumps to the next level. Correspondingly, the transition rate matrix of x(t) is parameterized by

$$\begin{aligned} {\mathbf {Q}}= \begin{pmatrix} -q_1 &{} q_1 &{} 0 &{} \cdots &{} 0 &{} 0\\ 0 &{} -q_2 &{} q_2 &{} \cdots &{} 0 &{} 0\\ \cdots &{} \cdots &{} \cdots &{} \cdots \\ 0 &{} 0 &{} 0 &{} \cdots &{} -q_m &{} q_m\\ 0 &{} 0 &{} 0 &{} \cdots &{} 0 &{} 0 \end{pmatrix}. \end{aligned}$$

The chain x(t) starts from the state 1 and ends in the state \(m+1\). When x(t) is in the state i, the stochastic intensity rate of the ground process and the mark distribution of Y(t) is given by \(\lambda _if_i(y)\). The attached marks can be any univariate or multivariate variables. We shall specify the mark distributions in later sections. This type of model is exactly a Markov modulated Poisson process (MMPP) attached by state-dependent marks, which is a doubly stochastic point process with the stochastic intensity of the ground process and the mark distributions determined by an underlying irreducible finite Markov chain, see Lu (2012). This type of multiple changepoint models have the conditional independence property, such that conditional on the position of a changepoint, observations after the changepoint contain no information about segments and observations prior to the changepoint. It is also a type of product partition models (Barry and Hartigan 1992) and a continuous-time generalization of Chib’s multiple changepoint model (1998).

Remark 1

The marks can be any type of variables, which may be dependent or independent of the ground process. Particularly, when the attached mark is an indicator of the class to which the point belongs, it forms a multiple changepoint model for multivariate Poisson processes (Ramesh et al. 2013). In this case, changepoints in multiple Poisson sequences can be simultaneously monitored.

Remark 2

This model formulation allows jointly monitoring the changepoints of Poisson rates and associated marks in two cases: changepoints occurring both in the ground process and the associated marks simultaneously, or changepoints occurring either in the ground process or the attached marks alone. This type of model formulation is more flexible than modelling changepoints of the ground process and the attached marks separately or individually, which is ideal for modelling ”common” structural breaks.

Remark 3

When all \(q_is\) in \({\mathbf {Q}}\) are equal, it suggests that the changepoints follow a constant rate Poisson process. As \(q_i\)s in \({\mathbf {Q}}\) may be different, varying scale of segment lengths can be represented, which is desirable for modelling highly variable segment lengths between changepoints and avoids potential model bias.

This model formulation is a special case of MMPP attached by state-dependent marks (Lu 2012). However, current model parameterization is consistent with a multiple changepoint model with \(m+1\) segment constraints and no ”state reciprocal” is assumed by specifying a full transition rate matrix of the latent Markov chain as indicated in Lu (2012). See also the last paragraph in Sect. 6 for further discussions.

We define some notations used in the later sections. Denote the inter-event time \(t_i-t_{i-1}\) by \(\Delta t_i\) and the observation \((t_i, y_i)\) by \(Y_i\). \(x(t_k)\) is denoted by \(x_k\). Generically, we define the brackets [ji] after a matrix A as the (ji)-th entry of A. The sample path of a random process Z(t) over a time interval [ab] or [ab) is denoted by Z[ab] or Z[ab) respectively.

2.2 The likelihood

The sequence \(\{(x_i,\Delta t_i,y_i)\}_{i=1}^n\) forms a Markov sequence with transition density matrix \(e^{({\mathbf {Q}}-\Lambda )(t_i-t_{i-1})}\Lambda \Upsilon (y_i)\), where \(\Lambda =\text{ diag }(\lambda _1,\ldots ,\lambda _{m+1})\) and \(\Upsilon (y)= \text{ diag }(f_1(y),\ldots ,f_{m+1}(y)).\) The likelihood is given by

$$\begin{aligned} L({\mathbf {Q}},\Lambda )={\text{ e }_1^\prime} e^{({\mathbf {Q}}-\Lambda )\Delta t_1}\Lambda \Upsilon (y_1)\cdots e^{({\mathbf {Q}}-\Lambda )\Delta t_n}\Lambda \Upsilon (y_n){\mathbf {1}}, \end{aligned}$$
(1)

where \(\mathbf{e }_i\) is a unit column vector with all entries being zero except the ith entry and \({\mathbf {1}}\) is a column vector with all entries being unity, see Lu (2012) for the derivation of the likelihood.

The evaluation of the likelihood is facilitated by use of the forward and backward recursions. Denote \(e^{({\mathbf {Q}}-\Lambda )(t_k-t_{k-1})}\Lambda \Upsilon (y_k)\) by \(L_k\). The forward and backward probabilities are written by \(\alpha _{t_k}(i)=\text{ e }_1' L_1 \cdots L_k{\mathbf {e}}_i\) and \(\beta _{t_k}(j)={\mathbf {e}}_j'L_{k+1}L_{k+2}\cdots L_n{\mathbf {1}}.\) Therefore, the forward and backward probabilities are recursively given by

$$\begin{aligned} {\left\{ \begin{array}{ll} \alpha _{t_{k+1}}(i)=\sum \limits _j\alpha _{t_k}(j)L_{k+1}[j,i]\\ \beta _{t_k}(i)=\sum \limits _jL_{k+1}[i,j]\beta _{t_{k+1}}(j), \end{array}\right. } \end{aligned}$$
(2)

where \(L_{k+1}[j,i]\) is the (ji)-th entry of \(L_{k+1}\). The likelihood in terms of this device is obviously written by \(L({\mathbf {Q}}, \Lambda )=\sum \nolimits _i \alpha _t(i)\beta _t(i)\) for each t or equivalently \(L({\mathbf {Q}}, \Lambda )=\alpha _T(m+1).\) The forward and backward densities tend to zero or infinity exponentially fast as the number of observations accrue, leading to ”under flow” or ”over flow” problem in practical computations. This ”numerical instability” problem is often treated by use of floating-point software or incorporation of scaling procedures as outlined in our R implementation.

3 Bayesian inference of multiple changepoint models

Let all the model parameters be denoted by \(\Theta =({\mathbf {Q}}, \theta )\). Suppose a prior \(\pi (\Theta )\) is specified for the model parameters \(({\mathbf {Q}}, \theta )\). In the Bayesian context, the posterior distribution \(\pi (\Theta |Y[0,T])\varpropto \pi (\Theta )p(Y[0,T]|\Theta )\) is of interest. We discuss a block Gibbs sampling scheme to sample approximately from the posterior distribution of \(\Theta\). The Gibbs sampling scheme involves two full conditionals: the full trajectories of the latent Markov chain x(t) given the model parameters and the model parameters conditioned on the trajectories of the latent Markov chain.

3.1 A continuous time forward filtering and backward sampling algorithm

Sampling the full trajectories of the latent Markov chain given model parameters is typically implemented by continuous time versions of forward filtering backward sampling algorithms, see Fearnhead and Sherlock (2006) and Rao and Teh (2013). In Fearnhead and Sherlock (2006) and Rao and Teh (2013), the continuous time forward filtering backward sampling algorithms are based on the uniformization method. Alternatively, we suggest a direct approach to simulate trajectories of the latent Markov chain x(t) without resorting to uniformization method. We start from the discrete time version of forward filtering backward sampling algorithm, sampling from \(x_k\triangleq x(t_k), k=1,\ldots ,n\) at Poisson arrival times conditioned on \(\{Y_t, 0\le t \le T\}\), see Chib (1998), Scott (2002) and Fearnhead and Sherlock (2006). The forward filtering recursion is given by

$$\begin{aligned}&p\left( x_i=k\big |Y[0,t_i],\Theta \right) \nonumber \\&\quad =\frac{\sum \nolimits _{l=k-1}^kp\left( x_{i-1}=l\big |Y[0,t_{i-1}],\Theta \right) p\left( x_i=k,Y(t_{i-1}, t_i]\big |x_{i-1}=l, \Theta \right) }{\sum \nolimits _k\sum \nolimits _{l=k-1}^kp\left( x_{i-1}=l\big |Y[0,t_{i-1}],\Theta \right) p\left( x_i=k,Y(t_{i-1}, t_i]\big |x_{i-1}=l, \Theta \right) } \end{aligned}$$
(3)

The state filtering is calculated recursively from the initial condition: \(p(x_0=1|\Theta )=1\).

The state sequence \(X_n=(x_1,\ldots ,x_n)\) is sampled from the joint distribution

$$\begin{aligned} p(X_n|Y[0,T],\Theta )=p(x_{n-1}|x_n,Y[0,T],\Theta )p(x_{n-2}|x_{n-1},Y[0,T], \Theta )\cdots p(x_1|x_2,Y[0,T],\Theta ). \end{aligned}$$

After the filtering probabilities are stored, the backward sampling is implemented according to

$$\begin{aligned}&p\left( x_i\bigg |x_{i+1},Y[0,T], \Theta \right) \nonumber \\&\quad \varpropto p\left( x_i\big |Y[0,t_i],\Theta \right) p\left( x_{i+1}, Y(t_i,t_{i+1}]\big |x_i,\Theta \right) . \end{aligned}$$
(4)

In (3) and (4), it is required to evaluate \(p\left( x_{i+1},Y(t_i,t_{i+1}]\big |x_i,\Theta \right)\) exactly. In this case,

$$\begin{aligned} p\left( x_{i+1},Y(t_i,t_{i+1}]\big |x_i,\Theta \right) =\mathbf{e }_{x_i}' e^{({\mathbf {Q}}-\Lambda )(t_{i+1}-t_{i})}\Lambda \Upsilon (y_{i+1})\mathbf{e }_{x_{i+1}}. \end{aligned}$$
(5)

See the previous section for the likelihood of a Markov modulated Poisson process with state-dependent marks.

It is noted that the state sequence \(\{x_1,x_2,\ldots ,x_n\}\) should be in consecutive order such that \(x_{k+1}=x_k\) or \(x_{k+1}=x_k+1\). Otherwise, there exists at least one segment within \((x_k, x_{k+1})\), which is totally devoid of any poisson observations. It is reasonable to assume no such segment appears in many practical scenes. To simulate the full trajectory of the latent Markov chain, it is necessary to simulate the exact state transition times of x(t). When a state transition happens between two consecutive \(x_k\) and \(x_{k+1}\), the exact transition time \(\tau\) may be simulated by the uniformization method, see Fearnhead and Sherlock (2006) and Rao and Teh (2013). Alternatively, we consider a direct approach to simulate it. The exact jump timing \(\tau\) is simulated according to the probability density

$$\begin{aligned} p\left( \tau \bigg |x_k=i,x_{k+1}=i+1,Y(0,T]\right) =\frac{q_ie^{-(q_i+\lambda _i) (\tau -t_k)}e^{-(q_{i+1}+\lambda _{i+1})(t_{k+1}-\tau )}}{e^{(Q-\Lambda )(t_{k+1} -t_k)}[i,i+1]},t_k\le \tau \le t_{k+1}, \end{aligned}$$
(6)

where \(e^{(Q-\Lambda )(t_{k+1}-t_k)}[i,i+1]\) is the \((i, i+1)\)-th entry of the matrix exponential \(e^{(Q-\Lambda )(t_{k+1}-t_k)}\). The numerator of the above equation is the probability density of x(t) sojourn in the state i from \(t_i\) until \(\tau\), then jumping to the state \(i+1\) in \([\tau , t_{k+1}]\), while no Poisson event happens in the entire \((t_k, t_{k+1})\). The denominator of the above equation is the probability density of x(t) sojourn in the state i at \(t_k\) and stay in the state \(i+1\) at \(t_{k+1}\) without Poisson arrivals in \((t_k, t_{k+1})\). The cumulative distribution function \(F(u)=\int _{t_k}^{u}p\left( v\big |x_k=i,x_{k+1}=i+1,Y(0,T]\right) \,dv\) is given in closed form. Hence, \(\tau\) can be sampled directly by the inverse transformation \(\tau =F^{-1}(U)\) as follows:

$$\begin{aligned} \tau =t_k+\log \left( \frac{Ve^{(Q-\Lambda )(t_{k+1}-t_{k})}[i,i+1]}{q_ie^{-(q_{i+1}+\lambda _{i+1})(t_{k+1}-t_{k})}}U+1\right) /V, \end{aligned}$$
(7)

where \(U\thicksim U[0,1]\) and \(V=(q_{i+1}+\lambda _{i+1}-q_i-\lambda _i)\).

3.2 Simulation of \({\mathbf {Q}}\)

The block Gibbs sampling scheme is completed by sampling the model parameters conditioned on the full path of the underlying Markov chain. Given x(t), the distribution of \({\mathbf {Q}}\) is independent of Y(t) and \(\theta\). It is straightforward to simulate from \({\mathbf {Q}}\) when the priors of the model parameters are selected from the conjugate ones. Suppose the prior distribution of \(q_i\) is \(\Gamma (a,b)\) with probability density \(\frac{b^a}{\Gamma (a)}q_i^{a-1}e^{-bq_i}\). Then the joint prior of \({\mathbf {Q}}\) is given by \(\prod \limits _{i=1}^{m}\frac{b^a}{\Gamma (a)}q_i^{a-1}e^{-bq_i}.\) The hyper-parameters a and b are often selected empirically. The likelihood of \(x(t), 0\le t\le T\) is written by

$$\begin{aligned} \prod \limits _{i=1}^{m}q_ie^{-q_i(\tau _i-\tau _{i-1})}. \end{aligned}$$

Therefore, the posterior distribution of \({\mathbf {Q}}\) is given by

$$\begin{aligned} p({\mathbf {Q}}|x[0,T])\propto \prod \limits _{i=1}^{m}\frac{b^a}{\Gamma (a)} q_i^{(a+1)-1}e^{-[b+(\tau _i-\tau _{i-1})]q_i}. \end{aligned}$$
(8)

According to (8), \(q_i\) is simulated from \(\Gamma (a+1,b+(\tau _i-\tau _{i-1}))\). The final step of each Gibbs iteration involves sampling from the full conditionals \(\theta |x(t), Y(t), t\in [0,T], {\mathbf {Q}}\). We shall discuss it in the following sections once the mark distributions are specified exactly.

Algorithm 3.1

Block Gibbs sampler for multiple changepoint models of the marked Poisson processes:

  1. 1.

    Sample the state sequence \(\{x_1,\ldots ,x_n\}\) of the latent Markov chain x(t) conditioned on the model parameters and observations by a discrete-time forward filtering backward sampling algorithm;

  2. 2.

    If there exists one jump between two consecutive \(x_ks\), the exact jumping time, i.e. the location of a changepoint \(\tau\) is simulated according to (7);

  3. 3.

    Sample \({\mathbf {Q}}\) and other model parameters \(\theta\) given the trajectories of the latent Markov chain x(t).

  4. 4.

    Repeat the above steps until the last iteration.

4 The continuous-time viterbi algorithm and the number of changepoints

In discrete time setting, it is well-known that the optimal set of changepoints can be given by the dynamic programming algorithm- Viterbi algorithm. However, for the continuous time HMMs, there exists an infinite number of potential paths for the latent Markov chain. In Bebbington (2007), a continuous-time Viterbi algorithm is suggested to retrieve the optimal path of the latent Markov chain of a Markov modulated Poisson process. The method can be tailored to fit the current models. Define

$$\begin{aligned} M_i(t_k)=\max \limits _{x[0,t_k)}\log p\left( x[0,t_k), x_k=i, Y[0, t_k]\right) . \end{aligned}$$
(9)

Viterbi recursion shows that

$$\begin{aligned} M_j(t)= \max \limits _i\left\{ M_i(t_k)+\log p\left( x(t)=j, Y(t_k,t]\big |x_k=i\right) \right\} ,&t_k\le t\le t_{k+1}. \end{aligned}$$
(10)

In this case, \(\log p\left( x(t)=j, Y(t_k,t]\big |x_k=i\right)\) needs to be maximized over all possible paths. It is noted that x(t) has at most one jump over \([t_k, t_{k+1})\). When \(j=i\), \(\log p\left( x(t)=j, Y(t_k,t]\big |x_k=i\right)\) is a constant. Only \(\log p\left( x(t)=i+1, Y(t_k,t]\big |x_k=i\right)\) needs to be maximized over the path space. Assume that the sample path of x(t) over \((t_k, t]\) is given by \(x(t_k,u)=i, x[u, t]=i+1.\) The probability density is given by \(q_ie^{-(q_i+\lambda _i)(u-t_k)}e^{-(q_{i+1}+\lambda _{i+1})(t-u)},\) which is maximized by setting either \(u-t_k\) or \(t-u\) to zero. So the maximum of it is given by

$$\begin{aligned}&q_i\max \{e^{-(q_i+\lambda _i)(t-t_k)},e^{-(q_{i+1}+\lambda _{i+1})(t-t_k)}\}\\&\quad =q_ie^{-\min \{q_i+\lambda _i,\, q_{i+1}+\lambda _{i+1}\}(t-t_k)}. \end{aligned}$$

Hence, the latent Markov chain has state transitions only at event times of Y(t). From the above argument, we have the following proposition.

Proposition 1

The optimal posterior path of the latent Markov chainx(t) of a Markov modulated Poisson process with marks given\(Y(t), 0\le t \le T\)has state transitions only at the Poisson event times.

From Proposition 1, it is suggested that the optimal set of changepoint locations are given exactly from the Poisson event times. Hence, the search for the set of most probable changepoint locations can be equivalently given by a discrete time version of Viterbi algorithm.

Let \(f_k(j)\) be the probability density of the most probable sample path of x(t) up to \(t_k\) when reaching to the state j, and \(\phi _k(j)\) be the optimal state at time \(t_{k-1}\) for the sample path reaching to the state j at time \(t_k\).

Algorithm 2

(Continuous-time Vierbi Algorithm)

  1. 1.

    Initialize \(f_1(j)=\text{ e }_1' e^{({\mathbf {Q}}-\Lambda )\Delta t_1}\Lambda \Upsilon (y_1)[,j]\) and set \(\phi _1(j)=0\) for all j.

  2. 2.

    For \(k=2,\ldots ,n\) and all j, recursively compute

    $$\begin{aligned} f_k(j)=\max \limits _i\{f_{k-1}(i)e^{({\mathbf {Q}}-\Lambda )\Delta t_k}\Lambda \Upsilon (y_k)[i,j]\} \end{aligned}$$

    and

    $$\begin{aligned} \phi _k(j)=argmax_i\{f_{k-1}(i)e^{({\mathbf {Q}}-\Lambda )\Delta t_k}\Lambda \Upsilon (y_k)[i,j]\}. \end{aligned}$$
  3. 3.

    Let \(x_n=argmax_jf_n(j)\) and backtrack the state sequence as follows:

    For \(k=n-1, \cdots , 1, x_k=\phi _{k+1}(x_{k+1})\).

In the above algorithm, \(f_k(j)\) propagates to zero or infinity exponentially fast, which again cause underflow or overflow problem. Proper scaling procedure by taking logarithm of it or by other approaches are required in numerical computations, see our R implementation.

There exist some popular model selection methods specifically tailored for the challenging changepoint problems, see Green (1995), Zhang and Siegmund (2007) and Harchaoui and Lévy-Leduc (2010), among many others. One popular criteria is Bayes information criterion (BIC). However, due to irregularities of the likelihood in the multiple changepoint models, there is no justification for using BIC in this scenario. Recently, a large sample approximation to the Bayes factor bypassing the Taylor expansion is derived for Gaussian changepoint models (Zhang and Siegmund 2007) and Poisson changepoint models (Shen and Zhang 2012) under the assumption of uniform prior for the locations of changepoints. Similar to Zhang and Siegmund (2007), we estimate the Bayes factor of the changepoint model versus the homogeneous Poisson model with stationary exponential marks for large T and \(\lim \limits _{T\rightarrow \infty }\tau _i/T=r_i, i=1, \cdots , m\). Denote the multiple changepoint models with m changepoints by \({\mathcal {M}}_m\). We have the following theorem:

Proposition 2

Assuming a uniform prior for the changepoints \(\tau\) and other model parameters, then

$$\begin{aligned} \log \frac{P({\mathcal {M}}_m|Y[0,T])}{P({\mathcal {M}}_0|Y[0,T])} = \log \left( \frac{\sup \hat{L}_{m}}{\sup \hat{L}_0}\right) -\sum \limits _{i=0}^m\log \left( \hat{\tau }_{i+1}-\hat{\tau }_i\right) +(1-m)\log T+ O_p(1). \end{aligned}$$

In (11), the first term is the generalized log-likelihood ratio statistics of the model with m changepoints relative to the null model with no changepoint. The rest of it is interpreted as a (negative) penalty term posed to the model complexity. Generally, the penalty term favors evenly distributed changepoints. See the proof sketch in “Appendix” section. Thus, the modified BIC is defined as a penalized likelihood criterion:

$$\begin{aligned} \log L_m(\hat{\Theta })-\sum \limits _{i=0}^m\log \left( \hat{\tau }_{i+1}- \hat{\tau }_i\right) +(1-m)\log T, \end{aligned}$$
(11)

where \(\hat{\Theta }\) is the maximum likelihood parameter estimation of the model \({\mathcal {M}}_m\).

5 Simulation studies

We perform a simulation study to demonstrate the methods. The simulation will provide some insights for an application of the methods in deep earthquakes modelling in the next section. We choose the exponential marks as the attached marks in this simulation for the reason illuminated in the application in the next section. We also choose conjugate priors \(\Gamma (\alpha ,\beta )\) and \(\Gamma (\zeta ,\eta )\) for the Poisson intensity rates \(\Lambda\) and the rate parameters \(\rho\) in the exponential type of marks respectively. Therefore, the full conditionals of \(\Lambda\) are written by

$$\begin{aligned} \Lambda |x[0,T], Y[0,T], {\mathbf {Q}}\varpropto \prod \limits _{i=1}^{m+1} \frac{\beta ^\alpha }{\Gamma (\alpha )}\lambda _i^{\alpha +N_i-1}e^{-[\beta +( \tau _i-\tau _{i-1})]\lambda _i}, \end{aligned}$$
(12)

where \(N_i\) is the number of Poisson events arrived in the i-th segment. Similarly, the full conditionals of \(\rho\) are given by

$$\begin{aligned} \rho |x[0,T], Y[0,T], {\mathbf {Q}}, \Lambda \varpropto \prod \limits _{i=1}^{m+1} \frac{\zeta ^\eta }{\Gamma (\eta )}\rho _i^{\eta +N_i-1}e^{-[\zeta +S_i]\rho _i}, \end{aligned}$$
(13)

where \(S_i\) is the cumulative summation of the marks in the i-th segment.

We perform three simulations. In the first case, a multiple changepoint model with simultaneous changepoints occurring in both the Poisson rate and associated marks is considered. A three-changepoint model is assumed the true model with specified parameters listed in (a) rows of Table 1. In each segment of the model, we simulate only 50 Poisson events attached by exponential marks. The total number of simulated observations is 200. So, the exact locations of changepoints are 51, 101, 151. We fit the simulated data by multiple changepoint models with \(2\sim 4\) changepoints. For each model, Gibbs sampling iterates 100,000 times, starting from two different initial values and hyperparameters and the last 10,000 samples are treated as posterior samples. The estimates of parameters, including \(\Lambda\) and \(\rho\), are given by the posterior means of the samples. Generally, the results are not very sensitive to the initial values and hyperparameters in the priors. We present only one of the simulation results.

It is noted that the four segments are moderately separated, as indicated by the Poisson rates \(\Lambda\) and \(\rho\) specified in the true model, see (a) rows of Table 1. However, with only 50 Poisson arrivals attached with marks in each segment, it is still hard for the algorithm to accurately locate the changepoints and estimate the model parameters. For a three-changepoint model, the locations of changepoints, the Poisson rates and \(\rho\) in each segment are properly estimated in comparison to the true values, see (c) rows of Table 1. However, according to (b) rows of Table 1, it is noted that the log-likelihood of a two-changepoint model, rather than that of a three-changepoint model, is the highest among all the models. With greater penalties for the \(3\sim 4\) changepoint models, the modified BIC of \(3\sim 4\) changepoint models is smaller. So, it is unnecessary to list all of them in Table (1). Obviously, a parsimonious model, i.e. a two-changepoint model, is preferred for this short sequence with only 200 events in total in terms of the modified BIC, which actually deviates from the true data generating mechanism.

The second simulation is a piecewise marked Poisson process parameterized nearly the same as the previous numerical example, with the Poisson rate \(\Lambda =(2,5,2,5)\) and the exponential rate \(\rho =(3,2,3,2)\), see (a) rows in Table (2). However, in each segment, we simulate 150 observations, instead of only 50 events as in the previous numerical example. The total number of simulated observations is 600. The exact locations of changepoints are 151, 301, 451. We fit the simulated data by multiple changepoint models with \(2\sim 4\) changepoints. Again, the Gibbs sampler iterates 100,000 times in each case, with the last 10,000 samples treated as the posterior samples. With more observations available, it is reasonably well for the algorithm to accurately locate the changepoints and estimate the model parameters. For the three-changepoint model, the lag-5 auto-correlations of \({\mathbf {Q}}\) are all bellow 0.01, suggesting good mixing of the Gibbs sampler. The locations of the changepoints are accurately located and the estimated parameters are close to the true values, see (c) rows of Table 2. For multiple changepoint models with two or four changepoints, the estimated \({\mathbf {Q}}, \Lambda , \rho\) and the locations of changepoints are listed in (b) and (d) rows of Table 2. In this case, MBIC is able to identify the number of changepoints of the true model.

The previous two numerical examples were designed for the marked Poisson process with simultaneous changepoints appearing in both the Poisson rates and associated marks. The following third numerical simulation is designed for a multiple changepoint model with miscellaneous type of changepoints, in which some of the changepoints appear either in the Poisson rates or in the associated marks alone, others appear in both components. For this marked Poisson process, the Poisson rates are given by \(\Lambda =(2,5,2,2)\) and the rate parameters in the exponential marks are given by \(\rho =(2,3,3,2)\), see (a) rows in Table 3. So, both the two types of changepoints appear in this numerical example. The true parameters of this simulation is comparable to the previous simulations. Similar to the second simulation, we generate 150 observations in each segment, with 600 observations in total. The exact locations of changepoints are 151, 301, 451. With the same number of observations as the previous simulation, it is more difficult for the algorithm to locate the changepoints and estimate other model parameters in this case. We fit the simulated data by multiple changepoint models with \(2\sim 4\) changepoints. In each case, the Gibbs sampler iterates 100,000 times, with the last 10,000 samples treated as the posterior samples. For three-changepoint and two-changepoint models, the lag-5 auto-correlation of \({\mathbf {Q}}\) are all bellow 0.01. However, the Gibbs sampler converges rather slow for misspecified models. The estimated \({\mathbf {Q}}, \Lambda , \rho\) and the locations of changepoints are listed in Table 3. For the three-changepoint model, the locations of changepoints still can be properly located and the estimated parameters are close to the true values of the simulated model, see (c) rows of Table 3. In this case, the modified BIC properly identify the number of changepoints of the simulated time series.

Table 1 The table lists the estimated \({\mathbf {Q}}, \mathbf {\Lambda }, \rho\), the locations of changepoints \(\tau s\), along with logL for multiple changepoint models
Table 2 The table lists the estimated \({\mathbf {Q}}, \varvec{\Lambda }, \rho\), the locations of changepoints \(\tau s\), along with MBIC and logL for multiple changepoint models
Table 3 The table lists the estimated \({\mathbf {Q}}, \varvec{\Lambda }, \rho\), the locations of changepoints \(\tau s\), along with MBIC and logL for multiple changepoint models

6 An application to the deep earthquakes

The data set studied in this analysis is from New Zealand catalogue, which is freely obtainable from GNS Science of New Zealand via Geonet (www.geonet.org.nz). We choose those events from New Zealand catalogue within confine as defined in Fig. 1 at depth greater than 50km with magnitude above 5 in New Zealand version of local magnitude scale. All the chosen events are either beneath the land or close to the shore. So, these events are under good coverage of monitoring networks. To avoid the analysis is biased by missing data, it is still necessary to assess the completeness threshold of the selected events. Generally, it is believed that the magnitude of completeness of the selected events is below 5 in this period, see, e.g., the analysis of the catalogue completeness for New Zealand deep earthquakes by various techniques in Lu and Vere-Jones (2011) and Lu (2012). Some descriptive properties of New Zealand deep earthquakes, such as the epicentral and depth distributions etc., are given in Lu and Vere-Jones (2011).

Fig. 1
figure 1

Epicenter distribution of deep earthquakes with magnitude above 5 between 1965 and 2013. Events encircled by dash lines and map boundaries within a polygon with vertexes \((170^{\circ }E, 43^{\circ }S)\), \((175^{\circ }E, 36^{\circ }S)\), \((177^{\circ }E, 36^{\circ }S)\), \((180^{\circ }E, 37^{\circ }S)\), \((180^{\circ }E, 38^{\circ }S)\), \((173^{\circ }E, 45^{\circ }S)\)are considered

We focus on deep earthquakes modelling in this analysis. The occurrence rate of earthquakes is a direct indicator of the level of seismicity, which is often characterized by the intensity function of a finite point process (Daley and Vere-Jones 2003). The magnitude-frequency distribution, also called Gutenberg-Richter (G-R) law, indicates that the cumulative number of earthquakes N(M) above \(M_c\) with magnitude M follows the log-linear relation:

$$\begin{aligned} \log _{10}N(M)=a-b(M-M_c), \end{aligned}$$
(14)

where a and b are constants and \(M_c\) is the magnitude threshold. The slope b, also called “b-value” in geophysical communities, indicates the relative proportion of the number of small and large earthquakes, corresponding to \(\beta\) in the exponential distribution: \(F(x)=1-e^{-\beta x}, \beta =b\log 10\). See the right part of Fig. 2 for the magnitude-frequency distribution of selected deep earthquakes. It seems that the log-linear relation of the magnitude-frequency distribution holds well. So, the figure suggests the completeness of the magnitude for the selected earthquakes. Otherwise, if the left end of the magnitude-frequency distribution is below the predicted log-linear relation of the G-R law, it is often believed that the catalogue is incomplete for smaller events due to limited detectability.

Fig. 2
figure 2

The left part of the plot is the yearly counts of deep earthquakes. The right part of the plot is the magnitude-frequency distribution of deep earthquakes. The logarithm of the frequency of magnitude is nearly linear with respect to the magnitude for a complete catalogue

Generally, unlike shallow earthquakes, deep earthquakes rarely have following sequence of small aftershocks which decay according to Omori’s law. Instead, the typical occurrence pattern of deep earthquakes is that it varies from time to time, active in one period and relatively quiescent in another, see the left part of Fig. 2 for the yearly counts of deep events. We also demonstrate the centralized and normalized cumulative sums of inter-event times \(\Delta t_i\) in the left bottom of Fig. 3, which is given by \(\frac{\sum _{i=1}^{j}\Delta t_i}{\sum _{i=1}^{n}\Delta t_i}-\frac{j}{n}\). For a homogeneous Poisson process, the statistic should be close to the line segment: \(y=0, 0\le x \le 1\), which behaves as a standard Brownian bridge on [0, 1], see Galeano (2007). From the left panel of Fig. 3 and the left part of Fig. 2, it is noted that the deep earthquakes show some non-stationarity. A constant rate Poisson process will not be adequate for fitting the occurrence rate of the deep earthquakes. Instead, Markov modulated Poisson process is viable for characterizing the time-varying behavior appearing in the occurrence rate of deep earthquakes.

Fig. 3
figure 3

The two figures in the left panel are the cumulative sum, the centralized and normalized cumulative sum of the inter-event times for deep earthquakes respectively. The two figures in the right panel are the cumulative sum, the centralized and normalized cumulative sum of earthquake magnitudes respectively

Another statistic of interest is the magnitude-frequency distributions (MFD). Interpretation of the b-value of earthquake MFD has led to considerable attention in geophysical community. Wide range of observation studies suggest the b-value varies both spatially and temporally , see Wiemer et al. (1998). The b-value variability has been considered to be due to the ambient stress state, material heterogeneity, focal depth and geothermal gradient, which are directly or indirectly associated with the effective stress state. The relationship has been utilized for seismic hazard evaluation and risk forecasting, see Nanjo et al. (2012), Schorlemmer and Wiemer (2005), Nuannin et al. (2005) and Lu (2017), among many others. Similarly, we demonstrate the b-value variability by the centralized and normalized cumulative sums of (trimmed) magnitudes \(\frac{\sum _{i=1}^{j}(m_i-M_c)}{\sum _{i=1}^{n}(m_i-M_c)}-\frac{j}{n}\) in the right bottom of Fig. 3. Again, when there is no changepoint occurring in the b-value, the statistic should behave like a standard Brownian bridge on [0, 1]. From the right top and right bottom of Fig. 3, it is noted that b-value variation appears for the deep earthquakes. Also, it is observed that there exists some sort of ”coupling” between the deep seismicity rates and the b-value, see the bottom two graphs of Fig. 3. Both the seismicity rate and the b-value show a change at about 200.

One approach to jointly modelling the temporal variabilities of deep seismicity rate and the b-value is by use of the Markov modulated Poisson process attached by state-dependent marks, formulated as in the previous sections. The occurrence time distribution and the magnitude-frequency distribution of deep earthquakes are fitted by the multiple changepoint models with \(1\sim 6\) changepoints. In each case, the Gibbs sampling scheme iterates 500,000 times, with the last 10,000 samples treated as the posterior samples. For this short sequence, it seems that the models with \(2 \sim 6\) changepoints overfit the data, as indicated by the MBIC listed in Table 4. From Table 4, it is also observed obvious similarities appearing among the Poisson intensity rates and the rate parameters in the mark distributions, particularly for muptiple changepoint models with \(3 \sim 6\) changepoints, which again suggests the multiple changepoint models with \(2 \sim 6\) changepoints overfit the data. The single changepoint model is sufficient to characterized the heterogeneity of the data. For this single changepoint model, after 500,000 Gibbs iterations, the 5-lag autocorrelations for all the parameters are less than or around 0.01, which suggests the Markov chain is mixing well. The posterior summary for the single changepoint model is displayed in Fig. 4. The figure displays the kernel density estimations and the \(95\%\) highest posterior density (HPD) intervals from 10,000 samples for some of the model parameters such as the location of the changepoint, the transition rate, the Poisson intensity rate and the rate parameter in the mark distribution. Both the upper and lower limits of the HPD intervals are indicated in the figure. The location of the changepoint appears a bit diffusive as indicated in the left top of the figure, which may result from a progressive rather than an abrupt change in the seismicity rate and (or) the b-value. The posterior inference is relatively robust over the selection of a range of proper hyperparameters.

Fig. 4
figure 4

Kernel density estimates for part of the model parameters. Line segments in bold beneath each kernel density estimate indicating the \(95\%\) highest posterior density for the model parameters. Both the lower and upper limits of the \(95\%\) HPD are indicated. The left top, right top, left bottom and right bottom of the figure show the posterior summary of the changepoint location \(\tau\), the transition rate \(q_1\), the Poisson rate \(\lambda _1\) and the rate parameter \(\rho _1\) in the exponential mark respectively. The number of posterior samples and the Bandwidth used in the kernel density estimation are given

The posterior mean of the location of changepoint \(\tau\) is around 1987. Before this point, the deep seismicity is relatively quiescent, with 9 deep events per year and a relatively high b value above \(3.2/\log (10)\approx 1.3\). Since 1987, it appears that the deep seismicity rate increased and the b value dropped to \(2.4/\log (10)\approx 1\). The deep seismicity changed from a relatively quiescent period to an active period. The accelerated energy release by the deep seismicity after 1987 suggests that a relatively high risk for the occurrences of large deep earthquakes. Although most of the deep earthquakes including large ones don’t pose direct threat to people, occasionally, some great deep earthquakes do cause severe damages. It is also worthwhile to evaluate the risks of the occurrence of large deep earthquakes. For instance, the expected number of deep earthquakes with magnitude greater than 6 per year after 1987 is

$$\begin{aligned} E\left( \sum \limits _{i=1}^{N(1)} I_{M_i\ge 6}\right) =E(N(1))P(M\ge 6)=1.09, \end{aligned}$$
(15)

which is nearly three times of that before 1987. In the above equation, N(1) is the number of earthquakes occurred per year and \(I_{M_i\ge 6}\) is an indicator function for the i-th earthquake with magnitude greater than 6. The top of Fig. 5 demonstrates the Magnitude-Time plot for major deep earthquakes with magnitude above 6. The vertical red dash line shows the location of changepoint \(\tau\), indicating a strong contrast for the risks of large deep earthquakes before and after that point. In addition, it seems that some sort of “weak coupling” exists between the deep and shallow seismicity, see the bottom of Fig. 5. From the bottom of Fig. 5, it is observed that the seismically active episode of deep earthquakes roughly coincides with that of shallow earthquakes. The episode of high seismicity rate and high mean energy release per quake is also the episode of frequent occurrences of large shallow earthquakes. The increase of the moment release rate by deep earthquakes may be attributed to the increase in convergence rate of the tectonic plates, and hence increasing risks of geological hazards near surface such as shallow earthquakes and volcano activities in the most recent decade.

Fig. 5
figure 5

The magnitude vs. time plot for major deep earthquakes and shallow earthquakes, with the location of the changepoint indicated by \(\tau\)

Nearly the same data set was analyzed in Lu (2012). In Lu (2012), a MMPP attached with state-dependent marks is applied to characterize the variabilities of seismicity rates and b-values, in which a full transition intensity rate matrix of the latent Markov chain is specified and the estimation of model parameters is implemented by the EM algorithm. This type of model formulation assumes there exists “state reciprocals” in deep seismicity, which may cause model bias when there is actually no such “state reciprocals”. Furthermore, for some short sequence with only a few “state transitions”, the estimation for the transition intensity rate matrix is unreliable, causing obvious instabilities in the state filtering and smoothing by methods in Lu (2012). Tiny perturbations may cause wholly different estimates for the latent state trajectories, leading to different decisions and scientific insights in practice. With a left-to-right restriction in the transition rate matrix, our model is suitable for modelling both a long sequence and a short sequence when the ”state transitions” are rare. The current model formulation also avoids potential bias from the use of MMPPs with a full transition rate matrix when there is no underlying ”state reciprocal” appearing in the MFD or (and) the deep seismicity rate. In addition, potential gains by Bayesian approach may include quantifying uncertainties of the number and positions of changepoints, which is difficult to deal with by other approaches.

7 Concluding remarks and discussion

In this study, we propose a Bayesian multiple changepoint model to jointly detect changepoints appearing in both the deep seismicity rate and the magnitude-frequency distribution, which is an extension of Chib’s multiple changepoint model in continuous time. We suggest an approach to directly simulate the full trajectory of the latent Markov chain in a block Gibbs sampling scheme. The locations and numbers of changepoints can be given by a continuous time Viterbi algorithm and a modified BIC, tailored particularly for changepoint problems of marked Poisson processes. The model is applied to analyse the time-varying pattern of deep seismicity in New Zealand. It has been seen an increase in both the deep seismicity rate and the mean energy release per quake since 1987, in which most major deep earthquakes and shallow earthquakes occurred. The deep seismicity change may be attributed to an increase in the convergence rate of the tectonic plates, suggesting relatively high risk for the hazards near surface such as large shallow earthquakes in recent decades. The high seismicity and high mean energy release per quake show no signs of turning down until 2014 and continues. The method is potentially applicable for modelling insurance claims (Elliott et al. 2007).

The current model considers only one type of nonstationarity, in which the model parameters are piecewise constant, subject to abrupt changes at a fixed number of locations. In practice, it might be sensible to consider other forms of nonstationarity, such as progressive change or trend appearing in the model parameters, which is beyond the current model to characterize. In addition, in contrast to other approaches, current model formulation assumes only a fixed number of changepoints. It might be desirable to allow the number of changepoints to accrue unboundedly upon the arrival of new data. Finally, it seems that there exists some ”weak coupling” between the deep seismicity and shallow seismicity. However, whether the large shallow earthquakes are preceded by an increase in deep seismicity or vice versa has not been thoroughly investigated so far. A study of it is potentially valuable for seismic hazards evaluation and earthquake risk forecasting (Kagan 2017) in subduction zones.

Table 4 The table lists the estimated \({\mathbf {Q}}, \varvec{\Lambda }, \rho\), the locations of changepoints \(\tau s\), along with MBIC and logL for multiple changepoint models