1 Introduction

The change point estimation in sequential data has become an important task in many areas of active research. It assumes the existence of at least two different processes observed over some time interval. Since the specific times associated with each process are typically unknown, they have to be estimated along with the processes themselves. The applications of change point estimation procedures can be found in medicine [1], ecology [2], pharmacy [3], engineering [4], finance [5, 6], and many other fields. The problem of process and change point estimation is also known as phase I in statistical process control. Then, phase II would deal with the detection of changes in a process flow based on the already estimated processes.

Researchers have been exploring change point problems for decades but there are still many questions that remain open. One of the earliest papers on the subject was devoted to the estimation of a change point in means of univariate normal distributions [7]. The problem with a constant mean but possible shift in variance parameters was considered by [8,9,10,11]. A generalization of both ideas was considered by [12] who developed a test capable of detecting a change in mean and variance parameters simultaneously.

Attention has been paid to multivariate settings as well. [13] and [14] considered the framework with a single change point in mean vectors of multivariate normal distributions. Soon after that, the estimation of multiple change points in mean vectors was studied by [15] and [16]. In the same setting of multivariate normal distribution, [17] proposed a procedure for estimating a change in covariance matrices under the assumption of a constant mean vector. Recently, [18] developed a test for estimating change points in mean vectors and covariance matrices simultaneously, thus generalizing the above-listed ideas. Other directions of research in the area of change point estimation include inference for the general exponential family [19, 20], nonparametric methods [21] including probabilistic pruning based on various goodness-of-fit measures [22], and some others.

In this paper, we consider the problem of estimating multiple change points in the framework with multivariate processes. The importance of this problem is rather substantial but the number of existing methods is very limited (e.g., see discussion on this topic in [22]). The most traditional approach taken by the majority of researchers assumes the independence of observations over time as well as their multivariate normality. Unfortunately, both assumptions are often inadequate or unrealistic. Among other alternatives, there are two nonparametric procedures employing probabilistic pruning with Energy statistic [23] and Kolmogorov-Smirnov statistic [24] that are available through the R package ecp [22]. It is worth mentioning that this R package is currently the only one that aims at identifying multiple change points in the multivariate setting. The lack of developments in this important area of change point inference motivates our methodology. Our proposed technique is based on a matrix normal distribution. Due to its form, one can model the covariance structure associated not just with variables (given by matrix rows) or time points (provided by matrix columns), but also the overall covariance structure associated with variables and times. This effectively eliminates some of the common restrictive assumptions such as the independence of observations at different time points. To make the proposed procedure more robust to deviations from normality, we propose incorporating one of several available transformations to near-normality. As a result, the proposed procedure gains robustness features while being capable of accommodating various covariance structures in data.

The rest of the paper is organized as follows below. Section 2 presents the proposed methodology. Section 3 investigates the performance of our procedure and three competitors in various settings. Section 4 applies the developed methods to the analysis of real-life data. The paper concludes with a discussion provided in Sect. 5.

2 Methodology

2.1 Matrix Normal Distribution

Let \(\boldsymbol{y}_1, \boldsymbol{y}_2, \ldots , \boldsymbol{y}_T\) be a process observed over T time points with each \(\boldsymbol{y}_i\) following a p-variate normal distribution. The entire dataset can be conveniently summarized in the matrix form as shown below

$$\begin{aligned} \boldsymbol{Y}= \left( \begin{array}{cccc} y_{11} &{} y_{12} &{} \ldots &{} y_{1T} \\ y_{21} &{} y_{22} &{} \ldots &{} y_{2T} \\ \vdots &{}\vdots &{} \ddots &{} \vdots \\ y_{p1} &{} y_{p2}&{} \ldots &{} y_{pT} \\ \end{array}\right) . \end{aligned}$$
(1)

Here, each row represents a particular variable observed over time, while every column stands for a p-variate measurement at a specific time point. The overall variability associated with \(\boldsymbol{Y}\) can often be explained by the variation observed in rows and columns. This leads to the idea of modeling the variability corresponding to p variables separately from that associated with T time points.

One distribution that can be effectively applied in the considered framework is a so-called matrix normal one [25] that has the following probability density function (pdf):

$$\begin{aligned} \phi _{p \times T}(\boldsymbol{Y}; \boldsymbol{M}, \boldsymbol{\Sigma }, \boldsymbol{\Psi }) = \frac{(2\pi )^{-\frac{pT}{2}}}{|\boldsymbol{\Sigma }|^{\frac{T}{2}}|\boldsymbol{\Psi }|^{\frac{p}{2}}}\exp \left\{ -\frac{1}{2} \text{ tr }\left\{ \boldsymbol{\Sigma }^{-1}(\boldsymbol{Y}- \boldsymbol{M})\boldsymbol{\Psi }^{-1}(\boldsymbol{Y}- \boldsymbol{M})^\top \right\} \right\} , \end{aligned}$$
(2)

where \(\boldsymbol{Y}\) is the \(p \times T\) matrix argument defined in (1) and \(\boldsymbol{M}\) is a \(p \times T\) mean matrix. The \(p \times p\) matrix \(\boldsymbol{\Sigma }\) and \(T \times T\) matrix \(\boldsymbol{\Psi }\) are covariance matrices that model variability associated with rows and columns, respectively. Also, \(\text{ tr }\{\cdot \}\) denotes the trace operator. It can be shown that \(\text{ vec }(\boldsymbol{Y}) \sim \mathcal N_{pT}(\text{ vec }(\boldsymbol{M}), \boldsymbol{\Psi }\otimes \boldsymbol{\Sigma })\), where \(\text{ vec }(\cdot )\) denotes the vectorization operator that stacks matrix columns on top of each other, \(\otimes \) is the Kronecker product, and \(\mathcal N_{pT}\) is the pT-variate normal distribution with mean vector \(\text{ vec }(\boldsymbol{M})\) and covariance matrix \(\boldsymbol{\Psi }\otimes \boldsymbol{\Sigma }\). There is a minor non-identifiability issue caused by the properties of the Kronecker product since \(a\boldsymbol{\Psi }\otimes \boldsymbol{\Sigma }= \boldsymbol{\Psi }\otimes a\boldsymbol{\Sigma }\) for any multiplier \(a \in I\!\!R^+\). One simple restriction on \(\boldsymbol{\Psi }\) or \(\boldsymbol{\Sigma }\) can effectively resolve this problem. The main advantage of taking into account the matrix data structure is the ability to reduce the number of parameters to \(T(T+1)/2 + p(p+1)/2 - 1\) from \(pT(pT+1)/2\) in the case of the most general covariance matrix. Hence, the proposed model effectively addresses a potential overparameterization issue while still allowing non-zero covariances \(Cov(y_{jt}, y_{j^\prime t^\prime })\) for any variables j and \(j^\prime \) at time points t and \(t^\prime \).

As the specific problem considered in our setting deals with vectors observed over time, matrix \(\boldsymbol{\Psi }\) can be conveniently parameterized in terms of a desired time series process. In this paper, we illustrate the methodology based on the autoregressive process of order 1 (AR(1)). Incorporating moving average or higher order autoregressive processes is very similar as it affects just the covariance matrix \(\boldsymbol{\Psi }\). In fact, the AR(1) model has been chosen as an illustration simply because it yields the best results for the application considered in Sect. 4. Under AR(1), the covariance matrix \(\boldsymbol{\Psi }\) is given by

$$\begin{aligned} \boldsymbol{\Psi }= \frac{\delta ^2}{1-\phi ^2}\left( \begin{array}{ccccc} 1 &{} \phi &{} \phi ^2 &{} \ldots &{} \phi ^{T-1} \\ \phi &{} 1 &{} \phi &{} \ldots &{} \phi ^{T-2} \\ \vdots &{}\vdots &{}\vdots &{} \ddots &{} \vdots \\ \phi ^{T-1}&{}\phi ^{T-2}&{}\phi ^{T-3}&{}\ldots &{} 1 \\ \end{array}\right) , \end{aligned}$$

where \(\phi \) is the correlation coefficient and \(\delta ^2\) is the variance parameter. Then, one convenient constraint to avoid the non-identifiability issue associated with \(\boldsymbol{\Psi }\otimes \boldsymbol{\Sigma }\) is to set \(\delta ^2 = 1 - \phi ^2\). This restriction immediately leads to \(\boldsymbol{\Psi }\equiv \boldsymbol{R}_\phi \), where \(\boldsymbol{R}_\phi \) denotes the corresponding correlation matrix that relies on a single parameter \(\phi \). It can be shown that

$$\begin{aligned} |\boldsymbol{\Psi }| \equiv |\boldsymbol{R}_\phi | = (1 - \phi ^2)^{T-1} \quad \text{ and } \quad \boldsymbol{\Psi }^{-1} \equiv \boldsymbol{R}_\phi ^{-1} = \frac{1}{1 - \phi ^2}(\boldsymbol{I}_T - \phi \boldsymbol{J}_1 + \phi ^2 \boldsymbol{J}_2), \end{aligned}$$
(3)

where \(\boldsymbol{J}_1\) and \(\boldsymbol{J}_2\) are \(T \times T\) matrices defined as follows below:

$$\begin{aligned} \boldsymbol{J}_1 = \left( \begin{array}{cccccc} 0 &{} 1 &{} 0 &{} \ldots &{} 0 &{} 0 \\ 1 &{} 0 &{} 1 &{} \ldots &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} \ldots &{} 0 &{} 0 \\ \ldots &{}\ldots &{}\ldots &{}\ldots &{}\ldots &{}\ldots \\ 0 &{} 0 &{} 0 &{} \ldots &{} 0 &{} 1 \\ 0 &{} 0 &{} 0 &{} \ldots &{} 1 &{} 0 \\ \end{array}\right) \quad \text {and} \quad \boldsymbol{J}_2 = \left( \begin{array}{cccccc} 0 &{} 0 &{} 0 &{} \ldots &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} \ldots &{} 0 &{} 0 \\ 0 &{} 0 &{} 1 &{} \ldots &{} 0 &{} 0 \\ \ldots &{}\ldots &{}\ldots &{}\ldots &{}\ldots &{}\ldots \\ 0 &{} 0 &{} 0 &{} \ldots &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} \ldots &{} 0 &{} 0 \\ \end{array}\right) . \end{aligned}$$

Expressions in (3) are helpful for speedier maximum likelihood estimation as the potentially time consuming inversion of the \(T \times T\) covariance matrix \(\boldsymbol{\Psi }\) can be completely avoided.

2.2 Change Point Estimation

Consider the problem of estimating change points in the given framework. Let \(\boldsymbol{\mu }_0\) be the p-variate mean vector associated with the main process. Suppose, there are K alternative processes with means \(\boldsymbol{\mu }_1, \boldsymbol{\mu }_2, \ldots , \boldsymbol{\mu }_K\). Then, the mean matrix \(\boldsymbol{M}\) can be written as \(\boldsymbol{M}= \sum _{k=0}^K \boldsymbol{\mu }_k \boldsymbol{m}_k^\top \), where \(\boldsymbol{m}_k\) (\(k = 0, 1, \ldots , K\)) is the vector of length T consisting of zeros and ones, with ones being located in those positions where the \(k^{th}\) process is observed. From the definition, it follows that \(\sum _{k=0}^K \boldsymbol{m}_k = \boldsymbol{1}_T\), where \(\boldsymbol{1}_T\) is the vector of length T with all elements equal to 1. It can be noted that vectors \(\boldsymbol{m}_k\) can present various permutations of zeros and ones. However, in the case of K shift change points at times \(t_1, t_2, \ldots , t_K\), the mean matrix is given by

$$\begin{aligned} \boldsymbol{M}= \left( \underbrace{\boldsymbol{\mu }_0, \ldots , \boldsymbol{\mu }_0}_{t_1 - 1}, \underbrace{\boldsymbol{\mu }_1\ldots , \boldsymbol{\mu }_1}_{t_2-t_1}, \ldots , \underbrace{\boldsymbol{\mu }_{K-1}, \ldots , \boldsymbol{\mu }_{K-1}}_{t_K - t_{K-1}}, \underbrace{\boldsymbol{\mu }_{K}, \ldots , \boldsymbol{\mu }_{K}}_{T-t_K+1}\right) . \end{aligned}$$

Also, \(\boldsymbol{m}_k = \left( \underbrace{\boldsymbol{0},\ldots ,\boldsymbol{0}}_{t_{k}-1} , \underbrace{\boldsymbol{1},\ldots ,\boldsymbol{1}}_{t_{k+1}-t_k}, \underbrace{\boldsymbol{0},\ldots ,\boldsymbol{0}}_{T - t_{k+1} + 1}\right) \) with boundary conditions \(t_0 = 1\) and \(t_{K+1} = T+1\). As a result of such parameterization, the mean matrix \(\boldsymbol{M}\) involves \(p(K+1)\) parameters.

The log-likelihood function corresponding to Eq. (2) has the following form:

$$\begin{aligned} {\begin{matrix} \log {\mathcal L}(\boldsymbol{Y}; \boldsymbol{M}, \boldsymbol{\Sigma }, \boldsymbol{\Psi }) = &{} -\frac{pT}{2}\log (2\pi ) - \frac{T}{2}\log |\boldsymbol{\Sigma }| - \frac{p}{2}\log |\boldsymbol{\Psi }|\\ &{} -\frac{1}{2}\text{ tr }\left\{ \boldsymbol{\Sigma }^{-1}(\boldsymbol{Y}- \boldsymbol{M})\boldsymbol{\Psi }^{-1}(\boldsymbol{Y}- \boldsymbol{M})^\top \right\} . \end{matrix}} \end{aligned}$$

Oftentimes, the normality assumption is not adequate and inference based on such a model may be incorrect or misleading. One possible treatment of such a situation is to employ a transformation to near-normality. Incorporating a transformation into the model makes it considerably more robust to possible violations of the normality assumption. Several immediate candidates include the famous power transformation proposed by [26], alternative families of power transformations as in [27], or the exponential transformation proposed by [28]. Let \(\mathcal T\) be an invertible and differentiable mapping representing the transformation operator such that \(\mathcal T(y; \lambda )\) is approximately normally distributed upon the appropriate choice of the transformation parameter \(\lambda \). In the p-variate setting, the traditional assumption is that the coordinatewise transformation leads to the joint near-normality [29,30,31], i.e., the p-variate transformation is given by \(\mathcal T(\boldsymbol{y}; \boldsymbol{\lambda }) = \left( \mathcal T(y_1; \lambda _1), \mathcal T(y_2; \lambda _2), \ldots , \mathcal T(y_p; \lambda _p)\right) ^\top \), where the transformation parameter vector is given by \(\boldsymbol{\lambda }= \left( \lambda _1, \lambda _2, \ldots , \lambda _p\right) ^\top \). This idea can be readily generalized to the matrix framework with \(\mathcal T(\boldsymbol{Y}; \boldsymbol{\lambda })\) representing data transformed to matrix near-normality based on the p-variate vector \(\boldsymbol{\lambda }\).

Taking into account the special forms of \(\boldsymbol{\Psi }\) and \(\boldsymbol{M}\) and implementing the transformation idea, the log-likelihood function can be further written as

$$\begin{aligned} {\begin{matrix} \log {\mathcal L}(\boldsymbol{\mu }_0, \boldsymbol{\mu }_1, &{} \ldots , \boldsymbol{\mu }_K, \boldsymbol{\Sigma }, \phi , \boldsymbol{\lambda }) = -\frac{pT}{2}\log (2\pi ) - \frac{T}{2}\log |\boldsymbol{\Sigma }| - \frac{p(T-1)}{2}\log (1-\phi ^2)\\ &{} -\frac{1}{2(1 - \phi ^2)}\text{ tr }\Big \{\boldsymbol{\Sigma }^{-1}\big (\mathcal T(\boldsymbol{Y};\boldsymbol{\lambda }) - \sum _{k=0}^K \boldsymbol{\mu }_k \boldsymbol{m}_k^\top \big )\left( \boldsymbol{I}_T - \phi \boldsymbol{J}_1 + \phi ^2 \boldsymbol{J}_2\right) \\ &{} \times \big (\mathcal T(\boldsymbol{Y};\boldsymbol{\lambda }) - \sum _{k=0}^K \boldsymbol{\mu }_k \boldsymbol{m}_k^\top \big )^\top \Big \} + \log \Big |\frac{\partial \mathcal T(\boldsymbol{Y}; \boldsymbol{\lambda })}{\partial \boldsymbol{Y}}\Big |, \end{matrix}} \end{aligned}$$
(4)

where the term \(\log \Big |\frac{\partial \mathcal T(\boldsymbol{Y}; \boldsymbol{\lambda })}{\partial \boldsymbol{Y}}\Big |\) represents the log of Jacobian associated with the transformation.

Maximum likelihood estimation leads to the following expressions for \(\boldsymbol{\mu }_k\):

$$\begin{aligned} \boldsymbol{\mu }_k = \Big (\mathcal T(\boldsymbol{Y};\boldsymbol{\lambda }) -\sum _{k^\prime =0 \atop k^\prime \ne k}^K \boldsymbol{\mu }_{k^\prime }\boldsymbol{m}_{k^\prime }^\top \Big )\boldsymbol{R}_\phi ^{-1}\boldsymbol{m}_k\left( \boldsymbol{m}_k^\top \boldsymbol{R}_\phi ^{-1}\boldsymbol{m}_k\right) ^{-1}, \end{aligned}$$

where \(\boldsymbol{R}_\phi ^{-1}\) is as in (3). Solving a system of \(K+1\) equations leads to the expressions for \(\boldsymbol{\mu }_0, \boldsymbol{\mu }_1, \ldots , \boldsymbol{\mu }_K\). Maximum likelihood estimation for \(\boldsymbol{\Sigma }_k\) yields the following expression:

$$\begin{aligned} \boldsymbol{\Sigma }= \frac{(\mathcal T(\boldsymbol{Y};\boldsymbol{\lambda }) - \sum _{k=0}^K \boldsymbol{\mu }_k \boldsymbol{m}_k^\top )\boldsymbol{R}_\phi ^{-1} (\mathcal T(\boldsymbol{Y};\boldsymbol{\lambda }) - \sum _{k=0}^K \boldsymbol{\mu }_k \boldsymbol{m}_k^\top )^\top }{T}. \end{aligned}$$

Substituting expressions for \(\boldsymbol{\mu }_0, \boldsymbol{\mu }_1, \ldots , \boldsymbol{\mu }_K\) and \(\boldsymbol{\Sigma }\) into the log-likelihood function (4) makes the log-likelihood a function of the parameters \(\phi \) and \(\boldsymbol{\lambda }\). The maximization with respect to these parameters can be done numerically using one of many available optimization algorithms.

For the purpose of illustration, in this paper we focus on the exponential transformation of Manly given by \(\mathcal T(y; \lambda ) = y^{I(\lambda = 0)}\left( \exp \{\lambda y - 1\}\lambda ^{-1}\right) ^{I(\lambda \ne 0)}\), where \(I(\cdot )\) is the indicator function. In this setting, the log of Jacobian in (4) is given by \(\boldsymbol{\lambda }^\top \boldsymbol{Y}\boldsymbol{1}_T\), where \(\boldsymbol{1}_T = (1, 1, \ldots , 1)^\top \) with cardinality \(|\boldsymbol{\lambda }| = T\).

The problem of change point estimation requires assessing the number of processes. To avoid potential problems with the adjustment for multiple comparisons, simplify calculations, and avoid testing procedures in general, we employ the variant of the Bayesian Information Criterion (BIC) [32] proposed by [33] specifically for the change point framework. BIC is also an appealing option due to its connection to the Bayes factor commonly used in Bayesian inference for comparing competing models.

As a final note in this section, we would like to remark that the proposed procedure focuses on processes with mean vectors \(\boldsymbol{\mu }_1, \boldsymbol{\mu }_2, \ldots , \boldsymbol{\mu }_K\). In real-life applications, it is possible that just some parts of these vectors will be different while the remaining variables exhibit no change point behavior. The task of detecting changes in specific variables is a challenging standalone problem that is beyond the scope of this work. One practical approach can be to search for such variables after detecting differences in mean vectors first. Such a scenario is considered in Sect. 4.

3 Experiments

In this section, we consider simulation studies devoted to the rigorous evaluation of the proposed methodology. We investigate the performance of the change point estimation procedure in two general settings. In both cases, we assume the existence of three processes observed over 100 time points. In the first case, the first process is observed until the change point at \(t_1 = 10\), when the second process starts. Then, the second process runs until the next change point at \(t_2 = 20\), when the third process starts and runs for the remaining time. In the second setting, the change points are set to be at times \(t_1 = 10\) and \(t_2 = 50\). The difference between these two settings is that in the first situation, the first two processes are observed for a relatively short period of time, while the third process is observed for much longer. On the contrary, in the second experiment setting, just the first process is observed for a short period of time as opposed to the other two processes. The parameters used in the simulation study are provided in Table 1.

Table 1 Parameter values used in the simulation study of Sect. 3

Various levels of correlation and scaling as reflected by parameters \(\phi \) and \(\boldsymbol{\Sigma }\), respectively, are studied. In particular, we consider \(\phi = 0.1, 0.5, 0.9\) and \(\boldsymbol{\Sigma }, \boldsymbol{\Sigma }/ 2, \boldsymbol{\Sigma }/4\). 250 datasets were simulated for each combination of the covariance matrix and correlation parameter in both considered setting, thus, yielding 4,500 simulated datasets in total. The proposed technique assumes that the exact location of change points is known. The quality of the model fit is assessed by means of BIC. It can be noticed that in the search for the optimal model with K change points, \((T-1)! / (T-1-K)!\) alternatives should be considered. As K is usually rather low, the approach is computationally feasible even for moderate T values. In our experiments, each model could be fitted in under one second. In addition, parallel computing can be readily implemented if the number of models becomes restrictively high.

The illustration of some simulated datasets can be found in Fig. 1. Here, plots (a) and (b) show datasets simulated with \(\phi = 0.1\) but with different covariance matrices \(\boldsymbol{\Sigma }\) and \(\boldsymbol{\Sigma }/ 4\), respectively. Plots (c) and (d) correspond to the same covariance matrices \(\boldsymbol{\Sigma }\) and \(\boldsymbol{\Sigma }/4\) but with high correlation of \(\phi = 0.9\). The four considered datasets represent the first setting with change points at \(t_1 = 10\) and \(t_2 = 20\). Within each of the four plots, there are three subplots representing the coordinatewise behavior of the processes reflected by means of the black, blue, and red colors. The top subplot corresponds to the first coordinate, the middle stands for the second one, and the bottom plot represents the third coordinate. Horizontal lines show the true back-transformed values of the corresponding coordinates of vectors \(\boldsymbol{\mu }_0\), \(\boldsymbol{\mu }_1\), and \(\boldsymbol{\mu }_2\).

Fig. 1
figure 1

Datasets generated in the course of the simulation study in Sect. 3 with different scaling (reflected by \(\boldsymbol{\Sigma }\) and \(\boldsymbol{\Sigma }/4\)) and correlation (\(\phi = 0.1, 0.9\)). Horizontal lines represent true back-transformed values of the corresponding coordinates of parameters \(\boldsymbol{\mu }_0\), \(\boldsymbol{\mu }_1\), and \(\boldsymbol{\mu }_2\)

From examining Fig. 1, it is easy to conclude that the task of change point estimation is far from trivial in these cases. Especially in those cases when the variability is higher (left column of plots), we can observe a number of points that can be mistakenly thought of as change points. Thus, it is fully expected that false change points will be found oftentimes. Moreover, we can observe that the first change point should be considerably easier to find than the second one due to the substantial gap in the second coordinate of means related to the first two processes (i.e., between black and blue horizontal lines).

As pointed out by [22], the number of procedures capable of estimating multiple change points in multivariate processes is rather limited. In this section, the developed methodology is compared with one parametric approach that we call naive and two nonparametric procedures available for practitioners through the R package ecp [22]. The naive method is mimicking the most common practical approach with all observations assumed independent and following multivariate normal processes. The two nonparametric procedures are based on probabilistic pruning with Energy statistic [23, 34] and Kolmogorov-Smirnov statistic [24] used as goodness-of-fit measures. Tables 3 and 4 provide the results of the simulation study in the first (\(t_1 = 10\), \(t_2 = 20\)) and second (\(t_1 = 10\), \(t_2 = 50\)) settings, respectively. The tables include proportions of times various solutions, as per description in Table 2, were found.

Table 2 Interpretation of notation used in Tables 3 and 4

As we can observe from Table 3, the proposed method can rather effectively identify change points. Expectedly, the performance of the procedure improves considerably when the variability decreases. For example, in the case with \(\phi = 0.9\) and \(\boldsymbol{\Sigma }\), we are able to correctly identify the combination of change points in 14.8% of all cases. The percentage improves to 49.2% and 93.2% for \(\boldsymbol{\Sigma }/2\) and \(\boldsymbol{\Sigma }/4\), respectively. The performance of the procedure somewhat degrades for lower values of parameter \(\phi \). In particular, the correct setting was found in 63.2% and 55.6% of cases for \(\boldsymbol{\Sigma }/4\) with \(\phi = 0.1\) and \(\phi =0.5\), respectively. In the settings with higher variability, the task of estimating both change points correctly is considerably more difficult. It is worth mentioning that in these settings our procedure is capable of identifying at least one change point effectively. In particular, we can notice that there is a relatively low proportion of times when our method identified one point correctly and the other change point estimate was considerably off. Another observation can be made with regard to a low number of false change point detections made by our procedure. In addition, due to a strong penalty carried out by BIC, there is no tendency to overestimate the number of change points as we can see from the line \(\{t_1, t_2, x\}\).

Table 3 Simulation study from Sect. 3 assuming two change points at times \(t_1 = 10\) and \(t_2 = 20\). The four methods considered are our proposed procedure, naive procedure, and probabilistic pruning with Energy statistic and Kolmogorov-Smirnov statistic (KS) used as the goodness-of-fit measure. The notation interpretation is provided in Table 2. The bold font highlights the proportion of times the correct combination was found

From examining Table 3, we can conclude that the closest competitor is the naive procedure. In particular, it demonstrates quite similar results in terms of the proportion of correct solutions for the majority of cases unless \(\phi = 0.9\). When \(\phi \) is high, the naive procedure is substantially outperformed by the proposed method in all settings. This observation is not surprising since the cases with lower correlations are more similar to the naive model assuming the independence of observations. Our developed method dramatically outperforms the two nonparametric methods. In the easiest case considered with \(\phi = 0.9\) and \(\boldsymbol{\Sigma }/4\), the probabilistic pruning with Energy statistic is capable of finding the correct combination of change points in 35.6% of cases. In all other cases, both procedures face considerable challenges. One can also notice that nonparametric methods struggle to find even one of the two change points correctly. In the case of \(\boldsymbol{\Sigma }/4\), the Kolmogorov-Smirnov statistic (denoted as KS) shows some improvement for \(\phi = 0.1\). It is able to estimate one change point correctly and the other one in close proximity to the true change point in 22.4% of all cases.

The inference drawn from Table 4 is mostly similar. In the meantime, we can notice that our method improves the performance in all cases. This happens due to the fact that the number of time points is more evenly distributed among the processes and thus more accurate estimation of parameters is possible. As a result, the difference between the proposed and naive approaches can now be observed for the case with \(\boldsymbol{\Sigma }/4\) and \(\phi = 0.9\). It is worth mentioning that similar analysis has been repeated for negative parameters \(\phi = -0.9, -0.5, -0.1\). The results and findings of these experiments were similar and consistent with those presented in this section. To conclude this section, we can remark that the proposed procedure proves to be a powerful tool for identifying change points.

Table 4 Simulation study from Sect. 3 assuming two change points at times \(t_1 = 10\) and \(t_2 = 50\). The description of the table is similar to that of Table 3

4 Applications

4.1 Illustration on Crime Rates in US Cities

First, we apply the proposed methodology to the US cities crime data obtained from the US Department of Justice, Federal Bureau of Investigation Website (http://www.ucrdatatool.gov/Search/Crime/Crime.cfm). There are seven crime types grouped into two general categories: violent and property crimes. The former includes Murder, Rape, Robbery, and Aggravated Assault. The property crimes are Burglary, Larceny Theft, and Motor Vehicle Theft. We focus on crime rates observed between 2000 and 2012. As an example, we choose the data reported by Austin and Cincinnati Police Departments. Figure 2 illustrates violent (left column) and property (right column) crime rates. As the value \(T = 13\) is quite low, instead of assuming models with shift-related change points only, we consider all possible orderings of processes.

Fig. 2
figure 2

Violent and Property crime rates in Austin and Cincinnati over the 13-year time period (2000-2012). The blue and red colors represent two processes detected. Horizontal lines stand for the means of the processes

In the case of Austin, the BIC value associated with a single process (i.e., no change points) is equal to -9.933. After running the developed procedure over all possible orderings of processes, the lowest BIC of −47.081 was found. It is worth mentioning that the naive procedure outlined in Sect. 3 yields BIC −45.225 and the model with the AR(1) structure of \(\boldsymbol{\Psi }\) but no transformation parameters produces BIC −44.099. This suggests that even for so few data points as in the considered application, the proposed procedure can be useful. The parameter estimates associated with the model can be found in Table 5. A corresponding illustration is provided in the first row of plots in Fig. 2. Here, the years 2004, 2006, 2007, 2008, and 2009 are associated with the second process (provided in the red color), while the rest of the years represent the first process (given in the blue color). The horizontal lines reflect back-transformed parameters \(\hat{\boldsymbol{\mu }}_0\) and \(\hat{\boldsymbol{\mu }}_1\) detected by our methodology. As we can clearly see, the separation into two processes is strongly driven by the variable Violent Crime. In the meantime, the variable Property Crime demonstrates considerable variability associated with both processes.

The opposite situation is observed for Cincinnati (second row in Fig. 2): the variable Property Crime contributes to the separation of the processes more than Violent Crime. Model parameters are also provided in Table 5. The BIC value of the best model detected is equal to −6.804 which is considerably better than that of the model with a single process, 19.568. The years 2000, 2007, 2008, 2009, and 2012 are associated with the first process (presented in the blue color), while the rest of the years represent the other process (given in the red color). The BIC value associated with the naive approach is equal to −10.846 suggesting that AR(1) structure of \(\boldsymbol{\Psi }\) as well as transformation-related parameters do not bring an improvement to the naive model in this case.

Table 5 Parameter estimates, log-likelihood, and BIC values for Austin and Cincinnati

4.2 Effect of Colorado Amendment 64

In this section, we demonstrate how our proposed methodology can be applied to the analysis of the effects of public policies. As an example, we focus on studying the effects of the Colorado Amendment 64 which makes the private consumption, production, and possession of marijuana legal. Amendment 64 has been added to the constitution of Colorado in December 2012 but the stores officially opened in January 2014.

The crime rate data have been obtained from the Colorado Bureau of Investigation Department of Public Safety Website (https://www.colorado.gov/pacific/cbi/crime-colorado1) for 10 years: from 2007 to 2016. The same seven variables as described in Sect. 4 have been explored without combining them into the two categories. The goal of our analysis was to check whether the last three years, when the use of marijuana was legal, were any different from the previous seven years. The value of BIC corresponding to the model with no change points is equal to −996.2, while that related to the model with the change point in 2014 yields BIC equal to −1,006.1. The likelihood ratio test conducted to verify the significance of the change yields P-value \(1.47\times 10^{-6}\). As we can see, there is very strong evidence in favor of the change point model based on both BIC and likelihood ratio test.

Fig. 3
figure 3

Crime rates in Colorado over the 10-year time period. The blue and red colors represent two processes. Horizontal lines stand for the back-transformed means of the processes

Figure 3 illustrates the obtained results. The first column consisting of four plots represents violent crimes, while the second column with three plots shows property crimes. The description of individual plots is similar to that of Fig. 2. As we can see, some variables such as Rape or Burglary seem to contribute substantially to the difference between the two models analyzed. To formalize the analysis, we employed a variable selection procedure. As the number of variables in our experiment is relatively low, we decided to test the model with no change point against the model with the change point at 2014 over all possible combinations of involved variables. The lowest P-value of \(1.36\times 10^{-6}\) was observed for the combination of variables Murder, Rape, and Burglary. Thus, the most dramatic change in 2014 has been observed for these three variables considered jointly. The corresponding P-value is just marginally lower than the P-value observed for the full model when all seven variables are included, but it gives a good idea about the combination of variables that contribute the most to separating the processes. By examining the contributions of the three variables, we can notice that the crime rate of Burglary dropped considerably, while Rape and to some extent Murder are grown in the last 3 years. Indeed, the proposed analysis does not assume any cause-and-effect conclusions. In fact, we can notice a considerable decrease in Murder rates in 2014 and we can also observe that the increase in Rape rates began in 2013, i.e., 1 year earlier than when Amendment 64 became effective. Nevertheless, it is obvious that the proposed methodology presents a powerful exploratory tool for studying the effects of public policies.

5 Discussion

In this paper, we developed an efficient method capable of estimating multiple change points in multivariate processes. The proposed technique relies on the matrix normal distribution adjusted by the exponential Manly transformation. Such an adjustment makes the proposed methodology robust to violations of the normality assumption. The matrix setting has an appealing form as rows can represent variables and columns can be associated with time points. Based on the results of challenging simulation studies, we can conclude that the proposed technique is very promising. It outperforms the two nonparametric competitors in all settings. Two applications to crime data considered in the paper demonstrate the usefulness of the developed method.