Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

The problem of estimating the location of a jump discontinuity (changepoint) has been extensively studied in the statistics literature. There are two versions of the problem. The on-line version is concerned with the quickest detection of a changepoint in the parameters of a dynamic stochastic system, and is closely related to classical problems in sequential analysis; for a comprehensive treatment, together with a discussion of important applications, see the books by Siegmund [18], Basseville and Nikiforov [1], and the review article by Lai [12] and references therein. In the off-line version, data are available for n covariate-response pairs, and one is interested in estimating the location of the changepoint as accurately as possible (see Ritov [17], Müller [15], Loader [13], Gijbels, Hall and Kneip [6], Hall and Molchanov [7], Kosorok and Song [11], and the book by Csörgő and Horváth [4]). The on-line version is also closely related to many developments in statistical process control (Hawkins et al. [8]) and associated control charts (e.g. Cumulative Sums (CUSUM), Exponential Weighted Moving Average (EWMA), etc.). However, both versions of the problem have dealt primarily with low- (usually one-) dimensional problems. Although there have been some extensions to multivariate data, they are usually obtained under an assumption of multivariate normality that gives rise to Hotelling’s T 2 test.

In this paper, we consider the off-line version in a high-dimensional network setting. Data are indexed by the edges of a graph; in the simplest case, binary data indicate whether the edge is present. We consider edges which evolve independently, so that at each point in time the network looks like an Erdős–Rényi random graph. This is a fundamental problem in changepoint analysis on networks, and already presents technical challenges. As graph size grows, we acquire more data about the changepoint, but have to deal with a higher-dimensional nuisance parameter space; this interaction is the main technical focus of the paper. We obtain the limiting distributions of the maximum likelihood estimates of both the changepoint and the remaining model parameters; although the asymptotic distribution for the changepoint estimate depends on the (unknown) signal-to-noise ratio, we develop an adaptive inference framework that does not require prior information about the limiting regime. Many of our results generalize those known for finite-dimensional models, although to our knowledge the focus on adaptive inference is new.

As a motivating application, we consider the question of whether the US Congress has abruptly become more polarized at some point in recent history. This question has raised a lot of interest in the political science literature; see for example [14, 16]. These works were primarily exploratory in nature, and no attempt was made to make inferences regarding the polarization process. Within the framework of our network-based approach, we use roll call vote data to generate a sequence of graphs, with vertices corresponding to congressmen and edges corresponding to whether they voted in the same way on a particular issue. We are then able to make inference about any changepoints in voting pattern.

Due to space constraints, we skip most of the details. A more extensive version of the paper is in preparation.

Network Changepoint Model and Estimators

Consider a sequence of random graphs indexed by n. Each graph has m=m(n) potential edges; we allow m(n) to grow with n. Each edge has a state α∈𝒮; for simplicity, in this note we take 𝒮={0,1}, but the model readily extends to arbitrary common finite state space. We assume that the underlying graphs are embedded into each other, so that it makes sense to speak of  “edge 1 of system n”. The edges evolve in discrete time; each edge evolves as a Markov chain with its own transition kernel, independently of all the other edges. Consequently, at each time point the state of the system is an Erdős–Rényi random graph (with different, time-varying, probabilities for each edge). We assume that edges transition according to one set of transition kernels \(\{P^{*}_{k},\ 1 \leq k \leq m(n)\}\) before a time t , the changepoint, and according to another set of transition kernels \(\{Q^{*}_{k},\ 1 \leq k \leq m(n)\}\) after t . The changepoint t , as well as the matrices \(P^{*}_{k}\) and \(Q^{*}_{k}\), may depend on n; but note that t is the same for all the edges. We may also have \(P^{*}_{k} = Q^{*}_{k}\) for some edges, i.e. the changepoint may only affect a subset of the edges in the graph. For convenience, we will rescale time so that t ∈[0,1].

We make n observations of the graph indexed by n, at times \(\{\frac{i}{n},\ i=1,\ldots,n\}\). This means that in the nth experiment, \(t^{*} = t^{*}(n) \in\{\frac{i}{n}\},\ i=1,\ldots,n\). We will assume t (n)→t 0 as n→∞, as well as \(P^{*}_{k} \to P^{0}_{k}\) and \(Q^{*}_{k} \to Q^{0}_{k}\) for each k. Below, we will frequently omit the dependence on n.

Let 1 k,αβ (s) be the indicator of the event that edge k was in state α at time s and in state β at time s+1. The log-likelihood function for this model is

$$\begin{aligned} l^M_n(P,Q,t) =& n^{-1} \Biggl(\,\sum_{k=1}^{m} \sum_{\alpha,\beta\in \mathscr {S}} \Biggl(\sum_{s=0}^{nt-1} \bigl(\mathbf {1}_{k,\alpha\to\beta}(s) \log(P_k)_{\alpha\beta} \bigr) \\ &{}+ \sum_{s=nt}^{n-1} \bigl( \mathbf {1}_{k, \alpha\to\beta}(s) \log(Q_k)_{\alpha\beta} \bigr) \Biggr) \Biggr). \end{aligned}$$
(22.1)

If the changepoint were at t, we could write down the MLEs \(\hat{P} = \hat{P}(t)\) and \(\hat{Q} = \hat{Q}(t)\):

$$\begin{aligned} \bigl(\hat{P}_k(t)\bigr)_{\alpha\beta} =& \frac{\sum_{s=0}^{n t - 1}\mathbf {1}_{k, \alpha \to\beta}(s)}{\sum_{s=0}^{n t - 1}\sum_{\gamma\in \mathscr {S}}\mathbf {1}_{k, \alpha \to\gamma}(s)}, \\ \bigl(\hat{Q}_k(t)\bigr)_{\alpha\beta} =& \frac{\sum_{s=n t}^{n- 1}\mathbf {1}_{k, \alpha\to\beta}(s)}{\sum_{s=n t}^{n - 1}\sum_{\gamma \in \mathscr {S}}\mathbf {1}_{k, \alpha\to\gamma}(s)}. \end{aligned}$$
(22.2)

The MLE \(\hat{t}\) can be obtained by iterating over t∈[0,1] (on the grid of discrete observation times), using the above form for \(\hat{P}\) and \(\hat{Q}\); in case of ties, we take the smallest maximizer.

Our main results will concern the asymptotic behavior of \(\hat{P}\), \(\hat{Q}\), and \(\hat{t}\) as n→∞. Below, we describe the necessary assumptions on the behavior of the dimension m(n), the “signal” ∑ k P Q F , and the values of true parameters. Here, \(\Vert A\Vert _{F} = (\sum_{i,j} A_{ij}^{2})^{1/2}\) is the Frobenius, or Hilbert–Schmidt, norm of the matrix A; and we write \(\Vert P^{*}-Q^{*}\Vert ^{2}_{F} = \sum_{k} \Vert P^{*}_{k}-Q^{*}_{k}\Vert _{F}^{2}\).

Assumption 22.1

  1. 1.

    The underlying parameters converge as follows.

    1. a.

      m(n) is either constant m(n)=m 0 or else monotonically increasing to infinity.

    2. b.

      t (n)→t 0 as n→∞. (For example, we could have t (n)=n −1nt 0⌋.)

    3. c.

      \(P^{*}_{k}(n) \to P^{0}_{k}\) and \(Q^{*}_{k}(n) \to Q^{0}_{k}\) uniformly in k.

  2. 2.

    There exists a constant ε>0 (which we need not know) such that, for each k, one of the following holds: either \(\Vert Q^{0}_{k} - P^{0}_{k}\Vert _{F} > \varepsilon \), or else \(Q^{0}_{k} = P^{0}_{k}\).

  3. 3.

    For each n and k, the transition matrices \(P^{*}_{k}(n)\) and \(Q^{*}_{k}(n)\) correspond to irreducible, aperiodic Markov chains with state space 𝒮. There exists some known constant c>0 such that t ∈(c,1−c), and all entries of \(P^{*}_{k}\) and \(Q^{*}_{k}\) belong to (c,1−c). (The same is then true of t 0, \(P^{0}_{k}\), and \(Q^{0}_{k}\).) We will only consider estimates of the changepoint that fall within (c,1−c).

  4. 4.

    The number of edges m satisfies n −1/2logm(n)→0.

  5. 5.

    The signal-to-noise ratio satisfies \(\frac{n}{m} \sum_{k=1}^{m} \Vert P^{*}_{k} - Q^{*}_{k}\Vert ^{2}_{F} \to\infty\).

Remark 22.1

Assumption 22.1.3 implies that the Markov chains with transition kernels \(P^{*}_{k}\) and \(Q^{*}_{k}\) have uniformly bounded mixing times; in particular, observations 1 k,αβ (⋅) form a mixing sequence, with mixing coefficients bounded uniformly in k. For discussion of variants of the changepoint problem where the changepoint is very close to the edge of the interval, see for example [4, Theorem 1.5.3].

Assumption 22.1.4 implies that with high probability, all estimates \(\hat{P}_{k}(t)\) and \(\hat{Q}_{k}(t)\) will satisfy Assumption 22.1.3; and together with Assumption 22.1.2, it means that we will correctly identify which of the edges experienced a change at t . The requirement n −1/2logm(n)→0 still allows quite large graphs, e.g. we may have m(n)=exp(n 1/4).

Assumption 22.1.5 asserts that the “average” per-edge signal \(\Vert P^{*}_{k} - Q^{*}_{k}\Vert ^{2}_{F} \gg n^{-1}\). With finitely many edges (m(n)=m 0), this is necessary for detectability; when m(n)→∞, the necessary condition is very slightly weaker.

Results

We now present our main results. Theorem 22.1 addresses the rates of convergence of the estimators and their asymptotic distributions. Finally, Theorem 22.2 addresses the question of adaptive inference, that is, inferring the parameters of the asymptotic distribution from the data.

Because the exact formulae below get somewhat involved, we state only the qualitative form of the limiting processes and distributions. Full expressions for the parameters will be found in our forthcoming longer paper on the subject. The form of the result is qualitatively similar to finite-dimensional models, cf. [4, Chap. 1], although our model is considerably more general.

Theorem 22.1

(Rates of convergence and asymptotic distribution.)

Under Assumptions 22.1.1 through 22.1.5, \(n \Vert Q^{*}-P^{*}\Vert ^{2}_{F} \vert \hat{t}-t^{*}\vert = O_{P}(1)\).

For any finite set of edges K and simultaneously for all kK, \(n \Vert \hat{P}_{k} - P^{*}_{k}\Vert _{F}^{2} = O_{P}(1)\) and \(n \Vert \hat{Q}_{k} - Q^{*}_{k}\Vert _{F}^{2} = O_{P}(1)\).

Define the local parameters \(h^{P}_{k} = \sqrt{n}(P_{k} - P^{*}_{k})\), \(h^{Q}_{k} = \sqrt{n} (Q_{k} - Q^{*}_{k})\). For each k, \(h^{P}_{k}\) and \(h^{Q}_{k}\) are asymptotically normal:

$$\begin{aligned} \bigl(h^P_k\bigr) \Longrightarrow N \bigl(0,\bigl(t^0\bigr)^{-1}V^P_k \bigr), \qquad h^Q_k \Longrightarrow N \bigl(0, \bigl(t^0 \bigr)^{-1}V^Q_k \bigr), \end{aligned}$$

where the S 2×S 2 covariance matrices \(V^{P}_{k}\), \(V^{Q}_{k}\) depend on \(P^{0}_{k}\), respectively \(Q^{0}_{k}\). For any fixed finite set K of edges, the estimates \(\{\hat{h}^{P}_{k}, \hat{h}^{Q}_{k}, \hat{t}\colon\ k \in K\}\) are asymptotically independent.

For the limiting distribution of \((\hat{t}-t^{*})\), we distinguish three cases, one of which is further subdivided:

  1. 1.

    If \(\Vert P^{*} - Q^{*}\Vert ^{2}_{F} \to\infty\), then \(n(\hat{t} - t^{*}) \to0\) in probability. That is, asymptotically we precisely identify the index of the transition where the transition probability matrix changed.

  2. 2.

    If \(\Vert P^{*} - Q^{*}\Vert ^{2}_{F} \to0\), then

    $$\begin{aligned} n \sum_{k=1}^{m} \sum _{\alpha,\beta\in \mathscr {S}} \frac{(\pi ^0_k)_\alpha}{(P^0_k)_{\alpha\beta}}\bigl(\bigl(P^*_k - Q^*_k\bigr)_{\alpha\beta}\bigr)^2 \bigl(\hat{t} - t^* \bigr) \to\sigma^{-1} \arg\max_{h \in \mathbb {R}} \biggl(B(h) - \frac{1}{2} \vert h\vert \biggr), \end{aligned}$$

    where B(h) is a standard Brownian motion, and σ 2 comes from the Markov chain central limit theorem (cf. [10, Case 1 of Theorem 5]).

  3. 3.

    IfP Q 2C∈(0,∞), then \(n(\hat{t} - t^{*})\) converges to the (smallest) maximizer of a limiting jump process supported on \(\mathbb {Z}\): \(n(\hat{t} - t^{*}) \to\arg\max \nolimits_{h \in \mathbb {Z}}[M(h) + G(h) - D(h)]\). Here, D is a deterministic triangular drift, G is a random walk with correlated Gaussian step sizes, and M is a functional of the Markov chain trajectories of some of the edges. Let \(\mathscr {I}_{+} = \{k\colon\ P^{0}_{k} \neq Q^{0}_{k}\} \) (necessarily finite); M(⋅) depends only on the edges in+, and D(⋅) and G(⋅) depend only on the remaining edges.

Interestingly, the network size m does not appear in the scaling of \(\hat{t} - t^{*}\); however, Assumption 22.1.5 places a lower bound on \(\Vert Q^{*}-P^{*}\Vert ^{2}_{F}\) that scales with m.

The proofs follow the approach of [20, Theorem 3.4.1], making extensive use of Doob’s martingale maximal inequality (the use for Markov chains is somewhat unusual). The continuity of the argmax functional in Case 22.1.3 is non-standard. The high-dimensional nuisance parameter space makes it hard to apply many classical changepoint techniques, such as those in [4].

Lastly, we present a result which allows adaptive inference of the limiting distribution from the data, irrespective of the limiting regime that applies. This means that we can provide asymptotically correct quantile estimation of the distribution based only on the data, without knowledge of the true parameters. The adaptive process is essentially the one that appears in case 3 of Theorem 22.1 when |ℐ+|=m.

Theorem 22.2

(Adaptive inference.)

Define the process \(\widetilde{M}(h)\) as follows. Let \(\tilde{X}_{k}(h), h \geq0\) be the reversed Markov chain with initial distribution \(\hat{\pi}_{k}\) and transition kernel \(\hat{ \mathscr {P}}_{k}\), \((\hat{ \mathscr {P}}_{k})_{\alpha\beta } = \frac{(\hat{\pi}_{k})_{\beta}}{(\hat{\pi}_{k})_{\alpha}}(\hat{P}_{k})_{\beta\alpha }\). Here, \((\hat{\pi}_{k})_{\alpha}:= \sum_{s=0}^{n\hat{t}-1}\sum_{\beta\in \mathscr {S}} \mathbf {1}_{k,\alpha\to\beta}(s)\) is the empirical proportion of time that edge k spends in state α up to time \(\hat{t}\). Let \(\tilde{Y}_{k}(h), h \geq0\) be the (ordinary) Markov chain with initial distribution \(\hat{\pi}_{k}\) and transition kernel \(\hat{Q}_{k}\). For different values of k, let the Markov chains be independent; moreover, let X k (0)=Y k (0) and let their transitions be independent otherwise. Define

$$\begin{aligned} \widetilde{M}(h+1)-\widetilde{M}(h) = \textstyle\begin{cases} \sum_{k=1}^m\sum_{\alpha,\beta\in \mathscr {S}} \mathbf {1}_{\tilde{Y}_k, \alpha\to\beta }(h) \log\frac{(\hat{P}_k)_{\alpha\beta}}{(\hat{Q}_k)_{\alpha\beta}}, & h \geq0,\cr \sum_{k=1}^m\sum_{\alpha,\beta\in \mathscr {S}} \mathbf {1}_{\tilde{X}_k, \beta\to\alpha }(\vert h\vert -1) \log\frac{(\hat{P}_k)_{\alpha\beta}}{(\hat{Q}_k)_{\alpha\beta }}, & h < 0. \end{cases}\displaystyle \displaystyle \end{aligned}$$

Let \(\tilde{h}\) be the smallest maximizer of \(\widetilde{M}(\cdot)\). Then \(\tilde{h}\) has the same asymptotic distribution as \(n(\hat{t} - t^{*})\), in the following sense:

  1. 1.

    If \(\Vert Q^{*}-P^{*}\Vert ^{2}_{F} \to\infty\), then both \(\tilde{h} \to0\) and \(n(\hat{t} - t^{*})\) in probability.

  2. 2.

    If \(\Vert Q^{*}-P^{*}\Vert ^{2}_{F} \to0\), then we have convergence in distribution for the renormalized estimate:

    $$\begin{aligned} \sum_{k=1}^{m} \sum _{\alpha,\beta\in \mathscr {S}} \frac{(\pi ^0_k)_\alpha}{(P^0_k)_{\alpha\beta}}\bigl(\bigl(P^*_k - Q^*_k\bigr)_{\alpha\beta}\bigr)^2 \tilde{h} \to \sigma^{-1} \arg\max_{h \in \mathbb {R}} B(h) - \frac{1}{2} \vert h\vert , \end{aligned}$$

    where B(h) is a standard Brownian motion, and σ 2 is as in Theorem  22.1.

  3. 3.

    If \(\Vert Q^{*}-P^{*}\Vert ^{2}_{F} \to C \in(0,\infty)\), then \(\tilde{h} \to \arg\max\nolimits_{h \in \mathbb {Z}}[M(h) + G(h) - \frac{1}{2} D(h)]\), where M(⋅), G(⋅), and D(⋅) are as in Theorem  22.1.

Application: Polarization in US Congress

We consider the question of whether the dynamics of discussion in the US Senate have experienced a changepoint in recent past. To construct the sequence of graphs as above, we identify the senators with senate seats (two per state, e.g. Michigan 1 and Michigan 2). We then consider 7949 roll call votes on bills during the years 1979–2012. The state of the edges of the (complete) graph on 100 vertices is then 1 if the corresponding senators voted in the same way on the issue, and 0 if they voted differently. The Markovian structure is, of course, an approximation of this data, but represents the fact that a particular pair of senators will tend to either agree or disagree on most issues. We note that while the occupants of a particular seat can change, this does not occur very often in practice, so the assumption that the parameters of the model are time-independent aside from the changepoint is not unreasonable.

In Fig. 22.1, we present the (profile) log-likelihood function for the location of the changepoint. We see broadly that the log-likelihood function reaches its maximum somewhere between the 104th and 107th Congresses, i.e. 1995–2003. (2003 corresponds to the Iraq war.) Within this interval, there are several local maxima; as the table to the right of Fig. 22.1 shows, which changepoint is dominant depends in particular on when data analysis starts. We can also examine the nature of the change by examining the estimated transition parameters before and after the changepoint (in this case, before the 104th and after the 107th Congress). We do not show the graphs due to space constraints, but the average probability of changing the status of an edge decreases by almost a factor of 2, from approximately 0.2 to approximately 0.1, leading to longer negotiation times until a compromise is reached.

Fig. 22.1
figure 1

Log-likelihood function for the senate roll call data. The horizontal axis is labelled with the index of the roll call vote; vertical bands identify the Congress, i.e. the two-year inter-election period. The table to the right presents the dominant changepoint as a function of the year when data collection begins

Discussion and Simulation Issues

We have presented a model which can address questions of changepoint inference in a networked setting. We begin by discussing several extensions of the model assumptions, and then discuss the computational complexity of the estimation.

Vertex Labels and Dependent Edges

A natural extension to community structures is to add labels to the vertices (e.g. political party affiliation for the US Congress), and allow dependence among the edges. There are many possibilities for such extensions; some are the subject of future work.

Multiple Changepoints

Although our research is only directly applicable under the assumption of exactly one changepoint, we may use techniques similar to the binary segmentation method of [3, 21] to find multiple changepoints. The basic idea is to locate the dominant changepoint, and keep looking in the two smaller subintervals around it; an extra elimination step may reduce the probability of finding too many changepoints. In general, estimating multiple changepoints is a challenging issue; we refer to the survey article [9] for a discussion of current approaches.

Computational Complexity

When the signal-to-noise ratio is either quite large or quite small (Cases 22.1 and 22.2 of Theorem 22.1), computing \(\hat{t}\) is the main computational challenge; the distribution of the maximizer of a Brownian motion with triangular drift, which appears in Case 22.2, can be computed precisely [2, 19]. In Case 22.3, which corresponds to the adaptive regime, the limiting process is easily simulated if P 0=Q 0; see also Fotopoulos et al. [5] for computing the maximizer. However, even in the case of Gaussian jumps, there is not a universal scaling that can relate different examples to each other, in part due to the non-stationarity of the process. For the generalized binomial component of the limiting random process, it seems necessary to simulate the trajectories of all m Markov chains in order to estimate the maximizer; the computation is, however, parallelizable, and can scale up to fairly large networks.