1 Introduction

Gaussian graphical models are an indispensable tool for studying the relationships among random variables. Under Gaussianity, the precision (or inverse covariance) matrix is intrinsically linked to conditional independence: any pair of random variables is independent given all others if and only if the corresponding element in the precision matrix is zero. The conditional dependence is further encoded by an undirected graph, in which variables are represented by nodes and an edge between nodes designates variables that are not conditionally independent. Therefore, edge selection is equivalent to detection of the nonzero entries in the precision matrix. Crucially, inference of the edge set provides interpretable representations and insightful visualizations of the conditional dependence among variables.

To estimate this edge set, frequentist and Bayesian methods have been designed to estimate a sparse precision matrix. Frequentist approaches commonly employ penalized likelihood estimation, which couples a Gaussian likelihood with an appropriate sparsity penalty (Meinshausen and Bühlmann 2006; Friedman et al. 2008; Fan et al. 2009). Bayesian models can incorporate exact zeros in the precision though careful specification of the prior, such as the hyper-inverse Wishart prior (Roverato 2002) and the G-Wishart prior (Lenkoski and Dobra 2011). Alternatively, Wang (2015) proposed a continuous spike-and-slab prior for greater computational efficiency, which has been generalized for multiple graphs by Shaddox et al. (2018) and Peterson et al. (2020). The continuous spike-and-slab prior has demonstrated excellent empirical performance relative to a variety of frequentist and Bayesian methods.

In this paper, we consider the problem of estimating dynamic graphical models based on time series data. Two significant challenges are present in many time series applications. First, we anticipate that the graph will change over time, yet these change points are not known in advance. Changes in the graph may be gradual, such as the addition or deletion of a small number of edges. However, whole-scale changes in the graph are also possible due to massive structural shifts in the underlying conditions. Second, outliers and heavy tails are often encountered in time series applications, therefore deviate the data substantially from the fundamental Gaussian assumption. Consequently, many existing Gaussian graphical models lack the robustness for reliable graph estimation in this setting.

To address these challenges, we propose dynamic and robust Bayesian graphical models that employ state-of-the-art hidden Markov models (HMMs), to introduce dynamics in the graph, and heavy-tailed multivariate t-distributions, for model robustness. The HMMs define a collection of discrete and unknown states that determine the graphical dependence at any given time. The learned HMM states are often interpretable and can be revisited throughout the time series. Crucially, the latent states are linked both temporally and hierarchically for greater information sharing across time and between states. For model robustness, we consider heavy-tailed alternatives based on the t-distribution. The multivariate and dynamic setting requires careful consideration of an appropriate t-distribution: we implement and evaluate both a classical multivariate t-distribution and a more flexible Dirichlet process alternative. The proposed dynamic and robust Bayesian graphical models are accompanied by a scalable MCMC algorithm for efficient posterior inference with full uncertainty quantification. A thorough simulation study demonstrates the excellent state recovery and graph estimation of the proposed methods relative to competing approaches, with the largest improvements occurring for heavy-tailed and contaminated data. Furthermore, we apply the proposed approach to human hand gesture tracking data, and discover edges and dynamics with well explained practical meanings.

To our knowledge, there is limited research on graphical models that are both dynamic and robust. Time-varying graphs can be estimated using penalized likelihood techniques that incorporate smoothness in the graph over time (Kolar et al. 2010; Gibberd and Nelson 2017; Yang and Peng 2020). Warnick et al. (2018) proposed a fully Bayesian approach using HMMs, but this method is limited in computational scalability and relies on a Gaussian emission distribution. However, robustness is essential: as demonstrated in the simulation study (Sect. 3), graph estimation deteriorates significantly for non-robust methods in the presence of heavy-tailed or contaminated data. The importance of robustness for graphical models has been emphasized in the non-dynamic setting, including deviation-weighted Gaussian likelihoods (Miyamura and Kano 2006; Sun and Li 2012), robust covariance estimators (Gottard and Pacillo 2010), trimmed estimators (Yang and Lozano 2015), and graphical models based on the t-distribution (Finegold and Drton 2011; Finegold et al. 2014). Vinciotti and Hashem (2013) provide a comparison among robust graphical models.

The remainder of the paper is organized as follows. In Sect. 2, we review some of the background literature and describe the proposed dynamic and robust graphical models, i.e., the dynamic classical-t graphical model and the dynamic Dirichlet-t graphical model. We also provide the MCMC algorithms. In Sect. 3, we provide simulation studies, which include comparisons to a frequentist competitor and a sensitivity analysis. In Sect. 4, we apply the proposed methods to the gesture tracking data. We conclude the paper with Sect. 5. The detailed MCMC algorithms are described in the “Appendix”.

2 Methods

2.1 Background

Undirected graphical models learn and express the conditional dependence among p variables \({\varvec{y}} = (y_1,\ldots , y_p)\). Fundamentally, the goal is to discover the set of variables that are conditionally dependent, or equivalently, the complement set of conditionally independent variables. Such dependence can be represented by a graph \(\mathcal {G(V, E)}\) with vertices \({\mathcal {V}} = \{1,\ldots ,p\}\) and undirected edges \({\mathcal {E}} \subset {\mathcal {V}} \times {\mathcal {V}}\).

From a modeling perspective, graphical dependence is intrinsically linked with the Gaussian distribution. When \({\varvec{y}} \sim N_p({\varvec{\mu }},{\varvec{\Omega }}^{-1})\) with mean \({\varvec{\mu }}\) and a positive definite precision matrix \({\varvec{\Omega }}= (\omega _{ij})_{i,j \in \{1,\ldots ,p\}}\), graphical dependence is determined by sparsity of the precision matrix: \(\omega _{i,j} = 0 \iff (i, j) \not \in {\mathcal {E}}\). More specifically, the edge set for Gaussian graphical models is \({\mathcal {E}} = \{(i,j): y_i \not \perp \!\!\! \perp y_j | {\mathcal {V}} /\{i,j\}\}\), so any two vertices ij are not connected if \(y_i\) and \(y_j\) are conditionally independent given the remaining variables. Consequently, estimating the graph \(\mathcal {G(V, E)}\) is equivalent to identifying the nonzero elements of \({\varvec{\Omega }}\). Let \({\mathcal {G}}\) be parameterized by the edge inclusion indicators as \({\varvec{G}} = \{g_{ij}\}_{1 \le i<j \le p}\) for \(g_{ij} \in \{0,1\}\). In Bayesian graphical models, the essential sparsity of the precision matrix is encoded in the prior distributions on \({\varvec{\Omega }}\) and \({\varvec{G}}\). The hyper-inverse Wishart prior (Roverato 2002) and the G-Wishart prior (Lenkoski and Dobra 2011) impose exact zeros. Alternatively, Wang (2015) proposed a continuous spike-and-slab prior for greater computational efficiency. In the sequel, following standard practice, we focus on the setting with \({\varvec{\mu }}= {\varvec{0}}\) and standardize the data in advance of model-fitting. However, generalizations for nonzero mean parameters are straightforward.

For time-ordered data, the assumption of independent and identically distributed observations is inadequate. One approach to allow for graphical dependence that varies across time, \(t=1,\ldots ,T\), is to introduce latent state variables \(s_t \in \{1,\ldots ,S\}\), where S is the number of states. Each discrete state \(s_t = s\) corresponds to a precision matrix \({\varvec{\Omega }}_s\) and edge inclusions \({\varvec{G}}_s\), which define the conditional dependence among the variables \(\{ {\varvec{y}}_t: s_t = s\}\). Conditional on the states, we can write:

$$\begin{aligned}{}[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}), \quad t=1,\ldots ,T, \end{aligned}$$
(1)

with state-dependent precision matrices. The observations \({\varvec{y}}_t\) remain conditionally independent but are no longer identically distributed given the states. Most important, (1) admits time-varying graphical dependence in \({\varvec{\Omega }}_{s_t}\), but with interpretable restrictions: each \({\varvec{\Omega }}_{s_t}\) belongs to the set of S precision matrices \(\{{\varvec{\Omega }}_s\}_{s=1}^S\) indexed by the dynamic states \(s_t\). Model dynamics are introduced via a discrete time-homogeneous Markov chain on the state variables. Specifically, the state transition probabilities are

$$\begin{aligned} p(s_t | s_{t-1}, \ldots , s_1) = p(s_t | s_{t-1}) = p_{s_{t-1} s_t} \end{aligned}$$
(2)

with \(S \times S\) transition probability matrix \({\varvec{P}} = (p_{rs})_{r,s \in \{1,\ldots , S\}}\). These dynamics imply that the graphical dependence at time t depends on that at time \(t-1\) through the states \(s_t\), which evolve according to the discrete Markov process defined in (2). The initial distribution of the Markov chain is the stationary distribution of the transition matrix \({\varvec{P}}\). The HMM is completed by independent Dirichlet priors on the rows \({\varvec{P}}_{r \cdot } = (p_{r1}, \ldots , p_{r S})\),

$$\begin{aligned} {\varvec{P}}_{r \cdot } {\mathop {\sim }\limits ^{indep}} \text{ Dir }(a_{r1}, \ldots , a_{rS}), \quad r=1,\ldots ,S, \end{aligned}$$
(3)

with concentration parameters \(a_{rs} > 0\).

2.2 Robust and dynamic graphical models

A crucial limitation of Gaussian graphical models is the lack of robustness that accompanies the Gaussian distribution. The presence of outliers or large values in the data violates the Gaussian assumption and, most importantly, can result in substantially less accurate estimates of the graph. For the HMM graphical model (1)–(3), outliers also impact state identification, which disrupts the interpretation of the hidden states \(s_t\) and the accompanying time t precision matrices \({\varvec{\Omega }}_{s_t}\).

To address these challenges, we introduce a robust generalization of the HMM graphical model. Specifically, we relax the Gaussianity assumption in (1) and instead consider robust alternatives based on the multivariate t-distribution. Importantly, the specified t-distributions provide the requisite robustness, yet preserve the modeling structure of the HMM model (1)–(3) and require only minimal modifications to the MCMC algorithm. We consider two variants: a classical multivariate t-distribution and a Dirichlet t-distribution.

The p-dimensional classical multivariate t-distribution is denoted as \({\varvec{y}} \sim t_{p}({\varvec{\mu }}, {\varvec{\Omega }}^{-1}, \nu )\) and has joint probability density function

$$\begin{aligned} f_\nu ( {\varvec{y}} | {\varvec{\mu }}, {\varvec{\Omega }}^{-1} )= & {} \dfrac{ \Gamma \{(\nu +p)/2\} |{\varvec{\Omega }}|^{1/2} }{ \pi ^{p/2} \nu ^{p/2} \Gamma (\nu /2) } \nonumber \\{} & {} \quad \{1 + \nu ^{-1} ({\varvec{y}} - {\varvec{\mu }})' {\varvec{\Omega }}({\varvec{y}} - {\varvec{\mu }})\} ^{- (\nu +p)/2 }, \nonumber \\ \end{aligned}$$
(4)

where \(\nu \) is the degrees of freedom, \({\varvec{\mu }}\) is the mean parameter (when \(\nu > 1\)), and \({\varvec{\Omega }}^{-1}\) is the \(p \times p\) scale matrix (Pinheiro et al. 2001). The parameter \(\nu \) indexes the deviation from the multivariate Gaussian distribution: small values of \(\nu \) admit heavy-tailed behavior, while \(\nu \rightarrow \infty \) reverts (4) to a Gaussian distribution. We incorporate the classical multivariate t-distribution as the emission distribution within the HMM graphical model. Using a parameter expansion of the t-distribution (Kotz and Nadarajah 2004), we generalize (1) to

$$\begin{aligned}&[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T, \{\tau _t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}/\tau _t), \end{aligned}$$
(5)
$$\begin{aligned}&[\tau _t | \nu ] {\mathop {\sim }\limits ^{iid}} \text{ Gamma }(\nu /2, \nu /2), \quad t=1,\ldots ,T. \end{aligned}$$
(6)

Notably, \(\tau _t\) scales the precision matrix \({\varvec{\Omega }}_{s_t}\) at each time t, which adds a layer of robustness to the model that guards against sensitivity to extreme values of \(|y_{tj}|\). By marginalizing over \(\tau _t\), (5)–(6) implies \([{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} t_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}, \nu )\) for \(t=1,\ldots ,T\).

Despite the generalization from the Gaussian model (1) to the classical multivariate-t model (5)–(6), the graph \({\mathcal {G}}\) is determined in the same way: \((i, j) \not \in {\mathcal {E}} \iff \omega _{i,j} = 0\). Conditional on the additional parameter \(\{\tau _t\}_{t=1}^T\), (5) is a Gaussian HMM graphical model and therefore inherits the familiar conditional independence interpretation among the components of \({\varvec{y}}_{t}\).

The primary disadvantage of the classical multivariate t-distribution is that each component of \({\varvec{y}}_t\) shares the same scaling \(\tau _t\). However, it is unlikely that all components of \({\varvec{y}}_t\) are outliers at exactly the same set of times t. As the dimension p increases, it becomes more likely that some but not all components of \({\varvec{y}}_t\) are outliers at any given time t. Clearly, the classical multivariate t-distribution is not well-designed for this common occurrence. To address this issue, we generalize (5) as follows:

$$\begin{aligned}{} & {} \left[ {\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T, \{{\varvec{\tau }}_t\}_{t=1}^T\right] \nonumber \\{} & {} \quad {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}}, \text {diag}( 1/ \sqrt{{\varvec{\tau }}_t}){\varvec{\Omega }}_{s_t}^{-1}\text {diag}( 1/ \sqrt{{\varvec{\tau }}_t})), \end{aligned}$$
(7)

which incorporates both the state-specific precision matrix \({\varvec{\Omega }}_{s_t}\) and a component-specific scaling \({\varvec{\tau }}_t = (\tau _{t1},\ldots , \tau _{tp})\) at each time t. When all components \(\tau _{tj} = \tau _t\) are identical for \(j=1,\ldots ,p\), model (7) simplifies to the classical multivariate t-model in (5). When all components \(\tau _{tj}\) are endowed with independent \(\text{ Gamma }(\nu /2 ,\nu /2)\) priors, Finegold et al. (2014) refer to the resulting marginal distribution as the alternative multivariate t-distribution. Although this specification improves modeling flexibility relative to the classical multivariate t-distribution, it suffers from overparametrization and incurs large computational costs.

As a compromise between the unnecessarily restrictive constraints \(\tau _{tj} = \tau _t\) of (5) and the overparametrization issues of (7) with iid priors on \(\tau _{tj}\), we consider a cluster-based approach using Dirichlet processes and cluster the p elements in \({\varvec{\tau }}_t\) at each time t separately. Clustering produces a more parsimonious parametrization and encourages information-sharing among the components of \({\varvec{\tau }}_t\), yet accommodates distinct scalings \(\tau _{tj}\) for any component j as needed. In particular, the Dirichlet-t model imposes the following prior on \({\varvec{\tau }}_t\):

$$\begin{aligned} \begin{aligned}&\tau _{tj} \overset{iid}{\sim }\ P_t, \quad j=1,\ldots ,p,\\&P_t \overset{iid}{\sim }\ DP(\alpha , P_0),\quad t=1,\ldots ,T,\\&P_0 = \text {Gamma}(\nu /2,\nu /2),\\&\alpha \sim \text {Gamma}(a_\alpha , b_\alpha ), \end{aligned} \end{aligned}$$
(8)

where \(DP(\alpha , P_0)\) denotes a Dirichlet process with concentration \(\alpha > 0\) and base measure \(P_0\). The prior (8) uses a Gamma distribution for the base measure \(P_0\) akin to (6), and is combined with the conditional likelihood (7). The concentration parameter \(\alpha \) determines the clustering behavior: as \(\alpha \rightarrow 0\), (8) converges to the classical multivariate t-distribution, while \(\alpha \rightarrow \infty \) produces the alternative t-distribution. Importantly, (8) includes a prior on \(\alpha \) in order to learn this key parameter. For a review of Dirichlet process, see Neal (2000).

2.3 Prior on multiple precision matrices

Although the states s induce distinct conditional dependence relations via \({\varvec{\Omega }}_s\), it is likely that the set of precision matrices \(\{{\varvec{\Omega }}_s\}_{s=1}^S\) will exhibit some common features. For example, some conditional correlations may be state-specific, while others may persist across some or all states. To incorporate these common features, we adopt a prior construction used in Peterson et al. (2020) as an extension of the continuous spike-and-slab prior of Wang (2015) to multiple graphs. This prior is written as:

$$\begin{aligned}{} & {} p({\varvec{\Omega }}_1,\ldots ,{\varvec{\Omega }}_S|\{{\varvec{G}}_s\}_{s=1}^S, {\varvec{\Theta }}, v_0, v_1, \uplambda ) \propto \prod _{i<j} N_S ({\varvec{\omega }}_{ij}| {\varvec{0}}, {\varvec{\Theta }}_{ij}) \nonumber \\{} & {} \quad \cdot \prod _{i} \prod _{s} \text{ Exp }(\omega _{s,ii}| \uplambda /2) \cdot {\varvec{1}}_{{\varvec{\Omega }}_1, \ldots , {\varvec{\Omega }}_S \in M^{+}}, \end{aligned}$$
(9)

where \({\varvec{\omega }}_{ij} = (\omega _{1,ij}, \omega _{2,ij},\ldots , \omega _{S,ij} )\), with \(w_{s,ij}\) the (ij)th element in \({\varvec{\Omega }}_s\), \({\varvec{\Theta }}_{ij}\) is the covariance matrix, \(\text {Exp}(\omega _{ij}| \uplambda /2)\) denotes the exponential distribution with rate \(\uplambda /2\), \({\varvec{1}}\) is the indicator function, and \(M^+\) is the space of \(p\times p\) positive definite matrices. The vector \({\varvec{\omega }}_{ij}\) encodes the connection between vertex i and vertex j for each state \(s=1,\ldots ,S\). Dependence among the elements of \({\varvec{\omega }}_{ij}\) is achieved through the covariance matrix \({\varvec{\Theta }}_{ij}\), which is decomposed into standard deviations \({\varvec{v}}_{ij} = \{v_{g_{s,ij}}\}_{s=1}^{S}\) and the \(S \times S\) interstate correlation matrix \({\varvec{\Phi }}\):

$$\begin{aligned} {\varvec{\Theta }}_{ij} = \text {diag}({\varvec{v}}_{ij}) \cdot {\varvec{\Phi }}\cdot \text {diag}({\varvec{v}}_{ij}), \end{aligned}$$
(10)

with standard deviations \(v_{g_{s,ij}}\) such that \(v_{g_{s, ij}} = v_0\) if \(g_{s, ij} = 0\) and \(v_{g_{s, ij}} = v_1\) if \(g_{s, ij} = 1\), respectively for each \(\omega _{s,ij}\). The prior induces a continuous spike around zero by specifying the standard deviation \(v_0\) to be small, while \(v_1\) is chosen to be large for the diffuse slab component. The edge indicators \(g_{s,ij}\) determine whether each \(\omega _{s,ij}\) belongs to the spike component (\(g_{s,ij} = 0\)) or the slab component (\(g_{s,ij} = 1\)). Setting \({\varvec{\Phi }}= {\varvec{I}}_S\) results in independent priors. The linked spike-and-slab prior is completed with a prior on the state-specific edge indicators \(\{{\varvec{G}}_s\}_{s=1}^S\),

$$\begin{aligned} p({\varvec{G}}_1,\ldots ,{\varvec{G}}_S | {\varvec{\Phi }}, v_0, v_1, \uplambda , \pi ) \propto \prod _{s} \prod _{i<j} \{ \pi ^{g_{s,ij}} (1-\pi )^{1-g_{s,ij}} \} \nonumber \\ \end{aligned}$$
(11)

and the correlation matrix \({\varvec{\Phi }}\),

$$\begin{aligned} p({\varvec{\Phi }}) \propto {\varvec{1}}_{{\varvec{\Phi }}\in \mathcal { R^{+}}}, \end{aligned}$$
(12)

where \({\mathcal {R}}^+\) is the set of positive definite matrices with ones on the diagonal. As noted by Wang (2015) and Peterson et al. (2020), the prior (11) is analytically defined only up to a normalizing constant which is proportional to the unknown normalizing constant of prior (9), and therefore cancels out in the joint prior \([{\varvec{\Omega }}_s, {\varvec{G}}_s]\). The correlation matrix \({\varvec{\Phi }}\) appears in the normalizing constant of (11), so the joint prior (11)–(12) is no longer exactly uniform on \({\varvec{\Phi }}\) in \({\mathcal {R}}^+\). However, as shown by Wang (2015) the effect of these unknown normalizing constants on the posterior inference is extremely mild, and the parameters \(\pi ,v_0\) and \(v_1\) can be easily calibrated to achieve a pre-specified level of sparsity. Our sensitivity analyses confirm those findings (see Sect. 3.4).

Our proposed robust and dynamic approach offers several crucial advantages over existing methods for graphical models estimation. With respect to the robust Dirichlet-t model of Finegold et al. (2014), our model incorporates graph dynamics via HMMs, and further encourages information sharing among the state-specific graphs using linked priors. Furthermore, Finegold et al. (2014) used a Hyper Inverse Wishart prior that requires decomposable graphs; our approach does not impose such a restriction. Our strategy to incorporate dynamics via HMM is similar to Warnick et al. (2018). Their method, however, is limited in computational scalability, as it uses G-Wishart distributions, and relies on a Gaussian emission distribution. Lastly, our algorithmic implementation of (8) uses a new truncation strategy that dramatically improves computational efficiency (see the “Appendix”). In conjunction, these features provide both methodological and computational advances.

2.4 MCMC algorithm for posterior inference

We design an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the model parameters from their joint posterior distribution. In the dynamic Gaussian graphical model, the unknown parameters are \(\{\{{\varvec{G}}_s\}_{s = 1}^S\), \(\{{\varvec{\Omega }}_s\}_{s = 1}^S\), \({\varvec{\Phi }}\), \(\{s_t\}_{t = 1}^T, {\varvec{P}}\}\). The robust dynamic classical-t graphical model adds the scaling parameters \(\{\tau _t\}_{t = 1}^T\), while the dynamic Dirichlet-t graphical model further includes \(\{{\varvec{\tau }}_t\}_{t = 1}^T\) and \(\alpha \). A generic iteration of the MCMC algorithm uses the steps described below. Details are reported in the “Appendix”.

  • Sample \({\varvec{\Omega }}\) and \({\varvec{G}}\): For each \(s \in \{1,..,S\}\), we first update the precision matrix \({{\varvec{\Omega }}}_{s}\) using a block Gibbs sampler with closed-form conditional distributions for each column. The sampler automatically guarantees positive definiteness of the precision matrix, as shown in the “Appendix”. Then we update \({\varvec{G}}_s\) by drawing each edge from a Bernoulli distribution.

  • Sample \({\varvec{\Phi }}\): This Metropolis-within-Gibbs step samples the entire matrix at once using a parameter expansion method (Liu and Daniels 2006).

  • Sample \({\varvec{P}}\) and \(s_t\): This Metropolis-Hastings step first samples the transition probability matrix \({\varvec{P}}\), by row from a Dirichlet distribution, and then the hidden states \(s_t\), \(t = 1,\ldots ,T\), with Forward-Backward sampling (Scott 2002).

  • Sample \({\varvec{\tau }}\) (classical-t): Sample each \(\tau _t, t=1,\ldots ,T\) from its full conditional distribution: \(\text {Gamma}(\nu /2 + p/2\), \(\nu /2 + {\varvec{y}}_t' {\varvec{\Omega }}_{s_t} {\varvec{y}}_t/2)\).

  • Sample \({\varvec{\tau }}\) and \({\varvec{\alpha }}\) (Dirichlet-t): We use a truncated stick-breaking representation of the Dirichlet process (Ishwaran and James 2001). Given a truncation level K, let \({\varvec{Z}}\) be a \(T \times p\) matrix of cluster indicators, i.e., \(z_{tj} = k\) if \(\tau _{tj}\) is in cluster k, \(k = 1,\ldots ,K\), and let \({\varvec{\eta }}_t = (\eta _{t1},\ldots ,\eta _{t K})\) denote the unique cluster values of \({\varvec{\tau }}_t\). First, we sample \(z_{tj}\) given \({\varvec{\tau }}\) and other parameters from its multinomial posterior distribution. Second, we sample the unique cluster values \(\eta _{tk}\) using the rejection algorithm from Finegold et al. (2014). Lastly, we sample \(\alpha \) from a conditionally conjugate Gamma distribution.

For posterior inference, we are interested in the detection of the hidden states \(s_t\), for \(t=1,\ldots , T\), and the estimation of the corresponding graphs \({\varvec{G}}_s\), for \(s=1, \ldots , S\). We estimate the hidden states by computing the proportion of MCMC samples for which a time point is classified in each of the S states, and then assigning the most probable state. Corresponding graphs are then estimated by computing marginal probabilities of edge inclusion as proportions of MCMC iterations in which an edge was included. More precisely, for each \(s=1, \ldots , S\), the posterior probability \(p(g_{s,ij}=1|\text {data})\) is estimated as the proportion of iterations that \(g_{s,ij}=1\). Median graphs are obtained by thresholding the posterior probabilities at 0.5. Given the graphs, estimates of the corresponding precision matrices can be obtained by averaging the sampled MCMC values. The MCMC output also generates an estimate of the correlation matrix \({\varvec{\Phi }}\) that describes the pairwise similarity between states.

3 Simulation study

We use simulated data to assess the performances of the proposed models and compare results to existing approaches. We consider the proposed dynamic and robust models (Sect. 2.2), i.e. the dynamic classical-t graphical model (dCT) and the dynamic Dirichlet-t graphical model (dDT), against the Gaussian graphical model (dGM). We also consider a frequentist competitor, i.e. the fused graphical lasso (fLasso; Danaher et al. 2014), which is a penalized estimation technique that encourages sparse graphs with similarities across known groups. For these groups, we supply the fLasso with the true hidden states, which ensures that the fLasso is maximally competitive. The sparsity tuning parameter \(\uplambda _1\) and the similarity tuning parameter \(\uplambda _2\) are selected from a grid of values using AIC, as done in Danaher et al. (2014). The “Appendix” provides more details on the selection of \(\uplambda _1\) and \(\uplambda _2\).

3.1 Data generation

Synthetic datasets with \(p = 20\), \(S = 3\) and \(T = 750\) are simulated as follows. First, we construct the sparse precision matrices \(\{{\varvec{\Omega }}_s\}_{s = 1,\ldots ,S}\). We set \({\varvec{\Omega }}_1\) to be an AR(2) structure, with diagonal elements set to 1 and off-diagonal elements set to 0.5 and 0.4 on the first and second diagonal, respectively. Each \({\varvec{\Omega }}_s\) is generated based on \({\varvec{\Omega }}_{s-1}\) following these steps: we randomly delete 10 edges by setting the corresponding entries in \({\varvec{\Omega }}_{s-1}\) to zero; we add 10 edges in random locations with values from \(\text {Uniform}((-0.6,-0.4)\cup (0.4,0.6))\); and we adjust \({\varvec{\Omega }}_s\) to be positive definite using Peng et al. (2009). The resulting graphs are shown in Fig. 1. Graphs \({\varvec{G}}_1\) and \({\varvec{G}}_2\) share 89% of edges, graphs \({\varvec{G}}_2\) and \({\varvec{G}}_3\) share 89% of edges, and graphs \({\varvec{G}}_1\) and \({\varvec{G}}_3\) share 84% of edges. The sample correlations \(\phi _{ss'}\) between the entries of \({\varvec{\Omega }}_s\) and \({\varvec{\Omega }}_{s'}\) are \(\phi _{12} = 0.67\), \(\phi _{23}=0.68\), and \(\phi _{13} = 0.56\). To generate the hidden states, we divide the time period [1, T] into 15 blocks of equal length and randomly assign each block to one of the S states, which results in states of different lengths. The precision matrices and hidden states are generated once and then fixed for all simulated datasets.

Fig. 1
figure 1

True edges in each state \(s=1,2,3\) for the simulated data (\(p = 20\))

Fig. 2
figure 2

Simulated data for the classical-t (top) and highly contaminated (middle) scenarios along with the true hidden states (bottom) over time. In the top 2 plots, each colored line represents one time series

Given the precision matrices and the hidden states, we simulate three types of data, with examples illustrated in Fig. 2:

  1. 1.

    Classical-t data: \([{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}, \{s_t\}] {\mathop {\sim }\limits ^{indep}} t_{p}({\varvec{\mu }}= {\varvec{0}}, {\varvec{\Omega }}^{-1} = {\varvec{\Omega }}_{s_t}^{-1}, \nu = 3)\) for \(t=1,\ldots ,T\).

  2. 2.

    Slightly contaminated data: first, we simulate \([{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}, \{s_t\}] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{\mu }}= {\varvec{0}}, {\varvec{\Omega }}^{-1} = {\varvec{\Omega }}_{s_t}^{-1})\) for \(t=1,\ldots ,T\); then, for each time series \(\{y_{tj}\}_{t=1}^T\), \(j=1,\ldots ,p\), we randomly select 5 time points and add independent noise generated from N(0, 100).

  3. 3.

    Highly contaminated data: similar to the slightly contaminated data, except that we randomly select 10 time points for each time series to add Gaussian noise.

3.2 Parameter settings

For the hyperparameters in the joint prior (9) on \(\{{\varvec{\Omega }}_s\}\), we follow the guidelines in Wang (2015) and set \(\uplambda = 1\), \(v_0 = 0.02\), \(h = 50\) (\(v_1 = h \cdot v_0\)) and \(\pi = 3/(p-1)\). A sensitivity analysis to these choices is presented in Sect. 3.4. The HMMs use \(S=3\) hidden states; details on selecting S for real data are provided in Sect. 4. The Dirichlet weights in (3) are set to \(a_{rs} = 1\) for all \(r,s = 1,\ldots , S\), and we fix the classical-t and Dirichlet-t parameters at \(\nu =3\) and \(a_{\alpha } = b_{\alpha } = 1\) in (8). To initialize the model, the states \(s_t\) are constructed by dividing the time series equally into S parts. Given the states, initial values of each \({\varvec{\Omega }}^{-1}_s\) are computed using a robust covariance estimator (Gottard and Pacillo 2010).

Results reported below for the dynamic Gaussian model and the dynamic classical-t model were obtained by running MCMC chains for 10,000 total iterations with a burn-in of 2000 iterations. The dynamic Dirichlet-t model converged much faster and was run for a total of 1500 iterations with a burn-in of 500 iterations. The Gaussian and classical-t models took on average about 3.2 min to run (19 s per 1000 iterations) and the Dirichlet-t model about 12.5 min (500 s per 1000 iterations), on a laptop computer with 1.80 GHz Intel Core i7. MCMC convergence was checked by visually inspecting the trace plots of the number of included edges in each state across iterations and those of the hidden states indicators (plots omitted for the simulation study). The Geweke diagnostic test (Geweke et al. 1991) was used to test for convergence of the number of observations in each state with a significance level of 1%. Noticeably, since there are many outliers, the Gaussian model never converged, even for substantially longer MCMC chains. Similarly, the classical-t model showed some convergence issues, while the Dirichlet-t model converged in almost all simulation replicates for all three scenarios. Similar behaviors were observed by Finegold et al. (2014).

Table 1 Performance comparisons for hidden state estimation over the three hidden states, for the three simulation scenarios

3.3 Results

We assess performance on edge selection and hidden states estimation using the true positive rate (TPR), the false positive rate (FPR) and the Matthews correlation coefficient (MCC). The MCC is defined as

$$\begin{aligned} MCC = \frac{TP \times TN - FP \times FN}{ \sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}} \end{aligned}$$

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The MCC provides a measure of overall classification success for a selected model, with larger values indicating better performance. We also provide ROC curves, computed by varying the threshold on the marginal posterior probabilities of edge inclusion from 0 to 1. Finally, estimation of the precision matrices is evaluated based on the Frobenius loss:

$$\begin{aligned} \frac{1}{S}\sum _{s=1}^S || \hat{{\varvec{\Omega }}}_{s} - {\varvec{\Omega }}^*_{s} ||^2_F / || {\varvec{\Omega }}^*_{s} ||^2_F, \end{aligned}$$
(13)

where \({\varvec{\Omega }}_s^*\) is the true precision matrix in state s and \(\hat{{\varvec{\Omega }}}_s\) is the posterior mean.

Table 1 summarizes hidden state estimation under the three simulation designs, in each case averaged over 25 simulated datasets. State estimation is defined as the posterior most probable state at each time point. The hidden states estimated by the Gaussian model are driven by outliers in the data and do not reflect the true hidden states, while the dynamic Dirichlet-t graphical model performs well, recovering the hidden states accurately in all three scenarios. As expected, the classical-t model performs well in the classical-t scenario, but its performance deteriorates as the number of contaminated data points increases.

Next, we evaluate graph estimation for the proposed methods and include comparisons to fLasso (Danaher et al. 2014). Figure 3 shows the edge selection accuracy and Frobenius loss across all 25 simulated datasets for each design, Table 2 reports these results averaged over all 25 replicates, and Fig. 4 plots the ROC curves obtained by varying the posterior probability of inclusion threshold (for fLasso, we obtain AUCs on a grid and the best one is kept; more details are provided in the “Appendix”). As expected, the dynamic Gaussian graphical model is inferior to the robust alternatives in all three scenarios. The Dirichlet-t and classical-t models perform similarly for the classical-t data, while the Dirichlet-t model substantially outperforms the classical-t model in the contaminated scenarios—especially as the number of contaminated data points increases. These results are consistent across all five performance measures. Most notably, the proposed approaches vastly outperform the fLasso on both edge selection and estimation—despite the fact that fLasso assumes knowledge of the true hidden states. The FPRs for the fLasso are exceedingly high, which suggests that the fLasso produces unnecessarily dense graphs in the presence of heavy-tailed or contaminated data. Yet, as demonstrated by the ROC curves in Fig. 4, varying the fLasso tuning parameter does not reconcile this issue.

In addition, we fit the three dynamic Bayesian models with \(S = 2\) and \(S = 4\) on the two contaminated dataset. The hidden state estimation for the Dirichlet-t model remains superior to both the Gaussian model and the classical-t model (results not shown). For the Dirichlet-t model, the 4th state collapses into an empty state when \(S = 4\); state 2 and state 3 are combined into one state when \(S = 2\). We also evaluate WAIC for different choices of S. Unlike other information criteria (e.g., AIC, BIC, and DIC), WAIC does not rely on a single point estimate of the model parameters and instead utilizes the joint posterior distribution for model assessment (Gelman et al. 1995). The complexity penalty is defined as the variance of log-likelihoods across MCMC iterations as recommended in Gelman et al. (1995). The WAIC of \(S=3\) is around 53,000 on average and about 400 less than the other two scenarios. In practice, we also recommend choosing S based on interpretability and by inspection of the state allocations, especially if one or more states are empty.

Lastly, we conduct a simulation study with \(p=10\). The edges in the first graph are randomly selected, and the graphs in the other two states are generated sequentially using the same random sampling procedure. The time period [1, T] was divided into 25 blocks instead of 15 as in the previous study. Compared to the previous setup, these graphs do not have a specific structure and the system has less persistence in the hidden states. The model hyperparameters are the same as the previous simulation (\(\uplambda = 1\), \(v_0 = 0.02\), \(h = 50\), \(\pi = 3/(p-1)\), \(S=3\) and \(\nu =3\)). MCMC convergence was checked using the same procedure. Since the states are less persistent, 200 more iterations were added to the burn-in period in the Dirichlet-t model to ensure convergence of the hidden states.

Based on the simulation results, the performance of the four models are very similar to the previous study. The Dirichlet-t model is able to estimate the hidden states and graphs accurately, while the non-robust methods have the worst performance. Tables for performance comparison are provided in the “Appendix”. The Dirichlet-t model took on average 2.5 min per 1000 iterations with a total of 1700 iterations; the classical-t model and the Gaussian model took on average 7.5 s for 1000 iterations with a total of 10,000 iterations. For our proposed Gibbs sampler, the empirical per-iteration computation time of sampling the scaling parameter with the Dirichlet prior is roughly proportional to p and the total number of time points T. Wang (2012) studies the running time per 1000 iterations for sampling the precision matrix. The computational cost is roughly linear to p when p is under 100, but increases much faster when p is above 100. Our empirical computing time is consistent with their findings; it is also linear to the number of states S. The sampling time for other parameters are negligible.

Aggregating across all performance metrics, it is clear that the Dirichlet-t HMM graphical model offers substantial advantages over competing methods under a variety of scenarios.

Fig. 3
figure 3

Performance comparisons for graph estimation. Boxplots of values of true positive rate (TPR), false positive rate (FPR), the Matthews correlation coefficient (MCC) and the Frobenius loss (FL), obtained on 25 replicates, for the three simulation scenarios

Table 2 Performance comparisons for graph estimation
Fig. 4
figure 4

ROC curves for edge selection under each simulation design. The models are: fused graphical Lasso with given states (fLasso), dynamic Gaussian graphical model (dGM), dynamic classical-t graphical model (dCT) and dynamic Dirichlet-t graphical model (dDT)

3.4 Sensitivity analysis

We investigate sensitivity to hyperparameter choices using one simulated dataset from the slightly contaminated scenario. Wang (2015) recommends the default setting of \(\pi =2/(p-1)\) and provides sensitivity analysis of the variance parameters, \(v_0\) and \(v_1\), of the spike-and-slab prior on the precision matrix. Using the reparameterization \(v_1 = hv_0\), he recommends settings with \(v_0>0.01\) and \(h<1000\) as a general range of acceptable values. Peterson et al. (2020) perform a similar sensitivity analysis on \(v_0\) and \(v_1\) in the linked prior (9)–(12), obtaining consistent results with those of Wang (2015): increasing \(v_0\) for fixed h leads to sparser graphs, while increasing h for fixed \(v_0\) reduces the number of selected edges. The graph estimation is robust to \(\uplambda \) from our experience, which confirms the previous results in the literature. We fix \(\uplambda =1\) and standardize the data prior to model fitting.

Table 3 Sensitivity analysis for simulated data

Given the sensitivity results from the existing studies, we focus our investigation on the robust models of Sect. 2.2 and specifically on the priors on the scaling parameters \(\tau \) in (6) and (8). We investigate the hyperparameter \(\nu \in \{3, 6, 9\}\) such that the prior standard deviation ranges from 0.3 to 0.6 and the distribution is centered around 1. We also investigate different choices of \(\pi \in \{2/(p-1), 3/(p-1)), 4/(p-1), 5/(p-1)\}\), which leads to a prior marginal edge inclusion probability ranging from \(1/(p-1)\) to \(4/(p-1)\). We set all other parameters as described in Sect. 3.2. Results reported in Table 3 show that the graph estimation accuracy of the Dirichlet-t model can be improved for smaller values of \(\nu \), i.e. larger prior variance for the scale parameter \(\tau \), and larger values of \(\pi \), which correspond to larger prior edge inclusion probabilities. The classical-t model is consistently inferior to the Dirichlet-t model across these hyperparameter settings, and in particular is overly conservative in selecting edges. This slightly contaminated simulation setting is not favorable for the classical-t model, and the hyperparameters apparently cannot be tuned to match the superior performance of the Dirichlet-t model.

In general, we suggest setting the model hyperparameters to be \(\uplambda = 1\), \(v_0 = 0.02, h = 50, \pi = 3/(p-1)\) and \(\nu = 3\). Based on our experience, the proposed dynamic Dirichlet-t graphical model can generate dense graphs based on the data even if the prior suggests a low edge inclusion probability. If the application requires a denser graph, one could increase \(\pi \) and decrease \(v_0\) and h.

4 Dynamic analysis of hand gesture data

Analysis of human gestures, such as facial expressions, hand motions, and torso movements, is essential across a wide range of applications, including computer vision, human-computer interaction, animation, linguistics, security and rehabilitation, among many others. Modern motion capture devices such as Microsoft’s Xbox Kinect (Lun and Zhao 2015) provide high resolution and affordable solutions for human gesture recording. Thus, human gesture analysis is becoming increasingly important and increasingly common, and new statistical methods are needed for these high dimensional and noisy data.

Phase segmentation models, notably hidden Markov models, are a prevailing tool to study the dynamics of human gestures (Mitra and Acharya 2007; Cheok et al. 2019). The visible bodily actions are segmented into several states, for example, a natural rest position and an artificial gesture unit (Kim et al. 2007). In linguistic studies, a discourse can be considered as one or several movement excursions (Kendon 2004). An excursion refers to hands moving from a position of rest to a position where the main movement happens, then executing the main movement and turning back to the rest position. One challenge in automatic gesture phase segmentation is that the exact definition of states—even on the same clip—can vary across researchers and human encoders (Madeo et al. 2013). Our proposed methods define states as having unique coordination patterns between body parts of interest, and provide a novel solution to the segmentation problem.

4.1 Data

We analyze the Gesture Phase Segmentation Data Set available on the UCI Machine Learning Repository to study the correlations between body parts. The gestures of a subject during storytelling were recorded using a Microsoft Xbox Kinect device. The story was based on simple comics shown to the subject prior to capturing the video. The raw footage was in the form of a stream of 3D frames. The 3D coordinates (x, y, z) of 4 points of interest (left hand, left wrist, right hand and right wrist) were acquired from each frame. Next, the 3D velocity and acceleration of these four points were calculated in each frame and represented by 32 variables, including the vectorial velocity and acceleration of the four points at each (x, y, z) coordinates, as well as the scalar velocities and accelerations of the four points. A complete list of variables and their indexing is provided in the “Appendix”. The 32 time series are of length \(T = 1743\) and displayed in Fig. 5. Details of data collection and data processing is provided in Madeo et al. (2013).

Fig. 5
figure 5

Gesture data (top, 32 time series, each color represents one time series), phases defined by movement excursions based on linguistic studies (middle), and posterior most probable states over time from the proposed graphical model (bottom)

In Madeo et al. (2013), each frame was labeled by a human encoder, and the stream was segmented into five states based on movement excursions: (i) the rest position; (ii) the preparation state, with hands moving from the rest position to the position where the stroke happens; (iii) a brief pause after the preparation state or the stroke, during which the hands maintain their configuration and position; (iv) the stroke that expresses the semantic content; (v) the retraction of hands from the stroke position to the rest position. Frame labels are plotted in Fig. 5. The five states are numbered as state 1–5 respectively. Given these labels, Madeo et al. (2013) focused on segmenting the video into the rest position and the gesture unit.

Figure 5 clearly illustrates the non-Gaussianity and heavy tails of the individual time series. We therefore employ the proposed Dirichlet-t HMM graphical model in effort to capture contemporaneous relationships among time series—which may change over time—while accounting for the spikes in the data. Notably, our simulation analysis demonstrates that this model is significantly more robust to outliers and contaminated data compared to Gaussian, classical-t, or frequentist alternatives. Our analysis does not make use of the phases defined by excursion and semantic meanings and our interest is to see whether the inferred hidden states recover any of that information.

4.2 Parameter settings

We set the Dirichlet-t hyperparameters as follows: \(\pi = 3/(p-1)\), \(v_0 = 0.02\), \(v_1 = 1\), \(\uplambda = 1\), \(\nu = 3\), \(a_{\alpha } = 1\), \(b_{\alpha } = 1\), and \(a_{rs} = 1\). These choices are discussed and evaluated in the simulation study. In addition, we note that state estimation and graph estimation are not overly sensitive to variations in \(\pi \in \{2/(p-1), 3/(p-1), 4/(p-1), 5/(p-1)\}\). We use \(K=6\) clusters as a truncation for the stick-breaking process; fewer than 1% of observations belong to the 6th cluster, which suggests that this truncation level is reasonable. The number of hidden states S is selected based on WAIC. The WAIC values for \(S = 2, 3, 4, 5\) are 235,620, 315,190, 325,160 and 308,110 respectively. We therefore set \(S = 2\), which has the smallest WAIC. When S is increased to \(S=3\), one of the state is roughly divided into two states while the remaining state is unchanged. The results are similar for \(S=4\) and \(S=5\). The MCMC algorithm was run for 8000 iterations with a burn-in of 8000 and required about 165 min on a 1.80 GHz Intel Core i7. Trace plots (not shown) demonstrated good mixing. Model convergence was verified via the Geweke test on the number of edges in each state with a significance level of 1%.

4.3 Results

For each state, we estimate the graph using the posterior median graph, so an edge is included only if its posterior inclusion probability exceeds 0.5. The edge densities in the two states are 0.14 and 0.15, which suggests that a sparse graphical model is appropriate. The non-zero elements of the state-specific precision matrices are estimated using the posterior mean of the respective elements in \({\varvec{\Omega }}_{s}\), while zeros are assigned from the posterior median graph. Partial correlations based on the state-specific and sparse estimated precision matrices are shown in Fig. 6. The figure highlights the positive partial correlations (blue), the negative partial correlations (red), the strength of the association (line thickness), and whether or not an edge is unique to that state (dotted line).

Fig. 6
figure 6

Posterior partial correlations among the movement measures of body parts in each hidden state. Blue lines indicate positive correlations and red lines indicate negative correlations; line thickness represents the correlation strength; a dotted line indicates that the edge is unique to the state. Top row: proposed Dirichlet-t HMM graphical model; Bottom row: Gaussian graphical model with fixed states. The full names of the variables are listed in the “Appendix”

The two estimated graphs from the proposed Dirichlet-t HMM graphical model share 55 edges. Most of these common edges can be summarized as follows: (i) at the same point of interest, the vectorial velocity and acceleration on the same coordinate (x, y, z) are positively correlated; (ii) on each side (left/right) and each coordinate, velocity of the hand and the wrist are positively correlated; (iii) on each side and each coordinate, acceleration of the hand and the wrist are positively correlated; (iv) on each side and each coordinate, velocity of the hand and acceleration of the wrist are negatively correlated; (v) on each side and each coordinate, velocity of the wrist and acceleration of the hand are negatively correlated; (vi) the previous five kinds of connections among vectorial measurements also exist among scalar velocities and accelerations. These six types of edges are very well expected based on their practical meanings. Our Dirichlet-t graphical model recovered all of them in both states, except missing one edge (25–31) in state 1. In addition, the lower right part of the graph is disconnected with the rest of the graph. It indicates that the scalar velocities and accelerations are independent from their vectorial decomposition on the 3 axis. In human motion data analysis, similar correlation patterns are commonly assumed based on the kinematic chain structure of human body (Han et al. 2006; Sigal et al. 2012); other studies build models to learn such networks from body motion data (Xu et al. 2022; Song et al. 2003; Zhang et al. 2021). In conclusion, our proposed model successfully recovered true edges from highly noisy observations without any prior domain knowledge.

For further evidence, we compared the graphs estimated with the Dirichlet-t graphical model with those obtained with a non-robust Gaussian graphical model where we fixed the hidden states to the posterior most probable states from the dynamic Dirichlet-t graphical model, and fitted Gaussian graphical models independently in each state. The posterior graph estimates are shown in Fig. 6. Graphs are much sparser than those from the Dirichlet-t model, and in particular they omit many strong and well-expected edges that were detected by the Dirichlet-t model. The state 1 graphs from the two models are more similar to each other, which is reasonable because the data are less noisy in the resting phase (Madeo et al. 2013). For further comparison, we estimated the fused graphical lasso using the estimated states from the Dirichlet-t model. The AIC-selected tuning parameters produce a dense graph. By further tuning these parameters to obtain sparser graphs, we find that the estimated graphs recover some of the strong positive connections that are also present in the Dirichlet-t outcome, but otherwise omit or estimate weak edges for many well-expected connections. Further details are in the “Appendix”.

Next, we examine results on hidden states generated by our model. The posterior most probable state across time is shown in Fig. 5. Notably, state 1 in our result mostly matches the resting period provided by the segmentation of the original data, while state 2 includes the gesture unit. Although the original segmentation is solely based on semantics and locations of the hands, it is interesting to see that it has a correspondence with the coordination dynamics between body parts. For comparison, we fixed the hidden states to the phases given by Madeo et al. (2013) and fit Dirichlet-t graphical models independently within each state. The partial correlations are almost the same between the major states with the 6 types of common edges dominating the graphs (results not shown).

Comparing the two state-specific graphs in Fig. 6, we note that, in general, the edges in state 1 are weaker than those in state 2. In other words, the conditional correlations between movement measurements are stronger when the storyteller purposefully moves their hands instead of resting. What’s more, there are some strong edges that only exist in state 2. While most of the shared edges connect vectorial measurements on the same coordinate (x, y, z), edges unique to state 2, as shown in dotted line, connects measurements on different coordinates. For example, there are several positive correlations between the x coordinate and the z coordinate on the left side. It probably indicates that the storyteller moved their left hand back and forth along the x = y direction. Next, velocity of left hand on the z coordinate is negatively correlated with acceleration of left wrist on the x coordinate; acceleration of left hand on the z coordinate is negatively correlated with velocity of left wrist on the x coordinate. These negative connections could be caused by bending the left wrist on a specific direction. Also, according to the posterior graphs, the left side and the right side are not completely independent. We discover edges connecting left hand and right wrist, left wrist and left hand, etc.

5 Conclusion

We have introduced a Bayesian graphical modeling framework for studying the conditional dependence in a multivariate time series. The proposed approach incorporates both dynamics via hidden Markov models and robustness for heavy-tailed data using multivariate t-distributions. The proposed models automatically estimate a graph at each time point, identify change points in the graphs, and classify time points into clusters. In addition, our methods include information sharing among the HMM latent states and generate posterior estimates of the state similarity matrix. Within the proposed modeling frameworks, the Dirichlet-t HMM graphical model is the most robust and performs well in simulations with a large proportion of outliers in the data. We have also applied this model to hand gesture tracking data and uncovered meaningful edges and dynamics within different movement measurements, including almost all the edges that should be included based on domain knowledge.

The proposed Bayesian framework—including both the model specification and the MCMC algorithm—is designed for computational efficiency. The central role of the continuous spike-and-slab prior builds upon Wang (2015) to offer greater computational advantages than other Bayesian approaches, especially those based on the G-Wishart prior (e.g., Finegold et al. 2014; Warnick et al. 2018). Our approach is also computationally competitive with frequentist alternatives, which require extensive parameter tuning and can be sensitive to parameter choices. By comparison, the proposed approach is quite robust to the choice of hyper-parameters, and provides superior graph estimation with fewer false positives.

The simulation study uses \(p=20\) variables, which is similar to the dimension of the application. However, the proposed sampling algorithm is scalable to larger p: for example, it takes about 55 min to sample \({\varvec{\Omega }}\) and \({\varvec{G}}\) for 1000 iterations when \(p = 100\) and \(S = 4\), which is much faster than the popular G-Wishart prior when four graphs are estimated. In particular, compared to the sampling algorithm of the full Dirichlet process with a rejection sampling scheme (Finegold et al. 2014), the truncated sampling algorithm for the stick-breaking process offers large gains in efficiency and can reduce the MCMC computing time by up to 90%. Naturally, larger p places greater demands on the HMM graphical model construction, which requires inference for S distinct \(p \times p\) precision matrices. Our approach borrows information across these states via a linked prior, which mitigates but cannot eliminate these challenges.

We acknowledge that HMMs are identifiable only up to permutations of the labels, which can cause problematic label switching (Scott 2002). In our simulation and application studies, we did not observe the label switching problem, likely because of the small number of states. When label switching is present, there are many MCMC post-processing strategies available that may be applied to our models and computations (Jasra et al. 2005; Sperrin et al. 2010; Rodríguez and Walker 2014).

There are many promising extensions. First, the models proposed in this paper assume that the state transition probabilities are constant across time. In practice, it is reasonable to consider a non-homogeneous HMM in which the state transition probabilities depend on time-varying exogenous factors. In addition, the graph prior (11) uses a Bernoulli distribution with the same parameter \(\pi \) for each edge. This specification could be modified to incorporate more complicated structures, such as state-specific \(\pi \) values (Osborne et al. 2022), covariates in the edge inclusion probabilities (Ni et al. 2021), or alternative graph priors entirely (Tan et al. 2017). Furthermore, the robust models in this paper assume that the scale parameters \(\tau _t\) are independent across time. Alternatively, one could add a hierarchical layer for time dependence among the volatilities within each state. Lastly, it is well know that Dirichlet process is affected by the “rich-gets-richer” effect and that it tends to generate highly skewed cluster sizes. To overcome these limitations, one could extend our prior to include more flexible alternatives, such as normalized completely random measures (Cremaschi et al. 2019). Matlab and R code to reproduce results of this paper is available at https://github.com/chunshanl/dynamic-and-robust-Bayesian-graphical-models.