Dynamic and robust Bayesian graphical models

Liu, Chunshan; Kowal, Daniel R.; Vannucci, Marina

doi:10.1007/s11222-022-10177-0

Dynamic and robust Bayesian graphical models

OriginalPaper
Published: 09 November 2022

Volume 32, article number 105, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistics and Computing Aims and scope Submit manuscript

Dynamic and robust Bayesian graphical models

Download PDF

Chunshan Liu¹,
Daniel R. Kowal¹ &
Marina Vannucci¹

638 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Gaussian graphical models are widely popular for studying the conditional dependence among random variables. By encoding conditional dependence as an undirected graph, Gaussian graphical models provide interpretable representations and insightful visualizations of the relationships among variables. However, time series data present additional challenges: the graphs can evolve over time—with changes occurring at unknown time points—and the data often exhibit heavy-tailed characteristics. To address these challenges, we propose dynamic and robust Bayesian graphical models that employ state-of-the-art hidden Markov models (HMMs) to introduce dynamics in the graph and heavy-tailed multivariate t-distributions for model robustness. The HMM latent states are linked both temporally and hierarchically for greater information sharing across time and between states. The proposed methods are computationally efficient and demonstrate excellent graph estimation on simulated data with substantial improvements over non-robust graphical models. We demonstrate our proposed approach on human hand gesture tracking data, and discover edges and dynamics with well explained practical meanings.

Variational Conditional Dependence Hidden Markov Models for Skeleton-Based Action Recognition

How to Encode Dynamic Gaussian Bayesian Networks as Gaussian Processes?

The Variational Coupled Gaussian Process Dynamical Model

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Gaussian graphical models are an indispensable tool for studying the relationships among random variables. Under Gaussianity, the precision (or inverse covariance) matrix is intrinsically linked to conditional independence: any pair of random variables is independent given all others if and only if the corresponding element in the precision matrix is zero. The conditional dependence is further encoded by an undirected graph, in which variables are represented by nodes and an edge between nodes designates variables that are not conditionally independent. Therefore, edge selection is equivalent to detection of the nonzero entries in the precision matrix. Crucially, inference of the edge set provides interpretable representations and insightful visualizations of the conditional dependence among variables.

To estimate this edge set, frequentist and Bayesian methods have been designed to estimate a sparse precision matrix. Frequentist approaches commonly employ penalized likelihood estimation, which couples a Gaussian likelihood with an appropriate sparsity penalty (Meinshausen and Bühlmann 2006; Friedman et al. 2008; Fan et al. 2009). Bayesian models can incorporate exact zeros in the precision though careful specification of the prior, such as the hyper-inverse Wishart prior (Roverato 2002) and the G-Wishart prior (Lenkoski and Dobra 2011). Alternatively, Wang (2015) proposed a continuous spike-and-slab prior for greater computational efficiency, which has been generalized for multiple graphs by Shaddox et al. (2018) and Peterson et al. (2020). The continuous spike-and-slab prior has demonstrated excellent empirical performance relative to a variety of frequentist and Bayesian methods.

In this paper, we consider the problem of estimating dynamic graphical models based on time series data. Two significant challenges are present in many time series applications. First, we anticipate that the graph will change over time, yet these change points are not known in advance. Changes in the graph may be gradual, such as the addition or deletion of a small number of edges. However, whole-scale changes in the graph are also possible due to massive structural shifts in the underlying conditions. Second, outliers and heavy tails are often encountered in time series applications, therefore deviate the data substantially from the fundamental Gaussian assumption. Consequently, many existing Gaussian graphical models lack the robustness for reliable graph estimation in this setting.

To address these challenges, we propose dynamic and robust Bayesian graphical models that employ state-of-the-art hidden Markov models (HMMs), to introduce dynamics in the graph, and heavy-tailed multivariate t-distributions, for model robustness. The HMMs define a collection of discrete and unknown states that determine the graphical dependence at any given time. The learned HMM states are often interpretable and can be revisited throughout the time series. Crucially, the latent states are linked both temporally and hierarchically for greater information sharing across time and between states. For model robustness, we consider heavy-tailed alternatives based on the t-distribution. The multivariate and dynamic setting requires careful consideration of an appropriate t-distribution: we implement and evaluate both a classical multivariate t-distribution and a more flexible Dirichlet process alternative. The proposed dynamic and robust Bayesian graphical models are accompanied by a scalable MCMC algorithm for efficient posterior inference with full uncertainty quantification. A thorough simulation study demonstrates the excellent state recovery and graph estimation of the proposed methods relative to competing approaches, with the largest improvements occurring for heavy-tailed and contaminated data. Furthermore, we apply the proposed approach to human hand gesture tracking data, and discover edges and dynamics with well explained practical meanings.

To our knowledge, there is limited research on graphical models that are both dynamic and robust. Time-varying graphs can be estimated using penalized likelihood techniques that incorporate smoothness in the graph over time (Kolar et al. 2010; Gibberd and Nelson 2017; Yang and Peng 2020). Warnick et al. (2018) proposed a fully Bayesian approach using HMMs, but this method is limited in computational scalability and relies on a Gaussian emission distribution. However, robustness is essential: as demonstrated in the simulation study (Sect. 3), graph estimation deteriorates significantly for non-robust methods in the presence of heavy-tailed or contaminated data. The importance of robustness for graphical models has been emphasized in the non-dynamic setting, including deviation-weighted Gaussian likelihoods (Miyamura and Kano 2006; Sun and Li 2012), robust covariance estimators (Gottard and Pacillo 2010), trimmed estimators (Yang and Lozano 2015), and graphical models based on the t-distribution (Finegold and Drton 2011; Finegold et al. 2014). Vinciotti and Hashem (2013) provide a comparison among robust graphical models.

The remainder of the paper is organized as follows. In Sect. 2, we review some of the background literature and describe the proposed dynamic and robust graphical models, i.e., the dynamic classical-t graphical model and the dynamic Dirichlet-t graphical model. We also provide the MCMC algorithms. In Sect. 3, we provide simulation studies, which include comparisons to a frequentist competitor and a sensitivity analysis. In Sect. 4, we apply the proposed methods to the gesture tracking data. We conclude the paper with Sect. 5. The detailed MCMC algorithms are described in the “Appendix”.

2 Methods

2.1 Background

Undirected graphical models learn and express the conditional dependence among p variables ${\varvec{y}} = (y_1,\ldots , y_p)$. Fundamentally, the goal is to discover the set of variables that are conditionally dependent, or equivalently, the complement set of conditionally independent variables. Such dependence can be represented by a graph $\mathcal {G(V, E)}$ with vertices ${\mathcal {V}} = \{1,\ldots ,p\}$ and undirected edges ${\mathcal {E}} \subset {\mathcal {V}} \times {\mathcal {V}}$.

From a modeling perspective, graphical dependence is intrinsically linked with the Gaussian distribution. When ${\varvec{y}} \sim N_p({\varvec{\mu }},{\varvec{\Omega }}^{-1})$ with mean ${\varvec{\mu }}$ and a positive definite precision matrix ${\varvec{\Omega }}= (\omega _{ij})_{i,j \in \{1,\ldots ,p\}}$, graphical dependence is determined by sparsity of the precision matrix: $\omega _{i,j} = 0 \iff (i, j) \not \in {\mathcal {E}}$. More specifically, the edge set for Gaussian graphical models is ${\mathcal {E}} = \{(i,j): y_i \not \perp \!\!\! \perp y_j | {\mathcal {V}} /\{i,j\}\}$, so any two vertices i, j are not connected if $y_i$ and $y_j$ are conditionally independent given the remaining variables. Consequently, estimating the graph $\mathcal {G(V, E)}$ is equivalent to identifying the nonzero elements of ${\varvec{\Omega }}$. Let ${\mathcal {G}}$ be parameterized by the edge inclusion indicators as ${\varvec{G}} = \{g_{ij}\}_{1 \le i<j \le p}$ for $g_{ij} \in \{0,1\}$. In Bayesian graphical models, the essential sparsity of the precision matrix is encoded in the prior distributions on ${\varvec{\Omega }}$ and ${\varvec{G}}$. The hyper-inverse Wishart prior (Roverato 2002) and the G-Wishart prior (Lenkoski and Dobra 2011) impose exact zeros. Alternatively, Wang (2015) proposed a continuous spike-and-slab prior for greater computational efficiency. In the sequel, following standard practice, we focus on the setting with ${\varvec{\mu }}= {\varvec{0}}$ and standardize the data in advance of model-fitting. However, generalizations for nonzero mean parameters are straightforward.

For time-ordered data, the assumption of independent and identically distributed observations is inadequate. One approach to allow for graphical dependence that varies across time, $t=1,\ldots ,T$, is to introduce latent state variables $s_t \in \{1,\ldots ,S\}$, where S is the number of states. Each discrete state $s_t = s$ corresponds to a precision matrix ${\varvec{\Omega }}_s$ and edge inclusions ${\varvec{G}}_s$, which define the conditional dependence among the variables $\{ {\varvec{y}}_t: s_t = s\}$. Conditional on the states, we can write:

$$\begin{aligned}{}[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}), \quad t=1,\ldots ,T, \end{aligned}$$

(1)

with state-dependent precision matrices. The observations ${\varvec{y}}_t$ remain conditionally independent but are no longer identically distributed given the states. Most important, (1) admits time-varying graphical dependence in ${\varvec{\Omega }}_{s_t}$, but with interpretable restrictions: each ${\varvec{\Omega }}_{s_t}$ belongs to the set of S precision matrices $\{{\varvec{\Omega }}_s\}_{s=1}^S$ indexed by the dynamic states $s_t$. Model dynamics are introduced via a discrete time-homogeneous Markov chain on the state variables. Specifically, the state transition probabilities are

$$\begin{aligned} p(s_t | s_{t-1}, \ldots , s_1) = p(s_t | s_{t-1}) = p_{s_{t-1} s_t} \end{aligned}$$

(2)

with $S \times S$ transition probability matrix ${\varvec{P}} = (p_{rs})_{r,s \in \{1,\ldots , S\}}$. These dynamics imply that the graphical dependence at time t depends on that at time $t-1$ through the states $s_t$, which evolve according to the discrete Markov process defined in (2). The initial distribution of the Markov chain is the stationary distribution of the transition matrix ${\varvec{P}}$. The HMM is completed by independent Dirichlet priors on the rows ${\varvec{P}}_{r \cdot } = (p_{r1}, \ldots , p_{r S})$,

$$\begin{aligned} {\varvec{P}}_{r \cdot } {\mathop {\sim }\limits ^{indep}} \text{ Dir }(a_{r1}, \ldots , a_{rS}), \quad r=1,\ldots ,S, \end{aligned}$$

(3)

with concentration parameters $a_{rs} > 0$.

2.2 Robust and dynamic graphical models

A crucial limitation of Gaussian graphical models is the lack of robustness that accompanies the Gaussian distribution. The presence of outliers or large values in the data violates the Gaussian assumption and, most importantly, can result in substantially less accurate estimates of the graph. For the HMM graphical model (1)–(3), outliers also impact state identification, which disrupts the interpretation of the hidden states $s_t$ and the accompanying time t precision matrices ${\varvec{\Omega }}_{s_t}$.

To address these challenges, we introduce a robust generalization of the HMM graphical model. Specifically, we relax the Gaussianity assumption in (1) and instead consider robust alternatives based on the multivariate t-distribution. Importantly, the specified t-distributions provide the requisite robustness, yet preserve the modeling structure of the HMM model (1)–(3) and require only minimal modifications to the MCMC algorithm. We consider two variants: a classical multivariate t-distribution and a Dirichlet t-distribution.

The p-dimensional classical multivariate t-distribution is denoted as ${\varvec{y}} \sim t_{p}({\varvec{\mu }}, {\varvec{\Omega }}^{-1}, \nu )$ and has joint probability density function

$$\begin{aligned} f_\nu ( {\varvec{y}} | {\varvec{\mu }}, {\varvec{\Omega }}^{-1} )= & {} \dfrac{ \Gamma \{(\nu +p)/2\} |{\varvec{\Omega }}|^{1/2} }{ \pi ^{p/2} \nu ^{p/2} \Gamma (\nu /2) } \nonumber \\{} & {} \quad \{1 + \nu ^{-1} ({\varvec{y}} - {\varvec{\mu }})' {\varvec{\Omega }}({\varvec{y}} - {\varvec{\mu }})\} ^{- (\nu +p)/2 }, \nonumber \\ \end{aligned}$$

(4)

where $\nu $ is the degrees of freedom, ${\varvec{\mu }}$ is the mean parameter (when $\nu > 1$), and ${\varvec{\Omega }}^{-1}$ is the $p \times p$ scale matrix (Pinheiro et al. 2001). The parameter $\nu $ indexes the deviation from the multivariate Gaussian distribution: small values of $\nu $ admit heavy-tailed behavior, while $\nu \rightarrow \infty $ reverts (4) to a Gaussian distribution. We incorporate the classical multivariate t-distribution as the emission distribution within the HMM graphical model. Using a parameter expansion of the t-distribution (Kotz and Nadarajah 2004), we generalize (1) to

$$\begin{aligned}&[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T, \{\tau _t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}/\tau _t), \end{aligned}$$

(5)

$$\begin{aligned}&[\tau _t | \nu ] {\mathop {\sim }\limits ^{iid}} \text{ Gamma }(\nu /2, \nu /2), \quad t=1,\ldots ,T. \end{aligned}$$

(6)

Notably, $\tau _t$ scales the precision matrix ${\varvec{\Omega }}_{s_t}$ at each time t, which adds a layer of robustness to the model that guards against sensitivity to extreme values of $|y_{tj}|$. By marginalizing over $\tau _t$, (5)–(6) implies $[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T] {\mathop {\sim }\limits ^{indep}} t_p({\varvec{0}},{\varvec{\Omega }}_{s_t}^{-1}, \nu )$ for $t=1,\ldots ,T$.

Despite the generalization from the Gaussian model (1) to the classical multivariate-t model (5)–(6), the graph ${\mathcal {G}}$ is determined in the same way: $(i, j) \not \in {\mathcal {E}} \iff \omega _{i,j} = 0$. Conditional on the additional parameter $\{\tau _t\}_{t=1}^T$, (5) is a Gaussian HMM graphical model and therefore inherits the familiar conditional independence interpretation among the components of ${\varvec{y}}_{t}$.

The primary disadvantage of the classical multivariate t-distribution is that each component of ${\varvec{y}}_t$ shares the same scaling $\tau _t$. However, it is unlikely that all components of ${\varvec{y}}_t$ are outliers at exactly the same set of times t. As the dimension p increases, it becomes more likely that some but not all components of ${\varvec{y}}_t$ are outliers at any given time t. Clearly, the classical multivariate t-distribution is not well-designed for this common occurrence. To address this issue, we generalize (5) as follows:

$$\begin{aligned}{} & {} \left[ {\varvec{y}}_t | \{{\varvec{\Omega }}_s\}_{s=1}^S, \{s_t\}_{t=1}^T, \{{\varvec{\tau }}_t\}_{t=1}^T\right] \nonumber \\{} & {} \quad {\mathop {\sim }\limits ^{indep}} N_p({\varvec{0}}, \text {diag}( 1/ \sqrt{{\varvec{\tau }}_t}){\varvec{\Omega }}_{s_t}^{-1}\text {diag}( 1/ \sqrt{{\varvec{\tau }}_t})), \end{aligned}$$

(7)

which incorporates both the state-specific precision matrix ${\varvec{\Omega }}_{s_t}$ and a component-specific scaling ${\varvec{\tau }}_t = (\tau _{t1},\ldots , \tau _{tp})$ at each time t. When all components $\tau _{tj} = \tau _t$ are identical for $j=1,\ldots ,p$, model (7) simplifies to the classical multivariate t-model in (5). When all components $\tau _{tj}$ are endowed with independent $\text{ Gamma }(\nu /2 ,\nu /2)$ priors, Finegold et al. (2014) refer to the resulting marginal distribution as the alternative multivariate t-distribution. Although this specification improves modeling flexibility relative to the classical multivariate t-distribution, it suffers from overparametrization and incurs large computational costs.

As a compromise between the unnecessarily restrictive constraints $\tau _{tj} = \tau _t$ of (5) and the overparametrization issues of (7) with iid priors on $\tau _{tj}$, we consider a cluster-based approach using Dirichlet processes and cluster the p elements in ${\varvec{\tau }}_t$ at each time t separately. Clustering produces a more parsimonious parametrization and encourages information-sharing among the components of ${\varvec{\tau }}_t$, yet accommodates distinct scalings $\tau _{tj}$ for any component j as needed. In particular, the Dirichlet-t model imposes the following prior on ${\varvec{\tau }}_t$:

$$\begin{aligned} \begin{aligned}&\tau _{tj} \overset{iid}{\sim }\ P_t, \quad j=1,\ldots ,p,\\&P_t \overset{iid}{\sim }\ DP(\alpha , P_0),\quad t=1,\ldots ,T,\\&P_0 = \text {Gamma}(\nu /2,\nu /2),\\&\alpha \sim \text {Gamma}(a_\alpha , b_\alpha ), \end{aligned} \end{aligned}$$

(8)

where $DP(\alpha , P_0)$ denotes a Dirichlet process with concentration $\alpha > 0$ and base measure $P_0$. The prior (8) uses a Gamma distribution for the base measure $P_0$ akin to (6), and is combined with the conditional likelihood (7). The concentration parameter $\alpha $ determines the clustering behavior: as $\alpha \rightarrow 0$, (8) converges to the classical multivariate t-distribution, while $\alpha \rightarrow \infty $ produces the alternative t-distribution. Importantly, (8) includes a prior on $\alpha $ in order to learn this key parameter. For a review of Dirichlet process, see Neal (2000).

2.3 Prior on multiple precision matrices

Although the states s induce distinct conditional dependence relations via ${\varvec{\Omega }}_s$, it is likely that the set of precision matrices $\{{\varvec{\Omega }}_s\}_{s=1}^S$ will exhibit some common features. For example, some conditional correlations may be state-specific, while others may persist across some or all states. To incorporate these common features, we adopt a prior construction used in Peterson et al. (2020) as an extension of the continuous spike-and-slab prior of Wang (2015) to multiple graphs. This prior is written as:

$$\begin{aligned}{} & {} p({\varvec{\Omega }}_1,\ldots ,{\varvec{\Omega }}_S|\{{\varvec{G}}_s\}_{s=1}^S, {\varvec{\Theta }}, v_0, v_1, \uplambda ) \propto \prod _{i<j} N_S ({\varvec{\omega }}_{ij}| {\varvec{0}}, {\varvec{\Theta }}_{ij}) \nonumber \\{} & {} \quad \cdot \prod _{i} \prod _{s} \text{ Exp }(\omega _{s,ii}| \uplambda /2) \cdot {\varvec{1}}_{{\varvec{\Omega }}_1, \ldots , {\varvec{\Omega }}_S \in M^{+}}, \end{aligned}$$

(9)

where ${\varvec{\omega }}_{ij} = (\omega _{1,ij}, \omega _{2,ij},\ldots , \omega _{S,ij} )$, with $w_{s,ij}$ the (i, j)th element in ${\varvec{\Omega }}_s$, ${\varvec{\Theta }}_{ij}$ is the covariance matrix, $\text {Exp}(\omega _{ij}| \uplambda /2)$ denotes the exponential distribution with rate $\uplambda /2$, ${\varvec{1}}$ is the indicator function, and $M^+$ is the space of $p\times p$ positive definite matrices. The vector ${\varvec{\omega }}_{ij}$ encodes the connection between vertex i and vertex j for each state $s=1,\ldots ,S$. Dependence among the elements of ${\varvec{\omega }}_{ij}$ is achieved through the covariance matrix ${\varvec{\Theta }}_{ij}$, which is decomposed into standard deviations ${\varvec{v}}_{ij} = \{v_{g_{s,ij}}\}_{s=1}^{S}$ and the $S \times S$ interstate correlation matrix ${\varvec{\Phi }}$:

$$\begin{aligned} {\varvec{\Theta }}_{ij} = \text {diag}({\varvec{v}}_{ij}) \cdot {\varvec{\Phi }}\cdot \text {diag}({\varvec{v}}_{ij}), \end{aligned}$$

(10)

with standard deviations $v_{g_{s,ij}}$ such that $v_{g_{s, ij}} = v_0$ if $g_{s, ij} = 0$ and $v_{g_{s, ij}} = v_1$ if $g_{s, ij} = 1$, respectively for each $\omega _{s,ij}$. The prior induces a continuous spike around zero by specifying the standard deviation $v_0$ to be small, while $v_1$ is chosen to be large for the diffuse slab component. The edge indicators $g_{s,ij}$ determine whether each $\omega _{s,ij}$ belongs to the spike component ($g_{s,ij} = 0$) or the slab component ($g_{s,ij} = 1$). Setting ${\varvec{\Phi }}= {\varvec{I}}_S$ results in independent priors. The linked spike-and-slab prior is completed with a prior on the state-specific edge indicators $\{{\varvec{G}}_s\}_{s=1}^S$,

$$\begin{aligned} p({\varvec{G}}_1,\ldots ,{\varvec{G}}_S | {\varvec{\Phi }}, v_0, v_1, \uplambda , \pi ) \propto \prod _{s} \prod _{i<j} \{ \pi ^{g_{s,ij}} (1-\pi )^{1-g_{s,ij}} \} \nonumber \\ \end{aligned}$$

(11)

and the correlation matrix ${\varvec{\Phi }}$,

$$\begin{aligned} p({\varvec{\Phi }}) \propto {\varvec{1}}_{{\varvec{\Phi }}\in \mathcal { R^{+}}}, \end{aligned}$$

(12)

where ${\mathcal {R}}^+$ is the set of positive definite matrices with ones on the diagonal. As noted by Wang (2015) and Peterson et al. (2020), the prior (11) is analytically defined only up to a normalizing constant which is proportional to the unknown normalizing constant of prior (9), and therefore cancels out in the joint prior $[{\varvec{\Omega }}_s, {\varvec{G}}_s]$. The correlation matrix ${\varvec{\Phi }}$ appears in the normalizing constant of (11), so the joint prior (11)–(12) is no longer exactly uniform on ${\varvec{\Phi }}$ in ${\mathcal {R}}^+$. However, as shown by Wang (2015) the effect of these unknown normalizing constants on the posterior inference is extremely mild, and the parameters $\pi ,v_0$ and $v_1$ can be easily calibrated to achieve a pre-specified level of sparsity. Our sensitivity analyses confirm those findings (see Sect. 3.4).

Our proposed robust and dynamic approach offers several crucial advantages over existing methods for graphical models estimation. With respect to the robust Dirichlet-t model of Finegold et al. (2014), our model incorporates graph dynamics via HMMs, and further encourages information sharing among the state-specific graphs using linked priors. Furthermore, Finegold et al. (2014) used a Hyper Inverse Wishart prior that requires decomposable graphs; our approach does not impose such a restriction. Our strategy to incorporate dynamics via HMM is similar to Warnick et al. (2018). Their method, however, is limited in computational scalability, as it uses G-Wishart distributions, and relies on a Gaussian emission distribution. Lastly, our algorithmic implementation of (8) uses a new truncation strategy that dramatically improves computational efficiency (see the “Appendix”). In conjunction, these features provide both methodological and computational advances.

2.4 MCMC algorithm for posterior inference

We design an efficient Markov chain Monte Carlo (MCMC) algorithm to sample the model parameters from their joint posterior distribution. In the dynamic Gaussian graphical model, the unknown parameters are $\{\{{\varvec{G}}_s\}_{s = 1}^S$, $\{{\varvec{\Omega }}_s\}_{s = 1}^S$, ${\varvec{\Phi }}$, $\{s_t\}_{t = 1}^T, {\varvec{P}}\}$. The robust dynamic classical-t graphical model adds the scaling parameters $\{\tau _t\}_{t = 1}^T$, while the dynamic Dirichlet-t graphical model further includes $\{{\varvec{\tau }}_t\}_{t = 1}^T$ and $\alpha $. A generic iteration of the MCMC algorithm uses the steps described below. Details are reported in the “Appendix”.

Sample ${\varvec{\Omega }}$ and ${\varvec{G}}$: For each $s \in \{1,..,S\}$, we first update the precision matrix ${{\varvec{\Omega }}}_{s}$ using a block Gibbs sampler with closed-form conditional distributions for each column. The sampler automatically guarantees positive definiteness of the precision matrix, as shown in the “Appendix”. Then we update ${\varvec{G}}_s$ by drawing each edge from a Bernoulli distribution.
Sample ${\varvec{\Phi }}$: This Metropolis-within-Gibbs step samples the entire matrix at once using a parameter expansion method (Liu and Daniels 2006).
Sample ${\varvec{P}}$ and $s_t$: This Metropolis-Hastings step first samples the transition probability matrix ${\varvec{P}}$, by row from a Dirichlet distribution, and then the hidden states $s_t$, $t = 1,\ldots ,T$, with Forward-Backward sampling (Scott 2002).
Sample ${\varvec{\tau }}$ (classical-t): Sample each $\tau _t, t=1,\ldots ,T$ from its full conditional distribution: $\text {Gamma}(\nu /2 + p/2$, $\nu /2 + {\varvec{y}}_t' {\varvec{\Omega }}_{s_t} {\varvec{y}}_t/2)$.
Sample ${\varvec{\tau }}$ and ${\varvec{\alpha }}$ (Dirichlet-t): We use a truncated stick-breaking representation of the Dirichlet process (Ishwaran and James 2001). Given a truncation level K, let ${\varvec{Z}}$ be a $T \times p$ matrix of cluster indicators, i.e., $z_{tj} = k$ if $\tau _{tj}$ is in cluster k, $k = 1,\ldots ,K$, and let ${\varvec{\eta }}_t = (\eta _{t1},\ldots ,\eta _{t K})$ denote the unique cluster values of ${\varvec{\tau }}_t$. First, we sample $z_{tj}$ given ${\varvec{\tau }}$ and other parameters from its multinomial posterior distribution. Second, we sample the unique cluster values $\eta _{tk}$ using the rejection algorithm from Finegold et al. (2014). Lastly, we sample $\alpha $ from a conditionally conjugate Gamma distribution.

For posterior inference, we are interested in the detection of the hidden states $s_t$, for $t=1,\ldots , T$, and the estimation of the corresponding graphs ${\varvec{G}}_s$, for $s=1, \ldots , S$. We estimate the hidden states by computing the proportion of MCMC samples for which a time point is classified in each of the S states, and then assigning the most probable state. Corresponding graphs are then estimated by computing marginal probabilities of edge inclusion as proportions of MCMC iterations in which an edge was included. More precisely, for each $s=1, \ldots , S$, the posterior probability $p(g_{s,ij}=1|\text {data})$ is estimated as the proportion of iterations that $g_{s,ij}=1$. Median graphs are obtained by thresholding the posterior probabilities at 0.5. Given the graphs, estimates of the corresponding precision matrices can be obtained by averaging the sampled MCMC values. The MCMC output also generates an estimate of the correlation matrix ${\varvec{\Phi }}$ that describes the pairwise similarity between states.

3 Simulation study

We use simulated data to assess the performances of the proposed models and compare results to existing approaches. We consider the proposed dynamic and robust models (Sect. 2.2), i.e. the dynamic classical-t graphical model (dCT) and the dynamic Dirichlet-t graphical model (dDT), against the Gaussian graphical model (dGM). We also consider a frequentist competitor, i.e. the fused graphical lasso (fLasso; Danaher et al. 2014), which is a penalized estimation technique that encourages sparse graphs with similarities across known groups. For these groups, we supply the fLasso with the true hidden states, which ensures that the fLasso is maximally competitive. The sparsity tuning parameter $\uplambda _1$ and the similarity tuning parameter $\uplambda _2$ are selected from a grid of values using AIC, as done in Danaher et al. (2014). The “Appendix” provides more details on the selection of $\uplambda _1$ and $\uplambda _2$.

3.1 Data generation

Synthetic datasets with $p = 20$, $S = 3$ and $T = 750$ are simulated as follows. First, we construct the sparse precision matrices $\{{\varvec{\Omega }}_s\}_{s = 1,\ldots ,S}$. We set ${\varvec{\Omega }}_1$ to be an AR(2) structure, with diagonal elements set to 1 and off-diagonal elements set to 0.5 and 0.4 on the first and second diagonal, respectively. Each ${\varvec{\Omega }}_s$ is generated based on ${\varvec{\Omega }}_{s-1}$ following these steps: we randomly delete 10 edges by setting the corresponding entries in ${\varvec{\Omega }}_{s-1}$ to zero; we add 10 edges in random locations with values from $\text {Uniform}((-0.6,-0.4)\cup (0.4,0.6))$; and we adjust ${\varvec{\Omega }}_s$ to be positive definite using Peng et al. (2009). The resulting graphs are shown in Fig. 1. Graphs ${\varvec{G}}_1$ and ${\varvec{G}}_2$ share 89% of edges, graphs ${\varvec{G}}_2$ and ${\varvec{G}}_3$ share 89% of edges, and graphs ${\varvec{G}}_1$ and ${\varvec{G}}_3$ share 84% of edges. The sample correlations $\phi _{ss'}$ between the entries of ${\varvec{\Omega }}_s$ and ${\varvec{\Omega }}_{s'}$ are $\phi _{12} = 0.67$, $\phi _{23}=0.68$, and $\phi _{13} = 0.56$. To generate the hidden states, we divide the time period [1, T] into 15 blocks of equal length and randomly assign each block to one of the S states, which results in states of different lengths. The precision matrices and hidden states are generated once and then fixed for all simulated datasets.

Given the precision matrices and the hidden states, we simulate three types of data, with examples illustrated in Fig. 2:

1.
Classical-t data: $[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}, \{s_t\}] {\mathop {\sim }\limits ^{indep}} t_{p}({\varvec{\mu }}= {\varvec{0}}, {\varvec{\Omega }}^{-1} = {\varvec{\Omega }}_{s_t}^{-1}, \nu = 3)$ for $t=1,\ldots ,T$.
2.
Slightly contaminated data: first, we simulate $[{\varvec{y}}_t | \{{\varvec{\Omega }}_s\}, \{s_t\}] {\mathop {\sim }\limits ^{indep}} N_p({\varvec{\mu }}= {\varvec{0}}, {\varvec{\Omega }}^{-1} = {\varvec{\Omega }}_{s_t}^{-1})$ for $t=1,\ldots ,T$; then, for each time series $\{y_{tj}\}_{t=1}^T$, $j=1,\ldots ,p$, we randomly select 5 time points and add independent noise generated from N(0, 100).
3.
Highly contaminated data: similar to the slightly contaminated data, except that we randomly select 10 time points for each time series to add Gaussian noise.

3.2 Parameter settings

For the hyperparameters in the joint prior (9) on $\{{\varvec{\Omega }}_s\}$, we follow the guidelines in Wang (2015) and set $\uplambda = 1$, $v_0 = 0.02$, $h = 50$ ($v_1 = h \cdot v_0$) and $\pi = 3/(p-1)$. A sensitivity analysis to these choices is presented in Sect. 3.4. The HMMs use $S=3$ hidden states; details on selecting S for real data are provided in Sect. 4. The Dirichlet weights in (3) are set to $a_{rs} = 1$ for all $r,s = 1,\ldots , S$, and we fix the classical-t and Dirichlet-t parameters at $\nu =3$ and $a_{\alpha } = b_{\alpha } = 1$ in (8). To initialize the model, the states $s_t$ are constructed by dividing the time series equally into S parts. Given the states, initial values of each ${\varvec{\Omega }}^{-1}_s$ are computed using a robust covariance estimator (Gottard and Pacillo 2010).

Results reported below for the dynamic Gaussian model and the dynamic classical-t model were obtained by running MCMC chains for 10,000 total iterations with a burn-in of 2000 iterations. The dynamic Dirichlet-t model converged much faster and was run for a total of 1500 iterations with a burn-in of 500 iterations. The Gaussian and classical-t models took on average about 3.2 min to run (19 s per 1000 iterations) and the Dirichlet-t model about 12.5 min (500 s per 1000 iterations), on a laptop computer with 1.80 GHz Intel Core i7. MCMC convergence was checked by visually inspecting the trace plots of the number of included edges in each state across iterations and those of the hidden states indicators (plots omitted for the simulation study). The Geweke diagnostic test (Geweke et al. 1991) was used to test for convergence of the number of observations in each state with a significance level of 1%. Noticeably, since there are many outliers, the Gaussian model never converged, even for substantially longer MCMC chains. Similarly, the classical-t model showed some convergence issues, while the Dirichlet-t model converged in almost all simulation replicates for all three scenarios. Similar behaviors were observed by Finegold et al. (2014).

Table 1 Performance comparisons for hidden state estimation over the three hidden states, for the three simulation scenarios

Full size table

3.3 Results

We assess performance on edge selection and hidden states estimation using the true positive rate (TPR), the false positive rate (FPR) and the Matthews correlation coefficient (MCC). The MCC is defined as

$$\begin{aligned} MCC = \frac{TP \times TN - FP \times FN}{ \sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}} \end{aligned}$$

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. The MCC provides a measure of overall classification success for a selected model, with larger values indicating better performance. We also provide ROC curves, computed by varying the threshold on the marginal posterior probabilities of edge inclusion from 0 to 1. Finally, estimation of the precision matrices is evaluated based on the Frobenius loss:

$$\begin{aligned} \frac{1}{S}\sum _{s=1}^S || \hat{{\varvec{\Omega }}}_{s} - {\varvec{\Omega }}^*_{s} ||^2_F / || {\varvec{\Omega }}^*_{s} ||^2_F, \end{aligned}$$

(13)

where ${\varvec{\Omega }}_s^*$ is the true precision matrix in state s and $\hat{{\varvec{\Omega }}}_s$ is the posterior mean.

Table 1 summarizes hidden state estimation under the three simulation designs, in each case averaged over 25 simulated datasets. State estimation is defined as the posterior most probable state at each time point. The hidden states estimated by the Gaussian model are driven by outliers in the data and do not reflect the true hidden states, while the dynamic Dirichlet-t graphical model performs well, recovering the hidden states accurately in all three scenarios. As expected, the classical-t model performs well in the classical-t scenario, but its performance deteriorates as the number of contaminated data points increases.

Next, we evaluate graph estimation for the proposed methods and include comparisons to fLasso (Danaher et al. 2014). Figure 3 shows the edge selection accuracy and Frobenius loss across all 25 simulated datasets for each design, Table 2 reports these results averaged over all 25 replicates, and Fig. 4 plots the ROC curves obtained by varying the posterior probability of inclusion threshold (for fLasso, we obtain AUCs on a grid and the best one is kept; more details are provided in the “Appendix”). As expected, the dynamic Gaussian graphical model is inferior to the robust alternatives in all three scenarios. The Dirichlet-t and classical-t models perform similarly for the classical-t data, while the Dirichlet-t model substantially outperforms the classical-t model in the contaminated scenarios—especially as the number of contaminated data points increases. These results are consistent across all five performance measures. Most notably, the proposed approaches vastly outperform the fLasso on both edge selection and estimation—despite the fact that fLasso assumes knowledge of the true hidden states. The FPRs for the fLasso are exceedingly high, which suggests that the fLasso produces unnecessarily dense graphs in the presence of heavy-tailed or contaminated data. Yet, as demonstrated by the ROC curves in Fig. 4, varying the fLasso tuning parameter does not reconcile this issue.

In addition, we fit the three dynamic Bayesian models with $S = 2$ and $S = 4$ on the two contaminated dataset. The hidden state estimation for the Dirichlet-t model remains superior to both the Gaussian model and the classical-t model (results not shown). For the Dirichlet-t model, the 4th state collapses into an empty state when $S = 4$; state 2 and state 3 are combined into one state when $S = 2$. We also evaluate WAIC for different choices of S. Unlike other information criteria (e.g., AIC, BIC, and DIC), WAIC does not rely on a single point estimate of the model parameters and instead utilizes the joint posterior distribution for model assessment (Gelman et al. 1995). The complexity penalty is defined as the variance of log-likelihoods across MCMC iterations as recommended in Gelman et al. (1995). The WAIC of $S=3$ is around 53,000 on average and about 400 less than the other two scenarios. In practice, we also recommend choosing S based on interpretability and by inspection of the state allocations, especially if one or more states are empty.

Lastly, we conduct a simulation study with $p=10$. The edges in the first graph are randomly selected, and the graphs in the other two states are generated sequentially using the same random sampling procedure. The time period [1, T] was divided into 25 blocks instead of 15 as in the previous study. Compared to the previous setup, these graphs do not have a specific structure and the system has less persistence in the hidden states. The model hyperparameters are the same as the previous simulation ($\uplambda = 1$, $v_0 = 0.02$, $h = 50$, $\pi = 3/(p-1)$, $S=3$ and $\nu =3$). MCMC convergence was checked using the same procedure. Since the states are less persistent, 200 more iterations were added to the burn-in period in the Dirichlet-t model to ensure convergence of the hidden states.

Based on the simulation results, the performance of the four models are very similar to the previous study. The Dirichlet-t model is able to estimate the hidden states and graphs accurately, while the non-robust methods have the worst performance. Tables for performance comparison are provided in the “Appendix”. The Dirichlet-t model took on average 2.5 min per 1000 iterations with a total of 1700 iterations; the classical-t model and the Gaussian model took on average 7.5 s for 1000 iterations with a total of 10,000 iterations. For our proposed Gibbs sampler, the empirical per-iteration computation time of sampling the scaling parameter with the Dirichlet prior is roughly proportional to p and the total number of time points T. Wang (2012) studies the running time per 1000 iterations for sampling the precision matrix. The computational cost is roughly linear to p when p is under 100, but increases much faster when p is above 100. Our empirical computing time is consistent with their findings; it is also linear to the number of states S. The sampling time for other parameters are negligible.

Aggregating across all performance metrics, it is clear that the Dirichlet-t HMM graphical model offers substantial advantages over competing methods under a variety of scenarios.

Table 2 Performance comparisons for graph estimation

Full size table

3.4 Sensitivity analysis

We investigate sensitivity to hyperparameter choices using one simulated dataset from the slightly contaminated scenario. Wang (2015) recommends the default setting of $\pi =2/(p-1)$ and provides sensitivity analysis of the variance parameters, $v_0$ and $v_1$, of the spike-and-slab prior on the precision matrix. Using the reparameterization $v_1 = hv_0$, he recommends settings with $v_0>0.01$ and $h<1000$ as a general range of acceptable values. Peterson et al. (2020) perform a similar sensitivity analysis on $v_0$ and $v_1$ in the linked prior (9)–(12), obtaining consistent results with those of Wang (2015): increasing $v_0$ for fixed h leads to sparser graphs, while increasing h for fixed $v_0$ reduces the number of selected edges. The graph estimation is robust to $\uplambda $ from our experience, which confirms the previous results in the literature. We fix $\uplambda =1$ and standardize the data prior to model fitting.

Table 3 Sensitivity analysis for simulated data

Full size table

Given the sensitivity results from the existing studies, we focus our investigation on the robust models of Sect. 2.2 and specifically on the priors on the scaling parameters $\tau $ in (6) and (8). We investigate the hyperparameter $\nu \in \{3, 6, 9\}$ such that the prior standard deviation ranges from 0.3 to 0.6 and the distribution is centered around 1. We also investigate different choices of $\pi \in \{2/(p-1), 3/(p-1)), 4/(p-1), 5/(p-1)\}$, which leads to a prior marginal edge inclusion probability ranging from $1/(p-1)$ to $4/(p-1)$. We set all other parameters as described in Sect. 3.2. Results reported in Table 3 show that the graph estimation accuracy of the Dirichlet-t model can be improved for smaller values of $\nu $, i.e. larger prior variance for the scale parameter $\tau $, and larger values of $\pi $, which correspond to larger prior edge inclusion probabilities. The classical-t model is consistently inferior to the Dirichlet-t model across these hyperparameter settings, and in particular is overly conservative in selecting edges. This slightly contaminated simulation setting is not favorable for the classical-t model, and the hyperparameters apparently cannot be tuned to match the superior performance of the Dirichlet-t model.

In general, we suggest setting the model hyperparameters to be $\uplambda = 1$, $v_0 = 0.02, h = 50, \pi = 3/(p-1)$ and $\nu = 3$. Based on our experience, the proposed dynamic Dirichlet-t graphical model can generate dense graphs based on the data even if the prior suggests a low edge inclusion probability. If the application requires a denser graph, one could increase $\pi $ and decrease $v_0$ and h.

4 Dynamic analysis of hand gesture data

Analysis of human gestures, such as facial expressions, hand motions, and torso movements, is essential across a wide range of applications, including computer vision, human-computer interaction, animation, linguistics, security and rehabilitation, among many others. Modern motion capture devices such as Microsoft’s Xbox Kinect (Lun and Zhao 2015) provide high resolution and affordable solutions for human gesture recording. Thus, human gesture analysis is becoming increasingly important and increasingly common, and new statistical methods are needed for these high dimensional and noisy data.

Phase segmentation models, notably hidden Markov models, are a prevailing tool to study the dynamics of human gestures (Mitra and Acharya 2007; Cheok et al. 2019). The visible bodily actions are segmented into several states, for example, a natural rest position and an artificial gesture unit (Kim et al. 2007). In linguistic studies, a discourse can be considered as one or several movement excursions (Kendon 2004). An excursion refers to hands moving from a position of rest to a position where the main movement happens, then executing the main movement and turning back to the rest position. One challenge in automatic gesture phase segmentation is that the exact definition of states—even on the same clip—can vary across researchers and human encoders (Madeo et al. 2013). Our proposed methods define states as having unique coordination patterns between body parts of interest, and provide a novel solution to the segmentation problem.

4.1 Data

We analyze the Gesture Phase Segmentation Data Set available on the UCI Machine Learning Repository to study the correlations between body parts. The gestures of a subject during storytelling were recorded using a Microsoft Xbox Kinect device. The story was based on simple comics shown to the subject prior to capturing the video. The raw footage was in the form of a stream of 3D frames. The 3D coordinates (x, y, z) of 4 points of interest (left hand, left wrist, right hand and right wrist) were acquired from each frame. Next, the 3D velocity and acceleration of these four points were calculated in each frame and represented by 32 variables, including the vectorial velocity and acceleration of the four points at each (x, y, z) coordinates, as well as the scalar velocities and accelerations of the four points. A complete list of variables and their indexing is provided in the “Appendix”. The 32 time series are of length $T = 1743$ and displayed in Fig. 5. Details of data collection and data processing is provided in Madeo et al. (2013).

In Madeo et al. (2013), each frame was labeled by a human encoder, and the stream was segmented into five states based on movement excursions: (i) the rest position; (ii) the preparation state, with hands moving from the rest position to the position where the stroke happens; (iii) a brief pause after the preparation state or the stroke, during which the hands maintain their configuration and position; (iv) the stroke that expresses the semantic content; (v) the retraction of hands from the stroke position to the rest position. Frame labels are plotted in Fig. 5. The five states are numbered as state 1–5 respectively. Given these labels, Madeo et al. (2013) focused on segmenting the video into the rest position and the gesture unit.

Figure 5 clearly illustrates the non-Gaussianity and heavy tails of the individual time series. We therefore employ the proposed Dirichlet-t HMM graphical model in effort to capture contemporaneous relationships among time series—which may change over time—while accounting for the spikes in the data. Notably, our simulation analysis demonstrates that this model is significantly more robust to outliers and contaminated data compared to Gaussian, classical-t, or frequentist alternatives. Our analysis does not make use of the phases defined by excursion and semantic meanings and our interest is to see whether the inferred hidden states recover any of that information.

4.2 Parameter settings

We set the Dirichlet-t hyperparameters as follows: $\pi = 3/(p-1)$, $v_0 = 0.02$, $v_1 = 1$, $\uplambda = 1$, $\nu = 3$, $a_{\alpha } = 1$, $b_{\alpha } = 1$, and $a_{rs} = 1$. These choices are discussed and evaluated in the simulation study. In addition, we note that state estimation and graph estimation are not overly sensitive to variations in $\pi \in \{2/(p-1), 3/(p-1), 4/(p-1), 5/(p-1)\}$. We use $K=6$ clusters as a truncation for the stick-breaking process; fewer than 1% of observations belong to the 6th cluster, which suggests that this truncation level is reasonable. The number of hidden states S is selected based on WAIC. The WAIC values for $S = 2, 3, 4, 5$ are 235,620, 315,190, 325,160 and 308,110 respectively. We therefore set $S = 2$, which has the smallest WAIC. When S is increased to $S=3$, one of the state is roughly divided into two states while the remaining state is unchanged. The results are similar for $S=4$ and $S=5$. The MCMC algorithm was run for 8000 iterations with a burn-in of 8000 and required about 165 min on a 1.80 GHz Intel Core i7. Trace plots (not shown) demonstrated good mixing. Model convergence was verified via the Geweke test on the number of edges in each state with a significance level of 1%.

4.3 Results

For each state, we estimate the graph using the posterior median graph, so an edge is included only if its posterior inclusion probability exceeds 0.5. The edge densities in the two states are 0.14 and 0.15, which suggests that a sparse graphical model is appropriate. The non-zero elements of the state-specific precision matrices are estimated using the posterior mean of the respective elements in ${\varvec{\Omega }}_{s}$, while zeros are assigned from the posterior median graph. Partial correlations based on the state-specific and sparse estimated precision matrices are shown in Fig. 6. The figure highlights the positive partial correlations (blue), the negative partial correlations (red), the strength of the association (line thickness), and whether or not an edge is unique to that state (dotted line).

The two estimated graphs from the proposed Dirichlet-t HMM graphical model share 55 edges. Most of these common edges can be summarized as follows: (i) at the same point of interest, the vectorial velocity and acceleration on the same coordinate (x, y, z) are positively correlated; (ii) on each side (left/right) and each coordinate, velocity of the hand and the wrist are positively correlated; (iii) on each side and each coordinate, acceleration of the hand and the wrist are positively correlated; (iv) on each side and each coordinate, velocity of the hand and acceleration of the wrist are negatively correlated; (v) on each side and each coordinate, velocity of the wrist and acceleration of the hand are negatively correlated; (vi) the previous five kinds of connections among vectorial measurements also exist among scalar velocities and accelerations. These six types of edges are very well expected based on their practical meanings. Our Dirichlet-t graphical model recovered all of them in both states, except missing one edge (25–31) in state 1. In addition, the lower right part of the graph is disconnected with the rest of the graph. It indicates that the scalar velocities and accelerations are independent from their vectorial decomposition on the 3 axis. In human motion data analysis, similar correlation patterns are commonly assumed based on the kinematic chain structure of human body (Han et al. 2006; Sigal et al. 2012); other studies build models to learn such networks from body motion data (Xu et al. 2022; Song et al. 2003; Zhang et al. 2021). In conclusion, our proposed model successfully recovered true edges from highly noisy observations without any prior domain knowledge.

For further evidence, we compared the graphs estimated with the Dirichlet-t graphical model with those obtained with a non-robust Gaussian graphical model where we fixed the hidden states to the posterior most probable states from the dynamic Dirichlet-t graphical model, and fitted Gaussian graphical models independently in each state. The posterior graph estimates are shown in Fig. 6. Graphs are much sparser than those from the Dirichlet-t model, and in particular they omit many strong and well-expected edges that were detected by the Dirichlet-t model. The state 1 graphs from the two models are more similar to each other, which is reasonable because the data are less noisy in the resting phase (Madeo et al. 2013). For further comparison, we estimated the fused graphical lasso using the estimated states from the Dirichlet-t model. The AIC-selected tuning parameters produce a dense graph. By further tuning these parameters to obtain sparser graphs, we find that the estimated graphs recover some of the strong positive connections that are also present in the Dirichlet-t outcome, but otherwise omit or estimate weak edges for many well-expected connections. Further details are in the “Appendix”.

Next, we examine results on hidden states generated by our model. The posterior most probable state across time is shown in Fig. 5. Notably, state 1 in our result mostly matches the resting period provided by the segmentation of the original data, while state 2 includes the gesture unit. Although the original segmentation is solely based on semantics and locations of the hands, it is interesting to see that it has a correspondence with the coordination dynamics between body parts. For comparison, we fixed the hidden states to the phases given by Madeo et al. (2013) and fit Dirichlet-t graphical models independently within each state. The partial correlations are almost the same between the major states with the 6 types of common edges dominating the graphs (results not shown).

Comparing the two state-specific graphs in Fig. 6, we note that, in general, the edges in state 1 are weaker than those in state 2. In other words, the conditional correlations between movement measurements are stronger when the storyteller purposefully moves their hands instead of resting. What’s more, there are some strong edges that only exist in state 2. While most of the shared edges connect vectorial measurements on the same coordinate (x, y, z), edges unique to state 2, as shown in dotted line, connects measurements on different coordinates. For example, there are several positive correlations between the x coordinate and the z coordinate on the left side. It probably indicates that the storyteller moved their left hand back and forth along the x = y direction. Next, velocity of left hand on the z coordinate is negatively correlated with acceleration of left wrist on the x coordinate; acceleration of left hand on the z coordinate is negatively correlated with velocity of left wrist on the x coordinate. These negative connections could be caused by bending the left wrist on a specific direction. Also, according to the posterior graphs, the left side and the right side are not completely independent. We discover edges connecting left hand and right wrist, left wrist and left hand, etc.

5 Conclusion

We have introduced a Bayesian graphical modeling framework for studying the conditional dependence in a multivariate time series. The proposed approach incorporates both dynamics via hidden Markov models and robustness for heavy-tailed data using multivariate t-distributions. The proposed models automatically estimate a graph at each time point, identify change points in the graphs, and classify time points into clusters. In addition, our methods include information sharing among the HMM latent states and generate posterior estimates of the state similarity matrix. Within the proposed modeling frameworks, the Dirichlet-t HMM graphical model is the most robust and performs well in simulations with a large proportion of outliers in the data. We have also applied this model to hand gesture tracking data and uncovered meaningful edges and dynamics within different movement measurements, including almost all the edges that should be included based on domain knowledge.

The proposed Bayesian framework—including both the model specification and the MCMC algorithm—is designed for computational efficiency. The central role of the continuous spike-and-slab prior builds upon Wang (2015) to offer greater computational advantages than other Bayesian approaches, especially those based on the G-Wishart prior (e.g., Finegold et al. 2014; Warnick et al. 2018). Our approach is also computationally competitive with frequentist alternatives, which require extensive parameter tuning and can be sensitive to parameter choices. By comparison, the proposed approach is quite robust to the choice of hyper-parameters, and provides superior graph estimation with fewer false positives.

The simulation study uses $p=20$ variables, which is similar to the dimension of the application. However, the proposed sampling algorithm is scalable to larger p: for example, it takes about 55 min to sample ${\varvec{\Omega }}$ and ${\varvec{G}}$ for 1000 iterations when $p = 100$ and $S = 4$, which is much faster than the popular G-Wishart prior when four graphs are estimated. In particular, compared to the sampling algorithm of the full Dirichlet process with a rejection sampling scheme (Finegold et al. 2014), the truncated sampling algorithm for the stick-breaking process offers large gains in efficiency and can reduce the MCMC computing time by up to 90%. Naturally, larger p places greater demands on the HMM graphical model construction, which requires inference for S distinct $p \times p$ precision matrices. Our approach borrows information across these states via a linked prior, which mitigates but cannot eliminate these challenges.

We acknowledge that HMMs are identifiable only up to permutations of the labels, which can cause problematic label switching (Scott 2002). In our simulation and application studies, we did not observe the label switching problem, likely because of the small number of states. When label switching is present, there are many MCMC post-processing strategies available that may be applied to our models and computations (Jasra et al. 2005; Sperrin et al. 2010; Rodríguez and Walker 2014).

There are many promising extensions. First, the models proposed in this paper assume that the state transition probabilities are constant across time. In practice, it is reasonable to consider a non-homogeneous HMM in which the state transition probabilities depend on time-varying exogenous factors. In addition, the graph prior (11) uses a Bernoulli distribution with the same parameter $\pi $ for each edge. This specification could be modified to incorporate more complicated structures, such as state-specific $\pi $ values (Osborne et al. 2022), covariates in the edge inclusion probabilities (Ni et al. 2021), or alternative graph priors entirely (Tan et al. 2017). Furthermore, the robust models in this paper assume that the scale parameters $\tau _t$ are independent across time. Alternatively, one could add a hierarchical layer for time dependence among the volatilities within each state. Lastly, it is well know that Dirichlet process is affected by the “rich-gets-richer” effect and that it tends to generate highly skewed cluster sizes. To overcome these limitations, one could extend our prior to include more flexible alternatives, such as normalized completely random measures (Cremaschi et al. 2019). Matlab and R code to reproduce results of this paper is available at https://github.com/chunshanl/dynamic-and-robust-Bayesian-graphical-models.

References

Cheok, M.J., Omar, Z., Jaward, M.H.: A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cybern. 10(1), 131–153 (2019)
Article Google Scholar
Cremaschi, A., Argiento, R., Shoemaker, K., Peterson, C., Vannucci, M.: Hierarchical normalized completely random measures for robust graphical modeling. Bayesian Anal. 14(4), 1271 (2019)
Danaher, P., Wang, P., Witten, D.M.: The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B Stat. Methodol. 76(2), 373 (2014)
Article MathSciNet MATH Google Scholar
Fan, J., Feng, Y., Wu, Y.: Network exploration via the adaptive Lasso and SCAD penalties. Ann. Appl. Stat. 3(2), 521 (2009)
Article MathSciNet MATH Google Scholar
Finegold, M., Drton, M.: Robust graphical modeling of gene networks using classical and alternative $t$-distributions. Ann. Appl. Stat. 5(2A), 1057–1080 (2011)
Article MathSciNet MATH Google Scholar
Finegold, M., Drton, M., et al.: Robust Bayesian graphical modeling using Dirichlet $ t $-distributions. Bayesian Anal. 9(3), 521–550 (2014)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)
Article MATH Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chapman and Hall/CRC, London (1995)
Book MATH Google Scholar
Geweke, J., et al.: Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, vol. 196. Federal Reserve Bank of Minneapolis, Research Department Minneapolis (1991)
Gibberd, A.J., Nelson, J.D.: Regularized estimation of piecewise constant Gaussian graphical models: the group-fused graphical lasso. J. Comput. Graph. Stat. 26(3), 623–634 (2017)
Article MathSciNet Google Scholar
Gottard, A., Pacillo, S.: Robust concentration graph model selection. Comput. Stat. Data Anal. 54(12), 3070–3079 (2010)
Article MathSciNet MATH Google Scholar
Han, T.X., Ning, H., Huang, T.S.: Efficient nonparametric belief propagation with application to articulated body tracking. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 214–221. IEEE (2006)
Ishwaran, H., James, L.F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Article MathSciNet MATH Google Scholar
Jasra, A., Holmes, C.C., Stephens, D.A.: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Stat. Sci. 20(1), 50–67 (2005)
Article MathSciNet MATH Google Scholar
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Kim, D., Song, J., Kim, D.: Simultaneous gesture segmentation and recognition based on forward spotting accumulative HMMs. Pattern Recognit. 40(11), 3012–3026 (2007)
Article MATH Google Scholar
Kolar, M., Song, L., Ahmed, A., Xing, E.P., et al.: Estimating time-varying networks. Ann. Appl. Stat. 4(1), 94–123 (2010)
Article MathSciNet MATH Google Scholar
Kotz, S., Nadarajah, S.: Multivariate $t$-Distributions and Their Applications. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Lenkoski, A., Dobra, A.: Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Stat. 20(1), 140–157 (2011)
Article MathSciNet Google Scholar
Liu, X., Daniels, M.J.: A new algorithm for simulating a correlation matrix based on parameter expansion and reparameterization. J. Comput. Graph. Stat. 15(4), 897–914 (2006)
Article MathSciNet Google Scholar
Liu, Y., Wichura, M.J., Drton, M.: Rejection sampling for an extended Gamma distribution. Submitted (2012)
Lun, R., Zhao, W.: A survey of applications and human motion recognition with microsoft kinect. Int. J. Pattern Recognit. Artif. Intell. 29(05), 1555008 (2015)
Article Google Scholar
Madeo, R.C., Lima, C.A., Peres, S.M.: Gesture unit segmentation using support vector machines: segmenting gestures from rest positions. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 46–52 (2013)
Meinshausen, N., Bühlmann, P., et al.: High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34(3), 1436–1462 (2006)
Article MathSciNet MATH Google Scholar
Mitra, S., Acharya, T.: Gesture recognition: a survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 37(3), 311–324 (2007)
Miyamura, M., Kano, Y.: Robust Gaussian graphical modeling. J. Multivar. Anal. 97(7), 1525–1550 (2006)
Article MathSciNet MATH Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
MathSciNet Google Scholar
Ni, Y., Baladandayuthapani, V., Vannucci, M., Stingo, F.C.: Bayesian graphical models for modern biological applications. In: Statistical Methods and Applications, pp. 1–29 (2021)
Osborne, N., Peterson, C.B., Vannucci, M.: Latent network estimation and variable selection for compositional data via variational EM. J. Comput. Graph. Stat. 31(1), 163–175 (2022)
Article MathSciNet MATH Google Scholar
Peng, J., Wang, P., Zhou, N., Zhu, J.: Partial correlation estimation by joint sparse regression models. J. Am. Stat. Assoc. 104(486), 735–746 (2009)
Article MathSciNet MATH Google Scholar
Peterson, C.B., Osborne, N., Stingo, F.C., Bourgeat, P., Doecke, J.D., Vannucci, M.: Bayesian modeling of multiple structural connectivity networks during the progression of Alzheimer’s disease. Biometrics 76(4), 1120–1132 (2020)
Article MathSciNet Google Scholar
Pinheiro, J.C., Liu, C., Wu, Y.N.: Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate $t$ distribution. J. Comput. Graph. Stat. 10(2), 249–276 (2001)
Article MathSciNet Google Scholar
Rodríguez, C.E., Walker, S.G.: Label switching in Bayesian mixture models: deterministic relabeling strategies. J. Comput. Graph. Stat. 23(1), 25–45 (2014)
Article MathSciNet Google Scholar
Roverato, A.: Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scand. J. Stat. 29(3), 391–411 (2002)
Article MathSciNet MATH Google Scholar
Scott, S.L.: Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc. 97(457), 337–351 (2002)
Article MathSciNet MATH Google Scholar
Shaddox, E., Stingo, F.C., Peterson, C.B., Jacobson, S., Cruickshank-Quinn, C., Kechris, K., Bowler, R., Vannucci, M.: A Bayesian approach for learning gene networks underlying disease severity in COPD. Stat. Biosci. 10(1), 59–85 (2018)
Article Google Scholar
Sigal, L., Isard, M., Haussecker, H., Black, M.J.: Loose-limbed people: estimating 3d human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98(1), 15–48 (2012)
Article MathSciNet MATH Google Scholar
Song, Y., Goncalves, L., Perona, P.: Unsupervised learning of human motion. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 814–827 (2003)
Article Google Scholar
Sperrin, M., Jaki, T., Wit, E.: Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Stat. Comput. 20(3), 357–366 (2010)
Article MathSciNet Google Scholar
Sun, H., Li, H.: Robust Gaussian graphical modeling via $\ell _1$ penalization. Biometrics 68(4), 1197–1206 (2012)
Article MathSciNet MATH Google Scholar
Tan, L.S., Jasra, A., De Iorio, M., Ebbels, T.M.: Bayesian inference for multiple Gaussian graphical models with application to metabolic association networks. Ann. App. Stat. 11(4), 2222–2251 (2017)
MathSciNet MATH Google Scholar
Vinciotti, V., Hashem, H.: Robust methods for inferring sparse network structures. Comput. Stat. Data Anal. 67, 84–94 (2013)
Article MathSciNet MATH Google Scholar
Wang, H.: Bayesian graphical lasso models and efficient posterior computation. Bayesian Anal. 7(4), 867–886 (2012)
Article MathSciNet MATH Google Scholar
Wang, H.: Scaling it up: stochastic search structure learning in graphical models. Bayesian Anal. 10(2), 351–377 (2015)
Article MathSciNet MATH Google Scholar
Warnick, R., Guindani, M., Erhardt, E., Allen, E., Calhoun, V., Vannucci, M.: A Bayesian approach for estimating dynamic functional network connectivity in fMRI data. J. Am. Stat. Assoc. 113(521), 134–151 (2018)
Article MathSciNet MATH Google Scholar
Xu, R., Wu, J., Yue, X., Li, Y.: Online structural change-point detection of high-dimensional streaming data via dynamic sparse subspace learning. In: Technometrics, pp. 1–14 (2022)
Yang, E., Lozano, A.C.: Robust Gaussian graphical modeling with the trimmed graphical lasso. In: Advances in Neural Information Processing Systems, pp. 2602–2610 (2015)
Yang, J., Peng, J.: Estimating time-varying graphical models. J. Comput. Graph. Stat. 29(1), 191–202 (2020)
Zhang, C., Yan, H., Lee, S., Shi, J.: Dynamic multivariate functional data modeling via sparse subspace learning. Technometrics 63(3), 370–383 (2021)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Rice University, Houston, TX, USA
Chunshan Liu, Daniel R. Kowal & Marina Vannucci

Authors

Chunshan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Daniel R. Kowal
View author publications
You can also search for this author in PubMed Google Scholar
Marina Vannucci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Vannucci.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 MCMC for the dynamic classical-t graphical model

Update ${\varvec{\tau }}$: For $t=1,2,\ldots ,T$, sample $\tau _t$ from
$$\begin{aligned} \text {Gamma}\left( \frac{\nu +p}{2}, \frac{\nu + {\varvec{y}}_t' {\varvec{\Omega }}_{s_t} {\varvec{y}}_t }{2}\right) , \end{aligned}$$
where the second parameter is the rate.
Update ${{\varvec{\Omega }}}$ and ${{\varvec{G}}}$: For each $s \in \{1,...,S\}$, let $n_s$ denote the number of time points in state s. Compute ${\varvec{\Sigma }}_s = \sum _{t: s_t=s} \tau _t {\varvec{y}}_t {\varvec{y}}_t'$ and perform the following updates:
1. 1.
  Following Peterson et al. (2020), for $s \in \{1,..,S\}$, sample ${{\varvec{\Omega }}}_{s}$ one column and row at a time. Here we demonstrate how to sample the last row and last column in the last state. First, partition ${\varvec{\Omega }}_S$ into
  $$\begin{aligned} {\varvec{\Omega }}_S = \begin{pmatrix} {\varvec{\Omega }}_{S,11} &{} {\varvec{\omega }}_{S,12} \\ {\varvec{\omega }}_{S,21} &{} \omega _{S,pp} \end{pmatrix} \end{aligned}$$
  where $\omega _{S,pp}$ is the (p, p)th element of the matrix, ${\varvec{\omega }}_{S,12}$ is the last column except $\omega _{S,pp}$ and ${\varvec{\omega }}_{S,21}$ is the last row except $\omega _{S,pp}$. Similarly, partition ${\varvec{\Theta }}_{ij}$, $i,j=1,\ldots ,p$, and ${\varvec{\Sigma }}_S$ into
  $$\begin{aligned} {\varvec{\Theta }}_{ij} = \begin{pmatrix} {\varvec{\Theta }}_{ij,11} &{} {\varvec{\theta }}_{ij,12} \\ {\varvec{\theta }}_{ij,21} &{} \theta _{ij,SS} \end{pmatrix} , \quad {\varvec{\Sigma }}_S = \begin{pmatrix} {\varvec{\Sigma }}_{S,11} &{} {\varvec{\sigma }}_{S,12} \\ {\varvec{\sigma }}_{S,21} &{} \sigma _{S,pp} \end{pmatrix} . \end{aligned}$$
  Next, perform a one-on-one transformation of the last column of ${\varvec{\Omega }}_S$ as $({\varvec{a}} , b) = ({\varvec{\omega }}_{S,12}, \omega _{S,pp} - {\varvec{\omega }}_{S,21} {\varvec{\Omega }}_{S,11}^{-1} {\varvec{\omega }}_{S,12} )$. It can be proved that the conditional posterior of $({\varvec{a}}, b)$ is
  $$\begin{aligned} \begin{aligned}&p({\varvec{a}}, b | {\varvec{\Omega }}_{S,11}, {\varvec{\Omega }}_1,\ldots ,{\varvec{\Omega }}_{S-1}, {\varvec{G}}_1,\ldots ,{\varvec{G}}_S, {\varvec{\Phi }}, {\varvec{s}}, {\varvec{Y}})\\&\quad = \text {Normal}({\varvec{Q}}^{-1} {\varvec{l}}, {\varvec{Q}}^{-1}) \text {Gamma}\left( \dfrac{n_S}{2}+1, \dfrac{\sigma _{S,pp}+\uplambda }{2}\right) \end{aligned} \nonumber \\ \end{aligned}$$
  (14)
  where ${\varvec{l}} = {\varvec{D}}^{-1}{\varvec{m}} - {\varvec{\sigma }}_{S,12}$, ${\varvec{Q}} = {\varvec{D}}^{-1} + (\sigma _{S,pp}+\uplambda ) {\varvec{\Omega }}_{S,11}^{-1}$, and the second parameter in the gamma distribution is rate. Here ${\varvec{D}}$ is the covariance matrix of ${\varvec{\omega }}_{S,12}$, which off diagonal elements are 0 and the jth on-diagonal element is $\theta _{jp,SS} - {\varvec{\theta }}_{jp,21} {\varvec{\Theta }}_{jp,11}^{-1} {\varvec{\theta }}_{jp,12}$, $j=1,\ldots ,p-1$. ${\varvec{m}}$ is the mean of ${\varvec{\omega }}_{S,12}$, which jth element is ${\varvec{\theta }}_{jp,21} {\varvec{\Theta }}_{jp,11}^{-1} (\omega _{1,jp}, \omega _{2,jp},\ldots ,\omega _{S-1,jp})'$, $j=1,\ldots ,p-1$. The positive definiteness of ${\varvec{\Omega }}_S$ is automatically guaranteed by the Sylvester’s criterion. Assume the sampled ${\varvec{\Omega }}_S$ at the nth iteration, denoted as ${\varvec{\Omega }}_S^{(n)}$, is positive definite. After updating its last column and row at the $(n+1)$th iteration, the leading $p-1$ principal minors of ${\varvec{\Omega }}_S^{(n+1)}$ are still positive. Note that the principal minor of pth order in ${\varvec{\Omega }}_S^{(n+1)}$ is $b \cdot |{\varvec{\Omega }}_{S, 11}^{(n)}|$. Since b is positive with a Gamma distribution, the pth leading principal minor is positive and ${\varvec{\Omega }}_S^{(n+1)}$ is positive definite.
2. 2.
  Sample ${\varvec{G}}_s$, $s=1,\ldots ,S$ from the conditional Bernoulli distributions
  $$\begin{aligned} p(g_{s,ij}= & {} 1|{\varvec{\Omega }}_1,\ldots ,{\varvec{\Omega }}_S, {\varvec{\Phi }}, {\varvec{s}}, {\varvec{Y}})\\= & {} \dfrac{N(\omega _{s,ij}|0,v_1^2)\pi }{N(\omega _{s,ij}|0,v_1^2)\pi +N(\omega _{s,ij}|0,v_0^2)(1-\pi )}. \end{aligned}$$
Update ${\varvec{\Phi }}$: This is an MH-step based on the PX-RPMH algorithm of Liu and Daniels (2006). Compute the diagonal matrix V where $V_{ss} =(\sum _{i=1}^{p-1} \sum _{j=i+1}^p \omega _{s,ij})^{-\frac{1}{2}}$, $s = 1,...,S$, and transfer ${\varvec{\Phi }}$ to ${\varvec{\Psi }}= V {\varvec{\Phi }}V$. Define ${\varvec{\epsilon }}_{ij} = V {\varvec{\omega }}_{ij}$, $i,j=1,\ldots ,p$. Given the candidate prior $p({\varvec{\Phi }}) \propto |{\varvec{\Phi }}|^{-\frac{S+1}{2}} {\varvec{1}}_{{\varvec{\Phi }}\in \mathcal {R^{S}}}$, the posterior of ${\varvec{\Psi }}$ is an Inverse-Wishart distribution with degrees of freedom $p(p-1)/2$ and scale matrix $\sum _{i=1}^{p-1} \sum _{j=i+1}^p \text {diag}({\varvec{v}}_{ij})^{-1} {\varvec{\epsilon }}_{ij} {\varvec{\epsilon }}_{ij}' \text {diag}({\varvec{v}}_{ij})^{-1}$. Sample ${\varvec{\Psi }}$ and set ${\varvec{\Phi }}^* = \text {diag}({\varvec{\Psi }})^{-\frac{1}{2}} {\varvec{\Psi }}\text {diag}({\varvec{\Psi }})^{-\frac{1}{2}} $. Accept the proposed ${\varvec{\Phi }}^*$ with probability
$$\begin{aligned} \text {min}\left( 1, \text {exp}\left[ \dfrac{S+1}{2} \left( \text {log}|{\varvec{\Phi }}^*| - \text {log}|{\varvec{\Phi }}| \right) \right] \right) . \end{aligned}$$
Update $\textbf{P}$ and $\textbf{s}$: This is a Metropolis-Hastings step following Scott (2002). First sample the state transition probabilities for each state s from a $\text {Dirichlet}({\varvec{\alpha }}_s)$ distribution where $\alpha _{rs}= (\# r \rightarrow s) +1$, $r = 1,...,S$, is the number of state transitions from r to s plus 1. Let $\pi _{new}$ denote the stationary distribution given the transition probabilities. Accept the resulting transition matrix and stationary distribution with probability
$$\begin{aligned} \text {min} \left\{ 1, \frac{\pi _{new}(State \quad at \quad the \quad first \quad time \quad point)}{\pi _{old}(State \quad at \quad the \quad first \quad time \quad point )} \right\} . \end{aligned}$$
Next, perform a Gibbs step using Forward Propagate Backward sampling. Obtain the forward transition probabilities for each state at each time point as ${\varvec{Q}}_{t}=[q_{trs}]$ by
$$\begin{aligned} q_{trs}\propto \pi _{t-1}(r|...) p_{rs} N( \sqrt{\tau _t} {\varvec{y}}_t |{\varvec{0}},{\varvec{\Omega }}_s^{-1}), \end{aligned}$$
where $q_{trs}$ is the transition probability from r at time $t-1$ to s at time t; compute $\pi _{t}(r|...) = \sum _{s = 1}^S q_{tsr}$. Sample $s_{T}$ from $\pi _{T}$ and then recursively sample $s_{t}$ proportional to the $s_{t+1}$-th column of ${\varvec{Q}}_{t+1}$ until $s_{1}$ is sampled.

1.2 MCMC for the dynamic Dirichlet-t graphical model

Update ${{\varvec{\Omega }}}$ and ${{\varvec{G}}}$: Same as the classical-t model with ${\varvec{\Sigma }}_s = \sum _{t: s_t=s} \text {diag}(\sqrt{{\varvec{\tau }}_t}) {\varvec{y}}_t {\varvec{y}}_t'\text {diag}(\sqrt{{\varvec{\tau }}_t})$ for $s \in \{1,\ldots ,S\}$.
Update ${{\varvec{\Phi }}}$: Same as the classical-t model.
Update $\textbf{P}$ and $\textbf{s}$: Same as the classical t model with $p_{trs} \propto \pi _{t-1}(r|\theta )q(r,s) N( \text {diag}(\sqrt{{\varvec{\tau }}_t}){\varvec{y}}_t |{\varvec{0}},{\varvec{\Omega }}_s^{-1} )$.
Update ${\varvec{\tau }}$: We use the truncated stick-breaking process to estimate the Dirichlet process (Ishwaran and James 2001). In simulation studies, we found that fixing the number of possible clusters reduces the MCMC processing time by up to 90%, as we avoid frequent sampling of the probability that $\tau _{tj}$ is in a new cluster, which requires evaluating the pdf of noncentral t-distributions. Given the pre-specified level of truncation K, let ${\varvec{Z}}$ be a $T \times p$ matrix of cluster indicators and let $n_{tk}$ be the size of cluster k at time t, $k = 1, \ldots , K$, $t = 1,...,T$. ${\varvec{\eta }}_t = (\eta _{t1},\ldots ,\eta _{t K})$ denotes the unique cluster values of ${\varvec{\tau }}_t$. Define the cluster weights as $\{w_{tk}\}_{t \in \{1,\ldots ,T\}, k \in \{1,\ldots ,K\} }$ and perform the following steps:
1. 1.
  For each $t=1,\ldots ,T$ and $j=1,\ldots ,p$, draw $z_{tj}$ and update $x_{tj} = y_{tj} * \sqrt{\eta _{t, z_{tj}}}$. The probability that it belongs to cluster k, $k=1,\ldots ,K$, is
  $$\begin{aligned} p(z_{tj}= k | {\varvec{z}}_{t,-j }, {\varvec{\eta }}_t, {\varvec{y}}_t, {\varvec{\Omega }}_{s_t}, {\varvec{w}}_t) \propto w_{tk} \text {N}\left( y_{tj} | \frac{\mu _c}{\sqrt{\eta _{tk}}}, \frac{\sigma _c^2}{\eta _{tk}}\right) , \end{aligned}$$
  where N$(x|\mu ,\sigma ^2)$ is the Normal($\mu ,\sigma ^2$) pdf evaluated at x. Here $\mu _c$ and $\sigma _c^2$ are the conditional mean and variance of $x_{tj}|{\varvec{x}}_{t,-j}$, where ${\varvec{x}}_{t,-j} = \{x_{ti}:i\ne j, i \in \{1,\ldots , p\} \}$ and ${\varvec{x}}_t \sim MVN({\varvec{0}}, {\varvec{\Omega }}_{s_t}^{-1})$
2. 2.
  For each $t=1,\ldots ,T$ and $k=1,\ldots ,K-1$, sample the stick-breaking weight $v_{tk}$ from Beta$(1+n_{tk}, \alpha + \sum _{k'= k+1}^{K} n_{t k'})$. Compute $w_{tk} = v_{tk} \prod _{k'<k} (1-v_{tk})$.
3. 3.
  For each $t=1,\ldots ,T$ and each $k=1,\ldots ,K$, draw $\eta _{tk}$ from distribution
  $$\begin{aligned} f(\eta _{tk}| {\varvec{\eta }}_{t,-k}, {\varvec{z}}_t, {\varvec{y}}_t, {\varvec{\Omega }}_{s_t}) \propto \eta _{tk}^{a-1} \text {exp}\{ -b \eta _{tk} -c \sqrt{\eta _{tk}} \} \end{aligned}$$
  (15)
  with $a = \frac{\nu +n_k}{2}$, $b=\frac{1}{2}[\nu + \text {tr}({\varvec{\Omega }}_{(k)(k)} {\varvec{y}}_{(k)} {\varvec{y}}_{(k)}' )]$, $c = \text {tr}( {\varvec{\Omega }}_{(k)(-k)} {\varvec{x}}_{(-k)} {\varvec{y}}_{(k)}' )$, $(k) = \{j: z_j=k, j \in \{1,\ldots ,p\}\}$ and $(-k) = \{j: z_j\ne k, j \in \{ 1,\ldots ,p\}\}$ (t and $s_t$ are dropped here for simplicity). In order to sample $\eta _{tk}$, we do a variable transformation $\tau ^* = b \eta _{tk}$ and get
  $$\begin{aligned} f(\tau ^*| \ldots ) \propto \tau ^{*(a^*-1)} \text {exp}\{ -\tau ^* -2 c^*\sqrt{\tau ^*} \} \end{aligned}$$
  with $a^*=a$ and $c^*=\frac{c}{2\sqrt{b}} $. Next, use the rejection algorithm from Liu et al. (2012) to sample $\tau ^*$:
  1. (a)
    When $c^*/\sqrt{a^*}\le -0.7$, use Algorithm 2 from Liu et al. (2012). Compute $m=\frac{2 a^*-1}{c^* + \sqrt{(c^*)^2 + 4 a^* -2}}$. Repeatedly generate $X \sim N (m, \frac{1}{2} )$ and $U\sim $ Uniform(0, 1) until $X>0$ and $m^{2a^*-1}U< X^{2a^*-1}\text {exp}\{ -2(m+c^*)(X-m)\}$. Accept $X^2$.
  2. (b)
    When $-0.7<c^*/\sqrt{a^*}< 0$, use the first part of Algorithm 1 from Liu et al. (2012). Compute $m=\frac{4a^*}{( \sqrt{(c^*)^2+4a^*}-c^*)^2}$. Repeatedly generate $X\sim $Gamma$(a^*, m)$ (m is the rate parameter) and $U\sim $ Uniform(0, 1) until $U< \text {exp} \{a^*+X(m-1)-2\sqrt{X}c^* - a^*/m)\}$. Accept X.
  3. (c)
    When $0\le c^*/\sqrt{a^*}< 0.7$, use the second part of Algorithm 1 from Liu et al. (2012). Repeatedly generate $X\sim $Gamma$(a^*,1)$ and $U\sim $Uniform(0, 1) until $U< \text {exp}(-2 c^* \sqrt{X})$. Accept X.
  4. (d)
    When $0.7\le c^*/\sqrt{a^*}$, use Algorithm 3 from Liu et al. (2012). Compute $m= c^*+\sqrt{(c^*)^2+4a^*}$. Repeatedly generate $X\sim $Gamma$(2a^*,m)$ and $U\sim $Uniform(0, 1) until $U<\text {exp}\{ - (X-\frac{m}{2}+c^*)^2\}$. Accept $X^2$.
  5. (e)
    $\eta _{tk} = \tau ^* / b$.
  6. (f)
    Update $x = \sqrt{\tau } y $ at each k.
Update ${\varvec{\alpha }}$: Sample $\alpha $ from
$$\begin{aligned} \alpha | \ldots \sim \text {Gamma}\left( a_{\alpha } + T \cdot (K - 1), b_{\alpha } - \sum _{t=1}^T \sum _{k=1}^{K - 1} log ( 1 - v_{tk}) \right) . \end{aligned}$$

1.3 Parameter selection for the fused graphical lasso

In the simulation study, we evaluate the performance of the fused graphical lasso (fLasso; Danaher et al. 2014) by treating the true hidden states as known. The fLasso contains two tuning parameters: $\uplambda _1$ controls the graphical lasso penalty (i.e., graph sparsity) and $\uplambda _2$ controls the fused lasso penalty (i.e., graph similarity).

To estimate a single graph, we select both $\uplambda _1$ and $\uplambda _2$ from a grid of values based on AIC, as done in Danaher et al. (2014). Specifically, for each value of $\uplambda _2$ on a grid, we define the grid of $\uplambda _1$ such that the graph ranges from empty to full. After fitting the fLasso for each $(\uplambda _1, \uplambda _2)$ pair, we return the estimated graph for which the AIC is smallest.

To compute an ROC curve for fLasso, we again vary $\uplambda _2$ on a grid, but now compute the AUC (across values of $\uplambda _1$ ranging from the empty graph to the full graph) and select the value of $\uplambda _2$ for which the AUC is largest.

1.4 Simulation study results when $p = 10$

	Classical-t			Slightly contaminated			Highly contaminated
	TPR	FPR	MCC	TPR	FPR	MCC	TPR	FPR	MCC
dGM	0.58	0.21	0.36	0.56	0.22	0.35	0.38	0.29	0.09
dCT	0.98	0.01	0.97	0.81	0.10	0.71	0.68	0.16	0.52
dDT	0.98	0.01	0.97	0.97	0.01	0.96	0.97	0.02	0.95

Performance comparisons for hidden state estimation over the three hidden states, for the three simulation scenarios, $p = 10$

	Classical-t				Slightly contaminated				Highly contaminated
	TPR	FPR	MCC	FL	TPR	FPR	MCC	FL	TPR	FPR	MCC	FL
fLasso	0.99	0.87	0.20	0.35	0.97	0.72	0.31	0.45	0.90	0.63	0.26	0.65
dGM	0.45	0.02	0.54	0.38	0.42	0.02	0.51	0.38	0.31	0.03	0.39	0.51
dCT	0.92	0.01	0.93	0.06	0.67	0.03	0.70	0.42	0.58	0.05	0.60	0.37
dDT	0.92	0.01	0.93	0.09	0.91	0.02	0.91	0.25	0.85	0.01	0.86	0.18

Performance comparisons for graph estimation, $p = 10$

1.5 List of variables in the hand gesture data set

Code	Variable name	Code	Variable name
1	Vectorial velocity of left hand (x coordinate)	17	Vectorial acceleration of right hand (y coordinate)
2	Vectorial velocity of left hand (y coordinate)	18	Vectorial acceleration of right hand (z coordinate)
3	Vectorial velocity of left hand (z coordinate)	19	Vectorial acceleration of left wrist (x coordinate)
4	Vectorial velocity of right hand (x coordinate)	20	Vectorial acceleration of left wrist (y coordinate)
5	Vectorial velocity of right hand (y coordinate)	21	Vectorial acceleration of left wrist (z coordinate)
6	Vectorial velocity of right hand (z coordinate)	22	Vectorial acceleration of right wrist (x coordinate)
7	Vectorial velocity of left wrist (x coordinate)	23	Vectorial acceleration of right wrist (y coordinate)
8	Vectorial velocity of left wrist (y coordinate)	24	Vectorial acceleration of right wrist (z coordinate)
9	Vectorial velocity of left wrist (z coordinate)	25	Scalar velocity of left hand
10	Vectorial velocity of right wrist (x coordinate)	26	Scalar velocity of right hand
11	Vectorial velocity of right wrist (y coordinate)	27	Scalar velocity of left wrist
12	Vectorial velocity of right wrist (z coordinate)	28	Scalar velocity of right wrist
13	Vectorial acceleration of left hand (x coordinate)	29	Scalar acceleration of left hand
14	Vectorial acceleration of left hand (y coordinate)	30	Scalar acceleration of right hand
15	Vectorial acceleration of left hand (z coordinate)	31	Scalar acceleration of left wrist
16	Vectorial acceleration of right hand (x coordinate)	32	Scalar acceleration of right wrist

Variables and their abbreviations

1.6 Results from the fused graphical lasso model on the hand gesture data set

In the application study, we fit the fused graphical lasso model given the two states generated by the proposed dynamic Dirichlet-t graphical model. The optimal graphs chosen by AIC (Danaher et al. 2014) are full graphs. To reach a desired level of sparsity, we adjust the two fLasso tuning parameters until we get a similar number of edges as the Dirichlet-t model. The estimated graphs are plotted below. The graphs in the two states include all edges that describe the positive correlation of velocity between the left hand and the left wrist, the positive correlation of acceleration between the left hand and the left wrist, and the same positive correlations on the right hand side. However, many expected edges are not detected. Thus, the fLasso results—even with the states supplied—are unsatisfactory.

Posterior partial correlations in each hidden state from the fused graphical lasso model. Blue lines indicate positive correlations and red lines indicate negative correlations; line thickness represents the correlation strength; a dotted line indicates that the edge is unique to the state

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, C., Kowal, D.R. & Vannucci, M. Dynamic and robust Bayesian graphical models. Stat Comput 32, 105 (2022). https://doi.org/10.1007/s11222-022-10177-0

Download citation

Received: 28 July 2022
Accepted: 26 October 2022
Published: 09 November 2022
DOI: https://doi.org/10.1007/s11222-022-10177-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Dynamic and robust Bayesian graphical models

Abstract

Similar content being viewed by others

Variational Conditional Dependence Hidden Markov Models for Skeleton-Based Action Recognition

How to Encode Dynamic Gaussian Bayesian Networks as Gaussian Processes?

The Variational Coupled Gaussian Process Dynamical Model

Explore related subjects

1 Introduction

2 Methods

2.1 Background

2.2 Robust and dynamic graphical models

2.3 Prior on multiple precision matrices

2.4 MCMC algorithm for posterior inference

3 Simulation study

3.1 Data generation

3.2 Parameter settings

3.3 Results

3.4 Sensitivity analysis

4 Dynamic analysis of hand gesture data

4.1 Data

4.2 Parameter settings

4.3 Results

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 MCMC for the dynamic classical-t graphical model

1.2 MCMC for the dynamic Dirichlet-t graphical model

1.3 Parameter selection for the fused graphical lasso

1.4 Simulation study results when \(p = 10\)

1.5 List of variables in the hand gesture data set

1.6 Results from the fused graphical lasso model on the hand gesture data set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation