1 Introduction

With the rapid advancement in data acquisition technology, time series and sequential data modeling have become an important research topic in various domains, ranging from medical virus sequences and human genome sequences modeling [19], gesture recognition [31], abnormal behaviors detection [28] to text clustering [32]. One of the most powerful tools for modeling sequential data or time series is the hidden Markov model (HMM) [33, 34], which is a probability graphical model assuming that each data observation in a hidden state is generated based on a probability density (namely the emission distribution).

In the literature of HMMs, the Gaussian distribution or the Gaussian mixture model (GMM) are common choices as emission densities for HMMs to model continuous sequential observations [21, 41]. Nevertheless, a number of research works have shown that HMMs with other emission densities are better alternatives than Gaussian-based HMMs in various practical applications where data often possess non-Gaussian property (e.g. the distribution of data is normally not symmetric) [8, 11, 15, 30]. Among different types of data, the \(L_2\) normalized data, also called spherical data as they are defined on a unit hypersphere [29], have drawn considerable attention as they are usually confronted in many real-world applications [25, 29], such as gene expression clustering, fMRI data analysis, text clustering, etc. Moreover, in a variety of applications, the \(L_2\) normalization is commonly adopted as an essential preprocessing step to handel the issue of sparsity by restricting the data on a hypersphere. It also has been shown that the clustering performance can be improved for various models if \(L_2\) normalization is applied during training [2]. In contrast with other distributions, a reasonable choice for modeling spherical data is through directional distributions, such as the von Mises (VM) distribution [7, 12, 29], the von Mises-Fisher (VMF) distribution [3, 29, 38], and the Watson distribution [14, 16, 36, 37]. Recently, an effective model has been proposed to model sequential spherical data based on HMM with VMF mixture models [15]. One limitation of this model is that the amount of hidden states for the HMM and the total number of VMF distributions for the VMF mixture model under each state are determined by treating the log-likelihood function as the model selection criterion. This method, however, demands high-computational resources and is time-consuming, since it has to implement the model learning algorithm multiple times with different numbers of hidden states and mixture components in order to obtain the optimal solution with the highest model selection scores. Another limitation of the HMM in [15] and many other exiting HMMs (such as [8, 11, 30], etc.) is that, they assume all features are equally important in data modeling. Nevertheless, this assumption is unsuccessful in real applications where high-dimensional data normally involve irrelevant features that may degrade the modeling performance. An effective solution to this problem is feature selection [17, 26], which is the process of selecting the “best” feature subset for describing the given data set. Recently, a variety of feature selection techniques [1, 10, 20, 40] have been developed and shown their effectiveness for handling high-dimensional data in different applications.

The goal of our work is to propose a novel nonparametric HMM (NHMM) for modeling sequential spherical observations. In our model, the emission distribution of each hidden state is distributed according to a VM mixture model which has better capability for modeling spherical data than other popular distributions (e.g. Gaussian distribution). Our NHMM is constructed by leveraging a Bayesian nonparametric framework namely as the Dirichlet process (DP) [39]. By applying the stick-breaking representation [35] of the DP in our NHMM, the amount of hidden states and the number of mixture components for each state can be automatically adjusted based on the observed data set. Moreover, to deal with high-dimensional data which may include irrelevant features, feature selection is adopted in our approach. Here, we formulate a unified framework which can simultaneously perform data modeling and feature selection by integrating an unsupervised localized feature selection method [13, 27, 42] in terms of feature saliency [24] with the proposed NHMM. The proposed model (namely VM-NHMM-Fs) is learned by theoretically developing a convergence-guaranteed algorithm based on variational Bayes (VB) [6, 22], which is a deterministic learning algorithm for approximating probability densities through optimization, and has been successfully applied in various Bayesian models. The advantages of our model are demonstrated by conducting experiments on both synthetic and real-world sequential data sets.

We summarize the contributions of our work as follows.

  • A novel NHHM with VM mixture models as its emission densities is proposed for modeling sequential spherical data;

  • The total number of hidden states and mixture components of our model are inferred automatically by leveraging the nonparametric stick-breaking DP;

  • We integrate our model with a localized feature selection method which results in a unified framework for both data modeling and feature selection;

  • A convergence-guaranteed algorithm based on VB inference is theoretically developed to learn the proposed model.

We organize the following parts of our paper as follows. We start by presenting the VM based NHMM with unsupervised localized feature selection in Sect. 2. We develop an effective approach based on VB inference in Sect. 3 to learn the proposed model. In Sect. 4, we report the experimental results using both synthetic and real-world data sets. Finally, we provide the conclusion in Sect. 5.

2 The nonparametric HMM with VM mixture model and localized feature selection

2.1 The VM mixture model with localized feature selection

A proper choice to model a D-dimensional spherical (i.e. \(L_2\) normalized ) vector \({\varvec{y}} = \{y_d\}_{d=1}^D\) is the D-dimensional von Mises (VM) distribution [29]

$$\begin{aligned} p({\varvec{y}}|\varvec{\mu },\varvec{\lambda })= & {} \prod _{d=1}^D \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_d, \lambda _d \right) \nonumber \\= & {} \prod _{d=1}^D \frac{1}{2\pi I_0(\lambda _d)} \exp \left( \lambda _d\varvec{\mu }_d^\mathrm {T}{\varvec{x}}_{d}\right) , \end{aligned}$$
(1)

where \(\Vert {\varvec{y}}\Vert _2=1\), \({\varvec{x}}_{d} = (x_{d1},x_{d2})\), and \(x_{d1} = y_{d}\). It is noteworthy that \(x_{d2}\) is included in the vector \({\varvec{x}}_{d}\) to attain the \(L_2\) normalization of \({\varvec{x}}_{d}\) (i.e., \(\Vert {\varvec{x}}_{d}\Vert _2=1\)). \(I_0(\cdot )\) represents the modified Bessel function of the first kind of order 0 [29]. The parameter \(\varvec{\mu } = \{\varvec{\mu }_d\}_{d=1}^D\) indicates the mean direction, and \(\varvec{\lambda } = \{\lambda _d\}_{d=1}^D\) in (1) represents the concentration parameter, where \(\varvec{\mu }_d = (\mu _{d1},\mu _{d2})\) and \(\lambda _d \ge 0\).

A more flexible and powerful way to model the \(L_2\) normalized D-dimensional vector \({\varvec{y}}\) is though a mixture of K VM distributions as

$$\begin{aligned} p({\varvec{y}}|{\varvec{c}},\varvec{\mu },\varvec{\lambda }) = \sum _{k=1}^Kc_k \prod _{d=1}^D \mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd}, \lambda _{kd}), \end{aligned}$$
(2)

where \({\varvec{c}}=\{c_k\}_{k=1}^K\), \(\sum _{k=1}^Kc_k=1\) represent mixing coefficients. As we may notice, all features in this VM mixture model (2) are equally treated. In practical applications, however, high-dimensional data often include noise or features that are irrelevant to the corresponding task. In our work, we solve this issue by adopting an unsupervised localized feature selection method [27]. The main idea is to assume that irrelevant features of the VM mixture model are distributed according to a common VM distribution that does not depend on class labels

$$\begin{aligned} p(y_{d}) = \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_{kd},\lambda _{kd}\right) ^{z_{kd}} \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_{kd}',\lambda _{kd}' \right) ^{1-z_{kd}}, \end{aligned}$$
(3)

where the binary variable \(z_{kd}\) represents the feature relevancy in the kth component of the VM mixture model. If \(z_{kd}\) equals 0, it means that the dth feature associated with the kth VM density is irrelevant and is distributed as \(\mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd}',\lambda _{kd}')\). When \(z_{kd}\) equals 1, it indicates that the dth feature is relevant and follows the VM distribution \(\mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd},\lambda _{kd})\).

2.2 The VM-NHMM with localized feature selection

In this part, we propose a nonparametric HMM (NHMM) which is formulated through the stick-breaking representation of the DP. If an infinite VM mixture model (i.e. a VM mixture model with an infinite number of components) with localized feature selection is considered as the emission density of the NHMM with an infinite number of states, then the resulting VM-NHMM-Fs model can be defined with parameters \(\varPhi = \{\varvec{\pi },A,C,\varTheta \}\), where \(\varvec{\pi }= \{\pi _i\}_{i}^\infty\) denotes the initial state probability matrix, \(A = \{a_{ij}\}_{i,j}^{\infty ,\infty }\) represents the state transition matrix, \(C = \{c_{ik}\}_{i,k}^{\infty ,\infty }\) is the mixing coefficient matrix, and \(\varTheta =\{\varvec{\mu }, \varvec{\lambda }, \varvec{\mu }', \varvec{\lambda }'\}\) denotes the set of parameters that governs the VM densities with \(\varvec{\mu }=\{\varvec{\mu }_{ikd}\}_{i,k,d}^{\infty ,\infty ,D}\), \(\varvec{\lambda }=\{\lambda _{ikd}\}_{i,k,d}^{\infty ,\infty ,D}\), \(\varvec{\mu }'=\{\varvec{\mu }_{ikd}'\}_{i,k,d}^{\infty ,\infty ,D}\), \(\varvec{\lambda }'=\{\lambda _{ikd}'\}_{i,k,d}^{\infty ,\infty ,D}\).

Given a sequence of T observations \(Y = \{{\varvec{y}}_t\}_t^T\), where \({\varvec{y}}_t=\{y_{td}\}_{td}^{TD}\) represents the feature vector at time t. \(S = \{s_t\}_t^T\), where \(s_t\in [1,\infty ]\) indicates the hidden state associated with the tth observation. \(L = \{l_t\}_t^T\), where \(l_t\in [1,\infty ]\) indicates from which component of the VM mixture model that the tth observation is generated. The latent variable \({\varvec{z}}=\{z_{tikd}\}_{t,i,k,d}^{T,\infty ,\infty ,D}\) represents the saliencies of different features in different components.

Fig. 1
figure 1

Graphical model of the proposed VM-NHMM-Fs

The model diagram of VM-NHMM-FS is shown in Fig. 1, and the probability distribution of this model is given by

$$\begin{aligned} p(Y, S, L|{\varvec{z}},\varPhi ) = \pi _{s_1}\bigg [\prod _{t=1}^{T-1}a_{s_ts_{t+1}}\bigg ] \bigg [\prod _{t=1}^{T}c_{s_tl_t}p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t) \bigg ], \end{aligned}$$
(4)

where \(p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t)\) denotes the VM density with feature selection and can be represented by

$$\begin{aligned}&p \left( {\varvec{y}}_t|\varTheta ,{\varvec{z}}_t \right) \nonumber \\&\quad = \prod _{d=1}^D\bigg [\mathrm {VM}\left( {\varvec{x}}_{td}|\varvec{\mu }_{s_tl_td},\lambda _{s_tl_td}\right) ^{z_{ts_tl_td}} \mathrm {VM}\left( {\varvec{x}}_{td}|\varvec{\mu }'_{s_tl_td},\lambda '_{s_tl_td}\right) ^{1-z_{ts_tl_td}}\bigg ]. \end{aligned}$$
(5)

Therefore, we can represent the likelihood of parameters \(\varPhi\) for the data sequence Y as

$$\begin{aligned} p(Y|\varPhi ) = \sum _{S,L}\pi _{s_1}\bigg [\prod _{t=1}^{T-1}a_{s_ts_{t+1}}\bigg ] \bigg [\prod _{t=1}^{T}c_{s_tl_t} p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t)\bigg ]. \end{aligned}$$
(6)

2.3 Priors over model parameters

Since the proposed VM-NHMM-Fs is a Bayesian model, each unknown variable is associated with a prior distribution. The prior probability of the indicator variable \({\varvec{z}}\) is defined by

$$\begin{aligned} p({\varvec{z}}|\varvec{\zeta })=\prod _{t=1}^{T}\prod _{i=1}^{\infty }\prod _{k=1}^{\infty }\prod _{d=1}^D \zeta _{ikd}^{z_{tikd}}(1-\zeta _{ikd})^{1-z_{tikd}}, \end{aligned}$$
(7)

where \(\zeta _{ikd}\) represents the feature saliency indicating whether the dth feature in the kth component associated with the ith state is relevant.

For parameters \(\varvec{\mu }\), \(\varvec{\lambda }\), \(\varvec{\mu }'\), and \(\varvec{\lambda }'\) of the VM distributions, von Mises-Gamma priors are adopted

$$\begin{aligned}&p \left( \varvec{\mu },\varvec{\lambda }\right) \nonumber \\&\quad =\prod _{i=1}^\infty \prod _{k=1}^\infty \prod _{d=1}^D \mathrm {VM}\left( \varvec{\mu }_{ikd}|{\varvec{m}}_{ikd},\beta _{ikd}\lambda _{ikd}\right) {\mathcal {G}}\left( \lambda _{ikd}|u_{ikd},v_{ikd}\right) , \end{aligned}$$
(8)
$$\begin{aligned}&p \left( \varvec{\mu }',\varvec{\lambda }' \right) \nonumber \\&\quad = \prod _{i=1}^\infty \prod _{k=1}^\infty \prod _{d=1}^D \mathrm {VM}\left( \varvec{\mu }_{ikd}'|{\varvec{m}}_{ikd}',\beta _{ikd}'\lambda _{ikd}' \right) {\mathcal {G}}\left( \lambda _{ikd}'|u_{ikd}',v_{ikd}' \right) , \end{aligned}$$
(9)

where \({\varvec{m}}_{ikd} = \left( m_{ikd1},m_{ikd2}\right)\) and \({\varvec{m}}_{ikd}' = \left( m_{ikd1}',m_{ikd2}' \right)\).

In our model, similar to [9], we adopt a nonparametric DP [39] as the prior over parameters \(\varvec{\pi }'\), A and C. According to the stick-breaking representation of the DP [35], \(\pi _{i}\), \(c_{ik}\) and \(a_{ij}\) can be represented by

$$\begin{aligned}&\pi _{i} = \pi _{i}'\prod _{n=1}^{i-1}(1-\pi _{n}'), \end{aligned}$$
(10)
$$\begin{aligned}&a_{ij} = a_{ij}'\prod _{n=1}^{j-1}\left( 1-a_{in}' \right) , \end{aligned}$$
(11)
$$\begin{aligned}&c_{ik} = c_{ik}'\prod _{n=1}^{k-1}\left( 1-c_{in}' \right) , \end{aligned}$$
(12)

where \(\varvec{\pi }'\), \(A'\) and \(C'\) are distributed according to Beta distributions

$$\begin{aligned}&p(\varvec{\pi }') = \prod _{i=1}^\infty \mathrm {Beta}\left( 1,\phi _{i}^\pi \right) = \prod _{i=1}^\infty \phi _{i}^\pi \left( 1-\pi _{i}' \right) ^{\phi _{i}^\pi -1}, \end{aligned}$$
(13)
$$\begin{aligned}&p(A') = \prod _{i=1}^\infty \prod _{j=1}^\infty \mathrm {Beta}(1,\phi _{ij}^A)= \prod _{i=1}^\infty \prod _{j=1}^\infty \phi _{ij}^A \left( 1-a_{ij}' \right) ^{\phi _{ij}^A-1}, \end{aligned}$$
(14)
$$\begin{aligned}&p(C') = \prod _{i=1}^\infty \prod _{k=1}^\infty \mathrm {Beta}(1,\phi _{ik}^C)= \prod _{i=1}^\infty \prod _{k=1}^\infty \phi _{ik}^C \left( 1-c_{ik}' \right) ^{\phi _{ik}^C-1}. \end{aligned}$$
(15)

3 Model learning algorithm based on VB inference

In this section, we systematically develop an effective learning approach which is tailored for learning the proposed VM-NHMM-Fs through variational Bayes (VB). In our case, our goal is to discover a proper approximation \(q(S, L, {\varvec{z}},\varPhi )\) to the true posterior \(p(S, L, {\varvec{z}},\varPhi |Y)\), where \(\{S, L, {\varvec{z}},\varPhi \}\) denotes the set of latent and unknown variables in VM-NHMM-Fs as described previously. To obtain a tractable inference procedure, we apply the mean-field theory [4] as

$$\begin{aligned} q \left( {\varvec{z}},S, L, \varPhi \right) =q({\varvec{z}})q(S,L)q(\varPhi ). \end{aligned}$$
(16)

The approximations \(q({\varvec{z}})\), q(SL) and \(q(\varPhi )\) (also known as variational posteriors) in VB inference can be found by maximizing the objective function, which is the evidence lower bound (ELBO) and is defined by

$$\begin{aligned} \mathrm {ELBO}(q) =&\int q \left( {\varvec{z}},S, L,\varPhi \right) \ln \frac{p \left( Y, {\varvec{z}},S, L,\varPhi \right) }{q({\varvec{z}},S, L, \varPhi )}\mathrm {d}{\varvec{z}}\mathrm {d}S\mathrm {d}L\mathrm {d}\varPhi \nonumber \\ =\,\,&\mathrm {ELBO}\left( q \left( \varvec{\pi }' \right) \right) +\mathrm {ELBO}(q(A')) +\mathrm { ELBO}(q(C')) \nonumber \\ \,\,&+\mathrm {ELBO}(q(\varTheta )) + \mathrm {ELBO}\left( q({\varvec{z}})\right) +Constant. \end{aligned}$$
(17)

In addition, the truncation technique [5] is adopted to truncate the variational posteriors at finite numbers of hidden states and mixture components at N and K, respectively as

$$\begin{aligned}&\pi _{N}' = 1, \quad \sum _{i=1}^N \pi _{i} = 1, \quad \pi _{i} = 0 \;\;\text {if}\;\; i>N, \end{aligned}$$
(18)
$$\begin{aligned}&a_{iJ}' = 1, \quad \sum _{j=1}^N a_{ij} = 1, \quad a_{ij} = 0 \;\;\text {if}\;\; j>N, \end{aligned}$$
(19)
$$\begin{aligned}&c_{iK}' = 1, \quad \sum _{k=1}^K c_{ik} = 1, \quad c_{ik} = 0 \;\;\text {if}\;\; k>K, \end{aligned}$$
(20)

where N and K will be inferred automatically during the VB learning process.

3.1 Optimizing variational posteriors \(q(\varvec{\pi }')\), \(q(C')\) and \(q(A')\)

The variational posteriors of the initial state probability matrix \(q(\varvec{\pi }')\), the state transition matrix \(q(A')\), and the mixing coefficient matrix \(q(C')\) can be optimized by maximizing the ELBO in (17) as

$$\begin{aligned}&q(\varvec{\pi }') = \prod _{i=1}^N\mathrm {Beta}\left( \pi _i'|{\widehat{W}}_i^\pi ,{\widetilde{W}}_i^\pi \right) , \end{aligned}$$
(21)
$$\begin{aligned}&q(A') = \prod _{i=1}^N\prod _{j=1}^N\mathrm {Beta}\left( a_{ij}'|{\widehat{W}}_{ij}^A,{\widetilde{W}}_{ij}^A \right) , \end{aligned}$$
(22)
$$\begin{aligned}&q(C') = \prod _{i=1}^N\prod _{k=1}^K\mathrm {Beta}\left( c_{ik}'|{\widehat{W}}_{ik}^C,{\widetilde{W}}_{ik}^C \right) , \end{aligned}$$
(23)

where the hyperparameters of the above variational posteriors are given by

$$\begin{aligned}&{\widehat{W}}_i^\pi = 1+ q(s_1=i), \end{aligned}$$
(24)
$$\begin{aligned}&{\widetilde{W}}_i^\pi = \phi _i^\pi + \sum _{n=i+1}^Nq(s_1=n), \end{aligned}$$
(25)
$$\begin{aligned}&{\widehat{W}}_{ij}^A = 1+ \sum _{t=1}^{T-1} q \left( s_t=i,s_{t+1}=j \right) , \end{aligned}$$
(26)
$$\begin{aligned}&{\widetilde{W}}_{ij}^A = \phi _{ij}^A+ \sum _{t=1}^{T-1} \sum _{n=j+1}^Nq \left( s_t=i,s_{t+1}=n \right) , \end{aligned}$$
(27)
$$\begin{aligned}&{\widehat{W}}_{ik}^C = 1+ \sum _{t=1}^T q(s_t=i,l_t=k), \end{aligned}$$
(28)
$$\begin{aligned}&{\widetilde{W}}_{ik}^C = \phi _{ik}^C+ \sum _{t=1}^T \sum _{n=k+1}^Kq \left( s_t=i,l_t=n \right) , \end{aligned}$$
(29)

where the classic forward-backward algorithm as described in [34] is adopted to compute \(q(s_1)\), \(q(s_t,s_{t+1})\) and \(q(s_t,l_{t})\).

3.2 Optimizing variational posterior \(q({\varvec{z}})\)

By maximizing the ELBO with respect to the feature saliency indicator \({\varvec{z}}\), we can optimize the variational posterior \(q({\varvec{z}})\) as

$$\begin{aligned} q({\varvec{z}}) =\prod _{t=1}^T\prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\varphi _{tikd}^{z_{tikd}}(1-\varphi _{tikd})^{1-z_{tikd}}, \end{aligned}$$
(30)

where \(\varphi _{tikd}\) can be computed by

$$\begin{aligned} \varphi _{tikd}= & {} \frac{\exp \left( {\widetilde{\varphi }}_{tikd} \right) }{\exp \left( {\widetilde{\varphi }}_{tikd}\right) +\exp \left( {\widehat{\varphi }}_{tikd}\right) }, \end{aligned}$$
(31)
$$\begin{aligned} {\widetilde{\varphi }}_{tikd}= \,\,& {} q(s_t=i,l_t=k) \bigg [\langle \lambda _{ikd}\varvec{\mu }_{ikd}^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0({\bar{\lambda }}_{ikd})\bigg )\left( \langle \lambda _{ikd}\rangle -{\bar{\lambda }}_{ikd}^{(t-1)}\right) \nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg ]+\ln \zeta _{ikd}, \end{aligned}$$
(32)
$$\begin{aligned} {\widehat{\varphi }}_{tikd}= & {} q(s_t=i,l_t=k) \bigg [\langle \lambda _{ikd}'\varvec{\mu }_{ikd}'^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg ) \bigg (\langle \lambda _{ikd}'\rangle -{\bar{\lambda }}_{ikd}'\bigg )\nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}' \right) \bigg ]+\ln (1-\zeta _{ikd}), \end{aligned}$$
(33)

where \(\langle \cdot \rangle\) denotes the calculation of expectation, \(\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) = \frac{I_1 \left( {\bar{\lambda }}_{ikd}\right) }{I_0 \left( {\bar{\lambda }}_{ikd}\right) }\) is obtained based on the property \(I_0'(\kappa ) = I_1(\kappa )\) of the modified Bessel function as discussed in [38].

The saliency of the dth feature in the kth component for the ith hidden state can be calculated by setting the derivative of ELBO with respect to \(\zeta _{ikd}\) to zero as

$$\begin{aligned} \zeta _{ikd}= \frac{1}{T}\sum _{t=1}^T\langle z_{tikd}\rangle , \end{aligned}$$
(34)

where the expectation \(\langle z_{tikd}\rangle = \varphi _{tikd}\).

3.3 Optimizing variational posterior \(q(\varTheta )\)

Through the maximization of the ELBO with respect to \(\varTheta =\{\varvec{\mu },\varvec{\lambda },\varvec{\mu }',\varvec{\lambda }'\}\), the variational posteriors of the VM distributions \(q \left( \varvec{\mu },\varvec{\lambda }\right)\) and \(q \left( \varvec{\mu }',\varvec{\lambda }' \right)\) can be obtained by

$$\begin{aligned}&q \left( \varvec{\mu },\varvec{\lambda }\right) \nonumber \\&\quad = \prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\mathrm {VM}\left( \varvec{\mu }_{ikd}| {\varvec{m}}_{ikd}^{*},\beta _{ikd}^{*}\lambda _{ikd}\right) {\mathcal {G}}\left( \lambda _{ikd}|u_{ikd}^{*},v_{ikd}^{*}\right) , \end{aligned}$$
(35)
$$\begin{aligned}&q \left( \varvec{\mu }',\varvec{\lambda }' \right) \nonumber \\&\quad = \prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\mathrm {VM}\left( \varvec{\mu }_{ikd}'|{\varvec{m}}_{ikd}'^{*}, \beta _{ikd}'^{*}\lambda _{ikd}' \right) {\mathcal {G}}\left( \lambda _{ikd}'|u_{ikd}'^{*},v_{ikd}'^{*}\right) , \end{aligned}$$
(36)

where the hyperparameters can be computed by

$$\begin{aligned} \beta _{ikd}^{*}=\,\, & {} \Vert \beta _{ikd} {\varvec{m}}_{ikd} + \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle {\varvec{x}}_{td}\Vert , \end{aligned}$$
(37)
$$\begin{aligned} {\varvec{m}}_{ikd}^{*}=\,\, & {} \frac{1}{\beta _{ikd}^{*}}\bigg (\beta _{ikd}{\varvec{m}}_{ikd}+ \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle {\varvec{x}}_{td}\bigg ), \end{aligned}$$
(38)
$$\begin{aligned} u_{ikd}^*=\,\, & {} u_{ikd}+\beta _{ikd}^{*}{\bar{\lambda }}_{ikd}\bigg (\frac{\partial }{\partial \beta _{ikd}^{*}\lambda _{ikd}}\ln I_0(\beta _{ikd}^{*}{\bar{\lambda }}_{ikd})\bigg ), \end{aligned}$$
(39)
$$\begin{aligned} v_{ikd}^*=\,\, & {} v_{ikd}+\sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle \bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg )\nonumber \\&\quad +\beta _{ikd}\bigg (\frac{\partial }{\partial \beta _{ikd}\lambda _{ikd}}\ln I_0(\beta _{ikd}{\bar{\lambda }}_{ikd})\bigg ), \end{aligned}$$
(40)
$$\begin{aligned} \beta _{ikd}'^*=\,\, & {} \Vert \beta _{ikd}' {\varvec{m}}_{ikd}' + \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle {\varvec{x}}_{td}\Vert , \end{aligned}$$
(41)
$$\begin{aligned} {\varvec{m}}_{ikd}'^*=\,\, & {} \frac{1}{\beta _{ikd}'^{*}}\bigg (\beta _{ikd}'{\varvec{m}}_{ikd}'+ \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle {\varvec{x}}_{td}\bigg ), \end{aligned}$$
(42)
$$\begin{aligned} u_{ikd}'^*=\,\, & {} u_{ikd}'+\beta _{ikd}'^*{\bar{\lambda }}_{ikd}'\bigg (\frac{\partial }{\partial \beta _{ikd}'^*\lambda _{ikd}'}\ln I_0(\beta _{ikd}'^*{\bar{\lambda }}_{ikd}')\bigg ), \end{aligned}$$
(43)
$$\begin{aligned} v_{ikd}'^*=\,\, & {} v_{ikd}'+\sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle \bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg ) \nonumber \\&\quad +\beta _{ikd}'\bigg (\frac{\partial }{\partial \beta _{ikd}'\lambda _{ikd}'}\ln I_0(\beta _{ikd}'{\bar{\lambda }}_{ikd}')\bigg ). \end{aligned}$$
(44)

3.4 Optimizing variational posterior q(SL)

Lastly, the joint variational posterior q(SL) is optimized (S represents the state indicator and L denotes the mixture component indicator) by maximizing the ELBO with respect to S and L

$$\begin{aligned} q(S,L) = \frac{1}{\varOmega } \pi _{s_1}^{*} \prod _{t=1}^{T-1}a_{s_ts_{t+1}}^{*} \prod _{t=1}^{T}c_{s_t, l_t}^{*} p^{*}\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_{t}\right) , \end{aligned}$$
(45)

where

$$\begin{aligned} \pi _{i}^*=\,\, & {} \exp \bigg \{ \varPsi \left( {\widehat{W}}_i^\pi \right) - \varPsi \left( {\widehat{W}}_i^\pi +{\widetilde{W}}_i^\pi \right) + \sum _{n=1}^{i-1}\bigg [\varPsi \left( {\widetilde{W}}_n^\pi \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_n^\pi +{\widetilde{W}}_n^\pi \right) \bigg ]\bigg \}, \end{aligned}$$
(46)
$$\begin{aligned} a_{ij}^*=\,\, & {} \exp \bigg \{\varPsi ({\widehat{W}}_{ij}^A) - \varPsi ({\widehat{W}}_{ij}^A+{\widetilde{W}}_{ij}^A)+ \sum _{n=1}^{j-1}\bigg [\varPsi \left( {\widetilde{W}}_{in}^A \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_{in}^A+{\widetilde{W}}_{in}^A \right) \bigg ]\bigg \}, \end{aligned}$$
(47)
$$\begin{aligned} c_{ik}^*=\,\, & {} \exp \bigg \{\varPsi \left( {\widehat{W}}_{ik}^C \right) - \varPsi \left( {\widehat{W}}_{ik}^C+{\widetilde{W}}_{ik}^C \right) + \sum _{n=1}^{k-1}\bigg [\varPsi \left( {\widetilde{W}}_{in}^C \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_{in}^C+{\widetilde{W}}_{in}^C \right) \bigg ]\bigg \}, \end{aligned}$$
(48)
$$\begin{aligned} p^*\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_{t}\right)=\,\, & {} \exp \bigg \{\sum _{d=1}^D \langle z_{tikd}\rangle \bigg [ \langle \lambda _{ikd}\varvec{\mu }_{ikd}^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\ln 2\pi - \ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg )(\langle \lambda _{ikd}\rangle -{\bar{\lambda }}_{ikd})\bigg ]\nonumber \\&\quad +\sum _{d=1}^D\langle 1-z_{tikd}\rangle \bigg [ \langle \lambda _{ikd}'\varvec{\mu }_{ikd}'^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\ln 2\pi -\bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg )\left( \langle \lambda _{jd}'\rangle -{\bar{\lambda }}_{ikd}' \right) \nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}' \right) \bigg ] \bigg \}, \end{aligned}$$
(49)

where \(\varOmega\) in (45) is the normalizing constant and is given by

$$\begin{aligned} \varOmega&= q(X|\varPhi ^{*}) = \sum _{S, L} \pi _{s_1}^{*} \prod _{t=1}^{T-1}a_{s_ts_{t+1}}^{*} \prod _{t=1}^{T}c_{s_t, l_t}^{*}p^{*}\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_t \right) . \end{aligned}$$
(50)

It is noteworthy that (50) can be considered as the approximation to the likelihood of the model with optimized parameters \(\varPhi ^{*}\), as we compare (50) with (6).

Algorithm 1 provides the VB inference algorithm for learning the VM-NHMM-Fs model. This learning algorithm is guaranteed to converge as the ELBO in (17) is convex with respect to each variational posterior [4]. By monitoring the variation of the ELBO, we can easily discover the convergence status if the difference of the values of ELBO between two consecutive iterations is less than some predefined threshold.

figure a

4 Experimental results

The proposed nonparametric HMM with localized feature selection (VM-NHMM-Fs) is evaluated through experiments on both synthetic and real-world time series or sequential data sets. We set the initial truncation values of N and K in our experiments as 20 and 30, respectively. The initial value of the hyperparameter \(\zeta\) of the feature saliency is set to 0.5. The hyperparameters \(\phi ^\pi\), \(\phi ^A\) and \(\phi ^C\) of the stick-breaking representation are all initialized to 0.5. The hyperparameters m and \(m'\) are initialized as the average of the data set. The other hyperparameters are initialized as: \((\beta , \beta ', u, u', v, v') = (0.01, 0.01, 0.3,0.3,0.05,0.05)\). We report the experimental results using the average performance of our model based on 20 runs for all experiments.

4.1 Experiments on synthetic sequential data

In this part, a synthetic sequential data set is generated to validate the effectiveness of the proposed learning approach to inferencing parameters and selecting important features for the proposed VM-NHMM-Fs.

Our synthetic sequential data set contains a sequence of 3000 data points that were generated based on 2 hidden states, where State 1 is used to generate the sequential observations at \(t = 1 : 1500\), while state 2 is in charge of generating the sequential observations at \(t = 1501 : 3000\). In each state, a mixture of two 3-dimensional VM densities corresponding to relevant features (i.e. we have 3 relevant features in total) was used as the emission density. The parameters that were adopted for generating the 3 relevant features are shown in Table 1. Then, we generated 12 irrelevant features according to a common VM distribution using parameters \(\varvec{\mu } = (0, 1)\) and \(\lambda = 1\) and appended these features to the 3 relevant features to form a 15-dimensional data set.

Table 1 The parameters for generating the 3 relevant features for the 15-dimensional data set, where S1 and S2 indicate state 1 and state 2, respectively; \(n_k\) denotes the number of data points that are generated from the kth VM density, d represents the feature number

To verify the “correctness” of the proposed VB learning algorithm, we compared the discrepancy between the true values of the parameters for generating the data set and the corresponding estimated values as in [3]. The comparison results of parameters for generating the synthetic sequential data set are demonstrated in Tables 2 and 3 , under state 1 and state 2, respectively. From these tables, we can see that the proposed VB inference algorithm can accurately estimate model parameters which illustrates the effectiveness of our VB algorithm.

Table 2 The comparison of the true and the estimated parameters by the proposed VM-NHMM-Fs under State 1 for the synthetic data set
Table 3 The comparison of the true and the estimated parameters by the proposed VM-NHMM-Fs under State 2 for the synthetic data set
Fig. 2
figure 2

Average feature saliences on the synthetic data set by VM-NHMM-FS plus and minus one standard deviation over 20 runs

Next, we test the performance of feature selection of our VM-NHMM-Fs on the synthetic data set. The results of feature selection in terms of feature saliency (i.e. the values of \(\{\zeta _d\}\)) are demonstrated in Fig. 2. According to the results shown in this figure, it is obvious that high degree of relevancies (i.e. above 0.9) have been assigned to the first three features while the remaining 12 features are considered as irrelevant features due to their low degrees of saliencies (i.e. close to 0). These results are consistent with the true settings of the synthetic sequential data set.

4.2 Experiments on real data sets

4.2.1 Data sets and experimental settings

In this part, the effectiveness of the proposed VM-NHMM-Fs was validated by conducting experiments on real sequential data sets in terms of unsupervised clustering applications. We adopted two real data sets from the UCI machine learning repositoryFootnote 1, including the gesture phase segmentation data set and the epileptic seizure recognition data set.

The gesture phase segmentation data set contains seven recorded videos consisted in a temporal segmentation of gestures (rest, preparation, stroke, hold and retraction) using Microsoft Kinect sensor. In our case, we teste the performance of VM-NHMM-Fs on three videos of this data set: A1 (1747 frames), A2 (1264 frames) and A3 (1834 frames), where each video includes the original version and a processed version. 50 features are extracted based on this data set, from which 18 features are obtained based on original videos and 32 features are extracted from processed videos.

The epileptic seizure recognition data set that we adopted is a pre-processed version of a data set regarding epileptic seizure detection as described in the UCI machine learning repository. It contains 11500 observations, where each observation consists of 178 data points, where each data point represents the value of the EEG observed at a different point in time. It contains five classes: (1) the EEG of seizure activity; (2) the EEG from the area where the tumor was located; (3) the EEG from the healthy brain area; (4) the EEG of the patient had their eyes closed; (5) the EEG of the patient had their eyes open.

Table 4 The average recognition performance over 20 runs by different approaches
Fig. 3
figure 3

Average feature saliences for the resting phase of the gesture phase segmentation data set by VM-NHMM-FS plus and minus one standard deviation over 20 runs

In our experiment, these two data sets were \(L_2\) normalized and then modeled by the proposed VM-NHMM-Fs. In order to demonstrate the advantages of our model, we compared it with other well-defined HMMs that employ different mixture models: the HMM with Gaussian mixture models (GMM-HMM) [21], the HMM with Gaussian mixture models and unsupervised feature selection (GMM-HMM-Fs) [43], the HMM with Dirichlet mixture model (DMM-HMM) [11], the HMM with inverted Dirichelt mixture model (IDMM-HMM) [30] and the HMM with VMF mixture models (VMF-HMM) [15]. Furthermore, to evaluate the importance of integrating feature selection in our model, we respectively applied the proposed model with localized feature selection (VM-NHMM-Fs) and without it (denoted by VM-NHMM). For the tested models, we adopted the same parameter values as in their original papers. All tested models were implemented on the same data sets as described in our experiments.

Fig. 4
figure 4

Average feature saliences for the class of seizure activity of the epileptic seizure recognition data set by VM-NHMM-FS plus and minus one standard deviation over 20 runs

In our experiment, we set the initial size of states for two data sets as \(N=20\), and the optimal number of states was automatically determined in the process of model learning. According to the results obtained by the proposed VM-NHMM-Fs, the gesture phase segmentation data set and the epileptic seizure recognition data set eventually converged to 3 and 2 states, respectively. For other tested approaches, the number of hidden states were set manually. Table 4 shows the recognition performance by different models on the two real data sets. As can be seen from this table, both VM-NHMM-Fs and VM-NHMM are able to outperform other HMM-based approaches with higher recognition accuracies for all data sets, which verified the merits of applying nonparametric VM-based HMMs for modeling gestures and EEG data. Another advantage of our approach is that, in contrast with other tested HMM-based approaches in which the number of clusters was determined through an extra evaluation step based on model selection scores, this number in our case was automatically determined during the inference procedure thanks to the nonparametric framework of Dirichlet process. According to Table 4, we may also notice the improvement of the performance when feature selection is integrated with VM-NHMM, by comparing the results of VM-NHMM-Fs with that of VM-NHMM.

The obtained feature saliencies of the 50-dimensional gesture phase data vectors of the resting phase by VM-NHMM-FS are shown in Fig. 3. It can be seen from this figure that there are 7 features that have obtained low degrees of relevance (i.e. saliencies are less than 0.5). Therefore, these features are considered as irrelevant features in the modeling process. On the other hand, the remaining features are considered as relevant features as they have high-level feature saliencies (i.e. greater than 0.5). Figure 4 illustrates the results of feature saliencies obtained by VM-NHMM-FS for the class of seizure activity of the epileptic seizure recognition data set. Based on this figure, different features have different contributions in the task of epileptic seizure recognition, where 22 of the 178 features have obtained relatively low saliencies (i.e. less than 0.5) and therefore have less contributions in data modeling.

5 Conclusion

In this work, a nonparametric HMM has been proposed for modeling time series or sequential spherical data vectors. In our model, the emission distribution of each hidden state obeys a mixture of VM distributions which has shown better capability for modeling spherical data than other commonly used distributions (such as the Gaussian distribution). We constructed our NHMM by leveraging a Bayesian nonparametric DP framework, and therefore the amount of hidden states and the number of mixture components for each state can be automatically adjusted according to observed data set. In addition, to deal with high-dimensional data sets which may contain irrelevant or noisy features, an unsupervised localized feature selection method was incorporated with the proposed NHMM, which results in a unified framework that can simultaneously perform data modeling and feature selection. The proposed model was learned by developing an effective algorithm based on VB inference. The advantages of our model were demonstrated through both simulated and real-world data sets. Particularly, according to the experimental results, our model was able to outperform other tested HMM-based models by at least 4.2% in gesture recognition and at least 2.6% in epileptic seizure recognition.

One limitation of the proposed NHMM is that it is not very efficient for dealing with large-scale data sets. This is mainly caused by the batch learning strategy of the conventional VB inference adopted in our work. Thus, a possible future work is to extend the developed VB inference algorithm with stochastic variational Bayes (SVB) [18], which has shown its efficiency in learning over large data sets through stochastic optimization. Moreover, in recent years, deep learning techniques have been successfully applied in different fields owing to their promising capabilities of automatically extracting meaningful representations from observed data. Therefore, another interesting future work is to integrate deep neural networks (e.g. variational auto-encoder [23]) with the proposed NHMM to improve its performance by leveraging the more representative features learned by these deep learning techniques.