Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models

Fan, Wentao; Hou, Wenjuan

doi:10.1007/s13042-022-01579-7

Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models

Original Article
Published: 06 June 2022

Volume 13, pages 3019–3029, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models

Download PDF

198 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

As spherical data (i.e. $L_2$ normalized vectors) are often encountered in a variety of real-life applications (such as gesture recognition, gene expression analysis, etc.), sequential spherical data modeling has become an important research topic in recent years. Hidden Markov models (HMMs), as probabilistic graph models, have shown their effectiveness in modeling sequential data in previous research works. In this article, we propose a nonparametric hidden Markov model (NHMM) for modeling time series or sequential spherical data vectors. In our model, the emission distribution of each hidden state obeys a mixture of von Mises (VM) distributions which has better capability for modeling spherical data than other popular distributions (e.g. the Gaussian distribution). As we construct our NHMM by leveraging a Bayesian nonparametric model namely the Dirichlet process, the amount of hidden states and the number of mixture components for each state can be automatically adjusted according to observed data set. In addition, to handle high-dimensional data sets which may contain irrelevant or noisy features, feature selection, which is the process of selecting the “best” feature subset for describing the given data set, is adopted in our framework. In our case, an unsupervised localized feature selection method is incorporated with the developed NHMM, which results in a unified framework that can simultaneously perform data modeling and feature selection. Our model is learned by theoretically developing a convergence-guaranteed algorithm through variational Bayes. The advantages of our model are demonstrated by conducting experiments on both synthetic and real-world sequential data sets.

Feature Selection for Hidden Markov Models with Discrete Features

Hidden Markov Models Based on Generalized Dirichlet Mixtures for Proportional Data Modeling

Accelerating the discovery of unsupervised-shapelets

Article 07 May 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid advancement in data acquisition technology, time series and sequential data modeling have become an important research topic in various domains, ranging from medical virus sequences and human genome sequences modeling [19], gesture recognition [31], abnormal behaviors detection [28] to text clustering [32]. One of the most powerful tools for modeling sequential data or time series is the hidden Markov model (HMM) [33, 34], which is a probability graphical model assuming that each data observation in a hidden state is generated based on a probability density (namely the emission distribution).

In the literature of HMMs, the Gaussian distribution or the Gaussian mixture model (GMM) are common choices as emission densities for HMMs to model continuous sequential observations [21, 41]. Nevertheless, a number of research works have shown that HMMs with other emission densities are better alternatives than Gaussian-based HMMs in various practical applications where data often possess non-Gaussian property (e.g. the distribution of data is normally not symmetric) [8, 11, 15, 30]. Among different types of data, the $L_2$ normalized data, also called spherical data as they are defined on a unit hypersphere [29], have drawn considerable attention as they are usually confronted in many real-world applications [25, 29], such as gene expression clustering, fMRI data analysis, text clustering, etc. Moreover, in a variety of applications, the $L_2$ normalization is commonly adopted as an essential preprocessing step to handel the issue of sparsity by restricting the data on a hypersphere. It also has been shown that the clustering performance can be improved for various models if $L_2$ normalization is applied during training [2]. In contrast with other distributions, a reasonable choice for modeling spherical data is through directional distributions, such as the von Mises (VM) distribution [7, 12, 29], the von Mises-Fisher (VMF) distribution [3, 29, 38], and the Watson distribution [14, 16, 36, 37]. Recently, an effective model has been proposed to model sequential spherical data based on HMM with VMF mixture models [15]. One limitation of this model is that the amount of hidden states for the HMM and the total number of VMF distributions for the VMF mixture model under each state are determined by treating the log-likelihood function as the model selection criterion. This method, however, demands high-computational resources and is time-consuming, since it has to implement the model learning algorithm multiple times with different numbers of hidden states and mixture components in order to obtain the optimal solution with the highest model selection scores. Another limitation of the HMM in [15] and many other exiting HMMs (such as [8, 11, 30], etc.) is that, they assume all features are equally important in data modeling. Nevertheless, this assumption is unsuccessful in real applications where high-dimensional data normally involve irrelevant features that may degrade the modeling performance. An effective solution to this problem is feature selection [17, 26], which is the process of selecting the “best” feature subset for describing the given data set. Recently, a variety of feature selection techniques [1, 10, 20, 40] have been developed and shown their effectiveness for handling high-dimensional data in different applications.

The goal of our work is to propose a novel nonparametric HMM (NHMM) for modeling sequential spherical observations. In our model, the emission distribution of each hidden state is distributed according to a VM mixture model which has better capability for modeling spherical data than other popular distributions (e.g. Gaussian distribution). Our NHMM is constructed by leveraging a Bayesian nonparametric framework namely as the Dirichlet process (DP) [39]. By applying the stick-breaking representation [35] of the DP in our NHMM, the amount of hidden states and the number of mixture components for each state can be automatically adjusted based on the observed data set. Moreover, to deal with high-dimensional data which may include irrelevant features, feature selection is adopted in our approach. Here, we formulate a unified framework which can simultaneously perform data modeling and feature selection by integrating an unsupervised localized feature selection method [13, 27, 42] in terms of feature saliency [24] with the proposed NHMM. The proposed model (namely VM-NHMM-Fs) is learned by theoretically developing a convergence-guaranteed algorithm based on variational Bayes (VB) [6, 22], which is a deterministic learning algorithm for approximating probability densities through optimization, and has been successfully applied in various Bayesian models. The advantages of our model are demonstrated by conducting experiments on both synthetic and real-world sequential data sets.

We summarize the contributions of our work as follows.

A novel NHHM with VM mixture models as its emission densities is proposed for modeling sequential spherical data;
The total number of hidden states and mixture components of our model are inferred automatically by leveraging the nonparametric stick-breaking DP;
We integrate our model with a localized feature selection method which results in a unified framework for both data modeling and feature selection;
A convergence-guaranteed algorithm based on VB inference is theoretically developed to learn the proposed model.

We organize the following parts of our paper as follows. We start by presenting the VM based NHMM with unsupervised localized feature selection in Sect. 2. We develop an effective approach based on VB inference in Sect. 3 to learn the proposed model. In Sect. 4, we report the experimental results using both synthetic and real-world data sets. Finally, we provide the conclusion in Sect. 5.

2 The nonparametric HMM with VM mixture model and localized feature selection

2.1 The VM mixture model with localized feature selection

A proper choice to model a D-dimensional spherical (i.e. $L_2$ normalized ) vector ${\varvec{y}} = \{y_d\}_{d=1}^D$ is the D-dimensional von Mises (VM) distribution [29]

$$\begin{aligned} p({\varvec{y}}|\varvec{\mu },\varvec{\lambda })= & {} \prod _{d=1}^D \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_d, \lambda _d \right) \nonumber \\= & {} \prod _{d=1}^D \frac{1}{2\pi I_0(\lambda _d)} \exp \left( \lambda _d\varvec{\mu }_d^\mathrm {T}{\varvec{x}}_{d}\right) , \end{aligned}$$

(1)

where $\Vert {\varvec{y}}\Vert _2=1$, ${\varvec{x}}_{d} = (x_{d1},x_{d2})$, and $x_{d1} = y_{d}$. It is noteworthy that $x_{d2}$ is included in the vector ${\varvec{x}}_{d}$ to attain the $L_2$ normalization of ${\varvec{x}}_{d}$ (i.e., $\Vert {\varvec{x}}_{d}\Vert _2=1$). $I_0(\cdot )$ represents the modified Bessel function of the first kind of order 0 [29]. The parameter $\varvec{\mu } = \{\varvec{\mu }_d\}_{d=1}^D$ indicates the mean direction, and $\varvec{\lambda } = \{\lambda _d\}_{d=1}^D$ in (1) represents the concentration parameter, where $\varvec{\mu }_d = (\mu _{d1},\mu _{d2})$ and $\lambda _d \ge 0$.

A more flexible and powerful way to model the $L_2$ normalized D-dimensional vector ${\varvec{y}}$ is though a mixture of K VM distributions as

$$\begin{aligned} p({\varvec{y}}|{\varvec{c}},\varvec{\mu },\varvec{\lambda }) = \sum _{k=1}^Kc_k \prod _{d=1}^D \mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd}, \lambda _{kd}), \end{aligned}$$

(2)

where ${\varvec{c}}=\{c_k\}_{k=1}^K$, $\sum _{k=1}^Kc_k=1$ represent mixing coefficients. As we may notice, all features in this VM mixture model (2) are equally treated. In practical applications, however, high-dimensional data often include noise or features that are irrelevant to the corresponding task. In our work, we solve this issue by adopting an unsupervised localized feature selection method [27]. The main idea is to assume that irrelevant features of the VM mixture model are distributed according to a common VM distribution that does not depend on class labels

$$\begin{aligned} p(y_{d}) = \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_{kd},\lambda _{kd}\right) ^{z_{kd}} \mathrm {VM}\left( {\varvec{x}}_{d}|\varvec{\mu }_{kd}',\lambda _{kd}' \right) ^{1-z_{kd}}, \end{aligned}$$

(3)

where the binary variable $z_{kd}$ represents the feature relevancy in the kth component of the VM mixture model. If $z_{kd}$ equals 0, it means that the dth feature associated with the kth VM density is irrelevant and is distributed as $\mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd}',\lambda _{kd}')$. When $z_{kd}$ equals 1, it indicates that the dth feature is relevant and follows the VM distribution $\mathrm {VM}({\varvec{x}}_{d}|\varvec{\mu }_{kd},\lambda _{kd})$.

2.2 The VM-NHMM with localized feature selection

In this part, we propose a nonparametric HMM (NHMM) which is formulated through the stick-breaking representation of the DP. If an infinite VM mixture model (i.e. a VM mixture model with an infinite number of components) with localized feature selection is considered as the emission density of the NHMM with an infinite number of states, then the resulting VM-NHMM-Fs model can be defined with parameters $\varPhi = \{\varvec{\pi },A,C,\varTheta \}$, where $\varvec{\pi }= \{\pi _i\}_{i}^\infty$ denotes the initial state probability matrix, $A = \{a_{ij}\}_{i,j}^{\infty ,\infty }$ represents the state transition matrix, $C = \{c_{ik}\}_{i,k}^{\infty ,\infty }$ is the mixing coefficient matrix, and $\varTheta =\{\varvec{\mu }, \varvec{\lambda }, \varvec{\mu }', \varvec{\lambda }'\}$ denotes the set of parameters that governs the VM densities with $\varvec{\mu }=\{\varvec{\mu }_{ikd}\}_{i,k,d}^{\infty ,\infty ,D}$, $\varvec{\lambda }=\{\lambda _{ikd}\}_{i,k,d}^{\infty ,\infty ,D}$, $\varvec{\mu }'=\{\varvec{\mu }_{ikd}'\}_{i,k,d}^{\infty ,\infty ,D}$, $\varvec{\lambda }'=\{\lambda _{ikd}'\}_{i,k,d}^{\infty ,\infty ,D}$.

Given a sequence of T observations $Y = \{{\varvec{y}}_t\}_t^T$, where ${\varvec{y}}_t=\{y_{td}\}_{td}^{TD}$ represents the feature vector at time t. $S = \{s_t\}_t^T$, where $s_t\in [1,\infty ]$ indicates the hidden state associated with the tth observation. $L = \{l_t\}_t^T$, where $l_t\in [1,\infty ]$ indicates from which component of the VM mixture model that the tth observation is generated. The latent variable ${\varvec{z}}=\{z_{tikd}\}_{t,i,k,d}^{T,\infty ,\infty ,D}$ represents the saliencies of different features in different components.

The model diagram of VM-NHMM-FS is shown in Fig. 1, and the probability distribution of this model is given by

$$\begin{aligned} p(Y, S, L|{\varvec{z}},\varPhi ) = \pi _{s_1}\bigg [\prod _{t=1}^{T-1}a_{s_ts_{t+1}}\bigg ] \bigg [\prod _{t=1}^{T}c_{s_tl_t}p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t) \bigg ], \end{aligned}$$

(4)

where $p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t)$ denotes the VM density with feature selection and can be represented by

$$\begin{aligned}&p \left( {\varvec{y}}_t|\varTheta ,{\varvec{z}}_t \right) \nonumber \\&\quad = \prod _{d=1}^D\bigg [\mathrm {VM}\left( {\varvec{x}}_{td}|\varvec{\mu }_{s_tl_td},\lambda _{s_tl_td}\right) ^{z_{ts_tl_td}} \mathrm {VM}\left( {\varvec{x}}_{td}|\varvec{\mu }'_{s_tl_td},\lambda '_{s_tl_td}\right) ^{1-z_{ts_tl_td}}\bigg ]. \end{aligned}$$

(5)

Therefore, we can represent the likelihood of parameters $\varPhi$ for the data sequence Y as

$$\begin{aligned} p(Y|\varPhi ) = \sum _{S,L}\pi _{s_1}\bigg [\prod _{t=1}^{T-1}a_{s_ts_{t+1}}\bigg ] \bigg [\prod _{t=1}^{T}c_{s_tl_t} p({\varvec{y}}_t|\varTheta ,{\varvec{z}}_t)\bigg ]. \end{aligned}$$

(6)

2.3 Priors over model parameters

Since the proposed VM-NHMM-Fs is a Bayesian model, each unknown variable is associated with a prior distribution. The prior probability of the indicator variable ${\varvec{z}}$ is defined by

$$\begin{aligned} p({\varvec{z}}|\varvec{\zeta })=\prod _{t=1}^{T}\prod _{i=1}^{\infty }\prod _{k=1}^{\infty }\prod _{d=1}^D \zeta _{ikd}^{z_{tikd}}(1-\zeta _{ikd})^{1-z_{tikd}}, \end{aligned}$$

(7)

where $\zeta _{ikd}$ represents the feature saliency indicating whether the dth feature in the kth component associated with the ith state is relevant.

For parameters $\varvec{\mu }$, $\varvec{\lambda }$, $\varvec{\mu }'$, and $\varvec{\lambda }'$ of the VM distributions, von Mises-Gamma priors are adopted

$$\begin{aligned}&p \left( \varvec{\mu },\varvec{\lambda }\right) \nonumber \\&\quad =\prod _{i=1}^\infty \prod _{k=1}^\infty \prod _{d=1}^D \mathrm {VM}\left( \varvec{\mu }_{ikd}|{\varvec{m}}_{ikd},\beta _{ikd}\lambda _{ikd}\right) {\mathcal {G}}\left( \lambda _{ikd}|u_{ikd},v_{ikd}\right) , \end{aligned}$$

(8)

$$\begin{aligned}&p \left( \varvec{\mu }',\varvec{\lambda }' \right) \nonumber \\&\quad = \prod _{i=1}^\infty \prod _{k=1}^\infty \prod _{d=1}^D \mathrm {VM}\left( \varvec{\mu }_{ikd}'|{\varvec{m}}_{ikd}',\beta _{ikd}'\lambda _{ikd}' \right) {\mathcal {G}}\left( \lambda _{ikd}'|u_{ikd}',v_{ikd}' \right) , \end{aligned}$$

(9)

where ${\varvec{m}}_{ikd} = \left( m_{ikd1},m_{ikd2}\right)$ and ${\varvec{m}}_{ikd}' = \left( m_{ikd1}',m_{ikd2}' \right)$.

In our model, similar to [9], we adopt a nonparametric DP [39] as the prior over parameters $\varvec{\pi }'$, A and C. According to the stick-breaking representation of the DP [35], $\pi _{i}$, $c_{ik}$ and $a_{ij}$ can be represented by

$$\begin{aligned}&\pi _{i} = \pi _{i}'\prod _{n=1}^{i-1}(1-\pi _{n}'), \end{aligned}$$

(10)

$$\begin{aligned}&a_{ij} = a_{ij}'\prod _{n=1}^{j-1}\left( 1-a_{in}' \right) , \end{aligned}$$

(11)

$$\begin{aligned}&c_{ik} = c_{ik}'\prod _{n=1}^{k-1}\left( 1-c_{in}' \right) , \end{aligned}$$

(12)

where $\varvec{\pi }'$, $A'$ and $C'$ are distributed according to Beta distributions

$$\begin{aligned}&p(\varvec{\pi }') = \prod _{i=1}^\infty \mathrm {Beta}\left( 1,\phi _{i}^\pi \right) = \prod _{i=1}^\infty \phi _{i}^\pi \left( 1-\pi _{i}' \right) ^{\phi _{i}^\pi -1}, \end{aligned}$$

(13)

$$\begin{aligned}&p(A') = \prod _{i=1}^\infty \prod _{j=1}^\infty \mathrm {Beta}(1,\phi _{ij}^A)= \prod _{i=1}^\infty \prod _{j=1}^\infty \phi _{ij}^A \left( 1-a_{ij}' \right) ^{\phi _{ij}^A-1}, \end{aligned}$$

(14)

$$\begin{aligned}&p(C') = \prod _{i=1}^\infty \prod _{k=1}^\infty \mathrm {Beta}(1,\phi _{ik}^C)= \prod _{i=1}^\infty \prod _{k=1}^\infty \phi _{ik}^C \left( 1-c_{ik}' \right) ^{\phi _{ik}^C-1}. \end{aligned}$$

(15)

3 Model learning algorithm based on VB inference

In this section, we systematically develop an effective learning approach which is tailored for learning the proposed VM-NHMM-Fs through variational Bayes (VB). In our case, our goal is to discover a proper approximation $q(S, L, {\varvec{z}},\varPhi )$ to the true posterior $p(S, L, {\varvec{z}},\varPhi |Y)$, where $\{S, L, {\varvec{z}},\varPhi \}$ denotes the set of latent and unknown variables in VM-NHMM-Fs as described previously. To obtain a tractable inference procedure, we apply the mean-field theory [4] as

$$\begin{aligned} q \left( {\varvec{z}},S, L, \varPhi \right) =q({\varvec{z}})q(S,L)q(\varPhi ). \end{aligned}$$

(16)

The approximations $q({\varvec{z}})$, q(S, L) and $q(\varPhi )$ (also known as variational posteriors) in VB inference can be found by maximizing the objective function, which is the evidence lower bound (ELBO) and is defined by

$$\begin{aligned} \mathrm {ELBO}(q) =&\int q \left( {\varvec{z}},S, L,\varPhi \right) \ln \frac{p \left( Y, {\varvec{z}},S, L,\varPhi \right) }{q({\varvec{z}},S, L, \varPhi )}\mathrm {d}{\varvec{z}}\mathrm {d}S\mathrm {d}L\mathrm {d}\varPhi \nonumber \\ =\,\,&\mathrm {ELBO}\left( q \left( \varvec{\pi }' \right) \right) +\mathrm {ELBO}(q(A')) +\mathrm { ELBO}(q(C')) \nonumber \\ \,\,&+\mathrm {ELBO}(q(\varTheta )) + \mathrm {ELBO}\left( q({\varvec{z}})\right) +Constant. \end{aligned}$$

(17)

In addition, the truncation technique [5] is adopted to truncate the variational posteriors at finite numbers of hidden states and mixture components at N and K, respectively as

$$\begin{aligned}&\pi _{N}' = 1, \quad \sum _{i=1}^N \pi _{i} = 1, \quad \pi _{i} = 0 \;\;\text {if}\;\; i>N, \end{aligned}$$

(18)

$$\begin{aligned}&a_{iJ}' = 1, \quad \sum _{j=1}^N a_{ij} = 1, \quad a_{ij} = 0 \;\;\text {if}\;\; j>N, \end{aligned}$$

(19)

$$\begin{aligned}&c_{iK}' = 1, \quad \sum _{k=1}^K c_{ik} = 1, \quad c_{ik} = 0 \;\;\text {if}\;\; k>K, \end{aligned}$$

(20)

where N and K will be inferred automatically during the VB learning process.

3.1 Optimizing variational posteriors $q(\varvec{\pi }')$, $q(C')$ and $q(A')$

The variational posteriors of the initial state probability matrix $q(\varvec{\pi }')$, the state transition matrix $q(A')$, and the mixing coefficient matrix $q(C')$ can be optimized by maximizing the ELBO in (17) as

$$\begin{aligned}&q(\varvec{\pi }') = \prod _{i=1}^N\mathrm {Beta}\left( \pi _i'|{\widehat{W}}_i^\pi ,{\widetilde{W}}_i^\pi \right) , \end{aligned}$$

(21)

$$\begin{aligned}&q(A') = \prod _{i=1}^N\prod _{j=1}^N\mathrm {Beta}\left( a_{ij}'|{\widehat{W}}_{ij}^A,{\widetilde{W}}_{ij}^A \right) , \end{aligned}$$

(22)

$$\begin{aligned}&q(C') = \prod _{i=1}^N\prod _{k=1}^K\mathrm {Beta}\left( c_{ik}'|{\widehat{W}}_{ik}^C,{\widetilde{W}}_{ik}^C \right) , \end{aligned}$$

(23)

where the hyperparameters of the above variational posteriors are given by

$$\begin{aligned}&{\widehat{W}}_i^\pi = 1+ q(s_1=i), \end{aligned}$$

(24)

$$\begin{aligned}&{\widetilde{W}}_i^\pi = \phi _i^\pi + \sum _{n=i+1}^Nq(s_1=n), \end{aligned}$$

(25)

$$\begin{aligned}&{\widehat{W}}_{ij}^A = 1+ \sum _{t=1}^{T-1} q \left( s_t=i,s_{t+1}=j \right) , \end{aligned}$$

(26)

$$\begin{aligned}&{\widetilde{W}}_{ij}^A = \phi _{ij}^A+ \sum _{t=1}^{T-1} \sum _{n=j+1}^Nq \left( s_t=i,s_{t+1}=n \right) , \end{aligned}$$

(27)

$$\begin{aligned}&{\widehat{W}}_{ik}^C = 1+ \sum _{t=1}^T q(s_t=i,l_t=k), \end{aligned}$$

(28)

$$\begin{aligned}&{\widetilde{W}}_{ik}^C = \phi _{ik}^C+ \sum _{t=1}^T \sum _{n=k+1}^Kq \left( s_t=i,l_t=n \right) , \end{aligned}$$

(29)

where the classic forward-backward algorithm as described in [34] is adopted to compute $q(s_1)$, $q(s_t,s_{t+1})$ and $q(s_t,l_{t})$.

3.2 Optimizing variational posterior $q({\varvec{z}})$

By maximizing the ELBO with respect to the feature saliency indicator ${\varvec{z}}$, we can optimize the variational posterior $q({\varvec{z}})$ as

$$\begin{aligned} q({\varvec{z}}) =\prod _{t=1}^T\prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\varphi _{tikd}^{z_{tikd}}(1-\varphi _{tikd})^{1-z_{tikd}}, \end{aligned}$$

(30)

where $\varphi _{tikd}$ can be computed by

$$\begin{aligned} \varphi _{tikd}= & {} \frac{\exp \left( {\widetilde{\varphi }}_{tikd} \right) }{\exp \left( {\widetilde{\varphi }}_{tikd}\right) +\exp \left( {\widehat{\varphi }}_{tikd}\right) }, \end{aligned}$$

(31)

$$\begin{aligned} {\widetilde{\varphi }}_{tikd}= \,\,& {} q(s_t=i,l_t=k) \bigg [\langle \lambda _{ikd}\varvec{\mu }_{ikd}^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0({\bar{\lambda }}_{ikd})\bigg )\left( \langle \lambda _{ikd}\rangle -{\bar{\lambda }}_{ikd}^{(t-1)}\right) \nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg ]+\ln \zeta _{ikd}, \end{aligned}$$

(32)

$$\begin{aligned} {\widehat{\varphi }}_{tikd}= & {} q(s_t=i,l_t=k) \bigg [\langle \lambda _{ikd}'\varvec{\mu }_{ikd}'^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg ) \bigg (\langle \lambda _{ikd}'\rangle -{\bar{\lambda }}_{ikd}'\bigg )\nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}' \right) \bigg ]+\ln (1-\zeta _{ikd}), \end{aligned}$$

(33)

where $\langle \cdot \rangle$ denotes the calculation of expectation, $\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) = \frac{I_1 \left( {\bar{\lambda }}_{ikd}\right) }{I_0 \left( {\bar{\lambda }}_{ikd}\right) }$ is obtained based on the property $I_0'(\kappa ) = I_1(\kappa )$ of the modified Bessel function as discussed in [38].

The saliency of the dth feature in the kth component for the ith hidden state can be calculated by setting the derivative of ELBO with respect to $\zeta _{ikd}$ to zero as

$$\begin{aligned} \zeta _{ikd}= \frac{1}{T}\sum _{t=1}^T\langle z_{tikd}\rangle , \end{aligned}$$

(34)

where the expectation $\langle z_{tikd}\rangle = \varphi _{tikd}$.

3.3 Optimizing variational posterior $q(\varTheta )$

Through the maximization of the ELBO with respect to $\varTheta =\{\varvec{\mu },\varvec{\lambda },\varvec{\mu }',\varvec{\lambda }'\}$, the variational posteriors of the VM distributions $q \left( \varvec{\mu },\varvec{\lambda }\right)$ and $q \left( \varvec{\mu }',\varvec{\lambda }' \right)$ can be obtained by

$$\begin{aligned}&q \left( \varvec{\mu },\varvec{\lambda }\right) \nonumber \\&\quad = \prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\mathrm {VM}\left( \varvec{\mu }_{ikd}| {\varvec{m}}_{ikd}^{*},\beta _{ikd}^{*}\lambda _{ikd}\right) {\mathcal {G}}\left( \lambda _{ikd}|u_{ikd}^{*},v_{ikd}^{*}\right) , \end{aligned}$$

(35)

$$\begin{aligned}&q \left( \varvec{\mu }',\varvec{\lambda }' \right) \nonumber \\&\quad = \prod _{i=1}^N\prod _{k=1}^K\prod _{d=1}^D\mathrm {VM}\left( \varvec{\mu }_{ikd}'|{\varvec{m}}_{ikd}'^{*}, \beta _{ikd}'^{*}\lambda _{ikd}' \right) {\mathcal {G}}\left( \lambda _{ikd}'|u_{ikd}'^{*},v_{ikd}'^{*}\right) , \end{aligned}$$

(36)

where the hyperparameters can be computed by

$$\begin{aligned} \beta _{ikd}^{*}=\,\, & {} \Vert \beta _{ikd} {\varvec{m}}_{ikd} + \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle {\varvec{x}}_{td}\Vert , \end{aligned}$$

(37)

$$\begin{aligned} {\varvec{m}}_{ikd}^{*}=\,\, & {} \frac{1}{\beta _{ikd}^{*}}\bigg (\beta _{ikd}{\varvec{m}}_{ikd}+ \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle {\varvec{x}}_{td}\bigg ), \end{aligned}$$

(38)

$$\begin{aligned} u_{ikd}^*=\,\, & {} u_{ikd}+\beta _{ikd}^{*}{\bar{\lambda }}_{ikd}\bigg (\frac{\partial }{\partial \beta _{ikd}^{*}\lambda _{ikd}}\ln I_0(\beta _{ikd}^{*}{\bar{\lambda }}_{ikd})\bigg ), \end{aligned}$$

(39)

$$\begin{aligned} v_{ikd}^*=\,\, & {} v_{ikd}+\sum _{t=1}^{T}q(s_t=i,l_t=k) \langle z_{tikd}\rangle \bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg )\nonumber \\&\quad +\beta _{ikd}\bigg (\frac{\partial }{\partial \beta _{ikd}\lambda _{ikd}}\ln I_0(\beta _{ikd}{\bar{\lambda }}_{ikd})\bigg ), \end{aligned}$$

(40)

$$\begin{aligned} \beta _{ikd}'^*=\,\, & {} \Vert \beta _{ikd}' {\varvec{m}}_{ikd}' + \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle {\varvec{x}}_{td}\Vert , \end{aligned}$$

(41)

$$\begin{aligned} {\varvec{m}}_{ikd}'^*=\,\, & {} \frac{1}{\beta _{ikd}'^{*}}\bigg (\beta _{ikd}'{\varvec{m}}_{ikd}'+ \sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle {\varvec{x}}_{td}\bigg ), \end{aligned}$$

(42)

$$\begin{aligned} u_{ikd}'^*=\,\, & {} u_{ikd}'+\beta _{ikd}'^*{\bar{\lambda }}_{ikd}'\bigg (\frac{\partial }{\partial \beta _{ikd}'^*\lambda _{ikd}'}\ln I_0(\beta _{ikd}'^*{\bar{\lambda }}_{ikd}')\bigg ), \end{aligned}$$

(43)

$$\begin{aligned} v_{ikd}'^*=\,\, & {} v_{ikd}'+\sum _{t=1}^{T}q(s_t=i,l_t=k) \langle 1-z_{tikd}\rangle \bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg ) \nonumber \\&\quad +\beta _{ikd}'\bigg (\frac{\partial }{\partial \beta _{ikd}'\lambda _{ikd}'}\ln I_0(\beta _{ikd}'{\bar{\lambda }}_{ikd}')\bigg ). \end{aligned}$$

(44)

3.4 Optimizing variational posterior q(S, L)

Lastly, the joint variational posterior q(S, L) is optimized (S represents the state indicator and L denotes the mixture component indicator) by maximizing the ELBO with respect to S and L

$$\begin{aligned} q(S,L) = \frac{1}{\varOmega } \pi _{s_1}^{*} \prod _{t=1}^{T-1}a_{s_ts_{t+1}}^{*} \prod _{t=1}^{T}c_{s_t, l_t}^{*} p^{*}\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_{t}\right) , \end{aligned}$$

(45)

where

$$\begin{aligned} \pi _{i}^*=\,\, & {} \exp \bigg \{ \varPsi \left( {\widehat{W}}_i^\pi \right) - \varPsi \left( {\widehat{W}}_i^\pi +{\widetilde{W}}_i^\pi \right) + \sum _{n=1}^{i-1}\bigg [\varPsi \left( {\widetilde{W}}_n^\pi \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_n^\pi +{\widetilde{W}}_n^\pi \right) \bigg ]\bigg \}, \end{aligned}$$

(46)

$$\begin{aligned} a_{ij}^*=\,\, & {} \exp \bigg \{\varPsi ({\widehat{W}}_{ij}^A) - \varPsi ({\widehat{W}}_{ij}^A+{\widetilde{W}}_{ij}^A)+ \sum _{n=1}^{j-1}\bigg [\varPsi \left( {\widetilde{W}}_{in}^A \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_{in}^A+{\widetilde{W}}_{in}^A \right) \bigg ]\bigg \}, \end{aligned}$$

(47)

$$\begin{aligned} c_{ik}^*=\,\, & {} \exp \bigg \{\varPsi \left( {\widehat{W}}_{ik}^C \right) - \varPsi \left( {\widehat{W}}_{ik}^C+{\widetilde{W}}_{ik}^C \right) + \sum _{n=1}^{k-1}\bigg [\varPsi \left( {\widetilde{W}}_{in}^C \right) \nonumber \\&\quad - \varPsi \left( {\widehat{W}}_{in}^C+{\widetilde{W}}_{in}^C \right) \bigg ]\bigg \}, \end{aligned}$$

(48)

$$\begin{aligned} p^*\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_{t}\right)=\,\, & {} \exp \bigg \{\sum _{d=1}^D \langle z_{tikd}\rangle \bigg [ \langle \lambda _{ikd}\varvec{\mu }_{ikd}^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\ln 2\pi - \ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \nonumber \\&\quad -\bigg (\frac{\partial }{\partial \lambda _{ikd}}\ln I_0 \left( {\bar{\lambda }}_{ikd}\right) \bigg )(\langle \lambda _{ikd}\rangle -{\bar{\lambda }}_{ikd})\bigg ]\nonumber \\&\quad +\sum _{d=1}^D\langle 1-z_{tikd}\rangle \bigg [ \langle \lambda _{ikd}'\varvec{\mu }_{ikd}'^{\mathrm {T}}{\varvec{x}}_{td}\rangle \nonumber \\&\quad -\ln 2\pi -\bigg (\frac{\partial }{\partial \lambda _{ikd}'}\ln I_0({\bar{\lambda }}_{ikd}')\bigg )\left( \langle \lambda _{jd}'\rangle -{\bar{\lambda }}_{ikd}' \right) \nonumber \\&\quad - \ln I_0 \left( {\bar{\lambda }}_{ikd}' \right) \bigg ] \bigg \}, \end{aligned}$$

(49)

where $\varOmega$ in (45) is the normalizing constant and is given by

$$\begin{aligned} \varOmega&= q(X|\varPhi ^{*}) = \sum _{S, L} \pi _{s_1}^{*} \prod _{t=1}^{T-1}a_{s_ts_{t+1}}^{*} \prod _{t=1}^{T}c_{s_t, l_t}^{*}p^{*}\left( {\varvec{y}}_{t}|\varTheta ,{\varvec{z}}_t \right) . \end{aligned}$$

(50)

It is noteworthy that (50) can be considered as the approximation to the likelihood of the model with optimized parameters $\varPhi ^{*}$, as we compare (50) with (6).

Algorithm 1 provides the VB inference algorithm for learning the VM-NHMM-Fs model. This learning algorithm is guaranteed to converge as the ELBO in (17) is convex with respect to each variational posterior [4]. By monitoring the variation of the ELBO, we can easily discover the convergence status if the difference of the values of ELBO between two consecutive iterations is less than some predefined threshold.

4 Experimental results

The proposed nonparametric HMM with localized feature selection (VM-NHMM-Fs) is evaluated through experiments on both synthetic and real-world time series or sequential data sets. We set the initial truncation values of N and K in our experiments as 20 and 30, respectively. The initial value of the hyperparameter $\zeta$ of the feature saliency is set to 0.5. The hyperparameters $\phi ^\pi$, $\phi ^A$ and $\phi ^C$ of the stick-breaking representation are all initialized to 0.5. The hyperparameters m and $m'$ are initialized as the average of the data set. The other hyperparameters are initialized as: $(\beta , \beta ', u, u', v, v') = (0.01, 0.01, 0.3,0.3,0.05,0.05)$. We report the experimental results using the average performance of our model based on 20 runs for all experiments.

4.1 Experiments on synthetic sequential data

In this part, a synthetic sequential data set is generated to validate the effectiveness of the proposed learning approach to inferencing parameters and selecting important features for the proposed VM-NHMM-Fs.

Our synthetic sequential data set contains a sequence of 3000 data points that were generated based on 2 hidden states, where State 1 is used to generate the sequential observations at $t = 1 : 1500$, while state 2 is in charge of generating the sequential observations at $t = 1501 : 3000$. In each state, a mixture of two 3-dimensional VM densities corresponding to relevant features (i.e. we have 3 relevant features in total) was used as the emission density. The parameters that were adopted for generating the 3 relevant features are shown in Table 1. Then, we generated 12 irrelevant features according to a common VM distribution using parameters $\varvec{\mu } = (0, 1)$ and $\lambda = 1$ and appended these features to the 3 relevant features to form a 15-dimensional data set.

Table 1 The parameters for generating the 3 relevant features for the 15-dimensional data set, where S1 and S2 indicate state 1 and state 2, respectively; $n_k$ denotes the number of data points that are generated from the kth VM density, d represents the feature number

Full size table

To verify the “correctness” of the proposed VB learning algorithm, we compared the discrepancy between the true values of the parameters for generating the data set and the corresponding estimated values as in [3]. The comparison results of parameters for generating the synthetic sequential data set are demonstrated in Tables 2 and 3 , under state 1 and state 2, respectively. From these tables, we can see that the proposed VB inference algorithm can accurately estimate model parameters which illustrates the effectiveness of our VB algorithm.

Table 2 The comparison of the true and the estimated parameters by the proposed VM-NHMM-Fs under State 1 for the synthetic data set

Full size table

Table 3 The comparison of the true and the estimated parameters by the proposed VM-NHMM-Fs under State 2 for the synthetic data set

Full size table

Next, we test the performance of feature selection of our VM-NHMM-Fs on the synthetic data set. The results of feature selection in terms of feature saliency (i.e. the values of $\{\zeta _d\}$) are demonstrated in Fig. 2. According to the results shown in this figure, it is obvious that high degree of relevancies (i.e. above 0.9) have been assigned to the first three features while the remaining 12 features are considered as irrelevant features due to their low degrees of saliencies (i.e. close to 0). These results are consistent with the true settings of the synthetic sequential data set.

4.2 Experiments on real data sets

4.2.1 Data sets and experimental settings

In this part, the effectiveness of the proposed VM-NHMM-Fs was validated by conducting experiments on real sequential data sets in terms of unsupervised clustering applications. We adopted two real data sets from the UCI machine learning repository^{Footnote 1}, including the gesture phase segmentation data set and the epileptic seizure recognition data set.

The gesture phase segmentation data set contains seven recorded videos consisted in a temporal segmentation of gestures (rest, preparation, stroke, hold and retraction) using Microsoft Kinect sensor. In our case, we teste the performance of VM-NHMM-Fs on three videos of this data set: A1 (1747 frames), A2 (1264 frames) and A3 (1834 frames), where each video includes the original version and a processed version. 50 features are extracted based on this data set, from which 18 features are obtained based on original videos and 32 features are extracted from processed videos.

The epileptic seizure recognition data set that we adopted is a pre-processed version of a data set regarding epileptic seizure detection as described in the UCI machine learning repository. It contains 11500 observations, where each observation consists of 178 data points, where each data point represents the value of the EEG observed at a different point in time. It contains five classes: (1) the EEG of seizure activity; (2) the EEG from the area where the tumor was located; (3) the EEG from the healthy brain area; (4) the EEG of the patient had their eyes closed; (5) the EEG of the patient had their eyes open.

Table 4 The average recognition performance over 20 runs by different approaches

Full size table

In our experiment, these two data sets were $L_2$ normalized and then modeled by the proposed VM-NHMM-Fs. In order to demonstrate the advantages of our model, we compared it with other well-defined HMMs that employ different mixture models: the HMM with Gaussian mixture models (GMM-HMM) [21], the HMM with Gaussian mixture models and unsupervised feature selection (GMM-HMM-Fs) [43], the HMM with Dirichlet mixture model (DMM-HMM) [11], the HMM with inverted Dirichelt mixture model (IDMM-HMM) [30] and the HMM with VMF mixture models (VMF-HMM) [15]. Furthermore, to evaluate the importance of integrating feature selection in our model, we respectively applied the proposed model with localized feature selection (VM-NHMM-Fs) and without it (denoted by VM-NHMM). For the tested models, we adopted the same parameter values as in their original papers. All tested models were implemented on the same data sets as described in our experiments.

In our experiment, we set the initial size of states for two data sets as $N=20$, and the optimal number of states was automatically determined in the process of model learning. According to the results obtained by the proposed VM-NHMM-Fs, the gesture phase segmentation data set and the epileptic seizure recognition data set eventually converged to 3 and 2 states, respectively. For other tested approaches, the number of hidden states were set manually. Table 4 shows the recognition performance by different models on the two real data sets. As can be seen from this table, both VM-NHMM-Fs and VM-NHMM are able to outperform other HMM-based approaches with higher recognition accuracies for all data sets, which verified the merits of applying nonparametric VM-based HMMs for modeling gestures and EEG data. Another advantage of our approach is that, in contrast with other tested HMM-based approaches in which the number of clusters was determined through an extra evaluation step based on model selection scores, this number in our case was automatically determined during the inference procedure thanks to the nonparametric framework of Dirichlet process. According to Table 4, we may also notice the improvement of the performance when feature selection is integrated with VM-NHMM, by comparing the results of VM-NHMM-Fs with that of VM-NHMM.

The obtained feature saliencies of the 50-dimensional gesture phase data vectors of the resting phase by VM-NHMM-FS are shown in Fig. 3. It can be seen from this figure that there are 7 features that have obtained low degrees of relevance (i.e. saliencies are less than 0.5). Therefore, these features are considered as irrelevant features in the modeling process. On the other hand, the remaining features are considered as relevant features as they have high-level feature saliencies (i.e. greater than 0.5). Figure 4 illustrates the results of feature saliencies obtained by VM-NHMM-FS for the class of seizure activity of the epileptic seizure recognition data set. Based on this figure, different features have different contributions in the task of epileptic seizure recognition, where 22 of the 178 features have obtained relatively low saliencies (i.e. less than 0.5) and therefore have less contributions in data modeling.

5 Conclusion

In this work, a nonparametric HMM has been proposed for modeling time series or sequential spherical data vectors. In our model, the emission distribution of each hidden state obeys a mixture of VM distributions which has shown better capability for modeling spherical data than other commonly used distributions (such as the Gaussian distribution). We constructed our NHMM by leveraging a Bayesian nonparametric DP framework, and therefore the amount of hidden states and the number of mixture components for each state can be automatically adjusted according to observed data set. In addition, to deal with high-dimensional data sets which may contain irrelevant or noisy features, an unsupervised localized feature selection method was incorporated with the proposed NHMM, which results in a unified framework that can simultaneously perform data modeling and feature selection. The proposed model was learned by developing an effective algorithm based on VB inference. The advantages of our model were demonstrated through both simulated and real-world data sets. Particularly, according to the experimental results, our model was able to outperform other tested HMM-based models by at least 4.2% in gesture recognition and at least 2.6% in epileptic seizure recognition.

One limitation of the proposed NHMM is that it is not very efficient for dealing with large-scale data sets. This is mainly caused by the batch learning strategy of the conventional VB inference adopted in our work. Thus, a possible future work is to extend the developed VB inference algorithm with stochastic variational Bayes (SVB) [18], which has shown its efficiency in learning over large data sets through stochastic optimization. Moreover, in recent years, deep learning techniques have been successfully applied in different fields owing to their promising capabilities of automatically extracting meaningful representations from observed data. Therefore, another interesting future work is to integrate deep neural networks (e.g. variational auto-encoder [23]) with the proposed NHMM to improve its performance by leveraging the more representative features learned by these deep learning techniques.

Data availability statement

The data sets analysed during the current study are available in the UCI Machine Learning Repository https://archive.ics.uci.edu.

Notes

https://archive.ics.uci.edu.

References

Asilian Bidgoli A, Ebrahimpour-komleh H, Rahnamayan S (2021) A novel binary many-objective feature selection algorithm for multi-label data classification. Int J Mach Learn Cybern 12:2041–2057
Article MATH Google Scholar
Aytekin C, Ni X, Cricri F, Aksu E (2018) Clustering and unsupervised anomaly detection with $l_2$ normalized deep auto-encoder representations. In: 2018 international joint conference on neural networks (IJCNN), pp 1–6
Banerjee A, Dhillon I, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises-Fisher distributions. J Mach Learn Res 6:1345–1382
MathSciNet MATH Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Blei DM, Jordan MI (2005) Variational inference for Dirichlet process mixtures. Bayesian Anal 1:121–144
MathSciNet MATH Google Scholar
Blei DM, Kucukelbir A, Mcauliffe J (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
Article MathSciNet Google Scholar
Calderara S, Prati A, Cucchiara R (2011) Mixtures of von Mises distributions for people trajectory shape analysis. IEEE Trans Circ Syst Video Technol 21(4):457–471
Article Google Scholar
Chatzis SP, Kosmopoulos DI (2011) A variational Bayesian methodology for hidden Markov models utilizing student’s-t mixtures. Pattern Recogn 44(2):295–306
Article MATH Google Scholar
Ding N, Ou Z (2010) Variational nonparametric Bayesian hidden markov model. In: 2010 IEEE international conference on acoustics, speech and signal processing, pp 2098–2101
Dokeroglu T, Deniz A, Kiziloz HE (2021) A robust multiobjective harris’ hawks optimization algorithm for the binary classification problem. Knowl-Based Syst 227(107):219
Google Scholar
Epaillard E, Bouguila N (2019) Variational Bayesian learning of generalized Dirichlet-based hidden Markov models applied to unusual events detection. IEEE Trans Neural Netw 30(4):1034–1047
Article MathSciNet Google Scholar
Fan W, Bouguila N (2020) Spherical data clustering and feature selection through nonparametric Bayesian mixture models with von Mises distributions. Eng Appl Artif Intell 94(103):781
Google Scholar
Fan W, Bouguila N, Ziou D (2011) Unsupervised anomaly intrusion detection via localized Bayesian feature selection. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 1032–1037
Fan W, Bouguila N, Du J, Liu X (2019) Axially symmetric data clustering through Dirichlet process mixture models of Watson distributions. IEEE Trans Neural Netw Learn Syst 30(6):1683–1694
Article MathSciNet Google Scholar
Fan W, Yang L, Bouguila N, Chen Y (2020) Sequentially spherical data modeling with hidden Markov models and its application to fMRI data analysis. Knowl-Based Syst 206(106):341
Google Scholar
Fan W, Yang L, Bouguila N (2021) Unsupervised grouped axial data modeling via hierarchical Bayesian nonparametric models with Watson distributions. IEEE Trans Pattern Anal Mach Intell 2021:1–1. https://doi.org/10.1109/TPAMI.2021.3128271
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347
MathSciNet MATH Google Scholar
Illingworth CJR, Roy S, Beale MA, Tutill HJ, Williams R, Breuer J (2017) On the effective depth of viral sequence data. Virus Evol 3:2
Article Google Scholar
Javidi MM (2021) Feature selection schema based on game theory and biology migration algorithm for regression problems. Int J Mach Learn Cybern 12:303–342
Article Google Scholar
Ji S, Krishnapuram B, Carin L (2006) Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Trans Pattern Anal Mach Intell 28(4):522–532
Article Google Scholar
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233
Article MATH Google Scholar
Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: ICLR
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Article Google Scholar
Ley C, Verdebout T (2018) Applied directional statistics: modern methods and case studies. Chapman and Hall/CRC, Hoboken
Book MATH Google Scholar
Li J, Cheng K, Wang S, Morstatter F, Trevino R, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Comput Surv 50(6):94
Google Scholar
Li Y, Dong M, Hua J (2009) Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans Pattern Anal Mach Intell 31(5):953–960
Article Google Scholar
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems. Expert Syst Appl 91:480–491
Article Google Scholar
Mardia KV, Jupp PE (2000) Directional statistics. Wiley, USA
MATH Google Scholar
Nasfi R, Amayri M, Bouguila N (2020) A novel approach for modeling positive vectors with inverted Dirichlet-based hidden Markov models. Knowl Based Syst 192(105):335
Google Scholar
Pigou L, Den Oord AV, Dieleman S, Van Herreweghe M, Dambre J (2018) Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. Int J Comput Vis 126:430–439
Article MathSciNet Google Scholar
Qiu Z, Shen H (2017) User clustering in a dynamic social network topic model for short text streams. Inf Sci 414:102–116
Article Google Scholar
Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16
Article Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):267–296
Article Google Scholar
Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sin 4:639–650
MathSciNet MATH Google Scholar
Sra S, Karp D (2013) The multivariate Watson distribution: Maximum-likelihood estimation and other aspects. J Multivar Anal 114:256–269
Article MathSciNet MATH Google Scholar
Taghia J, Leijon A (2016) Variational inference for Watson mixture model. IEEE Trans Pattern Anal Mach Intell 38(9):1886–1900
Article Google Scholar
Taghia J, Ma Z, Leijon A (2014) Bayesian estimation of the von Mises-fisher mixture model with variational inference. IEEE Trans Pattern Anal Mach Intell 36(9):1701–1715
Article Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Article MathSciNet MATH Google Scholar
Tubishat M, Ja’afar S, Alswaitti M, Mirjalili S, Idris N, Ismail MA, Omar MS (2021) Dynamic salp swarm algorithm for feature selection. Expert Syst Appl 164(113):873
Google Scholar
Volant S, Berard C, Martinmagniette M, Robin S (2014) Hidden markov models with mixtures as emission distributions. Stat Comput 24(4):493–504
Article MathSciNet MATH Google Scholar
Zheng Y, Jeon B, Sun L, Zhang J, Zhang H (2018) Student’s t-hidden Markov model for unsupervised learning using localized feature selection. IEEE Trans Circuits Syst Video Technol 28(10):2586–2598
Article Google Scholar
Zhu H, He Z, Leung H (2012) Simultaneous feature and model selection for continuous hidden markov models. IEEE Signal Process Lett 19(5):279–282
Article Google Scholar

Download references

Acknowledgements

The completion of this work was supported by the National Natural Science Foundation of China (61876068).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Huaqiao University, Xiamen, China
Wentao Fan
Division of Science and Technology, Beijing Normal University-Hong Kong Baptist University United International College (UIC), Zhuhai, China
Wentao Fan
Instrumental Analysis Center, Huaqiao University, Fujian, Xiamen, China
Wenjuan Hou

Authors

Wentao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Wenjuan Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wentao Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, W., Hou, W. Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models. Int. J. Mach. Learn. & Cyber. 13, 3019–3029 (2022). https://doi.org/10.1007/s13042-022-01579-7

Download citation

Received: 02 August 2021
Accepted: 13 May 2022
Published: 06 June 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s13042-022-01579-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unsupervised modeling and feature selection of sequential spherical data through nonparametric hidden Markov models

Abstract

Similar content being viewed by others

Feature Selection for Hidden Markov Models with Discrete Features

Hidden Markov Models Based on Generalized Dirichlet Mixtures for Proportional Data Modeling

Accelerating the discovery of unsupervised-shapelets

1 Introduction

2 The nonparametric HMM with VM mixture model and localized feature selection

2.1 The VM mixture model with localized feature selection