1 Introduction

HMMs [1] are a simple, yet powerful, tool to represent and predict sequential events [2] and are widely used in many types of data-driven tasks. The concept of HMMs is primarily based on Markov Chains [3, 4] (proposed by Andrey Markov in the early 20th century) but was formally developed later in many works. The key idea of HMMs is that a latent variable or state variable evolves according to a discrete, first-order Markov process. More specifically, the modeled process/data is a sequence of states or values that are unknown (hidden), where each hidden state depends on the past hidden state in the sequence. This Markov Chain of hidden states is associated with an equal sequence of known values (observations). Every hidden state emits an observation that follows a well-defined probability distribution in the space of observations, and each observation is conditionally independent of every other observation, given the value of its associated hidden state. By their structure, HMMs are generally able to solve a variety of tasks mainly with three main functionalities [5, 6]: evaluation, decoding (inference), and learning. The evaluation is the computation of the probability of an observation sequence given an HMM. Decoding is the task of inferring the most probable sequence of hidden states given a defined HMM and a sequence of known observations. As for learning, it is the search for the best parameters of the HMM (learning the HMM) given an observation sequence and the set of possible hidden states in the model.

By their definition, HMMs are an excellent choice to tackle data tasks that involve non-observable sequential values, as their structure allows inferring these latent values from the observable signals or even predicting their future trends. This high flexibility makes HMMs a strong candidate to deal with a variety of applications such as genetics and biomedical engineering [7, 8], climate modeling [9], signal processing [10], stock market prediction [11], speech [12], video recognition [13], and information retrieval systems [14] to name a few.

The observation emission, i.e., the formulation of the conditional dependence between the observations and the hidden states of the HMM is generally a deciding factor for the behaviour of the model, and is also our area of interest in this paper. For continuous data, the observation emission probability distributions associated with the hidden states often have a specific form from a parametric class such as Gaussian, Gamma, or Poisson. In this regard, multiple works have further explored the emission distributions and introduced the mixture models as an alternative [15]. This has led to some very useful variants of HMMs, perhaps the most popular one being the Gaussian mixture model HMM (GMMHMM). This prevalence of the GMMHMMs stems from the convenience of the GMM, as it provides a natural way to cluster the data and has relatively simple implementations and parameters. However, Gaussian-based distributions do not account for multiple natural characteristics of real-world data sets, including the presence of outliers [16], their asymmetry, and their specific location in space. Ergo, HMMs with Gaussian-based emissions can be limited when dealing with outlier-heavy, or significantly asymmetric data, which is often the case. Some of these issues have been tackled in [17] by introducing a bounded asymmetric Gaussian mixture [18] as an emission distribution for the HMM, but the low outlier tolerance of the Gaussian distribution remains a problem.

On this matter, the Student’s t-distribution [19] is an excellent alternative to the Gaussian when fitting skewed or heavy-tailed populations, thus, the multivariate finite Student’s t-Mixture Model (SMM) [20] can provide a more robust fit than the GMM in the presence of significant proportions of outliers in the data. Multiple articles have explored the potential of the HMMs with SMM emissions as in [21,22,23], but the idea of customizing this model within the HMM to better fit the real-world data has not been examined yet. In fact, while SMMs are an excellent solution for handling outliers, they assume, by their mathematical definition, that the examined data is symmetric and spans over an unbounded range, which is not a realistic depiction of most data sets.

This motivated us to introduce BASMMHMM, a HMM with Bounded Asymmetric Mixture Model (BASMM) emissions. This model is an amelioration of the drawbacks observed in the previously proposed HMMs, as the emissions’ distributions will not only fit observed data outliers (with heavy distribution tails), but also tolerate the natural imperfections of the data (with asymmetry) and take into account the fact that the data usually spans only finite regions of its space. We train our BASMMHMM using the Baum–Welch Expectation Maximization (EM) algorithm, and we apply it on a selection of popular real-world tasks, where the HMMs are a very efficient recourse: occupancy estimation [24], stock price prediction and human activity recognition [25].

This paper is laid out as follows: in the first section, we introduce the general scope and the motivations for this work. In the second section, we present the emission distribution of our proposed HMM, which is the multivariate Bounded Asymmetric Student’s-t Mixture Model (BASMM). In the third section, we review the necessary mathematical definitions and present the HMM with BASMM emissions. The fourth section features experiments using the proposed model as well as their results. Following the experiments’ results, we also establish a comparison between our proposed model and the other types of HMMs a variety of emissions. Finally, the fifth section concludes this work and discusses eventual paths of improvement.

The mathematical variables’ notations used for the rest of the paper are detailed in Table 1.

Table 1 BASMMHMM notations

2 Multivariate bounded asymmetric student’s-t Mixture Model

The BASMM [26] is a generalized format of the SMM where the specific location of the modeled data in the space (bounded support) and its natural asymmetry are taken into consideration. Being based on the multivariate Student’s t-distribution, the BASMM, and SMM are more robust than other popular mixture models like the GMM. In fact, unlike the Gaussian density function, the Student’s t-density function has an additional parameter -the degrees of freedom \(\nu\)- which is a robustness tuning parameter. As a result, the t-distribution provides a heavy-tailed alternative to the Gaussian distribution (see Fig. 1) for potential outliers in the data and therefore, the SMM can produce a clustering algorithm that is more outlier-tolerant than the GMM.

Fig. 1
figure 1

Student’s-t versus Gaussian probability density functions (univariate case)

2.1 Multivariate bounded asymmetric student’s t-distribution

We begin this section by building up the mathematical definitions that hold the basis for our model, starting with the multivariate Student’s t-distribution, and leading up to the BASMM.

Let \(t\) be a multivariate Student’s-t probability density function with the following parameters: a mean \(\mu\), a covariance matrix \(\Sigma\), and \(\nu\) degrees of freedom. For a multivariate vector \(x\) of dimension \(d\), and given the aforementioned parameters, Student’s-t can be written as follows [27]:

$$\begin{aligned} t(x|\mu , \Sigma ,\nu ) = \frac{\Gamma \big (\frac{\nu +d}{2}\big )|\Sigma |^{-1/2}(\nu \pi )^{-d/2}}{\Gamma (\nu /2)[1+\nu ^{-1}\Delta (x,\mu ;\Sigma )]^{(\nu +d)/2}} \end{aligned}$$
(1)

where \(\Gamma (x)\) is the Gamma function and \(\Delta (x,\mu ;\Sigma )\) is the squared Mahalanobis distance. Both functions have the following definitions, respectively:

$$\begin{aligned} & \Gamma (y)=\int _0^{\infty } x^{y-1}e^{-x} \,dx \quad ;\quad y>0 \end{aligned}$$
(2)
$$\begin{aligned} & \Delta (x,\mu ;\Sigma ) = (x-\mu )^{T}\Sigma ^{-1}(x-\mu ) \end{aligned}$$
(3)

It is worth noting that the definition of the t-distribution density function differs from a univariate to a multivariate population. Considering that most of the real-world data-related tasks feature multivariate observations, we will not tackle the univariate case in this paper. Hence, all the probability density functions, as well as the rest of the mathematical construction of our model are presented for a multivariate random variable \(x\). If we add the asymmetry to the multivariate t, where we have a left covariance \(\Sigma _l\) and a right covariance \(\Sigma _r\), we would have the following density function \(\mathcal {T}\):

$$\begin{aligned} \mathcal {T}(x|\mu , \Sigma _l,\Sigma _r, \nu )={\left\{ \begin{array}{ll} t(x|\mu , \Sigma _l, \nu ) & \text {if }x\le 0\\ t(x|\mu , \Sigma _r, \nu ) & \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

We determine whether the multivariate vector \(x\) is less than the zero of \(\mathbb {R}^d\) by calculating the sum \(A\) of all the components of the \(x\):

$$\begin{aligned} A=\sum _{i=1}^{d} x_i \end{aligned}$$
(5)

If \(A<0\), then \(x<0\), otherwise we consider \(x\ge 0\). When we add bounded support \(\Omega\) to the multivariate asymmetric t density function, we get the following probability density function \(\mathcal {S}\):

$$\begin{aligned} & \mathcal {S}(x|\theta ) = \frac{\mathcal {T}(x|\mu ,\Sigma _l,\Sigma _r,\nu )\times h(x,\Omega )}{\int _{\Omega } \mathcal {T}(y|\mu ,\Sigma _l,\Sigma _r,\nu ) \,dy} \end{aligned}$$
(6)

where \(h\) is an indicator function that bounds the multivariate t by \(\Omega \in \mathbb {R}^d\) and is defined as follows:

$$\begin{aligned} h(x,\Omega )={\left\{ \begin{array}{ll} 1 & \text {if } x\in \Omega \\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(7)

where \(\theta =\{\mu , \Sigma _l,\Sigma _r, \nu , \Omega \}\) is the set of parameters that fully define the multivariate bounded asymmetric t-distribution.

2.2 Relation to the multivariate Gaussian distribution

According to [27, 28], the multivariate t-distribution is conditionally related to the normal distribution: if the random variable \(x\) follows multivariate t-distribution with a mean \(\mu\), a covariance matrix \(\Sigma\), and \(\nu\) degrees of freedom, then given a precision parameter \(\phi\), \(x\) follows a multivariate Gaussian distribution \(n\) with mean \(\mu\) and covariance \(\frac{\Sigma }{\phi }\) and where the parameter \(\phi\) is a Gamma-distributed [29] variable with both scale and shape parameters equal to \(\frac{\nu }{2}\): \(\phi \sim \mathcal {G}(\frac{\nu }{2},\frac{\nu }{2})\) (See Eq. 8)).

$$\begin{aligned} x\sim t(\mu ,\Sigma ,\nu ) \ \Longleftrightarrow \ x|\phi \sim n\left(\mu ,\frac{\Sigma }{\phi }\right) \text { and } \nonumber \\ \quad \phi \sim \mathcal {G}\left(\frac{\nu }{2},\frac{\nu }{2}\right) \end{aligned}$$
(8)

By applying Bayes’ theorem, we find that the multivariate t-density function is the product of the Gaussian distribution and the Gamma distribution with the parameters explained above, which gives us Eq. 9).

$$\begin{aligned} t(x|\mu ,\Sigma ,\nu ) = n\Big (x|\mu ,\frac{\Sigma }{\phi }\Big ) \times \mathcal {G}(\phi ) \end{aligned}$$
(9)

where \(\mathcal {G}\) is the Gamma probability density function with both scale and shape parameters equal to \(\frac{\nu }{2}\):

$$\begin{aligned} & \mathcal {G}(\phi )=\frac{\big (\frac{\phi \nu }{2}\big )^{\frac{\nu }{2}} \exp {\big (\frac{-\phi \nu }{2}\big )}}{\phi \Gamma (\frac{\nu }{2})} \end{aligned}$$
(10)

As for the multivariate Gaussian distribution with a mean vector \(\mu\) and a covariance matrix \(\Sigma\), the probability density function is:

$$\begin{aligned} n(x|\mu ,\Sigma ) = \frac{\exp \Big (-\frac{1}{2}(x-\mu )^T\Sigma ^{-1}(x-\mu )\Big )}{\sqrt{(2\pi )^k|\Sigma |}} \end{aligned}$$
(11)

Suppose we want to add bounded support and asymmetry to this definition of the multivariate Student’s t. In that case, we can base it on an asymmetric multivariate Gaussian density function, then multiply it by the indicator function \(h\) (see Eq. 7)) and divide it by the integral over bounded the support region \(\Omega\), which yields the following density function:

$$\begin{aligned} & \mathcal {S}(x|\theta ) = \frac{ \mathcal {N}\Big (x|\mu ,\frac{\Sigma _l}{\phi },\frac{\Sigma _r}{\phi }\Big ) \times \mathcal {G}(\phi ) \times h(x,\Omega )}{\int _{\Omega } \mathcal {T}(y|\mu ,\Sigma _{l}, \Sigma _{r},\nu ) \,dy} \end{aligned}$$
(12)

where \(\mathcal {T}\) is the asymmetric multivariate t-probability density function (as presented in Eq. 13)), and where \(\mathcal {N}\) is the asymmetric multivariate Gaussian density function, which takes as parameters a mean vector, a left covariance matrix, and a right covariance matrix. In order to define this density function, we follow the same approach stated in Sect. 2.1 for the multivariate asymmetric t:

$$\begin{aligned} \mathcal {N}(x|\mu , \Sigma _l,\Sigma _r)={\left\{ \begin{array}{ll} n(x|\mu , \Sigma _l) & \text {if }x\le 0\\ n(x|\mu , \Sigma _r) & \text {otherwise} \end{array}\right. } \end{aligned}$$
(13)

Building up these definitions of the multivariate Bounded Asymmetric Student’s t-probability density function helps us understand the HMM observation emission probability that we will employ, to construct the entire model well.

2.3 Multivariate bounded asymmetric t-mixture model

Representing the distribution of a dataset \(X\) as a BASMM with \(K\) components implies that for every vector \(x_i\) of the dataset, the marginal probability density function of \(x_i\) is written as follows:

$$\begin{aligned} & f(x_i| \Theta )=\sum _{k=1}^{K} c_k\times \mathcal {S}(x_i|\theta _k) = \sum _{k=1}^{K} c_k\nonumber \\ & \quad \times \mathcal {S}(x_i|\mu _k,\Sigma _{l,k},\Sigma _{r,k},\nu _k,\Omega _k) \end{aligned}$$
(14)

where \(c_k\) and \(\theta _k\) are the mixing proportion and the set of parameters for the \(k{th}\) mixture component respectively, and finally, the mixture’s full set of parameters is \(\Theta =\{{\theta _1,\dots ,\theta _K};{c_1,\dots ,c_K}\}\). The mixing proportion \(c_k\) represents the prior probability that \(x_i\) belongs to the \(k{th}\) component, thus satisfies:

$$\begin{aligned} c_k \ge 0 \quad \text {and} \quad \sum _{k = 1}^{K}c_k = 1 \end{aligned}$$
(15)

2.4 Fitting the mixture model

Now to ensure that the mixture model fits the data in the most optimal way, we perform the EM algorithm [30] to adjust the parameters of the model \(\Theta\) to find the closest representation to the modeled data. As its name suggests, the EM algorithm comprises two main steps: Expectation and Maximization.

2.4.1 Expectation step

The Expectation step consists of estimating the log-likelihood of the mixture model, i.e., how accurate is the representation of the data by the model with the current initialized set of parameters \(\Theta\). Here, at iteration \(t\) of the EM algorithm, we define the log-likelihood as the logarithm of the BASMM’s probability density function of the data \(x\). Since the data points are considered independent and identically distributed (IID), the BASMM’s density function is the product over all the marginal density function values of the data vectors \((x_i)_{i=1}^{i=N}\). Thus the BASMM’s log-likelihood is defined as follows:

$$\begin{aligned} & L(\Theta ) = \log \Big (\prod _{i=1}^{N}f(x_i|\Theta )\Big ) \nonumber \\ & \quad = \sum _{i=1}^{N}\log \Big (\sum _{k=1}^{K}c_kP(x_i|\theta _k)\Big ) \end{aligned}$$
(16)

where \(\Theta = \{\theta _1, \dots , \theta _K; c_1, \dots , c_K\}\) and \(\theta _k = \{\mu _k, \Sigma _{k,l}, \Sigma _{k,r}, \nu _k, \Omega _k\}\) for \(1\le k \le K\). In the same expectation step, we define by \(z_{ik}\) the posterior probability that the vector \(x_i\) belongs to the \(k{th}\) component for \(i\in \{1,\dots ,N\}\) and \(k\in \{1,\dots ,K\}\). These posterior probabilities are called responsibilities in mixture models terminology. They signify how responsible a mixture component (a simple bounded asymmetric t-distribution in the case of BASMM) is for a data vector \(x_i\), i.e., the amount of contribution of a distribution/mixture component \(\theta _k\) over the quantity produced by the BASMM. At each iteration \(t\) of the Expectation step, the responsibility values \(\big (z_{ik}^{(t)}\big )_{i=k=1}^{i=N,k=K}\) are computed by the following equation:

$$\begin{aligned} z_{i k}^{(t)} = \frac{c_k^{(t)}P(x_i|\theta _k^{(t)})}{\sum _{j=1}^{K}c_j^{(t)}P(x_i|\theta _j^{(t)})} \end{aligned}$$
(17)

2.4.2 Maximization step

The goal of the Maximization step in the EM algorithm is to update the model parameters to maximize the previously calculated log-likelihood function [31]. As the logarithm is monotonically increasing, it is more suitable to minimize the negative log-likelihood function \(J(\Theta )=-L(\Theta )\). The followed logic here is to calculate the partial derivatives of \(J(\Theta )\) with respect to the different parameters and update those parameters as the solution to the equation:

$$\begin{aligned}\text {Partial derivative}(J)=0\end{aligned}$$

All solutions to this equation with respect to each parameter of the BASMM will require the knowledge of the responsibilities/posterior probabilities \(z_{ik}\) calculated in the expectation step. In turn, the responsibilities depend on the knowledge of the parameters of each mixture component \(\theta _k\). This explains the iterative nature of the EM algorithm. The M step of this algorithm, as well as the updated parameters’ definitions are elaborated in details in [26].

3 Hidden Markov models

3.1 Bounded asymmetric student’s t-mixture model hidden Markov model (BASMMHMM)

Here we present the main contribution of our model, which is the observation emission strategy. As discussed in the introduction, we aim to produce an HMM with emissions that are more robust to the observable data’s outliers. In this context, the Student’s t-distribution has been employed in modified versions as a non-Gaussian emission in [22, 32]. We build on these works by exploring asymmetry and bounded support along with the t-mixture for the emission. For this particular type of HMM, we consider that at the time \(t\), the probability of observing \(y_t\) given a hidden state \(s_i\) follows a probability distribution formed by a mixture of bounded asymmetric Student’s t-distributions with \(K\) components. We also consider that for all the hidden states of the HMM, the number of mixture components is the same. As a result, the probability of emitting the observation \(y_t\) from the hidden state \(s_i\) is defined in the following equation:

$$\begin{aligned} & P(y_t|\Theta _i) = \sum _{k=1}^{K}c_{i,k}\times \mathcal {S}(y_t|\theta _{i,k}) \nonumber \\ & \quad = \sum _{k=1}^{K}c_{i,k}\times \mathcal {S}(y_t|\mu _{i,k}, \Sigma _{i,k}^l,\Sigma _{i,k}^r, \nu _{i,k}, \Omega _{i,k}) \end{aligned}$$
(18)

With the definition of the multivariate t is given in Eqs. (1), (13). It’s hard and computationally costly to run the EM algorithm when fitting the HMM. In this case, we employ the definition based on the bounded asymmetric Gaussian stated in 2.2. As a result, the probability of emitting the \(t{th}\) observation \(y_t\) by the hidden state \(s_i\) (which corresponds to the emission mixture model \(\Theta _i\) with the set of parameters \(\Big (\theta _{i,k} = \{\mu _{i,k},\Sigma _{l,i,k},\Sigma _{r,i,k}\,\nu _{i,k},\Omega _{i,k}\}\Big )_{k=1}^{K}\)) is the following:

$$\begin{aligned} & P(y_t|s_i) = \sum _{k=1}^{K} \frac{c_{i,k}\times \mathcal {N}\Big (y_t|\mu _{i,k}, \frac{\Sigma _{l,i,k}}{\phi _{i,k}}, \frac{\Sigma _{r,i,k}}{\phi _{i,k}}\Big )\times \mathcal {G}(\phi _{i,k}) \times h(y_t,\Omega _{i,k})}{\int _{\Omega _{i,k}} \mathcal {T}(y|\mu _{i,k},\Sigma _{l,i,k}, \Sigma _{r,i,k},\nu _{i,k}) \,dy} \end{aligned}$$
(19)

where \(\phi _{i,k}\) is a precision parameter and \(\phi _{i,k}\sim \mathcal {G}(\frac{\nu }{2},\frac{\nu }{2})\) (see Sect. 2.2). We define also the observation indicators \(\big (\delta _{i,t}\big )_{i=t=1}^{i=I,t=L}\) by:

$$\begin{aligned} \delta _{i,t}={\left\{ \begin{array}{ll} 1 & \text {if the observation } y_t \text { is emitted from the hidden state } s_i\\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(20)

Also, given \(\delta _{i,t}=1\), we define the state-conditional mixture component indicators \(\big (\eta _{i,k,t}\big )_{k=1}^{K}\) as follows:

$$\begin{aligned} \eta _{i,k,t}={\left\{ \begin{array}{ll} 1 & \text {if } y_t \text { is emitted from the } k^{th} \text { mixture component of the hidden state } s_i\\ 0 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(21)

These indicators are latent variables that give information about the mixture component that each data point belongs to. We don’t have this information, but defining it mathematically gives us a complete data representation: \(y^c\), thus simplifying the equations, i.e., the complete data probability density function of each emission mixture:

$$\begin{aligned} \begin{aligned}&P(y^c|s_i)= \prod _{k=1}^{K}\Bigg [c_{i,k}\times \mathcal {N}\Big (y|\mu _{i,k}, \frac{\Sigma _{l,i,k}}{\phi _{i,k}}, \frac{\Sigma _{r,i,k}}{\phi _{i,k}}\Big ) \\&\quad \times \frac{\mathcal {G}(\phi _{i,k}) \times h(y,\Omega _{i,k})}{\int _{\Omega _{i,k}} \mathcal {T}(y|\mu _{i,k},\Sigma _{l,i,k}, \Sigma _{r,i,k},\nu _{i,k}) \,dy}\Bigg ]^{\eta _{i,k,t}} \end{aligned} \end{aligned}$$
(22)

After calculations, the log-likelihood of the emission mixture for the \(i{th}\) hidden state is given by:

$$\begin{aligned} \begin{aligned}&\log {P(y^c|s_i)} = \log \Big [ \prod _{k=1}^{K}c_{i,k}\times \mathcal {S}(y|\theta _{i,k})^{\eta _{i,k,t}}\Big ] \\&\quad = \sum _{k=1}^{K}\eta _{i,k,t}\times \Bigg [-\log {\Gamma \big (\frac{\nu _{i,k}}{2}\big )}+\frac{\nu _{i,k}}{2}\Big (\log \big (\frac{\nu _{i,k}}{2}\big )-\phi _{i,k}+ \log {\phi _{i,k}}\Big ) \\&\qquad - \frac{1}{2}\Big (\log {|\Sigma _{i,k}|}+d\log (2\pi )+ \phi _{i,k} \Delta (y,\mu _{i,k};\Sigma _{i,k})\Big ) \\&\qquad - \log {\int _{\Omega _{i,k}} \mathcal {T}(y|\mu _{i,k},\Sigma _{l,i,k}, \Sigma _{r,i,k},\nu _{i,k}) \,dy}\Bigg ] \end{aligned} \end{aligned}$$
(23)

where \(\Sigma _{i,k}\) can be the left or the right covariance matrix based on whether \(y\le 0\) or otherwise.

3.2 Defining the log-likelihood of the BASMMHMM

The likelihood of the BASMMHMM \(E(\mathcal {M})\) defines how well the model fits the data (set of observations). Thus, \(E(\mathcal {M})\) is obtained by calculating the joint emission probabilities of the observation sequence \(Y =\{y_t\}_{t=1}^{L}\) by every hidden state’s BASMM:

$$\begin{aligned} & E(\mathcal {M}) = \Bigg (\prod _{i=1}^N\lambda _{i}^{\delta _{i,1}}\Bigg )\nonumber \\ & \quad \times \Bigg (\prod _{i=1}^N\prod _{j=1}^N\prod _{t=1}^{L-1}\lambda _{i,j}^{\delta _{i,t}\times \delta _{j,t+1}}\Bigg ) \nonumber \\ & \quad \times \Bigg (\prod _{j=1}^N\prod _{t=1}^{L}P(y_t^c|s_{j})^{\delta _{j,t}}\Bigg ) \end{aligned}$$
(24)

Following this, the log likelihood of the BASMMHMM is given by:

$$\begin{aligned} \begin{aligned}&\mathcal {L}(\mathcal {M}) = \log \big (E(\mathcal {M})\big ) \\&\quad = \sum _{i = 1}^{N} \Big (\delta _{i,1}\log {\lambda _i} + \sum _{j=1}^N\sum _{t=1}^{L-1}\delta _{i,t} \delta _{j,t+1} \log {\lambda _{i,j}}\Big ) \\&\qquad + \sum _{j=1}^N\sum _{t=1}^{L} \delta _{j,t}\log {P(y_t^c|s_{j})} \end{aligned} \end{aligned}$$
(25)

3.3 Training the BASMMHMM

The goal of training the Bounded Asymmetric Student’s-t Hidden Markov Model is to find the optimal set of model parameters \(\big \{\lambda _i,\lambda _{i,j},s_{j}\big \}_{i,j=1}^{N,N}\) that best fits the sequence of observations \(Y=\big (y_t\big )_{t=1}^{L}\). This is done by maximizing the likelihood (see Eq. 25)) in an EM algorithm. let \(\rho _{i,t}\) and \(\rho _{i,j,t}\) be the posterior emission probabilities defined as follows:

$$\begin{aligned} & \rho _{i,t} = P(\delta _{i,t}=1|y_t) \end{aligned}$$
(26)
$$\begin{aligned} & \rho _{i,j,t} = P(\delta _{j,t+1}=1, \delta _{i,t}=1|y_t) \end{aligned}$$
(27)

To perform the training, we use the Baum–Welch algorithm. Our purpose here is to tune the parameters of the HMM, namely the state transition matrix, the emission matrix, and the initial state distribution, such that the model is maximally like the observed data. In short, Baum–Welch is a sort of EM algorithm, where the E-step consists of forward and backward phases [33].

3.3.1 Baum–Welch: expectation

  1. 1.

    Calculate the forward value \(\alpha\), where \(\alpha _t(i)\) is the probability of being in the \(i{th}\) state after the first \(t\) observations of the model, given the set of properties \(\Theta\).

  2. 2.

    Calculate the backward value \(\beta\), where \(\beta _t(i)\) is the probability of being in the \(i{th}\) state at the \(t{th}\) timestamp and seeing the observations from timestamp \(t+1\) until the end of the sequence, given the set of properties \(\Theta\).

  3. 3.

    Calculate the posterior transition probabilities \(\rho _{i,j,t}\): the probability of being in state \(i\) at time \(t\) then being in state \(j\) at time \(t+1\). \(\rho _{i,j,t}\) is calculated using the forward and backward values as follows:

    $$\begin{aligned} \begin{aligned}&\rho _{i,j,t} = \frac{\alpha _t(i)\times \lambda _{i,j}P(y_{t+1}|s_j)\times \beta _{t+1}(j)}{P(Y|\Theta )} \\&\quad = \frac{\alpha _t(i)\times \lambda _{i,j}P(y_{t+1}|s_j)\times \beta _{t+1}(j)}{\sum _{i=1}^N\sum _{j=1}^N\big [\alpha _t(i)\times \lambda _{i,j}P(y_{t+1}|s_j) \times \beta _{t+1}(j)\big ]} \end{aligned} \end{aligned}$$
    (28)
  4. 4.

    Calculate the posterior emission values \(\rho _{i,t}\), i.e., the probability of being in the \(i{th}\) state at the time \(t\), given the observations \(Y\) and the model \(\Theta\). We get the emission posteriors by summing over the \(\rho _{i,j,t}\) values for all states:

    $$\begin{aligned} \rho _{i,t}=\sum _{j=1}^N\rho _{i,j,t} \end{aligned}$$
    (29)
  5. 5.

    Calculate \(\mathcal {Q}(\mathcal {M})\), the expectation of the log-likelihood of the BASMMHMM:

    $$\begin{aligned} & \mathcal {Q}(\mathcal {M}) = E(\mathcal {L}(\mathcal {M}))\nonumber \\ & \quad = \sum _{i = 1}^{N} \Big (\rho _{i,1}\log {\lambda _i} + \sum _{j=1}^N\sum _{t=1}^{L-1} \rho _{i,j,t} \log {\lambda _{i,j}}\Big ) \nonumber \\ & \quad + \sum _{j=1}^N\sum _{t=1}^{L} \rho _{j,t}E\big (\log {P(y_t^c|s_{j})}\big ) \end{aligned}$$
    (30)

3.3.2 Baum–Welch: maximization

In maximization, we use the variables calculated in the expectation step to update the HMM properties: prior weights and emission mixtures for each hidden state. We proceed in the following steps:

  1. 1.

    Update the initial hidden state probabilities \(\big (\lambda ^{t_0}_i\big )_{i=0}^N\) by using the \(\gamma\) values:

    $$\begin{aligned} \widehat{\lambda ^{t_0}_i} = \rho _{i,t_0} \quad ;\quad i\in \{1,2,\dots ,N\} \end{aligned}$$
    (31)
  2. 2.

    Update the state transition probabilities:

    $$\begin{aligned} & \widehat{\lambda _{i,j}} = \frac{\text{ number } \text{ of } \text{ transitions } \text{ from } s_i \text{ to } s_j}{\text{ number } \text{ of } \text{ transitions } \text{ from } s_i} \nonumber \\ & \quad = \frac{\sum _{t=1}^{L-1}\rho _{i,j,t}}{\sum _{t=1}^{L}\rho _{i,t}} \end{aligned}$$
    (32)
  3. 3.

    Update the properties of the BASMM for each hidden state of the model: the means \((\mu _{i,k})_{i=k=1}^{i=N,k=K}\), the covariances, the mixing weights and the degrees of freedom.

    $$\begin{aligned} \widehat{\mu _{i,k}}=\frac{\sum _{t=1}^{L}\xi _{i,k,t}(u_{i,k}(y_t) y_t - A_{i,k})}{\sum _{t=1}^{L}\xi _{i,k,t}u_{i,k}(y_t)}; \end{aligned}$$
    (33)

where \(\xi _{i,k,t}\) is the \(i{th}\) state’s mixture component membership posterior, i.e., the probability that the observation \(y_t\) is emitted from the \(k{th}\) component of the \(i{th}\) hidden state:

$$\begin{aligned} \xi _{i,k,t}= \frac{\rho _{i,t} c_{i,k} \mathcal {S}(y_t|s_{i,k})}{\sum _{j=1}^{K}c_{i,j} \mathcal {S}(y_t|s_{i,j})} \end{aligned}$$
(34)

And where \(A_{i,k}\) is defined by using a sample of data points \((S_{m})_{m=1}^{m=M}\) that is drawn from the \(k{th}\) component of the \(i{th}\) hidden state’s mixture:

$$\begin{aligned} A_{i,k}= \frac{\sum _{m=1}^{M}(S_{m}-\mu _{i,k})u_{i,k}(S_m)h(S_{m},\Omega _{i,k})}{\sum _{l=1}^{M}h(S_{l},\Omega _{i,k})} \end{aligned}$$
(35)

And \(u_{i,k,t}\) is the precision function for an observation \(y_t\) of dimension \(d\):

$$\begin{aligned} u_{i,k}(y_t) = \frac{d+\nu _{i,k}}{\nu _{i,k}+\Delta (y_t,\mu _{i,k};\Sigma _{i,k})} \end{aligned}$$
(36)

The mixing weights \((c_{i,k})_{i=k=1}^{i=N,k=K}\) are updated by dividing the probability of emission from the \(k{th}\) mixture component of the \(i{th}\) hidden state by the total probability of being in that \(i{th}\) state at any timestamp in the Markov chain:

$$\begin{aligned} & \widehat{c_{i,k}} = \frac{\sum _{t=1}^{L}\xi _{i,k,t}}{\sum _{t=1}^{L}\sum _{l=1}^{K}\xi _{i,l,t}} \nonumber \\ & \quad = \frac{\sum _{t=1}^{L}\xi _{i,k,t}}{\sum _{t=1}^{L}\rho _{i,t}} \end{aligned}$$
(37)

The covariances \((\Sigma _{i,k})_{i=k=1}^{i=N,k=K}\) are updated as follows:

$$\begin{aligned} & \widehat{\Sigma _{i,k}}=\frac{\sum _{t=1}^{L}\xi _{i,k,t}u_{i,k,t} \times (y_t-\mu _{i,k})(y_t-\mu _{i,k})^T}{\sum _{t=1}^{L}\xi _{i,k,t}}-B_{i,k} \end{aligned}$$
(38)

where \(B_{i,k}\) is given by:

$$\begin{aligned} & B_{i,k}=\frac{\sum _{m=1}^{M}\big (\Sigma _{i,k}- (S_m-\mu _{i,k})(S_m-\mu _{i,k})^{T}u_{i,k}(S_m)\big )h(S_m,\Omega _{i,k})}{\sum _{m=1}^{M}h(S_m,\Omega _{i,k})} \end{aligned}$$
(39)

Next, the update of the degrees of freedom for each hidden state’s mixture component is the solution to the equation below:

$$\begin{aligned} & g(\nu _{i,k},d) +1+ \frac{1}{\sum _{t=1}^{L} \xi _{i,k,t}}\sum _{t=1}^{L} \xi _{i,k,t}\Big (\log {u_{i,k}(y_t)}-u_{i,k}(y_t)\Big ) \nonumber \\ & \quad -\frac{1}{\sum _{m=1}^{M}h(S_{m},\Omega _{i,k})} \sum _{m=1}^{M}\Big ( g(\nu _{i,k},d)+ 1+ \log {u_{i,k}(S_m)} - u_{i,k}(S_m) \Big )=0 \end{aligned}$$
(40)

where \(\psi\) is the digamma function and \(g(\nu ,d)\) is defined as:

$$\begin{aligned} g(\nu ,d) = -\psi (\frac{\nu }{2})+\log {(\frac{\nu }{2})}+\psi (\frac{\nu +d}{2}) - \log {(\frac{\nu +d}{2})} \end{aligned}$$
(41)

There is no closed-form solution to the Eq. 40), so we use the Newton Raphson method [34] to derive the optimal update of \(\nu _{i,k}\). Finally, we update the bounds of each hidden state’s mixture model by fetching the minimums and maximums among the observations that were attributed to each mixture component in the expectation step.

4 Experiments and results

In this section, we select a few popular sequential data-based applications where we attempt to employ the BASMMHMM, then evaluate its performance in comparison with baseline models among the following:

  • Gaussian Hidden Markov Model (GHMM)

  • Gaussian Mixture Hidden Markov Model (GMMHMM)

  • Student Mixture Hidden Markov Model (SMMHMM)

  • Student Hidden Markov Model (SHMM)

Our approach is to measure how much the Bounded Asymmetric Student’s t-Mixture emissions can elevate the HMM’s performance. That is why the baseline models mentioned above are all variants of HMM with different emission distributions.

4.1 Occupancy estimation

In the field of smart buildings, occupancy estimation is a frequently performed operation as it is useful for many tasks, namely energy saving, consumption tracking, and employee presence monitoring for companies. Therefore, we find that many works have extensively tackled this subject, like [35, 36]. So in this experiment, we also attempt to estimate the number of occupants in one room using signals from non-intrusive sensors.

4.1.1 Data

The dataset [37] that we used for this experiment comprises signals obtained from seven non-intrusive sensors of five different types: temperature, illumination, sound, CO2, and passive infrared (PIR). As Fig. 2 shows, sensor nodes S1-S4 were deployed at the desks (referred to as desk nodes). These desk nodes have temperature, light, and sound sensors only. Node S5 has a CO2 sensor kept in the middle to get the best possible measure in the room. Nodes S6 and S7 only contain PIR sensors and are put on the ceiling at an angle that maximizes the sensor’s field of view for motion detection.

Fig. 2
figure 2

Sensors’ layout in the room

The obtained data from these nodes spans 21 days (from 22 December 2017 to 11 January 2018) and has been recorded every 30 s, which gives us a time series of 10129 timestamps. As for the ground truth room occupancy, it varies between 0 and 3. We model this information as the hidden state of our HMM, which would give us 4 hidden states. The observations are the signals sent by sensors, in the case of this experiment, these observations would be vectors of a dimension \(d=16\) as there are 16 distinct records taken from the sensors in total.

4.1.2 Preprocessing

When we observe the labels (number of occupants over time), we find a clear imbalance, as for most of the recording time, there’s no one in the room, thus, the number of occupants is zero.

We cope with the imbalance by oversampling the minority classes. For that, we use the SMOTE technique [38]. However, we don’t make the classes equally partitioned, and this is to keep some outliers and the overall occupancy sequence patterns. The results of oversampling are shown in the Fig. 3.

Fig. 3
figure 3

Original data versus resampled data

After oversampling, we scale the data using the MinMax method. We then perform a PCA to reduce the number of features and the computation complexity. The number of principal components is chosen in a way that keeps the variance of the data above 0.95. Based on Fig. 4, we choose eight principal components.

Fig. 4
figure 4

Data variance depending on the number of principal components

4.1.3 Results

We run the BASMMHMM and a selection of other benchmark models (SMMHMM, SHMM, GMMHMM, GHMM) on the preprocessed data, taking the room occupancy numbers as hidden states. When fitting the models, we run the EM algorithm for a number of iterations ranging from 1 to 100, and we take the number of iterations that gives the best result for each model. After multiple experiments with the different mixture-based HMMs on the data, we take \(K=3\) as the number of mixture components, as it produces the best fit for the data-set. The weighted averages of the accuracy, precision, recall, and F-1 score are presented in the following Table 2.

Table 2 Occupancy estimation: accuracy and F1 score weighted averages for different models

According to the results above, the BASMMHMM clearly performed better than the rest, as it produced the highest accuracy and F1-score of 0.86, where the second best results were an accuracy and an F1-score of 0.82 for the SMMHMM. The models based on Student’s-t emissions gave better metrics than those based on the Gaussian emissions. This is mainly due to a bad prediction of the outliers (hidden states 1, 2 and 3) by the Gaussian-based models because as mentioned earlier, there is a dominant label in the time series (0 occupants most of the time). What is common between all the models is that they performed well with the majority hidden state 0. The confusion matrix in Fig. 5 shows that the BASMMHMM predicts well all the classes/hidden states of the data, despite their imbalance (class 0 is more occurrent than the rest). In comparison, the confusion matrices of the other models show in Fig. 6 show a limited prediction of the non-majority classes. The weighted averages of the accuracy, precision, recall, and F-1 score when using the original data without oversampling are presented in Table 3. According to the results, we can see that the BASMMHMM still gives relatively good results, considering the complexity of this highly imbalanced data, while outperforming the other models.

Fig. 5
figure 5

Occupancy estimation: confusion matrix of BASMMHMM

Fig. 6
figure 6

Occupancy estimation: confusion matrices of other HMMs

Table 3 Occupancy estimation: accuracy and F1 score weighted averages for different models using original data

4.2 Stock price prediction

The stock market is an important indicator that reflects economic growth: when the economy grows, this typically translates into an upward trend in stock prices. In contrast, when the economy slows, stock prices tend to be more mixed. For traders, it is important to predict the behaviour of these numbers (stock prices) to take the appropriate action and achieve profit. But this prediction task is not easy, as several uncertain parameters like economic conditions, policy changes, supply and demand between investors, etc, determine the price trend. These parameters vary, thus making stock markets volatile.

4.2.1 Data and preprocessing

We use the stock price time-series made available by Yahoo Finance API. This API contains records of multiple companies’ stock prices spanning long periods of time. For our experiment, we select three different companies’ datasets: Amazon (AMZN), Apple (AAPL), and Google (GOOGL). For each of these three companies, the time-series that we used spans over the 12 years from 1 January 2010 to 1 January 2022 and is multivariate with four variables: opening price, high price, low price, and closing price. As for the preprocessing, we perform a MinMax scaling on the data before passing it to the HMM. After the forecasting, we unscale the results produced by the model, and we compare them to the unscaled ground-truth data to view the model’s performance.

4.2.2 Forecasting approach

Our task is to predict the stock prices for a given day \(t\). To do this, we adopt the following method: First, we fit the BASMMHMM to the data (the time-series of the until the day \(t-1\)), then we proceed to predict based on sliding time windows \(W_j\) of fixed length \(q\) (where \(W_j\) is the data of last \(q\)-day sequence ending with the day \(j\) ): we calculate the log-likelihoodFootnote 1 of each sliding window, take the window with the closest log-likelihood to \(W_t\) and calculate the day \(t+1\) predictions based on that chosen window. The adopted approach is further explained in Figs. 7 and 8 below.

Fig. 7
figure 7

Forecasting the t+1 stock prices based on a sliding window of past k days

Fig. 8
figure 8

Predicted time-series calculation

4.2.3 Results

After performing the forecasting, we established a comparison between BASMMHMM and a selection of other models using the two following performance metrics:

  • MAPE: Short for Mean Absolute Percentage Error, is the average absolute error between the actual and predicted stock values in percentage. The formula is:

    $$\begin{aligned} MAPE = \frac{1}{n}\sum _{i=1}^{n}\frac{y_i-x_i}{x_i}\times 100 \end{aligned}$$
    (42)

    where \(n\) is the length of the time-series, and for \(i\in \{1,2,\dots ,n\}\), \(y_i\) is the predicted value and \(x_i\) is the actual value.

  • RMSE: The Root Mean Square Error is the square root of the mean of the square of all of the errors between the actual and the predicted data. The RMSE is widely used, and it is considered an excellent general purpose error metric for numerical predictions. Considering the notations used in Eq. 42), the RMSE formula is the following:

    $$\begin{aligned} RMSE = \sqrt{\frac{1}{n}\sum _{i=1}^{n}(y_i-x_i)^2} \end{aligned}$$
    (43)

Tables 4, 5 and 6 indicate the metrics found after the forecasting of the stock prices of Amazon, Apple, and Google, respectively. The prediction on multivariate stock price data with four variables: Open, High, Low, and Close prices, but in the tables, we focus mainly on the High price variable. The BASMMHMM has been run with a custom number of hidden states \(N\) and sliding window size \(q\). The BASMMHMM with the combination \(\{N,q\}\) that gives the best performance is elected. As for the number of mixture components of the emissions, it is selected using the Minimum Message Length criterion [30]. In this experiment, the BASMMHMM is compared to the SMMHMM and GMMHMM.

Table 4 AMZN stock price prediction: performance metrics for different models
Table 5 AAPL stock price prediction: performance metrics for different models
Table 6 GOOGL stock price prediction: performance metrics for different models

According to the tables above, BASMMHMM generally performed better than SMMHMM and GMMHMM. This is mainly explained by the outliers and the local minima/maxima being better predicted by the BASMMHMM. It is also worth mentioning that the models based on Student’s t-mixture emissions (BASMMHMM, SMMHMM) performed better than the GMMHMM, which is based on Gaussian mixture emissions. We can see the graphs in Figs. 9, 10 and 11 a more clear picture of the predicted versus the actual stock prices.

Fig. 9
figure 9

Amazon stock prices: BASMMHMM prediction versus ground truth

Fig. 10
figure 10

Apple stock prices: BASMMHMM prediction versus ground truth

Fig. 11
figure 11

Google stock prices: BASMMHMM prediction versus ground truth

4.3 Human activity recognition

Human Activity Recognition (HAR) is a popular scientific application that enables machines to recognize human body behaviours. HAR is useful for many real-world tasks, such as fall detection in elderly healthcare monitoring or physical exercise measuring and tracking in sport science. In this experiment, we use the dataset provided by UCI [39], which is popularly used in many research works.

4.3.1 Dataset and preprocessing

The data at hand consists of 10299 records, each record having 561 features (features are signals received from smartphone sensors). The labels of the data are the different activities performed at the time of recording, and they are mainly six: Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, and Laying.

The preprocessing consists of MinMax scaling and then reducing the features with the Principal Component Analysis method. We perform the PCA in a way that keeps the variance of the data above 0.95, which gives us 69 principal components.

In this experiment, we use a training sample of 7352 observations and a testing sample of 2947 observations. We create one HMM for every activity, which gives us six HMMs in total. The parameters of each HMM are learned from the corresponding activity’s training set with the Baum–Welch algorithm. In the testing phase, for each part of the test set, we calculate all six trained HMMs’ likelihood to have generated the observations, and the correspondant activity to the HMM with the highest likelihood is selected as the prediction label. for all six HMMs, we choose 2 hidden states and \(K=2\) mixture components per hidden state. The Fig. 12 summarizes the pipeline of the modeling in this experiment.

Fig. 12
figure 12

Human activity recognition: BASMMHMM framework

4.3.2 Results

Upon performing the prediction of the human activities, we calculate the weighted averages of the accuracy, precision, recall, and F1 score of the predicted labels. These weighted-averages are calculated by taking the mean of all per-class metrics while considering each class’s support. Support refers to the number of actual occurrences of the class in the dataset. The ‘weight’ essentially refers to the proportion of each class’s support relative to the sum of all support values.

The BASMMHMM did a better performance than the rest of the models, as shown in Table 7. The accuracy and the F1 score are close to 0.8, which is an improvement compared to the SMMHMM, which gave about 0.7. It is also worth mentioning that the models with emissions based on the Student’s t-mixture and distribution performed slightly better than the ones with emissions based on the Gaussian mixture and distribution.

Table 7 HAR: Accuracy and F1 score weighted averages for different models

It is noteworthy that the computational complexity of each iteration of the Baum–Welch algorithm is \(\mathcal {O}(LN^2)\) which shows the practicality and scalability of HMMs in general and BASMHMM in particular in real-world applications, especially in scenarios where computationally hungry models are generally avoided (e.g., Federated learning).

5 Conclusion

In this paper, we proposed the use of bounded asymmetric Student’s t-mixture models as the observation emission densities of continuous HMMs to offer a more robust methodology for sequential data modeling. We then presented different experiments where we applied BASMMHMM, which proved an enhanced performance compared to other benchmark HMM-based models. More specifically, the BASMMHMM can be a strong candidate for solving data outlier and asymmetry problems with its high flexibility.

We can conclude that adding a custom emission to the HMM, such as the Bounded Asymmetric Student’s t-Mixture, results in higher adaptability to the model, regardless of its applications. We presented the mathematical formulation of our model, and backed it up by results of different experiments. Applications such as occupancy estimation, stock price prediction and human activity recognition showed a better performance for the BASMMHMM in comparison to other Student’s t and Gaussian-based HMMs. The data anomalies are taken into consideration, thus making the BASMMHMM a very useful tool while tackling real world datasets. This also can save us the extra preprocessing that removes the outliers and might often end up altering the data, hence making our modeling "isolated" from the real information/experiment.

Finally, there is room to improve the proposed model and expand the work on many aspects. For instance, the number of emission mixture components is an important parameter to tune for the HMM to ensure optimal fit to the data. Introducing an adequate model selection [40] approach before training the HMM can fulfill this tuning. Furthermore, in the case of high dimensional observations, it is rigorous to implement a feature selection strategy [41] to avoid high computational complexity and to elect the parameters that represent the data in the most efficient way.