Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Hidden Markov Model (HMM) is a graphical model that can be used to describe a sequence of observed events/symbols. HMMs are most commonly applied in speech recognition, computational linguistics, cryptanalysis and bio-informatics [3, 7, 11]. An Expectation–Maximization (EM) algorithm proposed by Baum et al. [1], is frequently used for HMM parameter estimation. This algorithm locally maximizes the likelihood of observing the symbol sequence given the parameter estimates.

The motivation for this work arises from situations where application of HMM is ideal but impractical because of the excessive demands on computational resources. For example, Ref. [6] points out the high computational cost of HMM training in the context of intrusion detection because of long sequence lengths. The reason for this computational complexity comes from the way the Baum–Welch algorithm repeatedly performs computation at every time step of the sequence. In this chapter, we propose an alternate method to estimate HMM parameters, in which the information content of the observed sequence is condensed into a 2D matrix. Thereafter, the algorithm operates on this matrix, thus eliminating the need of computation at every time step of the sequence in each iteration. We propose a generative model to explain the information contained in the count matrix and show how the HMM parameters can be derived from the proposed generative model. It should be noted that unlike the Baum–Welch algorithm, the parameters estimated by the proposed algorithm can be suboptimal in the sense of the observing entire symbol sequence. In a closely related but concurrent and independent work [9], the author proposes non-negative matrix factorization [10] based estimation of the HMM parameters (NNMF–HMM). Therefore, we benchmark our results not just with the Baum–Welch but also with the NNMF–HMM algorithm and demonstrate several orders of magnitude speed gains using synthetic and real life datasets. We refer to our algorithm as PMF–HMM, where PMF stands for probabilistic matrix factorization.

The paper is organized as follows. In Sect. 2, we present some background on HMM and set the notations to be used in the rest of the paper. We formulate the problem of HMM parameter estimation using the PMF–HMM algorithm in Sect. 3. The experimental results are provided in Sect. 4, followed by concluding remarks in Sect. 5.

2 Hidden Markov Model

In a HMM, an observed symbol sequence of length \(T\), \(O=o_1 o_2 \ldots o_T\), is assumed to be generated by a hidden state sequence of the same length, (\(Q=q_1 q_2 \ldots q_T \)), as shown in Fig. 1. The hidden states and the observed symbols can take values from finite sets \(S= \{S_1,S_2, \ldots , S_N \}\) and \(V= \{V_1,V_2, \ldots , V_M \}\), respectively. At each time step, the current state emits a symbol before transitioning to the next state and the process is repeated at every time step. Typically, the number of hidden states \((N)\) is smaller than the number of observed symbols \((M)\). The probability of transitioning from \(k\)th to \( l\)th hidden state, in successive time steps, is denoted as \(P({q_{(t+1)}}={S_l}|{q_t}=S_k )\) or simply \(P({S_l}|{S_k})\). The probability of emitting the \(j\)th observed symbol from the \(k\)th hidden state is given by \(P(o_t=V_j|q_t=S_k)\) or \(P(V_j|S_k)\). The combined parameter set can be represented as \(\lambda = \left\{ P(S_l,S_k ),P(V_j|S_k)\right\} \). Essentially, any probabilistic model with parameters \(\lambda \) that satisfy the properties that \(\sum _{l=1}^NP(S_l|S_k )=1\) and \(\sum _{j=1}^MP(V_j|S_k)=1\), can be interpreted as a HMM [2]. Rabiner [12] in a comprehensive review on HMMs, points at three basic problems of interest in HMMs:

Fig. 1
figure 1

Generative process of a Hidden Markov Model. The grey and white nodes represent the observed symbols and hidden states, respectively

  1. 1.

    How to efficiently compute the probability of observed symbol sequence, \(P(O|\lambda )\), given the model parameters, \(\lambda \)?

  2. 2.

    Given an observed symbol sequence, \(O\), and the model parameters, \(\lambda \), how do we choose a hidden state sequence which is optimal with respect to some metric?

  3. 3.

    Given observed symbol sequence, \(O\), how do we estimate the parameters, \(\lambda \), such that \(P(O|\lambda )\) is maximized?

The third problem of HMM parameter estimation is more challenging than the other two as finding the global optimum is computationally intractable. The Baum–Welch algorithm solves this problem and guarantees attainment of a local optimum of the observed sequence likelihood. In this chapter, we propose a faster method to estimate the HMM parameters, which also provides a locally optimal solution but for a different objective function. In Sect. 3, we provide the mathematical formulation of the PMF–HMM algorithm.

3 PMF–HMM Algorithm

3.1 Problem Formulation

Let \(P(V_i,V_j)\) represent the bivariate mass function of observing the symbol pair \(\langle V_i,V_j\rangle \) in an HMM process. The empirical estimate, \(\hat{P}(V_i,V_j)\), of this bivariate mass function can be derived from the observed symbol sequence using Eq. (1).

$$\begin{aligned} \hat{P}(V_i,V_j) = \frac{1}{\left( T-1 \right) }\sum _{t=1}^{T-1}\mathbb {I}_{V_i}\left( q_t \right) \times \mathbb {I}_{V_j}\left( q_{t+1}\right) \end{aligned}$$
(1)

where the indicator function, \(\mathbb {I}_{V_i}(q_t)\), is a binary function which outputs 1 only when the observed symbol at time \(t\) is \(V_i\). The square matrix \(\hat{P}(V_i,V_j )\), which is the maximum likelihood estimate of the bivariate mass function \(P(V_i,V_j)\), contains the normalized frequency with which different symbol pairs appear in the sequence \(O\). Consider a process that can generate such pairs of symbols.

  • The current observed symbol \(V_i\), at some arbitrary time, makes a transitions to the hidden state \(S_k\) with a probability \(\hat{P}(S_k |V_i)\).

  • In the next time step, the \(k\)th hidden state emits the observed symbol \(V_j\) with a probability \(\hat{P}(V_j|S_k)\).

This process of generating all pairs of observed symbols is depicted as a graphical model in Fig. 2. It should be noted that this process is fundamentally different from the generation process of observed symbols in a HMM (Fig. 1). Based on the graphical model in Eq. (2), \(\hat{P}\left( V_i,V_j\right) \) can be factorized as:

Fig. 2
figure 2

The proposed graphical model that governs the generation of the pairs of symbols in the observed sequence. The grey and white nodes represent the observed symbols and hidden states, respectively. \(M\) is the total number of observed symbols and \(n(V_i)\) is the number of times symbol \(V_i\) appears in the sequence

$$\begin{aligned} \hat{P} \left( V_i,V_j \right) \approx \hat{P}(V_i)\sum _{k=1}^N\hat{P}(S_k|V_i)\hat{P}(V_j|S_k) \end{aligned}$$
(2)

where, \(\hat{P}(V_i)\) is the marginal distribution of observed symbols, which can be estimated empirically as \( \hat{P}(V_i)=\sum _j\hat{P}(V_i,V_j)\). In Sect. 3.2, we demonstrate how a fairly popular algorithm in the field of text-mining can be used to perform this factorization to estimate the remaining two parameters \(\hat{P}(S_k|V_i)\) and \(\hat{P}(V_j|S_k)\).

3.2 Probabilistic Factorization of Count Matrix

Hofmann proposed an EM algorithm for the probabilistic factorization of word count matrices in the field of text mining [4, 5]. In his seminal work, a count matrix was defined on a text corpus (a collection of documents) such that the entries represented the frequencies of the occurrence of different words (from a finite dictionary) in different documents present in the corpus. Hofmann’s model, known as Probabilistic Latent Semantic Analysis (PLSA), is a widely used method to perform automated document indexing. Although PLSA was proposed to factorize the word-count matrices, it is applicable to any matrix having the co-occurrence information about two discrete random variables. The key assumption in PLSA is the conditional independence of a word and a document given the latent/hidden topic. The generative model shown in Fig. 2, has the same assumption i.e. a pair of observed symbols occur in a sequence, independently, given the in-between hidden state. As a result, the EM algorithm, proposed by Hofmann, renders itself available to perform the factorization shown in Eq. (2). The algorithm is implemented iteratively to estimate the model parameters \(\hat{P}(V_j|S_k)\) and \(\hat{P}(S_k|V_i)\) using the following steps:

E Step: In this step, the probability distribution of the hidden states is estimated for every pair of observed symbols given the current parameter estimates.

$$\begin{aligned} \hat{P}(S_k|V_i,V_j) = \frac{\hat{P}(S_k|V_i)\hat{P}(V_j|S_k)}{\sum \limits _{k=1}^N\hat{P}(S_k|V_i)\hat{P}(V_j|S_k)} \end{aligned}$$
(3)

M Step: In this step, the model parameters are updated from the probabilities estimated in the E step.

$$\begin{aligned} \hat{P}(V_j|S_k) = \frac{\sum \limits _{i=1}^M\hat{P}(V_i,V_j)\times \hat{P}(S_k|V_i,V_j)}{\sum \limits _{i=1}^M\sum \limits _{j=1}^M\hat{P}(V_i,V_j)\times \hat{P}(S_k|V_i,V_j)} \end{aligned}$$
(4)
$$\begin{aligned} \hat{P}(S_k|V_i) = \frac{\sum \limits _{j=1}^M\hat{P}(V_i,V_j)\times \hat{P}(S_k|V_i,V_j)}{\sum \limits _{j=1}^M\hat{P}(V_i,V_j)} \end{aligned}$$
(5)

This EM process converges, after several iterations, to a local optimum that maximizes the log-likelihood function given by Eq. (6).

$$\begin{aligned} \ell = \sum _{i=1}^M\sum _{j=1}^M\hat{P}(V_i,V_j) \left( \hat{P}(V_i)\sum _{k=1}^N \hat{P}(S_k|V_i)\hat{P}(V_j|S_k) \right) \end{aligned}$$
(6)

where, \(\hat{P}(V_i,V_j)\) is the empirical estimate of the bivariate mass function of a pair of observed symbol given by Eq. (1), while, the term in the bracket is the same bivariate mass function but estimated assuming the generative model shown in Fig. 2. It can be shown that the maximization of the log-likelihood function (Eq. (6)) amounts to the minimization of Kullback–Leibler distance between the two joint mass functions i.e. \(D_{KL}\left( \hat{P}(V_i,V_j)||\hat{P}(V_i)\sum _{k=1}^N\hat{P}(S_k|V_i)\hat{P}(V_j,S_k) \right) \).

3.3 Estimation of HMM Parameters

The HMM parameters consists of the emission probabilities, \(P(V_j|S_k)\), and transition probabilities, \(P(S_l|S_k)\). The emission probabilities, \(P(V_j|S_k)\), are directly estimated in the M Step (Eq. 4) of the EM algorithm. However, the transition probabilities do not get estimated in the proposed generative model. Nevertheless, these probabilities can be obtained using a simple trick. To get the transition probability from \(k\)th to \(l\)th hidden states, we can enumerate all the possible paths between these two states (via all observed symbols) and aggregate the probabilities of all such paths, as shown in Eq. (7).

$$\begin{aligned} P(S_l|S_k) = \sum _{i=1}^MP(V_i|S_k)P(S_l|V_i) \end{aligned}$$
(7)

Here we list four key differences between the Baum–Welch algorithm and the PMF–HMM algorithm for the HMM parameter estimation.

  • Baum–Welch operates on the entire symbol sequence, while the later operates on the count matrix derived from the symbol sequence.

  • The number of parameters that are estimated in the PMF–HMM algorithm is \(2\) MN while the Baum–Welch estimates \(N(M+N)\) parameters.

  • Baum–Welch maximizes the likelihood of the entire observed sequence given the model parameters i.e. \(P(O|\lambda )\) as opposed to Eq. (6), which is maximized by the PMF–HMM algorithm.

  • The time complexity of PMF–HMM is, \(O(T)+O(IM^2N)\approx O(T)\), for very long sequences, while for Baum–Welch algorithm the complexity is \(O(IN^2T)\). The symbol \(I\) denotes the number of iterations of the respective EM algorithms.

In Sect. 4, we experimentally show that despite having these differences, the PMF–HMM algorithm estimates the HMM parameters fairly well.

3.4 Non-Degenerate Observations

In a HMM, an observed symbol can be expressed as a degenerate probability mass function supported on the symbol set. At any arbitrary time, the mass will be concentrated on the symbol that is being observed. Consider a scenario when there exists some ambiguity about the symbol being observed. This ambiguity can cause the probability mass to diffuse from one symbol to others resulting in a non-degenerate mass function. Such a situation can arise during the discretization of a system with continuous observations. The proposed algorithm, which operates on the count matrix, is inherently capable of handling this type of uncertain information. Figure 3, juxtaposes the scenarios when the observations are degenerate and non-degenerate respectively, for a system with six observed symbols. For the former case, the system makes a clean transition from 3rd to 4th symbol. The outer product of the two probability mass functions in successive time steps results in a \(6\times 6\) matrix with a single non-zero entry at (3, 4)th position. To obtain the count matrix for the entire sequence of length \(T\), the outer products in successive time steps can be aggregated as shown in Eq. (8), which is equivalent to Eq. (1).

$$\begin{aligned} \hat{P}(V_i,V_j)=\frac{1}{T-1}\sum _{t=1}^{N-1}O_t\otimes O_{t+1} \end{aligned}$$
(8)
Fig. 3
figure 3

a Generation of a count matrix by degenerate observations. The count value is localized at a single position in the matrix. b Generation of a count matrix by non-degenerate observations. The count value gets distributed in a neighborhood

For the non-degenerate case, the count value simply gets diffused from the (3, 4)th position to the neighboring positions as shown in Fig. 3b. Nevertheless, Eq.  (8) can still be used to compute the count matrix. Once the count matrix is obtained, the PMF–HMM algorithm can be applied to estimate the parameters.

4 Experiments

In this section, we present some empirical evidence of the speed gains of the PMF–HMM over the Baum–Welch algorithm, using synthetic and real-life datasets.

4.1 Synthetic Data

We kept the experimental setup identical to the one proposed in Ref. [9]. This provided us with a platform to benchmark our algorithm not just with the Baum–Welch but also with NNMF–HMM algorithm. The experiments were carried out in the MATLAB’s programming environment. For the implementation of the Baum–Welch algorithm, we used the Statistical toolbox of MATLAB. The observed symbol sequences were generated using a hidden Markov process with the transition probabilities as shown in Eq. (9).

$$\begin{aligned} P(S_k|S_l) = \begin{pmatrix} 0 &{} 0.9 &{} 0.1 \\ 0 &{} 0 &{} 1 \\ 1 &{} 0 &{} 0 \\ \end{pmatrix} \end{aligned}$$
(9)

The first and second hidden states randomly generated numbers from Gaussian distributions \(\phi (11,2)\) and \(\phi (16,3)\) respectively, while the third state generated numbers by uniformly sampling from the interval \((16,26)\). These emission probabilities are listed in Eq. (10).

$$\begin{aligned} P(V_j|S_k) = {\left\{ \begin{array}{ll} \phi (11,2) &{} \text {if } k=1, \\ \phi (16,3) &{} \text {if } k=2, \\ U(16,26) &{} \text {if } k=3. \end{array}\right. } \end{aligned}$$
(10)
Fig. 4
figure 4

Plot of the run times of the PMF–HMM algorithm versus the sequence lengths. The total time is split into its two components (1) time spent in computing the count matrix (2) time spent in the probabilistic factorization of the count matrix

The continuous observations were rounded to nearest integer to form a discrete symbol sequence. Seven different sequence lengths, \(T = 10^{3+0.5x};x=0,1,\ldots ,6\), were chosen for the experiments. For each sequence length, the HMM parameters were estimated with the PMF–HMM algorithm. Figure 4 plots the run times of the algorithm at different sequence length. The total runtime is split into its two constituent times 1) the time taken for populating the count matrix 2) the time taken to factorize the count matrix. As expected, the time taken for populating the count matrix varies linearly with the sequence length as indicated by the unit slope of the log–log plot. However, the time spent in matrix factorization remained almost constant because of its insensitivity to the sequence length (complexity is \(O(M^2N)\)). Hence, at smaller sequence length, matrix factorization dominated the total run time but its contribution quickly faded away as the sequences grew longer. Figure 5 plots the estimated emission probabilities of the three hidden states along with the true emission probabilities as given in Eq. (10). The error bars represent the \(95\%\) confidence interval of the estimated value as a result of 20 runs of each experiment. Clearly, as the sequence length was increased, the estimated emission probabilities converged to the true values and the error bars shrank.

Fig. 5
figure 5

Comparison of the true and the estimated emission probabilities (from PMF–HMM algorithm) at different sequence lengths \((T)\). For short sequence (a) the estimates were poor and showed high variance. For longer sequences (b and c) the estimated parameters matched the true parameters quite well with high confidence

In Ref. [9], the author compares different characteristics of NNMF–HMM algorithm with that of Baum–Welch algorithm. We add the characteristics of PMF–HMM algorithm, to these published results, so as to have a common ground to compare the three algorithms. In Fig. 6b, the likelihood values of observing the symbol sequence given the estimated HMM parameters, \(P(O|\lambda )\), is plotted versus the sequence length. It is remarkable that despite having a different generative model, both the PMF–HMM and NNMF–HMM algorithms resulted in likelihood values that were at par with that of Baum–Welch algorithm. In Fig. 6c, the Hellinger distance between the estimated and true emission probabilities is plotted versus the sequence length for the three algorithms. As the sequences grew longer, the estimated emission probabilities converged to the true values, which is indicated by the drop in the distance values. Overall, the Hellinger distance of PMF–HMM algorithm was higher than the other two algorithms, which can also explain its marginally lower likelihood values plotted in Fig. 6(b). However, the main difference was observed in the run times of the three algorithms, where the PMF–HMM algorithm was better than the other two by a significant margin (Fig. 6a).

Fig. 6
figure 6

Comparison of different characteristics of PMF–HMM, NNMF–HMM and Baum–Welch algorithms on the synthetic data at different sequence lengths. Algorithms runtimes, likelihood values of the sequence, post training, \(P(O|\lambda )\) and Hellinger distances are compared (a, b and c)

In Sect. 3.4, we discussed the ability of the PMF–HMM algorithm to handle non-degenerate observations. Here, we demonstrate the advantage of this ability for estimating the HMM parameter. In the previous experiment, the continuous observation values were rounded to the nearest integer to yield a discrete symbol sequence. This discretization came with a cost of some information loss. As an alternative, a soft discretization scheme can be employed, which assigns a real valued observation to multiple symbols with different memberships. One such soft discretization scheme is shown in Fig. 7, which involves defining a Gaussian kernel centered at the observation (8.35 in this case). As every symbol is bounded on both sides, the degree of membership of the real observation to a symbol can be obtained by computing the area under the Gaussian kernel between the symbol’s boundaries (refer Fig. 7). Because of the use of a probability density function (the Gaussian kernel), the membership values have the desired addition to unity property. We used this soft discretization scheme to obtain non-degenerate observation vectors and computed the count matrix using Eq. (8). The standard deviation of the Gaussian kernel was fixed at 1.0 (equal to the interval width). The remaining steps for computing the HMM parameter were identical to the case of discrete symbol sequence. Figure 8 shows the estimated emission probabilities, at the sequence length of 1000, along with the true emission probabilities. This figure can be compared with the Fig. 5a, where the hard discretization scheme was employed for obtaining the count matrix. The estimated emission probabilities, resulting from soft discretization, were not just closer to the true ones but also had tighter confidence interval.

Fig. 7
figure 7

Illustration of a scheme for generating non-degenerate observation vector from a continuous value. Instead of assigning the observation a specific symbol value, its membership to different symbol can be obtained by computing the area under a Gaussian Kernel centered at that value

Fig. 8
figure 8

Comparison of the true and estimated emission probabilities (by PMF–HMM algorithm) at the sequence length, \(T=1000\). The count matrix was obtained using the non-degenerate observations. The quality of estimated parameters is much better in comparison to case when discrete observations were used to obtain the count matrix (Fig. 5a)

4.2 Key Stroke Dynamics Data

This dataset was generated, in a study at CMU, for the analysis of typing rhythms to discriminate among users [8]. The purpose was to model the typing rhythms for separating imposters from actual users. In the study, 51 subjects were asked to type the same password 400 times over a period of few weeks. The password had eleven characters (including the enter key) and was identical for all the subjects. The recorded data consisted of the hold time (the length of time a key was pressed) and transition time (the time taken in moving from one key to the next). Therefore for every typed password, the data had 21 time entries (11 keys and 10 transitions). We used this dataset to perform a HMM based classification of a typed password into the most probable subject. The idea is to first learn a HMM for each subject from the passwords in the training set. Thereafter, classify the passwords in the test set using the Maximum A-Posteriori (MAP) criterion. The training and test sets were obtained after splitting the original dataset in half. As the first step, we discretized the continuous time values into 32 equi-spaced bins. Therefore, the subsequent HMMs were comprised 32 observed symbols and 21 hidden states.

Table 1 Comparison of the time spent in building HMM classifiers by the Baum-Welch and the proposed algorithm on key stroke dynamics data. Cohen’s kappa values on the test dataset are also listed for the two classifiers

Table 1 compares the Baum–Welch and the PMF–HMM algorithm for their run-time and classification accuracy. We used both degenerate and non-degenerate variants of the PMF–HMM algorithm. The classification accuracy is quantified using Cohen’s kappa(\(\kappa \)) statistics, which is a measure of inter-rater agreement for categorical items [13]. The kappa statistics takes into account the agreements occurring by chance and hence usually gives a conservative estimate of a classifier’s performance. It turns out that the Baum–Welch algorithm took almost 10000X more time than the proposed algorithm for the same job. Moreover, the longer time taken by the Baum–Welch algorithm was not reflected on its classification performance on the test dataset. The kappa value, \(\kappa =0.38\), of the classifier trained by the Baum–Welch algorithm was only slightly better than that of the PMF–HMM algorithm \((\kappa =0.32)\). Moreover, the PMF–HMM’s classification performance was further improved with the use of non-degenerate observations \((\kappa =0.35)\) without a impacting the training time significantly.

5 Conclusion

In this chapter we proposed a probabilistic matrix factorization based algorithm for the parameter estimation of a Hidden Markov Model. The 2D matrix, which is factorized, contains the information about the number of times different pairs of symbols occur in an observed sequence. A model is proposed that governs the generation process of these symbol pairs. Thereafter, an EM algorithm is used to estimate the HMM parameters assuming this generative model. The time required for the parameter estimation with the proposed algorithm can be orders of magnitude shorter than the Baum–Welch algorithm, thus making it attractive for time critical problems. We also discussed the ability of the proposed algorithm to handle non-degenerate observations and demonstrated the resulting improvement in the quality of HMM parameter estimates.