Keywords

1 Introduction

In the last decades, there has been a remarkable advance in the automatic systems dealing with the voice pathology diagnostic. However, the discrimination between pathological and normal voices is still a complex field of research in speech classification. The aim of this paper is to help the diagnosis of pathological voices among normal voices. Currently, the traditional way to detect voice pathology is to visit a specialist who examines the vocal folds of the patient using endoscopic tools. This process is considered time consuming, complex and expensive. Thus, this area of science has attracted a lot of attention in purpose to develop an accurate automatic device able to help the speech specialists for early diagnosing voices pathologies. In this work, we propose an automatic system for the detection of pathological voices combining incremental possibilistic SVM and HMM.

Hidden Markov Model (HMM) [17] is a statistical model which consists of a finite number of unknown states. Each of those states is associated with a respective probability distribution. HMM are considered as probabilistic framework which able to model a time series of any observations. HMM are successfully used for classification tasks in particular in bioinformatics and speech processing.

In the past 20 years, Support Vector Machines (SVM) technique acquired an important place in solving classification and regression problems since they provide a valuable learning design that generalize accurately by handling high dimensional data [5, 10]. SVM were first introduced by Vapnik as an approximate implementation of the Structural Risk Minimization (SRM) induction principle [5, 7].

Various studies using HMM and SVM have been proposed for voice pathology detection and classification; Dibazar et al. [14], propose to investigate HMM performance on a task of the detection of pathological voices using 5 pathologies. They suggest HMM approach using MFCC (Mel frequency cepstral coefficients) achieved a classification accuracy of 70%. The authors in [15] present a method based on HMM which classifies speeches into normal class and pathological class. The performance of this system for detection of vocal fold pathology is equal to 94%. In the study of [16], HMM was applied in order to classify voices of the database composed by 11 normal voices and 11 pathological voices. The proposed system obtains accuracy rate of 100% for pathological voices and 98% for normal voices.

Pend et al. [11] propose to combine PCA (Principal component analysis) to SVM using 27 features in order to classify the normal and pathological voice. Four classifiers were evaluated in [12] based on voice pathology problem. The Support vector machines achieved the best performance. In [13], the authors developed an incremental method combining density clustering and Support Vector Machines for voice pathology detection. This proposed method achieved a performance equal to 92%.

The main idea of SVM was to seek for a model with the optimal generalization performance while building the solution to the minimization problem of SRM through a quadratic programming optimization [6].

For the given data points \({(x_i,y_i)}\) where \(i=1,..,n\) and n is the number of the data, SVM learn a classifier \(f(x)=w^Tx+b\) where the hyperplane that optimally separates the data is the one that minimises:

$$\begin{aligned} \frac{1}{2}\Vert w^{ij}\Vert ^2+C\sum _{i=1}^n\xi ^{ij} \end{aligned}$$
(1)

Where C is a regularization term and \(\xi \) is a positive slack variable. Subject to the inequality constraints:

$$\begin{aligned} y_i[w^T.x_i+b]\ge 1-xi^i; i=1,2,...n \end{aligned}$$
(2)

On the other hand, the solution to the optimization quadratic programming problem can be cast to the Lagrange’s function and we obtain the following dual objective function:

$$\begin{aligned} L_d = \max _{\alpha _i} \sum _{i=1}^n \alpha _i - \sum _{i=1}^n \sum _{j=1}^n \alpha _i \alpha _j y_i y_j K(x_i, x_j). \end{aligned}$$
(3)

where \(K( x_i, x_j)\) is the kernel of data \(x_i\) and \(x_j\) and the coefficients \(\alpha _i\) are the lagrange multipliers and are computed for each sample of the data set.

In the context of speech classification area, the major problems of detection systems using batch methods can be resumed in two facts; the time-varying of speech samples and the amount of available data for learning stage. Hence, the online or incremental techniques provide a valuable solution in applications that handle speech data.

Given these facts, the main contribution of this paper is the application of an incremental possibilistic SVM technique combined to HMM to the problem of voice pathology detection. The implementation of the proposed system begin with the parameters extraction of samples and then proceed with the application of the proposed method to classify the voices samples.

This paper is organized as follows: in Sect. 2, the proposed system of voice pathology detection is presented and discussed. In Sect. 3, the features extraction step is described. In Sect. 4, experiments conditions and results are presented and evaluated. In Sect. 5, the conclusion and the perspectives of this work are illustrated.

2 The Proposed Incremental Possibilistic SVM-HMM System

In this paper, we consider the output of the incremental possibilistic Support Vector Machines (SVMs) as probabilities used into the HMM-based decoder and in particular used in the computation of HMM’s likelihood (see Fig. 1). The combination of incremental possibilistic SVM and HMM seems interesting and a robust solution since the SVM lack the ability to model time series. Hence, the probabilities outputted from possibilistic SVM are used by the HMM in order to provide their state-dependent likelihoods as follows:

$$\begin{aligned} P(x|q_i) \propto \sum _{k=1}^K c_{ik}\cdot \frac{Pr(k|x)}{Pr(k)} \end{aligned}$$
(4)

where for a given feature vector x, the posterior probability of the class k are given by Pr(k|x) and Pr(k) is the a-priori probability of class k. \(c_{ik}\) are the mixture weights for each HMM state.

The proposed method is shown in the Fig. 1. It must be pointed out that for our voice database, the Mel Frequency Cepstral Coefficients (MFCC) are extracted for each voice sample. Furthermore, in this work, we use the incremental learning which behaves exactly like an online learning by introducing repeatedly a new data at the current classifier. In other words, each step of the incremental learning of SVM consists of adding a new sample to the solution and retiring the old samples while keeping their Support Vectors (SV) which describes the learned decision boundary. Indeed, the training samples needed for the next step of the incremental learning process are obtain by incorporating the new incoming sample and the SV of the previous samples.

Fig. 1.
figure 1

The block diagram of the proposed method

The key idea of incremental SVM is to keep the Karush–Kuhn–Tucker (KKT) conditions satisfies while retiring old samples and adding a new one to the solution. Recalling that the KKT conditions are:

$$\begin{aligned} g_t=-1+K_t,:\alpha +\mu y_i \left\{ \begin{array}{cc} \ge 0, if \alpha _t=0\\ =0, if 0<\alpha _t<C\\ \le 0, if \alpha _t=C \end{array} \right. \frac{\delta W}{\delta \mu }=y^T \alpha =0 \end{aligned}$$
(5)

The following algorithm summarizes the incremental SVM steps to learn new incoming samples \((x_{c},y_{c})\). It consists to construct a classifier \(h^c\) from the classifier \(h^{c-1}\) [1].

figure a

To our knowledge, efficient application using Support vector machines (SVM) based incremental learning in the field of voice pathology detection has not been reported in the last years. Thus, we propose the incremental SVM learning in the context of voice pathology detection based on the possibilistic degrees combined to HMM.

2.1 Possibilistic SVM

SVM was first introduced to solve the problems of pattern classification. In recent years, SVM have demonstrated robustness and have been successfully used to various practical applications. However, in many real-world applications, the performance of SVM would be seriously affected by the nature of available data i.e. data may be accompanied by noise. The voice pathology detection applications are often considered as a very complicated and delicate problems since the voices samples are non-stationary signals with a high amount of variation in the way how and by whom the sample is pronounced.

Let us note that the speech which is generally produced on a short-time scale, includes non-stationary parts due to the physiological system of the speaker which defines the amplitude and frequency modulation.

Furthermore, the behavior of SVM depends mostly on the training data and the optimal hyperplane is identified mainly from the support vectors. Thus, the variation in the voice sample may lead SVM to misclassify the data set [4].

In the literature, various solutions were proposed to solve this kind of problems such as weighted SVM, adaptive SVM and central SVM. In this paper, we propose a possibilitic SVM based on a geometric distance to improve the performance of the conventional SVM on a task of voice pathology detection. The main idea is to assign different possibilistic degrees to the different voice samples while SVM is computing class posterior probabilities. Those degrees calculates an euclidean distance between the point and the center of each class. As a result, the membership degree of the sample \(x_i\) near of the center of the class \(y_i\) is more important than the degrees of the points far from the center. We use the euclidian distance algorithm to generate the possibilitic degrees [2].

The formulation of the proposed possibilitic SVM is defined in three steps (see Fig. 2):

Fig. 2.
figure 2

The process of possibilistic SVM

As shown in the Fig. 2, the first step consists of computing the Euclidean distance between the center of the different classes \(y_k\) and the data \(x_i\) to be detected. Then, the possibilitic degree is evaluated which measure the degree that the data \(x_i\) belong to the class \(y_i\). The final step of the formulation of the possibilitic SVM consists in incorporated into SVM those degrees in order to help the HMM based decoder to classify the voice pathologies.

Euclidean Distance. The Euclidean distance is computed between \(X_i\) and the center of the class \(CY_i\) where \(i\in (1,\ldots ,k)\). We suppose that it exists a possibility that the data \(X_i\) belongs to one of the classes \(Y_i\). The lowest measured value of the Euclidean distance given by \(d(CY_i,X_i)\) is assigned to the nearest data \(X_i\) to the class \(Y_i\) and the highest computed value is associated with the farthest class to the data \(X_i\).

Possibilistic Degrees. The possibilitic degrees noted \(m_i(X)\) measure the membership degree of every voice \(X_i\) of our data set to a given class \(Y_i\). Those degrees are computed as follows:

$$\begin{aligned} m_i(X) := 1/d(C_i,X_i) \end{aligned}$$
(6)

Where \(C_i \) is the center of the \(i^{th}\) class and d is the Euclidean distance previously calculated.

The Fig. 3 shows the possibilitic degrees generated by step 2 of a given class (where two samples are misclassified from the Class 1). As we can see, the degrees of training voices samples closer to the center of the class 1 are much larger and the samples farthest from the center are much smaller.

Fig. 3.
figure 3

An example of possibilitic degrees generated by the possibilitic SVM

Formulation of Possibilistic SVM. The purpose of incorporating possibilistic degrees is to limit the restrictions when the data have a larger degree into a given class.

Hence, with the formulation of possibilistic SVM for non-separable data, all the training data set must satisfy the following constraints:

$$\begin{aligned} \mathbf m(x) (w^{ij})^T \phi (x_t)+b^{ij} \ge 1-\xi ^{ij}_t, \text{ if } ~~ y_t=i \nonumber \\ \mathbf m(x) (w^{ij})^T \phi (x_t)+b^{ij} \le 1-\xi ^{ij}_t, \text{ if } ~~ y_t=j \nonumber \\ \xi ^{ij}_t\ge 0 \end{aligned}$$
(7)

with m(x) is the possibilitic degree of the sample x.

We optimize, also, the formulation of the possibilitic SVM in order to obtain a new dual representation including the possibilistic degrees m(x):

$$\begin{aligned} L_d = \max _{\alpha _i} \sum _{i=1}^m \alpha _i - \sum _{i=1}^m \sum _{j=1}^m \mathbf m(x_i) \mathbf m(x_j) \alpha _i \alpha _j y_i y_j \varPhi (x_i) \varPhi (x_j). \end{aligned}$$
(8)

In the new formulation the of possibilistic SVM, the decision function is given by:

$$\begin{aligned} \sum _{i=1}^m \mathbf m(x) \alpha _i y_i \varPhi (x_i) +b \end{aligned}$$
(9)

2.2 Incremental Possibilistic SVM

In this paper, the training sample \(x_i\) represents the vector MFCC features of voices files coming from the MEEI database. In the supervised learning, the label \(y_i\) represents the class to which belong the sample x. The Figure below shows the process of the proposed incremental possibilistic SVM.

Fig. 4.
figure 4

Incremental possibilitic SVM

As seen in the Fig. 4, the incremental possibilitic SVM get, first, a new training vector from the data X. Then, the existing SVM is updated to add the new training sample. Before computing the probabilities, a possibilitic degree is calculated for the given data and incorporated into SVM formulation. This process will be repeated until all posteriors probabilities for training samples are computed.

2.3 Hidden Markov Model

Hidden Markov Model (HMM) is considered as statistical model to estimate the probability of a set of observations based on the sequence of hidden state transitions. The use of HMM for speech recognition has become popular for the last decade thanks to its the inherent statistical framework. HMM are simple networks that can generate speech using a sequence of states for each model and modeling the short-term spectra associated with each state. The following equation shows a state transition probability distribution, \(a_{ij}\):

$$\begin{aligned} a_{ij}=P{q_{t+1}=j/q{t}=i},1\prec i, j\prec N_n \sum a_{ij}=1; 1\prec i, j\prec N_n \end{aligned}$$
(10)

where N is number of states in given model and \(q_t\) is the current state.

3 Feature Extraction

The feature extraction is the first step in a recognition system whose scheme is summarized in the Fig. 1. In this study, Mel-frequency cepstral coefficients (MFCCs) features [3] are extracted. Those coefficients are a very well-known extractor that allows to select significant features which are able to be used in several pattern detection problems.

The differential (Delta) and acceleration coefficients (Delta-Delta) were also calculated and used. Furthermore, the frame energy is appended to each feature vector. In the MEEI database, the pathological speakers have different voice disorders like traumatic, organic and psychogenic problems. The MEEI database contains 53 healthy samples and 724 samples with voice disorders. The speech samples were recorded in a controlled environment with a rate of 25 kHz or 50 kHz and 16 bits of resolution. The samples of healthy voices have duration of 3 s, and pathological voices samples have duration of 1 s. In this work, we set the duration of each frame to 20 ms and the Hamming window was used to extract the speech frames. For our experiments, the voice samples set consists of 53 normal voice samples and 139 for pathological (Keratosis/Vocal Poly/Adductor) voice samples.

The speakers have similar age, gender and different voice pathologies. The following table describes the MEEI database used in this work (Table 1):

Table 1. Normal and pathological speakers from the MEEI database

4 Experimental Results

The voice pathology detection performance is evaluated by four algorithms; standard SVM, standard HMM, batch possibilistic SVM and the proposed incremental possibilistic SVM-HMM. The results of our work will be presented for a classification into five classes normal/Keratosis/Vocal Poly/Adductor for all voices. For SVM method, we have to set several parameters such as the kernel width \(\gamma \) and the regularization parameter C which is the regularization parameter. Hence, we used the optimum values of \(\gamma =\frac{1}{K}\) and \(C=10\) found in a grid search using a cross-validation. RBF is selected as the kernel function. The choice to use RBF (Gaussian) Kernel was made after a study done on our data with different kernel functions such Linear, Polynomial, and Sigmoid.

The voices samples are subdivided into portions for training (70%) and testing (30%) steps. In order to investigate the performance of our voice pathology detection system, we consider four measures: Error Equal rate (EER), the performance accuracy (DCF), Sensitivity and Specificity.

Table 2. Comparison of EER (%), Efficiency (DCF(%)), Sensitivity (%) and Specificity (%) for the different voice pathology detection systems and the proposed incremental possibilistic SVM-HMM system

The results, given in Table 2, show that the detection system based on the hybrid incremental possibilistic SVM-HMM yield the best results in this study. Obviously, the proposed system using the Incremental Possibilistic SVM-HMM and MFCC coefficients with their first and second derivatives outperforms the standard HMM, the standard SVM and the batch possibilistic SVM with an obtained accuracy equal to 99%.

The voice pathologies detection using the standard HMM give the worst results within a rate of 90%. Moreover, the detection system using the possibilistic SVM give a decent rate of 93%. Table 2 presents the performance of the proposed incremental hybrid method compared with the standard methods SVM, HMM and the batch possibilistic SVM method. The results obtained in this study for voice pathology detection are very encouraging. As a future work, we suggest to investigate different multi-pathologies detectors and also, to improves the incremental hybrid classifiers in order to determine the degree of voice pathology.

Furthermore, the Table 3 presents a comparison of the proposed hybrid incremental method with other recent methods from the state-of-art for the voice pathology detection problem using the MEEI datasets in similar experimental conditions.

Table 3. Comparison of the performance of our proposed incremental method possibilistic SVM-HMM and different methods in the state-of-art

The following table shows that the proposed incremental possibilistic SVM-HMM method improves the robustness of the voice pathology detection system and achieved the highest accuracy compared to several existing methods in the state-of-art.

5 Conclusion

Standard SVM and standard HMM works correctly in a batch setting where the algorithm has a fixed collection of samples and uses them to construct a hypothesis, which is used, thereafter, for detection and classification tasks without further modification. This paper proposes to combine an incremental possibilitic SVM to HMM for voice pathology detection task based an online setting. In the proposed method, we incorporate possibilistic degrees to the class posterior probabilities computing by SVM. Then, to improve the detection decision, we have given those possibilistic probabilities to HMM-based decoder in order to detect normal voices among pathological voices. The experimental results on the normal/pathology voices from MEEI database suggest that the proposed method gives high accuracies compared with several methods in the literature in detection task.