Introduction

Parkinson’s disease is a degenerative disease of the nervous system, which as progresses, causes the patients to have difficulty in walking, talking, thinking or completing other simple tasks [1, 2, 3]. PD usually affects people over the age of 50 and for most elderly people with Parkinson’s disease, physical visits to the clinic for diagnosis, monitoring, and treatment are difficult [4]. As the disease causes vocal impairment for approximately 90% of the patients with PD [5], development of telemonitoring systems with accurate, reliable, and unbiased predictive models can be very useful for the diagnosis of the disease in the early stages and lowering the inconvenience and cost of physical visits [1, 6].

There are recent studies for detection of voice disorders with machine learning tools using acoustic measurements (features) of dysphonia extracted from both patients and control (healthy) subjects [1, 7-9]. Considering the practical limitations on labor and expense associated with obtaining and screening each specimen typically greatly constrain the number of samples available for such preliminary studies. As there are many interdependent vocal features that can be extracted, selecting a minimal yet a most informative subset of features is an important step to reduce the input dimensionality, thus the complexity of the learning task of the classifier [10, 11]. However, small sample sizes, large number of interrelated variables, coupled with efforts to maintain low false-discovery rates, weaken the power of statistical methods to identify features that could contribute to the prediction of diseases [1].

This paper addresses two important issues in constructing an unbiased, reliable telemonitoring system of PD. The first issue of selecting a minimal subset of features with maximal joint relevance to the PD-score, a binary score indicating whether or not the feature vector (the sample) belongs to a person with PD, can be accomplished iteratively by including the features that are maximally relevant to the PD-score while not redundant with the already selected ones. This criterion is known as maximum-relevance-minimum-redundancy (mRMR) [11]. For measuring the strength of the relevance between a feature and the PD-score, we apply the mutual information measure [11, 12]; and for assessing the statistical significance of it, we use the permutation test [13]. The second issue is construction of a predictive model with minimal bias (i.e. to maximize the generalization of the predictions in order for the model to work well with unseen test examples). For this task, we use a Support Vector Machine (SVM) [14] classification model and test its generalization with a more realistic, less-biased cross-validation scheme that we called leave-one-individual-out. The reason for using this version of leave-one-out tailored to the dataset in hand is due to the presence of multiple speech recordings per subject. The conventional bootstrapping or leave-one-out validation methods [1520] would spare some samples of an individual for the training and some for the testing, thus create an artificial overlap between the training and test sets, not typical of a real testing scenario.

The remainder of this paper is organized as follows. The “Materials and methods” section presents the dataset and the methods used. The “Experimental results” section presents the experimental results. Discussions are given in the “Discussions” section.

Materials and methods

Parkinson’s dataset

Parkinson’s dataset (PD) consists of six (or seven at times) recordings from a set of 32 individuals (24 of which are with PD and eight are healthy) which makes a total of 195 recordings, each represented with a 22 dimensional feature vector along with a binary PD-score to predict. A PD-score of one indicates that the feature vector belongs to a person with PD and a score of 0 indicates that it belongs to a healthy subject. The features in the dataset are diverse, some of them are traditional measures that are based on the application of the short-time autocorrelation to successive segments of the signal; some are called non-standard measures that are based on nonlinear dynamical systems theory. The labels and short explanations for the measurements along with some basic statistics of the original dataset are given in Table 1. The feature values have been linearly mapped to [−1, + 1] interval as a preprocessing step for the classification. The dataset was created at the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado [1] and has been made available online at UCI machine-learning archive [21] very recently, in June 2008.

Table 1 Description of the features of the Parkinson’s dataset

Mutual information

Mutual information (MI) [12] is a classical, information-theoretic, entropy-based measure of relevance/dependence used for feature selection in filter methods (i.e. sorting the variables from the most relevant to the least) in several studies with the aim of measuring the relevance levels of the features with the target variable [11, 22, 23].

The entropy of a random variable X, denoted by H(X), [12] is a measure of the uncertainty of a random variable X and thus, it quantifies how difficult to predict that variable. The definition of Shannon’s entropy can be written as an expectation:

$$H\left( X \right) = E\left[ {\log P\left( X \right)} \right] = - \sum\limits_x {\left[ {p\left( x \right)\log \left( {p\left( x \right)} \right)} \right]} $$
(1)

where p(x) = P(X = x) is the probability distribution function (more it is the precisely probability mass function for the discrete case but the results are generalizable) of X. Hence the Shannon’s entropy is the average amount of information contained in random variable X. In other words, it is the uncertainty removed after the actual outcome of X is revealed. Mutual Information (I) is a measure of mutual dependence of the two variables based on the entropy (Fig. 1):

$$I\left( {X;Y} \right) = H\left( X \right) + H\left( Y \right) - H\left( {X,Y} \right)$$
(2)
Fig. 1
figure 1

Visualization of mutual information with respect to entropy

The measure I is also the Kullback–Leibler (KL) divergence of the product P(X)P(Y) of the two marginal probability distributions from the joint probability distribution, P(X,Y):

$$I\left( {X;Y} \right) = D_{{\text{KL}}} \left( {\left. {P\left( {X,Y} \right)} \right\|P\left( X \right) \cdot P\left( Y \right)} \right) = \sum\limits_x {\sum\limits_y {\left[ {p\left( {x,y} \right)\log \left( {\frac{{p\left( {x,y} \right)}}{{p\left( x \right) \cdot p\left( y \right)}}} \right)} \right]} } $$
(3)

where p(x,y) = P(X = x, Y = y).

In this study, for simplifying the MI computations, the features are discretized to nine discrete levels (the original values of the variables are used for the predictions) similar to [11]. For discretization, for each variable, we used its mean μ and its standard deviation σ such that the feature values between μ − σ/2 and μ + σ/2 are converted to 0. The four intervals of size σ to the right of μ + σ/2 are converted to discrete levels from 1 to 4 and the four intervals of size σ to the left of μ − σ/2 are mapped to discrete levels from −1 to −4. Very large positive or negative feature values are truncated and discretized to ±4 appropriately.

Permutation test for determining statistical significance of MI-scores

MI-scores could easily be overestimated especially when applied to real-world datasets. The permutation test is applied to each feature for assessing the statistical significance of its mutual information score with the target variable. Features found to have no significant MI-scores are simply be eliminated from the rest of the feature-selection process. To estimate the statistical significance of the MI-score of a feature with the class-labels (i.e. the target values to predict), we can use a common statistical approach, which involves random shuffling of the class-labels of the samples. Then, we test the null hypothesis that the feature does not contain any information about the class-labels and the above-zero MI-score is simply due to chance. Our alternative hypothesis is that the above-zero MI-score is not accidental.

By randomly shuffling the class-labels of the samples we make sure that a feature does not have any mutual information with it. Then, according to the null hypothesis, the MI-score on the original scores should fall within the range of MI-scores achieved on the shuffled labels. In other words, among n repetitions of the shuffling procedure, the MI-score of the shuffled class-labels and the feature is recomputed. For each feature, the number of times the original MI-score of a feature is higher than the MI-score obtained on the randomly-permuted version is the confidence level that the original MI-score is not accidental. The features with smaller confidence level than a predefined threshold (e.g. 95%) are eliminated even if they have high mutual information with the class-labels.

For example, if a high-entropy feature is present in the dataset, it could easily have high mutual information with the class-labels (conceptually motivated by the area of the intersection of the two sets in Fig. 1). An extreme example would be an ID field (an integer ID given to each patient) considered as a feature. Every shuffling of class-labels would produce a different mapping but the mutual information will be as high as the true score, thus leading to a confidence level of zero. Such a feature would be eliminated by the permutation test.

mRMR (maximum relevance–minimum redundancy) approach

For feature selection, a mutual information based method called maximum relevance minimum redundancy (mRMR) [11] is applied on the features that have passed the significance test mentioned in the “Permutation test for determining statistical significance of MI-scores” section. The mRMR method is based on recognizing that the combinations of individually good variables do not necessarily lead to improved classification/prediction performance. That is, to maximize the joint dependency of top ranking variables on the target variable, the redundancy among them must be reduced, which suggests incrementally selecting the maximally relevant variables while avoiding the redundant ones. This helps the top k features selected to have the highest joint dependency.

According to mRMR approach, kth feature chosen for inclusion in the set of selected variables, S, must satisfy the below condition:

$$\mathop {\max }\limits_{x_j \in X - S_{k - 1} } \left[ {I\left( {x_j ,T} \right) - \frac{1}{{k - 1}}\sum\limits_{x_i \in S_{k - 1} } {I\left( {x_j ;x_i } \right)} } \right]$$
(4)

where X is the whole set of features; T is the target variable (e.g. PD-score); x i is the ith feature; S k -1 is the set of top k − 1 features selected in the earlier iterations, and I is the mutual information.

To better understand how this difference makes good sense, Eq. 4 can be expressed in proportional to the entropy of x j as shown below in Eq. 5:

$$\mathop {\max }\limits_{x_j \in X - S_{k - 1} } \left[ {H\left( {x_j } \right)\left( {\frac{{I\left( {x_j ,T} \right)}}{{H\left( {x_j } \right)}} - \frac{1}{{\left( {k - 1} \right)H\left( {x_j } \right)}}\sum\limits_{x_i \in S_{k - 1} } {I\left( {x_{j\,} ;x_i } \right)} } \right)} \right]$$
(5)

where H(x j ) is the entropy of x j .

In the above equation, first term, \(\frac{{I\left( {x_j ,T} \right)}}{{H\left( {x_j } \right)}}\) quantifies what percentage of the candidate variable x j ’s entropy is common with the target T. The second term \(\frac{1}{{\left( {k - 1} \right)H\left( {x_j } \right)}}\sum\limits_{x_i \in S_{k - 1} } {I\left( {x_j ;x_i } \right)} \) measures what percentage of the candidate variable x j ’s entropy (in average) is common with the selected variables (e.g. a variable x j might have 60% of its entropy common with the target variable T and 40% of its entropy common with the other variables, in average, which seemingly points out that 20% if its entropy could be unique information about T that would be gained when x j is included). Multiplying the difference of these terms with the variable’s entropy (favoring the variables with more entropy) approximates the unique information that the variable has about the target class. Among the candidate variables, the variable with the maximum mRMR score is chosen next into selected set of variables.

Support vector machines

For the learning task, among a number of software packages are available that implement the SVM theory, we use LIBSVM version developed by [24]. In order to use an SVM on a particular dataset, only three basic parameters have to be specified: (1) the choice of the kernel (Radial Basis Function, RBF, is recommended first) for simulating a nonlinear transformation and its kernel-specific parameter (e.g., g-parameter for RBF-kernel, or degree of the polynomial for polynomial kernel); (2) C-parameter that controls the smoothness of the decision boundary in the transformed space (Figs. 2 and 3) the “class-weight” parameter, w, which is used to account for the imbalance of the numbers of samples in classes that is used to weigh more the classification errors made on the samples of rare classes.

Fig. 2
figure 2

Control of the smoothness of the decision boundary. The data samples are shown in the input space and are separated into two classes by a curved line, which is only a linear boundary in the high-dimensional transformed space. In the left panel, the highly-convoluted boundary is overfitting: it correctly separates all the shown data samples by their classes, but is likely to be less accurate on new data samples than the smoother boundary in the right panel

Fig. 3
figure 3

Mutual information score and confidence level (%) of each variable. Features that have lower confidence than 95% are eliminated before feature selection scheme which are NHR and RPDE with 63% and 91% confidence level respectively

Generalization to unseen data: leave-one-individual-out

For the validation of SVMs model, since we do not have independent validation samples (recordings) of new individuals, we used a validation scheme that we called leave-one-individual-out. That is, we left out all the samples of one individual to be used for validation as if it is an unseen individual. As, there are 32 individuals in the dataset, we left out one individual at a time, in turn, and trained a classifier that is tested on the samples of the left out samples.

On the other hand, using a regular leave-one-out validation would result in bias in estimation because among the six or seven recordings per individual, all except one would remain in the training set and one would go into the “independent” validation set. The same argument applies to the bootstrap resampling validation as well.

In bootstrap resampling, each sample is assumed to have 1/n probability of being observed in a data sample of size n. Then we draw n samples with replacement for training. Thus, the same sample may appear more than once in the training set and some samples will not appear in the training set at all. Then, we apply the testing on all the n original samples. The procedure is repeated a number of times (200 times is suggested) to get a statistically reliable estimate of the mean and the standard deviation of the prediction accuracy of the classifier. If the n samples are obtained from c individuals (with n/p samples each), the probability p of choosing at least one sample from every individual in a bootstrap for the training set is:

$$p = 1 - \sum\limits_{i - 1}^{c - 1} {\left( {\frac{{n!}}{{\left( {n - i} \right)! \cdot i!}} \cdot \left( {\frac{{c - 1}}{c}} \right)^n } \right)} $$
(6)

For the PD dataset, only three individuals have seven recordings (rather than six), thus ignoring these three extra samples with c = 32 and n ≈ 192, the probability p approximately evaluates to 0.93 meaning that 93% of the bootstraps will contain at least one sample of each individual. Therefore, the validation set would greatly overlap with the training set and this would create artificially high, biased prediction accuracies.

Experimental results

Table 2 shows the mutual information of each feature with the PD-score, the normalized mutual information (i.e. the ratio of mutual information to the entropy of PD-score as suggested by Fig. 1) and their ranks according to mRMR. As seen in Table 2, some features could not pass the permutation test. The permutation test results are shown in detail in Fig. 3. The confidence threshold is set to 95%. After 1,000 permutations, two features, NHR and RPDE, are well below the threshold and thus, their MI-scores have not been found to be statistically significant and they are eliminated from the rest of the analysis.

Table 2 Relevance of the measurements of dysphonia to the PD-score

After ranking the variables using mRMR, we used SVMs to determine the classification accuracies of the top-k features (k = 1,2,…,20) using the leave-one-individual-out cross-validation method (“Generalization to unseen data: leave-one-individual-out ” section). We optimized the SVM-parameters as suggested by the aforementioned work in [1] so as to build an SVM model capable of achieving their reported results of 91% with 4 features and 90% with a greater set of 10 features. The SVM-parameters found are very conservative (near the default settings); these are C = 3 (the default value is 1) and g = 9/k for the RBF-kernel, where k is the number of features given to the SVM as input (i.e. this setting for g worked for both k = 4 and k = 10, the default was 1/k). The “class-weight” parameter, w, which is used to account for the imbalance of the numbers of samples in classes is set to 3 for class 0 because the healthy subjects (class 0) were outnumbered by the subjects with PD (class 1) by a factor of 3 (the default is 1 for all the classes). The optimization of the SVM-parameters is near-optimal as the classification rates we obtained are near the reported results of [1].

Our classification results with the obtained SVM settings (with no extra search for SVM-parameters for our set of variables) are shown in Fig. 4. The true positive (TP) and the true negative (TN) rates are shown in Table 3. The highest accuracy obtained is 81.53% with the top four (k = 4) features by mRMR ranking:

$${\text{mRMR}} - 4 = \left\{ {{\text{spread1,}}\,{\text{MDVP:Fo}}\left( {{\text{Hz}}} \right){\text{,}}\,{\text{Shimmer:APQ3,}}\,{\text{D2}}} \right\}$$
Fig. 4
figure 4

The top-k selected mRMR features and classification accuracy using leave one individual out

Table 3 True positives (TP) and true negatives (TN) classification rates using SVMs on the top-k features of mRMR

In [1], the reported maximal correct classification rate of 91% was achieved with the following four features of dysphonia: {HNR, RPDE, DFA, and PPE}. This accuracy is seemingly better than the one our method achieves. However, their testing uses bootstrap resampling that results in overestimation as explained in the “Generalization to unseen data: leave-one-individual-out ” section.

To compare mRMR-4’s accuracy with the feature set given in [1], we tested their subset using leave-one-individual-out method and obtained only 65.13% classification accuracy with a very high standard deviation of 35.84%. These results are much worse than the reported in [1] because the classifier in [1] was designed to maximize the accuracy on the test set with the bootstrap resampling, which, in its application to the PD-dataset, caused some samples of the test individuals also appear in the training set. Therefore, the results were simply overestimates and have not generalized to unknown individuals.

Lastly, to show differently that the bootstrap resampling technique really causes overestimation for the PD-dataset, we applied the bootstrap resampling validation as in [1] using mRMR-4 features. As seen in Table 4, the classification rate of our set increases to 92.75% as using the bootstrap resampling may not completely hide all the samples of any individuals from the training set of the classifier.

Table 4 Classification rates and standard deviations with leave-one-individual-out and bootstrap resampling validations

Discussions

Vocal impairment is caused for approximately 90% of the patients suffering from Parkinson’s disease (PD). Therefore, telediagnosis of PD using measurements of dysphonia would ease clinical monitoring of elderly people and increase the chances of its early diagnosis. However, building such an inferential model is not an easy task for a number of reasons: (1) it is not clear what features are relevant for this task; (2) it is not obvious which combination of the relevant features would be the best among the variety of relevant ones; and (3) it is difficult to get a large-enough dataset to experiment with, which makes it difficult to validate the built models.

Our study presents a methodology for selecting a minimal subset of features with maximal joint relevance to the PD-score, a binary score to be learnt, which indicates whether or not the sample (speech recording) belongs to a person with PD. We apply the mutual information measure with the permutation test for assessing the relevance and the statistical significance of the relations between each feature and the PD-score, then we rank the features according to the maximum-relevance-minimum-redundancy criterion, which suggests that features that are maximally relevant to the PD-score but not redundant with the already selected ones (iteratively) are the most important.

We use Support Vector Machines for building a predictive model of PD from the selected features. We also presented a methodology for building the predictive model with minimal bias using a more suitable validation mechanism for our problem. To maximize the generalization of the predictions to perform well with unseen test examples, we estimated classification accuracies using leave-one-individual-out that fits with the dataset in hand better than the conventional bootstrapping or leave-one-out validation methods. The reason for using this version of leave-one-out tailored to the dataset in hand is due to the presence of multiple speech recordings per subject. The conventional bootstrapping or leave-one-out validation methods are not suitable for this database because they assume that the samples of the database are distributed independently, thus failing by having samples of an individual appear in both the training and the test sets.

The accuracy obtained using our method with leave-one-individual-out is 81.53% with the four features: spread1, MDVP:Fo(Hz), Shimmer:APQ3, D2. However, this translates to 92.75% using the conventional bootstrap resampling validation. Our results compare favorably with the existing results in the literature. The study in [1] reports 91.40% accuracy using bootstrap resampling using another set of four features: HNR, RPDE, DFA, and PPE. However, we have found that their method has a very low, only 65.13%, accuracy using leave-one-individual-out cross validation. Moreover, the classifier in [1] is not reliable also in that the standard deviation of its accuracy is extremely high (35.84%) indicating that the testing on the six (or seven) samples of the left-out individuals gave either near zero or near one hundred percent accuracy. Such a bad standard deviation is a sign of memorization of the training set. We can conclude that our method better generalizes to unseen test examples. Moreover, using a mutual information based method, we avoided evaluating all combinations of features as done in [1] and we have chosen to report a lower accuracy than achievable but with higher generalization rather than an over-fitted accuracy.

In comparing the features selected by our method to those selected by [1], we can first mention that MDVP:Fo is selected into the mRMR-4 subset but it was actually not even included in the feature selection by [1] because they claim that this feature is adversely affected by gender and individual differences. However, they also mention that some other studies have found statistical relationships between MDVP:Fo and PD-related dysphonia. Our results validate the latter view by ranking MDVP:Fo as the second most important feature in predicting the PD-score. Shimmer:APQ3 is another variable in our mRMR-4 set but it is eliminated by [1] in the preselection stage. In order to have fewer variables to make an exhaustive search for the best subset feasible, they discard the redundant features. For this task, they remove one of the variables in every pair with 0.95 or greater correlation. However, unlike mRMR, this preselection stage is not theoretically justified and may cause elimination of useful features, such as Shimmer:APQ3 in our case.

PPE is a measure of dysphonia introduced by [1], which was not selected by our method due to its redundancy with spread1 feature. As a matter of fact, 64.86% of information (entropy) in PPE is common (mutual) with spread1. Taking PPE instead of spread1 into mRMR-4 subset drops the correct classification rate slightly from 81.53% to 80.14%. This result shows that PPE is indeed a good measure of frequency variation as proposed in [1].

The successful results of this study must be recognized with its limitations. There are two sources of bias affecting the performance of our predictive SVM model. First, both our work and the work in [1] optimize the SVM parameters using all the available data, which is inescapable when working with such small datasets. Second, mutual information computations (for mRMR) also use all the available data. However, as mutual information works with two variables at a time, it does not take the joint effects of the variables on the PD-score. Therefore, it is not as likely to overfit as using a separate SVM for evaluating all the subsets of the features. Nevertheless, it must be mentioned that more data is required to validate the claims in this work. Although our estimate of SVM performance is likely to be somewhat inflated, it is nevertheless significant in that: (1) it shows that our method is superior to that of [1]; (2) the set of top four features identified by [1] is likely to be spurious; and (3) vocal features do have a clear potential for PD evaluation.