Keywords

1 Introduction

Parkinson’s disease (PD) is a neuro-degenerative disorder which produces several motor and non-motor impairments. The motor symptoms include, among others tremor, rigidity, slowed movement, postural instability, lack of coordination and speech disorders [1]. Evaluating the condition of PD patients is difficult. Mobility problems make attending medical appointments burdensome, while speech disorders may hinder the communication with the medical experts [2]. Currently, the assessment of the disease in the motor capabilities is evaluated with the third section of the Movement Disorder Society, Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) [3]. This evaluation is subject to a clinical criterion and its intra- and inter-rater variability could be high. The diagnosis could be supported by computer aided systems, which could also improve the evaluation of the disease progression. On the other hand, only two of the 33 items of the MDS-UPDRS are related to the speech impairments of patients; however, speech disorders are among the most prevalent, and an early sign of further motor impairments [4]. In that way, speech signals could be used to assess the motor symptoms of PD patients.

There has been interest in the scientific community to develop computer aided tools to evaluate the condition of PD patients using information from speech. In the 2015 INTERSPEECH Computational Paralinguistics Challenge (ComPARE) the task of predicting the MDS-UPDRS score of PD patients from speech was addressed [5]. Speech recordings of 50 PD patients from the PC-GITA database [6] were considered for the train and development subsets. Recordings from eleven new patients were considered as the test set. All the speakers were native Spanish speakers. A Spearman’s correlation coefficient of 0.39 was reported as baseline of the challenge. The winners of the challenge [7] grouped the speech tasks of each patient and used deep neural networks and Gaussian processes for the prediction, obtaining a correlation coefficient of up to 0.69. In [8] the authors classify the speech of PD patients vs. healthy controls (HC) speakers in different languages. Cross-language experiments were performed using data in three languages: Czech, German and Spanish. The reported accuracies range from \(60\%\) to \(77\%\), depending on the languages used for train and test sets. In [9] a speaker model based on Gaussian mixture models-universal background models (GMM-UBM) was proposed to monitor the neurological state of PD patients. UBMs were trained with information from 61 PD and 50 HC speakers. Specific GMMs were adapted for seven PD patients recorded in three sessions. Then, the Bhattacharyya distance between the speaker models and the UBM was computed and correlated with the MDS-UPDRS score of the patients. A Pearson’s correlation of up 0.60 was reported by the authors. In [10] the authors proposed a new regression method to track the progression of speech disorders. The method is based on a non-parametric learning strategy based on a probability distance measure between the speakers from the test and training sets. The authors consider data from 61 PD patients to predict the UPDRS score, obtaining a Pearson’s correlation of up to 0.58.

Speaker models inspired by speaker verification and identification systems have shown promising results in evaluating PD from speech. The most recent breakthrough in speaker verification is the i-vector approach [11]. This strategy has also proven to be effective in many other speech tasks. Specially, it has shown excellent results in language identification [12, 13]. Recently, i-vectors were used to identify the native language of a speaker from recordings in a second language [14]. According to the reviewed literature, a language independent strategy to evaluate the PD condition from speech has not been enough addressed. Developing a computer aided system that can evaluate PD from speech in different languages would be a major step towards an unified objective assessment of the disease. Additionally, the use of i-vectors has been successfully applied to model speaker traits in multiple languages, which indicates that it could also be used to evaluate the PD condition from speech in a cross-language approach. In this study, a strategy based on i-vectors is used to assess the PD condition from speech in three languages: Spanish, German and Czech. Cross-language experiments are performed, i.e., train the models with utterances from one language, and test with the speech recordings from the other ones. The proposed approach is tested in two scenarios: (1) classification of PD vs. HC subjects, and (2) the prediction of the MDS-UPDRS score of the patients. Different i-vector extractors are trained with features related to specific dimensions of speech, e.g., phonation, articulation and prosody with the aim of evaluating the information provided by each dimension to represent the PD condition of the patients.

2 Methods

The methodology proposed in this study comprises four steps: (1) several feature sets are computed to analyze different speech dimension from speech, (2) a subset of speakers are used to train an i-vector extractor, (3) the i-vectors of speech signals are extracted, and (4) the cosine distance between a reference i-vector and the speaker i-vector is computed. This process is summarized in Fig. 1.

Fig. 1.
figure 1

General methodology followed in this study.

2.1 Feature Extraction

Four feature sets were considered in this study to model the speech impairments of PD patients. The first set comprises the Mel-Frequency Cepstral Coefficients (MFCCs), which are the classical features used to train i-vectors. 19 MFCCs and the log-energy extracted from 30 ms windows with time-shift of 15 ms were used to form a 20-dimensional feature vector. Non-speech frames were discarded using an energy-threshold voice activity detector (VAD). The other feature sets are formed with descriptors to assess the articulation, phonation, and prosody dimensions of speech. To evaluate articulation, the energy content in 22 Bark bands (BBE) in the voiced/unvoiced and unvoiced/voiced transitions were considered, as in [15]. The features considered to evaluate phonation and articulation in voiced segments are: the log-energy, the fundamental frequency (\(F_0\)), first and second formants (\(F_1\) and \(F_2\)) and their first and second derivatives. Additionally, perturbation features such as Jitter and Shimmer are also included. These descriptors form a 14-dimensional feature vector. These features were computed from voiced segments using 30 ms long analysis frames with a time-shift of 5 ms. To evaluate prosody we followed the approach introduced in [16]: The log-\(F_0\) and the log-energy contours within analysis frames were approximated using Lagrange polynomials of order \(P=5\). Analysis frames of 200 ms with time-shift of 50 ms were used as in [13]. A 13-dimensional feature vector is formed concatenating the six coefficients computed from the log-\(F_0\) and the log-energy contours, along with the number of voiced frames in the utterance.

Fig. 2.
figure 2

i-vector processing.

2.2 i-Vectors

In this approach, factor analysis is used to define a new low-dimensional space known as the total variability space with the aim of modeling the speaker and the channel variability [11]. For applications related to pathological speech analysis, the speaker variability carries the information about the disorders in speech due to the disease. In the total variability space, an utterance is represented by a supervector M formed by concatenating the mean vectors of a GMM-UBM. The total variability space is expressed according to Eq. 1, where m is a speaker and channel independent supervector (the UBM), T is the total variability matrix and w corresponds to the i-vector.

$$\begin{aligned} \mathbf {M}=\mathbf {m}+\mathbf {T}\mathbf {w} \end{aligned}$$
(1)

The i-vectors are processed in five steps: (1) i-vectors extracted from training speakers are normalized to zero mean and unit variance, i.e., Z-norm, (2) the normalized i-vectors computed from different speech tasks of a given speaker are averaged to obtain one i-vector per speaker, (3) the i-vectors of HC and PD speakers are averaged to obtain HC and PD reference i-vectors, respectively, (4) the i-vectors of a test speaker are normalized using the parameters from the training i-vectors, (5) the normalized i-vectors per utterace are averaged to obtain the speaker i-vector. Finally, the cosine distance between the HC/PD reference i-vectors and the speaker i-vectors is computed. The process is summarized in Fig. 2.

2.3 Cosine Distance

The cosine distance is used to compare two i-vectors \(w_1\) and \(w_2\). The distance is defined by Eq. 2. In this study, the i-vector of a single speaker is compared with a reference i-vector that represents the HC or PD population.

$$\begin{aligned} d_c(\mathbf {w}_1,\mathbf {w}_2) =1 - \frac{\mathbf {w}_1 \cdot \mathbf {w}_2}{||\mathbf {w}_1|| ||\mathbf {w}_2||}. \end{aligned}$$
(2)

2.4 Evaluation

The cosine distance between the test speaker i-vector and the reference i-vector is compared to a threshold to discriminate between PD patients and HC speakers. The development set is used to find the threshold that maximizes the accuracy. The prediction of the neurological state of a patient is evaluated using the Spearman’s correlation coefficient between the real MDS-UPDRS score and the distance measure.

3 Data

Spanish- The PC-GITA database [6] is used in this study. It contains recordings of 50 PD patients and 50 healthy control (HC) speakers. All of them are native Colombian Spanish speakers. During the recordings, the participants were asked to perform different speech tasks including reading ten isolated sentences, and the repetition of /pa-ta-ka/, a diadochokinetic (DDK) exercise.

German- The German data contain recordings from 88 PD patients and 88 HC subjects. The speakers perform several speech tasks, including the repetition of /pa-ta-ka/, and reading five isolated sentences [17].

Czech- The Czech data are formed with recordings from 20 PD patients and 15 HC subjects. The patients were recorded at the time of diagnosis with PD, and none of them had been medicated before or during the recording session. The speech tasks performed by the speakers include the rapid repetition of /pa-ta-ka/, and several repetitions of a sentence [4].

4 Experiments and Results

Two speech tasks were analyzed independently in these experiments: the rapid repetition of /pa-ta-ka/, and read sentences. Data from the three languages are used in turn as training, development and test sets. All possible combinations are tested. The training data are used for several processes: (1) to train the UBM and the i-vector extractor, (2) to compute the normalization parameters, and (3) to obtain the HC and PD reference i-vectors. UBMs with different number of Gaussian components were trained in a range from \(M=2\) to \(M=2^9\) into powers of 2. The dimension of the i-vector \(\mathrm {dim}_w\) was chosen following the relation \(\mathrm {dim}_w = \log _2 (M) \cdot \mathrm {dim}_f\), where M is the number of Gaussian components in the UBM and \(\mathrm {dim}_f\) is the dimension of the feature vector.

Table 1. Accuracies (\(\%\)) for the classification task.

Table 1 shows the results for the classification of PD vs. HC speakers. For the DDK speech task, only the test in Czech language shows accuracies higher than \(65\%\). This could be explained due to Czech patients being diagnosed at the time of the recording and being in an earlier state of the disease than the patients from the other two languages. Accuracies below \(50\%\) could be explained by the fact that ranges of the cosine distance are likely to be different in the development and test sets. On the other hand, for the sentences, the i-vectors extracted with MFCCs and articulation-based features show the best results in most of cases. The \(50\%\) results when Spanish is used for test can be explained by the mismatch of cosine distance ranges in the development and test sets. For this case, the threshold could be set so all speakers in the test dataset are classified either as PD or HC. The similar results found using the HC and PD reference i-vectors may be due to both vectors being antiparallel.

Table 2 shows the results for the prediction of the MDS-UPDRS. For this case, articulation-based features provide the best result when evaluating the DDK speech task. Phonation features show good results in some cases, specially when the Spanish language is used for test, but show poor results when testing on Czech. This maybe due to the fact that Spanish is a more voiced language than the other two languages, but further experimentation is required. Correlations with the MDS-UPDRS score of up to 0.4 were achieved. Slightly higher correlations were obtained using the DDK speech task due to the fact that such a task is language independent, i.e., the speakers in the corpora uttered the same sounds. Good results were also obtained when analyzing isolated sentences, which is a language dependent speech task. This is encouraging and indicates that other speech tasks could also be analyzed in a cross-language setting.

Table 2. Spearman’s correlation for the prediction task.

For comparison, language dependent results using the same Spanish database and the same i-vector methodology can be found in [18].

5 Conclusion

In this work we address the task of cross-language evaluation of Parkinson’s Disease speech using the i-vector approach. Data in Czech, German and Spanish were used. One of the languages is used for train, while the other two were used for parameter selection and test. All possible combinations were considered. Two reference i-vectors were created. These reference i-vectors represent the population of HC speakers or PD patients. Then, the cosine distance between one of these reference i-vectors and the i-vector of a test speaker was computed. This distance was used in two experiments: to classify PD patients and HC speakers, and to assess the prediction of the neurological state of the patients. Results are promising, with classification rates around \(70\%\) when using MFCCs and articulation features. Similar classification results were obtained using both reference i-vectors. In many cases, a positive correlation between the labels and the cosine distance to the HC reference i-vector was found. This means that the more affected the speech, the larger the difference to healthy speakers. A similar reasoning can be followed for the negative correlations when comparing test speakers with respect to the PD reference i-vector, i.e., the more affected the speech, the lower the difference to the PD speakers. Future work includes evaluating the use of techniques that can eliminate the variability of language in the i-vector space with the aim of improving the results and obtain a language independent method to evaluate the condition of patients with neurodegenerative disorders.