1 Introduction

Dysarthria represents a speech disorder which is caused by neuromuscular disturbances [1]. It is caused by damage to the brain which occurs at the time of birth, heart attack, tumor or due to other diseases such as cerebral palsy and multiple sclerosis. People suffering from dysarthria exhibit delayed speech, speech with fluctuating speed or articulated speech. They do not have any problems in reading, writing or even in understanding other’s speech. Due to the differences in the speech rate of dysarthric people, it is difficult to understand their speech. Hence, automatic speech recognition systems can be helpful in understanding dysarthric speech. Generic automatic speech recognition systems do not perform well in recognizing dysarthric speech [2]. Hence, it is essential to develop automatic speech recognition systems that specifically model the speech of dysarthric people. This paper intends to first enhance the dysarthric speech signals and then develop a model based on convolutional neural networks to automatically recognize single words of dysarthria people.

Due to changes in the vocal fold and glottis muscle the dysarthric signal is different from a normal speech in terms of pitch and amplitude [3]. This difference is caused due to source degradation i.e. due to damage of the muscles in the vocal tract. Experiments have been conducted to find the similarity between speech in noise, where speech is degraded by environmental noise and dysarthric speech, where speech is degraded at source [4]. It has been shown through these experiments that there is a relationship between speech in noise and dysarthric speech. It has also been shown by Yakoub et al. [5] that usage of empirical mode decomposition (EMD) for denoising the dysarthric speech helps in improving the dysarthric speech recognition. Moreover, Ram et al. have shown that variational mode decomposition (VMD) [6] outperforms in speech enhancement [7]. These results have motivated us to make use of VMD to first denoise the dysarthric speech and then build a model using the enhanced signals to recognize the speech. The two main contributions of this paper are (1) application of variational mode decomposition and wavelet thresholding for enhancement of dysarthric speech signals; and (2) development of a convolutional neural network (CNN) based model for automatic recognition of dysarthric speech. The rest of this paper is organized as follows. Section 2 gives an overview of recent work carried out in dysarthric speech recognition. Section 3 describes the methodology proposed in this paper for dysarthric speech recognition. Section 4 describes the dataset used and the results obtained. Section 5 gives the conclusion and directions for further work.

2 Related Work

A lot of research work has been carried out to develop automatic speech recognition systems specifically for dysarthric speech. These works have improved the performance of the dysarthric speech recognition systems by enhancing the dysarthric speech signals and/ or by designing generative or machine learning models to recognize dysarthric speech. Dysarthric speech signals are considered as signals corrupted with noise from source [4] and pre-processing techniques such as enhancement or denoising techniques are applied to improve the dysarthric speech. The generative or machine learning models make use of training data to learn and recognize the dysarthric speech effectively.

Various research works have been carried out that improve dysarthric speech signals by performing signal enhancement. Park et al. mention that initial consonants in dysarthric speech are similar to noise. Hence, they propose a consonant-vowel dependent Weiner filter to remove the noise introduced during pronouncing initial consonants by dysarthric people [8]. Wisler et al. also treat dysarthric speech as speech corrupted with noise [9]. They make use of transfer learning for feature extraction which is robust against noise for classification of dysarthric speech and normal speech. Borrie et al. have also proved that there is a high similarity between speech signals corrupted by environmental distortion and dysarthric speech signals which are distorted at source [4].

Various approaches based on generative models [10,11,12], artificial neural networks (ANN) [13, 14] and deep neural networks [5, 15,16,17] are available in the literature for automatic speech recognition of dysarthric speech. These works have predominantly concentrated on designing novel or hybrid features and/ or in building machine learning based systems to effectively recognize dysarthric speech. Generative models are very useful in modeling spectral sequences which are time varying. One of the widely used generative models is hidden Markov Model (HMM). A HMM based system is developed by Deller et al. [10] to recognize the words uttered by dysarthric speakers. They have performed pre-processing at signal level and latent position (LP) parameter vector to model each word by HMM. A HMM based speech recognizer is used by Lee et al. [11] to assess the level of dysarthria. They have trained word-specific HMM using mel-frequency cepstral coefficients (MFCC). Gaussian mixture models (GMM) are also used to represent the phoneme in every word. Class-specific HMM models are developed by Rajeswari et al. [12] that use MFCC features. The output of these models are given as input to support vector machine (SVM) to enable dysarthric speech recognition. In recent years, artificial neural networks (ANNs) are used to recognize dysarthric speech from acoustic features. An ANN based dysarthric speech recognition system has been built by Shahamiri et al. [13] that makes use of MFCC features. The best set of MFCC features is first identified which is then used by multilayer perceptron (MLP) to recognize the words spoken by dysarthric speakers. Hybrid ANN and HMM speech recognition systems are also developed which utilize the advantages of both ANN and HMM. A study carried out by Polur et al. [14] has shown that a hybrid ANN and HMM model based on MFCC features provides high accuracy in recognizing dysarthric speech. Very recently, deep neural networks are also deployed in automatic recognition of dysarthric speech. A convolutive bottleneck network is proposed by Nakashika et al. [15] to extract features from dysarthric speech signals. The bottleneck layer is used in the network to help extract features that are specific to dysarthric speech. A hybrid DNN and HMM based acoustic model is developed by Joy et al. [16] specifically for the TORGO dysarthric speech database. The system is speaker dependent and utilizes MFCC features to perform speech recognition. Two separate DNN models based on convolutional neural network (CNN) and long short term memory (LSTM) network are developed by Zaidi et al. [17] for dysarthric speech recognition. Acoustic features such as MFCCs, mel-frequency spectral coefficients (MFSCs) or perceptual linear prediction coefficients are extracted from dysarthric speech signals and given as input to CNN and LSTM models. HMM based models have been a standard for automatic speech recognition for decades. Although HMMs are useful in modeling speech signals which vary on time, they are not very effective in discriminating the features for a speech recognition system [12]. ANNs have the advantage of effectively discriminating the features. However, the limitation of ANNs is that handcrafted features have to be extracted from speech signals and given as input to ANNs. The advantage of using deep neural network (DNN) based models is that instead of using handcrafted features, these models learn the features themselves.

Some speech recognition systems improve the performance by first pre-processing the speech signals and then utilizing a machine learning based model for recognition. For instance, Yakoube et al. have utilized empirical mode decomposition (EMD) to enhance the dysarthric speech and later used convolutional neural networks (CNN) to perform phoneme recognition [5]. They have extracted mel-frequency cepstral coefficients (MFCCs) after speech enhancement and given them as input to CNN. In this work also, dysarthric speech signals are considered as noisy signals and hence they are first enhanced and then a CNN based model is developed to recognize the dysarthric speech. Ram et al. [7], have proved that variational mode decomposition (VMD) outperforms EMD in speech enhancement. This has motivated us to use VMD and wavelet based enhancement of dysarthric speech signals. CNNs, which are a type of deep neural networks (DNNs), can automatically learn features from the given input. Hence, in this work a CNN based speech model is developed to automatically learn features from the enhanced dysarthric speech signals for speech recognition.

3 Proposed Method

The CNN based dysarthric speech recognition system proposed in this paper consists of two main phases viz., training phase and the testing phase. The training phase builds and trains the CNN using the dysarthric speech signals. The trained CNN model is evaluated in the testing phase. The training and testing phases consists of two steps. In the first step, the dysarthric speech signals are enhanced using variational mode decomposition. In the second step, convolutional neural networks are used to automatically recognize the word from the speech signal given as input. This section explains these steps involved in automatic speech recognition of dysarthric speech. The block diagram of the proposed method is presented in Fig. 1.

Fig. 1
figure 1

Block diagram of the proposed system

3.1 Enhancement of Dysarthric Speech Using Variational Mode Decomposition and Wavelet Thresholding

VMD takes the advantages of Weiner filtering, Hilbert transform, frequency mixing and heterodyne demodulation to decompose the signal into modes and appropriately reconstruct them. The main objective of variational mode decomposition (VMD) [6] is to decompose the signal into several modes, \(v_{k}\). These modes are also called as sub signals. Every mode occurs around a centre frequency \(\mu _{k}\). The constrained variational problem for the variational mode decomposition is given by Eq. (1).

$$\begin{aligned} \begin{aligned}& \min \left\{ w_{k}\right\} \left\{ \mu _{k}\right\} \left\{ \sum _{k} \left\| \partial _{t} \left[ \left( \delta (t) + \frac{j}{\Pi t} \right) * v_{k}(t) \right] e^{-j\mu _{k} t} \right\| ^{2}_{2}\right\} \\ &\text {such that} \sum _{k} w_{k}(t) = f(t) \end{aligned} \end{aligned}$$
(1)

where f(t) is the signal that has to be decomposed, \(\left[ \left( \delta (t) + \frac{j}{\Pi t} \right) * v_{k}(t) \right]\) is the Hilbert transformation of \(w_{k}\), \(\delta (t)\) is the Dirac distribution, \(*\) is the convolution operation, k ranges from 1 to K (predefined number of modes), t is the time index, j is the imaginary part, \(\partial _{t}\) is the partial derivation, \(\mu _{k}\) is the center frequency of the kth mode. \(e^{-j\mu _{k} t}\) shifts the frequency spectrum of every \(k^{th}\) mode in its corresponding base mode. The notations \({w_{k}}={w_{1}, w_{2},\ldots , w_{K}}\) and \({\mu _{k}} = {\mu _{1}, \mu _{2},\ldots , \mu _{K}}\) are used to represent the set of all modes and their center frequencies respectively. In the present work, the number of modes, K, is 3. The augmented Lagrangian L can be used to solve the above mentioned constrained variational problem using Eq. (2).

$$\begin{aligned} L\left( \left\{ w_{k}\right\} , \left\{ \mu _{k}\right\} ,\lambda \right)= & {} \alpha \sum _{k}\left\| \partial _{t} \left[ \left( \delta (t) + \frac{j}{\Pi t} \right) * v_{k}(t) \right] e^{-j\mu _{k} t} \right\| ^{2}_{2} \nonumber \\&\quad +\, \left\| f(t) - \sum _{k} w_{k}(t)\right\| ^{2} \nonumber \\&\quad +\, \left\langle \lambda (t), f(t) - \sum _{k} w_{k}(t) \right\rangle \end{aligned}$$
(2)

where \(\lambda\) is the Lagrange multiplier and \(\alpha\) is the data fidelity constraint parameter. The augmented Lagrangian helps in imposing the constraints. The optimization problem of Eq. (1) can be solved by finding the saddle point of Lagrangian in Eq. (2). The solution of Eq. (2) results in modes, \(v_{k}\). It has been proved by Donoho et al. [18] that thresholding of detail wavelet coefficients of a signal helps in denoising the signal. Hence, this approach is used in this work to denoise/ enhance the dysarthric speech signals. Firstly, discrete wavelet transform (DWT) is applied to the modes \(v_{k}\) which results in the detail coefficients, D and approximation coefficients, A. Let every level of detail coefficients obtained from DWT be represented by \(D_{L}\). In ths work, the number of waveletet decomposition levels, L, is 3. For every \(D_{L}\), soft thresholding of wavelet coefficients is performed using Eq. (3) using a threshold th [18]. In the present work, the threshold value th is computed on a trial and error basis and is set as 0.5.

$$\begin{aligned} D^{T}_{L} = {\left\{ \begin{array}{ll} D_{L} - th &{}\quad \text {if } D_{L} \ge th \\ 0 &{} \quad \text {if } D_{L} < th \\ D_{L} + th &{} \quad \text {if } D_{L} \le -th \\ \end{array}\right. } \end{aligned}$$
(3)

The enhanced detail coefficients and the approximation coefficients of the modes \(v_{k}\) are used for reconstruction of the modes using inverse discrete wavelet transform. These enhanced modes \(v^{e}_{k}\) are then summed together for the VMD reconstructed. The reconstructed signal represents the denoised dysarthric speech signal \(f_{d}\).

3.2 Speech Recognition Using Convolutional Neural Networks

Convolutional neural networks (CNN) belong to a class of feed forward ANN models with a large number of hidden layers that perform feature extraction and classification tasks. The enhanced speech signal is passed as input to a series of convolutional layers and pooling layers which is followed by a fully connected layer to produce class scores. The convolutional layers and pooling layers together act as feature extractors from the input signals, and the fully connected layer acts as a classifier to recognize the spoken word. There are five main operations in the CNN. They are (a) convolution (b) non-linearity (c) pooling (d) flattening and (e) classification. In the convolutional layer, feature maps are extracted from the input signals with the help of kernels. Kernel matrix acts as a sliding window that is moved over the input signal. For every slide, convolution operation is performed with the kernel matrix and the underlying elements of the input. Rectified Linear Unit (ReLU) is the activation function that is applied on the feature maps obtained using convolution operation. ReLU is a piece-wise linear function that examines each element in the feature map and passes the element if it is positive. If the element is negative then the negative element is replaced by zero. It is considered that positive value accumulates knowledge and so it is preserved, but the negative value contributes nothing or minimal and hence it is eliminated. Pooling operation is a subsampling operation that does the dimensionality reduction of each feature map, but retains the knowledge. Subsampling operation is performed by taking the maximum or average over the collection of values under the defined window. In this work, a maximum operation is performed. The flattening layer flattens the output generated by the previous layer to turn them into a single vector that can be used as an input for the next layer. The classification task is performed by a series of fully connected layer which is a flat feed-forward neural network layer that uses non-linear activation function in order to output probabilities of word prediction.

4 Experimental Results

In this section experiments carried out to evaluate the proposed method using a benchmark dataset for dysarthric speech viz., Universal Access (UA)-Speech database and the results obtained are elaborated. The dysarthric speech signals are first enhanced using VMD and wavelet based thresholding. Later, the CNN based model is developed and evaluated for dysarthric speech recognition. All the experiments are carried out using a system with 1.99 GHz processor and 16GB of RAM. Section 4.1 describes the dataset used to evaluate the proposed method and Sect. 4.2 elaborates the results obtained.

4.1 Dataset used for Dysarthric Speech Recognition

The proposed method for dysarthric speech recognition is evaluated using Universal Access (UA)-Speech database [19]. The database consists of 10 digits, 26 radio alphabets, 19 computer commands and 100 common words spoken by 19 dysarthric speakers. The data has been recorded by a 8-microphone array and a digital video camera. The speakers have different speech intelligibility levels, such as ‘very low (0–25%)’, ‘low (25–50%)’, ‘mid (50–75%)’ and ‘high (75–100%)’. The severity of dysarthria is inversely related to the speech intelligibility. For instance, if the speaker is suffering from severe dysarthria, his speech intelligibility will be very low.

The experiments carried out in this work use 10 digits, 19 computer commands and 100 common words spoken by 10 dysarthric speakers (4 male, 6 female) taken from the UA-Speech database. Table 1 shows the list of speakers and the number of utterances made by them for every class of words. To evaluate the performance of the proposed method robustly, the dysarthric speakers belonging to all intelligibility levels are considered in this work. The words uttered by 10 speakers without disabilities from the same database are considered as control words. The dataset is divided into 80% and 20% to form training and test sets respectively. This work uses 2470, 1300 and 130,000 utterances of computer commands, digits and common words respectively. Out of these, 1976, 1040 and 104,000 utterances of computer commands, digits and common words respectively are used for training and the remaining are used for testing. The proposed system is implemented in Python language and Keras Library.

Table 1 Details of speakers and number of utterances used in this work

4.2 Results

The dysarthric speech signals are enhanced using VMD with the number of modes as 3. The Daubechies wavelets with 5 levels are used for discrete wavelet transform. Soft wavelet thresholding is used in this work with a threshold, th, value of 0.5. The original signal representing the word ’people’ spoken by dysarthric speaker ’F02’ along with its 3 modes obtained using VMD, wavelet thresholded VMD modes and the enhanced speech signal are shown in Fig. 2. The speech quality of the enhanced dysarthric speech signals is compared with their original signals using the performance measure signal-to-noise-ratio (SNR). In order to calculate the SNR values, the words spoken by corresponding control speakers are used as reference signals. The average SNR values obtained for original dysarthric speech signals and enhanced dysarthric speech signals are presented in Table 2. The SNR values of enhanced dysarthric speech signals using VMD are higher compared to original dysarthric speech signals, which prove that the VMD and wavelet thresholding help in enhancing the dysarthric speech signals.

Fig. 2
figure 2

Original and enhanced signals of word ’people’ uttered by F02 a original waveform b VMD mode 0 c VMD mode 1 d VMD mode 2 e wavelet thresholded VMD mode 0 f wavelet thresholded VMD mode 1 g wavelet thresholded VMD mode 2 h Enhanced waveform

Table 2 Performance of VMD and wavelet thresholding based enhancement of dysarthria speech signals

Convolutional neural networks (CNNs) are used in this work to learn the dysarthric speech recognition model from the enhanced speech signals obtained using VMD and wavelet thresholding. CNN has a one-dimensional (1D) input layer which takes the enhanced signal as input. It has four 1D convolutional layers with 8, 16, 32 and 64 output filters respectively. The size of the kernel used in these four convolution layers are 13, 11, 9 and 7 respectively. The rectified linear unit (ReLU) function is used as activation functions in the convolution layers. The max pooling layer helps in performing max pooling operation with a pool size of 3. The max pooling layer is followed by a drop out layer with 20% of neurons being dropped out. The final layer recognizes the word with the number of neurons equal to the number of words in that category and ’softmax’ activation function. The scatter plots of features obtained from CNN model for the test instances of common words, computer commands and digits are presented in Fig. 3. The features on 100 classes of common words, 19 classes of computer commands and 10 classes of digits are visualized by t-distributed Stochastic Neighbor Embedding (t-SNE) [20]. t-SNE helps in reducing the dimension of the features so that it is suitable to visualize them. In the present work, t-SNE is used to visualize the features of CNN in a two dimensional space. It can be observed that the features are discriminative.

Fig. 3
figure 3

Visualization of features for a common words b computer commands c digits

Table 3 Results for various categories of words

The results obtained for various categories of words, i.e. for common words, computer commands and digits are presented in Table 3. It can be observed from the results that the recognition accuracies for all the categories of words are better in CNN models with VMD based enhancement. Speaker specific results are presented in Table 4. Figures 4, 5, 6 and 7 show the accuracies of speakers with different intelligibility levels ’very low’, ’low’, ’mid’, and ’high’ respectively. It can be observed from the results that the average accuracy obtained is \(91.80\%\) for the CNN based model. The speech enhancement using VMD and wavelet based thresholding along with CNN has an average accuracy of \(95.95\%\). For every speaker also, the accuracies obtained using VMD based enhancement and CNN provides better results compared to using CNN alone. An exception to this is the results of speaker M01. For this speaker, the results of CNN is better than VMD based enhancement and CNN. The reason for this may be due to inconsistency in words spoken by this speaker. The results obtained using the proposed method are compared with the existing methods available in the literature and are presented in Table 5. It can be observed that the proposed method gives better results compared to the existing methods which are based on generative models and ANN. The results of the proposed method are better than the results obtained using MFCC features and ANN proposed by Shahamiri et al. [13], but they have used all the utterances of all speakers in UA-Speech database.

Table 4 Speaker specific results obtained for the proposed method
Fig. 4
figure 4

Accuracies for very low intelligibility speakers

Fig. 5
figure 5

Accuracies for low intelligibility speakers

Fig. 6
figure 6

Accuracies for mid intelligibility speakers

Fig. 7
figure 7

Accuracies for high intelligibility speakers

Table 5 Comparison of results with existing methods

5 Conclusion

This paper proposed a method for dysarthric speech recognition which is based on two steps. In the first step, variational mode decomposition and wavelet thresholding based speech signal enhancement are performed. In the second step, convolutional neural network is used for learning features from the enhanced signals automatically, rather than providing handcrafted features. The average accuracy obtained from the proposed speech recognition method is 95.95% with VMD based enhancement and 91.80% without speech enhancement. The results of the proposed method are better compared to the results of existing methods. In the present work, only common words, computer commands and digits of 10 speakers from UA-Speech database are used for evaluation. In future, the entire dataset of UA-Speech database will be used for evaluation. Moreover, the layers of the CNNs will be designed to extract features effectively.