Introduction

Speech is the sequence of sounds being considered an output of the time-varying vocal tract system. Articulators are moving in response to the neural signals for producing regular speech. Dysarthric speeches are distorted because persons with dysarthria are affected by a motor speech disorder, and they cannot control the movement of articulators. As a result, dysarthric speakers experience impediments in speaking properly. Articulators and muscles involved in speech production mechanisms are damaged or paralyzed for dysarthric speakers, and they find difficulty in conveying information to others through speech. The speech intelligibility of dysarthric speakers (Kim et al. 2008) considered in our work ranges between 2 and 95%. A dysarthria severity level (Gupta et al. 2021) is assessed using short speech segments based on residual neural networks. Dysarthric severity classification (Joshy and Rajan 2021) uses deep neural networks with speech utterances from Torgo and UA-speech databases. Articulatory features and deep CNN (Emre et al. 2019) are used for developing speaker-independent speech recognition systems for dysarthric speakers.

The dysarthric speech recognition system (Kim et al. 2018) is implemented using Mel frequency cepstral coefficients (MFCC) and assessed using GMM-HMM, DNN-HMM, CNN-HMM, and CLASM-HMM classifiers. Dysarthric speech recognition (Albaqshi and Sagheer 2020) is done using MFCC and convolutional recurrent neural networks for the Torgo database. Listen, Attend and Spell (LAS) model (Takashima et al. 2019) is investigated for dysarthric speech recognition, and the performance metric used is character error rate (CER). The intelligibility of dysarthric speeches (Chen et al. 2020) is enhanced using a gated CNN-based voice conversion system. An automatic speech recognition system is developed for dysarthric speakers. The accuracy of dysarthric speech recognition (Sidi Yakoub et al. 2020) is improved using empirical mode decomposition and CNN. This work mainly uses the speech enhancement technique to improve the accuracy of the dysarthric isolated digit recognition system. It utilizes deep machine learning neural network models for template creation and testing for original dysarthric speeches and intelligibility-enhanced speeches using phase spectrum compensation (PSC) as a speech enhancement mechanism. This paper is organized as follows. ‘Development of dysarthric speech recognition system’ section describes the database used in our work and analyses normal and dysarthric speech in time, frequency, and time–frequency domains. ‘Implementation of the CNN-based dysarthric speech recognition’ section describes the methods for implementing the system with feature extraction, CNN-based model development, and the speech enhancement technique used. Experimental results based on the proposed features and CNN-based system are presented in the ‘Results of the dysarthric speech recognition system based on experiments conducted’ section. ‘Discussion based on the outcome of the experiments’ section illustrates the discussion on the experimental results. The conclusion of the work is summarized in the ‘Conclusions’ section.

Development of dysarthric speech recognition system

Dysarthric speech recognition is developed to recognize the speeches uttered by dysarthric speakers. Since a dysarthric speaker’s speeches are highly distorted, developing a speech recognition system is gaining paramount importance, and it is quite challenging to develop a robust system. This section illustrates the details of the database used and analysis of dysarthric speeches in time and frequency domains.

Details of the database used—dysarthric speaker information (Kim et al. 2008)

Table 1 indicates the speaker information considered in our study. An isolated digit recognition system is developed to recognize the digits uttered by dysarthric speakers. These speakers are diagnosed as patients with spastic. Intelligibility levels vary between 2 and 95%.

Table 1 Information—speakers considered for the current study

Analysis of speech in time, frequency, and time–frequency domains

Speeches uttered by normal and dysarthric speakers are analysed in time, frequency, and time–frequency domains. For example, Fig. 1 presents the analysis of speech signals uttered by a normal speaker in the time, frequency, and time–frequency domain. He takes less than a second (0.84 s) to utter this word.

Fig. 1
figure 1

Normal speaker—analysis of speech in time, frequency, and time–frequency domain

Analysis of speech uttered by dysarthric speaker M09 with 86% speech fluency in time, frequency, and time–frequency domain is indicated in Fig. 2. Since this dysarthric speaker with 86% intelligibility, there are more similarities in signal characteristics as compared to the speech uttered by the normal speaker. For example, this speaker takes 2.15 s to utter the digit ‘one’.

Fig. 2
figure 2

Dysarthric speaker M09 with 86% intelligibility—analysis of speech in time, frequency, and time–frequency domain

Figure 3 indicates the analysis of speech uttered by the dysarthric speakers with 6% intelligibility in the time, frequency, and time–frequency domain. A dysarthric speaker utters this isolated digit with 6% intelligibility, and this fact is demonstrated in signal characteristics as compared to that of the normal speaker. This speaker takes 2.6 s to utter the isolated digit ‘one’.

Fig. 3
figure 3

Speech—dysarthric speaker F03 with 6% intelligibility—analysis in time, frequency, and time–frequency domain

This analysis indicates that dysarthric speakers take more time to speak simple words, and the severity level of the dysarthric speakers affects their ability to speak to a larger extent. As a result, it takes longer for them to speak. So, there is a need to develop an automated system to recognize their speeches and become a translator.

Implementation of the CNN-based dysarthric speech recognition

Dysarthric speech recognition is implemented by considering two phases: training and testing. During the training phase, features are extracted from the speeches uttered by the dysarthric speakers, the application of features to the modelling techniques to create templates as representative models of speeches, and models are fine-tuned for speech recognition. During testing, features are extracted from the speeches earmarked for testing. These features are applied to the models. Depending on the classifier used, speech is recognized as associated with the model in a pertinent isolated digit.

Feature extraction phase

In this work on CNN-based dysarthric speech recognition, time–frequency representational features are used to fine-tune the CNN models. Features extracted should have high discriminating capability among the speeches considered. Speech utterances in a pertinent isolated digit are concatenated, and spectrogram, Melspectrogram, and Gammatonegram are extracted for the speech frames containing 8192 samples, and this process is repeated for every 256 samples. The block schematic used for feature extraction is depicted in Fig. 4.

Fig. 4
figure 4

Feature extraction phase

Eighty percent of the features are used for training and the remaining 20% for testing. Size of the spectrogram, Melspectrogram, and Gammatonegram feature is [129,127], [64,127], and [64,49], respectively. For one frame, spectrogram, Melspectrogram, and Gammatonegram are plotted as in Figs. 5, 6, and 7.

Fig. 5
figure 5

Spectrogram—waterfall plot—dysarthric speech—isolated digit ‘one’

Fig. 6
figure 6

Melspectrogram—waterfall plot—dysarthric speech—isolated digit ‘one’

Fig. 7
figure 7

Gammatonegram—waterfall plot—dysarthric speech—isolated digit ‘one’

Development of CNN templates

Spectrogram, Melspectrogram, and Gammatonegram two-dimensional feature sets for each isolated digit are applied to the CNN network. Network models are fine-tuned to perform speech recognition for dysarthric speakers. Table 2 describes the CNN layered architecture (Soliman et al. 2021 J. Zhang et al. 2017, Arias-Vergara et al. 2021, Vavrek et al. 2021, Sangwan et al. 2020, Chen et al. 2020, P. H. Binh et al. 2021) for Gammatonegram-based dysarthric isolated digit recognition implemented as a work.

Table 2 CNN layered architecture—Gammatonegram—dysarthric isolated digit recognition

Similar CNN architecture is implemented with variations in the image input size [129, 127, and 1] for spectrogram and [64, 127, and 1] for Melspectrogram-based CNN networks. Figure 8 indicates the modules used for creating group CNN templates for the proposed features.

Fig. 8
figure 8

CNN template creation

Phase spectrum compensation-based speech enhancement (Stark et al. 2008)

In this method, the modified phase response is combined with a magnitude response to get the changed frequency response for the noisy speech. Analysing the relation between spectral-domain and time-domain during the synthesis process makes it possible to cancel out the high-frequency components, thus producing a signal with a reduced noise component. The STFT of the noisy signal is computed as in (1).

$${\mathrm Y}_{\mathrm n}\left(\mathrm k\right)=\left|{\mathrm Y}_{\mathrm n}(\mathrm k)\right|\mathrm e^{\mathrm j{\angle\mathrm Y}_{\mathrm n}(\mathrm k)}$$
(1)

The compensated short-time phase spectrum is computed by using Eqs. (2) and (3).

The process obtains phase spectrum compensation function as in Eq. (2).

$${\wedge }_{n}\left(k\right)=\lambda \psi (k)\left|{D}_{n}(k)\right|$$
(2)

\(\left|{D}_{n}(k)\right|\) specifies magnitude response of the noise signal\(\lambda\)—constant

The anti-symmetry function \(\psi (k)\) is defined as in (3).

$$\psi\left(k\right)=\left\{\begin{array}{c}1\;if\;0<\frac kN<0.5\\-1\;if\;0.5<\frac kN<1\end{array}\right.$$
(3)

Multiplication of symmetric magnitude spectra of the noise signal with anti-symmetric function \(\psi \left(k\right)\) produces an anti-symmetric \({\wedge }_{n}\left(k\right)\). Noise cancellation is made during the synthesis process by utilization of the anti-symmetry property of the phase spectrum compensation function. The complex spectrum of noisy speech is computed as in Eq. (4).

$${Y}_{n}\left(k\right)={X}_{n}\left(k\right)+{\wedge }_{n}\left(k\right)$$
(4)

The compensated phase spectrum of the noisy signal is derived as in Eq. (5).

$${\angle Y}_{n}\left(k\right)=ARG[{Y}_{n}\left(k\right)]$$
(5)

Recombination of the compensated phase response with magnitude response of the noisy signal is done to get the modified spectrum, from which enhanced speech is derived by performing inverse transform as in (7) on the modified spectral response given in (6).

$${S}_{n}\left(k\right)=\left|{Y}_{n}(k)\right|{e}^{j{\angle Y}_{n}(k)}$$
(6)
$$s\left(n\right)=real\lbrack inverse\;STFT(S_n\left(k\right))\rbrack$$
(7)

Figure 9 indicates the performance of the speech enhancement technique by phase compensation.

Fig. 9
figure 9

Illustration of speech enhancement technique—phase spectrum compensation

Results of the dysarthric speech recognition system based on experiments conducted

Speech utterances in pertinent isolated digits are concatenated, and spectrogram, Melspectrogram, and Gammatonegram two-dimensional features are extracted after voice activity detection for original raw dysarthric speeches. Similarly, speech intelligibility is improved using phase spectrum compensation as a speech enhancement technique on raw dysarthric speeches. The proposed time–frequency representational features are extracted for the enhanced dysarthric speeches. Eighty percent of the features have been used for training. Twenty percent of the features are considered for testing. These features are given to the CNN models. Based on the matching, each row of the test vectors is associated with one of the groups in training. Group can be categorical string of indices of isolated digits ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘0’. The confusion graph in Fig. 10 indicates the performance of the Gammatonegram and CNN-based dysarthric isolated digit recognition without applying PSC for speech enhancement. The overall average system accuracy is 88.3%, with a relatively low accuracy of 83% for recognizing the digit ‘2’.

Fig. 10
figure 10

Confusion chart—Gammatonegram and CNN-based system (without PSC)

The confusion chart shown in Fig. 11 depicts spectrogram and CNN-based dysarthric isolated digit recognition system without PSC. The average accuracy is 97.8%.

Fig. 11
figure 11

Performance chart—spectrogram and CNN-based system (without PSC)

The confusion chart depicted in Fig. 12 demonstrates the performance of the dysarthric isolated digit recognition for the Melspectrogram and CNN-based system. The average accuracy is 98%.

Fig. 12
figure 12

Performance chart—Melspectrogram and CNN-based system (without PSC)

Decision-level fusion of results of three spectrograms for the CNN-based system is done, and Table 3 reveals the performance of the decision-level fusion system. Figure 13 indicates the proposed decision-level fusion system. The overall accuracy of the decision-level fusion classifier is 99.72%.

Table 3 Performance assessment—decision-level fusion of spectrograms and CNN-based system (without PSC)
Fig. 13
figure 13

Decision-level fusion classifier

Figures 14, 15, and 16 indicate the performance of the spectrogram, Melspectrogram and Gammatonegram, and CNN-based system by using PSC as an enhancement technique on the raw dysarthric speeches for improving Intelligibility.

Fig. 14
figure 14

Performance chart—Gammatonegram and CNN-based system with PSC for speech enhancement

Fig. 15
figure 15

Performance chart—spectrogram and CNN-based system with PSC for speech enhancement

Fig. 16
figure 16

Performance chart—Melspectrogram and CNN-based system with PSC for speech enhancement

The average recognition accuracy for Gammatonegram, spectrogram and Melspectrogram, and CNN-based system with PSC as speech enhancement technique is 96.67%, 99.46%, and 98.76%, respectively. Table 4 indicates the performance of the dysarthric isolated digit recognition system for PSC as a speech enhancement technique with a decision-level fusion of results corresponding to the two-dimensional features such as spectrogram, Melspectrogram, and Gammatonegram.

Table 4 Performance assessment—dysarthric digit recognition with PSC for speech enhancement and decision-level fusion classifier

Overall average accuracy is 99.92%. Figure 17 indicates the comparative performance of the system with and without the speech enhancement technique.

Fig. 17
figure 17

Comparative analysis with and without PSC speech enhancement technique

This work on isolated digit recognition is extended to perform connected word recognition. Twenty related words spoken by dysarthric speakers are taken, and Fig. 18 indicates the confusion charts for connected word recognition using the proposed features and CNN-based systems. The performance of the decision-level fusion of correct indices on the features and CNN-based systems is indicated in Fig. 19. This system is evaluated using PSC for speech enhancement, and the accuracy is good.

Fig. 18
figure 18

Results—confusion chart—proposed features and CNN-based systems—connected word recognition

Fig. 19
figure 19

Results—decision-level fusion—connected word recognition

Discussion based on the outcome of the experiments

In this work on dysarthric speech recognition, speech utterances of dysarthric speakers are split into two sets, each for training and testing. Spectrogram, Melspectrogram, and Gammatonegram features are extracted from basic training speeches, and CNN templates are created for each isolated digit based on the pertinent input features. Test sets of utterances in each isolated digit are tested, and the system’s performance is analysed based on three different two-dimensional spectrogram features with CNN for modelling and classification. Out of the features used, the overall accuracy of the system for spectrogram and Melspectrogram is the same. Another experiment is conducted on the intelligibility improved speeches of dysarthric speakers with an application of the phase spectrum compensation technique for speech enhancement. Spectrogram-based feature selection is better in terms of attaining good accuracy as compared to other features. Decision-level fusion of outcome of the experiments for the features with and without speech enhancement technique proves to be good in attaining the very good accuracy of 99.92% for all the isolated digits uttered by dysarthric speakers. Table 5 gives the comparative analysis of the proposed work with existing works mentioned in the literature.

Table 5 Comparative analysis—proposed work with related works in literature

Conclusions

In this paper, the development of a speech recognition system for recognizing the isolated digits uttered by dysarthric speakers is analysed and assessed using two-dimensional spectrogram features and deep CNN. Spectrogram, Melspectrogram, and Gammatonegram features are extracted from the speech utterances corresponding to the speech utterances of isolated digits [‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘0’]. Eighty percent of the derived two-dimensional time–frequency representational features are applied to the CNN layered architecture, and combined group CNN models are created. The remaining 20% of the features are applied to the CNN group models, and classification is done based on the association of feature frames with one of the groups in training. Gammatonegram-, spectrogram-, and Melspectrogram-based CNN have an overall accuracy of 88.3%, 97.89%, and 98%, respectively. Decision-level fusion of correct classification indices of three CNN-based systems has yielded a good overall accuracy of 99.72%, with 100% individual accuracy for some isolated digits. The system is evaluated by applying the PSC speech enhancement technique to the raw dysarthric speeches, and a 9% increase in overall accuracy is ensured for Gammatonegram and CNN-based systems. The speech enhancement technique ensures a 1% increase in accuracy for spectrogram- and Melspectrogram-based recognition systems with marginal improvement for decision-level fusion of all CNN-based systems. If there is a system to recognize the distorted speeches of dysarthric persons, this automated system would be useful for caretakers to provide required help/assistance to the persons affected with dysarthria.