1 Introduction

Speech production is a mechanism in which speech is considered an output of a vocal tract system excited by vocal cords' vibration due to airflow from the lungs. Articulators in the vocal tract and muscles move in response to the neural signals for producing speech. Muscles in the vocal tract are weak in delivering speech for persons with dysarthria. Dysarthria refers to multiple neurogenic disorders with irregularities in speech. It is measured in the articulator movements' strength, speed, range, tone, and precision. Dysarthric patients may struggle to control the articulatory mechanisms in producing normal sounds. Dysarthric persons may be easily identified by listening to their way of creating speech. They may utter unclear, slurred, random, rapid, slow, and weak speeches. They may have difficulty controlling the facial muscles and articulators for speech production. Stroke and injury or tumour in the brain influence dysarthric patients' neurological defects. It may also be caused by facial paralysis or tongue and weakness in throat muscles.

Speech-language pathologists treat dysarthric patients and administer remedial measures to improve their speaking abilities. It is suggested that dysarthric patients' communication abilities could improve by increasing lip and tongue movement, strengthening the speech muscles, reducing the rate of uttering speech, and doing regular voice-related exercises. Converting thoughts into speech sounds includes articulators in the vocal tract system [1]. Speech is an output of the linear time-varying vocal tract system excited by quasi-periodic air pulses due to vocal cords' vibration for voiced sounds and noise for unvoiced sounds. Dysarthria is a speech disorder caused due to the lack of ability to control articulators. Speech impairment affects every activity and makes these people's lives miserable.

Dysarthria makes coordinating the nerves and articulators used for speech production difficult. Articulation of various speech sounds will get perturbed by the uncontrolled behaviour of the nerves used for speech production. The speed at which dysarthric people utter speeches is relatively slow compared to standard speakers. Due to the lack of control of muscular activity, it is difficult for patients with dysarthria to control speech parameters such as loudness, speed, pacing, breath, rhythm, and voice quality. Due to cerebral palsy, the system is proposed to recognize persons' speeches with an articulation disorder [2]. The acoustic and language models are constituent components of the speech recognition system.

An acoustic model may be specific to persons with dysarthria, but a language model may be universal irrespective of any category of speakers. A speech recognition system [3] is developed to assess the severity of dysarthric people's problems. The Partial Least Square based Voice conversion (VC) method [4] is used for dysarthric people. Healthy speeches are transformed into dysarthric utterances for data augmentation [5], and large-scale machine-learning models are used for classification.

Convolutional bottleneck networks [6] are used for speech recognition. Two hybrid speech recognition systems (DNN-HMM and GMM-HMM) [7] have been developed for speech recognition, with a 13% improvement in word error rate. The system's efficiency for dysarthric persons has been improved using rhythm knowledge [8]. The connectionist approach assesses the severity of the dysarthria, and the Hidden Markov Model is used to recognize speaker-dependent dysarthric speech. System [9] is developed to convert physically disabled persons' spoken utterances into intelligible utterances to understand better. The transformations are based on the movement of articulators for speech production. Non-negative matrix factorization [10] is used for voice conversion, which is better than GMM-based voice conversion.

Augmentation of acoustic models with articulatory information [11] shows improved recognition of the speech of dysarthric speakers. This integration is done by suiting the dysarthric speakers. Deep learning neural networks [12] are used to predict the severity of the problem concerned with dysarthric speakers. Dysarthric speeches are modified based on temporal and spectral factors [13] to improve the intelligibility of the speeches uttered by dysarthric speakers. A speech recognition system [14] is developed for dysarthric speakers using Hidden Markov Models, and the severity of the problems associated with their speeches being uttered is evaluated. Speech enhancement techniques [15,16,17,18,19,20,21] are described. A description of the UA- speech database [22] is given. A speech recognition system [23] is developed for hearing impaired. The unsupervised learning method has been developed [24] to assess the auditory systems for speech recognition, which do not need a specific transcription of training data.

The dysarthric speech classification from coded telephone speech was developed [25, 26]. This feature is extracted using the deep neural network-based glottal inverse filtering method. Furthermore, an algorithm is proposed for syllable boundary and repletion of syllable detection [27] in dysarthric speech. Acoustic speech parameters [28] are analyzed for patients with Parkinson's disease. Speech patterns are analyzed to study the speaking characteristics of dysarthric speakers, and speech recognition systems [29,30,31,32,33,34] are developed. Variational mode decomposition with wavelet thresholding is used for speech enhancement. CNNs [35] classify dysarthric speeches on the UA-speech database. Speaker-independent dysarthric speech assessment [36] systems are developed. Deep neural network architectures [37] are used for analyzing the speeches of a dysarthric speaker. Empirical mode decomposition and Hurst-based mode selection (EMDH), along with deep learning architecture using a convolutional neural network (CNN) [38], are used to improve the recognition of dysarthric speech. The diversity of the speech patterns [39] of dysarthric speakers is characterized using clinical perspective and speech analytics. Dysarthric speeches are synthesized using text–into–speech (TTS) conversion systems [40] to improve the accuracy of dysarthric speech recognition.

Deep-belief-neural networks [41] are used for dysarthric speech recognition. Dysarthric speeches are augmented [42] using more training data to improve accuracy. The TORGO dataset uses transfer learning-based convolutional neural networks (CNN) [43] for dysarthric speech recognition. Variational mode decomposition with wavelet thresholding is used for speech enhancement. CNNs [35] classify dysarthric speeches on the UA-speech database. Dysarthric speech recognition [44] uses features and models on the UA-speech database. Speech emotion recognition [45] is done using CNNs. Detection of dysarthric speech [46] is done using CNNs. Automatic assessment of dysarthric speech intelligibility [47] is done using deep learning techniques. Deep-learning-based acoustic feature representation [48] is done for dysarthric speech recognition [49,50,51,52]. Audio-visual features are considered [49] for dysarthric speech recognition. A dysarthric isolated digit recognition system with speech enhancement techniques has been developed [50]. This work emphasizes using speech enhancement techniques to improve the intelligibility of the speeches uttered by dysarthric speakers to establish a robust speech recognition system for dysarthria. It also emphasizes using different spectral features and machine learning techniques to produce the speech recognition system.

In this work on speaker-independent dysarthric speech recognition, Sect. 2 describes the database used, analysis of dysarthric speeches in time and frequency domains, implementation of speech enhancement techniques, Feature extraction procedures, modelling techniques and testing procedures. Section 3 depicts the system's experimental, subjective comparison between experimental and emotional, and statistical validation results. Finally, Sect. 4 summarises the dysarthric speech recognition system's outcome by applying speech enhancement techniques, features, models and testing procedures of different modelling techniques.

2 Preliminaries

2.1 Dysarthric Speech Database

The dysarthric dataset [22] considered in this work contains speech utterances from 6 speakers (M01, M04, M07, M09, F03, F05) in the age group 18–51 for each isolated digit. As per the database description, M01, M04, M07, M09, and F03 are low speech intelligibility. F05 is a speaker with high speech intelligibility. Subjective analysis done on the speeches of F05 by hearing would also indicate the clarity of the spoken utterance, and her speeches are similar to that of standard speakers. The listeners can easily understand and recognize the speech recordings of F05. However, these speakers are spastic and athetoid and persons who use wheelchairs. Speech intelligibility is measured in the average score in word transcription tasks. Few utterances are recorded from these speakers because it is difficult to understand and reciprocate the word transcriptions correctly. So, it isn't easy to increase the robustness of the dysarthric speech recognition system.

2.2 Analysis – Dysarthric Speech

It is fascinating to characterize the speech uttered by dysarthric speakers. There are many differences between dysarthric speakers in pronouncing words/sentences. This fact necessitates the provision of an extensive database for recognizing the words uttered by them. It is understood that their speeches are highly distorted, and subjective identification of utterances becomes difficult. On average, dysarthric and regular speakers' speeches are different in style, slang, and place of articulation. Figure 1 indicates the characteristics of the speech uttered by the dysarthric person in terms of signal variations and its spectrogram for the isolated digit "one".

Fig. 1
figure 1

Speech signal and spectrogram – Dysarthricspeaker (M09)—Digit "One."

Signal representation and spectrogram are shown in Fig. 2, which depict another dysarthric person's characteristics for uttering the word "one". The same word they spoke at different instants may have differences in amplitude and spectral energy. Since dysarthric speakers' speeches are indistinct, it is more imperative for better accuracy in recognizing dysarthric speakers' speeches.

Fig. 2
figure 2

Speech signal and spectrogram – Dysarthric speaker (M07)—Digit "One"

2.3 Implementation of Speaker Independent Speech Recognition System

The speaker-dependent and independent speech recognition systems are implemented to recognize the isolated words /isolated digits/continuous sentences/spontaneous sentences uttered by speakers. A speaker-dependent system is developed using a set of speech utterances spoken by all the speakers for training and the remaining set of phrases spoken by the same speakers for testing. Speaker independent speech recognition system facilitates the use of utterances spoken by some of the enrolled speakers for training and other enrolled subjects' utterances for testing.

Feature extraction and modelling are the two stages constituting the training phase. The feature extraction stage facilitates the extraction of speech-specific robust features. Then, these features are applied to the modelling techniques for creating templates specific to speeches. The testing phase dwells upon using test feature vectors in the models designed to recognize the spoken utterance. The word is finally recognized as associated with the pertinent model based on matching. So, it is imperative to have a proper notion of robust feature selection, modelling techniques, and implementing the appropriate testing procedure. It is also essential to use techniques to improve the lucidity of dysarthric speakers' distorted speeches so that the system's accuracy can be reasonably enhanced.

2.4 Speech Enhancement Techniques

In noisy practical environments, background noise sources often degrade speech clarity. So, it is required to use efficient speech enhancement techniques to improve the clarity of the speech. The noisy speech is represented by the Eq. (1)

$$y_{n} \left( m \right) = x_{n} \left( m \right) + d_{n} \left( m \right)$$
(1)

\({x}_{n}\left(m\right)\) – Clean speech signal.

\({d}_{n}\left(m\right)\) – Noise signal.

2.4.1 Single-channel online enhancement of speech [15]

Background noises and reverberation affect the voice-based interaction between people. Speech enhancement techniques are used to improve the quality of the speech for better speech recognition. Online speech enhancement technique based on the all-pole model enhances speech quality. It is implemented using reverberation power and a hidden Markov model for removing noise superimposed with speech. Statistical parameters are estimated from the speech and noise, and analysis is performed by taking a short-time Fourier transform (STFT) with filters spaced in the MEL scale; spectral gain is derived.

Figure 3 indicates the speech enhancement process in the STFT domain. System parameters and signal powers are estimated using the MEL-spaced sub-bands. Then, the transformation of the power spectrum is done by using filters spaced in the Mel scale.

Fig. 3
figure 3

Speech enhancement using a single-channel online enhancement technique

Consider noisy speech signals as \({{\text{Y}}}_{{\text{n}}}\left({\text{m}}\right)\). STFT is taken on \({{\text{Y}}}_{{\text{n}}}\left({\text{m}}\right)\) and coefficients are computed as in (2)

$$Y_{n} \left( k \right) = \mathop \sum \limits_{m = 0}^{k - 1} Y_{n} \left( m \right)w\left( m \right)e^{{\frac{ - i2\pi mk}{N}}}$$
(2)

k – STFT frequency bin. \(n\) – Time frame index. w(n) – Hamming Window sequence.

A power-domain filter bank is applied to compute the power in k Mel-spaced sub-bands as in (3)

$$\widehat{{Y_{n} }}\left( F \right) = \mathop \sum \limits_{k = 0}^{F - 1} a_{F,k} \left| {Y_{n} \left( k \right)} \right|^{2}$$
(3)

\({a}_{F,k}\) – Frequency response of the triangular filters.

HMM model is used to define the clean speech with an input probability distribution, state transition probabilities, and output observation probability distribution as in (4)

$$H_{m} = \left( {x_{m} , d_{m} ,R_{m} } \right)^{T}$$
(4)

\({{\text{H}}}_{{\text{m}}}\) is an HMM model including the reverberation \({{\text{R}}}_{{\text{m}}}\) and noise parameters \({{\text{d}}}_{{\text{m}}}\) for all subbands in a single state vector. Noise removal and improvement in intelligibility from noisy speech are made in the STFT domain. Figure 4 indicates the signals before and after applying the online speech enhancement algorithm on dysarthric speech.

Fig. 4
figure 4

Illustration of online speech enhancement algorithm

2.4.2 Speech Enhancement Using a Minimum Mean Square Error (MSE) LogSpectral Amplitude Estimator [16]

Minimizing MSE of the log spectra of the difference between the original signal's short-time spectral amplitude and the estimated signal is performed by the short-time spectral amplitude (STSA) estimator. The magnitude and phase response of the noisy, noise and clean speech signal is expressed in the frequency domain.

As \({Y}_{k}={R}_{k}{e}^{j{\gamma }_{k}}\), \({D}_{k}={B}_{k}{e}^{j{\beta }_{k}}\) and \({X}_{k}={A}_{k}{e}^{j{\alpha }_{k}}\)

The short-time spectral amplitude estimator \(\widehat{{{\text{A}}}_{{\text{k}}}}\), for minimizing the distortion measure is defined as in (5)

$$\overline{{A_{k} }} = E[\left( {\log A_{k} - \log \widehat{{A_{k} }}} \right)^{2}$$
(5)

The expected value of \(\widehat{{A}_{k}}\) given \({Y}_{k}\) equal to the expected value of \({A}_{k}\) given \({Y}_{k}\) as in (6)

$$\overline{{A_{k} }} = \exp \left\{ {E\left[ {\ln A_{k} |Y_{k} } \right]} \right\}$$
(6)

MSE of log power spectra is calculated as in (7)

$$E\left\{ {\left( {\log A_{k}^{2} - \log \widehat{{A_{k} }}^{2} } \right)^{2} } \right\}$$
(7)

\({\widehat{{A}_{k}}}^{2}\) Denote the estimator of \(\overline{{A}_{k}}\) as in (8)

$$\overline{{A_{k} }} = \sqrt {\widehat{{A_{k} }}^{2} }$$
(8)

\(E[{\text{ln}}{A}_{k}|{Y}_{k}]\) is computed by utilizing the moment-generating function of \({\text{ln}}{A}_{k}\) given \({Y}_{k}\).

Let \({Z}_{k}={\text{ln}}{A}_{k}\), and \({\varphi }_{{Z}_{k|{Y}_{k}}(\mu )}\) of \({Z}_{k}\) given \({Y}_{k}\) Be the moment generating function, and it is defined as in (9)

$$\varphi_{{Z_{{k|Y_{k} }} \left( \mu \right)}} = E\{ \exp (\mu Z_{k} |Y_{k} )\}$$
(9)

\(E\{[{\text{ln}}{A}_{k}|{Y}_{k}]\)} is obtained from \({\varphi }_{{Z}_{k|{Y}_{k}}(\mu )}\) by using (10)

$$E\left\{ {\left[ {\ln A_{k} |Y_{k} } \right]} \right\} = \frac{d}{dy}\varphi_{{Z_{{k|Y_{k} }} \left( \mu \right)}} at \mu = 0$$
(10)

The STSA estimator is as in (11)

$$\overline{{A_{k} }} = \frac{{\varepsilon_{k} }}{{1 + \varepsilon_{k} }}\exp \left\{ {\frac{1}{2}\mathop \smallint \limits_{{v_{k} }}^{\alpha } \frac{{e^{ - t} }}{t}dt} \right\}R_{k}$$
(11)

Figure 5 indicates the speech enhancement process using a log spectral amplitude estimator.

Fig. 5
figure 5

Illustration of speech enhancement by log spectral amplitude estimator

2.4.3 Speech Enhancement by Using Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator [17]

The signal \({\text{x}}\left({\text{t}}\right),\) noise \({\text{n}}\left({\text{t}}\right),\) and the noisy observations \({\text{y}}\left({\text{t}}\right)\) are expressed in the frequency domain as \({{\text{D}}}_{{\text{k}}}\) and \({{\text{Y}}}_{{\text{k}}}\). The \({{\text{Y}}}_{{\text{k}}}\) in the interval [0 T] is defined as in (12)

$$Y_{k} = \frac{1}{T}\mathop \smallint \limits_{t = 0}^{T} y\left( t \right)\exp \left( {\frac{ - j2\pi kt}{T}} \right){\text{dt}}$$
(12)

The spectral components are uncorrelated to each other; The MMSE estimator. \(\widehat{{{\text{A}}}_{{\text{k}}}}\) of \({{\text{A}}}_{{\text{k}}}\) given \({{\text{Y}}}_{{\text{k}}}\) is obtained as in (13) and (14)

$$\widehat{{A_{k} }} = E\{ A_{k} |Y_{k} \}$$
(13)

\(E\{.\}\) – denotes Expectation operation

$$\widehat{{A_{k} }} = \frac{{\mathop \smallint \nolimits_{0}^{\infty } \mathop \smallint \nolimits_{0}^{2\pi } a_{k} p\left( {Y_{k} |a_{k} , \alpha_{k} } \right)p\left( {a_{k} ,\alpha_{k} } \right)da_{k} d\alpha_{k} }}{{\mathop \smallint \nolimits_{0}^{\infty } \mathop \smallint \nolimits_{0}^{2\pi } p\left( {Y_{k} |a_{k} , \alpha_{k} } \right)p\left( {a_{k} ,\alpha_{k} } \right)da_{k} d\alpha_{k} }}$$
(14)

\(p\left(.\right)\)- Probability density function.

\(p\left({Y}_{k}|{a}_{k} , {\alpha }_{k}\right)\) is given by (15)

$$p\left( {Y_{k} |a_{k} , \alpha_{k} } \right) = \frac{1}{{\pi \lambda_{d} \left( k \right)}}\exp \left\{ {\frac{ - 1}{{\lambda_{d} \left( k \right)}}\left| {Y_{k} - a_{k} e^{{ - i\alpha_{k} }} } \right|^{2} } \right\}$$
(15)

\(p\left({a}_{k},{\alpha }_{k}\right)\) is given by (16)

$$p\left( {a_{k} ,\alpha_{k} } \right) = \frac{{ - a_{k} }}{{\pi \lambda_{x} \left( k \right)}}\exp \left\{ {\frac{{ - a_{k}^{2} }}{{\lambda_{x} \left( k \right)}}} \right\}$$
(16)

\({\lambda }_{d}\left(k\right)\), \({\lambda }_{x}\left(k\right)\) are the variances of the \(kth s\) pectral component of the noise and the speech.

Substituting \(p\left({a}_{k},{\alpha }_{k}\right)\) in Eqs. (17) and (18)

$$\widehat{{A_{k} }} = \frac{{\mathop \smallint \nolimits_{0}^{\infty } a_{k}^{2} \exp \left( {\frac{{ - a_{k}^{2} }}{{\lambda_{k} }}} \right)I_{0} \left( {2a_{k} \sqrt {\frac{{v_{k} }}{\lambda \left( k \right)}} } \right)da_{k} }}{{\mathop \smallint \nolimits_{0}^{\infty } a_{k} \exp \left( {\frac{{ - a_{k}^{2} }}{{\lambda_{k} }}} \right)I_{0} \left( {2a_{k} \sqrt {\frac{{v_{k} }}{\lambda \left( k \right)}} } \right)da_{k} }}$$
(17)
$$\widehat{{A_{k} }} = \Gamma \left( {1.5} \right)\frac{{\sqrt {v_{k} } }}{{\gamma_{k} }}\exp \left( {\frac{{ - v_{k} }}{2}} \right)\left[ {\left( {1 + v_{k} } \right)I_{0} \left( {\frac{{v_{k} }}{2}} \right) + v_{k} I_{1} \left( {\frac{{v_{k} }}{2}} \right)} \right]R_{k}$$
(18)
$$\Gamma \left( {1.5} \right) = \frac{\sqrt \pi }{2}$$

\(\Gamma \left(.\right)\) Denotes the gamma function.

\({I}_{0}\left(.\right) , {I}_{1}\left(.\right)\) Denotes the modified Bessel functions of zero and first order, respectively, with parameters as in Eqs. (1921)

$$v_{k} = \frac{{\varepsilon_{k} }}{{1 + \smallint_{k} }}\gamma_{k}$$
(19)
$${\epsilon }_{k}=\frac{{\lambda }_{x}(k)}{{\lambda }_{d}(k)}$$
(20)
$${\gamma }_{k}=\frac{{{R}_{k}}^{2}}{{\lambda }_{d}(k)}$$
(21)

\({\epsilon }_{k}\) and \({\gamma }_{k}\) – a priori and a posteriori signal-to-noise ratios.

Figure 6 depicts the speech enhancement process by a short-time spectral amplitude estimator.

Fig. 6
figure 6

Illustration of speech enhancement by short-time spectral amplitude estimator

2.4.4 Wavelet denoising for Speech Enhancement [18]

The wavelet denoising technique suppresses noise from noisy speech to obtain clean speech. First, wavelet packet transform decomposes noisy speech into approximation and detail coefficients. Then, the threshold is fixed and applied to the final level sub-band coefficients to minimize the noise propositions. Figure 7 shows the wavelet-based enhancing quality of speech. Finally, Enhanced speech is obtained by upsampling and interpolating the modified detail and approximation coefficients.

Fig. 7
figure 7

Speech enhancement using Wavelet denoising technique

Figure 8 demonstrates the speech enhancement process by using wavelets.

Fig. 8
figure 8

Illustration of speech enhancement process by wavelets

2.4.5 Probabilistic Geometric Approach (PGA) to spectral subtraction based Speech Enhancement [19]

A confidence parameter of noise estimation is introduced in the gain function, which removes the noise efficiently and prevents speech distortion. The schematic shown in Fig. 9 depicts the PGA-based speech enhancement technique modules.

Fig. 9
figure 9

Speech enhancement using Probabilistic Geometric Approach

The equation represents the STFT of noisy speech as in (22)

$$Y_{n} \left( k \right) = X_{n} \left( k \right) + D_{n} \left( k \right)$$
(22)

\(n\) – frame number.

The STFT of \({{\text{y}}}_{{\text{n}}}\left({\text{m}}\right)\) is represented as in (23)

$$Y_{n} \left( k \right) = \mathop \sum \limits_{m = 0}^{N - 1} y_{n} \left( m \right)e^{{ - j\frac{2\pi mk}{N}}}$$
(23)

From the basic rule of spectral subtraction, the Eq. (23) can be written as (24)

(24)

This equation can also be written as (25)

$$\left| {\widetilde{{X_{n} \left( k \right)}}} \right|^{2} = \left| {H_{{n\left( {PGA} \right)}} \left( k \right)} \right|^{2} - \left| {D_{n} \left( k \right)} \right|^{2}$$
(25)

\({H}_{n(PGA)}\left(k\right)\) – gain function of the nth frame. \({X}_{n}\left(k\right)\), \({Y}_{n}\left(k\right)\) and \({D}_{n}\left(k\right)\) can be expressed in polar form as

$$X_{n} \left( k \right) = a_{x} e^{{i\theta_{x} }} ,\;Y_{n} \left( k \right) = a_{y} e^{{i\theta_{y} }} ,\;D_{n} \left( k \right) = \rho a_{d} e^{{i\theta_{d} }}$$

\(\rho\) – is a constant dependent on posterior and a priori SNRs.

\({a}_{x}\), \({a}_{y}\) and \({a}_{d}\) are the magnitude response of clean, noisy and noise signals.

\({\theta }_{x}\), \({\theta }_{y}\) and \({\theta }_{d}\) are the phase response of clean, noisy and noise signals.

The gain function \({H}_{n(PGA)}\left(k\right)\) can be defined as in (26)

$$H_{{n\left( {PGA} \right)}} \left( k \right) = \sqrt {\frac{{a_{y}^{2} }}{{a_{x}^{2} }}}$$
(26)

The unchanged phase spectrum and compensated magnitude spectrum are combined to produce an enhanced speech by using the formula in (27)

(27)

Figure 10 indicates the effect of the probabilistic geometric approach in enhancing the speech uttered by the dysarthric speaker.

Fig. 10
figure 10

Illustration of speech enhancement by a probabilistic geometric approach

2.4.6 The Geometric Approach to Spectral Subtraction-Based Speech Enhancement [20]

Noise present in speech is effectively reduced by spectral subtraction. The spectral subtraction method removes the noise based on the assumption that the noise is uncorrelated with any other system signal. Figure 11 gives a detailed description of the blocks used for geometric approach-based speech enhancement.

Fig. 11
figure 11

Geometric approach to spectral Subtraction for speech enhancement

This equation to compute signal spectrum is as in (28)

$$\left| {\widetilde{{X_{n} \left( k \right)}}} \right|^{2} = \left| {H_{{n\left( {GA} \right)}} \left( k \right)} \right|^{2} - \left| {D_{n} \left( k \right)} \right|^{2}$$
(28)

\({H}_{n(GA)}\left(k\right)\) – gain function of the Geometric approach of the nth frame.

The magnitude and phase response of the noisy, noise and clean speech are related as in (29)

$$a_{y} e^{{i\theta_{y} }} = a_{x} e^{{i\theta_{x} }} + a_{D} e^{{i\theta_{d} }}$$
(29)

The triangle shown in Fig. 12 depicts the phase spectra of the noisy speech and clean speech and noise signals.

Fig. 12
figure 12

Geometric representation of noisy speech, clean speech and noise spectra

In Fig. 12, Eqs. (30) give the trigonometric relations for magnitude and phase spectra of noisy speech, clean speech and noise signals.

$$\overline{AB} \bot \overline{BC}$$
$${a}_{y}{\text{sin}}\left({\theta }_{D}-{\theta }_{y}\right)={a}_{x}{\text{sin}}\left({\theta }_{D}-{\theta }_{x}\right)$$

Taking square of both sides

$${{a}_{y}}^{2}{sin}^{2}\left({\theta }_{D}-{\theta }_{y}\right)={{a}_{x}}^{2}{sin}^{2}\left({\theta }_{D}-{\theta }_{x}\right)$$
$${{a}_{y}}^{2}\left[1-{cos}^{2}\left({\theta }_{D}-{\theta }_{y}\right)\right]={{a}_{x}}^{2}\left[1-{cos}^{2}\left({\theta }_{D}-{\theta }_{x}\right)\right]$$

It can be written as in (30)

$$a_{y}^{2} \left[ {1 - C_{yD}^{2} } \right] = a_{x}^{2} \left[ {1 - C_{xD}^{2} } \right]$$
(30)

The gain function is defined as in (31)

$$H_{{n\left( {GA} \right)}} = \frac{{a_{x} }}{{a_{y} }} = \sqrt {\frac{{1 - C_{yD}^{2} }}{{1 - C_{xD}^{2} }}}$$
(31)

Using cosine rules in triangle ABC, Eqs. (32) and (33) are used

$$C_{yD} = \frac{{a_{y}^{2} + a_{D}^{2} - a_{x}^{2} }}{{2a_{y} a_{D} }}$$
(32)
$$C_{xD} = \frac{{a_{y}^{2} - a_{D}^{2} - a_{x}^{2} }}{{2a_{x} a_{D} }}$$
(33)

Dividing both numerator and denominator of the equation by \({{a}_{D}}^{2}\) as in (34) and (35)

$$C_{yD} = \frac{{\frac{{a_{y}^{2} }}{{a_{D}^{2} }} + 1 - \frac{{a_{x}^{2} }}{{a_{D}^{2} }}}}{{\frac{{2a_{y} }}{{a_{D} }}}}$$
(34)
$$C_{xD} = \frac{{\frac{{a_{y}^{2} }}{{a_{D}^{2} }} - 1 - \frac{{a_{x}^{2} }}{{a_{D}^{2} }}}}{{\frac{{2a_{x} }}{{a_{D} }}}}$$
(35)
$$\Upsilon = \frac{{a_{y}^{2} }}{{a_{D}^{2} }},\;\varepsilon = \frac{{a_{x}^{2} }}{{a_{D}^{2} }}$$

\(\Upsilon\) – A posteriori SNR.

\(\varepsilon\)– A priori SNR.

The gain function can be written as in (36)

$$H_{{n\left( {GA} \right)}} = \frac{{a_{x} }}{{a_{y} }} = \sqrt {\frac{{1 - \frac{{\left( {\gamma + 1 - \varepsilon } \right)^{2} }}{4\gamma }}}{{1 - \frac{{\left( {\gamma - 1 - \varepsilon } \right)^{2} }}{4\gamma }}}}$$
(36)

Enhanced speech is obtained by combining an unchanged phase spectrum with compensated magnitude spectrum, as in (37)

(37)

Figure 13 illustrates the speech enhancement process by the geometric approach applied to the dysarthric speaker's speech.

Fig. 13
figure 13

Effect of speech enhancement process by a geometric approach

2.4.7 Phase Spectrum Compensation Based Speech Enhancement [21]

This method combines the modified phase response with a magnitude response to get the changed frequency response for the noisy speech. Analyzing the relation between the spectral and time domains during the synthesis process makes it possible to cancel out the high-frequency components, thus producing a signal with a reduced noise component. The STFT of the noisy signal is computed as in (38)

$$Y_{n} \left( k \right) = \left| {Y_{n} \left( k \right)} \right|e^{{j\angle Y_{n} \left( k \right)}}$$
(38)

The compensated short-time phase spectrum is computed by using the Eqs. (39) and (40)

The process obtains phase spectrum compensation function as in Eq. (39)

$$\wedge_{n} \left( k \right) = \lambda \psi \left( k \right)\left| {D_{n} \left( k \right)} \right|$$
(39)

\(\left|{D}_{n}(k)\right|\) Defines magnitude response of the noise signal.

\(\lambda\)– Constant.

The anti-symmetry function \(\psi (k)\) is defined as in (40)

$$\psi \left( k \right) = \left\{ {\begin{array}{*{20}c} 1 & {if\; 0 < \frac{k}{N} < 0.5} \\ { - 1} & {if\; 0.5 < \frac{k}{N} < 1} \\ \end{array} } \right.$$
(40)

Multiplication of symmetric magnitude spectra of the noise signal with anti-symmetric function \(\uppsi \left({\text{k}}\right)\) produces an anti-symmetric \({\wedge }_{{\text{n}}}\left({\text{k}}\right)\). Noise cancellation is made during the synthesis process by utilization of the anti-symmetry property of the phase spectrum compensation function. The complex spectrum of noisy speech is computed as in Eq. (41)

$$Y_{n} \left( k \right) = X_{n} \left( k \right) + \wedge_{n} \left( k \right)$$
(41)

The compensated phase spectrum of the noisy signal is derived as in Eq. (42)

$$\angle Y_{n} \left( k \right) = ARG\left[ {Y_{n} \left( k \right)} \right]$$
(42)

Recombination of the compensated phase response with the magnitude response of the noisy signal is done to get the modified spectrum, from which enhanced speech is derived by performing inverse transform as in (44) on the modified spectral response given in (43).

$$S_{n} \left( k \right) = \left| {Y_{n} \left( k \right)} \right|e^{{j\angle Y_{n} \left( k \right)}}$$
(43)
$$s\left( n \right) = real\;\left[ {inverse \;STFT\;\left( {S_{n} \left( k \right)} \right)} \right]$$
(44)

Figure 14 indicates the performance of the speech enhancement technique by phase compensation. Figure 15 depicts the variation in the distribution of samples for each speech enhancement technique by performing histogram equalization.

Fig. 14
figure 14

Illustration of speech enhancement by phase spectrum compensation technique

Fig. 15
figure 15

Histogram equalization – (1) Original speech (2–8) Enhanced speech using speech enhancement techniques

2.5 Feature Extraction

PLP extraction is based on the principle of how the human ear perceives sounds [1]. The PLP extraction method is similar to the linear prediction coefficient method, except its spectral characteristics are changed based on the human auditory system. Perceptual features with filters spaced in BARK, ERB, and MEL scales are extracted from the pre-processed speech using the techniques shown in Fig. 16. The FFT technique obtains the pre-processed signal's power spectrum; the auditory spectrum is obtained by multiplying the signal's power spectrum. The squared magnitude spectrum of the filters is spaced in different frequency scales. Cube root compression and Loudness equalization simulate the human ear's power law of hearing perception. Finally, the inverse transform is performed to obtain the signal, from which cepstral coefficients are derived using LPC and Cepstral analyses.

Fig. 16
figure 16

Procedure—Perceptual features extraction

i Procedural steps used for PLPC, MF-PLPC and ERB-PLPC extraction are summarised as follows.

1 Computation of power spectrum on pre-processed speech segment.

2 Critical band analysis uses 21, 47 and 35 critical bands in BARK, Mel, and ERB frequency scales at 16 kHz as sampling frequency. The magnitude response of the filter banks spaced in the MEL scale, BARK scale and ERB scale are shown in Figs. 17, 18 and 19. Frequency in Hz and other frequency scales, namely MEL, BARK and ERB, are related as in (45), (46) and (47).

$$f\left( {Mel} \right) = 2595*\log_{10} \left( {1 + \frac{{f\left( {Hz} \right)}}{700}} \right)$$
(45)
$$\left( {Bark} \right) = \left[ {\frac{{26.81f\left( {Hz} \right)}}{{1960 + f\left( {Hz} \right)}}} \right] - 0.53$$
(46)
$$f\left( {ERB} \right) = 24.7\left( {4.37f\left( {Hz} \right) + 1} \right)$$
(47)
Fig. 17
figure 17

Magnitude response of critical bands in the BARK scale

Fig. 18
figure 18

Magnitude response of the critical bands in the MEL scale

Fig. 19
figure 19

Magnitude response of the critical bands in the ERB scale

3 Hearing's power law is stimulated by hearing, loudness equalization, and cube root compression. Loudness equalization is done by pre-emphasis filter to weight the filter-bank outputs to simulate the sensitivity of ears to perceive sounds as in (48).

$$E\left( \omega \right) = \frac{{\left( {\omega^{2} + 56.8*10^{6} } \right)^{4} }}{{\left( {\omega^{2} + 6.3*10^{6} } \right) \left( {\omega^{2} + 0.38*10^{9} } \right)\left( {\omega^{6} + 9.58*10^{26} } \right)}}$$
(48)

Transformation of equalized values is done by a power law of hearing (i.e.) raising the power by 0.33. It is represented in (49)

$$L\left( \omega \right) = I(\omega )^{\frac{1}{3}}$$
(49)

4 IFFT is performed on L (\(\omega\)).

5 The Levinson-Durbin procedure computes the LP coefficient.

6 LP coefficients are converted into PLP, MFPLP and ERBPLP Cepstral coefficients.

2.6 Implementation of Template Creation Module

For a speech recognition system, templates are created to act as a representative model pertinent to speeches to be recognized. VQ-based or fuzzy-based clustering technique forms the low dimensional cluster set from the high dimensional training set among the many modelling techniques. They include M cluster centroids for contemplating the speech model from the training data of high dimension. This process is done by computing the Euclidean distance between the training set and initial cluster centroids. These cluster centroids are updated for iterations, and finally, the cluster set formed in pertinent speech represents the training set of feature vectors. For testing, Euclidean distance is computed between test vectors and cluster set, and cluster centroid, which produces minimum distance, is restored. All the test speech features and minimum distances are calculated and stored as a model value. This process is implemented for all models. Finally, a model is selected for the test speech to compare the minimum of model values. MHMM modelling technique facilitates the expectation–maximization procedure to generate templates containing maximum likelihood parameters. The testing procedure for MHMM enables the application of test features to the models, and log-likelihood values are computed. The model associated with the test speech has the most considerable log-likelihood value.

3 Experimental Evaluation – Results and Discussion

The dysarthric speech recognition system is evaluated based on perceptual features and various modelling techniques. Different speech enhancement techniques applied to distorted dysarthric speeches would enable the system to enhance performance. This speech recognition system encompasses training and testing phases. During training, speeches are concatenated, and conventional pre-processing techniques are applied to the speech data. After the pre-processing, extraction of perceptual features is performed, followed by using features for creating templates. Test speeches undergo pre-processing during testing, and perceptual features are extracted. These features are applied to all speech templates, and based on the classifier used; speech is identified to be associated with pertinent speech templates. Recognition accuracy/word error rate is used as a performance metric for evaluating the system. Finally, speech enhancement techniques are applied to raw training and test speeches, and the system's performance is assessed. The implementation uses the decision-level fusion of speech enhancement techniques, features, and modelling techniques to classify the pertinent dysarthric speech. Features extracted from test segments are applied to the models, and the model index based on the classifier used is derived. This process is repeated for all test segments. Finally, a decision-level fusion of correct indices about the modelling techniques is done to augment the system's performance. The decision-level fusion classifier is depicted in Fig. 20.

Fig. 20
figure 20

Decision-level fusion classifier

This decision-level fusion classifier classifies the pertinent speech based on the correct classification of features, modelling techniques, and speech enhancement techniques. Table 1 indicates the system's performance with a decision-level fusion of elements and models by taking speeches with and without speech enhancement techniques. The overall accuracy for ten digits in Fig. 21 shows the system's evaluation for recognizing dysarthric speeches against speech enhancement techniques with a decision-level fusion of results on features and models. Individual accuracy for some isolated digits is 100%, with overall accuracy for the decision-level fusion of influences of the features, models, and speech enhancement techniques at 80.2%.

Table 1 Performance – Average accuracy—decision level fusion of features and models for speech enhancement techniques
Fig. 21
figure 21

Performance of the dysarthric speech recognition system –dysarthric speaker F03 (6% speech intelligibility)

Individual accuracy for some isolated digits is low because the testing is done with utterances of a dysarthric speaker with only 6% speech intelligibility. Training the models with many feature vectors can enhance the system's accuracy. The system has not provided good accuracy because it is tested for the female speaker with only 6% speech intelligibility. Decision-level fusion of results of features and models has provided a better overall accuracy of 43%, with an application of phase spectrum compensation as a speech enhancement technique. It is 12% more than the system without using a speech enhancement mechanism. So, the system's accuracy depends on features, models, speech enhancement techniques, and the test set of spoken utterances. However, the system is trained for speech utterances at all intelligibility levels. Therefore, obtaining better accuracy for speakers with very low intelligibility is difficult.

Table 2 gives the individual performance of the isolated digit recognition system for dysarthric speakers with 95% speech intelligibility by considering the perceptual features and vector quantization (VQ) models. Results show that the system provides excellent accuracy if the features and models are tested for speaker F05, diagnosed with 95% speech intelligibility. She is almost like an average speaker, and testing done using her speech utterances for all isolated digits provides exemplary accuracy. So, the speech utterances applied for evaluation must be acceptable, and the distortion level must be low.

Table 2 Performance of the system – Perceptual features and clustering –Female Speaker F05 (95% speech intelligibility)

3.1 Statistical Analysis and Validation of Experimental Results

The system's performance is statistically analyzed [23] to validate the perceptual features, models and speech enhancement techniques for recognizing dysarthric speakers' speeches. Table 3 indicates the usage of χ2 a statistical distribution tool to analyze the experimental result. The number of test segments for concatenated test speech uttered by the dysarthric speaker in a pertinent digit is probable frequency. The actual frequency is the number of correctly identified test speech segments for each digit. Ten isolated digits are taken as ten attributes. Since the sample size is 100, χ2distribution is applied to statistically analyze the choice of features, models and speech enhancement techniques. Hence, the rule of hypothesis based on χ2distribution is framed as below:

Table 3 Statistical analysis of isolated digits using decision level fusion classification by X2 distribution test – F03 speaker (6% speech intelligibility)

H0: Rejection rate is greater than or equal to 10%

H1: Rejection rate is less than 10%

The individual χ2test is applied at a 10% significance level, and the degree of freedom is considered nine χ20.1 = 21.66. Concerning the χ2 table, the H0 hypothesis is accepted. Table 4 indicates the system's statistical analysis for F05 speakers with perceptual features and clustering as a modelling technique with the hypothesis set below.

Table 4 Statistical analysis – Performance of the Decision level Fusion system – Perceptual features and clustering – F05 speaker (95% speech intelligibility)

H1: Digit recognition rate is \(\ge \,95\%\)

H0: Digit recognition rate is \(< 95\%\)

The individual χ2 test is applied at a 5% significance level, and the degree of freedom is considered nine χ20.05 = 16.919. The χ2 table's calculated values are much less than the table value. Hence, the H1 hypothesis is accepted. Subjective analysis is done to supplement the experimental dysarthric speech recognition system. Four average persons are asked to recognize the speeches uttered by dysarthric speakers. They are informed to listen to the isolated digits spoken by F03 and F05 dysarthric speakers. Tables 5 and 6 indicate the subjective analysis results for recognizing the numbers uttered by dysarthric speakers. Figure 22 and 23 show the comparative analysis between the experimental and manual assessment for identifying isolated digits spoken by F03 and F05 dysarthric speakers. The practical and subjective analysis would yield low accuracy since the F03 dysarthric speaker has 6% speech intelligibility. The experimental study is better than manual recognition for all the digits except 'zero'. It reveals that ensuring better performance for dysarthric speech recognition has been challenging. The comparative analysis in Fig. 23 indicates that the subjective assessment has yielded slightly better accuracy than the experimental assessment. F05 is a dysarthric female speaker with 95% speech intelligibility, so the accuracy of the decision-level practical classification and subjective analysis is very high. It is revealed that accuracy is directly proportional to the speakers' intelligibility level of the speeches uttered. In this work, speech enhancement techniques are implemented for improvement.

Table 5 Performance assessment by Subjective analysis – dysarthric speaker—F03 (6% speech intelligibility)
Table 6 Performance assessment by Subjective analysis – dysarthric speaker—F05 (95% speech intelligibility)
Fig. 22
figure 22

Comparative analysis – Experimental and subjective assessment – F03 dysarthric speaker (6% speech intelligibility)

Fig. 23
figure 23

Comparative analysis – Experimental and subjective assessment – F05 dysarthric speaker (95% speech intelligibility)

Since the speeches of F03 speakers with 6% speech intelligibility are highly distorted and disordered, it is cumbersome to ensure better accuracy. There are significant variations in style, difficulty level and pronunciation of words in the speeches of these speakers. However, if the speech intelligibility is good, their speeches can be classified without ambiguity. Adapting better speech enhancement mechanisms, features, and models would be a promising solution for ensuring better accuracy for speech-impaired whose impairment level is high. Table 7 depicts the comparative analysis between the existing works and our proposed work.

Table 7 Comparative analysis – Existing works and the proposed work

4 Conclusion

Since the speeches uttered by dysarthric people are severely distorted and degraded, it has become essential to improve the intelligibility of dysarthric spoken utterances. Subjective analysis of recognizing dysarthric spoken words reveals that human manual recognition is complex, especially those uttered by speakers with low speech intelligibility. The system uses perceptual features, speech enhancement techniques and statistical modelling methods. The proposed decision-level fusion system comprising features, models and speech enhancement techniques could improve accuracy for recognizing isolated digits uttered by dysarthric speakers. Accuracy is 81% for the decision-level fusion classifier for identifying numbers spoken by the dysarthric speaker with 6% speech intelligibility. However, this system has provided 99% accuracy in recognizing the isolated digits uttered by the dysarthric speaker with 95% speech intelligibility. Experimental results surpass the manual recognition of numbers uttered by a speaker with deficient speech intelligibility. However, manual credit has 100% accuracy for recognizing isolated digits spoken by a dysarthric speaker with 95% speech intelligibility. This system would provide accuracy if the system is trained using the database containing a more significant number of utterances spoken by more dysarthric speakers. This system can act as a translator for caretakers to understand dysarthric speakers' speeches to provide them with the necessary assistance. A robust speech translator may be designed to convert unintelligent spoken utterances into intelligible ones and interpret speeches uttered by dysarthric speakers that can be understandable. This work emphasizes the need for more efficient speech enhancement techniques to improve speech quality. It is proposed to strengthen the selection of features, speech enhancement and modelling techniques for the system's performance improvement.