Deep neural network architectures for dysarthric speech analysis and recognition

Zaidi, Brahim Fares; Selouani, Sid Ahmed; Boudraa, Malika; Sidi Yakoub, Mohammed

doi:10.1007/s00521-020-05672-2

Deep neural network architectures for dysarthric speech analysis and recognition

Original Article
Published: 09 January 2021

Volume 33, pages 9089–9108, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Deep neural network architectures for dysarthric speech analysis and recognition

Download PDF

Brahim Fares Zaidi ORCID: orcid.org/0000-0003-1523-2982¹,
Sid Ahmed Selouani²,
Malika Boudraa¹ &
…
Mohammed Sidi Yakoub²

880 Accesses
21 Citations
Explore all metrics

Abstract

This paper investigates the ability of deep neural networks (DNNs) to improve the automatic recognition of dysarthric speech through the use of convolutional neural networks (CNNs) and long short-term memory (LSTM) neural networks. Dysarthria is one of the most common speech communication disorders associated with neurological impairments that can drastically reduce the intelligibility of speech. The aim of the present study is twofold. First, it compares three different input features for training and testing dysarthric speech recognition systems. These features are the mel-frequency cepstral coefficients (MFCCs), mel-frequency spectral coefficients (MFSCs), and the perceptual linear prediction features (PLPs). Second, the performance of the CNN- and LSTM-based architectures is compared against a state-of-the-art baseline system based on hidden Markov models (HMMs) and Gaussian mixture models (GMMs) to determine the best dysarthric speech recognizer. Experimental results show that the CNN-based system using perceptual linear prediction features provides a recognition rate that can reach 82%, which constitutes relative improvement of 11% and 32% when compared to the performance of LSTM- and GMM-HMM-based systems, respectively.

Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition

Article 20 March 2023

Analysis and Classification Dysarthric Speech

Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network

Article Open access 13 January 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Dysarthria is known as a motor speech disorder resulting from the malfunction of the muscles controlling the vocal apparatus [1]. The causes of dysarthria are multiple and include Parkinson's disease, stroke, head trauma, tumors, muscular dystrophies, and cerebral palsy [2, 3]. Dysarthria may affect breathing, phonation, resonance, articulation, and prosody. The consequences are hypernasality and the drastic reduction in speech intelligibility. Vowels may also be distorted in the most severe cases. The range of degradation of intelligibility is wide and depends on the extent of neurological damage.

1.1 Dysarthric speech recognition

Automatic speech recognition (ASR) systems can be very useful for people who are suffering from dysarthria and other speech disabilities. Unfortunately, due to the high variability and distortions in dysarthric speech [4, 5], the automatic recognition of dysarthric spoken words is still a challenging task [6, 7]. These distortions have a negative impact on the production and articulation of phonemes, which leads to a great complexity of their automatic analysis and characterization. For instance, an effortful grunt is often heard at the end of vocalizations; an excessively low pitch is frequently found, producing a harsh voice. In some cases, phonemes are characterized by pitch breaks in vocalic segments and imprecision of consonants’ production. Therefore, the acoustical analysis of dysarthric speech has to deal with many issues related to aberrant voicing, tempo disturbance, unpredictable shifting of formant frequencies in sonorants and utterances where erroneously dropped phonemes are observed.

This complexity was also demonstrated by the acoustic study carried out by Zeigler and von Cramon's which involved ten patients with spastic dysarthria [8]. This study revealed that impaired acceleration of moving articulators increases production time and thus induces slower speech rate. These alterations disrupt or mask the acoustical characteristics that can normally help discriminating between phonemes, which makes the dysarthric speech recognition a more complex process.

Thus, there has recently been a trend toward the creation of tailored ASR systems for people with dysarthria [6, 9,10,11]. Indeed, the best results for dysarthria speech recognition have been provided by isolated word ASR models and conventional ASR algorithms, such as artificial neural networks (ANNs) [12], but an effective ASR system requires the ability to recognize continuous speech [13, 14]. Recently, some research initiatives have been successful in recognizing dysarthric speech with a limited vocabulary. However, currently, a large-vocabulary dysarthric speech recognition system is unavailable.

Most conventional dysarthric speech recognition systems are generally based on statistical approaches such as hidden Markov models (HMMs) that perform the modeling of the sequential structure of speech signals. The HMMs of speech are mainly based on Gaussian mixture models (GMMs) that are considered the best statistical representation of the spectral distributions of speech waveforms.

The probabilistic modeling remains a powerful approach when coupled with flexible time dimension representation of uncertainty. In this context, a Gaussian process regression (GPR) method was proposed to predict the human intention [15]. For some applications such as dysarthric speech synthesis where there is a need to complete partially observable sequences, the GPR method could be useful to improve the synthetic speech naturalness.

Nevertheless, these methods cannot be applied in the recognition of dysarthric speech. Actually, GMM-based modeling is effective when a large quantity of data is used to train a robust model. However, it is not as efficient for dysarthria because the corpora used for training are always small [16].

As an alternative to the statistical approaches and in the context of the considerable progress made by connectionist approaches, numerous configurations based on deep neural networks (DNNs) have been proposed to deal with the inherent complexity of dysarthric speech. Among these advanced configurations, convolutional neural networks (CNNs) [17] and long short-term memory (LSTM) networks [18] have achieved state-of-the-art recognition accuracy in many applications.

1.2 Dysarthric speech processing using CNN-based architectures

Isolated word ASR models and conventional ANN architectures have been widely used to perform dysarthric speech recognition. The authors of [12] identified the best-performing set of mel-frequency cepstral coefficient (MFCCs) parameters to represent dysarthric acoustic features for use in ANN-based ASR. The results show that the speech recognizer trained by the conventional 12-coefficient MFCC features without the use of delta and acceleration features provided the best accuracy, and the proposed speaker-independent ASR recognized the speech of unforeseen dysarthric subjects with a word recognition rate of 68.38%. To improve dysarthric speech identification, the authors in [19] proposed a system using features resulting from the coding of 39 MFCCs by a deep belief network (DBN). The evaluation was performed using the Dysarthric Speech Database for Universal Access Research in both text-dependent and text-independent conditions where an accuracy rate of 97.3% was achieved. Using the same data, a study presented in [20] explored multiple methods for improving a hybrid GMM-DNN-based HMM for dysarthric speech recognition. The experiments were carried out using DNNs with four hidden layers and sigmoid activation functions for the 1024 neurons of each layer; a dropout factor of 0.2 for the first four DNN training epochs was applied. This configuration reduced the average relative word error rate (WER) by 14.12%.

Recently, DNN-based architectures have been proposed to generate artificial samples of dysarthric speech. In [21], artificial dysarthric speech samples were presented to five experienced speech-language pathologists. The authors used CNN-based architectures in both the generator and the discriminator of dysarthric speech. The results reveal that speech-language pathologists identified transformed speech as dysarthric 65% of the time.

In [22], an interpretable objective severity assessment algorithm for dysarthric speech based on DNNs was proposed. An intermediate Darley–Aronson–Brown (DAB) layer containing a priori knowledge provided by speech-language pathologists and neurologists was added to the DNN. The model was trained with a scalar severity label at the output of the network and intermediate labels that describe how atypical the impaired speech was along four perceptual dimensions in the DAB layer. The best performance for severity prediction was 82.6%.

In [23], an automatic detection of dysarthria using extended speech features called centroid formants was presented. The experimental data consisted of 200 speech samples from 10 dysarthric speakers and 200 speech samples from 10 age-matched healthy speakers. The centroid formants enabled an accuracy of 75.6% achieved with just one hidden layer and 10 neurons.

In [24], 39 MFCCs were used as input features for a dysarthric speech recognizer based on a hybrid framework using a generative learning-based data representation and a discriminative learning-based classifier. The authors also proposed the use of example-specific HMMs to obtain log-likelihood scores for dysarthric speech utterances to form a fixed-dimensional score vector representation. The discriminative capabilities of the score vector representation technique were demonstrated, particularly in the case of utterances with very low intelligibility.

In a recent application [25], the authors proposed to rate dysarthric speakers along five perceptual dimensions: severity, nasality, vocal quality, articulatory precision, and prosody on a scale from 1 to 7 (from normal to severely abnormal). They also used the Google ASR engine to calculate the WER of uttered sequences. Based on the obtained results, 32 dysarthric speakers were categorized with respect to the severity of their impairment.

To capture relevant acoustic–phonetic information of impaired speech, numerous studies have investigated different types of features. In this context, mel-frequency spectral coefficients (MFSCs) have been proposed as the basic acoustic features [26]. For the CNN, the authors used 40-dimensional filter bank features to obtain more evolved speaker-independent MFSC features, a linear discriminative analysis transformation for projecting sequences of frames into 40 dimensions, and then a maximum likelihood linear transformation for diagonalizing the matrix and gather the correlations among vectors. For speaker-dependent features, the authors employed a feature-space maximum likelihood linear regression. A comparison of the speech recognition architectures shows that even with a small database, the hybrid DNN-HMM models outperform classical GMM-HMM models according to WER measures.

In another study published in [16], different types of input features used by DNNs were assessed to automatically detect repetition stuttering and nonspeech dysfluencies within dysarthric speech. The authors used the TORGO database, and the results obtained using MFCCs and linear prediction cepstral coefficient (LPCCs) features produced similar recognition accuracies. Repetition stuttering in dysarthric speech and nondysarthric speech was correctly identified with accuracies of 86% and 84%, respectively. Nonspeech sounds were recognized with approximately 75% accuracy in dysarthric speakers.

In [27], a convolutive bottleneck network, which is an extension of a CNN, was proposed to extract disorder-specific features. A convolutive bottleneck network stacks a bottleneck layer, where the number of units is extremely small compared with the adjacent layers. The database used in their work was the American Broad News corpus. The use of bottleneck features in a convolutive network improved the accuracy from 84.3 to 88.0%.

In the context of speech-to-text systems for clinical applications, multiple speaker-independent ASR systems robust against pathological speech are presented in [28]. The authors investigated the performance of two convolutional neural network architectures: (1) a time–frequency convolutional neural network (TFCNN), which performs time and frequency convolution on the gammatone filterbank features, (2) a fused-feature-map convolutional neural network (FCNN), which uses frequency and time convolution in the acoustic and articulatory space, enabling the joint use of information from acoustic and articulatory space. The authors also compared TFCNN models with standard DNN and CNN models.

Recently, authors in [29] proposed a novel approach that is able to assess dysarthria intelligibility, which correlates strongly with perceptual intelligibility. Their approach requires the patient to speak a limited set of words (no more than 5 words). The system is based on the end-to-end deep speech framework to obtain a string of characters.

1.3 Dysarthric speech processing using recurrent neural network (RNN)-based architectures

A recurrent neural network (RNN) is a category of artificial neural networks with the capacity to exhibit the temporal dynamic behavior of a given input sequence. Its main feature is that connections between nodes form a directed graph along the input time sequence. In the conventional RNN, the training algorithm uses gradient-based back-propagation through time. This configuration has the drawback of slow updating of the network weight. To solve this problem, a new structure LSTM has been introduced. Unlike conventional RNNs, LSTM networks connect their units in a specific way to avoid the problems of vanishing and exploding gradients. This makes them very useful for tasks such as unsegmented speech processing and recognition. The performance of various RNN architectures to train acoustic models for large-vocabulary speech recognition, namely LSTM, conventional RNN, and DNN, was compared in [30]. A distributed training of LSTM-RNNs using asynchronous stochastic gradient descent optimization was proposed [30]. The authors also showed that two-layer deep LSTM-RNNs, where each LSTM layer has a linear recurrent projection layer, can exceed state-of-the-art speech recognition performance. The deep LSTM-RNNs [31] were extended by introducing gated direct connections between memory cells in adjacent layers. CNN and LSTM networks with very deep structures were investigated. The performances of each method were analyzed and compared with those of the DNNs. The obtained results clearly demonstrated the advantage of the CNN and LSTM techniques in terms of improving ASR accuracy for various tasks [31]. As with DNNs with deeper architectures, deep LSTM-RNNs have been successfully used for speech recognition [32,33,34].

Based on the RNN with LSTM units, [35] determined whether Mandarin-speaking individuals were afflicted with a form of dysarthria based on samples of syllable pronunciations. Using accuracy and receiver operating classification tasks, the authors evaluated several LSTM network architectures. Their results showed that the LSTMs’ ability to leverage temporal information within its input makes for an effective step in the pursuit of accessible dysarthria diagnoses.

Similarly, [36] proposed a machine learning-based method to automatically classify dysarthric speech into intelligible and unintelligible using LSTM neural networks. The classification and training of dysarthric speech were performed using the bidirectional LSTM type of RNNs. The authors adopted a transfer learning approach, where the internal representations are learned by DNN-based ASR models.

Despite the availability of numerous technological solutions and fundamental approaches, the design of robust dysarthric speech recognition systems still faces numerous issues. Dysarthric speech is versatile, is uncertain and remains in many situations intractable to conventional formalism and methods. It is worth mentioning that a very little research has been done to give dysarthric speech recognition systems the required robustness by realizing the potential benefit from the joint optimization of both front-end processing and recognition modeling.

In an attempt to provide new insights into dysarthric speech recognition, a unified approach that aims to provide robustness when the systems are confronted with imprecise and distorted dysarthric speech signal is proposed. Unlike state-of-the-art methods, this approach investigates the benefits that can be derived from DNN-based architectures that jointly optimize the selection of front-end processing, multiple parameters such as framing and training configuration as well as classifier architectures. A comprehensive and holistic analysis of the dysarthric speech recognition process is carried out to provide a theoretical scheme based on the most effective components leading to a usable user interface.

1.4 Objective and contributions

In this paper, the best approaches for automatically recognizing dysarthric speech using DNN-based architectures are investigated. Our goal is to contribute to the research effort that ultimately will open the doors toward the design of personalized assistive speech systems and devices based on robust and effective speech recognition that are still not available for people who live with dysarthria. In this context, the contributions of this paper are as follows:

(i)
to propose a new design of a speaker-dependent dysarthric speech recognizer. The proposed system is an important step toward the realization of usable speech-enabled interface for people with dysarthria;
(ii)
to assess original DNN-based architectures, providing a benchmark for DNN models on the Nemours publicly available dataset [37];
(iii)
to provide a detailed analysis that investigates the ability of acoustic modeling using perception and hearing mechanisms of yielding more robustness to the dysarthric speech recognition system. In this context, the performance of three acoustical analyzers, mel-frequency cepstral coefficients, mel-frequency spectral coefficients and perceptual linear prediction coefficients, is assessed;
(iv)
to present a comprehensive investigation of the pre-processing pipeline that leads to the optimal framing and method of training/test while reducing the risk of bias and overfitting.

The remainder of this paper is organized as follows: Section 2 describes the methodology of this work. Section 3 presents the experimental protocol used in the experiments, particularly regarding the different input features and the baseline HMM-GMM system against which the CNN and LSTM systems were compared. The obtained results and related discussions are provided in Sect. 4. Finally, Sect. 5 draws the conclusions and highlights future work.

2 Methodology

2.1 Data

The Nemours database is a collection of 814 short nonsense sentences; 74 sentences are uttered by each of the 11 American male speakers with different degrees of dysarthria. Each sentence has been transcribed at the word and phoneme levels.

To provide input data to the CNN- and LSTM-based systems, we divided each sentence's waveform into its phoneme waveforms. A set of 14,080 waveform files were created (see Table 1) and used in the experiments.

Table 1 Nemours database

Deep neural network architectures for dysarthric speech analysis and recognition

Abstract

Similar content being viewed by others

Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition

Analysis and Classification Dysarthric Speech

Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network

Explore related subjects

1 Introduction

1.1 Dysarthric speech recognition

1.2 Dysarthric speech processing using CNN-based architectures

1.3 Dysarthric speech processing using recurrent neural network (RNN)-based architectures

1.4 Objective and contributions

2 Methodology

2.1 Data

2.2 Auditory-based input features

2.2.1 Mel-frequency Cepstral coefficients (MFCCs)

2.2.2 Mel-frequency spectral coefficients (MFSCs)

2.2.3 Perceptual linear prediction (PLP)

2.3 HMM-GMM: baseline system

2.4 CNN-based system: first proposed system

2.5 LSTM-based system: second proposed system

3 Evaluation setup

3.1 Experiments and results

3.1.1 Effect of the filter size of the pooling layer

3.1.2 Effect of the kernel filter size of the convolutional layer

3.1.3 Effect of the CNN’s fully connected layer size

3.1.4 Effect of Hamming window size

3.1.5 Effect of acoustic features

3.1.6 Effect of activation functions

3.1.7 Effect of the number of Gaussians and mixture weights on GMM-HMM system

3.1.8 Comparison of the CNN- and LSTM-based systems

4 Discussion

5 Conclusion and future work

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation