Enhancement of spoken digits recognition for under-resourced languages: case of Algerian and Moroccan dialects

Lounnas, Khaled; Abbas, Mourad; Lichouri, Mohamed; Hamidi, Mohamed; Satori, Hassan; Teffahi, Hocine

doi:10.1007/s10772-022-09971-y

Enhancement of spoken digits recognition for under-resourced languages: case of Algerian and Moroccan dialects

Published: 15 April 2022

Volume 25, pages 443–455, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

Enhancement of spoken digits recognition for under-resourced languages: case of Algerian and Moroccan dialects

Download PDF

Khaled Lounnas ORCID: orcid.org/0000-0003-2649-4419¹,
Mourad Abbas^2,3,
Mohamed Lichouri³,
Mohamed Hamidi^4,5,
Hassan Satori⁵ &
…
Hocine Teffahi¹

390 Accesses
6 Citations
Explore all metrics

Abstract

In this paper, we present a set of experiments aiming to improve the recognition of spoken digits for under-resourced dialects of the Maghrebi region, using a hybrid system. Indeed, integrating a Dialect Identification module into an Automatic Speech Recognition (ASR) system has shown its efficiency in previous works. In order to make the ASR system able to recognize digits spoken in different dialects, we trained our hybrid system on Moroccan Berber Dialect “MBD,” Moroccan Arabic Dialect “MAD,” and Algerian Arabic dialect “AAD,” in addition to Modern Standard Arabic. We have investigated five machine learning based classifiers and two deep learning models: the first one is based on Convolutional Neural Network (CNN), and the second one uses two pre-trained models: Residual Deep Neural Network (Resnet50 and Resnet101). The findings show that the CNN model outperforms the other proposed methods and consequently enhances the performance of spoken digit recognition system by 20% for both Algerian and Moroccan dialects.

Amharic spoken digits recognition using convolutional neural network

Article Open access 04 May 2024

Mixed Bangla-English Spoken Digit Classification Using Convolutional Neural Network

Comparison of Deep Learning Methods for Spoken Language Identification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The necessity to build an ASR able to recognize multi-dialectal speech becomes more and more important. One of the solutions to achieve such a task is to determine the dialect of the input speech. Nevertheless, identification of spoken Arabic dialects is a challenging task, particularly for fine-grained ones. This is due, on one hand, to the presence of similarity between these dialects in terms of phonological, morphological, lexical, and syntactical levels, and on the other hand, to the lack of corpora related to those vernaculars. In order to evaluate our approach, we used a corpus composed of ten digits spoken in different Algerian and Moroccan dialects, namely, Moroccan Berber Dialect, Moroccan Arabic Dialect, and Algerian Arabic dialect, in addition to Modern Standard Arabic. This corpus is recorded by twenty four speakers, ten times, in the three aforementioned dialects and MSA. We prepared this dataset for building models for both dialect identification and ASR systems. The work presented in this paper is twofold: first, performing Maghrebi dialects identification, and second, showing its impact on multi-dialect ASR accuracy. Our approach of identifying the dialects is based on a multitude of efficient classification algorithms, namely: k-Nearest Neighbours (KNN), and Extratrees (EXT), and Random Forest (RF), and Gradient Boosting (GB), and Convolutional Neural Networks (CNN), and Support Vector Machine (SVM) (Campbell et al., 2006). This paper is organized as follows: we present an overview of both speech-based dialect identification and recognition of dialectal speech, and the related work in Sects. 2 and 3, respectively. In Sect. 4, we describe the corpus used to run different experiments. In Sect. 5, we present the system architecture. Section 6 is devoted for experiments and results regarding both dialect identification and speech recognition. The conclusion is presented in Sect. 7.

2 Speech based dialect identification

Speech-based dialect identification attracted the interest of many researchers (Liu & Hansen, 2011; Chittaragi et al., 2018, 2019; Kakouros et al., 2020). However, there is a very few little research devoted for Arabic dialects. To supply more resources for Arabic and its dialects (Shon et al., 2020) provided a huge dialectal Arabic corpora containing 17 dialects. For this purpose, a total of 3000 h of speech were available for training a fine-grained Arabic dialects identification system, split into three subsets according to their durations (< 5 s, 5 s \(\sim \) 20 s and > 20 s). Further, many state-of-the-art techniques were built using the aforementioned corpus. The obtained results show that the longer the duration of the utterance (in this case > 20 s), the better its identification. Regarding the same problem and to highlight the usefulness of the X-Vector technique on Arabic spoken dialect identification task, (Hanani & Naser, 2020) designed an X-Vector model using a set of relevant features (acoustic, lexical, and phonetic) extracted from VarDial 2018 and VarDial 2017 and showed that it outperforms other state-of-the-Art models, for instance, those based on i-vectors, Bottleneck features, and GMM-tokens.

In the case of Maghrebi dialects, (Lounnas et al., 2018) carried out a set of experiments using different features configurations to discriminate between Standard Arabic and one of the Berber dialects known as Kabyl^{Footnote 1}. They showed that the combination of acoustic (Mel Frequency Cepstral Coefficients) and prosodic (melody and stress) characteristics are the appropriate representation to identify these dialects. A further extension of this work is the one developed in Lounnas et al. (2019) where different systems have been built for the purpose of identifying Persian, German, English, Arabic, and Kabyl dialect. The results showed that despite the small size of data, the system yielded an encouraging accuracy of 84.6%. Prosodic information characterized by rhythm and intonation has been used in Bougrine et al. (2018) to model six Algerian dialects, using SVM based on the Universal Pearson VII Kernel function (PUK). The authors found that prosodic cue was suitable even with a short duration of utterances with a precision of more than 69%.

In Belgacem et al. (2010), the authors have developed a GMM-based model that detects similarities between nine dialects. They showed that there are no clear borders between dialects as well as the system’s ability to distinguish between eastern and western dialects and between Gulf and North African dialects, resulting in an accuracy of 73.33%. A similar approach has been presented in Nour-Eddine and Abdelkader (2015), Lachachi and Adla (2016), addressed the problem of Minimal Enclosing Ball reduction using two systems based on SVM; both are used for data reduction. These techniques were evaluated on a Maghrebi database containing five dialects (3 Algerian, 1 Moroccan, 1 Tunisian). In Terbeh et al. (2018), the authors proposed a statistical approach based on the phonetic modelling to identify the corresponding Arabic dialect for each input acoustic signal by calculating the appropriate phonetic model; then, they compared this latter to all referenced Arabic dialect models using cosine similarity.

3 Speech recognition for dialects

Many works have been tackled for recognizing Arabic Spoken Digits (Wazir & Chuah, 2019; Azim et al., 2021; Touazi & Debyeche, 2017; Zerari et al., 2018). Unfortunately, there is little research that has been done for dialectal Maghrebi speech recognition. In Satori and ElHaoussi (2014), the authors addressed the problem of speech recognition for one specific Moroccan dialect, “Tarifit Berber.” They developed an ASR system for this vernacular using the CMU-sphinx tool. Sixty native speakers of Tarifit Berber have recorded a corpus composed of 10 digits and 33 alphabets. The findings showed that a 16-GMM system provided a good recognition rate of 92%. Furthermore, in order to check the ability of the HMM speech recognition system to distinguish the vocal print of Moroccan dialect speakers, it has been shown in Mouaz et al. (2019) that using MFCC, delta, and delta-delta for dialectal model design is enough for a good characterization of Moroccan dialect, yielding an accuracy of 90%. In a similar way, in El Ghazi et al. (2011), authors presented their ASR system for Moroccan dialect where they showed that HMM outperformed the dynamic programming with an accuracy of 30%.

4 Dataset preparation

Our main goal is to present the best dialect identification system which improves multi-dialect ASR performance. The lack of labelled data and standardized orthography for Arabic dialects, particularly for those of Maghrebi region, is the main reason behind the absence of works dealing with speech recognition for these vernaculars. As aforementioned, we prepared our corpus in 3 dialects in addition to MSA. One part of this corpus, regarding MSA and Moroccan Berber dialect, has already been used in Lounnas et al. (2020). The second part concerning the Algerian Arabic dialect and Moroccan Arabic dialect were recorded by native speakers recently. We summarize in Table 1 the characteristics of this corpus and the recording conditions such as the number of speakers, environment noise and the total number of tokens.

Table 1 The corpus’ characteristics

Enhancement of spoken digits recognition for under-resourced languages: case of Algerian and Moroccan dialects

Abstract

Similar content being viewed by others

Amharic spoken digits recognition using convolutional neural network

Mixed Bangla-English Spoken Digit Classification Using Convolutional Neural Network

Comparison of Deep Learning Methods for Spoken Language Identification

Explore related subjects

1 Introduction

2 Speech based dialect identification

3 Speech recognition for dialects

4 Dataset preparation

5 System architecture

5.1 Dialect identification component

5.1.1 Acoustical-based DI architecture

5.1.2 Spectogram-based DI architecture

5.2 Automatic speech recognition (ASR)

6 Experiments and results

6.1 Machine learning based dialect identification

6.1.1 Scheme 1: rhythm characteristics, acoustic and spectral features

6.1.2 Scheme 2: spectrogram

6.1.3 Scheme 3

6.2 Deep learning based dialect identification

6.2.1 Experiment 1: Librosa + CNN

6.2.2 Experiment 2: spectogram + Resnet + CNN

6.3 Multilingual ASR baseline system

6.4 The multilingual ASR system

7 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation