1 Introduction

The necessity to build an ASR able to recognize multi-dialectal speech becomes more and more important. One of the solutions to achieve such a task is to determine the dialect of the input speech. Nevertheless, identification of spoken Arabic dialects is a challenging task, particularly for fine-grained ones. This is due, on one hand, to the presence of similarity between these dialects in terms of phonological, morphological, lexical, and syntactical levels, and on the other hand, to the lack of corpora related to those vernaculars. In order to evaluate our approach, we used a corpus composed of ten digits spoken in different Algerian and Moroccan dialects, namely, Moroccan Berber Dialect, Moroccan Arabic Dialect, and Algerian Arabic dialect, in addition to Modern Standard Arabic. This corpus is recorded by twenty four speakers, ten times, in the three aforementioned dialects and MSA. We prepared this dataset for building models for both dialect identification and ASR systems. The work presented in this paper is twofold: first, performing Maghrebi dialects identification, and second, showing its impact on multi-dialect ASR accuracy. Our approach of identifying the dialects is based on a multitude of efficient classification algorithms, namely: k-Nearest Neighbours (KNN), and Extratrees (EXT), and Random Forest (RF), and Gradient Boosting (GB), and Convolutional Neural Networks (CNN), and Support Vector Machine (SVM) (Campbell et al., 2006). This paper is organized as follows: we present an overview of both speech-based dialect identification and recognition of dialectal speech, and the related work in Sects. 2 and 3, respectively. In Sect. 4, we describe the corpus used to run different experiments. In Sect. 5, we present the system architecture. Section 6 is devoted for experiments and results regarding both dialect identification and speech recognition. The conclusion is presented in Sect. 7.

2 Speech based dialect identification

Speech-based dialect identification attracted the interest of many researchers (Liu & Hansen, 2011; Chittaragi et al., 2018, 2019; Kakouros et al., 2020). However, there is a very few little research devoted for Arabic dialects. To supply more resources for Arabic and its dialects (Shon et al., 2020) provided a huge dialectal Arabic corpora containing 17 dialects. For this purpose, a total of 3000 h of speech were available for training a fine-grained Arabic dialects identification system, split into three subsets according to their durations (< 5 s, 5 s \(\sim \) 20 s and > 20 s). Further, many state-of-the-art techniques were built using the aforementioned corpus. The obtained results show that the longer the duration of the utterance (in this case > 20 s), the better its identification. Regarding the same problem and to highlight the usefulness of the X-Vector technique on Arabic spoken dialect identification task, (Hanani & Naser, 2020) designed an X-Vector model using a set of relevant features (acoustic, lexical, and phonetic) extracted from VarDial 2018 and VarDial 2017 and showed that it outperforms other state-of-the-Art models, for instance, those based on i-vectors, Bottleneck features, and GMM-tokens.

In the case of Maghrebi dialects, (Lounnas et al., 2018) carried out a set of experiments using different features configurations to discriminate between Standard Arabic and one of the Berber dialects known as KabylFootnote 1. They showed that the combination of acoustic (Mel Frequency Cepstral Coefficients) and prosodic (melody and stress) characteristics are the appropriate representation to identify these dialects. A further extension of this work is the one developed in Lounnas et al. (2019) where different systems have been built for the purpose of identifying Persian, German, English, Arabic, and Kabyl dialect. The results showed that despite the small size of data, the system yielded an encouraging accuracy of 84.6%. Prosodic information characterized by rhythm and intonation has been used in Bougrine et al. (2018) to model six Algerian dialects, using SVM based on the Universal Pearson VII Kernel function (PUK). The authors found that prosodic cue was suitable even with a short duration of utterances with a precision of more than 69%.

In Belgacem et al. (2010), the authors have developed a GMM-based model that detects similarities between nine dialects. They showed that there are no clear borders between dialects as well as the system’s ability to distinguish between eastern and western dialects and between Gulf and North African dialects, resulting in an accuracy of 73.33%. A similar approach has been presented in Nour-Eddine and Abdelkader (2015), Lachachi and Adla (2016), addressed the problem of Minimal Enclosing Ball reduction using two systems based on SVM; both are used for data reduction. These techniques were evaluated on a Maghrebi database containing five dialects (3 Algerian, 1 Moroccan, 1 Tunisian). In Terbeh et al. (2018), the authors proposed a statistical approach based on the phonetic modelling to identify the corresponding Arabic dialect for each input acoustic signal by calculating the appropriate phonetic model; then, they compared this latter to all referenced Arabic dialect models using cosine similarity.

3 Speech recognition for dialects

Many works have been tackled for recognizing Arabic Spoken Digits (Wazir & Chuah, 2019; Azim et al., 2021; Touazi & Debyeche, 2017; Zerari et al., 2018). Unfortunately, there is little research that has been done for dialectal Maghrebi speech recognition. In Satori and ElHaoussi (2014), the authors addressed the problem of speech recognition for one specific Moroccan dialect, “Tarifit Berber.” They developed an ASR system for this vernacular using the CMU-sphinx tool. Sixty native speakers of Tarifit Berber have recorded a corpus composed of 10 digits and 33 alphabets. The findings showed that a 16-GMM system provided a good recognition rate of 92%. Furthermore, in order to check the ability of the HMM speech recognition system to distinguish the vocal print of Moroccan dialect speakers, it has been shown in Mouaz et al. (2019) that using MFCC, delta, and delta-delta for dialectal model design is enough for a good characterization of Moroccan dialect, yielding an accuracy of 90%. In a similar way, in El Ghazi et al. (2011), authors presented their ASR system for Moroccan dialect where they showed that HMM outperformed the dynamic programming with an accuracy of 30%.

4 Dataset preparation

Our main goal is to present the best dialect identification system which improves multi-dialect ASR performance. The lack of labelled data and standardized orthography for Arabic dialects, particularly for those of Maghrebi region, is the main reason behind the absence of works dealing with speech recognition for these vernaculars. As aforementioned, we prepared our corpus in 3 dialects in addition to MSA. One part of this corpus, regarding MSA and Moroccan Berber dialect, has already been used in Lounnas et al. (2020). The second part concerning the Algerian Arabic dialect and Moroccan Arabic dialect were recorded by native speakers recently. We summarize in Table 1 the characteristics of this corpus and the recording conditions such as the number of speakers, environment noise and the total number of tokens.

Table 1 The corpus’ characteristics

Taking into consideration that the two parts of the corpus have been recorded in conditions different from one speaker to another, we had to re-sample the recorded digits to get a uniform sampling frequency using PraatFootnote 2.

Then, we segmented the recorded signals into small fragments. This task is performed using both Praat and AudacityFootnote 3.

5 System architecture

Our system is based on two components: the Dialect Identification (DI) and the Automatic Speech Recognition (ASR). Figure 1 presents an illustration of the proposed architecture. The DI block aims at identifying the dialect/language of the spoken digits. This output is very important because it allows selecting the appropriate model corresponding to the dialect of the spoken utterance.

Fig. 1
figure 1

System architecture

5.1 Dialect identification component

To boost our system to better recognize spoken digits, it is essential to set up a language model adaptation process. This can be done by implementing a module that identifies the dialect of spoken digits.For the sake of implementing a reliable dialect identification module, we proposed two architectures, one uses acoustic-spectral information, and the other one is based on spectrogram images.

5.1.1 Acoustical-based DI architecture

Our first architecture is four blocks as presented in Fig. 2:

  • Input Tier:

    Speech utterances.

  • Feature Extraction:

    We extract relevant information based on acoustic and spectral cues.

  • Classification Process:

    A set of classifiers based on both machine learning and deep learning are applied to identify the dialects.

  • Output Tier:

    The dialect of the speech utterance is identified. The system performance is evaluated using F1 score.

Fig. 2
figure 2

Acoustic-spectral based DI component

5.1.2 Spectogram-based DI architecture

The input in this architecture is made of a set of spectrogram images of speech signals (Fig. 3).

  • Input Tier:

    Speech utterances.

  • Spectrogram Representation:

    The spectrogram images are used to train the model.

  • Classification Process:

    A set of classifiers based on both machine learning and deep learning are applied to identify the dialects.

  • Output Tier:

    The dialect of the speech utterance is identified. The system performance is evaluated using F1 score.

Fig. 3
figure 3

Spectrogram based DI component

For this purpose, we run several experiments in order to select the classifier that gives the best performance. More details can be found in Sect. 6.

5.2 Automatic speech recognition (ASR)

There are three necessary elements in the ASR system: the acoustic model, the n-gram language model, and the pronunciation dictionary (Fig. 4).

Fig. 4
figure 4

ASR system

The extracted features are mainly based on the 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC), their delta, and delta-delta vectors. In the decoding phase, the HMM decoder analyzes the features and compares them to the knowledge base. Our ASR system is based on the CMU toolkit (Ezzine et al., 2020; Zealouk et al., 2018) where we used an HMM-GMM approach. Note that each word is represented as a set of phonemes, and each phoneme is represented by 3-HMM state sequences, one emitting state as an entry and two non-emitting states as an exit that associates HMM units models together in the ASR system. Each emitting state consists of GMMs trained on 39 overall MFCC coefficients. Figure 5 represents our HMM configuration and Table 2 presents the dictionaries related to MSA and the three dialects.

Fig. 5
figure 5

HMM structure with 3 states

Table 2 The dictionaries used in the training and testing phases

6 Experiments and results

In this section, we show the impact of dialect identification on the enhancement of digits spoken recognition for Algerian and Moroccan dialects along with MSA. To that end, we achieved a set of experiments for both dialect identification and speech recognition. To get the best performance for dialect identification, we proposed statistical and deep learning-based approaches.

6.1 Machine learning based dialect identification

6.1.1 Scheme 1: rhythm characteristics, acoustic and spectral features

For the first scheme, we adopted acoustic and spectral features along with rhythm characteristics using the frameworkFootnote 4 based on LibrosaFootnote 5 (Giannakopoulos, 2015). We present, in the following, the 34 adopted features, namely: MFCC coefficients (13), Energy (1) & Energy of entropy (1), Zero Crossing Rate (1) & Spectral Centroid (1), Spectral Spread (1) & Spectral Entropy (1), Spectral Rolloff(1) & Chroma Vector (12), Spectral Flux (1) & Chroma Deviation (1).

These features are used to train a set of classifiers, namely: k-Nearest Neighbours (KNN), Support Vector Machines (SVM), Extra Trees (EXT), Random Forest (RF), and Gradient Boosting (GB) (Pedregosa et al., 2011). As we aim to select the best features, we used the default configuration of these classifiers (see Table 3). Taking into account the necessity of performing a speaker-independent system, we selected, for each dialect, multiple combinations of speakers to form ten different sets (training and test), in a way we get four speakers representing 65% for training and two speakers representing 35% for the test phase.

Table 3 Default configuration used for each system
Table 4 # of different speakers combination

Tables 5, 6 and 7 represent performance using the aforementioned ten sets (in Table 4 where \( S_{i} \) denotes speaker number i.) for binary, 3-class, and 4-class classification, respectively.

Table 5 Obtained results of our dialect identification system based on acoustic and spectral features and rhythm characteristics “4-class classification”

From Table 5, we noted that regarding 4-class classification using 10 sets (Table 4), GB achieved mostly the best results. We recorded its best performance using the 6th set with an F1-score of 85.89% and an accuracy of 93.03%. Most of the used classifiers achieved their best performances with the 6th set except SVM that yielded its best result using the 2nd set with an F1-score of 78.74% and accuracy of 89.55%.

The 3-class classification gives the best results through the GB classifier with an F1-score of 86.44% and an accuracy of 91.11% (see Table 6).

Table 6 Obtained results of our dialect identification system based on acoustic and spectral features and rhythm characteristics “3-class classification”
Table 7 Obtained results of the dialect identification system based on acoustic and spectral features and rhythm characteristics “binary classification”

For binary classification, one can notice from Table 7 that the EXT classifier outperforms the remaining classifiers when dealing with the couples of dialects (AAD-MBD) and (AAD-MAD), with an F1-score of 86.06% and 97.85%, respectively. In addition, it is ranked as the second-best classifier regarding the classification of AAD and MSA with an F1-score of 96.06%. These findings make us to state, intuitively, that the EXT classifier is suitable for inter-class classification (as for instance, Algerian dialect and the Moroccan dialect). For the cases of MBD-MSA, MBD-MAD, and MSA-MAD, the best F1-scores were achieved by SVM (96.78%), GB (93.19%), and KNN (94.99%), respectively. Roughly speaking, the two best overall scores were obtained by the EXT classifier for the AAD-MAD pair, followed by the SVM for MBD-MSA.

6.1.2 Scheme 2: spectrogram

This approach consists of transforming the raw speech into the spectral domain by computing its spectrogram. The set of global characteristics: Hu Moments (Žunić et al., 2010; Sun et al., 2015), Haralick Texture (Sengupta et al., 2019) and Color Histogram (Sergyan, 2008) are retrieved from the spectrograms, already computed and concatenated to form the features vectors. The results presented in Table 8 show, in the case of 4-class classification that the best performance is achieved by GB, with 72.11% (F1) and 86.51% (accuracy). It should be noted that this performance is lower than that obtained by the former approach (Scheme 1) by around 13.7%. The 3-class classification system, dealing with the three dialects (MBD, AAD, and MAD), has given an F1-score of 87.52% and an accuracy of 91.90% via the RF classifier (Table 9). This can be seen as an improvement of about 1% in comparison to GB performance recorded in Scheme 1.

Table 8 Results of the dialect identification system based on spectrogram “4-class classification”
Table 9 Results of the spectrogram based system “3 class-classification”
Table 10 Results of the spectrogram based system “binary classification”

By analyzing the results displayed in Table 10, which is related to the binary classification case, we note that the best performance is recorded for EXT and RF in the cases of (AAD-MBD, AAD-MAD, and MSA-MAD) and (AAD-MAD, MBD-MSA, MBD-MAD), respectively. Regarding intra-class classification (Moroccan dialects), RF yielded the best performance for (MBD-MAD) pair, in addition to (MBD-MSA).

6.1.3 Scheme 3

In this part, we used Librosa frameworkFootnote 6 (McFee et al., 2015), which includes spectral features and rhythm characteristics. The features used in this framework are composed of 193 components: MFCC coefficients (40), Mel spectrogram (128) & Chroma Vector (12), Spectral contrast (7) & Tonnetz(6).

Table 11 Results obtained for the dialect identification system (Scheme 3) “4-class classification”

As shown in Table 11, the best results are performed by EXT with an F1-score and accuracy of 94.46% and 88.13%, respectively. This representation, composed of 193 components, improved F1 score by 3%, compared to Scheme 1 results.

Table 12 Results obtained for the dialect identification system (Scheme 3) “3-class classification”

Furthermore, the results for 3-class classification are presented in Table 12. The best performance is achieved by GB classifier with F1-score equal to 89.17% and accuracy equal to 93.01%, leading to an improvement of about 2% in comparison to both Schemes 1 and 2.

Table 13 Results obtained for the dialect identification system (Scheme 3) “ binary classification”

Features representation used in Scheme 3 has given promising results for binary classification. This can be noticed clearly in Table 13. Let us summarize the results in the following points:

  • SVM and EXT performed perfectly regarding four pairs of languages (dialects): AAD-MSA, AAD-MAD, MBD-MSA, and MSA-MAD with F1-score and accuracy of 100%.

  • For AAD-MSA pair, almost all the classifiers achieved high performance.

  • KNN, SVM, and EXT yielded perfect scores for MSA-MAD pair.

  • In the case of AAD-MBD pair, the best performance was achieved by the RF classifier with F1-score and an accuracy equal to 95.71%.

  • An overall improvement is performed for all the six pairs of languages/dialects in comparison to Schemes 1 and 2.

6.2 Deep learning based dialect identification

This phase consists of adopting a deep neural network approach (Najafian et al., 2018) using a set of features based on Librosa library with 193 features using a Convolutional Neural Network (CNN) classifier (Experiment 1). The parameters we used in the CNN architecture are reported in Table 14. Furthermore, we applied in Experiment 2, a transfer learning approach by retraining two Resnet models: Resnet50 and Resnet101, using spectrograms as features.

Table 14 Our best CNN configuration

6.2.1 Experiment 1: Librosa + CNN

The results obtained for 4-class and 3-class classification are presented in Tables 15 and 16, respectively. We noticed an F1-score improvement of around 7%, 23%, and 10% compared to Scheme 3, Scheme 2, and Scheme 1 (baseline). However, performance of 3-class classification decreased in comparison to the three aforementioned schemes by about 22%, 20%, and 19%, respectively.

Table 15 Performance of the system based on Librosa+CNN for 4-class classification
Table 16 Performance of the system based on Librosa+CNN for 3-class classification

From the results of the binary classification task presented in Table 17, we note the followings points:

  • F1 obtained with CNN architecture reaches is 100 % for the three pairs: AAD-MSA, AAD-MAD, and MSA-MAD.

  • For AAD-MBD, CNN outperforms the first and second schemes. However, the most performing technique is Scheme 3.

  • For MBD-MSA, the first and third schemes outperform CNN.

  • For MBD-MAD pair, the CNN performance is the worst compared to all three schemes.

Table 17 Performance of the system based on Librosa + CNN for binary classification

6.2.2 Experiment 2: spectogram + Resnet + CNN

In this experiment, we tackled the 4-class classification problem by retraining the last layer of Resnet50 and Resnet101 (He et al., 2016). It should be noted that for both Resnet architectures we used the same configuration as explained in Table 18; the only exception is the number of layers per model. Table 19 shows clearly the degraded performance compared to Experiment 1, and that Resnet50 outperforms slightly Resnet101.

Table 18 Our best ResNet configuration
Table 19 Performance of the system based on spectogram + Resnet50/101 + CNN for 4-class classification

Whereas, we notice through Table 20 a little improvement recorded for 3-class classification compared to 4-class classification.

Table 20 Performance of the system based on spectogram + Resnet50/101 + CNN for 3-class classification

As can be noticed in Table 21, the accuracy achieved for binary classification is ranging from 41.78 to 89.28% (Resnet50) and from 37.85 to 88.92% (Resnet101). The best results have been recorded for the three pairs: AAD-MBD, MBD-MAD, and AAD-MAD. Overall, Resnet101 performance is, in most cases, better than Resent50, except for the pairs: MBD-MSA and MSA-MAD.

Table 21 Performance of the system based on spectogram + Resnet50/101 + CNN for binary classification

6.3 Multilingual ASR baseline system

Fig. 6
figure 6

Speech recognition rates with different GMM

In order to recognize the ten first digits spoken in MSA, MBD, MAD, and AAD, several experiments, with 3 HMM states and different Gaussian Mixture Models (4, 8 , 16 Gaussians), have been carried out.

On the one hand, we implemented four independent recognition engines for MSA, MBD, MAD, and AAD, respectively.

The best accuracy is obtained by using 3 HMMs and 4 GMMs, as shown in Fig. 6. On the other hand, we designed multilingual ASR baseline engines. Three ASR configurations have been considered to recognize, first, MAD and AAD jointly (mix-sys-1), second, MAD, AAD, and MBD (mix-sys-2), and third, MAD, AAD, MBD, in addition to MSA (mix-sys-3)Footnote 7. Figure 7 presents the recognition rates of the three configurations, with different GMMs values. The best recognition rates are 58.8 %, 56.7 %, and 49.7% for mix-sys-1, mix-sys-2, and mix-sys-3, respectively. The best scores are obtained with 4 GMMs. This is probably due to the small number of the used data. The recognition rates dropped dramatically with the increase of the number of dialects to be trained jointly.

To improve ASR systems’ accuracy, we integrated the language identification component, which identifies the speaker’s language/dialect before the speech recognition process. This will be detailed in the next section.

Fig. 7
figure 7

The accuracy of multilingual ASR baseline system (mix-sys-1, mix-sys-2, and mix-sys-3)

6.4 The multilingual ASR system

Our proposed system is a combination of Automatic Speech Recognition and Language/Dialect Identification, which is able to switch between the four independent recognizers mentioned in Sect. 6.3 (Fig. 6). It allows selecting the suitable ASR system to recognize the utterance spoken in a particular language/dialect that is identified and provided by the DI module Fig. 1.

As the accuracy of the three ASR engines corresponding to the configurations mix-sys-1, mix-sys-2, and mix-sys-3, was unsatisfactory as shown in Fig. 7, we added the dialect identification component by achieving binary, 3-class, and 4-class classification, according to the number of dialects, considered for each of the three configurations (See Table 22). We notice a significant improvement achieved by our proposed multilingual system using 3 HMMs and 4 GMMs (see Fig. 8 and Table 22) compared to the baseline one.

Fig. 8
figure 8

The accuracy of our proposed multilingual ASR system

Table 22 Accuracy of our proposed multilingual ASR system (Fig. 8) compared to the baseline one (Fig. 7) using the best DI system

7 Conclusion

In this paper, we presented a set of experiments for the sake of spoken digits recognition improvement, by adding the language/dialect identification component to standard ASR. We showed that our proposed system is useful for such a task dealing with Maghrebi vernaculars considered as under-resourced languages. We used different approaches for identifying these dialects. In fact, the best performance of 4-class classification (AAD, MAD, MBD, MSA) is achieved using the 3rd scheme, based on Librosa (193 components), to feed the CNN model. The machine learning based classifiers (SVM, EXT, KNN, RF, GB) achieved the best performance, either with Librosa acoustical features or with the spectrogram, when dealing with the three dialects (AAD, MAD, MBD). Overall, dealing with binary or multi-class classification of the dialects, the best scheme is Librosa + CNN, which yielded an accuracy of 100% in some cases, achieved by selecting the appropriate configuration of the CNN model. The second-best performance is achieved by the system based on Librosa with (KNN, SVM, EXT, RF and GB). Using the global features (Hu Moments, Haralick Texture, and Color Histogram) extracted from spectrogram images input, these classifiers outperform Resnet50/101 models that used directly spectrogram images. The latter models are less efficient because of the low number of images.

Our proposed multilingual ASR system has successfully improved the recognition rate of digits spoken in low-resourced dialects from the Maghreb region. In our future research, we will focus on expanding our corpus to cover more dialects.