Introduction

The Tunisian Dialect (TD) is the official mother tongue spoken by all Tunisians, regardless of their origin and social affiliation. It is a subgroup of Arabic dialects usually identified with Maghreb Arabic. The TD has sparked increased interest in the field of the NLP community, like Arabizi transliteration [21], speech recognition [22], dialect identification [4], word-to-word translation [13], sentiment analysis [24], among others. The TD is not standardized or taught, and has no official status. Nevertheless, most Arabic native speakers can not produce a sustained spontaneous speech in Modern Standard Arabic (MSA); in unwritten situations where spoken MSA would normally be expected. Hence, most Tunisian speakers typically resort to frequent code-switching between their dialect and MSA in their daily lives such as in talk shows on radio and TV channels [19]. Regarding this disturbance in spoken TD speech, syntactical and grammatical speech errors, mostly called disfluencies, can occur frequently in all forms of spontaneous speech, whether casual discussions or formal arguments [32].

Disfluencies are characteristic of spontaneous speech that make it different from written text [10]. They are additional speech noises corrected by the speaker and may affect the grammatical flow of utterances. Disfluencies detection presents a significant challenge for tasks dealing with spontaneous speech processing, such as parsing, machine translation, dialogue systems, and other NLP understanding tasks [6]. It helps to recognize disfluent sequences in spoken language transcripts or automatic speech recognition results.

Detecting disfluencies in spoken TD is rarely studied by researchers. Ref. [25] adopted a symbolic rule-based approach to delimit and correct disfluencies automatically using a set of rules and patterns, in a very restricted domain and with a limited TD vocabulary. Ref. [34] proposed transcription conventions to annotate disfluencies in TD corpora that are used later to annotate manually only incomplete words, repetitions and onomatopoeia words in the TD corpus STAC [33].

This work is part of disfluencies processing in the TD. As far as we know, there is no work within the spoken TD processing field which consists of detecting and removing several types of disfluencies automatically and from open domain transcriptions. In this perspective, our study aims to propose an original method for processing eight types of disfluencies, obviously, syllabic elongations, word-fragments, speech words, simple and complex repetitions insertions, substitutions and deletions. In previous work, we presented a method for detecting and removing disfluencies in TD transcripts, which provides a rule-based approach for simple disfluencies processing and a sequence-based model within the Machine Learning (ML) approach for complex disfluencies processing. The major contributions of this paper, mainly concern the detection of complex disfluencies task. We present a transition-based model for complex disfluencies detection and we compare it with the sequence-based model.

This paper is structured as follows: We present a background of disfluencies processing in spontaneous spoken TD in Section “Background of Disfluencies Processing in TD”. We give an overview of the building process of our TD corpus and an analysis of its most significant characteristics in Section “Data. We present our contribution to detecting disfluent regions of utterances in Section “The Proposed Method for Disfluencies Processing in TD”. We review the previous work for complex disfluencies processing in TD and we show the comparison results and the critical analysis of both contributions in Section “Experiments and Discussion”, before drawing our conclusion and major future work in Section “Conclusion and Perspectives”.

Background of Disfluencies Processing in TD

We expose in this section, the different types of disfluencies studied in our work as well as the challenges of handling these phenomena in the context of spontaneous TD speech processing.

Disfluencies Taxonomy

The TD usually includes eight types of disfluencies as follows [5]:

Syllabic elongations are abnormal vowel lengthening of a syllable lasting more than 1 s. In TD speech, the elongations appear usually with either the first (e.g., Utt1) or the last (e.g., Utt2) syllable of the word. The following examples illustrate the two cases:

Utt1: صوووتك موش واضح [Swwwtk mw$ wADH] (Your voiiiice is not clear).

Utt2: فماااا تران برك [fmAAAA trAn brk] (Thereee’s only one train).

Speech words are characterized by the continuation of the acoustic signal generation during the pause period. They include filled pauses also called hesitations (e.g., اه [āh] (Ah)) and discursive markers (e.g., يعني [y’ny] (meaning)). They are the most frequent disfluencies used in spontaneous oral productions.

Word-fragments are "syllables, speech sounds or single consonants, which are similar to the beginning of the next fully articulated word... [and] they may neither be equal to the whole next word" [11]. They are truncated words started and interrupted by the same speaker (e.g., Utt3). They may be dropped, taken, or replaced.

Utt3: وقتاش يبدا ال المؤتمر [wqtA$ ybdA Al AlmWtmr] (When the the-conference starts).

Simple repetitions are words that occur several times consecutively (e.g., Utt4).

Utt4: ثلاثة أمم ثلاثة وزراء [vlAvp Omm vlAvp wzrA’] (Three emm three ministers).

Complex repetitions can be either one word that is repeated not consecutively (except speech words) in the utterance (e.g., Utt5) or a group of words identically repeated (e.g., Utt6).

Utt5: علاش هذا التبذير علاش [ElA$ h*A Altb*yr ElA$] (Why this extravagance why)

Utt6: بعد ثلاثة سوايع امم ثلاثة سوايع يبدا [bEd vlAvh swAyE Amm vlAvh swAyE ybdA] (After three hours emm three hours starts)

Insertions are the case of correcting a part of the speech by adding new words (e.g., Utt7).

Utt7: عندك مثال كان عندك مثال [Endk mvAl kAn Endk mvAl] (you have an example if you have an example)

Substitutions are the case of correcting a part of the speech by replacing some words with new ones (e.g., Utt8).

Utt8: موضوع يهم المسنين أهه سامحني الشباب [mwDwE yhm Almsnyn Ohh sAmHny Al$bAb] (A topic that interests the elderly euuh forgive me the youth people)

Deletions are the case of correcting a part of the speech by removing words (e.g., Utt9).

Utt9: تران نورمال إلي يخرج تو تران إلي يخرج تو [trAn nwrmAl Ily yxrj tw trAn Ily yxrj tw] (Normal train that leaves now train that leaves now)

Disfluencies typology depends usually on the concerned language. For the TD language, 38% of elongation cases affect the first syllable of the word [6]. For research dealing with French, phonological characteristics like schwa and assimilation can be considered as a disfluencies type [7].

The heterogeneity in how the different disfluencies can be detected, brought us to use the following taxonomy: Simple Disfluencies and Complex Disfluencies. Simple disfluencies affect only one minimal token, they include syllabic elongations, speech words, word-fragments, and simple repetitions [5]. Complex disfluencies affect several tokens and may break the morphological flow of the utterance, they include complex repetitions, insertions, substitutions, and deletions. According to [29], complex disfluencies are typically assumed to have a tripartite reparandum-interregnum-repair structure as shown in Fig. 1.Footnote 1

Fig. 1
figure 1

Sentence with disfluencies annotated according to Shriberg (1994)

The reparandum is the disfluent portion of the utterance that is corrected or abandoned. The interregnum (also called editing term) is the optional portion of the utterance; it could include speech words. The repair is the portion of the utterance that corrects the reparandum.

Challenges for Disfluencies Detection in TD

Detecting disfluencies is a challenging task for spoken dialects including the TD, although this is the most natural, easy, and spontaneous means of communication. The dialects are not standardized, they are not taught, and they have no official status.

Pronunciation variability. The same utterance can be expressed in different ways between speakers to regions, social classes, ages, and areas. The TD is divided into several dialectal areas according to the Tunisian regions. The vocabulary varies through areas, involving phonological, morphological, lexical, and syntactic variations. The personal pronoun ’I’ in English is pronounced "آنا [|nA]" in Tunis and Sfax, "آني [|ny]" in Sahel, "ناي [nAy]" in El Kef, etc.

Dialogue context. Spontaneous dialogue may contain sub-dialogues of reformulation, clarification, or rectification. Likewise, the speaker can expand his idea over several turns. Several disfluencies cases can be caused by the absence of the dialogue context. In the dialogue below, "الزوز" and "الوحدة" are considered literally as substitution disfluencies, while it is not the case when considering the context of the first speaker turn.

- Speaker 1: سوم الوحدة والا الزوز؟ [swm AlwHdp wAlA Alzwz?] (the price of the once or of both)

- Speaker 2: الزوز, الوحدة بعشرين دينار [Alzwz, AlwHdp bE$ryn dynAr] (both, the once is for twenty dinars)

Out-of-vocabulary words. OOV words are speech recognition errors that correspond to insertions, deletions, and confusions of words generated by automatic recognition systems [3]. OOV words include unknown words that are non-existent words in the recognition language model or in the lexicon, they could be truncated words (e.g., Utt10) and miss-recognized words which are produced in the output of speech recognition, while other words were pronounced (e.g., Utt11).

Utt10: من تون من تونس [mn twn mn twns] (From Tun from Tunis)

Utt11: نحب نمشي لمارس [nHb nm$y lmArs] (I like to go to March)

Long-range dependency. Repairs do not necessarily follow reparandums in some cases. They may be placed after several fluent words not belonging to the disfluency structure. In Utt12, the reparandum "المعلمين" is corrected after three fluent words using the repair "التلامذة" where "اه لا" is the interregnum.

Utt12: المعلمين الي عملو مسيرة اه لا التلامذة [AlmElmyn Aly Emlw msyrp Ah lA AltlAm*p] (the teachers who made a protest ah no the students).

Segmentation of utterances. Sentence boundary detection and disfluencies processing are both being studied increasingly in spontaneous speech studies. The probability of disfluencies was found to be exponentially proportional to the length of the utterance [2]. Utt13 contains a complex repetition of the word تيكاي [tykAy] (ticket) that appears in both positions 3 and 6. While considering its grammatical structure, we should notice that Utt13 must be split into two utterances as follows:

باهي اعطيني تيكاي [bAhy AETyny tykAy] (ok give me a ticket) and بقداه هي التيكاي [bqdAh hy AltykAy] (how much is the ticket).

Utt13: باهي اعطيني تيكاي بقداه هي التيكاي [bAhy AETyny tykAy bqdAh hy AltykAy] (ok give me a ticket how much is the ticket).

Irregular word order. The order of the words in oral utterances is not always respected and this is without affecting the semantics conveyed. In a given Tunisian verbal utterance, the canonical word order can generally follow three syntactic structures, namely, Subject-Verb-Object (SVO), Verb-Subject-Object (VSO), and Object-Verb-Subject (OVS). Since TD is an irregular language, the syntactical order of words may change in the same disfluent utterance. In (Utt14), "وقتاش يخرج تران [wqtA$ yxrj trAn], VSO" is a repetition of "التران وقتاش يخرج [AltrAn wqtA$ yxrj], SVO" translated both to "when does the train leave".

Utt14: وقتاش يخرج تران نحب نعرف التران وقتاش يخرج لتوزر [wqtA$ yxrj trAn nHb nErf AltrAn wqtA$ yxrj ltwzr] (when does the train leave to Tozeur).

Compound words. Poly-lexical expressions such as compound words include several linguistic phenomena for which the syntactic and semantic properties only partially overlap. Disfluent compound words represent 2.7% of 4.6% cases of disfluent units in [9] study. Disfluencies can appear inside compound words (e.g., منزل امم بوزيان [mnzl emm bwzyAn] (Manzel Bouzayen), a commune in the Sidi Bouzid governorate in Tunisia) or in its outside (e.g., خا خارق للعادة [xA xArq llEAdp] (extraordinary)).

Voluntary repetitions. Voluntary word repetition is meant to highlight a description of reality and can also contribute to an argument. The repetition can then be either semantic (e.g., anaphora) or grammatical. In French, some words are repeated for syntactic reasons (e.g., nous nous sommes, for pronominal verbs). In TD vocabulary, we noticed voluntary repetitions used frequently, such as "تو تو [tw tw] (now)" and "كيف كيف [kyf kyf] (the same)".

Synonymy. The speaker may replace some words with their synonyms, which is unnecessary for the syntactic structure of the utterance. In Utt15, "مشى [m$Y]" may be the synonym of "خرج [xrj]". Besides, for languages that code-switch between several languages such as TD, a word can be repeated by its similar to other foreign languages.

Utt15: المدير مشى خرج [Almdyr m$Y xrj] (the director walked left)

Enumeration. An enumeration consists of successively detailing various elements of which a generic concept or an overall idea is composed. With the absence of coordinating conjunctions, the enumeration can be interpreted as a substitution disfluencies type in which the speaker replaces the words to catch up and correct his utterance, as the following example shows.

Utt16: باش يفسر يفهم [bA$ yfsr yfhm] (in order to explain understood)

Agglutination. Arabic TD is an agglutinative language. New dialectal affixes and suffixes are added when others are removed compared to the MSA morphology. The negation particle "ما [mA]" and the negation letter "ش [$]" attached respectively to the beginning and at the end of the verbs replace the negation particles of the MSA "لم [lm]" and "لن [ln]", for example, "مامشيتش [mAm$yt$] (I did not go)" whose root of the word is the verb "مشيت [m$yt] (I went)". Thus, the MSA interrogative clitics "ا [A]" and "هل [hl]" are replaced by "شي [$y]" attached to the end of the word, as in "مشيتشي [m$yt$y] (Did you go)". Likewise, the proclitics "ه [h]", "ع [E]" and "م [m]" agglutinated to the definite article "ال [Al]" are the result of reduction of close demonstrative pronouns ("هذا [h * A] (this)", "هذه [h * h] (this)", "هؤلاء [hWlA ’] (these, ’feminine / masculine plural’)", "هاتان [hAtAn] (these, ’feminine duel’)", "هذان [h * An] (these, ’masculine duels’)"), the preposition "من [mn] (de)" and the coordinating conjunction "مع [mE] (with)", respectively.

Diacritics lack. The absence of Arabic vowels (i.e., diacritical marks placed above or below the Arabic letters) leads to lexical ambiguity, given the polysemy nature of non-vowel words., which results in problems with the automatic analysis, especially in morpho-syntactic tagging. "درس [drs]" can have different vowellations, such as "دَرَسَ [darasa] (he studied)", "دَرْسٌ [darsun] (a lesson)", "دَرَّسَ [dar sa] (he taught)", etc.

Syntactical dependencies. The TD is characterized by the fact that the possessive pronouns or adjectives can be reduced and agglutinated to the nouns in their final position. In (Utt17), the expression "your phone" is written in two forms "تليفون متاعك [tlyfwn mtAEk]" where the possessive pronoun "متاعنا [mtAEnA]" is detached from the noun and "تليفونك [tlyfwnk]" whose the possessive pronoun is reduced and agglutinated to the noun.

Utt17: هز تليفونك قتلك هز تليفون متاعك [hz tlyfwnk qtlk hz tlyfwn mtAEk] (take your phone I told you to take your phone)

Alternation of languages and dialects. TD is a spoken variety of Arabic and presents a mode of communication built on the alternation between several languages and dialects. Tunisians express themselves spontaneously using usually three languages namely, MSA, TD and French. The TD itself represents a mosaic of languages many of which words and expressions are borrowed from French, Maltese, English, Turkish, and Spanish as a result of trade movements and colonization over the centuries. These words and expressions can be used daily without any phonological or morphological modification.

Reuse of borrowed words. Some foreign words, especially those from the French language, are affected by morphological changes to express an action, an order, or the possession of objects. A borrowed verb is morphologically derived to produce adjectives, nouns, and the conjugation of that verb. Borrowed nouns also undergo morphological derivations including, the verb, the adjective, and the noun. These changes are different from the real derivation of the word real language. For example, the French verb "gérer" (to manage) is conjugated to "يجاري [yjAry] (he manages)" instead of "il gère" in French. This reuse and derivation feature is applied also to MSA words. For example, adding the affix "جي [jy]" to names like "بنك [bnk] (bank)" indicates the profession "بنكاجي [bnkAjy] (banker)".

Data

Our study is carried out using the Disfluencies Corpus from Tunisian Arabic Transcription ’DisCoTAT’ [5]. It consists of transcribed utterances coming mainly from recordings of railway information services and Tunisian TV channels and radio programs.

DisCoTAT is composed of two parts. The first part consists of 38.627 utterances collected from three TD existing corpora (i.e., STAC [33], TUDICOI [12], and TARIC [20]). STAC (Spoken Tunisian Arabic Corpus) is composed of 3 h and 28 min (7.788 utterances) of TD speech recordings collected from different TV channels and radio stations. TUDICOI (TUnisian DIalect COrpus Interlocutor) is composed of 1.825 dialogues composed of 12.182 utterances. TARIC (Tunisian Arabic Railway Interaction Corpus) consists of 20 h of transcribed speech. It is composed of 4.662 dialogues with 18.657 utterances. TUDICOI and TARIC utterances consist of railway information services (e.g. train schedule, train destination, tariffs, etc.). However, only 21% of collected utterances contain disfluencies phenomena.

The second part of DisCoTAT is about 2 h of recordings obtained from different TV channels and radio stations (i.e., Mosaique radio, Sfax radio, and Nessma TV). The transcription is done manually according to OTTA and CODA-TUN conventions [34]. Only disfluent utterances are transcribed to increase the number of disfluencies occurrences in the corpus. To date, the number of transcripts is about 406 disfluent utterances. The total number of disfluent utterances in DisCoTAT is about 3780 composed of 4757 disfluency phenomena. 80% of utterances are used for training and 20% of utterances are used for evaluation. Tables 1 and 2 illustrate the distribution of disfluencies types in DisCoTAT.

Table 1 Simple disfluencies distribution in the DisCoTAT corpus
Table 2 Complex disfluencies distribution in the DisCoTAT corpus

However,DisCoTAT is composed of a mosaic of words coming from various languages mainly TD (62%), MSA (17%), French (13%), and Others (8%). Table 3 presents an example of DisCoTAT utterance.

DisCoTAT is enriched with two types of annotation: morpho-syntactic annotation using TD-WordNet and hand-crafted complex disfluencies annotation using the annotator tool DisAnT [5]. A given utterance goes through two phases of processing. The first phase is automatic, it consists of applying a set of pre-processing tasks such as lexical analysis, POS tagging, and simple disfluencies processing. The second phase is manual, it consists of identifying the disfluencies boundaries in the utterance.

Table 3 Examples of utterances with disfluencies

The Proposed Method for Disfluencies Processing in TD

In pursuit of our previous work [6], we present in this section, a transcription-based method guided by linguistic features, to handle disfluencies removal from transcribed utterances of spoken TD.

We have taken up the method proposed in our previous work, in particular pre-processing and simple disfluencies processing steps. However, we propose, in this paper, a transition-based model to carry out the complex disfluencies processing step. Figure 2 shows the steps of the proposed method through a TD utterance example.

Fig. 2
figure 2

Steps of the proposed method

Pre-processing Step

The pre-processing step is essential for the detection of different types of disfluencies, it allows the utterance to be adapted for downstream steps.

Tokenization

The purpose of the tokenization consists of segmenting the utterance into tokens. A token is either a word or a group of words (i.e., compound words). For example, "موش [mw$] (is not)" and "نورمال [nwrmAl] (normal)" constitute one token "موش-نورمال (abnormal)" labelled with ("locution”).

Morph-Syntactic Analysis

Tokens found are labelled with POS tags using the TD-WordNet lexicon [5]. The morpho-syntactic analysis allows lemmatizing no-labelled words based on the TD-WordNet list of prefixes and suffixes to find their POS tag. For example, the word " قلتلك [qltlk] (I told you)" is an inflected form of the verb "قال [qAl] (tell)", concatenated with the suffix "لك [lk]" which refers to a singular pronoun.

Simple Disfluencies Processing Step

Simple disfluencies are processed using a rule-based approach. For detecting simple disfluencies, we designed a set of detection rules based mainly on POS tags of words. In this work, we have integrated the semantic feature ’word synonyms’ to improve the detection performance of simple repetition phenomena.

Syllabic Elongations

Syllabic elongations processing consists of detecting and correcting words that are (i) not POS-tagged and (ii) contain more than two extensions in the first or last syllable.

Speech Words

Detecting speech words is a word-based matching of words tagged ("Marq_Disc", discourse mark) and ("Marq_Hesit", hesitation mark). Although speech words fall into the simple disfluencies category, they are removed until the next step, as they help to detect complex disfluencies.

Word-Fragments

Word-fragments are lexical syllable repetitions. Their processing consists of detecting and removing words that (i) may not be POS-tagged and (ii) are an integral part of the following word.

Simple Repetitions

Simple repetition processing consists of detecting and removing words that occur several times consecutively. Indeed, repetition can be lexical (i.e., with the same word) or semantic (i.e., with a synonym).

Speech words that appear inside word-fragment cases (e.g., Utt18) or simple repetition cases (e.g., Utt19) are removed consistently with the appropriate disfluency.

Utt18: "تران متا اه متاع تونس [trAn mtA Ah mtAE twns] (The train of euh of-Tunis)".

Utt 19: "المعرض يبدا اليوم آآ ليوما [AlmErD ybdA Alywm euuh lywmA] (The show starts today ahh today)".

Complex Disfluencies Processing Step

Previous studies on the disfluencies processing task fall into four main approaches:

The Noisy Channel Models (NCM) based approach [2, 17] uses NCM with Tree Adjoining Grammars (TAG) [14] to find the similarity between the disfluent chunk of the utterance and its correction as an indicator of disfluencies.

NMC models are not suitable for detecting various types of disfluencies, notably insertions and deletions in which repetitions do not necessarily accompany them. In [35], 62% of the words of the Reparandum are identical to the words of the Repair. In our study corpus, in approximately 32% of DC cases, the Repair does not contain repeated words from the Reparandum.

The transition-based approach [16, 30] uses transition-based analysis models that detect disfluencies while simultaneously identifying the utterance’s syntactic tree. Disfluencies detection is achieved by adding new actions to the parser to detect and remove the disfluent parts of the utterance and their dependencies.

The advantage of transition models lies in the fact that two different tasks (i.e., syntactic analysis and disfluency processing) can be carried out simultaneously. Likewise, they can capture the contiguous syntactic dependencies of disfluencies as well as segment-level information. In contrast, joint models require large annotated treebanks, containing both disfluent and syntactic structure annotations for the training phase. Additionally, they introduce an additional annotated syntactic structure that is very expensive to produce and can cause noise by significantly enlarging the output search space.

The sequence-tagging based approach [1, 18] uses word sequence-tagging models. It is based on statistical models that predict the class of a token according to the BIO encoding schema [26]. A model labels words as being inside or outside of the edit region.

Sequence-tagging-based models make it possible to capture distant dependencies between Reparandum and Repair even in long utterances using neural networks. The disadvantage of these models lies in the fact that they require large volumes of annotated data.

The seq2seq transformer-based approach [27, 28] is inspired by the machine translation task which considers disfluent text as the source language and the fluent text as the target language. The Transformer is a seq2seq neural type which has the particularity of using only the attention mechanism and no recurrent or conventional network.

The seq2seq models using the attention mechanism make it possible to almost perfectly detect the Repair which is far from the Reparandum. In contrast, seq2seq models must rely on a large amount of data.

This work is part of the disfluencies processing project in the Tunisian dialect. We propose to create and compare several disfluency processing models from the existing state-of-the-art methods. In a previous work, we presented our classification-based model. In this present work, we propose to handle complex disfluencies using a transition-based model, which detects and removes the disfluent chunk of the utterance with a set of transition actions and without syntactic dependencies inspired by [30]. We also incorporate semantic features based mainly on word synonyms to perform complex repetition detection in addition to morpho-syntactic features. The main idea of this work is to detect only the disfluent words (i.e., reparandum and interregnum) with ignoring the rest of the utterance’s words (i.e, fluent words, and repair). Identifying reparandums is the most challenging task in disfluencies processing. They may be in subjective form, occur at different places, vary in length, and in some cases, they could be nested [30]. We display both model comparisons in section “Complex Disfluencies Processing Evaluation”. In future work, we aim to implement the other disfluency detection methods.

Model presentation

A disfluent utterance is presented by the tuple (A,I,D,O) where:

  • Action (A) presents the history of the actions,

  • Input (I) presents words not yet processed,

  • Disfluent (D) presents words considered to be disfluent,

  • Output (O) presents words considered to be fluent.

The model results a sequence of binary tags denoted as \(D_{wi}^n = d_{w1},d_{w2},..,d_{wn}\), which means that \(w_i\) is either fluent or disfluent. The best sequence of tags \(D^*\) given \(W_i^n\) [30] is:

$$\begin{aligned} D^* = argmax_D \left( D_1^n |W_1^n\right) \end{aligned}$$

At the instant \(t_0\), I contains \(W_i^n\), and O, D and A are initially empty. For each a \(w_i\), the module predicts the transition action \(a_i = P(A_i^w |w_i^n,a_(i-1))\) where \(A_i^w = \{clear;shift\}\) and \(a_(i-1)\) is a dynamic feature:

  • clear: moves \(w_i\) from I to D, and

  • shift: moves \(w_i\) from I to O.

The model stops when I is empty. Since the model aims to predict a transition action for \(W_1^n\), we implemented a Binary Classifier Transitions (BCT) proposed by [31] based on word sequence \(W_1^n\) and feature vectors \(V_{w_{i}^{n}} = v_1^m.w_1,v_1^m.w_2,..,v_1^m.w_n\) presented in Section “Features of the Binary Classifier”. Algorithm 1 summarizes the BCT’s main steps.

figure a

Algorithm 1 Transition Model Algorithm

Features of the Binary Classifier

Labelling a word entity is based on a set of observations that are introduced to a classifier as a set of feature vectors. The task of detecting disfluencies is mainly related to either prosodic or linguistic features. Prosodic information (such as duration, rhythm, etc.) are omitted in our work since it belongs to the processing of transcripts. However, we rely on only linguistic features presented in Table 4.

Table 4 Features of the binary classifier

We used contextual features with a window of ± 3 words. We experimented with a window ± 1 word, a window ± 2 word, a window ± 3 word and a window ± 4 word. The choice of the word window is justified by the fact that the TD utterances are not too long, the corpus analysis shows that the repair starts after a window that does not exceed three words after the disfluent part, and this does not take into account the interregnum. Finally, we used the dynamic criterion. It considers the class assigned dynamically to the three previous words.

Model Generation

The binary classifier is built using the ML algorithm SVM, which gives high performance in binary classification tasks [23]. SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. We have experimented with various ML classification algorithms using the open-source library WEKA,Footnote 2 and we found that libSVM [8], an implementation of SVM, achieves higher results (79.81%). Table 5 summarizes the experiment results of five classification algorithms.

Table 5 Classification algorithm results

In the model parameters, the batch size is fixed to 100 instances. We used the kernel function Sigmoid [15]:

$$\begin{aligned} K(X,Y)=tanh\left( \gamma X^TY+r\right) \end{aligned}$$

The Sigmoid kernel takes two parameters, \(\gamma\) (i.e., the scaling parameter of the input data) and r (i.e., a shifting parameter that controls the threshold of mapping). \(\gamma\) and r are fixed to 0. The loss function used is 0.1. It defines the error between the predicted target and the given target value.

We tested the model using the k-fold cross-validation option with k = 10. Data are spliced into 10 parts (i.e., 10 folds) for which the algorithm runs 10 times.

Experiments and Discussion

In this section, we report the evaluations we performed using the evaluation portion data of DisCoTAT. We have implemented the method using the Java programming language with the NetBeans environment. We used Recall, Precision and F-Measure metrics.

The overall method gives good results. The recall, precision and F-measure metrics are 95.39%, 82.24% and 88.33%, respectively. Also, we evaluated the main steps of the method. The next sections present the evaluation results and analysis of simple (Section “Simple Disfluencies Processing Evaluation”) and complex (Section “Complex Disfluencies Processing Evaluation”) disfluencies processing steps.

Simple Disfluencies Processing Evaluation

The simple disfluencies processing step achieved promising results. Rates are projected in Table 6.

Table 6 Evaluation of simple disfluency types

The performance of how well the simple disfluencies processing module can detect speech words and word-fragments is mainly due to the efficiency of the pre-processing step. The TD-WordNet lexicon covers all instances of hesitation and discourse marks identified for the TD vocabulary. The effect of the lemmatization task that deals with the recognition of no-labelled words using their prefix and suffixes, improved the detection of simple disfluency types. In addition, we tested the contribution of the semantic feature to the process of detecting simple repetitions. We have thus obtained a 8.5% improvement compared to the previous work. In Utt20, both "الترينو [Altrynw]" and "تران [trAn]" are synonyms and mean (train). A simple repetition disfluencies case is successfully detected.

Utt20: تران ترينو يخرج الاربعة [trAn trynw yxrj AlArbEp] (Train train leaves at four).

Complex Disfluencies Processing Evaluation

We have evaluated the efficiency of the complex disfluencies processing step by experimenting with two models. In this present paper, we presented our transition-based model, which detects and removes the disfluent chunk of the utterance with a set of transition actions without syntactic dependencies. It achieved an F-Measure score of 79.81%.

In our previous work [6], a sequence-tagging-based model for complex disfluencies processing is proposed. The model classifies the utterance’s words into six classes based on the reparandum-interregnum-repair structure and following the BIO encoding [26]. Tokens can be labelled as B_RM (i.e., the beginning of the reparandum), I_RM (i.e., belongs to the reparandum part), B_RP (i.e., the beginning of the repair part), I_RP (i.e., belongs to the repair part), IP (i.e., Interregnum (i.e., belongs to speech words), or O: (i.e., fluent word). The sequence-tagging-based model is handled with the statistical CRF algorithm [1] and uses the same features used for the transition-based model. It achieved an F-Measure score of 78.97% with considering semantic features.

Table 7 Evaluation of complex disfluency types

Through rates projected in Table 7, we notice that the performances of the two models are close. The transition-based model gives a slight improvement in the detection performance of insertions, modifications, and deletions. We notice that the results for detecting complex repetitions have slightly degraded using the transition-based model. In fact, in some cases where it is a complex repetition of the discontinuous type (i.e., other entity inserted between the two repeated words), as shown in Utt21, the transition-based model considers this disfluency to be of the deletion type and deletes the word inserted with the reparandum. In this case, we admit that the detection of the repair as well as the reparandum is mandatory. However, this error does not affect the syntactic harmony in the utterance.

Utt21: الخدام أنا الخدام إلي يتضر [AlxdAm OnA AlxdAm Ily ytDr] (The worker, I am the worker who is harmed).

Considering the challenges of disfluencies processing in TD presented in section 2.2, the models have overcome several difficulties:

  • The model can capture long-range dependency of disfluencies due to the context of a ± 3 window of neighbouring words, applied to several features,

  • The wealth of the TD-WordNet led to overcoming compound words, voluntary repetitions, synonyms and pronunciation variability,

  • The lemmatization task in the pre-processing step therefore makes it possible to overcome the difficulty of agglutination linked to the Arabic language and TD specifically. Also, the simple disfluencies processing step facilitates the task of complex disfluencies detection since it consists of eliminating these phenomena which disturb the syntactic and semantic flow of the utterance,

  • The proposed models can detect several complex disfluency structures even with the presence of OOV words,

  • The efficiency of the selected features led to treating syntactical dependencies that appear within complex disfluencies, notably those dealing with possession adjectives or pronouns, as we explained in section 2.2,

  • The transition-based model can detect reformulations, a disfluencies type that we did not address in our contribution, with an F-measure rate of 42.7%. Utt22 demonstrates a sentence reformulation case. The speaker breaks his speech and starts a new utterance. Utt22: مع وقتاش يخرج... ثمة تران توا لسوسة [mE wqtA$ yxrj... vmp trAn twA lswsp] (When leaves... Is there a train now to Sousse.).

Also, to validate the learning features set, we trained the models without considering the grammatical gender and number of POS tags. Consequently, this syntax information contributes to improving the performance of both complex disfluencies detection models by 8.85% (Sequence-tagging based) and 8% (transition-based).

However, the major error analysis cases are mainly due to the following reasons:

  • POS tagging without considering Arabic vowels can generate multiple POS tags for a given word, for example, "المقابلة [AlmqAblp]" means both (game, Noun) and (across, Adverb) in TD-WordNet,

  • In the case where coordinating conjunction is omitted, the enumeration can be interpreted as a modification disfluency type insofar as the speaker replaces the words to catch up and correct his enunciation,

  • Foreign words undergo uncontrollable lexical changes that we cannot pin down their various phonological and morphological derivatives in a lexical base,

  • The absence of a discursive context in the utterance or dialogue contributes to various problems, in particular annotation confusion. Some cases of disfluencies, nested for example, have a very complex structure, therefore, they can undergo different annotations depending on the annotator.

Conclusion and Perspectives

In this paper, we investigated the field of disfluencies processing in the spontaneous Tunisian spoken dialect. First, we presented our study corpus DisCoTAT which consists of transcribed utterances coming mainly from recordings of Tunisian TV channels and radio programs. Then, we proposed our method to detect eight types of disfluencies. We constructed a set of detection rules based mainly on POS tags of words for simple disfluencies. We also proposed a transition-based model for the complex disfluencies detection task, which was evaluated and compared to another model based on sequence-tagging. The comparison results proved that both transition-based and sequence-tagging models are efficient, with a slight improvement for the transition-based model.

The originality of our contribution is due to that eight types of disfluencies are detected and corrected automatically, where complex disfluencies are detected using stochastic models derived from ML techniques with various linguistic features and that transcriptions are issued from a wide domain in the spontaneous spoken TD.

In future work, we intend to implement and test other disfluencies detection models like seq2seq transformers. Also, we aim to add acoustic features (e.g., duration, intensity, pitch, etc.) to the linguistic features.