Automatic prosodic tone choice classification with Brazil’s intonation model

Johnson, David O.; Kang, Okim

doi:10.1007/s10772-015-9327-z

Automatic prosodic tone choice classification with Brazil’s intonation model

Published: 02 December 2015

Volume 19, pages 95–109, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Speech Technology Aims and scope Submit manuscript

Automatic prosodic tone choice classification with Brazil’s intonation model

Download PDF

David O. Johnson¹ &
Okim Kang¹

373 Accesses
8 Citations
Explore all metrics

Abstract

This paper examines the performance of automatically classifying five tone choices (i.e., falling, rising, rising-falling, falling-rising, and neutral) of Brazil’s intonation model. We tested two machine learning classifiers (neural network and boosting ensemble) in two configurations (multi-class and pairwise coupling) and a rule-based classifier. Three sets of acoustic features built from the TILT and Bézier pitch contour models and a new four-point pitch contour model we introduced here were investigated. Tone choices are one of the key elements of Brazil’s prosodic intonation model. We found the rule-based classifier, which was built on our four-point model, achieved better results than the others with an accuracy of 75.1 % and a Cohen’s kappa coefficient of 0.73. This research proves that it is possible to classify tone choices with an accuracy reaching close to the percentage of agreement between two human analysts. The findings further concluded that our four-point model was better for classifying Brazil’s tone choices than both of the TILT or Bézier models.

Automatic prominent syllable detection with machine learning classifiers

Article 10 September 2015

Towards a Hybrid Learning Approach to Efficient Tone Pattern Recognition

Formants and Prosody-Based Automatic Tonal and Non-tonal Language Classification of North East Indian Languages

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The pattern of stress and intonation in a language is called prosody. There are many application domains that might benefit from automatic detection of prosody. It can be utilized in text-to-speech synthesis to model intonation for computerized and robot speech. Shriberg et al. (2005) and Escudero-Mancebo et al. (2014) demonstrated that prosodic models improve speaker identification and verification. Irregular prosody is one of the symptoms of autism and other related developmental disorders (Frith and Happé 1994; Fine et al. 1991; Paul et al. 2005; Shriberg et al. 2001; McCann and Peppé 2003). Computer programs that detect irregular prosody automatically have been employed to diagnose autism (Xu et al. 2009; Ringeval et al. 2011; Oller et al. 2010; Diehl & Paul 2012; Van Santen et al. 2010). Suprasegmental measures derived from the elements of Brazil’s model have been shown to explain half of the variance in oral proficiency and comprehensibility ratings of non-native speakers (Kang et al. 2010; Kang and Wang 2014). A number of studies have concluded that the inclusion of prosodic elements enhances automatic speech recognition (Bocklet and Shriberg 2009; Hämäläinen et al. 2007; Litman et al. 2000; Ostendorf 1999).

This study examines automatic detection of tone choice which is one of the fundamental elements of Brazil’s (1997) model of prosody (see Sect. 2 for further details on Brazil’s model). The purpose of this paper is to determine the best machine learning algorithm and the associated acoustic feature set, for classifying tone choice. We analyzed the accuracy and κ of two machine learning classifiers (neural network and boosting ensemble) in two configurations (multi-class and pairwise coupling) and a rule-based classifier. We tested three sets of acoustic features created from the TILT and Bézier models and a new four-point model we have introduced in this paper. Then, we explained how we decided on the classifiers and acoustic feature sets to test. We also described the methods employed to determine the best machine learning algorithm and the best acoustic feature set for classifying the tone choice of a termination prominent syllable. Finally, after presenting the results, we compared the current findings with those of other research in the field of speech science.

2 Brazil’s intonation model

Prosody is described by a variety of speech models. Brazil’s (1997) model and Pierrehumbert’s (1980) model are two that are used often in the fields of linguistics and applied linguistics. Pierrehumbert’s model is often utilized to model prosody for synthesized speech in text-to-speech applications (Wennerstrom 2001). Brazil’s model is frequently applied to language teaching (Cauldwell 2012). Using Brazil’s model is an innovative aspect of the current study because as far as we know it has not been applied to computational linguistics before. Brazil’s model defines pitch concord in an interactive dialog between two persons. Pitch concord matches the relative pitch of the key (first) and termination (last) prominent syllables between two speakers. For instance, high pitch on the termination of one speaker’s statement is matched with a high pitch on the key of the next speaker’s statement. Likewise, a mid termination is paired with a mid key. Pitch concord is a powerful predictor of speaking proficiency in non-native speakers (Pickering 1999). If we assume the goal of computational linguistics is more human-like speech production and interaction, then it is necessary to explore and adopt a model with a more thorough interpretation of intonation at a discourse (i.e., dialog) level.

The basis of Brazil’s theory is the tone unit. Brazil explains a tone unit as a portion of a discourse that a listener can distinguish as having a rising and falling pitch pattern that is distinctive from those of otherwise alike tone units having other patterns of pitch. Every tone unit has one or more prominent syllables, which can be identified from three properties of the syllable: pitch (fundamental frequency in Hz), duration (length in seconds), and intensity (amplitude in dB) (Chun 2002). Brazil asserts (as others have) that the importance of prominence is on the syllable, and not the word. Brazil differentiates prominence from lexical stress. He explains that lexical stress denotes the syllable inside content words that is stressed; however, prominence is the use of emphasis to add more meaning, importance, or contrast to words in a discourse. Accordingly, a syllable that is typically not stressed (e.g., a function word) may be accented to make it prominent. Conversely, a syllable that is customarily stressed lexically may be delivered with additional pitch, duration, or intensity to highlight its meaning, importance, or contrast. Every tone unit contains a key (first) and a termination (last) prominent syllable. If a tone unit has a single prominent syllable, then it is considered to be equally the key and termination prominent syllable. The termination syllable is also referred to as the tonic syllable. The relative pitch of the key and termination prominent syllables and the tone choice of the termination prominent syllable define the tone unit’s intonation pattern. Brazil postulated three evenly balanced scales of relative pitch: low, mid, and high, and five tone choices: falling, rising, rising-falling, falling-rising, and neutral as illustrated in Fig. 1.

The Brazil model covers both constrained and unconstrained speech in monologues and dialogs. Thus, the elements of the model (e.g., tone choice) apply equally to all types of speech.

3 Related research

In this section, we will review related research to identify techniques that can be applied to solving the problem of classifying tone choice. Brazil’s (1997) model has not been exploited in the field of computational linguistics. However, there is a large body of research on classifying ToBI Pitch Accents and Boundary Tones from which we identified candidate machine learning classifiers and acoustic feature sets for our experiments. The tones and break indices (ToBI) is a system for labeling prosodic events in speech (Wightman et al. 1992; Beckman and Elam 1997). ToBI defines three prosodic events: pitch accents, boundary tones, and break indices. Of these, pitch accents and boundary tones are the most closely related to Brazil’s tone choices. Pitch accents serve as cues for prominence, while boundary tones serve as cues for intonational phrasing. Although pitch accents are cues for prominence, there are usually more pitch accents in a dialog than there are Brazil’s prominent syllables. Boundary tones match closely with Brazil’s concept of key prominent syllables (i.e., initial boundary tones and phrasal tones) and termination prominent syllables (i.e., final boundary tones). ToBI defines eight types of pitch accents and nine types of boundary tones. There is not a one-to-one correspondence between Brazil’s tone choices and either pitch accents or boundary tones. Nonetheless, the methods of classifying them and Brazil’s tone choices are similar.

We compared several ToBI experiments involving pitch accents and boundary tones based on the accuracy to determine the candidate classifiers and feature sets we utilized in our experiments. We applied three constraints to the experiments we considered: (1) The experiment had to involve multiple speakers because single speaker classification is somewhat trivial and our goal is speaker independent recognition of Brazil’s tone choices; (2) the experiment had to classify with only acoustic features; and (3) the experiment had to include five or more classes since there are five tone choices.

There are three pitch contour models, which have been employed in the ToBI investigations. In the TILT model, intonation is characterized by parameters representing amplitude, duration, and tilt, where tilt is a measure of the shape of the pitch contour (Taylor 2000). The Bézier model is an approximation of pitch contours with Bézier functions (Escudero-Mancebo and Cardeñoso-Payo 2007). The Quantized Contour Model (QCM) (Rosenberg 2010a, b) quantizes the pitch contour of a word in the time and pitch domains, generating a low-dimensional representation of the contour. Each of these models produces a set of acoustic features, which can be classified with machine learning.

Table 1 presents the accuracy of several recent ToBI experiments along with what was classified (pitch accent or boundary tones), the number of classes classified out of the total number of classes, the number of speakers out of the total number of speakers, the pitch contour model, and machine learning classifier. Also indicated is whether the experiment met two of our constraints, i.e., multiple speakers and five or more classes. None of the experiments met our constraint of acoustic features only. All of the experiments made use of the Boston University Radio News Corpus (Ostendorf et al. 1995), except Li et al. (2010). Their corpus data was a set of 20 male and 20 female speakers from an L2 English speech corpus read by native Mandarin speakers. The speakers were asked to read 29 prompted sentences and instructed to read with a rising or falling intonation, according to an indicator next to each sentence.

Table 1 Summary of several recent ToBI experiments sorted by constraints met and accuracy (Acc)

Full size table

AuToBI is a tool for automatic ToBI annotation (Rosenberg 2010a, b). Rosenberg reported on the performance of AuToBI in classifying pitch accents and boundary tones utilizing various classifiers and features in 2010 and then again in 2012. In 2010, he described the operation of AuToBI on the Boston Directions Corpus and the Columbia Games Corpus. Utilizing SVMs, AuToBI classified pitch accents of the spontaneous portion of the Boston Directions Corpus with a combined error rate of 0.284, intonational phrase final tones with 55.0 % accuracy, and intermediate phrase ending tones with 68.6 %. He did not give the pitch accent classification results on the Columbia Games Corpus, but stated the intonational phrase final tones were classified with 35.34 % accuracy, whereas intermediate phrase ending phrase accents were classified with 62.21 % accuracy. In 2012, Rosenberg examined a number of features and classifiers to improve the capability of AuToBI to classify pitch accents and boundary tones. He found the AdaBoost classifier implemented with weka did the best at classifying pitch accents (60.91 % accuracy) and that the Random Forest classifier implemented with weka was the best at classifying pitch accent (47.44 %) and pitch accent/boundary tones (74.47 %).

From the experiments that met our constraints, we chose the neural network and decision tree classifiers as candidates for our experiments. We augmented the decision tree classifier with boosting. Boosting is a machine learning ensemble method designed to improve the performance of decision tree classifiers. We did not choose to use a Naïve Bayesian classifier because of all machine learning techniques Naïve Bayesian classifiers are typically the weakest (Caruana and Niculescu-Mizil 2006). We also selected two classification configurations: multi-class and pair-wise coupling. In the multi-class configuration, the classifier makes a 1-of-n choice. Multi-class classifiers generally function worse than binary classifiers. Pairwise coupling is a method of breaking a multiple classification problem into a number of more accurate binary classification problems (Hastie and Tibshirani 1998). For feature set models, we picked the TILT and Bézier model. We did not select the Quantized Contour Model because the low number of classes in Rosenberg (2010a, b) experiment may have over-inflated the accuracy of it compared with the TILT and Bézier model experiments.

In addition to the candidate classifiers and feature sets that we identified from the ToBI experiments, we also considered another classifier and another pitch contour model. The rule-based classifier is further detailed in Sect. 4.3. The other pitch contour model, which we call the four-point model in this paper, was derived for the rule-based classifier. This pitch contour model is the generalization of any pitch contour, i.e., every pitch contour has a first, last, minimum, and maximum pitch point. Section 4.2.1 contains a more in-depth description of the four-point model.

4 Experimental procedure

In summary, in this paper we have compared the accuracy and κ of two candidate machine learning classifiers (neural network and boosting ensemble) in two configurations (multi-class and pairwise coupling) in automatically classifying the five tone choices of Brazil’s intonation model. For each of the four combinations of classifier and configuration, we have considered three sets of features derived from three pitch contour models: TILT, Bézier, and our four-point model. We have also made comparisons of these twelve combinations with our rule-based classifier, which is founded on the four-point model.

4.1 TIMIT corpus

The DARPA TIMIT Acoustic–Phonetic Continuous Speech Corpus (TIMIT) of read speech provides speech data for the acquisition of acoustic–phonetic knowledge and for the development and evaluation of automatic speech recognition systems (Garofolo et al. 1993). TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The text material in the TIMIT prompts consists of two dialect sentences, 450 phonetically-compact sentences, and 1890 phonetically-diverse sentences. The dialect sentences were intended to reveal the dialect of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either of particular interest or difficult. Each speaker read five of these sentences and each text was spoken by seven different speakers. The phonetically-diverse sentences were selected to maximize the variety of allophonic contexts found in the texts. Each speaker read three of these sentences, with each sentence being read only by a single speaker. The corpus includes hand corrected start and end times for the phones, phonemes, pauses, syllables, and words.

The TIMIT corpus is composed of constrained (i.e., short read sentences) monologic speech. We chose the TIMIT corpus over others (e.g., Boston University Radio News Corpus) because of the large number of speakers and dialects spoken.

The TIMIT corpus includes definitions for 60 phones. The TIMIT phones are used by other corpora. For our experiments, we utilized a subset of the corpus consisting of 84 speakers speaking four dialects. There were 825 utterances in our subset containing 10,512 syllables of which 994 of those were terminating prominent syllables. Table 2 presents the distribution of speakers by gender and dialect.

Table 2 Distribution of TIMIT speakers in this research by gender and dialect

Full size table

We augmented the corpus by identifying the prominent syllables and tone choices on the termination (last) prominent syllables in the experimental subset using the syllable demarcations provided with the corpus. The prominent syllables and tone choices were identified by a trained linguist who coded them both by listening to the audio files and by using Praat (Boersma and Weenink 2014), a computerized speech analysis program, to confirm the movement of the pitch contour. Approximately, ten percent of the samples were analyzed by a second trained linguist to confirm the consistency of the coding. The inter-rater reliability between the two linguists was 85 to 87 %, which is a satisfactory rate found in other similar studies (e.g., Kang 2010) utilizing Brazil’s (1997) prosody model. The two linguists revised any discrepancies and continued coding the data until there were no more discrepancies. The first linguist then finished coding the rest of the speech files alone. This method of annotation has been employed as a reliable labeling technique extensively in other applied linguistics studies (Kang et al. 2010; Kang and Wang 2014; Pickering 1999). The linguist identified the tone choice of 994 terminating prominent syllables in the speech samples. The distribution of tone choices is depicted in Table 3.

Table 3 Distribution of tone choices

Full size table

Initially the analysts examined the pitch contours with the Multi-Speech and Computerized Speech Laboratory (CSL) Software (KayPENTAX 2008), while the computer analyzed them using Praat (Boersma and Weenink 2014). We discovered significant differences between the pitch contours displayed by the two. This discrepancy resulted in substantial disagreement between the tone choices classified by the computer and those by the human expert. In addition, further differences in pitch contour were found even between various versions of Praat. More differences were identified between the same versions of Praat running on different computers. Maryn et al. (2009) also reported this difference among Multi-Speech and CSL Software and Praat. They declared that pitch and intensity values were not comparable. Amir et al. (2009) also noted this discrepancy and added that the findings from Multi-Speech and CSL Software and Praat should not be combined. To ensure these variations did not affect our results, the analyst re-conducted the tone choice annotations that were utilized to train the classifiers using the same version of Praat of which the computer made use.

4.2 Parameterization of the F0 contours

We investigated three sets of classification features, each derived from a different model of the pitch contour: four-point model, TILT model (Taylor 2000), and a model proposed by Escudero-Mancebo and Cardeñoso-Payo (2007), which consists of Bézier parameters. The pitch contour was extracted with Praat (Boersma and Weenink 2014).

4.2.1 Four-point model features

The four-point model is of our own design, which we are proposing here. The four-point model has two sub-models as depicted in Fig. 2: rise-fall-rise and fall-rise-fall.

The rise-fall-rise sub-model is applied if the maximum pitch point is earlier in time than the minimum pitch point; the fall-rise-fall sub-model is applied if the minimum pitch point is earlier in time than the maximum pitch time. The features for the sub-models are built on the following four points (from which the model derives its name): first is the pitch of the first point in the pitch contour (Hz); last is the pitch of the last point in the pitch contour (Hz); max is the maximum pitch in the pitch contour (Hz); and min is the minimum pitch in the pitch contour (Hz). The features for the rise-fall-rise sub-model are first-rise (r1), first-fall (f1), and second-rise (r2) and they are calculated as follows:

$$r1 = max-first$$

(1)

$$f1 = max-min$$

(2)

$$r2 = last-min$$

(3)

The features for the fall-rise-fall sub-model are first-fall (f1), first-rise (r1), and second-fall (f2) and they are calculated as follows:

$$f1 = first-min$$

(4)

$$r1 = max-min$$

(5)

$$f2 = max-last$$

(6)

We apply this model because it is the generalization of any pitch contour; i.e., every pitch contour has a first, last, minimum, and maximum pitch point. In some cases, some or all of the four points might coincide. For example, the maximum may also be the first point. Theoretically, the classifiers should determine the tone choice by the significance or insignificance of the rises and falls. The significance of the rises and falls is determined during the classifier training. For instance, as depicted in Fig. 3, in the rise-fall-rise sub-model, if all the rises and falls are insignificant, then the tone choice is neutral. If r1 is insignificant, f1 is significant, and r2 is insignificant, the tone choice is fall.

Table 4 specifies the truth table for all possible combinations of significant and insignificant rise and falls, which are illustrated in Fig. 3. In the last row of Table 4, all of the rises and falls are significant so the tone choice could be either fall-rise or rise-fall. In these two cases, the tone choice of the last two significant rise and fall was applied, i.e., f1 and r2 for the rise-fall-rise sub-model and r1 and f2 for the fall-rise-fall sub-model.

Table 4 Truth table for all possible combinations of significant and insignificant rise and falls; 0 = rise/fall is insignificant (i.e., it is less than a threshold); 1 = rise/fall is significant (i.e., it is more than a threshold)

Full size table

4.2.2 TILT model features

TILT is one of the more popular models for parameterizing pitch contours (Taylor 2000). The model was developed to automatically analyze and synthesize speech intonation. In the model, intonation is represented as a sequence of events, which are characterized by parameters representing amplitude, duration, and tilt. Tilt is a measure of the shape of the event, or pitch contour. A popular public domain text-to-speech system, Festival (The Centre for Speech Technology Research 2014) applies this model to synthesize speech intonation. The model is illustrated in Fig. 4. Three points are defined: start of the event, the peak (the highest point), and the end of the event.

Each event is characterized by five RFC (rise/fall/connection) parameters: rise amplitude (difference in pitch between the pitch value at the peak and at the start, which is always greater than or equal to zero), rise duration (distance in time from start of the event to the peak), fall amplitude (pitch distance from the end to the peak, which is always less than or equal to zero), fall duration (distance in time from the peak to the end), and vowel position (distance in time from start of pitch contour to start of vowel). The TILT representation transforms four of the RFC parameters into three TILT parameters: duration (sum of the rise and fall durations), amplitude (sum of absolute values of the rise and fall amplitudes), and tilt (a dimensionless number which expresses the overall shape of the event). The TILT parameters are calculated as follows:

$$s = {\text{ start of event}}$$

(7)

$$p = {\text{ peak }}\left( {\text{the highest point}} \right)$$

(8)

$$e = {\text{ end of event}}$$

(9)

$$a_{rise} = {\text{ difference in pitch between the pitch value at the peak }}\left( p \right){\text{ and at the start }}\left( s \right), \, \ge 0$$

(10)

$$d_{rise} = {\text{distance in time from start}}\left( s \right){\text{of the event to the peak}}\left( p \right)$$

(11)

$$a_{fall} = {\text{ pitch distance from the end }}\left( e \right){\text{ to the peak }}\left( p \right), \, \le 0$$

(12)

$$d_{fall} = {\text{ distance in time from the peak }}\left( p \right){\text{ to the end }}\left( e \right)$$

(13)

$$d = {\text{ duration }} = d_{rise} + d_{fall}$$

(14)

$$a = {\text{ amplitude }} = \, \left| {a_{rise} } \right| \, + \, \left| {a_{fall} } \right|$$

(15)

$$t = {\text{ tilt }} = \frac{{\left| {a_{rise} } \right| - |a_{fall} |}}{{2\left( {\left| {a_{rise} } \right| + |a_{fall} |} \right)}} + \frac{{d_{rise} - d_{fall} }}{{2\left( {d_{rise} + d_{fall} } \right)}}$$

(16)

$$c = {\text{ pitch contour }} = \, \left\{ {f_{i} ,\,f_{i + 1} , \, \ldots ,\,f_{i + N} } \right\}$$

(17)

$$f_{i} = {\text{ frequency }}\left( {\text{Hz}} \right){\text{ of}}\;i{\text{th point in pitch contour}}$$

(18)

$$f_{v} \in c$$

(19)

$$v = {\text{ index of the beginning of the vowel}}$$

(20)

In our experiments, duration (d), amplitude (a), tilt (t), and vowel position (v) were the input features to the classifiers.

4.2.3 Bézier model features

Escudero-Mancebo and Cardeñoso-Payo (2007) proposed an alternative to the TILT model that is constructed from the approximation of the pitch contours with Bézier functions as illustrated in Fig. 5.

Similarly we used Bézier functions to approximate the pitch contour of the terminating prominent syllable, where:

$${\mathbf{P}} = {\text{ pitch contour}}$$

(21)

$${\mathbf{P}}_{i} = \, \left( {f_{i} ,t_{i} } \right) \, = {\text{F}}0 \, \left( {\text{Hz}} \right){\text{ at time }}\left( {\text{s}} \right)t_{i}$$

(22)

$$n = \, \left| {\mathbf{P}} \right| \, - { 1}$$

(23)

$$b = {\text{ number ofB}}\'e {\text{zier points }} = { 4}$$

(24)

$$x = \left( {0, \frac{1}{b - 1}, \frac{2}{b - 1}, 1} \right)$$

(25)

$$j = \, \left( { 1,{ 2},{ 3},{ 4}} \right)$$

(26)

$${\mathbf{B}}\left( {x_{j} } \right) \, = \, \left( {p_{j} ,x_{j} } \right)$$

(27)

$$p_{j} = {\text{ B}}\'e {\text{zier approximation of F}}0 \, \left( {\text{Hz}} \right){\text{ at time }}\left( {\text{s}} \right)x_{j}$$

(28)

$${\mathbf{B}} (x) = \sum\limits_{{i = 0}}^{n} {b_{{i,n}} } (x){\mathbf{P}}_{i},\quad 0 \, \le x \le {1}$$

(29)

$$b_{i,n} \left( x \right) = \left( {\begin{array}{*{20}c} n \\ i \\ \end{array} } \right)x_{i} (1 - x)^{n - i} \quad i = 0, \ldots , n$$

(30)

The resulting four Bézier parameters (p ₁, p ₂, p ₃, and p ₄) are the features on which the tone choice classifiers are trained.

4.3 Classifiers

We tested two standard machine learning classifiers to classify tone choices: neural network and boosting. We employed the Matlab patternnet function with ten hidden nodes and the Levenberg–Marquardt optimization network training function to implement the neural network classifier (MathWorks 2013). Boosting is an ensemble classifier that combines the outcomes of weak classifiers (typically decision trees) to improve their accuracy. Boosting was implemented with the Matlab fitensemble function using the AdaBoostM1 (binary classifier) or AdaBoostM2 (multi-class classifier) booster and 100 decision tree learners (i.e., weak classifiers).

We also utilized a rule-based classifier that implemented the four-point model truth table specified in Table 4 above. The thresholds for significance versus insignificance of each rise and fall (i.e., rise-fall-rise sub-model: r1, f1, and r2; fall-rise-rise sub-model: f1, r1, and f2) were determined during training. A simple brute-force method of trying every combination of unique rises and falls in the training data as thresholds determined the set of thresholds (TH ^*_rfr , TH ^*_frf ) that maximized the accuracy as follows:

$$TC = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$

(31)

$$T = \, \left\{ {\text{human classified tone choices for training data}} \right\}$$

(32)

$$T_{rfr} = \, \left\{ {\text{human classified tone choices for training data for rise-fall-rise sub-model}} \right\}$$

(33)

$$T_{frf} = \, \left\{ {\text{human classified tone choices for training data for fall-rise-fall sub-model}} \right\}$$

(34)

$$T = T_{rfr} \cup T_{frf}$$

(35)

$$\emptyset = T_{rfr} \cap T_{frf}$$

(36)

$$N = \, \left| {T_{rfr} } \right|$$

(37)

$$M = \, \left| {T_{frf} } \right|$$

(38)

$$T_{rfr} = \, \left\{ {t_{1} , \ldots ,t_{N} } \right\}$$

(39)

$$T_{frf} = \, \left\{ {t_{1} , \ldots ,t_{M} } \right\}$$

(40)

$$t_{i} \in TC$$

(41)

$$tr1_{i} = r1\;{\text{for}}\;i{\text{-th training sample}}$$

(42)

$$tf1_{i} = f1\;{\text{for}}\;i{\text{-th training sample}}$$

(43)

$$tr2_{i} = r2\;{\text{for}}\;i{\text{-th training sample}}$$

(44)

$$tf2_{i} = f2\;{\text{for}}\;i{\text{-th training sample}}$$

(45)

$$F_{rfr} = \, \left\{ {\left( {tr1_{1} ,tf1_{1} ,tr2_{1} } \right), \, \ldots , \, \left( {tr1_{N} ,tf1_{N} ,tr2_{N} } \right)} \right\}$$

(46)

$$F_{frf} = \, \left\{ {\left( {tf1_{1} ,tr1_{1} ,tf2_{1} } \right), \, \ldots , \, \left( {tf1_{M} ,tr1_{M} ,tf2_{M} } \right)} \right\}$$

(47)

$$\emptyset = F_{rfr} \cap F_{frf}$$

(48)

$$r1_{rfr} = \, \left\{ {tr1_{1} , \, \ldots ,tr1_{N} } \right\}$$

(49)

$$f1_{rfr} = \, \left\{ {tf1_{1} , \, \ldots ,tf1_{N} } \right\}$$

(50)

$$r2_{rfr} = \, \left\{ {tr2_{1} , \, \ldots ,tr2_{N} } \right\}$$

(51)

$$R1_{rfr} = \, !\exists r1_{rfr}$$

(52)

$$F1_{rfr} = \, !\exists f1_{rfr}$$

(53)

$$R2_{rfr} = \, !\exists r2_{rfr}$$

(54)

$$I = \, \left| {R1_{rfr} } \right|$$

(55)

$$J = \, \left| {F1_{rfr} } \right|$$

(56)

$$K = \, \left| {R2_{rfr} } \right|$$

(57)

$$r1_{i} \in R1_{rfr}$$

(58)

$$f1_{j} \in F1_{rfr}$$

(59)

$$r2_{k} \in R2_{rfr}$$

(60)

$$\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right) \, = {\text{rule-based classifier applying rise-fall-rise sub-model of Table 4}}$$

(61)

$$\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right) \in TC$$

(62)

$$f1_{frf} = \, \left\{ {tf1_{1} , \, \ldots ,tf1_{M} } \right\}$$

(63)

$$r1_{frf} = \, \left\{ {tr1_{1} , \, \ldots ,tr1_{M} } \right\}$$

(64)

$$f2_{frf} = \, \left\{ {tr2_{1} , \, \ldots ,tr2_{M} } \right\}$$

(65)

$$F1_{frf} = \, !\exists f1_{frf}$$

(66)

$$R1_{frf} = \, !\exists r1_{frf}$$

(67)

$$F2_{frf} = \, !\exists f2_{frf}$$

(68)

$$F = \, \left| {F1_{frf} } \right|$$

(69)

$$G = \, \left| {R1_{frf} } \right|$$

(70)

$$H = \, \left| {F2_{frf} } \right|$$

(71)

$$f1_{f} \in F1_{frf}$$

(72)

$$r1_{g} \in R1_{frf}$$

(73)

$$f2_{h} \in F2_{frf}$$

(74)

$$\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right) \, = {\text{rule-based classifier applying fall-rise-fall sub-model of Table 4}}$$

(75)

$$\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right) \in TC$$

(76)

$$X = \, \left\{ {\text{rule-based classifier tone choices for training samples for a sub-model}} \right\}$$

(77)

$$Y = \, \left\{ {\text{human tone choices for training samples for a sub-model}} \right\}$$

(78)

$$Z = \, \left| X \right| \, = \, \left| Y \right|$$

(79)

$$x_{i} \in X$$

(80)

$$y_{i} \in Y s$$

(81)

$$a_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad x_{i} = y_{i} } \\ {0,\quad x_{i} \ne y_{i} } \\ \end{array} } \right.$$

(82)

$$A\left( {X,Y} \right) \, = Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{Z} a_{i} }}{Z}$$

(83)

$$TH_{rfr}^{*} = {\text{ arg max}}_{{ 1\le i \le I,{ 1} \le j \le J,{ 1} \le k \le K}} \left[ {A\left( {\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right),T_{rfr} } \right)} \right]$$

(84)

$$TH_{frf}^{*} = {\text{ arg max}}_{{ 1\le f \le F,{ 1} \le g \le G, 1\le h \le H}} \left[ {A\left( {\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right),T_{frf} } \right)} \right]$$

(85)

4.4 Classifier configurations

We analyzed two different configurations of the neural network and boosting ensemble classifiers: multi-class and pairwise coupling. We employed fivefold cross-validation in each of the experiments to tune the parameters of the machine learning classifiers (i.e., training) and then test them. The method for determining the classifier outputs is described below for each combination of classifier and configuration.

4.4.1 Neural network multi-class

The multi-class neural network provides five outputs, one for each of the possible tone choices. The outputs are real numbers in the range from zero to one. The output with the highest value is selected as the tone choice. There is one multi-class neural network for the TILT model; one for the Bézier; and one for each of the four-point model sub-models.

4.4.2 Boosting ensemble multi-class

The multi-class ensemble provides one output, which is from the set {1, 2, 3, 4, 5} corresponding to the set of tone choices {rise, neutral, fall, fall-rise, rise-fall}. There are four multi-class ensembles; one for the TILT model; one for the Bézier; and one for each of the two four-point model sub-models.

4.4.3 Neural network pairwise coupling

The neural network pairwise coupling configuration consists of ten neural networks trained to classify each combination of tone choices: rise versus neutral, rise versus fall, rise versus fall-rise, rise versus rise-fall, neutral versus fall, neutral versus fall-rise, neutral versus rise-fall, fall versus fall-rise, fall versus rise-fall, and fall-rise versus rise-fall. There are ten neural networks for the TILT model; ten for the Bézier; and ten for each of the four-point model sub-models. The output of each classifier is a real number between zero and one. The outputs are treated as probabilities. The probabilities are combined as follows and the one with the highest probability is the tone choice selected.

$$T = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$

(86)

$$t \in T$$

(87)

$$o_{i,j} = {\text{ output of classifier trained to classify tone choice}}\,i\,{\text{vs}}\,j$$

(88)

$$o_{i,j} \in {\mathbb{R}}$$

(89)

$$0 \, \le o_{i,j} \le { 1}$$

(90)

$$p_{ 1} = Pr\left( {{\text{tone choice }} = {\text{ rise}}} \right) \, = o_{r,n} \cdot o_{r,f} \cdot o_{r,fr} \cdot o_{r,rf}$$

(91)

$$p_{ 2} = Pr\left( {{\text{tone choice }} = {\text{ neutral}}} \right) \, = o_{n,f} \cdot o_{n,fr} \cdot o_{n,rf} \cdot \left( { 1 { } - o_{r,n} } \right)$$

(92)

$$p_{ 3} = Pr\left( {{\text{tone choice }} = {\text{ fall}}} \right) \, = o_{f,fr} \cdot o_{f,rf} \cdot \left( { 1 { } - o_{r,f} } \right) \cdot \left( { 1 { }{-}o_{n,f} } \right)$$

(93)

$$p_{ 4} = Pr\left( {{\text{tone choice }} = {\text{ fall - rise}}} \right) \, = o_{fr,rf} \cdot \left( { 1 { }{-}o_{r,fr} } \right) \cdot \left( { 1 { }{-}o_{n,fr} } \right) \cdot \left( { 1 { }{-}o_{f,fr} } \right)$$

(94)

$$p_{ 5} = Pr\left( {{\text{tone choice }} = {\text{ rise - fall}}} \right) \, = \, \left( { 1 { }{-}o_{fr,rf} } \right) \cdot \left( { 1 { }{-}o_{r,rf} } \right) \cdot \left( { 1 { }{-}o_{n,rf} } \right) \cdot \left( { 1 { }{-}o_{f,rf} } \right)$$

(95)

$$t* = {\text{ arg max}}_{ 1\le t \le 5} p_{t}.$$

(96)

4.4.4 Boosting ensemble spairwise coupling

The boosting ensemble pairwise coupling configuration consists of ten boosting ensembles trained to classify each combination of tone choices: rise versus neutral, rise versus fall, rise versus fall-rise, rise versus rise-fall, neutral versus fall, neutral versus fall-rise, neutral versus rise-fall, fall versus fall-rise, fall versus rise-fall, and fall-rise versus rise-fall. There are ten ensembles for the TILT model; ten for the Bézier; and ten for each of the four-point model sub-models. The output of each classifier is from the set {1, 2, 3, 4, 5} corresponding to the set of tone choices {rise, neutral, fall, fall-rise, rise-fall}. For example, the output of the rise versus neutral classifier would be either 1 or 2. The accuracy of the classifier classifying the training data correctly is treated as the probability that the classifier output is correct. The probabilities are combined as follows and the one with the highest probability is the tone choice selected.

$$T = \left\{ { 1, 2, 3, 4, 5} \right\}\;{\text{corresponding to tone choices}}\left\{ {{\text{rise}}, \;{\text{neutral}},\; {\text{fall}},\; {\text{fall}} {\text{-rise}},\; {\text{rise-fall}}} \right\}$$

(97)

$$o_{i,j} = {\text{ output of classifier trained to classify tone choice}}\;i\;{\text{vs}}\;j$$

(98)

$$o_{i,j} \in T$$

(99)

$${a_{i,j}} = {\text{accuracy of classifier trained to classify tone choice}}\,i\,{\text{vs.}}\;j$$

(100)

$$X_{i,j} = \, \left\{ {{\text{tone choices for training samples from classifier classifying tone choice}}\;i\;{\text{vs}}\;j} \right\}$$

(101)

$$Y_{i,j} = \, \left\{ {{\text{tone choices for training samples from human classifying tone choice}}\;i\;{\text{vs}}\;j} \right\}$$

(102)

$$Z_{i,j} = \, \left| {X_{i,j} } \right| \, = \, \left| {Y_{i,j} } \right|$$

(103)

$$x_{i} \in X_{i,j}$$

(104)

$$y_{i} \in Y_{i,j}$$

(105)

$$b_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad x_{i} = y_{i} } \\ {0,\quad x_{i} \ne y_{i} } \\ \end{array} } \right.$$

(106)

$$a_{i,j} = \frac{{\mathop \sum \nolimits_{i = 1}^{Z} b_{i} }}{Z}$$

(107)

$$t \in T$$

(108)

$$p(o_{i,j} , a_{i,j} , t) = \left\{ {\begin{array}{ll} {a_{i,j} ,\quad o_{i,j} = t} \\ {1 - a_{i,j},\quad o_{i,j} \ne t} \\ \end{array} } \right.$$

(109)

$$p_{ 1} = Pr\left( {{\text{tone choice }} = {\text{ rise}}} \right) \, = p\left( {o_{r,n} ,a_{r,n} ,{ 1}} \right)\cdot p\left( {o_{r,f} ,a_{r,f} ,{ 1}} \right)\cdot p\left( {o_{r,fr} ,a_{r,fr} ,{ 1}} \right)\cdot p\left( {o_{r,rf} ,a_{r,rf} ,{ 1}} \right)$$

(110)

$$p_{ 2} = Pr\left( {{\text{tone choice }} = {\text{ neutral}}} \right) \, = p\left( {o_{r,n} ,a_{r,n} ,{ 2}} \right)\cdot p\left( {o_{n,f} ,a_{n,f} ,{ 2}} \right)\cdot p\left( {o_{n,fr} ,a_{n,fr} ,{ 2}} \right)\cdot p\left( {o_{n,rf} ,a_{n,rf} ,{ 2}} \right)$$

(111)

$$p_{ 3} = Pr\left( {{\text{tone choice }} = {\text{ fall}}} \right) \, = p\left( {o_{r,f} ,a_{r,f} ,{ 3}} \right)\cdot p\left( {o_{n,f} ,a_{n,f} ,{ 3}} \right)\cdot p\left( {o_{f,fr} ,a_{f,fr} ,{ 3}} \right)\cdot p\left( {o_{f,rf} ,a_{f,rf} ,{ 3}} \right)$$

(112)

$$p_{ 4} = Pr\left( {{\text{tone choice }} = {\text{ fall-rise}}} \right) \, = p\left( {o_{r,fr} ,a_{r,fr} ,{ 4}} \right)\cdot p\left( {o_{n,fr} ,a_{n,fr} ,{ 4}} \right)\cdot p\left( {o_{f,fr} ,a_{f,fr} ,{ 4}} \right)\cdot p\left( {o_{fr,rf} ,a_{fr,rf} ,{ 4}} \right)$$

(113)

$$p_{ 5} = Pr\left( {{\text{tone choice }} = {\text{ rise-fall}}} \right) \, = p\left( {o_{r,rf} ,a_{r,rf} ,{ 5}} \right)\cdot p\left( {o_{n,rf} ,a_{n,rf} ,{ 5}} \right)\cdot p\left( {o_{f,rf} ,a_{f,rf} ,{ 5}} \right)\cdot p\left( {o_{fr,rf} ,a_{fr,rf} ,{ 5}} \right)$$

(114)

$$t^* = {\text{ arg max}}_{ 1\le t \le 5} p_{t}$$

(115)

4.5 Experimental design

We employed fivefold cross-validation in each of the experiments. The 84 speakers were randomly allocated to folds. Speakers were randomly allotted to folds instead of the utterances to guarantee that training and testing on the identical speaker did not prejudice the trials. Thirteen experiments were conducted: one for each combination of the two classifiers (neural network and boosting ensemble), two configurations (multi-class and pairwise coupling), and three sets of features (from the TILT, Bézier, four-point models), plus one experiment for the rule-based classifier.

5 Results

In 13 experimental setups, we examined the performance of combinations of three classifiers in two configurations and three sets of features in automatically classifying the tone choice of a termination prominent syllable. We calculated accuracy and Cohen’s kappa coefficient (κ) (Cohen 1960) to evaluate the thirteen approaches of classifying tone choice. Accuracy is calculated as follows:

$$H_{test} = \, \left\{ {\text{human tone choices for test samples}} \right\}$$

(116)

$$M_{test} = \, \left\{ {\text{machine tone choices for test samples}} \right\}$$

(117)

$$N = \, \left| {M_{test} } \right| \, = \, \left| {H_{test} } \right|$$

(118)

$$h_{i} \in H$$

(119)

$$m_{i} \in M$$

(120)

$$a_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad m_{i} = h_{i} } \\ {0,\quad m_{i} \ne h_{i} } \\ \end{array} } \right.$$

(121)

$$Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{N} a_{i} }}{N}$$

(122)

Cohen’s kappa coefficient (κ) is calculated as follows:

$$Pr\left( a \right) \, = {\text{ relative observed agreement between human and machine }} = Accuracy$$

(123)

$$T = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$

(124)

$$t \in T$$

(125)

$$b_{t,i} = \left\{ {\begin{array}{*{20}c} {1,\quad h_{i} = t} \\ {0,\quad h_{i} \ne t} \\ \end{array} } \right.$$

(126)

$$c_{t,i} = \left\{ {\begin{array}{*{20}c} {1,\quad m_{i} = t} \\ {0,\quad m_{i} \ne t} \\ \end{array} } \right.$$

(127)

$$Pr\left( {h_{i} = t} \right) \, = \;\frac{{\mathop \sum \nolimits_{i = 1}^{N} b_{t,i} }}{N}$$

(128)

$$Pr\left( {m_{i} = t} \right) \, = \frac{{\mathop \sum \nolimits_{i = 1}^{N} c_{t,i} }}{N}$$

(129)

$$Pr\left( e \right) \, = {\text{ probability of chance agreement between human and machine}}$$

(130)

$$Pr\left( e \right) \, = \, \prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {b_{{t,i}} } \;\prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {c_{{t,i}} } + \prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {(1 - b_{{t,i}} )\prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {(1 - c_{{t,i}} )} }$$

(131)

$$\kappa = \frac{{\Pr \left( a \right) - { \Pr }(e)}}{{1 - { \Pr }(e)}}$$

(132)

Table 5 displays the accuracy and Cohen’s kappa coefficient (κ) of the three feature models: four-point, TILT, and Bézier; using two classifiers: neural network and boosting; in two configurations: multi-class and pairwise coupling. It also presents these metrics for the rule-based classifier. The accuracy and Cohen’s kappa coefficient (κ) are the mean of the five folds.

Table 5 Accuracy and Cohen’s kappa coefficient (κ) for different feature models, classifiers, and configurations

Full size table

The rule-based classifier, which is built on our four-point model, classified better than the others with an accuracy of 75.1 % and a Cohen’s kappa coefficient of 0.73 (bolded in Table 5). We believe this happened because the four-point model, on which the rule-based classifier is founded, is a more general model of pitch contour than either the TILT or Bézier models. Our initial hypothesis was that a more general model was needed to model the more complex pitch contours of Brazil’s tone choices.

From a model perspective, our four-point model was the best with a mean classifier accuracy of 74.1 % and a mean classifier κ of 0.71, followed by the Bézier model (71.0 %, 0.68) and the TILT model (67.4 %, 0.65). The TILT model may have functioned poorly because it did not account for Brazil’s fall-rise tone choice. From a machine learning classifier point of view, the boosting ensemble was better than the neural network with a mean classifier accuracy of 71.6 versus 69.9 % and a mean κ of 0.69 versus 0.67.

The findings of the multi-class configuration versus the pairwise coupling configuration were mixed. The multi-class configuration worked better for the neural network in all models. It also achieved better results with the multi-class boosting ensemble when our four-point model was employed. However, the pairwise coupling configuration improved more in terms of accuracy and κ than the multi-class configuration for the other two models with the boosting ensemble.

6 Discussion

The study evaluated two machine learning classifiers (i.e., neural network and boosting ensemble) in two configurations (i.e., multi-class and pairwise coupling) in automatically classifying the five tone choices of Brazil’s intonation model. For each of the four combinations of classifier and configuration, we considered three sets of features drawn from three pitch contour models: TILT, Bézier, and our four-point model. We have also compared these twelve combinations with our rule-based classifier which is established on the four-point model. We assessed the performance in terms of accuracy and Cohen’s kappa coefficient.

The outcomes of our study provide evidence that a computer can classify tone choices of terminating prominent syllables with an accuracy of 75.1 % and a κ of 0.73 when compared with a human expert. There is no other research on classifying Brazil’s tone choices automatically to make a comparison at the current stage. Thus, our work sets the standard for future efforts.

At the same time, the agreement between a computer and a human found in our study can be compared with the inter-rater agreement between two humans. A common inter-rater agreement measure is Cohen’s kappa coefficient. Escudero-Mancebo et al. (2014) noted that in the current state of art for ToBI research, κ ranges from 0.51 (Yoon et al. 2004) to 0.69 (Syrdal and McGory 2000). Breen et al. (2012) reported κ values of 0.52 and 0.77 for RaP investigations. The Rhythm and Pitch (RaP) system is a method of labeling the rhythm and relative pitch of spoken English. It is an extension of ToBI that permits the capture of both intonational and rhythmic aspects of speech (Dilley and Brown 2005), based on a tone interval theory proposed by Dilley (2005). In our experiments, as can be seen in Table 5, κ was generally higher than this, ranging from 0.61 to 0.73. Cross-corpora comparisons are dubious, but in this case we are comparing the human annotation of corpora using two different models of prosody, ToBI and RaP, with our computer annotation using the Brazil model. Although not conclusive, it does show that our computer annotation is in the range of inter-rater agreement between two humans.

Our study can also be contrasted with other research from the perspective of models, classifiers, and configuration. From a model view point, our four-point model functioned the most successfully, followed by the Bézier model, and the TILT model. The Bézier model performed better than the TILT model in other studies, too (Escudero-Mancebo and Cardeñoso-Payo 2007; González-Ferreras et al. 2012). From the perspective of a machine learning classifier, the boosting ensemble classifies tone choices better than the neural network. González-Ferreras et al. (2012) also support this view that the boosting ensemble is better than a neural network for classifying ToBI boundary tones and pitch accents. Unlike our mixed findings of the multi-class configuration versus the pairwise coupling configuration, after testing the TILT and Bézier models, González-Ferreras et al. (2012) reported that pairwise coupling is better at classifying ToBI boundary tones and pitch accents than multi-class in every case.

7 Conclusions

These experiments assessed the performance, in terms of accuracy and Cohen’s kappa, of two machine learning classifiers (i.e., neural network and boosting ensemble) in two configurations (i.e., multi-class and pairwise coupling) of classifying the five tone choices of Brazil’s intonation model with three sets of features extracted from three pitch contour models: TILT, Bézier, and our four-point model. These twelve combinations of classifiers, configurations, and feature sets were also contrasted with our rule-based classifier which is founded on the four-point model.

The findings reported in this paper offer empirical evidence that a computer can classify terminating prominent syllable tone choices specified in Brazil’s (1997) model of intonation with an accuracy approaching that of two human analysts. They also demonstrate that our four-point model is a better one for Brazil’s tone choices than either the TILT or Bézier model. Automatic classification of tone choices is an important achievement because tone choices are one of the key elements of Brazil’s model. Brazil’s model deals with the intonational and rhythmic aspects of speech and explains how they convey meaning that goes beyond what the sentences communicate (Brazil 1997). Accordingly, automatically classifying tone choices is another vital step in automatically deducing the intonational and rhythmic facets of speech.

Examining other classifiers (e.g., linear classifiers, support vector machines, lazy learning algorithms, random forests, meta-algorithms) as a means of improving tone choice classification is an area for further study. Since TIMIT is only read speech we cannot generalize the results to unconstrained, conversational, or any other type of speech. Thus, another area to explore is the use of other training corpora containing spontaneous, dialogic, and other types of speech.

The results reported in this paper reaffirm the potential of investigating Brazil’s (1997) intonation discourse theory as a means of better comprehending natural discourse in different environments that we found in earlier work.

References

Amir, O., Wolf, M., & Amir, N. (2009). A clinical comparison between two acoustic analysis softwares: MDVP and Praat. Biomedical Signal Processing and Control, 4(3), 202–205.
Article Google Scholar
Ananthakrishnan, S., & Narayanan, S. (2008). Fine-grained pitch accent and boundary tone labeling with parametric f0 features. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. (pp. 4545–4548). IEEE.
Beckman, M., & Elam, G. (1997). Guidelines for ToBI labelling. Available online: http://www.ling.ohio-state.edu/research/phonetics/E_ToBI.
Bocklet, T., & Shriberg, E. (2009). Speaker recognition using syllable-based constraints for cepstral frame selection. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. (pp. 4525–4528). IEEE.
Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (version 5.3.83). [Computer program]. Retrieved August 19, 2014.
Brazil, D. (1997). The communicative value of intonation in English. Cambridge: Cambridge University Press.
Google Scholar
Breen, M., Dilley, L. C., Kraemer, J., & Gibson, E. (2012). Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch).
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning (pp. 161–168). ACM, New York.
Cauldwell, R. (2012). Brazil, David. The encyclopedia of applied linguistics.
Chun, D. M. (2002). Discourse intonation in L2: From theory and research to practice (Vol. 1). Philadelphia, PA: John Benjamins Publishing.
Book Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Article Google Scholar
Diehl, J. J., & Paul, R. (2012). Acoustic differences in the imitation of prosodic patterns in children with autism spectrum disorders. Research in Autism Spectrum Disorders, 6(1), 123–134.
Article Google Scholar
Dilley, L. C. (2005). The phonetics and phonology of tonal systems (Doctoral dissertation, Massachusetts Institute of Technology).
Dilley, L. C., & Brown, M. (2005). The RaP (Rhythm and Pitch) Labeling System. Unpublished manuscript.
Escudero-Mancebo, D., & Cardeñoso-Payo, V. (2007). Applying data mining techniques to corpus based prosodic modeling. Speech Communication, 49(3), 213–229.
Article Google Scholar
Escudero-Mancebo, D., González-Ferreras, C., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2014). A fuzzy classifier to deal with similarity between labels on automatic prosodic labeling. Computer Speech and Language, 28(1), 326–341.
Article Google Scholar
Fine, J., Bartolucci, G., Ginsberg, G., & Szatmari, P. (1991). The use of intonation to communicate in pervasive developmental disorders. Journal of Child Psychology and Psychiatry, 32(5), 771–782.
Article Google Scholar
Frith, U., & Happé, F. (1994). Language and communication in autistic disorders. Philosophical Transactions of the Royal Society B, 346(1315), 97–104.
Article Google Scholar
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continuos speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon Technical Report N, 93, 27403.
González-Ferreras, C., Escudero-Mancebo, D., Vivaracho-Pascual, C., & Cardeñoso-Payo, V. (2012). Improving automatic classification of prosodic events by pairwise coupling. IEEE Transactions on Audio, Speech and Language Processing, 20(7), 2045–2058.
Article Google Scholar
Hämäläinen, A., Boves, L., de Veth, J., & Bosch, L. T. (2007). On the utility of syllable-based acoustic models for pronunciation variation modelling. EURASIP Journal on Audio, Speech, and Music Processing, 2007(2), 3.
Google Scholar
Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451–471.
Article MathSciNet MATH Google Scholar
Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301–315.
Article Google Scholar
Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. The Modern Language Journal, 94(4), 554–566.
Article Google Scholar
Kang, O., & Wang, L. (2014). Impact of different task types on candidates’ speaking performances and interactive features that distinguish between CEFR levels. ISSN 1756-509X, 40.
KayPENTAX. (2008). Multi-Speech and CSL Software. Lincoln Park, NJ: KayPENTAX.
Google Scholar
Levow, G. A. (2005). Context in multi-lingual tone and pitch accent recognition. In INTERSPEECH (pp. 1809–1812).
Li, K., Zhang, S., Li, M., Lo, W. K., & Meng, H. (2010). Detection of intonation in L2 English speech of native Mandarin learners. In 2010 7th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 69–74). IEEE.
Litman, D. J., Hirschberg, J. B., & Swerts, M. (2000). Predicting automatic speech recognition performance using prosodic cues. In Proceedings of the 1st North American chapter of the association for computational linguistics conference (pp. 218–225). Association for Computational Linguistics.
Maryn, Y., Corthals, P., De Bodt, M., Van Cauwenberge, P., & Deliyski, D. (2009). Perturbation measures of voice: a comparative study between Multi-Dimensional Voice Program and Praat. Folia Phoniatrica et Logopaedica, 61(4), 217–226.
Article Google Scholar
MathWorks, Inc. (2013). MATLAB Release 2013a. [Computer program]. Retrieved February 15, 2013.
McCann, J., & Peppé, S. (2003). Prosody in autism spectrum disorders: a critical review. International Journal of Language and Communication Disorders, 38(4), 325–350.
Article Google Scholar
Oller, D. K., Niyogi, P., Gray, S., Richards, J. A., Gilkerson, J., Xu, D., & Warren, S. F. (2010). Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development. Proceedings of the National Academy of Sciences, 107(30), 13354–13359.
Article Google Scholar
Ostendorf, M. (1999). Moving beyond the ‘beads-on-a-string’model of speech. In Proceedings of IEEE ASRU Workshop (pp. 79–84). Piscataway, NJ: IEEE.
Ostendorf, M., Price, P. J., & Shattuck-Hufnagel, S. (1995). The Boston University radio news corpus. Linguistic Data Consortium, 1–19.
Paul, R., Augustyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of prosody by speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(2), 205–220.
Article Google Scholar
Pickering, L. (1999). An analysis of prosodic systems in the classroom discourse of native speaker and nonnative speaker teaching assistants. Unpublished doctoral dissertation: Gainesville: University of Florida.
Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation (Doctoral dissertation, Massachusetts Institute of Technology).
Ringeval, F., Demouy, J., Szaszak, G., Chetouani, M., Robel, L., Xavier, J., & Plaza, M. (2011). Automatic intonation recognition for the prosodic assessment of language-impaired children. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1328–1342.
Article Google Scholar
Rosenberg, A. (2010). Classification of prosodic events using quantized contour modeling. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 721–724). Association for Computational Linguistics.
Rosenberg, A. (2010). AutoBI-a tool for automatic toBI annotation. In INTERSPEECH (pp. 146–149).
Rosenberg, A. (2012). Modeling intensity contours and the interaction of pitch and intensity to improve automatic prosodic event detection and classification. In 2012 IEEE Spoken Language Technology Workshop (SLT) (pp. 376–381). IEEE.
Ross, K., & Ostendorf, M. (1996). Prediction of abstract prosodic labels for speech synthesis. Computer Speech and Language, 10(3), 155–185.
Article Google Scholar
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., & Stolcke, A. (2005). Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3), 455–472.
Article Google Scholar
Shriberg, L. D., Paul, R., McSweeny, J. L., Klin, A., Cohen, D. J., & Volkmar, F. R. (2001). Speech and prosody characteristics of adolescents and adults with high-functioning autism and Asperger syndrome. Journal of Speech, Language, and Hearing Research, 44(5), 1097–1115.
Article Google Scholar
Sun, X. (2002). Pitch accent prediction using ensemble machine learning. In INTERSPEECH.
Syrdal, A. K., & McGory, J. T. (2000). Inter-transcriber reliability of ToBI prosodic labeling. In INTERSPEECH (pp. 235–238).
Taylor, P. (2000). Analysis and synthesis of intonation using the tilt model. The Journal of the Acoustical Society of America, 107(3), 1697–1714.
Article Google Scholar
The Centre for Speech Technology Research, University of Edinburgh (2014), The Festival Speech Synthesis System, [Computer Program]. Retrieved September 15, 2014, from http://www.cstr.ed.ac.uk/projects/festival.
Van Santen, J. P., Prud’hommeaux, E. T., Black, L. M., & Mitchell, M. (2010). Computational prosodic markers for autism. Autism, 14(3), 215–236.
Article Google Scholar
Wennerstrom, A. (2001). The music of everyday speech: Prosody and discourse analysis. New York: Oxford University Press.
Google Scholar
Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). ToBI: A standard for labeling English prosody. In Proceedings of the 1992 International Conference on Spoken Language Processing, ICSLP (pp. 12–16).
Xu, D., Gilkerson, J., Richards, J., Yapanel, U., & Gray, S. (2009, September). Child vocalization composition as discriminant information for automatic autism detection. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2009. EMBC 2009. (pp. 2518–2522). Minneapolis: IEEE.
Yoon, T., Chavarria, S., Cole, J., & Hasegawa-Johnson, M. (2004). Intertranscriber reliability of prosodic labeling on telephone conversation using ToBI. Proceedings of the International Conference on Spoken Language Processing (pp. 2729–2732). Japan: Nara.
Google Scholar

Download references

Author information

Authors and Affiliations

Northern Arizona University, Liberal Arts Building #18, Room 140, PO Box 6032, Flagstaff, AZ, 86011, USA
David O. Johnson & Okim Kang

Authors

David O. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Okim Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Okim Kang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Johnson, D.O., Kang, O. Automatic prosodic tone choice classification with Brazil’s intonation model. Int J Speech Technol 19, 95–109 (2016). https://doi.org/10.1007/s10772-015-9327-z

Download citation

Received: 25 September 2015
Accepted: 21 November 2015
Published: 02 December 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10772-015-9327-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Automatic prosodic tone choice classification with Brazil’s intonation model

Abstract

Similar content being viewed by others

Automatic prominent syllable detection with machine learning classifiers

Towards a Hybrid Learning Approach to Efficient Tone Pattern Recognition

Formants and Prosody-Based Automatic Tonal and Non-tonal Language Classification of North East Indian Languages

1 Introduction

2 Brazil’s intonation model

3 Related research

4 Experimental procedure

4.1 TIMIT corpus

4.2 Parameterization of the F0 contours

4.2.1 Four-point model features

4.2.2 TILT model features

4.2.3 Bézier model features

4.3 Classifiers

4.4 Classifier configurations

4.4.1 Neural network multi-class

4.4.2 Boosting ensemble multi-class

4.4.3 Neural network pairwise coupling

4.4.4 Boosting ensemble spairwise coupling

4.5 Experimental design

5 Results

6 Discussion

7 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic prosodic tone choice classification with Brazil’s intonation model

Abstract

Similar content being viewed by others

Automatic prominent syllable detection with machine learning classifiers

Towards a Hybrid Learning Approach to Efficient Tone Pattern Recognition

Formants and Prosody-Based Automatic Tonal and Non-tonal Language Classification of North East Indian Languages

Explore related subjects

1 Introduction

2 Brazil’s intonation model

3 Related research

4 Experimental procedure

4.1 TIMIT corpus

4.2 Parameterization of the F0 contours

4.2.1 Four-point model features

4.2.2 TILT model features

4.2.3 Bézier model features

4.3 Classifiers

4.4 Classifier configurations

4.4.1 Neural network multi-class

4.4.2 Boosting ensemble multi-class

4.4.3 Neural network pairwise coupling

4.4.4 Boosting ensemble spairwise coupling

4.5 Experimental design

5 Results

6 Discussion

7 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation