1 Introduction

The pattern of stress and intonation in a language is called prosody. There are many application domains that might benefit from automatic detection of prosody. It can be utilized in text-to-speech synthesis to model intonation for computerized and robot speech. Shriberg et al. (2005) and Escudero-Mancebo et al. (2014) demonstrated that prosodic models improve speaker identification and verification. Irregular prosody is one of the symptoms of autism and other related developmental disorders (Frith and Happé 1994; Fine et al. 1991; Paul et al. 2005; Shriberg et al. 2001; McCann and Peppé 2003). Computer programs that detect irregular prosody automatically have been employed to diagnose autism (Xu et al. 2009; Ringeval et al. 2011; Oller et al. 2010; Diehl & Paul 2012; Van Santen et al. 2010). Suprasegmental measures derived from the elements of Brazil’s model have been shown to explain half of the variance in oral proficiency and comprehensibility ratings of non-native speakers (Kang et al. 2010; Kang and Wang 2014). A number of studies have concluded that the inclusion of prosodic elements enhances automatic speech recognition (Bocklet and Shriberg 2009; Hämäläinen et al. 2007; Litman et al. 2000; Ostendorf 1999).

This study examines automatic detection of tone choice which is one of the fundamental elements of Brazil’s (1997) model of prosody (see Sect. 2 for further details on Brazil’s model). The purpose of this paper is to determine the best machine learning algorithm and the associated acoustic feature set, for classifying tone choice. We analyzed the accuracy and κ of two machine learning classifiers (neural network and boosting ensemble) in two configurations (multi-class and pairwise coupling) and a rule-based classifier. We tested three sets of acoustic features created from the TILT and Bézier models and a new four-point model we have introduced in this paper. Then, we explained how we decided on the classifiers and acoustic feature sets to test. We also described the methods employed to determine the best machine learning algorithm and the best acoustic feature set for classifying the tone choice of a termination prominent syllable. Finally, after presenting the results, we compared the current findings with those of other research in the field of speech science.

2 Brazil’s intonation model

Prosody is described by a variety of speech models. Brazil’s (1997) model and Pierrehumbert’s (1980) model are two that are used often in the fields of linguistics and applied linguistics. Pierrehumbert’s model is often utilized to model prosody for synthesized speech in text-to-speech applications (Wennerstrom 2001). Brazil’s model is frequently applied to language teaching (Cauldwell 2012). Using Brazil’s model is an innovative aspect of the current study because as far as we know it has not been applied to computational linguistics before. Brazil’s model defines pitch concord in an interactive dialog between two persons. Pitch concord matches the relative pitch of the key (first) and termination (last) prominent syllables between two speakers. For instance, high pitch on the termination of one speaker’s statement is matched with a high pitch on the key of the next speaker’s statement. Likewise, a mid termination is paired with a mid key. Pitch concord is a powerful predictor of speaking proficiency in non-native speakers (Pickering 1999). If we assume the goal of computational linguistics is more human-like speech production and interaction, then it is necessary to explore and adopt a model with a more thorough interpretation of intonation at a discourse (i.e., dialog) level.

The basis of Brazil’s theory is the tone unit. Brazil explains a tone unit as a portion of a discourse that a listener can distinguish as having a rising and falling pitch pattern that is distinctive from those of otherwise alike tone units having other patterns of pitch. Every tone unit has one or more prominent syllables, which can be identified from three properties of the syllable: pitch (fundamental frequency in Hz), duration (length in seconds), and intensity (amplitude in dB) (Chun 2002). Brazil asserts (as others have) that the importance of prominence is on the syllable, and not the word. Brazil differentiates prominence from lexical stress. He explains that lexical stress denotes the syllable inside content words that is stressed; however, prominence is the use of emphasis to add more meaning, importance, or contrast to words in a discourse. Accordingly, a syllable that is typically not stressed (e.g., a function word) may be accented to make it prominent. Conversely, a syllable that is customarily stressed lexically may be delivered with additional pitch, duration, or intensity to highlight its meaning, importance, or contrast. Every tone unit contains a key (first) and a termination (last) prominent syllable. If a tone unit has a single prominent syllable, then it is considered to be equally the key and termination prominent syllable. The termination syllable is also referred to as the tonic syllable. The relative pitch of the key and termination prominent syllables and the tone choice of the termination prominent syllable define the tone unit’s intonation pattern. Brazil postulated three evenly balanced scales of relative pitch: low, mid, and high, and five tone choices: falling, rising, rising-falling, falling-rising, and neutral as illustrated in Fig. 1.

Fig. 1
figure 1

Brazil’s five tone choices

The Brazil model covers both constrained and unconstrained speech in monologues and dialogs. Thus, the elements of the model (e.g., tone choice) apply equally to all types of speech.

3 Related research

In this section, we will review related research to identify techniques that can be applied to solving the problem of classifying tone choice. Brazil’s (1997) model has not been exploited in the field of computational linguistics. However, there is a large body of research on classifying ToBI Pitch Accents and Boundary Tones from which we identified candidate machine learning classifiers and acoustic feature sets for our experiments. The tones and break indices (ToBI) is a system for labeling prosodic events in speech (Wightman et al. 1992; Beckman and Elam 1997). ToBI defines three prosodic events: pitch accents, boundary tones, and break indices. Of these, pitch accents and boundary tones are the most closely related to Brazil’s tone choices. Pitch accents serve as cues for prominence, while boundary tones serve as cues for intonational phrasing. Although pitch accents are cues for prominence, there are usually more pitch accents in a dialog than there are Brazil’s prominent syllables. Boundary tones match closely with Brazil’s concept of key prominent syllables (i.e., initial boundary tones and phrasal tones) and termination prominent syllables (i.e., final boundary tones). ToBI defines eight types of pitch accents and nine types of boundary tones. There is not a one-to-one correspondence between Brazil’s tone choices and either pitch accents or boundary tones. Nonetheless, the methods of classifying them and Brazil’s tone choices are similar.

We compared several ToBI experiments involving pitch accents and boundary tones based on the accuracy to determine the candidate classifiers and feature sets we utilized in our experiments. We applied three constraints to the experiments we considered: (1) The experiment had to involve multiple speakers because single speaker classification is somewhat trivial and our goal is speaker independent recognition of Brazil’s tone choices; (2) the experiment had to classify with only acoustic features; and (3) the experiment had to include five or more classes since there are five tone choices.

There are three pitch contour models, which have been employed in the ToBI investigations. In the TILT model, intonation is characterized by parameters representing amplitude, duration, and tilt, where tilt is a measure of the shape of the pitch contour (Taylor 2000). The Bézier model is an approximation of pitch contours with Bézier functions (Escudero-Mancebo and Cardeñoso-Payo 2007). The Quantized Contour Model (QCM) (Rosenberg 2010a, b) quantizes the pitch contour of a word in the time and pitch domains, generating a low-dimensional representation of the contour. Each of these models produces a set of acoustic features, which can be classified with machine learning.

Table 1 presents the accuracy of several recent ToBI experiments along with what was classified (pitch accent or boundary tones), the number of classes classified out of the total number of classes, the number of speakers out of the total number of speakers, the pitch contour model, and machine learning classifier. Also indicated is whether the experiment met two of our constraints, i.e., multiple speakers and five or more classes. None of the experiments met our constraint of acoustic features only. All of the experiments made use of the Boston University Radio News Corpus (Ostendorf et al. 1995), except Li et al. (2010). Their corpus data was a set of 20 male and 20 female speakers from an L2 English speech corpus read by native Mandarin speakers. The speakers were asked to read 29 prompted sentences and instructed to read with a rising or falling intonation, according to an indicator next to each sentence.

Table 1 Summary of several recent ToBI experiments sorted by constraints met and accuracy (Acc)

AuToBI is a tool for automatic ToBI annotation (Rosenberg 2010a, b). Rosenberg reported on the performance of AuToBI in classifying pitch accents and boundary tones utilizing various classifiers and features in 2010 and then again in 2012. In 2010, he described the operation of AuToBI on the Boston Directions Corpus and the Columbia Games Corpus. Utilizing SVMs, AuToBI classified pitch accents of the spontaneous portion of the Boston Directions Corpus with a combined error rate of 0.284, intonational phrase final tones with 55.0 % accuracy, and intermediate phrase ending tones with 68.6 %. He did not give the pitch accent classification results on the Columbia Games Corpus, but stated the intonational phrase final tones were classified with 35.34 % accuracy, whereas intermediate phrase ending phrase accents were classified with 62.21 % accuracy. In 2012, Rosenberg examined a number of features and classifiers to improve the capability of AuToBI to classify pitch accents and boundary tones. He found the AdaBoost classifier implemented with weka did the best at classifying pitch accents (60.91 % accuracy) and that the Random Forest classifier implemented with weka was the best at classifying pitch accent (47.44 %) and pitch accent/boundary tones (74.47 %).

From the experiments that met our constraints, we chose the neural network and decision tree classifiers as candidates for our experiments. We augmented the decision tree classifier with boosting. Boosting is a machine learning ensemble method designed to improve the performance of decision tree classifiers. We did not choose to use a Naïve Bayesian classifier because of all machine learning techniques Naïve Bayesian classifiers are typically the weakest (Caruana and Niculescu-Mizil 2006). We also selected two classification configurations: multi-class and pair-wise coupling. In the multi-class configuration, the classifier makes a 1-of-n choice. Multi-class classifiers generally function worse than binary classifiers. Pairwise coupling is a method of breaking a multiple classification problem into a number of more accurate binary classification problems (Hastie and Tibshirani 1998). For feature set models, we picked the TILT and Bézier model. We did not select the Quantized Contour Model because the low number of classes in Rosenberg (2010a, b) experiment may have over-inflated the accuracy of it compared with the TILT and Bézier model experiments.

In addition to the candidate classifiers and feature sets that we identified from the ToBI experiments, we also considered another classifier and another pitch contour model. The rule-based classifier is further detailed in Sect. 4.3. The other pitch contour model, which we call the four-point model in this paper, was derived for the rule-based classifier. This pitch contour model is the generalization of any pitch contour, i.e., every pitch contour has a first, last, minimum, and maximum pitch point. Section 4.2.1 contains a more in-depth description of the four-point model.

4 Experimental procedure

In summary, in this paper we have compared the accuracy and κ of two candidate machine learning classifiers (neural network and boosting ensemble) in two configurations (multi-class and pairwise coupling) in automatically classifying the five tone choices of Brazil’s intonation model. For each of the four combinations of classifier and configuration, we have considered three sets of features derived from three pitch contour models: TILT, Bézier, and our four-point model. We have also made comparisons of these twelve combinations with our rule-based classifier, which is founded on the four-point model.

4.1 TIMIT corpus

The DARPA TIMIT Acoustic–Phonetic Continuous Speech Corpus (TIMIT) of read speech provides speech data for the acquisition of acoustic–phonetic knowledge and for the development and evaluation of automatic speech recognition systems (Garofolo et al. 1993). TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The text material in the TIMIT prompts consists of two dialect sentences, 450 phonetically-compact sentences, and 1890 phonetically-diverse sentences. The dialect sentences were intended to reveal the dialect of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either of particular interest or difficult. Each speaker read five of these sentences and each text was spoken by seven different speakers. The phonetically-diverse sentences were selected to maximize the variety of allophonic contexts found in the texts. Each speaker read three of these sentences, with each sentence being read only by a single speaker. The corpus includes hand corrected start and end times for the phones, phonemes, pauses, syllables, and words.

The TIMIT corpus is composed of constrained (i.e., short read sentences) monologic speech. We chose the TIMIT corpus over others (e.g., Boston University Radio News Corpus) because of the large number of speakers and dialects spoken.

The TIMIT corpus includes definitions for 60 phones. The TIMIT phones are used by other corpora. For our experiments, we utilized a subset of the corpus consisting of 84 speakers speaking four dialects. There were 825 utterances in our subset containing 10,512 syllables of which 994 of those were terminating prominent syllables. Table 2 presents the distribution of speakers by gender and dialect.

Table 2 Distribution of TIMIT speakers in this research by gender and dialect

We augmented the corpus by identifying the prominent syllables and tone choices on the termination (last) prominent syllables in the experimental subset using the syllable demarcations provided with the corpus. The prominent syllables and tone choices were identified by a trained linguist who coded them both by listening to the audio files and by using Praat (Boersma and Weenink 2014), a computerized speech analysis program, to confirm the movement of the pitch contour. Approximately, ten percent of the samples were analyzed by a second trained linguist to confirm the consistency of the coding. The inter-rater reliability between the two linguists was 85 to 87 %, which is a satisfactory rate found in other similar studies (e.g., Kang 2010) utilizing Brazil’s (1997) prosody model. The two linguists revised any discrepancies and continued coding the data until there were no more discrepancies. The first linguist then finished coding the rest of the speech files alone. This method of annotation has been employed as a reliable labeling technique extensively in other applied linguistics studies (Kang et al. 2010; Kang and Wang 2014; Pickering 1999). The linguist identified the tone choice of 994 terminating prominent syllables in the speech samples. The distribution of tone choices is depicted in Table 3.

Table 3 Distribution of tone choices

Initially the analysts examined the pitch contours with the Multi-Speech and Computerized Speech Laboratory (CSL) Software (KayPENTAX 2008), while the computer analyzed them using Praat (Boersma and Weenink 2014). We discovered significant differences between the pitch contours displayed by the two. This discrepancy resulted in substantial disagreement between the tone choices classified by the computer and those by the human expert. In addition, further differences in pitch contour were found even between various versions of Praat. More differences were identified between the same versions of Praat running on different computers. Maryn et al. (2009) also reported this difference among Multi-Speech and CSL Software and Praat. They declared that pitch and intensity values were not comparable. Amir et al. (2009) also noted this discrepancy and added that the findings from Multi-Speech and CSL Software and Praat should not be combined. To ensure these variations did not affect our results, the analyst re-conducted the tone choice annotations that were utilized to train the classifiers using the same version of Praat of which the computer made use.

4.2 Parameterization of the F0 contours

We investigated three sets of classification features, each derived from a different model of the pitch contour: four-point model, TILT model (Taylor 2000), and a model proposed by Escudero-Mancebo and Cardeñoso-Payo (2007), which consists of Bézier parameters. The pitch contour was extracted with Praat (Boersma and Weenink 2014).

4.2.1 Four-point model features

The four-point model is of our own design, which we are proposing here. The four-point model has two sub-models as depicted in Fig. 2: rise-fall-rise and fall-rise-fall.

Fig. 2
figure 2

Four-point model sub-models: rise-fall-rise (left) and fall-rise-fall (right)

The rise-fall-rise sub-model is applied if the maximum pitch point is earlier in time than the minimum pitch point; the fall-rise-fall sub-model is applied if the minimum pitch point is earlier in time than the maximum pitch time. The features for the sub-models are built on the following four points (from which the model derives its name): first is the pitch of the first point in the pitch contour (Hz); last is the pitch of the last point in the pitch contour (Hz); max is the maximum pitch in the pitch contour (Hz); and min is the minimum pitch in the pitch contour (Hz). The features for the rise-fall-rise sub-model are first-rise (r1), first-fall (f1), and second-rise (r2) and they are calculated as follows:

$$r1 = max-first$$
(1)
$$f1 = max-min$$
(2)
$$r2 = last-min$$
(3)

The features for the fall-rise-fall sub-model are first-fall (f1), first-rise (r1), and second-fall (f2) and they are calculated as follows:

$$f1 = first-min$$
(4)
$$r1 = max-min$$
(5)
$$f2 = max-last$$
(6)

We apply this model because it is the generalization of any pitch contour; i.e., every pitch contour has a first, last, minimum, and maximum pitch point. In some cases, some or all of the four points might coincide. For example, the maximum may also be the first point. Theoretically, the classifiers should determine the tone choice by the significance or insignificance of the rises and falls. The significance of the rises and falls is determined during the classifier training. For instance, as depicted in Fig. 3, in the rise-fall-rise sub-model, if all the rises and falls are insignificant, then the tone choice is neutral. If r1 is insignificant, f1 is significant, and r2 is insignificant, the tone choice is fall.

Fig. 3
figure 3

Examples of how the significance of the rises and falls determines the tone choice

Table 4 specifies the truth table for all possible combinations of significant and insignificant rise and falls, which are illustrated in Fig. 3. In the last row of Table 4, all of the rises and falls are significant so the tone choice could be either fall-rise or rise-fall. In these two cases, the tone choice of the last two significant rise and fall was applied, i.e., f1 and r2 for the rise-fall-rise sub-model and r1 and f2 for the fall-rise-fall sub-model.

Table 4 Truth table for all possible combinations of significant and insignificant rise and falls; 0 = rise/fall is insignificant (i.e., it is less than a threshold); 1 = rise/fall is significant (i.e., it is more than a threshold)

4.2.2 TILT model features

TILT is one of the more popular models for parameterizing pitch contours (Taylor 2000). The model was developed to automatically analyze and synthesize speech intonation. In the model, intonation is represented as a sequence of events, which are characterized by parameters representing amplitude, duration, and tilt. Tilt is a measure of the shape of the event, or pitch contour. A popular public domain text-to-speech system, Festival (The Centre for Speech Technology Research 2014) applies this model to synthesize speech intonation. The model is illustrated in Fig. 4. Three points are defined: start of the event, the peak (the highest point), and the end of the event.

Fig. 4
figure 4

Parameters of the RFC model in the TILT model of a pitch contour

Each event is characterized by five RFC (rise/fall/connection) parameters: rise amplitude (difference in pitch between the pitch value at the peak and at the start, which is always greater than or equal to zero), rise duration (distance in time from start of the event to the peak), fall amplitude (pitch distance from the end to the peak, which is always less than or equal to zero), fall duration (distance in time from the peak to the end), and vowel position (distance in time from start of pitch contour to start of vowel). The TILT representation transforms four of the RFC parameters into three TILT parameters: duration (sum of the rise and fall durations), amplitude (sum of absolute values of the rise and fall amplitudes), and tilt (a dimensionless number which expresses the overall shape of the event). The TILT parameters are calculated as follows:

$$s = {\text{ start of event}}$$
(7)
$$p = {\text{ peak }}\left( {\text{the highest point}} \right)$$
(8)
$$e = {\text{ end of event}}$$
(9)
$$a_{rise} = {\text{ difference in pitch between the pitch value at the peak }}\left( p \right){\text{ and at the start }}\left( s \right), \, \ge 0$$
(10)
$$d_{rise} = {\text{distance in time from start}}\left( s \right){\text{of the event to the peak}}\left( p \right)$$
(11)
$$a_{fall} = {\text{ pitch distance from the end }}\left( e \right){\text{ to the peak }}\left( p \right), \, \le 0$$
(12)
$$d_{fall} = {\text{ distance in time from the peak }}\left( p \right){\text{ to the end }}\left( e \right)$$
(13)
$$d = {\text{ duration }} = d_{rise} + d_{fall}$$
(14)
$$a = {\text{ amplitude }} = \, \left| {a_{rise} } \right| \, + \, \left| {a_{fall} } \right|$$
(15)
$$t = {\text{ tilt }} = \frac{{\left| {a_{rise} } \right| - |a_{fall} |}}{{2\left( {\left| {a_{rise} } \right| + |a_{fall} |} \right)}} + \frac{{d_{rise} - d_{fall} }}{{2\left( {d_{rise} + d_{fall} } \right)}}$$
(16)
$$c = {\text{ pitch contour }} = \, \left\{ {f_{i} ,\,f_{i + 1} , \, \ldots ,\,f_{i + N} } \right\}$$
(17)
$$f_{i} = {\text{ frequency }}\left( {\text{Hz}} \right){\text{ of}}\;i{\text{th point in pitch contour}}$$
(18)
$$f_{v} \in c$$
(19)
$$v = {\text{ index of the beginning of the vowel}}$$
(20)

In our experiments, duration (d), amplitude (a), tilt (t), and vowel position (v) were the input features to the classifiers.

4.2.3 Bézier model features

Escudero-Mancebo and Cardeñoso-Payo (2007) proposed an alternative to the TILT model that is constructed from the approximation of the pitch contours with Bézier functions as illustrated in Fig. 5.

Fig. 5
figure 5

Example of the Bézier function fitting stylization from Escudero-Mancebo and Cardeñoso-Payo (2007)

Similarly we used Bézier functions to approximate the pitch contour of the terminating prominent syllable, where:

$${\mathbf{P}} = {\text{ pitch contour}}$$
(21)
$${\mathbf{P}}_{i} = \, \left( {f_{i} ,t_{i} } \right) \, = {\text{F}}0 \, \left( {\text{Hz}} \right){\text{ at time }}\left( {\text{s}} \right)t_{i}$$
(22)
$$n = \, \left| {\mathbf{P}} \right| \, - { 1}$$
(23)
$$b = {\text{ number ofB}}\'e {\text{zier points }} = { 4}$$
(24)
$$x = \left( {0, \frac{1}{b - 1}, \frac{2}{b - 1}, 1} \right)$$
(25)
$$j = \, \left( { 1,{ 2},{ 3},{ 4}} \right)$$
(26)
$${\mathbf{B}}\left( {x_{j} } \right) \, = \, \left( {p_{j} ,x_{j} } \right)$$
(27)
$$p_{j} = {\text{ B}}\'e {\text{zier approximation of F}}0 \, \left( {\text{Hz}} \right){\text{ at time }}\left( {\text{s}} \right)x_{j}$$
(28)
$${\mathbf{B}} (x) = \sum\limits_{{i = 0}}^{n} {b_{{i,n}} } (x){\mathbf{P}}_{i},\quad 0 \, \le x \le {1}$$
(29)
$$b_{i,n} \left( x \right) = \left( {\begin{array}{*{20}c} n \\ i \\ \end{array} } \right)x_{i} (1 - x)^{n - i} \quad i = 0, \ldots , n$$
(30)

The resulting four Bézier parameters (p 1, p 2, p 3, and p 4) are the features on which the tone choice classifiers are trained.

4.3 Classifiers

We tested two standard machine learning classifiers to classify tone choices: neural network and boosting. We employed the Matlab patternnet function with ten hidden nodes and the Levenberg–Marquardt optimization network training function to implement the neural network classifier (MathWorks 2013). Boosting is an ensemble classifier that combines the outcomes of weak classifiers (typically decision trees) to improve their accuracy. Boosting was implemented with the Matlab fitensemble function using the AdaBoostM1 (binary classifier) or AdaBoostM2 (multi-class classifier) booster and 100 decision tree learners (i.e., weak classifiers).

We also utilized a rule-based classifier that implemented the four-point model truth table specified in Table 4 above. The thresholds for significance versus insignificance of each rise and fall (i.e., rise-fall-rise sub-model: r1, f1, and r2; fall-rise-rise sub-model: f1, r1, and f2) were determined during training. A simple brute-force method of trying every combination of unique rises and falls in the training data as thresholds determined the set of thresholds (TH * rfr , TH * frf ) that maximized the accuracy as follows:

$$TC = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$
(31)
$$T = \, \left\{ {\text{human classified tone choices for training data}} \right\}$$
(32)
$$T_{rfr} = \, \left\{ {\text{human classified tone choices for training data for rise-fall-rise sub-model}} \right\}$$
(33)
$$T_{frf} = \, \left\{ {\text{human classified tone choices for training data for fall-rise-fall sub-model}} \right\}$$
(34)
$$T = T_{rfr} \cup T_{frf}$$
(35)
$$\emptyset = T_{rfr} \cap T_{frf}$$
(36)
$$N = \, \left| {T_{rfr} } \right|$$
(37)
$$M = \, \left| {T_{frf} } \right|$$
(38)
$$T_{rfr} = \, \left\{ {t_{1} , \ldots ,t_{N} } \right\}$$
(39)
$$T_{frf} = \, \left\{ {t_{1} , \ldots ,t_{M} } \right\}$$
(40)
$$t_{i} \in TC$$
(41)
$$tr1_{i} = r1\;{\text{for}}\;i{\text{-th training sample}}$$
(42)
$$tf1_{i} = f1\;{\text{for}}\;i{\text{-th training sample}}$$
(43)
$$tr2_{i} = r2\;{\text{for}}\;i{\text{-th training sample}}$$
(44)
$$tf2_{i} = f2\;{\text{for}}\;i{\text{-th training sample}}$$
(45)
$$F_{rfr} = \, \left\{ {\left( {tr1_{1} ,tf1_{1} ,tr2_{1} } \right), \, \ldots , \, \left( {tr1_{N} ,tf1_{N} ,tr2_{N} } \right)} \right\}$$
(46)
$$F_{frf} = \, \left\{ {\left( {tf1_{1} ,tr1_{1} ,tf2_{1} } \right), \, \ldots , \, \left( {tf1_{M} ,tr1_{M} ,tf2_{M} } \right)} \right\}$$
(47)
$$\emptyset = F_{rfr} \cap F_{frf}$$
(48)
$$r1_{rfr} = \, \left\{ {tr1_{1} , \, \ldots ,tr1_{N} } \right\}$$
(49)
$$f1_{rfr} = \, \left\{ {tf1_{1} , \, \ldots ,tf1_{N} } \right\}$$
(50)
$$r2_{rfr} = \, \left\{ {tr2_{1} , \, \ldots ,tr2_{N} } \right\}$$
(51)
$$R1_{rfr} = \, !\exists r1_{rfr}$$
(52)
$$F1_{rfr} = \, !\exists f1_{rfr}$$
(53)
$$R2_{rfr} = \, !\exists r2_{rfr}$$
(54)
$$I = \, \left| {R1_{rfr} } \right|$$
(55)
$$J = \, \left| {F1_{rfr} } \right|$$
(56)
$$K = \, \left| {R2_{rfr} } \right|$$
(57)
$$r1_{i} \in R1_{rfr}$$
(58)
$$f1_{j} \in F1_{rfr}$$
(59)
$$r2_{k} \in R2_{rfr}$$
(60)
$$\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right) \, = {\text{rule-based classifier applying rise-fall-rise sub-model of Table 4}}$$
(61)
$$\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right) \in TC$$
(62)
$$f1_{frf} = \, \left\{ {tf1_{1} , \, \ldots ,tf1_{M} } \right\}$$
(63)
$$r1_{frf} = \, \left\{ {tr1_{1} , \, \ldots ,tr1_{M} } \right\}$$
(64)
$$f2_{frf} = \, \left\{ {tr2_{1} , \, \ldots ,tr2_{M} } \right\}$$
(65)
$$F1_{frf} = \, !\exists f1_{frf}$$
(66)
$$R1_{frf} = \, !\exists r1_{frf}$$
(67)
$$F2_{frf} = \, !\exists f2_{frf}$$
(68)
$$F = \, \left| {F1_{frf} } \right|$$
(69)
$$G = \, \left| {R1_{frf} } \right|$$
(70)
$$H = \, \left| {F2_{frf} } \right|$$
(71)
$$f1_{f} \in F1_{frf}$$
(72)
$$r1_{g} \in R1_{frf}$$
(73)
$$f2_{h} \in F2_{frf}$$
(74)
$$\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right) \, = {\text{rule-based classifier applying fall-rise-fall sub-model of Table 4}}$$
(75)
$$\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right) \in TC$$
(76)
$$X = \, \left\{ {\text{rule-based classifier tone choices for training samples for a sub-model}} \right\}$$
(77)
$$Y = \, \left\{ {\text{human tone choices for training samples for a sub-model}} \right\}$$
(78)
$$Z = \, \left| X \right| \, = \, \left| Y \right|$$
(79)
$$x_{i} \in X$$
(80)
$$y_{i} \in Y s$$
(81)
$$a_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad x_{i} = y_{i} } \\ {0,\quad x_{i} \ne y_{i} } \\ \end{array} } \right.$$
(82)
$$A\left( {X,Y} \right) \, = Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{Z} a_{i} }}{Z}$$
(83)
$$TH_{rfr}^{*} = {\text{ arg max}}_{{ 1\le i \le I,{ 1} \le j \le J,{ 1} \le k \le K}} \left[ {A\left( {\lambda_{rfr} \left( {F_{rfr} ,\left( {r1_{i} , \, f1_{j} , \, r2_{k} } \right)} \right),T_{rfr} } \right)} \right]$$
(84)
$$TH_{frf}^{*} = {\text{ arg max}}_{{ 1\le f \le F,{ 1} \le g \le G, 1\le h \le H}} \left[ {A\left( {\lambda_{frf} \left( {F_{frf} ,\left( {f1_{f} , \, r1_{g} , \, f2_{h} } \right)} \right),T_{frf} } \right)} \right]$$
(85)

4.4 Classifier configurations

We analyzed two different configurations of the neural network and boosting ensemble classifiers: multi-class and pairwise coupling. We employed fivefold cross-validation in each of the experiments to tune the parameters of the machine learning classifiers (i.e., training) and then test them. The method for determining the classifier outputs is described below for each combination of classifier and configuration.

4.4.1 Neural network multi-class

The multi-class neural network provides five outputs, one for each of the possible tone choices. The outputs are real numbers in the range from zero to one. The output with the highest value is selected as the tone choice. There is one multi-class neural network for the TILT model; one for the Bézier; and one for each of the four-point model sub-models.

4.4.2 Boosting ensemble multi-class

The multi-class ensemble provides one output, which is from the set {1, 2, 3, 4, 5} corresponding to the set of tone choices {rise, neutral, fall, fall-rise, rise-fall}. There are four multi-class ensembles; one for the TILT model; one for the Bézier; and one for each of the two four-point model sub-models.

4.4.3 Neural network pairwise coupling

The neural network pairwise coupling configuration consists of ten neural networks trained to classify each combination of tone choices: rise versus neutral, rise versus fall, rise versus fall-rise, rise versus rise-fall, neutral versus fall, neutral versus fall-rise, neutral versus rise-fall, fall versus fall-rise, fall versus rise-fall, and fall-rise versus rise-fall. There are ten neural networks for the TILT model; ten for the Bézier; and ten for each of the four-point model sub-models. The output of each classifier is a real number between zero and one. The outputs are treated as probabilities. The probabilities are combined as follows and the one with the highest probability is the tone choice selected.

$$T = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$
(86)
$$t \in T$$
(87)
$$o_{i,j} = {\text{ output of classifier trained to classify tone choice}}\,i\,{\text{vs}}\,j$$
(88)
$$o_{i,j} \in {\mathbb{R}}$$
(89)
$$0 \, \le o_{i,j} \le { 1}$$
(90)
$$p_{ 1} = Pr\left( {{\text{tone choice }} = {\text{ rise}}} \right) \, = o_{r,n} \cdot o_{r,f} \cdot o_{r,fr} \cdot o_{r,rf}$$
(91)
$$p_{ 2} = Pr\left( {{\text{tone choice }} = {\text{ neutral}}} \right) \, = o_{n,f} \cdot o_{n,fr} \cdot o_{n,rf} \cdot \left( { 1 { } - o_{r,n} } \right)$$
(92)
$$p_{ 3} = Pr\left( {{\text{tone choice }} = {\text{ fall}}} \right) \, = o_{f,fr} \cdot o_{f,rf} \cdot \left( { 1 { } - o_{r,f} } \right) \cdot \left( { 1 { }{-}o_{n,f} } \right)$$
(93)
$$p_{ 4} = Pr\left( {{\text{tone choice }} = {\text{ fall - rise}}} \right) \, = o_{fr,rf} \cdot \left( { 1 { }{-}o_{r,fr} } \right) \cdot \left( { 1 { }{-}o_{n,fr} } \right) \cdot \left( { 1 { }{-}o_{f,fr} } \right)$$
(94)
$$p_{ 5} = Pr\left( {{\text{tone choice }} = {\text{ rise - fall}}} \right) \, = \, \left( { 1 { }{-}o_{fr,rf} } \right) \cdot \left( { 1 { }{-}o_{r,rf} } \right) \cdot \left( { 1 { }{-}o_{n,rf} } \right) \cdot \left( { 1 { }{-}o_{f,rf} } \right)$$
(95)
$$t* = {\text{ arg max}}_{ 1\le t \le 5} p_{t}.$$
(96)

4.4.4 Boosting ensemble spairwise coupling

The boosting ensemble pairwise coupling configuration consists of ten boosting ensembles trained to classify each combination of tone choices: rise versus neutral, rise versus fall, rise versus fall-rise, rise versus rise-fall, neutral versus fall, neutral versus fall-rise, neutral versus rise-fall, fall versus fall-rise, fall versus rise-fall, and fall-rise versus rise-fall. There are ten ensembles for the TILT model; ten for the Bézier; and ten for each of the four-point model sub-models. The output of each classifier is from the set {1, 2, 3, 4, 5} corresponding to the set of tone choices {rise, neutral, fall, fall-rise, rise-fall}. For example, the output of the rise versus neutral classifier would be either 1 or 2. The accuracy of the classifier classifying the training data correctly is treated as the probability that the classifier output is correct. The probabilities are combined as follows and the one with the highest probability is the tone choice selected.

$$T = \left\{ { 1, 2, 3, 4, 5} \right\}\;{\text{corresponding to tone choices}}\left\{ {{\text{rise}}, \;{\text{neutral}},\; {\text{fall}},\; {\text{fall}} {\text{-rise}},\; {\text{rise-fall}}} \right\}$$
(97)
$$o_{i,j} = {\text{ output of classifier trained to classify tone choice}}\;i\;{\text{vs}}\;j$$
(98)
$$o_{i,j} \in T$$
(99)
$${a_{i,j}} = {\text{accuracy of classifier trained to classify tone choice}}\,i\,{\text{vs.}}\;j$$
(100)
$$X_{i,j} = \, \left\{ {{\text{tone choices for training samples from classifier classifying tone choice}}\;i\;{\text{vs}}\;j} \right\}$$
(101)
$$Y_{i,j} = \, \left\{ {{\text{tone choices for training samples from human classifying tone choice}}\;i\;{\text{vs}}\;j} \right\}$$
(102)
$$Z_{i,j} = \, \left| {X_{i,j} } \right| \, = \, \left| {Y_{i,j} } \right|$$
(103)
$$x_{i} \in X_{i,j}$$
(104)
$$y_{i} \in Y_{i,j}$$
(105)
$$b_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad x_{i} = y_{i} } \\ {0,\quad x_{i} \ne y_{i} } \\ \end{array} } \right.$$
(106)
$$a_{i,j} = \frac{{\mathop \sum \nolimits_{i = 1}^{Z} b_{i} }}{Z}$$
(107)
$$t \in T$$
(108)
$$p(o_{i,j} , a_{i,j} , t) = \left\{ {\begin{array}{ll} {a_{i,j} ,\quad o_{i,j} = t} \\ {1 - a_{i,j},\quad o_{i,j} \ne t} \\ \end{array} } \right.$$
(109)
$$p_{ 1} = Pr\left( {{\text{tone choice }} = {\text{ rise}}} \right) \, = p\left( {o_{r,n} ,a_{r,n} ,{ 1}} \right)\cdot p\left( {o_{r,f} ,a_{r,f} ,{ 1}} \right)\cdot p\left( {o_{r,fr} ,a_{r,fr} ,{ 1}} \right)\cdot p\left( {o_{r,rf} ,a_{r,rf} ,{ 1}} \right)$$
(110)
$$p_{ 2} = Pr\left( {{\text{tone choice }} = {\text{ neutral}}} \right) \, = p\left( {o_{r,n} ,a_{r,n} ,{ 2}} \right)\cdot p\left( {o_{n,f} ,a_{n,f} ,{ 2}} \right)\cdot p\left( {o_{n,fr} ,a_{n,fr} ,{ 2}} \right)\cdot p\left( {o_{n,rf} ,a_{n,rf} ,{ 2}} \right)$$
(111)
$$p_{ 3} = Pr\left( {{\text{tone choice }} = {\text{ fall}}} \right) \, = p\left( {o_{r,f} ,a_{r,f} ,{ 3}} \right)\cdot p\left( {o_{n,f} ,a_{n,f} ,{ 3}} \right)\cdot p\left( {o_{f,fr} ,a_{f,fr} ,{ 3}} \right)\cdot p\left( {o_{f,rf} ,a_{f,rf} ,{ 3}} \right)$$
(112)
$$p_{ 4} = Pr\left( {{\text{tone choice }} = {\text{ fall-rise}}} \right) \, = p\left( {o_{r,fr} ,a_{r,fr} ,{ 4}} \right)\cdot p\left( {o_{n,fr} ,a_{n,fr} ,{ 4}} \right)\cdot p\left( {o_{f,fr} ,a_{f,fr} ,{ 4}} \right)\cdot p\left( {o_{fr,rf} ,a_{fr,rf} ,{ 4}} \right)$$
(113)
$$p_{ 5} = Pr\left( {{\text{tone choice }} = {\text{ rise-fall}}} \right) \, = p\left( {o_{r,rf} ,a_{r,rf} ,{ 5}} \right)\cdot p\left( {o_{n,rf} ,a_{n,rf} ,{ 5}} \right)\cdot p\left( {o_{f,rf} ,a_{f,rf} ,{ 5}} \right)\cdot p\left( {o_{fr,rf} ,a_{fr,rf} ,{ 5}} \right)$$
(114)
$$t^* = {\text{ arg max}}_{ 1\le t \le 5} p_{t}$$
(115)

4.5 Experimental design

We employed fivefold cross-validation in each of the experiments. The 84 speakers were randomly allocated to folds. Speakers were randomly allotted to folds instead of the utterances to guarantee that training and testing on the identical speaker did not prejudice the trials. Thirteen experiments were conducted: one for each combination of the two classifiers (neural network and boosting ensemble), two configurations (multi-class and pairwise coupling), and three sets of features (from the TILT, Bézier, four-point models), plus one experiment for the rule-based classifier.

5 Results

In 13 experimental setups, we examined the performance of combinations of three classifiers in two configurations and three sets of features in automatically classifying the tone choice of a termination prominent syllable. We calculated accuracy and Cohen’s kappa coefficient (κ) (Cohen 1960) to evaluate the thirteen approaches of classifying tone choice. Accuracy is calculated as follows:

$$H_{test} = \, \left\{ {\text{human tone choices for test samples}} \right\}$$
(116)
$$M_{test} = \, \left\{ {\text{machine tone choices for test samples}} \right\}$$
(117)
$$N = \, \left| {M_{test} } \right| \, = \, \left| {H_{test} } \right|$$
(118)
$$h_{i} \in H$$
(119)
$$m_{i} \in M$$
(120)
$$a_{i} = \left\{ {\begin{array}{*{20}c} {1,\quad m_{i} = h_{i} } \\ {0,\quad m_{i} \ne h_{i} } \\ \end{array} } \right.$$
(121)
$$Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{N} a_{i} }}{N}$$
(122)

Cohen’s kappa coefficient (κ) is calculated as follows:

$$Pr\left( a \right) \, = {\text{ relative observed agreement between human and machine }} = Accuracy$$
(123)
$$T = \, \left\{ { 1,{ 2},{ 3},{ 4},{ 5}} \right\}{\text{ corresponding to tone choices }}\left\{ {{\text{rise}},{\text{ neutral}},{\text{ fall}},{\text{ fall-rise}},{\text{ rise-fall}}} \right\}$$
(124)
$$t \in T$$
(125)
$$b_{t,i} = \left\{ {\begin{array}{*{20}c} {1,\quad h_{i} = t} \\ {0,\quad h_{i} \ne t} \\ \end{array} } \right.$$
(126)
$$c_{t,i} = \left\{ {\begin{array}{*{20}c} {1,\quad m_{i} = t} \\ {0,\quad m_{i} \ne t} \\ \end{array} } \right.$$
(127)
$$Pr\left( {h_{i} = t} \right) \, = \;\frac{{\mathop \sum \nolimits_{i = 1}^{N} b_{t,i} }}{N}$$
(128)
$$Pr\left( {m_{i} = t} \right) \, = \frac{{\mathop \sum \nolimits_{i = 1}^{N} c_{t,i} }}{N}$$
(129)
$$Pr\left( e \right) \, = {\text{ probability of chance agreement between human and machine}}$$
(130)
$$Pr\left( e \right) \, = \, \prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {b_{{t,i}} } \;\prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {c_{{t,i}} } + \prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {(1 - b_{{t,i}} )\prod\nolimits_{{t = 1,i = 1}}^{{5,N}} {(1 - c_{{t,i}} )} }$$
(131)
$$\kappa = \frac{{\Pr \left( a \right) - { \Pr }(e)}}{{1 - { \Pr }(e)}}$$
(132)

Table 5 displays the accuracy and Cohen’s kappa coefficient (κ) of the three feature models: four-point, TILT, and Bézier; using two classifiers: neural network and boosting; in two configurations: multi-class and pairwise coupling. It also presents these metrics for the rule-based classifier. The accuracy and Cohen’s kappa coefficient (κ) are the mean of the five folds.

Table 5 Accuracy and Cohen’s kappa coefficient (κ) for different feature models, classifiers, and configurations

The rule-based classifier, which is built on our four-point model, classified better than the others with an accuracy of 75.1 % and a Cohen’s kappa coefficient of 0.73 (bolded in Table 5). We believe this happened because the four-point model, on which the rule-based classifier is founded, is a more general model of pitch contour than either the TILT or Bézier models. Our initial hypothesis was that a more general model was needed to model the more complex pitch contours of Brazil’s tone choices.

From a model perspective, our four-point model was the best with a mean classifier accuracy of 74.1 % and a mean classifier κ of 0.71, followed by the Bézier model (71.0 %, 0.68) and the TILT model (67.4 %, 0.65). The TILT model may have functioned poorly because it did not account for Brazil’s fall-rise tone choice. From a machine learning classifier point of view, the boosting ensemble was better than the neural network with a mean classifier accuracy of 71.6 versus 69.9 % and a mean κ of 0.69 versus 0.67.

The findings of the multi-class configuration versus the pairwise coupling configuration were mixed. The multi-class configuration worked better for the neural network in all models. It also achieved better results with the multi-class boosting ensemble when our four-point model was employed. However, the pairwise coupling configuration improved more in terms of accuracy and κ than the multi-class configuration for the other two models with the boosting ensemble.

6 Discussion

The study evaluated two machine learning classifiers (i.e., neural network and boosting ensemble) in two configurations (i.e., multi-class and pairwise coupling) in automatically classifying the five tone choices of Brazil’s intonation model. For each of the four combinations of classifier and configuration, we considered three sets of features drawn from three pitch contour models: TILT, Bézier, and our four-point model. We have also compared these twelve combinations with our rule-based classifier which is established on the four-point model. We assessed the performance in terms of accuracy and Cohen’s kappa coefficient.

The outcomes of our study provide evidence that a computer can classify tone choices of terminating prominent syllables with an accuracy of 75.1 % and a κ of 0.73 when compared with a human expert. There is no other research on classifying Brazil’s tone choices automatically to make a comparison at the current stage. Thus, our work sets the standard for future efforts.

At the same time, the agreement between a computer and a human found in our study can be compared with the inter-rater agreement between two humans. A common inter-rater agreement measure is Cohen’s kappa coefficient. Escudero-Mancebo et al. (2014) noted that in the current state of art for ToBI research, κ ranges from 0.51 (Yoon et al. 2004) to 0.69 (Syrdal and McGory 2000). Breen et al. (2012) reported κ values of 0.52 and 0.77 for RaP investigations. The Rhythm and Pitch (RaP) system is a method of labeling the rhythm and relative pitch of spoken English. It is an extension of ToBI that permits the capture of both intonational and rhythmic aspects of speech (Dilley and Brown 2005), based on a tone interval theory proposed by Dilley (2005). In our experiments, as can be seen in Table 5, κ was generally higher than this, ranging from 0.61 to 0.73. Cross-corpora comparisons are dubious, but in this case we are comparing the human annotation of corpora using two different models of prosody, ToBI and RaP, with our computer annotation using the Brazil model. Although not conclusive, it does show that our computer annotation is in the range of inter-rater agreement between two humans.

Our study can also be contrasted with other research from the perspective of models, classifiers, and configuration. From a model view point, our four-point model functioned the most successfully, followed by the Bézier model, and the TILT model. The Bézier model performed better than the TILT model in other studies, too (Escudero-Mancebo and Cardeñoso-Payo 2007; González-Ferreras et al. 2012). From the perspective of a machine learning classifier, the boosting ensemble classifies tone choices better than the neural network. González-Ferreras et al. (2012) also support this view that the boosting ensemble is better than a neural network for classifying ToBI boundary tones and pitch accents. Unlike our mixed findings of the multi-class configuration versus the pairwise coupling configuration, after testing the TILT and Bézier models, González-Ferreras et al. (2012) reported that pairwise coupling is better at classifying ToBI boundary tones and pitch accents than multi-class in every case.

7 Conclusions

These experiments assessed the performance, in terms of accuracy and Cohen’s kappa, of two machine learning classifiers (i.e., neural network and boosting ensemble) in two configurations (i.e., multi-class and pairwise coupling) of classifying the five tone choices of Brazil’s intonation model with three sets of features extracted from three pitch contour models: TILT, Bézier, and our four-point model. These twelve combinations of classifiers, configurations, and feature sets were also contrasted with our rule-based classifier which is founded on the four-point model.

The findings reported in this paper offer empirical evidence that a computer can classify terminating prominent syllable tone choices specified in Brazil’s (1997) model of intonation with an accuracy approaching that of two human analysts. They also demonstrate that our four-point model is a better one for Brazil’s tone choices than either the TILT or Bézier model. Automatic classification of tone choices is an important achievement because tone choices are one of the key elements of Brazil’s model. Brazil’s model deals with the intonational and rhythmic aspects of speech and explains how they convey meaning that goes beyond what the sentences communicate (Brazil 1997). Accordingly, automatically classifying tone choices is another vital step in automatically deducing the intonational and rhythmic facets of speech.

Examining other classifiers (e.g., linear classifiers, support vector machines, lazy learning algorithms, random forests, meta-algorithms) as a means of improving tone choice classification is an area for further study. Since TIMIT is only read speech we cannot generalize the results to unconstrained, conversational, or any other type of speech. Thus, another area to explore is the use of other training corpora containing spontaneous, dialogic, and other types of speech.

The results reported in this paper reaffirm the potential of investigating Brazil’s (1997) intonation discourse theory as a means of better comprehending natural discourse in different environments that we found in earlier work.