1 Introduction

Prosody conveys crucial information in speech. It reflects various features of the speaker as well as the utterance: emotional state of speaker, form of the utterance, the presence of irony or sarcasm, emphasis, contrast, and focus, or other elements of language that may not be encoded by grammar or by choice of vocabulary. Prosody extends over one single sound segment in an utterance and covers other paralinguistic aspects of speech such as pitch, tone, duration, intensity, and voice quality (Chun 2002). Several intonational models for representing prosodic features have been practiced. However, they often face various challenges and limitations in terms of adequately interpreting actual, natural human discourse (Xu 2012).

Two principle frameworks are often used for representing and interpreting prosodic features: Brazil (1997) and Pierrehumbert (1980). Both the Pierrehumbert and Brazil models provide important information about the prosodic features of human discourse. Pierrehumbert’s framework is widely used to model text-to-speech synthesis and has been realized quantitatively. Brazil’s framework has been used in discourse analysis and language teaching and learning (Cauldwell 2012). The former model does not account for the meaning of intonation in naturally occurring discourse sufficiently (Dilley 2005).

Dilley (2005) proposed a tone interval theory, which captured the intonational and rhythmic aspects of speech. Her theory provided experimental evidence that showed how Pierrehumbert’s framework (Pierrehumbert 1980; Pierrehumbert and Beckman 1988) did not account for intonation meaning. However, such tone interval theory still does not explain how tone units are used in discourse. This is the motivation of the current study choosing Brazil’s model, which offers efficient and meaningful interpretation of natural discourse (e.g., tone choices). It emphasizes the idea of interactional significance of prosody and the achievement of the communicative functions in discourse (Brazil 1997). For example, Brazil’s model stresses the first and last prominent syllables and comprises pitch concord, which are essential distinctions across varieties of languages (Pickering 1999). It uses tone choices for the interpretation of speakers’ intention, meaning, emotion, or other communicative purposes in the discourse (Pickering 2009).

In a monologic speech, rising tones are used for showing solidarity or expressing known or shared knowledge or indicating uncertainty or lack of power. Falling tones are for presenting a topic closure or expressing new information, or showing speakers’ authority. Level tones are more for focusing on action rather than discourse or indicating the continuation of discourse. In addition, rising contours are associated with anger, fear, and joy whereas falling contours are connected to sadness and tenderness (Juslin and Laukka 2004). The patterns of these tones can be understood in the relationship between the final tone unit of one move and the initial key choice of the next move, called pitch concord (Brazil 1997).

The fundamental unit of Brazil’s model is the prominent syllable. Brazil is very clear in his work that the importance of prominence is on the syllable and not the word. He provided examples of words with more than one prominent syllable and words whose prominent syllable varied depending on the intonational meaning the speaker was imparting. Although the rest of Brazil’s model is easy to quantify, what makes a syllable prominent, is difficult to compute. Brazil states that prominent syllables are recognized by the hearer as having more emphasis than other syllables. A trained analyst can easily identify prominent syllables by listening to an utterance. However, quantifying the difference between a prominent syllable and a non-prominent syllable is not so straightforward. Brazil further notes in his description of prominent syllables that prominence should be contrasted with word or lexical stress. Lexical stress focuses on the syllable within content words that is stressed. However, prominence focuses on the use of stress to distinguish those words that carry more meaning, more emphasis, more contrast, in utterances. Thus, a syllable within a word that normally receives lexical stress may receive additional pitch, length, or loudness to distinguish meaning (Brazil 1997). Alternatively, a syllable that does not usually receive stress (such as a function word) may receive stress for contrastive purposes.

Brazil’s definition of prominence is similar, but more specific, than more commonly known definitions of prominence. Terken (1991) stated prominence is the attribute of a linguistic unit which makes it stand out from its environment perceptually. However, he did not precisely specify the linguistic unit as the syllable as did Brazil. But, like Brazil, he said prosodic prominence is connected to the suprasegmental pitch, duration, and intensity attributes of speech.

The purpose of this paper is to determine the best machine learning classifier and set of features, chosen from pitch, length (i.e., duration), or loudness (i.e., intensity) to automatically detect Brazil’s prominent syllables. Specifically, we will assess the performance of five machine learning classifiers and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three, in automatically detecting Brazil’s prominent syllables.

Section 2 reviews existing research in the area of prominent syllable and prominent word detection. Section 3 describes the speech corpus, classification features, machine learning classifiers, and experimental methods used in the current research. In Sect. 4, we present the results of our experiments, followed by a comparison with other research findings in the field of speech science along with conclusions in Sect. 5.

2 Prominent syllable and word detection research

In the field of speech production and engineering, Brazil’s (1997) framework has been hardly utilized. In contrast, there is a large body of research on detecting Pitch Accents and Boundary Tones as defined by the ToBI standard. The tones and break indices (ToBI) is a system for labeling prosodic events in spoken utterances (Wightman et al. 1992; Beckman and Elam 1997). This standard specifies three types of prosodic events: Pitch Accents, Boundary Tones and Break Indices. Pitch Accents refer to the prosodic function of prominence. Boundary Tones and Break Indices refer to the prosodic function of phrasing. Although pitch accents are defined as a function of prominence, there are usually more pitch accents in an utterance than Brazil’s prominent syllables. This is due to the fact, that Pierrehumbert did not make a distinction between lexical stress and what Brazil calls, “prominence”. Thus, there is no one-to-one correspondence between Brazil’s concept of prominent syllables and Pitch Accents, Boundary Tones, or Break Indices.

The ToBi-related research uses a variety of intensity, duration, and pitch measurements along with lexical or syntactic cues (i.e., features) to detect prosodic events. Ludusan and Dupoux (2014) investigated using several duration and pitch features, by themselves and in a combination, without any lexical or syntactic cues to detect prosodic boundaries. They found that a combination of all the cues compared well with previous work. Ni et al. (2011) detected ToBI Pitch Accents with an accuracy of 91.4 % and Boundary Tones with an accuracy of 95.2 % utilizing pitch, duration, intensity, and lexical and syntactic cues. Later, they applied the same techniques to detect Mandarin stress (Ni et al. 2012) with an accuracy of 89.9 %. Jeon and Liu (2009) also used pitch, duration, intensity, and lexical and syntactic features to detect ToBI Pitch Accents and achieved an accuracy of 89.8 %. Likewise, Ananthakrishnan and Narayanan (2008) detected ToBI accent (86.75 % accuracy) and prosodic phrase boundaries (91.6 % accuracy) with pitch, duration, intensity, and lexical and syntactic cues.

To classify both Pitch Accents and Boundary Tones, González-Ferreras et al. (2012) used a number of acoustic features (pitch, energy, and vowel nucleus duration), lexical and syntactic features (part-of-speech tags), and pitch contour features with fusion of pairwise coupled neural network and decision trees classifiers and applied the Viterbi algorithm to find the best tone sequence to achieve classification accuracies of 70.8 % (pitch accents) and 84.2 % (boundary tones). With pitch, duration, intensity, and lexical and syntactic features, Sridhar et al. (2008) detected ToBI Pitch Accents (86 % accuracy) and Boundary Tones (93.1 % accuracy).

Rosenberg and Hirschberg (2009) compared pitch accent identification at the syllable, vowel, and word level, and found that a word level approach is superior to syllable or vowel level identification achieving an accuracy of 84.2 %.

Silipo and Greenberg (1999) concluded that intensity and duration are the most important acoustic parameters underlying prosodic stress in casually spoken American English, and that pitch plays only a minor role in the assignment of stress. In a later study (Silipo and Greenberg 2000), they reexamined this conclusion using both the range and average level of pitch to determine whether there were circumstances in which pitch figures importantly in prosodic stress. They found in the later study that pitch range is slightly more effective than average pitch. They explained that this finding was most likely a consequence of duration-related information intrinsic to pitch range, and was thus consistent with their early finding that pitch played a relatively minor role in stress assignment in naturally spoken American English. Kochanski et al. (2005) studied seven dialects of British and Irish English and three different styles of speech to find acoustic correlates of prominence. They found pitch played a minor role in distinguishing prominent syllables from the rest of the utterance. Instead, speakers primarily marked prominence with patterns of intensity and duration. Rosenberg and Hirschberg (2006) studied the correlation between intensity and pitch accent of four native speakers of Standard American English. They were able to predict pitch accent in read speech with an accuracy of 81.9 % using only intensity.

There are also other non-ToBI research initiatives examining “prominence”, where “prominence” in this case also includes lexical stress. These research initiatives combine intensity, duration, pitch, and other acoustic features (i.e., no lexical or syntactic cues) to automatically identify syllabic prominence. In relatively recent years, various studies have attempted to detect such prominence types. Avanzi et al. (2010) detected syllabic prominence in French with pitch, duration, and pause. Similarly, Streefkerk et al. (1997) identified prominence in Dutch with intensity and duration and no pitch. Prominence in English was detected with pitch, intensity, and duration by Mahrt et al. (2011, 2012a, b). Tamburini (2006) used pitch movements, overall syllable energy, syllable nuclei duration, and mid-to-high-frequency emphasis to detect syllabic prominence in English. Some researchers only used pitch and intensity to detect syllabic prominence in French and Italian (Ludusan et al. 2011). Finally, Cutugno et al. (2012) included pitch, intensity, and duration of a syllable and its neighbors to detect syllabic prominence in English and Italian.

The research above shows that a good number of machine learning classifiers and features have been employed to detect various types of prominence. In this paper, we will determine the best machine learning classifier and set of features to automatically detect Brazil’s prominent syllables. We will only test pitch, duration, and intensity because those are the features ‘prominence’ is comprised of in Brazil (1997) terms. Specifically, we will examine the performance of five machine learning classifiers (neural network, decision tree, support vector machine, bagging, and boosting) and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three.

3 Methods

3.1 TIMIT corpus

The DARPA TIMIT Acoustic–Phonetic Continuous Speech Corpus (TIMIT) of read speech was designed to provide speech data for the acquisition of acoustic–phonetic knowledge and for the development and evaluation of automatic speech recognition systems (Garofolo et al. 1993). TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The text material in the TIMIT prompts consists of two dialect sentences, 450 phonetically-compact sentences, and 1890 phonetically-diverse sentences. The dialect sentences were intended to reveal the dialect of the speakers and were read by all 630 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either of particular interest or difficult. Each speaker read five of these sentences and each text was spoken by seven different speakers. The phonetically-diverse sentences were selected to maximize the variety of allophonic contexts found in the texts. Each speaker read three of these sentences, with each sentence being read only by a single speaker. The corpus includes hand corrected start and end times for the phones, phonemes, pauses, syllables, and words. The TIMIT corpus includes definitions for 60 phones. The TIMIT phones are used by other corpora. For this research, we used a subset of the corpus consisting of 84 speakers speaking four dialects. There were 836 utterances in our subset containing 10,657 syllables. Table 1 shows the distribution of speakers by gender and dialect.

Table 1 Distribution of TIMIT speakers by gender and dialect used in this research

We augmented the corpus by identifying the prominent syllables in the experimental subset. The prominent syllables were identified by a trained analyst who coded them both by listening to the audio files and by using the Multi-Speech and CSL Software (KayPENTAX 2008) to view the pitch, intensity, and duration of the syllables. Roughly, 10 percept of the samples were analyzed by a second trained analyst to verify the reliability of prominent syllable coding. The inter-coder reliability between the two human coders was around 85–87 %, which were relatively acceptable rates as seen in other similar coding protocols (e.g., Kang 2010) particularly using Brazil’s (1997) framework. The two analysts reviewed any inconsistencies and resumed coding the samples until they agreed on the coding. The first analyst then completed the analysis independently for the remaining speech samples. The analyst identified 3536 prominent syllables in the speech samples. This coding method has been widely practiced as a reliable labeling scheme in other studies (Kang 2010; Kang et al. 2010; Pickering 1999) in applied linguistics.

Although the TIMIT corpus consists of isolated sentences, we chose it for prominent syllable detection because it contained a wide variety of speakers and dialects, many more than the six speakers and one dialect in the Boston University Radio News Corpus (Ostendorf et al. 1995), which is another commonly used corpus for intonation studies. The current study only used a subset of the TIMIT corpus. However, 84 speakers speaking four dialects with over 10,000 syllables and over 3500 prominent syllables proved to be sufficient for the identification of an appropriate prominent syllable classifier and feature set.

3.2 Classification features

As input to the classifiers, we used seven sets of features for each syllable consisting of combinations of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three. The pitch feature was calculated by taking the median of the pitch contour of the syllable extracted by Praat (Boersma and Weenink 2014). The intensity feature was calculated by taking the maximum of the intensity contour of the syllable extracted using the Matlab audioread function (MathWorks 2013). The duration feature was calculated by using the syllable start and stop times from the TIMT corpus (Garofolo et al. 1993). In other words, the prominent syllable classifiers used the syllable boundaries given in the corpus. Pitch, intensity, and duration vary across speakers. They can be different even for the same speaker due to various idiosyncratic factors. To ameliorate the effect that this variation might have on prominent syllable detection, the features within a run (i.e., a run is the speech between two pauses, where a pause is defined as a silence longer than 100 ms; the lengths of the pauses were provided by the corpus) were normalized with Z-scores and scaled to the interval [−1 1] as follows:

$$ {\text{f}}_{\text{i}} = {\text{feature value for syllable i}} $$
(1)
$$ {\text{f}}_{\text{mean}} = {\text{mean of f}}_{\text{i}} {\text{ for all syllables in the run}} $$
(2)
$$ {\text{f}}_{\text{std}} = {\text{standard deviation of f}}_{\text{i}} {\text{ for all syllables in the run}} $$
(3)
$$ {\text{fnorm}}_{\text{i}} = \left( {{\text{f}}_{\text{i}} - {\text{f}}_{\text{mean}} } \right)/{\text{f}}_{\text{std}} $$
(4)
$$ {\text{fnorm}}_{ \hbox{max} } = {\text{maximum fnorm}}_{\text{i}} {\text{ for all syllables in the run}} $$
(5)
$$ {\text{fnorm}}_{ \hbox{min} } = {\text{minimum fnorm}}_{\text{i}} {\text{ for all syllables in the run}} $$
(6)
$$ {\text{fnorm}}_{\text{scale}} = { \hbox{max} }\left( {{\text{fnorm}}_{ \hbox{max} } ,|{\text{fnorm}}_{ \hbox{min} } |} \right) $$
(7)
$$ {\text{fscaled}}_{\text{i}} = {\text{fnorm}}_{\text{i}} /{\text{fnorm}}_{\text{scale}} $$
(8)

Z-score normalization provides a zero-mean, unit-standard deviation normalization of the input data. For this to be valid there is an assumption that the underlying data be normally distributed. The [−1, 1] interval normalization is extremely sensitive to outliers. However, we tried three other methods of normalization: (1) no normalization, (2) dividing the feature values by the mean feature value of the run, and (3) Z-score normalization without interval normalization; and found all of them provided worse performance in terms of accuracy, F-measure, and κ.

3.3 Classifiers

We used five standard machine-learning classifiers to detect prominent syllables: neural network, support vector machine, decision tree, bagging, and boosting.

In machine learning, neural networks are a group of statistical learning models motivated by the biological neural networks in animal brains (Happel and Murre 1994). They are utilized to estimate or approximate functions (e.g., prominent syllable detection) that can depend on a number of unknown inputs (e.g., pitch, duration, and intensity). Neural networks are commonly portrayed as arrangements of interconnected nodes that send messages to each other, representing the interconnection of neurons in the brain. The connections have numeric weights that are tuned with a set of training data, allowing neural nets to adjust to inputs and capable of learning. We employed the Matlab fitnet function with ten hidden nodes (i.e., neurons) to implement the neural network classifier (MathWorks 2013).

Support vector machines are machine learning models with associated learning algorithms that recognize patterns (i.e., pitch, intensity, and duration of prominent syllables) (Cortes and Vapnik 1995). Provided with a set of training examples (e.g., pitch, intensity and duration of syllable), each denoted as belonging to one of two categories (e.g., prominent syllable or non-prominent syllable), a support vector training algorithm constructs a model that designates new examples as belonging to one category or the other. It is a non-probabilistic binary linear classifier. A support vector model is a depiction of the examples as points in space (e.g. pitch, intensity, and duration as points in 3-dimensional space), plotted so that the examples of the separate classes (e.g., prominent syllable and non-prominent syllable) are partitioned by a well-defined gap that is maximally broad. A new example is then plotted into that same space and determined to belong to a class depending on which side of the gap it is on. We utilized the Matlab svmtrain function to implement the support vector machine classifier (MathWorks 2013).

Decision tree learning is a machine learning technique that makes use of a decision tree as a predictive model to map observations (e.g., pitch, intensity, and duration) about an item (e.g., syllables) to conclusions about the item’s target value (e.g., prominent or non-prominent) (Quinlan 1999). It is a predictive modeling approach frequently found in statistics and data mining. Classification trees are models where the target variable can take a finite set of values. Leaves of the tree represent class labels (e.g., prominent and non-prominent) and branches are combinations of features that lead to those class labels (e.g., pitch, intensity, and duration). Decision tree learning is one of the more successful machine learning techniques. The decision tree classifier was implemented with the Matlab ClassificationTree function (MathWorks 2013).

Bagging and boosting are ensemble classifiers that combine the results of weak classifiers (typically decision trees) to improve their performance. Ensemble prediction usually entails more calculations than predicting with a single model, thus ensembles may be considered as a means to make up for poor learning algorithms with extra computation (Opitz and Maclin 1999). An ensemble is itself a machine learning technique, because it is trained and then applied to make predictions. Ensembles are more flexible in the functions they can model. This flexibility can lead to them over-fitting the training data more than a single model would. To compensate for this, ensemble classifiers employ techniques that reduce problems related to over-fitting of the training data.

Bagging stands for bootstrap aggregation (Breiman 1994). Bagging is accomplished by replicating portions of the training data and constructing multiple decision trees from the replicated data. The output of the ensemble is the average of the predictions from the individual trees. Bagging was implemented with the Matlab fitensemble function using 100 decision tree learners (MathWorks 2013).

Most boosting algorithms entail repetitive training of weak classifiers and adding them to a final strong classifier (Breiman 1996). When they are added, they are usually weighted in a manner that is typically connected to the weak learners’ accuracy. When a weak learner is added, the outputs of the other weak learners are reweighted. Instances that are misclassified lose weight and instances that are classified appropriately gain weight. Thus, future weak learners concentrate more on the instances that prior weak learners classified incorrectly. Several boosting algorithms also decrease the weight of examples that are continually classified incorrectly. Boosting was realized with the Matlab fitensemble function using the AdaBoostM1 booster and 100 decision tree learners (MathWorks 2013).

Resubstitution error is the variation between the actual responses (i.e., prominent and non-prominent) in the training data and the responses the tree predicts based on the input training data (i.e., pitch, intensity, and duration). If the resubstitution error is high, you cannot expect the predictions of the tree to be good. A common method of determining the number of decision trees to use in an ensemble is to plot resubstitution error versus number of trees and use a number of decision trees well past the knee of the curve (MathWorks 2013). Figure 1 illustrates that 100 decision tree learners is sufficient.

Fig. 1
figure 1

Number of decision trees versus resubstitution error

None of the classifiers were optimized beyond the standard settings for the Matlab functions.

3.4 Experimental design

In all the experiments we applied five-fold cross-validation. The folds were created by randomly assigning the 84 speakers to folds. Speakers were randomly assigned to folds rather than the utterances to ensure that training and testing on the same speaker did not bias the experiments. Thirty-five experiments were conducted: one for each combination of the five classifiers (i.e., neural network, decision tree, support vector machine, bagging, and boosting) and seven combinations of features (i.e., pitch, intensity, and duration taken one at a time, two at a time, and all three at a time).

4 Results

The purpose of this research is to determine which machine classifier and set of features, chosen from pitch, duration, and intensity, is the best to automatically detect Brazil’s prominent syllables. In 35 experiments, we examined the performance of five machine learning classifiers and seven sets of features consisting of three features: pitch, intensity, and duration, taken one at time, two at a time, and all three at time, in automatically detecting Brazil’s prominent syllables. To evaluate the performance of the five classifiers and the seven sets of features, accuracy, F-measure, and Cohen’s kappa coefficient (κ) (Cohen, 1960) were used. Accuracy and F-measure are calculated as follows:

$$ {\text{TP}} = {\text{number of syllables where both the computer and the human identified it as prominent}} $$
(9)
$$ {\text{TN}} = {\text{number of syllables where both the computer and the human identified it as not prominent}} $$
(10)
$$ {\text{FP}} = {\text{number of syllables where the computer identified it as prominent and the human identified it as not prominent}} $$
(11)
$$ {\text{FN}} = {\text{number of syllables where the computer identified it as not prominent and the human identified it as prominent}} $$
(12)
$$ {\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}} \right) $$
(13)
$$ {\text{F-Measure}} = 2 {\text{TP}}/\left( { 2 {\text{TP}} + {\text{FP}} + {\text{FN}}} \right) $$
(14)

The confidence interval was assumed to be symmetrical and was calculated as follows, where n is the number of folds (5) and σ is the standard deviation of the folds:

$$ {\text{Confidence interval}} = \pm \frac{\sigma }{2\sqrt n } $$
(15)

Table 2 shows the performance of five classifiers: neural network, decision tree, support vector machine, bagging, and boosting, using seven different sets of the features: duration, intensity, and pitch. The accuracy, F-measure, and Cohen’s kappa coefficient (κ) are the mean of the fivefolds.

Table 2 Accuracy, F-measure, and Cohen’s kappa coefficient (κ) and confidence interval for different classifiers and different sets of features sorted by accuracy

Bagging is clearly the best classifier for identifying prominent syllables and the best feature set is duration, intensity, and pitch by all three measures (accuracy = 95.9 ± 0.2 %; F-measure = 93.7 ± 0.4; κ = 0.907 ± 0.005). The second best classifier is Decision Tree, which is what would be expected, since Bagging is a method for improving classification results by using an ensemble of 100 Decision Trees. Comparisons between human coder agreements and the machine are further provided in Sect. 5.

5 Discussion

The results showed that our computer program could detect prominent syllables with an accuracy of 95.9 % (±0.2 %), an F-Measure of 93.7 (±0.4), and a κ of 0.907 (±0.005) when compared with humans. These results can be interpreted through those from other related computer programs. They can also be discussed with those between other human experts.

First, even though, cross-corpus comparisons are not always reliable, there are other related computer programs where prominence was identified automatically. Avanzi et al. (2010) reported F-measures of three French syllabic prominence detectors: ANALOR (69.7), PROSOPROM (71.7), and IRCAMPROM (75.4). We achieved an F-measure of 93.7 (±0.4), which is significantly better than ANALOR, PROSOPROM, and IRCAMPROM. Obin et al. (2009) proposed an approach for detecting prominence in a corpus of French read speech which obtained an F-measure of 87.5 and an accuracy of 90.4 %. Christodoulides and Avanzi (2014) trained and evaluated four classifiers on a corpus of spontaneous French speech and found the neural network classifier was the best with and accuracy of 84.2 % and an F-measure of 79.1. Rosenberg and Hirschberg (2010) found classifiers trained on Mandarin L1 English could automatically detect prominence in Mandarin L1 English with an accuracy of 87.2 % and an F-measure of 86.6 while those trained on native English speech detected prominence with an accuracy of 74.8 % and F-measure of 82.4. The current results are higher than all of these studies with an F-measure of 93.7 (±0.4) and an accuracy of 95.9 % (±0.2 %).

The machine classifier performances shown in Table 2 can also be compared to the inter-rater agreement between two human experts. Price et al. (1988) conducted an inter-rater agreement study on a set of three stories from the Boston University Radio News Corpus (Ostendorf et al. 1995) containing 1002 words. They found agreement on presence versus absence for 91 % of the words. Boundary tone agreement was 93 % for the 207 words marked by both labelers with an intonational phrase boundary, and similarly there was 91 % agreement for 280 phrase accents. Ludusan et al. (2011) reported an inter-rater agreement of 91.5 % on syllabic prominence. In this case, the human–human inter-rater reliability was less than the human–computer inter-rater reliability of 95.9 % for the best classifier shown in Table 2.

In another example, where the human–computer inter-rater reliability of 95.9 % was greater than the human–human inter-rater reliability, Kang (2010) found the inter-rater agreement between two phonetic analysts was 86 % or lower in identifying Brazil’s prominent syllables. The main problem raised in her study was human coder’s subjectivity and tiredness involved in the labor-intensive procedure of prominence analysis. Indeed, prominence analyses are subject to perceptual limitations (Kang and Pickering 2013). Discrepancies between the two human analysts tend to take place in determining the location of prominent syllables in a series of discourse. Therefore, a calibrating procedure having two human analysts reach consensus is required to ensure the reliability of the analysis; however, this process is often known to be difficult. Accordingly, the current method achieving 95.9 % (±0.2 %) agreement between the human rater and the computer is very promising. Such obtainment of high agreements was possible due to the consistency of a computer program, once it was trained on the basis of protocols used for human coders. While certain ambiguous parts of speech could involve inconsistent judgments between two human coders, a computer program can make it constant and coherent throughout the speech. This computer-based prominence detection suggests a useful resource to supplement human coding in the field of speech science.

The inter-rater agreement between two human experts can likewise be contrasted to the machine classifier performances shown in Table 2 with Cohen’s kappa coefficient. Escudero-Mancebo et al. (2014) noted that in the current state of art for ToBI research, κ ranges from 0.51 (Yoon et al. 2004) to 0.69 (Syrdal and McGory 2000). Breen et al. (2012) reported κ values of 0.52 and 0.77 for RaP research. The RaP (Rhythm and Pitch) system is a method of labeling the rhythm and relative pitch of spoken English. It is an extension of ToBI that permits the capture of both intonational and rhythmic aspects of speech (Dilley and Brown 2005). It is based on tone interval theory proposed by Dilley (2005). Nevertheless, the current method achieved a much greater κ value (0.907 ± 0.005) for inter-rater agreement between the computer and a human than either the ToBI research or RaP research.

The performances of the support vector machine and the neural network are very low. For the support vector machine, this is probably because the support vector machine is an older machine learning method, which has been surpassed in performance by more modern methods, such as ensembles of decision trees. Escudero-Mancebo et al. (2014) also found that support vector machines under-performed neural networks and decision trees. A more likely reason for the poor performance of the support vector machine and the neural network is that machine learning techniques (e.g., support vector machines, neural networks, decision trees, bagging, and boosting) perform differently in different applications. There is no machine learning technique that works best in all applications. That is the reason we compared the performance of more than one machine learning technique.

6 Conclusions

Overall, the current research has shown that it is possible to detect the fundamental element of Brazil’s model, prominent syllables, with an accuracy exceeding that of two human analysts and other programs that measure prominence. This is an important achievement because detecting prominent syllables is the foundation of Brazil’s theory. As we discussed earlier, Dilley (2005) showed that Pierrehumbert’s framework (Pierrehumbert 1980; Pierrehumbert and Beckman 1988) does not account for the meaning of intonation in natural discourse. On the other hand, Brazil’s model offers a meaningful interpretation of natural discourse by emphasizing the interactional communicative functions in discourse (Brazil 1997). Thus, automatically detecting prominent syllables is the important first step in automatically interpreting the interactional aspects of natural discourse.

The next steps are finding classifiers and algorithms with appropriate feature sets for automatically detecting the other elements of Brazil’s model (i.e., tone unit, tone choice, relative pitch, and pitch concord). Automatic interpretation of natural discourse has many applications in automatic speech recognition (Bocklet and Shriberg 2009; Hämäläinen et al. 2007; Litman et al. 2000; Ostendorf 1999), text-to-speech synthesis, speaker verification and identification (Shriberg et al. 2005; Escudero-Mancebo et al. 2014), human-robot interaction (Nadel et al. 2006), automatic speech scoring systems (Kang et al. 2010; Kang and Wang 2014), computer-aided language learning, forensics, and early childhood diagnosis of autism (Frith and Happé 1994; Fine et al. 1991; Paul et al. 2005; Shriberg et al. 2001; McCann and Peppé 2003). The current study demonstrated the potential of exploring a new discourse-based intonation model, i.e., Brazil’s (1997) intonation discourse framework, to better understand natural discourse in various contexts.