Introduction

Prosody often refers to the music of speech that comprises (1) intonation, (2) rhythm, and (3) stress (Cutler and Isard 1980). The perceptual and acoustic correlates of prosody include pitch (fundamental frequency, F0), intensity (amplitude), duration, and their co-variation (Cutler and Isard 1980). Typical individuals make use of prosody to achieve different communication functions including pragmatic, grammatical, and affective functions without conscious learning. For example, speakers can signal focused-information using stress and indicate utterance boundaries using pauses and lengthening the final syllables (Fox et al. 2008). On the other hand, individuals with autism spectrum disorders (ASD) often demonstrate atypical use of prosody, including both comprehension of prosodic cues (receptive prosody) and expression of prosody (expressive prosody) (Diehl and Berkovits 2010). The two gold-standard diagnostic tools of ASD, Autism Diagnostic Observation Schedule, second version (ADOS-2) (Lord et al. 2012) and Autism Diagnostic Interview-Revised (ADI-R) (Rutter et al. 2003), include prosody impairments as one of the diagnostic characteristics of the disorder, suggesting that atypical prosody may be a central feature of ASD. Individuals with high-functioning autism (HFA) form a sub-group of ASD. The term “high-functioning” encompasses a range of intellectual abilities, from superior to normal intellectual abilities and/or language abilities (Diehl et al. 2009).

The current study mainly focused on the production of intonation, which is expressed as the pitch contours of speech and can be measured in terms of variation in F0 of speech (Diehl et al. 2009) in individuals with HFA. According to Crystal (1987), intonation serves six functions: (1) grammatical, which marks the major units such as a clause and a sentence; (2) textual, for which the pitch level marks the beginning and ending of a sentence in a discourse; (3) information structure, which involves signaling the new information and the background information; (4) emotional, which conveys a range of mood or emotions such as excitement, surprised, bored, and reserved; (5) indexical, which conveys the information about identity such as indicating a person belonging to a certain social group, and (6) psychological, which may assist the language to be organized into units that are easily perceived. Intonation patterns produced by individuals with HFA are very diverse. As pointed out by various scholars, there is a wide range of descriptions of prosody in ASD, including HFA, in the literature and some seems to be opposite to each other (e.g., Baltaxe and Simmons 1985; Diehl et al. 2009; Nadig and Shaw 2012; Peppé et al. 2007). In regards to intonation, adjectives such as dull, monotonous, wooden, sing-songy and exotic accented are noted (Amoroso 1992; Fay and Schuler 1980; Lord et al. 1994).

McCann and Peppé (2003) conducted a comprehensive review on prosody in ASD and located 16 related studies published between 1980 and 2002. The review showed that there have been more studies on expressive prosody than receptive ones. McCann and Peppé (2003) categorized the studies reviewed under seven topics, namely, stress, rate, chunking, affect, reception, echolalia, and intonation. They found that intonation was relatively under-researched with only two studies included in the examination of intonation in ASD. The review concluded that there was no consensus on a specific pattern which characterized the prosodic features observed in ASD. Instead, substantial variability or even contradicting observations were found. Such a discrepancy in findings might be due to differences in the methodology, sample sizes, definition of prosody, and/or the functioning levels of the participants.

Another noteworthy conclusion made by McCann and Peppé (2003) was the predominant use of perceptual analysis in these studies, with only two studies adopting acoustic analysis (Baltaxe et al. 1984; Fosnot and Jun 1999). Although human perception provides ecologically valid descriptions of intonation, acoustic analysis offers reliable and objective measurements that exceed the capabilities of human hearing (McCann and Peppé 2003). In the study conducted by Baltaxe and Simmons 1985, the intonation contours of spontaneous declarative utterances in three groups of language-matched children were compared. These participants included six typically-developing children (aged from 2;0 to 4;0, years; months), six children with aphasia (4;5–12;2) and five children with autism (4;6–12;2). With reference to F0 range, the children with autism ranked in the middle as a group and the typical group demonstrated the greatest range. However, at an individual level, the children with autism showed considerable variability, exhibiting either too narrow or too exaggerated pitch ranges. Fosnot and Jun (1999) reported somewhat different findings, in which children with autism in general demonstrated a significantly greater pitch range than their typical-developing peers, and peers with stuttering when reading and imitating interrogative and declarative utterances.

There has been an increase in the number of studies involving acoustic analysis of prosody in ASD over the past decade. Hubbard and Trauner (2007) adopted acoustic and perceptual techniques to examine intonation in children with ASD and typical children. Perceptually, Hubbard and Trauner (2007) found that the emotions conveyed by the ASD group in utterances sounded less differentiating to listeners than those produced by the typical children. The ASD group displayed a greater pitch range across all the utterances than the typical group. The findings of increased pitch range in individuals with ASD were replicated in other studies. Diehl et al. (2009) compared the intonation produced by four groups of speakers, children and adolescents with HFA and their typical peers matched on intelligence quotient and verbal abilities. Both HFA groups showed significantly higher F0 variation in their story production when compared to the typical groups, implying that the HFA individuals demonstrated exaggerated intonation patterns.

Impairment in prosody has also been documented in non-English-speaking children with ASD. Baltaxe and Simmons (1985) summarized their study in 1975 and suggested that atypical prosody was also observed in both their German-speaking and English-speaking participants. Cross-linguistics acoustic data attesting that atypical prosody were also noted in Hebrew-speaking children with ASD (Green and Tobin 2009) and Hindi–English bilinguals with ASD (Sharda et al. 2010) who exhibited significantly higher pitches and wider pitch ranges when compared to the controls. Unlike the above observations, Nakai et al. (2014) reported a smaller pitch range in participants with ASD who speak Japanese. Nakai et al. employed the measure of pitch coefficient of variation (CV), a qualitative index of the relative pitch dispersion on the basis of average F0 of each word, as a proxy of intonation. School-aged children with ASD exhibited a significantly smaller pitch CV than their age-matched typical peers, implying that their speeches would sound more monotonous than their typical peers to listeners.

Certain contradictions exist in the literature on the expressive prosody in ASD. However, it has been ascertained that prosodic deficits are present in the speech of individuals with ASD, at least in languages where intonation or pitch variation plays a prominent role in signaling pragmatic functions. Diehl and Berkovits (2010) described prosodic impairment as a “bellwether of an individual’s cognitive environment” (p. 167), suggesting that the underpinnings of prosodic impairment in ASD may have a cognitive basis. Prosodic impairment in ASD has long been regarded as related to their underlying social pragmatic deficits (Diehl and Berkovits 2010; McCann and Peppé 2003). The evidence supporting the association between the social pragmatic deficit and prosodic impairment comes from individuals’ reduced processing of communicative intentions and reading of emotions in the speakers’ voices (Kleinman et al. 2001; Rutherford et al. 2002). Individuals with HFA often encounter difficulties in the use of figurative languages, such as irony and metaphors which require the comprehension of prosody (Happé 1994). Due to the lack of theory of mind, individuals with ASD, including those with HFA, are not capable of attributing their thoughts to others, and understanding and predicting others’ intention and belief (Baron-Cohen 1995). Under this assumption, it is possible that individuals with HFA may not be able to relate pitch variations with speakers’ intention and emotion, regulate pitch variations in their own speech, and atypical expressive intonation may ensue.

Another more recent account ascribed atypical expressive prosody in ASD as a consequence of their unusual auditory feedback system (Arciuli 2014), which controls and regulates pitch and loudness of speech production (Lane et al. 1997). Russo et al. (2008) pointed out that their participants with ASD showed a hyper-responsive audio-vocal system that may lead to an extra large response to auditory feedback. As a consequence, these individuals may show exceptional sensitivity in perception and impaired vocal control which ultimately leads to prosodic problem. Based on this account, intonation impairment in ASD would happen in all languages and cultures, regardless of the importance of the functional role of intonation in the language. On the other hand, if intonation impairment is mainly a product of underlying social-pragmatic deficit as opposed to abnormal auditory processing, the impairment may not be observed in languages where intonation does not play a salient role in marking pragmatic functions. The current study aimed to test the hypotheses with a tone language where intonation is not salient. More specifically, the first research question to be addressed was, “Is atypical intonation production a universal clinical pattern, also observed in individuals with HFA who speak a tone language, where pitch variation is primarily used to encode lexical differences?”.

Intonation and Sentence-Final Particles in Cantonese

Cantonese is a typical example of tone languages where pitch variations are used to encode lexical differences. There are nine lexical tones in Cantonese, including six contrastive tones and three allotones of the level tones. Lexical tones are characterized by distinctive pitch contour, in which Tone 1 is high, Tone 2 is mid-low to high, Tone 3 is mid-level, Tone 4 is mid-low to low, Tone 5 is mid-low to mid-level, and Tone 6 is mid-low (Bauer and Benedict 1997). Tone 7, Tone 8, and Tone 9 share the same pitch contour as Tone 1, Tone 3, and Tone 6 respectively, and with a shorter duration (Bauer and Benedict 1997). Despite the use of pitch at the lexical level, intonation presented in the form of the overall pitch contour of an utterance also exists in Cantonese (Chao 1968; Fox et al. 2008; Ma et al. 2006). Chao (1968) analogized the relationship between lexical tones and intonation as “small ripples riding on larger waves” (p. 39), and claimed that changes in intonation at sentence level would not modify the lexical tones at word level. Likewise, Yip (2002) has proposed that Cantonese speakers use sentential intonation to convey different pragmatic implications, while the lexical tones are manifested on the overall intonation contour of a sentence. Intonation, however, may compete with lexical tones for phonetic and phonological ‘space’, and a change in intonation should be limited to an extent that it does not influence the lexical tones in a word. As a consequence, the use of intonation for pragmatic functions would be more constrained in Cantonese than non-tone languages, such as English, given the feature of lexical tones of the former (Chan 1996).

On the other hand, there is another linguistic device, which has been regarded as sharing similar functions as intonation in conveying grammatical, pragmatic, and affective functions in Cantonese. This device is called the sentence final particle (SFP) (Cheung 1986; Law 1990). SFP is a distinctive feature of Cantonese in comparison with English because there is no direct grammatical counterpart of SFPs in English (Matthews and Yip 1994). SFPs serve a variety of functions in verbal communication. For example, SFPs can be used to facilitate modality, focus, and conditional reasoning of speech (Lee and Law 2001). Table 1 shows the effect of different SFPs on the meaning of the same utterances with examples. SFPs also play an important role in conveying information about moods, attitudes, feelings, and emotions of a speaker (Matthews and Yip 1994). For example, the particle /tsɛk5/ was described to have “highly affective value” (Matthews and Yip 1994, p. 340) and can be used to convey a variety of emotions, ranging from appreciation to sarcasm. There are approximately 30 basic forms of SFPs in Cantonese, and they can be used either individually or in clusters of two or three (Kwok 1984). Syntactically, SFPs are bound morphemes (Kwok 1984) and do not carry any content-wise information for the sentence. Luke (1990) reported that SFPs occurred nearly every 1.5 s in a continuous speech, indicating the high pervasiveness of SFPs in Cantonese conversations. Developmentally, SFPs emerged into the speech of very young children. Lee and Law (2001) reported that children as young as 20 months old produced at least three different SFPs in their speech.

Table 1 Examples illustrating the effect of SFPs on meaning of the same utterances

Cheung (1986) argued that SFPs serve equivalent communication functions as intonation, and may even replace intonation. Yau (1980), on the other hand, claimed that SFPs and intonation work in a compensatory fashion, such that the more a speaker relies on using SFPs to present sentential connotations, the less he/she relies on intonation, and vice versa. Kwok (1984) also agreed there is a mutual compensation between SFPs and intonation and pointed out that intonation may further refine the meanings conveyed by SFPs. These studies provided some implications on how SFPs and intonation work in Cantonese to achieve different communication functions.

As reviewed before, intonation impairment in individuals with HFA speaking non-tone languages may stem from their social-pragmatic deficits. SFPs in Cantonese take on the pragmatic functions of intonation and may share a similar cognitive mechanism that governs their use. This way, Cantonese-speaking individuals with HFA may, therefore, have difficulties in mastering the use of SFPs. This series of reasoning gave rise to the second research question of the current study—“Do Cantonese-speaking individuals with HFA demonstrate atypical use of SFPs when compared to typical counterparts?” Finally, this study explored the interaction between the use of SFPs and intonation. Yau (1980) has suggested an inverse relationship between SFP and intonation. That is, speakers with more variable F0 may produce fewer SFPs, and vice versa. The present study examined if such a relationship existed in both the HFA group and their matched counterparts.

In summary, the current study investigated the expressive intonation and the use of SFPs in Cantonese-speaking adults with HFA and their neurotypical (NT) matched controls. With reference to the first research question, it was predicted that there might not be any significant difference in intonation, in terms of pitch variations, between the HFA and the NT groups since intonation in Cantonese is restricted by the lexical tone system. For the second research question, it was hypothesized that the difficulties in intonation for HFA in Cantonese may be manifested in the distinctive use of SFPs. SFPs would, therefore, pose particular challenges to these individuals. Therefore, the HFA group may use fewer or less diverse SFPs when compared to the NT group. Finally, based on the current literature, a prediction on the relationship between the use of SFPs and intonation was made for the NT group—there would be a negative correlation between their use of SFPs and intonation variations (Kwok 1984; Yau 1980).

Method

Participants

Nineteen male adults with HFA were recruited through the non-governmental organizations in Hong Kong. All the participants with HFA received a formal diagnosis of HFA from either a clinical psychologist or a pediatrician during their childhood. The participants, aged between 18;11 and 33;5, were all native Cantonese speakers and have received compulsory education in Hong Kong, at least until Secondary Five level. These 19 adults were matched with 19 NT adults on age, sex, and education level. All the NT controls were evaluated by The Adult Autism Spectrum Quotient (AQ) Ages 16 or above (Baron-Cohen et al. 2001), a self-report measure of autistic traits, to ensure that there were no indications of ASD. Also, all the NT controls reported no history of any diagnosed developmental disorders, and/or other psychiatric conditions. Participants’ characteristics are presented in Table 2.

Table 2 Summary of participants’ characteristics

Procedures

The current study followed the procedures described by Diehl et al. (2009), who studied narratives produced by the participants. The narrative subtest in the Hong Kong Cantonese Oral Language Assessment Scale (T’sou et al. 2006) was used to elicit narrative samples. In this task, participants listened to a model story through earphones once and retold the story to the investigator with the support of a series of pictures. The narrative production was recorded with an external microphone (Rode Lavalier Microphone) connected to a computer through a USB audio interface (Focusrite Scarlett 2i2) with a sampling rate of 44,000 Hz. The microphone was set 10 cm away from the mouth of the participant. All the recordings took place in a sound-proof booth.

Analysis

Acoustic Analysis of Intonation

All the narrative samples were analyzed acoustically using the PRAAT programme (Boersma and Weenink 2005). The two measures employed by Diehl et al. (2009), the average F0 and SD of F0, were adopted in the current study. The average F0 of each participant was extracted from the beginning to the end of the narrative to provide an overall measure of pitch. The SD of F0 was calculated for each participant as a measure of pitch range and pitch variation. A larger SD of F0 implies a wider pitch range and more pitch variation within a participant.

Use of SFPs

Narrative samples were transcribed into verbatim with a special focus on SFPs. SFPs were transcribed in both characters and International Phonetic Alphabets (IPA) forms. The 28 forms of SFPs were described by Kwok (1984) (see Appendix of Table 4) and clusters of them were used as the basis to identify the SFPs in the samples. The total number of SFPs and the type of SFPs produced by each participant were computed. Inter-rater reliability for SFP identification was computed as well. Thirty percent of the narrative samples were randomly drawn from the HFA group and the NT group, which were then transcribed independently by another rater in the form of both Chinese characters IPA. The group identity was blinded to the second rater. The results were compared with the original rater. The inter-rater agreement of the SFP coding was 96.4 % (213/221). The raters showed most disagreements in translating particles [laa3] and [laak3], which were eventually resolved through discussion.

Statistical Analysis

Summary measures were calculated, including average F0 and F0 standard deviation (SD), and the frequency and type of SFPs produced by both groups. One-way ANOVAs were conducted to examine if there is any statistically significant differences between the two groups of all the measures. Tests of homogeneity of variance using Levene’s test were conducted to ensure that the variances of the two groups in all the measures were not significantly different (ps > .05). Effect sizes (in terms of partial eta squared η 2 p ) were also used to estimate the degree of the difference. Values of partial η 2 p between .01 and .06 represent small effect, between .06 and .14 represent medium effect, and above .14 represent large effect. To examine the relationship between the use of SFPs and intonation, Pearson product-moment correlation coefficients were computed for each group between SD of F0 and the total frequency of SFPs, and between SD of F0 and average SFP type.

Results

The duration of the narrative samples ranged from 2 to 9 min. The mean duration of narrative samples of the HFA group was 3 min and 45 s, whereas the NT group was 3 min and 13 s. An alpha level of .05 was used for the statistical tests. Table 3 summarizes the descriptive statistics of various measures of the two groups.

Table 3 Group means (standard deviations) for the intonation and SFP measures

Mean of Average F0

The means of average F0 (i.e., average pitch) of the HFA group and the NT group were 137.67 and 123.24 Hz respectively. Results of the ANOVA test showed that the mean average F0 of the HFA group (M = 137.67 Hz) was significantly higher than that of the NT group (M = 123.24 Hz) with a large effect size, partial η 2 p  = .16, implying that participants with HFA generally demonstrated a significantly higher pitch than their NT peers.

Mean of SD of F0

The mean of SD of F0 across the narratives of the HFA group was 27.35 and that of the NT group was 22.16. There was a significant difference between the mean of SD of F0 in the two groups and with a large effect size (partial η 2 p  = .145). This suggested that participants with HFA generally produced a wider pitch range, which might reveal a more exaggerated pitch variation than their NT peers perceptually.

Use of SFPs

The HFA group produced an average of 28.89 SFPs (regardless of the type) and an average of 5.37 different types in their narratives. The NT group produced an average of 34.7 SFPs with 6.63 different types in their narratives. ANOVA tests showed that there was no significant group difference in the total SFP frequency, implying that the HFA group produced SFPs at a similar number of times (M = 34.74) as that of the NT group (M = 28.89). The average SFP type produced by the HFA group (M = 5.37) was fewer than the NT group (M = 6.63) but the difference was non-significant (p = .072).

Correlations

There was no significant correlation between SD of F0 and the total SFP frequency, r(19) = −.205, p = .401 in the HFA group. However, there was a moderate positive correlation between SD of F0 and the average SFP type, r(19) = .504, p < .05. Unlike the HFA group, there was no significant correlation between SD of F0 and the total SFP frequency, r(19) = .297, p = .217, as well as between SD of F0 and the average SFP type, r(19) = .405, p = .085 in the NT group.

Discussion

The current study compared the intonation variations and the use of SFPs in narratives produced by Cantonese-speaking adults with and without HFA. Results indicated a significantly wider pitch range for the HFA group than the NT group based on narrative samples. As for the use of SFPs (total frequency and type), there was a similar pattern between the two groups even though the difference in the type of SFPs between the two groups approached significance. Finally, there was a moderate positive correlation between the type of SFPs and SD of F0 in the HFA group but not the NT group.

Group Comparison of Intonation Variation

It was hypothesized that the HFA group and the NT group would not show any significant difference in SD of F0 in the narratives, given that the use of intonation in Cantonese may be restricted by its tonal system. This hypothesis, however, was not supported in the current study. The current study revealed a significantly higher pitch range in the HFA group with a large effect size. This observation, however, was consistent with previous findings in non-tone languages (e.g., Diehl et al. 2009; Fosnot and Jun 1999; Green and Tobin 2009; Hubbard and Trauner 2007; Nadig and Shaw 2012; Paul et al. 2008; Sharda et al. 2010), suggesting that atypical prosody may be a universal characteristic of individuals with HFA regardless of the importance of intonation in the language they speak. In other words, prosodic impairment may involve a breakdown in auditory processing (Yu et al. 2015), rather than related to higher-level processing of the pragmatic functions served by intonation. Russo et al. (2008) provided empirical evidence showing that children with ASD demonstrated abnormalities in their audio-vocal systems, which influence the processing of auditory feedback when speaking. Since auditory feedback is important in stabilizing F0 in one’s voice (Russo et al. 2008), the disturbed feedback system leads to dysfunction in voice pitch regulation (i.e., prosody). The universal phenomenon of intonation impairment suggests that the impairment may not entirely stem from or secondary to the social pragmatic deficit in ASD. This might also be understandable as to why difficulties in prosody in HFA often persist, given that intervention addressing social communication issues did not target prosody directly. Diehl and Berkovits (2010) pointed out that the prosody difficulties often persist to adulthood despite improvement in other language and communication domains. Diehl and Berkovits (2010) also suggested that treatment with acoustic information as an instant feedback can be a potential intervention technique for managing prosodic problems in individuals with ASD in general (c.f., van Santen et al. 2009). Future intervention studies focusing on auditory feedback may provide strong evidence on the association between unusual auditory processing and impaired prosody. The speculation about the atypical audio-vocal regulation in HFA requires more systematic investigation.

Although the current study showed that the HFA group exhibited significantly more variations in intonation than the NT group, it should be noted that the findings represented a group pattern. A post hoc observation found that there was an overlapping between the HFA group and the NT group in terms of SD of F0 at an individual level. Some individuals with HFA might not be differentiated from their NT peers based on the SD of F0. In other words, not all individuals with HFA showed a wider pitch range in their narratives, and some of them even shared similar pitch variations as their NT peers. This phenomenon was also noted in the study conducted by Diehl et al. (2009), who found that some individuals with HFA produced similar patterns of pitch variations as their typical peers despite a significant group difference. These findings further confirmed the heterogeneous nature of expressive prosody deficits in ASD, and supported the claims that not all individuals with ASD showed disrupted features in a particular aspect of prosody (McCann et al. 2007; Shriberg et al. 2001).

Group Comparison of the Use of SFPs

The current study noted that HFA individuals produced a similar number and diversity of SFPs in their narratives when compared with their NT peers. However, the difference approached significance. Strictly speaking, the findings were not consistent with our prediction that individuals with HFA may be less capable of using SFPs to convey pragmatic implications in speech due to their social pragmatic impairments. The findings in general are also inconsistent with a previous study on the role of SFPs in the comprehension of irony, for which Cantonese-speaking children without ASD interpreted irony more accurately than the ASD group with the use of SFPs (Li et al. 2013). A post hoc observation found that the general SFPs /aa3, laa3, le1/ occurred most frequently in narratives produced by the HFA group, and accounted for more than 50 % of the total SFPs produced by the HFA group. These subsets of SFPs are phonologically similar, interchangeable in some contexts and have closely-related meanings (Leung 2005; Luke 1990). For example, the particle /le1/ was mainly described to be used to signal questions from statement, to draw the listeners’ attention to particular information, and to remind the listener about something he/she should have known when it is used in statements (Leung 2005). Although the particle /le1/ can facilitate the focus of speech and signal the difference between questions and statements, it does not carry much information about a speaker’s mood, emotion, feeling, or attitude (Leung 2005). It might be possible that these individuals with HFA are capable of using affectively neutral SFPs to achieve grammatical functions, such as indicating utterance boundaries. However, this suggestion should be verified by further studies investigating the specific pragmatic functions of the SFPs used.

Relative Roles of Intonation and SFPs

The positive correlation in the HFA group may indicate that the SFP type was associated with, or dependent on the pitch variation. It is important to recall that SFPs are “contentless”, so that they provide more rooms for pitch variation at the utterance-final position than content words. That means, even if the pitch of SFPs is changed, the core content would not be altered. In addition, different lexical tones carried on the same segments can represent different forms of SFPs. As a result, it is possible that the more diverse the SFP type an individual produced, the more pitch variation could be realized at the utterance-final position (as opposed to “sentential” intonation), and in turn the greater the SD of F0. Such a tendency was, in fact, also observed in the NT group but the correlation approached significance (p = .085). These findings did not support the previous claim that intonation and SFPs in Cantonese work together in a mutual compensatory fashion (e.g., Kwok 1984; Yau 1980), but ascertained the relationship between intonation and SFPs was more complex than originally assumed. However, this speculation requires more detailed and systematic examination. It can be achieved by only analyzing those sentences with SFPs, so that the effect of SFPs and the relationship with intonation can be more explicated.

Future Studies

Another prosodic feature of Cantonese relates to its rhythm. Cantonese is a syllabic language, where the duration of each syllable is approximately the same when compared to stress-timed languages such as English, for which the duration of each syllable is more varied (Bauer and Benedict 1997; Mok and Dellwo 2008). In addition, lexical stress is absent in Cantonese. These typological features of Cantonese may provide a fruitful ground to uncover the theoretical underpinning of prosody impairment in ASD.