1 Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental disorder with core deficits in social interaction, language, and communication, as defined by the International Classification of Diseases and Related Health Problems, tenth edition (WHO, 1992), and the Diagnostic and Statistical Manual of Mental Disorders, fifth edition (DSM-5, American Psychiatric Association, 2013). As the name suggests, individuals with such a “spectrum disorder” show a wide range of symptoms, ranging from mild to severe. The DSM-5 has abandoned the use of subcategorized diagnoses; however, researchers sometimes use subcategorical terms, such as “high-functioning autism,” “low-functioning autism,” and “Asperger’s syndrome,” especially in publications prior to the DSM-5. High-functioning autism (HFA) is an unofficial term used to describe the milder forms of ASD. Individuals with HFA usually have an intelligence quotient of 70 or above, and are generally able to use words to communicate in daily life. In contrast, low-functioning autism (LFA) is an unofficial term referring to individuals on the severe end of the autism spectrum, often with an intelligence quotient of below 70 and very limited language production and comprehension. Since the first report by Kanner in 1943, the prevalence of autism has been increasing. One of the most recent reports from the US Centers for Disease Control and Prevention (CDC) (2014) estimated that ASD occurs in 1 out of 68 children (or 14.7 per 1000 8-year-old children). The prevalence of ASD in the Chinese population is at least 1.18 per 1000 in China (Sun et al., 2013; Zhang & Ji, 2005), with at least 1.3–2 million children under 13 years old affected by ASD nationwide (Huang, Jia, & Wheeler, 2013). A brief report by Tao (1987) on four cases of infantile autism marked the first published study of ASD on the Chinese population. Chinese languages, such as Mandarin and Cantonese, are tonal languages. It is currently unclear whether theories and findings based on non-tonal languages such as English can be applied to the Chinese-speaking individuals with ASD due to a paucity of cross-language studies.

The landscape of research and scholarly work pertaining to ASD on the Chinese population is still relatively sparse, despite an exponential increase in the number of studies on English-speaking individuals with ASD in the last 15 years. Chinese (e.g., Mandarin or Cantonese) phonology differs from non-tonal languages, such as English, in terms of its use of pitch at the phonemic level. Specifically, the pitch pattern (e.g., high fundamental frequency (F0) vs. low-rising F0) serves to differentiate meanings of words that share the same phonetic segments (e.g., Tone 1 bi means force vs. Tone 2 bi means nose). Whether individuals with ASD whose native language is tonal show similar speech-processing deficits to those with a non-tonal native language (e.g., English or Finnish) is an important experimental and theoretical question, because, as reviewed in Tsao’s chapter (Chap. 10) and Singh’s chapter (Chap. 11) in this book, children with tonal language backgrounds may follow a somewhat divergent developmental trajectory from that of children with non-tonal language backgrounds (also see a recent review by Curtin & Werker, 2018).

Our goal in this review is to provide an overview and an evaluation of the behavioral and neurological evidence examining Chinese prosody and lexical tone processing in individuals with ASD. Research on speech perception and production abilities in Chinese individuals with ASD is a recently emerging area, but significant progress has been made. Systematically summarizing our understanding of prosody and lexical tone processing in tonal language speakers, as well as how these factors associate with ASD at both behavioral and brain levels, will further the research and clinical practice in these areas.

We searched the following databases: Cochrane, ERIC, Google Scholar, NCBI/PubMed, PsycINFO, and Web of Science, using the keywords: {“lexical tone,” “prosody,” “intonation,” “pitch,” or “fundamental frequency”} and {“Mandarin,” “Cantonese,” or “Chinese”} and {“Autism” or “Asperger”} in March and August 2017. We checked the bibliographies of the relevant articles and found only nine relevant studies on Chinese prosody and lexical tone perception and production across the field of behavioral and neurophysiological research. Two of the studies focused on Cantonese and seven focused on Mandarin. In this review, we examine these studies in relation to the extensively researched area of prosody, pitch production, and perception in individuals with ASD from English and other non-tonal language backgrounds.

We first discuss findings from the behavioral literature and then describe recent data obtained using brain measures. This review  is followed by a case study on the neuroplasticity of children with ASD. Lastly, we discuss the theoretical and clinical significance of the evidence that has accumulated thus far, and we point out gaps and challenges in understanding prosody, lexical tone perception, and production of Chinese-speaking individuals with ASD.

2 Prosody in Individuals with ASD: General Background

The terms “prosody” and “intonation” are often used interchangeably in the literature. In this paper, we use “prosody” as a superordinate term to describe changes in pitch, intensity, duration, and voice quality (Cummins et al., 2015; Titze, 1994). In other words, prosody is a suprasegmental feature of speech that is expressed via variations in pitch (fundamental frequency), loudness (intensity), duration, stress, and rhythm (Culter & Isard, 1980). Pitch is the perceptual correlate to the frequency of vocal fold vibration (i.e., F0). During vocal production, individuals often modulate their pitch to convey different emotions and pragmatic connotations (e.g., posing a question, delivering a statement, making an imperative order, expressing surprise). In tonal languages such as Mandarin, Cantonese, and Thai, pitch (which serves as a phonemic contrast) is also referred to as “lexical tone.”

Two of the most widely used diagnostic tools for ASD in English-speaking culture, the Autism Diagnostic Observation Schedule-2 (ADOS-2; Lord et al., 2012) and the Autism Diagnostic Interview-Revised (ADI-R; Rutter, Le Couteur, & Lord, 2003), include atypical prosody production as a diagnostic criterion. Atypical prosodic production and perception in individuals with ASD have been reported from the onset of research within this group (Asperger, 1944; Kanner, 1943; Simmons & Baltaxe, 1975). Some researchers proposed that distinctive and atypical vocal characteristics such as monotonous and robot- or machine-like speech may serve as one of the earliest-appearing biological markers of later ASD diagnosis. However, there is some evidence that children with ASD may have intact prosody perception (Grossman & Tager-Flusberg, 2012; Paul, Augustyn, Klin, & Volkmar, 2005) or superior pitch processing skills (e.g., Bonnel Mottron, Peretz, Trudel, Gallun, Bonnel, 2003; Stanutz, Wapnick, & Burack, 2014). An understanding of prosodic production and perception in ASD is critical since prosody often serves as a major cue in conveying both linguistic and paralinguistic functions (Crystal & Quirk, 1964). Moreover, attention and sensitivity to prosody play a critical role in early language development (Jusczyk, 1997; Mehler et al., 1988). Children with ASD are known to have difficulties detecting vocal prosodic cues that convey irony and sarcasm (Wang, Dapretto, Hariri, Sigman, & Brookheimer, 2004; Wang, Lee, Sigman, & Dapretto, 2006, 2007). Children and adolescents with high-functioning autism (HFA) sometimes have difficulty making use of vocal cues to make inferences about a speaker’s intentions in tasks designed to probe theory of mind (ToM) (e.g., Chevallier, Noveck, Happé, & Wislon, 2011). ToM is a theoretical position that individuals, such as children with ASD, have difficulty understanding that other people have separate thoughts, intentions, and feelings that are different from one’s own (Baron-Cohen, Leslie, & Frith, 1985). In an fMRI study, Eigsti and colleagues found that, compared to typically developing peers, high-functioning children with ASD showed broader recruitment of brain areas while processing both affective and grammatical prosodic cues (Eigsti, Schuh, Mencl, Schultz, & Paul, 2012). The authors suggested that for a fairly simple language-processing task, greater recruitment of areas involved in executive functions and what they refer to as “mind-reading” functions can be interpreted as less automaticity in processing language.

3 Atypical Prosody Production in ASD

3.1 Atypical Prosody Production in ASD with Non-tonal Language Background

The speech of individuals with ASD has been described as both “monotone” and “exaggerated” (Baltaxe & Simmons, 1985). Atypical prosody production has been suggested as a “bellwether” of the cognitive profiles of individuals with ASD, as well as a behavioral indicator of subtypes of ASD (e.g., hypersensitive vs. hyposensitive to auditory input) (Diehl, Berkovits, & Harrison, 2010, p. 167). However, the precise features of prosody in the speech of individuals with ASD are only recently becoming evident. A review of 16 earlier studies on prosody production in ASD by McCann and Peppé (2003) revealed many contradictory findings; these contradictions may be due to inadequate research, small sample sizes in many studies, and great variability in methodology across studies.

A number of more recent studies with larger sample sizes indicate that higher mean pitch and/or wider pitch range in speech of participants with ASD is the primary prosodic difference in comparison with the speech of controls when using tasks such as lexical elicitation (e.g., Bonnel et al., 2003), sentence elicitation (Diehl & Paul, 2013) and spontaneous prosodic production (Diehl, Watson, Bennetto, McDonough, & Gunlogson, 2009; but see Quigley et al., 2016 for contrary findings). A recent systematic review and meta-analysis on 34 empirical studies of vocal production in ASD calculated a moderate effect size (Cohen’s d of 0.4–0.5), but with a discriminatory accuracy of only about 61–64% (Fusaroli, Bang, Bowler, & Gaigg, 2017). Acoustic measures (e.g., duration and intensity) other than mean and variance of pitch have also been examined, but they were not found to be stable predictors of speech produced by individuals with ASD ( Fusaroli et al., 2017).

Endeavors to link pitch production with severity of ASD have led to highly inconclusive results. A computerized task, the Profiling Elements of Prosodic Systems in Children (PEPS-C), developed to assess prosody perception and production in children aged 4–16 years, has been used in a number of studies. The general finding was that children with ASD have atypical prosodic output (e.g., Diehl et al., 2009; McCann, Peppé, Gibbon, O’Hare, & Rutherford, 2007), but that they also demonstrated a considerable amount of similarity in their prosodic output/production relative to their typically developing (TD) peers and children with other types of disorders. This was especially the case with simple tasks, such as when models or prompts were provided and/or no face-to-face spontaneous interaction was required (e.g., Diehl & Paul, 2013).

As Fusaroli et al. (2017) pointed out in their review paper, among the five studies that have examined the correlation between pitch measures and severity of ASD symptoms, both strength and direction of the relation varied across studies and pitch parameters. For example, Bone and colleagues analyzed speech segments of spontaneous interaction between a child and a psychologist during a standard observational evaluation session using the ADOS and found that autism severity score was negatively correlated with the median pitch slope of the turn end, but no correlation with the pitch center or pitch slope variability (Bone et al., 2014). Nadig and Shaw (2012) found no correlation between pitch range and behavioral characteristics (IQ, language, and autism severity scores) of children that were examined based on analyzing spontaneous conversation samples. Nakai and colleagues have also reported no correlation between pitch coefficient of variation and total score from the Autism Screening Questionnaire (Nakai, Takashima, Takiguchi, & Takada, 2014). Diehl et al. (2009) tested two groups of children using similar narrative elicitation tasks. There was a positive  correlation between clinician’s judgment of ASD severity and the variance of ASD pitch production in the older children and teenagers (Study 1, 10–18 years old), but no such correlation in the younger children (Study 2, 6–14 years old). It is possible that the difference in measurement and methods (e.g., unprompted test condition) among these studies led to different outcomes.

So far, pitch measures show promise to distinguish individuals with and without ASD, but there is no evidence that they reflect severity of the disorder. We are certainly not ready to use prosodic measures as a tool to measure the severity of ASD. Several steps need to be taken first. In particular, it will be important to develop tasks that consistently and robustly result in prosodic differences between individuals with ASD and controls. Additional research needs to be undertaken with individuals who have language impairment, as well as ASD.

3.2 Atypical Prosody Production in Chinese Speakers with ASD

Tonal languages, such as Mandarin and Cantonese, use pitch both as a phonemic contrast at the lexical level and as a suprasegmental cue for intonation changes (as with non-tonal languages). The “contour interaction” theory (Thorsen, 1980; Vaissiere, 1983) and the “tone sequence” theory (Pierrehumbert, 1980) posited that lexical tone, stress, and intonation are closely knitted into the final suprasegmental pitch movement output. Both prosodic pitch patterns and lexical tone are realized via F0. F0 modulation for lexical tone, however, is used to distinguish meaning at the level of the morpheme, whereas F0 modulation for prosody is used for meaning at sentence/discourse level (e.g., pragmatics). F0 can also be used in lexical stress, which signals the prominence of a syllable in a multisyllabic word or phrase. In tonal languages, lexical stress is superimposed on top of the lexical tone. The sentence-level prosodic patterns (“sentence-level tunes”), including contrastive stress and sentential intonation, can convey modality (i.e., illocutionary force, e.g., assertive vs. interrogative). These are considered “larger waves,” while lexical tone at the syllable level and lexical stress of multisyllabic morphemes are “small ripples.” Chao (1968) described the competition for F0 space between lexical tone and prosody as “small ripples riding on larger waves” (p. 39). See Yu, Wang, and Li’s chapter (Chap. 5) of this book for the neural mechanisms of lexical tone processing in healthy adults, and Lee and Cheng’s chapter (Chap. 6) of this book for the neural development of lexical tone processing in early childhood. Tsao and Liu (Chap. 10) reviewed lexical tone perception in infancy. The question relevant to this review is how tonal language speakers with ASD convey multiple levels of pitch information, and whether they demonstrate similar pitch output at the syllable, morpheme, word, phrase, and sentential levels compared to that of healthy controls.

To the best of our knowledge, Chan and To (2016) is the only study that has examined the acoustic features of prosodic output in Chinese speakers with ASD. Chan and To (2016) focused on the use of sentence-final particles (SFPs) and the expressive intonation in Cantonese-speaking adults. Cantonese SFPs are bound morphemes that play a similar role to that of prosodic patterns in other languages. In Cantonese, the SFPs convey grammatical, pragmatic, and affective meaning. The authors proposed that individuals with high-functioning ASD (HFA) might show difficulties in mastering the use of SFPs and intonation, due to their known deficits in decoding pragmatic and affective cues. These two skills (the use of SFPs and the use of intonation) might interact with each other and work in a compensatory fashion. Speech samples were generated from 38 young adults (HFA group: n = 19; control group: n = 19) using spontaneous story retelling. The pitch variance was measured using sentence as a unit. Higher average pitch and larger pitch variations were found in the HFA group than the control group. The HFA groups and the controls are comparable in terms of the total frequency of SFP use, but HFA groups produced slightly fewer SFP types on average than the control groups (p = 0.072). The correlations between pitch and SFP measures were all nonsignificant, with the exception of a moderate positive correlation between the type of SFPs and the pitch variability in the HFA group only. At the individual level, some individuals with HFA showed similar pitch patterns to the healthy control counterparts.

It appears that the general patterns of atypical prosody production in Cantonese speakers with HFA were very similar to those of non-tonal language speakers with HFA, which suggests that prosody impairment may be language-independent.

4 Prosody Perception in Children with ASD

4.1 Infant Development

Prosodic cues are critical in assisting infants with the segmentation of running speech input into linguistically meaningful units (e.g., syllables, words, phrases). Language-specific prosody perception and processing develop during early infancy (Bosch & Sebastián-Gallés, 1997; Friedrich, Herold, & Friederici, 2009; Jusczyk & Aslin, 1995; Sambeth, Ruohio, Alku, Fellman, & Huotilainen, 2008; Shafer, Jaeger, & Shucard, 1999; Stefanics et al., 2009; see Chap. 10 by Tsao & Liu in this book for a review on lexical tone development). For example, a language-specific preference for words with trochaic structure, which is a predominant English stress pattern, was observed in English-learning infants between 6 and 9 months of age (Jusczyk, Cutler, & Redanz, 1993), in German-learning infants between 4 and 6 months of age, but not in French-learning 6-month-olds (Höhle, Bijeljac-Babic, Herold, Weissenborn, & Nazzi, 2009). In addition, neural responses in English-exposed three-month-old infants showed that they process Dutch, but not Italian in a similar fashion to English stories (Shafer et al., 1999). These findings are presumably due to the fact that Germanic languages (e.g., English, German, and Dutch) are stress-timed with trochaic predominance at the syllable level. In contrast, Romance languages (e.g., French and Italian) are syllable-timed and favor iambic stress patterns. The infants in Shafer et al. (1999) also showed evidence that they distinguished the greater pitch range of the Dutch compared to the English stories. This finding indicated early sensitivity to small differences in the melody of speech. Typically developing infants demonstrated an intrinsic preference for the prosodically rich child-directed speech (e.g., Vouloumanos & Werker, 2007). Infants’ early preference for higher pitch and exaggerated prosody as in child-directed speech assists early socio-communicative learning (Kuhl, Coffrey-Corina, Padden, & Dawson, 2005), and facilitates and predicts later language development (Thiessen, Hill, & Saffran, 2005). Furthermore, preference for speech correlated positively with general cognitive ability at 12 months, and weaker preference for speech over non-speech sounds correlated with more autistic-like behavior in infants who had siblings with diagnosed ASD (Curtin & Vouloumanos, 2013). Multiple studies have reported that young children with ASD have less robust preference for child-directed speech in comparison with their age-matched TD peers (e.g., Paul, Chawarska, Fowler, Cicchetti, & Volkmar, 2007; see Filipe, Watson, Vicenta, & Frota, 2017 for a review).

Thus, deviant patterns of prosodic perception and processing in the first few years of life could serve as a risk factor of ASD. More prospective studies of children at risk for ASD will need to be carried out to further explore this possibility.

4.2 Prosodic Perception in Older Children and Adults with ASD

Mixed evidence has been reported regarding acoustic tone perception and speech (word, phrase, and sentence) prosodic perception abilities in older children and adults with ASD. The majority of studies on acoustic tone perception have reported intact prosody perception in these children with ASD (Grossman & Tager-Flusberg, 2012; Paul et al., 2005). For example, individuals with ASD have superior pitch processing skills based on evidence from a variety of psychophysical measures (e.g., Bonnel et al., 2003, 2010; Stanutz et al., 2014), superior pitch direction detection in small intervals (Heaton, 2003, 2005; Heaton et al., 1999), or superior melodic contour identification (Järvinen-Pasley, Peppé, King-Smith, & Heaton, 2008). Contradictory findings included inferior pure tone discrimination when the reference tone was varied across trials (e.g., Boets, Verhoeven, Wouters, & Steyaert, 2015). See Haesen, Boets, and Wagemans (2011) for an extensive recent review. Many studies on sentence-level intonation revealed that children with ASD have deficits in intonation perception and/or production, especially at the sentence level. For example, Järvinen-Pasley and colleagues reported that children with ASD have unimpaired perception of word-level intonation, but deficits in understanding sentence-level intonation (e.g., Järvinen-Pasley et al., 2008). McCann and colleagues reported that most children with HFA have either expressive or receptive prosody deficits as measured by tasks that assess effect-related prosody at the single word-level, phrase-level stress, and sentence intonation (McCann et al., 2007). Prosodic production deficits have also been reported in other studies in children with HFA (e.g., Diehl & Paul, 2013); furthermore, there is also evidence that adults with HFA have difficulties using prosodic cues to extract information about mood and emotion (e.g., Rutherford, Baron-Cohen, & Wheelwright, 2002). A brain imaging study by Wang et al. (2006) showed that unlike TD controls, school-aged children with ASD were not only less accurate at interpreting prosodic cues for irony, but also showed aberrant brain activation patterns including absent activity in the medial prefrontal cortex, greater activation in the right inferior frontal gyrus, and bilaterally superior temporal sulcus (STS) regions.

The evidence so far suggests that the function of the pitch information (linguistic and non-linguistic) and the size of the prosodic unit (e.g., word level vs. sentence level) influence the performance level in these populations.

4.3 Prosody and Lexical Tone Perception in Chinese-Speaking Individuals with ASD: Behavioral Research

Crosslinguistic studies have suggested that extensive experience with a native tonal language attunes perception of pitch contour in language processing (Burnham & Francis, 1997; Gandour, 1983; Gandour & Harshman, 1978; Stevens, Keller, & Tyler, 2013; Wayland & Guion, 2004; Xu, Krishnan, & Gandour, 2006b) and music processing (Alexander, Bradlow, Ashley, & Wong, 2011; Bidelman, Hutka, & Moreno, 2013; Stevens et al., 2013. See Chap. 8 of this book for a review by Ong and colleagues). For example, Mandarin listeners can better perceive the subtle acoustic differences of Mandarin tonal categories compared to non-Mandarin speakers (Hallé, Chang, & Best, 2004; Leather, 1983; Lee, Vokoch, & Wurm, 1996). Cantonese speakers can discriminate Mandarin lexical tones better than English speakers (Lee et al., 1996). Native tonal (i.e., Thai) language speakers were faster and more accurate at discriminating pitch contour in both natural speech and musical contour tasks (Stevens et al., 2013). Mandarin listeners outperformed English listeners when discriminating Thai lexical tones post-training under both short and long interstimulus interval conditions (Wayland & Guion, 2004). Furthermore, using a multidimensional scaling method, Gandour and colleagues found that there was a perceptual dimension weighting difference between Chinese listeners and native English listeners when processing Mandarin tones. Specifically, native English listeners tended to rely on the pitch height while Chinese (Cantonese and Mandarin) listeners focused on both pitch height and pitch direction when processing Mandarin lexical tones (Gandour, 1984; Gandour & Harshman, 1978).

Whether Chinese individuals with ASD demonstrate the same superior attunement to lexical tone and musical prosody in relation to non-tonal language speakers with and without ASD is an open question due to lack of cross-language research. There are only five behavioral speech perception studies on Chinese children with ASD. One study focused on lexical tone, one on  intonation cues, and one on musical perception, while the remaining two studies focused on sentential prosody. These studies revealed differences between children with ASD and children with typical development in processing linguistic and non-linguistic stimuli. The findings were generally consistent with studies of prosodic perception in children of non-tonal languages. Below are the highlights of each study.

Chen et al. (2016) examined lexical tone (Tone 1 and Tone 2) identification and discrimination of the Mandarin syllable /i/ in eleven 6- to 8-year-old boys with ASD. The stimuli were from an 11-step lexical tone continuum ranging between a high-level tone (Tone 1) and a low-rising tone (Tone 2). These children had an average language level of 3 years and 6 months. Compared to age-matched typically developing controls, children with ASD exhibited much lower discrimination accuracy and a broader category identification boundary. Furthermore, a strong negative correlation between the boundary width and the developmental language age was found in children with ASD.

Li, Law, Lam, and To (2013) examined how Cantonese-speaking children with ASD implemented sentential prosodic cues and sentence-final particles (SFPs) in ironic stories to judge speaker’s belief and intent. Ironic expression is usually achieved via slow speaking rate, larger pitch variation, and greater intensity (Cutler & Bruck, 1974). As discussed above, some believe that SFPs and intonation have a trading relation with each other in presenting sentential connotation (Kwok, 1984; Yau, 1980). Li et al. (2013) tested children with and without ASD between the ages of 8.3 and 12.9 years by using 16 ironic stories and 5 complementary stories. The two groups demonstrated similar levels of comprehension when sentences did not contain prosodic or SFP cues. Participants in both groups answered the questions about the factual content of the stories with similar accuracy. A large group difference was observed for sentences with prosodic cues only, sentences with SFP cues only, and sentences with both prosodic and SFP cues. These Cantonese-speaking children with ASD failed to exploit either prosodic cues or SFP cues, similarly to English-speaking children with ASD (Happé, 1993). Note that the Cantonese-speaking TD children answered the questions with only slightly above chance accuracy under the prosody-only condition, suggesting that the prosody-only condition is challenging even for TD children.

Jiang, Liu, Wan, and Jiang (2015) investigated discrimination and identification of music and linguistic pitch contours in 17 Mandarin-speaking individuals (age: 6.0–16.2) with high-functioning autism and 17 control children and adolescents with matching age, nonverbal IQ, and years of music training. They used a five-tone sequence for the melodic contour discrimination and identification, and disyllabic verb–object constructions as stimuli for the speech intonation discrimination and identification tasks. Participants were asked to match the auditory sequence with the visual display of the melodic contour of the music or to identify whether the disyllabic verb–object construction was a statement or question. They found that the ASD group performed worse than the control group in terms of Mandarin intonation discrimination and identification, but the ASD group performed better than the control in the melodic contour identification task. The two groups showed similar performance in the melodic contour discrimination task. Jiang and colleagues suggested that linguistic pitch may not be processed the same way as musical pitch in Chinese-speaking individuals with ASD.

The Mandarin question words such as “what” (什么) and “who” (谁) can convey both statements and questions depending on their prosodic features (rising vs. falling sentence-final intonation) and/or their semantic structures (e.g., by adding the Mandarin universal quantifier “dou”/都, which means “all” in English). Su, Jin, Wan, Zhang, and Su (2014) tested 28 children (14 four- to eight-year-olds; 14 nine- to fourteen-year-olds) with high-functioning ASD and 28 age-matched TD controls using a computerized sentence comprehension task to identify statements and question sentences with either prosodic cues or semantic cues. They found that older children with ASD performed on par with their TD peers, by using either prosodic cues in ambiguous sentences or semantic cues in unambiguous sentences. However, younger children with ASD performed more poorly than the TD controls in terms of statement sentence processing under all structure conditions (prosody, semantic, and control structures). This highlighted a developmental delay in children with ASD in comprehending statement sentences containing wh-words. The findings in this study, together with several other studies on English-speaking individuals with ASD, support the claim that grammatical prosody is relatively spared in children with ASD compared to the affective or pragmatic prosody.

Wang and Tsao (2015) did not aim to explore the tone language-specific processing in Mandarin-speaking ASD children; rather, they aimed to examine “emotional prosody” perception in this group of children. Given that the data were collected from tonal language-learning children, it can potentially include an interaction between tonal language experience and emotional prosody processing. Moreover, this is the only study so far that has examined emotional prosody processing in tonal language-learning children with ASD. For both reasons, we included the study in this review. Wang and Tsao (2015) used three emotion tones (happy, sad, and angry) and a neutral tone presented in words and short sentences. Twenty-five boys with high-functioning autism (HFA) and 25 TD boys between 6 and 11 years of age were tested using an emotional prosody identification task. The study found that children with HFA performed more poorly than TD children in identifying prosodic patterns associated with “happy,” but that they did not differ in identifying prosodic patterns associated with “sad” or “angry.” This was true regardless of whether the semantic condition of the stimulus (words or sentences) was neutral or emotionally relevant. Correlation analyses revealed a strong positive association between perception accuracy of happy prosody and the pragmatic language skills and social adaptation skills of children with ASD.

4.4 Neural Indices of Lexical Tone Processing in Non-tonal ASD

Event-related potential (ERP), recorded using electroencephalogram (EEG), and event-related field (ERF), recorded via magnetoencephalography (MEG), are most often used to examine auditory processing of pure tone and speech. Many such ERP/ERF studies have adopted an oddball paradigm in which repetition of one sound pattern is interspersed with an infrequent sound pattern. The long-latency ERP/ERF obligatory sequence of peaks, P1-N1-P2, is thought to reflect the brain’s response to the physical features of the stimuli at the scalp level with multiple neural generators in the primary and secondary auditory cortex (Näätänen & Picton, 1987; Ponton, Eggermont, Khosla, Kwong, & Don, 2002; Scherg & Von Cramon, 1986). P1 to auditory events is observed at frontocentral sites in early childhood. P1 latency shifts earlier as the brain matures (Čeponienė, Rinne, & Näätänen 2002; Choudhury & Benasich, 2011; Kushnerenko, Čeponienė, Balan, Fellman, & Näätänen, 2002; Morr, Shafer, Kreuzer, & Kurtzberg, 2002 ; Shafer, Yu, & Datta, 2010; see Sharma, Glick, Deeves, & Duncan, 2015 for a review). N1 and P2 are not always apparent in young children to auditory stimuli presented at rates less than about 1 per second (Ponton, Eggermont, Kwong, & Don, 2000). The P1-N1-P2 complex does not reach full maturity until later adolescent years (Ponton et al., 2000). Bishop, Hardiman, Uwer, and von Suchodoletz (2007) reported that P1 amplitude appears to be larger and that it also peaks earlier for speech than for pure tones in children under 11 years of age.

Mismatch negativity (MMN) serves as an index of automatic preattentive cortical discrimination of auditory contrast. MMN is largest at frontal sites and is best seen by subtracting the response to a frequent stimulus/pattern from the response to an infrequent stimulus pattern. MMN is larger for greater physical (acoustic) differences between two stimuli and is often larger for speech sounds that cross a phoneme boundary than for speech sounds that fall within the same phoneme category (Näätänen, Paavilainen, Rinne, & Alho, 2007). Significant maturational changes have also been evidenced in the presence, amplitude, and latency of MMN in TD children with non-tonal language backgrounds (Friederici, Friedrich, & Weber, 2002; Kushnerenko et al., 2007; Leppänen et al., 2002; Morr et al., 2002; Shafer et al., 2010; Shafer, Yu, & Datta, 2011) and tonal language backgrounds (Cheng et al., 2013, 2015; Liu, Chen, & Tsao, 2014). In particular, infants and young children often show a positive mismatch response (pMMR), rather than MMN or in addition to the MMN (Shafer et al., 2010). The pMMR may reflect greater recovery from refractoriness (because the deviant stimulus is less frequent).

The P3a response is an index of involuntary attention switch elicited by a salient stimulus change or a rare stimulus change. Its latency is later than the MMN and shifts earlier progressively starting from early toddlerhood, reaching stabilization at around 12 years of age (Fuchigami et al., 1995). However, its scalp distribution matches with those of the adults only until late adolescence (Määttä et al., 2005).

The first few ERP studies to examine auditory processing in children with ASD suggested superior processing of non-speech stimuli compared to typically developing controls. Oades, Walker, Geffen, and Stern (1988) tested seven children with ASD and found shorter N1 latency and larger N1 amplitude to a pure tone contrast in the ASD group than the TD controls. Ferri et al. (2003) also found that the children with low-functioning autism (LFA) showed earlier and enhanced N1 peaks to pure tone contrasts (1000 Hz vs. 1300 Hz). Enhanced MMN and/or P3a have also been evident in children with ASD for auditory tones (Ferri et al., 2003; Kujala et al., 2007). Gomot et al. (2008) used a combination of behavioral discrimination and fMRI measures and found that children with Asperger’s syndrome were hyperactive to sound, as indicated by faster discrimination reaction times and similar response accuracy. They also showed stronger activation in the prefrontal and inferior parietal cortices to complex tone discrimination.

Čeponienė et al. (2003) found that pitch encoding and discrimination were similar for children with ASD and a TD group as measured using P1-N2-P2-N4, MMN, and P3a. They did observe a marginally smaller P1 amplitude in the ASD group. In contrast, other studies have found delayed and diminished cortical responses to auditory and speech stimuli in children with ASD. School-aged children with ASD demonstrated delayed N1 latency to pure tones, compared to age-matched healthy controls (longer N1c latencies; Bruneau, Bonnet-Brilhault, Gomot, Adrien, & Barthélémy, 2003; Bruneau, Roux, Adrien, & Barthélémy, 1999). Another study revealed diminished and delayed P1 responses to pure tone contrasts (Jansson-Verkasalo et al., 2003). Diminished P1, N2, P3, and N4 have also been reported for vowel contrasts in children with ASD (Whitehouse & Bishop, 2008). Quite a number of studies have reported that individuals with ASD have delayed and/or diminished MMN and P3a responses to phonemic change (Lepistö et al., 2005, 2006). We have also observed a delayed latency of the P1 peak to auditory words in a picture–word priming paradigm in minimally verbal 3–7-year-old children with ASD compared to age-matched controls (Cantiani et al., 2016). The controversial findings among studies may be due to experimental factors such as stimuli and tasks used, and the heterogeneous nature of the ASD population in general.

4.5 Neurophysiological Measures of Pitch Processing in Chinese-Speaking Individuals with ASD

The pursuit of understanding how the autistic brain processes native lexical tone has only just begun. Currently, only three careful studies from the same research group have been undertaken (Huang et al., 2017; Wang, Wang, Fan, Huang, & Zhang, 2017; Yu et al., 2015). These studies suggest that children with ASD have greater difficulty in processing speech than non-speech information, but that certain types of cues (e.g., duration) may be spared. Below are the highlights of the three studies.

Yu et al. (2015) were the first to examine the question of whether enhanced lower-level perceptual features such as pitch variation would hinder the processing of higher-level phonemic units of the lexical tone categories. In their first oddball experiment, three types of stimuli were used: simple pure tone contrast (standard 216 Hz vs. deviant 299 Hz), lexical tone contrast (standard /bai2/ vs. deviant /bai4/), and a nonword condition (standard /rai2/ vs. deviant /rai4/). The ASD group had larger MMN amplitudes than the TD control group at the vertex site (Cz) and had smaller MMNs than the TD group at the frontal site (Fz) for the lexical tone contrast /ba2/-/ba4/. MMN was present in the TD group for all contrast types, but for the ASD group, MMN was absent to the nonword /rai2/-/rai4/ contrast at both Fz and Cz sites. In contrast to their TD peers, enhanced P3a amplitude was observed for the pure tone contrast in the ASD group. In order to further understand the influence of lexicality, children were also tested using hummed speech. Results demonstrated that the ASD group had larger MMN amplitude than the controls at Cz but not at Fz. The ASD group also showed larger P3a amplitude at Cz; the TD group showed a tendency toward shorter latencies for the P3a peak at Fz compared to the ASD group. The authors speculated that the reduced neural sensitivity in lexical tone processing was probably due to inadequate suppression of the irrelevant within-category pitch differences.

The account of speech-specific deficits in autism for lexical tone processing proposed in Yu et al. (2015) was further supported by the findings in Wang et al. (2017). Wang et al. (2017) used vigorously controlled synthetic speech and non-speech contrasts. The speech condition consisted of three acoustically equidistant stimuli from a nine-point continuum /ba2/ (step 1)-/ba4/ (step 5)-/ba4/ (step 9) for between-category (steps 1 and 5) and within-category (steps 5 and 9) contrasts. The non-speech condition consisted of three complex stimuli that matched with speech stimuli on all acoustic parameters, except for harmonic composition. The study used a passive listening oddball paradigm and found that the TD controls showed larger MMN to between-category than within-category, whereas the ASD group had equal MMN for the between- and within-category comparisons; this finding indicated a lack of categorical perception in children with ASD. No significant P3a was observed under any condition for either group. Results from time–frequency analysis provided further evidence for group differences. The two groups demonstrated similar phase locking to harmonic speech stimuli, but for the lexical tone condition only the TD group showed a significant inter-trial phase coherence (ITPC) difference in the theta band for the MMNs of within- versus cross-category contrasts. This evidence further suggested that children with ASD do not have categorical perception in the lexical tone condition.

Duration of speech segments can also serve to distinguish meaning in some languages (e.g., Finnish and Japanese). Behavioral and neurological evidence indicates that individuals with ASD have deficits processing small durational differences in auditory and speech contrasts (Brodeur, Gordon Green, Flores, & Burack, 2014; Falter, Noreika, Wearden, & Bailey, 2012; Lambrechts, Falter-Wagner, & van Wassenhove, 2017; Lepistö et al., 2005; Maister & Plaisted-Grant, 2011; Szelag, Kowalska, Galkowski, & Pöppel, 2004). To answer the question of whether individuals with tonal language backgrounds also show deficits in temporal processing, Huang et al. (2017) used both pure tones (295 Hz) and nonsense syllables (/tý/) of two durations (250 ms vs. 350 ms) in a passive oddball paradigm. They compared the neurophysiological responses to these duration changes in school-aged children with ASD and TD peers. A delayed and diminished MMN peak was found in the ASD group in comparison with the TD control group for the pure tone stimulus condition only. In contrast, a delayed and diminished P3a peak was evidenced in the ASD group for the speech condition only. It is not entirely clear how to interpret this finding, considering that the results from the pure tone differed from that of the vowel for within-category comparisons. Clearly, additional research is necessary to fully understand how native language experience modulates auditory processing in children with ASD.

5 Treatment Study of Children with ASD Using Transcranial Direct Current Stimulation: A Feasibility, Pilot Study

Transcranial direct current stimulation (tDCS) is a noninvasive technique of applying constant low-intensity electrical currents to the scalp. This method has been extensively used in animal studies. It is well established in animal models that this type of stimulation can alter the threshold, rate, and balance of excitation and inhibition of neurons, and therefore can modulate brain functions both in vivo and in vitro (Bikson et al., 2004; Bindman, Lippold, & Redfearn, 1964; Chan, Hounsgaard, & Nicholson, 1988; Purpura & Mcmurtry, 1965; Rahman, Toshev, & Bikson, 2014; see Reato, Rahman, Bikson, & Parra, 2013 for a review). In the past decade, numerous studies have reported that noninvasive brain stimulation (NIBS) techniques such as transcranial magnetic stimulation (TMS) and transcranial direct current stimulation (tDCS) can facilitate language recovery in patients with aphasia (see Norise & Hamilton, 2016 for a review).

Whether language-related plasticity in the brains of children with ASD can be modulated by noninvasive brain stimulation is an emerging area of research. The available literature on the use of tDCS in ASD is preliminary, consisting of studies with methodological limitations (see Jacobson, Koslowsky, & Lavidor, 2012 for a review), but some of the results are promising. A treatment of 20 NIBS sessions was found to improve social and behavioral scales in children with ASD with a lasting effect of six months (Gómez et al., 2017), and a single session of anodal tDCS was found to increase peak alpha frequency at the stimulation site and to decrease autistic-like behavioral symptoms (1 mA, 20 min; Amatachaya et al., 2015), and to increase syntax comprehension in children with ASD (2 mA, 30 min; Schneider & Hopp, 2011). In a randomized controlled trial, after a single session of tDCS treatment (1.5 mA, 15 min) children with attention deficit and hyperactivity disorder showed increased inhibition accuracy (Soltaninejad, Nejati, & Ekhtiari, 2015).

Mandarin-learning children with ASD have shown atypical cortical oscillations as reviewed above (Wang et al., 2017). As per Wang et al. (2017), Mandarin-learning children with ASD may have deficits in inhibiting neural sensitivity to within-category lexical tone variation. We are interested in whether the sensory and cognitive functions associated with aberrant excitatory and inhibitory neuronal activities can be modulated by tDCS stimulation. If we accept the hypothesis that brain stimulation via tDCS technique can enhance the inhibitory neural activity associated with within-category speech processing, then this treatment could possibly enhance the categorization processing of speech that varies at the within-category level. The following pilot study was designed to test this hypothesis.

5.1 Materials and Methods

Participants

Data from two children with ASD (male, 8.1 years old and 9.8 years old) and two typically developing control children (male, 9.11 years old and 10.7 years old) were obtained. A 10.6-year-old nonverbal child with ASD was recruited but could not be tested due to lack of adequate compliance and was therefore excluded from the study. All four children are from the same neighborhood in the metropolitan New York City area, and all were from families in which both parents were native Mandarin speakers. Language background questionnaires from the parents indicated that all four children have had consistent Mandarin exposure from both parents since birth. According to parental reports and background questionnaires, all four children understood daily conversations in Mandarin, and all could carry out simple conversations about daily routines in Mandarin. All four children also had consistent English exposure via school since English was the language of instruction for all four children. According to parental reports, the two children with ASD were more dominant in Mandarin, while the two typically developing children were more dominant in English at the time of testing. Both children with ASD were diagnosed by certified developmental psychologists around 3 years of age, and both children with ASD had individual educational plans (IEPs) and were receiving special education via the public school system due to their ASD diagnoses. Their diagnoses were also validated by the school’s special education teacher, as well as by an experienced speech language pathologist (the first author). Informed consent from each parent and verbal assent from each child were obtained following the approved protocol by the local institutional review board.

The Event-Related Potential Procedures

Stimuli Disyllabic nonword stimuli with Tone 2 and Tone 3 contrast were used in an oddball paradigm. The frequent/standard stimuli were three tokens of /gu3pa1/, and the infrequent/deviant stimuli were two tokens of /gu2pa1/. The use of multiple tokens of the same lexical category was to facilitate between-category processing rather than within-category processing. Specifically, the tokens varied in non-relevant acoustic information and only the relevant tone difference could be used to correctly categorize the stimuli. These stimuli were used in Yu, Shafer, and Sussman (2017). In this study, native Mandarin adult speakers showed larger MMN responses compared to English speakers to these lexical tone differences. A total of 165 deviant (20%) and 645 standard (80%) stimuli were presented in 15 blocks with an average interstimulus interval of 675 ms (645–709 ms). Each block was separated by a 10-s break.

ERP recording The electroencephalogram (EEG) was time-locked to the onset of stimuli and recorded using 65-channel sensor nets at the sampling rate of 500 Hz with a band-pass filter of 0.1–100 Hz. Two sessions of ERP recordings were collected, one occurring before the tDCS procedure and another shortly after the tDCS procedure during the same laboratory visit. The data were filtered with a band-pass filter of 0.3–15 Hz and segmented 200 ms before the onset of stimuli and 700 ms post-stimulus onset. Artifact rejection, baseline correction, and average re-reference were performed in BESA 6.1. All children had at least 100 trials from the deviant condition. The amplitudes of the subtraction waves (deviant minus standard) were compared across the four participants.

High-Definition-tDCS Procedure

We applied high-definition (HD)-tDCS stimulation using the Soterix 1 × 1 tDCS Low-Intensity Stimulator with a Soterix 4 × 1 adaptor (Soterix Medical, New York, NY). We placed the stimulating ring electrodes around the frontocentral scalp region (C3, C4, F3, F4 as cathodes and FCz as anode). The sintered Ag/AgCl ring electrodes were fixated with an EEG cap with HD-tDCS electrode holders (Soterix Medical, New York, NY). The impedances of all five electrodes were in the range of 0.50–0.80 quality value. The anode (FCz) was set to deliver a total current of 1 mA, and the return electrodes shared the same current intensity of 1 mA (0.25 mA each). The duration of stimulation was set to 10 min. We used the intensity of 1 mA because HD-tDCS is known to deliver more focalized stimulation (Edwards, Cortes, Datta, Minhas, Wassermann, Bikson 2013), as well as to reduce an unnecessary tingling sensation that many children with ASD might not tolerate. Furthermore, an intensity of 1 mA single session of 20-min stimulation using conventional tDCS was found to elicit significant behavioral improvement in children with ASD (Amatachaya et al., 2015).

5.2 Preliminary Results

P1-N1-P2 Results

Figure 13.1 shows the amplitude of the standard condition for all four participants before and after tDCS stimulation. The two children with TD showed a P1 (80–100 ms), followed by N1 (around 150 ms) and P2 (220 ms). The N1 was attenuated compared to adults, but it was clearly emerging for these 10-year-old children (note that N1 is attenuated at short interstimulus intervals for children under 10 years of age). The two ASD children showed only a broad P1 peak followed by an N2 around 250 ms. This pattern was observed in younger typically developing children. There was a negative shift in the response at Fz after about 100 ms for all participants after tDCS. The response at the left mastoid (LM) showed the inverse pattern. The children with ASD showed a somewhat different pattern. Specifically, ASD01 showed a similar pattern to the children with TD from about 200 to 350 ms. ASD02 appeared to show an increased response (i.e., greater Fz negativity and greater LM positivity) from 200 to 600 ms.

Fig. 13.1
figure 1

Grand average ERPs to the standard stimulus waveforms for the Fz and left mastoid. The top panel shows the two control participants, and the bottom panel shows the two ASD participants

MMN Results

Figure 13.2 demonstrates the topography of the subtraction (deviant minus standard) waves. Before tDCS stimulation, in the two TD participants, there was a negative response (blue) between 150 and 250 ms at the frontocentral scalp region. Within the same time window, a predominantly positive response (red) was evidenced in both ASD participants. A robust negativity at the superior central scalp region was seen in both children with ASD between 300 and 400 ms. Post-tDCS, the peak of the negativity shifted earlier in both TD children and in one of the children with ASD (Participant D in Fig. 13.2). Due to the small sample size, we can only speculate at this time what these differences indicate: The first, superior negative peak was possibly the MMN and was observed in both the children with ASD and the TD children; however, it appeared in the much later time window before and after HD-tDCS for children with ASD than found for children with TD. The latency of the negativity in the two ASD children was very similar to the four-to-seven-year-old TD children in Shafer et al. (2010). Younger children (infant to seven years of age) often show a positive mismatch response sometimes alone (infants) and sometimes preceding the MMN (Morr et al., 2002; Shafer, Yu, & Garrido-Nag, 2012). In addition, the timing of the negativity in the two children with ASD may be the late negativity (LN) observed in children with specific language impairment in Shafer, Morr, Datta, Kurtzberg, Schwartz (2005). This pattern might indicate a developmental delay in processing complex speech sounds. The ERP amplitude changes post-HD-tDCS suggest that both individuals with ASD and TD respond to the HD-tDCS treatment. In addition, these preliminary data reveal that this is a promising approach for examining the effect of HD-tDCS treatment on lexical tone processing.

Fig. 13.2
figure 2

Topographical voltage maps of the subtraction wave (deviant standard) between 150 and 400 ms after the stimulus onset. Red portion shows positivity, and blue shows negativity. A and B are typical control children, and C and D are participants with ASD

6 General Discussion and Future Directions

The literature concerning auditory processing in Chinese/tonal language-speaking individuals with ASD is sparse with only a few studies examining perception or production and with very few studies attempting to address both non-linguistic basic auditory pitch processing and higher-level linguistic pitch and prosody processing. Further research needs first to replicate findings of these few studies and then to allow synthesis of the evidence. Even so, some preliminary comparisons of the findings can be accomplished between the general literature of auditory processing in children with ASD and these studies of Chinese-speaking children with ASD.

6.1 Developmental Prosody and Lexical Tone Processing in Children with ASD

Phonetic features that serve to distinguish lexical tone categories can be differentiated by multiple parameters along several spectral and temporal dimensions (e.g., onset F0, offset F0, contour of F0, and duration of the contour). TD children make few production errors for Mandarin lexical tones as early as 1.6 years of age in picture naming tasks (Hua & Dodd, 2000). However, based on results from perceptual judgment tasks, 6-year-old Mandarin-speaking children could not produce adultlike lexical tones. They also did not reach adultlike perception levels (Wong, 2013; Wong, Schwartz, & Jenkins, 2005). MMN responses to lexical tone change were not adultlike in school-aged children (Liu et al., 2014). Developmental changes in lexical tone perception in Chinese-learning children with ASD have only been addressed in a few studies. Certain prosodic cues (e.g., prosodic cues for irony) and lexical tones (e.g., difference between Tone 2 and Tone 3 in Mandarin) are intrinsically more challenging to learn, even for TD children. As shown in Li et al. (2013), TD children between 8 and 12 years of age do not demonstrate high comprehension accuracy for prosodic cues when encoding irony. In Su et al. (2014), the older children with HFA performed at or near ceiling, equally as well as the TD controls for all types of sentence processing, and performance gaps were only evident in the younger children with and without ASD. Such findings highlight a developmental delay instead of a persistent deficit in children with ASD when processing statement sentences that contain wh-words (e.g., “what” or “shenme” in Mandarin).

Future studies are needed to elaborate on the developmental trajectories of various prosodic cues and lexical categories in tonal language-learning children, both with and without ASD. This step is necessary before determining how to use this information to enhance clinical practice.

6.2 Lexical Tone and Music Processing in ASD

Mounting evidence suggests that brain plasticity governing language processing can be adapted for musical processing (Koelsch et al., 2002; Maess, Koelsch, Gunter, & Friederici, 2001; Patel, Gibson, Ratner, Besson, & Holcomb, 1998). We would like to suggest that future studies explore this relationship in children from tonal language backgrounds. Frameworks such as the shared syntactic integration resource hypothesis (SSIRH) proposed that the shared neural resources between music and language processing are located in the frontal brain regions, and that these resources are recruited “when structural integration of incoming elements in a sequence is costly.” That is, these shared networks are the “processing regions” for structural integration in linguistic syntax and tonal harmony (Patel, 2003, 2014).

For healthy individuals, language experience fine-tunes the production and processing abilities of critical auditory elements in both speech and non-speech domains. For example, Mandarin speakers are better at imitating and discriminating musical pitch than English speakers (Pfordresher & Brown, 2009), and they are more sensitive to both lexical pitch and non-speech (harmonic) pitch category boundaries (Xu, Gandour, & Francis, 2006a). The mutual enhancement between language and music, as seen in musicians and tonal language speakers, has been widely reported (Bidelman et al., 2013). Theoretically, such mutual enhancement should benefit music and speech processing in individuals with ASD and in other communication disorders as well.

Accumulating evidence suggests that individuals with ASD who come from tonal language backgrounds share the common characteristics of superior melodic contour processing and inferior linguistic prosody processing, as observed in individuals with ASD from non-tonal language backgrounds (Järvinen-Pasley & Heaton, 2007; Jiang et al., 2015; Heaton, 2005). The disparity between music and speech-processing skills leads researchers to believe that linguistic pitch and musical pitch are processed differently in individuals with ASD, and that tonal language experience does not compensate for the linguistic prosody processing deficit in individuals with ASD (Jiang et al., 2015). Currently, there is no direct cross-language comparison. Further cross-language studies are needed to directly test whether individuals with ASD from a tonal language background have an advantage over their counterparts from a non-tonal language background on musical and linguistic prosody processing.

Positive transfer from music training to speech processing has been evidenced in TD children with non-tonal language backgrounds (Moreno et al., 2009). The OPERA model hypothesized by Patel (2011) posited that the shared acoustic features in speech and music, such as F0 changes over time, are processed in an anatomically overlapping network. Patel further argued that music perception “place[s] higher demands on the encoding of certain acoustic features than does speech perception” for adequate communication. However, there is no evidence to support whether such a claim can be applied to all types of languages, especially tonal languages, in which high precision of pitch perception and production is a constant demand. Further studies should examine whether there is a positive transfer from lexical tone learning to music processing, or from music training to lexical tone learning in individuals with ASD. Answers to such questions will provide evidence for testing theory and for designing a therapeutic framework for treating tonal language speakers with ASD.

6.3 Effects of Experimental Variables on Chinese Individuals with ASD

So far, all nine studies on Chinese-speaking individuals with ASD have recruited individuals with HFA and no studies have yet focused on minimally verbal individuals with ASD, which counts for about one-third of the ASD population. Our study (which is one of the few existing studies) suggests that some English-exposed children with ASD and with minimal verbal skills have intact (although slightly delayed) lower-level visual and auditory processing skills (Cantiani et al., 2016). Unfortunately, the tasks employed in studies examining prosody and lexical tone production and processing often require functional communication skills (e.g., answering questions) as well as competency in social interaction (e.g., initiating conversation). Such demands preclude the participation of children with poor verbal skills who have LFA. Neurophysiological methods, including EEG-ERP methods, can use tasks that do not require a response (e.g., the passive oddball paradigm used by Zhang’s research group reviewed above provides a good alternative to test individuals with LFA). However, individuals with LFA often have sensory sensitivities and some cannot tolerate wearing the sensor net. On the other hand, it is possible to desensitize some of these children so that they will be able to tolerate electrodes (see Roesler et al., 2013). Multiple studies have conducted correlational analyses to seek the relationship between prosody characteristics and severity of autistic symptoms; some did not find a correlation between prosody and the severity of ASD, but given that few studies have included children with LFA, it remains unclear whether a relationship exists. This shortcoming is particularly important given that the definition of autism does not differentiate HFA from LHA under the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5).

Recently, Eigsti and Fein (2013) compared pitch discrimination sensitivity of teenagers with optimal outcomes and teenagers with HFA and age-matched TD controls, and found that superior pitch perceptual skill is correlated with ASD symptomatology. Specifically, teenagers who were diagnosed with ASD before 5 years of age and who at the time of testing did not have any autistic symptoms performed the same as the controls. In contrast, teenagers who still maintained the ASD diagnosis continued to demonstrate heightened pitch discrimination. The heterogeneous nature of this disorder calls for further studies that examine the features of subgroups on the spectrum.

As the world is becoming more plural, the proportion of bilingual individuals with ASD will also increase. Language development in bilingual children who are exposed to two languages from birth follows a different developmental trajectory starting from the first year of life (Bosch & Sebastián-Gallés, 2003; Shafer et al., 2012). There is evidence of both positive and negative transfers between the first and second language phonologies (Hambly, Wren, McLeod, & Roulstone, 2013). Wider pitch range and higher average pitch have been found in English-speaking, German-speaking, Mandarin-speaking monolingual individuals with ASD and Hindi–English bilinguals with ASD, but the opposite patterns were evidenced in Japanese-speaking children with ASD (Baltaxe & Simmons, 1985; Chan & To, 2016; Green & Tobin, 2009; Sharda et al., 2010; Nakai et al., 2014). It will be of theoretical and clinical interest to compare the pitch perception and lexical tone production of bilingual children with those of monolingual Chinese-speaking individuals such as in Chan and To (2016). The different language backgrounds of bilingual children with ASD (e.g., two tonal languages vs. one tonal language plus one non-tonal language; more balanced bilingual vs. one language dominant bilingual) would presumably lead to different hypotheses regarding the association between pitch perception, music perception, and lexical tone production.

Stimulus complexity and task demand are both known to influence auditory processing. Differences between a pair of acoustically less-salient lexical pairs such as Tone 2 versus Tone 3 can be more challenging than an acoustically more salient pair such as Tone 1 versus Tone 3 for non-native listeners (Chandrasekaran, Krishnan, & Gandour, 2007; Yu et al., 2017). Tone sandhi (e.g., when there are two 3rd tones in a row, the first one becomes 2nd tone) also further complicates the production and perception of Tone 2 versus Tone 3 processing (see Chap. 7 of this book by Chang & Kuo). Speech versus non-speech comparisons are routinely used to measure domain-general versus language-specific auditory processing. Lexical status of the stimuli (e.g., word vs. nonword) seems to play a subtle role in the nature of the cortical response to such contrasts in individuals with ASD (Wang et al., 2017; Yu et al., 2015). Stimulus complexity often interacts with attention and/or memory demands in children with ASD and other learning disorders (Čeponienė et al., 2003; Whitehouse & Bishop, 2008). Systematic investigation using carefully controlled tasks and varying stimulus complexity will provide important implications for theories and clinical application.

6.4 Theoretical Implications

Higher perception accuracy, faster response time, and larger brain response to relatively simple, domain-general perceptual contrasts in ASD than in controls are taken as evidence for enhanced lower-level perceptual processing (Mottron, Dawson, Soulieres, Hubert, & Burack, 2006; Mottron et al., 2013), along with evidence that sometimes less robust brain responses and less accurate behavioral responses to domain-specific complex stimuli such as vocal pitch processing and phonemic/semantic discrimination processing in some individuals with ASD are taken as evidence for deficits in information integration for global processing or deficits in higher-level linguistic processing (Cantiani et al., 2016). The neural complexity hypothesis suggested that superior performance for simple tone processing in the primary auditory cortex, along with impaired complex perceptual performance in the associative cortex, is autism-specific (Bertone, Mottron, Jelenic, & Faubert, 2005). Happé and Frith (2006) proposed the “weak central coherence” theory referring to the detail-focused processing bias in individuals with ASD. The weak central coherence (WCC) theory pointed out that the processing bias for the local/detail-focused/lower-level information over global/meaning-focused/higher-level information may impose a negative impact on higher-level integrated processing (Happé & Frith, 2006). It is necessary for research to determine whether there is an impact of such a bias. A deficit in theory of mind (ToM) is a hallmark of ASD as mentioned above. A meta-analysis on ToM development showed that there is a two-year or greater developmental timing difference in false belief performance between Chinese-speaking children and children in North American culture (Liu, Wellman, Tardif, & Sabbagh, 2008). This significant timing difference suggests that learning a tonal language such as Cantonese or Mandarin, or growing up in Chinese culture, enhances the development of ToM. However, we do not have direct evidence regarding whether learning a tonal language will enhance the development of ToM in children with ASD. Future studies need to test ToM in Chinese-speaking individuals with ASD directly.

The three neurophysiological studies on Mandarin-speaking children with ASD are consistent with the WCC theory mentioned above. You and colleagues proposed that different from children with other common developmental disorders, such as children with dyslexia and developmental language disorder (DLD), children with ASD do not have categorical perception deficits (CP deficit), but instead have a categorical precision deficit (CPR deficit). CPR allows perception of allophonic differences (You, Serniclaes, Rider, & Chabane, 2017). This proposal was based on the evidence that children with ASD made more categorical judgments than TD controls did on a natural vowel continuum, yet did not show categorical perception deficits of vowels and consonants. One important issue here is whether the vowel versus consonant difference in terms of categorical precision would be applied to lexical tone and consonant differences, since lexical tone is largely superimposed on the vowel part of the syllable. You et al. (2017) used a four-parameter logistic model including perceptual boundary, slope, and asymptotes of the identification function to assess categorical precision. Due to differences in analysis, it is unclear if the children with ASD in Chen et al. (2016) had a CPR deficit to lexical tone processing as well. But it is clear that children with and without ASD were both near chance level when discriminating within-category lexical tone contrasts (Fig. 4 in Chen et al., 2016), and children with ASD had a shallower identification slopes, suggesting continuous perception rather than CP.

Recently, there has been a lively discussion on spectral versus temporal auditory processing in the ASD literature (e.g., Huang et al., 2017; Kasai, Hashimoto, Kawakubo, Yumoto, Kamio, Itoh, Koshida, Iwanami, Nakagome, Fukuda, Yamasue, Yamada, Abe, Aoki, Kato, 2005; Lambrechts et al., 2017; Lepistö et al., 2006; see Haesen et al., 2011 for a review). Spectral information is generally processed in the right hemisphere, and rapid temporal dynamics for speech processing is more dominant in the left hemisphere (Zatorre, Evan, Meyer, & Gjedde, 1992). However, the acoustic properties of auditory cues interact with the function of the cues in a complex way. As Zatorre and Gandour (2008) pointed out, the right hemisphere dominance for pitch processing may be altered to the left when the pitch is linguistically relevant, as in lexical tone. Indeed, Mandarin speakers have shown larger left hemisphere responses to between-category lexical tone contrasts and marginally larger right hemisphere responses for within-category lexical tone contrasts (e.g., Xi, Zhang, Shu, Zhang, & Li, 2010).

Frequency differences are the primary cues to distinguish the four lexical tones in Mandarin. These differ from the spectral differences of vowels and consonants in serving as the carrier frequency (fundamental). There are some intrinsic durational differences among the four lexical tones. For example, as reported in Shen (1990) and other studies, Tone 3 is consistently longer than Tone 2, while Tone 4 has the shortest duration among the four (Lin, 1965; Shen, 1990). Future studies should examine how the intrinsic durational differences among lexical tone categories in Chinese will influence the developmental patterns of speech perception and production in the ASD individuals. Future studies comparing the neural responses to spectral versus durational cues for lexical categories would provide further evidence about the neural specialization related to atypical lexical tone processing in Chinese-speaking individuals with ASD.

We currently have no data on neuroplasticity in response to intervention. Children who are diagnosed with ASD usually receive language and behavioral intervention. A two-day perceptual training on Thai lexical tone alters the neural responses to the training tones in native speakers of a tonal language (i.e., Mandarin Chinese) and in those of a non-tonal language (i.e., English) (Kaan et al., 2008). Furthermore, research shows that there is an association between the range of brain activation and degree of lexical tone learning in speech and word training paradigms (Wong, Perrachione, & Parrish, 2007). Novel neural stimulation methods, such as the HD-tDCS method presented in the pilot study above, can also provide a valuable way to investigate intervention-related neuroplasticity in Chinese-speaking individuals with ASD.

7 Conclusion

The behavioral and electrophysiological evidence accumulated thus far on ASD in Chinese/tonal languages suggests that brain activation patterns and behavioral measures of lexical tone perception are not typical in Chinese-speaking individuals with ASD, and that these individuals with ASD also show atypical pitch production that is similar to non-tonal language speakers with ASD. Future studies need to systematically examine the effect of the stimuli, the relationship between lexical tone and music, and brain plasticity in response to lexical tone learning in order to fill the vast gap in the literature on the topic of Chinese-speaking individuals with ASD.