Natural fast speech is perceived as faster than linearly time-compressed speech

Reinisch, Eva

doi:10.3758/s13414-016-1067-x

Natural fast speech is perceived as faster than linearly time-compressed speech

Published: 09 February 2016

Volume 78, pages 1203–1217, (2016)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Natural fast speech is perceived as faster than linearly time-compressed speech

Download PDF

Eva Reinisch¹

2757 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

Listeners compensate for variation in speaking rate: In a fast context, a given sound is interpreted as longer than in a slow context. Experimental rate manipulations have been achieved either through linear compression or by using natural fast speech. However, in natural fast speech, segments are subject to processes such as reduction or deletion. If speaking rate is then defined as the number of segments per unit time, the question arises as to what impact such processes have on listeners’ normalization for speaking rate. The present study tested the effect of sentence duration and fast-speech processes on rate normalization for a German vowel duration contrast. Results showed that a naturally produced short sentence containing segmental reductions and deletions led to the most “long“ vowel responses whereas the long sentence with clearly articulated segments led to the fewest. This suggests that speaking rate is not merely calculated as the number of segments realized per unit time. Rather, listeners associate properties of natural fast speech with a higher speaking rate. This contrasts with earlier results and a second experiment in which perceived speaking rate was measured in an explicit task. Models of speech comprehension are evaluated with regard to the present findings.

Rate dependent speech processing can be speech specific: Evidence from the perceptual disappearance of words under changes in context speech rate

Article 22 September 2015

Encoding speech rate in challenging listening conditions: White noise and reverberation

Article Open access 22 August 2022

Effects of stimulus repetition and training schedule on the perceptual learning of time-compressed speech and its transfer

Article 03 June 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

In order to understand spoken language listeners have to overcome large amounts of variability in the speech signal. One of the most prominent sources of variation is speaking rate, because it may vary considerably even within the same speaker (Miller, Grosjean, & Lomanto, 1984; Quené, 2008). Speaking rate is often defined as articulation rate (i.e., excluding pauses; Crystal & House, 1990; throughout this paper, the term speaking rate will be used as a synonym to articulation rate) and measured in realized segments or syllables per unit time (e.g., per second). Taking this definition, a larger number of realized segments per unit time indicates a faster rate (Koreman, 2006). However, natural variation in speaking rate may be confounded with variation in clarity of articulation. Formal speech tends to be slow, with most or all intended segments clearly realized. Natural fast speech, however, often leads to articulatory undershoot (Lindblom, 1963, 1990; see below) and may include segmental reductions and deletions (see, e.g., Ernestus, 2014, for a recent overview). If, then, speaking rate is defined as the number of segments and syllables per unit time, the question arises as to what impact such fast-speech processes have on listeners’ perception of speaking rate. How do segments that are produced noncanonically or even deleted contribute to such a segment count? Addressing this question is important, as experimental manipulations demonstrating how listeners cope with variability in speaking rate typically use one of two types of rate manipulation: (1) linear compression of normal-rate speech that keeps all segmental properties in place and shortens every segment to the same extent, and (2) natural fast speech that may be subject to fast-speech processes such as reductions and deletions. The present study addresses the possible impact of such fast-speech processes on rate normalization and compares their effect to normalization for linearly compressed fast speech.

Normalization for speaking rate means that listeners take into account that at a fast rate all segments shorten to some extent (Crystal & House, 1982, 1988) and compensate for this. Through this process, they deal with the large variability in speaking rate during speech perception. That is, following a fast context sentence, a given sound is interpreted as longer than when it follows a slow context sentence (e.g., Ainsworth, 1972, 1974; Allen & Miller, 2001; Dilley & Pitt, 2010; Kidd, 1989; Miller, 1981, 1987; Miller & Dexter, 1988; Newman & Sawusch, 2009; Reinisch, Jesse, & McQueen, 2011; Reinisch & Sjerps, 2013; Summerfield, 1981). For example, the word-initial stop voicing contrast in English (e.g., /g/ vs. /k/) is mainly cued by temporal properties, namely duration of voice onset time (VOT). When a stop such as /g/ or /k/ is preceded by a fast carrier sentence, listeners report hearing /k/ (long VOT) more often than /g/ (short VOT). Following a slow sentence, more /g/ responses are reported (e.g., Newman & Sawusch, 2009). As a result, context information influences whether listeners hear words, for instance, as goat or coat. This information can aid speech perception, especially in the case of ambiguous phonemes (e.g., Newman & Sawusch, 2009; Reinisch et al. 2011; Sawusch & Newman, 2000) and even when other potential cues are available to the listener (Reinisch & Sjerps, 2013).

Studies of rate normalization in phonetic categorization have largely used one of two different methods to implement the rate manipulation: either a speaker is recorded at his or her natural rate and the sentence is manipulated by linear compression or expansion such as through PSOLA (e.g., Dilley & Pitt, 2010; Reinisch et al. 2011; Reinisch & Sjerps, 2013), or the speaker is asked to produce a given sentence at normal, fast, and slow rates (e.g., Kidd, 1989; Newman & Sawusch, 2009). Although both methods of obtaining stimuli at different overall durations have produced reliable effects of speaking rate context on phonetic categorization, little is known about possible differences in the magnitude of these effects. Differences could be expected if the sentences that were naturally spoken at fast versus slow rates differed in the presence of natural fast-speech processes; that is, if the sentence that has been spoken fast contained segmental reductions and deletions but the sentence spoken at a normal rate (that then would be linearly compressed) did not contain these processes (usually little information is given on the segmental properties of these fast vs. slow sentences).

Differences in the perception between natural fast speech and artificially compressed fast speech have been reported with regard to intelligibility (Janse, Nooteboom, & Quené, 2003). In natural fast speech, not all segments are compressed equally (Gay, 1978; Janse et al., 2003). Vowels tend to shorten relatively more than consonants, and unstressed syllables get shortened to a greater extent or are more likely to be deleted than stressed syllables. Janse et al. (2003) tested whether the perceptual system would be specifically tuned to this natural variation such that natural fast speech or speech that mimics the temporal relations of natural fast speech would be more intelligible than linearly compressed fast speech that had been spoken at a normal rate. However, this is not what they found. Rather, linearly compressed fast speech appeared to be most intelligible. Moreover, Adank and Janse (2009) showed that although listeners were able to adapt to and improve perception of linearly compressed speech as well as natural fast speech, transfer of improved understanding occurred only from linearly compressed speech to natural fast speech but not the other way around. Both studies hence suggest that listening to linearly compressed speech where all segments are realized as in normal-rate speech is “easier” than listening to natural fast speech. While these previous studies were mostly concerned with the timing of segments, the present study is additionally concerned with the number of realized segments per unit time.

Segmental reductions and deletions tend to occur even in moderately fast speech (Ernestus, 2014; Pluymaekers, Ernestus, & Baayen, 2005). While segmental deletion means that a segment is completely absent, segmental reduction means that a segment is realized, but not fully. An example here would be that a vowel is produced slightly centralized (i.e., somewhat more schwa-like). The occurrence of these two types of processes is correlated so that speech with more reductions also contains more deletions. Sometimes it is even difficult to distinguish the two: Browman and Goldstein (1990) provided an example of the phrase “perfect memory,” in which there is no audible trace of the /t/, but articulatory measures show a brief alveolar contact (which is, however, only released after the labial closure). From the point of view of the listener, this would be a deletion of the /t/, even though from an articulatory points of view it was “only” a reduction. That is, deletions are typically accompanied by segmental reductions, and sentences spoken at a very fast rate tend to also contain deletions.^{Footnote 1} Because, therefore, their exact roles are hard to disentangle, the present study will consider them as a combination of common speech processes in natural fast speech (see Vitela, Warner, & Lotto, 2013, for an attempt of assessing the role of segmental reductions alone). Importantly, the number of realized segments per unit time has been directly linked to the calculation of speaking rate, hence any differences between the normalization of clearly articulated, linearly time-compressed fast speech and natural fast speech would force us to reconsider how speaking rate is being calculated. Consider the phrase He probably said . . ., in which the word probably can be produced as the canonical three-syllable word or as the two-syllable form prob’ly. Assuming an overall duration of about 1,200 ms for this phrase (which constitutes a typical carrier sentence), the two-syllable form would result in a speaking rate decrease of roughly 13 versus 11 segments, or four versus three syllables per second. That is, articulatory processes in natural fast speech may affect the magnitude of normalization for speaking rate relative to fast speech that has been created by linear compression, where all segments are realized as they would be at a normal or slow speaking rate.

The present study follows Koreman (2006), who first investigated the impact of casual speech and the number of realized segments per unit time with regard to the explicit perception of speech tempo in two rate judgment tasks. Koreman selected a variety of carefully versus sloppily pronounced sentences from the Kiel Corpus of German spontaneous speech and paired them according to their intended speaking rate (i.e., as if all segments were fully realized) or their realized speaking rate (i.e., number of realized segments per second, where deletions equal fewer realized segments; note that spectral and temporal reductions in segments were not topic of the specific comparisons). Participants were asked to judge which of two sentences sounded faster and to rate the speaking rate in both sentences on a sliding scale, from too fast to too slow. Results showed that both intended and realized articulation rate correlated with listeners’ explicit rate comparison judgments. In keeping with the example given above, the phrase He probably said . . . sounded faster when all five syllables and thirteen segments were realized than when it had the same overall duration but fewer syllables and segments realized (e.g., He pro’bly said . . only four syllables and eleven segments realized; effect of realized rate). However, the version with deletions still sounded faster than other utterances of the same duration involving the same number (11) of segments/syllables that were instead all realized (i.e., He always says . . .; effect of intended rate). In summary, segmental deletions influence the perception of speaking rate in an explicit task such that fewer realized segments as well as fewer intended segments are taken as a sign of a slower rate.

The present study focuses on the influence of fast-speech processes in natural fast speech on the implicit perception of speaking rate. To test whether and how natural fast speech versus linearly time-compressed speech affect the speech perception process, the present study uses the well-established effect of normalization for speaking rate in phonetic categorization. This task may be called “implicit,” as participants will not be asked explicitly how fast they perceive an utterance to be. Rather, it will be tested whether the speaking rate of an utterance influences the perception of the following stimulus. The goal is to further inform psycholinguistic models of speech perception and to test whether perceived speaking rate depends on fast-speech processes, as this may impact the widely held assumption that speaking rate is calculated as the number of segments per unit time in implicit perception.

Two fundamentally different types of psycholinguistic models have been proposed to account for how listeners access mental representations of sounds and words. Simplifying for the sake of the argument, abstractionist models (e.g., Norris, 1994; Norris & McQueen, 2008) assume that listeners store one representation – typically, a canonical form of each word. Any variation due to speaker or speaking rate has to be “abstracted“ prelexically to access word meaning. Note that this is the common way to describe listeners’ reactions to variation in speaking rate. Notions such as "rate normalization," "compensation for speaking rate," and "coping with variability" suggest that speaking rate changes are a problem for the listener that have to be resolved before accessing the lexical representations. However, although “normalization” is the classical way of thinking of how to link the variable speech input to lexical representations, rate normalization can also be explained in terms of models with multiple representations for a given word.

Exemplar models (e.g., Goldinger, 1998) assume storage of each variant of a word, including, for example forms with deletions, items produced by different speakers, and at different speaking rates. Word forms are accessed by directly comparing the acoustic input to stored representations. That is, variability due to speaking rate does not have to be abstracted, but words spoken at a fast rate are mapped onto fast exemplars and tokens spoken at a slow rate are mapped onto slow exemplars (exemplars are “labeled” to have occurred in fast vs. slow speech; Pierrehumbert, 2001). Speaking rate is then assessed via the labels for rate of the best matching exemplars. Although it is commonly accepted that in their extreme assumptions neither fully abstractionist nor exemplar models can account for the majority of findings in the psycholinguistic literature, the nature of the best hybrid model has yet to be established (see, e.g., Ernestus, 2014, for a discussion with regard to processing reductions in casual speech).

Notably, a third class of models has recently been suggested to describe how listeners flexibly adapt to certain properties of the speech signal. These probabilistic models of speech perception, such as the belief-updating model of perceptual adaptation (Kleinschmidt & Jaeger, 2015) state that listeners track consistencies in the speech signal for a given situation, for instance, a specific speaker's idiosyncratic pronunciation of certain sounds, and create models of cue distributions for this given situation. These specifically adapted models of cue distributions will be reapplied in perception when the situation or the speaker is recognized again, hence facilitating perception and providing new starting points for further adaptation. Although the time scale of this adaptation remains widely unspecified, contextual speaking rate is likely a signal property for which cue distributions have been established (see, e.g., Reinisch, 2015; Sjerps & Reinisch, 2015).

Whether or not speech processes that are common in natural fast speech affect the implicit processing of speaking rate will help to further evaluate the different properties of the different speech perception models. Importantly the present study will contribute toward resolving the question whether speaking rate is indeed sufficiently explained by calculating the number of segments per unit time. A priori, two scenarios seem likely for how the processing of natural fast speech that typically includes segmental reductions and deletions may differ from the processing of linearly compressed normal-rate speech.

Under a first scenario, given the same sentence (i.e., the same intended words) and an identical overall duration, natural fast speech is perceived as slower than linearly compressed fast speech. This is because natural fast speech tends to contain segmental reductions and deletions. In contrast, linearly compressed speech tends to have all segments realized. If speaking rate were calculated from the number of realized segments per unit time or the relative speed of the articulators, we would expect the perception of slower speech tempo for a naturally produced fast than a linearly compressed normal-rate sentence. This scenario would be in line with the results for explicit speech perception (Koreman, 2006), which show that a higher number of realized segments leads to a higher perceived speaking rate. In terms of abstractionist speech perception models, it would suggest that speaking rate is calculated before abstraction during processing occurs. If rate were calculated only after abstraction, no difference between the natural fast and linearly compressed sentence would be expected. In terms of exemplar models, listeners would calculate rate by matching the signal onto stored exemplars, either full forms or forms that include reductions and deletions. Short forms and forms with a higher number of segments would be “labeled” as fast. Long forms and forms with fewer or less clearly articulated segments would be “labeled” slow. Probabilistic models of speech perception could also account for such an outcome. The belief updating model would state that a higher number of segments per unit time or faster perceived articulator speed would be directly associated with higher speaking rate and hence support the use of cue distributions for fast speech. These types of association would also be in line with accounts arguing for top-down prediction as a driving factor in perception (Clark, 2013; Farmer, Brown, & Tanenhaus, 2013, commentary).

Under a second scenario, given the same sentence (i.e., the same intended words) and an identical overall duration, natural fast speech that includes reductions and deletions is perceived as faster than linearly time-compressed fast speech with all segments present. This scenario would go against the traditional view of speaking rate being calculated by the number of realized segments/syllables per unit time. Abstractionist models would have a hard time explaining this outcome. However, it would be in line with exemplar models of speech perception in which stored fast sentences tend to include reductions and deletions. That is, not only shorter forms would be labeled fast but also forms that include properties that are typically found in fast speech, such as segmental reductions and deletions. Specifically, the information “short” and “includes fast-speech processes” would both contribute to the perceived speech tempo. In other words, Scenario 2 could be explained through listeners’ “knowledge” or association that segmental deletions tend to occur in fast speech. Probabilistic models of speech perception or accounts arguing for top-down prediction could also account for such an outcome. In fact, such an outcome would strongly favor the involvement of top-down knowledge. This point will be taken up in the General Discussion.

The present study set out to test whether and how processes in natural fast speech contribute to the perception of speaking rate. As laid out above, participants were not asked explicitly how fast they thought a sentence was. Instead, the perceived speaking rate was measured indirectly by asking participants to categorize a duration-based phonological contrast. Specifically, listeners categorized a German /a/–/a:/ duration continuum. In German, the /a/–/a:/ vowel contrast is described as a real duration contrast without consistent co-variation of spectral properties (Jessen, 1993; Pätzold & Simpson, 1997).^{Footnote 2} The /a/–/a:/ continuum appeared in a German minimal word pair at the end of a rate-manipulated carrier sentence. The faster this carrier sentence is perceived, the more “long” (/a:/) answers should be given. To be better able to judge the effect of speech processes typical to natural fast speech against an expected effect of rate due to overall context duration, the concept of speaking rate was split into two subcomponents: sentence duration (long/short) and fast-speech processes (present/absent). Sentence duration was defined such that “long” matched the duration of a naturally spoken normal-rate sentence and “short” matched a naturally spoken fast sentence including typical fast-speech processes such as reductions and deletions. That is, the first version of the sentence had a long overall duration with all segments articulated as is typical for natural normal-rate speech (“long/absent“ condition). In the second version, the same sentence was at a short overall duration including reductions and deletions as produced in natural fast speech (“short/present“ condition). That is, it was shorter, and contained fast-speech processes leading to a lower number of segments than the first version. For each of these sentences an additional version was created by artificial rate manipulation (i.e., linear compression and expansion). As a third version, the normal-rate sentence was linearly compressed to the short overall duration of the sentence that had been spoken fast (“short/absent” condition). This version was fast and had a higher articulation rate than the naturally spoken fast sentence, since more segments were realized in the same amount of time (and spoken more clearly). Finally, as a fourth version, the sentence that was spoken fast was linearly expanded to the overall duration of the long sentence (i.e., the normal-rate sentence; “long/present” condition). This sentence had the lowest articulation rate of all four conditions. All of these four conditions, except for the long one that was an expanded version of the naturally spoken fast sentence, can be found in studies on normalization for speaking rate in phonetic categorization. The fourth condition was included to complete the factorial design, as it is conceivable that speech processes found in natural fast speech may be perceived differently in fast speech (where they can be expected) than in slow speech, where they are unusual.

More specifically, the question was whether the different combinations of a long and short duration and the presence versus absence of fast-speech processes would result in rate effects of different magnitude as instantiated in shifts of the /a/–/a:/ category boundary. The faster a condition is perceived to be, the more /a:/ responses are expected. In a second experiment, the same sentences were subjected to an explicit rate comparison task similar to Koreman (2006) to allow for a better comparison of “implicit” effects of speaking rate on phonetic categorization and explicit judgments of perceived speaking rate.

Experiment 1

Method

Participants

Twelve native speakers of German from the student population of the University of Munich took part for a small monetary compensation. They reported no language, speech, or hearing impairment.

Materials

The German words bannen and bahnen (“to banish”–”to channel”) that differ minimally in the /a/–/a:/ vowel duration contrast served as targets. Both words were recorded by a female native speaker of German at the end of the carrier sentence Sie vermied in ihrem Text den Begriff {TARGET} (literal translation: “She avoided in her text the term {TARGET}”). The sentence was recorded multiple times at a “normal” speaking rate in a clear speech style, as well as fast with the possibility for segmental deletions (for details about the carrier, see below). One token of the target word bahnen (i.e., containing the long vowel; 161 ms) was selected from the normal-rate items and excised from the sentence for further manipulation.

Using this selected token of bahnen, an /a/–/a:/ vowel continuum was created by manipulating the duration tier in PRAAT (Boersma & Weenink, 2009) and subsequent PSOLA resynthesis. Pretests on similar minimal word pair continua in a different study showed that vowel durations of 51 and 146 ms suffice for clear /a/ and /a:/ continuum endpoints, respectively. This duration range was split into 13 steps with a step size of 7.3 ms. Nine of these were chosen for the present experiment (i.e., steps 1, 3, 5, 6, 7, 8, 9, 11, and 13). The other four were dropped to reduce the overall number of steps and to allow for a larger number of repetitions per point. To avoid any influence of segment durations other than the vowel biasing listeners towards the base word bahnen, all segment durations were set to an average duration between the manipulated word’s segments and a reference token of bannen (also selected from the normal-rate productions).

For the context manipulation, two tokens of the carrier sentence (Sie vermied in ihrem Text den Begriff {TARGET}) were selected: one spoken at a “normal” rate (long sentence, fast-speech processes absent), and one spoken fast with several segments reduced and deleted (short sentence, fast-speech processes present). Two trained phoneticians used broad IPA transcription to assess the number of segments produced in the chosen carrier sentences. Transcriptions were based on listening as well as visual inspection of the signals (spectrogram, oscillogram). The two transcribers agreed that in the clear version, 26 segments were realized (i.e., [zi: fɐmi:t ɪn irəm tɛks dəm begrɪf]), while in the version including deletions, only 20 segments were realized (i.e., [sɪ fəmit n iəm tɛks tm gɪf]). Note that in sequences like Text dem the final /t/ in Text did not show a separate release in either version. Given the same closure duration in the natural fast and linearly time-compressed fast version (approximately 30 ms) the same number of segments was counted in both versions (here: one). Appendix A lists all segments of the two carrier sentences, including segment durations, and values of the first and second formants of all vowels in Hertz and Bark to assess spectral reductions in addition to the difference in number of realized segments. Because the spectral values showed a slight tendency toward centralization of the vowels in the naturally spoken fast sentence, it was further established that the long-term average spectra of the two versions of the sentence did not show any consistent, long-lasting differences above and below the first two formants. This precluded the impact of spectral contrast effects as discussed in Vitela et al. (2013). Most important for the present study was that duration is the main cue to the German /a/–/a:/ contrast and that the naturally produced fast sentence had undergone processes typical of naturally produced fast speech such that it contained fewer segments than the naturally produced normal-rate sentence.

Two additional tokens of the carrier sentence were created to construct a fully crossed design. To disentangle the presence or absence of natural fast-speech processes from speaking rate as instantiated by overall sentence duration (or the time within which segments could be counted) the concept of speaking rate will henceforth be referred to as two components: sentence duration and presence versus absence of fast-speech processes. That is, the factors Duration (long/short) and fast-speech Processes (present/absent) were fully crossed. Using PSOLA resynthesis, the original short/present sentence was expanded to the overall duration of the original normal-rate sentence, and the normal-rate sentence was (linearly) compressed to the overall duration of the naturally spoken fast sentence. Figure 1 illustrates this manipulation. In order to account for differences in artifacts due to the expansion and compression, the original sentences were also resynthesized (speeded up and slowed down again) such that all four versions of the sentence had undergone manipulation.

Design and procedure

The four versions of the carrier sentence were combined with all nine selected steps of the bannen–bahnen continuum, resulting in a total of 36 sentences. Participants were seated in a sound-attenuated room and performed a phonetic categorization task. They listened to the sentences via headphones and indicated whether the last word in the sentence was bannen or bahnen by pressing the number keys 1 and 0 on a computer keyboard. Response options were displayed on the computer screen throughout the experiment with the layout of the words on the screen matching the sides of the response keys. Word-key assignments were counterbalanced across participants. The next trial started 700 ms after the participant’s response. Each participant received a total of 252 trials, that is, 7 repetitions of each carrier-target combination. Every 63 trials, participants were allowed to take a break. The experiment was controlled by ePrime software (Psychology Software Tools, Inc.) and took approximately 25 minutes to complete.

Results

Results were analyzed using a linear mixed-effects model with the dichotomous dependent variable /a:/ responses (i.e., response /a:/ coded as 1, response /a/ coded as 0) and participant as a random factor, including random slopes for all within-participant fixed factors (see Barr, Levy, Scheepers & Tily, 2013). A logistic linking function was used. The model included three fixed factors and their interactions: Sentence Duration (short -> coded as −0.5, long -> coded as 0.5), fast-speech Processes (present -> −0.5, absent -> 0.5), and Continuum Step (centered on 0). All factors were contrast coded such that effects could be interpreted as main effects. The two outmost steps on both sides of the continuum were classified correctly with close to ceiling performance (>98 % and <2 % /a:/ responses), which shows that acoustically unambiguous steps are unlikely to be affected by acoustic context. Only responses to the middle five steps of the continuum were therefore analyzed. Figure 2 shows the categorization responses along the /a/–/a:/ continuum in the four conditions. Figure 3 aggregates the effects over the five middle continuum steps.

Results showed main effects of Continuum Step, (b _(Intercept) = −0.61, SE = 0.39, z = −1.57, p = .12; b _(step) = 2.4, SE = 0.22, z = 11.19, p < .001); Sentence Duration, (b _(duration) = −1.18, SE = 0.25, z = −4.66 p < .001); and fast-speech Processes, (b _(processes) = −0.57, SE = 0.26, z = −2.20, p < .05), without any of the interactions reaching significance, (b _{(step*duration)} = −0.05, SE = 0.32, z = −0.16, p = .87; b _{(step*processes)} = −0.08, SE = 0.32, z = −0.25, p = .81; b _{(duration*processes)} = −0.43, SE = 0.4, z = −1.09, p = .28; b _{(step*duration*processes)} = 0.25, SE = 0.6, z =0.42, p = .67). The effect of Continuum Step indicates that the vowel duration manipulation resulted in the expected effect; that is, the longer the vowel, the more /a:/ responses were given. Critically, both Sentence Duration and fast-speech Processes influenced vowel perception: both short sentence duration and the presence of fast-speech processes lead to more /a:/-responses.

Discussion

Experiment 1 used the well-established speaking rate effect on phonetic categorization to test how duration and natural fast-speech processes contribute to perceived speaking rate. Based on previous literature, it was expected that the faster listeners perceived the preceding carrier sentence to be, the more “long” responses should be given, here with regard to the perception of the German /a/–/a:/ contrast. Results showed that both manipulated subcomponents, Duration and presence versus absence of fast-speech Processes, exerted an effect. The short sentence that contained natural fast-speech processes led to the most /a:/ responses, whereas the long sentence in normal-rate speech without such processes led to the fewest /a:/ responses. The remaining two conditions (short/absent and long/present) showed intermediate effects (see Fig. 3). These results speak to two issues.

First, these findings inform the literature on normalization for speaking rate in phonetic categorization. There are currently two main approaches to manipulating speaking rate. Either a speaker is asked to produce carrier sentences fast versus slowly (e.g., Kidd, 1989; Newman & Sawusch, 2009) or utterances spoken at a “neutral“ speaking rate are manipulated with linear compression and expansion (e.g., Dilley & Pitt, 2010; Reinisch et al. 2011; Reinisch & Sjerps, 2013). Although both types of rate manipulation have been shown to trigger the expected contrastive effects in the perception of duration contrasts, the present study was the first to directly compare the magnitude of the effects using the same sentence and target words in different conditions. Results suggest that naturally produced fast speech is perceived as faster than linearly compressed “normal-rate” speech. However, there was no interaction between Duration and fast-speech Processes. This suggests that although speech that contains fast-speech processes is perceived as faster than compressed or uncompressed normal-rate speech, the magnitude of the rate normalization effect (i.e., the comparison between the long vs. short sentence within each condition of the factor Processes) did not differ.

The second, more important finding of the present experiment is the effect of natural fast-speech processes itself. This effect showed the opposite pattern to the explicit rate judgments reported in Koreman (2006). In the present experiment, the sentence that had been spoken fast and included segmental reductions and deletions was perceived as faster than the same sentence of the same duration with all segments realized (as in normal-rate speech). That is, the second scenario as described in the introduction appears to be confirmed. The implications of this result for speech processing will be taken up in the General Discussion. However, before jumping to conclusions, the four sentence conditions of the present experiment were subjected to an additional rate comparison task, similar to that conducted by Koreman (2006). This additional task will allow the results of Experiment 1 to be set in relation to explicit rate comparison judgments.