Resolving competing predictions in speech: How qualitatively different cues and cue reliability contribute to phoneme identification

Crinnion, Anne Marie; Luthra, Sahil; Gaston, Phoebe; Magnuson, James S.

doi:10.3758/s13414-024-02849-y

Resolving competing predictions in speech: How qualitatively different cues and cue reliability contribute to phoneme identification

Published: 22 February 2024

Volume 86, pages 942–961, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Resolving competing predictions in speech: How qualitatively different cues and cue reliability contribute to phoneme identification

Download PDF

Anne Marie Crinnion¹,
Sahil Luthra²,
Phoebe Gaston¹ &
…
James S. Magnuson^1,3,4

229 Accesses
Explore all metrics

Abstract

Listeners have many sources of information available in interpreting speech. Numerous theoretical frameworks and paradigms have established that various constraints impact the processing of speech sounds, but it remains unclear how listeners might simultaneously consider multiple cues, especially those that differ qualitatively (i.e., with respect to timing and/or modality) or quantitatively (i.e., with respect to cue reliability). Here, we establish that cross-modal identity priming can influence the interpretation of ambiguous phonemes (Exp. 1, N = 40) and show that two qualitatively distinct cues – namely, cross-modal identity priming and auditory co-articulatory context – have additive effects on phoneme identification (Exp. 2, N = 40). However, we find no effect of quantitative variation in a cue – specifically, changes in the reliability of the priming cue did not influence phoneme identification (Exp. 3a, N = 40; Exp. 3b, N = 40). Overall, we find that qualitatively distinct cues can additively influence phoneme identification. While many existing theoretical frameworks address constraint integration to some degree, our results provide a step towards understanding how information that differs in both timing and modality is integrated in online speech perception.

Phonetic convergence across multiple measures and model talkers

Article 08 November 2016

Individual differences in the use of top-down versus bottom-up cues to resolve phonetic ambiguity

Article 29 May 2024

Why are listeners hindered by talker variability?

Article Open access 14 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

When trying to understand speech, listeners can use many different types of information beyond just the acoustic signal itself. In fact, speech frequently occurs in complex auditory environments (e.g., multi-talker environments; Cherry, 1953; Schneider et al., 2007) and often is not carefully articulated (for review, see Smiljanić & Bradlow, 2009), both of which increase ambiguity in the signal. The fact that speech perception is typically robust despite this ambiguity hints at the fact that in a given listening situation, there are myriad constraints that help the listener understand what is being said. Knowledge about who is talking (Kleinschmidt, 2019; Kraljic & Samuel, 2007; Nygaard & Pisoni, 1998), the topic being discussed (Borsky et al., 1998; Broderick et al., 2019; Hutchinson, 1989; Liberman, 1963), and how various sounds tend to pattern in one’s language (Frisch et al., 2000; Leonard et al., 2015; Mersad & Nazzi, 2011) are just a few examples of co-occurring constraints that may be available for processing. Language scientists have spent decades investigating how people make use of different types of information to process language, with a major aim of understanding what constraints are relevant in different contexts and how constraints might be combined (for an overview, see McRae & Matsuki, 2013). While theories of constraint satisfaction in language processing have been developed at various levels (MacDonald et al., 1994; Trueswell & Tanenhaus, 1994) and have been extended to domain-specific computational models (e.g., the TRACE model of speech perception and spoken word recognition; McClelland & Elman, 1986; SOPARSE, a model of sentence processing; Tabor & Hutchins, 2004), exactly how qualitatively distinct constraints are combined remains outside of the scope of these theories.

Bayesian approaches (e.g., Kleinschmidt & Jaeger, 2015) provide a computational framework in which expectations regarding the relative weight and reliability of different possible constraints (priors, such as knowledge of a talker’s typical productions, or of contextual information) can be integrated according to principles following from Bayes’ theorem to integrate them optimally. While this framework establishes a mathematical model of how an ideal observer should combine cues (in terms of Marr’s (1982) levels of information processing theories, it is a theory of the computations required), it does not provide an algorithmic mechanism for this cue integration.

We are concerned with the algorithmic level – that is, when and how various types of constraints are processed during online processing, in the context of real-time pressures on listeners. In the TRACE model of speech perception, for example, different types of constraints can be simulated (i.e., lexical-level constraints on phoneme ambiguities, or acoustic-level constraints, such as co-articulatory effects; Elman & McClelland, 1988). The relative weighting of two constraints, or how they are actually combined, is not inherent in the model architecture. For example, in order to simulate effects akin to preceding (lexical or semantic) context, TRACE has a priming mechanism that allows the experimenter to boost the resting activation of specific words. However, it is not immediately clear how priming in TRACE could be parameterized to appropriately model over-time influences or interactions with other cues. Doing so would require empirical data that clarify how effects of prior context interact with other, qualitatively distinct cues to influence speech processing.

The goal of this work, then, is to understand how both qualitative and quantitative factors influence the use of multiple constraints in speech perception. We ask how constraints that differ qualitatively (in terms of timing and modality) and quantitatively (in terms of reliability) affect processing of ambiguities in speech. In the General discussion, we return to the challenges multiple cue integration poses for models of human spoken word recognition.

Qualitative variation in constraints

Information that can influence the identification of speech sounds can stem from a variety of sources. Some are related to variations in the actual acoustic signal. Others are based on linguistic knowledge (e.g., constraints implied by the words the listener knows, or expectations based on syntax or semantics). Many different kinds of constraints may be available in real-world contexts. Imagine the following scenario: an individual walks into the kitchen, where their roommate is holding a shopping bag and says, “The chiropractor dealt with my ba[?]”, where the final word contains a segment ambiguous between /g/ and /k/ (bag or back). This scenario contains cues in different modalities, occurring at different time points. The visual information (the roommate holding a bag) occurs early on and has the possibility to suggest that the roommate said “bag.” There is also semantic context from the word chiropractor that might bias the listener towards hearing “back.” This cue occurs within the speech modality but still precedes the ambiguous segment by hundreds of milliseconds. There may also be constraining acoustic context adjacent to the point of ambiguity (e.g., the duration of the preceding vowel, where a longer vowel would be more consistent with the voiced alternative, /g/; Denes, 1955). There can be, then, many sources of constraint in different modalities and with different temporal relations to the point of ambiguity, and these can also vary in how informative or reliable they will be. How listeners use qualitatively different cues when processing speech, particularly when these sources of information may conflict, is not fully specified by psycholinguistic theories.

Much of the work on ambiguity resolution in speech perception looks at the effect of just one cue on another (often, how the perception of a phonetic continuum is influenced by the presence of a particular cue). Consider the well-known Ganong effect, which isolates lexical knowledge as a potential influence on ambiguity resolution (Ganong, 1980). For example, if an ambiguous /s/-/ʃ/ token is embedded in a sang-*shang continuum, listeners are more likely to identify ambiguous continuum steps as /s/, since only sang is a word. Semantic information can likewise influence ambiguity resolution. For example, Getz and Toscano (2019) found that semantic context can influence perception of targets from minimal pairs that differ in voicing (e.g., seeing the visual prompt AMUSEMENT before an auditory token ambiguous between bark and park leads to more /p/ responses).

Other studies have introduced multiple constraints on phonetic interpretation (e.g., trading relations; Repp, 1982). Typically, however, these studies look at use of multiple acoustic-phonetic cues. For example, many studies show that listeners seem to integrate across acoustic-phonetic cues (e.g., voice-onset times, vowel durations, preceding rate) and use these cues when they become available, reflecting a process of continuous integration (McMurray et al., 2008; Reinisch & Sjerps, 2013; Toscano & McMurray, 2015). This rapid integration of cues has also been shown at the lexical level, in a Ganong paradigm; listeners use lexical information very early (Kingston et al., 2016).

Recent work from Kaufeld et al. (2019) looked at the influence of cues across levels of language processing (i.e., not just cues stemming from the acoustic signal). They studied both syntactic and acoustic constraints on phoneme identification, and found additive effects of these sources of information in how listeners resolve ambiguities in the speech signal. In their study, listeners were influenced by both speaking rate (acoustic-phonetic) and morphosyntactic gender information (lexical/semantic) in interpreting ambiguous words in sentences. A similar examination of multiple sources of information in the visual word recognition literature comes from studies of how word frequency and masked repetition priming interact (Balota & Spieler, 1999; Becker, 1979; Connine et al., 1990; Forster & Davis, 1984; Holcomb & Grainger, 2006). Kinoshita (2006) showed both additive and interactive effects of these two sources of information, depending on how familiar the presented words were. When low-frequency words were familiar, priming effects were greater for lower frequency words than higher frequency words. However, when low-frequency words included familiar and unfamiliar items, priming effects did not differ based on lower or higher frequency. While familiarity is not of interest in the current work, it is interesting to note that under certain conditions, two sources of information might both appear to have influence, while under other conditions, the presence of one cue might eliminate, diminish, or amplify the influence of another to different degrees, at different levels of the first cue.

Recent work from Lai and colleagues (2022) has directly tested how two qualitatively distinct constraints influence spoken word recognition. Using a lexical decision task, they assessed how listeners used both co-articulatory information (stemming from rounded vowels following a sibilant, biasing listeners towards hearing /s/) and lexical information (assume vs. *ashume). Consistent with findings from Kaufeld and colleagues (2019), Lai and colleagues (2022) found additive effects overall of these sources of information. Additionally, like Kaufeld et al. (2019), they also found differences at the individual level in the types of information that listeners tended to use. Namely, listeners who used lexical information more, for example, tended to use co-articulatory information less.

We note, however, that Lai and colleagues used word-nonword continua (e.g., assume/*ashume) for their stimuli. Given the strength of the Ganong effect, it is possible that their paradigm might underestimate the influence of co-articulatory information. Hence, in situations where listeners only hear words, it is possible that they might rely more on co-articulatory information. Additionally, we note that the lexical information comes after the point of ambiguity, which could influence the time course of lexical influences on phonetic processing and potentially modulate the strength of co-articulatory constraints.

While these studies provide steps towards understanding how different constraints may operate simultaneously, we must also consider the fact that quantitative differences, such as how reliable different constraints are, may influence how much a given constraint is used.

Quantitative variation in constraints

An important factor that might impact how listeners balance constraints from two sources is how reliable each source of information is. Work from Bushong and Jaeger (2019) suggests that in laboratory settings when there are unnatural correspondences between acoustic and contextual cues, listeners tend to discount contextual information. For instance, in naturalistic settings, /d/-like voice onset times (VOTs) occur more often in sentence contexts containing /d/-initial words, but these cue correspondences are often violated in experimental settings, where listeners may be presented with a high proportion of inconsistent cues (e.g., /t/-like VOTs in semantic contexts consistent with /d/).

Bushong and Jaeger presented listeners with sentences that varied in semantic context and VOT of the target word (example sentences from their paradigm include: (A) When the [?]ent in the forest was well camouflaged, we began our hike, and (B) When the [?]ent in the fender was well camouflaged, we sold the car). The ambiguous token, [?]ent, varied across six VOT values (from most /t/-like to most /d/-like). In one condition, which had "high conflict" between lexical expectations and acoustic cues, the six possible VOT values were evenly distributed across semantic contexts. Note that this condition mirrors the setup of most laboratory studies, where the goal is to obtain an equal number of observations per cell. In this high-conflict condition, listeners were equally likely to hear a semantically consistent sentence, containing a phrase such as “dent in the fender”, as they were to hear a semantically inconsistent sentence, containing a phrase such as “tent in the fender.” In their "low-conflict" condition, VOT values occurred in more natural proportions with typical lexical-VOT distributions (with endpoint tokens only presented with their expected lexical context and gradually decreasing proportions of consistency up to the completely ambiguous tokens, which were presented equally often with both lexical contexts). For example, in the low-conflict condition, listeners heard a higher proportion of /t/-like VOTs in the semantically consistent forest sentence context (“tent in the forest”). Listeners used lexical context more in the low-conflict condition, suggesting that listeners may be sensitive to the distributions of (sometimes competing) cues.

Similar sensitivity to the distribution of cues has been shown by Giovannone and Theodore (2021) in a Ganong (1980) paradigm, where listeners heard tokens along a *giss-kiss continuum (where lexical context biases towards /k/) and along a gift-*kift continuum. When the degree of conflict between lexical and phonetic cues was reduced (i.e., in a low-conflict condition, listeners heard a higher proportion of /g/-like VOTs in a gift-*kift continuum), listeners also seemed to rely more on lexical information.

Additionally, at the acoustic level, listeners may down-weight the importance of an acoustic cue depending on whether it agrees with other disambiguating information, whether acoustic or lexical (Idemaru & Holt, 2011; Zhang et al., 2021). For example, Idemaru and Holt (2011) presented listeners with pairs of cues (e.g., VOT and F0, which are both cues to voicing) that were either consistent with typical correlations between the cues, or reversed (e.g., an F0 value that typically occurs for a voiced token paired with a VOT for a voiceless token). In multiple pairings of cues, they found that when listeners were tested on ambiguous tokens where only one cue (F0) could guide their phoneme decision, they relied less on that cue (F0) when it had previously been paired with an atypical VOT value. This suggests that listeners track distributions of cues and adjust their reliance on them in accordance with recent experience.

In sum, listeners can alter their reliance on specific cues in the presence of input that either mirrors or violates naturally occurring cue correspondences. In the context of qualitatively distinct cues, it remains unknown exactly how listeners might use information about the reliability of one cue to guide their relative cue use. In other words, it is unknown if the additive or interactive effects of distinct cues changes when the reliability of one source of information changes. The current study aims to address the question of how listeners use two qualitatively different cues (cross-modal identity priming and co-articulatory context, which differ in content, modality, and temporal proximity to ambiguities in our materials) to interpret ambiguities in the speech signal, and whether listeners are sensitive to the reliability of certain cues (i.e., whether they differentially use certain cues based on their reliability).

An approach to studying constraint integration

Our aim with the current experiments is to examine how listeners use two constraints, co-articulatory context and cross-modal identity priming (henceforth referred to as visual priming), that differ saliently in modality and in temporal proximity to the point of ambiguity. Previous work has shown that co-articulatory context alone can influence the identification of ambiguous phonemes (e.g., Luthra, Peraza‐Santiago, Beeson, et al., 2021), and in Experiment 1, we test whether visual priming alone can also guide the interpretation of ambiguous phonemes. In Experiment 2, we examine how these two cues might be used when both are available. Finally, in Experiment 3, we examine whether manipulating reliability of the visual prime leads to different use of that cue. To set the stage, we conclude this section with a review of the two constraints we will manipulate in the experiments.

The first constraint will be a written word-form with potential to influence processing of a corresponding spoken word through visual priming (Blank & Davis, 2016; Sohoglu et al., 2014). Previous studies suggest that such primes influence how speech is perceived. For instance, for both degraded (vocoded) and relatively clear speech, Blank and Davis (2016) found that participants had greater accuracy reporting what they had heard when auditory stimuli had been preceded by a visual identity prime (e.g., written SHAME before degraded acoustic token shame), as compared to a neutral prime (e.g., written ######## before acoustic token shame). However, to our knowledge, previous work has not examined whether visual priming can shift the identification of ambiguous phonemes.

The second constraint will come from co-articulatory context. Effects of co-articulatory context can be seen in a paradigm known as compensation for co-articulation (CfC; Mann, 1980; Mann & Repp, 1981; Repp & Mann, 1981, 1982; Viswanathan et al., 2010). When speakers produce a sound with a posterior place of articulation (PoA), such as /k/, and then produce a sound with an anterior PoA (e.g., /s/) (or the other way around), speakers may not reach the typical PoA on the second sound and subsequently produce a more ambiguous speech sound. Hence, if listeners hear a token like maniac (with word-final /k/ and therefore posterior PoA) followed by an ambiguous same-shame token, they will be more likely to interpret the ambiguous token as “same” (with anterior PoA), as though they are compensating for acoustic contingencies that follow from co-articulation. Though there are alternative explanations for CfC effects that appeal to acoustic differences rather than articulatory differences (Diehl et al., 2004; Holt & Lotto, 2008; but see Viswanathan et al., 2010), for the purposes of this investigation, it only matters that these effects exist as another instance of context influencing interpretation of speech and that these effects differ from the effects of visual priming with regards to timing and modality.

Thus, in a series of four pre-registered experiments (see preregistrations at https://osf.io/6kmub), we compare the impact of two competing constraints on phoneme identification. These constraints vary qualitatively in modality (visual vs. auditory) and in their temporal relation to the point of ambiguity in the speech signal (visual primes occur more than 1 s before the point of ambiguity, while co-articulatory context immediately precedes the point of ambiguity). In line with prior work looking at integration of various acoustic-phonetic sources of information (e.g., McMurray et al., 2008; Toscano & McMurray, 2015), it is possible that the constraints we consider influence processing from the moment they are available. However, it is also possible that one constraint dominates. Additionally, because of the temporal order inherent in the presentation of these two constraints, it is possible that the co-articulatory information only exerts an influence when it is in conflict with the prime (which may already maximally activate the target lexical or phonetic item).

We will also examine whether more reliable information coming from the visual primes (i.e., including a greater proportion of trials where the prime matches the auditory target) leads to greater use of the prime (as in Bushong & Jaeger, 2019, or Giovannone & Theodore, 2021). How listeners use qualitatively distinct constraints with varying degrees of reliability will inform theories of language processing, and provide a foundation for extending algorithmic accounts of speech processing to account for the potentially simultaneous influence of cues that are qualitatively distinct in modality and timing.

Experiment 1

In Experiment 1, we examine how visual identity priming influences identification of ambiguous word-word minimal pairs. In this study, listeners made an 's'-'sh' judgment for spoken continua created from minimal pairs like same-shame that were preceded by visual primes that were neutral ("########") or matched one endpoint ("SAME" or "SHAME"). Visual identity priming has been shown to influence identification of noise-vocoded speech (Blank & Davis, 2016; Sohoglu et al., 2014). However, it remains unknown whether such priming can influence perception of ambiguous tokens (such as tokens along a same-shame continuum). Because visual semantic priming has been found to influence perception of word-word pairs (Getz & Toscano, 2019), we hypothesize that identity priming should influence phoneme identification. Establishing whether visual priming can influence perception of an acoustic-phonetic continuum is a prerequisite to our goal of pitting qualitatively distinct constraints against one another in Experiment 2.

Methods

Materials

We used materials developed by Luthra, Peraza‐Santiago, Beeson, et al. (2021). Luthra et al. identified context items and target pairs that elicit robust compensation for co-articulation (CfC; necessary for Experiments 2 and 3), and we used the target items (and in Experiments 2 and 3, the context items) that they established can drive CfC. We included five /s/-/ʃ/ minimal pairs that were shown to exhibit CfC effects in the pilot from Luthra et al. (2021a, 2021b, 2021c). These pairs were: same-shame, sell-shell, sign-shine, sip-ship, and sort-short. Each pair consisted of five audio stimuli identified by Luthra et al. (2021a, 2021b, 2021c): the most ambiguous step (proportion of /s/ responses across five pairs = 0.47) and two steps on each side of that maximally ambiguous step (where the most s-like step had a mean s-rate of 0.92 across five pairs and the most ʃ-like step had a mean s-rate of 0.04 across five pairs). For each pair, we used three written primes, with one matching each end of the target continua and one that was neutral (e.g., SIP, SHIP, and ########). In Experiments 2 and 3, we include co-articulatory context items before the /s/-/ʃ/ ambiguity. To keep the timing identical in this experiment, we inserted silent pauses, matched to the durations of the appropriate context items, between the presentation of the prime and the onset of the critical auditory target (Fig. 1). The four context items were isolate (846 ms), maniac (785 ms), pocketful (765 ms), and questionnaire (1046 ms).

Participants

We collected data from 68 participants in order to achieve our pre-registered target sample size of 40 participants (15 female, 24 male, one other/decline to state; age range: 19–33 years; mean age: 27 years) after applying pre-registered exclusionary criteria (described below). To determine the appropriate sample size for our experiments, we considered relevant studies examining identity priming (Blank & Davis, 2016; n = 20 for 90% power) and compensation for co-articulation (Luthra, Peraza‐Santiago, Beeson, et al., 2021; n = 15 to achieve 90% power). However, a sample size sufficient for 90% power in previous studies might not be sufficient when combining constraints and examining interactions (as we do in Experiments 2 and 3). As such, we took a conservative approach and doubled the larger of the sample sizes, leading to a target sample of 40 participants per experiment.

Experimental sessions took approximately 60 min. Participants were paid $12, consistent with Connecticut minimum wage ($12/h at the time of data collection). Only participants who were 18–34 years of age, native speakers of North American English, and who reported normal/corrected-to-normal vision and normal hearing were recruited for this study.

After data collection, we applied our pre-registered exclusionary criteria to exclude participants (a) for not reaching at least 80% accuracy for the clear endpoint stimuli with neutral primes, in line with conventions used in Luthra and colleagues (2021), (b) for failing to respond in more than 10% of trials (with a 6-s trial timeout), (c) for failing our headphone check (described below) more than once, or (d) for not reaching at least 80% accuracy on reporting written primes (see Procedure below).

Procedure

The experiment was implemented in Gorilla (www.gorilla.sc; Anwyl-Irvine et al., 2020) and participants were recruited through Prolific (www.prolific.co). All procedures were approved by the University of Connecticut’s Institutional Review Board (IRB). Participants provided informed consent and filled out demographic information before the main task. Participants then completed a headphone screening that required them to identify the quietest tone among a series of three tones, a task that is designed to be difficult to pass without headphones due to phase cancelation (Woods et al., 2017). If a participant failed the screening twice, we excluded their data (per our pre-registered exclusion criteria), but they still received compensation as described above.

Trials consisted of a printed prime word presented in capital letters (in Open Sans font) for 500 ms, followed by a brief pause (250 ms), a silent gap corresponding to the duration of a context item (see Materials for timing details), and an auditory target (Fig. 1). Participants responded as to whether they thought the target started with an ‘s’ sound or an ‘sh’ sound by pressing the appropriate button (F or J; assignment of ‘s’ and ‘sh’ to F or J keys was counterbalanced across participants). To ensure that participants were paying attention to the written prime, on a subset of trials, participants were only presented with a written prime and asked to type that prime in a response box. Participants completed two blocks of trials, each consisting of 300 experimental trials (including all combinations of five target continuum steps, five target pairs, three written primes, and four gap durations) and 60 prime-only trials. Trial order was completely randomized within each block. Before the main blocks of the experiment, participants completed 12 practice trials with a different target continuum (daze-gaze). The experiment took about 60 min to complete.

Analyses

Pre-registered mixed-effects logistic regression models were run to predict the proportion of front-PoA (i.e., /s/) responses, using the mixed function in the R (R Core Team, 2021) package afex (Singmann et al., 2015), which reports results in ANOVA-like formats and is a wrapper for the glmer function in the lme4 package (Bates et al., 2015). Our model included fixed effects of Prime (front-consistent [e.g., SIP], back-consistent [e.g., SHIP], or neutral [########]; sum-coded) and Step (which ranged from -2 to +2) and their interaction. Our model also included by-subject and by-target-pair random slopes for Prime and Step and their interaction, as well as by-subject and by-target-pair random intercepts, without correlation between random slopes and intercepts. Following best practices outlined by Matuschek et al. (2017), we selected this random-effects structure by starting with the maximal model for our data and then using the anova function to test for differences between models with successively simpler random effects structures (first removing correlations between random slopes and intercepts and then removing by-item random effects) to arrive at the simplest model that does not significantly reduce fit. To investigate pairwise comparisons within the model, in exploratory analyses, we followed up on significant effects in the model using the emmeans package (Lenth, 2022), adjusting for multiple comparisons using the multivariate t-distribution. Details and results from pre-registered analyses of reaction time data can be found in the Online Supplementary Materials (OSM).

Results

Figure 2 shows that participants made more front-PoA responses when they had seen a front-PoA prime (e.g., SAME) as compared to either a neutral prime (e.g., ########) or a back-PoA prime (e.g., SHAME). More specifically, across all steps, participants made a front-PoA response 47% of the time after a front-PoA prime, 42% of the time after a neutral prime, and 40% of the time after a back-PoA prime.

Our mixed-effects logistic regression model revealed a significant effect of Prime (χ² = 16.13, p < .001), indicating that participants’ responses were influenced by the written prime. After participants saw a prime beginning with a front PoA (i.e., SAME), they were more likely to make an /s/ response, which was the expected direction of this effect. The model also revealed a significant effect of Step (χ² = 19.20, p < .001), indicating that participants made more front-PoA responses for more front-PoA steps. The interaction between Prime and Step was not significant (χ² = 5.36, p = .07).

We conducted follow-up tests to analyze pairwise comparisons for Prime, correcting for multiple comparisons as described above. There were more front-PoA responses for a front Prime than for a neutral Prime (contrast estimate: .399, z-ratio = 4.171, p < .001), and more front-PoA responses for a front Prime than for a back Prime (contrast estimate: .772, z-ratio = 6.040, p < .001). Likewise, there were also fewer front-PoA responses for a back Prime than for a neutral Prime (contrast estimate: .373, z-ratio = 4.272, p < .001).

Discussion

Overall, these findings demonstrate that visual priming influences identification of ambiguous phonemes, both influencing trial-level interpretation of the stimulus and promoting faster response times. To our knowledge, this is the first demonstration of the influence of visual priming on such a task, though as we noted above, semantic priming has been shown to influence identification of ambiguous phonemes (Getz & Toscano, 2019). In the General discussion, we consider broader implications and potential extensions of this finding. Most importantly, however, we note that the demonstration that visual identity primes influence identification of ambiguous phonemes will allow us to examine in Experiment 2 how priming does (or does not) influence speech processing when another qualitatively different cue is present: co-articulatory context.

Experiment 2

In Experiment 2, we investigate how listeners reconcile potentially concordant or conflicting cues that differ in both modality and timing: a visual prime and co-articulatory context. We established in Experiment 1 that presenting a visual prime such as SHAME before an ambiguous same-shame token makes listeners more likely to identify ambiguous tokens as “shame.” Prior research shows that presenting an auditory token of isolate (ending with an anterior PoA) before an ambiguous same-shame token will likewise lead listeners to be more likely to report hearing “shame” (starting with a posterior PoA), due to the CfC (Compensation for Coarticulation) effect introduced above (Mann, 1980; Mann & Repp, 1981; Luthra, Peraza‐Santiago, Beeson, et al.,2021; Repp & Mann, 1981, 1982). Of interest in the current work is how listeners make use of two different sources of information, particularly when they are in conflict.

By presenting listeners with both types of information (e.g., visually priming SAME before the audio isolate, before an ambiguous same-shame token – where the two constraints make opposite predictions (since CfC based on the front-PoA at the offset of isolate would be consistent with shame)), we test how both constraints influence speech perception. If listeners are more sensitive to the earliest information available, then we might expect listeners to rely more on the prime. If, however, listeners are more sensitive to within-modality information, we might expect listeners to rely more on the auditory context that immediately precedes the target. We also might expect to observe interactive effects, such that the presence of one source of information changes the effect of the other source of information. Another possibility (consistent with prior research, e.g., Kaufeld et al., 2019; Lai et al., 2022) is that we might observe additive effects of the two constraints. Note that here, we refer to interaction in the statistical sense, with relevant implications for underlying cognitive mechanisms.

We note that while Lai and colleagues found additive effects of co-articulatory information and lexical information, there were key differences with our work. First, their constraints (co-articulatory information and lexical information) occurred within the same modality, which might better facilitate additive processing. Our work asks how constraints that differ in modality and timing constrain speech processing. Second, they used word-nonword continua, which may have magnified lexical effects, since typical processing does not involve nonwords. Third, lexical information occurred only after the point of ambiguity. In contrast, we present both lexical (visual prime) and co-articulatory information before the auditory target (which comes only from word-word continua).