Introduction

Noise is unavoidable during our daily speech comprehension, such as another speaker at a cocktail party, the sound of traffic horns on the street, and the echoes in an empty hall. Maintaining a relatively robust speech comprehension in a noisy environment is of great importance to human life. Large quantities of psychological, cognitive and neuroscientific research have investigated how people comprehend noisy speech and achieved a wealth of discoveries (Coffey et al. 2017; Dryden et al. 2017; Alain et al. 2018). For example, multiple mechanisms, e.g., auditory mechanism and sensorimotor mechanism, etc., have been found to support speech-in-noise comprehension in distinct ways and correspond to different neural basis (Du et al. 2014; Guediche et al. 2014; Etard and Reichenbach 2019). However, most previous studies have long investigated this issue by using highly-controlled and short-duration artificial stimuli, such as phonemes and words (Hamilton and Huth 2020), which failed to resemble the naturalistic speech used in real-life scenarios. Moreover, traditional neuroscience routinely adopted a single-brain or third-person neuroscience approach. Participants were often isolated from the natural environment of interpersonal communication and required to accomplish a series of simple tasks with the only instruction of a computerized program (Hasson et al. 2012). Such a single-brain approach instructed participants to solely and passively perceive a non-interactive stimulus (Redcay and Schilbach 2019), neglecting the nature of interpersonal communication through language (Jiang et al. 2021). These experimental settings are quite different from naturalistic speech situations. Consequently, the neurocognitive mechanisms for speech-in-noise comprehension remain much unclear.

In recent years, modern advances in the simultaneous dual- or multiple-brain measurement technique (also known as ‘hyperscanning’, Montague et al. 2002) have given rise to a new approach to neuroscience: the inter-brain or second-person neuroscience approach (Schilbach et al. 2013; Hasson and Frith 2016; Redcay and Schilbach 2019). In contrast to the traditional single-brain or third-person approach that focuses on estimating each individual’s neural responses to the highly-controlled and simplified stimuli, e.g., phonemes and words, the second-person neuroscience approach measures the neural activities of the socially interactive agents (i.e., speaker and listener) during interaction and analyzes how the coherence or coupling of their neural activities varies among different conditions or correlates to the interactive behavior (Czeszumski et al. 2020; Kingsbury and Hong 2020; Holroyd 2022). It gives a novel perspective for investigating the neural basis of speech-in-noise comprehension from an integrative view. In this review, we first briefly review the previous findings about how the brain processed speech in noise and discuss their limitations; next, the second-person neuroscience approach and its advantages over the classical third-person approach are introduced; then, how the second-person neuroscience approach could help to reveal the linguistic and extralinguistic processes during speech-in-noise comprehension are discussed respectively; finally, we conclude by proposing some critical issues and calls for more research interests on the second-person approach for studying the neural mechanisms of speech-in-noise comprehension.

How does the brain process speech in noise?

Dual mechanisms, i.e., auditory mechanism and sensorimotor mechanism, have long been reported to support speech-in-noise comprehension (Du et al. 2014; Alain et al. 2018; Etard and Reichenbach 2019). The auditory mechanism refers to the faithful processing of multi-level linguistic information in a bottom-up way. It is associated with the brain regions responsible for acoustic, phonological, syntactic and lexical-semantic processing, which is mainly located in the temporal lobe (Hickok and Poeppel 2007; Price 2012) and with an extension to frontal regions for complex linguistic computation (Friederici 2012; Fedorenko and Blank 2020). The auditory mechanism could filter out the noise by selectively processing the target speech while suppressing the encoding of noise based on their various acoustic statistics (Guediche et al. 2014; Herrmann et al. 2014; Etard and Reichenbach 2019; Vander Ghinst et al. 2019; Marrufo-Perez et al. 2020), or resolve the noise-induced ambiguity of speech information by integrating it to the linguistic context (Zekveld et al. 2011; Golestani et al. 2013; Shi and Koenig 2016; Rysop et al. 2021).

In contrast to the auditory mechanism, the sensorimotor mechanism refers to the generation of linguistic information and the subsequent integration to the actual sensory input. It was associated with production-related regions covering the left posterior frontal lobe and the sensorimotor interface located at the posterior dorsal-most aspect of the left temporal lobe, etc. (Hickok and Poeppel 2007; Pulvermuller and Fadiga 2010; Sehm et al. 2013; Du et al. 2014; Alain et al. 2018). It supports speech-in-noise comprehension by compensating for the noise-masked linguistic information through motor simulation (Liberman et al. 1967; Hickok and Poeppel 2007; Pulvermuller and Fadiga 2010) or the content-based prediction (Hickok et al. 2011; Pickering and Garrod 2013; Schomers and Pulvermuller 2016). The sensorimotor-related regions in the frontal and parietal regions were broadly reported to be activated in noisy conditions (Du et al. 2014; Alain et al. 2018).

Whereas both of the dual mechanisms play supportive roles during speech-in-noise comprehension, the sensorimotor mechanism seems to be more robust against the noise than the auditory mechanism. It is because the sensorimotor mechanism could benefit from the linguistic information generated from the internal model, while the auditory mechanism relies on the relative completeness of the external auditory input. In this way, when an increasing intensity of background noise has interrupted many acoustic and linguistic details of the speech, the sensorimotor mechanism might be more adaptive and supportive for speech-in-noise processing as the auditory mechanism might have failed to function to support speech comprehension. Du et al. 2014 adopted fMRI to measure the neural activities from auditory-related and sensorimotor-related regions when people listened to phoneme tokens under various signal-to-noise ratio (SNR) levels, i.e., no noise, 8, -2, -6, -9 and − 12 dB. While the activation of anterior regions of superior temporal gyrus (STG), the anterior and posterior regions of middle temporal gyrus (MTG), etc., were decreased by increasing background noise, the neural activities of the sensorimotor regions, e.g., anterior insular and adjacent Broca’s area, the ventral premotor cortex, etc., were enhanced. Furthermore, the multivoxel patterns of neural activities in the sensorimotor regions exhibited effective phoneme representation even when the intensity of noise became stronger than the original speech. Meanwhile, the phoneme representation of the neural activities in the auditory regions was disrupted by even very mild background noise (Du et al. 2014, 2016; Du and Zatorre 2017). These findings suggested that the sensorimotor mechanism was more adaptive and might play a more fundamental role when the environmental noise became strong.

In addition to the above dual mechanisms, several studies have also highlighted the importance of extralinguistic processing for speech-in-noise comprehension (McGowan 2015; Hernandez et al. 2020). The extralinguistic processing is an indispensable part of speech comprehension (Hasson et al. 2018; Hagoort 2019). On the one hand, as speech is intended for interpersonal communication, speech comprehension is not limited to the linguistic processes but related to a broad range of extralinguistic processes to cope with the situational and (both individually and socially) personal content contained in the speech (Hasson et al. 2018; Redcay and Moraczewski 2020; Yuan 2020). They refer to a series of non-linguistic-specific but domain-general cognitive processes, including mentalizing, perspective taking, personal memory and knowledge, self- and social-cognition, social emotion, and etc. (Redcay and Moraczewski 2020; Yeshurun et al. 2021). On the other hand, the extralinguistic processes could reversely modulate the hierarchical encoding and the content-driven prediction of the linguistic information in a top-down fashion. For example, manipulating the listeners’ beliefs about the age, gender, race, etc., of a speaker could influence the processing of speech signals (Hanulikova 2021; Kutlu et al. 2022; Yu 2022). Therefore, the extralinguistic processing might also help resolve the interference of the noise. For example, when presented with a congruent cue about the speaker’s social identity, i.e., race, people could better comprehend the speech in noise (McGowan 2015).

Although lots of efforts have been devoted to exploring the neural mechanisms of speech-in-noise comprehension, it remains to be elucidated on how people comprehend speech in noise in real-life scenarios. This is because that traditional neuroscience typically adopts a reductionist and deductive approach to investigate the neural response to a particular stimulus (Hasson et al. 2012; Sonkusare et al. 2019; Kingsbury and Hong 2020; Holroyd 2022). Researchers often used highly-controlled and short-duration speech stimuli, such as phonemes, words, and single sentences (Anderson and Kraus 2010; Scharenborg and van Os 2019; Hennessy et al. 2022), to measure the behavioral or neural response to a particular speech stimulus in noise. However, these isolated materials didn’t resemble the continuity, complexity and dynamics of naturalistic speech. Moreover, they lacked of continuous contexts, which formed the essential basis for recovering the missing part from the noise-contaminated speech (Golestani et al. 2013; Hennessy et al. 2022). In this way, the neural mechanism of naturalistic speech comprehension in noise is still little understood. Besides, except for the over-simplified stimuli, the corresponding simplification of the speech-related task, such as the Quick Speech-in-Noise Test (QuickSIN), Hearing in Noise Test (HINT), or words in noise (WIN), etc. (Wilson et al. 2007, 2012; Holder et al. 2018), encouraged participants to simply perceive or comprehend speech in a decontextualized and non-social way (Guediche et al. 2014; Sonkusare et al. 2019; Hitczenko et al. 2020; Jaaskelainena et al. 2021). Such neglect of the interpersonal nature of speech led to the underestimate of the extralinguistic processing, i.e., mentalizing, perspective taking, etc., (Redcay and Moraczewski 2020). As discussed above, the extralinguistic processing not only influenced the linguistic processing of speech but also helped to resolve the interference of noise. Thus, to obtain a complete vision of the neurocognitive mechanisms for speech-in-noise comprehension, both the naturalistic speech stimuli and paradigm encouraging people to comprehend speech as naturally as in real-life scenarios are needed.

From third-person neuroscience to second-person neuroscience

The naturalistic stimuli paradigm has been gaining popularity recently (Sonkusare et al. 2019). It refers to the employment of naturalistic stimuli, such as natural speech, videos, and music, that people typically encounter in everyday life. While the naturalistic stimuli paradigms can be employed in laboratory settings, they are expected to give an ecologically reasonable approximation of real-life situations by resembling the complexity, diversity and dynamics of everyday stimuli (Sonkusare et al. 2019). The use of natural speech not only improves the ecological validity of neuroscientific research but also extends the previous knowledge about the neural mechanism of speech processing (for a review, Hamilton and Huth 2020). For example, more widespread brain regions beyond the classical language-specific areas, i.e., the Wernicke’s and the Broca’s areas, were found to be activated when comprehending natural speech as compared to isolated and simple language materials (Huth et al. 2016; de Heer et al. 2017). Besides, a less left-lateralized response was observed when people listened to natural speech than simple language stimuli (Hamilton and Huth 2020). While some researchers proposed that the rich meaning and long duration of the naturalistic speech contributed to a more extensive activation of bilaterally higher-order cortical areas (Price 2012), some other researchers explained it as increased involvement of the right hemisphere for the processing of prosody (Si et al. 2017; Weed and Fusaroli 2020), emotion (Schirmer and Kotz 2006), social information (Alexandrou et al. 2017), etc., which were fully activated by the natural speech.

However, the use of natural speech poses a great challenge for the classical single-brain or third-person neuroscience approach, which routinely calculates the brain-to-stimulus contingency with the measurement of individual’s brain response to a particular stimulus. For one thing, the multiple high-level linguistic information is difficult to be coded quantitatively and objectively, let alone the more implicit extralinguistic processes (Armeni et al. 2017). While the recent advance in natural language processing algorithms seemed to give quantitative and human-like descriptions to speech at linguistic levels, such as syntax (Nelson et al. 2017) and semantics (Broderick et al. 2018; Grand et al. 2022), and even at the extralinguistic levels, such as sentiment or emotion (Tanana et al. 2021), social state (Badal et al. 2021), etc. These labels still require validation and verification by human behavioral and neural data (Kingsbury and Hong 2020). Also, the multi-level linguistic and extralinguistic information were often interwoven with each other. It’s hard to neatly extract one particular feature from naturalistic speech. For another, even with the quantitative labels of natural speech from human coding or the computational language models, the continuous, time-varying and multivariate properties of natural speech would still render the conventional analytical method, i.e., the event-related design with general linear modelling, ineffective. To address this issue, some powerful mathematical models, e.g., temporal response function (Ding and Simon 2012; Mesgarani and Chang 2012; Golumbic et al. 2013; Broderick et al. 2018; Li et al. 2022a), are developed or introduced to the neuroscience. In line with the traditional event-related modelling, they typically model one or several features to the measured neural data to estimate how the listener’s brain processes particular linguistic information. However, these models often pre-assume some hypotheses about the brain and its correspondence to the stimulus, which are sometimes too abstract and over-simplified (Sonkusare et al. 2019). For instance, the temporal response function approximates the brain as a linear time-invariant system, while the brain is neither linear nor time-invariant (Crosse et al. 2016). These assumptions will somewhat limit the validity of the explanation for the brain.

The recent advent of inter-brain or second-person neuroscience (Hasson et al. 2012; Schilbach et al. 2013; Redcay and Schilbach 2019; Kingsbury and Hong 2020) provides a novel solution for investigating natural speech comprehension in no-noise or noisy conditions. As shown in Fig. 1, in contrast to the single-brain approach relying on the modelling of people’s neural response to a particular stimulus or feature, the inter-brain approach collects data from both speaker’s and listener’s brains, and estimates how the time series of their neural signals were synchronized or coupled to each other (Czeszumski et al. 2020; Kelsen et al. 2022). Actually, the synchronization or alignment between the listener and the speaker is the basis for successful comprehension (Garrod and Pickering 2004; Hasson and Frith 2016). It entails the shared processes of the multi-level linguistic information, i.e., acoustic, phonology, syntax and semantics, and the extralinguistic information, such as situational model, etc. (Garrod and Pickering 2004; Hasson and Frith 2016). Following this line, the listener’s neural activities underlying these multiple processes would also be synchronized or coupled to the speaker. Emerging studies have demonstrated that the neural activities of the speaker and the listener were significantly coupled to each other (e.g., Stephens et al. 2010; Jiang et al. 2012; Kuhlen et al. 2012; Dikker et al. 2014).

Speaker-listener neural coupling is achieved by the transfer of speech from the speaker to the listener (Hasson et al. 2012; Schoot et al. 2016; Kelsen et al. 2022). In essence, the speaker’s and the listener’s brains together could be analogous to a coupled two-source system that communicates via the wireless transmission of sound-based physical signals, i.e., speech (Hasson et al. 2012; Schoot et al. 2016; Kelsen et al. 2022). The emergence of the brain-to-brain coupling relies on the brain-to-stimulus coupling at both the speaker’s and the listener’s sides. Thus, the inter-brain neural coupling would disappear when no verbal communication took place between the speaker and the listener (Stephens et al. 2010). Moreover, as the speech was originally organized and generated by the speaker, the production-related neural activity inside the speaker’s brain could be regarded as a standardized reference to estimate how the listener’s brain processes the speech. With this logic, the more coupled the listener’s neural activity is to the speaker, the better the listener would comprehend the speaker. Numerous studies gave supportive evidence that the speaker-listener neural coupling level was positively correlated to the listener’s comprehension (e.g., Stephens et al. 2010; Dai et al. 2018; Liu et al. 2020). Thus, the strength of speaker-listener neural coupling could reflect whether or to how much degree the listener’s brain was (correctly) dealing with the speech.

As compared to the single-brain approach, the inter-brain approach owns some advantages. Firstly, the inter-brain neural coupling analysis provides a model-free and data-driven method by modelling the neural activities of listener’s brain to the speaker (Sonkusare et al. 2019). It doesn’t propose any assumption of the explicit model of the complex contents of dynamic stimuli and the brain-to-stimulus correspondence (Redcay and Schilbach 2019; Sonkusare et al. 2019), and thus gets rid of the pre-assumed bias about the brain, e.g., a linearized approximation of the neural system. Secondly, the inter-brain approach gives a powerful tool to fill the gap on the high-level linguistic and extralinguistic processes of natural speech, which are often underestimated by existing studies (Hamilton and Huth 2020). As the interpersonal alignment comprehensively covers the multiple linguistic and extralinguistic processes of speech, the corresponding speaker-listener neural coupling also emerges from a wide range of speech-related regions. It was found to cover the linguistic-related regions over the fronto-temporo-parietal cortex and further extend to extralinguistic regions associated with the processing of semantic and social aspects of the story, such as the precuneus, striatum, dorsolateral prefrontal cortex, orbitofrontal cortex and medial prefrontal cortex (Stephens et al. 2010; Silbert et al. 2014). Thirdly, the inter-brain approach restores the nature of interpersonal communication through speech. In contrast to the single-brain approach that solely analyzes the listener’s neural response to the speech stimuli, the inter-brain approach simultaneously includes the listener and the producer of the speech, i.e., the speaker. It encourages the listener to comprehend the speech in an interpersonal and social way, and even allows the listener to communicate with the speaker in a real-time fashion (Jiang et al. 2021).

Fig. 1
figure 1

The third-person and second-person neuroscience approaches for studying speech-in-noise comprehension. The third-person neuroscience approach has typically investigated the brain-to-stimulus contingencies and revealed both involvements of the auditory and sensorimotor regions for linguistic processing in noisy conditions. The auditory-related regions (marked in blue) are mainly located in the temporal lobes; the sensorimotor-related regions (marked in red) are mainly covering the left posterior frontal lobe, ventral premotor cortex, and the posterior dorsal-most aspect of the left temporal lobe. As previous single-brain studies often neglected the interpersonal nature of speech, the extralinguistic processes, e.g., mentalizing, lack of adequate concern. In contrast, the second-brain neuroscience approach collects data from both speaker’s and listener’s brain from an integrative view and calculates the coupling between their neural activities. The speaker-listener neural coupling not only originates from the auditory- and sensorimotor-related regions but also from the extralinguistic-related regions (marked in orange), such as the middle frontal gyrus, temporal-parietal junction, etc.

In sum, the recently well-developed inter-brain approach proposes a novel perspective to examine the neural mechanisms of speech-in-noise comprehension by modelling the speech-evoked neural response of the listener to the production-related neural response of the speaker. It not only promotes the use of naturalistic speech but also gives a new conceptual framework, i.e., speaker-listener neural coupling, to measure the listener’s neural processing from an integrative view. Thus, the inter-brain approach owns the potential to deepen the understanding of the neural mechanisms of speech-in-noise comprehension. In the next parts, its implications for unfolding the linguistic and extralinguistic processing during speech-in-noise comprehension are to be discussed.

Inter-brain neural coupling underlies the linguistic processing in noise

The linguistic processes are the basic parts of speech processing. As both the auditory and the sensorimotor mechanisms are suggested to support speech-in-noise comprehension (e.g., Ding and Simon 2013; Du et al. 2014), the speaker-listener neural coupling from the corresponding brain regions could underlie the alignment of the listener’s auditory or sensorimotor processing of linguistic information to the speaker. By examining how these inter-brain neural couplings from the auditory- and sensorimotor-related regions occur and further correlate to the comprehension in noisy conditions, researchers could understand how these various linguistic processes are involved and contribute to the speech-in-noise comprehension, respectively.

One recent study has used the inter-brain approach to explore how the auditory and sensorimotor mechanisms of linguistic processing supported natural speech-in-noise comprehension (Li et al. 2021). In this study, both Chinese speaker and listener participants were recruited. The speakers were invited to give unrehearsed narratives based on given topics. Their speeches were recorded and added with different intensities of meaningless white noise, which were manipulated into four conditions with the SNR equaling to no noise, 2, -6 and − 9 dB. The listeners then listened to these narratives in noisy conditions and finished comprehension tests about the content of the narratives. Both of the speakers’ and the listeners’ neural activities were measured by functional near-infrared spectroscopy (fNIRS). Results showed that the neural activity from the listener’s auditory-related regions, i.e., right MTG and angular gyrus (AG), and sensorimotor-related regions, i.e., left IFG, were coupled to the speaker in both clear and noisy conditions. However, only the neural coupling from the left IFG was correlated to the listener’s comprehension performance at the strong noise level. These results suggested that while both the auditory and sensorimotor processes were activated in noisy conditions, the sensorimotor processes played a more supportive role in comprehension when noise became strong (Li et al. 2021).

This study validated the feasibility of the inter-brain approach for revealing the neural mechanism of speech-in-noise comprehension. To further investigate how people comprehend non-native speech in noise and explain the non-native disadvantage in noisy conditions, Li et al. 2022b recruited another group of Korean listeners who had learnt Chinese for years to listen to Chinese narratives at different noise levels. Their neural coupling to the Chinese speakers was calculated. They found that the non-native listener relied on a right-lateralized mechanism for linguistic processing. In specific, the neural activities from the non-native listener’s right dorsolateral prefrontal cortex, pre- and post-central gyrus (preCG/postCG), MTG and STG, as well as the left IFG, were coupled to the speaker. Among these regions, the neural coupling from right postCG, MTG and STG was positively correlated to the comprehension at the strong noise level. As the right postCG was responsible for sensorimotor processing and the right MTG/STG for auditory processing, it suggested that non-native listeners recruited a mixed and right-lateralized mechanism of auditory and sensorimotor processing to support speech-in-noise comprehension.

Moreover, the speaker-listener coupling pattern at the speaker’s side can bring extra insights for explaining the specific linguistic processing inside the listener’s brain. In specific, as speaker’s neural activity is regarded as a standardized reference for listener, the brain region at the speaker’s side could tell what type of linguistic information is processed by the listener in noisy conditions. In Li et al. 2021, the inter-brain neural coupling for native listeners covered a distributed and bilateral set of brain areas at the speaker’s side, including the right postCG, left superior frontal gyrus, bilateral supramarginal gyrus, bilateral middle frontal gyrus, and bilateral AG, which might represent a unified language production network for the semantic-level linguistic generation. Meanwhile, in Li et al. 2022b>, for non-native listeners, the inter-brain coupling pattern at the speaker’s side was restricted to the right postCG and STG, which were responsible for the generation of phonological-level linguistic information. Taken together, with the regard of the same group of speakers’ neural activities during narrative speaking as a reference, the neural coupling at the speaker’s side further highlighted that people relied on various linguistic information for native and non-native speech comprehension in noise.

Except for this, the temporal dynamic of the inter-brain neural coupling could help to further distinguish various processing modalities of listeners, such as the follow-up auditory encoding of the speech vs. the forward prediction of the upcoming information, during speech-in-noise comprehension. The temporal dynamic here means the neural activities of the speaker and the listener are not necessarily synchronized to each other, but coupled with a time lag. The neural coupling with the speaker’s precedence to the listener might underlie the listener’s delayed linguistic processing, while the neural coupling with the listener’s precedence to the speaker underlies the listener’s predictive coding of the speech or the speaker (Jiang et al. 2021). Although no study has been done in noisy conditions yet, some studies in no-noise conditions have given supportive evidence for its potential. For example, Liu et al. 2020 found that the listener’s neural activity lagged behind the speaker in order along the temporal progressing of speech processing. In specific, the listener’s neural activities in the primary auditory cortex synchronized to the speaker’s articulatory-related neural activity without delay, but lagged by 2 and 4 s in the STS/STG and MTG, respectively. This temporal sequence underly the bottom-up information flow from lower-level acoustic-processing areas to higher-level semantic processing areas. More generally, Stephens et al. 2010 showed that the listener’s neural activities lagged behind the speaker’s activities in most areas, but the striking listener’s precedence to the speaker was observed in the striatum and anterior frontal areas. This listener’s precedence might indicate an anticipatory neural response to predict the upcoming words of the speaker in a top-down fashion. What’s more, the listener-preceded neural coupling was highly correlated to the listener’s comprehension, suggesting that this prediction-based process was essential for speech comprehension and might play a more supportive role in noisy conditions. Following this logic, more futural efforts can also be paid to investigate how noise modulated these temporal dynamics of the speaker-listener neural coupling. It would deepen the understanding of how these follow-up and top-down linguistic processes differently activate and support speech-in-noise comprehension.

Inter-brain neural coupling underlies extralinguistic processing in noise

Except for the linguistic processing, the extralinguistic processing contributes to speech-in-noise comprehension as well. The neural alignment between the speaker and the listener could also take place at the extralinguistic level, such as emotion (Smirnov et al. 2019). Many studies have shown that the speaker-listener neural coupling (in no-noise condition) not only originated from the linguistic-related regions but also emerged from those extralinguistic-related regions, such as medial and dorsolateral prefrontal cortex, temporal-parietal junction (TPJ), precuneus, etc., during natural speech communication (for a review, Jiang et al. 2021). Also, these extralinguistic-related speaker-listener neural coupling was sensitive to interpersonal interaction, e.g., visual gaze (Jiang et al. 2012; Leong et al. 2017), interactive style (Pan et al. 2018; Zheng et al. 2018), etc., instead of (solely) the linguistic content. Thus, the speaker-listener neural coupling could also help to examine the extralinguistic processes during speech-in-noise comprehension, which are often underestimated by previous studies.

Dai et al. 2018 have adopted the inter-brain approach to reveal the functional role of extralinguistic processes in speech-in-noise comprehension. They recruited three-person groups, i.e., one listener and two speakers, to the laboratory. Two speakers were simultaneously speaking to the listener, while the listener was required to attend to one of them and ignore the other. Results showed that the listener’s neural activity from the left TPJ was more coupled to the attended speaker than the unattended speaker, with the listener’s neural activity preceding the attended speaker for several seconds. Moreover, the strength of the speaker-listener neural coupling from TPJ was positively correlated to their speech-in-noise communication. As the left TPJ was a critical region for mentalizing the other’s mind or concept, these results might suggest that people selectively focused on the target speaker by predicting what the speaker intended to express. It was in favor of the previous hypothesis that the prediction promoted the selective focus and comprehension of the to-be-attended speech by gaining more weights for the processes of the relevant information (Schwartz et al. 2012).

Although previous studies have highlighted the importance of extralinguistic processing for speech comprehension in no-noise conditions (Jiang et al. 2021), there is no more inter-brain study examining its functional involvement in noisy conditions except for the one study above. Noteworthy, some researchers have suggested that the extralinguistic-related inter-brain neural coupling may serve as the neural base for successful speech comprehension and mutual understanding (Schoot et al. 2016). According to the mutual prediction theory, the integration of predicting others’ actions and enacting one’s own action by each individual led to the dynamic neural similarity among them, which formed the basis for successful reciprocal social interaction (Kingsbury et al. 2019), including speech interaction. In this line, the extralinguistic-related neural coupling serves an important purpose for the listener’s interpretation of the speaker. A recent study showed that the speaker-listener neural coupling from the emotion-related regions, such as the middle frontal gyrus, superior parietal lobule, precuneus, amygdala, etc., modulated the emotional feelings shared between them: the more the listener’s neural activities synchronized to the speaker, the more similar the listener’s emotional feelings were to the speaker (Smirnov et al. 2019). Such an extralinguistic-related speaker-listener neural coupling might be more important with the existence of background noise, as the listener could refer to the overall representation of both themselves and the speaker (Sebanz et al. 2006; Yeshurun et al. 2021) to resolve the interference of noise. Therefore, more futural studies are needed to employ the inter-brain approach to examine the extralinguistic processing during speech-in-noise comprehension.

Discussion

As compared to the classical single-brain approach, the inter-brain approach has provided a novel methodology to investigate the linguistic and extralinguistic processes during speech-in-noise comprehension by analyzing the relationship between the speaker’s and the listener’s neural activities. It is suitable for naturalistic settings by promoting the use of naturalistic speech and even allowing for real-time communication between speaker and listener. Some recent studies have respectively validated its potential by highlighting the essential roles of linguistic (Li et al. 2021, 2022b) and extralinguistic processes (Dai et al. 2018) in speech-in-noise comprehension. However, the number of existing inter-brain studies on speech-in-noise comprehension is still quite limited. There remains much unknown and calls for more research interests in the future.

Firstly, how the speaker-listener neural coupling varies with various types of background noise remains unclear. Previous behavioral evidence has suggested that meaningless noise (e.g., white noise), meaningless speech (e.g., speech in an unknown language) and competing meaningful speech interfered people’s speech-in-noise comprehension in different ways (Oswald et al. 2000; Wong et al. 2012). In particular, while both meaningless noise and meaningful speech cause acoustic masking to the original speech, the latter often brings additional interference to the high-level linguistic and extralinguistic processes of speech (Scharenborg and van Os 2019). However, existing inter-brain studies only focused on how either white noise (Li et al. 2021, 2022b) or competing speech (Dai et al. 2018) affected the linguistic or extralinguistic processes, respectively. It remains to be elucidated how various types of noise differently modulate the speaker-listener neural coupling from the speech-related regions, i.e., auditory, sensorimotor and extralinguistic-related regions. Future inter-brain studies with a direct comparison of different types of noises are needed to clarify this question, which are expected to bring more insights on the noise effect from an inter-brain perspective.

Another important issue is the causality between the speaker-listener neural coupling and speech comprehension. While large quantities of studies have revealed significant speaker-listener neural coupling during successful speech comprehension in no-noise (e.g., Stephens et al. 2010; Liu et al. 2020) and noisy (e.g., Dai et al. 2018; Li et al. 2021) conditions, it remains controversial whether it causally determines the comprehensive outcome of speech processing, or is just an epiphenomenal consequence for sharing the same environment or performing the same task (Hamilton 2021; Novembre and Iannetti 2021). In order to resolve this controversy, causal protocols, such as multi-brain stimulation (MBS), would give a solution (Novembre and Iannetti 2021). MBS refers to the simultaneous stimulation of multiple brains engaged in social interaction (Novembre and Iannetti 2021; Pan et al. 2021). The investigation of whether the direct manipulation of the speaker-listener neural coupling influences speech comprehension would clarify its causal role. Moreover, if the causality were proved, MBS could be further used to help people to listen better, especially for those populations with difficulty in speech-in-noise comprehension, such as the elders (Panouilleres and Mottonen 2018) or people with hearing loss (Healy and Yoho 2016).

Besides, the relevant inter-brain studies on speech-in-noise comprehension are all based on fNIRS measurement (Dai et al. 2018; Li et al. 2021, 2022b). The fNIRS was often chosen for its high tolerance to motion and the little operating noise (e.g., Li et al. 2021), making it broadly used in close-to-life (e.g., Dai et al. 2018) and even real-life communication scenarios (for a review, Kelsen et al. 2022). However, the fNIRS owns some disadvantages. For instance, both of its spatial and temporal resolutions are not high. The temporal resolution of fNIRS is around 1 s, and the spatial resolution is up to 1 cm (Dieler et al. 2012). To allow for a more precise description of the temporal dynamics or spatial localization of the speaker-listener neural coupling during speech-in-noise comprehension, other neuroimaging technologies with a higher spatial or temporal resolution, such as EEG, MEG, fMRI, ECoG, etc., could be further implemented.

Last but not least, as speech communication is a dynamic process with continuous mutual adaptation and coordination between the speaker and the listener (Hasson and Frith 2016), advanced mathematical and computational methods are necessary to further estimate how background noise influences the emergence, direction and dynamics of the speaker-listener neural coupling. For instance, by taking the communicators’ brains together as an integrated neuronal network, computational neuronal models, such as the Rulkov map, could offer a promising tool to investigate the phenomenon of speech-in-noise comprehension by modeling how noise modulates the coherence and stochastic resonance over the network (Wang et al. 2008, 2009).

Conclusion

Comprehending speech with the interruption of background noise is of great importance for human life. In the past decades, a large number of psychological, cognitive and neuroscientific research has explored the neurocognitive mechanisms of speech-in-noise comprehension. However, as limited by the low ecological validity of the speech stimuli and the experimental paradigm, as well as the inadequate attention on the high-order linguistic and extralinguistic processes, there remains much unknown about how people comprehend noisy speech in real-life scenarios. A recently emerging approach, i.e., the second-person or inter-brain neuroscience approach, provides a novel conceptual framework to address these issues by measuring the neural activities of both the speaker and the listener and calculating their inter-brain neural coupling from an integrative view. It promotes the use of naturalistic speech and allows for real-time communication between speaker and listener as in real-life scenarios. Several studies have validated its potential to investigate the linguistic and extralinguistic processes during speech-in-noise processing. More research interests in the inter-brain approach would further extend the present knowledge about the neural mechanism of speech-in-noise comprehension.