Keywords

6.1 Introduction

Hearing loss is a widespread health issue, affecting 10 % of children (Niskar et al. 1998), 20 % of adults, and 50 % of older adults in the United States (Shield 2006; NIDCD 2010). Although the amount of reported difficulty varies, all people with hearing loss experience the same problem: poor speech perception. Poor speech perception is the most common reason for people to seek hearing care (Knudsen et al. 2010). Areas of greatest difficulty may include communication in background noise, difficulty understanding talkers with soft voices, hearing speech at a distance, and conversing over the telephone. Except in rare cases, there are no medical or surgical treatments that can improve hearing in cases of sensorineural hearing loss. Consequently, hearing aids or other assistive devices are the most widely distributed treatment to improve speech perception.

The use of hearing aids can decrease hearing handicap by varying amounts depending on the patient and the situation. On average, hearing aid wearers report as much as a 70 % reduction of handicap for speech perception in quiet (compared to a listener with normal hearing in that situation) (Kochkin 2011). However, even in quiet, hearing aids do not eliminate hearing handicap. That limitation contrasts sharply with options for vision rehabilitation, where the most common treatments for vision loss (prescriptive lenses, cataract surgery, and laser vision correction) can nearly eliminate handicap (Kook et al. 2013; Lee 2014). The inability to “correct” hearing loss reflects the complex nature of the compromised auditory system, whereby hearing loss causes auditory deficits beyond simple threshold shifts. Some deficits can be easily addressed with hearing aids, while others present challenges. This chapter reviews the effects of hearing loss on speech perception and discusses how hearing aids can compensate for those effects.

6.2 Patient Factors Influencing Speech Perception

Several investigators have attempted to draw conclusions about cochlear damage patterns from the audiometric configuration. Seminal work by Schukecht (Schuknecht and Gacek 1993; Schuknecht 1994) classified damage patterns in human temporal bones according to the site of lesion, such as hair cell damage, loss of spiral ganglion cells, or damage to stria vascularis. Schuknecht originally proposed that different damage sites would result in different audiometric profiles and potentially in different speech perception abilities. For example, loss of spiral ganglion cells—“neural presbycusis”—was proposed to result in disproportionately poor speech perception. Although one-to-one associations between the cochlear site of lesion and speech perception ability are almost certainly an oversimplification for most human hearing loss, such associations do allow us to consider differences in cochlear damage patterns in relation to differences in speech perception and are likely to be one factor that explains speech perception variability among people with similar audiograms.

Human studies of cochlear damage patterns have been limited by the need to access audiometric data for later-obtained temporal bones. Therefore, most studies in this area have been based on animal models with control over the cause of hearing loss, such as intense noise exposure (Kujawa and Liberman 2006, 2009) or use of ototoxic drugs that modify the biochemical properties of the ear (Schmiedt et al. 2002; Lang et al. 2003, 2010). Using a novel approach, Dubno and colleagues (2013) surveyed more than 1,700 audiograms and selected the exemplars that fit predefined audiometric ranges derived from animal models of specific damage sites (e.g., metabolic damage linked to stria vascularis vs. sensory damage linked to hair cell survival). Patient history and risk factors were then analyzed for those exemplars. The results were consistent with the ideas put forth by Schuknecht and colleagues. For example, people whose audiograms fit the “sensory” criteria had a significantly higher incidence of noise exposure than those whose audiograms fit the “metabolic” criteria, and the “metabolic” group was significantly older than the “sensory” group.

To illustrate the idea that different underlying damage patterns may lead to different speech perception abilities, Halpin and Rauch (2009) devised a basic illustration of two people with similar pure-tone audiograms but with different underlying damage patterns. In one case, it was assumed that most sensory receptors (inner hair cells and ganglion cells) were present and in the other, that a portion of the basal hair cells were entirely absent. In the first individual, amplification can lead to appropriate frequency-selective information being carried in the auditory nerve and can improve speech perception. In the second individual, who has a “dead region” lacking receptors (Moore et al. 2000), amplification cannot make appropriate frequency-selective information available, and the individual will exhibit a plateau in the performance intensity function. This basic point will come up whenever the relationship between hearing loss, hearing aids, and speech perception is considered: without a means by which all important components of the acoustic signal can be received and transmitted within the auditory system, some degradation of speech perception is inevitable.

To expand this idea in the context of speech perception, consider the schematic representation in Fig. 6.1. The acoustic signal produced by the talker is first subject to the effects of the acoustic environment, including any background noise, reverberation, or a decrease in signal level due to distance between the talker and the listener. Use of a hearing aid or other assistive device further modifies the signal. The resulting signal is received by the listener but must be processed in several stages within the auditory and cognitive systems. At the periphery, the acoustic signal is transformed to a pattern of vibration along the cochlea, which leads to electrochemical processes in the outer and inner hair cells and then to neural encoding via the auditory nerve and its synaptic connections. At the peripheral level, information can be degraded by loss or dysfunction of outer and inner hair cells or by deficits in synaptic transmission. At the neural level, the firing rates of auditory fibers tuned to different frequencies transmit information about a short-term spectrum, changes in spectrum over time, and temporal patterns of amplitude modulation. The detailed timing of nerve spikes (phase locking) may also carry useful information about the temporal fine structure of the sound at each place in the cochlea (Young and Sachs 1979; Moore 2014). Reliance on that transmitted information has downstream effects on speech perception. For example, hearing loss is thought to shift the encoding balance of envelope and temporal fine structure (Kale and Heinz 2010; Scheidt et al. 2010; Swaminathan and Heinz 2011), a change that may have consequences for the ability to perceive speech in modulated backgrounds. At a later stage, information received via the auditory pathway is subjected to cognitive processes that compare information in working memory with long-term knowledge of phonology, syntax, and semantics to construct the meaning of the signal (Ronnberg et al. 2013). A disruption anywhere in this complex, multilevel process could potentially result in a deficit in speech perception.

Fig. 6.1
figure 1

Schematic of stages in the transmission and processing of speech, each of which can affect speech perception

To summarize, speech perception is influenced by many factors, including the acoustic environment; any enhancement or distortion of the acoustic information produced by a hearing aid; the processing capabilities of the listener’s peripheral and central auditory systems; and the listener’s cognitive abilities. Sections 6.36.5 consider the contributions to speech perception of each of the last three factors.

Table 6.1 provides a framework for relating possible auditory damage patterns to degree of hearing loss as measured using the audiogram, along with options for treatment with a hearing aid or cochlear implant. With regard to acquired hearing loss via exposure to noise or ototoxic agents, outer hair cells are likely to be the most susceptible, although loss of synapses and auditory neurons may also occur, and inner hair cells may be damaged by impulsive sounds such as gunshots. Although the initial mechanism of age-related hearing loss may be metabolic (specifically, changes to the endocochlear potential) (Schuknecht 1994; Lang et al. 2003, 2010; Saremi and Stenfelt 2013), changes to the endocochlear potential affect both inner and outer hair cells. Therefore, the effect on auditory thresholds is likely to be similar to direct outer hair cell damage from other causes. Some auditory models indicate that complete loss of the cochlear amplifier associated with outer hair cells will result in 50–60 dB of threshold elevation (Ryan and Dallos 1975; Cheatham and Dallos 2000). In terms of speech perception, one consequence of outer hair cell loss is reduced frequency selectivity (broader tuning), which is discussed in Sect. 6.4.2. This reduces the number of independent channels that can code information about the signal envelope, further impairing speech perception in noise (Swaminathan and Heinz 2011). Speech perception in noise may also be impaired by collateral degeneration of spiral ganglion nerve fibers after the more immediate damage to outer hair cells (Kujawa and Liberman 2006, 2009).

Table 6.1 Expected cochlear damage for different amounts of hearing loss as measured using the audiogram

For greater degrees of hearing loss, loss of both outer and inner hair cells is expected (Stebbins et al. 1979; Hamernik et al. 1989; Nelson and Hinojosa 2006). Whereas loss of outer hair cells elevates the tips of neural tuning curves but not their tails, a combined loss of inner and outer hair cells shifts both the tips and tails of tuning curves to higher levels (Liberman and Dodds 1984), significantly affecting the transmission of auditory information. Some information may not be transmitted at all in cases of areas of missing or very sparse inner hair cells, termed “dead regions” (Moore 2004). Dead regions are often associated with severe hearing loss and often lead to perceived distortion of sounds (e.g., Huss and Moore 2005), poor sound quality, and reduced benefit from amplification.

6.3 Audibility

A prerequisite for speech perception is audibility. Speech sounds that fall below the auditory threshold cannot be perceived. For a sentence spoken at a constant vocal level, the level measured in narrow bands (typically, 1/3 octave bands) using 125-ms time windows varies by as much as 50 dB (Dunn and White 1940; Cox et al. 1988), although audibility of the full range may not be necessary for perception (Studebaker and Sherbecoe 2002; Moore et al. 2008). The range of levels is increased further by changes in overall level produced by variations in talker-listener difference and speaking effort. This concept is illustrated in Fig. 6.2, which represents levels in the ear canal for weak, medium, and intense speech presented to a listener with a severe hearing loss while wearing a hearing aid. In each panel, the range of speech levels (enclosed by dashed lines) is plotted relative to the listener’s hearing thresholds (filled circles). For the weak (50 dB SPL) input level, only a small portion of the speech information is audible. More generally, audibility is greater in cases of higher speech levels or lower threshold levels and lower in cases of weaker speech levels, higher threshold levels, or the presence of masking noise (not shown in the figure).

Fig. 6.2
figure 2

A simple representation of the audibility of amplified speech (1/3 octave bands) for a listener with a severe hearing loss, wearing a hearing aid. The lines without symbols show the short-term range of speech levels (dashed lines) about the long-term average level (solid line); levels were measured in 125-ms windows. Each panel represents a different speech input level. In each panel, the audible part of the speech range is the area below the top dashed line (which represents the most intense speech segments) and above the thick line and filled circles (which represent the listener’s hearing thresholds). For the lowest input level of 50 dB SPL, even with hearing aid amplification, only 23 % of the speech information is audible. For the medium input level of 60 dB SPL, 52 % of the speech information is audible. For the highest speech level of 70 dB SPL, 76 % of the speech information is audible

Speech perception is determined, in part, by how much of the speech intensity range is audible but also by how much of the speech frequency range is audible. A classic measure that takes intensity and frequency into account is the Speech Intelligibility Index (SII; ANSI 1997) and its precursor, the Articulation Index (ANSI 1969). The SII is a measure of audibility ranging from 0 (inaudible) to 1 (audible) and is calculated from the proportion of the signal that is audible in each frequency band. The calculation takes into account the importance of each frequency band to speech perception. Audibility depends on the characteristics of the listener (auditory thresholds), the spectrum of the signal, and the spectrum of any background noise. Effects of reverberation are not taken into account. It is important to note that the SII value is not the predicted speech perception score; rather, a transfer function must be used to relate the SII value to intelligibility (Studebaker and Sherbecoe 1991; Souza and Turner 1999; McCreery and Stelmachowicz 2011). For listeners with normal hearing, and presumed good frequency resolution, speech intelligibility is well predicted by audibility (Dubno et al. 1989b). However, for listeners with hearing loss, speech perception is more variable and often falls below that predicted from audibility (Souza et al. 2007). This may be particularly true for listeners with greater amounts of hearing loss, especially listeners with dead regions (Baer et al. 2002; Malicka et al. 2013). The shortfall has been attributed to poor resolution and transmission of acoustic information. Some SII models incorporate a “proficiency factor” to capture these differences (Scollie 2008). Figure 6.3 illustrates this concept using data from a group of 27 listeners, with ages from 70 to 90 years. The figure shows the speech reception threshold (speech-to-noise ratio [SNR], required for 50 % correct) as a function of amount of hearing loss (three-frequency [0.5, 1, 2 kHz] pure-tone average). A higher speech reception threshold (SRT) indicates poorer speech reception, that is, the listener required a more favorable SNR to understand the speech. Open circles show unaided performance and filled circles show performance while wearing appropriately fitted hearing aids. In both conditions, greater amounts of hearing loss are associated with poorer speech reception. The effect of poor hearing is greater for the unaided condition, where both reduced audibility and poor auditory analysis would be expected to play a role in determining performance. For the aided condition, where audibility is expected to be increased and hence to have less influence, speech reception still worsens with increasing hearing loss, probably because of progressively poorer auditory analysis of audible signals. This is discussed more fully in Sect. 6.4.

Fig. 6.3
figure 3

Speech reception threshold (dB SNR at threshold) as a function of three-frequency (0.5, 1, 2 kHz) pure-tone average. A larger y-axis value indicates poorer speech reception. Open circles show unaided performance and filled circles show performance while wearing appropriately fitted hearing aids. In both conditions, greater amounts of hearing loss are associated with poorer speech reception

6.4 Suprathreshold Resolution

Speech signals vary rapidly in intensity and in spectrum, as illustrated in Fig. 6.4. The top panel shows the waveform and the bottom panel shows a narrowband spectrogram for the sentence “The lazy cow lay in the cool grass,” spoken by a female talker and sampled at 22.05 kHz. The figure shows the variation in short-term level over time (along the x-axis) and frequency (along the y-axis), with higher energy shown as darker shading. To analyze this stream of acoustic information, listeners with normal hearing have the advantage of fine resolution in both the spectral and temporal domains. For example, a listener with normal hearing can detect frequency differences as small as a few hertz and detect variations in energy over a few milliseconds (Fitzgibbons and Wightman 1982; Moore 1985). Those abilities allow easy translation of the spectrotemporal variations in the speech signal into meaningful sound.

Fig. 6.4
figure 4

Two representations of the sentence “The lazy cow lay in the cool grass.” The top panel shows the waveform, that is, the instantaneous amplitude as a function of time. The lower panel shows the spectrogram, with frequency on the y-axis and higher amplitudes represented by darker shading. The figure illustrates the rapidly changing and complex nature of the speech signal. Note, for example, the difference between the dynamic, lower frequency energy of the vowel formant transitions at 0.1–0.3 s (diphthong /aI/ in “lazy”) and the static fricative energy at 1.9–2.1 s (/s/ in “grass”)

Once damage occurs to the cochlea or other auditory structures, suprathreshold resolution is often reduced, sometimes in unpredictable ways. Although greater amounts of sensorineural hearing loss are commonly associated with degraded resolution (and therefore poorer speech perception), it is difficult to predict resolution abilities for a particular listener based on their audiogram. Sections 6.4.1 and 6.4.2 review some effects of hearing loss on spectral and temporal resolution and implications for speech perception.

6.4.1 Temporal Resolution

For convenience, speech features can be categorized according to their dominant fluctuation rates. One approach is to consider three rate categorizations: slow (envelope, 2–50 Hz), medium (periodicity, 50–500 Hz), and fast (fine structure, 500–10,000 Hz) (Rosen 1992). Other researchers have proposed that envelope and fine structure should be described in terms of the processing that occurs in the cochlea and that the rapidity of envelope and temporal fine structure fluctuations depends on the characteristic frequency within the cochlea (Moore 2014). Regardless of the nomenclature, we know that different fluctuation rates make different contributions to speech perception. For example, prosodic cues are partly conveyed by slowly varying envelope, whereas segmental cues such as consonant place may be partly conveyed by rapidly varying fine structure. In addition, the relative contribution of each type of information depends on the situation. For example, listeners with normal hearing can perceive nearly 100 % of speech in quiet when that speech is processed to preserve envelope cues but disrupt temporal fine structure cues (Shannon et al. 1995; Friesen et al. 2001; Souza and Rosen 2009). Temporal fine structure is thought to be important for listening in background noise (Moore 2008) as well as for music perception (Heng et al. 2011).

Poor temporal resolution for listeners with hearing loss (compared to listeners with normal hearing) is thought to be related to reduced sensation level and/or narrower stimulus bandwidth (Reed et al. 2009). However, many listeners with hearing loss are older, and age may introduce different problems with temporal processing. Consider how the temporal fine structure of a signal is conveyed through the auditory system. The frequency of a tone is represented, in part, by the time intervals between nerve “spikes.” In a normally functioning system, the interspike intervals are close to integer multiples of the period of the tone. With increasing age, the neural firing patterns may become disorganized such that they fail to faithfully represent the signal frequency. Some authors have proposed that this neural disorganization, or “dyssynchrony,” will impair the representation of sound at the level of the auditory brainstem (Pichora-Fuller et al. 2007; Anderson et al. 2012; Clinard and Tremblay 2013). Those listeners with poor neural representation also demonstrate poor objective (Anderson et al. 2010, 2011) and subjective (Anderson et al. 2013a) speech perception in noise.

In summary, the ability to resolve some types of temporal information (such as envelope information) may be relatively well preserved in people with sensorineural hearing loss. Other aspects of temporal information (such as temporal fine structure) are likely to be degraded by age and/or hearing loss. However, the extent to which temporal cues are preserved depends on the specific cue under study, the degree of hearing loss, the age of the listener, and perhaps other factors (such as hearing loss etiology) that are not yet well understood.

6.4.2 Spectral Resolution

Excluding conductive pathology, it is expected that most naturally occurring hearing loss involves some loss of outer hair cells. The consequences of outer hair cell loss are reduced audibility (caused by reduced gain of the cochlear amplifier) and reduced frequency selectivity. Listeners with cochlear hearing loss have broader-than-normal auditory filters (Glasberg and Moore 1986). The extent of the degradation roughly follows the degree of loss, so listeners with severe-to-profound sensorineural loss are likely to have very poor frequency selectivity. However, there can be large variability from person to person (Faulkner et al. 1990; Souza et al. 2012b). Degraded frequency selectivity is likely to be one of the major factors affecting speech perception. For speech in quiet, poor frequency resolution impedes accurate representation of spectral shape (Dubno et al. 1989a; Souza et al. 2012b, 2015). For speech in noise, masking effects are increased, causing the noise to obscure spectral features of the target speech (Leek et al. 1987; Leek and Summers 1996). A similar effect can be simulated for listeners with normal hearing by spectral “smearing” (Baer and Moore 1993).

6.5 “Top-Down” (Cognitive) Processing Ability and Listening Effort

Audiological care is usually focused on the capabilities of the peripheral auditory system. For example, clinical evaluations are based on the pure-tone audiogram, which provides information about audibility. Tests for dead regions have been suggested for use in selection of the hearing aid frequency response (Moore and Malicka 2013). The most common clinical speech perception test is monosyllabic word recognition in quiet, although tests of speech perception in noise are beginning to gain traction (Taylor 2003). Specific measures of suprathreshold resolution are infrequently included (Musiek et al. 2005), and it is unclear how those measures should be taken into account when making rehabilitation choices (Sirow and Souza 2013). Although there is considerable interest among clinicians, contributions of the cognitive system to speech perception are not usually assessed or considered as a routine part of audiological care. However, consider the demands of everyday communication: the listener must process a rapidly varying stream of acoustic information; match that acoustic information to stored lexical information to obtain meaning; and retain the information for later access and comparison with new information. It seems reasonable to expect that, in most situations, speech perception will depend on cognitive abilities, including memory and attention, and that those abilities will also affect the ability to understand and remember speech.

Recent work on speech perception and cognitive ability has focused on working memory, which refers to the ability to process and store information while performing a task (Daneman and Carpenter 1980; Baddeley 2000). Ronnberg et al. (2013) postulate that working memory involves deliberate and effortful processing, especially when the auditory representation of the input signal is degraded by noise, by a hearing aid, or by impaired processing in the auditory system. In that view, working memory plays only a minor role in the perception of speech in quiet or when contextual information is available to support a lexical decision (Cox and Xu 2010). Behavioral and physiological data support the idea that adults with poor working memory have poorer speech perception in complex listening environments (Akeroyd 2008; Wong et al. 2009). Such adults also report greater communication difficulty than listeners with similar amounts of hearing loss but better working memory (Zekveld et al. 2013). Because low working memory is associated with poor perception of acoustically degraded signals, it may also affect how an individual responds to signal-processing manipulations in hearing aids (Lunner and Sundewall-Thoren 2007; Arehart et al. 2013a).

Traditional studies of speech perception typically used percent correct or SRTs to compare results across individuals or groups. When there was no difference in score, it was assumed there was no difference in speech perception ability. However, such comparisons do not account for situations where one listener might apply greater conscious or unconscious effort to achieve the same level of speech perception as another listener. As one example, consider a simple intelligibility task (Wong et al. 2009) where older and younger listeners were asked to identify words in multitalker babble at 20 dB SNR (a relatively easy task). Although speech perception scores were similar for the two age groups, functional magnetic resonance imaging (fMRI) results showed reduced activation in the auditory cortex and an increase in working memory and attention-related cortical areas for the older listeners. In other words, equal performance was achieved only by the older listeners expending more cognitive effort to compensate for deficits in auditory and cognitive processing. Effort has also been shown to be correlated with working memory; people with lower working memory expend greater effort (Desjardins and Doherty 2013).

6.6 Language Experience and Effects of Age

A detailed consideration of the effects of age on speech perception is beyond the scope of this chapter. However, speech perception, by its nature, depends on language experience. Experience is one factor that may modify speech perception for younger or older listeners. For children, speech perception skills require time to mature (Hnath-Chisolm et al. 1998; Eisenberg 2007; Werner 2007). The last skills to develop involve speech perception in difficult listening environments, including background noise (Leibold and Buss 2013; Baker et al. 2014). As for adults, listening in these environments may require children with hearing impairment to use cognition to compensate for degraded auditory perception (Osman and Sullivan 2014).

Most hearing loss occurs gradually due to aging, noise exposure, or other late-occurring etiologies. The loss usually occurs in the context of long language experience, and language experience confers some protection against loss of auditory information. For example, older listeners appear to be better able than younger listeners to use context to fill in missing information (Lash et al. 2013). Note, though, that use of contextual information to compensate for degraded auditory input requires deployment of cognitive resources (Aydelott et al. 2011). Accordingly, older listeners’ ability to use contextual information may also depend on their cognitive abilities, including working memory (Janse and Jesse 2014).

Overall, there is little doubt that older listeners have more difficulty perceiving speech than younger listeners with similar levels of hearing loss (Gordon-Salant et al. 2010). These deficits are most obvious in complex listening environments (Pichora-Fuller and Souza 2003). Poorer performance in background noise and with rapidly varying signals, such as time-compressed speech (Jenstad and Souza 2007), may be related to degraded neural representations of temporal information (Anderson et al. 2011). Language or listening experience may partially offset those effects (Anderson et al. 2013b) and provide the ability to compensate for peripheral and central deficits.

6.7 Situational Factors Influencing Speech Perception

6.7.1 Background Noise

The most common complaint of people with hearing loss (and sometimes of people with normal hearing!) is difficulty listening in background noise. Most everyday situations involve some level of noise, ranging from favorable SNRs in relatively quiet situations (such as the listener’s home or workplace) to negative SNRs in restaurants or public transportation (Olsen 1998). The more spectral, temporal, or spatial “overlap” there is between the talker and background, the more difficult is speech perception. For example, a distant engine is unlikely to interfere with understanding a talker who is situated close to the listener because the engine noise is distinct in frequency spectrum, temporal pattern, and location from the talker’s voice. In contrast, attending to a talker in the presence of a second, unwanted talker standing next to the first talker is more challenging. In that case, the target and masking talkers may be producing sound that has similar frequency spectrum, temporal patterns, and location. The listener may need to expend more effort to focus on the target talker. The extent to which a noise “masks” (interferes with) perception of a target depends on a number of acoustic features of the two signals, including similarity of modulation patterns (Stone et al. 2012). Sections 6.7.1.1 and 6.7.1.2 consider the effect of noise on speech perception in two broad categories, energetic/modulation and informational masking.

6.7.1.1 Energetic and Modulation Masking

Energetic masking occurs when the peripheral response to the signal-plus-masker is almost the same as the response to the masker alone (Brungart et al. 2006). Energetic masking is reduced when there is a difference in the peripheral response to the signal-plus-masker and to the masker alone. Such a difference might occur because there is little overlap between the spectra of the target and masker (as for a target talker with a low-frequency voice speaking in the presence of a high-frequency fan), or because of brief reductions in the level of the masker. The noise encountered in everyday listening rarely has a constant level. Moreover, amplitude modulations can occur at different time points in different frequency regions. Listening in spectrotemporal “dips” in the background can decrease energetic masking and improve speech perception. Listeners with normal hearing can take advantage of momentary dips in the background where the SNR is briefly improved to obtain information about the target speech (Festen and Plomp 1990). When the background is speech, the amount of amplitude modulation is considerable when there are only a few talkers but decreases as the number of background talkers increases (Simpson and Cooke 2005; Rosen et al. 2013). Based on the principles of energetic masking, the most effective masker should be a broadband noise with a spectrum shaped to that of the target speech because such a noise does not have pronounced temporal or spectral dips. In practice, this may not be the case, for reasons explained below in the next paragraph.

Stone et al. (2012) have proposed that speech perception is better for speech in modulated noise than for speech in steady noise not because of release from energetic masking but because of release from modulation masking. Modulation masking occurs when amplitude fluctuations in the background make it harder to detect and discriminate amplitude fluctuations in the target signal (Bacon and Grantham 1989; Houtgast 1989). When the background is “steady” noise, random amplitude fluctuations in the noise produce modulation masking of the target speech. When the background sound contains pronounced spectrotemporal dips (over and above those associated with the random inherent fluctuations in the noise), these provide “clean” glimpses of the target speech, free from modulation masking, and this leads to better speech intelligibility. In that view, masker modulation can either increase or decrease speech intelligibility depending on the masker properties. Regardless of the mechanism, there is strong evidence that listeners with hearing loss have impaired glimpsing ability (Takahashi and Bacon 1992; Dubno et al. 2003; Wilson et al. 2010). Possible causes include reduced audibility of the target speech in the masker gaps (Bernstein and Grant 2009) as well as the limitations in auditory analysis described earlier in this section. For example, poor frequency selectivity may limit the ability to glimpse in a narrow spectral dip, and greater susceptibility to forward masking at low sensation levels may limit the ability to glimpse in a brief temporal dip (Festen and Plomp 1990; Gustafsson and Arlinger 1994; Eisenberg et al. 1995).

6.7.1.2 Informational Masking

Informational masking occurs when the listener cannot distinguish the target from the background, even when energetic or modulation masking is not the cause. This happens when the target and masker are confusable and/or similar—as when two people talk at the same time. Informational masking occurs for listeners with normal hearing and with hearing loss (Kidd et al. 2002; Alexander and Lutfi 2004). Because the “noise” in many occupational or social environments includes other talkers, informational masking plays a significant role in everyday listening. Informational masking can also occur when the masker is not speech but is acoustically similar to speech (Souza and Turner 1994; Brungart 2001). For example, informational masking can occur when the masker is a language not understood by the listener (Garcia Lecumberri and Cooke 2006; Van Engen and Bradlow 2007) or when the masker is speech modified to be unintelligible (Freyman et al. 2001; Hoen et al. 2007; Cullington and Zeng 2008). Although some studies have suggested that informational masking may be greater for older listeners or for listeners with hearing loss (Kidd et al. 2002), others have not (e.g., Souza and Turner 1994; Rothpletz et al. 2012).

6.7.2 Reverberation

When listening to speech in a room, part of the speech energy arrives directly at the ears. Other speech energy reaches the ears after reflections from surrounding surfaces, and this energy is delayed relative to the direct signal. The amount of this reverberation is often defined by the reverberation time, RT60, which is the time that it takes for the reflections to decay by 60 dB. Reverberation reduces amplitude modulation depth and can affect speech perception in two ways: overlap and self-masking (Nabelek et al. 1989). Overlap masking occurs when reflections from one speech sound overlap in time with a following sound. As a result, whereas noise causes more errors in identification of initial consonants in words, reverberation causes more errors in identification of final consonants (Helfer 1994). Self-masking refers to the distortion of the spectrotemporal information within a single speech sound, such as disruption of formants within a diphthong (Nabelek 1988).

As RT60 increases, speech perception worsens (Duquesnoy and Plomp 1980; Shi and Doherty 2008). Listeners with hearing loss may be especially sensitive to distortion from reverberation (Helfer and Huntley 1991; Sato et al. 2007). One source of this problem may be that listeners with hearing loss depend to a greater degree on temporal cues, and these are distorted by reverberation (Nabelek et al. 1989). Unlike listeners with normal hearing, those with hearing loss may be unable to adjust perception to listen effectively in reverberant environments, perhaps because they cannot perceive the acoustic information that is necessary to support that adjustment.

6.8 What to Expect from the Hearing Aid

Hearing loss—and the impaired speech perception that results—has enormous consequences for communication. With very few exceptions, the only treatment available to improve speech perception is amplification, usually via hearing aids, and sometimes via other assistive listening devices. Hearing aids have an onerous task: to improve speech audibility and to preserve essential speech cues while avoiding distortion. Considering the difficulties, hearing aids are effective at improving speech recognition in many situations, particularly in quiet environments. However, they may provide limited benefit in difficult listening environments, including distant talkers where the talker’s voice fails to reach the hearing aid microphone at a sufficiently high level; noisy rooms; and highly reverberant situations.

In this section, the focus is how technology might be used to address the issues relevant to speech perception covered in this chapter. Although a detailed review of hearing aid processing is provided elsewhere in this volume (Killion, Van Halteren, Stenfelt, and Warren, Chap. 3; Mecklenburger and Groth, Chap. 5), core hearing aid features are considered in relation to their effect on speech perception. This section also considers the relationship between hearing aid processing and the listener’s cognitive ability.

6.8.1 Overcoming Audibility Loss

The basic role of hearing aids is to improve audibility. Listeners with hearing loss whose dynamic range (from threshold of audibility to threshold of discomfort) is less the dynamic range of speech will be at a disadvantage if linear amplification is used, in that either low-intensity sounds will be inaudible or high-intensity sounds will be uncomfortably loud. Focal loss of inner hair cells may also have implications for hearing aid use because it may not be possible to improve reception of signal components falling within the frequency range of the dead region (Hogan and Turner 1998; Vickers et al. 2001; Baer et al. 2002). Fortunately, several types of hearing aid processing can be used to address this issue.

6.8.1.1 Frequency Gain Shaping

Hearing aids are usually fitted in such a way that frequency bands for which hearing threshold is poorer (and audibility is lower) receive greater gain. Over the past 50 years, many schemes have been proposed that prescribe gain at each frequency based on the audiogram or sometimes on measures of loudness perception (Byrne and Dillon 1986; Cox 1995; Moore et al. 2010). Although all of the methods were based on sound theoretical principles, only some gained widespread acceptance. Some were abandoned when they lacked updated versions that accommodated new amplification technology; some because they received little validation; and some were inconvenient to implement in clinical practice. Here, three procedures in current use are described as illustrations of the process by which speech perception can be improved via improved audibility.

The first procedure, the National Acoustic Laboratories nonlinear procedure (NAL-NL2; Dillon et al. 2011), aims to maximize speech intelligibility while keeping the overall loudness of the signal at or below that for a normal-hearing listener presented with unamplified speech. The target frequency- and level-dependent gains are derived from a modified version of the SII and a model of loudness perception (Moore and Glasberg 2004) (Fig. 6.5). Frequencies that do not contribute to higher SII values receive little or no gain. The prescription includes a modification for tonal languages, which are likely to differ in frequency content (and therefore require different audibility criteria) compared to nontonal languages. Currently, this is the most common procedure used to fit hearing aids for adults in the United States and in Australia.

Fig. 6.5
figure 5

Illustration of the adaptive process used to derive a frequency-gain response that takes into account the audiogram of the listener and the input signal [Modified from Dillon et al. (2011) with permission of the author]

The underlying tenet of the second procedure, the Desired Sensation Level procedure and its latest implementation, DSL v5 (Scollie et al. 2005; Moodie et al. 2007), is that audibility will be beneficial. To that end, DSL prescriptions often result in a wider audible bandwidth, greater gain, and sometimes higher compression ratios (discussed in Sect. 6.8.1.3) than NAL-NL2. In the United States and Canada, DSL is a popular choice for pediatric hearing aid fitting, due to its attention to child-specific speech spectra, such as the differences between a listener who is facing the talker and a small child or infant being held by a talker (Pittman et al. 2003) and use of conversion factors (Bagatto et al. 2002; Scollie et al. 2011) that allow for fewer in situ measurements.

The third procedure is based on the loudness model developed by Moore and Glasberg (1997, 2004). This procedure, termed the Cambridge procedure for loudness equalization, or CAMEQ, has two goals: (1) to give an overall loudness that is similar to or slightly lower than what would be perceived by a normal-hearing person listening unaided and (2) to make all frequency components in speech equally loud, on average, over the range 500–4,000 Hz. The most recent version of this method, CAM2 (Moore et al. 2010), prescribes target gains for a wide frequency range. That feature has been shown to improve speech clarity and recognition of specific high-frequency phonemes compared to narrower bandwidth amplification (Füllgrabe et al. 2010; Moore and Füllgrabe 2010; Moore and Sek 2013). One caveat is that the benefit of high-frequency audibility presumably requires sufficient high-frequency auditory receptors. To date, CAM2 has not been tested for people with severe high-frequency hearing loss.

6.8.1.2 Frequency Lowering

In cases in which high-frequency audibility cannot be achieved by providing gain (because the loss is too severe, the power of the hearing aid amplifier is limited, or acoustic feedback limits the available gain), frequency lowering can be used to shift the frequency of the input signal to a lower frequency region. In a recent survey, a majority of audiologists were reported to use frequency lowering for some of their patients with high-frequency loss (Teie 2012). With regard to speech perception, the rationale is that improved audibility might improve perception of high-frequency phonemes such as fricative consonants spoken by female and child talkers (Pittman et al. 2003). However, frequency lowering (especially strong frequency lowering that affects a wider frequency range) alters the acoustic characteristics of the shifted phoneme. Accordingly, frequency lowering may be beneficial to a listener when audibility outweighs distortion and detrimental when distortion outweighs audibility (Souza et al. 2013).

6.8.1.3 Amplitude Compression

Hearing aids fitted with individual frequency gain shaping have been highly successful at improving speech audibility and perception relative to unaided listening. However, most listeners with hearing loss have threshold elevation without corresponding elevation of their loudness discomfort level. To improve speech perception, the hearing aid must adjust the applied gain depending on the input level of the signal. Accordingly, amplitude compression is a feature of all modern hearing aids. Compression works as follows. The incoming signal is first filtered into a number of frequency bands. The level in each band is estimated. For levels falling below the compression threshold, a fixed (maximum) gain is usually applied (linear amplification). Some very low level sounds (below 30 or 40 dB SPL) may receive less gain (expansion) to reduce the annoyance of environmental sounds or microphone/circuit noise. When the level in a given band exceeds the compression threshold, progressively less gain is applied as the input level increases. The extent to which gain is reduced is determined by the compression ratio.

Compression makes intuitive sense as a means of improving speech perception because higher level inputs require little to no amplification to make them audible. Also, reduced gain for high input levels is needed to avoid loudness discomfort. Regardless of the prescriptive procedure that is used, compression hearing aids are quite successful at achieving improved audibility of low-level sounds and acceptable loudness for high-level sounds (Jenstad et al. 1999, 2000). In a small number of cases (usually severe hearing loss), the auditory threshold is too high—or the loudness discomfort threshold is too low—to achieve audibility across a range of speech levels without using unacceptably high compression ratios. The combined effect of a high compression ratio and fast compression speed may be unacceptable if the resulting processing dramatically alters the intensity relationships between individual sounds and removes the natural intensity contrasts in speech. In those cases, clinical goals often shift to giving good audibility of conversational speech (but not low-level speech) without discomfort from intense sounds.

In a compression hearing aid, the gain changes over time depending on the level of the signal relative to the compression threshold. The speed with which those adjustments occur is determined by the attack and release times of the compressor. While attack times are usually short—typically, 5 ms or less—release times vary widely, from about 10 ms to several seconds. When coupled with a low compression threshold, short attack and release times improve speech audibility by providing more gain for brief, low-intensity speech sounds. However, that improved audibility comes at the expense of altered amplitude properties of the speech signal (Jenstad and Souza 2005). In other words, there may be a trade-off between improved consonant audibility and a desire to retain some natural amplitude variations.

There is unlikely to be a single “best” compression speed that suits all hearing aid wearers. Rather, the optimal compression speed is likely to depend on both the environment and the listener. For example, the detrimental effects of fast compression may be more apparent when there is less acoustic information in the signal, such as for speech that is time compressed (Jenstad and Souza 2007) (mimicking rapidly spoken speech); spectrally degraded (Souza et al. 2012a) (mimicking a listener with poor spectral resolution; see Sect. 6.4.2); or when the listener is more susceptible to signal distortion (see Sect. 6.8.4). Finally, although short release times may offer greater speech recognition benefits for some listeners (e.g., Gatehouse et al. 2006), most listeners prefer a long release time for sound quality (Hansen 2002; Neuman et al. 1998).

6.8.2 Maintaining Acoustic Fidelity

A general assumption has been that once cues are made audible via appropriate frequency-dependent amplitude compression and frequency lowering, they will be accessible to the hearing aid wearer. If audibility were the only requirement for good speech perception, this would be a simple solution. However, amplitude compression and frequency lowering involve distortion of the signal and a loss of fidelity. In some sense, acoustic fidelity can be considered to be traded for audibility. Technically, it would be possible to make every signal audible for every listener but doing so might require amplification parameters (high gain, skewed frequency response, and extreme compression) that would degrade the acoustic signal. Instead, parameters must be chosen to improve the audibility on which speech perception depends while minimizing distortion.

It seems likely that poor spectral resolution for most people with hearing loss will force greater reliance on temporal information (Lindholm et al. 1988; Hedrick and Younger 2007). Each hearing-impaired listener uses both spectral and temporal information, but the balance between the two may vary across listeners. Consider two hypothetical listeners, both with moderately severe sensorineural loss and a 40-dB dynamic range. Listener A has good frequency selectivity and can access a full range of spectral cues to speech, including vowel spectra, formant transitions, and overall spectral shape. Listener B has broadened auditory filters and is limited to coarse representations of spectral information. Listener B must depend to a greater extent on temporal cues, including the amplitude envelope and periodicity in the signal. A clinician might be tempted to adjust hearing aid parameters for both listeners with audibility as the primary goal, using fast-acting wide dynamic range compression (WDRC) to improve the audibility of low-intensity sounds. Although fast-acting WDRC improves audibility, it also distorts the amplitude envelope and may be a poor choice for improving speech perception for Listener B (Jenstad and Souza 2005; Davies-Venn and Souza 2014). Audibility can also be improved by using a higher number of compression channels (Woods et al. 2006), but too many channels will smooth frequency contrasts (Bor et al. 2008) and may be a poor choice for improving speech perception for Listener A. Although such arguments are speculative, a necessary first step in clarifying these issues is to understand how reliance on spectral and temporal properties varies among individuals with hearing loss.

6.8.3 Listening in Noise

Because a common speech perception complaint is difficulty when listening in noise, considerable effort has gone into this aspect of hearing aid design. Two general strategies are used to reduce background noise: directional microphones and digital noise reduction. A more complete discussion of each feature is available elsewhere in this volume (Launer, Zakis, and Moore, Chap. 4; Akeroyd and Whitmer, Chap. 7). Here, the effects of each on speech perception are considered.

6.8.3.1 Directional Microphones

Directional microphones have been used to improve speech perception for nearly 40 years (Sung et al. 1975). Directionality is usually achieved by processing the outputs of two (or three) omnidirectional microphones. This has become a near-universal feature of hearing aids, with the exception of some aid styles (such as completely-in-canal aids) for which directional information is not preserved at the hearing aid microphone(s). A common configuration is a microphone that is both automatic and adaptive, where the modulation pattern and spatial location of the incoming signal are used to activate either an omnidirectional response or a directional response with a specific polar plot (Chung 2004). Because directional microphones operate in the spatial domain, they are successful at improving speech perception when speech and interfering sources are spatially separated. The improvement in SNR can be about 5 dB, which translates to as much as a 30 % improvement in speech intelligibility. Directional microphones are less advantageous in cases of multiple or moving noise sources, when the user wishes to switch attention between sources at different azimuths, when the speech signal of interest is behind the user, or in high levels of reverberation (Bentler and Chiou 2006b; McCreery et al. 2012; Ricketts and Picou 2013).

6.8.3.2 Digital Noise Reduction

Digital noise reduction is intended to remove noise while retaining speech information. Digital noise reduction is a nearly universal feature in modern hearing aids, although the type of digital noise reduction and the extent to which it is applied vary markedly. Noise reduction usually involves classifying the signal in each frequency band as predominantly speech or predominantly noise and decreasing the gain in bands that are dominated by noise while preserving the gain in bands that are dominated by speech. Typically, the modulation pattern of the signal is used to estimate whether speech or noise dominates in each band (Bentler and Chiou 2006a; Chung 2012). One limitation is that digital noise reduction cannot function perfectly without a template of the speech alone—something that is not available in real environments. On occasion, digital noise reduction may misclassify within-band noise as speech or misclassify within-band speech as noise. Such processing errors are more likely in cases in which the “noise” comprises other people speaking.

Patient expectations for digital noise reduction are high, but for many years the evidence suggested that it did not improve speech perception (Bentler 2005; Palmer et al. 2006; Bentler et al. 2008). Recently, however, researchers have begun to measure listening effort rather than speech identification. Those studies have consistently found that digital noise reduction reduces listening effort and fatigue and increases acceptance of background noise (Sarampalis et al. 2009; Hornsby 2013; Lowery and Plyler 2013; Gustafson et al. 2014). Because it reduces listening effort, noise reduction may also free cognitive resources for other tasks, such as learning new information (Pittman 2011).

6.8.4 Choosing Hearing Aid Parameters to Suit Individual Cognitive Abilities

Hearing aid choices and parameters have long been customized to suit the patient’s pure-tone audiogram and loudness discomfort levels. More recently, it has been recognized that individual cognitive abilities may also be relevant in selecting the parameters of hearing aid processing. Most of that work has relied on measurements of working memory (described in Sect. 6.5). Recall that low working memory is thought to reduce the ability to adapt to a degraded or altered acoustic signal. When hearing aids are used, the signal processing may significantly alter and/or degrade the speech signal. Such signal processing includes WDRC with a short release time (Sect. 6.8.1.3), frequency lowering (Sect. 6.8.1.2), and digital noise reduction (Sect. 6.8.3.2) in cases where classification errors result in reduced fidelity of the target speech signal or where the processing introduces spurious amplitude fluctuations that may affect intelligibility. Lower working memory is associated with poorer performance with short compression release times (Gatehouse et al. 2006; Lunner and Sundewall-Thoren 2007; Souza and Sirow 2014), and higher frequency compression ratios (Arehart et al. 2013a). There is emerging evidence that working memory may affect the benefit of digital noise reduction. One study showed that working memory was modestly associated with speech recognition benefit of digital noise reduction (Arehart et al. 2013b); another showed that digital noise reduction reduced cognitive load but only for listeners with high working memory (Ng et al. 2013). A third study showed no relationship between working memory and speech recognition, but patients with low working memory preferred stronger noise reduction settings (Neher et al. 2014).

Because noise can be a significant problem for patients with lower working memory, it seems probable that, for such patients, the beneficial effects of suppression of noise might outweigh the deleterious effects of the distortion produced by the noise suppression. Because the few data available employed different outcome measures (speech recognition, word recall [i.e., memory load], and overall preference), additional work is needed to clarify the extent of the relationship between working memory and noise reduction. More generally, the relationships between individual cognitive abilities and benefit from different features of hearing aid processing reflect the importance of understanding not only the acoustic effect of the hearing aid but also the interaction of those effects with the listener.

6.9 Summary

For people with normal hearing, speech perception appears largely effortless and occurs unconsciously. Hearing loss can greatly increase the effort involved in understanding speech such that speech perception rises to the level of conscious attention. And, when hearing loss impairs speech perception, it does so in unpredictable ways. When no hearing aids are used, the consequences of hearing loss vary from minimal effects in selected situations to substantial difficulty such that communication becomes a struggle that impairs every aspect of work and social engagement. Speech perception is determined by both auditory and cognitive factors, ranging from the amount of hearing loss and the specific pattern of auditory damage to the listener’s ability to compensate for reduced auditory cues using cognitive processing. Hearing aids can compensate for reduced audibility in many situations but are limited as to how much they can improve communication in adverse conditions, such as for speech in background sounds or reverberation. Although many research studies have defined the general effects of hearing loss (and hearing aids) on speech perception, the variability among individuals serves as a reminder that each individual—and the optimal hearing aid processing for that individual—must also be treated as unique.