1 Speech hearing and understanding

Within the frame of a common philosophical discussion of understanding there are concepts such as dialogue, discussion, controversy, listening, attention, intention, cognition, interpretation, and other forms of linguistic and mental events and processes relevant to human communication, interaction and cooperation (Dascal 2003). But, what if some of the words are missing or get distorted due to common cases such as low voice, noise, reverberations, interfering conversations, or hearing loss?

Hearing is physical. Understanding is mental. Hearing is auditory, attention and listening are more holistic. Understanding is supported by body gestures, symbolic and non-symbolic cues that might be added to the auditory signals. The ubiquity of literacy in the society have created a perceived separation between hearing and understanding language. Culture devices such as subtitles accompanying movies have bolstered this impression. It is commonly held that cognition operates independently of the sensory input, be it auditory, visual or even tactile (e.g. Braille). When experiencing a problem of hearing loss, people often go to an audiologist and receive a prescription for a hearing aid. When they do not understand what they hear, they might ask the speaker to repeat her words, make a guess, or try to understand linguistic information by visual means. However, human capacity to compensate for hearing loss and keep the level of mutual understanding is limited even with the support of the most advanced hearing aids (Lesica 2018; Anderson et al. 2018). It seems that “hearing loss” entails much more than mere deficit in the reception of auditory signals.

Rene Descartes’ influence on philosophy and science has contributed a lot to the separation of the material (including hearing and visual input) from the mental (which includes cognition and emotions) (Descartes 1989/1649). In the twentieth century, the philosopher Sir Karl Popper and Noble laureate neurophysiologist Sir John Eccles have expressed a similar position (Popper and Eccles 1977). Deviating from this tradition, we would like to ask whether hearing loss impacts cognition beyond the deprivation of data. More specifically, does hearing loss bear unique effects on discourse understanding? Does it impair thinking, language and understanding in ways that are separate from other sensory deprivation? As we all know (but tend to overlook), mind and body, thinking and experiencing, are interdependent in a deep way (Varela et al. 1993).

The received signal at a person's ears may comprise multiple speech sources, interference and noise signals, as well as sound reverberations from surrounding physical objects. The hearer needs to separate the attended speaker’s voice from all the other speakers and sounds that might be louder and closer, and to proceed to achieve understanding through complex processes that involve lower and higher cognitive functions.

Received speech signals are often incomplete. As a basic condition for their successful processing, missing or distorted parts must be corrected and compensated for by both automatic and intentional completion of partial information at all levels. The automatic integration of background knowledge is added to the completion of phonemes, words and sentences with unconscious automatic insertion of presumed missing parts (Warren 1970). In general, listeners can compensate for missing or noise-masked phonemes using their knowledge of the spoken language. Automatic neuro-cognitive processes “fill-up” sensory gaps and complete missing parts that have not been heard. The way to understanding involves guessing, inferring and interpreting, all based on available cues and prior knowledge, in real time of a conversation, lecture or any other form of spoken auditory input. Low level cognitive activities of bridging and elaborative inferences are also made automatically in real time (Singer and Brooke 2012). By now it is widely accepted that beliefs and background knowledge are essential for understanding. Prior knowledge participates unconsciously and automatically in the process of interpretation intended to create meaning. Without it, no understanding can be reached (Madden and Zwaan 2006).

To a large extent, knowledge is acquired through experience including acoustical experience. Hearing loss means a loss of a major source of experience and therefore should presumably have a negative impact on knowledge acquisition and speech understanding in more than one way. A reasonable conclusion is that a missing verbal input changes the balance between hearing and prior knowledge, increases the uncertainty of the content of a spoken message and increases the probability of mistakes and misunderstanding. But this is what happens for each of us when ambient acoustical conditions are bad, when noise and reverberations impair hearing, and when hearing deteriorates with age. As an example, it has been verified that acoustical conditions at classrooms are often a barrier to learning (Nelson and Soli 2000).

Hearing loss is one of the most common conditions affecting aging people and is most often caused by disorders of the inner ear or auditory nerve. Gradual age-related hearing loss is termed presbycusis (from Greek, presbys "elder" and akousis "hearing"). It is evident that it causes difficulties in speech perception and localization and leads to degradation in quality of life and social isolation (Amieva et al. 2015). It changes the allocation of cognitive resources and requires increased conscious efforts and attention to reach speech understanding. People with hearing loss face a particularly difficult challenge in unfavorable acoustical situations, as demonstrated by the "cocktail party problem" (Cherry 1953), of source separation and selection within a noisy environment (Kahneman 1973). Aging adults are therefore the main target group for hearing aid devices.

Like age-related hearing loss, cognitive aging is the gradual decline in cognitive processing that occurs as people get older. It involves degradation in processing speed, attention, memory, language, visuospatial abilities, and executive functioning/reasoning (Harada et al. 2014). It might not come as a surprise that association was found between age-related hearing loss and cognitive decline, cognitive impairment and dementia, even though their causal relation is still uncertain, i.e. they might both be results of background degenerative processes such as vascular system performance degradation (Loughrey et al. 2018).

Age-related hearing loss and cognitive aging effects are not limited to individual knowledge acquisition and processing. It should also be recognized that a frequent and continuous dialogue with others might be essential for maintaining language and discourse understanding skills. Following Martin Buber, we may hypothesize that social interaction and communication with others are important to cognition and are essential to the perception of the self as a person (Buber 1958/1923). This can now be demonstrated by several theories and empirical results. An example is the principle of linguistic relativity that states that language has a major influence on thinking (The so-called Sapir-Whorf hypothesis). Another is the hypothesis that language attrition may be attributed to a reduced use of language (Köpke 2007) that might also lead to reduced language understanding (Schneider et al. 2010). Such claims can be further reinforced by findings on brain plasticity, which is the brain ability to change its structure and function in response to experience (Syka 2002). It follows that limited auditory experience leads to limited brain and cognitive response to external stimuli. We hypothesize that the very slow nature of cognitive decline and presbycusis fosters changes in the brain, and consequently in language understanding skills. Some of these changes may be compensatory, such as increase in visual processing or inner speech. However, if age-related gradual hearing loss is indeed accompanied by cognitive aging and changes in the brain, the late use of hearing aids which focus on the production of sounds, may be less helpful to a brain and cognition that are already altered by many years of deteriorating hearing loss.

Another report has found that hearing loss is associated with accelerated cognitive decline in older adults, suggesting a few potential reasons for this association (Lin et al. 2013). One is an increased cognitive load as hearing loss is increasingly compensated by cognitive processes and resources such as inference and working memory that are needed to process a decreased auditory information. The other is that hearing loss increases social isolation, which is a risk factor for cognitive decline and dementia, presumably due to reduced interaction with others and decreased stimulation. Indeed, the link between hearing loss and social isolation has been well established (Amieva et al. 2015). The association of cognitive decline and hearing loss brings into relief that hearing loss is not a limited and isolated problem of the sensory system, but a complex decline of sensory, cognitive, linguistic, psychologic, and social dimensions of the self.

The impact of hearing conditions on cognitive performance is borne out by the acoustics of ordinary classrooms. The relation between physical acoustic conditions and student achievements is rarely considered (Klatte et al. 2010). The mental efforts directed at the decipherment of speech may drain mental and even emotional resources, especially when they are already at their limits (Glass et al. 1969; Corah and Boffa 1970). Hearing unfathomable speech is a double stressor, leading to failure of understanding and to the experience of speech as noise.

Hearing and understanding may therefore be addressed as an integrated body-mind activity, where decreased auditory input and output is linked to decreased cognitive capabilities. Such approach should consider the whole person who is involved in speaking, listening and understanding, including his physical and cognitive limitations. The surrounding acoustical characteristics, social reception (e.g., speaking slowly and loudly), and the availability of assisting technologies, can all improve these age-related disabilities and contribute to human perception, judgement, social participation, and well-being in general.

We will now turn to explore this hypothesis of integrated body-mind performance and its relation to hearing and understanding through a focused review of the role of hearing aids and cognitive technologies in hearing loss situations.

2 Hearing aids and cognitive technologies

Several researchers have addressed the relation between hearing aids use and cognitive decline. In a recent study, Sarant et al. (2020) reported a clinically and statistically significant improvement in cognition in a participant group after 18 months of hearing aid use, suggesting that treatment of hearing loss with hearing aids may delay cognitive decline.

Still, hearing aids are mainly amplification devices that are designed to compensate for hearing loss caused by degradation of the peripheral hearing system—mainly the inner ear (Lesica 2018). They do not address cognitive functions, at least not directly. However, we have seen that presbycusis and other hearing problems involve challenges broader than improvement of the physical properties of the acoustic signal reaching the inner ear. The problem of speech understanding remains unresolved as is evident by the ubiquitous complaint of patients “I can hear you but I can’t understand you” (Lin 2012). As such, they do not sufficiently address the problem of speech understanding, neither do they address cognitive processes—at least not directly (Ibid).

It is therefore evident that there is a pressing need for more advanced technologies and broader device-based assistance to bridge the gap between environmental stimuli and symbolic understanding and meaning extraction. Technology could assist cognitive processes and cognitive functions such as memory, attention, selectivity in time and space, speaker identification and be classified accordingly as ‘cognitive technology’ (Dascal 2002). Such technologies could extend the reach of current hearing devices to address cognitive functions. As an example, they could manipulate time and space by recording or artificially slowing down speech, or bridge distance through voice transmission. Slowing and repetition of speech can compensate for slower cognitive processing, and remote microphones (e.g., in smartphones) can assist in source separation. They could introduce and provide an interactive means for training and could be adjusted by the user for optimal results in different environments.

Dascal (2002) defines cognitive technology as a mean to assist cognitive processes. His definition allows him to consider language contribution to cognition as an important case of man-made cognitive technology. Dascal’s concept of cognitive technology could be useful for hearing-aids development to enhance their function to the much-desired position of new understanding-aids. The designer of an artifact addresses its purpose. “A chair is made for sitting” (Mercier and Sperber 2017, p. 177).

To extend the functions of hearing devices beyond the physical realm and to transform them into a cognitive technology, hearing devices can make use of recent advances in computer and communication technologies that provide more processing power and make use of dynamic network configuration and connectivity. All this within smaller devices and for longer operational times.

As a technology platform, smart phones can support the [Assistive Technology for Cognition] ATC functions of alerting, distracting, navigating, reminding, prompting and storing and displaying information. Such diverse functionality from a single technology platform underscores our argument that research should focus on the generalizable level of ATC function, conceptualized in cognitive terms, rather than specific devices or even technology platform (Gillespie et al. 2012).

3 Cognitive technologies for speech understanding

Our philosophical reflection on hearing and understanding seeks to delineate, even with broad brushstrokes, specifications for devices that will address the holistic challenge of understanding speech by the hard of hearing.

Hearing aids have indirect impact on cognition (e.g., decrease cognitive load, see Kahneman 1973). Through their extension into other devices and networks they have now an increased potential to improve speech understanding more directly.

Even a small change in the number of correctly understood words can have a disproportionate effect on overall understanding and communication experience (Whitton et al. 2017).

Hearing aids are already connected to smartphones that perform as parts of an integrated cognitive hearing technology. Still, their functions are not designed to assist cognition and speech understanding as can be seen from their published current and future specifications. They mostly include amplification and filtering functions. However, some functionality has already been incorporated in recent commercial devices including healthcare applications (Kimball et al. 2018).

  • Wireless connectivity of hearing aids and smartphones adds processing power, memory and network connectivity to the previously limited resources of hearing aids.

  • Hearing aids remote control and monitoring, battery, volume, remote microphones, and dynamic profiles provide a mean for humans to control their assistive devices for better use and comfort.

  • Hearing tests applications support personal customization.

The concept of ‘hearables’ is used for both hearing aids, personal sound amplifiers, and earbuds (small earphones). It might signify the start of a change in perception of auditory devices from medical devices that compensate for physical disability towards cognitive devices that can improve social interaction and understanding. Much more can be expected. The following is a further non-conclusive list of potential cognitive technologies with their implications for speech and cognition.

  • Recording for repeated listening at chosen times to facilitate memory and compensate for slow cognitive processing (Schneider et al. 2010, p. 190).

  • Slowing speech for easier understanding, prolonging the duration of less predictable words (Kraus and Slater 2016, p. 89). This can also be done by selectively changing the time gaps between words to decrease the processing load involved in speech understanding.

  • Frequency shift-out of dead regions into regions of audibility (Edwards 2004, p. 399) which adds more information for auditory processing.

  • Speech separation, multiple voice sensors and signal processing algorithms to assist the cognitive selection among many voice signals, typically resulted as the “cocktail party effect” that blocks speech understanding (Gannot 2017).

  • Source identification—selecting specific speaker and attenuating others, based on high level speech attributes or on preferences of the listener (O’Sullivan et al. 2017).

  • Target relocation and separation—moving the perceived location of different speakers by time and volume manipulation (Schneider et al 2010, p. 175).

  • Ad-hoc networks of smartphones that might select one among many speakers. Each smartphone has a microphone, and each can be selected as an input for the user hearing aids (Kimball 2018).

  • Voice replacement—changing accent or replacing female and male voices (different frequencies).

  • Machine translation—which can be extended to the replacement of complex words by frequently used synonyms thus assisting words retrieval from memory.

  • Speech to text—enables simultaneous reading and hearing providing visual cues for better understanding (Greenberg and Ainsworth 2004, p. 37).

  • Training—benefits the plasticity of the cortical and subcortical circuits in the brain (Schnupp et al. 2011, pp. 289–293, Pichora-Fuller and Levitt 2012, Schneider et al. 2010, p. 200).

In this context of future technologies, we would like to mention one report that was published by O’Sullivan et al. (2017). Their research addressed source identification and speech separation in a multi-speaker environment, based on the listener’s target of attention. They combined two technologies. The first is auditory attention decoding (AAD), that uses neural signals to decode the identity of an attended speaker. The second performs the separation and amplification of an attended speech signal, and attenuation of the others. The first technology was used to control which speech signal should be selected by the second technology for amplification. The first aims to decode attention, while the second provides source separation which is mainly a cognitive task.

The authors named their system “cognitively controlled hearing aids”. Their system results showed reduced listening efforts which might imply to a cognitive gain. The researchers reported that their system has performed well and improved the signal quality and user satisfaction substantially. However, they could not identify any improvement in speech intelligibility.

lack of improved intelligibility is a well-known phenomenon in speech enhancement research where noise suppression does not typically improve intelligibly scores, even though listening effort is reduced (O’Sullivan et al. 2017, p. 10).

O’Sullivan et al. (2017) checked the cognitive implications of their system through a missing word task and subjective reports by test participants. They stopped short of testing speech understanding (i.e. meaning). However, source selection is mainly a spontaneous cognitive function that has a close relation to speech understanding rather than mere perception. We would like to suggest that targeting speech understanding, processing demand and listening effort could be a more appropriate target for their system (Wendt et al. 2016). The two technologies could then be extended to include a human feedback on understanding with an aim to achieve a best human-technology combination.

4 Identifying acoustical transforms for greater speech clarity

We now turn to a more detailed discussion of acoustical transformations that address speech clarity and might promote speech understanding.

Speech data may be represented at several different levels. In the most direct form, speech data may be recorded as an explicit digitization of the original audio input—that is to say, an audio file in formats such as.wav or.aiff which stores audio data directly and allows it to be replayed. At a higher level of abstraction, speech data can be modeled by performing quantitative analyses which isolate mathematical profiles of the input audio and its wave forms; encoded representations of such quantitative data may then be stored in lieu of sound data itself. Important speech-processing tasks—such as speaker identification, or audio segmentation to isolate individual words—are typically performed on quantitative encodings of audio input data, rather than on audio input itself. A conventional workflow will isolate “feature vectors" from mathematical audio encodings and then search for patterns in these vectors which signal, to some degree of probability, high-level facts about the audio. For instance, sudden quantitative shifts across multiple dimensions in a feature vector (tracked through time) suggests word boundaries; and grouping feature vectors by certain similarity metrics permits the isolation of individual speaking voices, or the separation of speech content from background noise. Such analyses then permit the audio signal to be recorded at a still higher level of abstraction and processing: here audio data may be represented linguistically, in terms of words, sentences, speakers, and prosodic features of spoken language.

Having thus identified three levels of abstraction for speech encoding—in terms of raw audio, of quantitative wave-form profiles, and natural language and prosody, respectively—we can also observe that these levels apply not only to encoding audio as it is presented, but to modifying audio with higher levels of what we defined above as cognitive technology, for the benefit of the hearing-impaired. Phonetic qualities such as intonation, stress, and tempo are intrinsically bound to language—when we speak of individual words or syllables being stressed, we are working in the conceptual framework afforded by language and its manifestation in human speech. However, this high-level conceptual layer provides the scaffolding wherein lower-level audio phenomena are generated.

“Speech,” as the enunciation of linguistic units (e.g., words and sentences) is an emergent phenomenon whose substrate is audible vocalizations, by analogy to how consciousness itself is an emergent property of the nervous system. As with any emergent phenomenon, the supervening register (whatever its material dependency on a subvening substratum) can only be scientifically understood, in full detail, within a conceptual framework whose theoretical posits quantify over concepts ontologically bound to the emergent register. In normal human speech, it is the brain which translates vocal intentions to audible sounds; in other words, the explanatory gap between the subvening and supervening registers can be closed, in principle, by examining how the brain formulates vocalizations in the presence of abstract linguistic intentions (sentences as immaterial structures). In the technological context, audio-enhancement software must replicate at least some of this neurocognitive activity: it must generate and/or manipulate audio data by emulating how the human mind formulates speech.

Given this overview, we see that representing transformations of speech-audio content should be distributed across several levels, retracing the levels of representation for speech itself. Each of these levels requires its own computational and representation models. A full description of representations at the audio and mathematical levels is outside the scope of this paper, but in this section, we will discuss transform-representations at the more abstract language/discourse level. Specifically, we will examine how we can identify modulations in optimal speech patterns (vis-à-vis understandability for the hearing-impaired) insofar as they may be observed and notated through the conceptual framework of language and prosody.

Describing prosodic modifications is a different process than notating the existing speech patterns which are present in audio and/or transcribed records of speech occurrences. For transcribing speech as is, there are well-established formats such as SSML (Speech Synthesis Markup Language), Stem-ML (Soft Template Markup Language) and ToBI (this term derives from the acronym “tones and break indices”). While these formats achieve a detailed annotation of prosodic information supplemental to speech transcription (viz., annotated transcriptions do not merely record what was said as written text, but mark changes in pitch or tone, speaker alternation/overlap, sentence-boundary tones, and so forth) they are not intended for the further task of notating how a given speech artifact may be modified to generate a new audio resources optimized, relative to the original, for persons who are hearing-impaired.

In order to demonstrate how prosodic markup may be extended to notate audio modifications, this paper is accompanied by a supplemental code library which presents several suggestions for extending prosodic encoding, and also implements a parser for prosody markup which is thereby enhanced.Footnote 1 In addition, the library includes features for constructing a research environment where audio enhancements may be empirically tested. Specifically, given a prior audio sample and an alternative rendering of the audio constructed according to notated modifications, the library may be used to process empirical data reflecting how well the modified version promotes understandability compared to the original version. The code is designed to be compatible with existing formats for testing audio/acoustic quality in the speech context, such as the Perceptual Objective Listening Quality Assessment (POLQA), Perceptual Evaluation of Speech Quality (PESQ), and Mean Opinion Scale (MOS) standards, which measure multiple facets of subjectively experienced audio quality (including voice clarity, intrusiveness of background noise, clarity of individual words, etc.).

5 Summary and concluding remarks

Hearing loss is a common impairment that is present or will be present for most of us. Its consequences include social difficulties in human interaction, degradation of speech understanding. Hearing loss correlates with solitude, cognitive decline, depression and dementia. Current hearing aids technology does not present a solution for these problems. It is mostly designed for sound amplification and improvement in signal to noise ratio. Extending the auditory research to cognitive processes may change the way that hearing aids are defined and designed turning them into cognitive technologies to assist and enhance both communication and cognition.

Hearing devices could further develop to be an interactive source for information and augmented reality that will enhance human experience, mental capacities and wellbeing beyond and in addition to their traditional function of hearing loss compensation. The concept of ‘hearables’ is used for both hearing aids, personal sound amplifiers, and earbuds (small earphones). It might signify the start of a change in perception of auditory devices from medical devices that compensate for physical disability, towards cognitive and social devices.

Given the recording of a lecture or a conversation, for example, computer software could potentially generate an altered audio file which manipulates and clarifies the input data to generate an optimized version of the original speech data. To implement such alterations, it would then be necessary to identify how audio waveforms should be transformed. This corresponds to the second level of abstraction: given the mathematical encoding of the original audio data, notate desired transformations such as modifying the pitch and/or sound level of certain audio segments (corresponding, for example, to individual words), elongating certain words, or creating the effect of an individual speaker's voice being amplified, modulated, or figuratively moved in space (see the Schneider et al. citation on page 12). Finally, at the natural-language level, the implementation of audio-manipulation technology may be aided by a representation of desired audio-transform effects as they are manifest in the abstract register of language and discourse, where we may notate that a given word, for instance, should be emphasized via changes in pitch, amplitude, and/or segment length.

Notating desired features of optimized/altered audio files is only the first step toward implementing “cognitive” hearing enhancement, but it is an important step, because effective audio manipulation can only be achieved within the context of analyzing which alterations are appropriate to improve understandability. Existing literature in speech technology has identified certain obvious alterations that can improve the quality of speech recordings (e.g., minimizing background noise or isolating individual speakers), but more subtle manipulation requires a more thorough understanding of the cognitive and linguistic background for speech processing. Our framework for investigating these more subtle enhancements cannot be developed primarily at the acoustical level (the register of speech as a quantitative waveform) but rather must be conceptualized initially at the level of language and discourse.

Further research could be directed to the role of other senses such as sight, olfaction and touch (Cieśla et al. 2019) in speech understanding and how to extend hearing aids functionality and integration with “smart-glasses” technologies. Integrated assistive devices could combine a few or all sensorial inputs to help their users to achieve desired common-sensical results.