Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

People affected by severe neuro-degenerative diseases (e.g., late-stage amyotrophic lateral sclerosis (ALS) or locked-in syndrome) eventually lose all muscular control and are no longer able to gesture or speak. They also cannot use traditional assistive communication devices that depend on muscle control, nor typical brain-computer-interfaces (BCIs) that depend on visual stimulation or feedback [1,2,3]. For this population, auditory [4,5,6,7,8] and tactile BCIs [9, 10] are two of only a few remaining means of communication (see [11] for review).

While visual BCIs typically preserve the identity between the stimulus (e.g., a highlighted ‘A’) and the symbol the user wants to communicate (e.g., the letter ‘A’), all currently used auditory or tactile BCIs require a relatively artificial mapping between a stimulus (e.g., a particular but arbitrary sound) and a communication output (e.g., a particular letter or word). This mapping is easy to learn when there are only few possible outputs (e.g., a yes or no command). However, with an increasing number of possible outputs, such as with a spelling device, this mapping becomes arbitrary and complex. This makes most current auditory and tactile BCI systems cumbersome to learn and use.

Two avenues are being investigated to overcome this limitation. The first avenue is to directly decode expressive silent speech without requiring any external stimuli. In this approach, linguistic elements at different levels (e.g., phonemes, syllables, words and phrases) are first decoded from brain signals and then synthesized into speech. While recent studies have demonstrated this possibility [12,13,14,15,16], even invasive brain imaging techniques (e.g., ECoG, LFPs, single neuronal recordings) are currently unable to capture the entire complexity of expressive speech. Consequently, silent speech BCIs are limited in the vocabulary that can be decoded directly from the brain signals. The second avenue is to replace unnatural stimuli that require an artificial mapping with speech stimuli that do not. In such a system, the user would communicate simply by directing attention to the speech stimulus that matches his/her intent. Previous studies that explored this avenue required the speech stimuli to be designed (e.g., altered and broken up [17]) such that they elicit a particular and discriminable evoked response. Such evoked responses can be readily detected in scalp-recorded electroencephalography (EEG) to identify the attended speech stimulus. However, such altered speech stimuli are difficult to understand, which makes such a BCI system difficult to use. More importantly, this approach does not scale well beyond two simultaneously presented speech stimuli.

Recent studies suggest that the envelope of attended speech is directly tracked by electrocorticographic (ECoG) signals in the gamma band (i.e., 70–170 Hz) [15, 18,19,20,21], effectively removing the need to ‘alter’ the speech stimuli. Further evidence shows that this approach can identify auditory attention to one speaker in a mixture of speakers, i.e., a ‘cocktail-party’ situation [22].

However, BCI systems that use this physiological mechanism for communication purposes have not been described yet. In this study, we explore this possibility by implementing a BCI2000-based real-time system that uses ECoG signals to identify the attended speaker.

2 Methods

2.1 Human Subject

The subject in this study was a 49 year old left handed woman with intractable epilepsy who underwent temporary placement of subdural electrode arrays (see Fig. 1a) to localize seizure foci prior to surgical resection. A neuropsychological evaluation [23] revealed normal cognitive function and hearing (full scale IQ = 97, verbal IQ = 91, performance IQ = 99) and a pre-operative Wada test [24] determined left hemispheric language dominance.

Fig. 1
figure 1

Implant. The subject had 72 subdural electrodes (1 grid and 3 strips in different configurations) implanted over left frontal, parietal, and temporal regions. a Photograph of the craniotomy and the implanted grids in this subject. b Cortical model of the subject’s brain, showing an \(8\,\times \,8\) grid over frontal/parietal cortex, and two strips

The subject had a total of 72 subdural electrode contacts (one \(8\times 8\) 64-contact grid with 3 contacts removed, two strips in \(1\,\times \,4\) configuration, and one strip in \(1\,\times \,3\) configuration). The grid and strips were placed over the left hemisphere in frontal, parietal and temporal regions (see Fig. 1b for details). The implants consisted of flat electrodes with an exposed diameter of 2.3 mm and an inter-electrode distance of 1 cm, and were implanted for one week. Grid placement and duration of ECoG monitoring were based solely on the requirements of the clinical evaluation, without any consideration of this study. The subject provided informed consent, and the study was approved by the Institutional Review Board of Albany Medical College.

We used post-operative radiographs (anterior-posterior and lateral) and computed tomography (CT) scans to verify the cortical location of the electrodes. We then used Curry software (Neuroscan Inc, El Paso, TX) to create subject-specific 3D cortical brain models from high-resolution pre-operative magnetic resonance imaging (MRI) scans. We co-registered the MRIs by means of the post-operative CT and extracted the electrode coordinates according to the Talairach Atlas [25]. These electrode coordinates are depicted on Talairach template brain in Fig. 1b.

2.2 Data Collection

We recorded ECoG from the implanted electrodes using a g.HIamp amplifier/digitizer system (g.tec, Graz, Austria) and the BCI software platform BCI2000 [26,27,28], which sampled the data at 1200 Hz. Simultaneous clinical monitoring was implemented using a connector that split the cables coming from the patient into one set that was connected to the clinical monitoring system and another set that was connected to the g.HIamp devices. This ensured that clinical data collection was not compromised at any time. Two electrocorticographically silent electrodes (i.e., locations that were not identified as eloquent cortex by electrocortical stimulation mapping) served as electrical ground and reference, respectively.

Fig. 2
figure 2

Experimental setup and methods. a Subjects selectively directed auditory attention to one of two simultaneously presented speakers. b We extracted the envelope of ECoG signals in the high gamma band, as well as the envelopes of the attended and unattended speech stimuli (i.e., JFK and Obama). c The correlation between the envelopes of the ECoG gamma band and the attended speech stimulus, accumulated over time, is markedly larger than the accumulated correlation between the envelopes of the ECoG gamma band and the unattended speech stimulus

2.3 Stimuli and Task

The subject’s task was to selectively attend to one of two simultaneously presented speakers in a cocktail party situation (see Fig. 2a). The two speakers were John F. Kennedy and Barack Obama, each delivering his presidential inauguration address. Both speeches were similar in their linguistic features, but were uncorrelated in their sound intensities (\(\mathrm{r} = -0.02, \mathrm{p} = 0.9\)). To create a cocktail party situation, we mixed the two (monaural) speeches into a binaural presentation in which the stream presented to each ear contained \(20\%:80\%\) of the volume of one speaker and \(80\%:20\%\) of the other, respectively. This allowed us to manipulate the aural location of each speaker throughout the task. For the binaural presentation, we used in-ear monitoring earphones (AKP IP2, 12–23500 Hz bandwidth) that isolated the subject from any ambient noise in the room.

To create a trial structure, we broke these combined streams into segments of 15–25 s in length, which resulted in a total of 10 segments of 187 s combined length. In the course of the experiment, we presented each segment four times to counter-balance the aural location (i.e., left and right) and the identity (i.e., JFK and Obama) of the attended speaker. Thus, over these four trials, the subjects had to attend to each of the two speakers at each of the two aural locations. This resulted in a total of 40 trials (i.e., 10 segments, each presented 4 times).

At the beginning of each trial, an auditory cue indicated the aural location (i.e., left or right) to which the subject should attend. Throughout the trial, a visual stimulus complemented the initial auditory cue to indicate the identity of the aural location (e.g., ‘JFK in LEFT ear’). Each trial consisted of a 4 s cue, a 15–25 s stimulus and a 5 s inter-stimulus period. The total length of these 40 trials was 12.5 min. The subject performed these 40 trials in 5 blocks of 8 trials each, with a 3 min break between each block.

2.4 Offline Analysis

In the offline analysis, we characterized the relationship between the neural response (i.e., the ECoG signals) and the attended and unattended speech streams, as shown in Fig. 2b. In particular, we were interested in two parameters of this neural response. The first parameter was the delay between the audio stream and resulting cortical processing, i.e., the time from presentation of the audio stream to the observation of the cortical change. The second parameter was the cortical location that was most selective to the attended speech stream. These two parameters are the only two parameters that were later needed to configure the online BCI system.

To determine these two parameters, we extracted the high gamma band envelope at each cortical location and the envelopes of the covertly attended and unattended speech (i.e., JFK and Obama). We then correlated the high gamma band envelope at each cortical location, once with the attended and once with the unattended speech envelope. This resulted in two Spearman’s r-values for each cortical location. An example of this is shown in Fig. 2c. To determine the delay between the audio stream and resulting cortical processing, we measured the neural tracking of the sound intensity across different delays from 0 to 250 ms to identify the deal with the highest r-value.

2.4.1 Signal Processing

We first pre-processed the ECoG signals from the 72 channels to remove external noise. To do this, we high-pass filtered the signals at 0.5 Hz and re-referenced them to a common average reference that we composed from only those channels for which the 60 Hz line noise was within 1.5 standard deviations of the average.

Next, we extracted the signal envelope in the high gamma band using these pre-processed ECoG signals. For this, we applied an 18th order 70–170 Hz Butterworth filter and then extracted the envelope of the filtered signals using the Hilbert transform. Finally, we low-pass filtered the resulting signal envelope at 6 Hz (anti-aliasing) and downsampled the result to 120 Hz.

For each attended and unattended auditory stream, we extracted the time course of the sound intensity, i.e., the envelope of the signal waveform in the speech band. To do this, we applied a 80–6000 Hz Butterworth filter to each audio signal, and then extracted the envelope of the filtered signals using the Hilbert transform. Finally, we low-pass filtered the speech envelopes at 6 Hz and downsampled them to 120 Hz.

2.4.2 Feature Extraction

We extracted features that reflect the neural tracking of the attended and unattended speech stream. We defined neural tracking of speech as the correlation between the gamma envelope (of a given cortical location) and the speech envelope. We calculated this correlation separately for the attended and unattended speech, thereby obtaining two sets of r-values labeled ‘attended’ and ‘unattended,’ respectively.

2.4.3 Selection of Cortical Delay and Location

We expected a delay between the audio presentation and resulting cortical processing, i.e., the time from presentation of the audio stimuli to the observation of the cortical change. To account for this delay, we measured the neural tracking of the attended speech stream across different delays (0–250 ms, see Fig. 3) and across all channels. Next, we determined the cortical location that was most selective of the attended speech stream. For this, we selected the cortical location that showed the largest difference between the ‘attended’ and ‘unattended’ r-values. Based on these results, we selected a delay of 150 ms and a cortical location over superior temporal gyrus (STG). We corrected for this delay by shifting the speech envelopes relative to the ECoG envelopes prior to calculating the correlation values.

Fig. 3
figure 3

Lag between speech presentation and neural response. This figure shows the correlation between neural response and the attended speech (green) for the most selective cortical location, across corrected lags between 0 and 250 ms. This correlation peaks at 150 ms

2.4.4 Classification

In our approach, we assumed that the extracted features, i.e., the two r-values of the selected cortical location, were directly predictive of the ‘attended’ conversation. In other words, for the selected cortical location, if the ‘attended’ r-value was larger than the ‘unattended’ r-value, the the trial was classified correctly. To determine the performance as a function of the length of attention, we applied our feature extraction and classification procedure to data segments from 0.1 to 15 s in length.

2.5 Real-Time System Verification

In the real-time verification, we evaluated the system performance on the data recorded during the first stage of this study. We configured this system with parameters (i.e., cortical location and delay) determined in the previously detailed offline analysis.

2.5.1 Real-Time System Architecture

We used the BCI software platform BCI2000 [26,27,28] to implement an auditory attention based BCI. For this, we expanded BCI2000 with the capability to process auditory signals in real time. In detail, we implemented a signal acquisition for audio devices (e.g., a microphone) or pre-recorded files that is synchronized with the acquisition from the neural signals. Further, we implemented a signal correlation filter. For our evaluation, the two (monaural) speeches served as the audio input to the auditory attention based BCI (see Fig. 4).

In this system, BCI2000 filters the audio signals between 80 and 6000 Hz and the ECoG signals between 70 and 170 Hz. Next, a BCI2000 filter extracts the envelopes, decimates them to a common sampling rate of 200 Hz and adjusts their timing for the cortical delay. A signal correlation filter then calculates the correlation values, i.e., the correlation between the two (monaural) speeches and the selected neural envelope, to determine to which speaker the user directs his/her attention. Finally, the feedback augmentation filter increases the volume of the attended speaker and decreases the volume of the unattended speaker to provide feedback to the subject. This processing steps are updated every 50 ms to provide feedback in real-time.

Fig. 4
figure 4

Real-time system design. The auditory attention BCI is based on BCI2000 and simultaneously acquires and processes and signals. The audio signals from multiple conversations are sampled at 48 kHz and acquired from a low-latency USB audio-amplifier (Tascam US-122MKII). The ECoG signals from the surface of the brain are sampled at 1200 Hz and acquired from a 256-channel bio-signal amplifier (g.HIamp, g.tec Austria). In the next step, the signals are band-pass filtered (80–6000 Hz for audio, 70–170 Hz for ECoG) and their envelope is extracted. The resulting signal envelopes are decimated to a common sampling rate of 200 Hz and adjusted for timing differences. One channel of the decimated ECoG signal envelope is then selected and correlated with each of the decimated audio signal envelopes. As the human subject perceives the mixture of conversations through ear-phones, the auditory attention BCI then can provide feedback by modifying the volume of the presented mixture of conversations to enhance the volume of the attended and attenuate the volume of the unattended conversation

Fig. 5
figure 5

Neural tracking of attended (left) and unattended (right) speech. The tracking of the attended speech is both stronger and more widely distributed than the tracking of the unattended speech. In addition, there is only a marginal difference in spatial distribution between attended and unattended stimuli

3 Results

3.1 Neural Correlates of Attended and Unattended Speech

First, we were interested in visualizing the cortical areas that track the ‘attended’ and ‘unattended’ conversations. The results in Fig. 5 show the neural tracking of the ‘attended’ and ‘unattended’ speech in the form of an activation index. For each cortical location, this activation index expresses the negative logarithm of the p-value (−log(p)) of the correlation between the high gamma ECoG envelope and the attended or unattended speech envelope. The neural tracking is focused predominantly on areas on or around superior (STG) and middle temporal gyrus (MTG).

3.2 Relationship Between Segment Length and Classification Accuracy

Next, we were interested in determining the duration of attention that is needed to infer the ‘attended’ speech. For this, we examine the relationship between the segment length and classification accuracy. The results in Fig. 6 show the classification accuracies for variable segment lengths (0.1–15 s). In this graph, the accuracy improvements level off after 5 s, at 80–90% accuracy.

Fig. 6
figure 6

Accuracy for different segment lengths. The classification accuracy generally increases with segment length. The red horizontal dashed line indicates chance accuracy

3.3 Interface to the Investigator

Finally, we evaluated the real-time system performance that the determined parameters (i.e., the cortical location and delay) yield on the data recorded during the first stage of this study. The screenshot in Fig. 7 shows interface to the investigator. The interface presents the decimated and aligned ECoG and audio envelopes, their correlation with each other, and the inferred attention. The content of this interface is updated 20 times per second.

Fig. 7
figure 7

Interface design. The interface to the investigator presents multiple panels. The bottom left panel presents the decimated and aligned ECoG and audio envelopes. The panels on the right show the correlation between the ECoG and the attended (top), ECoG and unattended (middle) and the difference between the two correlation values (bottom). The panel on the top left shows this correlation difference in form of an analogue instrument where the pointer (i.e., the needle) indicates the direction of attention. In this experiment, the subject was cued to attend to a particular speaker annotated by “Attended” in this panel

4 Discussion

We show the first real-time implementation of an auditory attention BCI that uses ECoG signals and natural speech stimuli. The configuration of this system requires only two parameters: the cortical location and the delay between the audio presentation and the cortical processing. Our results can guide the selection of these parameters. For example, our results indicate that the underlying physiological mechanism is primarily focused on the temporal lobe, specifically the STG and MTG areas. Further, the neural tracking of attended speech is stronger and more widely distributed than that of unattended speech. This confirms results from a previous ECoG study that investigated auditory attention [22]. Further, our study shows that the cortical delay between the audio presentation and the cortical processing is in the range of \(\sim \)150 ms.

The presented results indicate that such system could support BCI communication. While being invasive, it may be justified for those affected by severe neuro-degenerative diseases (e.g., late-stage ALS, locked-in syndrome) who have lost all muscular control and therefore cannot use conventional assistive devices or BCIs that depend on visual stimulation or feedback. Most importantly, the results suggest that sufficient communication performance (\({>}70\%\), [29]) could be achieved with a single electrode placed over STG or MTG. This finding is important, because placement of ECoG grids as used in this study requires a large craniotomy. In contrast, a single electrode could be placed through a burr hole [30]. Furthermore, the electrodes in this study were placed subdurally (i.e., the electrodes are placed underneath the dura). Penetration of the dura increases the risk of bacterial infection [31,32,33,34,35]. Epidural electrodes (i.e., electrodes placed on top of the dura) provide signals of approximately comparable fidelity [36, 37]. A single electrode placed epidurally could reduce risk, which should make this approach more clinically practical.

In this study, we focused on demonstrating that one cortical location is sufficient for providing BCI communication. However, it is likely that combining the information from multiple cortical locations could substantially improve the communication performance. Thus, recent advances in clinically practical recordings of ECoG signals from multiple cortical locations [38, 39] could improve the clinical efficacy of the presented approach.

In comparison to many other auditory BCIs, the present approach has the unique advantage of using natural speech without any alteration. This aspect may be particularly relevant for those who are already at a stage where learning how to use a BCI has become difficult.

5 Conclusion

In summary, our study demonstrates the function of an auditory attention BCI that uses ECoG signals and natural speech stimuli. The implementation of this system within BCI2000 lays the groundwork for future studies that investigate the clinical efficacy of this system. Once clinically evaluated, such a system could provide communication without depending on other sensory modalities or a mapping between the stimulus and the communication intent. In the near future, this could substantially benefit people affected by severe motor disabilities that cannot use conventional assistive devices or BCIs that require some residual motor control, including eye movement.