Keywords

1 Introduction

Enabling machines to understand humans and their implicit or explicit intents accurately is a key objective of any efficient human-machine collaborative system [22]. For instance, one of the tasks of an assistive robotic arm in a noisy industrial scenario could be to hand over specific tools the human is searching for. If the robotic arm knows that the person is searching for something, it can assist with searching and handing over the desired object or tool. However, how can a robotic arm be sure if a human is searching for something or just scanning the environment? Object selection with eye-tracking has been studied in multiple literature [2, 3] to identify implicit intents. The EEG analysis in Kang et al. [1] barely exceeds the chance level for distinguishing a search intent or a scene scan.

Another challenging task is to address the EEG variability. EEG signals are highly non-stationary and can differ a lot across days or even within the same day for the same user. Inter-subject variability refers to the differences in brain signals between multiple subjects, and intra-subject variability refers to the differences in brain activity for the same subject occurring in various repetitions of the same task [13]. Inter and Intra- subject variability is unavoidable due to the involved time-variant factors connected to the experimental recording setup and underlying psychological and neurophysiological parameters. Ideally, a real-world Brain-machine communication would need to be effective and efficient at all times, i.e., across sessions and participants, without re-calibration. To address these problems, we propose a combination of brain signals recorded via Electroencephalography (EEG) and eye-tracking in a simulated working environment to investigate correlations between EEG data and the eye-tracking data for two reasons (1) to create context from overt data by automatically labeling EEG data based on the eye-tracking input (2) to build models for intention recognition based on this labeled EEG data. We use eye-tracking information only to perform automatic labeling of the EEG data. Using eye-tracking information as an interaction modality is beyond the scope of this paper.

The main objective of this paper is to predict different human implicit intentions that occur during visual stimulus presentation, i.e., Navigational Intent (Free viewing) and Informational Intent (Target searching). This research collects the EEG data from various participants during a visual search task to identify brain state transitions between those intentions and classify users’ implicit intentions using machine learning classification algorithms. We investigate a wide range of feature extraction methods and classification algorithms to provide the best setup for labeling Navigational and Informational Intent based on EEG activity with plausible accuracy. Our main contributions to this paper are as follows:

  1. 1.

    We design and develop an effective data acquisition paradigm and an end-to-end classification pipeline to categorize human intents using EEG signals.

  2. 2.

    We extensively evaluate our classification pipeline for single-subject to show a significant improvement compared to the existing state-of-the-art.

  3. 3.

    We extend the single-subject classification pipeline to enable the transfer of EEG-based learning to cross-subject scenarios for the first time.

The rest of the paper proceeds as follows. In Sect. 2, we present the related works concerning intent recognition and inter and intra-subject variability. Section 3 describes the data recording setup with recording devices and the overall recording procedure. Section 4 discusses the signal processing algorithm, including feature extraction techniques. Section 5 illustrates the model performance for a subject-wise and cross-subject scenario. Section 6 concludes the paper with some discussions and limitations.

2 Related Work

In the recent development of human-machine collaborative systems [15, 22], intent recognition [14] plays a major role in making the collaboration much more efficient and successful. Recognition of human intentions using EEG signals offer strong research interest due to their quality of giving insights into the human mind and the ability to communicate or interact with external devices such as wheelchairs and intelligent robots [14, 16]. In Slanzi et al., the authors propose a physiological-based analysis for predicting web users’ click intention by combining EEG responses and pupil dilation [4]. Authors design ten questions for each website concerning finding certain information within the website. Participants follow a navigation path from the home page to the page where information is present. The authors chose a wide range of features like Hjorth parameters, Petrosian Fractal Dimension, Higuchi Fractal Dimension Hurst exponent, and statistical features to train the model. However, the performance of the classifiers is not satisfactory. This study achieved a maximum accuracy of \(71.09\%\) with logistic regression, which may not be sufficient for real-world scenarios.

Recent research shows that EEG-based intent recognition can understand the implicit intention, even when a human does not express his thoughts. For example, in Kang et al., authors develop advanced interactive web service engines which rely on identifying brain connectivity patterns related to the user’s Navigational and Informational intentions through visual experiments based on static web images [1]. In this work, the authors analyze the differences in phase-locking value (PLV) to classify users’ Navigational and Informational intentions. Authors use Support Vector Machines, Naïve Bayes, and Gaussian Mixture Model. However, accuracies mostly fall between \(50\%\) to \(77\%\) for all classifiers, which is not sufficient for real-world deployment where precise estimation of intents is of utmost importance to make the system robust.

Existing studies focus on subject-specific evaluation, which is not the best case for real-world settings where a generalized setup could save considerable training time and effort. Due to the complexity and high dimensionality of brain signals, intent recognition accuracy and signal interpretability heavily depend on feature vector representation in a sophisticated manner. Moreover, EEG signals reflect the fluctuations of the voltages from different cortical regions of the human brain over a time period [17]. It becomes necessary to effectively combine both spatial and temporal information to capture the uncertainties generated by inter, and intra-subject variability [13, 17]. In Wei et al., authors use hierarchical clustering to explore the associations between EEG features and cognitive states to tackle inter and intra-subject variability within a large-scale dataset of EEG collected in a simulated driving task [18]. A subject transfer framework detects drowsiness, which reduces the calibration time by \(90\%\). Still, some amount of training is needed. Other research studies that address these variabilities use a completely different EEG-based paradigm like Motor Imagery [19] or P300 speller [20] to reduce task-based calibration time. So far, the possibility of handling inter and intra-subject variability in improving intent recognition for visual search tasks is unexplored.

3 Data Acquisition Setup

Fifteen healthy subjects (age: 20 to 30 years) participated in the experiment without prior training or knowledge. Before the start of the experiment, participants were informed about the experiment process and asked to sign an informed consent form for the scientific use of the recorded data. The study was approved by the ethical review board of the Faculty of Mathematics and Computer Science at Saarland UniversityFootnote 1. The experiment was performed in a dim light room with minimum distractions from external noises, or electronic devices, where the voluntary participants are asked to sit in a comfortable chair to prevent unnecessary muscle movements to minimize noise and artifacts in the EEG signals which could unfold from mental stress, electrical interference, and other physiological motor activity [5, 6]. The display resolution of the monitor was set to 1920 \(\times \) 1080 pixels, the screen brightness is set to 300, 00 cd/m\(^{2}\), the distance between the user and the screen is set to 60 cm, and the eyes of the user are about the same height as the center of the screen.

Recording Devices: EEG signals were recorded with a LivAmp 64 amplifier by Brain ProductsFootnote 2. The sampling frequency was set to 500 Hz. The 10–20 international system of electrode placement was used to locate the electrodes [7]. Electrode impedances were kept below 25 k\(\Omega \) throughout the duration, as it is a common practice for noise reduction in the EEG recordings [8]. Tobii pro fusionFootnote 3 is used to collect eye-tracking information, which is only used for automatically labeling the EEG data.

3.1 Experimental Procedure

The Experiment consists of 3 parts: (i) Navigational Intent or Free viewing, (ii) Target presentation, and (iii) Informational Intent or target searching. Figure 1 shows the experimental sequence. We designed the Experiment in Unity [9], where the industrial scenes are as close to the original working scene in an industrial context. The recording steps are as follows:

  1. 1.

    The participant glances over the input scene without knowing the target to get the overall overview of the scene.

  2. 2.

    The participant is shown a specific target tool as an image.

  3. 3.

    The participant searches for the shown target object in the input scene by looking around.

  4. 4.

    As soon as the participant finds the tool, the target object boundary appears with a red color which later changes to green color, ensuring that the participant found the correct tool.

Fig. 1.
figure 1

Setup of the search task. The participant is shown the image on the left without a concrete target. Afterward (middle), the target is shown to the participant. The participant searches the target in the scene, and by gaze tracing, the object is highlighted and selected if the participant fixates on it for more than 2 s (right).

The recorded dataset consists of 5 sessions for each subject recorded on the same day with short breaks in between sessions. Each session comprises 30 scenes for both Navigational and Informational Intent. Each session consists of different images, resulting in a total of 150 in the unique input scenes resembling the industrial working conditions of manufacturing or production units.

4 Methods

This section presents different methods we used for EEG data signal processing, including pre-processing and feature extraction. We assemble the dataset for the individual subject using the following steps.

4.1 Data Preprocessing

Typically, EEG signals contain external noises and artifacts like muscle movement, eye blinks, etc., while recording [21]. Therefore, it is necessary to preprocess the recorded data before extracting meaningful information for further analysis. We preprocess the data in MATLABFootnote 4 using functions from the EEGLAB toolbox [10]. Below are the preprocessing steps to clean the data:

  1. 1.

    Filtering: High-pass filtering at a cutoff frequency of 1 Hz is applied as recommended by [11] to remove low-frequency noise and low-frequency shifts before using the independent component analysis (IIR Filter, pop-iirfilt from EEGlab). A notch filter with a lower cutoff frequency of 48 Hz and an upper cutoff frequency of 52 Hz is applied to remove power line noise [6] which is followed by a low pass filtering done at a cutoff frequency of 40 Hz (IIR Filter).

  2. 2.

    Artifact rejection: We do electrode rejection using pop_clean_rawdata from EEGLAB as poor electrode-to-skin contact, broken recording device, and low signal quality hinder the quality of signals. Electrodes with a large portion of noise are removed based on their standard deviation and channels, which poorly correlate with other channels. The rejection threshold for channel correlation is 0.8.

  3. 3.

    Re-referencing: All electrodes are re-referenced to a common average reference, as it minimizes uncorrelated signal and noise sources through averaging.

  4. 4.

    Independent Component Analysis (ICA): Since EEG data collected in a single channel is a composition of all neuron potentials in an area, the recordings between electrodes can be highly correlated [10]. We clean the data using Independent Component Analysis, which removes unwanted artifacts embedded in the data (muscle, eye blinks, or eye movements) without removing the affected data segments. We apply Second-Order Blind Identification (SOBI) algorithm as an ICA decomposition algorithm, following a subsequent automated IC_Label rejection (muscle, heart, and eye components with a \(95\%\) threshold).

  5. 5.

    Channel interpolation: The channels marked as bad are interpolated using spherical interpolation, pop_interp. The motivation behind channel interpolation is to avoid bias when calculating the average reference.

  6. 6.

    Epoching: We use preprocessed data to extract specific time windows from the continuous EEG signal, with reference to the stimulus onset from the preprocessed data. We took equal duration for Navigational and Informational Intent within each sample, as the feature extraction module expects the input to have the same dimensions. We also removed the period in the Informational part where the eyes are resting because the participant is only fixating on the object successfully located, see Sect. 3.1.

4.2 Feature Extraction

In this section, we present methods to assemble a feature vector using PyEEG [12] and Common Spatial Pattern (CSP) [21]. PyEEG is an open-source python module for EEG feature extraction [12]. We extract 15 features to generate a feature vector for further investigation. Table 1 shows the list of features extracted for each EEG channel. CSP extracts features from EEG data in a maximally discriminative manner. CSP’s basic principle is applying a linear transformation to project the multi-channel EEG signal data to a lower-dimensional spatial subspace. The transformation results in the maximization of the variance of one class while minimizing the variance of other classes at the same time.

Table 1. List of extracted features from PyEEG

4.3 Classification Algorithms

Similar to past studies [1, 21], we use Random Forest (RF) and Naïve Bayes classifiers (NB) to distinguish the EEG signals according to the users’ implicit intention. Table 2 shows the hyper-parameters. We used default values for other parameters. RF uses bootstrap aggregation with multiple decision tree models. This strategy helps to improve predictive performance as compared to a single model. NB is a probabilistic machine learning model that uses algorithms based on the Bayes theorem. Each algorithm shares a common assumption, i.e., every pair of classified features is independent.

5 Experimental Evaluation

In this section, we present our results for subject-wise and cross-subject scenarios. We provide the best setup which is capable of generalizing across different participants. Kang et al. [1] are closely related to our work. We show their highest achieved accuracy as a baseline in all our box plots, depicted by a horizontal line. Since there exists no cross-subject evaluation on the search tasks, therefore, we cannot compare our cross-subject analysis.

Table 2. List of hyperparameters
Fig. 2.
figure 2

Subject-wise accuracy using common spatial pattern with RF and NB classifiers. The Horizontal line shows the baseline. Triangle, orange line, and circular dots show the mean, median, and outlier values, respectively (Color figure online)

5.1 Subject-Wise Analysis

We performed data assembly, training, and evaluation of the test set for each subject individually. To evaluate our classification pipeline, we use 80% of the data as the training set and 20% as the test set. The test set is assembled randomly at the start of the pipeline to keep it close to the online classification setup. We use hyperparameter optimization with grid search five-fold cross-validation, as it is a common practice for EEG classification [21]. Table 2 shows the parameters that influence the classification performance for both classifiers. The combination that yields the best classification accuracy is identified as the optimal meta-parameters for each subject. Finally, we use the test set to evaluate the performance of the trained classifier. The horizontal line shows the accuracy of the state-of-the-art. Results obtained for all subjects with the CSP feature extraction technique are shown as a box plot in Fig. 2. From the plot, it is evident that all subjects achieved admirable accuracy. For random forest, mean accuracy lies between \(90.79\%\) and \(97.18\%\). The standard deviation falls in the range of 0.73 and 5.06. The highest mean intent recognition accuracy of \(97.18\%\) is attained by subject \(S_5\). Moreover, other subjects, \(S_i\), i \(\in \) \(\{1, 2, 5, 7, 8, 10, 13, 14, 15\}\) achieve mean accuracy above \(95\%\). As compared to Random Forest, Naive Bayes performs slightly worse, especially for subject \(S_1\), where the mean accuracy is \(88.73\%\). However, for subjects \(S_i\), i \(\in \) \(\{3, 6, 15\}\) Naive Bayes achieves better results with a mean accuracy of \(96.55\%\), \(93.16\%\), and \(96.95\%\) respectively. For Naive Bayes, the mean accuracy lies between \(88.73\%\) and \(97.89\%\), and the standard deviation is between 0.74 and 3.57. Subject \(S_5\) also attains the highest mean accuracy for the Naive Bayes classifier. Overall, for the CSP feature extraction technique, both the classifier perform similarly to mean accuracy. Figure 3 shows the results obtained from the assembled feature vector using the PyEEG toolbox with Random Forest and Naive Bayes classifiers. We use the same hyperparameters to compare different feature extraction techniques and classifiers, as shown in Table 2. Both the classifiers perform worse as compared to CSP. The mean accuracy lies between \(81.07\%\) and \(93.93\%\) with RF and \(64.73\%\) and \(83.93\%\) with NB. Figure 4 shows the confusion matrix for subject \(S_{5}\) (which achieved the highest overall mean accuracy) using CSP and assembled feature vector with PyEEG. The diagonal elements show the number of correct classifications, while off-diagonal elements show misclassification. We have a balanced dataset between the two classes, with the highest number of correct predictions for CSP compared to PyEEG.

Fig. 3.
figure 3

Subject-wise accuracy using assembled feature vector (FV) using RF and NB classifiers. The Horizontal line shows the baseline. Triangle, orange line, and circular dots show the mean, median, and outlier values, respectively (Color figure online)

Fig. 4.
figure 4

Confusion Matrix for subject \(S_{5}\)

5.2 Cross-Subject Analysis

For the cross-subject case, we study two types of variability. Inter-subject: differences in brain activity across subjects. Intra-subject: differences in brain activity for the same subject occurring in multiple repetitions of the same task.

Inter-subject: Table 3 shows the result for inter-subject variability where each subject is taken as a test subject while the remaining subjects are in the training set. Thus, we acquire the test set from a different subject which is not a part of the training data. The performance evaluation is done using Random forest (RF) and Naive Bayes (NB) on the feature vector from CSP and assembled feature vector (FV) from the PyEEG toolbox. We compute the results using grid search five-fold cross-validation. We use the same parameters for hyperparameter optimization (shown in Table 2). Since we do not fix the random state of classifiers, we iterate the experiment 5 times to compute mean accuracy and standard deviation. The highest mean accuracy of \(96.83\%\) with RF-FV is achieved when \(S_8\) is taken as the test set with \(S_i\), i \(\in \) \(\{1,2,3,4,5,6,7,9,10,11,12,13,14,15\}\) as train set. Hyperparameter tuning plays a significant role in achieving optimal performance with an exhaustive and wide range of combinations. Overall, RF with assembled feature vectors performs best for all subjects.

Table 3. Inter-subject accuracy (%) for each subject as a test set with remaining subjects as train set

Intra-subject: For intra-subject estimation, we use one complete session from all the subjects and treat it as a test set while the remaining four sessions are in the training set. Since we do not fix the random state of the classifiers, we iterate 5 times and, thus, demonstrate the results in terms of the mean and standard deviation of classification accuracy. The same hyperparameters are tuned (shown in Table 2). Table 4 shows the mean accuracy, RF with assembled feature vector using PyEEG performs significantly better than other classifiers and feature extraction techniques. These results also align with the inter-subject analysis where RF-FV works best.

Table 4. Intra-subject mean accuracy (%)

6 Conclusion and Discussion

This paper proposes a classification pipeline to classify users’ intentions based on EEG data. The final prediction of the model is highly dependent on the methods used for data acquisition, preprocessing algorithms, computing features, and the choice of the classification algorithm. We evaluated our pipeline for subject-specific and cross-subject scenarios. In the case of the subject-specific analysis, our evaluation demonstrates that the CSP feature extraction method performs best for both Random Forest and Naive Bayes classifiers achieving a maximum mean accuracy of \(97.18\%\) and \(97.89\%\), respectively. Our work is a significant improvement compared to state-of-the-art, which makes our pipeline applicable to a real-world setting. However, for PyEEG, our pipeline could only achieve a maximum mean accuracy of \(93.93\%\) and \(83.93\%\) for Random Forest and Naive Bayes classification algorithms, respectively. We also extend our pipeline to adapt to the cross-subject scenario by combining the subject-specific dataset. Our cross-subject model achieves the highest mean accuracy of \(96.83\%\) and \(92.46\%\) for inter and intra-subject variability, respectively. The implementation pipeline enables generalizing brain signals across different subjects, capable of reducing the necessity of exhaustive subject-specific training sessions and training processes with tedious calibration. We also recommend using PyEEG with a Random Forest classifier since it generalizes well over all the subjects while being comparable to other strategies for a subject-wise scenario. In the future, we would like to extend our implementation approach with a multi-modal intent recognition model for discovering users’ intentions with complicated scenes. Additionally, it would be interesting to use data recorded on different days from the same subject and handle this type of variability.

Limitations: In this study, we claim that our intent recognition pipeline is generalized from trained subjects to the new unseen subject. However, we do not test this with subjects from diverse age groups or subjects with special conditions.