Keywords

1 Introduction

With the development of non-invasive techniques, electroencephalography (EEG) plays an important role in the research of neuroscience, cognitive science, and cognitive psychology, especially in the Brain-Computer Interface (BCI). EEG-based BCI system mainly relies on modulating the responses of large populations of neurons, either through a training period time or using external stimuli [1]. An important advantage of EEG is the extremely low risk to the subjects, compared to other invasive and semi-invasive techniques, such as electrocorticography (ECoG) [2]. Besides, EEG has an extremely high temporal resolution on milliseconds level, and its signals can be divided into several different frequency patterns (e.g. mu, beta) which can present different characteristics, patterns and access different functional brain states [3, 4]. Many factors are contaminating EEG signals, including both objective factors (e.g. environmental noises, faulty electrodes) and subjective factors [e.g. electrooculography (EOG)]. These factors result in various undesired artifacts, which reduce the classification accuracy in the BCI system, alter the characteristics of neurological phenomena, and make significant mistakes in controlling the BCI system [5, 6]. Therefore, it is important to apply the artifact removal and band-pass filtering to avoid causing leakage of EOG artifacts and gains meaningful EEG signals.

Through modulation of brain signals, such as motor imaginary (MI), BCI can be applied to both medical and non-medical fields. Applications in medicine, including cochlear implants for the deaf, deep brain stimulation for Parkinson’s disease has become more and more popular in medical treatments [7, 8]. Besides, scientists and engineers have also investigated various non-medical BCI products such as lie detector, alertness monitor, games, e-learning system, etc. [9,10,11,12]. As an important role in BCI, EEG-based BCI systems have strong opportunities to help patients with severe motor impairment regain their motor controls such as cursor control, prosthetic control, and speller [13,14,15]. EEG in BCI also has applications in interacting with a Web browser, controlling robots as well as lie detector [16, 17]. Commonly, a BCI system works based on detecting changes in brain activity when the brain responds to voluntary or involuntary mental commands. Motor Imagery Brain-Computer Interface (MI-BCI) is the most popular used to detect EEG signals through imagined movements. When subjects imagine moving a particular part in the body, the neural activity in the sensorimotor cortex will be spatiotemporally similar to the activity during performing the real movement. Therefore, this type of neural response allows us to discriminate and match the EEG signals with each imagined body movement [18]. The biggest challenge in MI-BCI system is the difficulty in feature extraction, which needs making the selective features robust, informative, and discriminative. The motor imagery frequencies have both inter- and intra-subject variabilities, leading to low accuracy in this system [19, 20].

Related to EEG recording and motor imaginary system, Hans Berger has already described and discovered the oscillations in alpha waves during closed and open eyes in the 1930s [21]. This is a common phenomenon observed in the EEG experiments. A phenomenon like event-related desynchronization (ERD) and synchronization (ERS) was introduced to investigate the dynamics of EEG oscillations [22]. By imagining the movement with EEG and EMG recording, these phenomena can be able to change the power of the mu (8–13 Hz) and beta (13–30 Hz) bands [23, 24].

Event-related desynchronization (ERD) is a neurological phenomenon that was introduced by Gert Pfurtscheller and colleagues in the 1970s [25]. This can be used in BCI to classify different motor imagery classes. In the case of a subject imagine about a limb movement, event-related desynchronization will happen. It leads to a drop in the mu and beta waves’ power. Otherwise, the event-related synchronization (ERS) increases the power of mu and beta waves when the imaginary movement stops.

The band power is the most common methodology used to determine event-related desynchronization and synchronization. First, the signal is filtered in distinct bands such as 8–14 Hz, 24–30 Hz…, then the influence level of the ERD/ERS effect was represented with different frequency components (e.g. in 8–14 Hz, the ERD effect occurs the most clearly). In the same frequency bands, the effect of ERD/ERS does not perform similarly for each subject, so it is recommended to select the frequency bands for individual participants. Then, the signal will be converted from frequency bands into power values. It means that the ERD/ERS alters when the power changes [26]. In an experiment, we perform various trials to get the final results, so the resulting signal is the average value over trials. This above information describes how to detect and quantify the effect of ERD/ERS by the band power method.

In this paper, we used the BCI Competition IV 2008 dataset IV 2a that contains the EEG signal recording of 9 subjects. This dataset includes 18 files of data recordings (nine files for the training set and nine files for the evaluation set) and 18 files of true labels for the evaluation step. The dataset was divided into four classes that are four different motor imaginary tasks (left hand, right hand, both feet, and tongue) [27]. This work is aimed to build an algorithm that is utilized to distinguish four above different motor imagery tasks. Normally, multichannel electroencephalogram (EEG) signals give fuzzy images of brain activity due to its low signal-to-noise ratio (SNR) [28]. This is the reason why it is required to filter signals before usable in BCI applications. In our proposed algorithm, we applied numerous methods to detect and remove all unwanted factors to get clean EEG signals. First, EEG signals were filtered by bandpass filter in the frequency band between 7 and 30 Hz with 5th order Butterworth IIR. Besides, we also applied linear regression and down-sampling 62.5 Hz in preprocessing.

In comparison with other techniques such as wavelet-based filter, linear regression is a simpler method that allows us to easily predict the effect of the filtering [29]. Due to its efficiency and reliability, these methods are extremely helpful to the signal-to-noise ratio improvement. In the feature extraction, the common spatial pattern (CSP) method was introduced. CSP is the most efficient and common method in BCI that is used to extract the discriminability classes from EEG signals [30, 31]. This algorithm is a spatial filtering method that can be able to optimize the discriminability of two classes [30].

Furthermore, the number of features CSP components can be reduced by the CSP method; it leads to optimize differences between motor imagery classes. After that, Mutual Information Best Individual Features (MIBIF) was used to select the best features for classification. The combination of CSP and MIBIF can raise the classification accuracy and Kappa score. In classification algorithms, linear discriminant analysis (LDA) was used to classify features and reduce the dimension of the feature vector.

2 Materials and Methods

In this work, we aim to translate the signal from the brain into a typical movement. To reach this goal, a set of processing phases is required to create the control signal, which comprises pre-processing, feature extraction, feature selection, and classification. An illustration of the processing scheme has been shown in Fig. 1 and the detail of each stage is represented in the following sections.

Fig. 1
figure 1

The workflow of the experiment

2.1 Data Acquisition and Dataset

The dataset used for this work is the dataset 2a from BCI Competition IV in 2008 [27]. This data set is provided by the Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology, (Clemens Brunner, Robert Leeb, Gernot Müller-Putz, Alois Schlögl, Gert Pfurtscheller).

The dataset 2a includes EEG data of 9 subjects. For each subject, two sessions were recorded on different days. There are six runs for each session, and each run consisted of 48 trials for four classes which means a single session during the experiment consisted of 288 trials. The cue of each trial was related to four different motor imaginary tasks of the left hand (class 1), right hand (class 2), foot (class 3), and tongue (class 4). The trial begins with a warning tone and a fixation cross displayed on the computer screen at the same time. After two seconds, a cue was shown as a small arrow pointing to the left, right, up, and down, and lasted for 1.25 s. The subjects were prompted to perform the imaginary movement task until the fixation cross disappeared from the screen at t = 6 s. After that, there was a short break. Figure 2 illustrates the paradigm of a single trial.

Fig. 2
figure 2

Timing scheme of the experimental paradigm

Twenty-two EEG channels were recorded using Ag/AgCl electrodes. The specific challenge of this dataset is the eye movement artifact, so three monopolar electro oculographic (EOG) channels were added for artifact processing purposes. The signals were sampled at 250 Hz and bandpass filtered between 0.5 and 100 Hz with 50 Hz notch filtered.

2.2 Preprocessing

Artifact Removal

As the output of this proposed method is the four different motor imagery classes, the frequency band of brain rhythm is within the ERD/ERS information. A 5th order Butterworth infinite impulse response (IIR) filter with the passband in the range 7–30 Hz to concentrate on the mu (8–13 Hz) and beta (13–30 Hz) rhythms. Additionally, three EOG components were used to utterly “subtract” the EOG artifacts in the EEG signals by linear regression [32]. We assume the regression model as follow:

$$X = E + kO$$
(1)

Accordingly, X is the recorded EEG which presented as a matrix of N channels X = [x1, x2, x3, … xN], E denotes the uncontaminated EEG signal without eye movement artifact, O denotes the pure EOG channel, and k denotes the weight of EOG artifact affecting on N component of EEG signal.

The dataset includes EOG channels for each subject (which electrodes are placed adjacent to the eye to avoid other artifacts). To compute the unknown k, a multiplication of (1) with OT, where T denotes the recording time-points, yields:

$$\langle O^{T} .X\rangle = \langle O^{T} .E\rangle + k\langle O^{T} .O\rangle$$
(2)

As we can assume that the EEG without artifacts and EOG signal have no correlation, the 〈OT. E〉 is equal to zero, and coefficient k becomes:

$$k = \langle O^{T} .X\rangle = \langle O^{T} .O\rangle^{ - 1}$$
(3)

And the uncontaminated EEG signal can be found according to:

$$E = X - kO$$
(4)

This method is also called the “multiple least-squares approaches” since it can remove more than one EOG component. In Graz Dataset 2a, there are 3 EOG channel recordings for each subject, so the algorithm was applied respectively to each channel in order to choose k at the minimum mean square of E.

Resampling

This step lessens the time spent on processing the signal and limits small variations in the data. Regarding the Nyquist theorem [33], the highest frequency of the input signal accounts for half of the sampling rate. Thus, the filtered signal must be sampled more than 60 Hz. One-fourth of the original sampling rate was applied for our work.

Fig. 3
figure 3

An illustration of the preprocessing method. a Original EEG signal of the channel from subject 1. b The filtered signal. c The downsampled signal

A visualization of the preprocessing step is presented in Fig. 3. This figure shows the period time from 225 to 240 s of the first channel of subject 1.

2.3 Feature Extraction

Common Spatial Pattern (CSP)

This algorithm can magnify the discriminative of classes by optimizing the variances of the filtered data [34]. Let WT is a N × M matrix (M spatial filters for N samples) with T denotes recording time-points and E(t) is the input EEG signal, the CSP model is as follow:

$$E_{CS} (t) = W^{T} E(t)$$
(5)

In terms of the spatial filter, covariance matrices Rc under the c ∈ {1,2} conditions have been shown in the below equation. Though multiple conditions were used for this proposed method, two conditions were applied for the simplicity of the illustration.

$$R_{c} = \frac{1}{{\text{K}}}\mathop \sum \limits_{{\text{i}}} {\text{Z}}_{{\text{c}}}^{{\text{i}}} \left( {{\text{Z}}_{{\text{c}}}^{{\text{i}}} } \right)^{{\text{T}}}$$
(6)

Zci denotes the EEG data in the i-th trial at the condition c. Moreover, the CSP also estimates:

$$W^{T} R_{c} W = \tau_{c}$$
(7)
$$\tau_{1} + \tau_{2} = I$$
(8)

The sum of two diagonal matrices τ1 and τ2 is the identity matrix I. This result illustrates the sum variances must be 1. Thus, if the variance of class 1 is high, the variance of class 2 is accordingly low and vice versa. The results of CSP are estimated by the mean of diagonalization of covariance matrices for two classes.

Band Power and Time Domain Parameters

Band power features were extracted in this method by applying the BioSig library [35] to calculate the target bandwidths by bandpass filtering the signal. Our target frequency bands are 8–14, 19–24, and 24–30 Hz. The signal was filtered by the 5th order Butterworth IIR bandpass filter. The bandpass frequencies of this filter are our target bands. To prevent leakage effects that reduce the quality of the processing signal, we apply the smoothing window of 2 s by continue implementing the filter. Finally, natural logarithms of output signals are calculated to enhance to performance of linear classification.

2.4 Feature Selection

Feature selection is a critical step in the classification problem. Some extracted features are irrelevant to the output variable. They do not contribute much to model performance but take time for training. Furthermore, the high dimensionality of data will lead to complicated computation and overfitting results. Therefore, feature selection methods are applied to reduce dimensions of input while maintaining the essential information for classification.

Mutual Information Best Individual Features

Mutual information is a quantity that computes the relationship or correlation between two random variables [36]. The formula of mutual information between two variable X and Y is given by:

$${\text{I}}\left( {{\text{X}},{\text{Y}}} \right) = \sum\limits_{{{\text{x}} \in {\text{X}}}} {\sum\limits_{{{\text{y}} \in {\text{Y}}}} {{\text{P}}\left( {{\text{x}},{\text{y}}} \right)\frac{{{\text{P}}\left( {{\text{x}},{\text{y}}} \right)}}{{{\text{P}}\left( {\text{x}} \right){\text{P}}\left( {\text{y}} \right)}}} } { }$$
(9)

In which, P(X) and P(Y) are the marginal distributions of X and Y.

The Mutual Information (MI) Function

The MI algorithms were developed by Räsänen [37].

The function MI estimates the mutual information and uses the number of variables ranking to score the weight of features based on their individual mutual information with four output classes. The more weight the feature has, the better the performance is when applied to the model.

From the previous step, 24 features corresponding to 3 bandwidths 8–14, 19–24, 2430 Hz are extracted. Mutual information of 24 features is calculated and weighted. Through the experiment, 15 largest-score features are collected.

2.5 Classification

Linear discriminant analysis (LDA) is implemented in our experiment.

Although multiple category classification was done in this work, an illustration for the principle of the LDA algorithm is binary classification due to its simplicity. Specifically, in order to discriminate two classes X and Y, LDA algorithms define a line w such that when the projection of each value of both classes onto the line satisfy two following conditions:

  • Maximize the distance between the means of given classes.

  • Minimize the variation within each class.

By projecting each value onto line w, data of the two classes are presented in a new axis which is easier to establish the decision boundary l to discriminate them (Fig. 4).

Fig. 4
figure 4

The projection of red and blue data points representing two classes of the label in the dataset onto different vectors w, is done by the LDA classifier. a Scatter plot representing data of the two classes. b Projection of each data point onto new vector w. c Separation of data points of the two classes done by line l

A good model will induce a large distance between the two classes when projecting these data points two classes onto vector w. Therefore, a threshold can be defined to separate the two classes and used to predict new data points.

Regarding the implementation of LDA algorithm on the classification of four mentioned imaginary motor tasks, function fitcdiscr of MATLAB is used in classification tasks with:

  • Bayesian optimization.

  • Iterations: 30.

Through each time of iteration, the optimizer algorithm automatically changes the shape of the objective function model by changing the hyperparameters to obtaining the best fit to the observed data points. Therefore, the classification performance of the model is improved.

3 Evaluation Metric

Cohen’s kappa coefficients were calculated to evaluate the performance of the classification model. Kappa scores are considered a robust evaluation of categorized discrimination against the problem as it not only takes into account the percentage of the agreement but also the possibility of chance agreement. Kappa statistics focus on the object-class distribution and ignore the number of features used to find out this distribution, so this measure is useful for imbalanced datasets.

Cohen’s kappa equation [38] is defined as:

$$k = \frac{{{\text{Pr}}\left( {\text{a}} \right) - {\text{Pr}}\left( {\text{e}} \right)}}{{1 - {\text{Pr}}\left( {\text{e}} \right)}}$$
(10)
Pr(a):

the relative observed agreement among raters.

Pr(e):

the hypothetical probability of chance agreement that is calculated as the probabilities of each observer randomly seeing each category using observed data.

4 Result and Discussion

As mentioned in the evaluation metrics part, the kappa value is suggested to address several accuracy validation problems in BCI research. The different trials can result in different kappa values, so we took the average results of several trials to get a more representative kappa; also, the kappa value will be more general. This paper reached a 0.57 score of mean kappa when using the Graz dataset 2a to train this model. Additionally, we added the testing set from a similar dataset to evaluate our model, and we observed that the mean of evaluation kappa was 0.41, which approaches the training result. Consequently, our model is considered reliable. Table 1 summarizes the kappa coefficient of the training and evaluating set are represented.

Table 1 The final training and evaluation kappa scores for each subject

Improvement. To compare the efficiency of our method with traditional methods, we propose two other models. The first model is artifact removal without linear regression, and another is feature extraction without CSP. Figure 5 illustrates that the best model is applying our method (k = 0.41), while the model applying linear regression algorithm in artifact removal is less efficient (k = 0.39). This experiment also proves that the CSP which is used to extract the features improved the kappa score (from 0.33 to 0.41).

Fig. 5
figure 5

Comparison of the effects of linear regression and CSP to kappa score

To recognize the role of linear regression and CSP on improving the kappa score, the Kruskal–Wallis test [39] was utilized to compare the difference between the mean value of 3 statistical models. The first reason is that the population data has a small size with nine subjects per each group method and another reason is the non-normal distribution of this data, according to the Levene test (p = 0.025) [39, 40]. In the average ranks (see Table 2), the feature extraction without CSP was worth ranked bottom, which proves that the CSP plays an important role in improving our model. On the other hand, using the linear regression method just makes a small impact on the mean value compared to the method without using CSP. Furthermore, Fig. 5 shows that when applying both linear regression and CSP, the mean kappa score was enhanced. Nevertheless, no significant difference in the mean is observed among the three approaches (p = 0.641).

Table 2 The Kruskal–Wallis test summarizes the mean rank of 3 methods

Confusion Matrix

To describe the effect of a 4-class classification problem, we use the confusion matrix which illustrates the relationship between the predicted classes (intended labels by user) and the true classes (known labels). Table 3 shows the proportion of samples from evaluation data predicted by the proposed model based on the ground truth, in which the bold ones illustrate the correct classifier. The higher the bold values of the confusion matrix, the more efficient model.

Table 3 Confusion matrix of the evaluation data

According to Table 3, tongue motor imagery got the best result with 87.8%. In contrast, the left-hand motor imagery just reached 64% accurate prediction; this class still was significantly confused back and forth with the right-hand class when being predicted (25% left-hand class was labeled right-hand motor imagery). Meanwhile, feet and right-hand motor imagery got considerable results, 79% and 85.1% sequentially. Overall, our model is considered good working in classifying four-class motor imagery (79% in the average).

Comparison with the BCI Competition IV (2008)

Compare with the BCI Competition IV (2008) on the same Graz dataset 2a, our work with removing eye artifact and applying Mutual Information Best Individual Features (MIBIF) and LDA algorithm has yielded promising results. Our kappa is higher than that of the 3rd prize in the competition, which is respectively 0.41 and 0.31 [41] (see Table 4). By using the Filter Bank CSP and Naive Bayes Parzen Window technique, K. K. Ang et al. won the first prize with the 0.57 kappa value [42]. Similar to the proposed method, Liu Guangquan et al. also applied the LDA algorithm, however, the combination of it with the Log variance of the best eight components and Bayesian is the reason why this kappa value holds more weight than the proposed method (k = 0.52) [43].

Table 4 The comparison kappa values of 9 subjects with the three best competitors

Code Deposit

Our code has been uploaded on the Github: https://github.com/nganluu0903/BCI_project.

Limitation and Future Steps

In this proposed method, the utilized techniques in signal processing procedure (including feature extractions, feature selection, and classification) are considered to be very common. Therefore, the future steps are to select and apply recent advantage techniques (e.g. recurrent neuronal network) to improve the quality of this study. Furthermore, there are various limitations in the evaluated dataset—Graz dataset A which was published in 2008. So, recent datasets or a self-constructed dataset are highly recommended to approach in the next work.

5 Conclusion

The EEG-based BCI is known as the non-invasive method that is growing significantly and bringing many advantages as well as challenges in EEG signal processing [44]. From recent studies, motor imagery based BCI (MI-BCI) has become a common approach; however, it is challenging to classify imaginary motor tasks [32].

In this paper, the proposed algorithm works on the event-related resynchronization (ERD) phenomenon to categorize four different motor imagery classes. We implemented band power features, time-domain parameters, and the common spatial patterns (CSP) to extract features while Mutual Information Best Individual Features was considered in feature selection. Besides, the linear discriminant analysis (LDA) classifier was utilized to classify four classes. The proposed algorithm was evaluated using the kappa score. The kappa value was obtained from the confusion matrix that evaluates the effectiveness of the algorithm and supports to compare the accuracy of models with other ones easily. Besides, the spatial filter is recommended to enhance the kappa score and increase the reliability of the model.

For future work, it is recommended to focus on raising the performance of the MIBCI algorithm by utilizing different techniques in encoding the EEG signal. In this paper, some factors influence the performance of the proposed algorithm. For instance, frequency band selection for each subject is necessary since the effect of the ERD/ERS phenomenon is different for each individual. To optimize this system quality, an automatic method for selecting the frequency bands should be considered before performing preprocessing. Furthermore, another method named the filter bank common spatial patterns (FBCSP) algorithm should be considered for feature extraction in separated smaller frequency bands.