Introduction

Brain science is one of the most challenging research fields and attempts to reveal and perceive the inner mechanisms of brain activities and functions. A Brain-Computer Interface (BCI) [1, 8, 28] provides a bridge between the brain and an external device for brain science studies. Electroencephalogram (EEG) [2, 4] data are collected using BCI devices to obtain information about brain activities. EEGs present the responses of the brain to external stimuli and are widely used for monitoring brain activities. Motor imagery EEG [20] has aroused the enthusiasm of many researchers and plays an important role in dyskinesia study, especially for patients with neuromuscular disorders.

EEG recognition is divided into two parts: feature extraction and classification. For feature extraction, various methods have been developed such as band power (BP) [21], power spectral density (PSD) values [26], and wavelet package (WP) [29]. For classification, support vector machines (SVM) [3], neural networks [22], and naive Bayes [15] are widely used. Gert [20] uses band power values and a neural network for the online classification of right and left motor imagery, and they also analyze offline classification using an adaptive autoregressive (AAR) model. An algorithm for epileptic seizure detection using lacunarity and Bayesian linear discriminant analysis (BLDA) [30] is proposed for long-term intracranial EEG. Luis [18] presents an adaptive semi-supervised classification algorithm for online multi-class motor imagery EEG and achieves the highest accuracy (77%) to date. A combination of linear and nonlinear EEG features and a k-nearest neighbor (k-NN) classifier [16] is used to detect mild depression in college students with a high accuracy.

Although neural networks have been widely used in EEG classification, the iteration process in the parameter fine tuning limits the learning speed. Hence, a type of single-hidden-layer feed-forward neural network (SLFN) named the Extreme Learning Machine (ELM) is proposed by Huang [5, 11, 31], in which the weight between the input and hidden layers is randomly generated, and the weight between the hidden and output layers is computed without an iteration process. Recently, the HELM [24], an approach based on the ELM, was presented and extended the ELM into a deep ELM neural network without back propagation or an iteration process. The HELM performances better than the ELM; however, the random parameters in the network influence the stability of the network. Combining with the Kernel-based Extreme Learning Machine (Kernel-ELM) [9], the KHELM is proposed. This method not only maintains the high training speed but also performs better than the HELM and Kernel-ELM. Therefore, it is applied to the classification of motor imagery EEG with PCA and LDA.

The contributions of this paper are as follows:

  1. (1)

    To improve the stability of the HELM, the HELM is combined with a Gaussian Kernel to form the KHELM method.

  2. (2)

    The KHELM is introduced first to the classification of motor imagery EEG signals, and a classification system based on the KHELM is proposed.

  3. (3)

    The classification system performs better than state-of-the-art methods with respect to accuracy as well as training and testing speeds.

The remainder of this paper is organized as follows. In the “Method” section, the classification system, including feature extraction and classification, is presented. In the “Experiments” section, experiments are conducted on the BCI competition 2003 Dataset Ia, and the performance results are given. In the “Conclusion” section, the conclusions are presented.

Method

The original EEG signals are easily affected by the environment, which results in a low signal-to-noise ratio (SNR). In addition, given their complicated components, EEG signals are difficult to recognize. In the proposed approach, we first partition the sample with overlaps into segments. Because the partitioned EEG data are still characterized by high dimensionality and low SNR, it is important to extract typical and distinctive features. Then, PCA is used to extract the principal dimensions of each segment. However, PCA cannot reveal information indicating different classes. After PCA, an LDA process is introduced to decrease the coupling between the classes. To further integrate and analyze all segments in each sample, the features are rearranged as the inputs of the KHELM for classification. Therefore, the system involves two parts: (1) feature extraction with PCA and LDA and (2) classification based on the KHELM. The framework of the proposed method is shown in Fig. 1.

Fig. 1
figure 1

The framework of the system

Feature Extraction with PCA and LDA

As a common tool in dimensional reduction, PCA projects the original high-dimensional data into lower dimensional data by relying on the maximization of the total scatter matrix of the projected samples [13]. In this way, the SNR of the signal can be increased. Then, LDA is introduced to search for the best projection direction and decrease the coupling between classes, which reduces the dimension to 1 for binary-class data [19]. Features obtained by PCA and LDA are more discriminative.

Denoting the training data as Z and the testing data as T, the procedure is described as follows. First, the eigenvectors and eigenvalues of all the data are calculated using covariance matrix decomposition. The eigenvectors corresponding to the top l eigenvalues form the basis L P C A = [? 1, ? 2, …, ? l ]. Because a larger eigenvalue represents a larger contribution rate of the principle component, we choose l principle components using a threshold of the accuracy contribution rate (ACR). Then, we obtain the training features Z P C A = Z · L P C A and the testing features T P C A = T · T P C A . Finally, LDA is applied to Z P C A and T P C A to obtain the remapped matrices X t r a i n and X t e s t as the final features.

Classification Based on KHELM

The ELM is advantageous as a result of its learning speed for machine learning and artificial intelligence. However, the ELM is limited to simple data due to its shallow architecture. Therefore, the Hierarchical Extreme Learning Machine is built, and the structure is extended into a more complex architecture [24]. In this paper, the KHELM is proposed based on a combination of the HELM and the Kernel-ELM. The technique is divided into two parts, as shown in Fig. 2: (1) the unsupervised feature extraction based on the ELM-based sparse auto-encoder and (2) the supervised feature classifier based on the Kernel-ELM.

Fig. 2
figure 2

The architecture of the KHELM learning algorithm [24]

For the KHELM learning algorithm, the first-layer weights of the KHELM are set as the transposition of the output weights ß 1 that are learned by the ELM-based Sparse Auto-Encoder shown in Fig. 2a, and the i th-layer weights of the KHELM are set as the transposition of the output weights ß i+1 of the ELM-based Sparse Auto-Encoder shown in Fig. 2b. The framework of the KHELM includes two phases: multi-layer forward encoding followed by the Kernel-ELM classifier.

For the training features \(\{({x}_{i},{y}_{i})\}^{N}_{{i}=1} \epsilon R^{{d}} \times R^{{m}}\) of the EEG data, x i is the input vector and y i is the class label. The output of each hidden layer is represented as follows:

$$ {H}_{{i}} = {g}({H}_{{i-1}} \cdot \beta^{{T}}), $$
(1)

where H i is the i th-hidden-layer output matrix, and the input x is considered as the 0th hidden layer, where i = 0. g is the activation function of the hidden layers, and ß is the output weight. It is shown in Fig. 2 that once the feature of the previous hidden layer is extracted, the weights of the current hidden layer can be learned without fine tuning.

To extract the abstract representation of the input features, the ELM-based Sparse Auto-Encoder is chosen to model the feature representation part of the KHELM, and its structure is shown in Fig. 2a.

The ELM is a promising technology with a high learning speed and trivial human intervention requirement [10]. Compared with other techniques, the ELM provides a better generalization performance and can obtain globally optimal solutions [12]. We use the ELM to build an auto-encoder with the target replaced by the input, called the ELM-AE. To generate more sparse and compact features of the inputs, an l 1 constraint is added to extend it into the ELM-based Sparse Auto-Encoder. Its optimization model is extended as follows:

$$ O_{\beta} = \arg\min\limits_{\beta}\{ \|{H}\beta - {X}\|^{2} + \| \beta\|_{l1} \}, $$
(2)

where X denotes the input data and output target and ß is the weight between the hidden and output layers. With the l 1 and l 2 constraints, the ELM-based Sparse Auto-Encoder is less sensitive to the input, and thus, its generalization performance is enhanced.

The method of the ELM-based Sparse Auto-Encoder is summarized in Algorithm 1, and its structure is shown in Fig. 2a. In the ELM-based Sparse Auto-Encoder, with the weight ß, the output H of the hidden layer is decoded into the input X. It is proved that the input X can be transformed into H * with the weight ß T, where H * is the output of the hidden layer in the KHELM, being an approximation to H, and ß T is the transposition of ß.

figure a

The KELM method is the combination of the ELM and a kernel. The output of the hidden layer H H T is replaced by the kernel function

$$ {H}{H}^{T} = {\Phi}_{ELM} = {\left[ \begin{array}{ccc} {K}({x}_{1},{x}_{1}) & {\cdots} & {K}({x}_{1},{x}_{L}) \\ {\vdots} & {\ddots} & {\vdots} \\ {K}({x}_{L},{x}_{1}) & {\cdots} &{K}({x}_{L},{x}_{L}) \end{array} \right]}, $$
(3)

where K(·) is the kernel function; the most popular kernel of the Kernel-ELM is the Gaussian kernel K(x i , x j ) = e x p(-?x i - x j ?/K), in which K is the kernel parameter.

The output of the Kernel-ELM is as follows:

$$\begin{array}{@{}rcl@{}} {f}({x}) &=& {h}({x})\beta = {h}({x}){H}^{T}({H}{H}^{T}+\frac{I}{C})^{\dag}{Y} \\&=& {\left[ \begin{array}{c} {K}({x},{x}_{1}) \\ {\ddots} \\ {K}({x},{x}_{L}) \end{array} \right]}^{T}({\Phi}_{ELM}+\frac{I}{C})^{\dag}{T}, \end{array} $$
(4)

where C is the regularization coefficient, (·) represents the Moore-Penrose generalized inverse, and T is the label of the samples.

The procedures for training the KHELM are given in Algorithm 2. Using the ELM-based Sparse Auto-Encoder, the KHELM is extended into a deep neural network, and the features are extracted in a more compact and sparse form again, which improves the KHELM performance in classification. Meanwhile, because the hidden layer is replaced by a kernel function in the Kernel-ELM, the randomness of the parameters in the KHELM is lower than in the HELM, which provides a more stable network for the KELM compared to the HELM.

figure b

Experiments

Dataset Description

In this section, we first introduce the EEG dataset for the motor imagery. The dataset is from the BCI competition 2003 Dataset Ia taken from a healthy subject. The subject is asked to imagine the movement of a cursor up and down on a computer screen. Meanwhile, his cortical potentials are recorded, and he receives visual feedback of his slow cortical potentials. Cortical positivity leads to a downward movement of the cursor on the screen, whereas cortical negativity leads to an upward movement of the cursor. All the trails are composed of a training set (268 trials, with 135 for class 0 and 133 for class 1) and a testing set (293 trials, with 147 for class 0 and 146 for class 1), and each trial lasts 6 s. During each trial, the task is visually presented by a highlighted goal at either the top or bottom of the screen to indicate negativity or positivity from 0.5 s until the end of the trial. The visual feedback is presented from 2 s to 5.5 s. Only this 3.5-s interval for each trial is used for training and testing. The 256 Hz sampling rate and 3.5-s recording length result in 896 samples per channel for each trial. Here, the central parietal region electrode (Cz-Mastoids) is chosen as the reference electrode, and the remaining six electrodes, which are located as shown in Fig. 3, are used to collect the signals.

Fig. 3
figure 3

Distribution of EEG electrodes for 6 channels

In the experiment, first, we only choose the signals from channel 1 and channel 2 (A1 and A2) to generate the discriminative feature sets based on previous work [7]. Then, the continuous recording samples are partitioned to sub-epochs of 500 ms, with a 125 ms overlap. Therefore, the raw data are split into 9 segments for each channel, and each segment has 128 dimensions. Thus, there are 18 segments in the end.

Discussion of Results

Parameter Selection for Feature Extraction

For the feature extraction, we use a combination of PCA and LDA. First, PCA is used for the dimensional reduction. Although the Accuracy Contribution Rate (ACR) is greater than 99% for all segments shown in Fig. 4, the corresponding feature dimension of 16 is selected. Then, the 16-dimensional features of each segment are reduced into 1-dimensional features using LDA. Finally, the features of channels A1 and A2 are combined to form the input of the KHELM in 18 dimensions.

Fig. 4
figure 4

ACR for PCA of the segments of channel A1

Parameter Selection of KHELM

For the KHELM, three parameters are required for tuning: the parameter C for the regularized least mean square calculation, the number of hidden nodes L, and the parameter of the kernel function K. In the experiment, C is confined to {10-10, 10-9, …, 109, 1010}, L is confined to {500,600, …, 1400, 1500}, and K is confined to {10-10, 10-9, …, 109, 1010}. The influences of L, C and K on the performance of the KHELM are shown in Fig. 5.

Fig. 5
figure 5

Classification accuracy of KHELM. a Accuracy of the KHELM with respect to L. b Accuracy of the KHELM with respect to C and K

Figure 5a shows the impact of L on the performance of the KHELM, and the parameters C and K are fixed beforehand. It is shown in Fig. 5a that the accuracy slightly fluctuates around a central line, and the amplitude of the fluctuation is less than 1%. Therefore, it is demonstrated that the KHELM is not sensitive to the node number of the hidden layers. In Fig. 5b, the 3-D accuracy curves of the KHELM in terms of C and K are shown. Clearly, C and K are the main influential factors for the KHELM. Therefore, one must be careful when selecting C and K, and the best choice for C and K is 104 and 105, respectively.

Evaluation of the Classification System

To demonstrate the effectiveness of the classification system, first, the performance of the KHELM is compared with other classifiers based on the ELM using the same features. The performance of the state of the art is also given on the same dataset as a comparison.

For all experiments, the hardware and software specifications are as follows: PC, Intel-i5 3.30 GHz CPU, 8.00 GB RAM, Windows 7, and MATLAB R2012a.

To certify the superiority of the KHELM, the accuracies of 50 experiments are averaged for comparison with the HELM based on a similar structure. The results are shown in Fig. 6. It is shown that the average accuracy of the KHELM is higher than that of the HELM, and the amplitudes of the fluctuations of the KHELM are substantially smaller than those of the HELM. Clearly, the HELM is more sensitive to the node number of the hidden layers than is the KHELM. Therefore, the KHELM is more stable than the HELM.

Fig. 6
figure 6

Comparison of the KHELM and the HELM

We compare the performance of the KHELM with other ELM-based methods such as the avg-ELM, Kernel-ELM, V-ELM [8], ML-ELM, and HELM. The result from the avg-ELM is an average over 50 ELMs. To obtain the results of the V-ELM, we use 50 ELMs to vote. The ML-ELM is a simple stacked layer-by-layer architecture with the ELM-AE; here, we set it to have three hidden layers, and the numbers of hidden nodes are (20, 20, 2000). The HELM has three hidden layers, similar to the KHELM. Table 1 gives the results of the comparison. The average accuracy of the KHELM is 91.99%, while those of the HELM and Kernel-ELM are 90.94 and 91.81%, respectively. Meanwhile, the best accuracies of the KHELM and HELM are equal: 94.54%. The training and testing speeds of the KHELM are approximately five times higher than those of the HELM. Specifically, in terms of both accuracy and speed, the performance of the KHELM is superior to that of the Kernel-ELM and the HELM. It is obvious (from Table 1) that the best accuracy achieved by the KHELM is the highest accuracy, and the speed is higher than that of the other methods based on the ELM, except for the avg-ELM. Because the avg-ELM has only one hidden layer, it obviously obtains a high speed. The results demonstrate that the sparse weights and kernel function have an important effect on the performance of the system with respect to speed and accuracy.

Table 1 Comparison with other methods based on the ELM

We also compare our system with some state-of-the-art methods on the same dataset. Table 2 gives the final results of the selected methods. It is shown that the proposed system performs better than the state-of-the-art methods. The accuracy increase is at least 2.39%. The results demonstrate that the classification system is more effective for motor imagery EEG data, and introducing the KHELM to the classification of motor imagery EEG data is thus necessary.

Table 2 Comparison with various related methods

Conclusion

In this paper, a new system is proposed for motor imagery EEG signal classification based on PCA, LDA, and the KHELM. To improve the stability of the HELM, the KHELM is proposed in this paper; the results demonstrate that it is a successful improvement on the HELM. Compared with various other feature extraction methods, features obtained by PCA and LDA are suitable for motor imagery signals with low dimensionality and high discrimination. Compared with other classification methods, the KHELM is easy to implement and requires trivial human intervention. When deep and compact feature information is extracted, the KHELM achieves a high accuracy and high speed in classification. The best results from the experiment demonstrate that the system is effective and efficient when applied to the binary-class signals of motor imagery.