1 Introduction

BCI applications have attracted the attention of many researchers recently [1,2,3]. One application of BCI is Motor imagery (MI) which creates a new communication method for disabled people that they have no dependence on muscle control. BCI allows the person to communicate with their environment independently of peripheral muscles and nerves, using EEG brain signals by assistive devices such as wheelchairs, robotic arms, and computers. It improves the quality of life of people with disabilities [4, 5].

The EEG signal indicates the electrical activity of the brain. These signals are nonlinear and contain useful information about the state of the brain. These signals are recorded by electrodes placed on the scalp [5, 6]. EEG signals are used to diagnose problems such as Epilepsy [7], Emotion detection [8], Alzheimer’s [9] and lie detection [10].

MI is a mental process that a person practices or simulates a particular movement without physically acting [11]. In [12], the EEG signals of motor imagery are filtered by the CSP methods. Then, these data are transferred to the five-dimensional phase space. The PCA is used for feature extraction and phase space dimension reduction. Poincare sections are applied to analyze data trajectory in phase space. The parameters needed for the phase space reconstruction and the Poincare coefficients are adjusted via an evolutionary algorithm. Finally, signals are classified by SVM.

In [13], a method is proposed for EEG motor imagery classification with four classes. Features are obtained using different feature extraction methods, such as the Phase Locking Value (PLV) and the higher order Statistics (HOS) method. The nonlinear principal component analysis (NLPCA) method was employed for dimension reduction. Features are classified using backpropagation neural network (BPNN), k-Nearest neighbors (k-NN), and support vector machine (SVM). Classification accuracy by NLPCA and SVM for PLV and HOS features is higher than other classifiers.

In [14], conditional empirical mode decomposition (CEMD) is used for feature extraction. In CEMD, two conditions of correlation coefficients and relative energy occupancy rate are used to select IMFs. Then, EEG signals are reconstructed from selected IMFs. A model called 1DMSCNN is used for motor imagery EEG signals classification. This model is based on a convolutional neural network (CNN) with a multi-scale convolution feature extraction strategy. An intelligent wheelchair system is designed and implemented based on this proposed algorithm. Jana [15] is used Fourier transform (FFT) to feature extraction of the BCI Competition III dataset Iva. DNN, SVM, KNN, Naive Bayes, Random Forest and Decision Tree are used in the classification step. SVM is used with two Sigmoid and RBF kernels. KNN is used with parameter k = 3. The maximum depth of trees in the Decision Tree method is equal to 3. The DNN achieves the highest classification performance. Here, the DNN has two hidden layers that have 7 and 6 neurons, respectively. The BCI Competition III dataset Iva is used for evaluation.

The motivation for this paper is as follows. In the existing methods, the concepts of increasing and decreasing dimensions of data were not present simultaneously. On the other hand, the number of features in this application is not very large for each subject. Therefore, the data can be moved to another space to provide richer discriminant data and higher classification accuracy. The kernel functions commonly fall into two categories, global and local kernel functions. The global kernel function has strong extrapolation characteristics, but the local kernel function has better interpolating characteristics. The single kernel function has specific limitations. Therefore, we decided to present a new efficient space for learning based on Multi-kernel learning. Since each kernel function has parameters, we used the meta-heuristic algorithm to fine-tune these parameters and dimension reduction. In this study, a space learning concept is proposed for multichannel motor imagery EEG signals classification. Four kernels are employed to map the features extracted by CSP in a high-dimensional space that obtain better discriminant information. The four kernels are combined using the multi-kernel learning approach based on Meta-heuristic optimization. The weight and parameters associated with each base kernel in the composite kernel are adjusted using the optimizer. Data dimensions increase after mapping to the space generated by the kernels, and the curse of dimensions occurs. Thus, the dimensions of the data should be reduced. The dimensions number of reduced space and the mapping matrix is obtained by the optimizer.

After the dimensional reduction step, the data are fed as input to the classifier. Three classifiers of KNN, ELM, and SVM are used. The proposed method is applied to the BCI Competition III dataset Iva for evaluation. Experimental results show that the proposed method outperforms other methods. The rest of the paper is structured as follows. In Sect. 2, the basic concepts used in the paper are described. The proposed method is presented in Sect. 3. In Sect. 4, the proposed method is evaluated. Finally, the conclusion is presented in Sect. 5.

2 Literature review

In this section, the methods used in the proposed method are reviewed. CSP for feature extraction, Multi-Kernel Learning (MKL) for combining base kernels and mapping data to a higher-dimensional space are described. Equilibrium Optimizer (EO) is explained for finding optimal values for parameters. ELM, KNN, and SVM classifiers are described for EEG signal classification.

2.1 Common spatial pattern

Common spatial pattern (CSP) is a feature extraction method that extracts the features of each class by applying spatial filters to the input signals. CSP and its variants have been considered for feature extraction for the application of MI in recent years [16,17,18]. CSP is an effective method to extract features in the classification of BCI-related two-class movement patterns. Assuming \({X}_{R },{ X}_{L}\in {R}^{N\times M}\) EEG signals are recorded from two classes, N is the number of channels, and M is the number of samples per channel. First, the normalized spatial covariance matrix is calculated for each class:

$$ C = \frac{{EE^{\prime } }}{{{\text{trace}}\left( {EE^{\prime } } \right)}} $$
(1)

Trace (.) is the sum of the diagonal elements, ′ is the transpose of the matrix, and \(E\) is \({X}_{L}\) or\({X}_{R}\). The composite spatial covariance is obtained according to the following equation: decomposed into matrices of eigenvectors (\({U}_{c}\)) and the corresponding diagonal matrix of eigenvalues (\(\lambda \)).

$$ C_{c} = \overline{C}_{L} + \overline{C}_{R} = U_{c} \lambda U_{c} ^{\prime} $$
(2)

\({\overline{C} }_{L}\) and \({\overline{C} }_{R}\) are the average of the normalized covariance matrix of the right and left-hand classes. The diagonal matrix of eigenvalues λ is arranged in descending order. The whitening transformation matrix is obtained as:

$$ p = \lambda^{{ - \frac{1}{2}}} U_{c} ^{\prime} $$
(3)

The data are calculated using the white matrix as follows:

$${S}_{L}=p{\overline{C} }_{L}{p}^{\prime }\,{S}_{R}=p{\overline{C} }_{R}{p}^{\prime }$$
(4)

That \({S}_{R}\) and \({S}_{L}\) have common eigenvectors, and the sum of corresponding values is equal to the identity matrix:

$$ S_{L} = B\lambda_{L} B^{\prime} ,\quad S_{R} = B\lambda_{R} {B}^{\prime } ,\quad \lambda_{L} + \lambda_{R} = I $$
(5)

The eigenvector with the largest eigenvalue corresponds to one class, and the eigenvector with the smallest eigenvalue corresponds to the other class. In the next step, the projection matrix is calculated:

$$ W = \mathop {{B}^{\prime }} p $$
(6)

The matrix \(W\) contains spatial filters that apply to the inputs and maximize the variance of the signals in the first class and minimize them in the other class at the same time. Then, it extracts first-class features from the filtered signals. So that the variance ratio between the two classes is maximized. The EEG signal of each class is filtered by the W matrix:

$$Z=WE$$
(7)

where E can be \({X}_{R}\) or \({X}_{L }\) [19, 20].

2.2 Multi-kernel learning (MKL)

The kernel function is represented as the inner product of the data points, which is interpreted as the similarity between the data points. Suppose \(T={\left\{{(x}_{i},{y}_{i})\right\}}_{i=1}^{n}\) is the training dataset, and n is the number of labeled features samples. \({x}_{i}\) is ith sample, and \({y}_{i}\) is the class label for samples \(\mathrm{where}\,{y}_{i}\in\,\left\{+1,-1\right\}\). M base kernels \(\left\{{K}_{1},{K}_{2},\dots ,{K}_{M}\right\}\) is made from samples where \({K}_{m}\left({x}_{i},{x}_{j}\right)=<{\varphi }_{m}\left({x}_{i}\right),{\varphi }_{m}\left({x}_{j}\right)>\) is the mth basis kernel function, \(\mathrm{where}\) transfers data to a higher dimension space (Hilbert space). The combination function consists of a set d = (\({d}_{1},{d}_{2}\),…,\({d}_{\mathrm{M}}\)). The MKL algorithm obtains composite kernel (\({\mathbf{k}}_{\mathrm{c}})\) by tuning the parameters of this function. The combination function can be linear or nonlinear. The nonlinear function is rarely used in machine learning because it has high computational complexity.

$$ \begin{gathered} {\mathbf{k}}_{{\text{c}}} = \mathop \sum \limits_{m = 1}^{{\text{M}}} d_{m} {\text{K}}_{m} , \hfill \\ \;\;\;s.t. \,d_{m} \ge 0 \hfill \\ \;\;\;\;\mathop \sum \limits_{m = 1}^{{\text{M}}} d_{m} = 1 \hfill \\ \end{gathered} $$
(8)

\({\mathbf{k}}_{\mathrm{c}}\): Composite kernel, mth basis kernel, M: Number of base kernels and \({d}_{\mathrm{m}}\in \mathbf{d}\) is the weight associated with m kernel is estimated by MKL algorithms [21,22,23].

2.3 Equilibrium optimizer (EO)

Like most metaheuristic algorithms, Equilibrium Optimizer (EO) uses the initial population to start the optimization process. In EO, each particle acts as a search agent with its position. There is no knowledge about the equilibrium state at the beginning of the optimization process. Four particles are considered as candidates to provide a particle search pattern. EO sets a large number for the fitness value of the Equilibrium candidates. The fitness value is calculated for population particles and compared with the fitness value of four candidate particles. If the fitness value of the selected particle is less than the candidate particle, the two particles will be replaced. Finally, the average of four candidate particles is calculated. The four candidate particles with their average construct the equilibrium pool set. The particle position updates according to the best available solutions (equilibrium candidates) until reaching the equilibrium state finally. The equilibrium state is the most optimal state of the algorithm, which is the final convergence state of the algorithm [24].

2.4 Classifier

SVM: Support vector machines (SVM) searches a hyperplane for data classification. This method selects the optimal the hyperplane that has the maximum margin with the data. Margin is the distance between the nearest examples and the hyperplane [8].


KNN: K-nearest neighbor (KNN) is one of the classifiers used by many researchers for EEG signals classification [25,26,27]. For a new test sample X, KNN finds the K nearest neighbor of X in the training set with distance metric. From this, K sample is done voting to determine the label of sample X. The X label is equal to the label with the maximum number of votes [28].


ELM: Extreme learning machine (ELM) is modeled as follows [29, 30]:

$$\sum_{j=1}^{L}{{\varvec{\beta}}}_{j}h\left({{\varvec{X}}}_{i}\right)={y}_{j},\,i=1,\dots ,N$$
(9)

\(L\): Hidden nodes, \(h\left({{\varvec{X}}}_{i}\right)\): nonlinear feature map, \({{\varvec{\beta}}}_{{\varvec{j}}}\): the vector of weight that connects the jth hidden neuron to output neurons. Different functions are used to map features in ELM, such as sigmoid. The hidden layer output matrix (H) is calculated as follows:

$$ H\left( {a_{1} , \ldots ,a_{L} ,{ }b_{1} , \ldots ,b_{L} ,{ }x_{1} , \ldots ,x_{N} } \right) = \left( {\begin{array}{*{20}c} {h_{1} \left( {x_{1} } \right)} & \cdots & {h_{L} \left( {x_{1} } \right)} \\ \vdots & \ddots & \vdots \\ {h_{1} \left( {x_{N} } \right)} & \cdots & {h_{L} \left( {x_{N} } \right)} \\ \end{array} } \right) \in R^{N \times L} $$
(10)

\({\mathrm{a}}_{j}={\left[{a}_{j1},{a}_{j2},\dots ,{a}_{jD}\right]}^{T}\in {R}^{D}\) is a weight vector that connects the input neurons to jth hidden neuron. \({b}_{j}\mathrm{ is}\) the jth hidden node bias. These two parameters are calculated randomly. The output weight vector \(\widetilde{{\varvec{\beta}}}\) is calculated as follows:

$$ \tilde{\user2{\beta }} = H^{\dag } {\mathbf{y}} $$
(11)

The symbol \(\dag\) is represented by the Moore–Penrose generalized inverse and \(\mathbf{y}={\left[{y}_{1},\dots ,{y}_{N}\right]}^{\mathrm{T}}.\) The ELM decision function for a new test \(\widehat{X}\) is as follows:

$$y=sign\left(h\left(\widehat{X}\right)\widetilde{{\varvec{\beta}}}\right)$$
(12)

3 The proposed method

The proposed method consists of four phases. These phases are shown in Algorithm 1. First, the input signals are pre-processed. In phase 2, features are extracted by CSP. In the next phase, the data are mapped to another space using a combination of multi-kernels. The parameters and weight related to each kernel are calculated by the Meta-heuristic optimization method. Data mapping in the new space increases the dimensions of the data and causes the curse of dimensions problem. The dimension of data is reduced through the optimizer. The number of dimensions of reduced space and the mapping matrix are obtained using the optimizer. In phase 4, the data are used as the classifier input to determine the test sample label. Also, the code of the activation function and the number of hidden layer neurons in the ELM classifier is determined by the optimizer.

figure a

3.1 Pre-processing

By selecting the data from the appropriate frequency band, signals are obtained that lead to better feature extraction and more accurate classification. At this phase, the primary EEG signals are filtered by a Butterworth band-pass filter (BPF) in the range of 8–30 Hz. It leads to finding useful data for motor imagery. Also, BPF is used to remove low and high-frequency noise [31,32,33]. Studies are shown that MI-related event-related desynchronization (ERD) and event-related synchronization (ERS) phenomena occur in \(\alpha \)(8–13) and \(\beta \)(13–30) rhythms [34, 35]. ERD refers to a decrease in rhythmic activity associated with motor events. ERS refers to an increase in rhythmic activity [36].

Butterworth band-pass filter is performed by Transfer function coefficients (\(a\),\(b\)) to the input signal. These coefficients are row vectors of length 2 \(n\)+1 for band-pass filters. \(n\) is the filter order. The order filter is equal to the maximum delay used to generate the output sample. Transfer function H(z) is as follows [37]:

$$\mathrm{H}\left(\mathrm{z}\right)=\frac{b\left(1\right)+b\left(2\right){z}^{-1}+\dots +b\left(n+1\right){z}^{-n}}{a\left(1\right)+a\left(2\right){z}^{-1}+\dots +a\left(n+1\right){z}^{-n}}$$
(13)

As a result:

$$ Y\left( z \right) = H\left( z \right) X\left( z \right) $$
(14)

\(Y\left(z\right)\) is z-transform of the filter output signal filter. \(X\left(z\right)\) is the z-transform of the filter input signal \(X.\) The input signal and filtered signal are related subject aw right hand, and foot classes are shown in Fig. 1.

Fig. 1
figure 1

A: Right-hand class signals from channel 1, B: Right-foot class signals from channel 1

3.2 Feature extraction

In this phase, the features were extracted from filtered signals by CSP with the parameter M = 1. The matrix \(W=\left[{w}_{1},{w}_{2},\dots {w}_{2M}\right]\in {R}^{N\times 2M}\) contains spatial filters. Feature vector \({\varvec{a}}\) from input signal E is obtained (Eq. 15). The log operator is used to normalize the distribution and reduce the range of changes in the obtained features [29].

$$ a = \log \left[ {{\text{var}} \left( {\mathop Z\limits^{\prime } } \right)} \right] $$
(15)

\(var\)(.) is the variance. For example, the output of this step, the feature vector of the two classes for the subject aw is shown in Fig. 2.

Fig. 2
figure 2

The feature vector of two classes (Right foot, Right hand) for the subject aw

3.3 Space learning

In this phase, a learning space is devised that is based on multi-kernel and dimension reduction techniques. This phase includes Parameter tuning, Feature map with kernels, and Dimension reduction.

3.3.1 Feature map with kernels

As shown in Fig. 2, there is a lot of overlap between the obtained features of the two classes in the previous step. So the features are mapped to another space with high dimensions using the composite kernel. The kernel composite is a combination of base kernels in Table 1.

Table 1 Functions of the basic kernels used in the composite kernel

The Linear kernel is simple and does not require parameter tuning. The RBF kernel is a local kernel with good learning ability and is compatible with many conditions such as high dimension, low dimension, and large or small sample. It requires few parameters compared to other kernel functions. Therefore, it is comfortable for regularization. The RBF has a wide convergence domain and poor generalization ability. The Polynomial kernel is a global kernel with low learning ability and high generalization ability. If the degree of a polynomial is too high, the generalization ability decreases and the problem of overfitting may occur. By trial and error method, we decided to use the RBF kernel twice because it better resulted. The RBF has a wide convergence domain and poor generalization ability. The Polynomial kernel is a global kernel with low learning ability and high generalization ability. If the degree of a polynomial is too high, the generalization ability decreases and the problem of overfitting may occur [38, 39]. By trial and error method, we decided to use the RBF kernel twice because it better resulted. To combining base kernels, it is needed to learning the parameter and weight-related each base kernel that is adjusted with a Meta-heuristic optimizer (Figs. 3, 4).

Fig. 3
figure 3

Representation of particles using the KNN, SVM classifiers

Fig. 4
figure 4

Representation of particles using ELM classifier

3.3.2 Parameters tuning, dimension reduction

The Meta-heuristic MKL algorithms can estimate the weight of kernels in a composite kernel (Eq. 8). These algorithms use Meta-heuristic rules to estimate the weight of kernels [21]. The weight of the base kernels and their parameters are adjusted through EO. EO has parameters such as generation rate used to enhance EO in exploration and exploitation, local optima avoidance. It also performs better than other algorithms such as PSO, GWO, GA, GSA, SSA, CMA-ES, and LSHADE [24]. In the EO, the number of particles as the population, the maximum number of iterations and the run number are considered 40, 25, and 5, respectively. The search interval is considered between [ − 1 1] so that the data space is the same. The dimensions of the data increase after mapping by the composite kernel. The number of reduced space dimensions and mapping matrix is obtained through EO. These are used for dimension reduction in data linearly. The number of dimensions is considered [1 10].

3.3.2.1 Fitness function

The inputs of the fitness function are the particle population and feature vectors of train data. The composite kernel matrix is made from four kernels \(\left\{{\mathrm{k}}_{1}{,\mathrm{k}}_{2}{,\mathrm{k}}_{3}{,\mathrm{k}}_{4}\right\}\) from the input samples. The data are divided into three parts: Test, Train and Validation. One of the study's innovations is that the value of the fitness function is equal to the value of the validation error. Also, in this function, train and test errors are calculated. The coding of particles using SVM and KNN classifiers is equal to an array called particle that is as follows:

The first four particles are the weight associated with each base kernel in the composite kernel. Particle 5 is the value of the RBF parameter. It is used as an input for the two RBF kernels. The value of the polynomial is known as particle 6. Particle 7 shows the number of dimensions that data will reduce to that. Particle 8 is the W mapping matrix, which is used to the dimensions of the data linearly. The dimension reduction is calculated with Eq. 16. X is the output matrix of the composite kernel.

$${X}_{new}= X\times W\left(:,1:particle\left(7\right)\right)$$
(16)

The fitness function for this method is summarized in Algorithm 2.

figure b

The coding of particles in fitness function using the ELM classifier is as follows:

Particle 7 denotes the code of the Activation function where presented in Table 2. For example, when Particle 7 is 1, the sigmoid function is selected. It is a changed scale in the range [1 5]. Particle 8 represents the number of hidden layer neurons that changed the scale to [2 50]. The dimension reduction is calculated linearly by Eq. 17.

Table 2 Activation functions used in the ELM Classifier
$${X}_{new}= X\times W\left(:,1:particle\left(9\right)\right)$$
(17)

The fitness function for this method is stated in Algorithm 3.

figure c

3.4 Classification

In this step, the data are sent as input to the classifier to determine the test sample label. SVM, KNN and ELM are selected for classification. SVM is a fast classifier that is not sensitive to overtraining and high-dimensional data. It has good generalization performance. The KNN algorithm has no explicit training phase for classification, and all the work happens during prediction. It can learn nonlinear decision boundaries [40]. In this paper, the KNN classifier is used the majority voting method because it is less sensitive to noise data [28]. Here, KNN is with parameter K = 1. ELM has good scalability and fast learning speed. The weights of the hidden layer are tuning-free [29, 30]. In ELM, the input layer neurons are equal to the number of features. The number of output neurons is equal to the number of classes. The number of hidden layer neurons and the activation function are obtained through EO. Five activation functions are defined, as shown in Table 2.

4 Experimental results

4.1 Data description

BCI Competition III dataset Iva [41] is used to evaluate the proposed method. It contains EEG data from 118 electrodes with a sampling rate of 100 Hz for healthy subjects (aa, al, aw, av, ay). It contains EEG data measured at the sampling rate of 100 Hz from 118 electrodes with 10–20 International System of Electrodes. The EEG signals were recorded for five subjects (aw, aa, al, ay, av) during the imaginations of the right foot or hand movement. The number of trials was 280 for each subject [42]. Table 3 shows the number of labeled trial samples for each subject.

Table 3 Number of labeled trials for each subject for BCI Competition III dataset Iva

4.2 Experimental evaluation

In this section, the results of the proposed method are presented. The fivefold subset cross-validation method is used for the final evaluation. The data are divided into three parts test, train, and validation.

For example, the results on the subject aw with the KNN classifier are shown in Table 4.

Table 4 Results obtained by the proposed method with KNN classifier on subject aw

In this table, the optimizer runs five times and searches for the optimal values in each run. The values of the kernel weight are considered in the range [0 1] because the weight of the kernel doesn't accept a negative value [21, 22]. Different experiments have shown that the higher the RBF kernel parameter, the lower its learning ability [38]. For this purpose, the value of this parameter is limited to 50 by the trial and error method. The value of this parameter is scaled to the range [1 50]. If the polynomial kernel parameter is high, the overfitting problem occurs [38]. The value of this parameter is considered by changing the scale in the range [2 4]. The Dimension column shows the number of dimensions that the data are reduced. This value has changed the scale in the [1 10]. Another particle obtained by the optimizer is the mapping matrix (W). This matrix has dimensions to the value of Dimension. It is not presented in these tables. In the fitness function are calculated the validation error, Train, and Test in each run. Finally, the average of these errors and the accuracy of each are calculated.

The results on the subject aw with the SVM classifier are shown in Table 5. The descriptions of Table 5 columns are similar to Table 4.

Table 5 Results obtained by the proposed method with SVM classifier on subject aw

The results on the subject aw with the ELM classifier are shown in Table 6. In this table, the Activation Function code column determines the activation function. A scale change is used to the range [1 5] for the value of this column. The number of hidden layer neurons column shows the optimal number of hidden layer neurons. If the number of neurons in the hidden layer is small, a classification error will occur, and if it is high, it will complicate the hidden layer. Therefore, this value is defined range [2 50] neurons by changing the scale.

Table 6 Results obtained by the proposed method with ELM classifier on subject aw

The accuracy of the classification of the proposed method with three classifiers is shown in Table 7.

Table 7 Classification performance evaluation criteria of the proposed method

According to the table, the SVM has the highest accuracy on the subject av. ELM has higher accuracy on the subjects aa, al and aw. KNN has much more accuracy on the subject ay with 99.5%. ELM has the highest average classifier accuracy, which is 91.4%. The KNN and ELM have the highest recall and precision values in aa, respectively. In the subject al, ELM has the highest value of recall and precision. The SVM classifier has the highest recall and precision values in the av subject. KNN and SVM have the highest recall and precision values in the aw subject, respectively. KNN has the highest recall and precision values in the ay subject with 99.7% and 99%, respectively. F-score was obtained 0.88 in aa subject by ELM and KNN. Also, KNN and SVM have an F-score of 0.99 in ay subject.

The time order of the training phase is \(O\left(Iter\times n\times Tfit\right)\) where \(Tfit\), \(Iter\), \(n\) are the running time of the fitness function, the number of iterations and the number of particles, respectively. The time order is \(O\left(Tfit\right)\) in the test phase. The advantages of the proposed method can be summarized as follows: New efficient space learning based on Multi-kernel learning resulted in discriminant data and higher classification accuracy. The time order in the test phase is low. Also, one of the innovations is the value of the fitness function is equal to the value of the validation error. Train error is not used as a fitness value to avoid the overfitting problem.

4.3 Compare with state-of-the-art methods

In this section, a comparison is made between the proposed method and the last state-of-the-art methods. Method 1 refers to the proposed method with the KNN classifier. Method 2 refers to the proposed method with the SVM classifier. Method 3 refers to the proposed method with the ELM classifier. According to Table 8, these methods are compared with twelve methods.

Table 8 Comparison of proposed methods with other methods

In [29], the multi-kernel Extreme learning machine (MKELM) method is presented for motor imagery classification. The combination of two Gaussian and polynomial kernels is used to map features extracted by CSP to the nonlinear feature space. Then, classification was performed by the ELM algorithm. In [43], a correlation-based channel selection (CCS) method is proposed to select the channels that contain more correlated information. Then, features are extracted by regularized CSP (RCSP). SVM with the RBF kernel is used for classification. In [44], a new deep architecture called the DSSMM (Deep Stacked Support Matrix Machine) is based on the principle of stacked generalization. DSSMM is constructed in a layer-by-layer technique. Each layer contains an SMM module that can maintain structural information between rows or columns in the EEG feature matrix. SMM can grasp the feature matrices' structural information. SMM can grasp the feature matrices' structural information.

In [45], P-LTCSP (PLV-modulated Local temporal common spatial patterns) method is proposed for feature extraction. P-LTCSP incorporate PLV (Phase locking value) into LTCSP. LTCSP is an effective method of obtaining the temporally local manifold of EEG time series. PLV is applied to quantify the phase relationship between samples. PL to quantify the phase relationship between samples. PLV is used as the weight between two EEG samples. LDA classifier is used for motor imagery EEG signals classification. In [46], filter band CSP (FCCSP) is proposed for MI classification. FCCSP employs two regularization parameters in order to increase robustness and reduce the estimation variance. All EEG signal is divided into frequency sub-bands. The sub-band is divided using wavelet packet. Features were extracted from sub-bands by (Component regularized CSP) CRCSP. The final features selected by mRMR are fed to LDA for classification.

In [47], a method is proposed a bispectrum-based channel selection (BCS) for MI tasks classification. Bispectrum is a statistical analysis method used to analyze the interactions between EEG signals. Bispectrum for each channel is Computed in all trials. Channels without redundant information are selected based on the larger f-scores. F-score is based on the sum of logarithmic amplitudes (SLA) and the first-order spectral moment (FOSM) features from the bispectrum. Features extracted by CSP are classified by SVM. In [48], a binary harmony search algorithm (BHS) is proposed to channel selection. Harmony search (HS) is a new meta-heuristic optimizer. The BHS is binary coded, in which every harmony vector length is equal to the channels available number in the data set. If the decision variable holds a value of 1, the related channel is selected. The sparse representation-based classification (SRC), SVM, and LDA are performed on the CSP extracted features, which the BHS-SRC has higher accuracy.

In [49], a spatial-frequency-temporal (SFT)-3D CNN model is proposed for MI classification. The SFT-3DCNN model consists of 8 layers. First layer is the input layer. The three layers are the SFT convolution layers. Four layers are the fully connection layers and the layer of output. Novel 3D CNN with three fully connected layers is proposed for the extraction of SFT features and classification. In [50], common time–frequency-spatial patterns (CTFSP) are proposed to extract sparse CSP features from multi-band filtered EEG. First, EEG signals are pre-processed with a Butterworth band-pass filter of 8–30 Hz. Then, the EEG signals are divided into seven frequency bands. Features are extracted from the frequency band by CSP. The most significant features are selected from frequency bands by LASSO. Classification is done by the voting result of three SVM classifiers.

The Deep Stacked Feature Representation (DSFR) method is proposed in [51]. The DSFR employed a set of feature decoding modules (FDMs). Each FDM includes a CSP and a support matrix machine (SMM). The architecture of DSFR has several layers. Each layer is an FDM, which needs to be fed with the predictions of all the previous layers and the original EEG feature to produce the EEG feature representation and prediction. In [52], the objective Firefly Algorithm (FA) is proposed to find an optimal EEG channel set. Then, the Channel set is ranked using fisher information index criteria. Regularized Common Spatial Pattern with Aggregation (RCSPA) is used for feature extraction. The RCSPA has two regularization parameters which control the bias-variance tradeoff among MI tasks. Regularized Support Vector Machine (SVM) is used for motor imagery tasks classification.

In [53], Multiobjective X-shaped Binary Butterfly Optimization Algorithm (MX-BBOA) is used to select EEG channels. The MX-BBOA method target to hold a balance between the classification accuracy and the number of channels. This method examines the butterfly's natural behavior with dual sigmoid functions to solve the channel selection problem. Features are extracted by Multivariate Empirical Mode Decomposition (MEMD). SVM, Naive Bayes and Decision Tree were used for classification, and SVM has higher performance.

According to Table 8, Method 3, Method 2, and Method 1 have higher accuracy compared to other methods with rank 1, 2, 3, respectively. Method 3 and method 2 have higher accuracy in all subjects from [49] that used deep convolutional neural network. Method 1 has higher accuracy in four subjects from [49].

As you can see in Fig. 5, the proposed Method3 were performed better in subject aa. Method2 and Method1 have higher accuracy compared to other methods in av, ay subjects, respectively. Table 9 shows the comparison of the average classification accuracy of the proposed methods and other methods.

Fig. 5
figure 5

Comparison of proposed Methods in terms of Accuracy with other methods

Table 9 Comparison of the average classification accuracy and Standard deviation of the proposed methods and other methods

Method3, Method 2 and Method 1 have higher accuracy than other methods with 91.4%, 91.2% and 90.1, respectively. Standard deviation has been used for robustness. As you can see in Table 9, the proposed methods in Method 2 and Method3 have lower standard deviations and are more robust. The proposed method by the ELM was improved the average classification accuracy and standard deviation by 3.9% and 2.28, respectively.

The Wilcoxon signed-rank test is a pairwise test that shows the differences between the behaviors of the two algorithms [54]. P-values are positive values between 0 and 1. The smaller the P-value, the better performance. The results of this test for method1 are shown (See Table 10).

Table 10 Results of Wilcoxon signed-rank test Method1 and other methods

As shown in Table 10, the Superior column shows the number of subjects that our method outperforms other methods. As Table 10 states, the number of subjects on that Method 1 performs better than [45] and [52] is 5. The number of subjects on that Method 1 performs better than [29, 43, 46,47,48,49] and [53] is 4. The number of subjects on that this method performs better than [44, 50] and [51] is 3. Therefore, the overall results show that Method 1 is better than twelve states of the art methods.

Experimental results of Wilcoxon signed-rank test Method 2 and other methods are shown in Table 11. The number of subjects on that Method 2 performs better than [45, 47, 49, 52] and [53] is 5. The number of subjects on that Method 2 performs better than [29, 43, 44, 46, 48] and [51] is 4.

Table 11 Results of Wilcoxon signed-rank test Method2 and other methods

The results of Wilcoxon signed-rank test Method 3 and other methods are given in Table 12. The number of subjects on that Method 3 performs better than [29, 43, 45,46,47,48,49,50, 52] and [53] is 5. The number of subjects on that Method 3 performs better than [44, 51] is 4. As you can see, Method 3 has superiority in other states of the art methods. The overall results show Method 3 is better than method1 and method 2.

Table 12 Results of Wilcoxon signed-rank test Method3 and other methods

5 Conclusion

In this study, efficient space learning based on kernel trick and dimension reduction was presented for multichannel motor imagery EEG signals. Dimension increase in data is done with the multi-kernel learning method. The parameters are optimized using the optimization method in this step. Also, dimension reduction is used to overcome the curse of dimension problem. The composite kernel is obtained to map the features extracted by CSP. The composite kernel was the combination of three types of the kernels, i.e., RBF, polynomial and linear kernel by Meta-heuristic MKL algorithm. The parameters associated with each base kernel and their weights in the composite kernel were calculated by the EO. After data mapping, dimensions of the data were reduced to maximum of ten dimensions. The number of reduced dimensions and mapping matrix were obtained using EO. Data dimension are reduced by mapping matrix linearly that its columns are equal to the number of reduced dimensions. Three classifiers of KNN, SVM and ELM were selected for the proposed method. Also, the number of hidden layer neurons and the code of the classifier activation function were calculated by EO in ELM. The proposed method with ELM has higher accuracy than the other two classifiers. The results indicate the superiority of the proposed method to state-of-the-art methods. This method can be employed for EEG signals classification in other applications such as, epileptic diagnosis, emotion recognition and other MI signals classification.