1 Introduction

A brain–computer interface (BCI) presents one of the most promising assistive technologies for improving the quality of life for physically impaired individuals, by providing a communication channel between the brain of the patient and the outside world, such that mental activities can be used to coordinate actions taking place in the environment [1]. The electrical activity of the brain, which is monitored using invasive or noninvasive methods during specific cognitive tasks, is associated with various controlling commands by a set of pattern recognition algorithms comprising the brain–computer interface.

BCIs are primarily used as a rehabilitation strategy for patients in a late stage of amyotrophic lateral sclerosis (ALS) or locked-in syndrome [2]. Other applications of BCI include research in neuroscience, controlling robotic elements, gaming and virtual reality [38]. Sensorimotor cortex in humans is responsible for generating neural activity related to execution or imagination of movement. Whenever one imagines movement (i.e., motor imagery), these specific brain rhythms first become attenuated, and then stronger. These two changes are called event-related desynchronization (ERD) and synchronization (ERS), respectively. Hence, the patterns generated by motor imagery can be exploited in a BCI. Figure 1 visualizes ERD/ERS components.

Fig. 1
figure 1

a ERD/ERS components in an observed brain signal. b Topographical map of ERD/ERS components in a hand movement motor imagery (The image originally appeared in [9])

A BCI based on motor imagery interprets movement-related oscillatory patterns of a subject during imagination of different tasks (e.g., moving hands and feet) into controlling commands [1012]. μ (8–12 Hz) and β (13–30 Hz) waves originating from sensorimotor cortex are the ones depicting the manifestation of ERD/ERS phenomena [1317].

In the frequency domain, Butterworth band-pass filter is commonly used to discard undesirable portions of a signal [1821].

Based on the fact that EEG signals are non-stationary, traditional methods for feature extraction like Fourier transform are not appropriate for analyzing such data. Wavelet is a time-frequency analysis method which decomposes the signal into several scales and is capable of discarding some of decomposed signals which makes it a suitable choice for signal processing [2225]. Other methods used for feature extraction include CSP, AAR parameters, AR spectral power and principal component analysis (PCA) [26].

CSP is used to filter channels containing the most informative and distinctive data from EEG. However, in this study, the raw EEG data being utilized are already containing three channels with the most informative data (i.e., Cz, C3, C4 that are considered to have the most informative data in a BCI based on motor imagery [27]); thus, despite the CSP being customary in motor-imagery-based BCI, it was not our first choice for feature extraction phase. Additionally, for the ensemble system to be productive, a diverse and distinctive feature space should be fed to experts. Also, in the genetic algorithm feature selection phase, there must be enough number of features to be optimized. Using wavelet transform and statistical measurements, a reasonably adequate number of features are generated making this transform ideal for our study. On the other hand, CSP does not generate as many diverse features as required for the GA feature selection and the ensemble system in our dataset.

Ensemble systems can be used to improve the productivity achieved by a single classifier [28]. They move toward improving the performance of classification by considering decisions made by different types of classifiers. There are several reasons, which motivates one to exploit this technique such as small number of data samples for training and high dimension of data samples that makes classification process difficult for a single classifier [2931]. Esmaeili in [32] reported better EEG classification accuracy using multiple classifier combination.

Combining procedures can be divided into different categories from a variety of perspectives, three of which are considered here. The first perspective divides combining strategies into classifier selection and classifier fusion. Classifier selection follows a divide-and-conquer approach by assigning a particular part of the problem space to each classifier [33]. On the other hand, in classifier fusion, all classifiers are trained over the entire problem space [31].

From the second viewpoint, two taxonomies are considered: trainable and non-trainable combining strategies. Non-trainable combiners are fixed algebraic rules, such as max, min, average or majority voting, while trainable combination rules, such as Stacked Generalization, determine their parameters during a learning procedure [29].

At last, considering how input data are involved with construction of ensemble, there are two main types of combining classifiers: static and dynamic. Dynamic techniques choose an ensemble specifically for each sample from a large pool of classifiers [34], as opposed to static ensemble construction methods which rely on the same set of classifiers for all samples.

In the present study, movement thoughts of left and right hand are primarily represented distinctively using discrete wavelet transform (DWT) and then classified using an ensemble of classifiers. To perform the classification in a better manner, we strived to train each classifier in the ensemble system using different training samples to increase diversity of each classifier. This also improves the accuracy of the ensemble system.

The organization of this study is as follows: The dataset used for this study is elucidated in Sect. 2. The approach used for preprocessing the raw signal using Butterworth band-pass filter is described in Sect. 3.1. Feature extraction using discrete wavelet transform is discussed in Sect. 3.2. Different single classifiers utilized is briefly presented in Sect. 3.3. Multiple classifier system used to improve EEG signal classification is demonstrated in Sect. 3.4. The experimental results are presented in Sect. 4. Finally, the study is concluded in Sect. 5

2 EEG data

For this study, dataset III of BCI competition II was used [35]. This dataset was obtained from a normal 25-year-old female sitting on a relaxing chair. The tasks assigned to the subject were motor imageries of left or right hand occurring randomly.

The dataset concludes 280 9-second fixed trials. The first two seconds of each trial are quiet. At t = 3, an arrow cue pointed at right or left is displayed for 1 s and the subject is required to move a bar into the direction of the cue. g.tech amplifier and Ag/AgCl electrodes utilizing three bipolar EEG channels, measured over C3, Cz, and C4, were used to record the signals.

The EEG was sampled at 128 Hz and band-pass filtered between 0.5 and 30 Hz afterward. 140 trials are reserved for training and 140 trials are reserved for testing, which were randomly selected from the entire 280 trials.

3 Methodology

3.1 Preprocessing

In order to prepare the original signal obtained from each channel for feature extraction, we first extracted from t = 4 to t = 9. Then, we extracted the portion of signal consisting the μ and β frequency bands, in which motor imageries occur. To achieve this goal, we used a sixth-order Butterworth band-pass filter. Figure 2 demonstrates raw and preprocessed signal from C3 channel in one epoch in the frequency domain.

Fig. 2
figure 2

a Shows original signal from C3 channel representing left-hand motor imagery, b shows preprocessed signal ready for feature extraction

3.2 Feature extraction

Experimenting different permutation of available channels led to the conclusion that using C3 and C4 channels for extracting features results in a more discriminative feature space.

We applied discrete wavelet transform in each stage and decomposed the signal into detail and approximation coefficients, respectively, representing low-frequency and high-frequency components. Then, we used wavelet coefficients at each level as features and reduced the dimension of feature space by extracting mean, min, max and standard deviation parameters from them. Figure 3 demonstrates the discrete wavelet transform decomposition process.

Fig. 3
figure 3

Decomposition of discrete wavelet transform: h[n] is the high-pass filter, g[n] is the low-pass filter, A is the approximation of the input and D is the detail

3.3 Classification

In our experiment, following classifiers were exerted and evaluated:

3.3.1 K nearest neighbor

The nearest neighbor is a classic classifier and is considered as one the simplest of all. K nearest neighbor (KNN) classification is based on finding closest training samples to an unseen point and assigning it to the most dominant class. Even though KNN due to the high dimension of EEG data is not a suitable choice [36], we chose it to increase the diversity among the base classifiers of our ensemble system. Based on empirical results for the dataset, we have come to the conclusion that using 13 nearest neighbors renders to better results for this classifier. To calculate distance between a target sample and other samples in the feature space, euclidean distance measure was used:

$$ d(p,q)= \sqrt{ \sum_{i=1}^{n} (p_{i}-q_{i})^2} $$
(1)

where d(pq) is the distance between the samples p and q, p i and q i are the ith feature of the sample and n is the number of features.

3.3.2 Multilayer perceptron

For our experiment, we wanted to evaluate a non-statistical classifier for comparing other classifiers. multilayer perceptron (MLP) fulfills this demand along with simpler implementation comparing to other neural networks [37]. For revising the weights of neurons, we have used back propagation algorithm. First, the weights are set randomly and then the values of hidden and output layers are calculated:

$$ O=\frac{1}{1+e^{-o}} $$
(2)
$$ Y=\frac{1}{1+e^{-y}} $$
(3)

where O are the hidden layer neurons and Y are the output layer neurons.

$$ Z'=z(1-z) $$
(4)

where z is the sigmoid function and Z′ is its derivative. Considering w1 as the value of hidden and input layer weights, and w2 as the value of hidden and output layer weights, \(\Updelta\) W for output neurons is calculated:

$$ G=y(1-y)(d-y), \Updelta W=(G * O)*\eta $$
(5)

where d is the deterministic output and η is the learning rate. Consequently, the value of \(\Updelta\) W for hidden neurons must be obtained:

$$ G(o)=O(1-O)(w2*G), \Updelta W(O)=(xG(o))*\eta $$
(6)

finally weights are updated:

$$ w1=w1+\Updelta W(O),w2=w2+\Updelta W $$
(7)

3.3.3 Naive Bayesian

Bayesian classifier is a simple classic probabilistic classifier which is based on Bayes' theorem. In this classifier, each class with highest post-probability will be addressed as the resulting class [38]. The simplicity of this classifier makes it an appropriate candidate for evaluating other classifiers. Power of rejection, meaning the capability of classifier to address an input sample as unpredictable, makes this classifier useful for dealing with uncertain conditions in EEG signal. Another compelling reason for using this classifier is the ability to produce outputs used for soft level combining in the ensemble system [30]. The naive Bayesian classifier assumes that features are independent in each class and predicts the class of an incoming instance X containing features [x 1.. x n ] by calculating the highest probability of C i given X [39]:

$$ P(C_{i}|X)=\frac{P(C_{i})\prod_{j} P(x_{j}|C_{i})}{P(X)} $$
(8)

3.3.4 Linear discrimination analysis

Linear discrimination analysis (LDA) is a linear classifier, which assumes that the two classes are linearly separable [38]. LDA separates the data using a hyperplane, which is obtained by seeking a projection such that Fisher criterion (i.e., simultaneously, maximizing the distance between centroids of each population while minimizing the inter-population variance) is satisfied [38]. Its drawback is reflected in its linearity, which yields poor results for complex nonlinear data. The within-class scatter matrix S w and the between-class scatter matrix S b are defined as:

$$ S_w=\sum_{k=1}^{c}\sum_{x\in C_{i}}(x - \mu_{i})(x - \mu_{i})^{T} $$
(9)
$$ S_b=\sum_{k=1}^{c}(\mu_{i} - \mu)(\mu_{i} - \mu)^{T} $$
(10)

where μ i is the mean of the class C i ,  μ is the mean of all samples and c denotes the number of classes. Then, we seek a transformation matrix W, which maximizes the between-class scatter while minimizing the within-class scatter. We achieve that when the Fisher criterion satisfied:

$$ w^{*} = argmax_w \left\{\frac{w^{T} S_{b} w}{w^{T} S_{w} w}\right\} $$
(11)

3.3.5 Support vector machine

Support vector machine (SVM), another well-known binary linear classifier, also tries to select a hyperplane, with the exception that it improves its discrimination by maximizing the margins (i.e., the distance from the hyperplane to the nearest training samples called support vectors) [40, 41]. Margin maximization will result in increased generalization capability for unseen data points. SVM is a good choice for classification in a high-dimensional space and is known for not being sensitive to the risk of over training [42]. Linear SVMs realize the large margin (i.e., optimal hyperplane) by minimizing the cost function below

$$ \frac{1}{2}||w||^{2}+C\sum_{i=1}^{n}{\xi_i} $$
(12)

under the constraints

$$ \begin{aligned} y_i(w^{T} x_{i}+b) & \geq 1 - \xi_{i} \quad \hbox{and}\\ \xi_{i} & \geq 0 \quad\forall k=1,\ldots,K \end{aligned} $$
(13)

where ||.||2 denotes the Euclidean norm, ξ is a vector of slack variables, b is the bias and C is a regularization parameter. It is of great importance to select an appropriate value for C, which controls the trade-off between the complexity and the number of non-separable points.

3.4 Classifier ensemble

Several methods exist for creating an ensemble system which we have implemented the followings:

Bagging: In the bootstrap aggregating algorithm (aka bagging) given the data containing m training samples, n number of subsamples with the same size as the original data is selected (with replacement). An instance may appear more than once in a subsample or may not appear at all [43]. Subsamples are used to train weak learners. Finally, a new instance is classified by winning a vote among constructing classifiers.

Adaboost: The Adaboost is a well-known algorithm for improving the accuracy of weak learners [44]. It prognosticates the distribution over the training samples and denotes a weight for each classifier by using a weighted majority vote for predicting labels. Firstly, Adaboost designates a weight distribution to each sample and a subset of samples is chosen by each classifier for learning phase. The initial weights distribution is uniform. Next, a weak classifier creates a hypothesis in order to calculate the error of the classifier (Eq. 14) [29].

$$ \varepsilon_{t} = \sum_{h_{t}(x_i \neq y_{i})} D_{t} (i) $$
(14)

where \(\varepsilon_t\) is the error term of each expert, x i is a sample, y i is the predicted output and D t (i) is the weight of each sample.

For each classifier, there is a defined weight (Eq. 15) which is used in updating weight distribution for the next expert (Eq. 16).

$$ \beta_{t} = \frac{z_{t}}{1-z_{t}} $$
(15)
$$ D_{t+1} (i)=\frac{D_t(i)}{z_t}\beta_{t} \quad h_{t}(x_{i})=y_{i} $$
(16)

where z t is a normalized term and is equal to the sum of weights distribution, β t is the weight for an expert, D t+1 (i) is the weight of sample for next expert and h t is a hypothesis. For predicting a test sample, all experts weighted votes are received for each class and the class that receives the highest vote in a voting process is considered as the final decision (Eq. 17).

$$ v_{j}=\sum_{h_{t}(x)=w_{j}} \log \frac{1}{\beta_{t}}, j=1,2,\ldots.C $$
(17)

where v j is the result of experts supporting class j and W j represents class j.

Behavioral knowledge space (BKS): BKS proposed by Haung and Suen uses the knowledge based on the behavior of classifiers. It is a table concluding k rows, each representing a classifier decision, t number of classes and number of arrangements of classifiers decisions (knowledge space) is equal to t k [45]. In the training phase, BKS algorithm constructs a knowledge space with different arrangements of classifiers decisions and counts number of samples belonging to each class, the class with the most occurring samples is marked as the predicted label. In the test phase, using the decisions made by classifiers, a column is indexed in the knowledge space, which its label is considered as the predicted label.

Majority voting: Plurality majority voting technique gathers opinions of each classifier and checks which class label is most reported by classifiers and choose that label as the final decision for incoming test sample. Using the notion from [29], assume the opinion of an individual classifier as \(d_{t,j}\,\in\,\{0,1\}\) which depicts support for class \(\omega_j,\,t=1\ldots T\) and \(j=1\ldots C\), where T is the number of classifiers and C is the number of classes. The formulation for choosing class ω j as the final decision would be expressed as Eq. 18.

$$ \sum_{j=1}^{T} d_{t,j}= \max_{j=1}^{C} \sum_{i=1}^{T} d_{t,j} $$
(18)

Weighted majority voting: Knowing that some classifiers perform better than others, their decision could be weighted and have more influence than other classifiers. This approach may further improve the performance obtained by plurality voting. Finding a weight for a classifier can be accomplished via several measures. Using genetic algorithms and the performance of the classifier as the fitness function, we estimated weights for the classifiers in the ensemble system. Assuming W t is the weight of the classifier, the formulation for choosing class j as the final decision via weighted majority would be expressed as Eq. 19.

$$ \sum_{j=1}^{T} w_{t}d_{t,j}= \max_{j=1}^{C} \sum_{i=1}^{T} w_{t}d_{t,j} $$
(19)

Combining continuous outputs: Classifiers are capable of reporting continuous outputs, which demonstrates their tendency toward a specific class. In our study, we applied several non-trainable algebraic combiners. Consider the notion in Eq. 20 which is taken from [46].

$$ \mu_{j}(x)=\xi[d_{1,j}(x),\ldots,d_{T,j}(x)] $$
(20)

where each element of the vector holds a continuous value representing the tendency of sample j to a class. Then, using ξ which is one of the following functions, μ j (x) for each class is calculated and the class with the largest value is considered as the winner. Mean Rule: Using this rule, we calculated the average of all classifiers continuous output which shows support for ω j :

$$ \mu_{j}(x)= \frac{1}{T} \sum_{i=1}^{T} w_{t}d_{t,j} $$
(21)

where 1/T is the normalization factor Min/Max/Median Rule: As the names of these rules imply, we also selected the minimum, maximum or median of classifiers continuous output as functions to choose the largest value as the winner class.

$$ \begin{aligned} \mu_{j}(x)& = \min_{t=1..T} \{d_{t,j}(x)\}\\ \mu_{j}(x)& = \max_{t=1..T} \{d_{t,j}(x)\}\\ \mu_{j}(x)& = \mathop {\text{median}}\limits_{t=1..T} \{d_{t,j}(x)\}\\ \end{aligned} $$
(22)

Decision template: Decision template is an approach initially introduced by Kuncheva [31] for combining continuous outputs of an ensemble system. It works based on the principle of decision profiles. Decision profile is a matrix, which its rows represent classes and its columns represent classifiers soft labels. The average of decision profile for each class is equal to decision template of the class in the training set:

$$ DT_j=\frac{1}{N_j} \sum_{x_j\in w_j} DP(x_j) $$
(23)

where N j is the number of all samples which belong to class j. Considering X j as a training sample, W j is a class, C as the number of classes, T as the number of experts and m j (x) as distance between the decision profile of the test sample and decision template of class j, the Euclidean distance is calculated between decision template for each class and test sample decision profile (Eq. 23), thus each class having the minimum distance is the predicted label [29].

$$ m_{j}(x)=\frac{1}{T*C} \sum_{t=1}^{T} \sum_{k=1}^{C}{(DP_{t,k}(x_{j})-DT(t,k))}^2 $$
(24)

Genetic ensemble feature selection: Using genetic algorithms, which is an evolutionary optimization technique, has been proven to be an effective way for finding the optimal feature subsets [47]. Using GA for ensemble feature selection was first proposed by [48] by using the accuracy of base classifiers as the fitness function. Various combinations of features, generated in each generation, are represented by binary strings (i.e., each bit denotes the absence or presence of each feature). Until a certain stop criterion is met, generation process is repeated by producing offspring chromosomes from previous population parents and evaluating them by the fitness function to find the suboptimal solution. Using genetic algorithm to select optimal features for an individual classifier also yields diverse decisions.

4 Experimental results

Table 1 demonstrates a summary of classification error obtained by groups which attended in BCI competition II and utilized dataset III. Each row denotes classification error of a certain group. Table 2 displays recognition error of the classification systems we implemented. Using genetic algorithm for feature selection and weighted majority voting as the classification approach yields the best recognition rate.

Table 1 BCI competition II, dataset III results
Table 2 Obtained results from classification experiments. Using weighted majority voting yields the best results

For choosing the optimum number of neighbors for KNN classifier, we evaluated several numbers which is demonstrated in Fig. 4.

Fig. 4
figure 4

Various number of nearest neighbors were evaluated, 13 number of nearest neighbors renders to the best results

Additionally, after several experiments, we found optimum number of hidden layer neurons for MLP which yields better recognition rate. Figure 5 demonstrates recognition rate for different number of hidden layers. Furthermore, as mentioned earlier, using C3 and C4 channels for feature extraction evolves the recognition rate. Table 3 demonstrates different permutations of channel selection and their recognition rates using different classifiers.

Fig. 5
figure 5

Using 22 hidden layer neurons renders to an improved performance

Table 3 Recognition rate of various channel(s) selection

To show the classification results and compare single classifiers to proposed method, we used a confusion matrix (Table 4). A confusion matrix is a square matrix, which includes information about actual and predicted labels designated by a classification system.

Table 4 Confusion matrix of the base classifiers and the proposed multiple classifier system

5 Conclusion

The contribution presents several approaches for classification of EEG signal based on ERS/ERD phenomena. We reduced the dimensions of the recorded data and extracted features using Wavelet transform. Based on the fact that EEG signals are vulnerable to noise, there may exist noisy and useless features in the feature vector, and therefore using a feature selection method such as genetic algorithm is salutary in finding the optimum feature space. Ultimately, weaker classifiers when joint can create an ensemble system which increases precision when classifying motor imageries arising from the brain.