1 Introduction

Brain–computer interface (BCI) researches have grown over the past few decades since the 1970s. One goal of BCI research is to develop systems capable of classifying neural representations of natural movement planning and execution. Researchers aim to develop systems that help disable persons to communicate with external world and provide them a non-muscular pathway between their brains and prosthetic devices such as robotic limbs, wheelchairs, or spellers. Such applications are not only limited the use of prosthetic devices but also extended virtual gaming, tele-operation, communication, and robotics (McFarland and Wolpaw 2008; Daly and Wolpaw 2008).

However, BCI suffers from several challenges such as: (i) the training sets are sometimes relatively small, since the training process is affected by usability matter. Even though heavily training sessions are considered time-consuming and demanding for the subjects, in the training phase, trained subject’s signal has been used to learn the used classifier. Therefore, a significant challenge in designing a BCI is to balance the trade-off between the technological complexity of classifying the user’s brain signals and the amount of training needed for successful operation of the interface. (ii) Nonlinearity; the brain is a highly complex nonlinear system in which disordered behavior of neural ensembles can be detected. Thus, EEG signals can be better characterized by nonlinear dynamic methods than linear methods. (iii) Non-stationarity and noise; the used signals continuously changed over time either between or within the recording sessions. The mental, emotional state background through different sessions, fatigue, and concentration levels can contribute in EEG signals variability. Noise is also a big contributor in the BCI non-stationarity issue as it includes unwanted signals caused by alterations in electrode placement and environmental noise. (iv) High dimensionality; signals are recorded from multiple channels to preserve high spatial accuracy. As the amount of data needed to properly describe different signals increases exponentially with the dimensionality of the vectors, various feature extraction methods have been proposed. They play an important role in identifying distinguishing characteristics. Thus, the classifier performance will be affected only by the small number of distinctive features instead of the whole recorded signals that may contain redundancy (Abdulkader et al. 2015).

The basic steps of BCI include acquisition of brain signals, preprocessing, feature extraction, and classification. The decisions generated by the employed classifier can be used to control an external device. The electroencephalogram (EEG) is a popular signal acquisition noninvasive technique that allows BCI systems to measure electrical potentials of the brain at a temporal resolution on the order of milliseconds through electrodes placed on the surface of the scalp. Typically EEG caps with 6 to 64 electrodes are the mostly used (some cases use a much greater number of electrodes, e.g., 256), so the dimension of the feature space is often very large having redundant features which not only creates additional overhead of managing the space complexity but also might include outliers, thereby reducing classification accuracy (Rakotomamonjy et al. 2005).

Feature selection is a subarea of dimensionality reduction aims to identify the best subset of features out of original feature space. In BCI applications, principal component analysis (PCA) (Yu et al. 2014), independent component analysis (ICA) (Guo et al. 2013), sequential forward search (SFS) (Pal et al. 2014), and particle swarm optimization (PSO) (Hsu 2013) have been used for feature selection to reduce the dimensionality of data. After feature extraction and reduction, classification algorithms are used having two functions in training and practical applications of BCI. During training, the task is to infer a mapping between signals and classes using the labeled feature vector produced by the feature extraction module. During the application of BCI, the task is to discriminate different types of neurophysiologic signals translating them into commands therefore to allow for control of a BCI.

Recently, a number of widely used classifiers such as linear discriminant analysis (LDA), K nearest neighbor (KNN) algorithms, support vector machine (SVM), decision trees, Naive Bayes (NB) classifier, and neural networks (NN) (Lotte et al. 2007) have been used as BCI classifiers. Linearity is the main limitation of LDA, which can cause poor outcomes (Lotte et al. 2007). SVM has a low-speed execution but has good generalization properties (Lotte et al. 2007). On the other hand, KNN assigns an unseen data sample to the dominant class among its K nearest neighbors formed using the training set. KNN may fail in some BCI experiments due to its sensitivity to the curse of dimensionality. However, it performs efficiently with low-dimensional feature sets (Lotte et al. 2007). NB classifier is based on Bayes’ theorem, with a strong assumption of independence of the features, and it is more suitable in BCI applications with small number of trials.

This paper introduces a new fuzzy-based classification strategy (FBCS) based on brain–computer interface. FBCS includes novel techniques for feature reduction and electrode selection to reduce the dimensionality of data. Accordingly, both training time and response time have been minimized that makes FBCS suitable for real-time applications which need a quick response. On the other hand, FBCS is based on a fuzzy inference system which is employed for the classification task. To accomplish such task, a new instance of KNN is introduced, which is called fuzzified KNN (FKNN). In addition to the high classification accuracy, FKNN has a salient property that traditional KNN does not have, which is overfitting resistant. The cause is that FKNN adds several classification heuristics besides the K nearest neighbors, which are the distance among items in the feature space as well as the belonging degree of the item to the class. Those heuristics have been merged via a fuzzy inference system. Accordingly, FKNN provides accurate classification decisions. FKNN has been compared against recent classification techniques that had been applied to BCI. Experimental results have depicted that FKNN outperforms recent techniques as it gives not only the maximum classification accuracy and sensitivity but also the minimum response time. This paper is organized as follows: In Sect. 2, an overview about the EEG-based BCI systems as well as their main parts have been introduced. Section 3 shows the previous efforts about the dimensional reduction and BCI classification techniques. Section 4 introduces the proposed fuzzy-based classification strategy (FBCS). Section 5 discusses the experimental results. And finally, conclusion of our work is presented in Sect. 6.

2 General scheme of EEG-based BCI

Figure 1 illustrates the basic principle of EEG-based BCI. Initially, signals from the brain are acquired. Generally, there are three methods to acquire (capture) signals that represent the human brain electrical activities, which are (i) invasive, (ii) partially invasive, and (iii) noninvasive. Invasive capture provides high-quality signal reading, but causes great inconvenience and risks to human health. Partially invasive capture provides lower-quality signals and lower risk to health. On the other hand, noninvasive capture is fully external to the body, more convenient, and easy to use and provides good quality signal capture and not present risk to users.

Fig. 1
figure 1

General scheme of EEG-based BCI

Although there are many methods to detect brain signals, EEG acquisition system has relatively short time constants, can function in most environments, and require relatively simple and inexpensive equipment, offering the possibility of a new non-muscular communication and control channel. EEG signal is acquired with the help of a multi-channel headset having a certain sampling rate.

EEG (electroencephalogram) is the most popular noninvasive brain signal acquisition tool; thus, it is the cheapest and the simplest recording technique. However, it has low signal-to-noise ratio (SNR) due to the environmental noise and artifacts caused by muscle and eye movements. The EEG system contains electrodes, amplifiers, A/D converter, and a recording device, which may be a personal computer or similar. The electrodes acquire the signal from the scalp; the amplifiers process the analog signal to enlarge the amplitude of the EEG signals (signals on scalp are very small microvolt range (1/1,000,000 V)), so that the A/D converter can digitalize the signal accurately. Then, the recording device stores and displays the data. The produced signal which is digitized and analyzed can be used to extract commands that can control a computer or a device.

Applications include spelling, computer mouse control, and prosthesis or robot control. Generally, BCI can be applied in several applications; mainly, it allows paralyzed people to control prosthetic limbs with their mind; visual images can be transmitted to the mind of a blind person, allowing them to see; auditory data can be transmit to the mind of a deaf person, allowing them to hear. From another point of view, BCI allows gamers to control video games with their minds; it can also allow a mute person to have their thoughts displayed and spoken by a computer. Finally, a feedback is provided to the user for further interaction. An improvement in just one of these steps can improve the performance of a BCI system.

3 Related work

The main target of this paper is to introduce a new classification strategy to enhance the classification performance of BCI systems using the concept of dimensional reduction. Hence, some of the recent efforts in employing some dimensional reduction and classification techniques in BCI applications will be represented in this section.

Principle component analysis (PCA) is a widely used dimensionality reduction linear transformation technique. However, the projections it finds are to maximize variances, which are not necessarily related to classification performance (the class labels); it is not particularly useful in classification and pattern recognition applications. Linear discriminant analysis (LDA) attempts to overcome this limitation of PCA by finding linear projections that maximize class separability under the Gaussian distribution assumption (Fukunaga 1990). The LDA projections are optimized based on the means and the covariance matrices of classes, which are not descriptive of an arbitrary probability density function (pdf). Independent component analysis (ICA) has also been used as a tool to find linear transformations that maximize the statistical independence of random variables. However, it has similar drawbacks as PCA. CSP (common spatial patterns) can be used instead of PCA and ICA (Naeem et al. 2009).

Atyabi et al. (2012) introduced electrode reduction (ER) and feature reduction (FR) methods based on genetic algorithms (GA) and particle swarm optimization (PSO). Evolution-based methods are used to generate a set of indexes presenting either electrode seats or feature points that maximize the output of a weak classifier representing a comparison between genetic algorithms (GA), particle swarm optimization (PSO), and random search algorithm as electrode and feature reduction methods. The results indicate that on average across all subjects, in GA-based ER, GA-based FR, random-based ER, random-based FR, and PSO-based FR, the electrode reduction (ER) had a greater impact on classification performance compared to feature reduction (FR), and the combination of polynomial SVM- and GA-based ER performed better than all other methods except the combination of the use of the full-set with polynomial SVM.

Sparse common spatial pattern (SCSP) algorithm was proposed in Arvaneh (2011), to select the least number of channels within a constraint of classification accuracy. To select channels using the SCSP method, first two sparse common spatial filters corresponding to two motor imagery tasks are obtained. After obtaining the sparse filters, channels corresponding to the zero elements in both of the spatial filters are discarded, and the rest are defined as the selected channels. To compare and consider the importance of each selected channel, a ranking method was proposed as follows: first, the top ranked channels for each motor imagery task are determined from the maximum of the absolute value of the corresponding sparse spatial filter. SCSP algorithm yielded an average improvement in 10% in classification accuracy compared to the use of three channels.

Multi-objective particle swarm optimization (MOPSO) method proposed in Hasan et al. (2009) solves the problem of effective channel selection for brain–computer interface (BCI) systems. The proposed method was tested and compared to another search-based method, sequential floating forward search (SFFS). The results demonstrate the effectiveness of MOPSO in selecting a fewer number of channels with insignificant sacrifice in accuracy, which is very important to build robust online BCI systems.

Muhammad et al. (2015) presented a comparison of mostly used classification algorithm with a new unsupervised learning technique for classification, i.e., self-organizing maps (SOM) based on neural network. SOM and other algorithms have been used to categorize the feature vector acquired from the EEG dataset into their corresponding classes. Both original and reduced feature sets have been used for classification of motor imagery-based EEG signals. The reduction is performed by applying principal component analysis (PCA). It has been depicted from measured data that SOM shows a maximum classification accuracy of 84.17% on PCA implemented reduce feature set.

Nanayakkara and Sakkaff (2012) presented a new classification method, which is closely related to K nearest neighbor (KNN) classification method, named fixed distance neighbor (FDN) classifier. For comparison purposes, performance of KNN classification method and performance of the FDN method are tested with the same feature vectors derived from EEG datasets recorded for imagery motor movement mental tasks. It was found that FDN performed slightly better than KNN for most of the datasets used in this study, indicating that FDN is a viable classification method, which can be used in place of KNN in BCI systems.

Authors in [17] used a combination of bacterial foraging optimization and learning automata to determine the best subset of features from a given motor imagery electroencephalography (EEG)-based BCI dataset. They employed discrete wavelet transform to obtain a high-dimensional feature set and classified it by distance likelihood ratio test. This proposed feature selector produced an accuracy of 80.291% in 216 s. On the other hand, Zanchettin et al. (2012) presented a hybrid KNN-SVM method for cursive character recognition. The main idea was to increase the K nearest neighbor recognition rate, sensible to different classes with similar attributes, using the SVM as a decision classifier. The adaptation was to get the two most frequently classes in the KNN and use the SVM to decide between these two classes. The main disadvantage is the processing time.

The advantages of self-organizing maps (SOM) artificial networks and KNN were explored in Silva and Del-Moral-Hernandez (2011), so the KNN performs the classification process and the SOM work as preprocessing to the KNN classifier, applied to digits recognition in car plates. The main advantage of this method is that the time consumed by SOM-KNN is shorter than time consumed by KNN. Finally, a review of BCI several techniques for signal acquisition, preprocessing or signal enhancement, feature extraction, classification, and the control interface was discussed in Nicolas-Alonso and Gomez-Gil (2012), representing their advantages, drawbacks, and latest advances.

Fig. 2
figure 2

Block diagram of feature reduction and EEG electrode selection BCI model

4 The proposed fuzzy-based classification strategy (FBCS)

This section illustrates the proposed fuzzy-based classification strategy (FBCS) in details. The different steps of FBCS are depicted in Fig. 2. As illustrated in Fig. 2, the proposed FBCS consists of seven sequential steps, namely (i) data acquisition, (ii) preprocessing, (iii) feature extraction, (iv) feature selection, (v) dimensionality reduction, (vi) classification, and (vii) decision making for choosing a certain action. However, FBCS mainly focuses on: (iv) feature selection to acquire a set of compact and informative features, (v) dimensionality reduction to minimize processing time, and (vi) classification to take the corresponding precise decisions. We claim that giving more attention to those steps will not only improve the performance of the BCI system, but also greatly reduce the computational load of the system. The next subsections explain those steps in more details.

4.1 Data acquisition

Data acquisition can be accomplished through an EEG cap. Figure 3 shows an example of EEG caps, representing a general view of EEG cap filled with electrodes. These electrodes are set up according to the standard 10/20 system of electrode placement method. EEG cap with 22 electrodes with 250 Hz sampling rate from Dataset 2a of BCI competition IV provided by BCI research group at Graz University (Brunner et al. 2008) and EEG cap with 118 electrodes with 1000 Hz sampling rate from Dataset IVa of BCI competition III (xxx yyy) are used as data acquisition components to evaluate the proposed FBCS.

Fig. 3
figure 3

An EEG cap filled with electrodes used in BCI data acquisition

4.2 Preprocessing

Generally, the purpose of signal preprocessing is to enhance the signal produced by EEG. Unfortunately, EEG-recorded data are highly challenging to evaluate due to the noise recorded together with EEG signal, non-stationary, and diverse artifacts. Artifacts are irrelevant unwanted signals present in BCI system. They have various origins, which include: utility frequencies such as noise, body movements, or eye blinks. As noise amplitude is usually larger than the signal of interest, the goal of preprocessing is to increase signal-to-noise ratio (SNR) for the signal acquired from EEG headset (Mallick and Kapgate 2015). Figure 4 represents the original signal before and after filtration to show the effect of noise.

The digital EEG signal is stored electronically and can be filtered. Filtering can be applied either in the frequency domain by selecting different pass bands, or in the spatial domain. Frequency filtering removes noises, such as filtering out direct current and high frequency noise (1–45 Hz). Frequency filtering also has the ability to select relevant frequency components, such as sensorimotor rhythm 8–12 Hz (mu). The goal of spatial domain filtering is to create a subset of EEG channels, which are related to certain brain activity, as well as to enhance the separability of the data. The choice of spatial filter can affect the SNR greatly. Bipolar derivation, Laplacian derivation, principal components analysis (PCA), independent components analysis (ICA), and common spatial patterns analysis (CSP) are alternative methods for deriving weights for a linear combination of EEG channels (Jung et al. 2000). In FBCS, EEG signals were band-pass filtered from 8 to 30 Hz including mu (8–13 Hz) and beta (13–30 Hz) rhythms, which are used for classifying motor imagery data.

Fig. 4
figure 4

EEG produced signals

4.3 Feature extraction

The goal of feature extraction is to represent the characteristics of original signal without unwanted redundancy. The features can be extracted from the EEG signal in two different domains, which are: time domain features (TDF) and frequency domain features (FDF).

Unlike Fourier transform, which provides frequency domain analysis at a constant resolution on the frequency scale, discrete wavelet transform (DWT) provides frequency domain as well as time domain analysis at multiple resolutions. Frequency domain analysis is mainly based on the power and coherence of each frequency band in the EEG signals. Spectral power estimation is the primary means of frequency domain analysis. While time domain analysis method mainly analyzes the geometric property of EEG waveforms, such as amplitude, mean, variance. It is widely used by EEG researchers for its intuition and clear physical meaning (Zhao et al. 2015).

In this paper, DWT is used. Signals are passed through filters with different cutoff frequencies and different scales. The number of filter stages (levels) to be used depends on the resolution required. So the feature vector is prepared using the detail coefficients of third and fourth level (D3 and D4) for each electrode because these levels contain information in the frequency range of 8–12 and 16–24 Hz. Considering a headset of 14 electrodes as an example, the dimension of a feature matrix is \(M=1176\); thus, the number of samples is \(S=84\) samples.

Fig. 5
figure 5

Proposed feature reduction methodology

4.4 Feature reduction

Feature reduction methods aim to identify a subset of ‘meaningful’ features out of the original set of features. Feature reduction has several advantages such as (i) overfitting avoidance, this is because the classification model is trained with the most precise and informative features, (ii) performance promotion, and (iii) processing time minimization, which makes the model more suitable for real-time applications (Saleh et al. 2016; Saleh and Abulwafa 2017). Generally, feature reduction methods can be subdivided into filter, wrapper, and embedded methods (Saleh and Abulwafa 2017). Filter methods compute a score for each feature by their information content, and then select only the features that have the best scores. On the other hand, wrapper methods train a predictive model on subsets of features; then, the subset which gives the best accuracy is selected. Finally, embedded methods determine the optimal subset of features directly by the trained weights of the classification method.

In this paper, we propose a new filter approach for feature reduction; the goal is to determine one feature value for the repeated feature values of the same electrode. For number of n trails, there will be n feature matrices, i.e., n value for each element in the feature matrix of dimension \(M(E\,\times \,S)\), for the above-mentioned example the matrix M of dimension (\(14\,\times \,84\)) is repeated n times for the same action. So, we propose a feature reduction phase to have only one nearly equivalent value for each element.

As shown in Fig. 5, theoretically the value of the feature \(x_{i,j}\) remains constant for n trails for the same action. But due to the effect of many factors like the personal feeling, fatigue, happiness, sadness, the value of any element \(x_{i,j}\) may differ in each trail. By following the next steps, there will be one value for each element constructing one feature vector for n trails of one action.

Fig. 6
figure 6

Projecting the considered values on the numbering axis

For each element in the matrix \((x_{i,j})\), the following steps should be followed:

  • Step 1 represent the n values of the element \((x_{i,j})\) the linear axis shown in Fig. 6, as \(x_{1},x_{2} ,{\ldots },x_{n}\), which are assumed normally to be nearly identical.

  • Step 2 calculate the average value \(\mu \) of the n values of the element \(x_{i,j}\), as:

    $$\begin{aligned} {\mu }=\frac{\mathop \sum \nolimits _{i=1}^n x_{i}}{n} \end{aligned}$$
    (1)
  • Step 3 find the set of values of x in the neighborhood of \(\mu \) after determining the neighborhood width as:

    $$\begin{aligned} \hbox {NW}= \frac{{X_\mathrm{max}}-{X_\mathrm{min}}}{2} \end{aligned}$$
    (2)

    Then select the set

    $$\begin{aligned} S_{1}=\{x(i) \forall \mu - \hbox {NW}< x(i)<\mu + \hbox {NW}\} \end{aligned}$$
    (3)
  • Step 4 repeat step 2 and step 3 for a pre-defined \((\xi )\) times till find the approximately value of the element \(x_{i,j}\) as the average of the items \(\in {S}_{\xi }\), or till we reached to one value of element \(x_{i,j}\) before the complete of \((\xi )\) times.

The pre-mentioned steps will be followed for all the elements; the data after n trails will be only one matrix of M elements. For illustration, if the system simply has 20 trails using 14 EEG electrodes and each electrode’s signal has been sampled to 84 samples, so there will be 20 feature matrices each of dimension 14\(\,\times \,\)84. Thus, within each matrix each element \( x_{i,j} \) will have 20 theoretically identical values which is equal 1176 elements. Table 1 shows an example of \(x_{i,j}\) values from the 20th trails. The goal is to have only one feature matrix with a certain value for each element \(x_{i,j}\). First, the average of the given 20 values should be determined, which will be equal to 0.537. Then, the range of neighborhood width as depicted in Eq. (2) is calculated, which is equal to 0.395. From Eq. (3), a selected list \(S_{1}\) from the given feature values, which lies inside the elected range, is picked. Doing the same action for the selected set of values \(S_{1}\), a new set of feature values \(S_{2}\), which has a number of items less than that of \(S_{1}\), can be selected. Assuming \(\xi =10\), the value of \(x_{i,j}\) can be represented by only 0.44 from \(S_{8}\) after 8 times of calculations. Repeating this procedure for the remaining elements of the feature matrix, the result is one matrix representing the 20 trails after removing the outlier values.

Table 1 An example of the proposed feature reduction method

4.5 Electrode selection

Electrode selection mainly focuses on the use of electrodes and scalp locations that best represent the subject’s intention and have a high contribution to classification accuracy. Different subjects may have different reactions toward the tasks, and the optimal electrode set for specific tasks may also vary among subjects. In this paper, a new approach for electrode selection will be introduced. Semantic analysis (Ogiela and Ogiela 2012) is used to determine the most informative set of electrodes for better classification accuracy and eliminates the electrodes that may badly affect the average accuracy of BCI several action system.

To reduce the dimensions of the feature matrix \(F_{E \times S}\), in which E is the number of electrodes (equal 14 in the above example) and S is the number of samples (equal 84 in the above example), the proposed strategy aims to choose the most informative set of electrodes IS, regarding their effect on the classification accuracy of the underlying action.

Initially, using a simple classifier such as ELM (extreme learning machine) classifier (Geetha and Geethalakshmi 2011), determine the accuracy according to a certain action \(A_{1}\), i.e., determine the classification accuracy for the input matrix \(\hbox {F}_{\mathrm{ExS}}\) (e.g.,\(\hbox {F}_{14 \times 84})\), considering the features from all electrodes, denoted as; Acc, which can be calculated by Eq. (4).

$$\begin{aligned} {\text {ACC}}=\frac{\sum _{i=1}^M {n_{{\textit{ii}}}}}{N} \end{aligned}$$
(4)

where the numerator represents the correctly classified samples and the denominator (e.g., N) represents the total number of samples. Then, the classification accuracy when using only one electrode \(E_{i}\) is calculated, if the accuracy decreased, then add it to the bad effect set of electrodes denoted as B, which reduces the accuracy. If using the features of \(E_{i}\) increases the accuracy, then add it to the informative set of electrodes IS. After that, the classification accuracy is determined when using the two electrodes \(E_{i}\) and \(E_{i+1}\), if the accuracy decreased, then add the electrode \(E_{i+1}\) to the bad effect set of electrodes (e.g., B). Afterward, classification accuracy when using another electrode \(E_{i+2}\) with \(E_{i}\) is determined, else add \(E_{i+1}\) to the informative set of electrodes IS, and repeat again for all the remaining electrodes. Finally, there will be two sets of electrodes representing the informative electrode IS and the bad effect electrodes B for a certain action. Hence, each action has a set of electrodes E (e.g., \(E=\{E_{1},E_{2},E_{3} ,{\ldots }., E_{n}\}\)) and a set of most informative selected set of electrodes \({\text {IS}}\subset E\). The remaining electrodes are considered as the bad effect set of electrodes for that action \(B=E-{\text {IS}}\). The bad effect (BE) on the accuracy can be determined, as it is the difference between the best accuracy and the accuracy using every electrode in the bad effect set of electrodes. Also, for every informative electrode its goodness effect (G) can be determined based on the accuracy, which is the difference between the accuracy of the system without using this electrode and after using it.

figure a

The above-mentioned procedure is used to determine the mutual effect of the system electrodes on the accuracy of recognizing each action and should be repeated for the remaining actions (e.g., if the system designed to classify four actions, repeat the same procedure for \(A_{2}, A_{3}, A_{4})\). Considering four actions, the result will be the sets of electrodes that improved the accuracy for each action \({\text {IS}}_{1}, {\text {IS}}_{2}, {\text {IS}}_{3},{\text {IS}}_{4}\) and sets of bad effect electrodes \(B_{1}, B_{2}, B_{3}, B_{4}\). Furthermore, since the system is designed to classify several actions, there will be IS and B sets for each action which may differ according to the underlying action.

To solve this issue, we select the common set of electrode, which guarantee the maximum classification accuracy of the whole classification system (e.g., for all considered actions). Electing the suitable set of electrodes is a true challenge. However, this can be accomplished through rule inference methodology and semantic analysis algorithm. The importance of each electrode to the whole system can be determined by analyzing the contents of all the resulted sets semantically and answering the following questions:

  • If this electrode exists in all informative sets for all actions:

    • \(\checkmark \) So it can be considered as an informative electrode for all the system.

  • If this electrode exists in some pre-defined number of informative sets for some actions and has a little bad effect on the accuracy of the other actions:

    • \(\checkmark \) So it can be considered as an informative electrode for all the system.

  • If this electrode exists in all bad effect sets for all actions:

    • \(\checkmark \) So it can be considered as a bad effect electrode for all the system.

  • If this electrode exists in some pre-defined number of informative sets for some actions and has a great not neglected bad effect on the accuracy of the other actions:

    • \(\checkmark \) So it can be considered as a bad effect electrode for all the system.

  • Otherwise a comparison between the goodness effect of this electrode on some actions and its bad effect degree on the other action could be established to determine its effect on all the system actions and its importance for the system.

Hence, a set of election rules could be concluded, then applied whenever it is needed to elect the most suitable set of electrodes to express the data for the whole classification system. Accordingly, this reduces the dimension of the employed dataset as well as the response time for the classification system. Election rules are illustrated in Table 2. On the other hand, electrode election methodology is depicted in Algorithm 2.

figure b
Table 2 Rules of electrode election methodology
Table 3 Selection for the most informative set of electrodes for the first action
Table 4 Effect (G—goodness/BE—bad effect) of each electrodes for the first action

Illustrative example

For illustration, consider a system that uses 6-EEG electrodes to classify four actions with trial duration of 6 s and the signal of each electrode has been sampled to 5 samples/s. So, there will be four feature matrices, each of dimension 6\(\,\times \,\)30, and the number of each matrix’s elements is equal to 180 elements. The goal is to distinguish only a set of the most informative electrodes for the 4 actions. Use the extreme learning machine (ELM) classification (Geetha and Geethalakshmi 2011), as a simple learning algorithm on the training set to select the set of electrodes that achieve the best accuracy.

As illustrated in Table 3, the classification accuracy using all electrodes for the first action is 89.1%. However, it is lightly improved when using only features from \(E_{1}\). Determining the classification accuracy using features from \(E_{2}\) and the features from \(E_{1}\) gives another little improvement. Repeat the calculation of the classification accuracy after adding new electrodes to the previous set, showing that the system has the max classification accuracy equals to 91.5% using the set of electrodes \({\text {IS}}_{1}= \{E_{1},E_{2},E_{4},E_{5}\}\), while adding electrodes \(E_{3}\) and \(E_{6}\) reduces the classification accuracy. Accordingly, they are considered as the items of the bad effect set, e.g., \(B_{1}=\{E_{3},E_{6}\}\).

From Table 3, it will be easy to determine the goodness (G) or bad effect (BE) of each electrode according to the promotion or demotion in the system’s performance (e.g., classification accuracy), as depicted in Table 4. Repeat the same to determine the most informative set of electrodes as well as the bad effect set of electrodes for the other three actions. Results are presented in Table 4.

Table 5 Selected electrodes on the four actions

It is clear from Table 5 that \(E_{1}\) and \(E_{5}\) are good electrodes for all actions. Hence, they should be used on the classification of the four actions. On the other hand, \(E_{3}\) is a common bad effect electrode for all actions, so it should be discarded with no effect on the classification accuracy. \(E_{2}\) is most informative for three actions and has only bad effect on \(A_{2}\). So, it can also be discarded. On the other hand, each of the electrodes \(E_{4}\) and \({E}_{6}\) has a bad effect on two actions out of the four actions. So, the goodness and badness effect of them will be compared. From Table 4, it can be concluded that they have slightly goodness effect and their bad effect cannot be neglected; hence, the decision is to discard them. Finally, the common informative set of electrodes for the given classification system consists of only \(\{E_{1},E_{2},E_{5}\}\) instead of the six electrodes. Accordingly, the input feature matrix for this classification system will have the dimension of 3\(\,\times \,\)30 instead of 6\(\,\times \,\)30, which in turn minimizes the number of the input elements to be the half of the original data.

4.6 Classification and decision making

BCI performance is measured by its classification accuracy (Aydemir and Kayikcioglu 2013). In order to guarantee online classification, the classifier must be quick enough to do real-time classification of the EEG signals. Accordingly, several issues must be carefully considered, which are: (i) classification should be robust with respect to outliers, since the neurophysiologic signals may contain several outliers as well as artifacts. (ii) The employed classification technique should have as low computational complexity as possible since data in BCI system should be processed in real time. (iii) Classification should provide confidence prediction levels as the nature basis to combine information obtained from different sources.

K nearest neighbors (KNN) classifier is a classical supervised method in the field of machine learning. It is based on statistical data and is widely used in many areas such as text classification, pattern recognition, and image processing. The decision rule of KNN algorithm is to find the K nearest or most similar training samples in the feature space, and then, the test sample is assigned to a majority vote of its K nearest neighbors (Zhao and Chen 2016). KNN performance depends on two factors, which are: (i) the assigned value of K that represents the number of considered neighbors, and (ii) the employed distance metric. The mostly used method to compute distance between a test sample and the specified training samples is Euclidean distance (Zhao and Chen 2016), which can be computed by Eq. (5).

$$\begin{aligned} {\text {Dist}(X, Y)} = \sqrt{\mathop \sum \nolimits _{{i=1}}^{n} \left( {{x}_{{i}} -{y}_{i}} \right) ^{2}} \end{aligned}$$
(5)

where Dist(X,Y) is the Euclidean distance between a test sample X and a specified training sample Y of features (1,2,...,n), \(x_{i}\) represents the features of the test sample \(X,y_{i}\) represents the features of the specified training sample Y, and n is the total number of features.

The main advantage of KNN is that it can easily deal with problems in which the number of classes is more than two. In addition, KNN allows adding examples to training dataset without retraining the classifier. The work in this paper extends the main concept of KNN to make the choice of the appropriate action. Accordingly, a new instance of KNN is produced with enhanced characteristics via a set of additional parameters, which are illustrated through the following definitions.

Definition 1

Distance To Center (DTC) \({\text {DTC}}_{i}\) is defined as the distance from the testing item to the center of the class corresponding to the ith action.

Definition 2

Inverse Belonging Degree (IBD) \(\text {IBD}_{i}\) is defined as the average distance from the K nearest neighbors of the testing item to the center of the class corresponding to the ith action.

Definition 3

Number of the nearest neighbors (NNN) \(\text {NNN}_i\) is defined as the number of nearest neighbors of the class corresponding to the ith action for the testing item.

As depicted in the above definitions, one of these additional parameters is the distance between the input test signal and the center point of each action’s training data \({\text {DTC}}_{i}\) (where i represents the corresponding action’s identifier). Moreover, the average distance from each of selected nearest neighbor points in the training dataset related to a certain action and the center point of this action is considered as a new parameter called inverse belonging degree \(({\text {IBD}}_{i})\). The inverse belonging degree affects the choice of the appropriated action that it helps to remove the outliers with highest \({\text {IBD}}_{i}\). A fuzzy inference system is used to combine those parameters for formulating the suitable decision. Hence, the new classification strategy is called fuzzy-based classification strategy (FBCS). Initially, during the training phase the centers of the underlying classes are calculated, which are denoted as; \(c_{1}, c_{2}, c_{3}, c_{4}\) considering four actions \((A_{1}, A_{2}, A_{3},{\textit{ and }}A_{4})\) as depicted in Fig. 7.

During the testing phase, initially, the distances from the unknown (unclassified) input item, which is expressed in a feature matrix and the pre-classified training items, are determined. Then, the K nearest training items are selected. Then the pre-mentioned parameters should be calculated to perfectly classify the input unclassified item, namely number of nearest neighbor (NNN), distance to center (DTC), and inverse belonging degree (IBD). Generally, the more the number of the nearest items related to the ith action (e.g., \({\text {NNN}}_{i})\) the more probability that the unknown item would be associated with that action. On the other hand, the less the distance from the tested item to the center of the ith class (e.g., \({\text {DTC}}_{i})\) the more the probability that the unclassified item related to that action. Moreover, the inverse belonging degree of the input item to ith class, which is the (average) distance from the selected nearest neighboring points of the training dataset related to each action and input unclassified item can be calculated by Eq. (6). Actually, the less the inverse belonging degree to the ith class (e.g., \({\text {IBD}}_{i})\), the more the probability that the input item is related to that class (action).

$$\begin{aligned} {\text {IBDi}}=\frac{\sum _{j=1}^M {{\text {ds}_{j}}}}{{\text {NNN}}} \end{aligned}$$
(6)

where NNN is the number of selected nearest points associated to an action \(A_{i}\), \({\text {ds}}_{j}\) is the distance between each selected nearest training point associated to the action Ai and the center point of the training set for that action. The above-mentioned three parameters \({\text {NNN}}_{i}\), \({\text {DC}}_{i}\), and \({\text {IBD}}_{i}\) for each action i are considered as three different fuzzy sets. Then, a proposed fuzzy inference system is employed to predict the weight of each action \((W_{i})\) to be the appropriate action for the unknown input features. The fuzzy inference system is realized through three steps, namely (i) fuzzification of inputs, (ii) fuzzy rule induction, and finally (iii) defuzzification.

figure e
Fig. 7
figure 7

An example for new unknown feature vector to be classified in the system which has training features for four actions and the proposed fuzzy parameters

Fig. 8
figure 8

Membership functions for the considered fuzzy sets

(i) Fuzzification

Generally, \({\text {NNN}}_{i}\), \({\text {DTC}}_{i}\), and \({\text {IBD}}_{i}\) would be considered as three different fuzzy sets. During the fuzzification step, the inputs are transformed into degree of membership for linguistic terms ‘low’ and ‘high’ of the considered fuzzy set. Then, a membership function is employed to provide the similarity degree of the considered input to the corresponding fuzzy set. The result is a value between 0.0 (for non-membership) and 1.0 (for full-membership). The used membership functions for the considered fuzzy sets are depicted in Fig. 8. On the other hand, the used values of \(\alpha \) and \(\beta \) are illustrated in Table 6, assuming \(K=10\).

(ii) Fuzzy rule induction

After fuzzification, the result is introduced as the input for the fuzzy rule base. The considered rules are in the form; if (A is x) AND (B is y) AND (C is z)... THEN (O is m), where AB, and C represent the input variables (e.g., \({\text {NNN}}_{i}\), \({\text {DTC}}_{i}\), and \({\text {IBD}}_{i})\), while xy, and z represent the corresponding linguistic terms (e.g., ‘Low’ or ‘High’), O represents the rule output, and m represents the corresponding linguistic terms (‘Low’, ‘Medium’, or ‘High’). Actually, there are 8 rules, which are illustrated in Table 7 (assuming L refers to ‘Low,’ H refers to ‘High,’ and M refers to ‘Medium’). For clarification, the first rule in Table 6 indicates that if NNN(action i) is Low AND DTC(action i) is Low AND IBD(action i) is Low THEN Output is Medium.

As illustrated in Giarratano and Riley (2004), four methods for fuzzy rules inference are available. These are max–min, max-product inference, sum-dot method, and drastic product. We choose the max–min method to be used in this paper, which is based on choosing a min operator for the conjunction in the rule premise and for the implication function while the max operator is used for the aggregation. Consider the case of using two items of evidence per rule, the resultant rules will be;

figure f

Thus, the max–min composition inference rule would be:

$$\begin{aligned} \mu _Y= & {} \overbrace{\text {max}}^{\text {aggregation}} \left[ \underbrace{\text {min}}_{\text {implication}} \left( {\mu _{X_{j1} } ,\mu _{X_{j2} } } \right) \,\forall \,j\in \left\{ {1,2,3,\ldots ,N} \right\} \right] \nonumber \\ \end{aligned}$$
(7)

This produces

$$\begin{aligned} \mu _Y= & {} {\text {max}}\left[ {\text {min}}\left( {\mu _{X_{11} } ,\mu _{X_{12} }} \right) ,\,{\text {min}}\left( {\mu _{X_{21} } ,\mu _{X_{22} } } \right) ,\,\ldots \ldots ,\right. \nonumber \\&\,\left. \qquad {\text {min}}\left( {\mu _{X_{N1} } ,\mu _{X_{N2} } } \right) \right] \end{aligned}$$
(8)

where \(\mu _{x}\) is the value of membership function associated with each fuzzy parameter, N is the number of Fuzzy rules, and \(\mu _{Y}\) represents the output membership value of fuzzy rule induction step.

(iii) Defuzzification

The output of the fuzzy rules after applying it on the fuzzified inputs is then defuzzified. Defuzzification process is a transformation from a space of fuzzy actions into a space of non-fuzzy ones. As depicted in Saleh et al. (2015), the most commonly used defuzzification techniques are: max-criterion, the mean of maxima, and center of gravity (COG). In COG, the weighted average of the area bounded by the membership function curve is computed to be the crisp value of the fuzzy quantity as illustrated in Eq. (9). Defuzzification would be accomplished by using the output membership function illustrated in Fig. 9.

$$\begin{aligned} {\text {COG}}={\sum \mu \left( {{\textit{Wi}}} \right) *{\textit{Wi}}}\Big /{\mu \left( {{\textit{Wi}}} \right) } \end{aligned}$$
(9)

So, the fuzzified KNN classifier algorithm can be written as the following steps:

  • Determine the values of the three fuzzy input parameters \({\text {NNN}}_{i}\), \({\text {DTC}}_{i}\), and\({\text {IBD}}_{i}\) for each action i.

  • Apply the fuzzy rules according to them, then determine the output membership values \((\mu _{\text {Low}}, \mu _{{\text {High}}}, \mu _{\text {Medium}})\) according to Eq. (8).

  • Plot the output membership values \((\mu _{{\text {Low}}}, \mu _{{\text {High}}}, \mu _{{\text {Medium}}})\) on the output membership function graph (Fig. 10),

  • Finally, determine the area under the resulting curve to have the weight of the action to be the correspondent action to the unknown input feature according to Eq. (9), and select the action with the highest weight.

Table 6 Assigned values of \(\upalpha \) and \(\upbeta \)
Table 7 Used fuzzy rules for the fuzzified KNN classifier

5 Results and discussion

As the proposed fuzzy-based classification strategy (FBCS) mainly contributes on; feature reduction to acquire a set of compact and informative features, electrode selection to minimize processing time, and classification to take the corresponding precise decisions, in this section, the proposed FBCS will be evaluated against some other approaches previously published in feature reduction, electrode selection and classification of BCI. Informedness (Powers 2003) is used to evaluate the performance, as used in the evaluation of feature selection and electrode reduction techniques in Atyabi et al. (2012), as it is more informative in comparison with accuracy taking specificity and sensitivity into account, as in Eq. 10.

$$\begin{aligned} \hbox {Informedness}=\frac{{\text {TP}}}{{\text {TP}}+{\text {FN}}} +\frac{{\text {TN}}}{{\text {TN}}+{\text {FP}}}-1 \end{aligned}$$
(10)

where TP is the true positive, which represents the number of positive samples correctly predicted, TN is the true negative, which represents the number of negative samples correctly predicted, FP is the false negative, which represents the number of positive samples incorrectly predicted, and FN is the false negative, which represents the number of negative samples incorrectly predicted.

Fig. 9
figure 9

Output membership function for defuzzification

Fig. 10
figure 10

Time line of EEG signal acquisition

Fig. 11
figure 11

Average informedness result achieved on testing set with SVM using feature reduction techniques

Moreover, the performance metrics: (i) classification accuracy (CA), (ii) sensitivity (SE), (iii) specificity (SP), and (iv) computational time (CT) are also used with the comparison of the classifiers, as applied in Geetha and Geethalakshmi (2011), and their calculating equations are depicted in Eqs. (1113).

$$\begin{aligned}&\hbox {CA}=\frac{{\text {TP}}+{\text {TN}}}{{\text {TP}}+{\text {TN}}+{\text {FP}}+{\text {FN}}}\times 100 \end{aligned}$$
(11)
$$\begin{aligned}&\hbox {SE}=\frac{{\text {TP}}}{{\text {TP}}+{\text {FN}}}\times 100 \end{aligned}$$
(12)
$$\begin{aligned}&\hbox {SP}=\frac{{\text {TN}}}{{\text {TN}}+{\text {FP}}}\times 100 \end{aligned}$$
(13)

Higher accuracies reflect better decoding (prediction) of class information from the EEG data features. To obtain statistically significant conclusions, a Wilcoxon rank-sum test (Sheskin 2003), one nonparametric statistical test to determine whether two independent random samples are from the same distribution, was performed to compare the accuracies of the classification system achieved with the proposed algorithm with the accuracies achieved by the other competitors, due to the simplicity of these statistics. In statistical hypothesis testing, the p value or probability value is the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be the same as or more extreme than the actual observed results. The computation is based on a rank for each p value, a threshold value of 0.05 was applied which means that if the p value is very low (\(p<\) 0.05), we reject the null hypothesis and the result is considered significant. On the other hand, if the p value is greater than 0.05, we accept the null hypothesis. The null hypothesis of this statistical test assumes the equivalent performance of all the competitor algorithms.

5.1 Employed datasets

In order to evaluate the proposed FBCS strategy, we carried out the experimental testing using two well-known BCI datasets. The first is labeled as Dataset D1, which is Dataset 2a of BCI competition IV provided by BCI research group at Graz University (Brunner et al. 2008). Recoded from 9 healthy subjects, the dataset consists of four different tasks known as motor imagery (MI) tasks, namely left hand, right hand, both feet, and tongue recorded on different days from each subject. The dataset is comprised of two sessions, each with six runs separated by short breaks. Every run includes 48 trials (12 trials for each task). Therefore, there are 288 trails in total per session. For a single trial paradigm, the subjects were sitting in a comfortable armchair in front of a computer screen. As shown in Fig. 11, at the beginning of a trial \((t=0\) s), a fixation cross appeared on the black screen. In addition, a short warning tone was presented. After 2 s \((t=2\) s), a cue in the form of an arrow pointing either to the left, right, down, or up (corresponding to one of the four classes left hand, right hand, foot, or tongue) appeared and stayed on the screen for 1.25 s. This prompted the subjects to perform the desired MI task. No feedback was provided. The subjects were asked to carry out the motor imagery task until the fixation cross disappeared from the screen at \(t=6\) s. A short break followed where the screen was black again. Table 8 gives a short description of the essential parameters of dataset D1.

On the other hand, the second employed dataset is labeled as Dataset D2, which is the Dataset IVa of BCI competition III (http://www.bbci.de/competition/iii/desc_IVa.html). It contains 3.5 s of the following 3 motor imageries in which the subject should perform, namely (L) left hand, (R) right hand, (F) right foot. Motor imagery tasks performed by five healthy subjects over 280 trials. Even though the visual cue was presented to subjects for the total amount of 3.5 s, the beginning and end 0.5 s can be ignored since they represent the transition time during which subject is changing its state from non-task to task and vice versa. Therefore, it is reasonable to omit the first and last 0.5 s and only use the middle 2.5 s for the classification task. As shown in Table 9, the used datasets contain EEG recordings from 118 channels for 2.5 s over 280 trials with 1000Hz sample rate. The experiments have been done using MATLAB R2015b. Simulations have been done using a computer with 4 GB memory and processor core i3 2.53 GHz.

Table 8 Table of parameters of dataset D1
Table 9 Table of parameters of dataset D2

The two datasets (i.e., D1 and D2) were used in testing proposed feature reduction (FR), the proposed electrode selection (ES), and the proposed fuzzified KNN classifier with some other approaches previously published, as will be discussed in the next subsections.

5.2 Experimental results

This section evaluates in details the contributions introduced in this paper. Three different experiments will be introduced through this section. The first experiment is employed using the dataset D2, for simplicity, to assess the impact of the proposed feature reduction (FR); a comparison is made with feature reduction methods based on genetic algorithms (GA), particle swarm optimization (PSO), random search algorithm introduced in Atyabi et al. (2012) and principal component analysis (PCA) applied in Muhammad et al. (2015). The comparison shows the impact of the use of the full-set and the use of reduced set, after applying several feature reduction methods, on performance using SVM.

The second experiment is employed on the dataset D2 also to assess the impact of the proposed electrode selection (ES); a comparison is made with electrode selection methods based on genetic algorithms (GA) and random search algorithm introduced in Atyabi et al. (2012). The comparison shows the impact of the use of the full-set and the use of reduced set, after applying the electrode selection methods, on performance using SVM.

Finally, the third experiment is employed to assess the impact of the proposed fuzzified KNN classifier, a comparison is made on the two datasets D1 and D2 with six classifiers namely K nearest neighbor (KNN) (Lotte et al. 2007), support vector machines (SVM) (Lotte et al. 2007), linear discriminate analysis (LDA) (Lotte et al. 2007), Naive Bayes (NB) (Lotte et al. 2007), decision tree (DT) (Lotte et al. 2007), and self-organized map (SOM) (Arvaneh 2011) using their built-in MATLAB implementations with default competitors’ parameters. As in Geetha and Geethalakshmi (2011), we used leave-one-out cross-validation (LOOCV) technique to estimate the most appropriate KNN’s and SVM’s parameters to avoid the problems of random selection as it selects parameters that provide the highest average performance matrices. Table 10 illustrates the values of the control parameters specific to each of the employed competitor.

Table 10 Values of the control parameters specific to each of the employed competitor

5.2.1 Evaluating the proposed feature reduction methodology

Uses a 10*20 cross-validation (CV) to assess the impact of the proposed feature reduction (FR). A comparison is made with feature reduction methods based on genetic algorithms (GA), particle swarm optimization (PSO), random search algorithm represented in Atyabi et al. (2012) and principal component analysis (PCA) applied in Muhammad et al. (2015) on each subject of the 5 subjects according to the D2 dataset . The comparison shows the impact of the use of the full-set and the use of reduced set, after applying several feature reduction methods, on performance using SVM.

Table 11 Averaged informedness result achieved on testing set with SVM using feature reduction techniques

Table 11 and Fig. 11 represent the impact of using feature reduction’s above-mentioned techniques by the averaged informedness result achieved on testing set with SVM, which has superior results in most cases. The p values obtained through the Wilcoxon rank-sum statistical test between the best algorithm and each of the competitors are listed in Table 12. Values with \({p\;{\text {value}}}<0.05\) indicate that the differences are statistically significant. The results listed in Tables 11 and 12 clearly indicate the superiority of the proposed FR in statistically significant fashion over the most of the competitors. However, the statistical test for subjects named AV and AW indicates no significant difference of the proposed FR over PCA as their corresponding p values are 0.052 and 0.057, respectively.

Table 12 p values obtained through the Wilcoxon rank-sum statistical test for the best algorithm—proposed FR—versus each of the competitors for each subject within the dataset
Fig. 12
figure 12

Average informedness result achieved on testing set with SVM using electrode selection techniques

Generally, the results indicate that on average across all subjects (named AA, AL, AV, AW, AY), the informedness has slightly improved from 0.482 to 0.49 using the proposed FR technique. The other techniques need to be improved to achieve high performance, so the next experiment shows the impact of electrode selection.

5.2.2 Evaluating the proposed electrode selection methodology

The same as experiment 1, experiment 2 evaluates the impact of the proposed electrode selection (ES) using D2 dataset. A comparison is made with electrode selection methods based on genetic algorithms (GA) and random search algorithm introduced in Atyabi et al. (2012). The comparison shows the impact of the use of the full-set and the use of reduced set, after applying the electrode selection methods, on performance using SVM.

Table 13 Averaged informedness result achieved on testing set with SVM using electrode selection techniques
Table 14 p values obtained through the Wilcoxon rank-sum statistical test for the best algorithm—proposed ES—versus each of the competitors for each subject within the dataset

Table 13 and Fig. 12 represent the impact of using the electrode selection’s above-mentioned techniques by the averaged informedness result achieved on testing set with SVM. The results indicate that on average across all subjects, the informedness has improved from 0.482 to 0.522 using the proposed ES technique. According to Table 14, where the p values obtained through the Wilcoxon rank-sum statistical test between the best algorithm and each of the competitors are displayed, the proposed ES is statistically significant better than the rest competitors. However, the statistical test for subject named AL indicates that no significant difference of the proposed ES over GA and for subject AY indicates insignificant dominance of ES over random as their corresponding p values are 0.0506 and 0.0602, respectively.

Comparing the results with the results of experiment 1, it is clearly that electrode selection techniques had a greater impact on classification performance compared to feature reduction. In the next experiment, evaluate the impact of the combination of proposed feature reduction (FR) and electrode selection (ES) with the proposed FKNN and other classifiers.

5.2.3 Evaluating the proposed fuzzified KNN classifier

Finally, to assess the impact of the proposed fuzzified KNN classifier \((K=10)\) using the two datasets (i.e., D1 and D2), a comparison is made against K nearest neighbor (KNN) (Lotte et al. 2007), support vector machines (SVM) (Lotte et al. 2007), linear discriminate analysis (LDA) (Lotte et al. 2007), naive Bayes (NB) (Lotte et al. 2007), decision tree (DT) (Lotte et al. 2007), and self-organized map (SOM) (Muhammad et al. 2015) classifiers using their built-in MATLAB implementations with default techniques’ parameters and values listed in Table 10. The performance metrics illustrated with Eqs. (1113) are evaluated, giving the following results.

Figure 13 illustrates the average classification accuracy (CA), which represents the percentage of the number of trails classified correctly in the test set over the total trails, for the competing classifiers as well as FKNN. Both datasets (i.e., D1 and D2) are employed on the original datasets without using dimensionality reduction, noting that D2 (118 electrode) is higher dimensional than D1 (22 electrode). As illustrated in such figure, FKNN has the maximum average classification accuracy (78% for D1 and 74% for D2), while KNN showed the worst average classification accuracy (70.1% for D1 and 60.8% for D2) as it is very sensitive to the curse of dimensionality and this explains why KNN algorithms are not very popular in the BCI community. On the other hand, Fig. 14 represents the average classification accuracy for the same set of classifiers using the proposed feature reduction and electrode selection methodologies. It is clearly that using feature reduction and electrode selection improves the average accuracy for all classifiers, due to the removal of outlier features and bad effect electrodes. Moreover, it is found that FBCS still has the highest accuracy (85.7% for D1 and 87.8% for D2).

Fig. 13
figure 13

Average classification accuracy without using dimensionality reduction

Fig. 14
figure 14

Average classification accuracy with the proposed feature reduction and electrode selection

Again, D1 and D2 are employed for measuring the average classification sensitivity (called the true positive rate) for the competing classifiers as well as FKNN in two scenarios. In the first scenario, SE is measured for all classifiers with no dimensionality reduction, while in the second scenario, both the proposed feature selection and electrode reduction methodologies are employed. As illustrated in Fig. 15, SE for all classifiers is shown using the first scenario. It is shown that FKNN presents the maximum average classification sensitivity for both datasets, precisely 76.5% for D1 and 74% for D2, and as proven in average classification accuracy, KNN represents the worst sensitivity with average sensitivity 70.1% for D1 and 60.8% for D2 which has higher dimensions. On the other hand, when the second scenario is followed, as illustrated in Fig. 16, SE is promoted for all classifiers. The maximum SE is given by the proposed FKNN, which was 85.1% for D1 and 84.2% for D2. Note that, KNN classifier gives acceptable sensitivity results with feature reduction and electrode selection techniques 82% for D1 and 80.5% for D2 as the dimensions of the datasets were reduced.

Fig. 15
figure 15

An average classification sensitivity without using dimensionality reduction

Fig. 16
figure 16

Average classification sensitivity with the proposed feature reduction and electrode selection

Fig. 17
figure 17

Average classification specificity without using dimensionality reduction

Fig. 18
figure 18

Average classification specificity with the proposed feature reduction and electrode selection

In this experiment, the target is to measure the average classification specificity (SP) (also called the true negative rate) for the employed classifiers against FKNN. As illustrated in Fig. 17, with no dimensionality reduction, the proposed FKNN outperforms all other classifiers in terms of SP. It is found that FKNN introduces 79% for D1 and 88% for D2. From another point of view, when applying both the proposed feature selection and electrode reduction methodologies, it is found that SP is promoted for all classifiers as illustrated in Fig. 18. For FKNN, which introduces the maximum SP, the calculated values of SP were 89.9% for D1 and 90.2% for D2.

To insure that the proposed FKNN is suitable for real-time operation, it is essential to measure the testing time. For the training and testing of the two BCI competition datasets without dimensionality reduction using the different classifiers, the average computational time is depicted in Fig. 19. In such figure, it can be seen that KNN and FKNN had the shortest computational time in the training phase with average value of the two datasets equal 0.04, 0.05 s, respectively, this is logically true as KNN is a lazy learner. Hence, more time is spent during the testing phase as almost all calculations are done during the testing phase. SVM had the longest time as it performs classification tasks by constructing the best hyperplane in a multidimensional space by maximizing the margin as possible. For the testing phase, SVM and FKNN nearly had the same average computational times (CT) with values of 0.44, 0.5 s, respectively. As seen FKNN computational time is greater than the computational time of KNN in testing phase as it has more computations than simple KNN.

Fig. 19
figure 19

Average classification computational time without using dimensionality reduction

Fig. 20
figure 20

Average classification computational time with the proposed feature reduction and electrode selection

Table 15 p values obtained through the Wilcoxon rank-sum statistical test for the best algorithm—proposed FBCS—versus each of the competitors for each dataset

On the other hand, using the dimensional reduction techniques, the average computational times (CT) in the training phase increase for all classification algorithms applied, as the computations of feature reduction (FR) and electrode selection (ES) techniques take more time in the training phase, even the value of CT for training the simple KNN and FBCS classifiers increases to be 5 s. and SVM classifier still having the highest average CT nearly about 30 s, as illustrated in Fig. 20. In the testing phase, the average CT for the classification decreases due to the reduction in the dataset dimensions. The main objective of the proposed strategy is to reduce the computational classification time specially in testing phase to meet the needs of real-time BCI applications. As discussed early, FR and ES need more time to select the best subset of features and electrode, but there is no need to significantly reduce computational time in training step. For the data captured to be tested, the selected features from the selected electrodes are directly classified referring to the reduced trained set of data, with no need for any feature reduction and electrode selection calculations. KNN has the shortest testing time with CT value 0.01 s. but according to the accuracy we can say that FBCS is still the fastest strategy in testing phase using both datasets with CT value about 0.018 s to take the decision of what expressed by the data captured from the EEG headset.

Table 15 reports the p values obtained through the Wilcoxon rank-sum statistical test for the proposed FBCS algorithm versus each of the other competitors applied on the datasets D1 and D2. The results of the statistical test prove that the FBCS with dimension reduction is among the best performing models for the BCI recognition system. Unlike most publications in BCI field which recommended SVM and LDA as the highest performing classifiers, our three experiments show that for each feature reduction and electrode selection method at many classifiers should be tested including the purposed classifier, and generally there is not a best classifier or feature reduction or electrode selection method that outperforms all others as they differs according to the mentality actions of the subject and the combination of all model’s parameters (signal processing, feature extraction, dimensionality reduction and classifier).

6 Conclusion

Dimensional reduction in features is an open problem in brain–computer interfacing (BCI) research. Therefore, features extracted from brain signals are high-dimensional which affects the accuracy of the classifier. Selection of the most relevant features and electrodes improves the performance of the classifier and reduces the computational cost of the system. In this study, a new strategy called fuzzy-based classification strategy (FBCS) is proposed to determine the best subset of features from a selected number of electrodes of electroencephalography (EEG)-based BCI dataset for different actions. Applying the proposed strategy for feature reduction and electrode selection reducing the dimensions of the feature vector is tested and gains the best metrics of classification performance (specially the computational time CT as a vital parameter) for Dataset 2a of BCI competition IV (Brunner et al. 2008) and Dataset IVa of BCI competition III (http://www.bbci.de/competition/iii/desc_IVa.html). Thus, our algorithm can be employed for further real-time processing of multi-class problem. Our future aim is to design a real system that has the ability to online classify brain tasks in a real environment with less computational time. Further study in this direction will aim to optimize the feature reduction, electrode selection, and classification techniques to be implemented in real-time applications of BCI.