1 Introduction

Brain-computer interface (BCI) is an advanced communication system designed to bypass muscle tissue and establish communication pathways between the brain and the external environment, thus effectively decoding the brain’s state of mind [1]. In recent years, with the interdisciplinary integration of computer science, brain science, neurobiology, and intelligent control, the application of BCI technology has expanded from the medical and health fields to the entertainment, education, intelligent home, and even military fields. A complete BCI system is generally composed of the following four parts: signal acquisition module, signal processing module, control signal output module, and feedback module [2].

Electroencephalography (EEG) is widely used to measure neurophysiological activity in the signal acquisition module because of its noninvasiveness, high temporal resolution, and inexpensive recording device [3]. However, EEG-based BCI systems also have some drawbacks that cannot be ignored, such as non-stationary nature, the characteristics of non-linearity, and low signal-to-noise ratio (SNR). There is no doubt that the shortcomings mentioned above lead to higher requirements for subsequent signal processing technology. Thus, seeking robust feature extraction methods has become a key issue in recent research [4].

Common spatial patterns (CSP), as one of the popular and representative spatial filtering algorithm, is good at extracting features associated with motor imagery tasks. In recent years, with the wide application of CSP algorithm, researchers have put forward many new ideas to further optimize the performance of BCI system. For example, Afrakhteh et al. indicated that higher recognition accuracy can be achieved by using CSP algorithm for feature extraction and dimension reduction, combined with some evolutionary approach and improved neural network algorithm [5,6,7,8]. Many scholars combined CSP algorithm with many classical methods, such as Mel-frequency–based CSP (MF-CSP) [9], CSP combined with wavelet packet decomposition (wavelet-CSP) [10], and improved common spatial patterns (B-CSP) [11]. Besides, Rahman et al. proposed multiclass CSP (M-CSP) [12] by extending CSP from two to multiple classes.

In fact, the classical CSP itself also has some drawbacks. Although the traditional CSP algorithm is simple and efficient, its covariance matrix estimation is based on the square of Euclidean distance, which makes the performance of this method vulnerable to outliers and noise [13]. In order to improve the robustness and sparsity of the CSP algorithm, some extensions have been put forward by modifying its objective function, such as L1-norm-based CSP (CSP-L1) [14, 15], sparse CSP-L1 (sp-CSPL1) [16], regularized CSP-L1 with a waveform length (wlCSPL1) [17], local temporal CSP (LTCSP) [18], local temporally correlated CSP (LTCCSP) [19], local temporal joint-recurrence CSP (LTRCSP) [20], and Lp-norm-based local temporally correlated CSP (LTCCSP-Lp) [21]. Among them, whereas the extensions based on L1-norm are popular and have been able to seek robust spatial filters effectively, the L1-norm is unable to characterize the geometric structure of the data well, and the absolute value operator makes the calculation difficult. Therefore, we were inspired to use the L21-norm which has the advantages of rotational invariance and geometric structure characterization. L21-norm-based CSP (CSP-L21) [22] and regularized CSP with the L21-norm (RCSP-L21) [23] are proposed accordingly.

In this paper, we consider a new algorithmic form with better stability and robustness by replacing the L2-norm with the capped L21-norm. This method, called the capped L21-norm-based CSP (CCSP-L21), is motivated by ideas underpinning some classical pattern recognition algorithms in the machine learning field. Compared with other extensions to CSP, CCSP-L21 has two main highlights as follows. On the one hand, by employing the L21-norm as the basic metric, we enhance the robustness of our new approach to achieve better classification performance. In fact, this enhancement is achieved by removing the square operator and has been applied in many feature selection algorithms, such as the rotational invariant L1-norm principal component analysis (R1-PCA) [24, 25], discriminant analysis via joint Euler transform and L21-norm (e-LDA-L21) [26], and L21-norm-based discriminant locality preserving projections (L21-DLPP) [27]. On the other hand, to further reduce the negative impact of some outliers with large amplitudes that appear during the signal acquisition process, we apply the capped norm to our new approach. Recently, some studies have also shown that methods integrating capped norms can obtain more discriminative features. For example, Lai et al. [28] presented a robust locally discriminant analysis via capped norm (RLDA) by mixing the L21-norm, capped norm, regularized term, and local structure information. Moreover, Wang et al. [29] proposed capped Lp-norm linear discriminant analysis (CappedLDA) to enhance the robustness of the algorithm.

The remainder of this paper is organized as follows. We define some notations and briefly introduce the conventional CSP in Sect. 2. In Sect. 3, the CCSP-L21 is presented, and the corresponding non-greedy iterative algorithm is introduced. We carry out a set of experiments on the three real EEG data sets and discuss the results in Sect. 4. Finally, Sect. 5 is the summary of the paper.

2 Brief review of conventional CSP

As one of the most commonly used feature extraction methods, common spatial patterns (CSP) have good performance in the classification of multichannel EEG signals [30]. As we all know, it is generally applied to a two-class paradigm. Let \({X}^{1},{X}^{2},...,{X}^{{t}_{x}}\in {R}^{C\times N}\) be the EEG signals of one mental task, while \({Y}^{1},{Y}^{2},...,{Y}^{{t}_{y}}\in {R}^{C\times N}\) be the other condition. Among the notations above, C represents the number of electrodes (channels), N is the number of recording time points in a trial, and tx and ty denote the numbers of trials that belong to the two classes, respectively. For the sake of expression, the columns of X and Y are relabeled as \(X=({x}_{1},{x}_{2},...,{x}_{m})\in {R}^{C\times m}\) (\(m=N\times {t}_{x}\)) and \(Y=({y}_{1},{y}_{2},...,{y}_{n})\in {R}^{C\times n}\) (\(n=N\times {t}_{y}\)). Here, m and n represent the numbers of sampled points from the two brain states. In addition, the trial segments are assumed to have already gone through the filtering of a specific frequency band, the decentralization of the mean value, and the preprocessing of normalization [31].

The CSP algorithm aims to find an optimal spatial filter \(w\in {R}^{C}\) that projects multichannel EEG signals into a new space such that the variance of one class is maximized while that of the other class is minimized. Mathematically, the objective function can be given as follows:

$${J}_{\text{CSP}}(w)=\frac{{w}^{T}{C}^{x}w}{{w}^{T}{C}^{y}w}$$
(1)

Here, the covariance matrices of the two classes \({C}^{x}\in {R}^{C\times C}\) and \({C}^{y}\in {R}^{C\times C}\) can be calculated as Eqs. (2) and (3), respectively:

$${C}^{x}=\frac{1}{{t}_{x}}X{X}^{T}$$
(2)
$${C}^{y}=\frac{1}{{t}_{y}}Y{Y}^{T}$$
(3)

where T is a transpose operator. The solution of the objective function is essentially a generalized eigenvalue problem, which can be solved by the following equation:

$${C}^{x}w=\lambda {C}^{y}w$$
(4)

where the eigenvalue λ represents the ratio of the variances of the two classes.

Finally, we select the few leading eigenvectors associated with the largest and smallest eigenvalues as spatial filters. Then, the normalized log-variances of these components are used as features, which are fed into the linear discriminant analysis (LDA) classifier.

3 Capped L21-norm-based common spatial patterns (CCSP-L21)

It is clear that the objective function expression of the traditional CSP algorithm is based on the square of Euclidean distance, which makes the performance of the method easily affected by outliers and noise. To address this problem, a new robust extension is considered in the paper. We term it capped L21-norm-based common spatial patterns (CCSP-L21).

In this section, we present the new objective function of our proposed method first. Then, a non-greedy iterative algorithm [32] is designed to solve the optimization problem. At last, the suitable features are extracted for classification.

3.1 Objective function

For the convenience of calculation, we rewrite the objective function of the classical CSP by substituting Eqs. (2) and (3) into Eq. (1), which is shown as Eq. (5):

$${J}_{\text{CSP}}(w)=\frac{{w}^{T}{C}^{x}w}{{w}^{T}{C}^{y}w}=\frac{\frac{1}{{t}_{x}}{\Vert {w}^{T}X\Vert }_{2}^{2}}{\frac{1}{{t}_{y}}{\Vert {w}^{T}Y\Vert }_{2}^{2}}=\frac{\frac{1}{{t}_{x}}\sum_{i=1}^{m}{({w}^{T}{x}_{i})}^{2}}{\frac{1}{{t}_{y}}\sum_{j=1}^{n}{({w}^{T}{y}_{j})}^{2}}$$
(5)

where \(\Vert \cdot \Vert\) denotes the L2-norm.

To obtain more discriminative features, the objective function can be further reformulated by the capped L21-norm as follows:

$${J}_{{\text{CCSP}}-L21}(W)=\frac{{\Vert {W}^{T}X\Vert }_{{\text{cap}}21}}{{\Vert {W}^{T}Y\Vert }_{{\text{cap}}21}}=\frac{\sum_{i=1}^{m}\mathrm{min}\left({\Vert {W}^{T}{x}_{i}\Vert }_{2},\varepsilon \right)}{\sum_{j=1}^{n}\mathrm{min}\left({\Vert {W}^{T}{y}_{j}\Vert }_{2},\varepsilon \right)}$$
(6)

where \({\Vert \cdot \Vert }_{{\text{cap}}21}\) denotes the capped L21-norm, ε (ε > 0) is a thresholding parameter that is used to pick out the extreme data outliers, \(W\in {R}^{C\times d}\) (d < C) is an optimal projection matrix for dimension reduction, and d represents the number of extracted features for classification.

According to simple algebraic theory, objective function (6) is equal to the following formulation:

$${J}_{{\text{CCSP}}-L21}(W)=\frac{tr({W}^{T}X{D}_{x}{X}^{T}W)}{tr({W}^{T}Y{D}_{y}{Y}^{T}W)}$$
(7)
$${D}_{x}=diag(\frac{Ind1}{||{x}_{1}|{|}_{2}},\frac{Ind1}{||{x}_{2}|{|}_{2}},...,\frac{Ind1}{||{x}_{d}|{|}_{2}})$$
(8)
$${D}_{y}=diag(\frac{Ind2}{||{y}_{1}|{|}_{2}},\frac{Ind2}{||{y}_{2}|{|}_{2}},...,\frac{Ind2}{||{y}_{d}|{|}_{2}})$$
(9)

where tr(·) is the trace operator and Ind1 and Ind2 represent the indicative functions defined as Eqs. (10) and (11):

$$Ind1=\left\{\begin{array}{cc}1& if\Vert {x}_{d}\Vert \le \varepsilon \\ 0& otherwise\end{array}\right.$$
(10)
$$Ind2=\left\{\begin{array}{cc}1& if\Vert {y}_{d}\Vert \le \varepsilon \\ 0& otherwise\end{array}\right.$$
(11)

3.2 Iterative algorithm

Obviously, it is difficult to find the solution to the objective function of our proposed approach. To obtain the optimal projection matrix W, we consider a non-greedy iterative algorithm by constructing an auxiliary function, with the assistance of the alternating renewal process, subgradient algorithm, and Armijo line search method.

The following theorem is introduced to provide an auxiliary function for objective optimization:

Theorem 1: Suppose that the matrix functions M(U) and N(U) are positive definite, we have:

$${\lambda }_{\mathrm{max}}=\frac{M({U}^{*})}{N({U}^{*})}=\underset{{U}^{T}U={I}_{p}}{\mathrm{max}}\frac{M(U)}{N(U)}$$
(12)

if and only if:

$$M({U}^{*})-{\lambda }_{\mathrm{max}}N({U}^{*})=\mathrm{max}(M(U)-{\lambda }_{\mathrm{max}}N(U))=0$$
(13)

Thus, objective function (7) can be inverted into the following equation with the form of the corresponding trace difference:

$${W}_{\text{opt}}=\underset{{W}^{T}W={I}_{d}}{\mathrm{arg}}\mathrm{max}\frac{||{W}^{T}X|{|}_{{\text{cap}}21}}{||{W}^{T}Y|{|}_{{\text{cap}}21}}=\mathrm{arg}\underset{{W}^{T}W={I}_{d},\lambda }{\mathrm{max}}(||{W}^{T}X|{|}_{{\text{cap}}21}-\lambda ||{W}^{T}Y|{|}_{{\text{cap}}21})$$
(14)

Because the specific derivation and the convergence proof of the iterative process have been mentioned in the article [22], the entire iteration steps are directly given here to avoid repetition, as shown in Table 1.

Table 1 Iterative algorithm procedure of CCSP-L21

3.3 Feature extraction

The optimal projection matrix W obtained by the above iterative algorithm can be regarded as a set of multiple orthogonal spatial filters. Therefore, we relabel the columns of W as \({w}_{1},{w}_{2,},\cdots ,{w}_{d}\). Suppose that Z denotes the EEG trial, and feature f is extracted as:

$$f={({\Vert {w}_{1}Z\Vert }_{2},{\Vert {w}_{2}Z\Vert }_{2},\cdots ,{\Vert {w}_{d}Z\Vert }_{2})}^{T}$$
(15)

where d represents the number of spatial filters.

4 Experiment

In the experiments, we use three public BCI competition-based EEG, data sets IIIa and IVa of BCI competition III, and data set IIa of BCI competition IV [33], to prove the effectiveness of the proposed CCSP-L21 approach. In addition, other extensions to the original CSP methods are also introduced for comparison purposes. Afterwards, we compare the performances of all methods when outliers of different frequencies occur. It should be mentioned that linear discriminant analysis (LDA) is used as a classifier to evaluate the algorithm performance here.

4.1 Real EEG data sets

The three real data sets record EEG signals, while the subjects imagine limb movements (e.g., hand or foot movements) [34]. What needs illustration is that only the classifications of the data for two classes are considered in the experiment. The detailed statistical information of the three publicly available data sets is summarized in Table 2.

Table 2 Detailed statistical information of the three real EEG data sets used for the experiment

4.2 Preprocessing of the EEG signals

For the three data sets introduced above, the raw EEG signals need a series of preprocessing operations before the experiment. The original signals are first filtered by a fifth-order Butterworth filter with cutoff frequencies of 8 and 35 Hz composing both the α-band and β-band, respectively. There is an optimal time period for an EEG to detect an event-related synchronization (ERS) or an event-related desynchronization (ERD). Thus, the EEG segments recorded from 0.5 s to 2.5 s after the visual cue are chosen for the first and third data sets. Specifically, inspired by the winner of BCI competition IV and the relevant pretreatment method mentioned in the corresponding article [35], we use a time interval from 0.5 s to 3.75 s on the second data set.

4.3 Experimental settings

The CCSP-L21 algorithm involves three parameters: the line search parameters β and α and the thresholding parameter ε. The set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} is designed for β empirically, while α randomly takes a value between 0 and 1. Note that we need to run the program ten times to ensure the stability of the algorithm. Based on the experience in article [29], the set {1e − 5, 1e − 4, 1e − 3, 1e − 2, 1e − 1, 1, 1e1, 1e2} for thresholding parameter ε is selected on an approximate scale. In addition, as one of the comparison methods, the TRCSP algorithm has regularization parameters, which are searched in set {1e − 6, 1e − 5, 1e − 4, 1e − 3, 1e − 2, 1e − 1, 1e1, 1e2} by a tenfold cross-validation.

In particular, the pairs of filters in the experiment vary from 1 to 0.5 × C here rather than using one fixed value. Then, d-D feature vectors, which are set as an input to the LDA classifier, can be obtained, where d denotes the number of filters.

4.4 Outlier simulation

In order to further verify the robustness of the algorithm, a C-dimensional Gaussian distribution N (m + , ) is used to generate outliers with the numbers varying from 0 to 0.5 N with step 0.1 N. Here, m represents the mean vector, σ is the standard deviation vector of the EEG training data, denotes the covariance matrix of the EEG training samples, and N is the number of recording time points.

4.5 Results and discussion

In this section, the performance of the proposed CCSP-L21 algorithm is verified by comparing with five relevant methods on the three public BCI competition EEG data sets mentioned above. Other than the classical CSP algorithm, we also use some other extensions for comparison, including the CSP with Tikhonov regularization (TRCSP) [35], the CSP with weighted average covariance matrix (ACMCSP) [36], the regularized CSP based on diagonal loading (DLCSP) [37], and the L21-norm-based CSP (CSP-L21) [22].

Figure 1 displays the average recognition rates of the above six methods as the pair of filters change. It can be seen that the blue curve representing the classification accuracies of the CCSP-L21 algorithm is above the curves of the other algorithms in most cases. In addition, our proposed approach also achieves the highest recognition rates, which is enough to show its superior performance in the task of recognizing motor imagery (MI)-based EEG signals.

Fig. 1
figure 1

The average classification accuracies as the pairs of spatial filters change by six different extensions to CSP, i.e., classical CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21. a Data set IIIa, b IVa of BCI competition III, and c data set IIa of BCI competition IV

Thus, we can obtain the corresponding filter pairs of the six methods when the optimal recognition rates are reached on the three real data sets. For the three individuals in data set IIIa of BCI competition III, the optimal spatial filter pairs for the six methods CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21 are 2, 3, 8, 2, 5, and 4, respectively; 3, 1, 1, 2, 2, and 3 filter pairs are selected for data set IVa of BCI competition III, which has five subjects. On data set IIa of BCI competition IV, the best accuracy can be achieved by applying three filter pairs for all the above methods.

Next, as shown in Tables 3 and 4, the optimal classification accuracies of each algorithm on the three data sets are calculated, and the data with the best performance in each subject are represented in bold to make the results easy to observe. It should be noted that the last line lists the results of the BCI winners just for the integrity of the results rather than as a comparison.

Table 3 Classification accuracies of CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21 on the subjects of data sets IIIa and IVa of BCI competition III without outliers added. The BCI winner values with underline on data set IIIa of BCI competition III are the kappa scores. Values in bold indicate the best recognition rate for each subject
Table 4 Classification accuracies of CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21 on the subjects of data set IIa of BCI competition IV without outliers added. The BCI winner values with underline are the kappa scores. Values in bold indicate the best recognition rate for each subject

Clearly, the classical CSP and the other four extensions have their advantages in some subjects. However, the proposed CCSP-L21 algorithm performs better than the other methods under most circumstances. For some subjects, such as s1, s3, al, and A08E, the recognition rates are above 98%. Among them, the rates of individuals s1 and al even reach 100%. Compared with the traditional CSP, the mean classification accuracies of CCSP-L21 increase by approximately 3.15%, 4.41%, and 1.95%, which fully demonstrates the effectiveness of the capped L21-norm. Besides, the performance of CCSP-L21 algorithm has also been improved to some extent by comparing with CSP-L21, which demonstrates the effectiveness of introducing the capped norm.

Afterwards, the robustness of the CCSP-L21 algorithm should be further evaluated with the addition of artificial outliers. By observing Fig. 2, we can see the curve of the average recognition rates of the subjects varying with the outliers’ frequencies for each data set. As the frequency of the outliers increases, CCSP-L21 still has excellent discrimination accuracies, which always exceeds 65%, while the performance of the other methods deteriorates. Furthermore, taking the data set IVa of BCI competition III as an example, the average classification accuracy of each subject is calculated by using the contaminated data set in Table 5. As you can see from the table, for the subject ay, the TRCSP algorithm performs best for its regularization term, which can effectively alleviate the overfitting problem in small data samples [38]. In addition, for several other subjects, there is no doubt that the CCSP-L21 algorithm obtains the best classification accuracy among all the methods. It can be seen that the performance of CCSP-L21 is superior to CSP-L21 in the case of outliers, which further proves the robustness of CCSP-L21. What is more, compared with the other extensions, the average recognition rates of CCSP-L21 are always improved by approximately 10%. According to the analysis above, we conclude that the CCSP-L21 approach is able to effectively reduce the impact of outliers.

Fig. 2
figure 2

Average classification accuracies of CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21 for the subjects of the three real EEG data sets with outliers added. The numbers of the outliers are 0.1 N, 0.2 N, 0.3 N, 0.4 N, and 0.5 N. a Data set IIIa of BCI competition III. b Data set IVa of BCI competition III. c Data set IIa of BCI competition IV

Table 5 Average classification accuracies of the CSP, ACMCSP, TRCSP, DLCSP, CSP-L21, and CCSP-L21 methods for the five subjects with increasing outlier occurrence frequencies on data set IVa of BCI competition III. Values in bold indicate the best average recognition rate for each subject. The numbers of the outliers vary from 0 to 0.5 N with step 0.1 N

Moreover, the determination of the line search parameter β and the thresholding parameter ε deserve to be discussed. We take the three subjects in data set IIIa of BCI competition III as research objects and draw the 3-D histogram of classification accuracies for each subject while varying the values of the parameters β and ε in Fig. 3. Empirically, the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} is designed for β, while the thresholding parameter ε is searched in set {1e − 5, 1e − 4, 1e − 3, 1e − 2, 1e − 1, 1, 1e1, 1e2} according to the relevant article [29]. It can be observed that the optimal value selection of β for each subject has a distinct difference because the state of the brain wave varies greatly between individuals. Moreover, the accuracy is generally able to reach a high level when ε takes the value from the list {1e − 2, 1e − 1, 1}. If the thresholding parameter is set too large, the effect of filtering outliers will become worse. In contrast, we will lose much useful information by choosing a threshold value that is too small. This conclusively proves that the parameter plays an important role in the algorithm and needs to be adjusted carefully.

Fig. 3
figure 3

Classification accuracies of CCSP-L21 for each subject from the data set IIIa of BCI competition III change with the value of the line search parameter β and the thresholding parameter ε

Last but not least, we analyze the computational complexity of the proposed method and other versions of the CSP-based method. For classical CSP, TRCSP, ACMCSP, and DLCSP, the main computational complexity comes from solving the eigen-equation, which needs computational complexity O(C3). For CSP-L21 and CCSP-L21, the non-greedy iterative procedure is derived to solve the proposed objective function. In reality, the dimensions of original EEG data are always larger than other constants, so we only consider the most important matrix calculation and the number of iterations. Therefore, if the iteration steps is T, the total computational complexity is O((m + n)CT), where C denotes the number of electrodes (channels), and m and n represent the numbers of sampled points from the two brain states. It can be seen that the two algorithms based on the L21-norm are affected by the numbers of iteration, which are related to the setting of initial value and step size parameters.

To sum up, the experiments on both noisy and unnoisy data sets demonstrate the robustness and superiority of the CCSP-L21 algorithm. However, the proposed method involves some parameters, which introduce uncertainty to the system. How to adjust the parameters optimally or design a more stable solution approach with fewer parameters will continue to be considered. In addition, improving processing speed and using more kind of noise can also be investigated in the future work.

5 Conclusion

In this paper, we propose the capped L21-norm-based common spatial patterns, named CCSP-L21. The algorithm aims to construct more robust models by introducing the capped L21-norm to redefine a covariance matrix of EEG data. Among them, the L21-norm removes the influence of the square operator, while the “capped” operation further achieves the goal of filtering extreme outliers. A non-greedy iterative algorithm is designed to compute the optimal solutions of the proposed CCSP-L21. Experimental results show that the CCSP-L21 method outperforms the classical CSP and other extensions. In future works, finding more appropriate parameters is a significant problem that deserves further consideration.