Introduction

Recently, Brain Computer Interface (BCI), which aims to establish a direct communicate pathway between brain and computer, has been widely researched over the world [1]. EEG-based BCIs which rely on the motor imagery of users are of particular interest to the BCI community. Because of many facts such as low topographical resolution and high noise level, it has been a big challenge to extract effective features from EEG signals and perform classification. A lot of methods have been proposed for feature extraction and classification of EEG signals as well as parameters selection [2]. To this end, CSP method [3, 4, 5] has been proven to be very powerful in determining the spatial filters which can extract discriminative brain rhythms. However, it suffers from the limitation of assuming an absolutely linear relation between the source signals of the brain and the recorded EEG signals.

In order to make CSP applicable to nonlinearly structured data, kernel-based methods have been applied. The main idea of kernel-based methods is to map the input data to a feature space by a nonlinear mapping where inner products in the feature space can be computed by a kernel function without knowing the nonlinear mapping explicitly [6, 7, 8, 9, 10]. In [11], the nonlinear CSP was proposed, which is restricted in the equal number of trials from each class. A more flexible formulation of the kernel CSP was given in [12]. However, the complexity of the algorithm and the selection of kernel functions became new problems. More recently, a hybrid linear and kernel CSP [13] is proposed to use hybrid kernel functions which are mostly linear but can also account for small degrees of nonlinearity.

The conventional CSP algorithm can be solved by two step PCA procedure. And the most existing kernel CSP methods are based on kernel PCA [7]. A common limitation of these methods is that the class-covariance in the feature space must be nonsingular, which restricts it application to large EEG training samples. To reduce the time of calibration procedure for BCI application, the efficient training algorithm based on smaller number of EEG trials is necessary. Therefore, we propose the nonlinear CSP based on kernel methods by using the generalized singular value decomposition.

The remainder of the paper are organized as follows. In Sect. 2, we briefly introduce the classical CSP algorithm and represent it in another formula which can be solved by GSVD, and then the kernel-based GCSP algorithm is described in Sect. 3. In Sect. 4, experimental results are illustrated and discussed. Finally, the conclusions are provided in Sect. 5.

Linear CSP

Generally, a M  ×  T matrix X j represents the j-th trial of EEG signals, where M is the channels number and T is the samples per channel. The class-specific spatial covariance matrix can be obtained by

$$ {\bf R}_{c} = \frac{1}{n_{c}} \sum_{{\bf X}_{j}\in \hbox{class}\,c} \frac{{\bf X}_{j}{\bf X}^{T}_{j}}{\hbox{trace}({\bf X}_{j}{\bf X}^{T}_{j})}, \quad c=1,2. $$
(1)

Here, n c is the number of trials recorded under the c-th mental task. A normalization is applied to each X j to make the sum of the energies of all signal channels is equal to 1. For two-class case, the total spatial covariance matrix can be formulated as

$$ {\bf R}_{t} = {\bf R}_{1} + {\bf R}_{2}. $$
(2)

The objective of CSP is to find the optimal spatial filter w which makes the average energy of one class is maximized while the other class is minimized (and vise versa). These filters are often expressed with the Rayleigh Quotient

$$ {\bf w} = \arg\max_{\bf w}\frac{{\bf w}^{T}{\bf R}_{c}{\bf w}}{{\bf w}^{T}{\bf R}_{t}{\bf w}}. $$
(3)

R c and R t in (3) can be further formulated as

$$ {\bf R}_{c} = {\bf H}_{c}{\bf H}_{c}^{T} \quad \hbox{and} \quad {\bf R}_{t} = {\bf H}_{t}{\bf H}_{t}^{T}, $$
(4)

with

$$ {\bf H}_{c} = \left[ \frac{{\bf X}_{1}}{\sqrt{n_{c}\times \hbox{trace}({\bf X}_{1}{\bf X}^{T}_{1})}}, \ldots, \frac{{\bf X} _{j}}{\sqrt{n_{c}\times \hbox{trace}({\bf X}_{j}{\bf X}^{T}_{j})}} \right]_{\mathop{{\bf X}_{j}\in \hbox{class}\,c,}\limits_{1\leq j\leq n_{c}}} $$
(5)

and

$$ {\bf H}_{t} = [{\bf H}_{1}, \quad {\bf H}_{2}], $$
(6)

where H c  ∈ R M ×  p c p c  = T × n c ,  p c is the total number of sample points for all EEG trials belong to the c-class.

Kernel-based GCSP

In this section, we present a nonlinear extension of CSP based on kernel functions and the GSVD. By using the kernel method, we can work on the feature space through kernel functions, as long as the problem formulation depends only on the inner products between data points. Let’s suppose that the input space is mapped into a Hilbert space through a nonlinear mapping function as:

$$ \phi : {\bf R}^{N} \mapsto {\mathcal{F}}\,{\bf x}\,\mapsto\,\phi({\bf x}). $$
(7)

In the nonlinear mapping space, the inner product is denoted by a kernel function,

$$ <\phi({\bf a}), \phi({\bf b})> = k({\bf a},{\bf b}). $$
(8)

We denote the nonlinear mapping of a matrix X which represents a single trial of EEG signals as

$$ {\bf X} \mapsto \phi({\bf X}), \quad {i.e.},\quad [{\bf x}_{1}\ldots{\bf x}_{n}] \mapsto [\phi({\bf x}_{1})\ldots\phi({\bf x}_{n})]. $$
(9)

To apply the kernel method on CSP in the feature space instead of the original input space, the c-class covariance matrix in (1) can be expressed as

$$ {\mathcal{R}}_{c} = \frac{1}{n_{c}}\sum_{{\bf X}_{j}\in \hbox{class}\,c} \frac{\phi({\bf X}_{j})\phi({\bf X}_{j})^{T}}{\hbox{trace}(\phi({\bf X}_{j})\phi({\bf X}_{j})^{T})}, \quad c=1,2. $$
(10)

The normalized part of X j in the feature space is denoted by the kernel function

$$ h_{{\bf X}_{j}} = \hbox{trace}(\phi({\bf X}_{j})\phi({\bf X}_{j})^{T}) = \sum_{{\bf x}_{k}\in{\bf X}_{j}}{k({\bf x}_{k},{\bf x}_{k})}. $$
(11)

Similar with (4, 5, 6), we also have

$$ {\mathcal{R}}_{c} = {\mathcal{H}}_{c}{\mathcal{H}}_{c}^{T} \quad \hbox{and} \quad {\mathcal{R}}_{t} = {\mathcal{H}}_{t}{\mathcal{H}}_{t}^{T}, $$
(12)

with

$$ {\mathcal{H}}_{c} = \left[\frac{\phi({\bf X}_{1})}{\sqrt{n_{c} \sum_{{\bf x}_{k}\in{\bf X}_{1}}{k({\bf x}_{k},{\bf x}_{k})}}}, \ldots, \frac{\phi({\bf X} _{j})}{\sqrt{n_{c} \sum_{{\bf x}_{k}\in{\bf X}_{j}}{k({\bf x}_{k},{\bf x}_{k})}}}\right]_{\mathop{{\bf X}_{j}\in {class c},}\limits_{1\leq j\leq n_{c}}}, $$
(13)
$$ {\mathcal{H}}_{t} = [{\mathcal{H}}_{1}, \quad {\mathcal{H}}_{2}], $$
(14)

where \({\mathcal{H}}_{c}\in {\mathcal{R}}^{M\times p_{c}}, p_{c}= T\times n_{c}\).

Then the CSP in \({\mathcal{F}}\) space can also be expressed as Rayleigh Quotient

$$ \varphi = \arg\max_\varphi\frac{\varphi^{T}{\mathcal{R}}_{c}\varphi} {\varphi^{T}\cal{R}_{t}\varphi}, $$
(15)

which can be solved by generalized eigenvalue problem

$$ {\mathcal{R}}_{c}\varphi = \lambda {\mathcal{R}}_{t}\varphi. $$
(16)

Hence we can restrict the solution space for (16) to span {ϕ(X)}. Let φ be represented as a linear combination of ϕ(x i )

$$ \varphi = \sum_{i=1}^{P}\alpha_{i}\phi({\bf x}_{i}) = {\bf Y}{\alpha}, $$
(17)

where Y = [ϕ(x 1), ...ϕ(x P )], α = [α1, ..., α P ]T, and P is the total number of sample points for all EEG trials. According to (12) and (17), the equation (16) can be rewritten with left multiplied by Y T as follows

$$ ({\bf Y}^{T}{\mathcal{H}}_{c})({\bf Y}^{T}{\mathcal{H}}_{c})^{T}\varvec{\alpha} =\lambda ({\bf Y}^{T}{\mathcal{H}}_{t})({\bf Y}^{T}{\mathcal{H}}_{t})^{T}{\alpha}. $$
(18)

Now α is the new eigenvector in the \({\mathcal{F}}\) space. The matrix \({\mathcal{H}}_{c}^{T}{\bf Y}\) can be computed and denoted by

$$ {\mathcal{K}}_{c}^{T} ={\mathcal{H}}_{c}^{T}{\bf Y} = \left[ \begin{array}{ccc} \frac{k({\bf x} _{1},{\bf x}_{1})}{\sqrt{n_{c} h_{{\bf x}_{1}}}} & \ldots& \frac{k({\bf x}_{1},{\bf x}_{P})}{\sqrt{n_{c} h_{{\bf x}_{1}}}}\\ \vdots &\vdots&\vdots\\ \frac{k({\bf x}_{p_{c}},{\bf x}_{1})}{\sqrt{n_{c} h_{{\bf x}_{p_{c}}}}}& &\frac{k({\bf x}_{p_{c}},{\bf x}_{P})}{\sqrt{n_{c} h_{{\bf x}_{p_{c}}}}}\\ \end{array}\right], $$
(19)

where h x j  = h X j and x j  ∈ X j . We define N = n 1 + n 2 is the total number of EEG trials for two-class case and P = p 1 + p 2 is the sample points for the both classes data set. Hence,

$$ {\mathcal{K}}_{t}^{T} ={\mathcal{H}}_{t}^{T}{\bf Y} = \left[ \begin{array}{ccc} \frac{k({\bf x} _{1},{\bf x}_{1})}{\sqrt{n_{c} h_{{\bf x}_{1}}}} &\ldots& \frac{k({\bf x}_{1},{\bf x}_{P})}{\sqrt{n_{c} h_{{\bf x}_{1}}}}\\ \vdots &\vdots&\vdots\\ \frac{k({\bf x}_{P},{\bf x}_{1})}{\sqrt{n_{c} h_{{\bf x}_{P}}}}&&\frac{k({\bf x}_{P},{\bf x}_{P})}{\sqrt{n_{c} h_{{\bf x}_{P}}}} \end{array} \right]. $$
(20)

Then (18) is equivalent to

$$ {\mathcal{K}}_{c}{\mathcal{K}}_{c}^{T}{\alpha} = \lambda {\mathcal{K}}_{t}{\mathcal{K}}_{t}^{T}{\alpha}. $$
(21)

In order to relieve the nonsingular restriction for \({\mathcal{K}}_{c}{\mathcal{K}}_{c}^{T}\) and \({\mathcal{K}}_{t}{\mathcal{K}}_{t}^{T}\), we apply the GSVD to the pair \(({\mathcal{K}}_{c}^{T}, {\mathcal{K}}_{t}^{T})\). Then

$$ {\bf U}^{T}{\mathcal{K}}_{c}^{T}{\bf X} = [\Upgamma_{c} \quad 0] \quad \hbox{and} \quad {\bf V}^{T}{\mathcal{K}}_{t}^{T}{\bf X} = [\Upgamma_{t} \quad 0], $$
(22)

where U and V are orthogonal and X is nonsingular, \(\Upgamma_{c}^{T}\Upgamma_{c} + \Upgamma_{t}^{T}\Upgamma_{t} = {\bf I}\) and \(\Upgamma_{c}^{T}\Upgamma_{c}, \Upgamma_{t}^{T}\Upgamma_{t}\) are diagonal matrices with nonincreasing and nondecreasing diagonal components respectively. Then the simultaneous diagonalizations of \({\mathcal{K}}_{c}{\mathcal{K}}_{c}^{T}\) and \({\mathcal{K}}_{t}{\mathcal{K}}_{t}^{T}\) can be obtained as

$$ {\bf X}^{T}{\mathcal{K}}_{c}{\mathcal{K}}_{c}^{T}{\bf X} =\left[ \begin{array}{cc} \Upgamma_{c}^{T}\Upgamma_{c} &0 \\ 0 &0\\ \end{array}\right] \quad \hbox{and} \quad {\bf X}^{T}{\mathcal{K}}_{t}{\mathcal{K}}_{t}^{T}{\bf X} =\left[ \begin{array}{cc} \Upgamma_{t}^{T}\Upgamma_{t} &0 \\ 0 & 0\\ \end{array}\right]. $$
(23)

The regularized GCSP criterion can be introduced through adding a regularization parameter η to (14)

$$ {\mathcal{H}}_{t} = [\eta{\mathcal{H}}_{1}, \quad (1-\eta){\mathcal{H}}_{2}], \quad 0 \leq \eta \leq 1. $$
(24)

Large values of η puts emphasis on the first class and vice versa. The appropriate value for η can be obtained by cross-validation.

The columns of X in (22) solves (21). Let \({\alpha}\) be the matrix obtained by the first r (default is 2) columns of X. For each class of c, we can obtain the corresponding \({\alpha}\). In the end, we combine the \({\alpha}\) for both two classes as the final spatial filters. Hence, for any new input EEG trial Z, the nonlinear spatial filtering by j-th filters is given by

$$ \varphi_{j}^{T}\phi({\bf Z}) = \varvec{\alpha}_{j}^{T}{\bf Y}^{T}\phi({\bf Z}) = \left[\sum_{i=1}^{P}\alpha_{ji} k({\bf x}_{i},{\bf z}_{1}),\ldots, \sum_{i=1}^{P}\alpha_{ji} k({\bf x}_{i},{\bf z}_{T})\right]. $$
(25)

Therefore, the energy of filtered EEG trial in (25) can be used to create the feature vectors as the input of classifier.

Experimental results

We conduct classification experiments on real EEG signals to discriminate between imagination of left hand movements (first class) and right hand movements (second class). Data are taken from two subjects S1 and S2. At second 2 of each trial a symbol indicating one specific mental task is displayed. At second 5, the screen is blank to relax the subject till the start of next trial. EEG signals are recorded from 6 electrodes with the sample rate of 256 Hz. The preprocess is performed to bandpass filter the EEG signals between 8 and 30 Hz. Each trial was split into non-overlapping time-segments of 1.5 s length prior to calculation of the spatial filters.

The Fig. 1 shows the data distribution of source signals filtered by the GCSP spatial filters. It is obvious that one class has the maximal variances while the others have the minimal variance in each direction. Therefore, the energies along the corresponding axes are enhanced by the kernel feature extractor. In this study, the Polynomial kernel function of k(xy) = (xy)dd = 2 and Gaussian kernel function of \(k(x,y)=exp(-{\|x-y\|^{2}}/{2\sigma^{2}})\) are applied in GCSP and linear support vector machine (SVM) classifiers are used to assess the classification performance. The optimal value for the parameter σ is determined through 5-fold cross-validation on training data. Table 1 gives the classification results of GCSP and linear CSP methods with small number of EEG training trials. In the case of subject S2, the linear CSP and GCSP resulted in roughly similar performance, while the results of subject S1 obtained by GCSP are generally better than the linear CSP. Experimental results demonstrate the superiority of nonlinear feature extractor empirically.

Fig. 1
figure 1

The feature distribution of source signals filtered by nonlinear spatial filters obtained from GCSP algorithm

Table 1 Classification accuracies (%) of linear CSP and nolinear GCSP algorithm by cross-validation

However, the disadvantages of the GCSP lie in the time and memory complexity of the algorithm. Consider the GSVD for the kernel matrix \({\mathcal{K}}_{c}, {\mathcal{K}}_{t}\) in Eq. (21). The kernel matrices have a relatively large dimension in typical EEG classification problems. To put it into perspective, assume 20 EEG trials for each class is given, each one with a length of 1000 samples, then \({\mathcal{K}}_{c}, {\mathcal{K}}_{t}\) will have about 20,0002 and 40,0002 elements respectively. GSVD of such two matrices imposes a high computational burden.

Conclusion

In this paper the kernel CSP approach based on GSVD is described as a nonlinear spatial filtering method. For real BCI applications, one tends to use as few training trials as necessary. The optimal kernel feature extractor proposed in this paper meets this need fairly well. One advantage of GCSP is that it can be applied regardless of singularity of the spatial covariance matrices both in the original space and in the feature space. In the future, the design of kernel functions specialized to each certain subject and new applications besides EEG signal classification would be further investigated.