Keywords

1 Introduction

The aim of Brain-computer interface (BCI) is to set a direct communication link between the brain and external electronic devices whereby brain signals are translated into useful commands. Such communication link would assist people suffering from severe muscular (motor) disabilities with an alternative means of communication and control that bypass the normal output pathways [13]. In this paper, we focus on an important sub-component of BCI systems, namely feature extraction. This sub-component’s aim is to identify a set of features that are effective in discriminating between different classes of interest.

Transform based approaches form an important class of feature extraction techniques. Their aim is to find a more compact lower-dimensional representation in which most of the signal’s information is packed in a few number of uncorrelated coefficients. By eliminating irrelevant features (transform coefficients), these methods allow extracting effective features that preserve the generalization capability while lessening the computational complexity associated with the classification stage [4]. These transform-based approaches can be subdivided into linear and nonlinear, supervised and unsupervised, and signal dependent and signal independent methods. The most widely used linear techniques are PCA and LDA. The first one is unsupervised and aims at maximizing the variance of the projected data, using the eigenvectors of the sample covariance matrix, onto a low-dimensional subspace called principal subspace. In contrast, the latter is supervised and attempts to find a linear mapping that maximizes linear class separability of the data in a low-dimensional space [5].

Recently, the authors introduced a signal-dependent linear orthogonal transform, referred to as LP-SVD transform [6]. The transform has the advantage of forming the transformation matrix using only the AR model parameters, instead of the data samples as in the case of PCA. This transform is used in this paper to map EEG data into a new domain where only a few spectral coefficients contain most of the signal’s energy. A subset of these transform coefficients, in conjunction with the LP coefficients and the error variance, were used as features in the classification of EEG into four class motor imagery tasks. The feature extraction method was validated using BCI IIIa competition dataset and its classification capability was assessed against two state-of-the-art methods based on DCT and AAR transforms.

The rest of the paper is organized as follows. Section 2 (a) describes the EEG data, its acquisition, and its pre-processing. Section 2 introduces the LP-SVD transform and how it is used in feature extraction. Section 3 compares the classification performance of the proposed LP-SVD based technique against two methods based on two of the most widely used linear transform for feature extraction. Section 4 concludes the paper.

2 Methodology

2.1 Data Acquisition and Pre-processing

The dataset IIIa from the BCI competition III (2005) [7] was used to evaluate the effectiveness of the proposed feature extraction method. It is a widely used benchmark dataset of multiclass motor imagery tasks recorded from three subjects; referred to as K3b, K6b and L1b. The multichannel EEG signals were recorded using a 64-channel Neuroscan EEG amplifier (Compumedics, Charlotte, North Carolina, USA). Only 60 EEG channels were actually recorded from the scalp of each subject using the 10–20 system and referential montage. The left and right mastoids served as reference and ground respectively. The recorded signal was sampled at 250 Hz and filtered using a bandpass filter with 1 and 50 Hz cut-off frequencies. A notch filter was then applied to suppress the interference originated from power lines. During the experiments, each subject was instructed to perform imagery movements associated with visual cues. Each trial started with an empty black screen at t = 0 s. At time point t = 2 s, a short beep tone was presented and a cross ‘+’ appeared on the screen to raise the subject’s attention. At t = 3 s, an arrow pointed to one of the four main directions (left, right, upwards or downwards) was presented. Each of the four directions, indicated by this arrow, instructed the subject to imagine one of the following four movements: left hand, right hand, tongue or foot, respectively. The imagination process was performed until the cross disappeared at t = 7 s. Each of the four cues was randomly displayed ten times in each run. No feedback was provided to the subject. The recorded dataset from subject K3b consists of 9 runs, while the ones from K6b and L1b consist of 6 runs each, which resulted in 360 trials for subject K3 and 240 trials for each of the other two subjects.

2.2 The LP-SVD Transform

The LP-SVD transform is constructed using a two-step process, namely the estimation of LPC filter coefficients and the computation of the left singular vectors of LPC filter impulse response matrix using singular value decomposition (SVD).

Linear prediction (LP) consists of computing the current signal observation, \( y\left( n \right) \), using a linear combination of its P past samples, namely,\( y\left( {n - i} \right) {\text{for}} i = 1, \ldots , P \). This can be expressed mathematically by [8]

$$ y\left( n \right) = - \mathop \sum \limits_{i = 1}^{P} a_{i} y\left( {n - i} \right) + e\left( n \right), $$
(1)

where, a i are the linear prediction coefficients (LPCs), P is the prediction order and \( e\left( n \right) \) is the prediction error. Equation (1) can be written in a more compact form using the following matrix notations:

$$ \varvec{y} = \varvec{He}, $$
(2)

where \( \varvec{y} = [y\left( 1 \right), \ldots ,y\left( N \right)]^{T} \) and \( \varvec{e} = [e\left( 1 \right), \ldots ,e\left( N \right)]^{T} \) are respectively the N × 1 columns vectors of the data samples and the prediction residual, while H is the N × N impulse response matrix of the synthesis filter (also called LPC filter) whose entries are completely determined by the linear prediction coefficients a i . The matrix H is lower triangular and Toeplitz. Applying the SVD to H gives:

$$ \varvec{y} = \varvec{UDV}^{\varvec{T}} \varvec{e} $$
(3)

U and V are the N × N orthogonal matrices containing the left and right eigenvectors of H and D is the N × N diagonal matrix of singular values [9].

We define the transformation that maps the measurement vector \( \left( \varvec{y} \right) \) to a feature vector (\( \varvec{\theta} \)) by [6]:

$$ \varvec{\theta}= \varvec{U}^{\varvec{T}} \varvec{y} $$
(4)

It is important to note that the transform operation \( (\varvec{U}^{\varvec{T}} \varvec{y}) \) by itself does not achieve any dimensionality reduction. It only decorrelates and packs a large fraction of the signal energy into a relatively few transform coefficients as shown in Fig. 1.

Fig. 1.
figure 1

Signal transformation using LP-SVD: (a) original EEG signal trace from subject L1b, (b) transform coefficients with AR(1) as a signal model.

2.3 LP-SVD-Based Feature Extraction

Our approach involves extracting features from each EEG segment. These features include the estimated LP coefficients (a i ), the prediction error variance \( \left( { Vr} \right) \), and a subset of the most significant transform coefficients \( \varvec{\theta} \). These features are described below.

According to the above LP analysis, the EEG vector is described in terms of all-poles filter coefficients and the prediction error. There are two classical approaches used to estimate the LP parameters, namely the autocorrelation and the covariance methods. In this study, we used the autocorrelation method as it guarantees the stability of the filter and allows the efficient Levinson-Durbin recursion to be used to estimate the model parameters [8]. Once the coefficients are estimated, the prediction error sequence can be computed using (1). The estimate of the prediction error e(n) variance is given by:

$$ Vr = \frac{1}{N - 1}\mathop \sum \limits_{n = 1}^{N} \left( {e\left( n \right) - \bar{e}} \right)^{2} , $$
(5)

where \( \bar{e} \) is the arithmetic mean of the prediction error vector e and N is its length.

The data vector y is presented in the new coordinates \( \{ \varvec{u}_{\varvec{i}} \} \) by the transform coefficients or scores \( \theta_{i} \). The transform coefficients corresponding to the K largest singular values are selected as features:

$$ \hat{\varvec{\theta }} = \varvec{\hat{U}y,}\,,{\text{ The columns of}}\,\hat{\varvec{U}}\;are\left\{ {\varvec{u}_{1} ,\varvec{u}_{2} , \ldots ,\varvec{u}_{K} } \right\}. $$
(6)
  1. a.

    DCT–based feature extraction procedure.

The DCT is a signal independent, real-valued, orthogonal transform that is asymptotically equivalent to the optimal principal component analysis (PCA) for highly correlated first-order stationary autoregressive signals [10]. The orthonormal basis vectors \( \varvec{w}_{k} \) of an N points discrete cosine transform (DCT-II) are giving by:

$$ \varvec{w}_{k} = \left\{ {\begin{array}{*{20}l} {\frac{1}{\sqrt N } \left( {1,1, \ldots ,1} \right)^{T} } \hfill & {{\text{for }}k = 1} \hfill \\ {\frac{2}{\sqrt N }\left( {\cos \frac{k\pi }{2N} ,\cos \frac{3k\pi }{2N} , \ldots ,\cos \frac{{\left( {2N - 1} \right)k\pi }}{2N} } \right)^{\varvec{T}} } \hfill & {{\text{for }}k = 2, \cdots ,N} \hfill \\ \end{array} } \right. $$
(7))

The N × N orthogonal DCT matrix is then defined as \( \varvec{W} = (\varvec{w}_{1} , \ldots ,\varvec{w}_{N} ) \). It follows immediately that the relation between a data vector y and its DCT transform Y is given by:

$$ \varvec{Y} = \varvec{W}^{\varvec{T}} \varvec{y} $$
(8)

The resulting DCT coefficients represented by the vector Y are concentrated in the low-frequency subspace as shown in Fig. 2. Dimensionality reduction using DCT is realized by using only these low frequency coefficients as features and discarding the remaining high frequency coefficients. This is illustrated by the following linear mapping.

$$ \hat{\varvec{Y}} = \hat{\varvec{W}}\varvec{y} , $$
(9)

The columns of \( \hat{\varvec{W}} \) are \( \left\{ {\varvec{w}_{1} ,\varvec{w}_{2} , \ldots ,\varvec{w}_{K} } \right\} \)

Figure 2 shows an exemplary DCT coefficients vector of the EEG data of Fig. 1. The energy of the transformed data is packed into the first few low frequency coefficients while all high frequency coefficients are relatively small.

Fig. 2.
figure 2

DCT-II transform of the EEG signal shown in Fig. 1(a)

3 Experimental Results and Discussion

This section is divided into two parts. The first part is devoted to the AR model order selection. The second part evaluates the performance of the LP-SVD-based feature extraction method against two well-known related feature extraction methods. The classifier used to measure the performance is a logistic model tree implemented as part of the Weka software package with its default parameters [11]. This classifier, that uses SimpleLogistic, has a merit over other classifiers due to its use of LogitBoost. To evaluate the classification results, we used 10 fold cross-validation where the data is randomly split into 10 folds of equal size.

3.1 AR Model Selection

To investigate the appropriate AR model order and the number of transform coefficients to be retained as features, we performed a series of simulations. In this part, only the parameters characterizing the LP-SVD transform are used as features, namely, a subset of transform coefficients (\( \hat{\varvec{\theta }} \)), the LP coefficients (a i ) and the prediction error variance (Vr). The features were extracted from the electrode sites over the primary motor area C3, CZ, and C4. These are widely considered to be the most informative channels associated with motor imagery tasks [12].

We varied the AR model order from one to seven using the EEG segments from t = 3.5 s to t = 5.5 s (501 samples) from each trial. The best model order was selected based on the resulting classification accuracy. This criterion is more suitable, in the present context, than the commonly used one in signal representation (modeling), namely the tradeoff between the model order and the prediction error variance. Table 1 shows the classification results as function of the order of the AR model.

Table 1. AR model order selection

For all subjects, the highest classification accuracy, on average, was obtained with first order AR model and using a subset of four transform coefficients with results ranging from 42.08 % for subject l1b to 66.11 % for subject K3b. Therefore, this model order and number of transform coefficients were used in subsequent analysis.

3.2 Feature Extraction Evaluation

This part compares the performance of the feature extraction method to those using similar approaches, which are based on signal modeling and orthogonal transform. These techniques are based on adaptive autoregressive (AAR) model [12] and discrete cosine transform (DCT). In particular, Schlögl et al. [12] applied a third order adaptive autoregressive (AAR) model for EEG signal analysis. The extracted AAR coefficients, which provide dynamic information about the signal spectrum, served as features. The authors used three different classifiers namely, neural network based on k-nearest neighbour (kNN), support vector machines (SVM), and linear discriminant analysis (LDA) to classify the EEG signal into one of the four classes described earlier. The results showed that the SVM-based classifier achieved the best accuracies followed by LDA and then kNN. The authors also reported that the best results were obtained when using the features extracted from all 60 monopolar channels. In this evaluation, we used these same channels to provide a fair comparison between the methods.

To find the adequate number of DCT coefficients that achieve the highest classification performance for the different subjects, we varied the number of retained DCT coefficients from 5 to 50 with a step size of 5. Table 2 summarizes the obtained classification results as a function of the number of retained DCT coefficients. The number of coefficients required, for subjects K6b, L1b and K3b, to achieve the highest classification accuracies were 15, 40, and 20, respectively.

Table 2. Performance (classification accuracy) of DCT-based feature extraction using 60 Monopolar Channels

The performances of the three feature extraction approaches mentioned above are summarized in Table 3. It can be seen that when only the transform coefficients were used as features, the proposed approach outperformed the DCT-based one by up to 23 % in terms of accuracy (for subject L1b) with 10 times fewer number of features. Meanwhile, when the LP coefficient and the residual error variance were added to the LP-SVD transform coefficients, our technique performed better than the two methods for subjects L1b and K6b and achieved comparable results to the AAR-based method for subject K3b. On average, the improvement, in terms of accuracy was about +25 % compared to DCT and +6 % compared to AAR-based methods. It is pertinent to point out that, unlike DCT which results only in the transform coefficients as features, our method results in other features, LPC coefficients and residual signal variance, that led to a better characterization of the signal. In addition, the DCT is signal independent while our proposed transform is signal dependent. These two facts explain the difference in performance between the two methods.

Table 3. Comparative analysis of different features extraction approaches

4 Conclusion

In the present study, we presented a feature extraction approach based on the combination of autoregressive modeling and orthogonal transformation. Results of classification experiments, using a benchmark dataset from the BCI competition III, and comparison against closely related approaches, namely DCT and AAR, demonstrates that the proposed feature set is compact and offers a significant improvement in performance as judged by the classification accuracy. The number of transform coefficients was kept constant during all the experiments. It would be interesting to address the issue of parameter tuning in future studies. Future work will also include adding more features to improve the performance beyond the one obtained in this study.