Keywords

1 Introduction

Electroencephalography (EEG) has become more readily available as a method to analyze cortical activity due to its relatively low cost and high temporal resolution [10]. Although it can be convenient to collect EEG data, the dimensions of the recorded electrical activity can become large very quickly, especially when using high-density EEG for trials recorded at lengthy time periods. Since EEG datasets are high-dimensional and noisy, thus it is difficult to use them for classifying subjects in terms of their movement characteristics. Therefore, it is important to develop methods for analyzing the EEG data with a reduced dimensional space where the variable information is still retained for classifying different gait movements for a given subject.

The purpose of this paper is to introduce a novel procedure which could classify movement characteristics with EEG motion artifact data projected by spatial Independent Component Analysis (SICA) for a given subject. The EEG motion artifact data did not actually contain electrophysiological signals from the body. Instead, the EEG motion artifact data consisted of signals that were recorded from an isolated conductive reference cap using an EEG system [17]. This EEG motion artifact recording method has enabled the development of new artifact rejection techniques to clean EEG signals [22] but these EEG motion artifact data could also potentially be used to classify movement characteristics.

ICA originates from a method to solve problems such as the “cocktail party” problem where one hopes to identify individual voices when many people are speaking simultaneously, recorded in multiple devices in different locations. In this case, the temporal ICA (TICA) algorithm assumes independence in time such that the original voices can be extracted from the mixtures [5]. Similar to the cocktail party problem, each electrode in EEG data is composed of the mixture of multiple electrophysiological signals that includes the true underlying cortical signals which are assumed to be temporally independent [6]. Therefore, TICA has been most commonly used to analyze EEG data [9]. Because TICA is most commonly used to analyze EEG data, the authors who recorded the first set of EEG motion artifact data in [17] applied TICA and source localization to the data and found that the independent components were mostly outside of the brain volume, which provided evidence that TICA could be used to partially distinguish motion artifact from electrophysiological signals [25]. However, the authors did not attempt to extract movement characteristics from the EEG motion artifact data, which inspired us to investigate that possibility in this paper.

In contrast to TICA, SICA assumes spatially independent components and has been used more commonly in literature which focuses on the analysis of functional magnetic resonance imaging (fMRI) data [7]. However, the temporal dimension of EEG data can be much larger than that of the spatial dimension in many cases [23]. Therefore, our aim is to reduce the temporal dimension and employ SICA on the reduced data instead of using TICA. Consequently, we propose a method which iteratively computes SICA on subsets of the partitioned data, and concatenates the independent components from each iteration.

The rest of this paper is organized as follows. We introduce the techniques of ICA, and then describe the examined data and the proposed data analysis procedure for dimension reduction. Furthermore, the classification methods: k-nearest neighbors, Support Vector Machines, Naive Bayes, and multinomial logistic regression are described. Finally, we discuss the results of our classification and further suggest the implications of our conclusions.

2 Classification Methods

We use the EEG motion artifact data projected by Spatial ICA with the following classifiers. (1) k-nearest neighbor (k-nn), (2) Support Vector Machines, (3) Naive Bayes, and (4) multinomial logistic regression. A brief description of each method is introduced as follows. Notice that X is the matrix of predictors and Y  is the response vector of classification labels in this section.

2.1 k-Nearest Neighbor (k-nn)

We used the k-nearest neighbour classification method with k = 3, which determines the class by majority voting of each point’s k nearest neighbors. The knn was fulfilled by knn3() function in package ‘caret’ in R were used in this study.

2.2 Support Vector Machines

In an approach to solve multi-class pattern recognition, we can consider the problem as many binary classification problems [8, 27]. If we consider the case of K classes, K classifiers are constructed. Each classifier builds a hyperplane between itself and the K − 1 other classes [27]. If our response, or the two classes, are represented by Y ∈{−1, 1}, we can use a Support Vector Machine (SVM) to construct a hyperplane to separate the two groups such that the distance between the hyperplane and the nearest point, or the margin, is maximized [8].

The optimization problem seeks to minimize: \( \phi (w,\xi ) = \frac {1}{2}\|w\|{ }^2 + C\sum _{i=1}^n \xi _i \) with constraints y i((w ⋅ x i) + b) ≥ 1 − ξ i, i = 1, …, n and ξ i ≥ 0, i = 1, …, n. We can solve the optimization problem by solving the dual problem, which consists of minimizing \( W(\alpha ) = \sum _{i=1}^n \alpha _i - \frac {1}{2}\sum _{i,j=1}^n y_iy_j\alpha _i\alpha _jK(x_i,x_j) \) under the constraints 0 ≤ α i ≤ C, i = 1, …, n and \( \sum _{i=1}^n \alpha _iy_i = 0 \) The above gives the decision function: \( f(x) = \mbox{sign}\left [\sum _{i=1}^n \left (\alpha _iy_iK(x,x_i)\right )+b\right ] \) [27].

There exist many different types of kernel functions to use in Support Vector Machine classification. Support vector machines (SVM) [8] with a linear kernel was used in this study. The cost parameters C was tuned using cross validation [16] with package ‘e1071’ in R [21].

2.3 Naive Bayes

The Naive Bayes algorithm is a classification algorithm based on Bayes’ rule and a set of conditional independence assumptions. Given the goal of learning P(Y |X) where X = X 1, …, X p, the Naive Bayes algorithm makes the assumption that each feature X i is conditionally independent of each of the other X js given Y = k, and also independent of each subset of the other X js given Y = k [15]. Bayes’ rule states that the probability of some observed data, x = (x 1, …, x p), belonging to class k is \( P(Y=k|x) = \frac {\pi _kf_k(x)}{\sum _{l=1}^K\pi _lf_l(x)}, \) where P(Y = k) = π k and f k(x) = p(X = x|Y = k) is the probability density for X in class k. For a given class k, Naive Bayes classification makes the assumption that all of the features, or x is are independent, or \( f_k(x) = P(x_1,x_2,\ldots ,x_p|Y=k) = \prod _{i=1}^p P(x_i|Y=k) \). Thus, the Naive Bayes classifier is as follows: \( \arg \max _k \pi _k \prod _{i=1}^p f^i_k(x_i), \) where \(f^i_k(x_i)=P(X_i=x_i|Y=k)\) [15]. We believe that, because we are reducing our data by Spatial ICA, Naive Bayes classification will provide the best results in terms of misclassification rate. We will use all other classification methods for a comparison to Naive Bayes.

2.4 Multinomial Logistic Regression

Logistic regression can be used as a method for modeling a categorical response variable by finding significant parameters. In the multinomial case, our y response variable represents more than two categories. It does not require the assumption of statistical independence of predictors unlike the Naive Bayes classifier, but assumes collinearity between predictors.We used package ‘glmnet’ in R [11] to fulfill the multinomial logistic regression.

3 The EEG Motion Artifact Signals Data and Spatial ICA Methodology

We used the EEG motion artifact signals data that was collected in [17] and were analyzed using TICA in [25]. The method of recording isolated motion artifact in EEG is described in detail in [17]. Briefly, the isolated motion artifact data were collected using a 256 channel EEG system (ActiveTwo, Biosemi) from ten young and healthy participants. A non-conductive silicone swim cap was placed on each subject’s head to block true electrophysiological signals. A simulated scalp consisting of a short wig soaked in conductive gel was placed over the silicone layer, and the EEG cap and electrodes were placed over the simulated scalp. Subjects sat (0 m/s) and walked at four different speeds (0.4 m/s, 0.8 m/s, 1.2 m/s, 1.6 m/s) on a treadmill. Each trial was 10 min in duration, and data were recorded at 512 Hz [25]. Ten subjects with complete data sets were used in analysis. In this study, we pre-processed data by vertically concatenating each of the subject’s five speeds into one data file while creating a speed label for each signal. In terms of the temporal dimension, each recording consisted of 300, 000–310, 000 points for the 10 min of recorded signal. However, we used only points 1 through 300, 000 for consistency within and between the ten subjects. Therefore, for one of the ten subjects, we have total 1280 EEG motion artifact signals with dimension p = 300, 000 in five speeds and the sample size n = 256 given a speed.

Independent Component Analysis (ICA) belongs to a class of methods often referred to as “Blind Source Separation” which aim to extract certain quantities from a mixture of other quantities [26]. ICA-unlike other statistical methods of dimension reduction that find mutually de-correlated signals such as Principle Component Analysis (PCA) or Factor Analysis (FA)- is based on the assumption of statistical independence [26]. ICA decomposes the data such that we are left with maximally independent signals by maximizing non-Gaussianity. One important distinction of ICA is that there is no order or ranking of the extracted components. It is also notable that the components do not recognize the difference of signs [18]. Since the EEG signals in our dataset have heavy-tailed and multimodal distributions, it is inadequate to apply PCA, which can not recover statistically independent source signals [13, 14].

Let us denote the observed data as an n by p matrix, X where n the number of spatial voxels and p represents the number of time points. In Spatial ICA (SICA), we consider the n vectors containing each of the p instances to be our signals [3]. We can represent the SICA decomposition as follows. Assuming that X is a mixture signals matrix from sources matrix S, and let r =  the number of components, A is n × r and S is r × p, and then X = AS + E, where E is defined using the smallest (n − r) principal components (PCs). S =  WKX is the estimated n × m matrix source matrix, W is the estimated m × m un-mixing matrix, K is the estimated p × m pre-whitening matrix projecting data onto the first m principal components, where n is the number of observations and m is the number of independent components [3, 14]. In this study, the ordering of independent components is determined by Principal Component Analysis.

Before ICA is performed, it is necessary to first pre-process the data with reduction and whitening. For the purposes of data compression, SICA presumes that there are fewer independent sources than there are time points [7]. Reduction is first performed by PCA and the specified number of components are retained such that the maximum amount of variation is represented. ICA combined with PCA allows both whitening and achieving dimension reduction [23].

ICA supposes that the underlying sources are each not normally distributed. It follows that, sources can be extracted by making them as non-Gaussian as possible with the measure of negentropy. Given a covariance matrix, the distribution that has the highest entropy is the Gaussian distribution [4, 18]. Negentropy or differential entropy is a measure of deviation from normality expressed as

$$\displaystyle \begin{aligned}N(Z) = \left(\mbox{E}G(Z) - \mbox{E}G(Z_{Gaussian}) \right)^2 \end{aligned}$$

where Z is an arbitrary multivariate random variable and Z Gaussian is a multivariate Gaussian random variable of the same covariance matrix as Z, and the contrast function \(G(u)=-\exp (-u^2/2)\) was used in our analysis [18]. There are several different algorithms that employ methods to estimate the independent components. The FastICA algorithm maximizes negentropy N(X) [23]. We use the FastICA algorithm [13, 14] for SICA because it has been shown to outperform most other ICA algorithms in speed of convergence [23].

4 Data Analysis Procedure

The challenge of the examining EEG signals data is its high dimension. Since the dimension comes from the multiple time moments which records the EEG signals of the individuals walking on a treadmill at a given speed, we assume that the high dimensional (p > n) EEG motion artifact signals used in this study can be decomposed as m < n independent components. Note that SICA cannot perform dimension reduction directly. Therefore, PCA is applied as a pre-process to determine the ordering of the importance of the components by the magnitude of the eigenvalues of the correlation matrix of X. To reduce the high dimensional signals, we transform the signals by using only the first four (which is K − 1, K is the number of categories) independent components. We applied package FastICA in software R [20] for our data analysis. For each of the ten subjects, we split our data into training and testing subsets, such that we will use the training set to train our classification model, and the testing set to see how our model performs for new observations. We randomly select 256 channels out of the 1280 concatenated channels as our test set, and use the remaining 1024 channels as our training set. Although the channels for a given trial (or given speed) are receiving motion artifact signals simultaneously, we proceed in our analysis as if the selected test signals are recorded from another trial. The data analysis procedure is shown in the following algorithm. The proposed method is outlined, starting from the structure of the original data to the concatenated data, further into the training and testing split, and finishing with the SICA and PCA projected data.

Algorithm 1 Spatial ICA and classification of EEG motion artifacts signals

  1. 1:

    There are 10 subjects. Each subject has five EEG motion artifact signals datasets according to the five different walking speeds. Each dataset has 256 rows/space points (from 256 channels) and 300,000 columns/time points (recordings in sequential time points).

  2. 2:

    procedure 

  3. 3:

      Concatenate all records of the five speeds EEG motion artifact datasets corresponding to the subject walking speeds.

  4. 4:

      Downsample the time points by keeping the first sample and then every 1000th sample after the first.

  5. 5:

      Partition the data as training (1024 records) X tr and test sets (256 records) X te by random sampling.

  6. 6:

      Apply SICA to the training set and extract 4 independent components by fastICA with type ‘deflation’ and the exponential contrast function. The components are extracted one at a time.

  7. 7:

      The SICA outputs a source matrix S tr, pre-whitening matrix K, and un-mixing matrix W.

  8. 8:

      Obtain the source matrix of the test set by projection S te = X teKW

  9. 9:

      Build classification models on S tr and evaluate by using S te

  10. 10:

    end procedure

  11. 11:

    For comparisons, another analysis used randomly sampling 4 time points from X tr and X te, and build classification models on the training and evaluation on the test set.

For each of the ten subjects data, we first concatenate all records as a dataset with 1280 rows and 300,000 columns, and then partitioned the data as a training set (1024 records) and a test set (256 records) by random sampling. For each interval of 1000, we sampled a time point for both the training and test sets. The downsampling rate is equalled a duration of 1.95 s, since the original signals were collected by 512 Hz sampling rate (the unit of time is 1/512 or 0.00195 s). Then we applied SICA to the downsampled training set using the FastICA algorithm on the training data, since the five categories can be represented in a 4-dimensional space, each signal is compressed to the first four independent components. We then project the test data set onto the space of the first four independent components obtained from the training data. The plots in Figs. 1 and 2 show that most of the four independent components for the test set each subject have obvious clusters corresponding to the walking-speed categories.

Fig. 1
figure 1

The first four projected independent components of the test set of each of subjects 1–5. In each plot, the y-axis is the component and the x-axis represents the space points (electrode channels) of the EEG motion artifacts signals. There are total 256 space points and five speeds (five categories). On average, each category contains about 51 space points. For each subject, most of the first four independent components clusters according to the subject’s walking speed. The clusters are highlighted by different colored. The x-axis is the 256 space points in the test set, and the y-axis represents the values of the independent components, which are compressed time points

Fig. 2
figure 2

The first four projected independent components of the test set of each of subjects 6–10. In each plot, the y-axis is the component and the x-axis represents the space points of the EEG motion artifacts signals. There are total 256 space points and five speeds (five categories). On average, each category contains about 51 space points. For each subject, most of the first four independent components clusters according to the subject’s walking speed. The clusters are highlighted by different colored. The x-axis is the 256 space points in the test set, and the y-axis represents the values of the independent components

We used the k-nearest neighbors with k = 3, SVM, Naive Bayes, and multinomial logistic regression for classification modeling. The accuracy rate is computed as the number of correctly categorized signals, over the total number of classified signals. We use the accuracy rates and the multi-class area under the curve (AUC) to evaluate the proposed method with comparisons of randomly selecting four time points. Next we trained our Naive Bayes model for classification. For each subject, we use the projected SICA test data to compare the classified results with the true classifications and output the accuracy rate and AUC. Package ‘naivebayes’ in R was used for the Naive Bayes Classification [19]. Finally, we trained our last classification model with the training data using multinomial logistic regression [28]. We do not evaluate individual parameters for significance but instead simply use the fitted model for prediction of the test data to obtain the accuracy rates for each subject. We also provide a and 3. We provide the Area Under the Curve (AUC) values as well to show classification performance. These values give the total area under the Receiver Operating Curve (AUC) values were calculated with package HandTill2001 in R [12]. Classification results are presented in Tables 1 and 2. Table 3 is the comparison of the proposed SICA method versus random sampling four time points on the simulation data, which was generated by adding noises into the signals of Subject 1. The noises were sampled from uniformly distributed random variables with the range of (−a, a), where a is \(\sqrt {3}\) signal-to-noise (SNR) ratio in order to make the simulated data have SNR = 1.

Table 1 Model comparisons in terms of accuracy rates
Table 2 Model comparisons in terms of multi-class areas under the ROC (AUC)
Table 3 Model comparisons of simulation data with 100 repetitions in terms of average accuracy rates and AUC

5 Classification Results

The aim of our study is to explore SICA as a method of dimension reduction for analyzing high dimensional EEG or EEG motion artifact datasets. We proposed an algorithm to downsample and perform SICA to the signals with a large number of time points such that sufficient information is still retained for classification. By using the first four independent components, the k-nn with k = 3, support vector classifier with linear kernels, and multinomial logistic regression all successfully classify the EEG motion artifacts signals. The Naive Bayes method performed worse than the others. In contrast, the classification results are very poor when just using four randomly selected time points. The comparisons show that the proposed method effectively reduce the dimension of time with high classification accuracy. The scatter plots of the top four independent components (ICs) indicate that these ICs have different patterns with respect to their walking speeds (see Figs. 1 and 2).

6 Discussion

Before classification, the independent components data for each subject was of dimensions 256 by 4 whereas the original test set started as 256 by 300,000. Hence, the temporal dimension of our data was reduced by 75,000 folds. The four independent components successfully retained a sufficient amount of information about the EEG motion artifact signals in order to successfully classify a subject’s walking speed. However, it is important to take caution in interpreting the independent components. With TICA, it can be assumed that the independent components represent the unmixed cortical signals. SICA has been commonly applied to functional MRI data, where time points correspond to input dimensions and voxels are samples. In contrast, TICA for EEG assumes that sensors constitute input dimensions and time-points are samples [1]. We used SICA for EEG motion artifact signals with the assumption that sensors constitute input samples and time points are dimensions. Consequently, the EEG motion artifact signals observed at different time points are assumed to be linear sums of the source signals and maximizes spatial sparsity alone [2].

It is evident that for a given subject, we are able to successfully classify walking speed with EEG motion artifact signals. The classification results (Tables 1 and 2) show that except for the Naive Bayes classifiers, k-nn, SVM, and multinomial logistic regression all have high classification accuracy and area under the Receiver Operating Curve (AUC). The Naive Bayes classifier assumes that every two predictors are mutually independent given the class. The classification results indicate that the statistically independent components obtained by SICA do not the class information, so that the assumption of Naive Bayes may not be satisfied by using independent components as predictors. It is also well-know that independence does not imply conditional independence generally. Future studies may use supervised ICA [24] and might be performed to create gait movement profiles across different subjects. Such that, if information existed about a group of subjects and their raw uncleaned EEG signals that includes cortical signals and motion artifact signals for a given movement, a new subject’s raw uncleaned EEG signals could be used to classify the new subject’s movement.