1 Introduction

Amyotrophic lateral sclerosis or ALS is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord, which often leads to complete paralysis (www.alsa.org/news/media/quick-facts.html, 2012).

ALS usually strikes people between the ages of 40 and 70. The incidence of ALS is two per 100,000 people, and it is estimated that about 30,000 people in the United States are living with ALS (www.alsa.org/news/media/quick-facts.html, 2012). As the disease progresses many assistive communication devices that have been once a necessity, may become ineffective (McCane et al. 2015).

Brain Computer Interface (BCI) is one of the best research fields that have been developed in the last decades to get a new way of communication for those people (Fouad and Hadidi 2014).

BCI research is based mainly on acquiring signals from the brain and obtaining discriminative information from them, which could help in classifying these signals to various applications (Azar et al. 2014).

This interface creates a direct communication between the brain and the object which is to be controlled by the brain. In case of the P300 speller, the basic purpose of the BCI system is to map the P300 signals into the right character to spell. Therefore, it detects P300 signals and converts these neurophysiologic signals into basic actions (Wolpaw et al. 2003). These actions are then displayed on a computer screen.

An event-related potential (ERP) is a signal observed on the scalp of the subject that occurs when an event such as a visual stimulus happens in a short period (Wu 2014). The ERP for each trial is hypothesized to be synchronous with the event and similar between trials (Wu 2014).

The P300-based speller was originally introduced by Farwell and Donchin in 1988 (Lu et al. 2013). Subjects using the P300 speller select characters from a matrix presented on a computer screen. Then, the system proceeds to analyze and classify the resulting EEG signals (Lu et al. 2013). The flashing of a character being focused on elicits an event-related potential (ERP) that distinguishes between target and non-target characters (Lu et al. 2013). A “target” refers to the character being focused on.

The P300 response is a positive deflection in the EEG over the parietal cortex that appears around 300 ms after a stimulus as shown in Fig. 1 (Lu et al. 2013). Groups of characters are successively and repeatedly flashed, but only the group that contains the target character will elicit a response (Mattout et al. 2014).

Fig. 1
figure 1

P300 signal (Cecotti and Graser 2011)

In the implementation of a P300 BCI, characters are grouped for flashing as rows and columns (Townsend et al. 2010). This orientation is referred to the row–column paradigm (RCP) (Townsend et al. 2010). The computer can identify the target character as the intersection of the row and column that induced the largest P300 (Townsend et al. 2010).

Reviews of BCI systems in general have been given in Nicolas Alonso and Gomez-Gil (2012), Wolpaw et al. (2002) among others. Waldert et al. concentrated on extracting directional information only (Waldert et al. 2009). McFarland et al. provide an overview over the feature extraction and translation methods for classification-based systems (McFarland et al. 2006). Teplan concentrates on measurement in EEG-based systems (Teplan 2002) and Lotte et al. on classification (Lotte et al. 2007). After that, Nicolas Alonso and Gomez-Gil (2012) handled many kinds of BCI applications range over adapted web browsers, word processors, brain control of a wheelchair or neuro-prostheses, games, and more. Furthermore, Moore (2003) identified and classified the following four main challenges of BCI systems in real world use:

  1. 1.

    The information transfer rate of such systems is far too low.

  2. 2.

    The error rate during input is very high due to highly variable brain signals.

  3. 3.

    Autonomy is not really given since sensors usually need to be applied by someone else.

  4. 4.

    The cognitive load, under distractions of the brain generates different and noisier signals. BCI systems need to work even under such a distracting environment.

Ferrez and Millan (2008) used error-related potentials within the EEG recording shortly after an action to apply reinforcement learning. That was an attempt towards a more adaptive BCI. Vidaurre et al. also strove to solve the so-called “BCI Illiteracy” problem, that there is a significant portion of people (about 15–20%), who are unable to control a BCI system. Therefore, they also tried to make BCIs more adaptive such that they do not solely rely on off-line calibration when using supervised learning (Vidaurre et al. 2011).

Regarding filtering methods, Blankertz et al. relied on common spatial patterns (CSP) as proposed by Ramoser et al. (2000). They extended the common spatial patterns approach to consider typical variations of a subject’s EEG signals between trials (Blankertz et al. 2007).

Carlson and Millán also used BCI to control a wheelchair with “left” and “right” commands and additionally equipped the wheelchair with obstacle avoidance to ensure safe use (Carlson and Millán 2013). They employed motor imagery as input using Laplacian filtering and power spectral density. Also, there is a thesis from Rebsamen (2009) that covers a wheelchair control through BCI using P300-based destination selection. Mandel et al. created a wheelchair control depending on EEG. They stressed that due to slight inaccuracy of the control method it is important to have the wheelchair equipped with a safety mechanism that avoids collisions. Also, they mentioned that they had one subject who was not able to use their control, so he fell under the so-called BCI illiteracy (Mandel et al. 2009).

However, in the current era, the achievements are very impressive (Tsui et al. 2011; Geng et al. 2007), but there is still much research to be conducted and many studies to be performed in the whole world of brain computer interface (Fouad and Lalib 2017).

2 Materials

The data that support the findings of this study are openly available in BCI competition III challenge 2004 was provided by the Wadsworth BCI dataset (P300 evoked potential) at (https://bbci.de/competition/iii/) (Wolpaw et al. 2004). The data were acquired using BCI2000s P3 speller paradigm described by Donchin et al. (2000) and originally by Farwell and Donchin (1988).

In this work, Matlab version 9.2.0.556344 (R2017a) and its Signal Processing Toolbox which supports an extensive range of signal processing operations is used for data analysis and technical computing, as it is a high-performance and powerful language. This work is implemented using a personal computer with a processor: Intel (R) Core (TM) i7-2.6 GHz.

2.1 Paradigm

The user was presented with a 6 by 6 matrix of characters as shown in Fig. 2. The user's task was to focus attention on characters in a word that was prescribed by the investigator (i.e., one character at a time). All rows and columns of this matrix were successively and randomly intensified. Two out of 12 intensifications of rows or columns contained the desired character. The responses evoked by these infrequent stimuli (i.e., the 2 out of 12 stimuli that did contain the desired character) are different from those evoked by the stimuli that did not contain the desired character and they are similar to the P300 responses. This P300-based paradigm can be considered as a "Virtual Keyboard on a BCI system's computer screen".

Fig. 2
figure 2

P300 Speller Paradigm (Donchin et al. 2000)

2.2 Data collection

The collected signals have the following specifications:

  • The signals are bandpass filtered in the 0.1–60 Hz range and digitized at 240 Hz from two subjects: subject A and subject B in five sessions each.

  • The training set for both subjects consists of 85 characters.

  • The test set for both subjects consists of 100 characters.

  • For each character, sets of 12 intensifications as shown in Fig. 3 (Donchin et al. 2000) were repeated 15 times and thus there were 12 × 15 (180) total intensifications.

  • The EEG signals have been acquired using 64-channels.

  • A more detailed description of the dataset can be found in the BCI competition online web site (Donchin et al. 2000).

Fig. 3
figure 3

Matrix rows and columns (Donchin et al. 2000)

3 Methods

3.1 Data preprocessing

Preprocessing was used to remove the noise and enhance the EEG signals. It was known that the acquired signals were bandpass filtered in the range [0.1, 60] Hz and digitized at 240 Hz.

3.1.1 Trials and filtration

As noticed, before, there are 12 post-intensification segments which are repeated 15 times (12 × 15) as one signal for each character over 64 channels. The target is to extract each post-intensification segment from the provided signals in both training and test sets. It was known that P300 ERP appears after 300–500 ms from the onset of the stimulus, so the data samples are extracted 650 ms from the beginning of each intensification. This is long enough to obtain the required time features.

Filtering is an important step in noise reduction. Extra filtering using an eighth-order bandpass filter was applied to the training (180 × 85 characters) and test (180 × 100) post-intensification segments at different cutoff frequencies, which were selected, because the cognitive activity very rarely occurs outside of the range 300–500 ms (Allison 2003).

3.1.2 Decimation

Decimation reduces the original sampling rate of a sequence to a lower rate. It lowpass filters the input to guard against aliasing and down-samples the result (Matlab 2017). The filtered segments are decimated according to the high cut-off frequencies. As an example,

  • The sampling frequency is 240 Hz.

  • The selected window is 650 ms.

  • Then, the post-intensification segments' length is 240 × 0.65 = 156 samples.

  • If the high cut-off frequency is 30 Hz, then, the decimated factor will be 240/30 = 8.

  • Therefore, the post-intensification segments' length is decimated to 156/8 ≈ 20 samples.

  • Then, the filtered training response is ((180 × 85) × 20 × 64), and the Filtered Test Response is ((180 × 100) × 20 × 64).

3.2 Feature vector

The Feature vector is constructed from the concatenation of the measurements of 64 channels of all post-intensification segments. As was mentioned before, the training set of both subjects ‘A' and ‘B' consists of 85 characters; total of 180 × 85 (15,300) post-intensification segments, while the test set of both subjects consists of 100 characters; 180 × 100 (18,000) post-intensification segments.

3.3 Normalization

"Z-Score Normalization" is commonly used in machine learning where the data include multiple dimensions can be handled. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the numerator) and unit-variance. The general method of calculation is to determine the distribution mean and standard deviation for each feature. Next, the mean is subtracted from each feature. Then, the values (mean is already subtracted) of each feature are divided by its standard deviation.

$$ x^{\prime} = \frac{{x - \overline{x} }}{\sigma }, $$

where x′ is the original feature vector, \(\overline{x}\) average (x) is the mean of that feature vector, and σ is its standard deviation.

The training feature vector is normalized to zero mean and unit variance. Then, the test feature vector is normalized depending on the normalization parameters acquired from the normalization of the training feature vector.

3.4 Classification

3.4.1 Linear discriminant analysis

Linear classifiers are discriminant algorithms that use linear functions to separate between different classes. They are likely the most known algorithms for BCI applications.

The aim of LDA is to use hyper-planes to separate the data representing the different classes (Duda et al. 2001; Fukunaga 1990). For a two-class problem, the class of a feature vector depends on which side of the hyper-plane the vector is.

LDA assumes a normal distribution of the data, with equal covariance matrices for both classes. The separating hyper-plane is obtained by seeking the projection that maximizes the distance between the two class means and minimizes the interclass variance (Fukunaga 1990).

This technique has a very low computational requirement which makes it suitable for BCI systems. Moreover, this classifier is simple to use and generally provides good results.

Two procedures of applying an LDA classifier were applied.

3.4.1.1 LDA method I (LDA I)

In the first procedure which is referred to as “LDA method I”, a test label of ‘1’ which was assigned to a target or ‘− 1’ to a non-target from each row or column were averaged over sequences as shown in Fig. 4. It is considered that the most probable row and column is the one that maximizes the score.

Fig. 4
figure 4

LDA method I classifier

3.4.1.2 LDA method II (LDA II)

In the second procedure which is referred to as "LDA method II" test scores from each row or column were averaged over sequences as shown in Fig. 5. It is considered that the most probable row and column is the one that maximizes the score.

Fig. 5
figure 5

LDA method II classifier

3.4.2 Support vector machine

A support vector machine uses a discriminant hyperplane to distinguish classes (Burges 1998; Bennett and Campbell 2000). However, concerning SVM, the selected hyperplane is the one that maximizes the margins, i.e., the distance from the nearest training points. Maximizing the margins is known to increase the generalization capabilities (Burges 1998; Bennett and Campbell 2000). This classifier uses a regularization parameter C that enables accommodation to outliers and allows errors on the training set.

Support vector machines enable classification using linear decision boundaries and are known as linear SVMs. These classifiers have been applied, always with success, to a relatively large number of synchronous BCI problems (Blankertz et al. 2002; Garrett et al. 2003; Rakotomamonjy et al. 2005; Rakotomamonjy and Guigue 2008).

SVMs have several advantages. Actually, thanks to the margin maximization and the regularization term, SVM are known to have good generalization properties (Bennett and Campbell 2000; Jain et al. 2000) to be insensitive to overtraining (Jain et al. 2000) and to the curse of dimensionality (Burges 1998; Bennett and Campbell 2000).

SVMs have been used in BCI research since it is a powerful procedure for pattern recognition, especially for high-dimensional problems (Rakotomamonjy and Guigue 2008). Two approaches for the application of support vector machine classifier are presented.

3.4.2.1 SVM method I and II

In the first approach, two procedures are introduced and mentioned as "method I" and "method II". After preprocessing the post-intensification segments of the training feature vector which belong to the same row or column are averaged, which indicates that for each character there are only 12 post-intensification segments instead of 180 segments. Therefore, the training feature vector and the training label vector will be of size (12 × 85) instead of (180 × 85) and will be fed to the support vector machine classifier as shown in Fig. 6.

Fig. 6
figure 6

SVM classifier. a SVM method I. b SVM method II

In the procedure named "method I" test labels (1 or − 1) from each row or column were averaged over sequences. While in the other procedure named "method II" test scores from each row or column were averaged over sequences. For the two approaches, it is considered that the most probable row and column is the one that maximizes the score.

3.4.2.2 SVM method III and IV

In the second approach which is a multiple-classifier combination approach, two procedures are presented and mentioned as "SVM method III" and "SVM method IV" (Fig. 7). After preprocessing, the training feature vector (180 × 85) and the training label vector are gathered into 17 equal partitions. Each partition consists of five consecutive characters (180 × 5) as shown in Fig. 7 and is fed to a linear support vector machine classifier. Therefore, a multiple classifier system was designed from the combination of the 17 SVM classifiers outputs.

Fig. 7
figure 7

SVM method III and IV

In the procedure named "method III" test labels (1 or − 1) from each row or column were averaged over sequences. While in the other procedure named "method IV" test scores from each row or column were averaged over sequences. For the two approaches, it is considered that the most probable row and column is the one that maximizes the score.

In the previous mentioned approaches, the selected value of the hyper-parameters (regularization parameters) through cross-validation of a subset of the training dataset as a validation set are shown in Table 1.

Table 1 SVM hyper-parameters values

3.4.3 Linear regression

Linear regression algorithms (LREG) mostly differ depending on the number of independent variables and the type of relationship between the independent and dependent variables.

Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, y can be calculated from a linear combination of the input variables (x) (https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linearregression-14c4e325882a).

In the present work, a linear regression algorithm is implemented as a classifier that can be implemented in a BCI system as shown in Fig. 8.

Fig. 8
figure 8

LREG classifiers

3.4.4 Bayesian classifier

Bayesian linear discriminant analysis (BLDA) is a simple and efficient method for classification. This technique was presented by Hoffmann et al. (2004) as follows:

  • Based on the training labels, the regression targets are calculated.

  • An iterative estimation of the parameters alpha and beta was calculated.

  • To predict each character (belongs to a certain row or column) related to which class, the mean of the predictive distribution was calculated for each character which was considered as the score for the row or column.

Based on our proposed Bayesian classifier, the number of iterations and the estimated parameters are shown in Table 2.

Table 2 Number of iterations and estimated parameters of Bayesian classifier

3.5 Character prediction

The scores predicted from the previously applied classifiers are averaged over the 15 sequences, yielding 12 scores for each character. The target character is determined by finding the row and column with the highest score.

4 Results and discussion

Depending on the percentage of the correctly predicted characters in the test sets, the performance is evaluated. Based on the evaluation criteria, Tables 3, and 5 show subject ‘A' and ‘B' performances among different frequencies which yield the best results for each classifier during 5th and 15th sequences. The time taken to predict a character in subject ‘A' does not exceed 24.0786 s as shown in Table 4, while the time taken to predict a character in subjects ‘B' does not exceed 55.2476938 s as shown in Table 6. The run times need to be improved to be more suitable for online applications.

Table 3 Subject A—performances of the presented classifiers
Table 4 Subject A—timing of the presented classifiers
Table 5 Subject B—performances of the presented classifiers
Table 6 Subject B-timing of the presented classifiers

Figure 9a and b shows that the highest performances obtained when applying the proposed classifiers during the 15th sequence for subjects ‘A’ and ‘B’.

Fig. 9
figure 9

Performances of the presented classifiers

Table 7 presents the performance results achieved by the test sets for subjects ‘A' and ‘B' during 15th and 5th sequences. LDA II method outperformed LDA I, as it achieved 95.5% and 61% for 15th and 5th sequences, respectively. SVM IV method outperformed the other support vector machine methods; SVM I, SVM II, and SVM III as it achieved 98% and 54.5% for 15th and 5th sequences, respectively.

Table 7 Performances of the presented classifiers

By comparing the performance results of the mentioned classifiers, it is obvious that SVM IV and BLDA methods give the highest results for 15th sequence, as they achieved 98%. While the LREG and BLDA methods gave the highest results for 5th sequence as they achieved 66.5% and 66%, respectively. Therefore, the BLDA method yields the highest performance for both sequences and outperformed all the classifiers.

As previously discussed, to predict a target character, it is important to determine the row and column this character belongs to. Thus, it is difficult to determine the four possible observations, where positive means target and negative means non-target:

  • True positives (TP): observation is positive and is predicted to be positive.

  • True negatives (TN): observation is negative and is predicted to be negative.

  • False positives (FP): observation is negative but is predicted positive.

  • False negatives (FN): observation is positive but is predicted negative.

Hence, it is impossible to compute more evaluation measurements: accuracy, sensitivity, specificity, and precision. Instead of predicting a specific character, its corresponding row and column were distinguished.

The actual and predicted testing data were averaged over the 15 sequences, resulting in an averaged data of 1200 observations (12 R/C × 100 character) for each.

The confusion matrices for the eight presented classifiers are shown in Tables 8 and 9 for both subjects A and B, respectively.

Table 8 Confusion matrices for all classifiers concerning subject A
Table 9 Confusion matrices for all classifiers concerning subject B

Tables 10 and 11 show the performance measurements for the eight classifiers applied on subjects A and B, respectively.

Table 10 Performance measurements for all classifiers concerning subject A
Table 11 Performance measurements for all classifiers concerning subject B

Receiver operating characteristic (ROC) curve is a major visualization technique for presenting the performance of a classification model. It summarizes the trade-off between the true positive rate (TPR) and false positive rate (FPR) for a predictive model using different probability thresholds. ROC curves of the eight presented classifiers for both subjects A and B are clearly shown in Fig. 10.

Fig. 10
figure 10

ROC Curves of the presented classifiers. a Subject A. b Subject B

It is clearly obvious from Tables 10 and 11 that all the performance measurements yield high results for all suggested classifiers. By analyzing the ROC curves of both subjects shown in Fig. 10, it is found that the areas under the ROC curves (AUC) indicate high performances (more than 0.934) of the presented classifiers. This means that the suggested classifiers are promising models.

5 Comparative study

Different classification techniques were investigated for the same dataset to achieve high performances. This is obvious in Table 12.

Table 12 Performances presented from different researchers

By comparing some researchers' classification performances, it is clearly shown that the suggested BLDA and SVM IV methods gives better performances after 15th sequences than other researchers' methods.

6 Conclusion

A complete BCI system was presented. For subjects 'A' and 'B' of competition III, the training and test sets were filtered by an eighth-order bandpass filter with different cut-off frequencies (0.1–10), (0.1–20), (0.1–30), and (0.1–40). The two sets were decimated according to the filter high cut-off frequency. The feature vector of both sets was reconstructed and normalized.

Concerning the classification process, the performance obtained when applying all classifiers to the test sets during the 15th sequence is better than the 5th sequence. It was observed that the time taken to train or classifying a test character is about 2 s. Eight classification procedures were presented. By comparing the performances of the mentioned classifiers, it was observed that LDA II method performs better than LDA I. SVM IV method yields better results than the other proposed methods. Finally, BLDA achieved the highest results during the 15th sequence and the 5th sequence.

In the future, several classification algorithms could be tried to achieve higher accuracy with fewer number of sequences, taking into consideration decreasing the number of used electrodes and thus decreasing the taken time.