Introduction

A brain–computer interface (BCI) provides a new communication channel for direct access between brain and external devices and is different from communication channels that rely on the conventional neuromuscular pathways of peripheral nerves and muscles (Mak and Wolpaw 2009). Several different neural activities, such as visual evoked potentials (Gergondet and Kheddar 2015), auditory evoked potentials (Hwang et al. 2017), event-related potential (Hoffmann et al. 2008), visual attention (Gaume et al. 2019), and event-related desynchronization/synchronization (ERD/ERS) (Pfurtscheller and Lopes da Silva 1999), have been utilized in electroencephalogram (EEG)-based BCIs.

Motor imagery (MI) is a mental representation of motor behavior and does not rely on external stimuli. The MI task elicits changes in the rhythmic activities of the brain, which are observed in the electrophysiological signal as ERD/ERS phenomenon (Pfurtscheller 1977, 1992), specifically in mu (8–12 Hz) and beta frequency bands (13–30 Hz) (McFarland et al. 2000; Pfurtscheller et al. 2006). The corresponding differences in EEG signals in the frequency and spatial domains can be extracted to discriminate mental states. Several applications, including neuroprosthesis control (Müller-Putz et al. 2005), 2D cursor control (Jinyi Long et al. 2012b), and wheelchair control (Tang et al. 2018), use MI-based BCIs to provide user control. Several studies have involved stroke patients in BCI-feedback training (Daly et al. 2009; Frisoli et al. 2012; van Dokkum et al. 2015; Pichiorri et al. 2015; Kim et al. 2016; Frolov et al. 2017). However, important questions remain to be addressed for implementing BCI-based rehabilitation in clinical applications (Pichiorri et al. 2017). In many cases, the EEG features of different mental states in MI-based BCIs are not sufficient to provide reliable EEG control commands (Guger et al. 2003; Blankertz et al. 2008a, 2010). Approximately 20% of BCI users are unable to obtain sufficient accuracy for control using MI (Guger et al. 2003). This phenomenon is called “BCI inefficiency” (Kaufmann et al. 2013). To address this shortcoming, many advanced methods have been proposed to improve the recognition performance of MI-based BCIs, including channel selection (Garrett et al. 2003; Arvaneh et al. 2011; He et al. 2013; Qiu et al. 2016) and feature extraction methods (Zhang et al. 2017; Miao et al. 2018; Feng et al. 2018). In addition, the performance of MI-based BCI systems depends primarily on the subject’s ability to modulate his/her sensorimotor rhythms. For naïve subjects, proficient modulation ability can be achieved via proper training (Neuper et al. 1999). Various training methods based on feedback have been proposed (Hwang et al. 2009; Zich et al. 2015; Abdalsalam et al. 2018) given that feedback can help subjects verify the effects of various MI strategies by providing them with instant information about control effects. However, inhibitory and facilitative effects on MI training can be induced by feedback, and these effects vary between subjects (McFarland et al. 1998; Yu et al. 2015). This problem is particularly pronounced in older stroke patients, and many patients have associated comorbidities, such as depression (Kertesz and Sheppard 1981; Di Carlo et al. 2000), which inhibit their ability to adapt quickly to the training. Excessive unintended or inaccurate feedback/control may frustrate subjects and, in severe cases, even obstruct the training.

Recent studies have shown that multi-modal brain signal fusion can effectively improve the performance of BCIs (Jinyi Long et al. 2012b; Amiri et al. 2013; Yin et al. 2015; Ma et al. 2017; Puanhvuan et al. 2017). To provide effective continuous feedback for MI training, a hybrid BCI paradigm combining MI and steady-state visually evoked potentials (SSVEPs) was proposed (Yu et al. 2015). The hybrid BCI system based on SSVEP and MI can effectively identify the intentions of the subjects. Moreover, the hybrid feedback BCI paradigm can be used to enhance MI training. However, SSVEP stimuli will cause extreme fatigue in subjects because of the fast repetition of the flashing stimuli (Gergondet and Kheddar 2015). In comparison with SSVEP, the P300 protocol could alleviate fatigue to some degree (Ma et al. 2017). Several studies have shown that the combination of P300 and MI can help improve performance in terms of wheelchair control (Jinyi Long et al. 2012a), computer cursor control (Li et al. 2010), and speller systems(Yu et al. 2016). However, no P300-MI hybrid paradigm that focuses on improving MI training exists. Moreover, these paradigms do not instruct subjects in performing the imagination, which might negatively affect MI performance (Qiu et al. 2017). For instance, subjects could attain better MI performance by imagining familiar actions (Gibson et al. 2014).

In the present study, a hybrid BCI paradigm based on MI and P300 is proposed to improve the MI training for Chinese people. Combining MI and P300 modalities is hypothesized to improve the feedback accuracy in the initial training stage, so as to remedy the limitation of MI training. In the paradigm, two Chinese character outlines are first displayed on the screen. Then, the strokes of each Chinese character are flashed (stroke by stroke) in accordance with natural writing. The subjects are asked to imagine writing the Chinese character and follow the flash order of the desired character. During the task, the ERD/ERS phenomenon is produced with writing imagination. Simultaneously, the P300 potential is evoked by the flash of each stroke. To fuse the P300 and MI classification outputs, a simple and effective fusion method is proposed, in which the unreliable P300 classifications are corrected by the reliable MI classifications. In our paradigm, the flash interval of the strokes of a single Chinese character is larger than 1 s, thereby providing a softer stimulus for users. Chinese people are familiar with writing Chinese characters, thereby helping to modulate their sensorimotor rhythms. The proposed hybrid paradigm is believed to exhibit two main advantages over existing MI training paradigms. First, it may help subjects modulate sensorimotor rhythms effectively with a softer stimulus. Second, the performance of the BCI system can be further improved by fusing the two features. Both offline and online recognition results demonstrate that the recognition accuracy of the fusion method is significantly higher than using P300 or MI features alone. In addition, the size of the required training data can be reduced by combining the two features.

Materials and methods

Subjects and EEG acquisitions

Eighteen healthy subjects (12 males and 6 females, aged 22–28 years, mean 23.6 ± 3.7 years) with no prior experience with MI-based BCIs participated in the experiment. The local ethics committee approved the consent form and experimental procedure, and all subjects signed a written consent form prior to the experiment. According to self-reports, all subjects were right-handed with no clinical history of neurological disorders, and all reported Mandarin Chinese as their native language.

The subjects were seated on a comfortable chair that was located 80 cm away from a standard 19-inch LED monitor (60 Hz refresh rate, 1920 × 1080 screen resolution) in an electromagnetically shielded room. During the task, subjects were asked to relax and avoid unnecessary movements.

EEG signals were recorded using 26 scalp electrodes in accordance with the International 10-20 System, with a sampling rate of 256 Hz. The 26 electrodes were placed at F3, Fz, F4, FC3, FCz, FC4, C5, C3, C1, Cz, C2, C4, C6, CP3, CP1, CPz, CP2, CP4, P7, P3, Pz, P4, P8, O1, O2, and O3. Most of the electrodes were located over the post-acrosomal and parietal region on the basis of previous reports (Hoffmann et al. 2008; Jin et al. 2011; Huang et al. 2018; Feng et al. 2018). All channels were referenced to electrode A1 located over the right mastoid, and the ground (GND) was placed on the forehead. The impedance of all electrodes was kept below 20 KΩ, and signals were amplified by a g.HIMP amplifier pre-processed with a hardware bandpass filter (0.1–30 Hz) and notch filter (50 Hz). All electrodes were used for feature extraction and calculating.

Experimental paradigms

The experimental protocol comprised a screen cue prompting the user to imagine writing the prompted Chinese character. Two Chinese characters and a forearm were displayed on the screen. The forearm was used to prompt subjects to perform an MI task on the associated side (left or right) of MI. The Chinese character outlines were initially displayed on the screen in the preparation stage, and then each Chinese character was flashed stroke by stroke according to the natural writing sequence. Four different Chinese characters (“生”, “末”, “仗”, “正”) were used in the experiment. These Chinese characters were selected because most Chinese people are familiar with the characters and all of them involve five strokes as in literature (Qiu et al. 2017). These Chinese characters were randomly combined and appeared in random on the screen, such as “生-正”, “生-末”, “仗-正”, and “仗-末”.

Figure 1a, b illustrate an example of the screen cue shown to subjects. Figure 1a represents a left-hand MI cue, whereas Fig. 1b represents a right-hand MI cue. In each trial, only one of these cues was presented to the user. During the task, subjects were asked to imagine writing the prompted Chinese character while following the flash of the Chinese character. Figure 1c shows one trial of the imagined writing task. During the 0–1.9 s preparation phase in which the first row of Fig. 1 (either (a) or (b)) was displayed on the screen, two outlines of Chinese characters were shown on the screen along with a picture of a forearm. Subjects were asked to focus on the Chinese character on the side of the forearm prompted. During the 1.9–7.9 s MI phase, the strokes of the two Chinese characters flashed on the screen stroke by stroke according to the natural writing sequence as illustrated by the second to the last row of Fig. 1a or b. The flash start time sequence of the left side Chinese character was 1.9, 3.1, 4.3, 5.5, and 6.7 s, and that at the right side was 2.1, 3.3, 4.5, 5.7, and 6.9 s. Each flash lasted for 150 ms. The P300 potential was evoked by the flash of each stroke of the target Chinese character. To stably evoke P300, the interval between consecutive flashes of each stroke was longer than 1 s (Walter 1968; Donchin and Smith 1970; Donchin 1981; Pritchard 1981). If a subject focused on the left Chinese character, the P300 signal was time locked with the left flash time sequence, and conversely, if a subject focused on the right Chinese character, the P300 signal was time locked with the right flash time sequence. Subjects were asked to imagine writing the Chinese character while following the flash order of the corresponding Chinese character. As a result, the P300 potential (elicited by the stroke flashes) and the ERD/ERS phenomenon (elicited by MI) were produced simultaneously during the task. During the 7.9–10 s rest phase, an empty black background was displayed on the screen. Each subject completed three offline sessions and one online session, and each session consisted of 40 trials (with a balanced number of left and right classes).

Fig. 1
figure 1

Experimental protocol. a Example of the left-hand MI screen cues shown to subjects, b example of the right-hand MI screen cues shown to subjects, c trial timing for the imagined writing task

Feature extraction procedure

P300 feature extraction

Generally, amplitude variations(Hoffmann et al. 2008), matched filter (Serby et al. 2005), calculation of area, and peak picking (Farwell and Donchin 1988) can be used to discriminate P300. Recent studies have shown that shape features (Alvarado-González et al. 2016) and phase locking value (PLV) (Kabbara et al. 2016) are efficient tools for discriminating target and non-target visual-evoked responses. Moreover, a satisfactory performance can be obtained when combining PLV and traditional features (Kabbara et al. 2016). Thus, in this study, amplitude variations and PLV features were combined for P300 classification.

First, a window with a length of 800 ms ranging from flash stimuli onset to 800 ms after flash stimuli was extracted (Hoffmann et al. 2008). The data segments extracted in accordance with the left and right flash time sequences were denoted as \(\varvec{Xu}_{\varvec{l}}^{\varvec{t}}\) and \(\varvec{Xu}_{\varvec{r}}^{\varvec{t}}\), respectively, where \(t = 1,2, \ldots ,5\), corresponding to the five strokes of each character. If the forearm appeared on the left side and subjects focused on the left Chinese character and performed the left-hand MI, \(\varvec{Xu}_{\varvec{l}}^{\varvec{t}}\) denotes the target data segment, whereas \(\varvec{Xu}_{\varvec{r}}^{\varvec{t}}\) denotes the non-target data segment. If the forearm appeared on the right side, \(\varvec{Xu}_{\varvec{l}}^{\varvec{t}}\) denotes the non-target data segment, whereas \(\varvec{Xu}_{\varvec{r}}^{\varvec{t}}\) denotes the target data segment. Then, a third-order Butterworth bandpass filter was used to filter the EEG between 0.1 and 12 Hz (Kolev et al. 1997; Jansen et al. 2004; Jin et al. 2011).

For amplitude variation feature extraction, the EEG data were down-sampled from 256 to 36.6 Hz by selecting every seventh sample from the filtered EEG (Hoffmann et al. 2008).

The PLV feature extraction is based on phase coupling between two signals quantified by PLV. With \(S_{x} \left( t \right)\) and \(S_{y} \left( t \right)\) being the signals over two electrodes, PLV is calculated as follows (Wang et al. 2006; Wei et al. 2007; Kabbara et al. 2016):

$$PLV = \left| {\left\langle {e^{{j\Delta \theta \left( t \right)}} } \right\rangle_{t} } \right|$$
(1)

where \(\left\langle \cdot \right\rangle_{t}\) is the operator of averaging over time; \(\Delta \theta \left( t \right) = \theta_{x} \left( t \right) - \theta_{y} \left( t \right)\); and \(\theta_{x} \left( t \right)\) and \(\theta_{y} \left( t \right)\) denote the instantaneous phase of \(S_{x} \left( t \right)\) and \(S_{y} \left( t \right)\), respectively, which can be calculated using Hilbert transform (Le Van Quyen et al. 2001; Wang et al. 2006). The PLV feature extracted in accordance with the left and right flash time sequence are denoted by \(\varvec{Xv}_{\varvec{l}}^{\varvec{t}}\) and \(\varvec{Xv}_{\varvec{r}}^{\varvec{t}}\), respectively, where \(t = 1,2, \ldots ,5\).

MI feature extraction

Several methods have been introduced to EEG analysis for MI feature extraction, including band power, autoregressive, common spatial pattern (CSP) algorithms and its improved methods, and PLV (Wei et al. 2007; Kirar and Agrawal 2016; Mingai et al. 2016; Miao et al. 2017). Among them, CSP is an efficient feature extraction method that has been widely used in MI-based BCI systems (Ramoser et al. 2000; Qiu et al. 2017). Furthermore, the combination of multiple types of features can effectively improve the classification accuracy due to their complementarities (Wei et al. 2007). Here, we combined CSP and phase coupling measure-based method to extract multiple features for MI classification. For each trial, EEG signals from 2.5 to 7.5 s post cue-onset were selected for feature extraction.

The classification performance of CSP is largely dependent on the selected frequency bands used for bandpass filtering of the EEG data (Blankertz et al. 2008b; Zhang et al. 2015, 2017). However, the optimal filter band is typically subject-specific and is difficult to determine manually (Zhang et al. 2015). Thus, a sparse filter band common spatial pattern (SFBCSP) (Zhang et al. 2015) was used to generate optimal filter bands automatically. Several studies (Tam et al. 2011; Qiu et al. 2016; Miao et al. 2017) have shown that precise electrode selection can helps improve the performance of MI-based BCIs. Ten channels with great contributions to the classification were selected by adopting Fisher’s linear discriminant criteria (Tam et al. 2011; Mingai et al. 2016).

The implementation procedure of SFBCSP is summarized as follows.

First, the raw EEG is bandpass filtered by a set of overlapping sub-bands. Second, CSP is utilized to extract the corresponding features on the filtered signals at each sub-band. The extracted CSP features can be described as \({\mathbf{G}} = \left[ {{\mathbf{g}}_{1} , {\mathbf{g}}_{2} , \ldots ,{\mathbf{g}}_{\varvec{i}} , \ldots , {\mathbf{g}}_{\varvec{N}} } \right]^{{\mathbf{T}}} \in {\mathbb{R}}^{2MK \times N}\), where \({\mathbf{g}}_{\varvec{i}}\) denotes the feature vector extracted from the EEG sample at the i-th trial, \(N\) is the number of samples in the training set, 2 M is the number of patterns selected in CSP, and K is the number of sub-bands. In the present study, M = 2; sub-bands were chosen from the frequency range 8–30 Hz because it contains all mu and beta frequency components of the EEG, which are important for the discrimination task (Ramoser et al. 2000), each sub-band has a bandwidth of 4 Hz with 2 Hz of overlap between sub-bands, which is consistent with those in previous studies (Kai et al. 2008; Zhang et al. 2015), that is, a total of K = 10 sub-bands are used.

Finally, significant CSP features can be extracted by following the sparse regression model

$${\mathbf{w}}^{*} = arg \mathop {\hbox{min} }\limits_{{\mathbf{w}}} \frac{1}{2}\left\| {{\mathbf{Gw}} - {\mathbf{y}}} \right\|_{2}^{2} +\uplambda\left\| {\mathbf{w}} \right\|_{1}$$
(2)

where \({\mathbf{w}}^{*}\) is a sparse weight vector to be generated, \({\mathbf{w}} \in {\mathbb{R}}^{2MK}\) is the weight vector of the extracted CSP features, \({\text{y}} \in {\mathbb{R}}^{N}\) is the class labels, \(\left\| \cdot \right\|_{1}\) denotes the \(l_{1}\)-norm, and λ is a positive regularization parameter for controlling the sparsity of \({\mathbf{w}}^{*}\) (a larger λ can result in sparser \({\mathbf{w}}^{*}\)). The optimization problem in (2) can be solved by coordinate descent algorithm (Friedman et al. 2010). With learned \({\mathbf{w}}^{*}\), the sparse feature vector can be denoted by \({\mathbf{g}}^{*} = {\mathbf{g}} \cdot {\mathbf{w}}^{*}\).

The phase coupling measure-based features are extracted from coupling between any two electrodes within each of the two ellipses around C3 and C4 (Wei et al. 2007). The PLV of each coupling electrodes is calculated using Eq. (1). The PLV feature can be denoted by \(\varvec{p}\).

Classification scheme

Bayesian linear discriminant analysis (BLDA) (Hoffmann et al. 2008) is utilized to classify the MI and P300 features. In addition, a P300 and MI classification fusion method was proposed to improve classification accuracy, in which unreliable P300 classifications were corrected by reliable MI classifications.

BLDA for MI and P300 classification

BLDA is regarded as an extension of Fisher’s linear discriminant analysis (FLDA). In contrast to FLDA, BLDA regularization is used to prevent overfitting to high dimensional and possibly noisy datasets, where the degree of regularization is automatically and quickly estimated from training data through Bayesian analysis (Hoffmann et al. 2008). Assuming that the target vector \(\varvec{t}\) and feature vectors X are linearly related with additive white Gaussian noise \(\epsilon\)

$$\varvec{t} = {\mathbf{w}}^{T} {\mathbf{X}} + \epsilon$$
(3)

The likelihood function for the weights w can be described as follows:

$$p\left( {{\mathbf{D}} |\beta ,\varvec{w}} \right) = \left( {\frac{\beta }{2\pi }} \right)^{N/2} \exp \left( { - \frac{\beta }{2}} \right)\left| {\left| {{\mathbf{X}}^{\text{T}} {\mathbf{w}} - \varvec{t}} \right|} \right|^{2}$$
(4)

where D denotes the pair {X, t}, β denotes the inverse variance of \(\epsilon\), and N denotes the number of samples in the training set. The bias term can be omitted assuming that the feature vectors contain one feature that is always equal to one.

To perform inference in a Bayesian setting, assume that the weight vector w satisfies the Gaussian prior distribution governed by a single precision parameter \(\alpha\):

$$p\left( {{\mathbf{w}} |\alpha } \right) = \left( {\frac{\alpha }{2\pi }} \right)^{D/2} { \exp }\left( { - \frac{\alpha }{2}\left| {\left| {\mathbf{w}} \right|} \right|_{2}^{2} } \right)$$
(5)

In accordance with the Bayesian rule, the posterior can be computed as follows:

$$p\left( {{\mathbf{w}} |\alpha ,\sigma^{2} ,\varvec{t}} \right) = \frac{{p\left( {\varvec{t} |{\mathbf{w}},\sigma^{2} } \right)p\left( {{\mathbf{w}} |\alpha } \right)}}{{p\left( {\varvec{t} |\alpha ,\sigma^{2} } \right)}}$$
(6)

Given that the likelihood and prior satisfy Gaussian distributions, the posterior also exhibits a Gaussian distribution. Thus, the mean \(\varvec{\mu}\) and covariance \({\varvec{\Sigma}}\) of the posterior satisfy the following equations:

$${\varvec{\Sigma}} = \left( {\sigma^{ - 2} {\mathbf{X}}^{T} {\mathbf{X}} + \alpha {\text{I}}} \right)^{ - 1}$$
(7)
$$\varvec{\mu}= \sigma^{ - 2} {\mathbf{\varSigma X}}^{\text{T}} \varvec{t}$$
(8)

For a new test sample \({\hat{\mathbf{x}}}\), the predictive distribution can be computed as follows:

$$p\left( {\hat{t} |\alpha ,\sigma^{2} ,{\hat{\mathbf{x}}},\varvec{t}} \right) = \int p(\hat{t}|{\mathbf{w}},\sigma^{2} ,{\hat{\mathbf{x}}})p({\mathbf{w}}|\alpha ,\sigma^{2} ,\varvec{t})d{\mathbf{w}}$$
(9)

The predictive distribution also satisfies the Gaussian distribution with mean and variance of

$$\hat{\mu } =\varvec{\mu}^{T} {\hat{\mathbf{x}}}$$
(10)
$$\hat{\sigma }^{2} = \sigma^{2} + {\hat{\mathbf{x}}}^{T} {\mathbf{\varSigma \hat{x}}}$$
(11)

For a binary classification model, consider a target \(\varvec{t} \in \left\{ {N_{1} /N, - N_{2} /N} \right\}\). The hyperparameters α and \(\sigma^{2}\) can be automatically and iteratively estimated using a previously described procedure (MacKay 1992).

With a learned posterior mean, \(\varvec{\mu}\), of the weight vector, w, a new P300 feature vector can be classified using a simple linear discriminant criterion:

$$t_{P300} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & { \quad if\,\,\varvec{\mu}^{T} \mathop \sum \limits_{\varvec{t}}^{N} \varvec{Xp}_{\varvec{l}}^{\varvec{t}} >\varvec{\mu}^{T} \mathop \sum \limits_{\varvec{t}}^{N} \varvec{Xp}_{\varvec{r}}^{\varvec{t}} } \hfill \\ { - 1,} \hfill & {\quad if\,\,\varvec{\mu}^{T} \mathop \sum \limits_{\varvec{t}}^{N} \varvec{Xp}_{\varvec{l}}^{\varvec{t}} <\varvec{\mu}^{T} \mathop \sum \limits_{\varvec{t}}^{N} \varvec{Xp}_{\varvec{r}}^{\varvec{t}} } \hfill \\ \end{array} } \right.$$
(12)

where \(\varvec{Xp}_{\varvec{l}}^{\varvec{t}}\) and \(\varvec{Xp}_{\varvec{r}}^{\varvec{t}}\) denote the combination of amplitude variation and PLV features of P300, N is the number of strokes.

MI features can be classified by

$$t_{MI} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {\quad if\,\,\varvec{\mu}^{T} Xm > - \mu_{0} } \hfill \\ { - 1,} \hfill & {\quad if\,\,\varvec{\mu}^{T} Xm < - \mu_{0} ,} \hfill \\ \end{array} } \right.$$
(13)

where \(\varvec{Xm}\) denotes the combination of SFBCSP and PLV features of MI, \(\mu_{0}\) is the discriminant threshold. In the present study, \(\mu_{0} = 0\).

P300 and MI classification fusion

As described in the previous section, the P300 potential and the ERD/ERS phenomenon are generated simultaneously in this experimental paradigm. In general, P300 produces higher classification accuracy and is more robust than MI (Bishop 2006). However, there are also cases where the P300 classifications are incorrect whereas the MI classifications are correct. To improve the classification accuracy, we can consider using the reliable MI classification to correct the unreliable P300 classification.

According to the statistical theory governing BLDA, classification errors arise from regions of the input space where the largest of the posterior probabilities \(p\left( {\hat{t} |{\hat{\mathbf{x}}}} \right)\) is significantly less than unity, or equivalently, where the joint distributions \(p\left( {\hat{t},{\hat{\mathbf{x}}}} \right)\) have comparable values (Nasrabadi 2007). These are the regions where we are relatively uncertain about class membership (as shown in Fig. 2). Correspondingly, in a BLDA model, a larger predictive mean \(\hat{\mu }\) more strongly represents the characteristics of P300 and MI features as defined by the training set (Hoffmann et al. 2008). Thus, \(\hat{\mu }\) can be defined as the classification confidence. For a MI classification, if \(\left| {\hat{\mu }} \right|\) is less than a threshold \(T_{c}\), it can be considered as an unreliable classification. For P300 classification, a single classification is deemed unreliable if the classification confidences \(\hat{\mu }_{l}\) and \(\hat{\mu }_{r}\) satisfy the following conditions:

  1. 1.

    \(sign\left( {\hat{\mu }_{l} } \right) = sign\left( {\hat{\mu }_{r} } \right)\)

  2. 2.

    \(\left| {\hat{\mu }_{l} } \right| < T_{c} \,\,and\,\, \left| {\hat{\mu }_{r} } \right| < T_{c}\)

where \(T_{c}\) is a reliability threshold that can be obtained via cross-validation. Typically, \(T_{c}\) can be set to a value that produces a high error rate for unreliable classifications and a low error rate for reliable classifications. The first condition denotes that \(\overline{{\varvec{X}_{\varvec{l}} }}\) and \(\overline{{\varvec{X}_{\varvec{r}} }}\) are classified into the same class. The second condition denotes that the class membership of \(\overline{{\varvec{X}_{\varvec{l}} }}\) and \(\overline{{\varvec{X}_{\varvec{r}} }}\) are relatively uncertain. Obviously, both conditions result in unreliable or low-confidence classification outputs, which we aim to prevent.

Fig. 2
figure 2

Illustration of unreliable region

For an unreliable classification of P300, the modification criteria are defined as follows:

  1. 1.

    If the P300 classification result is consistent with the MI classification result, keep the classification result unchanged.

  2. 2.

    If the P300 classification result differs from the MI classification result and the predictive mean of MI is less than the reliability threshold, use the P300 classification result.

  3. 3.

    If the P300 classification result differs from the MI classification result and the predictive mean of MI is larger than the reliability threshold, use the MI classification result.

The procedure for P300 and MI classification fusion is summarized in Fig. 3.

Fig. 3
figure 3

Procedure for P300 and MI classification fusion

Results

EEG results

To verify whether the MI and P300 tasks could be simultaneously analyzed and whether their corresponding features could be independently extracted, the ERD/ERS maps and EEG waveforms are studied in this section. The ERD/ERS maps over channels C3 and C4, the grand averaged target and non-target time-series data over channel Cz from three typical subjects, and the average value of all the subjects are shown in Fig. 4. These electrodes were selected for illustrative purposes because the left- and right-hand motor cortices are mainly localized in the C3 and C4 electrode regions (Pfurtscheller and Aranibar 1979; Pfurtscheller 1992; Pfurtscheller and Neuper 1997; Qiu et al. 2017; Ono et al. 2018), whereas Cz is a primary electrode used for P300 analysis (Sellers et al. 2006; Hoffmann et al. 2008; Jin et al. 2011).

Fig. 4
figure 4

ERD/ERS maps (first row) over channels C3 and C4 and the grand averaged time series data (second row) of target and non-target responses over channel Cz from three typical subjects (ac) and average value of all subjects (d)

From Fig. 4, obvious ERD/ERS phenomena can be observed at channel C3/C4 when subjects perform right-hand/left-hand MI tasks while following the flashing strokes of the Chinese characters. Additionally, for the target waveform (shown in red waveform), a positive ERP occurred approximately 300 ms after the flash onset. For the non-target waveform (shown in blue waveform), two smaller peaks occurred approximately 300 ms before and 300 ms after the P300 potential of the target waveform. This is because non-target data segments have a − 300 or 300 ms offset compared with the target data segments. The above results consistently show that the task designed for the experiment can effectively yield MI and P300 features simultaneously. The characteristics of these signals are similar to those evoked through a single-modality task (McFarland et al. 2000; Jansen et al. 2004; Huang et al. 2018).

Offline classification results

To quantify the classification performance, we developed measures related to the P300 classification, which included the error rate of unreliable classification (\({\text{UER}}\)), error rate of reliable classification (\({\text{RER}}\)), and error correction rate (\({\text{CR}}\)). These are defined as follows:

$$UER = \frac{UF}{UT + UF}$$
(14)
$$RER = \frac{RF}{RT + RF}$$
(15)
$$CR = \frac{CT}{UF}$$
(16)

where UER denotes the error rate, which satisfies the unreliable P300 classification conditions; UF is the number of false classifications; and UT is the number of true classifications when the output is deemed unreliable. RER denotes the error rate when the P300 classification conditions are considered reliable, RF is the number of false classifications, and RT is number of true classifications when the output is deemed reliable. CR denotes the probability of error correction, and CT is the number of corrected classifications (using MI classification output) of unreliable classifications. UER, RER, and CR in a 10-fold cross-validation of 18 subjects are shown in Table 1.

Table 1 Error rate of unreliable classification (UER), error rate of reliable classification (RER), and error correction rate (CR) of 18 subjects using 10-fold cross-validation

The first result of note is the large difference between UER and RER. The average value of UER (21.46 ± 7.20%) is larger than that of RER (1.13 ± 2.00%). Furthermore, all subjects, except S10, showed a higher UER versus RER. This finding means that the probability of error in an unreliable classification is significantly higher than that in reliable classification. The lowest error correction rate was 50.00%, and the highest error correction rate was 100.00%. The averaged error correction rate of the fusion method was 75.51 ± 22.87%.

Furthermore, to evaluate the effects of the P300 and MI fusion method on classification performance, the average classification accuracies of MI, P300, and P300 fusion with MI (P300 + MI) for 18 subjects are presented in Fig. 5. A paired sample t test was used to assess the difference in classification accuracy for P300 + MI versus MI and P300 + MI versus P300. The Lilliefors test was used to verify whether the sample distribution satisfies the condition of the paired sample t-test.

Fig. 5
figure 5

Offline classification accuracy (%) of MI, P300, and P300 fusion with MI (P300 + MI) for 18 subjects. The average classification accuracies (mean ± SD) of MI, P300, and P300 + MI are 85.02 ± 9.09%, 94.72 ± 4.54%, and 97.29 ± 3.65%, and the p-values of P300 + MI versus P300 and P300 + MI versus MI are 5.56E − 06 and 1.36E − 06, respectively

As shown in Fig. 5, the P300 and MI classification fusion method (P300 + MI) yielded a higher average classification accuracy than P300 or MI modalities alone. All subjects except S2 and S9 demonstrated a higher classification accuracy using the P300 and MI fusion method than using only P300. The average classification accuracy (mean ± SD) using the P300 + MI technique was 97.29 ± 3.65%, which was 12.27% and 2.57% higher than those of P300 (94.72 ± 4.54%) and MI (85.02 ± 9.09%), respectively. Paired sample t-test also showed that the classification performance of P300 + MI was significantly better than that of P300 (p < 0.01) and MI (p < 0.01).

Effect of reducing training data on classification performance

Theoretically, a high classification accuracy can be achieved using more training data for classifier calibration because more training data help mitigate against over-fitting. However, more training data require longer EEG recording time, which consequently reduces the practicability of the BCI system. As such, an effective BCI system should be able to obtain a high classification accuracy with shorter training data.

We investigated the classification accuracy of each method using different lengths of training data. Figure 6 shows the classification accuracies using the P300, MI, P300 + MI fusion method, averaged for all subjects, with training data lengths from 20 to 90% of the entire data (with an increment size of 10%). The training data were randomly selected for each evaluation. To avoid the influence of randomness on the results, both of the evaluations were repeated 100 times, and the average classification accuracies were calculated (Peterson et al. 2017).

Fig. 6
figure 6

Classification accuracies using P300, MI, and P300 + MI fusion method, averaged for all subjects, for different lengths of training data

As shown in Fig. 6, the average classification accuracies of all three methods increased with longer training data. However, for all training data lengths, the combined P300 and MI technique showed higher classification accuracy than P300 or MI alone. Moreover, the P300 + MI fusion method achieved high recognition accuracy with less training data. If we consider a classification accuracy threshold (90%), only 40% of the training data is needed to achieve this accuracy using the P300 + MI technique, whereas 60% of the training data is needed to achieve this accuracy using P300 alone.

Online classification results

Figure 7 shows the online classification accuracies of MI, P300, and P300 fusion with MI (P300 + MI) for 18 subjects. The average online classification accuracy (mean ± SD) using P300 + MI method was 93.94 ± 5.19%, which is 10.62% and 2.50% higher than that using P300 (91.25 ± 9.04%) and MI (81.61 ± 8.79%), respectively. Paired sample t-test also showed that the online classification performance of P300 + MI was significantly better than that of P300 (p < 0.05) and MI (p < 0.01).

Fig. 7
figure 7

Online classification accuracy (%) of MI, P300, and P300 fusion with MI (P300 + MI) for 18 subjects. The average classification accuracies (mean ± SD) of MI, P300, and P300 + MI are 81.61 ± 8.79%, 91.25 ± 9.04%, and 93.94 ± 5.19%, and the p-values of P300 + MI vs. P300 and P300 + MI vs. MI are 0.03 and 1.45E −06, respectively

Discussion

The primary goal of this study was to design a hybrid BCI paradigm on the basis of MI and P300 potentials to improve feedback performance at the early stages of BCI-feedback training. The key to a hybrid BCI system is that the two signal components must be yielded simultaneously and independently (Li et al. 2010). The recognition of the MI task is fundamentally based on the ERD/ERS, which is reflected in the EEG frequency spectrum. As shown in Fig. 4, the ERD/ERS in mu (8–12 Hz) or beta rhythm (13–30 Hz) can be clearly seen at channel C3 or C4, which is consistent with the findings reported in previous MI studies (Pfurtscheller and Lopes da Silva 1999; Blankertz et al. 2010; Faller et al. 2012; Ma et al. 2017). For P300, the task recognition is based on the characterized peak components in the time domain. As shown in Fig. 4, a positive ERP occurred for target data segments at approximately 300 ms after stroke flash onset. These results together demonstrate that the desired signal features (i.e., ERD/ERS for MI and P3 for P300) can be simultaneously obtained in the imagination of Chinese character writing tasks.

In terms of recognition performance, both P300 and MI features were discriminable (see Figs. 5 and 7). The online classification accuracy of MI in three of the subjects was more than 90%, and the average classification accuracy of all subjects was more than 80% in their first MI training. As such, imagination of writing Chinese characters can help subjects modulate their sensorimotor rhythms effectively. As expected, the average accuracy and stability of the P300 recognition outperformed those of MI. However, MI is often indispensable because in several BCI applications, such as those used for BCI-based stroke rehabilitation, MI is better suited than evoked potentials, such as the P300 response. Though P300 performance is generally superior to that of MI, the recognition accuracy of P300 can be further improved by incorporating complementary information from MI. According to the statistical results (see Table 1), the unreliable P300 classification error rate was significantly higher than the reliable classification error rate. Erroneous P300 classifications when the output is unreliable can be corrected by MI. Using this method, the averaged error correction rate reached 75.51 ± 22.87%. According to the ensemble learning theory, when the output of two or more classifiers has a certain accuracy and diversity, the accuracy and robustness can be improved by combining these classifiers (Zhou 2012). Here, a fusion method of P300 and MI classification, which automatically corrects an unreliable P300 classification with a reliable MI classification, was proposed. The recognition results showed that the recognition accuracy of the fusion method was significantly higher than that of P300 (p < 0.05) and MI (p < 0.01) alone. In particular, for the two subjects (S3 and S5) with low MI classification accuracy [< 70%, less than BCI “efficiency” threshold (Kübler et al. 2001)], a remarkable improvement in BCI performance can be achieved through the hybrid paradigm.

In addition, the required length of the training data can be reduced through the fusion of the two features, which is very important in practical applications. Generally, in real-world applications, such as in BCI-based stroke rehabilitation systems, a long period of offline data is required to train the classifier prior to online use (Jin et al. 2011; Faller et al. 2012). The long time required to perform these offline recordings consequently decreases the practicability and convenience of the BCI system. Therefore, understanding how the offline training time can be reduced is essential for producing useful and practical BCIs (Shenoy et al. 2006; Wu and Ge 2013; Kindermans et al. 2014; Jiao et al. 2019). MI usually takes a long time to train before a sufficiently high recognition performance can be achieved, while the time required for P300 training time is less (the average classification accuracy of P300 is considerably higher than that of MI with smaller data set sizes; see Fig. 5). Moreover, the fusion of P300 and MI signals can further shorten the training data length. As shown in Fig. 6, the fusion of P300 and MI can result in a higher average accuracy (90%) with lesser training data than using only P300, which requires 50% more data to achieve the same accuracy. In addition, the fusion method can also be combined with other technologies (semi-supervised learning, online adaptive learning, and transfer learning) to further reduce the offline training time.

Conclusion

A hybrid BCI paradigm based on MI and P300 by imagining writing Chinese characters was proposed in this study. ERD/ERS phenomenon and P300 potentials were yielded simultaneously via imagining writing Chinese characters and following flashing strokes of these characters on a screen. On the basis of statistical theory, a P300 and MI classification fusion method was proposed for enhancing BCI performance. The results showed that the proposed hybrid BCI paradigm yielded higher recognition accuracy than BCI systems using only MI. Additionally, the offline training time could be shortened by fusing P300 and MI features. Our future work will focus on developing an online feedback incorporating adaptive learning to further shorten the offline data collection time. We will also focus on applying this paradigm to stroke patients to verify its performance in stroke rehabilitation training.