1 Introduction

Epileptic seizures are a common chronic brain disease characterized by abrupt, recrudescent, and uncontrolled neuronal discharges [1]. This neurological disorder is primarily caused by genetics and external injuries, which can lead to self-injuries or more [2]. The electroencephalogram (EEG) collects the electrical activities of the brain, which has become a reliable tool in the diagnosis of epilepsy [3]. Large amounts of EEG data are required to be identified by visual observation of well-trained neurophysiologists. However, this is a subjective as well as a tedious process [4]. Therefore, the development of automatic detection of EEG recordings provides a new way to alleviate the pressure of neurologists.

For the diagnosis of epilepsy, the offline epileptic seizure detection is an indispensable procedure that can be retrospect to the 1970s. To date, a large number of automated seizure detection algorithms have been introduced based on the classification of EEG data. The broadly applied technique, proposed by Gotman [5], broke down EEG signals into half waves and utilized the peak amplitude, slope, duration, and sharpness as features [6]. Orhan et al. [7] developed a method that adopted k-means clustering and a multilayer perception neural network model for the discriminant analysis. As the development of nonlinear dynamics theory, a number of nonlinear features, including higher order spectra [8], approximate entropy [9], largest Lyapunov exponent [10], and pattern match regularity statistic [11], were also used for seizure detection and gained promising results.

Since EEG signals are non-stationary by nature and have paroxysmal and transient characteristics, the wavelet transform has become a powerful approach in EEG analysis, which can express the signal as a linear combination of a specific series of wavelet functions. In contrast to the short-time Fourier transform (STFT), the wavelet transform uses a variable window size to overcome the resolution drawback of STFT [12, 13]. Long time windows are employed at low frequencies to obtain high-frequency resolution and short time windows are adopted at high frequencies to acquire high time resolution. Hence, the wavelet transform can provide particular frequency information and time information at different frequency scales and localize transient changes in both time and precisely frequency domains [14].

The sparse representation that stems from compressed sensing has been extensively applied in many fields, especially in pattern recognition. Wright et al. [15] proposed a method of sparse representation-based classification (SRC) for robust face recognition via encoding a query face over the training template set. The classification can be performed by evaluating the represented residual. After that, Zhang et al. [16] developed the SRC to the collaborative representation-based classifier (CRC) which replaced l1-minimization with l2-minimization and obtained the competitive accuracy and lower complexity. Zhou et al. introduced the SRC and CRC to discriminate ictal EEGs from interictal EEGs [17] and to detect seizure events in long-term EEG signals [18]. Although the SRC and the CRC can achieve competitive accuracy, they still lack intrinsic explanations. Recently, a clear probabilistic explanation of classification mechanism of the CRC is given in [19], and a robust probabilistic collaborative representation-based classifier (R-ProCRC) is also presented for face recognition.

Because of the nonlinear reparability of the EEG data, it is difficult to find an effective linear technique to classify the EEG data in the original sample space. In this study, the kernel function method is applied to map the EEG epoch to a high-dimensional space, which combined with the R-ProCRC can more effectively capture the nonlinear relationships of EEG samples. The performance of the kernel R-ProCRC is evaluated on the two different EEG databases, and the detection results indicate that there is the potential for clinical application of the proposed method.

The remainder of this paper is organized as follows. In Section 2, the two different EEG databases are briefly described, and the detailed introduction of the proposed method composed of preprocessing, kernel R-ProCRC, and post-processing is given. Section 3 displays the experimental results and a discussion of the performance is followed in Section 4. Section 5 concludes this work.

2 Materials and methods

2.1 EEG database

The two EEG databases are used to evaluate the proposed method in this study. One is from the Department of Epileptology, Bonn University, Germany, and another is from the Epilepsy Center of the University Hospital of Freiburg, Germany. The dataset from Bonn University is comprised of five subsets (denoted Z, O, N, F, and S), which are digitized at 173.61 Hz per second. Each subset contains 100 single-channel EEG segments of 23.6-s duration. Sets Z and O contain scalp EEG segments which are recorded from five healthy volunteers whose eyes are open and closed. Sets N and F are collected in seizure-free intervals from five epileptic patients. Epochs in set F are obtained from the epileptogenic zone, and those in set N are extracted from the hippocampal formation of the opposite hemisphere of the brain, while set S includes epileptic seizure epochs from all channels. A more detailed description of this database is discussed in [20].

Three different classification experiments are constructed for the Bonn database to evaluate the classifying capacity of the proposed method. First, set F (interictal) and set S (ictal) are elected for classification, which is the most likely to be used in clinical practice. Then, for the classification of two databases, set Z (normal) and set S (ictal) are also solved. Thirdly, we select four databases and divide those into two classes: the first class comprises sets Z, N, and F, while set S is in the second class.

The data from the University Hospital of Freiburg are acquired using a Neurofile NT digital video-EEG system with 128 channels, a 256-Hz sampling rate, and a 16-bit analog-to-digital converter. The whole database includes intracranial EEG recordings of 21 patients suffering from medically intractable focal epilepsy, which are recorded during presurgical epilepsy monitoring with invasive electrodes [21]. The seizure onset and offset are determined by epileptologists. In addition, three focal and three extrafocal channels were previously chosen by certified epileptologists for all the patients. The Freiburg database is summarized in Table 1.

Table 1 Summary of the Freiburg database used in this study

In this study, both three focal channels and three extrafocal channels were used for seizure detection. For each patient, no more than three seizure events are selected according to the time order (except patients 5, 15, and 19) and the twice number of non-seizure data are selected for training. In total, 0.76 h of seizure data and 1.52 h of non-seizure data are chosen to train the classifier. In addition, 2.05 h of seizure data comprised of 1844 epochs and 560.05 h of non-seizure data comprised of 504,042 epochs are selected to assess the performance of the proposed method. In the aggregate, 564.38 h EEG data are used in this work.

The amount of data in the Bonn database is much smaller than that in the Freiburg database and the sampling frequency of the two databases is different. In addition, the EEG data in the Bonn database were selected and cut out from continuous EEG recordings after visual inspection for artifacts, e.g., due to muscle activity or eye movements. But the data in the Freiburg database were the original continuous long-term EEG data. So the seizure detection task on the Freiburg database is much more difficult than the EEG classification on the Bonn database. Only when good results are obtained on the Bonn database by the proposed kernel R-ProCRC, it is possible for this method to yield well detection results on the Freiburg database which contains much more EEG data with noise. Therefore, the experiment on the Bonn database can be regarded as the verification of the proposed method prior to the experiment of the Freiburg database that can be seen as the final proof of whether this method can be applied in practice.

2.2 Preprocessing

In this study, the procedures of the preprocessing based on the Freiburg database and the Bonn database are different. For Freiburg database, the original long-term EEG data are decomposed into 4-s segments using a sliding window without overlap. Next, the discrete wavelet transform with five decomposition levels is utilized to preprocess the EEG segments. Because of the sampling frequency of the Freiburg database is 256 Hz, the EEG segments can be split into five detail coefficients (D1 − D5) corresponding to 64–128 Hz, 32–64 Hz, 16–32 Hz, 8–16 Hz, and 4–8 Hz, and the approximation coefficients (A5) represent 0–4 Hz. In this work, the Daubechies-4 wavelet is adopted as the wavelet function, which has been proven by many previous studies as more effectively capturing the characteristics of EEG signals [14, 22, 23]. Considering that seizures usually occur between 3 and 29 Hz, the coefficients D3, D4 and D5 are selected to reconstruct the sub-signals XD3, XD4 and XD5 to detect epileptic seizure in EEGs.

The differential operator, which is defined as fn − fn ‐ 1, can capture significant changes such as spike and sharp waves contained in the EEG signals. It is able to heighten the contrast of the seizure activities and the background which contains interictal EEGs [24, 25]. In this study, the differential operator is carried out on the XD3, XD4, and XD5, which is defined as

$$ {F}_{\mathrm{n}}=\exp \left(\frac{1}{w}\left|{D}^{\prime }{f}_{\mathrm{n}}\right|\right) $$
(1)

where fn is the EEG signals, D represents the first-order derivative with respect to n, and every patient has their own w. For Freiburg database, continuous 1-h EEG data comprised of the seizures was used to determine the parameter w in the training stage. We adjusted the w to get the best results in advance for each patient. For Bonn database, we adjusted the value of w in the 10-fold cross-validation experiment to get the best results. After the above preprocessing steps, three outputs acquired from the differential operator are used to do the classification.

For the Bonn database, sets Z, N, F, and S are used in this work, and all of them contain 100 segments of 4096 points. First, each segment is divided into four epochs with the same length of 1024 points. Next, each EEG epoch is decomposed into one detail coefficient D1 and one approximation coefficient A1 by the discrete wavelet transform with one decomposition levels. The differential operator is then executed on the approximation coefficient A1, whose outputs are applied for the subsequent seizure detection.

The EEG data in the Bonn database is the artificially selected pure EEG data and the data in the Freiburg database is the raw long-term EEG recordings. So, there is a slight difference in the way of processing data in the two databases using wavelet transform. The choice of different wavelet decomposition levels is also based on the differences in EEG data between the two databases.

2.3 The kernel robust probabilistic collaborative representation-based classifier

The main procedures of the proposed method are exhibited in Fig. 1. In the following sections, a detailed description of each step will be given.

Fig. 1
figure 1

(a) The schematic diagram of EEG classification based on the Freiburg database. (b) The detailed procedures of the kernel R-ProCRC

2.3.1 The robust probabilistic collaborative representation-based classifier

Suppose that the training samples are composed of K classes of training sets X = [X1, X2, …, Xk], where \( {X}_i=\left[{x}_{\mathrm{i}}^1,{x}_{\mathrm{i}}^2,\dots, {x}_{\mathrm{i}}^{{\mathrm{n}}_{\mathrm{i}}}\right] \) represents the data matrix of the ith class and the column of Xi denotes the ni sample vector. We define the linear subspace that is collaboratively spanned by the training samples in X as S. The label set of each class in X is denoted by lx. For any data point in the subspace S, it can be represented by all training sample as

$$ x= Xa={\sum}_{k=1}^K{X}_k{a}_k $$
(2)

where \( a=\left[{a}_1^{\mathrm{T}},{a}_2^{\mathrm{T}},\dots, {a}_{\mathrm{k}}^{\mathrm{T}}\right] \) and ai are the coding vectors associated with the ith class.

The confidence of different data points belonging to lx is determined by the representation coefficients a (in terms of magnitude). If x falls into the ith class, it can be coded as a linear combination by the training samples of the same class: x = Xiai. The probability of lx is different between different data points, where l(x) denotes the label of x. A Gaussian function is selected to define the probability:

$$ P\left(l(x)\in lx\right)\propto \exp \left(-c{\left\Vert a\right\Vert}_2^2\right) $$
(3)

where c is a constant and the \( {\left\Vert a\right\Vert}_2^2 \) represents the square of the l2-norm of a. P(l(x) ∈ lx) should be higher if the l2-norm of a is smaller, vice versa.

The probability of the sample inside the subspace S has been defined in Eq. (3). However, not all the test sample y will fall into the subspace, whose probability P(l(y) ∈ lx) can be represented with the following methods. First, we select a sample x in subspace S and compute two probabilities P(l(x) ∈ lx) and P(l(x) = l(y)) which denotes the probability that y and x have the same class label. We then can obtain:

$$ P\left(l(y)\in lx\right)=P\left(l(y)=l(x)\left|(x)\in lx\right.\right)\cdot P\left(l(x)\in lx\right) $$
(4)

Using l1-norm to characterize, the loss function can enhance the robustness of classification [15]. So the Laplacian kernel is chosen to define the similarity of y and x:

$$ P\left(l(y)=\left.l(x)\right|l(x)\in lx\right)\propto \exp \left(-\kappa {\left\Vert y-x\right\Vert}_1\right) $$
(5)

where κ is a constant. With Eq. (3)~Eq. (5), we obtain:

$$ P\left(l(y)\in lx\right)\propto \exp \left(-\left(\kappa {\left\Vert y- Xa\right\Vert}_1+c{\left\Vert a\right\Vert}_2^2\right)\right) $$
(6)

Moreover, the logarithmic operator is applied to Eq. (6) to obtain the maximum of the probability P(l(y) ∈ lX):

$$ {\displaystyle \begin{array}{r}\max P\left(l(y)\in {l}_X\right)=\max \ln \left(P\left(l(y)\in {l}_X\right)\right)={\min}_{\alpha}\kappa {\left\Vert y- X\alpha \right\Vert}_1+c{\left\Vert \alpha \right\Vert}_2^2\\ {}={\min}_{\alpha }{\left\Vert y- X\alpha \right\Vert}_1+\lambda {\left\Vert \alpha \right\Vert}_2^2\end{array}} $$
(7)

where λ = c/κ. To measure the probability that x has the same class as xk, the Gaussian kernel is adopted to define it as

$$ P\left(l(x)=k|l(x)\in {l}_X\right)\propto \exp \left(-\delta {\left\Vert x-{X}_{\mathrm{k}}{\alpha}_{\mathrm{k}}\right\Vert}_2^2\right) $$
(8)

where δ is a constant and the \( {\left\Vert x-{X}_{\mathrm{k}}{\alpha}_{\mathrm{k}}\right\Vert}_2^2 \) represents the square of the l2-norm of x − Xkαk. A testing sample y whose probability that l(y) = k can be computed as:

$$ {\displaystyle \begin{array}{r}P\left(l(y)=k\right)=P\left(l(y)\left.\in l(x)\right|l(x)=k\right)\cdot P\left(l(x)=k\right)=P\left(l(y)\left.=l(x)\right|(x)=k\right)\cdot \\ {}P\left(l(x)=\left.k\right|l(x)\in {l}_X\right)\cdot P\left(l(x)\in {l}_X\right)\end{array}} $$
(9)

In consideration of k ∈ lX, the probability P(l(y) = l(x)| l(x) ∈ lX) in Eq. (5) and k are independent of each other. We readily have P(l(y) = l(x)| l(x) ∈ k) = P(l(y) = l(x)| l(x) ∈ lX). With (7)–(9), we obtain:

$$ {\displaystyle \begin{array}{r}P\left(l(y)=k\right)=P\left(l(y)\in {l}_{\mathrm{X}}\right)\cdot P\left(l(x)=\left.k\right|l(x)\in {l}_{\mathrm{X}}\right)\propto \exp \left(-\left({\left\Vert y- X\alpha \right\Vert}_1+\lambda {\left\Vert \alpha \right\Vert}_2^2\right.\right.+\\ {}\left.\left.\gamma {\left\Vert X\alpha -{X}_{\mathrm{k}}{\alpha}_{\mathrm{k}}\right\Vert}_2^2\right)\right)\end{array}} $$
(10)

where γ = δ/κ. By computing the maximum of the probability defined in Eq. (10) for each class, some corresponding data points can be found to satisfy the requirement. Assume that, a common x is found to maximize the joint probability P(l(y) = 1, …, l(y) = K) as well as each event l(y) = k is independent. The class label of y can be determined as

$$ {\displaystyle \begin{array}{r}P\left(l(y)=k\right)=\max P\left(l(y)=1,\dots, l(y)=K\right)=\max {\prod}_{\mathrm{k}}P\left(l(y)=k\right)\\ {}\propto \max \exp \left(-\left({\left\Vert y- X\alpha \right\Vert}_1+\lambda {\left\Vert \alpha \right\Vert}_2^2+\frac{\gamma }{K}{\sum}_{i=1}^K\left({\left\Vert X\alpha -{X}_{\mathrm{i}}{\alpha}_{\mathrm{i}}\right\Vert}_2^2\right)\right)\right)\end{array}} $$
(11)

The logarithmic operator is applied in Eq. (11) and the solution vector \( \widehat{\alpha} \) can be formulated as:

$$ \left(\widehat{\alpha}\right)=\arg {\min}_{\upalpha}\left\{{\left\Vert y- X\alpha \right\Vert}_1+\lambda {\left\Vert \alpha \right\Vert}_2^2+\frac{\gamma }{K}{\sum}_{k=1}^K{\left\Vert X\alpha -{X}_{\mathrm{k}}{\alpha}_{\mathrm{k}}\right\Vert}_2^2\right\} $$
(12)

where the parameters γ and λ can be tuned. For Freiburg database, continuous 1-h EEG data containing the seizure of training set is used to be classified in the training stage. And we adjusted the γ and λ to achieve the best results for this 1-h EEG data. The γ and λ are determined by this way for each patient. For Bonn database, we adjusted the values of γ and λ in the 10-fold cross-validation experiment to get the best results. In this work, we found that γ = 5 and λ = 0.001 are the best values.

With a more detailed view of Eq. (11), it can be seen that all classes share the common part \( \left({\left\Vert y- X\alpha \right\Vert}_1+\lambda {\left\Vert \alpha \right\Vert}_2^2\right) \). Thus, we can only compute the remaining portion of Eq. (11), that is,

$$ {p}_{\mathrm{k}}=\exp \left(-\left({\left\Vert X\widehat{\alpha}-{X}_{\mathrm{k}}{\widehat{\alpha}}_{\mathrm{k}}\right\Vert}_2^2\right)\right) $$
(13)

where \( \widehat{\alpha} \) is the solution vector obtained in Eq. (12). The final identity of y can be defined as

$$ l(y)=\arg \underset{k}{\max}\left\{{p}_{\mathrm{k}}\right\} $$
(14)

The above model is the R-ProCRC. Additionally, the sparse coefficient vector can be easily solved by the iterative reweighted least square (IRLS) algorithm. Here, a diagonal weighting matrix WX is introduced as:

$$ {W}_{\mathrm{X}}\left(i,i\right)=1/\left|X\left(i,:\right)\alpha -{y}_{\mathrm{i}}\right| $$
(15)

where X(i, :) denotes the ith row of X. With Eqs. (12) and (15), we have:

$$ {\displaystyle \begin{array}{r}\left(\widehat{\alpha}\right)=\arg {\min}_{\upalpha}\left\{\frac{\gamma }{K}{\sum}_{k=1}^K{\left\Vert X\alpha -{X}_{\mathrm{k}}{\alpha}_{\mathrm{k}}\right\Vert}_2^2+\lambda {\left\Vert \alpha \right\Vert}_2^2\right.+\\ {}\left.{\left( X\alpha -y\right)}^{\mathrm{T}}{W}_{\mathrm{X}}\left( X\alpha -y\right)\right\}\end{array}} $$
(16)

Next, the representation coefficient is computed as

$$ \left(\widehat{\alpha}\right)={\left({X}^{\mathrm{T}}{W}_{\mathrm{X}}X+\frac{\gamma }{K}{\sum}_{k=1}^K{\left({\overline{X}}_{\mathrm{k}}^{\prime}\right)}^{\mathrm{T}}{\overline{X}}_{\mathrm{k}}^{\prime }+\lambda I\right)}^{-1}{X}^{\mathrm{T}}{W}_{\mathrm{X}}y $$
(17)

The representation coefficient α is updated alternatively until convergence or after an appropriate number of iterations which is set as five in this study.

2.3.2 The kernel R-ProCRC

Many experimental results of kernel-based methods in the fields of machine learning have shown satisfactory classification performance [26, 27]. The inseparable samples are mapped from the original linear space into the high-dimensional feature space, which may become linearly separable.

The mapping mentioned above is defined as Φ : Rm → RF. The inner products between the transformed feature samples ϕ(xi) and ϕ(xj) in the feature space RF can be calculated in the original input space:

$$ {\left\langle \Phi \left({x}_{\mathrm{i}}\right),\Phi \left({x}_{\mathrm{j}}\right)\right\rangle}_{R^{\mathrm{F}}}=\Phi {\left({x}_{\mathrm{i}}\right)}^T\Phi {\left({x}_{\mathrm{j}}\right)}^T=K\left({x}_{\mathrm{i}},{x}_{\mathrm{j}}\right) $$
(18)

where K(, ) denotes the definition of the kernel function in Eq. (18). Both training samples and testing samples can be mapped into the high-dimensional space and represented as K(Xc, X) and K(Xc, y). The center matrix Xc is obtained by selecting training sets following the theory of k-means clustering [27]. First, the mean sample of the training set of each class \( {X}_{\mathrm{i}}=\left[{x}_{\mathrm{i}}^1,{x}_{\mathrm{i}}^2,\dots, {x}_{\mathrm{i}}^{{\mathrm{n}}_{\mathrm{i}}}\right] \) can be computed as: \( {u}_{\mathrm{i}}=\left({\sum}_{j=1}^{n_{\mathrm{i}}}{x}_{\mathrm{i}}^{\mathrm{j}}\right)/{n}_{\mathrm{i}} \). Next, we select the nearest 3ni/4 samples from ui to generate the matrix \( {X}_{\mathrm{c}}^{\mathrm{i}}=\left[{u}_{\mathrm{i}},{x^{\prime}}_{\mathrm{i}}^1,{x^{\prime}}_{\mathrm{i}}^2,\dots, {x^{\prime}}_{\mathrm{i}}^{\left(3{n}_{\mathrm{i}}/4\right)}\right] \). After the above steps, we obtain the center matrix \( {X}_{\mathrm{c}}=\left[{X}_{\mathrm{c}}^1,{X}_{\mathrm{c}}^2,\dots, {X}_{\mathrm{c}}^{\mathrm{k}}\right] \).

When the kernel trick is plugged into the R-ProCRC, the parameter β (which is the representation coefficient in the space RF) can be recalculated as

$$ {\displaystyle \begin{array}{r}\left(\widehat{\beta}\right)=\left(K{\left({X}_{\mathrm{c}},X\right)}^{\mathrm{T}}{W}_{\mathrm{X}}K\left({X}_{\mathrm{c}},X\right)\right.+\frac{\gamma }{K}{\sum}_{k=1}^K{\overline{\left(K{\left({X}_{\mathrm{c}},X\right)}_{\mathrm{k}}^{\prime}\right)}}^{\mathrm{T}}\cdot \\ {}{\overline{K\left({X}_{\mathrm{c}},X\right)}}_{\mathrm{k}}^{\prime }{\left.+\lambda I\right)}^{-1}K{\left({X}_{\mathrm{c}},X\right)}^{\mathrm{T}}{W}_{\mathrm{X}}K\left({X}_{\mathrm{c}},y\right)\end{array}} $$
(19)

The algorithm of the kernel R-ProCRC is illustrated as follows:

  1. 1)

    Generate the center matrix Xc.

  2. 2)

    Using the Gaussian RBF kernel, map X and y into the high feature space to gain the K(Xc, X) and K(Xc, y). The Gaussian RBF kernel is applied in this work.

  3. 3)

    Normalize each column of K(Xc, X) and K(Xc, y) using unit l2-norm and compute the sparse representation coefficient β by the IRLS algorithm according to Eq. (19).

  4. 4)

    Calculate the residual of each class:

$$ {\gamma}_{\mathrm{i}}(y)={\left\Vert K\left({X}_{\mathrm{c}},y\right)-K\left({X}_{\mathrm{c}},{X}_{\mathrm{i}}\right){\widehat{\beta}}_{\mathrm{i}}\right\Vert}_2^2 $$
(20)

In addition, we also compare the performance of the proposed method with different kernel functions, which include the linear kernel, sigmoid kernel, and polynomial kernel. The results of the comparison are given in the Sec. 3.

The linear kernel has a fast calculation speed and fewer parameters. When samples are linearly separable, the linear kernel can achieve a satisfactory classification result. And the linear kernel is defined as:

$$ K\left({X}_{\mathrm{c}},M\right)={X}_{\mathrm{c}}\cdotp M $$
(21)

where Xc is the center matrix, and M denotes the training samples or testing samples.

The sigmoid kernel is widely used in cases where the samples are linearlyinseparable. It is also one of the most commonly used activation functions for neural networks and can map variables to between 0 and 1. The sigmoid kernel can be formulated as:

$$ K\left({X}_c,M\right)=\tanh \left[a\left({X}_{\mathrm{c}}\bullet M\right)+c\right] $$
(22)

where tanh is the hyperbolic tangent function, Xc is the center matrix, and M denotes the training samples or testing samples, a is a scalar and c is a displacement parameter.

The polynomial kernel can represent the similarity of vectors in a feature space over polynomials of the original variables. It is often used to map linearly inseparable samples from the original space to the feature space. However, it has more parameters than the sigmoid kernel. When the order of the polynomial kernel is high, it will have a large computational complexity. The polynomial kernel is computed as:

$$ K\left({X}_{\mathrm{c}},M\right)={\left[\left({X}_{\mathrm{c}}\cdotp M\right)+b\right]}^{\mathrm{d}} $$
(23)

where Xc is the center matrix, and M denotes the training samples or testing samples, b is a free parameter and d is the order of the polynomial kernel.

2.4 Post-processing

In this study, number 0 denotes the seizure activity and number 1 denotes the normal or non-seizure segment. The outputs of the classifier are not exactly equal to 0 or 1. Therefore, the post-processing procedure is necessary to obtain the final detection results.

If a test sample belongs to the class of seizure, its representation vector α in regard to the ictal training set should be much larger than that associated with the interictal training set. In addition, its residual with respect to the ictal training set should have smaller values than that associated with the interictal training set. Figure 2 shows a case where an ictal segment is detected.

Fig. 2
figure 2

The sparse coefficients and residuals of an ictal segment acquired from patient 4. a The sparse coefficients located in the left side of the vertical dashed line are in regard to the interictal training samples, while the remainder in regard to the ictal training samples. b The residuals of this ictal segment in regard to the interictal and ictal training samples

For the Freiburg database, after a test sample is represented by two class of training samples, two residuals corresponding to the ictal and interictal training sets can be acquired. In order to make the classification result more accurate, the difference variable is applied to this study, which is defined as the value of the residual with respect to the interictal training set minus that with the ictal training set.

In order to remove the isolated misjudgment points and the small fluctuations caused by noise, the moving average filter (MAF) is first applied to the decision variables, which can be defined as:

$$ y(m)=\frac{1}{N+1}\sum \limits_{-N}^0x\left(m+n\right) $$
(24)

where x represents the input signal, y is the output signal, and N + 1 is the smoothing length that is specific for each patient. For each patient, 1-h continuous EEG data comprised of seizures was used to determine the parameter N in the training phase. We adjusted the N to achieve the best recognition rate for this 1-h EEG data. The value obtained from the MAF is compared to a suitable threshold, which is set as zero in this study. After that, binary decisions are acquired.

The multichannel integration is also applied to improve the correct detection rates. Figure 3 presents the procedures of the multichannel integration. The data of six electrodes are applied for the seizure detection and each EEG epoch is decomposed into three sub-signals XD3, XD4, and XD5 in the preprocessing stage. If there are at least two signed “1” in six channels, it will be marked as “seizure.” After that, three decisions corresponding to the three sub-signals are obtained. If the seizures are detected in at least two decisions, the testing epoch will be labeled as “seizure.” In addition, if there is only one seizure in the three decisions, the current epoch is also defined as a “seizure” when it adjoins an epoch, which is marked as seizure. Otherwise, it is labeled as “non-seizure.” Subsequently, the MAF is applied again to remove burrs and sporadic false detections.

Fig. 3
figure 3

The procedures of the multichannel integration

The start and the end of a seizure activity are changing slowly, which makes it difficult to detect the beginning and ending of the seizure. Moreover, the smoothing technique may also make the start and end of seizures obscure. Hence, a collar technique is used to compensate the epoch which is mistaken as non-seizure [28]. In this process, each detected seizure event is extended l epochs on both sides (Fig. 4(h)). The parameter l is adjusted for each patient in the training stage to get the best classification result. And the number of l does not exceed 5. Then, the number of l is fixed in the testing stage. The procedure of post-processing is exhibited in Fig. 4. In addition, the post-processing is only applied to the continuous EEG recordings from the Freiburg database in this work. For the Bonn database, the testing sample is categorized as the class that has the minimum represented residual.

Fig. 4
figure 4

The post-processing scheme of 1-h EEG data with one seizure from patient 4. (a) The difference variable with channel 1. (b) The smoothed output after the moving average filtering. (c) The decisions with channel 1 after threshold judgment. (d), (e) The decisions with another two channels after threshold judgment. (f) The decisions after the multichannel integration. (g) The smoothed output after the second moving average filtering. (h) The final classification results after the collar operation

3 Results

The proposed method is evaluated comprehensively based on different EEG databases and different assessment criteria. All the experiments are executed in the MATLAB8.1 environment on an Intel core processor with 3.40 GHz. The segment-based approach is employed for all of the databases, and the event-based approach is employed for the Freiburg database to appraise the performance of the proposed algorithm. For the segment-based criterion, the labels of the epochs judged by the algorithm are compared with those marked by the experts. The three statistical measures are introduced in this level, which are expressed as:

  • Sensitivity: True positive/the total number of seizures identified by the experts. The true positive (TP) denotes the seizure marked by the classifier as well as EEG experts.

  • Specificity: True negative/the total number of non-seizures identified by the experts. The number of non-seizures labeled by the detector as well as the experts is defined as true negative (TN).

  • Recognition accuracy: Number of accurately marked epochs/total number of epochs.

For the Freiburg database, the wavelet transform with five scales is conducted on EEG epochs and the coefficients of scales 3, 4, and 5 are selected to reconstruct the sub-signals for the multichannel decision. After that, the query samples are represented sparsely by the training samples using the kernel R-ProCRC method and the residuals in regard to the seizure and non-seizure training samples are calculated. A post-processing procedure is conducted on the residuals to obtain the final detection results.

For the segment-based level, the experimental results of the Freiburg dataset are listed in Table 2. It can be observed that the optimal sensitivity of 100%, specificity of 99.98%, and recognition accuracy of 99.98% are achieved for different patients. From the last row of Table 2, it can be noted that all of the mean values of the three statistical measurements are over 96%. Moreover, the sensitivity values of the 12 patients, exceeding half of the total number of patients, reach 100%. The lowest sensitivity of 87.18% is obtained for patient 15. All of the patients except patient 10 have specificities larger than 94%. Patient 10 achieves an unsatisfactory specificity of 75.01% owing to the electrode disconnection and reconnection.

Table 2 Detection results of the proposed method on the segment-based level

Furthermore, the event-based evaluation approach is also employed to verify the feasibility of the proposed method in clinical practice. At this level, the two measures (the number of true detections and the false detection rate) must be calculated. The true detection denotes the seizure event detected by this method overlapping that is marked by the EEG experts. The events detected only by the classifier but not the experts are defined as false detections. Table 3 displays the results on the event-based level. At this level, the 52 seizure events are used to evaluate the performance of this approach. Except for one seizure event of patient 10, all others are detected by the proposed method. Most patients have a satisfactory false detection rate, and more than half of the patients achieve a false detection rate of less than 0.1/h. In addition, the majority of the false detections are caused by high-amplitude activities that are easily misjudged as seizures.

Table 3 Detection results of the proposed method on the event-based level

In order to comprehensively demonstrate the performance of the proposed approach, the Bonn dataset is also used in this study. The K-fold cross validation is adopted to acquire stable and convincing classification results. The original dataset is divided into K subsets equally, and the K − 1 subsets are selected to train the model, while the remaining one is treated as the testing samples. That is, the process of classification is executed K times in turn. K is set to ten in this work.

The experimental results based on the Bonn dataset are depicted in Table 4. For F-S classification, the sensitivity of 99%, the specificity of 99.5% and the accuracy of 99.3% are acquired, which manifest the remarkable classification capacity of the proposed approach. All of the three classifications have the recognition accuracies over 99%, which proves that the proposed method can classify ictal and interictal EEGs accurately. In addition, for the F-S classification problem, Table 5 gives the results of the proposed method with different kernel functions including the linear kernel, sigmoid kernel, and polynomial kernel. It can be seen that the method with Gaussian RBF achieves the best result, which indicates that the Gaussian RBF is more adaptive than the others for the classification between seizure and non-seizure epochs. Compared with the linear function, the Gaussian kernel function is more suitable for linearly inseparable data, and it has fewer parameters than the polynomial kernel function, which implies the Gaussian kernel function has a lower complexity.

Table 4 The results of the three classification types based on the Bonn database
Table 5 The results of the proposed method with different kernel functions for the F-S classification based on the Bonn database

4 Discussion

In this study, a novel automatic seizure detection method based on the kernel R-ProCRC is introduced, which creates a classification by calculating the maximum probability that the testing sample falls into each class. The choice of distinct characteristics and an appropriate classifier is significant for the conventional detection method. Nonetheless, the work of feature selection and extraction is complicated and it is not clear whether the features selected can effectively help the classification. Compared with the conventional method, the selection of features is no more necessary in this method, which is only needed to sparsely represent the testing samples through the training set and make a comparison of the residuals with respect to the two categories. The kernel R-ProCRC, based on a probability framework, can make full use of training samples to judge the category of the testing samples.

In general, the classification of the seizure and non-seizure EEG signals is complicated, because they are irregular and nonlinear in nature and contain many seizure-like activities throughout the entire recordings. To prepare for the consequent classification, the preprocessing is employed to process the raw EEG signals. The wavelet transform can decompose the EEG signal into sub-signals on the different frequency bands, which offers a wealth of time information and frequency information. Hence, it can sufficiently remove the high-frequency noise, which is conducive for the differential operator that is further applied to the signal. But some short seizures may also be filtered due to the wavelet filtering and then misclassified as a non-seizure. The differential operator can capture the abrupt change at the boundary of the seizure and non-seizure, and amplify it to make the contrast of high-frequency components and the low-frequency background more prominent. In addition, it is a linear operator and therefore suitable for the real-time system. From Fig. 5, it can be seen that the contrast of the onset activities and the background that comprises interictal EEGs are clearly heightened. This property is conducive to improving the detection accuracy of the proposed method.

Fig. 5
figure 5

The performance of differential operator for 10-min EEG signal from patient 4 in the Freiburg database. a The EEG signal before the different operator. b The EEG signal after the different operator. The signals between the two vertical dashed lines are ictal signals

In order to assess the performance of the proposed method comprehensively, two different databases are employed, which include the Bonn database and Freiburg database. Many previous papers on seizure detection also adopt these databases to evaluate their algorithms. Majumdar et al. [24] proposed a method that combined the differential operator with the windowed variance to identify the seizure onset in continuous EEG signals. There are 369 h of interictal data and 59 h of ictal data are applied to evaluate the performance. Their method obtains the sensitivity of 91.25% with 59 seizures, which only used 15 patients in the study. Raghunathan et al. [29] put forward a multistage detection method to identify the morphologies of seizures. Their method takes advantage of wavelet filtering and combines the variance and coastline as EEG features, which obtains a sensitivity of 87.5% from five patients with 24 seizures. In the work of Yuan et al. [17], the kernel CRC are employed, which is evaluated on 21 patients including 60 seizures and achieves the sensitivity of 96.03%. In this study, the kernel R-ProCRC exploits a probabilistic collaborative representation framework to jointly maximize the probability that a test EEG sample belongs to seizure or non-seizure, where the feature extraction is no longer required. Compared with the work of Majumdar et al. and Raghunathan et al., our method used more EEG data to assess the performance of detecting epileptic seizure in EEGs. In addition, our method yields a much higher sensitivity than their method. Table 6 gives a detailed comparison of the Freiburg database between the previous algorithms and this approach.

Table 6 A comparison of results for different methods based on the Freiburg database

Table 7 shows a comparison between the proposed method and the other previous methods based on the Bonn database. First, for the classification of normal (Z) and ictal (S) EEGs, this approach obtained an average accuracy of 99.30% using the 10 fold cross-validation, which is the second best result listed in Table 7. Tzallas et al. [32] employed the artificial neural network for the classification and gained the best accuracy of 100%. The classification of interictal (F) and ictal (S) is the most significant among these three classifications, which is much closer to clinical applications. For F-S classification, the result obtained from this method is better than the others. The second best result is 98.63% in the work of Yuan et al. [18], in which the SRC is combined with the kernel trick for the EEG classification. Thirdly, for ZNF-S classification, the proposed method still yields the best accuracy of 99.20%, which has 1.45% of improvement compared with Guo’s work. In their work, they employed wavelet transform to process the raw EEG signals and applied line length feature to locate the seizure onset [36]. The competitive results suggest that the proposed method can become a potential approach for detecting the seizures in clinical application.

Table 7 Detailed comparison of the accuracy of three classification problems based on the Bonn database for different methods

5 Conclusion

In this study, a novel method based on the kernel version of R-ProCRC is presented to detect seizure events in EEG signals. When a test EEG sample comes, the kernel ProCRC can effectively take full advantage of the training samples to represent it and deduce its label. Most previous work only evaluated their method on one EEG database, which cannot verify the generalization ability of the algorithm for different EEG data. In this study, our method is evaluated on the two different databases. The experimental results of the two databases show that the proposed method has a remarkable adaptability in different types of EEG signals. Additionally, the feature extraction is no longer required in this method, which simplifies the process of epilepsy detection.