1 Introduction

In recent times, Machine Learning and Deep Learning have seen increased usage in healthcare optimisation (Khalilpourazari & Doulabi, 2021a, b). Electroencephalogram signals, more commonly referred to as EEG, are the principal components of several techniques used for non-invasive diagnostic tests. Such techniques are routinely used to detect epilepsy, brain tumours, stroke, sleep disorders, etc. Many feature extraction techniques have been developed to process the EEG signals, allowing classification algorithms to predict the nature of the data. One of the most used feature extraction methods for EEG data is the wavelet transform which produces frequency domain features localised in time. Several families of wavelet transformation specialise in different types of signals. The Daubechies wavelet with db4 has been used as a forefront feature extraction method for epilepsy detection (Adeli et al., 2003). Other families of wavelet transform such as Daubechies wavelet with db2 have also been used (Güler & Übeyli, 2005). Different feature extraction techniques such as principal component analysis (PCA) and independent component analysis (ICA) have also been applied for EEG signals (Subasi & Gursoy, 2010). Once features are appropriately extracted, a classification algorithm is used to categorize the EEG signals.

The support vector machines (SVMs) (Cortes & Vapnik, 1995) and SVM-based algorithms, such as twin support vector machine (TWSVM) (Jayadeva & Khemchandani, 2007), are excellent supervised learning algorithms employed in diverse domains of classification and regression. They showcase reliable performance and have been used in several fields such as face recognition (Zhou et al., 2010), text categorisation (Wang & Chiang, 2007), electroencephalogram (EEG) signal classification (Richhariya & Tanveer, 2018), data mining (Bollegala et al., 2010), diagnosis of various diseases like epilepsy (Richhariya & Tanveer, 2018; Zhang et al., 2019) and Alzheimer’s disease (Tanveer et al., 2020). SVM has gained this recognition because it reduces the generalisation error by maximising the margin between different classes. This is achieved by formulating a convex quadratic programming problem (QPP) to obtain exhaustive or global solution instead of settling for a local optimum, which many other methods such as artificial neural networks cannot overcome. Also, SVM implements the structural risk minimisation principle, which reduces the upper bound on the Vapnik-Chervonenkis (VC) dimension. This measures the capacity of functions that a statistical binary classification algorithm can learn.

Although SVM has been successful, however, SVM is not without disadvantages. These advantages come at the cost of solving a QPP with very high complexity, \(O(N^3)\), where N is the total number of training data points. An attempt to solve this issue of complexity was carried out by Jayadeva and Khemchandani (2007) by introducing a novel twin support vector machine (TWSVM). TWSVM attempts to break up the large QPP into two sub-problems by seeking two non-parallel proximal hyperplanes. Further improvements were made by the introduction of twin bounded SVM (TBSVM) (Shao et al., 2011). TBSVM uses an additional regularization variable to carry out the structural risk minimisation principle. Kumar and Gopal made further improvements by introducing the least-squares twin SVM (LSTSVM) (Kumar & Gopal, 2009). The LSTSVM has a meagre computation time, however, it suffers in presence of noise and outliers as it uses a quadratic loss function. The angle-based twin support vector machine (ATWSVM) proposed by Khemchandani et al. (2018) is an iteration on TWSVM with the advantage of quick computation compared to TWSVM. This advantage arises in ATWSVM due to solving the second problem as an unconstrained minimization problem instead of a QPP. Tanveer et al. introduced a novel robust energy-based least squares twin SVM (RELSTSVM) (Tanveer et al., 2016), which performed the best in a 2019 evaluation (Tanveer et al., 2019) of SVM classifiers on UCI datasets (Dua & Graff, 2017). For further specifics of the TWSVM, we refer the interested readers to (Tanveer et al., 2021). Another approach for improving TWSVM is to utilize L1-norm distance than the traditional L2-norm distance. This approach was implemented in L1-TWSVM (Yan et al., 2019). In order to improve the sparsity and robustness in the original twin SVM, linear programming twin SVM algorithms (Tanveer, 2015a, b) are proposed. Fuzzy membership is another widely used method for improving generalization performance. Entropy-based fuzzy twin support vector machine (Gupta et al., 2019), Entropy-based fuzzy least-squares TWSVM (Chen et al., 2020), intuitionistic fuzzy TWSVM (Rezvani et al., 2019) are some recent examples of fuzzy implementation in TWSVM.

The aforementioned algorithms either improve SVM or TWSVM based models, however, the framework of universum data in SVM-based algorithms has also been proposed. The idea of using universum data to improve the generalisation performance of SVM was proposed by Weston et al. (2006). The idea of universum data stems from the fact that SVM does not have any information about the data distribution in the training set. That is to say, SVM does not have any prior information about the data in the training set. This prior information, which is a form of Bayesian prior, is incorporated in the optimisation problem by introducing universum data. The concept of universum data can be understood as follows: take an example of classifying images of handwritten numbers into their respective classes (i.e. corresponding numbers). The training data consists of the pixel intensities from the images in a vector form. Here, when SVM uses this training data, it does not have any context about the data space, i.e. handwritten characters. Thus, universum data set containing handwritten letters (not numbers) is used to provide prior information about the data space.

As mentioned above, the universum data need not be the same as the training set, it is an unlabelled set used to provide relevant prior information. Thus, universum based SVM (USVM) (Weston et al., 2006) algorithm implements this concept by introducing universum data points in an \(\epsilon \)-insensitive area between the two binary groups. Similar to SVM, USVM is further improved by universum based twin SVM (UTSVM) (Qi et al., 2012) and reduced UTSVM models (Richhariya & Tanveer, 2020). The computation cost of UTSVM is improved by the universum least squares twin SVM (ULSTSVM) (Xu et al., 2016) which uses the least squares technique, introducing the quadratic loss. The fuzzy-based universum least squares twin SVM (FULSTSVM) (Richhariya & Tanveer, 2021) utilises the fuzzy concept to further improve ULSTSVM. The Universum based Lagrangian twin bounded support vector machine (ULTBSVM) (Kumar & Gupta, 2021) uses the square of the 2-norm of the slack variables to formulate a strongly convex objective function, enabling the method to produce unique solutions. The ULTBSVM has additional regularization terms which implement the structural risk minimization (SRM) principle. The Regularized Universum twin support vector machine (Gupta et al., 2019) introduces regularization terms to the formulation of UTWSVM such that it is well-posed, removing the necessity to add another term to ensure that the matrix is positive definite. The iterative universum support vector machine (IUTWSVM) (Richhariya & Gupta, 2019) is different from other methods discussed above. It attempts to be less computationally expensive by utilizing an iterative based approach using the Newton method. This low computation cost enables IUTWSVM to be a viable multi-class classifier.

All the above algorithms are efficient classifiers based on several kinds of loss functions, such as hinge loss, square loss, etc. However, they lack a critical property, i.e. noise-insensitivity. Real-life data sets are inherently noisy due to being recorded in real world conditions especially EEG signals, where entropy and errors are unavoidable. This noise can arise from errors in the measuring equipment or even the user recording the observations. Thus, noise insensitivity is of critical importance for a classifier used for real-life applications. The noise performance (i.e. insensitivity to noise) for classifiers can be improved by utilising a specific loss function. The pinball loss introduced by Huang et al. (2013) is an excellent candidate for such a function. It boosts insensitivity towards noise and outliers, and it enables stability for re-sampling data. This insensitivity of pinball loss is a result of utilising the quantile distance (Koenker & Hallock, 2001; Christmann & Steinwart, 2007) rather than the shortest distance. SVM with pinball loss (Pin-SVM) (Huang et al., 2013) has shown promising performance in presence of noise. However, Pin-SVM has high computational complexity as its solves a large QPP. Thus, Xu et al. proposed the twin parametric margin support vector machine with pinball loss (Pin-TSVM) (Xu et al., 2016) to overcome the noise sensitivity of TWSVM and reduce the complexity of Pin-SVM. Some examples of the latest algorithms utilizing the pinball loss in 2021 are: Smooth twin bounded support vector machine with pinball loss (Li & Lv, 2021), the Bound estimation-based safe acceleration for maximum margin of twin spheres machine with pinball loss (Yuan & Xu, 2021) and the Robust general twin support vector machine with pinball loss function (Ganaie & Tanveer, 2021).

The aforementioned methods consider that the noise distribution is identical everywhere in the dataset, but that may not be true. Tanveer et al. proposed a novel general twin SVM with pinball loss (Pin-GTSVM) (Tanveer et al., 2019) to deal with this problem. The Pin-GTSVM computes two non-parallel hyperplanes where each hyperplane is proximal either to negative or positive class. To improve the sparsity of the models, sparse support vector machine with pinball loss (Pin-SSVM) (Tanveer et al., 2021) and sparse pinball twin support vector machines (SPTWSVM) (Tanveer et al., 2019) are formulated. To upgrade the performance of models in presence of large datasets, large-scale pinball TWSVM (Tanveer et al., 2021), large-scale twin parametric SVM with pinball loss function (Sharma et al., 2019) have been proposed.

Thus, motivated by the concept of universum data and the excellent performance of pinball loss in the various classifiers, in this work, we propose a novel universum twin support vector machine with pinball loss (Pin-UTSVM) for the EEG signal classification. The main highlights of this work are as follows:

  1. 1.

    The novel universum twin support vector machine with pinball loss (Pin-UTSVM) is proposed.

  2. 2.

    The proposed Pin-UTSVM model is robust to noise and more stable for resampling.

  3. 3.

    The computational complexity of the proposed Pin-UTSVM model is similar to standard UTSVM model. Hence, the proposed Pin-UTSVM brings noise insensitivity and stability without incurring the additional computational cost.

  4. 4.

    For handling the non-linear cases, we use kernel based Pin-UTSVM model for better generalization.

  5. 5.

    The evaluation of the classification models on real world EEG signal classification show that the proposed Pin-UTSVM has better generalization compared to the baseline models.

2 Related work

Suppose A and B be the sets of positive (\(+1\)) and negative (\(-1\)) class, respectively. Let \(c_i\) be the positive parameters and \(e_j\) be the vector of ones with the relevant dimensions, for \(i=1,2,3,4\) and \(j=\{+,-,u\}\). Also, U denotes the universum data points and \(D=[A;B]\). ||x|| is used to represent 2-norm of any vector x.

2.1 Twin SVM Jayadeva and Khemchandani (2007)

The twin support vector machine (TWSVM) (Jayadeva & Khemchandani, 2007) significantly improve the conventional SVM by improving upon its high computational complexity. SVM requires all data points as constraints while as in TWSVM the patterns of one class gives constraints to the other class. Thus, the TWSVM solves two smaller QPPs instead of a large QPP.

The non-linear TWSVM seeks to find the following two kernel generated surfaces:

$$\begin{aligned} K(x^t, D^t)u_+ + b_+ = 0 \text { and } K(x^t, D^t)u_- + b_- = 0, \end{aligned}$$
(1)

where K is a kernel function. The optimization problem for TWSVM can be denoted as follows:

$$\begin{aligned} \underset{u_+,b_+,\xi _+}{min}&~~\frac{1}{2}\Vert K(A, D^t)u_+ + e_+b_+\Vert ^2+c_1e^t_-\xi _+ \nonumber \\ s.t. ~~&-(K(B, D^t)u_+ + e_-b_+) + \xi _+ \ge e_-, \xi _+ \ge 0 \end{aligned}$$
(2)

and

$$\begin{aligned} \underset{u_-,b_-,\xi _-}{min}&~~\frac{1}{2}\Vert K(B, D^t)u_- + e_-b_-\Vert ^2+c_2e^t_+\xi _- \nonumber \\ s.t. ~~&(K(A, D^t)u_- + e_+b_-) + \xi _- \ge e_+, \xi _- \ge 0, \end{aligned}$$
(3)

where \(\xi _+, \xi _-\) are slack variables and \(e_+,e_-\) are vector of ones with appropriate dimensions. By using Lagrange multipliers \(\alpha \ge 0, \beta \ge 0\) and using the Karush-Kuhn-Tucker (K.K.T) conditions, the Wolfe dual of (2) and (3) come out to be:

$$\begin{aligned} \underset{\alpha }{max}&~~e^t_-\alpha - \frac{1}{2}\alpha ^tQ(P^tP)^{-1}Q^t\alpha \nonumber \\ s.t.&~~0 \le \alpha \le c_1 \end{aligned}$$
(4)

and

$$\begin{aligned} \underset{\beta }{max}&~~e^t_+\beta - \frac{1}{2}\beta ^tP(Q^tQ)^{-1}P^t\beta \nonumber \\ s.t.&~~0 \le \beta \le c_2, \end{aligned}$$
(5)

where \(P = [K(A, D^t) ~~e_+]\) and \(Q = [K(B, D^t) ~~e_-]\).

After solving (4) and (5), the optimal separating hyperplanes are given by:

$$\begin{aligned} \begin{bmatrix} u_+ \\ b_+ \end{bmatrix}&=-(P^tP+\delta I)^{-1}Q^t\alpha , \end{aligned}$$
(6)
$$\begin{aligned} \begin{bmatrix} u_- \\ b_- \end{bmatrix}&=(Q^tQ+\delta I)^{-1}P^t\beta , \end{aligned}$$
(7)

where \(\delta (\delta > 0)\) is the regularization variable used to circumvent the ill-conditioning of the matrices \(P^tP\) and \(Q^tQ\). Further new data point x is assigned the classes using the following equation:

$$\begin{aligned} class(x)=arg~~\underset{i=\{+,-\}}{min} \frac{|K(x^t, D^t)u_i + b_i|}{\Vert u_i\Vert }. \end{aligned}$$
(8)

2.2 Universum twin SVM Qi et al. (2012)

The universum twin support vector machine (UTSVM) (Qi et al., 2012) is an extended version of universum support vector machine (Weston et al., 2006), which improves upon the computational cost.

The UTSVM seeks the following non-linear hyperplanes:

$$\begin{aligned} K(x^t, D^t)u_+ + b_+ = 0 \text { and } K(x^t, D^t)u_- + b_- = 0. \end{aligned}$$
(9)

The optimisation problem for UTSVM can be expressed as follows:

$$\begin{aligned} \underset{u_+,b_+,\xi _+,\psi }{min}&~~\frac{1}{2}\Vert K(A, D^t)u_+ + e_+b_+\Vert ^2+c_1e^t_-\xi _+ + c_ue_u^t\psi \nonumber \\ s.t.&-(K(B, D^t)u_+ + e_-b_+) + \xi _+ \ge e_-, ~~\xi _+ \ge 0, \nonumber \\&(K(U, D^t)u_+ + e_ub_+) + \psi \ge (-1 + \varepsilon )e_u, ~~\psi \ge 0 \end{aligned}$$
(10)

and

$$\begin{aligned} \underset{u_-,b_-,\xi _-,\psi }{min}&~~\frac{1}{2}\Vert K(B, D^t)u_- + e_-b_-\Vert ^2+c_2e^t_+\xi _- + c_ue_u^t\psi \nonumber \\ s.t.&(K(A, D^t)u_- + e_+b_-) + \xi _- \ge e_+,~~ \xi _- \ge 0, \nonumber \\&-(K(U, D^t)u_- + e_ub_-) + \psi \ge (-1 + \varepsilon )e_u, ~~\psi \ge 0, \end{aligned}$$
(11)

where \(\xi _+, \xi _-, \psi \) are the slack variables, \(\varepsilon \) is the tolerance variable for the universum. Using the Lagrangian multipliers \(\alpha _1, \alpha _2, \mu _1, \mu _2\) and appropriate K.K.T. conditions, the dual of (10) and (11) can be expressed as follows:

$$\begin{aligned} \underset{\alpha _1, \mu _1}{max}&~~e^t_-\alpha _1 - \frac{1}{2}(\alpha _1^tT-\mu _1^tO)(S^tS)^{-1}(T\alpha _1^t - O\mu _1^t) + (\varepsilon - 1)e_u^t\mu _1 \nonumber \\ s.t.&~~ 0 \le \alpha _1 \le c_1, 0 \le \mu _1 \le c_u \end{aligned}$$
(12)

and

$$\begin{aligned} \underset{\alpha _2, \mu _2}{max}&~~e^t_+\alpha _2 - \frac{1}{2}(\alpha _2^tS-\mu _2^tO)(T^tT)^{-1}(S^t\alpha _2 - O^t\mu _2) + (\varepsilon - 1)e_u^t\mu _2 \nonumber \\ s.t.&~~ 0 \le \alpha _2 \le c_2, 0 \le \mu _2 \le c_u, \end{aligned}$$
(13)

where \(O = [K(U,D^t) \;e_u]\), \(S = [K(A, D^t) \;e_+]\) and \(T = [K(B, D^t) \;e_-]\).

After solving (12) and (13), the optimal separating hyperplanes are given by:

$$\begin{aligned} \begin{bmatrix} u_+ \\ b_+ \end{bmatrix}&=-(S^tS+\delta I)^{-1}(T^t\alpha _1 - O^t\mu _1), \end{aligned}$$
(14)
$$\begin{aligned} \begin{bmatrix} u_- \\ b_- \end{bmatrix}&=(T^tT+\delta I)^{-1}(S^t\alpha _2 - O^t\mu _2), \end{aligned}$$
(15)

where \(\delta I\) is the regularization variable added to circumvent ill-conditioning of \(S^tS\) and \(T^tT\) matrices.

The new data point x is assigned its class using the following function:

$$\begin{aligned} class(x)=arg~~\underset{i=\{+,-\}}{min} \frac{|K(x^t, D^t)u_i + b_i|}{\Vert u_i\Vert }. \end{aligned}$$
(16)

2.3 Pin-GTSVM Tanveer et al. (2019)

The general twin support vector machine with pinball loss (Pin-GTSVM) (Tanveer et al., 2019) seeks the following hyperplanes:

$$\begin{aligned} K(x^t, D^t)u_+ + b_+ = 0 \text { and } K(x^t, D^t)u_- + b_- = 0, \end{aligned}$$
(17)

where K is the kernel function.

The optimisation problem for Pin-GTSVM can be expressed as follows:

$$\begin{aligned} \underset{u_+,b_+,\xi _+}{min}&~~\frac{1}{2}\Vert K(A, D^t)u_+ + e_+b_+\Vert ^2+c_1e^t_-\xi _+ \nonumber \\ s.t. ~&-(K(B, D^t)u_+ + e_-b_+) + \xi _+ \ge e_-, \nonumber \\&-(K(B, D^t)u_+ + e_-b_+) - \frac{\xi _+}{\tau _2} \le e_- \end{aligned}$$
(18)

and

$$\begin{aligned} \underset{u_-,b_-,\xi _-}{min}&~~\frac{1}{2}\Vert K(B, D^t)u_- + e_-b_-\Vert ^2+c_2e^t_+\xi _- \nonumber \\ s.t. ~&K(A, D^t)u_- + e_+b_- + \xi _- \ge e_+, \nonumber \\&K(A, D^t)u_- + e_+b_- - \frac{\xi _-}{\tau _1} \le e_+, \end{aligned}$$
(19)

where \(\xi _+, \xi _-\) are slack variables. Using the Lagrangian multipliers \(\alpha , \beta , \gamma , \sigma \ge 0\) and employing the appropriate K.K.T. conditions, the dual of (18) and (19) can be formulated as follows:

$$\begin{aligned} \underset{\alpha -\beta }{max}&~~e^t_-(\alpha -\beta ) - \frac{1}{2}(\alpha -\beta )^tQ(P^tP)^{-1}Q^t(\alpha -\beta ) \nonumber \\ s.t.&~~-\tau _2c_1e_- \le (\alpha -\beta ) \end{aligned}$$
(20)

and

$$\begin{aligned} \underset{(\gamma -\sigma )}{max}&~~e^t_+(\gamma -\sigma ) - \frac{1}{2}(\gamma -\sigma )^tP(Q^tQ)^{-1}P^t(\gamma -\sigma ) \nonumber \\ s.t.&~~ -\tau _1c_2e_+ \le (\gamma -\sigma ), \end{aligned}$$
(21)

where \(P = [K(A, D^t) ~~e_+]\) and \(Q = [K(B, D^t) ~~e_-]\).

After solving (20) and (21), the optimal separating hyperplanes are given by:

$$\begin{aligned} \begin{bmatrix} u_+ \\ b_+ \end{bmatrix}&=-(P^tP+\delta I)^{-1}Q^t(\alpha -\beta ), \end{aligned}$$
(22)
$$\begin{aligned} \begin{bmatrix} u_- \\ b_- \end{bmatrix}&=(Q^tQ+\delta I)^{-1}P^t(\gamma -\sigma ), \end{aligned}$$
(23)

where \(\delta I(\delta > 0)\) is the regularization variable used to circumvent this ill-conditioning of \(P^tP\) and \(Q^tQ\) matrices.

New data point x is assigned the class using the following equation:

$$\begin{aligned} class(x)=arg~~\underset{i=\{+,-\}}{min} \frac{|K(x^t, D^t)u_i + b_i|}{\Vert u_i\Vert }. \end{aligned}$$
(24)

3 Proposed universum twin SVM with pinball loss

Universum TWSVM (Qi et al., 2012) is based on the hinge loss, hence, suffers due to issues of noise and is unstable under resampling. Taking inspiration from Pin-TSVM (Tanveer et al., 2019), we include pinball loss in universum twin SVM and propose a novel universum twin SVM with pinball loss function (Pin-UTSVM). The proposed Pin-UTSVM overcome the issues of noise and make it more stable for resampling compared to the standard universum TWSVM model.

3.1 Linear universum twin SVM with pinball loss

The optimization problems for the linear variant of the proposed universum twin SVM with pinball loss (Pin-UTSVM) are given as:

$$\begin{aligned} \underset{u_+,b_+,\xi _+,\psi }{min}&~~\frac{1}{2}\Vert Au_++e_+b_+\Vert ^2+c_1e^t_-\xi _++c_2e_u^t\psi \nonumber \\ s.t.&-(Bu_++e_-b_+)\ge e_--\xi _+ \nonumber , \\&-(Bu_++e_-b_+)\le e_-+\frac{1}{\tau _1}\xi _+ \nonumber , \\&Uu_++e_ub_+\ge (-1+\epsilon )e_u-\psi \nonumber , \\&Uu_++e_ub_+\le (-1+\epsilon )e_u+\frac{1}{\tau _2}\psi , \end{aligned}$$
(25)

and

$$\begin{aligned} \underset{u_-,b_-,\xi _-,\psi }{min}&~~\frac{1}{2}\Vert Bu_-+e_-b_-\Vert ^2+c_3e^t_+\xi _-+c_4e_u^t\psi \nonumber \\ s.t.~~&Au_-+e_+b_-\ge e_+-\xi _- \nonumber , \\&Au_-+e_+b_-\le e_++\frac{1}{\tau _3}\xi _- \nonumber , \\&-(Uu_-+e_ub_-)\ge (-1+\epsilon )e_u-\psi \nonumber , \\&-( Uu_-+e_ub_-)\le (-1+\epsilon )e_u+\frac{1}{\tau _4}\psi , \end{aligned}$$
(26)

where \(\xi _i, \psi \) are the slack variables, \(c_j\) and \(\tau _j\) are positive parameters and \(\epsilon \) is the hyperparameter, here \(i=+,-\) and \(j=1,2,3,4\).

Note that when \(\tau _1,\tau _2\) and \(\tau _3,\tau _4\) tend to zero, then the second and fourth constraint of the (25) reduce to \(\xi _+\ge 0\) and \(\psi \ge 0\) and similarly for the (26). Under these conditions, the proposed Pin-UTSVM reduces to the standard UTSVM. Hence, standard UTSVM is a special case of the proposed Pin-UTSVM model.

To obtain the solution of (25) and (26), we derive their Wolfe dual. The Lagrangian of the optimization problem (25) is given as

$$\begin{aligned} L=&\frac{1}{2}\Vert Au_++e_+b_+\Vert ^2+c_1e^t_-\xi _++c_2e_u^t\psi +\alpha _1^t(Bu_++e_-b_+ +e_--\xi _+) \nonumber \\&-\alpha _2^t(Bu_++e_-b_++ e_-+\frac{1}{\tau _1}\xi _+) -\beta _1^t(Uu_++e_ub_+- (-1+\epsilon )e_u+\psi ) \nonumber \\&+\beta _2^t(Uu_++e_ub_+- (-1+\epsilon )e_u-\frac{1}{\tau _2}\psi ) . \end{aligned}$$
(27)

Applying K.K.T. conditions on (27), we have

$$\begin{aligned}&\frac{\partial L}{\partial u_+}= A^t(Au_++e_+b_+)+B^t(\alpha _1-\alpha _2)-U^t(\beta _1-\beta _2)=0, \end{aligned}$$
(28)
$$\begin{aligned}&\frac{\partial L}{\partial b_+}=e_+^t(Au_++e_+b_+)+e_-^t(\alpha _1-\alpha _2)-e_u^t(\beta _1-\beta _2)=0, \end{aligned}$$
(29)
$$\begin{aligned}&\frac{\partial L}{\partial \xi _+}=c_1e_--\alpha _1-\frac{1}{\tau _1}\alpha _2=0, \end{aligned}$$
(30)
$$\begin{aligned}&\frac{\partial L}{\partial \psi }=c_2e_u-\beta _1-\frac{1}{\tau _2}\beta _2=0, \end{aligned}$$
(31)
$$\begin{aligned}&\alpha _1^t(Bu_++e_-b_+ +e_--\xi _+)=0, \end{aligned}$$
(32)
$$\begin{aligned}&\alpha _2^t(Bu_++e_-b_++ e_-+\frac{1}{\tau _1}\xi _+)=0, \end{aligned}$$
(33)
$$\begin{aligned}&\beta _1^t(Uu_++e_ub_+- (-1+\epsilon )e_u+\psi )=0, \end{aligned}$$
(34)
$$\begin{aligned}&\beta _2^t(Uu_++e_ub_+- (-1+\epsilon )e_u-\frac{1}{\tau _2}\psi )=0 . \end{aligned}$$
(35)

Using (30) and \(\alpha _1\ge 0\), \(\alpha _2\ge 0\), we get

$$\begin{aligned} -\tau _1c_1e_-\le (\alpha _1-\alpha _2)\le c_1e_-. \end{aligned}$$
(36)

Similarly, using (31) and \(\beta _1\ge 0\), \(\beta _2\ge 0\), we get

$$\begin{aligned} -\tau _2c_2e_u\le (\beta _1-\beta _2)\le c_2e_u. \end{aligned}$$
(37)

Let \(X_1=[A,e_+], X_2=[B,e_-], X_3=[U,e_u]\).

Rewriting (28) and (29), we have

$$\begin{aligned} \begin{bmatrix} u_+ \\ b_+ \end{bmatrix}=-(X_1^tX_1)^{-1}\Big (X_2^t(\alpha _1-\alpha _2)-X_3^t(\beta _1-\beta _2)\Big ). \end{aligned}$$
(38)

Suppose \(\alpha =(\alpha _1-\alpha _2), \beta =(\beta _1-\beta _2), \gamma = [\alpha ;\beta ], N=[X_2;-X_3],\) and \(e_4=[e_-;(-1+\epsilon )e_u]\), then using K.K.T. conditions the Wolfe dual of (25) is given as

$$\begin{aligned} \underset{\gamma }{max}~~&-\frac{1}{2}\gamma ^tN(X_1^tX_1)^{-1}N^t\gamma +e_4^t\gamma \nonumber \\ s.t.~~&\begin{bmatrix} -\tau _1c_1e_- \\ -\tau _2c_2e_u \end{bmatrix}\ \le \gamma \le \begin{bmatrix} c_1e_- \\ c_2e_u \end{bmatrix}. \end{aligned}$$
(39)

In the similar manner, the Wolfe dual of (26) is given as follows

$$\begin{aligned} \underset{\theta }{max}~~&-\frac{1}{2}\theta ^tP(X_2^tX_2)^{-1}P^t\theta +e_5^t\theta \nonumber \\ s.t.~~&\begin{bmatrix} -\tau _3c_3e_+ \\ -\tau _4c_4e_u \end{bmatrix} \le \theta \le \begin{bmatrix} c_3e_+ \\ c_4e_u \end{bmatrix}, \end{aligned}$$
(40)

where \(\theta =[\eta ;\zeta ], \eta =(\eta _1-\eta _2), \zeta =(\zeta _1-\zeta _2), P=[X_1;-X_3],e_5=[e_+;(-1+\epsilon )e_u]\).

The optimal hyperplane corresponding to another class is given as

$$\begin{aligned} \begin{bmatrix} u_- \\ b_- \end{bmatrix}=(X_2^tX_2)^{-1}(X_1^t\eta -X_2^t\zeta ). \end{aligned}$$
(41)

The testing data sample \(x\in {\mathbb {R}}^n\) is assigned the class as follows

$$\begin{aligned} Class(x)=arg~~\underset{i=\{+,-\}}{min} \frac{|u_i^tx+b_i|}{\Vert u_i\Vert }. \end{aligned}$$
(42)

3.2 Non-linear universum twin SVM with pinball loss

The optimization problems of the proposed universum universum twin SVM with pinball loss (Pin-UTSVM) for non-linear case is given as

$$\begin{aligned} \underset{u_+,b_+,\xi _+,\psi }{min}&~~\frac{1}{2}\Vert K(A,D^t)u_++e_+b_+\Vert ^2+c_1e^t_-\xi _++c_2e_u^t\psi \nonumber \\ s.t.&-(K(B,D^t)u_++e_-b_+)\ge e_--\xi _+ \nonumber , \\&-(K(B,D^t)u_++e_-b_+)\le e_-+\frac{1}{\tau _1}\xi _+ \nonumber , \\&K(U,D^t)u_++e_ub_+\ge (-1+\epsilon )e_u-\psi \nonumber , \\&K(U,D^t)u_++e_ub_+\le (-1+\epsilon )e_u+\frac{1}{\tau _2}\psi \end{aligned}$$
(43)

and

$$\begin{aligned} \underset{u_-,b_-,\xi _-,\psi }{min}&~~\frac{1}{2}\Vert K(B,D^t)u_-+e_-b_-\Vert ^2+c_3e^t_+\xi _-+c_4e_u^t\psi \nonumber \\ s.t.&K(A,D^t)u_-+e_+b_-\ge e_+-\xi _- \nonumber , \\&K(A,D^t)u_-+e_+b_-\le e_++\frac{1}{\tau _3}\xi _- \nonumber , \\&-(K(U,D^t)u_-+e_ub_-)\ge (-1+\epsilon )e_u-\psi \nonumber , \\&-( K(U,D^t)u_-+e_ub_-)\le (-1+\epsilon )e_u+\frac{1}{\tau _4}\psi . \end{aligned}$$
(44)

To obtain the solution of (43) and (44), we derive their Wolfe dual. The Lagrangian of the optimization problem (43) is given as

$$\begin{aligned} L=&\frac{1}{2}\Vert K(A,D^t)u_++e_+b_+\Vert ^2\!+\!c_1e^t_-\xi _+\!+\!c_2e_u^t\psi \!+\!\alpha _1^t(K(B,D^t)u_++e_-b_+ +e_--\xi _+) \nonumber \\&-\alpha _2^t(K(B,D^t)u_++e_-b_++ e_-+\frac{1}{\tau _1}\xi _+) -\beta _1^t(K(U,D^t)u_+ \nonumber \\&+e_ub_+- (-1+\epsilon )e_u+\psi )+\beta _2^t(K(U,D^t)u_++e_ub_+- (-1+\epsilon )e_u-\frac{1}{\tau _2}\psi ) . \end{aligned}$$
(45)

Applying K.K.T. conditions on (45), we have

$$\begin{aligned}&\frac{\partial L}{\partial u_+}= (K(A,D^t))^t(K(A,D^t)u_++e_+b_+)\nonumber \\&\quad +(K(B,D^t))^t(\alpha _1-\alpha _2)-(K(U,D^t))^t(\beta _1-\beta _2)=0, \end{aligned}$$
(46)
$$\begin{aligned}&\frac{\partial L}{\partial b_+}=e_+^t(K(A,D^t)u_++e_+b_+)+e_-^t(\alpha _1-\alpha _2)-e_u^t(\beta _1-\beta _2)=0, \end{aligned}$$
(47)
$$\begin{aligned}&\frac{\partial L}{\partial \xi _+}=c_1e_--\alpha _1-\frac{1}{\tau _1}\alpha _2=0 , \end{aligned}$$
(48)
$$\begin{aligned}&\frac{\partial L}{\partial \psi }=c_2e_u-\beta _1-\frac{1}{\tau _2}\beta _2=0, \end{aligned}$$
(49)
$$\begin{aligned}&\alpha _1^t(K(B,D^t)u_++e_-b_+ +e_--\xi _+)=0, \end{aligned}$$
(50)
$$\begin{aligned}&\alpha _2^t(K(B,D^t)u_++e_-b_++ e_-+\frac{1}{\tau _1}\xi _+)=0, \end{aligned}$$
(51)
$$\begin{aligned}&\beta _1^t(K(U,D^t)u_++e_ub_+- (-1+\epsilon )e_u+\psi )=0, \end{aligned}$$
(52)
$$\begin{aligned}&\beta _2^t(K(U,D^t)u_++e_ub_+- (-1+\epsilon )e_u-\frac{1}{\tau _2}\psi )=0 . \end{aligned}$$
(53)

Using (48) and \(\alpha _1\ge 0\), \(\alpha _2\ge 0\), we get

$$\begin{aligned} -\tau _1c_1e_-\le (\alpha _1-\alpha _2)\le c_1e_-. \end{aligned}$$
(54)

Similarly, using (49) and \(\beta _1\ge 0\), \(\beta _2\ge 0\), we get

$$\begin{aligned} -\tau _2c_2e_u\le (\beta _1-\beta _2)\le c_2e_u. \end{aligned}$$
(55)

Let \(X_1=[K(A,D^t),e_+], X_2=[K(B,D^t),e_-], X_3=[K(U,D^t),e_u]\).

Rewriting (46) and (47), we have

$$\begin{aligned} \begin{bmatrix} u_+ \\ b_+ \end{bmatrix}=-(X_1^tX_1)^{-1}(X_2^t(\alpha _1-\alpha _2)-X_3^t(\beta _1-\beta _2)). \end{aligned}$$
(56)

Suppose \(\alpha =(\alpha _1-\alpha _2), \beta =(\beta _1-\beta _2), \gamma = [\alpha ;\beta ], N=[X_2;-X_3],\) and \(e_4=[e_-;(-1+\epsilon )e_u]\), then using K.K.T. conditions the Wolfe dual of (43) is given as

$$\begin{aligned} \underset{\gamma }{max}~~&-\frac{1}{2}\gamma ^tN(X_1^tX_1)^{-1}N^t\gamma +e_4^t\gamma \nonumber \\ s.t.~~&\begin{bmatrix} -\tau _1c_1e_- \\ -\tau _2c_2e_u \end{bmatrix}\le \gamma \le \begin{bmatrix} c_1e_- \\ c_2e_u \end{bmatrix}. \end{aligned}$$
(57)

In the similar manner, the Wolfe dual of (44) is given as follows

$$\begin{aligned} \underset{\theta }{max}~~&-\frac{1}{2}\theta ^tP(X_2^tX_2)^{-1}P^t\theta +e_5^t\theta \nonumber \\ s.t.~~&\begin{bmatrix} -\tau _3c_3e_+ \\ -\tau _4c_4e_u \end{bmatrix}\le \theta \le \begin{bmatrix} c_3e_+ \\ c_4e_u \end{bmatrix}, \end{aligned}$$
(58)

where \(\theta =[\eta ;\zeta ], \eta =(\eta _1-\eta _2), \zeta =(\zeta _1-\zeta _2), P=[X_1;-X_3],e_5=[e_+;(-1+\epsilon )e_u]\).

The optimal hyperplane corresponding to another class is given as

$$\begin{aligned} \begin{bmatrix} u_- \\ b_- \end{bmatrix}=(X_2^tX_2)^{-1}\Big (X_1^t\eta -X_2^t\zeta \Big ). \end{aligned}$$
(59)

The testing data sample \(x\in {\mathbb {R}}^n\) is assigned the class as follows

$$\begin{aligned} class(x)=arg~~\underset{i=\{+,-\}}{min} \frac{|K(x^t, D^t)u_i + b_i|}{\Vert u_i\Vert }. \end{aligned}$$
(60)

The algorithm for Pin-UTSVM is briefly described in Algorithm 1.

figure a

3.3 Computational complexity

Let \(m_1,m_2\) and \(m_u\) be the samples of positive, negative and universum data, with \(N=m_1+m_2\). Then, the computational complexity of SVM is \(T=O(N^3)\), where N is the dataset size. Since, TWSVM solves two smaller size QPPs and hence, its complexity is \(T=O(m_1^3)+O(m_2^3)\). The comparison of the Wolfe dual of the proposed Pin-UTSVM and standard UTSVM model reveals that both have same number of constraints and the order of the matrices where inversion is involved is same. Thus, the complexity of proposed Pin-UTSVM is \(T=O((m_1+m_u)^3)+O((m_2+m_u)^3)\) which is same as that of the standard UTSVM model.

3.4 Comparison of proposed Pin-UTSVM with the baseline models

The proposed Pin-UTSVM is different from the TWSVM, Pin-GTSVM and UTSVM as follows:

  • TWSVM model uses hinge loss function while as the proposed Pin-UTSVM models uses pinball loss function to penalize the errors. Moreover, the proposed Pin-UTSVM uses universum data to improve the performance which is ignored by the standard TWSVM model.

  • Pin-UTSVM model is different from the Pin-GTSVM model as the former uses universum concept while as the later ignores it.

  • UTSVM model uses hinge loss function to penalize the errors while as the proposed Pin-UTSVM model uses pinball loss function to penalize the errors.

4 Experiments

This section discusses the experimental setup followed for evaluating the performance of the classification models. Moreover, we also discuss the data acquisition and the preprocessing followed for the extraction of the features. We analyse the performance of the models statistically, and evaluated the effect of different hyperparameters on the performance of classification models.

Fig. 1
figure 1

Sample EEG signals for a Healthy (control), b Interictal (seizure-free) and c Ictal (seizure) state

4.1 Dataset acquisition, pre-processing and experimental setup

This section details the experiments performed and the results obtained for classifying EEG Signals. The dataset used is taken from a journal article by Andrzejak et al. (2001). This dataset consisted of five collections (denoted S, F, N, O and Z), each consisting of hundred EEG signals (single-channel) which were sampled at 173.61 Hz for 23.6 seconds. The collections O and Z represent healthy (control) subjects with eyes close and open, respectively. The collections F and N represent subjects in the interictal (seizure-free) state, and the collection S represents subjects in the ictal (seizure) state. The mode of EEG recording is intra-cranial for sets S, F and N, and all the EEG signals were processed using the same 128-channel amplifier system with an average common reference.

In the numerical experiments, ten-fold cross-validation is used. The tests were designed for binary classification, with the two classes being the interictal and the ictal state. Thus, Z vs S and O vs S, as both Z and O represent healthy subjects. The universum data is used to provide prior information about the training dataset. Thus, following the paper by Richhariya and Tanveer (2018), the set N, which consists of interictal state subjects, is used as universum data. Various methods of feature extraction, i.e. independent component analysis (ICA), principal component analysis (PCA) and wavelet transform were used. Several families of wavelet transform with different levels of decomposition were used. The Daubechies wavelet- db2, db4 and haar wavelet were set at level three, and db2 and db6 were set at level two. In the case of wavelet transform and ICA, PCA is applied first to reduce the number of dimensions. The class discriminatory ratio (CDR) was used to organise the PCA components and choose the most applicable components. The implementation of ICA used is the ICA architecture (Bartlett et al., 2002). The proposed Pin-UTSVM effectiveness was tested against Pin-GTSVM (Tanveer et al., 2019), TWSVM (Jayadeva & Khemchandani, 2007) and UTSVM (Qi et al., 2012). The experiments were performed on both the linear and non-linear variants of all the methods. For the non-linear variants, the Gaussian kernel, given by equation (61), was used. The Gaussian kernel parameter \(\mu \) is calculated using equation (62) (Tsang et al., 2006). The values and range of the various hyper-parameters used in different methods have been catalogued in Table 1. Additionally, to check the noise performance of the algorithms in presence of noise, we added Gaussian noise with zero-mean and standard deviation \(\sigma \) = 0 (i.e. no noise), 0.05, 0.075, 0.1.

$$\begin{aligned} K(a, b)&= e^{\frac{-1}{2\mu ^2}\Vert a-b\Vert ^2}, \end{aligned}$$
(61)
$$\begin{aligned} \mu&= \frac{1}{N^2} \sum ^N_{i,j=1} \Vert x_i-x_j\Vert ^2, \end{aligned}$$
(62)

where N is the total number of data points and \(x_i\) represents \(i^{th}\) data point. The hyperparameters for UTSVM were set as \(c=c_1=c_2, c_u\). The hyperparameters for Pin-UTSVM were set as \(c_1=c_3, c_2=c_4\) and \(\tau _1=\tau _3, \tau _2=\tau _4\).

All the computations were performed on a High-Performance PC running Windows 7 OS with Intel® Xeon® CPU E5-2697 v4 @ 2.30 GHz and 128 GB RAM. The program used for coding was MATLAB® R2017a.

Table 1 Parameter ranges for various methods

4.2 Results

Here, we discuss the performance of the standard models and the proposed Pin-UTSVM model for classification of EEG signal with different feature extraction techniques. The performance of the models with linear and non-linear cases are given as follows:

4.2.1 Evaluation of the models with linear kernel

The results of the experiments corresponding to the linear kernel are presented in Table 2. The optimal parameters obtained for linear kernel are available in Table 3. The Table 2 reports the average accuracy obtained after the cross-validation at the optimal parameters. The optimal parameters are obtained by using the grid search method. The table reports the average accuracy and average rank of the methods. One can note that the proposed Pin-UTSVM model performs better as it achieves lowest average rank of 1. 1 is the lowest rank possible, thus establishing that the proposed method has the best performance in the linear case. The best accuracy in Table 2 for a no-noise dataset is 0.89, obtained by the proposed Pin-UTSVM using the wavelet (Haar) feature extraction for the O &S dataset. Additionally, our proposed method obtains a substantially better average accuracy attaining around \(5\%\) better than the preceding method, UTSVM. TWSVM and UTSVM both perform similarly, achieving 2.625 and 2.375 average rank, respectively. The Pin-GTSVM is the worst performing method.

Specifically evaluating the Z &S dataset for the linear case, we observe that the wavelet (Haar) and wavelet (db2) feature extraction methods perform the best for the proposed Pin-UTSVM method. In contrast, ICA performs the worst. As for the O &S dataset, the same observations as Z &S are made with regards to the proposed Pin-UTSVM. The ICA feature extraction performs the worst for all of the methods in the linear case.

Table 2 Accuracies obtained for different methods using linear kernel on EEG data with four noise levels
Table 3 Optimal parameters obtained for different methods using linear kernel on EEG data with four noise levels
Table 4 Accuracies obtained for different methods using Gaussian kernel on EEG data with four noise levels
Table 5 Optimal parameters obtained for different methods using Gaussian kernel on EEG data with four noise levels

4.2.2 Evaluation of the models with Gaussian kernel

The results of the experiments corresponding to the Gaussian kernel are presented in Table 4. The optimal parameters obtained for gaussian kernel are available in Table 5. The Table 4 is arranged similar to the linear case, i.e., the table reports the average accuracy obtained after the cross-validation at the optimal parameters, which were obtained by the grid search method. The table reports the average accuracy and average rank of the methods. Analogous to the linear case, the proposed model performs exceptionally well, achieving an average rank of 1.286.

The best accuracy in Table 4 for a no-noise dataset is 0.975, obtained by both, the proposed Pin-UTSVM using the wavelet (db1) feature extraction for the Z &S dataset and UTSVM using the ICA feature extraction for the Z &S dataset. The runner-up method is the UTSVM with an average rank of 2.071. The TWSVM method obtains an average rank of 2.973 and the Pin-GTSVM the worst preforming method receives 3.67 average rank. All the methods perform significantly better with Gaussian kernel than the linear counterpart in Table 2. The average accuracy of each method increased by 9–22% when switching to Gaussian kernel. The least growth is observed in the proposed method Pin-UTSVM, suggesting its lower degree of dependence on Gaussian kernel to form separating hyperplanes. This revelation enables us to suggest utilising the fast linear kernel when speed is of priority, with a minimal compromise to accuracy.

Specifically evaluating the Z &S dataset for the Gaussian kernel case, we observe that the wavelet (db1) performs the best for our method. In contrast, ICA performs the worst. As for the O &S dataset, wavelet (db4) performs the best, and ICA performs the worst. A difference from the linear case, can be observed in the case the ICA feature extraction, the accuracy obtained is comparable or higher when comparing with other methods, indicating a necessity of non-linear kernel when using ICA feature extraction.

The performance of the models is also evaluated on the UCI (Dua & Graff, 2017) datasets. Table 16 show that the performance of the proposed Pin-UTSVM model is better compared to the baseline models.

4.3 Statistical analysis

Here, we evaluate the performance of the various models statistically via pairwise sign test, the Friedman test and ANOVA & Tukey-Kramer Test.

4.3.1 ANOVA & Tukey-Kramer (TK) test

The evaluation is performed for both linear and Gaussian kernel cases. We first analyse ANOVA (\(\alpha = 0.05\)) results to determine if there exists a significance difference between the models. If a difference exists, then the TK post hoc analysis is used to determine the specific differences (Montgomery, 2017). The TK test uses the q statistic as defined below when the sample sizes of the various groups are equal.

$$\begin{aligned} q = \frac{\bar{Y_i}-\bar{Y_j}}{s.e.} \end{aligned}$$
(63)

where, \(\bar{Y_i}\) is the mean of samples of Group i, standard error \(s.e. = MS\text {(within group)}/n\) and n is the number of samples.

For the linear case we can observe from Table 6 that \(F>F_{crit}\) and P value \(<\alpha \), indicating a significant difference between groups. Using a studentised q table we can find closest value of \(q_{crit}\) for \(\alpha =0.05, df=220, k=4\) as 3.659. The Table 7 calculates the q values for the proposed method Pin-UTSVM vs TWSVM, Pin-GTSVM and UTSVM. Since all the calculated values of q are smaller than \(q_{crit}\), there is a significant difference between Pin-UTSVM and all other methods.

Table 6 ANOVA results for Linear case
Table 7 Tukey-Kramer analysis for Linear case

For the non-linear case we can observe from Table 8 that \(F>F_{crit}\) and P value \(<\alpha \), indicating a significant difference between groups. Using a studentised q table we can find closest value of \(q_{crit}\) for \(\alpha =0.05, df=220, k=4\) as 3.659. The Table 9 calculates the q values for the proposed method Pin-UTSVM vs TWSVM, Pin-GTSVM and UTSVM. Since all the calculated values of q are smaller than \(q_{crit}\), there is a significant difference between Pin-UTSVM and all other methods.

Table 8 ANOVA results for Non-Linear case
Table 9 Tukey-Kramer analysis for Non-Linear case

4.3.2 Friedman test

The evaluation is performed for both linear and Gaussian kernel cases. We follow Friedman test to analyse the significance of the models. In Friedman test, the performance of the model is ranked on each dataset with the worse performing models assigned a higher rank and vice versa. Let \(r^i_j\) be the rank of the jth model on the ith dataset. Suppose K classification models are evaluated on T number of datasets, then the average rank of the \(j^{th}\) classifier \(R_j=\sum _{i=1}^Tr_j^i\). The Friedman statistic follows \(\chi ^2_F\) distribution with \((K-1)\) degrees of freedom and is given as

$$\begin{aligned} \chi ^2_F=\frac{12T}{K(K+1)}\Big [\sum _j R_j^2-\frac{K(K+1)^2}{4} \Big ]. \end{aligned}$$
(64)

Friedman statistic is undesirably conservative, hence, a better statistic is given as

$$\begin{aligned} F_F=\frac{(T-1)\chi ^2_F}{T(K-1)-\chi ^2_F}, \end{aligned}$$
(65)

which follows \(F_F\) distribution with \((K-1)\) and \((K-1)(T-1)\) degrees of freedom. Under the null hypothesis, all models are performing identically and their average ranks are equal. If the null hypothesis is void, then Nemenyi test is followed for the pairwise comparison of the models. In Nemenyi posthoc test, two models are equivalent if their average rank difference is less than the critical difference (CD). Mathematically,

$$\begin{aligned} CD=q_\alpha \sqrt{\frac{K(K+1)}{6T}}. \end{aligned}$$
(66)

After simple calculations, the average rank of the TWSVM, Pin-GTSVM, UTSVM, and proposed Pin-UTSVM for linear case are 2.636, 4, 2.38,  and 1, respectively. At \(5\%\) level of significance, \(\chi ^2_F =153.9317, F_F=601.7958\). With \(T=56, K=4\). From statistical table, \(F_F(3,165)=2.65\). Since \(601.7958>2.65\), hence, the null hypothesis is void and substantial differences exist among the classification models. To check the significant difference within the models, we follow Nemenyi test. Given \(q_{\alpha =0.05}=2.5690\), the \(CD=0.63\). Table 10 gives the significant difference among the models. Here, the blank entries denote that methods in the corresponding row and column are equal and no significant difference exists among them. The numbers in the corresponding cells denote the average rank difference among the models. Each entry denotes that significant difference exists among the methods of row and column, with row method performing better than the corresponding column method. One can see that the proposed Pin-UTSVM model is significantly better compared to the existing TWSVM, Pin-GTSVM and UTSVM models. Figure 2 gives the pictorial representation of the significant difference. The models connected via line denote that they are performing equally and no significant difference exists among them. Thus, one can see that proposed Pin-UTSVM model is significantly better compared to the existing models.

Table 10 Significant difference of models with linear kernel based on Nemenyi post-hoc test
Fig. 2
figure 2

Nemenyi critical difference for linear case

The average rank of the TWSVM, Pin-GTSVM, UTSVM, and proposed Pin-UTSVM with Gaussian kernel are 2.97, 3.67, 2.07, and 1.29, respectively. With simple calculation at \(5 \%\) level of significance, we get \(\chi ^2_F =108.8237, F_F=101.1436\). With \(T=56, K=4, F_F(3,165)=2.65\). As \(101.1436>2.65\), we dismiss the null hypothesis i.e. substantial differences exist in the classification models. To check the differences within the models, we follow pairwise Nemenyi test. With \(q_{\alpha =0.05}=2.5690\), we have \(CD=0.6268\). Thus, the models are significantly different if the average ranks of the models differ by atleast 0.6268. Table 11 gives the pairwise comparison of the models. Here, the entries in each cell have similar meaning as given above. One can see that the proposed Pin-UTSVM models is performing significantly better compared to the baseline models. Figure 3 gives the pictorial comparison of the models.

Table 11 Significant difference of models with Gaussian kernel based on Nemenyi post-hoc test
Fig. 3
figure 3

Nemenyi Critical difference for non-linear case

4.3.3 Statistical analysis based on pair wise: sign test

In pairwise sign test, the number of datasets on which a model is winner or loser is counted along with the count of ties (if any). Then sign test is followed for the pairwise comparison. Under the null hypothesis, the two methods perform identically thus each model wins on approximately T/2 datasets of T datasets. The model is significantly better if it wins on \(T/2+1.96\sqrt{T}/2\) datasets with \(p<0.05\). If there is tie between the algorithms, then the number of ties is evenly distributed if the number is even. In case of odd number of ties, we ignore one. With simple calculations, the number of win-tie-loss count for each pair of models for linear and non-linear cases are given in Table 12 and Table 14, respectively. Each entry of a table is given as [abc] which denotes that the row model wins a times, ties b times and loses c times with respect to the column method. With simple calculations, significant difference exists among the models if the number of wins\(\ge 35.3336\). Table 13 shows that the proposed Pin-UTSVM is significantly better compared to the baseline models. Also, Table 15 gives the significant difference of the models with Gaussian kernel. The methods not reported in Table 13 and Table 15 doesn’t show any significant difference. One can see that proposed Pin-UTSVM model is significantly better compared to the baseline models. Also, UTSVM is significantly better compared to the TWSVM and the Pin-GTSVM models.

Table 12 Sign test with linear case
Table 13 Significant difference among the models with linear case via pairwise sign test
Table 14 Sign test with Gaussian kernel
Table 15 Significant difference among the models with Gaussian kernel via pairwise sign test
Table 16 Evaluation of the models on UCI datasets using Gaussian Kernel

4.4 Parameter sensitivity analysis

Figure 4 presents the effect of hyperparameters on the accuracy of our proposed algorithm. We present four graphs for each dataset, Accuracy vs (i) \(c_1\) vs \(c_2\), (ii) \(\tau _1\) vs \(\tau _3\), (iii) \(c_1\) vs \(\epsilon \) and (iv) \(c_1\) vs \(\tau _1\). While plotting a graph against two parameters, the rest of the parameters are kept at their optimal values. Here, we have only presented the four most relevant graphs for appropriately selecting the hyperparameters. From all the (i) \(c_1\) vs \(c_2\) graphs, one can conclude that the values of \(c_1\) and \(c_2\) have inverse effects on the accuracy, in general, large \(c_1\) and small \(c_2\) leads to better accuracy. From all the (ii) \(\tau _1\) vs \(\tau _3\) graphs, we can observe that \(\tau _1\) and \(\tau _3\) have a similar relationship as \(c_1\) and \(c_2\), respectively, i.e., large \(\tau _1\) and small \(\tau _3\) leads to better accuracy. From all the (iii) \(c_1\) vs \(\epsilon \) graphs, we can note that \(\epsilon \) has minimal effect on accuracy, but lower values generally lead to better accuracy. We can again note that lower values of \(c_1\) lead to abysmal accuracies. From all the (iv) \(c_1\) vs \(\tau _1\) graphs, we can note that at higher values of \(c_1\), \(\tau _1\) has minimal effect on the accuracy, but in general, we can observe a minor improvement from larger values of \(\tau _1\).

Fig. 4
figure 4

Effect of various hyperparameters on the performance of proposed Pin-UTSVM model

4.5 Noise sensitivity analysis

The primary purpose of implementing the pinball loss function in our proposed method Pin-UTSVM is to enable noise insensitivity. To test this property, we introduced Gaussian noise to EEG data with varying standard deviations. The results of this experiment are tabulated in Tables 2 and 4. To properly analyse the noise insensitivity performance, we present Fig. 5.

In Fig. 5, we have generated a two-dimensional synthetic dataset with two classes, each from a Gaussian distribution such that \(x_i, i \in \{i : y_i = 1\} \sim {\mathcal {N}}(\mu _1,\Sigma _1)\) (represented by blue in Fig. 5) and \(x_i, i \in \{i : y_i = -1\} \sim {\mathcal {N}}(\mu _2,\Sigma _2)\) (represented by red in Fig. 5). For the universum data required for UTSVM and Pin-UTSVM, we used points belonging to \({\mathcal {N}}(\mu _3,\Sigma _3)\), where \(\mu _1 = [0.5,-3]^T, \mu _2 = [-0.5, 3]^T, \mu _3 = [-0.5, -3]^T \) and \(\Sigma _1 = \Sigma _2 = \Sigma _3 = \begin{bmatrix} 0.2 &{} 0 \\ 0 &{} 3 \end{bmatrix}\). The ideal separating hyperplane for the above two Gaussian distribution is given by the Bayes classifier viz. \(f_c(x) = 2.5x_1 - x_2\), i.e. a slope of 2.5. We have plotted the hyperplanes for different methods and have provided the slopes in the legend.

The \(\sigma \text {(Std. deviation)} = 0\) plot represents data without any noise. Next, we introduce Gaussian noise to these data points, with \(\sigma = [0.05, 0.075, 0.1]\). This noise changes the distribution of the data points, thus disrupting the hyperplanes. One can observe that the hyperplanes deviate more with an increase in noise. This deviation is significantly less severe in Pin-UTSVM than other methods, implying better insensitivity to noise.

Fig. 5
figure 5

The above four figures depict the noise insensitivity property of the pinball loss via four noise levels of a synthetic dataset

4.6 Effect of universum data

We also present the accuracy obtained for various selections of the number of universum points used for datasets (a) O &S using db1, (b) O &S using db6, (c) Z &S using Haar, (d) Z &S using db6 in Fig. 6. From Fig. 6, one can observe that the proposed method performs better than UTSVM in most cases for both linear and non-linear variations. Linear UTSVM being a minor exception, as a general trend, increasing the number of universum points leads to better accuracy. We observe plateauing accuracy at around \(40-60\) universum data points in nearly all the cases and techniques, thus suggesting the minimum number of universum points to be used as around 50 % of the training dataset size.

Fig. 6
figure 6

Performance comparison of linear and non-linear UTSVM and Pin-UTSVM w.r.t. number of universum samples used for a O &S using wavelet (db1), b O &S using wavelet (db6), c Z &S using Haar and d Z &S using wavelet (db6) feature extraction technique

4.7 Performance of the models on different feature extraction

Figure 7 presents the accuracies obtained for different algorithms for various Feature extraction techniques. One can observe that the proposed algorithm performs the best for nearly all the feature selection algorithms for both Z &S and O &S. One can observe that ICA produces unreliable results, nearly all classification algorithms suffer due to it (as seen by a dip in accuracy graph). But ICA also produces the best accuracy in the case of (c).

Fig. 7
figure 7

Accuracy comparison against various feature extraction techniques for classification of EEG signals using different algorithms for a Z &S using linear kernel, b O &S dataset using linear kernel, c Z &S using non-linear kernel and d O &S dataset using non-linear kernel

5 Conclusion

In this paper, we proposed a novel universum twin SVM with pinball loss functions (Pin-UTSVM) for the EEG signal classification. Pinball loss function has been widely used in the literature of classification and regression due its association between quantile regression. Here, we incorporated pinball loss function in baseline universum TWSVM model. The proposed Pin-UTSVM model is stable for resampling data and is insensitive to noise. To show the efficacy of our proposed Pin-UTSVM model, we evaluated it for the classification of the EEG signal. Experimental results and the statistical tests prove the competence of the proposed Pin-UTSVM compared to the standard models. To prove the robust of proposed Pin-UTSVM model in presence of noise, we corrupted the EEG signals with different levels of noise and evaluated the classification models. The results show that the proposed Pin-UTSVM model is effective in presence of noise while as the existing baseline models suffer. In future, one can apply this algorithm in the diagnosis of other diseases like Alzheimer’s disease. Moreover, one can also focus on improving the performance of the model via efficient optimization algorithms. The source code will be available at https://github.com/mtanveer1.