1 Introduction

In recent years, many machine learning and data mining techniques have been introduced to solve the classification and regression problems. If a particular dataset is having equal number of samples of each class, then it is called a balanced dataset; otherwise, it is an imbalanced dataset. It is not easy to solve the imbalance problem for classification. Support vector machine (SVM) is one of the most popular machines learning approach which is based on structural risk minimization (SRM) principle [1,2,3]. It solves a quadratic programming problem and always provides a globally optimal, relatively robust and sparse solution, whereas techniques like artificial neural network (ANN) is based on empirical risk minimization (ERM) principle and has local minima problem. SVM has been used in applications such as face recognition [4,5,6], pattern recognition [7, 8], speaker verification [9], intrusion detection [10] and various other classification problems [11,12,13,14].

SVM finds the resultant classifier by maximizing the margin between the support vectors and decision boundary, meanwhile improving the generalization ability. One can notice that SVM provides better generalization performance, but the training cost of SVM is very high i.e. \(O(m^{3} )\) where \(m\) is the total number of training samples [15]. Recently, an efficient approach twin support vector machine (TWSVM) is proposed by Jayadeva et al. [15] to decrease the training cost of SVM. In TWSVM, two quadratic programming problems of smaller size are solved to find the solution rather than a single large problem as in SVM.

SVM is a supervised machine learning algorithms which constructs a model depending on the available number of samples of each class. Due to some imbalance in the dataset, samples belonging to the minority class get misclassified since they cannot contribute much in the training phase of the method. Thus, the classifier becomes biased towards the majority class. Here, the class of interest is the minority class; therefore, giving more weights to the data points of minority class resolves this problem to some extent. In applications such as fault detection and disease detection, more emphasis is on correctly identifying the faults in machinery and abnormalities in the patients data which are present in very few samples.

To address this problem, Lin et al. [16] proposed a support vector machine based on fuzzy membership values (FSVM). Similar to SVM, FSVM also suffers from the problem of class imbalance. Batuwita and Palade [17] have presented a new model as FSVMs for class imbalance learning (FSVM-CIL) to handle the problem of class imbalance which is less sensitive to outliers and noise. Here, the smaller fuzzy membership values are assigned to support vectors to reduce the effect of support vectors on the resultant decision surface based on class centres. In a similar manner, a new efficient approach fuzzy support vector machine for non-equilibrium data is proposed [18] to reduce the misclassification accuracy of minority class in FSVM. A new approach, Bilateral-weighted FSVM (B-FSVM) is proposed [19] where membership of each sample is calculated by considering the samples as belonging to minority and majority class with different membership values. To solve bankruptcy prediction problem, a new fuzzy SVM is proposed by Chaudhuri and De [20]. In order to reduce the complexity of TWSVM for large-scale data, Shao et al. [21] proposed a weighted linear loss twin support vector machine for imbalanced probelm (WLTSVM) where linear equations are solved and lesser weights are given to the points having high loss values. A fuzzy-based Lagrangian twin parametric-margin support vector machine (FLTPMSVM) is proposed by Gupta et al. [22] to deal with noisy data. Tomar et al. [12] assigned weights to the data points on the basis of number of samples in each class and proposed a weighted least squares twin support vector machine (WLSTVM). In this, all the samples of each class are assigned the same weight. For more efficient classification methods, reader may see [23, 24].

Recently, Fan et al. [25] proposed an entropy-based fuzzy support vector machine (EFSVM) for class imbalance problem in which fuzzy membership is computed based on the class certainty of samples. Motivated by the work of Fan et al. [25] and Jayadeva et al. [15], we propose a new approach termed as entropy-based fuzzy twin support vector machine (EFTWSVM-CIL) to solve the class imbalance problem. One can notice that EFTWSVM-CIL solves a pair of smaller-size QPPs to find the resultant decision surface rather than solving a single large one in case of SVM. Hence, EFTWSVM-CIL improves the generalization of the decision surface for minority class samples based on class certainty and also takes less training time.

In this paper, all vectors are considered as column vectors. Suppose \(x\) and \(z\) are the vector in \(n -\) dimensional real space \(R^{n}\) then the inner product of two vectors is denoted as: \(x^{t} z\) where \(x^{t}\) is the transpose of \(x\). \(||x||\) and \(||Q||\) will be the 2-norm of a vector \(x\) and a matrix \(Q\), respectively. The identity matrix of appropriate size and the vector of dimension \(m\) are denoted by \(I\) and \(e\), respectively.

The paper is organized as follows: Sect. 2 is to give a review on the work related to Support Vector Machine discussing Twin Support Vector Machine (TWSVM), Fuzzy Twin Support Vector Machine (FTWSVM) and Entropy Fuzzy Support Vector Machine (EFSVM). The proposed method is discussed in Sect. 3. Several numerical experiments have been performed on well-known real-world dataset for the discussed and proposed variant of SVM in Sect. 4. In Sect. 5, we conclude the paper with future work.

2 Related Work

In this section, we briefly describe the formulations of twin support vector machine (TWSVM), fuzzy twin support vector machine (FTWSVM) and entropy support vector machine (EFSVM).

2.1 Twin Support Vector Machine (TWSVM)

Mangasarian and Wild [26] extended the idea of proximal SVM (PSVM) [27] to a new approach termed as multisurface proximal SVM via generalized eigenvalues (GEPSVM) for binary classification. In order to improve the learning efficiency, Jayadeva et al. [15] suggested a novel approach as Twin Support Vector Machine (TWSVM) in the light of GEPSVM. In TWSVM, two non-parallel hyperplanes are obtained instead of one hyperplane such that each of them is nearer to one of the class and as far as possible from the other class. Here, two optimization problems of smaller size are solved in form of QPPs instead of solving a large QPP as in the case of standard SVM. The running time of TWSVM is given as \(\left\{ { 2 { } \times \, \left( {\frac{m}{2}} \right)^{3} = \frac{{m^{3} }}{4}} \right\}\) which is a reduction of four times as compared to standard SVM.

Let us consider the input matrices \(\text{X}_{\text{1}}\) and \(\text{X}_{\text{2}}\) of size \(p \times n\) and \(q \times n\) where \(p\) is the total number of data point belonging to ‘Class 1’ and \(q\) are the total number of data points belonging to ‘Class 2’ such that total number of data samples \(m = p + q\) and \(n\) is the dimension of each data points. In nonlinear case, twin support vector machine finds a pair of non-parallel hyperplanes \(f_{1} (x) = K(x^{t} \varvec{,}D^{t} )w_{1}^{{}} + b_{1} = 0\) and \(f_{2} (x) = K(x^{t} \varvec{,}D^{t} )w_{2}^{{}} + b_{2} = 0\) from the solution of the following QPPs as

$$\hbox{min} \frac{1}{2}||K\text{(}X_{1} ,D^{t} \text{)}w_{1} + e_{1} b_{1} ||^{2} + C_{1} e_{2}^{t} \xi$$

subject to

$$\begin{array}{*{20}l} { - \text{(}K\text{(}X_{2} ,D^{t} \text{)}w_{1} + e_{2} b_{1} \text{)} + \xi \ge e_{2} ,\xi \ge 0} \hfill \\ {\hbox{min} \frac{1}{2}||K\text{(}X_{2} ,D^{t} \text{)}w_{2} + e_{2} b_{2} ||^{2} + C_{2} e_{1}^{t} \eta } \hfill \\ \end{array}$$
(1)

subject to

$$\text{(}K\text{(}X_{1} ,D^{t} \text{)}w_{2} + e_{1} b_{2} \text{)} + \eta \ge e_{1} ,\eta \ge 0$$
(2)

where \(\xi\),\(\eta\) represent slack variables; \(C_{1}\), \(C_{2}\) are penalty parameters; \(D = [\text{X}_{1} \varvec{;}\text{X}_{2} ]\); \(\varvec{e}_{\varvec{1}}\),\(\varvec{e}_{\varvec{2}}\) are vectors of suitable dimension having all values as 1’s; and \(K(x^{t} ,D^{t} ) = (k(x,x_{1} ), \ldots ,k(x,x_{m} ))\) is a row vector in \(R^{m}\).

The Lagrangian of problems (1) and (2) is written as

$$L_{1} = \frac{1}{2}||K(X_{1} ,D^{t} )w_{1} + e_{1} b_{1} ||^{2} + C_{1} e_{2}^{t} \xi + \alpha_{1}^{t} (( K(X_{2} ,D^{t} )w_{1} + e_{2} b_{1} ) - \xi + e_{2} ) - \beta_{1}^{t} \xi$$
(3)
$$L_{2} = \frac{1}{2}||K(\text{X}_{2} \text{,}D^{\text{t}} )w_{2} + e_{2} b_{2} ||^{2} + C_{2} e_{1}^{t} \eta + \alpha_{2}^{t} (( - K(\text{X}_{\text{1}} \text{,}D^{\text{t}} )w_{2} - e_{1} b_{2} ) - \eta + e_{1} ) - \beta_{2}^{t} \eta$$
(4)

where \(\alpha_{1} = (\alpha_{11} , \ldots ,\alpha_{1q} )^{t} ,\quad \beta_{1} = (\beta_{11} , \ldots ,\beta_{1q} )^{t} ,\quad \alpha_{2} = (\alpha_{21} , \ldots ,\alpha_{2p} )^{t} ,\) and \(\beta_{2} = (\beta_{21} , \ldots ,\beta_{2p} )^{t}\) are the vectors of Lagrange multipliers. The Wolfe dual of Eqs. (3) and (4) is written by applying the Karush–Kuhn–Tucker (K.K.T) necessary and sufficient conditions [28] as

$$\hbox{max} \,e_{2}^{t} \alpha_{1} - \frac{1}{2}\alpha_{1}^{t} T(S^{t} S)^{ - 1} T^{t} \alpha_{1}$$
(5)

subject to

$$0 \le \alpha_{1} \le C_{1}$$
$$\hbox{max} \, e_{1}^{t} \alpha_{2} - \frac{1}{2}\alpha_{2}^{t} S(T^{t} T)^{ - 1} S^{t} \alpha_{2}$$
(6)

subject to

$$0 \le \alpha_{2} \le C_{2}$$

where \(S = [K(X_{1} ,D^{t} )\,\,\,e_{1} ]\) and \(T = [K(X_{2} ,D^{t} )\,\,\,e_{2} ]\).

We compute the nonlinear hyperplanes \(K(x^{t} ,D^{t} )w_{1}^{{}} + b_{1} = 0\) and \(K(x^{t} ,D^{t} )w_{2} + b_{2} = 0\) by computing the value of \(w_{1}\),\(w_{2}\), \(b_{1}\) and \(b_{2}\) using Eqs. (7) and (8)

$$\left[ \begin{aligned} w_{1} \hfill \\ b_{1} \hfill \\ \end{aligned} \right] = - (S^{t} S + \delta I)^{ - 1} T^{t} \alpha_{1}$$
(7)
$$\left[ \begin{aligned} w_{2} \hfill \\ b_{2} \hfill \\ \end{aligned} \right] = (T^{t} T + \delta I)^{ - 1} S^{t} \alpha_{2}$$
(8)

Each new data point \(x \in R^{n}\) is assigned to a given class \('i'\) by using the following formula depending on which plane is closest to that data point.

$${\text{class}}\;i = \hbox{min} |K(x^{t} ,D^{t} )w_{i} + b_{i} |\quad {\text{for}}\;i = 1,2.$$
(9)

2.2 Fuzzy twin support vector machine (FTWSVM)

In the case of FTWSVM, a weighting parameter is used based on fuzzy membership values. For comparison, we choose the fuzzy membership for each data points based on its distance from the centroid [17]. The membership values are used for giving weights to the error tolerance, i.e. \(C\) for every data point in FTWSVM.

The fuzzy membership function is given as

$${\text{mem}} = 1 - \frac{{d_{\text{cen}} }}{{\hbox{max} (d_{\text{cen}} ) + \delta }}$$

where \(d_{\text{cen}}\) is the Euclidean distance of each data point from the centroid of its class and \(\delta\) is a small positive integer for making the denominator non-zero.The formulation of FTWSVM in primal is written as

$$\hbox{min} \frac{1}{2}||K(X_{1} ,D^{t} )w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi$$

subject to

$$\begin{array}{*{20}l} { - (K(X_{2} ,D^{t} )w_{1} + e_{2} b_{1} ) + \xi \ge e_{2} ,\;\xi \ge 0} \hfill \\ {\hbox{min} \frac{1}{2}||K(X_{2} ,D^{t} )w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta } \hfill \\ \end{array}$$
(10)

subject to

$$(K(X_{1} ,D^{t} )w_{2} + e_{1} b_{2} ) + \eta \ge e_{1} ,\,\,\eta \ge 0$$
(11)

where \(\xi\), \(\eta\) represent slack variables; \(C_{1}\), \(C_{2}\) are penalty parameters; \(K(,)\) is the kernel function, \(s_{1}\),\(s_{2}\) are vectors having the membership values of the data samples in the constraints.

The Lagrangian of the problems (10) and (11) is written as

$$L_{1} = \frac{1}{2}||K(X_{1} ,D^{t} )w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi + \alpha_{1}^{t} (( K(X_{2} ,D^{t} )w_{1} + e_{2} b_{1} ) - \xi + e_{2} ) - \beta_{1}^{t} \xi$$
(12)
$$L_{2} = \frac{1}{2}||K(X_{2} ,D^{t} )w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta + \alpha_{2}^{t} (( - K(X_{1} ,D^{t} )w_{2} - e_{1} b_{2} ) - \eta + e_{1} ) - \beta_{2}^{t} \eta$$
(13)

where \(\alpha_{1} = (\alpha_{11} , \ldots ,\alpha_{1q} )^{t} ,\quad \beta_{1} = (\beta_{11} , \ldots ,\beta_{1q} )^{t} ,\quad \alpha_{2} = (\alpha_{21} , \ldots ,\alpha_{2p} )^{t}\) and \(\beta_{2} = (\beta_{21} , \ldots ,\beta_{2p} )^{t}\) are the vectors of Lagrange multipliers. Now, we apply the Karush–Kuhn–Tucker (K.K.T) necessary and sufficient conditions [28] to find the Wolfe dual of Eqs. (12) and (13) as

$$\hbox{min} \frac{1}{2}\alpha_{1}^{t} T(S^{t} S)^{ - 1} T^{t} \alpha_{1} - e_{2}^{t} \alpha_{1}$$

subject to

$$\begin{array}{*{20}l} {0 \le \alpha_{1} \le s_{2} C_{1} } \hfill \\ {\hbox{min} \frac{1}{2}\alpha_{2}^{t} S(T^{t} T)^{ - 1} S^{t} \alpha_{2} - e_{1}^{t} \alpha_{2} } \hfill \\ \end{array}$$
(14)

subject to

$$0 \le \alpha_{2} \le s_{1} C_{2}$$
(15)

where \(S = [K(X_{1} ,D^{t} )\,\,e_{1} ]\) and \(T = [K(X_{2} ,D^{t} )\,\,e_{2} ]\).

We compute the nonlinear hyperplanes \(K(x^{t} ,D^{t} )w_{1}^{{}} + b_{1} = 0\) and \(K(x^{t} ,D^{t} )w_{2} + b_{2} = 0\) by computing the values of \(w_{1}\),\(w_{2}\), \(b_{1}\) and \(b_{2}\) by using Eq. (16) as

$$\left[ \begin{aligned} w_{1} \hfill \\ b_{1} \hfill \\ \end{aligned} \right] = - (S^{t} S + \delta I)^{ - 1} T^{t} \alpha_{1} \quad {\text{and}}\quad \left[ \begin{aligned} w_{2} \hfill \\ b_{2} \hfill \\ \end{aligned} \right] = (T^{t} T + \delta I)^{ - 1} S^{t} \alpha_{2}$$
(16)

Similarly, the resultant classifier is obtained by using Eq. (9).

3 Proposed Entropy-based Fuzzy Twin Support Vector Machine for class imbalance learning (EFTWSVM-CIL)

Recently, Fan et al. [25] proposed a novel fuzzy membership evaluation to improve the effectiveness and generalization ability of fuzzy support vector machine where memberships of the samples are computed based on class certainty. In information theory, entropy is a measure of the information carried by a sample. Chen et al. [29] used information entropy to find the uncertainty measure of a neighbourhood system. In case of class imbalance problem, most of the noisy data points of the majority class lie at the boundary of the two classes. So, for the majority class, the information of every data point is calculated based on its probability of belonging to any of the classes. This information is higher for the noisy samples as compared to rest of the samples in that class. The probability of a sample belonging to a particular class is based on class certainty. To find the class certainty, we can use entropy which is one of the effective-measuring approaches. Hence, one can assign the fuzzy membership to the data points by using the information entropy as the weighted parameter. Thus, the noisy samples of the majority class get lesser weights as compared to the other samples of the class. The traditional approach of giving weights [16] does not take into account the noise at the boundary of the two classes and do not incorporate the information about the probability distribution. Moreover, in most of the weighting strategies used for class imbalance problems, measures like distance from the centroid are used which do not give any information about the data points at the boundary of the two classes. In the proposed approach, to enhance the participation of the minority class in the decision classifier, the samples of majority class with lower entropy get larger fuzzy membership values. The entropy of any sample \(x_{i}\) is calculated as:

$$E_{i} = - P_{{{\text{pos}}\_x_{i} }} *\ln (p_{{{\text{pos}}\_x_{i} }} ) - p_{{{\text{neg}}\_x_{i} }} *\ln (p_{{{\text{neg}}\_x_{i} }} )$$

where \(P_{{{\text{pos}}\_x_{i} }}\) and \(P_{{{\text{neg}}\_x_{i} }}\) are the probability of minority class and majority class of sample \(x_{i}\), respectively. Further, we calculate the \(K\) nearest neighbours of sample \(x_{i}\) and assign the values to \(P_{{{\text{pos}}\_x_{i} }}\) and \(P_{{{\text{neg}}\_x_{i} }}\) based on count of total minority and majority class neighbours.

Further, the data points of the majority class are divided into \(n\) subsets based on increasing order of entropy. The fuzzy membership of samples in each subset are calculated as

$$F_{\text{q}} = 1.0 - \beta *(q - 1),\;q = 1,2, \ldots ,n$$

where \(F_{\text{q}}\) is the fuzzy membership for samples distributed in qth subset with fuzzy membership parameter \(\beta \in \left( {0,\frac{1}{n - 1}} \right]\) which controls the scale of the fuzzy values of samples. The fuzzy membership function is written as

$$s_{i} = \left\{ {\begin{array}{*{20}l} {1 - \beta *(q - 1),} \hfill & {{\text{if}}\quad y_{i} = - 1 \,\& \,x_{i} \in q{\text{th}}\;{\text{subset}}} \hfill \\ {1,} \hfill & {{\text{if}}\quad y_{i} = 1} \hfill \\ \end{array} } \right.$$

Fan et al. [25] considered this approach to find the fuzzy membership of the sample and proposed a new approach termed as entropy-based fuzzy support vector machine for imbalance datasets. Motivated by the work of Fan et al. [25] and Jayadeva et al. [15], in this paper, we propose a new fuzzy twin support vector machine based on information entropy for class imbalance learning where information entropy is used for the fuzzy membership. The data points which have highest entropy are those present on the boundary between the classes. So, the data points of the majority class get their membership value based on their entropy and all the minority class samples get full membership value equal to 1. EFTWSVM-CIL finds two non-parallel hyperplanes such that each one is closer to the two classes and as far as possible from the other, whereas EFSVM finds separating hyperplanes that maximizes the margin between two classes. Due to this approach, the proposed EFTWSVM-CIL gives better generalization performance in comparison with EFSVM. Further, one can notice that we consider a pair of QPP of smaller size to find the decision surface of our proposed EFTWSVM-CIL, instead of solving a single large QPP as in the case of EFSVM. This makes our proposed EFTWSVM-CIL faster than EFSVM in terms of training time. Thus, it is very well suited for training on large imbalanced data. Now, we discuss the linear and nonlinear formulations of our EFTWSVM-CIL.

3.1 Linear EFTWSVM-CIL

In linear case, the EFTWSVM-CIL finds the resultant classifier by solving the following pair of QPPs

$$\hbox{min} \frac{1}{2}||X_{1} w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi$$

subject to

$$\begin{array}{*{20}l} { - (X_{2} w_{1} + e_{2} b_{1} ) + \xi \ge e_{2} ,\;\xi \ge 0} \hfill \\ {\hbox{min} \frac{1}{2}||X_{2} w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta } \hfill \\ \end{array}$$
(17)

subject to

$$(X_{1} w_{2} + e_{1} b_{2} ) + \eta \ge e_{1} ,\;\eta \ge 0$$
(18)

where \(\xi\), \(\eta\) represent slack variables, \(C_{1} ,C_{2} > 0\) are penalty parameters and \(s_{1}\),\(s_{2}\) are vectors containing the entropy-based fuzzy membership values of minority as well as majority class, respectively. The Lagrangian of problems (17) and (18) in primal is written as

$$L_{1} = \frac{1}{2}||X_{1} w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi + \alpha_{1}^{t} (( X_{2} w_{1} + e_{2} b_{1} ) - \xi + e_{2} ) - \beta_{1}^{t} \xi$$
(19)
$$L_{2} = \frac{1}{2}||X_{2} w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta + \alpha_{2}^{t} (( - X_{1} w_{2} - e_{1} b_{2} ) - \eta + e_{1} ) - \beta_{2}^{t} \eta$$
(20)

where \(\alpha_{1} = (\alpha_{11} , \ldots ,\alpha_{1q} )^{t} ,\quad \beta_{1} = (\beta_{11} , \ldots ,\beta_{1q} )^{t} ,\quad \alpha_{2} = (\alpha_{21} , \ldots ,\alpha_{2p} )^{t}\) and \(\beta_{2} = (\beta_{21} , \ldots ,\beta_{2p} )^{t}\) are the vectors of Lagrange multipliers. Applying the KKT conditions to (19), we get

$$\frac{\partial L}{{\partial w_{1} }} = 0 \Rightarrow X_{1}^{t} \left( {X_{1} w_{1} + e_{1} b_{1} } \right) + X_{2}^{t} \alpha_{1} = 0$$
(21)
$$\begin{array}{*{20}l} {\frac{\partial L}{{\partial b_{1} }} = 0 \Rightarrow e_{1}^{t} \left( {X_{1} w_{1} + e_{1} b_{1} } \right) + e_{2}^{t} \alpha_{1} = 0} \hfill \\ {\frac{\partial L}{\partial \xi } = 0 \Rightarrow C_{1} s_{2} - \beta_{1} - \alpha_{1} = 0} \hfill \\ { - (X_{2} w_{1} + e_{2} b_{1} ) + \xi \ge e_{2} ,\;\xi \ge 0} \hfill \\ {\alpha_{1}^{t} ( - (X_{2} w_{1} + e_{2} b_{1} ) + \xi - e_{2} ) = 0} \hfill \\ {\beta_{1}^{t} \xi = 0,\;\alpha_{1} \ge 0,\;\beta_{1} \ge 0} \hfill \\ \end{array}$$
(22)

Combining (21) and (22), we get

$$\left[ \begin{aligned} X_{1}^{t} \hfill \\ e_{1}^{t} \hfill \\ \end{aligned} \right] \left[ {\begin{array}{*{20}l} {X_{1} } \hfill & {e_{1} } \hfill \\ \end{array} } \right]\left[ \begin{aligned} w_{1} \hfill \\ b_{1} \hfill \\ \end{aligned} \right] + \left[ \begin{aligned} X_{2}^{t} \hfill \\ e_{2}^{t} \hfill \\ \end{aligned} \right] \alpha = 0$$
(23)

One can rewrite (23) as

$$u_{1} = - (A^{t} A)^{ - 1} B^{t} \alpha_{1}$$

where \(A = \left[ {\begin{array}{*{20}l} {X_{1} } \hfill & {e_{1} } \hfill \\ \end{array} } \right]\), \(B = \left[ {\begin{array}{*{20}l} {X_{2} } \hfill & {e_{2} } \hfill \\ \end{array} } \right]\) and the augmented vector \(u_{1} = \left[ \begin{aligned} w_{1} \hfill \\ b_{1} \hfill \\ \end{aligned} \right]\).

Here, we introduce the regularization term \(\delta \, I\) where \(\delta > 0\) and \(I\) is the identity matrix of appropriate size to handle the ill-conditioning of \(\text{A}^{\text{t}} \text{A}\) in finding the inverse. Thus, we get,

$$u_{1} = - (A^{t} A + \delta I)^{ - 1} B^{t} \alpha_{1}$$
(24)

Using the above KKT conditions and (19), the dual of the optimization problem in (17) can be written in the form of following QPP

$$\hbox{min} \frac{1}{2}\alpha_{1}^{t} B \left( {A^{t} A} \right)^{ - 1} B^{t} \alpha_{1} - e_{2}^{t} \alpha_{1}$$

subject to

$$0 \le \alpha_{1} \le s_{2} C_{1}$$
(25)

In similar manner, one can find the dual of (18) as

$$\hbox{min} \frac{1}{2}\alpha_{2}^{t} A \left( {B^{t} B} \right)^{ - 1} A^{t} \alpha_{2} - e_{1}^{t} \alpha_{2}$$

subject to

$$0 \le \alpha_{2} \le s_{1} C_{2}$$
(26)

The values of \(w_{2}\) and \(b_{2}\) are calculated as

$$u_{2} = (B^{t} B + \delta I)^{ - 1} A^{t} \alpha_{2}$$
(27)

where \(u_{2} = \left[ \begin{aligned} w_{2} \hfill \\ b_{2} \hfill \\ \end{aligned} \right]\).

After calculating the value of \(u_{1}\) and \(u_{2}\), we find the non-parallel hyperplanes \(f_{1} (x) = w_{1}^{t} x + b_{1}\) and \(f_{2} (x) = w_{2}^{t} x + b_{2}\). Every new data point \(x \in R^{n}\) is assigned to a given class \('i'\) by using the following formula depending on the distance from the two planes.

$${\text{class}}\,i = \hbox{min} |x^{t} w_{i} + b_{i} |\,{\text{for}}\,i = 1,2.$$
(28)

3.2 Nonlinear EFTWSVM-CIL

For classifying nonlinear separable data points, we used kernel function to transform the data points in the higher-dimensional feature space [30]. The nonlinear TWSVM is formulated in the primal form as

$$\hbox{min} \frac{1}{2}||K(X_{1} ,D^{t} )w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi$$

subject to

$$\begin{array}{*{20}l} { - (K(X_{2} ,D^{t} )w_{1} + e_{2} b_{1} ) + \xi \ge e_{2} ,\,\xi \ge 0} \hfill \\ {\hbox{min} \frac{1}{2}||K(X_{2} ,D^{t} )w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta } \hfill \\ \end{array}$$
(29)

subject to

$$(K(X_{1} ,D^{t} )w_{2} + e_{1} b_{2} ) + \eta \ge e_{1} ,\eta \ge 0$$
(30)

where \(\xi\),\(\eta\) represent slack variables, \(C_{1}\), \(C_{2}\) are penalty parameters, \(D = [X_{1} ;X_{2} ]\), and \(s_{1}\),\(s_{2}\) are vectors containing the entropy-based fuzzy membership values. The Lagrangian function of the problems (29) and (30) is written as

$$L_{1} = \frac{1}{2}||K(X_{1} ,D^{t} )w_{1} + e_{1} b_{1} ||^{2} + C_{1} s_{2}^{t} \xi + \alpha_{1}^{t} (( K(X_{2} ,D^{t} )w_{1} + e_{2} b_{1} ) - \xi + e_{2} ) - \beta_{1}^{t} \xi$$
(31)
$$L_{2} = \frac{1}{2}||K(X_{2} ,D^{t} )w_{2} + e_{2} b_{2} ||^{2} + C_{2} s_{1}^{t} \eta + \alpha_{2}^{t} (( - K(X_{1} ,D^{t} )w_{2} - e_{1} b_{2} ) - \eta + e_{1} ) - \beta_{2}^{t} \eta$$
(32)

where \(\alpha_{1} = (\alpha_{11} , \ldots ,\alpha_{1q} )^{t} ,\quad \beta_{1} = (\beta_{11} , \ldots ,\beta_{1q} )^{t} ,\quad \alpha_{2} = (\alpha_{21} , \ldots ,\alpha_{2p} )^{t}\) and \(\beta_{2} = (\beta_{21} , \ldots ,\beta_{2p} )^{t}\) are the vectors containing the Lagrange multipliers.

Following the same procedure as in the linear case, we compute the nonlinear hyperplanes \(K(x^{t} ,D^{t} )w_{1}^{{}} + b_{1} = 0\) and \(K(x^{t} ,D^{t} )w_{2} + b_{2} = 0\) by computing the value of \(w_{1}\),\(w_{2}\), \(b_{1}\) and \(b_{2}\) using Eqs. (33) and (34)

$$u_{1} = \left[ \begin{aligned} w_{1} \hfill \\ b_{1} \hfill \\ \end{aligned} \right] = - (P^{t} P + \delta I)^{ - 1} Q^{t} \alpha_{1}$$
(33)
$$u_{2} = \left[ \begin{aligned} w_{2} \hfill \\ b_{2} \hfill \\ \end{aligned} \right] = (Q^{t} Q + \delta I)^{ - 1} P^{t} \alpha_{2}$$
(34)

where \(P = [K(X_{1} ,D^{t} ) \,\,e_{1} ]\),\(Q = [ K(X_{2} ,D^{t} )\,\, e_{2} ]\).

For each new data point \(x \in R^{n}\), it is assigned to a given class \('i'\) by using the following formula depending on which of the planes is closest to that point.

$${\text{class}}\,i = \hbox{min} |K(x^{t} ,D^{t} )w_{i} + b_{i} |\,{\text{for}}\,i = 1, \, 2.$$
(35)

4 Numerical Experiments

In this section, to check the effectiveness of the proposed EFTWSVM-CIL with TWSVM, FTWSVM and EFSVM, we performed experiments on several imbalanced datasets from KEEL imbalanced datasets [31] and UCI repository [32] for binary classification. All computations were carried out on a PC running on Windows 7 OS with 64 bit, 3.20 GHz Intel® core™ i5-2400 processor having 2 GB of RAM under MATLAB R2008b environment. We used MOSEK optimization toolbox to solve the SVM formulations which is taken from http://www.mosek.com. For selecting the optimum parameters, we used fivefold cross-validation technique. To construct nonlinear classifier, we have used Gaussian kernel \(k(a,b) = \exp ( - \sigma \left\| {a - b} \right\|^{2} )\) where vector \(a,b \in R^{m}\).

We have taken the value of the parameter \(C = C_{1} = C_{2}\) from the set \(\{ 2^{ - 5} , \ldots ,2^{5} \}\) for all the cases. For FTWSVM, \(\delta\) is taken as 0.5. For EFTWSVM and EFSVM the value of \(K\) for k-NN is chosen from {5, 10} and \(\beta\) is taken as 0.05. The value of \(\sigma\) is calculated as per the following formula [33] in all methods,

$$\sigma = \frac{1}{{N^{2} }}\sum\limits_{i,j = 1}^{N} {||x_{i} - x_{j} ||^{2} }$$

All the results for TWSVM, FTWSVM, EFSVM and proposed method EFTWSVM-CIL are shown in terms of prediction accuracy, i.e. the area under the ROC curve (AUC) [34] and training time for both linear and nonlinear cases in Tables 1 and 3. One can observe from Tables 1 and 3 that EFTWSVM-CIL is much superior to TWSVM, FTWSVM, and EFSVM in terms of better generalization performance. Our proposed EFTWSVM-CIL takes very less training time in comparison with EFSVM because EFTWSVM-CIL solves a pair of smaller-size QPPs instead of solving a large one as in the case of EFSVM.

Table 1 Performance comparison of EFTWSVM-CIL with TWSVM, FTWSVM and EFSVM using linear kernel for classification on imbalance datasets

It is observable from Table 1 that our proposed method EFTWSVM-CIL has not performed better in case of all the datasets for linear kernel. Further, we analyse the comparative performance of EFTWSVM-CIL with TWSVM, FTWSVM, and EFSVM based on the average ranks of all the methods which are presented in Table 2 for the linear case. One can clearly observe form Table 2 that the average rank of proposed EFTWSVM-CIL is lowest among all the methods. We perform the Friedman test with the corresponding post hoc test [35] in the case of linear kernel for statistical comparison on the performance of the 4 algorithms using 24 datasets. We assume all the methods are equivalent under null hypothesis, and the Friedman statistic is computed from Table 2 as

$$\begin{aligned} \chi _{F}^{2} & = \frac{{12 \times 24}}{{4 \times (4 + 1)}}\left[ {(2.4583^{2} + 2.3125^{2} + 3.4583^{2} + 1.7708^{2} ) - \frac{{4 \times (4 + 1)^{2} }}{4}} \right] \cong 21.4051 \\ F_{F} & = \frac{{(24 - 1) \times 21.4051}}{{24 \times (4 - 1) - 21.4051}} \cong 9.7306 \\ \end{aligned}$$

where \(F_{F}\) is distribution according to the \(F\)-distribution with \((3,\,3 \times 23) = (3,\,69)\) being degree of freedom with 4 methods and 24 datasets. The critical value of \(F(3,69)\) is \(2.7375\) for the level of significance at \(\alpha = 0.05\). Since the value of \(F_{F} = 9.7306 > 2.7375\), we reject the null hypothesis. Further, Nemenyi post hoc test is performed for pair-wise comparison of methods and the significant difference between them is checked by computing the critical difference (CD) at \(P = 0.10\) which should differ by at least \(2.291\sqrt {\frac{4 \times (4 + 1)}{6 \times 24}} \approx 0.8539\).

Table 2 Average ranks of TWSVM, FTWSVM, EFSVM and EFTWSVM-CIL for imbalance datasets using linear kernel for classification on imbalance datasets

Since the difference between the averages ranks of EFSVM with EFTWSVM-CIL \(( 3. 4 5 8 3- 1. 7 7 0 8= 1.6875)\) is greater than \(0.8538\), we conclude that EFTWSVM-CIL is significantly better than EFSVM. Since the differences in the average rank of TWSVM and FTWSVM with EFTWSVM-CIL are \(( 2. 4 5 8 3- 1. 7 7 0 8= 0.6875)\) and \(( 2. 3 1 2 5- 1. 7 7 0 8= 0.5417)\), respectively, which are less than \(0.8539\), this shows that there is no significant difference between EFTWSVM-CIL with TWSVM and FTWSVM.

For the Gaussian kernel, the accuracy values are shown with the training time for the proposed EFTWSVM-CIL with TWSVM, FTWSVM and EFSVM in Table 3. One can observe from Table 3 that EFTWSVM shows the better or equal generalization performance in 18 cases. The training speed of our proposed EFTWSVM-CIL is better than EFSVM and comparable to TWSVM and FTWSVM. The average ranks of all the methods based on accuracy values are shown in Table 2. One can conclude that among all the methods our proposed EFTWSVM-CIL has the lowest average rank. It is noticeable from the table that the proposed EFTWSVM is not always better in terms of accuracy for all the datasets, so further Friedman statistical test is performed with the post hoc tests.

Table 3 Performance comparison of EFTWSVM-CIL with TWSVM, FTWSVM and EFSVM using Gaussian kernel for classification on imbalance datasets

Now, the Friedman statistic is computed for nonlinear kernel under null hypothesis by using Table 4:

$$\chi_{F}^{2} = \frac{12 \times 28}{4 \times (4 + 1)}\left[ {( 2. 5 5 3 6^{2} + 2. 7 1 4 3^{2} + 2. 8 2 1 4^{2} + 1. 9 1 0 7^{2} ) - \frac{{4 \times (4 + 1)^{2} }}{4}} \right] \cong 8.3894$$
$$F_{F} = \frac{(28 - 1) \times 8.3894}{28 \times (4 - 1) - 8.3894} \cong 2.9958$$
Table 4 Average ranks of TWSVM, FTWSVM, EFSVM and EFTWSVM-CIL for imbalance datasets using Gaussian kernel for classification of imbalance datasets

The critical value of \(F(3,84)\) i.e. \(2.7132\) for the level of significant \(\alpha = 0.05\) is less than the value of \(F_{F} .\) Thus, it rejects the null hypothesis. Further, the Nemenyi post hoc test is used to find the significant difference between the pair-wise comparisons. We computed the critical difference (CD) at \(p = 0.10\) which should differ by at least \(2.291\sqrt {\frac{4 \times (4 + 1)}{6 \times 28}} \approx 0.7905\).

The difference between the average ranks of EFTWSVM-CIL with EFSVM and FTWSVM are \((2.8214 - 1.9107 = 0.9107)\) and \((2.7143 - 1.9107 = 0.8036)\), respectively, which are greater than \(0.7905\). Hence, proposed EFTWSVM-CIL is significantly better than EFSVM and FTWSVM.

One can verify that the performance of our proposed EFTWSVM-CIL is not sensitive to the values of its parameters C and K. After extensive simulations, it is found that EFTWSVM-CIL is not very sensitive to the user-specified parameter K. To illustrate this result, the performance of EFTWSVM-CIL with Gaussian RBF kernel on Australian Credit, WPBC, Yeast-0-3-5-9_vs_7-8 and Yeast-2_vs_4 datasets is shown in Fig. 1. From the figures, one can observe that better accuracy could be achieved for smaller values of C.

Fig. 1
figure 1

Insensitivity performance of EFTWSVM-CIL for classification to the user-specified parameters \((C,K)\) on imbalance datasets using Gaussian kernel. a Australian Credit, b WPBC, cYeast-0-3-5-9_versus_7–8, dYeast-2_vs_4

5 Conclusions and future work

In this paper, we proposed a new variant of SVM as EFTWSVM-CIL to solve class imbalance problem in binary class datasets where the fuzzy membership values are calculated based on entropy values of samples. Here, our proposed EFTWSVM-CIL solves the two smaller-size QPPs rather than a single large one as in case of EFSVM to find the decision surface. So, one can conclude from the results that EFTWSVM-CIL shows better generalization performance as compared to TWSVM, FTWSVM and EFSVM which clearly illustrates its efficacy and applicability. It has been found that EFTWSVM-CIL outperforms in terms of learning speed in comparison with EFSVM for both linear and nonlinear kernels. Here, the performance of EFTWSVM-CIL also depends on the optimal parameters. So, in future the proper selection of parameters for EFTWSVM-CIL may improve the performance of our proposed model. Some heuristic approaches can also be used to improve the method for parameter selection which may result into the better performance.