1 Introduction

Support vector machine (SVM) [1] is one of the most popular machine learning approach, which has already been successfully applied to a variety of real-world problems such as face recognition [2], bioinformatics [3], text categorization [4], intrusion detection [5] and various other classification problems [6, 7]. However, the training cost of SVM is very high, i.e. \( O(m^{3} ) \), where \( m \) is the number of training samples. Researchers have made many improvements on the basis of SVM, such as Fung and Mangasrian [8] proposed proximal support vector machine (PSVM) for binary classification. Recently, following PSVM, Mangasrian and Wild [9] proposed multisurface proximal SVM via generalized eigenvalues (GEPSVM) for binary classification, which aims at seeking two nonparallel proximal hyperplanes such that each hyperplane is closer to one of two classes and as far as possible from the other. Inspired by GEPSVM, Jayadeva et al. [10] proposed another nonparallel hyperplane classifier for pattern classification, called twin support vector machine (TWSVM). The main idea of TWSVM is to solve two smaller quadratic programming problems (QPPs) rather than a single large QPP which makes training speed of TWSVM four times faster than SVM. From then on, TWSVM has been widely investigated [11,12,13,14,15,16]. Some improvements have been made to TWSVM by researchers to obtain higher classification accuracy with lower computational time, such as Least squares twin support vector machine (LSTSVM) [11], Twin bounded support vector machine (TBSVM) [12], Twin parametric-margin support vector machine (TPMSVM) [13], Robust twin support vector machine (RTSVM) [14], Nonparallel support vector machines (NPSVM) [15] and Angle-based twin support vector machine (ATSVM) [16], and so on [17,18,19,20].

It should be noted, in practical problems, some data are often polluted by noise or in low quality. The patterns, even belong to the same class, should play different roles in the model training. Since SVM treats all samples with the same importance, it ignores the differences between the positive and negative classes, which results in the learned decision surface biasing toward the majority class. To address this problem, based on fuzzy membership values, Lin et al. [21] propose the Fuzzy SVM (FSVM) such that different input samples have different contributions to the learning of decision surface. However, it assigns smaller fuzzy memberships to support vectors which might decrease the effects of support vectors on the construction of decision surface. In order to overcome this problem in FSVM, a new efficient approach fuzzy SVM for non-equilibrium data is proposed to reduce the misclassification accuracy of minority class in FSVM [22]. Adopting fuzzy membership, various Fuzzy SVMs are presented such as Bilateral-weighted FSVM (B-FSVM) [23], NFSVM [24], WCS-FSVM [25], FTSVM [26] and NFTSVM [27]. Moreover, the fuzzy set and fuzzy system theory is also widely used in various control problems [28,29,30].

Recently, Fan et al. [31] proposed an entropy-based fuzzy support vector machine (EFSVM) for class imbalance problem in which fuzzy membership is computed based on the class certainty of samples. Motivated by EFSVM, Gupta et al. [32] proposed a fuzzy twin support vector machine based on information entropy which is termed as EFTWSVM-CIL. Therefore, the choice of fuzzy membership is very important for classification problems. Each sample is given a fuzzy membership which indicates the importance of the corresponding sample toward one class and this change can make some contribution to the final decision surface. Based on the above discussion and inspired by EFTWSVM-CIL and LSTSVM, in this paper, we propose a new approach termed as entropy-based fuzzy least squares twin support vector machine (EFLSTSVM). The contributions of the proposed EFLSTSVM can be highlighted as follows. First, entropy-based fuzzy membership is given to evaluate the class certainty of training samples. Similar to EFSVM and EFTWSVM-CIL, our EFLSTSVM adopts the entropy to evaluate the class certainty of each sample and then determines the corresponding fuzzy membership based on the class certainty. Thus, it can pay more attention to the samples with higher class certainty to result in more robust decision surface, which enhances the classification accuracy and generalization ability. Second, we modify the QPP-based formulation of EFTWSVM-CIL in least squares sense which leads to solving the optimization problem with equality constraints. Different from EFTWSVM-CIL, our EFLSTSVM solves a pair of linear system of equations as opposed to solving two QPPs in EFTWSVM-CIL, which leads to simple algorithm and less computational time. Third, a regularization term in the objective function is introduced. Therefore, our EFLSTSVM implements the structural risk minimization principle instead of the empirical risk minimization principle, which can overcome the possible over-fitting problems in LSTSVM. And last but not least, the experimental results on several synthetic datasets and benchmark datasets show the effectiveness of our proposed EFLSTSVM.

The rest of this paper is organized as follows. In Sect. 2, we give a brief review of LSTSVM and FTSVM. Section 3 proposes the details of linear EFLSTSVM and its nonlinear version. Experimental results on both synthetic and real-world datasets to investigate the effectiveness of our method are described in Sect. 4. Finally, Sect. 5 gives the conclusion.

2 Brief Review of LSTSVM and FTSVM

In this section, we briefly explain the basics of LSTSVM and FTSVM. Let us consider a binary classification problem in the n-dimensional real space \( R^{n} \) and a set of training data samples is represented by \( T = \{ (x_{1} ,y_{1} ),(x_{2} ,y_{2} ), \ldots ,(x_{m} ,y_{m} )\} \), where \( x_{i} \in R^{n} \) and \( y_{i} \in \{ - 1,1\} ,i = 1,2, \ldots ,m \). We organize the \( m_{1} \) samples of positive class by a \( m_{1} \times n \) matrix \( A \in R^{{m_{1} \times n}} \) and \( m_{2} \) samples of negative class by a \( m_{2} \times n \) matrix \( B \in R^{{m_{2} \times n}} \).

2.1 Least Squares Twin Support Vector Machine (LSTSVM)

Different from least squares support vector machine (LSSVM) [33], least squares twin support vector machine (LSTSVM) [11] aims to find a pair of nonparallel hyperplanes

$$ w_{1}^{T} x + b_{1} = 0\quad {\text{and}}\quad w_{2}^{T} x + b_{2} = 0 $$
(1)

such that each hyperplane is close to the training samples of one class and as far as possible from the samples of the other class. Then, the primal optimization problem of linear LSTSVM can be expressed as

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{1} ,b_{1} ,\xi_{2} }} \,\,\frac{1}{2}||Aw_{1} + e_{1} b_{1} ||_{2}^{2} + \frac{{c_{1} }}{2}\xi_{2}^{T} \xi_{2} \\ & \;s.t.\quad - (Bw_{1} + e_{2} b_{1} ) + \xi_{2} = e_{2} \\ \end{aligned} $$
(2)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{2} ,b_{2} ,\xi_{1} }} \,\,\frac{1}{2}||Bw_{2} + e_{2} b_{2} ||_{2}^{2} + \frac{{c_{2} }}{2}\xi_{1}^{T} \xi_{1} \\ & \;s.t.\quad (Aw_{2} + e_{1} b_{2} ) + \xi_{1} { = }e_{1} \\ \end{aligned} $$
(3)

where \( c_{1} \) and \( c_{2} \) are positive penalty parameters, \( \xi_{1} \) and \( \xi_{2} \) are slack variables, \( e_{1} \) and \( e_{2} \) are vectors with each element of the value of 1.

By substituting the equality constraint into the objective function, we obtain the unconstrained optimization problem as follows.

$$ [A\;e_{1} ]^{T} [A\;e_{1} ][w_{1}^{T} \;b_{1} ]^{T} + c_{1} [B\;e_{2} ]^{T} [B\;e_{2} ][w_{1}^{T} \;b_{1} ]^{T} + c_{1} [B\;e_{2} ]^{T} e_{2} = 0 $$
(4)
$$ [B\;e_{2} ]^{T} [B\;e_{2} ][w_{2}^{T} \;b_{2} ]^{T} + c_{2} [A\;e_{1} ]^{T} [A\;e_{1} ][w_{2}^{T} \;b_{2} ]^{T} - c_{2} [A\;e_{1} ]^{T} e_{1} = 0 $$
(5)

Let \( {\text{E = [}}A\;e_{1} ],F{ = [}B\;e_{2} ] \), form (4) and (5), we can get

$$ [w_{1}^{T} \;b_{1} ]^{T} { = } - \left(F^{T} F + \frac{1}{{c_{1} }}E^{T} E \right)^{ - 1} F^{T} e_{2} $$
(6)
$$ [w_{2}^{T} \;b_{2} ]^{T} { =} \left(E^{T} E + \frac{1}{{c_{2} }}F^{T} F\right )^{ - 1} E^{T} e_{1} $$
(7)

The solutions to the optimization problems (2) and (3) can be found directly by solving systems of linear Eqs. (4) and (5), more details can be seen in [11]. Once \( w_{1} ,b_{1} \) and \( w_{2} ,b_{2} \) are obtained from (6) and (7), the nonparallel hyperplanes (1) are known. A new data point \( x \in R^{n} \) is then assigned to positive class \( W_{1} \) or negative class \( W_{2} \) by

$$ x \in W_{k} ,\;k = \mathop {\arg \hbox{min} }\limits_{k = 1,2} \{ |w_{1}^{T} x + b_{1} |,|w_{2}^{T} x + b_{2} |\} $$
(8)

where \( |\cdot| \) is the absolute value.

2.2 Fuzzy Twin Support Vector Machine (FTSVM)

Different from TWSVM [10], in the case of linear FTSVM [26], a weighting parameter is used to construct the classifier based on fuzzy membership values. The formulation of linear FTSVM can be written as

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{1} ,b_{1} ,\xi_{2} }} \,\,\frac{1}{2}||Aw_{1} + e_{1} b_{1} ||_{2}^{2} + c_{1} s_{2}^{T} \xi_{2} \\ & \;s.t.\quad - (Bw_{1} + e_{2} b_{1} ) + \xi_{2} \ge e_{2} ,\xi_{2} \ge 0 \\ \end{aligned} $$
(9)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{2} ,b_{2} ,\xi_{1} }} \,\,\frac{1}{2}||Bw_{2} + e_{2} b_{2} ||_{2}^{2} + c_{2} s_{1}^{T} \xi_{1} \\ & \;s.t.\quad (Aw_{2} + e_{1} b_{2} ) + \xi_{1} \ge e_{1} ,\xi_{1} \ge 0 \\ \end{aligned} $$
(10)

where \( c_{1} \) and \( c_{2} \) are positive penalty parameters, \( \xi_{1} \) and \( \xi_{2} \) are slack variables, \( e_{1} \) and \( e_{2} \) are vectors with each element of the value of 1, \( s_{1} \) and \( s_{2} \) represent fuzzy membership of each type of sample points.

By introducing the method of Lagrangian multipliers, the corresponding Wolfe dual of QPPs (9) and (10) can be represented as

$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{\alpha } \,\,e_{2}^{T} \alpha - \frac{1}{2}\alpha^{T} G(H^{T} H)^{ - 1} G^{T} \alpha \\ & \;s.t.\quad 0 \le \alpha \le c_{1} s_{2} \\ \end{aligned} $$
(11)
$$ \begin{aligned} & \mathop {\hbox{max} }\limits_{\gamma } \,\,e_{1}^{T} \gamma - \frac{1}{2}\gamma^{T} H(G^{T} G)^{ - 1} H^{T} \gamma \\ & \;s.t.\quad 0 \le \gamma \le c_{2} s_{1} \\ \end{aligned} $$
(12)

where \( G = [B\;e_{2} ],\;H = [A\;e_{1} ] \) and \( \alpha \in R^{{m_{2} }} ,\;\gamma \in R^{{m_{1} }} \) are Lagrangian multipliers.

The nonparallel hyperplanes (1) can be obtained from the solutions \( \alpha \) and \( \gamma \) of (11) and (12) by

$$ [w_{1}^{T} \;b_{1} ]^{T} { = } - (H^{T} H )^{ - 1} G^{T} \alpha $$
(13)
$$ [w_{2}^{T} \;b_{2} ]^{T} { = (}G^{T} G )^{ - 1} H^{T} \gamma $$
(14)

Once \( w_{1} ,b_{1} \) and \( w_{2} ,b_{2} \) are obtained from (13) and (14), the two nonparallel hyperplanes (1) are known. A new data point \( x \in R^{n} \) is then assigned to positive class \( W_{1} \) or negative class \( W_{2} \) by

$$ x \in W_{k} ,\;k = \mathop {\arg \hbox{min} }\limits_{k = 1,2} \{ |w_{1}^{T} x + b_{1} |,|w_{2}^{T} x + b_{2} |\} $$
(15)

where \( |\cdot| \) is the absolute value. More details about FTSVM can be seen in [26].

3 Entropy-Based Fuzzy Least Squares Twin Support Vector Machine

As the evaluation of fuzzy membership is the key issue of FSVM, in this section, we introduce the entropy-based fuzzy membership at first. Then, by adopting the entropy-based fuzzy membership, the entropy-based fuzzy least squares twin support vector machine (EFLSTSVM) for binary classification is presented.

3.1 Entropy-Based Fuzzy Membership

In information theory, entropy is measure of the information carried by a sample [34]. Thus, entropy characterizes the certainty about the source of information, that is, the smaller entropy indicates the information is more certain. By adopting entropy, we can evaluate the class certainty of training samples and assign the fuzzy membership of the training samples based on their class certainty. Specifically, the sample with higher class certainty will be assigned to larger fuzzy memberships to enhance their contribution to the decision surface, and vice versa. Supposing the probabilities of the training samples \( x_{i} \) belonging to the positive and negative class are \( p_{ + i} \) and \( p_{ - i} \), respectively. The entropy of \( x_{i} \) is defined as

$$ H_{i} = - p_{ + i} \cdot \ln (p_{ + i} ) - p_{ - i} \cdot \ln (p_{ - i} ) $$
(16)

where ln represents the natural logarithm operator. The key point of calculating \( H_{i} \) by (16) is to evaluate the probability of each sample belong to positive and negative class. We calculate the K nearest neighbours of sample \( x_{i} \) and assign the values to \( p_{ + i} \) and \( p_{ - i} \) based on count of total positive and negative class neighbours, i.e.

$$ p_{ + i} = \frac{{num_{ + i} }}{k},\;p_{ - i} = \frac{{num_{ - i} }}{k} $$
(17)

where \( num_{ + i} \) and \( num_{ - i} \) represent the number of positive and negative samples in the selected K nearest neighbours, and \( num_{ + i} { + }num_{ - i} = k \).

By adopting the above entropy evaluation, the entropy of the positive samples are \( H_{ + } = \{ H_{ + 1} ,H_{ + 2} , \ldots ,H_{{ + m_{1} }} \} \). Then, data points of positive class are divided into \( N_{ + } \) subsets based on increasing order of entropy and the fuzzy memberships of positive samples in each subset are calculated as

$$ FM_{ + j} = 1.0 - \beta \times (j - 1),\;j = 1,2, \ldots ,N_{ + } $$
(18)

where \( FM_{ + j} \) is the fuzzy membership for positive samples distributed in jth subset with fuzzy membership parameter \( \beta \in (0,\tfrac{1}{{N_{ + } - 1}}] \) which controls the scale of the fuzzy values of positive samples. Then, the fuzzy membership of positive samples are defined as

$$ s_{ + i} = 1 - \beta \times (j - 1),\;\;if\;x_{i} \in jth\;subset\;(i = 1,2, \ldots ,m_{1} ) $$
(19)

Similarly, the entropy of the negative samples are \( H_{ - } = \{ H_{ - 1} ,H_{ - 2} , \ldots ,H_{{ - m_{2} }} \} \). Then, data points of negative class are divided into \( N_{ - } \) subsets based on increasing order of entropy and the fuzzy memberships of negative samples in each subset are calculated as

$$ FM_{ - j} = 1.0 - \beta \times (j - 1),j = 1,2, \ldots ,N_{ - } $$
(20)

where \( FM_{ - j} \) is the fuzzy membership for negative samples distributed in jth subset with fuzzy membership parameter \( \beta \in (0,\;\tfrac{1}{{N_{ - } - 1}}] \) which controls the scale of the fuzzy values of negative samples. Then, the fuzzy membership of negative samples are defined as

$$ s_{ - i} = 1 - \beta \times (j - 1),\;\;if\;x_{i} \in jth\;subset\;(i = 1,2, \ldots ,m_{2} ) $$
(21)

3.2 Linear EFLSTSVM

For binary classification problem, inspired by LSTSVM [11] and FTSVM [26], the proposed EFLSTSVM seeks a pair of nonparallel hyperplanes and modifies the primal problems of FTSVM by replacing the inequality constrains with equality constraints and using the square of 2-norm slack variables. Different from LSTSVM, the structural risk is minimized by adding a regularization term in our EFLSTSVM. Thus, the primal problems of EFLSTSVM are expressed as follows.

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{1} ,b_{1} }} \;\;\frac{{c_{2} }}{2}(w_{1}^{2} + b_{1}^{2} ) + \frac{1}{2}(Aw_{1} + e_{1} b_{1} )^{T} (Aw_{1} + e_{1} b_{1} ) + \frac{{c_{1} }}{2}\xi_{2}^{T} S_{ - } \xi_{2} \\ & \,s.t.\quad - (Bw_{1} + e_{2} b_{1} ) = e_{2} - \xi_{2} \\ \end{aligned} $$
(22)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{2} ,b_{2} }} \;\;\frac{{c_{4} }}{2}(w_{2}^{2} + b_{2}^{2} ) + \frac{1}{2}(Bw_{2} + e_{2} b_{2} )^{T} (Bw_{2} + e_{2} b_{2} ) + \frac{{c_{3} }}{2}\xi_{1}^{T} S_{ + } \xi_{1} \\ & \,s.t.\quad Aw_{2} + e_{1} b_{2} = e_{1} - \xi_{1} \\ \end{aligned} $$
(23)

where \( c_{i} \;(i = 1,2,3,4) \) are positive penalty parameters, \( S_{ + } \; = diag(s_{ + 1} ,s_{ + 2} , \ldots ,s_{{ + m_{1} }} ) \) and \( S_{ - } \; = diag(s_{ - 1} ,s_{ - 2} , \ldots ,s_{{ - m_{2} }} ) \) are the entropy-based fuzzy membership of positive class and negative class, \( \xi_{1} \) and \( \xi_{2} \) are slack variables, \( e_{1} \) and \( e_{2} \) are the vectors of ones with appropriate dimensions.

Consider the primal problem (22), by substituting the equality constraint into the objective function, we obtain the unconstrained problem. Therefore, we obtain

$$ \begin{aligned} L(w_{1} ,b_{1} ) & = \frac{{c_{2} }}{2}(w_{1}^{2} + b_{1}^{2} ) + \frac{1}{2}(Aw_{1} + e_{1} b_{1} )^{T} (Aw_{1} + e_{1} b_{1} ) \\ & \quad + \frac{{c_{1} }}{2}(e_{2} + Bw_{1} + e_{2} b_{1} )^{T} S_{ - } (e_{2} + Bw_{1} + e_{2} b_{1} ) \\ \end{aligned} $$
(24)

By taking partial derivative of (24) with respect to \( w_{1} \) and \( b_{1} \), we get

$$ \frac{\partial L}{{\partial w_{1} }} = c_{2} w_{1} + A^{T} (Aw_{1} + e_{1} b_{1} ) + c_{1} B^{T} S_{ - } (e_{2} + Bw_{1} + e_{2} b_{1} ) = 0 $$
(25)
$$ \frac{\partial L}{{\partial b_{1} }} = c_{2} b_{1} + e_{1}^{T} (Aw_{1} + e_{1} b_{1} ) + c_{1} e_{2}^{T} S_{ - } (e_{2} + Bw_{1} + e_{2} b_{1} ) = 0 $$
(26)

Then, combing (25) and (26) and solving for \( w_{1} \) and \( b_{1} \), it is easy to lead to a system of linear equations which is expressed as follows.

$$ \left( {\begin{array}{*{20}c} {c_{2} I_{1} + A^{T} A + c_{1} B^{T} S_{ - } B} & {A^{T} e_{1} + c_{1} B^{T} S_{ - } e_{2} } \\ {e_{1}^{T} A + c_{1} e_{2}^{T} S_{ - } B} & {c_{2} + e_{1}^{T} e_{1} + c_{1} e_{2}^{T} S_{ - } e_{2} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {w_{1} } \\ {b_{1} } \\ \end{array} } \right){ = }\left( {\begin{array}{*{20}c} { -\, c_{1} B^{T} S_{ - } e_{2} } \\ { -\, c_{1} e_{2}^{T} S_{ - } e_{2} } \\ \end{array} } \right) $$
(27)

where \( I_{1} \) is an identity matrix.

Let \( H = [A\;e_{1} ] \in R^{{m_{1} \times (n + 1)}} ,G = [B\;e_{2} ] \in R^{{m_{2} \times (n + 1)}} \), by simplifying (27), we can get

$$ (c_{2} I_{ + } + H^{T} H + c_{1} G^{T} S_{ - } G)u_{ + } = - c_{1} G^{T} S_{ - } e_{2} $$
(28)

where \( I_{ + } \) is an identity matrix and \( u_{ + } = [w_{1} ;b_{1} ] \in R^{n + 1} \).

Thus, form (28), we can get

$$ u_{ + } = - c_{1} (c_{2} I_{ + } + H^{T} H + c_{1} G^{T} S_{ - } G)^{ - 1} G^{T} S_{ - } e_{2} $$
(29)

Consider the primal problem (23), and substitute the equality constrains into the objective function. Thus, we can obtain

$$ \begin{aligned} L(w_{2} ,b_{2} ) & = \frac{{c_{4} }}{2}(w_{2}^{2} + b_{2}^{2} ) + \frac{1}{2}(Bw_{2} + e_{2} b_{2} )^{T} (Bw_{2} + e_{2} b_{2} ) \\ & \quad + \frac{{c_{3} }}{2}(e_{1} - Aw_{2} - e_{1} b_{2} )^{T} S_{ + } (e_{1} - Aw_{2} - e_{1} b_{2} ) \\ \end{aligned} $$
(30)

By taking partial derivative of (30) with respect to \( w_{2} \) and \( b_{2} \), we obtain

$$ \frac{\partial L}{{\partial w_{2} }} = c_{4} w_{2} + B^{T} (Bw_{2} + e_{2} b_{2} ) - c_{3} A^{T} S_{ + } (e_{1} - Aw_{2} - e_{1} b_{2} ) = 0 $$
(31)
$$ \frac{\partial L}{{\partial b_{2} }} = c_{4} b_{2} + e_{2}^{T} (Bw_{2} + e_{2} b_{2} ) - c_{3} e_{1}^{T} S_{ + } (e_{1} - Aw_{2} - e_{1} b_{2} ) = 0 $$
(32)

Then, combing (31) and (32) and solving for \( w_{2} \) and \( b_{2} \), it is easy to lead to a system of linear equations which is expressed as follows.

$$ \left( {\begin{array}{*{20}c} {c_{4} I_{2} + B^{T} B + c_{3} A^{T} S_{ + } A} & {B^{T} e_{2} + c_{3} A^{T} S_{ + } e_{1} } \\ {e_{2}^{T} B + c_{3} e_{1}^{T} S_{ + } A} & {c_{4} + e_{2}^{T} e_{2} + c_{3} e_{1}^{T} S_{ + } e_{1} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {w_{2} } \\ {b_{2} } \\ \end{array} } \right){ = }\left( {\begin{array}{*{20}c} {c_{3} A^{T} S_{ + } e_{1} } \\ {c_{3} e_{1}^{T} S_{ + } e_{1} } \\ \end{array} } \right) $$
(33)

where \( I_{2} \) is an identity matrix.

Similarly, by simplifying (33), we can obtain

$$ (c_{4} I_{ - } + G^{T} G + c_{3} H^{T} S_{ + } H)u_{ - } = c_{3} H^{T} S_{ + } e_{1} $$
(34)

where \( I_{ - } \) is an identity matrix and \( u_{ - } = [w_{2} ;\;b_{2} ] \in R^{n + 1} \).

Thus, form (34), we can get

$$ u_{ - } = c_{3} (c_{4} I_{ - } + G^{T} G + c_{3} H^{T} S_{ + } H)^{ - 1} H^{T} S_{ + } e_{1} $$
(35)

The solutions to the pair of QPPs (22) and (23) can be found directly by solving two systems of linear Eqs. (28) and (34), which involves two matrix inverses of size \( (n + 1) \times (n + 1) \). When \( n \) is much smaller than the number samples of positive class and negative class, the training speed of linear EFLSTSVM is extremely fast. Once \( w_{1} ,b_{1} \) and \( w_{2} ,b_{2} \) are obtained from (29) and (35), the two nonparallel hyperplanes (1) are known. A new data point \( x \in R^{n} \) is then assigned to positive class \( W_{1} \) or negative class \( W_{2} \) by

$$ x \in W_{k} ,\;k = \mathop {\arg \hbox{min} }\limits_{k = 1,2} \{ |w_{1}^{T} x + b_{1} |,|w_{2}^{T} x + b_{2} |\} $$
(36)

where \( |\cdot| \) is the absolute value.

In summary, the steps for constructing linear EFLSTSVM classifier are shown in Algorithm 1.

figure a

3.3 Nonlinear EFLSTSVM

For nonlinear case, firstly, we define \( C = [A;B] \) and choose an appropriate kernel function \( K \). Following the same idea, linear EFLSTSVM classifier can be extended to nonlinear version by considering the following kernel-generated surfaces

$$ K(x^{T} ,C^{T} )w_{1} + b_{1} = 0\quad {\text{and}}\quad K(x^{T} ,C^{T} )w_{2} + b_{2} = 0 $$
(37)

The primal problems of the nonlinear EFLSTSVM are expressed as follows.

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{1} ,b_{1} }} \;\;\frac{{c_{2} }}{2}(w_{1}^{2} + b_{1}^{2} ) + \frac{1}{2}(K(A,C^{T} )w_{1} + e_{1} b_{1} )^{T} (K(A,C^{T} )w_{1} + e_{1} b_{1} ) + \frac{{c_{1} }}{2}\xi_{2}^{T} S_{ - } \xi_{2} \\ & \,s.t.\quad - (K(B,C^{T} )w_{1} + e_{2} b_{1} ) = e_{2} - \xi_{2} \\ \end{aligned} $$
(38)
$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{2} ,b_{2} }} \;\;\frac{{c_{4} }}{2}(w_{2}^{2} + b_{2}^{2} ) + \frac{1}{2}(K(B,C^{T} )w_{2} + e_{2} b_{2} )^{T} (K(B,C^{T} )w_{2} + e_{2} b_{2} ) + \frac{{c_{3} }}{2}\xi_{1}^{T} S_{ + } \xi_{1} \\ & \,s.t.\quad K(A,C^{T} )w_{2} + e_{1} b_{2} = e_{1} - \xi_{1} \\ \end{aligned} $$
(39)

Similar to linear case, by substituting the constraints into objective function, then we can obtain the solutions to the problems (38) and (39) as follows.

$$ \begin{aligned} & \mathop {\hbox{min} }\limits_{{w_{1} ,b_{1} }} \;\frac{{c_{2} }}{2}(w_{1}^{2} + b_{1}^{2} ) + \frac{1}{2}(K(A,C^{T} )w_{1} + e_{1} b_{1} )^{T} (K(A,C^{T} )w_{1} + e_{1} b_{1} ) \\ & \quad + \frac{{c_{1} }}{2}(e_{2} + K(B,C^{T} )w_{1} + e_{2} b_{1} )^{T} S_{ - } (e_{2} + K(B,C^{T} )w_{1} + e_{2} b_{1} ) \\ \end{aligned} $$
(40)
$$ \begin{aligned} \mathop {\hbox{min} }\limits_{{w_{2} ,b_{2} }} \;\frac{{c_{4} }}{2}(w_{2}^{2} + b_{2}^{2} ) + \frac{1}{2}(K(B,C^{T} )w_{2} + e_{2} b_{2} )^{T} (K(B,C^{T} )w_{2} + e_{2} b_{2} ) \hfill \\ \quad \quad \quad \quad + \frac{{c_{3} }}{2}(e_{1} - K(A,C^{T} )w_{2} - e_{1} b_{2} )^{T} S_{ + } (e_{1} - K(A,C^{T} )w_{2} - e_{1} b_{2} ) \hfill \\ \end{aligned} $$
(41)

Let \( KerH = [K(A,C^{T} )\;\;e_{1} ] \in R^{{m_{1} \times (m + 1)}} ,KerG = [K(B,C^{T} )\;\;e_{2} ]^{{m_{2} \times (m + 1)}} \), similar to linear case, the solutions of QPPs (40) and (41) can be obtained as follows.

$$ (c_{2} I_{ + } + KerH^{T} \cdot KerH + c_{1} KerG^{T} \cdot S_{ - } \cdot KerG) \cdot v_{ + } = - c_{1} KerG^{T} \cdot S_{ - } \cdot e_{2} $$
(42)
$$ (c_{4} I_{ - } + KerG^{T} \cdot KerG + c_{3} KerH^{T} \cdot S_{ + } \cdot KerH) \cdot v_{ - } = c_{3} KerH^{T} \cdot S_{ + } \cdot e_{1} $$
(43)

where \( v_{ + } = [w_{1} ;b_{1} ] \in R^{m + 1} \), \( v_{ - } = [w_{2} ;b_{2} ] \in R^{m + 1} \).

By simplifying the expression (42) and (43), we obtain

$$ v_{ + } = - c_{1} (c_{2} I_{ + } + KerH^{T} \cdot KerH + c_{1} KerG^{T} \cdot S_{ - } \cdot KerG)^{ - 1} \cdot KerG^{T} \cdot S_{ - } \cdot e_{2} $$
(44)
$$ v_{ - } = c_{3} (c_{4} I_{ - } + KerG^{T} \cdot KerG + c_{3} KerH^{T} \cdot S_{ + } \cdot KerH)^{ - 1} \cdot KerH^{T} \cdot S_{ + } \cdot e_{1} $$
(45)

In summary, the solutions to the pair of QPPs (38) and (39) can be found directly by solving two systems of linear Eqs. (42) and (43). Once \( w_{1} ,b_{1} \) and \( w_{2} ,b_{2} \) are obtained from (44) and (45), the two nonparallel hyperplanes (37) are known. A new data point \( x \in R^{n} \) is then assigned to positive class \( W_{1} \) or negative class \( W_{2} \) by

$$ x \in W_{k} ,\;k = \mathop {\arg \hbox{min} }\limits_{k = 1,2} \{ |K(x,C^{T} )w_{1} + b_{1} |,|K(x,C^{T} )w_{2} + b_{2} |\} $$
(46)

where \( |\cdot| \) is the absolute value.

Similar to linear EFLSTSVM, steps for constructing the nonlinear EFLSTSVM classifier are given in Algorithm 2 as follows.

figure b

4 Experimental Results and Discussions

In order to validate the performance of the proposed EFLSTSVM, we investigate its classification accuracy and computational efficiency on several synthetic datasets, UCI benchmark datasets, NDC datasets and image recognition datasets, respectively. In our experiments, we focus on the comparison between our proposed EFLSTSVM and five state-of-the-art classifiers, including TWSVM [10], LSTSVM [11], FTSVM [26], EFSVM [31] and EFTWSVM-CIL [32]. And the classification accuracy of each method is evaluated by standard 5-fold cross-validation methodology. All methods are implemented in MATLAB R2018a on a personal computer (PC) with an Intel (R) Core (TM) i7-7700CPU (3.60 GHz × 8) and 32 GB random-access memory (RAM). The QPPs in TWSVM, FTSVM and EFTWSVM-CIL are solved by SOR algorithm, which is also used to solve QPPs in literatures [12, 18]. And the systems of linear equations in LSTSVM and our EFLSTSVM are solved by ‘\’. In addition, the grid search method is used to find the optimal parameters in all methods. Specifically, the penalty parameters \( c_{i} \) and kernel wide parameter \( \sigma \) of Gaussian kernel function \( K(x,y) = e^{{ - \frac{{\left\| {x - y} \right\|^{2} }}{{2\sigma^{2} }}}} \) in all methods are selected form the set \( \{ 2^{i} |i = - 8, - 7, \ldots ,7,8\} \). The neighborhood size k in our EFLSTSVM, EFSVM and EFTWSVM-CIL are chosen from the set \( \{ 1,3,5, \ldots ,17,19\} \). The number of the separated subsets \( N_{ + } \) and \( N_{ - } \) are set to 10 and the fuzzy membership parameter \( \beta \) is set to 0.05.

4.1 Synthetic Datasets

In this subsection, two artificial datasets, including Two-moons manifold dataset [35, 36] and Ripley’s synthetic datasets [37] have been used to illustrate that the proposed EFLSTSVM can deal with linearly inseparable problems. In experiments, Two-moons-1 manifold dataset contains 100 samples (50 positive samples and 50 negative samples) and Two-moons-2 manifold dataset contains 200 samples (100 positive samples and 100 negative samples). Figure 1 shows two kinds of Two-moons datasets with different complexity.

Fig. 1
figure 1

Two kinds of Two-moons datasets with different complexity

For Two-moons manifold datasets, we investigate the performance of nonlinear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM- CIL and our EFLSTSVM with Gaussian kernel function. We randomly select 40% for training sets and 60% for testing sets, each experiment repeat 10 times and the average results are listed in Table 1. For Ripley’s synthetic dataset, which contains 250 training points and 1000 test points, we investigate the classification performance of nonlinear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and our proposed EFLSTSVM with Gaussian kernel function. The test accuracy is listed in Table 2. From Tables 1 and 2, we can observe that our EFLSTSVM obtains the best performance on Two-moons and Ripley datasets. Moreover, the hyperplanes of TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and our EFLSTSVM on Ripley’s dataset are shown in Fig. 2.

Table 1 Classification accuracy on Two-moons datasets
Table 2 Classification accuracy on Ripley’s datasets
Fig. 2
figure 2

Classification hyperplanes of nonlinear methods on Ripley’s datasets. a TWSVM, b LSTSVM, c EFSVM, d FTSVM, e EFTWSVM-CIL and f EFLSTSVM

4.2 UCI Datasets

To further compare EFLSTSVM with TWSVM, LSTSVM, EFSVM, FTSVM and EFTWSVM-CIL, we choose 11 datasets from UCI machine learning repository [38]. Specifically, they are Australian, Bupa-Liver, House-Votes, Heart-c, Heart-Statlog, Ionosphere, Musk, PimaIndian, Sonar, Spect and Wpbc, respectively. Experimental results of their linear and nonlinear versions are given in Tables 3 and 4. The best accuracy is shown in boldface and the shortest time is shown by underline for each dataset. In Table 3, we can find that the accuracy of our proposed linear EFLSTSVM is better than that of TWSVM, LSTSVM, EFSVM, FTSVM and EFTWSVM-CIL on most of the datasets. For example, for the Spect dataset, the accuracy of our linear EFLSTSVM is 83.89%, while TWSVM is 81.24%, LSTSVM is 80.89%, EFSVM is 83.15%, FTSVM is 81.67% and EFTWSVM-CIL is 81.68%, respectively.

Table 3 Test results of linear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and EFLSTSVM
Table 4 Test results of nonlinear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and EFLSTSVM

Furthermore, Table 4 shows the experimental results for the nonlinear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and our proposed EFLSTSVM on the above 11 UCI datasets. The results in Table 4 are similar to those in Table 3, and it also confirms the observation above. Especially for Ionosphere dataset, our nonlinear EFLSTSVM obtains the classification accuracy 96.58%, which is 2.26% higher than TWSVM, 2.28% higher than LSTSVM, 0.85% higher than EFSVM and 1.98% higher than FTSVM and EFTWSCM-CIL, respectively. In addition, from Tables 3 and 4, we can find that our proposed EFLSTSVM is not the fastest method. Although it is a bit slower than LSTSVM in most case, it is faster than the other four methods on all chosen datasets. The main reason might be that TWSVM, EFSVM, FTSVM and EFTWSVM-CIL are required to solve QPPs, while LSTSVM and our proposed EFLSTSVM are only required to solve systems of linear equations. Figure 3 shows the training times of nonlinear methods on selected UCI datasets.

Fig. 3
figure 3

Training times of nonlinear methods on selected UCI datasets

In addition, Friedman test [39] is conducted to give a statistic comparison on the effectiveness with the compared algorithms. For this test, the average ranks of the compared algorithms on the used datasets are listed in the last row of Tables 3 and 4. In the experiments, we consider k (= 6) number of compared algorithms and n (= 11) number of datasets. Let \( r_{i}^{j} \) be the rank of the jth algorithms on the ith datasets. The average rank of the jth algorithms is calculated as \( R_{j} = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {r_{i}^{j} } \). We assume all the methods are equivalent under null hypothesis, and the Friedman statistic

$$ \chi_{\text{F}}^{ 2} = \frac{12n}{{k (k{ + 1)}}}\left[ {\sum\limits_{j} {R_{j}^{2} } - \frac{{k(k + 1)^{2} }}{ 4}} \right] $$
(47)

is distributed according to \( \chi_{\text{F}}^{ 2} \) with \( (k - 1) \) degrees of freedom, when n and k are reasonable large.

Moreover, Iman and Davenport [40] showed that Friedman’s \( \chi_{\text{F}}^{ 2} \) presents a pessimistic behavior, and they derived a better statistic

$$ F_{F} = \frac{{(n - 1)\chi_{F}^{2} }}{{n(k - 1) - \chi_{F}^{2} }} $$
(48)

which is distributed according to the F-distribution with \( (k - 1) \) and \( (k - 1)(n - 1) \) degrees of freedom.

For linear case, in Table 3, it is noticed that our proposed EFLSTSVM ranks the first with an average score of 1.8182. To validate that the measured average ranks are significantly different from the mean rank by the null hypothesis, according to (47) and (48), we obtain

$$ \chi_{\text{F}}^{ 2} = \frac{12 \times 11}{6 \times 7}\left[ {(4.2273^{2} + 4.9545^{2} + 4.7727^{2} + 3.3182^{2} + 1.9091^{2} + 1.8182^{2} ) - \frac{{6 \times 7^{2} }}{ 4}} \right] = 30.3498 $$
$$ F_{F} = \frac{10 \times 30.3498}{ 1 1\times 5- 30.3498}{ = }12.3122 $$

Specifically, for 6 algorithms and 11 datasets, \( F_{F} \) is distributed according to the F distribution with \( (6 - 1){ = }5 \) and \( (6 - 1) \times (11 - 1){ = }50 \) degrees of freedom. We can find that the critical value of \( F(5,50) \) is 2.400 for the level of significant \( \alpha { = }0.05 \) and it is less than the value of \( F_{F} \), which indicates the null hypothesis is rejected. Thus, the compared algorithms are significantly different on the adopted datasets.

For nonlinear case, in Table 4, it is noticed that EFLSTSVM ranks the first with an average score of 1.4091. According to (47) and (48), we obtain

$$ \chi_{\text{F}}^{ 2} = \frac{12 \times 11}{ 6\times 7}\left[ {(3.7727^{2} + 3.8182^{2} + 4.9545^{2} + 4.2727^{2} + 2.6818^{2} + 1.4091^{2} ) - \frac{{6 \times 7^{2} }}{ 4}} \right] = 22.9195 $$
$$ F_{F} = \frac{10 \times 22.9195}{ 1 1\times 5- 22.9195}{ = 7}.1444 $$

Similar to linear case, for 6 algorithms and 11 datasets, the critical value of \( F(5,50) \) is 2.400 for the level of significant \( \alpha { = }0.05 \) and it is also less than the value of \( F_{F} \). Therefore, the null hypothesis is rejected and the compared algorithms are significantly different.

On the other hand, in order to compare the performance of different nonparallel hyperplanes support vector machines, i.e. TWSVM, LSTSVM, FTSVM, EFTWSVM- CIL and our proposed EFLSTSVM, we give the two-dimensional scatter plots [14, 18] of part test points for above classifiers. Here, Australian UCI data set is selected for a practical example and the corresponding scatter plots are shown in Fig. 4 for Australian data set with about 20% of data points, where the coordinates \( (d_{i}^{ + } ,d_{i}^{ - } ) \) are the respective distances of a test point \( x_{i} \) to the two hyperplanes. Specifically, in Fig. 4, the figures (a–e) are the results obtained by TWSVM, LSTSVM, FTSVM, EFTWSVM-CIL and our proposed EFLSTSVM, respectively. From Fig. 4, it can be seen that most test samples are clustered around the corresponding hyperplanes for all methods. And it is clearly noticeable that our proposed EFLSTSVM obtains better clustered points and separated classes than other methods.

Fig. 4
figure 4

Two-dimensional projection for test points from Australian dataset. a TWSVM, b LSTSVM, c FTSVM, d EFTWSVM-CIL and e Our proposed EFLSTSVM

4.3 Image Recognition

In this subsection, we apply our method to image recognition. Three well-known and publicly available databases corresponding to typical image classifications, i.e., handwritten digits (USPS), objects (COIL-20) and recognition of faces (AR) are used to evaluate our proposed EFLSLSVM with TWSVM, LSTSVM, EFSVM, FTSVM and EFTWSVM-CIL.

The USPS database [41] consists of gray-scale handwritten digit images from 0 to 9, as shown in Fig. 5. Each digit contains 1100 images, and the size of each image is 16 × 16 pixels with 256 gray levels. Here we select five pairwise digits of varying difficulty for odd versus even digit classification.

Fig. 5
figure 5

An illustration of 10 subjects in the USPS database

COIL-20 [42] is a database of gray scale images of 20 objects, which are illustrated in Fig. 6. The objects were placed on a motorized turntable against a black background. Images of the objects were taken at pose intervals 5°, which corresponds to 72 images per object. In our experiments, we have resized each of the original 1440 images into 32 × 32 pixels.

Fig. 6
figure 6

An illustration of 20 subjects in the COIL-20 database

The AR database [43] contains 100 subjects and each subject has 26 face images taken in two sessions. For each session, there are 13 face images. Here 14 unoccluded images from the two sessions of each person are chosen for experiments, which are shown in Fig. 7. In our experiments, the 1400 images are all cropped into the same size of 40 × 30 pixels.

Fig. 7
figure 7

An illustration of 14 images of one person from the AR database

For these datasets, we randomly partition the images of each project into two parts with same sizes such that one part is selected for training and the remaining part is used for testing. This process is repeated ten times. We only consider the Gaussian kernel for these methods and Table 5 lists the experimental results of these methods in USPS, COIL-20 and AR datasets. It can be seen that the proposed EFLSTSVM obtains the best classification performance than the other five methods on USPS datasets. Although our proposed EFLSTSVM obtains lower classification accuracies than EFTWSVM-CIL on COIL-20 and AR datasets, it also obtains higher accuracies than the other four methods.

Table 5 The classification performance comparison on the USPS, COIL-20 and AR datasets

4.4 NDC Datasets

In this subsection, we conduct some experiments on large scale classification datasets. So, the David Musicants NDC Data Generator [44] is used to evaluate the computation time for all methods with respect to number of data points. Table 6 lists a description of NDC datasets, each dataset is divided into a training set and testing set.

Table 6 The characteristics of benchmark datasets

For the experiments on NDC datasets, we fixed parameters of all methods to be the same (i.e. \( c_{i} { = 1,}\sigma {\text{ = 1,k = 9}} \)). The training accuracy, testing accuracy and training time of linear and nonlinear classifiers are reported in Tables 7 and 8, respectively. In particular, Table 7 shows the comparison results for linear TWSVM, LSTSVM, EFSVM, FTSVM, EFTWSVM-CIL and our proposed EFLSTSVM on NDC datasets. In Table 7, we can see that our EFLSTSVM obtains the comparable accuracies and performs faster than other methods on most datasets. In addition, for nonlinear case, Table 8 shows the comparison results of all methods conducted on NDC datasets with Gaussian kernel. The results on these datasets also illustrate that LSTSVM and our EFLSTSVM are much faster than TWSVM, EFSVM, FTSVM and EFTWSVM-CIL. The reason might be that LSTSVM and our proposed EFLSTSVM are only solving systems of linear equations rather than quadratic programming in other methods.

Table 7 Comparison on NDC datasets for linear classifiers
Table 8 Comparison on NDC datasets for nonlinear classifiers

For linear case, taking NDC-10000 dataset for example, the training time of TWSVM is 17.5482 s, EFSVM is 64.2797 s and FTSVM is 17.2046 s, EFTWSVM-CIL is 18.6533 s, while LSTSVM is 0.0347 s and our EFLSTSVM is 0.0702 s, respectively. Moreover, for nonlinear case, the training times of all methods on three large scale NDC datasets, e.g. NDC-5000, NDC-8000 and NDC-10000, are shown in Fig. 8. Thus, the results of Tables 7, 8 and Fig. 8 can indicate the efficiency of our proposed EFLSTSVM when dealing with large scale problems.

Fig. 8
figure 8

Training times of all nonlinear classifiers on three large scale NDC datasets

4.5 Further Discussion

In the proposed EFLSTSVM, there are so many parameters, i.e. the number of nearest neighbor k, the positive penalty parameters \( c_{1} ,c_{2} ,c_{3} ,c_{4} \) and the kernel wide parameter \( \sigma \) for nonlinear case. However, these parameters significantly impact the classification performance of the proposed EFLSTSVM. In order to investigate the influence of these parameters to EFLSTSVM, we select 5 datasets from Table 4 and discuss their effects to the classification performance to EFLSTSVM. For simplicity, we choose the penalty parameters \( c_{1} = c_{3} \) and \( c_{2} = c_{4} \). In addition, we should be declared that when discussing on k, other parameters are set to the best parameters which are selected by fivefold cross-validation. Take Sonar dataset for example, the best parameters of Sonar are \( k = 13,c_{1} { = }c_{3} { = 1},c_{2} { = }c_{4} { = 2}^{ - 6} \) and \( \sigma = 1 \). Specifically, when discussing on k, k is selected from the set of \( \{ 1,3, \ldots ,17,19\} \) and the other parameters are set to \( c_{1} { = }c_{3} { = 1,}c_{2} { = }c_{4} { = 2}^{ - 6} ,\sigma = 1 \), respectively. When discussing on \( \sigma \), \( \sigma \) is selected from the set of \( \{ 2^{ - 8} ,2^{ - 7} , \ldots ,2^{7} ,2^{8} \} \) and the other parameters are set to \( c_{1} { = }c_{3} { = 1,}c_{2} { = }c_{4} { = 2}^{ - 6} ,k = 13 \). When discussing on \( c_{i} \), the parameter k is set to \( 13 \), \( \sigma \) is set to \( 1 \) and \( c_{1} = c_{3} \),\( c_{2} = c_{4} \) are selected from the set of \( \{ 2^{ - 8} ,2^{ - 7} , \ldots ,2^{7} ,2^{8} \} \). At last, classification accuracies of the proposed EFLSTSVM with respect to k, \( \sigma \) and \( c_{i} \) on adopted datasets are shown in Figs. 9, 10 and 11, respectively.

Fig. 9
figure 9

Accuracy of EFLSTSVM with respect to k on the selected datasets

Fig. 10
figure 10

Accuracy of EFLSTSVM with respect to \( \sigma \) on the selected datasets

Fig. 11
figure 11

Accuracy of EFLSTSVM with respect to \( c_{1} = c_{3} ,c_{2} = c_{4} \) on the selected datasets. a Wpbc, b Spect, c sonar, d Heart-c, e Heart-Statlog and f Ionosphere

Further, according to (44) and (45), it can be noted that the solutions of nonlinear EFLSTSVM require calculating the matrix inversion of size \( (m + 1) \times (m + 1) \) twice. Therefore, the computational complexity of the matrix inversion will be high when \( m \) is large enough. However, using Sherman–Morrison–Woodbury (SMW) formula [45], the solution of (42) and (43) can be solved using for inverses of smaller dimension than \( (m + 1) \times (m + 1) \). Thus, the expression (44) and (45) can be rewritten as follows.

$$ v_{ + } = - c_{1} \left( {Y - Y \cdot KerH^{T} \cdot (\bar{I}_{ + } + KerH \cdot Y \cdot KerH^{T} )^{ - 1} \cdot KerH \cdot Y} \right) \cdot KerG^{T} \cdot S_{ - } \cdot e_{2} $$
(49)
$$ v_{ - } = c_{3} \left( {Z - Z \cdot KerG^{T} \cdot (\bar{I}_{ - } + KerG \cdot Z \cdot KerG^{T} )^{ - 1} \cdot KerG \cdot Z} \right) \cdot KerH^{T} \cdot S_{ + } \cdot e_{1} $$
(50)
$$ \begin{aligned} {\text{Y}} & = (c_{2} I_{ + } + c_{1} KerH^{T} \cdot S_{ - } \cdot KerG)^{ - 1} \\ & = \frac{1}{{c_{2} }}\left( {I_{ + } - KerG^{T} \cdot S_{ - } \cdot (\frac{{c_{2} }}{{c_{1} }}\bar{I}_{ - } + KerG \cdot KerG^{T} \cdot S_{ - } )^{ - 1} \cdot KerG} \right) \\ \end{aligned} $$
(51)
$$ \begin{aligned} {\text{Z}} & = (c_{4} I_{ - } + c_{3} KerH^{T} \cdot S_{ + } \cdot KerH)^{ - 1} \\ & = \frac{1}{{c_{4} }}\left( {I_{ - } - KerH^{T} \cdot S_{ + } \cdot (\frac{{c_{4} }}{{c_{3} }}\bar{I}_{ + } + KerH \cdot KerH^{T} \cdot S_{ + } )^{ - 1} \cdot KerH} \right) \\ \end{aligned} $$
(52)

where \( \bar{I}_{ + } \in R^{{m_{1} \times m_{1} }} ,\bar{I}_{ - } \in R^{{m_{2} \times m_{2} }} \) are identity matrices.

After using SMW formula, according to (49), (50), (51) and (52), the solutions of (42) and (43) require inversion of matrix of size \( m_{1} \times m_{1} \) and \( m_{2} \times m_{2} \) twice. Thus, when \( m \) is more than \( m_{1} \) and \( m_{2} \), we can utilize SMW formula to reduce computational complexity in our experiment.

5 Conclusions

In this paper, an entropy-based fuzzy least squares twin support vector machine (EFLSTSVM) is proposed by adopting the entropy to evaluate the class certainty of each sample and calculating the corresponding fuzzy membership based on the class certainty. This method pays more attention to the samples with higher class certainty to result in more robust decision surface, which enhances the classification accuracy and generalization ability. Experimental results obtained on synthetic and real-world datasets illustrate the effectiveness of our proposed EFLSTSVM. It should be pointed out that there are many parameters in our EFLSTSVM, so the parameter selection is a practical problem. Therefore, in the future work, we will propose some technique to select the optimal parameters and further improve the performance of the proposed EFLSTSVM. In addition, the extension of EFLSTSVM to multi-class classification [46, 47], multi-view classification [48, 49], semi-supervised classification [50, 51] and expand its practical application areas are also interesting.