1 Introduction

Support vector machine (SVM) [1, 2], as a powerful tool for classification and regression, has been widely used in a variety of practical applications [36]. However, SVM is computational costly because its solution follows from solving constrained quadratic programming problems (QPPs), especially when handing large-scale data. Furthermore, SVM has only one separating plane hence it cannot cope with the complex XOR problem. To address these issues, the generalized eigenvalue proximal SVM (GEPSVM) [7] uses two nonparallel haperplanes such that each hyperplane is close to one of the two classes and is as far as possible from the other class. Motivated by GEPSVM, Jayadeva et al. [8] proposed a twin SVM (TWSVM) that developed a novel nonparallel hyperplane classifier for binary classification. In TWSVM, training samples in each class are proximal to one of the two nonparallel hyperplanes. The nonparallel hyperplanes are obtained by solving a pair of small-sized QPPs. Experimental results in [8] validated the superiority of TWSVM over the classical SVM. Moreover, TWSVM can effectively deal with the XOR problem due to the underlying model. Therefore, the method of constructing nonparallel hyperplane SVM has been extensively studied and a variety of methods have been proposed [916], such as LSTSVM [9], TBSVM [10], TPMSVM [11], RTSVM [12], PPSVC [13], NHSVM [14] and NPSVM [15]. Also, the method of finding two projection directions has been widely investigated, such as MVSVM [17], PTSVM [18] and LSPTSVM [19, 20]. An overview on twin support vector machines was given in [21].

However, in practice, patterns belonging to one class often play more significant roles in classification. Such a problem is usually solved using fuzzy SVMs [2226]. In fuzzy SVMs, the patterns of more important classes are assigned higher membership values. Different from FSVM, Tao and Wang proposed a new framework for SVM to solve the fuzzy classification problem using fuzzy membership function [27], which termed new fuzzy SVM (NFSVM) [28]. However, FSVM and NFSVM have the same computational complexity because their solutions follow from solving QPP. Lastly, inspired by TWSVM, FTWSVM [29] and FTSVM [30] have also been proposed. Specifically, FTWSVM incorporates the information of fuzziness in the data and obtains nonparallel planes around which the data points of the corresponding class get clustered. However, based on v-twin support vector machine, FTSVM was proposed by introducing importance of training sample.

In order to enhance the efficiency and performance of fuzzy SVM and TWSVM for binary classification, in this paper, we propose a new fuzzy twin support vector machine (NFTSVM). The proposed NFTSVM relates to the ideas of both NFSVM and TWSVM. However, NFTSVM has two advantages compared to NFSVM and TWSVM. First, NFTSVM aims at generating two nonparallel hyperplanes such that each plane is close to one of the two classes and is as far as possible from the other. It employs the fuzzy membership to weigh the margin of each training sample, hence enhances the generalization ability. Second, NFTSVM has a more general formulation, compared to TWSVM and TBSVM. In fact, TWSVM and TBSVM are special cases of the proposed NFTSVM. Moreover, our proposed NFTSVM is different from FTWSVM and FTSVM. For FTWSVM and FTSVM, a fuzzy membership value is assigned to each pattern, and points are classified by assigning them to the nearest of two nonparallel planes that are close to their respective classes. However, inspired by NFSVM, the main idea of our proposed NFTSVM is to weigh the margin by using the fuzzy membership function. Thus, it is easy to think that the influence of the samples with high uncertainty can be decreased by weighing the margin of each training vector. In addition, one of the important privileges of NFTSVM by the idea from fuzzy neural networks is that we can employ some available fuzzy membership functions.

The rest of this paper is organized as follows. Section 2 gives a brief overview to related work, including NFSVM, TWSVM and TBSVM. Section 3 presents the proposed NFTSVM in detail. Experimental results and conclusion are given in Sects. 4 and 5, respectively.

2 Related work

Conventional SVMs solve the binary classification problem using the principle of structural risk minimization (SRM) [1, 2]. For a typical binary classification problem, we have a set of training samples:

$$T = \{ ({x_1},{y_1}),({x_2},{y_2}), \ldots ,({x_l},{y_l})\}$$
(1)

where \({x_i} \in {\Re ^n}\) is the i-th observation, \({y_i} \in \{ - 1,1\}\) is the label of the i-th observation, and \(l\) is the number of training samples. The aim of linear SVM is to solve the following primal QPP:

$$\begin{gathered} \mathop {\min }\limits_{w,b,\xi } \quad \frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^l {{\xi _i}} \hfill \\ s.t.\quad {y_i}({w^T}{x_i} + b) \geqslant 1 - {\xi _i},\; \hfill \\ \quad \quad {\xi _i} \geqslant 0,\;i = 1,2, \ldots ,l \hfill \\ \end{gathered}$$
(2)

where \(C\) is penalty parameter and \({\xi _i}\) are slack variables. The output of SVM is an optimal separating hyperplane:

$${w^T}x + b = 0$$
(3)

where \(w \in {\Re ^n}\) is the normal vector to the hyperplane, and \(b \in {\Re ^1}\) is the offset. Given a new observation \(x'\), we can use this hyperplane to determine the label, i.e. we assign the positive label to \(x'\) if \({w^T}x + b \geqslant 0\), otherwise, we assign the negative label to \(x'\).

2.1 New fuzzy SVM (NFSVM)

However, one of the main drawbacks of the conventional SVM is that the linear model is very sensitive to outliers or noises. In order to deal with this problem, fuzzy support vector machines (FSVM) have been proposed [2226]. In addition, Tao and Wang [28] proposed a new fuzzy support vector machine (NFSVM). In NFSVM, we consider the binary classification problem for a set of training samples:

$${T^*} = \{ ({x_1},{m_1}),({x_2},{m_2}), \ldots ,({x_l},{m_l})\}$$
(4)

where \({x_i} \in {\Re ^n}\) is independently distributed, \({m_i} \in [0,1\) is the fuzzy membership that measures the contribution of the i-th observation \({x_i}\) to the positive class and \(l\) is the number of training samples. In order to get the similar formula to classical SVM, NFSVM uses a new label \({y_i} = 2{m_i} - 1\) to replace the fuzzy membership \({m_i}\). Hence, the primal problem of NFSVM can be expressed as

$$\begin{gathered} \mathop {\min }\limits_{w,b,\xi } \quad \frac{1}{2}{w^T}w + C\sum\limits_{i = 1}^l {{\xi _i}} \hfill \\ s.t.\quad {y_i}({w^T}{x_i} + b) \geqslant y_i^2 - y_i^2{\xi _i},\; \hfill \\ \quad \quad {\xi _i} \geqslant 0,\;i = 1,2, \ldots ,l \hfill \\ \end{gathered}$$
(5)

where \(C\) is penalty parameter, \({\xi _i}\) are slack variables, \(w \in {\Re ^n}\) is the normal vector and \(b \in \Re\) is the offset.

To solve this primal problem, NFSVM solves its Lagrangian dual problem:

$$\begin{gathered} \mathop {\min }\limits_\alpha \quad \frac{1}{2}\sum\limits_{i = 1}^l {\sum\limits_{j = 1}^l {{\alpha _i}} } {\alpha _j}{y_i}{y_j}({x_i} \cdot {x_j}) - \sum\limits_{i = 1}^l {{\alpha _i}} y_i^2 \hfill \\ s.t.\quad \;\sum\limits_{i = 1}^l {{\alpha _i}} {y_i} = 0\; \hfill \\ \quad \quad \;0 \leqslant {\alpha _i}y_i^2 \leqslant C,\;i = 1,2, \ldots ,l \hfill \\ \end{gathered}$$
(6)

where \(\alpha \in {\Re ^n}\) is Lagrangian multiplier [28].

2.2 TWSVM

Different from the single hyperplane SVM discussed above, linear twin support vector machine (TWSVM) [8] aims to find a pair of nonparallel hyperplanes:

$$w_1^Tx + {b_1} = 0\quad {\text{and}}\quad w_2^Tx + {b_2} = 0$$
(7)

such that each hyperplane is proximal to the training samples of one class and is as far as possible from the samples of the other class. More specifically, we consider the training of a binary classifier from \(p\) positive samples and \(q\) negative samples. The training samples in positive class are denoted by matrix \(A \in {\Re ^{p \times n}}\) and the training samples in negative class are denoted by matrix \(B \in {\Re ^{q \times n}}\). The primal problem of linear TWSVM can be expressed as:

$$\begin{gathered} \mathop {\min }\limits_{{w_1},{b_1},{\xi _2}} \,\,\frac{1}{2}||A{w_1} + {e_1}{b_1}||_2^2 + {c_1}e_2^T{\xi _2} \hfill \\ \;s.t.\quad - (B{w_1} + {e_2}{b_1}) + {\xi _2} \geqslant {e_2},{\xi _2} \geqslant 0 \hfill \\ \end{gathered}$$
(8)
$$\begin{gathered} \mathop {\min }\limits_{{w_2},{b_2},{\xi _1}} \,\,\frac{1}{2}||B{w_2} + {e_2}{b_2}||_2^2 + {c_2}e_1^T{\xi _1} \hfill \\ \;s.t.\quad (A{w_2} + {e_1}{b_2}) + {\xi _1} \geqslant {e_1},{\xi _1} \geqslant 0 \hfill \\ \end{gathered}$$
(9)

where \({c_1}> 0\) and \({c_2}> 0\) are penalty parameters, \({\xi _1}\) and \({\xi _2}\) are slack variables, \({e_1}\) and \({e_2}\) are vectors with each element of the value of 1. By introducing the method of Lagrangian multipliers, the corresponding Wolfe dual of QPPs (8) and (9) can be represented as

$$\begin{gathered} \mathop {\max }\limits_\alpha \,\,e_2^T\alpha - \frac{1}{2}{\alpha ^T}G{({H^T}H)^{ - 1}}{G^T}\alpha \hfill \\ \;s.t.\quad 0 \leqslant \alpha \leqslant {c_1}{e_2} \hfill \\ \end{gathered}$$
(10)
$$\begin{gathered} \mathop {\max }\limits_\gamma \,\,e_1^T\gamma - \frac{1}{2}{\gamma ^T}H{({G^T}G)^{ - 1}}{H^T}\gamma \hfill \\ \;s.t.\quad 0 \leqslant \gamma \leqslant {c_2}{e_1} \hfill \\ \end{gathered}$$
(11)

where \(G = [B\;{e_2}\) and \(\alpha \in {\Re ^q},\;\gamma \in {\Re ^p}\) are Lagrangian multipliers [8].

2.3 TBSVM

In order to embody the marrow of statistical learning theory and improve the performance of classification, the twin bounded support vector machines (TBSVM) was proposed [10]. TBSVM minimizes the structural risk by adding a regularization term with the idea of maximizing one side margin. For a linear binary classification problem, TBSVM solves the following two primal problems:

$$\begin{gathered} \mathop {\min }\limits_{{w_1},{b_1},{\xi _2}} \,\,\frac{1}{2}{c_3}(w_1^2 + b_1^2) + \frac{1}{2}||A{w_1} + {e_1}{b_1}||_2^2 + {c_1}e_2^T{\xi _2} \hfill \\ \;s.t.\quad - (B{w_1} + {e_2}{b_1}) + {\xi _2} \geqslant {e_2},{\xi _2} \geqslant 0 \hfill \\ \end{gathered}$$
(12)
$$\begin{gathered} \mathop {\min }\limits_{{w_2},{b_2},{\xi _1}} \,\,\frac{1}{2}{c_4}(w_2^2 + b_2^2) + \frac{1}{2}||B{w_2} + {e_2}{b_2}||_2^2 + {c_2}e_1^T{\xi _1} \hfill \\ \;s.t.\quad (A{w_2} + {e_1}{b_2}) + {\xi _1} \geqslant {e_1},{\xi _1} \geqslant 0 \hfill \\ \end{gathered}$$
(13)

where \({c_1},\;{c_2},\;{c_3}\) and \({c_4}\) are positive penalty parameters, \({\xi _1},\;{\xi _2}\) are slack variables, and \({e_1},\;{e_2}\) are vectors with each element of the value of 1. By introducing the method of Lagrangian multipliers, the Wolfe dual of QPPs (12) and (13) can be represented as

$$\begin{gathered} \mathop {\max }\limits_\alpha \,\,e_2^T\alpha - \frac{1}{2}{\alpha ^T}G{({H^T}H + {c_3}I)^{ - 1}}{G^T}\alpha \hfill \\ \;s.t.\quad 0 \leqslant \alpha \leqslant {c_1}{e_2} \hfill \\ \end{gathered}$$
(14)
$$\begin{gathered} \mathop {\max }\limits_\gamma \,\,e_1^T\gamma - \frac{1}{2}{\gamma ^T}H{({G^T}G + {c_4}I)^{ - 1}}{H^T}\gamma \hfill \\ \;s.t.\quad 0 \leqslant \gamma \leqslant {c_2}{e_1} \hfill \\ \end{gathered}$$
(15)

where \(G = [B\;{e_2}],\,H = [A\;{e_1}]\), \(\alpha \in {\Re ^q},\;\gamma \in {\Re ^p}\) are Lagrangian multipliers, and \(I\) is an identity matrix of appropriate dimensions [10].

3 New fuzzy twin support vector machine (NFTSVM)

3.1 Linear NFTSVM

In real-world applications, we may often face with the situations where patterns belonging to one class play a more significant role in classification. In such cases, the main concern is how to determine the final classes by assigning different importance degrees to training data. A well approach to deal with this challenge is to use the concept of fuzzy function. As we know, the existing nonparallel hyperplane support vector machines often obtain higher solving efficiency than conventional SVM, such as TWSVM is approximately four times faster than conventional SVM. In order to deal with real-world applications and get higher solving efficiency, we incorporate the concept of fuzzy theory into NFSVM and propose a new fuzzy twin support vector machine (NFTSVM) for binary classification. As discussed in Sect. 2.1, in fuzzy binary classification, \({m_i}\) is the fuzzy membership that measures the membership of the corresponding observation \({x_i}\) to the positive class. Given \(p\) positive samples \(\{ ({\tilde x_1},{\tilde m_1}),({\tilde x_2},{\tilde m_2}), \ldots ,({\tilde x_p},{\tilde m_p})\}\) and \(q\) negative samples\(\{ ({\hat x_1},{\hat m_1}),({\hat x_q},{\hat m_q}), \ldots ,({\hat x_q},{\hat m_q})\}\), then, we have the diagonal matrices \({Y_1} = diag({\tilde y_1},{\tilde y_2}, \ldots ,{\tilde y_p}),\;{Y_2} = diag({\hat y_1},{\hat y_2}, \ldots ,{\hat y_q})\), where \({\tilde y_i} = 2{\tilde m_i} - 1,\;(i = 1,2, \ldots ,p)\), \({\hat y_j} = 2{\hat m_j} - 1,\;(j = 1, \ldots ,q)\), respectively.

Similar to NFSVM [28], TWSVM [8] and TBSVM [10], the goal of the proposed linear NFTSVM is to find two nonparallel hyperplanes (7) using the following primal problems:

$$\begin{gathered} \mathop {\min }\limits_{{w_1},{b_1},{\xi _2}} \,\,\frac{1}{2}||A{w_1} + {e_1}{b_1}||_2^2 + \frac{1}{2}{c_1}(w_1^2 + b_1^2) + {c_2}e_2^T{\xi _2} \hfill \\ \;s.t.\quad {Y_2}(B{w_1} + {e_2}{b_1}) \geqslant Y_2^2{e_2} - Y_2^2{\xi _2},{\xi _2} \geqslant 0 \hfill \\ \end{gathered}$$
(16)
$$\begin{gathered} \mathop {\min }\limits_{{w_2},{b_2},{\xi _1}} \,\,\frac{1}{2}||B{w_2} + {e_2}{b_2}||_2^2 + \frac{1}{2}{c_3}(w_2^2 + b_2^2) + {c_4}e_1^T{\xi _1} \hfill \\ \;s.t.\quad {Y_1}(A{w_2} + {e_1}{b_2}) \geqslant Y_1^2{e_1} - Y_1^2{\xi _1},{\xi _1} \geqslant 0 \hfill \\ \end{gathered}$$
(17)

where \({c_1},\;{c_2},\;{c_3}\) and \({c_4}\) are positive penalty parameters, \({\xi _1}\) and \({\xi _2}\) are slack variables, \({e_1}\) and \({e_2}\) are vectors with each entry of the value of 1.

Let us compare the objective functions and constraints in (12), (13) and (16), (17). Obviously, their objective functions are the same and their constraints are different. In fact, for a classical binary classification problem, the fuzzy membership \({\tilde m_i} = 1\) and \({\hat m_j} = 0\), and the label \({\tilde y_i} = 1,\;(i = 1,2, \ldots ,p)\) and \({\hat y_j} = - 1,\;(j = 1, \ldots ,q)\). In this case, the constraints in (16), (17) are the same as in (12), (13). That is to say, TBSVM is a special case of our NFTSVM when the parameters of NFTSVM are appropriately selected. However, in practical problem, the fuzzy memberships \({\tilde m_i}\,(i = 1,2, \ldots ,p)\) are not all 1 and \({\hat m_j}\,(j = 1,2, \ldots ,q)\) are not all 0. This means that the punishment for different samples with different coefficients, which maybe improve the classification accuracy in practical problems.

To obtain the solutions of the objective functions (16) and (17), we have to derive their dual problems. For (16), by introducing the method of Lagrangian multipliers, we can obtain the Lagrangian function:

$$\begin{aligned} L({w_1},{b_1},{\xi _2},\alpha ,\beta ) = \frac{1}{2}||A{w_1} + {e_1}{b_1}||_2^2 + \frac{1}{2}{c_1}(w_1^2 + b_1^2) + {c_2}e_2^T{\xi _2} \hfill \\ \quad \quad \quad \quad \quad \quad \quad - {\alpha ^T}[{Y_2}(B{w_1} + {e_2}{b_1}) - Y_2^2{e_2} + Y_2^2{\xi _2}\end{aligned}$$
(18)

where \(\alpha \in {\Re ^q}\) and \(\beta \in {\Re ^q}\) are Lagrangian multipliers. With the Karush-Kuhn- Tucker (KKT) conditions [31], we have:

$$\frac{{\partial L}}{{\partial {w_1}}} = {A^T}(A{w_1} + {e_1}{b_1}) + {c_1}{w_1} - {B^T}Y_2^T\alpha = 0,$$
(19)
$$\frac{{\partial L}}{{\partial {b_1}}} = e_1^T(A{w_1} + {e_1}{b_1}) + {c_1}{b_1} - e_2^TY_2^T\alpha = 0,$$
(20)
$$\frac{{\partial L}}{{\partial {\xi _2}}} = {c_2}{e_2} - Y_2^T\alpha - \beta = 0,$$
(21)
$${Y_2}(B{w_1} + {e_2}{b_1}) \geqslant Y_2^T{e_2} - Y_2^T{\xi _2},{\xi _2} \geqslant 0,$$
(22)
$${\alpha ^T}[{Y_2}(B{w_1} + {e_2}{b_1}) - Y_2^T{e_2} + Y_2^T{\xi _2}$$
(23)
$$\alpha \geqslant 0,\beta \geqslant 0.$$
(24)

Since \(\beta \geqslant 0\), from (21) and (24), we have:

$$0 \leqslant \alpha \leqslant {c_2}{(Y_2^2)^{ - 1}}{e_2}$$
(25)

Let \(G = [B\;{e_2}],\,G = [H\;{A\;e_1}],\, v_{1} = [w_{1}^{T} b_{1} ]^{T}\), (19) and (20) imply:

$$({H^T}H + {c_1}I){v_1} - {G^T}Y_2^T\alpha = 0$$
(26)

Thus, we can get the augmented vector

$${v_1} = {({H^T}H + {c_1}I)^{ - 1}}{G^T}Y_2^T\alpha$$
(27)

where \(I\) is an identity matrix. Then, we obtain the Wolfe dual problem of (16) by combing (27) with (18) using (19), (21):

$$\begin{gathered} \mathop {\max }\limits_\alpha \,\,e_2^TY_2^T\alpha - \frac{1}{2}{\alpha ^T}{Y_2}G{({H^T}H + {c_1}I)^{ - 1}}{G^T}Y_2^T\alpha \hfill \\ \;s.t.\quad 0 \leqslant \alpha \leqslant {c_2}{(Y_2^2)^{ - 1}}{e_2} \hfill \\ \end{gathered}$$
(28)

In the same fashion, we can obtain the Wolfe dual problem of (19):

$$\begin{gathered} \mathop {\max }\limits_\gamma \,\,e_1^TY_1^T\gamma - \frac{1}{2}{\gamma ^T}{Y_1}H{({G^T}G + {c_3}I)^{ - 1}}{H^T}Y_1^T\gamma \hfill \\ \;s.t.\quad 0 \leqslant \gamma \leqslant {c_4}{(Y_1^2)^{ - 1}}{e_1} \hfill \\ \end{gathered}$$
(29)

where \(\alpha \in {\Re ^q},\gamma \in {\Re ^p}\) are Lagrangian multipliers.

The nonparallel hyperplanes (7) can be obtained from the solutions of \(\alpha\) and \(\gamma\) in (28) and (29) by:

$${v_1} = {({H^T}H + {c_1}I)^{ - 1}}{G^T}Y_2^T\alpha ,\quad {\rm{where }}\,{v_1} = {[w_1^T\,{b_1}]^T}$$
(30)
$${v_2} = {({G^T}G + {c_3}I)^{ - 1}}{H^T}Y_1^T\gamma ,\quad {\rm{where}}\;{v_2} = {[w_2^T\,{b_2}]^T}$$
(31)

Once \({v_1}\) and \({v_2}\) are obtained, these two nonparallel hyperplanes (7) are known. A new sample \(x \in {\Re ^n}\) is assigned to either positive or negative label, depending on the hyperplanes (7) it lies closest to, i.e.

$$x \in {W_k},\,\,k = \mathop {\arg \min }\limits_{i = 1,2} \,\left\{ {\frac{{|w_1^Tx + {b_1}|}}{{||{w_1}||}},\,\frac{{|w_2^Tx + {b_2}|}}{{||{w_2}||}}} \right\}$$
(32)

where \(| \cdot |\) is the absolute value.

3.2 Nonlinear NFTSVM

In this subsection, we show that our linear NFTSVM can be easily extended to nonlinear case. Here, we consider the following kernel-generated hyperplanes

$$K({x^T},{C^T}){w_1} + {b_1} = 0\quad {\text{and}}\quad K({x^T},{C^T}){w_2} + {b_2} = 0,$$
(33)

where \(C = [A\;B] \in \Re ^{{(p + q) \times n}}\) and \(K\) is an appropriately kernel. Similar to the linear case, the nonlinear optimization problems can be expressed as:

$$\begin{gathered} \mathop {\min }\limits_{{w_1},{b_1},{\eta _2}} \,\,\frac{1}{2}||K(A,{C^T}){w_1} + {e_1}{b_1}||_2^2 + \frac{1}{2}{c_1}(w_1^2 + b_1^2) + {c_2}e_2^T{\eta _2} \hfill \\ \;s.t.\quad {Y_2}(K(B,{C^T}){w_1} + {e_2}{b_1}) \geqslant Y_2^2{e_2} - Y_2^2{\eta _2},{\eta _2} \geqslant 0, \hfill \\ \end{gathered}$$
(34)
$$\begin{gathered} \mathop {\min }\limits_{{w_2},{b_2},{\eta _1}} \,\,\frac{1}{2}||K(B,{C^T}){w_2} + {e_2}{b_2}||_2^2 + \frac{1}{2}{c_3}(w_2^2 + b_2^2) + {c_4}e_1^T{\eta _1} \hfill \\ \;s.t.\quad {Y_1}(K(A,{C^T}){w_2} + {e_1}{b_2}) \geqslant Y_1^2{e_1} - Y_1^2{\eta _1},{\eta _1} \geqslant 0 \hfill \\ \end{gathered}$$
(35)

With the Lagrangian method and KKT conditions, we can obtain the corresponding Wolfe dual problems

$$\begin{gathered} \mathop {\max }\limits_\alpha \,\,e_2^TY_2^T\alpha - \frac{1}{2}{\alpha ^T}{Y_2}\tilde G{({{\tilde H}^T}\tilde H + {c_1}I)^{ - 1}}{{\tilde G}^T}Y_2^T\alpha \hfill \\ \;s.t.\quad 0 \leqslant \alpha \leqslant {c_2}{(Y_2^2)^{ - 1}}{e_2}, \hfill \\ \end{gathered}$$
(36)
$$\begin{gathered} \mathop {\max }\limits_\gamma \,\,e_1^TY_1^T\gamma - \frac{1}{2}{\gamma ^T}{Y_1}\tilde H{({{\tilde G}^T}\tilde G + {c_3}I)^{ - 1}}{{\tilde H}^T}Y_1^T\gamma \hfill \\ \;s.t.\quad 0 \leqslant \gamma \leqslant {c_4}{(Y_1^2)^{ - 1}}{e_1}, \hfill \\ \end{gathered}$$
(37)

where \(\tilde G = [K(B,{C^T})\;{e_2}],\, \tilde{H} = [K(A,C^{T} )e_{1} ]\)

According to (34)–(37), the augmented vectors \({v_1} = {[w_1^T\,{b_1}]^T}\) and \({v_2} = {[w_2^T\,{b_2}]^T}\) can be obtained by

$${v_1} = {({\tilde H^T}\tilde H + {c_1}I)^{ - 1}}{\tilde G^T}Y_2^T\alpha ,$$
(38)
$${v_2} = {({\tilde G^T}\tilde G + {c_3}I)^{ - 1}}{\tilde H^T}Y_1^T\gamma ,$$
(39)

Once the vector \({v_1}\) and \({v_2}\) are obtained, the two nonparallel hyperplanes (33) are known. A new sample \(x \in {\Re ^n}\) is assigned to either positive or negative label, depending on which of the hyperplanes (33) it lies closest to, i.e.

$$x \in {W_k},\,\,k = \mathop {\arg \min }\limits_{i = 1,2} \,\left\{ {\frac{{|w_1^TK(x,{C^T}) + {b_1}|}}{{\sqrt {w_1^TK(C,{C^T}){w_1}} }},\,\frac{{|w_2^TK(x,{C^T}) + {b_2}|}}{{\sqrt {w_2^TK(C,{C^T}){w_2}} }}} \right\}$$
(40)

where \(| \cdot |\) is the absolute value.

3.3 Implementation

In this subsection, we discuss the implementation of our proposed NFTSVM. In our NFTSVM, the dual problem can be rewritten as the following unified form

$$\begin{gathered} \mathop {\min }\limits_\alpha \quad \tfrac{1}{2}{\alpha ^T}Q\alpha - {d^T}\alpha \hfill \\ s.t.\quad \;0 \leqslant \alpha \leqslant ce\;, \hfill \\ \end{gathered}$$
(41)

where \(Q\) is a positive definite matrix and \(d\) is a vector. For example, if we set \(Q = {Y_2}G{({H^T}H + {c_1}I)^{ - 1}}{G^T}Y_2^T,\;d = {Y_2}{e_2}\) and \(c = {c_2}{(Y_2^2)^{ - 1}}\), problem (41) becomes the problem in (28) and if we set \(Q = {Y_2}\tilde G{({\tilde H^T}\tilde H + {c_1}I)^{ - 1}}{\tilde G^T}Y_2^T,\;d = {Y_2}{e_2}\) and \(c = {c_2}{(Y_2^2)^{ - 1}}\), problem (41) becomes the problem in (36).

figure a

In our proposed NFTSVM, the most expensive part in terms of computational cost is solving the dual QPPs (41). To solve the QPPs more efficiently, we use a state- of-the-art optimization technique, i.e. successive overrelaxation (SOR) algorithm [10, 14, 32]. SOR is an excellent QPPs solver because it is able to deal with large scale problems, without storing all the data in memory [32]. The experimental results in the following section show that the SOR accelerates our proposed NFTSVM.

To have a further analysis and understanding, we give some remarks to our NFTSVM. Firstly, if we define the membership \({m_i} = 1\) when \({x_i}\) is a positive sample, the corresponding label \({y_i} = 2{m_i} - 1 = 1\). If we define the membership \({m_i} = 0\) when \({x_i}\) is a negative sample, the corresponding label \({y_i} = 2{m_i} - 1 = - 1\). Therefore, \({Y_1} = diag\{ 1,1, \ldots ,1\} ,\;{Y_2} = diag\{ - 1, - 1, \ldots , - 1\}\) and the primal problems of NFTSVM can be described as

$$\begin{gathered} \mathop {\min }\limits_{{w_1},{b_1},{\xi _2}} \,\,\frac{1}{2}||A{w_1} + {e_1}{b_1}||_2^2 + \frac{1}{2}{c_1}(w_1^2 + b_1^2) + {c_2}e_2^T{\xi _2} \hfill \\ \;s.t.\quad - (B{w_1} + {e_2}{b_1}) \geqslant {e_2} - {\xi _2},{\xi _2} \geqslant 0, \hfill \\ \end{gathered}$$
(42)
$$\begin{gathered} \mathop {\min }\limits_{{w_2},{b_2},{\xi _1}} \,\,\frac{1}{2}||B{w_2} + {e_2}{b_2}||_2^2 + \frac{1}{2}{c_3}(w_2^2 + b_2^2) + {c_4}e_1^T{\xi _1} \hfill \\ \;s.t.\quad (A{w_2} + {e_1}{b_2}) \geqslant {e_1} - {\xi _1},{\xi _1} \geqslant 0, \hfill \\ \end{gathered}$$
(43)

It is obvious that (42) and (43) are the same as (12) and (13). Moreover, if the parameters \({c_1} = {c_3} = 0\), (42) and (43) degenerate to (8) and (9). Thus, the classical TWSVM [8] and TBSVM [10] are special cases of our NFTSVM when the parameters of NFTSVM are appropriately selected. Secondly, for nonlinear NFTSVM, if the number of training points is large, the rectangular kernel technique [33] can be applied to reduce the dimensionality of (36) and (37). Lastly, in practical classification applications, we often only have the information of labels and have to define the fuzzy membership based on prior knowledge. Fortunately, we can employ some available fuzzy membership functions from fuzzy neural networks. In this paper, we directly use a fuzzy membership function that is also used in [27, 28]. Keller and Hunt [27] suggested the following way to assign fuzzy membership values such that a fuzzy 2-partition is formed. Given a positive sample \({x_i}\) with positive label \(+ 1\), the fuzzy membership is defined as

$${m_1}({x_i}) = 0.5 + \frac{{\exp ({C_0}({d_{ - 1}}({x_i}) - {d_1}({x_i}))/d) - \exp ( - {C_0})}}{{2(\exp {C_0} - \exp ( - {C_0}))}},\;{m_{ - 1}}({x_i}) = 1 - {m_1}({x_i})$$

Given a negative sample \({x_i}\) with negative label \(- 1\), the fuzzy membership is defined as

$${m_{ - 1}}({x_i}) = 0.5 + \frac{{\exp ({C_0}({d_1}({x_i}) - {d_{ - 1}}({x_i}))/d) - \exp ( - {C_0})}}{{2(\exp {C_0} - \exp ( - {C_0}))}},\;{m_1}({x_i}) = 1 - {m_{ - 1}}({x_i})$$

where \({d_1}({x_i})\) is the distance between \({x_i}\) and the mean of the positive class, \({d_{ - 1}}({x_i})\) is the distance between \({x_i}\) and the mean of the negative class, \(d\) is the distance between the means of positive and negative classes, and \({C_0}\) is a constant controlling the membership function.

4 Experimental results

In order to evaluate the performance of our proposed NFTSVM, we investigated its classification accuracy and computational efficiency both on artificial datasets and real-world benchmark datasets. In our experiments, we focused on the comparison between our proposed NFTSVM and some classical classifiers, e.g. NFSVM [28], TWSVM [8], TBSVM [10], FTWSVM [29] and FTSVM [30]. These methods were implemented in MATLAB R2013a on a PC with Intel (R) Core (TM) processor (3.40 GHz) CPU and 4 GB RAM. TWSVM, TBSVM, and our proposed NFTSVM were solved using SOR algorithm. The QPPs in NFSVM, FTWSVM and FTSVM were solved by the optimization toolbox QP in MATLAB. The “Accuracy” used in our experiments was defined as Accuracy = (TP + TN)/(TP + FP + TN + FN), where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively. The parameters were selected by employing the standard tenfold cross-validation methodology [31]. The parameters \({c_i}\) and Gaussian kernel width \(\sigma\) are selected from the set \(\{ {2^i}|i = - 8, \ldots ,8\}\) and \({C_0}\) is selected from the set \(\{ 0.5,\;1,\;1.5,\;2,\;2.5\}\).

4.1 Artificial datasets

In this subsection, three artificial datasets, including crossplane (XOR), complex XOR and Ripley’s synthetic datasets [34] have been used to validate our proposed NFTSVM in dealing with linearly inseparable problems. Ripley’s synthetic dataset contains 250 training samples and 1000 test samples. The average results of linear NFSVM, TWSVM, TBSVM and our proposed NFTSVM on crossplane (XOR) dataset and complex XOR dataset are demonstrated in Table 1. The average results of nonlinear NFSVM, TWSVM, TBSVM, and our NFTSVM with Gaussian kernels on Ripley’s synthetic dataset are presented in Table 2. In all tables, “Acc” and “Std” denote “Accuracy” and “Standard deviation” respectively.

Table 1 Classification accuracy (training and testing) on crossplane (XOR) datasets
Table 2 Classification accuracy (training and testing) on Ripley datasets

From Table 1, we can conclude that NFSVM cannot well deal with XOR and complex XOR problems. Although TWSVM and TBSVM have better performance on XOR dataset, the performance decreases on the complex XOR dataset. In contrast, our proposed NFTSVM performs best both on the XOR and complex XOR datasets. According to Table 2, the accuracy of our proposed NFTSVM is 87.60% that beats all the others, which demonstrates that our method has better generalization ability. Figure 1 illustrates the crossplane (XOR) and complex XOR datasets. Ripley’s dataset and the hyperplanes of TWSVM, TBSVM and our proposed NFTSVM are shown in Fig. 2.

Fig. 1
figure 1

Crossplane (a) and complex XOR (b) datasets

Fig. 2
figure 2

Ripley’s dataset: (a) the distribution of the data; (b) the results of TWSVM; (c) the results of TBSVM; and (d) the results of NFTSVM

4.2 UCI datasets

To further compare our proposed NFTSVM with NFSVM, TWSVM, TBSVM, FTWSVM and FTSVM, we selected 15 datasets from the UCI machine learning repository [35]. The numerical results of their linear versions are given in Table 3. In Table 3, the best results are highlighted in bold font. We can find that the accuracy of linear NFTSVM is better than that of NFSVM, TWSVM, TBSVM, FTWSVM and FTSVM on most of the datasets. A win-tie-loss (W-T-L) summarization based on mean accuracy is also reported at the bottom of Table 3. For example, for the Heart-Statlog dataset, the accuracy of our NFTSVM is 86.30%, while that of NFSVM is 84.07%, TWSVM is 84.81%, TBSVM is 85.19%, FTWSVM is 84.85% and FTSVM is 84.81%.

Table 3 Tenfold testing percentage accuracy of linear classifiers

Table 4 shows the experimental results of nonlinear NFSVM, TWSVM, TBSVM, FTWSVM, FTSVM and our NFTSVM on the selected UCI datasets. Note that the Gaussian kernel \(K(x,y) = {e^{ - ||x - y|{|^2}/2{\sigma ^2}}}\) was used in this experiment. The accuracy and the time of training of these methods are also reported. The results in Table 4 are similar to those in Table 3, and therefore confirm our conclusion again. In addition, from Tables 3 and 4, we can find that the training times of NFSVM, FTWSVM and FTSVM are more than TWSVM, TBSVM and our NFTSVM, which indicates that the SOR can speed up the training procedure.

Table 4 Tenfold testing percentage accuracy of nonlinear classifiers

4.3 Image recognition

In this subsection, we test our method using the task of image recognition. Three well-known and publicly available databases for image classifications benchmarking, handwritten digits (USPS), objects (COIL-20), and recognition of faces (AR), are used to evaluate our proposed NFTSVM compared with TWSVM, TBSVM, FTWSVM and FTSVM. The USPS database [36] consists of gray-scale handwritten digit images from 0 to 9, as shown in Fig. 3. Each digit has 1100 images, and the resolution of each image is 16 × 16. In this paper, we selected five pairwise digits of varying difficulty for odd vs. even digit classification. COIL-20 [37] is a database of gray scale images of 20 objects, which are illustrated in Fig. 4. The objects were placed on a motorized turntable against a black background. Images of the objects were taken at pose intervals 5°, which corresponds to 72 images per object. In our experiments, we resized each image to 32 × 32. The AR database [38] contains 100 subjects and each subject has 26 face images taken in two sessions. For each session, there are 13 face images. We selected 14 unoccluded images from these two sessions in our experiments, as shown in Fig. 5. The 1400 images are all cropped into the same size of 40 × 30. For these datasets, we randomly split the images of each object into two parts with the same sizes such that one part is selected for training and the remaining part is used for testing. This process is repeated ten times and the average result is reported. Moreover, we only consider the Gaussian kernel for these methods. Table 5 lists the experimental results of these four methods in USPS, COIL-20 and AR datasets. The proposed NFTSVM obtains the best results both on USPS, COIL-20 and AR than the others in most cases according to the W-T-L summarization.

Fig. 3
figure 3

Subjects in the USPS database

Fig. 4
figure 4

The subjects in the COIL-20

Fig. 5
figure 5

An illustration of the selected 14 images of one person from the AR database

Table 5 The classification performance comparison on the COIL-20, AR and USPS datasets

5 Conclusions

Inspired by NFSVM and TWSVM, a new fuzzy twin support vector machine (NFTSVM) was presented in this paper. NFTSVM employs fuzzy membership to weigh each training sample to improve the generalization ability of the system. As discussed in the paper, TWSVM and TBSVM are special cases of our proposed NFTSVM when the parameters of NFTSVM are appropriately selected. Experimental results obtained on a number of benchmarks illustrate the superiority of the proposed NFTSVM. In our future works, we will address the following three issues. First, it is worth noting that there are five parameters in the proposed NFTSVM, hence parameter selection is a practical problem that should be carefully investigated in the future. Second, we will construct fast algorithm to solve QPPs, such as genetic algorithm (GA) [39]. Third, extensions of NFTSVM to multi-class classification [40] and uncertainty mining [41] are also interesting.