1 Introduction

The support vector machine (SVM) approach, introduced by Vapnik [1], is an extremely powerful kernel-based machine learning technique for data classifications. At present SVM has been successfully applied in various aspects, such as pattern recognition [2], text classification [3], and imbalance classification [4]. For SVM, the target is to find a hyper-plane that can separate different classes well. Many hyper-planes can meet this requirement, and SVM finds the one which follows the maximum margin principle in statistical learning theory [1]. SVM solves a quadratic programming problem (QPP), assuring that once an optimal solution is obtained, it is the unique (global) solution. By introducing kernel trick into the dual QPP, SVM can solve nonlinear classification problems successfully and can also effectively overcome the “curse of dimension”. It is well known that the training cost of SVM is O(l3), where l is the total size of the training data. As the increasing of the number of training samples, SVM costs much time.

To improve the computational speed of SVM. Recently, Jayadeva et al. [5] proposed a twin support vector machine (TSVM) for the binary classification data, which is based on the idea of the GEPSVM [6]. TSVM generates two nonparallel hyper-planes such that each hyper-plane is closer to one class and as far as possible from the other. It is implemented by solving two smaller-sized QPPs, which makes the learning speed of TSVM approximately four times faster than that of the classical SVM. Now, TSVM has been widely used because of its low computational complexity. Many variants of TSVM have been proposed, such as smooth TSVM [7], twin support vector regression [8,9,10], non-parallel Universum support vector machine [11]. However, each of them minimizes the empirical risk, which easily leads to the over-fitting problem. Twin bounded support vector machine (TBSVM) [12, 13] minimizes the structural risk by adding a regularization term. Wang et al. [14] proposed an improved ν-twin bounded support vector machine (Iν-TBSVM). Various algorithms [15,16,17,18,19,20] also implement the structural risk minimization principle, and have better generalization ability. But these algorithms do not sufficiently apply the prior distribution information of samples within classes.

In fact, different classes may possess diverse structural information when dealing with real world data. This structural information may be important for classification and has a great impact on the decision function. One obvious disadvantage of TSVM is that TSVM usually pays more attention to separating two classes well but ignores the underlying structural information within classes. Based on the studies of structural SVM, such as SLMM [21], SRSVM [22]. In 2013, Qi et al. [23] proposed a novel structural twin support vector machine (S-TSVM). This S-TSVM extracts the structural information by using a hierarchical clustering method and minimizes the tightness of each cluster by using the form of the covariance matrix. To improve the computational speed of S-TSVM, Xu et al. [24] proposed the structural least square twin support vector machine (S-LSTSVM). Since S-LSTSVM only needs to solve a pair of linear equations, it greatly accelerates the calculation speed.

Although experimental results reveal that S-TSVM owns excellent generalization ability, the S-TSVM has an obvious disadvantage. It ignores that samples within different clusters have different importance, and we can improve the classification accuracy by studying the different information of samples. Ye et al. [25] proposed a weighted TSVM with local information (WLTSVM), and it introduces a novel KNN method which gives different treatments to different classes in objective function or constraints of the model. Inspired by S-TSVM and WLTSVM, Pan et al. [26] proposed a K-nearest neighbor based structural twin support vector machine (KNN-STSVM). Recently, Mir et al. [27] proposed KNN-based least squares twin support vector machine (KNN-LSTSVM), which applies the similarity information of samples into LSTSVM. And Tanveer et al. [28] proposed an efficient regularized K-nearest neighbor based weighted twin support vector regression (RKNNWTSVR). By introducing regularization terms into KNNWTSVR [10] and replacing 1-norm of slack variables with 2-norm, RKNNWTSVR not only saves computational cost, but also improves the generalization performance.

However, KNN-STSVM involves the empirical risk minimization principle, which easily leads to the over-fitting problem if outliers exist [29,30,31] and reduces the prediction accuracy of the classifier. RKNNWTSVR solves the above problems by introducing extra regularization terms in each objective function to implement the structural risk minimization principle. But RKNNWTSVR ignores the helpful underly structural information of the data, which may contain useful prior domain knowledge for training a model. Motivated by the studies above, we improve the primal problem of KNN-STSVM by adding regularization terms into the objective function. By doing this, our new formulation, called RKNN-STSVM, is not only singularity free but also theoretically supported by statistical learning theory [32,33,34]. Similar to KNN-STSVM, RKNN-STSVM constructs two nonparallel hyper-planes by solving two smaller QPPs. We also incorporate the structural information of the corresponding class into the model. In addition, we not only strengthen the structural information by giving different weights to the samples, but also delete the redundant constraints to speed up the computation process. The contributions of our paper are as follows: (1) By introducing a regularization term, the matrix is guaranteed to be reversible when solving the dual problems. However, this extra prerequisite cannot always be satisfied without a regularization term. The dual problem of our algorithm can be derived without any extra assumption and need not be modified anymore; (2) In the primal problem of KNN-STSVM, the empirical risk is minimized, whereas in our RKNN-STSVM the structural risk is minimized by adding a regularization term with the idea of maximizing the margin; (3) In order to shorten training time, an effective method (the dual coordinate descent method, DCDM) is applied to our RKNN-STSVM.

Nevertheless, like the SVM, the long training time is still one of the main challenges of our RKNN-STSVM. So far, many fast optimization algorithms have been presented to accelerate the training speed, including the decomposition method [35], sequential minimal optimization (SMO) [36], the geometric algorithms [37], the dual coordinate descent method (DCDM) [38], and so on. The recently proposed DCDM algorithm can directly solve the dual QPP of SVM by updating one variable sub-problem during each iteration. The updated variable will derive the largest decrease on the objective value. This DCDM method shows a fast learning speed and great efficiency in experiments. In this paper, we introduce it into our RKNN-STSVM. Comparisons of the RKNN-STSVM with some other algorithms in terms of classification accuracy and learning time have been made on several UCI datasets and Caltech image datasets, indicating the superiority of our RKNN-STSVM.

The rest of the paper is organized as follows. Section 2 outlines the TSVM, S-TSVM, and KNN-STSVM. Details of RKNN-STSVM are given in Section 3, both the linear and nonlinear cases are included. Section 4 discusses the experimental results on twenty-seven benchmark datasets and two popular image recognition datasets to investigate the effectiveness of our proposed RKNN-STSVM. In Section 5, the DCDM algorithm is introduced into RKNN-STSVM to accelerate the learning speed. The last section contains conclusions.

2 Related work

In this section, we give a brief description of TSVM, S-TSVM and KNN-STSVM. The training samples are denoted by a set T = {(x1,y1),…,(xl,yl)}, where xiRn,yi ∈{1,− 1},i = 1,2,…,l. For convenience, matrix \(A \in R^{l_{1}\times n}\) represents the positive samples and matrix \(B \in R^{l_{2}\times n}\) represents the negative samples.

2.1 Twin support vector machine

TSVM generates two nonparallel hyper-planes instead of a single one as in the conventional SVMs. The two nonparallel hyper-planes are obtained by solving two smaller sized QPPs as opposed to a single large one in the standard SVMs. The linear TSVM aims to find two nonparallel hyper-planes in n-dimensional input space,

$$ \begin{array}{@{}rcl@{}} {w_{1}^{T}} x+b_{1}= 0,~~{w_{2}^{T}} x+b_{2}= 0, \end{array} $$
(1)

such that each hyper-plane is closer to one class and as far as possible from the other. A new sample is assigned to class + 1 or − 1 depending upon its proximity to the two nonparallel hyper-planes.

The formulations of linear TSVM can be written as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{1}, b_{1},\xi }} ~~&&\frac{1}{2} (Aw_{1}+e_{1}b_{1})^{T}(Aw_{1}+e_{1}b_{1})+c_{1}{e_{2}^{T}}\xi\\ \text{s.t.}~~&& -(Bw_{1}+e_{2}b_{1})+\xi \geqslant e_{2},\xi \geqslant 0, \end{array} $$
(2)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{2} ,b_{2} ,\eta }} ~~&&\frac{1}{2} (Bw_{2}+e_{2}b_{2})^{T}(Bw_{2}+e_{2}b_{2})+ c_{2}{e_{1}^{T}} \eta\\ \text{s.t.}~~&& (Aw_{1}+e_{1}b_{2})+\eta \geqslant e_{2},\eta \geqslant 0, \end{array} $$
(3)

where \(c_{1}, c_{2} \geqslant 0\) are pre-specified penalty factors, e1 and e2 are vectors of ones of appropriate dimensions. By introducing the lagrangian multipliers, the Wolfe dual of QPPs (2) and (3) can be represented as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\alpha} }~~&&{e_{2}^{T}}\alpha-\frac{1}{2}\alpha^{T}G(H^{T}H )^{-1}G^{T} \alpha\\ \text{s.t.}~~&& 0 \leqslant \alpha \leqslant c_{1}, \end{array} $$
(4)

where G = [Be] and H = [Ae], and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\gamma}} ~~&&{e_{1}^{T}}\gamma-\frac{1}{2}\gamma^{T}P(Q^{T}Q )^{-1}P^{T} \gamma\\ \text{s.t.}~~&& 0 \leqslant \gamma\leqslant c_{2}, \end{array} $$
(5)

where P = [Ae] and Q = [Be].

After resolving the QPPs (4) and (5), we can obtain

$$ \begin{array}{@{}rcl@{}} \left[\begin{array}{l}w^{(1)}\\b^{(1)} \end{array}\right]=-(H^{T}H)^{-1}G^{T}\alpha, \end{array} $$
(6)

and

$$ \begin{array}{@{}rcl@{}} \left[\begin{array}{l}w^{(2)}\\b^{(2)} \end{array}\right]=(Q^{T}Q)^{-1}P^{T}\beta. \end{array} $$
(7)

A new testing sample xRn is assigned to a class i = (+ 1,− 1) by comparing the following perpendicular distan ce which measures the distance of the sample from the two hyper-planes (1), i.e.,

$$ \begin{array}{@{}rcl@{}} \displaystyle{Class~~i=\arg\min_{i=1,2}}\frac{|x^{T}w^{(i)}+b^{(i)}|}{\|w^{(i)}\|}. \end{array} $$
(8)

In TSVM, if the number of samples in two classes is approximately equal to l/2, the computational complexity of TSVM is O(2 × (l/2)3). Thus the ratio of run-times between SVM and TSVM is l3/(2 × (l/2)3) = 4, which implies that TSVM works approximately four times faster than SVM [5].

2.2 Structural twin support vector machine

S-TSVM extracts the structural information of each class by some clustering methods. Then it applies the data distributions of the clusters in different classes into the objective functions of TBSVM, which makes S-TSVM fully exploit the structural information of samples into the model. Suppose there are CP clusters in the positive class P and CN clusters in the negative class N, respectively, ie., \(P=P_{1} \cup {\ldots } P_{i} \cup {\ldots } P_{C_{P}}, N=N_{1} \cup {\ldots } N_{j} \cup {\ldots } N_{C_{N}}\). Then the subsequent optimization problem in the S-TSVM model can be formulated as follows [23]:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{+} ,b_{+} ,\xi }} ~~&&\frac{1}{2} \|Aw_{+}{\kern1.7pt}+{\kern1.7pt}e_{1}b_{+}\|_{2}^{2}{\kern1.7pt}+{\kern1.7pt}c_{1}{e_{2}^{T}}\xi+\!\frac{1}{2} c_{2}(\|w_{+}\|_{2}^{2}+b_{+}^{2})\\ &&+\frac{1}{2} c_{3}w_{+}^{T}{\Sigma}_{+} w_{+}\\ \text{s.t.}~~&& -(Bw_{+}+e_{2}b_{+})+\xi \geqslant e_{2},\\ ~~&&\xi \geqslant 0, \end{array} $$
(9)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{-}, b_{-} ,\eta}} ~~&&\frac{1}{2} \|Bw_{-}+e_{2}b_{-}\|_{2}^{2}+c_{4}{e_{1}^{T}}\eta+\!\frac{1}{2} c_{5}(\|w_{-} \|_{2}^{2}{\kern1.7pt}+{\kern1.7pt}b_{-}^{2})\\ &&+\frac{1}{2} c_{6}w_{-}^{T}{\Sigma}_{-} w_{-}\\ \text{s.t.}~~&& \ (Aw_{-}+e_{1}b_{-})+\eta \geqslant e_{1},\\ ~~&&\eta \geqslant 0, \end{array} $$
(10)

where \(c_{i} \geqslant 0 (i = 1,\ldots ,6)\) are the pre-specified penalty factors, e1 and e2 are vectors of ones of appropriate dimensions, both ξ and η are slack variables. \({\Sigma }_{+}={\Sigma }_{P_{1}} +\ldots +{\Sigma }_{P_{C_{P}}}\), \({\Sigma }_{-}={\Sigma }_{N_{1}}+\ldots +{\Sigma }_{N_{C_{N}}}\), \({\Sigma }_{P_{i}}\) and \({\Sigma }_{N_{j}}\) are respectively the covariance matrices of clusters in the two classes. The corresponding dual problems of S-TSVM are given as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\alpha}} ~~&&{e_{2}^{T}}\alpha - \frac{1}{2}\alpha^{T}G(H^{T}H+c_{2}I+c_{3}J)^{-1}G^{T} \alpha\\ \text{s.t.}~~&& 0 \leqslant \alpha \leqslant c_{1}e_{2}, \end{array} $$
(11)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\beta}} ~~&& {e_{1}^{T}}\beta - \frac{1}{2}\beta^{T}P(Q^{T}Q+c_{5}I+c_{6}F)^{-1}P^{T} \beta\\ \text{s.t.}~~&& 0 \leqslant \beta \leqslant c_{4}e_{1}, \end{array} $$
(12)

where

$$ \begin{array}{@{}rcl@{}} G&=&[B \ e_{2}], H=[A \ e_{1}], J=\left[\begin{array}{ll} {\Sigma}_{+} & 0 \\ 0 & 0 \end{array}\right], P=[A \ e_{2}],\\ Q&=&[B \ e_{1}] F=\left[\begin{array}{ll} {\Sigma}_{-} & 0 \\ 0 & 0 \end{array}\right]. \end{array} $$

A new data point xRn is classified as the positive class or negative class depending on which of the two hyper-planes it lies closer to. The decision function of S-TSVM is

$$ \begin{array}{@{}rcl@{}} f(x) = {\arg}\ \min\limits_{\pm}\left\lbrace \frac{\mid x^{T}w_{\pm} + b_{\pm}\mid}{\parallel w_{\pm}\parallel}\right\rbrace . \end{array} $$
(13)

2.3 K-nearest neighbor structural twin support vector machine

To further improve the computational precision and speed of a classifier, KNN-STSVM is proposed in the spirit of S-TSVM. The formulation of KNN-STSVM is similar to TSVM, as it also tries to obtain two nonparallel hyper-planes for two classes by solving a pair of small-sized QPPs as follows [26]:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{+} ,b_{+} ,\xi}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{1}}}{\sum\limits_{j=1}^{l_{1}}}{W_{s,ij}^{(1)}}(w_{+}^{T}x_{j}^{(1)}{+}b_{+})^{2}{+} c_{1}e_{-}^{T}\xi{+}\frac{1}{2}c_{2}w_{+}^{T}{\Sigma}_{+}w_{+}\\ \text{s.t.}~~&& \ -f_{j}^{(2)}(w_{+}^{T}x_{j}^{(2)}+b_{+})+\xi_{j} \geqslant f_{j}^{(2)},\\ ~~&&\xi_{j} \geqslant 0, j=1, \ldots, l_{2}, \end{array} $$
(14)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{-} ,b_{-} ,\eta}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{2}}}{\sum\limits_{j=1}^{l_{2}}}{W_{s,ij}^{(2)}}(w_{-}^{T}x_{j}^{(2)}{+}b_{-})^{2}{+}c_{3}e_{+}^{T}\eta{+}\frac{1}{2}c_{4}w_{-}^{T}{\Sigma}_{-}w_{-}\\ \text{s.t.}~~&& \ f_{j}^{(1)}(w_{-}^{T}x_{j}^{(1)}+b_{-})+\eta_{j} \geqslant f_{j}^{(1)},\\ ~~&&\eta_{j} \geqslant 0, j=1, \ldots, l_{1}, \end{array} $$
(15)

where the parameters \(c_{1}, c_{2}, c_{3}, c_{4}\geqslant 0\) determine the penalty weights, and ξ, η are slack variables. Both e+ and e are vectors of ones with appropriate dimensions, Ws,ij is the weight matrix and fj is the vector with only 0 or 1. \({\Sigma }_{+}={\Sigma }_{P_{1}}+\ldots +{\Sigma }_{P_{C_{P}}}\), \({\Sigma }_{-}={\Sigma }_{N_{1}}+\ldots +{\Sigma }_{N_{C_{N}}}\), \({\Sigma }_{P_{i}}\) and \({\Sigma }_{N_{j}}\) are respectively the covariance matrixes corresponding to the i th and the j th clusters in two classes. The corresponding dual problems of KNN-STSVM are given as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\alpha}} ~~&& e_{-}^{T}F\alpha - \frac{1}{2}\alpha^{T}(F^{T}G)(H^{T}DH+c_{2}J_{+})^{-1}(G^{T} F)\alpha \\ \text{s.t.}~~&& 0 \leqslant \alpha \leqslant c_{1}e_{-}, \end{array} $$
(16)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\gamma }} ~~&& e_{+}^{T}P\gamma-\frac{1}{2}\gamma^{T}(P^{T}H)(G^{T}QG+c_{4}J_{-})^{-1}(H^{T}P) \gamma \\ \text{s.t.}~~&& 0 \leqslant \gamma\leqslant c_{3}e_{+}. \end{array} $$
(17)

A new testing sample xRn is assigned to a class i = (+ 1,− 1) by comparing the following perpendicular distance which measures the distance of the sample from the two hyper-planes, i.e.,

$$ \begin{array}{@{}rcl@{}} \displaystyle{Class~~i=\arg\min_{i=1,2}}\frac{|x^{T}w^{(i)}+b^{(i)}|}{\|w^{(i)}\|}. \end{array} $$
(18)

3 An efficient regularized K-nearest neighbor structural twin support vector machine

Motivated by [23, 25, 26, 28], we propose a reasonable and efficient variant of KNN-STSVM called regularized KNN-based weighted structural twin support vector machine, RKNN-STSVM for short. RKNN-STSVM has three steps: clustering, applying KNN tricks, and learning. RKNN-STSVM adopts some clustering techniques to capture the data distribution within classes, and KNN tricks are also applied to our model to improve the generalization performance.

3.1 Clustering

Structural support vector machine usually considers the data distribution through clustering methods. Here, we use the same clustering methods (the Ward’s linkage clustering, which is one of the hierarchical clustering method) as S-TSVM to study the clusters in each individual class. The main advantage of Ward’s linkage clustering (WIL) [39] is that clusters derived from this clustering method are compact and spherical, which provides a reliable basis to calculate the covariance matrices. Concretely, if S and T are two clusters, μS and μT are means of the clusters, the Ward’s linkage W(S,T) between clusters S and T can be computed as [21]

$$ \begin{array}{@{}rcl@{}} ~~&&W(S,T)=\frac{|S| \cdot |T| \cdot \|\mu_{S}-\mu_{T}\|}{|S|+|T|}. \end{array} $$
(19)

Initially, each individual sample is considered as a cluster. The Ward’s linkage of two samples xi and xj is W(xi,xj) = ∥xixj2/2. When two clusters are being merged to a new cluster A, the linkage W(A,C) can be represented by W(A,C),W(B,C) and W(A,B) as [21]

$$ \begin{array}{@{}rcl@{}} ~~&&W(A^{\prime},C){\kern1.7pt}={\kern1.7pt}\frac{(|A|{\kern1.7pt}+{\kern1.7pt}|C|)(W,A,C){\kern1.7pt}+{\kern1.7pt}(|B|{\kern1.7pt}+{\kern1.7pt}|C|)W(B,C){\kern1.7pt}-{\kern1.7pt}(|C|)W(A,B)}{|A|{\kern1.7pt}+{\kern1.7pt}|B|{\kern1.7pt}+{\kern1.7pt}|C|}.\\ \end{array} $$
(20)

The Ward’s linkage between clusters to be merged increases as the number of clusters decreases during the hierarchical clustering. A relation curve between the merge distance and the number of clusters can be determined by finding the knee point [40]. Furthermore, the WIL can be also applicable to the nonlinear case.

3.2 K-nearest neighbor method

Weighted TSVM [25] mines as much underlying similarity information within samples as possible. Weighted TSVM firstly constructs two neighbor graphs in the input space, i.e. intra-class graph Gs and inter-class graph Gd, to search the weights of samples from one class and to extract the possible SVs residing in the other class. For each sample xi in class + 1, it defines two nearest sets: Neas(xi) and Nead(xi). Neas(xi) contains its neighbors in class + 1, while Nead(xi) contains its neighbors in class -1. Specifically,

$$ \begin{array}{@{}rcl@{}} Nea_{s}(x_{i})&=&\{{x^{j}_{i}}|\ \ \text{if}\ {x^{j}_{i}}\ \text{and} \ x_{i}\ \text{belong to the same class},\\ && 0 \leq j \leq k_{1}\}, \end{array} $$
(21)

and

$$ \begin{array}{@{}rcl@{}} Nea_{d}(x_{i})&=&\{{x^{j}_{i}}|\ \ \text{if}\ {x^{j}_{i}}\ \text{and} \ x_{i}\ \text{belong to the different classes},\\ && 0 \leq j \leq k_{2}\}. \end{array} $$
(22)

Clearly, Neas denotes a set of its k1-nearest neighbors in class + 1, and Nead stands for a set of its k2-nearest neighbors in class − 1. Two adjacent (i.e. similarity) matrices of the plane of class + 1 corresponding to Gs and Gd can be defined [41], respectively, as follows:

$$ \begin{array}{@{}rcl@{}} W_{s,ij}=\left\{\begin{array}{ll} &1, \quad \text{if} \ x_{j}\in {Nea_{s}(x_{i})}\ \text{or} \ x_{i}\in {Nea_{s}(x_{j})},\\ &0, \quad \text{otherwise}, \end{array}\right. \end{array} $$
(23)

and

$$ \begin{array}{@{}rcl@{}} W_{d,ij}=\left\{\begin{array}{ll} &1, \quad \text{if} \ x_{j}\in {Nea_{d}(x_{i})}\ \text{or} \ x_{i}\in {Nea_{d}(x_{j})}.\\ &0, \quad \text{otherwise}. \end{array}\right. \end{array} $$
(24)

When Ws,ij = 1 or Wd,ij = 1, an undirected edge between xi and xj is added to the corresponding graph. To find the possible SVs from samples in class -1, here we redefine the weight matrix Wd,ij of Gd as follows:

$$ \begin{array}{@{}rcl@{}} f_{j}=\left\{\begin{array}{ll} &1, \quad \text{if}\ \exists \ j, \quad W_{d,ij} \neq 0,\hspace{1.2in} \\ &0, \quad \text{otherwise}.\hspace{1.2in} \end{array}\right. \end{array} $$
(25)

3.3 Linear case

By introducing the extra regularization terms \((\|w_{+}\|_{2}^{2}+b_{+}^{2})\) and \((\|w_{-}\|_{2}^{2}+b_{-}^{2})\) into primal problems (14) and (15), respectively, our RKNN-STSVM can be expressed as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{+}, b_{+},\xi}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{1}}}{\sum\limits_{j=1}^{l_{1}}}{W_{s,ij}^{(1)}}(w_{+}^{T}x_{j}^{(1)}+b_{+})^{2}+ c_{1}e_{-}^{T}\xi\\ &&+\frac{1}{2}c_{2}w_{+}^{T}{\Sigma}_{+}w_{+}+\frac{1}{2}c_{3}(\|w_{+}\|_{2}^{2}+b_{+}^{2})\\ \text{s.t.}~~&& \ -f_{j}^{(2)}(w_{+}^{T}x_{j}^{(2)} + b_{+})+\xi_{j} \geqslant f_{j}^{(2)},\\ ~~&&\xi_{j} \geqslant 0, j=1, \ldots, l_{2}, \end{array} $$
(26)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{w_{-}, b_{-}, \eta}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{2}}}{\sum\limits_{j=1}^{l_{2}}}{W_{s,ij}^{(2)}}(w_{-}^{T}x_{j}^{(2)}+b_{-})^{2}+c_{4}e_{+}^{T}\eta\\ &&+ \frac{1}{2}c_{5}w_{-}^{T}{\Sigma}_{-}w_{-}+\frac{1}{2}c_{6}(\|w_{-}\|_{2}^{2}+b_{-}^{2})\\ \text{s.t.}~~&& \ f_{j}^{(1)}(w_{-}^{T}x_{j}^{(1)}+b_{-})+\eta_{j} \geqslant f_{j}^{(1)},\\ &&\eta_{j} \geqslant 0, j=1, \ldots, l_{1}, \end{array} $$
(27)

where \(c_{1}, c_{2},\ldots , c_{6}\geqslant 0\) are positive parameters given in advance, which are used to denote the tradeoff among each term in the objective function, respectively. ξ and η are slack variables, both e+ and e are vectors of ones of appropriate dimensions, Ws,ij andfj are defined as \({\Sigma }_{+}={\Sigma }_{P_{1}}+\ldots +{\Sigma }_{P_{C_{P}}} \), \({\Sigma }_{-}={\Sigma }_{N_{1}} +\ldots +{\Sigma }_{N_{C_{N}}} \), \({\Sigma }_{P_{i}}\) and \({\Sigma }_{N_{j}}\) are respectively the covariance matrixes corresponding to the i th and the j th clusters in the two classes.

To solve (26), the lagrangian function is defined as:

$$ \begin{array}{@{}rcl@{}} L_{1}=&&\frac{1}{2} {\sum\limits_{j=1}^{l_{1}}}d_{j}^{(1)}(w_{+}^{T}x_{j}^{(1)} + b_{+})^{2}+ c_{1}e_{-}^{T}\xi+ \frac{1}{2}c_{2}w_{+}^{T}{\Sigma}_{+}w_{+}\\ &&+\frac{1}{2}c_{3}(\|w_{+}\|_{2}^{2}+b_{+}^{2})\\ &&-{\sum\limits_{j=1}^{l_{2}}}\alpha_{j}(-f_{j}^{(2)}(w_{+}^{T}x_{j}^{(2)} + b_{+})+ \xi_{j} - f_{j}^{(2)})-\beta^{T}\xi,\\ \end{array} $$
(28)

where \(d_{j}^{(1)}={{\sum }_{i=1}^{l_{1}}}W_{s,ij}^{(1)}\), α,β are lagrangian multipliers. Differentiating the lagrangian function L1 with respect to variables w+,b+ and ξ yields the following Karush-Kuhn-Tucker (KKT) conditions:

$$ \begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial w_{+}}&=&{\sum\limits_{j=1}^{l_{1}}}d_{j}^{(1)}x_{j}^{(1)} (w_{+}^{T}x_{j}^{(1)}+b_{+}) +c_{2} {\Sigma}_{+}w_{+}\\ &&+ {\sum\limits_{j=1}^{l_{2}}}\alpha_{j} f_{j}^{(2)}x_{j}^{(2)} + c_{3}w_{+}=0, \end{array} $$
(29)
$$ \begin{array}{@{}rcl@{}} \frac{\partial L_{1}}{\partial b_{+}}={\sum\limits_{j=1}^{l_{1}}}d_{j}^{(1)}(w_{+}^{T}x_{j}^{(1)}+b_{+}){\sum\limits_{j=1}^{l_{2}}}\alpha_{j} f_{j}^{(2)}+c_{3}b_{+}=0 , \end{array} $$
(30)
$$ \frac{\partial L_{1}}{\partial \xi}=c_{1}e_{-}-\alpha-\beta=0. $$
(31)

Rewrite (29) and (30) in their matrix forms, we get the following equations:

$$ A^{T}D(Aw_{+} + e_{+}b_{+})+B^{T}F\alpha+c_{2} {\Sigma}_{+}w_{+}+c_{3}w_{+}=0 , $$
(32)
$$ e_{+}^{T}D(Aw_{+}+e_{+}b_{+})+e_{-}^{T}F\alpha+c_{3}b_{+}=0 , $$
(33)

where \(D= diag(d_{1}^{(1)}, d_{2}^{(1)}, \ldots , d_{l_{1}}^{(1)})\) and \(F=diag(f_{1}^{(2)}, f_{2}^{(2)}, \ldots , f_{l_{2}}^{(2)})\) are diagonal matrices, and \({f_{j}^{(2)}}(j = 1, 2, \ldots , l_{2})\) is either 0 or 1. The following equation is obtained by combining (32) and (33).

$$ \begin{array}{@{}rcl@{}} H^{T}DHU+G^{T}F\alpha+(c_{2}J_{+}+c_{3}I)U=0,\\ \end{array} $$
(34)

i.e.,

$$ \begin{array}{@{}rcl@{}} \left[ \begin{array}{l} w_{+}\\ b_{+} \end{array} \right] = -(H^{T}DH +c_{2}J_{+} +c_{3}I)^{-1}G^{T}F\alpha, \end{array} $$
(35)

where

$$ G{\kern1.7pt}={\kern1.7pt}[B \ e_{2}], H{\kern1.7pt}={\kern1.7pt}[A \ e_{1}], J_{+}{\kern1.7pt}={\kern1.7pt}\left[\begin{array}{ll}{\Sigma}_{+} & 0 \\0 & 0 \end{array}\right]~\text{and}~U{\kern1.7pt}={\kern1.7pt}[w_{+} \ b_{+}]^{T}. $$

Finally, we can derive the dual formulation of (26) as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\alpha}} ~~&& e_{-}^{T}F\alpha - \frac{1}{2}\alpha^{T}(F^{T}G)(H^{T}DH + c_{2}J_{+} +c_{3}I)^{-1}(G^{T} F)\alpha \\ \text{s.t.}~~&& 0 \leqslant \alpha \leqslant c_{1}e_{-}. \end{array} $$
(36)

Similarly, the lagrangian function for (27) is defined as

$$ \begin{array}{@{}rcl@{}} L_{2}=&&\frac{1}{2} {\sum\limits_{j=1}^{l_{2}}}d_{j}^{(2)}(w_{-}^{T}x_{j}^{(2)}+b_{-})^{2}+c_{4}e_{+}^{T}\eta+\frac{1}{2}c_{5}w_{-}^{T}{\Sigma}_{-}w_{-}\\ &&+\frac{1}{2}c_{6}(\|w_{-}\|_{2}^{2}+b_{-}^{2})\\ &&-{\sum\limits_{j=1}^{l_{1}}}\gamma_{j}(-f_{j}^{(1)}(w_{-}^{T}x_{j}^{(1)}+b_{-})+\eta_{j} - f_{j}^{(1)})-\nu^{T}\eta,\\ \end{array} $$
(37)

where \(d_{j}^{(2)}={{\sum }_{i=1}^{l_{2}}}W_{s,ij}^{(2)}, \gamma , \nu \) are lagrangian multipliers. After differentiating the lagrangian function (37) with respect to variables w,b and η, we can obtain the simplified dual QPP of (27), which is,

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\gamma }} ~~&& e_{+}^{T}P\gamma{-}\frac{1}{2}\gamma^{T}(P^{T}H)(G^{T}QG{+} c_{5}J_{-}+c_{6}I)^{-1}(H^{T}P) \gamma \\ \text{s.t.}~~&& 0 \leqslant \gamma\leqslant c_{4}e_{+}, \end{array} $$
(38)

where \(Q=diag(d_{1}^{(2)}, d_{2}^{(2)}, \cdots , d_{l_{2}}^{(2)})\) and HCode \(P=diag (f_{1}^{(1)},f_{2}^{(1)}, \cdots , f_{l_{1}}^{(1)})\) are diagonal matrices, \(f_{j}^{(1)}\) is either 0 or 1, \(J_{-}=\left [\begin {array}{ll} {\Sigma }_{-} & 0\\ 0 & 0 \end {array}\right ]\).

Once the solution γ is calculated, we will get the following augmented vector,

$$ \begin{array}{@{}rcl@{}} \left[ \begin{array}{l} w_{-}\\ b_{-} \end{array} \right]=(G^{T}QG +c_{5}J_{-} +c_{6}I)^{-1}H^{T}P\gamma. \end{array} $$
(39)

3.4 Nonlinear case

For the nonlinear case, the two nonparallel hyper-planes are modified as the following kernel-generated expressions:

$$ \begin{array}{@{}rcl@{}} K(x)\mu_{+}+b_{+} = 0, \quad K(x)\mu_{-}+b_{-} = 0, \end{array} $$
(40)

where

$$ \begin{array}{@{}rcl@{}} K(x) = [K(x_{1},x), K(x_{2},x),..., K(x_{l},x)], \end{array} $$
(41)

and K(⋅) stands for a chosen kernel function. The primal QPPs of nonlinear RKNN-STSVM corresponding to the hyper-planes (40) are respectively given as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{\mu_{+} ,b_{+} ,\xi}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{1}}}{\sum\limits_{j=1}^{l_{1}}}{W_{s,ij}^{(1)}}(\mu_{+}^{T}K(x_{j}^{(1)})+b_{+})^{2}+ c_{1}e_{-}^{T}\xi \\ ~~&& + \frac{1}{2}c_{2}\mu_{+}^{T}\phi(M)^{}{\Sigma}_{+}^{\phi}\phi(M)\mu_{+}{\kern1.7pt}+{\kern1.7pt}\frac{1}{2}c_{3}(\|\mu_{+}\|_{2}^{2}{\kern1.7pt}+{\kern1.7pt}b_{+}^{2})\\ \text{s.t.}~~&& \ -f_{j}^{(2)}(\mu_{+}^{T}K(x_{j}^{(2)})+b_{+})+\xi_{j} \geqslant f_{j}^{(2)}, \\ ~~&&\xi_{j} \geqslant 0, j=1, \ldots, l_{2}, \end{array} $$
(42)

and

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{\mu_{-} ,b_{-} ,\eta}} ~~&&\frac{1}{2} {\sum\limits_{i=1}^{l_{2}}}{\sum\limits_{j=1}^{l_{2}}}{W_{s,ij}^{(2)}}(\mu_{-}^{T}K(x_{j}^{(2)})+b_{-})^{2}+c_{4}e_{+}^{T}\eta \\ ~~&&+ \frac{1}{2}c_{5}\mu_{-}^{T}\phi(M)^{}{\Sigma}_{-}^{\phi}\phi(M)\mu_{-}{\kern1.7pt}+{\kern1.7pt}\frac{1}{2}c_{6}(\|\mu_{-}\|_{2}^{2}+b_{-}^{2})\\ \text{s.t.}~~&& \-f_{j}^{(1)}(\mu_{-}^{T}K(x_{j}^{(1)})+b_{-})+\eta_{j} \geqslant f_{j}^{(2)},\\ ~~&&\eta_{j} \geqslant 0, j=1, \ldots, l_{1}, \end{array} $$
(43)

where \(c_{1}, c_{2}, c_{3}, c_{4} \geqslant 0\) are predefined parameters, and e is the vector of appropriate dimensions, Ws,ij and fj are defined as in the linear case. \({\Sigma }_{+}^{\phi }\) and \({\Sigma }_{-}^{\phi }\) are respectively the covariance matrices in the two classes by the kernel Ward’s linkage clustering.

The Wolfe dual of problem (42) is formulated as follows:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\alpha }} ~~&& e_{-}^{T}F\alpha - \frac{1}{2}\alpha^{T}(F^{T}G_{\phi})(H_{\phi}^{T}DH_{\phi}+c_{2}J_{+}^{\phi}+c_{3}I)^{-1}(G_{\phi}^{T} F)\alpha \\ \text{s.t.}~~&& \quad 0 \leqslant \alpha \leqslant c_{1}e_{-}, \end{array} $$
(44)

where Hϕ = [K(A) e+], Gϕ = [K(B) e], \(J_{+}^{\phi }=\left [\begin {array}{ll}{\Sigma }_{+}^{\phi } & 0 \\0 & 0 \end {array}\right ]\), then we can further get

$$ \begin{array}{@{}rcl@{}} \left[ \begin{array}{l} w_{+}\\ b_{+} \end{array} \right]=-(H_{\phi}^{T}DH_{\phi}+c_{2}J_{+}^{\phi}+c_{3}I)^{-1}G_{\phi}^{T}F\alpha. \end{array} $$
(45)

In a similar manner, the dual formulation of (43) is

$$ \begin{array}{@{}rcl@{}} \displaystyle{\max_{\gamma}} ~~&& e_{+}^{T}P\gamma {-} \frac{1}{2}\gamma^{T}(P^{T}H_{\phi})(G_{\phi}^{T}QG_{\phi}{+}c_{5}J_{-}^{\phi}{+}c_{6}I)^{-1}(H_{\phi}^{T}P) \gamma \\ \text{s.t.}~~&& \quad 0 \leqslant \gamma\leqslant c_{4}e_{+}. \end{array} $$
(46)

Once the solution γ is found, we get the following result,

$$ \begin{array}{@{}rcl@{}} \left[ \begin{array}{l} w_{-} \\ b_{-} \end{array} \right] = (G_{\phi}^{T}QG_{\phi} +c_{5}J_{-}^{\phi} +c_{6}I)^{-1}H_{\phi}^{T}P\gamma. \end{array} $$
(47)

Here, specifications of the matrices D and Q are analogous to the linear case.

A new point xRn is classified as the positive class or negative class depending on which of the two hyper-planes given by (45) and (47) it lies closer to. The decision function of RKNN-STSVM is

$$ \begin{array}{@{}rcl@{}} f(x) = {\arg}\ \min\limits_{\pm}\left\lbrace \frac{\mid x^{T}w_{\pm} + b_{\pm}\mid}{\parallel w_{\pm}\parallel}\right\rbrace . \end{array} $$
(48)

3.5 Analysis of algorithm

We first discuss the differences between TSVM, S-TSVM, KNN-STSVM, and our RKNN-STSVM.

  1. (1)

    Optimization of traditional TSVM costs around O(l3/4) if we assume that the patterns in two classes are approximately equal. As for S-TSVM, the computational complexity of the clustering step is O\((({l^{2}_{1}}+{l^{2}_{2}})n)\). So the computational complexity of S-TSVM is O\((({l^{2}_{1}}+{l^{2}_{2}})n+l^{3}/4)\). Thus, the overall computational complexity of KNN-STSVM and RKNN-STSVM is around O\((l^{2}log(l)+({l^{2}_{1}}+{l^{2}_{2}})n+l^{3}/4)\). Certainly, this conclusion is under the assumption of no constraints deleted. In fact, some components of f are set to be 0, which implies that the constraints are redundant, so the computational complexity of KNN-STSVM and RKNN-STSVM is less than O\((({l^{2}_{1}}+{l^{2}_{2}})n+l^{3}/4)\).

  2. (2)

    Unlike TSVM and KNN-STSVM, our RKNN-STSVM implements the structural risk minimization principle by introducing extra regularization terms in each objective function so that the problem becomes well-posed. It can not only help to alleviate over-fitting issue and improve the generation performance as in conventional SVM but also introduce invertibility in the dual formulation.

  3. (3)

    Two parameters c3 and c6 introduced in our RKNN-STSVM are the weights between the regularization term and the empirical risk, so that they can be chosen to improve the RKNN-STSVM performance.

4 Numerical experiments

To validate the superiority of our algorithm, in this section, we compare our proposed RKNN-STSVM with TSVM, Iν-TBSVM, S-TSVM, KNN-STSVM, KNN-LSTSVM on twenty-seven benchmark datasets from UCI machine learning repositoryFootnote 1 or the LIBSVM Database [42]. They are Australian (A), BCI-1a (B), Breast Cancer1 (B1), Breast Cancer2 (B2), Breast Cancer3 (B3), Balance Scale (BS), BUPA (BU), DBworld e-mail (D), Echocardiogram (E), Fertility (F), Hepatitis (H), Haberman (HA), Heart (HE), Ionosphere (I), IRIS (IR), LSVT (L), Liver Disorder (LD), Monks (M), Pima (P), Parkinson (PA), Planning Relax (PR), Sonar (S), Spect Heart (SH), Spambase (SP), Vertebral (V), WPBC (W), Waveform (WA). For convenience, we will use the abbreviations of them in the following paper. Table 1 lists the characteristics of these datasets. To further evaluate our algorithm, we also conduct experiments and make comparisons on popular Caltech image datasets.

Table 1 The statistics of twenty-seven benchmark datasets

To make the results more convincing, we use the 5-fold cross validation method to get the classification accuracy. More specifically, we spilt each dataset into five subsets randomly, and one of those sets is reserved as a test set whereas remaining sets are considered for training. This process is repeated five times until all subsets have been a test one once. All experiments are carried out in Matlab R2017a on Windows 10.

4.1 Parameters selection

The performance of six algorithms depends heavily on the choice of kernel functions and their parameters. In this paper, we only consider Gaussian kernel function k(xi,xj) = exp(−∥xixj2/γ2) for these datasets. We choose optimal values of the parameters by the grid search method. The Gaussian kernel parameter γ is selected from the set {2i|i = − 1,0,…,8}. For brevity’s sake, we let c1 = c4,c2 = c5,c3 = c6 in S-TSVM, RKNN-STSVM, c1 = c3,c2 = c4 in KNN-STSVM, r1 = r2, ν1 = ν2 in Iν-TBSVM, and c1 = c2 in TSVM, KNN-LSTSVM. The parameters c1, c2 are selected from the set {2i|i = − 1,0,…,8} and r1, r2 are selected from the set {10i|i = − 6,− 5,…,0}. In order to save calculation time and maintain calculation accuracy, we calculate the accuracy of the Breast Cancer 1 and Echocardiogram datasets along with the curve of parameter c3 in Fig. 1, we found that the optimal parameters are chosen from 10− 6 to 1, so c3 is selected from the set {10i|i = − 6,− 5,…,0}. We also calculate the accuracy of the Australian and the Breast Cancer 1 datasets along with the curve of parameter k in Fig. 2, the optimal value for k in KNN-STSVM, KNN-LSTSVM and RKNN-STSVM is chosen from the set {3,4,5,6,10,20}. The parameter ν1 is searched from the set {0.1,0.2,…,0.7} in Iν-TBSVM. For large-scale datasets, the range of parameters will be narrowed uniformly due to the long running time.

Fig. 1
figure 1

Changes of accuracy with the growth of c3 on the Breast Cance1 and Echocardiogram datasets

Fig. 2
figure 2

Changes of accuracy with the growth of k on the Australian and Breast Cance1 datasets

4.2 Result comparison and discussion

The experimental results on twenty-seven datasets are summarized in Table 2, where the “Accuracy” denotes the mean value of five times testing results, and plus or minus the standard deviation. “Time” denotes the mean value of the time taken by six methods, and each consists of training time and testing time. And the optimal parameters of six algorithms are shown in Table 3.

Table 2 Performance comparisons of six algorithms on twenty-seven datasets. Bold type shows the best result for each dataset
Table 3 The optimal parameters of six algorithms used in the experiments

From the perspective of prediction accuracy in Table 2, we can learn that our proposed RKNN-STSVM outperforms other five algorithms, i.e., TSVM, S-TSVM, Iν-TBSVM, KNN-STSVM and KNN-LSTSVM, for most datasets. RKNN-STSVM follows, it produces better testing accuracy than TSVM, Iν-TBSVM, S-TSVM, and KNN-LSTSVM for most cases. The TSVM yields the lowest testing accuracy and KNN-LSTSVM yields the lowest computational cost. Two main reasons lead RKNN-STSVM to produce better accuracy. One is that different weights are proposed to the samples according to their numbers of KNNs, and the point is important if it has more KNNs. The other is that our RKNN-STSVM implements the structural risk minimization principle by introducing extra regularization terms in each objective function.

Compared with other five algorithms, our RKNN-STSVM yields the highest testing accuracy on high-dimensional datasets including both BCI-1a and Breast Cancer3. This implies that our RKNN-STSVM is also suitable for the high-dimensional dataset. However, it yields the second lowest testing accuracy on high dimensional dataset Dbworld e-mails. The main reason lies in that the number of samples in this dataset is too small, it only has 64 samples, but has 4702 features. In large datasets, we can see RKNNS-TSVM yields the highest testing accuracy and its computational time is also lower than S-TSVM and KNN-STSVM in Spambase (SP) and Waveform (WA) datasets. But too many parameters leading to much training time is the major drawback of our algorithm. In Section 5, we add the DCDM fast algorithm to accelerate the calculation speed of our RKNN-STSVM.

4.3 Friedman tests

From Table 2, one can easily observe that our proposed RKNN-STSVM does not outperform other five algorithms for all the datasets in terms of testing accuracy. To analyze the performance of six algorithms on multiple datasets statistically, as it was suggested in Dems̆ar [43] and García and Fernández [44], we use Friedman test with the corresponding post hoc test which considered to be a simple, nonparametric yet safe test. For this, the average ranks of six algorithms on accuracy for all datasets are calculated and listed in Table 4. Under the null-hypothesis that all the algorithms are equivalent, one can compute the Friedman statistic [43] according to (49):

$$ \begin{array}{@{}rcl@{}} {\chi^{2}_{F}}=\frac{12N}{k(k+1)}\left[\sum\limits_{j}{R^{2}_{j}}-\frac{k(k+1)^{2}}{4}\right], \end{array} $$
(49)

where \(R_{j}=\frac {1}{N}{\sum }_{i}{r^{j}_{i}}\), and \({r^{j}_{i}}\) denotes the j th of k algorithms on the i th of N datasets. Friedman’s \({\chi ^{2}_{F}}\) is undesirably conservative and derives a better statistic

$$ \begin{array}{@{}rcl@{}} F_{F}=\frac{(N-1){\chi^{2}_{F}}}{N(k-1)-{\chi^{2}_{F}}}, \end{array} $$
(50)

which is distributed according to the F-distribution with k − 1 and (k − 1)(N − 1) degrees of freedom.

Table 4 Average rank on classification accuracy of six algorithms

We can obtain \({\chi ^{2}_{F}}=59.7356\) and FF = 20.6356 according to (49) and (50). Where FF is distributed according to F-distribution with (5,130) degrees of freedom. The critical value of F(5,130) is 2.2839 for the level of significance α = 0.05, and similarly it is 2.6656 for α = 0.025 and 3.1612 for α = 0.01. Since the value of FF is much larger than the critical value, there is a significant difference between the six algorithms. Note that, the average rank of RKNN-STSVM is far lower than the remaining algorithms. It means that our RKNN-STSVM is more valid than the other five algorithms.

4.4 Influence of parameter k

To investigate the influence of parameter k, we conduct experiments on twenty datasets from former twenty-seven datasets. The value of parameter k is chosen from the set {3,4,5,6,10,20}.

Table 5 shows that the accuracy of our proposed RKNN-STSVM is closely related to the parameter of k, this phenomenon is more pronounced in dataset DBworld e-mail. It can be noticed that when k is gradually increased, the accuracy of DBworld e-mail is also increased. And a too large or too small value of k will both reduce the prediction accuracy. Therefore, an appropriate parameter k is very important for our model.

Table 5 Performance comparisons of RKNN-STSVM with different parameter k on twenty datasets. Bold type shows the best result for each dataset

4.5 Image recognition datasets

We conduct experiments on Caltech image datasets, including the Caltech 101 [45] and the Caltech 256 datasets [46]. Caltech 101 dataset contains 102 categories, about 40 to 800 images per category. The size of each image is about 300 × 200 pixels. And the Caltech 256 dataset has 256 categories of images and at least 80 images per category. In addition, the Caltech 256 dataset has a cluttered class and the clutter category can be seen as noises or backgrounds. In our experiments, we choose Emu, Pigeon in Caltech 101 and Bonsai-101, Cactus in Caltech 256. We show some image samples in Fig. 3. Every row of image samples come from the same class.

Fig. 3
figure 3

Image samples in Caltech datasets

Due to the lack of samples in the majority class in the Caltech 101 dataset, we used no more than 50 samples from each category. The Caltech 256 dataset provides more categories and images and we used no more than 80 samples from each category. As for the image feature extraction, the popular dense-sift algorithm is used to extract features form those images. Then, we quantize those features into 1000 visual-words with bag-of-words models. We also reduce the feature vectors to appropriate dimensions with PCA, to retain 97% of the variance. Thus, we can use the kernel trick to improve classification accuracy.

The experimental results on the Caltech datasets are shown in Table 6. The experimental process and parameter selection are the same as the previous benchmark experiment. From the perspesctive of prediction accuracy in Table 6, compared with other five algorithms, we can learn that our RKNN-STSVM yields the highest testing accuracy of most datasets. In addition, our proposed RKNN-STSVM improves the classification accuracy of KNN-STSVM and ensures that the computational time is not slower than KNN-STSVM. It can be noticed that the TSVM yields the lowest testing accuracy and KNN-LSTSVM yields the lowest computational cost. At last, our proposed RKNN-STSVM performs better than TSVM, Iν-TBSVM, KNN-LSTSVM on four image datasets. The optimal parameters of six algorithms used in the experiments are shown in Table 7.

Table 6 Performance comparisons of six algorithms on four image datasets. Bold type shows the best result for each dataset
Table 7 The optimal parameters of six algorithms used in the experiments

5 The DCDM for accelerating RKNN-STSVM

5.1 The DCDM algorithm

Although Matlab’s built-in algorithm “Quadprog” can give a generally precise solution, the consumption of time is expensive for large-scale problems. In this section, to further improve the solving efficiency, a novel fast algorithm dual coordinate descent method (DCDM) [38] is embedded into the learning process of our RKNN-STSVM to accelerate the learning speed. DCDM is a kind of coordinate descent method since it aims to decompose an optimization problem into a series of single-variable sub-problems. That is, in each iteration, DCDM only updates one variable. By solving these sub-problems efficiently, we can realize the optimization process in less time. The procedure to embed DCDM into our RKNN-STSVM is shown below.

One of the dual QPP of the RKNN-STSVM is as follows [38]:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{\alpha}} ~~&& f(\alpha)=\frac{1}{2}\alpha^{T}Q\alpha-e^{T}\alpha \\ \text{s.t.}~~&& 0 \leqslant \alpha \leqslant C. \end{array} $$
(51)

The optimization process begins with an initial point α0Rn. The process from αk to αk+ 1 is referred as an outer iteration. In each outer iteration, we have n inner iterations, so that sequentially α1,α2,…,αn are updated. Each outer iteration generates vectors αk,iRn,i = 1,2,…,n + 1, such that αk,1 = αk,αk,n+ 1 = αk+ 1, and

$$ \alpha^{k,i}=[\alpha^{k+1}_{1}, \ldots, \alpha^{k+1}_{i-1}, {\alpha^{k}_{i}},\ldots, {\alpha^{k}_{l}}]^{T}, \forall i=2, \ldots, n. $$

For updating αk,i to αk,i+ 1, we solve the following one-variable sub-problem:

$$ \begin{array}{@{}rcl@{}} \displaystyle{\min_{d}} ~~&& f(\alpha^{k,i}+de_{i})\\ \text{s.t.}~~&& 0 \leqslant {\alpha^{k}_{i}} \leqslant C_{i}, \end{array} $$
(52)

where ei = [0,…,0,1,0,…,0]T. The objective function of (51) is a simple quadratic function of d:

$$ \begin{array}{@{}rcl@{}} f(\alpha^{k,i}+de_{i})=\frac{1}{2}Q_{ii}d^{2}+\triangledown_{i}f(\alpha^{k,i})d+constant, \end{array} $$
(53)

where \(\triangledown _{i}f\) is the i th component of the gradient \(\triangledown f\). So (51) has an optimum at d = 0 (i.e., no need to update αi) if and only if

$$ \begin{array}{@{}rcl@{}} {\triangledown^{P}_{i}}f(\alpha^{k,i})=0, \end{array} $$
(54)

where \(\triangledown ^{P}f(\alpha )\) means the projected gradient

$$ \begin{array}{@{}rcl@{}} {\triangledown^{P}_{i}}f(\alpha)= \left\{\begin{aligned} &\triangledown_{i}f(\alpha), \qquad \qquad \ \ \ \quad if \quad 0 < \alpha_{i} < C_{i},\\ &min(0, \triangledown_{i}f(\alpha)), \ \qquad if \quad \alpha_{i}=0,\\ &max(0, \triangledown_{i}f(\alpha)),\qquad if \quad \alpha_{i}=C_{i}. \end{aligned}\right. \end{array} $$
(55)

If (54) holds, we move to the index i + 1 without updating \(\alpha ^{k,i}_{i}\). Otherwise, we must find the optimal solution of (51). If Qii > 0, the solution is:

$$\alpha^{k,i+1}_{i}=min(max(\alpha^{k,i}_{i}-\frac{\triangledown_{i}f(\alpha^{k,i})}{Q_{ii}},0),C_{i}).$$

The stopping criterion we set for DCDM is ∥αk+ 1αk∥ < 𝜖 (𝜖 is a sufficient small real number) or maximum iterations reaching 1000. A flowchart of DCDM is given in Algorithm 1.

figure c

5.2 Experiments of the DCDM algorithm for RKNN-STSVM

Now we introduce this DCDM algorithm into the dual QPPs of our RKNN-STSVM, which have been given by (44) and (46). Similarly, compare models (44) and (51), the matrix Q in (51) is substituted by \(G_{\phi }(H_{\phi }^{T}DH_{\phi }+c_{2}J_{+}^{\phi }+c_{3}I)^{-1}G_{\phi }^{T}\). Notably, we have known that the \(F=diag(f^{(2)}_{1}, f^{(2)}_{2}, \ldots , f^{(2)}_{l_{2}})\) in our QPP is a diagonal matrix and \(f^{(2)}_{j}\) is either 0 or 1. That is to say, the component of α corresponding to 0 in F has no contribution to the optimization problem (44), so it can be ignored during the iteration of DCDM, which will further improve the computing speed.

In this experiment, twenty-five datasets are selected to validate the significance of applying this DCDM algorithm to speed up the operation. This initial value of α is a random vector between 0 and c1, and the stopping condition in the DCDM algorithm is chosen as [38]

$$\|\alpha^{k+1}-\alpha^{k}\|<\epsilon, \text{or maximum iterations reaching 1000}. $$

We set 𝜖 = 10− 2 in the experiments. Averages and standard deviations of test accuracies (in %) and computation time derived by RKNN-STSVM and RKNN-STSVM solved with DCDM are shown in Table 8. The last column shows the speedup of DCDM algorithm. These results indicate that DCDM algorithm gives similar performance on testing accuracy with the Matlab built-in algorithm, while it makes our RKNN-STSVM obtain a faster learning speed, the largest speedup almost reaches to 5.76 times.

Table 8 Performance comparisons of RKNN-STSVM and DCDM+RKNN-STSVM on twenty-five datasets

6 Conclusion

In this paper, we present an improved version of KNN-STSVM. We reformulate the primal problems of RKNN-STSVM by adding regularization terms in the objective functions to utilize the structural risk minimization principle and to remove singularity in the solution. Therefore, our RKNN-STSVM avoids the over-fitting problem to a certain degree and yields higher testing accuracy. Numerical experiments on twenty-seven benchmark datasets and image recognition datasets demonstrate the feasibility and validity of our RKNN-STSVM. The DCDM fast algorithm is further employed to speed up our RKNN-STSVM. The selection of an appropriate parameter k can greatly improve the testing accuracy of our algorithms. It should be pointed out that there are several parameters in our RKNN-STSVM, so it is meaningful to address the parameter selection in future work.