1 Introduction

Support vector machine (SVM) [27, 34], as computationally powerful tools for pattern recognition, has already obtained excellent performance in many fields [12, 13, 17, 29, 35]. The main idea of SVM is to seek an optimal hyperplane that can separate two classes of samples with maximal margin. The hyperplane can be obtained by solving a quadratic programming problem (QPP). SVM can also successfully solve nonlinear classification problems by using kernel trick. If the size of training set is n, the learning complexity of classical SVM is \(O(n^3)\). Therefore, one of the key issues for SVM is the slow learning speed for large-scale training datasets. To improve the learning speed of the classical SVM, many efficient training algorithms have been proposed with comparable classification accuracy, such as sequential minimal optimization (SMO) [3, 8, 23], decomposition method [6, 16], geometric algorithms [7, 15], etc.

Recently, a generalized eigenvalue proximal SVM (GEPSVM) was proposed [14], whose main idea aims at constructing a pair of nonparallel hyperplanes, each hyperplane is proximal to the samples of the corresponding class and is far from the others. By solving two generalized eigenvalue problems, the two nonparallel hyperplanes of the GEPSVM can be efficiently obtained. But its classification accuracy is poor in many practical problems, compared with the classical SVM. A twin SVM (TWSVM) was proposed by Jayadeva for binary classification [5]. Similar to GEPSVM in spirit, TWSVM also aims at seeking two nonparallel hyperplanes, and each hyperplane is closer to one class and is at a distance of at least one from the other. The two nonparallel hyperplanes can be obtained in the TWSVM by solving two smaller sized QPPs. The experimental results [5] show that the TWSVM works faster than the classical SVM and compares favorable with the classical SVM in the light of classification accuracy. Some extensions to TWSVM include the smooth TWSVM [9], least squares TWSVM [10], localized TWSVM [33], twin bounded SVM [25], twin parametric-margin SVM [19], \(\nu \)-TWSVM [18], structural TWSVM [24], nonparallel SVM [26], twin mahalanobis distance-based SVM [20], multi-label TSVM [2], twin support vector clustering [28], etc.

Different from TWSVM which seeks two nonparallel hyperplanes, Peng proposed twin-hypersphere support vector machine (THSVM) [22], which uses two hyperspheres to depict two classes of samples. The idea may be more reasonable for many practical datasets. The two hyperspheres can be obtained in the THSVM by solving two QPPs. The THSVM can avoid the inversions of two matrices that appear in the TWSVM, which makes the THSVM be more efficient than the TWSVM. Recently the THKSVM [30] and Pin-M3HM [31] as extensions of the THSVM were also proposed, respectively.

In this paper, we propose a novel classifier called THSVM with local density information (LDTHSVM) for binary classification. Firstly we extract local density for each sample and treat it as the weight of that sample, then prune training dataset according to these local density degrees, finally introduce these local density degrees into THSVM and reconstruct more robust classification model. Computational comparisons with some classical classification algorithms have been made on the synthetic and publicly available benchmark datasets, indicating that the LDTHSVM has better classification performance.

The remaining parts of this paper are organized as follows. Section 2 introduces the classical THSVM. Section 3 discusses the local density degrees of training samples and pruning method of training dataset. Section 4 deduces LDTHSVM in detail. Section 5 gives the computational complexity of LDTHSVM. In Sect. 6, experimental results on the synthetic and publicly available benchmark datasets are shown and conclusions are outlined in Sect. 7.

2 Related works

2.1 Notations

In this paper, we consider the binary classification problem with the dataset \(D = \{ ({x_i},{y_i})\}_{i=1}^{l} \), where \({x_i} \in {R^d}\) is a training sample labeled \({y_i} \in \{ 1, - 1\} \). Further, we denote by matrix A \(\in R^{l_+ \times d}\) and B \(\in R^{l_- \times d}\) the positive and negative samples, respectively. Finally a mapping \(\varphi ( \cdot )\) is introduced to map \({R^d}\) into some feature space Z. It is possible to use some kernel function \(K({x_i},{x_j})\) to represent the inner product in Z, i.e., \(K({x_i},{x_j}) = \varphi {({x_i})^T}\varphi ({x_j})\).

2.2 Review of THSVM

The THSVM [22] determines two hyperspheres, rather than two nonparallel hyperplanes, to describe two classes of samples in the feature space Z:

$$\begin{aligned} \left\| {\varphi (x) - {a_ + }} \right\| ^2 = R_ + ^2 \quad \hbox {and} \quad \left\| {\varphi (x) - {a_ - }} \right\| ^2 = R_ - ^2, \end{aligned}$$
(1)

where \({a_ \pm }\) and \({R_ \pm }\) are, respectively, the centers and radii of the corresponding hyperspheres.

The THSVM classifier is obtained by solving a pair of QPPs as follows:

(2)
(3)

where \({c_1,c_2,v_1,v_2 > 0}\) are the penalty factors prespecified in advance, and \({\xi _i,\xi _j}\) are the slack variables.

The dual QPPs of (2) and (3) can be obtained:

(4)
(5)

Once QPPs (4) and (5) are solved, the decision function of THSVM can be written as:

$$\begin{aligned} f(x)&= \hbox {sgn}\left\{ \frac{({\varphi (x) - {a_+}})^\mathrm{T}({\varphi (x) - {a_+}}) }{{R_+ ^2}}\right. \nonumber \\&\quad \left. -\frac{({\varphi (x) - {a_- }})^\mathrm{T}({\varphi (x)-{a_-}}) }{{R_- ^2}}\right\} . \end{aligned}$$
(6)

3 Pruning method of training dataset

3.1 Local density of training dataset

The noise samples, which may be caused by sampling or instrument error, have many effects on classification ability of THSVM. That is to say, the classification accuracy will be reduced if there are many noise samples in the training set. The noise samples mainly have two types, one is isolated samples with abnormal features, and the other is the samples with wrong label. In this paper, we use the weight of sample to reduce the effect. The weight of sample can be obtained by estimating its local density degree. The process of calculating the weights of samples is as follows:

Firstly, the Euclidean distances \({d_{ij}}\) between samples \({x_i}\) and \({x_j}\) in the feature space are calculated:

$$\begin{aligned} d_{ij}^2= & {} {\left\| {\varphi ({x_i}) - \varphi ({x_j})} \right\| ^2}\nonumber \\= & {} K({x_i},{x_i})-2K({x_i},{x_j})\nonumber \\&+K({x_j},{x_j}), \quad i,j = 1,\ldots ,l \; \hbox {and} \; i \ne j. \end{aligned}$$
(7)

Secondly, seek k nearest neighbor region \({\Omega _i}\) of sample \({x_i}\) and the radius of the region \({r_i}\) according to (7):

$$\begin{aligned}&{\Omega _i} = \{ {x_j}|{x_j}\;is\;k\;\hbox {nearest neighbor of}\,{x_i}\}, \end{aligned}$$
(8)
$$\begin{aligned}&{r_i} = \max (d_{ij}^2),\; j \in {\Omega _i}, \end{aligned}$$
(9)

where k is predetermined value.

Then, obtain intra-nearest neighbors of sample \({x_i}\):

$$ \begin{aligned} {\Lambda _i}\!=\!\{ {x_j}|{x_j} \in {\Omega _i}\; \& \; {x_j}\;\hbox {and}\;{x_i}\;\hbox {belong\;to\;the\;same\;class}\}. \end{aligned}$$
(10)

Finally, calculate local density degree \({d_i}\) of sample \({x_i}\) according to the following formula:

$$\begin{aligned} {d_i} = \sum \limits _{j \in {\Lambda _i}} {\exp \{ - \omega *d_{ij}^2/r\} }, \end{aligned}$$
(11)

where \(r = \frac{1}{l}\sum \nolimits _{i = 1}^l {{r_i}} \) and \(\omega \) is a weight.

Clearly, this method gives the higher local density degree \({d_i}\) for the sample in a higher density region: the sample with lower distances from its k nearest neighbors has higher \({d_i}\). Moreover, a smaller \(\omega \) produces higher local density degrees. Generally \(\omega \) and k are, respectively, set 1 and 7 [11, 21].

3.2 Pruning training dataset

To improve classification efficiency, it is necessary to prune samples for large-scale training set. In this paper, the pruning method that uses local density information is proposed as follows:

Firstly, use algorithm presented in Sect. 3.1 to obtain local density degrees \({d_i}\) of sample \({x_i}\).

Then, prune the sample from training dataset whose local density degree is smaller than \(\sigma \), and remain the sample whose local density degree is bigger than or equal to \(\sigma \), where pruning threshold \(\sigma \) is a determined by users according to the real problems.

Finally, suppose the pruned training dataset \(\bar{D} = \{ ({\bar{x}_i},{\bar{y}_i},d_{i}^{'})\}_{i=1}^{\bar{l}} \), then denote by matrix \(\bar{A} \in R^{\bar{l}_+ \times d}\) and \(\bar{B} \in R^{\bar{l}_- \times d}\) the positive and negative samples in the \(\bar{D}\), respectively, further let \(d^{'}_{+}\) and \(d^{'}_{-}\) be the weights of \(\bar{A}\) and \(\bar{B}\), respectively. Scale the weights of remaining samples as follows:

(12)

4 The LDTHSVM classifier

In order to obtain more accurate and efficient classifier for large-scale training datasets with noise, THSVM is improved to be the enhanced version using local density information, called LDTHSVM. Similar to THSVM in spirit, LDTHSVM also tries to construct a pair of hyperspheres, one for each class, such that each hypersphere can cover the samples of the corresponding class as many as possible.

Consider the binary classification problem with the pruned dataset \(\bar{D} = \{ ({\bar{x}_i},{\bar{y}_i},{\bar{d}_i})\}_{i=1}^{\bar{l}}\), which can be obtained by using the pruning method in Sect. 3.2. Further denote by matrix \(\bar{A} \in R^{\bar{l}_+ \times d}\) and \(\bar{B} \in R^{\bar{l}_- \times d}\) the positive and negative samples in the \(\bar{D}\), respectively, and let \(\bar{d}^{+} \in R^{\bar{l}_+}\) and \(\bar{d}^{-} \in R^{\bar{l}_-}\) be the weight vectors of \(\bar{A}\) and \(\bar{B}\). LDTHSVM can be formulated as follows:

(13)
(14)

From the primal problems (13), we notice that, unlike THSVM, firstly LDTHSVM does not employ all training samples, but use pruned training dataset, which makes classifier more robust and efficient when there exist many noise samples in the large-scale training set. Secondly, the local density degrees of negative and positive samples are, respectively, added to the second and third terms in the objective function which contributes to the positive center far away from the negative samples with higher local density degrees, in addition, to the positive samples with higher local density degrees covered by the positive hypersphere. Furthermore, we can get similar conclusions from the primal problem (14). Obviously, the LDTHSVM classifier is more reasonable for the practical applications.

The Lagrangian function of (13) is given by

(15)

where \({\alpha _i} \ge 0\), \({r_i} \ge 0\), \(\lambda \ge 0\), \(i = 1,\ldots ,{\bar{l}_+}\) are the Lagrangian multipliers. According to the Karush–Kuhn–Tucker Theorem, the following conditions are satisfied:

(16)
(17)
$$\begin{aligned}&\frac{{{c_1}}}{{{{\bar{l}}_ + }}}{\bar{d}_i^+}\!-\!{\alpha _i}\!-\!{r_i}\!=\!0\!\Rightarrow \!0\!\le \!{\alpha _i}\!\le \!\frac{{{c_1}}}{{{{\bar{l}}_ + }}}{\bar{d}_i^+}, \; i\!=\!1,\ldots ,\bar{l}_ +, \end{aligned}$$
(18)
$$\begin{aligned}&\left\| {\varphi ({{\bar{A}}_i}) - {a_ + }} \right\| ^2 \le R_ + ^2 + {\xi _i}, \; i=1,\ldots ,\bar{l}_ +, \end{aligned}$$
(19)
$$\begin{aligned}&{\alpha _i}\left( \left\| {\varphi ({{\bar{A}}_i})\!-\!{a_ + }} \right\| \!-\! R_ + ^2\!-\!{\xi _i}\right) \!=\! 0, {\alpha _i}\!\ge \! 0, i\!=\!1,\ldots ,\bar{l}_ +, \nonumber \\\end{aligned}$$
(20)
$$\begin{aligned}&{r_i}{\xi _i} = 0, \; {\xi _i} \ge 0, \; {r_i} \ge 0, \; i=1,\ldots ,\bar{l}_ +, \end{aligned}$$
(21)
$$\begin{aligned}&\lambda R_ + ^2 = 0, \; R_ + ^2 \ge 0, \; \lambda \ge 0. \end{aligned}$$
(22)

According to (16), (17) and (22), the center of positive hypersphere can be obtained as follows:

(23)

   Substituting (17), (18) and (23) into (15) and discarding the constant items, we can obtain the following dual problem of (13):

(24)

where and .

According to (18)–(21), we obtain

(25)

where \(\bar{I}_R^ + = \left\{ {i|0< {\alpha _i} < \frac{{{c_1}}}{{{{\bar{l}}_ + }}}{{\bar{d}}_i^+}, i=1,\ldots ,\bar{l}_ +} \right\} \).

Similarly, we can get the simplified dual optimal problem of (14) as follows:

(26)

where and .

Also, the center \({a_ - }\) and radius \({R_ - }\) of negative class are, respectively, calculated as follows:

(27)
$$\begin{aligned} R_ - ^2= & {} \frac{1}{{\left| {\bar{I}_R^ - } \right| }}\sum \limits _{j=1}^{{\left| {\bar{I}_R^ - } \right| }} {{{\left\| {\varphi ({{\bar{B}}_j}) - {a_ - }} \right\| }^2}}, \end{aligned}$$
(28)

where \(\bar{I}_R^ - = \left\{ {j|0< {\beta _j} < \frac{{{c_2}}}{{{{\bar{l}}_ - }}}{{\bar{d}}_j^-},j=1,\ldots ,\bar{l}_ -} \right\} \).

A new sample \({x} \in {R^d}\) is assigned to the positive class or negative class, depending on which of the two hyperspheres it lies closest to. Therefore, the decision function is defined as follows:

$$\begin{aligned} f(x)= & {} \hbox {sgn} \left\{ \frac{({\varphi (x) - {a_+}})^T({\varphi (x) - {a_+}}) }{{R_+ ^2}}\right. \nonumber \\&-\left. \frac{({\varphi (x) - {a_- }})^T({\varphi (x)-{a_-}}) }{{R_- ^2}}\right\} , \end{aligned}$$
(29)

where

(30)
(31)

5 Computational complexity of LDTHSVM

In this section, we further analyze the computational complexity of our LDTHSVM. There are three main steps in Our LDTHSVM, which are

  1. 1.

    Calculating the local density degrees of all training samples,

  2. 2.

    Pruning training dataset,

  3. 3.

    Solving the optimal problems (24) and (26),

where the main computational cost is the calculation of k nearest neighbor of all training samples in step 1 and the solution of the optimal problems in step 3. The computational complexity of calculating k nearest neighbor of all training samples is \(O({l^2}\log l)\), and the computational complexity of solving the optimal problems is \(O({\bar{l}_ + }^3 + {\bar{l}_ - }^3)\), where \({\bar{l}_ + } \ll {l_ + }\) and \({\bar{l}_ - } \ll {l_ - }\), when the training dataset contains many noise samples. Therefore, the computational complexity of LDTHSVM is about \({O({l^2}\log l + {\bar{l}_ + }^3 + {\bar{l}_ - }^3)}\).

6 Experiments

In this section, we investigate classification performance of our LDTHSVM on publicly available benchmark datasets, as well as a synthetic dataset. In the experiments, we compare LDTHSVM with other classical algorithms including THSVM, WLTSVM [32], TWSVM and SVM. The parameters selection is very important for these algorithms. The exhaustive search is still the most popular method for determining the parameters [5, 10, 22, 24, 25]. To reduce computational complexity of parameters selection, we make the parameters \({c_1} = {c_2} = c\) and \({v_1} = {v_2} = v\). In each algorithm, the optimal parameter c is searched from set \(\{ {2^i}|i = 0,1, \ldots ,10\} \), v from set \(\{ 0.1,0.2, \ldots ,0.9\} \) and pruning threshold \(\sigma \) from set \(\{ 0.1,0.2, \ldots ,0.5\} \) on the validation set comprising of 30% of the training samples. Once all parameters are determined, the validation sets are returned to the training datasets to construct the final classifiers.

Fig. 1
figure 1

Classification results of the LDTHSVM, THSVM, TWSVM and SVM for the synthetic dataset with linear kernels. The thick curves represent the separating plane, while the thin curves represent two nonparallel hyperplanes or a pair of hyperspheres. For LDTHSVM, the circled samples are pruned

6.1 Synthetic data with noise

In this subsection, to show the effectiveness of LDTHSVM intuitively, we use a synthetic dataset. The toy 2-D dataset is randomly generated under two Gaussian distributions: positive class: \(N({(0,0)^T},\hbox {diag}\{ 0.5,0.5\} )\), negative class: \(N({(2,2)^T},\hbox {diag}\{ 0.25,0.25\} )\). The training dataset consists of 440 samples (220 samples for each class, where there are 20 samples with wrong label) and the test dataset consists of 4000 samples (2000 samples for each class).

Figure 1 intuitively shows the classification results of the LDTHSVM, THSVM, TWSVM and SVM on the two Gaussian dataset with linear kernels. By inspecting Fig. 1, we can get the following conclusions: Firstly, the THSVM, TWSVM and SVM are significantly influenced by the noise samples, especially by the samples with wrong label. Secondly, LDTHSVM can effectively remove most samples with wrong label and a small part of isolated samples, further suppress the interference of the remaining noise samples by introducing local density of samples into the classifier, which make the separating curves around positive and negative class tighter and the centers of positive and negative hyperspheres closer to means of two Gaussian distributions. In other words, LDTHSVM can effectively depict the true distribution of the two classes of samples. Further we detailedly show the classification results in Table 1. We can observe from Table 1, LDTHSVM obtains better accuracy compared with THSVM, TWSVM and SVM. Although training speed of the LDTHSVM is slightly slower than that of THSVM and TWSVM, its training speed is significantly faster than that of SVM.

Table 1 Classification performance of the LDTHSVM, THSVM, TWSVM and SVM for the synthetic dataset with linear kernels

6.2 Benchmark datasets

In this subsection, to further investigate the classification performance of LDTHSVM, we perform LDTHSVM, THSVM, WLTSVM, TWSVM and SVM on the publicly available benchmark datasets from UCI Repository. In these simulations, we only consider the Gaussian kernel \(K(x_1,x_2) = {e^{ - \gamma {{\left\| {x_1 - x_2} \right\| }^2}}}\) and the parameter \(\gamma \) is selected from the range \(\{ {2^i}|i = - 9, - 8, \ldots ,10\} \). We use the tenfold cross-validation methodology to estimate the classification accuracy of each algorithm. Table 2 lists the classification accuracies and training time of LDTHSVM, THSVM, WLTSVM, TWSVM and SVM. From Table 2, it can be observed that our LDTHSVM obtains better classification accuracies for most datasets, compared with THSVM, WLTSVM, TWSVM and SVM. This indicates that LDTHSVM can effectively suppress the interference of noise samples. Furthermore, it can be observed that, compared with THSVM, LDTHSVM is inefficient. One of the possible reasons is that there are not a large number of noise samples that can be pruned in the training dataset. Even so, the training time of LDTHSVM is also close to TWSVM.

Table 2 Classification performance of the LDTHSVM, THSVM, WLTSVM, TWSVM and SVM for the benchmark datasets with Gaussian kernels
Table 3 Rank on classification accuracy of five classifiers for benchmark datasets

6.3 Friedman test

From Table 2, we can notice that not any algorithm can outperform all others for all datasets in the light of classification accuracy. In this subsection, to analyze the classification performance of five algorithms on multiple datasets statistically, we use Friedman test [1, 4]. The ranks of five classifiers on classification accuracy for all datasets are listed in Table 3. We can calculate the Friedman statistic according to (32)

$$\begin{aligned} \chi _F^2 = \frac{{12q}}{{p(p + 1)}}\left[ \sum \limits _{i = 1}^p {R_i^2} - \frac{{p{{(p + 1)}^2}}}{4}\right] , \end{aligned}$$
(32)

where \({R_i} = \frac{1}{q}\sum \nolimits _{j = 1}^q {r_i^j}\) and \(r_i^j\) represents the rank of the ith of p classifiers on the jth of q datasets. Friedmans \(\chi _F^2\) is undesirably conservative, and we use the other better statistic

$$\begin{aligned} {F_F} = \frac{{(q - 1)\chi _F^2}}{{q(p - 1) - \chi _F^2}}, \end{aligned}$$
(33)
Fig. 2
figure 2

Classification accuracy for different parameter k on different datasets

Fig. 3
figure 3

Illustration of the handwritten digits

Table 4 Classification result of the LDTHSVM, THSVM, TWSVM and SVM on USPS dataset with linear kernel

which is distributed according to the \(F(p-1,(p-1)(q-1))\).

We can get \(\chi _F^2 = 14.95\) and \(F_F = 6.87\) according to (32) and (33), where \(F_F\) is distributed according to F(4,24). The critical value of F(4,24) is 2.19 for the level of significance \(\alpha = 0.1\). Similarly, it is 2.78 for the level of significance \(\alpha = 0.05\). Since the critical value is smaller than \(F_F\), there is significant difference among five classifiers. It can be observed from Table 3 that the average rank of LDTHSVM is lower than the other classifiers. It implies that the LDTHSVM has better accuracy than the other classifiers.

6.4 Analysis on the parameter k

In this subsection, we further analyze the effect of parameter k on classification performance. In Fig. 2, we show the classification accuracies for different parameter k on different datasets. From Fig. 2, we can observe that, for smaller k, it is hard to get the best results, the main reason is that many useful neighbors are lost, when calculating the local density degrees of training samples, for larger k, it cannot be guaranteed to get the best classification accuracy, the main reason is that many noise samples could be introduced. Through a large number of experiments and observation, we notice that \(k=7\) can obtain satisfactory performance in general.

6.5 Handwritten digits recognition

We use LDTHSVM to recognize handwritten digits in this subsection. The USPS dataset, which is a publicly available database of handwritten digits recognition, is used to evaluate our LDTHSVM. In USPS dataset, there are 11000 8-bit grayscale images of handwritten digits, and each handwritten digit has 1100 images, as shown in Fig. 3.

The classification results of four classifiers are shown in Table 4. From Table 4, we can learn that our LDTHSVM has better classification accuracy, compared with the SVM, TWSVM and THSVM.

7 Conclusions

In this paper, the improvements for THSVM, called LDTHSVM classifier, have been presented. The proposed LDTHSVM inherits good properties from THSVM. For instance, LDTHSVM solves a pair of smaller sized optimization problems, avoids the inversion matrix in its dual QPPs and directly uses kernel trick to solve nonlinear problems as in the SVM. Further, unlike THSVM, LDTHSVM prunes training dataset according to local density degrees of training samples and introduce local density degrees into THSVM and reconstruct classification model with the pruned training dataset. The classification results on synthetic and publicly available benchmark datasets have shown that LDTHSVM can obtain better classification performance, compared with THSVM, WLTSVM, TWSVM and SVM, especially for the large-scale datasets which include many noise samples.