1 Introduction

As a powerful supervised learning algorithm based on solid theoretical foundations, support vector machine (SVM) (Cristianini and Shawe-Taylor 2000; Vapnik 1995) has gained substantial attention in many research areas. By simultaneously minimizing the regularization term and the hinge loss function, SVM implements the structural risk minimization principle rather than the traditional empirical risk minimization principle. Due to its remarkable characteristics such as the good generalization performance, the absence of local minima and sparse representation of the solution, SVM and its variants (Zhao et al. 2019; Melki et al. 2018) have been successfully applied to a wide range of applications, including classification, regression, clustering, representational learning and so on. However, there still exists a great space for improvements on the traditional SVM.

On the one hand, the use of hinge loss function leads to the sensitivity of SVM to re-sampling and noise around the decision boundary. To alleviate this disadvantage, various methods have been proposed so far (Bi and Zhang 2004; Shivaswamy et al. 2006; Xu et al. 2009; Zhong 2012; Wang et al. 2015; Wang and Zhong 2014). Recently, motivated by the link between the hinge loss and the shortest distance, Huang et al. (2014) proposed a new SVM classifier with the pinball loss (Pin-SVM). The pinball loss, which shares many good properties and brings noise insensitivity for classification, is related to the quantile distance (Koenker 2005; Jumutc et al. 2013) between two classes. The theoretical analysis and experimental results show that Pin-SVM is more stable to noise-corrupted data compared with the traditional SVM. Lately, to speed up the training process for Pin-SVM, Huang et al. (2014) further exploited the expectile distance as a surrogate of the quantile distance and propose a new SVM classifier with the asymmetric squared loss (aLS-SVM). This is motivated by the fact that the expectile value, which is related to minimizing the asymmetric squared loss (Lu et al. 2018), has similar statistical properties to the quantile value. In other words, aLS-SVM is an approximation of Pin-SVM which can be effectively solved.

On the other hand, a main challenge for the standard SVM is its dependence on sufficient supervised information. In many real-world applications, such as natural language parsing (Tur et al. 2005), spam filtering (Guzella and Caminhas 2009), and video surveillance (Zhang et al. 2011), the acquisition of enough labeled data is usually difficult while unlabeled data are available in large quantity. In such situations, the performance of SVM usually deteriorates since a lot of information carried by the unlabeled data is simply ignored. To handle the problem, semi-supervised learning (SSL) is proposed. It has become an efficient paradigm (Chapelle et al. 2006; Scardapane et al. 2016; Li et al. 2017; Calma et al. 2018), especially that with manifold regularization (MR) (Belkin et al. 2006; Chen et al. 2014) which tries to capture the geometric information from both labeled and unlabeled data and makes the smoothness of classifiers along the intrinsic manifold via an additional regularization term. With the addition of the MR into the conventional SVM, Belkin et al. (2006) presented the classical Laplacian SVM (LapSVM), in which the geometric information embedded in the abundant unlabeled data is fully considered to build more reasonable classifiers. After that, under the semi-supervised MR learning framework, a number of SVM-based SSL algorithms have emerged (Sun 2013; Khemchandani and Pal 2016; Pei et al. 2017a, b). However, similar to SVM, LapSVM is sensitive to noise-corrupted data, since it also employs the minimal distance which is related to the hinge loss to measure the margin between two classes.

In this paper, inspired by the studies above, we propose a novel semi-supervised support vector machine with the asymmetric squared loss (asy-LapSVM). Our motivation mainly depends on the facts that the MR has the ability to encode the geometric information embedded in the unlabeled data and the expectile distance is stable to noise-corrupted data. In other words, we hope that the proposed asy-LapSVM possesses the ability to make full use of the abundant unlabeled data, and moreover, it is stable to noise-corrupted data. Moreover, to speed up the training process, we present a simple and efficient functional iterative method to replace the traditional quadratic programming (QP) to solve the involved optimization problems. The convergence property of the iterative method is proved by both of the theory and experiments. Experimental results on a number of commonly used datasets show that the proposed asy-LapSVM achieves a significant performance in comparison with several popular supervised learning (SL) and SSL algorithms.

In summary, by incorporating the properties of the SSL and the asymmetric squared loss, the advantages of the proposed algorithm are as follows:

  • We construct a robust LapSVM framework by adopting an asymmetric squared loss, and it can be effectively solved with the help of a simple functional iterative method.

  • The proposed model belongs to inductive learning and is natural for out-of-sample data, which can avoid expensive graph computation.

  • The extensive comparison experiments with several related methods on widely used benchmark datasets demonstrate the effectiveness of the proposed method.

Moreover, compared with the very recently published algorithms, such as the pure SSL algorithms in references (Huang et al. 2014; Ma et al. 2019), the proposed asy-LapSVM not only maintains their primary advantages such as the ability to make use of unlabeled data and handle unseen data in the testing phase directly, but also makes them less sensitive to noise-corrupted data (feature noises). As for the robust SSL algorithms algorithms in references (Du et al. 2019; Pei et al. 2018; Gu et al. 2019), they are all proposed for outliers among labeled data (label noises).

The remainder of this paper is organized as follows. Section 2 briefly outlines the background. The novel asy-LapSVM is proposed in Sect. 3, which includes both the linear and nonlinear cases. Moreover, the proof of convergence of the iterative algorithm is given. After presenting the experimental results on multiple datasets in Sect. 4, we conclude this paper in Sect. 5.

2 Background

In this section, some background knowledge of the proposed algorithm including the asymmetric squared loss function and the semi-supervised manifold regularization learning framework is briefly reviewed.

2.1 Asymmetric squared loss function

In the field of machine learning, loss function is usually one of the key issues in designing learning algorithms since most problems require it to describe the cost of the discrepancy between the prediction and the observation. In fact, the use of the loss function can be traced back to a long time ago. For example, the least-square loss function for regression was already employed by Legendre, Gauss, and Adrain in the early 19th century (Steinwart and Christmann 2008). At present, various margin-based loss functions, such as hinge loss, squared loss, exponential loss, logistic loss, and brown boost loss have been exploited to search for the optimal classification and regression functions.

In a binary classification problem, a large margin between two classes plays an important role to obtain a good classifier. Traditionally, the margin is measured by the minimal distance between two sets, which is related to the hinge loss (1) or the squared hinge loss (2):

$$\begin{aligned} v_{\mathrm{hin}}(r)= & {} \max \{0,r\}, \end{aligned}$$
(1)
$$\begin{aligned} v_{\mathrm{shin}}(r)= & {} \max \{0,r\}^{2}, \end{aligned}$$
(2)

where \(r=1-yf({\varvec{x}})\), in which \({\varvec{x}}\in R^{d}\) is the input sample, \(y\in \{+1,-1\}\) is the corresponding output, and \(f({\varvec{x}})\) is the prediction. However, measuring the margin by the minimal value leads to the sensitivity of the corresponding classifiers to re-sampling and noise around the decision boundary. To overcome this weak point, Huang et al. (2014) employed the quantile value which has been deeply studied and widely applied in regression problems, as a surrogate of the minimal value to measure the margin between classes, and proposed the following pinball loss (3) which brings noise insensitivity:

$$\begin{aligned} v_{\mathrm{pin}}(r)= {\left\{ \begin{array}{ll} ur,&{}\quad r\ge 0, \\ -(1-u)r, &{}\quad r<0 , \end{array}\right. } \end{aligned}$$
(3)

where \(0<u< 1\). Different from the hinge loss, the pinball loss gives an additional penalty on the correctly classified samples. So, the pinball loss can be regarded as an extension to the hinge loss. However, the pinball loss is non-smooth and its minimization is more difficult than that of smooth loss functions. Hence, Huang et al. (2014) modified the measurement of margin by taking the expectile value, which is related to the following smooth asymmetric squared loss:

$$\begin{aligned} v_{\mathrm{asy}}(r)= {\left\{ \begin{array}{ll} ur^{2},&{}\quad r \ge 0, \\ (1-u)r^{2}, &{}\quad r<0 . \end{array}\right. } \end{aligned}$$
(4)

The property of asymmetric squared loss is similar to that of the pinball loss. From the definition of \(v_{\mathrm{asy}}(\cdot )\), one observes that when the value of u is 1, it becomes the squared hinge loss \(v_{\mathrm{shin}}(\cdot )\). Similar to the difference between the pinball loss and the hinge loss, the asymmetric squared loss, which gives an asymmetric penalty for negative and positive losses, can be seen as a generalized squared hinge loss. The plots of \(v_{\mathrm{hin}}(r)\), \(v_{\mathrm{shin}}(r)\), \(v_{\mathrm{pin}}(r)\) and \(v_{\mathrm{asy}}(r)\) with \(u=0.83\) are shown in Fig. 1.

Fig. 1
figure 1

Plots of loss functions

2.2 Semi-supervised manifold regularization learning framework

The idea of regularization, which is widely used in machine learning, has its root in mathematics to solve ill-posed problems (Tikhonov 1963). A number of popular learning algorithms can be interpreted as a supervised regularization learning framework that consists of different loss functions and complexity measures in an appropriately chosen Reproducing Kernel Hilbert Space (RKHS).

Given a set of labeled samples \({\mathcal {S}}_{l}=\{{\varvec{x}}_{i},y_{i}\}_{i=1}^{l}\) generated according to a probability distribution, the standard supervised regularization learning framework can be written as:

$$\begin{aligned} \min \limits _{f \in {\mathcal {H}}_{k} }&\sum \limits ^{l}_{i=1}v(r_{i})+\lambda _{1}\Vert f\Vert ^{2}_{k}, \end{aligned}$$
(5)

where \(r_{i}=1-y_{i}f({\varvec{x}}_{i})\) for \(i=1,\ldots ,l\), v stands for some loss function on the labeled samples, \(\lambda _{1}\) is the weight of \(\Vert f\Vert ^{2}_{k}\) that controls the complexity of the unknown function f, in which \(\Vert \cdot \Vert _{k}\) is a norm related to a Mercer kernel k in the RKHS \({\mathcal {H}}_{k}\). For the choice of k, it can be the linear kernel function, the polynomial kernel function (POLY) and the radial basis function kernel function (RBF) and so on. In the numerical experiments, the RBF kernel (6) which gains an advantages over others (Lu et al. 2018) will be employed:

$$\begin{aligned} k({\varvec{x}}_{i},{\varvec{x}}_{j})=\mathrm{exp}(-\sigma \Vert {\varvec{x}}_{i}-{\varvec{x}}_{j}\Vert ^{2}), \end{aligned}$$
(6)

where \(\sigma \) is the kernel parameter.

It is clear that the excellent performance of standard supervised regularization learning framework (5) is in premise of sufficient labeled samples. However, in many real-world applications, the acquisition of labeled data is generally more difficult than the collection of unlabeled ones. In such situations, a learning framework that is able to make full use of unlabeled data to improve recognition performance is of potentially great significance. In the light of this idea, by adding a manifold regularization term \(\Vert f\Vert ^{2}_{{\mathcal {I}}}\) in formulation (5), Belkin et al. (2006) proposed the following semi-supervised manifold regularization learning framework:

$$\begin{aligned} \min \limits _{f \in {\mathcal {H}}_{k} }&\sum \limits ^{l}_{i=1}v(r_{i})+ \lambda _{1}\Vert f\Vert ^{2}_{k}+ \lambda _{2}\Vert f\Vert ^{2}_{{\mathcal {I}}}, \end{aligned}$$
(7)

where \(\lambda _{2}\) is the weight of \(\Vert f\Vert ^{2}_{{\mathcal {I}}}\) in the low dimensional manifold (or intrinsic norm), which enforces f smoothness along the intrinsic manifold.

Given l labeled and m unlabeled data in the training set

$$\begin{aligned} {\mathcal {S}}={\mathcal {S}}_{l}\cup {\mathcal {S}}_{u}=\{{\varvec{x}}_{i},y_{i}\}_{i=1}^{l}\cup \{{\varvec{x}}_{i}\}_{i=l+1}^{n}, \end{aligned}$$
(8)

where \({\varvec{x}}_{i}\in R^{d}, i=1,\ldots ,n\) (\(n=l+m\) is the number of training samples), and \(y_{i}\in \{-1,1\}\) for \(i=1,\ldots ,l\). \({\mathcal {S}}_{u}\) denotes a set of m unlabeled data which are drawn according to a marginal distribution. The manifold regularization term \(\Vert f\Vert ^{2}_{{\mathcal {I}}}\) can be reexpressed as:

$$\begin{aligned} \Vert f\Vert ^{2}_{{\mathcal {I}}} =\sum \limits ^{n}_{i=1}\sum \limits ^{n}_{j=1} W_{ij}(f({\varvec{x}}_{i})-f({\varvec{x}}_{j}))^2={\mathbf{f }}^{T}L{\mathbf{f }}, \end{aligned}$$
(9)

where \({\mathbf{f }}=[f({\varvec{x}}_{1}), \ldots , f({\varvec{x}}_{n})]^{T}\) is the vector of the n values of f on the training data, L is the graph Laplacian associated to \({\mathcal {S}}\), given by \(L=G-W\), where W is the adjacency matrix which can be defined by the p nearest neighbors, and its non-negative edge weight \(W_{ij}\) represents the similarity of every pair of input instances. G is a diagonal matrix with its i-th diagonal element \(G_{ii}=\mathop \sum \nolimits ^{n}_{j=1}W_{ij}\) represents the weight degree of vertex i. When the manifold regularization term (9) is used in the semi-supervised regularization learning framework (7), we can understand it by the means: if the samples \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\) has higher similarity (\(W_{ij}\) is larger), the difference of \(f({\varvec{x}}_{i})\) and \(f({\varvec{x}}_{j})\) will obtain a big punishment.

By applying the hinge loss function to the semi-supervised manifold regularization learning framework (7), Belkin et al. (2006) proposed the classical LapSVM which can exploit the geometric information of the marginal distribution embedded in unlabeled data to construct a more reasonable classifier. Due to its excellent performance, LapSVM becomes a powerful choice for SSL. More details about LapSVM can be found in Belkin et al. (2006).

3 Semi-supervised SVM with asymmetric squared loss

In this section, we elaborate the formulation of the proposed semi-supervised SVM with the asymmetric squared loss (asy-LapSVM) for semi-supervised binary classification problem. After giving the detailed derivations of the proposed asy-LapSVM in the linear and nonlinear cases, we present a simple and efficient functional iterative method to optimize them. Then we prove the convergence of the proposed functional iterative method in theory. And in the end, we summary the proposed asy-LapSVM.

3.1 Linear asy-LapSVM

Although LapSVM has shown good generalization, we notice that hinge loss function is noise sensibility. To further improve its generalization performance, we provide a stable asy-LapSVM, especially for noise-corrupted data. In specific, unlike LapSVM, we exploit the asymmetric squared loss (4) as a surrogate of the hinge loss for the proposed linear asy-LapSVM. Meanwhile, we maximize the margin measured by the expectile distance between two classes by optimizing with respect to both the weight vector \({\varvec{w}}\) and the bias term b of the linear asy-LapSVM classifier \(f({\varvec{x}})= {\varvec{w}}^{T}{\varvec{x}}+b\). Thus, the regularization term \(\Vert f\Vert ^{2}_{k}\) can be expressed as:

$$\begin{aligned} \Vert f\Vert ^{2}_{k}=\frac{1}{2}(\Vert {\varvec{w}}\Vert ^{2}+b^{2}). \end{aligned}$$
(10)

As for the manifold regularization term \(\Vert f\Vert ^{2}_{{\mathcal {I}}}\), it has the following form:

$$\begin{aligned} \Vert f\Vert ^{2}_{{\mathcal {I}}}={\mathbf{f }}^{T}L{\mathbf{f }}={\varvec{w}}^{T}D^{T}LD{\varvec{w}}, \end{aligned}$$
(11)

where \(D=[{\varvec{x}}_{1},\ldots ,{\varvec{x}}_{n}]^{T}\) and \({\mathbf{f }}=[f({\varvec{x}}_{1}), \dots ,f({\varvec{x}}_{n})]^{T}= [{\varvec{w}}^{T}{\varvec{x}}_{1}, \dots ,{\varvec{w}}^{T}{\varvec{x}}_{n}]^{T}=D{\varvec{w}}\), in which we deliberately reduce the bias term b for convenience.

By substituting the asymmetric squared loss (4), the regularization term (10) and the manifold regularization term (11) into the semi-supervised manifold regularization learning framework (7), the primal problem of the linear asy-LapSVM can be formulated as

$$\begin{aligned} \begin{array}{llll} \min \limits _{{\varvec{w}},b,\xi } &{} \frac{1}{2}(\Vert {\varvec{w}}\Vert ^{2}+b^{2})+ \frac{c}{2}{\varvec{\xi }}^{T}{\varvec{\xi }}+\frac{\lambda }{2}{\varvec{w}}^{T}D^{T}LD{\varvec{w}}\\ \mathrm{s.t.} &{} Y(D{\varvec{w}}+{\varvec{e}}b)\ge {\varvec{e}}_{l}-\frac{1}{u}{\varvec{\xi }},\\ &{} Y(D{\varvec{w}}+{\varvec{e}}b)\le {\varvec{e}}_{l}+\frac{1}{1-u}{\varvec{\xi }}, \end{array} \end{aligned}$$
(12)

where \(0<u<1\), c and \(\lambda \) are the regularization parameters, \({\varvec{\xi }}\) is the error variable vector, \({\varvec{e}}_{l}\) is the vector of ones of l dimension, \({\varvec{e}}\) is the vector of ones of n dimension, \(Y\in R^{l\times n}\) is a matrix with elements \(Y_{ii}=y_{i}\) and other elements are zeros.

Introducing the nonnegative Lagrange parameter vectors \({\varvec{\alpha }}_{1}\) and \({\varvec{\alpha }}_{2}\), the Lagrangian function for the problem (12) can be expressed as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}({\varvec{w}},b,{\varvec{\xi }},{\varvec{\alpha }}_{1},{\varvec{\alpha }}_{2})&=\frac{1}{2}(\Vert {\varvec{w}}\Vert ^{2}+b^{2})+\frac{c}{2}{\varvec{\xi }}^{T}{\varvec{\xi }}+ \frac{\lambda }{2}{\varvec{w}}^{T}D^{T}LD{\varvec{w}} \\&\quad -{\varvec{\alpha }}_{1}^{T}\left[ Y(D{\varvec{w}} +{\varvec{e}}b)-{\varvec{e}}_{l}+\frac{1}{u}{\varvec{\xi }}\right] \\&\quad -{\varvec{\alpha }}_{2}^{T} \left[ -Y(D{\varvec{w}}+{\varvec{e}}b)+{\varvec{e}}_{l}+ \frac{1}{1-u}{\varvec{\xi }}\right] . \end{aligned} \end{aligned}$$
(13)

According to the following Karush–Kuhn–Tucker (KKT) conditions

$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial {\varvec{w}}} =(I+\lambda D^{T}LD){\varvec{w}}-D^{T}Y^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})=0, \end{aligned}$$
(14)
$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial b} =b-{\varvec{e}}^{T}Y^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})=0, \end{aligned}$$
(15)
$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial {\varvec{\xi }}}=c{\varvec{\xi }}- \left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) =0, \end{aligned}$$
(16)

we get

$$\begin{aligned}&{\varvec{w}}=(I+\lambda D^{T}LD)^{-1}D^{T}Y^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}), \end{aligned}$$
(17)
$$\begin{aligned}&b={\varvec{e}}^{T}Y^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}), \end{aligned}$$
(18)
$$\begin{aligned}&{\varvec{\xi }}=\frac{1}{c}\left( \frac{{\varvec{\alpha }}_{1}}{u} +\frac{{\varvec{\alpha }}_{2}}{1-u}\right) . \end{aligned}$$
(19)

Then the following dual problem of (12) can be derived by substituting the above Eqs. (17)–(19) into the Lagrangian function (13)

$$\begin{aligned} \begin{array}{llll} \min \limits _{{\varvec{\alpha }}_{1},{\varvec{\alpha }}_{2}} &{} \frac{1}{2} ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})^{T}F({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+ \frac{1}{2c}\left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) ^{T} \left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) - e_{l}^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})\\ \mathrm{s.t.} &{} {\varvec{\alpha }}_{1} \ge 0{\varvec{e}}_{l},\\ &{} {\varvec{\alpha }}_{2} \ge 0{\varvec{e}}_{l}, \end{array} \end{aligned}$$
(20)

where \(F=Y(D(I+\lambda D^{T}LD)^{-1}D^{T}+{\varvec{e}}{\varvec{e}}^{T})Y^{T}\).

Although the proposed asy-LapSVM can be solved by the classical quadratic programming, inspired by the idea in Balasundaram and Benipal (2016), we adopt a simple functional iterative method, which leads to the minimization of a differentiable convex function in a space of dimensionality equal to the number of classified points and in some cases is dramatically faster than a standard quadratic programming SVM solver, to solve the dual problem (20). Specifically, based on the Karush–Kuhn–Tucker (KKT) necessary and sufficient optimality conditions for the dual problem (20), we have

$$\begin{aligned} \begin{aligned}&0{\varvec{e}}_{l}\le {\varvec{\alpha }}_{1} \perp \left( F+\frac{I}{cu^{2}}\right) {\varvec{\alpha }}_{1}- \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}_{2}-e_{l}\ge 0{\varvec{e}}_{l},\\&0{\varvec{e}}_{l}\le {\varvec{\alpha }}_{2} \perp \left( F+\frac{I}{c(1-u)^{2}}\right) {\varvec{\alpha }}_{2}- \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}_{1}+e_{l}\ge 0{\varvec{e}}_{l}, \end{aligned} \end{aligned}$$
(21)

\(\Leftrightarrow \)

$$\begin{aligned} \begin{aligned}&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)} \perp \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)}+\left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\ge 0{\varvec{e}}_{l},\\&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{2}}{cu(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{2}}{cu(1-u)^{2}}-\left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+{\varvec{e}}_{l}\ge 0{\varvec{e}}_{l}, \end{aligned}\nonumber \\ \end{aligned}$$
(22)

\(\Leftrightarrow \)

$$\begin{aligned} 0{\varvec{e}}_{l}&\le \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}} \nonumber \\&\quad +\frac{1}{(1-u)} \left( \left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \ge 0{\varvec{e}}_{l},\nonumber \\ 0{\varvec{e}}_{l}&\le \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}}-\frac{1}{u} \left( \left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \nonumber \\&\ge 0{\varvec{e}}_{l}, \end{aligned}$$
(23)

where the symbol “\(\perp \)” represents two vectors are orthogonal. By exploiting the easily established identity between any two vectors \({\varvec{a}}\) and \({\varvec{b}}\) (Fung and Mangasarian 2004):

$$\begin{aligned} 0\le {\varvec{a}}\perp ({\varvec{a}}-{\varvec{b}})\ge 0 \Leftrightarrow {\varvec{a}}={\varvec{b}}_{+}=\frac{{\varvec{b}}+|{\varvec{b}}|}{2}, \end{aligned}$$
(24)

where \({\varvec{b}}_{+}=\max \{{\varvec{0}},{\varvec{b}}\}\) and \(|{\varvec{b}}|\) denotes the vector with all of its components set to absolute values, the optimality conditions (23) can be written in the following equivalent form:

$$\begin{aligned} \begin{aligned} \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}}&=\left( \frac{1}{(1-u)}\left( -\left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})+{\varvec{e}}_{l}\right) \right) _{+}\\&=\frac{1}{2(1-u)}\left( -\left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})+{\varvec{e}}_{l} \right. \\&\quad \left. +\left| -\left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})+{\varvec{e}}_{l}\right| \right) ,\\ \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}}&=\left( \frac{1}{u}\left( \left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \right) _{+} \\&=\frac{1}{2u}\left( \left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})-{\varvec{e}}_{l} \right. \\&\quad \left. +\left| \left( F-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right| \right) . \end{aligned} \end{aligned}$$
(25)

Next, let \({\varvec{\alpha }}={\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}\), we obtain the following absolute value equation problem:

$$\begin{aligned} \frac{{\varvec{\alpha }}}{cu^{2}(1-u)^{2}}&=\frac{1}{2(1-u)}\left( -\left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+{\varvec{e}}_{l} \right. \nonumber \\&\quad \left. +\left| -\left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+{\varvec{e}}_{l}\right| \right) \nonumber \\&\quad -\frac{1}{2u}\left( \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l} \right. \nonumber \\&\quad \left. + \left| \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) \nonumber \\&= \frac{1}{2u(1-u)}\left( -\left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+{\varvec{e}}_{l} \right. \nonumber \\&\quad \left. + (2u-1)\left| \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) , \end{aligned}$$
(26)

\(\Leftrightarrow \)

$$\begin{aligned} \left( F+\frac{I}{cu(1-u)}\right) {\varvec{\alpha }} ={\varvec{e}}_{l}+(2u-1) \left| \left( F-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| . \end{aligned}$$
(27)

The problem (27) can be solved by a simple functional iterative method given by

$$\begin{aligned}&{\varvec{\alpha }}^{i+1}=\left( F+\frac{I}{cu(1-u)}\right) ^{-1} \left( {\varvec{e}}_{l}+(2u-1) \left| \left( F-\frac{I}{cu(1-u)} \right) {\varvec{\alpha }}^{i}-{\varvec{e}}_{l}\right| \right) , \nonumber \\&\quad i=0,1,2,\ldots , \end{aligned}$$
(28)

After the optimal solution \({\varvec{\alpha }}^{*}\) is obtained, we get the following linear asy-LapSVM classifier which is represented by the dual variables

$$\begin{aligned} f({\varvec{x}})= {\varvec{w}}^{T}{\varvec{x}}+b={{\varvec{\alpha }}^{*}}^{T}Y(D(I+\lambda D^{T}LD)^{-1}{\varvec{x}}+{\varvec{e}}). \end{aligned}$$
(29)

3.2 Nonlinear asy-LapSVM

The same as the linear case, the asymmetric squared loss (4) is exploited in the proposed nonlinear asy-LapSVM, and the margin measured by the expectile distance is maximized by optimizing with respect to both the weight vector \({\varvec{w}}\) and the bias term b of the nonlinear asy-LapSVM classifier \(f({\varvec{x}})= {\varvec{w}}^{T}\phi ({\varvec{x}})+b\), where \(\phi (\cdot )\) is a nonlinear mapping function from a low-dimensional input space to a high-dimensional Hilbert space \({\mathcal {H}}\). Thus, the regularization term \(\Vert f\Vert ^{2}_{k}\) can be expressed as:

$$\begin{aligned} \Vert f\Vert ^{2}_{k}=\frac{1}{2}(\Vert {\varvec{w}}\Vert ^{2}+b^{2}). \end{aligned}$$
(30)

As for the manifold regularization term \(\Vert f\Vert ^{2}_{{\mathcal {I}}}\), on the basis of the constructed graph laplacian matrix L, the manifold regularization term is defined as:

$$\begin{aligned} \Vert f\Vert ^{2}_{{\mathcal {I}}}={\mathbf{f }}^{T}L{\mathbf{f }}= {\varvec{w}}^{T}D_{1}^{T}LD_{1}{\varvec{w}}, \end{aligned}$$
(31)

where \(D_{1}=[\phi ({\varvec{x}}_{1}),\ldots ,\phi ({\varvec{x}}_{n})]^{T}\) includes all of the labeled and unlabeled samples in \({\mathcal {S}}\) and \({\mathbf{f }}=[f({\varvec{x}}_{1}),\dots ,f({\varvec{x}}_{n})]^{T}=[{\varvec{w}}^{T}\phi ({\varvec{x}}_{1}),\dots ,{\varvec{w}}^{T}\phi ({\varvec{x}}_{n})]^{T}=D_{1}{\varvec{w}}\), in which we deliberately reduce the bias term b for convenience.

By substituting the asymmetric squared loss (4), the regularization term (30) and the manifold regularization term (31) into the the semi-supervised manifold regularization learning framework (7), the primal problem of the nonlinear asy-LapSVM can be formulated as

$$\begin{aligned} \begin{array}{llll} \min \limits _{{\varvec{w}},b,{\varvec{\xi }}} &{} \frac{1}{2}(\Vert {\varvec{w}}\Vert ^{2}+b^{2}) +\frac{c}{2}{\varvec{\xi }}^{T}{\varvec{\xi }}+\frac{\lambda }{2}{\varvec{w}}^{T}D_{1}^{T}LD_{1}{\varvec{w}}\\ \mathrm{s.t.} &{} Y(D_{1}{\varvec{w}}+{\varvec{e}}b)\ge {\varvec{e}}_{l}-\frac{1}{u}{\varvec{\xi }},\\ &{} Y(D_{1}{\varvec{w}}+{\varvec{e}}b)\le {\varvec{e}}_{l}+\frac{1}{1-u}{\varvec{\xi }}, \end{array} \end{aligned}$$
(32)

where the constant u, the parameters c and \(\lambda \), the vectors \({\varvec{\xi }}\), \({\varvec{e}}_{l}\) and \({\varvec{e}}\), the matrix Y have the same meanings as the ones in problems (12).

According to the Representer Theorem (Belkin et al. 2006), \({\varvec{w}}\) can be expressed as \({\varvec{w}}=\mathop \sum \nolimits ^{n}_{i=1}\rho _{i}\phi ({\varvec{x}}_{i})=D_{1}^{T}{\varvec{\rho }}\), where \({\varvec{\rho }}\in R^{n}\) is a parameter vector. Then the terms containing \({\varvec{w}}\) in the optimization problem (32) can be rewritten as

$$\begin{aligned} \frac{1}{2}\Vert {\varvec{w}}\Vert ^{2}= & {} \frac{1}{2}(D_{1}^{T}{\varvec{\rho }})^{T} (D_{1}^{T}{\varvec{\rho }})=\frac{1}{2}{\varvec{\rho }}^{T}K{\varvec{\rho }}, \end{aligned}$$
(33)
$$\begin{aligned} \frac{\lambda }{2}{\varvec{w}}^{T}D_{1}^{T}LD_{1}{\varvec{w}}= & {} \frac{\lambda }{2}(D_{1}^{T} {\varvec{\rho }})^{T}D_{1}^{T}LD_{1}(D_{1}^{T}{\varvec{\rho }})= \frac{\lambda }{2}{\varvec{\rho }}^{T}KLK{\varvec{\rho }}, \end{aligned}$$
(34)

where \(K\in R^{n\times n}\) is a Gram matrix with elements \(K_{i,j}=k({\varvec{x}}_{i},{\varvec{x}}_{j})\).

Based on the above analysis, the nonlinear optimization problem (32) can be converted into the following form:

$$\begin{aligned} \begin{array}{llll} \min \limits _{{\varvec{{\varvec{\rho }}}},b,{\varvec{\xi }}} &{} \frac{1}{2}{\varvec{\rho }}^{T} F_{1}{\varvec{{\varvec{\rho }}}}+\frac{1}{2}b^{2}+\frac{c}{2}{\varvec{\xi }}^{T}{\varvec{\xi }} \\ \mathrm{s.t.} &{} Y(K{\varvec{{\varvec{\rho }}}}+{\varvec{e}}b)\ge {\varvec{e}}_{l}-\frac{1}{u}{\varvec{\xi }},\\ &{} Y(K{\varvec{{\varvec{\rho }}}}+{\varvec{e}}b)\le {\varvec{e}}_{l}+\frac{1}{1-u}{\varvec{\xi }}, \end{array} \end{aligned}$$
(35)

where \(F_{1}=K+\lambda KLK\) is a symmetric positive semi-definite matrix since the Gram matrix K and the graph Laplacian L are two symmetric positive semi-definite matrices.

The Lagrange function of the optimization problem (35) can be written as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}({\varvec{\rho }},{\varvec{\xi }},{\varvec{\alpha }}_{1},{\varvec{\alpha }}_{2})&=\frac{1}{2}{\varvec{\rho }}^{T}F_{1}{\varvec{\rho }}+\frac{1}{2}b^{2}+ \frac{c}{2}{\varvec{\xi }}^{T}{\varvec{\xi }}-{\varvec{\alpha }}_{1}^{T} \left[ Y(K{\varvec{\rho }}+{\varvec{e}}b)-{\varvec{e}}_{l}+\frac{1}{u}{\varvec{\xi }}\right] \\&\quad -{\varvec{\alpha }}_{2}^{T}\left[ -Y(K{\varvec{\rho }}+{\varvec{e}}b)+{\varvec{e}}_{l}+ \frac{1}{1-u}{\varvec{\xi }}\right] , \end{aligned}\nonumber \\ \end{aligned}$$
(36)

where \({\varvec{\alpha }}_{1}\) and \({\varvec{\alpha }}_{2}\) are Lagrange parameter vectors.

Differentiating the Lagrange function (36) with respect to \({\varvec{\rho }}\), b, \({\varvec{\xi }}\) and setting them equal to zero, we can obtain

$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial {\varvec{\rho }}} =F_{1}{\varvec{\rho }}- KY^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})=0\Rightarrow {\varvec{\rho }}=F_{1}^{-1}KY^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}), \end{aligned}$$
(37)
$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial b} =b-{\varvec{e}}^{T}Y^{T} ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})=0\Rightarrow b={\varvec{e}}^{T}Y^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}), \end{aligned}$$
(38)
$$\begin{aligned}&\frac{\partial {\mathcal {L}}}{\partial {\varvec{\xi }}}=c{\varvec{\xi }}- \left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) =0\Rightarrow {\varvec{\xi }}=\frac{1}{c}\left( \frac{{\varvec{\alpha }}_{1}}{u}+ \frac{{\varvec{\alpha }}_{2}}{1-u}\right) . \end{aligned}$$
(39)

It is worth noting that although the matrix \(F_{1}\) in (37) is always positive semi-definite, it may not be well conditioned in some situations. In the light of the idea of regularization, \(F_{1}^{-1}\) can be revised by \((\epsilon I+F_{1})^{-1}\), where \(\epsilon I\) \((\epsilon >0)\) is a regularization term. In the following, we shall continue to use \(F_{1}^{-1}\) with the understanding that, if need be, \((\epsilon I+F_{1})^{-1}\) is to be used.

Then the following dual problem of (35) can be derived by substituting the above Eqs. (37)–(39) into the Lagrangian function (36)

$$\begin{aligned} \begin{array}{llll} \min \limits _{{\varvec{\alpha }}_{1},{\varvec{\alpha }}_{2}} &{} \frac{1}{2}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})^{T} F_{2}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+\frac{1}{2c} \left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) ^{T} \left( \frac{{\varvec{\alpha }}_{1}}{u}+\frac{{\varvec{\alpha }}_{2}}{1-u}\right) - {\varvec{e}}_{l}^{T}({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})\\ \mathrm{s.t.} &{} {\varvec{\alpha }}_{1} \ge 0{\varvec{e}}_{l},\\ &{} {\varvec{\alpha }}_{2} \ge 0{\varvec{e}}_{l}. \end{array} \end{aligned}$$
(40)

where \(F_{2}=Y(KF_{1}^{-1}K+ee^{T})Y^{T}\).

We solve the dual problem (40) by a functional iterative method. Specifically, based on the Karush–Kuhn–Tucker (KKT) necessary and sufficient optimality conditions for the dual problem (40) , we have

$$\begin{aligned}&0{\varvec{e}}_{l}\le {\varvec{\alpha }}_{1} \perp \left( F_{2}+ \frac{I}{cu^{2}}\right) {\varvec{\alpha }}_{1}- \left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}_{2}-{\varvec{e}}_{l}\ge 0{\varvec{e}}_{l},\nonumber \\&0{\varvec{e}}_{l}\le {\varvec{\alpha }}_{2} \perp \left( F_{2}+\frac{I}{c(1-u)^{2}}\right) {\varvec{\alpha }}_{2}- \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}_{1}+ {\varvec{e}}_{l}\ge 0{\varvec{e}}_{l}, \end{aligned}$$
(41)

\(\Leftrightarrow \)

$$\begin{aligned} \begin{aligned}&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)} \perp \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)}+\left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\ge 0{\varvec{e}}_{l},\\&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{2}}{cu(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{2}}{cu(1-u)^{2}}- \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+{\varvec{e}}_{l}\ge 0{\varvec{e}}_{l}, \end{aligned}\nonumber \\ \end{aligned}$$
(42)

\(\Leftrightarrow \)

$$\begin{aligned} \begin{aligned}&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}}+\frac{1}{(1-u)} \left( \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \\&\quad \ge 0{\varvec{e}}_{l},\\&0{\varvec{e}}_{l}\le \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}} \perp \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}}-\frac{1}{u}\left( \left( F_{2}- \frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \ge 0{\varvec{e}}_{l}. \end{aligned}\nonumber \\ \end{aligned}$$
(43)

By exploiting the identity (24), the optimality conditions (43) can be written in the following equivalent form:

$$\begin{aligned} \frac{{\varvec{\alpha }}_{1}}{cu^{2}(1-u)^{2}}&=\left( \frac{1}{(1-u)}\left( -\left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+{\varvec{e}}_{l}\right) \right) _{+}\nonumber \\&=\frac{1}{2(1-u)}\left( -\left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+{\varvec{e}}_{l} \right. \nonumber \\&\quad \left. +\left| -\left( F_{2}- \frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})+ {\varvec{e}}_{l}\right| \right) ,\nonumber \\ \frac{{\varvec{\alpha }}_{2}}{cu^{2}(1-u)^{2}}&=\left( \frac{1}{u}\left( \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2})-{\varvec{e}}_{l}\right) \right) _{+} \nonumber \\&=\frac{1}{2u}\left( \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})-{\varvec{e}}_{l} \right. \nonumber \\&\quad \left. + \left| \left( F_{2}- \frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}_{1}- {\varvec{\alpha }}_{2})- {\varvec{e}}_{l}\right| \right) . \end{aligned}$$
(44)

Let \({\varvec{\alpha }}={\varvec{\alpha }}_{1}-{\varvec{\alpha }}_{2}\), we obtain the following absolute value equation problem:

$$\begin{aligned} \begin{aligned} \frac{{\varvec{\alpha }}}{cu^{2}(1-u)^{2}}&=\frac{1}{2(1-u)} \left( -\left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+{\varvec{e}}_{l} \right. \\&\quad \left. +\left| -\left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+{\varvec{e}}_{l}\right| \right) \\&\quad -\frac{1}{2u} \left( \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l} \right. \\&\quad \left. + \left| \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) \\&= \frac{1}{2u(1-u)}\left( -\left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}+ {\varvec{e}}_{l}\right. \\&\quad \left. +\,(2u-1)\left| \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) , \end{aligned} \end{aligned}$$
(45)

\(\Leftrightarrow \)

$$\begin{aligned} \left( F_{2}+\frac{I}{cu(1-u)}\right) {\varvec{\alpha }} ={\varvec{e}}_{l}+(2u-1) \left| \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| . \end{aligned}$$
(46)

The problem (46) can be solved by the simple functional iterative method given by

$$\begin{aligned}&{\varvec{\alpha }}^{i+1} = \left( F_{2}+\frac{I}{cu(1-u)}\right) ^{-1} \left( {\varvec{e}}_{l}+(2u-1)\left| \left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}^{i}-{\varvec{e}}_{l}\right| \right) , \nonumber \\&\quad i=0,1,2,\ldots . \end{aligned}$$
(47)

After the optimal solution \({\varvec{\alpha }}^{*}\) is obtained, we get the following nonlinear asy-LapSVM classifier which is represented by the dual variables

$$\begin{aligned} f({\varvec{x}})= {\varvec{w}}^{T}\phi ({\varvec{x}})+b= \sum \limits ^{n}_{i=1}\rho ^{*}_{i}k({\varvec{x}}_{i},{\varvec{x}})+b^{*} \end{aligned}$$
(48)

where \({\varvec{\rho }}^{*}=F_{1}^{-1}KY^{T}{\varvec{\alpha }}^{*}\) and \(b^{*}={\varvec{e}}^{T}Y^{T}{\varvec{\alpha }}^{*}\).

Next, we prove the convergence of the proposed functional iterative method in the following Theorem.

Theorem 1

Assume that \({\varvec{\alpha }}\in R^{l}\) is the solution of the absolute value Eq. (46) and \(0.5<u<1\), then, for any starting vector \({\varvec{\alpha }}^{0}\in R^{l}\), the sequence of iterates generated by (47) will always converge to \({\varvec{\alpha }}\) with linear rate of convergence.

Proof

According to the conditions given in the above theorem, by using (46) and (47), we get

$$\begin{aligned}&\left( F_{2}+\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i+1}- {\varvec{\alpha }}) \\&\quad =(2u-1) \left( \left| \left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}^{i}- {\varvec{e}}_{l}\right| -\left| \left( F_{2}- \frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) , \\&\qquad i=0,1,2,\ldots . \end{aligned}$$

Since

$$\begin{aligned}&\left| \left( \left| \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}^{i}-{\varvec{e}}_{l}\right| - \left| \left( F_{2}-\frac{I}{cu(1-u)}\right) {\varvec{\alpha }}-{\varvec{e}}_{l}\right| \right) \right| \\&\quad \le \left| \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i}-{\varvec{\alpha }})\right| , \end{aligned}$$

we have

$$\begin{aligned} \begin{aligned}&\left\| \left( F_{2}+\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i+1}- {\varvec{\alpha }})\right\| \\&\quad \le (2u-1) \left\| \left( F_{2}-\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i}-{\varvec{\alpha }})\right\| \\&\quad =(2u-1) \\&\qquad \left\| \left( F_{2}-\frac{I}{cu(1-u)}\right) \left( F_{2}+\frac{I}{cu(1-u)}\right) ^{-1}\left( F_{2}+ \frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i}-{\varvec{\alpha }})\right\| \\&\quad \le (2u-1) \left\| \left( F_{2}-\frac{I}{cu(1-u)}\right) \left( F_{2}+\frac{I}{cu(1-u)}\right) ^{-1}\right\| \\&\qquad \left\| \left( F_{2}+\frac{I}{cu(1-u)}\right) ({\varvec{\alpha }}^{i}-{\varvec{\alpha }})\right\| . \end{aligned} \end{aligned}$$

Let \(\{q_{1},\ldots ,q_{l}\}\) be the set of the nonnegative eigenvalues of the positive semi-definite matrix \(F_{2}\). Clearly, the eigenvalues of \((F_{2}-\frac{I}{cu(1-u)})(F_{2}+\frac{I}{cu(1-u)})^{-1}\) will become:

$$\begin{aligned} \left\{ \frac{q_{1}-\frac{1}{cu(1-u)}}{q_{1}+\frac{1}{cu(1-u)}}, \ldots ,\frac{q_{l}-\frac{1}{cu(1-u)}}{q_{l}+\frac{1}{cu(1-u)}}\right\} , \end{aligned}$$

So, \((2u-1) ||(F_{2}-\frac{1}{cu(1-u)}) (F_{2}+ \frac{1}{cu(1-u)})^{-1}| |=(2u-1) \max \{|\frac{q_{1}- \frac{1}{cu(1-u)}}{q_{1}+ \frac{1}{cu(1-u)}}|, \ldots ,|\frac{q_{l}- \frac{1}{cu(1-u)}}{q_{l}+\frac{1}{cu(1-u)}}|\}<1\) is always true. This shows the Theorem 1 holds. \(\square \)

Although we only prove the convergence of the nonlinear case, the convergence of the linear case also can be easily gotten in the similar way. In the end, we summary the proposed nonlinear asy-LapSVM as follows:

Input. A set of l labeled data and m unlabeled data \({\mathcal {S}}={\mathcal {S}}_{l}\cup {\mathcal {S}}_{u}= \{({\varvec{x}}_{i},y_{i})\}_{i=1}^{l} \cup \{{\varvec{x}}_{i}\}_{i=l+1}^{n}\), the constant u, the positive parameters c and \(\lambda \), the kernel parameter \(\sigma \), the number of the nearest neighbors p, the vector of ones of l dimension \({\varvec{e}}_{l}\), the vector of ones of n dimension e, the matrix \(Y\in R^{l\times n}\) with elements \(Y_{ii}=y_{i}\) according to the class of each sample, the maximum number of iterations \(maxI =15\), the tolerance \(tol=10^{-4}\), and initial iteration number \(t=0\).

Step 1. Compute the graph Laplacian matrix \(L=G-W\), in which W denotes the adjacency matrix which is constructed by p nearest neighbors with n nodes in \({\mathcal {S}}\) and edge weights \(W_{ij}\) are calculated by binary weights. G denotes the diagonal matrix with its diagonal elements \(G_{ii}=\mathop \sum \nolimits ^{n}_{j=1}W_{ij}\) representing the weight degree of the vertex i.

Step 2. Choose a proper kernel function \(k(\cdot ,\cdot )\), and compute the Gram matrix K.

Step 3. Compute the matrices

$$\begin{aligned} F_{2}= & {} Y(K(K+\lambda KLK)^{-1}K+{\varvec{e}}{\varvec{e}}^{T})Y^{T}, \\ Q= & {} \left( F_{2}+\frac{I}{cu(1-u)}\right) ^{-1}, \\ Q_{1}= & {} F_{2}-\frac{I}{cu(1-u)} , \end{aligned}$$

where \(I \in {\mathcal {R}}^{l\times l}\) is an identity matrix.

Step 4. Compute the initial vector

$$\begin{aligned} {\varvec{\alpha }}^{0} =Q{\varvec{e}}_{l}. \end{aligned}$$

Step 5. Compute \({\varvec{\alpha }}^{t+1}\) via

$$\begin{aligned} {\varvec{\alpha }}^{t+1} =Q({\varvec{e}}_{l}+(2u-1)| Q_{1}{\varvec{\alpha }}^{t}-{\varvec{e}}_{l}|), ~~i=0,1,2,\ldots . \end{aligned}$$

Step 6. If \(\Vert w^{t+1}-w^{t}\Vert < tol\) or \(t> maxI\), stop; else, let \(t=t+1\) and goto Step 5.

Step 7. Compute the bias term \(b={\varvec{e}}^{T}Y^{T}{\varvec{\alpha }}^{t}\).

Step 8. Derive the nonlinear asy-LapSVM classifier

$$\begin{aligned} f({\varvec{x}})= {\varvec{w}}^{T}\phi ({\varvec{x}})+b=\sum \limits ^{n}_{i=1} \rho _{i}k({\varvec{x}}_{i},{\varvec{x}})+b, \end{aligned}$$

where \({\varvec{\rho }}=(K+\lambda KLK)^{-1}KY^{T}{\varvec{\alpha }}^{t}\).

4 Numerical experiments

In this section, we will verify the validity of the proposed asy-LapSVM by comparing it with several closely relevant algorithms on different kinds of datasets with noise of different variances. After the experimental setup and data are described in Sects. 4.1 and 4.2, we carefully analyze the performance of our proposed asy-LapSVM in the following subsections.

4.1 Experimental setup

We compare the proposed asy-LapSVM with a group of popular supervised learning (SL) algorithms [SVM (Cristianini and Shawe-Taylor 2000), regularized least squares (RLS) (Belkin et al. 2006), extreme learning machine (ELM) (Huang et al. 2006), aLS-SVM (Huang et al. 2014)] and semi-supervised learning (SSL) algorithms (LapSVM (Belkin et al. 2006), Laplacian RLS (LapRLS) (Belkin et al. 2006), semi-supervised ELM (SSELM) (Huang et al. 2014)). In the experiments, the sigmoid activation function is used for the two algorithms with ELM, and the number of hidden neurons is set to 1000. For the remaining algorithms and the proposed asy-LapSVM, the RBF kernel (6) which meets the Mercer’s theorem is employed. For the choice of parameters, a set of possible values are first predefined, and 5-fold cross validation is used for all the compared methods. In order to avoid the bias caused by different sample partition, 5-fold cross validation with different partition repeats 10 times at random, and the average testing accuracies are computed to obtain the final evaluation results. The ranges of the five parameters are listed as follows:

  • \(c: \{2^{-3},2^{-2},2^{-1},2^{0},2^{1},2^{2},2^{3}\}\),

  • \(\lambda : \{2^{-3},2^{-2},2^{-1},2^{0},2^{1},2^{2},2^{3}\}\),

  • \(p: \{5,10,15,20\}\),

  • \(\sigma : \{2^{-3},2^{-2},2^{-1},2^{0},2^{1},2^{2},2^{3}\}\),

  • \(u: \{0.55, 0.65, 0.75, 0.83, 0.95, 0.99\}\).

In the model training, we randomly select 10% training samples as labeled samples, and the remaining training samples are regarded as unlabeled samples. For supervised algorithms, we use only the selected labeled samples to train the classifier, while for semi-supervised algorithms, we use all the training set with both labeled and unlabeled samples to train the classifier. The simulations of all the algorithms are carried out in MATLAB R2016a on a personal computer with system configuration: Intel Core i7 (3.6 GHz) and 8 GB random access memory. We use the quadratic programming (QP) solver embedded in MATLAB to solve all the QP problems and the MATLAB operation “\(\setminus \)” to realize the matrix inverse involved in the algorithms.

4.2 Data specification

In this subsection, several publicly available datasets including one artificial dataset, six UCIFootnote 1 datasets and five image datasets are employed to study the performance of the proposed asy-LapSVM. The G50CFootnote 2 artificial dataset is generated from two unit covariance normal distributions with equal probabilities. The class means are adjusted so that the Bayes error is 5%. For the image datasets, the Coil-20Footnote 3 dataset includes 1440 grey-scale images sampled from 20 objects, and each object has 72 images with the size of \(32\times 32\). The USPSFootnote 4 handwritten digit dataset consists of 7,291 training images and 2,007 test images with size of \(16\times 16\). The YaleBFootnote 5 face recognition dataset admittedly contains manifold structures includes 2414 grey human facial images sampled from 38 persons, and each person has about 64 different images with size of \(32\times 32\). The Multiple Features (MF)Footnote 6 handwritten numeral dataset consists of 10 classes and 2,000 samples. The NUS-WIDE-OBJECT (NWO)Footnote 7 dataset contains 31 categories and 30,000 network images (17,927 for training and 12,703 for testing) created by the media search laboratory of National University of Singapore.

In order to fit into the binary-class environment, we choose two pairwise digits in the USPS dataset, three pairwise objects in the Coil-20 dataset, three pairwise facial images in the YaleB face recognition dataset, eight pairwise subsets in the MF dataset, and seven pairwise subsets in the NWO dataset to constitute 23 binary-class image datasets, namely, USPS1, USPS2, Coil-201, Coil-202, Coil-203, YaleB1, YaleB2, YaleB3, MF1, MF2, MF3, MF4, MF5, MF6, MF7, MF8, NWO1, NWO2, NWO3, NWO4, NWO5, NWO6, and NWO7. To facilitate the calculation, the well-known principal component analysis (PCA) is adopted to preprocess these image datasets, where the dimension and accumulative contribution rate after processing are displayed in Table 1. The features of each dataset are corrupted by zero-mean Gaussian noise, and the ratio of the variance of noise to that of the features, denoted as z, is set to be 0 (i.e., noise-free), 0.05, and 0.1. The training and test data are aggravated by the same noise. The important statistics of these employed datasets are summarized in Table 1, where “No.” denotes the ordinal numbers of the datasets. “Selected class” and “Selected size” denote the classes and sizes selected from the original datasets, respectively. “Dimension” denotes the number of the original features.

Table 1 Information of the employed datasets

4.3 Experimental results and discussions

The results of the eight algorithms on the aforementioned datasets with different proportion noise are shown in Table 2, where the best ones are presented in bold. In addition, the learning time (second) taken by each method under the optimal parameters obtained by one 5-fold cross validation is listed in Table 3. For better interpretation, the comparison results of the time among the proposed asy-LapSVM and other three SSL algorithms in the absence of noise are visualized in Fig. 2, where indexes on the horizontal axis denote the ordinal numbers of the employed datasets.

Table 2 Comparison of accuracies (%) of eight algorithms on the employed datasets
Table 3 The comparison of learning time (s) of eight algorithms on the employed datasets
Fig. 2
figure 2

The learning time (s) of the proposed asy-LapSVM and the three SSL algorithms on the employed datasets with \(z=0\)

From the experimental results and experimental runtime reported in Tables 2 and 3 and Fig. 2, we can get the following conclusions:

  1. (1)

    The SSL algorithms including the proposed asy-LapSVM perform better than the corresponding SL ones in most cases, which might be attribute to the manifold regularization which helps classifiers gain more geometric information embedded in unlabeled data to achieve better performance.

  2. (2)

    The proposed asy-LapSVM stands out the other three SSL algorithms (LapSVM, LapRLS and SSELM) on 54 cases, which implies that adopting the asymmetric squared loss can improve the prediction accuracy, especially for noise-corrupted data. For example, the accuracy of asy-LapSVM in YaleB3 dataset with \(z=0.10\) is 92.55%, and nearly increases by 2.55% compared with the second best method LapSVM.

  3. (3)

    From Table 3, it can be easy to know that the SSL algorithms generally consume longer time than the SL algorithms. The major reason is that, in addition to the few labeled data which is considered in SL algorithms, the SSL algorithms need to consider lots of unlabeled data.

  4. (4)

    From Fig. 2, we can see that although the presented functional iterative method for the proposed asy-LapSVM costs more time than LapRLS, it shows faster learning speed than the other two SSL algorithms in most cases, which implies the feasibility of the presented functional iterative method.

To sum up, the proposed method is always the best in terms of classification accuracy and also preferable in terms of learning time among all the SSL algorithms, which implies that the asy-LapSVM is a powerful SSL algorithm for semi-supervised binary-class classification problems in the presence of noise.

4.4 Statistical test

To further compare the performance of the proposed asy-LapSVM with the relevant SSL algorithms, we employ the well-known nonparametric Friedman test with the corresponding post hoc tests (Demsar 2006). The average ranks of four methods on 90 employed datasets in terms of accuracy are shown in Table 4. Under the null hypothesis that all the algorithms are equivalent, we compute the Friedman statistics

$$\begin{aligned} {\mathcal {T}}_{{\mathcal {X}}^{2}} =\frac{12\times 90}{4\times 5}\left[ 2.63^{2}+2.79^{2}+2.54^{2}+2.03^{2} -\frac{4\times 5^{2}}{4}\right] ={14.769} \end{aligned}$$
(49)

and

$$\begin{aligned} {\mathcal {T}}_{F} =\frac{89\times {\mathcal {T}}_{{\mathcal {X}}^{2}}}{90\times 3-{\mathcal {T}}_{{\mathcal {X}}^{2}}} ={5.15} \end{aligned}$$
(50)

which is distributed according to the \({\mathcal {F}}\)-distribution with (3, 87) degrees of freedom.

For \(\alpha =0.05\), \({\mathcal {F}}_{\alpha }(3,267)=2.6384<{5.15}\), so we reject the null hypothesis. Next, the Nemenyi post-hoc test is exploited to further compare the four algorithms in pairs. Based on the Studentized range statistic divided by \(\sqrt{2}\), we know \(q_{\alpha }=2.569\) and the critical difference

$$\begin{aligned} CD=q_{\alpha }\sqrt{\frac{4\times 5}{6\times 90}}=0.4944. \end{aligned}$$
(51)

Thus, if the average ranks of two algorithms differ by at least CD, then their performance is significantly different. From Table 4, we can derive the differences between the proposed asy-LapSVM and other three SSL algorithms as follows:

$$\begin{aligned} \begin{aligned} \mathrm{d}(\mathrm{LapSVM}{-}\text {asy-LapSVM})= {2.63-2.03=0.60}>0.4944,\\ \mathrm{d}(\mathrm{LapRLS}{-}\text {asy-LapSVM})= {2.79-2.03=0.76}>0.4944,\\ \mathrm{d}(\mathrm{SSELM}{-}\text {asy-LapSVM})= {2.54-2.03=0.51}>0.4944, \end{aligned} \end{aligned}$$
(52)

where \(\mathrm d(a-b)\) denotes the differences between two algorithms \(\mathrm{a}\) and \(\mathrm{b}\). Then we can summarize that the proposed asy-LapSVM performs significantly better than the other SSL algorithms on the employed datasets with noise of different variances.

Table 4 Average ranks of the accuracies of four SSL algorithms on the employed datasets

4.5 Effect of the parameter u

In this subsection, we study the influence of the parameter u in loss function on the performance of the proposed asy-LapSVM. In the experiments, the value of u is tuned in the range \(\{0.55,0.65,0.75,0.83,0.95,0.99\}\). By changing it in the given range and setting other parameters to be optimal values obtained by one 5-fold cross validation, we estimate its influence on seven employed datasets with \(z=0\) and the results are displayed in Fig. 3. As shown in Fig. 3, all the curves are almost invariable with the different values of u, which means that the value of u has little effect on the performance of the proposed algorithm. In other words, the proposed asy-LapSVM is insensitive to the parameter u.

Fig. 3
figure 3

The classification accuracies of asy-LapSVM with respect to u on seven employed datasets with \(z=0\)

Fig. 4
figure 4

Analysis for the convergence of asy-LapSVM on four datasets with \(z=0\)

4.6 Analysis for the convergence

In this subsection, by exploiting an empirical justification (Ye 2005), we discuss the convergence of the proposed asy-LapSVM on four employed datasets with \(z=0\). The learning results are illustrated in Fig. 4, where the abscissa denotes the number of iterations and the ordinate denotes the logarithm values of the objective function (35). From Fig. 4, it can be clearly seen that the logarithm values of the objective function change with the iterations and go fast steady in less than 10 iterations. Therefore, we can conclude that the proposed algorithm converges in limited iterations.

5 Conclusion

A novel asy-LapSVM is proposed in this paper to enhance the generalization performance of the classical LapSVM. To our knowledge, it is almost the first time to employ the asymmetric squared loss to improve the performance of the SSL algorithms. Moreover, we present a simple and efficient functional iterative method to solve the proposed asy-LapSVM. And we further investigate the convergence of the functional iterative method from both theoretical and experimental aspects. The validity of the proposed asy-LapSVM and the feasibility of the presented functional iterative method are demonstrated by numerical experiments on a series of popular datasets with noise of different variances.

However, limitations of the proposed asy-LapSVM still exist, such as it is not suitable for online learning. The research along this line will be our future work.