1 Introduction

Due to its simple structure, low computational complexity and good generalization, extreme learning machine (ELM) [1,2,3,4,5,6] has been successfully applied in many fields [7,8,9,10]. Compared with traditional neural networks, the main advantages of ELM are that it runs fast with global optimal solution and is easy to implement. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically. Recently, many researchers have done in-depth researches on ELM. For example, Wang et al. [7] proposed a self-adaptive extreme learning machine (SaELM). It can select the best neuron number in hidden layers to construct the optimal networks. This method trains fast and can obtain the global optimal solution, with good generalization. Zhang et al. [8] proposed a ELM+ which introduces the privileged information to the traditional ELM method. Zhang et al. [9] proposed a memetic algorithm based extreme learning machine (M-ELM). It embeds the local search strategy into the global optimization framework to obtain optimal network parameters. Huang et al. [2] proposed a ELM based on optimization theory (OPT-ELM) by introducing hinge loss into the ELM framework. It minimizes the norm of the output weights to find a separating hyperplane with the maximal margin between two classes of data, which is similar to the idea employed in support vector machine (SVM) [11]. Compared to ELM, the minimization norm of output weights enables OPT-ELM to get better generalization performance. OPT-ELM solves a quadratic programming (QP) problem, which assures that a global optimal solution can be found.

The manifold regularization method has been widely used for semi-supervised learning tasks. One of the most popular manifold regularization is the Laplacian regularization [12,13,14,15,16,17,18], which utilizes graph Laplacian to determine the geometry information of data. ELMs are very popular in many fields and are mainly used to supervised learning tasks, which greatly limits their applicability. Many researchers introduced Laplacian regularization into the ELM framework for semi-supervised learning tasks [19,20,21,22,23].

Lagrangian support vector machine (LSVM) [24] is a computationally powerful machine learning tool. It minimizes an unconstrained differentiable convex function in a space of dimensionality equal to the number of classified points. In recent years, some researchers have extended the idea of LSVM to twin support vector machines [25,26,27,28,29,30,31,32,33].

In this paper, inspired by the studies above, we propose two novel ELM formulations for supervised and semi-supervised classifications. The main contributions of this paper are summarized as follows:

  1. (1)

    The first is called lagrangian extreme learning machine (LELM), which is based on optimality conditions and dual theory. Then LELM is generalized to semi-supervised setting to obtain a semi-supervised lagrangian extreme learning machine (Lap-LELM), which incorporates the manifold regularization into LELM to improve performance when insufficient training information is available.

  2. (2)

    In order to avoid the inconvenience caused by matrix inversion, Sherman-Morrison-Woodbury (SMW) [24] identity is used in LELM and Lap-LELM optimization problems, which leads to two smaller sized unconstrained minimization problems. The proposed models are solvable in a space of dimensionality equal to the number of samples.

  3. (3)

    Two fast and simple algorithms are designed to optimize the proposed LELM and Lap-LELM, which requires only iteratively solving equations rather than quadratic programming like OPT-ELM. The resulting iteration algorithms converge globally and have low computational burden.

  4. (4)

    Difference from the lagrangian SVM (LSVM) which has difficulty in dealing with nonlinear problems, the proposed LELM and Lap-LELM have the explicit kernel function form, and are convenient to use in nonlinear classifications.

The rest of this paper is organized as follows. Section 2, briefly dwells on the OPT-ELM, LSVM and MR. We propose LELM and Lap-LELM in Sections 3 and 4, respectively. The experimental results, and discussions are presented in Sections 5. Finally, the conclusion is drawn in Section 6.

2 Related work

In order to propose an improved version of OPT-ELM, we review OPT-ELM in Section 2.1, and introduce LSVM and MR in Sections 2.2 and 2.3, respectively.

2.1 Optimization method based ELM

Consider a supervised learning problem with training data \(T_{l}=\{\boldsymbol {x}_{i},y_{i}\}_{i = 1}^{l}, i = 1,\ldots ,l\), where \(x_{i}\in \mathbb {R}^{n}\), yi ∈ {− 1, + 1}, Tl denotes a set of l labeled samples.

Huang et al. [2] proposed ELM framework based on optimization theory (called OPT-ELM), in which the hinge loss function was introduced. This leads to the following optimization:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min\limits_{\beta,\xi}} ~~&&\frac{1}{2}\parallel\beta\parallel^{2}+C{\sum}_{i = 1}^{l}\xi_{i}\\ \text{s.t.}~~&& y_{i}h(x_{i})^{T}\beta \geq1-\xi_{i},\ \xi_{i}\geq0,\ \ i = 1,\ldots,l \end{array} \end{array} $$
(1)

where \(h(x)=(g({w_{1}^{T}}x+b_{1}),\ldots ,g({w_{L}^{T}}x+b_{L}))^{T}\) actually maps the data from the n-dimensional input space to ELM feature space; L is the number of hidden layer nodes; ξ = (ξ1, … , ξl) is a slack variable; C is a positive penalty parameter. This is a quadratic programming with global solution.

According to the optimization theory, the dual problem of optimization problem (1) is

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min_{\alpha}} ~~&&\frac{1}{2}{\sum}_{i = 1}^{l}{\sum}_{j = 1}^{l}y_{i}y_{j}\alpha_{i}\alpha_{j}h(\mathbf{x}_{i})h(\mathbf{x}_{j})-{\sum}_{i = 1}^{l}\alpha_{i}\\ \text{s.t.}~~&& 0\leq\alpha_{i}\leq C,\ \ i = 1, ... , l. \end{array} \end{array} $$
(2)

We can define the ELM kernel function as:

$$\begin{array}{@{}rcl@{}} K_{ELM} &=& h(x_{i})h(x_{j}) \\ &=& [G(a_{1}, b_{1},x_{i}),\ldots,G(a_{L}, b_{L},x_{i})]^{T} \cdot[G(a_{1}, b_{1},x_{j}),\ldots,\\ &&G(a_{L}, b_{L},x_{j})]^{T} \end{array} $$
(3)

where G(a, b, x) is a nonlinear piecewise continuous function satisfying ELM universal approximation capability theorems [1,2,3,4,5,6] and \(\{(a_{i},b_{i})\}_{i = 1}^{L}\) are randomly generated according to any continuous probability distribution.

Thus, we can get the following optimization problem:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min_{\alpha}} ~~&&\frac{1}{2}{}{\sum}_{i = 1}^{l}{}{\sum}_{j = 1}^{l}y_{i}y_{j}K_{ELM}(\mathbf{x}_{i},\mathbf{x}_{j})\alpha_{i}\alpha_{j}{}-{}{\sum}_{i = 1}^{l}\alpha_{i}\\ \text{s.t.}~~&& 0\leq\alpha_{i}\leq C,\ \ i = 1, ... , l. \end{array} \end{array} $$
(4)

The decision function of OPT-ELM is

$$ f(x)=sign\left( \sum\limits_{i = 1}^{l}\alpha_{i} y_{i}K_{ELM}(x,x_{i })\right) $$
(5)

2.2 Lagrangian support vector machine

Mangasarian and Musicant modified the standard SVM and proposed Lagrangian Support Vector Machine [24]:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min\limits_{\mathbf{w},\gamma, y}} ~~&&\nu\frac{\xi^{T}\xi}{2}+\frac{1}{2}(\mathbf{w}^{T}\mathbf{w}+b^{2})\\ \text{s.t.}~~&& D(A\mathbf{w}-eb)+\xi\geq e \end{array} \end{array} $$
(6)

where parameter ν > 0, w is the normal to the bounding planes and b determines their location relative to the origin. According to duality theory, the dual of this problem is:

$$ \min\limits_{0\leq u\in\mathbb{R}^{n}}f(u)=\frac{1}{2}u^{T}Qu-e^{T}u. $$
(7)

where \(Q=\left (\frac {I}{\nu }+D(AA^{T}+ee^{T})D\right )\).

The LSVM Algorithm is based directly on the Karush-Kuhn-Tucker necessary and sufficient optimality conditions of the dual problem (7):

$$ 0\leq u\perp Qu-e\geq0 $$

These optimality conditions lead to the following very simple iterative scheme which constitutes LSVM Algorithm:

$$u^{i + 1} = Q^{-1}(e+((Qu^{i}-e)-\alpha u^{i})_{+}); i = 0,1,\ldots,$$

for which we will establish global linear convergence from any starting point under the easily satisfiable condition:

$$0<\alpha<\frac{2}{\nu}$$

More details can refer to [24].

2.3 Manifold regularization

Consider a semi-supervised learning problem with training data \(T=T_{l}\cup T_{u}=\{\boldsymbol {x}_{i},y_{i}\}_{i = 1}^{l}\cup \{\boldsymbol {x}_{j}\}_{j=l + 1}^{l+u}, i = 1,\ldots ,l\), where \(x_{i}\in \mathbb {R}^{n}\), \(x_{j}\in \mathbb {R}^{n}\), yi ∈{− 1,+ 1}, Tl denotes a set of l labeled samples, Tu denotes a set of u unlabeled samples. The manifold regularization approach [13] takes advantage of the geometry of the marginal distribution \(\mathcal {P}_{X}\). We assume that the support of data probability distribution has the geometry of the Riemannian manifold \(\mathcal {M}\). The label of the two closest samples in the \(\mathcal {P}_{X}\) intrinsic geometry should be the same or similar, which means that the conditional probability distribution \(\mathcal {P}(y\mid x)\) should vary little between two such points. As a result, we have

$$ \parallel f{\parallel_{I}^{2}}=\sum\limits^{l+u }_{i = 1}\sum\limits^{l+u}_{j = 1}W_{ij}(f(x_{i})-f(x_{j}))^{2}=f^{T}\mathbf{L}_{p}f $$
(8)

where Lp = DW is the graph Laplacian; D is the diagonal degree matrix of W given by \(\mathbf {D}_{ii} = {\sum }^{l+u}_{j = 1} \mathbf {W}_{ij}\), and Dij = 0 for ij; the normalizing coefficient \(\frac {1}{(l+u)^{2}}\) is the natural scale factor for the empirical estimate of the Laplace operator. If kernel function k(⋅,⋅) is given, we estimate the target function by minimizing

$$ f^{\ast}=\arg\min\limits_{f\in \mathcal{H}_{k}}=\sum\limits_{i = 1}^{l}V(x_{i},y_{i},f)+\gamma_{A}\parallel f{\parallel_{A}^{2}}+\gamma_{I}\parallel f{\parallel_{I}^{2}} $$
(9)

where V is loss function, γA controls the complexity of functions in the ambient space and γI controls the complexity of functions in the intrinsic geometry of sample probability distribution.

3 Lagrangian extreme learning machine

In this section, based on the optimization method theory, we propose a new lagrangian extreme learning machine (LELM). Then, we compare LELM with other related algorithms.

3.1 Lagrangian extreme learning machine

We change slightly the OPT-ELM with hinge loss-function. First, we change the l1-norm of ξ to l2-norm squared which makes the constraint ξ ≥ 0 redundant and guarantees the strictly convexity of the object function. Therefore, based on the optimization theory we can get the following optimization problem:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min\limits_{\beta,\xi}} ~~&&\frac{1}{2}\parallel\beta\parallel^{2}+\frac{C}{2}\parallel \xi\parallel^{2}\\ \text{s.t.}~~&& D(H\beta)+\xi \geq e \end{array} \end{array} $$
(10)

where C is a tradeoff parameter, Dl×l is diagonal matrix with yi(i = 1, 2, … , l) along its diagonal, H = [h(x1), … , h(x1)]T is output matrix, e is a column vector of any dimension. The object function above is strictly convex, which guarantees that (10) has a unique solution.

The Lagrange function of the primal LELM optimization (10) is

$$ L(\beta,\xi,\alpha)=\frac{1}{2}\parallel\beta\parallel^{2}+\frac{C}{2}\parallel \xi\parallel^{2}-\alpha^{T}(DH\beta+\xi -e) $$
(11)

where α = (α1, … , αl)T are the Lagrange multipliers with non-negative values.

Based on Karush-Kuhn-Tucker (KKT) condition we have:

$$ \frac{\partial L}{\partial \beta}=\beta-H^{T}D\alpha= 0\Rightarrow \beta=H^{T}D\alpha $$
(12)
$$ \frac{\partial L}{\partial \xi}=C\xi-\alpha= 0\Rightarrow \xi=\frac{\alpha}{C} $$
(13)

Substituting (12) and (13) into (11) we can get the dual form of LELM:

$$ \min\limits_{\alpha\geq0}\frac{1}{2}\alpha^{T}\left( \frac{I}{C}+DHH^{T}D\right)\alpha-e^{T}\alpha $$
(14)

Before stating our algorithm we define two matrices to simplify notation as follows:

$$ M=[DH,\mathbf{0}],\ \ \ Q=\frac{I}{C}+MM^{T}. $$
(15)

With these definitions, the dual problem (14) can be written as follows:

$$ \min\limits_{\alpha\geq0} \frac{1}{2}\alpha^{T} Q\alpha-e^{T}\alpha $$
(16)

Similarly, the Lagrange function of the optimization problem (16) is

$$ L(\alpha,\gamma)=\frac{1}{2}\alpha^{T} Q\alpha-e^{T}\alpha-\gamma\alpha $$

Based on the Karush-Kuhn-Tucker necessary and sufficient optimality conditions we can get

$$\frac{\partial L(\alpha,\gamma)}{\partial\alpha}=Q\alpha-e-\gamma= 0\Rightarrow \gamma=Q\alpha-e\geq0, $$
$$\gamma\alpha= 0, $$
$$\alpha\geq0. $$

Thus, we have

$$ 0\leq\alpha\bot Q\alpha-e\geq0. $$
(17)

For any two real numbers or vectors x and y, the following identity can be established

$$0\leq x\bot y\geq0\Leftrightarrow x=(x-ay)_{+},\ a>0. $$

Thus, the optimality condition (17) can be written in the following equivalent form for any positive θ:

$$ Q\alpha-e=((Q\alpha-e)-\theta\alpha)_{+} $$
(18)

These optimality conditions lead to the following very simple iterative scheme which constitutes our LELM Algorithm:

$$ \alpha^{i + 1}=Q^{-1}[e+((Q\alpha^{i}-e)-\theta\alpha^{i})_{+}],\ \ i = 0,1,...., $$
(19)

where θ satisfies the condition

$$ 0<\theta<\frac{2}{C}. $$
(20)

Next, we will introduce the projection theorem [34] to prepare for the subsequent proof of the global convergence of our algorithm.

Theorem 1

(Projection Theorem34) LetX be a nonempty, close, and convex subset of\(\mathbb {R}^{n}\).

  1. (a)

    For every\(z\in \mathbb {R}^{n}\), there exist a uniquexXthat minimizeszxover allxX. This vector is called the projection ofz onX and denoted by [z]+.

  2. (b)

    Give some\(z\in \mathbb {R}^{n}\), a vectorxXis equal to projection [z]+if and only if

    $$ (z-x^{\ast})^{T}(x-x^{\ast})\leq 0 $$
    (21)
  3. (c)

    The mapping\(f: \mathbb {R}^{n}\mapsto X\)defined byf(x) = [x]+is continuous and nonexpansive, that is

    $$ \parallel[x]_{+}-[y]_{+}\parallel\leq\parallel x-y\parallel,\ \ \forall x,y\in\mathbb{R}^{n}. $$
    (22)
  4. (d)

    In the case where X is a subset, a vector\(x^{\ast }\in \mathbb {R}^{n}\)is equal to the projection [z]+if and only if zxis orthogonal to X, that is,

    $$ (z-x^{\ast})^{T}x = 0. $$
    (23)

Theorem 2

(Global Convergence of LELM) LetQ be the symmetric positive definite matrix defined by (15) and let (19) hold. Starting with an arbitraryα0, the iteratesαiof (20) converge to the unique solution\(\bar {\alpha }\)of (16) at the linear rate:

$$ \parallel Q\alpha^{i + 1}-Q\bar{\alpha}\parallel \leq \parallel I-\theta Q^{-1}\parallel\cdot\parallel Q\alpha^{i}-Q\bar{\alpha}\parallel. $$
(24)

Proof

Suppose \(\bar {\alpha }\) is the solution of (16), it must satisfy the optimality condition (18) for any θ > 0. So, we have

$$ Q\alpha^{i + 1}-e=((Q\alpha^{i}-e)-\theta\alpha)_{+} $$
(25)

and

$$ Q\bar{\alpha}-e=((Q\bar{\alpha}-e)-\theta\bar{\alpha})_{+}, $$
(26)

From (25) and (26) we can get:

$$ \parallel Q\alpha^{i + 1}-Q\bar{\alpha}\parallel=\parallel(Q\alpha^{i}-\theta\alpha^{i})_{+}- (Q\bar{\alpha}-e-\theta\bar{\alpha})_{+}\parallel. $$
(27)

The Projection Theorem 1 states that the distance between any two points in \(\mathbb {R}^{n}\) is not less than the distance between their projections on any convex set in \(\mathbb {R}^{n}\). We can obtain the following inequality:

$$\begin{array}{@{}rcl@{}} \parallel Q\alpha^{i + 1}-Q\bar{\alpha}\parallel&\leq&\parallel(Q-\theta I)(\alpha^{i}-\bar{\alpha})\parallel \\ &\leq& \parallel I-\theta Q^{-1}\parallel\cdot\parallel Q(\alpha^{i}-\bar{\alpha})\parallel \end{array} $$
(28)

Now we only need to prove ∥ IθQ− 1 ∥< 1. This follows (20) as follows. Noting the definition (15) of Q and letting λi, i = 1, ... , m, denote the nonnegative eigenvalues of MMT, all we need is:

$$ 0<\theta(\frac{1}{C}+\lambda_{i})^{-1}<2, $$
(29)

which is satisfied under the assumption (20). □

Based on the above discussion, the LELM is summarized as Algorithm 1.

figure a

3.2 Compared with other related algorithms

We compared the proposed Lagrangian extreme learning machine (LELM) with other related algorithms: Lagrangian support vector machine (LSVM) [24], Unconstrained Lagrangian twin support vector machine (ULTWSVM) [28], Optimization method based extreme learning machine (OPT-ELM) [2] and L2-Regularized Extreme Learning Machine (ELM) [1].

Compared with LSVM and ULTWSVM:

  1. (1)

    Obviously, the objective is different. The bias b is not required in LELM since the separating hyperplane βTh(x) = 0 passes through the origin in LELM feature space, while LSVM needs bias b to determine the hyperplane. The goal of ULTWSVM is to look for two non-parallel classification hyperplanes, but the goal of our proposed LELM is to look for classification hyperplanes that cross the origin.

  2. (2)

    Difference from the LSVM and ULTWSVM which are difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of proposed LELM (10) has the explicit form: KELM(xi, xj) = h(xi)Th(xj), and its network parameters are randomly generated without tuning.

Compared with OPT-ELM and ELM:

  1. (1)

    Compared with the OPT-ELM, we change slightly the OPT-ELM with hinge loss-function. First, we replace the l1-norm with the l2-norm of the slack variable ξ by weighing \(\frac {C}{2}\), which guarantees the strict convexity of the object function. This leads to the LELM optimization problem with unique solution.

  2. (2)

    The traditional ELM tends to reach zero training errors, however, in LELM training errors are generally not equal to zero, which leads to more better generation performance on testing data.

4 Laplacian lagrangian extreme learning machine

In this section, we propose a semi-supervised lagrangian extreme learning machine (Lap-LELM) via extending LELM to a semi-supervised learning framework. The proposed Lap-LELM incorporates the manifold regularization to leverage unlabeled data to improve the classification accuracy when labeled data are scarce. Then, we compare Lap-LELM with other related algorithms.

4.1 Laplacian lagrangian extreme learning machine

For the binary classification applications, the decision function of ELM is f(x) = sign(βh(x)), where h(x) maps the data from the d-dimensional input space to the L-dimensional hidden layer ELM feature space. By means of the Representer Theorem [2], output weights β can be expressed in the dual problem as the expansion over labeled and unlabeled samples

$$\beta=\sum\limits_{i = 1}^{l+u}\alpha_{i}h(x_{i}),$$

where h(x) = [h(x1), … , h(xl + u)]T and α = [a1, … , αl + u]. Then, the decision function is

$$ f(x)=\sum\limits_{i = 1}^{l+u}\alpha_{i}K(x_{i},x) $$
(30)

and K is the kernel matrix formed by kernel functions K(xi, xj) = h(xi) ⋅ h(xj). Therefore, the regularization term can be expressed by the kernel matrix and the expansion coefficient:

$$ \parallel f\parallel^{2}_{\mathcal{H}}=\parallel \beta\parallel^{2}=(h(x)\alpha)^{T}(h(x)\alpha)=\alpha^{T}\mathbf{K}\alpha . $$
(31)

The geometry of the data is represented by a graph, where nodes represent labeled and unlabeled samples connected by weights Wij. Regularizing the graph follows from the manifold assumption. We can get the manifold term via spectral graph theory [13],

$$ \parallel f{\parallel^{2}_{I}}=\frac{1}{(l+u)^{2}}\sum\limits_{i,j = 1}^{l+u}\mathbf{W}_{ij}(f(x_{i})-f(x_{j}))^{2}= \frac{1}{(l+u)^{2}}f^{T}\mathbf{L}_{p}f, $$
(32)

where Lp = DW is the graph Laplacian; D is the diagonal degree matrix of W given by \(\mathbf {D}_{ii} = {\sum }^{l+u}_{j = 1} \mathbf {W}_{ij}\), and Dij = 0 for ij; the normalizing coefficient \(\frac {1}{(l+u)^{2}}\) is the natural scale factor for the empirical estimate of the Laplace operator; and f = Kα.

Therefore, based on the above (9), (31) and (32) we propose the following Lap-LELM:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min\limits_{\xi\in\mathbb{R}^{l},\alpha\in\mathbb{R}^{l+u}}} ~~&&\frac{1}{l}{\sum}^{l}_{i = 1}\xi_{i}+\gamma_{A}\alpha^{T}\mathbf{K}\alpha +\frac{\gamma_{I}}{(l+u)^{2}}\alpha^{T}\mathbf{K}^{T}\mathbf{L}_{p}\mathbf{K}\alpha\\ \ \text{s.t.}~~&& y_{i}{\sum}^{l+u}_{j = 1}\alpha_{i}\mathbf{K}(x_{i},x_{j})\geq 1-\xi_{i}, \ \ i = 1,\ldots, l.\\ ~~&& \xi_{i}>0, \ \ i = 1,\ldots, l. \end{array} \end{array} $$
(33)

where γA and γI is regularization parameters, K is the kernel matrix formed by kernel functions K(xi, xj) = h(xi) ⋅ h(xj).

The Lagrange function of the primal Lap-LELM optimization (33) is

$$\begin{array}{@{}rcl@{}} L(\alpha,\xi,\lambda)&=&\frac{1}{l}\sum\limits^{l}_{i = 1}\xi_{i}+\gamma_{A}\alpha^{T}\mathbf{K}\alpha +\frac{\gamma_{I}}{(l+u)^{2}}\alpha^{T}\mathbf{K}^{T}\mathbf{L}_{p}\mathbf{K}\alpha\\ &&- \lambda_{i}\left( y_{i}\sum\limits\limits^{l+u}_{i = 1}\alpha_{i}\mathbf{K}(x_{i},x_{j}\right)- 1+\xi_{i})-\theta_{i}{\sum}_{i = 1}^{l}\xi_{i} \end{array} $$
(34)

where λ = [λ1, … , λl]T are the Lagrange multipliers. In order to find the optimal solutions of (34) we should have

$$ \frac{\partial L}{\partial\xi_{i}}=\frac{1}{l}-\lambda_{i}-\theta_{i}= 0\Rightarrow \theta_{i}=\frac{1}{l}-\lambda_{i} $$
(35)

Substitute (35) into (34) we have

$$ \min\limits_{\alpha,\lambda}\frac{1}{2}\alpha^{T}(2\gamma_{A}\mathbf{K}+\frac{2\gamma_{I}}{(l+u)^{2}} \mathbf{K}^{T}\mathbf{L}_{p}\mathbf{K})\alpha-\alpha^{T}\mathbf{K}\mathbf{J}^{T} \mathbf{Y}\lambda+\sum\limits_{i = 1}^{l}\lambda_{i} $$
(36)

where J = [I, 0]l×(l + u), Il×l, 0l×u, Y = diag(yi, … , yl). Taking derivatives again with respect to α, we obtain the solution

$$ \alpha=(2\gamma_{A}\mathbf{I}+\frac{2\gamma_{I}}{(l+u)^{2}}\mathbf{L}_{p}\mathbf{K})^{-1} \mathbf{J}^{T}\mathbf{Y}\lambda $$
(37)

Thusly, we obtain the following quadratic-programming problem:

$$\begin{array}{@{}rcl@{}} \begin{array}{llll} \displaystyle{\min\limits_{\lambda}} ~~&&\frac{1}{2}\lambda^{T}\mathbf{S}\lambda-{\sum}_{i = 1}^{l}\lambda_{i}\\ \ \text{s.t.} ~~&& 0\leq\lambda_{i}\leq\frac{1}{l} \end{array} \end{array} $$
(38)

where \(\mathbf {S}=\mathbf {YJK}(2\gamma _{A}\mathbf {I}+\frac {2\gamma _{I}}{(l+u)^{2}}\mathbf {L}_{p}\mathbf {K})^{-1} \mathbf {J}^{T}\mathbf {Y}\). The Lagrange function of the optimization (38) is

$$ L(\lambda,\eta)=\frac{1}{2}\lambda^{T}\mathbf{S}\lambda-e^{T}\lambda-\eta^{T}\lambda $$
(39)

where η = [η1, … , ηl]T are the Lagrange multipliers. According to the Karush-Kuhn-Tucker Conditions (KKT conditions), we can get

$$\begin{array}{@{}rcl@{}} \frac{\partial L}{\partial\lambda} &=& \mathbf{S}\lambda-e-\eta= 0 \end{array} $$
(40)
$$\begin{array}{@{}rcl@{}} \lambda^{T}\eta &=& 0 \end{array} $$
(41)
$$\begin{array}{@{}rcl@{}} \lambda &>& 0 \end{array} $$
(42)

From (40)–(42), we have

$$ 0\leq\lambda\bot \mathbf{S}\lambda-e\geq0 $$
(43)

The optimality condition (43) can be written in the following equivalent form for any positive 𝜗:

$$ \mathbf{S}\lambda-e=(\mathbf{S}\lambda-e-\vartheta\lambda)_{+} $$
(44)

As mentioned above, these optimality conditions will result in a very simple iteration scheme for our Lap-LELM algorithm:

$$ \lambda^{i + 1}= S^{-1}[e + (S\lambda^{i}- e-\theta{\lambda}^{i})_{+}], i = 0,1,\ldots $$
(45)

Our algorithm has the property of globally linear convergence when the initial point of iteration satisfies the following conditions:

$$ 0<\vartheta<\frac{2}{C} $$
(46)

Thusly, we can get output weight

$$ \beta=h(x)\alpha=h(x)(2\gamma_{A}\mathbf{I}+\frac{2\gamma_{I}}{(l+u)^{2}}\mathbf{L}_{p}\mathbf{K})^{-1} \mathbf{J}^{T}\mathbf{Y}\lambda. $$
(47)

Based on the above discussion, the Lap-LELM algorithm is summarized as Algorithm 2.

figure b

4.2 Compared with other related algorithms

We compare Laplacian lagrangian extreme learning machine (Lap-LELM) with other related algorithms: Laplacian support vector machine (Lap-SVM) [13], Manifold proximal support vector machine (MPSVM) [18], Semi-superviesd extreme learning machine (SS-ELM) [21], and Manifold regularized extreme learning machine (MR-ELM) [23].

Compared with Lap-SVM and MPSVM:

  1. (1)

    The bias b is not required in Lap-LELM since the separating hyperplane βTh(x) = 0 passes through the origin in ELM feature space, while Lap-SVM and MPSVM need threshold to determine the hyperplane.

  2. (2)

    Different from the Lap-SVM and MPSVM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping, however, the proposed Lap-LELM (33) kernel function has the explicit form: KELM(xi, xj) = h(xi)Th(xj). Thus the proposed Lap-LELM with fast running speed has less decision variables than the existing Lap-LSVM and MPSVM.

Compared with SS-ELM and MR-ELM:

  1. (1)

    The SS-ELM and MR-ELM are based on the traditional ELM framework, our proposed Lap-LELM is based on the OPT-ELM framework. Relative to SS-ELM and MR-ELM our proposed Lap-LELM is robust, since the quadratic loss function is used in SS-ELM and MR-ELM, but the hinge loss is used in the proposed Lap-LELM.

  2. (2)

    Compared with SS-ELM and MR-ELM, Lap-LELM has the advantages of fast operation, global convergence and low computational burden. The Lap-LELM is solvable in a space of dimensionality equal to the number of sample points, however, the SS-ELM and MR-ELM models are solved in a space with a dimension greater than the number of sampling points, so the calculation is complicated and the computational burden is high.

5 Numerical results

To evaluate the accuracy and efficiency of the LELM and Lap-LELM algorithm, we performed experiments on benchmark datasets and four NIR spectroscopy datasets. All algorithms were implemented using MATLAB R2014b on a 3.40 GHz machine with 8 GB of memory.

5.1 Experimental design

The evaluation criteria and datasets description should be specified before presenting the experimental results.

5.1.1 Algorithm evaluation criteria

In order to evaluate the effectiveness of the proposed method, the evaluation criteria used in this paper are ACC (accuracy), F1-measure and MCC (Matthews correlation coefficient), where ACC is the recognition rate of two samples, F1 is the precision and recall rate of the two indicators of the harmonic average, MCC is a comprehensive evaluation criteria. The CPU-time also serves as an indicator of algorithm evaluation. They are defined as:

$$ACC=\frac{TP+TN}{TP+FN+TN+FP},$$
$$MCC=\frac{(TP\cdot TN)-(FP\cdot FN)}{\sqrt{(TP+FP)\cdot(TP+FN)\cdot(TN+FP)\cdot(TN+FN)}}, $$
$$F_{1}=\frac{2TP}{2TP+FP+FN}. $$

5.1.2 Datasets description

Our experiments performed on eleven UCI datasetsFootnote 1 and four NIR spectroscopy datasets [5], respectively. The eleven UCI datasets are listed in Table 1.

Table 1 Description of UCI datasets

The two classes of data are generated from different 2-dimensional normal distributions N(μ1, σ1) and N(μ2, σ2), where μ1 = [4.5, 1.5], μ2 = [1.5, − 0.5] and σ1 = σ2 = [0.5, 0.5] and each class contains 1000 sample points.

Licorice is a traditional Chinese herbal medicine. We utilize 244 licorice seeds including 122 hard seeds and 122 soft seeds in our experiment. Near-infrared (NIR) spectroscopic datasets of licorice seeds was obtained via an MPA spectrometer. The NIR spectral range of 5000-10000 cm− 1 is recorded with a resolution of 4 cm− 1. The initial spectra are digitized by OPUS 5.5 software. To comprehension validation, numerical experiments are carried out in four different spectral regions: 5000-6000 cm− 1, 6000-7000 cm− 1, 7000-8000 cm− 1 and 8000-10000 cm− 1. The corresponding spectral regions are denoted regions A, B, C and D, respectively. Information in them is summarized in Table 2.

Table 2 Near-infrared spectral sample regions of licorice seeds

5.2 Supervised learning results

Before experiment, all datasets are normalized to the range of [0, 1]. We choose the LSVM [24], OPT-ELM [2], NLTWSVM [28], SLTWSVM [28] and SVM [11] as the baseline methods. We conduct numerical experiments on UCI datasets and NIR spectroscopy datasets, respectively. We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set.

In our experiment, the RBF kernel \(K(x_{i},x_{j})=e^{-\sigma ||x_{i}-x_{j}||_{2}^{2}}\) is considered in SVM, LSVM, NLTWSVM and SLTWSVM. In the LELM and OPT-ELM model we use Sigmoid function 1/(1 + exp(−(wx + b))) (w and b are randomly generated) as the activation function. For SVM, LSVM, NLTWSVM, SLTWSVM and LELM, we carry out grid search and 10-fold cross-validation on the training sets to get the optimal parameters (C, ν, L, σ) with highest accuracy. All of the following experimental results are performed on these optimal parameters. The parameter C is selected from {10− 5, ⋯ , 105}. The parameter ν is selected from {20,21, ⋯ , 28}. The RBF kernel parameter σ is selected from {2− 5, ⋯ , 25}. The Hidden layer nodes L is selected from {200, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. Theoretically, the larger L is and the better the generalization performance of LELM [5]. For the algorithm LSVM and LELM we set the parameter \(0<\delta <\frac {1.9}{\nu }\) and \(0<\theta <\frac {1.7}{C}\).

5.2.1 Experiments on artificial dataset

In order to verify the performance of the proposed LELM, we compared LELM with SVM, LSVM, OPT-ELM, NLTWSVM and SLTWSVM on artificial datasets. The results of this experiment are shown in Table 3. We can see that LELM outperforms SVM, LSVM, OPT-ELM, NLTWSVM and SLTWSVM in the analysis of ACC and CPU-time analysis.

Table 3 Performance comparison of SVM, LSVM, OPT-ELM, NLTWSVM, SLTWSVM and LELM on on artificial datasets

5.2.2 Experiments on UCI datasets

We compare the proposed LELM with OPT-ELM and LSVM, on eleven UCI datasets. All experimental results are shown in Figs. 123 and 4. Compared with the OPT-ELM and LSVM, Fig. 1 illustrate that the LELM improves generalization ability and has higher ACC values on most data sets. In Fig. 2, we compare the MCC values of the proposed LELM with OPT-ELM and LSVM on each datasets. As can be seen from Fig. 2, in addition to Prima, the MCC values of LELM is higher than that of OPT-ELM and LSVM in most cases. Further, we can also find that OPT-ELM outperforms LSVM on most datasets except Australian and Vote. Figure 3 presents a comparison of the F1 values of the proposed LELM with LSVM and OPT-ELM on each datasets. From the experimental results in Fig. 3, we can find that in the comparison of F1 values, LELM has better performance than OPT-ELM and LSVM in most cases. Figure 4 compare the number of support vectors for the proposed LELM with OPT-ELM and LSVM on each datasets. From Fig. 4, we find that LELM has fewer support vectors (SVs) than OPT-ELM and LSVM on most datasets. It can be further found that the number of support vectors of LELM and OPT-ELM is similar on some data sets. More precisely, the former has better performance.

Fig. 1
figure 1

Performance comparison of ACC (%) of OPT-ELM, LSVM and LELM on eleven UCI datasets, where 1 denote Diabetes, 2 denote Australian, 3 denote Balance, 4 denote Breast Cancer, 5 denote German, 6 denote Ionosphere, 7 denote Pima, 8 denote QSAR, 9 denote Vote, 10 denote WDBC, 11 denote Wholeasle

Fig. 2
figure 2

Performance comparison of MCC (%) of OPT-ELM, LSVM and LELM on eleven UCI datasets, where 1 denote Diabetes, 2 denote Australian, 3 denote Balance, 4 denote Breast Cancer, 5 denote German, 6 denote Ionosphere, 7 denote Pima, 8 denote QSAR, 9 denote Vote, 10 denote WDBC, 11 denote Wholeasle

Fig. 3
figure 3

Performance comparison of F1 (%) of OPT-ELM, LSVM and LELM on eleven UCI datasets, where 1 denote Diabetes, 2 denote Australian, 3 denote Balance, 4 denote Breast Cancer, 5 denote German, 6 denote Ionosphere, 7 denote Pima, 8 denote QSAR, 9 denote Vote, 10 denote WDBC, 11 denote Wholeasle

Fig. 4
figure 4

Performance comparison of SVs (%) of OPT-ELM, LSVM and LELM on eleven UCI datasets, where 1 denote Diabetes, 2 denote Australian, 3 denote Balance, 4 denote Breast Cancer, 5 denote German, 6 denote Ionosphere, 7 denote Pima, 8 denote QSAR, 9 denote Vote, 10 denote WDBC, 11 denote Wholeasle

To further verify the performance of our proposed LELM, we compared the proposed LELM with the traditional methods OPT-ELM and LSVM on seven UCI datasets. The experimental results are presented in Table 4. The classification accuracy and CPU-times presented in Table 4 are the average of five-times experiments. From Table 4, we can find that the LELM classification accuracy and time analysis are superior to other algorithms on the seven datasets. It is further found that the performance of the LELM we proposed on most of the data sets is similar to that of the NLTWSVM and SLTWSVM. To be precise, LELM is better than NLTWSVM and SLTWSVM. The above analysis of experimental results further validates that our proposed algorithm is effective and reliable, which fully demonstrates the correctness of our theory.

Table 4 Performance comparison of SVM, LSVM, OPT-ELM, NLTWSVM, SLTWSVM and LELM on UCI datasets

5.2.3 Experiments on NIR spectroscopy datasets

We compared the proposed LELM with the SVM, LSVM, OPT-ELM NLTWSVM and SLTWSVM on NIR spectroscopy datasets. Our experimental results are presented in Table 5. We find from Table 5 that the LELM achieves better performance than SVM, LSVM, OPT-ELM NLTWSVM and SLTWSVM with respect to the ACC and CPU-time analysis. Further find that the performance of our proposed LELM are very similar to that of NLTWSVM and SLTWSVM on the NIR spectroscopy datasets. Accurately our algorithm is better than NLTWSVM and SLTWSVM.

Table 5 Performance comparison of SVM, LSVM, OPT-ELM, NLTWSVM, SLTWSVM and LELM on NIR spectroscopy datasets

5.3 Semi-supervised learning results

In order to verify the generalization performance of the proposed semi-supervised LELM algorithm, we conducted experiments on eleven UCI datasets, COIL20(B) and USPST(B)Footnote 2 datasets, respectively, and five times 10-fold cross validation. We choose LapSVM [13], SS-ELM [21] and MR-ELM [23] as the benchmark comparison algorithm.

Before experiment, all datasets are normalized to the range of [0,1]. The LapSVM, SS-ELM, MR-ELM and Lap-LELM constructed data adjacency graphs using k-nearest neighbors. Binary edge weights were chosen, and the neighborhood size k was set to be 9 for all the all datasets. We used the Sigmoid function 1/(1 + exp(−(wx + b))) as the activation function for SS-ELM, MR-ELM, LELM and Lap-LELM. The RBF kernel \(K(x_{i},x_{j})=e^{-\sigma ||x_{i}-x_{j}||_{2}^{2}}\) is considered in LapSVM. We carry out grid search and ten-fold cross-validation on the training sets to get the optimal (C, γA, γI, σ, λ) with highest accuracy. γA and γI are selected from {100, ⋯ , 103}; C and λ are selected from {10− 5, ⋯ , 105}; RBF kernel parameter σ is selected from {2− 5, ⋯ , 25}. The Hidden nodes L is selected from {200, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. All the following experimental results are performed in the optimal parameters.

5.3.1 Experiments on artificial datasets

In order to further verify the generalization performance of proposed Lap-LELM, we performed experiments on artificial data. The experimental results are shown in Table 6. We find that Lap-LELM are better than LapSVM, LELM, SS-ELM and MR-ELM in classification accuracy.

Table 6 Performance comparison of LapSVM, LELM, SS-ELM, MR-ELM and Lap-LELM on artificial dataset

5.3.2 Experiments on UCI datasets

In this subsection, we perform LELM, LapSVM, SS-ELM, MR-ELM and Lap-LELM on eleven UCI datasets. Most of these datasets have appeared in previous experiments. We conduct the experiments with different proportions of labeled samples, i.e., 10% and 20%. The testing accuracy are computed using standard 10-fold cross validation. For supervised algorithm (LELM), training samples were divided into 10 parts, one (when 20% samples were labeled, this should be two) part for training and the rest for testing, and for semi-supervised algorithm (LapSVM, SS-ELM, MR-ELM Lap-LELM), all samples are involved in training.

We compare the proposed classification accuracy of Lap-LELM with LELM, LapSVM, SS-ELM and MR-ELM. All experimental results are based on the optimal parameters. The classification accuracy was the average of five-time experiments. Results are listed in Table 7. Let’s summarize the results in a simple language. We find from Table 7 that the performance of Lap-LELM is better than that of LELM on all datasets. This shows that in the semi-supervised learning problem the use of the graph Laplacian operator can improve the classification accuracy of the model. It is further found that the propose Lap-LELM is superior to the traditional semi-supervised learning algorithm Lap-SVM on the vast majority datasets. We can also find that our algorithm Lap-LELM outperforms ELM-based semi-supervised algorithms SS-ELM and MR-ELM on most datasets. The analysis of experimental results further validates that the introduction of manifold regularization in the LELM framework can effectively improve the performance of LELM when the sample information is insufficient.

Table 7 Performance comparison of LapSVM, LELM, SS-ELM, MR-ELM and Lap-LELM on UCI datasets

To further verify the performance of our proposed Lap-LELM. We compared the proposed Lap-LELM with OPT-ELM and Lap-SVM on six UCI datasets of different proportioned label samples (10%, 30%, 50%, 70%, 90%). The experimental results are presented in Fig. 5.

Fig. 5
figure 5

Testing ACC with respect to different number of labeled data

Figure 5 shows the performance of LapSVM, LELM and Lap-LELM on Balance, Breast Cancer, Australian, WDBC, Vote and Ionosphere with different number of labeled samples. In this experiment, all the settings are similar to above, except that we varied the proportion of labeled data in the training set. We can observe that LapLSVM and Lap-LELM outperformed LELM significantly when there is a small amount of labeled data. Further, we can find that the classification accuracy of LELM will grow with the gradually increasing of the labeled samples number.

5.3.3 Experimental results on COIL20(B) and USPST(B) datasets

As we all know, COIL20(B) and USPST(B) are widely used in semi-supervised learning algorithm evaluation datasets. We compared the proposed Lap-LELM with Lap-SVM, SS-ELM and MR-ELM on COIL20(B) and USPST(B) datasets

The COIL20(B) and USPST(B) datasets are described as follows:

The Columbia object image library (COIL20) is a set of 1440 gray-scale images of 20 different objects. Each sample represents a 32 × 32 gray scale image of an object acquired from a specific view. The COIL20(B) is a binary data set generated by grouping the first 10 objects in COIL20 to class 1 and the remaining objects to class 2.

The USPST data set is a collection of hand-written digits from the USPS postal system. Each digit image is represented by a resolution of 16 × 16 pixels. The USPST(B) is a binary data set which was built by grouping the first 5 digits to class 1 and the remaining digits to class 2.

The proposed Lap-LELM is compared with Lap-SVM, MR-ELM and SS-ELM. We use the classification accuracy on USPST(B)and COIL20(B) datasets to evaluate the performance of these algorithms. The experimental results are shown in Table 8. All experimental results are performed under optimal parameters. From Table 8, we can see that our method is better than Lap-SVM semi-supervised learning algorithms on on both USPST(B)and COIL20(B). It can be further found that the proposed method has good performance compared to other ELM-based semi-supervised algorithms. The above experimental analysis further validates that our proposed Lap-LELM is effective and reliable.

Table 8 Performance comparison of the OPT-ELM, LapWSC-SVM, Lap-ELM and LapWSC-ELM

6 Conclusion

In this paper, we have first proposed a new type of lagrange extreme learning machine (LELM) based on the optimization theory. Then, a semi-supervised lagrangian extreme learning machine (Lap-LELM) is proposed via extending LELM to a semi-supervised learning framework, which incorporates the manifold regularization into LELM to improve performance when insufficient labeled samples are available. Compared to existing supervised and semi-supervised ELM algorithms, the proposed LELM and Lap-LELM maintain almost all the advantages of ELMs, such as the remarkable training efficiency for binary classification problems. In addition, through the SMW identities, LELM and Lap-LELM are transformed into two smaller unconstrained optimizations. At the same time, two very simple iterative algorithms are constructed to solve the two unconstrained optimization problems. Theoretical analysis and numerical experiments show that our iterative algorithms are globally converged, have a low computational burden and a certain degree of generalization performance compared with traditional learning algorithms.

In the near future, we will further optimize our proposed framework and study the sparse regularization problem for our framework. In addition, we will extend our method to multi-class classification and some practical applications.