Abstract
Two extreme learning machine (ELM) frameworks are proposed to handle supervised and semi-supervised classifications. The first is called lagrangian extreme learning machine (LELM), which is based on optimality conditions and dual theory. Then LELM is extended to semi-supervised setting to obtain a semi-supervised extreme learning machine (called Lap-LELM), which incorporates the manifold regularization into LELM to improve performance when insufficient training information is available. In order to avoid the inconvenience caused by matrix inversion, Sherman-Morrison-Woodbury (SMW) identity is used in LELM and Lap-LELM, which leads to two smaller sized unconstrained minimization problems. The proposed models are solvable in a space of dimensionality equal to the number of sample points. The resulting iteration algorithms converge globally and have low computational burden. So as to verify the feasibility and effectiveness of the proposed method, we perform a series of experiments on a synthetic dataset, near-infrared (NIR) spectroscopy datasets and benchmark datasets. Compared with the traditional methods, experimental results demonstrate that the proposed methods achieve better performances than the traditional supervised and semi-supervised methods in most of considered datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Due to its simple structure, low computational complexity and good generalization, extreme learning machine (ELM) [1,2,3,4,5,6] has been successfully applied in many fields [7,8,9,10]. Compared with traditional neural networks, the main advantages of ELM are that it runs fast with global optimal solution and is easy to implement. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically. Recently, many researchers have done in-depth researches on ELM. For example, Wang et al. [7] proposed a self-adaptive extreme learning machine (SaELM). It can select the best neuron number in hidden layers to construct the optimal networks. This method trains fast and can obtain the global optimal solution, with good generalization. Zhang et al. [8] proposed a ELM+ which introduces the privileged information to the traditional ELM method. Zhang et al. [9] proposed a memetic algorithm based extreme learning machine (M-ELM). It embeds the local search strategy into the global optimization framework to obtain optimal network parameters. Huang et al. [2] proposed a ELM based on optimization theory (OPT-ELM) by introducing hinge loss into the ELM framework. It minimizes the norm of the output weights to find a separating hyperplane with the maximal margin between two classes of data, which is similar to the idea employed in support vector machine (SVM) [11]. Compared to ELM, the minimization norm of output weights enables OPT-ELM to get better generalization performance. OPT-ELM solves a quadratic programming (QP) problem, which assures that a global optimal solution can be found.
The manifold regularization method has been widely used for semi-supervised learning tasks. One of the most popular manifold regularization is the Laplacian regularization [12,13,14,15,16,17,18], which utilizes graph Laplacian to determine the geometry information of data. ELMs are very popular in many fields and are mainly used to supervised learning tasks, which greatly limits their applicability. Many researchers introduced Laplacian regularization into the ELM framework for semi-supervised learning tasks [19,20,21,22,23].
Lagrangian support vector machine (LSVM) [24] is a computationally powerful machine learning tool. It minimizes an unconstrained differentiable convex function in a space of dimensionality equal to the number of classified points. In recent years, some researchers have extended the idea of LSVM to twin support vector machines [25,26,27,28,29,30,31,32,33].
In this paper, inspired by the studies above, we propose two novel ELM formulations for supervised and semi-supervised classifications. The main contributions of this paper are summarized as follows:
-
(1)
The first is called lagrangian extreme learning machine (LELM), which is based on optimality conditions and dual theory. Then LELM is generalized to semi-supervised setting to obtain a semi-supervised lagrangian extreme learning machine (Lap-LELM), which incorporates the manifold regularization into LELM to improve performance when insufficient training information is available.
-
(2)
In order to avoid the inconvenience caused by matrix inversion, Sherman-Morrison-Woodbury (SMW) [24] identity is used in LELM and Lap-LELM optimization problems, which leads to two smaller sized unconstrained minimization problems. The proposed models are solvable in a space of dimensionality equal to the number of samples.
-
(3)
Two fast and simple algorithms are designed to optimize the proposed LELM and Lap-LELM, which requires only iteratively solving equations rather than quadratic programming like OPT-ELM. The resulting iteration algorithms converge globally and have low computational burden.
-
(4)
Difference from the lagrangian SVM (LSVM) which has difficulty in dealing with nonlinear problems, the proposed LELM and Lap-LELM have the explicit kernel function form, and are convenient to use in nonlinear classifications.
The rest of this paper is organized as follows. Section 2, briefly dwells on the OPT-ELM, LSVM and MR. We propose LELM and Lap-LELM in Sections 3 and 4, respectively. The experimental results, and discussions are presented in Sections 5. Finally, the conclusion is drawn in Section 6.
2 Related work
In order to propose an improved version of OPT-ELM, we review OPT-ELM in Section 2.1, and introduce LSVM and MR in Sections 2.2 and 2.3, respectively.
2.1 Optimization method based ELM
Consider a supervised learning problem with training data \(T_{l}=\{\boldsymbol {x}_{i},y_{i}\}_{i = 1}^{l}, i = 1,\ldots ,l\), where \(x_{i}\in \mathbb {R}^{n}\), yi ∈ {− 1, + 1}, Tl denotes a set of l labeled samples.
Huang et al. [2] proposed ELM framework based on optimization theory (called OPT-ELM), in which the hinge loss function was introduced. This leads to the following optimization:
where \(h(x)=(g({w_{1}^{T}}x+b_{1}),\ldots ,g({w_{L}^{T}}x+b_{L}))^{T}\) actually maps the data from the n-dimensional input space to ELM feature space; L is the number of hidden layer nodes; ξ = (ξ1, … , ξl) is a slack variable; C is a positive penalty parameter. This is a quadratic programming with global solution.
According to the optimization theory, the dual problem of optimization problem (1) is
We can define the ELM kernel function as:
where G(a, b, x) is a nonlinear piecewise continuous function satisfying ELM universal approximation capability theorems [1,2,3,4,5,6] and \(\{(a_{i},b_{i})\}_{i = 1}^{L}\) are randomly generated according to any continuous probability distribution.
Thus, we can get the following optimization problem:
The decision function of OPT-ELM is
2.2 Lagrangian support vector machine
Mangasarian and Musicant modified the standard SVM and proposed Lagrangian Support Vector Machine [24]:
where parameter ν > 0, w is the normal to the bounding planes and b determines their location relative to the origin. According to duality theory, the dual of this problem is:
where \(Q=\left (\frac {I}{\nu }+D(AA^{T}+ee^{T})D\right )\).
The LSVM Algorithm is based directly on the Karush-Kuhn-Tucker necessary and sufficient optimality conditions of the dual problem (7):
These optimality conditions lead to the following very simple iterative scheme which constitutes LSVM Algorithm:
for which we will establish global linear convergence from any starting point under the easily satisfiable condition:
More details can refer to [24].
2.3 Manifold regularization
Consider a semi-supervised learning problem with training data \(T=T_{l}\cup T_{u}=\{\boldsymbol {x}_{i},y_{i}\}_{i = 1}^{l}\cup \{\boldsymbol {x}_{j}\}_{j=l + 1}^{l+u}, i = 1,\ldots ,l\), where \(x_{i}\in \mathbb {R}^{n}\), \(x_{j}\in \mathbb {R}^{n}\), yi ∈{− 1,+ 1}, Tl denotes a set of l labeled samples, Tu denotes a set of u unlabeled samples. The manifold regularization approach [13] takes advantage of the geometry of the marginal distribution \(\mathcal {P}_{X}\). We assume that the support of data probability distribution has the geometry of the Riemannian manifold \(\mathcal {M}\). The label of the two closest samples in the \(\mathcal {P}_{X}\) intrinsic geometry should be the same or similar, which means that the conditional probability distribution \(\mathcal {P}(y\mid x)\) should vary little between two such points. As a result, we have
where Lp = D −W is the graph Laplacian; D is the diagonal degree matrix of W given by \(\mathbf {D}_{ii} = {\sum }^{l+u}_{j = 1} \mathbf {W}_{ij}\), and Dij = 0 for i ≠ j; the normalizing coefficient \(\frac {1}{(l+u)^{2}}\) is the natural scale factor for the empirical estimate of the Laplace operator. If kernel function k(⋅,⋅) is given, we estimate the target function by minimizing
where V is loss function, γA controls the complexity of functions in the ambient space and γI controls the complexity of functions in the intrinsic geometry of sample probability distribution.
3 Lagrangian extreme learning machine
In this section, based on the optimization method theory, we propose a new lagrangian extreme learning machine (LELM). Then, we compare LELM with other related algorithms.
3.1 Lagrangian extreme learning machine
We change slightly the OPT-ELM with hinge loss-function. First, we change the l1-norm of ξ to l2-norm squared which makes the constraint ξ ≥ 0 redundant and guarantees the strictly convexity of the object function. Therefore, based on the optimization theory we can get the following optimization problem:
where C is a tradeoff parameter, Dl×l is diagonal matrix with yi(i = 1, 2, … , l) along its diagonal, H = [h(x1), … , h(x1)]T is output matrix, e is a column vector of any dimension. The object function above is strictly convex, which guarantees that (10) has a unique solution.
The Lagrange function of the primal LELM optimization (10) is
where α = (α1, … , αl)T are the Lagrange multipliers with non-negative values.
Based on Karush-Kuhn-Tucker (KKT) condition we have:
Substituting (12) and (13) into (11) we can get the dual form of LELM:
Before stating our algorithm we define two matrices to simplify notation as follows:
With these definitions, the dual problem (14) can be written as follows:
Similarly, the Lagrange function of the optimization problem (16) is
Based on the Karush-Kuhn-Tucker necessary and sufficient optimality conditions we can get
Thus, we have
For any two real numbers or vectors x and y, the following identity can be established
Thus, the optimality condition (17) can be written in the following equivalent form for any positive θ:
These optimality conditions lead to the following very simple iterative scheme which constitutes our LELM Algorithm:
where θ satisfies the condition
Next, we will introduce the projection theorem [34] to prepare for the subsequent proof of the global convergence of our algorithm.
Theorem 1
(Projection Theorem34) LetX be a nonempty, close, and convex subset of\(\mathbb {R}^{n}\).
-
(a)
For every\(z\in \mathbb {R}^{n}\), there exist a uniquex∗∈ Xthat minimizes ∥ z − x ∥ over allx ∈ X. This vector is called the projection ofz onX and denoted by [z]+.
-
(b)
Give some\(z\in \mathbb {R}^{n}\), a vectorx∗ ∈ Xis equal to projection [z]+if and only if
$$ (z-x^{\ast})^{T}(x-x^{\ast})\leq 0 $$(21) -
(c)
The mapping\(f: \mathbb {R}^{n}\mapsto X\)defined byf(x) = [x]+is continuous and nonexpansive, that is
$$ \parallel[x]_{+}-[y]_{+}\parallel\leq\parallel x-y\parallel,\ \ \forall x,y\in\mathbb{R}^{n}. $$(22) -
(d)
In the case where X is a subset, a vector\(x^{\ast }\in \mathbb {R}^{n}\)is equal to the projection [z]+if and only if z − x∗is orthogonal to X, that is,
$$ (z-x^{\ast})^{T}x = 0. $$(23)
Theorem 2
(Global Convergence of LELM) LetQ be the symmetric positive definite matrix defined by (15) and let (19) hold. Starting with an arbitraryα0, the iteratesαiof (20) converge to the unique solution\(\bar {\alpha }\)of (16) at the linear rate:
Proof
Suppose \(\bar {\alpha }\) is the solution of (16), it must satisfy the optimality condition (18) for any θ > 0. So, we have
and
From (25) and (26) we can get:
The Projection Theorem 1 states that the distance between any two points in \(\mathbb {R}^{n}\) is not less than the distance between their projections on any convex set in \(\mathbb {R}^{n}\). We can obtain the following inequality:
Now we only need to prove ∥ I − θQ− 1 ∥< 1. This follows (20) as follows. Noting the definition (15) of Q and letting λi, i = 1, ... , m, denote the nonnegative eigenvalues of MMT, all we need is:
which is satisfied under the assumption (20). □
Based on the above discussion, the LELM is summarized as Algorithm 1.
3.2 Compared with other related algorithms
We compared the proposed Lagrangian extreme learning machine (LELM) with other related algorithms: Lagrangian support vector machine (LSVM) [24], Unconstrained Lagrangian twin support vector machine (ULTWSVM) [28], Optimization method based extreme learning machine (OPT-ELM) [2] and L2-Regularized Extreme Learning Machine (ELM) [1].
Compared with LSVM and ULTWSVM:
-
(1)
Obviously, the objective is different. The bias b is not required in LELM since the separating hyperplane βTh(x) = 0 passes through the origin in LELM feature space, while LSVM needs bias b to determine the hyperplane. The goal of ULTWSVM is to look for two non-parallel classification hyperplanes, but the goal of our proposed LELM is to look for classification hyperplanes that cross the origin.
-
(2)
Difference from the LSVM and ULTWSVM which are difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping and the kernel parameters, kernel function of proposed LELM (10) has the explicit form: KELM(xi, xj) = h(xi)Th(xj), and its network parameters are randomly generated without tuning.
Compared with OPT-ELM and ELM:
-
(1)
Compared with the OPT-ELM, we change slightly the OPT-ELM with hinge loss-function. First, we replace the l1-norm with the l2-norm of the slack variable ξ by weighing \(\frac {C}{2}\), which guarantees the strict convexity of the object function. This leads to the LELM optimization problem with unique solution.
-
(2)
The traditional ELM tends to reach zero training errors, however, in LELM training errors are generally not equal to zero, which leads to more better generation performance on testing data.
4 Laplacian lagrangian extreme learning machine
In this section, we propose a semi-supervised lagrangian extreme learning machine (Lap-LELM) via extending LELM to a semi-supervised learning framework. The proposed Lap-LELM incorporates the manifold regularization to leverage unlabeled data to improve the classification accuracy when labeled data are scarce. Then, we compare Lap-LELM with other related algorithms.
4.1 Laplacian lagrangian extreme learning machine
For the binary classification applications, the decision function of ELM is f(x) = sign(β ⋅ h(x)), where h(x) maps the data from the d-dimensional input space to the L-dimensional hidden layer ELM feature space. By means of the Representer Theorem [2], output weights β can be expressed in the dual problem as the expansion over labeled and unlabeled samples
where h(x) = [h(x1), … , h(xl + u)]T and α = [a1, … , αl + u]. Then, the decision function is
and K is the kernel matrix formed by kernel functions K(xi, xj) = h(xi) ⋅ h(xj). Therefore, the regularization term can be expressed by the kernel matrix and the expansion coefficient:
The geometry of the data is represented by a graph, where nodes represent labeled and unlabeled samples connected by weights Wij. Regularizing the graph follows from the manifold assumption. We can get the manifold term via spectral graph theory [13],
where Lp = D −W is the graph Laplacian; D is the diagonal degree matrix of W given by \(\mathbf {D}_{ii} = {\sum }^{l+u}_{j = 1} \mathbf {W}_{ij}\), and Dij = 0 for i ≠ j; the normalizing coefficient \(\frac {1}{(l+u)^{2}}\) is the natural scale factor for the empirical estimate of the Laplace operator; and f = Kα.
Therefore, based on the above (9), (31) and (32) we propose the following Lap-LELM:
where γA and γI is regularization parameters, K is the kernel matrix formed by kernel functions K(xi, xj) = h(xi) ⋅ h(xj).
The Lagrange function of the primal Lap-LELM optimization (33) is
where λ = [λ1, … , λl]T are the Lagrange multipliers. In order to find the optimal solutions of (34) we should have
Substitute (35) into (34) we have
where J = [I, 0]l×(l + u), Il×l, 0l×u, Y = diag(yi, … , yl). Taking derivatives again with respect to α, we obtain the solution
Thusly, we obtain the following quadratic-programming problem:
where \(\mathbf {S}=\mathbf {YJK}(2\gamma _{A}\mathbf {I}+\frac {2\gamma _{I}}{(l+u)^{2}}\mathbf {L}_{p}\mathbf {K})^{-1} \mathbf {J}^{T}\mathbf {Y}\). The Lagrange function of the optimization (38) is
where η = [η1, … , ηl]T are the Lagrange multipliers. According to the Karush-Kuhn-Tucker Conditions (KKT conditions), we can get
The optimality condition (43) can be written in the following equivalent form for any positive 𝜗:
As mentioned above, these optimality conditions will result in a very simple iteration scheme for our Lap-LELM algorithm:
Our algorithm has the property of globally linear convergence when the initial point of iteration satisfies the following conditions:
Thusly, we can get output weight
Based on the above discussion, the Lap-LELM algorithm is summarized as Algorithm 2.
4.2 Compared with other related algorithms
We compare Laplacian lagrangian extreme learning machine (Lap-LELM) with other related algorithms: Laplacian support vector machine (Lap-SVM) [13], Manifold proximal support vector machine (MPSVM) [18], Semi-superviesd extreme learning machine (SS-ELM) [21], and Manifold regularized extreme learning machine (MR-ELM) [23].
Compared with Lap-SVM and MPSVM:
-
(1)
The bias b is not required in Lap-LELM since the separating hyperplane βTh(x) = 0 passes through the origin in ELM feature space, while Lap-SVM and MPSVM need threshold to determine the hyperplane.
-
(2)
Different from the Lap-SVM and MPSVM which is difficult to optimize in dealing with nonlinear problems because of the unknown implication mapping, however, the proposed Lap-LELM (33) kernel function has the explicit form: KELM(xi, xj) = h(xi)Th(xj). Thus the proposed Lap-LELM with fast running speed has less decision variables than the existing Lap-LSVM and MPSVM.
Compared with SS-ELM and MR-ELM:
-
(1)
The SS-ELM and MR-ELM are based on the traditional ELM framework, our proposed Lap-LELM is based on the OPT-ELM framework. Relative to SS-ELM and MR-ELM our proposed Lap-LELM is robust, since the quadratic loss function is used in SS-ELM and MR-ELM, but the hinge loss is used in the proposed Lap-LELM.
-
(2)
Compared with SS-ELM and MR-ELM, Lap-LELM has the advantages of fast operation, global convergence and low computational burden. The Lap-LELM is solvable in a space of dimensionality equal to the number of sample points, however, the SS-ELM and MR-ELM models are solved in a space with a dimension greater than the number of sampling points, so the calculation is complicated and the computational burden is high.
5 Numerical results
To evaluate the accuracy and efficiency of the LELM and Lap-LELM algorithm, we performed experiments on benchmark datasets and four NIR spectroscopy datasets. All algorithms were implemented using MATLAB R2014b on a 3.40 GHz machine with 8 GB of memory.
5.1 Experimental design
The evaluation criteria and datasets description should be specified before presenting the experimental results.
5.1.1 Algorithm evaluation criteria
In order to evaluate the effectiveness of the proposed method, the evaluation criteria used in this paper are ACC (accuracy), F1-measure and MCC (Matthews correlation coefficient), where ACC is the recognition rate of two samples, F1 is the precision and recall rate of the two indicators of the harmonic average, MCC is a comprehensive evaluation criteria. The CPU-time also serves as an indicator of algorithm evaluation. They are defined as:
5.1.2 Datasets description
Our experiments performed on eleven UCI datasetsFootnote 1 and four NIR spectroscopy datasets [5], respectively. The eleven UCI datasets are listed in Table 1.
The two classes of data are generated from different 2-dimensional normal distributions N(μ1, σ1) and N(μ2, σ2), where μ1 = [4.5, 1.5], μ2 = [1.5, − 0.5] and σ1 = σ2 = [0.5, 0.5] and each class contains 1000 sample points.
Licorice is a traditional Chinese herbal medicine. We utilize 244 licorice seeds including 122 hard seeds and 122 soft seeds in our experiment. Near-infrared (NIR) spectroscopic datasets of licorice seeds was obtained via an MPA spectrometer. The NIR spectral range of 5000-10000 cm− 1 is recorded with a resolution of 4 cm− 1. The initial spectra are digitized by OPUS 5.5 software. To comprehension validation, numerical experiments are carried out in four different spectral regions: 5000-6000 cm− 1, 6000-7000 cm− 1, 7000-8000 cm− 1 and 8000-10000 cm− 1. The corresponding spectral regions are denoted regions A, B, C and D, respectively. Information in them is summarized in Table 2.
5.2 Supervised learning results
Before experiment, all datasets are normalized to the range of [0, 1]. We choose the LSVM [24], OPT-ELM [2], NLTWSVM [28], SLTWSVM [28] and SVM [11] as the baseline methods. We conduct numerical experiments on UCI datasets and NIR spectroscopy datasets, respectively. We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set.
In our experiment, the RBF kernel \(K(x_{i},x_{j})=e^{-\sigma ||x_{i}-x_{j}||_{2}^{2}}\) is considered in SVM, LSVM, NLTWSVM and SLTWSVM. In the LELM and OPT-ELM model we use Sigmoid function 1/(1 + exp(−(w⋅x + b))) (w and b are randomly generated) as the activation function. For SVM, LSVM, NLTWSVM, SLTWSVM and LELM, we carry out grid search and 10-fold cross-validation on the training sets to get the optimal parameters (C, ν, L, σ) with highest accuracy. All of the following experimental results are performed on these optimal parameters. The parameter C is selected from {10− 5, ⋯ , 105}. The parameter ν is selected from {20,21, ⋯ , 28}. The RBF kernel parameter σ is selected from {2− 5, ⋯ , 25}. The Hidden layer nodes L is selected from {200, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. Theoretically, the larger L is and the better the generalization performance of LELM [5]. For the algorithm LSVM and LELM we set the parameter \(0<\delta <\frac {1.9}{\nu }\) and \(0<\theta <\frac {1.7}{C}\).
5.2.1 Experiments on artificial dataset
In order to verify the performance of the proposed LELM, we compared LELM with SVM, LSVM, OPT-ELM, NLTWSVM and SLTWSVM on artificial datasets. The results of this experiment are shown in Table 3. We can see that LELM outperforms SVM, LSVM, OPT-ELM, NLTWSVM and SLTWSVM in the analysis of ACC and CPU-time analysis.
5.2.2 Experiments on UCI datasets
We compare the proposed LELM with OPT-ELM and LSVM, on eleven UCI datasets. All experimental results are shown in Figs. 1, 2, 3 and 4. Compared with the OPT-ELM and LSVM, Fig. 1 illustrate that the LELM improves generalization ability and has higher ACC values on most data sets. In Fig. 2, we compare the MCC values of the proposed LELM with OPT-ELM and LSVM on each datasets. As can be seen from Fig. 2, in addition to Prima, the MCC values of LELM is higher than that of OPT-ELM and LSVM in most cases. Further, we can also find that OPT-ELM outperforms LSVM on most datasets except Australian and Vote. Figure 3 presents a comparison of the F1 values of the proposed LELM with LSVM and OPT-ELM on each datasets. From the experimental results in Fig. 3, we can find that in the comparison of F1 values, LELM has better performance than OPT-ELM and LSVM in most cases. Figure 4 compare the number of support vectors for the proposed LELM with OPT-ELM and LSVM on each datasets. From Fig. 4, we find that LELM has fewer support vectors (SVs) than OPT-ELM and LSVM on most datasets. It can be further found that the number of support vectors of LELM and OPT-ELM is similar on some data sets. More precisely, the former has better performance.
To further verify the performance of our proposed LELM, we compared the proposed LELM with the traditional methods OPT-ELM and LSVM on seven UCI datasets. The experimental results are presented in Table 4. The classification accuracy and CPU-times presented in Table 4 are the average of five-times experiments. From Table 4, we can find that the LELM classification accuracy and time analysis are superior to other algorithms on the seven datasets. It is further found that the performance of the LELM we proposed on most of the data sets is similar to that of the NLTWSVM and SLTWSVM. To be precise, LELM is better than NLTWSVM and SLTWSVM. The above analysis of experimental results further validates that our proposed algorithm is effective and reliable, which fully demonstrates the correctness of our theory.
5.2.3 Experiments on NIR spectroscopy datasets
We compared the proposed LELM with the SVM, LSVM, OPT-ELM NLTWSVM and SLTWSVM on NIR spectroscopy datasets. Our experimental results are presented in Table 5. We find from Table 5 that the LELM achieves better performance than SVM, LSVM, OPT-ELM NLTWSVM and SLTWSVM with respect to the ACC and CPU-time analysis. Further find that the performance of our proposed LELM are very similar to that of NLTWSVM and SLTWSVM on the NIR spectroscopy datasets. Accurately our algorithm is better than NLTWSVM and SLTWSVM.
5.3 Semi-supervised learning results
In order to verify the generalization performance of the proposed semi-supervised LELM algorithm, we conducted experiments on eleven UCI datasets, COIL20(B) and USPST(B)Footnote 2 datasets, respectively, and five times 10-fold cross validation. We choose LapSVM [13], SS-ELM [21] and MR-ELM [23] as the benchmark comparison algorithm.
Before experiment, all datasets are normalized to the range of [0,1]. The LapSVM, SS-ELM, MR-ELM and Lap-LELM constructed data adjacency graphs using k-nearest neighbors. Binary edge weights were chosen, and the neighborhood size k was set to be 9 for all the all datasets. We used the Sigmoid function 1/(1 + exp(−(w⋅x + b))) as the activation function for SS-ELM, MR-ELM, LELM and Lap-LELM. The RBF kernel \(K(x_{i},x_{j})=e^{-\sigma ||x_{i}-x_{j}||_{2}^{2}}\) is considered in LapSVM. We carry out grid search and ten-fold cross-validation on the training sets to get the optimal (C, γA, γI, σ, λ) with highest accuracy. γA and γI are selected from {100, ⋯ , 103}; C and λ are selected from {10− 5, ⋯ , 105}; RBF kernel parameter σ is selected from {2− 5, ⋯ , 25}. The Hidden nodes L is selected from {200, 300, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000}. All the following experimental results are performed in the optimal parameters.
5.3.1 Experiments on artificial datasets
In order to further verify the generalization performance of proposed Lap-LELM, we performed experiments on artificial data. The experimental results are shown in Table 6. We find that Lap-LELM are better than LapSVM, LELM, SS-ELM and MR-ELM in classification accuracy.
5.3.2 Experiments on UCI datasets
In this subsection, we perform LELM, LapSVM, SS-ELM, MR-ELM and Lap-LELM on eleven UCI datasets. Most of these datasets have appeared in previous experiments. We conduct the experiments with different proportions of labeled samples, i.e., 10% and 20%. The testing accuracy are computed using standard 10-fold cross validation. For supervised algorithm (LELM), training samples were divided into 10 parts, one (when 20% samples were labeled, this should be two) part for training and the rest for testing, and for semi-supervised algorithm (LapSVM, SS-ELM, MR-ELM Lap-LELM), all samples are involved in training.
We compare the proposed classification accuracy of Lap-LELM with LELM, LapSVM, SS-ELM and MR-ELM. All experimental results are based on the optimal parameters. The classification accuracy was the average of five-time experiments. Results are listed in Table 7. Let’s summarize the results in a simple language. We find from Table 7 that the performance of Lap-LELM is better than that of LELM on all datasets. This shows that in the semi-supervised learning problem the use of the graph Laplacian operator can improve the classification accuracy of the model. It is further found that the propose Lap-LELM is superior to the traditional semi-supervised learning algorithm Lap-SVM on the vast majority datasets. We can also find that our algorithm Lap-LELM outperforms ELM-based semi-supervised algorithms SS-ELM and MR-ELM on most datasets. The analysis of experimental results further validates that the introduction of manifold regularization in the LELM framework can effectively improve the performance of LELM when the sample information is insufficient.
To further verify the performance of our proposed Lap-LELM. We compared the proposed Lap-LELM with OPT-ELM and Lap-SVM on six UCI datasets of different proportioned label samples (10%, 30%, 50%, 70%, 90%). The experimental results are presented in Fig. 5.
Figure 5 shows the performance of LapSVM, LELM and Lap-LELM on Balance, Breast Cancer, Australian, WDBC, Vote and Ionosphere with different number of labeled samples. In this experiment, all the settings are similar to above, except that we varied the proportion of labeled data in the training set. We can observe that LapLSVM and Lap-LELM outperformed LELM significantly when there is a small amount of labeled data. Further, we can find that the classification accuracy of LELM will grow with the gradually increasing of the labeled samples number.
5.3.3 Experimental results on COIL20(B) and USPST(B) datasets
As we all know, COIL20(B) and USPST(B) are widely used in semi-supervised learning algorithm evaluation datasets. We compared the proposed Lap-LELM with Lap-SVM, SS-ELM and MR-ELM on COIL20(B) and USPST(B) datasets
The COIL20(B) and USPST(B) datasets are described as follows:
The Columbia object image library (COIL20) is a set of 1440 gray-scale images of 20 different objects. Each sample represents a 32 × 32 gray scale image of an object acquired from a specific view. The COIL20(B) is a binary data set generated by grouping the first 10 objects in COIL20 to class 1 and the remaining objects to class 2.
The USPST data set is a collection of hand-written digits from the USPS postal system. Each digit image is represented by a resolution of 16 × 16 pixels. The USPST(B) is a binary data set which was built by grouping the first 5 digits to class 1 and the remaining digits to class 2.
The proposed Lap-LELM is compared with Lap-SVM, MR-ELM and SS-ELM. We use the classification accuracy on USPST(B)and COIL20(B) datasets to evaluate the performance of these algorithms. The experimental results are shown in Table 8. All experimental results are performed under optimal parameters. From Table 8, we can see that our method is better than Lap-SVM semi-supervised learning algorithms on on both USPST(B)and COIL20(B). It can be further found that the proposed method has good performance compared to other ELM-based semi-supervised algorithms. The above experimental analysis further validates that our proposed Lap-LELM is effective and reliable.
6 Conclusion
In this paper, we have first proposed a new type of lagrange extreme learning machine (LELM) based on the optimization theory. Then, a semi-supervised lagrangian extreme learning machine (Lap-LELM) is proposed via extending LELM to a semi-supervised learning framework, which incorporates the manifold regularization into LELM to improve performance when insufficient labeled samples are available. Compared to existing supervised and semi-supervised ELM algorithms, the proposed LELM and Lap-LELM maintain almost all the advantages of ELMs, such as the remarkable training efficiency for binary classification problems. In addition, through the SMW identities, LELM and Lap-LELM are transformed into two smaller unconstrained optimizations. At the same time, two very simple iterative algorithms are constructed to solve the two unconstrained optimization problems. Theoretical analysis and numerical experiments show that our iterative algorithms are globally converged, have a low computational burden and a certain degree of generalization performance compared with traditional learning algorithms.
In the near future, we will further optimize our proposed framework and study the sparse regularization problem for our framework. In addition, we will extend our method to multi-class classification and some practical applications.
References
Huang G, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501
Huang G, Ding XJ, Zhou HM (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74:155–163
Huang G, Huang G, Song S, You KY (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Yang L, Zhang S (2017) A smooth extreme learning machine framework. J Intell Fuzzy Syst 33(6):3373–3381
Yang L, Zhang S (2016) A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition. Eng Appl Artif Intel 53(C):176– 189
Wang Y, Cao F, Yuan Y (2011) A study on effectiveness of extreme learning machine. Neurocomputing 74(16):2483–2490
Wang G, Lu M, Dong YQ, Zhao XJ (2016) Self-adaptive extreme learning machine. Neural Comput Appl 27(2):291–303
Zhang W, Ji H, Liao G, Zhang Y (2015) A novel extreme learning machine using privileged information. Neurocomputing 168(C):823–828
Zhang Y, Wu J, Cai Z, Zhang P, Chen L (2016) Memetic extreme learning machine. Pattern Recogn 58(C):135–148
Ding XJ, Lan Y, Zhang ZF, Xu X (2017) Optimization extreme learning machine with ν regularization. Neurocomputing
Vapnik, Vladimir N (2002) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6):1564–1564
Belkin M, Niyogi P (2004) Semi-supervised learning on riemannian manifolds. Mach Learn 56(1-3):209–239
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. JMLR.org
Xiaojin Z (2006) Semi-supervised learning literature sur-vey. Semi-Supervised Learning Literature Sur-vey, Technical report, Computer Sciences. University of Wisconsin-Madisoa 37(1):63–77
Chapelle O, Sindhwani V, Keerthi SS (2008) Optimization techniques for semi-supervised support vector machines. J Mach Learn Res 9(1):203–233
Wang G, Wang F, Chen T, Yeung DY, Lochovsky FH (2012) Solution path for manifold regularized semisupervised classification. IEEE Trans Syst Man Cybern Part B Cybern A Publ IEEE Syst Man Cybern Soc 42 (2):308
Melacci S, Belkin M (2009) Laplacian support vector machines trained in the primal. J Mach Learn Res 12(5):1149–1184
Chen WJ, Shao YH, Xu DK, Fu YF (2014) Manifold proximal support vector machine for semi-supervised classification. Appl Intell 40(4):623–638
Deng W, Zheng Q, Chen L (2009) Regularized extreme learning machine. In: IEEE Symposium on computational intelligence and data mining, 2009. CIDM ’09. IEEE, pp 389–395
Iosifidis A, Tefas A, Pitas I (2014) Semi-supervised classification of human actions based on neural networks. In: International conference on pattern recognition, vol 15. IEEE, pp 1336– 1341
Huang G, Song S, Gupta JND, Wu C (2014) Semi-supervised and unsupervised extreme learning machines. IEEE Trans Cybern 44(12):2405
Zhou Y, Liu B, Xia S, Liu B (2015) Semi-supervised extreme learning machine with manifold and pairwise constraints regularization. Neurocomputing 149(PA):180–186
Liu B, Xia SX, Meng FR, Zhou Y (2016) Manifold regularized extreme learning machine. Neural Comput Applic 27(2):255–269
Mangasarian O, Musicant L, David R (2001) Lagrangian support vector machines. J Mach Learn Res 1(3):161–177
Balasundaram S, Tanveer M (2013) On lagrangian twin support vector regression. Neural Comput Appl 22(1):257–267
Tanveer M, Shubham K, Aldhaifallah M, Nisar KS (2016) An efficient implicit regularized lagrangian twin support vector regression. Appl Intell 44(4):1–18
Shao YH, Chen WJ, Zhang JJ, Wang Z, Deng NY (2014) An efficient weighted lagrangian twin support vector machine for imbalanced data classification. Pattern Recogn 47(9):3158– 3167
Balasundaram S, Gupta D, Prasad SC (2016) A new approach for training lagrangian twin support vector machine via unconstrained convex minimization. Appl Intell 46(1):1–11
Balasundaram S, Gupta D (2014) On implicit lagrangian twin support vector regression by newton method International. J Comput Intell Syst 7(1):50–64
Tanveer M, Shubham K (2017) A regularization on lagrangian twin support vector regression. Int J Mach Learn Cybern 8(3):807–821
Balasundaram S, Gupta D (2014) Training lagrangian twin support vector regression via unconstrained convex minimization. Knowl-Based Syst 59(59):85–96
Tanveer M (2015) Newton method for implicit lagrangian twin support vector machines. Int J Mach Learn Cybern 6(6):1029–1040
Shao YH, Hua XY, Liu LM, Yang ZM, Deng NY (2015) Combined outputs framework for twin support vector machines. Appl Intell 43(2):424–438
Bertsekas DP (1997) Nonlinear programming. J Oper Res Soc 48(3):334–334
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (No11471010) and Chinese Universities Scientific Fund.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ma, J., Wen, Y. & Yang, L. Lagrangian supervised and semi-supervised extreme learning machine. Appl Intell 49, 303–318 (2019). https://doi.org/10.1007/s10489-018-1273-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1273-4