Abstract
Extreme learning machine (ELM), as a newly developed learning paradigm for the generalized single hidden layer feedforward neural networks, has been widely studied due to its unique characteristics, i.e., fast training, good generalization, and universal approximation/classification ability. A novel framework of discriminative extreme learning machine (DELM) is developed for pattern classification. In DELM, the margins between different classes are enlarged as much as possible through a technique called ε-dragging. DELM is further extended to pruning DELM (P-DELM) using L2,1-norm regularization. The performance of DELM is compared with several state-of-the-art methods on public face databases. The simulation results show the effectiveness of DELM for face recognition when there are posture, facial expression, and illumination variations. P-DELM can distinguish the importance of different hidden neurons and remove the worthless ones. The model can achieve promising performance with fewer hidden neurons and less prediction time on several benchmark datasets. In DELM model, the margins between different classes are enlarged by learning a nonnegative label relaxation matrix. The experiments validate the effectiveness of DELM. Furthermore, DELM is extended to P-DELM based on L2,1-norm regularization. The developed P-DELM can naturally distinguish the importance of different hidden neurons, which will lead to a more compact network by neuron pruning. Experimental validations on some benchmark datasets show the advantages of the proposed P-DELM method.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Cognitive computation has been emerging as a discipline involving neurobiology, cognitive psychology, and artificial intelligence [1, 2]. A cognitive system is, broadly speaking, something that seeks to mimic or better understand the way that humans process complex situations. Extensive efforts have been made for the study of cognitive-inspired techniques/systems in the past few decades [3,4,5]. As a type of cognitive-inspired computation technique, feedforward neural networks (FNNs) have been widely investigated and applied since the introduction of the well-known back-propagation (BP) algorithm [6]. However, these gradient descent-based methods may face with the problems of local minima, learning rate, stopping criteria, and learning epochs [7].
Recently, extreme learning machine (ELM) has been proposed for training single hidden layer feedforward neural networks (SLFNs). Unlike the other traditional learning algorithms, e.g., BP-based neural networks (NNs) or support vector machine (SVM), the parameters of hidden layers in ELM are randomly generated without tuning. The hidden nodes in ELM can be established independent of the training data [8, 9]. Huang et al. [10, 11] have theoretically proved that the SLFNs with randomly generated hidden neurons and the output weights calculated by regularized least square maintain its universal approximation capability. The concrete biological evidences for ELM have also been reported [12, 13]. With the learning theory, ELM tends to achieve faster and better generalization performance than those of NNs and SVM. ELMs have been extensively studied and demonstrated to have excellent learning accuracy and speed in a variety of applications, such as semisupervised and unsupervised learning [14], multilayer perceptron [15], dimensionality reduction [16], visual tracking [17], tactile object recognition [18], and transfer learning [19, 20].
In the architectural design of ELM network, a key problem is to determine the suitable number of hidden neurons. Too few or too many hidden neurons employed in an ELM network would lead to underfitting or overfitting [21]. The suitable number of hidden neurons is usually determined with human intervention in a trial-and-error way. There are mainly two heuristic techniques for the problem, i.e., constructive methods (or growing methods) and destructive methods (or pruning methods). Huang et al. [10] presented an incremental ELM (I-ELM), where the hidden nodes are added incrementally and the output weights are determined analytically. Lan et al. [22] proposed a constructive hidden node selection method for ELM (CS-ELM) by selecting the optimal number of hidden nodes when the unbiased risk estimation-based criterion C p reaches the minimum value. Obviously, both I-ELM and CS-ELM are constructive methods. There are also some pruning methods. Miche et al. [23] proposed an optimally pruned ELM (OP-ELM), which first ranks the hidden neurons using the multi-response sparse regression algorithm (MRSR). OP-ELM then selects the hidden neurons through leave-one-out (LOO) validation. Recently, a pruning ensemble model of ELM with L1/2 regularizer (PE-ELMR) has been proposed in [24]. PE-ELMR incorporates L1/2 regularizer into the preliminary ELM, and the neurons in hidden layer are pruned with the ensemble model.
From the viewpoint of geometry, it is expected that the distances between data points in different classes are as large as possible after they are transformed [25]. However, the traditional ELM assumes that the hidden layer output can be exactly transformed into strict label matrix and does not consider such a geometrical criterion. These observations motivate us to introduce the geometrical criterion into classical ELM to fully exploit the discriminant information in data.
To this end, one feasible way is to enlarge the distances between regression labels of different classes. Figure 1 shows the idea of our method. For a two-class classification problem, the regression labels in original ELM are coded as [+1, −1] and [−1, +1]. They are denoted as red points in the figure. The maximal distance between them is fixed with little freedom. After dragged to the blue ones with proper dragging direction and value, the distance between them can be enlarged (d2 > d1). With this strategy, ELM is expected to have better generalization ability by fully exploiting the discriminative information in data. In detail, a technique called ε-dragging is employed to learn a nonnegative label relaxation matrix, which promotes the regression labels of different classes moving along with opposite directions. A slack label matrix is embedded into the ELM framework so that the distances between different classes can be enlarged. Besides, the proposed discriminative extreme learning machine (DELM) method is further extended for the architectural design of ELM network in a destructive manner. We introduce a structured norm regularization, namely L2,1-norm, into DELM model to learn a row-sparse output weight matrix. The method, termed as pruning DELM (P-DELM), can distinguish the importance of different hidden neurons in information transmission. As a result, the worthless neurons can be adaptively removed for a more compact network. It is worth noting that there are some major differences between our work and the work in [17], though both of them adopt L2,1-norm regularization for the purpose of neuron pruning. Firstly, we focus on pattern classification with single-task ELM in a supervised way. However, the model in [17] targets at visual tracking with multitask ELM in a semisupervised manner. Secondly, we first design a novel DELM model with label relaxation, and then L2,1-norm regularization is introduced into the developed DELM. The obtained P-DELM network tends to be more compact and has better generalization ability. Thirdly, the model in [17] first ranks the neurons in hidden layer and then selects a fixed number of neurons. Differently, we develop an adaptive neuron selection method by the obtained row-sparse output weight matrix. Moreover, the working principle of L2,1-norm regularization is analyzed in detail in this paper. The proposed P-DELM might be introduced into the model in [17] for visual tracking.
Several characteristics of the proposed DELM and P-DELM are as follows:
-
1.
DELM inherits the merits of ELM, including the feature mapping with randomly generated input weights and bias, and good generalization.
-
2.
Hadamard product of matrices is introduced into DELM to perform ε-dragging. A slack variable matrix is constructed, and thus, the margins between different classes can be enlarged. The resultant optimization problem can be solved iteratively by performing variable decoupling. Both theoretical analysis and experimental evaluations show the effectiveness of DELM.
-
3.
The DELM is extended to P-DELM based on L2,1-norm regularization. Worthless hidden neurons can be removed with the obtained row-sparse output weight matrix for a more compact network.
The remainder of this paper is outlined as follows. “Extreme Learning Machine and Discriminative Extreme Learning Machine” section reviews related works on ELM and presents our discriminative ELM (D-ELM) model. Our pruning DELM (P-DELM) model is introduced in “Neuron Pruning-Inspired Discriminative Extreme Learning Machine” section. “Experiments” section reports the experimental results. Conclusions are drawn in “Conclusions” section.
Extreme Learning Machine and Discriminative Extreme Learning Machine
Extreme Learning Machine
ELMs are a type of FNNs characterized by a random initialization of their hidden layer weights and a fast training algorithm for the output weights. The optimization function of ELM is
where β ∈ ℜ L × c denotes the output weights between hidden layer and output layer and ξ = [ξ 1, ξ 2 … ξ N ]T ∈ ℜ N × c are the prediction error matrices with respect to the training data. C is a penalty constant on the training errors, and H ∈ ℜ N × L is the hidden layer output matrix, computed as
where X = [x 1, x 2, … , x N ] ∈ ℜ d × N is the training dataset falling into c categories with label matrix T = [t 1, t 2 … t N ]T ∈ ℜ N × c and t i = [−1 … + 1 … − 1] ∈ ℜ c. Note that only the jth entry of t i is +1 which indicates that sample x i comes from the jth class. By substituting the constraints of (1) into its objective function, we get the following equivalent unconstrained optimization problem
The previous problem is widely known as the ridge regression or regularized least square. By setting the gradient of the objective function with respect to β to be 0, we have
The closed-form solution β can be solved under the following two circumstances. If the number of training samples N is larger than L (N > L), the gradient equation is overdetermined, and the closed-form solution can be calculated as
where I L × L denotes the identity matrix with size of L and H † is the Moore-Penrose generalized inverse of H. If the number N of training patterns is smaller than L (L > N), an underdetermined least square problem would be handled. One can restrict β to be a linear combination of the rows in H as β = H T α (α ∈ ℜ N × m). By substituting β = H T α into (3), and multiplying both sides with (HH T)−1 H, we have
Then, we get the solution as
Discriminative Extreme Learning Machine
Conventional ELM assumes that the hidden layer output h(x i )(i = 1, 2 … N) can be exactly transformed into strict label matrix as in (2). However, the previous assumption may be too rigid. We relax the strict label matrix into a slack label matrix by introducing a nonnegative relaxation matrix M, which can not only provide more freedom for β but also enlarge the distances between different classes as much as possible.
In implementation, we push these +1/−1 label outputs far away along two opposite directions. Specifically, with a positive slack variable ε i , we hope the output will become 1 + ε i for the sample grouped into “1” and −1 − ε i for the sample grouped into “−1.” In this way, the distance between data points from different classes will be enlarged. By introducing ε-dragging term into the optimization function, the distances between different classes are expected to be enlarged. The following Table 1 illustrates an example of this case.
Before ε-dragging, the maximal distance between the first and third hidden layer output (randomly projected feature) is \( \sqrt{{\left(1-\left(-1\right)\right)}^2+{\left(-1-1\right)}^2+{\left(-1-\left(-1\right)\right)}^2}=2\sqrt{2} \).While after ε-dragging, the distance becomes
It can be seen that the distance between the first and third randomly projected feature becomes larger after ε-dragging. This shows that the use of the nonnegative label relaxation matrix allows margins between different classes to be enlarged. Concretely, we introduce an auxiliary matrix B that is defined as follows. If T ij = 1, B ij = + 1, and it indicates the positive dragging direction. If T ij = − 1, B ij = − 1, which means the negative dragging direction. We record nonnegative learnable dragging value εs in matrix M ∈ ℜ N × c and get the relaxation label matrix as T ∘ = T + B ⊙ M, where ⊙ is a Hadamard product operator of matrices. By substituting T ∘ into (2), we obtain the following discriminative ELM (DELM) model
Compared with (2), a ε-dragging-related term B ⊙ M is integrated into (7) to enlarge the distances between different classes in label space. When solving (7), we can update each variable by fixing another iteratively. An iterative optimization method to solve problem (7) is presented as follows. Given M, problem (7) becomes
Let Q = T + B ⊙ M, and denote the objective function as \( \ell \left(\boldsymbol{\beta} \right)=\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{Q}\right\Vert}^2 \). The solution can be achieved by setting\( \frac{\partial \ell \left(\boldsymbol{\beta} \right)}{\partial \boldsymbol{\beta}}=\mathbf{0} \), then
Given β, problem (7) becomes
Let R = H β − T, we have
Due to the fact that the squared Frobenius norm of matrix can be decoupled element by element, (11) can be decoupled equivalently into N × c subproblems. For the ith row and jth column element of M, we have
where R ij and B ij are the ith row and jth elements of R and B, respectively. Note that \( {\mathbf{B}}_{ij}^2=1 \), we obtain (R ij − B ij M ij )2 = (B ij R ij − M ij )2; thus, we can get
Based on the nonnegative constraint about M ij , M can be finally got as
The complete algorithm for solving the optimization problem (7) is described in Algorithm 1.
Once the optimal β and M obtained, we have Q = T + B ⊙ M, and the predicted output of a new test sample z can be computed as
Neuron Pruning-Inspired Discriminative Extreme Learning Machine
In classic ELM, the number of hidden neurons is always determined with human intervention in a trial-and-error way. It is tedious to select the suitable number of hidden neurons manually. Besides, an overlarge network also brings about longer prediction responses and unnecessary requirement for large memory as well as high cost in hardware resource. An alternative way is to train a network larger than necessary and then prune the unnecessary neurons. In this section, we will devise and present a novel neuron pruning-inspired discriminative ELM based on structured sparse model, which has been widely studied in pattern recognition and machine learning [26, 27, 42].
Model Formulation
Suppose that we are given N training samples {(x i , t i )} , i = 1 , 2 , … , N, which belong to c (≥2) classes. Here, x i ∈ ℜ m is a data point and t i ∈ ℜ c is its label vector. h(x i ) ∈ ℜ L, as the ELM feature mapping, maps the sample x i from m-dimensional input space to the L-dimensional hidden-layer feature space, which is called ELM feature space. Suppose that ELM can approximate the data label. The relation between the estimated outputs and the actual outputs is
where β j is the jth row of output weight matrix β. The previous N equations can be compactly written as
where
As shown in (16), each neuron will has its own response for the sample. The responses of all the neurons will result in the final label prediction t i depending on the output weight matrix β. If some rows of β, i.e., β j (j = 1, 2 … L) are equal to zero, the corresponding neuron response will have no contribution on the estimated output. As a result, these irrelevant hidden neurons can be removed; i.e., a pruning of neurons can be conducted to get rid of useless neurons. In this way, we endow the output weights β the function of neuron selection by considering the relevance of hidden neurons with the class labels. Specifically, a new vector \( \overline{\boldsymbol{\beta}} \), which collects the L2 norms of the row vectors of β, can be constructed as
where \( {\left\Vert {\boldsymbol{\beta}}_j\right\Vert}_2=\sqrt{\sum_{k=1}^c{\beta}_{j k}^2} \), j = 1 , 2 … L. Constructing d nonzero rows in β is just equivalent to pushing the number of nonzero entities in \( \overline{\boldsymbol{\beta}} \) equal to be d
Nevertheless, solving the problem with L0 norm constraint is a NP-hard problem. Alternatively, we approximate L0 norm with L1 norm and adopt the following L2,1 norm of matrix β
Therefore, our pruning discriminative ELM (P-DELM) model is formulated as follows:
The errors between target output and actual output are taken into consideration, which provides more freedom for β.
Optimization for Pruning Discriminative Extreme Learning Machine
Our P-DELM model could be optimized in a similar way as solving DELM. Given M, let Q = T + B ⊙ M.The optimization problem becomes
Obviously, the objection function is differentiable to β [28]. First, we consider the derivative of the term ‖β‖2 , 1 w.r.t β. According to the definition of L2,1-norm in (20), the derivative of ‖β‖2 , 1 about the entity β jk can be calculated as
Then, we get the derivative of ‖β‖2 , 1 w.r.t βas
where ∑ is a diagonal matrix in ℜ L × L with the jth diagonal component computed as
This shows that ‖β‖2 , 1 can be written as \( {\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}=\frac{1}{2}\mathrm{tr}\left({\boldsymbol{\beta}}^{\mathrm{T}}\mathbf{\sum}\boldsymbol{\beta } \right) \), where ∑ is defined in (25). By setting the derivation of the objection function (6) w.r.t β to 0, we have
We further obtain the expression of β as follows:
Note that ∑ depends on β, which can be iteratively determined using β from the previous optimization step. Then, we fix β and solve the following problem to updateM.
Let R = H β − T, we have
Similarly, M can be got as
Figure 2 shows the difference between the output weight matrix obtained by the original ELM and our method. Figure 2a shows the normalized L2-norm of rows of β, i.e., \( \overline{\boldsymbol{\beta}} \) defined in (18), obtained from the original ELM. Most of its entities are nonzero with the Frobenius norm constraint, which could only enforce β to be small. Contrastively, the L2,1-norm in our model could get a row-sparse β as illustrated in Fig. 2b and distinguish the importance of different hidden neurons. In our P-DELM method, a few hidden neurons can undertake the task of information transmission from input space to the ELM feature space.
After the optimal β is obtained, d neurons can be selected from the L original neurons. We next present a pruning method to get suitable number of valuable hidden neurons. We first normalize \( \overline{\boldsymbol{\beta}} \) by dividing it by the sum of all the entities in \( \overline{\boldsymbol{\beta}} \) and then sort the entities in \( \overline{\boldsymbol{\beta}} \) from large to small. The descending sorted \( \overline{\boldsymbol{\beta}} \) is denoted as \( \overline{\overline{\boldsymbol{\beta}}}=\left[{\overline{\overline{\beta}}}_1,{\overline{\overline{\beta}}}_2,\dots, {\overline{\overline{\beta}}}_L\right] \). Then, we select the hidden neurons by a threshold η. The ratio of the sum of the first d entities to the sum of all entities in \( \overline{\overline{\boldsymbol{\beta}}} \) is formulated as
Obviously, the denominator of (31) is 1, and \( {\eta}_d={\sum}_{q=1}^d{\overline{\overline{\beta}}}_q \). Given a threshold η d in (0, 1), the number of valuable hidden neurons can be got as
Once d selected hidden neurons determined, the corresponding hidden layer output matrix \( \tilde{\mathbf{H}} \) is utilized to update the output weight matrix as
The structure of the proposed P-DELM is shown in Fig. 3. We first adopt the P-DELM to get the valuable hidden neurons and then update the output weight matrix exploiting remaining hidden layer output matrix as in (33). The complete optimization algorithm for P-DELM is described in Algorithm 2.
Experiments
Experimental Results for Discriminative Extreme Learning Machine
Face recognition (FR) is one of the classical problems in computer vision [29]. Facial images have big within-class scatter and small between-class scatter, which poses great difficulties on FR. In this section, four popular face databases, i.e., ORL [30], Extended Yale B [31], CMU PIE [32], and AR [33] databases, are employed to evaluate the performance of different methods. The ORL face database contains 400 images from 40 subjects. Each subject has ten images acquired at different times. The size of face image on ORL database is 32 × 32 pixels. The Extended Yale B database consists of 2414 frontal facial images of 38 individuals. Each individual contains about 64 images, taken under various laboratory-controlled lighting conditions. In our experiments, each image is manually cropped and resized to 32 × 32 pixels. The AR face database consists of more than 4000 color images of 126 subjects. The CMU PIE database contains over 40,000 facial images of 68 individuals. Images of each individual were acquired across 13 different poses, under 43 different illumination conditions, and with 4 different expressions. Figure 4a–d shows several example images of one subject in ORL database, Extended Yale B database, CMU PIE database, and AR database, respectively.
We compare our algorithm with the least squares regression (LSR) [25], discriminative least squares regression (DLSR) [25], SVM [34], LSSVM [35], and the classic ELM on the eigenface feature [36]. For SVM and LSSVM, the Libsvm-3.12 and LSSVM-1.7 toolbox were used, respectively. For LSR and DLSR, the Matlab codes are provided by the authors [25]. The optimum linear transformation is first learned and the facial images under this transformation are employed for classification by a 1-NN classifier.
For the ORL face database, we randomly select l (= 4, 5, 6) images per subject for training and the reminder for testing. For each given l, we independently perform all the methods 10 times and report the average results. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 2 lists the recognition results of different approaches.
For the Extended Yale B database, we randomly select l (=10, 20, 30) images per subject for training and the reminder for testing. For each given l, we independently perform all the methods 10 times and report the average recognition results. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 3 shows the recognition results of different methods.
For the CMU PIE database, we use a near frontal pose subject, namely C07, for experiments, which contains 1629 images of 68 individuals. Each individual has about 24 images. A random subset with l (= 8, 10, 12) images for each individual is selected for training and the rest for testing. For each given l, we independently perform all the methods 10 times and report the average recognition rates. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 4 lists the recognition accuracy together with the standard deviation obtained by different methods.
For the AR face database, a subset that contains 50 male subjects and 50 female subjects is chosen in our experiments. For each subject, seven images from session 1 are used for training, with other seven images from session 2 for testing. The size of image is 60 × 43. The recognition results of different methods are given in Table 5.
From Tables 2, 3, 4, and 5, one can conclude that the proposed DELM method can achieve promising performance. Moreover, with the increase of training samples per class and dimension of eigenface feature, all the methods tend to achieve higher recognition accuracy. DELM outperforms all the compared methods on most of the dimensions under different training sets. In comparison with DLSR, the proposed DELM performs better as a whole, which reveals the effect of executing an explicit mapping from the input space to a higher-dimensional ELM feature space. The proposed DELM also outperforms classical ELM. The gain mainly benefits from the enlarged margin between different classes by introducing a nonnegative label relaxation matrix.
Parameter Analysis for Discriminative Extreme Learning Machine
Similar to ELM, the proposed DELM algorithm has two key parameters, namely the number of hidden neurons L and the penalty constant C in (7). We further conduct experiments to investigate the effect of these parameters on the final recognition accuracy. Eleven different values of C (0.001, 0.01, 1, 100, 200, 500, 1000, 2000, 3000, 4000, and 5000) and seven different values of L (100, 500, 1000, 2000, 3000, 4000, and 5000) have been tried, resulting in 77 different pairs in total. The experiments on the previously mentioned databases for parameter analysis are performed, respectively.
Figure 5 shows the relationship between the recognition rate and the parameter pair (L, C). From Fig. 2, one can see that the recognition accuracy tends to increase with the increase of L and C. DELM is not especially sensitive to the change of (L, C) in a large range, and it performs stable when L and C are assigned relatively large values.
Experimental Results for Pruning Discriminative Extreme Learning Machine
The Selection of Parameter η
In this section, we will study the characteristic of the key parameter η in P-DELM, and the experiments are carried out using the Diabetes dataset from the University of California at Irvine (UCI) Machine Learning Repository [37]. With the initialized 50 hidden neurons, we record the number of selected hidden neurons and corresponding testing accuracy of P-DELM. The results are acquired from 100 repeated experiments and are shown in Fig. 6. From the results, we observe that the number of selected hidden neurons and testing accuracy raise with the increase of threshold η. When η approaches 1, the number of selected hidden neurons and the testing accuracy tend to reach a plateau. Noteworthy, when η = 0.9999, only about (14.56/50) × 100 % = 29.12% of the original hidden neurons are selected. These observations indicate that the introduction of L2,1-norm regularization could result in a quite sparse \( \overline{\boldsymbol{\beta}} \), which could distinguish the importance of different hidden neurons in information transmission. As a result, we empirically set η = 0.9999 in the method, which can guarantee a good performance as the following experimental results demonstrate.
The Performance of Pruning Discriminative Extreme Learning Machine for Pattern Classification
In this section, the performance of P-DELM is evaluated on public benchmark datasets for classification problem comparing with the original ELM, DELM, and OP-ELM on several datasets from the UCI Machine Learning Repository [37]. The information and characteristic of the datasets are summarized in Table 6. L2,1-DELM denotes an L2,1-norm regularized DELM model without a neuron pruning process.
The results are averaged on 100 repeated experiments. We report the mean number of hidden neurons used, testing accuracy/STD, and the CPU time for training and testing in Table 7. From the results, one can conclude that our DELM, L2,1-DELM, and P-DELM could always achieve a higher testing accuracy than ELM does. Meanwhile, P-DELM is faster in testing with a comparative or even better performance with fewer hidden neurons. In general, OP-ELM does not perform well in comparison with P-DELM with lower testing accuracy and more hidden neurons.
The Performance of Pruning Discriminative Extreme Learning Machine for Image Classification
We further conduct experiments to validate the performance of DELM, L2,1-ELM, and P-DELM on image datasets, including COIL20 and Caltech256. The COIL20 database has 20 objects, and each object has 72 images which are obtained by the rotation of the object through 360° in 51 steps (1440 images in total) [38]. The size of each image is 32 × 32 pixels on COIL20. A subset of Caltech256 database [39, 40], which has 20 classes with 100 samples per category, is used in our experiment. In the experiments, we directly use grayscale image as the feature on COIL20 database, while 2048-dimensional PiCoDes [41] is adopted to represent the images in Caltech256 databases. We randomly select half of the images per class for training and the rest for testing. The experimental results are reported in Table 8. The experiments on these image datasets show promising results. We also note that our methods consume more time for training. A future work should reduce the computational complexity.
Conclusions
In this paper, we have presented a framework of DELM for pattern classification. DELM aims to enlarge the distances between different classes as much as possible by learning a nonnegative label relaxation matrix. The performance of DELM is compared with several state-of-the-art methods on public face databases under different experimental settings. The results demonstrate the effectiveness of DELM for FR when there are posture, facial expression, and illumination variations. In addition, we develop a novel method for the problem of architectural design of ELM network by introducing L2,1-norm regularization into the DELM model. The obtained P-DELM model can distinguish the importance of different hidden neurons. Worthless neurons are then pruned for a more compact network. Experimental results show that P-DELM can achieve promising performance for pattern classification with fewer hidden neurons and less prediction time.
References
Taylor JG. Cognitive computation. Cogn Comput. 2009;1(1):4–16.
Clark A. Mindware: an introduction to the philosophy of cognitive science. New York: Oxford University Press; 2001.
Luo B, Hussain A, Mahmud M, et al. Advances in brain-inspired cognitive systems. Cogn Comput. 2016;8(5):795–6.
Zhang HY, Ji P, Wang JQ, et al. A neutrosophic normal cloud and its application in decision-making. Cogn Comput. 2016;8(4):1–21.
Gepperth A, Karaoguz C. A bio-inspired incremental learning architecture for applied perceptual problems. Cogn Comput. 2016;8(5):924–34.
Rumelhart DE, Hinton GE, Williams RJ. Learning representation by backpropagating errors. Nature. 1986;323(6088):533–6.
Zhang L, Zhang D, Tian F. SVM and ELM: who wins? Object recognition with deep convolutional features from ImageNet. Comput Sci. 2015.
Huang G-B, Zhou H, Ding X, Zhang R. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man CybernB Cybern. 2012;42(2):513–29.
Huang GB. What are extreme learning machines? Filling the gap between Frank Rosenblatt’s dream and John von Neumann’s puzzle. Cogn Comput. 2015;7(3):263–78.
Huang GB, Chen L, Siew CK. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw. 2006;17(4):879–92.
Huang GB, Li MB, Chen L, Siew CK. Incremental extreme learning machine with fully complex hidden nodes. Neurocomputing. 2008;71(4–6):576–83.
Huang GB. An insight into extreme learning machines: random neurons, random features and kernels. Cogn Comput. 2014;6(3):376–90.
Fusi S, Miller EK, Rigotti M. Why neurons mix: high dimensionality for higher cognition. Curr Opin Neurobiol. 2016;37:66.
Huang G, Song S, Gupta JND, et al. Semi-supervised and unsupervised extreme learning machines. IEEE Trans Cybern. 2014;44(12):2405–17.
Tang J, Deng C, Huang GB. Extreme learning machine for multilayer perceptron. IEEE Trans Neural Netw Learn Syst. 2016;27(4):809–21.
Lekamalage CL, Yang Y, Huang GB, et al. Dimension reduction with extreme learning machine. IEEE Trans Image Process. 2016;25(8):3906–18.
Liu H, Sun F, Yu Y. Multitask extreme learning machine for visual tracking. Cogn Comput. 2014;6(3):391–404.
Liu H, Qin J, Sun F, et al. Extreme kernel sparse learning for tactile object recognition. IEEE Trans Cybern, 2016.
Zhang L, Zhang D. Domain adaptation extreme learning machines for drift compensation in E-nose systems. IEEE Trans Instrum Meas. 2015;64:1790–801.
Zhang L, Zhang D. Robust visual knowledge transfer via extreme learning machine based domain adaptation. IEEE Trans Image Process. 2016;25(10):4959–73.
Rong HJ, Ong YS, Tan AH, et al. A fast pruned-extreme learning machine for classification problem. Neurocomputing. 2008;72(1–3):359–66.
Lan Y, Soh YC, Huang GB. Constructive hidden nodes selection of extreme learning machine for regression. Neurocomputing. 2010;73(16–18):3191–9.
Miche Y, Sorjamaa A, Bas P, et al. OP-ELM: optimally pruned extreme learning machine. IEEE Trans Neural Netw. 2010;21(1):158–62.
He B, Sun T, Yan T, et al. A pruning ensemble model of extreme learning machine with L1/2 regularizer. Multidimension Syst Signal Process. 2016:1–19.
Xiang S, Nie F, Meng G, et al. Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst. 2012;23(11):1738–54.
Liu H. Robust exemplar extraction using structured sparse coding. IEEE Trans Neural Netw Learn Syst. 2015;26(8):1816–21.
Liu H, Guo D, Sun F. Object recognition using tactile measurements: kernel sparse coding methods. IEEE Trans Instrum Meas. 2016;65(3):656–65.
Nie F, Huang H, Cai X, et al. Efficient and robust feature selection via joint ℓ2, 1-norms minimization. Adv Neural Info Process Syst. 2010:1813–1821.
Zhao W, Chellappa R, Phillips J, Rosenfeld A. Face recognition: a literature survey. ACM Comput Surv. 2003;35(4):399–458.
Samaria F S, Harter A C. Harter, A.: Parameterisation of a stochastic model for human face identification[C]// Applications of computer vision, 1994. Proceedings of the Second IEEE Workshop on. 1995:138–142.
Rizon M, Hashim MF, Saad P, et al. Face recognition using eigenfaces and neural networks. Am J Appl Sci. 2006;3(6):586–91.
Georghiades AS, Belhumeur PN, Kriegman DJ. From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell. 2001;23(6):643–60.
Martinez, A. M. The AR face database. Cvc Technical Report, 2010, 24.
Vapnik V. Statistical learning theory. New York: Wiley; 1998.
Suykens JAK, Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett. 1999;9(3):293–300.
M. Turk, A. Pentland, Face recognition using eigenfaces. In: Proceeding of the CVPR, 1991.
C. Blake, C. Merz, UCI repository of machine learning databases [online], in (http://www.ics.uci.edu/mlearn/MLRepository.html), Department of Information and Computer Sciences, University of California, Irvine, USA, 1998.
S. A. Nene, S. K. Nayar, H. Murase, et al., Columbia object image library (COIL-20), Technical report, technical report CUCS-005-96, 1996.
Yu J, Tao D, Wang M, Yu J, Tao D, Wang M. Adaptive hypergraph learning and its application in image classification. IEEE Trans Image Process. 2012;21(7):3262–72.
G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset, Technical Report, California Institute of Technology, 2007.
Bergamo A, Torresani L, Fitzgibbon A. PICODES: learning a compact code for novel-category recognition. Adv Neural Info Process Syst. 2011:2088–2096.
Liu H, Yu Y, Sun F, et al. Visual-tactile fusion for object recognition. IEEE Trans Automation Sci Eng. 2016:1–13.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 61571069, 61401048), Chongqing University Postgraduates’ Innovation Project (No. CYB15030), and in part by the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding
This study was funded by the National Natural Science Foundation of China (grant number 61571069 and 61401048), Chongqing University Postgraduates’ Innovation Project (grant number CYB15030), and in part by the Fundamental Research Funds for the Central Universities.
Conflict of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with animals performed by any of the authors.
Informed Consent
Informed consent was obtained from all individual participants included in the study.
Rights and permissions
About this article
Cite this article
Guo, T., Zhang, L. & Tan, X. Neuron Pruning-Based Discriminative Extreme Learning Machine for Pattern Classification. Cogn Comput 9, 581–595 (2017). https://doi.org/10.1007/s12559-017-9474-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-017-9474-4