Introduction

Cognitive computation has been emerging as a discipline involving neurobiology, cognitive psychology, and artificial intelligence [1, 2]. A cognitive system is, broadly speaking, something that seeks to mimic or better understand the way that humans process complex situations. Extensive efforts have been made for the study of cognitive-inspired techniques/systems in the past few decades [3,4,5]. As a type of cognitive-inspired computation technique, feedforward neural networks (FNNs) have been widely investigated and applied since the introduction of the well-known back-propagation (BP) algorithm [6]. However, these gradient descent-based methods may face with the problems of local minima, learning rate, stopping criteria, and learning epochs [7].

Recently, extreme learning machine (ELM) has been proposed for training single hidden layer feedforward neural networks (SLFNs). Unlike the other traditional learning algorithms, e.g., BP-based neural networks (NNs) or support vector machine (SVM), the parameters of hidden layers in ELM are randomly generated without tuning. The hidden nodes in ELM can be established independent of the training data [8, 9]. Huang et al. [10, 11] have theoretically proved that the SLFNs with randomly generated hidden neurons and the output weights calculated by regularized least square maintain its universal approximation capability. The concrete biological evidences for ELM have also been reported [12, 13]. With the learning theory, ELM tends to achieve faster and better generalization performance than those of NNs and SVM. ELMs have been extensively studied and demonstrated to have excellent learning accuracy and speed in a variety of applications, such as semisupervised and unsupervised learning [14], multilayer perceptron [15], dimensionality reduction [16], visual tracking [17], tactile object recognition [18], and transfer learning [19, 20].

In the architectural design of ELM network, a key problem is to determine the suitable number of hidden neurons. Too few or too many hidden neurons employed in an ELM network would lead to underfitting or overfitting [21]. The suitable number of hidden neurons is usually determined with human intervention in a trial-and-error way. There are mainly two heuristic techniques for the problem, i.e., constructive methods (or growing methods) and destructive methods (or pruning methods). Huang et al. [10] presented an incremental ELM (I-ELM), where the hidden nodes are added incrementally and the output weights are determined analytically. Lan et al. [22] proposed a constructive hidden node selection method for ELM (CS-ELM) by selecting the optimal number of hidden nodes when the unbiased risk estimation-based criterion C p reaches the minimum value. Obviously, both I-ELM and CS-ELM are constructive methods. There are also some pruning methods. Miche et al. [23] proposed an optimally pruned ELM (OP-ELM), which first ranks the hidden neurons using the multi-response sparse regression algorithm (MRSR). OP-ELM then selects the hidden neurons through leave-one-out (LOO) validation. Recently, a pruning ensemble model of ELM with L1/2 regularizer (PE-ELMR) has been proposed in [24]. PE-ELMR incorporates L1/2 regularizer into the preliminary ELM, and the neurons in hidden layer are pruned with the ensemble model.

From the viewpoint of geometry, it is expected that the distances between data points in different classes are as large as possible after they are transformed [25]. However, the traditional ELM assumes that the hidden layer output can be exactly transformed into strict label matrix and does not consider such a geometrical criterion. These observations motivate us to introduce the geometrical criterion into classical ELM to fully exploit the discriminant information in data.

To this end, one feasible way is to enlarge the distances between regression labels of different classes. Figure 1 shows the idea of our method. For a two-class classification problem, the regression labels in original ELM are coded as [+1, −1] and [−1, +1]. They are denoted as red points in the figure. The maximal distance between them is fixed with little freedom. After dragged to the blue ones with proper dragging direction and value, the distance between them can be enlarged (d2 > d1). With this strategy, ELM is expected to have better generalization ability by fully exploiting the discriminative information in data. In detail, a technique called ε-dragging is employed to learn a nonnegative label relaxation matrix, which promotes the regression labels of different classes moving along with opposite directions. A slack label matrix is embedded into the ELM framework so that the distances between different classes can be enlarged. Besides, the proposed discriminative extreme learning machine (DELM) method is further extended for the architectural design of ELM network in a destructive manner. We introduce a structured norm regularization, namely L2,1-norm, into DELM model to learn a row-sparse output weight matrix. The method, termed as pruning DELM (P-DELM), can distinguish the importance of different hidden neurons in information transmission. As a result, the worthless neurons can be adaptively removed for a more compact network. It is worth noting that there are some major differences between our work and the work in [17], though both of them adopt L2,1-norm regularization for the purpose of neuron pruning. Firstly, we focus on pattern classification with single-task ELM in a supervised way. However, the model in [17] targets at visual tracking with multitask ELM in a semisupervised manner. Secondly, we first design a novel DELM model with label relaxation, and then L2,1-norm regularization is introduced into the developed DELM. The obtained P-DELM network tends to be more compact and has better generalization ability. Thirdly, the model in [17] first ranks the neurons in hidden layer and then selects a fixed number of neurons. Differently, we develop an adaptive neuron selection method by the obtained row-sparse output weight matrix. Moreover, the working principle of L2,1-norm regularization is analyzed in detail in this paper. The proposed P-DELM might be introduced into the model in [17] for visual tracking.

Fig. 1
figure 1

A simplified illustration for proposed DELM. The regression labels in classic ELM are denoted as red points. After dragged to the blue ones, the distance between the samples’ regression labels is enlarged. With this strategy, DELM is expected to fully exploit the discriminative information in data

Several characteristics of the proposed DELM and P-DELM are as follows:

  1. 1.

    DELM inherits the merits of ELM, including the feature mapping with randomly generated input weights and bias, and good generalization.

  2. 2.

    Hadamard product of matrices is introduced into DELM to perform ε-dragging. A slack variable matrix is constructed, and thus, the margins between different classes can be enlarged. The resultant optimization problem can be solved iteratively by performing variable decoupling. Both theoretical analysis and experimental evaluations show the effectiveness of DELM.

  3. 3.

    The DELM is extended to P-DELM based on L2,1-norm regularization. Worthless hidden neurons can be removed with the obtained row-sparse output weight matrix for a more compact network.

The remainder of this paper is outlined as follows. “Extreme Learning Machine and Discriminative Extreme Learning Machine” section reviews related works on ELM and presents our discriminative ELM (D-ELM) model. Our pruning DELM (P-DELM) model is introduced in “Neuron Pruning-Inspired Discriminative Extreme Learning Machine” section. “Experiments” section reports the experimental results. Conclusions are drawn in “Conclusions” section.

Extreme Learning Machine and Discriminative Extreme Learning Machine

Extreme Learning Machine

ELMs are a type of FNNs characterized by a random initialization of their hidden layer weights and a fast training algorithm for the output weights. The optimization function of ELM is

$$ \begin{array}{c}\hfill { \min}_{\boldsymbol{\beta} \in {\Re}^{L\times c}}\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\sum}_{i=1}^N{\left\Vert {\boldsymbol{\xi}}_i\right\Vert}^2\hfill \\ {}\hfill\ s. t.\mathbf{h}\left({\boldsymbol{x}}_i\right)\boldsymbol{\beta} ={\boldsymbol{t}}_i-{\boldsymbol{\xi}}_i,\kern0.5em i=1,2\dots N\iff \mathbf{H}\boldsymbol{\beta } =\mathbf{T}-\boldsymbol{\xi} \hfill \end{array} $$
(1)

where β ∈  L × c denotes the output weights between hidden layer and output layer and ξ = [ξ 1, ξ 2 … ξ N ]T ∈  N × c are the prediction error matrices with respect to the training data. C is a penalty constant on the training errors, and H ∈  N × L is the hidden layer output matrix, computed as

$$ \mathbf{H}=\left[\begin{array}{c}\hfill \mathbf{h}\left({\boldsymbol{x}}_1\right)\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \mathbf{h}\left({\boldsymbol{x}}_N\right)\hfill \end{array}\right]=\left[\begin{array}{ccc}\hfill {h}_1\left({\boldsymbol{x}}_1\right)\hfill & \hfill \dots \hfill & \hfill {h}_L\left({\boldsymbol{x}}_1\right)\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {h}_1\left({\boldsymbol{x}}_N\right)\hfill & \hfill \cdots \hfill & \hfill {h}_L\left({\boldsymbol{x}}_N\right)\hfill \end{array}\right] $$

where X = [x 1, x 2,  … , x N ] ∈  d × N is the training dataset falling into c categories with label matrix T = [t 1, t 2 … t N ]T ∈  N × c and t i  = [−1 …  + 1 …  − 1] ∈  c. Note that only the jth entry of t i is +1 which indicates that sample x i comes from the jth class. By substituting the constraints of (1) into its objective function, we get the following equivalent unconstrained optimization problem

$$ { \min}_{\boldsymbol{\beta} \in {\Re}^{L\times c}}\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}\right\Vert}^2 $$
(2)

The previous problem is widely known as the ridge regression or regularized least square. By setting the gradient of the objective function with respect to β to be 0, we have

$$ {\boldsymbol{\beta}}^{\ast }+ C\cdot {\mathbf{H}}^{\mathrm{T}}\left(\mathbf{H}\boldsymbol{\beta } -\mathbf{T}\right)=\mathbf{0} $$
(3)

The closed-form solution β can be solved under the following two circumstances. If the number of training samples N is larger than L (N > L), the gradient equation is overdetermined, and the closed-form solution can be calculated as

$$ {\boldsymbol{\beta}}^{\ast }={\mathbf{H}}^{\mathbf{\dagger}}\mathbf{T}={\left({\mathbf{H}}^{\mathbf{T}}\mathbf{H}+\frac{{\mathbf{I}}_{L\times L}}{C}\right)}^{-1}{\mathbf{H}}^{\mathbf{T}}\mathbf{T} $$
(4)

where I L × L denotes the identity matrix with size of L and H is the Moore-Penrose generalized inverse of H. If the number N of training patterns is smaller than L (L > N), an underdetermined least square problem would be handled. One can restrict β to be a linear combination of the rows in H as β = H T α (α ∈  N × m). By substituting β = H T α into (3), and multiplying both sides with (HH T)−1 H, we have

$$ {\boldsymbol{\alpha}}^{\ast }- C\cdot \left(\mathbf{T}-{\mathbf{HH}}^{\mathrm{T}}{\boldsymbol{\alpha}}^{\ast}\right)=\mathbf{0} $$
(5)

Then, we get the solution as

$$ {\boldsymbol{\beta}}^{\ast }={\mathbf{H}}^{\mathrm{T}}\boldsymbol{\alpha} ={\mathbf{H}}^{\mathrm{T}}{\left({\mathbf{H}\mathbf{H}}^{\mathrm{T}}+\frac{{\mathbf{I}}_{N\times N}}{C}\right)}^{-1}\mathbf{T} $$
(6)

Discriminative Extreme Learning Machine

Conventional ELM assumes that the hidden layer output h(x i )(i = 1, 2 … N) can be exactly transformed into strict label matrix as in (2). However, the previous assumption may be too rigid. We relax the strict label matrix into a slack label matrix by introducing a nonnegative relaxation matrix M, which can not only provide more freedom for β but also enlarge the distances between different classes as much as possible.

In implementation, we push these +1/−1 label outputs far away along two opposite directions. Specifically, with a positive slack variable ε i , we hope the output will become 1 + ε i for the sample grouped into “1” and −1 − ε i for the sample grouped into “−1.” In this way, the distance between data points from different classes will be enlarged. By introducing ε-dragging term into the optimization function, the distances between different classes are expected to be enlarged. The following Table 1 illustrates an example of this case.

Table 1 A case for ε-dragging

Before ε-dragging, the maximal distance between the first and third hidden layer output (randomly projected feature) is \( \sqrt{{\left(1-\left(-1\right)\right)}^2+{\left(-1-1\right)}^2+{\left(-1-\left(-1\right)\right)}^2}=2\sqrt{2} \).While after ε-dragging, the distance becomes

$$ \sqrt{{\left(\left(1+{\varepsilon}_{11}\right)-\left(-1-{\varepsilon}_{31}\right)\right)}^2+{\left(\left(-1-{\varepsilon}_{12}\right)-\left(1+{\varepsilon}_{32}\right)\right)}^2+{\left(\left(-1-{\varepsilon}_{13}\right)-\left(-1-{\varepsilon}_{33}\right)\right)}^2}\ge 2\sqrt{2} $$

It can be seen that the distance between the first and third randomly projected feature becomes larger after ε-dragging. This shows that the use of the nonnegative label relaxation matrix allows margins between different classes to be enlarged. Concretely, we introduce an auxiliary matrix B that is defined as follows. If T ij  = 1, B ij  =  + 1, and it indicates the positive dragging direction. If T ij  =  − 1, B ij  =  − 1, which means the negative dragging direction. We record nonnegative learnable dragging value εs in matrix M ∈  N × c and get the relaxation label matrix as T  = T + B ⊙ M, where ⊙ is a Hadamard product operator of matrices. By substituting T into (2), we obtain the following discriminative ELM (DELM) model

$$ { \min}_{\boldsymbol{\beta} \in {\mathbf{\Re}}^{L\times c},\mathbf{M}}\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}-\mathbf{B}\odot \mathbf{M}\right\Vert}^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(7)

Compared with (2), a ε-dragging-related term B ⊙ M is integrated into (7) to enlarge the distances between different classes in label space. When solving (7), we can update each variable by fixing another iteratively. An iterative optimization method to solve problem (7) is presented as follows. Given M, problem (7) becomes

$$ { \min}_{\boldsymbol{\beta} \in {\mathbf{\Re}}^{L\times c}}\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}-\mathbf{B}\odot \mathbf{M}\right\Vert}^2 $$
(8)

Let Q = T + B ⊙ M, and denote the objective function as \( \ell \left(\boldsymbol{\beta} \right)=\frac{1}{2}{\left\Vert \boldsymbol{\beta} \right\Vert}^2+ C\cdot \frac{1}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{Q}\right\Vert}^2 \). The solution can be achieved by setting\( \frac{\partial \ell \left(\boldsymbol{\beta} \right)}{\partial \boldsymbol{\beta}}=\mathbf{0} \), then

$$ {\boldsymbol{\beta}}^{\ast }={\left({\mathbf{H}}^{\mathrm{T}}\mathbf{H}+\frac{{\mathbf{I}}_{L\times L}}{C}\right)}^{-1}{\mathbf{H}}^{\mathrm{T}}\mathbf{Q} $$
(9)

Given β, problem (7) becomes

$$ { \min}_{\mathbf{M}}\frac{C}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}-\mathbf{B}\odot \mathbf{M}\right\Vert}^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(10)

Let R = H β − T, we have

$$ { \min}_{\mathbf{M}}\frac{C}{2}{\left\Vert \mathbf{R}-\mathbf{B}\odot \mathbf{M}\right\Vert}^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(11)

Due to the fact that the squared Frobenius norm of matrix can be decoupled element by element, (11) can be decoupled equivalently into N × c subproblems. For the ith row and jth column element of M, we have

$$ { \min}_{{\mathbf{M}}_{ij}}{\left({\mathbf{R}}_{ij}-{\mathbf{B}}_{ij}{\mathbf{M}}_{ij}\right)}^2\kern0.5em s. t.\kern0.5em {\mathbf{M}}_{ij}\ge 0 $$
(12)

where R ij and B ij are the ith row and jth elements of R and B, respectively. Note that \( {\mathbf{B}}_{ij}^2=1 \), we obtain (R ij  − B ij M ij )2 = (B ij R ij  − M ij )2; thus, we can get

$$ {\mathbf{M}}_{ij}= \max \left({\mathbf{B}}_{ij}{\mathbf{R}}_{ij},0\right) $$
(13)

Based on the nonnegative constraint about M ij , M can be finally got as

$$ \mathbf{M}= \max \left(\mathbf{B}\odot \mathbf{R},\mathbf{0}\right) $$
(14)

The complete algorithm for solving the optimization problem (7) is described in Algorithm 1.

figure a

Once the optimal β and M obtained, we have Q = T + B ⊙ M, and the predicted output of a new test sample z can be computed as

$$ \boldsymbol{y}=\mathbf{h}\left(\boldsymbol{z}\right)\boldsymbol{\beta} =\mathbf{h}\left(\boldsymbol{z}\right)\cdot {\left({\mathbf{H}}^{\mathrm{T}}\mathbf{H}+\frac{{\mathbf{I}}_{L\times L}}{C}\right)}^{-1}{\mathbf{H}}^{\mathrm{T}}\mathbf{Q} $$
(15)

Neuron Pruning-Inspired Discriminative Extreme Learning Machine

In classic ELM, the number of hidden neurons is always determined with human intervention in a trial-and-error way. It is tedious to select the suitable number of hidden neurons manually. Besides, an overlarge network also brings about longer prediction responses and unnecessary requirement for large memory as well as high cost in hardware resource. An alternative way is to train a network larger than necessary and then prune the unnecessary neurons. In this section, we will devise and present a novel neuron pruning-inspired discriminative ELM based on structured sparse model, which has been widely studied in pattern recognition and machine learning [26, 27, 42].

Model Formulation

Suppose that we are given N training samples {(x i , t i )} , i = 1 , 2 ,  …  , N, which belong to c (≥2) classes. Here, x i  ∈  m is a data point and t i  ∈  c is its label vector. h(x i ) ∈  L, as the ELM feature mapping, maps the sample x i from m-dimensional input space to the L-dimensional hidden-layer feature space, which is called ELM feature space. Suppose that ELM can approximate the data label. The relation between the estimated outputs and the actual outputs is

$$ {\boldsymbol{t}}_i={\boldsymbol{\beta}}_1{\mathrm{h}}_1\left({\boldsymbol{x}}_i\right)+{\boldsymbol{\beta}}_2{\mathrm{h}}_2\left({\boldsymbol{x}}_i\right)+\dots +{\boldsymbol{\beta}}_L{\mathrm{h}}_L\left({\boldsymbol{x}}_i\right)={\sum}_j^L{\boldsymbol{\beta}}_j{\mathrm{h}}_j\left({\boldsymbol{x}}_i\right)\kern0.5em \left( i=1,2\dots N\right) $$
(16)

where β j is the jth row of output weight matrix β. The previous N equations can be compactly written as

$$ \mathbf{H}\boldsymbol{\beta } =\mathbf{T} $$
(17)

where

$$ \mathbf{H}=\left[\begin{array}{c}\hfill \mathbf{h}\left({\boldsymbol{x}}_1\right)\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill \mathbf{h}\left({\boldsymbol{x}}_N\right)\hfill \end{array}\right]=\left[\begin{array}{ccc}\hfill {\mathrm{h}}_1\left({\boldsymbol{x}}_1\right)\hfill & \hfill \dots \hfill & \hfill {\mathrm{h}}_L\left({\boldsymbol{x}}_1\right)\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {\mathrm{h}}_1\left({\boldsymbol{x}}_N\right)\hfill & \hfill \cdots \hfill & \hfill {\mathrm{h}}_L\left({\boldsymbol{x}}_N\right)\hfill \end{array}\right]\kern0.5em \boldsymbol{\beta} =\left[\begin{array}{c}\hfill {\boldsymbol{\beta}}_1^{\mathrm{T}}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{\beta}}_L^{\mathrm{T}}\hfill \end{array}\right]\kern0.5em \mathbf{T}=\left[\begin{array}{c}\hfill {\boldsymbol{t}}_1^{\mathrm{T}}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\boldsymbol{t}}_L^{\mathrm{T}}\hfill \end{array}\right] $$

As shown in (16), each neuron will has its own response for the sample. The responses of all the neurons will result in the final label prediction t i depending on the output weight matrix β. If some rows of β, i.e., β j (j = 1, 2 … L) are equal to zero, the corresponding neuron response will have no contribution on the estimated output. As a result, these irrelevant hidden neurons can be removed; i.e., a pruning of neurons can be conducted to get rid of useless neurons. In this way, we endow the output weights β the function of neuron selection by considering the relevance of hidden neurons with the class labels. Specifically, a new vector \( \overline{\boldsymbol{\beta}} \), which collects the L2 norms of the row vectors of β, can be constructed as

$$ \overline{\boldsymbol{\beta}}={\left[{\left\Vert {\boldsymbol{\beta}}_1\right\Vert}_2,{\left\Vert {\boldsymbol{\beta}}_2\right\Vert}_2,\dots, {\left\Vert {\boldsymbol{\beta}}_L\right\Vert}_2\right]}^{\mathrm{T}}\in {\mathbf{\Re}}^L $$
(18)

where \( {\left\Vert {\boldsymbol{\beta}}_j\right\Vert}_2=\sqrt{\sum_{k=1}^c{\beta}_{j k}^2} \), j = 1 , 2 … L. Constructing d nonzero rows in β is just equivalent to pushing the number of nonzero entities in \( \overline{\boldsymbol{\beta}} \) equal to be d

$$ {\left\Vert \overline{\boldsymbol{\beta}}\right\Vert}_0= d $$
(19)

Nevertheless, solving the problem with L0 norm constraint is a NP-hard problem. Alternatively, we approximate L0 norm with L1 norm and adopt the following L2,1 norm of matrix β

$$ {\left\Vert \overline{\boldsymbol{\beta}}\right\Vert}_1={\sum}_{j=1}^L\sqrt{\sum_{k=1}^c{\beta}_{j k}^2}={\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1} $$
(20)

Therefore, our pruning discriminative ELM (P-DELM) model is formulated as follows:

$$ { \min}_{\boldsymbol{\beta}, \mathbf{M}}{\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}+\frac{C}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}-\mathbf{B}\odot \mathbf{M}\right\Vert}_F^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(21)

The errors between target output and actual output are taken into consideration, which provides more freedom for β.

Optimization for Pruning Discriminative Extreme Learning Machine

Our P-DELM model could be optimized in a similar way as solving DELM. Given M, let Q = T + B ⊙ M.The optimization problem becomes

$$ { \min}_{\boldsymbol{\beta}, \mathbf{M}}{\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}+\frac{C}{2}{\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{Q}\right\Vert}_F^2 $$
(22)

Obviously, the objection function is differentiable to β [28]. First, we consider the derivative of the term ‖β2 , 1 w.r.t β. According to the definition of L2,1-norm in (20), the derivative of ‖β2 , 1 about the entity β jk can be calculated as

$$ \frac{\partial {\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}}{\partial {\beta}_{j k}}={\beta}_{j k}{\left({\sum}_{l=1}^c{\beta}_{j l}^2\right)}^{-1/2}=\frac{\beta_{j k}}{{\left\Vert {\boldsymbol{\beta}}_j\right\Vert}_2} $$
(23)

Then, we get the derivative of ‖β2 , 1 w.r.t βas

$$ \frac{\partial {\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}}{\partial \boldsymbol{\beta}}=\mathbf{\sum}\boldsymbol{\beta } $$
(24)

where ∑ is a diagonal matrix in L × L with the jth diagonal component computed as

$$ {\mathbf{\sum}}_{j j}=\frac{1}{{\left\Vert {\boldsymbol{\beta}}_j\right\Vert}_2} $$
(25)

This shows that ‖β2 , 1 can be written as \( {\left\Vert \boldsymbol{\beta} \right\Vert}_{2,1}=\frac{1}{2}\mathrm{tr}\left({\boldsymbol{\beta}}^{\mathrm{T}}\mathbf{\sum}\boldsymbol{\beta } \right) \), where ∑ is defined in (25). By setting the derivation of the objection function (6) w.r.t β to 0, we have

$$ \mathbf{\sum}\boldsymbol{\beta } + C\cdot {\mathbf{H}}^{\mathrm{T}}\left(\mathbf{H}\boldsymbol{\beta } -\mathbf{Q}\right)=\mathbf{0} $$
(26)

We further obtain the expression of β as follows:

$$ {\boldsymbol{\beta}}^{\ast }= C\cdot {\left(\mathbf{\sum}+ C\cdot {\mathbf{H}}^{\mathrm{T}}\mathbf{H}\right)}^{-1}{\mathbf{H}}^{\mathrm{T}}\mathbf{Q} $$
(27)

Note that ∑ depends on β, which can be iteratively determined using β from the previous optimization step. Then, we fix β and solve the following problem to updateM.

$$ { \min}_{\mathbf{M}}\frac{C}{2}\cdot {\left\Vert \mathbf{H}\boldsymbol{\beta } -\mathbf{T}-\mathbf{B}\odot \mathbf{M}\right\Vert}_F^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(28)

Let R = H β − T, we have

$$ { \min}_{\mathbf{M}}\frac{C}{2}\cdot {\left\Vert \mathbf{R}-\mathbf{B}\odot \mathbf{M}\right\Vert}_F^2\kern0.5em s. t.\kern0.5em \mathbf{M}\ge \mathbf{0} $$
(29)

Similarly, M can be got as

$$ \mathbf{M}= \max \left(\mathbf{B}\odot \mathbf{R},\mathbf{0}\right) $$
(30)

Figure 2 shows the difference between the output weight matrix obtained by the original ELM and our method. Figure 2a shows the normalized L2-norm of rows of β, i.e., \( \overline{\boldsymbol{\beta}} \) defined in (18), obtained from the original ELM. Most of its entities are nonzero with the Frobenius norm constraint, which could only enforce β to be small. Contrastively, the L2,1-norm in our model could get a row-sparse β as illustrated in Fig. 2b and distinguish the importance of different hidden neurons. In our P-DELM method, a few hidden neurons can undertake the task of information transmission from input space to the ELM feature space.

Fig. 2
figure 2

The normalized L2 norm of rows of β, namely \( \overline{\boldsymbol{\beta}} \) defined in (18). a is obtained by the classical ELM, and b is obtained by our model shown in (21)

After the optimal β is obtained, d neurons can be selected from the L original neurons. We next present a pruning method to get suitable number of valuable hidden neurons. We first normalize \( \overline{\boldsymbol{\beta}} \) by dividing it by the sum of all the entities in \( \overline{\boldsymbol{\beta}} \) and then sort the entities in \( \overline{\boldsymbol{\beta}} \) from large to small. The descending sorted \( \overline{\boldsymbol{\beta}} \) is denoted as \( \overline{\overline{\boldsymbol{\beta}}}=\left[{\overline{\overline{\beta}}}_1,{\overline{\overline{\beta}}}_2,\dots, {\overline{\overline{\beta}}}_L\right] \). Then, we select the hidden neurons by a threshold η. The ratio of the sum of the first d entities to the sum of all entities in \( \overline{\overline{\boldsymbol{\beta}}} \) is formulated as

$$ {\eta}_d=\frac{\sum_{q=1}^d{\overline{\overline{\beta}}}_q}{\sum_{q=1}^L{\overline{\overline{\beta}}}_q} $$
(31)

Obviously, the denominator of (31) is 1, and \( {\eta}_d={\sum}_{q=1}^d{\overline{\overline{\beta}}}_q \). Given a threshold η d in (0, 1), the number of valuable hidden neurons can be got as

$$ d= \min \left\{ d|{\eta}_d\ge \eta \right\} $$
(32)

Once d selected hidden neurons determined, the corresponding hidden layer output matrix \( \tilde{\mathbf{H}} \) is utilized to update the output weight matrix as

$$ \tilde{\boldsymbol{\beta}}={\left({\tilde{\mathbf{H}}}^{\mathrm{T}}\tilde{\mathbf{H}}+\frac{{\mathbf{I}}_{d\times d}}{C}\right)}^{-1}{\tilde{\mathbf{H}}}^{\mathrm{T}}\mathbf{T} $$
(33)

The structure of the proposed P-DELM is shown in Fig. 3. We first adopt the P-DELM to get the valuable hidden neurons and then update the output weight matrix exploiting remaining hidden layer output matrix as in (33). The complete optimization algorithm for P-DELM is described in Algorithm 2.

figure b
Fig. 3
figure 3

The structure of proposed pruning DELM (P-DELM)

Experiments

Experimental Results for Discriminative Extreme Learning Machine

Face recognition (FR) is one of the classical problems in computer vision [29]. Facial images have big within-class scatter and small between-class scatter, which poses great difficulties on FR. In this section, four popular face databases, i.e., ORL [30], Extended Yale B [31], CMU PIE [32], and AR [33] databases, are employed to evaluate the performance of different methods. The ORL face database contains 400 images from 40 subjects. Each subject has ten images acquired at different times. The size of face image on ORL database is 32 × 32 pixels. The Extended Yale B database consists of 2414 frontal facial images of 38 individuals. Each individual contains about 64 images, taken under various laboratory-controlled lighting conditions. In our experiments, each image is manually cropped and resized to 32 × 32 pixels. The AR face database consists of more than 4000 color images of 126 subjects. The CMU PIE database contains over 40,000 facial images of 68 individuals. Images of each individual were acquired across 13 different poses, under 43 different illumination conditions, and with 4 different expressions. Figure 4a–d shows several example images of one subject in ORL database, Extended Yale B database, CMU PIE database, and AR database, respectively.

Fig. 4
figure 4

Some example images used in our experiments. a The ORL database. b The Extended Yale B database. c The CMU PIE database. d The AR database

We compare our algorithm with the least squares regression (LSR) [25], discriminative least squares regression (DLSR) [25], SVM [34], LSSVM [35], and the classic ELM on the eigenface feature [36]. For SVM and LSSVM, the Libsvm-3.12 and LSSVM-1.7 toolbox were used, respectively. For LSR and DLSR, the Matlab codes are provided by the authors [25]. The optimum linear transformation is first learned and the facial images under this transformation are employed for classification by a 1-NN classifier.

For the ORL face database, we randomly select l (= 4, 5, 6) images per subject for training and the reminder for testing. For each given l, we independently perform all the methods 10 times and report the average results. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 2 lists the recognition results of different approaches.

Table 2 Recognition results of different methods on ORL database

For the Extended Yale B database, we randomly select l (=10, 20, 30) images per subject for training and the reminder for testing. For each given l, we independently perform all the methods 10 times and report the average recognition results. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 3 shows the recognition results of different methods.

Table 3 Recognition results of different methods on Extended Yale B database

For the CMU PIE database, we use a near frontal pose subject, namely C07, for experiments, which contains 1629 images of 68 individuals. Each individual has about 24 images. A random subset with l (= 8, 10, 12) images for each individual is selected for training and the rest for testing. For each given l, we independently perform all the methods 10 times and report the average recognition rates. Two dimensions of eigenface feature, i.e., 50 and 100, are tested. Table 4 lists the recognition accuracy together with the standard deviation obtained by different methods.

Table 4 Recognition results of different methods on CMU PIE database

For the AR face database, a subset that contains 50 male subjects and 50 female subjects is chosen in our experiments. For each subject, seven images from session 1 are used for training, with other seven images from session 2 for testing. The size of image is 60 × 43. The recognition results of different methods are given in Table 5.

Table 5 Recognition results of different methods on AR database

From Tables 2, 3, 4, and 5, one can conclude that the proposed DELM method can achieve promising performance. Moreover, with the increase of training samples per class and dimension of eigenface feature, all the methods tend to achieve higher recognition accuracy. DELM outperforms all the compared methods on most of the dimensions under different training sets. In comparison with DLSR, the proposed DELM performs better as a whole, which reveals the effect of executing an explicit mapping from the input space to a higher-dimensional ELM feature space. The proposed DELM also outperforms classical ELM. The gain mainly benefits from the enlarged margin between different classes by introducing a nonnegative label relaxation matrix.

Parameter Analysis for Discriminative Extreme Learning Machine

Similar to ELM, the proposed DELM algorithm has two key parameters, namely the number of hidden neurons L and the penalty constant C in (7). We further conduct experiments to investigate the effect of these parameters on the final recognition accuracy. Eleven different values of C (0.001, 0.01, 1, 100, 200, 500, 1000, 2000, 3000, 4000, and 5000) and seven different values of L (100, 500, 1000, 2000, 3000, 4000, and 5000) have been tried, resulting in 77 different pairs in total. The experiments on the previously mentioned databases for parameter analysis are performed, respectively.

Figure 5 shows the relationship between the recognition rate and the parameter pair (L, C). From Fig. 2, one can see that the recognition accuracy tends to increase with the increase of L and C. DELM is not especially sensitive to the change of (L, C) in a large range, and it performs stable when L and C are assigned relatively large values.

Fig. 5
figure 5

The influence of tunable parameter (L, C) for DELM on different face database. a The ORL database, b the Extended Yale B database, c the CMU PIE database, and d the AR database

Experimental Results for Pruning Discriminative Extreme Learning Machine

The Selection of Parameter η

In this section, we will study the characteristic of the key parameter η in P-DELM, and the experiments are carried out using the Diabetes dataset from the University of California at Irvine (UCI) Machine Learning Repository [37]. With the initialized 50 hidden neurons, we record the number of selected hidden neurons and corresponding testing accuracy of P-DELM. The results are acquired from 100 repeated experiments and are shown in Fig. 6. From the results, we observe that the number of selected hidden neurons and testing accuracy raise with the increase of threshold η. When η approaches 1, the number of selected hidden neurons and the testing accuracy tend to reach a plateau. Noteworthy, when η = 0.9999, only about (14.56/50) × 100 %  = 29.12% of the original hidden neurons are selected. These observations indicate that the introduction of L2,1-norm regularization could result in a quite sparse \( \overline{\boldsymbol{\beta}} \), which could distinguish the importance of different hidden neurons in information transmission. As a result, we empirically set η = 0.9999 in the method, which can guarantee a good performance as the following experimental results demonstrate.

Fig. 6
figure 6

Effect of the threshold η on the testing accuracy and number of selected hidden neurons. a Threshold η versus number of reserved hidden neurons. b Threshold η versus testing accuracy

The Performance of Pruning Discriminative Extreme Learning Machine for Pattern Classification

In this section, the performance of P-DELM is evaluated on public benchmark datasets for classification problem comparing with the original ELM, DELM, and OP-ELM on several datasets from the UCI Machine Learning Repository [37]. The information and characteristic of the datasets are summarized in Table 6. L2,1-DELM denotes an L2,1-norm regularized DELM model without a neuron pruning process.

Table 6 Specification of classification benchmark problems

The results are averaged on 100 repeated experiments. We report the mean number of hidden neurons used, testing accuracy/STD, and the CPU time for training and testing in Table 7. From the results, one can conclude that our DELM, L2,1-DELM, and P-DELM could always achieve a higher testing accuracy than ELM does. Meanwhile, P-DELM is faster in testing with a comparative or even better performance with fewer hidden neurons. In general, OP-ELM does not perform well in comparison with P-DELM with lower testing accuracy and more hidden neurons.

Table 7 The performance comparison between ELM, DELM, L2,1-DELM, and P-DELM on UCI datasets

The Performance of Pruning Discriminative Extreme Learning Machine for Image Classification

We further conduct experiments to validate the performance of DELM, L2,1-ELM, and P-DELM on image datasets, including COIL20 and Caltech256. The COIL20 database has 20 objects, and each object has 72 images which are obtained by the rotation of the object through 360° in 51 steps (1440 images in total) [38]. The size of each image is 32 × 32 pixels on COIL20. A subset of Caltech256 database [39, 40], which has 20 classes with 100 samples per category, is used in our experiment. In the experiments, we directly use grayscale image as the feature on COIL20 database, while 2048-dimensional PiCoDes [41] is adopted to represent the images in Caltech256 databases. We randomly select half of the images per class for training and the rest for testing. The experimental results are reported in Table 8. The experiments on these image datasets show promising results. We also note that our methods consume more time for training. A future work should reduce the computational complexity.

Table 8 The performance comparison between ELM, DELM, L2,1-DELM, and P-DELM on COIL20 database and Caltech256 database

Conclusions

In this paper, we have presented a framework of DELM for pattern classification. DELM aims to enlarge the distances between different classes as much as possible by learning a nonnegative label relaxation matrix. The performance of DELM is compared with several state-of-the-art methods on public face databases under different experimental settings. The results demonstrate the effectiveness of DELM for FR when there are posture, facial expression, and illumination variations. In addition, we develop a novel method for the problem of architectural design of ELM network by introducing L2,1-norm regularization into the DELM model. The obtained P-DELM model can distinguish the importance of different hidden neurons. Worthless neurons are then pruned for a more compact network. Experimental results show that P-DELM can achieve promising performance for pattern classification with fewer hidden neurons and less prediction time.