1 Introduction

An optimal method to an automatic selection of optimal kernels is to learn a linear combination \(K = \sum\nolimits_{j = 1}^{m} {\mu_{j} } K_{j}\) with mixing coefficients μ together with the model parameters. This framework, named multiple-kernel learning (MKL), was first introduced by [1] where two kinds of constraints on β and K have been considered, leading to either semi-definite programming or QCQP approaches, respectively. For appropriately designed sub-kernels K j , the optimized combination coefficients can then be used to understand which features of the examples are of importance for discrimination: If one is able to obtain an accurate classification by a sparse weighting μ j , then one can quite easily interpret the resulting decision function. Intuitively, sparseness of β makes sense when the expected number of meaningful kernels is small. Requiring that only a small number of features contribute to the final kernel implicitly, it assumes that most of the features to be selected are equally informative. In other words, sparseness is good when the kernels already contain a couple of good features that alone capture almost all of the characteristic traits of the problem. This also implies that features are highly redundant. It is now generally recognized as a powerful tool for various machine learning problems [2, 3].

Extreme learning machine (ELM) is proposed by Huang [4, 5]. ELM is a new type of single hidden layer feed-forward neural networks (SLFNs), and its core is a fixed hidden layer, which contains a large number of nonlinear nodes. The hidden layer bias of ELM is chosen randomly beforehand, and only the output weights of ELM need to be calculated. The only factor which needs to be set by users is the size of ELM (the number of hidden nodes) [6]. Benefiting from the simple structure and effective training method, ELM has been successfully used in many practical tasks such as prediction, fault diagnosis, recognition, classification, signal processing, and so on.

Although ELM has made some achievements, but there is still space for improvement. Some scholars are engaged in optimizing the learning algorithm of ELM. Han et al. [7] encoded a priori information to improve the function approximation of ELM. Kim et al. [8] introduced a variable projection method to reduce the dimension of the parameter space. Zhu et al. [9] used a differential evolutionary algorithm to select the input weights of ELM.

Some other scholars dedicated to optimize the structure of ELM. Wang et al. [10] made a proper selection of the input weights and bias of ELM in order to improve the performance of ELM. Li et al. [11] proposed a structure-adjustable online ELM learning method, which can adjust the number of hidden layer RBF nodes. Huang et al. [12, 13] proposed an incremental structure ELM, which increases the hidden nodes gradually. Meanwhile, another incremental approach referred to as error-minimized extreme learning machine (EM-ELM) was proposed by Feng et al. [14]. All these incremental ELM start from a small size of ELM hidden layer and add random hidden node (nodes) to the hidden layer. During the growth of networks, the output weights are updated incrementally. On the other hand, an alternative method to optimize the structure of ELM is to train an ELM that is larger than necessary and then prune the unnecessarily nodes during learning. A pruned ELM (PELM) was proposed by Rong et al. [15, 16] for classification problem. Yoan et al. [17] proposed an optimally pruned extreme learning machine (OP-ELM) methodology. Besides, there are still other attempts to optimize the structure of ELM, such as CS-ELM [18] proposed by Lan et al., which used a subset model selection method. Zong et al. [29] put forward the weighted extreme learning machine for imbalance learning. The kernel trick applied to ELM was introduced in previous work [30].

Lanckriet et al. [1] pioneered the work of parameterized combinations of kernel learning in which the optimal kernel is obtained as a linear combination of pre-specified kernels. This procedure named as multiple-kernel learning (MKL) in the literature allows kernels to be chosen more automatic based on data and enhances the autonomy of machine learning process [27]. On the other hand, MKL is particularly valuable in application [32], i.e., through linear combination of kernel matrix learning, MKL offers the advantage of integrating heterogeneous data from multiple sources such as vectors, strings, trees, and graphs. Recently, MKL had been successfully applied to combine various heterogeneous data sources in practice [19]. Compared with the existing L-norm MKL method, the L2-norm MKL could lead to non-sparse solutions and more advantages when applied to biomedical problems. Ye et al. [20] showed that the multiple-kernel learning for LS-SVM can be formulated as a SDP problem, which has much less computational complexity compared with that of C-SVM. Sonnenburg et al. [22] applied a semi-infinite linear programming (SILP) strategy by reusing the SVM implementations for solving the subproblems inside the MKL optimization more efficiently, which made MKL applicable to large-scale data sets. Yang et al. [26] used the elastic net regularizer on the kernel combination coefficients as a constraint for MKL. Gu et al. [28] aimed to learn the map between the space of high-magnification sampling rate image patches and the space of blurred high-magnification sampling rate image patches based on the multi-kernel regression, which are the interpolation results generated from the corresponding low-resolution images. Liu et al. [33] designed sparse, non-sparse, and radius-incorporated MK-ELM algorithms.

We propose the issue of multiple-kernel learning for ELM by formulating it as a semi-infinite linear programming (SILP). In addition, the proposed algorithm optimizes the regularization parameter in a unified framework along with the kernels, which make the learning system more automatic. Empirical results on benchmark data sets prove that multiple-kernel learning for ELM (MKL-ELM) has good competitive performance compared with the traditional ELM algorithm.

2 Kernelized extreme learning machine

At first, we will briefly review the ELM proposed in [5, 6, 31]. The essence of ELM is that the hidden layer in ELM need not be tuned. The output function of ELM for generalized SLFNs is

$$f_{L} (x) = \sum\limits_{i = 1}^{L} {\beta_{i} } h_{i} (x_{j} ) = \sum\limits_{i = 1}^{L} {\beta_{i} } h(w_{i} \cdot x_{j} + b_{i} ) = h(x)\beta {\kern 1pt} \quad j = 1, \ldots ,N$$
(1)

where \(w_{i} \in R^{n}\) is the weight vector connecting the input nodes and the ith hidden node, \(b_{i} \in R\) is the bias of the ith hidden node, \(\beta_{i} \in R\) is the weight connecting the ith hidden node and the output node, and \(f_{L} (x) \in R\) is the output of the SLFNs. Where \(w_{i} \cdot x_{j}\) denote the inner product of w i and x j , b i are the learning parameters of hidden nodes, which are randomly chosen before learning.

If the standard SLFNs with at most \(\tilde{N}\) hidden nodes can approximate theses N samples with zero error, it then means there exists β i , w i , and b i such that

$$\sum\limits_{i = 1}^{L} {\beta_{i} } h(w_{i} \cdot x_{j} + b_{i} ) = t_{j} ,\quad j = 1, \ldots ,N$$
(2)

Equation (2) can be written compactly as

$${\mathbf{H}}\beta = {\mathbf{T}}.$$
(3)

Then, Eq. (3) become a linear system, and the smallest norm least squares solution of the linear system is

$$\beta = {\mathbf{H}}^{\dag } {\mathbf{T}},$$
(4)

where \({\mathbf{H}}^{\dag }\) is the Moore–Penrose generalized inverse [4] of the hidden layer output matrix H.

$$H = \left( {\begin{array}{*{20}c} {h(x_{1} )} \\ \vdots \\ {h(x_{N} )} \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {h(w_{1} ,b_{1} ,x_{1} )} & \cdots & {h(w_{L} ,b_{L} ,x_{1} )} \\ \vdots & \ddots & \vdots \\ {h(w_{1} ,b_{N} ,x_{1} )} & \cdots & {h(w_{L} ,b_{L} ,x_{N} )} \\ \end{array} } \right)_{N \times L} ,$$

\({\mathbf{T}} = [t_{1} , \ldots ,t_{N} ]^{T}\), and \(\beta = [\beta_{1} ,\beta_{2} , \ldots ,\beta_{L} ]^{T} .\)

H is called the hidden layer output matrix of the network [4, 5]; the ith column of H is the ith hidden node’s output vector with respect to input \(x_{1} ,x_{2} , \ldots ,x_{N}\); and the jth row of H is the output vector of the hidden layer with respect to input x j .

It has been proven in [5, 6] that SLFNs with hidden nodes have the universal approximation capability; the hidden nodes can be randomly generalized in the beginning of learning. As introduced in [6], one of the methods to calculate Moore–Penrose generalized inverse of a matrix is the orthogonal projection method: \({\mathbf{H}}^{\dag } = {\mathbf{H}}^{T} ({\mathbf{HH}}^{T} )^{ - 1} .\)

According to the ridge regression theory [25], one can add a positive value to the diagonal of \({\mathbf{HH}}^{T}\); the resultant solution is more stable and tends to have better generalization performance:

$$f(x) = h\beta = h(x){\mathbf{H}}^{T} \left( {\frac{{\mathbf{I}}}{C} + {\mathbf{HH}}^{T} } \right)^{ - 1} {\mathbf{T}},$$
(5)

The feature mapping \(h(x)\) is usually known to users in ELM. However, if a feature mapping \(h(x)\) is unknown to users, a kernel matrix for ELM can be defined as follows [6]:

$$\Omega _{\text{ELM}} = {\mathbf{HH}}^{T} :\Omega _{{{\text{ELM}}_{i,j} }} = h(x_{i} ) \cdot h(x_{j} ) = K(x_{{_{i} }} ,x_{j} ).$$
(6)

Thus, the output function of kernel ELM classifier can be written compactly as:

$$f(x) = h(x){\mathbf{H}}^{T} \left( {\frac{{\mathbf{I}}}{C} + {\mathbf{HH}}^{T} } \right)^{ - 1} {\mathbf{T}} = \left[ {\begin{array}{*{20}c} {K(x,x_{1} )} \\ \vdots \\ {K(x,x_{N} )} \\ \end{array} } \right]^{T} \left( {\frac{{\mathbf{I}}}{C} + \varOmega_{\text{ELM}} } \right)^{ - 1} {\mathbf{T}}.$$
(7)

ELM is to minimize the training error as well as the norm of the output weights [6].

$${\text{Minimize}}:{\kern 1pt} \quad {\kern 1pt} \left\| {{\mathbf{H}}\beta - {\mathbf{T}}} \right\|^{2} \quad {\text{and}}\quad \left\| \beta \right\|.$$
(8)

In fact, according to (8), the classification problem for ELM with multi-output nodes can be formulated as:

$$\begin{array}{*{20}l} {{\text{Minimize:}}{\kern 1pt} } \hfill & {L_{{P_{\text{ELM}} }} = \frac{1}{2}\left\| \beta \right\|^{2} + C\frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\varvec{\upxi}}_{i} } \right\|^{2} } } \hfill & {} \hfill \\ {{\text{S}} . {\text{t}} . :} \hfill & {h(x)\beta = {\mathbf{t}}_{i}^{T} - {\varvec{\upxi}}_{i}^{T} ,} \hfill & {i = 1, \ldots ,N} \hfill \\ {} \hfill & {} \hfill & {} \hfill \\ \end{array}$$
(9)

where \({\varvec{\upxi}}_{i} = [\xi_{i,1} , \ldots ,\xi_{i,m} ]^{T}\) is the training error vector of the m output nodes with respect to the training sample x i . Based on the KKT condition, to train ELM is equivalent to solving the following dual optimization problem:

$${\kern 1pt} \begin{array}{*{20}c} {\text{Minimize:}} & {L_{{D_{\text{ELM}} }} = \frac{1}{2}\left\| \beta \right\|^{2} + C\frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\varvec{\upxi}}_{i} } \right\|^{2} } - \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{m} {\alpha_{i,j} } } (h(x_{i} )\beta_{j} - t_{i,j} + \xi_{i,j} ).} \\ \end{array}$$
(10)

Algorithm 1: Given a training set \(\left\{ {(x_{i} ,t_{i} )} \right\}_{i = 1}^{N} \subset R^{n} \times R\), activation kernel function \(g( \cdot )\), and the hidden node number L:

  • Step 1: Randomly assign input weight w i and bias \(b_{i} ,{\kern 1pt} {\kern 1pt} i = 1, \ldots ,L.\)

  • Step 2: Calculate the hidden layer output matrix H.

  • Step 3: Calculate the output weight β: \(\beta = {\mathbf{H}}^{\dag } {\mathbf{T}}.\)

3 Multiple-kernel-learning algorithms for extreme learning machine

3.1 Minimum norm least squares (LS) solution of SLFNs

It is very interesting and surprising that unlike the most common understanding that all the parameters of SLFNs need to be adjusted, the input weights w i and the hidden layer biases b i are in fact not necessarily tuned, and the hidden layer output matrix H can actually remain unchanged once random values have been assigned to these parameters in the beginning of learning. For fixed input weights w i and the hidden layer biases b i , seen from Eq. (11), to train an SLFN is simply equivalent to finding a least squares solution \(\hat{\beta }\) of the linear system \({\mathbf{\rm H}}\beta = {\mathbf{T}}\):

$$\begin{aligned} & \left\| {{\rm H}(\hat{w}_{1} , \ldots ,\hat{w}_{{\tilde{N}}} ,\hat{b}_{1} , \ldots ,\hat{b}_{{\tilde{N}}} )\hat{\beta } - {\rm T}} \right\| \\ & \quad = \mathop {\hbox{min} }\limits_{{w_{i} ,b_{i} ,\beta }} \left\| {{\rm H}(w_{1} , \ldots ,w_{{\tilde{N}}} ,b_{1} , \ldots ,b_{{\tilde{N}}} )\beta - T} \right\| \\ \end{aligned}$$
(11)
$$\begin{aligned} & \left\| {{\rm H}(w_{1} , \ldots ,w_{{\tilde{N}}} ,b_{1} , \ldots ,b_{{\tilde{N}}} )\hat{\beta } - {\mathbf{\rm T}}} \right\| \\ & {\kern 1pt} \quad = \mathop {\hbox{min} }\limits_{\beta } \left\| {{\rm H}(w_{1} , \ldots ,w_{{\tilde{N}}} ,b_{1} , \ldots ,b_{{\tilde{N}}} )\beta - {\mathbf{\rm T}}} \right\|. \\ \end{aligned}$$
(12)

If the number \(\tilde{N}\) of hidden nodes is equal to the number N of distinct training samples \(\tilde{N} = N,\) matrix H is square and invertible when the input weight vectors w i and the hidden biases b i are randomly chosen, and SLFNs can approximate these training samples with zero error.

However, in most cases, the number of hidden nodes is much less than the number of distinct training samples \(\tilde{N} \ll N,\) \({\mathbf{\rm H}}\) is a non-square matrix, and there may not exist w i , b i , β i (\(i = 1, \ldots ,\tilde{N}\)) such that \({\mathbf{\rm H}}\beta = {\mathbf{T}} .\) The smallest norm least squares solution of the above linear system is

$$\hat{\beta } = {\mathbf{\rm H}}^{\dag } {\mathbf{\rm T}},$$
(13)

where \({\mathbf{\rm H}}^{\dag }\) is the Moore–Penrose generalized inverse of matrix H [21].

3.2 QCQP formulation

In the scenario of ELM, the duality gap of Eq. (9) and its dual program is zero since the Slater constraint qualification holds, and we get

$$\begin{gathered} \mathop {\min }\limits_{{\omega ,\xi }} {\kern 1pt} {\kern 1pt} \mathop {\max }\limits_{\alpha } L(\omega ,\xi ;\alpha ) = \mathop {\max }\limits_{\alpha } {\kern 1pt} {\kern 1pt} \mathop {\min }\limits_{{\omega ,\xi }} L(\omega ,\xi ;\alpha ) \hfill \\ \max {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \alpha ^{T} y - \frac{1}{2}\alpha ^{T} K\alpha - \frac{1}{{2C}}\alpha ^{T} \alpha \hfill \\ {\text{s}}{\text{.t}}{\text{.}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \sum\nolimits_{{i = 1}}^{n} {\alpha _{i} } = 0. \hfill \\ \end{gathered}$$
(14)

The Lagrangian dual yields:

$${\kern 1pt} \begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{\lambda } \mathop {\hbox{max} }\limits_{\alpha } } & {\alpha^{T} y - \frac{1}{2}\alpha^{T} K\alpha - \frac{1}{2C}\alpha^{T} \alpha + \lambda \alpha^{T} 1_{n} ,{\kern 1pt} } & {\lambda \ge 0} \\ \end{array}$$
(15)

The Lagrangian function is given as

$$L(\lambda ,\alpha ) = \alpha^{T} y - \frac{1}{2}\alpha^{T} K\alpha - \frac{1}{2C}\alpha^{T} \alpha + \lambda \alpha^{T} {\mathbf{1}}_{n} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \lambda \ge 0$$
(16)

To obtain the dual, the derivatives of the Lagrangian function with respect to the primal variables α have to vanish

$$\frac{\partial L(\lambda ,\alpha )}{\partial \alpha } = 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \to {\kern 1pt} y - K\alpha - \frac{1}{C}\alpha + \lambda {\mathbf{1}}_{n} = 0,$$
(17)
$$\begin{gathered} \left( {K + \frac{1}{C}I} \right)\alpha = y + \lambda {\mathbf{1}}_{n} , \hfill \\ \alpha = \left( {K + \frac{1}{C}I} \right)^{ - 1} (y + \lambda {\mathbf{1}}_{n} ). \hfill \\ \end{gathered}$$
(18)

According to (18), the Eq. (16) can be formulated as:

$$\begin{aligned} L(\lambda ,\alpha ) & = \alpha^{T} y - \frac{1}{2}\alpha^{T} (y + \lambda {\mathbf{1}}_{n} ) + \lambda \alpha^{T} {\mathbf{1}}_{n} \\ & = \alpha^{T} y - \frac{1}{2}\alpha^{T} y - \frac{1}{2}\lambda \alpha^{T} {\mathbf{1}}_{n} + \lambda \alpha^{T} {\mathbf{1}}_{n} \\ & = \frac{1}{2}\alpha^{T} y + \frac{1}{2}\lambda \alpha^{T} {\mathbf{1}}_{n} \\ & = \frac{1}{2}\left( {\left( {K + \frac{1}{C}I} \right)^{ - 1} \left( {y + \lambda {\mathbf{1}}_{n} } \right)} \right)^{T} \left( {y + \lambda {\mathbf{1}}_{n} } \right) \\ & = \frac{1}{2}(y + \lambda {\mathbf{1}}_{n} )^{T} \left( {K + \frac{1}{C}I} \right)^{ - 1} \left( {y + \lambda {\mathbf{1}}_{n} } \right) \\ \end{aligned}$$
(19)

We get its dual program:

$$\begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{\lambda } \frac{1}{2}(y + \lambda {\mathbf{1}}_{n} )^{T} } & {\left( {K + \frac{1}{C}I} \right)^{ - 1} (y + \lambda {\mathbf{1}}_{n} ),} \\ \end{array}$$
(20)

for the Slater constraint qualification holds. We consider the parameterization \(K = \sum\nolimits_{i = 1}^{p} {\mu_{i} K_{i} }\) with additional affine constraint \(\sum\nolimits_{i = 1}^{p} {\mu_{i} } = 1\) and positive semi-definiteness constraint \(K \ge 0\). Attaching these constraints into Eq. (20) and minimizing with respect to μ gives:

$$\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{\mu ,\lambda } } \hfill & {\frac{1}{2}(y + \lambda {\mathbf{1}}_{n} )^{T} \left( {\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} } + \frac{1}{C}I} \right)^{ - 1} (y + \lambda {\mathbf{1}}_{n} )} \hfill & {} \hfill \\ {s.t.} \hfill & {\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} } \ge 0,} \hfill & {} \hfill \\ {} \hfill & {\sum\limits_{i = 1}^{p} {\mu_{i} } = 1,} \hfill & {} \hfill \\ \end{array}$$
(21)

i.e.,

$$\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{\mu ,\lambda ,t} {\kern 1pt} } \hfill & t \hfill & {} \hfill \\ {s.t.{\kern 1pt} } \hfill & {\left( {\begin{array}{*{20}c} {\frac{1}{2}\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} + \frac{1}{2C}I} } & {y + \lambda {\mathbf{1}}_{n} } \\ {(y + \lambda {\mathbf{1}}_{n} )^{T} } & t \\ \end{array} } \right) \ge 0,} \hfill & {} \hfill \\ {} \hfill & {\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} \ge 0,{\kern 1pt} {\kern 1pt} } \sum\limits_{i = 1}^{p} {\mu_{i} = 1.{\kern 1pt} {\kern 1pt} } } \hfill & {} \hfill \\ \end{array}$$
(22)

when the weights μ i are constrained to be nonnegative and K i are positive semi-definite, the constraint \(\sum\nolimits_{i = 1}^{p} {K_{i} } \ge 0\) is satisfied naturally. In that case, we again consider the parameterization \(K = \sum\nolimits_{i = 1}^{p} {\mu_{i} K_{i} }\) with constraint \(\sum\nolimits_{i = 1}^{p} {\mu_{i} } = 1\) and μ i  ≥ 0. Substituting this into Eq. (20) and minimizing with respect to μ give the following formulation:

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{\mu :\mu \ge 0,\mu^{T} {\kern 1pt} {\mathbf{1}}_{p} = 1}} \,\mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\kern 1pt} {\mathbf{1}}_{n} = 0}} \left\{ { - \frac{1}{2}\sum\limits_{i = 1}^{p} {\mu_{i} \alpha^{T} K_{i} \alpha + \alpha^{T} y - \frac{1}{2C}\alpha^{T} \alpha } } \right\} \\ {\kern 1pt} & \quad = \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} {\kern 1pt} \,\mathop {\hbox{min} }\limits_{{\mu :\mu \ge 0,\mu^{T} {\kern 1pt} {\mathbf{1}}_{p} = 1}} \,\left\{ {\alpha^{T} y - \frac{1}{2C}\alpha^{T} \alpha - \frac{1}{2}\sum\limits_{i = 1}^{p} {\mu_{i} } \alpha^{T} K_{i} \alpha } \right\} \\ & \quad = \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} \left\{ {\alpha^{T} y - \frac{1}{2C}\alpha^{T} \alpha - \frac{1}{2}\mathop {\hbox{max} }\limits_{{{\kern 1pt} \mu :\mu \ge 0,\mu^{T} {\mathbf{1}}_{p} = 1}} \sum\limits_{i = 1}^{p} {\mu_{i} } \alpha^{T} K_{i} \alpha } \right\} \\ & \quad = {\kern 1pt} \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} \left\{ {\alpha^{T} y - \frac{1}{2C}\alpha^{T} \alpha - \frac{1}{2}\mathop {\hbox{max} }\limits_{p \ge i \ge 1} \alpha^{T} K_{i} \alpha } \right\}, \\ \end{aligned}$$
(23)

where \(\mu_{i}^{T} {\mathbf{1}}^{p} = 1\) means \(\sum\nolimits_{i = 1}^{p} {\mu_{i} = 1}\) (p is the number of kernels). Here, it should be noted that the order of the minimization and the maximization is interchanged in the first equation. This is right since the conditions that the objective is convex in μ and concave in α, the minimization problem is strictly feasible in μ, and the maximization problem is strictly feasible in α [1]. Equation (23) can be reformulated as the following QCQP

$${\kern 1pt} \begin{array}{*{20}l} {\mathop {\hbox{max} }\limits_{\alpha ,t} } \hfill & { - \frac{1}{2}t + \alpha^{T} y - \frac{1}{2C}\alpha^{T} \alpha } \hfill & {} \hfill \\ {{\text{s}} . {\text{t}} .} \hfill & {t \ge \alpha^{T} K_{i} \alpha ,} \hfill & {i = 1, \ldots ,p,} \hfill \\ {} \hfill & {\alpha^{T} {\mathbf{1}}_{n} = 0.} \hfill & {} \hfill \\ \end{array}$$
(24)

Such a QCQP problem can be solved efficiently with general-purpose optimization software packages, like MOSEK [23], which solve the primal and dual problems simultaneously using the interior point methods. The obtained dual variables can be used to fix the optimal kernel coefficients.

In some cases, the performance of ELM depends critically on the values of C. We show that the formulations (22) and (24) can be reformulated slightly, and this new formulation leads naturally to the estimation of the regularization parameter C in a joint framework. As can be seen from Eq. (22), the identity matrix appears in exactly the same form as kernel matrices; we can treat the regularization parameter as one of the coefficients for the kernel matrix and optimize them simultaneously. This leads to the following formulation:

$$\begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{\mu ,\lambda ,t} } \hfill & t \hfill & {} \hfill \\ {{\text{s}} . {\text{t}} .{\kern 1pt} } \hfill & {\left( {\begin{array}{*{20}c} {\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} } } & {y + \lambda {\mathbf{1}}_{n} } \\ {(y + \lambda {\mathbf{1}}_{n} )^{T} } & t \\ \end{array} } \right) \ge 0,} \hfill & {} \hfill \\ {} \hfill & {\sum\limits_{i = 1}^{p} {\mu_{i} K_{i} } \ge 0,\;\sum\limits_{i = 0}^{p} {\mu_{i} = 1,{\kern 1pt} } } \hfill & {\mu_{0} \ge 0,{\kern 1pt} } \hfill \\ \end{array}$$
(25)

where \(\mu_{0} = \tfrac{1}{C}\) and K 0 = I. To optimize the regularization parameter in Eq. (24), we modify Eq. (23) slightly:

$$\begin{aligned} & \mathop {\hbox{min} }\limits_{{\mu :\mu \ge 0,\mu^{T} {\kern 1pt} {\mathbf{1}}_{p + 1} = 1}} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\kern 1pt} {\mathbf{1}}_{n} = 0}} \left\{ { - \frac{1}{2}\sum\limits_{i = 0}^{p} {\mu_{i} \alpha^{T} K_{i} \alpha + \frac{1}{2}\alpha^{T} y} } \right\} \\ & = {\kern 1pt} \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mathop {\hbox{min} }\limits_{{\mu :\mu \ge 0,\mu^{T} {\kern 1pt} {\mathbf{1}}_{p + 1} = 1}} {\kern 1pt} \left\{ { - \frac{1}{2}\sum\limits_{i = 0}^{p} {\mu_{i} \alpha^{T} K_{i} \alpha + \frac{1}{2}\alpha^{T} y} } \right\} \\ {\kern 1pt} & = \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} \left\{ {\frac{1}{2}\alpha^{T} y - \frac{1}{2}\mathop {\hbox{max} }\limits_{{\mu :\mu \ge 0,\mu^{T} {\mathbf{1}}_{p + 1} = 1}} {\kern 1pt} \sum\limits_{i = 0}^{p} {\mu_{i} \alpha^{T} K_{i} \alpha } {\kern 1pt} {\kern 1pt} } \right\} \\ & = \mathop {\hbox{max} }\limits_{{\alpha :\alpha^{T} {\mathbf{1}}_{n} = 0}} \left\{ {\frac{1}{2}\alpha^{T} y - \frac{1}{2}\mathop {\hbox{max} }\limits_{p \ge i \ge 0} {\kern 1pt} \alpha^{T} K_{i} \alpha } \right\}, \\ \end{aligned}$$
(26)

where K 0 stands for unit matrix I, μ 0 denotes the reciprocal of regularization parameter. Substituting α T K i α by t and moving it to the constraint, we get the following quadratically constraint linear program:

$$\begin{array}{*{20}c} {\hbox{max} } & {\alpha^{T} y - t} & {} \\ {{\text{s}} .{\kern 1pt} {\text{t}} .} & {t \ge \alpha^{T} K_{i} \alpha ,} & {{\kern 1pt} i = 0, \ldots ,p,} \\ {} & {\alpha^{T} {\mathbf{1}}_{n} = 0.} & {} \\ \end{array}$$
(27)

We will show that the joint optimization of C works better in most cases in comparison with the approach of pre-specifying C.

3.3 SILP formulation

In this section, we show how these problems can be resolved by considering a novel dual formulation of the QCQP as a semi-infinite linear programming (SILP) problem. In the following formulation, K j represents the jth kernel matrix in a set of p + 1 kernels with the (p + 1)th kernel identity matrix. The MKL-ELM is formulated as

$$\begin{array}{*{20}l} {\mathop {\hbox{max} {\kern 1pt} }\limits_{\theta ,u} } \hfill & u \hfill & {} \hfill \\ {{\text{s}} . {\text{t}} .} \hfill & {\theta_{j} \ge 0,} \hfill & {j = 1, \ldots ,p + 1} \hfill \\ {} \hfill & {\sum\limits_{j = 1}^{p + 1} {\theta_{j}^{2} } \le 1,} \hfill & {} \hfill \\ {} \hfill & {\frac{1}{2}{\kern 1pt} \sum\limits_{j = 1}^{p + 1} {\theta_{j} f_{j} (\beta_{q} )} - \frac{1}{2}\sum\limits_{q = 1}^{k} {\beta_{q}^{T} Y_{q}^{ - 1} {\mathbf{1}}_{q} } \ge u,} \hfill & {\forall \beta_{q} ,{\kern 1pt} \,q = 1, \ldots ,k} \hfill \\ {} \hfill & {f_{j} (\beta_{q} ) = \sum\limits_{q = 1}^{k} {(\frac{1}{2}\beta_{q}^{T} } K_{j} \beta_{q} ),} \hfill & {j = 1, \ldots ,p + 1,\,q = 1, \ldots ,k} \hfill \\ {} \hfill & {} \hfill & {} \hfill \\ \end{array}$$
(28)

The MKL-ELM is presented in algorithm 2.

figure a

Step1 optimizes θ as a linear programming. Since the regularization coefficient is automatically estimated as θ p+1, step 3 simplifies to a linear problem as

$$\left[ {\begin{array}{*{20}c} 0 & {{\mathbf{1}}^{T} } \\ {\mathbf{1}} & {\varOmega^{(l)} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} 0 \\ {\beta^{(l)} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} 0 \\ {Y^{ - 1} {\mathbf{1}}} \\ \end{array} } \right]$$
(29)

4 Empirical results

In this section, we will perform classification experiments on such following data set: Banana, Breast Cancer, Titanic, Waveform, German, Image, Heart, Diabetes, Ringnorm, Thyroid, Twonorm, Flare Solar, and Splice (Table 1).

Table 1 Data sets used in the experiments

In the experiments, we will compare the following:

  1. 1.

    SK-ELM: single-kernel ELM

  2. 2.

    MK-ELM(MEB) [33]: the radius-incorporated multiple-kernel ELM

4.1 Multiple different kinds of kernels

Here, four sorts of different kernel functions, i.e., polynomial kernel function \(k_{1} (x,y) = (1 + x^{T} y)^{d} ,\) Gaussian kernel function \(k_{2} (x,y) = \exp \left( {{{ - \left\| {x - y} \right\|^{2} } \mathord{\left/ {\vphantom {{ - \left\| {x - y} \right\|^{2} } {\sigma^{2} }}} \right. \kern-0pt} {\sigma^{2} }}} \right),\) linear kernel function \(k_{3} (x,y) = x^{T} y,\) and Laplacian kernel function \(= \,\exp ({{ - \left\| {x - y} \right\|} \mathord{\left/ {\vphantom {{ - \left\| {x - y} \right\|} {p)}}} \right. \kern-0pt} {p)}},\) are selected to construct the multiple kernels \(k = \sum\nolimits_{i = 1}^{4} {\mu_{i} k_{i} } ,\) where the corresponding kernel parameters are specified as d = 2, \(\sigma^{2} = 20,\) p = 5 before experiments. All kernel matrixes K i are normalized through replacing \(K_{i} (m,n)\) by \({{K_{i} (m,n)} \mathord{\left/ {\vphantom {{K_{i} (m,n)} {\sqrt {K_{i} (m,m)K_{i} (n,n)} }}} \right. \kern-0pt} {\sqrt {K_{i} (m,m)K_{i} (n,n)} }}\) to get unit diagonal matrix [1, 24]. Table 2 gives the results of the single-kernel ELM experiments on the Image, Ringnorm data set, etc. The criterion TRA (%) and TSA (%) represent the accuracy of the training set and of the testing set. As can be seen from Table 2, our proposed algorithm is a very potential tool. Most of the testing rates are high, basically attaining 90 % in the Laplacian kernel cases. About 88.66 % testing rate is obtained for the Ringnorm set in ELM cases, while for the image data set, such criterion is beyond 90.64 %. On the one hand, comparisons of Table 2 may suggest that the MKL-ELM has higher (at least the same) testing rate than the single kernel; on the other hand, the former is more time-consuming than the latter in parameter selection. Therefore, the MKL-ELM is proved again to be a more potential tool than the single-kernel one. Furthermore, most experimental results indicate that the proposed algorithm has higher accuracy, as well as the robust stability than the MK-ELM (MEB), SK-ELM, and ELM on the multi-class problems, which can be observed from the multiple-kernel experimental results in Table 2. Moreover, the proposed algorithm has less time-consuming than the MK-ELM (MEB).

Table 2 Experimental results of our method, MK-ELM(MEB), SK-ELM, and ELM: multi-class data sets

Table 3 reports the optimal kernel weight \(\left\{ {\mu_{i} } \right\}_{i = 1}^{4}\) for every kernel function and the experimental results through MKL, i.e., the SILP formulation through MKL in the ELM-SILP. Summation of the average value of μ 1, μ 2, μ 3, μ 4 and μ 0 is not equal to 1. The optimal kernel weights μ 1, μ 2 = 0, while μ 3, μ 4 ≠ 0 can be used to explain the cause why the MKL-ELM is likely to be better than the SK-ELM. Moreover, the TRA in the MKL-ELM can attain the maximum TRA in the SK-ELM. Higher TRA with MK will show better fitting capability of the MKL-ELM, while higher TSA will show better predication capacity. The above experimental results prove that the MKL-ELM has a lower upper bound of the expected risk and may give potential better data representation than the SK-ELM.

Table 3 Experimental results of multiple different types of kernels

4.2 Fusing kernel experiments

In the section, the experiments are made through the fusing kernel learning (FKL) and exhibit the performance of the fusing kernel. In every feature set, five kinds of Gaussian kernel functions with bandwidth \(\sigma^{2} = \left[ {\begin{array}{*{20}l} {0.1} \hfill & 1 \hfill & {10} \hfill & {20} \hfill & {40} \hfill \\ \end{array} } \right]\) are used to form the fusing kernel \(k = \sum\nolimits_{i = 1}^{2} {\left( {\sum\nolimits_{j = 1}^{5} {\mu_{ij} k_{ij} } } \right)}\). On the basis of the results mentioned above, it is clear that the fusing kernel not only can provide a good data representation of feature set, but also can distinguish which kernels are fit for the underlying problem. The perception into the reasoning made by the fusing kernel may be analyzed through the kernel weights reported in Table 4. The results of kernel weights, TRA, TSA, and CPU(s) are listed in Table 4. Obviously, there are very high TRA and TSA performance, which are obtained by the fusing kernel. Of course, it should be pointed out that the accuracy improvement on the fusing kernel model is at the cost of increasing the model complexity, which can be embodied from the increased experimental CPU time in Table 4.

Table 4 Experimental results of fusing kernels

Figure 1 shows the training time with varying number of samples for QCQP and SILP. For the max sample size, QCQP achieves 400 and the SILP approach achieves 1,300. For the training time, our approach costs the least time, three times faster than QCQP. In fact, our algorithm can solve various scale problems. We find that the termination of SILP is due to the problem of out of memory in Matlab, and the termination of QCQP is due to the same problem in MOSEK.

Fig. 1
figure 1

Training time with varying number of samples for algorithms

The benchmark data set was constructed by 2,000 samples in Fig. 2. We used different kernel widths to construct the RBF kernel matrices and increase the number of kernel matrices from 2 to 200. The QCQP formulations had memory issues when the number of kernels was larger than 80.

Fig. 2
figure 2

Comparing the QCQP with the SILP formulations solve the MKL-ELM by different numbers of kernels

In Fig. 3, the benchmark data were made up of two linear kernel matrices and 1,000 samples. The samples were equally and randomly divided into various numbers of classes. The number of classes gradually increased from 2 to 20.

Fig. 3
figure 3

Comparison of the QCQP and SILP formulations to solve the MKL-ELM data sets containing different numbers of classes

5 Conclusions and future research

We investigate the issue of multiple-kernel learning based on ELM for classification in this paper. This problem is formulated as convex programs, and thus, globally optimal solutions are guaranteed. The final kernel functions for the multiple-kernel models are determined by black-box learning from the initially proposed kernel functions. Theoretically, the kernel functions used to construct the multiple-kernel models may be different for various problems, so the problem insight could impact on the selection of the kernel functions to improve the performance. In fact, some convex optimization problems are computationally expensive, and the proposed algorithm is efficient to solve. Furthermore, we consider the problem of optimizing the kernel, thus approaching the desirable goal of automated model selection. To evaluate the proposed algorithms, we have conducted extensive experiments, which demonstrate the effectiveness.

The proposed MKL-ELM classifier is to learn the optimal combination of multiple large-scale data sets. Despite multiple-kernel ELM displaying some superiorities over the single-kernel ELM both theoretically and experimentally, there is much work worth investigating in the future, such as developing efficient algorithm to deal with large-scale training problem resulted from the increasing of the number of kernels.

From the practical point of view, our method can be easily applied in lots of applications, such as pattern recognition, time serial prediction, high-magnification sample image in painting, and bioinformatics.