Keywords

1 Introduction

Kernel methods such as support vector machines (SVM) and kernel Fisher discriminant analysis (KFDA) have been successfully applied to a wide variety of machine learning problems [1]. These methods map data points from the input space to some feature space, i.e., higher dimensional reproducing kernel Hilbert space (RKHS), where even relatively simple algorithms such as linear methods can deliver very impressive performance. The mapping is determined implicitly by a kernel function (or simply a kernel), which computes the inner product of data points in the feature space. Despite the popularity of kernel methods, there is not yet a mechanism in place that can serve to guide the kernel leaning and selection. It is well known that selecting an appropriate kernel, thereby, an appropriate feature space is of great importance to the success of kernel methods [2]. To address this issue, recent years have witnessed the active research on learning effective kernels automatically from data. One popular technique for kernel learning and selection is multiple kernel learning (MKL) [3,4,5], which aims at learning a linear or nonlinear combination of a set of predefined kernels (base kernels) in order to identify a good target kernel for the applications. Compared with traditional kernel methods employing a fixed kernel, MKL exhibits its flexibility of automated kernel learning, and also reflects the fact that typical learning problems often involve multiple, heterogeneous data sources.

The idea of MKL can be generally applied to all kinds of kernel methods, such as the commonly used SVM and KFDA, leading to SVM-based MKL and discriminant MKL, respectively. Our work in this paper will only focus on the SVM-based MKL formulation. Specifically, we present a two-stage multiple kernel learning model based on the Hilbert-Schmidt independence criterion (HSIC), called HSIC-MKL. HSIC, which was initially introduced for measuring the statistical dependence between random variables or random processes [6], has been successfully applied in various machine learning problems [7], such as feature selection, clustering and subspace learning. The success is based on the fact that many existing learning tasks can be cast into problems of dependence maximization (or minimization). Motivated by this, in the first stage, we propose a HSIC-Lasso-based MKL formulation, which not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also the global optimal solution can be computed efficiently by solving a Lasso optimization problemFootnote 1. In the second stage, the SVM is used to select the prediction hypothesis, i.e., the SVM is trained to induce the final decision function to show classification results. It should be pointed out that the HSIC Lasso [8, 9] was originally proposed for high-dimensional feature selection, which needs to predefine the kernels (for example, Gaussian kernel for inputs and delta kernel for outputs) before feature selection, whereas our work employs the HSIC Lasso for MKL, aiming to learn an optimal composite kernel to train a kernel classifier.

2 Multiple Kernel Learning

In this section, we briefly review the MKL. Suppose we are given a set of labeled training samples \( \{ (\varvec{x}_{i} ,y_{i} )\}_{i = 1}^{n} \) in a binary classification problem, where \( \varvec{x}_{i} \in \text{X} \subset R^{d} \) is the input data and \( y_{i} \in \{ + 1, - 1\} \) is the corresponding class label. The goal of the SVM is to find an optimal hyperplane \( \varvec{w}^{{\rm T}} \phi (\varvec{x}) + b = 0 \) that separates the training points into two classes with the maximal margin, where \( \varvec{w} \) is the normal vector of the hyperplane, \( b \) is a bias, and \( \phi \) is a feature map which maps \( \varvec{x}_{i} \) to a high-dimensional feature space. This hyperplane can be obtained by solving the following optimization problem

$$ \begin{aligned} & { \hbox{min} }\frac{1}{2}\left\| \varvec{w} \right\|^{2} + C\sum\limits_{i = 1}^{n} {\xi_{i} } \\ & \text{s.t.}\quad y_{i} (\varvec{w}^{\text{T}} \phi (\varvec{x}_{i} ) + b) \ge 1 - \xi_{i} \, \\ & \quad \quad \xi_{i} \ge 0,\quad i = 1, \cdot \cdot \cdot ,n \\ \end{aligned} $$
(1)

where \( \varvec{\xi}= (\xi_{1} , \cdot \cdot \cdot ,\xi_{n} )^{\text{T}} \) is the vector of slack variables and \( C \) is the regularization parameter used to impose a trade-off between the training error and generalization.

To solve the SVM optimization problem, suppose \( \alpha_{i} \) be the Lagrange multiplier corresponding to the \( i\text{th} \) inequality in (1), the dual problem of (1) is shown to

$$ \begin{aligned} & \hbox{max} \sum\limits_{i = 1}^{n} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {y_{i} y_{j} \alpha_{i} \alpha_{j} k(\varvec{x}_{i} ,\varvec{x}_{j} )} } \\ & \text{s.t.}\quad \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} = 0, \\ & \quad \quad 0 \le \alpha_{i} \le C,\quad i = 1, \cdot \cdot \cdot ,n. \\ \end{aligned} $$
(2)

where \( k(\varvec{x}_{i} ,\varvec{x}_{j} ) = \phi (\varvec{x}_{i} )^{\text{T}} \phi (\varvec{x}_{j} ) \) is the kernel function which implicitly defines the feature map \( \phi \).

Instead of formulating an optimization criterion with a fixed kernel \( k \), one can leave the kernel \( k \) as a combination of a set of predefined kernels, which results in the issue of MKL [3,4,5]. MKL maps each sample to a multiple-kernel-induced feature space and a linear classifier is learned in this space. The feature mapping used in MKL takes the form of \( \phi ( \cdot ) = [\phi_{1}^{\text{T}} ( \cdot ), \cdots ,\phi_{M}^{\text{T}} ( \cdot )]^{\text{T}} \), which is induced by \( M \) pre-defined base kernels \( \{ k_{m} ( \cdot , \cdot )\}_{m = 1}^{M} \) with different kernel forms or different kernel parameters. The linear combination of the base kernels is given by \( k = \sum\nolimits_{m - 1}^{M} {\mu_{m} k_{m} } \), where \( \mu_{m} \) is the corresponding combination coefficient. Let \( \varvec{\mu}= (\mu_{1} , \cdots ,\mu_{M} )^{\text{T}} \in \Delta \), where \( \Delta \) is the domain of \( \varvec{\mu} \). With different constraints on \( \varvec{\mu} \), different MKL models can be obtained. For example, when \( \varvec{\mu}\in \Delta \) lies in a simplex, i.e.:

$$ \Delta = \left\{ {\varvec{\mu}:\left\|\varvec{\mu}\right\|_{1} = \sum\limits_{m = 1}^{M} {\mu_{m} = 1} ,\mu_{m} \ge 0} \right\} $$
(3)

we call it \( L_{1} \)-norm of kernel weights and the resulting model \( L_{1} \)-MKL [10]. Most MKL methods fall in this category. When

$$ \Delta = \left\{ {\varvec{\mu}:\left\|\varvec{\mu}\right\|_{p} \le 1,p > 1,\mu_{m} \ge 0} \right\} $$
(4)

we call it \( L_{p} \)-norm of kernel weights and the resulting model \( L_{p} \)-MKL [11].

Like SVM, the dual problem of MKL can be represented as

$$ \begin{aligned} & \hbox{max} \sum\limits_{i = 1}^{n} {\alpha_{i} } - \frac{1}{2}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {y_{i} y_{j} \alpha_{i} \alpha_{j} \sum\limits_{m = 1}^{M} {\mu_{m} k_{m} } (\varvec{x}_{i} ,\varvec{x}_{j} )} } \\ & \text{s.t.}\quad \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} = 0,\;\varvec{\mu}\in \Delta , \\ & \quad \quad 0 \le \alpha_{i} \le C,\quad i = 1, \cdot \cdot \cdot ,n. \\ \end{aligned} $$
(5)

The goal of training MKL is to learn \( \mu_{m} \), \( \alpha_{i} \) and \( b \) with the given \( M \) base kernels, and the final decision function is given by

$$ f(\varvec{x}) = {\text{sgn}}\left( {\sum\limits_{i = 1}^{n} {\alpha_{i} y_{i} \sum\limits_{m = 1}^{M} {\mu_{m} k_{m} (\varvec{x}_{i} ,\varvec{x})} } + b} \right) $$
(6)

where the samples \( \varvec{x}_{\varvec{i}} \) with \( \alpha_{i} > 0 \) are called support vectors.

3 MKL-HSIC

In this section, we detailedly discuss the two-stage MKL method (HSIC-MKL) for learning kernels in the form of linear combination of \( M \) base kernels \( \{ k_{m} ( \cdot , \cdot )\}_{m = 1}^{M} \) or kernel matrices \( \{ {\mathbf{K}}_{m} \}_{m = 1}^{M} \). The corresponding combination coefficient \( \mu_{m} \) is selected subject to the condition \( \mu_{m} \ge 0 \). In the first stage, the algorithm determines the combination coefficient \( \mu_{m} \), and in the second stage, an SVM is trained with the learned kernel.

We first introduce the notion of the HSIC. Let \( {\mathbf{e}} = (1, \cdots ,1)^{\text{T}} \in \text{R}^{n} \) and \( {\mathbf{I}} \in \text{ R}^{n \times n} \) be the identity matrix. Given the centering matrix \( {\mathbf{H}} = {\mathbf{I}} - {{{\mathbf{ee}}^{\text{T}} } \mathord{\left/ {\vphantom {{{\mathbf{ee}}^{\text{T}} } n}} \right. \kern-0pt} n} \in \text{ R}^{n \times n} \), the centered kernel matrix associated with \( {\mathbf{K}} \) is given by \( {\bar{\mathbf{K}}} = {\mathbf{HKH}} \). Given two kernels \( k_{1} \) and \( k_{2} \), the HSIC between these two kernels is defined as

$$ HSIC({\mathbf{K}}_{1} ,{\mathbf{K}}_{2} ) = \frac{1}{{n^{2} }}\text{tr}({\mathbf{K}}_{1} {\mathbf{HK}}_{2} {\mathbf{H}}) $$
(7)

Let \( {\bar{\mathbf{L}}} = {\mathbf{HLH}} \) and \( {\bar{\mathbf{K}}} = {\mathbf{HKH}} \), where \( {\mathbf{K}} \) and \( {\mathbf{L}} \) are the kernel matrix for input data an kernel matrix for output labels, respectively. We here propose using HSIC Lasso [8, 9] for estimating the combination coefficient \( \varvec{\mu} \):

$$ \begin{aligned} & \hbox{min} \;\frac{1}{2}\left\| {{\bar{\mathbf{L}}} - \sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } } \right\|_{\text{F}}^{2} + \lambda \left\|\varvec{\mu}\right\|_{1} \\ & \text{s.t.}\quad \mu_{1} , \cdot \cdot \cdot ,\mu_{M} \ge 0 \, \\ \end{aligned} $$
(8)

where \( || \cdot ||_{\text{F}} \) is the Frobenius norm and \( \lambda > 0 \) is the regularization parameter. In (8), the first term means that we are aligning the centered output kernel matrix \( {\bar{\mathbf{L}}} \) by a linear combination of the centered input base kernel matrices \( \{ {\bar{\mathbf{K}}}_{m} \}_{m = 1}^{M} \), and the second term means that the combination coefficients for irrelevant base kernels become zero since the \( L_{1} \)-regularizer tends to produce a sparse solution. After estimating \( \varvec{\mu} \), we normalize each element of \( \varvec{\mu} \) as \( \mu_{m} \to {{\mu_{m} } \mathord{\left/ {\vphantom {{\mu_{m} } {\sum\nolimits_{m = 1}^{M} {\mu_{m} } }}} \right. \kern-0pt} {\sum\nolimits_{m = 1}^{M} {\mu_{m} } }} \).

Noting that \( < {\bar{\mathbf{K}}},{\bar{\mathbf{L}}} >_{\text{F}} = < {\bar{\mathbf{K}}},{\mathbf{L}} >_{\text{F}} = < {\mathbf{K}},{\bar{\mathbf{L}}} >_{\text{F}} = {\text{tr}}{\mathbf{KHLH}} = n^{2} HSIC({\mathbf{K}},{\mathbf{L}}) \), we can rewrite the first term of (8) as

$$ \begin{aligned} & \frac{1}{2}\left\| {{\bar{\mathbf{L}}} - \sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } } \right\|_{\text{F}}^{2} \\ & = \frac{1}{2}\left\langle {{\bar{\mathbf{L}}} - \sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } ,{\bar{\mathbf{L}}} - \sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } } \right\rangle_{\text{F}} \\ & = \frac{1}{2}\left\langle {{\bar{\mathbf{L}}},{\bar{\mathbf{L}}}} \right\rangle_{\text{F}} - \left\langle {{\bar{\mathbf{L}}},\sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } } \right\rangle_{\text{F}} + \frac{1}{2}\left\langle {\sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } ,\sum\limits_{m = 1}^{M} {\mu_{m} {\bar{\mathbf{K}}}_{m} } } \right\rangle_{\text{F}} \\ & = \frac{1}{2}\left\langle {{\bar{\mathbf{L}}},{\bar{\mathbf{L}}}} \right\rangle_{\text{F}} - \sum\limits_{m = 1}^{M} {\mu_{m} \left\langle {{\bar{\mathbf{L}}},{\bar{\mathbf{K}}}_{m} } \right\rangle_{\text{F}} } + \frac{1}{2}\sum\limits_{m = 1}^{M} {\sum\limits_{o = 1}^{M} {\mu_{m} \mu_{o} \left\langle {{\bar{\mathbf{K}}}_{m} ,{\bar{\mathbf{K}}}_{o} } \right\rangle_{\text{F}} } } \\ & = \frac{{n^{2} }}{2}HSIC({\mathbf{L}},{\mathbf{L}}) - n^{2} \sum\limits_{m = 1}^{M} {\mu_{m} HSIC({\mathbf{L}},{\mathbf{K}}_{m} ) + \frac{{n^{2} }}{2}\sum\limits_{m = 1}^{M} {\sum\limits_{o = 1}^{M} {\mu_{m} \mu_{o} HSIC({\mathbf{K}}_{m} ,{\mathbf{K}}_{o} )} } } \, \\ \end{aligned} $$
(9)

In (9), the \( n^{2} \) and \( HSIC({\mathbf{L}},{\mathbf{L}}) \) are constant and can be ignored. We have a clear statistical interpretation of MKL using HSIC Lasso. First, if the m-th kernel matrix \( {\mathbf{K}}_{m} \) has high dependence on the output matrix \( {\mathbf{L}} \), \( HSIC({\mathbf{L}},{\mathbf{K}}_{m} ) \) takes a large value and thus \( \mu_{m} \) should also be large so that (9) is minimized. On the other hand, if \( {\mathbf{K}}_{m} \) and \( {\mathbf{L}} \) are independent, \( HSIC({\mathbf{L}},{\mathbf{K}}_{m} ) \) is close to zero and thus \( \mu_{m} \) tends to be removed by the \( L_{1} \)-regularizer. This means that relevant kernels that have strong dependence on output \( {\mathbf{L}} \) tend to be selected by the HSIC Lasso. Second, if \( {\mathbf{K}}_{m} \) and \( {\mathbf{K}}_{o} \) are strongly dependent, which means one of them is redundant kernel, \( HSIC({\mathbf{K}}_{m} ,{\mathbf{K}}_{o} ) \) takes a large value and thus either \( \mu_{m} \) or \( \mu_{o} \) tends to be zero. This means that redundant kernels tend to be removed by the HSIC Lasso. In one word, HSIC Lasso tends to find non-redundant kernels with strong dependence on output \( {\mathbf{L}} \), which is a preferable property in kernel learning.

To solve the HSIC Lasso problem in (9), many Lasso optimization techniques can be applied in practice, such as dual augmented Lagrangian (DAL) [12, 13], which has been successfully employed for high-dimensional feature selection [8, 9].

We sketch the overall procedure of the proposed HSIC-MKL in Algorithm 1, where the centered kernel matrix can be calculated by

$$ {\bar{\mathbf{K}}}_{ij} = {\mathbf{K}}_{ij} - \frac{1}{n}\sum\limits_{i = 1}^{n} {{\mathbf{K}}_{ij} } - \frac{1}{n}\sum\limits_{j = 1}^{n} {{\mathbf{K}}_{ij} } + \frac{1}{{n^{2} }}\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\mathbf{K}}_{ij} } } $$
(10)
figure a

We analyze the computational complexity of Algorithm 1 with the \( O \) notation. Firstly, the computational complexity of calculating centered kernel matrices in Step 3 is \( O(Mn^{2} ) \). Secondly, the complexity of the quadratic programming solver in Step 4 is \( O(TM^{3} ) \) with \( T \) being the number of iterations in solving (8). Finally, note that empirically the SVM training complexity is \( O(n^{2.3} ) \) [14], the computational complexity of Step 6 is \( O(M + n^{2.3} ) \). Thus, the total computational complexity of our proposed HSIC-MKL is

$$ O(Mn^{2} ) + O(M^{3} ) + O(M + n^{2.3} ) = O(Mn^{2} + M^{3} + n^{2.3} ) $$
(11)

It should be noted that we here suppose that multiple base kernels (kernel matrices) can be precomputed and loaded into memory before the HSIC-MKL training. Then, the computational cost of calculating the base kernels is ignored.

4 Experimental Evaluation

In this section, we perform extensive experiments on binary classification problems to evaluate the efficacy of the proposed HSIC-MKL approach. We compare HSIC-MKL with the following state-of-the-art kernel learning algorithms:

  • AvgMKL: The average combination of multiple base kernels. It was reported that AvgMKL is competitive with many algorithms [3, 4].

  • SimpleMKL [10]: An algorithm reformulates the mixed-norm regularization of MKL problem as the weighted 2-norm regularization, and \( L_{1} \)-norm is imposed on kernel weights.

  • LpMKL [11]: An algorithm generalizes the regular \( L_{1} \)-norm MKL to arbitrary \( L_{p} \)-norm \( (p > 1) \) MKL. We adopt the cutting plane algorithm with second order Taylor approximation of \( L_{p} \).

  • CKA-MKL [15]: The two-stage MKL with centered kernel alignment. The two-stage MKL first learns the optimal kernel weights according some criteria, and then applies the learned optimal kernel to train a kernel classifier.

For parameter settings, the regularization parameters \( C \) and \( \lambda \) are determined by 5-fold cross-validation on the training set. Specifically, we perform grid-search in one dimension (i.e., a line-search) to choose the regularization parameters \( C \) from the set \( \{ 10^{ - 2} ,10^{ - 1} , \cdots ,10^{2} \} \) for all the compared methods. For our proposed HSIC-MKL approach, we perform grid-search over two dimensions, i.e., \( C = \{ 10^{ - 2} ,10^{0} , \cdots ,10^{2} \} \) and \( \lambda = \{ 10^{ - 2} ,10^{ - 1} , \cdots ,10^{2} \} \). In addition, for LpMKL, we examine \( p = 2,3,4 \) and report the best results. In the aspect of implementation, all the methods are implemented using MATLAB in the framework of SVM-KM toolboxFootnote 2. Note that SimpleMKL has been implemented in the SimpleMKL software packageFootnote 3, which needs the SVM-KM toolbox.

We select eight popular binary classification data sets, i.e., Australian Credit Approval, Breast Cancer Wisconsin (Original), Pima Indians Diabetes, German Credit Data, Heart, Ionosphere, Liver Disorders, and Sonar, from the UCI machine learning repository [16]. For Breast Cancer Wisconsin (Original), we directly eliminated the samples that contain missing attribute values. Table 1 provides the statistics of these data sets. It presents, for each data set, the short name of data set, the number of samples, the number of features, and the original name of data set.

Table 1. Statistics of the selected eight data sets from UCI

For each data set, we partition it into a training set and a test set by stratified sampling (by which the object generation follows the class prior probabilities): 50% of the data set serves as training set and the left 50% as test set. The training samples are normalized to be of zero mean and unit variance, and the test samples are also normalized using the same mean and variance of the training data. Following the settings of previous For each data set, we partition it into a training set and a test set by stratified sampling (by which the object generation follows the class prior probabilities): 50% of the data set serves as training set and the left 50% as test set. The training samples are normalized to be of zero mean and unit variance, and the test samples are also normalized using the same mean and variance of the training data. Following the settings of previous MKL studies [10], we use the Gaussian kernel \( k(\varvec{x}_{i} ,\varvec{x}_{j} ) = \exp ( - {{\left\| {\varvec{x}_{i} - \varvec{x}_{j} } \right\|^{2} } \mathord{\left/ {\vphantom {{\left\| {\varvec{x}_{i} - \varvec{x}_{j} } \right\|^{2} } {2\sigma^{2} }}} \right. \kern-0pt} {2\sigma^{2} }}) \) and polynomial kernel \( k(\varvec{x}_{i} ,\varvec{x}_{j} ) = (\varvec{x}_{i} \cdot \varvec{x}_{j} + 1)^{d} \) as the base kernels:

  • Gaussian kernels with ten different widths \( \sigma \in \{ 2^{ - 3} ,2^{ - 2} , \cdots ,2^{6} \} \) for each individual dimension as well as all dimensions.

  • Polynomial kernels with three different degrees \( d \in \{ 1,2,3\} \) for each individual feature as well as all features.

All kernel matrices are normalized to unit trace and precomputed prior to running the algorithms.

To get stable results, we independently repeat splitting each data set, and then run each algorithm on it for 20 times. The average classification accuracy and the standard deviations of each algorithm are reported in Table 2. The bold numbers denote the best performance of MKL methods on each data set. To conduct a rigorous comparison, the paired t-test [17] is performed. The paired t-test is used to analyze if the difference between two compared algorithms on one data set is significant or not. The p-value of the paired t-test represents the probability that two sets of compared results come from the distributions with an equal mean. A p-value of 0.05 is considered statistically significant. The win-tie-loss (W-T-L) summarizations based on the paired \( t \)-test are listed in Table 3, where HSIC-MKL and SimpleMKL, HSIC-MKL and LpMKL, and HSIC-MKL and CKA-MKL are compared, respectively. For two compared algorithms, assuming Algorithm 1 vs. Algorithm 2, a win or a loss means that Algorithm 1 is better or worse than Algorithm 2 on a data set. A tie means that both algorithms have the same performance.

Table 2. Classification accuracy comparison among different MKL algorithms on UCI data sets
Table 3. Significance test of classification results on UCI data sets

From Tables 2 and 3, we find that the proposed HSIC-MKL consistently achieves the overall best classification performance. Among the evaluated 8 data sets, SimpleMKL, LpMKL and CKA-MKL report 1, 2 and 1 best results, respectively, while our HSIC-MKL reports 4 best results. From the viewpoint of significance test, we have the following observations. For HSIC-MKL, although it is outperformed by LpMKL on the German and Sonar data sets, it produces significantly better classification performance than LpMKL on the Australian, Diabetes, Heart, Ionosphere and Liver data sets. Compared with SimpleMKL, HSIC-MKL significantly outperforms SimpleMKL on the Australian, Diabetes, German, Ionosphere, Liver, Sonar data sets, and yields the same performance on the rest of the data sets. Compared with CKA-MKL, HSIC-MKL significantly outperforms CKA-MKL on the Diabetes, Heart, Ionosphere and Liver data sets, and yields the same performance on the rest of the data sets. Overall, HSIC-MKL is better than SimpleMKl, LpMKL and CKA-MKL.

5 Conclusion

We have presented an effective two-stage MKL algorithm based on the notion of HSIC. By discussing the connection between MKL and HSIC Lasso, we find that the proposed algorithm not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also the global optimal solution can be computed efficiently by solving a Lasso optimization problem. Comprehensive experiments on a number of benchmark data sets demonstrate the promising results of our proposed algorithm. Future investigation will focus on the further validation of the use of the proposed algorithm on more real-world applications, such as computer vision, speech and signal processing, and natural language processing. Moreover, expending the proposed model to extreme learning machine and domain transfer learning, as well as investigating theoretical properties of the proposed algorithm are important issues to be investigated.