1 Introduction

Document classification is an important research problem in information retrieval. Many classification techniques based on the category of machine learning have been proposed [1,2,3,4,5,6,7]. Support Vector Machine (SVM) [8] and k-Nearest Neighbor (k-NN) [9] are two popular methods, which have been widely applied in the classification of documents, such as text, image, etc. [10,11,12,13], yet they deeply rely on a good representation of the distance metric in the feature space. So it is important to find a desirable distance metric from the training data rather than only using Euclidean metric in document classification, Metric learning is proposed to address this issue.

Metric learning (ML) aims to learn a proper distance metric from the structure of training data to measure the distance or similarity between samples. Mahalanobis distance [14] is a most well-studied metric, which is usually learned by learning a symmetric positive semi-definite matrix. Xing et al. [15] propose the first metric learning method about Mahahabinous distance by pairwise constraints, the optimization problem is a convex optimization problem and solved by a projected gradient descent method. Weinberger and Saul [16] provide a method Large Margin Nearest Neighbors (LMNN), which is inspired by Neighbourhood Component Analysis (NCA) [17] and designed by using pairwise constraints and target neighbors. The optimization problem of LMNN can be reformulated to a semi-definite programming problem by introducing slack variables, while the introduced slack variables lead to much more constraints. These methods can efficiently improve the classification accuracy, however, the learned distance matrix at each iteration is not always a positive semi-definite matrix, so it needs to be projected onto a subspace of the positive semi-definite matrices for next iteration. Otherwise, the learned metrics are all linear metrics, which generally do not apply in the nonlinear problems, such as document classification problems.

Many nonlinear methods are proposed for metric learning. The kernel method is a widely-used method, and the main idea is the kernelization of linear metric learning methods. When a kernel function given, it will induce a Hilbert space, in which a linear metric can be learned by the metric learning based on kernel method. Some popular metric learning methods based kernel include Kernel Discriminative Component Analysis (KDCA) [18], Large Margin Component Analysis (LMCA) [19], Information-theoretic metric learning (ITML) [20]. The kernel method is also widely adopted in document classification [21, 22]. However, their performance are affected by the selection of kernel function, then Multiple Kernel Learning (MKL) is proposed to address this problem. In MKL, a proper kernel is learned by a certain combination of some given basic kernels [23,24,25]. Inspired by MKL for SVM, a framework combining ML and MKL for nonlinear metric learning is proposed by Wang et al. [26], In this framework, a desirable Mahalanobis distance will be learned in a reproducing kernel Hilbert space (RKHS) induced by basic kernels. This framework is general, yet the efficiency of this framework is still deeply influenced by the adopted metric learning method in real world application.

As mentioned above, in this paper, we propose a nonlinear metric learning method to address document classification problem. The proposed optimization can be decomposed into two subproblems. The first problem is to learn a linear metric in kernel space induced by basic kernels, in which, we will learn a positive definite matrix rather than a positive semi-definite matrix on positive definite matrices groups to avoid the projection onto a subspace of the positive semi-definite matrices. The second problem is to learn the combination coefficients of basic kernels to provide a proper kernel function. By solving these two subproblems alternately, our optimization problem is resolved.

Our work is inspired by LMNN [16] and the methods of Wang et al. [26] and Ying et al. [27]. The main contributions of our work are summarized as follows. First, the proposed new formulation of nonlinear metric in our method is represented by a weighted linear combination of basic kernels, and this weighted linear combination can better reflects the relationships of basic kernels and expresses the distribution of data. Second, our optimization problem is proposed by adopting triplet constraint rather than pairwise constraint, which makes it can be solved without introducing the slack variables mentioned in [16], especially it can be transformed to an unconstrained optimization on positive define matrices groups. Last, our proposed algorithm is performed with an adaptive step-size rather than the constant step-size in Ying et al. [27], then it ensures that the searching step-size in each iteration must be the optimal step, which can avoid the problem that too large constant step-size will lead to reduce classification accuracy, or the problem that too small step-size will increase iteration times.

The rest of this paper is organized as follows. Section 2 reviews some related work, and introduces our nonlinear metric learning method, Sect. 3 proposes an improved intrinsic steepest descent algorithm, Sect. 4 presents our experimental results on document classification, and Sect. 5 concludes this paper.

2 Nonlinear Metric Learning Method with Multiple Kernel

2.1 Mahalanobis Distance via Multiple Kernel

Let \(\mathcal {X}=\{x_{1}, \ldots ,x_{n}\}\subseteq \mathbb {R}^n\) be the samples data, where \(x_{i}\in \mathbb {R}^d\), \(i=1,\ldots , n\), \(X=\left[ x_{1}, \ldots , x_{n}\right] \) be data matrix corresponding to \(\mathcal {X}\). Then the Mahalanobis distance is defined by

$$\begin{aligned} D(x_{i},x_{j}) = \sqrt{(x_{i}-x_{j})^T\mathbf M (x_{i}-x_{j})} \end{aligned}$$
(1)

where \(\mathbf M \) is a symmetric positive semi-definite matrix, denoted by \(\mathbf M \succeq 0\). \(\mathbf M \) can be decomposed as \(\mathbf M = \textit{L}^T\textit{L}\), \(\textit{L}: \mathbb {R}^d \rightarrow \mathbb {R}^d \) is a linear transformation, then learning \(\mathbf M \) is equal to learning L.

Let \(\mathcal {H}\) be the feature space of \(\mathcal {X}\). A kernel function in \(\mathcal {H}\) is defined by

$$\begin{aligned} \textit{k}(x_{i},x_{j}) = \langle \phi (x_{i}),\phi (x_{j})\rangle = \phi (x_{i})^T\phi (x_{j}), x_{i}, x_{j} \in \mathcal {X} \end{aligned}$$

where \(\phi :x \rightarrow \phi (x) \in \mathcal {H}\) is a nonlinear map. Kernel matrix is \(\mathbf K = (\textit{k}_{ij})_{n\times n}\), \(\textit{k}_{ij} = \textit{k}(x_{i},x_{j})\), which is a symmetric positive semi-definite matrix. Then the Mahalanobis distance in \(\mathcal {H}\) is

$$\begin{aligned} D(\phi (x_{i}),\phi (x_{j})) = \sqrt{(\phi (x_{i})-\phi (x_{j}))^T\mathbf M (\phi (x_{i})-\phi (x_{j}))} \end{aligned}$$
(2)

It is difficult to find a proper mapping \(\phi \) for Eq. (2). Jain et al. [28] present an form \(\mathbf M = \eta \mathbf I + \varPhi (\mathbf X )\mathbf A \varPhi (\mathbf X )^T \), where \(\mathbf A \succeq 0\), \(\varPhi (\mathbf X ) = \left[ \phi (x_{1}), \ldots , \phi (x_{n})\right] \), \(\mathbf I \) is the identity matrix and \(\eta \) is a constant, usually \(\eta = 0\). Then learning \(\mathbf M \) in \(\mathcal {H}\) only needs leaning \(\mathbf A \). Let \( K^i \) be the i-th column of \(\mathbf K \), the square of (2) can be rewritten as

$$\begin{aligned} D^{2}(\phi ({x_{i}}),\phi ({x_{j}}))= & {} {(\phi ({x_{i}}) - \phi ({x_{j}}))^T}\varPhi (\mathbf X )\mathbf A \varPhi {(\mathbf X )^T}(\phi ({x_{i}}) - \phi ({x_{j}}))\nonumber \\= & {} {({K^i} - {K^j})^T}{} \mathbf A ({K^i} - {K^j}) \end{aligned}$$
(3)

which is the expression of Mahalanobis distance in kernel space. From (3), if a kernel function is given, a kernel matrix \(\mathbf K \) is also defined, and then it only needs learning \(\mathbf A \) in the kernel space to solve nonlinear classification problem, the result has been proved in [26]. So the key problem is how to choose a proper kernel function. Cross-validation is a very common method, yet there are high computational costs.

Multiple kernel learning aims to learn a desirable kernel from some given basic kernels, which is an efficient method that not only can learn a proper kernel, but also can reduce the computational costs. The kernel function in MKL is usually defined as

$$\begin{aligned} k({x_{i}},{x_{j}}) = \sum \limits _{s = 1}^{m} {{\mu _{s}}{k_s}({x_{i}},{x_{j}})}, \quad {\mu _{s}} \ge 0. \end{aligned}$$
(4)

where \(\{ {k_s}({x_{i}},{x_{j}})\} _{i = 1}^m\) are the basic kernels. By using this linear combination of basic kernel, we deduced (3) to a new representation in the kernel space induced by the basic kernels as follow.

$$\begin{aligned} D^{2}(\phi ({x_{i}}),\phi ({x_{j}}))= & {} \sum \limits _{s = 1}^{m} {{\mu _{s}} {{\left( K_{s}^{i} - K_{s}^{j}\right) }^T}} \mathbf A \sum \limits _{s = 1}^{m} {{\mu _{s}}\left( K_{s}^{i} - K_{s}^{j}\right) } \nonumber \\= & {} \left( \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{i}} - {\sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{j}}}\right) ^T \mathbf A \left( \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{i}} - \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{j}}\right) \end{aligned}$$
(5)

where \(K_{s}^{j}\) is the i-th column of kernel matrix corresponding to the s-th basic kernel. Then we have a kernel matrix \(\mathbf K =\sum _{s = 1}^{m} {\mu _{s}K_{s}}\). Now learning a Mahalanobis distance becomes learning \(\mathbf A \) and the combination coefficients \(\mu _{i}, i=1,\ldots ,m \) from kernel matrix \(\mathbf K \).

2.2 Nonlinear Metric Learning Model

Let \(\mathcal {Y}=\{y_{1}, \ldots , y_{l}\}\) be the class labels corresponding to \(\mathcal {X}\). In metric learning, there are usually two constraints: pairwise constraint and triplet constraint. In pairwise constraint \((x_{i},x_{j})\), \(x_{i}\) and \(x_{j}\) are in the same class together or not, the metric is learned to satisfy keeping samples with same label closer while separating samples with different label farther. LMNN also learns metric by pairwise constraint, and the goal is that the k-nearest neighbors always belong to the same class while samples from different classes are separated by a large margin [16], so this method can efficiently increase the computation speed. In triplet constraint \((x_{i},x_{j},x_{k})\), \((x_{i},x_{j})\) and \((x_{i},x_{k})\) are pairwise constraints, the metric is learned to make \(x_{i}\) and \(x_{j}\) closer than \(x_{i}\) and \(x_{k}\). Wang et al. propose a training goal to push \(x_{i}\) and \(x_{j}\) with the same label together and pull \(x_{i}\) and \(x_{k}\) with the different label apart so that the learned distance makes \(x_{j}\) is closer to \(x_{i}\) than \(x_{k}\) to \(x_{i}\) in [29], then a triplet constraint \(\mathcal {T}=\{(x_{i},x_{j},x_{k}): 1+D_{ij}^{2}< D_{ik}^{2}\}\) is given. Yet the constraint \(\mathcal {T}\) is influenced very much by the distribution of data. Ying et al. [27] revise \(\mathcal {T}\) to

$$\begin{aligned} \mathcal {T}'=\{(x_{i},x_{j},x_{k}): D_{ij}^{2}< \gamma D_{ik}^{2}, 0< \gamma <1 \} \end{aligned}$$
(6)

This triplet constraint can balance the influence between the inner-class data and inter-class data by parameter \(\gamma \), and then the limitation of \(\mathcal {T}\) depending on the distribution of data is overcome.

When learning Mahalanobis distance, our goal is to minimize distances between input sample and its target neighbors, while keeping the distances between input sample and its target neighbors with same label are smaller than the distances between input sample and its target neighbors with different label, i.e. input sample and its target neighbors need satisfying triplet constraint \(\mathcal {T}'\). Therefor, we propose a linear supervised metric learning method as follow.

$$\begin{aligned}&\min \limits _\mathbf{A } ~~\lambda \sum \limits _{i, j \in \mathcal{N}(i)}^{} {D_{ij}^{2}} + (1 - \lambda )\sum \limits _{i, j, k \in \mathcal{N}(i),\mathcal {T}'}\left( D_{ij}^{2}-\gamma D_{ik}^{2}\right) \end{aligned}$$
(7)
$$\begin{aligned}&\text {s.t.}\quad \mathbf A \succ 0. \end{aligned}$$
(8)

where \(\mathcal {N}(i)\) is the nearest neighbor list of \(x_{i}\), \(\lambda \) is a trade-off parameter between the first term and the second term, \(0\le \lambda \le 1\), and \(\mathbf A \) is a positive definite matrix, denoted by \(\mathbf A \succ 0\). Here the constrain is \(\mathbf A \succ 0\) rather than \(\mathbf A \succeq 0\), then we do not need to introduce a projection to ensure \(\mathbf A \) must be a positive semi-definite matrix after each iteration. The first term of cost function ensures that the distances between input sample and its target neighbors are sufficiently small, The second term means pulling input sample and its target neighbors with same labels closer while pushing input sample and its target neighbors with different labels farther.

Similar to LMNN, we define \(\eta _{ij}\) as follows. If \(x_{j}\) is a target neighbor of \(x_{i}\), then \(\eta _{ij}=1\), otherwise \(\eta _{ij}=0\). A character matrix is defined as \(H = {({\eta _{ij}})_{n \times n}}\). We also define \(y_{ij}\) as follows. If \(x_{i}\) and \(x_{j}\) have the same label, then \(y_{ij}=1\), otherwise \(y_{ij}=0\). An indicator matrix is defined as \(Y = {({y_{ij}})_{n \times n}}\). Then, by using the Mahalanobis distance (6), we develop the problem (7)–(8) to a nonlinear optimization problem as follows in the kernel space induced by basic kernels.

$$\begin{aligned}&\min \limits _\mathbf{A ,\mu } ~~\lambda \sum \limits _{i = 1}^n {\sum \limits _{j = 1}^n {{\eta _{ij}}D_{ij}^{2}} } + (1 - \lambda )\sum \limits _{i,j,k = 1}^n {{\eta _{ij}}} \left( 1 - {y_{ik}}\right) \left( D_{ij}^{2} - \gamma D_{ik}^{2}\right) \end{aligned}$$
(9)
$$\begin{aligned}&\text {s.t.}\quad \mathbf A \succ 0, {\mu _{s}} \ge 0, \sum \limits _{s = 1}^{m} {{\mu _{s}}} = 1. \end{aligned}$$
(10)

For \(\mathbf A \succ 0\), it can be decomposed into \(\mathbf A =\textit{Q}{} \textit{Q}^T\), further we rewrite (6) to

$$\begin{aligned} D^{2}(\phi ({x_{i}}),\phi ({x_{j}}))= & {} \left( \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{i} - {\sum \limits _{s = 1}^{m} {\mu _{s}}K_{s}^{j}}}\right) ^TQ{Q^{T}}\left( \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{i}} - \sum \limits _{s = 1}^{m} {{\mu _{s}}K_{s}^{j}}\right) \\= & {} \sum \limits _{s,t = 1}^m {{\mu _{s}}{\mu _{t}}\left( K{{_s^{i}}^{^T}}Q{Q^{T}}K_{t}^{i} + K{{_s^{j}}^{^T}}Q{Q^{T}}K_{t}^{j} - 2K{{_s^{i}}^{^T}}Q{Q^{T}}K_{t}^{j}\right) } \end{aligned}$$

Then, in problem (9)–(10), the first term can be reformulated to

$$\begin{aligned} \sum \limits _{i = 1}^n {\sum \limits _{j = 1}^n {{\eta _{ij}}D_{ij}^{2}} }= & {} 2\sum \limits _{s,t = 1}^m {{\mu _{s}}{\mu _{t}}\sum \limits _{i,j = 1}^n {{\eta _{ij}}\left( K{{_s^{i}}^{^T}}Q{Q^{T}}K_{t}^{i} - K{{_s^{i}}^{^T}}Q{Q^{T}}K_t^j\right) } } \\= & {} 2tr\left( {Q^{T}}{} \mathbf K ({D_{H}} - H)\mathbf K ^{T}Q\right) = 2tr\left( \mathbf K ({D_{H}} - H) \mathbf K ^{T}{} \mathbf A \right) \end{aligned}$$

where \(D_{H} = diag(He)\), e is a column vector with all elements 1, tr is the trace operator. By using the same way, the second term can be reformulated to

$$\begin{aligned} \sum \limits _{i,j,k}^n {{\eta _{ij}}\left( 1 - {y_{ik}}\right) \left( D_{ij}^{2} - \gamma D_{ik}^{2}\right) }=2tr\left( {Q^{T}}{} \mathbf K \left( {D_{c}} - C\right) \mathbf K ^{T}Q\right) =2tr\left( \mathbf K \left( {D_{c}} - C\right) \mathbf K ^{T}{} \mathbf A \right) \end{aligned}$$

where \({D_{c}}=diag(Ce)\), \(C = diag((e{e^{T}}-Y)e)H-\gamma diag(He)(e{e^{T}}-Y)\).

Now we rewrite problem (9)–(10) as follows.

$$\begin{aligned}&\min \limits _\mathbf{A ,\mu } ~~\lambda tr\left( \mathbf K \left( {D_{H}} - H\right) \mathbf K ^{T} \mathbf A \right) + \left( 1 - \lambda \right) tr\left( \mathbf K \left( {D_{c}} - C\right) \mathbf K ^{T}{} \mathbf A \right) \end{aligned}$$
(11)
$$\begin{aligned}&\text {s.t.}\quad \mathbf A \succ 0, {\mu _{s}} \ge 0, \sum \limits _{s = 1}^{m} {{\mu _{s}}} = 1. \end{aligned}$$
(12)

3 Algorithms

The optimization problem (11)–(12) can be decomposed into two subproblems. Given \(\mu \), one is problem (13)–(14).

$$\begin{aligned}&\mathop {\min }\limits _\mathbf{A } \mathrm{{ }}~\lambda tr\left( \mathbf K \left( {D_{H}} - H\right) \mathbf K ^{T}{} \mathbf A \right) +\left( 1 - \lambda \right) tr\left( \mathbf K \left( {D_{c}} - C\right) \mathbf K ^{T}{} \mathbf A \right) \end{aligned}$$
(13)
$$\begin{aligned}&~\text {s.t.} ~~ \mathbf A \succ 0 \end{aligned}$$
(14)

which is an optimization problem about the positive definite matrix \(\mathbf A \) in the kernel space. Though the cost function is linear in \(\mathbf A \), however, the positive definiteness of the learned \(\mathbf A \) in each iteration still can not be ensured in kernel space. Below we will treat this subproblem as an unconstrained optimization problem on positive define matrices groups, where the learned \(\mathbf A \) in each iteration must be a positive definite matrix.

Given \(\mathbf A \), another subproblem is (15)–(16).

$$\begin{aligned}&\mathop {\min }\limits _{\mu }~ \mathrm{{ }}~\lambda tr\left( \mathbf K \left( {D_{H}} - H\right) \mathbf K ^{T}{} \mathbf A \right) +\left( 1 - \lambda \right) tr\left( \mathbf K \left( {D_{c}} - C\right) \mathbf K ^{T}{} \mathbf A \right) \end{aligned}$$
(15)
$$\begin{aligned}&~\text {s.t.} ~~~ {\mu _{s}} \ge 0, \sum \limits _{s = 1}^{m} {{\mu _{s}}}= 1 \end{aligned}$$
(16)

which can be rewritten as a quadratic programming problem as follows.

$$\begin{aligned}&\mathop {\min }\limits _{\mu }~\lambda \sum \limits _{s,t = 1}^m {{\mu _{s}} {\mu _{t}}tr\left( {K_{s}}\left( {D_{H}} - H\right) {K_{t}}^{^T}{} \mathbf A \right) } \nonumber \\&+ \left( 1 - \lambda \right) \sum \limits _{s,t = 1}^m {{\mu _{s}}{\mu _{t}}tr \left( {K_{s}} \left( {D_{c}} - C\right) {K_{t}}^{^T}{} \mathbf A \right) } \end{aligned}$$
(17)
$$\begin{aligned}&\text {s.t.} ~~~ {\mu _{s}} \ge 0, \sum \limits _{s = 1}^{m} {{\mu _{s}}} = 1 \end{aligned}$$
(18)

So we can learn \(\mathbf A \) and \(\mu \) by an alternating process which is similar to the two-step iterative algorithm proposed in [26], and the convergence is guaranteed in [30].

Let us denote \(f(\mathbf A )= \lambda tr(\mathbf K ({D_{H}} - H)\mathbf K ^{T}{} \mathbf A ) + (1 - \lambda )tr(\mathbf K ({D_{c}} - C)\mathbf K ^{T}{} \mathbf A )\). For \(f(\mathbf A )\) is linear in \(\mathbf A \), whose gradient can be calculated by

$$\begin{aligned} \nabla f(\mathbf A )=\lambda \left( (\mathbf K ({D_{H}} - H)\mathbf K ^{T}{} \mathbf A )\right) ^T +(1-\lambda )\left( \mathbf K ({D_{c}} - C)\mathbf K ^{T}\right) ^T \end{aligned}$$
(19)

Our works focus on the solution of problem (13)–(14). As mentioned in [27], on the positive definite matrices groups \(\{P\in R^{n\times n}| P = P^{T}, P\succ 0\}\), a geodesic P(t) can be defined by

$$\begin{aligned} P(t)=P^{\frac{1}{2}} \exp \left( tP^{-\frac{1}{2}}SP^{-\frac{1}{2}}\right) P^{\frac{1}{2}} \end{aligned}$$
(20)

where \(\exp \) is the exponential map, \(P(0)=P \in \mathcal {P}(n) \), \(\dot{P}(0)=S \in T_{P}\mathcal {P}(n)\) is a direction, \(T_{P}\mathcal {P}(n)\) is the tangent space at point P.

Therefore \(\mathbf A \) can be learned by

$$\begin{aligned} \mathbf A (t+1)=\mathbf A (t)^{\frac{1}{2}}\exp \left[ \alpha (t) \cdot \mathbf A (t)^{-\frac{1}{2}}S(t)\mathbf A (t)^{-\frac{1}{2}}\right] \mathbf A (t) ^{\frac{1}{2}} \end{aligned}$$
(21)

where \(\alpha (t)\) is the optimal step size at time t, and \(\mathbf A (t)^{-\frac{1}{2}}S(t)\mathbf A (t)^{-\frac{1}{2}}\) is a descent direction. On positive definite matrices groups, an intrinsic steepest descent algorithm is proposed by Ying et al. [27], and this algorithm can ensure that \(\mathbf A (t+1)\) in (21) must be a positive definite matrix. However, the step size \(\alpha (t)\) is set to a constant, which will lead to the classification accuracy reducing when the constant is larger, or lead to iteration times increasing when the constant is smaller. To address these problems, we propose an intrinsic steepest descent algorithm with adaptive step-size. In our algorithm, we use (22) to search an optimal value \(\alpha (t)\) as the next step size. Furthermore, we set a precision parameter \(\epsilon >0\), when it satisfied, the iteration will be stopped.

$$\begin{aligned} {\alpha (t)} = \arg \mathop {\min }\limits _{\alpha \ge 0} f\left( {\alpha (t)}^{\frac{1}{2}}\exp (-t{G(t)}){\alpha (t)}^{\frac{1}{2}}\right) \end{aligned}$$
(22)

Our proposed algorithm is an improvement of the algorithm in [27], which is presented in Algorithm 1. We have proved that our algorithm converges linearly by a special Taylor’s formula on positive definite matrices groups, the result is not described here for the limitation of space.

figure a

4 Experimental Results

To evaluate the performance of our proposed nonlinear metric learning method, we perform our method for document classification by k-NN classifier on two real document data sets: USPS handwritten digit datasetFootnote 1 and C-Cube handwriting datasetFootnote 2. The document features of each sample are represented by a fixed-length feature vector. Each dataset is randomly split into two parts: percent 70 for training, the rest percent 30 for testing. The basic kernels are used by ten polynomial kernels with degree from one to ten and ten Gaussian kernels with bandwidth \(\sigma \in \{0.5, 1, 2, 5, 7, 10, 12, 15, 17, 20\}\), same as in [26, 31]. All experimental results are presented by the average over 30 runs.

We choose k-NN with Euclidean distance as the baseline to compare our method (SIMKML, for short) with the following methods: (1) LMNN [16], (2) ITML [20], (3) the method to optimization problem (7)–(8) (SIML, for short), (4) a kernel method using best single kernel to problem (7)–(8) (BSKML, for short), where the best kernel is chosen by cross-validation, (5) a kernel method with a composite kernel to problem (9)–(10)(USMKML, for short), where the kernel is defined as the unweighted sum of all basic kernels, i.e. \(\mu = \mathbf 1 \). (6) a kernel method with the concatenation kernel matrices (CMKML, for short), where the kernel matrix is the unweighted concatenation of all basic kernel matrices, \(\mathbf K = {(K_1^T,\ldots ,K_m^T)^T}\), which is similar to \(\textit{NR}\)-\(\textit{ML}_{h}\)-\(\textit{MKL}_{\mu }\) in [26]. The dimensionality of \(\mathbf K \) is \(mn \times n\), so we just use the polynomial kernels of degree from one to two and Gaussian kernel with \(\sigma =0.5\) in the experiment.

All the experimental results with 1-NN are presented in Table 1. The classification error rates with different methods over 30 times are displayed by box-plots in Fig. 1. The changing of classification error rates with different “k” in k-NN classifier on USPS and C-Cube are shown in Fig. 2.

Table 1 Classification average error rates on document datasets (mean ± SD, in %)

USPS dataset The United State Postal Service (USPS) database is a kind of handwritten digit data set. There are 7319 training samples, 2007 testing samples and ten classes from 0 to 9. Each image is normalized to grayscale image of size \(16\times 16\). In the experiment, 1000 data are randomly chosen from USPS, then 70% of it are randomly chosen again for training, the rest 30% are for testing. From Table 1 and Fig. 1a, we see that classification error rate of SIMKML is higher than LMNN, the result indicates that USPS data set may be more suitable to the linear method LMNN. We also observe that our method has the second lowest classification error rate, which shows that it is a reasonable nonlinear metric learning. The results of Fig. 2a show that the proposed method is sensitive to the number “k” of k-NN classifiers on USPS dataset as other methods.

Fig. 1
figure 1

Classification error rates of different metrics on a USPS and b C-Cube. The eight methods are (1) Euclidean, (2) LMNN, (3) ITML, (4) SIML, (5) BSKML, (6) USMKML, (7) CMKML, (8) SIMKML

Fig. 2
figure 2

The changing of classification error rates with different “k” in k-NN classifier on a USPS and b C-Cube

C-Cube dataset Cursive Character Challenge (C-Cube) is a new benchmark for machine learning and pattern recognition algorithms, which is formed by 57,293 cursive characters extracted from handwritten words, including both upper and lower case versions of each letter [32]. C-Cube has been randomly split into a training set with 38,160 characters and a test set with 19,133 characters. In the experiments, we choose the lower case versions from a to z as experiment data, which contains 11,162 training binary images and 22,273 testing binary images in 26 clusters. So the distribution of C-Cube Data is more nonlinear than USPS data which is just composed of numbers from 0 to 9. For the different sizes of images, by calibrating the centers of images and filling the missing parts by zeros, all the images are normalized into the same size \(209 \,\times \, 123\). Furthermore, 1000 data are randomly chosen from C-Cube, 70% of it are randomly chosen as training set and the rest 30% as testing set. From Table 1 and Fig. 1b, we observe that our proposed linear method is not a proper method to C-Cube data set for the higher classification error rate, but the proposed nonlinear method SIMKML outperforms all the other methods, which shows that our nonlinear method is an efficiently method for nonlinear classification problem by using the revised triplet constraint, multiple kernel and intrinsic steepest descent algorithm with adaptive iteration step size. From Fig. 2b, we see that our method is stable and not sensitive to the number of classifiers, which shows that our proposed nonlinear method is more suitable to nonlinear data C-Cube.

5 Conclusion

In this paper, we propose a nonlinear metric learning method based on multiple kernel learning for document classification. By defining a kernel with a weighted linear combination of basic kernels, a new expression of distance is presented. Then by using k-nearest neighbor and a revised triplet constraint, an optimization model for classification is proposed, which avoids to introduce slack variables and reduces the computational costs. Further, an intrinsic steepest descent algorithm on positive definite matrices groups is provided, the searching step is adopted with optimal value, which ensures classification accuracy will not be reduced. The experiment results show that, our method not only have a good effectiveness for the classification of documents whose features are represented by a fixed-length feature vector, but also can provide an alternative method to choose a proper kernel function for nonlinear classification problem.

For the future work, one is to design parallel algorithm based on our proposed algorithm to improve the computational performance, the other is the research for document classification based on semi-supervised metric learning.