1 Introduction

Machine learning aims to train a predictive model from a training set of input-out pairs and then use the model to predict an unknown output from a given test input [1, 13, 14, 17, 31, 34, 37]. In this paper, we focus on the machine learning problem of binary pattern classification. In this problem, each input is a feature vector of a data point, and each output is a binary class label of a data point, either positive or negative [5, 19, 21, 26, 2830]. To learn the predictive model, i.e., the classification model, we usually compare the true class label of data point against the predicted label using a loss function, for example, hinge loss, logistic loss, and squared \(\ell _2\) norm loss. By minimizing the loss functions over the training set with regard to the parameter of the classification model, we can obtain an optimal classification model. To evaluate the performance of the model, we apply it to a set of test data points to predict their class labels and then compare the predicted class labels to their true class labels. This comparison can be conducted by using some multivariate performance measures, for example, prediction accuracy, F-score, area under receiver operating characteristic curve (AUROC) [2, 7, 22, 23], and precision–recall curve break-even point (PRBEP) [18, 20, 27, 33]. A problem of such machine learning procedure is that in the training process, we optimize a simple loss function, such as hinge loss, but in the test process, we use a different and complex performance measure to evaluate the prediction results. It is obvious that the optimization of the loss function cannot lead to an optimization of the performance measure. For example, in the formulation of support vector machine (SVM), the hinge loss is minimized, but usually in the test procedure, the AUROC is used as a performance measure. However, in many real-world applications, the optimization of a specific performance measure is desired. To solve this problem, direct optimization of some complex loss functions corresponding to some desired performance measures is studied. These methods try to optimize a complex loss function in the objective function, and the loss functions are corresponding to the performance measure directly. By minimizing the loss function directly to obtain the predictive model, the desired performance measure can be optimized by the predictive model directly. In this paper, we study this problem and propose a novel method based on sparse learning and maximum likelihood optimization.

1.1 Related works

Some existing works proposed to optimize a complex multivariate loss function are briefly introduced as follows.

  • Joachims [8] proposed to learn a support vector machine to optimize a complex loss function. In the proposed model, the complexity of the predictive model is reduced by minimizing squared \(\ell _2\) norm of the model parameter. To minimize the complex loss, its upper bound is approximated and minimized.

  • Mao and Tsang [16] improved the Joachims’s work by integrating feature selection to support vector machine for complex loss optimization. A weight is assigned to each feature before the predictive model is learned. Moreover, the feature weights and the predictive model parameter are learned jointly in an iterative algorithm.

  • Li et al. [12] proposed a classifier adaptation method to extend Joachims’s work. The predictive model is a combination of a base classifier and an adaptation function, and the learning of the optimal model is transferred to the learning of the parameter of the adaptation function.

  • Zhang et al. [36] proposed a novel smoothing strategy by using Nesterov’s accelerated gradient method to improve the convergence rate of the method proposed by Joachims [8]. This method, according to the results reported in [36], converges significantly faster than Joachims’s method [8], but it does not scarify generalization ability.

Almost all the existing methods are limited to the support vector machine for multivariate complex loss function. This method uses a linear function to construct the predictive model and seek both the minimum complexity and loss.

1.2 Contribution

In this paper, we propose a novel predictive model to optimize a complex loss function. This model is based on the likelihood of a positive or negative class given an input feature vector of a data point. The likelihood function is constructed based on a sigmoid function of a linear function. Given a group of data points, we organize them as a data tuple, and the predicted class label tuple is the one that maximizes the logistic likelihood of the data tuple. The learning target is to learn a predictive model parameter, so that with the corresponding predicted class label tuple, the complex loss function can be minimized. Moreover, we also hope the model parameter can be as sparse as possible, so that only the useful can be kept in the model. To this end, we construct an objective function, which is composed of two terms. The first term is the \(\ell _1\) norm of the parameter to impose the sparsity of the parameter, and the second term is the complex loss function to seek the optimal desired performance measure. The problem is transferred to a minimization problem of the objective function with regard to the parameter. To solve this problem, we first approximate the upper bound of the complex as a logistic function of the parameter and then optimize it by using the fast iterative shrinkage-thresholding algorithm (FISTA). The novelty of this paper is summarized as follows:

  1. 1.

    For the first time, we propose to use the maximum likelihood model to construct a predictive model for the optimization of complex losses.

  2. 2.

    We construct a novel optimization problem for the learning of the model parameter by considering the sparsity of the model and the minimization of the complex loss jointly.

  3. 3.

    We develop a novel iterative algorithm to optimize the proposed minimization problem, and a novel method to approximate the upper bound of the complex loss. The approximation of the upper bound of the complex loss is obtained as a logistic function, and the problem is optimized by a FISTA algorithm.

1.3 Paper organization

This paper is organized as follows: In Sect. 2, we introduce the proposed method, in Sect. 3, we evaluate the proposed method on two real-world applications, and in Sect. 4, the paper is concluded and some future works are given.

2 Proposed method

2.1 Problem formulation

Suppose we have a data set of n data points, and we denote them as \(\{({\mathbf{x}}_i,y_i)\}|_{i=1}^n\), where \({\mathbf{x}}_i \in {\mathbb {R}}^d\) is the d-dimensional feature vector of the ith data point and \(y_i\in \{+1,-1\}\) is it corresponding class label. We consider the data points as a data tuple, \(\overline{{\mathbf{x }}}=({\mathbf{x }}_1, \ldots , {\mathbf{x }}_n)\), and their corresponding class labels as a label tuple, \(\overline{y}=(y_1, \ldots , y_n)\). Under the framework of complex performance measure optimization, we try to learn a multivariate mapping function to map the data tuple \(\overline{{\mathbf{x }}}\) to a class label tuple \(\overline{y}^* = (y_1^*,\ldots ,y_n^*)\in {\mathcal {Y}}\), where \(y_i^*\in \{+1,-1\}\) is the predicted label of the ith data point and \({\mathcal {Y}}= \{+1,-1\}^n\). To measure the performance of the multivariate mapping function, \(\overline{h}(\overline{{\mathbf{x }}})\), we use a predefined complex loss function \(\Delta (\overline{y},\overline{y}^*)\) to compare the true class label tuple \(\overline{y}\) against the predicted class label tuple \(\overline{y}^*\).

To construct the multivariate mapping function \(\overline{h}(\overline{{\mathbf{x }}})\), we proposed to apply a linear discriminate function to match the ith data point \({\mathbf{x }}_i\) against the ith class label \(y_i'\) in a candidate tuple \(\overline{y}'=(y_1', \ldots , y_n')\),

$$\begin{aligned} f_{\mathbf{w }}({\mathbf{x }}_i,y_i') = y_i' {\mathbf{w }}^\top {\mathbf{x }}_i, \end{aligned}$$
(1)

where \({\mathbf{w }}= [w_1,\ldots ,w_d]\in {\mathbb {R}}^d\) is the parameter vector of the function. And then we apply a sigmoid function to the response of this function to impose it to a range of [0, 1],

$$\begin{aligned} g({\mathbf{x }}_i,y_i')&= \frac{1}{1+ \exp \left( - f({\mathbf{x }}_i,y_i') \right) } \\&= \frac{1}{1+ \exp \left( -y_i' {\mathbf{w }}^\top {\mathbf{x }}_i\right) }. \end{aligned}$$
(2)

Moreover,

$$\begin{aligned} g({\mathbf{x }}_i,+1)&= \frac{1}{1+ \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) }\\&= \frac{\left( 1+ \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) \right) - \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) )}{1+ \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) }\\&= 1-\frac{ \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) )}{1+ \exp \left( - {\mathbf{w }}^\top {\mathbf{x }}_i\right) }\\&= 1-\frac{1}{1+ \exp \left( {\mathbf{w }}^\top {\mathbf{x }}_i\right) }\\&= 1-g({\mathbf{x }}_i,-1), \end{aligned}$$
(3)

thus we can treat \(g({\mathbf{x }}_i,y_i)\) as the conditional probability of \(y= y_i'\) given \({\mathbf{x }}= {\mathbf{x }}_i\),

$$\begin{aligned} Pr (y=y_i'| {\mathbf{x }}= {\mathbf{x }}_i) = g({\mathbf{x }}_i,y_i'). \end{aligned}$$
(4)

We also assume that the data points in the tuple \(\overline{{\mathbf{x }}}\) are conditionally independent from each other, and thus the conditional probability of \(\overline{y} = \overline{y}'\) given the \(\overline{{\mathbf{x }}}\) is

$$\begin{aligned} Pr (\overline{y}=\overline{y}'| \overline{{\mathbf{x }}} )&= \prod _{i=1}^n Pr (y=y_i'| {\mathbf{x }}= {\mathbf{x }}_i)\\&= \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i' {\mathbf{w }}^\top {\mathbf{x }}_i\right) }. \end{aligned}$$
(5)

To constructed the complex mapping function, we map the data tuple to the class tuple \(\overline{y}^*\) which can give the maximum log-likelihood,

$$\begin{aligned} y^*\leftarrow \overline{h}(\overline{{\mathbf{x }}})&= \underset{\overline{y}'\in {\mathcal {Y}}}{\arg \max }\, \log \left( Pr (\overline{y}=\overline{y}'| \overline{{\mathbf{x }}} ) \right) \\&= \underset{\overline{y}'\in {\mathcal {Y}}}{\arg \max } \, \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i' {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) . \end{aligned}$$
(6)

In this way, we seek the maximum likelihood estimator of the class label tuple as the mapping result for a data tuple.

To learn the parameter of the linear discriminative function, \({\mathbf{w }}\), so that the complex loss function \(\Delta (\overline{y},\overline{y}^*)\) can be minimized, we consider the following problems,

  • Encouraging sparsity of \({\mathbf{w }}\) We assume that in a feature vector a data point, only a few features are useful, while most of the remaining features are useless. Thus, we need to conduct a feature selection procedure to remove the useless features and keep the useful features, so that we can obtain a parse feature vector. In our method, instead of seeking sparsity of the feature vectors, we seek the sparsity of the parameter vector \({\mathbf{w }}\). With a sparse \({\mathbf{w }}\), we can also control the sparsity of the feature effective to the prediction results. To encourage the sparsity of \({\mathbf{w }}\), we use the \(\ell _1\) norm of \({\mathbf{w }}\) to present its sparsity, and minimize the \(\ell _1\),

    $$\begin{aligned} \min _{{\mathbf{w }}}&\left\{ \frac{1}{2} \left\| {\mathbf{w }}\right\| _1 = \frac{1}{2} \sum _{j=1}^d |w_j| \right. \\&\left. =\frac{1}{2} \sum _{j=1}^d \frac{w_j^2}{|w_j|} =\frac{1}{2} {\mathbf{w }}^\top diag \left( \frac{1}{|w_1|},\ldots , \frac{1}{|w_d|} \right) {\mathbf{w }}\right. \\&\left. = \frac{1}{2} {\mathbf{w }}^\top {\varLambda } {\mathbf{w }}\right\} , \end{aligned}$$
    (7)

    where \(diag \left( \frac{1}{|w_1|},\ldots , \frac{1}{|w_d|} \right) \in {\mathbb {R}}^{d\times d}\) is a diagonal matrix with its diagonal elements as \(\frac{1}{|w_1|},\ldots , \frac{1}{|w_d|}\), and

    $$\begin{aligned} {\varLambda } = diag \left( \frac{1}{|w_1|},\ldots , \frac{1}{|w_d|} \right) \end{aligned}$$
    (8)

    when the \(\ell _1\) norm of \({\mathbf{w }}\) is minimized, most elements of \({\mathbf{w }}\) will shrink to zeros and lead a sparse \({\mathbf{w }}\).

  • Minimizing complex performance lose \(\Delta (\overline{y},\overline{y}^*)\) Given the predicted label tuple \(\overline{y}^*\), we can measure the prediction performance by comparing it against the true label tuple \(\overline{y}\) by using a complex performance measure. To obtain an optimal mapping function, we minimize a corresponding complex loss of a complex performance measure, \(\Delta (\overline{y},\overline{y}^*)\),

    $$\begin{aligned} \min _{{\mathbf{w }}}\,&{\Delta }(\overline{y},\overline{y}^*) \end{aligned}$$
    (9)

    Due to its complexity, we minimize its upper boundary instead of itself. We have the following theorem to define the upper boundary of \({\Delta }(\overline{y},\overline{y}^*)\).

Theorem 1

\({\Delta }(\overline{y},\overline{y}^*)\) satisfies

$$\begin{aligned} {\Delta }(\overline{y},\overline{y}^*)\,\le\,& \max _{\overline{y}'\in {\mathcal {Y}}} \left\{ \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i' {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right)\right. \\ & -\left. \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) + {\Delta }(\overline{y},\overline{y}')\right\} \\= & \,\left\{ \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i'' {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right)\right. \\&\left. - \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) + {\Delta }(\overline{y},\overline{y}'') \right\} , \end{aligned}$$
(10)

where \(\overline{y}' = (y_1',\ldots ,y_n')\), and \(\overline{y}'' = (y_1',\ldots ,y_n')\),

$$\begin{aligned} \overline{y}'' =&\arg \max _{\overline{y}'\in {\mathcal {Y}}} \left\{ \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i' {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) \right. \\&\left. - \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) + {\Delta }(\overline{y},\overline{y}') \right\} \end{aligned}$$
(11)

The proof of this theorem is found in Appendix section.

After we have the upper bound of the loss function, we minimize it instead of \({\Delta }(\overline{y},\overline{y}^*)\) to obtain the mapping function parameter, \({\mathbf{w }}\),

$$\begin{aligned} \min _{{\mathbf{w }}}&\left\{ \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i'' {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) \right. \\&- \log \left( \prod _{i=1}^n \frac{1}{1+ \exp \left( -y_i {\mathbf{w }}^\top {\mathbf{x }}_i\right) } \right) + {\Delta }(\overline{y},\overline{y}'') \\&\left. =\sum _{i=1}^n \log \left( \frac{1+\exp (-y_i{\mathbf{w }}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}^\top {\mathbf{x }}_i)} \right) +{\Delta }(\overline{y},\overline{y}'') \right\} . \end{aligned}$$
(12)

Please note that \(\overline{y}''\) is also a function of \({\mathbf{w }}\).

The overall optimization problem is obtained by combining the problems in (7) and (12),

$$\begin{aligned} &\min _{{\mathbf{w }}} \left\{ f({\mathbf{w }})= \frac{1}{2} {\mathbf{w }}^\top {\varLambda } {\mathbf{w }}\right.\\& \quad + \left. C \left[ \sum _{i=1}^n \log \left( \frac{1+\exp (-y_i{\mathbf{w }}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}^\top {\mathbf{x }}_i)} \right) +{\Delta }(\overline{y},\overline{y}'') \right] \right\} \end{aligned}$$
(13)

where C is a trade-off parameter. Please note that in this objective, both \({\varLambda }\) and \(\overline{y}''\) are functions of \({\mathbf{w }}\). In the first term of the objective, we impose the sparsity of the \({\mathbf{w }}\), and in the second term, we minimize the upper bound of \({\Delta }(\overline{y},\overline{y}^*)\).

2.2 Optimization

To solve the problem of (13), we try to employ the FISTA algorithm with constant step size to minimize the objective \(f({\mathbf{w }})\). This algorithm is an iterative algorithm, and in each iteration, we first update a search point according to a previous solution of the parameter vector and then update the next parameter vector based on the search point. The basic procedures are summarized as the two following steps:

  1. 1.

    Search point step In this step, we assume the previous solution of \({\mathbf{w }}\) is \({\mathbf{w }}_{pre}\), and seek a search point \({\mathbf{v }}\in {\mathbb {R}}^d\) based on \({\mathbf{w }}\) is \({\mathbf{w }}_{pre}\) and a step size L.

  2. 2.

    Weighting factor step In this step, we assume we have a weighting factor of previous iteration, \(\tau _{pre}\), and we update it to a new weighting factor \(\tau _{cur}\).

  3. 3.

    Solution update step In this step, we update the new solution of the variable according to the search point. The updated solution is a weighted version of the previous search points, weighted by the weighting factors.

In the follows, we will discuss how to implement these three steps.

2.2.1 Search point step

In this step, when we want to minimize an objective function \(f({\mathbf{w }})\) with regard to a variable vector \({\mathbf{w }}\) with a step size L and a previous solution \({\mathbf{w }}_{pre}\), we seek a search point \({\mathbf{u }}^*\) as follows,

$$\begin{aligned} {\mathbf{u }}^* = \arg \min _{{\mathbf{u }}} \,&\left\{ \frac{L}{2} \left\| {\mathbf{u }}- \left( {\mathbf{w }}_{pre} - \frac{1}{L} \nabla f({\mathbf{w }}_{pre})\right) \right\| _2^2 \right\} , \end{aligned}$$
(14)

where \(\nabla f({\mathbf{w }})\) is the gradient function of \(f({\mathbf{w }})\). Due to the complexity of function \(f({\mathbf{w }})\), the close form of gradient function \(\nabla f({\mathbf{w }})\) is difficult to obtain. Thus, instead of seeking gradient function directly, we seek the sub-gradient of this function. At this end, we use the EM algorithm strategy. In each iteration, we first fix \({\mathbf{w }}\) as \({\mathbf{w }}_{pre}\) and calculate \({\varLambda }\) according to (8), and \(y_i''|_{i=1}^n\) according to (11). Then we fix \({\varLambda }\) and \(y_i''|_{i=1}^n\) and seek the sub-gradient \(\nabla f({\mathbf{w }})\),

$$\begin{aligned} \nabla f({\mathbf{w }})=&\, {\varLambda }{\mathbf{w }}+ C \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}^\top {\mathbf{x }}_i)} \right. \\ & \left.-\, \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}^\top {\mathbf{x}}_i)} \right) . \end{aligned}$$
(15)

After we have the sub-gradient function \(\nabla f({\mathbf{w }})\), we substitute it to (14), and we have

$$\begin{aligned} {\mathbf{u }}^*= & \arg \min _{{\mathbf{u }}} \left\{ \frac{L}{2} \left\| {\mathbf{u }}- \left( {\mathbf{w }}_{pre} - \frac{1}{L} \nabla f({\mathbf{w }}_{pre})\right) \right\| _2^2 \right\} \\= & \arg \min _{{\mathbf{u }}} \left\{ \frac{L}{2} \left\| {\mathbf{u }}- \left[ {\mathbf{w }}_{pre} - \frac{1}{L} \left( {\varLambda }{\mathbf{w }}_{pre} \right. \right. \right. \right. \\&\left. \left. \left. \left. + C \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right. \right. \right. \right. \right. \\&\left. \left. \left. \left. \left. - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right) \right) \right] \right\| _2^2 \right\} \\= & \arg \min _{{\mathbf{u }}} \left\{ \frac{L}{2} \left\| {\mathbf{u }}- \left[ \left( I - \frac{1}{L} {\varLambda } \right) {\mathbf{w }}_{pre} \right. \right. \right. \\&\left. \left. \left. - \frac{C}{L} \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right. \right. \right. \right. \\&\left. \left. \left. \left. - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right) \right] \right\| _2^2 = g({\mathbf{u }}) \right\} . \end{aligned}$$
(16)

To solve this problem, we set the gradient function of the objective function \(g({\mathbf{u }})\) to zero,

$$\begin{aligned} \nabla g({\mathbf{u }})= & \, L \left\{ {\mathbf{u }}- \left[ \left( I - \frac{1}{L} {\varLambda } \right) {\mathbf{w }}_{pre} \right. \right. \\&\left. \left. - \frac{C}{L} \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right. \right. \right. \\&\left. \left. \left. - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right) \right] \right\} = 0 \\ \Rightarrow {\mathbf{u }}^*= & \left( I - \frac{1}{L} {\varLambda } \right) {\mathbf{w }}_{pre} - \frac{C}{L} \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right. \\&\left. - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right) . \end{aligned}$$
(17)

In this way, we obtain the search point \({\mathbf{u }}^*\).

2.2.2 Weighting factor step

We assume that weighting factor of previous iteration is \(\tau _{pre}\), and we can obtain the weighting factor of current iteration, \(\tau _{cur}\), as follows,

$$\begin{aligned} \tau _{cur}= \frac{1+\sqrt{1+4{\tau _{pre}}^2}}{2}. \end{aligned}$$
(18)

2.2.3 Solution update step

After we have the search point of this current iteration, \({\mathbf{u }}^*\), the search point of previous iteration, \({{\mathbf{u }}^*}_{pre}\), and the weighting factor of this iteration and previous iteration, \(\tau _{cur}\) and \(\tau _{pre}\), we can have the following update procedure for the solution of this iteration,

$$\begin{aligned} {\mathbf{w }}_{cur}= & \,{\mathbf{u }}^* + \left( \frac{\tau _{pre} - 1}{\tau _{cur}}\right) \left( {\mathbf{u }}^* - {\mathbf{u }}^*_{pre}\right) \\= & \, \left( \frac{\tau _{cur} + \tau _{pre} - 1}{\tau _{cur}}\right) {\mathbf{u }}^* - \left( \frac{\tau _{pre} - 1}{\tau _{cur}}\right) {\mathbf{u }}^*_{pre}. \end{aligned}$$
(19)

In this equation, we can see that the updated solution of \({\mathbf{w }}_{cur}\) is a weighted version of the current search point, \({\mathbf{u }}^*\), and the previous search point, \({\mathbf{u }}^*_{pre}\).

2.3 Iterative algorithm

With the optimization in the previous section, we summarize the iterative algorithm to optimize the problem in (13). The iterative algorithm is given in Algorithm 1.

figure a

In this algorithm, we can see that in each iteration, we first update \({\varLambda }\) and \({y_i''}|_{i=1}^n\) and then use them to update the search point. With the search point and an updated weighting factor, we update the mapping function parameter vector, \({\mathbf{w }}\). This algorithm is called learning of sparse maximum likelihood model (SMLM).

2.4 Scaling up to big data based on Hadoop

In this section, we discuss how to fit the proposed algorithm to big data set. We assume that the number of the training data points, n, is extremely large. One single machine is not able to store the entire data set, and the data set is split into m subsets and stored in m different clusters. The clusters are managed by a big data platform, Hadoop [4, 10, 25, 35]. Hadoop is a software of distributed data management and processing. Given a large data set, it splits it into subsets and stores them in different clusters. To process the data and obtain a final output, it uses a MapReduce framework [3, 6, 15, 24]. This framework requires a Map program and a Reduce program from the users. The Hadoop software delivers the Map program to each cluster and uses it to process the subset to produce some median results and then uses the Reduce program to combine the median results to produce the final outputs. Using the MapReduce framework, by defining our own Map and Reduce functions, we can implement the critical steps in Algorithm 1. For example, in the sub-step (c) of step k, we need to calculate \({\mathbf{u }}_k\) from (17). In this step, the most time-consuming step is to calculate the summation of a function over all the data points,

$$\begin{aligned} output= & \sum _{i=1}^n \left( \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \right) \\= & \sum _{i=1}^n function ({\mathbf{x }}_i, y_i, y_i''), \end{aligned}$$
(20)

where \(\begin{aligned}function({\mathbf{x }}_i, y_i, y_i'') =& \frac{y_i'' {\mathbf{x }}_i \exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i''{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)} \\ & - \frac{y_i {\mathbf{x }}_i \exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}{1+\exp (-y_i{\mathbf{w }}_{pre}^\top {\mathbf{x }}_i)}\end{aligned}\) is the function applied to each data point. Since the entire data set is split into m subsets, \({\mathcal {X}}_m|_{j=1}^m\), we can design a Map function to calculate the summation over each subset and then design a Reduce function to combine them to obtain the final output. The Map and Reduce functions are as follows.

Map function applied to the j-th subset

  1. 1.

    Input: Data points of the jth subset, \(\{({\mathbf{x }}_i, y_i,y_i'')\}|_{i:{\mathbf{x }}_i\in {\mathcal {X}}_j}\).

  2. 2.

    Input: Previous parameter, \({\mathbf{w }}_{pre}\).

  3. 3.

    Initialize: \(Output_j = 0\).

  4. 4.

    For \(i:{\mathbf{x }}_i\in {\mathcal {X}}_j\)

    1. (a)

      \(Output _j = Output _j + function ({\mathbf{x }}_i, y_i, y_i'')\);

  5. 5.

    Endfor

  6. 6.

    Output: \(Output _j\)

Reduce function to calculate the final output

  1. 1.

    Input: Median outputs of m Map functions, \(Output_j|_{j=1}^m\).

  2. 2.

    Initialize: \(Output = 0\).

  3. 3.

    For  \(j=1,\ldots , m\)

    1. (a)

      \(Output = Output + Output _j\);

  4. 4.

    Endfor

  5. 5.

    Output: Output

3 Experiment

In this section, we evaluate the proposed SMLM for the optimization of complex loss function. Three different applications are considered, which are aircraft event recognition, intrusion detection in wireless mesh networks, and image classification.

3.1 Aircraft event recognition

Recognizing aircraft event of aircraft landing is an important problem in the area of civil aviation safety research. This procedure provides important information for fault diagnosis and structure maintenance of aircraft [32]. Given a landing condition, we want to predict whether it is normal and abnormal. To this end, we extract some features and use them to predict the aircraft event of normal or abnormal. In this experiment, we evaluate the proposed algorithm in this application and use it as a model for the prediction of aircraft event recognition.

3.1.1 Data set

In this experiment, we collect a data set of 160 data points. Each data point is a landing condition, and we describe the landing condition by five features, including vertical acceleration, vertical speed, lateral acceleration, roll angle, and pitch rate. The data points are classified into two classes, normal class and abnormal. The normal class is treated as positive class, while the abnormal class is treated as negative class. The number of positive data points is 108, and the number of negative data points is 52.

3.1.2 Experiment setup

In this experiment, we use the tenfold cross-validation. The data set is split into tenfolds randomly, and each fold contains 16 data points. Each fold is used as a test set in turn, and the remaining tenfolds are combined and used as training set. The proposed model is training over the training set and then used to predict the class labels of the testing data points in the test set. The prediction results are evaluated by a performance measurement. This performance measurement is used to compare the true class labels of the test data points against the predicted class labels. In the training procedure, a complex loss function corresponding to the performance measurement is minimized.

In our experiments, we consider three performance measurements, which are F-score, area under receiver operating characteristic curve (AUROC), and precision–recall curve break-even point (PRBEP). To define these performance measures, we first need to define the following items,

  • true positive (TP), the number of correctly predicted positive data points,

  • true negative (TN), the number of correctly predicted negative data points,

  • false positive (FP), the number of negative data points wrongly predicted to positive data points, and

  • false negative (FN), the number of positive data points wrongly predicted to negative data points.

With these measures, we can define F-score as follows,

$$\begin{aligned} F = \frac{2\times TP }{2\times TP + FP + FN }. \end{aligned}$$
(21)

Moreover, we can also define true positive rate (TPR) and the false positive rate (FPR) as follows,

$$\begin{aligned} TPR = \frac{ TP }{ TP + FN }, \quad FPR = \frac{ FP }{ FP + TN }. \end{aligned}$$
(22)

With different thresholds, we can have different pair of TPR and FPR. By plotting TPR against FPR values, we can have a curve of receiver operating characteristic (ROC). The area under this curve is obtained as AUROC. The recall and precision are defined as follows,

$$\begin{aligned} recall = \frac{ TP }{ TP + FN }, \quad precision = \frac{ TP }{ TP + FP }. \end{aligned}$$
(23)

With different thresholds, we can also have different pair of recall and precision values. We can obtain a recall–precision (RP) curve, by plotting different precision values against recall values. PRBEP is the value of the point of the RP curve where recall and precision are equal to each other.

3.1.3 Experiment result

We compare the proposed algorithm, SMLM, against several state-of-the-art complex loss optimization methods, including support vector machine for multivariate performance optimization \((\hbox {SVM}_{multi})\) [9], classifier adaptation for multivariate performance optimization (CAPO) [12], and features selection for multivariate performance optimization \((\hbox {FS}_{multi})\) [16]. The boxplots of the optimized F-scores of tenfold cross-validation of different algorithms on the aircraft event recognition problem are given in Fig. 1, these of optimized AUROC are given in Fig. 2, and these of the optimized PRBEP are given in Fig. 3. From these figures, we can see that the proposed method, SMLM, outperforms the compared algorithms on three different optimized performances. For example, in Fig. 3, we can see that the boxplot of PRBEP of SMLM is significantly higher than that of other methods, the median value is almost 0.6, while that of other methods is much lower than 0.6. In Fig. 2, we can also have similar observation, and the overall AUROC values optimized by SMLM are much higher than those of other methods. A reason for this outperforming is that our method seeks the maximum likelihood and sparsity of the model simultaneously.

Fig. 1
figure 1

Boxplots of F-score of compared method on aircraft event recognition problem

Fig. 2
figure 2

Boxplots of AUROC of compared method on aircraft event recognition problem

Fig. 3
figure 3

Boxplots of PRBEP of compared method on aircraft event recognition problem

3.2 Intrusion detection in wireless mesh networks

Wireless mesh network (WMN) is a new generation technology of wireless networks, and it has been used in many different applications. However, due to its openness in wireless communication, it is vulnerable to intrusions, and thus, it is extremely important to detect intrusion in WMN. Given an attack record, the problem of intrusion detection is to classify it to one of the following classes, denial service attacks, detect attacks, obtain root privileges and remote attack unauthorized access attacks. In this paper, we use the proposed method, SMLM, for the problem intrusion detection,

3.2.1 Data set

In this experiment, we use the KDD CPU1999 data set. This data set contains 40,000 attack records, and for each class, there are 10,000 records. For each record, we first preprocess the record and then convert the features into digital signature as the new features.

3.2.2 Experiment setup

In this experiment, we also use the tenfold cross-validation, and we also use the F-score, AUROC, and PRBEP performance measures.

3.2.3 Experiment result

The boxplots of the optimized F-scores of tenfold cross-validation are given in Fig. 4, the boxplots of AUROC are given in Fig. 5, and the boxplots of PRBEP are given in Fig. 6. Similar to the results on aircraft event recognition problem, the outperforming of the proposed algorithm, SMLM, over other methods is also significant. This is a strong evidence of the advantages of sparse learning and maximum likelihood.

Fig. 4
figure 4

Boxplots of F-score of compared method on intrusion detection problem

Fig. 5
figure 5

Boxplots of AUROC of compared method on aircraft event recognition problem

Fig. 6
figure 6

Boxplots of PRBEP of compared method on aircraft event recognition problem

3.3 ImageNet image classification

In the third experiment, we use a large image set to test the performance of the proposed algorithm with big data.

3.3.1 Data set

In this experiment, we use a large data set, ImageNet [11]. This data set contains over 15 million images, and the images belong to 22,000 classes. These images are from Web pages and are labeled by people manually. The entire data set is split into three subsets, which are one training set, one validation set, and one testing set. The training set contains 1.2 million images, the validation set contains 50,000 images, and the testing set contains 150,000 images. To represent each image, we use the bag-of-features method. Local SIFT features are extracted from each image and quantized to a histogram. The features can be downloaded directly from http://image-net.org/download-features.

3.3.2 Experiment setup

In this experiment, we do not use the tenfold cross-validation, but use the given training/validation/testing set splitting. We first perform the proposed algorithm to the training set to learn the classifier, then use the validation set to justify the optimal trade-off parameters, and finally test the classifier over the testing set. The performances of F-score, AUROC, and PRBEP are considered in this experiment. To handle the multi-classification problem, we have a binary classification problem for each class, and in this problem, the considered class is a positive class, while the combination of all other classes is a negative class.

3.3.3 Experiment results

The boxplots of the optimized F-score, AUROC, and PRBEP of different classes are given in Figs. 7, 8, and 9. From these figures, we clearly see that the proposed algorithm outperforms the competing methods. This is another strong evidence of the effectiveness of the SMLM algorithm. Moreover, it also shows that the proposed algorithm also works well over the big data.

Fig. 7
figure 7

Boxplots of F-score of compared method on ImageNet image classification problem

Fig. 8
figure 8

Boxplots of AUROC of compared method on ImageNet image classification problem

Fig. 9
figure 9

Boxplots of PRBEP of compared method on ImageNet image classification problem

3.4 Running time

The running time of the proposed algorithm on the three used data sets is given in Fig. 10. It can be observed from this figure that the first two experiments do not consume much time, while the third large-scale data set-based experiment costs a lot of time. This is natural, because in each iteration of the algorithm, we have a function for each data point, and a summation over the responses of this function.

Fig. 10
figure 10

Running time of SMLM algorithm on three experiments

4 Conclusion

In this paper, we investigate the problem of optimization of complex corresponding to a complex multivariate performance measure. We propose a novel predicative model to solve this problem. This model is based on the maximum likelihood of a class label tuple given an input data tuple. To solve the model parameter, we propose an optimization problem based on the approximation of the upper bound of the loss function and the sparsity of the model. Moreover, an iterative algorithm is developed to solve it. Experiments on two real-world applications show its advantages over state of the art.