As the increasingly strong computational power of computers fills the shortage of human brain at calculating, data mining, a major component of data science, has emerged as the times require due to its merit of being capable of extracting novel and useful knowledge which has potential value from large scale of complex data. However, from the mathematical perspective, some data mining methods, such as decision tree, genetic algorithm, and association rules could be considered as heuristic algorithms: which means to select a “better solution” from several alternative solutions as the criterion of classification. These methods lack of exploring how to locate the “best solution” systematically.

Based on [1] and [2], this chapter describes the advanced techniques of applying multi-criteria decision making methods and multi-criteria mathematical programming to conducting data mining process for selecting the “best solution” from multiple alternative solutions, instead of using heuristic algorithms. Section 2.1 is Multi-Criteria Linear Programming (MCLP) for supervised learning, which includes error correction method in classification by using Multiple-Criteria and Multiple-Constraint Levels Linear Programming (MC2LP) [3], Multi-Instance classification based on regularized Multiple Criteria Linear Programming (RMCLP) [4], supportive instances for RMCLP classification [5], and kernel based simple RMCLP for binary classification and regression [6]. Then, Sect. 2.2 describes a group of knowledge-incorporated MCLP classifier [7] and decision rule extraction for RMCLP model [1]. Finally, Sect. 2.3 summarizes three methods of MCDM based data analytics. They are a MCDM approach for estimating the number of clusters [8], parallel RMCLP classification algorithm [9], and an effective intrusion detection framework based on MCLP and support vector machine.

1 Multi-criteria Linear Programming for Supervised Learning

1.1 Error Correction Method in Classification by Using Multiple-Criteria and Multiple-Constraint Levels Linear Programming

First, the MCLP model for classification is outlined as below [10, 11]:

Given a set of n variables about the records X T = (x 1, x 2, …, x l), and then let x i = (x i1, x i2, …, x in)T be one sample of data, where i = 1, 2, …, l and l is the sample size. In linear discriminant analysis, data separation can be achieved by two opposite objectives, that is, minimizing the sum of the deviations (MSD) and maximizing the minimum distances (MMD) of observations from the critical value. That is to say, in order to solve classification problem, we need to minimize the overlapping of data, i.e. α, at the same time, to maximize the distances from the well classified points to the hyperplane, i.e. β.

However, it is difficult for traditional linear programming to optimize MMD and MSD simultaneously. According to the concept of Pareto optimality, we can check all the possible trade-offs between the objective functions by using multiple-criteria linear programming algorithm. The MCLP model can be described by Fig. 2.1.

Fig. 2.1
figure 1

MCLP model

Moreover, the first Multiple Criteria Linear Programming (MCLP) model can be described as follows:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i{\alpha}_i\\ {}\min \sum \limits_i{\beta}_i\\ {}s.t.{A}_iX=b+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX=b+{\alpha}_i-{\beta}_i,{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

Here, α i is the overlapping and β i is the distance from the training sample xi to the discriminator (w · x i) = b (classification separating hyperplane).

Then, the MC2LP model for classification is introduced in [10].

According to the discussion above, a non-fixed b is very important to our problem. At the same time, for the simplicity and existence of the solution, b should be fixed in some interval.

As a result, for different data, we fix b in different pairs of intervals [b l, b u], where b l and b u are two fixed numbers. Now our problem is to search the best cutoff between b l and b u at every level of their tradeoffs, that is to say, to test every point in the interval [b l, b u]. We keep the multiple-criteria the same as MCLP, which is, MMD and MSD. And then, the following model is posed:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i{\alpha}_i\\ {}\min \sum \limits_i{\beta}_i\\ {}s.t.{A}_iX=\left[{b}_l,{b}_u\right]+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX=\left[{b}_l,{b}_u\right]+{\alpha}_i-{\beta}_i,{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

where A i, b l and b u are given, and X is unrestricted.

In the model, [b l, b u] represents a certain tradeoff in the interval. By virtue of the technical of Multiple-criteria and multiple-constraint levels linear programming (MC2LP), we can test each tradeoff between the multiple-criteria and multiple-constraint levels as follows:

$$ {\displaystyle \begin{array}{c}\min {\lambda}_1\sum \limits_i{\alpha}_i-{\lambda}_2\sum \limits_i{\beta}_i\\ {}s.t.{A}_iX={\gamma}_1{b}_l+{\gamma}_2{b}_u+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX={\gamma}_1{b}_l+{\gamma}_2{b}_u+{\alpha}_i-{\beta}_i,{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

Here, the parameters of λ × γ are fixed for each programming problem. Moreover, the advantage of MC2LP is that it can find the potential solutions for all possible trade-offs in the parameter space systematically [12, 13] where the parameter space is

$$ \left\{\left(\uplambda, \upgamma \right)|{\uplambda}_1+{\uplambda}_2=1,{\upgamma}_1+{\upgamma}_2=1\right\}. $$

Of course, in this model, choosing a suitable pair for the goal problem is a key issue and needs domain knowledge. Consequently, a non-parameter choosing MC2LP method should be posed.

For the original MCLP model, one cutoff b is used to predict a new sample’s class, that is to say, there is only one hyperplane. The former MC2LP model points out that we can define two cutoffs b l and b u instead of the original single cutoff. And then a systematical method can be used to solve this problem. Consequently, all potential solutions at each constrain level tradeoff can be acquired. However, one problem is how to find the cutoffs b l and b u.

On one hand, we utilize two cutoffs to discover the solution of higher accuracy; on the other hand, we hope the cutoffs can be obtained from the system directly. Inspired by the idea above, we address our first MC2LP model, which solves the classification problem twice.

For the first step, MCLP model is used to find the vector of external deviations α. It is a function of λ. For simplicity, we set b = 1. And then, we fix the parameter of λ to get one potential solution. Now a non-parameter vector of external deviations α is acquired. The component (α i > 0) means the corresponding sample in the training set is misclassified. In other words, Type I and Type II errors occur. According to the idea of MC2LP, we can detect the result of every single MCLP by fixing the parameter of γ at each level in the interval [b l, b u]. Now, we find the maximal component of α:

$$ {\alpha}_{\mathrm{max}}=\max \left\{{\alpha}_i,1\le i\le l\right\}. $$
(2.1)

Indeed, the smaller the weight of external deviations is, the bigger α max is.

The misclassified samples are all projected into the interval [1 − α max, 1 + α max] according to the weight vector X obtained from the MCLP model. In this way, we define b l and b u as 1 − α max and 1 + α max, respectively. It is easy to see, if we want to lessen the number of two types of error, in effect, we just need to inspect the cutoffs by altering the cutoff in the interval

$$ \left[1-{\alpha}_{max},1+{\alpha}_{max}\right]. $$

Moreover, for the second step, a new MC2LP classification model can be stated as follows:

$$ {\displaystyle \begin{array}{c}\min {\lambda}_1\sum \limits_i{\alpha}_i-{\lambda}_2\sum \limits_i{\beta}_i\\ {}s.t.{A}_iX=\left[1-{\alpha}_{\mathrm{max}},1+{\alpha}_{\mathrm{max}}\right]+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX=\left[1-{\alpha}_{\mathrm{max}},1+{\alpha}_{\mathrm{max}}\right]+{\alpha}_i-{\beta}_i,{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

where A i, α max are given, and X is unrestricted, [1 − α max, 1 + α max] means a certain tradeoff in the interval. At the same time, λ = (λ 1, λ 2) is the parameter chosen in the first step.

The most direct modification of the new MC2LP model is to transfer the single objective function to be a multiple-criteria one. Because the vector of external deviations is a function of λ, it is easy to observe that if the weight between external deviations and internal deviations changes, α changes. Consequently, α max alters. And the ideal α is the one that makes α max not too huge. In other words, we do not hope to check the weight that satisfies λ 1 not too small. Actually, some papers have proved that only if λ 1 > λ 2, then α · β = 0, which makes the model meaningful [14]. As a result, we only need to check the parameters of objective functions that make α max not too big, in short, not too far away from the original one.

On the other hand, we expect α max not too small. That is to say, we hope the model has some generalization. Hence, two small positive numbers ϵ 1 and ϵ 2 are chosen manually. And then, the interval is built as [(1 − α max − ϵ 1, 1 − α max + ϵ 1), (1 + α max − ϵ 2, 1 + α max + ϵ 2)]. This means that the lower and the upper bound of the interval should be trade-off of some intervals, i.e. the multiple-constrained levels are actually multiple-constrained intervals. Indeed, checking every tradeoff of the intervals is the same as checking every tradeoff of 1 − α max − ϵ 1 and 1 + α max + ϵ 2. In this case, we can consider the objective function as a multiple-criteria one. It can be stated as follows:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i{\alpha}_i\\ {}\min \sum \limits_i{\beta}_i\\ {}s.t.{A}_iX=\left[1-{\alpha}_{\mathrm{max}}-{\varepsilon}_1,1+{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX=\left[1-{\alpha}_{\mathrm{max}}-{\varepsilon}_1,1+{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]-{\alpha}_i+{\beta}_i,{A}_i\in Bad,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$
(2.2)

where A i, α max, ϵ 1 and ϵ 2 are given, and X is unrestricted. Here, ϵ 1 and ϵ 2 are two nonnegative numbers.

Lemma 2.1

For certain trade-off between the objective functions, if b is maintained to be the same sign, then hyperplanes, which are obtained in the MCLP model, keep the same. Furthermore, different signs result in different hyperplanes.

Proof

Assume that the tradeoff between the objective functions is λ = (λ 1, λ 2) and X 1 is the solution obtained by fixing b to be 1. Then, set b 1 as an arbitrary positive number. The MCLP model can be transformed as follows:

$$ {\displaystyle \begin{array}{c}\min {\lambda}_1\sum \limits_i{\alpha}_i-{\lambda}_2\sum \limits_i{\beta}_i\\ {}s.t.{A}_iX={b}_1+{\alpha}_i-{\beta}_i,{A}_i\in Bad,\\ {}{A}_iX={b}_1-{\alpha}_i+{\beta}_i,{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

The problem above is the same as:

$$ {\displaystyle \begin{array}{c}\min {\lambda}_1\frac{\sum \limits_i{\alpha}_i}{b_1}-{\lambda}_2\frac{\sum \limits_i{\beta}_i}{b_1}\\ {}s.t.{A}_i\frac{X}{b_1}=1+\frac{\alpha_i}{b_1}-\frac{\beta_i}{b_1},{A}_i\in Bad,\\ {}{A}_i\frac{X}{b_1}=1-\frac{\alpha_i}{b_1}+\frac{\beta_i}{b_1},{A}_i\in Good,\\ {}{\alpha}_i,{\beta}_i\ge 0,i=1,2,\dots, l\end{array}} $$

And then, we let αi′ = \( \frac{\alpha_i}{b_1} \), βi′ = \( \frac{\beta_i}{b_1} \), X′ = \( \frac{X}{b_1} \). It is obvious that the solution is X′ = \( \frac{X}{b_1} \) and the hyperplane AX′ = b 1 is the same as AX 1 = 1.

Similarly, we can prove that when b is a negative number, the solution is the same as the one that is obtained from b = 1.

As a result, we just need to compare the solutions (hyperplanes) resulted from b = 1 and b = 1. For this case, it is easy to see that the signs before α i and β i swap when we transform b = 1 into b = 1. If this happens, then the objective function changes into \( -{\lambda}_1\sum \limits_i{\alpha}_i+{\lambda}_2\sum \limits_i{\beta}_i \). This means that the solutions will be different.

According to the lemma, we have the theorem below:

Theorem 2.1

For our MC2LP model ( 2.2 ) above, according to the solutions (hyperplanes), space γ is divided into two non-intersect parts.

Remark 2.1

When [1 − α max, 1 + α max] is achieved, ϵ 1 and ϵ 2 are chosen to satisfy that 0 is contained by the interval [1 − α max − ϵ 1, 1 + α max + ϵ 2]. In this case, for any λ, the solutions belong to the trade-offs with same sign will result in the same hyperplane. In other words, there are only two different hyperplanes corresponding to model (2.2). In short, the flexibility of model (2.2) is limited.

In many classification models, including original MCLP model, two types of error is a big issue. In credit card account classification, to correct two types of error can not only improve the accuracy of classification but also help to find some important accounts.

Accordingly, many researchers have focused on this topic. Based on this consideration, more attention should be paid to the samples that locate between two hyperplanes acquired by the original MCLP model, that is, the points in the grey zone [15]. Consequently, we define the external deviations and internal deviations related to two different hyperplanes, the left one and the right one, that is, α l, α r, β l and β r.

Definition 2.1

The conditions the deviations should satisfy are stated as follows:

$$ {\displaystyle \begin{array}{l}{\alpha}_i^l=\left\{\begin{array}{ll}0,& {A}_iX<1-{\alpha}_{max}{andA}_i\in Bad;\\ {}{A}_iX-\left(1-{\alpha}_{max}\right),& {A}_iX\ge 1-{\alpha}_{max}{andA}_i\in Bad;\\ {}0,& {A}_iX\ge 1-{\alpha}_{max}{andA}_i\in Good;\\ {}\left(1-{\alpha}_{max}\right)-{A}_iX,& {A}_iX<1-{\alpha}_{max}{andA}_i\in Good.\end{array}\right.\\ {}{\alpha}_i^r=\left\{\begin{array}{ll}0,& {A}_iX<1+{\alpha}_{max}{andA}_i\in Bad;\\ {}{A}_iX-\left(1+{\alpha}_{max}\right),& {A}_iX\ge 1+{\alpha}_{max}{andA}_i\in Bad;\\ {}0,& {A}_iX\ge 1+{\alpha}_{max}{andA}_i\in Good;\\ {}\left(1+{\alpha}_{max}\right)-{A}_iX,& {A}_iX<1+{\alpha}_{max}{andA}_i\in Good.\end{array}\right.\\ {}{\beta}_i^l=\left\{\begin{array}{ll}\left(1-{\alpha}_{max}\right)-{A}_iX,& {A}_iX<1-{\alpha}_{max}{andA}_i\in Bad;\\ {}0,& {A}_iX\ge 1-{\alpha}_{max}{andA}_i\in Bad;\\ {}{A}_iX-\left(1-{\alpha}_{max}\right),& {A}_iX\ge 1-{\alpha}_{max}{andA}_i\in Good;\\ {}0,& {A}_iX<1-{\alpha}_{max}{andA}_i\in Good.\end{array}\right.\\ {}{\beta}_i^r=\left\{\begin{array}{ll}\left(1+{\alpha}_{max}\right)-{A}_iX,& {A}_iX<1+{\alpha}_{max}{andA}_i\in Bad;\\ {}0,& {A}_iX\ge 1+{\alpha}_{max}{andA}_i\in Bad;\\ {}{A}_iX-\left(1+{\alpha}_{max}\right),& {A}_iX\ge 1+{\alpha}_{max}{andA}_i\in Good;\\ {}0,& {A}_iX<1+{\alpha}_{max}{andA}_i\in Good.\end{array}\right.\end{array}} $$
Fig. 2.2
figure 2

MC2LP model

Figure 2.2 is a sketch for the model. In the graph, the green and the red lines are the left and right hyperplane, b l and b r respectively, which are some trade-offs in two intervals, i.e. [1 − α max − ϵ 2, 1] and [1, 1 + α max + ϵ 1]. And all the deviations are measured according to them in different colors. For instance, if a sample in “Good” class is misclassified as “Bad” class, it means α i r > β i l  0 and α i l = β i r = 0. And then, if a sample in “Bad” class is misclassified as “Good” class, it means α i l > β i r  0 and α i r = β i l = 0. Thus, for the misclassified ones, α i r + α i l − β i r − β i l should be minimized.

As a result, a more meticulous model could be stated as follows:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i\left({\alpha}_i^r+{\alpha}_i^l\right)\\ {}\min \sum \limits_i\left({\alpha}_i^l-{\beta}_i^r\right)\\ {}\min \sum \limits_i\left({\alpha}_i^r-{\beta}_i^l\right)\\ {}\max \sum \limits_i\left({\beta}_i^r+{\beta}_i^l\right)\\ {}s.t.{A}_iX=1+\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_1\right]+{\alpha}_i^r-{\beta}_i^r,{A}_i\in Bad,\\ {}{A}_iX=1-\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]+{\alpha}_i^l-{\beta}_i^l,{A}_i\in Bad,\\ {}{A}_iX=1+\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_1\right]-{\alpha}_i^r+{\beta}_i^r,{A}_i\in Good,\\ {}{A}_iX=1-\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]-{\alpha}_i^l+{\beta}_i^l,{A}_i\in Good,\\ {}{\alpha}_i^r,{\alpha}_i^l{\beta}_i^r,{\beta}_i^l\ge 0,i=1,2,\dots, l.\end{array}} $$

where A i, α max, ϵ 1 > 0, ϵ 2 > 0 are given, and X is unrestricted.

In Fig. 2.2, for each point, there are at most two kinds of deviations nonzero. The objective functions appear to deal with the deviations according to the position shown in Fig. 2.2, respectively, whereas they have their own special meaning. That is to say, it measures two types of error in some degree by means of the second and third objective functions. As a result, in this new version of MC2LP, we not only consider the deviations respectively, but also take the relationship of the deviations based on two types of error into account in the objective functions. By virtue of MC2LP method, each tradeoff between 1 − α max − ϵ 2 and 1 for the left hyperplane as well as each tradeoff between 1 and 1 + α max + ϵ 1 for the right hyperplane can be checked.

After obtaining the weight vector X of the hyperplane, AX = 1 is still used to be the classification hyperplane. However, in our new model, we minimize the distance between the left hyperplane and the right one. In other words, we discover the hyperplane that genders the smallest grey area.

Actually, in statistics, Type I and Type II errors are two opposite objectives. That is to say, it is very hard to correct both of them at the same time. As a result, we modify the former model into two different models focusing on two types of error respectively as follows:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i\left({\alpha}_i^r+{\alpha}_i^l\right)\\ {}\min \sum \limits_i\left({\alpha}_i^l-{\beta}_i^r\right)\\ {}\max \sum \limits_i\left({\beta}_i^r+{\beta}_i^l\right)\\ {}s.t.{A}_iX=1+\left[0,{\alpha}_{\mathrm{max}}+\varepsilon \right]+{\alpha}_i^r-{\beta}_i^r,{A}_i\in Bad,\\ {}{A}_iX=1+{\alpha}_i^l-{\beta}_i^l,{A}_i\in Bad,\\ {}{A}_iX=1+\left[0,{\alpha}_{\mathrm{max}}+\varepsilon \right]-{\alpha}_i^r+{\beta}_i^r,{A}_i\in Good,\\ {}{A}_iX=1-{\alpha}_i^l+{\beta}_i^l,{A}_i\in Good,\\ {}{\alpha}_i^r,{\alpha}_i^l{\beta}_i^r,{\beta}_i^l\ge 0,i=1,2,\dots, l.\end{array}} $$
(2.3)

where A i, α max and ϵ > 0 are given, and X is unrestricted. In this model, \( {\sum}_i{\alpha}_i^r-{\beta}_i^l \) is not contained in the objective functions. This model can deal with Type II error, that is, classifying a “Good” point to be a “Bad” one. Now we provide an example to illustrate the effect of model (2.2).

As the result shown above, model (2.3) can correct Type II error in some degree. We conclude this in the proposition below.

Proposition 2.1

Model ( 2.3 ) can correct Type II error by moving the right hyperplane to the right based on the concept of multiple-constraint levels.

Note that the second objective function in model (2.3) is nonzero for the samples in class “Bad” and getting negative when the right hyperplane moving to the right. That is to say, we tolerate some Type I errors. At the same time, the first objective function in model (2.3) renders Type II errors an increasing punishment with moving the right hyperplane to the right. As a result, it can correct Type II error in some degree.

Similar to model (2.3), (2.4) is posed to deal with Type I error as follows:

$$ {\displaystyle \begin{array}{c}\min \sum \limits_i\left({\alpha}_i^r+{\alpha}_i^l\right)\\ {}\min \sum \limits_i\left({\alpha}_i^l-{\beta}_i^r\right)\\ {}\min \sum \limits_i\left({\beta}_i^r+{\beta}_i^l\right)\\ {}s.t.\kern1.5em {A}_iX=1+{\alpha}_i^r-{\beta}_i^r,{A}_i\in Bad,\\ {}{A}_iX=1-\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]+{\alpha}_i^l-{\beta}_i^l,{A}_i\in Bad,\\ {}{A}_iX=1-{\alpha}_i^r+{\beta}_i^r,{A}_i\in Good,\\ {}{A}_iX=1-\left[0,{\alpha}_{\mathrm{max}}+{\varepsilon}_2\right]-{\alpha}_i^l+{\beta}_i^l,{A}_i\in Good,\\ {}{\alpha}_i^r,{\alpha}_i^l{\beta}_i^r,{\beta}_i^l\ge 0,i=1,2,\dots, l.\end{array}} $$
(2.4)

where A i, α max and ϵ > 0 are given, and X is unrestricted. In this model, \( {\sum}_i{\alpha}_i^l-{\beta}_i^r \) is not contained in the objective functions. This model focuses on Type I error, that is, classifying a “Bad” point to be a “Good” one.

The numerical examples to illustrate the theoretical results of this section can be found in [3].

1.2 Multi-instance Classification Based on Regularized Multiple Criteria Linear Programming

Multi-instance learning (MIL) has received intense interest recently in the field of machine learning. This idea was originally proposed for handwritten digit recognition by [16]. The term multi-instance learning was first introduced by [17] when they were investigating the problem of binding ability of a drug activity prediction. In MIL framework, the training set consists of positive and negative bags of points in the n-dimensional real-space R n, and each bag contains a number of points (instances). A positive training bag contains at least one positive instance, whereas a negative bag contains only negative instances. The aim of MIL is to construct a learned classifier from the training set for correctly labeling unseen bags. Multi-instance learning has been found useful in diverse domains such as object detection, text categorization, image categorization, image retrieval, web mining, computer-aided medical diagnosis, etc. [12,13,14, 18].

In this subsection, we propose a novel Multi-instance Learning method based on Regularized Multiple Criteria Linear Programming (called MI-RMCLP), which includes two algorithms for linear and nonlinear cases separately. To our knowledge, MI-RMCLP is the first RMCLP implementation based on MIL, which is a useful extension of RMCLP. The original MI-RMCLP model proposed itself is a nonconvex optimization problem. By an appropriate modification, we will the model to derive two quadratic programming subproblems, which can arrive at the optimal value by an iterative strategy solving these sequential subproblems. All preliminary numerical experiments show that our approach is competitive with other multiple learning formulations.

We first give a brief introduction of RMCLP in the following. For classification about the training data:

$$ T=\left\{\left({x}_1,{y}_1\right),\cdots, \left({x}_l,{y}_l\right)\right\}\in {\left({R}^n\times y\right)}^l, $$

where x i ∈ R n, y i ∈  = {1, −1}, i = 1, ⋯, l, data separation can be achieved by two opposite objectives. The first objective separates the observations by minimizing the sum of the deviations (MSD) among the observations. The second maximizes the minimum distances (MMD) of observations from the critical value [19]. The overlapping of data u should be minimized, while the distance v has to be maximized. However, it is difficult for traditional linear programming to optimize MMD and MSD simultaneously. According to the concept of Pareto optimality, we can seek the best trade-off of the two measurements [2, 20]. So MCLP model can be described as follows:

$$ \underset{u}{\min }{e}^Tu\&\underset{v}{\max }{e}^Tv, \vspace*{-12pt}$$
(2.5)
$$ s.t.\left(w\cdot {x}_i\right)+\left({u}_i-{v}_i\right)=b,\mathrm{for}\kern0.5em \left\{i\left|{y}_i=1\right.\right\},\vspace*{-12pt} $$
(2.6)
$$ \left(w\cdot {x}_i\right)-\left({u}_i-{v}_i\right)=b,\mathrm{for}\ \left\{i\left|{y}_i=-1\right.\right\}, \vspace*{-12pt}$$
(2.7)
$$ u,v\ge 0, $$
(2.8)

where e ∈ R l be vector whose all elements are 1, w and b are unrestricted, u i is the overlapping, and v i the distance from the training sample x i to the discriminator (w ⋅ x i) = b (classification separating hyperplane). By introducing penalty parameter c, d > 0, MCLP has the following version

$$ \underset{u,v}{\min }{ce}^Tu-{de}^Tv, \vspace*{-12pt}$$
(2.9)
$$ s.t.\left(w\cdot {x}_i\right)+\left({u}_i-{v}_i\right)=b,\mathrm{for}\kern0.5em \left\{i\left|{y}_i=1\right.\right\}, \vspace*{-12pt}$$
(2.10)
$$ \left(w\cdot {x}_i\right)-\left({u}_i-{v}_i\right)=b,\mathrm{for}\ \left\{i\left|{y}_i=-1\right.\right\}, \vspace*{-12pt}$$
(2.11)
$$ u,v\ge 0, $$
(2.12)

The geometric meaning of the model is shown in Fig. 2.3.

Fig. 2.3
figure 3

Geometric meaning of MCLP

A lot of empirical studies have shown that MCLP is a powerful tool for classification. However, we cannot ensure this model always has a solution under different kinds of training samples. To ensure the existence of solution, recently, Shi et al. proposed a RMCLP model by adding two regularized items \( \frac{1}{2}{w}^T Hw \) and \( \frac{1}{2}{u}^T Qu \) on MCLP as follows (more theoretical explanation of this model can be found in [2]):

$$ \underset{z}{\min}\frac{1}{2}{w}^T Hw+\frac{1}{2}{u}^T Qu+{de}^Tu-{ce}^Tv, \vspace*{-12pt}$$
(2.13)
$$ s.t.\left(w\cdot {x}_i\right)+\left({u}_i-{v}_i\right)=b,\mathrm{for}\ \left\{i\left|{y}_i=1\right.\right\}, \vspace*{-12pt}$$
(2.14)
$$ \left(w\cdot {x}_i\right)-\left({u}_i-{v}_i\right)=b,\mathrm{for}\ \left\{i\left|{y}_i=-1\right.\right\}, \vspace*{-12pt}$$
(2.15)
$$ u,v\ge 0, $$
(2.16)

where z = (w T, u T, v T, b)T ∈ R n + l + l + 1, H ∈ R n × n, Q ∈ R l × l are symmetric positive definite matrices. Obviously, the regularized MCLP is a convex quadratic programming.

Compared with traditional SVM, we can find that the RMCLP model is similar to the Support Vector Machine model in terms of the formation by considering minimization of overlapping of the data. However, RMCLP tries to measure all possible distances v from the training samples x i to separating hyperplane, while SVM fixes the distance as 1 (through bounding planes (w ⋅ x) = b ± 1) from the support vectors. Although the interpretation can vary, RMCLP addresses more control parameters than the SVM, which may provide more flexibility for better separation of data under the framework of the mathematical programming. In addition, different with SVM, RMCLP considers all the samples to solve classification problem. These make RMCLP have stronger insensitivity to outliers.

One of the drawbacks of applying the supervised learning model is that it is not always possible for a teacher to provide labeled examples for training. Multiple instance learning (MIL) provides a new way of modeling the teachers’ weakness. MIL considers a particular form of weak supervision in which training class labels are associated with sets of patterns, or bags, instead of individual patterns. A negative bag only consists of negative instances, whereas a positive bag comprises both positive and negative instances. The goal of MIL is to find a separate hyperplane, which can decide the label of any new instance.

In the following, we give the formal description of multiple instance learning problem. Given a training set

$$ \left\{{\mathbf{B}}_1^{+},\cdots, {\mathbf{B}}_{m^{+}}^{+},{\mathbf{B}}_1^{-},\cdots, {\mathbf{B}}_{m^{-}}^{-}\right\} $$
(2.17)

where a bag \( {\mathbf{B}}_i^{+}=\left\{{x}_{i1},\cdots, {x}_{im_i^{+}}\right\},{x}_{ij}\in {R}^n,j=1,\cdots, {m}_i^{+},i=1,\cdots, {m}^{+} \);\( {\mathbf{B}}_i^{-}=\left\{{x}_{i1},\cdots, {x}_{im_i^{-}}\right\},{x}_{ij}\in {R}^n,j=1,\cdots, {m}_i^{-},i=1,\cdots, {m}^{-} \); B + means that the positive bag B + contains at least one positive instance x ij; B means that all instance x ij of the negative bag B are negative. The goal is to induce a real-valued function

$$ y=\operatorname{sgn}\left(\mathbf{g}(x)\right) $$
(2.18)

such that the label of any instance x in R n space can be predicted.

Now we rewrite the training set (2.17) as

$$ { {\begin{array}{ll}\mathrm{Train}=\left\{{\mathbf{B}}_1^{+},\cdots, {\mathbf{B}}_{m^{+}}^{+},{\mathbf{B}}_{m^{+}+1}^{-},\cdots, {\mathbf{B}}_{m^{+}+{m}^{-}}^{-},\right\}=\left\{{\mathbf{B}}_1^{+},\cdots, {\mathbf{B}}_{m^{+}}^{+},{x}_{z+1},\cdots, {x}_{z+f}\right\}\end{array}}} $$
(2.19)

where z is the number of the instances in all positive bags and f the number of the instances in negative bags.

The set consisting of subscripts of B i is expressed as:

$$ \Im (i)=\left\{i\left|{x}_i\in {\mathbf{B}}_i\right.\right\} $$
(2.20)

For a separable multi-instance classification problem, if a positive bag can be correctly classified, it should satisfy the following constraint:

$$ \underset{j\in \Im (i)}{\max}\left(w\cdot {x}_j\right)-b>0. $$
(2.21)

In RMCLP, v i means the distance from the training sample x i to the separating hyperplane and be a nonnegative number. Thus, we can always find an appropriate v i such that

$$ \underset{j\in \Im (i)}{\max}\left(w\cdot {x}_j\right)-b={v}_i. $$
(2.22)

For nonseparable multi-instance classification, we need to add corresponding slack variable u i ≥ 0. Finally, the (2.22) is expressed by

$$ \underset{j\in \Im (i)}{\max}\left(w\cdot {x}_j\right)-b={v}_i-{u}_i. $$
(2.23)

Similar to [21], it is equivalent to the fact that there exist convex combination coefficients set \( \left\{{\lambda}_j^i\left|j\in \Im (i),i=1,\cdots, {m}^{+}\right.\right\} \), such that

$$ \left(w\cdot \sum \limits_{j\in \Im (i)}{\lambda}_j^i{x}_j\right)+{u}_i-{v}_i=b, $$
(2.24)
$$ {\lambda}_j^i\ge 0,\sum \limits_{j\in \Im (i)}{\lambda}_j^i=1. $$
(2.25)

For solving multi-instance classification, so (2.62.9) can be converted as:

$$ \underset{z}{\min}\frac{1}{2}{\left\Vert w\right\Vert}^2+\frac{1}{2}{\left\Vert u\right\Vert}^2+d\sum \limits_{i=1}^{m^{+}}{u}_i+d\sum \limits_{i=z+1}^{z+f}{u}_i-c\sum \limits_{i=1}^{m^{+}}{v}_i-c\sum \limits_{i=z+1}^{z+f}{v}_i, \vspace*{-12pt}$$
(2.26)
$$ s.t.\left(w\cdot \sum \limits_{j\in \Im (i)}{\lambda}_j^i{x}_j\right)+\left({u}_i-{v}_i\right)=b,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.27)
$$ \left(w\cdot {x}_i\right)-\left({u}_i-{v}_i\right)=b,i=z+1,\cdots, z+f, \vspace*{-12pt}$$
(2.28)
$$ {\lambda}_j^i\ge 0,j\in \Im (i),i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.29)
$$ \sum \limits_{j\in \Im (i)}{\lambda}_j^i=1,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.30)
$$ u,v\ge 0, $$
(2.31)

where \( z={\left({w}^T,{u}^T,{v}^T,b,{\lambda}^T\right)}^T,\lambda =\left\{{\lambda}_j^i\left|j\in \Im (i)\right.,i=1,\cdots, {m}^{+}\right\} \), \( \Im (i)=\left\{i\left|{x}_i\in {\mathbf{B}}_i^{+}\right.\right\} \).

As both \( {\lambda}_j^i \) and w are variables, the constraint (2.27) is no longer a linear constraint and (2.262.31) becomes a nonlinear optimization problem.

In the following, we give an approximate iterative solution via solving successive quadratic programming problem. Firstly, we fix λ, and solve a quadratic programming with respect to w, u, v, b; then fix w, solve a quadratic programming with respect to u, v, b, λ.

  1. 1.

    For fixed \( {\lambda}_j^i,i=1,\cdots, {m}^{+},j\in \Im (i) \), we can obtain

    $$ {\hat{x}}_i=\sum \limits_{j\in \Im (i)}{\lambda}_j^i{x}_j,i=1,\cdots, {m}^{+}, $$
    (2.32)

    So the problem (2.262.31) can be written as

    $$ \underset{z}{\min}\frac{1}{2}{w}^T Hw+\frac{1}{2}{u}^T Qu+{de}^Tu-{ce}^Tv, \vspace*{-12pt}$$
    (2.33)
    $$ s.t.\left(w\cdot {\hat{x}}_i\right)+\left({u}_i-{v}_i\right)=b,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
    (2.34)
    $$ \left(w\cdot {\hat{x}}_i\right)-\left({u}_i-{v}_i\right)=b,i=z+1,\cdots, z+f, \vspace*{-12pt}$$
    (2.35)
    $$ u,v\ge 0, $$
    (2.36)

    The problem (2.332.36) is a standard quadratic programming problem and as same as RMCLP. We choose H and Q to be identity matrix. Its dual problem can be formulated as

    $$ \underset{\alpha, u}{\max }-\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}\left(\left({\hat{x}}_i\cdot {\hat{x}}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
    $$ -\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}\left(\left({\hat{x}}_i\cdot {\hat{x}}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
    $$ -\frac{1}{2}\sum \limits_{i=2+1}^{z+f}\sum \limits_{j=1}^{m^{+}}\left(\left({\hat{x}}_i\cdot {\hat{x}}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
    (2.37)
    $$ -\frac{1}{2}\sum \limits_{i=2+1}^{z+f}\sum \limits_{j=z+1}^{z+f}\left(\left({\hat{x}}_i\cdot {\hat{x}}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
    $$ -\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j-\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j \vspace*{-12pt}$$
    $$ -\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j \vspace*{-12pt}$$
    $$ s.t.-{u}_i-d\le {\alpha}_i\le -c,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
    (2.38)
    $$ -{u}_i-d\le -{\alpha}_i\le -c,i=z+1,\cdots, z+f, $$
    (2.39)

    where c, d > 0. We can compute: \( \hat{\alpha}={\left({\hat{\alpha}}_1,\cdots, {\hat{\alpha}}_{m^{+}},{\hat{\alpha}}_{z+1},\cdots, {\hat{\alpha}}_{z+f}\right)}^T \) by solving the problem of (2.372.39), and (w, b) can be expressed as

    $$ \hat{w}=-\sum \limits_{i=1}^{m^{+}}{\hat{\alpha}}_i{\hat{x}}_i-\sum \limits_{i=z+1}^{z+f}{\hat{\alpha}}_i{\hat{x}}_i, \vspace*{-12pt}$$
    (2.40)
    $$ \hat{b}=\sum \limits_{i=1}^{m^{+}}{\hat{\alpha}}_i+\sum \limits_{i=z+1}^{z+f}{\hat{\alpha}}_i, $$
    (2.41)

    \( \hat{w},\hat{b} \) is the updating value of (w, b).

  2. 2.

    For fixed w, the formula (2.262.31) can be substituted as:

    $$ { {\begin{array}{ll}\displaystyle\underset{\lambda, u,v,b}{\min}\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j\end{array}}} \vspace*{-12pt}$$
    $$ +d\sum \limits_{i=1}^{m^{+}}{u}_i+d\sum \limits_{i=z+1}^{z+f}{u}_i-c\sum \limits_{i=1}^{m^{+}}{v}_i-c\sum \limits_{i=z+1}^{z+f}{v}_i $$
    (2.42)
    $$ s.t.\left(w\cdot \sum \limits_{j\in \Im (i)}{\lambda}_j^i{x}_j\right)+\left({u}_i-{v}_i\right)=b,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
    (2.43)
    $$ \left(w\cdot {x}_i\right)-\left({u}_i-{v}_i\right)=b,i=z+1,\cdots, z+f, \vspace*{-12pt}$$
    (2.44)
    $$ {\lambda}_j^i\ge 0,j\in \Im (i),i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
    (2.45)
    $$ \sum \limits_{j\in \Im (i)}{\lambda}_j^i=1,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
    (2.46)
    $$ u,v\ge 0, $$
    (2.47)

    thus we are able to establish the following Algorithm 2.1 based on the formulas above.

Algorithm 2.1 Linear MI-RMCLP

Initialize: Given a training set (see (2.19));

Choose appropriate penalty parameters c, d > 0;

Choose Q and H to be identity matrixes;

Setting initial values for λ (k = 1), where \( \left\{{\lambda}_j^i(1)\left|j\in \Im (i),i=1,\cdots, {m}^{+}\right.\right\} \);

Process: 1. For fixed \( \lambda (k)=\left\{{\lambda}_j^i(k)\right\} \), the goal is to compute w(k):

1.1. Compute \( \left\{{\hat{x}}_1,\cdots, {\hat{x}}_{m^{+}},{\hat{x}}_{r_1},\cdots, {\hat{x}}_{z+f}\right\} \) by (2.32);

1.2. Solve quadratic programming (2.38) ~ (2.39),

 obtaining the solution \( \hat{\alpha}={\left({\hat{\alpha}}_1,\cdots, {\hat{\alpha}}_p,{\hat{\alpha}}_{z+1},\cdots, {\hat{\alpha}}_{z+f}\right)}^T \);

1.2. Compute \( \hat{w} \) from (2.40);

1.4. Set \( w(k)=\hat{w} \).

2. For fixed w(k), the goal is to compute λ(k + 1):

2.1. Solve quadratic programming (2.42) ~ (2.47) with the

 variables λ, u, v, b, obtaining the solution \( \hat{\lambda},\hat{b} \).

2.2. Set \( \lambda \left(k+1\right)=\hat{\lambda},b\left(k+1\right)=\hat{b} \);

2. If |λ(k + 1) − λ(k)| < ε, goto Output:; otherwise,

 goto the step 1, setting k = k + 1.

Output: Obtain the decision function f(x) =  sgn ((w  ⋅ x) + b ),

 where w  = w(k), b  = b(k).

For nonlinear MI-RMCLP, we firstly introduce the kernel function K(x, x ) = (Φ(x) ⋅ Φ(x )) to replace (x, x ), where Φ(x) is a mapping from the input space R n to some Hilbert space ℍ:

$$ \Phi :{R}^n\to \mathrm{\mathbb{H}} \vspace*{-12pt}$$
$$ x\to \mathrm{x}=\Phi (x) $$
(2.48)

Therefore, the problem (2.262.31) can be expressed as

$$ \underset{z}{\min}\frac{1}{2}{\left\Vert w\right\Vert}^2+\frac{1}{2}{\left\Vert u\right\Vert}^2+d\sum \limits_{i=1}^{m^{+}}{u}_i+d\sum \limits_{i=z+1}^{z+f}{u}_i-c\sum \limits_{i=1}^{m^{+}}{v}_i-c\sum \limits_{i=z+1}^{z+f}{v}_i \vspace*{-12pt}$$
(2.49)
$$ s.t.\left(\mathrm{w}\cdot \sum \limits_{j\in \Im (i)}{\lambda}_j^i\Phi \left({x}_j\right)\right)+\left({u}_i-{v}_i\right)=b,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.50)
$$ \left(\mathrm{w}\cdot \Phi \left({x}_i\right)\right)-\left({u}_i-{v}_i\right)=b,i=z+1,\cdots, z+f, \vspace*{-12pt}$$
(2.51)
$$ {\lambda}_j^i\ge 0,j\in \Im (i),i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.52)
$$ \sum \limits_{j\in \Im (i)}{\lambda}_j^i=1,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.53)
$$ u,v\ge 0, $$
(2.54)

Similar to Algorithm 2.1, as a given λ, the current problem can be solved by the following quadratic programming problem:

$$ \underset{\alpha, u}{\max }-\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}\left(\sum \limits_{k\in \Im (i)}{\lambda}_k^i\sum \limits_{l\in I(j)}{\lambda}_l^jK\left({x}_k\cdot {x}_l\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
$$ -\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}\left(\sum \limits_{k\in \Im (i)}{\lambda}_k^iK\left({x}_k\cdot {x}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
$$ -\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=1}^{m^{+}}\left(\sum \limits_{l\in I(j)}{\lambda}_l^jK\left({x}_i\cdot {x}_l\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
(2.55)
$$ -\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=z+1}^{z+f}\left(K\left({x}_i\cdot {x}_j\right)+1\right){\alpha}_i{\alpha}_j \vspace*{-12pt}$$
$$ -\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j-\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j \vspace*{-12pt}$$
$$ -\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j-\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j \vspace*{-12pt}$$
(2.56)
$$ s.t.-{u}_i-d\le {\alpha}_i\le -c,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.57)
$$ -{u}_i-d\le -{\alpha}_i\le -c,i=z+1,\cdots, z+f, $$
(2.58)

We can obtain a solution of \( \left(\hat{w},\hat{b}\right) \) by computing

$$ \hat{\mathrm{w}}=-\sum \limits_{i=1}^{m^{+}}{\hat{\alpha}}_i\sum \limits_{j\in \Im (i)}{\lambda}_j^i\Phi \left({x}_i\right)-\sum \limits_{i=z+1}^{z+f}{\hat{\alpha}}_i\Phi \left({x}_i\right), \vspace*{-12pt}$$
(2.59)
$$ \hat{b}=\sum \limits_{i=1}^{m^{+}}{\hat{\alpha}}_i+\sum \limits_{i=z+1}^{z+f}{\hat{\alpha}}_i, $$
(2.60)

where \( \hat{\alpha}={\left({\hat{\alpha}}_1,\cdots, {\hat{\alpha}}_p,{\hat{\alpha}}_{z+1},\cdots, {\hat{\alpha}}_{z+f}\right)}^T \) is a solution of the problem (2.56)–(2.58).

For fixed w, the problem (2.492.54) can be written as

$$ {\displaystyle \begin{array}{l}\underset{\lambda, u,v,b}{\min}\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=1}^{m^{+}}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=1}^{m^{+}}{u}_i{u}_j+\frac{1}{2}\sum \limits_{i=z+1}^{z+f}\sum \limits_{j=z+1}^{z+f}{u}_i{u}_j\\ {}+d\sum \limits_{i=1}^{m^{+}}{u}_i+d\sum \limits_{i=z+1}^{z+f}{u}_i-c\sum \limits_{i=1}^{m^{+}}{v}_i-c\sum \limits_{i=z+1}^{z+f}{v}_i\end{array}} \vspace*{-12pt}$$
(2.61)
$$ { {\begin{array}{ll}\displaystyle s.t.-\sum \limits_{j=1}^{m^{+}}{\hat{\alpha}}_j\sum \limits_{k\in I(j)}{\tilde{\lambda}}_k^j\sum \limits_{l\in \Im (i)}{\lambda}_l^iK\left({x}_k,{x}_l\right)-\sum \limits_{j=z+1}^{z+f}{\hat{\alpha}}_j\sum \limits_{l\in \Im (i)}{\lambda}_l^iK\left({x}_j,{x}_l\right)-\left({u}_i-{v}_i\right)=b,\end{array}}} \vspace*{-12pt}$$
$$ i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.62)
$$ { {\begin{array}{ll}\displaystyle -\sum \limits_{j=1}^{m^{+}}{\hat{\alpha}}_j\sum \limits_{k\in I(j)}{\tilde{\lambda}}_k^jK\left({x}_k,{x}_i\right)-\sum \limits_{j=z+1}^{z+f}{\hat{\alpha}}_jK\left({x}_j,{x}_i\right)+\left({u}_i-{v}_i\right)=b,i=z+1,\cdots, z{+}f\end{array}}} \vspace*{-12pt}$$
(2.63)
$$ {\lambda}_j^i{\ge} 0,j\in \Im (i),i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.64)
$$ \sum \limits_{j\in \Im (i)}{\lambda}_j^i=1,i=1,\cdots, {m}^{+}, \vspace*{-12pt}$$
(2.65)
$$ u,v\ge 0, $$
(2.66)

where \( \tilde{\lambda}=\left({\tilde{\lambda}}_i^i\left|j\in \Im (i),\right.i=1,\dots, {m}^{+}\right) \) and \( \hat{\alpha}={\left({\hat{\alpha}}_1,\cdots, {\hat{\alpha}}_{z+1},\cdots, {\hat{\alpha}}_{z+f}\right)}^T \) are known.

The ultimate separating hypersurface can be expressed as

$$ g(x)=-\sum \limits_{j=1}^{m^{+}}{\hat{\alpha}}_j\sum \limits_{k\in I(j)}{\tilde{\lambda}}_k^jK\left({x}_k,x\right)-\sum \limits_{j=z+1}^{z+f}{\hat{\alpha}}_jK\left({x}_j,x\right)+\hat{b}, $$
(2.67)

In the following, we give out Algorithm 2.2 for nonlinear MI-RMCLP.

Algorithm 2.2 Nonlinear MI-RMCLP

Initialize: Given a training set (see (2.19));

Choose appropriate penalty parameters c, d > 0;

Choose Q and H to be identity matrixes;

Choose appropriate

Setting initial values for λ (k = 1), where \( \left\{{\lambda}_j^i(1)\left|j\in \Im (i),i=1,\cdots, {m}^{+}\right.\right\} \);

Process: 1. For fixed \( \lambda (k)=\left\{{\lambda}_j^i(k)\right\} \), the goal is to compute w(k):

1.1. Solve quadratic programming (2.56) ~ (2.58), obtaining the solution.

\( \hat{\alpha}={\left({\hat{\alpha}}_1,\cdots, {\hat{\alpha}}_p,{\hat{\alpha}}_{z+1},\cdots, {\hat{\alpha}}_{z+f}\right)}^T \);

1.2. Set \( \tilde{\lambda}=\lambda (k) \);

2. For fixed \( \hat{\alpha},\tilde{\lambda} \), the goal is to compute \( \hat{\lambda}=\left\{{\lambda}_j^i\right\} \):

2.1. Solve quadratic programming (2.61) ~ (2.66) with the

 variables (λ, u, v, b), obtaining the solution \( \hat{\lambda}=\left\{{\lambda}_j^i\right\} \).

2.2. Set \( \lambda \left(k+1\right)=\hat{\lambda},b\left(k+1\right)=\hat{b} \);

2. If |λ(k + 1) − λ(k)| < ε, goto Output:; otherwise,

 goto the step 1, setting k = k + 1.

Output: Obtain the decision function f(x) =  sgn (g(x)),

 where g(x) by (2.18).

To demonstrate the capabilities of our algorithm, we report results on 12 data sets, 2 from the UCI machine learning repository [22], and 10 from [23]. “Elephant,” “Fox” and “Tiger” data sets are from an image annotation task in which the goal is to determine whether or not a given animal is present in an image. The other seven data sets are from the OHSUMED data, and the task is to learn binary concepts associated with the Medical Subject Headings of MEDLINE documents. The “Musk1” and “Musk2” data sets from the UCI machine learning repository are used to test our nonlinear multi-instance RMCLP, which involves bags of molecules and their activity levels and is commonly used in multi-instance classification. Detailed information about these data sets can be found in [21].

Our algorithm code was written in MATLAB 2010. The experiment environment is Intel Core i5 CPU, 2 GB memory. The “quadprog” function with MATLAB is employed to solve quadratic programming problem related to this section. The testing accuracies for our method are computed using standard tenfold cross-validation [24]. The RBF kernel parameter σ is selected from the set {2i|i =  − 7, ⋯, 7} by tenfold cross-validation on the tuning set comprising of random 10% of the training data. Once the parameters are selected, the tuning set was returned to the training set to learn the final decision function. The (c, d) are set 1. If the difference between 2 is less than 10−4 or the iterations K > 100, our algorithms will be stopped.

We compare our results with MICA [21], mi-SVM [23], MI-SVM [25], EM-DD [25] and SVM-CC [26]. MI-RMCLP is our method in Table 2.1 and Fig. 2.4. The results of tenfold cross-validation accuracy are listed in Table 2.1 and Fig. 2.4. The results for mi-SVM, MI-SVM and EM-DD are taken from [21].

Table 2.1 Results of all methods in the case of rbf kernel
Fig. 2.4
figure 4

Results of all methods in the case of linear kernel. X-axis represents different methods: 1: MICA; 2: mi-SVM; 3: MI-SVM; 4: EM-DD; 5: SVM-CC; 6: Mi-RMCLP. Y-axis represents the accuracy

1.3 Supportive Instances for Regularized Multiple Criteria Linear Programming Classification

Although RMCLP performs excellently in classifying lots of benchmark datasets, its shortage is also obvious. By taking account of every training instances into consideration, RMCLP is sensitive to noisy and imbalanced training samples. In other words, the classification boundary may shift significantly even if there is merely a slight change of training samples. This difficulty can be described in Fig. 2.5, assume there is a two groups classification problem, the first group is denoted by “.” and the second group is denoted by “☆” . We can observe that it is a linear-separable dataset and the classification boundary is denoted by a line “/”. Figure 2.5a shows that on an ideal training sample, RMCLP successfully classify all the instances. In Fig. 2.5b, when we add some noisy instances into the first group, the classification boundary shifts towards the first group, making more instances in the first group misclassified. In Fig. 2.5c, we can observe that when we add instances into the second group to make the number of instances in two groups imbalanced, the classification boundary also changes significantly, causing a great number of misclassifications. In Fig. 2.5d, we can see that if we choose some representative instances (also called supportive instances) for RMCLP, which locate inside the blue circle, then although more noisy and imbalanced instances are added into the training sample, the classification boundary always keeps unchanged and will have a good ability to do prediction. That is to say, building RMCLP model only on supportive instances can improve its accuracy and stability.

Fig. 2.5
figure 5

(a) The original RMCLP model built on an ideal training sample; (b) when adding two noisy instances in the left side, the classification boundary shifts towards the left side; (c) when the training sample is imbalanced, the boundary also shifts significantly; (d) if we select representative training instances which locate around the distribution centers (inside the circle), the classification boundary becomes satisfactory

According to the above observation, in this subsection, we propose a clustering-based sample selection method, which chooses the instances in the clustering center as the supportive samples (just as SVM [27] chooses the support vectors to draw a classification boundary). Experimental results on synthetic and real-life datasets show that our new method not only can significantly improve the prediction accuracy, but also can dramatically reduce the number of training instances.

Lots of empirical studies have shown that MCLP is a powerful tool for classification. However, there is no theoretical work on whether MCLP always can find an optimal solution under different kinds of training samples. To go over this difficulty, recently, [2] proposed a RMCLP model by adding two regularized items \( \frac{1}{2}{x}^T Hx \) and \( \frac{1}{2}{\alpha}^T Q\alpha \) on MCLP as follows:

$$ \boldsymbol{\operatorname{Minimize}}\frac{1}{2}{x}^T Hx+\frac{1}{2}{\alpha}^T Q\alpha +{d}^T\alpha -{c}^T\beta \vspace*{-12pt}$$
(2.68)
$$ \boldsymbol{Subject}\kern0.5em \boldsymbol{to}:{\displaystyle \begin{array}{c}{A}_ix-{\alpha}_i+{\beta}_i=b,\forall {A}_i\in {G}_1;\\ {}{A}_ix+{\alpha}_i-{\beta}_i=b,\forall {A}_i\in {G}_2;\\ {}{\alpha}_i,{\beta}_i\ge 0.\end{array}} $$

where H ∈ R r ∗ r, Q ∈ R n ∗ n are symmetric positive definite matrices. d T, c T ∈ R n. The RMCLP model is a convex quadratic program. Theoretically studies [2] have shown that RMCLP can always find a global optimal solution.

Besides two groups classification problem, a recent work [28] also introduced a multiple groups RMCLP model. As far as three groups classification problem be considered, we first find a projection direction x and a group of hyper planes (b 1, b 2), to an arbitrary training instance A i, if A i x < b 1, then A i ∈ G 1; if b 1 ≤ A i x < b 2 then A i ∈ G 2; and if A i x ≥ b 2, then A i ∈ G 3. Extending this method to n group classification, we can also find a direction x and n − 1 dimension vector b = [b 1, b 2, …, b n − 1] ∈ R n − 1, to make sure that to any training instance A i:

$$\vspace*{-3pt} {\displaystyle \begin{array}{l}{A}_ix<{b}_1,\forall {A}_i\in {G}_1;\\ {}{b}_{j-1}\le {A}_ix<{b}_j,\forall {A}_i\in {G}_i,1<i<n;\\ {}{A}_ix\ge {b}_{n-1},\forall {A}_i\in {G}_n;\end{array}} \vspace*{-3pt}$$
(2.69)

We first define \( {c}_i=\frac{b_{i-1}+{b}_i}{2} \) as the midline in group i(1 < i < n). Then, to the misclassified records, we define \( {\alpha}_i^{+} \) as the distance from c i to A i x, which equals (c i − A i x), when misclassify a group i’s record into group j (j < i), and we define \( {\alpha}_i^{-} \) as the distance from A i x to c i, which equals (c i − A i x), when misclassify a group i’s record into group j (j > i). Similarly, to the correct classified records, we define \( {\beta}_i^{-} \) when A i is in the left side of c i, and we define \( {\beta}_i^{+} \) when A i is in the right side of c i. When we have a n groups training sample with size m, we have \( \alpha =\left\{{\alpha}_i^{+},{\alpha}_i^{-}\right\}\in {R}^{m\ast 2} \), \( \beta =\left\{{\beta}_i^{+},{\beta}_i^{-}\right\}\in {R}^{m\ast 2} \), and we can build a multiple groups Regularized Multi-Criteria Linear Programming (SRMCLP) as follows:

$$ \boldsymbol{\operatorname{Minimize}}\kern0.5em \frac{1}{2}{x}^T Hx+\frac{1}{2}{\alpha}^T Q\alpha +{d}^T\alpha +{c}^T\beta \vspace*{-12pt}$$
$$ { {\begin{array}{ll}\displaystyle \boldsymbol{Subject}\ \boldsymbol{to}:{\displaystyle \begin{array}{c}{A}_ix-{\alpha}_i^{-}-{\beta}_i^{-}+{\beta}_i^{+}=\frac{1}{2}{b}_1,\forall {A}_i\in {G}_1;\\ {}{A}_ix-{\alpha}_i^{-}+{\alpha}_i^{+}-{\beta}_i^{-}+{\beta}_i^{+}=\frac{1}{2}\left({b}_{i-1}+{b}_i\right),\forall {A}_i\in {G}_i,1<i<n;\\ {}{A}_ix+{\alpha}_i^{+}-{\beta}_i^{-}+{\beta}_i^{+}=2{b}_{n-1},\forall {A}_i\in {G}_n;\\ {}{\alpha}_i^{-},{\alpha}_i^{+},{\beta}_i^{-},{\beta}_i^{+}\ge 0.\end{array}}\end{array}}} $$
(2.70)

Since this multiple groups RMCLP model is mainly designed to solve the ordinal separable dataset, we also call it Ordinal RMCLP model [28].

Figure 2.6 gives the whole procedure of the sample selection algorithm. The main idea of our algorithm is that it iteratively discards training instances in each group which are far away from the clustering center until the clustering center for each group is stable (with the given threshold ε), then the remained instances will be taken as the supportive instances (just as the support vectors to SVM) and used to build a classifier. From Fig. 2.6, We can observe that this algorithm is similar to the well-known k-means algorithm. However, the main difference between them is that, our algorithm is based on supervised learning framework, while k-means is an unsupervised learning algorithm. In our algorithm, although the clustering centers shift in each iteration, each instance keeps a constant class label. But in k-means, the class label of each instance may change frequently. An important issue of k-means clustering is how to choose the initial points, if we choose a good initial point, we can get a global optimal solution; otherwise, we may only get a local optimal solution. On the contrast, our sample selection method can avoid this problem. It always leads to a global minimal solution.

Fig. 2.6
figure 6

Clustering method to get the supportive method

There are some important parameters in our algorithm. The first important parameter is ε, which determinates when the algorithm stops. The second parameter is the exclusion percentage s, which indicates how many instances that are far away from the clustering center should be discarded in each iteration. This parameter, in fact, determines the convergence speed. The larger value of s, the faster algorithm converges. To analyze the computation complexity of our new algorithm, we take an extremely bad situation into consideration. Assume there are n instances in the training sample, we assign the values s = 1 and ε = 0. Then, the algorithm will discard only one instances in each iteration. To the worst case, after n times iterations, the algorithm converges to the clustering center. In the i th iteration, it needs to calculate the (n − i) instances to get the clustering center, so we can roughly infer that the computation complexity is about O(n 2 ).

To investigate whether our new algorithm works, we use two synthetic datasets and a well-known US bank’s real-life credit card dataset for testing. In our experiments, the RMCLP is implemented by Visual Fortran 6.5.

The 6000 credit card records are randomly selected from 25,000 real-life credit card records of a major US bank. Each record has 113 variables, with 38 original variables and 65 derived variables. The 38 original variables are balance, purchase, payment, cash advance, and related variables, with the former 5 items each have six variables that represent raw data of six consecutive months and the last item includes interest charges, data of last payment, times of cash advance, account open data and so on. The 65 derived variables (CHAR01–CHAR65) are derived from original 38 variables using simple arithmetic methods to reinforce the comprehension of cardholders’ behaviors. In this section, we use the derived 65 variables. We then define five classes for this dataset using a label variable: The Number of Over-limits. The five classes are defined as Bankrupt charge-off accounts (THE NUMBER OF OVER-LIMITS≥13), Non-bankrupt charge-off accounts (7≤THE NUMBER OF OVER-LIMITS≤12), Delinquent accounts (3≤THE NUMBER OF OVER-LIMITS≤6), Current accounts (1≤THE NUMBER OF OVER-LIMITS≤2), and Outstanding accounts (no over limit). Bankrupt charge-off accounts are accounts that have been written off by credit card issuers due to reasons other than bankrupt claims. The charge-off policy may vary among authorized institutions. Delinquent accounts are accounts that haven’t paid the minimum balances for more than 90 days. Current accounts are accounts that have paid the minimum balances. The outstanding accounts are accounts that have not balances. In our randomly selected 6000 records, there are 72 Bankrupt charge-off accounts, 205 Non-bankrupt charge-off accounts, 454 Delinquent accounts, 575 Current accounts and 4694 outstanding accounts.

Two groups credit card dataset

To acquire a two groups training sample, we combine the Bankrupt charge-off accounts, Non-bankrupt charge-off accounts and Delinquent accounts together to form a “bad” group. And then we combine the current accounts and the outstanding accounts into a “good” group. According to the previous research work on this dataset, we first randomly select a benchmark training size of 700 bad records and 700 good records, and the remained 4600 records are combined to test the performance. Now what we need to do is to examine three assumptions: first, is the randomly selected 1400 points are suitable to build model? second, are there any noisy instances in this randomly selected dataset? third, can we reduce the 1400 points in a much smaller size and improve the accuracy synchronously? Experimental results in Table 2.2 tell us the answers. The first column of Table 2.2 is the current training sample’s size, from the 1400 instances to 140 instances, the second and the third columns list the performance on different training samples and the fourth and the fifth columns exhibit the performance on the same 4600 testing instances. The experiment is conducted as follows: firstly, we build a RMCLP model on all the 1400 training instances, and we get a benchmark accuracy as 72.78%. Then we call our sample selection algorithm with parameter s = 1 and ε = 0.1. We do experiments on night special datasets, 10%, 20%, …, 90% of the original 1400 training sample. We finally list the performance of RMCLP in Table 2.2. Intuitionally, we though the larger the training sample, the more information we could get, and thus the model would be more accurate when do prediction. However, Table 2.2, we can see that the 1400 randomly selected instances is not the best training set for RMCLP model, there exist noisy and useless instances which deteriorate its performance. Our new sample selection method reduces the training samples continuously. When get 20% of the original training sample (that is 280 instances), we can build a RMCLP with the highest accuracy of 88.54% on the testing set.

Table 2.2 Comparison of different percentage of training instances

Multiple Groups credit card dataset

Besides two groups RMCLP model, in this part, we also study the performance of our new algorithm on multiple groups RMCLP model. For three groups classification, we choose the Bankrupt charge-off accounts as the first group, the Non-bankrupt charge-off as the second group and the Delinquent as the third group. Based upon the three groups dataset, we construct the four groups dataset by adding the Current account as the fourth group. At last, we construct a five groups dataset by adding the Outstanding accounts as the fifth group.

Tables 2.3, 2.4 and 2.5 list the results of comparisons. The second and the third columns list the results of the original RMCLP method, the fourth and the fifth columns list the results of RMCLP after selecting the supportive instances. We can observe that in three groups classification, the original RMCLP’s average accuracy is 72.32%, while that of the supportive instances is 85.71%. The improvement of accuracy is as large as 12.39%. In four groups classification, the average accuracy of the original RMCLP is 57.05%, on the contrast, after selecting the supportive instances, the accuracy improves to 82.00%, as high as 25.95% improvement. To the five groups classification, the improvement after selecting supportive instances is 4.31%. From these compressive results, we can validate our former conclusion that selecting supportive instances for RMCLP can significantly improve its accuracy.

Table 2.3 Comparison on three groups credit card dataset
Table 2.4 Comparison on four groups credit card dataset
Table 2.5 Comparison on five groups credit card dataset

1.4 Kernel Based Simple Regularized Multiple Criteria Linear Programming for Binary Classification and Regression

In this section, a novel kernel based regularized multiple criteria linear program are proposed for both classification and regression scenarios.

Given an observed dataset T = {(x 1, y 1), (x 2, y 2), …, (x l, y l)} with l instances. Each instance x i belongs to the category y i. x i ∈ χ ⊆ R n and y i ∈ y are the n attributes values and corresponding label for the instance i. The goal of classification problem is to predict the corresponding label y i ∈ y when new instance x j ∈ χ arrives. When Card(y) = 2, the issue is binary classification problem. In order to facilitate description, here we let y = {−1, 1} for following introduction. Under this binary classification problem, supposed we have positive instances number is l1, negative instances number is l 2, where l 1 + l 2 = l. ξ A = 0, ξ B = 0 which are not marked in the picture.

In contrast to points A and B, points C and D are improperly predicted. Hence their distance could be constructed as β C = 0, β D = 0 and ξ C > 0, ξ D > 0. In summary, following the idea described above the basic MCLP model [29] for classification could be written as this:

$$ \underset{w,b,\xi, \beta }{\min}\sum \limits_{i=1}^l{\xi}_i \vspace*{-12pt}$$
$$ \underset{w,b,\xi, \beta }{\max}\sum \limits_{i=1}^l{\beta}_i \vspace*{-12pt}$$
(2.71)
$$ s.t.{y}_i\left({x}_i^Tw+b\right)={\beta}_i-{\xi}_i, \vspace*{-12pt}$$
$$ {\xi}_i\ge 0,{\beta}_i\ge 0,i=1,\cdots, l; $$

Here w and b could be seem as the slope and intercept of the discriminant hyperplane. One of the objectives ∑ξ i could be considered as the measure of misclassification, thus we minimized it to avoid the inappropriate model construction.

And the other goal ∑β i is to maximize the generalization capability of the chosen classification function. As we introduced before, there exist no single solution that could make the both these two goals in conflict optimal at the same time. In [30, 31], compromise solution is introduced and analyzed for this multiple objective model Eq. (2.71). However, the algorithm that obtained compromise solution were usually time consuming and not suitable for real world application.

As a result, many methods convert model Eq. (2.71) into single objective linear program:

$$ \underset{w,b,\xi, \beta }{\min}\sum \limits_{i=1}^l{\xi}_i-\gamma \sum \limits_{i=1}^l{\beta}_i \vspace*{-12pt}$$
$$ s.t.{y}_i\left({x}_i^Tw+b\right)={\beta}_i-{\xi}_i, \vspace*{-12pt}$$
(2.72)
$$ {\xi}_i\ge 0,{\beta}_i\ge 0,i=1,\cdots, l; $$

Unfortunately, naive model Eq. (2.72) confronts the unsolvable defect because of the nature of linear programming. More sophisticated approaches need to be investigated. Therefore, an improved model would be illustrated in next section.

Although model Eq. (2.72) avoided the computational cost of multiple objectives, it had a fatal solvability problem. Therefore, we added new quadratic term to the objective function and proposed a new simple regularized MCLP model showed as below:

$$ \underset{w,b,\xi, \beta }{\min}\sum \limits_{i=1}^l{\xi}_i-\gamma \sum \limits_{i=1}^l{\beta}_i+\frac{1}{2}{\tau \beta}^T H\beta \vspace*{-12pt}$$
$$ s.t.{y}_i\left({x}_i^Tw+b\right)={\beta}_i-{\xi}_i, \vspace*{-12pt}$$
(2.73)
$$ {\xi}_i\ge 0,{\beta}_i\ge 0,i=1,\cdots, l; \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\}. $$

Furthermore, users want to guarantee the slope of the hyperplane not too large. Then, we made the regularization term w T Kw as a part of the goal and obtained the following model:

$$ \underset{w,b,\xi, \beta }{\min}\sum \limits_{i=1}^l{\xi}_i-\gamma \sum \limits_{i=1}^l{\beta}_i+\frac{1}{2}{\tau \beta}^T H\beta +\frac{1}{2}\kappa {w}^T Kw \vspace*{-12pt}$$
$$ s.t.{y}_i\left({x}_i^Tw+b\right)={\beta}_i-{\xi}_i, \vspace*{-12pt}$$
(2.74)
$$ {\xi}_i\ge 0,{\beta}_i\ge 0,i=1,\cdots, l; \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\}; $$

In order to write the formulas in matrix form, we let

$$ A={\left[\begin{array}{c}{x}_1^T\\ {}{x}_2^T\\ {}\vdots \\ {}{x}_l^T\end{array}\right]}_{l\ast n},Y={\left[\begin{array}{cccc}{y}_1& 0& \cdots & 0\\ {}0& {y}_2& \cdots & 0\\ {}\cdots & \cdots & \cdots & \cdots \\ {}0& \cdots & 0& {y}_l\end{array}\right]}_{l\ast l} $$
(2.75)

So model Eq. (2.74) could be rewritten as this:

$$ \underset{w,\beta, \xi }{\min}\frac{1}{2}{w}^T Hw+\frac{1}{2}{\lambda}_1{\beta}^T K\beta -{\lambda}_2{e}^T\beta +{\lambda}_3{e}^T\xi \vspace*{-12pt}$$
$$ s.t.Y\left( Aw+ be\right)-\beta +\xi =0, \vspace*{-12pt}$$
(2.76)
$$ b\in \left\{-1,1\right\},\beta \ge 0,\xi \ge 0 $$

Where w ∈ R n, β ∈ R l, ξ ∈ R l, \( e={\left[1,\cdots, 1\right]}_l^T \) is the vector of all ones. K and H are n × n and l × l positive matrix, respectively. We simply set positive matrix H, K in model Eq. (2.76) as identity matrix. And to solve the problem with inequality type constraints, we have to find the saddle point of the Lagrangian function for model Eq. (2.76)

$$ L\left(w,\beta, \xi, {\alpha}_{equ},{\alpha}_{\beta },{\alpha}_{\xi}\right)=\left(\frac{1}{2}{w}^Tw+\frac{1}{2}{\lambda}_1{\beta}^T\beta -{\lambda}_2{e}^T\beta +{\lambda}_3{e}^T\xi \right) \vspace*{-12pt}$$
$$ +{\alpha}_{equ}^T\left(Y\left( Aw+ be\right)-\beta +\xi \right)-{\alpha}_{\beta}^T\beta -{\alpha}_{\xi}^T\xi $$
(2.77)

where α equ is free, α β ≥ 0, α ξ ≥ 0 are Lagrangian multipliers. Minimization with respect to w, β, ξ implies the following

$$ {\nabla}_wL\left(w,\beta, \xi, {\alpha}_{equ},{\alpha}_{\beta },{\alpha}_{\xi}\right)=w+{A}^TY{\alpha}_{equ}=0 \vspace*{-12pt}$$
(2.78)
$$ {\nabla}_{\beta }L\left(w,\beta, \xi, {\alpha}_{equ},{\alpha}_{\beta },{\alpha}_{\xi}\right)={\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}-{\alpha}_{\beta }=0 \vspace*{-12pt}$$
(2.79)
$$ {\nabla}_{\xi }L\left(w,\beta, \xi, {\alpha}_{equ},{\alpha}_{\beta },{\alpha}_{\xi}\right)={\lambda}_3e+{\alpha}_{equ}-{\alpha}_{\xi }=0 $$
(2.80)

Sustaining Eq. (2.78) into function Eq. (2.77), we get

$$ L\left(w,\beta, \xi, {\alpha}_{equ},{\alpha}_{\beta },{\alpha}_{\xi}\right)=-\frac{1}{2}{\alpha}_{equ}^T{YAA}^TY{\alpha}_{equ}-\frac{1}{2}{\lambda}_1{\beta}^T\beta +{be}^TY{\alpha}_{equ} $$

Therefore, the dual problem for model Eq. (2.76) is obtained as

$$ \max -\frac{1}{2}{\alpha}_{equ}^T{YAA}^TY{\alpha}_{equ}-\frac{1}{2}{\lambda}_1{\beta}^T\beta +{be}^TY{\alpha}_{equ} \vspace*{-12pt}$$
$$ s.t.{\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}\ge 0, \vspace*{-12pt}$$
$$ {\lambda}_3e+{\alpha}_{equ}\ge 0, \vspace*{-12pt}$$
(2.81)
$$ \beta \ge 0, \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\} $$

According to the Eq. (2.78), the decision function is

$$ f(x)=\mathit{\operatorname{sign}}\left(w\cdot x+b\right)=\mathit{\operatorname{sign}}\left(-{YA}^T{\alpha}_{equ}x+b\right). $$

When introduce kernel functions

$$ {R}^n\to H \vspace*{-12pt}$$
$$ x\to \Phi (x) $$
(2.82)

We have K(x i, x j) = Φ(x i) ⋅ Φ(x j). Therefore, the dual problem Eq. (2.81) could be rewritten as

$$ \min \frac{1}{2}{\alpha}_{equ}^T YK\left(A,A\right)Y{\alpha}_{equ}+\frac{1}{2}{\lambda}_1{\beta}^T\beta -{be}^TY{\alpha}_{equ} \vspace*{-12pt}$$
$$ s.t.{\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}\ge 0, \vspace*{-12pt}$$
$$ {\lambda}_3e+{\alpha}_{equ}\ge 0, \vspace*{-12pt}$$
(2.83)
$$ \beta \ge 0, \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\} $$

Furthermore, the decision boundary turns into

$$ f(x)=\mathit{\operatorname{sign}}\left(w\cdot \Phi (x)+b\right)=\mathit{\operatorname{sign}}\left(- YK\left(A,x\right){\alpha}_{equ}+b\right). $$

Theorem 2.2

Given the solution of the dual problem Eq. ( 2.83 ) as \( \left({\alpha}_{equ}^{\ast },{\beta}^{\ast}\right) \) , the solution of its corresponding primal problem w.r.t. H space can be obtained as below:

$$ {w}^{\ast }=-Y\Phi {(A)}^T{\alpha}_{equ}^{\ast } $$
(2.84)

Proof

From dual problem Eq. (2.83), we can get its Lagrangian function as:

$$ L\left({\alpha}_{equ},\beta, {\alpha}_1,{\alpha}_2\right)=\frac{1}{2}{\alpha}_{equ}^T YK\left(A,A\right)Y{\alpha}_{equ}+\frac{1}{2}{\lambda}_1{\beta}^T\beta -{be}^TY{\alpha}_{equ} \vspace*{-12pt}$$
$$ -{\alpha}_1^T\left({\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}\right)-{\alpha}_2^T\left({\lambda}_3e+{\alpha}_{equ}\right)-{\alpha}_3^T\beta $$
(2.85)

Where α 1 ≥ 0, α 2 ≥ 0, α 3 ≥ 0. From the KTT condition, we have the equations below:

$$ {\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}\ge 0 \vspace*{-12pt}$$
(2.86)
$$ {\lambda}_3e+{\alpha}_{equ}\ge 0 \vspace*{-12pt}$$
(2.87)
$$ \beta \ge 0 \vspace*{-12pt}$$
(2.88)
$$ {\left({\lambda}_1\beta -{\lambda}_2e-{\alpha}_{equ}\right)}^T{\alpha}_1=0 \vspace*{-12pt}$$
(2.89)
$$ {\left({\lambda}_3e+{\alpha}_{equ}\right)}^T{\alpha}_2=0 \vspace*{-12pt}$$
(2.90)
$$ {\beta}^T{\alpha}_3=0 \vspace*{-12pt}$$
(2.91)
$$ {\nabla}_{\alpha_{equ}}L\left({\alpha}_{equ},\beta, {\alpha}_1,{\alpha}_2\right)= YK\left(A,A\right)Y{\alpha}_{equ}- bYe+{\alpha}_1-{\alpha}_2=0 \vspace*{-12pt}$$
(2.92)
$$ {\nabla}_{\beta }L\left({\alpha}_{equ},\beta, {\alpha}_1,{\alpha}_2\right)={\lambda}_1\beta -{\lambda}_1{\alpha}_1-{\alpha}_3=0 $$
(2.93)

Sustaining Eq. (2.84) into Eq. (2.92), so

$$ {\nabla}_{\alpha_{equ}}L\left({\alpha}_{equ},\beta, {\alpha}_1,{\alpha}_2\right)= YK\left(A,A\right)Y{\alpha}_{equ}- bYe+{\alpha}_1-{\alpha}_2 \vspace*{-12pt}$$
$$ =-Y\left({w}^{\ast}\cdot \Phi (A)+ be\right)+{\alpha}_1^{\ast }-{\alpha}_2^{\ast }=0 $$
(2.94)

This satisfies the constraint of problem Eq. (2.76) when \( \beta ={\alpha}_1^{\ast } \), \( \xi ={\alpha}_2^{\ast } \). Therefore, \( \left({w}^{\ast },{\alpha}_1^{\ast },{\alpha}_2^{\ast}\right) \) is the feasible solution of primal problem Eq. (2.76) w.r.t. H space. Furthermore, introducing Eqs. (2.89), (2.90) and (2.92), the objective function of primal problem Eq. (2.76) turns into:

$$ \frac{1}{2}{w}^{\ast T}{w}^{\ast }+\frac{1}{2}{\lambda}_1{\beta}^{\ast T}{\beta}^{\ast }-{\lambda}_2{e}^T{\beta}^{\ast }+{\lambda}_3{e}^T{\xi}^{\ast } \vspace*{-12pt}$$
$$ =-\frac{1}{2}{\alpha}_{equ}^{\ast T} YK\left(A,A\right)Y{\alpha}_{equ}^{\ast }-\frac{1}{2}{\lambda}_1{\beta}^{\ast T}{\beta}^{\ast }+{be}^TY{\alpha}_{equ}^{\ast } $$
(2.95)

As a result, the object value of the primal problem at points (w , β , ξ ) is the optimal value of its dual problem at points (α equ, β ) w.r.t. H space.

Base on the Theorem 2.2, we introduced Algorithm 2.3 using kernel based simple regular multiple constraint linear program (KSRMCLP) for binary classification problem.

Given a training set {(x 1, y 1), ⋯, (x l, y l)}, being different from classification problem, regression is not to give a new arrival instance x i a category label but a real number value, y i ∈ R. That is mean the possible set of y i has been changed from finite labels set y to infinite R. Following the idea of ϵ − tube, a model for regression problem could be constructed from a binary classification model [32]. Given a real number ϵ, two different category points could be generated when we add and minus ϵ on the regression output y i. When we have l instances {(x 1, y 1), ⋯, (x l, y l)} for regression, 2 × l instances {(x 1, y 1 + ϵ)pos, ⋯, (x l, y l + ϵ)pos, (x 1, y 1 − ϵ)neg, ⋯, (x , y  − ϵ)neg} could be constructed. According to the binary classification model we propose in the last section, a model for regression problem could be given as:

\( \min \frac{1}{2}{w}^T Hw+\frac{1}{2}{\lambda}_1{\beta}^T K\beta -{\lambda}_2{e}^T\beta +{\lambda}_3{e}^T\xi \)

$$ s.t.Y\left({A}_{reg}w+ be\right)=\beta -\xi, $$
(2.96)
$$ \beta \ge 0,\xi \ge 0 $$

Algorithm 2.3 KSRMCLP Algorithm for Binary Classification

Input:

Training dataset S = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)} with l instances, x i ∈ R n andy i ∈ {−1, 1}, kernel function K θ(x i, x j) and its parameters θ, model parameters λ 1 ≥ 0, λ 2 ≥ 0, λ 3 ≥ 0.

Output:

Binary classification discriminate function f(x).

1: Begin

2: Construct data matrix A, label matrix Y according to Eq. (2.75).

\( A={\left[\begin{array}{c}{x}_1^T\\ {}{x}_2^T\\ {}\vdots \\ {}{x}_l^T\end{array}\right]}_{l\ast n} \), \( Y={\left[\begin{array}{cccc}{y}_1& 0& \cdots & 0\\ {}0& {y}_2& \cdots & 0\\ {}\cdots & \cdots & \cdots & \cdots \\ {}0& \cdots & 0& {y}_l\end{array}\right]}_{l\ast l} \)

3: Construct and solve the optimization problem according to model Eq. (2.83).

\( \min \frac{1}{2}{\alpha}_{equ}^T{YK}_{\theta}\left(A,A\right)Y{\alpha}_{equ}+\frac{1}{2}{\lambda}_1{\beta}^T\beta -{be}^TY{\alpha}_{equ} \),

s. t. λ 1 β − λ 2 e − α equ ≥ 0,

λ 3 e + α equ ≥ 0,

β ≥ 0,

$$ b\in \left\{-1,1\right\} $$

4: Obtain the decision function f(x) =  sign (−YK θ(A, x)α equ + b).

5: End

where w ∈ R n + 1, β, ξ ∈ R 2l, and

$$ {A}_{reg}={\left[\begin{array}{c}{x}_1^T,{y}_1+\upepsilon \\ {}\vdots \\ {}{x}_l^T,{y}_l+\upepsilon \\ {}{x}_1^T,{y}_1-\upepsilon \\ {}\vdots \\ {}{x}_l^T,{y}_l-\upepsilon \end{array}\right]}_{2l\times \left(n+1\right)},Y={\left[\begin{array}{cc}{I}_{l\times l}& O\\ {}O& -{I}_{l\times l}\end{array}\right]}_{2l\times 2l} $$
(2.97)

The constraint of Eq. (2.96) could be divided into two parts, the positive and the negative. For positive points, the corresponding target value is y i + ϵ, for negative points is y i − ϵ. Thus matrix Y is useless and variables β, ξ, also change into Boos, β pos, β neg, ξ pos, ξ neg. Then, model Eq. (2.96) could be written as,

\( \min \frac{1}{2}{w}^T Hw+\frac{1}{2}{\lambda}_1{\beta}_{pos}^TK{\beta}_{pos}+\frac{1}{2}{\lambda}_1{\beta}_{neg}^TK{\beta}_{neg}-{\lambda}_2{e}^T\left({\beta}_{pos}+{\beta}_{neg}\right)+{\lambda}_3{e}^T\left({\xi}_{pos}+{\xi}_{neg}\right) \)

$$ s.t. Aw+ be+\eta \left(y+\upepsilon e\right)={\beta}_{pos}-{\xi}_{pos} \vspace*{-12pt}$$
(2.98)
$$ Aw+ be+\eta \left(y-\upepsilon e\right)=-\left({\beta}_{neg}-{\xi}_{neg}\right) \vspace*{-12pt}$$
$$ {\beta}_{pos}\ge 0,{\beta}_{neg}\ge 0,{\xi}_{pos}\ge 0,{\xi}_{neg}\ge 0 $$

where w ∈ R n, β pos, β neg, ξ pos, ξ neg ∈ R l, b ∈ R are variables.

$$ A={\left[\begin{array}{c}{x}_1^T\\ {}{x}_2^T\\ {}\vdots \\ {}{x}_l^T\end{array}\right]}_{l\times n},y={\left[\begin{array}{c}{y}_1\\ {}{y}_2\\ {}\vdots \\ {}{y}_l\end{array}\right]}_{l\times 1} $$
(2.99)

We know η ≠ 0, w, b, β pos, β neg, ξ pos, ξ neg are all variables, so η could be removed from the expression. Model Eq. (2.98) turns into:

\( \min \frac{1}{2}{w}^T Hw+\frac{1}{2}{\lambda}_1{\beta}_{pos}^TK{\beta}_{pos}+\frac{1}{2}{\lambda}_1{\beta}_{neg}^TK{\beta}_{neg}-{\lambda}_2{e}^T\left({\beta}_{pos}+{\beta}_{neg}\right)+{\lambda}_3{e}^T\left({\xi}_{pos}+{\xi}_{neg}\right) \)

$$ s.t. Aw+ be+\left(y+\upepsilon e\right)={\beta}_{pos}-{\xi}_{pos} \vspace*{-12pt}$$
(2.100)
$$ Aw+ be+\left(y-\upepsilon e\right)=-\left({\beta}_{neg}-{\xi}_{neg}\right), \vspace*{-12pt}$$
$$ {\beta}_{pos}\ge 0,{\beta}_{neg}\ge 0,{\xi}_{pos}\ge 0,{\xi}_{neg}\ge 0 $$

where w ∈ R n, β pos, β neg, ξ pos, ξ neg ∈ R l, b ∈ R are variables. And ϵ, λ 1, λ 2, λ 3 ∈ R, positive matrices H, K are given in advance. Similar to the procedure last part, we set K, H as identity matrix, the Lagrangian function of model Eq. (2.100) is derived as

$$ { {\begin{array}{ll}L\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)=-\frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^T{AA}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\frac{1}{2}{\lambda}_1{\beta}_{pos}^T{\beta}_{pos}\end{array}}} \vspace*{-12pt}$$
$$ { {\begin{array}{ll}-\frac{1}{2}{\lambda}_1{\beta}_{neg}^T{\beta}_{neg}+{\left( be+y\right)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)+\upepsilon {e}^T\left({\alpha}_{pos}-{\alpha}_{neg}\right)\end{array}}} $$
(2.101)

where α pos, α neg are free variables, \( {\alpha}_{\beta_{pos}}\ge 0,{\alpha}_{\beta_{neg}}\ge 0,{\alpha}_{\xi_{pos}}\ge 0,{\alpha}_{\xi_{neg}}\ge 0 \) are corresponding Lagrangian multipliers. Also, from KKT condition, we have

$$ {\nabla}_wL\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)=w+{A}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)=0 \vspace*{-12pt}$$
(2.102)
$$ {\nabla}_{\beta_{pos}}L\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)={\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}-{\alpha}_{\beta_{pos}}=0 \vspace*{-12pt}$$
(2.103)
$$ {\nabla}_{\beta_{neg}}L\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)={\lambda}_1{\beta}_{neg}-{\lambda}_2e+{\alpha}_{neg}-{\alpha}_{\beta_{neg}}=0 \vspace*{-12pt}$$
(2.104)
$$ {\nabla}_{\xi_{pos}}L\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)={\lambda}_3e+{\alpha}_{pos}-{\alpha}_{\xi_{pos}}=0 \vspace*{-12pt}$$
(2.105)
$$ {\nabla}_{\xi_{neg}}L\left(w,{\beta}_{pos},{\beta}_{neg},{\xi}_{pos},{\xi}_{neg}\right)={\lambda}_3e-{\alpha}_{neg}-{\alpha}_{\xi_{neg}}=0 $$
(2.106)

Therefore, the dual problem for model Eq. (2.100) is obtained:

$$ \max -\frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^T{AA}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right)+ \vspace*{-12pt}$$
$$ {\left( be+y\right)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)+\upepsilon {e}^T\left({\alpha}_{pos}-{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ s.t.{\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}\ge 0, \vspace*{-12pt}$$
$$ {\lambda}_1{\beta}_{neg}-{\lambda}_2e+{\alpha}_{neg}\ge 0, \vspace*{-12pt}$$
(2.107)
$$ {\lambda}_3e+{\alpha}_{pos}\ge 0, \vspace*{-12pt}$$
$$ {\lambda}_3e-{\alpha}_{neg}\ge 0, \vspace*{-12pt}$$
$$ {\beta}_{pos}\ge 0, \vspace*{-12pt}$$
$$ {\beta}_{neg}\ge 0, \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\} $$

where α pos, α neg, β pos, β neg ∈ R l, b ∈ R are variables. And ϵ ≥ 0, λ 1 ≥ 0, λ 2 ≥ 0, λ 3 ≥ 0 are given in advance.

When introducing kernel function Eq. (2.82), model Eq. (2.107) turns into

\( \min \frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^TK\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)+\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right) \vspace*{-12pt}\)

$$ -{\left( be+y\right)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\upepsilon {e}^T\left({\alpha}_{pos}-{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ s.t.{\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}\ge 0, \vspace*{-20pt}$$
$$ {\lambda}_3e+{\alpha}_{pos}\ge 0, \vspace*{-12pt}$$
(2.108)
$$ {\lambda}_3e-{\alpha}_{neg}\ge 0, \vspace*{-12pt}$$
$$ {\beta}_{pos}\ge 0, \vspace*{-12pt}$$
$$ {\beta}_{neg}\ge 0, \vspace*{-12pt}$$
$$ b\in \left\{-1,1\right\} $$

From the decision hyperplane w ⋅ x + b + y = 0, the regression function could be obtained as

$$ f(x)=-\left(w\cdot x+b\right)={A}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)\cdot x-b $$

With kernel function, regression function could be derived from

$$ f(x)=\Phi {(A)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)\Phi (x)-b=K\left(A,x\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-b $$

Theorem 2.3

Given the solution of Dual Problem Eq. ( 2.108 ) \( \big({\alpha}_{pos}^{\ast },{\alpha}_{neg}^{\ast },{\beta}_{pos}^{\ast }, {\beta}_{neg}^{\ast}\big) \) , the solution of its corresponding primal problem w.r.t. H space can be obtained as below:

$$ {w}^{\ast }=-\Phi \left({A}^T\right)\left({\alpha}_{pos}^{\ast }+{\alpha}_{neg}^{\ast}\right) $$
(2.109)

Proof

From dual problem Eq. (2.108), we can get its Lagrangian function as:

$$ L\left({\alpha}_{pos},{\alpha}_{neg},{\beta}_{pos},{\beta}_{neg}\right)=\frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^TK\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ +\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right) \vspace*{-12pt}$$
$$ -{\left( be+y\right)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\upepsilon {e}^T\left({\alpha}_{pos}-{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ -{\alpha}_1^T\left({\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}\right) \vspace*{-12pt}$$
(2.110)
$$ -{\alpha}_2^T\left({\lambda}_1{\beta}_{neg}-{\lambda}_2e+{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ -{\alpha}_3^T\left({\lambda}_3e+{\alpha}_{pos}\right) \vspace*{-12pt}$$
$$ -{\alpha}_4^T\left({\lambda}_3e-{\alpha}_{neg}\right) \vspace*{-12pt}$$
$$ -{\alpha}_5^T{\beta}_{pos} \vspace*{-12pt}$$
$$ -{\alpha}_6^T{\beta}_{neg} $$

where α 1 ≥ 0, α 2 ≥ 0, α 3 ≥ 0, α 4 ≥ 0, α 5 ≥ 0, α 6 ≥ 0. from the KTT condition, we have the equation below:

$$ {\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}\ge 0 \vspace*{-12pt}$$
(2.111)
$$ {\lambda}_1{\beta}_{neg}-{\lambda}_2e+{\alpha}_{neg}\ge 0 \vspace*{-12pt}$$
(2.112)
$$ {\lambda}_3e+{\alpha}_{pos}\ge 0 \vspace*{-12pt}$$
(2.113)
$$ {\lambda}_3e-{\alpha}_{neg}\ge 0 \vspace*{-12pt}$$
(2.114)
$$ {\beta}_{pos}\ge 0 \vspace*{-12pt}$$
(2.115)
$$ {\beta}_{neg}\ge 0 \vspace*{-12pt}$$
(2.116)
$$ {\alpha}_1^T\left({\lambda}_1{\beta}_{pos}-{\lambda}_2e-{\alpha}_{pos}\right)=0 \vspace*{-12pt}$$
(2.117)
$$ {\alpha}_2^T\left({\lambda}_1{\beta}_{neg}-{\lambda}_2e+{\alpha}_{neg}\right)=0 \vspace*{-12pt}$$
(2.118)
$$ {\alpha}_3^T\left({\lambda}_3e+{\alpha}_{pos}\right)=0 \vspace*{-12pt}$$
(2.119)
$$ {\alpha}_4^T\left({\lambda}_3e-{\alpha}_{neg}\right)=0 \vspace*{-12pt}$$
(2.120)
$$ {\alpha}_5^T{\beta}_{pos}=0 \vspace*{-12pt}$$
(2.121)
$$ {\alpha}_6^T{\beta}_{neg}=0 \vspace*{-12pt}$$
(2.122)
$$ {\nabla}_{\alpha_{pos}}L=K\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\left( be+y\right)-\upepsilon e+{\alpha}_1-{\alpha}_3=0 \vspace*{-12pt}$$
(2.123)
$$ {\nabla}_{\alpha_{neg}}L=K\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\left( be+y\right)+\upepsilon e-{\alpha}_2+{\alpha}_4=0 \vspace*{-12pt}$$
(2.124)
$$ {\nabla}_{\beta_{pos}}L={\lambda}_1{\beta}_{pos}-{\lambda}_1{\alpha}_1-{\alpha}_5=0 \vspace*{-12pt}$$
(2.125)
$$ {\nabla}_{\beta_{neg}}L={\lambda}_1{\beta}_{neg}-{\lambda}_1{\alpha}_2-{\alpha}_6=0 $$
(2.126)

Sustaining Eq. (2.109) into Eqs. (2.123) and (2.124), we have

$$ {\nabla}_{\alpha_{pos}}L\left({\alpha}_{pos},{\alpha}_{neg},{\beta}_{pos},{\beta}_{neg}\right) \vspace*{-12pt}$$
$$ =K\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\left( be+y\right)-\upepsilon e+{\alpha}_1-{\alpha}_3 \vspace*{-12pt}$$
(2.127)
$$ =-\left({w}^{\ast}\cdot \Phi (A)+ be+\left(y+\upepsilon e\right)-{\alpha}_1+{\alpha}_3\right)=0 \vspace*{-12pt}$$
$$ {\nabla}_{\alpha_{pos}}L\left({\alpha}_{pos},{\alpha}_{neg},{\beta}_{pos},{\beta}_{neg}\right) \vspace*{-12pt}$$
$$ =K\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\left( be+y\right)+\upepsilon e-{\alpha}_1+{\alpha}_3 \vspace*{-12pt}$$
(2.128)
$$ =-\left({w}^{\ast}\cdot \Phi (A)+ be+\left(y-\upepsilon e\right)+{\alpha}_1-{\alpha}_3\right)=0 $$

This satisfies the constraint of primal problem Eq. (2.100), so \( \big({w}^{\ast },{\alpha}_1^{\ast },{\alpha}_2^{\ast },{\alpha}_3^{\ast }, {\alpha}_4^{\ast}\big) \) is the feasible solution of primal problem Eq. (2.100) w.r.t. H space. Furthermore, introducing Eqs. (2.117)–(2.122), the objective function of primal problem Eq. (2.100) turns into:

$$ \frac{1}{2}{w}^Tw+\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right)-{\lambda}_2{e}^T\left({\beta}_{pos}+{\beta}_{neg}\right)+{\lambda}_3{e}^T\left({\xi}_{pos}+{\xi}_{neg}\right) \vspace*{-12pt}$$
$$ =-\frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^TK\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)-\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right) \vspace*{-12pt}$$
$$ +{\left( be+y\right)}^T\left({\alpha}_{pos}+{\alpha}_{neg}\right)+\upepsilon {e}^T\left({\alpha}_{pos}-{\alpha}_{neg}\right) $$

As a result, the object value of the primal problem at points \( \big({w}^{\ast },{\beta}_{pos}^{\ast },{\beta}_{neg}^{\ast },{\xi}_{pos}^{\ast }, {\xi}_{neg}^{\ast}\big) \) is the optimal value of its dual problem at points \( \left({\alpha}_{pos}^{\ast },{\alpha}_{neg}^{\ast },{\beta}_{pos}^{\ast },{\beta}_{neq}^{\ast}\right) \).

Base on the Theorem 2.3, we introduced Algorithm 2.4 from kernel based simple regular multiple constraint linear programming (KSRMCLP) for regression problem.

Algorithm 2.4 KSRMCLP Algorithm for Regression

Input:

Training dataset S = {(x 1, y 1), (x 2, y 2), ⋯, (x l, y l)}, x i ∈ R n and y i ∈ R. Kernel function K θ(x i, x j) and its parameters θ, model parameters ϵ ≥ 0, λ 1 ≥ 0, λ 2 ≥ 0, λ 3 ≥ 0.

Output:

Regression estimated function f(x).

1: Begin

2: Construct data matrix A, target value vector y according to equation formula below:

\( A={\left[\begin{array}{c}{x}_1^T\\ {}{x}_2^T\\ {}\vdots \\ {}{x}_l^T\end{array}\right]}_{l\ast n} \), \( y={\left[\begin{array}{c}{y}_1\\ {}{y}_2\\ {}\vdots \\ {}{y}_l\end{array}\right]}_{l\times 1} \)

3: Construct and solve the optimization problem according to Eq. (2.108).

\( \min \frac{1}{2}{\left({\alpha}_{pos}+{\alpha}_{neg}\right)}^T{K}_{\theta}\left(A,A\right)\left({\alpha}_{pos}+{\alpha}_{neg}\right)+\frac{1}{2}{\lambda}_1\left({\beta}_{pos}^T{\beta}_{pos}+{\beta}_{neg}^T{\beta}_{neg}\right) \)

−(be + y)T(α pos + α neg) − ϵe T(α pos − α neg),

s. t. λ 1 β pos − λ 2 e − α pos ≥ 0,

λ 1 β neg − λ 2 e + α neg ≥ 0,

λ 3 e + α pos ≥ 0,

λ 3 e − α neg ≥ 0,

β pos ≥ 0,

β neg ≥ 0,

b ∈ {−1, 1}

4: Obtain the decision function f(x) = K θ(A, x)(α pos + α neg) − b.

5: End

2 Multiple Criteria Linear Programming with Expert and Rule Based Knowledge

2.1 A Group of Knowledge-Incorporated Multiple Criteria Linear Programming Classifier

Prior knowledge in some classifiers usually consists of a set of rules, such as, if A then x ∈ G (or x ∈ B), where condition A is relevant to the attributes of the input data. One example of such form of knowledge can be seen in the breast cancer recurrence or nonrecurrence prediction. Usually, doctors can judge if the cancer recur or not in terms of some measured attributes of the patients. The prior knowledge used by doctors in the breast cancer dataset includes two rules which depend on two features of the total 32 attributes: tumor size (T) and lymph node status (L). The rules are [33]:

  • If L ≥ 5 and T ≥ 4 Then RECUR and If L = 0 and T ≤ 1.9 Then NONRECUR

The conditions L ≥ 5 and T ≥ 4 (L = 0 and T ≤ 1.9) in the above rules can be written into such inequality as Cx ≤ c, where C is a matrix driven from the condition, x represents each individual sample, c is a vector. For example, if each sample x is expressed by a vector [x 1, …, x L, …, x T, …, x r]T, for the rule: if L ≥ 5 and T ≥ 4 then RECUR, it also means: if x L ≥ 5 and x T ≥ 4, then x ∈ RECUR, where x L and x T are the corresponding values of attributes L and T of a certain sample data, r is the number of attributes. Then its corresponding inequality Cx ≤ c can be written as:

$$ \left[\begin{array}{ccccccc}0& \dots & -1& \dots & 0& \dots & 0\\ {}0& \dots & 0& \dots & -1& \dots & 0\end{array}\right]x\le \left[\begin{array}{c}-5\\ {}-4\end{array}\right]. $$

where x is the vector with r attributes include two features relevant to prior knowledge.

Similarly, the condition L = 0 and T ≤ 1.9 can also be reformulated to be inequalities. With regard to the condition L = 0, in order to express it into the formulation of Cx ≤ c, we must replace it with the condition L ≥ 0 and L ≤ 0. Then the condition L = 0 and T ≤ 1.9 can be represented by two inequalities: C 1 x ≤ c 1 and C 2 x ≤ c 2, as follows:

$$ \left[\begin{array}{ccccccc}0& \dots & -1& \dots & 0& \dots & 0\\ {}0& \dots & 0& \dots & 1& \dots & 0\end{array}\right]x\le \left[\begin{array}{c}0\\ {}1.9\end{array}\right]\mathrm{and}\ \left[\begin{array}{ccccccc}0& \dots & 1& \dots & 0& \dots & 0\\ {}0& \dots & 0& \dots & 1& \dots & 0\end{array}\right]x\le \left[\begin{array}{c}0\\ {}1.9\end{array}\right] $$

We notice the fact that the set {x| Cx ≤ c} can be regarded as polyhedral convex set. In Fig. 2.7, the triangle and rectangle are such sets.

Fig. 2.7
figure 7

The classification result by MCLP (line a) and knowledge-incorporated MCLP (line b)

In two-class classification problem, the result RECUR or NONRECUR is equal to the expression x ∈ B or x ∈ G. So according to the above rules, we have:

$$ Cx\le c\Rightarrow x\in G\kern1em \left( or\kern1em x\in B\right) $$
(2.129)

In MCLP classifier, if the classes are linearly separable, then x ∈ G is equal to x T w ≥ b, similarly, x ∈ B is equal to x T w ≤ b. That is, the following implication must hold:

$$ Cx\le c\Rightarrow {x}^Tw\ge b\kern1em \left( or\kern1em {x}^Tw\le b\right) $$
(2.130)

For a given (w, b), the implication Cx ≤ c ⇒ x T w ≥ b holds, this also means that Cx ≤ c, x T w < b has no solution x. According to nonhomogeneous Farkas theorem, we can conclude that C T u + w = 0, c T u + b ≤ 0, u ≥ 0, has a solution (u, w) [33].

The above statement is able to be added to constraints of an optimization problem. In this way, the prior knowledge in the form of some equalities and inequalities in constraints is embedded to the original multiple linear programming (MCLP) model. The knowledge-incorporated MCLP model is described in the following.

Knowledge-incorporated MCLP model

Now, we are to explain the knowledge-incorporated MCLP model. This model is to deal with linear knowledge and linear separable data. The combination of the two kinds of input can help to improve the performances of both methods.

Suppose there are a series of knowledge sets as follows:

$$ {\displaystyle \begin{array}{l}\mathrm{If}\kern0.5em {C}^ix\le {c}^i,i=1,\dots, k\kern1em \mathrm{Then}\ x\in G\\ {}\mathrm{If}\ {D}^jx\le {d}^j,j=1,\dots, l\kern1em \mathrm{Then}\ x\in B\end{array}} $$

This knowledge also means the convex sets {x| C i x ≤ c i}, i = 1, …, k lie on the G side of the bounding plane, the convex sets {x| D j x ≤ d j}, j = 1, …, l on the B side.

Based on the above theory in the last section, we converted the knowledge to the following constraints:

There exist u i, i = 1, …, k, v j, j = 1, …, l, such that:

$$ {\displaystyle \begin{array}{l}{C}^{iT}{u}^i+w=0,\kern1.5em {c}^{iT}{u}^i+b\le 0,\kern1.5em {u}^i\ge 0,\kern1.5em i=1,\dots, k\\ {}{D}^{jT}{v}^j-w=0,\kern1.5em {d}^{jT}{v}^j-b\le 0,\kern1.5em {v}^j\ge 0,\kern1.5em j=1,\dots, l\end{array}} $$
(2.131)

However, there is no guarantee that such bounding planes precisely separate all the points. Therefore, some error variables need to be added to the above formulas. The constraints are further revised to be:

There exist u i, r i, ρ i, i = 1, …, k   and v j, s j, σ j, j = 1, …, l, such that:

$$ {\displaystyle \begin{array}{l}-{r}^i\le {C}^{iT}{u}^i+w\le {r}^i,\kern1.5em {c}^{iT}{u}^i+b\le {\rho}^i,\kern1.5em {u}^i\ge 0,\kern1.5em i=1,\dots, k\\ {}-{s}^j\le {D}^{jT}{v}^j-w\le {s}^j,\kern1.5em {d}^{jT}{v}^j-b\le {\sigma}^j,\kern1.5em {v}^j\ge 0,\kern1.5em j=1,\dots, l\end{array}} $$
(2.132)

After that, we embed the above constraints to the MCLP classifier, and obtained the knowledge-incorporated MCLP classifier:

$$ {\displaystyle \begin{array}{l}\operatorname{Minimize}\kern1em {d}_{\alpha}^{+}+{d}_{\alpha}^{-}+{d}_{\beta}^{+}+{d}_{\beta}^{-}+C\left(\sum \left({r}_{\mathrm{i}}+{\rho}^{\mathrm{i}}\right)+\sum \left({s}^{\mathrm{j}}+{\sigma}^{\mathrm{j}}\right)\right)\\ {}\mathrm{Subject}\ \mathrm{to}:\\ {}\kern3.75em {\alpha}^{\ast }+\sum \limits_{i=1}^n{\alpha}_i={d}_{\alpha}^{-}-{d}_{\alpha}^{+}\\ {}\kern2.5em {\beta}^{\ast }-\sum \limits_{i=1}^n{\beta}_i={d}_{\beta}^{-}-{d}_{\beta}^{+}\kern0.5em \\ {}\kern4.5em {x}_{11}{w}_1+\dots +{x}_{1\mathrm{r}}{w}_{\mathrm{r}}=\mathrm{b}+{\alpha}_1-{\beta}_1,\kern1.5em \mathrm{for}\ {\mathrm{A}}_1\in \mathrm{B},\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern4.5em {x}_{\mathrm{n}1}{w}_1+\dots +{x}_{\mathrm{n}\mathrm{r}}{w}_{\mathrm{r}}=\mathrm{b}-{\alpha}_{\mathrm{n}}+{\beta}_{\mathrm{n}},\kern0.75em \mathrm{for}\ {\mathrm{A}}_{\mathrm{n}}\in \mathrm{G},\\ {}\kern4.5em -{r}^{\mathrm{i}}\le {\mathrm{C}}^{\mathrm{i}\hbox{'}}{u}^{\mathrm{i}}+w\le {r}^{\mathrm{i}},\kern2.5em \mathrm{i}=1,\dots, \mathrm{k}\\ {}\kern4.5em {\mathrm{c}}^{\mathrm{i}\hbox{'}}{u}^{\mathrm{i}}+b\le {\rho}^{\mathrm{i}}\\ {}\kern4.5em -{s}^{\mathrm{j}}\le {\mathrm{D}}^{\mathrm{j}\hbox{'}}{v}^{\mathrm{j}}-w\le {s}^{\mathrm{j}},\kern2.5em \mathrm{j}=1,\dots, l\\ {}\kern4.5em {\mathrm{d}}^{\mathrm{j}\hbox{'}}{v}^{\mathrm{j}}-b\le {\sigma}^{\mathrm{j}}\\ {}\kern4.5em {\alpha}_1,\dots, {\alpha}_{\mathrm{n}}\ge 0,\kern1.5em {\beta}_1,\dots, {\beta}_{\mathrm{n}}\ge 0,\kern1.5em \left({u}^{\mathrm{i}},{v}^{\mathrm{j}},{r}^{\mathrm{i}},{\rho}^{\mathrm{i}},{s}^{\mathrm{j}},{\sigma}^{\mathrm{j}}\right)\ge 0\end{array}} $$
(2.133)

In this model, all the inequality constraints are derived from the prior knowledge. The last objective C(∑(r i + ρ i) +  ∑ (s j + σ j)) is about the slack error variables added to the original knowledge equality constraints. The last objective attempts to drive the error variables to zero. We want to get the best bounding plane (w, b) by solving this model to separate the two classes.

We notice the fact that if we set the value of parameter C to be zero, this means to take no account of knowledge. Then this model will be equal to the original MCLP model. Theoretically, the larger the value of C, the greater impact on the classification result of the knowledge sets.

Knowledge-incorporated KMCLP Model

If the data set is nonlinear separable, the above model will be inapplicable. We need to figure out how to embed prior knowledge into the KMCLP model, which can solve nonlinear separable problem.

As is shown in the above part, in generating KMCLP model, we suppose:

$$ w=\sum \limits_{i=1}^n{\lambda}_i{y}_i{X}_i $$
(2.134)

If expressed by matrix, the above formulation will be:

$$ w={X}^T Y\lambda $$
(2.135)

where Y is n*n diagonal matrix, the value of each diagonal element depends on the class label of the corresponding sample data, which can be +1 or −1. X is the n*r input matrix with n samples, r attributes. λ is a n-dimensional vector λ = (λ 1, λ 2, …, λ n)T.

$$ Y=\left[\begin{array}{cccc}{y}_1& 0& \dots & 0\\ {}0& {y}_2& \dots & 0\\ {}\vdots & \vdots & \ddots & \vdots \\ {}0& 0& \dots & {y}_n\end{array}\right],\kern3.5em X=\left[\begin{array}{cccc}{x}_{11}& {x}_{12}& \dots & {x}_{1r}\\ {}{x}_{21}& {x}_{22}& \dots & {x}_{2r}\\ {}\vdots & \vdots & \ddots & \vdots \\ {}{x}_{n1}& {x}_{n2}& \dots & {x}_{nr}\end{array}\right] $$

Therefore, w in the original MCLP model is replaced by X T , thus forming the KMCLP model. And in this new model, the value of each λ i is to be worked out by the optimization model.

In order to incorporate prior knowledge into KMCLP model, the inequalities about the knowledge must be transformed to be the form with λ i instead of w. Enlightened by the KMCLP model, we also introduce kernel to the expressions of knowledge. Firstly, the equalities in (2.131) are multiplied by input matrix X [34]. Then replacing w with X T , (2.131) will be:

$$ {\displaystyle \begin{array}{l}{XC}^{iT}{u}^i+{XX}^T Y\lambda =0,\kern1.5em {c}^{iT}{u}^i+b\le 0,\kern1.5em {u}^i\ge 0,\kern1.5em i=1,\dots, k\\ {}{XD}^{jT}{v}^j-{XX}^T Y\lambda =0,\kern1.5em {d}^{jT}{v}^j-b\le 0,\kern1.5em {v}^j\ge 0,\kern1.5em j=1,\dots, l\end{array}} $$
(2.136)

Kernel function is introduced here to replace XC iT and XX T. Also slack errors are added to the expressions, then such kind of constraints are formulated:

$$ {\displaystyle \begin{array}{l}-{r}^{\mathrm{i}}\le K\left(X,{\mathrm{C}}^{\mathrm{i}T}\right){u}^{\mathrm{i}}+K\left(X,{X}^T\right) Y\lambda \le {r}^{\mathrm{i}},\kern2.5em \mathrm{i}=1,\dots, \mathrm{k}\\ {}{\mathrm{c}}^{\mathrm{i}\mathrm{T}}{u}^{\mathrm{i}}+b\le {\rho}^{\mathrm{i}}\\ {}-{s}^{\mathrm{j}}\le K\left(X,{\mathrm{D}}^{\mathrm{j}T}\right){v}^{\mathrm{j}}-K\left(X,{X}^T\right) Y\lambda \le {s}^{\mathrm{j}},\kern2.5em \mathrm{j}=1,\dots, l\\ {}{\mathrm{d}}^{\mathrm{j}T}{v}^{\mathrm{j}}-b\le {\sigma}^{\mathrm{j}}\end{array}} $$
(2.137)

These constraints can be easily embedded to KMCLP model as the constraints acquired from prior knowledge.

Knowledge-incorporated KMCLP classifier:

$$ {\displaystyle \begin{array}{l}\kern4.5em \operatorname{Min}\left({d}_{\alpha}^{+}+{d}_{\alpha}^{-}+{d}_{\beta}^{+}+{d}_{\beta}^{-}\right)\kern0.5em +C\left(\sum \limits_{i=1}^k\left({r}_{\mathrm{i}}+{\rho}^{\mathrm{i}}\right)+\sum \limits_{j=1}^l\left({s}^{\mathrm{j}}+{\sigma}^{\mathrm{j}}\right)\right)\\ {}\mathrm{s}.\mathrm{t}.\kern3em {\lambda}_1{y}_1K\left({X}_1,{X}_1\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,{X}_1\right)=b+{\alpha}_1-{\beta}_1,,\kern1.5em \mathrm{for}\ {X}_1\in \mathrm{B},\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern4.5em {\lambda}_1{y}_1K\left({X}_1,{X}_n\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,{X}_n\right)=b-{\alpha}_n+{\beta}_n,\kern0.75em \mathrm{for}\ {X}_{\mathrm{n}}\in \mathrm{G},\\ {}\kern4.5em {\alpha}^{\ast }+\sum \limits_{i=1}^n{\alpha}_i={d}_{\alpha}^{-}-{d}_{\alpha}^{+},\\ {}\kern4.5em {\beta}^{\ast }-\sum \limits_{i=1}^n{\beta}_i={d}_{\beta}^{-}-{d}_{\beta}^{+},\\ {}\kern4.5em -{r}^{\mathrm{i}}\le K\left(X,{\mathrm{C}}^{\mathrm{i}T}\right){u}^{\mathrm{i}}+K\left(X,{X}^T\right) Y\lambda \le {r}^{\mathrm{i}},\kern2.5em \mathrm{i}=1,\dots, \mathrm{k}\\ {}\kern4.5em {\mathrm{c}}^{\mathrm{i}\mathrm{T}}{u}^{\mathrm{i}}+b\le {\rho}^{\mathrm{i}}\\ {}\kern4.5em -{s}^{\mathrm{j}}\le K\left(X,{\mathrm{D}}^{\mathrm{j}T}\right){v}^{\mathrm{j}}-K\left(X,{X}^T\right) Y\lambda \le {s}^{\mathrm{j}},\kern2.5em \mathrm{j}=1,\dots, l\\ {}\kern4.5em {\mathrm{d}}^{\mathrm{j}T}{v}^{\mathrm{j}}-b\le {\sigma}^{\mathrm{j}}\\ {}\kern4.5em {\alpha}_1,\dots, {\alpha}_{\mathrm{n}}\ge 0,\kern1.5em {\beta}_1,\dots, {\beta}_{\mathrm{n}}\ge 0,\kern1.5em {\lambda}_1,\dots, {\lambda}_{\mathrm{n}}\ge 0,\\ {}\kern4.5em \left({u}^{\mathrm{i}},{v}^{\mathrm{j}},{r}^{\mathrm{i}},{\rho}^{\mathrm{i}},{s}^{\mathrm{j}},{\sigma}^{\mathrm{j}}\right)\ge 0\\ {}\kern4.5em {d}_{\alpha}^{-},{d}_{\alpha}^{+},{d}_{\beta}^{-},{d}_{\beta}^{+}\ge 0\end{array}} $$
(2.138)

In this model, all the inequality constraints are derived from prior knowledge. u i, v i ∈ R p, where p is the number of conditions in one knowledge. For example, in the knowledge if x L ≥ 5 and x T ≥ 4, then x ∈ RECUR, the value of p is 2. r i, ρ i, sj and σj are all real numbers. And the last objective Min ∑ (r i + ρ i) +  ∑ (s j + σ j) is about the slack error variables added to the original knowledge equality constraints. As we talked in last section, the larger the value of C, the greater impact on the classification result of the knowledge sets.

In this model, several parameters need to be set before optimization process. Apart from C we talked about above, the others are parameter of kernel function q (if we choose RBF kernel) and the ideal compromise solution α* and β*. We want to get the best bounding plane (λ, b) by solving this model to separate the two classes. And the discrimination function of the two classes is:

$$ {\displaystyle \begin{array}{l}{\lambda}_1{y}_1K\left({X}_1,z\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,z\right)\le b,\kern1.5em then\kern1em z\in B\\ {}{\lambda}_1{y}_1K\left({X}_1,z\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,z\right)\ge b,\kern1.5em then\kern1em z\in G\end{array}} $$
(2.139)

where z is the input data which is the evaluated target with r attributes. X i represents each training sample. y i is the class label of ith sample.

In the above models, the prior knowledge we deal with is linear. That means the conditions in the above rules can be written into such inequality as Cx ≤ c, where C is a matrix driven from the condition, x represents each individual sample, c is a vector. The set {x| Cx ≤ c} can be viewed as polyhedral convex set, which is a linear geometry in input space. But, if the shape of the region which consists of knowledge is nonlinear, for example, {x| ||x||2 ≤ c}, how to deal with such kind of knowledge?

Suppose the region is nonlinear convex set, we describe the region by g(x) ≤ 0. If the data is in this region, it must belong to class B. Then, such kind of nonlinear knowledge may take the form of:

$$ {\displaystyle \begin{array}{l}g(x)\le 0\kern1em \Rightarrow \kern1em x\in B\kern0.5em \\ {}h(x)\le 0\kern1em \Rightarrow \kern1em x\in G\end{array}} $$
(2.140)

Here g(x): R r → R p (x ∈ Γ) and h(x): R r → R q (x ∈ Δ) are functions defined on a subset Γ and Δ of R r which determine the regions in the input space. All the data satisfied g(x) ≤ 0 must belong to the class B and h(x) ≤ 0 to the class G.

With KMCLP classifier, this knowledge equals to:

$$ {\displaystyle \begin{array}{l}g(x)\le 0\kern1em \Rightarrow \kern1em {\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)\le b,\kern0.5em \left(x\in \Gamma \right)\kern1em \\ {}h(x)\le 0\kern1em \Rightarrow \kern1em {\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)\ge b,\kern0.5em \left(x\in \Delta \right)\end{array}} $$
(2.141)

This implication can be written in the following equivalent logical form [35]:

$$ {\displaystyle \begin{array}{l}g(x)\le 0\kern1em ,\kern0.5em {\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b>0,\mathrm{has}\ \mathrm{no}\ \mathrm{solution}\ x\in \Gamma .\\ {}h(x)\le 0\kern1em ,\kern0.5em {\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b<0,\mathrm{has}\ \mathrm{no}\ \mathrm{solution}\ x\in \Delta .\end{array}} $$
(2.142)

The above expressions hold, then there exist v ∈ R p, r ∈ R q, v,r ≥ 0 such that:

$$ {\displaystyle \begin{array}{l}-{\lambda}_1{y}_1K\left({X}_1,x\right)-\dots -{\lambda}_n{y}_nK\left({X}_n,x\right)+b+{v}^Tg(x)\ge 0,\kern0.5em \left(x\in \Gamma \right)\kern0.5em \\ {}{\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b+{r}^Th(x)\ge 0,\kern0.5em \left(x\in \Delta \right)\end{array}} $$
(2.143)

Add some slack variables on the above two inequalities, then they are converted to:

$$ {\displaystyle \begin{array}{l}-{\lambda}_1{y}_1K\left({X}_1,x\right)-\dots -{\lambda}_n{y}_nK\left({X}_n,x\right)+b+{v}^Tg(x)+s\ge 0,\kern0.5em \left(x\in \Gamma \right)\\ {}{\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b+{r}^Th(x)+t\ge 0,\kern0.5em \left(x\in \Delta \right)\end{array}} $$
(2.144)

The above statement is able to be added to constraints of an optimization problem.

Suppose there are a series of knowledge sets as follows:

$$ \mathrm{If}\ {g}_i(x)\le 0,\mathrm{Then}\ x\in B\kern0.5em \left({g}_i(x):{R}^r\to {R^p}_i\left(x\in {\Gamma}_i\right),i=1,\dots, k\right) $$
$$ \mathrm{If}\ {h}_j(x)\le 0,\mathrm{Then}\ x\in G\kern0.5em \left({h}_j(x):{R}^r\to {R^q}_j\left(x\in {\Delta}_j\right),j=1,\dots, l\right) $$

Based on the above theory in last section, we converted the knowledge to the following constraints:

There exist v i ∈ R p i, i = 1, …, k, r j ∈ R q j, j = 1, …, l, v i,r j ≥ 0 such that:

$$ {\displaystyle \begin{array}{l}-{\lambda}_1{y}_1K\left({X}_1,x\right)-\dots -{\lambda}_n{y}_nK\left({X}_n,x\right)+b+{v_i}^T{g}_i(x)+{s}_i\ge 0,\kern0.5em \left(x\in \Gamma \right)\\ {}{\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b+{r_j}^T{h}_j(x)+{t}_j\ge 0,\kern0.5em \left(x\in \Delta \right)\end{array}} $$
(2.145)

These constraints can be easily imposed to KMCLP model as the constraints acquired from prior knowledge.

Nonlinear knowledge in KMCLP classifier [36]:

$$ {\displaystyle \begin{array}{l}\kern3em \operatorname{Min}\left({d}_{\alpha}^{+}+{d}_{\alpha}^{-}+{d}_{\beta}^{+}+{d}_{\beta}^{-}\right)\kern0.5em +C\left(\sum \limits_{i=1}^k{s}_i+\sum \limits_{j=1}^l{t}^j\right)\\ {}\mathrm{s}.\mathrm{t}.\kern3em {\lambda}_1{y}_1K\left({X}_1,{X}_1\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,{X}_1\right)=b+{\alpha}_1-{\beta}_1,,\kern1.5em \mathrm{for}\ {X}_1\in \mathrm{B},\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern9em .\\ {}\kern3em {\lambda}_1{y}_1K\left({X}_1,{X}_n\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,{X}_n\right)=b-{\alpha}_n+{\beta}_n,\kern0.75em \mathrm{for}\ {X}_{\mathrm{n}}\in \mathrm{G},\\ {}\kern3em {\alpha}^{\ast }+\sum \limits_{i=1}^n{\alpha}_i={d}_{\alpha}^{-}-{d}_{\alpha}^{+},\\ {}\kern3em {\beta}^{\ast }-\sum \limits_{i=1}^n{\beta}_i={d}_{\beta}^{-}-{d}_{\beta}^{+},\\ {}\kern3em -{\lambda}_1{y}_1K\left({X}_1,x\right)-\dots -{\lambda}_n{y}_nK\left({X}_n,x\right)+b+{v_i}^T{g}_i(x)+{s}_i\ge 0,\kern2.5em \mathrm{i}=1,\dots, \mathrm{k}\\ {}\kern3em {s}_i\ge 0,\kern2.5em \mathrm{i}=1,\dots, \mathrm{k}\kern3em \\ {}\kern3em {\lambda}_1{y}_1K\left({X}_1,x\right)+\dots +{\lambda}_n{y}_nK\left({X}_n,x\right)-b+{r_j}^T{h}_j(x)+{t}_j\ge 0,\kern2.5em \mathrm{j}=1,\dots, l\\ {}\kern3em {t}_j\ge 0,\kern2em \mathrm{j}=1,\dots, l\\ {}\kern3em {\alpha}_1,\dots, {\alpha}_{\mathrm{n}}\ge 0,\kern1.5em {\beta}_1,\dots, {\beta}_{\mathrm{n}}\ge 0,\kern1.5em {\lambda}_1,\dots, {\lambda}_{\mathrm{n}}\ge 0,\\ {}\kern3em \left({v}_i,{r}_j\right)\ge 0\\ {}\kern3em {d}_{\alpha}^{-},{d}_{\alpha}^{+},{d}_{\beta}^{-},{d}_{\beta}^{+}\ge 0\end{array}} $$
(2.146)

In this model, all the inequality constraints are derived from the prior knowledge. The last objective \( C\left(\sum \limits_{i=1}^k{s}_i+\sum \limits_{j=1}^l{t}^j\right) \) is about the slack error. Theoretically, the larger the value of C, the greater impact on the classification result of the knowledge sets.

The parameters need to be set before optimization process are C, q (if we choose RBF kernel), α* and β*. The best bounding plane of this model decided by (λ, b) of the two classes is the same with formula (2.139).

2.2 Decision Rule Extraction for Regularized Multiple Criteria Linear Programming Model

In this section, we present a clustering-based rule extraction method to generate decision rules from the black box RCMLP model. Our method can improve the interpretability of the RMCLP model by using explicit and explainable decision rules. To achieve this goal, a clustering algorithm will first be used to generate prototypes (which are the clustering centers) for each group of examples identified by the RMCLP model. Then, hyper cubes (whose edges are parallel to the axes) will be extracted around each prototype. This procedure will be repeated until all the training examples are covered by a hyper cube. Finally, the hyper cubes will be translated to a set of if-then decision rules. Experiments on both synthetic and real-world data sets have demonstrate the effectiveness of our rule extraction method.

For ease of description, we introduce some notations first. Assume a r-dimensional space, the coordinate of the clustering center p is p = (p 1, …, p r), and the classification hyper plane is \( {\sum}_{i=1}^r{a}_i{x}_i=b \) (where x i is the direction of the hyper plane). For each class, we prefer hyper cubes which cover as many examples as possible. Intuitively, if we pick a point u on the classification boundary and then draw cubes based on both clustering center p and u, then the generated hyper cube will cover the largest area with respect to the current prototype p. The distance from p to the hyper plane can be calculated by Eq. (2.147) as follows:

$$ d= Distance\left(f,{p}_i\right)=\frac{\sum_{i=1}^r{p}_i{x}_i-b}{\sqrt{x_i^2}} $$
(2.147)

After computing d, Step 2.3 draws hyper cubes H = DrawHC(d, P i ) by using the prototype point P i as the central point, and each edge has a length of \( \sqrt{2}d \) meanwhile parallel with the axis. By so doing, we can get if-then rules which are easily understood. For example, for a specific example a 1 ∈ G 1, a decision rule can be described in the following form:

$$ {\displaystyle \begin{array}{l} if\kern0.5em \left({l}_1\le {a}_{11}\le {u}_1\right)\ and\ \left({l}_2\le {a}_{12}\le {u}_2\right)\dots \dots and\ \left({l}_r\le {a}_{1r}\le {u}_r\right)\kern0.5em \\ {} then\ {a}_1\ belongs\ to\ class\ 1\end{array}} $$
(2.148)

Figure 2.8 illustrates an example with two dimensions. Examples in G1 (a i ∈ G 1) are covered by hyper cubes with a central point as its clustering center and a vertex on the hyper plane \( {\sum}_{i=1}^r{a}_i{x}_i=b \).

Fig. 2.8
figure 8

An illustration of Algorithm 2.5 which generates hyper cubes from RMCLP models. Based on the RMCLP model’s decision boundary (the red line), Algorithm 2.5 first calculates several clustering centers for each class (e.g., the red circle in Group 1), then it calculates the distance d from the classification boundary to the clustering center (the blue line). After that, it generates a series of hyper cubes. Each hyper cube’s edge is parallel to the axes and the length is \( \sqrt{2}d \). Finally, the hyper cubes can be easily translated into rules that are explainable and understandable

The main computational cost of Algorithm 2.5 is from Steps 2.1~2.3, where a K-Means clustering model and two distance functions are calculated. Assume there are l iterations of K-Means. In each iteration, there are k clusters. Therefore, the total time complexity of K-Means will be O(lknr), where n is the number of training examples, r is the number of dimensions.

On the other hand, calculating distance d for each clustering center by (2.147) will take a linear time complexity, so the computational cost of Step 2.2 will be O(k) for k clustering centers. Finally, the time cost of extracting hyper cubes in Step 2.3 will be O(kr) for k clustering centers in r dimensional space. To sum up, the total computational complexity of Algorithm 2.5 can be denoted by (2.149),

$$ O(lknr)+O(k)+O(kr)=O(lknr) $$
(2.149)

The above analysis indicates that the hyper cube extracting method in Steps 2.2 and 2.3 is dominated by the K-Means clustering model in Step 2.1. It is in linear time complexity with respect to training example size.

Algorithm 2.5 Extract Rules from MCLP Models

Input: The data set A = {a 1, a 2, …, a n}, RMCLP model f

Output: Rule Set {w}

Begin

Step 1. Classify all the examples in A using model f;

Step 2. Define Covered set C = Φ, Uncovered set U = A;

Step 3.While (U is not empty) do

Step 3.1  For each group G i,

       Calculate the clustering center P i = K-means(G i ∩ U);

     End for

Step 3.2  Calculate distances between each P i and boundary d = Distance(f, P i );

Step 3.3  Draw a new hypercube H = DrawHC(d, P i );

Step 3.4  For all the examplesa i ∈ U,

     If a i is covered by H

       U = U\a i, C = C ∪ a i ;

      End If

    End For

  End While

Step 4  Translate each hypercube H into rule;

Step 5  Return the rule set {w}

End

To demonstrate the effectiveness of the proposed rules extraction method, we will test our method on both synthetic and real-world data sets. The whole testing system is implemented in a Java environment by integrating WEKA data mining tools [37]. The clustering method used in our experiments is the simple K-Means package in WEKA.

As shown in Fig. 2.9a, we generate a 2-dimensional 2-class data set containing 60 examples, with 30 examples for each class. In each class, we use 50% of the examples to train a RMCLP model. That is, 30 training examples in total are used to train the RMCLP model. All examples comply with Gaussian distribution x~N(μ, Σ), where μ is mean vector and Σ is covariate matrix. The first group is generated by a mean vector μ 1 = [1,1] with a covariance matrix \( {\Sigma}_1=\left[\begin{array}{l}\ 0.1\kern0.75em 0\kern0.5em \\ {}\kern0.5em 0\kern0.75em 0.1\end{array}\right] \). The second group is generated by a mean vector μ 2 = [2,2] with a covariance matrix Σ2 = Σ1.

Fig. 2.9
figure 9

(a) The synthetic dataset; (b) Experimental results. The straight line is the RMCLP model’s classification boundary, and the squares are hyper cubes generated by using Algorithm 2.5. All the examples are covered by the squares whose edges are parallel to the axes

Here we only discuss the two-group classification problem. It is not difficult to extend to multiple-group classification applications. It is expected to extract knowledge from the RMCLP model in the form of:

$$ \boldsymbol{if}\left(a\le x1\le b,c\le x2\le d\right)\kern0.5em \boldsymbol{then}\kern0.5em Definition\ 1 $$
(2.150)

The result is shown in Fig. 2.9b; we can observe that for the total of 60 examples, three examples in group 1, and one example in group 2 are misclassified by the RMCLP model. That is to say, the accuracy of RMCLP on this synthetic dataset is 56/60 = 92.3%. By using our rule extraction algorithm, we can generate nine squares, four squares for group 1, and five squares for group 2. All the squares can be translated to explainable rules in the form of (6) as follows:

  • K1: if 0.6 ≤ x 1 ≤ 0.8 and 2 ≤ x 2 ≤ 2.8, then x ∈ G 1;

  • K2: if 1.1 ≤ x 1 ≤ 1.3 and 1.8 ≤ x 2 ≤ 2.1, then x ∈ G 1;

  • K3: if 0.4 ≤ x 1 ≤ 1.5 and −1 ≤ x 2 ≤ 1.6, then x ∈ G 1;

  • K4: if 0.9 ≤ x 1 ≤ 2.2 and −0.8 ≤ x 2 ≤ 0, then x ∈ G 1;

  • K5: if 1.2 ≤ x 1 ≤ 1.6 and 2.2 ≤ x 2 ≤ 3.2, then x ∈ G 2;

  • K6: if 1.4 ≤ x 1 ≤ 1.6 and 1.8 ≤ x 2 ≤ 2.0, then x ∈ G 2;

  • K7: if 1.7 ≤ x 1 ≤ 2.8 and 1.0 ≤ x 2 ≤ 4.0, then x ∈ G 2;

  • K8: if 1.9 ≤ x 1 ≤ 2.0 and 0.7 ≤ x 2 ≤ 0.8, then x ∈ G 2;

  • K9: if 2.1 ≤ x 1 ≤ 2.4 and 0.1 ≤ x 2 ≤ 0.5, then x ∈ G 2;

where k i (i = 1, …, 9) denotes the i th rule. From the results on this synthetic data set, we can observe that by using the proposed rule extraction method, we can not only obtain prediction results from RMCLP, but also comprehensible rule.

As one of the basic services offered by the Internet, E-Mail usage is becoming increasingly widely adopted. Along with constant global network expansion and network technology improvement, people’s expectations of an E-Mail service are increasingly demanding. E-Mail is no longer merely a communication tool for people to share their ideas and information; its wide acceptance and technological advancement has given it the characteristics of a business service [38], and it is being commercialized as a technological product.

At the same time, many business and specialized personal users of E-Mail want an E-Mail account that is safe, reliable, and equipped with a first-class customer support service. Therefore, many websites have developed their own user-pays E-mail service to satisfy this market demand. According to statistics, the Chinese network has advanced so much in the past few years that, by 2005, the total market size of Chinese VIP E-mail services reached 6.4 hundred million RMB. This enormous market demand and market prospect also means increasing competition between the suppliers. How to analyze the pattern of lost customer accounts and decrease the customer loss rate have become a focal point of competition in today’s market [39, 40].

Our partner company’s VIP E-Mail data are mainly stored in two kinds of repository systems; one is customer databases, the other is log files. They are mainly composed of automated machine recorded customer activity journals and large amount of manually recorded tables; these data are distributed among servers located in different departments of our partnering companies, coving more than 30 kinds of transaction data charts and journal documents, with over 600 attributes.

If we were to directly analysis these data, it would lead to a “course of dimensionality”, that is to say, a drastic rise in computational complexity and classification error with data of large dimensions. Hence, the dimensionality of the feature space must be reduced before classification is undertaken. According to the accumulated experience functions, we eventually selected 230 attributes from the original 600 attributes.

Figure 2.10 displays the procedure of feature selection of the VIP E-Mail dataset. We selected a part of the data charts and journal documents from the VIP E-Mail System. The left upper part of Fig. 2.10 displays the three logging journal documents and two email transaction journal documents; when the user logs into the pop3 server, the machine will record the user’s login into the log file pop3login; similarly when the user logs into the smtp server, the machine will record this into the log file smtplogin; when the user logs into the E-Mail system through http protocol, the machine will record it into the log file weblogin; when the user successfully sends an E-Mail by smtp protocol, the system will record it into the log file smtprcptlog; when receiving a letter, it will be recorded into the log file mx_rcptlog.

Fig. 2.10
figure 10

The roadmap of the VIP Email Dataset

We extracted 37 attributes from these five log files, that is, 184 attributes in total, to describe user logins and transactions. From the databases, shown in the left lower section of Fig. 2.8, we extracted six features about “customer complaint about the VIP E-Mail Service”, 24 features about “customer payment” and 16 features about “customer’s personal information” (for example, age, gender, occupation, income etc.) to form the operational table. Thus, 185 features from log files and 65 features from databases eventually formed the Large Table, and the 230 attributes depicted the features of the customers. The accumulated experience functions used in the feature selection are confidential, and further discussion of them exceeds the range of this section.

Considering the integrality of the customer records, we eventually extracted two groups from a huge number of data: the current and the lost. Ten thousand nine hundred and ninety-six customers, 5498 for each class, were chosen from the dataset. Combining the 10,996 SSN with the 230 features, we eventually acquired the Large Table with 5498 current records and 5498 lost records, which became the dataset for data mining.

Table 2.6 lists the ten-folder cross validation results of the RMCLP model’s performance on the VIP Email Dataset. The columns “LOST” and “CURRENT” refer to the number of records that were correctly classified as “lost” and “current” respectively. The column “Accuracy” was calculated using correctly classified records divided by the total records in that class. From Table 2.6, we can observe that the average prediction accuracy of the RMLCP on this data set is 80.67% on the first class and 87.15% on the second class. That is, on the whole 10,996 test examples, the average accuracy of RMCLP is 82.91%.

Table 2.6 Ten Folder Cross Validation on VIP Email Dataset

As discussed above, a decision tree is widely used to extract rules from training examples. In the following experiments, we will compare our method with a decision tree (which is implemented by the WEKA J48 package).

Table 2.7 shows the comparison results between our method and the decision tree. By using our rule extraction method, we obtain more than 20 hyper cubes. Due to space limitation, we only list the two most representative rules (i.e., Rule 1 for class “LOST” and Rule 6 for class “CURRENT”) in the left side of Table 2.7. Then we find the corresponding rules from the decision tree (i.e., Rule 1′ for class “LOST” and Rule 6′ for class “CURRENT”), and list them in the right side of Table 2.7.

Table 2.7 Comparisons between RMCLP’s Rule and Decision Tree’s Rule

From these results, we can observe that our rule extraction method acquires much more accurate rules than the decision tree method. For example, when comparing Rule 1 with Rule 1′, we can safely say that Rule 1 is supported by 81.6% examples in the “LOST” class; by contrast, rules from decision tree only get 74.6% supportive examples. Similarly, when comparing Rule 6 with Rule 6′, our method also achieves better support than the decision tree.

At the bottom of Table 2.7, we list the average accuracy of the two methods. It is obvious that the average accuracy of rules extracted from RMCLP is 80.90%. This is better than the decision tree’s accuracy of 74.25%. Moreover, compared to the RMCLP’s performance in Table 2.6 (which equals 82.91%), we can say that the average accuracy of the extracted rules (i.e., 80.90%) suffers only a little loss in performance. Therefore, our rule extraction method from the RMCLP model can effectively extract comprehensible rules from the RMCLP model.

3 Multiple-Criteria Decision Making Based Data Analysis

3.1 A Multicriteria Decision Making Approach for Estimating the Number of Clusters

Estimating the number of clusters for a given data set is closely related to the validity measures and the data set structures. Many validity measures have been proposed and can be classified into three categories: external, internal, and relative [41]. External measures use predefined class labels to examine the clustering results. Because external validation uses the true class labels in the comparison, it is an objective indicator of the true error rate of a clustering algorithm. Internal measures evaluate clustering algorithms by measuring intra- and inter-cluster similarity. An algorithm is regarded as good if the resulting clusters have high intra-class similarities and low inter-class similarities. Relative measures try to find the best clustering structure generated by a clustering algorithm using different parameter values. Extensive reviews of cluster validation techniques can be found in [41] and [42, 43].

Although external measures perform well in predicting the clustering error in previous studies, they require a priori structure of a data set and can only be applied to data sets with class labels. Since this study concentrates on data sets without class labels, it utilizes relative validity measures. The proposed approach can be applied to a wide variety of clustering algorithms. For simplicity, this study chooses the well-known k-means clustering algorithm. Figure 2.11 describes the MCDM-based approach for determining the number of clusters in a data set. For a given data set, different numbers of clusters are considered as alternatives and the performances of k-means clustering algorithm on the relative measures with different numbers of clusters represent criteria by MCDM methods. The output is a ranking of numbers of clusters, which evaluates the appropriateness of different numbers of clusters for a given data set based on their overall performances for multiple criteria (i.e., selected relative measures).

Fig. 2.11
figure 11

A MCDM-based approach for determining the number of clusters in a dataset

3.1.1 MCDM Methods

This study chooses three MCDM methods for estimating the number of clusters for a data set. This section introduces the selected MCDM methods (i.e., WSM, PROMETHEE, and TOPSIS) and explains how they are used to estimate the optimal number of clusters for a given data set.

3.1.1.1 MCDM Method 1: Weighted Sum Method (WSM)

The weighted sum method (WSM) was introduced by Zadeh [44]. It is the most straightforward and widely-used MCDM method for evaluating alternatives. When an MCDM problem involves both benefit and cost criteria, two approaches can be used to deal with conflicting criteria. One is the benefit to cost ratio and the other is the benefit minus cost [45]. For the estimation of optimal number of clusters for a data set, the relative indices Dunn, silhouette, and PBM are benefit criteria and have to be maximized, while Hubert, normalized Hubert, Davies-Bouldin index, SD, S_Dbw, CS, and C-index are cost criteria and have to be minimized. This study chooses the benefit minus cost approach and applies the following formulations to rank different numbers of clusters.

Suppose there are m alternatives, k benefit criteria, and n cost criteria. The total benefit of alternative \( {A}_i^{benefit} \) is defined as follows:

$$ {A}_i^{benefit}=\sum \limits_{j=1}^k{w}_j{a}_{ij}, for\kern0.5em i=1,2,3,\dots, m $$

where a ij represents the performance measure of the jth criterion for alternative Ai. Similarly, the total cost of alternative \( {A}_i^{\cos t} \) is defined as follows:

$$ {A}_i^{\cos t}=\sum \limits_{j=1}^n{w}_j{a}_{ij}, for\kern0.5em i=1,2,3,\dots, m $$

\( where\sum \limits_{j=1}^k{w}_j+\sum \limits_{j=1}^n{w}_j=1;0<{w}_j\le 1 \). Then the importance of alternative \( {A}_i^{WSM- score} \) is defined as follows:

$$ {A}_i^{WSM- score}={A}_i^{benefit}-{A}_i^{\cos t}, for\kern0.5em i=1,2,3,\dots, m $$

The best alternative is the one has the largest WSM score [45].

3.1.1.2 MCDM Method 2: Preference Ranking Organization Method for Enrichment of Evaluations (PROMETHEE)

Brans proposed the PROMETHEE I and PROMETHEE II, which use pairwise comparisons and outranking relationships to choose the best alternative [46]. The final selection is based on the positive and negative preference flows of each alternative. The positive preference flow indicates how an alternative is outranking all the other alternatives and the negative preference flow indicates how an alternative is outranked by all the other alternatives [47]. While PROMETHEE I obtains partial ranking because it does not compare conflicting actions [48], PROMETHEE II ranks alternatives according to the net flow which equals to the balance of the positive and the negative preference flows. An alternative with a higher net flow is better [47]. Since the goal of this study is to provide a complete ranking of different numbers of clusters, PROMETHEE II is utilized. The following procedure presented by Brans and Mareschal [47] is used in the experimental study:

  • Step 1. Define aggregated preference indices.

    Let a,b ∈A, and let

    $$ \left\{\begin{array}{c}\pi \left(a,b\right)=\sum \limits_{j=1}^k{p}_j\left(a,b\right){w}_j,\\ {}\pi \left(a,b\right)=\sum \limits_{j=1}^k{p}_j\left(b,a\right){w}_j.\end{array}\right. $$

    where A is a finite set of possible alternatives {a1, a2,…, an}, k represents the number of evaluation criteria, and w j is the weight of each criterion. For estimating the number of clusters for a given data set, the alternatives are different numbers of clusters and the criteria are relative indices. Arbitrary numbers for the weights can be assigned by decision-makers. The weights are then normalized to ensure that \( {\sum}_{j=1}^k{w}_j=1 \). π(a, b) indicates how a is preferred to b over all the criteria and π(b, a) indicates how b is preferred to a over all the criteria. P j(a, b) and P j(b, a) are the preference functions for alternatives a and b. The relative indices Dunn, silhouette, and PBM have to be maximized, and Hubert, normalized Hubert, DB, SD, S_Dbw, CS, and C-index have to be minimized.

  • Step 2. Calculate π(a, b) and π(b, a) for each pair of alternatives of A. There are six types of preference functions and the decision-maker needs to choose one type of the preference functions for each criterion and the values of the corresponding parameters [49]. The usual preference function, which requires no input parameter, is used for all criteria in the experiment.

  • Step 3. Define the positive and the negative outranking flow as follows:

    The positive outranking flow:

  • $$ {\phi}^{+}(a)=\frac{1}{n-1}\sum \limits_{x\in A}\pi \left(a,x\right), $$

    The negative outranking flow:

    $$ {\phi}^{-}(a)=\frac{1}{n-1}\sum \limits_{x\in A}\pi \left(x,a\right), $$
  • Step 4. Compute the net outranking flow for each alternative as follows:

    $$ \phi (a)={\phi}^{+}(a)-{\phi}^{-}(a). $$

    When ϕ(a) > 0, a is more outranking all the alternatives on all the evaluation criteria. When ϕ(a)<0, a is more outranked.

3.1.1.3 MCDM Method 3: Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)

The Technique for order preference by similarity to ideal solution (TOPSIS) method was proposed by Hwang and Yoon [50] to rank alternatives over multiple criteria. It finds the best alternatives by minimizing the distance to the ideal solution and maximizing the distance to the nadir or negative-ideal solution [37]. This section uses the following TOPSIS procedure, which was adopted from [51] and [37], in the empirical study:

  • Step 1. Calculate the normalized decision matrix. The normalized value r ij is calculated as

    $$ {r}_{ij}={x}_{ij}/\sqrt{\sum \limits_{i=1}^J{x}_{ij}^2},j=1,\dots, J;i=1,..,n. $$
  • Step 2. Develop a set of weights wi for each criterion and calculate the weighted normalized decision matrix. The weighted normalized value vij is calculated as:

    $$ {v}_{ij}={w}_i{r}_{ij},j=1,..,J;i=1,..,n. $$

    Weight of the ith criterion, and \( {\sum}_{i=1}^n{w}_i=1 \).

  • Step 3. Find the ideal alternative solution S+, which is calculated as:

    $$ {S}^{+}=\left\{{v}_1^{+},\dots, {v}_n^{+}\right\}=\left\{\left(\underset{j}{\max }{v}_{ij}|i\in {I}^{\hbox{'}}\right),\Big(\underset{j}{\min }{v}_{ij}|i\in {I}^{\hbox{'}\hbox{'}}\Big)\right\} $$

    where I′ is associated with benefit criteria and I″ is associated with cost criteria. In this study, benefit and cost criteria of TOPSIS are defined the same as the benefit and cost criteria in WSM.

  • Step 4. Find the negative-ideal alternative solution S2, which is calculated as:

    $$ {S}^{-}=\left\{{v}_1^{-},\dots, {v}_n^{-}\right\}=\left\{\left(\underset{j}{\min }{v}_{ij}|i\in {I}^{\hbox{'}}\right),\Big(\underset{j}{\max }{v}_{ij}|i\in {I}^{\hbox{'}\hbox{'}}\Big)\right\} $$
  • Step 5. Calculate the separation measures, using the n-dimensional Euclidean distance. The separation of each alternative from the ideal solution is calculated as:

    $$ {D}_j^{+}=\sqrt{\sum \limits_{i=1}^n{\left({v}_{ij}-{v}_i^{+}\right)}^2},j=1,\dots, J. $$

    The separation of each alternative from the negative-ideal solution is calculated as:

    $$ {D}_j^{-}=\sqrt{\sum \limits_{i=1}^n{\left({v}_{ij}-{v}_i^{-}\right)}^2},j=1,\dots, J. $$
  • Step 6. Calculate a ratio \( {R}_j^{+} \) that measures the relative closeness to the ideal solution and is calculated as:

    $$ {R}_j^{+}={D}_j^{-}/\left({D}_j^{+}+{D}_j^{-}\right),j=1,\dots, J. $$
  • Step 7. Rank alternatives by maximizing the ratio \( {R}_j^{+} \).

3.1.2 Clustering Algorithm

The k-means algorithm, the most well-known partitioning method, is an iterative distance-based technique [32]. The input parameter k predefines the number of clusters. First, k objects are randomly chosen to be the centers of these clusters. All objects are then partitioned into k clusters based on the minimum squared-error criterion, which measures the distance between an object and the cluster center. The new mean of each cluster is calculated and the whole process iterates until the cluster centers remain the same [11, 52]. Let X = {x i} be the n objects to be clustered, C = {C 1, C 2, …, C k} is the set of clusters. Let mi be the mean of cluster Ci. The squared-error between μ i and the objects in cluster Ci is defined as.

$$ WCSS\left({C}_i\right)=\sum \limits_{x_j\in {C}_i}{\left\Vert {x}_j-{\mu}_i\right\Vert}^2 $$

Then the aim of k-means algorithm is to minimize the sum of the squared error over all k clusters, that is

$$ \min \Big( WCSS(C)=\arg \underset{C}{\min}\sum \limits_{x_j\in {C}_i}{\left\Vert {x}_j-{\mu}_i\right\Vert}^2 $$

where WCSS denotes the sum of the squared error in the inner-cluster.

Two critical steps of k-means algorithm have impact on the sum of squared error. First, generate a new partition by assigning each observed point to its closest cluster center, the formula is as follows:

$$ {C_i}^{(t)}=\left\{{x}_j:\left\Vert {x}_j-{m_i}^{(t)}\right\Vert \le \left\Vert {x}_j-{m_{i\ast}}^{(t)}\right\Vert foralli\ast =1,..,k\right\} $$

where m i (t) denotes the mean of the i th cluster in t th times clustering, while C i (t) represents all sets contained in the i th cluster in t th times clustering. Second, compute new cluster mean centers using the following formula.

$$ {m_i}^{\left(t+1\right)}=\frac{1}{\mid {C_i}^{\left(t+1\right)}\mid}\sum \limits_{x_j\in {C}_i^{(t)}}{x}_j $$

where m i (t + 1) denotes the mean of the i th cluster in (t + 1)th times clustering while C i (t + 1) represents all sets contained in the i th cluster in (t + 1)th times clustering. The algorithm is implemented using WEKA (Waikato Environment for Knowledge Analysis), a free machine learning software [53].

3.1.3 Clustering Validity Measures

Ten relative measures are selected for the experiment, namely, the Hubert Γ statistic, the normalized Hubert Γ, the Dunn’s index, the Davies-Bouldin index, the CS measure, the SD index, the S_Dbw index, the silhouette index, PBM, and the C-index. Relative measures can also be used to identify the optimal number of clusters in a data set and some of them, such as the C-index and silhouette, have exhibited good performance in previous studies. The following paragraphs define these relative measures.

  • Hubert Γ statistic [ 54 ]:

    $$ \Gamma =\left(1/M\right)\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^nP\left(i,j\right)\cdot Q\left(i,j\right) $$

    where n is the number of objects in a data set, M = n(n − 1)/2, P is the proximity matrix of the data set, and Q is an n*n matrix whose (i, j) element is equal to the distance between the representative points (vci,vcj) of the clusters where the objects xi and xj belong [42]. C indicates the agreement between P and Q.

  • Normalized Hubert Γ:

    $$ \hat{\Gamma}=\frac{\left[\left(1/M\right)\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^n\left(P\right(i,j\left)-{\mu}_P\right)\cdot \left(Q\right(i,j\left)-{\mu}_Q\right)\right]}{\sigma_P{\sigma}_Q} $$

    where μ P, μ Q, σ p, and σ Q represent the respective means and variances of P and Q matrices [43].

    Dunn’s index [55] evaluates the quality of clusters by measuring inter cluster distance and intra cluster diameter.

    $$ D=\underset{i=1,\dots, K}{\min}\left\{\underset{j=i+1,\dots, K}{\min}\left[\frac{d\left({C}_i,{C}_j\right)}{\underset{l=1,\dots, K}{\max } diam\left({C}_l\right)}\right]\right\} $$

    where K is the number of clusters, Ci is the i th cluster, d(Ci,Cj) is the distance between cluster Ci and Cj, and diam(C l) is the diameter of the lth cluster. Larger values of D suggest good clusters, and a D larger than 1 indicates compact separated clusters.

  • Davies-Bouldin index is defined as [ 56 ]:

    $$ {DB}_K=\frac{1}{K}\sum \limits_{i=1}^K{R}_i,{R}_i\underset{i=1,\dots, K,i\ne j}{\max }{R}_{ij},{R}_{ij}=\frac{s_i+{s}_j}{d_{ij}},i=1,\dots, K $$

    where K is the number of clusters, si and sj represent the respective dispersion of clusters i and j, dij measures the dissimilarity between two clusters, and Rij measures the similarity between two clusters [42, 43]. It is the average similarity between each cluster and its most similar one.

  • The CS measure is proposed to evaluate clusters with different densities and/or sizes [57]. It is computed as:

    $$ CS=\frac{\sum \limits_{i=1}^K\left\{\frac{1}{N_i}\sum \limits_{x_j\in {C}_i}\underset{x_k\in {C}_i}{\max}\left\{d\left({x}_j,{x}_k\right)\right\}\right\}}{\sum \limits_{i=1}^K\left\{\underset{j\in \left\{1,2,\dots, K\right\},j\ne i}{\min}\left\{d\left({v}_i,{v}_j\right)\right\}\right\}},{v}_i=\frac{1}{N_i}\sum \limits_{x_j\in {C}_i}{x}_j $$

    where Ni is the number of objects in cluster i and d is a distance function. The smallest CS measure indicates a valid optimal clustering.

  • SD index combines the measurements of average scattering for clusters and total separation between clusters [ 42 ]:

    $$ SD(K)= Dis\left({c}_{\mathrm{max}}\right)\times Scat(K)+ Dis(K) $$

    where cmax is the maximum number of input clusters,

    $$ Scat(K)=\frac{1}{K}\sum \limits_{i=1}^K\left\Vert \sigma \left({v}_i\right)\right\Vert /\left\Vert \sigma (X)\right\Vert \kern0.5em \mathrm{and} $$
    $$ Dis(K)=\frac{D_{\mathrm{max}}}{D_{\mathrm{min}}}\sum \limits_{k=1}^K{\left(\sum \limits_{z=1}^K\left\Vert {v}_k-{v}_z\right\Vert \right)}^{-1} $$

    Dmax is the maximum distance between cluster centers and the Dmin is the minimum distance between cluster centers.

    S_Dbw index is similar to SD index and is defined as [42]:

    $$ {\displaystyle \begin{array}{l}S\_ Dbw(K)= Scat(K)+ Dens\_ bw(K),\\ {} Dens\_ bw(K)=\frac{1}{K\cdot \left(K-1\right)}\sum \limits_{i=1}^K\left(\sum \limits_{\begin{array}{l}j=1\\ {}j\ne i\end{array}}^K\frac{density\left({u}_{ij}\right)}{\max \left\{ density\left({v}_i\right), density\left({v}_j\right)\right\}}\right),\\ {} density(u)=\sum \limits_{l=1}^{N_{ij}}f\left({x}_l,u\right)\end{array}} $$

    where Nij is the number of objects that belong to the cluster Ci and Cj, and function f(x,u) is defined as:

    $$ f\left(x,u\right)=\left\{\begin{array}{c}0, if\kern0.5em d\left(x,u\right)> stdev\\ {}1,\operatorname{} otherwise\end{array}\right., stdev=\frac{1}{K}\sqrt{\sum \limits_{i=1}^K\left\Vert \sigma \left({v}_i\right)\right\Vert } $$

    Silhouette is an internal graphic display for clustering methods evaluation. It represents each cluster by a silhouette, which shows how well objects lie within their clusters. It is defined as [58]:

    $$ s(i)=\frac{b(i)-a(i)}{\max \left\{a(i),b(i)\right\}} $$

    where i represents any object in the data set, a(i) is the average dissimilarity of i to all other objects in the same cluster A, and b(i) is the average dissimilarity of i to all objects in the neighboring cluster B, which is defined as the cluster that has the smallest average dissimilarity of i to all objects in it. Note that A ≠B and the dissimilarity is computed using distance measures. Since a(i) measures how dissimilar i is to its own cluster and b(i) measures how dissimilar i is to its neighboring cluster, an s(i) close to one indicates a good clustering method. The average s(i) of the whole data set measures the quality of clusters.

  • PBM is developed by [ 40 ] and it is based on the intra-cluster and inter-cluster distances:

    $$ {\displaystyle \begin{array}{l} PBM={\left(\frac{1}{K}\frac{E_1}{E_K}{D}_K\right)}^2\\ {}\mathrm{where}\kern0.5em {E}_1={\sum}_{i=1}^N\left\Vert {x}_i-\overline{x}\right\Vert, {E}_K={\sum}_{l=1}^N{\sum}_{x_i\in {C}_l}\left\Vert {x}_i-{\overline{x}}_l\right\Vert, \\ {}{D}_K=\underset{l,m=1,\dots, K}{\max}\left\Vert {\overline{x}}_l-{\overline{x}}_m\right\Vert \end{array}} $$

    The C-index [59] is based on intra-cluster distances and their maximum and minimum possible values [60]:

    $$ CI=\frac{\theta -\min \theta }{\max \theta -\min \theta },\theta =\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^n{q}_{i,j}\left\Vert {x}_i-{x}_j\right\Vert $$

3.2 Parallel Regularized Multiple Criteria Linear Programming Classification Algorithm

In this section, the focus is on the RMCLP, and the designed and proposed Parallel version of RMCLP algorithm (PRMCLP). In order to overcome the compute and storage requirements that increase rapidly with the number of training sample, the second strategy is adopted, inspire by some findings in [61].

Let us give a brief introduction of MCLP as follows. For classification of the training data:

$$ T=\left\{\left({x}_1,{y}_1\right),\dots, \left({x}_l,{y}_l\right)\right\}\in {\left({\Re}^n\times y\right)}^l $$
(2.151)

where x i ∈  n, y i ∈ y = {1, ‐1}, i = 1, …, l, data separation can be achieved by two opposite objectives. The first objective separates the observations by minimizing the sum of the deviations (MSD) among the observations. The second maximizes the minimum distances (MMD) of observations from the critical value [62]. The overlapping of data ξ (1) should be minimized while the distance ξ (2) has to be maximized. However, it is difficult for traditional linear programming to optimize MMD and MSD simultaneously. According to the concept of Pareto optimality, we can seek the best trade-off between the two measurements [10, 63]. So MCLP model can be described as follows:

$$ \min {e}^{\mathrm{T}}{\xi}^{(1)}\&\max {e}^{\mathrm{T}}{\xi}^{(2)} \vspace*{-12pt}$$
(2.152)
$$ s.t.\left(w\cdot {x}_i\right)+\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=1\right\}, \vspace*{-12pt}$$
(2.153)
$$ \left(w\cdot {x}_i\right)-\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=-1\right\}, \vspace*{-12pt}$$
(2.154)
$$ {\xi}^{(1)},{\xi}^{(2)}\ge 0 $$
(2.155)

where e ∈ Rl be vector whose all elements are 1, w and b are unrestricted, \( {\xi}_i^{(1)} \) is the overlapping and \( {\xi}_i^{(2)} \) the distance from the training sample x i to the discriminator (w ⋅ x i) = b (classification separating hyperplane). By introducing penalty parameter C,D > 0, MCLP has the following version

$$ \underset{{\xi_i}^{(1)},{\xi_i}^{(2)}}{\min }{Ce}^{\mathrm{T}}{\xi}^{(1)}-{De}^{\mathrm{T}}{\xi}^{(2)}, \vspace*{-12pt}$$
(2.156)
$$ s.t.\left(w\cdot {x}_i\right)+\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=1\right\}, \vspace*{-12pt}$$
(2.157)
$$ \left(w\cdot {x}_i\right)-\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=-1\right\}, \vspace*{-12pt}$$
(2.158)
$$ {\xi}^{(1)},{\xi}^{(2)}\ge 0 $$
(2.159)

A lot of empirical studies have shown that MCLP is a powerful tool for classification. However, we cannot ensure that this model always has a solution under different kinds of training samples. To ensure the existence of solution, recently, Shi et al. proposed a RMCLP model by adding two regularized items \( \frac{1}{2}{\omega}^{\mathrm{T}} H\omega \) and \( \frac{1}{2}{\xi}^{(1)\mathrm{T}}Q{\xi}^{(1)} \) in MCLP as follows (more theoretical explanation of this model can be found in [63]):

$$ \underset{z}{\min}\frac{1}{2}{w}^{\mathrm{T}} Hw+\frac{1}{2}{\xi}^{(1)\mathrm{T}}Q{\xi}^{(1)}+\frac{1}{2}{b}^2+{Ce}^{\mathrm{T}}{\xi}^{(1)}-{De}^{\mathrm{T}}{\xi}^{(2)}, \vspace*{-12pt}$$
(2.160)
$$ s.t.\left(w\cdot {x}_i\right)+\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=1\right\}, \vspace*{-12pt}$$
(2.161)
$$ \left(w\cdot {x}_i\right)-\left({\xi_i}^{(1)}-{\xi_i}^{(2)}\right)=b, for\left\{i|{y}_i=-1\right\}, \vspace*{-12pt}$$
(2.162)
$$ {\xi}^{(1)},{\xi}^{(2)}\ge 0 $$
(2.163)

where z = (w Τ, ξ (1)T, ξ (2)T, b)Τ ∈ R n + l + l + 1, H ∈ R n × n is symmetric positive definite matrices. Obviously, the regularized MCLP is a convex quadratic programming. According to the dual theorem, (2.160)–(2.163) can be formulated as:

$$ \underset{\alpha, {\xi}^{(1)}}{\min}\frac{1}{2}{\alpha}^{\mathrm{T}}\left(K\left(A,{A}^{\mathrm{T}}\right)+{ee}^{\mathrm{T}}\right)\alpha +\frac{1}{2}{\xi}^{(1)\mathrm{T}}Q{\xi}^{(1)}, \vspace*{-12pt}$$
(2.164)
$$ s.t.-Q{\xi}^{(1)}- Ce\le E\alpha \le - De, \vspace*{-24pt}$$
(2.165)
$$ {\displaystyle \begin{array}{l}\mathrm{where}\ A={\left[{x}_1^{\mathrm{T}},\dots, {x}_l^{\mathrm{T}}\right]}^{\mathrm{T}}\in {R}^{l\times n},E=\mathit{\operatorname{diag}}\left\{{y}_1,\dots, {y}_l\right\}\\ {}\mathrm{and}\ \\ {}K\left(A,{A}^{\mathrm{T}}\right)=\Phi (A)\Phi {(A)}^{\mathrm{T}}={\left(\Phi (A)\cdot \Phi {(A)}^{\mathrm{T}}\right)}_{l\times l}\end{array}} $$

and Φ is a mapping from the input space Rn to some Hilbert space H [64].

In order to realize the parallelization of RMCLP, we firstly translate RMCLP into a unconstrained optimization problem. To simplify, (2.164) can be rewritten as

$$ {\displaystyle \begin{array}{l}\underset{\pi }{\min}\frac{1}{2}{\pi}^{\mathrm{T}}\Lambda \pi, \\ {}s.t. G\pi - Ce\le 0,\\ {} H\pi + De\le 0,\end{array}} $$
(2.166)

where π = [α Τ, ξ (1)Τ]Τ, and G = [−Q, −E], H = [E, O], O ∈ R l × l is a null matrix, Λ is written as

$$ \left(\begin{array}{cc}K\left(A,{A}^{\mathrm{T}}\right)+{ee}^{\mathrm{T}}& 0\\ {}0& Q\end{array}\right) $$
(2.167)

Next, we represent the objective (2.164) as the following unconstrained optimization problem

$$ \underset{\pi }{\min }f\left(\pi \right)=\frac{1}{2}{\pi}^{\mathrm{T}}\Lambda \pi +{\lambda}^{\mathrm{T}}\max {\left\{ G\pi - Ce,0\right\}}^2+\mu \max {\left\{ H\pi + De,0\right\}}^2 $$
(2.168)

where C,D ∈R are the artificial parameters, and λ = {λ 1, …, λ l}, μ = {μ 1, …, μ l}.

Define d is the search direction of the optimization problem (2.168), here, we choose the negative gradient direction as the feasible direction:

$$ d=-\nabla f\left(\pi \right)/\left\Vert \nabla f\left(\pi \right)\right\Vert $$
(2.169)

where

$$ { {\begin{array}{ll}\nabla f\left(\pi \right)=\Lambda \pi +2{\lambda}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\max \left\{ G\pi - Ce,0\right\}\right)+2{\mu}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({H}^{\mathrm{T}}\max \left\{ H\pi + De,0\right\}\right)\end{array}}} $$
(2.170)

Now, we use PVD idea to split our model [61]. Suppose we can use p processors, the variable of the unconstrained optimization problem (2.168) can be divided into p chunks: {1, …, p}, where the dimension of the i th chunk is mi

$$ \pi =\left\{{\pi}_1,\dots, {\pi}_m\right\},{\pi}_i\in {R}^{m_i},i=1,\dots, p,\sum \limits_{i=1}^p{m}_i=2l $$
(2.171)

In the next step, we allocate the p-th variable to p-th processor, and decompose the problem (2.168) into the subproblem with mi dimensions. Each processor solves one corresponding subproblem, which update other variables on the basis of some rules except for computing the mi variables itself. After each processor finishes updating, a quick synchronous step is performed: searching the results obtained by each processor and computing the current solution. Repeating then this, our algorithm can be described as

Theorem 2.4

The sequence generated by {π k} of Algorithm 2.4 either terminates at a stationary point {π k}, or is an infinite sequence, whose accumulation point is stationary and \( \underset{k\to \infty }{\lim}\nabla f\left({\pi}^k\right)=0 \) .

Proof

$$ {\displaystyle \begin{array}{l} For\operatorname{}\forall \pi, {\pi}^{\hbox{'}}\in {R}^{2l},\mathrm{we}\ \mathrm{have}\\ {}\nabla f\left(\pi \right)=\Lambda \pi +2{\lambda}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\max \left\{ G\pi - Ce,0\right\}\right)+2{\mu}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({H}^{\mathrm{T}}\max \left\{ H\pi + De,0\right\}\right)\\ {} So\end{array}} \vspace*{-15pt}$$
(2.172)
$$ {\displaystyle \begin{array}{l}\left\Vert \nabla f\left(\pi \right)-\nabla f\left({\pi}^{\hbox{'}}\right)\right\Vert =\Big\Vert \Lambda \left(\pi -{\pi}^{\hbox{'}}\right)+2{\lambda}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\\ {}+2{\mu}^{\mathrm{T}}\mathit{\operatorname{diag}}\left({H}^{\mathrm{T}}\max \left\{ H\pi + De,0\right\}-\max \left\{ H\pi + De,0\right\}\right)\left)\right\Vert \\ {}\le \left\Vert \Lambda \right\Vert\ \left\Vert \pi -{\pi}^{\hbox{'}}\right\Vert +2\left\Vert {\lambda}^{\mathrm{T}}\right\Vert\ \left\Vert \mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\right\Vert \\ {}+2\left\Vert {\mu}^{\mathrm{T}}\right\Vert\ \left\Vert \mathit{\operatorname{diag}}\left({H}^{\mathrm{T}}\max \left\{ H\pi + De,0\right\}-\max \left\{ H\pi + De,0\right\}\right)\right)\Big\Vert \end{array}} $$
(2.173)
$$ {\displaystyle \begin{array}{l}\mathrm{i}\Big)\kern0.5em \mathrm{For}\ \mathrm{any}\ \mathrm{G}{\pi}_i,\mathrm{G}\pi {\hbox{'}}_i\le Ce, wherei=1,\dots, m,\mathrm{we}\ \mathrm{have}\\ {}\left\Vert \mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\right\Vert {=}0\le \left\Vert {G}^{\mathrm{T}}G\left({\pi}_i{-}{\pi_i}^{\hbox{'}}\right)\right\Vert \end{array}} \vspace*{-12pt}$$
(2.174)
$$ {\displaystyle \begin{array}{l}\mathrm{ii}\Big)\kern0.5em \mathrm{For}\ \mathrm{any}\ \mathrm{G}{\pi}_i,\mathrm{G}\pi {\hbox{'}}_i> Ce, wherei=1,\dots, m,\mathrm{we}\ \mathrm{have}\\ {}\left\Vert \mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\right\Vert =\left\Vert {G}^{\mathrm{T}}G\left({\pi}_i-{\pi_i}^{\hbox{'}}\right)\right\Vert \end{array}} \vspace*{-12pt}$$
(2.175)
$$ {\displaystyle \begin{array}{l}\mathrm{Taken}\ \mathrm{together},\mathrm{we}\ \mathrm{can}\ \mathrm{obtain}\\ {}\left\Vert \mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{{G\pi}^{\hbox{'}}- Ce,0\right\}\right)\right)\right\Vert \\ {}\le \left\Vert {G}^{\mathrm{T}}G\left(\pi -{\pi}^{\hbox{'}}\right)\right\Vert \le \left\Vert {G}^{\mathrm{T}}\right\Vert\ \left\Vert G\right\Vert \left(\pi -{\pi}^{\hbox{'}}\right)\Big\Vert \end{array}} \vspace*{-12pt}$$
(2.176)
$$ {\displaystyle \begin{array}{l}\mathrm{Similarly},\mathrm{we}\ \mathrm{have}\\ {}\left\Vert \mathit{\operatorname{diag}}\left({H}^{\mathrm{T}}\left(\max \left\{ H\pi - De,0\right\}-\max \left\{{H\pi}^{\hbox{'}}- De,0\right\}\right)\right)\right\Vert \\ {}\le \left\Vert {H}^{\mathrm{T}}\left(\pi -{\pi}^{\hbox{'}}\right)\right\Vert \le \left\Vert {H}^{\mathrm{T}}\right\Vert\ \left\Vert H\right\Vert \left(\pi -{\pi}^{\hbox{'}}\right)\Big\Vert \end{array}} \vspace*{-12pt}$$
(2.177)
$$ {\displaystyle \begin{array}{l}\mathrm{As}\ \mathrm{the}\ \mathrm{result},\mathrm{let}\ \left\Vert \Lambda \right\Vert +2\left\Vert \Lambda \right\Vert \left\Vert {G}^{\mathrm{T}}\right\Vert \left\Vert G\right\Vert +\\ {}2\left\Vert \mu \right\Vert \left\Vert {H}^{\mathrm{T}}\right\Vert \left\Vert \mathrm{H}\right\Vert =K,\\ {}\mathrm{we}\ \mathrm{can}\ \mathrm{obtain}\end{array}} \vspace*{-12pt}$$
$$ \left\Vert \nabla f\left(\pi \right)-\nabla f\left({\pi}^{\hbox{'}}\right)\right\Vert \le \left\Vert \pi -{\pi}^{\hbox{'}}\right\Vert $$
(2.178)

According to the Theorem 2.2 in [19], {π k} either terminates at a stationary point \( \left\{{\pi}^{\overline{k}}\right\} \), or is an infinite sequence, whose accumulation point is stationary and \( \underset{k\to \infty }{\lim}\nabla f\left({\pi}^k\right)=0 \).

Theorem 2.5

If A of Algorithm 2.4 is positive definite, then the sequence of iterates {π k } generated by the subproblem of ( 2.168 ) converges linearly to the unique solution \( \overline{\pi} \) , and the rate of convergence is

$$ \left\Vert {\pi}^k-\overline{\pi}\right\Vert \le {\left(\frac{2}{\gamma}\left(f\left({\pi}^k\right)-f\left(\overline{\pi}\right)\right)\right)}^{\frac{1}{2}}{\left(1-\frac{1}{p}{\left(\frac{\gamma }{K}\right)}^2\right)}^{\frac{1}{2}}, $$
(2.179)

where γ,K > 0 are constants.

Proof

For

$$ {\displaystyle \begin{array}{l}\forall \pi, {\pi}^{\hbox{'}}\in {R}^{2l}\\ {}\left(\nabla f\left(\pi \right)-\nabla f\left({\pi}^{\hbox{'}}\right)\right)\left(\pi -{\pi}^{\hbox{'}}\right)={\left(\pi -\pi \hbox{'}\right)}^{\mathrm{T}}\Lambda \left(\pi -{\pi}^{\hbox{'}}\right)+\left(2{\lambda}^{\mathrm{T}}\mathit{\operatorname{diag}}\right({G}^{\mathrm{T}}\Big(\max \left\{ G\pi - Ce,0\right\}\\ {}-\max \left\{ G\pi - Ce,0\right\}\left)\right)+2{\mu}^{\mathrm{T}}\mathit{\operatorname{diag}}\Big({H}^{\mathrm{T}}\max \left\{ H\pi + De,0\right\}\\ {}-\max \left\{ H\pi + De,0\right\}\left)\right)\left)\right)\left(\pi -{\pi}^{\hbox{'}}\right)\\ {}\end{array}} $$
(2.180)

It is known that

$$ {\displaystyle \begin{array}{l}\mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\left(\pi -{\pi}^{\hbox{'}}\right)\ge 0,\\ {}\mathit{\operatorname{diag}}\left({G}^{\mathrm{T}}\left(\max \left\{ G\pi - Ce,0\right\}-\max \left\{ G\pi - Ce,0\right\}\right)\right)\left(\pi -{\pi}^{\hbox{'}}\right)\ge 0\end{array}} $$
(2.181)

Since Λ is a positive definite matrix, we have

$$ {\displaystyle \begin{array}{l}\left(\nabla f\left(\pi \right)-\nabla f\left({\pi}^{\hbox{'}}\right)\right)\left(\pi -{\pi}^{\hbox{'}}\right)\ge {\left(\pi -\pi \hbox{'}\right)}^{\mathrm{T}}\Lambda \left(\pi -{\pi}^{\hbox{'}}\right)\ge \frac{\gamma }{2}{\left\Vert \pi -\pi \hbox{'}\right\Vert}^2,\\ {}\forall \pi \in {R}^{2l}\end{array}} $$
(2.182)

where γ is a constant. As a result, subproblem of (2.168) converges linearly to the unique solution \( \overline{\pi} \), and the rate of convergence is

$$ \left\Vert {\pi}^k-\overline{\pi}\right\Vert \le {\left(\frac{2}{\gamma}\left(f\left({\pi}^k\right)-f\left(\overline{\pi}\right)\right)\right)}^{\frac{1}{2}}{\left(1-\frac{1}{p}{\left(\frac{\gamma }{K}\right)}^2\right)}^{\frac{1}{2}} $$
(2.183)

3.3 An Effective Intrusion Detection Framework Based on Multiple Criteria Linear Programming and Support Vector Machine

The main contributions of this section include the following:

  1. (a)

    Modifications to the chaos particle swarm optimization have been proposed by adopting the time-varying inertia weight factor (TVIW) and time-varying acceleration coefficients (TVAC), namely TVCPSO, to make it faster in searching for the optimum and avoid the search being trapped into local optimum.

  2. (b)

    A weighted objective function that simultaneously takes into account trade-off between the maximizing the detection rate and minimizing the false alarm rate, along with considering the number of features is proposed to eliminate the redundant and irrelevant features, as long as increase the attacks’ detection rate.

  3. (c)

    An extended version of multiple criteria linear programming, namely PMCLP, has been adopted to increase the performance of this classifier in dealing with the unbalance intrusion detection dataset.

  4. (d)

    The proposed TVCPSO has been adopted to provide an effective IDS framework by determining parameters and selecting a subset of features for multiple criteria linear programming and support vector machines.

In the recent years, biology inspired approaches has been used to solve complex problems in a variety of domains such as computer science, medicine, finance and engineering [65]. Swarm intelligence considered as an artificial intelligence techniques which inspired from a flock of birds, a school of fish swims or a colony of ants and their unique capability to solve complex problems [65]. Briefly, swarm intelligence (SI) considered as some methodologies, techniques and algorithms inspired by study of collective behaviors in decentralized systems [66]. Particle swarm optimization is one of these techniques, which introduced by Eberhart and Kennedy in 1995 [67]. Particle swarm optimization is a population based meta-heuristic optimization technique that simulates the social behavior of individuals, namely, particles. This technique, compare with the other algorithms in this group has several advantages such as simple to implement, scalability, robustness, quick in finding approximately optimal solutions and flexibility [39].

In particle swarm optimization, each individual of a population that considered as a representative of the potential solution move through an n-dimensional search space. After the initialization of the population, at each iteration particle seeks the optimal solution by changing its direction which consists of its velocity and position according to two factors, its own best previous experience (pbest) and the best experience of all particles (gbest). Equations (2.184) and (2.185), respectively represents updating the velocity and position of each percale at iteration [t + 1]. At the end of each iteration the performance of all particles will be evaluated by predefined fitness functions.

$$ {{\begin{array}{ll}{v}^{id}\left[t+1\right]=w.{v}^{id}\ \left[t\right]+{c}_1\ {r}_1\left({p}^{id, best}\left[t\right]-{x}^{id}\left[t\right]\right)\\ \quad{}+{c}_2\ {r}_2\left({p}^{gd, best}\left[t\right]-{x}^{id}\left[t\right]\right)\kern1em \mathrm{d}=1,2,\dots, \mathrm{D}\end{array}}} \vspace*{-12pt}$$
(2.184)
$$ {x}^{id}\left[t+1\right]={p}^{id}\left[t\right]+{v}^{id}\left[t+1\right]\kern0.75em \mathrm{d}=1,2,\dots, \mathrm{D} $$
(2.185)

Where, i = 1, 2, …, N, N is the number of swarm population. In D-dimensional search space, x i[t] = {x i1[t], x i2[t], …, x iD[t]} represent the current position of the i th particle at iteration [t]. Likewise, the velocity vector of each particle at iteration [t] represented by v i[t] = {v i1[t], v i2[t], …, v iD[t]}. p i, best[t] = {p i1[t], p i2[t], …, p iD[t]} represent the best position that particle i has obtained until iteration t, and p g, best[t] = {p g1[t], p g2[t], …, p gD[t]} represent the previous best position of whole particle until iteration t.

To control the pressure of local and global search, the concept of an inertia weight w was introduced in the PSO algorithm by [68]. r 1 and r 2 are two D-dimensional vectors with random number between 0 and 1. c 1 and c 2 are positive acceleration coefficients which respectively called cognitive parameter and social parameter. In fact, these two parameters control the importance of particles’ self-learning versus learning from all the swarm’s population.

In this research, in order to balance the global exploration and local exploitation, time-varying acceleration coefficients (TVAC) [68, 69] and time-varying inertia weight (TVIW) [69, 70] is adopted to justify the acceleration coefficients and inertia weight, respectively. Both of these concepts help PSO algorithm to have better performance to find the region of global optimum and do not trap in local minima [68, 69, 71].

In TVAC, the acceleration coefficients adjusted by decreasing the value of c 1 from initial value of c 1i to c 1f, while the value of c 2 is increasing from its initial value of c 2i to c 2f as shown in Eqs. (2.186) and (2.187). Moreover, in TVIW, the inertia weight w is updated according to the Eq. (2.188), which means a large inertia weight makes PSO has more global search ability at the beginning of the run and by a linearly decreasing the inertia weight makes PSO has better local search.

$$ {c}_1={c}_{1i}+\frac{t}{t_{max}}\left({c}_{1f}-{c}_{1i}\right) \vspace*{-12pt}$$
(2.186)
$$ {c}_2={c}_{2i}+\frac{t}{t_{max}}\left({c}_{2f}-{c}_{2i}\right) \vspace*{-12pt}$$
(2.187)
$$ w={w}_{max}-\frac{t}{t_{max}}\ \left({w}_{max}-{w}_{min}\right) $$
(2.188)

Here, t represents the current iteration and t max means the maximum number of iterations, c 1i, c 1f, c 2i, c 2f are the constant values and w max, w min are the predefined maximum and minimum inertia weight.

3.3.1 Discrete Binary PSO

Although the original PSO was proposed to act in continuing space, Kennedy and Eberhart [67] proposed the discrete binary version of PSO. In this model particle moves in a state space restricted to zero and one on each dimension, in terms of the changes in probabilities that a bit will be in one state or the other. The formula proposed in Eq. (2.8) remains unchanged except that x id[t], p gd, best[tand p id, best[t] ∈ {0, 1} and v id restricted to the [0.0, 1.0] [15, 65]. By introducing the sigmoid function, the velocity mapped from a continuous space to probability space as following:

$$ sig\ \left({v}^{id}\right)=\frac{1}{1+{e}^{\left(-{v}^{id}\right)}}\kern1em \mathrm{d}=1,2,\dots, \mathrm{D} $$
(2.189)

The new particle position calculated by using the following rule:

$$ {x}^{id}\left[t+1\right]=\left\{\begin{array}{c}1,\kern0.75em if\ rnd\left(\right)< sig\left({v}^{id}\right)\\ {}0\kern1.25em if\ rnd\left(\right)\ge sig\left({v}^{id}\right)\end{array}\right.,\mathrm{d}=1,2,\dots, \mathrm{D} $$
(2.190)

Where, sig(v id) is a sigmoid function and rnd( ) is a random number in range [0.0, 1.0].

Although traditional PSO gains considerable results in different fields, however, the performance of the PSO depends on the preset parameters and it often suffers the problem of being trapped in local optima. In order to further enhance the search ability of swarm in PSO and avoids the search being trapped in local optimum, chaotic concept has been introduced by [68, 69, 71]. Here, chaos is characterized as ergodicity, randomicity and regularity.

In this section, Logistic equation which is a typical chaotic system adopted to make the chaotic local search as represented in the following:

$$ {z}_{j+1}=\mu {z}_j\left(1-{z}_j\right)\kern0.5em j=1,2,\dots m $$
(2.191)

Here, by considering n-dimensional vector z j = (z j1z j2, …, z jn), each component of this system is a random value in the range [0, 1], μ is the control parameter and the system of Eq. (2.15) has been proved to be completely chaotic when 0 ≤ z 0 ≤ 1 and μ = 4. Chaos queues z 1z 2, z 3, …, z m are generated by iteration of Logistic equation.

In fact, the basic ideas of chaotic are adopted in this section are described as follows:

  • Chaos initialization: In spite of standard PSO, which particle’s position in the search space initialized randomly, here chaos initialization is adopted to better initialize the position of each particle and to increase the diversity of the population.

  • Chaotic local search (CLS): By using the chaos queues, it helps PSO to does not trapped in a local optimum besides it can cause to search the optimum quickly. It will happen by generating the chaos queues based on the optimal position (p g, best), and then replace the position of one particle of the population with the best position of the chaos queues.

Although different performance metrics has been proposed to evaluate the effectiveness of IDSs, the most two popular of these metrics are detection rate (DR) and false alarm rate (FAR). By comparing the actual nature of a given record which here “Positive” means an “attack classes” and “Negative” means a “normal record” to the prediction ones, it’s possible to consider four outcomes for this situation as shown in Table 2.8, which known as the confusion matrix.

Table 2.8 Confusion matrix

Here, true positive and true negative means correctly labeled the records as an attack and normal, respectively, that is, IDSs predict the labels perfectly. False positive (FP), refer to normal record is considered as an attack and False negative (FN) means those attack records falsely considered as a normal one.

A well performed IDS should has a high detection rate (DR) as well as low false positive rate. In intrusion detection domain false positive rate typically named false alarm rate (FAR). Thus, the particles with higher detection rate, lower false positive rate and the small number of selected features can produce a high objective function value. Hence, in this research a weighted objective function that simultaneously takes into account trade-off between the maximizing the detection rate and minimizing the false alarm rate, along with considering the number of features is proposed according to the following equation:

$$ {\displaystyle \begin{array}{l}\mathrm{Objective}\ \mathrm{function}\ \left({\mathrm{F}}_{fit}\right)=\\ {}\kern2em {w}_{DR}.\left[\frac{TP}{\left(\mathrm{TP}+\mathrm{FN}\right)}\right]+{w}_{FAR}.\left[1-\frac{FP}{\left(\mathrm{FP}+\mathrm{TN}\right)}\right]+{w}_F.\left[1-\frac{\sum_{i=1}^{nF}{\mathrm{f}}_i}{n_F}\right]\end{array}} $$
(2.192)

Since any of these three elements of objective function have different effect on the performance of IDS, we convert this multiple criteria problem to a single weighted fitness function that combines the three goals linearly into one. Where w DR, w FAR and w F represents the importance of detection rate, false alarm rate and number of selected features in the objective function. Detection rate or sensitivity in biomedical informatics terms, known as a true positive rate (TPR), which means the ratio of true positive recognition to the total actual positive class;\( \frac{TP}{\left(\mathrm{TP}+\mathrm{FN}\right)} \). False alarm rate (FAR) or false positive rate (FPR) defined as: \( \frac{FP}{\left(\mathrm{FP}+\mathrm{TP}\right)} \). fi represents the value of feature mask (“1” represents that feature i is selected and “0” represents that feature i is not selected), and n F indicates the number of features.

The specific steps of TVCPSO–MCLP and TVCPSO–SVM are described as follows:

  • Step 1: Chaotic initialization for n + 2 particle, for the MCLP algorithm, the first two parameters are α and β and for SVM algorithm the first two parameters are c and γ. The rest of n particle is binary features mask of feature sets which here is 41 features of NSL-KDD cup 99 datasets. Here in binary features mask, 1 and 0 adopted to present as selected features and discarded features, respectively.

    1. (a)

      Initialize a vector z 0 = (z 01z 02, …, z 0n), each component of it is set as a random value in the range [0, 1], and by iteration of Logistic equation a chaos queue z 1z 2, …, z n is obtained.

    2. (b)

      In order to transfer the chaos queue z j into the parameter’s range the following equation is used:

$$ {\hat{Z}}_{jk}={a}_k+\left({b}_k-{a}_k\right).{z}_{jk}\kern2.25em \left(k=1,2,\dots, n\right) $$
(2.193)
  • Where the value range of each particle defined by [a k, b k].

  • Step 2: Compute the fitness value of the initial vector \( {\hat{Z}}_j\ \left(j=1,2,\dots, m\right) \) and then choose the best M solutions as the initial positions of M particles.

  • Step 3: Randomly initialize the velocity of M particles, here, v j = (v j1, v j2, …., v jnj = (1, 2, …, M.)

  • Step 4: Update the velocity and position of each classifier’s parameters (α, β in MCLP and cγ in SVM) according to Eqs. (2.184) and (2.185), and in order to update the velocity and position of the features in each particle Eqs. (2.184) and (2.190) have been used, respectively.

  • Step 5: Evaluate the fitness of each particle according to Eq. (2.192) and then compare the evaluated fitness value of each particle (personal optimal fitness (pfit)) to its personal best position (p i, best):

    1. (a)

      If the pfit is better than p i, best then update the p i, best as the current position, otherwise keep the previous ones in memory.

    2. (b)

      If the pfit is better than p g, best then update the p g, best as the current position, otherwise keep the previous p g, best.

  • Step 6: Optimize p g, best by chaos local search according to the following steps:

    1. (a)

      Consider T = 0, scale the p gk, best into the range of [0,1] by \( {z}_k^T=\frac{p^{gk, best}-{a}_k}{b_k-{a}_k}\left(k=1,2,\dots, n\right). \)

    2. (b)

      Generate the chaos queues \( {Z}_j^T\left(T=1,2,\dots, m\right) \) by iteration of Logistic equation.

    3. (c)

      Obtain the solution set p = (p 1, p 2, …, p m) by scale the chaotic variables \( {Z}_j^T \) into the decision variable according to the \( {p}_k^T={a}_k+\left({b}_k-{a}_k\right).{z}_k^T \).

    4. (d)

      Evaluate the fitness value of each feasible solution p = (p 1, p 2, …, p m), and get the best solution \( {\hat{p}}^{g, best} \).

  • Step 7: If the stopping criteria are satisfied, then stop the algorithms and get the global optimum that are the optimal value of (α, β in MCLP and c, γ in SVM) and the most appropriate subset of features. Otherwise, go to step 5.