1 Introduction

Classification is one of the fundamental tasks of the data mining and machine learning community. The need for accurate and effective solution of classification problems proliferates throughout the business world, engineering, and sciences. In this paper, we propose a new approach to classification problems with the aim to develop a methodology for reliable and robust risk-averse classifier design which has allows the users to choose tailored risk measurement for misclassification in various classes. Classification problems are based on observed data; they use approximations for the true distribution of the populations to be separated. Creating a good approximation or taking into account the uncertainty of the approximated distribution is important for drawing proper conclusion. The uncertainty is exacerbated when the data is high dimensional but some or all populations are represented by small (relative to the dimensionality) samples. Furthermore, we stipulate that misclassification in different classes is associated with different risk. Naturally, when the sample sizes of the populations are imbalanced, the the statistical estimates associated with the small size samples carry more risk than those based on large samples. Our approach contributes to the methods for classification of imbalanced classes. Outside of that scenario, misclassification for different classes may be associated with dramatically different cost which should be taken into account when designing a classifier. We comment further on that issue in due course.

The proposed approach has its foundation in the theory of coherent measures of risk and risk sharing. Although, this theory is well advanced in the field of mathematical finance and actuarial analysis, the classification problem does not fit the problem setting analyzed in those fields and the theoretical results on risk sharing are inapplicable here. The classification problem raises new issues, poses new challenges, and requires a dedicated analysis. We employ non-linear in probability risk functionals specific to each class. We analyze the structure of the new classifier design problem and establish its theoretical relation to the risk-neutral design problem. In particular, we show that the risk-sharing classification problem is equivalent to an implicitly defined optimization problem with unequal, implicitly defined but unknown weights for each data point. We implement our methodology in a binary classification scenario on several different data sets and carry out numerical comparison with classifiers which are obtained using the Huber loss function and other popular loss functions. In these applications, we use linear support vector machines in order to demonstrate the proposed approach.

Our paper is organized as follows. In Sect. 2, we introduce the loss function and provide a formal definition to the problem illustrating it by examples. In Sect. 3, we introduce the necessary notions; we recall the notion of law-invariant coherent measures of risk and their dual representation, which illustrates how those risk measures provide robustness to the solution of a stochastic optimization problem. The main results of our paper are contained in Sect. 4. In Sect. 5, we derive confidence intervals for the misclassification risk as measured by the coherent measures of risk. In Sect. 6, we address specifically risk-averse binary classification by proposing risk-averse problem formulation, which can be solved in an efficient way numerically. In Sect. 7, we refer to some related work in the area of robust statistics, robust optimization, and measures of risk. Additionally, we discuss the problem of risk sharing and risk allocation in financial institutions. Finally, Sect. 8 reports on our numerical experiments.

2 Problem setting

We consider labeled data consisting of k subsets \(S_1,\dots , S_k\) of n-dimensional vectors. The cardinality of \(S_i\) is \(|S_i|=m_i\), \(i=1,\dots , k\). Analytically, the classification problem consists of identifying a mapping \(\phi \), whose image is partitioned into k subsets corresponding to each class of data, so that \(\phi (\cdot )\) can be used as an indicator function of each class. We adopt the following definition.

Definition 1

A classifier is a vector function \(\varphi :\mathbb {R}^n\rightarrow \mathbb {R}^d\) such that \(\varphi (x) \in K_i\) for all \(x\in S_i\), \(i=1,\dots ,k\), where \(K_i\subset \mathbb {R}^d\) and \(K_i\cap K_j=\emptyset \) for all \(i,j=1,\dots ,k\) and \(i\ne j\).

In our discussion, we assume that the classifier belongs to a certain functional family depending on a finite number of parameters, which we denote by \(\pi \in \mathbb {R}^s\). The task is to choose a suitable value for the parameter \(\pi \).

Some examples of this point of view are the following. When support vector machine is formulated, we seek to distinguish two classes, i.e., \(k=2\). The classifier is a linear function \(\varphi (x;\pi ) :{\mathbb {R}}^n\rightarrow {\mathbb {R}}\), defined by setting

$$\begin{aligned} \varphi (x;\pi )= v^\top x -\gamma \quad \text { for any }\,\, x\in {\mathbb {R}}^n. \end{aligned}$$

The classifier is determined by the parameters \(\pi = (v,\gamma )\in \mathbb {R}^{n+1}\). The regions the classifier maps to are \(K_1= [0,+\infty )\), \(K_2= (-\infty ,0)\).

Another example is given, when we wish to separate many classes (\(k\ge 3\)) by a linear classifier, which is created on the principle “one vs. all”. Then effectively, our goal is to determine functions \(\varphi _j(x;a^i,b_i):=\langle a^i, x \rangle - b_i\), where x is a data point, \(a^i\in {\mathbb {R}}^n\), \(i=1,\dots k-1\), are the normals of the separating planes and \(b_i\) determine the location of the i-th plane. Plane i is meant to separate the data points from class j from the rest of the data points. This means that

$$\begin{aligned} \varphi _j(x;a^i,b_i) ={\left\{ \begin{array}{ll}\ge 0 &{} \text { for } x\in S_i\\ < 0 &{} \text { for } x\not \in S_i. \end{array}\right. } \end{aligned}$$
(1)

We define a \(k-1\times n\) matrix A whose rows are the vectors \(a^i\), and a vector \(b\in {\mathbb {R}}^{k-1}\) whose components are \(b_i\). The classifier for this problem can be viewed as a vector function \(\varphi (\cdot ; A,b): {\mathbb {R}}^n\rightarrow {\mathbb {R}}^{k-1}\) by setting \(\varphi (x; A,b) = Ax-b\). The parameter space is of form \(\pi =(A,b)\in {\mathbb {R}}^{(k-1)(n+1)}\). Requirement (1) means that the regions \(K_j\) are the orthants

$$\begin{aligned} K_i&= \{z\in {\mathbb {R}}^{k-1}: \, z_i\ge 0, \, z_j< 0,\;\; j\ne i, \;j =1,\dots , k-1\}, \; i=1,\dots k-1;\\ K_k&= \{z\in {\mathbb {R}}^{k-1}: \, z_i < 0, \; i =1,\dots , k-1\} \end{aligned}$$

This setting may be used for classification in some anomaly detection scenarios. Two approaches are known. One setting may require to distinguish between several distinct normal regimes or features of normal operational status. In that case, the class k may contain the anomalous instances, while classes \(i=1,\dots k-1\) represent the normal operation. Another problem deals with several rare undesirable phenomena with distinct features. In such a scenario, we may associate classes \(i=1,\dots k-1\) with those anomalous events and class k with a normal operation. When kernels are used, then the mapping \(\varphi (x;\pi )\) becomes a composition of the embedding mapping from \(\mathbb {R}^n\) to the new feature space and another classifier mapping in the feature space.

For a random observation \(z\in \mathbb {R}^n\), we calculate \(\varphi (z;\pi )\) and note that misclassification occurs when \(\varphi (z;\pi )\not \in K_i\), while \(z\in S_i\) for any \(i=1,\dots , k\). The classification error can be defined as the distance of a particular record to the classification set, to which it should belong. This definition is in harmony with the notion of an error in statistics, when a model is fit to data, in which case the error is defined as the distance of the model prediction to the realizations of the predicted random variable. Here the distance from a point r to a set K is defined by using a suitable norm in \(\mathbb {R}^n\):

$$\begin{aligned} \mathrm{dist}(r,K) =\min \{ \Vert r-a\Vert : a\in K\}. \end{aligned}$$

The distance is well-defined when the set K is convex and closed.

Fig. 1
figure 1

Classification error calculation

As the records in every data class \(S_i\), \(i=1,\dots , k\) constitute a sample of an unknown distribution of a random vector \(X^i\) defined on a probability space \((\varOmega ,\mathcal {F},P)\), the following random variables:

$$\begin{aligned} Z^i(\pi )=\mathrm{dist}(\varphi (X^i;\pi ),{K_i}),\,\, i=1,\dots k, \end{aligned}$$
(2)

represent the (random) misclassification of data points in class i when parameter \(\pi \) is used. These univariate random variables are defined on a common probability space and are represented by the sampled observations

$$\begin{aligned} z^i_j(\pi )=\mathrm{dist}(\varphi (x^i_j;\pi ),{K_i}) \quad \text { with } \quad x^i_j\in S_i\quad j=1,\dots , m_i. \end{aligned}$$

The distance of \(\varphi (x^i_j;\pi )\) to \(K_i\) is the smallest translation needed to eliminate misclassification of the point \(x^i_j\). Figure 1 illustrates how the classification error for a certain binary classifier is measured. In the support vector machine, the classification error is computed by

$$\begin{aligned} \mathrm{dist}\big (\varphi (x;v,\gamma ), K_i\big )= {\left\{ \begin{array}{ll} \max (0, \langle v, x \rangle -\gamma ) &{} \text { for } x\in S_1,\\ \max (0, \gamma -\langle v, x\rangle ) &{} \text { for } x\in S_2. \end{array}\right. } \end{aligned}$$

We classify every new observation x in \(S_i\), if \(\mathrm{dist}\big (\varphi (x;v,\gamma ), K_i\big ) =0\), \(i,=1,2\). In the case of SVM, the regions cover the entire image space of the classifier \(R=K_1\cup K_2.\) Therefore, the condition \(\mathrm{dist}\big (\varphi (x;v,\gamma ), K_i\big ) =0\), \(i,=1,2\), always holds for exactly one class (Fig. 1).

Observe that in the multi-class example, the regions \(K_i\), \(i=1,\dots k\) do not cover the entire image space of the classifier. Therefore, it is possible to observe a future instance x such that \(\mathrm{dist}\big (\varphi (x; A,b), K_i\big )>0\) for all \(i=1,\dots k\). In that case, we could classify according to the smallest distance

$$\begin{aligned} x\in S_j\quad \text {iff} \quad \mathrm{dist}\big (\varphi (x; A,b), K_j\big )= \min _{1\le i\le k} \mathrm{dist}\big (\varphi (x; A,b), K_i\big ),\quad j\in \{1,\dots ,k\}. \end{aligned}$$

Another problem arises, if the the minimum distance is achieved for several classes. The ambiguity could be resolved in several ways as a sequential classification procedure but this question is beyond the scope of our study.

If the distribution of the vectors \(X^i\), \(i=1,\dots , k\), are known, then the optimal risk-neutral classifier would be obtained by minimizing the expected error. This would be the solution of the following optimization problem:

$$\begin{aligned} \min \,\left\{ \sum _{i=1}^k {\mathbb {E}}\big [ Z^i(\pi )\big ] : \; Z^i(\pi ) = \mathrm{dist}(\varphi (X^i;\pi ),{K_i}),\; i=1\dots k,\; \pi \in \mathcal D. \right\} \end{aligned}$$
(3)

Here, a closed convex set \(\mathcal D\subseteq \mathcal R^s\) describes the set of feasible parameters \(\pi \). Our goal is to introduce a family of risk-averse classifiers, where the expectation is replaced by law-invariant coherent measures of risk.

We start with the formulation of an optimization problem for binary classification, in which the (estimated) expected total error is minimized.

$$\begin{aligned} \begin{aligned} \min _{v,\gamma ,Z^1,Z^2}\,&\, \frac{1}{m_1} \sum _{j=1}^{m_1} z_j^1 + \frac{1}{m_2} \sum _{j=1}^{m_2} z_j^2 \\ \text {s. t. }\,&\, \langle v, x_j^1\rangle - \gamma + z_j^1 \ge 0, \quad j=1,\dots ,m_1,\\&\, \langle v, x_{j}^2\rangle -\gamma - z_{j}^2 \le 0, \quad j=1,\dots ,m_2,\\&\, \Vert v\Vert =1,\quad Z^1\ge 0, \quad Z^2\ge 0. \end{aligned} \end{aligned}$$
(4)

In this formulation, \(Z^1\) and \(Z^2\) are random variables expressing the classification error for class 1 and class 2, respectively. Those variables have realizations \(z_j^1\) and \(z_j^2\). Note that \(z_i^1\) and \(z_i^2\) will satisfy Eq. (2) only if we use the Euclidean norm of v in (4).

The soft-margin SVM with parameters \(M>0\) and \(\delta >0\) is formulated as follows:

$$\begin{aligned} \begin{aligned} \min _{v,\gamma ,Z^1,Z^2}\,&\, M \left( \sum _{j=1}^{m_1} z_j^1 + \sum _{j=1}^{m_2} z_j^2 \right) +\delta \Vert v\Vert ^2 \\ \text {s. t. }\,&\, \langle v, x_j^1\rangle - \gamma + z_j^1 \ge 1, \quad j=1,\dots ,m_1,\\&\, \langle v, x_{j}^2\rangle -\gamma - z_{j}^2 \le -1, \quad j=1,\dots ,m_2,\\&\, Z^1\ge 0, \quad Z^2\ge 0. \end{aligned} \end{aligned}$$
(5)

In problem (5), the normal vector v of the separating hyperplane can be of any positive length. Observe that multiplying the solution of problem (5), v and \(\gamma \), by a positive constant does not change the separating plane. In problem (5), the estimated expected total classification error equals

$$\begin{aligned} \frac{1}{m_1\Vert v\Vert } \sum _{i=1}^{m_1} \max (z_i^1-1, 0) + \frac{1}{m_2\Vert v\Vert } \sum _{j=1}^{m_2} \max (z_j^2-1, 0) \end{aligned}$$

This means that the objective function does not necessarily minimize the expected classification error although the variables \(z^1_j\) and \(z^2_j\) are indicative of misclassification occurrence.

We propose a new family of risk functionals: coherent measures of risk representing the point of view that it should be possible to treat misclassification errors for each classes with different attitude to risk. While the total expected error is a sum of expected misclassification in each class, the risk in a system measured by a coherent risk measure is not a sum of the risk of each component. That is why, we do not simply minimize the sum of risks for each class. We adopt a point of view on optimality of risk allocation as the one in risk sharing theory in mathematical finance. However, we emphasize that the problem setting and the results associated with risk sharing of losses in financial institutions are inapplicable to the classification problem as it will become clear in due course.

3 Coherent measures of risk

Measures of risk are widely used in finance and insurance. Additionally, the signal to noise measures, used in engineering and statistics (Fano factor Fano 1947 or the index of dispersion Cox and Lewis 1966) are of similar spirit. An axiomatic theory of measures of risk is presented in Ogryczak and Ruszczyński (1999), Artzner et al. (1999), Föllmer and Schied (2011), Kijima and Ohnishi (1993), Rockafellar et al. (2006). In a more general setting, risk measures are analyzed in Ruszczynski and Shapiro (2006). For \(p\in [1,\infty ]\) and a probability space \((\varOmega ,{\mathcal {F}},P)\), we use the notation \({\mathcal {L}}_p(\varOmega ,{\mathcal {F}},P)\), for the space of random variables with finite p-th moments.

Definition 2

A coherent measure of risk is a functional \(\varrho : \mathcal {L}_p(\varOmega )\rightarrow \mathbb {R}\) satisfying the following axioms:

  • Convexity: For all XY, \(\gamma \in [0,1]\), \(\varrho (\gamma X + (1-\gamma )Y) \le \gamma \varrho (X) + (1-\gamma )\varrho (Y).\)

  • Monotonicity: If \(X_\omega \ge Y_\omega \) for P-a.a \(\omega \in \varOmega \), then \(\varrho (X)\ge \varrho (Y)\).

  • Translation Equivariance: For any \(a\in \mathbb {R}\), \(\varrho (X+a)=\varrho (X)+ a\) for all X.

  • Positive Homogeneity: If \(t>0\) then \(\varrho (t X)=t\varrho (X)\) for any X.

For an overview of the theory of coherent measures of risk, we refer to Shapiro et al. (2014) and the references therein.

A risk measure \(\varrho (\cdot )\) is called law-invariant if \(\varrho (X)=\varrho (Y)\) whenever the random variables X and Y have the same distributions. It is clear that in our context, only law invariant measures of risk are relevant.

The following result is known as a dual representation of coherent measures of risk (cf. Shapiro et al. 2014). The space \({\mathcal {L}}_p(\varOmega )\) and the space \({\mathcal {L}}_q(\varOmega )\) with \(\frac{1}{p} + \frac{1}{q} =1 \) are viewed as paired vector spaces with respect to the bilinear form

$$\begin{aligned} \langle \zeta ,Z \rangle =\int _\varOmega \zeta (\omega ) Z(\omega ) dP(\omega ),\; \zeta \in \mathcal {L}_q(\varOmega ),\;Z\in \mathcal {L}_p(\varOmega ). \end{aligned}$$
(6)

For any \(\zeta \in \mathcal {L}_p(\varOmega )\), we can view \(\langle \zeta , Z\rangle \) as the expectation \(\mathbb {E}_Q[Z]\) taken with respect to the probability measure \(dQ=\zeta dP\), defined by the density \(\zeta \), i.e., Q is absolutely continuous with respect to P and its Radon-Nikodym derivative is \(dQ/dP=\zeta \). For any finite-valued coherent measure of risk \(\varrho \), a convex subset \(\mathcal {A}\) of probability density functions \(\zeta \in \mathcal {L}_q(\varOmega )\) exists, such that for any random variable \(Z\in {\mathcal {L}}_p(\varOmega )\), it holds

$$\begin{aligned} \varrho (Z)=\sup _{\zeta \in \mathcal {A}} \langle \zeta , Z\rangle =\sup _{dQ/dP\in \mathcal {A}}\mathbb {E}_Q [Z]. \end{aligned}$$
(7)

This result reveals how measures of risk provide robustness with respect to the changes of the distribution. Their application constitutes a new approach to robust statistical inference.

For a random variable \(X\in \mathcal {L}_p (\varOmega )\) with distribution function \(F_X(\eta )=P\{X \le \eta \} \), we consider its survival function \( \bar{F}_X(\eta ) = P(X>\eta ) \) and the left-continuous inverse of the cumulative distribution function defined as follows:

$$\begin{aligned} F^{(-1)}_X(\alpha ) = \inf \ \{ \eta : F_X(\eta ) \ge \alpha \} \quad \text {for}\quad 0< \alpha < 1, \end{aligned}$$

i.e., \(F^{(-1)}_X(\alpha )\) is the left \(\alpha \)-quantile of X.

We intend to investigate the distribution of classification errors and that is why we have a preference to small outcomes (small errors). Following Shapiro et al. (2014), the Value at Risk at level \(\alpha \) of a random error X is defined by setting

$$\begin{aligned} \mathrm{VaR}_{\alpha }(X) = F^{(-1)}_X(1-\alpha ). \end{aligned}$$

The risk here is understood as the probability of the error X being large. This point of view corresponds to minimizing the probability of misclassification. Although Value at Risk is an intuitively appealing measure, it is not coherent.

In the theory of measures of risk, a special role is played by the functional called the Conditional Value-at-Risk and denoted \(\mathrm{CVaR}(\cdot )\) (also known as Conditional Value at Risk or CVaR, see Ogryczak and Ruszczyński 2002; Rockafellar and Uryasev 2002). The Conditional Value at Risk of X at level \(\alpha \) is defined as

$$\begin{aligned} \mathrm{CVaR}_{\alpha }(X) = \frac{1}{\alpha } \int _0^{\alpha } \mathrm{VaR}_t(X)\, dt. \end{aligned}$$
(8)

For the second equality, we refer to Dentcheva and Martinez (2012), Ogryczak and Ruszczyński (2002). This is the representation (cf. also Shapiro et al. 2014) suitable for optimization problems. Due to Kusuoka theorem (Kusuoka 2001; Shapiro et al. 2014, Thm. 6.24), every law invariant, finite-valued coherent measure of risk on \(\mathcal {L}^{p}(\varOmega )\) for non-atomic probability space can be represented as a mixture of Conditional Value-at-Risk at all probability levels. This result can be extended for finite probability spaces with equally likely observations.

A popular class of coherent measures of risk include the semideviation of a random variable. The upper semideviation of order p is defined as

$$\begin{aligned} \sigma ^+_p[Z] :=\Big ( \mathbb {E}\Big [\big (Z-\mathbb {E}[Z]\big )_+^p\Big ]\Big )^{1/p}, \end{aligned}$$
(9)

where \(p\in [1,\infty )\) is a fixed parameter. It is well defined for all random variables Z with finite p-th order moments. The mean–upper-semideviation measure has the general form

$$\begin{aligned} \mathbb {E}[Z] + {c} \sigma ^+_p[Z], \quad \text {for some constant } c\in [0,1]. \end{aligned}$$
(10)

The mean-upper-semideviation measure reflects preferences to small realizations of the random variables and it aims at penalization of the excess over the expected value when Z depends on the choice of a decision maker and the choice is taken as to minimize this risk measure. This measure of risk is suited for our purposes since in the setting of misclassification small values are preferred. Higher value of the constant c corresponds to higher risk-aversion while \(c=0\) corresponds to the risk-neutral attitude.

In order to illustrate the connection to robust statistics and robust optimization, we provide the dual representation of the risk measures which we shall use in our numerical study. For parameters \(\alpha ,\beta , c\), all contained in [0, 1] and \(p\ge 1\), we have:

$$\begin{aligned}&\mathrm{CVaR}_{\alpha }(X) = \sup \{\mathbb {E}_\zeta [Z]\,:\, \zeta \in \mathcal {L}_{\infty }(\varOmega ): \zeta (\omega )\in [0,\alpha ^{-1}]\;\text {a.e.}\};\nonumber \\&(1-\beta )\mathbb {E}[Z]+ \beta \mathrm{CVaR}_\alpha (Z\nonumber )\\&\qquad = \sup \{\mathbb {E}_\zeta [Z]\,:\, \zeta \in \mathcal {L}_{\infty }(\varOmega ): \zeta (\omega )\in [1-\beta ,1 + \beta (1-\alpha )\alpha ^{-1}]\;\text {a.e.}\};\nonumber \\&\mathbb {E}[Z] + {c} \sigma ^+_p[Z] = \sup \{\mathbb {E}_\zeta [Z]\,:\, \zeta \in \mathcal {L}_q(\varOmega ): \zeta =1+\xi -\mathbb {E}[\xi ],\;\Vert \xi \Vert _q\le c\} \end{aligned}$$
(11)

Note that using these measures results in taking the worst expected misclassification when it is evaluated not only by the empirical probability mass function but all mass functions satisfying the conditions in (11).

Statistical estimators of spectral law-invariant measures of risk using Kusuoka representations are proposed in Dentcheva and Penev (2010). Furthermore, central limit theorems for general composite risk functionals, which incorporate the risk measures used in this paper are established in Dentcheva et al. (2017). Other classes of coherent measures of risk were proposed and analyzed in Dentcheva et al. (2010), Krokhmal (2007), Ogryczak and Ruszczyński (2002), Shapiro et al. (2014) and the references therein.

4 Risk sharing in classification

First, we introduce the notion of a risk-averse classifier and show that risk-averse classifiers are associated with minimal points of the attainable errors, where the minimality is with respect to a suitable stochastic order. Let a set of labeled data, a parametric classifier family \(\varphi (\cdot ;\pi )\) with the associated collection of sets \(K_i\), \(i=1\dots ,k\), and the law-invariant coherent risk measures \(\varrho _i\), \(i=1\dots ,k\) be given. The presumption is that we have different attitude to misclassification risk in the various classes and the total risk is shared among the classes according to risk-averse preferences. Risk preferences in classification have been previously specified via cost-sensitive objective functions which may place different costs on the misclassification errors of each class. However, it is difficult to determine a proper weighting that truly reflects the risk preferences of the decision maker. In this section, we show that using risk functionals leads to cost-sensitive formulation, where the weighting is implied by the choice of risk measure.

First, we show that \(Z^i(\pi )\in \mathcal {L}_p(\varOmega ,\mathcal {F},P)\)\(i=1,\dots , k\) under appropriate conditions.

Theorem 1

Assume that \(X^i\in \mathcal {L}_p(\varOmega ,\mathcal {F},P)\), \(K_i\) are closed convex sets, \(i=1,\dots , k\), and that the function \(\varphi (\cdot ;\pi )\) satisfies the following growth condition

$$\begin{aligned} \Vert \varphi (x,\pi ) \Vert \le C_1(\pi ) + C_2(\pi )\Vert x\Vert ^p, \quad x\in \cup _{i=1}^k~\mathrm{supp~}{X^i}, \end{aligned}$$
(12)

where \(C_1(\pi )\) and \(C_2(\pi )\) are constants depending on \(\pi \) and \(\Vert \varphi (x,\pi ) \Vert \) refers to the Euclidean norm in \(\mathbb {R}^d\). Then \(Z^i(\pi )\in \mathcal {L}_1(\varOmega ,\mathcal {F},P)\), \(i=1,\dots , k\).

Proof

The distance functions \(z\mapsto \mathrm{dist}(z,K_i)\), \(i=1,\dots , k\), are continuous convex functions (see, e.g., Beer 1993) and \(\mathrm{dist}(z,K_i)<\infty \) for all \(z\in \mathbb {R}^n\). Furthermore, let \(\hat{x}_i\) be the norm-minimal element of \(K_i\). Then

$$\begin{aligned} \mathrm{dist}(z,K_i)\le \Vert z-\hat{x}_i\Vert \le \Vert z\Vert + \Vert \hat{x}_i\Vert . \end{aligned}$$

For all \(i=1,\dots , k,\) we obtain the estimate over the support of the random vector \(X^i\):

$$\begin{aligned} \mathrm{dist}(\varphi (x;\pi ),{K_i})\le \Vert \varphi (x;\pi )\Vert + \Vert \hat{x}_i\Vert \le C_1(\pi ) + C_2(\pi )\Vert x\Vert ^p + \Vert \hat{x}_i\Vert . \end{aligned}$$

We conclude that \(Z_i\), \(i=1,\dots , k\), are integrable because \(\Vert X^i\Vert _p\) is finite. \(\square \)

Let \(\mathcal {Y}\) denote the set of all random vectors \((Z^1(\pi ),\dots ,Z^k(\pi ))\) obtained as \(Z^i(\pi )=\mathrm{dist}(\varphi (X^i;\pi ),{K_i})\) for some \(\pi \in \mathcal D\), i.e., \(\mathcal {Y}\) is the set of all attainable classification errors considered as random vectors in the corresponding probability space. In the classification problem, we deal with their representation from the available sample calculated as follows:

$$\begin{aligned} z^i_j(\pi )=\mathrm{dist}(\varphi (x_j;\pi ),{K_i}), x_j\in S_i,\quad j=1,\dots , m_i,\;\; i=1,\dots k. \end{aligned}$$

for a given parameter \(\pi \in \mathcal D\).

Definition 3

A vector \(w\in \mathbb {R}^k\) represents an attainable risk allocation for the classification problem, if a parameter \(\pi \in \mathcal D\) exists such that

$$\begin{aligned} w = \big (\varrho _1(Z^1(\pi )), \dots , \varrho _k(Z^k(\pi ))\big )\in \mathbb {R}^k\quad \text {for}\quad \big (Z^1(\pi ),\dots ,Z^k(\pi )\big )\in \mathcal {Y}. \end{aligned}$$

We denote the set of all attainable risk allocations by \(\mathcal {X}\). Assume that a partial order on \(\mathbb {R}^k\) is induced by a pointed convex cone \(\mathcal {K}\subset \mathbb {R}^k\), i.e.,

$$\begin{aligned} v\preceq _\mathcal {K}w \text { if and only if } w-v\in \mathcal {K}. \end{aligned}$$

Recall that a point \(v\in A\subset \mathbb {R}^k\) is called \(\mathcal {K}\)-minimal point of the setA if no point \(w\in A\) exists such that \(v-w\in \mathcal {K}\). If \(\mathcal {K}=\mathbb {R}^k_+\), then the notion of \(\mathcal {K}\)-minimal points of a set corresponds to the well-known notion of Pareto-efficiency or Pareto-optimality in \(\mathbb {R}^k\).

Definition 4

A classifier \(\varphi (\cdot ;\pi )\) is called \(\mathcal {K}\)-optimal risk-averse classifier, if its risk-allocation is a \(\mathcal {K}\)-minimal element of \(\mathcal {X}\). If \(\mathcal {K}=\mathbb {R}^k_+\), then the classifier is called Pareto-optimal.

From now on, we focus on Pareto-optimality, but our results are extendable to the case of more general orders defined by pointed cones.

Definition 5

A risk-sharing classification problem (RSCP) is given by the set of labeled data, a parametric classifier family \(\varphi (\cdot ;\pi )\) with the associated collection of sets \(K_i\), \(i=1\dots ,k\), and a set of law-invariant risk measures \(\varrho _i\), \(i=1\dots ,k\). The risk-sharing classification problem consists of identifying a parameter \(\pi \in \mathcal D\) resulting in a Pareto-optimal classifier \(\varphi (\cdot ;\pi )\).

We shall see that the Pareto-minimal risk allocations are produced by random vectors, which are minimal points in the set \(\mathcal {Y}\) with respect to the usual stochastic order, defined next.

Definition 6

A random variable Z is stochastically larger than a random variable \(Z'\) with respect to the usual stochastic order (denoted \(Z\succeq _{(1)}Z'\)), if

$$\begin{aligned} {\mathbb {P}} (Z>\eta )\ge {\mathbb {P}} (Z' >\eta )\quad \forall \, \eta \in \mathbb {R}, \end{aligned}$$
(13)

or, equivalently, \(F_{Z}(\eta )\le F_{Z'}(\eta )\). The relation is strict (denoted \(Z\succ _{(1)}Z'\)), if additionally, inequality (13) is strict for some \( \eta \in \mathbb {R}\).

A random vector \(\mathbf {Z}= (Z_1,\dots Z_k)\) is stochastically larger than a random vector \(\mathbf {Z}'= (Z'_1,\dots Z'_k)\) (denoted \(\mathbf {Z} \succeq \mathbf {Z}'\)) if \(Z_i\succeq _{(1)}Z_i'\) for all \(i=1,\dots k\). The relation is strict if for some component \(Z_i\succ _{(1)}Z_i'\).

The random vectors of \(\mathcal {Y}\), which are non-dominated with respect to this order will be called minimal points of \(\mathcal {Y}\).

For more information on stochastic orders see, e.g., Shaked and Shanthikumar (2007).

The following result is known for atomless probability spaces. We verify it for a sample space in order to deal with the empirical distributions.

Theorem 2

Suppose the probability space \((\varOmega ,\mathcal {F},P)\) is finite with equal probabilities of all simple events. Then every law-invariant risk functional \(\varrho \) is consistent with the usual stochastic order if and only if it satisfies the monotonicity axiom. If \(\varrho \) is strictly monotonic with respect to the almost sure relation, then \(\varrho \) is consistent with the strict dominance relation, i.e. \(\varrho (Z_1)<\varrho (Z_2)\) whenever \(Z_2\succ _{(1)}Z_1\).

Proof

Assuming that \(\varOmega =\{ \omega _1,\dots ,\omega _m\}\), let the random variable \(U(\omega _i)=\frac{i}{m}\) for all \(i=1,\dots , m\). If \(Z_2\succeq _{(1)}Z_1\), then defining \(\hat{Z}_1:=F_{Z_1}^{-1}(U)\) and \(\hat{Z}_2:=F_{Z_2}^{-1}(U)\), we obtain \(\hat{Z}_2(\omega )\ge \hat{Z}_1(\omega )\) for all \(\omega \in \varOmega \). Due to the monotonicity axiom, \(\varrho (\hat{Z_2}) \ge \varrho (\hat{Z_1})\). The random variables \(\hat{Z}_i\) and \(Z_i\), \(i=1,2\), have the same distribution by construction. This entails that \(\varrho ({Z_2}) \ge \varrho ({Z_1})\) because the risk measure is law invariant. Consequently, the risk measure \(\varrho \) is consistent with the usual stochastic order. The other direction is straightforward. \(\square \)

This observation justifies our restriction to risk measures, which are consistent with the usual stochastic order, also known as the first order stochastic dominance relation. Furthermore, when dealing with non-negative random variables as in the context of classification, then strictly monotonic risk measures associate no risk only when no misclassification occurs, as shown by the following statement.

Lemma 1

If \(\varrho \) is a law invariant strictly monotonic coherent measure of risk, then

$$\begin{aligned} \begin{aligned} \varrho (Z)&> 0 \quad \text { for all random variables } Z\ge 0\,\, a.s., \quad Z\not \equiv 0\\ \varrho (Z)&< 0 \quad \text { for all random variables } Z\le 0\,\, a.s., \quad Z\not \equiv 0. \end{aligned} \end{aligned}$$
(14)

Proof

Denote the random variable, which is identically equal zero by \(\mathbf {0}\). Notice that \(\varrho (\mathbf {0}) = \varrho (2\cdot \mathbf {0}) = 2\varrho (\mathbf {0})\), which implies that \(\varrho (\mathbf {0})=0\). If \(Z\ge 0\) a.s. and \(Z\not \equiv 0\), then \(\varrho (Z)> \varrho (\mathbf {0})=0\) by the strict monotonicity of \(\varrho \). The second statement follows analogously. \(\square \)

This statement implies that \(\varrho _i(Z^i(\pi ))\ge 0,\)\(i=1,\dots k,\) for all \(\pi \in \mathcal D\) and, therefore, the attainable allocations lie in the positive orthant, i.e., \(\mathcal {X}\subseteq {\mathbb {R}}^k_+\). Consequently, the set \(\mathcal {X}\) has minimal elements with respect to the Pareto-order. From now on, we adopt the following assumptions:

  1. (A1)

    The risk measures \(\varrho ^i\) used for evaluation of classification errors in classes \(i=1,...,k\) are coherent, law invariant, and finite-valued.

  2. (A2)

    The sets \(K_i\), \(i=1,\dots k\) and \(\mathcal {D}\subseteq \mathbb {R}^s\) are non-empty, closed and convex.

  3. (A3)

    The function \(\varphi (\cdot ;\pi )\) satisfies the growth condition (12).

We point out that Assumptions (A2) and (A3) are satisfied for the examples, given in Sect. 1.

Theorem 3

Assume (A1)–(A3). If the function \(\varphi (x,\cdot )\) is continuous for every argument \(x\in \mathbb {R}^n\), then the components of the attainable risk allocations \(\varrho _i(Z^i(\cdot ))\), \(i=1,\dots k\), are continuous functions. If additionally, each component of the vector function \(\varphi (x,\cdot )\) is an affine function, then \(\varrho _i(Z^i(\cdot ))\), \(i=1,\dots k\) are convex functions.

Proof

Recall again that the distance functions \(z\mapsto \mathrm{dist}(z,K_i)\) are continuous convex functions and \(\mathrm{dist}(z,K_i)<\infty \) for all \(z\in \mathbb {R}^n\). Thus, the composition of the distance function with the continuous function \(\varphi (x;\cdot )\) is continuous, meaning that the random variable \(Z^i (\pi ) = \mathrm{dist}(\varphi (X^i;\pi ),K_i)\) has realizations, which are continuous functions of \(\pi \). The variables \(Z^i\) are integrable due to Theorem 1. Therefore, \(Z^i (\cdot )\) is continuous with respect to the norm in the space \(\mathcal {L}_1(\varOmega )\). Since the risk measures \(\varrho _i(\cdot )\) are convex and finite, they are continuous on \(\mathcal {L}_1\). We conclude that its composition with the risk measure: \(\varrho _i(Z^i (\cdot ))\), is continuous.

In order to prove convexity, let \(\lambda \in (0,1)\) and let \(\pi _\lambda = \lambda \pi +(1-\lambda )\pi '\).

Let \(z^i(\pi ), z^i(\pi ')\in K_i\) be the points such that

$$\begin{aligned} \Vert \varphi (x;\pi ) -z^i(\pi )\Vert = \min _{z\in K_i}\Vert \varphi (x;\pi ) -z\Vert \end{aligned}$$
(15)
$$\begin{aligned} \Vert \varphi (x;\pi ) -z^i(\pi ')\Vert = \min _{z\in K_i}\Vert \varphi (x;\pi ') -z\Vert \end{aligned}$$
(16)

We define \(z_\lambda = \lambda z^i(\pi )+(1-\lambda ) z^i(\pi ')\). Due to the convexity of \(K_i\), we have \(z_\lambda \in K_i.\) As \(\varphi (x,\cdot )\) is affine, we obtain

$$\begin{aligned} \varphi (x;\pi _\lambda ) = \lambda \varphi (x;\pi ) +(1-\lambda )\varphi (x;\pi '). \end{aligned}$$

This entails the following inequality for all \(i=1,\dots k\) and all \(z\in \mathbb {R}^d\):

$$\begin{aligned} \min _{z\in K_i}\Vert \varphi (x;\pi _\lambda ) -z\Vert&\le \Vert \varphi (x;\pi _\lambda ) -z^i_\lambda \Vert = \Vert \varphi (x;\pi _\lambda ) -\lambda z^i(\pi )-(1-\lambda ) z^i(\pi ')\Vert \\&= \Vert \lambda \big (\varphi (x;\pi )-z^i(\pi )\big ) +(1-\lambda )\big (\varphi (x;\pi ')-z^i(\pi ')\big )\big \Vert \\&\le \lambda \Vert \varphi (x;\pi )-z^i(\pi )\Vert +(1-\lambda )\Vert \varphi (x;\pi ')-z^i(\pi '))\big \Vert \\&= \lambda \min _{z\in K_i}\Vert \varphi (x;\pi )-z\Vert +(1-\lambda )\min _{z\in K_i}(\varphi (x;\pi ')-z)\big \Vert . \end{aligned}$$

Therefore,

$$\begin{aligned} \mathrm{dist}(\varphi (x;\pi _\lambda ),K_i) \le \lambda \mathrm{dist}(\varphi (x;\pi ),K_i) +(1-\lambda )\mathrm{dist}(\varphi (x;\pi '),K_i). \end{aligned}$$

The monotonicity and convexity axioms for the risk measures imply that

$$\begin{aligned}&\varrho _i \big (\mathrm{dist}(\varphi (X;\pi _\lambda ),K_i) \big ) \\&\quad \le \lambda \varrho _i\big (\mathrm{dist}(\varphi (X;\pi ),K_i) \big ) + (1-\lambda )\varrho _i\big (\mathrm{dist}(\varphi (X;\pi '),K_i)\big ) \end{aligned}$$

for all \(i=1,\dots , k.\)\(\square \)

This result implies the existence of Pareto-optimal classifier. Furthermore, the convexity property allows us to identify the Pareto-optimal risk-allocations by using scalarization techniques.

Corollary 1

Assume (A1)–(A3) and let the function \(\varphi (x,\cdot )\) be affine for every argument \(x\in \mathbb {R}^n\). Then a parameter \(\pi \) defines a Pareto-optimal classifier \(\varphi (\cdot , \pi )\) for the given RSCP if and only if a scalarization vector \(w\in \mathbb {R}^k_+\) exists with \(\sum _{i=1}^k w_i=1\), such that \(\pi \) is a solution of the problem

$$\begin{aligned} \min _{\pi \in \mathcal {D}} \sum _{i=1}^k w_i\varrho _i\big (\mathrm{dist}(\varphi (X_i;\pi ),K_i)\big ). \end{aligned}$$
(17)

Proof

Statement follows form the well-known scalarization theorem in vector optimization problems (Miettinen 1999) and Theorem 3. \(\square \)

Theorem 4

Assume that the risk measures \(\varrho _i\) are law invariant and strictly monotonic for all \(i=1,\dots k.\) If a classifier \(\varphi (\cdot ;\pi )\) is Pareto-optimal, then its corresponding random vector \((Z^1(\pi ),\dots , Z^k(\pi ))\) is a minimal point of \(\mathcal {Y}\) with respect to the order of Definition 6.

Proof

Suppose that \(\varphi (\cdot ;\pi )\) is Pareto-optimal and the point \(Z(\pi ) =(Z^1(\pi ),\dots , Z^k(\pi ))\) is not minimal. Then a parameter \(\pi '\) exists, such that the corresponding vector \(Z(\pi ')\) is strictly stochastically dominated by Z, which implies \(Z^i(\pi ) \succeq _{(1)}Z^i(\pi ') \) with a strict relation for some component. We obtain \(\varrho _i(Z^i(\pi )) \ge \varrho _i(Z^i(\pi '))\) for all \(i=1,\dots , k\) with a strict inequality for some i due to the consistency of the coherent measures of risk with the strong stochastic order relation, which contradicts the Pareto-optimality of \(\varphi (\cdot ;\pi )\). \(\square \)

We consider the sample space \(\varOmega =\prod _{i=1}^k \varOmega _i\) where \((\varOmega _i,\mathcal {F}_i,P_i)\) is a finite space with \(m_i\) simple events \(\omega _j\in \varOmega _i\), \(P_i(\omega _j)=\frac{1}{m_i}\), and \(\mathcal {F}_i\) consisting of all subsets of \(\varOmega _i\).

Theorem 5

Assume (A1)–(A3). Suppose each component of the vector function \(\varphi (x,\cdot )\) is affine for every \(x\in \mathbb {R}^n\). If the parameter \(\hat{\pi }\) defines a Pareto-optimal classifier \(\varphi (\cdot , \hat{\pi })\) for the RSCP, then a probability measure \(\mu \) on \(\varOmega \) exists so that \(\hat{\pi }\) is an optimal solution for the problem

$$\begin{aligned} \min _{\pi \in \mathcal {D}} \sum _{i=1}^k \sum _{j=1}^{m_i} \mu ^i_j \mathrm{dist}(\varphi (x^i_j;\pi ),K_i). \end{aligned}$$
(18)

Proof

Since the parameter \(\hat{\pi }\) defines a Pareto-optimal classifier \(\varphi (\cdot , \hat{\pi })\) for the RSCP and all conditions of Corollary 1 are satisfied, then \(\hat{\pi }\) is an optimal solution of problem (17) for some scalarization w. Let \(\mathcal {A}_i\) denotes the set of probability measures corresponding to the risk measure \(\varrho _i\), \(i=1,\dots , k\) in representation (7). Since the risk measures \(\varrho _i\) take finite values on \(\varOmega _i\), the sets \(\mathcal {A}_i\) are non-empty and compact. Thus, the supremum in the dual representation (7) is achieved at some elements \(\zeta ^i\in \mathcal {A}_i\). We have \(\zeta ^i_j\ge 0\), \(\sum _{j=1}^{m_i} \frac{\zeta ^i_j}{m_i} =1\) because \(\zeta _i\) are probability densities. We obtain

$$\begin{aligned} \varrho _i(\mathrm{dist}(\varphi (X^i;\pi ),K_i)) = \sum _{j=1}^{m_i} \frac{\zeta ^i_j}{m_i} \mathrm{dist}(\varphi (x^i_j;\pi ),K_i). \end{aligned}$$

Setting \(\displaystyle { \mu ^i_j=w_i\frac{\zeta ^i_j}{m_i},\; j=1,\dots , m_i,\; i=1,\dots , k,} \) we observe that the vector \(\mu \in R^{m_1+\cdots m_k}\) constitutes a probability mass function. Thus, problem (17) can be reformulated as (18). \(\square \)

This result shows that the RSCP can be viewed as a classification problem in which the expectation error is minimized. However, the expectation is not calculated with respect to the empirical distribution but with respect to another measure \(\mu \), which is implicitly determined by the chosen measures of risk. It is the worst expectation according to our risk-averse preferences, which are represented by the choice of the measures \(\varrho _i\), \(i=1,\dots , k.\)

The composite nature of the problem (17) is difficult and that is why we reformulate the problem. We introduce auxiliary variables \(Y\in \mathcal L_1(\varOmega , \mathcal F, P;\mathbb {R}^{m})\), \(i=1,\dots k\), which are defined by the constraints:

$$\begin{aligned} \varphi (X^i;\pi )+Y^i \in K_i \quad \forall i=1,\dots , k. \end{aligned}$$

Problem (17) can be reformulated to

$$\begin{aligned} \begin{aligned} \min _{\pi ,Y}\,&\,\sum _{i=1}^k w_i\varrho _i(\Vert Y^i\Vert )\\ \text {s.t. }&\, \varphi (X^i;\pi )+Y^i\in K_i, \quad \forall i=1,\dots , k,\\&\, \pi \in \mathcal {D}. \end{aligned} \end{aligned}$$
(19)

We show that this problem is equivalent to (17).

Lemma 2

Assume that \(K_i\), \(i=1,\dots , k,\) are non-empty, closed convex sets. For any solution \(\hat{\pi }\) of problem (17), random vectors \(\hat{Y}^i\) exist, so that \((\hat{\pi },\hat{Y})\) solves problem (19) as well, where \(\hat{Y}= (\hat{Y}^k,\dots , \hat{Y}^k)\) and for any solution \((\hat{\pi },\hat{Y})\) of problem (19), the vector \(\hat{\pi }\) is a solution of problem (17) as well.

Proof

Observe that for any fixed point \(\pi \in \mathcal {D}\), the function \(\sum _{i=1}^k w_i\varrho _i(\Vert Y^i\Vert )\) achieves minimal value with respect to the constraints on the variables \(Y^i\) using the projections of the realizations of \(X^i\) onto \(K_i\):

$$\begin{aligned} Y^i (\omega ) = \mathrm{Proj}_{K_i}\big ((\varphi (X(\omega );{\pi })\big ) - \varphi (X(\omega );{\pi }). \end{aligned}$$
(20)

Here \(\mathrm{Proj}_{K_i}(z)\) denotes the Euclidean projection of the point z onto the set \(K_i.\) Then, \(\Vert Y^i \Vert = \mathrm{dist}(\varphi (X^i;\pi ),K_i)\) and the objective functions of both problems have the same value. Therefore, the minimal value is achieved at the same point \(\hat{\pi }\) and the corresponding \(\hat{Y}^i_j\) is obtained from Eq. (20). \(\square \)

Recall that the normal cone to a set \(\mathcal {D}\subset \mathbb {R}^s\) is defined as

$$\begin{aligned} \mathcal {N}_{\mathcal {D}}(\pi ) = \{a\in \mathbb {R}^s: \langle a, d - \pi \rangle \le 0\;\text { for all } d\in \mathcal {D}\}. \end{aligned}$$

For brevity, we denote the normal cone to the feasible set of problem (19) by \(\mathcal {N}\) and the normal cones to the sets \(K_i\) by \(\mathcal {N}_i\), \(i=1,\dots , k\). We formulate optimality conditions for problem (19).

We denote the realizations of the random vectors \(Y^i\), \(i=1,\dots , k\) when \(\pi \) is used, by \(y^i_j(\pi )\), \(j=1,\dots m_i\), \(i=1,\dots , k\). More precisely, we have

$$\begin{aligned} y^i_j(\pi ) = \mathrm{Proj}_{K_i}\big ((\varphi (x^i_j;{\pi })\big ) - \varphi (x^i_j;{\pi })\quad j=1,\dots m_i,\; i=1,\dots , k. \end{aligned}$$

We suppress the argument \(\pi \) whenever it does not lead to confusion. Additionally, we denote the Jacobian of \(\varphi \) with respect to \(\pi \) by \( D\varphi (x;{\pi })\). Consider the sample-based version of problem (19):

$$\begin{aligned} \begin{aligned} \min _{\pi ,Y}\,&\,\sum _{i=1}^k w_i\varrho _i(\Vert Y^i\Vert )\\ \text {s.t. }&\, \varphi (x^i_j;\pi )+y^i_j\in K_i, \quad \forall j=1,\dots , m_i,\; i=1,\dots , k,\\&\, \pi \in \mathcal {D}. \end{aligned} \end{aligned}$$
(21)

Theorem 6

Assume that the sets \(K_i\), \(i=1,\dots , k\) are closed convex polyhedral cones and \(\varphi (x; \cdot )\) is an affine vector function. A feasible point \((\hat{\pi },\hat{Y})\) is optimal for problem (21) if and only if probability mass functions \(\zeta ^i\in \partial \varrho _i (0)\) and vectors \(g^i_j\) from \(\partial \Vert \hat{y}^i_j\Vert \) exist such that

$$\begin{aligned}&0\in -\sum _{i=1}^k \sum _{j=1}^{m_i} w_i \zeta ^i_j(g^i_j)^\top D\varphi (X^i;\hat{\pi }) +\mathcal N_{\mathcal {D}}(\hat{\pi }) \end{aligned}$$
(22)
$$\begin{aligned}&w_i\zeta ^i_j g^i_j \in \mathcal N_i\big (\varphi (x^i_j;\hat{\pi })+\hat{y}^i_j \big ) \quad \text { for all }\; j=1,\dots m_i,\quad i=1,\dots k. \end{aligned}$$
(23)

Proof

We assign Lagrange multipliers \(\lambda ^i_j\) to the inclusion constraints and define the Lagrange function as follows:

$$\begin{aligned} L(\pi , Y,\lambda ) = \sum _{i=1}^k \left( w_i\varrho _i(\Vert Y^i\Vert ) + \sum _{j=1}^{m_i} \big \langle \varphi (x^i_j;\pi )+{y}^i_j , \lambda ^i_j\big \rangle \right) . \end{aligned}$$

Using optimality conditions, we obtain that \((\hat{\pi },\hat{Y})\) is optimal for problem (21) if and only if \(\hat{\lambda }\) exists such that

$$\begin{aligned}&0\in \partial _{(\pi ,Y)} L(\hat{\pi },\hat{Y},\hat{\lambda }) +\mathcal {N}(\hat{\pi },\hat{Y})\\&\hat{\lambda }^i_j \in \mathcal N_i \big (\varphi (x^i_j;\hat{\pi }) +\hat{y}^i_j) \big ). \end{aligned}$$

Considering the partial derivatives of the Lagrangian with respect to the two components, we obtain

$$\begin{aligned}&0\in \sum _{i=1}^k \sum _{j=1}^{m_i} (\hat{\lambda }^i_j)^\top D\varphi (x^i_j;\hat{\pi })+\mathcal N_{\mathcal {D}} (\hat{\pi }) \end{aligned}$$
(24)
$$\begin{aligned}&0= w_i\partial _Y \varrho _i (\Vert Y\Vert ) +\hat{\lambda }^i,\quad i=1,\dots k, \end{aligned}$$
(25)
$$\begin{aligned}&\hat{\lambda }^i_j \in \mathcal N_i\big (\varphi (x^i_j;\hat{\pi }) +\hat{y}^i_j \big ) , \quad j=1,\dots , m_i,\; i=1,\dots k. \end{aligned}$$
(26)

We calculate the multipliers \(\hat{\lambda }^i \) from the Eq. (25) using elements \(\zeta ^i\in \partial \varrho _i (0)\) and \(g^i_j\) from \(\partial \Vert \hat{y}^i_j\Vert \). We obtain:

$$\begin{aligned} \hat{\lambda }^i_j= - w_i \zeta ^i_j g^i_j,\quad j=1,\dots , m_i,\; i=1,\dots k. \end{aligned}$$

Notice that \( g^i_j = \frac{\hat{y}^i_j}{\Vert \hat{y}^i_j\Vert }\) whenever \(\hat{y}^i_j \ne 0\), otherwise \(g^i_j\in \mathbb {R}^d\) can be any vector with \(\Vert g^i_j\Vert \le 1\). Substituting the value of \(\hat{\lambda }^i \) into (24) and (26), we obtain condition (22) and (23). \(\square \)

We note that, we can define again a probability mass function \(\mu \) by setting \(\mu ^i_j= w_i\zeta ^i_j\) and interpret the Karush–Kuhn–Tucker condition as follows:

$$\begin{aligned}&\mathbb {E}_{\mu } (g^i_j)^\top D\varphi (X^i;\hat{\pi }) \in \mathcal N_{\mathcal {D}}(\hat{\pi })\\&\quad \mu ^i_j g^i_j \in \mathcal N_i\big (\varphi (x^i_j;\hat{\pi })+\hat{y}^i_j \big ) \text { for all }\; j=1,\dots m_i,\; i=1,\dots k. \end{aligned}$$

Problem (21) can be reformulated as a risk-averse two-stage optimization problem (cf. Shapiro et al. 2009). The first stage decision is \(\pi \) and the first stage problem is

$$\begin{aligned} \min _{ \pi \in \mathcal {D}} \sum _{i=1}^k w_i\varrho _i\big (Z^i(\pi ))\big ) . \end{aligned}$$
(27)

Given \(\pi \), the calculation of each realization of \(Z^i(\pi )\) amounts to solving the following problem

$$\begin{aligned} z^i_j(\pi )= \min _{y\in K_i} \Vert \varphi (x^i_j;\pi ) -y\Vert , \quad j=1,\dots m_i,\; i=1,\dots k. \end{aligned}$$
(28)

Calculating \(z^i_j(\pi )\) might be very easy for specific regions \(K_i\) such as the cones in the example of the polyhedral classifier. Every component of the solution vector \(\hat{z}^i_j\) to problem (28) can be computed as follows:

$$\begin{aligned} (\hat{z}^i_j)_\ell = {\left\{ \begin{array}{ll} \max \{0, -(\varphi (x^i_j;\pi ))_\ell \} &{}\text { for } \ell = i;\\ \max \{0, (\varphi (x^i_j;\pi ))_\ell \} &{}\text { for } \ell \ne i; \end{array}\right. } \quad \ell =1,\dots , k. \end{aligned}$$
(29)

Then the optimal value of (28) is

$$\begin{aligned} z^i_j(\pi ) = \left( \sum _{\ell =1}^k (\hat{z}^i_j)_\ell ^2\right) ^{\frac{1}{2}}. \end{aligned}$$

This point of view facilitates the application of stochastic optimization methods to solve the problem.

5 Confidence intervals for the risk

In this section, we analyze the risk-averse classification problem when we increase the data sets and derive confidence intervals for the misclassification risk. We use the results on statistical inference for composite risk functionals presented in Dentcheva et al. (2017). In Dentcheva et al. (2017), a composite risk functional is defined in the following way.

$$\begin{aligned} \varrho (X) = \mathbb {E} \left[ f_{1}\left( \mathbb {E} \left[ f_{2}\left( \mathbb {E} \left[ \cdots f_{\ell }\left( \mathbb {E}\left[ f_{\ell +1}\left( X \right) \right] ,X\right) \right] \cdots ,X \right) \right] ,X\right) \right] \end{aligned}$$
(30)

where X is an n-dimensional random vector with unknown distribution, \(P_{X}\). The functions \(f_{j}\) are such that \(f_{j}(\eta _{j},x):\mathbb {R}^{n_{j}}\times \mathbb {R}^{n} \rightarrow \mathbb {R}^{n_{j-1}}\) for \(j = 1,\ldots ,\ell \) and \(n_{0} = 1\). The function \(f_{\ell +1}\) is such that \(f_{\ell +1}(x):\mathbb {R}^{n} \rightarrow \mathbb {R}^{n_{\ell }}\).

A law-invariant risk-measure \(\varrho (X)\) is an unknown characteristic of the distribution \(P_{X}\). The empirical estimate of \(\varrho (X)\) given N independent and identically distributed observations of X is given by the plug-in estimate

$$\begin{aligned} \begin{aligned} \varrho ^{(N)} =&\sum _{i_0=1}^N\frac{1}{N}\left[ f_1\left( \sum _{i_1=1}^N\frac{1}{N}\left[ f_2\left( \sum _{i_2=1}^N\frac{1}{N}\left[ \;\cdots f_\ell \left( \sum _{i_\ell =1}^N\frac{1}{N}f_{\ell +1}(X_{i_\ell }),X_{i_{\ell -1}}\right) \right] \right. \right. \right. \right. \\&\left. \left. \left. \left. \;\cdots ,X_{i_1}\right) \right] ,X_{i_0}\right) \right] \end{aligned} \end{aligned}$$
(31)

It is shown in Dentcheva et al. (2017) that the most popular measures of risk fit the structure (30). It is established that the plug-in estimator satisfies a central limit formula and the limiting distribution is described. This is the distribution of the Hadamard-directional derivative of the risk functional \(\varrho \) when a normal random variable is plugged in. Recall the notion of Hadamard directional derivatives of the functions \(f_{j}\big (\cdot ,x)\) at points \(\mu _{j+1}\) in directions \(\zeta _{j+1}\). It is given by

$$\begin{aligned} f'_{j}\big (\mu _{j+1},x;\zeta _{j+1}) = {\mathop {\mathop {\lim }\limits _{t\downarrow 0}}\limits _{s\rightarrow \zeta _{j+1}}}\frac{1}{t} \big [ f_{j}\big (\mu _{j+1}+ts,x) - f_{j}\big (\mu _{j+1},x)\big ]. \end{aligned}$$

The central limit formula holds under the following conditions:

  1. (i)

    \(\int \Vert f_j(\eta _j,x)\Vert ^2 \;P(dx)<\infty \) for all \(\eta _j\in I_j\), and \(\int \mathrm{dist}^2(\varphi (X^i;\pi ),{K_i}) P(dx)<\infty \);

  2. (ii)

    For all realizations x of \(X^i\), the functions \(f_j(\cdot ,x)\), \(j=1,\dots ,\ell \), are Lipschitz continuous:

    $$\begin{aligned} \Vert f_j(\eta _j',x)- f_j(\eta _j'',x)\Vert \le \gamma _j(x) \Vert \eta _j'-\eta _j''\Vert ,\quad \forall \; \eta _j',\eta _j'', \end{aligned}$$

    and \(\int \gamma _j^2(x)\;P(dx) <\infty \).

  3. (iii)

    For all realizations x of \(X^i\), the functions \(f_j(\cdot ,x)\), \(j=1,\dots ,\ell \), are Hadamard directionally differentiable.

These properties are satisfied for the mean-semideviation risk measures as shown in Dentcheva et al. (2017). Furthermore, it is shown that similar construction represents the Conditional-Value-at-Risk.

For every parameter \(\pi \) the risk of misclassification for a given class \(i=1,\dots , k\) can be fit to the setting (30) by choosing the innermost function \(f_{\ell +1}(x):\mathbb {R}^{d} \rightarrow \mathbb {R}\) to be \(f_{\ell +1}(x) = \mathrm{dist}(\varphi (x;\pi ),{K_i})\) whenever \(\varphi \) satisfies properties (i)–(iii).

In our setting, each misclassification risk \(\varrho _i\Big (\mathrm{dist}\big (\varphi (X^i;\pi ),{K_i}\big )\Big )\) is estimated by \(\varrho _i^{(m_i)}\big (\Vert \hat{Y}^i\Vert \big )\), where \((\hat{Y}^i;\hat{\pi })\) is the solution of problem (21). Denoting the estimated variance of the limiting distribution of \(\varrho _i^{(m_i)}\big (\Vert \hat{Y}^i\Vert \big )\) (briefly \(\varrho _i^{(m_i)}\)) by \(\sigma _i^2\), we obtain the following confidence interval:

$$\begin{aligned} \Big [ \;\varrho _i^{(m_i)} - t_{\alpha ,\mathrm{df}}\frac{\sigma _i}{\sqrt{m_i}},\quad \varrho _i^{(m_i)} + t_{\alpha ,\mathrm{df}}\frac{\sigma _i}{\sqrt{m_i}}\;\; \Big ]. \end{aligned}$$

Here \(\alpha \) is the desired level of confidence, \(t_{\alpha ,\mathrm{df}}\) is the corresponding quantile of the t-distribution with degrees of freedom df. The degrees of freedom depend on the choice if risk measure and can be calculated as \(df=m_i-\ell \), where \(\ell \) is the number of compositions in formula (31). The decrease of the degrees of freedom form \(m_i\) is due to the estimation of the expected value associated with each composition. The total risk is estimated by

$$\begin{aligned} \hat{\varrho }= \sum _{i=1}^k w_i\varrho _i^{(m_i)}\big (\Vert \hat{Y}^i\Vert \big ). \end{aligned}$$

We obtain that \(\hat{\varrho }\) has an approximately normal distribution with expected value \(\varrho \) and variance \(\sum _{i=1}^k \frac{w_i^2\sigma _i^2}{m_i}.\) A confidence interval for the entire risk \(\varrho \), associated with the optimal classifier, is given by

$$\begin{aligned} \left[ \;\hat{\varrho }- t_{\alpha ,\mathrm{df}}\sqrt{\sum _{i=1}^k \frac{w_i^2\sigma _i^2}{m_i}},\quad \hat{\varrho } + t_{\alpha ,\mathrm{df}} \sqrt{\sum _{i=1}^k \frac{w_i^2\sigma _i^2}{m_i}}\;\; \right] . \end{aligned}$$

We can use the confidence interval to evaluate how well the risk is estimated. The quantity \(\frac{w_i^2\sigma _i^2}{m_i}\) gives us guidance about the need of additional observations from class i, which would help to reduce effectively the size of confidence interval of the risk, thus, will help us to evaluate the risk more precisely.

6 Risk sharing in SVM

We analyze the SVM problem in more detail. We consider only strictly monotonic coherent measures of risk \(\varrho _1, \varrho _2\) for the two classes \(S_1\) and \(S_2\).

The risk-sharing SVM problem (RSSVM) consists in identifying a parameter \(\pi =(v,\gamma )\in \mathbb {R}^n\) corresponding to a Pareto-minimal point of the attainable risk-allocations for the affine classifier \(\varphi (z;\pi )=\langle v, z\rangle -\gamma \). Due to Corollary 1, we can determine a risk-averse classifier by solving the following problem:

$$\begin{aligned} \min _{v,\gamma ,Z^1,Z^2}&\, \lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2) \end{aligned}$$
(32)
$$\begin{aligned} \text {s. t. }&\, \langle v, x^1_j\rangle - \gamma + z_j^1 \ge 0, \quad j=1,\dots ,m_1, \end{aligned}$$
(33)
$$\begin{aligned}&\, \langle v, x^2_{j}\rangle -\gamma - z_{j}^2 \le 0, \quad j=1,\dots ,m_2, \end{aligned}$$
(34)
$$\begin{aligned}&\, \langle v,v\rangle =1, \end{aligned}$$
(35)
$$\begin{aligned}&\, Z^1\ge 0,\quad \, Z^2\ge 0. \end{aligned}$$
(36)

Here \(\lambda \in (0,1)\) is a parameter representing the scalarization and is indicative of the risk sharing between the classes. The random variables \(Z^i\) can be represented by a deterministic vectors stacking all realizations \(z_j^i\) as components of it. Abusing notation, we shall use \(Z^i\) also for those vectors in \({\mathbb {R}}^{m_i},\)\(i=1,2\).

We note that the normalization of the vector v automatically bounds \(\gamma \) because for any fixed v, the component \(\gamma \) can be considered restricted in a compact set \([\gamma _m(v), \gamma _M(v)] \), where

$$\begin{aligned} \gamma _M = \max _{1\le j\le m_i,\; i=1,2} v^\top x^i_j\quad \gamma _m =\min _{1\le j\le m_i,\; i=1,2} v^\top x^i_j. \end{aligned}$$
(37)

Thus, in this case, we can set \(\mathcal {D}=\mathbb {R}^n.\)

We also consider a soft-margin risk-averse SVM based on problem (4), although the classification error might not be calculated properly. The problem reads

$$\begin{aligned} \min _{v,\gamma ,Z^1,Z^2}\ \Big \{ \lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2) +\delta \Vert v\Vert ^2 {:}\; (33), (34),(36)\Big \} \end{aligned}$$
(38)

In this problem, \(\delta >0\) is a small number. The objective function grows to infinity when the norm of v increases. Thus, we do not need to bound the norm of the vector v. It also automatically bounds \(\gamma \), similar to problem (32)–(36).

We obtain a counterpart of the result in Jouini et al. (2008) for the risk sharing of random losses among constituents. We observe that the parameter \((v,\gamma )\) for each Pareto-optimal classifier can be obtained by solving the following problem:

$$\begin{aligned} \begin{aligned} \min _{v,\gamma ,Z^1,Z^2}&\, \varrho _1(Z^1) + \varrho _2(Z^2) \\ \text {s. t. }&\, \langle v, x^1_i\rangle - \gamma + \frac{1}{\lambda }z_i^1 \ge 0, \quad i=1,\dots ,m_1,\\&\, \langle v, x^2_{j}\rangle -\gamma - \frac{1}{1-\lambda }z_{j}^2 \le 0, \quad j=1,\dots ,m_2,\\&\, \langle v,v\rangle =1,\quad \; Z^1\ge 0,\quad \, Z^2\ge 0. \end{aligned} \end{aligned}$$
(39)

Lemma 3

Problem (39) is equivalent to problem (32)–(36).

Proof

The equivalence follows from the axiom of positive homogeneity for the risk measures:

$$\begin{aligned} \lambda \varrho _1(Z^1) = \varrho _1(\lambda Z^1) \quad \text {and}\quad (1-\lambda ) \varrho _2(Z^2) = \varrho _2((1-\lambda ) Z^2). \end{aligned}$$

Defining new random variables \(\tilde{Z}^1= \lambda Z^1\) and \(\tilde{Z}^2= (1-\lambda ) Z^2\), we can rescale the variables in their respective inequality constraint. \(\square \)

Although problem (38) is non-convex due to the presence of constraint (35), we can solve it by a dedicated numerical method using sequentially local convex approximations. Let \(\bar{v}\) be a fixed point. The non-convex constraint can be approximated locally by using Taylor expansion:

$$\begin{aligned} \langle {v},{v}\rangle -1 \approx \langle \bar{v},\bar{v}\rangle -1 +2\langle \bar{v}, v-\bar{v}\rangle = 2\langle \bar{v}, v\rangle -\langle \bar{v}, \bar{v}\rangle -1. \end{aligned}$$

If \(\Vert \bar{v}\Vert =1\), then we obtain:

$$\begin{aligned} \langle {v},{v}\rangle -1 \approx 2(\langle \bar{v}, v\rangle -1). \end{aligned}$$

For the sake of brevity, we denote the objective function by f:

$$\begin{aligned} f(v,\gamma ,Z^1,Z^2) = \lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2) \end{aligned}$$

Observe that for a feasible vector \((v, \gamma )\) with \(\Vert v\Vert \ne 1\), the misclassification errors are \(\frac{1}{\Vert v\Vert }Z^1\) and \(\frac{1}{\Vert v\Vert }Z^2\). Thus, using the positive homogeneity property of the risk measures, the true risk of misclassification is

$$\begin{aligned} \frac{1}{\Vert v\Vert } \big (\lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2)\big ), \end{aligned}$$

We propose the following method for solving problem (38).

  • Risk averse binary classification method   

  • Step 0 Set \(\ell =1\), \(v^0 = \frac{1}{m_1}\sum _{i=1}^{m_1}x_i^1-\frac{1}{m_2}\sum _{i=1}^{m_2}x_i^2\), calculate \(\gamma ^0 \), \(Z^{1,0} \), and \(Z^{2,0} \) as a solution of the following problem

    $$\begin{aligned} \begin{aligned} \min _{\gamma ,Z^1,Z^2}&\, f(v^0,\gamma ,Z^1,Z^2)\\ \text {s. t. }&\, \langle v^0, x^1_i\rangle - \gamma + z_i^1 \ge 0, \quad i=1,\dots ,m_1,\\&\, \langle v^0, x^2_{j}\rangle -\gamma - z_{j}^2 \le 0, \quad j=1,\dots ,m_2. \end{aligned} \end{aligned}$$
    (40)
  • Step 1 Solve the convex approximation problem

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,Z^1,Z^2}&\, f(v,\gamma ,Z^1,Z^2) \\ \text {s. t. }&\, \langle v^{\ell -1}, v\rangle =1,\; (33),\; (34),\; (36). \end{aligned} \end{aligned}$$
    (41)

    Denote its solution by \(\hat{\xi }^\ell = (\hat{v}^{\ell },\hat{\gamma }^{\ell }, \hat{Z}^{1,\ell },\hat{Z}^{2,\ell })\).

  • Step 2 If \(f(\hat{\xi }^\ell )= f(\xi ^{\ell -1})\), then stop; otherwise set

    $$\begin{aligned} v^{\ell }= \frac{1}{\Vert \hat{v}^{\ell }\Vert }\hat{v}^{\ell },\quad \gamma ^{\ell } = \frac{1}{\Vert \hat{v}^{\ell }\Vert }\hat{\gamma }^{\ell },\quad Z^{i,\ell } = \frac{1}{\Vert \hat{v}^{\ell }\Vert } \hat{Z}^{i,\ell },\; i=1,2. \end{aligned}$$

    Increase \(\ell \) by one and go to Step 1.

Theorem 7

When the method stops, the point \(({Z}^{1,\ell },{Z}^{2,\ell }, {v}^{\ell }, {\gamma }^{\ell })\) satisfies the optimality conditions for problem (38). Otherwise, the method generates a sequence of points \(\left\{ ({Z}^{1,\ell },{Z}^{2,\ell }, {v}^{\ell }, {\gamma }^{\ell })\right\} _{\ell =1}^\infty \) such that every accumulation point satisfies the optimality conditions for problem (38).

Proof

First, we formulate optimality conditions for problem (38). Assign Lagrange multipliers \(\mu ^1\in \mathbb {R}^{m_1}_+\), \(\mu ^2\in \mathbb {R}^{m_2}_+\) and \(\mu ^3\in \mathbb {R}\) to the constraints (33) (re-formulated to \(\le \)), (34), and (35), respectively. The Lagrange function has the form

$$\begin{aligned} \varLambda (v,\gamma ,Z^1,Z^2)= & {} \lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2) +\sum _{i=1}^{m_1} \mu ^1_i (-\langle v, x^1_j\rangle + \gamma -z_i^1) \\&+\sum _{j=1}^{m_1} \mu ^2_j (\langle v, x^2_j\rangle - \gamma - z_j^2) +\mu ^3 (\langle v,v\rangle -1) \\&= \lambda \varrho _1(Z^1) + (1-\lambda )\varrho _2(Z^2) - \sum _{i=1}^{m_1} \mu ^1_i z_i^1 -\sum _{j=1}^{m_2} \mu ^2_j z_j^2\\&+\sum _{i=1}^{m_1} \mu ^1_i (-\langle v, x^1_j\rangle + \gamma ) + \sum _{j=1}^{m_1} \mu ^2_j (\langle v, x^2_j\rangle - \gamma ) +\mu ^3 (\langle v,v\rangle -1). \end{aligned}$$

If the point \((\bar{v},\bar{\gamma },\bar{Z}^1,\bar{Z}^2)\) is optimal, then we have

$$\begin{aligned}&0= \nabla _{v,\gamma } \varLambda (\bar{v},\bar{\gamma },\bar{Z}^1,\bar{Z}^2); \end{aligned}$$
(42)
$$\begin{aligned}&0\in \partial _{Z^1,Z^2} \varLambda (\bar{v},\bar{\gamma },\bar{Z}^1,\bar{Z}^2) +\mathcal {N}_{\mathbb {R}_+^{m_1+m_2}} (\bar{Z}^1,\bar{Z}^2), \end{aligned}$$
(43)
$$\begin{aligned}&\mu ^1_i (-\langle v, x^1_j\rangle + \gamma -z_i^1)=0,\quad i=1,\dots , m_1, \end{aligned}$$
(44)
$$\begin{aligned}&\mu ^2_j (\langle v, x^2_j\rangle - \gamma - z_j^2)=0,\quad j=1,\dots , m_2. \end{aligned}$$
(45)

Condition (42) has the form

$$\begin{aligned}&-\sum _{i=1}^{m_1}\mu ^1_i x_i^1 + \sum _{i=1}^{m_2}\mu ^2_i x^2_i +2\mu ^3 \bar{v } = 0 \end{aligned}$$
(46)
$$\begin{aligned}&\sum _{i=1}^{m_1}\mu _i^1 =\sum _{i=1}^{m_2}\mu ^2_i. \end{aligned}$$
(47)

Condition (43) is equivalent to the existence of subgradients \(\zeta ^1\in \partial \varrho _1(\bar{Z}^1)\) and \(\zeta ^2\in \partial \varrho _2(\bar{Z}^2)\) such that

$$\begin{aligned} \begin{aligned}&\lambda \zeta ^1 \ge \mu ^1 \ge 0,\quad \langle \lambda \zeta ^1 -\mu ^1, \bar{z}^1\rangle =0, \\&(1-\lambda ) \zeta ^2 \ge \mu ^2 \ge 0,\quad \langle (1-\lambda ) \zeta ^2 -\mu ^2,\bar{z}^2 \rangle =0. \end{aligned} \end{aligned}$$
(48)

It is easy to see that, for \(\bar{v}= v^{\ell -1}\), then conditions (46)–(47)–(48) coincide with the optimality conditions for problem (41) at iteration \(\ell \) with optimal Lagrange multipliers \(\mu ^1,\mu ^2\) and \(2\mu ^3\in \mathbb {R}\).

If the method stops at iteration \(\ell \), we have \(f(\hat{\xi }^\ell )= f(\xi ^{\ell -1})\). The point \(\xi ^{\ell -1}\) is feasible for problem (41) at iteration \(\ell \). Therefore, \(\xi ^{\ell -1}\) is optimal for (41) and satisfies the optimality conditions for (41). Since \(\Vert {v}^{\ell -1}\Vert =1\), the point \(\xi ^{\ell -1}\) is feasible for problem (38). Therefore, the optimality conditions for (38) are also satisfied at \(\xi ^\ell \).

Now consider the case when the method generates an infinite sequence of points \(\{\xi ^\ell \}\). For any solution \(\hat{\xi }^\ell \) of problem (41), we have

$$\begin{aligned} 1=\langle v^{\ell -1}, \hat{v}^\ell \rangle \le \Vert v^{\ell -1}\Vert .\Vert \hat{v}^\ell \Vert = \Vert \hat{v}^\ell \Vert . \end{aligned}$$

Thus, if the method does not stop at iteration \(\ell \), the following inequality holds:

$$\begin{aligned} f(\xi ^{\ell }) = \frac{1}{\Vert \hat{v}^\ell \Vert } f(\hat{\xi }^\ell )< f(\hat{\xi }^\ell ) < f({\xi }^{\ell -1}). \end{aligned}$$

Consequently, the sequence \(\{ f(\xi ^{\ell }) \}\) is monotonically decreasing.

Since \(\Vert v^{\ell } \Vert =1\), then \(\gamma ^{\ell }\), as well as \(Z^{1,\ell }\), \( Z^{2,\ell }\) are bounded (cf. (37), (29)). Thus, the sequence \(\{ \xi ^\ell \}\) has a convergent subsequence \(\mathscr {L}\subset \{1,2,\dots \}\). Let \(\xi ^*= (v^*,\gamma ^*,Z^1_*, Z^2_*)\) be an accumulation point of \(\{ \xi ^\ell \}\). Due to the continuity of f and the monotonicity of \(\{ f(\xi ^{\ell }) \}\), we have

$$\begin{aligned} \lim _{\ell \rightarrow \infty } f(\xi ^{\ell }) = f(\xi ^*). \end{aligned}$$

Furthermore, the point \(\xi ^*\) is feasible for (38) because all points \(\xi ^\ell \) are feasible and the constraint functions are continuous.

Each point \(\hat{\xi }^\ell \) satisfies the optimality conditions for problem (41). Let \(\zeta ^{1,\ell }\in \partial \varrho _1(\hat{Z}^{1,\ell })\) and \(\zeta ^{2,\ell }\in \partial \varrho _2(\hat{Z}^{1,\ell })\) be the optimal subgradients from the corresponding condition (48). Due to the positive homogeneity of the risk measures, it holds

$$\begin{aligned} \zeta ^{1,\ell }\in \partial \varrho _1({Z}^{1,\ell })\; \quad \text { and }\quad \; \zeta ^{2,\ell }\in \partial \varrho _2({Z}^{2,\ell }). \end{aligned}$$

Using the fact that the subdifferential mapping of a convex function is upper semi-continuous with compact values, we obtain that the sequence \(\{ (\zeta ^{1,\ell },\zeta ^{2,\ell })\}\), \({\ell \in \mathscr {L}}\), has a convergent subsequence \(\mathscr {L}_1\subset \mathscr {L}\) by virtue of Berge theorem, i.e. \(\lim _{\ell \in \mathscr {L}_1} (\zeta ^{1,\ell },\zeta ^{2,\ell }) = (\zeta ^1_*, \zeta ^2_*)\). This implies that the optimal Lagrange multipliers \(\mu ^{1,\ell }, \mu ^{2,\ell }\) are bounded due to the inequalities (48) in the optimality conditions. Therefore, the optimal Lagrange multipliers \(\{\mu ^{1,\ell }, \mu ^{2,\ell }\}\) are convergent to \(\mu ^{1}_*, \mu ^{2}_*\) for a subsequence \(\mathscr {L}_2\subset \mathscr {L}_1\). Finally, \(\mu ^{3,\ell }\) is convergent to some number \(\mu ^{3}_*\) for \(\ell \in \mathscr {L}_2\), which is obtained passing to the limit in Eq. (46). We conclude that the point \(\xi ^*\) satisfies the optimality conditions for (38) with Lagrange multipliers \(\mu ^{1}_*, \mu ^{2}_*\), \(\mu ^{3}_*\) and subgradients \((\zeta ^1_*, \zeta ^2_*)\). \(\square \)

7 Related work

The design of robust estimators, robust classifiers in particular, has attracted attention of statisticians as well as of data scientists. Misclassification may lead to different cost distribution for the different types of errors. An example illustrating this point is the damage caused by a hurricane. The cost of the hurricane damage depends on features, which are used for classification. The cost is highly non-linear with respect to those features (see Davis and Uryasev 2016). Therefore, if we fail to predict correctly that a hurricane will take place in a certain region, the cost is quite different than the cost induced by an incorrect hurricane alarm. Furthermore, different predictions of the hurricane’s location lead to different cost. A risk-averse model prediction would take this fact into account (see, e.g. Chambers and Quiggin 2000). Another example is classification of credit-worthiness of bank customers. If a customer, who might be a company requesting a substantial loan, is classified incorrectly as credit worthy, the bank may experience substantial loss while not providing a loan to a credit-worthy customer, results in a lost opportunity; both losses are quite different Oguz et al. (2008)

Different attitude to errors in model fitting was proposed a long time ago in statistics and this point of view is accepted and used in various approaches, most notably, in robust statistics. We refer to Huber (2011), El Ghaoui et al. (2003), Gotoh and Uryasev (2017), Hastie et al. (2009) and the references therein for methods of robust classification design for binary classification. Support vector machines are one of the most popular classification tools. They appear as part of sequential classification methods for multiple classes Sculley et al. (2011). Recent developments include the use of SVMs as part of deep learning architectures (Kim et al. 2015; Qi et al. 2016). Other recent papers (Zareapoor et al. 2018), leverage the power of Deep Belief Networks as input to SVMs in ensemble algorithms. For information on kernel methods in classification and support vector learning, we refer to Schölkopf et al. (1999), Maji et al. (2008), Muandet et al. (2012). Additionally, there is a substantial research effort into incremental learning for SVM and Support Vector Regression (Liang and Li 2009; Gu et al. 2015a, b).

Many papers deal with robust binary classification. One possibility is provided by the tool of robust statistics; for example, employing the Huber risk function (Huber 2011). We refer to Zhang (2004) for detailed discussion on robustness and choice of loss functions. In Lanckriet et al. (2003) and El Ghaoui et al. (2003), the tools of robust optimization are employed. The idea there is that the future instance will come from a distribution, which is close to the observed empirical distribution in some sense. Therefore, a set of acceptable distributions is constructed, called an uncertainty set, and the worst misclassification error is minimized over all distributions in that set. In Lanckriet et al. (2003) and El Ghaoui et al. (2003), the uncertainty sets are defined by allowing all distributions on the sample space, which have the same mean and the same covariance as the estimated ones. In Ma et al. (2011) the authors look at the median rather than the expected value and the sum of the two median errors is minimized. In Katsumata and Takeda (2015), the authors allow for uncertainty sets of different diameter for each class.

Our proposed approach suggests to minimize the classification error in a risk averse manner. In Rockafellar et al. (2008), the use of coherent measures of risk for generalized regression and model fit was proposed. This point of view was also utilized in SVM in Gotoh and Uryasev (2017). While those works recognize the need of expressing different attitude to errors in fitting statistical models, the authors propose using one overall measure of risk as an objective in the regression problem, respectively in the SVM problem. The classification design based on a single measure of risk does not allow for differentiation between the classes, while in our view different attitude should be allowed to classification errors for the different classes.

The topic of risk sharing is a subject of intensive investigations in the community of economics, quantitative finance and risk management. This is due to the fact that the sum of the risk of each component in a system does not equal the risk of the entire system. The main focus in the extant literature on risk-sharing is on the choice of decomposition of a random variable X into k terms \(X=X^1+\dots + X^k\), so that when each component is measured by a specific risk measure, the associated total risk is in some sense optimal. The variable X represents the total random loss of the firm and the question addressed is about splitting the loss among the constituents. Assigning coherent measures of risk \(\varrho _i\) to each term \(X^i\), the adopted point of view is that the outcome \(\big (\varrho _1(X^1), \dots , \varrho _k(X^k)\big )\) should be Pareto-optimal among the feasible allocations.

The main results in risk-sharing theory accomplish the decomposition of X into terms by looking at the infimal convolution of the measures of risk. It is observed (see, e.g., Landsberger and Meilijson 1994; Ludkovski and Rüschendorf 2008) that the random variables \(X^i\), \(i=1,\dots , k\), which solve this problem, satisfy a co-monotonicity property as follows

$$\begin{aligned} \big (X^i(\omega ) -X^i(\omega ')\Big )\big (X^j(\omega ) -X^j(\omega ')\Big ) \ge 0,\quad \text {for all } \omega ,\omega '\in \varOmega ,\quad i,j=1,\dots ,k. \end{aligned}$$

While we adopt similar point of view on optimality of risk allocation, it is clear that the problem setting and subsequently the results associated with risk sharing in financial institutions are inapplicable to the classification problem. We cannot expect co-monotonicity properties of the class errors because not all decomposition of the total random error can be obtained via some classifier. The presence of constraints in the optimization problem, the functional dependence of the misclassification error on the classifier’s parameters, and the complex nature of design problem require dedicated analysis.

8 Numerical experiments

In the previous sections, we have shown the solid theoretical foundation supporting our approach. In this section, we display the performance of the proposed framework, as well as its flexibility. To this end, we use several publicly available data sets and compare the performance of our approach to some existing formulations, in terms of performance metrics frequently used in classification. Further, we showcase the flexibility of the framework by exploring the Pareto-efficient frontier of various classifiers derived from our framework. In our numerical experiments, we have used the Conditional Value-at-Risk and the mean semi-deviation of order one.

8.1 Data

We compare our approach to other known approaches on several datasets. More specifically, we use three data sets obtained from the UCI Machine Learning Repository (Lichman 2013). These data sets exhibit different degrees of class imbalance, that is the proportion of records in one class versus that of the other class. A summary of basic characteristics of the data sets is shown in Table 1.

In the sections to follow, we make reference to the default and target class for each dataset. This nomenclature is consistent with the group represented by the class itself. In other words, the default class (Class0) in the WDBC dataset contains the observations where the diagnosis was “benign”, while the target class (Class1) represents observations with a “malignant” diagnosis. Similarly, for the other two datasets, the default class represents the healthy or normal state while the target class represents the class of interest in the context of the data. For the “pima-indians-diabetes” dataset the target class is the set diabetics amongs the sample, while for the “seismic-bumps” dataset the target class is the set of shifts where high energy seismic bumps occurred.

Table 1 Data summary

8.2 Model formulations

We consider several scenarios for choices of measures of risk. In the first scenario, we treat the default class (Class0) in a risk neutral manner, while applying the mean-semi-deviation measure to the classification error of the target class. We call this loss function “asym_risk” (see Table 2). In the same table, we provide the risk measure combinations for other loss functions which we have used in our numerical experiments. The loss functions called “risk_cvar” and “two_cvar” use a convex combination of the expected error and the Conditional Value-at-Risk of the classification error. These convex combinations use an additional model parameter \(\beta \in (0,1)\). The formulation (38) for these loss functions uses the variational form of the Conditional Value-at-Risk at level \(\alpha \in (0,1)\). Table 2 displays the chosen combinations of risk measure pairs for the binary classification scenario in order to give an easy overview.

Table 2 Risk measure combinations used as loss functions in the experiments

We note that calculation of the first order semi-deviation and the conditional value-at-risk can be formulated as linear optimization problems. Therefore, their application does not increase the complexity of RSSVM in comparison to the soft-margin SVM. However, if we use higher order semi-deviations or higher order inverse risk measures, the problem becomes more difficult.

Note that conditional value-at-risk at level \(\alpha \) puts weight on the largest quantiles of the error distribution (the upper \(\alpha \)-portion of them). This means the classifier is focused mainly on eliminating the worst classification errors. Depending on our risk-aversion, we could include small or larger portion of quantiles by controlling the level \(\alpha \). Additionally, taking convex combinations of the expected value and the conditional value-at-risk (using \(\beta >0\)), we include into consideration all quantiles with some additional weight on the largest quantile. The higher the value of \(\beta \) the less weight it is put on the largest misclassification errors and \(\beta =1\) corresponds to the risk-neutral SVM. The mean-semideviation measure adds weight to all classification errors which are above average size. As already mentioned, higher value of the constant c entails larger penalty for deviations above the mean. Furthermore, using higher order semi-deviations results in a non-linear (form of power function) penalty for those deviations.

We compare our results against three different benchmarks: two risk-neutral formulations and one risk-averse formulation with a single risk measure. The first risk-neutral formulation is the soft-margin SVM as formulated in (4). The second risk-neutral formulation uses the Huber loss function and leads to the following problem formulation

$$\begin{aligned} \min _{v,\gamma ,Z^1,Z^2}\;\left\{ \frac{1}{m_1} \sum _{i=1}^{m_1} \min \big (z_i^1,(z_i^1)^2\big ) + \frac{1}{m_2} \sum _{j=1}^{m_2} \min \big (z_j^2,(z_j^2)^2\big ) {:} \; (33), (34),(36)\right\} .\nonumber \\ \end{aligned}$$
(49)

The third benchmark uses a single risk measure (50) on the total error as proposed in Gotoh and Uryasev (2017). It has the following formulation.

$$\begin{aligned} \begin{aligned} \min _{v,\gamma ,t,Z^1,Z^2, Y^1,Y^2}&\, \beta \left( \frac{1}{m_1}\sum _{j=1}^{m_1} z^1_j + \frac{1}{m_2}\sum _{j=1}^{m_2} z^2_j \right) \\&\,\qquad +(1-\beta )\left( t + \frac{1}{\alpha (m_1+m_2)}\left( \sum _{j=1}^{m_1} y^1_j + \sum _{j=1}^{m_2} y^2_j \right) \right) +\delta \Vert v\Vert ^2\\ \text {s. t. }&\, y_j^i\ge z_j^i-t,\quad j=1,\dots ,m_i,\; i=1,2,\\&\, (33), (34), (36),\; Y^1\ge 0,\quad \, Y^2\ge 0. \end{aligned}\nonumber \\ \end{aligned}$$
(50)

Interestingly, both risk-neutral formulations produce nearly identical results on all data sets. Subsequently we only report one of them under the name “exp_val”. In the presented figures and tables below, we refer to the loss function consisting of a single Conditional Value-at-Risk measure, as “joint_cvar”.

The problem formulations used in our experiments are the following.

  • Expected value vs. Conditional Value-at-Risk—“asym_risk”

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,t,Z^1,Z^2,Y}\quad&\, \frac{\lambda }{m_1}\sum _{j=1}^{m_1} z^1_j+ \frac{1-\lambda }{m_2}\sum _{j=1}^{m_2} (y_j+z^2_j) \\ \text {s. t. }\quad&\, y_j \ge z_j^2-t,\quad j=1,\dots ,m_2,\\&\, (33), (34),(35), (36),\; Y\ge 0. \end{aligned} \end{aligned}$$
    (51)
  • Mean-semi-deviation vs. Conditional Value-at-Risk—“one_cvar”

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,t,Z^1,Z^2,Y^1,Y^2}\quad&\, \frac{\lambda }{m_1}\sum _{j=1}^{m_1} (y^1_j+z^1_j) + (1-\lambda )\Bigg (t + \frac{1}{\alpha m_2}\sum _{j=1}^{m_2} y^2_j\Bigg )\\ \text {s. t. }\quad&\, y_j^1 \ge z_j^1-\frac{1}{m_1}\sum _{j=1}^{m_1} z^1_j,\quad j=1,\dots ,m_1,\\&\, y_j^2 \ge z_j^2-t,\quad j=1,\dots ,m_2,\\&\, (33), (34), (35), (36),\; Y^1\ge 0,\quad \, Y^2\ge 0. \end{aligned} \end{aligned}$$
    (52)
  • Mean-semi-deviation vs. combination of the expectation and CVaR—“risk_cvar”

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,t,Z^1,Z^2,Y^1,Y^2}\quad&\, \frac{\lambda }{m_1}\sum _{j=1}^{m_1} (y^1_j+z^1_j) + \frac{\beta (1-\lambda )}{m_1}\sum _{j=1}^{m_2} z_j^2\\&\, \qquad + (1-\beta )(1-\lambda )\left( t + \frac{1}{\alpha m_2}\sum _{j=1}^{m_2} y^2_j\right) \\ \text {s. t. }\quad&\, y_j^1 \ge z_j^1-\frac{1}{m_1}\sum _{j=1}^{m_1} z^1_j,\quad j=1,\dots ,m_1,\\&\, y_j^2 \ge z_j^2-t,\quad j=1,\dots ,m_2,\\&\, (33), (34), (35), (36),\; Y^1\ge 0,\quad \, Y^2\ge 0. \end{aligned} \end{aligned}$$
    (53)
  • Mean-semi-deviation for both classes—“two_risk”

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,Z^1,Z^2,Y^1,Y^2}\quad&\, \frac{\lambda }{m_1}\sum _{j=1}^{m_1} (y^1_j+z^1_j) + \frac{1-\lambda }{m_2}\sum _{j=1}^{m_2} (y^2_j+z^2_j) \\ \text {s. t. }\quad&\, y_j^i \ge z_j^i-\frac{1}{m_i}\sum _{j=1}^{m_i} z_j^i,,\quad j=1,\dots ,m_i,\; i=1,2,\\&\, (33), (34), (35), (36),\; Y^1\ge 0,\, Y^2\ge 0. \end{aligned} \end{aligned}$$
    (54)
  • Conditional-Value at Risk for both classes—“two_cvar”

    $$\begin{aligned} \begin{aligned} \min _{v,\gamma ,t_1,t_2,Z^1,Z^2,Y^1,Y^2}\quad&\, \lambda \beta _1 \sum _{j=1}^{m_1} z^1_j + \lambda (1-\beta _1)\left( t_1 + \frac{1}{\alpha m_1}\sum _{j=1}^{m_1} y^1_j\right) \\&\, \quad + (1-\lambda )\beta _2 \sum _{j=1}^{m_1} z^2_j + (1-\lambda )(1-\beta _2)\left( t_2 + \frac{1}{\alpha m_2}\sum _{j=1}^{m_2} y^2_j\right) \\ \text {s. t. }\quad&\, y_j^i \ge z_j^i-t_i,\quad j=1,\dots ,m_i,\; i=1,2,\\&\, (33), (34), (35), (36),\; Y^1\ge 0,\, Y^2\ge 0. \end{aligned} \end{aligned}$$
    (55)
Table 3 Main results table for the WDBC dataset: displaying the model parameters and the performance metrics for each model formulation

8.3 Performance

We perform k-fold cross-validation and all reported results are out of sample. Furthermore, our method determines only normalized optimal classifiers and all misclassification numbers and reported risk are computed with respect to such classifiers. In Tables 35, and 7, we report the \(\hbox {F}_1\)-score and AUC, along with recall (True Positive Rate), precision, as well as false positive rate (FPR) for all loss functions. Additionally, we report the number of misclassified observations, as well as the chosen parameters, where applicable. In light of the fact that the \(\hbox {F}_1\)-score and AUC are competing metrics, for each dataset we present two set of results, one optimized for each metric. We use this to highlight the additional flexibility offered by the proposed method as discussed in the next section.

We recall that precision is the ratio of true positives over the total number of positively classified points. \(\hbox {F}_1\)-score, or sometimes referred to as F–measure, is the harmonic mean between precision and recall.

$$\begin{aligned} \hbox {F}_1\hbox {-score} = \frac{2 \cdot \text {precision} \cdot \text {recall}}{\text {precision} + \text {recall}} \end{aligned}$$

AUC stands for Area Under the [Receive Operating Characteristic] Curve, and is defined as the integral of the curve created by the True Positive Rate and False Positive Rate as functions of the threshold.

$$\begin{aligned} \text {AUC} = \int ^{\infty }_{-\infty } \text {TPR}(T)\text {FPR}(T)dT \end{aligned}$$

In Table 3, we show the best value for each metric for each set in bold face. We observe that for this particular dataset, the best performing model formulation with respect to the \(F_1\)-score is the “risk_cvar” model; outperforming the risk neutral formulations by more than 0.04. On the other hand, if we consider the AUC to be the target metric, we notice the “two_cvar” formulation has the highest value. Further, we note that the “one_cvar” model has the same parameters for both target metrics. We find this to be unusual in our experiments. While this formulation does not have the best value for the target metric, it too significantly outperforms the risk neutral formulations. Further, this formulation does have the best value for the competing metric in both cases. The respective ROC curves for each of the classifiers are displayed in Fig. 2. The color on each curve represents the value of the \(F_1\)-score. High values are represented by the bright green color, and low values are represented by the dark red color. The two dotted lines indicate the threshold at which the classifier is set to operate.

Fig. 2
figure 2

ROC plots for the best performing model formulations on the WDBC data: “risk_cvar” with the best \(F_1\)-score, “two_cvar” with the best AUC value, and “one_cvar” for the alternate metric

We can certainly see the classifier performs very well on this data. Table 4 contains the calculations of risk, with respect to each model formulation. More specifically, for each obtained classifier we calculate the value of the risk functionals on the out of the sample data points during cross-validation. We consider the raw expectation, mean semi-deviation, as well as the conditional value-at-risk for the \(\alpha \) quantiles 0.75, 0.85, and 0.95.

Table 4 Risk evaluation for the WDBC data set: displaying the expectation of error, mean semi-deviation, and average value at risk for the \(\alpha \) quantiles 0.75, 0.85, and 0.95

Indeed, we can observe that our models reduce the risk for each class with respect to each risk calculation, compared to the benchmarks. More specifically, we notice that the “one_cvar” model, which does not attain the best performance in terms of \(F_1\)-score, but does, in fact, attain the lowest total risk value. Its value is approximately one half that of the risk neutral formulation, and that of the other benchmark. The “risk_cvar” model does perform nearly identically, albeit having at slightly larger values across the board. Further, we note that the “two_cvar” model, which performs best with respect to the AUC metric is the worst performing, benchmarks excluded. Looking closely at the corresponding ROC curve in Fig. 2 one can argue that the performance with respect to the AUC metric, comes at the expense of robustness and generalization.

Looking at the results on the “pima-indians-diabetes” data set in Table 5 we observe that the best performing model with respect to \(F_1\)-score is the again “risk_cvar” model with 0.68581 compared to the 0.66785 of the risk neutral formulations. Similarly, the “one_cvar” model is again second in this context, at the same time having the largest AUC value for the group. Surprisingly, the benchmark formulation “joint_cvar” has the lowest score here. Switching the attention to the AUC section of the table, we notice that “one_cvar” is the best performing model in that regard as well; with the “risk_cvar” being second best. However, the gain in AUC value with the changed parameters is minimal with a considerable reduction in the alternate target metric; “one_cvar” shifting from 0.68581 \(F_1\)-score to 0.65377 in exchange for 0.0027 gain in AUC, and “risk_cvar” shifting from 0.68781 \(F_1\) to 0.65504 for a gain of 0.003.

Table 5 Main results table for the “pima-indians-diabetes” dataset: displaying the model parameters and the performance metrics for each model formulation

Figure 4 shows how the empirical distribution of error realizations from applying the classifier to out-of-sample records on the left, and the overlayed ROC curves for the various classifiers on the right. Negative values indicate correctly classified observations, while positive values indicate misclassification. We compare the select loss functions to each other and the benchmarks. Looking closely at the ROC curves in Fig. 3, we can see that the AUC prioritized “one_cvar” actually does not classify at its maximum potential in terms of \(F_1\)-score, indicated by the fact that the threshold is not at the lightest green segment of the curve. This requires additional investigation and exploration. The shape of the ROC curves for the various classifiers is relatively similar, see right panel of Fig. 4. However, looking at the error distribution plot on the left in the figure, we notice that the two benchmarks, “exp_val” and “join_cvar”, misclassify less of the default class and more of the target class. It is important to show the similarity of the ROC curves in order to highlight how the choice of risk measure may impact the convergence to a specific threshold. More specifically, we see the “two_cvar” formulation underperforming, in relation to the target metric (\(F_1\)-score) and the best performing formulation “risk_cvar”, by misclassifying too much of the default class (Table 6).

Fig. 3
figure 3

ROC plots for the best performing model formulations on the “pima-indians-diabetes” data: “risk_cvar” with the best \(F_1\)-score, “one_cvar” featuring both parameter sets, and finally the “asym_risk” formulation featuring the best AUC value

Fig. 4
figure 4

Empirical distribution of error realizations comparing risk-averse loss function formulations to benchmarks [\(F_1\)-score] on the “pima-indians-diabetes” dataset (left) and the corresponding ROC curves (right)

Table 6 Risk evaluation for the “pima-indians-diabetes” data set: displaying the expectation of error, mean semi-deviation, and average value at risk for the \(\alpha \) quantiles 0.75, 0.85, and 0.95
Table 7 Main results table for the “seismic-bumps” dataset: displaying the model parameters and the performance metrics for each model formulation

Table 7 contains the risk functional evaluation for the “pima-indians-data”. It is interesting that the “two_risk” model has the lowest total risk with respect to every risk functional, despite the fact that is not the best performing model in terms of \(F_1\)-score or AUC. This leads us to believe that there may be room for additional exploration with regard to performance metrics and evaluation.

We continue with the performance evaluation on the third and final dataset, whose main performance metrics are shown in Table 7. One can immediately observe, that no model performs particularly well on this dataset. We have chosen this data set for being particularly imbalanced and containing categorical variables.

Again, we see the “risk_cvar” formulation as having the best \(F_1\)-score, followed very closely by the “joint_cvar” formulation. In terms of AUC, it is the “two_cvar” formulation that leads group, but again at a significant cost of the \(F_1\)-score. Looking at Fig. 5, we can see room for improvements to the this by changing the threshold on the AUC prioritized “two_cvar” model. We observe that in terms of stability to that respect, the “asy_risk” formulation along with “joint_cvar” benchmark have less variation. Turning the attention to the risk functional evaluation in Table 8, we observe that the “exp_val” benchmark model has the lowest total on the “seismic-bumps”. However, being that this dataset is very imbalanced, we can see how significantly different the risk functional evaluation is between the two classes for each model formulation.

Notice, in Fig. 6, how the “exp_val” benchmark stands alone compared to the well grouped risk aware models, which includes the benchmark formulation “joint_cvar”. Similarly, as on the previous dataset, the ROC curves are very much grouped.

In summary, the \(F_1\)-score prioritized model consistently provides small but significant improvement over the baseline models.

Fig. 5
figure 5

ROC plots for the best performing model formulations on the “seismic-bumps” data: “risk_cvar” with the best \(F_1\)-score, “one_cvar”, “joint_cvar”, “two_cvar” formulation featuring the best AUC value

Table 8 Risk Evaluation for the “seismic-bumps” data set: displaying the expectation of error, mean semi-deviation, and conditional value at risk for the \(\alpha \) quantiles 0.75, 0.85, and 0.95

8.4 Flexibility

Our approach provides additional flexibility which is generally not available for classification methods like soft-margin SVM. We allow the user to implement a predetermined attitude toward risk of misclassification, and to explore the Pareto-efficient frontier of classifiers.

We traverse the Pareto frontier by varying \(\lambda \) from 0.4 to 0.7 and observe that the solution is rather sensitive to the scalarization used in the loss function. In Fig. 7, we show the resulting error densities from such a traversal. We can observe how varying the weight between the two risk measures allows us to obtain a family of risk-averse Pareto-optimal classifiers. The efficient frontier can be used to choose a risk-averse classifier according additional criterion as the \(\hbox {F}_1\)-score, AUC, or other similar performance metrics by choosing specific parameter \(\lambda \), as discussed in the previous section.

The Pareto frontier looks substantially different when different combinations of risk measures are used. Further research would reveal the effect of higher order risk measures and their ability to create a classifier with highly discriminant powers.

We have chosen the probability level for the Conditional Value-at-Risk in a similar way. We observe that the loss function “one_cvar” consistently provides the best performance. A close second, is the loss function “risk_cvar,” which has a similar structure. Interestingly, using the same risk measure on both classes does not perform as well.

Fig. 6
figure 6

Empirical distribution of error realizations comparing risk-averse loss function formulations to benchmarks [\(F_1\)-score] on the “seismic-bumps” dataset (left) and the corresponding ROC curves (right)

Fig. 7
figure 7

The distribution of error displayed as smoothed histogram for each of five proposed formulations for the risk-averse SVM problem e.g. “asym_risk”, “one_cvar”, “risk_cvar”, “two_risk”, and “two_cvar” all using the same set of \(\lambda \) values, with other parameters fixed, on the “seismic-bumps” dataset