1 Introduction

Background. During the last two decades, support vector machines (SVMs) have become a popular methodology for binary classification and a number of modified formulations have been derived. To find a decision function that can predict the class labels of unseen data, every SVM solves in general a bi-objective minimization which is defined with a set of given data samples, \((\varvec{x}_1,y_1),\ldots ,(\varvec{x}_m,y_m)\), and is usually referred to as the regularized empirical risk minimization (ERM) or the structural risk minimization (see e.g., Christopher 1998). Typically, the decision function is represented by a discriminant hyperplane, \(\{\varvec{x}\in {\mathbb {R}}^n:\varvec{w}^\top \varvec{x}=b\}\), which separates \({\mathbb {R}}^n\) into two half-spaces, each corresponding to a class label. In order to obtain it, the regularized ERM minimizes the sum of an Empirical Risk and a Regularizer:

$$\begin{aligned} \underbrace{\mathcal{F}(\varvec{L}(\varvec{w},b))}_{{\mathrm{[Empirical Risk]}}} \,+\, \underbrace{\gamma (\varvec{w}),}_{{\mathrm{[Regularizer]}}} \end{aligned}$$

over \((\varvec{w},b)\). Here, \(\varvec{L}(\varvec{w},b)\) is a vector in \({\mathbb {R}}^m\), representing a degree of misclassification over the given data samples with respect to the hyperplane, \(\mathcal{F}\) is a function of \(\varvec{L}\), gauging the aversion to the vector and referred to as risk function in this paper, and \(\gamma \) is a function regularizing \(\varvec{w}\). Intuitively, minimization of Empirical Risk, the first term of the objective, seeks a hyperplane which would have smaller in-sample misclassification, while the Regularizer, the second term, prevents the hyperplane from overfitting to the samples.

This simple principle allows for a large freedom in the choice of Empirical Risk and Regularizer. Despite the generality, only several choices are popular in the literature. For example, the C-SVM (Cortes and Vapnik 1995), the most prevailing formulation, employs as Empirical Risk the hinge loss: \(\mathcal{F}(\varvec{L})=\frac{C}{m}\sum _{i=1}^m\max \{L_i+1,0\}\) or \(\mathcal{F}(\varvec{L})=\frac{C}{m}\sum _{i=1}^m(\max \{L_i+1,0\})^2\), where \(C>0\) is a user-defined constant.

As for the Regularizer, the use of the square of the \(\ell _2\)-norm (or the Euclidean norm) of the normal vector, e.g., \(\gamma (\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\), is dominant. Although the use of the \(\ell _2\)-norm naturally leads to the so-called kernel trick (see e.g., Christopher 1998), other norms are viable alternatives. For example, the \(\ell _1\)-norm is popular since its use leads to a sparse solution. Besides, the use of any norm can be justified along the lines of geometric margin maximization by supposing that its dual norm is employed in gauging the distance of a data sample to the discriminant hyperplane (see e.g., Mangasarian 1999; Pedroso and Murata 2001).

Motivation and proposed scheme. The primary purpose of this paper is to seek reasonable forms of the regularized ERM on the basis of a generalized formulation. One of the motivations for this comes out of the pursuit of a tractable SVM formulation in which parametrized families of polyhedral norms, recently studied by Gotoh and Uryasev (2016), are employed as regularizers. A merit of the use of those parametrized families of norms is in tuning regularizers. The \(\ell _p\)-norm family is used for the tuning of regularizers (e.g., Kloft et al. 2011). In consideration of current status of algorithm studies and solver software, the use of the \(\ell _p\)-norm with \(p\ne 1,2\), or \(\infty \) is, however, not advantageous over the \(\ell _1\)-, the \(\ell _2\)-, or the \(\ell _\infty \)-norm in that its nonlinearity with respect to the parameter p may prevent an efficient parametric optimization. In contrast, the new families can be associated with linear program (LP) and employ an efficient parametric optimization (with respect to alternative parameters), and thus are more advantageous for the tuning of regularizers.

However, the introduction of such new norms requires a prudent approach. Indeed, the form of the regularizer and/or the choice of the empirical risk function can affect the validity of the optimization problem and/or the meaning of the resulting classifier. To address this issue, we reexamine the basic formulation of SVMs. Additionally, to retain the tractability of the popular SVMs, we limit our attention to the case where both empirical risk functions and regularizers are convex.

Method. The development of the formulations is based on convex analysis, especially the Fenchel duality (e.g., Rockafellar 1970). Although convex analysis is not typically used in machine learning literature, it is becoming popular (e.g., Rifkin and Lippert 2007; Kloft et al. 2011). As Rifkin and Lippert (2007) claim, the use of Fenchel duality is advantageous in developing duality theorems and establishing optimality conditions because we can derive most of the results in a standard setting just by applying established patterns of function operations. Thus, by treating both norms and empirical risk functions in a more general, we enjoy these advantages as well.

A novelty is the employment of the convex risk function theory established in mathematical finance (e.g., Föllmer and Schied 2002) and stochastic optimization (e.g., Ruszczyński and Shapiro 2005, 2006; Rockafellar and Uryasev 2013). The linkage to the risk function theory yields new perspectives on the SVM formulations. For example, the so-called \(\nu \)-property of the \(\nu \)-SVM (Schölkopf et al. 2000) can be analyzed via the connection to the conditional value-at-risk (CVaR), a popular risk measure in financial context (Rockafellar and Uryasev 2000, 2002) since the \(\nu \)-SVM virtually employs a CVaR (Gotoh and Takeda 2005; Takeda and Sugiyama 2008).

More notable benefits of our approach are in relations to the three functions’ properties: Monotonicity, Translation Invariance, and Positive Homogeneity, which are all often referred to in the context of financial risk management. We draw several insights on the regularized ERM, especially in relation to robust optimization modeling and the geometric and probabilistic interpretations.

Perspectives on robust optimization modeling. SVMs are often viewed as a data-driven optimization based on i.i.d. data samples, and are likely to be vulnerable to some perturbation of the data samples or deviation from the i.i.d. assumption. To cope with such a situation, the idea of optimizing the worst case, known as robust optimization, is a popular choice. In this paper we provide two insights on the relation to the robust modeling. First, considering a worst-case perturbation of the given data samples, we derive an interpretation of the regularized ERM as a robust optimization of the (unregularized) ERM, which is parallel to what Xu et al. (2009a) show by focusing on the use of the hinge loss. The perspective that the regularizer would make the ERM robust is enhanced in a broad way with the help of the risk function theory. Second, we demonstrate that with some risk functions \(\mathcal {F}\), our framework can straightforwardly treat another type of robust optimization modeling, called distributionally robust optimization. This type of robust optimization assumes uncertainty in the probability measures whereas the first robust modeling approach assumes uncertainty in the support (or observed values of samples). We show that with a class of risk functions the distributionally robustified formulations can be established within convex optimization.

Novelty in the context of machine learning. The term “convex risk” itself is not new in the context of machine learning. Indeed, the empirical risk of the separable form \(\mathcal{{F}}(\varvec{L})=\frac{1}{m}\sum _{i=1}^mv(L_i)\) has been discussed where v is a convex function on \({\mathbb {R}}\) (e.g., Christmann and Steinwart 2004; Zhang 2004; Bartlett et al. 2006; Rifkin and Lippert 2007; Kloft et al. 2011). This formulation, however, does not include some important convex risk functions that appear in machine learning methods. Indeed, the \(\nu \)-SVM corresponds to an inseparable risk function \(\mathcal{{F}}(\varvec{L})=\min _{\rho }\{-\rho \nu +\frac{1}{m}\sum _{i=1}^{m}\max \{L_i+\rho ,0\}\}\), where \(\nu \in (0,1]\). If we expand the coverage beyond the separable functions, we can treat for instance the log-sum-exp (or entropic) function, i.e., \(\mathcal{{F}}(\varvec{L})=\ln \sum _{i=1}^{m}\exp (L_i)\), without removing the ‘\(\ln \)’-operator. Needless to say, the minimization of \(\ln \sum _{i=1}^{m}\exp (L_i)\) is equivalent to the minimization of \(\sum _{i=1}^{m}\exp (L_i)\). However, even a difference based on such a monotonic transformation may result in a different consequence because of the consideration of regularizer, which we will discuss in Sect. 3.1.

In addition, a class of inseparable risk functions include several existing formulations as special cases recently studied in Kanamori et al. (2013). Their paper shows the convergence of the obtained classifier to a classifier which attains the smallest expected misprediction. This indicates that the generalized formulation developed in the current paper is, at least, partly justified in a statistical way.

Moreover, with the help of the risk function theory, we show that dual formulations of such inseparable risk functions can be connected to a geometric or probabilistic interpretation. The geometric interpretation extends the existing papers (e.g., Crisp and Burges 2000; Bennett and Bredensteiner 2000; Takeda et al. 2013; Kanamori et al. 2013), while the probabilistic interpretation relates to the minimization of the \(\varphi \)-divergence (or, originally, f-divergence) (Csiszár 1967), as will be shown in Sect. 5. Interestingly, the probabilistic interpretation is further connected to the distributionally robust extension.

Further merits in practice. The presentation of general formulations, defined with only elementary operations, fits a recent trend of optimization software packages. Indeed, various convex functions are available as built-in functions in some software packages [e.g., PSG (American Optimal Decisions, Inc. 2009) and CVX (Grant and Boyd 2012)]. With these software packages, users can easily customize SVMs. Additionally, such a presentation may potentially fit recently developed algorithms for, e.g., nonsmooth and/or stochastic optimization. For example, the recent development of \(\ell _1\)-minimization algorithms, such as the Fast Iterative Shrinkage Thresholding Algorithm (Beck and Teboulle 2009), suggests to directly handle the subgradient of the \(\ell _1\)-norm.

Difference from relevant studies. Let us mention several related papers. Following Gotoh and Takeda (2005), Takeda and Sugiyama (2008) which derive \(\nu \)-SVM (Schölkopf et al. 2000) and E\(\nu \)-SVM (Perez-Cruz et al. 2003) as CVaR minimizations, Gotoh et al. (2014) extends CVaR to the coherent risk measures (Artzner et al. 1999), a subclass of convex risk functions, while preserving the nonconvexity in the formulation of the preceding studies. Also, Tsyurmasto et al. (2013) explore positively homogeneous risk functions, which are not necessarily convex. In contrast to the above papers, our paper disregards the nonconvexity, while generalizing the class of risk functions.

Xu et al. (2009b) mention the axioms of risk functions in relation to their robust optimization formulation. Takeda et al. (2013) propose a unified view using the presentation with uncertainty sets, although that paper does not relate to risk functions. A recent paper of Kanamori et al. (2013) studies a duality correspondence between empirical risk functions and uncertainty sets, sharing part of their formulation with ours. An advantage of the current paper over the above ones is a larger capability in a more systematic presentation on the basis of the theory of convex risk functions.

As for the use of general norms, numerous papers deal with non-\(\ell _2\)-norms. Among such, Zhou et al. (2002) present a couple of formulations, which are shared with ours, and show some generalization bounds for them. However, to our best knowledge, all the existing papers focus on the \(\ell _p\)-norms and only the \(\ell _1\)- and the \(\ell _\infty \)-norms are employed for LP formulations of SVMs. In contrast, we employ other LP-representable norms such as the CVaR norm and the deltoidal norm (Gotoh and Uryasev 2016), which both include the \(\ell _1\)- and the \(\ell _\infty \)-norms as special limiting cases.

Structure of the paper. The structure of this paper is as follows. Section 2 poses a general formulation of SVMs and explains how it includes existing ones. In particular, Sects. 2.2 and 2.3 introduce risk functions and regularizers, respectively, which are or can be used for SVMs. Section 3 examines the form of regularizer from two perspectives; Section 3.1 discusses the incompatibility of homogeneous empirical risk and regularizer, while Sect. 3.2 reveals a condition under which the regularized ERM can be viewed as a robust ERM. Section 4 derives duality theorem and optimality condition as well as dual formulation. Section 5 is devoted to interpretation of the dual formulation and some relations to distributionally robust optimization. Section 6 concludes the paper. Proofs of propositions are given in “Appendix”.

To downsize the manuscript, this paper focuses on presentation of the general formulation. Readers who are interested in the proximity between the use of the non-\(\ell _2\)-norm and that of the \(\ell _2\)-norm or in remarks on the kernel trick and some numerical examples illustrating those theoretical results are referred to Gotoh and Uryasev (2013), the discussion version of this paper.

Notations. A vector in \({\mathbb {R}}^n\) is denoted in boldface and is written as a column vector in the inner products. In particular, \(\varvec{1}\) and \(\varvec{0}\) are the column vectors with all components equal to 1 and 0, respectively. Matrices are denoted also by boldface. In particular, we denote by diag \((\varvec{x})\) the square matrix whose diagonal elements are given by \(\varvec{x}\) and off-diagonal elements are all 0. The superscript ‘\(\top \)’ denotes the transpose of vectors and matrices (e.g., \(\varvec{x}^\top =(x_1,\ldots ,x_n)\)). The inequality \(\varvec{x}\ge \varvec{y}\) denotes \(x_i\ge y_i,i=1,\ldots ,m\), and \({\mathbb {R}}_+^n:=[0,+\infty )^n=\{\varvec{x}\in {\mathbb {R}}^n:\varvec{x}\ge \varvec{0}\}\); we denote by \(\Pi ^m\) the unit simplex in \({\mathbb {R}}^m\), i.e., \(\Pi ^m:=\{\varvec{p}\in {\mathbb {R}}^m:\varvec{1}^\top \varvec{p}=1,\varvec{p}\ge \varvec{0}\}\). For a set C, its relative interior is denoted by \({\mathrm{ri}}(C)\). We denote the \(\ell _2\)-, the \(\ell _1\)-, and the \(\ell _\infty \)-norms by \(\Vert \cdot \Vert _2\), \(\Vert \cdot \Vert _1\), and \(\Vert \cdot \Vert _\infty \), respectively. The notation \(\Vert \cdot \Vert \) is reserved for any norm in \({\mathbb {R}}^n\). \((x)_+:=\max \{x,0\}\). \(\delta _C\) denotes the (0-\(\infty \)) indicator function of a set \(C\subset {\mathbb {R}}^n\), i.e., \(\delta _C(\varvec{x})=0\) if \(\varvec{x}\in C\); \(+\infty \) otherwise. With a little abuse of the notation, we sometimes denote by \(\delta _{c(\cdot )}\) the indicator function of a condition \(c(\cdot )\), i.e., \(\delta _{c(\cdot )}(\varvec{x})=0\) if \(c(\varvec{x})\) is true; \(+\infty \) otherwise. As a convention inspired by MATLAB, we extensively apply a function on \({\mathbb {R}}\) to a vector in \({\mathbb {R}}^m\). For example, with a function v on \({\mathbb {R}}\), we define \(v(\varvec{L}):=(v(L_1),\ldots ,v(L_m))^\top \). Besides, we employ the notations ‘. / ’ and ‘\(^{.k}\)’ for component-wise division of two vectors \(\varvec{x}\) and \(\varvec{y}\) and power, respectively, i.e., \(\varvec{x}./\varvec{y}=(x_1/y_1,\ldots ,x_n/y_n)^\top \) and \(\varvec{x}^{.k}=(x_1^k,\ldots ,x_n^k)^\top \).

2 A general primal formulation of SVMs

This section introduces a general primal formulation of the binary classification and explains its motivations.

2.1 Loss, risk function, and regularized ERM

Suppose that a data set \(\{(\varvec{x}_1,y_1),\ldots ,(\varvec{x}_m,y_m)\}\) is given, where \(\varvec{x}_i\in {\mathbb {R}}^n\) denotes the attributes of sample i and \(y_i\in \{\pm 1\}\) denotes its binary label, which represents the class that sample i belongs to, \(i=1,\ldots ,m\). Then the (binary) classification problem is formulated as the problem of finding a decision function \(d:{\mathbb {R}}^n\rightarrow \{\pm 1\}\) defined as

$$\begin{aligned} d(\varvec{x}):={\mathrm{sign}}(\varvec{w}^\top \varvec{x}-b)=\left\{ \begin{array}{ll} +1, &{} \text{ if } \varvec{w}^\top \varvec{x}\ge b,\\ -1, &{} \text{ if } \varvec{w}^\top \varvec{x}< b,\\ \end{array} \right. \end{aligned}$$
(1)

for predicting binary labels of unseen samples, \(\varvec{x}_{m+1},\ldots ,\varvec{x}_{\ell }\). Note that it is equivalent to find a vector \(\varvec{w}\ne \varvec{0}\) and a scalar b.

To formulate the problem as an optimization, we quantify the misspecification of the sample labels by the decision function d by using some function \(\mathcal{{F}}\) on some loss \(\varvec{L}\) associated with \((\varvec{w},b)\).

Let us first introduce the loss. Since a sample \(\varvec{x}_i\) can be considered misclassified if its margin, denoted by \(y_i(\varvec{x}_i^\top \varvec{w}-b)\), is negative it is natural to define \(\varvec{L}\) as

$$\begin{aligned} \varvec{L}=-\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)\, \leftrightarrow \, L_i=-y_i(\varvec{x}_i^\top \varvec{w}-b),~i=1,\ldots ,m, \end{aligned}$$
(2)

where \(\varvec{Y}:=\mathrm{diag}(y_1,\ldots ,y_m)\) and \(\varvec{X}:=(\varvec{x}_1,\ldots ,\varvec{x}_m)^\top \). More precisely, the loss represents the degree of misclassification so that \(L_i>0\) means that the i-th sample is misclassified, while \(L_i<0\) implies correct classification.

Alternatively, with a kernel function k on \({\mathbb {R}}\times {\mathbb {R}}\) and \(\varvec{\alpha }:=(\alpha _1,\ldots ,\alpha _m)^\top \in {\mathbb {R}}^m\), the loss can be associated with a nonlinear discriminant surface defined by \(\{\varvec{x}\in {\mathbb {R}}^n:\sum _{h=1}^my_hk(\varvec{x}_h,\varvec{x}_i)\alpha _h=b\}\). Denoting the kernel matrix by \(\varvec{K}=(y_iy_jk(x_i,x_j))_{i,j=1,\ldots ,m}\in {\mathbb {R}}^{m\times m}\), the corresponding loss can be defined by

$$\begin{aligned} \varvec{L}=-(\varvec{K}\varvec{\alpha }-\varvec{y}b)\, \leftrightarrow \, L_i=-y_i\left( \sum _{h=1}^my_hk(\varvec{x}_h,\varvec{x})\alpha _h-b\right) ,~i=1,\ldots ,m. \end{aligned}$$
(3)

It is noteworthy that with a matrix \(\varvec{G}\) and a vector \(\varvec{w}\), both (2) and (3) have the form

$$\begin{aligned} \varvec{L}=-(\varvec{G}^\top \varvec{w}-\varvec{y}b), \end{aligned}$$
(4)

which is linear with respect to \((\varvec{w},b)\). Accordingly, we hereafter suppose that the loss \(\varvec{L}\) has the linear form (4).

We are now in a position to formally introduce the empirical risk.

Let \(\mathcal{{F}}\) be a function on \({\mathbb {R}}^m\) that can take ‘\(+\infty \)’ and is proper and l.s.c. (lower semi-continuous):

  • \(\mathcal{{F}}\) is proper if \(\mathcal{{F}}(\varvec{L})>-\infty \) for all \(\varvec{L}\in {\mathbb {R}}^m\) and dom \(\mathcal{{F}}\ne \emptyset \), where ‘dom \(\mathcal{{F}}\)’ denotes the effective domain of \(\mathcal{{F}}\), i.e., dom \(\mathcal{{F}}:=\{\varvec{L}\in {\mathbb {R}}^m:\mathcal{{F}}(\varvec{L})<+\infty \}\).

  • \(\mathcal{{F}}\) is l.s.c. if \(\mathcal{{F}}(\varvec{L})\le \liminf \limits _{i\rightarrow \infty }\mathcal{{F}}(\varvec{L}_i)\) for any \(\varvec{L}\in {\mathbb {R}}^m\) and any sequence \(\varvec{L}_1,\varvec{L}_2,\ldots \in {\mathbb {R}}^m\) converging to \(\varvec{L}\).

The empirical risk is then defined as \(\mathcal{{F}}(\varvec{L})\), and we call the function \(\mathcal {F}\) a risk function. By construction, \(\mathcal{{F}}(\varvec{L})\) represents the undesirability of the loss defined above, i.e., less is better. Here we would like to note that the usage of the words ‘loss’ and ‘risk’ are slightly different from the usual convention of machine learning literature.

With a function \(\gamma :{\mathbb {R}}^n\rightarrow [0,\infty ]\), we consider a general SVM for binary classification of the following regularized ERM form:

$$\begin{aligned} p^\star :=\underset{\varvec{w},b}{\inf }\quad \mathcal{{F}}(-(\varvec{G}^\top \varvec{w}-\varvec{y}b))+\gamma (\varvec{w}), \end{aligned}$$
(5)

which is a minimization of a function consisting of two terms, as sketched in the introduction.

For example, the C-SVM (Cortes and Vapnik 1995) for the binary classification is formulated with the following (convex) quadratic programming (QP) problem:

$$\begin{aligned} \begin{array}{rll} \bar{p}^\star :=&{}\underset{\varvec{w},b,\varvec{z}}{\text{ minimize }} &{} \frac{1}{2}\varvec{w}^\top \varvec{w}+\frac{C}{m}\varvec{1}^\top \varvec{z}\\ &{}\text{ subject } \text{ to } &{} \varvec{z}\ge -\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)+\varvec{1},~\varvec{z}\ge \varvec{0},\\ \end{array} \end{aligned}$$
(6)

where \(C>0\) is a user-defined parameter. This can be equivalently presented as

$$\begin{aligned} \underset{\varvec{w},b}{\text{ minimize }}\, \underbrace{\frac{C}{m}\varvec{1}^\top (-\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)+\varvec{1})_+}_{{\mathrm{Empirical Risk}}}\, +\, \underbrace{\frac{1}{2}\Vert \varvec{w}\Vert _2^2}_{{\mathrm{Regularizer}}}, \end{aligned}$$

which corresponds to (5) with \(\mathcal{{F}}(\varvec{L})=(C/m)\varvec{1}^\top (\varvec{L}+\varvec{1})_+\), \(\varvec{G}=\varvec{X}^\top \varvec{Y}\), and \(\gamma (\cdot )=\frac{1}{2}\Vert \cdot \Vert _2^2\). On the other hand, the \(\nu \)-SVM (Schölkopf et al. 2000) solves another QP:

$$\begin{aligned} \begin{array}{l@{~}ll} \tilde{p}^\star :=&{}\underset{\varvec{w},b,\rho ,\varvec{z}}{\text{ minimize }} &{} \frac{1}{2}\varvec{w}^\top \varvec{w}-\rho +\frac{1}{m\nu }\varvec{1}^\top \varvec{z}\\ &{}\text{ subject } \text{ to } &{} \varvec{z}\ge -\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)+\varvec{1}\rho ,~\varvec{z}\ge \varvec{0},\\ \end{array} \end{aligned}$$
(7)

where \(\nu \in (0,1]\) is a user-defined parameter. Similarly to the C-SVM, the QP (7) can be viewed as a regularized ERM:

$$\begin{aligned} \underset{\varvec{w},b}{\text{ minimize }}\, \underbrace{\min _\rho \big \{-\rho +\frac{1}{\nu m}\varvec{1}^\top ((-\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)+\rho \varvec{1})_+)\big \}}_{{\mathrm{Empirical Risk}}}\, +\, \underbrace{\frac{1}{2}\Vert \varvec{w}\Vert _2^2}_{{\mathrm{Regularizer}}}. \end{aligned}$$

Note that the risk function here is considered as the CVaR (see the list below for its definition), while, to our knowledge, no specific name was given in the SVM context, while the risk function of (6) is known as hinge loss.

Considering the tractability in optimization and duality, we assume convexity of \(\mathcal {F}\) and \(\gamma \) throughout the paper:

  • \(\mathcal{{F}}\) is convex if \((1-\tau )\mathcal{{F}}(\varvec{L})\,+\,\tau \mathcal{{F}}(\varvec{L}')\ge \mathcal{{F}}((1-\tau )\varvec{L}\,+\,\tau \varvec{L}')\) for all \(\varvec{L},\varvec{L}'\in {\mathbb {R}}^m,~\tau \in (0,1)\).

In the remainder of this section, let us see other existing alternatives of the convex risk function \(\mathcal{{F}}\) (Sect. 2.2) and the convex regularizer \(\gamma \) (Sect. 2.3), which are covered by the generalized formulation (5).

2.2 Convex risk functions and their basic properties

Below, we give some examples of convex risk functions in binary classification.

figure a

where \(t>0\) and \(\alpha \in [0,1)\) are user-defined parameters, and \({\mathbb {E}}_{\varvec{p}}(\cdot )\) denotes the mathematical expectation under a probability measure \(\varvec{p}\), i.e., \({\mathbb {E}}_{\varvec{p}}(\varvec{x}):=\varvec{p}^\top \varvec{x}\). See Proof in “Appendix” of Gotoh and Uryasev (2013) for the additional explanation of CVaR (Proof of Proposition 2 section in “Appendix”) and a list of the other risk functions which have potential to be useful (Proof of Theorem 2 section in “Appendix”).

Here we emphasize that in addition to \({\mathrm{CVaR}}_{(\alpha ,\varvec{p})}\), which appears in the \(\nu \)-SVM, the Log-Sum-Exp risk function \({\mathrm{LSE}}_{(1,\varvec{p})}\), which appears in AdaBoost (Freund and Schapire 1997), is another notable inseparable-form risk function.

In addition to convexity, the following three properties are frequently considered in the context of financial risk management (e.g., Artzner et al. 1999).

  • \(\mathcal{{F}}\) is monotonic if \(\mathcal{{F}}(\varvec{L})\ge \mathcal{{F}}(\varvec{L}') \text{ for } \text{ all } \varvec{L},\varvec{L}'\in {\mathbb {R}}^m\) such that \(\varvec{L}\ge \varvec{L}'\).

  • \(\mathcal{{F}}\) is translation invariant if \(\mathcal{{F}}(\varvec{L}+\tau \varvec{1})=\mathcal{{F}}(\varvec{L})+\tau \) for all \(\tau \in {\mathbb {R}},~\varvec{L}\in {\mathbb {R}}^m\).

  • \(\mathcal{{F}}\) is positively homogeneous if \(\mathcal{{F}}(\tau \varvec{L})=\tau \mathcal{{F}}(\varvec{L})\) for all \(\tau >0,~\varvec{L}\in {\mathbb {R}}^m\).

In particular, a proper l.s.c. convex risk function satisfying the above three properties is said to be coherent.

  • \(\mathcal{{F}}\) is coherent if it is a proper l.s.c. convex risk function satisfying monotonicity, translation invariance and positive homogeneity.

While the above properties make sense in financial risk management (see Artzner et al. 1999, for interpretations of these properties in the financial context), we need to examine the rationale of their role in the context of SVMs, i.e., with which properties \(\mathcal {F}\) can be reasonable as a risk function for SVMs.

Among the three properties, the monotonicity seems to be less arguable since it requires that a larger misclassification, \(L_i\), should be more penalized in the ERM. On the other hand, there seems to be no strong motivation for the other two properties at this point, unless they lead to tractable optimization problems. However, as will be shown in Sects. 3 to 5, these properties play crucial roles in the interpretation of the primal or dual formulation and in the validity of the combination of risk function and regularizer. Because of those facts, this paper considers the above properties. While the term “coherence” may be confusing, we simply use it to emphasize commonality with the financial risk function theory. We do not use this term to insist that “coherence” implies a more legitimate choice of function in the context of the classification task.

For later reference, we observe the following facts.

Proposition 1

For a function \(\mathcal{{V}}\) or \({\mathbb {R}}^m\), let us define another function on \({\mathbb {R}}^m\) of the form:

$$\begin{aligned} \mathcal{{F}}(\varvec{L})=\inf _c\{c+{\mathcal{{V}}}(\varvec{L}-c\varvec{1})\}. \end{aligned}$$
(8)

Then, \(\mathcal {F}\) is convex if \(\mathcal{{V}}\) is convex; \(\mathcal {F}\) is monotonic if \(\mathcal{{V}}\) is monotonic; \(\mathcal {F}\) is translation invariant for any \(\mathcal{{V}}\); \(\mathcal {F}\) is positively homogeneous if \(\mathcal{{V}}\) is positively homogeneous.

This proposition is a minor modification of Quadrangle Theorem of Rockafellar and Uryasev (2013) which restricts \(\mathcal {F}\) to be convex (the above proposition does not make this assumption).

We should notice that we can view the formula (8) as an operation for making any function translation invariant.

In particular, we will refer to a special case of (8) having the form

$$\begin{aligned} \mathcal{{F}}(\varvec{L})= \mathcal{{F}}_{\varvec{p}}(\varvec{L}):=\inf _c\{c+{\mathbb {E}}_{\varvec{p}}(v(\varvec{L}-c\varvec{1}))\} \equiv \inf _c\left\{ c+\sum _{i=1}^{m}p_{i}v(L_i-c)\right\} , \end{aligned}$$
(9)

where \(v:{\mathbb {R}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a proper l.s.c. convex function. As an easy extension of Proposition 1, we can confirm the following properties.

Corollary 1

Function (9) is translation invariant for any v. (9) is convex if v is convex; (9) is monotonic if v is monotonic; (9) is positively homogeneous if v is positively homogeneous; if v satisfies \(v(z)\ge z+B\) for any z and a constant B, (9) is proper.

This corollary is also a minor modification of Expectation Theorem of Rockafellar and Uryasev (2013), where v is assumed to be convex and satisfies the condition of the last statement above with \(B=0\).

We would emphasize that \({\mathrm{CVaR}}_{(\alpha ,\varvec{p})}\) and \({\mathrm{LSE}}_{(1,\varvec{p})}\) can be represented in the form of (9) with \(v(z)=\frac{1}{1-\alpha }(z)_+\) and \(\exp (z)-1\), respectively, with each satisfying the condition.

In addition, via the formula (9), no-translation invariant functions such as Hinge1, Hinge2, and LSSVM can be transformed into translation invariant ones. For example, by employing \(v(z)=(1+z)_+/t\) in (9), \({\mathrm{Hinge1}}_{(t,\varvec{p})}\) is transformed to a translation invariant risk function \({\mathrm{Hinge1}}^{\mathrm{OCE}}_{(t,\varvec{p})}=\inf _c\{c+\frac{1}{t}\varvec{p}^\top [((1-c)\varvec{1}+\varvec{L})_+]\}\). Note that this is equal to \(1+{\mathrm{CVaR}}_{(1-t,\varvec{p})}(\varvec{L})\). Namely, CVaR can be considered as Hinge1 transformed by (9).

Such transformed functions are shown to be related to the uncertainty set-based representation of SVMs. Indeed, Kanamori et al. (2013) consider an SVM formulation which employs the risk function of the form (9) with \(\varvec{p}=(2/m)\varvec{1}\). More precisely, one of their examples called Truncated Quadratic Loss corresponds to Hinge2 transformed by (9). An extension based on the above propositions will be discussed in Sect. 5.

Figure 1 illustrates relationships of some risk functions on the basis of their properties, indicating relations to several existing regularized ERMs. For the risk functions which are not mentioned in this paper, see Proof of Theorem 2 in “Appendix” of Gotoh and Uryasev (2013).

Fig. 1
figure 1

Classification of convex risk functions and corresponding regularized ERMs

In addition to the properties described in the text, ‘Regular’ refers to l.s.c. convex risk functions \(\mathcal{{F}}\) satisfying

figure b

Rockafellar and Uryasev 2013 show that with a (proper) l.s.c. convex \(v:{\mathbb {R}}\rightarrow {\mathbb {R}}\cup \{+\infty \} \) such that \(v(0)=0\) and \(v(x)>x\) when \(x\ne 0\), the risk function defined in (9) is a regular measure of risk. In order to avoid unnecessary mix-up, we do not use ‘risk measure’ term in this paper, but use ‘risk function’ term instead. Besides, in Sect. 5, we show that the dual formulations of a set of monotonic translation-invariant risk functions which is shaded in the figure are interpreted as an optimization over probability distributions.

2.3 Regularizers with general norms and classification of SVMs

Following Rifkin and Lippert (2007), the regularizer \(\gamma \) is assumed to have the following properties.

$$\begin{aligned} \gamma :{\mathbb {R}}^n\rightarrow [0,+\infty ] \text{ is } \text{ an } \text{ l.s.c. } \text{ convex } \text{ function } \text{ such } \text{ that } \gamma (\varvec{0})=0. \end{aligned}$$
(10)

In particular, we below consider the case where the regularizer \(\gamma (\varvec{w})\) is associated with an arbitrary norm as follows.

$$\begin{aligned} \gamma (\varvec{w})=\iota (\Vert \varvec{w}\Vert ), \end{aligned}$$

where \(\Vert \cdot \Vert \) is an arbitrary norm on \({\mathbb {R}}^n\), and \(\iota : [0,+\infty )\rightarrow [0,+\infty ]\) is non-decreasing and convex. In the following, we pay special attention to the following three regularizers.

$$\begin{aligned} \text{(a) }&\quad \gamma (\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\\ \text{(b) }&\quad \gamma (\varvec{w})=\Vert \varvec{w}\Vert \\ \text{(c) }&\quad \gamma (\varvec{w})=\delta _{\Vert \cdot \Vert \le 1}(\varvec{w}) \end{aligned}$$

where \(\delta _{\Vert \cdot \Vert \le 1}\) denotes the indicator function defined as

$$\begin{aligned} \delta _{\Vert \cdot \Vert \le 1}(\varvec{w})= \left\{ \begin{array}{ll} 0, &{}\Vert \varvec{w}\Vert \le 1,\\ \infty ,&{}\Vert \varvec{w}\Vert >1.\\ \end{array} \right. \end{aligned}$$

Note that \(\Vert \cdot \Vert \) denotes an arbitrary norm, while \(\Vert \cdot \Vert _2\) denotes the \(\ell _2\)-norm.

The cases (a) and (b) are categorized as the Tikhonov regularization, where the norms appear in the objective of the primal formulation, while the case (c) is categorized as the Ivanov regularization, where the norm appears in the constraint of the formulation, i.e., \(\Vert \varvec{w}\Vert \le 1\). These two styles often bring the same result (see e.g., Proposition 12 of Kloft et al. 2011). However, we have to pay attention to the difference because such equivalence depends on the risk function employed, which will be discussed in Sect. 3.1.

Despite several restrictions on the forms of the loss \(\varvec{L}\), the risk function \(\mathcal {F}\), and the regularizer \(\gamma \), the general formulation (5) covers a variety of optimization problem formulations for binary classification, as follows.

  • 1-C-SVM (6): \(\mathcal{{F}}={\mathrm{Hinge1}}_{(t,\frac{\varvec{1}}{m})}\), \(\gamma (\cdot )=\frac{1}{2}\Vert \cdot \Vert _2^2\);

  • 2-C-SVM: \(\mathcal{{F}}={\mathrm{Hinge2}}_{(t,\frac{\varvec{1}}{m})}\), \(\gamma (\cdot )=\frac{1}{2}\Vert \cdot \Vert _2^2\);

  • \(\nu \)-SVM (7): \(\mathcal{{F}}={\mathrm{CVaR}}_{(1-\nu ,\frac{\varvec{1}}{m})}\), \(\gamma (\cdot )=\frac{1}{2}\Vert \cdot \Vert _2^2\);

  • \(\ell _1\)-regularized logistic regression (e.g., Koh et al. (2007)): \(\mathcal{{F}}={\mathrm{LR}}_{(1,\frac{\varvec{1}}{m})}\), \(\gamma (\cdot )=t\Vert \cdot \Vert _1\);

  • AdaBoost (Freund and Schapire 1997), \(\mathcal{{F}}={\mathrm{LSE}}_{(1,\frac{\varvec{1}}{m})}\), \(\gamma =\delta _{\Vert \cdot \Vert _1\le 1}+\delta _{{\mathbb {R}}_+^n}\);

  • LPBoost (Rätsch et al. 2000) \(\mathcal{{F}}={\mathrm{CVaR}}_{(1-\nu ,\frac{\varvec{1}}{m})}\), \(\gamma =\delta _{\Vert \cdot \Vert _1\le 1}+\delta _{{\mathbb {R}}_+^n}\);

  • LS-SVM (Suykens and Vandewalle 1999) \(\mathcal{{F}}={\mathrm{LSSVM}}_{(t,\frac{\varvec{1}}{m})}\), \(\gamma (\cdot )=\frac{1}{2}\Vert \cdot \Vert _2^2\),

where \(\gamma =\delta _{\Vert \cdot \Vert _1\le 1}+\delta _{{\mathbb {R}}_+^n}\) corresponds to a regularizer, explicitly given by

$$\begin{aligned} (\delta _{\Vert \cdot \Vert _1\le 1}+\delta _{{\mathbb {R}}_+^n})(\varvec{w})=\left\{ \begin{array}{ll} 0,&{}\text{ if } \varvec{1}^\top \varvec{w}\le 1,~\varvec{w}\ge \varvec{0},\\ +\infty ,&{}\text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

With such a large possibility of the risk functions and the regularizers, the first research question we will consider is formulated as follows. What properties should \(\mathcal{{F}}\) and \(\gamma \) have for the regularized ERM (5) to be reasonable or interpretable?

3 Insights on the general regularizer

To answer the question given at the end of the previous section, we first consider the regularized ERM (5) and later its dual formulation. In particular, this section draws two insights on the primal formulation (5): (1) an incompatible choice of the empirical risk and the regularizer; (2) a perspective as a robust optimization.

3.1 General formulations with non-\(\ell _2\)-norm regularizers

Let us start with the following simple, but suggestive fact.

Proposition 2

Suppose that both regularizer \(\gamma \) and risk function \(\mathcal{{F}}\) are positively homogeneous. Then the primal (5) either attains the optimal objective value 0, where the solution \((\varvec{w}^\star ,b^\star )\) such that \(\varvec{w}^\star =\varvec{0}\) is an optimal solution, or results in an unbounded solution such that \(p^\star =-\infty \).

See Proof of Proposition 2 in “Appendix” for the proof.

The above proposition shows a situation where (5) would be meaningless, having no optimal solution or resulting in a trivial solution satisfying \(\varvec{w}=\varvec{0}\). Even if an optimization algorithm returns an optimal solution with \(\varvec{w}\ne \varvec{0}\), such a solution is considered to be fragile since the all-zero solution is another optimal solution. Accordingly, the combination of a positively homogeneous function \(\mathcal{{F}}\), such as CVaR, and a regularizer of the form \(\gamma (\varvec{w})=\Vert \varvec{w}\Vert \) is not adequate for the classification problem. On the other hand, with a non-homogeneous \(\iota \), the regularizer given in the form \(\gamma (\varvec{w})=\iota (\Vert \varvec{w}\Vert )\) makes sense. For example, the case where \(\gamma (\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\), corresponding to \(\iota (z)=\frac{1}{2}z^2\), works for the \(\nu \)-SVM, which employs CVaR. (See also Remark 1 below.)

Thus, we may apply such a non-homogeneous \(\iota \) to non-\(\ell _2\)-norm, such as the \(\ell _1\)- or the \(\ell _\infty \)-norm, so as not to make the Tikhonov regularizer positively homogeneous. However, such a strategy leads to a non-linear formulation and may reduce the advantage of using polyhedral norms. Therefore, below we consider the case of the Ivanov regularization, i.e., \(\gamma (\varvec{w})=\delta _{\Vert \cdot \Vert \le 1}(\varvec{w})\).

The Tikhonov and Ivanov regularizations are often considered identical. However, as the above observation indicates, a careful treatment is required. A notion of ‘equivalence’ only holds if a meaningful optimal solution is attained.

Remark 1

Tsyurmasto et al. (2013) consider the case where \(\mathcal {F}\) is positively homogeneous (not necessarily convex) and the \(\ell _2\)-norm is employed for the regularizer. They show that under some mild conditions, the following formulations are equivalent. By ‘equivalent’ we mean that the formulations provide the same (set of) classifiers.

$$\begin{aligned} \begin{array}{ll} \underset{\varvec{w},b}{\text{ minimize }} &{} \mathcal{{F}}(-\varvec{G}^\top \varvec{w}+\varvec{y}b)\\ \text{ subject } \text{ to } &{} \Vert \varvec{w}\Vert _2\le E,\\ \end{array} \end{aligned}$$
(11)
$$\begin{aligned} \begin{array}{ll} \underset{\varvec{w},b}{\text{ minimize }} &{} C\cdot \mathcal{{F}}(-\varvec{G}^\top \varvec{w}+\varvec{y}b)+\frac{1}{2}\Vert \varvec{w}\Vert _2^2,\\ \end{array} \end{aligned}$$
(12)
$$\begin{aligned} \begin{array}{ll} \underset{\varvec{w},b}{\text{ minimize }} &{} \frac{1}{2}\Vert \varvec{w}\Vert _2^2\\ \text{ subject } \text{ to } &{} \mathcal{{F}}(-\varvec{G}^\top \varvec{w}+\varvec{y}b)\le -D.\\ \end{array} \end{aligned}$$
(13)

Here EC, and D are positive constants, although the equivalence is independent of EC, and D and therefore we can set \(E=C=D=1\). This is a virtue of the positive homogeneity of the risk function.

On the other hand, without positive homogeneity, the above independence does not hold. For example, with \(\mathcal{{F}}={\mathrm{Hinge1}}_{(1,\varvec{1}C/m)}\), which is not homogeneous, (12) is equal to the C-SVM (6). However, the equivalence to (11) or to (13) depends on E or D, respectively.

3.2 Interpretation of regularizers based on robust optimization modeling

Xu et al. (2009a) show that a regularized minimization of Hinge1 can be viewed as a robust optimization modeling. They suppose that the given data set \(\{(\varvec{x}_1,y_1),\ldots ,(\varvec{x}_m,y_m)\}\) suffers from some perturbation of the form \(\{(\varvec{x}_1-\varvec{\delta }_1,y_1),\ldots ,(\varvec{x}_m-\varvec{\delta }_m,y_m)\}\) with some \((\varvec{\delta }_1,\ldots ,\varvec{\delta }_m)\) belonging to

$$\begin{aligned} \mathcal{T}:=\left\{ (\varvec{\delta }_1,\ldots ,\varvec{\delta }_m):\sum _{i=1}^m\Vert \varvec{\delta }_i\Vert \le C\right\} , \end{aligned}$$

where \(C>0\) is a parameter deciding the size of the set and \(\Vert \cdot \Vert \) is a norm. Under this uncertainty, they consider to minimize the worst-case ERM with Hinge1. Namely, they consider to minimize \(\max _{\varvec{\Delta }\in \mathcal{T}}{\mathrm{Hinge1}}_{(m,\varvec{1}/m)}(-\varvec{Y}\{(\varvec{X}-\varvec{\Delta })\varvec{w}-\varvec{1}b\})\), where \(\varvec{\Delta }=(\varvec{\delta }_1,\ldots ,\varvec{\delta }_m)^\top \).

Theorem 1

(Xu et al. 2009a) Suppose that there is no decision function (1) correctly mapping all given samples, \(\varvec{x}_1,\ldots ,\varvec{x}_m\), to their labels, \(y_1,\ldots ,y_m\), i.e., \(\not \exists (\varvec{w},b)\in (\mathbb {R}^{n}\setminus \{\varvec{0}\})\times \mathbb {R},~y_i=d(\varvec{x}_i),~i=1,\ldots ,m\). Then the following two optimization problems over \((\varvec{w},b)\) are equivalent.

$$\begin{aligned} \underset{\varvec{w},b}{\mathrm{minimize}}&\quad \max _{(\varvec{\delta }_1,\ldots ,\varvec{\delta }_m)\in \mathcal{T}} \sum _{i=1}^m (1-y_i\{\varvec{w}^\top (\varvec{x}_i-\varvec{\delta }_i)-b\})_+,\\ \underset{\varvec{w},b}{\mathrm{minimize}}&\quad C\Vert \varvec{w}\Vert ^\circ +\sum _{i=1}^m (1-y_i(\varvec{w}^\top \varvec{x}_i-b))_+, \end{aligned}$$

where \(\Vert \cdot \Vert ^\circ \) is the dual norm of \(\Vert \cdot \Vert \), i.e., another norm defined by \(\Vert \varvec{w}\Vert ^\circ :=\max _{\varvec{x}}\{\varvec{x}^\top \varvec{w}:\Vert \varvec{x}\Vert \le {1}\}\).

In this subsection, we derive similar results for the case of monotonic and translation invariant risk functions. Note that Hinge1, which Xu et al. (2009a) employed, is not a case considered here since it is not translation invariant. In place of \(\mathcal T\), we consider the following uncertainty.

$$\begin{aligned} \mathcal{S}:=\{(\varvec{\delta }_1,\ldots ,\varvec{\delta }_m):\Vert \varvec{\delta }_i\Vert \le C,i=1,\ldots ,m\}. \end{aligned}$$

Note that \(\mathcal{S}\) is called the box uncertainty in Xu et al. (2009a), and \(\mathcal{S}\supset \mathcal{T}\) holds for the same C.

Theorem 2

Let the function \(\mathcal{{F}}\) be monotonic and translation invariant as well as proper, l.s.c. and convex. Then, for any \((\varvec{w},b)\), we have

$$\begin{aligned} \max _{\varvec{\Delta }\in \mathcal{S}}~ \mathcal{{F}}(-\varvec{Y}\{(\varvec{X}-\varvec{\Delta })\varvec{w}-\varvec{1}b\})=C\Vert \varvec{w}\Vert ^\circ +\mathcal{{F}}(-\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)), \end{aligned}$$
(14)

where \(\varvec{\Delta }=(\varvec{\delta }_1,\ldots ,\varvec{\delta }_m)^\top \in {\mathbb {R}}^{m\times n}\).

See Proof of Theorem 2 in “Appendix” for the proof.

Theorem 2 shows that the Tikhonov regularization \(\gamma (\varvec{w})=\Vert \varvec{w}\Vert \) can be interpreted as a consequence of the robustification of the (non-regularized) ERM not only for Hinge1, but also for a variety of other risk functions satisfying monotonicity and translation invariance. In particular, it is interesting to note that both approaches induce the same Tikhonov-type regularizer if the same norm is employed for defining the uncertainty sets, \(\mathcal{S}\) and \(\mathcal{T}\). In this sense, the use of a norm in the Tikhonov regularization is just viewed as a consequence of the choice of uncertainty set. Moreover, Theorem 2 derives the same regularizer in a simpler way under a class of risk functions and larger uncertainty set.

On the other hand, there is a fact which is noteworthy in the light of Proposition 2. We see that if \(\mathcal {F}\) is coherent (i.e., positively homogeneous in addition to the two properties supposed in Theorem 2), the unconstrained minimization of the worst-case empirical risk (14) is not adequate for the classification task. A similar generalization is also considered by Livni et al. (2012) on the basis of a probabilistic interpretation. However, their formulation cannot deal with positively homogeneous risk functions due to this reason.

To make sense of the employment of (14) as the empirical risk term when \(\mathcal {F}\) is coherent, the addition of an Ivanov regularizer or non-homogeneous Tikhonov regularizer is required. With this view, a formulation

$$\begin{aligned} \frac{1}{2}\Vert \varvec{w}\Vert _2^2+C\Vert \varvec{w}\Vert +{\mathrm{CVaR}}_{(\alpha ,\varvec{p})}(-\varvec{Y}(\varvec{X}\varvec{w}-\varvec{1}b)), \end{aligned}$$

for example, makes sense in the light of Proposition 2 and can be viewed as a robust version of the \(\nu \)-SVM. It is noteworthy that a composite regularizer of the form \(\frac{1}{2}\Vert \varvec{w}\Vert _2^2+C\Vert \varvec{w}\Vert _1\) is known as the elastic net-type regularizer in the machine learning community and used also for SVMs.

4 Dual formulation and Fenchel duality

Dual SVM formulations are frequently considered. For example, the dual problem to the C-SVM (6) is given by another QP:

(15)

Strong duality between (6) and (15), i.e., \(\bar{p}^\star =\bar{d}^\star \), holds under a mild condition. More importantly, with the optimality condition, which will be described later in Theorem 4, we have

$$\begin{aligned} \varvec{w}^\star =\varvec{X}^\top \varvec{Y}\varvec{\lambda }^\star , \end{aligned}$$
(16)

where \(\varvec{w}^\star \) is an optimal solution to (6) and \(\varvec{\lambda }^\star \) is an optimal solution to (15). This equation leads to the so-called representer theorem, which provides a building block for the kernel-based nonlinear classification (e.g., Burges 1998). In fact, putting the condition (16) into (1), the decision function (1) can be rewritten with the optimal dual variables \(\varvec{\lambda }^\star \) as \(d(\varvec{x})={\mathrm{sign}}(\varvec{x}^\top \varvec{X}^\top \varvec{Y}\varvec{\lambda }^\star -b^\star )\). See e.g., Chen et al. (2005) for the calculation of \(b^\star \) on the basis of the dual solution \(\varvec{\lambda }^\star \).

On the other hand, the dual formulation of the \(\nu \)-SVM (7) is given by another QP:

(17)

Let \((\varvec{w}^\star ,b^\star ,\rho ^\star ,\varvec{z}^\star )\) and \(\varvec{\lambda }^\star \) be optimal solutions to (7) and (17), respectively. Similarly to the C-SVM, under a mild condition, we have the strong duality between (7) and (17), i.e., \(\tilde{p}^\star =\tilde{d}^\star \), and the parallelism (16) between \(\varvec{w}^\star \) and \(\varvec{X}^\top \varvec{Y}\varvec{\lambda }^\star \), again. Likewise, we can obtain a decision function on the basis of a dual solution to (17).

Note that the dual formulations and the optimality conditions for the \(\ell _2\)-regularized SVMs, such as (15) and (17), are often derived via the Lagrangian duality theory (see e.g., Burges, 1998). In contrast, as shown below, the use of the Fenchel duality theory benefits to derive dual formulations and optimality condition under any combination of \(\mathcal{{F}}\) and \(\gamma \).

4.1 Formulations and duality of convex risk function-based SVMs

The dual problem to the general SVM (5) is derived as

$$\begin{aligned} d^\star :=\underset{\varvec{\lambda }}{\sup }\, -\gamma ^*(\varvec{G}\varvec{\lambda })-\mathcal{{F}}^*(\varvec{\lambda })-\delta _{\varvec{y}^\top (\cdot )=0}(\varvec{\lambda }), \end{aligned}$$
(18)

where \(\gamma ^*\) and \(\mathcal{{F}}^*\) denote the conjugate functions of \(\gamma \) and \(\mathcal{{F}}\), respectively, namely,

$$\begin{aligned} \gamma ^*(\varvec{w}):=\sup _{\varvec{s}}\{\varvec{w}^\top \varvec{s}-\gamma (\varvec{s})\},\quad \mathcal{{F}}^*(\varvec{\lambda }):=\sup _{\varvec{L}}\{\varvec{\lambda }^\top \varvec{L}-\mathcal{{F}}(\varvec{L})\}. \end{aligned}$$
(19)

Since \(\mathcal{{F}}\) and \(\gamma \) are proper l.s.c. convex functions, both \(\gamma ^*\) and \(\mathcal{{F}}^*\) are proper, l.s.c. and convex (e.g., Section 12 of Rockafellar 1970), and the dual (18) is a convex optimization problem. Table 1 lists conjugates of the aforementioned risk functions. As for the regularizers (a) to (c) introduced in Sect. 2.3, we have the following conjugate relations.

$$\begin{aligned} \begin{array}{llll} \mathrm{(a)}~ &{} \gamma (\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2 &{} \, \leftrightarrow \, &{} \gamma ^*(\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2,\\ \mathrm{(b)}~ &{} \gamma (\varvec{w})=\Vert \varvec{w}\Vert &{} \, \leftrightarrow \, &{} \gamma ^*(\varvec{w})=\delta _{\Vert \cdot \Vert ^\circ \le 1}(\varvec{w}),\\ \mathrm{(c)}~ &{} \gamma (\varvec{w})=\delta _{\Vert \cdot \Vert \le 1}(\varvec{w}) &{} \, \leftrightarrow \, &{} \gamma ^*(\varvec{w})=\Vert \varvec{w}\Vert ^\circ . \end{array} \end{aligned}$$

Note especially that the conjugate of \(\gamma \) also becomes a regularizer, i.e., satisfies the condition (10), if \(\gamma \) is a regularizer. In this sense, the squared \(\ell _2\)-regularizer (a) is self-dual, while the Tikhonov regularizer (b) and the Ivanov regularizer (c) are dual to each other.

Table 1 Conjugates of convex risk functions examples

Obviously the known dual formulations such as (15) and (17) can be readily derived just by applying the above established patterns of conjugation. For example, with \({\mathrm{Hinge1}}^*_{(t,\varvec{p})}\) and \(\gamma ^*(\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\), we can reach the dual (15) of the C-SVM (6).

Given a pair of the primal and dual formulations, (5) and (18), respectively, we can describe the weak and strong duality theorems, as follows.

Proposition 3

(Weak duality) The weak duality holds between (5) and (18), i.e., we have \(p^\star \ge d^\star \).

Theorem 3

(Strong duality) The strong duality holds between (5) and (18), i.e., we have \(p^\star =d^\star \), if either of the following conditions is satisfied:

  1. (a)

    There exists a \((\varvec{w},b)\) such that \(\varvec{w}\in {{\mathrm{ri}}}({{\mathrm{dom}}}\,\gamma )\) and \(-\varvec{G}^\top \varvec{w}+\varvec{y}b\in {{\mathrm{ri}}} ({{\mathrm{dom}}}\,\mathcal{{F}})\).

  2. (b)

    There exists a \(\varvec{\lambda }\in {{\mathrm{ri}}}({{\mathrm{dom}}} \mathcal{{F}}^*)\) such that \(\varvec{y}^\top \varvec{\lambda }=0\).

Under (a), the supremum in (18) is attained at some \(\varvec{\lambda }\), while under (b), the infimum in (5) is attained at some \((\varvec{w},b)\). In addition, if \(\mathcal{{F}}\) (or equivalently, \(\mathcal{{F}}^*\)) is polyhedral, “ri” can be omitted.

Proposition 3 is straightforward from the Fenchel’s inequality (see Sections 12 and 31 of Rockafellar 1970). Theorem 3 can be obtained from the Fenchel-Rockafellar duality theorem. See Proofs of Theorem 3 and the modification of the condition (a) for the Ivanov regularization in “Appendix” for the details.

4.2 Duality correspondence for the case with the Ivanov regularizers

Regarding the incompatibility between the Tikhonov regularization of the form \(\gamma (\varvec{w})=\Vert \varvec{w}\Vert \) and the positively homogeneous risk function \(\mathcal {F}\) (Proposition 2), let us consider the SVM with the Ivanov regularization. The primal (5) and the dual (18) then become

$$\begin{aligned} \begin{array}{l@{~}ll} p^\star :=&{}\underset{\varvec{w},b}{\text{ minimize }} &{} \mathcal{{F}}(-\varvec{G}^\top \varvec{w}+\varvec{y}b) \\ &{}\text{ subject } \text{ to } &{} \Vert \varvec{w}\Vert \le 1, \\ \end{array} \end{aligned}$$
(20)

and

$$\begin{aligned} \begin{array}{l@{~}ll} d^\star :=&{}\underset{\varvec{\lambda }}{\text{ maximize }} &{} -\Vert \varvec{G}\varvec{\lambda }\Vert ^\circ -\mathcal{{F}}^*(\varvec{\lambda }) \\ &{}\text{ subject } \text{ to } &{} \varvec{y}^\top \varvec{\lambda }=0,\\ \end{array} \end{aligned}$$
(21)

respectively. We denote by \((\mathcal{{F}},\Vert \cdot \Vert )\) the pair of the primal and dual formulations (20) and (21) for an SVM.

For the Ivanov regularization case, the condition (a) of Theorem 3 can be a little more specific.

  1. (a)

    There exists a \((\varvec{w},b)\) such that \(\Vert \varvec{w}\Vert <1\) and \(-\varvec{G}^\top \varvec{w}+\varvec{y}b\in {\mathrm{ri}}({\mathrm{dom}}\,\mathcal{{F}})\).

With the help of the Fenchel duality, the optimality conditions can be derived in a similar manner. When the Tikhonov-type \(\ell _2\)-regularization, \(\gamma (\varvec{w})=\frac{1}{2}\Vert \varvec{w}\Vert _2^2\), is employed, the condition (16) is derived. As for the Ivanov regularization case, the condition is derived as follows.

Theorem 4

(Optimality condition) Suppose (20) and (21). In order that \((\varvec{w}^\star ,b^\star )\) and \(\varvec{\lambda }^\star \) be vectors such that

$$\begin{aligned} \mathcal{{F}}(-(\varvec{G}^\top \varvec{w}^\star -\varvec{y}b^\star ))+\delta _{\Vert \cdot \Vert \le 1}(\varvec{w}^\star )= -\Vert \varvec{G}\varvec{\lambda }^\star \Vert ^\circ -\mathcal{{F}}^*(\varvec{\lambda }^\star )-\delta _{\varvec{y}^\top (\cdot )=0}(\varvec{\lambda }^\star ), \end{aligned}$$

it is necessary and sufficient that \((\varvec{w}^\star ,b^\star )\) and \(\varvec{\lambda }^\star \) satisfy the conditions:

$$\begin{aligned} \varvec{G}\varvec{\lambda }^\star \in \mathcal{N}(\varvec{w}^\star ), \quad \Vert \varvec{w}^\star \Vert \le 1, \quad \varvec{y}^\top \varvec{\lambda }^\star =0,\quad -\varvec{G}^\top \varvec{w}^\star +\varvec{y}b^\star \in \partial \mathcal{{F}}^*(\varvec{\lambda }^\star ), \end{aligned}$$
(22)

where \(\mathcal{N}(\varvec{w}^\star ):=\{\varvec{u}:\varvec{u}^\top \varvec{w}^\star =\Vert \varvec{u}\Vert ^\circ \}\), and \(\partial \mathcal{{F}}^*(\varvec{\lambda }^\star )\) is the subdifferential of \(\mathcal{{F}}^*\) at \(\varvec{\lambda }^\star \), i.e., \(\partial \mathcal{{F}}^*(\varvec{\lambda }^\star ):=\{\varvec{L}:\mathcal{{F}}^*(\varvec{L})\ge \mathcal{{F}}^*(\varvec{\lambda }^\star )+\varvec{L}^\top (\varvec{\lambda }-\varvec{\lambda }^\star ), \text{ for } \text{ all } \varvec{\lambda }\}\).

This theorem is also straightforward from Theorem 31.3 of Rockafellar (1970). See the appendix for the detailed correspondence.

Note that the first and the second conditions in (22) can be rewritten by

$$\begin{aligned} \varvec{w}^\star \in \arg \max _{\varvec{w}} \{(\varvec{\lambda }^\star )^\top \varvec{G}^\top \varvec{w}:\Vert \varvec{w}\Vert \le 1\}. \end{aligned}$$

In particular, if we employ the \(\ell _2\)-norm Ivanov regularization, this condition implies

$$\begin{aligned} \varvec{w}^\star =\frac{\varvec{G}\varvec{\lambda }^\star }{\Vert \varvec{G}\varvec{\lambda }^\star \Vert _2}. \end{aligned}$$
(23)

This condition, which claims a parallelism between \(\varvec{w}^\star \) and \(\varvec{\lambda }^\star \), corresponds to the one given in (16). Accordingly, as long as the \(\ell _2\)-norm is employed, the two regularization results in the same decision function.

Contrarily, if we employ a non-\(\ell _2\)-norm, we have to pay attention to the deviation from the parallelism (23). See Section 6 of Gotoh and Uryasev (2013) for discussion of the proximity of the parallelism for a parameterized class of LP-representable norms.

Example 1

Employing \(\mathcal{{F}}=\text{ LSE }_{(t,\varvec{p})}\), we have an SVM \(({\mathrm{LSE}}_{(t,\varvec{p})},\Vert \cdot \Vert )\), where its dual is obtained as

(24)

Let us consider the optimality condition (22) for \(({\mathrm{LSE}}_{(t,\varvec{p})},\Vert \cdot \Vert )\) in (24). Noting that at any \(\varvec{\lambda }\in {\mathrm{ri}}(\Pi ^m)\), the function \(\mathcal{{F}}^*(\varvec{\lambda })={\mathrm{KL}}_{(t,\varvec{p})}(\varvec{\lambda })=\frac{1}{t}\varvec{\lambda }^\top \ln (\varvec{\lambda }./\varvec{p})+\delta _{{\mathrm{ri}}(\Pi ^m)}(\varvec{\lambda })\) has subdifferential \(\partial \mathcal{{F}}^*(\varvec{\lambda })=\{\nabla \frac{1}{t}\varvec{\lambda }^\top \ln (\varvec{\lambda }./\varvec{p})(\varvec{\lambda })+k\varvec{1}:k\in {\mathbb {R}}\}\), the optimality condition is then explicitly given by

$$\begin{aligned} \begin{array}{l} (\varvec{\lambda }^\star )^\top \varvec{G}^\top \varvec{w}^\star =\Vert \varvec{G}\varvec{\lambda }^\star \Vert ^\circ ,\quad \Vert \varvec{w}^\star \Vert \le 1,\quad \varvec{y}^\top \varvec{\lambda }^\star =0,\\ -\varvec{G}^\top \varvec{w}^\star +\varvec{y}b^\star =\frac{1}{t}(\ln \varvec{\lambda }^\star ./\varvec{p}+\varvec{1})+\varvec{1}k^\star ,\quad \varvec{1}^\top \varvec{\lambda }^\star =1\quad \varvec{\lambda }^\star >\varvec{0}. \end{array} \end{aligned}$$

Furthermore, consider the situation where the \(\ell _2\)-norm is employed in \(({\mathrm{LSE}}_{(t,\varvec{p})},\Vert \cdot \Vert _2)\), and there exists a solution \(\varvec{\lambda }^\star >\varvec{0}\) such that \(\Vert \varvec{G}\varvec{\lambda }^\star \Vert _2>0\), then we can find an optimal solution by solving a system of \(n+m+2\) equalities:

$$\begin{aligned} \varvec{w}^\star =\frac{\varvec{G}\varvec{\lambda }^\star }{\Vert \varvec{G}\varvec{\lambda }^\star \Vert _2}, \quad \varvec{y}^\top \varvec{\lambda }^\star =0, \quad -\varvec{G}^\top \varvec{w}^\star +\varvec{y}b^\star =\frac{1}{t}(\ln \varvec{\lambda }^\star ./\varvec{p}+\varvec{1})+\varvec{1}k^\star , \quad \varvec{1}^\top \varvec{\lambda }^\star =1, \end{aligned}$$

and the optimal decision function is given by \(d(\varvec{x})={\mathrm{sign}}(\frac{\varvec{x}^\top \varvec{G}\varvec{\lambda }^\star }{\Vert \varvec{G}\varvec{\lambda }^\star \Vert _2}-b^\star )\). \(\square \)

5 Perspectives on the dual formulation from various viewpoints

In this section, we demonstrate connections between the dual formulation (18) and the existing papers. We base our arguments on the correspondences between duality and the properties of the risk function \(\mathcal{{F}}\).

5.1 Correspondence between risk function properties and dual formulations

By using the conjugate of \(\mathcal{{F}}\), monotonicity, translation invariance, and positive homogeneity can be characterized in a dual manner as follows.

Theorem 5

(Ruszczyński and Shapiro 2006) Suppose that \(\mathcal{{F}}\) is l.s.c., proper, and convex, then we have

  1. 1.

    \(\mathcal{{F}}\) is monotonic if and only if dom \(\mathcal{{F}}^*\) is in the nonnegative orthant;

  2. 2.

    \(\mathcal{{F}}\) is translation invariant if and only if \(\forall \varvec{\lambda }\in {\mathrm{dom}}\,\mathcal{{F}}^*\), \(\varvec{1}^\top \varvec{\lambda }=1\);

  3. 3.

    \(\mathcal{{F}}\) is positively homogeneous if and only if it can be represented in the form

    $$\begin{aligned} \mathcal{{F}}(\varvec{L})=\sup _{\varvec{\lambda }} \{ \varvec{L}^\top \varvec{\lambda } : \varvec{\lambda }\in {\mathrm{dom}}\,\mathcal{{F}}^* \}, \end{aligned}$$
    (25)

    or equivalently, \(\mathcal{{F}}^*(\varvec{L})=\delta _\mathcal{Q}(\varvec{L})\) for a convex set \(\mathcal{Q}\) in \({\mathbb {R}}^m\).

Note that the first and second statements of Theorem 5 imply the following expressions, respectively.

  1. 1.

    \(\mathcal{{F}}\) is monotonic if and only if \(\mathcal{{F}}(\varvec{L})=\sup _{\varvec{\lambda }}\{\varvec{L}^\top \varvec{\lambda }-\mathcal{{F}}^*(\varvec{\lambda }):\varvec{\lambda }\ge \varvec{0}\}\);

  2. 2.

    \(\mathcal{{F}}\) is translation invariant if and only if \(\mathcal{{F}}(\varvec{L})=\sup _{\varvec{\lambda }}\{\varvec{L}^\top \varvec{\lambda }-\mathcal{{F}}^*(\varvec{\lambda }):\varvec{1}^\top \varvec{\lambda }=1\}\);

  3. 3.

    \(\mathcal{{F}}\) is monotonic and translation invariant if and only if \(\mathcal{{F}}(\varvec{L})=\sup _{\varvec{\lambda }}\{\varvec{L}^\top \varvec{\lambda }-\mathcal{{F}}^*(\varvec{\lambda }):\varvec{\lambda }\in \Pi ^m\}\).

From Theorem 5, we see that dom \(\mathcal{{F}}^*\) plays an important role in characterizing risk functions. Let us denote this effective domain by \(\mathcal{Q}_\mathcal{{F}}\) and call it risk envelope, i.e., \(\mathcal{Q}_\mathcal{{F}}= {\mathrm{dom}}\,\mathcal{{F}}^*\).Footnote 1 In particular, by combining with Theorem 5, any coherent risk function can be characterized by a set of probability measures.

Corollary 2

(Artzner et al. 1999) Any coherent risk function \(\mathcal{{F}}\), we have \(\mathcal{Q}_\mathcal{{F}}\subset \Pi ^m\). On the other hand, for any set \(\mathcal{Q}\subset \Pi ^m\), the risk function defined as \(\mathcal{{F}}(\varvec{L}):=\sup \{\varvec{L}^\top \varvec{\lambda }:\varvec{\lambda }\in \mathcal{Q}\}=\sup \{\varvec{L}^\top \varvec{\lambda }:\varvec{\lambda }\in {\mathrm{conv}}(\mathcal{Q})\}=\max \{\varvec{L}^\top \varvec{\lambda }:\varvec{\lambda }\in {\mathrm{cl}}({\mathrm{conv}}(\mathcal{Q}))\}\) is coherent, where \({\mathrm{conv}}(\mathcal{Q})\) denotes the convex hull of a set \(\mathcal Q\) and \({\mathrm{cl}}(\mathcal{Q})\) denotes the closure of a set \(\mathcal Q\).

For example, CVaR is coherent and can be represented with the risk envelope

$$\begin{aligned} \mathcal{Q}_\mathcal{{F}}=\mathcal{Q}_{{\mathrm{CVaR}}(\alpha ,\varvec{p})}:=\left\{ \varvec{q}\in \Pi ^m: \varvec{q}\le \varvec{p}/(1-\alpha )\right\} , \end{aligned}$$
(26)

i.e., \({\mathrm{CVaR}}_{(\alpha ,\varvec{p})}(\varvec{L})=\max _{\varvec{q}}\{{\mathbb {E}}_{\varvec{q}}(\varvec{L}):\varvec{q}\in \mathcal{Q}_{{\mathrm{CVaR}}(\alpha ,\varvec{p})}\}\).

Based on Theorem 5, we can associate the constraints of dual formulations with the properties of the risk functions \(\mathcal{{F}}\) employed in the primal formulation (5).

Proposition 4

  1. 1.

    If \(\mathcal{{F}}\) is monotonic, the dual problem (18) can be represented as

    $$\begin{aligned} \underset{\varvec{\lambda }}{\sup }~-\gamma ^*(\varvec{G}\varvec{\lambda })-\mathcal{{F}}^*(\varvec{\lambda }) -\delta _{C}(\varvec{\lambda }) \text{ with } C=\{\varvec{\lambda }\in {\mathbb {R}}^m:\varvec{y}^\top \varvec{\lambda }=0,\varvec{\lambda }\ge \varvec{0}\}; \end{aligned}$$
  2. 2.

    If \(\mathcal{{F}}\) is translation invariant, the dual problem (18) can be represented as

    $$\begin{aligned} \underset{\varvec{\lambda }}{\sup }~-\gamma ^*(\varvec{G}\varvec{\lambda })-\mathcal{{F}}^*(\varvec{\lambda }) -\delta _{C}(\varvec{\lambda }) \text{ with } C=\{\varvec{\lambda }\in {\mathbb {R}}^m:\varvec{y}^\top \varvec{\lambda }=0,\varvec{1}^\top \varvec{\lambda }=1\}; \end{aligned}$$
  3. 3.

    If \(\mathcal{{F}}\) is positively homogeneous, the dual problem (18) can be represented as

    $$\begin{aligned} \underset{\varvec{\lambda }}{\sup }~-\gamma ^*(\varvec{G}\varvec{\lambda })- \delta _{C}(\varvec{\lambda }) \text{ with } C=\{\varvec{\lambda }\in {\mathbb {R}}^m:\varvec{y}^\top \varvec{\lambda }=0\}\cap \mathcal{Q}_\mathcal{{F}}. \end{aligned}$$

Example 2

If we limit our attention to positively homogeneous risk functions and the Ivanov regularization, the dual formulation (21) can be simplified with the help of its risk envelope \(\mathcal{Q}_\mathcal{{F}}\):

(27)

and

(28)

There is a symmetric dual correspondence between the primal (27) and the dual (28). Precisely, the primal (27) has a norm constraint with \(\Vert \cdot \Vert \) while its dual norm \(\Vert \cdot \Vert ^\circ \) appears in the objective of the dual (28); the positively homogeneous convex risk function \(\mathcal{{F}}\) in the primal’s objective corresponds to its risk envelope \(\mathcal{Q}_\mathcal{{F}}\) in the dual’s constraint.

Correspondingly, the condition (b) of Theorem 3 can be replaced with

  1. (b)

    There exists a \(\varvec{\lambda }\) such that \(\varvec{y}^\top \varvec{\lambda }=0\) and \(\varvec{\lambda }\in {\mathrm{ri}}\mathcal{Q}_\mathcal{{F}}\),

and the optimality condition (22) of Theorem 4 can be rewritten as follows:

$$\begin{aligned} \varvec{G}\varvec{\lambda }^\star \in \mathcal{N}(\varvec{w}^\star ), \Vert \varvec{w}^\star \Vert \le 1, \varvec{y}^\top \varvec{\lambda }^\star =0, -\varvec{G}^\top \varvec{w}^\star +\varvec{y}b^\star \in N_{\mathcal{Q}_\mathcal{{F}}}(\varvec{\lambda }^\star ), \varvec{\lambda }^\star \in \mathcal{Q}_\mathcal{{F}}, \end{aligned}$$
(29)

where \(N_{\mathcal{Q}_\mathcal{{F}}}(\varvec{\lambda }^\star )\) denotes the normal cone to the set \(\mathcal Q\) at a point \(\varvec{\lambda }^\star \in \mathcal{Q}_\mathcal{{F}}\), i.e., \(N_{\mathcal{Q}_\mathcal{{F}}}(\varvec{\lambda }^\star ):=\{\varvec{L}\in {\mathbb {R}}^m:\varvec{L}^\top (\varvec{\lambda }-\varvec{\lambda }^\star )\le 0, \text{ for } \text{ all } \varvec{\lambda }\in \mathcal{Q}_\mathcal{{F}}\}\).

The change in the final part of (29) comes from the fact that the subdifferential of the indicator function of a non-empty convex set is given by the normal cone to it (see p. 215 of Rockafellar 1970, for the details of the subdifferential of the indicator function). \(\square \)

Remark 2

The primal formulation (27) with \(\mathcal{Q}_\mathcal{{F}}\subset \Pi ^m\) is a convex relaxation of the formulation developed by Gotoh et al. (2014), where the negative geometric margin and the coherent risk function are employed as the loss and the risk measure, respectively. On the other hand, the dual formulation is not mentioned in their paper since theirs includes some nonconvexity. \(\square \)

If monotonicity and translation invariance are simultaneously supposed on \(\mathcal {F}\), the dual variable, \(\varvec{\lambda }\), can be considered as a probability measure, i.e., \(\varvec{\lambda }\in \Pi ^m\).

Corollary 3

If \(\mathcal{{F}}\) is monotonic and translation invariant, the dual problem (18) can be rewritten by

$$\begin{aligned} \underset{\varvec{\lambda }}{\sup }~-\gamma ^*(\varvec{G}\varvec{\lambda })-\mathcal{{F}}^*(\varvec{\lambda })-\delta _{C}(\varvec{\lambda }) \text{ with } C=\{\varvec{\lambda }\in {\mathbb {R}}^m:\varvec{y}^\top \varvec{\lambda }=0\}\cap \Pi ^m. \end{aligned}$$

Furthermore, if \(\mathcal{{F}}\) is coherent, the third statement of Proposition 4 is valid with \(\mathcal{Q}_\mathcal{{F}}\) such that \(\mathcal{Q}_\mathcal{{F}}\subset \Pi ^m\).

To deepen this probabilistic view, let us introduce the \(\varphi \) -divergence (Csiszár 1967; Ben-Tal and Teboulle 2007). Let \(\varphi :{\mathbb {R}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be an l.s.c. convex function satisfying \(\varphi (1)=0\). With such \(\varphi \), the \(\varphi \)-divergence of \(\varvec{q}\in {\mathbb {R}}^m\) relative to \(\varvec{p}\in \Pi ^m_+\) is defined by

$$\begin{aligned} \mathcal{I}_\varphi (\varvec{q},\varvec{p}):= \left\{ \begin{array}{ll} {\mathbb {E}}_{\varvec{p}}(\varphi (\varvec{q}./\varvec{p})) \equiv \sum \limits _{i=1}^{m}p_i\varphi \left( \frac{q_i}{p_i}\right) ,&{}\text{ if } \varvec{q} \text{ satisfies } \varvec{1}^\top \varvec{q}=1,\\ +\infty ,&{}\text{ otherwise }.\\ \end{array} \right. \end{aligned}$$

The \(\varphi \)-divergence generalizes the relative entropy. Indeed, with \(\varphi (s)=s\log s-s+1\) (and \(0\ln 0=0\)), \(\mathcal{I}_\varphi (\varvec{q},\varvec{p})\) is the Kullback-Leibler divergence, i.e., \(\mathrm{KL}_{(1,\varvec{p})}(\varvec{q})\), while with \(\varphi (s)=(s-1)^2\), \(\mathcal{I}_\varphi (\varvec{q},\varvec{p})\) is the modified \(\chi ^2\)-divergence, i.e., \(\chi ^2_{(1,\varvec{p})}(\varvec{q})\). See e.g., Table 2 of Reid and Williamson (2011), for the other examples.

Theorem 6

Let v be a proper l.s.c. convex function on \({\mathbb {R}}\) such that \(v(z)\ge z+B\) with some \(B\in {\mathbb {R}}\). Then, the risk function (9) is proper l.s.c. convex, and it is valid \(\mathcal{{F}}_{\varvec{p}}^*(\varvec{\lambda })=\mathcal{I}_{v^*}(\varvec{\lambda },\varvec{p})\). Namely,

$$\begin{aligned} \mathcal{{F}}_{\varvec{p}}(\varvec{L})=\inf _c\{c+{\mathbb {E}}_{\varvec{p}}(v(\varvec{L}-c\varvec{1}))\}= \sup _{\varvec{q}}\{ \varvec{q}^\top \varvec{L}-\mathcal{I}_{v^*}(\varvec{q},\varvec{p})\}. \end{aligned}$$
(30)

Furthermore, if there exists \(z^\star \) such that \(v(z^\star )=z^\star +B\), i.e., B is the minimum of \(v(z)-z\) and \(z^\star \) is the minimizer, the \(\varphi \)-divergence \(\mathcal{I}_{v^*}(\varvec{q},\varvec{p})\) attains the minimum \(-B\) at \(\varvec{q}=\varvec{p}\). Furthermore, \(\mathcal{{F}}_{\varvec{p}}(\varvec{L})\) is monotonic if \({\mathrm{dom}}\,v^*\subset {\mathbb {R}}_+\); \(\mathcal{{F}}_{\varvec{p}}(\varvec{L})\) is positively homogeneous if \(v^*=\delta _{[a,a']}\), where \(a:=\inf \{s:s\in {\mathrm{dom}}\,v^*\}\) and \(a':=\sup \{s:s\in {\mathrm{dom}}\,v^*\}\).

See Proof of Theorem 6 section in “Appendix” for the proof.

Formula (30) indicates that the risk function of the form (9) is interpreted as a worst-case expected loss which is deducted with the \(\varphi \)-divergence where \(\varphi =v^*\). In this view, we can associate each coherent risk function with a \(\varphi \)-divergence which is represented as the indicator function of a (closed convex) set. Table 2 demonstrates the correspondence of \(\mathcal{{F}}_{\varvec{p}}\), \(\mathcal{I}_{v^*}(\varvec{q},\varvec{p})\), \(v^*\) and v for CVaR, LSE and MV risk functions. For any \(\alpha \in [0,1)\) and \(t>0\), the functions \(v(z)-z\) of both \({\mathrm{CVaR}}_{(\alpha ,\varvec{p})}\) and \({\mathrm{MV}}_{(t,\varvec{p})}\) attain the minimum and minimizer at \(z^\star =0\) and \(B=0\), respectively. On the other hand, \({\mathrm{LSE}}_{(t,\varvec{p})}\) attains the minimum \(B=(1+\ln t)/t-1\) at \(z^\star =(1/t)\ln (1/t)\) for any \(t>0\). However, each \(\mathcal{I}_{v^*}(\varvec{q},\varvec{p})\) attains its minimum at \(\varvec{q}=\varvec{p}\). Accordingly, all of these functions in Table 2 can be related to the divergences relative to \(\varvec{p}\). See Ben-Tal and Teboulle (2007) for the details and an interpretation as the Optimized Certainty Equivalent based on a concave utility function \(u(z):=-v(-z)\).

Table 2 Examples of risk functions \(\mathcal{{F}}_{\varvec{p}}\) satisfying (30) and their \(\mathcal{I}_{v^*}\), \(v^*\) and v

With the \(\varphi \)-divergence and Theorem 6, the second statement of Proposition 4 can be specified as follows.

Corollary 4

If the risk function \(\mathcal{{F}}\) is written by (30) with \(v^*\) such that v is monotonic, then the dual problem (18) is represented by

$$\begin{aligned} \underset{\varvec{\lambda }}{\sup }~-\gamma ^*(\varvec{G}\varvec{\lambda })-\mathcal{I}_{v^*}(\varvec{\lambda },\varvec{p}) -\delta _{C}(\varvec{\lambda }) \text{ with } C=\{\varvec{\lambda }\in {\mathbb {R}}^m:\varvec{y}^\top \varvec{\lambda }=0\}\cap \Pi ^m. \end{aligned}$$

Note that this corollary is applicable to CVaR and LSE since they are both monotonic and translation invariant, which can be confirmed also by seeing that dom \(v^*\) of CVaR and LSE are in the nonnegative orthant (Table 2) and Theorem 6.

5.2 A connection to geometric interpretation

Crisp and Burges (2000), Bennett and Bredensteiner (2000), Takeda et al. (2013), Kanamori et al. (2013) show that dual problems of a couple of SVMs can be interpreted as the problem of finding two nearest points over two separate sets \(I_+:=\{i\in \{1,\ldots ,m\}:y_i=+1\}\) and \(I_-:=\{i\in \{1,\ldots ,m\}:y_i=-1\}\), each corresponding to the data samples having the same label. With the dual formulation (18), we can easily derive the similar implication in a general manner.

To that purpose, let us consider the case where \(\mathcal{{F}}\) is translation invariant and \(\varvec{G}=\varvec{X}^\top \varvec{Y}\). In this case, the constraint, \(\varvec{y}^\top \varvec{\lambda }=0\), of (18) can be represented by \(\sum _{i\in I_-}\lambda _i=\sum _{i\in I_+}\lambda _i=\frac{1}{2}\), while the first term of the objective, i.e., \(-\gamma ^*(\varvec{G}\varvec{\lambda })\), can be \(-\gamma ^*(\sum _{i\in I_+}\varvec{x}_i\lambda _{i}-\sum _{h\in I_-}\varvec{x}_h\lambda _{h})\). Consequently, with a change of variables \(\mu _{+,i}:=2\lambda _i\) for \(i\in I_+\) and \(\mu _{-,i}:=2\lambda _i\) for \(i\in I_-\), (18) (or (30)) can be represented as

where \(\varvec{\mu }:=2\varvec{\lambda }\).

As a further concrete interpretation, let us consider the case where \(\gamma (\varvec{w})=\delta _{\Vert \cdot \Vert \le 1}(\varvec{w})\) and \(\mathcal{{F}}\) is given in the form of (9) with monotonic v. Then (18) is represented as

(31)

where \(\varvec{\mu }_+\) and \(\varvec{\mu }_-\) are vectors consisting of \(\mu _{+,i}\) and \(\mu _{-,i}\), respectively. It is noteworthy that the formulation (31) is close to what Kanamori et al. (2013) demonstrate. Precisely, employing the \(\ell _2\)-norm, they virtually present the regularized ERM of the form \(\min _\rho \{-2\rho +\frac{1}{m}(\sum _{i=1}^mv(-y_i(\varvec{x}_i^\top \varvec{w}-b))+\rho )_+\}\) subject to \(\Vert \varvec{w}\Vert _2^2\le t\) with \(t>0\). It is easy to see that (31) contains their formulation as a special case.

Besides, the geometric interpretation of the \(\nu \)-SVM demonstrated by Crisp and Burges (2000), Bennett and Bredensteiner (2000) can also be derived straightforwardly. For example, \(\nu \)-SVM with \(\gamma (\varvec{w})=\delta _{\Vert \cdot \Vert \le 1}(\varvec{w})\) can be explicitly represented as the geometric problem of the form

$$\begin{aligned} \min _{\varvec{z}_+,\varvec{z}_-} \Vert \varvec{z}_+-\varvec{z}_-\Vert ^\circ ~ \text{ subject } \text{ to } \varvec{z}_+\in \mathcal{Q}_+,\varvec{z}_-\in \mathcal{Q}_-, \end{aligned}$$

with

$$\begin{aligned} \begin{array}{l} \mathcal{Q}_+:=\left\{ \varvec{z}\in {\mathbb {R}}^n:\varvec{z} =\sum \nolimits _{i\in I_+}\varvec{x}_{i}\mu _{+,i},\varvec{\mu }_+\in \mathcal{Q}_{{\mathrm{CVaR}}(1-\nu ,2\varvec{p}_+)}\right\} ;\\ \mathcal{Q}_-:=\left\{ \varvec{z}\in {\mathbb {R}}^n:\varvec{z} =\sum \nolimits _{i\in I_-}\varvec{x}_{i}\mu _{-,i},\varvec{\mu }_-\in \mathcal{Q}_{{\mathrm{CVaR}}(1-\nu ,2\varvec{p}_-)}\right\} ,\\ \end{array} \end{aligned}$$

where \(\varvec{p}_+:=(p_i)_{i\in I_+}\) and \(\varvec{p}_-:=(p_i)_{i\in I_-}\) are supposed to have the elements in the same order as \(\varvec{\mu }_+\) and \(\varvec{\mu }_-\), respectively, and \(\mathcal{Q}_{{\mathrm{CVaR}}(\alpha ,2\varvec{p}_+)}\subset \Pi ^{|I_+|}\) and \(\mathcal{Q}_{{\mathrm{CVaR}}(\alpha ,2\varvec{p}_-)}\subset \Pi ^{|I_-|}\). (Here we admit an abuse of the notation. Precisely, we put \(2\varvec{p}_+\) or \(2\varvec{p}_-\) in the place of a probability measure \(\varvec{p}\).) Note that \(\mathcal{Q}_+\) and \(\mathcal{Q}_-\) are exactly the reduced convex hulls in Crisp and Burges (2000), Bennett and Bredensteiner (2000). Note that along this line, we can derive the geometric interpretation of an SVM defined with a coherent risk function and a general norm, which is parallel to Kanamori et al. (2013).

In addition, the formulation (31) bridges the geometric interpretation and an information theoretic interpretation. Noting the relation

$$\begin{aligned} \mathcal{I}_{v^*}\left( \frac{1}{2}\varvec{\mu },\varvec{p}\right) =\sum _{i\in I_+}p_i{v^*}\left( \frac{\mu _i}{2p_i}\right) +\sum _{h\in I_-}p_h{v^*}\left( \frac{\mu _h}{2p_h}\right) , \end{aligned}$$

the second term of the objective of (31) can be interpreted as a penalty on the deviations between the weight vectors \(\varvec{\mu }_+\) (or \(\varvec{\mu }_-\)) and \(2\varvec{p}_+\) (or \(2\varvec{p}_-\), respectively). If we further suppose that

$$\begin{aligned} \sum _{i\in I_+}p_i=\sum _{h\in I_-}p_h\left( =\frac{1}{2}\right) , \end{aligned}$$
(32)

the penalty term is rewritten as \(\mathcal{I}_\varphi (\varvec{\mu }_+,\varvec{r}_+)+\mathcal{I}_\varphi (\varvec{\mu }_-,\varvec{r}_-)\) where \(\varvec{r}_+:=(r_i)_{i\in I_+}\) with \(r_i=p_i/\sum _{h\in I_+}p_h\) for \(i\in I_+\) and \(\varvec{r}_-:=(r_i)_{i\in I_-}\) with \(r_i=p_i/\sum _{h\in I_-}p_h\) for \(i\in I_-\). Namely, we can symbolically recast (31) as the minimization of

$$\begin{aligned} \frac{1}{2}\Vert {\mathbb {E}}_{\varvec{\mu }}(\varvec{x}|y=+1)-{\mathbb {E}}_{\varvec{\mu }}(\varvec{x}|y=-1)\Vert ^\circ + \mathcal{I}_\varphi (\varvec{\mu },\varvec{p}|y=+1)+\mathcal{I}_\varphi (\varvec{\mu },\varvec{p}|y=-1), \end{aligned}$$

over \(\varvec{\mu }\in \Pi ^m\), where \({\mathbb {E}}_{\varvec{\mu }}(\varvec{x}|y=a)\) are conditional expectation, and \(\mathcal{I}_\varphi (\varvec{\mu },\varvec{p}|y=a)\) are divergence between conditional distributions, “\(\varvec{\mu }|y=a\)” and “\(\varvec{p}|y=a\)” with \(a=+1\) or \(a=-1\).

It is worth mentioning that the approach of Bennett and Mangasarian (1992) virtually employs the condition (32) and attains a nice performance in the breast cancer data set (see Wolberg et al. 2013).

5.3 Distributionally robust SVMs

Different from the robust optimization modeling described in Sect. 3.2, the so-called distributionally robust optimization is also popular in the literature. In this subsection, we show that a class of generalized SVM formulations described in this paper also fits into this robust optimization modeling approach.

In existing SVMs, the samples are usually assumed to be independently drawn from an unknown distribution, and the empirical probability \(\varvec{p}=\varvec{1}/m\) is used as \(\varvec{p}\). However, such an i.i.d. assumption is often unfulfilled. For example, we can consider a situation where the samples are i.i.d. within each label samples while the (prior) distribution of labels, \(\vartheta :={\mathbb {P}}\{y=+1\}(=1-{\mathbb {P}}\{y=-1\})\), is not known. Namely, we can assume \(p_i=\vartheta /|I_+|\) for \(y_i=+1\) and \(p_i=(1-\vartheta )/|I_-|\) for \(y_i=-1\), but \(\vartheta \) is under uncertainty. In such a case, the choice of the uniform distribution may not be the best.

In general, let us consider the case where \(\varvec{p}\) is under uncertainty of the form: \(\varvec{p}+\varvec{\delta }\in P\) with some P satisfying \(\varvec{p}\in P\subset \Pi ^{m}\). Similarly to Sect. 3.2, one reasonable strategy is to consider the worst case over the set P. Let us list examples of the uncertainty set P.

$$\begin{aligned}{}\begin{array}[t]{ll} \bullet ~P=\mathcal{Q}_{{\mathrm{Fi}}(\varvec{p}_1,\ldots ,\varvec{p}_K)}:=\{\varvec{p}_1,\ldots ,\varvec{p}_K\},&{}\text{ with } \varvec{p}_1,\ldots ,\varvec{p}_K\in \Pi ^m;\\ \bullet ~P=\mathcal{Q}_{{\mathrm{Dist}}(\Vert \cdot \Vert ',\varvec{A},\varvec{p})}:=\{\varvec{\pi }\in \Pi ^m:\varvec{\pi }=\varvec{p}+\varvec{A}\varvec{\zeta },\Vert \varvec{\zeta }\Vert '\le 1\}, &{}\text{ with } \varvec{A}\in {\mathbb {S}}^m_{++},~\Vert \cdot \Vert ' \text{: } \text{ a } \text{ norm };\\ \bullet ~P=\mathcal{Q}_{\mathcal{I}_\varphi (t,\varvec{p})}:=\left\{ \varvec{\pi }\in \Pi ^m:\mathcal{I}_{\varphi }(\varvec{\pi },\varvec{p})\le t\right\} , &{}\text{ with } t>0,\\ \end{array} \end{aligned}$$

where \({\mathbb {S}}^m_{++}\) denotes the \(m\times m\) real symmetric positive definite matrices. The first example indicates the situation where K candidates \(\varvec{p}_1,\ldots ,\varvec{p}_K\) for \(\varvec{p}\) are possible. The second and third examples are the case where the possible deviations are given by convex sets defined with some norm \(\Vert \cdot \Vert '\) and some \(\varphi \)-divergence, respectively. Specifically, \(\mathcal{Q}_{{\mathrm{Dist}}(\Vert \cdot \Vert ,\varvec{A},\varvec{p})}\) denotes a set of probability measures which are away from \(\varvec{p}\) with distance at most 1 under a norm \(\Vert \cdot \Vert \) and a metric \((\varvec{A}^{-1})^2\). Especially when \(\Vert \cdot \Vert =\Vert \cdot \Vert _\infty \) and \(\varvec{A}={\mathrm{diag}}(\overline{\varvec{\zeta }})\) with some \(\overline{\varvec{\zeta }}\ge \varvec{0}\), the set forms a box-type constraint, i.e., \(\mathcal{Q}_{{\mathrm{Dist}}(\Vert \cdot \Vert _{\infty },{\mathrm{diag}} (\overline{\varvec{\zeta }}),\varvec{p})}=\Pi ^m\cap [\varvec{p}-\overline{\varvec{\zeta }},\varvec{p}+\overline{\varvec{\zeta }}]\).

Let us consider that the risk function \(\mathcal{{F}}_{\varvec{p}}\) has the form (30). Namely, we consider a distributionally robust version of the primal formulation:

$$\begin{aligned} \underset{\varvec{w},b}{\text{ minimize }}&\quad \sup _{\varvec{\pi }\in P}\mathcal{{F}}_{\varvec{\pi }}(\varvec{L}) + \gamma (\varvec{w}). \end{aligned}$$
(33)

Note that the worst-case risk function is given as

$$\begin{aligned} {\mathrm{Worst}\text {-}}\mathcal{{F}}_P(\varvec{L}):= \sup _{\varvec{\pi }\in P}\mathcal{{F}}_{\varvec{\pi }}(\varvec{L})= \sup _{\varvec{\pi }\in P,\varvec{q}}\{\varvec{q}^\top \varvec{L}-\mathcal{I}_\varphi (\varvec{q},\varvec{\pi })\}= \sup _{\varvec{q}}\{\varvec{q}^\top \varvec{L}-\inf _{\varvec{\pi }\in P}\mathcal{I}_\varphi (\varvec{q},\varvec{\pi })\}. \end{aligned}$$

The last part indicates that \(({\mathrm{Worst}\text {-}}\mathcal{{F}}_P)^*(\varvec{q})=\inf _{\varvec{\pi }\in P}\mathcal{I}_\varphi (\varvec{q},\varvec{\pi })\), where we can independently show that this is convex in \(\varvec{q}\) as long as the \(\varphi \)-divergence is given with a convex \(\varphi \). (See e.g., Section 3.2.6 of Boyd and Vandenberghe (2004), for the details.)

Proposition 5

If P is a convex set, the dual formulation of the distributionally robust version (33) is given as the following convex minimization

If \(P=\mathcal{Q}_{\mathrm{Fi}(\varvec{p}_1,\ldots ,\varvec{p}_K)}\), the distributionally robust version of the generalized dual formulation is rewritten by

Although we describe the case of \(P=\mathcal{Q}_{\mathrm{Fi}(\varvec{p}_1,\ldots ,\varvec{p}_K)}\) separately from the case where P is a convex set, we can treat \({\mathrm{Worst}\text {-}}\mathcal{{F}}_P\) in a unified manner when the original risk function \(\mathcal{{F}}_{\varvec{p}}\) is positively homogeneous. Indeed, we then have

$$\begin{aligned} {\mathrm{Worst}\text {-}}\mathcal{{F}}_P(\varvec{L})= \sup _{\varvec{p}\in P}\sup _{\varvec{q}\in \mathcal{Q}_\mathcal{{F}}(\varvec{p})}\varvec{q}^\top \varvec{L}= \sup \limits _{\varvec{q}}\{\varvec{L}^\top \varvec{q}:\varvec{q}\in \underset{\varvec{p}\in P}{\cup }\mathcal{Q}_\mathcal{{F}}(\varvec{p})\}. \end{aligned}$$
(34)

The union, \(\cup _{\varvec{p}\in P}\mathcal{Q}_\mathcal{{F}}(\varvec{p})\), in (34) can be a nonconvex set. However, the convex hull of the union provides an equivalent coherent risk function. Namely, we have \( {\mathrm{Worst}\text {-}}\mathcal{{F}}_P(\varvec{L})= \sup _{\varvec{q}}\{\varvec{q}^\top \varvec{L}:\varvec{q}\in \text{ conv }( \cup _{\varvec{p}\in P}\mathcal{Q}_\mathcal{{F}}(\varvec{p}))\}. \) Since the convex hull of the risk envelopes become another (possibly, larger) risk envelope, the distributionally robust coherent risk function-based SVM is also another coherent risk function-based SVM. Accordingly, with a positively homogeneous risk function \(\mathcal {F}\), the dual of the distributionally robust version is given by

(35)

For example, if we employ the uncertainty sets P listed above, the distributionally robust version of \(\nu \)-SVMs \(({\mathrm{Worst}\text {-}}{\mathrm{CVaR}}_{(1-\nu ,P)},\Vert \cdot \Vert )\) are represented in the following dual forms, respectively:

where the first one with \(\Vert \cdot \Vert ^\circ =\Vert \cdot \Vert _2\) is presented in Wang (2012), which extends it into a multi-class classification setting.

The distributionally robust SVMs presented above are different from existing ones (e.g., Wang et al. 2015) in that it is easy to obtain the dual formulation based on the dual representation of the inseparable risk function (30). As seen in preceding sections, (33) can be associated with \(\varphi \)-divergences and (35) is obtained straightforwardly with the help of the Fenchel duality.

It is noteworthy that the distributional robustification technique above incorporates prior knowledge on the distribution without significantly increasing the complexity of the optimization problem, specifically when such information is given by moments. For example,

  • When the average of the j-th attribute of samples having the label \(y_i=+1\) belongs in a certain interval \([l^+_j,u^+_j]\), we include this information into the dual problem as the constraint:

    $$\begin{aligned} l^+_j\le \sum _{i\in I_+}\pi _ix_{ij}\le u^+_j. \end{aligned}$$
  • When the prior probability of a sample being drawn from the group of label \(y_i=+1\) is twice to thrice as large as that from \(y_i=-1\), we include this information into the dual problem as the constraint:

    $$\begin{aligned} 2 \sum _{i\in I_-}\pi _i \le \sum _{i\in I_+}\pi _i \le 3 \sum _{i\in I_-}\pi _i. \end{aligned}$$

Although it is known that simple robust optimization modeling often leads to excessively conservative results, adding experts’ knowledge as constraints can be helpful to escape from those situations.

6 Concluding remarks

This paper studies formulations of SVMs for binary classification in a unified way, particularly considering the capability of inseparable risk functions and non-\(\ell _2\)-norms, while also providing insights on the formulations from various perspectives.

  • When using positively homogeneous functions, the choice of the form of regularizer requires careful attention. (Sect. 3.1).

  • Corresponding to the dual characterizations of the three properties of the risk function (monotonicity, translation invariance, and positive homogeneity), we can express the dual formulation in interpretable ways (Sect. 5.1). More specifically, monotonic and translation invariant risk functions are shown to be associated with geometric and probabilistic interpretations (Sect. 5.2).

  • In relation to robust optimization modeling, we draw two perspectives. With monotonic and translation invariant risk functions, the regularized ERM formulation can be viewed as a robust optimization (Sect. 3.2). Additionally, for these risk functions the distributionally robust modeling can be easily incorporated into the dual formulation (Sect. 5.3).

As stated in Introduction, a motivation of this study was the use of recently developed polyhedral norms for the regularizer. We see that the Ivanov regularization seems to be the unique solution for the combination with positive homogeneous risk measures, and that is the reason why we have focused on that case in the analysis (e.g., Sect. 4.2). Through an experiment, which is not reported in the current manuscript, we observed that within a comparable amount of time the use of a family of polyhedral norms could achieve a better out-of-sample performance than the standard \(\ell _2\)-regularized SVM, whose difference is in the regularizer. See Gotoh and Uryasev (2013) for the details.

While we supposed that the argument of \(\mathcal {F}\) was of the form \(\varvec{L}=-(\varvec{G}\varvec{w}-\varvec{y}b)\) and only \(\varvec{w}\) was regularized (i.e., b was not regularized), we can treat a variety of existing formulations. On the other hand, excluded classes of risk functions or losses, such as

  • \(L_{i,j}=\varvec{w}^\top \varvec{x}_i-\varvec{w}^\top \varvec{x}_j\), where i are samples of \(y_i=-1\) and j are samples of \(y_i=+1\),

remain to be investigated.

This framework can be extended to other types of machine learning tasks, such as multi-class classifications and regression, in a similar manner. In particular, the application of the CVaR norms and the deltoidal norms to the multiple kernel learning (Kloft et al. 2011) can be a promising extension.