1 Introduction

The Newton method is a classical numerical scheme for solving systems of nonlinear equations and smooth optimization [47, 50]. However, there are at least two reasons that prevent the use of such methods from solving large-scale problems. Firstly, while these methods often have a fast local convergence rate which can be up to a quadratic rate, their global convergence has not been well-understood [46]. In practice, one can use a damped-step scheme utilizing the Lipschitz constant of the objective derivatives to compute a suitable step-size as often seen in gradient-type methods, or incorporate the algorithm with a globalization strategy such as line-search, trust-region, or filter to guarantee a descent property [47]. Both strategies allow us to prove a global convergence of the underlying Newton-type method in some sense. Unfortunately, in practice, there exist several problems whose objective function does not have global Lipschitz gradient or Hessian such as logarithmic or reciprocal functions. This class of problems does not provide us some uniform bounds to obtain a constant step-size in optimization algorithms. On the other hand, using a globalization strategy for determining step-sizes often requires centralized computation such as function evaluations, which prevent us from using distributed computation and stochastic descent methods. Secondly, Newton algorithms are second-order methods which often require a high per-iteration complexity due to the operations on the Hessian mapping of the objective function or its approximations. In addition, these methods require the underlying functionals to be smooth up to a given smoothness levels which does not often hold in many practical models.

Motivation In recent years, there has been a great interest in Newton-type methods for solving convex optimization problems and monotone equations due to the development of new techniques and mathematical tools in optimization, machine learning, and randomized algorithms [6, 11, 16, 18, 34, 42, 43, 54, 55, 57, 58, 61]. Several combinations of Newton-type methods and other techniques such as proximal operators [8], cubic regularization [42], gradient regularization [55], randomized algorithms such as sketching [54], subsampling [18], and fast eigen-decomposition [26] have opened up a new research direction and attracted a great attention in solving nonsmooth and large-scale problems. Hitherto, research in this direction remains focusing on specific classes of problems where standard assumptions such as nonsingularity and Hessian Lipschitz continuity are preserved. However, such assumptions do not hold for many other examples as shown in [62]. Moreover, if they are satisfied, then we often get a lower bound of possible step-sizes for our algorithm which may lead to a poor performance, especially in large-scale problems.

In the seminal work [45], Nesterov and Nemirovskii showed that the class of log-barriers does not satisfy the standard assumptions of the Newton method if the solution of the underlying problem is closed to the boundary of the domain of a barrier function. They introduced a powerful concept called “self-concordance” to overcome this drawback and developed new Newton schemes to achieve global and local convergence without requiring any additional assumption, or a globalization strategy. While the self-concordance notion was initially invented to study interior-point methods, it is less well-known in other communities. Recent works [1, 14, 38, 62, 67, 72] have popularized this concept to solve other problems arising from machine learning, statistics, image processing, scientific computing, and variational inequalities.

Our goals In this paper, motivated by [1, 63, 72], we aim at generalizing the self-concordance concept in [45] to a broader class of smooth and convex functions. To illustrate our idea, we consider a univariate smooth and convex function \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\). If \(\varphi \) satisfies the inequality \(\vert \varphi '''(t)\vert \le M_{\varphi }\varphi ''(t)^{3/2}\) for all t in the domain of \(\varphi \) and for a given constant \(M_{\varphi }\ge 0\), then we say that \(\varphi \) is self-concordant (in Nesterov and Nemirovskii’s sense [45]). We instead generalize this inequality to

$$\begin{aligned} \vert \varphi '''(t)\vert \le M_{\varphi }\varphi ''(t)^{\frac{\nu }{2}}, \end{aligned}$$
(1)

for all t in the domain of \(\varphi \), and for given constants \(\nu > 0\) and \(M_{\varphi }\ge 0\).

We emphasize that generalizing from univariate to multivariate functions in the standard self-concordant case (i.e., \(\nu = 3\)) [45] preserves several important properties including the multilinear symmetry [40, Lemma 4.1.2], while, unfortunately, they do not hold for the case \(\nu \ne 3\). Therefore, we modify the definition in [45] to overcome this drawback. Note that a similar idea has been also studied in [1, 63] for a class of logistic-type functions. Nevertheless, the definition using in these papers is limited, and still creates certain difficulty for developing further theory in general cases.

Our second goal is to develop a unified mechanism to analyze the convergence (including global and local convergence) of the following Newton-type scheme:

$$\begin{aligned} x^{k+1} := x^k - s_kF'(x^k)^{-1}F(x^k), \end{aligned}$$
(2)

where F can be represented as the right-hand side of a smooth monotone equation \(F(x) = 0\), or the optimality condition of a convex optimization or a convex–concave saddle-point problem, \(F'\) is the Jacobian map of F, and \(s_k\in (0, 1]\) is a given step-size. Despite the Newton scheme (2) is invariant to a change of variables [16], its convergence property relies on the growth of the Hessian mapping along the Newton iterative process. In classical settings, the Lipschitz continuity and the non-degeneracy of the Hessian mapping in a neighborhood of a given solution are key assumptions to achieve local quadratic convergence rate [16]. These assumptions have been considered to be standard, but they are often very difficult to check in practice, especially the second requirement. A natural idea is to classify the functionals of the underlying problem into a known class of functions to choose a suitable method for minimizing it. While first-order methods for convex optimization essentially rely on the Lipschitz gradient continuity, Newton schemes usually use the Lipschitz continuity of the Hessian mapping and its non-degeneracy to obtain a well-defined Newton direction as we have previously mentioned. For self-concordant functions, the second condition automatically holds, but the first assumption fails to satisfy. However, both full-step and damped-step Newton methods still work in this case by appropriately choosing a suitable metric. This situation has been observed and standard assumptions have been modified in different directions to still guarantee convergence of Newton-type methods, see [16] for an intensive study of generic Newton-type methods, and [40, 45] for the self-concordant function class.

Our approach We attempt to develop some background theory for a broad class of smooth and convex functions under the structure (1). By adopting the local norm defined via the Hessian mapping of such a convex function from [45], we can prove some lower and upper bound estimates for the local norm distance between two points in the domain as well as for the growth of the Hessian mapping. Together with this background theory, we also identify a class of functions using in generalized linear models [37, 39] as well as in empirical risk minimization [68] that falls into our generalized self-concordance class for many well-known loss-type functions as listed in Table 2.

Applying our generalized self-concordant theory, we develop a class of Newton-type methods to solve the following composite convex minimization problem:

$$\begin{aligned} F^{\star }:=\min _{x\in \mathbb {R}^p}\Big \{ F(x):=f(x)+g(x) \Big \}, \end{aligned}$$
(3)

where f is a generalized self-concordant function in our context, and g is a proper, closed, and convex function that can be referred to as a regularization term. We consider two cases. The first case is a non-composite convex problem in which g is vanished (i.e., \(g = 0\)). In the second case, we assume that g is equipped with a “tractably” proximal operator [see (34) for the definition].

Our contribution To this end, our main contribution can be summarized as follows.

  1. (a)

    We generalize the self-concordant notion in [40] to a more broader class of smooth convex functions which we call generalized self-concordance. We identify several loss-type functions that can be cast into our generalized self-concordant class. We also prove several fundamental properties and show that the sum and linear transformation of generalized self-concordant functions are generalized self-concordant for a given range of \(\nu \) or under suitable assumptions.

  2. (b)

    We develop lower and upper bounds on the Hessian mapping, the gradient mapping, and the function values for generalized self-concordant functions. These estimates are key to analyze several numerical optimization methods including Newton-type methods.

  3. (c)

    We propose a class of Newton methods including full-step and damped-step schemes to minimize a generalized self-concordant function. We explicitly show how to choose a suitable step-size to guarantee a descent direction in the damped-step scheme, and prove a local quadratic convergence for both the damped-step and the full-step schemes using a suitable metric.

  4. (d)

    We also extend our Newton schemes to handle the composite setting (3). We develop both full-step and damped-step proximal Newton methods to solve this problem and provide a rigorous theoretical convergence guarantee in both local and global sense.

  5. (e)

    We also study a quasi-Newton variant of our Newton scheme to minimize a generalized self-concordant function. Under a modification of the well-known Dennis–Moré condition [15] or a BFGS update, we show that our quasi-Newton method locally converges at a superlinear rate to the solution of the underlying problem.

Let us emphasize the following aspects of our contribution. Firstly, we observe that the self-concordance notion is a powerful concept and has widely been used in interior-point methods as well as in other optimization schemes [28, 35, 62, 72], generalizing it to a broader class of smooth convex functions can substantially cover a number of new applications or can develop new methods for solving old problems including logistic and multimonomial logistic regression, optimization involving exponential objectives, and distance-weighted discrimination problems in support vector machine (see Table 2 below). Secondly, verifying theoretical assumptions for convergence guarantees of a Newton method is not trivial, our theory allows one to classify the underlying functions into different subclasses by using different parameters \(\nu \) and \(M_{\varphi }\) in order to choose suitable algorithms to solve the corresponding optimization problem. Thirdly, the theory developed in this paper can potentially apply to other optimization methods such as gradient-type, sketching and sub-sampling Newton, and Frank–Wolfe’s algorithms as done in the literature [49, 54, 57, 58, 62]. Finally, we also show that it is possible to impose additional structure such as self-concordant barrier to develop path-following scheme or interior-point-type methods for solving a subclass of composite convex minimization problems of the form (3). We believe that our theory is not limited to convex optimization, but can be extended to solve convex–concave saddle-point problems, and monotone equations/inclusions involving generalized self-concordant functions [67].

Summary of generalized self-concordant properties We provide a short summary on the main properties of generalized self-concordant (gsc) functions in Table 1.

Table 1 A summary of generalized self-concordant properties

Although several results hold for a different range of \(\nu \), the complete theory only holds for \(\nu \in [2, 3]\). However, this is sufficient to cover two important cases: \(\nu = 2\) in [1, 2] and \(\nu = 3\) in [45].

Related work Since the self-concordance concept was introduced in 1990s [45], its first extension is perhaps proposed by [1] for a class of logistic regression. In [63], the authors extended [1] to study proximal Newton method for logistic, multinomial logistic, and exponential loss functions. By augmenting a strongly convex regularizer, Zhang and Lin in [72] showed that the regularized logistic loss function is indeed standard self-concordant. In [2] Bach continued exploiting his result in [1] to show that the averaging stochastic gradient method can achieve the same best-known convergence rate as in strongly convex case without adding a regularizer. In [62], the authors exploited standard self-concordance theory in [45] to develop several classes of optimization algorithms including proximal Newton, proximal quasi-Newton, and proximal gradient methods to solve composite convex minimization problems. In [35], Lu extended [62] to study randomized block coordinate descent methods. In a recent paper [22], Gao and Goldfarb investigated quasi-Newton methods for self-concordant problems. As another example, [53] proposed an alternative to the standard self-concordance, called self-regularity. The authors applied this theory to develop a new paradigm for interior-point methods. The theory developed in this paper, on the one hand, is a generalization of the well-known self-concordance notion developed in [45]; on the other hand, it also covers the work in [1, 61, 72] as specific examples. Several concrete applications and extensions of self-concordance notion can also be found in the literature including [28, 32, 49, 53]. Recently, [14] exploited smooth structures of exponential functions to design interior-point methods for solving two fundamental problems in scientific computing called matrix scaling and balancing.

Paper organization The rest of this paper is organized as follows. Section 2 develops the foundation theory for our generalized self-concordant functions including definitions, examples, basic properties, Fenchel’s conjugate, smoothing technique, and key bounds. Section 3 is devoted to studying full-step and damped-step Newton schemes to minimize a generalized self-concordant function including their global and local convergence guarantees. Section 4 considers the composite setting (3) and studies proximal Newton-type methods, and investigates their convergence guarantees. Section 5 deals with a quasi-Newton scheme for solving the noncomposite problem of (3). Numerical examples are provided in Sect. 6 to illustrate advantages of our theory. Finally, for clarity of presentation, several technical results and proofs are moved to the appendix.

2 Theory of generalized self-concordant functions

We generalize the class of self-concordant functions introduced by Nesterov and Nemirovskii in [40] to a broader class of smooth and convex functions. We identify several examples of such functions. Then, we develop several properties of this function class by utilizing our new definitions.

Notation Given a proper, closed, and convex function \(f:\mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\), we denote by \(\mathrm {dom}(f) := \left\{ x\in \mathbb {R}^p \mid f(x) <+\,\infty \right\} \) the domain of f, and by \(\partial {f}(x) := \big \{w\in \mathbb {R}^p \mid f(y) \ge f(x) + \langle w, y- x\rangle ,~\forall y\in \mathrm {dom}(f) \big \}\) the subdifferential of f at \(x\in \mathrm {dom}(f)\). We use \(\mathcal {C}^3(\mathrm {dom}(f))\) to denote the class of three times continuously differentiable functions on its open domain \(\mathrm {dom}(f)\). We denote by \(\nabla {f}\) its gradient map, by \(\nabla ^2{f}\) its Hessian map, and by \(\nabla ^3{f}\) its third-order derivative. For a twice continuously differentiable convex function f, \(\nabla ^2{f}\) is symmetric positive semidefinite, and can be written as \(\nabla ^2{f}(\cdot ) \succeq 0\). If it is positive definite, then we write \(\nabla ^2{f}(\cdot ) \succ 0\).

Let \(\mathbb {R}_+\) and \(\mathbb {R}_{++}\) denote the sets of nonnegative and positive real numbers, respectively. We use \(\mathcal {S}^p_{+}\) and \(\mathcal {S}^p_{++}\) to denote the sets of symmetric positive semidefinite and symmetric positive definite matrices of the size \(p\times p\), respectively. Given a \(p\times p\) matrix \(H\succ 0\), we define a weighted norm with respect to \(H\) as \(\left\| u\right\| _{H} := \left\langle Hu,u\right\rangle ^{1/2}\) for \(u\in \mathbb {R}^p\). The corresponding dual norm is \(\left\| v\right\| _{H}^{*} := \left\langle H^{-1}v,v\right\rangle ^{1/2}\). If \(H= \mathbb {I}\), the identity matrix, then \(\left\| u\right\| _{H}=\left\| u\right\| _{H}^{*} =\left\| u\right\| _2\), where \(\left\| \cdot \right\| _2\) is the standard Euclidean norm. Note that \(\Vert \cdot \Vert _2^{*} = \Vert \cdot \Vert _2\).

We say that f is strongly convex with the strong convexity parameter \(\mu _f > 0\) if \(f(\cdot ) - \frac{\mu _f}{2}\left\| \cdot \right\| ^2\) is convex. We also say that f has Lipschitz gradient if \(\nabla {f}\) is Lipschitz continuous with the Lipschitz constant \(L_f \in [0, +\,\infty )\), i.e., \(\Vert \nabla {f}(x) - \nabla {f}(y)\Vert ^{*} \le L_f\left\| x - y\right\| \) for all \(x, y\in \mathrm {dom}(f)\).

For \(f\in \mathcal {C}^3(\mathrm {dom}(f))\), if \(\nabla ^2 f(x)\succ 0\) at a given \(x\in \mathrm {dom}(f)\), then we define a local norm \(\left\| u\right\| _{x} := \langle \nabla ^2{f}(x)u, u\rangle ^{1/2}\) as a weighted norm of \(u\) with respect to \(\nabla ^2{f}(x)\). The corresponding dual norm \(\left\| v\right\| ^{*}_{x}\), is defined as \(\left\| v\right\| ^{*}_{x}:=\max \left\{ \langle v, u\rangle \mid \left\| u\right\| _{x}\le 1\right\} =\left\langle \nabla ^2 f(x)^{-1}v,v\right\rangle ^{1/2}\) for \(v\in \mathbb {R}^p\).

2.1 Univariate generalized self-concordant functions

Let \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) be a three times continuously differentiable function on the open domain \(\mathrm {dom}(\varphi )\). Then, we write \(\varphi \in \mathcal {C}^{3}\left( \mathrm {dom}(\varphi )\right) \). In this case, \(\varphi \) is convex if and only if \(\varphi ''(t) \ge 0\) for all \(t\in \mathrm {dom}(\varphi )\). We introduce the following definition.

Definition 1

Let \(\varphi : \mathbb {R}\rightarrow \mathbb {R}\) be a \(\mathcal {C}^{3}\left( \mathrm {dom}(\varphi )\right) \) and univariate function with open domain \(\mathrm {dom}(\varphi )\), and \(\nu > 0\) and \(M_{\varphi } \ge 0\) be two constants. We say that \(\varphi \) is \((M_{\varphi },\nu )\)-generalized self-concordant if

$$\begin{aligned} \vert \varphi {'''}(t)\vert \le M_{\varphi }\varphi {''}(t)^{\frac{\nu }{2}},~~~\forall t\in \mathrm {dom}(\varphi ). \end{aligned}$$
(4)

The inequality (4) also indicates that \(\varphi ''(t) \ge 0\) for all \(t\in \mathrm {dom}(f)\). Hence, \(\varphi \) is convex. Clearly, if \(\varphi (t) = \frac{a}{2}t^2 + bt\) for any constants \(a\ge 0\) and \(b\in \mathbb {R}\), then we have \(\varphi ''(t) = a\) and \(\varphi '''(t) = 0\). The inequality (4) is automatically satisfied for any \(\nu > 0\) and \(M_{\varphi } \ge 0\). The smallest value of \(M_{\varphi }\) is zero. Hence, any convex quadratic function is \((0, \nu )\)-generalized self-concordant for any \(\nu > 0\). While (4) holds for any other constant \(\hat{M}_{\varphi } \ge M_{\varphi }\), we often require that \(M_{\varphi }\) is the smallest constant satisfying (4).

Example 1

Let us now provide some common examples satisfying Definition 1.

  1. (a)

    Standard self-concordant functions If we choose \(\nu = 3\), then (4) becomes \(\vert \varphi '''(t)\vert \le M_{\varphi }\varphi ''(t)^{3/2}\) which is the standard self-concordant functions in \(\mathbb {R}\) introduced in [45].

  2. (b)

    Logistic functions In [1], Bach modified the standard self-concordant inequality in [45] to obtain \(\left| \varphi '''(t)\right| \le M_{\varphi }\varphi ''(t)\), and showed that the well-known logistic loss \(\varphi (t) := \log (1 + e^{-t})\) satisfies this definition. In [63] the authors also exploited this definition, and developed a class of first-order and second-order methods to solve composite convex minimization problems. Hence, \(\varphi (t) := \log (1 + e^{-t})\) is a generalized self-concordant function with \(M_{\varphi } = 1\) and \(\nu = 2\).

  3. (c)

    Exponential functions The exponential function \(\varphi (t) := e^{-t}\) also belongs to (4) with \(M_{\varphi } = 1\) and \(\nu = 2\). This function is often used, e.g., in Ada-boost [33], or in matrix scaling [14].

  4. (d)

    Distance-weighted discrimination (DWD) We consider a more general function \(\varphi (t) := \frac{1}{t^q}\) on \(\mathrm {dom}(\varphi ) = \mathbb {R}_{++}\) and \(q \ge 1\) studied in [36] for DWD using in support vector machine. As shown in Table 2, this function satisfies Definition 1 with \(M_{\varphi } = \frac{q+2}{\root (q+2) \of {q(q+1)}}\) and \(\nu = \tfrac{2(q+3)}{q+2} \in (2, 3)\).

  5. (e)

    Entropy function We consider the well-known entropy function \(\varphi (t) := t\ln (t)\) for \(t > 0\). We can easily show that \(\vert \varphi '''(t)\vert = \tfrac{1}{t^2} = \varphi ''(t)^2\). Hence, it is generalized self-concordant with \(\nu = 4\) and \(M_{\varphi } = 1\) in the sense of Definition 1.

  6. (f)

    Arcsine distribution We consider the function \(\varphi (t) := \frac{1}{\sqrt{1 - t^2}}\) for \(t \in (-\,1, 1)\). This function is convex and smooth. Moreover, we verify that it satisfies Definition 1 with \(\nu = \frac{14}{5} \in (2, 3)\) and \(M_{\varphi } = \frac{3\sqrt{495-105\sqrt{21}}}{(7-\sqrt{21})^{7/5}} < 3.25\). We can generalize this function to \(\varphi (t) := \left[ (t-a)(b-t)\right] ^{-q}\) for \(t\in (a, b)\), where \(a < b\) and \(q > 0\). Then, we can show that \(\nu = \frac{2(q+3)}{q+2} \in (2, 3)\).

  7. (g)

    Robust regression Consider a monomial function \(\varphi (t):=t^q\) for \(q\in (1,2)\) studied in [71] for robust regression using in statistics. Then, \(M_{\varphi } = \frac{2-q}{\root (2-q) \of {q(q-1)}}\) and \(\nu = \frac{2(3-q)}{2-q}\in (4,+\,\infty )\).

As concrete examples, the following table, Table 2, provides a non-exhaustive list of generalized self-concordant functions used in the literature.

Table 2 Examples of univariate generalized self-concordant functions (\(\mathcal {F}^{1,1}_L\) means that \(\nabla {\varphi }\) is Lipschitz continuous)

Remark 1

All examples given in Table 2 fall into the case \(\nu \ge 2\). However, we note that Definition 1 also covers [72, Lemma 1] as a special case when \(\nu \in (0, 2)\). Unfortunately, as we will see in what follows, it is unclear how to generalize several properties of generalized self-concordance from univariate to multivariable functions for \(\nu \in (0, 2)\), except for strongly convex functions.

Table 2 only provides common generalized self-concordant functions using in practice. However, it is possible to combine these functions to obtain mixture functions that preserve the generalized self-concordant inequality given in Definition 1. For instance, the barrier entropy \(t\ln (t) - \ln (t)\) is a standard self-concordant function, and it is the sum of the entropy \(t\ln (t)\) and the negative logarithmic function \(-\log (t)\) which are generalized self-concordant with \(\nu = 4\) and \(\nu = 3\), respectively.

2.2 Multivariate generalized self-concordant functions

Let \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) be a \(\mathcal {C}^3(\mathrm {dom}(f))\) smooth and convex function with open domain \(\mathrm {dom}(f)\). Given \(\nabla ^2f\) the Hessian of f, \(x\in \mathrm {dom}(f)\), and \(u,v\in \mathbb {R}^p\), we consider the function \(\psi (t) := \langle \nabla ^2{f}(x+ tv)u, u\rangle \). Then, it is obvious to show that

$$\begin{aligned} \psi '(t) := \langle \nabla ^3f(x+ tv)[v]u, u\rangle . \end{aligned}$$

for \(t \in \mathbb {R}\) such that \(x+ tv\in \mathrm {dom}(f)\), where \(\nabla ^3f\) is the third-order derivative of f. It is clear that \(\psi (0) = \langle \nabla ^2{f}(x)u, u\rangle = \left\| u\right\| _{x}^2\). By using the local norm, we generalize Definition 1 to multivariate functions \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) as follows.

Definition 2

A \(\mathcal {C}^3\)-convex function \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) is said to be an \((M_f, \nu )\)-generalized self-concordant function of the order \(\nu > 0\) and the constant \(M_f \ge 0\) if, for any \(x\in \mathrm {dom}(f)\) and \(u,v\in \mathbb {R}^p\), it holds

$$\begin{aligned} \left| \langle \nabla ^3f(x)[v]u, u\rangle \right| \le M_f\left\| u\right\| _{x}^2\left\| v\right\| _{x}^{\nu -2}\left\| v\right\| _2^{3-\nu }. \end{aligned}$$
(5)

Here, we use a convention that \(\frac{0}{0} = 0\) for the case \(\nu < 2\) or \(\nu > 3\). We denote this class of functions by \(\widetilde{\mathcal {F}}_{M_f,\nu }(\mathrm {dom}(f))\) (shortly, \(\widetilde{\mathcal {F}}_{M_f,\nu }\) when \(\mathrm {dom}(f)\) is explicitly defined).

Let us consider the following two extreme cases:

  1. 1.

    If \(\nu = 2\), (5) leads to \(\left| \langle \nabla ^3f(x)[v]u, u\rangle \right| \le M_f\left\| u\right\| _{x}^2\left\| v\right\| _2\) which collapses to the definition introduced in [1] by letting \(u= v\).

  2. 2.

    If \(\nu = 3\) and \(u= v\), (5) reduces to \(\left| \langle \nabla ^3f(x)[u]u, u\rangle \right| \le M_f\left\| u\right\| _{x}^3\), Definition 2 becomes the standard self-concordant definition introduced in [40, 45].

We emphasize that Definition 2 is not symmetric, but can avoid the use of multilinear mappings as required in [1, 45]. However, by [45, Proposition 9.1.1] or [40, Lemma 4.1.2], Definition 2 with \(\nu =3\) is equivalent to [40, Definition 4.1.1] for standard self-concordant functions.

2.3 Basic properties of generalized self-concordant functions

We first show that if \(f_1\) and \(f_2\) are two generalized self-concordant functions, then \(\beta _1 f_1 + \beta _2 f_2\) is also a generalized self-concordant for any \(\beta _1, \beta _2 > 0\) according to Definition 2.

Proposition 1

(Sum of generalized self-concordant functions) Let \(f_i\) be \((M_{f_i},\nu )\)-generalized self-concordant functions satisfying (5), where \(M_{f_i} \ge 0\) and \(\nu \ge 2\) for \(i=1,\ldots , m\). Then, for \(\beta _i > 0\), \(i=1,2,\ldots ,m\), the function \(f(x) := \sum _{i=1}^m\beta _i f_i(x)\) is well-defined on \(\mathrm {dom}(f ) = \bigcap _{i=1}^m\mathrm {dom}(f_i)\), and is \((M_f, \nu )\)-generalized self-concordant with the same order \(\nu \ge 2\) and the constant

$$\begin{aligned} M_f := \max \left\{ \beta _i^{1-\frac{\nu }{2}}M_{f_i} \mid 1 \le i \le m\right\} \ge 0. \end{aligned}$$

Proof

It is sufficient to prove for \(m=2\). For \(m > 2\), it follows from \(m=2\) by induction. By [40, Theorem 3.1.5], f is a closed and convex function. In addition, \(\mathrm {dom}(f) = \mathrm {dom}(f_1)\cap \mathrm {dom}(f_2)\). Let us fix some \(x\in \mathrm {dom}(f)\) and \(u,v\in \mathbb {R}^p\). Then, by Definition 2, we have

$$\begin{aligned} \left| \langle \nabla ^3f_i(x)[v]u, u\rangle \right| \le M_{f_i}\langle \nabla ^2 f_i(x)u,u\rangle \langle \nabla ^2 f_i(x)v,v\rangle ^{\frac{\nu -2}{2}}\left\| v\right\| _2^{3 - \nu }, ~~~i=1,2. \end{aligned}$$

Denote \(w_i := \langle \nabla ^2{f}_i(x)u,u\rangle \ge 0\) and \(s_i := \langle \nabla ^2{f}_i(x)v,v\rangle \ge 0\) for \(i=1,2\). We can derive

$$\begin{aligned} \frac{\left| \langle \nabla ^3f(x)[v]u, u\rangle \right| }{\langle \nabla ^2 f(x)u,u\rangle \langle \nabla ^2 f(x)v,v\rangle ^{\frac{\nu -2}{2}}}\le & {} \frac{\beta _1\left| \langle \nabla ^3f_1(x)[v]u, u\rangle \right| +\beta _2\left| \langle \nabla ^3f_2(x)[v]u, u\rangle \right| }{\langle \nabla ^2 f(x)u,u\rangle \langle \nabla ^2 f(x)v,v\rangle ^{\frac{\nu -2}{2}}} \nonumber \\\le & {} \left[ \frac{M_{f_1}\beta _1w_1s_1^{\frac{\nu -2}{2}} + M_{f_2}\beta _2w_2s_2^{\frac{\nu -2}{2}}}{(\beta _1w_1+\beta _2w_2)(\beta _1s_1+\beta _2s_2)^{\frac{\nu -2}{2}}}\right] _{[T]}\left\| v\right\| _2^{3-\nu }.\nonumber \\ \end{aligned}$$
(6)

Let \(\xi := \frac{\beta _1w_1}{\beta _1w_1+\beta _2w_2} \in [0, 1]\) and \(\eta := \frac{\beta _1s_1}{\beta _1s_1+\beta _2s_2} \in [0, 1]\). Then, \(\tfrac{\beta _2w_2}{\beta _1w_1+\beta _2w_2} = 1 - \xi \ge 0\) and \(\tfrac{\beta _2s_2}{\beta _1s_1 + \beta _2s_2} = 1 - \eta \ge 0\). Hence, the term [T] in the square brackets of (6) becomes

$$\begin{aligned} h(\xi ,\eta ):=\beta _1^{1-\frac{\nu }{2}}M_{f_1}\xi \eta ^{\frac{\nu -2}{2}}+\beta _2^{1-\frac{\nu }{2}}M_{f_2}(1-\xi )(1-\eta )^{\frac{\nu -2}{2}},~~~ \xi ,\eta \in [0,1]. \end{aligned}$$

Since \(\nu \ge 2\) and \(\xi , \eta \in [0, 1]\), we can upper bound \(h(\xi ,\eta )\) as

$$\begin{aligned} h(\xi ,\eta ) \le \beta _1^{1-\frac{\nu }{2}}M_{f_1}\xi + \beta _2^{1-\frac{\nu }{2}}M_{f_2}(1-\xi ), ~~~\forall \xi \in [0, 1]. \end{aligned}$$

The right-hand side function is linear in \(\xi \) on [0, 1]. It achieves the maximum at its boundary. Hence, we have

$$\begin{aligned} \max _{\xi \in [0,1],\eta \in [0,1]}h(\xi ,\eta ) \le \max \left\{ \beta _1^{1-\frac{\nu }{2}}M_{f_1}, \beta _2^{1-\frac{\nu }{2}}M_{f_2}\right\} . \end{aligned}$$

Using this estimate into (6), we can show that \(f(\cdot ) := \beta _1f_1(\cdot ) + \beta _2f_2(\cdot )\) is \((M_f,\nu )\)-generalized self-concordant with \(M_f := \max \left\{ \beta _1^{1-\frac{\nu }{2}}M_{f_1}, \beta _2^{1-\frac{\nu }{2}}M_{f_2}\right\} \). \(\square \)

Using Proposition 1, we can also see that if f is \((M_f,\nu )\)-generalized self-concordant, and \(\beta > 0\), then \(g(x) := \beta f(x)\) is also \((M_g,\nu )\)-generalized self-concordant with the constant \(M_g := \beta ^{1-\frac{\nu }{2}}M_f\). The convex quadratic function \(q(x) := \frac{1}{2}\langle Qx,x\rangle + c^{\top }x\) with \(Q\in \mathcal {S}^p_{+}\) is \((0, \nu )\)-generalized self-concordant for any \(\nu > 0\). Hence, by Proposition 1, if f is \((M_f,\nu )\)-generalized self-concordant, then \(f(x) + \frac{1}{2}\langle Qx,x\rangle + c^{\top }x\) is also \((M_f, \nu )\)-generalized self-concordant.

Next, we consider an affine transformation of a generalized self-concordant function.

Proposition 2

(Affine transformation) Let \(\mathcal {A}(x) := Ax+ b\) be an affine transformation from \(\mathbb {R}^p\) to \(\mathbb {R}^q\), and f be an \((M_f,\nu )\)-generalized self-concordant function with \(\nu > 0\). Then, the following statements hold:

  1. (a)

    If \(\nu \in (0, 3]\), then \(g(x) := f(\mathcal {A}(x))\) is \((M_{g},\nu )\)-generalized self-concordant with \(M_g := M_f\left\| A\right\| ^{3-\nu }\).

  2. (b)

    If \(\nu > 3\) and \(\lambda _{\min }(A^{\top }A) > 0\), then \(g(x) := f(\mathcal {A}(x))\) is \((M_{g},\nu )\)-generalized self-concordant with \(M_g := M_f\lambda _{\min }(A^{\top }A)^{\frac{3-\nu }{2}}\), where \(\lambda _{\min }(A^{\top }A)\) is the smallest eigenvalue of \(A^{\top }A\).

Proof

Since \(g(x) = f(\mathcal {A}(x)) = f(Ax+ b)\), it is easy to show that \(\nabla ^2g(x) = A^{\top }\nabla ^2f(\mathcal {A}(x))A\) and \(\nabla ^3g(x)[v] = A^{\top }(\nabla ^3f(\mathcal {A}(x)[Av])A\). Let us denote by \(\tilde{x} := Ax+ b\), \(\tilde{u} := Au\), and \(\tilde{v} := Av\). Then, using Definition 2, we have

(7)

(a) If \(\nu \in (0, 3]\), then we have \(\Vert Av\Vert _2^{3-\nu } \le \Vert A\Vert ^{3-\nu }\Vert v\Vert _2^{3-\nu }\). Hence, the last inequality (7) implies

$$\begin{aligned} \vert \langle \nabla ^3g(x)[v]u, u\rangle \vert \le M_f \Vert A\Vert ^{3-\nu }\langle \nabla ^2g(x)u, u\rangle \langle \nabla ^2g(x)v, v\rangle ^{\frac{\nu }{2}-1}\Vert v\Vert _2^{3-\nu }, \end{aligned}$$

which shows that g is \((M_g, \nu )\)-generalized self-concordant with \(M_g := M_f \Vert A\Vert ^{3-\nu }\).

(b) Note that \(\Vert Av\Vert _2^2 = v^{\top }A^{\top }Av\ge \lambda _{\min }(A^{\top }A)\left\| v\right\| _2^2 \ge 0\), where \(\lambda _{\min }(A^{\top }A)\) is the smallest eigenvalue of \(A^{\top }A\). If \(\lambda _{\min }(A^{\top }A) > 0\) and \(\nu > 3\), then we have \(\Vert Av\Vert _2^{3-\nu } \le \lambda _{\min }(A^{\top }A)^{\frac{3-\nu }{2}}\left\| v\right\| _2^{3-\nu }\). Combining this estimate and (7), we can show that g is \((M_g, \nu )\)-generalized self-concordant with \(M_g := M_f\lambda _{\min }(A^{\top }A)^{\frac{3-\nu }{2}}\). \(\square \)

Remark 2

Proposition 2 shows that generalized self-concordance is preserved via an affine transformations if \(\nu \in (0, 3]\). If \(\nu > 3\), then it requires \(A\) to be over-completed, i.e., \(\lambda _{\min }(A^{\top }A) > 0\). Hence, the theory developed in the sequel remains applicable for \(\nu > 3\) if \(A\) is over-completed.

The following result is an extension of standard self-concordant functions \((\nu = 3)\), whose proof is very similar to [40, Theorems 4.1.3, 4.1.4] by replacing the parameters \(M_f = 2\) and \(\nu = 3\) with the general parameters \(M_f \ge 0\) and \(\nu > 0\) (or \(\nu \ge 2\)), respectively. We omit the detailed proof.

Proposition 3

Let f be an \((M_f,\nu )\)-generalized self-concordant function with \(\nu > 0\). Then:

  1. (a)

    If \(\nu \ge 2\) and \(\mathrm {dom}(f)\) contains no straight line, then \(\nabla ^2{f}(x) \succ 0\) for any \(x\in \mathrm {dom}(f)\).

  2. (b)

    If there exists \(\bar{x}\in \mathrm {bd}(\mathrm {dom}(f))\), the boundary of \(\mathrm {dom}(f)\), then, for any \(\bar{x}\in \mathrm {bd}(\mathrm {dom}(f))\), and any sequence \(\left\{ x_k\right\} \subset \mathrm {dom}(f)\) such that \(\lim _{k\rightarrow \infty }x_k = \bar{x}\), we have \(\lim _{k\rightarrow \infty }f(x_k) = +\,\infty \).

Note that Proposition 3(a) only holds for \(\nu \ge 2\). If we consider \(g(x) := f(\mathcal {A}(x))\) for a given affine operator \(\mathcal {A}(x) = Ax + b\), then the non-degenerateness of \(\nabla ^2g\) is only guaranteed if A is full-rank. Otherwise, it is non-degenerated in a given subspace of A.

2.4 Generalized self-concordant functions with special structures

We first show that if a generalized self-concordant function is strongly convex or has a Lipschitz gradient, then it can be cast into the special case \(\nu = 2\) or \(\nu = 3\).

Proposition 4

Let \(f\in \widetilde{\mathcal {F}}_{M_f,\nu }\) be an \((M_f,\nu )\)-generalized self-concordant with \(\nu > 0\). Then:

  1. (a)

    If \(\nu \in (0, 3]\) and f is also strongly convex on \(\mathrm {dom}(f)\) with the strong convexity parameter \(\mu _f > 0\) in \(\ell _2\)-norm, then f is also \((\hat{M}_f, \hat{\nu })\)-generalized self-concordant with \(\hat{\nu } = 3\) and \(\hat{M}_f := \frac{M_f}{(\sqrt{\mu _f})^{3-\nu }}\).

  2. (b)

    If \(\nu \ge 2\) and \(\nabla {f}\) is Lipschitz continuous with the Lipschitz constant \(L_f\in [0,+\,\infty )\) in \(\ell _2\)-norm, then f is also \((\hat{M}_f, \hat{\nu })\)-generalized self-concordant with \(\hat{\nu } = 2\) and \(\hat{M}_f := M_fL_f^{\frac{\nu }{2}-1}\).

Proof

(a) If f is strongly convex with the strong convexity parameter \(\mu _f > 0\) in \(\ell _2\)-norm, then we have \(\langle \nabla ^2{f}(x)v, v\rangle \ge \mu _f\Vert v\Vert _2^2\) for any \(v\in \mathbb {R}^p\). Hence, \(\frac{\left\| v\right\| _{2}}{\Vert v\Vert _{x}} \le \frac{1}{\sqrt{\mu _f}}\). In this case, (5) leads to

$$\begin{aligned} \left| \langle \nabla ^3f(x)[v]u, u\rangle \right| \le M_f\left\| u\right\| _{x}^2\left( \frac{\left\| v\right\| _2}{\Vert v\Vert _{x}}\right) ^{3-\nu }\left\| v\right\| _{x} \le \frac{M_f}{(\sqrt{\mu _f})^{3-\nu }}\Vert u\Vert _{x}^2\Vert v\Vert _{x}. \end{aligned}$$

Hence, f is \((\hat{M}_f, \hat{\nu })\)-generalized self-concordant with \(\hat{\nu } = 3\) and \(\hat{M}_f := \frac{M_f}{(\sqrt{\mu _f})^{3-\nu }}\).

(b) Since \(\nabla {f}\) is Lipschitz continuous with the Lipschitz constant \(L_f\in [0,+\,\infty )\) in \(\ell _2\)-norm, we have \(\Vert v\Vert _{x}^2 = \langle \nabla ^2{f}(x)v, v\rangle \le L_f\Vert v\Vert _2^2\) for all \(v\in \mathbb {R}^p\) which leads to \(\frac{\Vert v\Vert _{x}}{\Vert v\Vert _2} \le \sqrt{L_f}\) for all \(v\in \mathbb {R}^p\). On the other hand, \(f\in \widetilde{\mathcal {F}}_{M_f,\nu }\) with \(\nu \ge 2\), we can show that

$$\begin{aligned} \left| \langle \nabla ^3f(x)[v]u, u\rangle \right| \le M_f\left\| u\right\| _{x}^2\left( \frac{\left\| v\right\| _{x}}{\Vert v\Vert _2}\right) ^{\nu -2}\left\| v\right\| _2 \le M_fL_f^{\frac{\nu -2}{2}}\Vert u\Vert _{x}^2\Vert v\Vert _2. \end{aligned}$$

Hence, f is also \((\hat{M}_f, \hat{\nu })\)-generalized self-concordant with \(\hat{\nu } = 2\) and \(\hat{M}_f := M_fL_f^{\frac{\nu -2}{2}}\). \(\square \)

Proposition 4 provides two important properties. If the gradient map \(\nabla {f}\) of a generalized self-concordant function f is Lipschitz continuous, we can always classify it into the special case \(\nu = 2\). Therefore, we can exploit both structures: generalized self-concordance and Lipschitz gradient to develop better algorithms. This idea is also applied to generalized self-concordant and strongly convex functions.

Given n smooth convex univariate functions \(\varphi _i : \mathbb {R}\rightarrow \mathbb {R}\) satisfying (4) for \(i=1,\ldots , n\) with the same order \(\nu > 0\), we consider the function \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) defined by the following form:

$$\begin{aligned} f(x) := \frac{1}{n}\sum _{i=1}^n\varphi _i(a_i^{\top }x+ b_i), \end{aligned}$$
(8)

where \(a_i\in \mathbb {R}^p\) and \(b_i\in \mathbb {R}\) are given vectors and numbers, respectively for \(i=1,\ldots , n\). This convex function is called a finite sum and widely used in machine learning and statistics. The decomposable structure in (8) often appears in generalized linear models [7, 11], and empirical risk minimization [72], where \(\varphi _i\) is referred to as a loss function as can be found, e.g., in Table 2.

Next, we show that if \(\varphi _i\) is generalized self-concordant with \(\nu \in [2,3]\), then f is also generalized self-concordant. This result is a direct consequence of Propositions 1 and 2.

Corollary 1

If \(\varphi _i\) in (8) satisfies (4) for \(i=1,\ldots , n\) with the same order \(\nu \in [2, 3]\) and \(M_{\varphi _i} \ge 0\), then f defined by (8) is also \((M_f,\nu )\)-generalized self-concordant in the sense of Definition 2 with the same order \(\nu \) and the constant \(M_f := n^{\frac{\nu }{2}-1}\max \left\{ M_{\varphi _i}\left\| a_i\right\| _2^{3-\nu } \mid 1 \le i \le n\right\} \).

Finally, we show that if we regularize f in (8) by a strongly convex quadratic term, then the resulting function becomes self-concordant. The proof can follow the same path as [72, Lemma 2].

Proposition 5

Let \(f(x) := \frac{1}{n}\sum _{i=1}^n\varphi _i(a_i^{\top }x+ b_i) + \psi (x)\), where \(\psi (x) := \frac{1}{2}\langle Qx,x\rangle + c^{\top }x\) is strongly convex quadratic function with \(Q\in \mathcal {S}^p_{++}\). If \(\varphi _i\) satisfies (4) for \(i=1,\ldots , n\) with the same order \(\nu \in (0, 3]\) and a constant \(M_{\varphi _i} > 0\), then f is \((\hat{M}_f, 3)\)-generalized self-concordant in the sense of Definition 2 with \(\hat{M}_f := \lambda _{\min }(Q)^{\frac{\nu - 3}{2}}\max \left\{ M_{\varphi _i} \Vert a_i\Vert _2^{3 - \nu } \mid 1\le i\le n\right\} \).

2.5 Fenchel’s conjugate of generalized self-concordant functions

The primal-dual theory is fundamental in convex optimization. Hence, it is important to study the Fenchel conjugate of generalized self-concordant functions.

Let \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) be an \((M_f, \nu )\)-generalized self-concordant function. We consider Fenchel’s conjugate \(f^{*}\) of f as

$$\begin{aligned} f^{*}(x) = \sup _u\left\{ \langle x, u\rangle - f(u) \mid u\in \mathrm {dom}(f) \right\} . \end{aligned}$$
(9)

Since f is proper, closed, and convex, \(f^{*}\) is well-defined and also proper, closed, and convex. Moreover, since f is smooth and convex, by Fermat’s rule, if \(u^{*}(x)\) satisfies \(\nabla {f}(u^{*}(x)) = x\), then \(f^{*}\) is well-defined at x. This shows that \(\mathrm {dom}(f^{*}) = \left\{ x\in \mathbb {R}^p \mid \nabla {f}(u^{*}(x)) = x~\text {is solvable}\right\} \).

Example 2

Let us look at some univariate functions. By using (9), we can directly show that:

  1. 1.

    If \(\varphi (s) = \log (1 + e^{s})\), then \(\varphi ^{*}(t) = t\log (t) + (1-t)\log (1-t)\).

  2. 2.

    If \(\varphi (s) = s\log (s)\), then \(\varphi ^{*}(t) = e^{t-1}\).

  3. 3.

    If \(\varphi (s) = e^s\), then \(\varphi ^{*}(t) = t\log (t) - t\).

Intuitively, these examples show that if \(\varphi \) is generalized self-concordant, then its conjugate \(\varphi ^{*}\) is also generalized self-concordant. For more examples, we refer to [3, Chapter 13]. Let us generalize this result in the following proposition whose proof is given in Appendix A.1.

Proposition 6

If f is \((M_f, \nu )\)-generalized self-concordant in \(\mathrm {dom}(f)\subseteq \mathbb {R}^p\) such that \(\nabla ^2{f}(x)\succ 0\) for \(x\in \mathrm {dom}(f)\), then the conjugate function \(f^{*}\) of f given by (9) is well-defined, and \((M_{f^{*}}, \nu _{*})\)-generalized self-concordant on

$$\begin{aligned} \mathrm {dom}(f^{*}) := \left\{ x\in \mathbb {R}^p \mid f(u) - \langle x, u\rangle ~\text {is bounded from below on}~\mathrm {dom}(f)\right\} , \end{aligned}$$

where \(M_{f^{*}} = M_f\) and \(\nu _{*} = 6 - \nu \) provided that \(\nu \in [3, 6)\) if \(p > 1\) and \(\nu \in (0, 6)\) if \(p = 1\).

Moreover, we have \(\nabla {f^{*}}(x) = u^{*}(x)\) and \(\nabla ^2{f^{*}}(x) = \nabla ^2{f}(u^{*}(x))^{-1}\), where \(u^{*}(x)\) is a unique solution of the maximization problem \(\max _{u}\left\{ \langle x, u\rangle - f(u) \mid u\in \mathrm {dom}(f)\right\} \) in (9) for any \(x\in \mathrm {dom}(f^{*})\).

Proposition 6 allows us to apply our generalized self-concordance theory in this paper to the dual problem of a convex problem involving generalized self-concordant functions, especially, when the objective function of the primal problem is generalized self-concordant with \(\nu \in (3, 4]\). The Fenchel conjugates are certainly useful when we develop optimization algorithms to solve constrained convex optimization involving generalized self-concordant functions, see, e.g., [65, 66].

2.6 Generalized self-concordant approximation of nonsmooth convex functions

Several well-known convex functions are nonsmooth. However, they can be approximated (up to an arbitrary accuracy) by a generalized self-concordant function via smoothing. Smoothing techniques clearly allow us to enrich the applicability of our theory to nonsmooth convex problems.

Given a proper, closed, possibly nonsmooth, and convex function \(f : \mathbb {R}^p\rightarrow \mathbb {R}\cup \{+\,\infty \}\). One can smooth f using the following Nesterov’s smoothing technique [41]

$$\begin{aligned} f_{\gamma }(x) := \sup _{u\in \mathrm {dom}(f^{*})}\left\{ \langle x, u\rangle - f^{*}(u) - \gamma \omega (u)\right\} , \end{aligned}$$
(10)

where \(f^{*}\) is the Fenchel conjugate of f, \(\omega : \mathrm {dom}(\omega ) \subseteq \mathbb {R}^p\rightarrow \mathbb {R}\) is a smooth convex function such that \(\mathrm {dom}(f^{*})\subseteq \mathrm {dom}(\omega )\), and \(\gamma > 0\) is called a smoothness parameter. In particular, if f is Lipschitz continuous, then \(\mathrm {dom}(f^{*})\) is bounded [3]. Hence, the \(\sup \) operator in (10) reduces to the \(\max \) operator.

Our goal is to choose an appropriate smoothing function \(\omega \) such that the smoothed function \(f_{\gamma }\) is well-defined and generalized self-concordant for any fixed smoothness parameter \(\gamma > 0\).

Example 3

Let us provide a few examples of well-known nonsmooth convex functions:

  1. (a)

     Consider the \(\ell _1\)-norm function \(f(x) := \left\| x\right\| _1\) in \(\mathbb {R}^p\). Then, it can be rewritten as

    $$\begin{aligned} \left\| x\right\| _1= & {} \max _{u}\left\{ \langle x, u\rangle \mid \left\| u\right\| _{\infty } \le 1\right\} \\= & {} \max _{u, v}\big \{ \langle x, u - v\rangle \mid \sum _{i=1}^p(u_i + v_i) = 1, ~u,v\in \mathbb {R}^p_{+}\big \}. \end{aligned}$$

    We can smooth this function by \(f_{\gamma }\) by choosing \(\omega (u, v) := \ln (2p) + \sum _{i=1}^p(u_i\ln (u_i) + v_i\ln (v_i))\). In this case, we obtain \(f_{\gamma }(x) = \gamma \ln \left( \sum _{i=1}^p\left( e^{x_i/\gamma } + e^{-x_i/\gamma }\right) \right) - \gamma \ln (2p)\). This function is clearly generalized self-concordant with \(\nu = 2\), see [63, Lemma 4].

    However, if we choose \(\omega (u) := p - \sum _{i=1}^p\sqrt{1 - u_i^2}\), then we get \(f_{\gamma }(x) = \sum _{i=1}^p\sqrt{x_i^2 + \gamma ^2} - \gamma p\). In this case, \(f_{\gamma }\) is also generalized self-concordant with \(\nu = \frac{8}{3}\) and \(M_{f_{\gamma }} = 3\gamma ^{-\frac{2}{3}}\).

  2. (b)

    The hinge loss function \(\varphi (t) := \max \left\{ 0, 1 - t\right\} \) can be written as \(\varphi (t) = \frac{1}{2}\left| 1 - t\right| + \frac{1}{2}(1 - t)\). Hence, we can smooth this function by \(\varphi _{\gamma }(t) := \gamma \ln \left( \frac{e^{\frac{(1-t)}{\gamma }} + e^{-\frac{(1-t)}{\gamma }}}{2}\right) + \frac{1}{2}(1-t)\) with a smoothness parameter \(\gamma > 0\). Clearly, \(\varphi _{\gamma }\) is generalized self-concordant with \(\nu = 2\).

In many practical problems, the conjugate \(f^{*}\) of f can be written as the sum \(f^{*} = \varphi + \delta _{\mathcal {U}}\), where \(\varphi \) is a generalized self-concordant function, and \(\delta _{\mathcal {U}}\) is the indicator function of a given nonempty, closed, and convex set \(\mathcal {U}\). In this case, \(f_{\gamma }\) in (10) becomes

$$\begin{aligned} f_{\gamma }(x) := \sup _{u}\left\{ \langle x, u\rangle - \varphi (u) - \gamma \omega (u) \mid u\in \mathcal {U}\right\} . \end{aligned}$$
(11)

If \(\omega \) is a generalized self-concordant function such that \(\nu _{\varphi } = \nu _{\omega }\), and \( \mathcal {U}= \overline{\mathrm {dom}(\omega )\cap \mathrm {dom}(\varphi )}\), then \(f_{\gamma }\) is generalized self-concordant with \(\nu _{f_{\gamma }} = 6 - \nu _{\varphi }\) as shown in Proposition 6.

2.7 Key bounds on Hessian map, gradient map, and function values

Now, we develop some key bounds on the local norms, Hessian map, gradient map, and function values of generalized self-concordant functions. In this subsection, we assume that the Hessian map \(\nabla ^2{f}\) of f is nondegenerate at any point in its domain.

For this purpose, given \(\nu \ge 2\), we define the following quantity for any \(x, y\in \mathrm {dom}(f)\):

$$\begin{aligned} d_{\nu }(x,y) := {\left\{ \begin{array}{ll} M_f\left\| y- x\right\| _2 &{}\quad \text {if }\,\,\nu = 2\\ \left( \frac{\nu }{2}-1\right) M_f\left\| y-x\right\| _2^{3-\nu }\Vert y- x\Vert _{x}^{\nu -2} &{}\quad \text {if }\,\,\nu > 2. \end{array}\right. } \end{aligned}$$
(12)

Here, if \(\nu > 3\), then we require \(x\ne y\). Otherwise, we set \(d_{\nu }(x, y) := 0\) if \(x= y\). In addition, we also define the function \(\bar{\bar{\omega }}_{\nu } : \mathbb {R}\rightarrow \mathbb {R}_{+}\) as

$$\begin{aligned} \bar{\bar{\omega }}_{\nu }(\tau ) := {\left\{ \begin{array}{ll} \frac{1}{\left( 1-\tau \right) ^{\frac{2}{\nu -2}}} &{}\quad \text {if }\,\,\nu > 2\\ e^{\tau } &{}\quad \text {if }\,\,\nu = 2, \end{array}\right. } \end{aligned}$$
(13)

with \(\mathrm {dom}(\bar{\bar{\omega }}_{\nu }) = (-\infty , 1)\) if \(\nu > 2\), and \(\mathrm {dom}(\bar{\bar{\omega }}_{\nu }) = \mathbb {R}\) if \(\nu = 2\). We also adopt the Dikin ellipsoidal notion from [40] as \(W^0(x;r):=\left\{ y\in \mathbb {R}^p \mid d_{\nu }(x,y) < r\right\} \).

The next proposition provides a bound on the local norm defined by a generalized self-concordant function f. This bound is given for the local distances \(\Vert y- x\Vert _{x}\) and \(\Vert y-x\Vert _{y}\) between two points \(x\) and \(y\) in \(\mathrm {dom}(f)\).

Proposition 7

(Bound of local norms) If \(\nu > 2\), then, for any \(x\in \mathrm {dom}(f)\), we have \(W^0(x; 1)\subseteq \mathrm {dom}(f)\). For any \(x,y\in \mathrm {dom}(f)\), let \(d_{\nu }(x,y)\) be defined by (12), and \(\bar{\bar{\omega }}_{\nu }(\cdot )\) be defined by (13). Then, we have

$$\begin{aligned} \bar{\bar{\omega }}_{\nu }\left( -d_{\nu }(x,y)\right) ^{\frac{1}{2}}\left\| y-x\right\| _{x} \le \left\| y- x\right\| _{y} \le \bar{\bar{\omega }}_{\nu }\left( d_{\nu }(x,y)\right) ^{\frac{1}{2}}\left\| y-x\right\| _{x}. \end{aligned}$$
(14)

If \(\nu > 2\), then the right-hand side inequality of (14) holds if \(d_{\nu }(x, y) < 1\).

Proof

We first consider the case \(\nu > 2\). Let \(u\in \mathbb {R}^p\) and \(u\ne 0\). Consider the following univariate function

$$\begin{aligned} \phi (t) := \left\langle \nabla ^2f(x+tu)u, u\right\rangle ^{1 - \tfrac{\nu }{2}}=\left\| u\right\| _{x+tu}^{2-\nu }. \end{aligned}$$

It is easy to compute the derivative of this function, and obtain

$$\begin{aligned} \phi '(t) = \left( \frac{2-\nu }{2}\right) \frac{\langle \nabla ^3f(x+tu)[u]u,u\rangle }{\left\langle \nabla ^2f(x+tu)u, u\right\rangle ^{\frac{\nu }{2}}} = \left( \frac{2-\nu }{2}\right) \frac{\langle \nabla ^3f(x+tu)[u]u,u\rangle }{\left\| u\right\| _{x+tu}^{\nu }}. \end{aligned}$$

Using Definition 2 with \(u= v\) and \(x+tu\) instead of \(x\), we have \(\left| \phi '(t)\right| \le \frac{\nu -2}{2}M_f\left\| u\right\| _2^{3-\nu }\). This implies that \(\phi (t) \ge \phi (0) - \frac{\nu -2}{2}M_f\left\| u\right\| _2^{3-\nu }\left| t\right| \). On the other hand, we can see that \(\mathrm {dom}(\phi ) = \left\{ t \in \mathbb {R}\mid \phi (t) > 0\right\} \). Hence, we have \(\mathrm {dom}(\phi )\) contains \(\left( -\frac{2\phi (0)}{(\nu -2)M_f\left\| u\right\| _2^{3-\nu }}, \frac{2\phi (0)}{(\nu -2)M_f\left\| u\right\| _2^{3-\nu }}\right) \). Using this fact and the definition of \(\phi \), we can show that \(\mathrm {dom}(f)\) contains \(\left\{ y := x + tu \mid \left| t\right| < \frac{2\left\| u\right\| _{x}^{2-\nu }}{(\nu -2)M_f\left\| u\right\| _2^{3-\nu }} \right\} \). However, since \(\left| t\right| = \frac{\left\| y-x\right\| _x^{\nu -2}}{\left\| u\right\| _x^{\nu -2}}\frac{\left\| y-x\right\| _2^{3-\nu }}{\left\| u\right\| _2^{3-\nu }}\), the condition \(\left| t\right| < \frac{2\left\| u\right\| _{x}^{2-\nu }}{(\nu -2)M_f\left\| u\right\| _2^{3-\nu }}\) is equivalent to \(d_{\nu }(x,y) < 1\). This shows that \(W^0(x; 1)\subseteq \mathrm {dom}(f)\).

Since \(\big \vert \int _0^1\phi '(t)\mathrm {d}t\big \vert \le \int _0^1\left| \phi '(t)\right| \mathrm {d}t\), integrating \(\phi '(t)\) over the interval [0, 1] we get

$$\begin{aligned} \Big \vert \left\| u\right\| _{x+u}^{2-\nu }-\left\| u\right\| _{x}^{2-\nu } \Big \vert \le \frac{\nu -2}{2}M_f\left\| u\right\| _2^{3-\nu }. \end{aligned}$$

Using \(u=y-x\) in the last inequality, we get \(\vert \left\| y-x\right\| _{y}^{2-\nu }-\left\| y-x\right\| _{x}^{2-\nu }\vert \le \frac{\nu -2}{2}M_f\left\| y-x\right\| _2^{3-\nu }\) which is equivalent to

$$\begin{aligned} \begin{aligned} \left\| y-x\right\| _{y}^{\nu - 2}&\le \left\| y-x\right\| _{x}^{\nu -2}\left( 1 - \frac{\nu -2}{2}M_f\left\| y-x\right\| _{x}^{\nu -2}\left\| x-y\right\| _2^{3-\nu }\right) ^{-1}\\&= \left\| y-x\right\| _{x}^{\nu -2}\left( 1 - d_{\nu }(x,y)\right) ^{-1} \\ \left\| y-x\right\| _{y}^{\nu -2}&\ge \left\| y-x\right\| _{x}^{\nu -2}\left( 1 + \frac{\nu -2}{2}M_f\left\| y-x\right\| _{x}^{\nu -2}\left\| x-y\right\| _2^{3-\nu }\right) ^{-1}\\&= \left\| y-x\right\| _{x}^{\nu -2}\left( 1 + d_{\nu }(x,y)\right) ^{-1}, \end{aligned} \end{aligned}$$

given that \(d_{\nu }(x,y)<1\). Taking the power of \(\tfrac{1}{\nu -2} > 0\) in both sides, we get (14) for the case \(\nu > 2\).

Now, we consider the case \(\nu = 2\). Let \(0\ne u\in \mathbb {R}^p\). We consider the following function

$$\begin{aligned} \phi (t) := \ln \left( \left\langle \nabla ^2f(x+tu)u,u\right\rangle \right) = \ln \big (\Vert u\Vert _{x+tu}^2\big ). \end{aligned}$$

Clearly, it is easy to show that \(\phi '(t) = \frac{ \langle \nabla ^3f(x+tu)[u]u,u\rangle }{\left\langle \nabla ^2f(x+tu)u, u\right\rangle }=\frac{ \langle \nabla ^3f(x+tu)[u]u,u\rangle }{\left\| u\right\| _{x+tu}^2}\). Using again Definition 2 with \(u= v\) and \(x+tu\) instead of \(x\), we obtain \(\left| \phi '(t)\right| \le M_f\left\| u\right\| _2\).

Since \(\big \vert \int _0^1\phi '(t)\mathrm {d}t \big \vert \le \int _0^1\left| \phi '(t)\right| \mathrm {d}t\), integrating \(\phi '(t)\) over the interval [0, 1] we get

$$\begin{aligned} \left| \ln \left( \left\| u\right\| _{x+u}^2\right) - \ln \left( \left\| u\right\| _{x}^2\right) \right| \le M_f\left\| u\right\| _2. \end{aligned}$$

Substituting \(u=y-x\) into this inequality, we get \(\big \vert \ln \left\| y-x\right\| _{y} - \ln \left\| y-x\right\| _{x} \big \vert \le \tfrac{M_f}{2}\left\| y-x\right\| _2\). Hence, \(\ln \left\| y-x\right\| _{x} -\tfrac{M_f}{2}\left\| y-x\right\| _2\le \ln \left\| y-x\right\| _{y} \le \ln \left\| y-x\right\| _{x} + \tfrac{M_f}{2}\left\| y-x\right\| _2\). This inequality leads to (14) for the case \(\nu = 2\). \(\square \)

Next, we develop new bounds for the Hessian map of f in the following proposition.

Proposition 8

(Bounds of Hessian map) For any \(x,y\in \mathrm {dom}(f)\), let \(d_{\nu }(x,y)\) be defined by (12), and \(\bar{\bar{\omega }}_{\nu }(\cdot )\) be defined by (13). Then, we have

$$\begin{aligned}&\left[ 1 - d_{\nu }(x,y)\right] ^{\frac{2}{\nu -2}}\nabla ^2f(x) \preceq \nabla ^2f(y) \preceq \left[ 1 - d_{\nu }(x, y)\right] ^{\frac{-2}{\nu -2}}\nabla ^2f(x)~~~&\quad \text {if}\, \nu > 2,\qquad \end{aligned}$$
(15)
$$\begin{aligned}&e^{-d_{\nu }(x,y)}\nabla ^2f(x) \preceq \nabla ^2f(y) \preceq e^{d_{\nu }(x,y)}\nabla ^2f(x)~~~&\quad \text {if }\,\nu = 2,\qquad \end{aligned}$$
(16)

where (15) holds if \(d_{\nu }(x, y) < 1\) for the case \(\nu > 2\).

Proof

Let \(\nu > 2\) and \(0\ne u\in \mathbb {R}^n\). Consider the following univariate function on [0, 1]:

$$\begin{aligned} \psi (t) := \left\langle \nabla ^2f(x+t(y-x))u,u\right\rangle ,~~t\in [0,1]. \end{aligned}$$

If we denote by \(y_t := x+t(y-x)\), then \(y_t - x= t(y- x)\), \(\psi (t)=\left\| u\right\| _{y_t}^2\), and \(\psi '(t) = \langle \nabla ^3f(y_t)[y-x]u,u\rangle \). By Definition 2, we have

$$\begin{aligned} \left| \psi '(t)\right|&\le M_f\left\| u\right\| _{y_t}^2\left\| y-x\right\| _{y_t}^{\nu -2}\left\| y-x\right\| _2^{3-\nu } = M_f \psi (t)\left[ \tfrac{\left\| y_t-x\right\| _{y_t}}{t}\right] ^{\nu -2}\left\| y-x\right\| _2^{3-\nu } \end{aligned}$$

which implies

$$\begin{aligned} \left| \frac{d\ln \psi (t)}{dt}\right| \le M_f \left[ \tfrac{\left\| y_t-x\right\| _{y_t}}{t}\right] ^{\nu -2}\left\| y-x\right\| _2^{3-\nu }. \end{aligned}$$
(17)

Assume that \(d_{\nu }(x, y) < 1\). Then, by the definition of \(y_t\) and \(d_{\nu }(\cdot )\), we have \(d_{\nu }(x, y_t) = td_{\nu }(x, y)\) and \(\left\| y_t - x\right\| _{x} = t\left\| y- x\right\| _{x}\). Using Proposition 7, we can derive

$$\begin{aligned} \begin{array}{ll} \frac{1}{t}\left\| y_t-x\right\| _{y_t} &{} \le \frac{1}{t}\left[ 1 - \left( \frac{\nu }{2}-1\right) \left\| y_t - x\right\| _2^{3-\nu }\left\| y_t - x\right\| _{x}^{\nu -2}\right] ^{-\frac{1}{\nu - 2}}\left\| y_t - x\right\| _{x}\\ &{} = \frac{1}{t}\left[ 1 - d_{\nu }(x, y_t)\right] ^{-\frac{1}{\nu -2}}\left\| y_t - x\right\| _{x}\\ &{} = \left[ 1 - d_{\nu }(x, y)t\right] ^{-\frac{1}{\nu -2}}\left\| y- x\right\| _{x}. \end{array} \end{aligned}$$

Hence, we can further derive

$$\begin{aligned} \left[ \frac{1}{t}\left\| y_t-x\right\| _{y_t}\right] ^{\nu -2} \le \frac{\left\| y-x\right\| _{x}^{\nu -2}}{1 - d_{\nu }(x,y)t} \end{aligned}$$

Integrating \(\frac{d\ln \psi (t)}{dt}\) with respect to t on [0, 1] and using the last inequality and (17), we get

$$\begin{aligned} \left| \int _0^1\frac{d\ln \psi (t)}{dt}dt\right|\le & {} \int _0^1\left| \frac{d\ln \psi (t)}{dt}\right| dt\\\le & {} \left\| y-x\right\| _{x}^{\nu -2}\left\| y-x\right\| _2^{3-\nu }\int _0^1\frac{dt}{1 - d_{\nu }(x,y)t}. \end{aligned}$$

Clearly, we can compute this integral explicitly as

$$\begin{aligned} \left| \ln \left[ \frac{\left\| u\right\| ^2_{y}}{\left\| u\right\| ^2_{x}}\right] \right|= & {} \left| \ln \left[ \frac{\psi (1)}{\psi (0)}\right] \right| \le \frac{-2d_{\nu }(x,y)}{(\nu -2)d_{\nu }(x,y)}\ln \left[ 1 - d_{\nu }(x,y)\right] \\= & {} \ln \left[ \left( 1 - d_{\nu }(x,y)\right) ^{\frac{-2}{\nu -2}}\right] . \end{aligned}$$

Rearranging this inequality, we obtain

$$\begin{aligned} \left[ 1 - d_{\nu }(x,y)\right] ^{\frac{2}{\nu -2}} \le \frac{\left\| u\right\| ^2_{y}}{\left\| u\right\| ^2_{x}} \equiv \frac{\langle \nabla ^2f(y)u,u\rangle }{\langle \nabla ^2f(x)u,u\rangle } \le \left[ 1 - d_{\nu }(x,y)\right] ^{\frac{-2}{\nu -2}}. \end{aligned}$$

Since this inequality holds for any \(0\ne u\in \mathbb {R}^p\), it implies (15). If \(u=0\), then (15) obviously holds.

Now, we consider the case \(\nu = 2\). It follows from (17) that

$$\begin{aligned} \left| \ln \left[ \frac{\left\| u\right\| ^2_{y}}{\left\| u\right\| ^2_{x}}\right] \right|= & {} \left| \int _0^1\frac{d\ln \psi (t)}{dt}dt\right| \le \int _0^1\left| \frac{d\ln \psi (t)}{dt}\right| dt \\\le & {} M_f\int _0^1\left\| y-x\right\| _2dt = M_f\left\| y- x\right\| _2. \end{aligned}$$

Since this inequality holds for any \(u\in \mathbb {R}^p\), it implies (16). \(\square \)

The following corollary provides a bound on the mean of the Hessian map \(G(x, y) := \int _0^1\nabla ^2{f}(x+ \tau (y- x))d\tau \) whose proof is moved to Appendix A.2.

Corollary 2

For any \(x,y\in \mathrm {dom}(f)\), let \(d_{\nu }(x,y)\) be defined by (12). Then, we have

$$\begin{aligned} \underline{\kappa }_{\nu }(d_{\nu }(x,y))\nabla ^2f(x) \preceq \int _0^1\nabla ^2{f}(x+ \tau (y- x))d\tau \preceq \overline{\kappa }_{\nu }(d_{\nu }(x,y))\nabla ^2f(x), \end{aligned}$$
(18)

where

$$\begin{aligned} \begin{array}{ll} &{}\underline{\kappa }_{\nu }(t) := {\left\{ \begin{array}{ll} \frac{1-e^{-t} }{t} &{}\quad \text {if}\,\, \nu = 2\\ \frac{1-(1-t)^2}{2t} &{}\quad \text {if}\,\, \nu = 4\\ \frac{(\nu - 2)}{\nu }\left[ \frac{1 - (1 - t)^{\frac{\nu }{\nu - 2}}}{t}\right] &{}\quad \text {if }\,\nu> 2~ \text {and} ~\nu \ne 4, \end{array}\right. }\\ \text {and}~~~~&{} \\ &{}\overline{\kappa }_{\nu }(t) := {\left\{ \begin{array}{ll} \frac{e^{t}-1}{t} &{}\quad \text {if}\,\, \nu = 2 \\ \frac{-\ln (1 - t)}{t} &{}\quad \text {if}\,\, \nu = 4 \\ \left( \frac{\nu - 2}{\nu - 4}\right) \left[ \frac{1 - (1 - t)^{\frac{\nu -4}{\nu -2}}}{t}\right] &{}\quad \text {if}\, \nu > 2~\text {and }~\nu \ne 4. \end{array}\right. } \end{array} \end{aligned}$$

Here, if \(\nu > 2\), then it requires \(d_v(x, y) < 1\) for \(x,y\in \mathrm {dom}(f)\) in (18).

We prove a bound on the gradient inner product of a generalized self-concordant function f.

Proposition 9

(Bounds of gradient map) For any \(x, y\in \mathrm {dom}(f)\), we have

$$\begin{aligned} \bar{\omega }_{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2\le & {} \left\langle \nabla f(y) - \nabla f(x),y- x\right\rangle \le \bar{\omega }_{\nu }\left( d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2,\quad \end{aligned}$$
(19)

where, if \(\nu > 2\), then the right-hand side inequality of (19) holds if \(d_{\nu }(x, y) < 1\), and

$$\begin{aligned} \bar{\omega }_{\nu }(\tau ) := {\left\{ \begin{array}{ll} \frac{e^{\tau }-1}{\tau } &{}\quad \text {if}\,\, \nu = 2 \\ \frac{\ln \left( 1 - \tau \right) }{-\tau } &{}\quad \text {if}\,\, \nu = 4 \\ \left( \tfrac{\nu -2}{\nu -4}\right) \frac{1 - \left( 1 - \tau \right) ^{\frac{\nu -4}{\nu -2}}}{\tau } &{}\quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(20)

Here, \(\bar{\omega }_{\nu }(\tau ) \ge 0\) for all \(\tau \in \mathrm {dom}(\bar{\omega }_{\nu })\).

Proof

Let \(y_t := x+ t(y- x)\). By the mean-value theorem, we have

$$\begin{aligned} \left\langle \nabla f(y)-\nabla f(x),y-x\right\rangle = \int _0^1\left\langle \nabla ^2 f(y_t)(y-x),y-x\right\rangle \mathrm {d}t = \int _0^1\frac{1}{t^2}\left\| y_t-x\right\| _{y_t}^2\mathrm {d}t. \end{aligned}$$
(21)

We consider the function \(\bar{\bar{\omega }}_{\nu }\) defined by (13). It follows from Proposition 7 that

$$\begin{aligned} \bar{\bar{\omega }}_{\nu }\left( -d_{\nu }(x,y_t)\right) \left\| y_t-x\right\| _{x}^2 \le \left\| y_t -x\right\| _{y_t}^2 \le \bar{\bar{\omega }}_{\nu }\left( d_{\nu }(x,y_t)\right) \left\| y_t-x\right\| _{x}^2. \end{aligned}$$

Now, we note that \(d_{\nu }(x,y_t) = td_{\nu }(x,y)\) and \(\left\| y_t-x\right\| _{x} = t\left\| y-x\right\| _{x}\), the last estimate leads to

$$\begin{aligned} \bar{\bar{\omega }}_{\nu }\left( -td_{\nu }(x,y)\right) \left\| y- x\right\| _{x}^2 \le \frac{1}{t^2}\left\| y_t -x\right\| _{y_t}^2 \le \bar{\bar{\omega }}_{\nu }\left( td_{\nu }(x,y)\right) \left\| y-x\right\| _{x}^2. \end{aligned}$$

Substituting this estimate into (21), we obtain

$$\begin{aligned} \left\| y- x\right\| _{x}^2\int _0^1\bar{\bar{\omega }}_{\nu }\left( -td_{\nu }(x,y)\right) dt\le & {} \left\langle \nabla f(y)-\nabla f(x),y-x\right\rangle \\\le & {} \left\| y- x\right\| _{x}^2\int _0^1\bar{\bar{\omega }}_{\nu }\left( td_{\nu }(x,y)\right) dt. \end{aligned}$$

Using the function \(\bar{\bar{\omega }}_{\nu }(\tau )\) from (13) to compute the left-hand side and the right-hand side integrals, we obtain (19). \(\square \)

Finally, we prove a bound on the function values of an \((M_f, \nu )\)-generalized self-concordant function f in the following proposition.

Proposition 10

(Bounds of function values) For any \(x, y\in \mathrm {dom}(f)\), we have

$$\begin{aligned}&\omega _{\nu }\left( -d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2 \le f(y) - f(x) - \left\langle \nabla f(x),y- x\right\rangle \le \omega _{\nu }\left( d_{\nu }(x, y)\right) \left\| y-x\right\| _{x}^2,\qquad \quad \end{aligned}$$
(22)

where, if \(\nu > 2\), then the right-hand side inequality of (22) holds if \(d_{\nu }(x, y) < 1\). Here, \(d_{\nu }(x,y)\) is defined by (12) and \(\omega _{\nu }\) is defined by

$$\begin{aligned} \omega _{\nu }(\tau ) := {\left\{ \begin{array}{ll} \frac{e^{\tau }-\tau - 1}{\tau ^2} &{}\quad \text {if}\,\, \nu = 2 \\ \frac{-\tau -\ln (1-\tau )}{\tau ^2} &{}\quad \text {if }\,\nu = 3 \\ \frac{(1-\tau )\ln \left( 1 - \tau \right) + \tau }{\tau ^2} &{}\quad \text {if}\,\, \nu = 4 \\ \left( \tfrac{\nu -2}{4-\nu }\right) \frac{1}{\tau }\left[ \frac{\nu -2}{2(3-\nu )\tau }\left( (1 - \tau )^{\frac{2(3-\nu )}{2-\nu }} - 1\right) - 1\right] &{}\quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(23)

Note that \(\omega _{\nu }(\tau ) \ge 0\) for all \(\tau \in \mathrm {dom}(\omega _{\nu })\).

Proof

For any \(x,y\in \mathrm {dom}(f)\), let \(y_t := x+ t(y- x)\). Then, \(y_t - x= t(y- x)\). By the mean-value theorem, we have

$$\begin{aligned} f(y)-f(x)-\left\langle \nabla f(x),y-x\right\rangle = \int _0^1 \tfrac{1}{t}\langle \nabla f(y_t)-\nabla f(x),y_t-x\rangle dt. \end{aligned}$$

Now, by Proposition 9, we have

$$\begin{aligned} \bar{\omega }_{\nu }\left( - d_{\nu }(x,y_t)\right) \left\| y_t - x\right\| _{x}^2 \le \langle \nabla f(y_t)-\nabla f(x),y_t-x\rangle \le \bar{\omega }_{\nu }\left( d_{\nu }(x,y_t)\right) \left\| y_t - x\right\| _{x}^2. \end{aligned}$$

Clearly, by the definition (12), we have \(d_{\nu }(x, y_t) = td_{\nu }(x,y)\) and \(\left\| y_t-x\right\| _{x} = t\left\| y-x\right\| _{x}\). Combining these relations, and the above two inequalities, we can show that

$$\begin{aligned} \left\| y-x\right\| _{x}^2\int _0^1 t\bar{\omega }_{\nu }\left( -td_{\nu }(x,y)\right) dt\le & {} f(y)-f(x)-\left\langle \nabla f(x),y-x\right\rangle \\\le & {} \left\| y-x\right\| _{x}^2\int _0^1 t\bar{\omega }_{\nu }\left( td_{\nu }(x,y)\right) dt. \end{aligned}$$

By integrating the left-hand side and the right-hand side of this estimate using the definition (20) of \(\bar{\omega }_{\nu }(\tau )\), we obtain (22). \(\square \)

3 Generalized self-concordant minimization

We apply the theory developed in the previous sections to design new Newton-type methods to minimize a generalized self-concordant function. More precisely, we consider the following non-composite convex problem:

$$\begin{aligned} f^{\star } := \min _{x\in \mathbb {R}^p} f(x), \end{aligned}$$
(24)

where \(f : \mathbb {R}^p\rightarrow \mathbb {R}\) is an \((M_f,\nu )\)-generalized self-concordant function in the sense of Definition 2 with \(\nu \in [2, 3]\) and \(M_f \ge 0\). Since f is smooth and convex, the optimality condition \(\nabla {f}(x^{\star }_f) = 0\) is necessary and sufficient for \(x^{\star }_f\) to be an optimal solution of (24).

The following theorem shows the existence and uniqueness of the solution \(x^{\star }_f\) of (24). It can be considered as a special case of Theorem 4 below with \(g\equiv 0\).

Theorem 1

Suppose that \(f\in \widetilde{\mathcal {F}}_{M_f,\nu }(\mathrm {dom}(f))\) for given parameters \(M_f > 0\) and \(\nu \in [2, 3]\). Denote by \(\sigma _{\min }(x) := \lambda _{\min }(\nabla ^2 f(x))\) and \(\lambda (x) := \Vert \nabla {f}(x)\Vert ^{*}_{x}\) for \(x\in \mathrm {dom}(f)\). Suppose further that there exists \(x\in \mathrm {dom}(f)\) such that \(\sigma _{\min }(x) > 0\) and

$$\begin{aligned} \lambda (x) < \frac{2\left[ \sigma _{\min }(x)\right] ^{\frac{3-\nu }{2}}}{(4-\nu )M_f}. \end{aligned}$$

Then, problem (24) has a unique solution \(x_f^{\star }\) in \(\mathrm {dom}(f)\).

We say that the unique solution \(x^{\star }_f\) of (24) is strongly regular if \(\nabla ^2{f}(x^{\star }_f) \succ 0\). The strong regularity of \(x^{\star }_f\) for (24) is equivalent to the strong second order optimality condition. Theorem 1 covers [40, Theorem 4.1.11] for standard self-concordant functions as a special case.

We consider the following Newton-type scheme to solve (24). Starting from an arbitrary initial point \(x^0\in \mathrm {dom}(f)\), we generate a sequence \(\left\{ x^k\right\} _{k\ge 0}\) as follows:

$$\begin{aligned} x^{k+1} := x^k + \tau _kn_{\mathrm {nt}}^k,~~~~~\text {where}~~n_{\mathrm {nt}}^k := -\nabla ^2f(x^k)^{-1}\nabla {f}(x^k), \end{aligned}$$
(25)

and \(\tau _k \in (0, 1]\) is a given step-size. We call \(n_{\mathrm {nt}}^k\) a Newton direction.

  • If \(\tau _k = 1\) for all \(k\ge 0\), then we call (25) a full-step Newton scheme.

  • Otherwise, i.e., \(\tau _k \in (0, 1)\), we call (25) a damped-step Newton scheme.

Clearly, computing the Newton direction \(n_{\mathrm {nt}}^k\) requires to solve the following linear system:

$$\begin{aligned} \nabla ^2{f}(x^k)n_{\mathrm {nt}}^k = -\nabla {f}(x^k). \end{aligned}$$
(26)

Next, we define a Newton decrement\(\lambda _k\) and a quantity \(\beta _k\), respectively as

$$\begin{aligned} \lambda _k := \Vert n_{\mathrm {nt}}^k\Vert _{x^k} = \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}~~~\text {and}~~~\beta _k := M_f\Vert n_{\mathrm {nt}}^k\Vert _2 = M_f\Vert \nabla ^2f(x^k)^{-1}\nabla {f}(x^k)\Vert _2.\qquad \end{aligned}$$
(27)

With \(\lambda _k\) and \(\beta _k\) given by (27), we also define

$$\begin{aligned} d_k := {\left\{ \begin{array}{ll} \beta _k &{}\quad \text {if}\,\, \nu = 2\\ \left( \frac{\nu }{2} - 1\right) M_f^{\nu -2}\lambda _k^{\nu -2}\beta _k^{3-\nu } &{}\quad \text {if}\, \nu \in (2, 3]. \end{array}\right. } \end{aligned}$$
(28)

Let us first show how to choose a suitable step-size \(\tau _k\) in the damped-step Newton scheme and prove its convergence properties in the following theorem whose proof can be found in Appendix A.5.

Theorem 2

Let \(\left\{ x^k\right\} \) be the sequence generated by the damped-step Newton scheme (25) with the following step-size:

$$\begin{aligned} \tau _k := {\left\{ \begin{array}{ll} \frac{1}{\beta _k}\ln (1 + \beta _k) &{}\quad \text {if}\,\, \nu = 2 \\ \frac{1}{d_k}\left[ 1-\left( 1+\frac{4-\nu }{\nu -2}d_k\right) ^{-\frac{\nu -2}{4-\nu }}\right] &{}\quad \text {if}\, \nu \in (2, 3], \end{array}\right. } \end{aligned}$$
(29)

where \(\lambda _k\), \(\beta _k\) are defined by (27), and \(d_k\) is defined by (28). Then, \(\tau _k\in (0,1]\), \(\left\{ x^k\right\} \) in \(\mathrm {dom}(f)\), and this step-size guarantees the following descent property

$$\begin{aligned} f(x^{k+1}) \le f(x^k) - \varDelta _k, \end{aligned}$$
(30)

where \(\varDelta _k := \lambda _k^2\tau _k - \omega _{\nu }\left( \tau _k d_k\right) \tau ^2_k\lambda _k^2 > 0\) with \(\omega _{\nu }\) defined by (23).

Assume that the unique solution \(x^{\star }_f\) of (24) exists. Then, there exists a neighborhood \(\mathcal {N}(x^{\star }_f)\) such that if we initialize the Newton scheme (25) at \(x^0\in \mathcal {N}(x^{\star }_f)\cap \mathrm {dom}(f)\), then the whole sequence \(\left\{ x^k\right\} \) converges to \(x^{\star }_f\) at a quadratic rate.

Example 4

(Better step-size for regularized logistic and exponential models) Consider the minimization problem (24) with the objective function \(f(\cdot ) := \phi (\cdot ) + \frac{\gamma }{2}\Vert \cdot \Vert _2^2\), where \(\phi \) is defined as in (8) with \(\varphi _i(t) = \log (1 + e^{-t})\) being the logistic loss. That is

$$\begin{aligned} f(x) := \frac{1}{n}\sum _{i=1}^n\log (1 + e^{-a_i^{\top }x}) + \frac{\gamma }{2}\Vert x\Vert _2^2. \end{aligned}$$

As we shown in Sect. 2 that f is either generalized self-concordant with \(\nu = 2\) or generalized self-concordant with \(\nu = 3\) but with different constant \(M_f\).

Let us define \(R_A := \max \left\{ \Vert a_i\Vert _2 \mid 1 \le i \le n\right\} \). Then, if we consider \(\nu = 2\), then we have \(M_f^{(2)} = R_A\) due to Corollary 1, while if we choose \(\nu = 3\), then \(M_f^{(3)} = \frac{1}{\sqrt{\gamma }}R_A\) due to Proposition 4. By the definition of f, we have \(\nabla ^2{f}(x) \succeq \gamma \mathbb {I}\). Hence, using this inequality and the definition of \(\lambda _k\) and \(\beta _k\) from (27), we can show that

$$\begin{aligned} \beta _k = M_f^{(2)}\Vert \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Vert _2 \le \tfrac{R_A}{\sqrt{\gamma }}\lambda _k = M_f^{(3)}\lambda _k. \end{aligned}$$
(31)

For any \(\tau > 0\), we have \(\frac{\ln (1 + \tau )}{\tau } > \frac{1}{1 + 0.5\tau }\). Using this elementary result and (31), we obtain

$$\begin{aligned} \tau _k^{(2)} = \tfrac{\ln (1 + \beta _k)}{\beta _k} > \tfrac{1}{1 + 0.5\beta _k} \ge \tfrac{1}{1 + 0.5M^{(3)}_f\lambda _k} = \tau ^{(3)}_k. \end{aligned}$$

This inequality shows that the step-size \(\tau _k\) given by Theorem 2 satisfies \(\tau _k^{(2)} > \tau _k^{(3)}\), where \(\tau _k^{(\nu )}\) is a given step-size computed by (29) for \(\nu = 2\) and 3, respectively. Such a statement confirms that the damped-step Newton method using \(\tau _k^{(2)}\) is theoretically better than using \(\tau _k^{(3)}\). This result will be empirically confirmed by our experiments in Sect. 6. \(\square \)

Next, we study the full-step Newton scheme derived from (25) by setting the step-size \(\tau _k = 1\) for all \(k\ge 0\) as a full-step. Let

$$\begin{aligned} \underline{\sigma }_k := \lambda _{\min }\left( \nabla ^2f(x^k)\right) \end{aligned}$$

be the smallest eigenvalue of \(\nabla ^2{f}(x^k)\). Since \(\nabla ^2{f}(x^k)\succ 0\), we have \(\underline{\sigma }_k > 0\). The following theorem shows a local quadratic convergence of the full-step Newton scheme (25) for solving (24) whose proof can be found in Appendix A.6.

Theorem 3

Let \(\left\{ x^k\right\} \) be the sequence generated by the full-step Newton scheme (25) by setting the step-size \(\tau _k = 1\) for \(k\ge 0\). Let \(d_{\nu }^k := d_{\nu }(x^k, x^{k+1})\) be defined by (12) and \(\lambda _k\) be defined by (27). Then, the following statements hold:

  1. (a)

    If \(\nu = 2\) and the starting point \(x^0\) satisfies \(\underline{\sigma }_0^{-1/2}\lambda _0 < \frac{d_2^{\star }}{M_f}\), then both sequences \(\left\{ \underline{\sigma }_k^{-1/2}\lambda _k\right\} \) and \(\left\{ d_2^k\right\} \) decrease and quadratically converge to zero, where \(d_2^{\star }\approx 0.12964\).

  2. (b)

    If \(2< \nu < 3\), and the starting point \(x^0\) satisfies \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 < \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then both sequences \(\left\{ \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\right\} \) and \(\left\{ d_{\nu }^k\right\} \) decrease and quadratically converge to zero, where \(d_{\nu }^{\star }\) is the unique solution of the equation \(\left( \nu -2\right) R_{\nu }(d_{\nu })= 4(1-d_{\nu })^{\frac{4-\nu }{\nu -2}}\) in \(d_{\nu }\) with \(R_{\nu }(\cdot )\) given by (56).

  3. (c)

    If \(\nu = 3\) and the starting point \(x^0\) satisfies \(\lambda _0 < \frac{1}{2M_f}\), then the sequence \(\left\{ \lambda _k\right\} \) decreases and quadratically converges to zero.

As a consequence, if \(\left\{ d^k_{\nu }\right\} \) locally converges to zero at a quadratic rate, then \(\big \{\Vert x^k - x^{\star }_f\Vert _{H_k}\big \}\) also locally converges to zero at a quadratic rate, where \(H_k = \mathbb {I}\), the identity matrix, if \(\nu = 2\); \(H_k = \nabla ^2{f}(x^k)\) if \(\nu = 3\); and \(H_k = \nabla ^2f(x^k)^{\frac{\nu }{2}-1}\) if \(2< \nu < 3\). Hence, \(\left\{ x^k\right\} \) locally converges to \(x^{\star }_f\), the unique solution of (24), at a quadratic rate.

If we combine the results of Theorem 2 and Theorem 3, then we can design a two-phase Newton algorithm for solving (24) as follows:

  • Phase 1 Starting from an arbitrary initial point \(x^0 \in \mathrm {dom}(f)\), we perform the damped-step Newton scheme (25) until the condition in Theorem 3 is satisfied.

  • Phase 2 Using the output \(x^j\) of Phase 1 as an initial point for the full-step Newton scheme (25) with \(\tau _k = 1\), and perform this scheme until it achieves an \(\varepsilon \)-solution \(x^k\) to (24).

We also note that the damped-step Newton scheme (25) can also achieve a local quadratic convergence as shown in Theorem 2. Hence, we combine this fact and the above two-phase scheme to derive the Newton algorithm as shown in Algorithm 1 below.

figure a

Per-iteration complexity The main step of Algorithm 1 is the solution of the symmetric positive definite linear system (26). This system can be solved by using either Cholesky factorization or conjugate gradient methods which, in the worst-case, requires \(\mathcal {O}(p^3)\) operations. Computing \(\lambda _k\) requires the inner product \(\langle n_{\mathrm {nt}}^k, \nabla {f}(x^k)\rangle \) which needs \(\mathcal {O}(p)\) operations.

Conceptually, the two-phase option of Algorithm 1 requires the smallest eigenvalue of \(\nabla ^2{f}(x^k)\) to terminate Phase 1. However, switching from Phase 1 to Phase 2 can be done automatically allowing some tolerance in the step-size \(\tau _k\). Indeed, the step-size \(\tau _k\) given by (29) converges to 1 as \(k\rightarrow \infty \). Hence, when \(\tau _k\) is closed to 1, e.g., \(\tau _k \ge 0.9\), we can automatically set it to 1 and remove the computation of \(\lambda _k\) to reduce the computational time.

In the one-phase option, we can always perform only Phase 1 until achieving an \(\varepsilon \)-optimal solution as shown in Theorem 2. Therefore, the per-iteration complexity of Algorithm 1 is \(\mathcal {O}(p^3) + \mathcal {O}(p)\) in the worst-case. A careful implementation of conjugate gradient methods with a warm-start can significantly reduce this per-iteration computation complexity.

Remark 3

(Inexact Newton methods) We can allow Algorithm 1 to compute the Newton direction \(n_{\mathrm {nt}}^k\) approximately. In this case, we approximately solve the symmetric positive definite system (26). By an appropriate choice of stopping criterion, we can still prove convergence of Algorithm 1 under inexact computation of \(n_{\mathrm {nt}}^k\). For instance, the following criterion is often used in inexact Newton methods [16], but defined via the local dual norm of f:

$$\begin{aligned} \Vert \nabla ^2{f}(x^k)n_{\mathrm {nt}}^k + \nabla {f}(x^k)\Vert _{x^k}^{*} \le \kappa \Vert \nabla {f}(x^k)\Vert _{x^k}^{*}, \end{aligned}$$

for a given relaxation parameter \(\kappa \in [0, 1)\). This extension can be found in our forthcoming work.

4 Composite generalized self-concordant minimization

Let \(f\in \widetilde{\mathcal {F}}_{M_f,\nu }(\mathrm {dom}(f))\), and g be a proper, closed, and convex function. We consider the composite convex minimization problem (3) which we recall here for our convenience of references:

$$\begin{aligned} F^{\star } := \min _{x\in \mathbb {R}^p}\Big \{ F(x) := f(x) + g(x) \Big \}. \end{aligned}$$
(32)

Note that \(\mathrm {dom}(F) := \mathrm {dom}(f) \cap \mathrm {dom}(g)\) may be empty. To make this problem nontrivial, we assume that \(\mathrm {dom}(F)\) is nonempty. The optimality condition for (32) can be written as follows:

$$\begin{aligned} 0\in \nabla f(x^{\star })+\partial g(x^{\star }). \end{aligned}$$
(33)

Under the qualification condition \(0 \in \mathrm {ri}\left( \mathrm {dom}(g) - \mathrm {dom}(f)\right) \), (33) is necessary and sufficient for \(x^{\star }\) to be an optimal solution of (32), where \(\mathrm {ri}\left( \mathcal {X}\right) \) is the relative interior of \(\mathcal {X}\).

4.1 Existence, uniqueness, and regularity of optimal solutions

Assume that \(\nabla ^2{f}(x)\) is positive definite (i.e., nonsingular) at some point \(x\in \mathrm {dom}(F)\). We prove in the following theorem that problem (32) has a unique solution \(x^{\star }\). The proof can be found in Appendix A.4. This theorem can also be considered as a generalization of [40, Theorem 4.1.11] and [62, Lemma 4] in standard self-concordant settings in [40, 62].

Theorem 4

Suppose that the function f of (32) is \((M_f, \nu )\)-generalized self-concordant with \(M_f > 0\) and \(\nu \in [2, 3]\). Denote by \(\sigma _{\min }(x) := \lambda _{\min }(\nabla ^2 f(x))\) and \(\lambda (x) := \Vert \nabla {f}(x) + v\Vert ^{*}_{x}\) for \(x\in \mathrm {dom}(F)\) and \(v\in \partial {g}(x)\). Suppose further that there exists \(x\in \mathrm {dom}(F)\) such that \(\sigma _{\min }(x) > 0\) and

$$\begin{aligned} \lambda (x) < \frac{2\left[ \sigma _{\min }(x)\right] ^{\frac{3-\nu }{2}}}{(4-\nu )M_f}. \end{aligned}$$

Then, problem (32) has a unique solution \(x^{\star }\) in \(\mathrm {dom}(F)\).

Now, we recall a condition such that the solution \(x^{\star }\) of (32) is strongly regular in the following Robinson’s sense [56]. We say that the optimal solution \(x^{\star }\) of (32) is strongly regular if there exists a neighborhood \(\mathcal {U}(\varvec{0})\) of zero such that for any \(\delta \in \mathcal {U}(\varvec{0})\), the following perturbed problem

$$\begin{aligned} \min _{x\in \mathbb {R}^p} \left\{ \langle \nabla {f}(x^{\star }) - \delta , x- x^{\star }\rangle + \tfrac{1}{2}\langle \nabla ^2{f}(x^{\star })(x- x^{\star }), x- x^{\star }\rangle + g(x) \right\} \end{aligned}$$

has a unique solution \(x^{*}(\delta )\), and this solution is Lipschitz continuous on \(\mathcal {U}(\varvec{0})\).

If \(\nabla ^2{f}(x^{\star }) \succ 0\), then \(x^{\star }\) is strongly regular. While the strong regularity of the solution \(x^{\star }\) requires a weaker condition than \(\nabla ^2{f}(x^{\star }) \succ 0\). For further details of the regularity theory, we refer the reader to [56].

4.2 Scaled proximal operators

Given a matrix \(H\in \mathcal {S}^p_{++}\), we define a scaled proximal operator of g in (32) as

$$\begin{aligned} \mathrm {prox}_{H^{-1} g}(x):=\mathrm {arg}\min _{z}\left\{ g(z) + \tfrac{1}{2}\left\| z-x\right\| ^2_{H}\right\} . \end{aligned}$$
(34)

Using the optimality condition of the minimization problem under (34), we can show that

$$\begin{aligned} y= & {} \mathrm {prox}_{H^{-1} g}(x)\iff 0 \in H(y- x) + \partial {g}(y) \iff x\in y+ H^{-1}\partial g(y)\\\equiv & {} (\mathbb {I}+ H^{-1}\partial {g})(y). \end{aligned}$$

Since g is proper, closed, and convex, \(\mathrm {prox}_{H^{-1} g}\) is well-defined and single-valued. In particular, if we take \(H=\mathbb {I}\), the identity matrix, then \(\mathrm {prox}_{H^{-1}g}(\cdot ) = \mathrm {prox}_{g}(\cdot )\), the standard proximal operator of g. If we can efficiently compute \(\mathrm {prox}_{H^{-1}g}(\cdot )\) by a closed form or by polynomial time algorithms, then we say that g is proximally tractable. There exist several convex functions whose proximal operator is tractable. Examples such as \(\ell _1\)-norm, coordinate-wise separable convex functions, and the indicator of simple convex sets can be found in the literature including [3, 21, 51].

4.3 Proximal Newton methods

The proximal Newton method can be considered as a special case of the variable metric proximal method in the literature [8]. This method has previously been studied by many authors, see, e.g., [8, 34]. However, the convergence guarantee often requires certain assumptions as used in standard Newton-type methods. In this section, we develop a proximal Newton algorithm to solve the composite convex minimization problem (32) where f is a generalized self-concordant function. This problem covers [62, 64] as special cases.

Given \(x^k\in \mathrm {dom}(F)\), we first approximate f at \(x^k\) by the following convex quadratic surrogate:

$$\begin{aligned} Q_f(x;x^k):= f(x^k)+\left\langle \nabla f(x^k),x-x^k\right\rangle +\tfrac{1}{2}\left\langle \nabla ^2 f(x^k)(x-x^k),x-x^k\right\rangle . \end{aligned}$$

Next, the main step of the proximal Newton method requires to solve the following subproblem:

$$\begin{aligned} z^k:= & {} \mathrm {arg}\min _{x\in \mathrm {dom}(g)}\Big \{Q_f(x;x^k)+g(x) \Big \} = \mathrm {prox}_{\nabla ^2{f}(x^k)^{-1}g}\Big (x^k - \nabla ^2{f}(x^k)^{-1}\nabla {f}(x^k)\Big ).\qquad \quad \end{aligned}$$
(35)

The optimality condition for this subproblem is the following linear monotone inclusion:

$$\begin{aligned} 0\in \nabla f(x^k)+\nabla ^2 f(x^k)(z^k-x^k)+\partial g(z^k). \end{aligned}$$
(36)

Here, we note that \(\mathrm {dom}(Q_f(\cdot ;x^k)) = \mathbb {R}^p\). Hence, \(\mathrm {dom}(Q_f(\cdot ;x^k) + g(\cdot )) = \mathrm {dom}(g)\). In the setting (32), \(z^k\) may not be in \(\mathrm {dom}(F)\). Our next step is to update the next iteration \(x^{k+1}\) as

$$\begin{aligned} x^{k+1}:=x^k+\tau _kn_{\mathrm {pnt}}^k=(1-\tau _k)x^k+\tau _kz^k, \end{aligned}$$
(37)

where \(n_{\mathrm {pnt}}^k := z^k - x^k\) is the proximal Newton direction, and \(\tau _k\in (0,1]\) is a given step-size.

Associated with the proximal Newton direction \(n_{\mathrm {pnt}}^k\), we define the following proximal Newton decrement and the \(\ell _2\)-norm quantity of \(n_{\mathrm {pnt}}^k\) as

$$\begin{aligned} \lambda _k := \Vert n_{\mathrm {pnt}}^k\Vert _{x^k}~~~~\text { and }~~~~\beta _k := M_f\Vert n_{\mathrm {pnt}}^k\Vert _2. \end{aligned}$$
(38)

Our first goal is to show that we can explicitly compute the step-size \(\tau _k\) in (37) using \(\lambda _k\) and \(\beta _k\) such that we obtain a descent property for F. This statement is presented in the following theorem whose proof is deferred to Appendix A.7.

Theorem 5

Let \(\left\{ x^k\right\} \) be the sequence generated by the proximal Newton scheme (37) starting from \(x^0\in \mathrm {dom}(F)\). If we choose the step-size \(\tau _k\) as in (29) of Theorem 2, then \(\tau _k \in (0, 1]\), \(\left\{ x^k\right\} \) in \(\mathrm {dom}(F)\), and

$$\begin{aligned} F(x^{k+1}) \le F(x^k) - \varDelta _k, \end{aligned}$$
(39)

where \(\varDelta _k := \lambda _k^2\tau _k - \omega _{\nu }\left( \tau _k d_k\right) \tau ^2_k\lambda _k^2 > 0\) for \(\tau _k > 0\) and \(d_k\) as defined in Theorem 2.

There exists a neighborhood \(\mathcal {N}(x^{\star })\) of the unique solution \(x^{\star }\) of (32) such that if we initialize the scheme (37) at \(x^0\in \mathcal {N}(x^{\star })\cap \mathrm {dom}(F)\), then \(\left\{ x^k\right\} \) quadratically converges to \(x^{\star }\).

Next, we prove a local quadratic convergence of the full-step proximal Newton method (37) with the unit step-size \(\tau _k = 1\) for all \(k\ge 0\). The proof is given in Appendix A.8.

Theorem 6

Suppose that the sequence \(\left\{ x^k\right\} \) is generated by (37) with full-step, i.e., \(\tau _k=1\) for \(k\ge 0\). Let \(d_{\nu }^k:=d_{\nu }(x^k,x^{k+1})\) be defined by (12) and \(\lambda _k\) be defined by (38). Then, the following statements hold:

  1. (a)

    If \(\nu = 2\) and the starting point \(x^0\) satisfies \(\underline{\sigma }_0^{-1/2}\lambda _0 < d_2^{\star }/M_f\), then both sequences \(\left\{ \underline{\sigma }_k^{-1/2}\lambda _k\right\} \) and \(\left\{ d_2^k\right\} \) decrease and quadratically converge to zero, where \(d_2^{\star }\approx 0.35482\).

  2. (b)

    If \(2< \nu < 3\), and the starting point \(x^0\) satisfies \(\underline{\sigma }_0^{-\frac{3-\nu }{2}}\lambda _0 < \frac{1}{M_f}\min \left\{ \frac{2d_{\nu }^{\star }}{\nu -2},\frac{1}{2}\right\} \), then both sequences \(\left\{ \underline{\sigma }_k^{-\frac{3-\nu }{2}}\lambda _k\right\} \) and \(\left\{ d_{\nu }^k\right\} \) decrease and quadratically converge to zero, where \(d_{\nu }^{\star }\) is the unique solution to the equation \(\left( \nu -2\right) R_{\nu }(d_{\nu })= 4(1-d_{\nu })^{\frac{4-\nu }{\nu -2}}\) in \(d_{\nu }\) with \(R_{\nu }(\cdot )\) given in (56).

  3. (c)

    If \(\nu = 3\) and the starting point \(x^0\) satisfies \(\lambda _0 < \frac{2d_3^{\star }}{M_f}\), then the sequence \(\left\{ \lambda _k\right\} \) decreases and quadratically converges to zero, where \(d_3^{\star }\approx 0.20943\).

As a consequence, if \(\left\{ d^k_{\nu }\right\} \) locally converges to zero at a quadratic rate, then \(\big \{\Vert x^k - x^{\star }\Vert _{H_k}\big \}\) also locally converges to zero at a quadratic rate, where \(H_k = \mathbb {I}\), the identity matrix, if \(\nu = 2\); \(H_k = \nabla ^2{f}(x^k)\) if \(\nu = 3\); and \(H_k = \nabla ^2f(x^k)^{\frac{\nu }{2}-1}\) if \(2< \nu < 3\). Hence, \(\left\{ x^k\right\} \) locally converges to \(x^{\star }\), the unique solution of (32), at a quadratic rate.

Similar to Algorithm 1, we can also combine the results of Theorems 5 and 6 to design a proximal Newton algorithm for solving (32). This algorithm is described in Algorithm 2 below.

figure b

Implementation remarks The main step of Algorithm 2 is the computation of the proximal Newton step \(n_{\mathrm {pnt}}^k\), or the trial point \(z^k\) in (35). This step requires to solve a composite quadratic-convex minimization problem (35) with strongly convex objective function. If g is proximally tractable, then we can apply proximal-gradient methods or splitting techniques [3, 4, 44] to solve this problem. We can also combine accelerated proximal-gradient methods with a restarting strategy [19, 23, 48] to accelerate the performance of these algorithms. These methods will be used in our numerical experiments in Sect. 6.

As noticed in Remark 3, we can also develop an inexact proximal Newton variant for Algorithm 2 by approximately solving the subproblem (35). We leave this extension to our forthcoming work.

5 Quasi-Newton methods for generalized self-concordant minimization

This section studies quasi-Newton variants of Algorithm 1 for solving (24). Extensions to the composite form (32) can be done by combining the result in this section and the approach in [62].

A quasi-Newton method for solving (24) updates the sequence \(\left\{ x^k\right\} \) using

$$\begin{aligned} x^{k+1} := x^k - \tau _kB_k\nabla {f}(x^k),~~~\text {where}~~B_k := H_k^{-1}~\text {and}~~H_k \approx \nabla ^2{f}(x^k), \end{aligned}$$
(40)

where the step-size \(\tau _k\in (0, 1]\) is appropriately chosen, and \(x^0\in \mathrm {dom}(f)\) is a given starting point.

Matrix \(H_k\) is symmetric and positive definite, and it approximates the Hessian matrix \(\nabla ^2{f}(x^k)\) of f at the iteration \(x^k\) in some sense. The most common approximation sense is that \(H_k\) satisfies the well-known Dennis–Moré condition [15]. In the context of generalized self-concordant functions, we can modify this condition by imposing:

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{\Vert (H_k - \nabla ^2{f}(x^{\star }_f))(x^k - x^{\star }_f)\Vert _{\hat{x}}^{*}}{\Vert x^k - x^{\star }_f\Vert _{\hat{x}}} = 0,~~\text {where}\, \hat{x} = x^{\star }_f\, \hbox {or}\, \hat{x} = x^k. \end{aligned}$$
(41)

Clearly, if we have \(\lim _{k\rightarrow \infty }\Vert H_k - \nabla ^2{f}(x^k)\Vert _{\hat{x}} = 0\), then, with a simple argument, we can show that (41) automatically holds. In practice, we can update \(H_k\) to maintain the following secant equation:

$$\begin{aligned} H_{k+1}s^k = y^k, ~~~\text {where}~~s^k := x^{k+1} - x^k, ~~\text {and}~~~y^k := \nabla {f}(x^{k+1}) - \nabla {f}(x^k). \end{aligned}$$
(42)

There are several candidates to update \(H_k\) to maintain this secant equation, see, e.g., [47]. Here, we propose to use a BFGS update as

$$\begin{aligned} H_{k+1} := H_k + \frac{y^k(y^k)^{\top }}{\langle y^k,s^k\rangle } - \frac{(H_ks^k)(H_ks^k)^{\top }}{(\langle H_ks^k, s^k\rangle }. \end{aligned}$$
(43)

In practice, to avoid the inverse \(B_k = H^{-1}_k\), we can update this inverse directly [47] in lieu of updating \(H_k\) as in (43). Note that the BFGS update (43) or its inverse version may not maintain the sparsity or block pattern structures of the sequence \(\left\{ H_k\right\} \) or \(\left\{ B_k\right\} \) even if \(\nabla ^2{f}\) is sparse.

The following result shows that the quasi-Newton method (40) achieves a superlinear convergence whose proof can be found in Appendix A.9.

Theorem 7

Assume that \(x^{\star }_f \in \mathrm {dom}(f)\) is the unique solution of (24) and is strongly regular. Let \(\left\{ x^k\right\} \) be the sequence generated by (40). Then, the following statements hold:

  1. (a)

    Assume, in addition, that the sequence of matrices \(\left\{ H_k\right\} \) satisfies the Dennis–Moré condtion (41) with \(\hat{x} = x^{\star }_f\). Then, there exist \(\bar{r} > 0\), and \(\bar{k} \ge 0\) such that, for all \(k \ge \bar{k}\), we have \(\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} \le \bar{r}\) and \(\left\{ x^k\right\} \) locally converges to \(x^{\star }_f\) at a superlinear rate.

  2. (b)

    Suppose that \(H_0\) is chosen such that \(H_0\in \mathcal {S}^p_{++}\). Then, \(\langle y^k, z^k\rangle > 0\) for all \(k\ge 0\), and hence, the sequence \(\left\{ H_k\right\} \) generated by (43) is symmetric positive definite, and satisfies the secant equation (42). Moreover, if the sequence \(\left\{ x^k\right\} \) generated by (40) satisfies \(\sum _{k=0}^{\infty }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} < +\,\infty \), then \(\left\{ x^k\right\} \) locally converges to the unique solution \(x^{\star }_f\) of (24) at a superlinear rate.

Note that the condition \(\sum _{k=0}^{\infty }\Vert x^k - x^{\star }_f\Vert _{x^{\star }_f} < +\,\infty \) in Theorem 7(b) can be guaranteed if \(\Vert x^{k+1} - x^{\star }_f\Vert _{x^{\star }_f} \le \rho \Vert x^k -x^{\star }_f\Vert _{x^{\star }_f}\) for some \(\rho \in (0, 1)\) and \(k\ge \bar{k} \ge 0\). Hence, if \(\left\{ x^k\right\} \) locally converges to \(x^{\star }_f\) at a linear rate, then it also locally converges to \(x^{\star }_f\) at a superlinear rate.

6 Numerical experiments

We provide five examples to verify our theoretical results and compare our methods with existing methods in the leterature. Our algorithms are implemented in Matlab 2014b running on a MacBook Pro. Retina, 2.7 GHz Intel Core i5 with 16Gb 1867 MHz DDR3 memory.

6.1 Comparison with [72] on regularized logistic regression

In this example, we empirically show that our theory provides a better step-size for logistic regression compared to [72] as theoretically shown in Example 4. In addition, our step-size can be used to guarantee a global convergence of Newton method without linesearch. It can also be used as a lower bound for backtracking or forward linesearch to enhance the performance of Algorithm 1.

To illustrate these aspects, we consider the following regularized logistic regression problem:

$$\begin{aligned} f^{\star } := \min _{x\in \mathbb {R}^p}\Big \{ f(x) := \frac{1}{n}\sum _{i=1}^n\ell ( y_i(a_i^{\top }x+ \mu )) + \frac{\gamma }{2}\Vert x\Vert _2^2 \Big \}, \end{aligned}$$
(44)

where \(\ell (s) = \log (1 + e^{-s})\) is the logistic loss, \(\mu \) is a given intercept, \(y_i \in \left\{ -1,1\right\} \) and \(a_i\in \mathbb {R}^p\) are given as input data for \(i=1,\ldots , n\), and \(\gamma > 0\) is a given regularization parameter.

As shown previously in Proposition 5, f can be cast into an \((M^{(3)}_f, 3)\)-generalized self-concordant function with \(M^{(3)}_f = \frac{1}{\sqrt{\gamma }}\max \left\{ \Vert a_i\Vert _2 \mid 1\le i\le n\right\} \). On the other hand, f can also be considered as an \((M_f^{(2)}, 2)\)-generalized self-concordant with \(M_f^{(2)} := \max \left\{ \Vert a_i\Vert _2 \mid 1\le i\le n\right\} \).

We implement Algorithm 1 using two different step-sizes \(\tau _k^{(2)} = \frac{\ln (1 + \beta _k)}{\beta _k}\) and \(\tau ^{(3)}_k := \frac{1}{1 + 0.5M^{(3)}_f\lambda _k}\) as suggested by Theorem 2 for \(\nu = 2\) and \(\nu = 3\), respectively. We terminate Algorithm 1 if \(\Vert \nabla {f}(x^k)\Vert _2 \le 10^{-8}\max \left\{ 1, \Vert \nabla {f}(x^0)\Vert _2\right\} \), where \(x^0 = \varvec{0}\) is an initial point. To solve the linear system (26), we apply a conjugate gradient method to avoid computing the inverse \(\nabla ^2{f}(x^k)^{-1}\) of the Hessian matrix \(\nabla ^2{f}(x^k)\) in large-scale problems. We also compare our algorithms with the fast gradient method in [40] using an optimal step-size for strongly convex functions which has an optimal linear convergence rate.

We test all algorithms on a binary classification dataset downloaded from [12] at https://www.csie.ntu.edu.tw/~cjlin/libsvm/. As suggested in [72], we normalize the data such that each row \(a_i\) has \(\Vert a_i\Vert _2 = 1\) for \(i=1,\ldots , n\). The parameter is set to \(\gamma := 10^{-5}\) as in [72].

The convergence behavior of Algorithm 1 for \(\nu = 2\) and \(\nu = 3\) is plotted in Figure 1 for the news20 problem.

Fig. 1
figure 1

The convergence of Algorithm 1 for news20.binary (left: relative objective residuals, middle: relative norms of gradient, and right: step-sizes)

Table 3 The performance and results of the three algorithms for solving the logistic regression problem (44)

As we can see from this figure that Algorithm 1 with \(\nu = 2\) outperforms the case \(\nu = 3\). The right-most plot reveals the relative objective residual \(\frac{f(x^k) - f^{\star }}{\max \left\{ 1, \left| f^{\star }\right| \right\} }\), the middle one shows the relative gradient norm \(\frac{\Vert \nabla {f}(x^k)\Vert _2}{\max \left\{ 1, \Vert \nabla {f}(x^0)\Vert _2\right\} }\), and the left-most figure displays the step-size \(\tau _k^{(2)}\) and \(\tau _k^{(3)}\). Note that the step-size \(\tau _k^{(3)}\) of Algorithm 1 depends on the regularization parameter \(\gamma \). If \(\gamma \) is small, then \(\tau _k^{(3)}\) is also small. In contrast, the step-size \(\tau _k^{(2)}\) of Algorithm 1 is independent of \(\gamma \).

Our second test is performed on six problems with different sizes. Table 3 shows the performance and results of the 3 algorithms: Algorithm 1 with \(\nu = 2\), Algorithm 1 with \(\nu = 3\), and the fast-gradient method in [40]. Here, n is the number of data points, p is the number of variables, iter is the number of iterations, error is the training error measured by \(\frac{1}{2n}\sum _{i=1}^n(1-\mathrm {sign}(y_i(a_i^{\top }x + \mu )))\), and \(f(x^k)\) is the objective value achieved by these three algorithms.

We observe that our step-size \(\tau _k^{(2)}\) using \(\nu = 2\) works much better than \(\tau ^{(3)}_k\) using \(\nu = 3\) as in [72]. This confirms the theoretical analysis in Example 4. This step-size can be useful for parallel and distributed implementation, where evaluating the objective values often requires high computational effort due to communication and data transferring. Note that the computation of the step-size \(\tau ^{(2)}_k\) in Algorithm 1 only needs \(\mathcal {O}(p)\) operations, and do not require to pass over all data points. Algorithm 1 with \(\nu = 2\) also works better than the fast gradient method [40] in this experiment, especially for the case \(n\gg 1\). Note that the fast gradient method uses the optimal step-size and has a linear convergence rate in this case.

Finally, we show that our step-size \(\tau _k^{(2)}\) can be used as a lower bound to enhance a backtracking linesearch procedure in Newton methods. The Armijo linesearch condition is given as

$$\begin{aligned} f(x^k + \tau _kn_{\mathrm {nt}}^k) \le f(x^k) - c_1\tau _k\nabla {f}(x^k)^{\top }n_{\mathrm {nt}}^k, \end{aligned}$$
(45)

where \(c_1 \in (0, 1)\) is a given constant. Here, we use \(c_1 = 10^{-6}\) which is sufficiently small.

  • In our backtracking linesearch variant, we search for the best step-size \(\tau \in [\tau _k^{(2)}, 1]\). This variant requires to compute \(\tau _k^{(2)}\) which needs \(\mathcal {O}(p)\) operations.

  • In the standard backtracking linesearch routine, we search for the best step-size \(\tau \in (0, 1]\).

Both strategies use a bisection section rule as \(\tau \leftarrow \tau /2\) starting from \(\tau \leftarrow 1\). The results on 3 problems are reported in Table 4.

Table 4 The performance and results of the two linesearch variants of Algorithm 1 for solving (44)

As shown in Table 4, using the step-size \(\tau _k^{(2)}\) as a lower bound for backtracking linesearch also reduces the number of function evaluations in these three problems. Note that the number of function evaluations depends on the starting point \(x^0\) as well as the factor \(c_1\) in (45). If we set \(c_1\) too small, then the decrease on f can be small. Otherwise, if we set \(c_1\) too high, then our decrement \(c_1\tau _k\nabla {f}(x^k)^{\top }n_{\mathrm {nt}}^k\) may never be achieved, and the linesearch condition fails to hold. If we change the starting point \(x^0\), the number of function evaluations can significantly be increased.

6.2 The case \(\nu = 2\): matrix balancing

We consider the following convex optimization problem originated from matrix balancing [14]:

$$\begin{aligned} f^{\star } := \min _{x\in \mathbb {R}^p}\Big \{ f(x):=\sum _{1\le i,j\le p}a_{ij}e^{x_i-x_j} \Big \}, \end{aligned}$$
(46)

where \(A = (a_{ij})_{p\times p}\) is a nonnegative square matrix in \(\mathbb {R}^{p\times p}\). Although (46) is a unconstrained smooth convex problem, its objective function f is not strongly convex and does not have Lipschitz gradient. Existing gradient-type methods do not have a theoretical convergence guarantee as well as a rule to compute step-sizes. However, (46) is an important problem in scientific computing.

By Proposition 1 and Corollary 1, f is generalized self-concordant with \(M=\sqrt{2}\) and \(\nu =2\). We implement Algorithm 1 and the most recent method proposed in [14] (called Boxed-constrained Newton method (BCNM)) to solve (46). Note that [14] is not directly applicable to (46), but it solves a regularization of this problem. Since \(\nabla ^2f(x)\) is not positive definite, we use a preconditioned conjugate gradient gradient (PCG) method to solve the linear system in Algorithm 1. We use an accelerated projected gradient method (FISTA) [4] to solve the subproblem for the method in [14]. We terminate PCG and FISTA using either a tolerance \(10^{-9}\) or a maximum of 200 iterations. For the outer loop, we terminate Algorithm 1 and BCNM using the same stopping criterion: \(\delta {f_k'} :=\Vert \nabla f(x^k)\Vert _2/\max \left\{ 1,\Vert \nabla f(x^0)\Vert _2\right\} \le 10^{-8}\). We choose \(x^0 := \varvec{0}^p\) as an initial point.

We test both algorithms on several synthetic and real datasets. The synthetic data is generated as in [52] with different structures. The basic matrix \(H = (H_{ij})_{p\times p}\) is a \(p\times p\) upper Hessenberg matrix defined as \(H_{ij}=0\) if \(j< i-1\), and \(H_{ij} = 1\) otherwise. \(H_1\) differs from H only in that \(H_{11}\) is replaced by \(p^2\); \(H_2\) differs from H only in that \(H_{12}\) is replaced by \(p^2\); and \(H_3 = H + (p^2-1)\mathbb {I}_p\). We use these matrices for A in (46). We take \(p = 1000\), 5000, 10000, and 15000. We name each problem instance by “Hdy”, where H stands for Hessenberg, and \(\texttt {y} = 10^{-3}p\).

The real data is downloaded from https://math.nist.gov/MatrixMarket/searchtool.html with different structures from different application fields, suggested by [13]. Since we require the matrix A to be nonnegative, we take \(A_0 := \max \left\{ 0, A\right\} \) (entry-wise). For the real data, if A is highly ill-conditioned, then we add a uniform noise \(\mathcal {U}[0, \sigma ]\) to A, where \(\sigma = 10^{-5}\max \{a_{ij} |1\le i, j\le p\}\).

The final results of both algorithms are reported in Table 5, where p is the size of matrix A; iter/siter is the maximum number of Newton-type iterations / PCG or FISTA iterations; time[s] is the computational time in second; \(\delta {f_k'}\) is the relative gradient norm defined above; \(t_{\mathrm {rat}}\) is the ratio of the computational time between Algorithm 1 and BCNM; and \(\delta {x^k}\) is the relative difference between \(x^k\) given by Algorithm 1 and BCNM.

Table 5 Summary of the results of Algorithm 1 and BCNM on 10 synthetic and 30 real problem instances

As we can see from our experiment, both methods give almost the same result in terms of the objective values \(f(x^k)\) and approximate solutions \(x^k\). Given the same stopping criteria and solution quality, Algorithm 1 outperforms BCNM in all datasets in terms of average computational time which is specified by \(t_{\text {rat}} =\frac{\text {time}_{\mathrm {BCNM}}}{\text {time}_{\mathrm {Alg.~1}}}\). In particular, for many asymmetric and/or ill-conditioned datasets (e.g., H2d5, or bwm), Algorithm 1 is approximately from 8 to 17 times faster than BCNM.

6.3 The case \(\nu \in (2, 3)\): distance-weighted discrimination regression.

In this example, we test the performance of Algorithm 1 on the distance-weighted discrimination (DWD) problem introduced in [36]. In order to directly use Algorithm 1, we slightly modify the setting in [36] to obtain the following form:

$$\begin{aligned}&f^{\star } := \min _{x= [w, \xi , \mu ]^{\top }\in \mathbb {R}^p}\left\{ f(x) := \frac{1}{n}\sum _{i=1}^n \frac{1}{(a_i^{\top }w+ \mu y_i + \xi _i)^q} + c^{\top }\xi \right. \nonumber \\&\qquad \quad \left. + \frac{1}{2}\left( \gamma _1 \Vert w\Vert _2^2 + \gamma _2\mu ^2 + \gamma _3\Vert \xi \Vert _2^2\right) \right\} , \end{aligned}$$
(47)

where \(q > 0\), \(a_i, y_i\) (\(i=1,\ldots , n)\) and \(c\) are given, and \(\gamma _s > 0\) (\(s = 1,2,3\)) are three regularization parameters for \(w\), \(\mu \) and \(\xi \), respectively. Here, the variable \(x\) consists of the support vector \(w\), the intercept \(\mu \), and the slack variable \(\xi \) as used in [36]. Here, we penalize these variables by using least-squares terms instead of the \(\ell _1\)-penalty term as in [36]. Note that the setting (47) is not just limited to the DWD application above, but can also be used to formulate other practical models such as time optimal path planning problems in robotics [69] if we choose an appropriate parameter q.

Since \(\varphi (t) := \frac{1}{t^q}\) is \((M_{\varphi }, \nu )\)-generalized self-concordant with \(M_{\varphi } := \frac{q+2}{\root (q+2) \of {q(q+1)}}n^{\frac{1}{q+2}}\) and \(\nu := \tfrac{2(q+3)}{q+2} \in (2, 3)\), using Proposition 1, we can show that f is \((M_f, \tfrac{2(q+3)}{q+2})\)-generalized self-concordant with \(M_f := \frac{q+2}{\root (q+2) \of {q(q+1)}}n^{\frac{1}{q+2}}\max \left\{ \left\| (a_i^{\top }, y_i, e_i^{\top })^{\top }\right\| _2^{q/(q+2)} \mid 1\le i\le n\right\} \) (here, \(e_i\) is the i-th unit vector). Problem (47) can be transformed into a second-order cone program [25], and can be solved by interior-point methods. For instance, if we choose \(q = 1\), then, by introducing intermediate variables \(s_i\) and \(r_i\), we can transform (47) into a second-order cone program using the fact that \(\frac{1}{r_i} \le s_i\) is equivalent to \(\sqrt{(r_i-s_i)^2 + 2^2} \le (r_i + s_i)\).

We implement Algorithm 1 to solve (47) and compare it with the interior-point method implemented in commercial software: Mosek. We experienced that Mosek is much faster than other interior-point solvers such as SDPT3 [60] or SDPA [70] in this test. For instance, Mosek is from 52 to 125 times faster than SDPT3 in this example. Hence, we only present the results of Mosek.

We also incorporate Algorithm 1 with a backtracking linesearch using our step-size \(\tau _k\) (LS with \(\tau _k\)) as a lower bound. Note that since f does not have a Lipschitz gradient map, we cannot apply gradient-type methods to solve (47) due to the lack of a theoretical guarantee.

Since we cannot run Mosek on big data sets, we rather test our algorithms and this interior-point solvers on 6 small and medium size problems using data from [12] (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). We choose the regularization parameters as \(\gamma _1 = \gamma _2 = 10^{-5}\) and \(\gamma _3 = 10^{-7}\). Note that if the data set has the size of (np), then number of variables in (47) becomes \(p+n+1\). Hence, we use a built-in Matlab conjugate gradient solver to compute the Newton direction \(n_{\mathrm {nt}}^k\). The initial point \(x^0\) is chosen as \(w^0 := \varvec{0}\), \(\mu ^0 := 0\) and \(\xi ^0 := \varvec{1}\). In our algorithms, we use \(\Vert \nabla {f}(x^k)\Vert _2 \le 10^{-8}\max \left\{ 1, \Vert \nabla {f}(x^0)\Vert _2\right\} \) as a stopping criterion.

Note that, by the choice of \(\gamma _i\) for \(i=1,2,3\) as \(\gamma _{\min } := \min \left\{ \gamma _1,\gamma _2, \gamma _3\right\} = 10^{-7} > 0\), the objective function of (47) is strongly convex. By Proposition 4(a), we can cast this function into an (\(\hat{M}_f, \hat{\nu })\)-generalized self-concordant with \(\hat{\nu } = 3\) and \(\hat{M}_f := \gamma _{\min }^{\frac{-q}{2(q+2)}}M_f\), where \(M_f\) is given above. We also implement Algorithm 1 using \(\hat{\nu } = 3\) to solve (47).

The results and performance of the four algorithms are reported in Table 6 for two cases: \(q=1\) and \(q=2\). We can see that Algorithm 1 with \(\nu = 2\) outperforms the case \(\hat{\nu } = 3\) in terms of iterations. The case \(\nu = 2\) is approximately from 3 to 13 times faster than the case \(\hat{\nu } = 3\). This is not surprising since \(\hat{M}_f\) depends on \(\gamma _{\min }\), and it is large since \(\gamma _{\min }\) is small. Hence, the step-size \(\tau _k^{(3)}\) computed by using \(\hat{M}_f\) is smaller than \(\tau _k^{(2)}\) computed from \(M_f\) as we have seen in the first example. Mosek works really well in this example and it is slightly better than Algorithm 1 with \(\nu = 2\). If we combine Algorithm 1 with a backtracking linesearch, then this variant outperforms Mosek. All the algorithms achieve a very high accuracy in terms of the relative norm of the gradient \(\frac{\Vert \nabla {f}(x^k)\Vert _2}{\Vert \nabla {f}(x^0)\Vert _2}\) which is up to \(10^{-8}\). We emphasize that our methods are highly parallelizable and their performance can be improved by exploiting this structure as studied in [72] for the logistic case.

Table 6 The performance and results of the four methods for solving the DWD problem (47)

6.4 The case \(\nu = 3\): portfolio optimization with logarithmic utility functions.

In this example, we aim at verifying Algorithm 2 for solving the composite generalized self-concordant minimization problem (32) with \(\nu = 3\). We illustrate this algorithm on the following portfolio optimization problem with logarithmic utility functions [59] (scaled by a factor of \(\frac{1}{n}\)):

$$\begin{aligned} f^{\star } = \min _{x\in \mathbb {R}^p}\left\{ f(x) := -\sum _{i=1}^n\log (w_i^{\top }x) \mid x\ge 0, ~~\varvec{1}^{\top }x= 1 \right\} , \end{aligned}$$
(48)

where \(w_i\in \mathbb {R}^p_+\) for \(i=1,\ldots , n\) are given vectors presenting the returns at the i-th period of the assets considered in the portfolio data. More precisely, as indicated in [9], \(w_i\) measures the return as the ratio \(w_{ij} = v_{i,j}/v_{i-1,j}\) between the closing prices \(v_{i,j}\) and \(v_{i-1,j}\) of the stocks on the current day i and on the previous day \(i-1\), respectively; \(\varvec{1}\in \mathbb {R}^p\) is a vector of all ones. The aim is to find an optimal strategy to assign the proportion of the assets in order to maximize the expected return among all portfolios.

Note that problem (48) can be cast into an online optimization model [27]. The authors in [27] proposed an online Newton method to solve this problem. In this case, the regret of such an online algorithm showing the difference between the objective function of the online counterpart and the objective function of (48) converges to zero at a rate of \(\frac{1}{\sqrt{n}}\) as \(n\rightarrow \infty \). If n is relatively small (e.g., \(n=1000\)), then the online Newton method does not provide a good approximation to (48).

Let \(\varDelta :=\left\{ x\in \mathbb {R}^p \mid x\ge 0, ~~\varvec{1}^{\top }x= 1\right\} \) be the standard simplex, and \(g(x) := \delta _{\varDelta }(x)\) be the indicator function of \(\varDelta \). Then, we can formulate (48) into (32). The function f defined in (48) is \((M_f,\nu )\)-generalized self-concordant with \(\nu = 3\) and \(M_f=2\).

We implement Algorithm 2 using an accelerated projected gradient method [4, 40] to compute the proximal Newton direction. We also implement the Frank–Wolfe algorithm and its linesearch variant in [20, 30], and a projected gradient method using Barzilai and Borwein’s step-size to solve (48). We name these algorithms by FW, FW-LS, and PG-BB, respectively.

We emphasize that both PG-BB and FW-LS do not have a theoretical guarantee when solving (48). FW has a theoretical guarantee as recently proved in [49], but the complexity bound is rather pessimistic. We terminate all the algorithms using \(\Vert x^{k+1}-x^k\Vert _2 \le \varepsilon \max \left\{ 1,\Vert x^k\Vert _2\right\} \), where \(\varepsilon = 10^{-8}\) in Algorithm 2, \(\varepsilon = 10^{-6}\) in PG-BB, and \(\varepsilon = 10^{-4}\) in FW and FW-LS. We choose different accuracies for these methods due to the limitation of first-order methods for attaining high accuracy solutions in the last three algorithms.

We test these algorithms on two categories of dataset: synthetic and real stock data. For the synthetic data, we generate matrix \(W\) with given price ratios as described above in Matlab. More precisely, we generate \(W:= \mathrm {ones}(n, p) + \mathcal {N}(0, 0.1)\) which allows the closing prices to vary about \(10\%\) between two consecutive periods. We test with three instances, where \((n, p) = (1000,800)\), (1000, 1000), and (1000, 1200), respectively. We name these three datasets by PortfSyn1, PortfSyn2, and PortfSyn3, respectively. For the real data, we download a US stock dataset using an excel tool http://www.excelclout.com/historical-stock-prices-in-excel/. This tool gives us the closing prices of the US stock market in a given period of time. We generate three datasets with different sizes using different numbers of stocks from 2005 to 2016 as described in [9]. We pre-processed the data by removing stocks that are empty or lacking information in the time period we specified. We name these three datasets by Stock1, Stocks2, and Stocks3, respectively.

The results and the performance of the four algorithms are given in Table 7. Here, iter gives the number of iterations, time is the computational time in second, error measures the relative difference between the approximate solution \(x^k\) given by the algorithms and the interior-point solution provided by CVX [25] with the high precision configuration (up to \(1.8\times 10^{-12})\): \(\left\| x^k -x^{*}_{\mathrm {cvx}}\right\| /\max \left\{ 1,\left\| x^{*}_{\mathrm {cvx}}\right\| \right\} \).

Table 7 The performance and results of the four algorithms for solving the portfolio optimization problem (48)

From Table  7 we can see that Algorithm 2 has a comparable performance to the first-order methods: FW-LS and PG-BB. While our method has a rigorous convergence guarantee, these first-order methods remains lacking a theoretical guarantee. Note that Algorithm 2 and PG-BB are faster than the FW method and its linesearch variant although the optimal solution \(x^{\star }\) of this problem is very sparse. We also note that PG-BB gives a smaller error to the CVX solution. This CVX solution is not the ground-truth \(x^{\star }\) but gives a high approximation to \(x^{\star }\). In fact, the CVX solution is dense. Hence, it is not clear if PG-BB produces a better solution than other methods.

6.5 Proximal Quasi-Newton method for sparse multinomial logistic regression.

We apply our proximal Newton and proximal quasi-Newton methods to solve the following sparse multinomial logistic problem studied in various papers including [31]:

$$\begin{aligned}&F^{\star } := \min _{x} \Big \{ F(x) := \Big [\frac{1}{n}\sum _{j=1}^n\Big (\log \big ( \sum _{i=1}^{m}e^{\langle w^{(j)}, x^{(i)}\rangle } \big ) - \sum _{i=1}^{m}y_i^{(j)}\langle w^{(j)}, x^{(i)}\rangle \Big )\Big ]_{f(x)} \nonumber \\&\qquad + \Big [\gamma \Vert \mathrm {vec}(x)\Vert _1\Big ]_{g(x)} \Big \}, \end{aligned}$$
(49)

where \(x\) can be considered as a matrix variable of size \(m\times p\) formed from \(x^{(1)}, \cdots , x^{(m)}\), \(\mathrm {vec}(\cdot )\) is the vectorization operator, and \(\gamma > 0\) is a regularization parameter. Both \(y_i^{(j)} \in \left\{ 0, 1\right\} \) and \(w^{(j)}\) are given as input data for \(i=1,\ldots , m\) and \(j=1,\ldots , n\).

The function f defined in (49) has a closed form Hessian matrix. However, forming the full Hessian matrix \(\nabla ^2{f}(x)\) requires an intensive computation in large-scale problems when \(n \gg 1\). Hence, we apply our proximal-quasi-Newton methods in this case. As shown in [63, Lemma 4], the function f is \((M_f, \nu )\)-generalized self-concordant with \(\nu = 2\) and \(M_f := \frac{\sqrt{6}}{n}\max \left\{ \Vert w^{(j)}\Vert _2 \mid 1 \le j \le n\right\} \).

We implement our proximal quasi-Newton methods to solve (49) and compare them with the accelerated first-order methods implemented in a well-established software package called TFOCS [5]. We use three different variants of TFOCS: TFOCS with N07 (using Nesterov’s 2007 method with two proximal operations per iteration), TFOCS with N83 (using Nesterov’s 1983 method with one proximal operation per iteration), and TFOCS with AT (using Auslender and Teboulle’s accelerated method).

We test on a collection of 26 multi-class datasets downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm/. We set the parameter \(\gamma \) in (49) at \(\gamma := \frac{0.5}{\sqrt{N}}\) after performing a fine tuning. We terminate all the algorithms if \(\Vert x^{k+1} - x^k\Vert \le 10^{-6}\max \left\{ 1, \Vert x^k\Vert \right\} \).

We first plot the convergence behavior in terms of iterations of three proximal Newton-type algorithms we proposed in this paper in Figure 2 (left) for the dna problem with 3 classes, 2000 data points, and 180 features.

Fig. 2
figure 2

Left: convergence behavior of three methods,  right: performance profile in time [second] of 5 methods

As we can see from this figure, the proximal Newton method takes fewer iterations than the other two methods. However, each iteration of this method is more expensive than the proximal-quasi-Newton methods due to the evaluation of the Hessian matrix. In our experiment, the quasi-Newton method with L-BFGS outperforms the one with BFGS.

Next, we build a performance profile in time [second] to compare five different algorithms: two proximal quasi-Newton methods proposed in this paper (BFGS and L-BFGS), and three variants of the accelerated first-order methods implemented in TFOCS.

The performance profile was studied in [17] which can be considered as a standard way to compare different optimization algorithms. A performance profile is built based on a set \(\mathcal {S}\) of \(n_s\) algorithms (solvers) and a collection \(\mathcal {P}\) of \(n_p\) problems. We build a profile based on computational time. We denote by \(T_{ij} :=\)computational time required to solve problemiby solverj. We compare the performance of solver j on problem i with the best performance of any algorithm on this problem; that is we compute the performance ratio \(r_{ij} := \frac{T_{ij}}{\min \{T_{ik} \mid k \in \mathcal {S}\}}\). Now, let \(\tilde{\rho }_j(\tilde{\tau }) := \frac{1}{n_p}\mathrm {size}\left\{ i\in \mathcal {P} \mid r_{ij}\le \tilde{\tau }\right\} \) for \(\tilde{\tau } \in \mathbb {R}_{+}\). The function \(\tilde{\rho }_j:\mathbb {R}\rightarrow [0,1]\) is the probability for solver j that a performance ratio is within a factor \(\tilde{\tau }\) of the best possible ratio. We use the term “performance profile” for the distribution function \(\tilde{\rho }_j\) of a performance metric. In the following numerical examples, we plotted the performance profiles in \(\log _2\)-scale, i.e. \(\rho _j(\tau ) := \frac{1}{n_p}\mathrm {size}\left\{ i\in \mathcal {P} \mid \log _2(r_{i,j})\le \tau := \log _2\tilde{\tau }\right\} \).

Figure 2 (right) shows the performance profile of five algorithms on a collection of 26 problems indicated above. The proximal quasi-Newton method with L-BFGS achieves 13 / 26 (\(50\%\)) with the best performance, while the BFGS obtains 10 / 26 (\(38\%\)) with the best performance. In terms of computational time, both proximal quasi-Newton methods outperform the optimal proximal gradient methods in this experiment. It is also clear that our proximal quasi-Newton-type methods achieve a more accuracy solution in this experiment compared to the accelerated proximal gradient-type methods implemented in TFOCS.

7 Conclusion

We have generalized the self-concordance notion in [45] to a more general class of smooth and convex functions. Such a function class covers several well-known examples, including logistic, exponential, reciprocal, and standard self-concordant functions, just to name a few. We have developed a unified theory with several basic properties to reveal the smoothness structure of this functional class. We have provided several key bounds on local norms, Hessian mapping, gradient mapping, and function value of this functional class. Then, we have illustrated our theory by applying it to solve a class of smooth convex minimization problems and its composite setting. We believe that our theory provides an appropriate approach to exploit the curvature of these problems and allows us to compute an explicit step-size in Newton-type methods that have a global convergence guarantee even for non-Lipschitz gradient/Hessian functions. While our theory is still valid for the case \(\nu > 3\), we have not found yet a representative application in a high-dimensional space. Therefore, we limit our consideration to Newton and proximal Newton methods for \(\nu \in [2, 3]\), but our key bounds in Sect. 2.7 remain valid for different ranges of \(\nu \) with \(\nu > 0\).

Our future research is to focus on several aspects. Firstly, we can exploit this theory to develop more practical inexact and quasi-Newton-type methods that can easily capture practical applications in large-scale settings. Secondly, we will combine our approach and stochastic, randomized, and coordinate descent methods to develop new variants of algorithms that can scale better in high-dimensional space. Thirdly, by exploiting both generalized self-concordant, Lipschitz gradient, and strong convexity, one can also develop first-order methods to solve convex optimization problems. Finally, we plan to generalize our theory to primal-dual settings and monotone operators to apply to other classes of convex problems such as convex–concave saddle points, constrained convex optimization, and monotone equations and inclusions.