1 Introduction

In this paper, we consider the following constrained convex optimization problem:

$$\begin{aligned} f^{\star } := \min _{{\mathbf {x}}\in {\mathcal {X}}}f({\mathbf {x}}). \end{aligned}$$
(1)

Here we assume that \({\mathcal {X}}\) is a nonempty, closed, and convex subset in \({\mathbb {R}}^p\) and \(f : {\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a smoothFootnote 1 and convex function such that \(\mathrm {dom}(f)\cap {\mathcal {X}}\ne \emptyset \). We emphasize that, in our setting, \(\mathrm {dom}(f)\) does not necessarily contain \({\mathcal {X}}\). Among first-order methods, the Frank–Wolfe (FW) method [14] (or more generally, the conditional gradient method) has gained tremendous popularity lately due to its scalability and its theoretical guarantees when the objective is L-smooth (i.e., its gradient \(\nabla {f}\) is Lipschitz continuous with a Lipschitz constant L) on \({\mathcal {X}}\). The scalability of FW is mainly due to its computational primitive, called the linear minimization oracle (LMO):

$$\begin{aligned} {\mathcal {L}}_{{\mathcal {X}}}({\mathbf {s}}) := \mathrm {arg}\!\displaystyle \min _{{\mathbf {u}}\in {\mathcal {X}}} \langle {\mathbf {s}}, {\mathbf {u}}\rangle . \end{aligned}$$
(2)

There are many applications, such as latent group LASSO and simplex optimization problems where computing the LMO is significantly cheaper as compared to projecting onto the constraint set \({\mathcal {X}}\). If \({\mathcal {X}}\) is polyhedral, then evaluating \({\mathcal {L}}_{{\mathcal {X}}}({\mathbf {s}})\) requires to solve a linear program, which can be achieved in polynomial-time up to very high accuracy. In many cases, evaluating \({\mathcal {L}}_{{\mathcal {X}}}({\mathbf {s}})\) can be done in a closed form or with a low-order polynomial-time algorithms such as using quick-sort, see, e.g., [26] and its subsequent references.

While existing Frank–Wolfe methods can handle a sufficiently large class of convex problems, there are many machine learning problems where the objective function involves logarithmic, ridge regularized exponential, and log-determinant functions. These problems so far cannot exploit the rate as well as the scalability of the FW algorithm or its key variants. Our work precisely bridges this gap by focusing on objective functions where \(f : {\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is standard self-concordant (see Definition 1) and \({\mathcal {X}}\) is a nonempty, compact, and convex set in \({\mathbb {R}}^p\). We emphasize that the class of self-concordant functions intersects with the class of Lipschitz continuous gradient functions, but they are different. In particular, we assume:

Assumption 1

The solution set \({\mathcal {X}}^{\star }\) of (1) is nonempty. The function f in (1) is standard self-concordant (cf. Definition 1) and its Hessian \(\nabla ^2 f({\mathbf {x}})\) is nondegenerate (i.e., \(\nabla ^2 f({\mathbf {x}})\) is positive definite) for any \({\mathbf {x}}\in \mathrm {dom}(f)\). The constraint set \({\mathcal {X}}\) is closed and bounded, and its LMO defined by (2) can be computed efficiently and accurately.

Note that when \({\mathcal {X}}\not \subseteq \mathrm {dom}(f)\), we cannot guarantee that the Hessian \(\nabla ^2{f}\) is bounded on \({\mathcal {X}}\). For instance, a univariate function \(f(x) := -\log (x) - \log (1-x)\) is self-concordant with its domain \(\mathrm {dom}(f) = (0, 1)\). If we consider \({\mathcal {X}}:= [0,1]\), then f is not L-smooth on \({\mathcal {X}}\).

Under Assumption 1, problem (1) covers various applications in statistics and machine learning such as D-optimal experimental design [23, 31], minimum-volume enclosing ellipsoid [9], quantum tomography [21], logistic regression with elastic-net regularizer [42], portfolio optimization [37], and optimal transport [36].

Related work Motivated by the fact that, for many convex sets, including simplex, polytopes, and spectrahedron, computing a linear minimization oracle is much more efficient than evaluating their projection [25, 26], various linear minimization oracle-based algorithms have been proposed, see, e.g., [14, 25, 26, 29, 30]. Recently, such approaches are extended to the primal-dual setting in [44, 45].

The most classical one is the Frank–Wolfe algorithm proposed in [14] for minimizing a quadratic function over a polytope. It has been shown that the convergence rate of this method is \({\mathcal {O}}\left( 1/k\right) \) and is tight under the L-smoothness assumption, where k is the iteration counter. After that, many works have attempted to improve the convergence rate of the Frank–Wolfe algorithm and its variants by imposing further assumptions or exploiting the underlying problem structures. For instance, [3] showed a linear convergence of the Frank–Wolfe method under the assumption that f is a quadratic function and the optimal solution \({\mathbf {x}}^{\star }\) is in the interior of \({\mathcal {X}}\). [22] firstly proposed a variant of the Frank–Wolfe method with away-step and proved its linear rate to the optimal value if f is strongly convex, \({\mathcal {X}}\) is a polytope, and the optimal solution \({\mathbf {x}}^{\star }\) is in the interior of \({\mathcal {X}}\).

Recently, [15, 28] showed that the result of [22] still holds even when \({\mathbf {x}}^{\star }\) is on the boundary of \({\mathcal {X}}\). This can be viewed as the first general global linear convergence result of Frank–Wolfe algorithms. [16] showed that the convergence rate of the Frank–Wolfe algorithm can be accelerated up to \({\mathcal {O}}\left( 1/k^2\right) \) if f is strongly convex and \({\mathcal {X}}\) is a “strongly convex set” (see their definition).

All the results mentioned above rely on the L-smooth assumption of the objective function f. Moreover, the primal-dual methods in [44, 45] suffer in proving convergence rate since they can only handle the self-concordant function by splitting the objective and then relying on the proximal operator of the self-concordant function.

For the non-L-smooth case, the literature is minimal. Notably, [34] is the first work, to the best of our knowledge, that proved that the Frank–Wolfe method could converge with \({\mathcal {O}}\left( 1/k\right) \) rate for the Poisson phase retrieval problem where f is a logarithmic objective. This result relies on a specific simplex structure of the feasible set \({\mathcal {X}}\) and proved that the objective function f is eventually L-smooth on \({\mathcal {X}}\). However, the worst-case bound is rather loosely estimated. In addition, [9] showed a linear convergence of the Frank–Wolfe method with away-step for the minimum-volume enclosing ellipsoid problem with a log-determinant objective. The algorithms and analyses in the respective papers exploit the cost function and the structure, but it is not clear how they can handle more general self-concordant objectives. Note that since both objective functions in the aforementioned works are self-concordant, they are covered by our framework in this paper. Another related work is [13], which was available several months later after our paper was online. However, the algorithm and its analysis in [13] are different from our work, and it relies on additional assumptions.

In terms of algorithm, there are also several papers exploiting combination between the Frank–Wolfe method and other schemes to solve different problems. For instance, [17, 20] propose to combine the Frank–Wolfe method and a (quasi) Newton scheme to solve constrained nonlinear systems, where local and global convergence rates are established, respectively. [11, 19] further generalize the Frank–Wolfe method in [17, 20] to an inexact projection framework. Notice that these algorithms are fundamentally different from our method. In fact, they first solve the Newton system and then apply a Frank–Wolfe method to estimate the projection, while our method uses Frank–Wolfe scheme directly to solve the constrained quadratic subproblem (4). In a concurrent work [18], a Frank–Wolfe variant is proposed as a subsolver for the subproblem of the underlying quasi-Newton method, which is similar to ours. However, [18] does not establish an explicit convergence rate for the proposed method and uses a different set of assumptions.

Our goal and approach Our first goal is to tackle an important class of problems (1), where existing LMO-based methods do not have convergence guarantees. Our results have advantages when computing the LMO is cheaper than computing projections. Otherwise, the first-order methods, e.g., from [41] can also be applied. For this purpose, we apply a projected Newton method to solve (1) and use the Frank–Wolfe method in the subproblems to approximate the projected Newton direction. This approach leads to a double-loop algorithm, where the outer loop performs an inexact projected Newton scheme, and the inner loop carries out an adaptive Frank–Wolfe scheme by automatically adjusting the inner accuracy.

Notice that our algorithm enjoys several additional computational advantages. When the feasible set \({\mathcal {X}}\) is a polytope, our subproblem becomes minimizing a quadratic function over a polytope. By the result of [28], we can use the Frank–Wolfe algorithm with away-steps to attain linear convergence without sacrificing the overall complexity. Since our objective function in the subproblem is quadratic, the optimal step-size at each iteration has a closed form expression, leading to structure exploiting variants (see Algorithm 2). Finally, our algorithm can enhance Frank–Wolfe-type approaches by using the inexact projected Newton direction.

Our contribution To this end, our contribution can be summarized as follows:

  1. (a)

    We propose a double-loop algorithm to solve (1) when f is self-concordant (see Definition 1) and \({\mathcal {X}}\) is equipped with a tractable linear minimization oracle. The proposed algorithm is self-adaptive, i.e., it does not require tuning for the step-size and accuracy of the subproblem.

  2. (b)

    We prove that the gradient and Hessian complexity of our method is \({\mathcal {O}}\left( \log \left( \frac{1}{\varepsilon }\right) \right) \), while the LMO complexity is \({\mathcal {O}}\big (\varepsilon ^{-(1 + \nu )}\big )\), where \(\nu := \frac{\log (1-2\beta )}{\log (2\beta )}\) and \(\beta > 0\) can be sufficiently small. When \(\beta \) approaches zero, the complexity bound also approaches \({\mathcal {O}}\left( \frac{1}{\varepsilon }\right) \) as in the Frank–Wolfe methods for the L-smooth case.

To the best of our knowledge, this work is the first one studying LMO-based methods for solving (1) with non-Lipschitz continuous gradient functions on a general convex set \({\mathcal {X}}\). It also covers the models in [9, 34] as special cases, via a completely different approach.

Paper outline The rest of this paper is organized as follows. Section 2 recalls some basic notation and preliminaries of self-concordant functions. Section 3 presents the main algorithm. Section 4 proves the local linear convergence of the outer loop and gives a rigorous analysis of the total oracle complexity. Three numerical experiments are given in Sect. 5. Finally, we draw some conclusions in Sect. 6. For the sake of presentation, all the technical proofs are deferred to the “Appendix”.

2 Theoretical background

Basic notation We work with Euclidean spaces, \({\mathbb {R}}^p\) and \({\mathbb {R}}^n\), equipped with standard inner product \(\langle \cdot ,\cdot \rangle \) and Euclidean norm \(\left\| \cdot \right\| \). For a given proper, closed, and convex function \(f : {\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\), \(\mathrm {dom}(f) := \left\{ {\mathbf {x}}\in {\mathbb {R}}^p \mid f({\mathbf {x}}) < +\infty \right\} \) denotes the domain of f, \(\partial {f}\) denotes the subdifferential of f, and \(f^{*}\) is its Fenchel conjugate. For a symmetric matrix \({\mathbf {A}}\in {\mathbb {R}}^{n\times n}\), \(\lambda _{\max }({\mathbf {A}})\) denotes the largest eigenvalue of \({\mathbf {A}}\). We use [k] to denote the set \(\left\{ 1,\cdots ,k\right\} \), and \({\mathbf {e}}\) to denote the vector whose elements are 1s. For a vector \({\mathbf {u}}\in {\mathbb {R}}^p\), \(\mathrm {Diag}({\mathbf {u}})\) is a \(p\times p\) diagonal matrix formed by \({\mathbf {u}}\). We also define two nonnegative and monotonically increasing functions \(\omega (\tau ) := \tau - \log (1 + \tau )\) for \(\tau \in [0, \infty )\) and \(\omega _{*}(\tau ) := -\tau - \log (1 - \tau )\) for \(\tau \in [0,1)\). We use \({\mathbf {H}}\succeq 0\) (resp., \({\mathbf {H}}\succ 0\)) to denote a symmetric positive semidefinite (resp., definite) matrix \({\mathbf {H}}\).

2.1 Self-concordant functions

Our class of objective functions in (1) is self-concordant. Hence, we recall the definition of self-concordant functions introduced in [33] here.

Definition 1

A three times continuously differentiableFootnote 2 univariate function \(\varphi : {\mathbb {R}}\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is said to be self-concordant with a parameter \(M_{\varphi } \ge 0\) if \(\vert \varphi {'''}(\tau )\vert \le M_{\varphi }\varphi {''}(\tau )^{3/2}\) for all \(\tau \in \mathrm {dom}(\varphi )\). A three times continuously differentiable function \(f : {\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is said to be self-concordant with a parameter \(M_f \ge 0\) if \(\varphi (\tau ) := f({\mathbf {x}}+ \tau {\mathbf {v}})\) is self-concordant with the same parameter \(M_f\) for any \({\mathbf {x}}\in \mathrm {dom}(f)\) and \({\mathbf {v}}\in {\mathbb {R}}^p\). If \(M_f = 2\), then we say that f is standard self-concordant.

Note that any self-concordant function f can be rescaled to the standard form as \({\hat{f}}(\cdot ) := (M_f^2/4)f(\cdot )\). When \(\mathrm {dom}(f)\) does not contain straight line, \(\nabla ^2{f}({\mathbf {x}})\) is nondegenerate (i.e., positive definite) [32, Theorem 4.1.3], and therefore we can define a local norm associated with f together with its dual norm as follows:

$$\begin{aligned} \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}} := \big ({\mathbf {u}}^{\top }\nabla ^2f({\mathbf {x}}){\mathbf {u}}\big )^{1/2} \qquad \text {and}\qquad \Vert {\mathbf {u}}\Vert _{{\mathbf {x}}}^{*} := \big ({\mathbf {u}}^{\top }\nabla ^2f({\mathbf {x}})^{-1}{\mathbf {u}}\big )^{1/2}. \end{aligned}$$

These norms are weighted and satisfy the Cauchy-Schwarz inequality \(\langle {\mathbf {u}}, {\mathbf {v}}\rangle \le \Vert {\mathbf {u}}\Vert _{\mathbf {x}}\Vert {\mathbf {v}}\Vert _{\mathbf {x}}^{*}\) for \({\mathbf {u}},{\mathbf {v}}\in {\mathbb {R}}^p\).

The class of self-concordant functions is sufficiently broad to cover many important applications. It is closed under nonnegative combination and affine transformation. Any linear and convex quadratic functions are self-concordant. The function \(f_1({\mathbf {x}}) := -\log ({\mathbf {x}})\) and \(f_2({\mathbf {x}}) := {\mathbf {x}}\log ({\mathbf {x}}) - \log ({\mathbf {x}})\) are self-concordant. For symmetric positive semidefinite matrices, \(f_3({\mathbf {X}}) := -\log \det ({\mathbf {X}})\) is also self-concordant, which is widely used in covariance estimation-type and experimental design problems. In statistical learning, the regularized logistic regression model with \(f_4({\mathbf {x}}) := \frac{1}{n}\sum _{i=1}^n\log (1 + \exp (-y_i{\mathbf {a}}_i^{\top }{\mathbf {x}})) + \frac{\mu _f}{2}\Vert {\mathbf {x}}\Vert ^2\) and the regularized Poisson regression model with \(f_5({\mathbf {x}}) := \frac{1}{n}\sum _{i=1}^n\left( y_i\exp (\frac{-{\mathbf {a}}_i^{\top }{\mathbf {x}}}{2}) + \exp (\frac{{\mathbf {a}}_i^{\top }{\mathbf {x}}}{2})\right) + \frac{\mu _f}{2}\Vert {\mathbf {x}}\Vert ^2\) are both self-concordant. Note that all the functions introduced above are not globally L-smooth on their domain except for \(f_4\). In addition, any three times continuously differentiable and strongly convex function with Lipschitz Hessian continuity is also self-concordant. We refer the reader to [35, 40] for more examples and theoretical results.

2.2 Approximate solutions

Since \(\nabla ^2 f({\mathbf {x}})\) is nondegenerate, (1) has only one optimal solution \({\mathbf {x}}^{\star }\). Moreover, \(\nabla ^2{f}({\mathbf {x}}^{\star }) \succ 0\). Our goal is to design an algorithm to approximate \({\mathbf {x}}^{\star }\) as follows:

Definition 2

Given a tolerance \(\varepsilon > 0\), we say that \({\mathbf {x}}_{\varepsilon }^{\star }\) is an \(\varepsilon \)-solution of (1) if

$$\begin{aligned} \Vert {\mathbf {x}}_{\varepsilon }^{\star } - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \varepsilon . \end{aligned}$$
(3)

Different from existing Frank–Wolfe methods where an approximate solution \({\mathbf {x}}_{\varepsilon }^{\star }\) is defined by \(f({\mathbf {x}}_{\varepsilon }^{\star }) - f^{\star } \le \varepsilon \), we define it via a local norm. However, we show in Theorem 4 that these two concepts are related to each other.

3 The proposed Newton Frank–Wolfe algorithm

Since f in (1) is standard self-concordant, we first approximate it by a quadratic surrogate and apply a projected Newton method to solve (1). More precisely, given \({\mathbf {x}}\in \mathrm {dom}(f)\cap {\mathcal {X}}\), the projected Newton method computes a search direction at \({\mathbf {x}}\) by solving the following constrained convex quadratic program:

$$\begin{aligned} \begin{array}{ll} T({\mathbf {x}}) := \displaystyle \mathrm {arg}\!\displaystyle \min _{{\mathbf {u}}\in {\mathcal {X}}}\Big \{ Q_f({\mathbf {u}}; {\mathbf {x}}) := \left\langle \nabla f({\mathbf {x}}), {\mathbf {u}}- {\mathbf {x}}\right\rangle + \frac{1}{2}({\mathbf {u}}- {\mathbf {x}})^{\top }\nabla ^2f({\mathbf {x}})({\mathbf {u}}- {\mathbf {x}}) \Big \}. \end{array} \end{aligned}$$
(4)

Since \(\nabla ^2{f}({\mathbf {x}})\) is positive definite by Assumption 1, \(T({\mathbf {x}})\) is the unique solution of (4). However, this problem often does not have a closed-form solution, and we need to approximate it up to a given accuracy. Since we aim at exploiting LMO of \({\mathcal {X}}\), we apply a Frank–Wolfe scheme to solve (4). The optimality condition of (4) becomes

$$\begin{aligned} \langle \nabla Q_f(T({\mathbf {x}});{\mathbf {x}}), T({\mathbf {x}}) - {\mathbf {u}}\rangle \le 0,~~\forall {\mathbf {u}}\in {\mathcal {X}}, \end{aligned}$$
(5)

where \(\nabla Q_f(T({\mathbf {x}});{\mathbf {x}}) = \nabla {f}({\mathbf {x}}) + \nabla ^2{f}({\mathbf {x}})(T({\mathbf {x}}) - {\mathbf {x}})\). Using this optimality condition, we can define an inexact solution of (4) as follows:

Definition 3

Given a tolerance \(\eta > 0\), we say that \(T_{\eta }({\mathbf {x}})\) is an \(\eta \)-solution of (4) if

$$\begin{aligned} \max _{{\mathbf {u}}\in {\mathcal {X}}}\langle \nabla Q_f(T_{\eta }({\mathbf {x}});{\mathbf {x}}), T_{\eta }({\mathbf {x}}) - {\mathbf {u}}\rangle \le \eta ^2. \end{aligned}$$
(6)

The following lemma shows that the distance between \(T({\mathbf {x}})\) and \(T_{\eta }({\mathbf {x}})\) can be bounded by \(\eta \). Therefore, this justifies the well-definedness of Definition 3.

Lemma 1

Let \(T_{\eta }({\mathbf {x}})\) be an \(\eta \)-solution defined by Definition 3 and \(T({\mathbf {x}})\) be the exact solution of (4). Then, it holds that \(\Vert T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\Vert _{{\mathbf {x}}} \le \eta \).

Proof

From Definition 3, we have \(\langle \nabla Q_f(T_{\eta }({\mathbf {x}}); {\mathbf {x}}), T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\rangle \le \eta ^2\). Since \(Q_f(\cdot ;{\mathbf {x}})\) is a convex quadratic function, it is easy to show that

$$\begin{aligned} \begin{array}{l} \langle \nabla Q_f(T({\mathbf {x}}); {\mathbf {x}}) + \nabla ^2 f({\mathbf {x}})(T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})), T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\rangle \\ = \langle \nabla Q_f(T_{\eta }({\mathbf {x}});{\mathbf {x}}), T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\rangle \le \eta ^2. \end{array} \end{aligned}$$

Substituting \(T_{\eta }({\mathbf {x}})\) for \({\mathbf {u}}\) in the optimality condition (5), we obtain

$$\begin{aligned} \langle \nabla {Q}_f(T({\mathbf {x}}); {\mathbf {x}}), T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\rangle \ge 0. \end{aligned}$$

Combining the above two inequalities, we finally get

$$\begin{aligned} \langle \nabla ^2f({\mathbf {x}})(T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})), T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\rangle \le \eta ^2, \end{aligned}$$

which is equivalent to \(\Vert T_{\eta }({\mathbf {x}}) - T({\mathbf {x}})\Vert _{{\mathbf {x}}} \le \eta \). \(\square \)

Now, we combine our inexact projected Newton scheme and the well-known Frank–Wolfe algorithm to develop a new algorithm as presented in Algorithm 1.

figure a
figure b

Let us make a few remarks on Algorithm 1.

  1. (a)

    Discussion on structure Algorithm 1 integrates both damped-step and full-step inexact projected Newton schemes. First, it performs the damped-step scheme to generate \(\{{\mathbf {x}}^k\}\) starting from an initial point \({\mathbf {x}}^0\) that may be far from the optimal solution \({\mathbf {x}}^{\star }\). Then, once \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \) is satisfied, it switches to the full-step scheme. For the damped-step stage, we will show later that Algorithm 1 only performs a finite number of iterations.

  2. (b)

    Discussion on the Newton decrement \(\lambda _k\). Due to the update rule of \(\lambda _k\) in Algorithm 1 we have \(\lambda _k := \beta \sigma ^k\). As proved in (12) of Theorem 2 below, one has \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \lambda _k\) in the full-step stage.Footnote 3 Since \(\lambda _k\) is decreased geometrically by a factor \(\sigma \in (0,1)\) as \(\lambda _k := \sigma \lambda _{k-1}\), \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) converges linearly to zero (see Theorem 2). Notice that in the damped-step stage, we keep \(\lambda _k\) unchanged. Therefore, \(\lambda _k\) does not upper bound \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) in this case.

  3. (c)

    Discussion on the inner accuracy \(\eta _k\). The quantity \(\eta _k\) is used to measure \(\Vert T({\mathbf {x}}^k) - {\mathbf {z}}^k\Vert _{{\mathbf {x}}^k}\) (see (4) for the definition of \(T({\mathbf {x}}^k)\)). In Algorithm 1, \({\mathbf {z}}^k\) is calculated by Algorithm 2 as

    $$\begin{aligned} {\mathbf {z}}^k := \mathbf{Adaptive}\_Frank\_Wolfe\_Subroutine (\nabla f({\mathbf {x}}^k), \nabla ^2f({\mathbf {x}}^k)[\cdot ], {\mathbf {x}}^k,\eta _k^2). \end{aligned}$$
    (7)

    From the stop criterion of Algorithm 2, \({\mathbf {z}}^k\) is an \(\eta _k\)-approximate solution by Definition 3 at \({\mathbf {x}}= {\mathbf {x}}^k\) and may not be in \(\mathrm {dom}(f)\). According Lemma 1, we have \(\Vert {\mathbf {z}}^k - T({\mathbf {x}}^k)\Vert _{{\mathbf {x}}^k} \le \eta _k\). Therefore, \(\eta _k\) measures the accuracy for solving the subproblem. In the damped-step stage, we keep \(\eta _k\) as a constant. In the full-step one, \(\eta _k\) is decreased by a factor of \(\sigma \in (0,1)\) at each iteration to guarantee that we get a more accurate projected Newton direction when the algorithm approaches the optimal solution \({\mathbf {x}}^{\star }\). The following lemma shows that the choice of our step-size guarantees that \({\mathbf {x}}^k\in \mathrm {dom}(f)\cap {\mathcal {X}}\) regardless of the full-step or the damped-step.

Lemma 2

Let \(\{{\mathbf {x}}^{k}\}\) be generated by Algorithm 1. Then \(\{ {\mathbf {x}}^{k} \} \subset \mathrm {dom}(f)\cap {\mathcal {X}}\).

Proof

We prove this lemma by induction. Due to the initialization of Algorithm 1, we have \(x^0 \in \mathrm {dom}(f)\cap {\mathcal {X}}\). Assume that \(x^k\in \mathrm {dom}(f)\cap {\mathcal {X}}\) for \(k\ge 0\). We now show that \(x^{k+1} \in \mathrm {dom}(f)\cap {\mathcal {X}}\). Since \({\mathcal {X}}\) is convex, \({\mathbf {x}}^{k+1} = (1-\alpha _k){\mathbf {x}}^k + \alpha _k{\mathbf {z}}^k\), \({\mathbf {x}}^k \in {\mathcal {X}}\), \({\mathbf {z}}^k \in {\mathcal {X}}\), and \(\alpha _k \in (0, 1]\), it is obvious that \({\mathbf {x}}^{k+1} \in {\mathcal {X}}\). We only need to show that \({\mathbf {x}}^{k+1} \in \mathrm {dom}(f)\). If we update \({\mathbf {x}}^{k+1}\) by the damped-step, then by Algorithm 1, we have

$$\begin{aligned} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{k}\Vert _{{\mathbf {x}}^k} = \alpha _k\Vert {\mathbf {z}}^k - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} = \alpha _k \gamma _k = \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k^2 - \eta _k^2 + \gamma _k} < 1. \end{aligned}$$

Alternatively, if we update \({\mathbf {x}}^{k+1}\) by the full-step, then by (12) of Theorem 2 below, we have \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{k}\Vert _{{\mathbf {x}}^k} = 2\beta \sigma ^k < 1\). In both cases, we have \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{k}\Vert _{{\mathbf {x}}^k} < 1\). Hence, by using [32, Theorem 4.1.5], we conclude that \(x^{k+1} \in \mathrm {dom}(f)\). Consequently, \(x^{k+1} \in \mathrm {dom}(f)\cap {\mathcal {X}}\). By induction, we have \(\{x^k\} \subset \mathrm {dom}(f)\cap {\mathcal {X}}\). \(\square \)

  1. (d)

    Discussion on the switching condition \(\gamma _k + \eta _k \le h^{-1}(\beta )\). When \(\gamma _k + \eta _k > h^{-1}(\beta )\), we use a damped-step scheme with the step-size

    $$\begin{aligned} \alpha _k := \frac{\delta (\gamma _k^2 - \eta _k^2)}{\gamma _k^3 + \gamma _k^2 - \eta _k^2\gamma _k}. \end{aligned}$$

    This step-size is derived from Lemma 3 in the “Appendix”, and is in (0, 1). Once \(\gamma _k + \eta _k \le h^{-1}(\beta )\) is satisfied, we move to the full-step stage and no longer use the damped-step one. In addition, from Lemma 4 in the “Appendix”, we can see that if \(\gamma _k + \eta _k \le h^{-1}(\beta )\), then we have \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \), which means that we already find a good initial point for the full-step stage.

  2. (e)

    Discussion on the Frank–Wolfe subroutine The subroutine (7) is an adaptive Frank–Wolfe variant, which is customized to solve the following constrained convex quadratic program:

    $$\begin{aligned} \min _{{\mathbf {x}}\in {\mathcal {X}}} \big \{\psi ({\mathbf {x}}): = \langle {\mathbf {h}}, {\mathbf {x}}- {\mathbf {u}}^0\rangle + \tfrac{1}{2}\langle {\mathbf {H}}({\mathbf {x}}- {\mathbf {u}}^0), {\mathbf {x}}- {\mathbf {u}}^0\rangle \big \}. \end{aligned}$$

    The step size \(\tau _t\) in (7) is computed via the following exact linesearch condition (see [29] for further details):

    $$\begin{aligned} \tau _t := \mathrm {arg}\!\displaystyle \min _{\alpha \in [0,1]} \left\{ \psi ({\mathbf {u}}^t + \alpha ({\mathbf {v}}^t - {\mathbf {u}}^t))\right\} . \end{aligned}$$
  3. (f)

    Discussion on the Hessian evaluation \(\nabla ^2f(\cdot )\). In practice, we do not need to evaluate the full Hessian \(\nabla ^2{f}({\mathbf {x}}^k)\) at each iteration k. We only need to evaluate the matrix-vector operator \(\nabla ^2{f}({\mathbf {x}}^k){\mathbf {v}}\) for a given direction \({\mathbf {v}}\). Similarly, the computation of \(\gamma _k\) does not incur significant cost. Indeed, since we have already computed \(\nabla ^2{f}({\mathbf {x}}^k){\mathbf {d}}^k\) in (7), computing \(\gamma _k\) requires only one additional vector inner product \(\langle \nabla ^2{f}({\mathbf {x}}^k){\mathbf {d}}^k, {\mathbf {d}}^k\rangle \).

  4. (g)

    Faster Frank–Wolfe variants Since \({\mathcal {X}}\) in(1) is a general convex set, our analysis below is based on the standard Frank–Wolfe variant [26]. However, when it is possible (e.g., \({\mathcal {X}}\) is a polytope or a strongly convex set [16]), we can replace this standard Frank–Wolfe subroutine by a faster variant. For instance, if \({\mathcal {X}}\) is a polytope, then we can use an away-step variant, which often has a linear convergence rate [28]. If \({\mathcal {X}}\) is strongly convex [16], then we can apply an accelerated variant, which can achieve up to \({\mathcal {O}}\left( 1/T^2\right) \) convergence rate. In both cases, the LMO complexity stated in Theorems 3 and 4 still holds (up to a constant factor), or can even be improved.

4 Convergence and complexity analysis

Our analysis closely follows the outline below:

  • Given \(\beta \in (0,1)\), we show that we only need a finite number of damped-steps to reach \({\mathbf {x}}^k\) such that \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). We call it the damped-step stage.

  • Once \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \) is satisfied, we prove a linear convergence of the full-step projected Newton scheme. We call this the full-step stage.

  • We finally estimate the overall LMO complexity of Algorithm 1.

4.1 Finite complexity of damped-step stage

Before we present the main theorem of this section, let us first define a univariate function \(h : {\mathbb {R}}_{+} \rightarrow {\mathbb {R}}_{+}\) , whose shape is shown in Fig. 1, as

$$\begin{aligned} h(\tau ) := \frac{\tau (1 -2\tau + 2\tau ^2)}{(1 - 2\tau )(1 - \tau )^2 - \tau ^2}. \end{aligned}$$
(8)

From Fig. 1, h is nonnegative and monotonically increasing on \([0, C_2)\) for the constant \(C_2 \in (0.3, 0.4)\) such that \((1 - 2C_2)(1 - C_2)^2 - C_2^2 = 0\).

Fig. 1
figure 1

The shape of h (left) and the feasible region of \((\beta ,\sigma )\) for (11) when \(C = 10\) (right)

The following theorem states that Algorithm 1 only needs a finite number of LMO calls \({\mathcal {T}}_1\) to achieve \({\mathbf {x}}^k\) such that \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). Although \({\mathcal {T}}_1\) is independent of tolerance \(\varepsilon \), it depends on the pre-defined constants \(\beta \) and \(C_1\) in the algorithm and the structure of f and \({\mathcal {X}}\).

Theorem 1

Let \(\omega (\tau ) := \tau - \log (1 + \tau )\). If we choose the parameters as in Algorithm 1, then after at most

$$\begin{aligned} K : = \frac{f({\mathbf {x}}^0) - f({\mathbf {x}}^{\star })}{\delta \omega \left( \frac{1 - 2C_1}{1-C_1}h^{-1}(\beta )\right) } \end{aligned}$$
(9)

outer iterations of the damped-step scheme, we can guarantee that \(\gamma _k + \eta _k \le h^{-1}(\beta )\) for some \(k \in [K]\), which implies that \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). Moreover, the total number of LMO calls is at most

$$\begin{aligned} {\mathcal {T}}_1 := \frac{6D_{{\mathcal {X}}}^2\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))}{(C_1h^{-1}(\beta ))^2}\frac{1 - (1-\delta )^{2K+1}}{\delta (1-\delta )^{2K}}, \end{aligned}$$
(10)

where \(D_{{\mathcal {X}}} := \displaystyle \max _{{\mathbf {x}},{\mathbf {y}}\in {\mathcal {X}}}\Vert {\mathbf {x}}-{\mathbf {y}}\Vert \). The number of gradient \(\nabla {f}({\mathbf {x}}^k)\) and Hessian \(\nabla ^2{f}({\mathbf {x}}^k)\) evaluations is also at most K.

Proof

Notice that in Algorithm 1, we always have \(\eta _k \le C_1h^{-1}(\beta )\) in the damped-step stage, where \(C_1 \in (0,0.5)\). Clearly, if \(\gamma _k + \eta _k > h^{-1}(\beta )\), then \(\gamma _k > h^{-1}(\beta ) - \eta _k \ge (1 -C_1)h^{-1}(\beta )\). Therefore, we can show that

$$\begin{aligned} \frac{\gamma _k^2 - \eta _k^2}{\gamma _k} \ge \frac{((1 -C_1)h^{-1}(\beta ))^2 - (C_1h^{-1}(\beta ))^2}{(1-C_1)h^{-1}(\beta )} = \frac{1 - 2C_1}{1-C_1}h^{-1}(\beta ). \end{aligned}$$

Using Lemma 3 in the “Appendix” and the monotonicity of \(\omega \) we also have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) \overset{(26)}{\le } f({\mathbf {x}}^k) - \delta \omega (\frac{\gamma _k^2 - \eta _k^2}{\gamma _k}) \le f({\mathbf {x}}^k) - \delta \omega (\frac{1 - 2C_1}{1-C_1}h^{-1}(\beta )). \end{aligned}$$

Consequently, we need at most \(K := \frac{f({\mathbf {x}}^0) - f({\mathbf {x}}^{\star })}{\delta \omega \left( \frac{1 - 2C_1}{1-C_1}h^{-1}(\beta )\right) } \) outer iterations to get \(\gamma _k + \eta _k \le h^{-1}(\beta )\) as stated in (9).

From Lemma 6 in the “Appendix”, we can show that the number of LMO calls needed at the k-th outer iteration is \(T_k := \frac{6\lambda _{\max }(\nabla ^2f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{\eta _k^2}\). Since f is self-concordant, we have

$$\begin{aligned} \nabla ^2 f({\mathbf {x}}^{k+1}) \preceq \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k})^2} = \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - \alpha _k\gamma _k)^2} \preceq \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - \delta )^2}, \end{aligned}$$

which implies that \(\nabla ^2 f({\mathbf {x}}^k) \le \frac{\nabla ^2 f({\mathbf {x}}^0)}{(1 - \delta )^{2k}}\). Hence, the total number of LMO calls can be computed by

$$\begin{aligned} \begin{array}{lcl} {\mathcal {T}}_1 &{} := &{} \sum _{k=0}^KT_k = 6D_{{\mathcal {X}}}^2\sum _{k = 0}^{K} \frac{\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^k))}{\eta _k^2} \\ &{} \le &{} \frac{6D_{{\mathcal {X}}}^2}{(C_1h^{-1}(\beta ))^2}\sum _{k = 0}^{K} \frac{\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))}{(1-\delta )^{2k}} \\ &{} \le &{}\frac{6D_{{\mathcal {X}}}^2}{(C_1h^{-1}(\beta ))^2}\sum _{k = 0}^{2K} \frac{\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))}{(1-\delta )^{k}} \\ &{} = &{} \frac{6D_{{\mathcal {X}}}^2\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))}{(C_1h^{-1}(\beta ))^2}\frac{1 - (1-\delta )^{2K+1}}{\delta (1-\delta )^{2K}}. \end{array} \end{aligned}$$

Finally, if \(\gamma _k + \eta _k \le h^{-1}(\beta )\), then we have \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\overset{(30)}{\le } h(\gamma _k + \eta _k) \le h(h^{-1}(\beta )) = \beta \). \(\square \)

4.2 Linear convergence of full-step stage

Theorem 1 shows that we only need a finite number of damped-steps to obtain \({\mathbf {x}}^k\) such that \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). Therefore, without loss of generality, we always assume that \(\Vert {\mathbf {x}}^0 - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \) in the rest of this paper. Using this assumption, we analyze convergence rate of \(\{{\mathbf {x}}^k\}\) to the unique optimal solution \({\mathbf {x}}^{\star }\) of (1). In this case, Algorithm 1 always choose full-steps, i.e., \({\mathbf {x}}^{k+1} := {\mathbf {x}}^k + {\mathbf {d}}^k = {\mathbf {z}}^k\).

The following theorem states the linear convergence of \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }}\) and \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\). The convergence of \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k}\) will be used in Theorem 3 to bound \(\{\nabla ^2 f({\mathbf {x}}^k)\}\) which is key to our LMO complexity analysis.

Theorem 2

Suppose that \(\Vert {\mathbf {x}}^0 - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \) and the triple \((\sigma , \beta , C)\) satisfies:

$$\begin{aligned} \left\{ \begin{array}{l} \sigma \in (0,1), ~~\beta \in (0, 0.5), ~~C > 1, \\ \frac{1}{C(1-\beta )} + \frac{\beta }{(1-2\beta )(1-\beta )^2}\le \sigma , \\ \frac{1}{C} + \frac{1}{(1-2\beta )}\le 2. \end{array}\right. \end{aligned}$$
(11)

Let \(\eta _k := \frac{\beta \sigma ^k}{C}\) and \(\{{\mathbf {x}}^k\}\) be updated by the full-step stage in Algorithm 1. Then, for \(k\ge 0\), the following bounds hold:

$$\begin{aligned} \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \sigma ^k ~~\text {and}~~\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} \le 2\beta \sigma ^k. \end{aligned}$$
(12)

Proof

We prove this theorem by induction. Firstly, we have \(\Vert {\mathbf {x}}^0 - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \sigma ^0 = \beta < 1\) by assumption. Next, suppose that \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \sigma ^k\) for \(k \ge 0\). By induction, we can derive

$$\begin{aligned} \begin{array}{lcl} {\bar{\lambda }}_{k+1} &{}:= &{} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \\ &{} \overset{(35)}{\le } &{} \frac{\eta _k}{1-{\bar{\lambda }}_k} + \frac{{\bar{\lambda }}_k^2}{(1 - {\bar{\lambda }}_k)^2(1-2{\bar{\lambda }}_k)} \\ &{}= &{} \frac{\beta \sigma ^k}{C(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k^2}{(1 - {\bar{\lambda }}_k)^2(1-2{\bar{\lambda }}_k)} \\ &{}\le &{} \left( \frac{1}{C(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{(1 - {\bar{\lambda }}_k)^2(1-2{\bar{\lambda }}_k)}\right) \beta \sigma ^k \qquad (\text {by induction})\\ &{}\le &{} \left( \frac{1}{C(1-\beta )} + \frac{\beta }{(1 - \beta )^2(1-2\beta )}\right) \beta \sigma ^k \qquad (\text {by induction})\\ &{}\overset{(11)}{\le } &{} \beta \sigma ^{k+1}, \end{array} \end{aligned}$$

which proves the first estimate of (12).

Similarly, we also have

$$\begin{aligned} \begin{array}{lcl} \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k} &{}\overset{(36)}{\le } &{} \eta _k + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k} \\ &{}= &{} \frac{\beta \sigma ^k}{C} + \frac{{\bar{\lambda }}_k^2}{(1-2{\bar{\lambda }}_k)(1-{\bar{\lambda }}_k)} + \frac{{\bar{\lambda }}_k}{1 - {\bar{\lambda }}_k} \\ &{}\le &{} \left( \frac{1}{C} + \frac{{\bar{\lambda }}_k}{(1 - {\bar{\lambda }}_k)(1-2{\bar{\lambda }}_k)} + \frac{1}{1 - {\bar{\lambda }}_k} \right) \beta \sigma ^k \qquad (\text {by induction})\\ &{}\le &{} \left( \frac{1}{C} + \frac{\beta }{(1 - \beta )(1-2\beta )} + \frac{1}{1 - \beta } \right) \beta \sigma ^k \qquad (\text {by induction})\\ &{}= &{} \left( \frac{1}{C} + \frac{1}{(1 - 2\beta )} \right) \beta \sigma ^k \\ &{} \overset{(11)}{\le } &{} 2\beta \sigma ^{k}, \end{array} \end{aligned}$$

which proves the second estimate of (12). \(\square \)

Theorem 2 shows that \(\{{\mathbf {x}}^k\}\) linearly converges to \({\mathbf {x}}^{\star }\) with a contraction factor \(\sigma \in (0, 1)\) chosen from (11). Figure 1 shows the feasible region of \((\beta , \sigma )\) for (11) when \(C = 10\). From this figure, we can see that (11) will always hold once \(\beta \) is sufficiently small. Therefore, theoretically, we can let \(\beta \) arbitrarily close to 0.

4.3 Overall LMO complexity analysis

This subsection focuses on the analysis of LMO complexity of Algorithm 1. We first show that Algorithm 1 needs \({\mathcal {O}}\big (\varepsilon ^{-2(1+\nu )}\big )\) LMO calls to reach an \(\varepsilon \)-solution defined by (3) where \(\nu := \frac{\log (1-2\beta )}{\log (\sigma )}\). Consequently, we can show that it needs \({\mathcal {O}}\big (\varepsilon ^{-(1+\nu )}\big )\)-LMO calls to find an \(\varepsilon \)-solution \({\mathbf {x}}^{\star }_{\varepsilon }\) such that \(f({\mathbf {x}}^{\star }_{\varepsilon }) - f^{\star } \le \varepsilon \).

Theorem 3

Suppose that \(\Vert {\mathbf {x}}^0 - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). If we choose the parameters \(\beta \), \(\sigma \), C, and \(\{\eta _k\}\) as in Theorem 2 and update \(\{{\mathbf {x}}^k\}\) by the full-step stage, then to obtain an \(\varepsilon \)-solution \({\mathbf {x}}_{\varepsilon }^{\star } := {\mathbf {x}}^k\) defined by (3), it requires

$$\begin{aligned} \left\{ \begin{array}{ll} {\mathcal {O}}\left( \log (\varepsilon ^{-1})\right) ~~\hbox { gradient evaluations}\ \nabla {f}({\mathbf {x}}^k), \\ {\mathcal {O}}\left( \log (\varepsilon ^{-1})\right) ~~\text {Hessian evaluations} \nabla ^2{f}({\mathbf {x}}^k),\text { and} \\ {\mathcal {O}}\left( \varepsilon ^{-2(1+\nu )}\right) ~~\text {LMO calls, with }\nu := \frac{\log (1-2\beta )}{\log (\sigma )}. \end{array}\right. \end{aligned}$$

Proof

By self-concordance of f, using [32, Theorem 4.1.6], it holds that

$$\begin{aligned} \nabla ^2 f({\mathbf {x}}^{k+1}) \preceq \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - \Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _{{\mathbf {x}}^k})^2} \overset{(12)}{\preceq } \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - 2\beta \sigma ^k)^2} \preceq \frac{\nabla ^2 f({\mathbf {x}}^k)}{(1 - 2\beta )^2}. \end{aligned}$$

By induction, we have

$$\begin{aligned} \nabla ^2 f({\mathbf {x}}^k) \preceq \left( \frac{1}{1 - 2\beta }\right) ^{2k} \nabla ^2 f({\mathbf {x}}^0). \end{aligned}$$

Therefore, we can bound the maximum eigenvalue \(\lambda _{\max }(\nabla ^2{f}({\mathbf {x}}^k))\) of \(\nabla ^2{f}({\mathbf {x}}^k)\) as

$$\begin{aligned} \lambda _{\max }(\nabla ^2 f({\mathbf {x}}^k)) \le \left( \frac{1}{1 - 2\beta }\right) ^{2k} \lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0)). \end{aligned}$$
(13)

Let us denote by \({\hat{\lambda }}_0 := \lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))\). Then, from Lemma 6 in the “Appendix”, we can see that the number of LMO calls at the k-th outer iteration is at most

$$\begin{aligned} {\mathcal {O}}_k := \frac{6\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^k))D_{{\mathcal {X}}}^2}{\eta _k^2} \overset{(13)}{\le } \frac{6{\hat{\lambda }}_0D_{{\mathcal {X}}}^2}{(1 - 2\beta )^{2k}\eta _k^2} = \frac{6C^2{\hat{\lambda }}_0D_{{\mathcal {X}}}^2}{\beta ^2((1 - 2\beta )\sigma )^{2k}}, \end{aligned}$$
(14)

where the last equality holds because we set \(\eta _k := \beta \sigma ^k/C\) in Theorem 2.

To obtain an \(\varepsilon \)-solution \({\mathbf {x}}^k\) defined by (3), we need to impose \(\beta \sigma ^k \le \varepsilon \) (recall that \(\Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \sigma ^k\) by Theorem 2), which is equivalent to \(k \ge \frac{\log (\beta /\varepsilon )}{\log (1/\sigma )}\). Since \(\beta \in (0, 1)\), the outer iteration number is at most \(\frac{\log (1/\varepsilon )}{\log (1/\sigma )} = \frac{\log (\varepsilon )}{\log (\sigma )}\). This number is also the total number of gradient and Hessian evaluations.

Finally, by (14), the total number of LO calls of Algorithm 1 is estimated as

$$\begin{aligned} \begin{array}{lcl} \sum _{k = 0}^{\frac{\log (\varepsilon )}{\log (\sigma )}} \frac{6C^2{\hat{\lambda }}_0 D_{{\mathcal {X}}}^2}{\beta ^2((1 - 2\beta )\sigma )^{2k}} &{}= &{} \frac{6C^2{\hat{\lambda }}_0 D_{{\mathcal {X}}}^2}{\beta ^2} \sum _{k = 0}^{\frac{\log (\varepsilon )}{\log (\sigma )}} \left( \frac{1}{(1 - 2\beta )\sigma }\right) ^{2k} \\ &{} \le &{} \frac{3C^2{\hat{\lambda }}_0 D_{{\mathcal {X}}}^2}{\beta ^3}\left( \frac{1}{(1 - 2\beta )\sigma }\right) ^{\frac{2\log (\varepsilon )}{\log (\sigma )}} \\ &{}= &{} \frac{3C^2{\hat{\lambda }}_0 D_{{\mathcal {X}}}^2}{\beta ^3}\left( \frac{1}{\varepsilon }\right) ^{2\left( 1 + \frac{\log (1-2\beta )}{\log (\sigma )}\right) }, \end{array} \end{aligned}$$

where the last equality holds since \(\tau ^{\alpha \log (s)} = s^{\alpha \log (\tau )}\). \(\square \)

From Theorem 3, we can observe that a small value of \(\beta \) gives a better oracle complexity bound, but increases the number of oracle calls in the damped-step stage. Hence, we need to trade-off between the damped-step stage and the full-step stage. In practice, we do not recommend to choose an extremely small \(\beta \) but some value in the range of [0.01, 0.1].

Finally, the following theorem states the LMO complexity of Algorithm 1 on the objective residuals.

Theorem 4

Suppose that \(\Vert {\mathbf {x}}^0 - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \). If we choose \(\sigma \), \(\beta \), C, and \(\{\eta _k\}\) as in Theorem 2 and update \(\{{\mathbf {x}}^k\}\) by the full-step stage, then we have

$$\begin{aligned} f({\mathbf {x}}^{k+1}) - f({\mathbf {x}}^{\star }) \le \left( \frac{12\beta ^3}{1-2\beta } + \frac{\beta ^2}{C^2} + \beta ^2\right) \sigma ^{2k}. \end{aligned}$$

Consequently, the total LMO complexity of Algorithm 1 to achieve an \(\varepsilon \)-solution \({\mathbf {x}}_{\varepsilon }^{\star } := {\mathbf {x}}^k\) such that \(f({\mathbf {x}}_{\varepsilon }^{\star }) - f^{\star } \le \varepsilon \) is \({\mathcal {O}}\left( \varepsilon ^{-(1+\nu )}\right) \), where \(\nu := \frac{\log (1-2\beta )}{\log (\sigma )}\).

Proof

It is easy to check that \(\omega _{*}(\tau ) \le \tau ^2\) for \(0< \tau < 0.5\). Therefore, \(\omega _{*}(\beta \sigma ^k) \le (\beta \sigma ^k)^2\) for \(k \ge 0\). Since \(\eta _k := \frac{\beta \sigma ^k}{C}\), \(\gamma _k \le 2\beta \sigma ^k\), and \({\bar{\lambda }}_k := \Vert {\mathbf {x}}^k - {\mathbf {x}}^{\star }\Vert _{{\mathbf {x}}^{\star }} \le \beta \sigma ^k\) in Theorem 2, for \(k\ge k_0\), we have

$$\begin{aligned} \begin{array}{ll} f({\mathbf {x}}^{k+1}) - f({\mathbf {x}}^{\star }) &{}\overset{(44)}{\le } \frac{\gamma _k^2(\gamma _k + {\bar{\lambda }}_k)}{1 - \gamma _k} + \eta _k^2 + \omega _{*}({\bar{\lambda }}_{k+1}) \\ &{} \overset{(12)}{\le } \frac{12\beta ^3\sigma ^{3k}}{1-2\beta \sigma ^k} + \frac{\beta ^2\sigma ^{2k}}{C^2} + \omega _{*}(\beta \sigma ^{k+1}) \\ &{}\le \left( \frac{12\beta ^3\sigma ^{k}}{1-2\beta \sigma ^k} + \frac{\beta ^2}{C^2} +\beta ^2\sigma ^2\right) \sigma ^{2k} \\ &{}\le \left( \frac{12\beta ^3}{1-2\beta } + \frac{\beta ^2}{C^2} +\beta ^2\right) \sigma ^{2k}. \end{array} \end{aligned}$$
(15)

Let \(C_1 > \frac{12\beta ^2}{1-2\beta } + \frac{\beta ^2}{C^2} +\beta ^2\) be a constant. To guarantee \(f({\mathbf {x}}^{k+1}) - f({\mathbf {x}}^{\star }) \le \varepsilon \), we impose \(C_1\sigma ^{2k} \le \varepsilon \) i.e. \(k \ge \frac{\log (\varepsilon /C_1)}{2\log (\sigma )}\). Therefore, the outer iteration number is at most \(\frac{\log (\varepsilon /C_1)}{2\log (\sigma )} + 1\). Using (14), the total number of LMO calls will be

$$\begin{aligned} \begin{array}{lclcl} {\mathcal {T}}_2 &{} := &{} \sum _{k = 0}^{\frac{\log (\varepsilon /C_1)}{2\log (\sigma )} + 1} \frac{6C^2\lambda _{\max }(\nabla ^2 f({\mathbf {x}}^0))D_{{\mathcal {X}}}^2}{\beta ^2((1 - 2\beta )\sigma )^{2k}} &{}= &{} {\mathcal {O}}\left( \sum _{k = 0}^{\frac{\log (\varepsilon /C_1)}{2\log (\sigma )} + 1} \left( \frac{1}{(1 - 2\beta )\sigma }\right) ^{2k}\right) \\ &{} = &{} {\mathcal {O}}\left( \left( \frac{1}{(1 - 2\beta )\sigma }\right) ^{\frac{\log (\varepsilon /C_1)}{\log (\sigma )}}\right) &{}= &{} {\mathcal {O}}\left( \left( \frac{1}{\varepsilon }\right) ^{\frac{\log ((1-2\beta )\sigma )}{\log (\sigma )}}\right) , \end{array} \end{aligned}$$
(16)

where the last equality follows from the fact that \(\tau ^{\alpha \log (s)} = s^{\alpha \log (\tau )}\). \(\square \)

4.4 Trade-off between the damped-step and full-step stages

Fix \(\beta \in [0,0.1]\), let us choose

$$\begin{aligned} C = \frac{1}{\sigma }\frac{2(1-2\beta )(1-\beta )}{2(1-2\beta )(1-\beta )^2 - 1} > 0, \quad \text {and}\quad \sigma = 2\beta . \end{aligned}$$

It is easy to verify that (11) still holds. The overall complexity in Theorem 4 becomes \({\mathcal {O}}\left( \varepsilon ^{-(1+\nu )}\right) = {\mathcal {O}}\left( \varepsilon ^{-(1 + \frac{\log (1-2\beta )}{\log (2\beta )})}\right) \). Here, since \(\beta \in (0, 0.1]\), we have \(\nu := \frac{\log (1 - 2\beta )}{\log (2\beta )} \le 0.139\). As a concrete example, if we choose \(\beta := 0.05\), then the conditions (11) of Theorem 2 hold if we choose \((C, \sigma ) = (27.3814, 0.1)\). In this case, \(\nu := \frac{\log (1 - 2\beta )}{\log (\sigma )} = 0.0458\) which is very close to zero.

Now we show that the LMO complexity of the full-step stage: \({\mathcal {T}}_2\) in (16) dominates the LMO complexity of the damped-step stage: \({\mathcal {T}}_1\) in (10). Let us choose \(\delta := \varepsilon \). Then, the step-size of the damped-step stage becomes \(\alpha _k = \frac{\varepsilon (\gamma _k^2 - \eta _k^2)}{\gamma _k^3 + \gamma _k^2 - \eta _k^2\gamma _k}\), which is proportional to \(\varepsilon \). In this case, the number of iterations K of the damped-step stage in Theorem 1 is

$$\begin{aligned} K = \frac{R}{\varepsilon } = {\mathcal {O}}\left( \frac{1}{\varepsilon }\right) , \quad \text {where}~R := \frac{f({\mathbf {x}}^0) - f({\mathbf {x}}^{\star })}{\omega \big (\frac{1 - 2C_1}{1-C_1}h^{-1}(\beta )\big )}~\text {is a fixed constant}. \end{aligned}$$

Moreover, for a sufficiently small \(\varepsilon \), we have \((1-\delta )^{2K} = (1-\varepsilon )^{\frac{2R}{\varepsilon }} = \Theta \left( \frac{1}{e^{2R}}\right) \). Hence, by Theorem 1, the total LMO calls of the damped-step stage can be bounded by

$$\begin{aligned} {\mathcal {T}}_1 := {\mathcal {O}}\left( \frac{1}{\delta (1-\delta )^{2K}}\right) = {\mathcal {O}}\left( \frac{e^{2R}}{\varepsilon }\right) = {\mathcal {O}}\left( \frac{1}{\varepsilon }\right) . \end{aligned}$$

Therefore, the LMO complexity \({\mathcal {T}}_2 := {\mathcal {O}}\left( \varepsilon ^{-(1+\nu )}\right) \) in the full-step stage dominates the one \({\mathcal {T}}_1 = {\mathcal {O}}\left( \varepsilon ^{-1}\right) \) in the damped-step stage. Overall, the total complexity of Algorithm 1 is \({\mathcal {O}}\left( \varepsilon ^{-(1+\nu )}\right) \), as stated in Theorem 4.

5 Numerical experiments

We provide three numerical examples to illustrate the performance of Algorithm 1. We emphasize that the objective function f of the first two examples is not globally L-smooth. Hence, existing Frank–Wolfe and projected gradient-based methods may not have theoretical guarantees. In the following experiments, we implement Algorithms 1 and its competitors in Matlab running on a Linux desktop with 3.6GHz Intel Core i7-7700 and 16Gb memory. Our code is available at https://github.com/unc-optimization/FWPN.

5.1 Portfolio optimization

Consider the following portfolio optimization model studied in [40, Section 6.4]:

$$\begin{aligned} \left\{ \begin{array}{l} {\displaystyle \min _{{\mathbf {x}}\in {\mathbb {R}}^p}}\Big \{ f({\mathbf {x}}) := -\sum _{i=1}^n\log ({\mathbf {a}}_i^{\top }{\mathbf {x}}) \Big \} \\ \text {s.t.}~~ \sum _{i=1}^p{\mathbf {x}}_i = 1, ~{\mathbf {x}}\ge 0, \end{array}\right. \end{aligned}$$
(17)

where \({\mathbf {a}}_i \in {\mathbb {R}}^p\) for \(i=1,\cdots , n\). Let \({\mathbf {A}}:= [{\mathbf {a}}_1,\cdots ,{\mathbf {a}}_n]^{\top } \in {\mathbb {R}}^{n\times p}\). In the portfolio optimization model, \({\mathbf {A}}_{ij}\) represents the return of stock j in scenario i and \(\log (\cdot )\) is the utility function. Our goal is to allocate assets to different stock companies to maximize the expected return.

We implement Algorithm 1, abbreviated by FWPN, to solve (17). We also implement the standard projected Newton method which uses accelerated projected gradient method to compute the search direction, the Frank–Wolfe algorithm [14] and its linesearch variant [26], the projected gradient method using Barzilai-Borwein’s step-size [1, 38], the nonmonotone spectral projected gradient method [6], and the inexact variable metric method [18] to solve this problem. We name these algorithms by PN, FW, FW-LS, PG-BB, nSPG, and IVM, respectively. For PN and PG-BB, we use the algorithm in [7] to compute the projection onto the simplex set.

We test these algorithms both on synthetic and real data. For the real data, we download three US stock datasets from http://www.excelclout.com/historical-stock-prices-in-excel/. We name these datasets by real1, real2, and real3. We generate synthetic datasets as follows. We generate a matrix \({\mathbf {A}}\) as \({\mathbf {A}}:= \text {ones}(n,p) + {\mathcal {N}}(0,0.1)\) which allows each stock to vary about 10% among scenarios. We test with six examples, where\((n,p)=(7\times 10^2,10^3)\), \((10^3,10^3)\), \((5\times 10^3,10^4)\), \((10^4,10^3)\), \((10^4,10^4)\), and \((10^5,10^3)\), respectively. We call these six datasets syn1, syn2, syn3, syn4, syn5, and syn6, respectively. The results and the performance of these six algorithms are shown in Fig. 2.

Fig. 2
figure 2

A comparison between 7 methods for solving problem (17) on 9 datasets

From Fig. 2, one can observe that our algorithm, FWPN, clearly outperforms the other competitors on both real and synthetic datasets. In our algorithm, we use a Frank–Wolfe method with away-step to solve the simplex constrained quadratic subproblem which has a linear convergence rate as proved in [28]. As we can see from Fig. 2, PGBB, nSPG, and PN work relatively well compared to other candidates on the real datasets. nSPG works quite well if the data is well-conditioned (see the plots of the syn4 and syn6 datasets) but will perform poorly if the condition number is large (see the plots of syn3 and syn5 datasets). Also notice that the IVM method is slightly worse than our FWPN method in most cases. In fact, both methods have the same subproblem, and we also apply the same subsolver to both methods. However, due to different stepsize strategies, their performance is not identical. As expected, the standard FW and its linesearch variant cannot reach a highly accurate solution.

5.2 D-optimal experimental design

Our second example in this section is the following convex optimization model in D-optimal experimental design:

$$\begin{aligned} \left\{ \begin{array}{ll} {\displaystyle \min _{{\mathbf {x}}\in {\mathbb {R}}^p}} &{}\Big \{ f({\mathbf {x}}) := -\log \det ({\mathbf {A}}{\mathbf {X}}{\mathbf {A}}^{\top }) \Big \} \\ \text {s.t.} &{}\sum _{j=1}^p{\mathbf {x}}_j = 1, ~{\mathbf {x}}\ge 0, \end{array}\right. \end{aligned}$$
(18)

where \({\mathbf {A}}:= [{\mathbf {a}}_1,\cdots ,{\mathbf {a}}_p] \in {\mathbb {R}}^{n\times p}\), \({\mathbf {X}}:= \mathrm {Diag}({\mathbf {x}})\), and \({\mathbf {a}}_i \in {\mathbb {R}}^n\) for \(i = 1,\cdots ,p\). It is well-known that the dual problem of (18) is the minimum-volume enclosing ellipsoid (MVEE) problem:

$$\begin{aligned} \left\{ \begin{array}{ll} {\displaystyle \min _{{\mathbf {H}}\succeq 0}} &{}\Big \{ g({\mathbf {H}}) := -\log \det ({\mathbf {H}}) \Big \} \\ \text {s.t.} &{}{\mathbf {a}}_i^{\top }{\mathbf {H}}{\mathbf {a}}_i \le n, ~~i = 1,\cdots ,p. \end{array}\right. \end{aligned}$$
(19)

The objective of this problem is to find the minimum ellipsoid that covers the points \({\mathbf {a}}_1, \cdots , {\mathbf {a}}_p \in {\mathbb {R}}^n\). The datasets \(\left\{ {\mathbf {a}}_i\right\} _{i=1}^{p}\) are generated using independent multinomial Gaussian distribution \({\mathcal {N}}({\mathbf {0}},{\varvec{\Sigma }})\) as in [9]. To solve (18), one state-of-the-art solver is the Frank–Wolfe algorithm with away-step [28]. We observe that the linesearch problem for computing the optimal step-size \(\tau \):

$$\begin{aligned} \min _{\tau \in [0,1]} f((1 - \tau ){\mathbf {x}}+ \tau {\mathbf {e}}_j) \end{aligned}$$

has a closed-form solution (see [27] for more details). Therefore, we do not have to carry out a linesearch at each iteration of the Frank–Wolfe algorithm.

Recently, [9] showed that the Frank–Wolfe algorithm with away-step has a linear convergence rate for this specific problem. Figure 3 reveals the performance of our algorithm (FWPN), Frank–Wolfe algorithm with away-step, the nonmonotone spectral projected gradient method (nSPG) [6], and the inexact variable metric method (IVM) [18] on three datasets, where the dimension n varies from 100 to 5, 000. Note that existing literature only tested for problems with \(n \le 500\). As far as we are aware of, this is the first attempt to solve problem (18) with n up to 5, 000.

Fig. 3
figure 3

A comparison between 4 algorithms for solving (18) on three datasets

From Fig. 3, our method outperforms the other three competitors on both large and small datasets, including IVM. Figure 3 also shows that when the size of the problem is small, our algorithm is slightly better than the Frank–Wolfe method with away-step. However, when the size of the problem becomes large, our algorithm highly outperforms the Frank–Wolfe method in terms of computational time. This happens due to a small number of projected Newton steps while each inner iteration requires significantly small computational time.

5.3 Logistic Regression with Elastic-net Regularizer

Finally, let us consider the following logistic regression with elastic-net regularizer:

$$\begin{aligned} \min _{{\mathbf {x}}\in {\mathbb {R}}^p} \Big \{ \frac{1}{n}{\mathbf {e}}^{\top }\log ({\mathbf {e}}+ \exp ({\mathbf {A}}^{\top }{\mathbf {x}})) + \frac{\mu }{2}\Vert {\mathbf {x}}\Vert ^2 + \rho \Vert {\mathbf {x}}\Vert _1 \Big \}, \end{aligned}$$
(20)

where \({\mathbf {e}}:= (1,1,\cdots , 1)^{\top }\in {\mathbb {R}}^n\), \({\mathbf {A}}:= [-y_1{\mathbf {a}}_1,\cdots ,-y_n{\mathbf {a}}_n] \in {\mathbb {R}}^{p\times n}\), and \(({\mathbf {a}}_i, y_i) \in {\mathbb {R}}^p\times \left\{ -1,1\right\} \) for \(i = 1, \cdots ,n\).

It is well-known that (20) is the Lagrangian formulation of the following constrained problem with a suitable penalty parameter \(\rho > 0\) [24, Section 3.4.2]. Although [24] only consider the standard linear regression problem, it is trivial to extend it to logistic regression of the form:

$$\begin{aligned} \left\{ \begin{array}{ll} \displaystyle \min _{{\mathbf {x}}\in {\mathbb {R}}^p} &{}\Big \{ f({\mathbf {x}}) : = \frac{1}{n}{\mathbf {e}}^{\top }\log ({\mathbf {e}}+ \exp ({\mathbf {A}}^{\top }{\mathbf {x}})) + \frac{\mu }{2}\Vert {\mathbf {x}}\Vert ^2 \Big \} \\ \text {s.t.} &{}\Vert {\mathbf {x}}\Vert _1 \le \rho _1. \end{array}\right. \end{aligned}$$
(21)

It is has been shown in [40] that \(f({\mathbf {x}}) := \frac{1}{n}{\mathbf {e}}^{\top }\log ({\mathbf {e}}+ \exp ({\mathbf {A}}^{\top }{\mathbf {x}})) + \frac{\mu }{2}\Vert {\mathbf {x}}\Vert ^2\) is self-concordant. Therefore, (21) fits into our template (1) with \({\mathcal {X}}:= \left\{ {\mathbf {x}}\in {\mathbb {R}}^p \mid \Vert {\mathbf {x}}\Vert _1 \le \rho _1\right\} \).

For this example, the objective function f is also L-smooth and strongly convex. Hence, we can compare Algorithm 1 (FWPN) with the standard proximal-gradient method [4], the accelerated proximal-gradient method with linesearch and restart [5, 39], the nonmonotone spectral projected gradient method [6], and the inexact variable metric method [18]. These methods are abbreviated by PG, APG-LSRS, nSPG, and IVM, respectively. We use binary classification datasets: a1a, a9a, w1a, w8a, covtype, news20, real-sim from [8] and generate the datasets mnist17 and mnist38 from the mnist dataset where digits are chosen from \(\left\{ 1,7\right\} \) and \(\left\{ 3,8\right\} \), respectively. We set \(\mu := \frac{1}{n}\) as in [10], and \(\rho _1\) is set to be 10, which guarantees that the sparsity of the solution is maintained between \(1\%\) and \(10\%\).

Since we need to evaluate the projection on an \(\ell _1\)-norm ball at each iteration of PG and APG-LSRS, we use the algorithm provided by [12] which only need \({\mathcal {O}}(p)\) time. For our algorithm, since the \(\ell _1\)-norm ball is still a polytope, we can linearly solve the subproblem by using the Frank–Wolfe algorithm with away-step from [28]. The performance and results of three algorithms on the above datasets are presented in Fig. 4 in terms of objective residuals against CPU time.

Fig. 4
figure 4

A comparison between 5 methods for solving (21) on 9 different datasets

From Fig. 4, one can observe that our algorithm outperforms PG, APG-LSRS, and nSPG on all datasets. This happens thanks to the low computational cost of the linear minimization oracle and the linear convergence of the FW method with away-step. We also notice that our algorithm is still better than IVM on most datasets except for mnist38. It is interesting that although our algorithm is a hybrid method between second-order and first-order methods, we can still solve high-dimensional problems (e.g., when \(p = 1,355,191\) in news20 dataset) as often seen in first-order methods. We gain this efficiency due to the use of Hessian-vector products instead of full Hessian evaluations.

6 Conclusions

In this paper, we have combined the well-known Frank–Wolfe scheme (known as a variant of the conditional gradient method) and an inexact projected Newton (second-order) method to develop a novel hybrid algorithm for solving a class of constrained convex optimization problems with self-concordant objective function. Our approach is different from existing methods that heavily rely on the L-smooth assumption. Under this new setting, we have derived the first rigorous convergence and complexity analysis for the proposed method. Surprisingly, the LMO complexity of our algorithm is still comparable with the Frank–Wolfe algorithms for a different class of problems. In addition, our algorithm enjoys several computation advantages on some specific problems, which are also supported by the three numerical examples in Sect. 5. Moreover, the last example has shown that our algorithm still outperforms first-order methods on large-scale instances. Our finding suggests that sometimes it is worth carefully combining first-order and second-order methods for solving large-scale problems in non-standard settings.