1 Introduction

In minimizing a univariate function f with an iteration \(x_{k+1} = x_k - f'(x_k)/g(x_k)\), possibilities for g include

$$\begin{aligned}&\text {gradient:}&g(x_k)&= 1, \\&\text {secant:}&g(x_k)&= \frac{f'(x_k) - f'(x_{k-1})}{x_k - x_{k-1}}, \\&\text {Newton:}&g(x_k)&= f''(x_k), \\&\text {Steffensen:} \qquad&g(x_k)&= \frac{f'(x_k + f'(x_k)) - f'(x_k)}{f'(x_k)} , \end{aligned}$$

with different orders of convergence q, i.e., \(|x_{k+1} - x^* |\le c|x_k - x^* |^q\). Gradient descent has \(q =1\), secant method \(q = (1+\sqrt{5})/2\), Newton and Steffensen methods both have \(q = 2\).

Steffensen method [1, 2] is a surprise. Not only does it not require second derivatives (like Newton) to achieve quadratic convergence, it also does not achieve its superior convergence through the use of multisteps (like secant). In other words, the kth Steffensen iterate only depends on \(x_k\) but not \(x_{k-1}, x_{k-2}\), etc.

Nevertheless, while the other three methods have widely used multivariate generalizations (secant method has several, as quasi-Newton methods, as Barzilai–Borwein step size, etc), all existing multivariate generalizations of Steffensen method [3,4,5,6,7,8,9,10,11,12,13,14,15] involve multivariate divided differences that require \(O(n^2)\) function evaluations and are no less expensive than using the full Hessian. Furthermore these multivariate generalizations are no longer one-step methods. As a result they have not found widespread use.

Our contributions are as follows:

  1. (i)

    We show that by incorporating a carefully chosen step size parameter the convergence of Steffensen method may be further improved beyond quadratic to \(q = 1 + \sqrt{2}\).

  2. (ii)

    We extend Steffensen method to a multivariate setting as an adaptive learning rate, avoiding divided differences, requiring just two gradient evaluations, and remaining a one-step method.

  3. (iii)

    We show that when used in a randomized setting, our methods outperform SGD, SVRG, and SLBFGS on a variety of standard machine learning tasks on real data sets.

The performance in iii is measured in actual running time. But aside from speed, our methods have two advantages over SLBFGS, which has become a gold standard in machine learning:

  1. (a)

    Quasi-Newton methods may involve matrix–vector product, a two-loop recursion with \(O(d^2)\) computation. Although deterministic LBFGS does not form matrix–vector product explicitly, stochastic LBFGS does. Our multivariate Steffensen methods, whether deterministic or stochastic, are free of such products.

  2. (b)

    Quasi-Newton methods come in two flavors: Hessian or inverse Hessian updates. The latter seems a nobrainer as it avoids matrix inversion but this is a fallacy. It is common knowledge among practitioners [16, Section 4.5.2.2] that the inverse Hessian version often conceals an ill-conditioned approximate Hessian; one should instead update the Cholesky factors of the approximate Hessian in order to detect ill-conditioning. By its design, LBFGS inevitably uses the inverse Hessian version. Our multivariate Steffensen methods are not quasi-Newton methods and do not involve approximate Hessians, avoiding this issue entirely.

A theoretical counterpart to iii is that when measured by the number of stochastic gradient evaluations, we have the following complexity estimates:

$$\begin{aligned}&\text{ SSM/SSBB: } \qquad {}&{} O\bigl ((n + \kappa ^2)\log (1/\varepsilon )\bigr ),\\ {}&\text{ SVRG--BB: }{}&{} O\bigl ((n + \kappa ^3) \log (1/\varepsilon )\bigr ),\\ {}&\text{ SLBFGS: }{}&{} O\bigl ((n + \kappa ^{2 + 2(d+h )}) \log (1/\varepsilon ) \bigr ), \end{aligned}$$

to minimize a d-variate convex function of the form \(f = f_1 + \dots + f_n\) to \(\varepsilon \)-accuracy. So our proposed methods SSM and SSBB are at least an order of magnitude fasterFootnote 1 than SVRG–BB and SLBFGS in terms of the condition number \(\kappa \). Here h refers to the ‘history size’ of SLBFGS. The algorithms and quantities involved will all be explained in due course.

Johan Steffensen first proposed his eponymous method [1] in 1933. See [19] for an informative history of the method and a biography of its inventor. The method was described in the classic books of Henrici [5, pp. 91–95] and Householder [20, p. 164] but has remained more of a textbook curiosity. One reason, as we mentioned above and will elaborate in Sect. 2.2, is that there has been no viable multivariate version.

Another reason, as we will speculate, is that much like the Kaczmarz method [21, 22] for iterative solution of linear systems had lingered in relative obscurity until it was randomized [23], Steffensen method is also most effective in a randomized setting. This is in fact more than an analogy; we will show in Sect. 2.4 that the stochastic Steffensen method we propose reduces to randomized Kaczmarz method when applied to a quadratic objective—not true for SGD, SVRG, or SLBFGS. So one may also view our stochastic Steffensen method as a generalization of randomized Kaczmarz method to arbitrary differentiable objective functions.

In Sect. 3 we will supply proofs of linear convergence of our methods and a theoretical comparison with other existing methods. In Sect. 4 we show how to adapt our methods for nonsmooth functions. In the numerical experiments in Sect. 5, we will see that our stochastic Steffensen method compares favorably with SGD, SVRG (with or without Barzilai–Borwein step size), and SLBFGS across different tasks in the LIBSVM datasets: ridge regression, logistic regression, and support vector machines with squared hinge loss.

1.1 Background

As in the usual setting for stochastic gradient descent and its variants, our goal is to minimize an objective function of the form

$$\begin{aligned} f(x)= \frac{1}{n} \sum _{i=1}^n f_i(x), \end{aligned}$$
(1)

where \(x\in \mathbb {R}^d\) is the model parameter. Such functions are ubiquitous in machine learning, arising from the empirical risk minimization (ERM) problem where \(f_i\) takes the form

$$\begin{aligned} f_i(x) = \ell (w_i^{\scriptscriptstyle \textsf{T}}x; y_i) + \lambda R(x), \end{aligned}$$

with \(\ell :\mathbb {R}\times \mathbb {R}\rightarrow \mathbb {R}_+\) the loss function, \(R: \mathbb {R}^d\rightarrow \mathbb {R}_+\) the regularizer, \(\lambda \ge 0\) the regularization parameter, and \(\{(w_i,y_i)\in \mathbb {R}^d \times \mathbb {R}:i=1,\dots , n\}\) the training set with labels. Different choices of \(\ell \) and R give \(l^2\)-regularized logistic regression, lasso regression, soft-margin support vector machine, etc.

The challenge here is that the dimension d and sample size n are extremely large in modern situations, mandating the use of first-order methods that rely only on first derivatives. But when n is large, even computing the full gradient of all \(f_1,\dots , f_n\) is intractable, and we need stochastic optimization methods that update x only after processing a small subset of data, permitting progress in the time deterministic methods make only a single step. Consequently, stochastic first-order methods have become the method of choice, with stochastic gradient descent (SGD) [24] and its many variants [25,26,27] and various stochastic quasi-Newton methods [28,29,30] ruling the day.

Stochastic optimization has grown into a vast subject. We have limited our comparison in this article to stochastic variants of classical methods that rely primarily on gradients. We did not include more sophisticated stochastic optimization algorithms that bring in additional features like moments [31, 32] or momentum [33,34,35,36] for two reasons. Firstly these more sophisticated algorithms invariably require heavy tuning compared to purely gradient-based methods. Secondly we view them as enhancements to gradients-based methods and our proposed stochastic Steffensen methods likewise lend themselves to such enhancements. As such, the most appropriate and equitable comparisons for us would be the aforementioned gradient-based methods.

1.2 Convention

In this article, we use the terms learning rate and step size slightly differently. Take for example our Steffensen–Barzilai–Borwein iteration in (10):

$$\begin{aligned} x_{k+1} = x_k - \frac{\beta _k \Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)} \nabla f(x_k), \end{aligned}$$

the coefficient

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SBB}}}_k {:}{=}\frac{\beta _k \Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)} \end{aligned}$$

will be called a learning rate whereas the coefficient

$$\begin{aligned} \beta _k {:}{=}\frac{\Vert x_k - x_{k-1}\Vert ^2}{[\nabla f(x_k) - \nabla f(x_{k-1})]^{\scriptscriptstyle \textsf{T}}(x_k - x_{k-1})} \end{aligned}$$

will be called a step size. In general, the term ‘learning rate’ will be used exclusively to refer to the coefficient of a search direction, which may be a gradient, a stochastic gradient, a variance-reduced stochastic gradient, etc. The term ‘step size’ will be used for coefficients in other contexts like \(\beta _k\) in the definition of the learning rate \(\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k\).

We will use \(\eta _k\) to denote a general learning rate. For the learning rate of a particular algorithm, we will indicate the algorithm in superscript. For example, \(\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k\) above is the learning rate of Steffensen–Barzilai–Borwein method (SBB). The Barzilai–Borwein step size above will always be denoted \(\beta _k\) throughout.

2 Stochastic multivariate Steffensen methods

Our three-step strategy is to (a) push the convergence order of the univariate Steffensen method to its limit, (b) extend the resulting method to a multivariate setting, and then (c) randomize the multivariate algorithm. For (a), we are led naturally to the Barzilai–Borwein step size; for (b), we emulate the multivariate extension of secant method into quasi-Newton method; and for (c), we draw inspiration from stochastic gradient descent and its various derivatives.

2.1 Deterministic univariate setting

As we saw in Sect. 1, univariate Steffensen method:

$$\begin{aligned} x_{k+1} = x_k - \frac{f'(x_k)^2}{f'(x_k + f'(x_k)) - f'(x_k)} \end{aligned}$$
(2)

avoids second-order derivatives and yet preserves quadratic convergence with the use of two first-order derivatives \(f'(x_k + f'(x_k))\) and \(f'(x_k)\). With modern hindsight, it is clear that we may obtain an immediate improvement in (2), one that is essentially free, by incorporating a coefficient \(\beta _k\) that only depends on quantities already computed. The analysis in the next two results will lead us to an appropriate choice of \(\beta _k\). Note that although the algorithms require only first derivatives of f, the convergence results assume that f has a higher degree of smoothness.

Proposition 1

(Convergence order of Steffensen method) Let f be a function that is \(C^3\) in a neighborhood of a stationary point \(x^*\) with \(f'(x^*) = 0\) and \(f''(x^*) \ne 0\). Let \(\alpha \in \mathbb {R}\) be a nonzero constant parameter and

$$\begin{aligned} x_{k+1} {:}{=}x_k - \frac{\alpha f'(x_k)^2}{f'\bigl (x_k + \alpha f'(x_k)\bigr ) - f'(x_k)} \end{aligned}$$

for \(k = 0,1,2,\dots .\) If \(\lim _{k \rightarrow \infty } x_k = x^*\), then

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{|\varepsilon _{k+1}|}{|\varepsilon _k^2|}=\frac{1}{2}\left| \frac{f'''(x^*)}{f''(x^*)}\right| \left| 1 + \alpha f''(x^*)\right| , \end{aligned}$$

where \(\varepsilon _k {:}{=}x_k - x^*\) denotes the error in iteration k.

Proof

Let \(\varepsilon _k = x_k - x^*\). Subtracting \(x^*\) from both sides, we get

$$\begin{aligned} \varepsilon _{k+1} = \varepsilon _{k} - \frac{\alpha f'(x_k)^2}{f'(x_k + \alpha f'(x_k)) - f'(x_k)}. \end{aligned}$$

Taylor expanding \(f'(x_k + \alpha f'(x_k))\) about \(x_k\), we get

$$\begin{aligned} f'(x_k + \alpha f'(x_k)) = f'(x_k) + f''(x_k)\alpha f'(x_k) + \frac{f'''(\xi _k)}{2}\alpha ^2f'(x_k)^2 \end{aligned}$$

for some \(\xi _k\) between \(x_k\) and \(x_k + \eta f'(x_k)\). Combining the previous two equations, we have

$$\begin{aligned} \begin{aligned} \varepsilon _{k+1}&= \varepsilon _{k} - \frac{f'(x_k)}{f''(x_k) + \frac{f'''(\xi _k)}{2}\alpha f'(x_k)} \\ {}&= \frac{-f'(x_k) + f''(x_k)\varepsilon _k + \frac{1}{2} f'''(\xi _k)\alpha f'(x_k)\varepsilon _k}{f''(x_k) + \frac{1}{2} f'''(\xi _k)\alpha f'(x_k)}. \end{aligned} \end{aligned}$$
(3)

Taylor expanding \(f'\) about \(x_k\), we get

$$\begin{aligned} 0 = f'(x^*) = f'(x_k) - f''(x_k)\varepsilon _k + \frac{f'''(\xi _k^*)}{2}\varepsilon _k^2 \end{aligned}$$

for some \(\xi _k^*\) between \(x_k\) and \(x^*\). Plugging \(f'(x_k)\) into (3) gives us

$$\begin{aligned} \varepsilon _{k+1} = \frac{f'''(\xi _k^*)\varepsilon _k^2 + \alpha f'''(\xi _k)f''(x_k)\varepsilon _k^2 - \frac{\alpha }{2} f'''(\xi _k)f'''(\xi _k^*)\varepsilon _k^3}{2f''(x_k) + f'''(\xi _k)\alpha f'(x_k)}. \end{aligned}$$

Taking limit \(k \rightarrow \infty \) and using continuity of \(f'\), \(f''\), and \(f'''\) at \(x^*\), we have

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{|\varepsilon _{k+1}|}{|\varepsilon _k^2|}&= \lim _{k\rightarrow \infty }\biggl |\frac{f'''(\xi _k^*) + \alpha f'''(\xi _k)f''(x_k) - \frac{\alpha }{2} f'''(\xi _k)f'''(\xi _k^*)\varepsilon _k}{2f''(x_k) + f'''(\xi _k)\alpha f'(x_k)}\biggr |\\ {}&=\frac{1}{2}\biggl |\frac{f'''(x^*)}{f''(x^*)}\biggr | |1 + \alpha f''(x^*)| \end{aligned}$$

as required. \(\square \)

We next show that with an appropriate choice of \(\alpha \), we can push Steffensen method into the superquadratically convergent regime. The quadratic convergence in Proposition 1 is independent of the value \(\alpha \) and we may thus choose a different \(\alpha \) at every step. Of course if we simply set \(\alpha _k = -1/f''(x_k)\) in Proposition 1, we will obtain a cubically convergent algorithm. However since we want a first-order method whose learning rate depends only on previously computed quantities, we set \(\alpha _k = -(x_k - x_{k-1})/[f'(x_k) - f'(x_{k-1})]\) to be the finite difference to avoid second derivatives — as it turns out, this improves convergence order to \(1 + \sqrt{2}\).

Theorem 2

(Convergence order of Steffensen method with Barzilai–Borwein step size) Let f be a function that is \(C^4\) in a neighborhood of a stationary point \(x^*\) with \(f'(x^*) = 0\) and \(f''(x^*) \ne 0\). Let

$$\begin{aligned} \beta _k =-\frac{x_k - x_{k-1}}{f'(x_k) - f'(x_{k-1})} \end{aligned}$$

and

$$\begin{aligned} x_{k+1} = x_k - \frac{\beta _k f'(x_k)^2}{f'\bigl (x_k + \beta _k f'(x_k)\bigr ) - f'(x_k)} \end{aligned}$$
(4)

for \(k =0,1,2,\dots .\) If \(\lim _{k \rightarrow \infty } x_k \rightarrow x^*\), then

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{|\varepsilon _{k+1}|}{|\varepsilon _k^2\varepsilon _{k-1}|} = \Bigl (\frac{f'''(x^*)}{2f''(x^*)}\Bigr )^2. \end{aligned}$$

In particular, the order of convergence of (4) is superquadratic with \(1+\sqrt{2} \approx 2.414\).

Proof

Taylor expanding \(f'(x_k+\beta _kf'(x_k))\) at \(x_k\), we get

$$\begin{aligned} f'(x_k + \beta _k f'(x_k))= & {} f'(x_k) + f''(x_k)\beta _k f'(x_k) +\frac{f^{(3)}(x_k)}{2}\beta _k^2f'(x_k)^2 \\ {}{} & {} + \frac{f^{(4)}(\xi _k)}{6}\beta _k^3f'(x_k)^3 \end{aligned}$$

for some \(\xi _k\) between \(x_k\) and \(x_k + \eta _kf'(x_k)\). Let \(\varepsilon _k = x_k - x^*\), we have

$$\begin{aligned} \begin{aligned} \varepsilon _{k+1}&= \varepsilon _{k} - \frac{f'(x_k)}{f''(x_k) + \frac{1}{2} f^{(3)}(x_k) \beta _k f'(x_k) + \frac{1}{6} f^{(4)}(\xi _k)\beta _k^2f'(x_k)^2}\\&= \frac{-f'(x_k)+f''(x_k)\varepsilon _k + \frac{1}{2} f^{(3)}(x_k)\beta _kf'(x_k)\varepsilon _k+\frac{1}{6} f^{(4)}(\xi _k) \beta _k^2f'(x_k)^2\varepsilon _k}{f''(x_k) + \frac{1}{2} f^{(3)}(x_k) \beta _k f'(x_k) + \frac{1}{6} f^{(4)}(\xi _k)\beta _k^2f'(x_k)^2}. \end{aligned} \end{aligned}$$
(5)

Taylor expanding \(f'(x^*)\) at \(x_k\) to 4th, 3th, and 2nd order, we get

$$\begin{aligned} 0 = f'(x^*)&= f'(x_k) - f''(x_k)\varepsilon _k + \frac{f^{(3)}(x_k)}{2}\varepsilon _k^2 -\frac{f^{(4)}(\xi _k^*)}{6}\varepsilon _k^3,\\ 0 = f'(x^*)&= f'(x_k) - f''(x_k)\varepsilon _k + \frac{f^{(3)}(\xi _k')}{2}\varepsilon _k^2,\\ 0 = f'(x^*)&= f'(x_k) - f''(\xi _k^\dagger )\varepsilon _k. \end{aligned}$$

Plugging these into (5) and defining

$$\begin{aligned} A_k&{:}{=}f''(x_k) + \frac{f^{(3)}(x_k)}{2}\beta _k f'(x_k) + \frac{f^{(4)}(\xi _k)\beta _k^2f'(x_k)^2}{6},\\ B_k&{:}{=}\frac{f^{(4)}(\xi _k)}{6}\beta _k^2f''(\xi _k^\dagger )^2\varepsilon _k^3 - \frac{f^{(4)}(\xi _k^*)}{6}\varepsilon _k^3 - \frac{f^{(3)}(x_k)}{4}f^{(3)}(\xi _k')\beta _k\varepsilon _k^3, \end{aligned}$$

we obtain

$$\begin{aligned} \varepsilon _{k+1} = \frac{\frac{1}{2} f^{(3)}(x_k)\varepsilon _k^2(f''(x_k)\beta _k+1) + B_k}{A_k}. \end{aligned}$$

Since \(\beta _k = -(x_k - x_{k-1})/(f'(x_k) - f'(x_{k-1}))\), we may Taylor expand \(f'(x_{k-1})\) at \(x_k\) to get

$$\begin{aligned} f'(x_{k-1})&= f'(x_k) + f''(x_k)(\varepsilon _{k - 1} - \varepsilon _k) +\frac{f^{(3)}(\xi _k^\ddagger )}{2}(\varepsilon _{k-1} - \varepsilon _k)^2 \end{aligned}$$

for some \(\xi _k^\ddagger \) between \(x_{k-1}\) and \(x_k\). Plugging it into

$$\begin{aligned} \beta _k = -\frac{1}{f''(x_k)+\frac{1}{2} f^{(3)}(\xi _k^\ddagger )(\varepsilon _{k-1} - \varepsilon _k)} \end{aligned}$$

gives us

$$\begin{aligned} \varepsilon _{k+1} =\frac{\dfrac{f^{(3)}(x_k)f^{(3)}(\xi _k^\ddagger )\varepsilon _k^2(\varepsilon _{k-1} - \varepsilon _k)}{2(2f''(x_k)+f^{(3)}(\xi _k^\ddagger )(\varepsilon _{k-1} - \varepsilon _k))} + B_k}{A_k}. \end{aligned}$$

We deduce that

$$\begin{aligned} \lim _{k\rightarrow \infty }\frac{|\varepsilon _{k}|}{|\varepsilon _{k-1}|} = 0, \qquad \lim _{k\rightarrow \infty }\frac{|B_k|}{|\varepsilon _k^2\varepsilon _{k-1}|} = 0, \end{aligned}$$

and therefore

$$\begin{aligned} \lim _{k\rightarrow \infty } \frac{|\varepsilon _{k+1}|}{|\varepsilon _k^2\varepsilon _{k-1}|} = \Bigl (\frac{f^{(3)}(x^*)}{2f^{(2)}(x^*)}\Bigr )^2. \end{aligned}$$

Hence the convergence order is \(1 + \sqrt{2}\). \(\square \)

The choice of \(\beta _k\) above is exactly the Barzilai–Borwein (BB) step size for a univariate function [37]. In the multivariate setting, \(\beta _k\) will be replaced by the multivariate BB step size. Theorem 2 provides the impetus for a first-order method with Steffensen updates and BB step size, namely, it is superquadratically convergent for univariate functions. Such a high convergence order is clearly an overkill for a deterministic algorithm but our experiments in Sect. 5 show that they are rewarding when the algorithm is randomized, as randomization inevitably compromises convergence speed. For easy comparison, we tabulate the convergence order, i.e., the largest q such that \(|\varepsilon _{k+1} |\le c|\varepsilon _k |^q\) for some \(c >0\) and all k sufficiently large, of various methods below:

Method

Convergence

Derivatives

Steps

Steepest descent

1

1st

Single step

Secant = Barzilai–Borwein = quasi-Newton

\((1 + \sqrt{5})/2 \)

1st

Mutltistep

Newton

2

2nd

Single step

Steffensen

2

1st

Single step

Steffensen–Barzilai–Borwein

\(1 + \sqrt{2} \)

1st

Multistep

Note that for a univariate function, Barzilai–Borwein step size and any quasi-Newton method with Broyden class updates (including BFGS, DFP, SR1) reduce to the secant method. In particular, they are all two-step methods, i.e., its iterate at step k depends on both \(x_k\) and \(x_{k-1}\). As a result Steffensen–Barzilai–Borwein method is also a two-step method as it involves the Brazlai–Borwein step size but Steffensen method is a one-step method.

2.2 Deterministic multivariate setting

There have been no shortage of proposals for extending Steffensen method to a multivariate or even infinite-dimensional setting [3,4,5,6,7,8,9,10,11,12,13,14,15]. However all of them rely on various multivariate versions of divided differences that require evaluation and storage of \(O(n^2)\) first derivatives in each step. Although they do avoid second derivatives, computationally they are just as expensive as Newton method and are unsuitable for modern large scale applications like training deep neural networks.

We will propose an alternative class of multivariate Steffensen methods that use only O(n) first derivatives, by emulating quasi-Newton methods [38,39,40,41] and Barzilai–Borwein method [37] respectively. Our observation is that expensive multivariate divided differences can be completely avoided if we just use the ideas in Sect. 2.1 to define learning rates. Another advantage is that these learning rates could be readily used in conjunction with existing stochastic optimization methods, as we will see in Sect. 2.3.

The key idea behind quasi-Newton method is the extension of univariate secant method to a multivariate objective function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) by replacing the finite difference approximation of \(f''(x_k)\), i.e., \(h_k = [f'(x_k) - f'(x_{k-1})]/(x_k - x_{k-1})\), with the secant equation \(H_k s_k = y_k\) or

$$\begin{aligned} B_k y_k = s_k \end{aligned}$$
(6)

where \(s_k = x_k - x_{k-1}\) and \(y_k = \nabla f(x_k) - \nabla f(x_{k-1})\), avoiding the need to divide vectorial quantitites. Here \(H_k\) (resp. \(B_k\)) is the approximate (resp. inverse) Hessian.

We use the same idea to extend Steffensen method to a multivariate setting, solving (6) with

$$\begin{aligned} s_k = \nabla f(x_k), \qquad y_k = \nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k). \end{aligned}$$

Note that with these choices, (6) roughly says that “\(B_k = s_k/y_k = \nabla f(x_k)/[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k) ]\),” which gives us \(f'(x_k)/[f'(x_k + f'(x_k)) - f'(x_k)]\) as in the univariate Steffensen method when \(d=1\) but is of course meaningless when \(d > 1\). Nevertheless we may pick a minimum-norm solution to (6), which is easily seen to be given by the rank-one matrix

$$\begin{aligned} B_k = \mathop {\textrm{argmin}}\limits _{B y_k = s_k} \Vert B \Vert = \frac{s_k y_k^{\scriptscriptstyle \textsf{T}}}{y_k^{\scriptscriptstyle \textsf{T}}y_k} \end{aligned}$$

regardless of whether \(\Vert \, \cdot \, \Vert \) is the Frobenius or spectral norm. Hence we obtain a multivariate analogue of Steffensen method (2) as

$$\begin{aligned} x_{k+1} = x_k - B_k \nabla f(x_k) = x_k - \frac{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}{\Vert \nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k) \Vert ^2} \nabla f(x_k). \end{aligned}$$
(7)

We will call this quasi-Steffensen method in analogy with quasi-Newton methods.

The key idea behind the Barzilai–Borwein method [37] is an alternative way of treating the secant equation (6), whereby the approximate Hessian \(B_k\) is assumed to take the form \(B_k = \sigma _k I\) for some scalar \(\sigma _k > 0\). Since in general it is not possible to find \(\sigma _k\) so that (6) holds exactly with \(B_k = \sigma _k I\), a best approximation is used instead. We seek \(\sigma _k\) so that the residual of the secant equation \(\Vert y_k - (1/\sigma _k) s_k\Vert ^2\) or \(\Vert \sigma _k y_k - s_k\Vert ^2\) is minimized. The first minimization problem gives us

$$\begin{aligned} \sigma _k = \mathop {\textrm{argmin}}\limits _{\sigma >0} \; \Vert y_k - (1/\sigma ) s_k\Vert ^2 = \frac{s_k^{\scriptscriptstyle \textsf{T}}s_k}{s_k^{\scriptscriptstyle \textsf{T}}y_k} =\frac{\Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}, \end{aligned}$$
(8)

and the second minimization gives the same expression as (7). We will call the resulting iteration

$$\begin{aligned} x_{k+1} = x_k - \frac{\Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)} \nabla f(x_k) \end{aligned}$$

Steffensen method since it most resembles the univariate Steffensen method in (2). Note that the Barzilai–Borwein step size derived in [37] is

$$\begin{aligned} \beta _k = \frac{\Vert x_k - x_{k-1}\Vert ^2}{[\nabla f(x_k) - \nabla f(x_{k-1})]^{\scriptscriptstyle \textsf{T}}(x_k - x_{k-1})} \end{aligned}$$
(9)

and differs significantly from (8). In particular, \(x_{k+1} = x_k - \beta _k \nabla f(x_k)\) is a multistep method whereas \(x_{k+1} = x_k - \sigma _k \nabla f(x_k)\) remains a single step method.

Both (7) and (8) reduce to (2) when f is univariate. Motivated by the univariate discussion before Theorem 2, we combine features from (8) and (9) to obtain a Steffensen–Barzilai–Borwein method in analogy with the univariate case (4):

$$\begin{aligned} x_{k+1} = x_k - \frac{\beta _k \Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)} \nabla f(x_k). \end{aligned}$$
(10)

Note that (10) reduces to (4) when f is univariate. The stochastic version of (10) will be our method of choice, supported by extensive empirical evidence some of which we will present in Sect. 5.

In summary, we have four plausible learning rates:

$$\begin{aligned}&\text {Quasi-Steffensen:}&\eta ^{{\scriptscriptstyle \textsf{qS}}}_k&= \frac{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}{\Vert \nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k) \Vert ^2}, \\&\text {Quasi-Steffensen--Barzilai--Borwein:} \quad&\eta ^{{\scriptscriptstyle \textsf{qSBB}}}_k&= \frac{\beta _k [\nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}{\Vert \nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k) \Vert ^2}, \\&\text {Steffensen:}&\eta ^{{\scriptscriptstyle \textsf{S}}}_k&= \frac{\Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}, \\&\text {Steffensen--Barzilai--Borwein:}&\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k&= \frac{\beta _k \Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \beta _k \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}. \end{aligned}$$

Here \(\beta _k\) is the Barzilai–Borwein step size in (9). For a univariate function, the iterations with \(\eta ^{{\scriptscriptstyle \textsf{qS}}}_k\) and \(\eta ^{{\scriptscriptstyle \textsf{S}}}_k\) reduce to (2) whereas those with \(\eta ^{{\scriptscriptstyle \textsf{qSBB}}}_k\) and \(\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k\) reduce to (4). The computational costs of all four learning rates are the same: two gradient evaluations and two inner products.

Note that our muiltivariate Steffensen and quasi-Steffensen methods are one-step methods — \(\eta ^{{\scriptscriptstyle \textsf{S}}}_k\) and \(\eta ^{{\scriptscriptstyle \textsf{qS}}}_k\) depend only on \(x_k\) — just like the univariate Steffensen method. Steffensen–Barzilai–Borwein and quasi-Steffensen–Barzilai–Borwein are inevitably two-step methods because they involve the Barzilai–Borwein step size \(\beta _k\), which has a two-step formula.

The main difference between our multivariate Steffensen methods and those in the literature [3,4,5,6,7,8,9,10,11,12,13,14,15] is that ours are encapsulated as learning rates and avoid expensive multivariate divided differences. Recall that for \(g = (g_1,\dots ,g_n): \mathbb {R}^n \rightarrow \mathbb {R}^n\), its divided difference [42] at \(x,y\in \mathbb {R}^n\) is the matrix \(\llbracket x,y \rrbracket \in \mathbb {R}^{n \times n}\) whose (ij)th entry is

$$\begin{aligned} \llbracket x,y \rrbracket _{ij} {:}{=}{\left\{ \begin{array}{ll} \dfrac{g_i(x_1,\dots ,x_j,y_{j+1},\dots ,y_n)-g_i(x_1,\dots ,x_{j-1},y_{j},\dots ,y_n)}{x_j-y_j} &{}x_j \ne y_j,\\ \dfrac{\partial g_i}{\partial x_j} (x_1,\dots ,x_j,y_{j+1},\dots ,y_n) &{}x_j = y_j, \end{array}\right. } \end{aligned}$$

for \(i,j =1,\dots ,n\).

In a stochastic setting, the learning rates \(\eta ^{{\scriptscriptstyle \textsf{S}}}_k,\eta ^{{\scriptscriptstyle \textsf{qS}}}_k,\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k,\eta ^{{\scriptscriptstyle \textsf{qSBB}}}_k\) share the same upper and lower bounds in Lemma 7 and as a result, the linear convergence conclusion in Theorem 10 applies alike to all four of them. Our experiments also indicate that \(\eta ^{{\scriptscriptstyle \textsf{qS}}}_k\) and \(\eta ^{{\scriptscriptstyle \textsf{S}}}_k \) have similar performance and likewise for \(\eta ^{{\scriptscriptstyle \textsf{qSBB}}}_k\) and \(\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k \), although there is a slight difference between \(\eta ^{{\scriptscriptstyle \textsf{S}}}_k \) and \(\eta ^{{\scriptscriptstyle \textsf{SBB}}}_k\). One conceivable advantage of the ‘quasi’ variants is that for a given \(\nabla f(x_k)\), the denominator vanishes only at a single point, e.g., when \(\nabla f(x_k +\nabla f(x_k)) = \nabla f(x_k)\), as opposed to a whole hyperplane, e.g., whenever \(\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k) \perp \nabla f(x_k)\). Nevertheless, in all our experiments on their stochastic variants, this has never been an issue.

We prefer the slightly simpler expressions of the Steffensen and Steffensen–Barzilai–Borwein methods and will focus our subsequent discussions on them. Their ‘quasi’ variants may be taken as nearly equivalent alternatives for users who may have some other reasons to favor them.

2.3 Stochastic multivariate setting

Encapsulating Steffensen method in the form of learning rates offers an additional advantage — it is straightforward to incorporate them into many stochastic optimization algorithms, which we will do next.

Standard gradient descent applied to (1) requires the evaluation of n gradients. The stochastic gradient descent (SGD), instead of using the full gradient \(\nabla f(x_k)\), relies on an unbiased estimator \(g_k\) with \(\mathbb {E}[g_k] = \nabla f(x_k)\) [24]. One common randomization is to draw \(i_k \in \{1, \dots , n\}\) randomly and set \(g_k = \nabla f_{i_{k}}(x_k)\), giving the update:

$$\begin{aligned} x_{k+1} = x_k-\eta _k \nabla f_{i_{k}}(x_k). \end{aligned}$$

Note that \(\mathbb {E}[\nabla f_{i_k}(x_k) \mid x_k] = \nabla f(x_k)\) and its obvious advantage is that each step relies only on a single gradient \(\nabla f_{i_k}\), resulting in a computational cost that is 1/n that of the standard gradient descent. While we could adopt this procedure to randomize our Steffensen and Steffensen–Barzilai–Borwein iterations, we will use a slightly more sophisticated variant with variance reduction and minibatching.

The price of randomization is paid in the form of variance, as the stochastic gradient \(\nabla f_{i_k}(x_k)\) equals the gradient \(\nabla f(x_k)\) only in expectation but each \(\nabla f_{i_k}(x_k)\) is different. Of the many variance reduction strategies, one of the best known and simplest is the stochastic variance reduced gradient method (SVRG) [27], based on the tried-and-tested notion of control variates in Monte Carlo methods. We will emulate SVRG to randomize (7) and (10).

The basic idea of SVRG is to compute the full gradient once every m iterations for some fixed m and use it to generate stochastic gradients with lower variance in the next m iterations:

$$\begin{aligned} x_{k+1} = x_k - \eta _k\bigl (\nabla f_{i_k}(x_k) - \nabla f_{i_k}(\widetilde{x}) + \nabla f(\widetilde{x})\bigr ). \end{aligned}$$

Here \(\widetilde{x}\) denotes the point where full gradient is computed. Notice that when \(k\rightarrow \infty \), \(x_k\) and \(\widetilde{x}\) are very close to the optimal point \(x^*\). As \(x_k\) and \(\widetilde{x}\) are highly correlated, the variability of the stochastic gradient is reduced as a result [27].

We may similarly randomize multivariate Steffensen method. Our stochastic Steffensen method (SSM) in Algorithm 1 operates in two nested loops. In the kth iteration of the outer loop, we compute two full gradients \(\nabla f(x_k)\) and \(\nabla f(x_k + \nabla f(x_k))\). Note that \(x_k\) plays the role of \(\widetilde{x}\) in the above paragraph. These two terms are used for computing the Steffensen learning rate:

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SS}}}_{k} = \frac{1}{\sqrt{m}} \cdot \frac{\Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)}. \end{aligned}$$
(11)

In the \((t+1)\)th iteration of the inner loop, we use \(\nabla f(x_k)\) to generate the stochastic gradient with lower variance

$$\begin{aligned} v_{k,t} = \nabla f_{i_t}(x_{k,t}) - \nabla f_{i_t}(x_k) + \nabla f(x_k), \end{aligned}$$

with \(i_t \in \{1,\dots ,n\}\) sampled uniformly. The updating rule takes the form

$$\begin{aligned} x_{k,t+1} = x_{k,t} - \eta ^{{\scriptscriptstyle \textsf{SS}}}_{k} v_{k,t} \end{aligned}$$

where the search direction is known as the variance-reduced stochastic gradient. Note that the learning rate \(\eta _k\) given by (11) has an extra \(1/\sqrt{m}\) factor to guarantee the linear convergence. The \(1/\sqrt{m}\) factor may be replaced by \(1/m^p\) for any \(p \in (0,1)\), although not \(p =1\), while preserving linear convergence. The optimal complexity, as measured by the number of stochasic gradient evaluations, is achieved when \(p = 1/2\), with details to follow in Sect. 3.

Algorithm 1
figure a

Stochastic Steffensen Method (SSM)

Aside from variance reduction, we include another common enhancement called minibatching. Minibatched SGD is a trade-off between SGD and gradient descent (GD) where the cost function (and therefore its gradient) is averaged over a small number of samples. SGD has a batch size of one whereas GD has a batch size that includes all training samples. In each iteration, we sample a minibatch \(S_{k}\subseteq \{1,\dots ,n\}\) with \(|S_{k}|=b\) a small number and update

$$\begin{aligned} x_{k+1} = x_k - \eta _k\frac{1}{|S_{k}|}\sum _{j\in S_{k}}\nabla f_{j}(x_k) {=}{:}x_k - \eta _k \nabla f_{S_k}(x_k). \end{aligned}$$

Minibatched SGD smooths out some of the noise in SGD but maintains the ability to escape local minima. The minibatch size b is kept small, thus preserving the cost-saving benefits of SGD. As in the discussion after (11), the coefficient \(1/\sqrt{m}\) may be replaced by \(1/m^p\) for any \(p\in (0,1)\) while preserving linear convergence, but \(p = 1/2\) gives the minimum number of stochastic gradient evaluations, as we will see in Sect. 3.

Upon incorporating (i) a Barzilai–Borwein step size, (ii) variance reduction, and (iii) minibatching, we arrive at the stochastic Steffensen–Barzilai–Borwein method (SSBB) in Algorithm 2. This is our method of choice in this article.

Algorithm 2
figure b

Stochastic Steffensen–Barzilai–Borwein Method (SSBB)

Although we did not include minibatching in Algorithm 1’s pseudocode to avoid clutter, we will henceforth assume that it is also minibatched. The randomization, variance reduction, and minibatching all apply verbatim when the learning rates in Algorithms 1 and 2 are replaced respectively by the quasi-Steffensen and quasi-Steffensen–Barzilai–Borwein learning rates on p. 10. Nevertheless, as we have mentioned, our numerical experiments do not show that the resulting algorithms differ in performance from that of Algorithms 1 and 2.

2.4 Randomized Kaczmarz method as a special case

Given \(A\in \mathbb {R}^{m \times n}\) of full row rank with row vectors \(a_1,\dots ,a_m \in \mathbb {R}^n\) and \(b\in \mathbb {R}^m\) in the image of A, the Kaczmarz method [21, 22] solves the consistent linear system \(Ax=b\) via

$$\begin{aligned} x_{k+1} = x_k+ \frac{b_i- a_i^{\scriptscriptstyle \textsf{T}}x_k}{\Vert a_{i} \Vert ^2} a_i, \end{aligned}$$

with \(i = k \mod m\), \(i =1,\dots , m\). The iterative method has remained relatively obscure, almost unheard of in numerical linear algebra, until it was randomized in [23], which essentially does

$$\begin{aligned} x_{k+1} = x_k+ \frac{b_{i_k} - a_{i_k}^{\scriptscriptstyle \textsf{T}}x_k}{\Vert a_{i_k} \Vert ^2} a_{i_k}, \end{aligned}$$

where \(i_k \in \{1,\dots ,m\}\) is now sampled with probability \(\Vert a_{i_k}\Vert ^2/\Vert A \Vert ^2\).

We will see that randomized Kaczmarz method is equivalent to applying stochastic Steffensen method, with or without Barzilai–Borwein step size, to minimize the quadratic function \(f: \mathbb {R}^n \rightarrow \mathbb {R}\),

$$\begin{aligned} f(x) {:}{=}\frac{1}{2}\sum _{i = 1}^m f_i(x) = \frac{1}{2}\sum _{i = 1}^m (a_i^{\scriptscriptstyle \textsf{T}}x - b_i)^2. \end{aligned}$$

While it is sometimes claimed that SGD has this property, this is not quite true. Suppose \(i_k \in \{1,\dots ,m\}\) is the random row index sampled at the kth step, the update rule in SGD gives

$$\begin{aligned} x_{k+1} = x_k - \eta _k (a_{i_k}^{\scriptscriptstyle \textsf{T}}x_k - b_i) a_{i_k}, \end{aligned}$$

and the update rule in SLBFGS is even further from this. So one needs to impose further assumptions [43] on the learning rate to get randomized Kaczmarz method, which requires that \(\eta _k = 1/\Vert a_{i_k}^2\Vert \). If we use the Steffensen method, we get from (10) that

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{S}}}_k = \frac{\Vert \nabla f_{i_k}(x_k)\Vert ^2}{[\nabla f_{i_k}\bigl (x_k + \nabla f_{i_k}(x_k)\bigr ) - \nabla f_{i_k}(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f_{i_k}(x_k)} = \frac{1}{\Vert a_{i_k}\Vert ^2}; \end{aligned}$$

and using Steffensen–Barzilai–Borwein method makes no difference:

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SBB}}}_k = \frac{\beta _k\Vert \nabla f_{i_k}(x_k)\Vert ^2}{[\nabla f_{i_k}\bigl (x_k + \beta _k\nabla f_{i_k}(x_k)\bigr ) - \nabla f_{i_k}(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f_{i_k}(x_k)} = \frac{1}{\Vert a_{i_k}\Vert ^2}, \end{aligned}$$

as \(\beta _k = \Vert x_k - x_{k-1}\Vert ^2/(x_k - x_{k-1})^{\scriptscriptstyle \textsf{T}}[\nabla f_{i_k}(x_k) - \nabla f_{i_k}(x_{k-1})] = 1/\Vert a_{i_k}\Vert ^2\).

3 Convergence analysis

In this section, we establish the linear convergence of our stochastic Steffensen methods Algorithm 1 (SSM) and Algorithm 2 (SSBB) for solving (1) under standard assumptions. We would like to stress that these convergence results are intended to provide a minimal theoretical guarantee and do not really do justice to the actual performance of SSBB. The experiments in Sect. 5 indicate that the convergence of SSBB is often superior to other existing methods like SGD and SVRG, with or without Barzilai–Borwein step size, or even SLBFGS. However, we are unable to prove this theoretically, only that it is linearly convergent like the other methods.

For easy reference, we reproduce the minibatched SVRG algorithm in [44, Algorithm 1] as Algorithm 3.

Algorithm 3
figure c

Minibatched SVRG

We need to establish the linear convergence of Algorithm 3 for our own convergence results in Sects. 3.1 and 3.2 but we are unable to find such a result in the literature. In particular, the convergence results in [44, Propositions 2–4] and [45, Theorem 1] are for more sophisticated variants of Algorithm 3. So we will provide a version following the same line of arguments in [45, Theorem 1] but tailored to our own requirements.

Our linear convergence proofs for SSM and SSBB are a combination of the proofs in [45, 46] adapted for our purpose. In particular, we quote [46, Lemma A] and prove a simplied version of [45, Lemma 3] for easy reference.

Lemma 3

[Nitanda] Let \(\xi _1,\dots ,\xi _n \in \mathbb {R}^{d}\) and \(\bar{\xi } {:}{=}\frac{1}{n} \sum _{i=1}^n\xi _i\). Let S be a b-element subset chosen uniform randomly from all b-element subsets of \(\{1,2, \ldots , n\}\). Then

$$\begin{aligned} \mathbb {E}_S\Bigl \Vert \frac{1}{b} \sum \nolimits _{i \in S} \xi _{i}-\bar{\xi }\Bigr \Vert ^2=\frac{n-b}{b(n-1)} \mathbb {E}_{i}\bigl \Vert \xi _{i}-\bar{\xi }\bigr \Vert ^2. \end{aligned}$$

Here \(\mathbb {E}_{\mathcal {S}}\) denotes expectation of the random subset \(S \subseteq \{1,\dots ,n\}\) and \(\mathbb {E}_i\) that of the uniform random variable \(i \in \{1,\dots ,n\}\). More specifically, if \(S = \{i_1, \dots , i_b\}\), then

$$\begin{aligned} \mathbb {E}_S\Bigl \Vert \frac{1}{b} \sum _{i \in S} \xi _{i}-\bar{\xi }\Bigr \Vert ^2&= \frac{b!(n-b)!}{n!}\sum _{S\subseteq \{1,\dots ,n\}}\Bigl \Vert \frac{1}{b} \sum _{j = 1}^b \xi _{i_j}-\bar{\xi }\Bigr \Vert ^2,\\ \mathbb {E}_i\bigl \Vert \xi _{i}-\bar{\xi }\bigr \Vert ^2&= \frac{1}{n}\sum _{j=1}^n \bigl \Vert \xi _{j}-\bar{\xi }\bigr \Vert ^2. \end{aligned}$$

For the rest of this section, we will need to assume, as is customary in such proofs of linear convergence, that our objective f is \(\mu \)-strongly convex, the gradient of each additive component \(f_i\) is L-Lipschitz continuous (and therefore so is \(\nabla f\)), and that all iterates are well-defined (the denominators appearing in our learning rates \(\eta _k \) are nonzero).

Assumption 1

Assume that the function f in (1) satisfies

$$\begin{aligned} f(w)&\ge f(v)+\nabla f(v)^{\scriptscriptstyle \textsf{T}}(w-v)+\frac{\mu }{2}\Vert v-w\Vert ^2,\\&\quad \Vert \nabla f_i(v)-\nabla f_i(w)\Vert \le L\Vert v-w\Vert \end{aligned}$$

for any \(v, w \in \mathbb {R}^{d}\), \(i =1,\dots ,n\).

Applying Lemma 3 with \(\xi _i = v_i^{k,t}\) and [45, Corollary 3], we may bound the variance of a minibatched variance-reduced gradient as follows.

Lemma 4

Let f be as in Assumption 1 with \(x^*{:}{=}\mathop {\textrm{argmin}}\limits _{x} f(x)\). Let

$$\begin{aligned} v_i^{k,t} = \nabla f_i(x_{k,t}) - \nabla f_i(x_k) + \nabla f(x_k), \qquad v_{k,t} = \frac{1}{b}\sum _{i\in S_{k,t}} v_{i}^{k,t}. \end{aligned}$$

Then

$$\begin{aligned} \mathbb {E}\Vert v_{k, t}-\nabla f(x_{k,t})\Vert ^2 \le \frac{4 L}{b} \bigl [f(x_{k, t})-f(x^*)+f(x_k)-f(x^*)\bigr ]. \end{aligned}$$

The next lemma, a simplified version of [45, Lemma 3], gives a lower bound of the optimal value \(f(x^*)\) useful in our proof of linear convergence.

Lemma 5

Let \(\Delta _{k,t} {:}{=}v_{k,t} - \nabla f(x_{k,t})\) and \(\eta _k\) be a learning rate with \(0<\eta _k \le 1 / L\). Then with the same assumptions and notations in Lemma 4, we have

$$\begin{aligned} f(x^*)\ge f(x_{k,t+1})+v_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x^*-x_{k,t})+\frac{\eta _k}{2}\Vert v_{k,t}\Vert ^2+\frac{\mu }{2}\Vert x^*-x_{k,t}\Vert ^2+\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1}-x^*). \end{aligned}$$

Proof

By the strong convexity of f, we have

$$\begin{aligned} f(x^*) \ge f(x_{k,t}) + \nabla f(x_{k,t})^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t}) + \frac{\mu }{2}\Vert x^* - x_{k,t}\Vert ^2. \end{aligned}$$

By the smoothness of f, we have

$$\begin{aligned} f(x_{k,t}) \ge f(x_{k,t+1}) - \nabla f(x_{k,t+1})^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1} - x_{k,t}) - \frac{L}{2}\Vert x_{k,t+1} - x_{k,t}\Vert ^2. \end{aligned}$$

Summing the two inequalities, we get

$$\begin{aligned} f(x^*)\ge f(x_{k,t+1}) + \nabla f(x_{k,t})^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1}) + \frac{\mu }{2}\Vert x^* - x_{k,t}\Vert ^2 - \frac{L\eta _k^2}{2}\Vert v_{k,t}\Vert ^2. \end{aligned}$$

The second term on the right simplifies as

$$\begin{aligned} \nabla f(x_{k,t})^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1})&= \nabla f(x_{k,t})^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1}) + (v_{k,t} - v_{k,t})^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1})\\ {}&= v_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1}) + (v_{k,t} - \nabla f(x_{k,t}))^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1} - x^*)\\ {}&= v_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x^* - x_{k,t+1}) + \eta _k\Vert v_{k,t}\Vert ^2. \end{aligned}$$

If the learning rate satisfies \(0 < \eta _k \le 1/L\), then

$$\begin{aligned} f(x^*)&\ge f(x_{k,t+1})+v_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x^*-x_{k,t})+\frac{\eta _k}{2}(2-L\eta _k)\left\| v_{k,t}\right\| ^2\\&\qquad \qquad +\frac{\mu }{2}\left\| x^*-x_{k,t}\right\| ^2+\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1}-x^*)\\&\ge f(x_{k,t+1})+v_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x^*-x_{k,t})+\frac{\eta _k}{2}\left\| v_{k,t}\right\| ^2\\&\qquad \qquad +\frac{\mu }{2}\left\| x^*-x_{k,t}\right\| ^2+\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1}-x^*), \end{aligned}$$

as required. \(\square \)

Theorem 6

(Linear convergence of Algorithm 3) Let f be as in Assumption 1 with \(x^*{:}{=}\mathop {\textrm{argmin}}\limits _{x} f(x)\). For the \((k+1)\)th iteration of outer loop in Algorithm 3,

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f(x^*)] \le \left[ \frac{b}{m\mu \eta _k(b - 4L\eta _k)} + \frac{4(m+1)L\eta _k}{m(b-4L\eta _k)}\right] [f(x_k) - f(x^*)]. \end{aligned}$$

If m, \(\eta _k\), and b are chosen so that

$$\begin{aligned} \rho _k = \frac{b}{m\mu \eta _k(b - 4L\eta _k)} + \frac{4(m+1)L\eta _k}{m(b-4L\eta _k)} \le \rho< 1,\qquad \eta _k < \min \Bigl ( \frac{b}{4L}, \frac{1}{L} \Bigr ), \end{aligned}$$

then Algorithm 3 converges linearly in expectation with

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)]. \end{aligned}$$

Proof

For the iteration in the inner loop, we apply Lemma 5 to get

$$\begin{aligned} \begin{aligned} \left\| x_{k,t+1}-x^*\right\| ^2&= \left\| x_{k,t} - x^*\right\| ^2 - 2\eta _kv_{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t} - x^*) + \eta _k^2\left\| v_{k,t}\right\| ^2 \\&\le \left\| x_{k,t} - x^*\right\| ^2 + 2\eta _k[f(x^*) - f(x_{k,t+1})] - 2\eta _k\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1} - x^*). \end{aligned} \end{aligned}$$
(12)

Lemma 5 requires that the learning rate \(\eta _k\le 1/L\). Let \(\bar{x}_{k,t+1} {:}{=}x_{k,t}-\eta _k\nabla f(x_{k,t})\). Then the last term in (12) may be written as

$$\begin{aligned} -2\eta _k\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1} - x^*)&= -2\eta _k\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(x_{k,t+1} - \bar{x}_{k,t+1}) - 2\eta _k\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(\bar{x}_{k,t+1} - x^*)\\&= 2\eta _k^2\left\| \Delta _{k,t}\right\| ^2 - 2\eta _k\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(\bar{x}_{k,t+1} - x^*). \end{aligned}$$

Plugging this into (12) and taking expectations on both sides conditioned on \(x_{k,t}\) and \(x_k\) respectively, we get

$$\begin{aligned} \mathbb {E}\Vert x_{k,t+1}-x^*\Vert ^2&\le \Vert x_{k,t} - x^*\Vert ^2 + 2\eta _k[\eta _k\mathbb {E}\Vert \Delta _{k,t}\Vert ^2\\&\qquad \qquad - \mathbb {E}[\Delta _{k,t}^{{\scriptscriptstyle \textsf{T}}}(\bar{x}_{k,t+1} - x^*)] - (f(x_{k,t+1}) - f(x^*))]\\&= \Vert x_{k,t} - x^*\Vert ^2 + 2\eta _k[\eta _k\mathbb {E}\Vert \Delta _{k,t}\Vert ^2 - (f(x_{k,t+1}) - f(x^*))], \end{aligned}$$

where the last equality follows from \(\mathbb {E}[\Delta _{k,t}] = 0\). Set \(\gamma {:}{=}8L\eta _k^2/b\). By Lemma 4, we have

$$\begin{aligned}&\mathbb {E}\Vert x_{k,t+1}-x^*\Vert ^2 \le \Vert x_{k,t} - x^*\Vert ^2 + \gamma [f(x_{k,t}) - f(x^*) + f(x_k) - f(x^*)] \\&\quad - 2\eta _k\mathbb {E}[f(x_{k,t+1}) - f(x^*)]. \end{aligned}$$

For \(t=0,\dots ,m-1\), we have

$$\begin{aligned}&\mathbb {E}\left\| x_{k,t+1}-x^*\right\| ^2 + 2\eta _k\mathbb {E}[f(x_{k,t+1}) - f(x^*)] \\&\qquad \le \Vert x_{k,t} - x^*\Vert ^2 + \gamma [f(x_{k,t}) - f(x^*) + f(x_k) - f(x^*)]. \end{aligned}$$

Summing this inequality over all \(t=0,\dots ,m-1\), the left hand side becomes

$$\begin{aligned} \text {LHS} = \sum _{t=0}^{m-1}\mathbb {E}\left\| x_{k,t+1}-x^*\right\| ^2 + 2\eta _k\sum _{t=0}^{m-1}\mathbb {E}[f(x_{k,t+1}) - f(x^*)], \end{aligned}$$

and the right hand side becomes

$$\begin{aligned} \text {RHS} = \sum _{t=0}^{m-1}\Vert x_{k,t} - x^*\Vert ^2 + \gamma \sum _{t=0}^{m-1}\mathbb {E}[f(x_{k,t}) - f(x^*)] + \gamma m\mathbb {E}[f(x_k) - f(x^*)]. \end{aligned}$$

By the definition of \(x_{k+1}\) in Algorithm 3,

$$\begin{aligned} \mathbb {E}[f(x_{k+1})] = \frac{1}{m}\sum _{t=1}^mf(x_{k,t}), \end{aligned}$$

and so, bearing in mind that \(\text {LHS} \le \text {RHS}\),

$$\begin{aligned} \mathbb {E}\Vert x_{k,m}-x^*\Vert ^2&+ 2\eta _km\mathbb {E}[f(x_{k+1}) - f(x^*)]\\&\le \mathbb {E}\left\| x_{k,0}-x^*\right\| ^2 + \gamma m\mathbb {E}[f(x_k) - f(x^*)] + \gamma \sum _{t=0}^{m-1}\mathbb {E}[f(x_{k,t}) - f(x^*)]\\&=\mathbb {E}\left\| x_{k,0}-x^*\right\| ^2 + \gamma m\mathbb {E}[f(x_k) - f(x^*)] + \gamma m\mathbb {E}[f(x_{k+1}) - f(x^*)] \\&\qquad \qquad + \gamma [f(x_k) - f(x^*)], \end{aligned}$$

where the last step follows by replacing \(\sum _{t=0}^{m-1}\) with \(\sum _{t=0}^{m}\), which preserves inequality. Thus

$$\begin{aligned} 2\eta _km\mathbb {E}[f(x_{k+1}) - f(x^*)]&\le 2\eta _km\mathbb {E}[f(x_{k+1}) - f(x^*)] + \mathbb {E}\left\| x_{k,m}-x^*\right\| ^2\\&\le \mathbb {E}\left\| x_k-x^*\right\| ^2 + \gamma (m+1)\mathbb {E}[f(x_k) - f(x^*)]\\&\qquad \qquad + \gamma m\mathbb {E}[f(x_{k+1}) - f(x^*)]. \end{aligned}$$

Rearranging terms and applying strong convexity of f, we have

$$\begin{aligned} \biggl (2\eta _k - \frac{8L\eta _k^2}{b} \biggr )&m\mathbb {E}[f(x_{k+1}) - f(x^*)] \\&\le \mathbb {E}\left\| x_k-x^*\right\| ^2 + \frac{8(m+1)L\eta _k^2}{b}\mathbb {E}[f(x_k) - f(x^*)]\\&\le \frac{2}{\mu }[f(x_k) - f(x^*)] + \frac{8(m+1)L\eta _k^2}{b}\mathbb {E}[f(x_k) - f(x^*)]. \end{aligned}$$

Here we require that \(2\eta _k > 8L\eta _k^2 / b\) and thus \(\eta _k < b / (4\,L)\), leading to

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f(x^*)] \le \rho _k [f(x_k) - f(x^*)] \end{aligned}$$

with

$$\begin{aligned} \rho _k {:}{=}\frac{b}{m\mu \eta _k(b - 4L\eta _k)} + \frac{4(m+1)L\eta _k}{m(b-4L\eta _k)}. \end{aligned}$$

Choose \(m, \eta _k\) so that \(\rho _k \le \rho < 1\) and apply the last inequality recursively, we get

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)] \end{aligned}$$

as required. \(\square \)

3.1 Linear convergence of stochastic Steffensen method

With Theorem 6, we may deduce the linear convergence of Algorithm 1 as a special case of Algorithm 3 with \(b = 1\) (no minibatching) and \(\eta _k = \eta ^{{\scriptscriptstyle \textsf{SS}}}_k\) (SSM learning rate).

Lemma 7

Let f be as in Assumption 1. Then the stochastic Steffensen learning rate

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SS}}}_k = \frac{1}{\sqrt{m}}\cdot \frac{\Vert \nabla f(x_k)\Vert ^2}{{\nabla f(x_k)}^{\scriptscriptstyle \textsf{T}}(\nabla f(x_k + \nabla f(x_k)) - \nabla f(x_k))} \end{aligned}$$

satisfies

$$\begin{aligned} \frac{1}{\sqrt{m}L} \le \eta ^{{\scriptscriptstyle \textsf{SS}}}_k \le \frac{1}{\sqrt{m}\mu }. \end{aligned}$$

Proof

Since \(\nabla f\) is L-Lipschitz, a lower bound is given by

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SS}}}_k \ge \frac{1}{\sqrt{m}}\cdot \frac{\Vert \nabla f(x_k)\Vert ^2}{L\Vert \nabla f(x_k)\Vert ^2} = \frac{1}{\sqrt{m}L}. \end{aligned}$$

The required upper bound follows the \(\mu \)-strong convexity of f. \(\square \)

Corollary 8

(Linear convergence of SSM) Let f be as in Assumption 1 with \(x^*{:}{=}\mathop {\textrm{argmin}}\limits _{x} f(x)\). If m is chosen so that

$$\begin{aligned} \rho {:}{=}\frac{(5+4/m)\kappa }{\sqrt{m} -4 \kappa } < 1, \end{aligned}$$

where \(\kappa = L / \mu \) is the condition number, then Algorithm 1 converges linearly in expectation with

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)]. \end{aligned}$$

Proof

By Theorem 6, we have

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f(x^*)] \le \left[ \frac{1}{m\mu \eta ^{{\scriptscriptstyle \textsf{SS}}}_k(1 - 4L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k)} + \frac{4(m+1)L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k}{m(1-4L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k)}\right] \mathbb {E}[f(x_k) - f(x^*)] \end{aligned}$$

as long as \(\eta ^{{\scriptscriptstyle \textsf{SS}}}_k < 1/(4L)\). Lemma 7 shows that this holds for \(m > 16\kappa ^2\), which follows from \(\rho <1\). The upper and lower bounds in Lemma 7 also give

$$\begin{aligned} \rho ^{{\scriptscriptstyle \textsf{SS}}}_k&= \frac{1}{m\mu \eta ^{{\scriptscriptstyle \textsf{SS}}}_k(1 - 4L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k)} + \frac{4(m+1)L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k}{m(1-4L\eta ^{{\scriptscriptstyle \textsf{SS}}}_k)}\\&\le \frac{1}{m\mu \frac{1}{\sqrt{m}L}(1-4L\frac{1}{\sqrt{m}\mu })} + \frac{4(m + 1)L\frac{1}{\sqrt{m}\mu }}{m(1-4L\frac{1}{\sqrt{m}\mu })} = \frac{(5+4/m)\kappa }{\sqrt{m} -4 \kappa }. \end{aligned}$$

Hence if m is chosen so that \(\rho < 1\), we have

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)] \end{aligned}$$

as required. \(\square \)

3.2 Linear convergence of stochastic Steffensen–Barzilai–Borwein

The linear convergence of Algorithm  2 likewise follows from Theorem 6 with \(\eta _k = \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k\).

Lemma 9

Let f be as in Assumption 1. Then the stochastic Steffensen–Barzilai–Borwein learning rate

$$\begin{aligned} \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k = \frac{1}{\sqrt{m}}\cdot \frac{\beta _k\Vert \nabla f(x_k)\Vert ^2}{[\nabla f(x_k + \beta _k\nabla f(x_k)) - \nabla f(x_k)]^{\scriptscriptstyle \textsf{T}}\nabla f(x_k)} \end{aligned}$$

satisfies

$$\begin{aligned} \frac{1}{\sqrt{m}L} \le \eta _k^{{\scriptscriptstyle \textsf{SSBB}}} \le \frac{1}{\sqrt{m}\mu }. \end{aligned}$$

Proof

Similar to that of Lemma 7. \(\square \)

Corollary 10

(Linear convergence of SSBB) Let f be as in Assumption 1 with \(x^*{:}{=}\mathop {\textrm{argmin}}\limits _{x} f(x)\). If m and b are chosen so that

$$\begin{aligned} \rho {:}{=}\frac{(b + 4/m + 4)\kappa }{\sqrt{m}b - 4\kappa } < 1, \end{aligned}$$

where \(\kappa = L / \mu \) is the condition number, then Algorithm 2 converges linearly in expectation with

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)]. \end{aligned}$$

Proof

Because SSBB is a special case of Algorithm 3, then we can easily get

$$\begin{aligned} \mathbb {E}[f(x_{k+1}) - f(x^*)] \le \left[ \frac{b}{m\mu \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k(b - 4L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k)} + \frac{4(m+1)L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k}{m(b-4L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k)}\right] \mathbb {E}[f(x_k) - f(x^*)] \end{aligned}$$

when \(\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k < b/(4\,L)\) and \(\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k < 1/L\). From Lemma 9, this is valid for \(m > \max (\kappa ^2, 16\kappa ^2/b^2)\), which holds because \(\rho < 1\). Also, from Lemma 9, we have

$$\begin{aligned} \rho ^{{\scriptscriptstyle \textsf{SSBB}}}_k&= \frac{b}{m\mu \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k(b - 4L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k)} + \frac{4(m+1)L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k}{m(b-4L\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k)} \\ {}&\le \frac{b}{m\mu \frac{1}{\sqrt{m}L}(b - 4L\frac{1}{\sqrt{m}\mu })} + \frac{4(m+1)L\frac{1}{\sqrt{m}\mu }}{m(b - 4L\frac{1}{\sqrt{m}\mu })} = \frac{(b + 4/m + 4)\kappa }{\sqrt{m}b - 4\kappa }. \end{aligned}$$

Hence if m and b are chosen so that \(\rho < 1\), we have

$$\begin{aligned} \mathbb {E}[f(x_k) - f(x^*)]\le \rho ^k[f(x_0) - f(x^*)] \end{aligned}$$

as required. \(\square \)

3.3 Optimal number of stochastic gradient evaluations

Observe that in the proof of Corollary 8 and 10, we may replace \(1/\sqrt{m}\) by \(1/m^p\) for any \(p \in (0,1)\) without affecting the linear convergence conclusion. More precisely, to reach \(\varepsilon \)-accuracy, the proofs of Corollaries 8 and 10 show that when we set \(b = O(1)\) and \(m = O\bigl (\max (\kappa ^{1/p}, \kappa ^{1/(1-p)})\bigr )\), then both SSM and SSBB require evaluation of

$$\begin{aligned} O\bigl ((n + \max (\kappa ^{1/p}, \kappa ^{1/(1-p)}))\log (1/\varepsilon )\bigr ) \end{aligned}$$

stochastic gradients. Clearly, this is minimized when \(p = 1/2\). It follows that for \(p = 1/2\), \(b = O(1)\), and \(m = O(\kappa ^2)\), both SSM and SSBB require evaluation of

$$\begin{aligned} O\bigl ((n + \kappa ^2)\log (1/\varepsilon )\bigr ), \end{aligned}$$

stochastic gradients to reduce to \(\varepsilon \)-accuracy.

3.4 Comparison with other methods

For a theoretical comparison with other methods on equal footing, we will have to limit ourselves to the ones that do not leave the step size \(\eta _k\) unspecified. This automatically excludes SGD and SVRG, which treat \(\eta _k\) as a hyperparameter to be tuned separately. A standard choice is to choose \(\eta _k\) to be the Barzilai–Borwein step size, resulting in the SVRG–BB method [47], which requires evaluation of

$$\begin{aligned} O\bigl ((n + \kappa ^3) \log (1/\varepsilon )\bigr ) \end{aligned}$$

stochastic gradients to achieve \(\varepsilon \)-accuracy when \(\kappa \) is sufficiently large. On the other hand, the SLBFGS method [28] requires evaluation of

$$\begin{aligned} O\bigl ((n + \kappa ^{2 + 2(d + h )}) \log (1/\varepsilon ) \bigr ), \end{aligned}$$

stochastic gradients where h is ‘history size’, the number of previous updates kept in LBFGS. Evidently both SVRG–BB and SLBFGS are at least an order of magnitude slower than SSM and SSBB as measured by the condition number \(\kappa \). It is worth noting that for a d-variate objective, the number of stochastic gradients required by SLBFGS depends on d. Our methods, like SVRG–BB, are free of such dependence. Note that the lower bound of \(O \bigl ( n + \sqrt{n\kappa } \log (1/\epsilon ) \bigr )\) in [18] and upper bound of \(O \bigl ( (n + \sqrt{n\kappa }) \log (1/\epsilon ) \bigr )\) in [17] assume a constant multiple of the stochastic gradient throughout. We restrict our comparison to adaptive methods where the stochastic gradient is modified by a nonconstant (scalar or matrix) multiple of the stochastic gradient that changes from step to step.

4 Proximal variant

As shown in [45], SGD and SVRG may be easily extended to cover nondifferentiable objective functions of the form

$$\begin{aligned} F(x) = f(x) + R(x) = \frac{1}{n}\sum _{i=1}^nf_i(x) + R(x), \end{aligned}$$
(13)

where f satisfies Assumption 1 and R is a nondifferentiable function such as \(R(x) = \Vert x\Vert _1\). In this section we will see that SSBB may likewise be extended, and the linear convergence is preserved.

To solve (13), the proximal gradient method does

$$\begin{aligned} x_k = {{\,\textrm{prox}\,}}_{\eta R}\bigl (x_{k-1} - \eta \nabla f(x)\bigr ), \end{aligned}$$

with a proximal map defined by

$$\begin{aligned} {{\,\textrm{prox}\,}}_{R}(y) = \mathop {\textrm{argmin}}\limits _{x\in \mathbb {R}^d}\biggl \{\frac{1}{2}\Vert x - y\Vert ^2 + R(x)\biggr \}. \end{aligned}$$

As in [45], we replace the update rule \(x_{k,t+1} = x_{k,t} - \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_kv_{k,t}\) in Algorithm 2 by

$$\begin{aligned} x_{k,t+1} = {{\,\textrm{prox}\,}}_{\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k R}\bigl (x_{k,t} - \eta ^{{\scriptscriptstyle \textsf{SSBB}}}_kv_{k,t}\bigr ). \end{aligned}$$
(14)

We will see that the resulting algorithm, which we will call prox-SSBB, remains linearly convergent as long as the following assumption holds for some \(\mu > 0\).

Algorithm 4
figure d

Proximal Stochastic Steffensen–Barzilai–Borwein Method (prox-SSBB)

Assumption 2

The function R is \(\mu \)-strongly convex in the sense that

$$\begin{aligned} R(y) \ge R(x) + g(x)^{\scriptscriptstyle \textsf{T}}(y-x) + \frac{\mu }{2}\Vert y - x\Vert ^2 \end{aligned}$$

for all \(x\in {{\,\textrm{dom}\,}}(R)\), \(g(x)\in \partial R(x)\), \(y\in \mathbb {R}^d\), and \(R(y) {:}{=}+\infty \) whenever \(y \notin {{\,\textrm{dom}\,}}(R)\). Here \(\partial R(x)\) denotes subgradient at x.

It is a standard fact [48, p. 340] that if R is a closed convex function on \(\mathbb {R}^d\), then

$$\begin{aligned} \Vert {{\,\textrm{prox}\,}}_R(x) - {{\,\textrm{prox}\,}}_R(y)\Vert \le \Vert x - y\Vert \end{aligned}$$
(15)

for all \(x,y\in \text {dom}(R)\). We will write \(\mu _f\) for the strong convexity parameter of f in Assumption 1 and \(\mu _R\) for that of R in Assumption 2. This implies that the overall objective function F is strongly convex with \(\mu \ge \mu _f + \mu _R\).

To establish linear convergence for prox-SSBB, we need an analogue of Lemma 5, which is provided by [45, Lemma 3], reproduced here for easy reference.

Lemma 11

(Xiao–Zhang) Let f be as in Assumption 1, R as in Assumptions 1, and \(F = f + R\) with \(x^*{:}{=}\mathop {\textrm{argmin}}\limits _{x} F(x)\). Let \(\Delta _{k,t} {:}{=}v_{k,t} - \nabla f(x_{k,t})\) and

$$\begin{aligned} g_{k,t} {:}{=}\frac{1}{\eta _k}(x_{k,t} - x_{k,t+1}) = \frac{1}{\eta _k}\bigl (x_{k,t} - {{\,\textrm{prox}\,}}_{\eta _kR}(x_{k,t} - \eta _k v_{k,t})\bigr ). \end{aligned}$$

If \(0< \eta _k < 1/L\), then

$$\begin{aligned} F(x^*)&\ge \ F(x_{k,t+1}) + g_{k,t}^{\scriptscriptstyle \textsf{T}}(x^* - x_{k,t}) + \frac{\eta _k}{2}\Vert g_{k,t}\Vert ^2\\&\quad + \frac{\mu _f}{2}\Vert x_{k,t} - x^*\Vert ^2 + \frac{\mu _R}{2}\Vert x_{k,t+1} - x^*\Vert ^2 + \Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t+1} - x^*). \end{aligned}$$

Corollary 12

(Linear convergence of prox-SSBB) Let F and \(x^*\) be as in Lemma 11 and \(\eta _k =\eta ^{{\scriptscriptstyle \textsf{SSBB}}}_k\). Then Corollary 10 holds verbatim with F in place of f.

Proof

To apply Lemma 11, we need \(\eta _k\le 1/L\) and this holds as we have \(m \ge (L/\mu )^2 = \kappa ^2\) among the assumptions of Lemma 9. In the notations of Lemma 11, the update (14) is equivalent to \(x_{k,t+1} = x_{k,t} - \eta _kg_{k,t}\). So

$$\begin{aligned} \Vert x_{k,t+1} - x^*\Vert ^2 = \Vert x_{k,t} - x^*\Vert ^2 - 2\eta _kg_{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t} - x^*) + \eta _k^2\Vert g_{k,t}\Vert ^2. \end{aligned}$$

By Lemma 11, we have

$$\begin{aligned}&\quad - g_{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t} - x^*) + \frac{\eta _k}{2}\Vert g_{k,t}\Vert ^2\\ {}&\le \ F(x^*) - F(x_{k,t+1}) - \frac{\mu _f}{2}\Vert x_{k,t} - x^*\Vert ^2 - \frac{\mu _R}{2}\Vert x_{k,t+1}-x^*\Vert ^2 - \Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t+1} - x^*). \end{aligned}$$

Therefore,

$$\begin{aligned} \Vert x_{k,t+1} - x^*\Vert ^2 \le \Vert x_{k,t} - x^*\Vert ^2 - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t+1} - x^*) + 2\eta _k[F(x^*) - F(x_{k,t+1})]. \end{aligned}$$

We bound the middle term on the right. Let \(\bar{x}_{k,t+1} {:}{=}{{\,\textrm{prox}\,}}_{\eta _kR}(x_{k,t} - \eta _k\nabla f(x_{k,t}))\). Then

$$\begin{aligned} \begin{aligned} - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t+1} - x^*)&= -2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(x_{k,t+1} - \bar{x}_{k,t+1}) - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(\bar{x}_{k,t+1} - x^*)\\&\le 2\eta _k\Vert \Delta _{k,t}\Vert \Vert x_{k,t+1} - \bar{x}_{k,t+1}\Vert - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(\bar{x}_{k,t+1} - x^*)\\&\le 2\eta _k\Vert (x_{k,t} - \eta _kv_{k,t}) - (x_{k,t} - \eta _k\nabla f(x_{k,t}))\Vert \\&\qquad \qquad - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(\bar{x}_{k,t+1} - x^*)\\&= 2\eta _k^2\Vert \Delta _{k,t}\Vert ^2 - 2\eta _k\Delta _{k,t}^{\scriptscriptstyle \textsf{T}}(\bar{x}_{k,t+1} - x^*), \end{aligned} \end{aligned}$$

where the first inequality is Cauchy–Schwarz and the second follows from Lemma 15. The remaining steps are as in the proofs of Theorem 6 and Corollary 10 with F in place of f. \(\square \)

5 Numerical experiments

As mentioned earlier, for smooth objectives, our method of choice is Algorithm 2, the stochastic Steffensen–Barzilai–Borwein method (SSBB) with minibatching. We will compare it with several benchmarking algorithms: stochastic gradient descent (SGD), stochastic variance reduced gradient (SVRG) [27], stochastic LBFGS [28], and the first two with Barzilai–Borwein step size (SGD–BB and SVRG–BB) [47]. For nonsmooth objectives, we compare Algorithm 4, the proximal stochastic Steffensen–Barzilai–Borwein method (prox-SSBB), with proximal variants of the previously mentioned algorithms: prox-SGD, prox-SVRG [45], prox-SLBFGS, and prox-SVRG–BB.

We test these algorithms on popular empirical risk minimization problems — ridge regression, logistic regression, support vector machines with squared hinge loss, \(l^1\)-regularized logistic regression — on standard datasets in LIBSVM.Footnote 2 The parameters involved are summarized in Table 1. Our experiments show that SSBB and prox-SSBB compare favorably with these benchmark algorithms. All our codes are available at https://github.com/Minda-Zhao/stochastic-steffensen.

Table 1 Sample size n, dimension d, batch size b, \(l_2\)-regularization parameter \(\lambda _2\), \(l_1\)-regularization parameter \(\lambda _1\)

For a fair comparison, all algorithms are minibatched. We set a batch size of \(b = 4\) for ridge regression, \(b = 16\) for logistic loss and squared hinge loss, \(b=32\) for \(l^1\)-regularized logistic loss. The inner loop size is set at \(m = 2n\) or 4n. The learning rates in SGD, SVRG, and SLBFGS are hyperparameters that require separate tuning; we pick the best possible values with a grid search. SLBFGS requires more hyperparameters: As suggested by the authors of [28], we set the Hessian update interval to be \(L = 10\), Hessian batch size to be \(b_H= L b\), and history size to be \(h = 10\). All experiments are initialized with \(x_0 =0\). We repeat every experiment ten times and report average results.

In all figures, we present the convergence trajectory of each method. The vertical axis represents in log scale the value \(f(x_k) - f(x^*)\) where we estimate \(f(x^*)\) by running full gradient descent or Newton method multiple times. The horizontal axis represents computational cost as measured by either number of gradient computations divided by n or the actual running time — we present both. In all experiments, we note that the convergence trajectories of SSBB and prox-SSBB agree with the linear convergence established in Sects. 3 and 4.

Fig. 1
figure 1

Ridge regression on synthetic dataset regularized with \(\lambda _2 = 10^{-5}\). Left: number of passes through data. Right: running time

5.1 Ridge regression

Figure 1 shows a simple ridge regression on a synthetic dataset generated in a controlled way to give us the true global solution. We generate \(x^* \in \mathbb {R}^d\) with \(x^*_i\sim \mathcal {N}(0,1)\) and \(A\in \mathbb {R}^{n\times d}\) with row vectors \(a_1,\dots ,a_n \in \mathbb {R}^d\) and entries \(a_{ij} \sim \mathcal {N}(0,1)\). We form \(y = Ax^* + b\) with b an n-dimensional standard normal variate. We then attempt to recover \(x^*\) from A and y by optimizing, with \(\lambda _2 = 10^{-5}\),

$$\begin{aligned} \min _{x\in \mathbb {R}^d}\frac{1}{n}\sum _{i=1}^n (y_i - a_i^{\scriptscriptstyle \textsf{T}}x)^2 + \frac{\lambda _2}{2}\Vert x\Vert _2^2. \end{aligned}$$

5.2 Logistic regression

Figure 2 shows the results of a binary classification problem on the w6a dataset from LIBSVM using an \(l^2\)-regularized binary logistic regression. The associated optimization problem with regularization \(\lambda _2 = 10^{-4}\) and labels \(y_i\in \{-1, +1\}\) is

$$\begin{aligned} \min _{x\in \mathbb {R}^d}\frac{1}{n}\sum _{i=1}^n\log \bigl (1 + e^{-y_i(a_i^{\scriptscriptstyle \textsf{T}}x)}\bigr ) + \frac{\lambda _2}{2}\Vert x\Vert _2^2. \end{aligned}$$
Fig. 2
figure 2

\(l^2\)-regularized logistic regression on w6a dataset from LIBSVM regularized with \(\lambda _2 = 10^{-4}\). Left: number of passes through data. Right: running time

Fig. 3
figure 3

\(l^2\)-regularized squared hinge loss on a6a from LIBSVM regularized with \(\lambda _2 = 10^{-3}\). Left: number of passes through data. Right: running time

5.3 Squared hinge loss

Figure 3 shows the results of a support vector machine classifier with \(l^2\)-regularized squared hinge loss and \(\lambda _2=10^{-3}\) on the a6a dataset from LIBSVM. The optimization problem in this case is

$$\begin{aligned} \min _{x\in \mathbb {R}^d} \frac{1}{n} \sum _{i=1}^{n}[(1-y_{i} a_{i}^{\scriptscriptstyle \textsf{T}}x)_{+}]^2+\frac{\lambda _2}{2}\Vert x\Vert _2^2. \end{aligned}$$

The results are clear: SSBB solves the problems to high levels of accuracy and is the fastest, whether measured by running time or by number of passes through data, in all experiments. When measured by running times, SLBFGS performs relatively poorly because of the additional computational cost of its matrix–vector products that other methods avoid.

5.4 Proximal variant

Figure 4 shows the results of a binary classification problem on the w6a dataset from LIBSVM using a binary logistic regression with both \(l^2\)- and \(l^1\)-regularizations, a problem considered in [45]:

$$\begin{aligned} \min _{x\in \mathbb {R}^d}\frac{1}{n}\sum _{i=1}^n\log \bigl (1 + e^{-y_i(a_i^{\scriptscriptstyle \textsf{T}}x)}\bigr ) + \frac{\lambda _2}{2}\Vert x\Vert _2^2 + \lambda _1 \Vert x\Vert _1. \end{aligned}$$

We set regularization parameters to be \(\lambda _2 = 10^{-4}\), \(\lambda _1 = 10^{-4}\).

Fig. 4
figure 4

Logistic regression with \(l^2\) and \(l^1\) regularizations on w6a dataset from LIBSVM regularized with \(\lambda _2 = 10^{-4}\) and \(\lambda _1 = 10^{-4}\). Left: number of passes through data. Right: running time

The results obtained are consistent with those in Sects. 5.1, 5.2 and 5.3, demonstrating that prox-SSBB solves the problem to high levels of accuracy and is the fastest among all algorithms compared, whether measured by running time or by the number of passes through data.

6 Conclusion

The stochastic Steffensen methods introduced in this article are (i) simple to implement, (ii) efficient to compute, (iii) easy to incorporate, (iv) tailored for massive data and high dimensions, have (v) minimal memory requirements and (vi) a negligible number of hyperparameters to tune. The last point is in contrast to more sophisticated methods involving moments [31, 32] or momentum [33,34,35,36], which require heavy tuning of many more hyperparameters. SSM and SSBB require just two — minibatch size b and inner loop size m.

The point (iii) also deserves special mention. Since SSM and SSBB are ultimately encapsulated in the respective learning rates \(\eta _k^{{\scriptscriptstyle \textsf{SS}}}\) and \(\eta _k^{{\scriptscriptstyle \textsf{SSBB}}}\), they are versatile enough to be incorporated into other methods such as those in [31,32,33,34,35,36], assuming that we are willing to pay the price in hyperparameters tuning. We hope to explore this in future work.