1 Introduction and Background

Consider the following composite convex optimization problem

$$\begin{aligned} \min _{x\in \mathbb {R}^d} \left\{ F(x) \equiv f(x) + R(x) \right\} , \end{aligned}$$
(1)

where \(f: \mathbb {R}^{d} \rightarrow \mathbb {R}\) is smooth and convex and \(R: \mathbb {R}^{d} \rightarrow (-\infty , \infty ]\) is a proper closed and convex function with an easy-to-compute proximal term. This problem often arises in training machine learning models, where f is a loss function and R is a regularization term, e.g., \(\ell _1\)-regularized logistic regression [33], LASSO regression [41] and elastic net regression [47]. It also includes projected gradient descent, if R is an indicator on a convex set.

A natural algorithm which is well-suited for solving (1) is proximal gradient descent, which requires iteratively taking a proximal step in the direction of the steepest descent. Unfortunately, this method requires computing the gradient \(\nabla f\) at each iteration, which can be computationally expensive or even impossible in several settings. This has sparked interest in developing cheaper, practical methods that need only a stochastic unbiased estimate \(g_k \in \mathbb {R}^d\) of the gradient at each iteration. These methods can be written as

$$\begin{aligned} x_0 \in \mathbb {R}^d, \quad x_{k+1} = \textrm{prox}_{\gamma _k R} \left( x_k - \gamma _k g_k \right) , \end{aligned}$$
(2)

where \(\left( \gamma _k \right) _k\) is a sequence of step sizes. This estimate \(g_k\) can take on many different forms depending on the problem of interest. Here we list a few.

1.1 Stochastic Approximation

Most machine learning problems can be cast as minimizing the generalization error of some underlying model, where \(f_z(x)\) is the loss over a sample z and

$$\begin{aligned} f(x) = \mathbb {E}_{z \sim \mathcal {D}} \left[ f_{z}(x)\right] . \end{aligned}$$
(3)

Since \(\mathcal {D}\) is an unknown distribution, computing this expectation is impossible in general. However, by sampling \(z \sim \mathcal {D}\), we can compute a stochastic gradient \(\nabla f_z(x)\). Using Algorithm (2) with \(g_k = \nabla f_{z_k}(x_k)\) and \(R \equiv 0\) gives the simplest stochastic gradient descent method: stochastic gradient descent (SGD) [29, 35].

1.2 Finite-Sum Minimization

Since the expectation (3) cannot be computed in general, one well-studied solution to approximately solve this problem is to use a Monte Carlo estimator:

$$\begin{aligned} f(x) = \frac{1}{n}\sum _{i=1}^n f_i(x), \end{aligned}$$
(4)

where n is the number of samples and \(f_i(x)\) is the loss at x on the ith drawn sample. When R is a regularization function, problem (1) with f defined in (4) is often referred to as regularized empirical minimization (R-ERM) [39]. For the approximation (4) to be accurate, we would like n to be as large as possible. This, in turn, makes computing the gradient extremely costly. In this setting, for low-precision problems, SGD scales very favorably compared to gradient descent, since an iteration of SGD requires \(\mathcal {O}(d)\) flops compared to \(\mathcal {O}(nd)\) for gradient descent. Moreover, several techniques applied to SGD such as importance sampling and minibatching [12, 21, 28, 46] have made SGD the preferred choice for solving Problem (1) + (4). However, one major drawback of SGD is that, using a fixed step size, SGD does not converge and oscillates in the neighborhood of a minimizer. To remedy this problem, variance reduced methods [3, 8, 19, 32, 36] were developed. These algorithms get the best of both worlds: the global convergence properties of GD and the small iteration complexity of SGD. In the smooth case, they all share the distinguishing property that the variance of their stochastic gradients \(g_k\) converges to 0. This feature allows them to converge to a minimizer with a fixed step size at the cost of some extra storage or computations compared to SGD.

1.3 Distributed Optimization

Another setting where the exact gradient \(\nabla f\) is impossible to compute is in distributed optimization. The objective function in distributed optimization can be formulated exactly as (4), where each \(f_i\) is a loss on the data stored on the ith node. Each node computes the loss on its local data, then the losses are aggregated by the master node. When the number of nodes n is high, the bottleneck of the optimization becomes the cost of communicating the individual gradients. To remedy this issue, various compression techniques were proposed [1, 2, 15, 22, 38, 43, 45], most of which can be modeled as applying a random transformation \(Q: \mathbb {R}^d \mapsto \mathbb {R}^d\) to each gradient \(\nabla f_i(x_k)\) or to a noisy estimate of the gradient \(g_i^k\). Thus, many proximal quantized stochastic gradient methods fit the form (2) with

$$g_k = \sum _{i=1}^n Q(g_i^k).$$

While quantized stochastic gradient methods have been widely used in machine learning applications, it was not until the DIANA algorithm [26, 27] that a distributed method was shown to converge to the neighborhood of a minimizer for strongly convex functions. Moreover, in the case where each \(f_i\) is itself a finite average of local functions, variance reduced versions of DIANA, called VR-DIANA [18], were recently developed and proved to converge sublinearly with a fixed step size for convex functions.

1.4 High-Dimensional Function Minimization

Lastly, regardless of the structure of f, if the dimension of the problem d is very high, it is sometimes impossible to compute or to store the gradient at any iteration. Instead, in some cases, one can efficiently compute some coordinates of the gradient, and perform a gradient descent step on the selected coordinates only. These methods are known as (randomized) coordinate descent (RCD) methods [30, 44]. These methods also fit the form (2), for example, with

$$g_k = \nabla f(x_k)e_{i_k},$$

where \(\left( e_{i} \right) _i\) is the canonical basis of \(\mathbb {R}^d\) and \(i_k \in [d]\) is sampled randomly at each iteration. Though RCD methods fit the form (2) their analysis is often very different compared to other stochastic gradient methods. One exception to this observation is SEGA [16], the first RCD method known to converge for strongly convex functions with nonseparable regularizers.

While all the methods presented above have been discovered and analyzed independently, most of them rely on the same assumptions and share a similar analysis. It is this observation and the results derived for strongly convex functions by [11] that motivate this work.

2 Contributions

We now summarize the key contributions of this paper.

2.1 Unified Analysis of Stochastic Gradient Algorithms

Under a unified assumption on the gradients \(g_k\), it was shown by [11] that stochastic gradient methods which fit the format (2) converge linearly to a neighborhood of the minimizer for quasi-strongly convex functions when using a fixed step size. We extend this line of work to the convex setting, and further generalize it by allowing for decreasing step sizes. As a result, for all the methods which verify our assumptions, we are able to prove either sublinear convergence to the neighborhood of a minimum with a fixed step size or exact convergence with a decreasing step size.

2.2 Analysis of SGD Without the Bounded Gradients Assumption

Most of the existing analysis on SGD assume a uniform bound on the second moments of the stochastic gradients or on their variance. Indeed, for the analysis of stochastic (sub)gradient descent, this is often necessary to apply the classical convergence proofs. However, for large classes of convex functions, it has been shown that these assumptions do not to hold [20, 31]. As a result, there has been a recent surge in trying to avoid these assumptions on the stochastic gradients for several classes of smooth functions: strongly convex [14, 25, 31], convex [14, 25, 40, 42], or even nonconvex functions [20, 24, 25]. Surprisingly, a general analysis for SGD with proximal iterations for convex functions without these bounded gradient assumptions is still lacking. As a special case of our unified analysis, assuming only convexity and smoothness, we provide a general analysis of proximal SGD in the convex setting. Moreover, using the arbitrary sampling framework of [12], we are able to prove convergence rates for SGD under minibatching, importance sampling, or virtually any form of sampling.

2.3 Extension of the Analysis of Existing Algorithms to the Convex Case

As another special case of our analysis, we also provide the first convergence rates for the (variance reduced) stochastic coordinate descent method SEGA [16] and the distributed (variance reduced) compressed SGD method DIANA [26] in the convex setting. Our results can also be applied to all the recent methods developed in [11].

2.4 Optimal Minibatches for L-SVRG and SAGA in the Convex Setting

With a unifying convergence theory in hand, we can now ask sweeping questions across families of algorithms. We demonstrate this by answering the question

What is the optimal minibatch size for variance reduced methods?

Recently, precise estimates of the minibach sizes which minimize the total complexity for SAGA [8] and SVRG [4, 19, 34] applied to strongly convex functions were derived by [9] and [37]. We showcase the flexibility of our unifying framework by deriving new optimal minibatch sizes for SAGA [8] and L-SVRG [17, 23] in the general convex setting. Unlike prior work in the strongly convex setting [9, 37], our resulting optimal minibatch sizes can be computed using only the smoothness constants. To verify the validity of our claims, we show through extensive experiments that our theoretically derived optimal minibatch sizes are competitive against a grid search.

3 Unified Analysis for Proximal Stochastic Gradient Methods

3.1 Notation

The Bregman divergence associated with f is the mapping

$$D_{f} (x, y) \; \overset{\text {def}}{=}\; f(x) - f(y) - \left\langle \nabla f(y), x - y \right\rangle , \quad x,y\in \mathbb {R}^d,$$

and the proximal operator of \(\gamma R\) is the function

$$\textrm{prox}_{\gamma R} \left( x \right) \; \overset{\text {def}}{=}\; {{\,\mathrm{\arg \!\min }\,}}_{u \in \mathbb {R}^d} \left\{ \gamma R(x) + \frac{1}{2} \left\Vert x-u\right\Vert ^2\right\} .$$

Let \([n] \overset{\text {def}}{=}\left\{ 1,\dots ,n\right\} \). We denote the expectation of a random variable X by \(\mathbb {E}_{} \left[ X \right] \) and the conditional expectation of X given a random variable Y by \(\mathbb {E}_{} \left[ X \mid Y\right] \).

[11] analyze stochastic gradient methods that fit the form (2) for smooth quasi-strongly convex functions. In this work, we extend these results to the general convex setting. We formalize our assumptions on f and R in the following.

Assumption 1

The function f is L–smooth and convex:

$$\begin{aligned} f(y)&\le f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2} \left\Vert y-x\right\Vert ^2, \quad \text{ for } \text{ all } x,y \in \mathbb {R}^d, \end{aligned}$$
(5)
$$\begin{aligned} f(y )&\ge f(x) + \langle \nabla f(x), y - x \rangle , \quad \text{ for } \text{ all } x,y \in \mathbb {R}^d. \end{aligned}$$
(6)

The function R is convex:

$$\begin{aligned} R(\alpha x + (1- \alpha ) y) \ge \alpha R(x) + (1-\alpha ) R(y), \quad \text{ for } \text{ all } x,y \in \mathbb {R}^d, \alpha \in [0,\,1]. \end{aligned}$$

When f has the form (4), we assume that for all \(i \in [n]\), \(f_i\) is \(L_i\)-smooth and convex, and we denote \(L_{\max } \overset{\text {def}}{=}\underset{i\in [n]}{\max }\,L_i\).

The innovation introduced by [11] is the following unifying assumption on the stochastic gradients \(g_k\) used in (2) which allows to simultaneously analyze classical SGD, variance reduced methods, quantized stochastic gradient methods, and some randomized coordinate descent methods.

Assumption 2

(Assumption 4.1 in [11]) Consider the iterates \(\left( x_k \right) _k\) and gradients \(\left( g_k \right) _k\) in (2).

  1. 1.

    The gradient estimates are conditionally unbiased:

    $$\begin{aligned} \mathbb {E}_{} \left[ g_k \mid x_k\right] = \nabla f(x_k). \end{aligned}$$
    (7)
  2. 2.

    There exist constants \(A,B,C,D_1, D_2, \rho \ge 0\), and a sequence of random variables \(\sigma ^2_k \ge 0 \) such that for all possible minimizers \(x_{*}\) of F:

    $$\begin{aligned} \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_*)\right\Vert ^2 \mid x_k \right]&\le 2 A D_{f} (x_k, x_*) + B \sigma _k^2 + D_1, \end{aligned}$$
    (8)
    $$\begin{aligned} \mathbb {E}_{} \left[ \sigma _{k+1}^2 \mid x_k \right]&\le \left( 1 - \rho \right) \sigma _k^2 + 2 C D_{f} (x_k, x_*) + D_2. \end{aligned}$$
    (9)

Though we chose to present Eqs. (7), (8) and (9) as an assumption, we show throughout the main paper and in the appendix that for all the algorithms we consider (excluding DIANA), these equations all hold with known constants when Assumption 1 holds. An extensive yet nonexhaustive list of algorithms satisfying Assumption 2 and the corresponding constants can be found in [11, Table 2]. We report in Section B of the appendix these constants for five algorithms: SGD, two variance reduced methods L-SVRG and SAGA, a distributed method DIANA and a coordinate descent-type method SEGA.

We now state our main theorem.

Theorem 3.1

Suppose that Assumptions 1 and 2 hold. Let \(M \overset{\text {def}}{=}B/\rho \) and let \((\gamma _k)_{k\ge 0}\) be a decreasing, strictly positive sequence of step sizes chosen such that

$$\begin{aligned} 0< \gamma _0 < \min \left\{ \frac{1}{2 (A + MC)}, \frac{1}{L} \right\} . \end{aligned}$$

The iterates given by (2) satisfy

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma _0\left( \delta _0 + \gamma _0M\sigma _0^2 \right) + 2\left( D_1 + 2 M D_2 \right) \sum \limits _{k=0}^{t-1}\gamma _k^2}{2\sum \nolimits _{i=0}^{t-1}\left( 1 - 2\gamma _i\left( A+MC \right) \right) \gamma _i}, \end{aligned}$$
(10)

where \(\bar{x}_t \overset{\text {def}}{=}\sum \nolimits _{k=0}^{t-1} \frac{\left( 1 - 2\gamma _k(A+MC) \right) \gamma _k}{\sum \nolimits _{i=0}^{t-1}\left( 1 - 2\gamma _i(A+MC) \right) \gamma _i}x_k\) and \({\delta _0 \overset{\text {def}}{=}F(x_0) - F(x_*)}\).

The proof of Theorem 3.1 is deferred to the appendix (Section C).

4 The Main Corollaries

In contrast to [11], our analysis allows both constant and decreasing step sizes. In this section, we will present two corollaries corresponding to these two choices of step sizes and discuss the resulting convergence rates depending on the constants obtained from Assumption 2. Then, we specialize our theorem to SGD, which allows us to recover the first analysis of proximal SGD without the bounded gradients or bounded gradient variance assumptions in the general convex setting. We apply the same analysis to DIANA and present convergence results for this algorithm in the convex setting.

First, we show that by using a constant step size the average of iterates of any stochastic gradient method of the form (2) satisfying Assumptions 1 and 2 converges sublinearly to the neighborhood of the minimum.

Corollary 4.1

Consider the setting of Theorem 3.1. Let \(M = B/\rho \). Choose stepsizes \(\gamma _k = \gamma > 0\) for all k, where \({\gamma \le \min \left\{ \frac{1}{4 (A + MC)}, \frac{1}{2\,L} \right\} }\); then substituting in the rate in (10), we have,

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{2\gamma \left( \delta _0 + \gamma M\sigma _0^2 \right) +{\left\Vert x_0 - x_*\right\Vert }^2}{\gamma t} + 2 \gamma \left( D_1 + M D_2 \right) . \end{aligned}$$

One can already see that to ensure convergence with a fixed step size, we need to have \(D_1 = D_2 = 0\). The only known stochastic gradient methods which satisfy this property are variance reduced methods, as we show in Sect. 5. When \(D_1 \ne 0\) or \(D_2 \ne 0\), which is the case for SGD and DIANA (See Section B), the solution to ensure anytime convergence is to use decreasing step sizes.

Corollary 4.2

Consider the setting of Theorem 3.1. Let \(M = B/\rho \). Choose stepsizes \(\gamma _k = \frac{\gamma }{\sqrt{k+1}}\) for all \(k \ge 0\), where \(\gamma \le \min \left\{ \frac{1}{4 (A + MC)}, \frac{1}{2\,L} \right\} \). Then substituting in the rate in (10), we have

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right]&\le \frac{\gamma \left( \delta _0 + \gamma M\sigma _0^2 \right) +{\left\Vert x_0 - x_*\right\Vert }^2 + \left( \tfrac{D_1}{2} + M D_2 \right) \left( \log (t)+1 \right) }{\gamma \left( \sqrt{t} - 1 \right) } \\&\sim \mathcal {O}\left( \frac{\log (t)}{\sqrt{t}}\right) . \end{aligned}$$

4.1 SGD Without the Bounded Gradients Assumption

To better illustrate the significance of the convergence rates derived in Corollaries 4.1 and 4.2, consider the SGD method for the finite sum setting (4):

$$\begin{aligned} x_0 \in \mathbb {R}^d, \quad x_{k+1} = \textrm{prox}_{\gamma _k R} \left( x_k - \gamma _k \nabla f_{i_k}(x_k) \right) , \end{aligned}$$
(11)

where \(i_k\) is sampled uniformly at random from [n]. Note that we consider the finite sum setting just for illustration, and our results continue to hold under the more general Monte Carlo setting.

Lemma 4.1

Assume that f has a finite sum structure (4) and that Assumption 1 holds. The iterates defined by (11) verify Assumption 1 with

$$\begin{aligned} A = 2L_{\max }, \; B = 0, \; \rho = 1, \; C = 0, \; D_1 = 2\sigma ^2, D_2 = 0, \end{aligned}$$
(12)

where \(\sigma ^2 = \frac{1}{n}\underset{x_* \in X^*}{\sup }\sum \nolimits _{i=1}^n{\left\Vert \nabla f_i(x_*)\right\Vert }^2\), where \(X_{*}\) is the set of minimizers of F, and \(L_{\max } = \underset{i\in [n]}{\max }\ \, L_i\).

Proof

See Lemma A.1 in [11]. \(\square \)

This analysis can be easily extended to include minibatching, importance sampling, and virtually all forms of sampling by using the constants given in (12), with the exception of \(L_{\max }\) which should be replaced by the expected smoothness constant [12]. Due to lack of space, we defer this general analysis of SGD to the appendix (Sections A and B). Using Theorem 3.1 and Lemma 4.1, we arrive at the following result.

Corollary 4.3

Let \((\gamma _k)_k\) be a sequence of decreasing step sizes such that \(0<\gamma _0 \le 1/4L_{\max }\) for all \(k \in \mathbb {N}\). Let Assumption 1 hold. The iterates of (11) verify

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma _0\left( F(x_0) - F(x_*) \right) }{\sum \nolimits _{i=0}^{t-1}\gamma _i} + \frac{2\sigma ^2 \sum \nolimits _{k=0}^{t-1}\gamma _k^2}{\sum \nolimits _{i=0}^{t-1}\gamma _i}. \end{aligned}$$

Moreover, as we did in Corollaries 4.1 and 4.2, we can show sublinear convergence to a neighborhood of the minimum if we use a fixed step size, or \(\mathcal {O}(\log (k)/\sqrt{k})\) convergence to the minimum using a step size \(\gamma _k = \frac{\gamma }{\sqrt{k+1}}\). Moreover, if we know the stopping time of the algorithm, we can derive a \(\mathcal {O}(1/\sqrt{k})\) upper bound as done in [29].

Corollary 4.3 fills a gap in the theory of SGD. Indeed, to the best of our knowledge, this is the first analysis of proximal SGD in the convex setting which does not assume neither bounded gradients nor bounded variance (as done in, e.g., [10, 29]). Instead, it relies only on convexity and smoothness. The closest results to ours here are Theorem 1.6 in [14] and Theorem 5 in [40], both of which are in the same setting as Lemma 4.1 but study more restrictive variants of proximal SGD. [14] studies SGD with projection onto closed convex sets and [40] studies vanilla SGD, without proximal or projection operators. Unfortunately, neither result extends easily to include using proximal operators, and hence our results necessitate a different approach. When specialized to the setting where \(g = 0\) (i.e., no prox) and L-smoothness (\(A=L\), \(B=0\)), Corollary 4.1 gives us the following guarantee after T steps of SGD:

$$\begin{aligned} \mathbb {E}_{} \left[ f(\bar{x}_T) - f(x_*)\right] \le \frac{2\gamma \delta _0+{\left\Vert x_0 - x_*\right\Vert }^2}{\gamma t} + 2 \gamma D_1. \end{aligned}$$

Observe that by the smoothness of f we have \(\delta _0 \le \frac{L}{2} {\left\Vert x_0 - x_{*}\right\Vert }^2\), therefore

$$\begin{aligned} \mathbb {E}_{} \left[ f(\bar{x}_T) - f(x_*)\right] \le \frac{(1+\gamma L) {\left\Vert x_0 - x_*\right\Vert }^2}{\gamma t} + 2 \gamma D_1. \end{aligned}$$

Optimizing over \(\gamma \), we get that for \(\gamma = \min \left\{ \frac{1}{2\,L}, \frac{\left\Vert x_0 - x_{*}\right\Vert }{\sqrt{2 D_1 T}} \right\} \) we have the convergence rate

$$\begin{aligned} \mathbb {E}_{} \left[ f(\bar{x}_T) - f(x_{*})\right] \le \frac{3L {\left\Vert x_0 - x_{*}\right\Vert }^2}{T} + \frac{3 \sqrt{2} R \sqrt{D_1}}{\sqrt{T}}. \end{aligned}$$

This matches the guarantees given in [14, 40] up to constants. Therefore, our analysis is more general while sacrificing no degradation in the convergence rate.

4.2 Convergence of DIANA in the Convex Setting

DIANA was the first distributed quantized stochastic gradient method proven to converge to the minimizer in the strongly convex case and to a critical point in the nonconvex case [26]. See Section B.2 in the appendix for the definition of DIANA and its parameters.

Lemma 4.2

Assume that f has a finite sum structure and that Assumption 1 holds. The iterates of DIANA (Algorithm 4) satisfy Assumption 2 with constants:

$$\begin{aligned}{} & {} A = \left( 1+\frac{2w}{n} \right) L_{\max }, \; B = \frac{2w}{n}, \; \rho = \alpha , \\{} & {} C = L_{\max }\alpha , \; D_1 = \frac{(1+w)\sigma ^2}{n}, \; D_2 = \alpha \sigma ^2, \end{aligned}$$

where \(w > 0\) and \(\alpha \le \frac{1}{1+w}\) are parameters of Algorithm 4 and \(\sigma ^2\) is such that

$$\forall k \in \mathbb {N}, \quad \frac{1}{n}\sum _{i=1}^n\mathbb {E}_{} \left[ {\left\Vert g_i^k - \nabla f(x_k)\right\Vert }^2\right] \le \sigma ^2.$$

Proof

See Lemma A.12 in [11]. \(\square \)

As yet another corollary of Theorem 3.1, we can extend the results of [26] to the convex case and show that DIANA converges sublinearly to the neighborhood of the minimum using a fixed step size, or to the minimum exactly using a decreasing step size.

Corollary 4.4

Assume that f has a finite sum structure (4) and that Assumption 1 holds. Let \((\gamma _k)_{k\ge 0}\) be a decreasing, strictly positive sequence of step sizes chosen such that

$$\begin{aligned} 0< \gamma _0 < \frac{1}{4(1 + \frac{4w}{n})L_{\max }}. \end{aligned}$$

By Theorem 3.1 and Lemma 4.2, we have that the iterates given by Algorithm 4 verify

$$\begin{aligned}&\mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \\&\quad \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma _0\left( F(x_0) - F(x_*)+ \frac{2w\gamma _0}{\alpha n}\sigma _0^2 \right) + \frac{2\left( 1+5w \right) \sigma ^2}{n} \sum \nolimits _{k=0}^{t-1}\gamma _k^2}{\sum \nolimits _{i=0}^{t-1}\gamma _i}. \end{aligned}$$

5 Optimal Minibatch Sizes for Variance Reduced Methods

Variance reduced methods are of particular interest because they do not require a decreasing step size in order to ensure convergence. This is because for variance reduced methods we have \(D_1 = D_2 = 0\), and thus, these methods converge sublinearly with a fixed step size.

Variance reduced methods were designed for solving (1) in the special case where f has a finite sum structure. In this case, in order to further improve the convergence properties of variance reduced methods, several techniques can be applied such as adding momentum [3] or using importance sampling [13], but the most popular of such techniques is by far minibatching. Minibatching has been used in conjunction with variance reduced methods since their inception [21], but it was not until [9, 37] that a theoretical justification for the effectiveness of minibatching was proved for SAGA [8] and SVRG [19] in the strongly convex setting. In this section, we show how our theory allows us to determine the optimal minibatch sizes which minimize the total complexity of any variance reduced method. This allows us to compute the first estimates of these minibatch sizes in the nonstrongly convex setting. For simplicity, in the remainder of this section, we will consider the special case where \(R = 0\). Hence, in this section

$$\begin{aligned} F(x) = f(x) \equiv \frac{1}{n}\sum _{i=1}^n f_i(x). \end{aligned}$$

To derive a meaningful optimal minibatch size from our theory, we need to use the tightest possible upper bounds on the total complexity. When \(R = 0\), we can derive a slightly tighter upper bound than the one we obtained in Theorem 3.1 as follows.

Proposition 5.1

Let \(R = 0\) and \(M=B/2\rho \). Suppose that Assumption 2 holds with \(D_1 = D_2 = 0\). Let the step sizes \(\gamma _k = \gamma \) for all \(k\in \mathbb {N}\), with \(\gamma _k = \gamma \le 1/(4(A+MC))\) for all \(k \in \mathbb {N}\). Then,

$$\begin{aligned} \mathbb {E}_{} \left[ f(\bar{x}_k) - f(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2M\gamma ^2\sigma _0^2}{\gamma k}. \end{aligned}$$
(13)

We can translate this upper bound into a convenient complexity result as follows.

Corollary 5.1

Assume that there exists a constant \(G \ge 0\) such that

$$\begin{aligned} \sigma _0^2 \le G \, {\left\Vert x_0 - x_*\right\Vert }^2. \end{aligned}$$
(14)

Let \(\epsilon > 0\) and \(\gamma = \frac{1}{4(A+\frac{BC}{2\rho })}\). It follows that

$$\begin{aligned}{} & {} k \ge \left( 4(A+ \frac{BC}{2\rho }) + \frac{BG}{2(2\rho A+BC)} \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon } \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \; \implies \; \mathbb {E}_{} \left[ f(\bar{x}_k) - f(x_*)\right] \le \epsilon . \end{aligned}$$
(16)

Proof

The result follows from taking \(\gamma = \frac{1}{4(A+\frac{BC}{2\rho })}\) and upper bounding \(\sigma _0^2\) by \(G {\left\Vert x_0 - x_*\right\Vert }^2\) in (13). \(\square \)

In the same way we specialized the general convergence rate given in Theorem 3.1 to the cases of SGD and DIANA in Sect. 4, we can specialize the iteration complexity result (15) to any method which verifies \(D_1 = D_2 = 0\). Due to their popularity, we chose to analyze minibatch variants of SAGA [8] and L-SVRG [17, 23], a single-loop variant of the original SVRG algorithm [19]. The pseudocode for these algorithms is presented in Algorithms 1 and 2. We define for any subset \(B \subseteq [n]\) the minibatch average of f over B as \(f_B(x) = \frac{1}{b}\sum _{i \in B} f_i(x).\)

Algorithm 1
figure a

b-SAGA

Algorithm 2
figure b

b-L-SVRG

As we will show next, the iterates of Algorithms 1 and 2 satisfy Assumption 2 with constants which depend on the minibatch size b. These constants will depend on the following expected smoothness and expected residual constants \(\mathcal {L}(b)\) and \(\zeta (b)\) used in the analysis of SAGA and SVRG in [9, 37]:

$$\begin{aligned} \mathcal {L}(b)&\overset{\text {def}}{=}&\frac{1}{b}\frac{n-b}{n-1}L_{\max } + \frac{n}{b}\frac{b-1}{n-1}L, \quad \text{ and } \quad \zeta (b) \;\overset{\text {def}}{=}\; \frac{1}{b}\frac{n-b}{n-1}L_{\max }. \end{aligned}$$
(17)

5.1 Optimal Minibatch Size for b-SAGA

Consider the b-SAGA method in Algorithm 1. Define

$$H(x) \overset{\text {def}}{=}\left[ f_1(x),\dots ,f_n(x)\right] \in \mathbb {R}^{d}$$

and let \(\nabla H(x) \in \mathbb {R}^{d\times n}\) denote the Jacobian of H. Let \(J_k = [J_k^1, \ldots , J_k^n]\) be the current stochastic Jacobian.

Lemma 5.1

The iterates of Algorithm 1 satisfy Assumption 2 and Eq. (14) with

$$\begin{aligned} \sigma _k^2 = \frac{1}{nb} \frac{n-b}{n-1}\left\Vert J_k - \nabla H(x_*)\right\Vert _{\textrm{Tr}}^2, \end{aligned}$$
(18)

where for all \(Z \in \mathbb {R}^{d\times n},\) \(\left\Vert Z\right\Vert _{\textrm{Tr}}^2 = \textrm{tr}\left( {ZZ^\top } \right) \), and constants

$$\begin{aligned} A = 2\mathcal {L}(b), \; B = 2, \; \rho = \frac{b}{n}, \; C = \frac{b\zeta (b)}{n}, \; D_1 = D_2 = 0, \; G = \zeta (b)L. \end{aligned}$$
(19)

Using Corollary 5.1, we can determine the iteration complexity of Algorithm 1.

Corollary 5.2

(Iteration complexity of \(b-\)SAGA) Consider the iterates of Algorithm 1. Let the stepsize used be \(\gamma = \frac{1}{4(2\mathcal {L}(b)+\zeta (b))}\). Given the constants obtained for Algorithm 1 in (19), by Corollary 5.1 we have that

$$\begin{aligned}{} & {} k \ge \left( 4(2\mathcal {L}(b)+\zeta (b)) + \frac{n\zeta (b)L}{2b\left( 2\mathcal {L}(b) + \zeta (b) \right) } \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon } \\{} & {} \quad \; \implies \; \mathbb {E}_{} \left[ F(\bar{x}_k) - F(x_*)\right] \le \epsilon . \end{aligned}$$

We define the total complexity as the number of gradients computed per iteration (b) times the iteration complexity required to reach an \(\epsilon \)-approximate solution. Thus, multiplying by b the iteration complexity in Corollary 5.2 and plugging in (17), the total complexity for Algorithm 1 is upper bounded by

$$\begin{aligned} \begin{aligned} K_{saga}(b)&\overset{\text {def}}{=}\bigg ( \frac{4\left( 3(n-b)L_{\max } + 2n(b-1)L \right) }{n-1} \\&\qquad \qquad + \frac{n(n-b)L_{\max }L}{2\left( 3(n-b)L_{\max } + 2n(b-1)L \right) } \bigg )\frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }. \end{aligned} \end{aligned}$$
(20)

Minimizing this upper bound in the minibatch size b gives us an estimate of the optimal empirical minibatch size, which we verify in our experiments.

Proposition 5.2

Let \(b_{saga}^* = \underset{b \in [n]}{{{\,\mathrm{\arg \!\min }\,}}}\, K_{saga}(b)\), where \(K_{saga}(b)\) is defined in (20).

  • If \(L_{\max } \le \frac{2nL}{3}\) then

    $$\begin{aligned} b_{saga}^* = \left\{ \begin{array}{ll} 1 &{} \text{ if } \quad \bar{b}< 2 \\ \left\lfloor b_1 \right\rfloor &{} \text{ if } \quad 2 \le \bar{b} < n \\ n &{} \text{ if } \quad \bar{b} \ge n, \end{array} \right. \end{aligned}$$
    (21)

    where

    $$b_1 \, \overset{\text {def}}{=}\, \frac{n\left( (n-1)L\sqrt{L_{\max }} - 2\sqrt{2nL - 3L_{\max }}(3L_{\max } - 2\,L) \right) }{2(2nL - 3L_{\max })^{\frac{3}{2}}}.$$
  • Otherwise, if \(L_{\max } > \frac{2nL}{3}\) then \(b^* = n\).

5.2 Optimal Minibatch Size for b-L-SVRG

Since the analysis for Algorithm 2 is similar to that of Algorithm 1, we defer its details to the appendix and only present the total complexity and the optimal minibatch size. Indeed, as shown in Section E.3, an upper bound on the total complexity to find an \(\epsilon \)-approximate solution for Algorithm 2 is given by

$$\begin{aligned} K_{svrg}(b)&\overset{\text {def}}{=}&\left( 1+2b \right) \left( \frac{12\left( (n-b)L_{\max } + n(b-1)L \right) }{b(n-1)}+\frac{nL}{6} \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }. \end{aligned}$$

Proposition 5.3

Let \(b_{svrg}^* = \underset{b \in [n]}{{{\,\mathrm{\arg \!\min }\,}}}\, K_{svrg}(b)\), where \(K_{svrg}(b)\) is defined in (44). Then,

$$\begin{aligned} b_{svrg}^* = 6\sqrt{\frac{n\left( L_{\max } - L \right) }{72\left( nL-L_{\max } \right) + n(n-1)L}}. \end{aligned}$$

6 Experiments

Here we test our new formula for optimal minibatch size of SAGA given by (21) against the best minibatch size found over a grid search. We used logistic regression with no regularization (\(\lambda =0\)) to emphasize that our results hold for nonstrongly convex functions with data sets taken from the LIBSVM collection [7]. For each data set, we ran minibatch SAGA with the stepsize given in Corollary 5.2 and until a solution with

$$F(x_t) -F(x^*) < 10^{-4}(F(x_0) -F(x^*) )$$

was reached.

Fig. 1
figure 1

Comparing the theoretical optimal batchsize (21) with the best over a grid

In Fig. 1 we plot the total complexity (number of iterations times the minibatch size) to reach this tolerance for each minibatch size on the grid. We can see in Fig. 1 that for ijcnn and phishing the optimal minibatch size \(b^{*}_{\text {theory}} =b^{*}_{saga}\) (21) is remarkably close to the best minibatch size over the grid \(b^{*}_{\text {empirical}}\). Even when \(b^{*}_{\text {theory}}\) is not close to \(b^{*}_{\text {empirical}}\), such as on the YearPredictionMSD problem, the resulting total complexity is still very close to the total complexity of \(b^{*}_{\text {empirical}}\).