Stochastic proximal-gradient algorithms for penalized mixed models

Fort, Gersende; Ollier, Edouard; Samson, Adeline

doi:10.1007/s11222-018-9805-7

Stochastic proximal-gradient algorithms for penalized mixed models

Published: 12 February 2018

Volume 29, pages 231–253, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistics and Computing Aims and scope Submit manuscript

Stochastic proximal-gradient algorithms for penalized mixed models

Download PDF

Gersende Fort¹,
Edouard Ollier^2,3 &
Adeline Samson⁴

403 Accesses
6 Citations
Explore all metrics

Abstract

Motivated by penalized likelihood maximization in complex models, we study optimization problems where neither the function to optimize nor its gradient has an explicit expression, but its gradient can be approximated by a Monte Carlo technique. We propose a new algorithm based on a stochastic approximation of the proximal-gradient (PG) algorithm. This new algorithm, named stochastic approximation PG (SAPG) is the combination of a stochastic gradient descent step which—roughly speaking—computes a smoothed approximation of the gradient along the iterations, and a proximal step. The choice of the step size and of the Monte Carlo batch size for the stochastic gradient descent step in SAPG is discussed. Our convergence results cover the cases of biased and unbiased Monte Carlo approximations. While the convergence analysis of some classical Monte Carlo approximation of the gradient is already addressed in the literature (see Atchadé et al. in J Mach Learn Res 18(10):1–33, 2017), the convergence analysis of SAPG is new. Practical implementation is discussed, and guidelines to tune the algorithm are given. The two algorithms are compared on a linear mixed effect model as a toy example. A more challenging application is proposed on nonlinear mixed effect models in high dimension with a pharmacokinetic data set including genomic covariates. To our best knowledge, our work provides the first convergence result of a numerical method designed to solve penalized maximum likelihood in a nonlinear mixed effect model.

A Line Search Based Proximal Stochastic Gradient Algorithm with Dynamical Variance Reduction

Article 23 December 2022

Laplacian smoothing gradient descent

Article 12 August 2022

Accelerated Gradient-Free Optimization Methods with a Non-Euclidean Proximal Operator

Article 16 August 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many problems in computational statistics reduce to the maximization of a criterion

$$\begin{aligned} {\text {argmax}}_{\theta \in \mathbb {R}^d} F(\theta ), \quad \text {where} \ F :=\ell - g, \end{aligned}$$

(1)

and the functions $\ell , g$ satisfy

H 1

The function $g\,{:}\,\mathbb {R}^d\rightarrow \left[ 0,+\infty \right] $ is convex, not identically $+\infty $, and lower semi-continuous.

H 2

The function $\ell \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}\cup \{-\infty \}$ is continuously differentiable on $\Theta :=\{\theta \in \mathbb {R}^d\,{:}\,g(\theta ) + |\ell (\theta )| < \infty \}$, and its gradient is of the form

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta ),\\&\quad \text {with} \ {\bar{S}}(\theta ) :=\int _\mathsf {Z}S(z) \pi _\theta (z) \nu (\mathrm {d}z); \end{aligned} \end{aligned}$$

(2)

$\nabla $ denotes the gradient operator, and $\pi _\theta \mathrm {d}\nu $ is a probability distribution on a measurable subset $(\mathsf {Z}, {\mathcal {Z}})$ of $\, \mathbb {R}^p$. The measurable functions $\nabla \phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}^d$ and $\varPsi \,{:}\,\mathbb {R}^d\rightarrow \mathbb {R}^{d \times q}$ are known but the expectation ${\bar{S}}$ of the function $S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q$ with respect to $\pi _\theta \mathrm {d}\nu $ may be intractable. Furthermore, there exists a finite nonnegative constant L such that for all $\theta , \theta ' \in \Theta $,

$$\begin{aligned} \Vert \nabla \ell (\theta ) - \nabla \ell (\theta ')\Vert \le L \Vert \theta - \theta '\Vert ; \end{aligned}$$

(3)

$\Vert \cdot \Vert $ is the Euclidean norm.

Examples of functions $\ell $ satisfying Eq. (2) are given below. We are interested in numerical methods for solving Eq. (1), robust to the case when neither $\ell $ nor its gradient has an explicit expression.

Such an optimization problem occurs for example when computing a penalized maximum-likelihood estimator in some parametric model indexed by $\theta \in \mathbb {R}^d: \ell $ denotes the log likelihood of the observations $\mathsf {Y}$ (the dependence upon $\mathsf {Y}$ is omitted) and g is the penalty term.

The optimization problem Eq. (1) covers the computation of the maximum when the parameter $\theta $ is restricted to a closed convex subset $\Theta $ of $\mathbb {R}^d$; in that case, g is the characteristic function of $\Theta $, i.e., $g(\theta ) =0$ for any $\theta \in \Theta $ and $g(\theta ) =+ \infty $ otherwise. It also covers the case when g is the ridge, the lasso or the elastic net penalty; and more generally, the case when g is the sum of lower semi-continuous nonnegative convex functions.

A first example of such a function $\ell $ is given by the log likelihood in a latent variable model with complete likelihood from the q-parameter exponential family (see, e.g., Bickel and Doksum 2015; Bartholomew et al. 2011 and the references therein). In that case, $\ell $ is of the form

$$\begin{aligned} \theta \mapsto \ell (\theta ) :=\log \int _\mathsf {Z}\exp \left( \phi (\theta ) + \left<S(z),\psi (\theta )\right> \right) \nu (\mathrm {d}z), \end{aligned}$$

(4)

where $\left<a,b\right>$ denotes the scalar of two vectors $a,b \in \mathbb {R}^l, \phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}, \psi \,{:}\,\mathbb {R}^d\rightarrow \mathbb {R}^q$ and $S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q$ are measurable functions, and $\nu $ is a $\sigma $-finite positive measure on $(\mathsf {Z},{\mathcal {Z}})$. The quantity $\theta \mapsto \phi (\theta ) + \left<S(Z),\psi (\theta )\right>$ is known as the complete log likelihood, and $Z$ is the latent data vector. Under regularity conditions, we have

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \nabla \phi (\theta ) + {\text {J}} \psi (\theta ) \int _\mathsf {Z}S(z) \ \pi _{\theta }(z) \nu (\mathrm {d}z), \\&\text {with} \ \pi _{\theta }(z) :=\frac{\exp (\left<S(z),\psi (\theta )\right>)}{\int _\mathsf {Z}\exp (\left<S(u),\psi (\theta )\right>) \nu (\mathrm {d}u)}, \end{aligned} \end{aligned}$$

(5)

where ${\text {J}} \psi (\theta )$ denotes the transpose of the jacobian matrix of the function $\psi $ at $\theta $.

A second example is given by the log likelihood of N independent observations $(\mathsf {Y}_{1}, \ldots , \mathsf {Y}_{N})$ from a log-linear model for Markov random fields. In this model, $\ell $ is given by

$$\begin{aligned} \theta \mapsto \ell (\theta ):= & {} \sum _{k=1}^N \left<S(\mathsf {Y}_{k}),\theta \right>\nonumber \\&\quad -\, N \log \int _\mathsf {Z}\exp \left( \left<S(z),\theta \right> \right) \nu (\mathrm {d}z). \end{aligned}$$

(6)

The function $\theta \mapsto \int _\mathsf {Z}\exp \left( \left<S(z),\theta \right> \right) \nu (\mathrm {d}z)$ is known as the partition function. Under regularity conditions, we have

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \sum _{k=1}^N S(\mathsf {Y}_{k} ) - N \int _\mathsf {Z}S(z) \ \pi _\theta (z) \nu (\mathrm {d}z), \\&\quad \text {with} \, \pi _\theta (z) :=\frac{\exp \left( \left<S(z),\theta \right> \right) }{\int \exp \left( \left<S(u),\theta \right> \right) \, \nu (\mathrm {d}u)}. \end{aligned} \end{aligned}$$

(7)

In these two examples, the integrals in Eqs. (4)–(7) are intractable except for toy examples: neither the function $\ell $ nor its gradient is available. Nevertheless, all the integrals in Eqs. (4)–(7) can be approximated by a Monte Carlo sum (see, e.g., Robert and Casella 2004). In the first example, this Monte Carlo approximation consists in imputing the missing variables z; it is known that such an imputation is far more efficient when the Monte Carlo samples are drawn under $\pi _\theta \mathrm {d}\nu $, i.e., the a posteriori distribution of the missing variables given the observations (see Eq. 5) than when they are drawn under the a priori distribution. This remark is the essence of the expectation maximization (EM) algorithm (introduced in Dempster et al. 1977), a popular iterative procedure for maximizing the log likelihood $\ell $ in latent variable models.

In this paper, we are interested in first-order optimization methods to solve Eq. (1), that is methods based on the gradient. In Sect. 2.1, we describe two stochastic first-order descent methods, which are stochastic perturbations of the proximal-gradient (PG) algorithm (introduced in Combettes and Pesquet 2011; see also Beck and Teboulle 2009; Parikh and Boyd 2013 for literature reviews on proximal-gradient algorithms). The two algorithms are the Monte Carlo proximal-gradient algorithm (MCPG) and the stochastic approximation proximal-gradient algorithm (SAPG), which differ in the approximation of the gradient $\nabla \ell $ and more precisely, of the intractable integral ${\bar{S}}(\theta )$ (see Eq. 2). In MCPG, at each iteration n of the algorithm, this expectation evaluated at the current point $\theta _n$ is approximated by a Monte Carlo sum computed from samples $\{Z_{1,n}, \ldots , Z_{m_{n+1},n}\}$ approximating $\pi _{\theta _n} \mathrm {d}\nu $. In SAPG, the approximation is computed as a Monte Carlo sum based on all the points drawn during all the previous iterations of the algorithm $\{Z_{i,j}, i \le m_{j+1}, j \le n\}$.

When $\ell $ is the log likelihood of a latent variable model, we prove in Sect. 2.2 that our algorithms are generalized EM algorithms (see, e.g., McLachlan and Krishnan 2008; Ng et al. 2012) combined with a stochastic E-step: in MCPG and SAPG, the stochastic E-step mimics, respectively, the E-step of the Monte Carlo EM (Wei and Tanner 1990; Levine and Fan 2004) and the E-step of the stochastic approximation EM (see, e.g., Delyon et al. 1999).

Section 3 is devoted to the convergence analysis of MCPG and SAPG. These algorithms can be seen as perturbed proximal-gradient algorithms when the perturbation comes from replacing the exact quantity ${\bar{S}}(\theta _n)$ by a Monte Carlo approximation $S_{n+1}$ at each iteration of the algorithm. Our convergence analysis covers the case when the points $\{Z_{1,n}, \ldots , Z_{m_{n+1},n}\}$ are sampled from a Markov chain Monte Carlo sampler (MCMC) with target distribution $\pi _{\theta _n} \mathrm {d}\nu $—and therefore, it also covers the case of i.i.d. draws. This implies that the estimator $S_{n+1}$ of ${\bar{S}}(\theta _n)$ may be biased. There exist many contributions in the literature on the convergence of perturbed proximal-gradient algorithms when $\ell $ is concave, but except in the works by Atchadé et al. (2017) and Combettes and Pesquet (2015), most of them assume that the error $S_{n+1} - {\bar{S}}(\theta _n)$ is unbiased and gets small when $n \rightarrow \infty $ (see, e.g., Rosasco et al. 2014; Combettes and Pesquet 2016; Rosasco et al. 2016; Lin et al. 2015). In this paper, we provide sufficient conditions for the almost-sure convergence of MCPG and SAPG under the assumption that $\ell $ is concave and with no assumptions on the bias of $S_{n+1} - {\bar{S}}(\theta _n)$. The convergence analysis of MCPG is a special case of Atchadé et al. (2017, Section 4); to our best knowledge, the convergence of SAPG is a new result.

Practical implementation is discussed in Sect. 4. Some guidelines are given in Sect. 4.2 to choose the sequences involved in the stochastic approximation procedures. Then, MCPG and SAPG are compared through a toy example in Sect. 4.3. A more challenging application to penalized inference in a mixed effect model is detailed in Sect. 5. Mixed models are applied to analyze repeated data in a population of subjects. The N independent vectors of observations $( \mathsf {Y}_k, k=1, \ldots , N)$ of the N subjects are modeled by

$$\begin{aligned} \mathsf {Y}_k = f(t_k, Z^{(k)}) + \varepsilon _k, \end{aligned}$$

(8)

with individual latent variable $Z^{(k)}$ independent of the measurement error vector $\varepsilon _k$ and f the regression function that depends on the vector of observation times $t_k$. Mixed models thus enter the class of models given by Eq. (4) with latent variables $Z= (Z^{(1)}, \ldots , Z^{(N)})$. When a covariate model is introduced, the number of covariates can be large, but with only a few of them being influential. This is a sparse estimation problem and the selection problem can be treated through the optimization of a penalized version of the log likelihood Eq. (4). In nonlinear mixed models, the optimization problem is not explicit and stochastic penalized versions of EM (Bertrand and Balding 2013; Ollier et al. 2016; Chen et al. 2017) have been proposed. To our best knowledge, stochastic proximal-gradient algorithms have not been proposed for mixed models.

2 Stochastic proximal-gradient based algorithms

In this section, we describe first-order based algorithms for solving Eq. (1) under the assumptions H1 and H2, when the expectation ${\bar{S}}(\theta )$ in Eq. (2) is intractable.

2.1 The MCPG and SAPG algorithms

Both MCPG and SAPG are iterative algorithms, and each update relies on the combination of a gradient step and a proximal operator. The proximal map (Moreau 1962, see also Bauschke and Combettes 2011; Parikh and Boyd 2013) associated with a convex function g is defined for any $\gamma >0$ and $\theta \in \mathbb {R}^d$ by

$$\begin{aligned} {\text {Prox}}_{\gamma ,g}(\theta ) :={\text {argmin}}_{\tau \in \Theta } \left\{ g(\tau ) + \frac{1}{2\gamma } \Vert \theta - \tau \Vert ^2 \right\} . \end{aligned}$$

(9)

Note that under H1, for any $\gamma >0$ and $\theta \in \mathbb {R}^d$, there exists an unique point $\tau $ minimizing the RHS of Eq. (9). This proximal operator may have an explicit expression. When g is the characteristic function

$$\begin{aligned} g(\theta ) :=\left\{ \begin{array}{l@{\quad }l} 0 &{} \quad \text {if}\,\,\theta \in \Theta \\ + \infty &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$

for some closed convex set $\Theta \subseteq \mathbb {R}^d$, then $g(\theta )$ is the projection of $\theta $ on $\Theta $. This projection is explicit for example when $\Theta $ is an hyper-rectangle. Another example of explicit proximal operator is the case associated with the so-called elastic net penalty, i.e., $g_{\lambda ,\alpha }(\theta ) :=\lambda \left( \frac{1-\alpha }{2} \sum _{i=1}^d \theta _i^2+ \alpha \sum _{i=1}^d |\theta _i| \right) $ with $\theta =(\theta _1, \ldots , \theta _\mathrm{d}), \lambda >0$ and $\alpha \in \left( 0,1\right] $, then for any component $i \in \{1, \ldots , d \}$,

$$\begin{aligned}&\left( {\text {Prox}}_{\gamma ,g_{\lambda , \alpha }}(\theta ) \right) _i \\&\quad = \frac{1}{1+\gamma \lambda (1-\alpha )} \left\{ \begin{array}{l@{\quad }l} 0 &{}\quad \text {if}\,\,|\theta _i| \le \gamma \lambda \alpha , \\ \theta _i - \gamma \lambda \alpha &{}\quad \text {if}\,\,\theta _i \ge \gamma \lambda \alpha , \\ \theta _i + \gamma \lambda \alpha &{}\quad \text {if}\,\,\theta _i \le -\gamma \lambda \alpha . \end{array} \right. \end{aligned}$$

The proximal-gradient algorithm for solving the optimization problem Eq. (1) produces a sequence $\{\theta _n, n \ge 0 \}$ as follows: given a $\left( 0,1/L\right] $-valued sequence $\{\gamma _n, n \ge 0 \}$,

$$\begin{aligned} \theta _{n+1}= & {} {\text {Prox}}_{\gamma _{n+1},g} \left( \theta _n + \gamma _{n+1} \nabla \ell (\theta _n) \right) \nonumber \\= & {} {\text {Prox}}_{\gamma _{n+1},g} \left( \theta _n + \gamma _{n+1} \{ \nabla \phi (\theta _n) + \varPsi (\theta _n) {\bar{S}}(\theta _n) \} \right) .\nonumber \\ \end{aligned}$$

(10)

This update scheme can be explained as follows: by H2, we have for any $L \le \gamma _{n+1}^{-1}$,

$$\begin{aligned} F(\theta )= & {} \ell (\theta ) -g(\theta ) \ge \ell (\theta _n) - \left<\nabla \ell (\theta _n),\theta -\theta _n\right>\\&-\, \frac{1}{2 \gamma _{n+1}} \Vert \theta - \theta _n \Vert ^2 -g(\theta ). \end{aligned}$$

This minorizing function is equal to $F(\theta _n)$ at the point $\theta _n$; the maximization (w.r.t. $\theta $) of the RHS yields $\theta _{n+1}$ given by Eq. (10). The proximal-gradient algorithm is therefore a minorization–majorization (MM) algorithm and the ascent property holds: $F(\theta _{n+1}) \ge F(\theta _n)$ for all n. Sufficient conditions for the convergence of the proximal-gradient algorithm Eq. (10) can be derived from the results by Combettes and Wajs (2005) and Parikh and Boyd (2013) or from convergence analysis of MM algorithms (see, e.g., Zangwill 1969; Meyer 1976).

In the case ${\bar{S}}(\theta )$ can not be computed, we describe two strategies for a Monte Carlo approximation. At iteration $n+1$, given the current value of the parameter $\theta _n, m_{n+1}$ points $\{Z_{1,n}, \ldots , Z_{m_{n+1},n} \}$ from the path of a Markov chain with target distribution $\pi _{\theta _n} \mathrm {d}\nu $ are sampled. A first strategy consists in replacing ${\bar{S}}(\theta _n)$ by a Monte Carlo mean:

$$\begin{aligned} S_{n+1}^{\mathrm {mc}} :=\frac{1}{m_{n+1}} \sum _{j=1}^{m_{n+1}} S(Z_{j,n}). \end{aligned}$$

(11)

A second strategy, inspired by stochastic approximation methods (see, e.g., Benveniste et al. 1990; Kushner and Yin 2003), consists in replacing ${\bar{S}}(\theta _n)$ by a stochastic approximation

$$\begin{aligned} S_{n+1}^{\mathrm {sa}} :=(1-\delta _{n+1}) S_n^{\mathrm {sa}} + \frac{\delta _{n+1}}{m_{n+1}} \sum _{j=1}^{m_{n+1}} S(Z_{j,n}), \end{aligned}$$

(12)

where $\{\delta _n, n \ge 0 \}$ is a deterministic $\left[ 0,1\right] $-valued sequence. These two strategies yield, respectively, the Monte Carlo proximal-gradient (MCPG) algorithm (see Algorithm 1) and the stochastic approximation proximal-gradient (SAPG) algorithm (see Algorithm 2).

In Sect. 3, we prove the convergence of MCPG to the maximum points of F when $\ell $ is concave, for different choices of the sequences $\{\gamma _n, m_n, n\ge 0\}$ including decreasing or constant step sizes $\{\gamma _n,n \ge 0 \}$ and, respectively, constant or increasing batch size $\{m_n, n \ge 0\}$. We also establish the convergence of SAPG to the maximum points (in the concave case); only the case of a constant batch size $\{m_n,n \ge 0\}$ and a decreasing step size $\{\gamma _n, n \ge 0\}$ is studied, since this framework corresponds to the stochastic approximation one from which the update rule Eq. (12) is inherited (see details in Delyon et al. 1999). From a numerical point of view, the choice of the sequences $\{\gamma _n, n\ge 0\}, \{\delta _n, n \ge 0 \}$ and $\{m_n, n \ge 0\}$ is discussed in Sect. 4: guidelines are given in Sect. 4.2 and the behavior of the algorithm is illustrated through a toy example in Sect. 4.3.

2.2 Case of latent variable models from the exponential family

In this section, we consider the case when $\ell $ is given by Eq. (4). A classical approach to solve penalized maximum-likelihood problems in latent variables models with complete likelihood from the exponential family is the expectation–maximization (EM) algorithm or a generalization called the generalized EM (GEM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 2008; Ng et al. 2012). Our goal here, is to show that MCPG and SAPG are stochastic perturbations of a GEM algorithm.

The EM algorithm is an iterative algorithm: at each iteration, given the current parameter $\theta _n$, the quantity ${\mathcal {Q}}(\theta \vert \theta _n)$, defined as the conditional expectation of the complete log likelihood under the a posteriori distribution for the current fit of the parameters, is computed:

$$\begin{aligned} {\mathcal {Q}}(\theta \vert \theta ') :=\phi (\theta ) + \left<{\bar{S}}(\theta '),\psi (\theta )\right>. \end{aligned}$$

(13)

The EM sequence $\{\theta _n, n\ge 0\}$ for the maximization of the penalized log likelihood $\ell -g$ is given by (see McLachlan and Krishnan 2008, Section 1.6.1)

$$\begin{aligned} \theta _{n+1} = {\text {argmax}}_{\theta \in \Theta } \left\{ \phi (\theta ) + \left<{\bar{S}}(\theta _n),\psi (\theta )\right> -g(\theta ) \right\} . \end{aligned}$$

(14)

When ${\bar{S}}(\theta )$ is intractable, it was proposed to replace ${\bar{S}}(\theta _n)$ in this EM-penalized algorithm by an approximation $S_{n+1}$—see Algorithm 3. When $S_{n+1} = S_{n+1}^{mc}$ (see Eq. 11), this yields the so-called Monte Carlo EM-penalized algorithm (MCEM-pen), trivially adapted from MCEM proposed by Wei and Tanner (1990) and Levine and Fan (2004). Another popular strategy is to replace ${\bar{S}}(\theta _n)$ by $S_{n+1}^\mathrm {sa}$ (see Eq. 12) yielding to the so-called stochastic approximation EM-penalized algorithm (SAEM-pen)—(see Delyon et al. 1999 for the unpenalized version).

When the maximization of Eq. (13) is not explicit, the update of the parameter is modified as follows, yielding the generalized EM-penalized algorithm (GEM-pen):

$$\begin{aligned}&\theta _{n+1} \ \text {s.t.} \phi (\theta _{n+1}) + \left<{\bar{S}}(\theta _n),\psi (\theta _{n+1})\right> -g(\theta _{n+1}) \nonumber \\&\quad \ge \phi (\theta _n) + \left<{\bar{S}}(\theta _n),\psi (\theta _n)\right> -g(\theta _n). \end{aligned}$$

(15)

This update rule still produces a sequence $\{\theta _n, n\ge 0\}$ satisfying the ascent property $F(\theta _{n+1}) \ge F(\theta _n)$ which is the key property for the convergence of EM (see, e.g., Wu 1983). Here again, the approximations defined in Eqs. (11) and (12) can be plugged in the GEM-pen update Eq. (15) when ${\bar{S}}$ is not explicit.

We show in the following proposition that the sequence $\{\theta _n, n \ge 0\}$ produced by the proximal-gradient algorithm Eq. (10) is a GEM-pen sequence since it satisfies the inequality Eq. (15). As a consequence, MCPG and SAPG are stochastic GEM-pen algorithms.

Proposition 1

Let g satisfying H1 and $\ell $ be of the form Eq. (4) with continuously differentiable functions $\phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}, \psi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}^q$ and $S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q$. Set $\Theta :=\{ g +|\ell | < \infty \}$. Define ${\bar{S}}\,{:}\,\Theta \rightarrow \mathbb {R}^q$ by ${\bar{S}}(\theta ) :=\int _\mathsf {Z}S(z) \, \pi _\theta (z) \, \nu (\mathrm {d}z)$ where $\pi _\theta $ is given by Eq. (5). Assume that there exists a constant $L>0$ such that for any $s \in {\bar{S}}(\Theta )$, and any $\theta , \theta ' \in \Theta $,

$$\begin{aligned}&\Vert \nabla \phi (\theta ) - \nabla \phi (\theta ') + \left( {\text {J}} \psi (\theta ) - {\text {J}} \psi (\theta ')\right) s \Vert \nonumber \\&\quad \le L \Vert \theta - \theta ' \Vert . \end{aligned}$$

(16)

Let $\{\gamma _n, n \ge 0 \}$ be a (deterministic) positive sequence such that $\gamma _n \in \left( 0, 1/L\right] $ for all $n \ge 0$.

Then the proximal-gradient algorithm Eq. (10) is a GEM-pen algorithm for the maximization of $\ell -g$.

The proof is postponed in “Appendix A”. The assumption Eq. (16) holds when $\Theta $ is compact and ${\bar{S}}$ (resp. $\phi $ and $\psi $) are continuous (resp. twice continuously differentiable). Note also that for any $\theta , \theta ^{'} \in \Theta $ and $s \in {\bar{S}}(\Theta )$, we have $\left( {\text {J}} \psi (\theta ) - {\text {J}} \psi (\theta ')\right) s =0$ if ${\bar{S}}(\theta ) \in \mathrm {Ker}( {\text {J}} \psi (\theta '))$ for any $\theta , \theta ^{'} \in \Theta $.

3 Convergence of MCPG and SAPG

The convergence of MCPG and SAPG is established by applying recent results from Atchadé et al. (2017) on the convergence of perturbed proximal-gradient algorithms. Atchadé et al. (2017, Theorem 2) applied to the case $\nabla \ell (\theta )$ is of the form $\nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta )$, where ${\bar{S}}(\theta )$ is an intractable expectation and $\nabla \phi , \varPsi $ are explicit, yields

Theorem 1

Assume H1, H2, $\theta \mapsto \ell (\theta )$ is concave, and the set ${{\mathcal {L}}} :={\text {argmax}}_{\theta \in \Theta } F(\theta )$ is a non-empty subset of $\Theta $. Let $\{\theta _n, n \ge 0 \}$ be given by

$$\begin{aligned} \theta _{n+1} = {\text {Prox}}_{\gamma _{n+1},g}\left( \theta _n + \gamma _{n+1}\left\{ \nabla \phi (\theta _n) + \varPsi (\theta _n) S_{n+1} \right\} \right) , \end{aligned}$$

with a $\left( 0,1/L\right] $-valued stepsize sequence $\{\gamma _n, n \ge 0 \}$ satisfying $ \sum _n \gamma _n = + \infty $. If the series

$$\begin{aligned}&\sum _n \gamma _{n+1} \left< \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) , T_{\gamma _{n+1}}(\theta _n) \right>, \\&\sum _n \gamma _{n+1} \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) , \\&\sum _n \gamma _{n+1}^2 \Vert \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) \Vert ^2, \end{aligned}$$

converge, where

$$\begin{aligned} T_\gamma (\theta ) :={\text {Prox}}_{\gamma ,g}(\theta +\gamma \left\{ \nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta )\right\} ), \end{aligned}$$

then there exists $\theta _\infty \in {{\mathcal {L}}}$ such that $\lim _n \theta _n = \theta _\infty $.

We check the conditions of Theorem 1 in the case $S_{n+1}$ is resp. given by Eq. (11) for the proof of MCPG and by Eq. (12) for the proof of SAPG. Our convergence analysis is restricted to the case $\ell $ is concave; to our best knowledge, the convergence of the perturbed proximal-gradient algorithms when $\ell $ is not concave is an open question.

The novelty in this section is Proposition 2 and Theorem 4 which provide resp. a control of the $L_2$-norm of the error $S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n)$ and the convergence of SAPG. These results rely on a rewriting of $\left( S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n) \right) $ taking into account that $S_{n+1}^\mathrm {sa}$ is a weighted sum of the function S evaluated at all the samples $\{Z_{i,j}, i \le m_{j+1}, j \le n \}$ drawn from the initialization of the algorithm. This approximation differs from a more classical Monte Carlo approximation (see Theorems 2 and 3 for the convergence of MCPG, which are special cases of the results in Atchadé et al. 2017).

We allow the simulation step of MCPG and SAPG to rely on a Markov chain Monte Carlo sampling: at iteration $(n+1)$, the conditional distribution of $Z_{j+1,n}$ given the past is $P_{\theta _n}(Z_{j,n}, \cdot )$ where $P_\theta $ is a Markov transition kernel having $\pi _\theta \mathrm {d}\nu $ as its unique invariant distribution. The control of the quantities $S_{n+1} - {\bar{S}}(\theta _n)$ requires some ergodic properties on the kernels $\{P_{\theta _n}, n \ge 0\}$ along the path $\{\theta _n, n \ge 0 \}$ produced by the algorithm. These properties have to be uniform in $\theta $, a property often called the “containment condition” (see, e.g., the literature on the convergence of adaptive MCMC samplers, for example Andrieu and Moulines 2006; Roberts and Rosenthal 2007; Fort et al. 2011a). There are therefore three main strategies to prove the containment condition. In the first strategy, $\Theta $ is assumed to be bounded, and a uniform ergodic assumption on the kernels $\{P_\theta , \theta \in \Theta \}$ is assumed. In the second one, there is no boundedness assumption on $\Theta $ but the property ${\mathbb {P}}(\limsup _n \Vert \theta _n \Vert < \infty )=1$ has to be established prior the proof of convergence; a kind of local boundedness condition on the sequence $\{\theta _n, n\ge 0\}$ is then applied—see, e.g., Andrieu and Moulines (2006) and Fort et al. (2011a). The last strategy consists in showing that ${\mathbb {P}}( \sup _n \rho _n \Vert \theta _n \Vert < \infty ) =1$ for some deterministic sequence $\{\rho _n, n \ge 0\}$ vanishing to zero when $n \rightarrow \infty $ at a rate compatible with the decaying ergodicity rate—see, e.g., Saksman and Vihola (2010). The last two strategies are really technical and require from the reader a strong background on controlled Markov chain theory; for pedagogical purposes, we therefore decided to state our results in the first context: we will assume that $\Theta $ is bounded.

By allowing MCMC approximations, we propose a theory which covers the case of a biased approximation, called below the biased case: conditionally to the past

$$\begin{aligned} {\mathcal {F}}_n :=\sigma \left( Z_{i,j}, i \le m_{j+1}, j \le n-1\right) , \end{aligned}$$

(17)

the expectation of $S_{n+1}$ is not ${\bar{S}}(\theta _n)$: ${\mathbb {E}}\left[ S_{n+1} \vert {\mathcal {F}}_n \right] \ne {\bar{S}}(\theta _n)$. As soon as the samplers $\{P_\theta , \theta \in \Theta \}$ are ergodic enough (for example, under (H4a) and (H4b)), the bias vanishes when the number of Monte Carlo points $m_n$ tends to infinity. Therefore, the proof for the biased case when the sequence $\{m_n,n \ge 0 \}$ is constant is the most technical situation since the bias does not decay. It relies on a specific decomposition of the error $S_{n+1} - {\bar{S}}(\theta _n)$ into a martingale increment with bounded $L^2$-moments, and a remainder term which vanishes when $n \rightarrow \infty $ even when the batch size $m_n$ is constant. Such a behavior of the remainder term is a consequence of regularity properties on the functions $\nabla \phi , \varPsi , {\bar{S}}$ (see H3c), on the proximity operator (see H3d) and on the kernels $\{P_\theta , \theta \in \Theta \}$ (see H4c).

Our theory also covers the unbiased case, i.e., when

$$\begin{aligned}{\mathbb {E}}\left[ S_{n+1} \vert {\mathcal {F}}_n \right] = {\bar{S}}(\theta _n). \end{aligned}$$

We therefore establish the convergence of MCPG and SAPG by strengthening the conditions H1 and H2 with

H 3

(a)
$\ell $ is concave and the set ${{\mathcal {L}}} :={\text {argmax}}_{\Theta } F$ is a non-empty subset of $\Theta $.
(b)
$\Theta $ is bounded.
(c)
There exists a constant L such that for any $\theta ,\theta ' \in \Theta $,
$$\begin{aligned}&\Vert \nabla \phi (\theta ) - \nabla \phi (\theta ') \Vert + \Vert \varPsi (\theta ) - \varPsi (\theta ') \Vert \\&\quad +\,\Vert {\bar{S}}(\theta ) - {\bar{S}}(\theta ')\Vert \le L \Vert \theta - \theta ' \Vert , \end{aligned}$$
where for a matrix $A, \Vert A\Vert $ denotes the operator norm associated with the Euclidean vector norm.
(d)
$\sup _{\gamma \in \left( 0,1/L\right] } \sup _{\theta \in \Theta } \gamma ^{-1} \Vert {\text {Prox}}_{\gamma ,g}(\theta ) - \theta \Vert < \infty $.

Note that the assumptions (H3b)–(H3c) imply Eq. (3) and $\sup _{\theta \in \Theta } \left( \Vert \nabla \phi (\theta ) \Vert + \Vert \varPsi (\theta ) \Vert + \Vert {\bar{S}}(\theta ) \Vert \right) < \infty $. When $\Theta $ is a compact convex set, then (H3d) holds for the elastic net penalty, the Lasso or the fused Lasso penalty. Atchadé et al. (2017, Proposition 11) give general conditions for (H3d) to hold.

Before stating the ergodicity conditions on the kernels $\{P_\theta , \theta \in \Theta \}$, let us recall some basic properties on Markov kernels. A Markov kernel P on the measurable set $(\mathsf {Z}, {\mathcal {Z}})$ is an application on $\mathsf {Z}\times {\mathcal {Z}}$, taking values in $\left[ 0,1\right] $ such that for any $x \in \mathsf {Z}, P(x,\cdot )$ is a probability measure on ${\mathcal {Z}}$; and for any $A \in {\mathcal {Z}}, x \mapsto P(x,A)$ is measurable. Furthermore, if P is a Markov kernel, $P^k$ denotes the kth iterate of P defined by induction as

$$\begin{aligned}&P^0(x,A) :=\mathbb {1}_A(x), \\&P^k(x,A) :=\int P^{k-1}(x,\mathrm {d}z)P(z,A), \quad k \ge 1. \end{aligned}$$

Finally, the kernel P acts on the probability measures: for any probability measure $\xi $ on ${\mathcal {Z}}, \xi P$ is a probability measure defined by

$$\begin{aligned} \xi P(A) :=\int \xi (\mathrm {d}z) P(z,A), \quad A \in {\mathcal {Z}}; \end{aligned}$$

and P acts on the positive measurable functions: for a measurable function $f\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}_+, Pf$ is a measurable function defined by

$$\begin{aligned} Pf(z) :=\int f(y) \, P(z, \mathrm {d}y). \end{aligned}$$

We refer the reader to Meyn and Tweedie (2009) for the definitions and basic properties on Markov chains. Given a measurable function $W\,{:}\,\mathsf {Z}\rightarrow \left[ 1,+\infty \right) $, define the W-norm of a signed measure $\nu $ on ${\mathcal {Z}}$ and the W-norm of a function $f\,{:}\, \mathsf {Z}\rightarrow \mathbb {R}^d$:

$$\begin{aligned} \left| f \right| _W :=\sup _{\mathsf {Z}} \frac{\Vert f\Vert }{W}, \quad \left\| \nu \right\| _W :=\sup _{f\,{:}\,\left| f \right| _W \le 1} \left| \int f \mathrm {d}\nu \right| ; \end{aligned}$$

these norms generalize resp. the supremum norm of a function and the total variation norm of a measure.

Our results are derived under the following conditions on the kernels:

H 4

(a)
There exist $\lambda \in \left( 0,1\right] , b < \infty $ and a measurable function $W\,{:}\,\mathsf {Z}\rightarrow \left[ 1,+\infty \right) $ such that
$$\begin{aligned} \left| S \right| _{\sqrt{W}} < \infty , \quad \sup _{\theta \in \Theta } P_\theta W \le \lambda W + b. \end{aligned}$$
(b)
There exist constants $C< \infty $ and $\rho \in \left( 0,1\right) $ such that for any $z \in \mathsf {Z}$ and $n \ge 0$,
$$\begin{aligned} \sup _{\theta \in \Theta } \left\| P_\theta ^n(z, \cdot ) - \pi _\theta \right\| _W \le C \, \rho ^n \, W(z). \end{aligned}$$
(c)
There exists a constant C such that for any $\theta , \theta ' \in \Theta $,
$$\begin{aligned}&\left\| \pi _\theta - \pi _{\theta '} \right\| _{\sqrt{W}} + \sup _{z \in \mathsf {Z}} \frac{\left\| P_\theta (z,\cdot ) - P_{\theta '}(z,\cdot ) \right\| _{\sqrt{W}}}{\sqrt{W}(z)} \\&\quad \le C \, \Vert \theta - \theta ' \Vert . \end{aligned}$$

Sufficient conditions for the uniform-in-$\theta $ ergodic behavior (H4b) are given, e.g., in Fort et al. (2011b, Lemma 2.3): this lemma shows how to deduce such a control from a minorization condition and a drift inequality on the Markov kernels. Examples of MCMC kernels $P_\theta $ satisfying these assumptions can be found in Andrieu and Moulines (2006, Proposition 12) and Saksman and Vihola (2010, Proposition 15) for the adaptive Hastings–Metropolis algorithm, in Fort et al. (2011b, Proposition 3.1) for an interactive tempering sampler, in Schreck et al. (2013, Proposition 3.2) for the equi-energy sampler, and in Fort et al. (2015, Proposition 3.1) for a Wang–Landau type sampler.

Theorem 2 establishes the convergence of MCPG when the number of points in the Monte Carlo sum $S_{n+1}^\mathrm {mc}$ is constant over iterations and the step size sequence $\{\gamma _n, n \ge 0\}$ vanishes at a convenient rate. It is proved in Atchadé et al. (2017, Theorem 4).

Theorem 2

Assume H1, H2, (H3a–c) and (H4a–b). Let $\{\theta _n, n \ge 0 \}$ be the sequence given by Algorithm 1 with a $\left( 0,1/L\right] $-valued sequence $\{\gamma _n, n \ge 0 \}$ such that $\sum _n \gamma _n = + \infty $ and $\sum _n \gamma _n^2<\infty $, and with a constant sequence $\{m_n, n \ge 0 \}$.

In the biased case, assume also (H3d) and (H4c) and $\sum _n |\gamma _{n+1} - \gamma _n |< \infty $.

Then, with probability one, there exists $\theta _\infty \in {{\mathcal {L}}}$ such that $\lim _n \theta _n = \theta _\infty $.

Theorem 3 establishes the convergence of MCPG when the number of points in the Monte Carlo sum $S_{n+1}^\mathrm {mc}$ is increasing; it allows a constant stepsize sequence $\{\gamma _n, n \ge 0\}$. It is proved in Atchadé et al. (2017, Theorem 6).

Theorem 3

Assume H1, H2, (H3a–c) and (H4a–b). Let $\{\theta _n, n \ge 0 \}$ be the sequence given by Algorithm 1 with a $\left( 0,1/L\right] $-valued sequence $\{\gamma _n, n \ge 0 \}$ and an integer valued sequence $\{m_n,n \ge 0\}$ such that $\sum _n \gamma _n = +\infty $ and $\sum _n \gamma _{n}^2/m_{n} < \infty $.

In the biased case, assume also $\sum _n \gamma _{n}/m_{n}< \infty $.

Then, with probability one, there exists $\theta _\infty \in {{\mathcal {L}}}$ such that $\lim _n \theta _n = \theta _\infty $.

MCPG and SAPG differ in their approximation of ${\bar{S}}(\theta _n)$ at each iteration. We provide below a control of this error for a constant or a polynomially increasing batch size $\{m_n, n \ge 0 \}$, and polynomially decreasing stepsize sequences $\{\gamma _n, n \ge 0 \}$ and $\{\delta _n, n \ge 0 \}$.

Proposition 2

Let $\gamma _\star , \delta _\star , m_\star $ be positive constants and $\beta \in \left[ 0,1\right) , \alpha \ge \beta , c\ge 0$. Set $\gamma _n = \gamma _\star n^{-\alpha }, \delta _n = \delta _\star n^{-\beta }$ and $m_n = m_\star n^{c}$. Assume H1 to H4. Then

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \Vert S_{n+1}^\mathrm {mc}- {\bar{S}}(\theta _n)\Vert ^2 \right] = O\left( n^{-c} \right) , \\&{\mathbb {E}}\left[ \Vert S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n) \Vert ^2\right] =O\left( n^{- \{ 2(\alpha -\beta ) \wedge (\beta +c)\}} \right) . \end{aligned} \end{aligned}$$

The proof is given in “Appendix C”. This proposition shows that when applying MCPG with a constant batch size $(c=0)$, the error $S_{n+1}^\mathrm {mc}- {\bar{S}}(\theta _n)$ does not vanish; this is not the case for SAPG, since even when $c=0$, the error $S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n)$ vanishes as soon as $\alpha> \beta >0$. Since the case “constant batch size” is the usual choice of the practitioners in order to reduce the computational cost of the algorithm, this proposition supports the use of SAPG instead of MCPG.

We finally study the convergence of SAPG without assuming that the batch size sequence $\{m_{n}, n \ge 0\}$ is constant, which implies the following assumption on the sequences $\{\gamma _n, \delta _n, m_n, n \ge 0\}$.

H 5

The step size sequences $\{\gamma _n, n \ge 0 \}, \{\delta _n, n \ge 0 \}$ and the batch size sequence $\{m_n, n \ge 0 \}$ satisfy

(a)
$\gamma _n \in \left( 0,1/L\right] , \delta _n \in \left( 0,1\right) , m_n \in \mathbb {N}, \sum _n \gamma _n = + \infty , \sum _n \gamma _n^2 < \infty $,
$$\begin{aligned}&\sum _{n } \left( \gamma _{n-1} \gamma _{n} + \gamma _{n-1}^2 + |\gamma _n - \gamma _{n-1}| \right) \mathsf {D}_n< \infty , \\&\quad \sum _n \gamma _n^2 \delta _n^2 (1+\mathsf {D}_{n+1})^2 m_n^{-1} < \infty , \end{aligned}$$
where $\mathsf {D}_n :=\sum _{k \ge n} \left( \prod _{j=n}^k (1-\delta _j)\right) $.
(b)
Furthermore,
$$\begin{aligned}&\sum _n \gamma _{n+1} |m_{n+1}^{-1} \delta _{n+1} - m_n^{-1} \delta _n|< \infty , \\&\sum _n \gamma _{n+1} |m_{n+1}^{-1} \delta _{n+1} \mathsf {D}_{n+2} - m_n^{-1} \delta _n \mathsf {D}_{n+1}|< \infty , \\&\sum _n \left( \gamma _{n-1} \gamma _{n} + \gamma _{n-1}^2 + |\gamma _n - \gamma _{n-1}| \right) \ldots \\&\quad \times m_{n-1}^{-1} \, \delta _{n-1} (1 + \mathsf {D}_{n}) < \infty . \end{aligned}$$

Let us comment this assumption in the case the batch size sequence $\{m_n, n \ge 0 \}$ is constant. This situation corresponds to the “stochastic approximation regime” where the number of draws at each iteration is $ m_n=1$ (or say, $m_n = m$ for any n), and it also corresponds to what is usually done by practitioners in order to reduce the computational cost. When $\delta _n= \delta _\star \in \left( 0,1\right) $ for any $n \ge 0$, then $\mathsf {D}_n = \delta _\star ^{-1}$ for any $n \ge 0$. This implies that the condition H5 is satisfied with polynomially decreasing sequences $\gamma _n \sim \gamma _\star / n^\alpha $ with $\alpha \in \left( 1/2,1\right] $ (and $m_n = m$ for any n).

When $\delta _n \sim \delta _\star \, n^{-\beta }$ for $\beta \in \left( 0,1\right) $, then $\mathsf {D}_n = O( n^\beta )$ (see Lemma 3). Hence, using Lemma 3, (H5a) and (H5b) are satisfied with $\gamma _n \sim \gamma _\star n^{-\alpha }$ where $\beta< (1+\beta )/2 < \alpha \le 1$, and $m_n = m$ for any n.

We cannot have $\delta _n = \delta _\star n^{-1}$ since it implies $\mathsf {D}_n = + \infty $ for any $n \ge 0$.

Theorem 4

Assume H1, H2, H3 and (H4a–b). Let $\{\theta _n, n \ge 0 \}$ be the sequence given by Algorithm 2 and applied with sequences $\{\gamma _n, \delta _n, m_n, n \ge 0\}$ verifying (H5a).

In the biased case, assume also (H4c) and (H5b).

Then with probability one, there exists $\theta _\infty \in {{\mathcal {L}}}$ such that $\lim _n \theta _n = \theta _\infty $.

Proof

The proof is in Section D. $\square $

4 Numerical illustration in the convex case

In this section, we illustrate the behavior of the algorithms MCPG and SAPG on a toy example. We first introduce the example and then give some guidelines for a specific choice of the sequences $\{\delta _n, n \ge 0\}, \{\gamma _n, n \ge 0\}$. Finally, the algorithms are compared more systematically on repeated simulations.

4.1 A toy example

The example is a mixed model, where the regression function is linear in the latent variable $Z$. More precisely, we observe data $(\mathsf {Y}_1, \ldots , \mathsf {Y}_N)$ from N subjects, each individual data being a vector of size J: $\mathsf {Y}_k :=(\mathsf {Y}_{k1}, \ldots , \mathsf {Y}_{kJ})$. For the subject $k, k=1, \ldots , N, \mathsf {Y}_{kj}$ is the jth measurement at time $t_{kj}, j=1, \ldots , J$. It is assumed that $\{\mathsf {Y}_{k}, k=1, \ldots , N\}$ are independent and for all $k=1, \ldots , N$,

$$\begin{aligned} \begin{aligned}&Y_{kj} \vert Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}\left( \left<Z^{(k)},{\bar{t}}_{kj}\right> ,1\right) , \\&\bar{t}_{kj} :=\begin{bmatrix} 1 \\ t_{kj} \end{bmatrix} \quad j=1, \ldots , J; \end{aligned} \end{aligned}$$

(18)

that is, a linear regression model with individual random intercept and slope, the $\mathbb {R}^2$-valued vector being denoted by $Z^{(k)}$. The latent variable is $Z=(Z^{(1)}, \ldots , Z^{(N)})$. Furthermore,

$$\begin{aligned} Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}_2( X_k \theta , I_2); \end{aligned}$$

(19)

here, $\theta \in \mathbb {R}^{2(D+1)}$ is an unknown parameter and the design matrix $X_k \in \mathbb {R}^{2 \times 2(D+1)}$ is known

$$\begin{aligned} X_k :=\begin{bmatrix} 1&X_{k1}&\ldots&X_{k D}&0&0&\ldots&0 \\ 0&0&\ldots&0&1&X_{k1}&\ldots&X_{k D} \end{bmatrix}. \end{aligned}$$

(20)

The optimization problem of the form Eq. (1) that we consider is the log likelihood $\ell (\theta )$ penalized by a lasso penalty: the objective is the selection of the influential covariates

$$\begin{aligned}(X_{k1}, \ldots , X_{kD})\end{aligned}$$

on the two components of $Z^{(k)}$. We thus penalize all the elements except $\theta _1$ and $\theta _{D+2}$ which correspond to the two intercepts; hence, we set

$$\begin{aligned} g(\theta ) :=\lambda \sum _{r \ne \{1, D+2 \}} |\theta _r|. \end{aligned}$$

The above model is a latent variable model with complete log likelihood equal to—up to an additive constant

$$\begin{aligned} - \frac{1}{2}\sum _{k=1}^N\left\{ \sum _{j=1}^{J} \left( \mathsf {Y}_{kj} - \left<Z^{(k)},{\bar{t}}_{kj}\right>\right) ^2 + \left\| Z^{(k)} - X_k \theta \right\| ^2 \right\} . \end{aligned}$$

It is of the form $\phi (\theta ) + \left<S(z),\psi (\theta )\right>$ by setting (with $(\cdot )'$ denoting the transpose of a matrix)

$$\begin{aligned} \begin{aligned}&\phi (\theta ) :=-\frac{1}{2} \theta ' \left( \sum _{k=1}^N X_k' X_k\right) \theta - \frac{1}{2} \sum _{k=1}^N \sum _{j=1}^J \mathsf {Y}_{kj}^2, \\&\psi (\theta ) :=\begin{bmatrix} 1 \\ \theta \end{bmatrix} \in \mathbb {R}^{1+2(D+1)}, \\&S(z^{(1)}, \ldots , z^{(N)}) \\&\quad :=- \frac{1}{2} \sum _{k=1}^N \begin{bmatrix} {z^{(k)}}' (I+T_k) z^{(k)}- 2 \left<z^{(k)},{\bar{\mathsf {Y}}}_k\right> \\ - 2 X_k' z^{(k)} \end{bmatrix}, \\&T_k :=\sum _{j=1}^{J} {\bar{t}}_{kj} \bar{t}_{kj}',\\&{\bar{\mathsf {Y}}}_k :=\sum _{j=1}^{J} \mathsf {Y}_{kj} {\bar{t}}_{kj}. \end{aligned} \end{aligned}$$

The a posteriori distribution $\pi _\theta $ is a Gaussian distribution on $\mathbb {R}^{2N}$, equal to the product of N Gaussian distributions on $\mathbb {R}^2$:

$$\begin{aligned}&\pi _\theta (z^{(1)}, \ldots , z^{(N)}) :=\prod _{k=1}^N {\mathcal {N}}_2\nonumber \\&\quad \left( (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) , (I+T_k)^{-1} \right) [z^{(k)}]. \end{aligned}$$

(21)

Hence, ${\bar{S}}(\theta )$ is explicit and given by

$$\begin{aligned}&{\bar{S}}(\theta ) =-\frac{1}{2} \nonumber \\&\quad \sum _{k=1}^N \begin{bmatrix} \mathrm {Trace}((I+T_k) \Sigma _k) -2 {\bar{\mathsf {Y}}}_k' (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \\ -2 X_k' (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \end{bmatrix}\nonumber \\ \end{aligned}$$

(22)

with

$$\begin{aligned} \Sigma _k:= & {} (I+T_k)^{-1} + (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \nonumber \\&\left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) ' (I+T_k)^{-1}. \end{aligned}$$

(23)

Finally, note that in this example, the function $\ell $ is explicit and given by (up to an additive constant)

$$\begin{aligned} \begin{aligned} \ell (\theta ) =&- \frac{1}{2} \theta ' \left( \sum _{k=1}^N X_k' X_k \right) \theta \\&+ \frac{1}{2} \sum _{k=1}^N ({\bar{\mathsf {Y}}}_k + X_k \theta )' (I+T_k)^{-1} ({\bar{\mathsf {Y}}}_k + X_k \theta ). \end{aligned} \end{aligned}$$

Thus $\ell $ is a concave function. Furthermore, in this toy example, $\theta \mapsto \nabla \ell (\theta )$ is linear so that the Lipschitz constant L is explicit and equal to

$$\begin{aligned} L = \Vert - \sum _{k=1}^N X_k' X_k + \sum _{k=1}^N X_k' (I+T_k)^{-1} X_k \Vert _2, \end{aligned}$$

(24)

where for a matrix A, $\Vert A \Vert _2$ denotes the spectral norm. Finally, we assumed that $\Theta = \{ \theta \in \mathbb {R}^{2(D+1)} \vert \Vert \theta \Vert < 10^4 \}$ to fulfill the theoretical boundedness assumption. The MCMC algorithm includes a projection step on $\Theta $ if necessary. But in practice, it never happens.

A data set is simulated using this model with $N=40, J=8, D=300$ and $t_{kj} \in \{ 0.25,4,6,8,10,12,14,16 \}, \forall k \in \{1,\ldots ,N\}$. The design components $(X_{k1}, \ldots , X_{kD})$ (see Eq. 20) are drawn from a centered Gaussian distribution with covariance matrix $\Gamma $ defined by $\Gamma _{rr'} = 0.5^{\vert r - r' \vert }$ ($r,r'=1,\ldots ,300$). To sample the observations, we use a parameter vector $\theta ^\star $ defined as follows: $\theta _1^\star = \theta _{D+2}^\star =1$; the other components are set to zero, except 12 components randomly selected (6 among the components $\{2, \ldots , D+1\}$ and 6 among the components $\{D+3, \ldots , 2D+2 \}$) and chosen uniformly in $\left[ 0.5,1.5\right] $—see the last row in Fig. 7.

4.2 Guidelines for the implementation

In this section, we give some guidelines on the choice of the sequences $\{\delta _n, n \ge 0\}$ and $\{\gamma _n, n \ge 0\}$. We illustrate the results on single runs of each algorithm. We use the same random draws for all the algorithms to avoid potential differences due to the randomness of the simulations. Similar results have been observed when simulations are replicated. We refer to Sect. 4.3 for replicated simulations.

Classical sequences $\{\delta _n, n \ge 0\}$ and $\{\gamma _n, n \ge 0\}$ are of the form:

$$\begin{aligned} \gamma _{n+1}&= \left\{ \begin{array}{l@{\quad }l} \gamma _\star &{}\quad \text { if } n \le n_\alpha , \\ \gamma _\star (n-n_\alpha )^{-\alpha } &{} \quad \text { if } n>n_\alpha , \end{array} \right. \end{aligned}$$

(25)

$$\begin{aligned} \delta _{n+1}&= \left\{ \begin{array}{l@{\quad }l} \delta _\star &{} \quad \text { if } n \le n_\beta , \\ \delta _\star (n-n_\beta )^{-\beta } &{}\quad \text { if } n>n_\beta . \end{array} \right. \end{aligned}$$

(26)

Impact of$\gamma _\star $and$\delta _\star $on the transient phase The theoretical study on the asymptotic behavior of SAPG and MCPG is derived under the assumption that $\gamma _n \le 1/L$: when $\alpha >0$, this property holds for any n large enough. In this section, we illustrate the role of $\gamma _n, \delta _n$ for small values of n that is, in the transient phase of the algorithm. In Fig. 1, we display the behavior of MCPG and SAPG for two different values of the initial point $\theta _{n=0}$: on the left, it corresponds to a standard initialization ($\theta _{n=0} = (0, \ldots , 0)$), while on the right, it corresponds to a poor initialization—which mimics what may happen in practice for challenging numerical applications.

On both plots, we indicate by a vertical line the smallest n such that $\gamma _{n} \le 1/L$—remember that in this example, L is explicit (see Eq. 24). The plots show the estimation of component #245, as a function of the number of iterations n. In all cases, $n_\alpha = n_\beta =0, \alpha = 0.75, m_n= 60$, and for SAPG, $\beta = 0.5$. The dotted blue curve displays a run of SAPG when $(\gamma _\star ,\delta _\star ) = (0.009,0.2)$; the dashed-dotted yellow curve displays a run of SAPG when $(\gamma _\star ,\delta _\star ) = (0.009,0.5)$; the dashed red curve displays a run of SAPG when $(\gamma _\star ,\delta _\star ) =(0.009,0.8)$; the green solid curve displays a run of MCPG when $\gamma _\star = 0.009$.

The stability of MCPG during the transient phase depends crucially on the first values of the sequence $\{\gamma _n, n \ge 0\}$. Then when n is large enough so that $\gamma _{n} \le 1/L$ (after the vertical line), MCPG is more stable and gets smoother. For SAPG, a small value of $\delta _\star $ implies an important impact of the initial point $\theta _{n=0}$. When this initial point is poorly chosen, a small value of $\delta _\star $ delays the convergence of SAPG. A value of $\delta _\star $ around 0.5 is a good compromise.

Role of${\alpha }$and${\beta }$: Figure 2 displays the behavior of SAPG for different values of $\alpha $ and $\beta $ with $(\gamma _\star , \delta _\star )=(0.015, 0.5), n_\alpha = n_\beta =0$ and $m_n=60$. The plots show that the larger the parameter $\alpha $ is, the longer the transient phase is. We then recommend to set $\alpha $ close to 0.6. The parameter $\beta $ seems to have an impact only when $\alpha $ is close to 1. Therefore, we recommend to set $\delta _n$ constant during the transient phase ($n_\beta >0$) and then to decrease it rapidly in the convergence phase.

Random stepsize sequence$\{\gamma _n, n \ge 0\}$: The convergence of the SAPG algorithm can suffer from the scale difference of the parameters, when run with the same stepsize sequence $\{\gamma _n, n \ge 0\}$ applied to each component of $\theta _n$.

Ideally each component of $\theta _n$ should have a specific $\gamma _n$ value adapted to its scale. But it can be time-consuming to find, by hand-tuning, a sequence that ensures a fast and stable convergence of the algorithm. As an alternative, we suggest to use a matrix-valued random sequence $\{\Gamma ^n, n\ge 0\}$ and replace the update rule of SAPG by

$$\begin{aligned} (\theta _{n+1})_i = {\text {Prox}}_{ \Gamma ^{n+1}_{ii} g} \left( (\theta _n)_i + \Gamma ^{n+1}_{ii} \left( \nabla \phi (\theta _n) + \varPsi (\theta _n) S_{n+1}^\mathrm {sa}\right) _i \right) . \end{aligned}$$

We propose to define the matrix $\Gamma _{n+1}$ as a diagonal matrix with entries $ \Gamma ^{n+1}_{ii}$ depending on $H^n_{ii}$, where $H^n$ is an approximation of the hessian of $-\ell (\theta )$. (We give an example of such an approximation in Sect. 5.) Through numerical experiments, we observed that asymptotically, $H^n$ converges. Hence, to ensure a stepsize sequence decaying like $O(n^{-\alpha })$ asymptotically, we propose the following definition of the random sequence:

$$\begin{aligned} \Gamma ^{n+1}_{ii} = \left\{ \begin{array}{l@{\quad }l} 1/\vert H^n_{ii}\vert &{}\quad \text { if } n \le n_{\alpha }, \\ \left( (n-n_{\alpha })^{\alpha } \vert H^n_{ii}\vert \right) ^{-1} &{}\quad \text { if } n>n_{\alpha }. \end{array} \right. \end{aligned}$$

(27)

4.3 Long-time behavior of the algorithm

In this section, we illustrate numerically the theoretical results on the long term convergence of the algorithms MCPG, SAPG and SAEM-pen (i.e., Algorithm 3 applied with $S_{n+1} = S_{n+1}^\mathrm {sa}$) and EM-pen on the toy model. In this example, the exact algorithm EM-pen (see Eq. 14) applies: the quantity ${\bar{S}}(\theta )$ is an explicit expectation under a Gaussian distribution $\pi _\theta $. Therefore, we use this example (i) to illustrate the convergence of the three stochastic methods to the same limit point as EM-pen, (ii) to compare the two approximations $S_{n+1}^\mathrm {mc}$ and $S_{n+1}^\mathrm {sa}$ of ${\bar{S}}(\theta _n)$ in a GEM-pen approach and (iii) to study the effect of relaxing the M-step by comparing the GEM-pen and EM-pen approaches namely SAPG and SAEM-pen.

The sequences $\{\gamma _n, n \ge 0 \}$ and $\{\delta _n, n \ge 0 \}$ are defined as follows: $(\gamma _\star , \delta _\star ) = (0.004, 0.5)$, and $n_\alpha = n_\beta = 0$; three different pairs $(\alpha , \beta )$ are considered: $(\alpha , \beta )=(0.9, 0.4), (\alpha , \beta )= (0.6,0.1)$, and $(\alpha , \beta )=(0.5,0.5)$. The algorithms are implemented with a fixed batch size $m_n=60$. 100 independent runs of each algorithm are performed. For the penalty term, we set $\lambda =50$. In MCPG, SAPG and SAEM-pen, the simulation step at iteration $(n+1)$ relies on exact sampling from $\pi _{\theta _n}$—see Eq. (21); therefore, in this toy example, the Monte Carlo approximation of ${\bar{S}}(\theta _n)$ is unbiased.

In Fig. 3, for the three algorithms MCPG, SAPG and SAEM-pen, the evolution of an approximation of $\Vert S_{n+1} - {\bar{S}}(\theta _n)\Vert _{2}$ with iterations n is plotted, where, for a random variable $U, \Vert U\Vert _2 :=\sqrt{{\mathbb {E}}\left[ \Vert U \Vert ^2 \right] }$. This $L_2$-norm is approximated by a Monte Carlo sum computed from 100 independent realizations of $S_{n+1}$; here, ${\bar{S}}(\theta _n)$ is explicit (see Eq. 22). SAEM-pen and SAPG behave similarly; the $L_2$-norm converges to 0, and the convergence is slower when $(\alpha ,\beta )=(0.6,0.1)$—this plot illustrates the result stated in Proposition 2, Sect. 3. This convergence does not hold for MCPG because the size $m_n$ of the Monte Carlo approximation is kept fixed.

We compared the limiting vectors $\lim _n \theta _n$ obtained by each algorithm, over the 100 independent runs. They are all equal, and the limiting vector is also the limiting value $\theta _\infty $ of the EM-pen algorithm. In order to discuss the rate of convergence, we show the behavior of the algorithms when estimating the component $\# 245$ of the regression coefficients; this component was chosen among the non-null component of $\theta _\infty $. Figure 4 shows the boxplot of 100 estimations of the component $\# 245$ of the vector $\theta _n$, when $n = 5, 25, 50, 500$, for the algorithms MCPG, SAPG and SAEM-pen with $(\alpha , \beta ) = (0.9, 0.4)$. Here, SAPG and MCPG behave similarly, with a smaller variability among the 100 runs than SAEM-pen. SAEM-pen converges faster than SAPG and MCPG which was expected since they correspond, respectively, to stochastic perturbations of EM-pen and GEM-pen algorithms. Figure 5 shows the boxplot of 100 estimations by MCPG, SAPG and SAEM-pen of the component $\# 245$ after $n=500$ iterations with different values for the parameters $\alpha $ and $\beta $. We observe that the three algorithms give similar final estimates for the three conditions on parameters $\alpha $ and $\beta $. This is due to the fact that with $n_\alpha = n_\beta = 200$, the algorithms have already attained the convergence phase when $n=200$. This allows the algorithms to quickly converge toward the limit points when $n>200$.

Figure 6 shows the convergence of a Monte Carlo approximation of $n \mapsto {\mathbb {E}}\left[ F(\theta _n)\right] $ based on 100 independent estimations $\theta _n$ obtained by three different algorithms: EM-pen, MCPG, SAPG and SAEM-pen run with $(\alpha , \beta )=(0.9,0.4)$ and $m_n = 60$. Here again, all the algorithms converge to the same value and EM-pen and SAEM-pen converge faster than MCPG and SAPG. We observe that the path of SAPG is far more smooth than the path of MCPG.

Finally, Fig. 7 shows the support of the vector $\lim _n \theta _n$ (where the component $\theta _1$ and $\theta _{302}$ are removed) estimated by MCPG, SAPG, SAEM-pen and EM-pen (the estimated support is the same for the four algorithms). The frequency, among 100 independent runs, for each component to be in the support of the limit value $\lim _n \theta _n$, is displayed. Algorithms are implemented with $(\alpha , \beta ) = (0.9,0.4)$ and $m_n = 60$. For all algorithms, we observe that most of the non-null components of $\lim _n \theta _n$ are non-null components of $\theta ^\star $. Note also that the stochastic algorithms MCPG, SAPG and SAEM-pen converge to the same vector as EM-pen.

5 Inference in nonlinear mixed models for pharmacokinetic data

In this section, SAPG is applied to solve a more challenging problem. The objective is to illustrate the algorithm in cases that are not covered by the theory. The application is in pharmacokinetic analysis, with nonlinear mixed effect models (NLMEM); in this application, the penalized maximum-likelihood inference is usually solved by the SAEM-pen algorithm, possibly combined with an approximation of the M-step when it is non-explicit. This section also provides a numerical comparison of SAPG and SAEM-pen. Both algorithms have a simulation step; in this more challenging application, it will rely on a Markov chain Monte Carlo (MCMC) sampler—see Sect. 5.1. Therefore, for both algorithms, ${\bar{S}}(\theta )$ is approximated by a biased Monte Carlo sum.

We start with a presentation of the statistical analysis and its translation into an optimization problem; we then propose a modification of the SAPG by allowing a random choice of the stepsize sequence $\{\gamma _n, n \ge 0\}$, to improve the numerical properties of the algorithm. We conclude the section by a comparison of the methods on a pharmacokinetic real data set.

5.1 The nonlinear mixed effect model

Pharmacokinetic data are observed along time for N patients. Let $\mathsf {Y}_{k}$ be the vector of the J drug concentrations observed at time $t_{kj}$ ($j \in \{1,\ldots ,J \}$) for the kth patient ($k \in \{1,\ldots ,N\}$). The kinetic of the drug concentration is described by a nonlinear pharmacokinetic regression model f, which is a function of time t and unobserved pharmacokinetic parameters $Z^{(k)}$. These parameters are typically the rates of absorption or elimination of the drug by the body. An example is detailed below. The variability among patients is modeled by the randomness of the hidden variables $Z^{(k)}$. These pharmacokinetic parameters may be influenced by covariates, such as age, gender but also genomic variables. Among these high dimension factors, only few of them are correlated with $Z^{(k)}$. Their selection can thus be performed by optimizing the likelihood with a sparsity inducing penalty, an optimization problem that enters problem Eq. (1). However, the likelihood is generally not concave, that is, through this example, we explore beyond the framework in which we are able to prove the convergence of MCPG and SAPG (see Sect. 3).

Let us now detail the model and the optimization problem. The mixed model is defined as

$$\begin{aligned} \mathsf {Y}_{kj} = f(t_{kj},Z^{(k)}) + \epsilon _{kj} , \quad \epsilon _{kj} \sim {\mathcal {N}}(0,\sigma ^2) \hbox { (iid)}, \end{aligned}$$

(28)

where the measurement errors $\epsilon _{kj}$ are centered, independent and identically normally distributed with variance $\sigma ^2$. Individual parameters $Z^{(k)}$ for the kth subject is a R-dimensional random vector, independent of $\epsilon _{kj}$. In a high dimension context, the $Z^{(k)}$’s depend on covariates (typically genomics variables) gathered in a matrix design $X_k\in \mathbb {R}^{R \times (D+1)R}$. The distribution of $Z^{(k)}$ is usually assumed to be normal with independent components

$$\begin{aligned} Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}_R(X_{k} \mu ,\varOmega ) \end{aligned}$$

(29)

where $\mu \in \mathbb {R}^{(D+1)R}$ is the mean parameter vector and $\varOmega $ is the covariance matrix of the random parameters $Z^{(k)}$, assumed to be diagonal. The unknown parameters are $\theta = \left( \mu , \varOmega _{11}, \ldots , \varOmega _{RR}, \sigma ^2 \right) \in {\mathbb {R}}^{R(D+1)} \times \left( 0,+\infty \right) ^{R+1}$.

A typical function f is the two-compartmental pharmacokinetic model with first-order absorption, describing the distribution of a drug administered orally. The drug is absorbed from the gut and reaches the blood circulation where it can spread in peripheral tissues. This model corresponds to $f= \frac{A_\mathrm{c}}{V_\mathrm{c}}$ with $A_\mathrm{c}$ defined as

$$\begin{aligned} \frac{\mathrm{d}A_\mathrm{d}}{\mathrm{d}t}= & {} -k_\mathrm{a} \, A_\mathrm{d},\nonumber \\ \frac{\mathrm{d}A_\mathrm{c}}{\mathrm{d}t}= & {} k_\mathrm{a} \, A_\mathrm{d} + \frac{Q}{V_\mathrm{p}}A_\mathrm{p} - \frac{Q}{V_\mathrm{c}}A_\mathrm{c}- \frac{Cl}{V_\mathrm{c}}A_\mathrm{c} , \nonumber \\ \frac{\mathrm{d}A_\mathrm{p}}{\mathrm{d}t}= & {} \frac{Q}{V_\mathrm{c}}A_\mathrm{c} - \frac{Q}{V_\mathrm{p}}A_\mathrm{p}, \end{aligned}$$

(30)

with $A_\mathrm{d}(0) = \mathrm{Dose}, A_\mathrm{c}(0) = 0, A_\mathrm{p}(0) = 0$ and where $A_\mathrm{d}, A_\mathrm{c}, A_\mathrm{p}$ are the amount of drug in the depot, central and peripheral compartments, respectively; $V_\mathrm{c}, V_\mathrm{p}$ are the volume of the central compartment and the peripheral compartment, respectively; Q and Cl are the intercompartment and global elimination clearances, respectively. To assure positiveness of the parameters, the hidden vector is

$$\begin{aligned} z=(\log (V_\mathrm{c}), \log (V_\mathrm{p}), \log (Q), \log (Cl), \log (k_\mathrm{a})). \end{aligned}$$

It is easy to show that the model described by Eqs. (28)–(29) belongs to the curved exponential family (see Eq. 4) with minimal sufficient statistics:

$$\begin{aligned}&S_{1k}(z) = z^{(k)}, \quad S_{2}(z) = \sum _{k=1}^N z^{(k)} \, z^{(k)'}, \\&S_{3}(z) = \sum _{k=1}^N\sum _{j=1}^{J} (\mathsf {Y}_{kj}-f(t_{kj}, z^{(k)}))^2; \\&\psi _{1k}(\theta ) = ( X_{k} \mu )'\varOmega ^{-1}, \quad \psi _2(\theta ) = -\frac{1}{2}\varOmega ^{-1}, \\&\psi _3(\theta ) = - \frac{1}{2\sigma ^2}, \end{aligned}$$

and $S(z) :=\mathrm {Vect}\left( S_{11}(z), \ldots , S_{1N}(z), S_2(z), S_3(z) \right) , \psi :=\mathrm {Vect}\left( \psi _{11}, \ldots , \psi _{1N},\psi _2, \psi _3\right) $. The function $\phi $ is given by $\phi (\theta ) = -J N \log (\sigma ) - \frac{N}{2}\log (\vert \varOmega \vert ) - \frac{1}{2}\sum _{k} ( X_{k} \mu )'\varOmega ^{-1}( X_{k} \mu )$. The selection of genomic variables that influence all coordinates of $Z^{(k)}$ could be obtained by optimizing the log likelihood penalized by the function $g(\theta )= \lambda \Vert \mu \Vert _{1}$, the $L_1$ norm of $\mu $ with $\lambda $ a regularization parameter.

However, this estimator is not invariant under a scaling transformation (i.e., ${{\tilde{Z}}}^{(k)} = bZ^{(k)} \text{, } \tilde{\mu }= b\mu \text{ and } {{\tilde{\varOmega }}}_{rr}^{1/2} = b\varOmega _{rr}^{1/2}$) (see, e.g., Lehmann and Casella 2006). In our high dimension experiments, the scale of the hidden variables has a non-negligible influence on the selection of the support. To be more precise, let us denote, for $r \in \{1, \ldots , R\}$,

$$\begin{aligned}\mu _{(r)} :=(\mu _{(r-1)(D+1)+1}, \ldots , \mu _{r(D+1)})\end{aligned}$$

the coordinates corresponding to the rth pharmacokinetic parameter of function f. When the variance $\varOmega _{rr}$ of the random parameters $Z_r^{(k)}$ is low, the algorithms tend to select too many covariates. This phenomenon is strengthened with a small number of subjects as random effect variances are more difficult to estimate. A solution is to consider the following penalty

$$\begin{aligned} \lambda \sum _{r=1}^{R} \varOmega _{rr}^{-\frac{1}{2} } \Vert \mu _{(r)} \Vert _{1} , \end{aligned}$$

that makes the estimator invariant under scaling transformation. It was initially proposed by Städler et al. (2010) to estimate the regression coefficients and the residual error’s variance in a mixture of penalized regression models. However, the resulting optimization problem is difficult to solve directly because the variance of the random effect $\varOmega _{rr}$ appears in the penalty term. Therefore, we propose a new parameterization

$$\begin{aligned} {\tilde{\mu }}_{(r)} :=\mu _{(r)}\varOmega _{rr} ^{-\frac{1}{2} }, \quad \Sigma _{rr} :=\varOmega _{rr} ^{-\frac{1}{2} } \end{aligned}$$

and ${\tilde{\theta }} :=\{ {\tilde{\mu }}, \Sigma _{11}, \ldots , \Sigma _{RR}, \sigma ^2 \} \in \mathbb {R}^{R(D+1)} \times \left( 0,+\infty \right) ^{R+1}$. Then, the optimization problem is the following:

$$\begin{aligned} \underset{{\tilde{\theta }}}{{\text {Argmax}}} \left( \ell ({{\tilde{\theta }}}) - g({\tilde{\theta }}) \right) , \quad \hbox { with } g({\tilde{\theta }})= \lambda \Vert \tilde{\mu } \Vert _{1}. \end{aligned}$$

(31)

This problem can be solved using MCPG, SAPG or SAEM-pen algorithms. Indeed, the complete log likelihood is now—up to an additive constant—

$$\begin{aligned} \begin{aligned} \log p(\mathsf {Y},Z ; {\tilde{\theta }})=&- J N \log (\sigma ) \\&-\frac{1}{2}\sum _{k=1}^N \sum _{j=1}^J \frac{ \left( Y_{kj} - f(t_{kj}, Z^{(k)} )\right) ^{2} }{\sigma ^2} \\&+ N\log (\vert \Sigma \vert ) - \frac{1}{2}\sum _{k=1}^N \Vert \Sigma Z^{(k)} - X_{k} \tilde{\mu }\Vert ^2 \end{aligned} \end{aligned}$$

It is again a complete likelihood from the exponential family, with the statistic S unchanged and the functions $\phi $ and $\psi $ given by—up to an additive constant—

$$\begin{aligned}&\phi ({\tilde{\theta }}) = -J N \log (\sigma ) + N\log (\vert \Sigma \vert ) - \frac{1}{2}\sum _{k=1}^N\Vert X_{k} {\tilde{\mu }} \Vert ^2, \\&\psi _{1k}({\tilde{\theta }}) = \Sigma ( X_{k} {\tilde{\mu }})^{t}, \quad \psi _2({\tilde{\theta }}) = -\frac{1}{2}\Sigma ^{2} , \quad \psi _3({\tilde{\theta }}) = - \frac{1}{2\sigma ^2}. \end{aligned}$$

With these definitions of $\phi , \psi $ and g, the M-step of SAEM-pen amounts to compute the optimum of a convex function, which is solved numerically by a call to a cyclical coordinate descent implemented in the R package glmnet (Friedman et al. 2010).

MCMC sampler In the context of nonlinear mixed models, simulation from $\pi _{\theta _n}\mathrm {d}\nu $ can not be performed directly like in the toy example. We then use a MCMC sampler based on a Metropolis Hastings algorithm to perform the simulation step. Two proposal kernels are successively used during the iterations of the Metropolis Hastings algorithm. The first kernel corresponds to the prior distribution of $\Sigma Z^{(k)}$ that is the Gaussian distribution ${\mathcal {N}}(X_{k} {\tilde{\mu }}_n , I)$. The second kernel corresponds to a succession of R uni-dimensional random walk in order to update successively each component of $Z^{(k)}$. The variance of each random walk is automatically tuned to reach a target acceptance ratio following the principle of an adaptive MCMC algorithm (Andrieu and Thoms 2008).

Adaptive random stepsize sequences In the context of NLMEM, numerical experiments reveal that choosing a deterministic sequence $\{\gamma _n, n\ge 0\}$ that achieve a fast convergence of SAPG algorithm could be difficult. Indeed, parameters to estimate are of different scales. For example, random effect and residual variances are constrained to be positive. Some of them are close to zero; some are not. As explained in Sect. 4.2, an alternative is to implement a matrix-valued random sequence $\{\Gamma ^n, n\ge 0\}$. The gradient and the hessian of the likelihood $\ell (\theta )$ can be approximated by stochastic approximation using the Louis principle (see McLachlan and Krishnan 2008, Chapter 4). Let us denote $H_n$ the stochastic approximation of the hessian obtained at iteration n as explained by Samson et al. (2007). Note that no supplementary random samples are required to obtain this approximation. Along the iterations, each diagonal entry of the matrix $ {H}^{n} $ converges: this limiting value can be seen as a simple way to automatically tune a good $\gamma _{\star }$, that is parameter specific. The entries $\Gamma ^{n+1}_{ii}$ are then defined by Eq. (27).

5.2 Simulated data set

The convergence of the corresponding algorithms is illustrated on simulated data. Data are generated with the model defined by Eq. (30) and $N = 40, J = 12, D = 300$. The design matrix $X_k$ is defined by Eq. (20), with components $(X_{k1}, \ldots , X_{kD})$ drawn from ${\mathcal {N}}(0,\Gamma )$ with $\Gamma _{ii'} = 0.5^{\vert i - i' \vert }$ ($i,i'=1,\ldots ,300$). Parameter values are

$$\begin{aligned} \begin{aligned}&[\mu _1, \mu _{1+(D+1)} , \mu _{1+2(D+1)} , \mu _{1+3(D+1)} , \mu _{1+4(D+1)} ] \\&\quad = [ 6.61 , 6.96, 5.77, 5.42 ,-0.51]; \end{aligned} \end{aligned}$$

the other components are set to zero, except $\mu _{4}$ and $\mu _{912}$ that are set to 1. The matrix $\varOmega $ is diagonal with diagonal elements equal to (0.16, 0.16, 0.16, 0.04, 0.04).

The penalty function is set to

$$\begin{aligned} g({\tilde{\theta }}) :=\lambda \sum _{\ell \ne \{1+r(D+1), r=0, \ldots , 4 \}} |{\tilde{\mu }}_\ell |, \end{aligned}$$

(32)

only the parameters corresponding to a covariate effect being penalized. The optimization problem Eq. (1) with regularization parameter $\lambda = 190$ is solved on this dataset with SAEM-pen and SAPG; we run SAPG with the random sequence $\{\Gamma _n, n \ge 0\}$ as described above (see Eq. 27) with $n_0 = 9500$. For both algorithms, the stochastic approximation step size was set to:

$$\begin{aligned} \delta _{n+1} = \left\{ \begin{array}{l@{\quad }l} 0.5 &{} \quad \text { if } n\le n_0 \\ \frac{0.5}{(n - n_0)^ {\beta }} &{} \quad \text { if } n>n_0 \end{array} \right. \end{aligned}$$

(33)

We set $\alpha = 0.75$ and $\beta = 0.499$. Figure 8 shows the convergence of SAEM-pen and three parameterizations of SAPG: (i) a version with $\gamma ^\star =0.005$ for all the components of $\theta $, (ii) a version with $\gamma ^\star =0.005$ for $\tilde{\mu }, \gamma ^\star =0.0005$ for $\Sigma $ and $\gamma ^\star =0.03$ for $\sigma $ and (iii) a version with adaptive random step sizes. For the four algorithms, all the parameters corresponding to a covariate effect are estimated to zero except the two components $\mu _{4}$ and $\mu _{912}$. The version of SAPG with a same $\gamma ^\star $ for all the component is the one that converge the most slowly. When the $\gamma ^\star $ is tuned differently according the type of parameters, the convergence of SAPG is accelerated. Algorithms SAEM-pen and SAPG with adaptive random step sizes have a similar fast convergence profile.

Figure 9 presents the evolution of four entries of the matrix $\Gamma ^n$ along the iterations of SAPG, corresponding to the components $\tilde{\mu }_{904}, \tilde{\mu }_{912}, \Sigma _{44}$ and $\sigma $. We can notice that they are not on the same scale. They vary during the first iterations and converge to limiting values before iteration $n_0=9500$. Then the step sizes decrease to 0, following the definition given in Eq. (27).

5.3 Application to real data

Algorithms SAEM-pen and SAPG with matrix-valued random sequence $\{\Gamma ^n, n\ge 0\}$ are applied to real data of the pharmacokinetic of dabigatran (DE) from two crossover clinical trials (Delavenne et al. 2013; Ollier et al. 2015). These 2 trials studied the drug–drug interaction between DE and different Pgp-inhibitors. From these 2 trials, the pharmacokinetics of DE are extracted from 15 subjects with no concomitant treatment with Pgp-inhibitors. The concentration of dabigatran is measured at 9 sampling times for each patient. Each subject is genotyped using the DMET$^{\textregistered }$ microarray from Affymetrix. Single nucleotide polymorphisms (SNP) showing no variability between subjects are removed and 264 SNP are included in the analysis.

Function f of the nonlinear mixed model is defined as the two-compartment pharmacokinetic model with first-order absorption previously described (see Eq. 30) (Delavenne et al. 2013). The penalty function g is defined by Eq. (32).

Because of the limited number of subjects, the influence of genetic covariates is only studied on $V_\mathrm{c}$ and Cl parameters, that characterize the elimination process and are the most likely to be influenced by the genetic. Finally, random effect variances of Q and $V_\mathrm{p}$ are set to 0.01 in accordance with previously published population pharmacokinetic of dabigatran (Delavenne et al. 2013). The other variance parameters are estimated. The penalized likelihood problem (Eq. 31) is solved on the data with the SAEM-pen and SAPG algorithms, for 40 different values of parameter $\lambda $. SAPG algorithm is run using the random sequence $\{\Gamma ^n, n\ge 0\}$ given in Eq. (27). The best regularization parameter $\lambda $ is chosen with a data-driven approach based on the EBIC criteria (Chen and Chen 2008).

Figure 10 shows the results. The regularization paths of Cl and $V_\mathrm{c}$ parameters using both algorithms correspond to the evolution of covariate coefficient estimates as a function of the value of $\lambda $. They are reconstructed with low noise for both algorithms, are very similar for high values of $\lambda $ but less for lower values of $\lambda $.

Finally, the selected model has all covariates parameters set to zero. This means that none of the genetic covariates influence the distribution of the individual parameters. This result is not surprising given the low number of subjects and the fact that a large part of the interindividual variability is due to the dissolution process of the drug (Ollier et al. 2015) and is therefore not influenced by genetic covariates. This lack of relationship between dabigtran’s pharmacokinetic parameters and genetic covariates has already been highlighted in an other study (Gouin-Thibault et al. 2017).

6 Conclusion

In this work, we propose a new stochastic proximal-gradient algorithm to solve penalized maximum-likelihood problems when the likelihood is intractable: the gradient is approximated through a stochastic approximation scheme. We provide a theoretical convergence analysis of this new algorithm and illustrate these results numerically on a simulated toy example in the case of a concave likelihood function. The robustness to the non-concave case is explored through a more challenging application to population pharmacokinetic analysis relying on penalized inference in nonlinear mixed effects models.

References

Andrieu, C., Moulines, E.: On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Prob. 16(3), 1462–1505 (2006)
Article MathSciNet MATH Google Scholar
Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Stat. Comput. 18(4), 343–373 (2008)
Article MathSciNet Google Scholar
Atchadé, Y., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach. Learn. Res. 18(10), 1–33 (2017)
MathSciNet MATH Google Scholar
Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis. Wiley Series in Probability and Statistics, 3rd edn. Wiley, Chichester (2011)
MATH Google Scholar
Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York (2011)
Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Benveniste, A., Métivier, M., Priouret, P.: Adaptive Algorithms and Stochastic Approximations, Applications of Mathematics, vol. 22. Springer, Berlin (1990)
Book MATH Google Scholar
Bertrand, J., Balding, D.: Multiple single nucleotide polymorphism analysis using penalized regression in nonlinear mixed-effect pharmacokinetic models. Pharmacogenet. Genomics 23(3), 167–174 (2013)
Article Google Scholar
Bickel, P.J., Doksum, K.A.: Mathematical Statistics—Basic Ideas and Selected Topics, vol. 1, 2nd edn. Texts in Statistical Science Series. CRC Press, Boca Raton (2015)
Chen, J., Chen, Z.: Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3), 759–771 (2008)
Article MathSciNet MATH Google Scholar
Chen, H., Zeng, D., Wang, Y.: Penalized nonlinear mixed effects model to identify biomarkers that predict disease progression. Biometrics 73, 1343–1354 (2017)
Article MathSciNet MATH Google Scholar
Combettes, P., Pesquet, J.: Proximal splitting methods in signal processing. In: Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer Optimization and Its Applications, vol. 49. Springer, New York (2011)
Google Scholar
Combettes, P.L., Pesquet, J.C.: Stochastic quasi-fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25(2), 1221–1248 (2015)
Article MathSciNet MATH Google Scholar
Combettes, P., Pesquet, J.: Stochastic approximations and perturbations in forward–backward splitting for monotone operators. Online J. Pure Appl. Funct. Anal. 1(1), 1–37 (2016)
MathSciNet MATH Google Scholar
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)
Article MathSciNet MATH Google Scholar
Delavenne, X., Ollier, E., Basset, T., Bertoletti, L., Accassat, S., Garcin, A., Laporte, S., Zufferey, P., Mismetti, P.: A semi-mechanistic absorption model to evaluate drug–drug interaction with dabigatran: application with clarithromycin. Br. J. Clin. Pharmacol. 76(1), 107–113 (2013)
Article Google Scholar
Delyon, B., Lavielle, M., Moulines, E.: Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 27(1), 94–128 (1999)
Article MathSciNet MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Fort, G., Moulines, E., Priouret, P.: Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Stat. 39(6), 3262–3289 (2011a)
Article MathSciNet MATH Google Scholar
Fort, G., Moulines, E., Priouret, P.: Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Stat. 39(6), 3262–3289 (2011b)
Article MathSciNet MATH Google Scholar
Fort, G., Jourdain, B., Kuhn, E., Lelièvre, T., Stoltz, G.: Convergence of the Wang–Landau algorithm. Math. Comput. 84(295), 2297–2327 (2015)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Article Google Scholar
Gouin-Thibault, I., Delavenne, X., Blanchard, A., Siguret, V., Salem, J., Narjoz, C., Gaussem, P., Beaune, P., Funck-Brentano, C., Azizi, M., et al.: Interindividual variability in dabigatran and rivaroxaban exposure: contribution of abcb1 genetic polymorphisms and interaction with clarithromycin. J. Thromb. Haemost. 15(2), 273–283 (2017)
Article Google Scholar
Hall, P., Heyde, C.C.: Probability and mathematical statistics. In: Hall, P., Heyde, C.C. (eds.) Martingale Limit Theory and Its Application. Academic Press, New York (1980)
MATH Google Scholar
Kushner, H., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications: Applications of Mathematics, vol. 35, 2nd edn. Springer, New York (2003)
MATH Google Scholar
Lehmann, E., Casella, G.: Theory of Point Estimation. Springer, New York (2006)
MATH Google Scholar
Levine, R.A., Fan, J.: An automated (Markov chain) Monte Carlo EM algorithm. J. Stat. Comput. Simul. 74(5), 349–359 (2004)
Article MathSciNet MATH Google Scholar
Lin, J., Rosasco, L., Villa, S., Zhou, D.: Modified Fejer sequences and applications. Technical report. arXiv:1510.04641v1 [math.OC] (2015)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series in Probability and Statistics, 2nd edn. Wiley-Interscience, Hoboken (2008)
MATH Google Scholar
Meyer, R.R.: Sufficient conditions for the convergence of monotonic mathematical programming algorithms. J. Comput. Syst. Sci. 12(1), 108–121 (1976)
Article MathSciNet MATH Google Scholar
Meyn, S., Tweedie, R.L.: Markov Chains and Stochastic Stability, 2nd edn. Cambridge University Press, Cambridge (2009)
Book MATH Google Scholar
Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien. Compt. Rendus Math. l’Acad. Sci. 255, 2897–2899 (1962)
MathSciNet MATH Google Scholar
Ng, S., Krishnan, T., McLachlan, G.: The EM algorithm. In: Gentle, J., Härdle, W., Mori, Y. (eds.) Handbook of Computational Statistics—Concepts and Methods, vol. 1, 2nd edn, pp. 139–172. Springer, Heidelberg (2012)
Chapter Google Scholar
Ollier, E., Hodin, S., Basset, T., Accassat, S., Bertoletti, L., Mismetti, P., Delavenne, X.: In vitro and in vivo evaluation of drug–drug interaction between dabigatran and proton pump inhibitors. Fundam. Clin. Pharmacol. 29(6), 604–614 (2015)
Article Google Scholar
Ollier, E., Samson, A., Delavenne, X., Viallon, V.: A saem algorithm for fused lasso penalized nonlinear mixed effect models: application to group comparison in pharmacokinetics. Comput. Stat. Data Anal. 95, 207–221 (2016)
Article MathSciNet MATH Google Scholar
Parikh, N., Boyd, S.: Proximal Algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Google Scholar
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Texts in Statistics, 2nd edn. Springer, New York (2004)
Book Google Scholar
Roberts, G., Rosenthal, J.: Coupling and ergodicity of adaptive MCMC. J. Appl. Prob. 44, 458–475 (2007)
Article MATH Google Scholar
Rosasco, L., Villa, S., Vu, B.: Convergence of a stochastic proximal gradient algorithm. Technical report. arXiv:1403.5075v3 (2014)
Rosasco, L., Villa, S., Vu, B.: A stochastic inertial forward–backward splitting algorithm for multivariate monotone inclusions. Optimization 65(6), 1293–1314 (2016)
Article MathSciNet MATH Google Scholar
Saksman, E., Vihola, M.: On the ergodicity of the adaptive Metropolis algorithm on unbounded domains. Ann. Appl. Prob. 20(6), 2178–2203 (2010)
Article MathSciNet MATH Google Scholar
Samson, A., Lavielle, M., Mentré, F.: The SAEM algorithm for group comparison tests in longitudinal data analysis based on non-linear mixed-effects model. Stat. Med. 26(27), 4860–4875 (2007)
Article MathSciNet Google Scholar
Schreck, A., Fort, G., Moulines, E.: Adaptive equi-energy sampler: convergence and illustration. ACM Trans. Model. Comput. Simul. 23(1), 5 (2013)
Article MathSciNet MATH Google Scholar
Städler, N., Bühlmann, P., van de Geer, S.: $\ell $1-penalization for mixture regression models. Test 19(2), 209–256 (2010)
Article MathSciNet MATH Google Scholar
Wei, G., Tanner, M.: A Monte-Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 85, 699–704 (1990)
Article Google Scholar
Wu, C.F.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)
Article MathSciNet MATH Google Scholar
Zangwill, W.: Nonlinear Programming: A Unified Approach. Prentice-Hall International Series in Management. Prentice-Hall, Englewood Cliffs (1969)
Google Scholar

Download references

Author information

Authors and Affiliations

IMT UMR5219, CNRS, Université de Toulouse, 31062, Toulouse Cedex 9, France
Gersende Fort
INSERM, U1059, Dysfonction Vasculaire et Hémostase, Saint Etienne, France
Edouard Ollier
U.M.P.A., Ecole Normale Supérieure de Lyon, CNRS UMR 5669, INRIA, Project-Team NUMED, 46 Allée d’Italie, 69364, Lyon Cedex 07, France
Edouard Ollier
Laboratoire Jean Kuntzmann, UMR CNRS 5224, Université Grenoble-Alpes, 38000, Grenoble, France
Adeline Samson

Authors

Gersende Fort
View author publications
You can also search for this author in PubMed Google Scholar
Edouard Ollier
View author publications
You can also search for this author in PubMed Google Scholar
Adeline Samson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edouard Ollier.

Appendices

Appendix

A Proof of Proposition 1

Lemma 1

Under the assumptions of Proposition 1, for any $\gamma \in \left( 0,1/L\right] , s \in {\bar{S}}(\Theta )$ and any $\theta , \theta ' \in \Theta $,

$$\begin{aligned} {\mathcal {Q}}(\theta \vert \theta ')&\ge {\mathcal {Q}}(\theta ' \vert \theta ') \nonumber \\&\quad - \frac{1}{2\gamma }\Vert \theta - \theta ' - \gamma \{ \nabla \phi (\theta ') + {\text {J}} \psi (\theta ') {\bar{S}}(\theta ')\}\Vert ^2 \nonumber \\&\quad + \frac{\gamma }{2} \Vert \nabla \phi (\theta ') + {\text {J}} \psi (\theta ') {\bar{S}}(\theta ') \Vert ^2. \end{aligned}$$

(34)

Proof

Fix $\theta ' \in \Theta $ and $s \in {\bar{S}}(\Theta )$. The derivative of the function $\theta \mapsto {\mathcal {L}}(\theta ) :=\phi (\theta ) + \left<s,\psi (\theta )\right>$ is $\nabla \phi (\theta ) + {\text {J}} \psi (\theta ) s$, and this gradient is L-Lipschitz. From a Taylor expansion to order 1 at $\theta '$ and since the gradient is Lipschitz, we have

$$\begin{aligned} {\mathcal {L}}(\theta ) \ge {\mathcal {L}}(\theta ') + \left<\nabla \phi (\theta ') + {\text {J}} \psi (\theta ') s ,\theta -\theta '\right> - \frac{L}{2} \Vert \theta - \theta '\Vert ^2. \end{aligned}$$

We then choose $s = {\bar{S}}(\theta ')$, use $L \le 1/\gamma $ and conclude by the equality $2 \left<a,b\right> - \Vert a\Vert ^2= \Vert b\Vert ^2-\Vert a-b\Vert ^2$. $\square $

Proof of Proposition 1

We prove that ${\mathcal {Q}}(\theta _{n+1} \vert \theta _n) -g(\theta _{n+1}) \ge {\mathcal {Q}}(\theta _n \vert \theta _n) - g(\theta _n)$ so that the sequence $\{\theta _n, n \ge 0 \}$ defined by Eq. (10) is a sequence satisfying Eq. (15).

By Lemma 1, it holds for any $\theta \in \Theta $ and any $\gamma \in \left( 0, 1/L\right] $

$$\begin{aligned}&{\mathcal {Q}}(\theta \vert \theta _n) - g(\theta ) \\&\quad \ge {\mathcal {Q}}(\theta _n \vert \theta _n) + \frac{\gamma }{2} \Vert \nabla \phi (\theta _n) + {\text {J}} \psi (\theta _n) {\bar{S}}(\theta _n) \Vert ^2 \\&\qquad - \frac{1}{2\gamma } \left\| \theta - \theta _n - \gamma \{ \nabla \phi (\theta _n) + {\text {J}} \psi (\theta _n) {\bar{S}}(\theta _n) \}\right\| ^2 -g(\theta ). \end{aligned}$$

Note that the RHS and the LHS are equal when $\theta = \theta _n$ so that for any point $\tau $ which maximizes the RHS, it holds ${\mathcal {Q}}(\tau \vert \theta _n) - g(\tau ) \ge {\mathcal {Q}}(\theta _n \vert \theta _n) - g(\theta _n) $. This concludes the proof upon noting that such a point $\tau $ is unique and equal to $\theta _{n+1}$ given by Eq. (10).

B Technical lemmas

Define

$$\begin{aligned}&\Delta _{k:n} :=\prod _{j=k}^n (1-\delta _j), \ \ 0 \le k \le n, \quad \Delta _{n+1:n} =1, \\&\mathsf {D}_k :=\sum _{n \ge k} \Delta _{k:n}. \end{aligned}$$

Lemma 2

For any $n \ge 2, \sum _{j=2}^{n}\Delta _{j+1:n} \, \delta _j = 1 - \Delta _{2:n}$.

Proof

For any $j \le n$, we have $\Delta _{j+1:n} - \Delta _{j:n} = \delta _j \Delta _{j+1:n}$ from which the result follows.

Lemma 3

Let $\beta \in \left( 0,1\right) $ and $\delta _\star >0$. Set $\delta _n = \delta _\star n^{-\beta }$ for any $n \ge 1$. Then for any k large enough,

$$\begin{aligned} \delta _k \mathsf {D}_k \le 1+ O\left( k^{\beta -1}\right) . \end{aligned}$$

Furthermore, $\left| \delta _{n+1} \mathsf {D}_{n+2} - \delta _n \mathsf {D}_{n+1} \right| = O(1/ n^{1+(1-\beta ) \wedge \beta })$.

The proof of Lemma 3 relies on standard Taylor’s expansions with explicit formulas for the remainder. The proof is omitted.

Lemma 4

Let $\beta \in \left[ 0,1\right) $ and $\delta _\star >0$. For any r, when $n \rightarrow \infty $,

$$\begin{aligned} \sum _{j=2}^n j^{-r} \prod _{k=j}^n \left( 1 - \frac{\delta _\star }{k^\beta } \right) = O\left( n^{\beta -r} \right) . \end{aligned}$$

Proof

We have

$$\begin{aligned} \prod _{k=j}^n \left( 1 - \frac{\delta _\star }{k^\beta } \right)&\le \exp \left( -\delta _\star \sum _{k=j}^n k^{-\beta } \right) \\&\le \exp \left( - \frac{\delta _\star }{1-\beta } \left\{ n^{1-\beta } - j^{1-\beta } \right\} \right) . \end{aligned}$$

Let $q_\star \ge 0$ such that for any $q \ge q_\star , q(1-\beta )+1-r >0$. For any constant $D>0$, there exist constants $C, C'$ (whose value can change upon each appearance) such that

$$\begin{aligned}&\sum _{j=2}^n j^{-r} \exp (D j^{1-\beta }) \\&\quad = \sum _{q \ge 0} \frac{D^q}{q!} \sum _{j=2}^n j^{-r+q(1-\beta )} \\&\quad \le C n^{1-r} \sum _{q \ge q_\star } \frac{D^q}{q!} \frac{n^{q(1-\beta )}}{(q+1)(1-\beta )} \frac{(q+1)(1-\beta )}{q(1-\beta ) +1-r} + C' n \\&\quad \le C n^{\beta -r} \sum _{q \ge q_\star } \frac{D^q}{(q+1)!} n^{(q+1)(1-\beta )} + C' n \\&\quad \le C \exp (D n^{1-\beta }) n^{\beta -r}. \end{aligned}$$

This concludes the proof. $\square $

Lemma 5

Let $\{\mathsf {A}_n, n\ge 0 \}$ be a sequence of $d' \times q$ matrices and $\{\sigma _n, n \ge 0 \}$ be a sequence of $q \times 1$ vectors. Let $\{S_n^\mathrm {sa}, n \ge 0 \}$ be given by Eq. (12). For any $n \ge 2$

$$\begin{aligned}&\mathsf {A}_{n}\left( S_{n}^\mathrm {sa}- \sigma _{n-1} \right) = \Delta _{2:n} \mathsf {A}_1 \left( S_1^\mathrm {sa}-\sigma _0 \right) \\&\quad + \sum _{j=2}^{n} \Delta _{j:n} \left( \mathsf {A}_j - \mathsf {A}_{j-1} \right) \left( S_{j-1}^\mathrm {sa}- \sigma _{j-2} \right) \\&\quad + \sum _{j=2}^{n} \Delta _{j:n} \mathsf {A}_j \left( \sigma _{j-2} - \sigma _{j-1} \right) \\&\quad + \sum _{j=2}^{n} \Delta _{j+1:n} \delta _j \mathsf {A}_j \left( m_{j}^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1})- \sigma _{j-1} \right) . \end{aligned}$$

Proof

By definition of $S_n^\mathrm {sa}$, it holds $\mathsf {A}_n \left( S_n^\mathrm {sa}- \sigma _{n-1} \right) = (1-\delta _n) \mathsf {A}_{n-1} \left( S_{n-1}^\mathrm {sa}- \sigma _{n-2}\right) +B_n$ where

$$\begin{aligned} B_n&:=(1-\delta _{n}) (\mathsf {A}_n - \mathsf {A}_{n-1}) \left( S_{n-1}^\mathrm {sa}- \sigma _{n-2} \right) \\&\quad + (1-\delta _n) \mathsf {A}_n \left( \sigma _{n-2} - \sigma _{n-1} \right) \\&\quad + \delta _n \mathsf {A}_n \left( m_{n}^{-1} \sum _{k=1}^{m_n} S(Z_{k,n-1}) - \sigma _{n-1} \right) . \end{aligned}$$

By iterating, we have

$$\begin{aligned} \mathsf {A}_n \left( S_n^\mathrm {sa}- \sigma _{n-1}\right) = \Delta _{2:n} \mathsf {A}_1 \left( S_1^\mathrm {sa}- \sigma _0 \right) + \sum _{j=2}^n \Delta _{j+1:n} B_j, \end{aligned}$$

from which the lemma follows. $\square $

Lemma 6

Assume (HH4a). Let $\{S_n^\mathrm {sa}, n \ge 0 \}$ be given by Eq. (12). Then

$$\begin{aligned}&\sup _{n \ge 0} {\mathbb {E}}\left[ \Vert m_n^{-1} \sum _{j=1}^{m_n} S(Z_{j,n-1}) \Vert ^2 \right]< \infty , \\&\sup _{n \ge 0} {\mathbb {E}}\left[ m_n^{-1} \sum _{j=1}^{m_n} W(Z_{j,n-1}) \right]< \infty , \\&\sup _{n \ge 0} {\mathbb {E}}\left[ \Vert S_n^\mathrm {sa}\Vert ^2 \right] < \infty . \end{aligned}$$

Proof

By (H4a), there exists a constant $C < \infty $ such that for any $n \ge 1$ and $1 \le j \le m_n, \Vert S(Z_{j,n-1})\Vert ^2 \le C \, W(Z_{j,n-1})$. In addition, by the drift assumption on the kernels $P_\theta $, we have

$$\begin{aligned} {\mathbb {E}}\left[ W(Z_{j,n-1}) \right]&= {\mathbb {E}}\left[ P_{\theta _{n-1}} W(Z_{j-1,n-1}) \right] \\&\le \lambda {\mathbb {E}}\left[ W(Z_{j-1,n-1}) \right] + b \\&\le \lambda ^j {\mathbb {E}}\left[ W(Z_{0,n-1}) \right] + \frac{b}{1-\lambda }. \end{aligned}$$

Similarly, by using $Z_{0,n-1} = Z_{m_{n-1},n-2}$, we have

$$\begin{aligned} {\mathbb {E}}\left[ W(Z_{0,n-1}) \right] \le \lambda ^{m_{n-1}} {\mathbb {E}}\left[ W(Z_{0,n-2}) \right] + \frac{b}{1-\lambda }. \end{aligned}$$

A trivial induction shows that

$$\begin{aligned} \sup _n \sup _{j \le m_n} {\mathbb {E}}\left[ W(Z_{j,n-1}) \right] < \infty , \end{aligned}$$

from which the first two results follow. For the third one, by Lemma 5 applied with $\mathsf {A}_n = I$ (the identity matrix) and $\sigma _n = 0$, we have for any $n \ge 1$,

$$\begin{aligned} S_{n}^\mathrm {sa}= \Delta _{2:n} S_1 + \sum _{j=2}^{n} \Delta _{j+1:n} \, \delta _{j} \, m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1}). \end{aligned}$$

By the Minkowsky inequality and the inequality $(a+b)^2 \le 2 a^2 + 2 b^2$, we have

$$\begin{aligned} {\mathbb {E}}\left[ \Vert S_{n}^\mathrm {sa}\Vert ^2 \right]\le & {} 2 \left( \Delta _{2:n} \right) ^2 {\mathbb {E}}\left[ \Vert S_1^\mathrm {sa}\Vert ^2 \right] \\&+\,2 \sup _j {\mathbb {E}}\left[ \Vert m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1}) \Vert ^2 \right] \\&\times \left( \sum _{j=2}^{n} \Delta _{j+1:n} \, \delta _{j} \right) ^2. \end{aligned}$$

By definition, $\Delta _{2:n} \in \left[ 0,1\right] $ and by Lemma 2,

$$\begin{aligned}\sup _n \sum _{j=2}^{n} \Delta _{j+1:n} \, \delta _{j} < \infty .\end{aligned}$$

Hence, $\sup _{n \ge 0} {\mathbb {E}}\left[ \Vert S_n^\mathrm {sa}\Vert ^2 \right] < \infty $. $\square $

Define the proximal-gradient operator

$$\begin{aligned} T_\gamma (\theta ) :={\text {Prox}}_{\gamma ,g}\left( \theta + \gamma \nabla \ell (\theta ) \right) . \end{aligned}$$

Lemma 7

Assume H1, H2 and H3. Let $\{S_n^\mathrm {sa}, n \ge 0 \}$ be given by (12). Then, for the sequence $\{\theta _n,n \ge 0 \}$ given by Algorithm 2,

(i)
There exists a constant C such that almost-surely, for any $n \ge 0$,
$$\begin{aligned} \Vert \theta _{n+1} - \theta _n \Vert \le C \gamma _{n+1} \left( 1 + \Vert S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n) \Vert \right) . \end{aligned}$$
(ii)
There exists a constant $C'$ such that almost-surely, for any $n \ge 0$,
$$\begin{aligned}&\Vert \gamma _{n+1} \varPsi (\theta _n) - \gamma _n \varPsi (\theta _{n-1}) \Vert \\&\quad \le C' \left( \left| \gamma _{n+1} - \gamma _n \right| + \gamma _n^2 (1+ \Vert S_{n}^\mathrm {sa}- {\bar{S}}(\theta _{n-1}) \Vert ) \right) . \end{aligned}$$
(iii)
There exists a constant $C''$ such that almost-surely, for any $n \ge 0$,
$$\begin{aligned}&\Vert \gamma _{n+1} T_{\gamma _{n+1},g}\left( \theta _n \right) - \gamma _n T_{\gamma _{n},g}\left( \theta _{n-1} \right) \Vert \\&\quad \le C'' \left( \left| \gamma _{n+1} - \gamma _n \right| + \gamma _n \gamma _{n+1} \right. \\&\qquad \left. +\,\gamma _n^2 \left( 1 + \Vert S_{n}^\mathrm {sa}- {\bar{S}}(\theta _{n-1})\Vert \right) \right) . \end{aligned}$$

Proof

The proof of (i) is on the same lines as the proof of Atchadé et al. (2017, Lemma 15), and is omitted. For (ii), we write by using (H3b)and (H3c),

$$\begin{aligned}&\Vert \gamma _{n+1} \varPsi (\theta _n) - \gamma _n \varPsi (\theta _{n-1}) \Vert \\&\quad \le | \gamma _{n+1} - \gamma _n | \, \Vert \varPsi (\theta _n) \Vert + \gamma _n \Vert \varPsi (\theta _n)-\varPsi (\theta _{n-1})\Vert \\&\quad \le | \gamma _{n+1} - \gamma _n | \, \sup _\Theta \Vert \varPsi \Vert + L \, \gamma _n \Vert \theta _n - \theta _{n-1} \Vert . \end{aligned}$$

We then conclude by (ii). The LHS in (iii) is upper bounded by

$$\begin{aligned} \left| \gamma _{n+1} - \gamma _n \right| \sup _{\gamma \in \left( 0,1/L\right] } \sup _{\theta \in \Theta } \Vert T_{\gamma ,g}\left( \theta \right) \Vert \\ + \gamma _n \Vert T_{\gamma _{n+1},g}\left( \theta _n \right) - T_{\gamma _n,g}\left( \theta _{n-1}\right) \Vert . \end{aligned}$$

Under H4a, there exists a constant C such that for all $\gamma , \gamma ' \in \left( 0,1/L\right] $ and $\theta , \theta ' \in \Theta $ (see Atchadé et al. 2017, Proposition 12)

$$\begin{aligned}&\sup _{\gamma \in \left( 0,1/L\right] } \sup _{\theta \in \Theta } \Vert T_{\gamma ,g}\left( \theta \right) \Vert < \infty , \\&\quad \Vert T_{\gamma ,g}\left( \theta \right) - T_{\gamma ',g}\left( \theta ' \right) \Vert \le C \left( \gamma + \gamma ' + \Vert \theta - \theta '\Vert \right) . \end{aligned}$$

We then conclude by (i). $\square $

Lemma 8

Assume H4. For any $\theta \in \Theta $, there exists a function ${\hat{S}}_\theta :\mathsf {Z}\rightarrow \mathbb {R}^q$ such that $S - {\bar{S}}(\theta ) = {\hat{S}}_\theta - P_\theta {\hat{S}}_\theta $ and $\sup _{\theta \in \Theta } \left| {\hat{S}}_\theta \right| _{\sqrt{W}} < \infty $. In addition, there exists a constant C such that for any $\theta , \theta ' \in \Theta $,

$$\begin{aligned} \left| P_\theta {\hat{S}}_\theta - P_{\theta '} {\hat{S}}_{\theta '} \right| _{\sqrt{W}} \le C \, \Vert \theta - \theta ' \Vert . \end{aligned}$$

Proof

Set ${\hat{S}}_\theta (z) :=\sum _{n \ge 0} \left( P_\theta ^n S(z) - {\bar{S}}(\theta ) \right) $. Observe that, when exists, this function satisfies $S - {\bar{S}}(\theta ) = {\hat{S}}_\theta - P_\theta {\hat{S}}_\theta $. Note that under (H4a)–(H4b), there exist C and $\rho \in \left( 0,1\right) $ such that for any $\theta \in \Theta $,

$$\begin{aligned} \sum _{n \ge 0} \left\| P_\theta ^n S(z) - {\bar{S}}(\theta ) \right\| \le C \, \left| S \right| _{\sqrt{W}} \, \left( \sum _{n \ge 0} \rho ^n \right) \sqrt{W}(z); \end{aligned}$$

the RHS is finite, thus showing that ${\hat{S}}_\theta $ exists. This inequality also proves that $\sup _\theta \left| {\hat{S}}_\theta \right| _{\sqrt{W}}< \infty $. The Lipschitz property is established in Fort et al. (2011b, Lemma 4.2) and its proof uses (H4c). $\square $

C Proof of Proposition 2

Throughout this section, set $\Vert U\Vert _{L_2} :={\mathbb {E}}\left[ \Vert U\Vert ^2 \right] ^{1/2}$. By Lemma 5, $\Vert S_{n}^\mathrm {sa}- {\bar{S}}(\theta _{n-1}) \Vert _{L_2} \le \sum _{i=1}^3 {\mathcal {T}}_{i,n}$ with

$$\begin{aligned}&{\mathcal {T}}_{1,n} :=\Delta _{2:n} \Vert S_1^\mathrm {sa}- {\bar{S}}(\theta _0) \Vert _{L_2}, \\&{\mathcal {T}}_{2,n} :=\sum _{j=2}^n \Delta _{j:n} \Vert {\bar{S}}(\theta _{j-1}) - {\bar{S}}(\theta _{j-2}) \Vert _{L_2}, \\&{\mathcal {T}}_{3,n} :=\left\| \sum _{j=2}^n \Delta _{j+1:n} \delta _j \left( m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1}) - {\bar{S}}(\theta _{j-1}) \right) \right\| _{L_2}. \end{aligned}$$

Since $\Delta _{2:n} \le \exp (-\delta _\star \sum _{j=2}^n j^{-\beta })$, then

$$\begin{aligned} {\mathcal {T}}_{1,n} = O\left( \exp (-\delta _\star (1-\beta )^{-1} n^{1 - \beta }) \right) . \end{aligned}$$

By (H3c), Lemma 6 and Lemma 7, there exists a constant C such that ${\mathcal {T}}_{2,n} \le C \, \sum _{j=2}^n \Delta _{j:n} \gamma _{j-1}$. By Lemma 4, this yields ${\mathcal {T}}_{2,n} = O(n^{\beta - \alpha })$. For the last term, we use a martingale decomposition.

By Lemma 8, there exists a function ${\hat{S}}_\theta $ such that

$$\begin{aligned} S(Z_{k,j-1}) - {\bar{S}}(\theta _{j-1}) = {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}), \end{aligned}$$

and $\sup _{\theta \in \Theta } \left| {\hat{S}}_\theta \right| _{\sqrt{W}}< \infty $. Hence, we write

$$\begin{aligned} m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1}) - {\bar{S}}(\theta _{j-1}) = \partial M_j + R_{j,1} + R_{j,2} \end{aligned}$$

with

$$\begin{aligned}&\partial M_j := m_j^{-1} \sum _{k=1}^{m_j} \left\{ {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k-1,j-1}) \right\} , \\&R_{j,1} :=m_j^{-1} \left\{ P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{0,j-1}) - P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}) \right\} , \\&R_{j,2} :=m_j^{-1} \left\{ P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{0,j}) \right\} ; \end{aligned}$$

we used that $Z_{0,j} = Z_{m_j,j-1}$. Upon noting that $\partial M_j$ is a martingale increment, and

$$\begin{aligned}{\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k-1,j-1}) \end{aligned}$$

is a martingale increment, we have by two successive applications of Hall and Heyde (1980, Theorem 2.10):

$$\begin{aligned} \left\| \sum _{j=2}^n \Delta _{j+1:n} \delta _j \partial M_j \right\| _{L_2} \le C \left( \sum _{j=2}^n \Delta _{j+1:n}^2 \frac{\delta _j^2}{m_j} \right) ^{1/2}. \end{aligned}$$

By Lemma 4, this term is $O(n^{-(\beta +c)/2})$. For the second term, we write

$$\begin{aligned}&\sum _{j=2}^n \Delta _{j+1:n} \delta _j R_{j,1} \\&\quad = \Delta _{3:n} \frac{\delta _2}{m_2} P_{\theta _{1}} {\hat{S}}_{\theta _{1}}(Z_{0,1}) - \frac{\delta _n}{m_n} P_{\theta _{n}} {\hat{S}}_{\theta _{n}}(Z_{0,n}) \\&\qquad + \sum _{j=2}^{n-1} \left( \Delta _{j+2:n} \frac{\delta _{j+1}}{m_{j+1}} - \Delta _{j+1:n} \frac{\delta _j}{m_j} \right) P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}). \end{aligned}$$

By Lemmas 6 and 8, the RHS is $O(n^{-(\beta +c)} + n^{-(1+c)})$ so that this second term is $O(n^{-(\beta +c)})$. Finally, for the third term, by using Lemmas 6, 7 and 8, we write

$$\begin{aligned} \left\| \sum _{j=2}^n \Delta _{j+1:n} \delta _j R_{j,2} \right\| _{L_2} \le \sum _{j=2}^n \Delta _{j+1:n} \frac{\delta _j}{m_j} \gamma _j. \end{aligned}$$

Again by Lemma 4, this last term is $O(n^{-(\alpha +c)})$. Therefore, ${\mathcal {T}}_{3,n} = O(n^{-(\beta +c)/2})$.

D Proof of Theorem 4

Throughout the proof, we will write $S_{n+1}$ instead of $S_{n+1}^\mathrm {sa}$.

Proof of Theorem 4

We prove the almost-sure convergence of the three random sums given in Theorem 1. The third one is finite almost-surely since its expectation is finite (see Proposition 3). The first two ones are of the form $\sum _n \mathsf {A}_{n+1} \left( S_{n+1} - {\bar{S}}(\theta _n) \right) $ where $\mathsf {A}_{n+1}$ is, respectively,

$$\begin{aligned} \mathsf {A}_{n+1} = \gamma _{n+1} \left( T_{\gamma _{n+1},g}\left( \theta _n \right) \right) ', \quad \mathsf {A}_{n+1} = \gamma _{n+1} \varPsi (\theta _n). \end{aligned}$$

Note that $\mathsf {A}_{n+1} \in {\mathcal {F}}_n$ (the filtration is defined by Eq. (17)). By Lemma 7 and (H3b–c), for both cases, there exists a constant C such that almost-surely, for any $n \ge 0$,

$$\begin{aligned} \Vert \mathsf {A}_{n+1} - \mathsf {A}_n \Vert&\le C \left( \left| \gamma _{n+1} - \gamma _n \right| + \gamma _{n}^2 + \gamma _n \gamma _{n+1}\right) \ \ldots \\&\quad \times \left( 1+ \Vert S_n - {\bar{S}}(\theta _{n-1})\Vert \right) , \\ \Vert \mathsf {A}_{n+1} \Vert&\le C \gamma _{n+1}. \end{aligned}$$

We then conclude by Proposition 4.

Proposition 3

Assume (H4a) and

$$\begin{aligned} \sup _{\theta \in \Theta } \left( \Vert \varPsi (\theta )\Vert + \Vert {\bar{S}}(\theta )\Vert \right) < \infty . \end{aligned}$$

Then there exists a constant C such that

$$\begin{aligned} \sum _n \gamma _{n+1}^2 {\mathbb {E}}\left[ \left\| \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) \right\| ^2 \right] \le C \, \sum _n \gamma _{n+1}^2. \end{aligned}$$

Proof

We write

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) \right\| ^2 \right] \\&\quad \le 2 \sup _{\Theta } \Vert \varPsi (\theta ) \Vert ^2 \left( \sup _n {\mathbb {E}}\left[ \Vert S_{n} \Vert ^2 \right] + \sup _{\Theta } \Vert {\bar{S}} \Vert ^2 \right) , \end{aligned}$$

and conclude by Lemma 6. $\square $

Proposition 4

Let $\{\theta _n, n\ge 0\}$ be given by Algorithm 2. Assume H1, H3, (H4a–b) and (H5a) and (H4a–b). In the biased case, assume also (H4c) and (H5b)). Let $\{\mathsf {A}_n, n \ge 0 \}$ be a sequence of $d' \times q$ random matrices such that for any $n \ge 0, \mathsf {A}_{n+1} \in {\mathcal {F}}_n$, and there exists a constant $C_\star $ such that almost-surely

$$\begin{aligned}&\Vert \mathsf {A}_{n+1} \Vert \le C_\star \gamma _{n+1}, \end{aligned}$$

(35)

$$\begin{aligned}&\quad \Vert \mathsf {A}_{n+1} - \mathsf {A}_n \Vert \le C_\star a_{n+1} \left( 1 + \Vert S_n - {\bar{S}}(\theta _{n-1})\Vert \right) ; \end{aligned}$$

(36)

here $a_{n+1} :=\gamma _n \gamma _{n+1} + \gamma _n^2 + | \gamma _{n+1} - \gamma _n |$. Then, almost-surely, the series $\sum _n \mathsf {A}_{n+1} \left( S_{n+1} - {\bar{S}}(\theta _n) \right) $ converges.

By Lemma 5 applied with $\sigma _n = {\bar{S}}(\theta _n)$, we decompose this sum into four terms:

$$\begin{aligned} {\mathcal {T}}_{1}&:=\sum _{n \ge 2} \Delta _{2:n} \mathsf {A}_{1} \left( S_1 - {\bar{S}}(\theta _0) \right) \\&= \mathsf {D}_2 \mathsf {A}_{1} \left( S_1 - {\bar{S}}(\theta _0) \right) , \\ {\mathcal {T}}_{2}&:=\sum _{n \ge 2} \sum _{j=2}^{n} \Delta _{j:n} \left( \mathsf {A}_j - \mathsf {A}_{j-1} \right) \left( S_{j-1} - {\bar{S}}(\theta _{j-2}) \right) \\&= \sum _{j \ge 2} \mathsf {D}_j \left( \mathsf {A}_j - \mathsf {A}_{j-1} \right) \left( S_{j-1} - {\bar{S}}(\theta _{j-2}) \right) , \\ {\mathcal {T}}_3&:=\sum _{n \ge 2} \sum _{j=2}^{n} \Delta _{j:n} \mathsf {A}_j \left( {\bar{S}}(\theta _{j-2}) - {\bar{S}}(\theta _{j-1}) \right) \\&= \sum _{j \ge 2} \mathsf {D}_j \mathsf {A}_j \left( {\bar{S}}(\theta _{j-2}) - {\bar{S}}(\theta _{j-1}) \right) , \\ {\mathcal {T}}_4&:=\sum _{n \ge 2} \sum _{j=2}^{n} \delta _j \Delta _{j+1:n} \mathsf {A}_j \left( m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1})- {\bar{S}}(\theta _{j-1}) \right) \\&= \sum _{j \ge 2} \delta _j (1+\mathsf {D}_{j+1}) \mathsf {A}_j \left( m_j^{-1} \sum _{k=1}^{m_j} S(Z_{k,j-1})- {\bar{S}}(\theta _{j-1}) \right) . \end{aligned}$$

We have by using Eq. (35),

$$\begin{aligned} \mathsf {D}_2 \Vert \mathsf {A}_{1} \left( S_1 - {\bar{S}}(\theta _0) \right) \Vert \le \mathsf {D}_2 C_\star \gamma _1 \left( \Vert S_1 \Vert + \sup _\Theta \Vert {\bar{S}} \Vert \right) . \end{aligned}$$

By (H5a), $\mathsf {D}_2 < \infty $ so the RHS is finite thus implying that ${\mathcal {T}}_1$ is finite almost-surely.

Using Eq. (36), there exists a constant C such that

$$\begin{aligned}&{\mathbb {E}}\left[ \sum _{j \ge 2 } \mathsf {D}_j \Vert \left( \mathsf {A}_j - \mathsf {A}_{j-1} \right) \left( S_{j-1} - {\bar{S}}(\theta _{j-2}) \right) \Vert \right] \\&\quad \le C \left( 1 + \sup _n {\mathbb {E}}\left[ \Vert S_n \Vert ^2\right] + \sup _\Theta \Vert {\bar{S}} \Vert \right) \sum _{j \ge 2} a_j \mathsf {D}_j. \end{aligned}$$

By (H3b)–(H3c), (H5a), and Lemma 6, the RHS is finite thus implying that ${\mathcal {T}}_2$ is finite almost-surely.

Similarly, there exists a constant C such that

$$\begin{aligned}&{\mathbb {E}}\left[ \sum _{j \ge 2} \mathsf {D}_j \Vert \mathsf {A}_j \left( {\bar{S}}(\theta _{j-2}) - {\bar{S}}(\theta _{j-1}) \right) \Vert \right] \\&\quad \le C \, \sum _{j \ge 2} \gamma _j \mathsf {D}_j {\mathbb {E}}\left[ \Vert {\bar{S}}(\theta _{j-1}) - {\bar{S}}(\theta _{j-2}) \Vert \right] . \end{aligned}$$

By (H3c), the RHS is bounded (up to a multiplicative constant) by $\sum _{j \ge 2} \gamma _j \mathsf {D}_j {\mathbb {E}}\left[ \Vert \theta _{j-1} - \theta _{j-2} \Vert \right] $; and by (H5a) and Lemmas 6 and 7, this sum is finite. Hence, ${\mathcal {T}}_3$ is finite almost-surely.

We give the proof of the convergence of the last term in the biased case: ${\mathbb {E}}\left[ S(Z_{k,n}) \vert {\mathcal {F}}_n \right] \ne {\bar{S}}(\theta _n)$. The proof in the unbiased case corresponds to the following lines with $R_{j,1} = R_{j,2} =0$ and ${\hat{S}}_\theta = S$. Set ${\overline{\mathsf {D}}}_j :=\delta _j (1+\mathsf {D}_{j+1})$. By Lemma 8, there exists ${\hat{S}}_\theta $ such that

$$\begin{aligned}&S(Z_{k,j-1}) - {\bar{S}}(\theta _{j-1}) \\&\quad = {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}), \end{aligned}$$

and $\sup _{\theta \in \Theta } \left| {\hat{S}}_\theta \right| _{\sqrt{W}}< \infty $. Hence, we have

$$\begin{aligned} {\mathcal {T}}_4 = \sum _{j \ge 2} {\overline{\mathsf {D}}}_j \mathsf {A}_j \left( \partial M_j + R_{j,1} + R_{j,2} \right) , \end{aligned}$$

where

$$\begin{aligned}&\partial M_j :=m_j^{-1} \sum _{k=1}^{m_j} \left( {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k-1,j-1}) \right) , \\&R_{j,1} :=m_j^{-1} \left( P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{0,j-1}) - P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}) \right) , \\&R_{j,2} :=m_j^{-1} \left( P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{0,j}) \right) . \end{aligned}$$

Upon noting that ${\mathbb {E}}\left[ \mathsf {A}_j \partial M_j \vert {\mathcal {F}}_{j-1} \right] = 0$, the almost-sure convergence of the series $\sum _j {\overline{\mathsf {D}}}_j \mathsf {A}_j \partial M_j$ is proved by checking criteria for the almost-sure convergence of a martingale. By (35), there exists a constant C such that

$$\begin{aligned}&\sum _j {\overline{\mathsf {D}}}_j^2 {\mathbb {E}}\left[ \Vert \mathsf {A}_j \partial M_j \Vert ^2 \right] \le C \, \sum _j \frac{\gamma _j^2}{m_j^2} {\overline{\mathsf {D}}}_j^2 \ldots \\&\quad \times \, {\mathbb {E}}\left[ \left\| \sum _{k=1}^{m_j} \left( {\hat{S}}_{\theta _{j-1}}(Z_{k,j-1}) - P_{\theta _{j-1}} {\hat{S}}_{\theta _{j-1}}(Z_{k-1,j-1})\right) \right\| ^2 \right] . \end{aligned}$$

By (H5a), Lemma 6 and (Hall and Heyde 1980, Theorem 2.10), the RHS is finite. Hall and Heyde (1980, Theorem 2.17) implies that $\sum _j {\overline{\mathsf {D}}}_j \mathsf {A}_j \, \partial M_j$ is finite almost-surely. For the second term, we write

$$\begin{aligned}&\sum _{j \ge 2} {\overline{\mathsf {D}}}_j \mathsf {A}_j R_{j,1} = m_2^{-1} {\overline{\mathsf {D}}}_2 \mathsf {A}_2 P_{\theta _1} {\hat{S}}_{\theta _1}(Z_{0,1}) \\&\quad + \sum _{j \ge 2} \left( m_{j+1}^{-1} {\overline{\mathsf {D}}}_{j+1} \mathsf {A}_{j+1} - m_j^{-1} {\overline{\mathsf {D}}}_j \mathsf {A}_j \right) P_{\theta _{j}} {\hat{S}}_{\theta _{j}}(Z_{0,j}), \end{aligned}$$

so that, by Lemmas 6 and 8, this series is finite almost-surely if $\sum _j {\mathbb {E}}\left[ \Vert m_{j+1}^{-1} {\overline{\mathsf {D}}}_{j+1} \mathsf {A}_{j+1} - m_j^{-1} {\overline{\mathsf {D}}}_j \mathsf {A}_j \Vert \right] < \infty $. From Eqs. (35) and (36), there exists a constant C such that

$$\begin{aligned}&\Vert m_{j+1}^{-1} {\overline{\mathsf {D}}}_{j+1} \mathsf {A}_{j+1} - m_j^{-1} {\overline{\mathsf {D}}}_j \mathsf {A}_j \Vert \\&\quad \le C \gamma _{j+1} \left| m_{j+1}^{-1} {\overline{\mathsf {D}}}_{j+1} - m_j^{-1} {\overline{\mathsf {D}}}_j \right| \\&\qquad +\,m_j^{-1} {\overline{\mathsf {D}}}_j a_{j+1} \left( 1 + \Vert S_j -{\bar{S}}(\theta _{j-1}) \Vert \right) . \end{aligned}$$

H5 and Lemma 6 imply that

$$\begin{aligned}\sum _j {\mathbb {E}}\left[ \Vert m_{j+1}^{-1} {\overline{\mathsf {D}}}_{j+1} \mathsf {A}_{j+1} - m_j^{-1} {\overline{\mathsf {D}}}_j \mathsf {A}_j \Vert \right] < \infty .\end{aligned}$$

Finally, by (35), Lemmas 6 to 8, there exists a constant C such that

$$\begin{aligned} \sum _{j \ge 2} {\mathbb {E}}\left[ {\overline{\mathsf {D}}}_j \, \Vert \mathsf {A}_j R_{j,2} \Vert \right] \le C\, \sum _{j \ge 2} \gamma _j^2 m_j^{-1} {\overline{\mathsf {D}}}_j. \end{aligned}$$

The RHS is finite by H5 thus implying that $\sum _j {\overline{\mathsf {D}}}_j \mathsf {A}_j R_{j,2}$ is finite almost-surely.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fort, G., Ollier, E. & Samson, A. Stochastic proximal-gradient algorithms for penalized mixed models. Stat Comput 29, 231–253 (2019). https://doi.org/10.1007/s11222-018-9805-7

Download citation

Received: 28 April 2017
Accepted: 02 February 2018
Published: 12 February 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11222-018-9805-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Stochastic proximal-gradient algorithms for penalized mixed models

Abstract

Similar content being viewed by others

A Line Search Based Proximal Stochastic Gradient Algorithm with Dynamical Variance Reduction

Laplacian smoothing gradient descent

Accelerated Gradient-Free Optimization Methods with a Non-Euclidean Proximal Operator

Explore related subjects

1 Introduction

H 1

H 2

2 Stochastic proximal-gradient based algorithms

2.1 The MCPG and SAPG algorithms

2.2 Case of latent variable models from the exponential family

Proposition 1

3 Convergence of MCPG and SAPG

Theorem 1

H 3

H 4

Theorem 2

Theorem 3

Proposition 2

H 5

Theorem 4

Proof

4 Numerical illustration in the convex case

4.1 A toy example

4.2 Guidelines for the implementation

4.3 Long-time behavior of the algorithm

5 Inference in nonlinear mixed models for pharmacokinetic data

5.1 The nonlinear mixed effect model

5.2 Simulated data set

5.3 Application to real data

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

A Proof of Proposition 1

Lemma 1

Proof

Proof of Proposition 1

B Technical lemmas

Lemma 2

Proof

Lemma 3

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

C Proof of Proposition 2

D Proof of Theorem 4

Proof of Theorem 4

Proposition 3

Proof

Proposition 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation