1 Introduction

Many problems in computational statistics reduce to the maximization of a criterion

$$\begin{aligned} {\text {argmax}}_{\theta \in \mathbb {R}^d} F(\theta ), \quad \text {where} \ F :=\ell - g, \end{aligned}$$
(1)

and the functions \(\ell , g\) satisfy

H 1

The function \(g\,{:}\,\mathbb {R}^d\rightarrow \left[ 0,+\infty \right] \) is convex, not identically \(+\infty \), and lower semi-continuous.

H 2

The function \(\ell \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}\cup \{-\infty \}\) is continuously differentiable on \(\Theta :=\{\theta \in \mathbb {R}^d\,{:}\,g(\theta ) + |\ell (\theta )| < \infty \}\), and its gradient is of the form

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta ),\\&\quad \text {with} \ {\bar{S}}(\theta ) :=\int _\mathsf {Z}S(z) \pi _\theta (z) \nu (\mathrm {d}z); \end{aligned} \end{aligned}$$
(2)

\(\nabla \) denotes the gradient operator, and \(\pi _\theta \mathrm {d}\nu \) is a probability distribution on a measurable subset \((\mathsf {Z}, {\mathcal {Z}})\) of \(\, \mathbb {R}^p\). The measurable functions \(\nabla \phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}^d\) and \(\varPsi \,{:}\,\mathbb {R}^d\rightarrow \mathbb {R}^{d \times q}\) are known but the expectation \({\bar{S}}\) of the function \(S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q\) with respect to \(\pi _\theta \mathrm {d}\nu \) may be intractable. Furthermore, there exists a finite nonnegative constant L such that for all \(\theta , \theta ' \in \Theta \),

$$\begin{aligned} \Vert \nabla \ell (\theta ) - \nabla \ell (\theta ')\Vert \le L \Vert \theta - \theta '\Vert ; \end{aligned}$$
(3)

\(\Vert \cdot \Vert \) is the Euclidean norm.

Examples of functions \(\ell \) satisfying Eq. (2) are given below. We are interested in numerical methods for solving Eq. (1), robust to the case when neither \(\ell \) nor its gradient has an explicit expression.

Such an optimization problem occurs for example when computing a penalized maximum-likelihood estimator in some parametric model indexed by \(\theta \in \mathbb {R}^d: \ell \) denotes the log likelihood of the observations \(\mathsf {Y}\) (the dependence upon \(\mathsf {Y}\) is omitted) and g is the penalty term.

The optimization problem Eq. (1) covers the computation of the maximum when the parameter \(\theta \) is restricted to a closed convex subset \(\Theta \) of \(\mathbb {R}^d\); in that case, g is the characteristic function of \(\Theta \), i.e., \(g(\theta ) =0\) for any \(\theta \in \Theta \) and \(g(\theta ) =+ \infty \) otherwise. It also covers the case when g is the ridge, the lasso or the elastic net penalty; and more generally, the case when g is the sum of lower semi-continuous nonnegative convex functions.

A first example of such a function \(\ell \) is given by the log likelihood in a latent variable model with complete likelihood from the q-parameter exponential family (see, e.g., Bickel and Doksum 2015; Bartholomew et al. 2011 and the references therein). In that case, \(\ell \) is of the form

$$\begin{aligned} \theta \mapsto \ell (\theta ) :=\log \int _\mathsf {Z}\exp \left( \phi (\theta ) + \left<S(z),\psi (\theta )\right> \right) \nu (\mathrm {d}z), \end{aligned}$$
(4)

where \(\left<a,b\right>\) denotes the scalar of two vectors \(a,b \in \mathbb {R}^l, \phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}, \psi \,{:}\,\mathbb {R}^d\rightarrow \mathbb {R}^q\) and \(S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q\) are measurable functions, and \(\nu \) is a \(\sigma \)-finite positive measure on \((\mathsf {Z},{\mathcal {Z}})\). The quantity \(\theta \mapsto \phi (\theta ) + \left<S(Z),\psi (\theta )\right>\) is known as the complete log likelihood, and \(Z\) is the latent data vector. Under regularity conditions, we have

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \nabla \phi (\theta ) + {\text {J}} \psi (\theta ) \int _\mathsf {Z}S(z) \ \pi _{\theta }(z) \nu (\mathrm {d}z), \\&\text {with} \ \pi _{\theta }(z) :=\frac{\exp (\left<S(z),\psi (\theta )\right>)}{\int _\mathsf {Z}\exp (\left<S(u),\psi (\theta )\right>) \nu (\mathrm {d}u)}, \end{aligned} \end{aligned}$$
(5)

where \({\text {J}} \psi (\theta )\) denotes the transpose of the jacobian matrix of the function \(\psi \) at \(\theta \).

A second example is given by the log likelihood of N independent observations \((\mathsf {Y}_{1}, \ldots , \mathsf {Y}_{N})\) from a log-linear model for Markov random fields. In this model, \(\ell \) is given by

$$\begin{aligned} \theta \mapsto \ell (\theta ):= & {} \sum _{k=1}^N \left<S(\mathsf {Y}_{k}),\theta \right>\nonumber \\&\quad -\, N \log \int _\mathsf {Z}\exp \left( \left<S(z),\theta \right> \right) \nu (\mathrm {d}z). \end{aligned}$$
(6)

The function \(\theta \mapsto \int _\mathsf {Z}\exp \left( \left<S(z),\theta \right> \right) \nu (\mathrm {d}z)\) is known as the partition function. Under regularity conditions, we have

$$\begin{aligned} \begin{aligned} \nabla \ell (\theta )&= \sum _{k=1}^N S(\mathsf {Y}_{k} ) - N \int _\mathsf {Z}S(z) \ \pi _\theta (z) \nu (\mathrm {d}z), \\&\quad \text {with} \, \pi _\theta (z) :=\frac{\exp \left( \left<S(z),\theta \right> \right) }{\int \exp \left( \left<S(u),\theta \right> \right) \, \nu (\mathrm {d}u)}. \end{aligned} \end{aligned}$$
(7)

In these two examples, the integrals in Eqs. (4)–(7) are intractable except for toy examples: neither the function \(\ell \) nor its gradient is available. Nevertheless, all the integrals in Eqs.  (4)–(7) can be approximated by a Monte Carlo sum (see, e.g., Robert and Casella 2004). In the first example, this Monte Carlo approximation consists in imputing the missing variables z; it is known that such an imputation is far more efficient when the Monte Carlo samples are drawn under \(\pi _\theta \mathrm {d}\nu \), i.e., the a posteriori distribution of the missing variables given the observations (see Eq. 5) than when they are drawn under the a priori distribution. This remark is the essence of the expectation maximization (EM) algorithm (introduced in Dempster et al. 1977), a popular iterative procedure for maximizing the log likelihood \(\ell \) in latent variable models.

In this paper, we are interested in first-order optimization methods to solve Eq. (1), that is methods based on the gradient. In Sect. 2.1, we describe two stochastic first-order descent methods, which are stochastic perturbations of the proximal-gradient (PG) algorithm (introduced in Combettes and Pesquet 2011; see also Beck and Teboulle 2009; Parikh and Boyd 2013 for literature reviews on proximal-gradient algorithms). The two algorithms are the Monte Carlo proximal-gradient algorithm (MCPG) and the stochastic approximation proximal-gradient algorithm (SAPG), which differ in the approximation of the gradient \(\nabla \ell \) and more precisely, of the intractable integral \({\bar{S}}(\theta )\) (see Eq. 2). In MCPG, at each iteration n of the algorithm, this expectation evaluated at the current point \(\theta _n\) is approximated by a Monte Carlo sum computed from samples \(\{Z_{1,n}, \ldots , Z_{m_{n+1},n}\}\) approximating \(\pi _{\theta _n} \mathrm {d}\nu \). In SAPG, the approximation is computed as a Monte Carlo sum based on all the points drawn during all the previous iterations of the algorithm \(\{Z_{i,j}, i \le m_{j+1}, j \le n\}\).

When \(\ell \) is the log likelihood of a latent variable model, we prove in Sect. 2.2 that our algorithms are generalized EM algorithms (see, e.g., McLachlan and Krishnan 2008; Ng et al. 2012) combined with a stochastic E-step: in MCPG and SAPG, the stochastic E-step mimics, respectively, the E-step of the Monte Carlo EM (Wei and Tanner 1990; Levine and Fan 2004) and the E-step of the stochastic approximation EM (see, e.g., Delyon et al. 1999).

Section 3 is devoted to the convergence analysis of MCPG and SAPG. These algorithms can be seen as perturbed proximal-gradient algorithms when the perturbation comes from replacing the exact quantity \({\bar{S}}(\theta _n)\) by a Monte Carlo approximation \(S_{n+1}\) at each iteration of the algorithm. Our convergence analysis covers the case when the points \(\{Z_{1,n}, \ldots , Z_{m_{n+1},n}\}\) are sampled from a Markov chain Monte Carlo sampler (MCMC) with target distribution \(\pi _{\theta _n} \mathrm {d}\nu \)—and therefore, it also covers the case of i.i.d. draws. This implies that the estimator \(S_{n+1}\) of \({\bar{S}}(\theta _n)\) may be biased. There exist many contributions in the literature on the convergence of perturbed proximal-gradient algorithms when \(\ell \) is concave, but except in the works by Atchadé et al. (2017) and Combettes and Pesquet (2015), most of them assume that the error \(S_{n+1} - {\bar{S}}(\theta _n)\) is unbiased and gets small when \(n \rightarrow \infty \) (see, e.g., Rosasco et al. 2014; Combettes and Pesquet 2016; Rosasco et al. 2016; Lin et al. 2015). In this paper, we provide sufficient conditions for the almost-sure convergence of MCPG and SAPG under the assumption that \(\ell \) is concave and with no assumptions on the bias of \(S_{n+1} - {\bar{S}}(\theta _n)\). The convergence analysis of MCPG is a special case of Atchadé et al. (2017, Section 4); to our best knowledge, the convergence of SAPG is a new result.

Practical implementation is discussed in Sect. 4. Some guidelines are given in Sect. 4.2 to choose the sequences involved in the stochastic approximation procedures. Then, MCPG and SAPG are compared through a toy example in Sect. 4.3. A more challenging application to penalized inference in a mixed effect model is detailed in Sect. 5. Mixed models are applied to analyze repeated data in a population of subjects. The N independent vectors of observations \(( \mathsf {Y}_k, k=1, \ldots , N)\) of the N subjects are modeled by

$$\begin{aligned} \mathsf {Y}_k = f(t_k, Z^{(k)}) + \varepsilon _k, \end{aligned}$$
(8)

with individual latent variable \(Z^{(k)}\) independent of the measurement error vector \(\varepsilon _k\) and f the regression function that depends on the vector of observation times \(t_k\). Mixed models thus enter the class of models given by Eq. (4) with latent variables \(Z= (Z^{(1)}, \ldots , Z^{(N)})\). When a covariate model is introduced, the number of covariates can be large, but with only a few of them being influential. This is a sparse estimation problem and the selection problem can be treated through the optimization of a penalized version of the log likelihood Eq. (4). In nonlinear mixed models, the optimization problem is not explicit and stochastic penalized versions of EM (Bertrand and Balding 2013; Ollier et al. 2016; Chen et al. 2017) have been proposed. To our best knowledge, stochastic proximal-gradient algorithms have not been proposed for mixed models.

2 Stochastic proximal-gradient based algorithms

In this section, we describe first-order based algorithms for solving Eq. (1) under the assumptions H1 and H2, when the expectation \({\bar{S}}(\theta )\) in Eq. (2) is intractable.

2.1 The MCPG and SAPG algorithms

Both MCPG and SAPG are iterative algorithms, and each update relies on the combination of a gradient step and a proximal operator. The proximal map (Moreau 1962, see also Bauschke and Combettes 2011; Parikh and Boyd 2013) associated with a convex function g is defined for any \(\gamma >0\) and \(\theta \in \mathbb {R}^d\) by

$$\begin{aligned} {\text {Prox}}_{\gamma ,g}(\theta ) :={\text {argmin}}_{\tau \in \Theta } \left\{ g(\tau ) + \frac{1}{2\gamma } \Vert \theta - \tau \Vert ^2 \right\} . \end{aligned}$$
(9)

Note that under H1, for any \(\gamma >0\) and \(\theta \in \mathbb {R}^d\), there exists an unique point \(\tau \) minimizing the RHS of Eq. (9). This proximal operator may have an explicit expression. When g is the characteristic function

$$\begin{aligned} g(\theta ) :=\left\{ \begin{array}{l@{\quad }l} 0 &{} \quad \text {if}\,\,\theta \in \Theta \\ + \infty &{} \quad \text {otherwise}, \end{array} \right. \end{aligned}$$

for some closed convex set \(\Theta \subseteq \mathbb {R}^d\), then \(g(\theta )\) is the projection of \(\theta \) on \(\Theta \). This projection is explicit for example when \(\Theta \) is an hyper-rectangle. Another example of explicit proximal operator is the case associated with the so-called elastic net penalty, i.e., \(g_{\lambda ,\alpha }(\theta ) :=\lambda \left( \frac{1-\alpha }{2} \sum _{i=1}^d \theta _i^2+ \alpha \sum _{i=1}^d |\theta _i| \right) \) with \(\theta =(\theta _1, \ldots , \theta _\mathrm{d}), \lambda >0\) and \(\alpha \in \left( 0,1\right] \), then for any component \(i \in \{1, \ldots , d \}\),

$$\begin{aligned}&\left( {\text {Prox}}_{\gamma ,g_{\lambda , \alpha }}(\theta ) \right) _i \\&\quad = \frac{1}{1+\gamma \lambda (1-\alpha )} \left\{ \begin{array}{l@{\quad }l} 0 &{}\quad \text {if}\,\,|\theta _i| \le \gamma \lambda \alpha , \\ \theta _i - \gamma \lambda \alpha &{}\quad \text {if}\,\,\theta _i \ge \gamma \lambda \alpha , \\ \theta _i + \gamma \lambda \alpha &{}\quad \text {if}\,\,\theta _i \le -\gamma \lambda \alpha . \end{array} \right. \end{aligned}$$

The proximal-gradient algorithm for solving the optimization problem Eq. (1) produces a sequence \(\{\theta _n, n \ge 0 \}\) as follows: given a \(\left( 0,1/L\right] \)-valued sequence \(\{\gamma _n, n \ge 0 \}\),

$$\begin{aligned} \theta _{n+1}= & {} {\text {Prox}}_{\gamma _{n+1},g} \left( \theta _n + \gamma _{n+1} \nabla \ell (\theta _n) \right) \nonumber \\= & {} {\text {Prox}}_{\gamma _{n+1},g} \left( \theta _n + \gamma _{n+1} \{ \nabla \phi (\theta _n) + \varPsi (\theta _n) {\bar{S}}(\theta _n) \} \right) .\nonumber \\ \end{aligned}$$
(10)

This update scheme can be explained as follows: by H2, we have for any \(L \le \gamma _{n+1}^{-1}\),

$$\begin{aligned} F(\theta )= & {} \ell (\theta ) -g(\theta ) \ge \ell (\theta _n) - \left<\nabla \ell (\theta _n),\theta -\theta _n\right>\\&-\, \frac{1}{2 \gamma _{n+1}} \Vert \theta - \theta _n \Vert ^2 -g(\theta ). \end{aligned}$$

This minorizing function is equal to \(F(\theta _n)\) at the point \(\theta _n\); the maximization (w.r.t. \(\theta \)) of the RHS yields \(\theta _{n+1}\) given by Eq. (10). The proximal-gradient algorithm is therefore a minorization–majorization (MM) algorithm and the ascent property holds: \(F(\theta _{n+1}) \ge F(\theta _n)\) for all n. Sufficient conditions for the convergence of the proximal-gradient algorithm Eq. (10) can be derived from the results by Combettes and Wajs (2005) and Parikh and Boyd (2013) or from convergence analysis of MM algorithms (see, e.g., Zangwill 1969; Meyer 1976).

In the case \({\bar{S}}(\theta )\) can not be computed, we describe two strategies for a Monte Carlo approximation. At iteration \(n+1\), given the current value of the parameter \(\theta _n, m_{n+1}\) points \(\{Z_{1,n}, \ldots , Z_{m_{n+1},n} \}\) from the path of a Markov chain with target distribution \(\pi _{\theta _n} \mathrm {d}\nu \) are sampled. A first strategy consists in replacing \({\bar{S}}(\theta _n)\) by a Monte Carlo mean:

$$\begin{aligned} S_{n+1}^{\mathrm {mc}} :=\frac{1}{m_{n+1}} \sum _{j=1}^{m_{n+1}} S(Z_{j,n}). \end{aligned}$$
(11)

A second strategy, inspired by stochastic approximation methods (see, e.g., Benveniste et al. 1990; Kushner and Yin 2003), consists in replacing \({\bar{S}}(\theta _n)\) by a stochastic approximation

$$\begin{aligned} S_{n+1}^{\mathrm {sa}} :=(1-\delta _{n+1}) S_n^{\mathrm {sa}} + \frac{\delta _{n+1}}{m_{n+1}} \sum _{j=1}^{m_{n+1}} S(Z_{j,n}), \end{aligned}$$
(12)

where \(\{\delta _n, n \ge 0 \}\) is a deterministic \(\left[ 0,1\right] \)-valued sequence. These two strategies yield, respectively, the Monte Carlo proximal-gradient (MCPG) algorithm (see Algorithm 1) and the stochastic approximation proximal-gradient (SAPG) algorithm (see Algorithm 2).

figure a
figure b

In Sect. 3, we prove the convergence of MCPG to the maximum points of F when \(\ell \) is concave, for different choices of the sequences \(\{\gamma _n, m_n, n\ge 0\}\) including decreasing or constant step sizes \(\{\gamma _n,n \ge 0 \}\) and, respectively, constant or increasing batch size \(\{m_n, n \ge 0\}\). We also establish the convergence of SAPG to the maximum points (in the concave case); only the case of a constant batch size \(\{m_n,n \ge 0\}\) and a decreasing step size \(\{\gamma _n, n \ge 0\}\) is studied, since this framework corresponds to the stochastic approximation one from which the update rule Eq. (12) is inherited (see details in Delyon et al. 1999). From a numerical point of view, the choice of the sequences \(\{\gamma _n, n\ge 0\}, \{\delta _n, n \ge 0 \}\) and \(\{m_n, n \ge 0\}\) is discussed in Sect. 4: guidelines are given in Sect. 4.2 and the behavior of the algorithm is illustrated through a toy example in Sect. 4.3.

2.2 Case of latent variable models from the exponential family

In this section, we consider the case when \(\ell \) is given by Eq. (4). A classical approach to solve penalized maximum-likelihood problems in latent variables models with complete likelihood from the exponential family is the expectation–maximization (EM) algorithm or a generalization called the generalized EM (GEM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 2008; Ng et al. 2012). Our goal here, is to show that MCPG and SAPG are stochastic perturbations of a GEM algorithm.

The EM algorithm is an iterative algorithm: at each iteration, given the current parameter \(\theta _n\), the quantity \({\mathcal {Q}}(\theta \vert \theta _n)\), defined as the conditional expectation of the complete log likelihood under the a posteriori distribution for the current fit of the parameters, is computed:

$$\begin{aligned} {\mathcal {Q}}(\theta \vert \theta ') :=\phi (\theta ) + \left<{\bar{S}}(\theta '),\psi (\theta )\right>. \end{aligned}$$
(13)

The EM sequence \(\{\theta _n, n\ge 0\}\) for the maximization of the penalized log likelihood \(\ell -g\) is given by (see McLachlan and Krishnan 2008, Section 1.6.1)

$$\begin{aligned} \theta _{n+1} = {\text {argmax}}_{\theta \in \Theta } \left\{ \phi (\theta ) + \left<{\bar{S}}(\theta _n),\psi (\theta )\right> -g(\theta ) \right\} . \end{aligned}$$
(14)

When \({\bar{S}}(\theta )\) is intractable, it was proposed to replace \({\bar{S}}(\theta _n)\) in this EM-penalized algorithm by an approximation \(S_{n+1}\)—see Algorithm 3. When \(S_{n+1} = S_{n+1}^{mc}\) (see Eq. 11), this yields the so-called Monte Carlo EM-penalized algorithm (MCEM-pen), trivially adapted from MCEM proposed by Wei and Tanner (1990) and Levine and Fan (2004). Another popular strategy is to replace \({\bar{S}}(\theta _n)\) by \(S_{n+1}^\mathrm {sa}\) (see Eq. 12) yielding to the so-called stochastic approximation EM-penalized algorithm (SAEM-pen)—(see Delyon et al. 1999 for the unpenalized version).

figure c

When the maximization of Eq. (13) is not explicit, the update of the parameter is modified as follows, yielding the generalized EM-penalized algorithm (GEM-pen):

$$\begin{aligned}&\theta _{n+1} \ \text {s.t.} \phi (\theta _{n+1}) + \left<{\bar{S}}(\theta _n),\psi (\theta _{n+1})\right> -g(\theta _{n+1}) \nonumber \\&\quad \ge \phi (\theta _n) + \left<{\bar{S}}(\theta _n),\psi (\theta _n)\right> -g(\theta _n). \end{aligned}$$
(15)

This update rule still produces a sequence \(\{\theta _n, n\ge 0\}\) satisfying the ascent property \(F(\theta _{n+1}) \ge F(\theta _n)\) which is the key property for the convergence of EM (see, e.g., Wu 1983). Here again, the approximations defined in Eqs. (11) and (12) can be plugged in the GEM-pen update Eq. (15) when \({\bar{S}}\) is not explicit.

We show in the following proposition that the sequence \(\{\theta _n, n \ge 0\}\) produced by the proximal-gradient algorithm Eq. (10) is a GEM-pen sequence since it satisfies the inequality Eq. (15). As a consequence, MCPG and SAPG are stochastic GEM-pen algorithms.

Proposition 1

Let g satisfying H1 and \(\ell \) be of the form Eq. (4) with continuously differentiable functions \(\phi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}, \psi \,{:}\,\mathbb {R}^d \rightarrow \mathbb {R}^q\) and \(S\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}^q\). Set \(\Theta :=\{ g +|\ell | < \infty \}\). Define \({\bar{S}}\,{:}\,\Theta \rightarrow \mathbb {R}^q\) by \({\bar{S}}(\theta ) :=\int _\mathsf {Z}S(z) \, \pi _\theta (z) \, \nu (\mathrm {d}z)\) where \(\pi _\theta \) is given by Eq. (5). Assume that there exists a constant \(L>0\) such that for any \(s \in {\bar{S}}(\Theta )\), and any \(\theta , \theta ' \in \Theta \),

$$\begin{aligned}&\Vert \nabla \phi (\theta ) - \nabla \phi (\theta ') + \left( {\text {J}} \psi (\theta ) - {\text {J}} \psi (\theta ')\right) s \Vert \nonumber \\&\quad \le L \Vert \theta - \theta ' \Vert . \end{aligned}$$
(16)

Let \(\{\gamma _n, n \ge 0 \}\) be a (deterministic) positive sequence such that \(\gamma _n \in \left( 0, 1/L\right] \) for all \(n \ge 0\).

Then the proximal-gradient algorithm Eq. (10) is a GEM-pen algorithm for the maximization of \(\ell -g\).

The proof is postponed in “Appendix A”. The assumption Eq. (16) holds when \(\Theta \) is compact and \({\bar{S}}\) (resp. \(\phi \) and \(\psi \)) are continuous (resp. twice continuously differentiable). Note also that for any \(\theta , \theta ^{'} \in \Theta \) and \(s \in {\bar{S}}(\Theta )\), we have \(\left( {\text {J}} \psi (\theta ) - {\text {J}} \psi (\theta ')\right) s =0\) if \({\bar{S}}(\theta ) \in \mathrm {Ker}( {\text {J}} \psi (\theta '))\) for any \(\theta , \theta ^{'} \in \Theta \).

3 Convergence of MCPG and SAPG

The convergence of MCPG and SAPG is established by applying recent results from Atchadé et al. (2017) on the convergence of perturbed proximal-gradient algorithms. Atchadé et al. (2017, Theorem 2) applied to the case \(\nabla \ell (\theta )\) is of the form \(\nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta )\), where \({\bar{S}}(\theta )\) is an intractable expectation and \(\nabla \phi , \varPsi \) are explicit, yields

Theorem 1

Assume H1, H2, \(\theta \mapsto \ell (\theta )\) is concave, and the set \({{\mathcal {L}}} :={\text {argmax}}_{\theta \in \Theta } F(\theta )\) is a non-empty subset of \(\Theta \). Let \(\{\theta _n, n \ge 0 \}\) be given by

$$\begin{aligned} \theta _{n+1} = {\text {Prox}}_{\gamma _{n+1},g}\left( \theta _n + \gamma _{n+1}\left\{ \nabla \phi (\theta _n) + \varPsi (\theta _n) S_{n+1} \right\} \right) , \end{aligned}$$

with a \(\left( 0,1/L\right] \)-valued stepsize sequence \(\{\gamma _n, n \ge 0 \}\) satisfying \( \sum _n \gamma _n = + \infty \). If the series

$$\begin{aligned}&\sum _n \gamma _{n+1} \left< \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) , T_{\gamma _{n+1}}(\theta _n) \right>, \\&\sum _n \gamma _{n+1} \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) , \\&\sum _n \gamma _{n+1}^2 \Vert \varPsi (\theta _n) \left( S_{n+1} - {\bar{S}}(\theta _n) \right) \Vert ^2, \end{aligned}$$

converge, where

$$\begin{aligned} T_\gamma (\theta ) :={\text {Prox}}_{\gamma ,g}(\theta +\gamma \left\{ \nabla \phi (\theta ) + \varPsi (\theta ) {\bar{S}}(\theta )\right\} ), \end{aligned}$$

then there exists \(\theta _\infty \in {{\mathcal {L}}}\) such that \(\lim _n \theta _n = \theta _\infty \).

We check the conditions of Theorem 1 in the case \(S_{n+1}\) is resp. given by Eq. (11) for the proof of MCPG and by Eq. (12) for the proof of SAPG. Our convergence analysis is restricted to the case \(\ell \) is concave; to our best knowledge, the convergence of the perturbed proximal-gradient algorithms when \(\ell \) is not concave is an open question.

The novelty in this section is Proposition 2 and Theorem 4 which provide resp. a control of the \(L_2\)-norm of the error \(S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n)\) and the convergence of SAPG. These results rely on a rewriting of \(\left( S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n) \right) \) taking into account that \(S_{n+1}^\mathrm {sa}\) is a weighted sum of the function S evaluated at all the samples \(\{Z_{i,j}, i \le m_{j+1}, j \le n \}\) drawn from the initialization of the algorithm. This approximation differs from a more classical Monte Carlo approximation (see Theorems 2 and 3 for the convergence of MCPG, which are special cases of the results in Atchadé et al. 2017).

We allow the simulation step of MCPG and SAPG to rely on a Markov chain Monte Carlo sampling: at iteration \((n+1)\), the conditional distribution of \(Z_{j+1,n}\) given the past is \(P_{\theta _n}(Z_{j,n}, \cdot )\) where \(P_\theta \) is a Markov transition kernel having \(\pi _\theta \mathrm {d}\nu \) as its unique invariant distribution. The control of the quantities \(S_{n+1} - {\bar{S}}(\theta _n)\) requires some ergodic properties on the kernels \(\{P_{\theta _n}, n \ge 0\}\) along the path \(\{\theta _n, n \ge 0 \}\) produced by the algorithm. These properties have to be uniform in \(\theta \), a property often called the “containment condition” (see, e.g., the literature on the convergence of adaptive MCMC samplers, for example Andrieu and Moulines 2006; Roberts and Rosenthal 2007; Fort et al. 2011a). There are therefore three main strategies to prove the containment condition. In the first strategy, \(\Theta \) is assumed to be bounded, and a uniform ergodic assumption on the kernels \(\{P_\theta , \theta \in \Theta \}\) is assumed. In the second one, there is no boundedness assumption on \(\Theta \) but the property \({\mathbb {P}}(\limsup _n \Vert \theta _n \Vert < \infty )=1\) has to be established prior the proof of convergence; a kind of local boundedness condition on the sequence \(\{\theta _n, n\ge 0\}\) is then applied—see, e.g., Andrieu and Moulines (2006) and Fort et al. (2011a). The last strategy consists in showing that \({\mathbb {P}}( \sup _n \rho _n \Vert \theta _n \Vert < \infty ) =1\) for some deterministic sequence \(\{\rho _n, n \ge 0\}\) vanishing to zero when \(n \rightarrow \infty \) at a rate compatible with the decaying ergodicity rate—see, e.g., Saksman and Vihola (2010). The last two strategies are really technical and require from the reader a strong background on controlled Markov chain theory; for pedagogical purposes, we therefore decided to state our results in the first context: we will assume that \(\Theta \) is bounded.

By allowing MCMC approximations, we propose a theory which covers the case of a biased approximation, called below the biased case: conditionally to the past

$$\begin{aligned} {\mathcal {F}}_n :=\sigma \left( Z_{i,j}, i \le m_{j+1}, j \le n-1\right) , \end{aligned}$$
(17)

the expectation of \(S_{n+1}\) is not \({\bar{S}}(\theta _n)\): \({\mathbb {E}}\left[ S_{n+1} \vert {\mathcal {F}}_n \right] \ne {\bar{S}}(\theta _n)\). As soon as the samplers \(\{P_\theta , \theta \in \Theta \}\) are ergodic enough (for example, under (H4a) and (H4b)), the bias vanishes when the number of Monte Carlo points \(m_n\) tends to infinity. Therefore, the proof for the biased case when the sequence \(\{m_n,n \ge 0 \}\) is constant is the most technical situation since the bias does not decay. It relies on a specific decomposition of the error \(S_{n+1} - {\bar{S}}(\theta _n)\) into a martingale increment with bounded \(L^2\)-moments, and a remainder term which vanishes when \(n \rightarrow \infty \) even when the batch size \(m_n\) is constant. Such a behavior of the remainder term is a consequence of regularity properties on the functions \(\nabla \phi , \varPsi , {\bar{S}}\) (see H3c), on the proximity operator (see H3d) and on the kernels \(\{P_\theta , \theta \in \Theta \}\) (see H4c).

Our theory also covers the unbiased case, i.e., when

$$\begin{aligned}{\mathbb {E}}\left[ S_{n+1} \vert {\mathcal {F}}_n \right] = {\bar{S}}(\theta _n). \end{aligned}$$

We therefore establish the convergence of MCPG and SAPG by strengthening the conditions H1 and H2 with

H 3

  1. (a)

    \(\ell \) is concave and the set \({{\mathcal {L}}} :={\text {argmax}}_{\Theta } F\) is a non-empty subset of \(\Theta \).

  2. (b)

    \(\Theta \) is bounded.

  3. (c)

    There exists a constant L such that for any \(\theta ,\theta ' \in \Theta \),

    $$\begin{aligned}&\Vert \nabla \phi (\theta ) - \nabla \phi (\theta ') \Vert + \Vert \varPsi (\theta ) - \varPsi (\theta ') \Vert \\&\quad +\,\Vert {\bar{S}}(\theta ) - {\bar{S}}(\theta ')\Vert \le L \Vert \theta - \theta ' \Vert , \end{aligned}$$

    where for a matrix \(A, \Vert A\Vert \) denotes the operator norm associated with the Euclidean vector norm.

  4. (d)

    \(\sup _{\gamma \in \left( 0,1/L\right] } \sup _{\theta \in \Theta } \gamma ^{-1} \Vert {\text {Prox}}_{\gamma ,g}(\theta ) - \theta \Vert < \infty \).

Note that the assumptions (H3b)–(H3c) imply Eq. (3) and \(\sup _{\theta \in \Theta } \left( \Vert \nabla \phi (\theta ) \Vert + \Vert \varPsi (\theta ) \Vert + \Vert {\bar{S}}(\theta ) \Vert \right) < \infty \). When \(\Theta \) is a compact convex set, then (H3d) holds for the elastic net penalty, the Lasso or the fused Lasso penalty. Atchadé et al. (2017, Proposition 11) give general conditions for (H3d) to hold.

Before stating the ergodicity conditions on the kernels \(\{P_\theta , \theta \in \Theta \}\), let us recall some basic properties on Markov kernels. A Markov kernel P on the measurable set \((\mathsf {Z}, {\mathcal {Z}})\) is an application on \(\mathsf {Z}\times {\mathcal {Z}}\), taking values in \(\left[ 0,1\right] \) such that for any \(x \in \mathsf {Z}, P(x,\cdot )\) is a probability measure on \({\mathcal {Z}}\); and for any \(A \in {\mathcal {Z}}, x \mapsto P(x,A)\) is measurable. Furthermore, if P is a Markov kernel, \(P^k\) denotes the kth iterate of P defined by induction as

$$\begin{aligned}&P^0(x,A) :=\mathbb {1}_A(x), \\&P^k(x,A) :=\int P^{k-1}(x,\mathrm {d}z)P(z,A), \quad k \ge 1. \end{aligned}$$

Finally, the kernel P acts on the probability measures: for any probability measure \(\xi \) on \({\mathcal {Z}}, \xi P\) is a probability measure defined by

$$\begin{aligned} \xi P(A) :=\int \xi (\mathrm {d}z) P(z,A), \quad A \in {\mathcal {Z}}; \end{aligned}$$

and P acts on the positive measurable functions: for a measurable function \(f\,{:}\,\mathsf {Z}\rightarrow \mathbb {R}_+, Pf\) is a measurable function defined by

$$\begin{aligned} Pf(z) :=\int f(y) \, P(z, \mathrm {d}y). \end{aligned}$$

We refer the reader to Meyn and Tweedie (2009) for the definitions and basic properties on Markov chains. Given a measurable function \(W\,{:}\,\mathsf {Z}\rightarrow \left[ 1,+\infty \right) \), define the W-norm of a signed measure \(\nu \) on \({\mathcal {Z}}\) and the W-norm of a function \(f\,{:}\, \mathsf {Z}\rightarrow \mathbb {R}^d\):

$$\begin{aligned} \left| f \right| _W :=\sup _{\mathsf {Z}} \frac{\Vert f\Vert }{W}, \quad \left\| \nu \right\| _W :=\sup _{f\,{:}\,\left| f \right| _W \le 1} \left| \int f \mathrm {d}\nu \right| ; \end{aligned}$$

these norms generalize resp. the supremum norm of a function and the total variation norm of a measure.

Our results are derived under the following conditions on the kernels:

H 4

  1. (a)

    There exist \(\lambda \in \left( 0,1\right] , b < \infty \) and a measurable function \(W\,{:}\,\mathsf {Z}\rightarrow \left[ 1,+\infty \right) \) such that

    $$\begin{aligned} \left| S \right| _{\sqrt{W}} < \infty , \quad \sup _{\theta \in \Theta } P_\theta W \le \lambda W + b. \end{aligned}$$
  2. (b)

    There exist constants \(C< \infty \) and \(\rho \in \left( 0,1\right) \) such that for any \(z \in \mathsf {Z}\) and \(n \ge 0\),

    $$\begin{aligned} \sup _{\theta \in \Theta } \left\| P_\theta ^n(z, \cdot ) - \pi _\theta \right\| _W \le C \, \rho ^n \, W(z). \end{aligned}$$
  3. (c)

    There exists a constant C such that for any \(\theta , \theta ' \in \Theta \),

    $$\begin{aligned}&\left\| \pi _\theta - \pi _{\theta '} \right\| _{\sqrt{W}} + \sup _{z \in \mathsf {Z}} \frac{\left\| P_\theta (z,\cdot ) - P_{\theta '}(z,\cdot ) \right\| _{\sqrt{W}}}{\sqrt{W}(z)} \\&\quad \le C \, \Vert \theta - \theta ' \Vert . \end{aligned}$$

Sufficient conditions for the uniform-in-\(\theta \) ergodic behavior (H4b) are given, e.g., in Fort et al. (2011b, Lemma 2.3): this lemma shows how to deduce such a control from a minorization condition and a drift inequality on the Markov kernels. Examples of MCMC kernels \(P_\theta \) satisfying these assumptions can be found in Andrieu and Moulines (2006, Proposition 12) and Saksman and Vihola (2010, Proposition 15) for the adaptive Hastings–Metropolis algorithm, in Fort et al. (2011b, Proposition 3.1) for an interactive tempering sampler, in Schreck et al. (2013, Proposition 3.2) for the equi-energy sampler, and in Fort et al. (2015, Proposition 3.1) for a Wang–Landau type sampler.

Theorem 2 establishes the convergence of MCPG when the number of points in the Monte Carlo sum \(S_{n+1}^\mathrm {mc}\) is constant over iterations and the step size sequence \(\{\gamma _n, n \ge 0\}\) vanishes at a convenient rate. It is proved in Atchadé et al. (2017, Theorem 4).

Theorem 2

Assume H1, H2, (H3a–c) and (H4a–b). Let \(\{\theta _n, n \ge 0 \}\) be the sequence given by Algorithm 1 with a \(\left( 0,1/L\right] \)-valued sequence \(\{\gamma _n, n \ge 0 \}\) such that \(\sum _n \gamma _n = + \infty \) and \(\sum _n \gamma _n^2<\infty \), and with a constant sequence \(\{m_n, n \ge 0 \}\).

In the biased case, assume also (H3d) and (H4c) and \(\sum _n |\gamma _{n+1} - \gamma _n |< \infty \).

Then, with probability one, there exists \(\theta _\infty \in {{\mathcal {L}}}\) such that \(\lim _n \theta _n = \theta _\infty \).

Theorem 3 establishes the convergence of MCPG when the number of points in the Monte Carlo sum \(S_{n+1}^\mathrm {mc}\) is increasing; it allows a constant stepsize sequence \(\{\gamma _n, n \ge 0\}\). It is proved in Atchadé et al. (2017, Theorem 6).

Theorem 3

Assume H1, H2, (H3a–c) and (H4a–b). Let \(\{\theta _n, n \ge 0 \}\) be the sequence given by Algorithm 1 with a \(\left( 0,1/L\right] \)-valued sequence \(\{\gamma _n, n \ge 0 \}\) and an integer valued sequence \(\{m_n,n \ge 0\}\) such that \(\sum _n \gamma _n = +\infty \) and \(\sum _n \gamma _{n}^2/m_{n} < \infty \).

In the biased case, assume also \(\sum _n \gamma _{n}/m_{n}< \infty \).

Then, with probability one, there exists \(\theta _\infty \in {{\mathcal {L}}}\) such that \(\lim _n \theta _n = \theta _\infty \).

MCPG and SAPG differ in their approximation of \({\bar{S}}(\theta _n)\) at each iteration. We provide below a control of this error for a constant or a polynomially increasing batch size \(\{m_n, n \ge 0 \}\), and polynomially decreasing stepsize sequences \(\{\gamma _n, n \ge 0 \}\) and \(\{\delta _n, n \ge 0 \}\).

Proposition 2

Let \(\gamma _\star , \delta _\star , m_\star \) be positive constants and \(\beta \in \left[ 0,1\right) , \alpha \ge \beta , c\ge 0\). Set \(\gamma _n = \gamma _\star n^{-\alpha }, \delta _n = \delta _\star n^{-\beta }\) and \(m_n = m_\star n^{c}\). Assume H1 to H4. Then

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \Vert S_{n+1}^\mathrm {mc}- {\bar{S}}(\theta _n)\Vert ^2 \right] = O\left( n^{-c} \right) , \\&{\mathbb {E}}\left[ \Vert S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n) \Vert ^2\right] =O\left( n^{- \{ 2(\alpha -\beta ) \wedge (\beta +c)\}} \right) . \end{aligned} \end{aligned}$$

The proof is given in “Appendix C”. This proposition shows that when applying MCPG with a constant batch size \((c=0)\), the error \(S_{n+1}^\mathrm {mc}- {\bar{S}}(\theta _n)\) does not vanish; this is not the case for SAPG, since even when \(c=0\), the error \(S_{n+1}^\mathrm {sa}- {\bar{S}}(\theta _n)\) vanishes as soon as \(\alpha> \beta >0\). Since the case “constant batch size” is the usual choice of the practitioners in order to reduce the computational cost of the algorithm, this proposition supports the use of SAPG instead of MCPG.

We finally study the convergence of SAPG without assuming that the batch size sequence \(\{m_{n}, n \ge 0\}\) is constant, which implies the following assumption on the sequences \(\{\gamma _n, \delta _n, m_n, n \ge 0\}\).

H 5

The step size sequences \(\{\gamma _n, n \ge 0 \}, \{\delta _n, n \ge 0 \}\) and the batch size sequence \(\{m_n, n \ge 0 \}\) satisfy

  1. (a)

    \(\gamma _n \in \left( 0,1/L\right] , \delta _n \in \left( 0,1\right) , m_n \in \mathbb {N}, \sum _n \gamma _n = + \infty , \sum _n \gamma _n^2 < \infty \),

    $$\begin{aligned}&\sum _{n } \left( \gamma _{n-1} \gamma _{n} + \gamma _{n-1}^2 + |\gamma _n - \gamma _{n-1}| \right) \mathsf {D}_n< \infty , \\&\quad \sum _n \gamma _n^2 \delta _n^2 (1+\mathsf {D}_{n+1})^2 m_n^{-1} < \infty , \end{aligned}$$

    where \(\mathsf {D}_n :=\sum _{k \ge n} \left( \prod _{j=n}^k (1-\delta _j)\right) \).

  2. (b)

    Furthermore,

    $$\begin{aligned}&\sum _n \gamma _{n+1} |m_{n+1}^{-1} \delta _{n+1} - m_n^{-1} \delta _n|< \infty , \\&\sum _n \gamma _{n+1} |m_{n+1}^{-1} \delta _{n+1} \mathsf {D}_{n+2} - m_n^{-1} \delta _n \mathsf {D}_{n+1}|< \infty , \\&\sum _n \left( \gamma _{n-1} \gamma _{n} + \gamma _{n-1}^2 + |\gamma _n - \gamma _{n-1}| \right) \ldots \\&\quad \times m_{n-1}^{-1} \, \delta _{n-1} (1 + \mathsf {D}_{n}) < \infty . \end{aligned}$$

Let us comment this assumption in the case the batch size sequence \(\{m_n, n \ge 0 \}\) is constant. This situation corresponds to the “stochastic approximation regime” where the number of draws at each iteration is \( m_n=1\) (or say, \(m_n = m\) for any n), and it also corresponds to what is usually done by practitioners in order to reduce the computational cost. When \(\delta _n= \delta _\star \in \left( 0,1\right) \) for any \(n \ge 0\), then \(\mathsf {D}_n = \delta _\star ^{-1}\) for any \(n \ge 0\). This implies that the condition H5 is satisfied with polynomially decreasing sequences \(\gamma _n \sim \gamma _\star / n^\alpha \) with \(\alpha \in \left( 1/2,1\right] \) (and \(m_n = m\) for any n).

When \(\delta _n \sim \delta _\star \, n^{-\beta }\) for \(\beta \in \left( 0,1\right) \), then \(\mathsf {D}_n = O( n^\beta )\) (see Lemma 3). Hence, using Lemma 3, (H5a) and (H5b) are satisfied with \(\gamma _n \sim \gamma _\star n^{-\alpha }\) where \(\beta< (1+\beta )/2 < \alpha \le 1\), and \(m_n = m\) for any n.

We cannot have \(\delta _n = \delta _\star n^{-1}\) since it implies \(\mathsf {D}_n = + \infty \) for any \(n \ge 0\).

Theorem 4

Assume H1, H2, H3 and (H4a–b). Let \(\{\theta _n, n \ge 0 \}\) be the sequence given by Algorithm 2 and applied with sequences \(\{\gamma _n, \delta _n, m_n, n \ge 0\}\) verifying (H5a).

In the biased case, assume also (H4c) and (H5b).

Then with probability one, there exists \(\theta _\infty \in {{\mathcal {L}}}\) such that \(\lim _n \theta _n = \theta _\infty \).

Proof

The proof is in Section D. \(\square \)

4 Numerical illustration in the convex case

In this section, we illustrate the behavior of the algorithms MCPG and SAPG on a toy example. We first introduce the example and then give some guidelines for a specific choice of the sequences \(\{\delta _n, n \ge 0\}, \{\gamma _n, n \ge 0\}\). Finally, the algorithms are compared more systematically on repeated simulations.

4.1 A toy example

The example is a mixed model, where the regression function is linear in the latent variable \(Z\). More precisely, we observe data \((\mathsf {Y}_1, \ldots , \mathsf {Y}_N)\) from N subjects, each individual data being a vector of size J: \(\mathsf {Y}_k :=(\mathsf {Y}_{k1}, \ldots , \mathsf {Y}_{kJ})\). For the subject \(k, k=1, \ldots , N, \mathsf {Y}_{kj}\) is the jth measurement at time \(t_{kj}, j=1, \ldots , J\). It is assumed that \(\{\mathsf {Y}_{k}, k=1, \ldots , N\}\) are independent and for all \(k=1, \ldots , N\),

$$\begin{aligned} \begin{aligned}&Y_{kj} \vert Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}\left( \left<Z^{(k)},{\bar{t}}_{kj}\right> ,1\right) , \\&\bar{t}_{kj} :=\begin{bmatrix} 1 \\ t_{kj} \end{bmatrix} \quad j=1, \ldots , J; \end{aligned} \end{aligned}$$
(18)

that is, a linear regression model with individual random intercept and slope, the \(\mathbb {R}^2\)-valued vector being denoted by \(Z^{(k)}\). The latent variable is \(Z=(Z^{(1)}, \ldots , Z^{(N)})\). Furthermore,

$$\begin{aligned} Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}_2( X_k \theta , I_2); \end{aligned}$$
(19)

here, \(\theta \in \mathbb {R}^{2(D+1)}\) is an unknown parameter and the design matrix \(X_k \in \mathbb {R}^{2 \times 2(D+1)}\) is known

$$\begin{aligned} X_k :=\begin{bmatrix} 1&X_{k1}&\ldots&X_{k D}&0&0&\ldots&0 \\ 0&0&\ldots&0&1&X_{k1}&\ldots&X_{k D} \end{bmatrix}. \end{aligned}$$
(20)

The optimization problem of the form Eq. (1) that we consider is the log likelihood \(\ell (\theta )\) penalized by a lasso penalty: the objective is the selection of the influential covariates

$$\begin{aligned}(X_{k1}, \ldots , X_{kD})\end{aligned}$$

on the two components of \(Z^{(k)}\). We thus penalize all the elements except \(\theta _1\) and \(\theta _{D+2}\) which correspond to the two intercepts; hence, we set

$$\begin{aligned} g(\theta ) :=\lambda \sum _{r \ne \{1, D+2 \}} |\theta _r|. \end{aligned}$$

The above model is a latent variable model with complete log likelihood equal to—up to an additive constant

$$\begin{aligned} - \frac{1}{2}\sum _{k=1}^N\left\{ \sum _{j=1}^{J} \left( \mathsf {Y}_{kj} - \left<Z^{(k)},{\bar{t}}_{kj}\right>\right) ^2 + \left\| Z^{(k)} - X_k \theta \right\| ^2 \right\} . \end{aligned}$$

It is of the form \(\phi (\theta ) + \left<S(z),\psi (\theta )\right>\) by setting (with \((\cdot )'\) denoting the transpose of a matrix)

$$\begin{aligned} \begin{aligned}&\phi (\theta ) :=-\frac{1}{2} \theta ' \left( \sum _{k=1}^N X_k' X_k\right) \theta - \frac{1}{2} \sum _{k=1}^N \sum _{j=1}^J \mathsf {Y}_{kj}^2, \\&\psi (\theta ) :=\begin{bmatrix} 1 \\ \theta \end{bmatrix} \in \mathbb {R}^{1+2(D+1)}, \\&S(z^{(1)}, \ldots , z^{(N)}) \\&\quad :=- \frac{1}{2} \sum _{k=1}^N \begin{bmatrix} {z^{(k)}}' (I+T_k) z^{(k)}- 2 \left<z^{(k)},{\bar{\mathsf {Y}}}_k\right> \\ - 2 X_k' z^{(k)} \end{bmatrix}, \\&T_k :=\sum _{j=1}^{J} {\bar{t}}_{kj} \bar{t}_{kj}',\\&{\bar{\mathsf {Y}}}_k :=\sum _{j=1}^{J} \mathsf {Y}_{kj} {\bar{t}}_{kj}. \end{aligned} \end{aligned}$$

The a posteriori distribution \(\pi _\theta \) is a Gaussian distribution on \(\mathbb {R}^{2N}\), equal to the product of N Gaussian distributions on \(\mathbb {R}^2\):

$$\begin{aligned}&\pi _\theta (z^{(1)}, \ldots , z^{(N)}) :=\prod _{k=1}^N {\mathcal {N}}_2\nonumber \\&\quad \left( (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) , (I+T_k)^{-1} \right) [z^{(k)}]. \end{aligned}$$
(21)

Hence, \({\bar{S}}(\theta )\) is explicit and given by

$$\begin{aligned}&{\bar{S}}(\theta ) =-\frac{1}{2} \nonumber \\&\quad \sum _{k=1}^N \begin{bmatrix} \mathrm {Trace}((I+T_k) \Sigma _k) -2 {\bar{\mathsf {Y}}}_k' (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \\ -2 X_k' (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \end{bmatrix}\nonumber \\ \end{aligned}$$
(22)

with

$$\begin{aligned} \Sigma _k:= & {} (I+T_k)^{-1} + (I+T_k)^{-1} \left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) \nonumber \\&\left( {\bar{\mathsf {Y}}}_k + X_k \theta \right) ' (I+T_k)^{-1}. \end{aligned}$$
(23)

Finally, note that in this example, the function \(\ell \) is explicit and given by (up to an additive constant)

$$\begin{aligned} \begin{aligned} \ell (\theta ) =&- \frac{1}{2} \theta ' \left( \sum _{k=1}^N X_k' X_k \right) \theta \\&+ \frac{1}{2} \sum _{k=1}^N ({\bar{\mathsf {Y}}}_k + X_k \theta )' (I+T_k)^{-1} ({\bar{\mathsf {Y}}}_k + X_k \theta ). \end{aligned} \end{aligned}$$

Thus \(\ell \) is a concave function. Furthermore, in this toy example, \(\theta \mapsto \nabla \ell (\theta )\) is linear so that the Lipschitz constant L is explicit and equal to

$$\begin{aligned} L = \Vert - \sum _{k=1}^N X_k' X_k + \sum _{k=1}^N X_k' (I+T_k)^{-1} X_k \Vert _2, \end{aligned}$$
(24)

where for a matrix A, \(\Vert A \Vert _2\) denotes the spectral norm. Finally, we assumed that \(\Theta = \{ \theta \in \mathbb {R}^{2(D+1)} \vert \Vert \theta \Vert < 10^4 \}\) to fulfill the theoretical boundedness assumption. The MCMC algorithm includes a projection step on \(\Theta \) if necessary. But in practice, it never happens.

A data set is simulated using this model with \(N=40, J=8, D=300\) and \(t_{kj} \in \{ 0.25,4,6,8,10,12,14,16 \}, \forall k \in \{1,\ldots ,N\}\). The design components \((X_{k1}, \ldots , X_{kD})\) (see Eq. 20) are drawn from a centered Gaussian distribution with covariance matrix \(\Gamma \) defined by \(\Gamma _{rr'} = 0.5^{\vert r - r' \vert }\) (\(r,r'=1,\ldots ,300\)). To sample the observations, we use a parameter vector \(\theta ^\star \) defined as follows: \(\theta _1^\star = \theta _{D+2}^\star =1\); the other components are set to zero, except 12 components randomly selected (6 among the components \(\{2, \ldots , D+1\}\) and 6 among the components \(\{D+3, \ldots , 2D+2 \}\)) and chosen uniformly in \(\left[ 0.5,1.5\right] \)—see the last row in Fig. 7.

4.2 Guidelines for the implementation

In this section, we give some guidelines on the choice of the sequences \(\{\delta _n, n \ge 0\}\) and \(\{\gamma _n, n \ge 0\}\). We illustrate the results on single runs of each algorithm. We use the same random draws for all the algorithms to avoid potential differences due to the randomness of the simulations. Similar results have been observed when simulations are replicated. We refer to Sect. 4.3 for replicated simulations.

Classical sequences \(\{\delta _n, n \ge 0\}\) and \(\{\gamma _n, n \ge 0\}\) are of the form:

$$\begin{aligned} \gamma _{n+1}&= \left\{ \begin{array}{l@{\quad }l} \gamma _\star &{}\quad \text { if } n \le n_\alpha , \\ \gamma _\star (n-n_\alpha )^{-\alpha } &{} \quad \text { if } n>n_\alpha , \end{array} \right. \end{aligned}$$
(25)
$$\begin{aligned} \delta _{n+1}&= \left\{ \begin{array}{l@{\quad }l} \delta _\star &{} \quad \text { if } n \le n_\beta , \\ \delta _\star (n-n_\beta )^{-\beta } &{}\quad \text { if } n>n_\beta . \end{array} \right. \end{aligned}$$
(26)

Impact of\(\gamma _\star \)and\(\delta _\star \)on the transient phase The theoretical study on the asymptotic behavior of SAPG and MCPG is derived under the assumption that \(\gamma _n \le 1/L\): when \(\alpha >0\), this property holds for any n large enough. In this section, we illustrate the role of \(\gamma _n, \delta _n\) for small values of n that is, in the transient phase of the algorithm. In Fig. 1, we display the behavior of MCPG and SAPG for two different values of the initial point \(\theta _{n=0}\): on the left, it corresponds to a standard initialization (\(\theta _{n=0} = (0, \ldots , 0)\)), while on the right, it corresponds to a poor initialization—which mimics what may happen in practice for challenging numerical applications.

Fig. 1
figure 1

Estimation of the component \(\# 245\) of the vector \(\theta \) along 10,000 iterations (x-axis is in \(\log _{10}\) scale) for MCPG and SAPG. MCPG is represented in green solid line. A run of SAPG is displayed in dashed red line in the case \(\delta _\star =0.8\); in dashed-dotted yellow line in the case \(\delta _\star =0.5\) and in dotted blue line in the case \(\delta _\star =0.2\). For all runs, \(\gamma _{\star }=0.009, n_\alpha = n_\beta = 0, (\alpha , \beta ) = (0.75, 0.5)\) and \(m_n = 60\). The vertical black dashed line corresponds to the smallest n such that \(\gamma _{n} \le 1/L\). Left: standard initialization of \(\theta \) (\(\theta _{n=0}\) is the null vector). Right: poor initialization of \(\theta \). The penalty term \(\lambda \) was set to 50 for all the runs. (Color figure online)

Fig. 2
figure 2

Estimation of the component \(\# 245\) of the vector \(\theta \) along 10,000 iterations (x-axis is in \(\log _{10}\) scale). Runs of SAPG with \((\gamma _\star , \delta _\star )=(0.015,0.5), n_\alpha = n_\beta =0\) and \(m_n=60\) and different values of \((\alpha , \beta )\). (Left) \(\alpha = 1\) and \(\beta =0.9\) (green solid line), \(\beta =0.6\) (red dashed line), \(\beta =0.3\) (yellow dotted line), \(\beta =0\) (blue dash-dotted line). (Middle) \(\alpha = 0.8\) and \(\beta =0.5\) (green solid line), \(\beta =0.3\) (red dashed line), \(\beta =0.1\) (yellow dotted line), \(\beta =0\) (blue dash-dotted line). (Right) \(\alpha = 0.6\) and \(\beta =0.2\) (green solid line), \(\beta =0.1\) (yellow dotted line), \(\beta =0\) (blue dash-dotted line). The penalty term \(\lambda \) was set to 50 for all the runs. (Color figure online)

On both plots, we indicate by a vertical line the smallest n such that \(\gamma _{n} \le 1/L\)—remember that in this example, L is explicit (see Eq. 24). The plots show the estimation of component #245, as a function of the number of iterations n. In all cases, \(n_\alpha = n_\beta =0, \alpha = 0.75, m_n= 60\), and for SAPG, \(\beta = 0.5\). The dotted blue curve displays a run of SAPG when \((\gamma _\star ,\delta _\star ) = (0.009,0.2)\); the dashed-dotted yellow curve displays a run of SAPG when \((\gamma _\star ,\delta _\star ) = (0.009,0.5)\); the dashed red curve displays a run of SAPG when \((\gamma _\star ,\delta _\star ) =(0.009,0.8)\); the green solid curve displays a run of MCPG when \(\gamma _\star = 0.009\).

The stability of MCPG during the transient phase depends crucially on the first values of the sequence \(\{\gamma _n, n \ge 0\}\). Then when n is large enough so that \(\gamma _{n} \le 1/L\) (after the vertical line), MCPG is more stable and gets smoother. For SAPG, a small value of \(\delta _\star \) implies an important impact of the initial point \(\theta _{n=0}\). When this initial point is poorly chosen, a small value of \(\delta _\star \) delays the convergence of SAPG. A value of \(\delta _\star \) around 0.5 is a good compromise.

Role of\({\alpha }\)and\({\beta }\): Figure 2 displays the behavior of SAPG for different values of \(\alpha \) and \(\beta \) with \((\gamma _\star , \delta _\star )=(0.015, 0.5), n_\alpha = n_\beta =0\) and \(m_n=60\). The plots show that the larger the parameter \(\alpha \) is, the longer the transient phase is. We then recommend to set \(\alpha \) close to 0.6. The parameter \(\beta \) seems to have an impact only when \(\alpha \) is close to 1. Therefore, we recommend to set \(\delta _n\) constant during the transient phase (\(n_\beta >0\)) and then to decrease it rapidly in the convergence phase.

Random stepsize sequence\(\{\gamma _n, n \ge 0\}\): The convergence of the SAPG algorithm can suffer from the scale difference of the parameters, when run with the same stepsize sequence \(\{\gamma _n, n \ge 0\}\) applied to each component of \(\theta _n\).

Ideally each component of \(\theta _n\) should have a specific \(\gamma _n\) value adapted to its scale. But it can be time-consuming to find, by hand-tuning, a sequence that ensures a fast and stable convergence of the algorithm. As an alternative, we suggest to use a matrix-valued random sequence \(\{\Gamma ^n, n\ge 0\}\) and replace the update rule of SAPG by

$$\begin{aligned} (\theta _{n+1})_i = {\text {Prox}}_{ \Gamma ^{n+1}_{ii} g} \left( (\theta _n)_i + \Gamma ^{n+1}_{ii} \left( \nabla \phi (\theta _n) + \varPsi (\theta _n) S_{n+1}^\mathrm {sa}\right) _i \right) . \end{aligned}$$

We propose to define the matrix \(\Gamma _{n+1}\) as a diagonal matrix with entries \( \Gamma ^{n+1}_{ii}\) depending on \(H^n_{ii}\), where \(H^n\) is an approximation of the hessian of \(-\ell (\theta )\). (We give an example of such an approximation in Sect. 5.) Through numerical experiments, we observed that asymptotically, \(H^n\) converges. Hence, to ensure a stepsize sequence decaying like \(O(n^{-\alpha })\) asymptotically, we propose the following definition of the random sequence:

$$\begin{aligned} \Gamma ^{n+1}_{ii} = \left\{ \begin{array}{l@{\quad }l} 1/\vert H^n_{ii}\vert &{}\quad \text { if } n \le n_{\alpha }, \\ \left( (n-n_{\alpha })^{\alpha } \vert H^n_{ii}\vert \right) ^{-1} &{}\quad \text { if } n>n_{\alpha }. \end{array} \right. \end{aligned}$$
(27)
Fig. 3
figure 3

Evolution of the Monte Carlo approximation of \(\Vert S_{n+1} - \bar{S}(\theta _n)\Vert _{2}\) with iterations n for algorithms MCPG (solid green), SAEM-pen (solid yellow), SAPG (dashed blue), implemented with \((\alpha ,\beta ) = (0.9, 0.4)\) [left], \((\alpha , \beta ) = (0.6, 0.1)\) [center] and \((\alpha , \beta ) = (0.5, 0.5)\) [right]; for MCPG and SAPG, the batch size is fixed \(m_n = 60\). (Color figure online)

4.3 Long-time behavior of the algorithm

In this section, we illustrate numerically the theoretical results on the long term convergence of the algorithms MCPG, SAPG and SAEM-pen (i.e., Algorithm 3 applied with \(S_{n+1} = S_{n+1}^\mathrm {sa}\)) and EM-pen on the toy model. In this example, the exact algorithm EM-pen (see Eq. 14) applies: the quantity \({\bar{S}}(\theta )\) is an explicit expectation under a Gaussian distribution \(\pi _\theta \). Therefore, we use this example (i) to illustrate the convergence of the three stochastic methods to the same limit point as EM-pen, (ii) to compare the two approximations \(S_{n+1}^\mathrm {mc}\) and \(S_{n+1}^\mathrm {sa}\) of \({\bar{S}}(\theta _n)\) in a GEM-pen approach and (iii) to study the effect of relaxing the M-step by comparing the GEM-pen and EM-pen approaches namely SAPG and SAEM-pen.

The sequences \(\{\gamma _n, n \ge 0 \}\) and \(\{\delta _n, n \ge 0 \}\) are defined as follows: \((\gamma _\star , \delta _\star ) = (0.004, 0.5)\), and \(n_\alpha = n_\beta = 0\); three different pairs \((\alpha , \beta )\) are considered: \((\alpha , \beta )=(0.9, 0.4), (\alpha , \beta )= (0.6,0.1)\), and \((\alpha , \beta )=(0.5,0.5)\). The algorithms are implemented with a fixed batch size \(m_n=60\). 100 independent runs of each algorithm are performed. For the penalty term, we set \(\lambda =50\). In MCPG, SAPG and SAEM-pen, the simulation step at iteration \((n+1)\) relies on exact sampling from \(\pi _{\theta _n}\)—see Eq. (21); therefore, in this toy example, the Monte Carlo approximation of \({\bar{S}}(\theta _n)\) is unbiased.

In Fig. 3, for the three algorithms MCPG, SAPG and SAEM-pen, the evolution of an approximation of \(\Vert S_{n+1} - {\bar{S}}(\theta _n)\Vert _{2}\) with iterations n is plotted, where, for a random variable \(U, \Vert U\Vert _2 :=\sqrt{{\mathbb {E}}\left[ \Vert U \Vert ^2 \right] }\). This \(L_2\)-norm is approximated by a Monte Carlo sum computed from 100 independent realizations of \(S_{n+1}\); here, \({\bar{S}}(\theta _n)\) is explicit (see Eq. 22). SAEM-pen and SAPG behave similarly; the \(L_2\)-norm converges to 0, and the convergence is slower when \((\alpha ,\beta )=(0.6,0.1)\)—this plot illustrates the result stated in Proposition 2, Sect. 3. This convergence does not hold for MCPG because the size \(m_n\) of the Monte Carlo approximation is kept fixed.

We compared the limiting vectors \(\lim _n \theta _n\) obtained by each algorithm, over the 100 independent runs. They are all equal, and the limiting vector is also the limiting value \(\theta _\infty \) of the EM-pen algorithm. In order to discuss the rate of convergence, we show the behavior of the algorithms when estimating the component \(\# 245\) of the regression coefficients; this component was chosen among the non-null component of \(\theta _\infty \). Figure 4 shows the boxplot of 100 estimations of the component \(\# 245\) of the vector \(\theta _n\), when \(n = 5, 25, 50, 500\), for the algorithms MCPG, SAPG and SAEM-pen with \((\alpha , \beta ) = (0.9, 0.4)\). Here, SAPG and MCPG behave similarly, with a smaller variability among the 100 runs than SAEM-pen. SAEM-pen converges faster than SAPG and MCPG which was expected since they correspond, respectively, to stochastic perturbations of EM-pen and GEM-pen algorithms. Figure 5 shows the boxplot of 100 estimations by MCPG, SAPG and SAEM-pen of the component \(\# 245\) after \(n=500\) iterations with different values for the parameters \(\alpha \) and \(\beta \). We observe that the three algorithms give similar final estimates for the three conditions on parameters \(\alpha \) and \(\beta \). This is due to the fact that with \(n_\alpha = n_\beta = 200\), the algorithms have already attained the convergence phase when \(n=200\). This allows the algorithms to quickly converge toward the limit points when \(n>200\).

Fig. 4
figure 4

Estimation of the component \(\# 245\) of the vector \(\theta _n\) when \(n=5, 25, 50, 500\). MCPG (green, left), SAPG (blue, middle) and SAEM-pen (yellow, right) are implemented with \((\alpha ,\beta ) = (0.9, 0.4)\); for MCPG and SAPG, the batch size is fixed \(m_n = 60\). Each boxplot is computed from 100 independent runs. Black dashed line corresponds to the value obtained with EM-pen algorithm at iteration 500. (Color figure online)

Fig. 5
figure 5

Estimation of the component \(\# 245\) of the vector \(\theta _n\) when \(n=500\). MCPG (green), SAPG (blue) and SAEM-pen (yellow) are implemented with \((\alpha ,\beta ) = (0.9, 0.4)\) [left], \((\alpha , \beta ) = (0.6, 0.1)\) [center] and \((\alpha , \beta ) = (0.5, 0.5)\) [right]; for MCPG and SAPG, the batch size is fixed \(m_n = 60\). Each boxplot is computed from 100 independent runs. Black dashed line corresponds to the value obtained with EM-pen algorithm at iteration 500. (Color figure online)

Figure 6 shows the convergence of a Monte Carlo approximation of \(n \mapsto {\mathbb {E}}\left[ F(\theta _n)\right] \) based on 100 independent estimations \(\theta _n\) obtained by three different algorithms: EM-pen, MCPG, SAPG and SAEM-pen run with \((\alpha , \beta )=(0.9,0.4)\) and \(m_n = 60\). Here again, all the algorithms converge to the same value and EM-pen and SAEM-pen converge faster than MCPG and SAPG. We observe that the path of SAPG is far more smooth than the path of MCPG.

Fig. 6
figure 6

Monte Carlo approximation of \({\mathbb {E}}\left[ F(\theta _n)\right] \) (based on 100 independent samples) along the iterations n, for algorithms EM-pen (solid black), MCPG (solid green), SAEM-pen (dash-dotted yellow), SAPG (dashed blue), implemented with \((\alpha ,\beta )=(0.9,0.4)\) and \(m_n=60\). (Color figure online)

Finally, Fig. 7 shows the support of the vector \(\lim _n \theta _n\) (where the component \(\theta _1\) and \(\theta _{302}\) are removed) estimated by MCPG, SAPG, SAEM-pen and EM-pen (the estimated support is the same for the four algorithms). The frequency, among 100 independent runs, for each component to be in the support of the limit value \(\lim _n \theta _n\), is displayed. Algorithms are implemented with \((\alpha , \beta ) = (0.9,0.4)\) and \(m_n = 60\). For all algorithms, we observe that most of the non-null components of \(\lim _n \theta _n\) are non-null components of \(\theta ^\star \). Note also that the stochastic algorithms MCPG, SAPG and SAEM-pen converge to the same vector as EM-pen.

Fig. 7
figure 7

(Top) Support of \(\lim _n \theta _n\) estimated by all the algorithms MCPG, SAPG, SAEM-pen and EM-pen over 100 runs for \((\alpha ,\beta )=(0.9,0.4)\) and \(m_n = 60\). (Bottom) The support of \(\theta ^\star \) used to produce the observations. On both rows, the components 1 and \(D+1\) are not displayed

5 Inference in nonlinear mixed models for pharmacokinetic data

In this section, SAPG is applied to solve a more challenging problem. The objective is to illustrate the algorithm in cases that are not covered by the theory. The application is in pharmacokinetic analysis, with nonlinear mixed effect models (NLMEM); in this application, the penalized maximum-likelihood inference is usually solved by the SAEM-pen algorithm, possibly combined with an approximation of the M-step when it is non-explicit. This section also provides a numerical comparison of SAPG and SAEM-pen. Both algorithms have a simulation step; in this more challenging application, it will rely on a Markov chain Monte Carlo (MCMC) sampler—see Sect. 5.1. Therefore, for both algorithms, \({\bar{S}}(\theta )\) is approximated by a biased Monte Carlo sum.

We start with a presentation of the statistical analysis and its translation into an optimization problem; we then propose a modification of the SAPG by allowing a random choice of the stepsize sequence \(\{\gamma _n, n \ge 0\}\), to improve the numerical properties of the algorithm. We conclude the section by a comparison of the methods on a pharmacokinetic real data set.

5.1 The nonlinear mixed effect model

Pharmacokinetic data are observed along time for N patients. Let \(\mathsf {Y}_{k}\) be the vector of the J drug concentrations observed at time \(t_{kj}\) (\(j \in \{1,\ldots ,J \}\)) for the kth patient (\(k \in \{1,\ldots ,N\}\)). The kinetic of the drug concentration is described by a nonlinear pharmacokinetic regression model f, which is a function of time t and unobserved pharmacokinetic parameters \(Z^{(k)}\). These parameters are typically the rates of absorption or elimination of the drug by the body. An example is detailed below. The variability among patients is modeled by the randomness of the hidden variables \(Z^{(k)}\). These pharmacokinetic parameters may be influenced by covariates, such as age, gender but also genomic variables. Among these high dimension factors, only few of them are correlated with \(Z^{(k)}\). Their selection can thus be performed by optimizing the likelihood with a sparsity inducing penalty, an optimization problem that enters problem Eq. (1). However, the likelihood is generally not concave, that is, through this example, we explore beyond the framework in which we are able to prove the convergence of MCPG and SAPG (see Sect. 3).

Let us now detail the model and the optimization problem. The mixed model is defined as

$$\begin{aligned} \mathsf {Y}_{kj} = f(t_{kj},Z^{(k)}) + \epsilon _{kj} , \quad \epsilon _{kj} \sim {\mathcal {N}}(0,\sigma ^2) \hbox { (iid)}, \end{aligned}$$
(28)

where the measurement errors \(\epsilon _{kj}\) are centered, independent and identically normally distributed with variance \(\sigma ^2\). Individual parameters \(Z^{(k)}\) for the kth subject is a R-dimensional random vector, independent of \(\epsilon _{kj}\). In a high dimension context, the \(Z^{(k)}\)’s depend on covariates (typically genomics variables) gathered in a matrix design \(X_k\in \mathbb {R}^{R \times (D+1)R}\). The distribution of \(Z^{(k)}\) is usually assumed to be normal with independent components

$$\begin{aligned} Z^{(k)} {\mathop {\sim }\limits ^{\mathrm{ind}}} {\mathcal {N}}_R(X_{k} \mu ,\varOmega ) \end{aligned}$$
(29)

where \(\mu \in \mathbb {R}^{(D+1)R}\) is the mean parameter vector and \(\varOmega \) is the covariance matrix of the random parameters \(Z^{(k)}\), assumed to be diagonal. The unknown parameters are \(\theta = \left( \mu , \varOmega _{11}, \ldots , \varOmega _{RR}, \sigma ^2 \right) \in {\mathbb {R}}^{R(D+1)} \times \left( 0,+\infty \right) ^{R+1}\).

A typical function f is the two-compartmental pharmacokinetic model with first-order absorption, describing the distribution of a drug administered orally. The drug is absorbed from the gut and reaches the blood circulation where it can spread in peripheral tissues. This model corresponds to \(f= \frac{A_\mathrm{c}}{V_\mathrm{c}}\) with \(A_\mathrm{c}\) defined as

$$\begin{aligned} \frac{\mathrm{d}A_\mathrm{d}}{\mathrm{d}t}= & {} -k_\mathrm{a} \, A_\mathrm{d},\nonumber \\ \frac{\mathrm{d}A_\mathrm{c}}{\mathrm{d}t}= & {} k_\mathrm{a} \, A_\mathrm{d} + \frac{Q}{V_\mathrm{p}}A_\mathrm{p} - \frac{Q}{V_\mathrm{c}}A_\mathrm{c}- \frac{Cl}{V_\mathrm{c}}A_\mathrm{c} , \nonumber \\ \frac{\mathrm{d}A_\mathrm{p}}{\mathrm{d}t}= & {} \frac{Q}{V_\mathrm{c}}A_\mathrm{c} - \frac{Q}{V_\mathrm{p}}A_\mathrm{p}, \end{aligned}$$
(30)

with \(A_\mathrm{d}(0) = \mathrm{Dose}, A_\mathrm{c}(0) = 0, A_\mathrm{p}(0) = 0\) and where \(A_\mathrm{d}, A_\mathrm{c}, A_\mathrm{p}\) are the amount of drug in the depot, central and peripheral compartments, respectively; \(V_\mathrm{c}, V_\mathrm{p}\) are the volume of the central compartment and the peripheral compartment, respectively; Q and Cl are the intercompartment and global elimination clearances, respectively. To assure positiveness of the parameters, the hidden vector is

$$\begin{aligned} z=(\log (V_\mathrm{c}), \log (V_\mathrm{p}), \log (Q), \log (Cl), \log (k_\mathrm{a})). \end{aligned}$$

It is easy to show that the model described by Eqs. (28)–(29) belongs to the curved exponential family (see Eq. 4) with minimal sufficient statistics:

$$\begin{aligned}&S_{1k}(z) = z^{(k)}, \quad S_{2}(z) = \sum _{k=1}^N z^{(k)} \, z^{(k)'}, \\&S_{3}(z) = \sum _{k=1}^N\sum _{j=1}^{J} (\mathsf {Y}_{kj}-f(t_{kj}, z^{(k)}))^2; \\&\psi _{1k}(\theta ) = ( X_{k} \mu )'\varOmega ^{-1}, \quad \psi _2(\theta ) = -\frac{1}{2}\varOmega ^{-1}, \\&\psi _3(\theta ) = - \frac{1}{2\sigma ^2}, \end{aligned}$$

and \(S(z) :=\mathrm {Vect}\left( S_{11}(z), \ldots , S_{1N}(z), S_2(z), S_3(z) \right) , \psi :=\mathrm {Vect}\left( \psi _{11}, \ldots , \psi _{1N},\psi _2, \psi _3\right) \). The function \(\phi \) is given by \(\phi (\theta ) = -J N \log (\sigma ) - \frac{N}{2}\log (\vert \varOmega \vert ) - \frac{1}{2}\sum _{k} ( X_{k} \mu )'\varOmega ^{-1}( X_{k} \mu )\). The selection of genomic variables that influence all coordinates of \(Z^{(k)}\) could be obtained by optimizing the log likelihood penalized by the function \(g(\theta )= \lambda \Vert \mu \Vert _{1}\), the \(L_1\) norm of \(\mu \) with \(\lambda \) a regularization parameter.

However, this estimator is not invariant under a scaling transformation (i.e., \({{\tilde{Z}}}^{(k)} = bZ^{(k)} \text{, } \tilde{\mu }= b\mu \text{ and } {{\tilde{\varOmega }}}_{rr}^{1/2} = b\varOmega _{rr}^{1/2}\)) (see, e.g., Lehmann and Casella 2006). In our high dimension experiments, the scale of the hidden variables has a non-negligible influence on the selection of the support. To be more precise, let us denote, for \(r \in \{1, \ldots , R\}\),

$$\begin{aligned}\mu _{(r)} :=(\mu _{(r-1)(D+1)+1}, \ldots , \mu _{r(D+1)})\end{aligned}$$

the coordinates corresponding to the rth pharmacokinetic parameter of function f. When the variance \(\varOmega _{rr}\) of the random parameters \(Z_r^{(k)}\) is low, the algorithms tend to select too many covariates. This phenomenon is strengthened with a small number of subjects as random effect variances are more difficult to estimate. A solution is to consider the following penalty

$$\begin{aligned} \lambda \sum _{r=1}^{R} \varOmega _{rr}^{-\frac{1}{2} } \Vert \mu _{(r)} \Vert _{1} , \end{aligned}$$

that makes the estimator invariant under scaling transformation. It was initially proposed by Städler et al. (2010) to estimate the regression coefficients and the residual error’s variance in a mixture of penalized regression models. However, the resulting optimization problem is difficult to solve directly because the variance of the random effect \(\varOmega _{rr}\) appears in the penalty term. Therefore, we propose a new parameterization

$$\begin{aligned} {\tilde{\mu }}_{(r)} :=\mu _{(r)}\varOmega _{rr} ^{-\frac{1}{2} }, \quad \Sigma _{rr} :=\varOmega _{rr} ^{-\frac{1}{2} } \end{aligned}$$

and \({\tilde{\theta }} :=\{ {\tilde{\mu }}, \Sigma _{11}, \ldots , \Sigma _{RR}, \sigma ^2 \} \in \mathbb {R}^{R(D+1)} \times \left( 0,+\infty \right) ^{R+1}\). Then, the optimization problem is the following:

$$\begin{aligned} \underset{{\tilde{\theta }}}{{\text {Argmax}}} \left( \ell ({{\tilde{\theta }}}) - g({\tilde{\theta }}) \right) , \quad \hbox { with } g({\tilde{\theta }})= \lambda \Vert \tilde{\mu } \Vert _{1}. \end{aligned}$$
(31)

This problem can be solved using MCPG, SAPG or SAEM-pen algorithms. Indeed, the complete log likelihood is now—up to an additive constant—

$$\begin{aligned} \begin{aligned} \log p(\mathsf {Y},Z ; {\tilde{\theta }})=&- J N \log (\sigma ) \\&-\frac{1}{2}\sum _{k=1}^N \sum _{j=1}^J \frac{ \left( Y_{kj} - f(t_{kj}, Z^{(k)} )\right) ^{2} }{\sigma ^2} \\&+ N\log (\vert \Sigma \vert ) - \frac{1}{2}\sum _{k=1}^N \Vert \Sigma Z^{(k)} - X_{k} \tilde{\mu }\Vert ^2 \end{aligned} \end{aligned}$$

It is again a complete likelihood from the exponential family, with the statistic S unchanged and the functions \(\phi \) and \(\psi \) given by—up to an additive constant—

$$\begin{aligned}&\phi ({\tilde{\theta }}) = -J N \log (\sigma ) + N\log (\vert \Sigma \vert ) - \frac{1}{2}\sum _{k=1}^N\Vert X_{k} {\tilde{\mu }} \Vert ^2, \\&\psi _{1k}({\tilde{\theta }}) = \Sigma ( X_{k} {\tilde{\mu }})^{t}, \quad \psi _2({\tilde{\theta }}) = -\frac{1}{2}\Sigma ^{2} , \quad \psi _3({\tilde{\theta }}) = - \frac{1}{2\sigma ^2}. \end{aligned}$$

With these definitions of \(\phi , \psi \) and g, the M-step of SAEM-pen amounts to compute the optimum of a convex function, which is solved numerically by a call to a cyclical coordinate descent implemented in the R package glmnet (Friedman et al. 2010).

MCMC sampler In the context of nonlinear mixed models, simulation from \(\pi _{\theta _n}\mathrm {d}\nu \) can not be performed directly like in the toy example. We then use a MCMC sampler based on a Metropolis Hastings algorithm to perform the simulation step. Two proposal kernels are successively used during the iterations of the Metropolis Hastings algorithm. The first kernel corresponds to the prior distribution of \(\Sigma Z^{(k)}\) that is the Gaussian distribution \({\mathcal {N}}(X_{k} {\tilde{\mu }}_n , I)\). The second kernel corresponds to a succession of R uni-dimensional random walk in order to update successively each component of \(Z^{(k)}\). The variance of each random walk is automatically tuned to reach a target acceptance ratio following the principle of an adaptive MCMC algorithm (Andrieu and Thoms 2008).

Adaptive random stepsize sequences In the context of NLMEM, numerical experiments reveal that choosing a deterministic sequence \(\{\gamma _n, n\ge 0\}\) that achieve a fast convergence of SAPG algorithm could be difficult. Indeed, parameters to estimate are of different scales. For example, random effect and residual variances are constrained to be positive. Some of them are close to zero; some are not. As explained in Sect. 4.2, an alternative is to implement a matrix-valued random sequence \(\{\Gamma ^n, n\ge 0\}\). The gradient and the hessian of the likelihood \(\ell (\theta )\) can be approximated by stochastic approximation using the Louis principle (see McLachlan and Krishnan 2008, Chapter 4). Let us denote \(H_n\) the stochastic approximation of the hessian obtained at iteration n as explained by Samson et al. (2007). Note that no supplementary random samples are required to obtain this approximation. Along the iterations, each diagonal entry of the matrix \( {H}^{n} \) converges: this limiting value can be seen as a simple way to automatically tune a good \(\gamma _{\star }\), that is parameter specific. The entries \(\Gamma ^{n+1}_{ii}\) are then defined by Eq. (27).

5.2 Simulated data set

The convergence of the corresponding algorithms is illustrated on simulated data. Data are generated with the model defined by Eq. (30) and \(N = 40, J = 12, D = 300\). The design matrix \(X_k\) is defined by Eq. (20), with components \((X_{k1}, \ldots , X_{kD})\) drawn from \({\mathcal {N}}(0,\Gamma )\) with \(\Gamma _{ii'} = 0.5^{\vert i - i' \vert }\) (\(i,i'=1,\ldots ,300\)). Parameter values are

$$\begin{aligned} \begin{aligned}&[\mu _1, \mu _{1+(D+1)} , \mu _{1+2(D+1)} , \mu _{1+3(D+1)} , \mu _{1+4(D+1)} ] \\&\quad = [ 6.61 , 6.96, 5.77, 5.42 ,-0.51]; \end{aligned} \end{aligned}$$

the other components are set to zero, except \(\mu _{4}\) and \(\mu _{912}\) that are set to 1. The matrix \(\varOmega \) is diagonal with diagonal elements equal to (0.16, 0.16, 0.16, 0.04, 0.04).

The penalty function is set to

$$\begin{aligned} g({\tilde{\theta }}) :=\lambda \sum _{\ell \ne \{1+r(D+1), r=0, \ldots , 4 \}} |{\tilde{\mu }}_\ell |, \end{aligned}$$
(32)

only the parameters corresponding to a covariate effect being penalized. The optimization problem Eq. (1) with regularization parameter \(\lambda = 190\) is solved on this dataset with SAEM-pen and SAPG; we run SAPG with the random sequence \(\{\Gamma _n, n \ge 0\}\) as described above (see Eq. 27) with \(n_0 = 9500\). For both algorithms, the stochastic approximation step size was set to:

$$\begin{aligned} \delta _{n+1} = \left\{ \begin{array}{l@{\quad }l} 0.5 &{} \quad \text { if } n\le n_0 \\ \frac{0.5}{(n - n_0)^ {\beta }} &{} \quad \text { if } n>n_0 \end{array} \right. \end{aligned}$$
(33)

We set \(\alpha = 0.75\) and \(\beta = 0.499\). Figure 8 shows the convergence of SAEM-pen and three parameterizations of SAPG: (i) a version with \(\gamma ^\star =0.005\) for all the components of \(\theta \), (ii) a version with \(\gamma ^\star =0.005\) for \(\tilde{\mu }, \gamma ^\star =0.0005\) for \(\Sigma \) and \(\gamma ^\star =0.03\) for \(\sigma \) and (iii) a version with adaptive random step sizes. For the four algorithms, all the parameters corresponding to a covariate effect are estimated to zero except the two components \(\mu _{4}\) and \(\mu _{912}\). The version of SAPG with a same \(\gamma ^\star \) for all the component is the one that converge the most slowly. When the \(\gamma ^\star \) is tuned differently according the type of parameters, the convergence of SAPG is accelerated. Algorithms SAEM-pen and SAPG with adaptive random step sizes have a similar fast convergence profile.

Figure 9 presents the evolution of four entries of the matrix \(\Gamma ^n\) along the iterations of SAPG, corresponding to the components \(\tilde{\mu }_{904}, \tilde{\mu }_{912}, \Sigma _{44}\) and \(\sigma \). We can notice that they are not on the same scale. They vary during the first iterations and converge to limiting values before iteration \(n_0=9500\). Then the step sizes decrease to 0, following the definition given in Eq. (27).

Fig. 8
figure 8

Path of a run of SAEM-pen [left column] and three different parameterizations of SAPG : (i) with \(\gamma ^\star =0.005\) for all the components of \(\theta \) [middle left column], (ii) with \(\gamma ^\star =0.005\) for \(\tilde{\mu }, \gamma ^\star =0.0005\) for \(\Sigma \) and \(\gamma ^\star =0.03\) for \(\sigma \) [middle right column] and (iii) with a random sequence \(\{\Gamma ^n,n \ge 0 \}\) [right column]. For each algorithm, estimation of the standard deviation of the residual error \(\sigma \) [third row]; the variances of the \(Z^{(k)}\)’s, \(\varOmega _{11}, \ldots , \varOmega _{RR}\) [fourth row]; the path of the covariate parameters \(\mu _i\) for \(i \notin \{1, 1+(D+1), \ldots , 1+4(D+1) \}\) [first row]; the path of the intercept parameters \(\mu _{i}, i \in \{ 1, 1+(D+1), \ldots , 1+4(D+1)\}\) [second row]. Each color corresponds to a specific parameter: orange line for Cl, red for \(V_\mathrm{c}\), blue line for \(k_\mathrm{a}\), yellow line for Q and green line for \(V_\mathrm{p}\). Note that the path of all the covariate parameters is zero except for two components. x-axis is in \(\log _{10}\) scale. (Color figure online)

Fig. 9
figure 9

Evolution of \(\Gamma ^n_{ii}\) with iterations n of SAPG, for four different values of i, corresponding to the components \(\tilde{\mu }_{904}\) [left]; \(\tilde{\mu }_{912}\) [middle left]; \(\Sigma _{44}\) [middle right]; \(\sigma \) [right]. Both x-axis and y-axis are in \(\log _{10}\) scale

5.3 Application to real data

Algorithms SAEM-pen and SAPG with matrix-valued random sequence \(\{\Gamma ^n, n\ge 0\}\) are applied to real data of the pharmacokinetic of dabigatran (DE) from two crossover clinical trials (Delavenne et al. 2013; Ollier et al. 2015). These 2 trials studied the drug–drug interaction between DE and different Pgp-inhibitors. From these 2 trials, the pharmacokinetics of DE are extracted from 15 subjects with no concomitant treatment with Pgp-inhibitors. The concentration of dabigatran is measured at 9 sampling times for each patient. Each subject is genotyped using the DMET\(^{\textregistered }\) microarray from Affymetrix. Single nucleotide polymorphisms (SNP) showing no variability between subjects are removed and 264 SNP are included in the analysis.

Function f of the nonlinear mixed model is defined as the two-compartment pharmacokinetic model with first-order absorption previously described (see Eq. 30) (Delavenne et al. 2013). The penalty function g is defined by Eq. (32).

Because of the limited number of subjects, the influence of genetic covariates is only studied on \(V_\mathrm{c}\) and Cl parameters, that characterize the elimination process and are the most likely to be influenced by the genetic. Finally, random effect variances of Q and \(V_\mathrm{p}\) are set to 0.01 in accordance with previously published population pharmacokinetic of dabigatran (Delavenne et al. 2013). The other variance parameters are estimated. The penalized likelihood problem (Eq. 31) is solved on the data with the SAEM-pen and SAPG algorithms, for 40 different values of parameter \(\lambda \). SAPG algorithm is run using the random sequence \(\{\Gamma ^n, n\ge 0\}\) given in Eq. (27). The best regularization parameter \(\lambda \) is chosen with a data-driven approach based on the EBIC criteria (Chen and Chen 2008).

Fig. 10
figure 10

Regularization path of covariate parameters (Cl parameter on top, \(V_\mathrm{c}\) parameter on bottom) obtained on dabigatran pharmacokinetic data for both SAEM-pen and SAPG algorithms. Black vertical dashed line corresponds to the \(\lambda \) value selected by EBIC. (Color figure online)

Figure 10 shows the results. The regularization paths of Cl and \(V_\mathrm{c}\) parameters using both algorithms correspond to the evolution of covariate coefficient estimates as a function of the value of \(\lambda \). They are reconstructed with low noise for both algorithms, are very similar for high values of \(\lambda \) but less for lower values of \(\lambda \).

Finally, the selected model has all covariates parameters set to zero. This means that none of the genetic covariates influence the distribution of the individual parameters. This result is not surprising given the low number of subjects and the fact that a large part of the interindividual variability is due to the dissolution process of the drug (Ollier et al. 2015) and is therefore not influenced by genetic covariates. This lack of relationship between dabigtran’s pharmacokinetic parameters and genetic covariates has already been highlighted in an other study (Gouin-Thibault et al. 2017).

6 Conclusion

In this work, we propose a new stochastic proximal-gradient algorithm to solve penalized maximum-likelihood problems when the likelihood is intractable: the gradient is approximated through a stochastic approximation scheme. We provide a theoretical convergence analysis of this new algorithm and illustrate these results numerically on a simulated toy example in the case of a concave likelihood function. The robustness to the non-concave case is explored through a more challenging application to population pharmacokinetic analysis relying on penalized inference in nonlinear mixed effects models.