1 Introduction

Hierarchical mixed-effects nonlinear regression models are widely used nowadays to analyze complex data involving longitudinal or repeated measures which are often arising in pharmacokinetics or from medical, biological and other similar applications (see for example Davidian and Giltinan (2003)). In such studies, the sampling units are often "subjects" drawn from the relevant population of interest whereby statistical inference, primarily for the estimation of various model parameters, is being sought, primarily on certain characteristics of the underlying population of interest. In that context, the hierarchical nonlinear model can be considered as an extension of the ordinary nonlinear regression models constructed to handle and ‘aggregate’ data obtained from several individuals. Modeling this type of data usually involves a ‘functional’ relationship between at least one predictor variable, x, and the measured response, y. As it often the case, the assumed ‘functional’ model between the response y and the predictor x, is based on some on physical or mechanistic grounds and is usually nonlinear in its parameters. For instance, in pharmacokinetics, a typical (compartmental) model of drug's concentration in the plasma is obtained from a set differential equations reflecting the nonlinear time-dependency of the drug's disposition in the body. Figure 1 below, illustrates such plasma concentrations profiles for a group of \(N=12\) patients (the sample), each observed at \(n=11\) time points following the administration of the drug under study (the Theophylline study, see for example Boeckmann et al. (1994), Davidian and Giltinan (1995) and also Sect. 5.3 for more details about this well-known data set).

Fig. 1
figure 1

The Theophylline data—drug plasma concentrations (ng/mL) profiles of \(N=12\) patients recorded over time (hr)

The primary aim of such pharmacokinetic studies with data as depicted in Fig. 1, is to make, based on the N patients' data, generalizations about the drug disposition in the population of interest to which the group patients belongs. Therefore such studies require a valid and reliable estimation procedure of the population's "typicaland" variability values for each of the underlying pharmacokinetic parameters (e.g.: the ‘typical’ rates of absorption, elimination, and clearance)–usually reflecting the population (hierarchical) distribution of the relevant model's parameters. In that context, there are three basic types of 'typical' population pharmacokinetic parameters. Some are viewed as fixed-effect parameters which quantify the population average kinetics of a drug; others represent inter-individual random-effect parameters, which quantify the typical magnitude of inter-individual variability in pharmacokinetic parameters and the intra-individual random-effect parameter which quantifies the typical magnitude of the intra-individual variability (the experimental error).

The basic hierarchical linear regression model for pharmacokinetics applications was pioneered by Sheiner et al. (1972), which accounted for both types of variations; of within and between subjects. The nonlinear case received widespread attention in later developments. Lindstrom and Bates (1990) proposed a general nonlinear mixed effects model for repeated measures data and proposed estimators combining least squares estimators and maximum likelihood estimators (under specific normality assumption). Vonesh and Carter (1992) discussed nonlinear mixed effects model for unbalanced repeated measures. Additional related references include: Mallet (1986), Davidian and Gallant (1993a); Davidian and Giltinan (1993b, 1995).

In all, the standard approach for inference in hierarchical nonlinear models is typically based on full distributional assumptions for both, the intra-individual and inter-individual random components. The most commonly used assumption is that both random components are considered to be normally distributed. However, this can be a questionable assumption in many cases. Our main results in this work offer a more generalized framework that does not hinge on the normality assumption of the various random terms. In fact, the rigorous asymptotic results we obtained are established only with minimal moments conditions on the random errors and random effect components of the underlying model and thus could be construed as a distribution-free approach.

One simple approach for estimation in such hierarchical 'population' models is the so-called two-stage estimation method. At the first stage one estimates the 'individual-level' parameters and then, at the second-stage, combines them in some manner to obtain the 'population-level' parameter estimates. However despite of its simplicity, the main challenge to such a two-stage estimation approach is in obtaining the sampling distributions and related properties (accuracy, precision, consitency, etc..) of the final estimators, either in finite or in large sample settings. For most part, the performance of these two-stage estimation methods have been evaluated primarily via Monte-Carlo simulations– see related references including: Sheiner and Beal (1981, 1982, 1983), Steimer et al. (1984), and Davidian and Giltinan (1995, 2003). Hence, an alternate and a more data oriented evaluation methodology should be considered in assesing this type of hierarchical models. Using a variant of the random weighting technique, Bar-Lev and Boukai (2015) proposed a re-sampling scheme, which is termed herein recycling, as a valuable and valid alternative methodology for evaluation and comparison of the estimation procedure. Zhang and Boukai (2019b) studied the validity and established the asymptotic consistency and asymptotic normality of the recycled estimates in a one-layered nonlinear regression model.

In the present paper we extend Bar-Lev and Boukai (2015) approach to include general random weights in the case of hierarchical nonlinear regression models with minimal moments assumptions on the random error-terms/effects. In Sect. 2, we present the basic framework for the hierarchical nonlinear regression models with fixed and random effects. In Sect. 3, we describe the Standard Two-Stage (STS) estimation procedure for the population parameters appropriate in this hierarchical nonlinear regression settings. Along these lines, we introduce a corresponding re-sampling scheme especially devised, based on general random weights, to obtain the recycled version of the STS estimators. In Sect. 4, we establish the asymptotic consistency and asymptotic normality of the STS estimators in such general settings. As we mentioned before, our rigorous results do not depend on specifying the distribution(s) of the the random component terms in the model (both errors and effects), but rather, are obtained largely based on minimal moments assumptions. As far as we know, these are the first provably valid asymptotic results concerning the sampling distribution and implied sampling properties of the estimators obtained using the STS procedure in the context of hierarchical nonlinear regression. In addition, we demonstrate the applicability via the asymptotic consistency and normality of the recycled version of the STS estimators. These results enable us to use the sampling distribution of the recycled version of the STS estimators to approximate the unknown sampling distribution of the actual STS estimators in a general 'distribution-free' framework. The results of extensive simulation studies and a detailed application to the Theophylline data are provided in Sect. 5. The proofs of our main results along with many other technical details are provided in the "Appendix".

2 The basic hierarchical (population) model

Consider a study involving a random sample of N individuals, where the nonlinear regression model (as in Zhang and Boukai (2019b)) is assumed to hold for each of the i-th individuals. That is, for each i, (\(i=1, 2, \dots , N\)), we have available the \(n_i\) (repeated) observations on the response variable in the form of \(\mathbf{y}_i:=(y_{i1}, y_{i2}, \dots , y_{i n_i})^\mathbf{t}\), where

$$\begin{aligned} y _{i j}=f(\mathbf{x}_{ij}; \, {\varvec{\theta }}_i)+\epsilon _{ij}, \ \ \ \ j=1,\dots , n_i, \end{aligned}$$
(1)

and \( \mathbf{x}_{ij} \) is the j-th covariate for the i-th individual, which gives rise to the response, \( y_{ij}\), for \(j=1, \dots , n_i\) and \(i=1, \dots , N\). Here, \(f(\cdot )\) is a given nonlinear function and \(\epsilon _{ij}\) denote some i.i.d. \((0, \sigma ^2)\) error-terms. That is, if we set \({{\varvec{\epsilon }}_{i}}:=(\epsilon _{i1},\epsilon _{i2},\dots ,\epsilon _{in_i})^\mathbf{t}\), then \(E({{\varvec{\epsilon }}_{i}}) = \mathbf{0}\) and \(Var({{\varvec{\epsilon }}_{i}})\equiv Cov({{\varvec{\epsilon }}_{i}} {{\varvec{\epsilon }}_{i}}^\mathbf{t}) = \sigma ^2\mathbf{I}_{n_i}\) . In the current context of hierarchical modeling, the parameter vector \({{\varvec{\theta }}_i=(\theta _{i1}, \theta _{i2}, \dots , \theta _{ip})}^\mathbf{t}\in \Theta \subset {{I\!R^p}}\), (with \(p< n_i\)), can vary from individual to individual, so that \({\varvec{\theta }}_i\) is seen as the individual-specific realization of \({\varvec{\theta }}\). More specifically, it is assumed that, independent of the error terms, \({{\varvec{\epsilon }}_{i}}\),

$$\begin{aligned} {\varvec{\theta }}_i:= \varvec{\theta }_0+\mathbf{b}_i, \end{aligned}$$
(2)

where \(\varvec{\theta }_0:=(\theta _{01}, \theta _{02}, \dots , \theta _{0p})^\mathbf{t}\), is a fixed population parameter, though unknown, and \(\mathbf{b}_i=(b_{i1}, b_{i2}, \dots , b_{ip})^\mathbf{t}\) is a \(p\times 1\) vector representing the random effects associated with i-th individual. It is assumed that the random effects, \(\mathbf{b}_1, \mathbf{b}_2, \dots , \mathbf{b}_N\) are independent and identically distributed random vectors satisfying,

$$\begin{aligned} E(\mathbf{b}_i)=\mathbf{0} \ \ \ \text {and} \ \ \ Var(\mathbf{b}_i)\equiv Cov( \mathbf{b}_i, \mathbf{b}_i^t) = \mathbf{D}. \end{aligned}$$

Here \(\sigma ^2\) represents the within individual variability and \(\mathbf{D}\) describes the between individuals variability. Thus, \({\varvec{\theta }}_1, {\varvec{\theta }}_2, \dots , {\varvec{\theta }}_N\) are i.i.d. random vectors with

$$\begin{aligned} E({\varvec{\theta }}_i)=\mathbf{0}\ \ \ \text {and} \ \ \ Var({\varvec{\theta }}_i) = \mathbf{D}. \end{aligned}$$

In the simple (i.e.: standard) hierarchical modeling it is often assumed that \(\mathbf{D}\) is some diagonal matrix of the form \(\mathbf{D}=Diag(\lambda _1^2,\lambda _2^2, \dots , \lambda _p^2)\) or even simpler, as \(\mathbf{D}=\lambda ^2 \mathbf{I}_p\) for some \(\lambda >0\), and that \(Var({{\varvec{\epsilon }}_{i}})=\sigma ^2\mathbf{I}_{n_i}\) for each \(i=1, \dots , N\) for some \(\sigma >0\).

In the more complex hierarchical modeling, more general structures of the within individual variability \(Var({{\varvec{\epsilon }}_{i}})=\Gamma _i\) (for some \(\Gamma _i\)) and of the between individuals variability, \(\mathbf{D}\), are possible. However, even in the simplest structure, the available estimation methods for these model's parameters, \(\varvec{\theta }_0, \sigma ^2\) and \(\mathbf{D}\) are typically highly iterative in their nature and are based on the variations of the least squares estimation. Similarly, even when considered under some specific distributional assumptions, such as that both, the error terms \({{\varvec{\epsilon }}_{i}}\), and the random effects \(\mathbf{b}_i\) are normally distributed, so that, \({{\varvec{\epsilon }}_{i}} \sim \mathcal{N}_{n_i}(\mathbf{0}, \sigma ^2\mathbf{I}_{n_i})\) and \(\mathbf{b}_i\sim \mathcal{N}_p(\mathbf{0}, \mathbf{D}), \) for each \(=i=1, \dots , N\). In fact, many of the available results in the literature hinge on the specific normality assumption and on the ability to effectively 'linearize' the regression function \(f(\cdot )\) (see for example Bates and Watts (2007)) on order to obtain some assessment of the resulting sampling distributions fo the parameters' estimates. We point out that here we require no specific distributional assumptions (such as normality, or otherwise) on either the intra-individual and the inter-individual error terms, \({{\varvec{\epsilon }}_{i}}\) nor \(\mathbf{b}_i\), respectively.

3 The two-stage estimation procedure

For each \(i=1, \dots , N\), let \(\mathbf{f}_i({\varvec{\theta }})\) denote the \(n_i\times 1\) vectors whose elements are \(f(\mathbf{x}_{ij}, {\varvec{\theta }}), j=1, \dots , n_i\) then model (1) can be written more succinctly as

$$\begin{aligned} \mathbf{y}_i=\mathbf{f}_i({\varvec{\theta }}_i)+{{\varvec{\epsilon }}_{i}} \end{aligned}$$
(3)

Accordingly, the STS estimation procedure can be described as follows:

On Stage I::

For each \(i=1, \dots , N\) obtain \({\hat{{\varvec{\theta }}}}_{ni}\) as the minimizer of

$$\begin{aligned} Q_i({\varvec{\theta }}):= (\mathbf{y}_i- \mathbf{f}_i({\varvec{\theta }}))(\mathbf{y}_i-\mathbf{f}_i({\varvec{\theta }}))^\mathbf{t}\equiv \sum _{j=1}^{n_i}(y_{ij}-f(\mathbf{x}_{ij}, {\varvec{\theta }}))^2 , \end{aligned}$$
(4)

so as to form \({\hat{{\varvec{\theta }}}}_{n1}, {\hat{{\varvec{\theta }}}}_{n2}, \dots , {\hat{{\varvec{\theta }}}}_{nN}\), based on all the \(M:=\sum _{i}^N n_i\) available observations. Next, estimate the within-individual variability component, \(\sigma ^2\), by

$$\begin{aligned} {\hat{\sigma }}_M^2:=\frac{1}{M-pN} \sum _{i=1}^N Q_i({{\hat{{{\varvec{\theta }}}_n}}_i}). \end{aligned}$$
On Stage II::

Estimate the 'population' parameter \(\varvec{\theta }_0\) by the average

$$\begin{aligned} \hat{{\varvec{\theta }}}-{{STS}}:=\frac{1}{N} \sum _{i=1}^{N}{\hat{{{\varvec{\theta }}}_n}}_i. \end{aligned}$$
(5)

Next, estimate \(Var(\hat{{\varvec{\theta }}}-{{STS}})\) by \(\mathbf{S}^2-{{STS}}/N\), where

$$\begin{aligned} \mathbf{S}^2-{{STS}}:= \sum _{i=1}^{N}({\hat{{{\varvec{\theta }}}_n}}_i-\hat{{\varvec{\theta }}}-{{STS}})({\hat{{{\varvec{\theta }}}_n}}_i-\hat{{\varvec{\theta }}}-{{STS}})^\mathbf{t}. \end{aligned}$$

Finally estimate the between-individual variability component, \(\mathbf{D}\), by

$$\begin{aligned} {\hat{\varvec{D}}}={\mathbf{S}^{2}}-{{STS}}- \min\, ({\hat{\nu }} , {\hat{\sigma }}_M^{2})\, {\hat{{\varvec{\Sigma }}}}_N, \end{aligned}$$
(6)

where \({\hat{{\varvec{\Sigma }}}}_N:= \frac{1}{N}\sum _{i=1}^{N}{\varvec{\Sigma }}_{n_i}({\hat{{{\varvec{\theta }}}_n}}_i)\), with \({\varvec{\Sigma }}^{-1}_{n_i}\) defined as,

$$\begin{aligned} \varvec{\Sigma }^{-1}_n({\varvec{\theta }}):= \frac{1}{n}\sum _{i=1}^{n}\nabla f_i({\varvec{\theta }})\nabla f_i({\varvec{\theta }})^\mathbf{t}, \end{aligned}$$
(7)

and where \({\hat{\nu }}\) is the smallest root of the equation \(|\mathbf{S}^2-{{STS}}-\nu {\hat{{\varvec{\Sigma }}}}_N|=0\), see Davidian and Giltinan (2003) for details.

Bar-Lev and Boukai (2015) provided a numerical study of this two-stage estimation procedure in the context of a (hierarchical) pharmacokinetics modeling under the normality assumption. They also proposed a corresponding two-stage re-sampling scheme based on specific \(\mathcal{{D}}irichlet({\varvec{1}})\) random weights. However, in this paper we consider a more general framework for the random weights to be used.

As in Zhang and Boukai (2019b), we let for each \(n\ge 1\), the random weights, \(\mathbf{w}_n=(w_{1:n}, w_{2:n}, \dots , w_{n:n})^\mathbf{t}\), be a vector of exchangeable nonnegative random variables with \(E(w_{i:n})=1\) and \(Var(w_{i:n}):= \tau _n^2\), and let \(W_{i}\equiv W_{1:n}=(w_{i:n}-1)/\tau _n\) be the standardized version of \(w_{i:n}\), \(i=1, \dots , n\). In addition we also assume:

Assumption W

The underlying distribution of the random weights \(\mathbf{w}_n\) satisfies

  1. 1.

    For all \(n\ge 1\), the random weights \(\mathbf{w}_n\) are independent of \((\epsilon _1, \epsilon _2, \dots , \epsilon _n)^\mathbf{t}\);

  2. 2.

    \(\tau ^2_n=o(n)\), \(E(W_iW_j)=O(n^{-1})\) and \(E(W_i^2W_j^2)\rightarrow 1\) for all \(i\ne j\), \(E(W_i^4)<\infty \) for all i.

Some examples of random weights, \(\mathbf{w}_n\) that satisfy the above conditions in Assumption W are: the Multinomial weights, \(\mathbf{w}_n \sim \mathcal{{M}}ultinomial(n, 1/n, 1/n, \dots , 1/n)\), which correspond to the classical bootstrap of Efron (1979) and the Dirichlet weights, \(\mathbf{w}_n\equiv n\times \mathbf{z}_n\) where \(\mathbf{z}_n\sim \mathcal{{D}}irichlet(\alpha , \alpha , \dots , \alpha )\), with \(\alpha >0\) which often refer to as the Bayesian bootstrap (see Rubin (1981), and its variants as in Zheng and Tu (1988) and Lo (1991)).

We will assume throughout this paper that all the random weights we use in the sequel do satisfy Assumption W. With such random weights \(\mathbf{w}_n\) at hand, we define in similarity to (3), the recycled version \(\hat{{{\varvec{\theta }}}_n^*}\) of \(\hat{{{\varvec{\theta }}}_n}\) as the minimizer of the randomly weighted least squares criterion. With such general random weights, the recycled version of the STS estimation procedure described in (4-7) above is:

On Stage I\(^*\)::

For each \(i=1, \dots , N\), independently generate random weights, \(\mathbf{w}_i=(w_{i1},w_{i2},\dots ,w_{in_i})^\mathbf{t}\) that satisfy Assumption W with \(Var(w_{ij})=\tau ^2_{n_i}\) and obtain \({\hat{{{\varvec{\theta }}}_n^*}}_i\) as the minimizer of

$$\begin{aligned} Q_i^*({\varvec{\theta }}):= \sum _{j=1}^{n_i}w_{ij}(y_{ij}-f(\mathbf{x}_{ij}, {\varvec{\theta }}))^2 , \end{aligned}$$
(8)

so as to form \({{\hat{{{\varvec{\theta }}}_n^*}}_1}, {\hat{{{\varvec{\theta }}}_n^*}}_2, \dots , {\hat{{{\varvec{\theta }}}_n^*}}_N\).

On Stage II\(^*\)::

Independent of Stage I\(^*\), generate random weights, \(\mathbf{u}=(u_{1},u_{2},\dots ,u_N)^\mathbf{t}\) that satisfy Assumption W with \(Var(u_{i})=\tau ^2_N\), and obtained the recycled version of \(\hat{{\varvec{\theta }}}-{{STS}}\) as:

$$\begin{aligned} \hat{{\varvec{\theta }}}-{{RTS}}^*:=\frac{1}{N} \sum _{i=1}^{N}u_i{\hat{{{\varvec{\theta }}}_n^*}}_i \end{aligned}$$
(9)

The recycled version \(\mathbf{D}^*\) of \(\mathbf{D}\) can be subsequently obtained as described in Stage II above.

4 Consistency of the STS and the recycled estimation procedures

In this section we present some asymptotic results that establish and validate the consistency and asymptotic normality of the STS estimator, \(\hat{{\varvec{\theta }}}-{{STS}}\) (Theorems 1 and 2) and of its recycled version \(\hat{{\varvec{\theta }}}-{{RTS}}^*\) (Theorems 3 and 4), obtained using the general random weights satisfying the premises of Assumption W. We establish these results without the 'typical' normality assumption on the within-individual error terms, \(\epsilon _{ij}\), nor on the between-individual random effects \(\mathbf{b}_i\). However, for simplicity of the exposition, we state these results in the case of \(p=1\), so that \(\Theta \in {I\!R}\). With that in mind, we denote for each \(i=1, \dots , N\),

$$\begin{aligned} f_{ij}(\theta )\equiv f(x_{ij}, \theta ), \ \ \text {for} \ \ j=1, \dots , n_i. \end{aligned}$$

Accordingly, the least squares criterion in (1), becomes

$$\begin{aligned} Q_{ni}(\theta ):= \sum _{j=1}^{n_i}(y_{ij}-f_{ij}(\theta ))^2, \end{aligned}$$

and the LS estimator \(\hat{\theta }_{ni}\) is readily seen as the solution of

$$\begin{aligned} Q_{ni}'(\theta ):= 2\sum _{j=1}^{n_i}\phi _{ij}(\theta )=0 \end{aligned}$$
(10)

where,

$$\begin{aligned} \phi _{ij}(\theta ):=-(y_{ij}-f_{ij}(\theta ))f^{'}_{ij}(\theta ), \ \ \ \ \end{aligned}$$
(11)

with \(f^{'}_{ij}(\theta ):= d f_{ij}(\theta )/d \theta \), for \(j=1\dots , n_i\) and for each \(i=1\dots , N\). We write \(f^{''}_{ij}(\theta ):= d f^\prime _{ij}(\theta )/d \theta \) and \(\phi ^\prime _{ij}(\theta ):= d \phi _{ij}(\theta )/d \theta \), etc. As in Zhang and Boukai (2019b), we also assume that \( f_{ij}^{'}(\theta )\) and \(f_{ij}^{''}(\theta )\) exist for all \( \theta \) near \(\theta _0\). However, to account for the inclusion of the \((0, \lambda ^2)\) random effect term, \(b_i\), in the model, we also assume that,

Assumption A

For each \(i=1, \dots , N\)

  1. 1.

    \(a_{n_i}^{2}:=\sigma ^2\sum _{j=1}^{n_i}E(f^{'2}_{ij}(\theta _0+b_i))\rightarrow \infty \ \ as \ \ n_i\rightarrow \infty \), ;

  2. 2.

    \(\underset{n_i\rightarrow \infty }{\limsup } \ \ a_{n_i}^{-2}\sum _{j=1}^{n_i}\underset{|\theta -\theta _0-b_i|\le \delta }{\sup } f_{ij}^{''2}(\theta ) <\infty \)

  3. 3.

    \( a_{n_i}^{-2}\sum _{j=1}^{n_i}f_{ij}^{'2}(\theta ) \rightarrow \frac{1}{\sigma ^2}\) uniformly in \(|\theta -\theta _0-b_i|\le \delta \).

In the following two Theorems we establish, under the conditions of Assumption A, the asymptotic consistency and normality of \({\hat{\theta }}_{{STS}}\). Their proofs and some related technical results are given in Sect. 7.1.

Theorem 1

Suppose that Assumption A holds, then there exists a sequence \({{\hat{\theta }}_{ni}}\) of solutions of (10) such that \({{\hat{\theta }}_{ni}}=\theta _0+b_i+a_{ni}^{-1}T_{ni}\), where \(|T_{ni}|<K\) in probability, for each \(i=1,2,\dots ,N\). Further, there exists a sequence \({\hat{\theta }}_{{STS}}\) as expressed in (5) such that \({\hat{\theta }}_{{STS}}-\theta _0\overset{p}{\rightarrow }0, \) as \(n_i\rightarrow \infty \), for \(i=1,2,\dots ,N\), and as \(N\rightarrow \infty \).

Theorem 2

Suppose that Assumption A holds. If \(\underset{N,ni\rightarrow \infty }{\lim } N/a_{ni}^2<\infty , \) for all \(i=1,2,\dots ,N\), then there exists a sequence \({\hat{\theta }}_{STS}\) as expressed in (5) such that \({\hat{\theta }}-{{STS}}-\theta _0= \frac{1}{N}\sum _{i=1}^{N}b_i-\psi -{N,n_i },\) where \(\sqrt{N}\psi -{N,n_i}\overset{p}{\rightarrow }0\). Further,

$$\begin{aligned} \mathcal{{R}_N}:=\frac{\sqrt{N}}{\lambda }({\hat{\theta }}-{{STS}}-\theta _0)\Rightarrow {{\mathcal {N}}}(0,1) \end{aligned}$$

as \(n_i\rightarrow \infty \), for \(i=1,2,\dots ,N\), and as \(N\rightarrow \infty \).

For the recycled STS estimation procedure as described in Sect. 3, the recycled version \(\hat{\theta }^*_{ni}\) of \(\hat{\theta }_{ni}\) is the minimizer of (8), or alternatively, the direct solution of

$$\begin{aligned} Q_{i}^{*\prime }(\theta ):= 2\sum _{j=1}^{n_i}w_{ij}\phi _{ij}(\theta )=0, \end{aligned}$$
(12)

where \(\mathbf{w}_i=(w_{i1},w_{i2},\dots ,w_{in_i})^\mathbf{t}\) are the randomly drawn weights (satisfying Assumption W), for the ith individual, \(i=1, 2, \dots , N\). For establishing comparable results to those given in Theorems 1 and 2 for the recycled version, \({\hat{\theta }}^*-{{RTS}} =\sum _{i=1}^{N}u_i \hat{\theta }^*_{ni}/N\) of \({\hat{\theta }}-{{STS}}=\sum _{i=1}^{N}\hat{\theta }_{ni}/N\), with the random weights \(\mathbf{u}=(u_1, u_2, \dots , u_N)^\mathbf{t}\) as in Stage II\(^*\), we need the following additional assumptions.

Assumption B

In addition to Assumption A, we assume that \(E(\epsilon _{ij}^4)<\infty \) and that for each \(i=1, 2, \dots ,N,\)

  1. 1.

    \(\underset{n_i\rightarrow \infty }{\limsup } \ \ a_{n_i}^{-2}\sum _{j=1}^{n_i}\underset{|\theta -\theta _0-b_i|\le \delta }{\sup } f_{ij}^{'4}(\theta ) <\infty, \)

  2. 2.

    \(\underset{n_i\rightarrow \infty }{\limsup } \ \ a_{n_i}^{-2}\sum _{j=1}^{n_i}\underset{|\theta -\theta _0-b_i|\le \delta }{\sup } f_{ij}^{''4}(\theta ) <\infty, \)

  3. 3.

    As \(n_i\rightarrow \infty, \)\({n_i}{a_{n_i}^{-2}}\rightarrow c_i\ge 0.\)

In Theorems 3 and 4 below we establish, under the conditions of Assumptions A and B, the asymptotic consistency and normality of the recycled estimator \({\hat{\theta }}^*-{{RTS}}\). Their proofs and some related technical results are given in Sect. 7.2.

Theorem 3

Suppose that Assumptions A and B hold. Then there exists a sequence \({\hat{\theta }}_{ni}^*\) as the solution of (12) such that \({\hat{\theta }}_{ni}^*={{\hat{\theta }}_{ni}}+a_{ni}^{-1}T^*_{ni}\) , where \(|T_{ni}^*|<K\tau _{n_i}\) in probability, for \(i= 1,\dots , N\). Further for any \(\epsilon >0\), we have \(P^*(|{\hat{\theta }}^*-{{RTS}}-\theta _0|>\epsilon )=o_p(1),\) as \(n_i\rightarrow \infty \), for \(i=1,2,\dots ,N\), and as \(N\rightarrow \infty \).

Theorem 4

Suppose that Assumptions A and B hold. If for each \(i=1, 2, \dots , N\), \(\frac{\tau _{n_i}}{\tau -{N}}=o(\sqrt{n_i}), \) then we have \( {\hat{\theta }}^*_{{RTS}}-{\hat{\theta }}_{{STS}}= \frac{1}{N}\sum _{i=1}^{N}(u_i-1){{\hat{\theta }}_{ni}}-\psi ^*_{N,n_i}, \) where \(\frac{\sqrt{N}}{\tau _{N}}\psi ^*_{N,n_i}\overset{p^*}{\rightarrow } 0\) as \(N,n_i\rightarrow \infty \). Additionally,

$$\begin{aligned} \mathcal{{R}^*_N}:=\frac{\sqrt{N}}{\lambda \tau _{N}} ({\hat{\theta }}^*_{{RTS}}-{\hat{\theta }}_{{STS}})\Rightarrow {{\mathcal {N}}}(0,1), \end{aligned}$$

as \(n_i\rightarrow \infty \), for \(i=1,2,\dots ,N\), and as \(N\rightarrow \infty \).

The proofs of Theorems 3 and 4 and some related technical results are given in Sect. 7.1. The following Corollary is an immediate consequence of the above results. It suggest that the sampling distribution of \({\hat{\theta }}_{{STS}}\) can be well approximated by that of the recycled or re-sampled version of it, \({\hat{\theta }}^*_{{RTS}}\).

Corollary 1

For all \(t\in I\!R\), let \(\mathcal{{H}}_N(t)=P\left( \mathcal{{R}_N}\le t\right) , \) and \( \mathcal{{H}}_N^*(t)=P^*\left( \mathcal{{R}^*_N}\le t\right) ,\) denote the corresponding c.d.f of \(\mathcal{{R}_N}\) and \(\mathcal{{R}^*_N}\), respectively. Then by Theorems 2 and  4,

$$\begin{aligned} \underset{t}{\sup }| \mathcal{{H}}_n^*(t)-\mathcal{{H}}_n(t)|\rightarrow 0 \ \ \ in \ \ probability. \end{aligned}$$

5 Implementation and numerical results

5.1 Illustrating the STS estimation procedure

To illustrate the main results of Sect. 4 for the hierarchical nonlinear regression model and the corresponding STS estimation procedure as described in (4-7) above, we consider a typical compartmental modeling from pharmacokinetics. In the standard two-compartment model, the relationship between the measure drug concentration and the post-dosage time t, (following an intravenous administration), can be described through the nonlinear function of the form:

$$\begin{aligned} {f(t;{\varvec{\eta }})=\eta _1e^{-\eta _2t}+\eta _3e^{-\eta _4 t}}, \end{aligned}$$
(13)

with \({\varvec{\eta }}:={(\eta _1, \eta _2, \eta _3, \eta _4)^\prime }\) being a parameter representing the various kinetics rate constants, such as the rate of elimination, rate of absorption, clearance, volume, etc. Since these constants (i.e. parameters) must be positive, we re-parametrize the model with \({\varvec{\theta }}\equiv \log (\varvec{\eta })\) (with \(\theta _k=log(\eta _k)\), \(k=1,2,3,4\)), so that with \(t>0\),

$$\begin{aligned} f(t; {\varvec{\theta }})=exp(\theta _{1})exp\{-exp(\theta _{2})t\}+exp(\theta _{3})exp\{-exp(\theta _{4})t\}, \end{aligned}$$
(14)

with \({\varvec{\theta }}=(\theta _1, \theta _2, \theta _3, \theta _4)^\mathbf{t}\in {I\!R}^4\). For the simulation study we consider a situation in which the (plasma) drug concentrations \(\{y_{ij} \}\) of N individuals were measure at post-dose times \(t_{ij}\) and are related as in model (1) via the nonlinear regression model,

$$\begin{aligned} y_{ij}=f(t_{ij}; {\varvec{\theta }}_{i})+\epsilon _{ij}, \end{aligned}$$

for \(j=1, \dots , n_i\) and \(i=1, \dots , N\). Here, as in Sect. 4, \(\epsilon _{ij}\) are standard i.i.d. \((0, \sigma ^2)\) random error terms and \({\varvec{\theta }}_i=\varvec{\theta }_0+\mathbf{b}_i\), where \(\mathbf {b}_i\) are independent identically distributed random effects terms, with mean \(\mathbf {0}\) and unknown variance \(\lambda ^2{{\varvec{I}}}_{4\times 4}\). Accordingly, we have in all a total of 6 unknown parameters, namely, \(\varvec{\theta }_0=(\theta _{10}, \theta _{20}, \theta _{30}, \theta _{40})^\mathbf{t}, \ \sigma \) and \(\lambda \).

Since \(\sigma \) and \(\lambda \) represent variation within and between individuals (respectively), different setting for these two lead to very different situations. For instance, Fig. 2a below, depicts the situation for \(N=5\) and \(n_i\equiv n =15\), each, when \(\sigma =0.1\) and \(\lambda =0.1\), so that the variation between individuals are similar to variation within individuals. Figure 2b depicts the situation with \(\sigma =0.05, \lambda =1\), so that the variation between individuals is much larger than variation within individuals.

For the simulation, we set \(\varvec{\theta }_0=(1,0.8,-0.5,-1)^\mathbf{t}\), and for each i, the times \(t_{ij}, j=1,\dots ,n\) were generated uniformly from [0, 8] interval. To allow for different 'distributions', the error terms, \(\epsilon _{ij}\), as well as the random effect terms, \(\mathbf{b}_i\), were generated either from the (a) Truncated Normal, (b) Normal and (c) Laplace distributions – all in consideration of Assumption A in our main results.

For each simulation run, with the Truncated Normal distribution for the error-terms and the random effects terms, we calculated the value of \({\hat{{\varvec{\theta }}^k}}_{{STS}}\) as an estimator of \(\varvec{\theta }_0\) and repeated this procedure \(M=1000\) times to calculate the corresponding Mean Square Error (MSE) as followed,

$$\begin{aligned} MSE=\frac{1}{M}\sum _{k=1}^{M}||{\hat{{\varvec{\theta }}}}^k_{{STS}}-\varvec{\theta }_0||^2 \end{aligned}$$

The corresponding simulation results obtained for various values of N and n, are presented in Table 1 for \(\sigma =0.1, \lambda =0.1\) and \(\sigma =0.05, \lambda =1\).

Fig. 2
figure 2

Illustrating drug plasma concentration vs time for 5 individuals (colored) for the cases: a \(\sigma =0.1, \lambda =0.1\) and b \(\sigma =0.05, \lambda =1\)

Table 1 The MSE of STS estimates for truncated Normal error-terms/effects with \(\sigma =0.1, \lambda =0.1\) and \(\sigma =0.05, \lambda =1.0\)

From Table 1, we see that with n and N both increasing, the MSE is decreasing, as expected. However, when \(\sigma =0.05, \lambda =1\), n increasing for a fixed N, doesn't contribute to smaller MSE, which is consistent with our main result Theorem 1, the STS estimate is not consistent with only \(n_i\rightarrow \infty \), (this effect is more obvious in the case \(\lambda \) is relatively large).

For simulating the results of Theorem 2, we choose \(\theta _{2}\) to be the unknown parameter, and use the main result to construct \(95\%\) Confidence Interval as

$$\begin{aligned} (\hat{\theta }_{{STS}}-1.96\frac{\hat{\lambda }}{\sqrt{N}},\ \hat{\theta }_{{STS}}+1.96\frac{\hat{\lambda }}{\sqrt{N}}), \end{aligned}$$

where

$$\begin{aligned} \hat{\lambda }^2=\frac{1}{N-1}\sum _{i=1}^{N}({{\hat{\theta }}_{ni}}-{\hat{\theta }}_{{STS}})^2. \end{aligned}$$

The estimate for \(\hat{\lambda }\) used here is the simple STS estimate, not the corrected one as in (6). M=1,000 replications of such simulations were executed to determine the percentage of times the true value of the parameter estimates was contained in the interval. We use \(\sigma =0.5, \lambda =0.5\) and observed Coverage Percentages are provided in Table 2.

Table 2 Coverage Percentage of the CI for the truncated Normal and Normal error-terms/effects with \(\sigma =0.5, \lambda =0.5\)

From these results we can observe that with n and N both increase, the Coverage Percentage approximate to 0.95. While when n is small (15), with N increase, the Coverage Percentage is drifting farther away from the desired level of 0.95. This finding is consistent with our main result, the convergence require the condition \(\underset{N,ni\rightarrow \infty }{\lim } N/a_{ni}^2<\infty \), which in this case becomes \(\underset{n\rightarrow \infty }{\lim }\frac{1}{n}a_{n}^2/\sigma ^2<\infty \), that is \(\underset{N,n\rightarrow \infty }{\lim } N/n<\infty \) is required. Hence, when N is much large than n, this condition does not hold. Although for this model, error terms that follow the normal distribution do not satisfy Assumption A, we used normal error terms in the simulations, and reported the resulting MSE and Coverage Percentage for 95% confidence interval in Tables 2 and 3. From the results we can observe that with n and N increasing, the MSE are smaller and Coverage Percentage are closer to 0.95.

Table 3 The MSE of STS estimates for Normal error-terms/effects with \(\sigma =0.1, \lambda =0.1\)

We further considered simulations using the Laplace distributions for the error terms and random effects terms. The complete results are reported in Zhang and Boukai (2019b) and indicate of similar conclusions.

5.2 Illustrating the recycled STS estimation procedure

Here we provide the results of the simulation studies corresponding to Theorem 3 and 4 concerning the recycled STS estimator, \({\hat{\theta }}^*_{{RTS}}\). We considered the same nonlinear (compartmental) model as given in the previous subsection, however again with \(p=1\). Accordingly, we choose \(\theta _2\) to represent the model's unknown parameter and set, for the simulations, \(\theta _0=0.8\), for each i. As before, we generated the values of \(\{t_{ij},j=1,\dots ,n\}\) uniformly from the [0, 8] interval, and drew the error terms, \(\epsilon _{ij}\) and the random effects terms, \(b_i\), from the truncated Normal distribution.

For each simulation run, we calculated the value of \(\hat{\theta }_{{STS}}\) as in Sect. 3, then with \(B=1,000\), we generated \(B\times N\) independent replications of the random weights \(\mathbf{w}_i=(w_{i1},w_{i2},\dots ,w_{in})\) and \(B=1,000\) independent replications of the random weight \(\mathbf{u}=(u_1,u_2,\dots ,u_N)\), to obtain \(\hat{\theta }^{*1}_{{STS}}\), \(\hat{\theta }^{*2}_{{STS}}\), \(\dots \), \(\hat{\theta }^{*B}_{{STS}}\). The correspond 95% Confidence Intervals were formed. With \(\sigma = 1\), \(\lambda = 1\) a total of \(M=2000\) replications of such simulations were executed to determine the percentage of times the true value of the parameter estimates was contained in the interval and average confidence interval length was calculated. The Coverage Percentages with average confidence interval lengths are reported in Tables 4 and 5.

Table 4 demonstrates the results of the asymptotic results of Sect. 4. Table 5 provide Coverage Percentages with average confidence interval lengths, with random weights set to be Multinomial, Dirichlet or Exponential distributed . From these results we can see with N and n both increase, the Coverage Percentages converges to 0.95 as expected (see Corollary 5). Also notice that Coverage Percentages derived from the recycled STS are more accurate (closer to 0.95) than the asymptotic result, especially when n and N are small.

We further consider the case when n is even smaller. Table 6 provides Coverage Percentages and the average confidence interval length when \(n=10\) for the case of the Multinomial, Dirichlet or Exponential distributed random weights. As can be seen, in these cases, our procedure produces reasonable results. However, we must point out that the effects of a small sample size on our procedure depend also on the dimensionality, p, of the parameter \({\varvec{\theta }}\), on the "nature"of the non-linear regression function \(\mathbf{f}_i(\cdot )\) and its gradients, and on the particular minimization (optimization) algorithm used on \(Q_i({\varvec{\theta }})\) in (4) and on \(Q_i^*({\varvec{\theta }})\) in (8). Clearly, further numerical experimentation could be instructive in these regards.

Table 4 Simulated Coverage Percentage of the CI for the truncated Normal error-terms/effects with \(\sigma =1, \lambda =1\)
Table 5 Coverage Percentage of the CI for the truncated Normal error-terms/effects with \(\sigma =1, \lambda =1\) and with Multinomial random weights
Table 6 Coverage Percentage of the CI for the truncated Normal error-terms/effects with \(\sigma =0.05, \lambda =1\) and with different choices of random weights

5.3 An example–thetheophylline data

We illustrate our proposed recycled two-stage estimation procedure with the Theophylline data set (which is widely available as Theoph under the R package, R Core Team (2020)). This well-known data set provides the concentration-time profiles (see Fig. 1) as were obtained in the pharmacokinetic study of the anti-asthmatic agent Theophylline, and reported by Boeckmann et al. (1994) and subsequently analyzed by Davidian and Giltinan (1995) (NLME), Kurada and Chen (2018) (NLMIXED) as well as in Adeniyi et al. (2018). In this experiment, the drug was administered orally to \(N=12\) subjects, and serum concentrations were measured at 11 time points per subject over the subsequent 25 hours. However, as in Davidian and Giltinan (1995), we also excluded here the zero time point from the analysis to simplify the modeling of the within-subject mean variance relationship.

For the analysis, a one-compartment version of the model in (13) in which \({\eta _1=-\eta _3}\) was fitted to the data. The resulting pharmacokineticmodel is described by the three parameters \({\varvec{\theta }}_i=(K_{a_i}, K_{e_i}, Cl)^\prime \), (with \( K_{a_i}> K_{e_i}\)), representing the absorption rate (1/hr), the elimination rate (1/hr) and fundamental clearance (L/hr), per each of the N individual under study. Often however, the model parametrization is given in term of the compartmental volume, V, where \(V=Cl/K_e\) (L). Thus, the mean concentration at time \(t_{ij}\), \((i=1,\dots ,N, j=1,\dots ,n)\), following a single dose of size \(d_0\) administrated at time \(t_{i1}\), by the i-th individual, \((i=1,\dots ,N)\), is,

$$\begin{aligned} f(t_{ij}; \, {\varvec{\theta }}_i)\equiv {d_0 K_{a_i}K_{ei}\over Cl_i(K_{a_i}-K_{e_i})}(\exp (-K_{e_i}t_{ij})-\exp (-K_{a_i}t_{ij})). \end{aligned}$$
(15)

The statistical model accounts for the errors intervening between true and the observed drug concentrations, and with the inter-individual variability in the model's parameters. To deal with the first, it is assumed that for each \(i = 1, ..., N\),

$$\begin{aligned} y_{ij}=f(t_{ij};\, {\varvec{\theta }}_i)+\epsilon _{ij} \end{aligned}$$

where \(y_{ij}\) is the observed \(j^{th}\) drug concentration of the \(i^{th}\) individual, obtained at time \(t_{ij}\), and where \(\epsilon _{ij}\) are some i.i.d random error terms with mean 0 and variance \(\sigma _\epsilon ^2\). Here \(\sigma _\epsilon ^2\) is assumed to be the only intra-individual random effect parameter of concern. Similarly, for modeling the inter-individual variability in the parameters, we assume that \(\varvec{\beta _i} := log(\varvec{\theta _i})\equiv (lK_{a} , lK_{e} , lCl)^\prime \), represent some random effects with \(E(\varvec{\beta _i})=\varvec{\beta _0}\equiv (lK_{a0} , lK_{e0} , lCl_0)^\prime \) and where \(Var(\varvec{\beta _i})=\mathbf{D}\equiv diag(\sigma ^2_{lK_a} ,\sigma ^2_{lK_e} ,\sigma ^2_{lCl})\), for each \(i=1,...,N\). Accordingly, \(\varvec{\theta }_0\equiv exp(\varvec{\beta _0}) = (K_{e0} , K_{a0} , Cl_0 )^\prime \) represents the fixed-effect population parameter. In all, there are seven population parameters, namely: \(K_{a0} , K_{e0} , Cl_0\) and \(\sigma ^2_{K_a} ,\sigma ^2_{K_e} ,\sigma ^2_{Cl}\) and \(\sigma _\epsilon \). Because of the logarithmic scale, all these standard deviations are dimensionless quantities and they may be regarded as approximate coefficients of variation. We emphasize that, unlike the other cited approaches (namely NLME and NLMIXED), our modeling here does not depend on any specific distributional assumption (i.e. normality) for the random effects, \(\varvec{\beta _i}\) nor for the error terms \(\varvec{\epsilon _i}\).

In fact, after standardization to a unit dose, (so that \(d_0=1\) in (15)) the data on each individual may be viewed as consisting of 10 observations. Using the Dirichlet(1) weights in \(B=1000\) iterations, we obtained the recycled two-stage estimates \(\hat{{\varvec{\theta }}}-{{RTS}}^*\) of \(\varvec{\theta }_0=E({\varvec{\theta }}_i)\) as well as the STS estimates, \(\hat{{\varvec{\theta }}}-{{STS}}\) for these data. The results are presented in Table 7, which also provides the estimates for the variance components in \(\mathbf{D}\) and in \(\mathbf{D}^*\), as well as the \(95\%\) confidence intervals for the fixed parameters \(K_{e0}, K_{a0}\) and \(Cl_0\) as where obtained directly from the corresponding recycled sampling distributions. For sake of comparison, we also provide in Table 8 the results of NLME estimation procedure as in Lindstrom and Bates (1990) (using nlme R package) and those obtained from the NLMIXED estimation procedure as as were reported by Kurada and Chen (2018). We point out again that, while the results in Tables 7 and 8 are largely similar, the estimation procedures utilized in Table 8 (NLME and NLMIXED) hinge on the normality assumptions for the random effect terms (both within and between). In contrast, the result presented in Table 7 using our recycled two-stage estimation procedure are entirely free of such specific distributional assumptions.

Table 7 The Recycled STS estimation from the PK-data on Theophylline
Table 8 Estimation Result of the PK-data on Theophylline using NLME and NLMIXED (published) procedures

6 Summary and discussion

We considered the general random weights approach as a viable re-sampling technique in the case of hierarchical nonlinear regression models involving fixed and random effects. We revisit the Standard Two-Stage (STS) estimation procedure for the population parameters, say \(\varvec{\theta }_0\), appropriate in this hierarchical nonlinear regression settings. While intuitively appealing, this STS approach was studied in the literature primarily via simulations and with an underlying normality assumption. Here, we establish at first the asymptotic consistency and the asymptotic normality of the STS estimator, \(\hat{{\varvec{\theta }}}-{{STS}}\) , in the more general context. Our rigorous results, as stated in Theorems 1 and 2, do not hinge on any specific distributional assumptions (e.g., normality) on the random component terms in the model (both errors-terms and random effects), but rather, they are obtained largely based on minimal moments assumptions. Next, we presented the recycled (or the re-sampled) version, \(\hat{{\varvec{\theta }}}-{{RTS}}^*\), of the STS estimates, \(\hat{{\varvec{\theta }}}-{{STS}}\), in this hierarchical nonlinear regression context and established its applicability under general random weighting scheme (Assumption W). In Theorems 3 and 4 we established, the consistency and asymptotic normality of the corresponding re-sampled estimator, \(\hat{{\varvec{\theta }}}-{{RTS}}^*\). These results enable us to use the recycled sampling distribution \(\hat{{\varvec{\theta }}}-{{RTS}}^*\), as is generated by the re-sampling procedure using the random weights technique to approximate the actual, though unknown, sampling distribution of the STS estimator, \(\hat{{\varvec{\theta }}}-{{STS}}\) (see Corollary 5). Thereby allowing us to validly assess the sampling properties of \(\hat{{\varvec{\theta }}}-{{STS}}\), such as precision and coverage probabilities based on the re-sampled (via the random weights) data. Toward that end, we augmented our rigorous theoretical results with a detailed simulation study (covering various sample sizes) illustrating the properties of the estimators, \(\hat{{\varvec{\theta }}}-{{STS}}\) and \(\hat{{\varvec{\theta }}}-{{RTS}}^*\) under various scenarios involving normal as well as non-normal error terms and utilizing different choices of random weights (Multinomial, Dirichlet and Exponential). Clearly, the effects of the choice of random weights on the numeral minimization (optimization) procedures used by various software, will depend also on the non-linear regression function, its curvatures and the number of data points used. However, this choice could be instructed by experimentation. Additionally, we provided a detailed application of our two-stage recycled estimation procedure to the data of the Theophylline study, and provided a comparison with the (normality-based) estimation procedures, NLME and NLMIXED. This real-data example, with \(N=12\) and \(n=10\), also illustrates the applicability of our approach even to data involving small sample sizes. In any case, we believe that the gamut of results presented here, both theoretical and numerical, are indicative of the potential and promise of the random weighting recycled (re-sampled) STS estimation procedure method to other more complex hierarchical non-linear regression models involving more structured mixed-effects parameters. For instance, extension to cases in which (2) is generalized to \({\varvec{\theta }}_i=\mathbf{A}_i\varvec{\theta }_0+\mathbf{B}_i\mathbf{b}_i\), where: \(\mathbf{A}_i, \ \mathbf{B}_i\) are some design matrices. However, for sake of scope and space, this and other related issues will have to be pursued elsewhere.