1 Estimating equations and empirical likelihood

Maximum-likelihood and least-squares estimation methods are two fundamental pillars of the modern statistical sciences. Suppose that \((y_1,\dots ,y_n)\) is an independent and identically distributed (iid) sample from a random variable Y with an assumed parametric distribution \(f(y;\theta )\). Under certain regularity conditions, the maximum-likelihood estimator \({\hat{\theta }}\) of \(\theta \), which maximizes the likelihood function \(L(\theta ) = \prod _{i=1}^nf(y_i;\theta )\), is the solution to the score equations:

$$\begin{aligned} \frac{\partial }{\partial \theta } \log L(\theta ) = \sum _{i=1}^n \frac{\partial }{\partial \theta }\log f(y_i;\theta ) = \mathbf{0 }. \end{aligned}$$
(1)

When the response variable \(y_i\) is related to a vector of covariates \(\mathbf{x }_i\) and the main objective is to explore relations between y and \(\mathbf{x }\), a semiparametric regression model can be specified through the first two conditional moments \(E_{\xi }(y_i \mid \mathbf{x }_i) = \mu (\mathbf{x }_i;\theta )\) and \(V_{\xi }(y_i \mid \mathbf{x }_i) = v_i\sigma ^2\), where \(\mu (\mathbf{x }_i;\theta )\) is the mean function, which can be linear or nonlinear in the vector of parameters \(\theta \), and \(v_i\) are known constants which might depend on the given \(\mathbf{x }_i\). The notations \(E_{\xi }(\cdot )\) and \(V_{\xi }(\cdot )\) refer to expectation and variance under the assumed semiparametric model, \(\xi \). The weighted least-squares estimator \({{\hat{\theta }}}\) of \(\theta \), which minimizes the weighted sum of squares of residuals \(Q(\theta ) = \sum _{i=1}^n\{y_i-\mu (\mathbf{x }_i;\theta )\}^2/v_i\), is the solution to the normal equations:

$$\begin{aligned} \frac{\partial }{\partial \theta } Q(\theta ) = -2 \sum _{i=1}^n \mathbf{D }(\mathbf{x }_i;\theta ) v_i^{-1}\{y_i-\mu (\mathbf{x }_i;\theta )\} = \mathbf{0 }, \end{aligned}$$
(2)

where \(\mathbf{D }(\mathbf{x }_i;\theta ) = \partial \mu (\mathbf{x }_i;\theta )/\partial \theta \). For linear regression models where \(\mu (\mathbf{x }_i;\theta ) = \mathbf{x }_i'\theta \), we have \(\mathbf{D }(\mathbf{x }_i;\theta ) = \mathbf{x }_i\). For generalized linear models with \(\mu _i = \mu (\mathbf{x }_i;\theta ) = \mu (\mathbf{x }_i'\theta ) \) and \(v_i = v(\mu _i)\), where \(\mu (\cdot )\) is a link function and \(v(\cdot )\) is a variance function, the solution to (2) is called the quasi-maximum-likelihood estimator of \(\theta \) (McCullagh and Nelder 1983).

The score Eq. (1) and the normal equations (2) can be unified through a common form:

$$\begin{aligned} \mathbf{G }_n(\theta ) = \frac{1}{n}\sum _{i=1}^n \mathbf{g }(y_i,\mathbf{x }_i; \theta ) = \mathbf{0 }, \end{aligned}$$
(3)

where the estimating functions \(\mathbf{g }(y,\mathbf{x }; \theta )\) are unbiased, i.e., \(E_{\xi }\{\mathbf{g }(y,\mathbf{x }; \theta _0)\}=\mathbf{0 }\) under the assumed model, \(\xi \), where \(\theta _0\) denotes the true value of \(\theta \). The factor 1/n in (3) is redundant, but is included, so that the asymptotic order of \(\mathbf{G }_n(\theta _0)\) will be \(O_p(n^{-1/2})\). Godambe (1960) was the first to study the optimality properties of the score functions given in (1). Some early results on theoretical and applied aspects of estimating functions were collected in the book edited by Godambe (1991).

The general theory of estimating equations has much broader scope than the maximum-likelihood and the least-squares methods. Let \(\mathbf{g }(y,\mathbf{x }; \theta )\) be a vector of \(r\times 1\) estimating functions; let \(\theta \) be a \(k\times 1\) vector of unknown parameters; let \(\mathbf{G }(\theta ) = E_{\xi }\{\mathbf{g }(y,\mathbf{x }; \theta )\}\) under the assumed model, \(\xi \). The true value \(\theta _0\) for the vector parameter satisfies \(\mathbf{G }(\theta _0) = \mathbf{0 }\). The over-identified scenarios with \(r>k\) are often of interest and will be discussed in detail in the paper. For just-identified cases where \(r=k\), the so-called m-estimator \({\hat{\theta }}\) of \(\theta _0\) based on the random sample \(\{(y_i,\mathbf{x }_i),i=1,\dots ,n\}\) is the solution to \(\mathbf{G }_n(\theta ) = \mathbf{0 }\) as specified by the estimating equations (3). Theoretical properties of the m-estimators with independent samples can be found in Newey and McFadden (1994), van der Vaart (2000), and Tsiatis (2006). Under-identified scenarios with \(r<k\) are not of interest for this paper.

Empirical likelihood is one of the major statistical advances of the past 30 years. It was first proposed by Owen (1988) for iid samples. While the development of empirical likelihood has been the collective effort of many contributors, as evidenced in the book by Owen (2001), there were two major milestones that established the approach as a general inference tool. The first major milestone was the result proved by Owen (1988), analogous to Wilks’ theorem for parametric models, showing that the nonparametric empirical likelihood ratio statistic has a \(\chi ^2\) limiting distribution. Let \(\mathbf{p } = (p_1,\dots ,p_n)\) be the discrete probability measure over the iid sample \((y_1,\dots ,y_n)\) from a random variable Y with mean \(\mu _0 = E_{\xi }(Y)\). The distribution of Y based on the sample data is represented by \(F_n(t) = \sum _{i=1}^n p_i I(y_i\le t)\), \(t\in (-\infty ,\infty )\), which is the empirical likelihood estimator of \(F(t) = P(Y\le t)\). The maximum value of the empirical likelihood function \(L(\mathbf{p }) = \prod _{i=1}^np_i\) under the normalization constraint:

$$\begin{aligned} \sum _{i=1}^np_i = 1 \;\;\;\; (p_i \ge 0) \end{aligned}$$
(4)

is achieved at \({\hat{p}}_i=n^{-1}\), \(i=1,\dots ,n\). The maximum empirical likelihood estimator of F(t) reduces to \(F_n(t) = n^{-1}\sum _{i=1}^n I(y_i\le t)\), the customary empirical distribution of Y. Let \({\hat{p}}(\mu )\) be the maximizer of \(L(\mathbf{p })\) under the normalization constraint (4) and the constraint induced by the parameter of interest:

$$\begin{aligned} \sum _{i=1}^np_i y_i= \mu \end{aligned}$$
(5)

for a given \(\mu \). Owen (1988) showed that, under mild moment conditions on Y, the empirical likelihood ratio statistic \(r(\mu ) = -2\{\log L(\hat{\mathbf{p }}(\mu )) - \log L(\hat{\mathbf{p }})\}\) converges in distribution to a \(\chi ^2\) random variable with one degree of freedom when \(\mu =\mu _0\).

The second major milestone is the paper by Qin and Lawless (1994) on combining empirical likelihood with general estimating equation theory for parameters defined through unbiased estimating functions. Suppose that the \(k\times 1\) vector parameters \(\theta \) satisfy \(E_{\xi }\{\mathbf{g }(y,\mathbf{x }; \theta )\} \) \(= \mathbf{0 }\) when \(\theta = \theta _0\). The empirical likelihood function for \(\theta \) is computed as \(L(\hat{\mathbf{p }}(\theta ))\), where \(\hat{\mathbf{p }}(\theta ) = ({\hat{p}}_1(\theta ),\dots ,{\hat{p}}_n(\theta ))\) maximizes \(L(\mathbf{p })\) subject to the normalization constraint (4) and the parameter constraint given by:

$$\begin{aligned} \sum _{i=1}^np_i \, \mathbf{g }(y_i,\mathbf{x }_i; \theta ) = \mathbf{0 } \end{aligned}$$
(6)

for the given \(\theta \). The maximum empirical likelihood estimator \({\hat{\theta }}\) of \(\theta \) is obtained as the maximum point of \(L(\hat{\mathbf{p }}(\theta ))\). There are several impactful consequences from combining estimating equations with empirical likelihood. First, it provides a general approach for dealing with different inferential problems through estimating functions. Second, the \(r\times 1\) estimating functions \(\mathbf{g }(y,\mathbf{x }; \theta )\) can be over-identified (i.e., \(r>k\)), which becomes convenient for incorporating auxiliary information and known moment conditions through additional estimating equations. Third, it allows inferences on key parameters of interest while treating others as nuisance parameters. And finally, it opens the door for exploring other advanced inferential procedures such as variable selection and Bayesian analysis through empirical likelihood.

Historically, the same concept of empirical likelihood was first discussed in survey sampling under the name “scale-load approach” by Hartley and Rao (1968, (1969). They focused on point estimation and showed that a constrained maximization problem with the known population mean of the auxiliary variable used in a calibration equation leads to the maximum scale-load estimator which is asymptotically equivalent to the regression estimator. This result was later “re-discovered” by Chen and Qin (1993) using the empirical likelihood formulation of Owen (1988).

2 Design-based inference with survey data

A survey population consists of a finite number N of units. Values of the variables of interest are attached to units and it is assumed that the values are fixed for each unit and can be measured without error. Let \(\mathbf{S }\) be the set of n units in the survey sample selected by a probability sampling method. Let \(\{(y_i,\mathbf{x }_i),i\in \mathbf{S }\}\) be the survey dataset. We assume that the first-order and the second-order inclusion probabilities \(\pi _i\) and \(\pi _{ij}\) are available, and the survey design leads to a fixed sample size n. Let \(d_i=1/\pi _i\) be the basic survey design weights, \(i\in \mathbf{S }\).

Traditional design-based estimation with survey data focuses on descriptive finite population parameters such as the population mean \(\mu _y = N^{-1}\sum _{i=1}^Ny_i\), the finite population distribution function \(F_{{\scriptscriptstyle N}}(t) = N^{-1}\sum _{i=1}^NI(y_i\le t)\) where \(I(\cdot )\) is the indicator function, and the \(100\alpha \)th finite population quantile \(t_{\alpha } = F_{{\scriptscriptstyle N}}^{-1}(\alpha ) = \inf \{t\mid F_{{\scriptscriptstyle N}}(t) \ge \alpha \}\) with \(\alpha \in (0,1)\). The finite population and the finite population parameters are viewed as fixed, and randomization is induced by the probability sampling design for selecting the survey sample. The survey weighted estimator of \(\mu _y\) is given by \({\hat{\mu }}_y = \sum _{i\in \mathbf{S }}d_iy_i / \sum _{i\in \mathbf{S }}d_i\), and the estimator of \(F_{{\scriptscriptstyle N}}(t)\) for a given t has the same form of \({\hat{\mu }}_y\) but with \(y_i\) replaced by \(I(y_i\le t)\).

Most descriptive finite population parameters can be defined through a (single) census estimating equation in the general form of:

$$\begin{aligned} \mathbf{G }_{{\scriptscriptstyle N}}(\theta ) = \frac{1}{N}\sum _{i=1}^N \mathbf{g }(y_i,\mathbf{x }_i; \theta ) = \mathbf{0 }\,. \end{aligned}$$
(7)

The factor \(N^{-1}\) used in (7) as well as (8) below is for the convenience of asymptotic development and is not required for computations. The finite population mean \(\theta _{{\scriptscriptstyle N}} = \mu _y\) corresponds to \(\mathbf{g }(y_i,\mathbf{x }_i; \theta ) = y_i - \theta \). The \(100\alpha \)th finite population quantile \(\theta _{{\scriptscriptstyle N}} = t_{\alpha }\) is defined through \(\mathbf{g }(y_i,\mathbf{x }_i; \theta ) = I(y_i \le \theta ) - \alpha \). For the population quantiles, the estimating function is not continuous in \(\theta \) and the equation \(\mathbf{G }_{{\scriptscriptstyle N}}(\theta ) = 0\) may not hold exactly for any \(\theta \). An alternative solution can be defined through \(\theta _{{\scriptscriptstyle N}} = \inf \{ \theta \mid \mathbf{G }_{{\scriptscriptstyle N}}(\theta ) \ge 0\}\), which satisfies \(\mathbf{G }_{{\scriptscriptstyle N}}(\theta _{{\scriptscriptstyle N}}) = O(N^{-1})\). This modification to (7) does not change the asymptotic results on \(\theta _{{\scriptscriptstyle N}}\). The design-based estimator \({\hat{\theta }}\) of \(\theta _{{\scriptscriptstyle N}}\) can be obtained as the solution to the survey weighted estimating equation:

$$\begin{aligned} \mathbf{G }_{n}(\theta ) = \frac{1}{N}\sum _{i\in \mathbf{S }} d_i \mathbf{g }(y_i,\mathbf{x }_i; \theta ) = \mathbf{0 }\,. \end{aligned}$$
(8)

Design-based estimation of finite population parameters can be carried out under the unified framework of estimating equations as specified by (7) and (8).

Large-scale complex survey data are often used for inferences on model parameters. This is the so-called analytic use of survey data. The survey variables \((y,\mathbf{x })\) are assumed to follow a model, called the superpopulation model, denoted as \(\xi \). The model parameters \(\theta \) may be defined through a set of unbiased estimating functions, i.e., \(\mathbf{G }(\theta ) = E_{\xi }\{\mathbf{g }(y,\mathbf{x }; \theta )\} = \mathbf{0 }\). When a conditional model of y given \(\mathbf{x }\) is used, the model parameters can be specified through \(E_{\xi }\{\mathbf{g }(y,\mathbf{x }; \theta ) \mid \mathbf{x }\} = \mathbf{0 }\). One of the statistical questions is how to make inferences on the model parameters \(\theta \) using a probability survey sample \(\mathbf{S }\) selected from a particular finite population. Godambe and Thompson (1986, 2009) proposed to focus on finite population parameters \(\theta _{{\scriptscriptstyle N}}\) defined through census estimating equations using design-based methods. If the superpopulation model holds for the survey population and the population size N is large, inferences on \(\theta _{{\scriptscriptstyle N}}\) are essentially the same as for the model parameters \(\theta \). If the finite population does not follow the model \(\xi \), the finite population parameters \(\theta _{{\scriptscriptstyle N}}\) are well defined and may still be of interest for the survey population. Design-based inferences remain valid for the latter cases. We consider two practically important scenarios of the analytic use of survey data.

Linear regression analysis. Suppose that the study variable y and a set of covariates \(\mathbf{x }\) are measured for all units in the survey sample, \(\mathbf{S }\). For notational simplicity without loss of generality, we assume that the vector \(\mathbf{x }\) contains 1 as its first component. The linear regression model is assumed to hold for the finite population, i.e., \(y_i = \mathbf{x }_i'\beta + \varepsilon _i\), \(i=1,\dots ,N\), where the \(\varepsilon _i\)’s are iid with \(E_{\xi }(\varepsilon _i)=0\) and \(V_{\xi }(\varepsilon _i) = \sigma ^2\). The \(\beta \) and \(\sigma ^2\) are the superpopulation parameters. The estimating functions for \(\beta \) under the least-squares estimation framework are given by \(\mathbf{g }(y,\mathbf{x }; \beta ) = \mathbf{x }(y - \mathbf{x }'\beta )\). The finite population regression coefficients \(\beta _{{\scriptscriptstyle N}}\) are the solution to \(\sum _{i=1}^N \mathbf{x }_i(y_i - \mathbf{x }_i'\beta ) = \mathbf{0 }\), which leads to the closed form expression \(\beta _{{\scriptscriptstyle N}} = \big (\sum _{i=1}^N\mathbf{x }_i\mathbf{x }_i'\big )^{-1}\sum _{i=1}^N\mathbf{x }_iy_i\). This is the least square estimator of the model parameters \(\beta \) if we treat the finite population as an iid sample of size N from the linear regression model. The survey weighted estimator \({\hat{\beta }}\) is the solution to \(\sum _{i\in \mathbf{S }} d_i \mathbf{x }_i(y_i - \mathbf{x }_i'\beta ) = \mathbf{0 }\), and is given by \({\hat{\beta }} = \big (\sum _{i\in \mathbf{S }} d_i \mathbf{x }_i\mathbf{x }_i'\big )^{-1}\sum _{i\in \mathbf{S }} d_i \mathbf{x }_iy_i\).

The linear regression model \(\xi \) may not hold for the finite population from which the survey sample is selected. This can happen, for instance, if crucial covariates are not measured by the survey sample or if the model contains certain high order or interaction terms. However, the finite population regression coefficients \(\beta _{{\scriptscriptstyle N}}\) are still meaningful parameters for the survey population and the design-based estimator \({\hat{\beta }}\) remains consistent for \(\beta _{{\scriptscriptstyle N}}\).

Logistic regression analysis. Suppose that the study variable y is binary and \(p_i = P(y_i =1 \mid \mathbf{x }_i)\) depends on \(\mathbf{x }_i\) through the logit link function, i.e., \(p_i = p(\mathbf{x }_i'\beta ) = 1 - \big \{1+\exp (\mathbf{x }_i'\beta )\big \}^{-1}\). The estimating functions for the model parameters \(\beta \) under the quasi-maximum likelihood framework of (2) with \(v_i = p_i(1-p_i)\) are given by \(\mathbf{g }(y,\mathbf{x }; \beta ) = \mathbf{x }\{y - p(\mathbf{x }'\beta )\}\). The finite population regression coefficients \(\beta _{{\scriptscriptstyle N}}\) under the assumed logistic regression model are the solution to \(\sum _{i=1}^N \mathbf{x }_i\{y_i - p(\mathbf{x }_i'\beta )\} = \mathbf{0 }\), which does not have a closed form expression. The design-based estimator \({\hat{\beta }}\) of \(\beta _{{\scriptscriptstyle N}}\) is the solution to \(\sum _{i\in \mathbf{S }} d_i \mathbf{x }_i\{y_i - p(\mathbf{x }_i'\beta )\} = \mathbf{0 }\). Finding the solution requires an iterative computational procedure.

When the estimating functions \(\mathbf{g }(y,\mathbf{x }; \theta )\) are differentiable in \(\theta \) and the estimating equation system is just-identified (i.e., \(r=k\)), the design-based estimator \({\hat{\theta }}\) obtained by solving (8) is design-consistent for \(\theta _{{\scriptscriptstyle N}}\) with the design-based variance–covariance matrix given by the sandwich form (Binder 1983):

$$\begin{aligned} V_p\big ({\hat{\theta }}\big ) = \{\mathbf{H }_{{\scriptscriptstyle N}}(\theta _{{\scriptscriptstyle N}})\}^{-1} V_p\big \{\mathbf{G }_n\big (\theta _{{\scriptscriptstyle N}}\big )\big \} \{\mathbf{H }_{{\scriptscriptstyle N}}'(\theta _{{\scriptscriptstyle N}})\}^{-1} \,, \end{aligned}$$
(9)

where

$$\begin{aligned} \mathbf{H }_{{\scriptscriptstyle N}}(\theta ) = \frac{\partial }{\partial \theta } \mathbf{G }_{{\scriptscriptstyle N}}(\theta ) = \frac{1}{N}\sum _{i=1}^N \frac{\partial }{\partial \theta } \mathbf{g }(y_i,\mathbf{x }_i; \theta )\,, \end{aligned}$$
(10)

and \(V_p(\cdot )\) denotes variance under the probability sampling design. The design-based variance estimator is computed as:

$$\begin{aligned} v_p\big ({\hat{\theta }}\big ) = \{\mathbf{H }_n({\hat{\theta }})\}^{-1} v_p\big \{\mathbf{G }_n\big ({\hat{\theta }}\big )\big \} \{\mathbf{H }_n'({\hat{\theta }})\}^{-1} \,, \end{aligned}$$

where

$$\begin{aligned} \mathbf{H }_n(\theta ) = \frac{\partial }{\partial \theta } \mathbf{G }_n(\theta ) = \frac{1}{N}\sum _{i\in \mathbf{S }} d_i \Big \{ \frac{\partial }{\partial \theta } \mathbf{g }(y_i,\mathbf{x }_i; \theta )\Big \} \,, \end{aligned}$$
(11)

and \(v_p\big \{\mathbf{G }_n\big ({\hat{\theta }}\big )\big \}\) is the design-based estimator of the variance-covariance matrix for the Horvitz–Thompson estimator \(\mathbf{G }_n\big (\theta \big )\) evaluated at \(\theta = {\hat{\theta }}\).

3 General inferential procedures

In this section, we discuss two empirical likelihood-based inferential problems for parameters \(\theta _{{\scriptscriptstyle N}}\) defined through the census estimating equations (7). We consider the general setting where \(r\ge k\) and the estimating functions \(\mathbf{g }(y,\mathbf{x }; \theta )\) can be smooth or non-differentiable. The asymptotic framework assumes that there is a sequence of finite populations and a sequence of probability survey samples, indexed by \(\nu \). Both the population size \(N_{\nu }\) and the sample size \(n_{\nu }\) go to infinity as \(\nu \rightarrow \infty \). All limiting processes are understood as \(\nu \rightarrow \infty \); see Fuller (2009) for further details. The index \(\nu \) will be dropped for notational simplicity and the limiting processes are denoted exchangeably as \(N\rightarrow \infty \) or \(n \rightarrow \infty \).

3.1 Empirical likelihood-based inferences with survey data

Standard empirical likelihood methods for independent sample data with parameters defined through estimating equations consist of three main components: (i) the empirical likelihood function \(L(\mathbf{p }) = \prod _{i=1}^np_i\); (ii) the normalization constraint (4); and (iii) the parameter constraints (6). When the methods are applied directly to survey data, the resulting estimator \({\hat{\theta }}\) is not design-consistent unless the sample is selected by simple random sampling. There are two possible modifications to make the methods applicable to survey data analysis. One is to modify the empirical likelihood function \(L(\mathbf{p })\) to take into account the survey design features, and the other is to use a survey weighted version for the parameter constraints.

Pseudo empirical likelihood methods. Chen and Sitter (1999) proposed to replace the empirical log-likelihood function \(\ell (\mathbf{p }) = \sum _{i=1}^n\log (p_i)\) by the pseudo empirical log-likelihood function \(\ell _{{\scriptscriptstyle PEL}}(\mathbf{p }) = \sum _{i\in \mathbf{S }}d_i \log (p_i)\) while keeping the normalization constraint (4) and the parameter constraints (6) unchanged. The method leads to design-consistent point estimators. Pseudo empirical likelihood ratio confidence intervals were discussed by Wu and Rao (2006) for a scalar parameter. Generalizations to vector parameters defined through estimating equations were given in Zhao and Wu (2019).

Sample empirical likelihood methods. The sample empirical likelihood was first briefly mentioned by Chen and Kim (2014) as an alternative approach to the population empirical likelihood methods discussed in their paper. The methods were formally studied by Zhao et al. (2019) and Zhao and Wu (2019). The sample empirical likelihood uses the same form \(L(\mathbf{p })\) as for iid data and the standard normalization constraint (4), but replaces the parameter constraints (6) by a survey weighted version. These methods also lead to design-consistent point estimators.

Our discussions for the rest of the paper are formulated under the sample empirical likelihood. The empirical likelihood methods discussed by Berger and De La Riva Torres (2016) and Oguz-Alper and Berger (2016) are also closely related to the sample empirical likelihood methods.

3.2 Point estimation

We first consider point estimation for finite population parameters \(\theta _{{\scriptscriptstyle N}}\) defined through the census estimating equations (7). The sample empirical log-likelihood function is given by \(\ell _{{\scriptscriptstyle SEL}}(\mathbf{p }) = \sum _{i\in \mathbf{S }}\log (p_i)\). The sample empirical likelihood function of \(\theta \) is defined as:

$$\begin{aligned} \ell _{{\scriptscriptstyle SEL}}(\theta ) = \ell _{{\scriptscriptstyle SEL}}\{\hat{\mathbf{p }}(\theta )\} = \sum _{i\in \mathbf{S }} \log \{{\hat{p}}_i(\theta )\}\,, \end{aligned}$$
(12)

where \(\hat{\mathbf{p }}(\theta ) = ({\hat{p}}_1(\theta ),\dots ,{\hat{p}}_n(\theta ))\) maximizes \(\ell _{{\scriptscriptstyle SEL}}(\mathbf{p }) = \sum _{i\in \mathbf{S }}\log (p_i)\) subject to the normalization constraint \(\sum _{i \in \mathbf{S }}p_{i} = 1\) and the survey weighted parameter constraints:

$$\begin{aligned} \sum _{i\in \mathbf{S }} p_i \{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )\} = \mathbf{0 } \end{aligned}$$
(13)

for the given \(\theta \). The maximum sample empirical likelihood estimator \({\hat{\theta }}\) of \(\theta _{{\scriptscriptstyle N}}\) is the maximum point of \(\ell _{{\scriptscriptstyle SEL}}(\theta )\), i.e., \({\hat{\theta }} = \arg \max _{\theta \in \Theta } \ell _{{\scriptscriptstyle SEL}}(\theta )\), where \(\Theta \) is the parameter space.

The design-based validity of the maximum sample empirical likelihood estimator \({\hat{\theta }}\) can be informally justified by two special cases. When the estimating equations system (7) is just-identified (i.e., \(r=k\)), the global maximum of \(\ell _{{\scriptscriptstyle SEL}}(\mathbf{p })\) is achieved at \({\hat{p}}_i=n^{-1}\) for all \(i\in \mathbf{S }\), and the maximum sample empirical likelihood estimator \({\hat{\theta }}\) is the solution to the survey weighted estimating equations (8), which is design-consistent under suitable regularity conditions. A practically important over-identified estimating equation system is the use of known auxiliary population information for survey data analysis. Let \(\mathbf{g }(y,\mathbf{x },\mathbf{z }; \theta ) = (\mathbf{g }_1'(y,\mathbf{x };\theta ), \mathbf{g }_2'(\mathbf{z }))'\), where \(\mathbf{g }_1(y,\mathbf{x };\theta )\) are the \(k\times 1\) estimating functions for defining the \(k\times 1\) parameters \(\theta _{{\scriptscriptstyle N}}\), and \(\mathbf{g }_2(\mathbf{z })\) are \((r-k)\times 1\) estimating functions which do not involve the parameters \(\theta \) and satisfy the moment condition \(N^{-1}\sum _{i=1}^N\mathbf{g }_2(\mathbf{z }_i) = \mathbf{0 }\). For instance, we may have \(\mathbf{g }_2(\mathbf{z }_i) = \mathbf{z }_i - \mu _\mathbf{z }\) where the finite population means \( \mu _\mathbf{z }\) for the \(\mathbf{z }\) variables are known and can be used in benchmark constraints. The parameter constraints under the current setting are given by:

$$\begin{aligned} \sum _{i\in \mathbf{S }}p_i\{d_i \mathbf{g }(y_i,\mathbf{x }_i,\mathbf{z }_i; \theta )\} = \mathbf{0 } \,, \end{aligned}$$

which is an over-identified system. It can be shown that the maximum sample empirical likelihood estimator \({\hat{\theta }}\) solves the first part of the just-identified equations system:

$$\begin{aligned} \sum _{i\in \mathbf{S }}{\hat{p}}_i\{d_i \mathbf{g }_1(y_i,\mathbf{x }_i;\theta )\} = \mathbf{0 } \,, \end{aligned}$$
(14)

where \(\hat{\mathbf{p }} = ({\hat{p}}_1,\dots ,{\hat{p}}_n)\) is the maximizer of \(\ell _{{\scriptscriptstyle SEL}}(\mathbf{p })\) under the normalization constraint (4) and the benchmark constraints (second part of the equations system):

$$\begin{aligned} \sum _{i\in \mathbf{S }} p_i \{d_i \mathbf{g }_2(\mathbf{z }_i)\} = \mathbf{0 } \,. \end{aligned}$$

The combined components \({\hat{p}}_id_i\) can be viewed as the calibration weights, and the solution \({\hat{\theta }}\) to the estimating equations (14) is design-consistent for \(\theta _{{\scriptscriptstyle N}}\) defined through \(N^{-1}\sum _{i=1}^N \mathbf{g }_1(y_i,\mathbf{x }_i;\theta ) = \mathbf{0 }\).

An over-identified estimating equation system does not always have a partition \((\mathbf{g }_1,\mathbf{g }_2)\) with the calibration equations described above. For instance, if the parameter \(\theta \) is the mean of a Poisson random variable y, then the single \(\theta \) satisfies two moment conditions: \(E_{\xi }(y-\theta )=0\) and \(E_{\xi }\{(y-\theta )^2-\theta \}=0\). In another example, if \(\theta \) is the mean of the variable y with a known variance \(\sigma ^2_0\), then the parameter \(\theta \) also satisfies two moment conditions: \(E_{\xi }(y-\theta ) = 0\) and \(E_{\xi }\{(y-\theta )^2-\sigma ^2_0\} = 0\). General estimation results which cover over-identified estimating equations system are both theoretically and practically important.

Under suitable regularity conditions on the estimating functions \(\mathbf{g }(y,\mathbf{x };\theta )\), the probability sampling design, and the finite population as described in Zhao et al. (2019), the maximum sample empirical likelihood estimator \({\hat{\theta }}\) is design-consistent with design-based variance–covariance matrix given by:

$$\begin{aligned} \mathbf{V }= \big (\mathbf{H }'\mathbf{W }^{-1}\mathbf{H }\big )^{-1}\mathbf{H }'\mathbf{W }^{-1}{\varvec{\Sigma }} \mathbf{W }^{-1}\mathbf{H } \big (\mathbf{H }'\mathbf{W }^{-1}\mathbf{H }\big )^{-1}\,, \end{aligned}$$
(15)

where \(\mathbf{H } = \mathbf{H }_{{\scriptscriptstyle N}}(\theta _{{\scriptscriptstyle N}})\) and \(\mathbf{H }_{{\scriptscriptstyle N}}(\theta )\) is defined in (10), \(\mathbf{W } = nN^{-2}\sum _{i=1}^N d_i\mathbf{g }_i \mathbf{g }_i'\) with \(\mathbf{g }_i = \mathbf{g }(y_i,\mathbf{x }_i;\theta _{{\scriptscriptstyle N}})\), and \({\varvec{\Sigma }} = V_p\big \{\mathbf{G }_n\big (\theta _{{\scriptscriptstyle N}}\big )\big \}\) as previously appeared in (9). It should be noted that \(\mathbf{H }\) is \(r\times k\), \(\mathbf{W }\) is \(r\times r\), and \({\varvec{\Sigma }}\) is \(r\times r\), resulting in a \(k\times k\) matrix for \(\mathbf{V }\).

If the estimating equation system is just-identified (i.e., \(r=k\)), the variance–covariance matrix given in (15) reduces to \(\mathbf{V } = \mathbf{H }^{-1} {\varvec{\Sigma }} (\mathbf{H }')^{-1}\), which is the same as \(V_p({\hat{\theta }})\) given in (9). In general, variance estimation requires plug-in estimators for the three components \(\mathbf{H }\), \(\mathbf{W }\) and \({\varvec{\Sigma }}\), which are, respectively, given by \(\hat{\mathbf{H }} = \mathbf{H }_n({\hat{\theta }})\) as defined in (11), \(\hat{\mathbf{W }} = nN^{-2}\sum _{i\in \mathbf{S }} d_i^2\hat{\mathbf{g }}_i \hat{\mathbf{g }}_i'\) with \(\hat{\mathbf{g }}_i = \mathbf{g }(y_i,\mathbf{x }_i;{\hat{\theta }})\), and \(\hat{\varvec{\Sigma }} = v_p\big \{\mathbf{G }_n\big ({\hat{\theta }}\big )\big \}\). The definitions of \(\mathbf{H }\) and \(\hat{\mathbf{H }}\) through (10) and (11) cannot be used when the estimating functions \(\mathbf{g }_i = \mathbf{g }(y_i,\mathbf{x }_i;\theta )\) are non-differentiable in \(\theta \). The asymptotic result under those cases involves \(\mathbf{H }(\theta ) = \partial \mathbf{G }(\theta )/\partial \theta \), where \(\mathbf{G }(\theta ) = \lim _{{\scriptscriptstyle N}\rightarrow \infty } \mathbf{G }_{{\scriptscriptstyle N}}(\theta )\). Zhao and Wu (2019) contains details on how to estimate \(\mathbf{H }\) for non-smooth estimating functions and additional discussions on estimating the design-based variance–covariance matrix \({\varvec{\Sigma }}\) under commonly used survey designs.

3.3 Hypothesis tests

Hypothesis tests are a common inferential problem for building statistical models or answering specific scientific questions. With complex survey data, the problems can be formulated for finite population parameters defined through census estimating equations under the design-based framework. When the assumed superpopulation model holds for the survey population, the inferential results can be extended to the superpopulation model parameters as discussed in Sect. 2.

The general results on sample empirical likelihood ratio tests and the required regularity conditions are discussed in Zhao et al. (2019) and Zhao and Wu (2019). The sample empirical likelihood ratio statistic for testing \(H_0: \theta _{{\scriptscriptstyle N}} = \theta _{{\scriptscriptstyle N0}}\) versus \(H_1: \theta _{{\scriptscriptstyle N}} \ne \theta _{{\scriptscriptstyle N0}}\) for a pre-specified \(\theta _{{\scriptscriptstyle N0}}\) is computed as:

$$\begin{aligned} r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}}) = -2 \Big \{ \ell _{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}}) - \ell _{{\scriptscriptstyle SEL}}({\hat{\theta }})\Big \} \,, \end{aligned}$$

where \(\ell _{{\scriptscriptstyle SEL}}(\theta )\) is defined in (12) and \({\hat{\theta }}\) is the maximum sample empirical likelihood estimator of \(\theta _{{\scriptscriptstyle N}}\). It can be shown that:

$$\begin{aligned} r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}}) = \mathbf{Q }' {\varvec{\Delta }} \mathbf{Q } + o_p(1)\,, \end{aligned}$$

where \(\mathbf{Q } \sim \mathbf{N }(\mathbf{0 }, \mathbf{I }_r)\), the standard multivariate normal distribution, and:

$$\begin{aligned} {\varvec{\Delta }} = n {\varvec{\Sigma }}^{1/2}\mathbf{W }^{-1}\mathbf{H } \big (\mathbf{H }'\mathbf{W }^{-1}\mathbf{H }\big )^{-1} \mathbf{H }'\mathbf{W }^{-1} {\varvec{\Sigma }}^{1/2} \,. \end{aligned}$$

The sampling distribution of \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}})\) is asymptotically equivalent to the distribution of a quadratic form, which can be re-expressed as:

$$\begin{aligned} \mathbf{Q }' {\varvec{\Delta }} \mathbf{Q } = \sum _{j=1}^k \delta _j\chi ^2_j\,, \end{aligned}$$

where \(\delta _j\), \(j=1,\dots ,k\) are the non-zero eigenvalues of \({\varvec{\Delta }}\), and \(\chi ^2_j\), \(j=1,\dots ,k\) are independent \(\chi ^2\) random variables with one degree of freedom. For just-identified cases with \(r=k\), the matrix \({\varvec{\Delta }}\) reduces to \({\varvec{\Delta }} = n{\varvec{\Sigma }}^{1/2}\mathbf{W }^{-1}{\varvec{\Sigma }}^{1/2}\). It can further be shown that, under single-stage PPS sampling with a negligible sampling fraction, we have \({\varvec{\Sigma }} = n^{-1}\mathbf{W } + o(n^{-1})\), and consequently, the sample empirical likelihood ratio statistic \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}})\) converges in distribution to a standard \(\chi ^2\) random variable with k degrees of freedom.

The sample empirical likelihood ratio statistic \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N}} \mid H_0)\) for testing a general hypothesis \(H_0: \mathbf{K }(\theta _{{\scriptscriptstyle N}}) = \mathbf{0 }\) versus \(H_1: \mathbf{K }(\theta _{{\scriptscriptstyle N}}) \ne \mathbf{0 }\), where \(\mathbf{K }(\theta _{{\scriptscriptstyle N}}) = \mathbf{0 }\) imposes \(k_1\) (\(\le k\)) linear or nonlinear constraints on the \(k\times 1\) parameters \(\theta _{{\scriptscriptstyle N}}\), is computed as follows. Let \({\hat{\theta }}\) be the (unrestricted) maximum sample empirical likelihood estimator of \(\theta _{{\scriptscriptstyle N}}\) over the parameter space \(\Theta \); let \({\hat{\theta }}^* =\arg \max _{\theta \in \Theta ^*}\ell _{SEL}(\theta )\) be the restricted maximum sample empirical likelihood estimator of \(\theta _{{\scriptscriptstyle N}}\) under the restricted parameter space \(\Theta ^* = \{\theta \mid \theta \in \Theta \;\; \mathrm{and} \;\; \mathbf{K }(\theta ) = \mathbf{0 }\}\). We have:

$$\begin{aligned} r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N}} \mid H_0) = -2 \Big \{ \ell _{{\scriptscriptstyle SEL}}({\hat{\theta }}^*) - \ell _{{\scriptscriptstyle SEL}}({\hat{\theta }})\Big \}. \end{aligned}$$

It can be shown that \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N}} \mid H_0) = \mathbf{Q }' {\varvec{\Delta }}^* \mathbf{Q } + o_p(1)\), where \(\mathbf{Q } \sim \mathbf{N }(\mathbf{0 },\mathbf{I }_r)\) and:

$$\begin{aligned} {\varvec{\Delta }}^* = n {\varvec{\Sigma }}^{1/2}\mathbf{W }^{-1}\mathbf{H } {\varvec{\Gamma }} {\varvec{\Phi }}' \big ({\varvec{\Phi }} {\varvec{\Gamma }} {\varvec{\Phi }}' \big )^{-1} {\varvec{\Phi }} {\varvec{\Gamma }} \mathbf{H }'\mathbf{W }^{-1}{\varvec{\Sigma }}^{1/2} \,, \end{aligned}$$

where \({\varvec{\Phi }} = \{\partial \mathbf{K }(\theta ) /\partial \theta \} |_{\theta = \theta _{{\scriptscriptstyle N}}}\) and \({\varvec{\Gamma }} = \mathbf{H }'\mathbf{W }^{-1}\mathbf{H }\), with \(\mathbf{H }\), \(\mathbf{W }\) and \({\varvec{\Sigma }}\) defined the same as before. If \(r=k\) and the survey design is single-stage PPS sampling with a small sampling fraction, the sample empirical likelihood ratio statistic \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N}} \mid H_0)\) follows asymptotically a standard \(\chi ^2\) distribution with \(k_1\) degrees of freedom.

Linear hypotheses are most commonly encountered in practice, where \(\mathbf{K }(\theta )\) \( = \mathbf{0 }\) has the form \(\mathbf{A }\theta = \mathbf{b }\), with \(\mathbf{A }\) being a \(k_1 \times k\) matrix and \(\mathbf{b }\) being a \(k_1\times 1\) vector, both pre-specified. In this case, we have \({\varvec{\Phi }} = \partial \mathbf{K }(\theta ) /\partial \theta = \mathbf{A }\). The hypothesis \(H_0: \theta _{{\scriptscriptstyle N}} = \theta _{{\scriptscriptstyle N0}}\) is equivalent to letting \(\mathbf{A } = \mathbf{I }_k\) and \(\mathbf{b } = \theta _{{\scriptscriptstyle N0}}\).

Implementations of the sample empirical likelihood ratio tests generally require the estimation of the matrix \({\varvec{\Delta }}\) or \({\varvec{\Delta }}^*\), which amounts to estimating \(\mathbf{H }\), \(\mathbf{W }\) and \({\varvec{\Sigma }}\). The sampling distribution of the test statistic \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N0}})\) or \(r_{{\scriptscriptstyle SEL}}(\theta _{{\scriptscriptstyle N}} \mid H_0)\) can be obtained through a simulation-based approach to the distribution of the quadratic form \(\mathbf{Q }' {\varvec{\Delta }} \mathbf{Q }\) or \(\mathbf{Q }' {\varvec{\Delta }}^* \mathbf{Q }\), or equivalently, the linear combination of independent \(\chi ^2\) random variables using the estimated eigenvalues of \({\varvec{\Delta }}\) or \({\varvec{\Delta }}^*\). Some analytic approximation methods for the distribution of a weighted sum of \(\chi ^2\) random variables such as those described in Rao and Scott (1981) and Rao and Scott (1984) and Bodenham and Adams (2016) may also be considered.

4 Design-based variable selection

Complex survey data often contain information on a large number of variables, especially for health and social science-related surveys where many factors are deemed potentially important for scientific investigations. For instance, surveys of the International Tobacco Control (ITC) Policy Evaluation Project (Thompson et al. 2006) collect data on many variables related to demographic, psychosocial, behavioral, and health aspects of the units as well as measures of knowledge and attitude towards smoking. Variable selection is an important problem at the initial stage of model building to identify relevant factors for a particular response variable such as addiction or quitting behaviors.

Design-based variable selection using survey data focuses on the finite population regression coefficients for linear regression models, logistic regression models, or other generalized linear models as discussed in Sect. 2. Under standard settings with independent sample data, the basic aim of variable selection is to identify covariates in a regression model for which the coefficients are zero. The finite population regression coefficients \(\theta _{{\scriptscriptstyle N}}\) defined as the solution to the census estimating equations, however, are usually not exactly equal to zero even if the corresponding superpopulation parameters are zero. The components are typically of the order \(O(N^{-1/2})\) if the model parameters are zero and the model holds for the finite population. For design-based variable selection, we need to treat population regression coefficients as practically zero if their theoretical values are of the order \(O(N^{-1/2})\).

The most widely known variable selection method that is a product of an estimation technique is the least absolute shrinkage and selection operator (LASSO) by Tibshirani (1996). Variable selection through penalized empirical likelihood with independent data has been studied by Tang and Leng (2010) and Leng and Tang (2012). The general procedures require that the un-penalized method provides consistent point estimators of the regression coefficients, and the penalized method forces estimators with small values to be zero. The sample empirical likelihood fits into this framework very naturally for design-based variable selection with the population regression coefficients defined through census estimating equations.

Let \(p_{\tau }(\cdot )\) be a pre-specified penalty function with regularization parameter \(\tau \). Let \(\mathbf{g }(y,\mathbf{x };\theta )\) be the estimating functions for defining \(\theta _{{\scriptscriptstyle N}}\). The penalized sample empirical likelihood function (omitting the constant term \( - n\log (n)\)) is defined as:

$$\begin{aligned} \ell _{{\scriptscriptstyle PSEL}}(\theta ) = - \sum _{i\in \mathbf{S }} \log \big [1+\lambda '\{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )\}\big ] - n\sum _{j=1}^k p_{\tau }(| \theta _j |), \end{aligned}$$

where \(\theta _j\) is the jth component of \(\theta \) and the Lagrange multiplier \(\lambda \) is the solution to (17) as described in Sect. 6. The smoothly clipped absolute deviation (SCAD) penalty function proposed by Fan and Li (2001) has been shown to achieve variable selection and unbiased parameter estimation simultaneously under standard settings. Zhao et al. (2019) showed that the SCAD penalty also works well for the penalized sample empirical likelihood method. The SCAD penalty function \(p_{\tau }(t)\) satisfies \(p_{\tau }(0) = 0\) and has its first-order derivative given by:

$$\begin{aligned} p'_{\tau }(t)=\tau \left\{ I(t \le \tau )+\frac{(a\tau -t)_+}{(a-1)\tau }I(t >\tau )\right\} , \end{aligned}$$

where \((b)_+ = b\) if \(b\ge 0\) and \((b)_+ = 0\) if \(b<0\). The penalty function contains two regularization parameters: a and \(\tau \). The choice \(a=3.7\) works well under the universal thresholding \(\tau = \{2\log (k)\}^{1/2}\) when \(k\le 100\). More refined data-driven choice of \((a,\tau )\) can be determined using criteria such as BIC or generalized cross-validation. See Fan and Li (2001) and Tang and Leng (2010) for further details.

The maximum penalized sample empirical likelihood estimator of \(\theta _{{\scriptscriptstyle N}}\) is given by \({\hat{\theta }}_{{\scriptscriptstyle PSEL}} = \arg \max _{\theta \in \Theta } \ell _{{\scriptscriptstyle PSEL}}(\theta )\). Zhao et al. (2019) showed that the procedure possesses oracle properties for variable selection under the design-based framework in the sense that zero components of \(\theta _{{\scriptscriptstyle N}}\) will be correctly identified with probability approaching 1 as n grows large. In addition, the penalized estimator for the non-zero components of \(\theta _{{\scriptscriptstyle N}}\) is design-consistent.

5 Bayesian inferences

Bayesian inferences require a likelihood function for the observed sample data. With a chosen prior distribution, inferences on the parameters based on the posterior distribution are conditional on the given sample data. Bayesian inferences for finite population parameters with desirable frequentist properties under the design-based framework, however, are very difficult to achieve, as shown by Godambe (1966, 1968) and Ericson (1969).

The sample empirical likelihood provides a convenient tool for defining a profile likelihood function for finite population parameters through survey weighted estimating equations. With a suitably chosen prior distribution, the likelihood leads to Bayesian posterior inferences which are valid under the design-based framework under certain survey designs. The approach is particularly appealing for parameters involving non-smooth estimating functions, since the computational procedures do not incur any additional difficulties. Upon omitting the constant term \( - n\log (n)\), the profile sample empirical log-likelihood function for \(\theta \) defined in (12) is given by:

$$\begin{aligned} \ell (\theta ) = - \sum _{i\in \mathbf{S }} \log \left[ 1+\lambda '\big \{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )\big \}\right] , \end{aligned}$$

where the Lagrange multiplier \(\lambda = \lambda (\theta )\) with the given \(\theta \) is the solution to (17). The maximum sample empirical likelihood estimator \({\hat{\theta }}\) is the maximum point of \(\ell (\theta )\).

5.1 Bayesian inference with a fixed prior

Let \(\mathbf{g }_i(\theta ) = \mathbf{g }(y_i,\mathbf{x }_i;\theta )\); let \(\pi (\theta )\) be a fixed prior distribution which is independent of the sample size n. The posterior distribution of \(\theta \) for the given sample \(\mathbf{S }\) has the form \(\pi (\theta \mid \mathbf{S }) \propto \pi (\theta ) \exp \{\ell (\theta )\}\) and is given by:

$$\begin{aligned} \pi (\theta \mid \mathbf{S }) = c(\mathbf{S }) \exp \left[ \log \big \{\pi (\theta )\big \} - \sum _{i\in \mathbf{S }}\log \big \{1+ \lambda 'd_i\mathbf{g }_i(\theta )\big \}\right] , \end{aligned}$$
(16)

where \(c(\mathbf{S })\) is a normalizing constant depending on \(\{(y_i,\mathbf{x }_i,d_i),i\in \mathbf{S }\}\), such that \(\int \pi (\theta \mid \mathbf{S })\mathrm{d}\theta = 1\).

It is shown by Zhao et al. (2020) that the posterior density function given in (16) with a fixed prior has the following asymptotic expansion:

$$\begin{aligned} \pi (\theta \mid \mathbf{S }) \propto \exp \left[ - \frac{1}{2} \big (\theta -{\hat{\theta }} \big )' \mathbf{J }_n \big ( \theta -{\hat{\theta }} \big ) + R_n \right] , \end{aligned}$$

where \(\mathbf{J }_n = n\mathbf{H }'\mathbf{W }^{-1}\mathbf{H }\) and \(R_n=o_p(1)\), with \(\mathbf{H }\) and \(\mathbf{W }\) defined the same as before. The posterior distribution of \(\theta \) is asymptotically equivalent to a multivariate normal distribution with mean \({\hat{\theta }}\) and variance–covariance matrix \(\mathbf{J }_n^{-1}\). The fixed prior distribution \(\pi (\theta )\) has no impact on the posterior distribution under large samples.

The asymptotic expansion of the posterior density function shows that the posterior variance of \(\theta \) matches the design-based variance of the posterior mean under single-stage PPS sampling without replacement with negligible sampling fractions. Consequently, Bayesian inference with any fixed prior has valid design-based frequentist properties under such survey designs.

5.2 Bayesian inference with an n-dependent prior

A fixed prior has impact on the analysis when the sample size is small or moderate, but the influence diminishes under large samples. A stronger version of prior distributions is the so-called n-dependent prior, denoted as \(\pi _n(\theta )\), for which the variance of the prior distribution shrinks as n gets large. There are practical scenarios where an n-dependent prior might arise naturally. For instance, a previous survey or a pilot survey might be available, which is taken from the same finite population with a common set of variables to those of the current survey. It is possible to obtain a point estimate with an estimated variance from the survey for the parameters of interest, and using the estimates to form a prior distribution. This was used by Rao and Ghangurde (1972) for Bayesian optimization in sampling finite populations.

The n-dependent prior \(\pi _n(\theta )\) is assumed to satisfy that (i) the function \(\log \{\pi _n(\theta )\}\) is twice continuously differentiable; (ii) the prior density has bounded mode \(\mathbf{m }_0 = \arg \max _{\theta }\pi _n(\theta )\); and (iii) the information matrix satisfies:

$$\begin{aligned} \mathbf{H }_0 = - \left[ \frac{\partial ^2}{\partial \theta \partial \theta '}\log \left\{ \pi _n(\theta )\right\} \right] \Big |_{\theta =\mathbf{m }_0} = O(n). \end{aligned}$$

It is shown by Zhao et al. (2020) that the posterior density \(\pi (\theta \mid \mathbf{S })\) given in (16) but with the n-dependent prior \(\pi _n(\theta )\) has the following asymptotic expansion:

$$\begin{aligned} \pi (\theta \mid \mathbf{S }) \propto \exp \left[ -\frac{1}{2}\big (\theta - \mathbf{m }_n \big )' \mathbf{K }_n \big ( \theta - \mathbf{m }_n \big ) + R_n \right] , \end{aligned}$$

where \(\mathbf{K }_n = \mathbf{H }_0 + \mathbf{J }_n^{-1}\), \(\mathbf{m }_n = \mathbf{K }_n^{-1}\big (\mathbf{H }_0\mathbf{m }_0+\mathbf{J }_n^{-1}{\hat{\theta }}\big )\), \(R_n=o_p(1)\), and \(\mathbf{J }_n\) is defined in Sect. 5.1. The posterior distribution of \(\theta \) is asymptotically equivalent to a multivariate normal distribution, of which the mean is a convex combination of the prior mode \(\mathbf{m }_0\) and the maximum sample empirical likelihood estimator \({\hat{\theta }}\), and the variance is inversely related to the sum of the information matrix of the prior and the posterior variance under the noninformative prior.

The asymptotic expansion of the posterior density with an n-dependent prior shows that the impact of the prior distribution is asymptotically negligible if the information matrix of the prior satisfies \(\mathbf{H }_0 = o(n)\). This leads to another crucial observation: the condition \(\mathbf{m }_0 = \theta _{{\scriptscriptstyle N}}+O_p(n^{-1/2})\) on the prior distribution is necessary for the validity of design-based frequentist interpretation for Bayesian inference if the variance of the prior distribution is chosen with the order \(O(n^{-1})\). If the variance of the prior distribution goes to 0 faster than \(n^{-1}\), the posterior mean will be dominated by the prior mean under large samples. For finite samples, the impact of the n-dependent prior \(\pi _n(\theta )\) depends largely on the mode \(\mathbf{m }_0\) of the distribution and, to a lesser extent, on the variance of the distribution or the information matrix \(\mathbf{H }_0\).

6 Computational notes

The first major computational task is to maximize \(\ell _{{\scriptscriptstyle SEL}}(\mathbf{p }) = \sum _{i\in \mathbf{S }}\log (p_i)\) under the constraints (4) and (13) with a given \(\theta \). It can be shown using the Lagrange multiplier method that the solution is given by:

$$\begin{aligned} {\hat{p}}_i(\theta ) = \frac{1}{n\big [1+\lambda '\big \{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )\big \}\big ]} \end{aligned}$$

for \( i\in \mathbf{S }\), where the Lagrange multiplier \(\lambda = \lambda (\theta )\), which depends on \(\theta \), is the solution to:

$$\begin{aligned} \mathbf{D }_1(\theta ,\lambda ) = \frac{1}{n} \sum _{i\in \mathbf{S }} \frac{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )}{1+\lambda '\big \{d_i \mathbf{g }(y_i,\mathbf{x }_i;\theta )\big \}} = \mathbf{0 }. \end{aligned}$$
(17)

The modified Newton–Raphson procedure proposed by Chen et al. (2002) is designed to solve (17) to obtain \(\lambda \) with a given \(\theta \).

The second major computational task is to find the maximum sample empirical likelihood estimator \({\hat{\theta }} = \arg \max _{\theta \in \Theta }\ell _{{\scriptscriptstyle SEL}}(\theta )\). It can be shown that setting \(\partial \ell _{{\scriptscriptstyle SEL}}(\theta ) / \partial \theta = \mathbf{0 }\) leads to:

$$\begin{aligned} \mathbf{D }_2(\theta ,\lambda ) = \left\{ \sum _{i\in \mathbf{S }} {\hat{p}}_i(\theta ) d_i \frac{\partial }{\partial \theta } \mathbf{g }(y_i,\mathbf{x }_i;\theta )\right\} '\lambda = \mathbf{0 }. \end{aligned}$$
(18)

Note that \(\mathbf{D }_1(\theta ,\lambda )\) and \(\lambda \) are both \(r\times 1\), and \(\mathbf{D }_2(\theta ,\lambda )\) and \(\theta \) are both \(k\times 1\). The estimator \({\hat{\theta }}\) can be obtained by treating \(\theta \) and \(\lambda \) as separate parameters and solving (17) and (18) simultaneously.

Variable selection using penalized sample empirical likelihood requires maximization of the penalized sample empirical likelihood \(\ell _{{\scriptscriptstyle PSEL}}(\theta )\) with respect to \(\theta \). The SCAD penalty function of Fan and Li (2001) allows for a quadratic approximation given by:

$$\begin{aligned} p_{\tau }(|\theta _j|) \doteq p_{\tau }(|\theta _{j0}|) + \frac{1}{2}\left\{ p'_{\tau }(|\theta _{j0}|)/|\theta _{j0}|\right\} (\theta _j^2 - \theta _{j0}^2), \end{aligned}$$

when \(\theta _j\) is close to \(\theta _{j0}\), which is an important feature for easy computation. We can replace (18) by \(\partial \ell _{{\scriptscriptstyle PSEL}}(\theta ) / \partial \theta = \mathbf{0 }\) using the quadratic approximation.

Bayesian inferences based on the posterior distribution \(\pi (\theta \mid \mathbf{S })\) given in (16) can be carried out through an MCMC procedure. The full posterior distribution of \(\theta _{{\scriptscriptstyle N}}\) can be simulated using an acceptance–rejection sampling method. Details can be found in Zhao et al. (2020).

7 Additional remarks

Empirical likelihood and estimating equations are a powerful statistical tool for data analysis. Their applications to survey data analysis require careful adaptations to take account of the survey design features under a suitable framework. Chapters 7 and 8 of Wu and Thompson (2020) contain additional materials on regression analysis, estimating equations, and empirical likelihood with complex survey data.

The pseudo empirical likelihood and the sample empirical likelihood approaches can be applied to survey data with a complex design involving stratification, clustering, and unequal probability selection as characterized by the first- and the second-order inclusion probabilities. The formulation of the sample empirical likelihood through survey weighted estimating equations only involves the first-order inclusion probabilities. This is sufficient for point estimation. Tests of statistical hypotheses require variance estimation, which further requires the second-order inclusion probabilities unless the survey design permits variance approximations without such information. A reviewer raised the interesting question of two-phase sampling designs, where a large first phase sample with information on auxiliary variables is available. Applications of the empirical likelihood methods to two-phase survey data require a careful formulation of constraints similar to those presented in Wu and Luan (2003). They also require detailed derivations of the variance components under two-phase sampling.

Large-scale survey data are often made available to public users who explore different aspects of the data. Public-use survey data files are created to include the basic design weights or the calibration weights as a separate column in addition to all other variables measured by the survey. Variance estimation is typically handled by using additional columns of replication weights supplied by the data file producers. If such weights are available, the inferential procedures described in this paper can readily be applied to public-use survey data files. Zhao et al. (2020) contains further details on empirical likelihood methods with public-use survey data.