1 Introduction

The use of an auxiliary variable, say X which is correlated with the study variable Y is well known to increase the efficiency of an estimator. When there is a linear relationship between them but the line does not pass through the origin, the linear regression estimator could be used under the assumption that the population mean \(\mu _x\) is known. However, \(\mu _x\) is usually unknown in practice so, double sampling is used where a preliminary sample is taken to estimate it.

There is also a situation where partial information about \(\mu _x\) might be available, then we can perform a preliminary test and construct an estimator based on the result of the test. Han (1973a) assumes \((x_i,y_i);i=1,2,...,n\) being a bivariate normal sample with means \(\mu _x,\mu _y\) (unknown), known variances \(\sigma _x,\sigma _y\) and correlation coefficient \(\rho \) and consider a preliminary test for

$$\begin{aligned} H_0:\mu _x=\mu _0 \ \text {against}\ H_1:\mu _x\ne \mu _0 \end{aligned}$$

where \(\mu _0\) is the value of \(\mu _x\) obtained from the partial information on it. Assuming \(\mu _0=0\) without loss of generality, the preliminary test estimator (PTE) is constructed as

$$\begin{aligned} \bar{y}*=\left\{ \begin{array}{rcl} \bar{y}-\rho \bar{x} \ \ \text {if}\ \ \vert \sqrt{n}\bar{x}\vert \le Z_{\alpha }\\ \bar{y}\ \ \text {if}\ \ \vert \sqrt{n}\bar{x}\vert > Z_{\alpha } \end{array}\right. \end{aligned}$$

where \(\bar{x}\) and \(\bar{y}\) are the sample means of X and Y respectively, \(\alpha \) being the preliminary test level and \(Z_{\alpha }\) is the \(100(1-\alpha /2)\%\) of N(0, 1). The estimator with \(\mu _0\) will be used when \(H_0\) is accepted. If \(H_0\) is rejected, \(\bar{y}\) is used as an estimator.

In another paper of Han (1973b), double sampling is used to estimate the unknown \(\mu _x\). When there is partial information on \(\mu _x\) also, he performs the same preliminary test and the estimator is defined as

$$\begin{aligned} \bar{y}**=\left\{ \begin{array}{rcl} \bar{y}-\rho \bar{x} \ \ \text {if}\ \ \vert \sqrt{n}'\bar{x}'\vert \le Z_{\alpha } \\ \bar{y}+\rho (\bar{x}'-\bar{x})\ \ \text {if}\ \ \vert \sqrt{n}'\bar{x}'\vert > Z_{\alpha } \end{array}\right. \end{aligned}$$

where \(\bar{x}'\) is the sample mean of \(n'\) observations from double sampling. If \(H_0\) is accepted, the prior value \(\mu _0\) is used; otherwise the sample mean \( \overline{x}^{\prime} \) based on the preliminary sample in double sampling is used.

For two auxiliary variables, Das and Bez (1995) have constructed a preliminary test regression estimator with double sampling which is found to be more efficient than the ordinary regression estimator. Khongji and Das (2012) have also worked on some preliminary test estimators in double sampling using stratification.

In this paper, we will construct a preliminary test regression estimator in two-stage cluster sampling using ranked set sampling (RSS) in the second stage using double sampling. RSS was introduced by McIntyre in 1952 for estimating yield where measurement of the study variable is difficult but judgment ranking is possible. Stokes (1977) proved that the variance of RSS sample mean, ranking based on concomitant variables is smaller than that of SRS.

Sud and Mishra (2006) focused on the estimation of finite population mean in two-stage RSS designs. Nematollahi et al. (2008) used RSS in the second stage with replacement of two-stage sampling and showed that the two-stage cluster sampling with RSS (TSCRSS) is more efficient than SRS in two-stage sampling. Ozturk (2019) also worked on RSS with and without replacement in two-stage cluster sampling. He used sample mean as an estimator and found that the efficiency depends on the intra-cluster correlation coefficient and the sampling designs also.

Since we are using regression estimator of \(\mu _y\), it is required to know the population mean of the auxiliary variable. However, the population mean of the auxiliary variable is usually unknown. An experimenter may have partial information on it from other sources and believes the population mean of the auxiliary variable is some value, which we call a prior value but not known for certain. So, hypothesis testing is needed to test the significance of the prior value. Under the assumption that we have partial information on auxiliary variables, we will study whether the proposed estimator can be better than the conventional regression estimator in the two-stage sampling by using RSS without replacement in the second stage, even though there is error in ranking. Throughout this paper we are going to take a situation where ranking is imperfect. When the ranking is no better than random, the sample obtained through RSS can be considered an almost as simple random sample.

2 Development of the Proposed Estimator

Suppose, X and Z are two auxiliary variables and we have partial information on both. According to Han (1973b), we can construct a preliminary test estimator using double sampling to utilize the partial information. Let (XYZ) follows a trivariate normal distribution with means \(\mu _x, \mu _y,\mu _z\), variances \(\sigma _x^2,\sigma _y^2,\sigma _z^2\) respectively and correlation coefficients \(\rho _{yx},\rho _{yz},\rho _{xz}\). Considering \(\mu _{0x}\) and \(\mu _{0z}\) being the prior values of \(\mu _x\) and \(\mu _z\) respectively, we can perform a preliminary test on the hypotheses

$$\begin{aligned} H_{01}:\mu _x=\mu _{0x}\ \ \text {and} \ \ H_{02}:\mu _z=\mu _{0z} \end{aligned}$$
(1)

For selecting samples we will implement RSS in the second stage of the two-stage sampling:

Assume that a population has N primary sampling units (PSUs) or first stage units, each having \(M_i;i=1,2,...,N\) secondary stage units (SSUs). In the first stage sampling, n units are selected from N PSUs with SRS without replacement (SRSWOR).

In the second stage sampling, we will select sample of SSUs from each selected PSU. However, we are assuming \(\mu _x\) and \(\mu _z\) to be unknown. Thus, double sampling is used to first estimate \(\mu _x\) and \(\mu _z\) by selecting \(m_i'\) SSUs from each selected PSU using SRSWOR. Let \(\bar{x}'\) and \(\bar{z}'\) be the sample means obtained from the preliminary samples with sample size \(m'=\sum _i^nm_i'\) on X and Z respectively through double sampling. Since we are using RSS, all preliminary samples within each selected PSUs are to be ranked according to the auxiliary variable. Here, we have two auxiliary variables, so, the variable which is more highly correlated with the study variable is used for ranking. Select the largest unit from the first PSU, the second largest from the second PSU and continue till the smallest of the \(i^{th}\) PSU \(;i=1,2,...,n\) is chosen. We will use different sample sizes for each cluster.

Suppose we select \(m_i(< m_i')\) units using RSS such that \(m_i=r_im_i''\) where \(m_i''\) is the number of samples selected in \(r_i\) cycles from the \(i^{th}\) selected PSU and \(m=\sum _{i=1}^nm_i\). If X is used for ranking, then

$$\begin{aligned}&X_{il(j)}=\text { value of}\ X\ \text {for the}\ j^{th} \text {rank in the} \ l^{th}\ \text {cycle of the}\ i^{th}\ \text {selected PSU;}\\&\qquad i=1,2,...,n , l=1,2,...,r_i , j=1,2,...,{m_i}''. \end{aligned}$$

Then, \(Y_{il[j]}\) and \(Z_{il(j)}\) are the value of Y and Z respectively corresponding to \(X_{il(j)}\). Let \(\bar{x}_{2srss},\bar{y}_{2srss}\) and \(\bar{z}_{2srss}\) denote the corresponding sample means of observations obtained on XY and Z such that

$$\begin{aligned}&\bar{y}_{2srss}=\frac{1}{n\bar{M}}\sum _{i=1}^n\sum _{l=1}^{r_i} \sum _{j=1}^{m_i''}\frac{M_i}{r_im_i''}Y_{il[j]},\ \bar{x}_{2srss} =\frac{1}{n\bar{M}}\sum _{i=1}^n\sum _{l=1}^{r_i}\sum _{j=1}^{m_i''} \frac{M_i}{r_im_i''}X_{il(j)}\\&\quad \text {and}\ \bar{z}_{2srss}=\frac{1}{n\bar{M}}\sum _{i=1}^n \sum _{l=1}^{r_i}\sum _{j=1}^{m_i''}\frac{M_i}{r_im_i''}Z_{il(j)} \ \ \text {where}\ \bar{M}=\frac{1}{N}\sum _{i=1}^N M_i \end{aligned}$$

Considering that the covariance matrix \(\Sigma \) is known and \(\sigma _x^2=\sigma _y^2=\sigma _z^2=1\) without loss of generality, the joint distribution of \((\bar{x}',\bar{x}_{2srss},\bar{y}_{2srss}, \bar{z}',\bar{z}_{2srss})\) follows a multivariate normal distribution with mean \((\mu _x,\mu _x, \mu _y,\mu _z,\mu _z)\) and covariance matrix given by

$$\begin{aligned} \Sigma =\begin{pmatrix} \frac{1}{m'} &{}\frac{1}{m'}&{} \frac{\rho _{yx}}{m'} &{}\frac{\rho _{xz}}{m'} &{} \frac{\rho _{xz}}{m'}\\ \frac{1}{m'} &{}\frac{1}{m} &{} \frac{\rho _{yx}}{m} &{}\frac{\rho _{xz}}{m'} &{}\frac{\rho _{xz}}{m}\\ \frac{\rho _{yx}}{m'}&{}\frac{\rho _{yx}}{m}&{}\frac{1}{m_i}&{}\frac{\rho _{yz}}{m'} &{}\frac{\rho _{yz}}{m}\\ \frac{\rho _{xz}}{m'}&{}\frac{\rho _{xz}}{m'}&{}\frac{\rho _{yz}}{m'}&{}\frac{1}{m'}&{}\frac{1}{m'}\\ \frac{\rho _{xz}}{m'}&{}\frac{\rho _{xz}}{m}&{}\frac{\rho _{yz}}{m}&{}\frac{1}{m'}&{}\frac{1}{m} \end{pmatrix} \end{aligned}$$

The test statistic to perform a preliminary test significance of the hypotheses (1) at \(\alpha \) level of significance is as follows. The null hypothesis \(H_{01}\) can be accepted when

$$\begin{aligned} \left| \frac{\bar{x}'-\mu _x}{SE(\bar{x}')}\right| \le Z_{\alpha } \end{aligned}$$

and \(H_{02}\) may be accepted when

$$\begin{aligned} \left| \frac{\bar{z}'-\mu _z}{SE(\bar{z}')}\right| \le Z_{\alpha } \end{aligned}$$

If the null hypotheses are accepted, \(\mu _{0x}\) and \(\mu _{oz}\) are used instead of \(\mu _x\) and \(\mu _z\) in the proposed estimator; otherwise, sample means based on the preliminary samples \(\bar{x}'\) and \(\bar{z}'\) are used. Thus, it follows that

$$\begin{aligned} P_{H_{01}}\left[ \bar{x}'-\frac{Z_{\alpha }}{\sqrt{m'}} \le \mu _x\le \bar{x}'+\frac{Z_{\alpha }}{\sqrt{m'}}\right] =1-\alpha \end{aligned}$$
(2)

and

$$\begin{aligned} P_{H_{02}}\left[ \bar{z}'-\frac{Z_{\alpha }}{\sqrt{m'}} \le \mu _z\le \bar{z}'+\frac{Z_{\alpha }}{\sqrt{m'}}\right] =1-\alpha \end{aligned}$$
(3)

Since we have partial information on \(\mu _x\) and \(\mu _z\), we can also assume \(\mu _{0x}=\mu _{0z}=0\) without loss of generality such that

$$\begin{aligned} H_{01}:\mu _x=0\ \ \text {and}\ \ H_{02}:\mu _z=0 \end{aligned}$$

Equations (2) and (3) imply that

$$\begin{aligned} P_{H_{01}}\left[ \bar{x}'\le \frac{Z_{\alpha }}{\sqrt{m'}},\bar{x}' \ge \frac{Z_{\alpha }}{\sqrt{m'}}\right] =1-\alpha \end{aligned}$$

and

$$\begin{aligned} P_{H_{02}}\left[ \bar{z}'\le \frac{Z_{\alpha }}{\sqrt{m'}},\bar{z}' \ge \frac{Z_{\alpha }}{\sqrt{m'}}\right] =1-\alpha \end{aligned}$$

Hence, it follows that the hypothesis \(H_{01}\) can be accepted if \(\vert \bar{x}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}\) and \(H_{02}\) will be accepted if \(\vert \bar{z}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}\).

Therefore, we propose the following estimator under the above assumptions as

$$\begin{aligned} T=\left\{ \begin{aligned} \bar{y}_{2srss}-B_{yx}\bar{x}_{2srss}-B_{yz}\bar{z}_{2srss} \ \ \text {if} \,\vert \bar{x}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}, \,\vert \bar{z}'\vert&\le \frac{Z_{\alpha }}{\sqrt{m'}}\\ \bar{y}_{2srss}+B_{yx}(\bar{x}'-\bar{x}_{2srss})-B_{yz}\bar{z}_{2srss} \ \ \text {if}\,\vert \bar{x}'\vert>\frac{Z_{\alpha }}{\sqrt{m'}}, \, \vert \bar{z}'\vert&\le \frac{Z_{\alpha }}{\sqrt{m'}}\\ \bar{y}_{2srss}-B_{yx}\bar{x}_{2srss}+B_{yz}(\bar{z}'-\bar{z}_{2srss}) \ \ \text {if} \,\vert \bar{x}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}, \,\vert \bar{z}'\vert&> \frac{Z_{\alpha }}{\sqrt{m'}}\\ \bar{y}_{2srss}+B_{yx}(\bar{x}'-\bar{x}_{2srss})+B_{yz}(\bar{z}' -\bar{z}_{2srss})\ \ \text {if}\,\vert \bar{x}'\vert> \frac{Z_{\alpha }}{\sqrt{m'}},\, \vert \bar{z}'\vert&>\frac{Z_{\alpha }}{\sqrt{m'}} \end{aligned}\right. \end{aligned}$$

Here, \(B_{yx}=\frac{\rho _{yx}-\rho _{yz}\rho _{xz}}{1-{\rho _{xz}}^2}\ \ \text {and}\ \ B_{yz}=\frac{\rho _{yz}-\rho _{yx} \rho _{xz}}{1-{\rho _{xz}}^2}\) are known population regression coefficients of Y on X and Z respectively.

2.1 Bias and Mean Square Error of T

We know that

$$\begin{aligned} E(T)&=E\left( \bar{y}_{2srss}-B_{yx}\bar{x}_{2srss} -B_{yz}\bar{z}_{2srss}\right) +B_{yx}E\left( \bar{x}'/ \vert \bar{x}'\vert>\frac{Z_{\alpha }}{\sqrt{m'}}\right) \\&\quad P\left( \vert \bar{x}'\vert>\frac{Z_{\alpha }}{\sqrt{m'}}\right) +B_{yz}E\left( \bar{z}'/\vert \bar{z}'\vert>\frac{Z_{\alpha }}{\sqrt{m'}}\right) . P\left( \vert \bar{z}'\vert >\frac{Z_{\alpha }}{\sqrt{m'}}\right) \\&=\mu _y-Bias(T)\\ \text {where},\, Bias(T)&=B_{yx}\mu _x\{\phi (a)-\phi (b)\}-B_{yx}. \frac{\varphi (a)-\varphi (b)}{\sqrt{m'}}+B_{yz}\{\phi (A)-\phi (B)\}\\&\quad -B_{yz}.\frac{\varphi (A)-\varphi (B)}{\sqrt{m'}} \end{aligned}$$

where \(\varphi (.)\) denotes the density function and \(\phi (.)\) is the cumulative distribution function of N(0, 1)

$$\begin{aligned} \text {and}\ \ a=Z_\alpha -\mu _x\sqrt{m'}; b=-Z_\alpha -\mu _x\sqrt{m'}; A=Z_\alpha -\mu _z\sqrt{m'}; B=-Z_\alpha -\mu _z\sqrt{m'} \end{aligned}$$

Then, the mean square error can be obtained as

$$\begin{aligned} MSE(T)&=Var(T)+\{Bias(T)\}^2\\&=E(T^2)-\{E(T)\}^2+2\mu _y.Bias(T) \end{aligned}$$

Using multivariate normal distribution and differentiation under integral sign,

$$\begin{aligned} MSE(T)= \left( \frac{1-B}{m}+\frac{B}{m'}\right) +H \end{aligned}$$

where \(B=B_{yx}^2+B_{yz}^2-2B_{yx}\rho _{yx}-2B_{yz}\rho _{yz} +2B_{yx}B_{yz}\rho _{zx};\)

$$\begin{aligned} H&=\{\phi (a)-\phi (b)\}\left[ B_{yx}^2\left( \mu _x^2+\frac{1}{m_i}\right) -2B_{yx}\rho _{yx}+B_{yx}B_{yz}\left( \mu _x\mu _z+\frac{\rho _{xz}}{m'}\right) \right] \\&\quad + \frac{\varphi (a)-\varphi (b)}{\sqrt{m'}}\left[ -2B_{yx}^2\mu _x +2B_{yx}\rho _{yx}\mu _x-B_{yx}B_{yz}(\mu _z+\rho _{xz}\mu _x)\right] +\frac{a\varphi (a)-b\varphi (b)}{m'}\\&\quad (-B_{yz}^2+2B_{yx}\rho _{yx}-B_{yx}B_{yz}\rho _{xz}) +\{\phi (A)-\phi (B)\}\left[ B_{yz}^2\left( \mu _z^2+\frac{1}{m'}\right) -2B_{yz}\rho _{yz}\right. \\&\quad \left. +B_{yx}B_{yz}\left( \mu _x\mu _z+\frac{\rho _{xz}}{m'}\right) \right] +\frac{\varphi (A)-\varphi (B)}{\sqrt{m'}}\left[ -2B_{yz}^2\mu _z +2B_{yz}\rho _{yz}\mu _z-B_{yx}B_{yz}\right. \\&\quad \left. (\mu _x +\rho _{xz}\mu _z)\right] +\frac{A\varphi (A)-B\varphi (B)}{m'} (-B_{yz}^2+2B_{yz}\rho _{yz}-B_{yx}B_{yz}\rho _{xz}) \end{aligned}$$

3 Comparison

The first quantity of the MSE(T) is the variance of the regression estimator in two-stage with RSS using double sampling for two auxiliary variables i.e..,

$$\begin{aligned} T_1=\bar{y}_{2srss}+B_{yx}(\bar{x}'-\bar{x}_{2srss})+B_{yz}(\bar{z}'-\bar{z}_{2srss}) \end{aligned}$$

Thus, the proposed estimator is compared with the regression estimator \(T_1\). We define the relative efficiency of T with respect to \(T_1\) as

$$\begin{aligned} e=\frac{\frac{1-B}{m}+\frac{B}{m'}}{\left( \frac{1-B}{m} +\frac{B}{m'}\right) +H}=\frac{G}{G+H}\ \ \text {(say)} \end{aligned}$$

As we can see from the expression of MSE, the values of e depend on \(m,m',B_{yx},B_{yz},\alpha ,\mu _x,\mu _z\).

The relative efficiency e are computed for different values of \(m',m=rm'',\mu _x,\mu _z,\alpha \). Table 1 presents the values of e when \(\alpha =0.01\) and for different values of \(\rho _{yx}=0.7({\textbf {0.3)}},\rho _{xz}=0.6({\textbf {0.5}})\) and \(\rho _{yz}=0.8({\textbf {0.4}})\).

Table 1 Relative efficiency of T to \(T_1\)

As shown in table 1, the value of e is very close to 1 at \(\mu _x=\mu _z=0\) but decreases as the value of \(\mu _x\) and \(\mu _z\) increases which is typical behavior of a preliminary test estimator. The relative efficiency attains the maximum at \(\mu _x=\mu _z=1\), but the efficiency is higher when the correlation between XY and Z is high. Also when the sample size m increases, e increases for each values of \(\mu _x\) and \(\mu _z\).

4 Conclusion

The proposed regression estimator is more efficient than the ordinary preliminary test regression estimator under two-stage sampling along with RSS under certain conditions. However, in general, the information on the population mean is not always available and the experimenter may have some prior information on it. In such a situation, we can utilize partial information, and therefore, the efficiency of an estimator based on the preliminary test is quite high. And it is also more convenient to employ the preliminary test. Moreover, the estimator on the two-stage sampling with RSS in the second stage is also more efficient than SRS.

5 Empirical Study

Consider the total number of agricultural labourers as the study variable Y from 2011 census of Imphal West District, Manipur, India. Let \(X=\) total number of cultivators and \(Z=\) total number of households be the two auxiliary variables. The whole district is divided into 13 subdivisions/ towns (PSUs), taking villages as SSUs. Those PSUs with very small SSUs are combined together.

It is assumed that population means of X and Y are partially known. When \(\mu _x\) and \(\mu _z\) are unknown, we can estimate by using double sampling. In this data, we first select 8 out of 13 PSUs in the first stage sampling. In second stage, we will apply the double sampling procedure, where we select \(m_i'\) SSUs from each selected PSU. The estimates of \(\mu _x\) and \(\mu _z\) are given by \(\bar{x}'=161.8\) and \(\bar{z}'=303.1\) respectively.

Then, all \(m_i'\) SSUs are ranked based on the values of X. Samples are selected using RSS without replacement. We get

$$\begin{aligned} \bar{y}_{2srss}=21.1,\ \bar{x}_{2srss}=157.2,\ \bar{z}_{2srss}=347.5. \end{aligned}$$

Suppose we have partial information on auxiliary variables from the previous census data (2001 Census, Imphal West District, Manipur) where they are computed and given by \(\mu _{ox}=101.7\) and \(\mu _{oz}=329.5\). We need to utilize the preliminary samples to test the hypotheses

$$\begin{aligned}&H_{01}:\mu _x=\mu _{0x}\ \text {against}\ H_{11}:\mu _x\ne \mu _{0x}\\&\quad \text {and}\ H_{02}:\mu _z=\mu _{0z}\ \text {against}\ H_{21}:\mu _z\ne \mu _{0z} \end{aligned}$$

At \(\alpha =0.01\), the test statistic, say, \(\vert Z\vert >Z_\alpha \) under \(H_{01}\) . Therefore, \(H_{01}\) is rejected. Since reliable information on \(\mu _x\) is not available, \(\bar{x}'\) is used instead of \(\mu _{0x}\) in the estimation of \(\mu _x\).

Under \(H_{02}\), \(\vert Z\vert <Z_\alpha \). When a reliable partial information of \(\mu _x\) is available, \(H_{02}\) can be accepted and \(\mu _{0z}\) is used in the estimator T. Therefore, \(99\%\) confidence interval is (243.3, 362.8).

Thus, the preliminary test regression estimator T which is the estimate of the population mean of Y is given by 14.8 and the usual regression estimator \(T_1=4.3\). It is further observed that the \(MSE(T)=0.3, MSE(T_1)=1.5\), hence, our proposed estimator T is more efficient than \(T_1\).