Abstract
In this paper a preliminary test regression estimator for estimating the population mean of study variable Y is suggested for the two-stage sampling using ranked set sampling in the second stage when there is partial information on both of the two auxiliary variables X and Z. The variables (X, Y, Z) are considered to follow a trivariate normal distribution. Bias and mean square error of the proposed estimator are computed and comparison is made with the usual regression estimator under the same sampling scheme with two auxiliary variables analytically and numerically.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The use of an auxiliary variable, say X which is correlated with the study variable Y is well known to increase the efficiency of an estimator. When there is a linear relationship between them but the line does not pass through the origin, the linear regression estimator could be used under the assumption that the population mean \(\mu _x\) is known. However, \(\mu _x\) is usually unknown in practice so, double sampling is used where a preliminary sample is taken to estimate it.
There is also a situation where partial information about \(\mu _x\) might be available, then we can perform a preliminary test and construct an estimator based on the result of the test. Han (1973a) assumes \((x_i,y_i);i=1,2,...,n\) being a bivariate normal sample with means \(\mu _x,\mu _y\) (unknown), known variances \(\sigma _x,\sigma _y\) and correlation coefficient \(\rho \) and consider a preliminary test for
where \(\mu _0\) is the value of \(\mu _x\) obtained from the partial information on it. Assuming \(\mu _0=0\) without loss of generality, the preliminary test estimator (PTE) is constructed as
where \(\bar{x}\) and \(\bar{y}\) are the sample means of X and Y respectively, \(\alpha \) being the preliminary test level and \(Z_{\alpha }\) is the \(100(1-\alpha /2)\%\) of N(0, 1). The estimator with \(\mu _0\) will be used when \(H_0\) is accepted. If \(H_0\) is rejected, \(\bar{y}\) is used as an estimator.
In another paper of Han (1973b), double sampling is used to estimate the unknown \(\mu _x\). When there is partial information on \(\mu _x\) also, he performs the same preliminary test and the estimator is defined as
where \(\bar{x}'\) is the sample mean of \(n'\) observations from double sampling. If \(H_0\) is accepted, the prior value \(\mu _0\) is used; otherwise the sample mean \( \overline{x}^{\prime} \) based on the preliminary sample in double sampling is used.
For two auxiliary variables, Das and Bez (1995) have constructed a preliminary test regression estimator with double sampling which is found to be more efficient than the ordinary regression estimator. Khongji and Das (2012) have also worked on some preliminary test estimators in double sampling using stratification.
In this paper, we will construct a preliminary test regression estimator in two-stage cluster sampling using ranked set sampling (RSS) in the second stage using double sampling. RSS was introduced by McIntyre in 1952 for estimating yield where measurement of the study variable is difficult but judgment ranking is possible. Stokes (1977) proved that the variance of RSS sample mean, ranking based on concomitant variables is smaller than that of SRS.
Sud and Mishra (2006) focused on the estimation of finite population mean in two-stage RSS designs. Nematollahi et al. (2008) used RSS in the second stage with replacement of two-stage sampling and showed that the two-stage cluster sampling with RSS (TSCRSS) is more efficient than SRS in two-stage sampling. Ozturk (2019) also worked on RSS with and without replacement in two-stage cluster sampling. He used sample mean as an estimator and found that the efficiency depends on the intra-cluster correlation coefficient and the sampling designs also.
Since we are using regression estimator of \(\mu _y\), it is required to know the population mean of the auxiliary variable. However, the population mean of the auxiliary variable is usually unknown. An experimenter may have partial information on it from other sources and believes the population mean of the auxiliary variable is some value, which we call a prior value but not known for certain. So, hypothesis testing is needed to test the significance of the prior value. Under the assumption that we have partial information on auxiliary variables, we will study whether the proposed estimator can be better than the conventional regression estimator in the two-stage sampling by using RSS without replacement in the second stage, even though there is error in ranking. Throughout this paper we are going to take a situation where ranking is imperfect. When the ranking is no better than random, the sample obtained through RSS can be considered an almost as simple random sample.
2 Development of the Proposed Estimator
Suppose, X and Z are two auxiliary variables and we have partial information on both. According to Han (1973b), we can construct a preliminary test estimator using double sampling to utilize the partial information. Let (X, Y, Z) follows a trivariate normal distribution with means \(\mu _x, \mu _y,\mu _z\), variances \(\sigma _x^2,\sigma _y^2,\sigma _z^2\) respectively and correlation coefficients \(\rho _{yx},\rho _{yz},\rho _{xz}\). Considering \(\mu _{0x}\) and \(\mu _{0z}\) being the prior values of \(\mu _x\) and \(\mu _z\) respectively, we can perform a preliminary test on the hypotheses
For selecting samples we will implement RSS in the second stage of the two-stage sampling:
Assume that a population has N primary sampling units (PSUs) or first stage units, each having \(M_i;i=1,2,...,N\) secondary stage units (SSUs). In the first stage sampling, n units are selected from N PSUs with SRS without replacement (SRSWOR).
In the second stage sampling, we will select sample of SSUs from each selected PSU. However, we are assuming \(\mu _x\) and \(\mu _z\) to be unknown. Thus, double sampling is used to first estimate \(\mu _x\) and \(\mu _z\) by selecting \(m_i'\) SSUs from each selected PSU using SRSWOR. Let \(\bar{x}'\) and \(\bar{z}'\) be the sample means obtained from the preliminary samples with sample size \(m'=\sum _i^nm_i'\) on X and Z respectively through double sampling. Since we are using RSS, all preliminary samples within each selected PSUs are to be ranked according to the auxiliary variable. Here, we have two auxiliary variables, so, the variable which is more highly correlated with the study variable is used for ranking. Select the largest unit from the first PSU, the second largest from the second PSU and continue till the smallest of the \(i^{th}\) PSU \(;i=1,2,...,n\) is chosen. We will use different sample sizes for each cluster.
Suppose we select \(m_i(< m_i')\) units using RSS such that \(m_i=r_im_i''\) where \(m_i''\) is the number of samples selected in \(r_i\) cycles from the \(i^{th}\) selected PSU and \(m=\sum _{i=1}^nm_i\). If X is used for ranking, then
Then, \(Y_{il[j]}\) and \(Z_{il(j)}\) are the value of Y and Z respectively corresponding to \(X_{il(j)}\). Let \(\bar{x}_{2srss},\bar{y}_{2srss}\) and \(\bar{z}_{2srss}\) denote the corresponding sample means of observations obtained on X, Y and Z such that
Considering that the covariance matrix \(\Sigma \) is known and \(\sigma _x^2=\sigma _y^2=\sigma _z^2=1\) without loss of generality, the joint distribution of \((\bar{x}',\bar{x}_{2srss},\bar{y}_{2srss}, \bar{z}',\bar{z}_{2srss})\) follows a multivariate normal distribution with mean \((\mu _x,\mu _x, \mu _y,\mu _z,\mu _z)\) and covariance matrix given by
The test statistic to perform a preliminary test significance of the hypotheses (1) at \(\alpha \) level of significance is as follows. The null hypothesis \(H_{01}\) can be accepted when
and \(H_{02}\) may be accepted when
If the null hypotheses are accepted, \(\mu _{0x}\) and \(\mu _{oz}\) are used instead of \(\mu _x\) and \(\mu _z\) in the proposed estimator; otherwise, sample means based on the preliminary samples \(\bar{x}'\) and \(\bar{z}'\) are used. Thus, it follows that
and
Since we have partial information on \(\mu _x\) and \(\mu _z\), we can also assume \(\mu _{0x}=\mu _{0z}=0\) without loss of generality such that
Equations (2) and (3) imply that
and
Hence, it follows that the hypothesis \(H_{01}\) can be accepted if \(\vert \bar{x}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}\) and \(H_{02}\) will be accepted if \(\vert \bar{z}'\vert \le \frac{Z_{\alpha }}{\sqrt{m'}}\).
Therefore, we propose the following estimator under the above assumptions as
Here, \(B_{yx}=\frac{\rho _{yx}-\rho _{yz}\rho _{xz}}{1-{\rho _{xz}}^2}\ \ \text {and}\ \ B_{yz}=\frac{\rho _{yz}-\rho _{yx} \rho _{xz}}{1-{\rho _{xz}}^2}\) are known population regression coefficients of Y on X and Z respectively.
2.1 Bias and Mean Square Error of T
We know that
where \(\varphi (.)\) denotes the density function and \(\phi (.)\) is the cumulative distribution function of N(0, 1)
Then, the mean square error can be obtained as
Using multivariate normal distribution and differentiation under integral sign,
where \(B=B_{yx}^2+B_{yz}^2-2B_{yx}\rho _{yx}-2B_{yz}\rho _{yz} +2B_{yx}B_{yz}\rho _{zx};\)
3 Comparison
The first quantity of the MSE(T) is the variance of the regression estimator in two-stage with RSS using double sampling for two auxiliary variables i.e..,
Thus, the proposed estimator is compared with the regression estimator \(T_1\). We define the relative efficiency of T with respect to \(T_1\) as
As we can see from the expression of MSE, the values of e depend on \(m,m',B_{yx},B_{yz},\alpha ,\mu _x,\mu _z\).
The relative efficiency e are computed for different values of \(m',m=rm'',\mu _x,\mu _z,\alpha \). Table 1 presents the values of e when \(\alpha =0.01\) and for different values of \(\rho _{yx}=0.7({\textbf {0.3)}},\rho _{xz}=0.6({\textbf {0.5}})\) and \(\rho _{yz}=0.8({\textbf {0.4}})\).
As shown in table 1, the value of e is very close to 1 at \(\mu _x=\mu _z=0\) but decreases as the value of \(\mu _x\) and \(\mu _z\) increases which is typical behavior of a preliminary test estimator. The relative efficiency attains the maximum at \(\mu _x=\mu _z=1\), but the efficiency is higher when the correlation between X, Y and Z is high. Also when the sample size m increases, e increases for each values of \(\mu _x\) and \(\mu _z\).
4 Conclusion
The proposed regression estimator is more efficient than the ordinary preliminary test regression estimator under two-stage sampling along with RSS under certain conditions. However, in general, the information on the population mean is not always available and the experimenter may have some prior information on it. In such a situation, we can utilize partial information, and therefore, the efficiency of an estimator based on the preliminary test is quite high. And it is also more convenient to employ the preliminary test. Moreover, the estimator on the two-stage sampling with RSS in the second stage is also more efficient than SRS.
5 Empirical Study
Consider the total number of agricultural labourers as the study variable Y from 2011 census of Imphal West District, Manipur, India. Let \(X=\) total number of cultivators and \(Z=\) total number of households be the two auxiliary variables. The whole district is divided into 13 subdivisions/ towns (PSUs), taking villages as SSUs. Those PSUs with very small SSUs are combined together.
It is assumed that population means of X and Y are partially known. When \(\mu _x\) and \(\mu _z\) are unknown, we can estimate by using double sampling. In this data, we first select 8 out of 13 PSUs in the first stage sampling. In second stage, we will apply the double sampling procedure, where we select \(m_i'\) SSUs from each selected PSU. The estimates of \(\mu _x\) and \(\mu _z\) are given by \(\bar{x}'=161.8\) and \(\bar{z}'=303.1\) respectively.
Then, all \(m_i'\) SSUs are ranked based on the values of X. Samples are selected using RSS without replacement. We get
Suppose we have partial information on auxiliary variables from the previous census data (2001 Census, Imphal West District, Manipur) where they are computed and given by \(\mu _{ox}=101.7\) and \(\mu _{oz}=329.5\). We need to utilize the preliminary samples to test the hypotheses
At \(\alpha =0.01\), the test statistic, say, \(\vert Z\vert >Z_\alpha \) under \(H_{01}\) . Therefore, \(H_{01}\) is rejected. Since reliable information on \(\mu _x\) is not available, \(\bar{x}'\) is used instead of \(\mu _{0x}\) in the estimation of \(\mu _x\).
Under \(H_{02}\), \(\vert Z\vert <Z_\alpha \). When a reliable partial information of \(\mu _x\) is available, \(H_{02}\) can be accepted and \(\mu _{0z}\) is used in the estimator T. Therefore, \(99\%\) confidence interval is (243.3, 362.8).
Thus, the preliminary test regression estimator T which is the estimate of the population mean of Y is given by 14.8 and the usual regression estimator \(T_1=4.3\). It is further observed that the \(MSE(T)=0.3, MSE(T_1)=1.5\), hence, our proposed estimator T is more efficient than \(T_1\).
References
Cochran WG (1977) Sampling techniques, 3rd edn. John Wiley and Sons, New York
Das G, Bez K (1995) Preliminary test estimators in double sampling with two auxiliary variables. Commun. Stat. - Theory Methods 24(5):1211–1226
Han CP (1973) Regression estimation for bivariate normal distributions. Ann Inst Stat Math 25(1):335–344
Han CP (1973) Double sampling with partial information on auxiliary variables. J Am Stat Assoc 68(344):914–918
Khongji P, Das G (2012) Studies on some preliminary test estimators in double sampling. J Indian Soc Agric Stat 66(3):381–390
McIntyre GA (1952) A method for unbiased selective sampling, using ranked sets. Aust J Agric Res 3(4):385–390. https://doi.org/10.1198/000313005X54180
Nematollahi N, Salehi M, Saba R (2008) Aliakbari Two-stage cluster sampling with ranked set sampling in the secondary sampling frame. Commun. Stat. - Theory Methods 37(15):2404–2415. https://doi.org/10.1080/03610920801919684
Ozturk O (2019) Two-stage cluster samples with ranked set sampling designs. Ann Inst Stat Math 71(1):63–91. https://doi.org/10.1007/s10463-017-0623-z
Stokes SL (1977) Ranked set sampling with concomitant variables. Commun. Stat. - Theory Methods 6(12):1207–1211
Sud Umesh, Mishra Dwijesh (2006) Estimation of finite population mean using ranked set two stage sampling designs, J Indian Soc Agric Stat., 60
Sukhatme PV, Sukhatme BV (1970) Sampling theory of surveys with applications. Iowa State University Press, Ames
Funding
No funds, grants, or other support was received.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yanglem, W., Khongji, P. Preliminary Test Regression Estimator in Double Sampling Based on Two-Stage and Ranked Set Sampling with Two Auxiliary Variables. J Indian Soc Probab Stat 24, 55–63 (2023). https://doi.org/10.1007/s41096-022-00144-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41096-022-00144-3