Abstract
Analysis of censored environmental data has been of special interest to many scientists and practitioners for the recent years. Numerous works have been published on modeling bivariate environmental data when variables of interest are below some detection limits. Depending on the problem, one of the variables or both variables may be unobserved. These situations especially arise in modeling the joint distributions of environmental variables such as flood, drought and epidemiological. Some of these variables cannot be observed as they are too small to be detected below certain threshold points. Because of this censored structure, it is difficult to assess the validity of proposed bivariate distributions. Moreover, there is a wide need for a simple goodness-of-fit test for researchers working on practical environmental problems. This motivates us to propose a goodness-of-fit test for location-scale type bivariate distributions with censored data. The asymptotic distribution of the proposed test is shown to have a Chi-square distribution. A simulation study is carried out to show the power performances of the test. A real environmental data from the literature is analyzed to illustrate the efficacy of our proposed test.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Bivariate distributions are very often used for describing and modeling environmental events such as drought, flood, environmental health, epidemiology, ocean research, etc. Therefore, it is very important to test the fit and choose the right bivariate distribution for statistical modeling. This is achieved through goodness-of-fit tests many of which have been proposed for testing multivariate normality; see Sürücü (2006). A significant number of environmental data sets contain censored observations. Censoring is mostly due to observations below or above a threshold point, which necessitates constructing a new easy-to-use test of fit.
In the literature, many researchers have worked on bivariate environmental data sets. Especially, bivariate normal (mostly after a transformation) and bivariate lognormal distribution have been of much use for their appropriateness to represent the behaviour of various environmental variables. Sackl and Bergmann (1987) and Goel et al. (1998) used the bivariate normal distribution to represent the joint distribution of flood peaks and volumes. Yue (1999) applied bivariate normal distribution for the joint distribution of flood peak and volume in addition to that of flood volume and duration in flood frequency analysis; see also the work by Karmakar and Simonovic (2008) for bivariate flood modeling. Yue (2000, 2002) utilized the bivariate lognormal distribution in multivariate flood and storm problems. Johansen (2004) also worked on bivariate flood frequency analysis by assuming a bivariate lognormal structure. Repko et al. (2004) used the bivariate lognormal distribution for the bivariate description of extreme wave heights and wave periods. Vangelis et al. (2011) assumed a bivariate normal distribution to model the joint distribution of precipitation and potential evapotranspiration for drought severity assessment.
Censored environmental data have also received attention in the recent works; see, for example, Frances and Salas (1994), Wang (1990, 1996), Durrans (1996), Zafirakou-Koulouris et al. (1998), Smith and Burns (1998) and Tate and Freeman (2000). Many researchers have especially worked on bivariate distributions with censored data. Lu et al. (1999) analyzed lowflow quantiles below a limit for water resources planning and management. They used the bivariate lognormal distribution to model the two-dimensional censored data. Beersma and Buishand (2004) tried to fit a joint density to the rainfall and runoff deficits. They used a bivariate normal distribution (transformed data) to jointly model the maximum precipitation deficit and discharge deficit for the assessment of the economical damage due to drought. Some values of discharge deficit were censored at a low threshold value. Freeman and Modarres (2006) utilized several bivariate distributions including the bivariate normal to model the censored water quality weekly monitoring data. Chu et al. (2008) considered to use left-censored bivariate normal model to estimate the correlation coefficient between two biomarkers which are very important in environmental health and epidemiological studies. They also discussed the importance of estimating the distributions of such left-censored data sets.
There are very few works in the literature on goodness-of-fit tests for testing multivariate distributions with censored data. Tempelman and Akritas (1996), for example, proposed a class of goodness-of-fit procedures for testing multivariate distributions under random censoring in their theoretical work. In another work by Andersen et al. (2005), a class of goodness of fit tests were proposed for a copula based on bivariate right-censored data. Wang (2010) also developed a goodness-of-fit test for Archimedean copula family with right-censored bivariate data. In this paper, we propose a bivariate goodness-of-fit test for censored data sets (type I censoring) when the distributions are of location-scale type. Since the bivariate normal distribution is extensively used in environmental modeling, we develop our statistic under the assumption of bivariate normality (also for bivariate lognormality). It is also noted that the proposed test is very useful for researchers working in applied fields.
2 Tests based on complete samples
The probability density function (pdf) of the bivariate normal distribution is given by
where
\(\mu _1\) and \(\sigma _1\) being the mean and the SD of the random variable X, \(\mu _2\) and \(\sigma _2\) being the mean and the SD of the random variable Y, and \(\rho \) being the correlation coefficient between X and Y.
It is also possible to write (1) as
or
where
and
Sürücü (2006) proposed a test procedure for assessing multivariate distributions of location-scale type. For the bivariate normal case, he wrote the joint distribution as in (2) (or 3) and considered the random observations \(q_{1i}=x_i\) and \(q_{2i}=y_i-{\hat{\beta }}{x_i}\) \((i=1,\ldots ,n)\) where \(\hat{\beta }\) is the least square estimator of the regression coefficient \(\beta =\rho {\sigma _2 \over \sigma _1}\). Under bivariate normality, it was noted that \(q_1\) and \(q_2\) are independently distributed as normal for large n. Conducting independent normality tests on the samples \(q_{1i}\) and \(q_{2i}\) and combining them through a linear form leaded very powerful test statistics \(Z_2\), \(C_2\) and \(R_2\) for testing bivariate normality (Sürücü 2006). Especially, the test \(Z_2\) is very useful for testing other location-scale type multivariate distributions and can also be extended to censored and truncated structures.
\(Z_2\) statistic: The statistic \(Z_2\) defined by Sürücü (2006) is very flexible for complete, truncated and censored samples. It is given by
where
is the Tiku statistic defined for testing any univariate location-scale distribution; see Tiku (1980). The quantity \(G_{ij}={{q_{j(i+1)}-q_{j(i)}}\over {\mu _{i+1:n}-\mu _{i:n}}}\) describes the sample spacings where \(q_{j(i)}\) is the ith order statistic of the sample \(q_j\) \((j=1,2)\) and \(\mu _{i:n}\) is the expected value of the ith order statistic of a sample of size n from the standard normal distribution. The asymptotic null distribution of \(Z_j^*\) is known to be normal N(1, V); see Tiku (1980) and Sürücü (2008). Then, the asymptotic null distribution of \(Z_2\) is Chi-square with 2 degrees of freedom (Sürücü 2006).
3 Tests based on censored samples
Censored bivariate normal (or lognormal) data often arise in environmental studies. Some of these censoring schemes happen in the following ways:
-
(i)
One of the bivariate normal variables may depend on the other one. Some observations of the independent variable may not be observed as they are below (or above) a detection limit.
-
(ii)
One of the bivariate normal variables may depend on the other one. Some observations of the dependent variable may not be observed as they are below (or above) a detection limit.
-
(iii)
Two components of the bivariate normal may simply have a correlation with each other but there is no dependency structure. Some observations of one of the variables may not be observed as they are below (or above) a detection limit.
-
(iv)
For both bivariate normal variables, there can be some unobservable values as mentioned in the cases above; see for example, Lu et al. (1999).
In this paper, we will consider the cases (i) and (iii) with unobservable variables below a detection limit (left censoring). The same procedures can easily be applied to cases with right censoring. The cases (ii) and (iv) are, however, difficult to handle since they deal with the distribution of the order statistics given their concomitants.
Case (i): Let us assume the random variable Y depends on the random variable X and r observations of the independent random variable X are unobservable since they are below the detection limit d. Then, the likelihood function can be written as
where
and
\(x_{(i)}\) being the ith order statistic of X, \(y_{[i]}\) being the concomitant value for Y corresponding to \(x_{(i)}\), \(\beta = \rho {\sigma _2\over \sigma _1}\) and F is the cumulative distribution function (cdf) of the normal distribution. The conditional variance of \(Y_i\) given \(x_i\) is shown by \(\sigma _{2|1} ^2.\) Define \(q_{1i}^*=x_{(i)}\) and \(q_{2i}^*=y_{[i]}-\hat{\beta }x_{(i)}\) \((i=r+1,\ldots ,n).\) As in the case of complete samples, realize also that \(q_{1i}^*\) and \(q_{2i}^*\) are asymptotically independent for large n.
A goodness-of-fit test for testing the bivariate normality of the censored sample will be based on conducting independent goodness-of-fit tests on the samples \(q_{1i}^*\) and \(q_{2i}^*\) and combining them as was done in Sürücü (2006). However, the problem for this kind of censored bivariate data is different than that for complete bivariate data. We have to consider the following two different test approaches for \(q_{1i}^*\) and \(q_{2i}^*\).
Test for censored \(q_{1i}^*\) data: Since we have a type I censored data, a goodness-of-fit test based on type I censoring should be conducted on \(q_{1i}^*.\) Note that type I censored sample is the same as truncated sample for large n, which means that conducting a test on type I censored data is asymptotically equivalent to conducting a test on truncated data. Therefore, one needs a powerful test statistic for testing truncated normal distribution. In this sense, Tiku’s \(Z^*\) statistic provides a very powerful test statistic. Moreover, it has the beauty that its asymptotic null distribution is the same for complete, truncated and censored samples coming from location-scale families and is asymptotically normal N(1, V); see Tiku (1980). The statistic is given by
where
is the ith spacing and \(\mu _i^T\) \((i=1,\ldots ,n)\) is the expected value of the ith order statistic for the truncated standard normal distribution with truncation point \(t=(d-\hat{\mu })/\hat{\sigma }\) where \(\hat{\mu }\) and \(\hat{\sigma }\) are the estimators (to be mentioned shortly) of \(\mu \) and \(\sigma ,\) respectively.
Test for \(q_{2i}^*\) data: We have a censored \(q_{2i}^*\) data, for which the censoring mechanism is not known. Moreover, the distribution theory is complicated due to the structure having both concomitants and order statistics. For the distributional theory of concomitants whose covariances depend on corresponding order statistics, one can refer to David (1993). On the other hand, it is known that concomitants \(y_{[i]}\) \((i=1,2,\ldots ,k)\) are conditionally independent given \(x_{(i)}\) \((i=1,2,\ldots ,k)\) for any \(k\le n;\) see Bhattacharya (1974) and Wang (2008). It directly follows from this result that \(q_{2i}^*\) \((i=r+1,\ldots ,n)\) are asymptotically independently distributed normal variates. Note also here that \(q_{2i}^*\) are identically distributed. Therefore, one can conduct an independent normality test on \(q_{2i}^*\) for a complete sample of size \(n-r.\) Since \(q_{2i}^*=y_{[i]}-\hat{\beta }x_{(i)},\) we need to find the maximum likelihood (ML) estimator \(\hat{\beta }\) by using the asymptotic independence of \(L_1\) and \(L_2\) in (4). However, the equations are intractable and need to be solved by an approximate method. Therefore, we use the modified maximum likelihood (MML) estimation method proposed by Tiku (1967); see also Tiku and Suresh (1992). To estimate \(\beta ,\) the ML equations are
and
Solving (5) and (6) simultaneously, we obtain
To obtain the MML estimator \(\hat{\mu }_1\) in (7), we need to solve the likelihood equation
where \(z_{(i)}=(x_{(i)}-\mu _1)/\sigma _1\), \(z^*=(d-\mu _1)/\sigma _1\), \(g(z^*)=f_Z(z^*)/F_Z(z^*);\) \(f_Z(z)\) and \(F_Z(z)\) being the pdf and the cdf of the standard normal distribution, respectively. Realize that the ML equation in (8) is intractable due to the term \(g(z^*)\) which is very much expected to be covered by the interval (a,b) (for large n) where
and
\(\bar{x}_1\) and \(s_1\) being the mean and the SD of the censored sample \(x_{(i)}\) \((i=r+1,\ldots ,n),\) respectively (Tiku 1994; Tiku and Akkaya 2004, p. 182). By using the linear approximation
where \(\beta =(g(b)-g(a))/(b-a)\) and \(\alpha =g(a)+a\beta ,\) we write the MML equation
and obtain the MML estimator of \(\mu _{1}\) as
where
and
Then, the univariate Tiku statistic for the complete sample size of \(n-r\) can be written as
where
is the ith spacing and \(\mu _{i:n-r}\) \((i=1,\ldots ,n-r)\) is the expected value of the ith order statistic for the standard normal distribution based on a sample of size \(n-r\). For the estimation of \(\beta ,\) one can utilize a slightly less efficient estimator, which does not affect the efficiency of the goodness-of-fit test significantly; see Eren (2009).
Combined bivariate test: By combining the two tests defined above, we propose the following \(Z_{2c}^*\) statistic to test for bivariate normality based on censored data:
\(E_T\) being the mean of \(Z_T^*\), and \(V_T\) and V being the variances of \(Z_T^*\) and \(Z^*,\) respectively; see “Appendix”. The mean of \(Z^*\) is 1 for all sample sizes and censoring proportions. The asymptotic null distribution of \(Z_{2c}^*\) is Chi-square with 2 degrees of freedom since both \(Z_T^*\) and \(Z^*\) are asymptotically normal variates.
Censored dependent variable: Now assume the random variable Y depends on the random variable X and r observations of the dependent random variable Y are unobservable since they are below the detection limit d. Then, the likelihood function can be written as
where
and
\(y_{(i)}\) being the ith order statistic of Y, \(x_{[i]}\) being the concomitant value for X corresponding to \(y_{(i)}\) and \(\beta =\rho {\sigma _2 \over \sigma _1}\) . Define \(q_{1i}^{**}=x_i\) and \(q_{2i}^{**}=y_{(i)}-\hat{\beta }x_{[i]}\) \((i=r+1,\ldots ,n).\) We have a complete sample of size n for the independent variable X, which means that a goodness-of-fit test (\(Z_1^{**}\)) based on a complete sample of size n should be conducted for the sample \(q_{1i}^{**}.\) For \(q_{2i}^{**},\) on the other hand, we have to work on the distribution of order statistics given their concomitants, which is difficult to handle due to dependency. The same situation is relevant for case (iv). Therefore, we are not able to construct our test statistics for these two cases as we have done in the previous case. However, this problem can be worked out in a future study.
Case (iii): Since there is no dependency structure assumed for X and Y, we may simply write the likelihood as
where \(L_1''\) and \(L_2''\) stand for the likelihoods of the uncensored and censored random variables, respectively. The goodness-of-fit statistic for testing bivariate normality of the censored data is exactly the same as given in (15).
4 Simulations
In this section, we give the results of a simulation study for various alternative bivariate distributions which were also discussed in Sürücü (2006). For this, let \(U_1\) and \(U_2\) be two identically and independently distributed normal random variables. Then, \(X=U_1\) and \(Y= \rho U_1 +\sqrt{1-\rho ^2}U_2\) have a bivariate normal distribution with correlation coefficient \(\rho ;\) see Sürücü (2006). In the simulation study, we consider the following alternative marginal distributions for \(U_1\) and \(U_2\) as well as some known bivariate distributions:
Bivariate alternative distributions constructed by univariate marginals: normal, Chi-square(1), Chi-square(4), lognormal, exponential, Weibull(0.5),Weibull(2), Gamma(2), Gamma(5), beta(3,2), logistic and uniform.
Bivariate alternative distributions: bivariate t(2), bivariate t(6), bivariate F (2,2;8), bivariate Gamma(0.5), bivariate Gamma(2), bivariate exponential, bivariate logistic and bivariate uniform.
If we assume that r observations of X are not observed, then one can easily assume the censoring point \(d=x_{(r)}+\varDelta \) where \(\varDelta \epsilon (0,x_{(r+1)}-x_{(r)})\). It should also be mentioned here is that the censoring point d does not play an important role in the structure of the test statistic as long as we know the number of censored observations below it. That is why we show the power values of the test statistic for only various censoring proportions r / n.
Table 1 shows the power values of the proposed test statistic \(Z_{2c}^*\) to test for the bivariate normal distribution with censored data (case i). The statistic \(Z_{2c}^*\) is very powerful especially against skew alternatives. It is less powerful for symmetric alternatives when compared to skew ones. However, its power increases as the sample size gets larger. It should also be noted that the power does not decrease as expected when the censoring percentage increases. Moreover, \(Z_{2c}^*\) is very practical for those who work in applied fields and it can easily be calculated by Eq. (15). One only needs to know the means and variances of \(Z_{T}^*\) and \(Z^*,\) which are provided in the “Appendix”.
The power properties of \(Z_{2c}^*\) is essentially the same for other type I error levels. Also, we do not give the results of the simulation study conducted for case (iii) as it is the same with the results of case (i).
5 Application to real data
Freeman and Modarres (2006) analyzed 39 weekly measurements of biochemical oxygen demand (BOD) and total suspended solids (TSS) from the water quality monitoring samples. Three of the BOD values are below the detection limit of 2 mg/L. Authors tested several candidate bivariate distributions for this data set. As an application to our method, we choose bivariate normal [transformed with Box-Cox transformation \(\lambda =(0,1/3)\)] and bivariate lognormal distributions to test the censored water quality data set. Since there is no specific dependence structure mentioned, we can consider the case (iii). To calculate our \(Z_{2c}^*\) statistic, we simulated the means and variances of \(Z_T^*\) and \(Z^*\) in the Eq. (15). We obtained \(E_T=0.988\), \(\sqrt{V_T}=0.074\) and \(\sqrt{V}=0.086,\) which can also easily be obtained from Table 3 in “Appendix” by a rough interpolation. According to the results in Table 2, we do not reject the bivariate lognormal distribution at the 0.05 type I error level. Freeman and Modarres (2006), however, rejects the bivariate lognormal distribution in their work.
6 Conclusion
In this paper, we have introduced a new goodness-of-fit test for bivariate normal distribution with unobservable variables below a detection point. Because of the nice structure of the test statistic, the test can easily be extended to other bivariate location-scale distributions with censored observations. We should also mention that the procedure used here can also be applied to type II censored bivariate data coming from location-scale families. The only difference in that case will be in the estimation of \(\beta =\rho \frac{\sigma _2}{\sigma _1}\) due to the slightly different likelihood function. The test statistic \(Z_{2c}^*\) is very easy to calculate and therefore very useful for researchers working on environmental applications.
References
Andersen PK, Ekstrøm CT, Klein JP, Shu Y, Zhang M-J (2005) A class of goodness-of-fit tests for a copula based on bivariate right-censored data. Biom J 47:815–824
Beersma JJ, Buishand TA (2004) The joint probability of rainfall and runoff deficits in the Netherlands. Critical Transitions in Water and Environmental Resources Management. In: Sehlke G, Hayes DF, Stevens DK (eds) World Water & Environmental Resources Congress 2004, June 27–July 1, Salt Lake City, pp 1–10
Bhattacharya PK (1974) Convergence of sample paths of normalized sums of induced order statistics. Ann Stat 2:1034–1039
Chu H, Nie L, Zhu M (2008) On estimation of bivariate biomarkers with known detection limits. Environmetrics 19:301–317
David HA (1993) Concomitant of order statistics: review and recent developments. In: Hoppe FM (ed) Multiple comparisons, selection and application in biometry. Dekker, New York, pp 507–518
Durrans SR (1996) Low-flow analysis with a conditional Weibull tail model. Water Resour Res 32(6):1749–1760
Eren E (2009) Effect of estimation in goodness-of-fit tests. M.S. Thesis, METU, Ankara, Turkey
Frances F, Salas JD (1994) Flood frequency analysis with systematic and historical or paleofood data based on the two-parameter general extreme value models. Water Resour Res 30(6):1653–1664
Freeman J, Modarres R (2006) Estimating the bivariate mean vector of censored environmental data with Box–Cox transformations and E-M algorithm. Environmetrics 17:405–416
Goel NK, Seth SM, Chandra S (1998) Multivariate modeling of flood flows. J Hydraul Eng 124(2):146–155
Johansen SS (2004) Bivariate frequency analysis of flood characteristics at Glomma and Gudbrandsdalslagen. Thesis, Hovedoppgave i Institutt for Geofag-Universitetet i Oslo
Karmakar S, Simonovic SP (2008) Bivariate flood frequency analysis. Part 1: determination of marginals by parametric and nonparametric techniques. J Flood Risk Man 1(4):190–200
Lu JC, Liu S, Yin M, Hughes-Oliver JM (1999) Modelling restricted bivariate censored low flow data. Environmetrics 10(2):125–136
Repko A, Van Gelder PHAJM, Voortman HG, Vrijling JK (2004) Bivariate description of offshore wave conditions with physics-based extreme value statistics. App Ocean Res 26(3–4):162–170
Sackl B, Bergmann H (1987) A bivariate flood model and its application In: Singh VP (ed) Hydrology frequency modelling. Reidel, Dordrecht, pp 571–582
Smith DE, Burns KC (1998) Estimating percentiles from composite environmental samples when all observations are nondetectable. Environ Ecol Stat 5(3):227–243
Sürücü B (2006) Goodness-of-fit tests for multivariate distributions. Commun Stat-Theory Meth 35:1319–1331
Sürücü B (2008) A power comparison and simulation study of goodness-of-fit tests. Comput Math Appl 56:1617–1625
Tate EL, Freeman SN (2000) Three modelling approaches for seasonal streamflow droughts in southern Africa: the use of censored data. Hydrol Sci J 45(1):27–42
Tempelman AA, Akritas MG (1996) Model testing for multivariate censored data. Probab Theory Relat Fields 106:351–369
Tiku ML (1967) Estimating the mean and standard deviation from a censored normal sample. Biometrika 54:155–165
Tiku ML (1980) Goodness-of-fit statistics based on the spacings of complete or censored samples. Aust J Stat 22:260–275
Tiku ML (1994) Estimation for Bivariate Normal based on Truncated or Type I Censored Samples. Gujarat Statistical Review (Professor Khatri Memorial Volume), pp 244–255
Tiku ML, Akkaya AD (2004) Robust estimation and hypothesis testing. New Age International Publishers, New Delhi
Tiku ML, Suresh RP (1992) A new method of estimation for location and scale parameters. J Stat Plan Inference 30:281–292
Vangelis H, Spiliotis M, Tsakiris G (2011) Drought severity assessment based on bivariate probability analysis. J Water Res Man 25:357–371
Wang QJ (1990) Estimation of the GEV distribution from censored samples by method of partial probability weighted moments. J Hydrol 120:103–114
Wang QJ (1996) Using partial probability weighted moments to fit the extreme value distributions to censored samples. Water Resour Res 32(6):1767–1771
Wang K (2008) On concomitants of order statistics. Ph.D. Thesis, Graduate School of The Ohio State University
Wang A (2010) Goodness-of-fit tests for Archimedean copula models. Stat Sin 20:441–453
Yue S (1999) Applying bivariate normal distribution to flood frequency analysis. Water Int 24(3):248–254
Yue S (2000) The bivariate lognormal distribution to model a multivariate flood episode. Hydrol Process 14:2575–2588
Yue S (2002) The bivariate lognormal distribution for describing joint statistical properties of a multivariate storm event. Environmetrics 13:811–819
Zafirakou-Koulouris A, Vogel RM, Craig SM, Habermeier J (1998) L-moment diagrams for censored observations. Water Resour Res 34(5):1241–1249
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling Editor: Ashis SenGupta.
Appendix
Appendix
See Table 3.
Rights and permissions
About this article
Cite this article
Sürücü, B. Testing for censored bivariate distributions with applications to environmental data. Environ Ecol Stat 22, 637–649 (2015). https://doi.org/10.1007/s10651-015-0323-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10651-015-0323-x