Introduction

The recreational water quality is related to the presence of microorganisms in the water such as fecal coliforms, streptococci, total coliforms and enterococci. There are general guidelines and standards for measuring the microbial quality of water to prevent the incidence of disease outbreaks. These values are derived from studies which link the exposure, the water quality and the diseases related with the presence of microorganisms.

Many agencies have chosen the 95th percentile to measure the quality of recreational waters. One can assess the recreational water quality comparing the observed percentile values with guideline values.

The theoretical 95th percentile is a value such that the probability that the variable is less than it is equal to 0.95, and the observed percentile is the value that leaves 95 % of the observations below it.

As the distribution of the bacteria count has a marked asymmetry, in practice, the percentiles are calculated using log-normal method, that is, logarithmic transformation is applied to the data so that they acquire approximate normal distribution. Percentile obtained in this way is called parametric percentile. A limitation of this method is that we lose the original scale of the data and the inverse transformation has to be applied.

A broader approach consists in applying the Box and Cox (1964) power transformation, that contains the logarithmic one as a particular case. After a transformation, one is often interested in inference on the original scale. Taylor (1985) defines a measure of location on the original scale applying the inverse Box–Cox to the center of the transformed data to symmetry.

A frequently applied alternative strategy, is to calculate non-parametric percentiles. Some of them are due to Hazen, Blom, Tukey and Weibull (Hunter 2002). The disadvantage of non parametric methods is that they generally ignore relevant information on data, obtaining less accurate estimates.

We propose in this article to apply a Tweedie model (Tweedie 1984), that gives a parametrical percentile estimate, and needs no transformations. This stochastic family allows modeling positive data with skewed distributions, by choosing optimal values for the parameter \(p\) from an infinite range of possible values. Gamma, normal, Poisson and inverse Gaussian distributions are particular cases. In this way, we respected the original scale of the data and estimate the 95th percentile from the Tweedie distribution, which better fit the actual data.

We give a review of existing methods for calculating the percentile estimate in “Methods for Calculating the 95th Percentile of a Data-Set: Literature Review” section. In “Proposed Methodology: Estimating the Percentile from a Tweedie Model” section we introduce Tweedie models and define the percentile estimator based on them. “Simulation Study” section shows a simulation study that compares the estimators described above and in “Application to Real Data” section we obtain all these estimators, for real data sets from the beaches of Mar del Plata. Finally a discussion is presented in “Discussion and conclusions” section.

Methods for Calculating the 95th Percentile of a Data-Set: Literature Review

Non-parametric Percentile

Several methods of estimating percentiles employ non-parametric statistics (Ellis 1989), we will describe Hazen, Blom, Tukey and Weibull methods. In all of them, percentile estimators can be calculated by a two-step non-parametric procedure: the first step consists in obtaining a number \(r\), defined in each case by:

$$\begin{aligned}&\hbox {Hazen:}\;r_H =0.5+0.95n\end{aligned}$$
(1)
$$\begin{aligned}&\hbox {Blom:}\;r_B =3/8+0.95(n+0.25)\end{aligned}$$
(2)
$$\begin{aligned}&\hbox {Tukey:}\;r_T =1/3+0.95(n+1/3)\end{aligned}$$
(3)
$$\begin{aligned}&\hbox {Weibull:}\;r_W =0.95(n+1) \end{aligned}$$
(4)

where \(n\) is the sample size.

Once the value of \(r\) is known, the corresponding percentile is calculated as follows:

$$\begin{aligned} P_*=\left( {1-rf_*} \right) *X_{ri_*} +rf_**X_{ri_*+1} \end{aligned}$$
(5)

where \(X\) is the original variable, * is \(H\), \(B\), \(T\) or \(W\), respectively, the subscript \(ri\) indicates the integer portion of \(r\) and \(rf\) indicates the fractional part of \(r\).

Estimated Percentile from Anti-logarithmic Transformation

For normally distributed data, the 95th percentile can be easily calculated from the mean \(\left( m \right) \) and standard deviation \(\left( s \right) \) of the data using the formula \(P=m+sz\) (\(P\): parametric percentile) where \(z=1.6449\) is the quantile corresponding to the standard normal distribution.

But bacterial count does not follow a normal distribution and logarithmic transformation is often used to approach normality. Thus, we estimate the percentile from the transformed data with \(P^{\prime }=m^{\prime }+s^{\prime }z\), where \(m^{\prime }\) and \(s^{\prime }\) are mean and standard deviation of the logarithm of the data respectively, and where \(z\) is the same as above. Then, the estimated percentile back in the original scale is obtained via the inverse transformation: \(P_{log} =10^{P^{{\prime }}}.\) This approach is outlined in Bartram and Rees (2000).

An important limitation is that no always a logarithm transformation gives normal data.

Estimated Percentile from Inverse Box–Cox Transformation

A more general approach is given by Box–Cox transformations (see Box and Cox 1964) defined as:

$$\begin{aligned} Y=\left\{ {{\begin{array}{l@{\quad }l} {\frac{\left( {X^{\lambda }-1} \right) }{\lambda }}&{} \mathrm{when}\;\lambda \ne 0 \\ {ln\left( X \right) }&{} \mathrm{when}\; \lambda =0 \\ \end{array} }} \right. \end{aligned}$$
(6)

being \(X\) a positive random variable. It can be proved that there exists an optimal value \(\lambda \) such that the transformed variable \(Y\) has the more accurate approximation to a normal distribution with mean \(\mu \) and variance \(\sigma ^{2}\). Note that the logarithmic transformation is a particular case, for \(\lambda \) = 0.

The distribution of the anti-transformed data belongs to the power Normal (PN) family and a detailed description of these variables can be found in Freeman and Modarres (2006). These authors also consider the quantile functions that can be applied in statistical modeling when interest focuses particularly on the extreme observations in the tails of the data (Modarres et al. 2002), as is in our case.

The quantile function of \(PN\left( {\lambda ,\mu ,\sigma ^{2}} \right) \) is given by

$$\begin{aligned} P_{BC}^\lambda \left( p \right) =\left\{ {{\begin{array}{ccc} {\left( {\lambda \left( {\sigma \varPhi ^{-1}\left( {V\left( p \right) } \right) +\mu } \right) +1} \right) ^{1/\lambda }}&{} {\lambda >0} \\ {exp\left( {\mu +\sigma \varPhi ^{-1}\left( p \right) } \right) }&{} {\lambda =0} \\ {\left( {\lambda \left( {\sigma \varPhi ^{-1}\left( p \right) +\mu } \right) +1} \right) ^{1/\lambda }}&{} {\lambda <0} \\ \end{array} }} \right. \end{aligned}$$
(7)

where \(\varPhi \) is the standard normal cumulative distribution and \(V\left( p \right) =1-\left( {1-p} \right) \varPhi \left( T \right) \), for \(0<p<1\), being \(T=\frac{1}{\lambda \sigma }+\frac{\mu }{\sigma }\) the truncation point. It is well known that the estimator \({\mathop {\hat{P}}\nolimits {_{BC}^\lambda }} \left( p \right) \) has asymptotic normality, where \( \hat{\mu } \) and \(\hat{\sigma }{2}\), are the maximum likelihood estimators (MLEs) of the mean and variance on the normal scale. When \(p\) is replaced by 0.95, an estimator of the 95th percentile is obtained, and in the particular case of \(\lambda \) = 0 we obtain the same estimator as in “Estimated Percentile From Anti-logarithmic Transformation” section.

Proposed Methodology: Estimating the Percentile from a Tweedie Model

Tweedie models form a subclass of the exponential dispersion models. They are defined as exponential dispersion models with unit variance functions of a certain simple form. More precisely, an exponential dispersion model with unit variance functions \(V\) is called Tweedie model of order \(p\in R-\left( {0,1} \right) \) if \(V\left( \mu \right) =\mu ^{p},\mu \in \varOmega \) being \(\Omega \) the parametric space.

Tweedie models include most of the usual distributions such as normal \(\left( {p=0} \right) \), Poisson \(\left( {p=1} \right) \), gamma \(\left( {p=2} \right) \) and inverse Gaussian \(\left( {p=3} \right) \). Their density is given by

$$\begin{aligned} p_p \left( {y,\theta ,\lambda } \right) =c_p \left( {y,\lambda } \right) exp\left( {\lambda \left( {y\theta -\kappa _p \left( \theta \right) } \right) } \right) \end{aligned}$$
(8)

where \(y\in R^{+},\theta \in R\) is the position parameter, \(\lambda >0\) the dispersion parameter and the function \(\kappa _p \left( \theta \right) \) is given by

$$\begin{aligned} \kappa _p \left( \theta \right) =\left\{ {{\begin{array}{c@{\quad }l} {e^{\theta }}&{} {\mathrm{for}\; p=1} \\ {-log\left( {-\theta } \right) }&{} {\mathrm{for}\; p=2} \\ {\frac{1}{2-p}\left( {\left( {1-p} \right) \theta } \right) ^{\frac{p-2}{p-1}}}&{} {\mathrm{for}\; p \notin \left\{ {1;2} \right\} } \\ \end{array} }} \right. \end{aligned}$$
(9)

The function \(c_p \left( {y,\lambda } \right) \) is obtained using the Fourier inversion formula ((Feller 1978, p. 581)). If \(p>2\), it is of the form

$$\begin{aligned} c_p \left( {y,\lambda } \right)&= \frac{1}{\pi \lambda y}\mathop \sum \nolimits _{k=1}^\infty \frac{\mathbf{\Gamma }\left( {1+\alpha k} \right) }{k!}\lambda ^{k}{\kappa _{p} ^{k}}\left( {-\frac{1}{\lambda y}} \right) \nonumber \\&sin\left( {-k\pi \alpha } \right) \end{aligned}$$
(10)

For a random variable \(Y\) with Tweedie distribution the notation \(Y\sim Tw_p (\theta ,\phi )\) will be used with

$$\begin{aligned} \phi =1/\lambda \end{aligned}$$

The mean and variance are given by

$$\begin{aligned} E\left( Y \right) =\mu =\left\{ {{\begin{array}{cc} {\left( {\left( {1-p} \right) \theta } \right) ^{\frac{1}{1-p}}}&{} {p\ne 1} \\ {e^{\theta }}&{} {p=1} \\ \end{array} }} \right. \end{aligned}$$

and

$$\begin{aligned} Var\left( Y\right) =\frac{\mu ^{p}}{\phi }=\frac{1}{\phi }V\left( \mu \right) . \end{aligned}$$

A detailed discussion of these models can be found in (Jørgensen 1997). A fundamental property is their scale invariance: if \(Y\) belongs to a given family then for any positive real number \(c\), \(cY\) also belongs to a family from this class. They are also limiting distributions, in the sense that they have domains of attraction. In practical applications such models are often required for skewed positive continuous data.

However, it is clear that expression (8) is not simple, which may be the main factor limiting the use of these models with real data. A method of obtaining the density was developed by Dunn and Smyth (2005) and it is implemented in the R package (R Development Core Team 2006).

Outside the interval (0,1), each real value of \(p\) generates a family. Given a set of observed data, the optimal value for \(p\) can be determined via profile likelihood estimation (Dunn 2004). This numerical method provides a selection of representations that are closely “tailored” to data sets with skewed distributions based on the chosen optimal value of \(p\) parameter.

Given a data set, we propose the following strategy:

  1. 1.

    Obtain the optimal value of the \(p\) parameter via profile likelihood estimation, so \(\sim Tw_p (\theta ,\phi )\) .

  2. 2.

    Calculate the theoretical 95th percentile, \(P_{T_w } \) such that \(P\left( {Y \le P_{T_w } } \right) =0.\)95; with namely

    $$\begin{aligned} {\begin{array}{l} {0.95=\mathop \smallint \limits _0^{P_{T_w } } c_p \left( {y,\lambda } \right) exp\left( {\lambda \left( {y\theta -\kappa _p \left( \theta \right) } \right) } \right) dy} \\ \end{array} } \end{aligned}$$
    (11)

In this way, we preserve the original scale and estimate the 95th percentile from that Tweedie distribution which better fit the actual data.

Simulation Study

We performed a Monte Carlo simulation to compare the performance of the different percentile estimators. The routines were written in R language and the package “TWEEDIE” was used to generate data (R Development Core Team 2006). We ran 1,000 iterations generating a sample of 100 observations each time, following a Tweedie distribution with parameters \(p=2.5,\mu =1,\phi =0.58\). The theoretical 95th percentile [see (11)] was calculated.

The mean squared errors (MSE) were obtained to compare the performance of the corresponding estimators. Table 1 shows the results and Fig. 1 illustrates with a box plot for each percentile estimator, allowing the comparison of their properties. As can be seen, the MSE corresponding to the percentile obtained from Tweedie model is the smallest one.

Fig. 1
figure 1

Boxplots for each estimator for comparative purpose. The horizontal line indicates the theoretical percentile value (\(P=2.5\))

Table 1 The 95th percentile estimators obtained using the proposed methods from a simulation study with parameters \(p=2.5,\mu =1,\phi =0.58\)

Application to Real Data

The data were obtained from a study consisting of the monitoring and sampling of microbial water from the beaches of Mar del Plata, between 1999 and 2007, always in winter. There were four groups of bacteria: fecal coliforms, streptococci, total coliforms and enterococci; in Table 2 descriptive statistics are shown.

Table 2 Descriptive statistics of four groups of bacteria: fecal coliforms, streptococci, total coliforms and enterococci

In a first step, we calculated for each bacteria the optimum value for \(p,\) to find the most suitable Tweedie distribution to fit the data. For total coliforms \(p=2.21\), for fecal coliforms \(p=2.071\) and for streptococci and enterococci \(p=2.5\). In Fig. 2 we show the histograms with the theoretical densities for the corresponding \(p\) superposed, it can be seen that the fit is more than acceptable.

Fig. 2
figure 2

Histogram for bacteria concentrations superposed with density curve of a Tweedie distribution with the corresponding value for \(p\)

Later, we calculated percentiles using all the above methods and compared them to the actual percentile value of the data (Table 3). The percentile obtained from Tweedie distribution is the one that most closely fits the observed percentile for all groups.

Table 3 The 95th percentile estimator obtained using the proposed methods, from four data sets of bacteriological counts from beaches of Mar del Plata

Discussion and Conclusions

It has been found that bacteria count is not normally nor log normally distributed.

Among others, Chawla and Hunter (2005), found that their datasets “were not log normally distributed on at least 85 % of occasions and these finding fatally undermine the validity of using a parametric method for calculating 95th percentiles to classify bathing water quality”.

Other percentile estimators frequently used have been proposed in the literature (Hunter 2002), they are ‘non-parametric and use a limited amount of information because they only consider the order of each observation, not the exact value. Crabtree et al. (1987) affirm that “the arbitrary use of non-parametric techniques may fail to make the most effective use of the information contained in the data”. On the other hand, Beamonte et al. (2007) state that parametric methods gave better results than non-parametric ones.

Another alternative is to antitransform percentiles obtained from data that has been transformed to approach normality. (see Taylor 1985)

In this paper, we suggest estimating the percentile of bacteriological counts in water, from a probability density function that takes into account the asymmetric distribution of this kind of data. We used the Tweedie family proposed by Tweedie (1984) and characterized as an exponential dispersion model by Jørgensen (1992, 1997). This model is appropriate for fitting asymmetrical data sets and eliminates the need to alter the original scale of the data by applying transformations.

In comparing the MSE of different percentile estimates, we found that the lowest mean square error was obtained using the Tweedie family. So we can conclude that this is a better estimator, in the sense that it is more precise.

It has also a more direct calculation. The numerical method implemented in the R package, allows choosing optimal values for the \(p\) parameter, as the one that maximizes the profile likelihood curve. Then, the 95th percentile estimator can easily been obtained from the optimal distribution function.