1 Introduction

Capture–recapture (CR) methods have become increasingly popular in the last decades and were adopted in a wide range of applications, focusing on the estimation of the size of hidden populations. To remark the importance of CR methods in data analysis, two books have been recently published by McRea and Morgan (2014) and Böhning et al. (2018), in which the CR methods are introduced and extensively discussed. CR analyses are based on the repeated sampling from a population and, consequently, on the use of recapture information to infer the number of uncaptured units. Throughout the paper, we consider the following CR setting. The target population is sampled over a certain number of capture occasions, and for each occasion, captured units are counted only once. Moreover, we consider a closed population, i.e. the unknown population size, is assumed to be constant (with no births/deaths during sampling stages), misclassification is not allowed and all units act independently. In capture–recapture analysis, the assumption of homogeneous catchability or identifiability across members of the target population is frequently in question. In these cases we speak of heterogeneity. The heterogeneity may influence capture probabilities, and failure to acknowledge this may lead to biased estimates of the unknown population size (Anan et al. 2017a; Farcomeni and Scacciatelli 2013; Hwang and Huggins 2005).

CR data are usually collected as follows. For each unit \(i (i = 1,\ldots , N)\) at occasion \(t (t = 1,\ldots , T)\), we record a binary indicator variable, say \(y_{it}\), where \(y_{it} = 1\) means that the i-th unit has been identified at the t-th occasion. It is further assumed that \(y_{i}=\sum _{t=1}^Ty_{it}\) is observed only if \(y_i>0\), that is if at least one \(y_{it}>0\) for \(t=1,\ldots ,T\). When \(y_{i1}=y_{i2}=\cdots =y_{iT}=0\), the i-th unit remains unobserved. The number of sampling occasions T may or may not be known a priori. In the following, we focus on the random variable X representing the distribution of the number of captures, i.e. X is a count variable.

To introduce heterogeneity, let us consider the pair \((x,\lambda )\), where x is a realization of X, and \(\lambda \ge 0\) is an unobserved realization of a non-negative random variable \(\varLambda \), the parameter of the count data distribution, i.e. X depends on the parameter \(\lambda \). It follows that the joint density \(f(x,\lambda )\) can be written as \(f(x\mid \lambda )g(\lambda )\), where \(g(\lambda )\) is the marginal distribution of \(\lambda \) with respect to \(f(x,\lambda )\). As we have not observed the value of \(\lambda \), we consider the margin of \(f(x\mid \lambda )g(\lambda )\) over \(\lambda \) leading to the mixture

$$\begin{aligned} \Pr (X=x) = \kappa _x = \int _{0}^{\infty } f(x\mid \lambda )g(\lambda )d\lambda ,\end{aligned}$$
(1)

for \(x\in \{0,1,2,\ldots \}\), the non-negative integers. In mixture model (1) we call \(g(\lambda )\) the mixing distribution and \(f(x\mid \lambda )\) the mixture kernel. In the following, we consider \(f(x\mid \lambda ) = \frac{\exp (-\lambda )\lambda ^x}{x!}\), \(\lambda \ge 0\), and \(g(\lambda ) = \frac{1}{\theta }\exp \left( -\frac{\lambda }{\theta }\right) \), \(\theta >0\), accounting for departures from assumptions implied by \(f(x\mid \lambda )\), such that \(\kappa _x = (1-p)^xp\), with \(p = \frac{1}{1+\theta } \in (0,1)\), i.e. \(\kappa _x\) follows a geometric distribution. Note that this result—mixing the Poisson with an exponential distribution leads to the geometric—is a special case of mixing a Poisson with a Gamma distribution which leads to a negative binomial (Fisher et al. 1943).

Of course, the binomial distribution can be also considered as the reference distribution to estimate population sizes. A justification of considering the Poisson distribution instead is as follows. Suppose the observational window consists of a large number of trapping occasions, each with the same positive capture probability \(\theta \). Then, let T be the largest number of possible identifications, using that \(T\theta = \lambda \) remains constant when T becomes large, the binomial distribution converges to the Poisson distribution with parameter \(T\theta = \lambda \). X is again the count of identifications per member of the target population. The only difference is that we do not know what could have been the largest possible count. The underlying assumption remains that identification occurs independently across occasions and with the same probability \(\theta \).

We are interested in using the geometric distribution in the capture–recapture setting. Let \(X_1,\ldots ,X_N\) be a sample from the geometric distribution. Here N is the size of the target population of interest and \(X_i\) is the count of identifications of the i-th member during the sampling period. The way identification occurs is determined by the application: it could be a live-trap, a hospital register, a police database, etc. In a recent paper, Coumans et al. (2017) considered estimating the numbers of homeless people in the Netherlands. In this case, \(X_i\) represents the number of nights stayed in a homeless shelter for homeless person i. However, not all members are identified, i.e. have a value of \(X_i>0\). Hence, we observe a zero-truncated sample \(X_1,\ldots ,X_n\), where we have, without loss of generality, \(X_{n+1}=\cdots =X_N=0\). We know the size n of the zero-truncated sample, but we do not know N, which needs to be estimated, and this is what this work is about. In the following we will use \(f_x=\#\{x_i|x_i=x\}\), the frequency of counts exactly equal to x for \(x=0,1,2,\ldots \).

Whereas the Poisson distribution has been used frequently, we think that the geometric is more flexible in comparison with the former as it incorporates already some form of heterogeneity. Niwitpong et al. (2013) discuss various estimators for model (1) including a form of Mantel–Haenszel estimation.

In this work, we introduce a Turing-type estimator coping with heterogeneity, in the sense that the Poisson parameter of the conditional count-of-identifications distribution is mixed with an exponential density, leading to a geometric distribution. We argue that the geometric distribution is better suited for count distributions in the capture–recapture context as it can cope with simple forms of heterogeneity. We also derive its asymptotic variance, to have a measure of precision available. The Turing estimator has been used under several distributional assumptions on count data (Böhning et al. 2013; Hwang et al. 2015), in particular under the Poisson assumption where it is given by \(\hat{N}_{\textit{Turing}}=\frac{n}{1-f_1/\sum _{x=1}^mxf_x}=\frac{n}{1-\hat{p}_0}\), where m is the maximum number of observed counts. The benefits of Turing’s estimator are that it is easy to calculate, its value can be obtained in a straightforward way, and there is no need for an iterative procedure. Nevertheless, under the Poisson assumption, it often underestimates less than the maximum likelihood estimate and has a comparable precision (Böhning et al. 2018; p. 13–14). We show in several case studies with known population size that the behavior of the geometric-based Turing estimator outperforms other well-established estimators.

We investigate the empirical behavior of the introduced estimator by a large-scale simulation study with respect to several factors, such as the population size and the capture probabilities. To show the practical usefulness of this new estimator, we compare its performance to a few alternative estimators, widely used in the capture–recapture framework (see the simulation study in Sect. 3). Finally, we apply the proposal to several real datasets, often used as benchmarks in the capture–recapture framework and check the appropriateness of considering a geometric distribution to estimate the population size.

2 A Turing-type estimator under the geometric distribution

To generalise the Turing estimator under the geometric distribution, let us note that \(\kappa _{0}=p, \kappa _{1}=(1-p)p\) and \(E(X)=\frac{1-p}{p}\) and, accordingly,

$$\begin{aligned} \kappa _{0}=p= \sqrt{p^{2}}=\sqrt{\frac{(1-p)p^{2}}{(1-p)}} =\sqrt{\frac{(1-p)p}{(1-p)/p}}=\sqrt{\frac{\kappa _{1}}{E(X)}}. \end{aligned}$$
(2)

In practice, the \(\kappa _x\)s can be estimated by the relative frequencies so that

$$\begin{aligned} \widehat{\kappa }_{0}=\sqrt{\frac{f_{1}/N}{S/N}} =\sqrt{\frac{f_{1}}{S}}, \end{aligned}$$
(3)

where \(S=\sum \nolimits _{x=0}^{m}xf_{x} =\sum \nolimits _{x=1}^{m}xf_{x}\). Hence, the resulting Turing estimator under the geometric distribution (TG) is given as

$$\begin{aligned} \widehat{N}_{\textit{TG}} = \frac{n}{1-\sqrt{\frac{f_{1}}{S}}}. \end{aligned}$$
(4)

The form of the resulting TG estimator resembles its specification under the Poisson assumption. It uses the frequency \(f_1\) of units observed only once, which is usually a large quantity. It also uses all information in the sample, by including S. This is in contrast with other well-established estimators which use only frequencies of ones and twos to estimate \(f_0\).

Theorem 1

The TG estimator is asymptotically unbiased under the geometric distribution

$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{E(\widehat{N}_{\textit{TG}})}{N} \rightarrow 1, \end{aligned}$$

with \(\hat{N}_{\textit{TG}}>\hat{N}_{\textit{Turing}}\).

Proof

We have that \(E(X)=E(S/N) = (1-p)/p\), \(E(f_{1})=Np(1-p)\) so that \(\sqrt{\frac{E(f_{1}/N)}{E(S/N)}} = \sqrt{ \frac{p(1-p)}{(1-p)/p}} =p = {\kappa }_0\) and \(E(n/N)=(1-{\kappa }_0)=(1-p)\). Therefore,

$$\begin{aligned} E\left( \frac{\widehat{N}_{\textit{TG}}}{N} \right)= & {} E \left( \frac{ \frac{n}{1-\sqrt{f_{1}/S}}}{N} \right) =E\left( \frac{n}{N} \frac{1}{1-\sqrt{f_{1}/S}} \right) \xrightarrow [N\rightarrow \infty ]{} (1-p)\frac{1}{1-p} = 1. \end{aligned}$$

This proves that the TG estimator is asymptotically unbiased under the geometric distribution. To show that \(\hat{N}_{\textit{TG}}>\hat{N}_{\textit{Turing}}\), let us assume that \(f_x>0\) for some \(x>1\). The estimated probability of zero counts according to the (original) Turing estimator under the Poisson distribution is \(\widehat{p}_{0,\textit{Turing}} = \frac{f_{1}}{S}\), whereas the probability of zero counts according to the TG estimator is \(\widehat{p}_{0,\textit{TG}} =\sqrt{\frac{f_{1}}{S}}.\) It is obvious that \(f_{1} < S\) where \(S=\sum _{x=1}^{m}x{f_{x}}\), therefore \(\frac{f_{1}}{S} < \sqrt{\frac{f_{1}}{S}}\). Then, we have that

$$\begin{aligned} \widehat{N}_{\textit{TG}} = \frac{n}{1-\sqrt{\frac{f_{1}}{S}}}>\frac{n}{1-\frac{f_{1}}{S}}=\widehat{N}_{\textit{Turing}}. \end{aligned}$$

\(\square \)

This property provides evidence of the importance of defining an estimator able to handle heterogeneity, as it is widely acknowledged that the Poisson-based Turing estimator is biased downward (Böhning et al. 2013) and a lower bound estimator in presence of heterogeneity (Puig and Kokonendji 2018).

Following Böhning (2008), we derive the variance of the estimator by using a conditional technique mixed with the delta method.

Proposition 1

The variance of the TG estimator is given as

$$\begin{aligned} \widehat{Var}(\widehat{N}_{\textit{TG}}) = \frac{n\sqrt{\frac{f_{1}}{S}}}{\left( 1-\sqrt{\frac{f_{1}}{S}}\right) ^{2}} +n^{2} \left\{ \frac{S+f_{1}}{4S^{2}\left( 1-\sqrt{\frac{f_{1}}{S}} \right) ^{4}}\right\} . \end{aligned}$$
(5)

Proof

The proof is given in the “Appendix” \(\square \)

3 Simulation study

A simulation study is undertaken to investigate the performance of the proposed estimator and its competitors. The count data sets were generated following the geometric distribution with a variety of parameters. That is \(X \sim \textit{Geo}(p)\) where \(p = 0.1, 0.15, 0.2, 0.25, 0.3, 0.5 \). The population size N is set to \(N = 100, 250\) for small sizes , \(N= 500, 1000\) for medium sizes, and \(N=5000, 10{,}000\) for large sizes. Each data set is rearranged in the form of frequencies \(f_{0}, f_1,f_{2}, f_{3}, \ldots f_{m}\), corresponding to the counts \(0,1,2,3,\ldots ,m\). The frequency of zero counts \(f_{0}\) was omitted before estimating population sizes \(\widehat{N}\). The aim is to investigate the finite sample behavior of the proposed estimator and to show how this may differ from other well-established estimators, known to work well under the geometric distribution. Moreover, we also look at how well we can approximate the uncertainty surrounding the estimates.

To further underline the usefulness of the proposed estimator, we generate counts from a negative binomial model \(\varGamma (x+\nu )/(\varGamma (\nu )+x!)p^{\nu }(1-p)^x\) for \(x=0,1,\ldots \), with \(\nu =2, 5\) and \(p = 0.1, 0.5, 0.7\) and estimate the population size using the proposed estimator.

Fig. 1
figure 1

Relative bias of six estimators with different parameters following the geometric distribution

Fig. 2
figure 2

Relative variance of six estimators with different parameters following the geometric distribution

Fig. 3
figure 3

Relative root mean square error of six estimators with different parameters following the geometric distribution

3.1 Simulation results to investigate the performance of the estimator

To study the performance of the proposed estimator, 5000 samples were drawn from the geometric distribution for each combination of parameters. For each scenario the relative bias, relative variance and relative root mean squared error are computed. In the following, we show simulation study results, summarized in Figs. 1, 2 and 3 and, with a focus on the proposed estimator, in Table 1. The linear regression Conway–Maxwell–Poisson-based estimator (LCMP; \(\widehat{N}_{\textit{LCMP}} = n + f_1\exp (-\hat{\lambda })\); Anan et al. 2017a, b) , the maximum likelihood estimator under the geometric distribution (MLEGeo; \(\widehat{N}_{\textit{MLE Geo}} = \frac{n}{1-n/\sum {x=1}^m xf_x}\)) and a non-parametric estimator based on Chao’s lower bound under the geometric distribution (CG; \(\widehat{N}_{\textit{CG}} = n+\frac{f_{1}^{2}}{f_{2}}\) ) are compared with our Turing-based proposal (TG) and the extended Zelterman’s estimator based on the zero-truncated geometric distribution (ZG; \(\widehat{N}_{\textit{ZG}} = \frac{nf_1}{f_2}\)). Among these estimators, likely, the less-known one is the one based on the Conway–Maxwell–Poisson distribution (Shmueli et al. 2005). It has probability distribution \(\mathrm{CMP}(\lambda ,\nu )\) is given by

$$\begin{aligned} \kappa _x=\frac{\lambda ^x}{(x!)^{\nu }}\frac{1}{z(\lambda ,\nu )},\quad x = 0,1,2,\ldots ; \ \lambda > 0; \ \nu \ge 0 \end{aligned}$$

where the normalizing constant

$$\begin{aligned} z(\lambda ,\nu ) = \sum _{j=0}^{\infty }\frac{\lambda ^j}{(j!)^{\nu }} \end{aligned}$$

is a generalization of well-known infinite sums. Anan et al. (2017a) introduced a population size estimator based on this distribution.

Table 1 Simulation results: population size estimates \(E(\widehat{N})\), approximated standard errors \(E[ \widehat{SE}(\widehat{N})]\) and true standard errors \(\textit{SE}(\widehat{N})\)

The relative bias (\(\textit{RBias}\)), variance (\(\textit{RVar}\)) and relative root mean squared error (\(\textit{RRMSE}\)) are calculated as

$$\begin{aligned} \textit{RBias}(\widehat{N})= & {} \frac{1}{N} \left[ E(\widehat{N})-N \right] ,\\ \textit{RVar}(\widehat{N})= & {} \frac{1}{N^{2}} \left( \widehat{Var}(\widehat{N}) \right) , \end{aligned}$$

and

$$\begin{aligned} \textit{RRMSE}\{\widehat{N}\} = \frac{1}{N} \sqrt{Var\{ \widehat{N} \} + \{\textit{bias}(\widehat{N})\}^{2} }, \end{aligned}$$

where \(\textit{bias}(\widehat{N}) = E(\widehat{N})-N\).

Results are displayed and summarized in Figs. 1, 2 and 3. All the considered estimators are asymptotically unbiased. They show a slight overestimation of population size for small population sizes and show a reduced bias as the population sizes increase. The TG estimator shows the most accurate behavior with the least bias on average. The relative bias of LCMP and TG estimators are very similar as p is small (i.e \(p = 0.1, 0.15)\). The MLEGeo provides the smallest variance under all settings as we expect, in line with the literature. Another interesting point is that the TG estimator has not only a small bias but also provides an estimated variance close to the MLEGeo.

To summarize the performance of the different estimators, we use the relative root mean square error (see Fig. 3). The simulation results show that TG and MLEGeo estimators are likely to be the best choices to estimate population size. The LCMP also shows a reasonable behavior, slightly worse, but comparable, than the TG and MLEGeo estimators. This happens because the Conway–Maxwell–Poisson distribution contains the geometric distribution as a special boundary case. It remains a reasonable estimator for a small value of p and performs better than the CG estimator.

3.2 Variance approximation and confidence intervals

The aim of this section is to investigate the performance of the variance approximation introduced in Eq. (5). The full set of results of the simulation study, with the averaged estimates and standard errors of the new estimator, is provided in Table 1. Additionally, we consider the validity of the approximating variance estimator, used to derive confidence intervals, by investigating the ratio between approximated and sampled standard errors. Such an investigation shows that the performance of the approximated standard error of the TG estimator is more than reasonable, though it slightly underestimates the sample variability on average, but provides a satisfactory approximation for the true standard error, i.e. the ratio between the approximated and the sample standard errors is fairly close to one.

Table 2 Comparing the performance of confidence intervals of five estimators when data are generated from the geometric distribution
Fig. 4
figure 4

Coverage probabilities of 95% confidence interval when data is generated from the geometric distribution

We further look at 95% confidence intervals for the proposed estimator of the population size N. We compare our proposal with other estimators often used to deal with population size estimation, derived under the Poisson and the geometric distributions. A 95% approximate confidence interval of population size N is constructed under the symmetric normal approximation as:

$$\begin{aligned} 95\% \, \textit{CI} = (\widehat{N}_{L}, \widehat{N}_{U}) = (\widehat{N} \pm Z_{.975} S.E (\widehat{N})), \end{aligned}$$
(6)

where, for the proposed Turing-based estimator, \(S.E(\widehat{N}) \) is approximated by the square root of (5). Their performances are quantified using the coverage probability \((\textit{Cov})\) and the average length \((\textit{AL})\) defined as follows:

$$\begin{aligned} \textit{Cov}= \frac{ \sum _{r=1}^{R}{A_{(r)}}}{R} \times 100, \end{aligned}$$
(7)

where \(A_{(r)}\) equal to 1 if the true value N is in the target confidence interval, and 0 otherwise; and R is the number of replicates. The confidence intervals are produced based upon the assumption of asymptotic normality that might affect the coverage probability for the small population sizes. It can be seen from Table 2 and Fig. 4 that the coverage probabilities from almost all estimators show lower level than the nominal confidence interval level when the population sizes are \(N=100\) and the converge to the nominal level increases with increasing N. Overall, the simulation results suggest that the MLEGeo estimator provides the best performance with respect to coverage probability of confidence intervals. As can be seen, the coverage probability of the MLEGeo is close to the nominal level on average and provides the shortest length. The proposed TG estimator has a satisfying and comparable behavior, even for small population sizes. The CG estimator requires large population sizes to provide a good performance of confidence interval at \(95\% \). , Finally, the LCMP estimator shows strong anti-conservative behavior with lower coverage than the nominal confidence levels for the small population sizes. Additionally, it tends to show higher coverage than the nominal levels when the population sizes increases, as a consequence of the large standard errors.

To draw further conclusions on the usefulness of the proposed Turing-type estimator, we discuss results based on simulated data from a misspecified model, i.e. data are drawn from the negative binomial distribution (see Table 3). A crucial role is played by the probability of success in each trial, which drives the percentage of zeros generated in the data. For a lower value of p, i.e. for lower percentage of zeros, we have a better performance of the estimator, which overestimates the true population sizes, though still reasonably, for higher values of p and proportions of zeros. This is also somehow expected. As shown in Böhning (2015), even sampling under the negative binomial distribution may lead to severe bias in the population size estimate, due to a boundary problem.

Table 3 Simulation results: population size estimates \(E(\widehat{N})\), approximated standard errors \(E[ \widehat{\textit{SE}}(\widehat{N})]\) and true standard errors \(\textit{SE}(\widehat{N})\)

4 Applications

In this section, we firstly consider two widely analysed datasets to show the appropriateness and importance of our proposal in empirical data analyses and then provide further examples to better understand the behavior of the proposed estimator. We compare our estimator with other well established ones, based on homogeneous and heterogeneous Poisson models and on the geometric model. Turing’s estimator and the maximum likelihood estimator under a Poisson model, with parameter \(\lambda \), are considered as estimators in the homogeneous case. Estimators for heterogeneous populations as Zelterman’s estimator (Zelterman 1988) and Chao’s lower bound estimator (Chao 1987, 1989; Chao and Colwell 2017) are considered as well. The maximum likelihood estimator under a geometric distribution and the LCMP estimator, with parameters \(\lambda \) and \(\nu \), are included as potential competitors to our estimator, as well as Chao’s estimator under the geometric distribution. We also report an estimate of the variance associated with the population size estimate based on the so-called imputed bootstrap (\(\textit{SE}_{\textit{boot}}\); Norris and Pollock 1996). In our setting, the imputed bootstrap is based on \(\widehat{N}\) and the corresponding estimate \(\hat{f}_0\) of \(f_0\). We draw 1000 samples containing \(\widehat{N}\) observations from a multinomial distribution with parameters \(\widehat{N}\) and \(\{\hat{f}_0/\widehat{N},\hat{f}_1/\widehat{N},\ldots , \hat{f}_m/\widehat{N}\}\). For each bootstrapped sample we estimate \(\widehat{N}\) and then compute the variance over all 1000 samples. We would like to remark that the imputed bootstrap performs well only if the model is valid (Anan et al. 2017b).

4.1 Golf tees data

In a field experiment, \(N=250\) groups of golf tees were placed in a survey region, either exposed above the surrounding grass or hidden by it. They were surveyed by the 1999 statistics honor class at the University of St Andrews (Scotland), see Borchers et al. (2004). A total of \(n=162\) groups of tees were observed, but a (potentially unknown) number is missed and needs to be estimated. The corresponding frequency distribution is given by \((f_0,\ldots ,f_8)=(88, 46, 28, 21, 13, 23, 14, 6, 11)\). This toy example is very useful for comparing the performance of several estimators as the true value of the population size is known. In the following, we compare several estimators based on the Poisson or other distributions, accounting for heterogeneity. The population size estimators under the geometric distribution (i.e., TG, MLEGeo, LCMP and CG) provide the population size estimates close to the true number \(N=250\), confirming that the homogeneity assumption of the capture probabilities, as in the Poisson distribution, is unreliable. In detail, the proposed TG estimator and the MLEGeo estimator show a negligible difference in terms of both the estimate of the population size and variability. Comparing these results with those from other estimators, the TG and MLEGeo are the best for estimating the number of golf tees. In detail, with respect to the other estimators, it is no surprise for the original Turing and MLEPoi to underestimate the population size, and their confidence intervals do not even cover the true value. \(\hat{N}_{\textit{Turing}}\), \(\hat{N}_{\textit{MLEPoisson}}\) and \(\hat{N}_{\textit{Chao}}\) are lower bound estimates for mixed (heterogeneous) Poisson distributions, where the geometric distribution is a special case (Puig and Kokonendji 2018). Although the Zelterman estimator provides an estimated population size closer to the true value, the standard errors are very large, leading to the a wide confidence interval at 95%. The LCMP estimator might be the alternative choice for estimating the number of golf tees giving a slight bias (Table 4).

Table 4 Golf tees data: estimated population size, standard errors, confidence intervals and lengths of confidence interval from eight different estimators

4.2 Bowel cancer data

Over several years, from 1984 onwards, about 50,000 subjects were screened for bowel cancer at St Vincent’s Hospital in Sydney (Australia), see Lloyd and Frommer (2004). The screening procedure was based on a sequence of binary diagnostic tests, self-administered on \(T = 6\) successive days. Since no screening test is 100% accurate, replications of the diagnostic test over a number of days may help identify most cases. On each of the six occasions, the presence of blood in feces has been recorded. People with six negative tests were not further assessed and it remains unknown which disease status they have, while people with at least one positive test had their true disease status determined by physical examination, sigmoidoscopy, and colonoscopy. The aim is to estimate how many (say \(f_0\)) cancer patients have been missed by adopting this screening procedure. Lloyd and Frommer (2004) mention that 122 patients with confirmed cancer status were screened again using the identical screening procedure. We will focus on this secondary distribution as \(f_0\) is known there, with \((f_0,\ldots ,f_6)=(22; 8; 12; 16; 21; 12; 31)\).

From Table 5, it is clear that ignoring any heterogeneity source leads to an underestimation of the population size. The Turing and MLEPoi estimators fail to recover the unknown \(f_0\). The situation does not improve even if we consider estimators that relax the Poisson assumption. Zelterman’s and Chao’s estimators do not show a satisfying behavior. The true value is not covered by confidence intervals of any of these estimators. The proposed TG estimator clearly outperforms its competitors with \(\hat{f}_0=17\) and \(\hat{N}=117\), and both the MLEGeo and the LCMP estimators have reasonable behaviors.

Table 5 Bowel cancer data: estimated population size, standard errors, confidence intervals and lengths of confidence interval from eight different estimators

4.3 Other examples where the number of zeros is known

The geometric distribution works well for the two illustrative datasets considered above. A more extended range of examples is provided in the following to support the view that the geometric is useful generally, though it may lead to non-optimal estimates when the Poisson assumption is tenable.

We are going to check the performance of our estimators with four real data sets where the numbers of zeros are known. The first two were analysed in Böhning and Schön (2005) in order to check the performance of the estimators introduced there. The last two (number of dicentrics) were analysed in Puig and Barquinero (2011) using rth-order Hermite distributions. The data sets are as follows:

  1. 1.

    Daily numbers of deaths in 1989 of women with brain vessel disease in West Berlin: \((f_0,\ldots ,f_{14})=(1,4,15,31,39,55,54,49,47,31,16, 9,8,4,3)\).

  2. 2.

    Weekly number of packs of a product that were purchased within the previous 7 days in 456 stores. Each frequency is the number of stores that sold exactly x packages: \((f_0,\ldots ,f_{20})=(102,54,49,62,44, 25,26,15,15,10,10,10,10,3,3,5,5,4,1,2,1)\).

  3. 3.

    Number of dicentric chromosomes after the exposure of a radiation dose of 0.405 Gy. Each frequency is the number of cells having exactly x dicentric chromosomes: \((f_0,\ldots ,f_4)=(437,66,15,1,1)\).

  4. 4.

    Number of dicentric chromosomes after the exposure of a radiation dose of 0.600 Gy. The frequencies are as follows: \((f_0,\ldots ,f_4)=(473,119,34,3,2)\).

All the results given by our estimators are shown in Table 6.

Table 6 Other examples where the number of zeros is known: estimated population size and standard errors

There is clear evidence that all the Poisson-based estimators provide a general good performance for the first data set (Brain vessel). The similarity of all the estimates could be a sign of an underlying Poisson distribution, as already noticed by Böhning and Schön (2005). The proposed estimator overestimates the true population size, driven by some heterogeneity that is not present in the data. As a further remark, we notice that, for the Zelterman estimator, the estimated standard error by the conditioning method is very different from that obtained through bootstrap. This is because the estimated standard error by the conditioning method is based on \(f_1\) and \(f_2\), which often show the highest frequencies in the data, but not in this specific example, leading to an underestimation of the uncertainty. The number of packs data is included here to show that our estimator outperforms its Poisson-based counterpart, which does not provide a value close to the true number of zeros, and performs similarly compared to the MLEGeo estimator. Poor results were obtained by Böhning and Schön (2005) and Puig and Kokonendji (2018). Both works highlighted the presence of heterogeneity (mainly due to zero-inflation) which is difficult to capture and to model properly. It seems that our proposal is, instead, able to capture this data feature. The last two data sets come from an experiment where the counts of chromosome aberrations (dicentrics) can be modelled by a physical mechanism leading to compound-Poisson distributions (Puig and Barquinero 2011). In both examples, the geometric-based estimators provide much better results than the Poisson-based ones, capturing the heterogeneity in the data.

4.4 The ratio-plot under the geometric distribution

As we extensively discussed throughout the main text, the geometric distribution is potentially a suitable candidate for count of cases distribution as it catches naturally some heterogeneity present in the population. Hence, in some sense the geometric distribution should be preferred to the Poisson distribution. Nevertheless, the geometric distribution needs to be investigated to see if it is appropriate for our real data example.

In Böhning et al. (2013) a diagnostic tool was suggested to investigate a count dataset for a specific distribution. This diagnostic tool, called the ratio plot, is built on the observation that the ratios of neighboring probabilities are constant. The ratio plot is then given by \(r_x = \frac{\kappa _{x+1}}{\kappa _x} = 1-p\), for \(x = 0,1,\ldots ,m\), and p being the geometric event parameter. Note that these ratios are not dependent on whether untruncated or truncated distributions are considered. A natural estimate of \(r_x\) occurs when replacing the unknown probabilities by the estimate \(f_x/N\)

$$\begin{aligned} \hat{r}_x =\frac{f_{x+1}}{f_x} \end{aligned}$$

as the unknown N cancels out.

Fig. 5
figure 5

Poisson and geometric ratio plots. Left panel: golf tees data. Right panel: bowel cancer data

An obvious question relates to the fact that we could have used another popular distribution for modelling count data, as e.g. the Poisson model. The Poisson distribution is given as \(\kappa _x = exp(-\lambda )\lambda ^x/x!\) so that \(r_x = (x + 1)\kappa _{x+1}/\kappa _x = \lambda \) and we expect \(\hat{r}_x = (x + 1)f_{x+1}/f_x\) to show a horizontal line pattern.

Fig. 6
figure 6

Ratio plots for a geometric distribution under the null hypothesis with the 95% point-wise error bars (see Böhning and Punyapornwithaya 2018 for further details): data in Sects. 4.1 and 4.2

Fig. 7
figure 7

Ratio plots for a geometric distribution under the null hypothesis with the 95% point-wise error bars (see Böhning and Punyapornwithaya 2018 for further details): data in Sect. 4.3

Figure 5 shows both ratio plots in comparison for both datasets considered in Sects. 4.1 and 4.2 and there is clear evidence that there is a positive trend in the Poisson ratio plots. Consequently, we argue here that the geometric distribution, whose ratio-plot is an almost constant line, is more appropriate in this case study.

Whereas the ratio plot focuses on the idea whether an empirical, nonparametric estimate of the ratio would follow a straight line, a major difficulty with interpreting the ratio plots is the qualitative judgment on constancy across the count range. Böhning and Punyapornwithaya (2018) developed the idea to construct a diagnostic device, namely the ratio-plot under the null hypothesis, that shows the observed ratio within limits expected if the data would follow a geometric distribution, so that we are able to examine more easily if the observed ratios lie in the specified geometric-defined region. This can be achieved by considering the 95% point-wise error bars. Figure 6 shows the error bars for the (log)-ratio-plots for the golf tees and bowel cancer data and supports the use of the geometric distribution in these case studies.

To provide further evidence of the results discussed in Sect. 4.3, we show the ratio-plots under the null hypothesis for those data as well (see Fig. 7). The geometric distribution looks appropriate for three out of four datasets, while it does not seem appropriate for the brain vessel one, confirming the results previously discussed. To conclude, this graphical device could preliminary used to asses the adequacy of the geometric distribution, avoiding the use of geometric-based estimators if not appropriate.