1 Introduction

The spatial scan statistic proposed by Kulldorff (1997) has been widely used in spatial disease or crime surveillance and other spatial cluster detection applications. One of the versions of the scan statistic was developed for inhomogeneous Poisson processes. However, in practice, data may present substantial departure from the assumed underlying Poisson process.

The Poisson distribution is often used for analysis of count data. However, this distribution presents some limitations. One of the most well known is the fact that the variance of the distribution is equal to the mean.

Count data, like contingency tables, often contain cells having zero counts. However, a zero count is called a sampling zero when a zero is drawn from a distribution with positive mass at zero. According to Agresti (1990), a zero for a cell in which it is theoretically impossible to have observations is called a structural zero. Agresti (1990) also remarks “a sampling zero is an observation having value 0, and we regard it as one of the observed counts. A structural zero is not an observation, however.”

Özmen and Famoye (2007) worked with regression models for count data related to zoological data. In their application, the number of C. caretta hatchlings dying from exposure to the sun was also modeled. The data included both structural zeros and sampling zeros. If no C. caretta hatchlings emerge from the nests, the number of C. caretta hatchlings dying from exposure to the sun was automatically zero. If C. caretta hatchlings emerged from the nests, the number of C. caretta hatchlings dying from exposure to the sun could be zero or greater than zero. In the application, since there are two kinds of zero counts, the data is considered to be zero-inflated and the Poisson model is no longer good enough to correctly predict the occurrence of zero counts.

Lambert (1992) introduced the zero-inflated Poisson (ZIP) regression model to account for excess zeros in counts of manufacturing defects. The model has been applied to inumerous situations. Böhning et al. (1999) used a zero-inflated Poisson model in dental epidemiology data.

Several studies have dealt with spatial count data with excess zeros. Agarwal et al. (2002) formulated a Bayesian zero-inflated regression model for spatial count data. The authors applied their methods to counts of nest burrows in a given area where more than 82 % of the counts are 0. Rathbun and Fei (2006) remark that, in ecological surveys, count data often include excess zeros. It can occur either due to the inclusion of a habitat unsuitable to the species, or due to the limited ability of the species to disperse into all parts of the study region. The authors used a spatial zero-inflated Poisson regression model to determine an oak species range. In the context of the study of rare diseases, Gómez-Rubio and López-Quílez (2010) argue that, among other methods for the detection of disease clusters, the scan statistic by Kulldorff (1997) may not be suitable for problems involving very rare diseases. According to the authors, for diseases with very low prevalence, the number of cases may be very low and excess zeros may cause bias in the inferences. However, the authors argue that “a likelihood ratio test similar to that of the spatial scan statistic may be difficult to develop in closed form.”

In this work, we build upon the developments in Kulldorff (1997) and propose two new scan statistics, denoted as Scan-ZIP and Scan-ZIP+EM, allowing for the occurrence of zero-inflated spatial data. Based on simulated data, we found that the Scan-Poisson statistic (Kulldorff 1997) steadily deteriorates as the number of structural zeros increases, producing biased inferences. Moreover, when the data include structural zeros, the Scan-Poisson statistic is more likely to detect spurious significant clusters than the Scan-ZIP and Scan-ZIP+EM.

This article is organized as follows. In Sect. 2, we review the Kulldorff scan statistic. The Scan-ZIP statistic is described in Sect. 3. Some theoretical properties are discussed in Sect. 4, and applications to simulated and real data are presented in Sect. 5.

2 Kulldorff scan statistic

Consider an inhomogeneous Poisson point process over \(k\) regions or locations in a study area. Let \(x_i\) be the number of cases in region \(i\) with corresponding at-risk population \(n_i\) under unit-specific relative risk \(\theta _i\) such that \(x_i\sim \hbox {Poisson}(n_i\theta _i)\). Further, let \(Z\) be a subset of the indexes \(\{1,2,\ldots ,k\}\), describing a given zone, which represent a putative cluster. Define \(\mathcal {Z}\) as a collection of zones in the study area.

Kulldorff (1997) formulated a scan statistic that compares the total number of case-counts in zone \(Z\), \(x_Z=\sum _{i \in Z}^{} x_i\), with the total number of counts outside \(Z\), \(x_{\overline{Z}}=\sum _{i\in \overline{Z}}^{} x_i\), given the corresponding population counts, that is, \(n_Z=\sum _{i \in Z}^{} n_i\) and \(n_{\overline{Z}}=\sum _{i\in \overline{Z}}^{} n_i\). Let \(n=n_{\overline{Z}}+n_{Z}\) and \(x=x_{\overline{Z}} + x_{Z}\), and assume that \(\theta _i=\theta _Z\) for every region \(i\) \(\in \) \(Z\) and that \(\theta _i=\theta _0\) for every region \(i\) \(\notin \) \(Z\). The hypotheses of interest are given by

$$\begin{aligned} H_0: \theta _Z=\theta _0 \quad \hbox {v.s.}\quad H_a: \theta _Z>\theta _0, \quad Z \quad \in \quad \mathcal {Z}, \end{aligned}$$

where \(H_0\) implies that there is a constant risk, while \(H_a\) implies that there is at least one cluster defined by a zone \(Z\ \in \ \mathcal {Z}\) such that \(\theta _Z > \theta _0\).

Since the putative cluster \(Z\) is actually unknown, it is considered a parameter, and the likelihood function \(L(Z)=L(Z,\theta _0,\theta _Z)\) is given by

$$\begin{aligned} L(Z)&= L(Z,\theta _0,\theta _Z)= \displaystyle \left[ \,\,\prod _{i \in Z}^{} \frac{e^{-n_i\theta _Z}(n_i\theta _Z)^{x_i}}{x_i!} \right] \left[ \displaystyle \prod _{j \notin Z}^{} \frac{e^{-n_j\theta _0}(n_j\theta _0)^{x_j}}{x_j!}\right] . \end{aligned}$$
(1)

The maximum likelihood estimators of \(\theta _0\) and \(\theta _Z\) are, respectively, \(\hat{\theta }_0=x_{\overline{Z}}/n_{\overline{Z}}\) and \(\hat{\theta }_Z=x_{Z}/n_{Z}\), and the most likely cluster is the solution of \(\hat{Z}=\{Z: L(Z) \ge L(Z^{\prime }) \ \ \forall \ \ Z^{\prime } \in \mathcal {Z}\}\) where \(\hat{Z}\) is the maximum likelihood estimator of parameter \(Z\).

Besides showing how to find the most likely cluster, Kulldorff (1997) also developed a likelihood ratio test that allows us to decide whether or not the most likely cluster is statistically significant, that is, if the area included in the most likely cluster really incorporates an abnormally high number of cases.

For a given zone \(Z\), we have

$$\begin{aligned} \lambda _Z \!=\!\frac{\hbox {sup}_{\ \theta _Z > \theta _0}L(Z,\theta _0,\theta _Z)}{\hbox {sup}_{\ \theta _Z=\ \theta _0}L(Z,\theta _0,\theta _Z)}\!=\!\left( \frac{x_Z/n_Z}{x/n}\right) ^{x_Z} \left( \frac{x_{\overline{Z}}/n_{\overline{Z}}}{x/n}\right) ^{x_{\overline{Z}}} \!\times I\left( x_Z/n_Z > x_{\overline{Z}}/n_{\overline{Z}}\right) .\nonumber \\ \end{aligned}$$
(2)

Kulldorff’s scan statistic is described by

$$\begin{aligned} \lambda&= \sup _{Z}\lambda _Z=\frac{\hbox {sup}_{\ Z \in \mathcal {Z}, \ \theta _Z > \theta _0}L(Z,\theta _0,\theta _Z)}{\hbox {sup}_{\ Z \in \mathcal {Z}, \ \theta _Z= \theta _0}L(Z,\theta _0,\theta _Z)}. \end{aligned}$$
(3)

3 The Scan-ZIP statistic

Now, in order to describe our scan statistic, consider the same notation as in Sect. 2, but now assume that the case-counts in the regions follow independent ZIP random variables with the same probability \(p\) of a structural zero, that is, \(X_i\sim \hbox {ZIP}(p, n_i\theta _i)\) where \(X_i\)s are independent. Thus, \(P(X_i=x_i \mid p, n_i\theta _i) = P(X_i)\) is given by

$$\begin{aligned} \nonumber P(X_i=0 \mid p, n_i\theta _i)&= p + (1-p)e^{-n_i\theta _i};\\ P(X_i=x_i \mid p, n_i\theta _i)&= (1-p)\frac{e^{-n_i\theta _i}(n_i\theta _i)^{x_i}}{x_i!}, \quad x_i>0. \end{aligned}$$
(4)

The ZIP model allows for additional flexibility when compared to the Poisson model. If \(X_i \sim \hbox {ZIP}(p, n_i\theta _i)\), then \(E(X_i{\mid } p, n_i\theta _i)=(1-p)n_i\theta _i\) and \(V(X_i{\mid } p, n_i\theta _i)=(1-p)n_i\theta _i(1+pn_i\theta _i)\).

It can be verified that \(E(X_i {\mid } p, n_i\theta _i)<n_i\theta _i\) when \(p>0\). Therefore, when structural zeros occur, the ZIP model correctly accounts, on average, for a reduction in the case-counts. Further, since \(V(X_i{\mid } p, n_i\theta _i)>E(X_i {\mid } p, n_i\theta _i)\) when \(p>0\), the ZIP model allows for overdispersion or extra-Poisson variation. The Poisson model often underestimates the observed dispersion.

Regarding the likelihood ratio test formulation for the ZIP model, assume, as usual for the Kulldorff scan statistic, that \(\theta _i=\theta _Z\) for every region \(i\in {Z}\) and that \(\theta _j=\theta _0\) for every region \(j\notin {Z}\). The hypotheses of interest are given by

$$\begin{aligned} H_0:\theta _Z=\theta _0\quad \hbox {v.s.}\quad H_a: \theta _Z>\theta _0,\quad Z\quad \in \quad \mathcal {Z}. \end{aligned}$$

In the consequent sections, we describe our Scan-ZIP statistic in two settings: (a) when we know when a zero count is a structural one, and (b) when we do not know, for sure, whether or not a zero count is a structural one.

3.1 Incomplete-data likelihood

Assume that data include both, sampling and structural zeros, and that we are not able to tell the nature of each kind of zero. Then, the likelihood conditioned on zone \(Z\) and the observed data \(X\) is given by

$$\begin{aligned} \nonumber L(p, \theta _0, \theta _Z) = L(p, \theta _0,\theta _Z \mid Z, X)&= \left[ \,\,\displaystyle \prod _{i \in {Z}}^{} P(X_i)\right] \left[ \, \displaystyle \prod _{i \notin {Z}}^{} P(X_i)\right] , \end{aligned}$$

where \(X = (X_1,\ldots ,X_k)\). Thus, the log-likelihood is given by

$$\begin{aligned} \ell (p, \theta _0,\theta _Z)&= \sum _{\scriptscriptstyle {\begin{array}{c} i \in {Z}\\ x_i=0\end{array}}} \log (P(X_i)) + \displaystyle \sum _{\begin{array}{c} i \in {Z}\\ x_i>0\end{array}} \log (P(X_i))\nonumber \\&+ \displaystyle \sum _{\begin{array}{c} j \notin {Z}\\ x_j=0\end{array}}\log (P(X_j))+ \displaystyle \sum _{\begin{array}{c} j \notin {Z}\\ x_j>0\end{array}} \log (P(X_j)) \nonumber \\ \nonumber&= \displaystyle \sum _{\begin{array}{c} i \in {Z}\\ x_i=0\end{array}} \log [p+(1-p)e^{-n_i\theta _Z}]\nonumber \\&+\displaystyle \sum _{\begin{array}{c} i \in {Z}\\ x_i>0\end{array}} [\log (1-p)-n_i\theta _Z + x_i\log (n_i\theta _Z)]\nonumber \\&+ \displaystyle \sum _{\begin{array}{c} j \notin {Z}\\ x_j=0\end{array}}\log [p+(1-p)e^{-n_j\theta _0}]\nonumber \\&+ \displaystyle \sum _{\begin{array}{c} j \notin {Z}\\ x_j>0\end{array}}[\log (1-p)-n_j\theta _0 + x_j\log (n_j\theta _0)]. \end{aligned}$$
(5)

Using this likelihood, it is not possible to find closed form MLE estimators for the parameters. However, following the approach developed by Lambert (1992), we describe a new likelihood under the assumption that we know in advance the nature of each zero, whether structural or sampling.

3.2 Complete-data likelihood

Let \(\delta =(\delta _1,\ldots ,\delta _k)\) where \(\delta _i\) is a binary variable that assumes \(\delta _i=1\) for a structural zero count in region \(i\). Thus, \(\delta _i \sim \hbox {Bernoulli}(p)\). Consider that \(\delta =(\delta _1,\ldots ,\delta _k)\) is observed, so that now we work with bivariate data, that is, \((X_i,\delta _i), \ \ i=1,\ldots ,k\). The likelihood function for zone \(Z \in \mathcal {Z}\) is given by

$$\begin{aligned}&L(p, \theta _0,\theta _Z) =L(p, \theta _0,\theta _Z\mid Z, X,\delta ) =\left[ \,\displaystyle \prod _{i \in {Z}}^{} P(X_i,\delta _i) \right] \left[ \, \displaystyle \prod _{i \notin {Z}}^{} P(X_i,\delta _i) \right] \nonumber \\&\quad =\left[ \,\displaystyle \prod _{i \in {Z}}^{} P(X_i\!=\!x_i {\mid } \delta _i\!=\!d_i)P(\delta _i\!=\!d_i)\!\right] \left[ \, \displaystyle \prod _{i \notin {Z}}^{} P(X_i=x_i {\mid } \delta _i=d_i)P(\delta _i=d_i)\!\right] \nonumber \\&\quad =\left[ \,\displaystyle \prod _{i \in {Z}}^{} p^{d_i} \left[ \,(1-p)\frac{e^{-n_i\theta _Z}(n_i\theta _Z)^{x_i}}{x_i!}\right] ^{(1-d_i)}\right] \nonumber \\&\qquad \times \left[ \,\displaystyle \prod _{i \notin {Z}}^{} p^{d_i}\left[ \,(1-p)\frac{e^{-n_i\theta _0} (n_i\theta _0)^{x_i}}{x_i!}\right] ^{(1-d_i)}\right] , \end{aligned}$$
(6)

since \(P(X_i=0,\delta _i=1)=p\) and \(P(X_i=x_i,\delta _i=0)=(1-p)\frac{e^{-n_i\theta _i}(n_i\theta _i)^{x_i}}{x_i!}\), \(x_i\ge 0\). Notice that \(P(X_i>0,\delta _i=1)=0\).

Under \(H_a\), the likelihood is given by

$$\begin{aligned} L_{a}(p,\theta _0,\theta _Z)&= p^{\sum _{i=1}^{k}d_i}\ (1-p)^{k-\sum _{i=1}^{k}d_i} e^{-\theta _Z\sum _{i \in {Z}}^{}n_i(1-d_i)}\ \theta _Z^{\sum _{i \in {Z}}^{}x_i(1-d_i)}\\&\times \, e^{-\theta _0\sum _{i \notin {Z}}^{}n_i(1-d_i)}\ \theta _0^{\sum _{i\notin {Z}}^{}x_i(1-d_i)}, \end{aligned}$$

and, under \(H_0\), it is as follows:

$$\begin{aligned} L_{0}(p, \theta _0)&= p^{\sum _{i=1}^{k}d_i}\ (1-p)^{k-\sum _{i=1}^{k}d_i} \times e^{-\theta _0\sum _{i=1}^{k}n_i(1-d_i)}\ \theta _0^{\sum _{i=1}^{k}x_i(1-d_i)}. \end{aligned}$$

Under \(H_0\), the maximum likelihood estimators (MLEs) are

$$\begin{aligned} \hat{\theta }_0=\frac{\sum _{i=1}^{k}x_i(1-d_i)}{\sum _{i=1}^{k}n_i(1-d_i)} \quad \hbox {and}\quad \hat{p}=\frac{\sum _{i=1}^{k}d_i}{k}, \end{aligned}$$
(7)

while, under \(H_a\), the MLEs are

$$\begin{aligned} \hat{\theta }_z=\frac{\sum _{i \in {Z}}^{}x_i(1-d_i)}{\sum _{i \in {Z}}^{}n_i(1-d_i)},\quad \hat{\theta }_0=\frac{\sum _{j \notin {Z}}^{}x_j(1-d_j)}{\sum _{j \notin {Z}}^{}n_j(1-d_j)}, \quad \hbox {and}\quad \hat{p}=\frac{\sum _{i=1}^{k}d_i}{k}. \end{aligned}$$
(8)

Notice that, by the Factorization Theorem (see Casella and Berger 1990, pp. 250), \(\sum _{i=1}^{k}x_i(1-d_i)\), \(\sum _{i=1}^{k}n_i(1-d_i)\), and \(\sum _{i=1}^{k}d_i\) are jointly sufficient for \(({\theta }_0,p)\), while \(\sum _{i \in {Z}}^{}x_i(1-d_i)\), \(\sum _{i\in {Z}}^{}n_i(1-d_i)\), \(\sum _{j \notin {Z}}^{}x_j(1-d_j)\), \(\sum _{j\notin {Z}}^{}n_j(1-d_j)\), and \(\sum _{i=1}^{k}d_i\) are jointly sufficient for \(({\theta }_z,{\theta }_0,p)\).

3.3 Likelihood ratio test when \(\delta \) is known

Let \(L_{a}^{S}(p, \theta _0, \theta _Z\mid Z, X, \delta ) = L_{a}(\hat{p}, \hat{\theta }_0,\hat{\theta }_Z\mid Z, X, \delta )\) and \(L_{0}^{S}(p, \theta _0\mid X, \delta ) =L_{0}(\hat{p}, \hat{\theta }_0 \mid X, \delta )\) be the likelihoods, under \(H_a\) and under \(H_0\), respectively, evaluated at their respective MLEs. Then,

$$\begin{aligned} \nonumber \lambda&= \frac{\hbox {sup}_{Z \in \mathcal {Z}, \ \theta _Z > \theta _0} L(p,\theta _0,\theta _Z\mid Z, X,\delta )}{\hbox {sup}_{Z \in \mathcal {Z}, \ \theta _Z= \theta _0} L(p,\theta _0,\theta _Z\mid Z, X,\delta )} = \frac{\hbox {sup}_{Z \in \mathcal {Z}} L_{a}^{S}(p, \theta _0,\theta _Z \mid Z, X,\delta )}{L_{0}^{S}(p,\theta _0 \mid X,\delta )}\\ \nonumber&= \hbox {sup}_{Z \in \mathcal {Z}} \frac{\left[ \,\frac{\sum _{i \in {Z}}^{}x_i(1-d_i)}{\sum _{i \in {Z}}^{}n_i(1-d_i)} \right] ^{\sum _{i \in {Z}}^{}x_i(1-d_i)} \left[ \,\frac{\sum _{j \notin {Z}}^{}x_j(1-d_j)}{\sum _{j \notin {Z}}^{}n_j(1-d_j)} \right] ^{\sum _{j \notin {Z}}^{}x_j(1-d_j)}}{\left[ \,\frac{\sum _{i=1}^{k}x_i(1-d_i)}{\sum _{i=1}^{k}n_i(1-d_i)} \right] ^{\sum _{i=1}^{k}x_i(1-d_i)}}\\&\times \, I\left( \frac{\sum _{i \in {Z}}^{}x_i(1-d_i)}{\sum _{i \in {Z}}^{}n_i(1-d_i)}>\frac{\sum _{j \notin {Z}}^{}x_j(1-d_j)}{\sum _{j \notin {Z}}^{}n_j(1-d_j)}\right) , \end{aligned}$$
(9)

if there is at least one zone \(Z\) such that \(\frac{\sum _{i \in {Z}}^{}x_i(1-d_i)}{\sum _{i \in {Z}}^{}n_i(1-d_i)} >\frac{\sum _{j \notin {Z}}^{}x_j(1-d_j)}{\sum _{j \notin {Z}}^{}n_j(1-d_j)}\), and \(\lambda =1\) otherwise.

As in the likelihood ratio test developed by Kulldorff (1997), it is not possible to find the exact distribution associated with \(\lambda \). However, in the present case, in which the locations of excess zeros are known, we can just remove the sites with structural zeros from the data set and use the likelihood ratio test developed by Kulldorff.

Using the EM algorithm, in the following sections, we describe a closed-form Scan-ZIP statistic for the case in which \(\delta _i\)s are not known.

3.4 Likelihood ratio test when \(\delta \) is unknown

When \(\delta \) is unknown, the likelihood ratio test for the Scan-ZIP statistic via the EM algorithm is given by

E-step: We estimate \(\delta _i\) by the conditional expectation of \(\delta _i\) given \(X_i\). Since \((\delta _i \mid X_i) \sim \hbox {Bernoulli}(\zeta _i)\), then

$$\begin{aligned} \nonumber \zeta _i&= E(\delta _i \mid X_i)=P(\delta _i=1 \mid X_i)\\ \nonumber&= \frac{P(X_i=x_i\mid \delta _i=1)P(\delta _i=1)}{P(X_i=x_i\mid \delta _i=1)P(\delta _i=1)+P(X_i=x_i\mid \delta _i=0)P(\delta _i=0)}I(x_i=0) \\ \nonumber&= \frac{p}{p+(1-p)e^{-n_i\theta _i}} I(x_i=0)\\&= \left\{ \begin{array}{l@{\quad }l} \frac{p}{p+(1-p)e^{-n_i\theta _i}}, &{} x_i=0 \\ 0,&{} x_i=1,2,\ldots \end{array} \right. . \end{aligned}$$
(10)

Therefore, at the \(m\)th iteration of the EM algorithm,

$$\begin{aligned} \hat{\delta }_i^{(m)}=\frac{\hat{p}^{(m)}}{\hat{p}^{(m)}+(1-\hat{p}^{(m)}) e^{-n_i\hat{\theta }_i^{(m)}}}I(x_i=0), \quad i=1,\ldots ,k. \end{aligned}$$
(11)

M-step: Conditional on vector \(\hat{\delta }^{(m)}=(\hat{\delta }_1^{(m)},\ldots ,\hat{\delta }_k^{(m)})\), at the (\(m+1\))th iteration, the MLEs of \(\theta _0\), \(\theta _Z\), and \(p\) are obtained according to the expressions given in (7) (under \(H_0\)) and (8) (under \(H_a\)), but with \(d_i\) replaced by \(\hat{\delta }_i^{(m)}\). Thus, considering \(H_a\) we have

$$\begin{aligned} \hat{\delta }_i^{(m+1)}&= \left\{ \begin{array}{l@{\quad }l} \frac{\hat{p}^{(m+1)}}{\hat{p}^{(m+1)}+(1-\hat{p}^{(m+1)})e^{-n_i\hat{\theta }_Z^{(m+1)}}}, &{} x_i=0 \quad \hbox {and} \quad i \in {Z}\\ \frac{\hat{p}^{(m+1)}}{\hat{p}^{(m+1)}+(1-\hat{p}^{(m+1)})e^{-n_i\hat{\theta }_0^{(m+1)}}}, &{} x_i=0 \quad \hbox {and} \quad i \notin {Z}\\ 0,&{} x_i=1,2,\ldots \end{array} \right. . \end{aligned}$$
(12)

The E-step and M-step are performed until convergence for each candidate cluster \(Z\). In our experiments and examples, the EM procedure was initialized with \(\hat{\delta }_i^{(0)}=0.5\) if \(x_i=0\), and \(\hat{\delta }_i=0\) otherwise. Convergence was assumed when \(|\delta _i^{(m+1)}-\delta _i^{(m)}|<0.01\), \(i=1,\ldots ,k\).

The likelihood ratio test for the case of unknown \(\delta \) is also given by (9), but with \(d_i\) replaced by \(\hat{\delta }_i^{(s)}\), where \(s\) is the iteration in which convergence is achieved. Thus, given any proposed set \(\mathcal {Z}\) of candidate clusters, Algorithm 1 should be run in order to determine the most likely cluster.

figure d

Again, the distribution of \(\lambda \) cannot be exactly determined and a Monte Carlo simulation is conducted to obtain critical value \(\lambda ^*\). As we do not know the exact number of structural zeros, nor their locations, the Monte Carlo replicas are constructed using a parametric bootstrap-like framework. Thus, given \(\hat{p}\), the estimate of \(p\) for the observed data, each region of the map is randomly assigned as a structural zero location with probability \(\hat{p}\). This procedure is repeated \(B\) times and will generate \(B\) maps with as many structural zeros, on average, as it was estimated for the observed data. The estimated expected number of cases in each region, under \(H_0\), will be \(\hat{E}_0(X_i)=(1-\hat{p})n_i\hat{\theta }_0\), where \(\hat{p}\) and \(\hat{\theta }_0\) are obtained as defined in (7). Now, given the structural zeros in each of the \(B\) bootstrap samples, we perform a Monte Carlo simulation. This procedure is summarized by Algorithm 2.

figure e

4 Properties of the Scan-ZIP test statistic

4.1 Detection versus inference

Kulldorff (1997) remarks that his formulation of the spatial scan statistic enables not only the location of the most likely cluster but also inferences related to it, in the sense that when the null hypothesis is rejected it is possible to locate the region that causes the rejection. According to the author, most of the statistical methods for cluster analysis of spatial point processes do not possess both of these features.

In his Theorem 1, Kulldorff (1997) shows that when the null hypothesis is rejected for a given data set, say \(A\), whose most likely cluster is \(\hat{Z}_A\), then any other data set that shares all the cases in \(\hat{Z}_A\) also implies that \(H_0\) will also be rejected regardless of the locations of the remaining cases in the alternative data set \(\bar{A}\), provided that the total number of points in the map is preserved. Next, we present a new version of this theorem that has been generalized to the case of the Scan-ZIP statistic.

Theorem 1

Consider an inhomogeneous Poisson point process observed over \(k\) regions or locations in a study area. Let \(D\) be a data set composed by triplets \(\{(x_i,n_i,d_i)\}\) for which \(\hat{Z}\), a subset of \(\Omega =\{1, 2, \ldots , k\}\), is the most likely cluster. Similarly, let \(D'\) be an alternative data set such that \(\sum _{i=1}^k x_i'(1-d_i')=\sum _{i=1}^k x_i(1-d_i)\), \(\sum _{i=1}^{k}n_i(1-d_i)=\sum _{j=1}^{k}n_j'(1-d_j')\), and \(n_i'=n_i\), for all \(i\). If the null hypothesis is rejected under \(D\) then it is also rejected under \(D'\) whenever \(\sum _{i \in \hat{Z}} x_i'(1-d_i')\ge \sum _{i \in \hat{Z}} x_i(1-d_i)\).

In other words, Theorem 1 states that, if population \(n_i\) is kept unchanged for all regions in the map and the adjusted number of cases \(\sum _{i\in \hat{Z}} x_i(1-d_i)\) inside \(\hat{Z}\) is not decreased, then \(\lambda (\hat{Z})\) cannot decrease. The proof for Theorem 1 can be found in the Appendix and consists of showing that the test statistic is an increasing function on \(\sum _{i\in \hat{Z}} x_i(1-d_i)\), the adjusted number of cases inside the zone \(\hat{Z}\).

4.2 Power

For the spatial scan statistic, Kulldorff (1997) defined the so-called individually most powerful test (IMP) as a proxy for dealing with the limitations in the technique that prevents finding an uniformly most powerful (UMP) test. As noted by Kulldorff, we cannot expect to find a UMP test except for the special case when there is only one cluster under the alternative hypothesis. The IMP tests are constructed as follows:

  1. (a)

    Partition the subset of the parameter space related to the composite alternative hypothesis, that is, \(W=\{(Z,\theta _Z, \theta _0):\theta _Z>\theta _0, \ Z \ \in \ \mathcal {Z}\},\) into a countable number of sets \(\{A_j\}\) to form a partition of \(W\), that is, \(A_i \cap A_j = \emptyset \) for all \(i \ne j\) and \(W=\cup _j A_{j}^{}\). Let \(\Psi =\overline{W}\cup W\) represent the whole parameter space.

  2. (b)

    Define a partition \(\{R_j\}\) of the critical region \(R\) (see Sect. 3.3 for the Scan-ZIP statistic).

  3. (c)

    Take an alternative critical region \(R^{\prime }=\cup _j R^{\prime }_{j}\) in which \(\{R^{\prime }_j\}\) represents a partition of \(R^{\prime }\).

Let \(\pi (\theta )=P(w \in R \mid \theta )\), with \(\theta \in \Psi \), be the power function of a hypothesis test with rejection region \(R\) (similarly, \(\pi ^{\prime }\) for the rejection region \(R^{\prime }\)). The following definition, restated from Kulldorff (1997), completes the description of the IMP test.

Definition 1

For a particular significance level \(\alpha \), a test is IMP with respect to a partition \(\{A_j\}\) of \(W\) and a partition \(\{R_j\}\) of the critical region \(R\), if for each \(A_i\) there are no sets \(R^{\prime }\) and \(\{R^{\prime }_j\}\) such that

  1. 1.

    \(R_j=R^{\prime }_j\) for all \(j\ne i\);

  2. 2.

    \(P( w \in R^{\prime } \mid H_0)=\pi ^{\prime }(\theta )=\alpha \), if \(\theta \in \) \(\overline{W}\);

  3. 3.

    \(P( w \in R^{\prime }_i \mid (Z,\theta _0,\theta _Z))>P( w \in R_i \mid (Z,\theta _0,\theta _Z))\) for any \((Z,\theta _0,\theta _Z) \ \in A_i\).

Definition 1 implies that no other test of size \(\alpha \) exists, such that the three conditions above can be simultaneously true.

Now, let \(A_Z=\{(Z,\theta _Z,\theta _0): \theta _Z>\theta _0\}\) and \(A_0=\{(Z,\theta _Z,\theta _0): \theta _Z=\theta _0\}\). Let \(R_Z\), a subset in a partition of \(R\) described by the intersection of the critical region \(R\) and a subset of the sample space in which \(Z\) is the most likely cluster. However, let \(R^{\prime }_Z\) be any subset in a partition of \(R^{\prime }\). We now enunciate Kulldorff’s Theorem 2 for the case of the Scan-ZIP statistic.

Theorem 2

The test based on \(\lambda \) forms an IMP test with respect to partitions \(\{ A_Z\}\) and \(\{ R_Z\}\).

The proof for Theorem 2 can be found in the Appendix.

5 Simulation study

In order to test our methodology, we constructed different artificial clusters—which we will call true clusters—using a map consisting of 203 hexagonal cells arranged in a regular grid, each cell with population 1,000. Thus, the total population in the whole map is 203,000. We will assume that the distance between two cells is the Euclidean distance measured between their centroids. The centroid of each hexagonal cell is initially defined as its center. In this framework the distances from any cell to its neighbors are always the same. So, to avoid any tie when measuring distances between any two cells, we introduced a random disturbance in the \(x\) and \(y\) coordinates of the cells’ centroids.

For each cluster, the cases (e.g., a given disease, crime, etc.) are randomly distributed over the map according to a multinomial distribution whose probabilities are proportional to the relative risks assigned to each region. These risks are computed following the procedure described by Kulldorff et al. (2003), assigning high risks to the regions inside the cluster and lower risks to the remaining regions. The risks of regions inside the cluster are set high enough so that conditioned on the total number of cases, the null hypothesis would be rejected with probability 0.999 when using a binomial test.

A structural zero inside the cluster can be interpreted as a high risk area whose case count is unavailable. In this sense, if there are any regions inside the cluster with structural zeros, then the attributed risk is also high, but the total number of cases is distributed only among the non-structural zero regions of the map. If it happens that a case is assigned to a structural zero region, than that case is rejected and another one is generated. Thus, the case counts of the structural zero regions are always set to zero.

The referred random distribution of cases is repeated thousands of times (say \(N\) times), generating thousands of simulated data sets which are used to compute power, sensitivity, and positive predicted value (PPV).

Let us denote the cluster detection algorithms as Scan-Poisson (Kulldorff’s standard version of the Poisson-scan algorithm), Scan-ZIP (when the structural zeros are known), and Scan-ZIP+EM (when the structural zeros are unknown). All the three algorithms were used along with circular windows to build the set \(\mathcal {Z}\) of candidate clusters, as described by Algorithm 3.

For each of the \(N\) random distributions of cases over the map, test statistic \(\lambda _i\), \(i=1,\ldots , N\), is computed by running the circular cluster detection algorithm. Then, the obtained \(\lambda _i\)s are compared to critical value \(\lambda ^*\), obtained by the Monte Carlo simulation under \(H_0\). The power is then given by

$$\begin{aligned} \hbox {Power}=\displaystyle \frac{\sum _{i=1}^{N} I(\lambda _i>\lambda ^*)}{N}, \end{aligned}$$

where \(I(\lambda _i>\lambda ^*)=1\) if \(\lambda _i>\lambda ^*\) and 0 otherwise. Sensitivity and PPV are defined in terms of the population of the true and detected clusters (Kulldorff et al. 2009) as

$$\begin{aligned} \hbox {Sensitivity}&= \frac{\hbox {Pop(Detected Cluster} \cap \hbox {True Cluster)}}{\hbox {Pop(True Cluster)}},\\ \hbox {PPV}&= \frac{\hbox {Pop(Detected Cluster} \cap \hbox {True Cluster)}}{\hbox {Pop(Detected Cluster)}}. \end{aligned}$$

where Pop() denotes the population. The reported values of sensitivity and PPV are the average values of these proportions along the \(N\) Monte Carlo simulations.

figure f

5.1 Simulation design

The three algorithms were used to detect the artificial clusters \(A\), \(B\), \(C\), and \(D\), which are described below and illustrated in Fig. 1.

  • Scenario \(A\) is a circular cluster crossed by a set of contiguous regions with structural zeros disconnecting the cluster;

  • Scenario \(B\) is a circular cluster with “random” structural zeros;

  • Scenario \(C\) is a small circular cluster with a single structural zero region in the middle; and

  • Scenario \(D\) is an irregular L-shaped cluster with a structural zero in the middle of the upper part and with two structural zeros disconnecting the lower left part.

Fig. 1
figure 1

Artificial scenarios \(A\) (top-left), \(B\) (top-right), \(C\) (bottom-left), and \(D\) (bottom-right). Regions in gray represent the cluster and the \(\times \)s indicate structural zeros

Notice, from Fig. 1, that in the four scenarios described above, the structural zero regions are the same, and thus the scenarios differ only by the locations of the high-risk regions. These scenarios were designed in order to simulate realistic situations: a cluster disconnected by a systematic pattern (\(A\)), random lack of information (\(B\)), a cluster with a hole (\(C\)), and a problematic irregular cluster (\(D\)).

In order to illustrate the deteriorating performance of the Scan-Poisson algorithm as we progressively add structural zeros to the true cluster, we designed four extra progressive scenarios, \(A_1\)\(A_4\) (see Fig. 2). These scenarios are modified versions of scenario \(A\), in the sense that high-risk regions are the same, and thus that these scenarios differ only by the locations of the structural zero regions. Starting from scenario \(A_1\), structural zeros are gradually moved from outside to inside the true cluster, thereby preserving the total number of structural zeros in the map. Further, we have included scenario \(A_0\), also designed from scenario \(A\), but with no structural zeros.

All scenarios have 15 structural zero regions (except for \(A_0\)) with the total number of cases \(M\) fixed at 0.25 % of the total population: 507 cases. For each scenario we run 10,000 Monte Carlo replications under the null hypothesis for each method to compute the critical value, and another 10,000 replications under the alternative hypothesis to compute power, sensitivity, and PPV. The random distributions of cases are always the same for the three methods.

Fig. 2
figure 2

Artificial progressive scenarios \(A_1\) (top-left), \(A_2\) (top-right), \(A_3\) (bottom-left), and \(A_4\) (bottom-right). Regions in gray represent the cluster and the \(\times \)s indicate structural zeros

5.2 Analysis for the simulated data

The results obtained using the simulated data are presented in Table 1, in terms of power, sensitivity, and PPV, for the three different methods applied to the nine designed scenarios. As we can observe, regarding the estimates of power, sensitivity, and PPV, both Scan-ZIP and Scan-ZIP+EM algorithms are systematically superior to the Scan-Poisson algorithm, the last one being comparable only when the true cluster presents none or few structural zeros. Besides, the performance of Scan-ZIP+EM is very close to that of Scan-ZIP.

Table 1 Power, sensitivity and positive predictive value obtained by the three methods for artificial clusters \(A_0\), \(A\), \(B\), \(C\) and \(D\) and progressive \(A_1\)\(A_4\)

Regarding the progressive scenarios, the Scan-Poisson algorithm steadily deteriorates as the number of structural zeros increases. The Scan-ZIP algorithm is able to maintain its good properties irrespective of the number of structural zeros, while the Scan-ZIP+EM algorithm deteriorates much slower than the Scan-Poisson algorithm, remaining close to the Scan-ZIP algorithm. This suggests that the structural zero estimates obtained by the EM procedure are quite reasonable and that the Scan-ZIP+EM algorithm is clearly capable of capturing the presence of structural zeros.

The results obtained for scenario \(A_0\) show that the lack of structural zeros does not compromise the performance of the Scan-ZIP+EM method, as its performance is very close to that of the Scan-Poisson algorithm. This indicates that the proposed statistic can be possibly used as a more general model than the standard Poisson approach. For this scenario, the results for the Scan-ZIP method are obviously identical to those of the Scan-Poisson algorithm.

5.3 Discussing the Scan-Poisson type I error

It is important to study the performance of the discussed methods for maps that include structural zero regions but with nonexistent clusters. To that purpose, we conducted another experiment in which we considered 15 structural zero regions, and according to the ZIP model, we randomly distributed 507 cases over the map under the null hypothesis. This experiment was repeated 10,000 times. In this setting, the expected number of cases was the same for all regions, except for the regions presenting a structural zero. We then applied the three scan methods to each one of the 10,000 generated samples in order to compute the probability of type I error for each approach.

For \(LR\) tests with \(\alpha =0.05\), the Scan-ZIP and Scan-ZIP+EM methods presented type I error rates of 0.0548 and 0.0561, respectively. Thus, both methods produced error rates close to the nominal significance level. However, the Scan-Poisson type I error proportion was 0.1006.

The Scan-Poisson performance was poorer because the simulated process under \(H_0\) was not a homogeneous one, an assumption required for building the Scan-Poisson \(LR\) empirical distribution. This violation in the homogeneity assumption caused a substantial bias in the \(LR\) estimated distribution. As a consequence, the Scan-Poisson method was more likely to detect spurious significant clusters than it would be expected in normal conditions. This problem obviously did not affect the Scan-ZIP method that proved to be more robust than the Scan-Poisson method to the occurrences of type I error. The Scan-ZIP+EM method is apparently able to adequately estimate the model.

6 Application

In many practical situations, we do not actually know whether or not the data include structural zeros. However, the ZIP model can be used to account for excess zeros when the practitioner suspects that there are more zeros than that expected for count data under the assumption of a Poisson distribution. For example, in small towns, due to the lack of a health center designed to diagnose a given disease, the number of reported cases may be set as zero. Further, limited health insurance over community members may prevent access to health center facilities, thereby causing a downward bias in the notification statistics provided by this health center.

Another source of negative bias in the notification statistics is due to under-reporting. This may occur despite state and local laws requiring medical providers to report notifiable cases to public health authorities and can be attributed to incomplete reporting by the health staff. We would like to stress the fact that in Eq. (4), the probability of zero cases includes two possible sources: the zero that occurs due to a random or stochastic situation and the zero that is due to the structural (and almost deterministic) nature of the event.

6.1 Breast cancer deaths in northern Brazil

In order to illustrate some of these points, consider the northern area of Brazil, divided into municipalities. For each municipality the number of breast cancer deaths from 2008 to 2011 are recorded. Areas with non-zero counts are then displayed in a map. We observe that most of these areas in the map (see Fig. 3) present zero cases. This has nothing to do with possible immunity to this disease. Actually, that is just an artifact caused by the absence of mammographers in these areas.

Fig. 3
figure 3

Northern region of Brazil. Gray regions indicate municipalities with at least one death due to breast cancer

For example, according to the Brazilian National Cancer Institute (INCA), in 2011, only 86 mammographers were available for the entire northern region that includes a total of 450 municipalities. Furthermore, the northern region of Brazil is an enormous territory (approximately 40 % of the US area), with a very low population density and encompassing most of the Amazon forest, thus making mobility very difficult. Further, given an imperfect system of notification of cases (e.g., not pointing out the municipality of residence), and we will certainly observe “false” zeros, caused by a series of external factors. As we can observe from Table 2, only 16 of the 450 municipalities reported some notification records concerning deaths caused by breast cancer.

Table 2 Number of breast cancer death notifications in the northern area of Brazil from 2008 to 2011

We applied our methods to the Tocantins state (highlighted in Fig. 3). Tocantins comprises 139 municipalities and for each one we considered the administrative center as the centroid. According to the recorded data, Tocantins presented 14 deaths by breast cancer during 2008, for a population at risk of 690,104 women. We considered that municipalities with no available mammographers present structural zeros. In fact, we observed that all municipalities without functional mammographers presented no records of deaths by breast cancer during 2008.

The most likely clusters detected by the Scan-Poisson, Scan-ZIP, and Scan-ZIP+EM methods are shown in Fig. 4. The cluster detected by the Scan-Poisson method, including only one municipality (Araguaína), is significant (\(p\) value \(=\) 0.0021), while the most likely clusters detected by the other methods are not: \(p\) value = 0.7712 for the Scan-ZIP and 0.1750 for the Scan-ZIP+EM method. The non-significant cluster detected by the Scan-ZIP method includes the municipality of Araguaína and some of its neighbours.

Fig. 4
figure 4

Most likely clusters of breast cancer deaths found by the Scan-Poisson (left), Scan-ZIP (center), and Scan-ZIP+EM (right) method in the state of Tocantins, Brazil

The disagreement between the Scan-Poisson method and the other methods can be justified by the results of our last simulations: as we mentioned at the end of Sect. 5, the Scan-Poisson method presents some bias towards the detection of spurious significant clusters when structural zeros are present in the data. These findings reinforce the importance of our proposed methodology as a useful tool in preventing incorrect identification of clusters when zero-inflation is present in the data.

6.2 Oral cancer cases in Georgia, United States

We now present an application to oral cancer cases data in Georgia, United States. According to the International Classification of Diseases, oral cancer is a malignant neoplasm of lip, oral cavity, or pharynx, and can be described as an abnormal, malignant tissue growth in the mouth. The data set used in our application was obtained at http://oasis.state.ga.us and consists of 430 deaths by oral cancer between 1994 and 1995, for all races, ethnicities, and life stages. The resulting map is divided into 159 counties, each of which had the centroid set to the county seat. The population at risk is 7,157,165. Figure 5 shows the population distribution over the map of Georgia and the disease incidence rates. Our guess is that the data may be affected by under-reporting and possible presence of structural zeros. This can be, for example, due to the lack of health insurance coverage among some citizens in Georgia, thereby preventing them from obtaining any kind of help, which leads to there being no diagnosed cause of death. Another possibility is under-reporting, for example, due to failures in filling medical protocols by the medical staff.

Fig. 5
figure 5

Population distribution (left) and oral cancer death rates (right) in Georgia, United States, between 1994 and 1995

We then run the cluster detection algorithms using the Poisson and ZIP models. Since we have no information about the location of possible structural zeros, we used the Scan-ZIP+EM version for the ZIP case. The most likely clusters found by each method are shown in Fig. 6. The black dots indicate regions with zero case counts. The cluster produced by the Scan-Poisson algorithm is not significant at \(\alpha = 0.05\) (\(p\) value = 0.056) while the cluster detected by the Scan-ZIP+EM algorithm is highly significant (\(p\) value = 0.002). Besides, the cluster produced by the Scan-Poisson algorithm differs substantially from the one that was produced by the Scan-ZIP+EM algorithm, and the former seems to misrepresent a high-incidence oral cancer area.

Fig. 6
figure 6

Clusters found by the Scan-Poisson (left) and Scan-ZIP+EM (right) algorithms. Black dots indicate zero case counts

We notice from Fig. 6 that the presence of some regions with zero case counts prevents the Scan-Poisson method from finding a solution (or cluster) that includes the high-incidence areas in the lower-middle part of the map. This happens because the inclusion of these regions lowers the test statistic (\(\lambda \)). However, the solution provided by the Scan-ZIP+EM algorithm is very robust in the sense that it does include several zero case-count areas, since their impact in reducing \(\lambda \) is easily compensated by the inclusion of the high incidence-regions.

7 Discussion

In this article, we focus on extensions of the scan statistic proposed by Kulldorff (1997) for cluster detection of spatial count data. The new scan statistics, denoted as Scan-ZIP and Scan-ZIP+EM, allow for the occurrence of zero-inflated spatial data. Our simulations revealed that the Scan-Poisson statistic proposed by Kulldorff (1997) steadily deteriorates as the number of structural zeros increases, producing biased inferences. This can be evaluated in terms of power, sensitivity, and PPV measures. Our simulations also revealed that the Scan-ZIP and Scan-ZIP+EM statistics are, most of the time, either superior or comparable to the Scan-Poisson statistic for all scenarios. When the structural zeros are known, Scan-ZIP is far superior to the other algorithms. The real data application reinforces these conclusions: for high-incidence areas, which nonetheless present zero case counts, the Scan-Poisson method fails to provide a reasonable solution or cluster.

Another important finding shown in our simulations is that when the data include structural zeros, the Scan-Poisson method is more likely to detect spurious significant clusters than the Scan-ZIP and Scan-ZIP+EM methods.

Like Kulldorff’s scan statistic, the ZIP scan statistic can be linked to more sophisticated cluster detection techniques, allowing for the detection of irregularly shaped clusters. Moreover, it seems quite straightforward to make extensions to space-time and prospective space-time versions. However, it should be borne in mind that the computational effort required by the EM algorithm increases as the number of candidate clusters increases, since we must estimate structural zeros for each candidate cluster. For instance, for the map used in our simulations with 203 regions and using a C++ implementation optimized for a Core i7 PC with 4GB RAM, running 10,000 simulations under \(H_0\) takes about 8 seconds for the Scan-Poisson or the Scan-ZIP. In our experiments, the Scan-ZIP+EM was about eight times slower than the Scan-Poisson and Scan-ZIP. A copy of our codes can be made available upon request.

In this work, we used area-aggregated data for simulation and applications. With the formulation presented in this paper, the proposed method is restricted to this type of data. For instance, our methodology is not suitable for point pattern data. In the future, a more general approach could be developed to account for point pattern data sets.

In Sect. 3, we assume that the case counts in the regions follow independent ZIP random variables with the same probability \(p\) of a structural zero. However, following Lambert (1992), a more flexible model can be devised by considering the case where probability \(p_i\) of a structural zero in region \(i\) is modelled as a function of covariates using a logistic regression.