1 Introduction. Spatial Outliers

A local or spatial outlier [3] or [6] is an observation that differs from its neighbors, i.e., \(z(s_0)\), the value of the variable of interest Z at location \(s_0\), is a local outlier if it differs from \(z(s_0+\varDelta s_0)\) where \(\varDelta s_0\) defines a neighborhood of location \(s_0\).

The usual method used to detect local outliers is somewhat complicated because, first, we have to define what is a neighborhood, i.e., what is “close”; then, we have to select some locations inside the neighborhood, to compute and compare the value of Z at these locations.

In the first part of the paper we propose two novel techniques based on a GIS for easily and quickly detect possible local outliers. The first one, developed in Sect. 2, is based on making a geographical map where the heights of the ground correspond to the observations. This map of separate heights is completed by means of a Triangulated Irregular Network (TIN) interpolation. Once the geographical map has been made, local outliers are easily identified as hills with big slopes.

The second technique, developed in Sect. 3, consists in fitting a robust GAM to the observations. Then, we do the previous process (interpolation plus detection of outlying slopes) with the residuals of this robust fit.

These ideas have been previously used (with some variants) in [5, 10, 12]. Here we extend their ideas considering a more general model, a GAM one, because this is the model usually considered in a fit of spatial data.

Once identified possible local outliers, we compute, in Sect. 4, the probability of such an extreme slope according to a model fitted to the data. If, according to this model (i.e., assuming that the model is correct), the probability of such extreme slope is small, the hotspot is labelled as a local outlier.

2 Spatial Outlier Detection by Interpolation

We propose, first, to interpolate the observations \(z(s_i)\) using a TIN interpolation, that is implemented in Quantum GIS (QGIS), and that essentially means to interpolate the observations with triangles. Then we use the Geographic Resources Analysis Support System (GRASS) to compute the slopes of all the triangles obtained with the previous TIN interpolation. Finally, we reclassify the slopes, using GRASS grass again, looking for outlying slopes. All locations with big slopes will be considered as hotspots, i.e., potential outliers.

Other interpolation procedures could be used, such as Inverse Distance Weighting (IDW), but TIN works well for data with some relationship to other ones across the grid, that should be the kind of data usually considered in a spatial data problem, [8].

2.1 Multivariate Spatial Outliers

If we have multivariate observations, we first transform them into the scores obtained from a Principal Component Analysis \(PC_1\), ..., \(PC_p\). With this process, similar to Principal Components Regression Analysis, we can apply the previous QGIS method to each one dimensional independent variable, \(PC_i\), obtaining so p layers of hotspots (one layer for each \(PC_i\)). The intersection of all of them will be the set of possible multivariate outliers. Moreover, in this way we also have a marginal analysis for each univariate variable.

Example 1

Let us consider Guerry data, [9], available in the R package with the same name. This data set has been analyzed in [6] and, as there, here we only use 85 departments, excluding Corsica. The two variables considered are also “population per crime against persons” (PER) and “population per crime against property” (PROP).

As we mentioned before, the descriptive process of detection of possible outliers, i.e., hotspots, consists in using QGIS, (a) incorporating first into QGIS the vectorial data, france1.txt, of the scores, after transforming the original observations with the two Principal Components \(PC_1\) and \(PC_2\); (b) computing a TIN interpolation for each new variable \(PC_1\) and \(PC_2\); (c) computing with GRASS the slopes from a Digital Elevation Model (DEM); (d) using again GRASS to reclassify slopes in two groups: small slopes and big slopes.

Fig. 1
figure 1

Slopes reclassification (\(PC_1\) and \(PC_2\))

The details of the computations of all the examples in the paper are at http://www.uned.es/pfacs-estadistica-aplicada/smps.htm.

In these computations, we obtain for \(PC_1\) a plot (and a table) of departments with slopes higher than 30 \(\%\) and, for \(PC_2\), slopes higher than 19 \(\%\). The intersection of both layers is showed in Fig. 1 where the outlying slopes (the unfilled circles) correspond to the departments Ain, Ardeche, Correze, Creuse, Indre, Isere, Jura, Loire, Rhone, Saone-et-Loire and Haute-Vienne.

3 Spatial Outlier Detection by a Robust GAM

The method proposed in the previous section is an exploratory technique based only on a GIS. In this section we propose to fit a robust GAM to the spatial observations \(z_i=Z(s_i)\). In this way, local large residuals will give us possible spatial outliers. We consider a GAM because this type of models is generally used for modeling spatial data.

With a GAM, [11], we assume that (univariate) observations are explained as

$$\begin{aligned} z_i = h(s_i) + h({u_1}_i) + \cdots + h({u_k}_i) + e_i \end{aligned}$$
(1)

where \(s_i=({x}_i,{y}_i)\) are the coordinates of \(z_i\); \(u= (u_1,\ldots , u_k)\) is a vector of covariates, and h is a smooth function that is expressed in terms of a basis \(\{ b_1,\ldots b_q\}\) as

$$\begin{aligned} h(u) = \sum _{j=1}^q b_j(u) \beta _j \end{aligned}$$
(2)

for some unknown parameters \(\beta _j\) ([15], pp. 122). The errors \(e_i\) must be, as usual, i.i.d. \(N(0,\sigma )\) random variables.

A key point in our proposal is to consider the coordinates \(s_i=({x}_i,{y}_i)\) of the observations \(z_i\) as a covariate in model (1).

The function h could be different for each covariate and, in some cases, the coordinates covariate is split into two covariates being the model

$$z_i = h_1(x_i) + h_2(y_i)+ h_3({u_1}_i) + \cdots + h_{k+2}({u_k}_i) + e_i.$$

We can summarize model (1) as \(\, z_i = H(s_i,{u_1}_i, \ldots ,{u_k}_i) + e_i \,\). This approach extends the ideas of [7] because they consider (pp. 52) a linear regression model. Also, some aspects of the papers [12] or [5] are extended in this way.

The robust GAM that we shall fit is the model proposed in [13, 14] although other possible robust GAMs could be the proposed in [1] or [4].

The robust M-type estimators \(\hat{\varvec{\beta }}\) for the GAM proposed by Wong are the solution of the following system of estimating equations

$$ \sum _{i=1}^n \left[ w(\mu _i) \, \nu (z_i,\mu _i) \, {\varvec{\mu }_{{\varvec{i}}}'}-a(\varvec{\beta }) -\frac{1}{n} \mathbf{S} {\varvec{\beta }} \right] = \mathbf{0} $$

where

\(\mu _i = E[z_i|\mathbf{u}_i]\); \(\varvec{\beta }=(\beta _1,\ldots ,\beta _q)^t\); \({\varvec{\mu }_{{\varvec{i}}}'}=\partial \mu _i/\partial \varvec{\beta }\); \(\nu (z_i,\mu _i) =(z_i - \mu _i)/V(\mu _i)\);

$$w(\mu _i) = \frac{1}{E[\varphi _c'((z_i - \mu _i)/V^{1/2}(\mu _i))]}$$
Fig. 2
figure 2

Slopes reclassification of the scores of the residuals (\(PC_1\) and \(PC_2\))

$$a(\varvec{\beta }) = \frac{1}{n} \sum _{i=1}^n E_{z_i|\mathbf{u}_i}[\nu (z_i,\mu _i)] \, w(\mu _i) \, {\varvec{\mu }_{{\varvec{i}}}'}$$

\(\varphi _c\) the Huber-type function with tuning constant c, and \(\mathbf{S}= 2 \lambda \mathbf{D}\), being \(\lambda \) a smoothing parameter and \(\mathbf{D}\) a pre-specified penalty matrix.

The previous system of estimating equations, hence, is formed by the robust quasi-likelihood equations introduced in [2], plus the usual penalized GAM part.

After we have a good fit, the residuals of this fit, i.e., the differences between the observed and the predicted values, will help us to detect possible spatial outliers. To do this we compute the residuals (or the scores of the residuals if \(\mathbf{z}_i(s_0)\) is multivariate), we incorporate them into QGIS and we follow the same process than in the previous section: A TIN interpolation, the slopes obtained with GRASS and, finally, a reclassification with GRASS looking for outlying slopes.

Example 2

Let us consider Guerry data again, [9]. We first fit a robust GAM [13, 14] for each dependent variable, PER and PROP, and we compute the residuals for each fit. We then compute the scores of these residuals and, again with QGIS, we obtain departments with slopes both, higher than 30 \(\%\) for \(PC_1\) and higher than 13 \(\%\) for \(PC_2\), Fig. 2. The hotspots obtained correspond to the departments Hautes-Alpes, Ardeche, Creuse, Indre, Loire, Rhone, Saone-et-Loire, Seine and Haute-Vienne.

4 Identification of Spatial Outliers

With the procedures considered in the two previous sections we obtain a set of possible local outliers. In this section we compute, mathematically, if the behavior around a hotspot is very unlikely or not to label it as an actual spatial outlier, computing the probability of obtaining an slope as big as the one obtained at a given location \(s_0\). Considering the framework of the last section, a large (positive or negative) slope, i.e., a large derivative of function H (h in fact) at \(s_0\) will give us a good idea if \(z(s_0)\) is a local outlier or not.

To compute the probabilities of large slopes at the hotspots previously identified, we first fit a classical GAM. We consider now a classical GAM fit instead of a robust one to magnify theirs slopes because the classical model will be more sensitive than the robust and the slopes less soft. Also because we know the (asymptotic) distribution of the estimators of the parameters in a classical GAM but not in the robust one.

From a mathematical point of view, the slope at a point \(s_0\) in the direction v is stated as the directional derivative along v (unit vector) at \(s_0\).

If we represent, as usual, by \(D_v h(s_0)\) the collection of directional derivatives of function h (assuming that it is differentiable) along all directions v (unit vectors) at \(s_0\) and by MS the maximum slope, i.e., \(\, MS(s_0) =\sup _v |D_v h(s_0)| \,\), we compute the probability of obtaining the observed maximum slope \(ms(s_0)\, \), i.e., \( \, P\{MS(s_0) \ge ms(s_0) \}.\) If this probability is low (for instance lower than 0.05), we shall label \(z(s_0)\) as a local outlier (more formally, we could say that we are rejecting the hypothesis of being zero the slope at \(s_0\), i.e., that \(z(s_0)\) is not a local outlier) and, as the smaller the probability, the greater should be considered \(z(s_0)\) as a local outlier.

Because we have assumed that the smooth function h has a representation in terms of a basis, (2), the slope will depend on the estimators of the parameters \( \beta _j\), estimators that are approximately normal distributed ([15], pp. 189) if the \(z_i\) are normal.

From vector calculus, we known that the largest value for the slope at a location \(s_0\) is gradient norm, i.e.,

$$MS(s_0) = \sup _v |D_v h(s_0)| = ||\nabla h(s_0)|| = \sqrt{ \left( \left. \frac{\partial }{\partial x} h(x,y)\right| _{{s}_0}\right) ^2 + \left( \left. \frac{\partial }{\partial y} h(x,y)\right| _{{s}_0}\right) ^2}$$

and because h is expressed in term of a basis, the probability that we have to compute refers to the random variable

$$\begin{aligned} \sqrt{ \left( \sum _{j=1}^q \frac{\partial }{\partial x} b_j(s_0) \cdot \widehat{\beta _j} \right) ^2 + \left( \sum _{j=1}^q \frac{\partial }{\partial y} b_j(s_0) \cdot \widehat{\beta _j} \right) ^2} \end{aligned}$$
(3)

If this is low, \(z(s_0)\) will be labelled as a local outlier.

4.1 Cubic Regression Splines

We shall use a cubic regression splines to explain function h in the fit of a GAM to the observations \(z_i\). For this aim we shall use the R function gam of the R package mgcv. The cubic spline function, with k knots \(v_1, \ldots , v_k\), that we fit ([15], pp. 149–150) is (\(v_j \le v \le v_{j+1}\))

$$P(v) = \frac{v_{j+1} - v}{h_j} \, \beta _j + \frac{v- v_{j}}{h_j} \, \beta _{j+1} + \left[ \frac{(v_{j+1} - v)^3}{h_j} - h_j \, (v_{j+1} - v) \right] \, \frac{\delta _j}{6} $$
$$+ \left[ \frac{(v - v_j)^3}{h_j} - h_j \, (v - v_j) \right] \, \frac{\delta _{j+1}}{6}$$

where \(h_j = v_{j+1}-v_j\), \(j=1,\ldots ,k-1\) and \(\delta _j= P''(v_j)\).

The first derivative of P (partial derivative in formula (3)) is

$$P'(v) = \frac{\beta _{j+1} - \beta _j}{h_j} + \left[ -\frac{3 (v_{j+1} - v)^2}{h_j} + h_j \right] \, \frac{\delta _j}{6} + \left[ \frac{3 (v- v_j)^2}{h_j} - h_j \right] \, \frac{\delta _{j+1}}{6}$$

and considering as knots the locations, \(v_j\),

$$P'(v_j) = \frac{\beta _{j+1} - \beta _j}{h_j} - \frac{\delta _j \, h_j}{3}.$$

If the term \(\; \delta _j \, h_j/3 \; \) is negligible, we have to compute the probabilities,

$$P\left\{ (\hat{\beta }_{j+1} - \hat{\beta }_j)/h_j > \text { observed slope} \right\} $$

based on a normal model because ([15], pp. 189) \(\hat{\beta }_j \) is approximately normal distributed with mean \(\beta _j\).

Table 1 Probability of a big slope for both variables

Example 3

Let us consider Guerry data again. The set of all departments detected as possible outliers for, at least, one of the two methods explained in Sects. 2 and 3, together with the probabilities of such slopes (i.e., the p-values of the bilateral test of the null hypothesis \(H_0:\beta _{j+1} - \beta _j =0\)), are in Table 1.

Hence, we can label as spatial outliers the observations at Jura, Rhone and Seine. As is remarked in [6], Seine (together with Ain, Haute-Loire and Creuse) is a global outlier and a local one.

Hence, if we do not consider the Department of Seine (because is a global outlier) we have two departments that can be considered as spatial outliers: Jura and Rhone, two departments in what is called the Rhône-Alpes area, i.e., the same result than in [6].