1 Introduction

An important aspect of analyzing data collected from different geographical regions, known as spatial data, is the emergence of spatial autocorrelation, a situation where the values of a variable are correlated according to their geographical positions, creating clusters of observations, and, as Anselin (1988) reports, spatial autocorrelation is attributed to spatial dependence, which along with spatial heterogeneity, is outcome of spatial effects inherent in this type of data. The presence of spatially nonindependent observations causes serious problems in quantitative analysis, since the sample contains less information than a counterpart with independent spatial elements, and, moreover, the concept of the random sample is violated, as Schabenberger and Gotway (2005) have indicated. Therefore, any conventional statistical inference will produce unreliable results, unless spatial dependence is incorporated to the model, as it happens in time-series data with time autocorrelation. For this reason, a spatial sample should be considered as a realization of a spatial process and not as a random sample, as in time-series analysis.

The traditional Box and Jenkins (1976) methodology for time-series analysis has been extended to spatial analysis as an effort to model spatial dependence between observations of the same variable. Likewise, spatial correlograms and spatial partial correlograms are constructed, using the Moran’s I spatial autocorrelation coefficient or some other measures that have been proposed as presented by Cliff and Ord (1981), to identify the most adequate spatial generation mechanism of an observed dataset. However, diagrams may often be unable to identify correctly the underlying mechanism producing in that sense confusing results, as in time-series analysis, where this issue of selecting the best fitted model is addressed by several information criteria.

Hence, it will be very interesting to examine and evaluate the performance of the three most often used in practice information criteria in spatial analysis, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the Hannan and Quinn information criterion (HQC), in terms of selecting pure spatial processes. Indeed, the behavior of these criteria has been investigated thoroughly in the literature for time-series processes and regression models but not for spatial models, if you exclude the studies of Hoeting et al. (2006) and Lee and Ghosh (2008) in which they considered geostatistical models, i.e., models used for point-referenced geostatistical data that incorporate the dependence applied to a covariance function that determines the relationship for observations at different distance locations. Using a Monte Carlo analysis, this study finds that these information criteria can successfully contribute to spatial modeling, although their overall behavior depends not only on the sample size but also on the magnitude of the spatial parameters of the true generating processes.

The remaining of the paper is organized as follows. Section 2 depicts the most important spatial processes that will be considered for the simulation analysis and presents the three aforementioned information criteria. Section 3 describes the design of the simulation analysis and discusses the results. Finally, the concluding remarks are presented in Sect. 4.

2 Spatial processes and information criteria

Spatial processes can be regarded as multidirectional extensions of the well-known time-series processes on the geographical space, meaning that the dependence among values of a variable is expressed according to their geographical positions and not according to their chronological order. For example, for a sample of n cross-sectional observations collected from n different geographical units, the spatial dependence is incorporated into the process by the definition of a spatial (n × n) weights matrix W that captures the spatial interaction between the neighboring locations. The matrix W is usually used in its row-standardized form such that its Wij element is different from zero if the locations i and j are neighbors, otherwise it is zero, where the determination of the neighbors for each spatial unit is clearly the most important issue for constructing such a matrix in spatial analysis. Indeed, a variety of criteria have been proposed in the literature for spatial weights formation including boundary contiguity and distance measures, as, for example, can be seen in Cliff and Ord (1981) and in Anselin (1988). Contiguity criteria consider as neighbors the spatial units which share common borders, so they are contiguous, while other criteria which are based on distance measures define the neighborhood according to the distance between two regions. A brief presentation of the three most commonly used spatial processes that express spatial dependence, namely the spatial autoregressive process of order 1, the spatial moving average process of order 1 and the mixed spatial autoregressive moving average process of orders 1 and 1, is given below.

The spatial autoregressive process of order 1, i.e., SAR(1), was initially introduced by Whittle (1954) and by Besag (1974) as an extension of the autoregressive process of order 1, i.e., AR(1), in time-series analysis to geographical context. Utilizing matrix notation and considering zero mean value for the examined variable, the SAR(1) process, as presented by LeSage and Pace (2009), is defined as:

$$ {\mathbf{y}} = \rho {\mathbf{Wy}} + {{\varvec{\upvarepsilon}}} $$

where \({\mathbf{y}}\) is an (\(n \times 1\)) vector of observations of the process collected from n geographical points, W is the (n × n) spatial weights matrix, \(\rho\) is the spatial autoregressive parameter and ε is an (\(n \times 1\)) white noise random vector. The vector \({\mathbf{Wy}}\) is called spatial lag, and each element, for a row-standardized \({\mathbf{W}}\), is a weighting average of \({\mathbf{y}}\) values in neighboring units for every region.

The log-likelihood function of a SAR(1) process, assuming that ε ~ N(0, σ2I) with Ι being the identity matrix and σ2 a constant variance, is obtained as:

$$ \ln L\left( {\rho ,\sigma^{2} } \right) = - \frac{n}{2}\ln \left( {2{\uppi }} \right) - \frac{n}{2}\ln \sigma^{2} - \frac{{\left( {{\mathbf{y}} - \rho {\mathbf{Wy}}} \right)^{\prime } \left( {{\mathbf{y}} - \rho {\mathbf{Wy}}} \right)}}{{2\sigma^{2} }} + \ln \left| {{\mathbf{\rm I}} - \rho {\mathbf{W}}} \right| $$

where \(\left| {{\mathbf{\rm I}} - \rho {\mathbf{W}}} \right|\) is the Jacobian determinant for the transformation of the random vector ε into the vector \({\mathbf{y}}\). Substituting the maximum likelihood estimator for the variance of the process, i.e., \(\sigma_{ML}^{2} = {{{\mathbf{\varepsilon^{\prime}\varepsilon }}} \mathord{\left/ {\vphantom {{{\mathbf{\varepsilon^{\prime}\varepsilon }}} n}} \right. \kern-\nulldelimiterspace} n}\), the log-likelihood function becomes:

$$ \ln L\left( \rho \right) = - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {2{\uppi }} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-\nulldelimiterspace} n}} \right)\left( {{\mathbf{y}} - \rho {\mathbf{Wy}}} \right)^{\prime } \left( {{\mathbf{y}} - \rho {\mathbf{Wy}}} \right)} \right) + \ln \left| {{\mathbf{\rm I}} - \rho {\mathbf{W}}} \right| $$

which is clearly only a function of the parameter ρ.

Next, the spatial moving average process of order 1, i.e., SMA(1), is defined as:

$$ {\mathbf{y}} = {{\varvec{\upvarepsilon}}} - \lambda {\mathbf{W\varepsilon }} $$

where again ε ~ N(0, σ2I) and λ is the spatial moving average coefficient, as presented by Haining (1978). The log-likelihood function of a SMA(1) process is:

\(\ln L\left( {\lambda ,\sigma^{2} } \right) = - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {2{\uppi }} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \sigma^{2} - \left( {{1 \mathord{\left/ {\vphantom {1 {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }}} \right)\left[ {\left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right)^{ - 1} {\mathbf{y}}} \right]^{\prime } \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right)^{ - 1} {\mathbf{y}} - \ln \left| {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right|\)which can be written as:

$$ \ln L\left( \lambda \right) = - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {2{\uppi }} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-\nulldelimiterspace} n}} \right)\left[ {\left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right)^{ - 1} {\mathbf{y}}} \right]^{\prime } \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right)^{ - 1} {\mathbf{y}}} \right) - \ln \left| {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right| $$

for \(\sigma_{ML}^{2} = {{{\mathbf{\varepsilon^{\prime}\varepsilon }}} \mathord{\left/ {\vphantom {{{\mathbf{\varepsilon^{\prime}\varepsilon }}} n}} \right. \kern-\nulldelimiterspace} n}\) and \(\left| {{\mathbf{\rm I}} - \lambda {\mathbf{W}}} \right|\) being the Jacobian determinant. Note, that as in the case of SAR(1), the log likelihood of SMA(1) is a function solely of the parameter λ.

Lastly, as in time-series analysis mixed models are also defined in spatial analysis, known as spatial autoregressive moving average processes, i.e., SARMA, as introduced by Huang (1984). The simplest mixed spatial autoregressive moving average process is the SARMA(1, 1) process defined as:

$$ {\mathbf{y}} = \rho {\mathbf{W}}_{1} {\mathbf{y}} + {{\varvec{\upvarepsilon}}} - \lambda {\mathbf{W}}_{2} {{\varvec{\upvarepsilon}}} $$

where \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) denote (n × n) spatial weights matrices for the autoregressive and the moving average term, respectively. It should be pointed out that the SARMA(1, 1) process is properly defined if and only if different weight matrices are used for the two components of the process, as can be seen in Mur and Angulo (2007). Unlike time-series analysis where for a proper definition of an ARMA(1, 1) process, the autoregressive parameter must not be equal to the moving average parameter, the SARMA(1, 1) process is well defined even for equal values of ρ and λ, provided that the weights matrices are different, i.e., \({\mathbf{W}}_{1} \ne {\mathbf{W}}_{2}\). Actually, a SARMA(1, 1) process is a process that combines global and local effects, since the SAR(1) process expresses the global spatial dependence, i.e., influences from one geographical point that spread and affect the whole study region, whereas the SMA(1) process defines local spatial dependence with effects covering only the neighborhood regions.

The log-likelihood function of a SARMA (1, 1) process is obtained as:

$$ \ln L\left( {\rho ,\lambda ,\sigma^{2} } \right) = - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {2{\uppi }} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \sigma^{2} - \left( {{1 \mathord{\left/ {\vphantom {1 {2\sigma^{2} }}} \right. \kern-\nulldelimiterspace} {2\sigma^{2} }}} \right)\left[ {{\mathbf{B}}^{ - 1} {\mathbf{Ay}}} \right]^{\prime } {\mathbf{B}}^{ - 1} {\mathbf{Ay}} + \ln \left| {\mathbf{A}} \right| - \ln \left| {\mathbf{B}} \right| $$

where \({\mathbf{A}} = {\mathbf{\rm I}} - \rho {\mathbf{W}}_{1}\) and \({\mathbf{B}} = {\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2}\), and takes the following form:

$$ \ln L\left( {\rho ,\lambda } \right) = - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {2{\uppi }} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right) - \left( {{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\ln \left( {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. \kern-\nulldelimiterspace} n}} \right)\left[ {{\mathbf{B}}^{ - 1} {\mathbf{Ay}}} \right]^{\prime } {\mathbf{B}}^{ - 1} {\mathbf{Ay}}} \right) + \ln \left| {\mathbf{A}} \right| - \ln \left| {\mathbf{B}} \right| $$

which is a function only of the spatial parameters.

The log-likelihood functions of the three presented spatial processes are maximized by applying a numerical method for an observed dataset in order to estimate the spatial coefficients ρ and λ, emphasizing the fact that the most important element of this maximization process is the log-determinant of the Jacobian matrix. For this purpose, Ord (1975) has proposed a convenient method for dealing with this issue by using the eigenvalues of \({\mathbf{W}}\). For example, the log-determinant term for the spatial parameter ρ can be decomposed as:

$$ \ln \left| {{\mathbf{\rm I}} - \rho {\mathbf{W}}} \right| = \sum\limits_{j = 1}^{n} {\ln \left( {1 - \rho \,\omega_{j} } \right)} $$

where ωj are the eigenvalues of the weights matrix \({\mathbf{W}}\), provided that these values are real numbers. If, on the other hand, the eigenvalues are complex, Bivand et al. (2013) have suggested an alternative method of computing the log-determinant of the Jacobian matrix as follows:

$$ \ln \left| {{\mathbf{\rm I}} - \rho {\mathbf{W}}} \right| = L_{1} + L_{2} $$

where

$$ L_{1} = \sum\limits_{j = 1}^{k} {\ln \left[ {\left( {1 - \rho \,\omega_{j} } \right)\left( {1 - \rho \,\overline{\omega }_{j} } \right)} \right]} = \sum\limits_{j = 1}^{k} {\ln \left[ {\left( {1 - \rho \alpha_{j} } \right)^{2} + \left( {\rho b_{j} } \right)^{2} } \right]} $$

and

$$ L_{2} = \sum\limits_{j = k + 1}^{n - k} {\ln \left( {1 - \rho \,\zeta_{j} } \right)} $$

while \(\omega_{j} = \alpha_{j} + ib_{j}\) and \(\overline{\omega }_{j} = \alpha_{j} - ib_{j}\) denoting the 2 k complex eigenvalues, i is the imaginary unit and \(\zeta_{j}\) represents a real eigenvalue.

However, it should be mentioned that in spatial analysis, unlike time-series analysis, the values of the coefficients ρ and λ are not necessarily restricted strictly to the interval (−1, + 1), but the estimation process can be implemented provided that the Jacobian matrix is nonsingular, an outcome that is related to the eigenvalues of the spatial weights matrices. Row-standardized spatial weights matrices have always the largest eigenvalue equal to unity, something which ensures that the upper limit of the interval will be always + 1, while the value for the lower limit is unknown and several times smaller than −1. Ord (1975) has demonstrated that the spatial parameters for symmetric matrices before standardization could take values within the interval \(\left( {{1 \mathord{\left/ {\vphantom {1 {\omega_{\min } \,,{1 \mathord{\left/ {\vphantom {1 {\omega_{\max } }}} \right. \kern-\nulldelimiterspace} {\omega_{\max } }}}}} \right. \kern-\nulldelimiterspace} {\omega_{\min } \,,{1 \mathord{\left/ {\vphantom {1 {\omega_{\max } }}} \right. \kern-\nulldelimiterspace} {\omega_{\max } }}}}} \right)\), where ωmin and ωmax are the smallest and largest real eigenvalues of \({\mathbf{W}}\). On the other hand, in the case of asymmetric row-standardized weights matrices with complex eigenvalues, LeSage and Pace (2009) suggested that the Jacobian is nonsingular when the spatial parameters have values in the interval \(\left( {{1 \mathord{\left/ {\vphantom {1 {r_{s} \,,\,\,1}}} \right. \kern-\nulldelimiterspace} {r_{s} \,,\,\,1}}} \right)\), where rs is the most negative purely real eigenvalue of \({\mathbf{W}}\). Lastly, if the parameters take values inside the feasible interval, corresponding to the applied weights matrix, the Jacobian determinant will be positive, meaning that its logarithm will exist, and the likelihood function of a process will be well defined.

Hence, if the log-likelihood functions are maximized, the best fitted model is selected according to the minimum value of any of the widely used information criteria. The first criterion most often appeared in practice is the Akaike information criterion (AIC) suggested by Akaike (1973) which was developed on the Kullback–Leibler divergence measure for evaluating the discrepancy between a true model and a candidate model. The AIC is computed as follows:

$$ {\text{AIC}} = - 2\ln \hat{L} + 2p $$

where \(\ln \hat{L}\) is the maximized value of the log-likelihood function and p is the number of parameters of the process. The other two well often applied criteria are the Bayesian information criterion (BIC), suggested by Schwarz (1978) as an attempt to improve AIC performance, and the Hannan and Quinn information criterion (HQC), suggested by Hannan and Quinn (1978), defined, respectively, as follows:

$$ {\text{BIC}} = - 2\ln \hat{L} + p\ln n $$

and

$$ {\text{HQC}} = - 2\ln \hat{L} + 2p\ln \ln n $$

where n is the sample size used for the estimation. Obviously, the number of existing information criteria is not limited to those three previously presented, but these criteria are typically the most often used in practice not only because it is fairly easy to compute their values but also because they are reported by almost every statistical package. However, since it is known in the literature that the AIC criterion is strongly negatively biased in small samples, as shown by Sugiura (1978) and Hurvich and Tsai (1989), bias-corrected version of AIC proposed by Hurvich and Tsai (1989) and denoted as AICc, i.e.,

$${\text{AICc}} = {\text{AIC}} + \frac{{2p\left( {p + 1} \right)}}{n - p - 1}$$

is also considered.

The philosophy of an information criterion is based on the quantification of the goodness of fit of an estimated model including a penalty for the number of estimated parameters and for the sample size. The selection process of a model consists of estimating several different models, for which the value of the designated criterion is computed, and the best fitted model is the one at which the value of the criterion is minimized, emphasizing that alternative information criteria do not always select the same best model. Actually, AIC is known as an asymptotically efficient criterion, meaning that, if the true model is not among the candidate models, the criterion chooses the model with the minimum one-step expected quadratic forecasting error as the sample increases, i.e., AIC chooses the model which is the best approximation of the unknown and, perhaps, of an infinite-dimensional model. On the other hand, BIC and HQC are considered as consistent criteria in the sense that if the true model is among the candidate models, these criteria select the true model with probability approaching to 1 as sample increases (see details regarding their properties in Diebold 2007, Judge et al. 1985 and Burnham and Anderson 2002).

Overall, AIC has the tendency to select a more elaborate model, as Jones (1975) and Shibata (1976) have demonstrated that for autoregressive time-series processes, AIC overestimates the true order of the process, while BIC chooses simpler models according to the parsimonious concept. In this lieu, Hurvich and Tsai (1989) showed that AIC is a biased estimator of the Kullback–Leibler information causing in that sense overfitting of the model. Therefore, it will be very exciting to evaluate the performance of these information criteria for spatial analysis, i.e., for observation in a lattice environment appropriate for economic and regional data, knowing also that their behavior has been examined for time-series and regression models.

3 Simulation results

The performance of the three previously presented information criteria, i.e., AIC, BIC and HQC, is investigated in this section for spatial data using a Monte Carlo analysis, in terms of selecting the right spatial process among the three candidate processes, i.e., SAR(1), SMA(1) and SARMA(1, 1). Spatial dependence is introduced into the processes by defining two separate row-standardized spatial weights matrices \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) for the autoregressive and moving average terms, respectively. More precisely, matrices \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) are constructed using the rook (four neighbors-common edge) and the queen (eight neighbors-common edge and vertex) contiguity definitions, respectively, over a squared regular lattice for dimensions 10 × 10 and 20 × 20 providing samples of 100 and 400 observations. Using this formation, the matrices are symmetric originally, but they become asymmetric after row standardization, although it must be said that their structure behaves relatively like a symmetric matrix. Moreover, the research is extended into a situation with extremely asymmetric weights matrices in order to further investigate the performance of these information criteria under a realistic environment. For this purpose, the geographical structure of Greece at the local authority districts of Kallikrates Operational Programme consisting of 325 municipalities is considered to construct matrices \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) according to the four-nearest-neighbor and the eight-nearest-neighbor definitions, respectively, based on the geographical coordinates of the centroid for each municipality.Footnote 1 Note that in this case the matrices are asymmetric before and after row standardization, since the nearest neighbors criterion defines spatial relations asymmetrically.

Table 1 presents the feasible ranges of the spatial parameters ρ and λ according to the spatial structures as specified above. As can be seen from this table, the lower bounds of the spatial coefficients are significantly less than −1 for all matrices except for those created with rook contiguity. Moreover, the row-standardized weights matrices that have been constructed with rook definition over the regular lattices and with the nearest neighbors over the Greek geographical structure yield complex eigenvalues.

Table 1 Characteristics of the weights matrices along with the lower and upper limits of the spatial parameters ρ and λ

Having described the formation of \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) matrices, the spatial processes are generated as follows: the SAR(1) process as \({\mathbf{y}} = \left( {{\mathbf{\rm I}} - \rho {\mathbf{W}}_{1} } \right)^{ - 1} {{\varvec{\upvarepsilon}}}\), the SMA(1) process as \({\mathbf{y}} = \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2} } \right){{\varvec{\upvarepsilon}}}\) and the SARMA(1, 1) process as \({\mathbf{y}} = \left( {{\mathbf{\rm I}} - \rho {\mathbf{W}}_{1} } \right)^{ - 1} \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2} } \right){{\varvec{\upvarepsilon}}}\), where the spatial parameters ρ and λ are allowed to take values within the feasible range intervals defined in Table 1 and the vector of random errors \(\varepsilon\) is considered as\( N\left(0,\mathbf{\rm I}\right)\). The whole simulation process is conducted in R using the SPDEP package developed by Bivand (2015) for the spatial weights matrices manipulation as well as for the generation of the processes. Each spatial generation process is then estimated for all three model specifications, i.e., for the SAR(1) with \({\mathbf{W}}_{1}\), for the SMA(1) with \({\mathbf{W}}_{2}\) and for the SARMA(1, 1) with \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) as the weights matrices, respectively, by maximizing the log-likelihood function, using the nlminb function in R, so that the values of all three information criteria can be calculated. The best fitted model is selected according to the minimum value of any criterion based on 1000 replications.

Table 2 presents the percentages of model selection by any of the three criteria when the true generating process is the SAR(1) model. As can be seen from this table, the selection behavior of all three information criteria is pretty much similar, independent of the construction of the weights matrix, and it depends mainly on the magnitude of the parameter ρ, meaning that the true SAR(1) process is selected more often by all criteria as the absolute value of ρ increases for given sample, reaching almost at the 99% level based on the BIC criterion for n = 400. It turns out though that the BIC criterion has the best behavior, among the other two criteria, in terms of most frequently selecting the true model regardless of sample size and the value of ρ, a result that confirms the overall well-known behavior of BIC in terms of selecting parsimonious models over AIC. The HQC criterion behaves closely to BIC criterion but with lower probability of selecting the true model. For small absolute values of ρ, the second best fitted model is the SMA(1) process selected by all criteria, whereas for large absolute values of ρ, the second best fitted model selected mainly by AIC is the SARMA(1, 1) process. Finally, as sample size increases, the true model is selected more often even for small values of ρ, i.e., for ρ = 0.2, the SAR(1) process is selected by BIC 64.9% and 84.4% for n = 100 and 400, respectively. It should be noted that the corrected AIC criterion, although it is not reported, behaves very similarly, if not identically in most cases, as the AIC criterion. The AICc selects slightly more frequently the correct spatial process than the AIC for small sample sizes, whereas for large sample sizes the selection rate is almost the same. For example, for n = 100 and for values of ρ = −0.9, −0.5, 0.2 and 0.8, the AICc selects the SAR(1) process 82.9%, 81.3%, 59.4% and 84.5%, respectively, as opposed to the relevant AIC selection rates of 80.7%, 80.4%, 58.8% and 83.2% presented in Table 2.

Table 2 Percentage of selections for all three spatial processes based on AIC, BIC and HQC when the true data-generating model is the SAR(1) process for various values of ρ using 1000 replications

Similar conclusions can be made for the special case of using a real geographical structure to generate the spatial processes. The main interesting feature of this case is that the SAR(1) process is well defined even for values of ρ smaller than −1, as can be seen from Table 1. For all these values, the behavior of all information criteria is similar, if not identical, to the largest absolute value of the autoregressive parameter within the range of (−1, 1), as can be seen from Table 2 for suggestive values of ρ equal to −1.3 and −1.1.Footnote 2 Hence, the presence of these values does not provide any additional information of relevance to the Monte Carlo analysis, but these values do exist mathematically to ensure the consistency of the Jacobian matrix so that the likelihood function can be computed. In practice, however, it is very unlikely to obtain such values as estimates of the spatial autoregressive parameter, although the feasible range of all values of ρ obtained from the construction of the weights matrix, based on a specific geographical structure, is included in the maximization process.

The selection rates for all three spatial processes based on all information criteria when the true generating process is the SMA(1) model are reported in Table 3. It is evident that all three information criteria behave quite similarly, as in the previous case, regardless of the construction of the weights matrix. The SMA(1) process is selected more often as the absolute value of the moving average parameter λ increases for all samples, reaching almost at the 99% level based on the BIC criterion for n = 400. Among all criteria, the SMA(1) process is selected most frequently for any given value of λ and sample size by BIC followed by HQC and AIC. For small absolute values of λ the second best fitted model is the SAR(1) process selected by all criteria, as equivalently the SMA(1) process was the second best fitted model selected by all criteria in the case of the true generating SAR(1) process for small absolute values of the autoregressive parameter, whereas for large absolute values of λ the selection of the SARMA(1, 1) process appears to be as the second choice for the AIC criterion. It is also worth mentioning that the selection process of the true model is improving as sample increases by all criteria and for all values of the moving average parameter, i.e., for λ = 0.2, the SMA(1) process is selected by AIC 58.9% and 72.4% for n = 100 and 400, respectively. As in the case of the SAR(1) process, the AICc criterion selects the correct spatial moving average process slightly more frequently than the AIC only for small sample sizes.

Table 3 Percentage of selections of all three spatial processes based on AIC, BIC and HQC information criteria when the true data generating model is the SMA(1) process for various values of λ using 1000 replications

Moreover, the construction of the \({\mathbf{W}}_{2}\) matrix to generate SMA(1) processes, either as a queen definition or as a special case of a real geographical structure with eight neighbors contiguity definitions, resulted to values of the moving average parameter smaller than −1 for both cases, as Table 1 reports. Like the SAR(1) case, the use of these values does not provide any additional information to the overall analysis, since the behavior of all information criteria in terms of selecting the best fitted model remains the same as their behavior for the largest absolute value of the moving average parameter within the range of (−1, 1), as can be seen from Table 3 for selected values of λ equal to −1.4 for the queen formation and −1.4 and −2.4 for the real geographical structure, as presented in Table 3.Footnote 3

Tables 4 and 5 report the selection percentages of each candidate model based on all three information criteria obtained through this simulation analysis when the true generating model is the SARMA(1, 1) process for specific values of ρ and λ, since it was not technically feasible to include all results in one table for all possible combinations of ρ and λ. For this purpose, Tables 4 and 5 report only simulation results for small, moderate and large positive values of ρ and λ, respectively, against several values of the other parameter.Footnote 4 As can be seen from these tables, the main findings can be summarized into four points, meaning that the percentage of selecting the right model by all information criteria increases, regardless of the construction of the weights matrices: a) as the absolute values of ρ, λ and n increase, b) as the absolute value of λ increases for given values of ρ and n, c) as the absolute value of ρ increases for given values of λ and n and d) as sample size increases for given values of ρ and λ.Footnote 5 Unlike the previous two cases, AIC has now the dominant role of selecting the right model most frequently, followed by HQC, especially for small sample sizes, whereas the BIC criterion in this case is the least reliable criterion for selecting the true model, a finding that was expected, since, as discussed, the AIC criterion has the tendency to select large models.

Table 4 Percentage of selections for all three spatial processes based on AIC, BIC and HQC information criteria when the true data generating model is the SARMA(1, 1) process for ρ = 0.2, 0.5 and 0.9 and for various values of λ using 1000 replications
Table 5 Percentage of selections for all three spatial processes based on AIC, BIC and HQC information criteria when the true data generating model is the SARMA(1, 1) process for of λ = 0.2, 0.5 and 0.9 and for various values of ρ using 1000 replications

For small values of the moving average parameter as well as for the autoregressive parameter, all criteria select almost exclusively the SAR(1) process and/or the SMA(1) process and not the true SARMA(1, 1) process, especially for small sample size. For example, for λ = 0.2 and ρ = 0.9, the SAR(1) process is selected 79.4% by AIC, 94.9% by BIC and 89.4% by HQC, as Table 4 reports for n = 100, whereas for ρ = 0.2 and for λ = 0.9 the SMA(1) process is selected 56.1% by AIC, 81% by BIC and 69% by HQC, as Table 5 reports for n = 100. Clearly, when the moving average (autoregressive) parameter is very low, the autoregressive (moving average) term prevails and that is why all information criteria prefer model SAR(1) (SMA(1)) instead of the true SARMA(1, 1) model. However, when the values of both parameters are large, all criteria select more frequently the true SARMA (1, 1) process, i.e., for values of λ and ρ equal to (0.8, 0.9) and (0.9, 0.8), the SARMA(1, 1) process is selected 75.6% and 95.9%, respectively, by AIC criterion, even for n = 100. Unlike the previous two cases, the corrected AIC criterion selects less frequently the true SARMA(1, 1) process than the AIC criterion only for small sample sizes; i.e., for n = 100 and for values of ρ and λ equal to (0.2, 0.8) and (0.9, −0.5), the SARMA(1, 1) process is selected by AICc 36.2% and 66.2%, respectively, as opposed to the 37.8% and 68% selection rates by AIC reported in Table 4, and similar results can be found for all other values of λ and ρ. For large sample sizes, both criteria select the true model at the same rate.

It should be stated though that the whole simulation process is conducted even for identical values of both parameters ρ and λ due to the fact that the weights matrices \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\) are different, excluding in that sense the presence of a common spatial root. If both weights matrices were exactly the same, the SARMA(1, 1) process would have been simply a spatial white noise process for all identical values of ρ and λ, causing serious issues in the whole Monte Carlo analysis. Lastly, similar conclusions can be found for the special geographical structure that was used to construct the weights matrices, indicating the behavior of the selection of all information criteria for values smaller than −1 for both parameters; it was similar to the extreme absolute values of both parameters and did not deliver any additional information to the overall analysis, as already has been recognized.

4 Concluding remarks

The objective of this study was to investigate the behavior of the three most frequently used information criteria, AIC, BIC and HQC, for model selection among competitive models for spatial data, using a Monte Carlo analysis. For this purpose, three spatial processes, the SAR(1), the SMA(1) and the SARMA(1, 1) processes, are generated either by using a hypothetical geographical structure based on regular grids or by using a real geographical structure based on a map of the administrative units of Greece. For the autoregressive part, the weight matrix is constructed with a rook definition for the theoretical case of both grids and with the four closest neighbor definitions for the real case, whereas for the moving average part the weight matrix is constructed with a queen definition for the theoretical case of both grids and with the eight closest neighbor definitions for the real case. The three spatial models are estimated next, based on the predefined weights matrices, using the nlminb function in R that maximizes the log-likelihood function to obtain the corresponding values of all three information criteria. The best fitted model is selected by the minimum value of any of the three used information criteria.

Simulation results showed that the behavior of these criteria is not the same. BIC performs better for selecting models with small number of parameters, whereas AIC works better for large models, findings that coincide with the overall knowledge of the performance of these criteria. On the other hand, the HQC criterion always comes second in order among the three criteria either right after BIC for small models or right after AIC for large models. However, the selection mechanism of the true generating process contains a large volume of ambiguity, especially when the sample size is small and/or when the values of both parameters are also small, indicating that all three criteria have difficulty in successfully recognizing the true generating process. Things are even more complicated for spatial analysis.

First, one may argue that the results are sensitive to the construction of the weights matrices, although two different types of structures are employed for the theoretical part in this simulation process. The truth is that in spatial analysis, spatial dependence is defined exogenously for every model with a variety of neighborhood criteria, indicating that these matrices are not uniquely constructed. Moreover, the application of the same spatial neighborhood definition to different spatial structures can lead to completely dissimilar spatial weights matrices. For this purpose, this study considered additionally a real geographical structure, i.e., the spatial structure of Greece, which has a lot of geographical peculiarities resulting in quite asymmetric spatial weights matrices, either as an effort to alternatively support the simulated results or as an effort to minimize the validity of such an argument. It turns out that the behavior of all information criteria is not affected by the construction of these matrices since the same selection rate is obtained regardless of the geographical structures that were used either on the theoretical basis of regular grids or on the practical basis of a real geographical structure, as it is expresses through the patterns of a map.

The concept of properly defining the SARMA(1, 1) process is another important element in spatial analysis, which is highly related to the construction of the weights matrices \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\). Typically, the SARMA(1, 1) process must be defined with different spatial weights matrices for its two components, so that issues concerning common spatial roots will not theoretically appear in the analysis, as it was carefully imposed in this applied simulation process. However, the estimation of this process can be executed even with identical weights matrices, but not with the same or near the same values of the autoregressive and moving average parameters. The question though remains unclear as how these weights matrices are viewed as different, since dissimilarity is a subjective and ambiguous concept. The truth is that spatial weights matrices are sparse matrices that only a small number of their elements are different from zero, regardless of how the dependence is defined. The construction of the second weight matrix, i.e., the matrix that corresponds to the moving average part of the SARMA(1, 1) process, by an alternative method reassures that the two weights matrices are typically different, but not practically. The \({\mathbf{W}}_{2}\) matrix is constructed in such a way that leads to a more dense matrix than the \({\mathbf{W}}_{1}\) matrix, since it has some more elements different than zero. Therefore, it is hard to tell whether these two matrices are different, knowing that both matrices are sparse matrices and hence it may be difficult to evaluate the performance of these information criteria under these circumstances.

Another way tο distinguish processes is to calculate the Frobenius distance between the covariance matrices of two processes. The Frobenius distance measures the distance between two matrices, and it is defined as:

$$ F(A,B) = \sqrt {\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\left( {a_{ij} - b_{ij} } \right)^{2} } } } $$

where A = {aij} and B = {bij} are square matrices having the same dimensions. Clearly, when the value of F(A, B) is close to zero, the matrices A and B are very similar and identical in the case where F(A, B) = 0. Hence, the larger the value of the Frobenius distance, the more different the matrices are. The covariance matrices for the SAR(1), SMA(1) and SARMA(1, 1) processes are defined, respectively, as:

$$ {\text{E}}\left[ {{\mathbf{yy^{\prime}}}} \right] = \sigma^{2} \left[ {\left( {{\mathbf{\rm I}} - \rho {\mathbf{W^{\prime}}}_{1} } \right)\left( {{\mathbf{\rm I}} - \rho {\mathbf{W}}_{1} } \right)} \right]^{ - 1} $$
$$ {\text{E}}\left[ {{\mathbf{yy^{\prime}}}} \right] = \sigma^{2} \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2} } \right)\left( {{\mathbf{\rm I}} - \lambda {\mathbf{W^{\prime}}}_{2} } \right) $$
$${\text{E}}\left[ {{\mathbf{yy^{\prime}}}} \right] = \sigma^{2} \left( {{\mathbf{\rm I}} - \rho {\mathbf{W}}_{1} } \right)^{ - 1} \left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2} } \right)\left( {{\mathbf{\rm I}} - \lambda {\mathbf{W}}_{2} } \right)^{\prime } \left[ {\left( {{\mathbf{\rm I}} - \rho {\mathbf{W}}_{1} } \right)^{ - 1} } \right]^{\prime }$$

and since σ2 is the same for all processes, it can be ignored. Table 6 presents the Frobenius distance for cases that needed extra attention, i.e., for cases of common values for the two parameters, where the distinction between spatial processes is even more difficult. As can be seen from Table 6, it is easier to distinguish a SAR(1) process from a SMA(1) or a SARMA(1, 1) process than a SMA(1) from a SARMA(1, 1), where the value of the Frobenius distance increases as sample size increases and/or as the absolute value of both parameters increases.

Table 6 Frobenius distance between covariance matrices of the following processes, A: SAR(1), Β: SMA(1) and C: SARMA(1, 1)

Lastly and more importantly, the estimation procedure is another issue that needs special attention. It is unknown whether the estimated values of the spatial coefficients have been influenced by the very small lower limits of their feasible range intervals, resulting from the queen and the nearest neighbors spatial dependence definitions and/or by the maximization algorithm that the R statistical package uses through the nlminb function. Perhaps, an alternative estimation procedure may produce different results concerning the selection rates of all information criteria, especially for small sample sizes.

Information criteria are by far the best statistical tools for selecting and/or for evaluating models for any type of quantitative analysis. Their role is to guide the analyst in discovering the true underlying generating mechanism of any phenomenon based on a given dataset. However, it turns out that these criteria occasionally fail to select the true model, especially when the sample does not convey strong qualitative and quantitative evidence of the true population behavior of any variable. In this case, all criteria most probably will select the next closest behavior, but not the right one, as it was presented and analyzed in this study. For small sample sizes and for small absolute values of the autoregressive and/or the moving average parameters, the selection rate of the true model was not large enough, whereas in the contrary case the selection rate was so high that the true model was selected in several cases even with certainty.