Abstract
Logistic regression is commonly used to estimate the association of one (or more) independent variable(s) with a binary- dependent outcome. In many applications latent sources are both spatially dependent and non-Gaussian; thus, it is desirable to exploit both properties jointly. Spatial logistic regression is a well-established technique of including spatial dependence in logistic regression models. In this paper, we develop a spatial logistic regression model based on a valid skew-Gaussian random field. For parameter estimation, we use a Monte Carlo extension of the EM algorithm along with an approximation based on the standard logistic function. A simulation study is applied in order to determine the performance of the proposed model and also to compare the results with a recently introduced model with established efficiency. The identifiability of the parameters is investigated as well. As an illustrative purpose, an application to the Meuse heavy metals dataset is presented.
Supplementary materials accompanying this paper appear online.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Many practical studies in public health, ecology and many other disciplines rely on binary spatial data. However, most of the conventional spatial analyses were designed to address the problem of estimation/prediction based on continuous observations. In the case of binary variables, for instance, diagnosis of groundwater pollution, there are only two possible outcomes, present (denoted as 1) or absent (denoted as 0). The logistic regression model is a well-known and well-documented methodology which is used in many contexts, specifically, in the presence of spatial dependence, see for example, Lin and Clayton (2005), Zhu et al. (2005), Xie et al. (2005), Tayyebi et al. (2010), Wu and Zhang (2013), Diggle and Giorgi (2016).
In a spatial framework, Paciorek (2007) focused on a large binary dataset and compared penalized likelihood and Bayesian models based on fit, speed and ease of implementation. He also devised an effective Markov chain Monte Carlo (MCMC) sampling scheme to address slow mixing of MCMC techniques in a generalized linear mixed model (GLMM). Zhu et al. (2008) studied logistic regression analysis of binary lattice data using a spatial–temporal autologistic regression model in a frequentist approach and used Monte Carlo maximum likelihood estimators for parameter estimation. To handle computational and inferential challenges posed by high-dimensional binary spatial data, Chang et al. (2016) presented a novel calibration method for computer models and applied a generalized principal component-based dimension reduction method. Sengupta et al. (2016) used a reduced-rank spatial random effects model to account for remote sensing datasets that can be massive in size and non-stationary in space. They estimated the parameters using an expectation–maximization (EM) algorithm. Nisa et al. (2019) focused on the estimation of propensity score as a method which is used to reduce bias due to confounding factors in the estimation of the treatment impact on observational data. They incorporated a spatial logistic regression model and used an EM algorithm to handle maximum likelihood estimation. Hardouin (2019) presented a variational method for parameter estimation in a logistic spatial regression since the expectations in the E-step of the EM algorithm were not available in closed-from expressions. Zhang et al. (2021) proposed a multivariate skew-elliptical link model for correlated binary responses, which included the multivariate probit model as a special case.
Intrinsically, the inference of a logistic regression model involves a hidden unobserved process, although in all aforementioned studies the hidden process has been treated as a user-friendly Gaussian random field. Nevertheless, in a whole range of applications, non-Gaussianity of the latent component arises explicitly from the existence of spatial/spatiotemporal heterogeneities. Thus, some active efforts to seek departures from Gaussianity called for some applicable strategies to handle some of the potential weaknesses associated with the transformation methods. A review of the most recent studies on this topic has been deemed by Tadayon and Torabi (2019) and Tadayon and Rasekh (2019). Mahmoudian (2018) discussed that most of previous skewed spatial models were ill-defined according to the consistency condition of the Kolmogorov existence theorem (Billingsley 2008) as their parametrization of the skewed distributions does not directly allow for an extension to a spatial random field model. Using the multivariate skew-normal distribution of Sahu et al. (2003) (SSN) they proposed a valid random field model with a skew structure to tackle non-Gaussian features and claimed that their random field is particularly convenient for computation. In addition, Mahmoudian (2018) expressed that the induced skewness under this family is not mixed with the spatial correlations.
To the best of our knowledge, the literature on modeling skewness in the case of binary spatial data is very scarce (Hosseini et al. 2011; Afroughi 2015). This design is very useful when our interest is to capture spatial dependence and avoid inefficient estimates by manipulating the data. In this paper, we focus on implementing the valid flexible skew-Gaussian random field introduced by Mahmoudian (2018) to address both spatial dependence and (possible) skewness through a logistic regression model. The plan of the remainder of this paper is as follows. The following section introduces our proposed spatial logistic regression model based on a valid skew-Gaussian random field and explains our methodology of estimating the model parameters. An analysis of a synthetic data is described in Sect. 3. Section 4 analyzes the Meuse heavy metals dataset as an application of our methodology. Finally, the paper ends with some conclusions and final remarks (Sect. 5).
2 The Spatial Model
Logistic regression is generally a kind of multiple regression model to analyze the relationship between a binary outcome and independent variables. Let \(\mathbf{Z} \left( \mathbf{S}\right) = {\left( {Z\left( {{s_1}} \right) , \ldots ,Z\left( {{s_n}} \right) } \right) ^\mathrm{T}}\) be an observable vector of spatially dependent binary variables at locations \(\mathbf{S} = {\left( {{s_1}, \ldots ,{s_n}} \right) ^\mathrm{T} }\). In a hierarchical setting, it is conventional to model \(\mathbf{Z}\left( \mathbf{S}\right) \) as Bernoulli variables, whose means depend on an underlying spatial process \(\mathbf{Y}\left( \mathbf{S}\right) = {\left( {Y\left( {{s_1}} \right) , \ldots Y\left( {{s_n}} \right) } \right) ^\mathrm{T}}\) such that \(Z\left( {s_i}\right) \)s are conditionally independent, given the hidden process \(\mathbf{Y}\left( \mathbf{S}\right) \). Like Tadayon and Torabi (2022), the specific hierarchical model we investigate has the following representation
where \({\varvec{\beta }}\) is a vector of k unknown parameters with corresponding \(\mathbf{{x}}\left( s \right) \! =\!\) \({\left( {{x_1}\left( s \right) , \ldots ,{x_k}\left( s \right) } \right) ^\mathrm{T} }\) as a vector of known covariates that captures the large-scale spatial variation, \(\gamma \) is a scale parameter, \(W\left( \cdot \right) \) takes account of the non-Gaussian features through a valid skewed random field in a latent mode. Finally, the white noise error \(\varepsilon \left( \cdot \right) \sim \mathrm{N}\left( {0,{\tau ^2}} \right) \) is considered to be independent of \(W\left( \cdot \right) \). Evidently,
We consider \(\mathbf{{W}}\left( \mathbf{{S}} \right) = {\left( {W\left( {{s_1}} \right) , \ldots ,W\left( {{s_n}} \right) } \right) ^\mathrm{T} }\) as the SSN process
with the probability density function
mean \(\mathbf{0}\) and covariance matrix \(H + \left( {1 - {2}/{\pi }} \right) {\delta ^2}{I_n}\), where \(\mathbf{{w}} \in {{{\mathbb {R}}}^n}\), \(\Delta = {I_n} - {\delta ^2}{\left[ {H + {\delta ^2}{I_n}} \right] ^{ - 1}}\), \(\mathbf{1}_n\) denotes an \(n\times 1\) vector of ones and \(I_n\) is the identity matrix. \({\phi _n}\left( \cdot ;\mu ,\Sigma \right) \) and \({\Phi _n}\left( \cdot ;\mu ,\Sigma \right) \) represent the normal density and the normal cumulative distribution function of \(\mathrm{N}_n\left( {\mu ,\Sigma } \right) \), respectively. The second term in the covariance matrix can be viewed as a nugget effect in geostatistics. The exponential correlation function is chosen for the entries of H such that \({H_{ij}} = \exp \left\{ { - \left\| h \right\| /\psi } \right\} = \exp \left\{ { - {{\left\| {{s_i} - {s_j}} \right\| }}/{\psi }} \right\} \), where \(\psi \) is the range parameter. Therefore, the complete log likelihood function of \({\varvec{\eta }}= \left( {{\varvec{\beta }},\gamma ,{\tau ^2},\delta ,\psi } \right) ^\mathrm{T}\) is given by
where \(\left| {\cdot } \right| \) denotes the determinant. Since the likelihood function \(\ell \left( {\varvec{\eta }}\right) \) is analytically intractable, one can use a natural extension of the EM algorithm that employs Monte Carlo methods (MCEM algorithm) to estimate the model parameters \({\varvec{\eta }}\). In order to be self-contained, we recall that the EM algorithm operates on the so-called Q-function where at the tth E-step is defined by
The M-step is to maximize Q with respect to \({\varvec{\eta }}\) to obtain \({{\varvec{\eta }}^{t + 1}} = \arg {\max _{{\varvec{\eta }}\in \Theta }}Q\left( {{\varvec{\eta }}\left| {{{\varvec{\eta }}^t}} \right. } \right) \), where \(\Theta \) is the parameter space. When the integral in equation (4) is analytically intractable or very high dimensional the MCEM algorithm presents a modification of the EM algorithm where the expectation in the E-step is computed numerically through Monte Carlo simulation. By replacing the conditional expectations in (4) with the corresponding Monte Carlo approximations, we can write
and employ an optimization procedure to maximize \(Q\left( {{\varvec{\eta }}\left| {{{\varvec{\eta }}^t}} \right. } \right) \) with respect to \({\varvec{\eta }}\). These steps are repeated until convergence conditions of the MCMC were satisfied through the Gelman–Rubin convergence diagnostics (Gelman and Rubin 1992). At the tth iteration of the MCEM algorithm, we need to calculate some conditional expectations of the form of \({E_i}\left[ {\mathrm{{g}}\left( {\mathbf{{W}},{\varvec{\varepsilon }}} \right) \left| \mathbf{{Z}} \right. } \right] \), \( i\in \left\{ 1,\ldots ,7\right\} \) for some function \(\mathrm {g}\) of \(\mathbf{{W}}\) and \({\varvec{\varepsilon }}\). These conditional expectations that are shown in Equation (A1) of Appendix as an extended form of (4) may not have explicit forms and need to be substituted by their Monte Carlo approximations. We use the notation \({\mathbb {E}}_{i}^t\left( \cdot \right) \) to show the corresponding approximation of the ith conditional expectation \({E}_{i}\left( \cdot \right) \) whenever it does not have a closed form. \({{\mathbb {E}}}_{i}^t\) can be calculated based on samples \(\left\{ {{\mathbf{{W}}^{\left( m \right) }},{{\varvec{\varepsilon }}^{\left( m \right) }}} \right\} _{m = 1}^M\) from the joint distribution \({f_{\mathbf{{W}},{\varvec{\varepsilon }}\left| \mathbf{{Z}},{\varvec{\eta }}\right. }}\) as
For details regarding the updates of the model parameters through the M-step see the Appendix in which a variational method is used to estimate the parameters. Generating samples from the joint distribution \({f_{\mathbf{{W}},{\varvec{\varepsilon }}\left| \mathbf{{Z}},{\varvec{\eta }}\right. }}\) requires a MCMC algorithm. To that end, we explore the full conditional distributions as follows.
-
\({\varvec{\varepsilon }}\left| {\mathbf{{Z}},\mathbf{{W}},{\varvec{\eta }}} \right. \): According to the details of the variational method described in Appendix, we can write
$$\begin{aligned} f\left( {{\varvec{\varepsilon }}\left| {\mathbf{{Z}},\mathbf{{W}},{\varvec{\eta }}} \right. } \right)\propto & {} \exp \left\{ { - \frac{1}{{2{\tau ^2}}}\sum \nolimits _i {\varepsilon _i^2} + \sum \nolimits _i {{Z_i}{\varepsilon _i}} - \frac{1}{2}\sum \nolimits _i {{\varepsilon _i}} } \right. \\&\left. {\frac{}{} - \sum \nolimits _i {\lambda \left( {{\theta _i}} \right) \left( {\varepsilon _i^2 + 2\mathbf{{x}}_i^\mathrm{T} {\varvec{\beta }}{\varepsilon _i} + 2\gamma {W_i}{\varepsilon _i}} \right) } } \right\} , \end{aligned}$$therefore, the full conditional distribution of \({\varepsilon _i}\)s is approximately proportional to a normal density as
$$\begin{aligned} {\varepsilon _i}\left| {\mathbf{{Z}},\mathbf{{W}},{\varvec{\eta }}} \right. \mathop \simeq \limits ^d \mathrm{{N}}\left[ {\frac{{{Z_i} - 2\lambda \left( {{\theta _i}} \right) \left( {\mathbf{{x}}_i^\mathrm{T} {\varvec{\beta }}+ \gamma {W_i}} \right) - 0.5}}{{{\tau ^{ - 2}} + 2\lambda \left( {{\theta _i}} \right) }},\frac{1}{{{\tau ^{ - 2}} + 2\lambda \left( {{\theta _i}} \right) }}} \right] . \end{aligned}$$ -
\(\mathbf {W} \left| { \mathbf{{Z}},{\varvec{\varepsilon }},{\varvec{\eta }}} \right. \): With regard to the hierarchical representation of the SSN distribution based on a normal and a truncated normal distributions, we can rewrite \(\mathbf {W}\) as
$$\begin{aligned} \mathbf{{W}}\left| {\mathbf{{V}} = \mathbf{{v}}} \right. \sim {\mathrm{{N}}_n}\left[ {{{{\varvec{\mu }}_\mathbf{v}}},H} \right] ,\qquad {{\varvec{\mu }}_\mathbf{v}} = \delta \left( {{\mathbf{v}} - \sqrt{\frac{2}{\pi }} {\mathbf{{1}}_n}} \right) , \end{aligned}$$where \(\mathbf{{V}} \sim {\mathrm{{N}}_n}\left[ {\mathbf{{0}},{I_n}} \right] {\mathbf{{I}}_{\left\{ {{{\mathbb {R}}}_+^ n } \right\} }}\left( \mathbf{{V}} \right) \) and \({\mathbf{{I}}_{\left\{ \cdot \right\} }}\left( \cdot \right) \) denotes the indicator function. Therefore,
$$\begin{aligned} f\left( {\mathbf{{W}}\left| {\mathbf{{Z}},{\varvec{\varepsilon }}} \right. ,{\varvec{\eta }}} \right)\propto & {} f\left( {\mathbf{{Z}}\left| {\mathbf{{W}},{\varvec{\varepsilon }}} \right. ,{\varvec{\eta }}} \right) f\left( {\mathbf{{W}}\left| {\mathbf{{V}},{\varvec{\eta }}} \right. } \right) f\left( {\mathbf{{V}}} \right) \\\propto & {} \exp \left\{ { - \frac{1}{2}\gamma \sum \nolimits _i {{W_i}} - {\gamma ^2}\sum \nolimits _i {\lambda \left( {{\theta _i}} \right) W_i^2} - 2\gamma \sum \nolimits _i {\lambda \left( {{\theta _i}} \right) \mathbf{{x}}_i^\mathrm{T} {\varvec{\beta }}{W_i}} } \right. \\&\left. { \frac{}{}- 2\gamma \sum \nolimits _i {\lambda \left( {{\theta _i}} \right) {\varepsilon _i}{W_i}} + \gamma \sum \nolimits _i {{Z_i}{W_i}} } \right\} \\&\times \exp \left\{ { - \frac{1}{2}\left( {\mathbf{{W^\mathrm{T}}}{H^{ - 1}}{} \mathbf{{W}} - 2{\varvec{\mu }}_\mathbf{v}^\mathrm{T} {H^{ - 1}}{} \mathbf{{W}}} \right) } \right\} . \end{aligned}$$One can synthesize the above terms to obtain \(\mathbf{{W}}\left| {\mathbf{{Z}},\mathbf{{V}},{\varvec{\varepsilon }},{\varvec{\eta }}} \right. \mathop \simeq \limits ^d {\mathrm{{N}}_n}\left[ {{{\varvec{\mu }}_{\mathbf{{w}}\left| \cdot \right. }},{\Sigma _{\mathbf{{w}}\left| \cdot \right. }}} \right] \), where
$$\begin{aligned} {{\varvec{\mu }}_{\mathbf{{w}}\left| \cdot \right. }} = {\Sigma _{\mathbf{{w}}\left| \cdot \right. }} \left( { {H^{ - 1}}{\varvec{\mu }}_\mathbf{v} - \frac{{\mathcal{C}}}{2}} \right) ,\qquad {\Sigma _{\mathbf{{w}}\left| \cdot \right. }} = {\left( {{H^{ - 1}} + {{{\mathcal {D}}}}} \right) ^{ - 1}}, \end{aligned}$$in which \({{{\mathcal {C}}}}\) is an \(n\times 1\) vector with elements \({c_i} \!=\! \gamma \left( {1 + 4\lambda \left( {{\theta _i}} \right) \mathbf{{x}}_i^\mathrm{T} {\varvec{\beta }}+ 4\lambda \left( {{\theta _i}} \right) {\varepsilon _i} - 2{Z_i}} \right) \) and \({{{\mathcal {D}}}}\) is a diagonal matrix as \({{{\mathcal {D}}}} = 2{\gamma ^2}\mathrm{diag} \left( {\lambda \left( {{\theta _1}} \right) , \ldots ,\lambda \left( {{\theta _n}} \right) } \right) \).
-
\(\mathbf {V} \left| { \mathbf{{Z}},\mathbf {W},{\varvec{\varepsilon }},{\varvec{\eta }}} \right. \):
$$\begin{aligned} f\left( {\mathbf{{V}}\left| {\mathbf{{Z}},\mathbf{{W}},{\varvec{\varepsilon }}} \right. ,{\varvec{\eta }}} \right)\propto & {} f\left( {\mathbf{{W}}\left| {\mathbf{{V}},{\varvec{\eta }}} \right. } \right) f\left( \mathbf{{V}} \right) \\\propto & {} \exp \left\{ { - \frac{1}{2}\left( {{\varvec{\mu }}_\mathbf{v}^\mathrm{T} {H^{ - 1}}{{\varvec{\mu }}_\mathbf{v}} - 2{\varvec{\mu }}_\mathbf{v}^\mathrm{T} {H^{ - 1}}{} \mathbf{{W}} + \mathbf{{V^\mathrm{T}V}}} \right) } \right\} {\mathbf{{I}}_{\left\{ {{{\mathbb {R}}}_+^ n } \right\} }}\left( \mathbf{{V}} \right) , \end{aligned}$$hence, \(\mathbf{V} \left| { \mathbf{{Z}},\mathbf {W},{\varvec{\varepsilon }},{\varvec{\eta }}} \right. \sim {\mathrm{{N}}_n}\left[ {{\Sigma _{\mathbf{v}\left| \cdot \right. }}(\delta {H^{ - 1}}{} \mathbf{{W}} - \sqrt{\frac{2}{\pi }} {H^{ - 1}}{\mathbf{{1}}_n}),{\Sigma _{\mathbf{v}\left| \cdot \right. }}} \right] {\mathbf{{I}}_{\left\{ {{{\mathbb {R}}}_+^ n } \right\} }}\left( \mathbf{{V}} \right) \), where its covariance matrix can be written as \({\Sigma _{\mathbf{v}\left| \cdot \right. }} = {{({\delta ^2}{H^{ - 1}} + {I_n})}^{ - 1}}\).
3 Analysis of a Synthetic Dataset
We now assess the performance of the proposed model using a synthetic dataset along with making a comparison between our results and the one is resulted by applying the model presented in Hardouin (2019). Thus, the contribution of this section is twofold. First, the performance of the presented model is evaluated in estimating the parameters using the response variable generated from model (1) (using algorithm 1), and then, the effect of sample size (the number of spatial locations) on model performance is examined. Ultimately, the results are compared with that of its competitor. All computations were performed using the publicly available statistical software R.
To address our goals, we use algorithm 1 to generate spatially correlated binary data \(Z\left( {{s_i}} \right) \) with \(E\left[ {Z\left( {{s_i}} \right) } \right] = p\left( {{s_i}} \right) \) and \(\rho \left[ {Z\left( {{s_i}} \right) ,Z\left( {{s_j}} \right) } \right] ={H_{ij}} = \exp \left\{ { - \left\| h \right\| /\psi } \right\} \). We did three distinct simulations, each with \({{{\mathcal {R}}}}=500\) generated datasets for three different sample sizes as \(n=200, 400\) and 800 observations, respectively. For all three simulations, we set \(M=100\). In each simulation study, the sites are uniformly distributed over the region \(\left( {0,10} \right) \times \left( {0,10} \right) \). The data were simulated from the model (1) with \({x_i} \sim N\left( {0,1} \right) \) where the true values of the model parameters has been shown in Table 1 which also summarizes the results. Notice that choosing \(\psi =3.5\) in each simulation yields the rough values 0.02 and 0.99 for \(\exp \left\{ { - \left\| h \right\| /\psi } \right\} \) as the approximations of the maximum and minimum dependencies based on the presented exponential correlation function correspond to the smallest and largest distances between the selected locations, respectively.
Table 1 specially reports the bias criterion for an arbitrary parameter, say \(\vartheta \), as
and also the empirical variance of each estimation as
to assess the performance of the proposed methodology, where \({\bar{\vartheta }} = {{{{\mathcal {R}}}}^{ - 1}}\sum \nolimits _{r = 1}^{{{\mathcal {R}}}} {{{\hat{\vartheta }}^{\left( r \right) }}}\). It is worthwhile mentioning that in Hardouin (2019)’s approach spatial variation is captured through the term \(\varepsilon \left( \cdot \right) \) with the same exponential correlation function, i.e., \({\varvec{\varepsilon }}\sim {\mathrm{N}}\left( {\mathbf{0},H^*} \right) \), where \({H_{ij}^*} = {\tau ^2}\exp \left\{ { - \left\| h \right\| /\psi } \right\} \). Eventually, the resultant Akaike information criterion (AIC) values were used to compare model performance. This benchmark, which is the most popular criterion for model assessment in the literature, is calculated as \(\mathrm{AIC}= 2[\#\mathrm{model~parameters}-{{\varvec{\ell }}_{y,\lambda }^{~\cdot }}]\). The AIC values corroborate better performance of the proposed model compared to its competitor. Note that Table 1 compares parameter estimates for the data generated from the proposed model with a misspecified model considered in Hardouin (2019), so it is expected that parameter estimates will be biased for the misspecified model. To address this issue, we use the mean squared prediction error (MSPE) to assess the performance of suggested strategy. In classification problems prediction error is commonly defined as the probability of an incorrect classification, also called the misclassification rate (MCR). To compute MCR, we randomly drop n/10 observations from each simulation; then, MCR is calculated by \({\mathrm{MCR}} = {\left( n/10\right) ^{ - 1}}\sum \nolimits _{i = 1}^{n/10} {I\left( {\widehat{{Z_i}} \ne {Z_i}} \right) }\), where \(I\left( \cdot \right) \) is the indicator function that is equal to one when its input is true. The results which have been reported as percentage in Table 1 represent lower MCRs for the suggested model and also show that as the sample size increases MCRs decrease.
To evaluate the performance of the proposed model in different scale of spatial dependence, we did another simulation based on \(n=800\), however, in this case, we fixed all parameters \(\beta _0, \beta _1, \gamma , \delta \) and \(\tau ^2\) the same as what considered before in Table 1 and only altered \(\psi \) to \(\psi =0.1\) which allows the spatial dependency to vary from almost 0 to 0.5. We are aware that in practical issues this value should be considered according to the autocorrelation function relative to the size of the domain, however, it has been chosen to assess the performance of the proposed model in the case of low spatial dependence. The results that are presented in Table 2 could readily be compared with the corresponding part of Table 1. Patently, the results substantiate stability in the performance (bias, the empirical variance and MCR) of the proposed model in the case of low spatial dependence.
Finally, since the inference may be challenging in identifying the nugget effect components (\({\tau ^2}\) and \(\delta \)), here, we discuss to what extent information about these parameters can be recovered from data. To assess identifiability of each of these parameters, say \(\delta \), three datasets (of size 200) were generated from the proposed model with different values of \(\delta \) (and fixed values for other parameters, as described in Table 1). Then, the estimated values were obtained. The same applies for inference on \({\tau ^2}\). Table 3 indicates that the data allow for meaningful inference on the model’s nugget effect components.
4 Application: The Meuse Heavy Metals Data
In this section, we illustrate our proposed methodology using a well-known real dataset in the literature on spatial statistics. The Meuse dataset which has been documented in detail by Rikken and Van Rijn (1993) and Burrough and McDonnell (1998) and have been studied frequently in several geostatistical researches, comprise heavy metals measurements in the topsoil in a flood plain along the Meuse River west of the municipality of Stein, Limburg, the Netherlands. The dataset is available in the R package sp and can be loaded with the data function as data(meuse). The measures consist of 155 soil samples collected in an area of approximately \(15m \times 15m\) which were analyzed for their concentration of toxic heavy metals (zinc, lead, copper, and cadmium) in ppm. Figure 1 below depicts a schematic description of the region and sampling locations.
We chose the binary variable lime as our response of interest Z and simultaneously in order to find the most related covariates to our response, corrected AIC (AICC) introduced by Hoeting et al. (2006) (for geostatistical model selection) was used. AICC is given by
where, p shows the number of regression coefficients including an intercept term, k is the number of parameters associated with the autocorrelation function and n is the number of observed sites. Considering four variables zinc, lead, copper, and cadmium as potential covariates, we investigated among all \(2^4 -1\) feasible embedded models and ultimately, an overall consideration (that are not presented here) resulted in a model with two covariates lead and zinc. Although the simulation study showed that estimation of parameters does not depend on initial values for parameters, we use estimates of ordinary GLM for initial values of regression coefficients, i.e., \(\beta _0=-3.34, \beta _{lead}=-0.03\) and \(\beta _{zinc}=0.01\). From \(p(s) = {e^{Y(s)}}/\left( {1 + {e^{Y(s)}}} \right) \) and Equation (1) we can write
where \(\widehat{p\left( s \right) } = {{\bar{Z}}} = 0.284\). Now, we define \(V\left( s \right) = \gamma W\left( s \right) + \varepsilon \left( s \right) \) which is approximately given by \(V\left( s \right) = 2.422 + 0.038\textit{lead}\left( s \right) - 0.017 zinc\left( s \right) \). We can see that \(V\left( s \right) \) is a member of the SSN family (2). The Q-Q plot and histogram of \(V\left( s \right) \) are demonstrated in Fig. 2. As a result of simple exploratory data analysis, the histogram shows a non-Gaussian feature, which confirms the suitability of implementing the proposed model based on above-mentioned skew random field. The empirical semi-variogram of \(V\left( s \right) \) was plotted in Fig. 3. The best model was exponential with parameters \(\textit{nugget effect} = 1.07, \textit{sill} = 6.60\) and \(range = 367.24\). Table 4 displays the model parameter estimates and the corresponding standard error for the proposed model and MCRs of both competitor models. The presented estimated-values/MCRs were calculated as the mean of estimated values/MCRs over \(\mathcal{R}=20\) runs of the program.
5 Conclusion
The present study concentrated on implementing a valid flexible skew-Gaussian random field based on the skew-normal family introduced by Sahu et al. (2003) to capture both spatial dependence and (possible) skewness through a logistic regression model. Declaring that directly maximizing the likelihood function of observed data is intractable, a Monte Carlo extension of EM algorithm was developed to compute the maximum likelihood estimate of model parameters. Moreover, a simulation study was conducted to assess the performances of the proposed model and also to investigate the effect of sample size on the results. Finally, a real data application regarding the presence of lime in the topsoil along the Meuse River was also analyzed in which, the concentration of toxic heavy metals zinc and lead were considered as two covariates.
Overall, the proposed model added flexibility to the class of spatial logistic regression models often considered in the literature to account for binary spatial data. It must be mentioned that, in the spatial context, the asymptotic properties of parameter estimators strongly depend on the asymptotic regime which is considered. Specifically, two regimes can be considered, first, when the spatial domain is fixed and bounded and the density of the sampling locations increases with n (the fixed/infill domain). Second, when the spatial domain of observation is unbounded and it grows in size with the sample dimension n (the increasing domain framework). Whereas under the latter regime the maximum likelihood estimators are consistent and asymptotically normal, subject to some regularity conditions (see, for example, Mardia and Marshall (1984)), under the former analogous results do not hold and model parameters could not be consistently estimated. Besides this, in the suggested approach, the latent factors are independent for each location which results in satisfaction of mixing conditions. However, in the general case, replicates are required to obtain consistent estimates even if the number of locations is large.
An astonishing extension of this work is to investigate how the variance process can be allowed to depend on covariates which opens up an opportunity to interpret tail behavior of the process as a function of known covariates. Another step forward is to let this covariance-covariate dependence change in time. On the other hand, in the last decade, with the wide usage of mobile applications and Global Positioning System (GPS) devices as well as the advancement of remote sensors which are accompanied with cheap data storage/computational devices, many geo-referenced data are being collected. As a result, there has been a growing enthusiasm for modeling spatial big data. The third interesting extension of this work is to scale the proposed model to big data. Moreover, in this study the exponential correlation function was chosen, although this could affect the smoothness of the process. One can choose a more flexible spatial correlation structure and compare the results. We have planned to study these approaches in our future studies.
5.1 Supplementary Material
Supplementary materials contain R codes for simulations and real data application conducted in this paper.
References
Afroughi S (2015) Bayesian inference of spatially correlated binary data using skew-normal latent variables with application in tooth caries analysis. Open J Stat 5:127–139
Billingsley P (2008) Probability and measure. Wiley, Hoboken
Burrough PA, McDonnell RA (1998) Principles of geographical information systems: spatial information systems and geostatistics
Chang W, Haran M, Applegate P, Pollard D (2016) Calibrating an ice sheet model using high-dimensional binary spatial data. J Am Stat Assoc 111(513):57–72
Diggle PJ, Giorgi E (2016) Model-based geostatistics for prevalence mapping in low-resource settings. J Am Stat Assoc 111(515):1096–1120
Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7(4):457–472
Hardouin C (2019) A variational method for parameter estimation in a logistic spatial regression. Spatial Stat 31(1):1–45
Hoeting JA, Davis RA, Merton AA, Thompson SE (2006) Model selection for geostatistical models. Ecol Appl 16(1):87–98
Hosseini F, Eidsvik J, Mohammadzadeh M (2011) Approximate bayesian inference in spatial glmm with skew normal latent variables. Comput Stat Data Anal 55(4):1791–1806
Jaakkola TS, Jordan MI (2000) Bayesian parameter estimation via variational methods. Stat Comput 10(1):25–37
Lin P-S, Clayton MK et al (2005) Analysis of binary spatial data by quasi-likelihood estimating equations. Ann Stat 33(2):542–555
Mahmoudian B (2018) On the existence of some skew-gaussian random field models. Stat Prob Lett 137:331–335
Mardia KV, Marshall RJ (1984) Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika 71(1):135–146
Nisa H, Mitakda MB, Astutik S, et al. (2019) Estimation of propensity score using spatial logistic regression. In: IOP conference series: materials science and engineering, volume 546, page 052048. IOP Publishing
Paciorek CJ (2007) Computational techniques for spatial logistic regression with large data sets. Comput Stat Data Anal 51(8):3631–3653
Rikken M, Van Rijn R (1993) Soil pollution with heavy metals: in inquiry into spatial variation, cost of mapping and the risk evaluation of Copper, Cadmium, Lead and Zinc in the floodplains of the Meuse West of Stein, The Netherlands: field study report. University of Utrecht, Utrecht
Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with applications to bayesian regression models. Canadian J Stat 31(2):129–150
Sengupta A, Cressie N, Kahn BH, Frey R (2016) Predictive inference for big, spatial, non-gaussian data: modis cloud data and its change-of-support. Aust New Zealand J Stat 58(1):15–45
Tadayon V, Rasekh A (2019) Non-gaussian covariate-dependent spatial measurement error model for analyzing big spatial data. J Agric Biol Environ Stat 24(1):49–72
Tadayon V, Torabi M (2019) Spatial models for non-gaussian data with covariate measurement error. Environmetrics 30(3):e2545
Tadayon V, Torabi M (2022) Sampling strategies for proportion and rate estimation in a spatially correlated population. Spatial Stat 47:100564
Tayyebi A, Delavar MR, Yazdanpanah MJ, Pijanowski BC, Saeedi S, Tayyebi AH (2010) A spatial logistic regression model for simulating land use patterns: a case study of the shiraz metropolitan area of iran. Advances in earth observation of global change. Springer, Berlin, pp 27–42
Wu W, Zhang L (2013) Comparison of spatial and non-spatial logistic regression models for modeling the occurrence of cloud cover in north-eastern puerto rico. Appl Geogr 37:52–62
Xie C, Huang B, Claramunt C, Chandramouli C (2005) Spatial logistic regression and gis to model rural-urban land conversion. In: Proceedings of PROCESSUS Second International Colloquium on the Behavioural Foundations of Integrated Land-use and Transportation Models: frameworks, models and applications, pages 12–15. University of Toronto
Zhang Z, Arellano-Valle RB, Genton MG, Huser R (2021) Tractable bayes of skew-elliptical link models for correlated binary data. arXiv preprint arXiv:2101.02233
Zhu J, Huang H-C, Wu J (2005) Modeling spatial-temporal binary data using markov random fields. J Agric Biol Environ Stat 10(2):212
Zhu J, Zheng Y, Carroll AL, Aukema BH (2008) Autologistic regression analysis of spatial-temporal binary data via monte carlo maximum likelihood. J Agric Biol Environ Stat 13(1):84–98
Acknowledgements
We would like to thank the Associate Editor and two reviewers for the constructive comments and suggestions, which led to an improved version of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Author Contribution
VT contrived the study, conceptualized the review, reviewed and revised the manuscript. The simulation study, fitting the model to the real data, and documenting the whole manuscript were performed by the first author. MMS had the majority role in the theoretical part of the modeling and also he found an appropriate real data set. Exploratory data analysis of the real data and also some parts of the R functions were provided by the second author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix
Appendix
In what follows, we use the notation \({{{\vartheta }}_i}\) to show \({{\vartheta }}\left( {{s_i}} \right) \). Equation (4) can be written as
in which the fourth line has been derived from the properties \(x^\mathrm{T}Ax = \mathrm{trace}\left( {x^\mathrm{T}Ax} \right) \) and \(\mathrm{trace}\left( {AB} \right) = \mathrm{trace}\left( {BA} \right) \). A closer scrutiny shows, however, that one of the main problematic terms of (A1) is \(\sum \nolimits _i {{E_1}\left[ {\left. {\ln \left( {1 + {e^{{Y_i}}}} \right) } \right| \mathbf{{Z}}} \right] }\). Hardouin (2019) proposed a variational method which is based on replacing this term by an initial approximation of the logistic function \(\kappa \left( x \right) = {e^x}/ \left( {1 + {e^x}} \right) = 1/ \left( {1 + {e^{ - x}}} \right) \) which had been studied by Jaakkola and Jordan (2000) as
This variational lower bound involves the model parameters and the so-called variational parameter \(\theta \). Let \(\Theta = {\left( {{\theta _1}, \ldots ,{\theta _n}} \right) ^\mathrm{T} }\), we apply this lower bound to \( - \sum \nolimits _i {\ln \left( {1 + {e^{{Y_i}}}} \right) } = \sum \nolimits _i {\ln \kappa \left( { - {Y_i}} \right) }\) as the first term of (3). Therefore,
The monotonicity of expectation implies that getting the (conditional) expectation of (A2) (given \(\mathbf{Z}\)) preserves the inequality. Now, we can write \(Q\left( {{{{\varvec{\eta }}}}\left| {{{{{\varvec{\eta }}}}^t}} \right. } \right) \ge {{\tilde{Q}}}\left( {{{{\varvec{\eta }}},\Theta }\left| {{{{{\varvec{\eta }}}}^t}},\Theta ^t \right. } \right) \), where \({{\tilde{Q}}}\) has been resulted by replacing the first term of Q with the expectation of the right hand side of (A2) given \(\mathbf{Z}\), which eliminates \({E_1}\left[ {\left. {\ln \left( {1 + \exp \left\{ {{Y_i}} \right\} } \right) } \right| \mathbf{{Z}}} \right] \) and incorporates \({E_8}\left[ {{W_i^2}\left| \mathbf{{Z}} \right. } \right] \) and \({E_9}\left[ {{\varepsilon _i^2}\left| \mathbf{{Z}} \right. } \right] \) into inference. We then use a two-stage estimation procedure in the M-step, where the first stage consists of maximizing \({{\tilde{Q}}}\left( {{{{\varvec{\eta }}},\Theta }\left| {{{{{\varvec{\eta }}}}^t}},\Theta ^t \right. } \right) \) with respect to the model parameters for fixed \(\Theta \) results in \({{\tilde{Q}}}\left( {{{{\varvec{\eta }}}^{t+1},\Theta }\left| {{{{{\varvec{\eta }}}}^t},\Theta ^t} \right. } \right) \), and in the second stage, updated variational parameters \(\Theta ^{t+1}\) is obtained by maximizing \({{\tilde{Q}}}\left( {{{{\varvec{\eta }}}^{t+1},\Theta }\left| {{{{{\varvec{\eta }}}}^t},\Theta ^t} \right. } \right) \) with respect to \(\Theta \). The updates of the model parameters are as follows. \({\tau ^{{2^{t + 1}}}} = {n^{ - 1}} {{\mathbb {E}}}_4^t\), \({\varvec{\beta }}^{t+1}\) can be easily obtained as a solution of the systems of linear equations
in which the left-hand side can be rewritten as \(\left[ {\sum \nolimits _i {{\mathbf{{x}}_i}{} \mathbf{{x}}_i^\mathrm{T} } } \right] {{\varvec{\beta }}^{t + 1}}\), then,
Moreover,
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tadayon, V., Saber, M.M. A Spatial Logistic Regression Model Based on a Valid Skew-Gaussian Latent Field. JABES 28, 59–73 (2023). https://doi.org/10.1007/s13253-022-00512-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13253-022-00512-3