Introduction

Land use change and urban development are two areas of research that attract broad attention, since both can produce significant ecological impacts to the environment. However, urban development is a special type of land use change: the conversion of mainly agricultural and forested lands to residential, commercial-industrial, and recreational in cities and along their edges. The special driving forces and radiating impact of urban development distinguish it from other kinds of land use change. The major driving forces of urban development include cultural, social, and economic factors; and make the processes of urban development very complex (Cheng and Masser 2003; Fang et al. 2005; Gimblett et al. 2001; Ligtenberg et al. 2001; Rusk 1995; Weber 2003). Although the proportion of urbanization is small compared to the total land use change on the earth’s land surface area (Grübler 1994), urbanization can cause very large changes in surrounding environmental conditions, more so than other land use changes (Folke et al. 1997; Heilig 1994; Lambin et al. 2001). Even though the two areas can be separated based on the above reasons, both land use change and urban development researchers eventually study the conversion of land use. Therefore, they can share techniques such as those which estimate the conversion probability and the mapping of the conversion. Logistic regression is one of the techniques shared by land use change and urban development researchers.

Logistic regression is a common method to build models for predicting the probabilities of categorical variables (responses or events) based on numerical (continuous and discrete) and categorical variables (Agresti 2002; Hosmer and Lemeshow 2000; McCullagh 1980). It is also widely used in predicting the conversion probabilities of both land use change (Geoghegan et al. 2001; Serneels and Lambin 2001; Verburg et al. 2002) and urban development (Cheng and Masser 2003; Fang et al. 2005; Wu 2002). In studies of land use and urban development, usually logistic regression is used to fit a probability model based on sampled data. The conversion probability of land use change and urbanization in a study area thus can be predicted using the fitted logistic models based on the attribute maps. The resulting probability maps can be used to indicate the “hot spots” where the highest probability for change will occur over a certain duration (Fang et al. 2005).

Currently, logistic regression has been mainly used in predicting binomial probability of two responses (“yes” or “no” in general) of a dependent categorical variable (event) in studies of land use change and urban development (for example, Geoghegan et al. 2001; Verburg et al. 2002; Wu 2002). Often the conversion of land use change and urbanization possesses a multinomial distribution (Turner et al. 1996; Wear et al. 1998). The challenge of multinomial probability prediction is that the sum of the predicted probabilities of all responses should be equal to one. Usually, multinomial probabilities are predicted using a set of logistic models, whose dependent variables could be fractions of probabilities of paired responses [log(P i /P j ) = AX, ij, where A and X are vectors of coefficients and independent variables, respectively], and the models are adjusted to make the multinomial probabilities consistent (i.e., the sum of probabilities equal to one) after their coefficients are estimated (Allison 1999; Chomitz and Gray 1996; McCullagh 1980).

The estimation of multinomial probability has its limits. First, the adjustment to reach consistency of the multinomial probabilities is based on the assumption of independence among multiple responses (Chomitz and Gray 1996). This assumption does not hold for most of situations. Furthermore, when each response has its own unique explanatory variables (Deal et al. 2002), the pair-response-based logistic models and the resulting models (Allison 1999) may make no logical sense and can cause confusion, since the resulting model for a specific response contains explanatory variables which are not defined for that response. Another problem for this situation is over-parameterization, i.e., too many explanatory variables may be included in a model, even though some are not significant in terms of prediction.

Finally, in model development and calibration, there is a need for screening techniques, i.e., to select significant independent variables of the logistic models from a number of candidate variables. Although screening techniques are available for binomial logistic regression, they are usually not available for multinomial probability estimation in standard statistical packages (for instance, SAS® and SUDAAN®). Therefore, there is a need for methods that estimate multinomial probability without the theoretical and practical limitations as given above.

Since the estimation of a logistic model for binomial probability is well-established and any multi-element system (set) could be divided into two subsystems (subsets) at a time, a bisection system is promising to find a solution for the prediction of the multinomial probabilities of land use change. A bisection system has been widely used in developing regression trees to improve the quality of empirical models (Alexander and Grimshaw 1996; Chaudhuri et al. 1994; De’Ath and Fabricius 2000; Loh and Shih 1997); for machine learning in classification (Perlich et al. 2003); and for classification of vegetation patterns and land use conversion changes (not prediction of future change) (Lawrence and Wright 2001; McDonald and Urban 2006; Rogan et al. 2003; Taverna et al. 2005). In tree regression, although the dependent variables of the models can be numerical or categorical and individual probabilities of categories can be computed for classification, the consistency of multinomial probabilities (i.e., their sum equals to 1.0) is never considered. In addition, tree regression does not divide data according to the (categorical) dependent variables of the models in the estimation of multinomial probabilities, but instead by the independent variables.

The objective of this study is to develop a consistency-constrained procedure based on a bisection system for prediction of the multinomial probabilities of effect-specified land use conversion. The bisection system is based on the dependent variables of the probability models. The procedure is suitable for the general properties of land use change and urbanization, but can also be used for other landscape systems where multinomial probabilities are needed. It has been developed based on conditional probability inference and utilizes existing logistic regression statistical software. A case study will be used to evaluate this procedure and demonstrate its use in the prediction of the multiple probability maps.

Procedure development

When an event has two types of responses (such as, “yes” and “no”), it has a binomial distribution and the probability of one type of response can be calculated from that of the other one, i.e., P 2 = 1 − P 1. Logistic regression has been developed for modeling the probability of binomial distributions and most major statistical software packages have procedures for this purpose. When there are more than two types of responses (say, k responses) for an event, it has multinomial distribution and the sum of the probabilities for all types of responses should be equal to one, i.e., \(\sum\nolimits_{r=1}^{k}{P}_{r}=1,\) where P r is the probability of the rth response.

Suppose a special bisection decomposition system as shown in Fig. 1 is constructed. In such a decomposition system, k types of responses of an event are decomposed into k − 1 decomposition levels, each decomposition level being a binomial structure. At the first level there are exactly two classes, one class containing a single response type and the other class containing all remaining k − 1 response types. For the second level, remove the data for the single response type used for the first level. Then the class containing k − 1 responses in first level is regrouped to have exactly two classes: one class containing a single response type and the other class containing all remaining k − 2 response types. This general form of decomposition is continued for k − 1 levels.

Fig. 1
figure 1

Special bisection decomposition system designed for predicting multinomial probability of events which have more than two types of responses using logistic probability models and conditional probability inference

Therefore, at each decomposition level, a separate logistic regression for binomial distribution can be applied to estimate the (conditional) probabilities of the separated single response types at each level. The logistic model of the first decomposition level estimates the probability of the single response type, and the logistic models beyond the first level estimate the conditional probabilities of the corresponding single response types at each level. Based on this decomposition system (Fig. 1), a total of k-1 logistic models are needed to estimate the probabilities of all k types of responses of an event. In logistic model development, given a data set, the first model uses the entire data set in logistic regression. Thus the prediction of the established model is the probability of the first single response type. The second model based on the second level should use the sub data set, which excludes the data whose single response type belongs to the first level single response. Therefore, the predicted probabilities of the second model are the conditional probabilities of the second level single response type given the probability of the first level single response type. Following this pattern, the last (k − 1 level) model uses only the sub data set which contains only data whose responses belong to the k − 1 and kth types, and predicts the conditional probabilities of the (k − 1)th type of response given the (conditional) probabilities of the 1st, 2nd, ..., and (k − 2)th types of responses. After the probability model for each level is developed and the (conditional) probabilities of the first k − 1 types of responses are predicted, the probabilities of all responses of the event can be computed according to the properties of binomial distribution and conditional probability:

$$ \left \{{{\begin{array}{ll} {P}_{i}={P}_{i}^{\prime}&{i=1}\\ {P}_{i}={P}_{i}^{\prime}\cdot\prod\limits_{{j=1}}^ {\rm i-1}{(1-{P}_{j}^{\prime})}&{i=2},\cdots,{k-1} \\ {P}_{k}=1-\sum\limits_{r=1}^{k-1}{P}_{r}&{i=k}\\ \end{array}}}\right. $$
(1)

where P′ is the (conditional) probability predicted using the logistic models, P is the probability of response types, and subscripts (i, j, r, and k) indicate the types of the responses.

By using the bisection decomposition system, the logistic regression at each level can have specific explanatory variables, since there is only a single response type at each level. Therefore, there is no confusion with the explanatory variables and the probability models, which can be very meaningful. When a specific set of explanatory variables are not known or are not well defined for a particular response or responses, logistic regression with screening options (for example, forward, backward, and stepwise selection) can be used to develop the logistic models at each level.

Procedure evaluation

A case study of urbanization over a 10-year period was conducted in order to evaluate the properties of the procedure proposed in the previous section. In the case study, four types of urbanization land use conversion were considered.

Study area

The study area includes the cities of Columbus, Georgia, and Phenix City, Alabama; and their adjacent area. The geographical location of the study area is between latitude 32°25′00′′–32°44′55′′ N and longitude 84°34′18′′–85°04′52′′W. The land use of the study area in 1980 and 1990 is displayed in Fig. 2A and 2B. Outside the cities, the predominant land use category is forested. Development during 1980 and 1990 within the study area was mainly concentrated inside the cities and their suburbs (Fig. 2C). There are three categories (responses) of urban development: “Residential” (RES), “Commercial-Industrial” (CI), and “Open Space” (OS, urban/recreational grassy area). Adding the response of “No Change” (NCH) in development, there were a total four types of development (responses) considered in this study.

Fig. 2
figure 2

Land use maps (A and B) of the study area (including cities of Columbus, Georgia (GA) and Phenix City, Alabama (AL), USA) and urbanization development during 1980 and 1990 (C). The land use categories come from the website of USGS (http://edcwww.cr.usgs.gov/programs/lccp/classes.html). Open Space represents urban/recreational grassy areas

Among the total number of 493,600 pixels (177,696 ha) in the study area, 23,810 pixels (8,571 ha) had no data in the 1980 land use map (at the south-east corner) (Fig. 2B). Therefore, that corner was eliminated from the analysis.

Materials

The US Construction Engineering Research Laboratory (USACERL) (Lozar et al. 2003) provided land use maps. The pixel size of the land use maps was 60 × 60 m2. The first three types of urban development (RES, CI, and OS) are modeled and predicted using the explanatory variables generated by ten factors. Those factors are City (X1), County Road (X2), Slope (X3), Forest (X4), Ramp (X5), Road Intersection (RI, X6), State Highway (SH, X7), Water (X8), Utilities (X9), and the number of immediate neighbors (Neighbor, X10). They were defined by the LEAM (the Land Use Evolution and Impact Assessment Model, see URL “http://www.leam.uiuc.edu/”) research group and their definitions are listed in Table A1 in the Appendix (Deal et al. 2002).

The LEAM group defined scores based on the ten factors that potentially could lead to the conversion to RES, CI, and OS. Those scores served as the direct explanatory variables for the logistic models. At a particular location (pixel), a factor could have different scores for different categories of conversion (Deal et al. 2002). For example, the factor Forest (X4) for a pixel could have a higher score for RES and a lower score for CI, or visa-versa. The exceptions were the two factors, Utility (X9) and Neighbor (X10), which had the same scores for different categories at the same location. Therefore, there were a total of 26 (Neighbor + Utility + (8 factors  ×  3 categories)) unique scores across the categories, each category had ten scores (explanatory variables).

The score maps were also provided by the LEAM group. The original resolution of the score maps was 30 × 30 m2. They were scaled up to 60 × 60 m2 based on the average value of the merged pixels to correspond to the pixel size of land use maps. The score maps were used in two ways: (1) pixels from these score maps were sampled to calibrate the logistic probability models using the bisection system; and (2) using all pixels from the score maps as model inputs, the calibrated logistic probability models were used to predict the probabilities of the four types of urban development for the entire study area.

Methodology assessment

Three categories of urban development, RES, CI, and OS, were explicitly modeled in this study using the corresponding scores. The probability of the last category, NCH, was not explicitly modeled, since there were no scores defined for it and its probability can be calculated as the complement of the first three categories of urban development (see lower part of Eq. 1). For coefficient estimation, three independent samples were randomly drawn from the historical land use and score maps. For each of the random samples, 3% (14100 pixels) of the total pixels in the study area were sampled.

The order in which the response variables were considered in the bisection decomposition systems was evaluated. Three separate bisection decomposition system were constructed. Table 1 lists the decomposition levels and the order in which the response variables were considered in the bisections. According to the procedure, for each of the random samples, three logistic models (as one set) were needed to predict the multinomial probabilities of the categories of urban development for each of the decomposition systems. Thus, three sets of (nine) logistic models were developed with each of the three random samples, resulting in a total of nine sets of (27) logistic models.

Table 1 The order in which the categories were considered for three separate bisections systems

The initial independent variables of the probability models included the scores of the ten explanatory factors and their cross product terms that are listed in Table 2. A stepwise logistic regression was used for selecting significant independent variables for the models and for estimating their coefficients using SAS® (PROCEDURE “LOGISTIC” with the model option “SELECTION=STEPWISE” and default significance level (α = 0.05)). The quality of each model was assessed using (pseudo) R-square and concordance, which is a summary measure of association based on the number of pairs of observations whose predicted probability and response are consistent.

Table 2 The initial independent variables (scores generated from the factors and the combinations of scores)

In order to evaluate the performance of the models, Relative Operating Characteristic (ROC) was used. ROC is an index used to measure the accuracy of predicted probability compared to the actual condition (Swets 1988). Pontius and Schneider (2001) interpreted ROC in measuring the quality of the prediction of spatial ecological changes, and provided the details for drawing empirical ROC curves and computing their values. Fang et al. (2005) used ROC in comparison of the performance of urbanization models. In this study, both predicted probability and historical land use maps were used to draw Relative Operating Characteristic (ROC) curves and to compute the ROC value for each category. The details on how this was done can be found in Fang et al. (2005) and Pontius and Schneider (2001). In drawing the ROC curves and computing their values, 11 probability scenarios were adapted: 0.0, 0.1, ..., 0.9, and 1.0.

The quality and performance of the logistic probability models were statistically analyzed using analysis of variance (ANOVA). There were three random samples (Sample), three orders (Order) in which the response variables were considered in the decomposition, and three response variables (Category). Therefore, a three-way ANOVA was used to identify the effects of these variables on the quality measures of (pseudo) R-squares, percent concordance, and ROC values used for assessing performance of the logistic regressions. In the ANOVA, the null hypotheses were that the different levels of Sample, Order, and Category individually had the same effect on the three quality measures. The ANOVA’s were conducted using SAS© PROCEDURE “ANOVA”.

Results and analysis

From 1980 to 1990, among the 469,790 pixels (169,124 ha) considered in the study area (excluding the south-east corner as shown in Fig. 2C), 458,139 pixels (164,930 ha) had no change. The major type of urbanization was the conversion to Residential (RES), which consisted of 7,276 pixels (2,619 ha). The number of pixels converted to Commercial-Industrial (CI) were 3,130 (1,127 ha). Open Space (OS) was the least converted, only 1,245 pixels (448 ha) changed. The models calibrated with sampled data from the study area reflected the pattern of urbanization.

Table 3 lists the descriptive statistics of the quality measures of the (conditional) probability models, which were fitted based on the three selected bisection systems and three random samples. The R-squares of all models were between 0.72 and 0.75 . The mean R-squares of the models for the three categories with all samples were also within this interval (Table 3). The standard error of the R-squares of any individual sample was very small (<0.001), which indicates very small variation when different samples were used. The concordance was also larger than 87% for all models, except for CI with the third random sample. The standard errors of concordance for all models were smaller than 1.2 (Table 3). Since the means of the R-squares and concordance based on variables Sample and Category had relatively large variation compared to the overall standard errors, the corresponding F-values were large, thus leading to small P-values (<0.03) for the variables Sample and Category (Table 4). Therefore, the results of ANOVA led to rejecting the null hypotheses for the effect of these two variables at a 0.05 significance level. However, the variable Order had small F-values (<0.11) and very large P-values (> 0.9) (Table 4) for both the R-squares and the concordance from modeling. Thus, the null hypotheses for the effect of Order could not be rejected at even a 0.50 significance level.

Table 3 The means and standard errors (SE) of the R-squares and percent concordance of the probability models based on the three random samples
Table 4 ANOVA results of the R-squares and percent concordance of the probability models

The means and standard errors of the ROC values, computed from the predicted probability maps and historical land use maps, are listed in Table 5. The means of the ROC values for the category RES were the highest in comparison to those of the other categories. The logistic models for the category CI provided the lowest means of the ROC values and their highest mean was still smaller than 0.53, which was very close to the base ROC value of 0.50 . The higher ROC values of RES indicated that RES probability models could explain more land use conversion than CI and OS models based on their explanatory variables. The ROC curves visually demonstrate the performance of the logistic models built using the first random sample according to the locations of the categories with the third decomposition system (Table 1 and Fig. 3). The ANOVA results based on ROC values showed that both the variables Sample and Order had very small F-values (0.37 and 0.08) and large P-values (0.6983 and 0.9219). Therefore, the null hypotheses for the effect of these two variables could not be rejected at a 0.05 significance level. Thus, the difference of ROC values based on different sample and order was not significant. However, the variable Category had a very large F-value (693.74) and a very small P-value (<0.0001), which led to a rejection of the corresponding hypothesis at any significant level larger than 0.0001. A pairwise multiple comparison test was performed and the difference in the mean ROC values for each of the categories were all found to be significantly different from each other at the 0.05 level (Table 5). The ANOVA model of ROC was highly significant (P-value <0.0001).

Table 5 The means and standard errors (SE) of ROC values computed based on the historical land use maps and the predicted probabilities according to the three random samples
Fig. 3
figure 3

ROC graphs based on the historical land use maps and predicted multinomial probabilities of land use converted to Residential (A), Commercial-Industrial (B), and Open Space (C) between 1980 and 1990 in the study area. Models were built using the first random sample according to the location of the categories in the third decomposition system (Table 3)

Figure 4 displays the predicted probability maps of all categories using the probability models based on the first random sample and the last (third) bisection decomposition system given in Table 1. Comparing the probability maps with the actual change during 1980 and 1990 (see Figs. 2C and 4), the predicted maps of RES and OS (especially RES) provided a reasonable prediction of the spatial pattern of the development. The quality of probability maps was measured by the ROC values and curves (Fig. 3 and Table 5). As these figures show, the predicted CI probability map captured a small portion of the actual CI conversion. The probability of No Change (NCH) could be considered as a view of the probability of overall urban development. Comparing Figs. 2C and 4D, the major development during 1980 and 1990 was captured by lower probability of NCH in its probability map. The ROC value from the probability map of NCH was 0.851, which was considerably better than that of RES. This indicated that some CI or OS pixels which were not captured by their corresponding probability models were predicted to have higher probability of RES, or vice versa. This might have been caused by the similarity among the three categories.

Fig. 4
figure 4

The maps of the predicted probabilities of city development between 1980 and 1990. Maps A to D are the probabilities of land use converted to Residential, Commercial-industrial, Open Space, and no conversion (No Change), respectively

Discussion and conclusion

The procedure presented for modeling multinomial probabilities has been developed based on conditional probability inference and a special bisection decomposition system. As long as such a decomposition system can be established, multinomial probability problems can be decomposed into binomial and conditional probability problems. Once decomposed, classical methods/techniques can be used for estimation. Theoretically, there was no assumption implied in the procedure, and a special bisection decomposition system could always be built for any multi-response event. The decomposition of multinomial probability into binomial probability made it much more convenient to use screening techniques with logistic regression and to structure specific models for specific responses and their corresponding explanatory factors.

In the evaluation of this procedure, the responses of interest had different numbers of observations in the samples. The largest number of responses, Residential (RES), had five times the number of observations as Open Space (OS). With this large difference, evaluation of this procedure showed that the impact of decomposition order to either the quality or performance of logistic models was not significant. It also implies that modeling uncertainty will impact the prediction of individual pixels, but not reduce the accuracy of predictions for the entire population (across entire case study area). This property of the procedure adds more flexibility in practice: researchers could decompose a multinomial system into a series of binomial systems according to their preference.

The performance of the probability models built with the procedure developed in this study was comparable to similar studies. The highest ROC mean of the probability maps of the modeled categories was RES with a value of 0.78, which was higher than that (0.72) of the RES probability map in Peoria, Illinois, USA (Fang et al. 2005). The comparison of the performance of the probability models built in different study areas showed that the procedure developed in this study would not cause difficulties in terms of model performance. Due to technical reasons mentioned in the Introduction, there is no comparison between this procedure and other estimation methods for multinomial probability.

The measures of model quality and performance were not consistent in this study. Concordance, which indicated the quality of probability models, had very different values when different random samples were used in model development. With the third random sample, logistic models of CI had concordance values less than half of that obtained from the first two samples. However, the R-squares and ROC values of all models for all categories across samples were very stable. The R-squares of the models of all categories were concentrated within a narrow interval (0.72 to 0.75), although statistically there was significant difference based on the ANOVA. The ROC values of different categories had more noticeable differences. Since ROC value was computed directly based on probabilities and for the entire study area, it was more reliable than R-square, which is computed using likelihood in logistic regression, and concordance based on ordinal variables. The consistency of ROC values for any one of the categories indicated that all three random samples represented the study area well.