Introduction

Zaragoza city is located in the central part of the Ebro Basin, in the north-east of the Iberian Peninsula. This triangular-shaped basin is bounded in the north by the Pyrenees, in the south-west by the Iberian Range and in the south-east by the Catalan Coastal Range. The continental sedimentary infill of this basin is composed of conglomerates and sandstones at the margins, grading into clays marls, evaporites and carbonate facies towards the depocentre of the basin (Benito et al. 1998). In the central part of the basin, the playa-lake deposits, from the Zaragoza Formation, form the largest gypsum outcrop in the area (Quirantes 1978). In this area, the different terrace levels and glacis lie directly on these Tertiary evaporites (Fig. 1).

Fig. 1
figure 1

Geomorphology of the study area

The rate of gypsum dissolution in water is approximately between 30 and 70 (Klimchouk et al. 1996) to 100 and 150 (Martínez et al. 1998) times faster than that of limestone. Thus, karstification processes are especially intense in these covered evaporite karst areas although the supply of water by annual precipitation in this semiarid environment is less than 350 mm.

Four conditions must be met for evaporite karstification to be considered active. These include (Johnson 2005; Lamont-Blanck et al. 2002):

  1. 1.

    An evaporite deposit on the surface.

  2. 2.

    Water that is unsaturated with respect to the evaporite mineral.

  3. 3.

    An outlet for the escape of solvent water.

  4. 4.

    Energy to cause water to flow through the system.

These conditions are met in the study area. Karstification is an active process and has formed dolines, which results in collapse (sinkholes) or subsidence dolines with a wide range of sizes.

In the last decades, subsidence and doline development in this area has led to enormous economic losses caused by rupture of irrigation infrastructures and water supply networks as well as the destruction of transport infrastructures and buildings. The total economic losses due to all these factors are estimated to be in the order of tens of millions of dollars (Soriano and Simón 1995).

Most of the studies performed in relation with doline development in this area are aimed at solving selective problems, but many experts agree on the necessity of a regional analysis of this phenomenon for urban planning (Benito et al. 2000; Guerrero et al. 2004; Gutiérrez-Elorza and Gutiérrez-Santolalla 1998; Simón et al. 1998a; Simón and Soriano 2002; Simón et al. 1998b). This is especially necessary in complex areas characterized by a dynamic industrial and urban development and an intensive agricultural use as in the case of Zaragoza area.

According to Paukštys et al. (1999), the most cost-effective way of planning in these areas lies in avoiding existing dolines and most subsidence prone areas. The application of this preventive philosophy requires the recognition of the areas affected by subsidence and the production of hazard maps (Guerrero et al. 2004).

In the last decade, several countries have built karst databases integrated in a GIS environment where new interpretations of the relationship between the causes of the data can be readily made (Cooper et al. 2001; Gao et al. 2001, 2005a, 2005b; Green et al. 2002; Lei et al. 2001).

Our main objective is the development of a doline probability map in a GIS environment that can help to differentiate between high and low hazardous areas as an aid for regional land use management in the Zaragoza area. We aim to improve the existing knowledge about the factors that best determine doline distribution in this dynamic area.

In order to fulfil this main objective, several secondary objectives need to be satisfied as well:

  • A revision of the state of the art in the study area for the recognition of the main factors involved in this phenomenon.

  • A revision of possible methodologies to be used in the hazard assessment.

  • Mapping of the dolines present in the study area.

  • The development of models for the different geological, environmental and human factors that can play a role in the hazard development.

  • Multivariate analysis of the information for producing a doline probability map with the use of logistic regression.

Logistic regression relates predictor variables to the occurrence or non-occurrence of dolines and uses the relationship to produce a map showing the probability of dolines. The main advantage of this technique, in comparison with multiple linear regression, is based on the possibility it provides of analysing a qualitative variable (as the occurrence or not of dolines) as a function of several qualitative and quantitative explanatory variables.

The study area

General settings

Zaragoza is located in the Ebro corridor, a highly dynamic economic area within the Iberian Peninsula. Its strategic location in the middle of three of the most important developed areas inside the Iberian Peninsula (Madrid, Barcelona and Valencia) facilitated a fast and great economic and urban development in the last decades. In 1964, Zaragoza was declared Focus of Industrial Development, resulting in a rapid increase in population to 500,000 inhabitants in the 1970s. Nowadays, the city has a total population of about 700,000 inhabitants (more than 60% of the population in the Region of Aragón) and is expected to increase even more in the next decade. Zaragoza will be the venue of the International Fair in 2008 with the title “Water and sustainable development”, thus implying the creation of many new infrastructures and positions that will be very attractive for new inhabitants to come. Figure 2 shows the extension of new industrial and urban areas as the logistic platform (PLAZA) or the recycling technologic station (PTR) as well as the assigned location of the EXPO 2008.

Fig. 2
figure 2

Infrastructure development in the surrounding areas of Zaragoza

The climate in this area is semi-arid with mean annual precipitation of about 350 mm and a mean annual temperature of about 15°C. The continental Mediterranean climate of Zaragoza is also characterized irregular distribution of precipitation, with annual peak values during spring and autumn, which also varies between different years. Other characteristics are the existence of high thermal differences between summer (average monthly maximum temperature of 24–25°C in July) and winter (monthly minimum temperature of 5–6°C in January) and the intensity and frequency of the wind called “Cierzo” (NW wind).

The combination of high anthropogenic impact and semiarid climate means that the remnants of scarce natural vegetation corresponds to sclerophyllous scrub, gypsum steppes and riparian forest and thickets.

The subsidence phenomenon

Hazard description

Alluvial dolines are generated when surficial deposits (terraces and glacis) overlying Neogene evaporites fall into conduits enlarged by dissolution, causing depressions on the surface. Their generation follows two mechanisms (Soriano and Simón 1995):

  1. (a)

    Mobilization of material by ground water, which may take place by either dissolution of soluble substrate or piping of the detrital cover.

  2. (b)

    Cave-in of the alluvial cover, developed by either sudden collapse or slow subsidence.

According to Soriano and Simón (2002), the forms of the present-day alluvial dolines northwest of Zaragoza correspond to those described in classical papers (Cvijic 1981; Palmquist 1979): pan, funnel and well-shaped types have been identified. Doline size varies according to the diameter and ranges between several meters and 100 m, with depths varying between 1 and 20 m. There are also larger shallow depressions up to 1,100 m long and 600 m wide. Most of these were identified as uvalas. Soriano and Simón (2002) observed in this sector, that dolines density is higher in the second terrace levels than that in the flood plain. In addition, they monitored the subsidence of different dolines during a period of about 4 years and estimated subsidence rates ranging between 21 and 92 mm/year.

In the sector southeast of Zaragoza, most of the dolines are shallow closed depressions, commonly less than 1.5 m deep, with non-scarped edges. They show highly variable geometries and dimensions, reaching up to 0.35 km2 in area and 2 km in length. A small number of scarp-edged collapse sinkholes (up to 7 m in length and 2.5 m in depth) have also been recognized in the floodplain (Gutiérrez-Santolalla et al. 2005b).

Controlling factors

Gutiérrez-Elorza and Gutiérrez-Santolalla (1998) classified the factors that play a role in the dissolution process based on geological and environmental factors. On the other hand, Soriano and Simón (1995) distinguished between physical, hydrological and human factors. Despite this difference in classification their studies in the area agree on the following factors that control the subsidence process:

  1. 1.

    Geological factors:

    1. (a)

      Lithological: characteristics of the substrate and Quaternary cover (texture, porosity and permeability).

    2. (b)

      Stratigraphical: thickness of gypsum formations and Quaternary cover, intercalation of insoluble bodies, existence of halite beds.

    3. (c)

      Structural: structure of the gypsum formations, discontinuity planes, joints, faults.

    4. (d)

      Geomorphological: relief configuration.

    5. (e)

      Hydrogeological: flow velocity and regime, water infiltration, hydraulic gradient, depth and seasonal variations of the water table, thickness of the saturated Quaternary deposits, chemical composition of the groundwater.

  2. 2.

    Environmental factors:

    1. (a)

      Climate: existence of stormy events and high temperatures.

  3. 3.

    Anthropogenic factors: irrigation, pumping, and constructions.

Guerrero et al. (2004) reported that recent studies based on borehole data and interpretations of the paleokarst features highlight the relevance of salt and sodium sulphate karstification in the subsidence phenomena. They confirm the presence of halite close to the surface in some locations along the Huerva Valley like Cadrete village (Fig. 2). It is possible that previously existing halite beds have been removed by dissolution from the outcropping evaporite sequence. The existence of halite and Na-sulphates in the bedrock is a crucial factor for the development of dissolution-induced subsidence phenomena due to their high solubility. Whereas the solubility of gypsum at 25°C is 2.4 g/l, halite, glauberite and thenardite solubility reach 360, 118 and 519 g/l, respectively (Ford and Williams 1989).

Guerrero et al. (2004) also observed that most of the subsidence problems do not occur inside the city but in the peripheral areas of Zaragoza (Fig. 6) where there is a relatively thin alluvial mantle. They attribute this to the fact that the majority of the city is built on thickened and slightly cemented alluvial deposits. This explanation matches the results obtained by Simón et al. (1998b) and Soriano and Simón (1995). However, a recent study (Gutiérrez-Santolalla et al. 2005b) reveals that downstream of the Ebro River flood plain the alluvium thickness does not have a significant influence on the formation of dolines.

Two main sets of vertical joints with prevalent N–S and NW–SE directions have been recognized in the study area (Arlegui and Simón 2000). Several authors agree in the high morphogenetic control of the axis NW–SE in the study area. Maldonado et al. (2000) studied the dolines developed in La Puebla de Alfindén village (10 km NE Zaragoza). They discovered that many dolines developed along this NW–SE axis. Soriano (1992) compared the direction of the axis of dolines with the direction of the fractures, and observed a high correlation.

Maldonado et al. (2000) also monitored the dolines’ history and discovered that the subsidence activity was directly related to storm events. Furthermore, Soriano and Simón (2002) reported that subsidence rates correlate with rainfall but they stated that the changes are very subtle, suggesting that the influence of rainfall on doline evolution is not evident.

Another possible factor is the location of areas where the Tertiary aquifer discharges into the alluvial aquifer following the model suggested by Jiménez-Torrecilla et al. (2004).

There is a reciprocal interaction between anthropic activities and doline generation. Many infrastructures are affected by subsidence but at the same time certain activities favor the generation of dolines. Benito and Gutiérrez-Elorza (1988) observed that dolines preferentially form near unlined canals. Guerrero et al. (2004) observed that the generation of dolines is frequent along the Imperial Canal (Figs. 2, 4), particularly on the downgrading side.

Other human-induced triggering factors also cause variations in the water table since these are mainly due to the supply of water by irrigation (Soriano and Simón 1995). As the water level declines it causes a loss of buoyant support to the ground, increases the flow gradient and velocity, facilitates the aquifer recharge and reduces the geomechanical strength of the alluvial cover (Gutiérrez-Elorza and Gutiérrez-Santolalla 1998).

The geological hazard

The presence of dolines causes many problems both in farming activities and civil engineering. Farmers lose arable land and attempt to replace it by filling the depressions each year. In the upstream area of Zaragoza problems in civil engineering have increased concern since the 1970s when a large number of factories were built. Many buildings were built on depressions, which had been filled several years before and no special inspections for construction were taken at that time (Soriano and Simón 1995). Nowadays, much damage is evident in the form of fractures which have developed in walls and floors of some factories (Fig. 3).

Fig. 3
figure 3

Fractures developed in the parking lot of a factory in the northwest of Zaragoza

Due to the filling of voids by the farmers, the surface of the area occupied by dolines in present air photographs has been reduced with respect to photographs taken in the 1950s. Local farmers fill the depressions with the intention of transforming the land into arable production. However, according to some landowners, this is an ineffective practice, as the filled and levelled dolines often remain unproductive (Gutiérrez-Santolalla et al. 2005b).

In the downstream area of Zaragoza damage to buildings has been reported in La Puebla de Alfindén village (Fig. 2), although in some cases it is difficult to elucidate whether it is due to dissolution subsidence or the hydrocollapse of the gypsiferous silts. The Madrid–Barcelona motorway (A-2) crosses a few shallow closed subsidence depressions (Gutiérrez-Santolalla et al. 2005b).

As to the water management infrastructures, an old section of the Canal Imperial de Aragón, the biggest canal in this region, had to be abandoned and rebuilt several meters to the north of the old section because the construction itself intercepted several collapses (Fig. 4).

Fig. 4
figure 4

Doline developed in an old section of the Canal Imperial de Aragón

Another case of doline problem was observed in La Cartuja village (Fig. 2). In 2003 a doline developed underneath the municipal swimming pool. Approximately 15 years before, the swimming pool was broken in the middle and rebuilt without major protective investigations.

Many examples can be collected from the previous studies, newspapers and personal meetings with the local residents in the area reasserting the importance of this phenomenon and the necessity of a better knowledge of its spatial distribution for a proper land use management.

Methodological background

Many examples concerning subsidence hazard mapping can be found in the literature, especially in European and North-American developed countries.

Soriano and Simón (1995) observed that groundwater sulphate content, grain size of the detrital cover, topography of the Tertiary/Quaternary boundary, annual variation of the water table and the thickness of detrital cover were the main factors controlling doline development. Taking into account these variables, a theoretical spatial hazard model was elaborated and expressed as a mathematical equation. They also produced a 1:50,000 scale doline hazard map of their study area located upstream of Zaragoza.

In subsequent studies, Simón et al. (1998b) stated that this model cannot be applied at a larger, 1:25,000 scale and used a qualitative assessment. They introduced three main factors in the subsidence hazard, by order of importance:

  1. 1.

    Topography of the Tertiary–Quaternary contact.

  2. 2.

    Thickness of Quaternary sediments.

  3. 3.

    Lutitic percentage in Quaternary deposits.

They also introduced the gradient in groundwater sulphate content as a secondary factor. A potential hazard map was developed based on these factors. The potential hazard map overlaps with a cartography of present hazard developed by air photograph analysis and cartography of dolines. The resulting map presents seven categories of hazard from Very high real hazard to Low potential hazard.

Kaufmann and Quinif (2002) also used a GIS-based qualitative assessment in the Tournaisis area. Hydrogeological data and geological mapping information were used to create a geo-hazard map. Records of former collapses were also investigated. These records were of great interest since doline distribution was obviously clustered in the area. Their study showed that zones of high doline occurrence coincide with zones of significant drop of piezometric heads. Combining the density of former collapses with the dewatering of the limestone enabled them to delineate zones of low, moderate and high collapse hazard.

Green et al. (2002) introduced a karst unit delineation using the GIS technology as well. Many different overlays of the karst features were combined to better understand the landscape dynamics.

Sinkhole probability maps have also been constructed for south-eastern Minnesota. These maps were constructed as paper maps with boundaries drawn by the maps’ authors using subjective criteria. As part of the transition of this mapping effort into a digital GIS environment, Gao and Alexander (2003) developed a mathematical decision tree model for the construction of maps of relative sinkhole risk based on the distribution of distances to the nearest sinkhole and the sinkhole density.

Whitman and Gubbels (1999) used GIS to investigate the spatial relationships between hydrogeologic factors and sinkhole formation near Orlando, FL. Landsat TM imagery, digital topography, and well data were used to construct a model of head difference between a discontinuous set of surficial aquifers and the Florida aquifer, a regionally extensive confined aquifer.

Lei et al. (2001) studied sinkhole distributions influenced by karstification of bedrock, soil types and properties, groundwater condition, human activities, land use, infrastructures, and economic development. They developed a sinkhole hazard map using a qualitative weighting method within GIS. The three factors they introduced were water level in dry and rainy seasons, thickness of soil and finally the karstification of bedrock together with the fault distance.

Qualitative methods are very flexible and permit a complete inclusion of expert knowledge. The main pitfall is that they involve a great level of subjectivity, so that the maps produced by different researchers can be very different. Although a fully objective procedure does not exist, quantitative methods as logistic regression assure that the same results can be achieved provided the same basic assumptions apply (Beguería and Lorente 2003).

Specific methodology

Logistic regression

Commonly applied linear regression techniques are appropriate only when the dependent variable and the explanatory variables are quantitative and continuous. To analyze a binary qualitative variable (0 or 1) as a function of a number of explanatory variables, special techniques must be used if the analysis is to be performed adequately. One of the most common regression models that accomplish this is the logit or logistic regression model (Bledsoe and Watson 2001).

The logistic regression technique has already been used for many environmental studies, in many cases with more success than multiple linear regression. Battaglin and Goolsby (1997) compared the results of both techniques to identify natural and anthropogenic variables of drainage basins that have strong relations to agricultural chemical concentrations and mass transport measured in rivers. They concluded that logistic regression was somewhat more successful than multiple linear regression.

This technique has also been applied successfully to predict thresholds of channel pattern and instability (Bledsoe and Watson 2001) and to forecast short-term hail risk (Sánchez et al. 1998). But the most common use of this technique in geo-hazards has been the development of landslides hazard maps (Beguería and Lorente 2003; Lee and Min 2001; Ohlmacher and Davis 2003).

Logistic regression establishes a functional relationship between the binary coded hazard locations (existence or not of dolines) and different factors which are recognized to play a role in the hazard development. It states that the natural logarithm of odd (logit) is linearly related to the independent variables (Beguería and Lorente 2003):

$$ {\text{Logit}}\,(P) = {\text{Ln}}(P/1 - P) = B_{0} + B_{1} X_{1} + \cdots + B_{n} X_{n} , $$
(1)

where P is the probability of occurrence, X n is the set of n independent variables, and B n is the set of n + 1 parameters. Solving P from expression (1):

$$ P = \exp (B_{0} + B_{1} X_{1} + \cdots + B_{n} X_{n} )/(1 + \exp (B_{0} + B_{1} X_{1} + \cdots + B_{n} X_{n} )). $$
(2)

Slope coefficients for the logistic equation are fit into the categorical data using a maximum likelihood method that optimizes the probability that the observed data be estimated from the set of slope coefficients (Battaglin and Goolsby 1997).

In ordinary regression analysis, the coefficient of determination (r²) is frequently used as a measure of model performance. In logistic regression, it is common to be more concerned with whether the predictions are correct or incorrect than with how close the predicted values are to the observed (0 or 1) values of the dependent variables. Therefore, r² has little meaning in logistic regression analysis (Bledsoe and Watson 2001).

Goodness-of-fit tests may aid in the interpretation of the results of logistic regression. The likelihood L 0 for the null model, where all slope parameters are zero, may be directly compared with the likelihood L 1 of the fitted model. Specifically, one can compute the X² statistic for this comparison as

$$ X^{2} = {{ - }}2(\log (L_{0} ){{ - }}\log (L_{1} )) $$
(3)

The degree of freedom for this X² value is equal to the number of independent variables in the logistic regression. If the P-level associated with this X² is significant, the estimated model yields a significantly better fit to the data than the null model and the regression parameters are statistically significant.

Dependent and independent variables mapping

In this section, we aim to explain the development of the dependent variable (doline/no doline) and the independent variables that will be introduced in the stepwise logistic regression procedure. Due to the lack of information about some factors such as location of salty layers and fractures in the Tertiary substrate, these factors were not included in the model. This fact may limit the accuracy of the doline probability map.

Digital Terrain Models (DTM) of the different factors that can be regionalized were created. These models have a spatial resolution of 20 m and are restricted to the Ebro alluvial due to the availability of geological and hydrogeological information.

Several methodologies were used for the models development: simply conversion of existing maps into a common reference system or change on data structure (rasterization); in other cases, it was necessary to analyze and digitalize air photographs; and finally the interpolation of punctual data gathered from reports or field analysis (i.e., precipitation from meteorological stations, boreholes data, etc.) and information modeling.

ArcGIS 9.1 (ESRI Inc. 2005) and Gocad (Earth Decision Sciences 2005) were used to implement the model. Figure 5 shows the different steps of the data preparation.

Fig. 5
figure 5

Scheme of the logistic regression variables development

The information about the dependent and independent variables at every pixel in the DTM was stored in a database and then analyzed with a statistical software package, SPSS 11 (SPSS Inc. 2001). In the database, every row corresponds to a case, every location, and every columns represents a different variable.

ERDAS 8.7 (Leica 2003) was used to create the database. This software has the capability of importing grids into a same Layer stack. This Layer stack can then be exported as ASCII files which in turn are easy to be imported into SPSS. SPSS performs the logistic regression analysis and returns the required logistic regression model parameters in terms of the regression coefficients (see Eq. 1).

The last step is the implementation of the model, which was carried out using the raster calculator in ArcGIS.

The dependent variable

For the development of the dependent variable, doline location and cartography of previous studies in the upstream area of Zaragoza were georeferenced and digitized (Simón et al. 1998a, b). For areas with no previous studies available air photographs from 1984 and 1997 were analyzed. Figure 6 shows that doline development is a severe problem in the region along the Ebro valley. The doline vector map was then converted into raster format and reclassified into two different values, existence (1) and no existence (0) of dolines.

Fig. 6
figure 6

Dolines distributed in the surrounding areas of Zaragoza

Geological variables

In the case of the lithological and stratigraphical variables, a 3D model of the Quaternary alluvial deposits of the study area was created. The information from more than 900 boreholes was obtained from the Water Points Inventory (IPA, Inventario Puntos de Agua) of the Ebro Basin Authority (CHE, Confederación Hidrográfica del Ebro). This information was completed with several boreholes collected from different private enterprises (Control-7, Entecsa, Z-amaltea, CTA, ESHYG) and from some previous studies carried out for the construction of several roads (MOP 1967, 1970, 1973, 1994, 2000, 2003).

Since doline development is mainly limited to the Ebro valley where the major concentration of infrastructures exists, boreholes outside the valley were not considered for the model. Due to the lacking, fragmentary or imprecise information on the lithology, it was necessary to omit a number of additional boreholes. In the end, about 400 boreholes could be used for modeling the Quaternary–Tertiary boundary. Together with the digital elevation model (DEM), this boundary represents the outer limit of a 3D model body, which was filled with lithologic information. The spatial construction of the model was carried out with Gocad.

Due to the lack of precise information about the permeability of the Quaternary valley deposits, the permeability of the Quaternary cover was qualitatively assessed using lithological information. The approach uses the mean grain size of the encountered layers. Clay, silt, sand and gravel represent grain-size classes with sedimentologically well-defined upper and lower grain-size limits which allow the determination of a mean grain size. For layers which consisted of more than one grain-size class, the mean layer grain size was determined as a function of the proportion of each grain-size class, which in turn was estimated from the lithological layer description. The mean grain size was then spatially interpolated within the previously modeled 3D body, returning a 3D model of the grain size. In this model, areas having a mean grain-size spectrum belonging to the clay-silt fraction were categorized as low permeable whereas the rest was classified as permeable.

For the geomorphological factors, the geological maps scale 1:50,000 from the National Geological Institute (ITGE 1995, 1998) were used to locate the different terrace levels. We wanted to assess if doline distribution is more concentrated on particular terrace levels. This information was originally in a vector format from ArcInfo and required conversion to raster format.

Several hydrogeological variables were also considered. The information about water table level and hydrochemistry was also obtained from the IPA. Since this inventory contains all hydrogeological information from different studies the information is not consistent with respect to time and space.

In the case of the water table models, only the points with more than 10 measurements were selected. Figure 7 shows the distribution of the points in the study area. In order to obtain the water table surfaces, the information from the points in the IPA needs interpolation. Interpolation is a complex issue, which consists of developing a continuous surface from punctual data. Interpolation procedures can be simple deterministic mathematical models (inverse distance weighting, trend surface analysis, Thiessen polygons, etc.), or more complex models (geostatistical methods, such as kriging and thin plate splines; Ninyerola et al. 2000). The mean, maximum and minimum water table level values were queried from the database and interpolated with ordinary Kriging. This method belongs to geostatistical methods. Geostatistics constitute a theory about the statistical behavior of natural phenomena with a spatial variability. Its fundamental concept is the regionalized variable, which can always be represented by a continuous surface over the map plane.

Fig. 7
figure 7

Location of boreholes for the groundwater table models

For the creation of the groundwater depth models, a subtraction of the groundwater level from the digital elevation model (DEM and groundwater level model with 20 m pixel size) was conducted (expression 4):

$$ X = A - B, $$
(4)

where X is the groundwater depth, A is the groundwater elevation above sea level and B is the DEM of the surface.

Besides, a subtraction between the maximum and minimum water table level models was also made for the development of the water table variations between summer and winter.

The water table gradient was created using a focal analysis filter. The standard deviation of a 5 × 5 pixel kernel was used to model the areas with more water table variations in space, thus implying greater flow velocity.

With respect to the hydrochemistry, the number of points with several measurements was scarce, so the information from all boreholes was used. To avoid erroneous data, the error balance was calculated based on the method by Custodio and Llamas (1983). All the points with an inadmissible error were rejected. A simple kriging interpolation technique and a cross validation of the model were conducted. A leave-one-out cross validation of the model was selected due to the lack of enough information to divide the sample into training sets and testing sets. This involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data.

Environmental variables

In the study area, the main environmental triggering factor for doline development is the occurrence of storm events. However, the regression analysis does not include this variable because rainstorms are relatively uniformly distributed throughout the study area.

Anthropogenic variables

The cartography of irrigated areas in 2000, which is available online (CHE 2005), was used for locating irrigated land, since the irrigated land in the alluvial plain has not changed since 2000. This information was converted from vector format to raster format.

In addition to this, the irrigation amount was also approximated. A database with information about the availability of water resources for every agrarian administrative division and irrigation system was available. More details about the origin of this information can be found in Cruz et al. (1997). The irrigation area map was intersected with the agrarian administrative division map, which is also available through the Internet (CHE 2005). The water availability values were assigned to every system in every agrarian division. This only applies to the real water supply by irrigation since the information represents the common water availability under regular conditions and as a consequence the amount of water commonly irrigated. However, it does not take into account dry periods when the real availability is reduced. Therefore, in the case of selection of this factor by the model, the doline susceptibility would be exaggerated in dry periods when the amount of irrigation is lower than the one introduced in the model.

The impermeability of some infrastructures (roads and train rails) causes a water concentration at their edges, which is believed to be an important factor in the doline development. In addition, the breaks in the irrigation network (canals) represents a grate support of water. In many previous studies, it was asserted that dolines locate in the proximity of irrigation canals. Thus, the infrastructure information was obtained from the digital 1:25,000 scale topographic map from the National Geographical Institute (IGN). The information had to be previously updated by digitizing new roads and train rails (constructed after the IGN map creation) by using air photographs. Models representing the distance to these infrastructures were performed to assay whether doline development probability increases proportionally in response to reduced distances of these infrastructures. For the development of these models, a distance operation into ArcGIS was performed.

Results

In order to better understand the factors controlling doline distribution, models of different factors were visually analyzed together with the doline map. In addition, some of the variables were classified into different categories taking into account previous studies and the visual analysis. Both variables, the continuous and categorical, were introduced in the analysis. Table 1 shows the different variables introduced in the model and their continuous or categorical condition.

Table 1 Variables introduced in the logistic regression

A subset of 60% of the total information, selected randomly, was introduced in a stepwise logistic regression procedure; stepwise selection method with variables introduction based on the significance of the score statistic, and removal based on the probability of a likelihood-ratio statistic based on conditional parameter estimates.

First, the fitness of the logistic regression equation was tested using this 60% sample data. For the overall model fit the goodness-of-fit test called “chi-square” in the SPSS was used. This test is simply the chi-square difference between the null model (i.e., with the constant only) and the model containing one or more predictors. This is one use of the likelihood ratio test between two nested models. It is an assessment of the improvement of fit between the predicted and observed values by adding the predictor(s). A classification table of observed dolines (0 and 1) and adjusted probability classes (determining a dividing point by matching the quantity of the number of 1s in the observed values of the dependent variable) to obtain the overall percentage of correct classifications was performed.

In the final model, a total of 11 variables were introduced (or used) where eight variables were categorical and three of them were continuous. Table 2 shows the variables introduced in the model. β is the coefficient of the logistic equation. The Wald test is the usual test for the significance of a single predictor. It tests the hypothesis that the predictor is useful in predicting the outcome. In this case, all the variables are significant (α < 0.05) since the method used is a stepwise procedure. And finally, the exp of β is the change in odds for a unit increase in the independent variable. This last statistic shows the relative importance of the variables.

Table 2 Variables introduced by the model

According to the model results, the most significant variables are geomorphology, represented, in order of importance, by the location of endorheic areas and different terrace levels, T4, T3, T2 and T1. The secondary variables are the presence of irrigation and the water table gradient.

The next variable in importance is the percentage of low permeable layers in the Quaternary cover. This only plays a role when classified into two categories. Thus, the existence of low percentage, less than 10%, of low permeable (impermeable) layers increases doline probability, according to the positive coefficient.

The factors with negative coefficients are of less importance in this case due to their low exp of β. The results suggest that high ground water sulphate contents reduce doline probability. Doline probability increases in areas with lower summer water table, which imply a larger supply by irrigation. Finally, doline probability is significantly reduced in areas where the Quaternary sediment is more than 30 m thick. The variables selected by the model represent the doline distribution in the study area and do not imply that they are the only ones that take part in the hazard development. This is exemplified by the water table variations. Despite its importance, this variable is not introduced in the model because it correlates high with water table gradient.

The determined model was validated with 40% of information not introduced in the logistic regression by comparing the predicted outcome (probability value) with the reality (dolines occurrence or not). Due to the difference in percentage between the two groups (existence or not of dolines) and the difficulty in understanding the validation results all the doline cases and 12,000 randomly selected cases lacking dolines were selected making a total validation sample of 24,059 cases. The resulting confusion (error) matrix (Table 3) yields an overall map accuracy of more than 68% and a Kappa index of agreement of 0.36, which means a classification of 36% better agreement than that obtained by chance. The confusion matrix is commonly used in classificatory approaches as a way to test the model performance (Beguería and Lorente 2003; Martínez-Casasnovas et al. 2004).

Table 3 Confusion matrix

Logistic models are frequently used in a classification approach. This implies selecting given values of the response variable and classifying all the cells in one of the groups formed according to these. The method more commonly adopted in the literature is to divide the probability map histogram into different categories based on expert opinions (Lee and Min 2001; Dai and Lee 2002; Ohlmacher and Davis 2003). In the case of division into two groups, the threshold value is normally 0.5 probability, since the two sample groups are usually similar in size. For the case where the two groups are very dissimilar, the proportion of ones in the sample (proportion of dolines, 0.06 in our study area) can be used instead of the 0.5 value (Beguería and Lorente 2003; Martínez-Casasnovas et al. 2004). According to Ayalew and Yamagishi (2005), this type of changing continuous data into two or more categories does not take into account the relative position of a case within the probability map and is neither fully automated nor statistically tested. They suggested classification systems that use quantiles, natural breaks, equal intervals and standard deviation to choose the one that best suits the information and the scale of investigation.

Figure 8 shows the doline probability map and, superimposed, the location of dolines since they need to be considered as areas with a higher doline probability. In this case the map has been classified into four categories using the mean and standard deviation values. Note that the division in categories has been made for visualization purposes, which does not imply a categorization into safe and unsafe areas. In fact, all the study area presents a certain probability of doline development.

Fig. 8
figure 8

Dolines probability map created with logistic regression technique. Note that the map does not allow a categorization into safe and unsafe areas (see text)

Discussion and conclusions

One of the major advantages of logistic regression is the possibility to introduce categorical variables in the model. Our results corroborate this affirmation as many of the most important variables in the doline distribution were categorical i.e. geology, represented by the different terrace levels. In addition, it allowed us to introduce anthropogenic factors that are very important in the doline probability, namely location of irrigated land.

In some cases, it is necessary to categorize continuous variables as the changes produced from one unit to another are insignificant or are not constant along the range of values of the variable. This implies that the relation between the dependent and independent variable does not follow a lineal regression. This is the case of the thickness of Quaternary deposits or the percentage of permeable layers. Our results suggest that there is a threshold of 30 m thickness of Quaternary cover. Thicker covers reduce doline probability while, in the range between 0 and 30 m thickness, dolines develop independently from the sediment cover thickness. Similar results can be found in previous studies (Johnson 2005; Simón et al. 1998b; Soriano and Simón 1995).

Several authors have demonstrated, in different mantled karst areas, that the formation of one doline (the ‘‘mother’’) promotes subsurface conditions that favor the formation of additional dolines (the ‘‘daughters’’) in the adjacent area (Cooper and Saunders 1999; Drake and Ford 1972; Gutiérrez-Santolalla et al. 2005a; Kaufmann and Quinif 2002; Palmquist 1979).

In the downstream area of Zaragoza Gutiérrez-Santolalla et al. (2005a) discovered that the dolines show a clear tendency to form clusters. They calculated the spatial distribution index given by the Clark and Evans index (Clark and Evans 1954). This index quantifies the clustering or dispersion of elements in a particular area. In the La Puebla de Alfindén sinkhole field, the Clark and Evans index has a value close to 0 (R = 0.00089) indicating a clustered distribution. Gao et al. (2005a, b), after applying the Clark and Evans index to active karst areas in Minnesota, also concluded that the sinkholes tend to be clustered. However, they found that the pattern change from clustered to random to regular as the scale of the analysis decreases.

To this respect, we agree with Gutiérrez-Santolalla et al. (2005b) that the mapping of areas affected by subsidence is an effective approach to avoid future subsidence for land-use planning. Consequently, these areas have to be considered as areas with high probability of doline development. However, as they pointed out these maps have some limitations related to the difficulty of mapping all the subsidence areas as they may be masked by the morphosedimentary activity of the fluvial system and anthropogenic fillings. Therefore, the study of the relationship between doline distribution and different environmental factors is very important. Such studies lead to a better recognition of hazardous areas which would be considered as non-hazardous.

As a result, the doline probability map created with GIS and logistic regression technique provides valuable information for land use planning at a regional scale.

However, there are some limiting factors such as the impossibility of mapping some factor playing a role in the doline development. This is the case of the mentioned salty layers or faults in the Tertiary evaporites. This factor could not be introduced in the logistic regression due to the lack of information. Thus, more research is still needed to characterized the geology of this area since the introduction of these factors may improve the result of the doline susceptibility map.

An excellent example of management in subsidence prone areas can be found in some countries such as Slovenia and China where governments have founded karst research institutes (Veni 1999). A better management approach in the area surrounding Zaragoza city would be the creation of a karst research institute or organization in the Central Ebro Basin. Its main objective should be the creation of a karst database. This database must integrate the karst features mapped in previous studies carried out separately in different areas around the city of Zaragoza by different research groups. Future studies need to integrate information related to the geological, environmental and human factors and to create new data in areas having limited data and information such as the Ebro alluvial downstream of Zaragoza. And finally, but not the least important is the active cooperation between different research groups and land use managers.