Introduction

Interest in predicting groundwater vulnerability has increased because of widespread detection of contaminants and the implications for human and aquatic health and resources. Report of environmental problems associated with mining communities had prompted the groundwater vulnerability study of basement aquifers in Ilesa gold mining area of southwestern Nigeria. The evaluation of the natural vulnerability of aquifers to contamination is a function of space and time (Civita 1987). In most cases, an accurate prediction of groundwater vulnerability is not feasible due to complexity of groundwater systems. In order to provide accurate and reliable vulnerability prediction in a given area, a suitable model that will account for the sub-surface geology, groundwater flow, and pollutant transport for the area needs to be developed.

A fundamental difficulty in groundwater vulnerability prediction model is the intertwined processes of groundwater flow and pollutant transport, which reflect in the influencing factors (Shih-Kai et al. 2013). Most of these factors are often evaluated by a number of experts using different approaches. It is important to note that the degree of contribution of one or more factors to groundwater vulnerability is not the same and this may also vary from one location to the other. Furthermore, the effects of all the important factors that can influence the groundwater contamination in the area must be integrated to develop a reliable model. Groundwater vulnerability study is a spatial problem that requires data input, processing, and solution from many experts.

A variety of methods have been developed and used for assessing aquifer vulnerability to contaminants (Twarakavi and Kaluarachchi 2005). Previous methods to estimate aquifer vulnerability to contamination may be classified into the following categories: hydrogeological complex and setting (HCS) methods, parametric system or overlay/index methods, numerical or process-based methods, and statistical methods. HCS methods which were developed based on criteria found to be representative of groundwater vulnerability under certain hydrogeological condition (Gogu and Dassargues 2000). The overlay/index models such as the multi criteria decision analysis (MCDA) in the context of analytic hierarchy process (AHP) (Adiat et al. 2012; Adiat et al. 2013; Akinlalu et al. 2017; Adiat et al. 2018) and DRASTIC model (Mohammad 2017; Malik and Shukla 2019; Hassan et al. 2019) are based on combining maps of various physiographic attributes and assigning weights to each attribute to obtain a final score (Connell and Van den Daele 2003; Thapinta and Hudak 2003; Twarakavi and Kaluarachchi 2005). The methods are largely dependent on data availability and expert judgment rather than the controlling physical processes (Twarakavi and Kaluarachchi 2005). Numerical or process-based methods are usually more elaborate than simple overlay or index methods. They require analytical and/or numerical solutions to the governing mathematical equations that represent coupled processes of contaminant transport. (Meeks and Dean 1990; Twarakavi and Kaluarachchi 2005). These methods are computationally costly and demand substantial data. Furthermore, the process-oriented numerical models also suffer from flaws of being used for site-specific studies and not for evaluating vulnerability on a large scale. All the aforementioned methods suffer from flaws of inability to capture the probabilistic nature or the uncertainty of groundwater vulnerability consequent upon which validation may be inherently impossible for this category of methods that assess vulnerability outside of a probabilistic framework (Worrall 2002). On the other hand, statistical methods are flexible and better suited to accommodate uncertainty in the data than the former methods.

Uncertainty is inherent to predictions of groundwater vulnerability (Loague 1991; Loague et al. 1996), yet few groundwater vulnerability assessments have accounted for, or reported, associated uncertainty. Statistical methods are based on the concept of uncertainty, which is described in terms of probability distributions for the variable of interest (National Research Council NRC 1993). One possible goal in applying statistical methods to vulnerability assessment is to identify variables that can be used to define the probability of groundwater contamination (Burkart et al. 1999). Statistical methods use response variables such as the frequency of contaminant occurrence, contaminant concentration, or contamination probability.

Statistical methods range from simple summary or descriptive statistics of concentrations of targeted contaminants to more complex regression analyses that incorporate the effects of several predictor variables (Worrall 2002; Worrall and Kolpin 2003). A significant benefit of statistical method is that predictions of vulnerability are expressed in probabilistic terms. However, all uncertainty is not inherently represented within the resulting probabilistic predictions because unavoidable model and data errors propagate through its calculations make predictions of vulnerability best estimates. It is therefore reasonable to say that the prediction of groundwater vulnerability is best estimated using statistical approaches because they cater for series of uncertainties and complexities of the hydrogeological environment. Examples of statistical analysis methods utilized in groundwater resources research are cluster analysis, factor analysis, discriminant analysis, regression analysis, fuzzy recognition, and back propagation (BP) neural networks (Gui and Chen 2007; Chen et al. 2013; Adiat et al. 2020).

One of the common statistical methods to estimate aquifer vulnerability is the technique of binary logistic regression or commonly called logistic regression (LR). LR models relate the probability of a contaminant concentration to exceed a threshold concentration to a set of possible influencing variables. LR analysis is a model structuring technique for modeling and analyzing several variables. LR analysis predicts the probability of a binary or categorical response based on independent or predictive (influencing) variables. LR analysis, with its advantage of being more simple than other analyses and its regression logic, has an important place in categorical data analysis. Therefore, LR is well suited for analysis of groundwater vulnerability assessment because the binary response or categorical response in the case of ordinal logistic regression can be established using a threshold that represents a drinking water standard, laboratory detection level, or relative background concentration (Twarakavi and Kaluarachchi 2005). Often, the objective of a groundwater vulnerability assessment is to predict the occurrence of a water quality constituent above a certain level or threshold. This method allows us to develop an acceptable model, which could define the correlation between dependent (predicted, i.e., contaminant) and independent (predictive) variables in best fit with the least variable. LR has been used by researchers to solve problems related to groundwater studies in different geologic environments in various parts of the world. Twarakavi and Kaluarachchi (2005) used ordinal LR to assess aquifer vulnerability to heavy metals in Washington, USA. Ozdemir (2016) adopted the methodology of LR to map sinkhole susceptibility in Konya, Turkey. Qian et al. (2018) used LR to predict water shortage risk in situations with insufficient data in Beijing, China. Chenini and Msaddek (2019) mapped groundwater recharge susceptibility using LR and bivariate statistical analysis in Tunisia. Kim et al. (2019) used the technique of LR to assess impacts of climate change on a complex river system in South Korea. However, within the context of the literature review done for this study, the application of LR to predict/assess groundwater vulnerability to contamination resulting from gold mining activities in a typical basement complex geologic environment has hitherto not been reported in the current study area. Consequently, attempt would be made to utilize the methodology of LR to predict/assess vulnerability of the aquifer to contaminant(s) in the gold mining area of Ilesa, a typical basement complex of southwestern Nigeria. The Ilesa Schist belt is one of the major schist belts in Nigeria that have been extensively mapped and studied in detail. The belt consists of several occurrences of primary and alluvial gold workings. (Akinlalu et al. 2018). Gold mining operations started in the area in early 1950s (Makinde et al. 2014). This had resulted to various degree of land degradations (Adeoye 2016) and groundwater contamination (Makinde et al. 2016). The objectives of the study are the following:

  1. i

    generate factors/parameters (independent variables) that can be used to predict aquifer contamination if there is any

  2. j

    identify factors (dependent variable(s)) responsible for the probability of groundwater contamination

  3. k

    develop empirical (LR) model and map that predict the probability of occurrence of contaminant(s) (identified in ii above) with respect to threshold level in the groundwater resources in the study area and

  4. l

    quantify the prediction accuracy and reliability of the model developed.

Study area description

The study area is located in the south-western part of Ilesa, Osun state, Nigeria. It lies between longitude 4° 38′ 0″ E and 4° 43′ 0 E and latitude 7° 31′ 30″ N and 7° 36′ 0″ N (Fig. 1). The area is sparsely inhabited, and most of the economic activities engage by the inhabitants are agriculture and mining. Numerous minerals such as gold (Au), lead (Pb), iron (Fe), nickel (Ni), cadmium (Cd), chromium (Cr), copper (Cu), zinc (Zn), and manganese (Mn) had been reported by the Nigeria Geological Survey Agency (NGSA) to be deposited in the area (Adekoya et al. 2003). The Ilesa Schist belt of southwestern Nigeria has complex geology and mineralization potential. The study area is located in one of the major schist belts in Nigeria and has been extensively mapped and studied in detail; others are Maru, Anka, Zuru, Kazaure, Kusheriki, Zungeru, Kushaka, Iseyin, Oyan, and Iwo schist belts. The belt consists of several occurrences of primary and alluvial gold workings. The primary gold commonly occur in quartz veins within several lithologies, and the host rocks to the veins include fine-grained mica schists, amphibolite schists, talc tremolite schists, and several varieties of gneisses (Akinlalu et al. 2018). Gold mining operations started in the study area in early 1950s (Makinde et al. 2014). More than fifty mining sites located in various parts of the study area were visited. Most of these mining pits were open-pit, and the average depth of the mining pits was 3.4 m, while an estimate of 25.8 ha of land was degraded in the entire mining sites (Adeoye 2016).

Fig. 1
figure 1

Geological map of the study area showing borehole/well and VES locations (modified after geological map of Ilesa SW Sht. 243)

In terms of structural features, lithology, and mineralization, the schist belts of Nigeria show considerable similarities to the Achaean green stone belts (Rahaman 1989; Olusegun et al. 1995). The area is known to have variable metamorphic mineral assemblages ranging from green schist—to amphibolite—facies (Ajibade et al. 1987). Four major rock types are present in the area, and these are the amphibolite and the amphibolite schist, the undifferentiated migmatite gneiss, the quartzite, and the quartz schist (Fig. 1). Geology is an important factor that controls groundwater accumulation in an environment especially in terms of quality and quantity. The schist belts, which form part of the Precambrian basement rock units, are notable for clay-rich weathered horizons. The degree of fracturing and weathering of rocks influence the rate of percolation and infiltration.

The topography of the area varies from heavily forested mountains, and gently rolling hills to a vast stream/river coastal plain. The topographic elevation of the area ranges from 278 and 490 m above mean sea level. The drainage pattern of the area is largely dendritic typical of highly fractured bedrock with flat and undulating terrain.

Methodology

The study was undertaken in two phases which include the data acquisition/processing phase and assessment of groundwater vulnerability through the application of logistic regression phase. The research utilizes the integration of ancillary data, water sample, remote sensing data, and subsurface geophysical data to derive dependent and independent variables. Logistic regression techniques were applied to the results obtained from the analysis of these data to develop groundwater vulnerability prediction models with a view to selecting a final model based on maximization of test statistics.

Data acquisition and processing techniques

The ancillary data utilized for the study were the geological map, soil distribution map, and the boreholes information of the available wells drilled across the area. These ancillary data were processed to extract the geological map, soil distribution map, and the boreholes information of the well drilled across the study area. The geological map and soil distribution maps were georeferenced, clipped to required boundary and digitized.

The soil distribution map was categorized based on the two soil associations present in the area. The remote sensing data utilized for the study were the Landsat ETM image, Advanced Space borne Thermal Emission and Reflection Radiometer (ASTER), and digital elevation model (DEM) image. The lineaments and drainage were extracted from the LANDSAT-TM images, while DEM was used for producing the slope map of the area. The remote sensing data were processed using ArcGis 10.1, Envi 4.5, and PCI Geomatica 2012. Computer-assisted methods for the detection of structural lineaments were exclusively based on edge enhancement or spatial filtering techniques (directional and/ or gradient filters). These methods produced edge maps requiring further processing for lineament segments to appear with one-pixel thickness. Optimal edge detectors, e.g., the Canny algorithm (Canny 1986), have already been successfully applied on natural scenes with satisfactory results. A composite band combination was used (Süzen and Toprak 1998). Directional filtering and edge sharpening enhancement algorithm of PCI Geomatica were utilized to extract the lineament for analyses (Abdullah et al. 2010). Slope was extracted from DEM using the slope algorithm of ArcGis 10.1. The density of the lineaments and the drainage were obtained by dividing the summation of the total lengths of the lineaments and drainage by the coverage area of the environment under consideration respectively (Adiat et al. 2012, 2013). Krigging technique was used to produce the lineament and drainage density maps.

Electrical resistivity data were acquired using the Ohmega Terrameter and its accessories. A total of seventy (70) Vertical Electrical Sounding (VES) stations were occupied (Fig. 1). The Schlumberger array was adopted with electrode spacing (AB/2) ranging from 1 to 100 m. The coordinates of measurement stations were taken using Garmin GPS 7.0. The data acquired were processed and plotted. Quantitative analysis, involving partial curve matching and computer iterations, using win RESIST software developed by Vander Velpen 1998, was adopted to determine the geo-electric characteristics of the study area. From this information, aquifer resistivity, aquifer thickness, unsaturated zone thickness, total longitudinal conductance of the unsaturated zone, and total transverse resistance of the unsaturated zone were estimated. The aquifers were identified by using resistivity range of the subsuface layer as the criteria. This was however guided by the well information obtained from the area. The unsaturated zone thickness was calculated using the summation of the thickness of the overlapping layers. The longitudinal conductance (S) and transverse resistance (TR) of the unsaturated zone were calculated from the results of resistivity data using Eqs. 1 and 2 below:

$$ S=\raisebox{1ex}{$h$}\!\left/ \!\raisebox{-1ex}{$\rho $}\right.S=\sum \limits_{i=1}^n\frac{h_i}{\rho_i} $$
(1)

Transverse unit resistance (TR) was determined from the layer parameters as (1):

$$ \mathrm{TR}=\rho \ast hTR=\sum \limits_{i=1}^n{\rho}_i\ast {h}_i $$
(2)

where ρi and hi are resistivity and thicknesses of ith layer, respectively.

A total of ten (10) domestic drinking water wells were randomly collected from water sources available at the mining sites and their host communities (Fig. 1). The depths of the wells vary from 10 to 15 m. It is also important to add that all the wells tap water from localized unconfined aquifer. The water samples were collected on April 21, 2016. A plastic bottle (2 l) was washed with dilute HCl acid of 0.5 mol/dm3 and rinse with distilled water. These samples, stored in a distilled plastic bottles, were taken to the laboratory for analysis to determine the safety or otherwise of the groundwater resources of the area. In the laboratory, the samples were digested for water quality test. Physiochemical parameters test was performed on all the water samples. The following physiochemical parameters were tested: temperature, turbidity, conductivity, pH, chloride, total hardness, sulphate, nitrate, phosphate, total solids, total dissolved solids, total suspended solids, and total alkalinity. In addition to these parameters, some inorganic metals (Na, K, Ca, Mg, Zn, Fe, and Cu) were also tested. Also, atomic absorption spectrometer test was conducted on the samples to test for the presence of heavy metals such as Cd, Mn, and Pb. The tests were conducted at the Central Research Laboratory of Federal University of Technology, Akure, Nigeria. In order to determine whether the water in the study area was contaminated or safe for consumption, the water quality results obtained were compared with maximum permissible levels for safe drinking water by Nigerian Standard for Drinking Water Quality Threshold values guideline (NSDWQ) 2007. Kurtosis and Spearman’s rank correlation analysis were employed to determined non-normality of the physiochemical parameters and major metallic ions obtained from the water samples and relationship between the two input variables at the two-tailed significance (i.e., α = 0.05) level. The results of the analysis will produce the dependent variables that will be utilized for groundwater vulnerability modeling.

Methodologies/steps of logistic regression as adopted in the study

The concept and procedures of logistic regression require several steps to be conducted and this has been explained in detail in Park (2013). Some of these steps found to be suitable to the nature and structure of the data set adopted for this study are presented as follows:

  1. 1

    Examination of the basic assumptions of logistic regression which include:

    1. (a)

      Binary categorization of dependent variable and

    2. (b)

      Examination of the non-normality of the dependent variables and relationship between the dependent and independent variables

  2. 2

    Development of logistic regression prediction model

  3. 3

    Statistical assessment of the prediction model developed which involves (a) model significance; (b) results for the Hosmer–Lemeshow goodness-of-fit test statistic, R-square values, and model accuracy; and (c) assessment of the reliability of the prediction model.

  4. 4

    Groundwater vulnerability prediction map and

  5. 5

    Validation of the groundwater vulnerability prediction map

The statistical package for social scientists (SPSS) was used for the statistical analysis.

Results and Discussions

Independent variables utilized in groundwater vulnerability modeling

The results of the ancillary data are discussed based on the independent variables utilized in groundwater vulnerability modeling. The result of the percentage of clay and particle size distribution present in each soil association was adopted to establish the top soil characteristics of the study area (Ogunsanwo 1989). Two types of soil series (Itagunmodi and Egbeda series) are obtainable in the study area.

The borehole records show that there are two aquifer systems in the area and these are unconfined aquifer and confined aquifer. The depth of occurrence of the unconfined aquifer ranges from 10 to 15 m, and the depth of occurrence of the confined aquifer is at 30–40 m. It was however observed that most of the hand dug wells in the study area terminate in the unconfined aquifer layer, while the boreholes terminate in the confined aquifer layer. Therefore, hand dug wells are more susceptible to groundwater contamination than the borehole in the area.

The independent variables obtained from the remote sensing data are lineament, drainage, and slope representing geomorphological parameters that influence groundwater vulnerability.

The distribution of the lineaments in the study area concentrated in the southern and western parts of the study area, with few lineaments in the northern and eastern parts of the area. The study area is relatively dense in terms of lineament, and the lineament is denser in the eastern and central parts of the study location (Fig. 2). Groundwater of the area with high lineament and lineament density is relatively vulnerable to surface contaminants due to secondary porosity and permeability developed by the lineament features.

Fig. 2
figure 2

Lineament density map of the study area

The system of the drainage is largely dendritic typical of structurally controlled drainage along the sheared zone of metamorphic rock. The drainage system in an area is strictly dependent on the slope, the nature/attitude of bedrock, and the regional as well as local fractures pattern. The study area is well drained. Area of high drainage density is indicative of area with a relative poor groundwater infiltration (Fig. 3). This implies that groundwater in area with high drainage density is not vulnerable to surface contaminant. The dominant direction of the drainage pattern in the area is southeast–northwest direction. This suggests that the river/stream is structurally controlled. Four classes of slope obtained in the area are 0–2, 2–8, 8–15, and 15–30 representing flat, undulating, rolling, and moderately steep classifications, respectively (Adiat et al. 2012). The study area is largely characterized with flat to undulating slope, having small amount of runoff and high amounts of infiltration. Areas with low slope tend to retain water for long periods of time. This favors infiltration of water recharge and contaminant migration. Therefore, the flat to undulating slope characterizing the study area suggests that groundwater in most of the area is relatively vulnerable to groundwater contamination.

Fig. 3
figure 3

Drainage density map of the study area

The geophysical parameters that influence groundwater vulnerability as obtained from the results of the interpretation of VES are the unsaturated zone thickness, aquifer resistivity, aquifer thickness, longitudinal conductance, and transverse resistance. Based on the depth of occurrence or thickness of the unsaturated zone, the aquifers in the area can be categorized into shallow and deep aquifers with thicknesses ranging between 1.2–10 m and 10.1–42.8 m, respectively. Deep seated aquifers are characterized by high thickness of unsaturated zone. Groundwater in the deep seated aquifers are more protected because the contaminants will take a longer time before they percolate into the aquifer, whereas the shallow seated aquifers are more vulnerable to groundwater contamination because the contaminants will percolate within a very short time.

In the study area, three aquifer types were identified. The aquifer media were delineated based on the resistivity value of the geo-electric layers obtained from the study. The resistivity ranges of 67–150 Ωm, 150–600 Ωm, and 600–859 Ωm were classified as weathered basement, fractured basement, and partly weathered basement aquifers, respectively. The aquifers thicknesses vary between 1.2 and 42.8 m. In general, the larger the thickness of the aquifer, the higher the transmissivity of the aquifer media. Consequently, the greater the pollution potential. The unsaturated zone layer constitutes the main protective unit.

Total longitudinal conductance and total transverse resistance of the unsaturated zone helped us to characterize the study area. Total longitudinal conductance map was grouped into four vulnerability classes based on the model of Antonio and Richard (2014). The four classes obtained are < 0.1, 0.1–0.3, 0.3–0.7, and 0.7–2.5 representing extreme, high, moderate, and low classifications, respectively. Areas with low longitudinal conductance value have high permeability and are more vulnerable. The study area is mainly characterized with extreme to high vulnerability class (Fig. 4).

Fig. 4
figure 4

Total longitudinal conductance map of the study area

The total transverse resistance of the unsaturated zone of study area was classified into high and low transverse resistance areas. Total transverse resistance value above 1000 Ωm2 is classified as high transverse resistance, while values less than 1000 Ωm2 are classified as low transverse resistance. High total transverse resistance dominated the entire study area with exception of some few pockets of low total transverse resistance at the central part of the study area. Areas with high total transverse resistance are classified as areas of low infiltration, due to their low permeability. Consequently, these areas are less vulnerable to surface contaminant.

The results of the analysis obtained from water chemistry laboratory are presented in Table 1. Physico-chemical parameters evaluated were temperature, turbidity, conductivity, pH, chloride, total hardness, sulphate, nitrate, phosphate, total solids, total dissolved solids, total suspended solids, and total alkalinity. In addition to these, major metal concentrations, which include sodium, calcium, magnesium, iron, and heavy metals that include copper, zinc, cadmium, lead, and manganese, were also evaluated. The results of the major and heavy metals analysis obtained from the water chemistry are presented in Table 2. The results were compared with the maximum permissible level for safe drinking water established by the Nigerian Standard for Drinking Water Quality Threshold values guideline (NSDWQ) 2007, to determine which of the physio-chemical parameters and major and heavy metals present in the water samples had exceeded the maximum permissible level. It was observed that all the physio-chemical parameters are within the permissible limit. The results of the comparison of major and heavy metals with maximum permissible level are presented in Table 3. The table shows that all water samples containing Mg, Cd, and Pb exceeded the permissible limit except where they were not detected.

Table 1 Physio-chemical parameters obtained from the samples collected from the study area on April 16, 2016
Table 2 Major metals and heavy metal concentration obtained from the samples collected from the study area on April 16, 2016
Table 3 Comparison result of major and heavy metal concentration obtained from the water samples collected from the study area with maximum permitted level

All water samples containing Na, Ca, Fe, Cu, and Mn are within the permissible limit. On the other hand, some of the water samples containing Zn are within the permissible limit while some exceeded the permissible limit. This implies that zinc concentration (Zn) is the only dependent variable that had two categorical outcomes. It also established that there is relationship between the mining activities and high zinc concentration in the areas.

Results of logistic regression as adopted in this study are as follows

Results of binary categorization of dependent variable

In logistic regression model development, two categorical outcomes of the dependent variable must be satisfied. From the results presented in Table 3, zinc concentration is the only dependent variable that satisfied the condition of an outcome variable with two possible categorical outcomes binarily categorized as 0 and 1 (Table 4). Thus, zinc concentration was selected to be the dependent (predicted) variable (i.e., the contaminant that would be utilized for the regression model development). The convention is to associate 1 with “success” (i.e., vulnerability test is passed; zinc concentration maximum permitted level is not exceeded), and 0 with “failure” (i.e., vulnerability test is failed; zinc ion concentration maximum permitted level is exceeded) as presented Table 4.

Table 4 Binary categorization of zinc concentration with the maximum permitted level as the threshold for categorization

Results of examination of the non-normality of the dependent variables and relationship between the dependent and independent variables

The results of non-normality tests for physico-chemical parameters and major ion concentration show that all the kurtosis values deviated from zero; this indicates that the datasets are not normally distributed. This makes them applicable in logistic regression modeling. The non-normality implies that the relationship between the independent and dependent variables is non-linear. It is important to emphasis that non-linear relationship between independent and dependent variables is one of the assumptions of logistic regression (Park 2013).

It further implies that major ion concentration present in the water samples are not from the same aquifer system, and these established the disjointed relationship between the aquifer systems in the study area. It also depicts the non-parametric nature of the groundwater system in the study area. Since none of the data were normally distributed, the Spearman’s rank correlation coefficient measure was used to determine the relationship between the dependent variable and each of the independent (predictive) variables. The result obtained from the Spearman’s rank correlation shows that five (5) independent (predictive) variables (percent clay in soil, drainage, slope, unsaturated zone thickness, and total longitudinal conductance) have good correlation with the dependent variable (zinc concentration). Their respective correlation coefficients at two-tailed significance (i.e., α = 0.05) are − 0.699, − 0.047, − 0.009, − 0.535, and 0.817. This makes them statistically significant, and consequently, they will be utilized in logistic regression model development.

Results of logistic regression prediction model development

The final model has the following independent variables as members of its group: total longitudinal conductance, unsaturated zone thickness, slope, percent of clay in soil, and drainage as presented in the model in Eqs. 3 and 4.

$$ \mathrm{Logit}\ (p)=\ln \left(\frac{P}{1-P}\right)=\upalpha +{b}_1{x}_1+{b}_2{x}_2+{b}_3{x}_3+{b}_4{x}_4+{b}_5{x}_5 $$
(3)

Therefore,

$$ P=\frac{e^{\left(\alpha +{b}_1{x}_1+{b}_2{x}_2+{b}_3{x}_3+{b}_4{x}_4+{b}_5{x}_5\right)}}{1+{e}^{\left(\alpha +{b}_1{x}_1+{b}_2{x}_2+{b}_3{x}_3+{b}_4{x}_4+{b}_5{x}_5\right)}} $$
(4)

The constant (intercept) of the prediction model is 21.323, and the gradient coefficient of each predictive variable “bi” is the log odds obtained for the independent variables of the final model. The log odds coefficients of total longitudinal conductance, unsaturated zone thickness, slope, percent of clay in soil, and drainage are 193.397, − 2.481, − 2.193, 25.156, and − 11.933, respectively. From the log odd coefficient, each independent variable contribution to measure of variation of the dependent variable was estimated. Substituting these values shown in Eq. 4 give

$$ P=\frac{e^{\left(21.323+193.397{x}_1-2.481\ {x}_2-2.193\ {x}_3+25.156{x}_4-11.933{x}_5\right)}}{1+{e}^{\left(21.323+193.397{x}_1-2.481\ {x}_2-2.193\ {x}_3+25.156{x}_4-11.933{x}_5\right)}} $$
(5)

The value of predictive variables for each well point was substituted in Eq. 5, and the result of the probability prediction (p) of dependent variable (zinc concentration) not exceeding 3 mg/L in groundwater sample of the study area is presented in Table 5. If the p results obtained (last column of Table 5) is approximately equal to 1 (i.e., 0.5 ≤ p ≤ 1.0) as obtained in W1, W2, W5, W6, W7, and W10, it implies that the zinc concentration was below the maximum permitted level (i.e., zinc concentration ≤ 3 mg/L). On the other hand, if the p results obtained (as shown in Table 5) is less than 1 (i.e., 0 ≤ p ≤ 0.4) as obtained in W3, W4, W8, and W9, it implies that the zinc concentration was above the maximum permitted level (i.e., zinc concentration ≥ 3 mg/L).

Table 5 Probability prediction of zinc concentration not exceeding 3 mg/L in groundwater sample of the study area using the model in Eq. 5

Also, odds ratio of each independent variable was calculated by using the regression coefficient of the independent variables “b” as the exponent or exp (b).

$$ \mathrm{Odds}\ \mathrm{ratio}={\exp}^{(bi)} $$
(6)

The odds ratios of total longitudinal conductance, unsaturated zone thickness, slope, percent of clay in soil, and drainage are 9.800e8, 0.084, 0.112, 8.416e10, and 0.002, respectively. The significance of the odd ratio can be expressed in terms of the change in odds. When the independent variable increases by one unit, the odds that the case can be predicted increase by a factor of odds ratio times, when other variables are controlled. Therefore, increase in values of total longitudinal conductance and percent of clay in soil will significantly increase the odds of the groundwater sample not exceeding 3 mg/L by factors of 9.800e8 and 8.416e10, respectively; also, increase in unsaturated zone thickness and slope will slightly increase the odds of the groundwater sample not exceeding 3 mg/L by factors of 0.084 and 0.112, respectively. While increase in the values of drainage will significantly decreases the odds of the groundwater sample not exceeding 3 mg/L by factors of 0.002.

Results of statistical assessment of the developed prediction model

Results of model significance

Statistical assessments utilized to assess the predicted model are presented in Table 6. All values for the significant test for the model were statistically significant at α = 0.05 level of significance. The Wald chi-square values of total longitudinal conductance, unsaturated zone thickness, slope, percent of clay in soil, and drainage were 1.07, 1.13, 0.42, 0.61, and 0.41, respectively, while the P values obtained were respectively 0.03, 0.029, 0.052, 0.035, and 0.052 indicating that all the independent variables are statistically significant (P ≤ 0.05); i.e., the independent variable has a significant effect.

Table 6 Various statistical assessments utilized to assess the predicted model

Results of Hosmer–Lemeshow goodness-of-fit test statistic, R-square values, model accuracy

The model had 0.99 P value associated with the Hosmer–Lemeshow goodness-of-fit test. This value, being greater than 0.05, indicates that the estimates for the model fit the original data at an acceptable level. The R-square values (Cox and Snell R square and Nagelkerke R square) for the models were 0.65 and 0.87, respectively, indicating that the model had a moderately strong predictive power. The overall model prediction accuracy was 85.7%, meaning that the model had a good fit. Due to the model’s satisfactory assessment and hence, strong prediction capability, the model was chosen as the final model for the study. Based on this level of reliability, the model can be used to predict the probability of zinc concentration above or below 3 mg/L in area that water samples were not taken, having knowledge of the independent variables in the area.

Results of the assessment of the reliability of the prediction model

Tables 4 and 5 are used to explain the results of the accuracy assessment of the model developed. Whenever the vulnerability test is passed (i.e., the Zn concentration maximum permitted level is not exceeded, as shown in the second column of Table 4), the value of the p, shown in the last column of Table 5, is expected to be approximately equal to one (i.e., 0.5 ≤ p ≤ 1.0). If vulnerability test is failed (i.e., the Zn concentration maximum permitted level is exceeded, as shown in the second column of Table 4), the value of the p, shown in the last column of Table 5, is expected to be approximately less than one (i.e., 0 ≤ p ≤ 0.4).

It was observed from the Table 5 that zinc concentration value obtained showed agreement with the model predicted probability that Zn concentration ≤ 3 mg/L in nine out of the ten locations. The disagreement observed at location W5 (Table 4) may be due to other hydrologeological factors, which though may not be significant in the final model, but might contribute to high zinc concentration being greater than 3 mg/L in the groundwater. On this basis, the probability prediction model is not only accurate but also reliable with percentage reliability of 90%.

Results of the groundwater vulnerability prediction map

The study area was gridded to grid size of 500 m with the center of each grid being used as the measuring point for the grid. The values of independent variables for each grid point were estimated and substituted into the model equation to obtain predicted probability used to produce the zinc concentration probability prediction (groundwater vulnerability prediction) map shown in Fig. 5. High concentration of zinc (i.e., above permissible level) typical of contamination dominated the eastern, western, central, south-western, and north-eastern part of the study area. It was observed that most of the parts dominated by high zinc concentration being predicted by the model are communities where gold mining activities are taken place.

Fig. 5
figure 5

Zinc concentration probability prediction (groundwater vulnerability prediction) map of the study area

Results of the validation of groundwater vulnerability prediction map

The validation of the predictive model was achieved by using independent variables associated with a given location within the study area to predict for groundwater quality of the location. Imagine a hydrogeological system at well “W6” whose total longitudinal conductance was 0.0279, percent of clay in the soil was 0.74, the unsaturated zone thickness was 4.99 m, the slope of the area was 4.358, and the drainage density of the area was 1.6. In order to examine whether or not the groundwater quality would pass the test for zinc concentration permitted level (i.e., belong to category 1 or 0), the values of the independent variables for the location (i.e., “W6”) are substituted into the model equation thus obtain:

$$ P=\frac{e^{\left(21.323+193.397{x}_1-2.481\ {x}_2-2.193\ {x}_3+25.156{x}_4-11.933{x}_5\right)}}{1+{e}^{\left(21.323+193.397{x}_1-2.481\ {x}_2-2.193\ {x}_3+25.156{x}_4-11.933{x}_5\right)}} $$
(7)
$$ P=\frac{e^{\left(21.323+193.397(0.0279)-2.481\ (4.99)-2.193\ (4.358)+25.156(0.74)-11.933(1.6)\ \right)}}{1+{e}^{\left(21.323+193.397(0.0279)-2.481\ (4.99)-2.193\ (4.358)+25.156(0.74)-11.933(1.6)\right)}} $$
(8)
$$ P=\frac{e^{\left(21.323+5.395-12.380-9.557+18.615-19.093\ \right)}}{1+{e}^{\left(21.323+5.395-12.380-9.557+18.615-19.093\ \right)}}=\frac{e^{(4.303)}}{1+{e}^{\left(4.303\ \right)}}=0.98 $$
(9)

Therefore, the probability that groundwater quality of well “W6” passed the test for zinc concentration permitted level is 98%, or 98% of such independent variables will be expected to produced groundwater quality that passed the test for zinc concentration based on the threshold of the maximum permitted levels of inorganic concentration for safe drinking water by Nigerian Standard for Drinking Water Quality (NSDWQ) 2007 (Fig. 6).

Fig. 6
figure 6

Zinc concentration probability prediction map with binary categorizes of zinc concentration of the study area

Also, for well “W4”whose total longitudinal conductance was 0.023, percent of clay in the soil was 0.53, the unsaturated zone thickness was 3.806 m, the slope of the area was 8.356, and the drainage density of the area was 1.728. Substituting these values in model equation, we get

$$ P=\frac{e^{\left(21.323+193.397(0.023)-2.481\ (3.806)-2.193\ (8.356)+25.156(0.53)-11.933(1.728)\ \right)}}{1+{e}^{\left(21.323+193.397(0.023)-2.481\ (3.806)-2.193\ (8.356)+25.156(0.53)-11.933(1.728)\ \right)}} $$
(10)
$$ P=\frac{e^{\left(21.323+4.448-12.380-9.443+13.333-20.620\ \right)}}{1+{e}^{\left(21.323+4.448-12.380-9.443+13.333-20.620\ \right)}}=\frac{e^{\left(-3.337\ \right)}}{1+{e}^{\left(-3.337\ \right)}}=0.03 $$
(11)

It therefore implies that the probability that groundwater quality of well “W4” passed the test for zinc concentration permitted level is 3%, or 3% of such explanatory variables will be expected to produced groundwater quality that passed the test for zinc concentration permitted level. Therefore, the groundwater quality of well “W4” failed the test for zinc concentration based on the threshold of the maximum permitted levels of inorganic concentration for safe drinking water by Nigerian Standard for Drinking Water Quality (NSDQW) 2007 (Fig. 6).

Conclusions

Reports of environmental problems associated with mining communities had prompted the groundwater vulnerability study of the Ilesa gold mining area in Ilesa schist belt, southwestern Nigeria. The objectives of the study were to generate factors/parameters that can be used to predict aquifer contamination in the area, identify which of the factors generated is/are responsible for the probability of groundwater contamination, develop empirical (LR) model and map that predict the probability of occurrence of contaminant(s) with respect to threshold level in the groundwater resources in the study area, and quantify the prediction accuracy and reliability of the model developed. In order to achieve the objectives of the study, the integration of remote sensing, geophysical method, and chemical analysis of water samples was undertaken. Data management and result integration were carried out in GIS environment. The concept of logistic regression was applied to the results obtained to develop groundwater vulnerability model for the area. Analysis of remote sensing and geophysical data assisted in generating factors/parameters (independent variables) that can be used to predict aquifer contamination in the area; these factors include lineament/lineament density, drainage/drainage density, slope, rock types (geology/lithology), and soil type association obtainable in the area, aquifer resistivity and thickness, longitudinal conductance, transverse resistance, and coefficient of anisotropy. On the other hand, analysis of water samples assisted in generating the dependent variables (contaminants) utilized in the study. Of all the dependent variables, zinc concentration (Zn) was the only variable that had two categorical outcomes, since two categorical outcomes of dependent variable(s) are a necessary condition for logistic regression model development; Zn was the contaminant utilized for the study. Similarly, only five (5) independent (predictive) variables, which are percent clay in soil, drainage, slope, unsaturated zone thickness, and total longitudinal conductance, were established to have good correlation and statistically significant with the dependent variable, the contaminant, and thus utilized in logistic regression model development. The quantitative assessment of the developed model established that the overall model prediction accuracy was 85.7% suggesting that the model had a very good fit. The probability prediction model was also accurate and reliable with percentage reliability established to be 90%. In conclusion, it is evident from the results obtained from the study that since the model developed was assessed to be accurate and reliable, the model, and hence the technique, can be replicated in another area of similar geologic condition.