Introduction

Water is one of the most essential and vital resources for human life [1]. According to the WHO in 2017, approximately 2 billion people worldwide were denied access to safe and healthy water [2]. Groundwater is the most important source for various uses, including: agriculture, drinking, and industrial uses [3]. Approximately one-third of the world’s population uses groundwater for drinking, and the most important reasons are the nonavailability of potable surface water and a general belief that groundwater is purer and safer than surface water, due to the protective qualities of the soil cover [4, 5].

With the advent of industrial development in recent years, the large-scale use of synthetic fertilizers for agricultural production and the use of pesticides and insecticides for agriculture have created serious concern about the susceptibility of groundwater contamination. Changes in groundwater quality are related to rock–water contact and oxide-reduction reactions during water percolation across aquifers. In addition to these mechanisms, waterborne contaminants, toxic and non-toxic pollutants are the major factors of water quality that are moved from the recharge area to the discharge area from groundwater aquifers [6, 7].

Assessment of water quality is one of the most major issues in groundwater studies.Evaluation and monitoring of groundwater quality are vital for the sustainable use of these resources [8]. One of the essential approaches to quality control of water resources is to obtain an effective model for predicting groundwater quality [9]. In this study, using statistical methods, we tired to predict the quality characteristics of groundwater in the areas of Kermanshah province. Similar studies have been conducted in this field by Conglian et al. [10], Zouheira et al. [11], Yesilnacar et al. [12], Kisi et al. [13]. Common models in this field are regression models, time series, and new models such as neural networks and genetic algorithms, and etc. The multiple regression model due to extracting all data from the data set at the same time has a better ability to understand the problem and another advantage of this model is its flexibility and proper performance against normal distribution variables.

The purpose of this study was to present multiple regression models to predict the qualitative parameters; Total dissolved solids (TDS), Sodium adsorption ratio (SAR), Total hardness (TH) of Kermanshah groundwater resources in the west of Iran. Investigation of the Sodium adsorption ratio (SAR) parameter is essential in soil management and stability in agricultural lands. Total dissolved solids (TDS) and Total hardness (TH) are also effective parameters in creating the taste. Due to the importance of these three parameters, some suitable models were presented to predict these parameters given the type of water resource (deep wells, semi-deep wells, springs and aquifers).

Materials and methods

Case study

Kermanshah is a province in the west of Iran. It has an area of about 24,640 \({km}^{2}\) and is located at the longitude of N\({33}^{^\circ }37^{\prime}\)\({35}^{^\circ }17^{\prime}\) and latitude of E \({45}^{^\circ }20^{\prime}\) to \({48}^{^\circ }1^{\prime}\) (Fig. 1). The study area altitude ranges from the lowest of 2100 m to the highest of 3357 m. The descriptive statistics of the physicochemical parameters for the groundwater parameters including Calcium (Ca), pH, Chlorine (Cl), Magnesium (Mg), Sodium (Na), Sodium percent (%Na), Electrical conductivity (EC), Sulfate (\({\text{SO}}_{4}\)), Total hardness (TH), Carbonate (\({\text{CO}}_{3}\)), Sodium adsorption ratio (SAR), Bicarbonate (\({\text{HCO}}_{3}\)), Cations, Anions, Total dissolved solids (TDS) are summarized in Table 1. To prepare the spatial distribution map for each parameter, the water quality information of 142 wells (Fig. 2) were analysed.

Fig. 1
figure 1

Position of the study area

Table 1 Summary statistics of chemical compositions of major ions (mg/l) in the groundwater’s of the study area
Fig. 2
figure 2

Local position of the wells in the study area

Multiple linear regressions (MLR)

Multivariate statistical analysis is widely used to test the reliability of different processes that affect the mineralization of the groundwater aquifer system [4]. Multiple regressions have the general purpose of learning more about the relationship between one or more independent or predictor variables and a dependent or observed variable [10]. In other words, the multiple linear regression can predict the value of the dependent variable for the given set of predictors. The multiple regression model will match the observed dependent variable with a measured variable by changing the coefficients linearly relating to the predictors [14].

In this study the multiple regression model was used to predict groundwater quality parameters, with a 5% level of significance. The relationship among groundwater quality parameters were examined using mlr function in R software [15].

The dependent variables in this study were the groundwater TDS, SAR, TH, and the independent variables or predictors were SO4, pH, HCO3, Mg, Ca, Cl, Na, EC. Modeling was performed for four water resources: deep wells, semi-deep wells, springs and aquifers. The model's characteristics were evaluated using coefficient of determination (R2) statistic.

Inverse distance weighted (IDW)

The spatial distribution for groundwater quality parameters was done with the help of a spatial analyst module in ArcGIS.10.4.1 software.

Inverse distance weighting (IDW) interpolation obviously means the conclusion that factors closer with each other are the same as those farther apart. IDW can use the calculated values surrounding the prediction position to predict a value for any unmeasured position [16].

Statistical analysis

Analysis of variance (ANOVA) has wide applicability in groundwater quality problems as a versatile diagnostic tool. ANOVA tests significant differences in one or more clusters [15]. The Kruskal–Wallis rank-sum test is, a non-parametric equivalent of an ANOVA test [3]. To find the appropriate parametric or non-parametric test and to check the homogeneity of variance, Flinger test was used. This test showed a significant P-value (< 0.05), so the non-parametric Kruskal–Wallis test was used for analysis of variance [17].

Results and discussion

Statistical analysis results

The ANOVA and Kruskal–Wallis test findings are presented in Table 2. The value of df (degrees of freedom) for deep well, semi-deep well, aquifer is 5 and for spring 15. The results showed that there is a significant difference between the quality parameters and the type of water source.

Table 2 ANOVA and Kruskal–Wallis results for the TDS, SAR, TH models in groundwater for the study area

GIS interpolation model

IDW model was used as the interpolation model to produce GIS maps as presented in Fig. 3.

Fig. 3
figure 3

Interpolated maps of the groundwater quality parameters generated using by IDW. a TDS: b SAR: c TH

The lowest and the maximum output in IDW was 8.89086 and 901.19 mg/l for TDS; 0.007 and 4.02 mg/l for SAR and 15.80 and 1966.39 mg/l for TH, respectively. All of the parameters had low values in the north and the east of the study area. In contrast, the maximum value of parameters was located in the south and the west of the study area; so the north and the east part of the study area had better quality than the south and the west part of the study area. One of the most important reasons for the decrease in quality in these areas is the use of nitrate fertilizers and the dissolution of calcareous minerals.

Multiple linear regression model (MLR)

The MLR model is useful in discovering the association between various independent and dependent variables. MLR (TDS, SAR, TH) models for four water resources: deep wells, semi-deep wells, springs and aquifers were made using R software presented in Fig. 4.

Fig. 4
figure 4

Multiple linear regressions TDS, SAR, TH models in groundwater for the study area

MLR analysis of TDS model

TDS is one of the most important parameters in assessing the suitability of water for irrigation [18] and for overall groundwater quality assessment [19]. TDS is a measure of the combined concentration of cations and anions. In natural water, dissolved solids consist of inorganic salts, small amounts of organic matter, and dissolved materials. Dissolved solids are mainly due to carbonates, chlorides, sulfates, nitrates, phosphates, Ca, Mg, Na, K, Fe, Mn, etc. [5].

TDS of the Groundwater is mainly due to the vegetable decay and the disposal of effluents from industries. TDS value of 500 mg/l is the desirable limit, and water containing more than 500 mg/l TDS causes gastrointestinal irritation [20]. The high value of TDS influences the taste, hardness, and corrosive property of the water [21].

In Fig. 4, the predicted values are graphically showed with the observed data for the models. There are good relationships between predicted values and the observed data for TDS model. The estimated \({\text{R}}^{2}\) value and P values of this model are represented in Table 3. Also, the most critical factor in determining the success of the model is, the adjusted R square,in comparision with multiple R or R square. The adjusted R square is 0.86 for the deep well, 0.94 for the semi-deep wells, 0.88 for the springs and 0.94 for the aquifers.

Table 3 Performance multiple linear regressions TDS models in groundwater for the study area

The P-value is less than 2.2e − 16. Considering the P-value of the model, it is statistically significant. The independent variables such \({\text{SO}}_{4}\), pH, \({\text{HCO}}_{3}\), Mg, Ca, Cl, Na, EC were significant in predicting TDS value. Independent variables describe the variance of TDS by 86% for the deep well, 94% for the semi-deep wells, 88% for the springs, and 94% for the aquifers.

Adhikari et al. (2009) studied statistical approaches for hydrogeochemical characterization of groundwater in west Delhi, India. The study showed a good correlation between water quality parameters and also showed that multiple regression models can predict quality parameters at 5% level of significance [22].

The results obtained in the present study are Comparison with the previous studies (Pan et al. (2018) and Zouheira et al. ( 2017)). According to their reports, The high R square shows that about 98% of the total variations in the TDS have been explained by these variables [10, 11]. The value of R square is 0.98, showing that about 98% of the total variations in the TDS can be accounted for the independent variables. TDS models provide an accurate prediction of quality parameters with considerably high values of \({\text{R}}^{2}\).

Also, This results is in contrast with the ones expressed by Kadam et al. (2019). According to their reports, multiple \({\text{R}}^{2}\) is 1 and adjusted \({\text{R}}^{2}\) is 1. Also the ‘p’ value is less than 2.2e − 16.

The ANN is appropriate compared to the MLR model. ANN models counterpart convincingly fit quality. MLR modelling technique is based on the simple least square method; whereas, the ANN model imitates the functioning of the human being intelligence. According to their report, the ANN model would become more beneficial in the prediction of water quality [23].

Also, Civelekoglu et al. (2007) indicated that ANN modeling appears to be a strong tool in situations where the relations between variables are nonlinear [24].

The results of the multiple regression model can be used as a positive predictive tool for determining the chemistry of groundwater if the dependent variable TDS is measured at every location. The proposed TDS model can be utilized for estimating TDS content in groundwater obtained from such an area. Consequently, the MLR model can serve as an alternative and cost-effective tool for groundwater quality prediction in the circumstances, where trained expertise and time constraints and the field data are favourable.

MLR analysis of SAR model

Sodium concentration is one of the important parameters in the classification of water irrigation. Soils containing a large proportion of sodium with carbonate as predominant anion are termed alkali soils and those with chloride or sulfate, as predominant anion, are termed as saline soils, These affect the growth of the plant [21]. For calculating SAR, Na, Mg, and Ca are also needed [13].

The systematic calculation of the correlation coefficient between water quality variables and regression analysis provides an indirect means for rapid monitoring of water quality [21].The estimated \({\text{R}}^{2}\) value and P values of this model are represented in Table 4. The most critical factor in determining the success of the model is, the adjusted R square,in comparision with multiple R or R square. The adjusted R square is for the deep well and the aquifers 0.98 and for the semi-deep wells and the springs 0.97. The P-value is less than 2.2e − 16.

Table 4 Performance multiple linear regressions SAR models in groundwater for the study area

Based on the explanation coefficient, the independent variables in the deep well and the aquifers were found to affect the SAR by 98%, 97% for the semi-deep well and the spring.

The results obtained in the present study are Comparison with the previous studies Tabari et al. (2012) [25]. According to their study Correlation coefficient and standard error are 0.74 and 1.35 respectively. Measured and predicted values for regression models, fit relatively well, but at a high SAR value‚ the amount of differences of the measured values and the model is increased. Finally, using comparison between statistical indicators in the artificial neural network and regression model in can be observed superiority of the artificial neural network model in simulated and predicted values of SAR. Therefore the artificial neural network is better performance than the regression model for predicted values of SAR [25].

Also, the results obtained in the present study are Comparison with the previous studies Kisi et al. (2018) [13]. the quality variables were modeled by simple ANFIS and the ANFIS trained by evolutionary algorithms. Finally, the models’ performances were evaluated using determination coefficient (R2). There used the ANFIS model to estimate sodium adsorption ratio SAR. The results indicate that the Na (0.97 and 0.92) and Cl show the highest correlations (0.97 and 0.82) with EC and SAR. Also, Mg, Cl and Ca are the most appropriate variables for TH [13]. Also, the correlation coefficient between the most effective variables and outputs is positive. It is noteworthy that potassium (0.24, 0.23 and 28) and show the lowest correlation with EC, SAR and TH, respectively.the results showed that the ANFIS model could be useful tools to compute and predict the groundwater quality variables [13]. The results of the multiple regression SAR model can be used as a positive predictive tool in determining the groundwater quality parameters.

MLR analysis of TH model

The hardness of water is mainly based on the evaluations of calcium and magnesium. Calcium and magnesium, the two most dominant cations play a major role in determining the hardness of the water. It is noteworthy that there are also some other variables in water such as aluminium, iron, manganese, etc.; but, calcium and magnesium are the most effective variables to the hardness of water [13]. Hardness may be due to the persence of calcium and magnesium salt from detergents and soaps used for laundering on the bank of the water body precipitated as calcium carbonate.

The maximum allowable limit of TH for drinking purpose is 500 mg/l, and the most desirable limit is 100 mg/l ( per WHO international standard). For total hardness, the most desirable limit is 80–100 mg/l. Groundwater exceeding the limit of 300 mg/l is considered to be very hard [4]. The estimated \({\text{R}}^{2}\) value and P values of this model are represented in Table 5. Also, the most critical factor in determining the success of the model is, the adjusted R square,in comparision with multiple R or R square. The adjusted R square is 1 for the deep well, the semi-deep wells, the springs and for the aquifers. The P-value is less than 2.2e − 16.

Table 5 Performance multiple linear regressions TH models in groundwater for the study area

According to the results, it can be said that the independent variables in deep wells, semi-deep wells, springs, and aquifer with a coefficient of explanation of 100% are effective on TH. The multiple \({\text{R}}^{2}\) value (100%) indicates that 100% of the variability in TH could be ascribed to the combined effect of \({\mathrm{SO}}_{4}\), pH, \({\text{HCO}}_{3}\), Mg, Ca, Cl, Na, EC.

The results obtained in the present study are Comparison with the previous studies Kadam et al. (2019), Kisi et al. (2019); indicated that ANFIS and ANN models could be used as useful tools to predict TH value the groundwater quality variables [13, 23].

Mekparyup et al. (2013) also indicated that all regression coefficients are significant and Highly positive correlation between the response variable and the predictor variables [26].

Therefore the best performances TH model for the groundwater quality parameters respect to the other estimating methods.

Conclusions

This aim of the study was determination of groundwater quality and the relationship between variability of groundwater quality in Kermanshah province, west of Iran. The results of spatial distribution for each parameter by IDW showed that the north and the east part of the study area had better quality in groundwater resources in comparosion with the south and the west part the study area. The relationship between parameters by MLR showed that TDS and water quality parameters in semi-deep wells and aquifers had a strong positive correlation (r = 0.94, r = 0.98) and there was a strong positive significant correlation between SAR and water quality parameters in deep wells and aquifer (r = 0.98, r = 0.99). Also, TH and water quality parameters in all water sources had a strong positive correlation (r = 1). Consequently, the MLR model could serve as an alternative and cost-effective tool for groundwater quality prediction in the circumstances, where trained expertise and time constraints and the field data are favourable.

The present study has been done in annual scales, it is suggested that in future researches in different time scales including daily and monthly study.