1 Introduction

In the world above, 97% of the water is salt water occupied by oceans. The remaining 3% of the water is fresh water. Over 50% of the freshwater is used for drinking purposes and is also used for agriculture and irrigation purposes. Water source is always available; streams and rivers are constantly dry and surface water is polluted in various ways. Horvat et al. (2021) investigated the water quality parameters in the fourth sector of Lake Palic in Serbia. This paper evaluated the water quality parameters from 2011 to 2019 for 9 years. The data were subjected to principal component analysis (PCA) and machine learning (ML) classification algorithms that identified seasonality related to lake water quality. WQI was determined using two approaches to provide a general view of the overall quality of a lake. From 2011 to 2019, they monitored 13 water quality parameters. The final result compares the WA WQI with the CCME WQI (Canadian Council of Ministers of the Environment Water Quality Index) methods. The determination of the water quality index categorized the quality of the lake as good or poor. The overall temporal tendency was not shown in the result. The water quality parameters were examined by using the method of ML.

By using WQI and analysis of statistics, Ustaoğlu et al. (2020) analyzed the water quality of the stream and river systems. From February 2017 to January 2018, the biology, water quality, oxygenation, and nutrient parameters were examined. Standard methods were used to investigate the performance of parameters. Pearson correlation index (PCI), PCA and clustering analysis (CA) are the statistical methods used to analyze the results. Hazard index (HI), stream water quality, and hazard quotient (HQ) were examined by WQI. This was used to solve the risk to human health. The results showed that the Turnasuyu has very good water quality characteristics and that the identified trace elements are not within the dangerous limit for public health. Dey and Vijay (2021) monitored the water quality parameters using geospatial techniques. Laboratory and fieldwork were reduced by using geospatial techniques. The water quality was assessed using the bio-optical method and the spectral ration method. The remote sensing analysis observed temperature, total suspended solids (TSS), and salinity. The results indicated the estimation of the non-optical parameters and proved the remote sensing technique for water quality measurement.

Valentini et al. (2021) adapted a WQI technique for calculating the parameters in Brazil. This study was designed to monitor water quality parameters through WQI correlation analysis. Using statistical methods to monitor water quality, the highest coefficient was indicated by ML4 point. Using ML models based on big data, Chen et al. (2020) identified the water quality parameters and studied the surface water quality performance. From 2012 to 2018 in China, the execution of water quality parameters for lakes and rivers was compared using ten ML models and predicted the water quality parameters by recall, precision, weighted FI score and FI score. Deep Cascade Forest (DCF), Random Forest (RF) and Decision Tree (DT) identified the water parameters such as NH3–N, DO, COD, etc.

Shalumon et al. (2021) estimated the water quality parameters for five districts of Kerala by using the WQI method. During monsoon season, 11 water quality parameters and uranium were analyzed for 30 samples. It was clearly described that the water quality requirement was satisfied, the uranium concentration for the study area was low, and the water was also acceptable for drinking purposes. Based on the Python framework and Raspberry Pi3 board development, Khatri et al. (2020) presented a real-time monitoring system for water quality. The validation result of water quality parameters such as PH, electrical conductivity, temperature, and dissolved oxygen were compared with benchmark equipment. A developed setup was used for real-time environment analysis, processing and logging of the data. Detection of contamination, network distribution monitoring, water quality for rivers and lakes and environmental monitoring are the important applications of real-time monitoring.

Kothari et al. (2021) investigated the water quality parameters for five districts of Uttarakhand in India based on the WQI calculation, biological and hydro-geochemical parameters. This study found suitable water quality parameters based on BIS standards for drinking purposes. The combination of WQI with multiple parameters was used to calculate the correlation coefficient by statistical analysis. The result showed that the correlation index describes the total coli form, concentration of total iron and faecal coli form. The concentration of chloride ions, sulphate and conductivity of correlation identified nitrate in the water. At the location of the Awash River in Ethiopia, Shishaye and Asfaw (2020) determined and analyzed the various water quality parameters for 10 years. Also, this study examined the composition of the geochemical and origin of the river water. From the result analysis, the 10 years samples and total dissolved solids (TDS), EC and PH were measured. Ray et al. (2021) presented a two-seasonal study of water quality parameters at 66 locations in Kerala. This study compared chemical water parameters with ecological water parameters with the help of PCA, and the study result was that fast-growing algal species were identified.

The water quality parameters were determined by Najafzadeh and Niazmardi (2021). The algorithm of Multiple-Kernel Support Vector Regression (MKSVR), also known as Support Vector Regression (SVR), was the novel method used to evaluate the optimization problem. Using WQI, Biochemical Oxygen Demand (BOD) and Chemical Oxygen Demand (COD) were determined by MKSVR. Comparison of random forest regression (RFR) with observed SVR for the final result showed corresponding values of root mean squared error (RMSE), RFR and SVR. Xu et al. (2022) analyzed the easy and simple water quality parameters, which in turn, predicted the appearance of trihalomethane (THMs). To improve the THMs models, the radial basis function (RBF) adopted artificial neural network (ANN) and linear/log-linear regression models (LRM) and obtained 64 results from samples used to model development. Limited ability prediction and LRM parameters were shown in the results. For best results, Gaussian and a maximum number of neuron (MN) functions are used to enhance the different types of LRMs.

Pandey et al. (2020) determined the chemical and physical parameters of pre- and post-monsoon groundwater collected at the Koradi Thermal Power Plant Site in the Nagpur region. The result analysis has been determined during the pre- and post-monsoon. It was analyzed by statistical methods, which showed the electrical conductivity, total hardness and dissolved solids. GW-3, GW-5 and GW-9 were also known as groundwater resources used to determine the parameters. Chatterjee and Lataye (2020) investigated surface water quality parameters in Futala Lake, Nagpur. Eleven water quality parameters were determined between September 2017 and October 2017 using a geographic information system (GIS). The parameters were determined by kriging interpolation across the lake. The percentage difference of parameters was calculated, and the result indicated the water quality. Parameters were compared for immersion and pre-immersion, and affected areas were treated during immersion.

SBHS (Asadollah et al. 2021) analyzed a prediction of WQI for rivers using extra tree regression (ETR), also known as the ML model. The performance of ETR was compared with classic stand-alone models, decision tree regression (DTR) and SVR. The result shows that the ETR model produces more accurate WQI predictions for the training and testing phases. Although all 10 input variables achieved the highest prediction performance, the combination of input parameters, including only BOD, phosphate and turbidity concentration, provided the second-highest prediction accuracy.

Based on the sensitivity analysis, Khan et al. (2021) proposed an ANN pattern obtained from the SLA algorithm. To estimate the model’s performance, MATLAB software was used to calculate the determinant of the coefficient and MSE. The result showed the best performance of EC, pH, TDS, and turbidity. The errors were reduced by monitoring the bacteria in real-time. An irrigation WQI was developed by Jahin et al. (2020) using MA for surface water in Egypt. Based on the National Sanitation Foundation Water Quality Index (NSFWQI), parameter weights were adopted using FAO 29 guidelines. Twenty-two parameters from the summer, winter, autumn and spring seasons of 2018 were collected and analyzed from 45 geo-references. FA, Ca, and PCA were used to analyze the data. The tested water quality was good, which was shown in the results. The performances of the seven parameters were better, and the weights of water quality were estimated. Based on the 3D-EEM (Excitation–Emission Matrix) technique and hyper spectra, Zhang et al. (2020) determined the parameters for surface water quality in China. Forty-eight samples were derived from hyper spectra and analyzed 12 water quality parameters by WQI. This study improved the Back Propagation-Artificial Neural Network (BPANN) method and determined the WQP. The study’s outcome revealed that water quality estimation by WQP and surface water quality monitoring have improved.

The existing studies determined various water quality parameters by various analyzing techniques in various places. Some existing studies (Khatoon et al. 2013; Maji and Chaudhary 2019; Singh et al. 2020; Isaac and Siddiqui 2022) conducted water quality assessments in the Kali River, Yamuna River and Ganga River in Uttar Pradesh state. Some previous studies (Banerjee et al. 2021; Chaurasia et al. 2021; Ram et al. 2021) estimated water quality in one district. This study used a novel hybrid DNN-MBGD for water quality prediction and established the water quality index by WAWQI using 20 water quality parameters in Uttar Pradesh. The hybrid DNN-MBGD model predicts the best performance of the selected parameter.

The objective of the study is to check the water quality, whether it is suitable for drinking purposes and to predict the water quality parameters by hybrid DNN-MBGD model. The proposed methodology is given in section 2, the result and discussion are explained in section 3, and section 4 consists of the conclusion of this study.

2 Methodology

The data on the water quality parameters are collected from the surface water quality station-wise report (Jan 2019–Jan 2020). Solving the WQI equation using Excel 2007 software determined the WQI value, using these 20 water quality parameters. There are three steps involved in WQI determination. The first step is determining the weightage level (\(w_{i}\)) for each water quality parameter. The second step is to determine the relative weight (\(W_{i}\)) for each water quality parameter and then determine the quality rating value (\(Q_{i}\)). The third and last step is to find the values of WQI for quality parameters. WQI was used to determine the water quality of experimental sites. After that correlation between WQI and 20 water quality parameters was determined. Input water quality parameters are considered by important feature analysis, and output values are predicted using a novel hybrid DNN-MBGD technique. In this study, the DNN model is combined with MBGD to update the weights and develop the accuracy for better prediction. The performance of hybrid DNN-MBGD compared with ANN, SVM and GMDH models. Finally, error analyses were determined for better accuracy. Figure 1 shows the proposed research methodology for this study.

Figure 1
figure 1

Flowchart for proposed research methodology.

2.1 Study area

Uttar Pradesh (27°40′N; 80°00′E) is the study area of this study and was executed from January 2019 to January 2020. In this study, 10 districts of Uttar Pradesh were selected for water quality prediction and WQI estimation because this state is an industrialized and the highest populous state in India. Minerals, textiles, and electronics are the main industries in Uttar Pradesh. In this state, surface water can easily get polluted by discharging toxic wastes from these industries and extracting chemical wastes. So author analyzed the water quality in 10 districts (Jaunpur, Deoria, Fatehpur, Prayagraj, Ambedkar Nagar, Gorakhpur, Lucknow, Amroha, Hardoi and Meerut) of Uttar Pradesh to check whether the water is suitable for drinking. Maighat site is selected for the Jaunpur district.

The following sites are selected for water quality prediction. The sites are, Deoria, Fatehpur, Prayagraj, Ambedkar Nagar, Gorakhpur, Lucknow, Amroha, Hardoi and Meerut, Turtipar, Kora-old, Allahabad, Akbarpur, Birdghat, Garhmukteshwar, Ankinghat and Galeta. Figure 2 shows the location of Uttar Pradesh in India and its experimental districts, respectively.

Figure 2
figure 2

Uttar Pradesh’s location in India and its experimental districts.

2.2 Dataset collection

India’s most industrialized state is Uttar Pradesh. Uttar Pradesh has several industries centred on textiles, minerals, and electronics. Surface water in this state is readily contaminated by chemical waste extraction and discharge from these companies. There is a chance that water will get contaminated. So in this study, 10 districts of Uttar Pradesh were selected for water quality prediction and WQI estimation. Calculation of the WQI value used to determine if the water is fit for human consumption. Between January 2019 and January 2020, the data for 10 sites of 20 water quality parameters were gathered weekly. In total 10400 historical data were collected for 20 water quality parameters from 10 sites of Uttar Pradesh. For each site, 1040 water quality parameters were collected from January 2019 to January 2020. For water quality prediction, the entire data was considered. In this study, the average values of water quality parameters data for 10 sites from January 2019 and January 2020 are given in table 3.

2.3 Methods

The study aims to forecast water quality using a novel hybrid DNN-MBGD optimization and WAWQI technique to assess WQI in 10 districts of Uttar Pradesh.

2.3.1 Weighted Arithmetic Water Quality Index Method (WAWQI)

WQI determines water quality, and the index number indicates the WQI. WQI is determined by various water quality parameters such as chemical, biological, and physical. Compared to other methods like NSF-WQI, CCME WQI and OWQI, WAWQI gives accurate results, so this study used the WAWQI method for WQI calculation. Many studies (Gautam et al. 2021; Menberu et al. 2021; Rojamadhuvanthi et al. 2021; Hagage et al. 2022; Lien et al. 2022; Nandi et al. 2022; Ojukwu et al. 2022; Prathibha et al. 2022; Rahul et al. 2022; Zait et al. 2022) was used the WAWQI method to calculate WQI. The ecological status of water is performed by WAWQI methods, which transform the multiple WQ parameters into a mathematical equation. By solving the constant value \(K\) and \(S_{i}\), finding out the \(w_{i}\) value, which is equation (1). The summation of the unit weights for 20 parameters is used to determine the relative weight, shown in equation (2). The ratio of observed values of each parameter to the standard values of each WQ parameter is known as the Quality Index, represented in equation (3). Multiplying \(Q_{i}\) and \(W_{i}\) gives the subindex value shown in equation (4). Summation of the subindex for 20 parameters used to determine the WQI, shown in equation (5).

The following steps calculate the WQI of water quality parameters. The first step is to determine the unit weight (\(w_{i}\)) of each water quality parameter.

$$w_{i} = \frac{K}{{S_{i} }}$$
(1)

where \(w_{i}\) = unit weight of each WQ parameter, K = constant value.

$$K = \frac{1}{{\frac{1}{{S_{1} }} + \frac{1}{{S_{2} }} + \cdots + \frac{1}{{S_{20} }}}} = \frac{1}{{\sum\nolimits_{i = 1}^{20} {S_{i} } }}.$$

Here, \(S_{i}\) represents the standard values of each WQ parameter and i indicates the total number of parameters used, here i = 20.

The second step is to determine the relative weightage (\(W_{i}\)), and the third step is to determine the quality index (\(Q_{i}\)) of each WQ parameter.

$$W_{i} = \sum\limits_{i = 1}^{20} {w_{i} }$$
(2)
$$Q_{i} = \frac{{C_{i} }}{{S_{i} }}$$
(3)

where \(W_{i}\) = relative weightage of each parameter, \(Q_{i}\) = quality index of each parameter, \(C_{i}\) = observed values of each parameter.

After finding \(Q_{i}\), the sub-index SI is determined. The WQI is calculated from the SI value.

$${\text {SI}}_{i} = W_{i} \times Q_{i} ,$$
(4)
$${\text {WQI}} = \sum {\text {SI}}.$$
(5)

The WAWQI method divides the WQI value into five categories. The WQI value is greater than 100, which means the water is not suitable for drinking. The water condition is very poor if the WQI value is between 76 and 100. The water condition is poor if the WQI value is between 51 and 75. The water condition is good if the WQI value is between 26 and 50. The WQI value is less than 20, so the water is suitable for drinking. Table 1 explains the five categories of WQI values.

Table 1 Classification of WQI by WAWQI method.

2.3.2 Deep learning neural network (DNN)

Deep learning neural networks (DNN) are referred to as feed-forward neural networks (FFNNs). DNN’s data does not run backwards; it runs only in the forward direction. This network divides multiple data and the massive amount of data trained by DNN. There are three layers involved in DNN: input, hidden and output. These networks handled unlabeled data, non-structured data and non-linear data. The signal was passed from one neuron to another based on the input data. The output data has been neglected if the threshold value is smaller than the signal value. A function is done in each neuron, called the Function of activation, and various neurons are involved in the layers. The output layer was obtained by an input layer influenced by weight. DNN gives a better prediction by optimizing the weights with MBGD optimization. A mathematical approach was used to apply the DNN. The process of DNN was performed by transferring one neuron to another neuron. The structure of DNN is given in figure 3. In figure 3, pink colour circles indicate the input layer, hidden layers are represented by blue colour, and green colour represents the output layers. Each layer is fully connected with other layers.

Figure 3
figure 3

Structure of deep neural network.

To determine the precision of DNN-MBGD model, the coefficient of determination (R2), RMSE, MAE and MSE were used. These are the important step in the machine learning model. This study examines the coefficient of determination (R2) and RMSE values for 20 water quality parameters.

The linear regression technique is used to describe the proportion of variance in the dependent variable is known as the coefficient of determination (R2). The values of R2 should be less than 1. Equation (6) illustrates the determination of the coefficient of determination.

$$R^{2} = 1 - \frac{{\sum {(x_{i} - x_{j} )^{2} } }}{{\sum {(x_{i} - x_{k} )^{2} } }}.$$
(6)

The square root of MSE is also called RMSE. RMSE reflects the standard deviation. Equation (7) specifies the RMSE determination.

$$\sqrt {\text {MSE}} = {\text {RMSE}} = \sqrt {\frac{1}{K}\sum\limits_{i = 1}^{K} {(x_{i} - x_{j} )^{2} } }.$$
(7)

Equation (8) is used to predict the output of DNN-MBGD. In this equation, \(X\) indicates the prediction of output data, \(C\) indicates the weights of hidden layer inputs, \(Y\) represents the input data and \(D\) indicates the bias.

$$X = CY + D.$$
(8)

R2 and RMSE values are used to determine the best input parameter performance. The predicted values are compared using the ANN technique after predicting the output data.

2.4 Water quality parameters

The following 20 parameters are determined from the water samples, namely boron (B), Biochemical Oxygen Demand (BOD), Carbon trioxide (CO3), Calcium (Ca), Chemical Oxygen Demand (COD), Chlorine (Cl), Electrical Conductivity (EC), Bicarbonate (HCO3), Potassium (K), Magnesium (Mg), Ammonia (NH3), Sodium (Na), Phosphorous (P), Sulphate (SO4), Silicon dioxide (SiO2), Total Alkalinity (TA), Temperature (T), Total Coliform (TC), Total Dissolved Solids (TDS) and pH. The WQI value is determined using these 20 water quality parameters. The data on water quality parameters for the above sites were collected from the Surface Water Quality Station Wise Report (Jan 2019–Jan 2020).


Boron: It is a chemical parameter denoted by B, and the unit is mg/l. The permissible boron limit of the BIS for drinking water is 1 mg/l. Maximum B values occur at site 7, and minimum B values occur at site 1. The B values for each site do not exceed the standard value.


Biochemical oxygen demand: It is a chemical parameter denoted by BOD, and the unit is mg/l, and the BIS permissible BOD limit for drinking water is 2 mg/l. The highest BOD value is at site 10, the lowest BOD value is at site 3. The BOD value exceeds the standard limit with the exception of site 3.


Carbon trioxide: It is denoted by CO3; the unit is mg/l, BIS permissible limit for CO3 is 75 mg/l. The CO3 values for site 1, site 3, and site 4 were 12.43, 12.60 and 5.28, respectively. At the other sites, the CO3 value was zero.


Calcium: It is denoted by Ca, the permissible BIS limit for Ca is 75 mg/l, and the unit is mg/l. The highest amount of Ca occurs at site 10, and the lowest Ca presents at site 8; at all 10 sites, Ca does not exceed the standard limit.


Chemical oxygen demand: It is denoted by COD, the unit is mg/l, and CPCB permissible limit for COD is 250 mg/l. Site 3 has a minimum COD of 5.55, and site 10 has a maximum COD of 131. At all 10 sites, the COD does not exceed the standard limit.


Chlorine: It is denoted by Cl, the unit is mg/l, BIS permissible limit for Cl is 250 mg/l. The lowest Cl is 10.84 at site 3 and the highest Cl is 88.85 at site 10. At all 10 sites, Cl does not exceed the standard limit.


Electrical conductivity: It is denoted by EC, the unit is mg/l; the limit value for EC permitted by the WHO is 400 mg/l. The minimum EC is 291 at site 3 and the maximum EC is 1209 at site 10. Sites 1, 5, 7, and 10 exceed the standard limit.


Bicarbonate: It is denoted by HCO3, the unit is mg/l, and the BIS permissible limit for HCO3 is 300 mg/l. The lowest HCO3 is 124 at site 3, and the highest HCO3 is 391 at site 10. Site 10 exceeds the standard limit.


Potassium: It is denoted by K, the unit is mg/l, and the WHO permissible K limit is 10 mg/l. Minimum K is 3.71 at site 2, and maximum K is 49.23 at site 10. Site 10 exceeds the standard limit.


Magnesium: It is denoted by Mg, the unit is mg/l, and the BIS permissible limit of Mg is 30 mg/l. Minimum Mg is 8.36 at site 3, and maximum Mg is 32.41 at site 10. Site 10 exceeds the standard limit.


Ammonia: It is denoted by NH3; the unit is mg/l, and the BIS permissible limit of NH3 is 0.5 mg/l. Minimum NH3 is 0.05 at site 3, and maximum NH3 is 16.45 at site 10. Site 10 exceeds the standard limit.


Sodium: It is denoted by Na, the unit is mg/l, and the BIS permissible limit of Na is 200 mg/l. Minimum Na is 7.15 at site 8, and maximum Na is 68.78 at site 10. In all 10 sites, Na does not exceed the standard limit.


Phosphorous: It is denoted by P, the unit is mg/l, and the BIS permissible limit of Phosphorous for drinking water is 1 mg/l. Minimum P is 0.08 at site 3, and maximum P is 0.63 at site 10. In all 10 sites, P does not exceed the standard limit.


Sulphate: It is denoted by SO4, the unit is mg/l, and the BIS permissible limit of SO4 is 200 mg/l. Minimum SO4 is 13.42 at site 3, and maximum SO4 is 65.03 at site 10. In all 10 sites, SO4 does not exceed the standard limit.


Silicon dioxide: It is denoted by SiO2, the unit is mg/l, and the WHO permissible SiO2 limit is 9.2 mg/l. Minimum SiO2 is 5.55 at site 3, and maximum SiO2 is 14.29 at site 5. Site 1, site 4, site 5 and site 10 exceed the standard limit.


Total alkalinity: It is denoted by TA, the unit is mg/l, and the BIS permissible limit of TA is 200 mg/l. Minimum TA is 103 at site 8, and maximum TA is 352 at site 10. Site 10 exceeds the standard limit.


Temperature: It is denoted by T, the unit is °C, and the WHO permissible temperature limit is 25°C. Minimum T is 41°C at site 10 and maximum T is 52ºC at site 7. All sites exceed the standard limit.


Total coliform: It is denoted by TC, and the BIS permissible limit of TC is 10000. Minimum TC is 1409 at site 2 and maximum TC is 8119000 at site 10. Except for sites 2, 3, 6, 8 and 9 remaining sites exceed the standard limit.


Total dissolved solids: It is denoted by TDS, the unit is mg/l and the BIS permissible limit of TDS is 500 mg/l. Minimum TDS is 85 at site 3, and maximum TDS is 753 at site 10. Site 10 exceeds the standard limit of TDS.


pH: Basic/acidic water calculated by pH has no unit, and the BIS permissible pH limit is 6.5–8.5. Minimum pH is 7.12 at site 9, and maximum pH is 8.71 at site 1. Sites 1, 3 and 7 exceed the standard limit.

Site 1 represented Maighat site in the Jaunpur district. Site 2 represented Turtipar site in the Deoria district. Site 3 represented Kora-old site in the Fatehpur district. Site 4 represented Allahabad site in the Prayagraj district. Site 5 represented Akbarpur site in the Ambedkar Nagar district. Site 6 represented Maighat site in the Birdghat district. Site 7 represented Lucknow site in the Lucknow district. Site 8 represented Garhmukteshwar site in the Amroha district. Site 9 represented Ankinghat site in the Hardoi district, and Site 10 represented Galeta site in the Meerut district. The standard value of 20 parameters is given in table 2. The values of water quality parameters for the corresponding sites are shown in table 3. Figure 4 shows the variation of water quality parameters in 10 sites.

Table 2 Drinking water standards for 20 water quality parameters.
Table 3 Average values of water quality parameters from Jan 2019 to Jan 2020.
Figure 4
figure 4figure 4

Variation of water quality parameters in 10 sites.

3 Results and discussion

The result of the proposed methodology describes the prediction of water quality parameters using a hybrid DNN-MBGD technique. It evaluates the WQI values for 10 sites of UP from January 2019 to January 2020. The WQI values are determined by determining unit weight, relative weight and quality index.

3.1 WQI calculation

WQI determines the water quality and indicates the WQI in terms of an index number. This study uses the WAWQI method to calculate the WQI values. Compared to other methods such as the National Sanitation Foundation-WQI (NSF-WQI), the Canadian Council of Ministers of the Environment Index (CCME WQI) and the Oregon-WQI (OWQI), WAWQI gives accurate results. So WAWQI method was selected for WQI calculation. The first step of WQI calculation is to determine the unit weight of each parameter. After determining the unit weight, the second step determines the relative weightage and the third step determines the quality index of each WQ parameter. Excel 2007 software was used for the WQI calculations.

3.1.1 Calculation of \(w_{i}\) and \(W_{i}\)

\(w_{i}\) and \(W_{i}\) of each parameter is determined by equations (1 and 2). By solving the constant value, \(K\,{\text {and}}\,S_{i}\) find out the \(w_{i}\) value. The summation of the unit weights for 20 parameters is used to determine the relative weights shown in equation (2). Table 4 shows the \(w_{i}\) and \(W_{i}\) values for each parameter.

Table 4 Examined unit weight and relative weightage of each parameter.

3.1.2 Determination of quality index ( \(Q_{i}\) )

The ratio of the observed values of each parameter to the standard values of each WQ parameter is known as the quality index, shown in equation (3). Table 5 indicates the \(Q_{i}\) values for each site.

Table 5 Examined quality index (\(Q_{i}\)) for 10 sites.

3.1.3 WQI determination

Multiplying \(Q_{i}\) and \(W_{i}\), we get a sub-index value shown in equation (4). Summation of the sub-index for 20 parameters used to determine the WQI shown in equation (5). If the WQI value exceeds 100, the water is unsuitable for drinking. The water condition is very poor if the WQI value is between 76 and 100. The water condition is poor if the WQI value is between 51 and 75. The water condition is good if the WQI value is between 26 and 50. WQI value is less than 20, the water is suitable for drinking.

The WQI values of 10 districts of Uttar Pradesh are shown in table 6. Table 6 indicates that the water from these sites is unsuitable for drinking, domestic and irrigation purposes. The WQI value for site 10 is greater than 100. The water has to be treated at 10 sites in Uttar Pradesh. A water purifying system must also be installed. As the water purification system reduced excess physical, chemical and biological parameters, the water quality and harmful containment were also minimized. The water at these 10 sites is unsuitable for drinking purposes. But comparatively, site 3 has a minimum WQI value. So, the water quality parameters are predicted from site 3 using hybrid DNN-MBGD optimization.

Table 6 WQI for 10 sites in Uttar Pradesh.

3.2 Correlation matrix of 20 parameters and WQI

Determining the relationship between two water quality parameters is known as correlation. The relationship between the two directions of water quality parameters reflects the correlation. Correlation values can be zero values, negative values and positive values. The direction of two variables transformed in the same direction (1) is known as a positive correlation. The direction of two variables transforming in the opposite direction (–1) is known as a negative correlation, and if there is no relationship between two variables, it is called zero correlation (0). Some existing studies also determine the correlation between water quality parameters (Patil et al. 2020; Geng et al. 2021; Sayess et al. 2021; Srivastava et al. 2022). In this study, the 20 water quality parameters are correlated with WQI to determine the relationship between two water quality parameters. A correlation matrix of 20 parameters is shown in table 7. In table 7, the correlation of the same variable shows the value 1, indicating a positive correlation and two zero correlation values are shown in the yellow box. Positive values indicate a positive correlation, meaning that two variables are transforming in the same direction. Negative values indicate a negative correlation, meaning that two variables are transforming in opposite directions, and there is a 2-zero correlation which means there is no relationship between the two variables.

Table 7 Correlation matrix of 20 water quality parameters.

3.3 Proposed hybrid deep learning neural network–mini-batch gradient descent optimization (DNN-MBGD)

MBGD, also known as GD variation, is used to determine the coefficients of an updated model and calculate the model’s errors. Training data is divided into small batches by MBGD. Gradient variations are minimized by MBGD and used for implementation by deep learning. In this study, the MBGD optimization is chosen because it gives the highest update frequency than the GD model. Compared to SGD, it provides more efficiency. In this study, the DNN model is hybridized with MBGD to update the weights and to improve the accuracy for better prediction. Each layer consists of multiple neurons and is transmitted by an activation function. In the second hidden layer, the weights of the water quality parameters are updated for better prediction.

In this study, 80% of the dataset was trained, while the remaining 20% was tested. The hybrid DNN-MBGD was developed using these two sets of datasets. Hence the results showed that the hybrid DNN-MBGD optimization models could predict the various elements of water quality with adequate accuracy. Following dataset division, one hidden layer with several neurons corresponding to the input features is considered in the first stage. The effectiveness of the various transfer functions is now evaluated, and the best one is selected, namely K. The network size is changed to increase the accuracy of the created models in the subsequent phase. This would increase the number of neurons or hidden layers. The weights are optimized by MBGD optimization in the second hidden layer.

After selecting the best performance of the water quality parameter (K), K is predicted based on other parameters. The input feature is more relevant or crucial in predicting the feature importance that the analysis will report the water quality. Based on the feature importance score, the less important feature is eliminated using a backward elimination process. In a range from 0 to 1, the feature importance can be calculated; 0 means the feature is unimportant and 1 means the feature is essential. Negative values for feature significance indicate a feature that is affecting the performance of the model.

The random forest classifier is one of the most commonly used methods for determining a feature's relevance. This study uses Random Forest Classifier to determine the feature importance of water quality parameters. Figure 5 shows the feature importance score of input parameters. Based on figure 5, the feature importance is 1 for all input parameters, this means that the input parameters are extremely important for predicting water quality. So, other water quality components, including B, BOD, CO3, Ca, COD, Cl, EC, HCO3, Mg, NH3, Na, P, SO4, SiO2, TA, T, TC, TDS and pH were considered as inputs to the prediction of K, as shown in figure 6.

Figure 5
figure 5

Feature importance score of input parameters.

Figure 6
figure 6

DNN model structure for predicting a component of water quality (K).

3.4 Prediction of K

Based on the feature importance score, the water quality component K is predicted by the proposed hybrid DNN-MBGD optimization model. Eighty per cent of the dataset was used for training in this study, while 20% was used for testing. These two sets of datasets were employed during the developing of the hybrid DNN-MBGD model. Using LSM, the prediction of potassium (K) is made by equation (8).

$${\text {K}} = {\text {weight}}\, \times \,{\text {B}} + {\text {bias}}$$
(9)
$${\text {K}} = {\text {weight}}\, \times \,{\text {BOD}} + {\text {bias}}$$
(10)
$${\text {K}} = {\text {weight}}\,\, \times \,{\text {CO}}_{3} + {\text {bias}}$$
(11)
$${\text {K}} = {\text {weight}}\, \times \,{\text {Ca}} + {\text {bias}}$$
(12)
$${\text {K}} = {\text {weight}}\, \times \,{\text {COD}} + {\text {bias}}$$
(13)
$${\text {K}} = {\text {weight}}\, \times \,{\text {Cl}} + {\text {bias}}$$
(14)
$${\text {K}} = {\text {weight}}\, \times \,{\text {EC}} + {\text {bias}}$$
(15)
$${\text {K}} = {\text {weight}}\,\, \times \,{\text {HCO}}_{3} + {\text {bias}}$$
(16)
$${\text {K}} = {\text {weight}}\, \times \,{\text {Mg}} + {\text {bias}}$$
(17)
$${\text {K}} = {\text {weight}}\,\, \times \,{\text {NH}}_{3} + {\text {bias}}$$
(18)
$${\text {K}} = {\text {weight}}\, \times \,{\text {Na}} + {\text {bias}}$$
(19)
$${\text {K}} = {\text {weight}}\, \times \,{\text {P}} + {\text {bias}}$$
(20)
$${\text {K}} = {\text {weight}}\, \times \,{\text {SO}}_{4} + {\text {bias}}$$
(21)
$${\text {K}} = {\text {weight}}\, \times \,{\text {SiO}}_{2} + {\text {bias}}$$
(22)
$${\text {K}} = {\text {weight}}\, \times \,{\text {TA}} + {\text {bias}}$$
(23)
$${\text {K}} = {\text {weight}}\, \times \,{\text {T}} + {\text {bias}}$$
(24)
$${\text {K}} = {\text {weight}}\, \times \,{\text {TC}} + {\text {bias}}$$
(25)
$${\text {K}} = {\text {weight}}\, \times \,{\text {TDS}} + {\text {bias}}$$
(26)
$${\text {K}} = {\text {weight}}\, \times \,{\text {pH}} + {\text {bias}}$$
(27)

Figure 7 shows the prediction result of the water quality parameter K and the prediction performance of the hybrid DNN-MBGD model compared to machine learning models (Haghiabi et al. 2018). As per figure 7, the blue line with the star indicates the observed K values and the green line star demonstrates the predicted performance of the hybrid DNN-MBGD model. Similarly, the red, orange, and violet lines with a star indicate the predicted performances of ANN, SVM and GMDH models. The outcome demonstrated that the observed and predicted K values of the hybrid DNN-MBGD correspond closely. It shows that the hybrid DNN-MBGD-based prediction model is more accurate than other models such as ANN, SVM and GMDH. Because the weight optimization by MBGD in DNN leads to an improvement in model accuracy. Accurate water quality prediction is crucial to improve the pollution control and water management.

Figure 7
figure 7

Comparison of prediction performance result with ANN, SVM, and GMDH model.

Figure 8 shows the regression plots of a proposed model observed using the prediction performance of K. These regression plots are used to examine and understand complex data relationships. Figure 8 shows that the model’s output values are closer to the real target values. Figure 9 shows a proposed model’s training and testing accuracy, which is used to establish the model’s accuracy. Superior precisions are provided through training and testing accuracy. The data is divided into train and test sets to avoid fitting and overfitting issues and improve the model performance. In this study, the proposed model gives better accuracy. Table 8 shows the model’s performance based on R2, RMSE, MAE, MAPE, and MSE. From this table 8, the least error obtained in the proposed model shows that the proposed model performance is better than the ANN, SVM and GMDH models. Because of the hybridization of MBGD, optimization with DNN gives more accuracy than non-hybrid models.

Figure 8
figure 8

Regression plots by predicting K.

Figure 9
figure 9

Training and testing accuracy.

Table 8 Performance of the model.

4 Conclusion

This study was used to determine the WQI and predict the water quality parameters using a hybrid approach of DNN-MBGD optimization in 10 districts of Uttar Pradesh. The estimation of the WQI value is used to check whether the water quality is suitable for drinking. After WQI calculation, the relationship between two water quality parameters is determined. A novel hybrid DNN technique predicts water quality parameters and the performance is compared with the ANN, SVM and GMDH models.

The WAWQI is used to calculate the WQI in 10 districts of Uttar Pradesh. The result indicated, the water from 10 sites is unsuitable for drinking and for other purposes such as domestic and irrigation. Because WQI values for 10 sites in Uttar Pradesh are greater than 100, the water from 10 experimental sites in Uttar Pradesh must be treated and is also important for installing a water purifying system. As excess physical, chemical and biological parameters have been reduced by the water purification system which increased water quality and harmful containment is minimized. The correlation matrix of 20 parameters with WQI determines the relationship between each component. In this study, the performance of DNN-MBGD was determined and compared with ANN, SVM and GMDH models. Twenty water quality parameters predicted by DNN-MBGD optimization at ten sites in Uttar Pradesh yield better results. Then the best performance of the parameter K is selected and predicted using a mathematical equation with the other concerned parameters. In addition, comparing the DNN-MBGD to the ANN, SVM and GMDH model and comparing a hybrid DNN-MBGD optimization gives better performance than other models.