Keywords

1 Introduction

Groundwater quality assessment is essential for determining the quality of water in a region as water is not only a basic necessity in every household but also an important resource for survival. Conservation of water has been a need of the hour, as freshwater is getting contaminated due to various factors like the use of fertilizers, pesticides, and industrial wastes. Thus, the study of freshwater resources like groundwater is becoming important.

The study area for this paper is the capital city of Chhattisgarh, i.e., Raipur. In Raipur, the Kharun River is the only supply of raw water currently available. Groundwater is another source of water, having a capacity of 22 million liters per day, in addition to water from the Kharun River.

The quality assessment of water from the groundwater sources is a tedious task as it involves manual work and analysis. The role of Machine Learning can be crucial in automating such tasks, especially in the scenario where lockdown is implemented due to the pandemic. The prediction using machine learning can prevent the ceasing of analysis of such crucial resources.

In this paper, we have used the concept of the Water Quality Index (WQI) for determining the water quality of the 44 different regions of Raipur. The Water Quality Index was calculated based on pH, TA, TH, Chloride, Nitrate, Fluoride, and Calcium. Further, the quality of groundwater was assessed using various Machine Learning Models, namely, Logistic Regression, Decision Tree Classifier, Gaussian NB, Random Forest Classifier, Linear SVC, and XGB Classifier.

2 Methodology

Figure 1 shows the methodology used in this work. The stages include input dataset, data preprocessing, calculating WQI, and classification of groundwater using Machine Learning Models.

Fig. 1
A flow chart begins with input dataset, data preprocessing, calculating W Q I, and classification of G W using machine learning. It is further classified into classes 0 to 3 such as not suitable, good, excellent, and poor, respectively.

Methodology used

2.1 Dataset Description

The dataset has been collected from the Geology Department of NIT, Raipur. It consisted of groundwater details of samples collected from 44 locations within Raipur and 24 features, namely, pH, EC, TDS, TH, TA, \({\text{HCO}}_{{3}}^{ - }\) , Cl, \({\text{NO}}_{{3}}^{ - }\) , \({\text{SO}}_{{4}}^{{ - {2}}}\) , F, Ca+2, Mg+2, Na+ , K+ , Si+4, SSP, SAR, SAR,

KR, RSC, MR, and CR from which seven features have been used, namely, pH, TA, TH, Cl, \({\text{NO}}_{{3}}^{ - }\) , F, and Ca+2 for the calculation of WQI, which further form the basis of classification.

2.2 Correlation of Features

Figure 2 shows the correlation of the seven parameters used for the calculation of WQI and Water Quality Classification. Correlation shows how the parameters are related to each other and whether the presence or absence of one parameter leads to the presence or absence of another parameter. Correlation is a mechanism used to identify the strength of relationships between features.

Fig. 2
A correlation matrix of features with the calculation of the 7 parameters used in water quality classification, The color gradient values are mentioned on the right side. The diagonal element marks 1.00 value.

Correlation matrix of the features

Correlation is of three types:

  • Positive correlation: If there are two features which are having positive correlation, it means that if the value of one feature is increasing in a scenario, the value of the other feature will also tend to rise in a similar scenario;

  • Negative correlation: If there are two features which are having negative correlation, it means that if the value of one feature is increasing in a scenario, the value of the other feature will tend to reduce in a similar scenario;

  • No correlation: When two features are said to be in no correlation, it means that the value of the two features is independent, i.e., increasing or decreasing and the value of a particular feature will have no impact on the value of other features.

2.3 Calculation of Water Quality Index

Step 1. Calculation of weightage factor

The weight is assigned for each parameter as per the importance of the parameter in the water consumption. The relative weight is given by:

$$ W_{i} = \, \Sigma w_{i} , $$
(1)

where Wi denotes relative weight, wi denotes parameter weight, and n represents the number of parameters. The weight assigned to all the seven parameters and the relative weight is shown in Table 1.

Table 1 Weight, relative, and BIS standard value for each parameter

Step 2. Calculating sub-index

For obtaining the Water Quality Index, firstly sub-index is calculated which is given by:

$$ {\text{SI }} = \, W_{i} * \, \left( {c/s} \right) \, *100, $$
(2)

where Wi is the relative weight, c is the value of the parameter in the water sample in mg/l, and s is the standard value of the parameter mentioned in Table 1.

Step 3. Calculating Water Quality Index

$$ W_{i} = \, \Sigma {\text{ SI}} $$
(3)

The Water Quality Index is used to determine the category of water quality, namely, Excellent, Good, Poor, or Not Suitable for consumption. Table 2 shows the different classes of water quality into which the dataset is bifurcated according to the Water Quality Index.

Table 2 Water quality classes

Figure 3 shows the pie chart representation of the occurrence of different categories of water quality. Figure 4 shows the violin plot of the classes of water quality and how they are related to WQI. A violin plot uses density curves to represent numeric data distributions for one or more groups.

Fig. 3
A pie chart for the different categories of water quality. Category 1 has a high value of 35.4%, category 2 at 31.3%, category 3 at 18.8% and category 0 at 14.6%.

Pie chart representation of the occurrences of different categories of WQ

Fig. 4
A violin plot of the water quality index versus category underscore id. Three vertical violin shaped structures mark between 100 and 70 on the top left, 60 and 30 at the bottom center, 80 and 40 on the right, respectively. A horizontal violin shaped structure lies between 40 and 20.

Violin plot of water quality classes based on WQI

Figure 5 shows the distribution and box plot representation of the parameters based on the water quality category. The distribution curves and box plot depict how each feature is contributing to the water quality. In the distribution curve and box plot, green represents excellent water quality, blue represents good water quality, yellow represents poor water quality, and red represents water quality which is not suitable for consumption.

Fig. 5
Eight graphs of density distribution and eight box plots of water quality category on the left and right, respectively for four different categories that is from classes 0 to 3. On the left, the distribution for category 1 and 2 is high, while 0 and 4 are slightly less. On the right, the distribution of category 2 is high for p H, T H, and T A, while it is low for the others.

Density distribution and box plot based on water quality category

2.4 Machine Learning Models

2.4.1 Logistic Regression

This is a supervised learning-based classification approach. It works with discrete data and produces non-continuous output. However, instead of digital numbers, it outputs probabilistic values between zero and one. The minimum value is the most crucial part of Logistic Regression; it lays the foundation for classification and is used to determine if the outcome is nearer to one or zero. The recall and accuracy settings determine the threshold value. If both recall and accuracy are one, the threshold value is considered to be perfect.

However, this optimum condition does not always occur; thus, there are two possible approaches: one with great precision but low recall and the other with low precision but high recall. The threshold is established [1] based on the system’s requirements. The Sigmoid function (see Fig. 6) is an S-shaped curve that is used to calculate the result depending on the threshold value in Logistic Regression. This is accomplished by putting the real values and threshold value onto the curve, then comparing the real values to the threshold value to see if they are closer to 1 or 0. The Logistic Regression model is ideal for binary classes, as can be seen from the preceding description, but it may also be utilized for numerous classes adopting the idea of one versus all [2].

Fig. 6
A graphical representation of the sigmoid function in an S-shaped curve. The threshold value is marked at 0.5 at the center.

Logistic regression

2.4.2 Decision Tree Classifier

A Supervised Machine Learning Approach called a decision tree produces decisions based on a combination of rules. It constructs decision trees using historical data. Construction of the maximum tree, selection of the appropriate tree size, and classification of fresh data using the established tree are the three elements of this technique.

Maximum tree creation takes a large amount of time. The observations are separated continually until only one class remains at the lowest possible level. Figure 7 depicts the maximum tree’s creation. The figure to the left of the tree’s root ought to be smaller than the figure to the right. The values that are referred here are parameters in the learning sample’s parameter matrix.

Fig. 7
A tree structure illustrates the construction of maximum tree. The parent is at the top. It is classified into left and right. The text at the center reads x superscript L sub i less than = x superscript R sub i.

Construction of maximum tree

The proper size tree must be chosen since the maximum tree might have many layers and be complex; thus, it must be optimized before categorization. This is intended to shrink the tree and remove any extraneous nodes. Finally, classification is completed, with classes allocated to each observation [3].

2.4.3 Gaussian Naïve Bayes

A statistical and supervised learning-based probabilistic classifier is yet another name for GNB. It is based on the probability hypothesis of Bayes. For classification, it employs a probabilistic technique. When used on massive datasets, it produces accurate findings in a shorter amount of time [4].

It has four types of probabilities: likelihood probability, which represents the probability of predictor, such that the class is given; predictor prior probability, which represents the probability of the previous predictor; class prior probability, which tells the probability of class; and posterior probability, which represents the probability of class such that the predictor is given. The existence or lack of one characteristic does not impact the existence or lack of another characteristic, according to the Naive Bayes approach. Each characteristic contributes to the categorization process in its own way [5].

We employed GNB, that is a type of NB classifier that is used for continuous data categorization. It is based on probability ideas of Gaussian normal distribution. Classes are believed to be represented by normal distributions with distinct dimensions. It is simple, computationally quick and only takes a modest amount of training data. However, since it presupposes characteristic independence, it may produce inaccurate estimates, thus the label naive. The workings of GNB are shown in Fig. 8. The categorization is based on the shortest distance between each data points (x).

Fig. 8
A graphical representation of the G N B classifier. It has two bell-shaped overlapping curves for class A and class B. Distance of x from both class A and B is calculated.

GNB classifier

2.4.4 Random Forest Classifier

This approach, as the name implies, uses many trees to arrive at a categorization choice. It is based on the notion of tyranny of the majority. The trees each make their own estimate, and the class with the highest count is chosen. The overall result is successful because the outputs of the multiple decision trees are combined; thus, even if some decision trees produce incorrect predictions, the other trees’ outcomes cover up the final estimate [6]. Because it considers different tree options on a majority premise, Random Forest produces superior results. However, when there are a lot of trees, categorization can be stagnant and unproductive. This classifier’s training is quick, but predictions are stagnant [7]. Figure 9 depicts the Random Forest Classifier’s generic operation.

Fig. 9
An illustration of a random forest classifier. Three tree-like structures are at the top connect to the sigma block.

Random forest classifier

2.4.5 Linear SVC

It is a kernel-based classifier that was designed for linear separation and can divide dataset into two groups. SVM now is utilized to solve a variety of real-world situations [8]. For classification, SVM employs the idea of hyperplane. To partition the data into classes, it creates a line or hyperplane. To achieve maximum gap between the two classes, SVM tries to push the decision boundary much further than possible. It boosts the productivity by using a sub-class of training points called as support vectors. Moreover, because it requires a lot of training time, it performs best with modest datasets. The basic premise of SVM is depicted in Fig. 10, where an optimum hyperplane is used to classify objects into groups based on support vector.

Fig. 10
A graphical representation illustrates the linear S V C. Three lines in an increasing trend represents the optimum hyperplane with support vectors on the left and right. The distance between them is labelled as maximized margin. A clustered triangle on the left indicates a support vector, while the clustered circles on the right indicate the support vector.

Linear SVC

2.4.6 XGB

Gradient-boosted decision trees are implemented in XGB. Decision trees are constructed sequentially in this approach. In XGBoost, weights are very significant. All of the independent variables are given weights, which are subsequently put into the decision tree, which predicts outcomes. The weight of variables that the tree predicted incorrectly is raised, and the variables are then put into the second decision tree. Individual classifiers/predictors are then combined to form a more powerful and precise model. Figure 11 shows the working of XGB [9, 10].

Fig. 11
A block diagram consists of training and weighted samples at the top. Each connects to the first weak classifier, second weak, third weak classifier, and n weak classifier at the bottom. It leads to the final classifier.

XGB

2.5 Results

Table 3 shows the comparison of different classifiers used for predicting the groundwater quality using the stated Machine Learning Approach. It can be seen that Random Forest and Decision Tree Classifiers perform the best and give 100 percent accuracy. Also, XGB has performed well in the process of classification, which emphasizes on the fact that when we use labeled data, i.e., supervised learning approach, then classification is better in using XGB, RF, and DT classifiers.

Table 3 Comparative results of different classifiers

Figure 12 shows the confusion matrix of random forest classification.

Fig. 12
A confusion matrix of Random Forest classification. The diagonal element is highlighted. It includes 2, 5, 2, and 3. The scale range is on the right side.

Confusion matrix of RF

2.6 Conclusion and Future Work

The groundwater quality prediction and classification were a multiclass problem. It can be observed that Extreme Gradient Boost, Random Forest, and Decision Tree Classifiers have been proved to perform best in this problem. The dataset for such problems is usually raw data, with some missing values and uneven scales, so data preprocessing needs to be done in order to the get best classification results. Further, Deep Learning techniques can be used in future for this domain, especially when the dataset is large. Time series forecasting.