Application of Classification Techniques for Prediction of Water Quality of 17 Selected Indian Rivers

Bindra, Harlieen; Jain, Rachna; Singh, Gurvinder; Garg, Bindu

doi:10.1007/978-981-13-1402-5_18

Harlieen Bindra¹⁷,
Rachna Jain^17,18,
Gurvinder Singh¹⁸ &
…
Bindu Garg¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 808))

1091 Accesses
3 Citations

Abstract

Objective: In this study, prediction using classification techniques are used to predict the water quality of the 17 selected rivers in the year 2011 using their water quality in 2008 to interpret whether the water quality has improved or deteriorated. Methods/Analysis: For this prediction, we have used data mining classification techniques using Waikato Environment for Knowledge Analysis (WEKA) API to the dataset of selected 17 Indian rivers. The data used for prediction was created from ambient water quality of Aquatic Resources in India in 2008 and 2011. Data is obtained from data portal which was published under National Data Sharing and Accessibility Policy (NDSAP) and the contributor was Ministry of Environment and Forests Central Pollution Control Board (CPCB). Findings: Out of the four techniques used, prediction of classes, i.e. excellent, good, average and fair is best done by Naive Bayes followed by J48, SMO and REPTree technique.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Development of an Automated Water Quality Classification Model for the River Ganga

Analysis and Prognosis of Water Quality for River Ganga Using Water Quality Index

Predictive modelling and seasonal analysis of water quality indicators: three different basins of Şanlıurfa, Turkey

Article 24 June 2021

Keywords

1 Introduction

India is popularly referred to as the land of rivers since it has been blessed with several water bodies which not only enhance the beauty of the country but is also the source of livelihoods for a large number of people. They are the main sustainability source for people especially the farmers since the soil lands in proximity to the rivers are nourished and fertile. For many holy reasons, these rivers are worshipped in India; specially “The Ganges” which is considered to be the holiest of all.

Indian Rivers not only nourish the flora and fauna but also attract tourist from all around the world and play an indispensable role in our economy. They are the witness of how the civilisation evolved but they are not only significant historically but also culturally and religiously. Even their inherent nature could not be altered by the dams. They still originate from the mountains and gush down the plains and valleys with the same force as several years ago. They nourish the plain with vitality and fertility.

But people day by day are forgetting the importance of rivers. The rivers now have fertilisers, pesticides and more different types of chemical products. A number of instances oil spills have disturbed the aquatic animals. The banks of the rivers are piled up with non-biodegradable wastes. But we need to understand that improving the unhygienic and dirty conditions of the rivers is not the sole responsibility of the government. We as the citizens of this nation should take special precautions and actions to improve the water quality of rivers. Even in western countries, the citizens themselves take measures to keep their rivers and river banks clean. We must strive to keep the best gift of nature clean and preserve its water quality.

For present study, 17 rivers were selected for prediction of water quality in 2011 using 2008 instances. The number of stations used to collect data for each river is mentioned in the parenthesis, which are

Beas (19), Satluj (20), Ganga (36), Yamuna (19), Brahmaputra (10), Dhansiri (7), Mahi (7), Narmada (6), Tapi (10), Mahanadi (14), Brahmani (11), Baitarni (5), Subarnarekha (6), Godavari (34), Krishna (22), Pennar (4), Cauvery (20). So, in total 250 instances were used for analysis.

In this paper, there are seven sections. Section 1 is the Introduction which is the current section. The Sect. 2 is Literature Review, the Sect. 3 is Materials and Methods, the Sect. 4 is Performance Comparison, the Sect. 5 is Result and Discussion, the Sect. 6 is Conclusion, seventh is Acknowledgement and the last section is References.

2 Literature Review

In paper [1], performance of CART, J48, REPTREE, Bayes Net and Naïve Bayes classification algorithms are compared by applying them to a dataset consisting of only 11 attributes, for predicting heart attacks. In the research work algorithms for prediction are applied using WEKA as it provides proficiency in analysing, discovering and predicting patterns. The results of the paper helped us in concluding that J48, CART and REPTREE shows the best results and there is not much difference in their performance factor. In paper [2], the author has compared the results of two decision trees ID3 and J48. The two techniques are applied to a dataset of students enrolling for MCA. The research work explains how tree based classification algorithms ID3 and J48 works and are used to analyse the data. From the results it can be concluded that ID3 decision tree algorithm shows an accuracy of 69.69% as compared to that of J48 which is 67.67%. In paper [3] the two data mining algorithms which are used for producing the classification model are Naive Bayesian Classifier algorithm and Decision Tree algorithms. These algorithms are applied on preprocessed student dataset. Decision Tree algorithms has an accuracy of 93.33% over 71.67% of Naive Bayesian Classifier algorithm. Hence decision tree algorithm proves to be better than naïve Bayesian classifier. In paper [4], the author has explained about the heart diseases and symptoms of heart attack. The paper has talked about various models that are developed using different data mining techniques. In paper [5] has bestowed satisfactory modifications for calculation of the water quality index (WQI). To calculate the general water quality index nine parameters are required but sometimes a few parameters are missing or unavailable, in that case the modified formula given in this paper helps user to calculate WQI.

National River Conservation Plan [6] was initiated with the launching of Ganga Action Plan (GAP) in 1985. In 1995 GAP was expanded to cover other rivers of the country. At a sanctioned cost of Rs. 5779.41 crore, NRCP, excluding the GAP-I, GAP-II and National Ganga River Basin Authority (NGRBA) programme presently covers polluted stretches of 40 rivers in 121 towns spread over 19 States running head, it will be shortened. Your suggestion as to how to shorten it would be most welcome.

3 Materials and Methods

For present study, the data set was created using the data that referred to the ambient water quality of Aquatic Resources in India in 2008 and 2011 [7]. This Dataset is released under “National Data Sharing and Accessibility Policy (NDSAP)” and the contributor is “Ministry of Environment and Forests and Central Pollution Control Board”. The values of water quality parameters like Fecal Coliform, Temperature, Nitrate, Biochemical Oxygen Demand (B.O.D), pH, etc. were given in the data used. The data was published on the data portal on December 22, 2014 which was released under National Data Sharing and Accessibility Policy (NDSAP) [8] and the contributor was Ministry of Environment and Forests Central Pollution Control Board [9].

In classification [10] a set of objects is classified into a group so that objects in a group are more similar to each other.

3.1 Classification Techniques Used

3.1.1 Naïve Bayes [11]

Naïve Bayes Classifier is a part of probabilistic classifier based on application of Bayes’ Theorem with strong independent presupposition/presumption between the features. This classifier algorithm presumes that in a given class attribute values are independent of other attributes values.

3.1.2 J48 [12]

J48 is the appendage od ID3. The features of J48 are decision tree pruning, keeping accounts for missing values, derivation of rules, etc. In WEKA, the implementation for JAVA open source algorithm C4.5 is done using J48. For tree pruning a number of options are provided by WEKA (data mining tool). Pruning could be employed as a mechanism for précising if there is case of over fitting. The aim of this algorithm is to progressively generalise the decision tree till accurate and flexible tree is obtained. Continuous and discrete attributes can be handled by this algorithm.

3.1.3 SMO (Sequential Minimal Optimization) [13]

SMO stands for Sequential Minimal Optimization, John Platt invented this algorithm in 1998 at Microsoft Research. This algorithm is mainly used for solving quadratic programming problem which emerges at time of programming support vector machine. Nominal attributes are transformed into binary ones on implementation of the model. Also, by default it normalises all attribute. The worst case running time for this is O(n³).

3.1.4 REPTree [14]

REP stands for Reduce Error Pruning. This algorithm is based minimising error surfacing form variance and calculating the information gain using entropy. REP Tree generates various trees in reordered iterations and uses regression tree logic. It splits the missing values into pieces of corresponding instances.

3.2 Software Used

Eclipse [15] open source IDE (Integrated Development Environment) was used to compile the code wrote using Waikato Environment for Knowledge Analysis (WEKA) [16] (developed by the University of Waikato, New Zealand) API. It is commonly used for data mining works, as it has a number of machine learning algorithms. It has tools for preprocessing, classification, visualisation, etc. Eclipse is an IDE mostly used for computer programming in Java language but it also supports many other programming languages.

3.2.1 General Algorithm

1.
Training dataset is loaded.
2.
The class index is set to the last attribute.
3.
Number of classes is fetched.
4.
Class values in the training dataset is printed.
5.
Class string value using the class index is fetched.
6.
Creating and building the classifier.
7.
The test dataset is loaded.
8.
The class index is set to the last attribute.
9.
Looping through the new dataset to make predictions.
10.
Fetching class value for current instance.
11.
Fetching class string value using the class index Class’s int value is used.
12.
Instance object of current instance is fetched.
13.
Calling classifyInstance, which returns a double value for the class.
14.
Use the double value to get string value of the predicted class.

3.2.2 Code Used

The following is the code used for Naïve Bayes Classifier. The classifier can accordingly be changed as per the requirement but the rest of the code will remain the same.

import weka.classifiers.bayes.NaiveBayes;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
public class RiversClassification

{
public static void main(String args[]) throws Exception{
DataSource source = new DataSource(“C:\\Users\\Bindra\\Desktop\\data set\\claasification\\train2008.arff”);
Instances trainDataset = source.getDataSet();
trainDataset.setClassIndex(trainDataset.numAttributes()-1);
int numClasses = trainDataset.numClasses();
for(int i = 0; i < numClasses; i++){
String classValue = trainDataset.classAttribute().value(i);
System.out.println(“Class Value “+i+” is “ + classValue);
}

//Classifier used is Naïve Bayes here
NaiveBayes nb = new NaiveBayes();
nb.buildClassifier(trainDataset);
DataSource source1 = new DataSource(“C:\\Users\\Bindra\\Desktop\\data set\\claasification\\test2011.arff”);
Instances testDataset = source1.getDataSet();
testDataset.setClassIndex(testDataset.numAttributes()-1);
System.out.println(“===================“);
System.out.println(“Actual,NB Predicted Class”);
for (int i = 0; i < testDataset.numInstances(); i++) {
double actualClass = testDataset.instance(i).classValue();
String actual = testDataset.classAttribute().value((int)actualClass);
Instance newInst = testDataset.instance(i);
double predNB = nb.classifyInstance(newInst);
String predString = testDataset.classAttribute().value((int) predNB);
System.out.println(actual+”, “+predString);
}
}
}

4 Performance Comparison

Below is the performance of various classification techniques on the used data set (Table 1).

Table 1 Percentage error in classification techniques applied for analysis of water quality of rivers

Full size table

The table consists of two columns, techniques and error percentage. Naïve Bayes show the best result and maximum error is found in REPTree technique.

Below are the screenshots of the outputs when the code was run on Eclipse IDE using different classifiers (Figs. 1, 2, 3, and 4).

5 Result and Discussion

In this work, Naïve Bayes has proved to be the best technique with minimum error. The error percentage in the classification techniques we have applied in our analysis is high because the data set which is used as input is biased since in the dataset number of instances in average and good water quality groups are greater than the number of instances in fair and excellent groups. Another reason for high error percentage is that the number of instances in excellent, good, fair and average water quality groups are not same.

6 Conclusion

Out of all the classification techniques, we have applied on our dataset, Naive Bayes has shown the results with least error.

In future, for effective and accurate analysis, some modifications need to be applied in these predefined classification techniques or new classification techniques need to be devised in order to form correct classes of such biased datasets.

References

Masethe, H. D., & Masethe, M. A. (2014). Prediction of heart disease using classification algorithms. In Proceedings of the World Congress on Engineering and Computer Science 2014 (Vol. II), October 22–24, 2014, San Francisco, USA.
Google Scholar
Saini, P., & Jain, A. K. (2013). Prediction using classification technique for the students’ enrollment process in higher educational institutions. International Journal of Computer Applications (0975–8887), 84(14).
Google Scholar
Padmapriya, A. Dr. (2012). Prediction of higher education admissibility using classification algorithms. International Journal of Advanced Research in Computer Science and Software Engineering, 2(11).
Google Scholar
Sudhakar, K., & Manimekalai, M. Dr. (2014). Study of heart disease prediction using data mining. International Journal of Advanced Research in Computer Science and Software Engineering, 4(1).
Google Scholar
Srivastava, G., & Kumar, P. (2013). Water quality index with missing parameters. IJRET: International Journal of Research in Engineering and Technology, 02(04), 609–614.
Google Scholar
National River Conservation Directorate (NRCD) http://envfor.nic.in/division/national-river-conservation-directorate-nrcd. Date accessed on 30/9/2016.
Data Set https://data.gov.in/catalog/status-water-quality-india-2008-and-2011.
NDASP http://www.dst.gov.in/national-data-sharing-and-accessibility-policy-0.
Ministry of Environment and Forests https://data.gov.in/ministrydepartment/ministry-environment-and-forests.
Sujatha, M., Prabhakar, S., & Lavanya Devi, G. Dr. (2013). A survey of classification techniques in data mining. International Journal of Innovations in Engineering and Technology (IJIET), 2(4). ISSN 2319-1058.
Google Scholar
Bhargavi, P., & Jyothi, S. Dr. (2009). Applying Naive Bayes data mining technique for classification of agricultural land soils. IJCSNS International Journal of Computer Science and Network Security, 9(8).
Google Scholar
Patil, T. R., & Sherekar, S. S. Mrs. (2013). Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications, 6(2). ISSN 0974-1011.
Google Scholar
Platt, J. C. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines (Technical Report MSR-TR-98-14), April 21, 1998.
Google Scholar
Kalmegh, S. (2015). Analysis of WEKA data mining algorithm REPTree, simple cart and RandomTree for classification of Indian News. IJISET—International Journal of Innovative Science, Engineering & Technology, 2(2). ISSN 2348–7968.
Google Scholar
Eclipse IDE http://www.eclipse.org/users/. Date accessed on 11/2/2017.
Weka website (Latest version 3.6) http://www.cs.waikato.ac.nz/ml/weka/. Date accessed on 30/9/2016.

Download references

Acknowledgements

I profoundly thank Bharati Vidyapeeth’s College of Engineering for constant support and encouragement.

Author information

Authors and Affiliations

CSE Department, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Harlieen Bindra, Rachna Jain & Bindu Garg
CSE Department, Guru Tegh Bahadur Institute of Technology, New Delhi, India
Rachna Jain & Gurvinder Singh

Authors

Harlieen Bindra
View author publications
You can also search for this author in PubMed Google Scholar
Rachna Jain
View author publications
You can also search for this author in PubMed Google Scholar
Gurvinder Singh
View author publications
You can also search for this author in PubMed Google Scholar
Bindu Garg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harlieen Bindra .

Editor information

Editors and Affiliations

Department of Automatics and Applied Software, Aurel Vlaicu University of Arad, Arad, Romania
Valentina Emilia Balas
Audyogik Tantra Shikshan Sanstha’s, IICMR, Pune, Maharashtra, India
Neha Sharma
Faculty of Engineering and Technology, A. K. Choudhury School of Information Technology, Kolkata, West Bengal, India
Amlan Chakrabarti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bindra, H., Jain, R., Singh, G., Garg, B. (2019). Application of Classification Techniques for Prediction of Water Quality of 17 Selected Indian Rivers. In: Balas, V., Sharma, N., Chakrabarti, A. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 808. Springer, Singapore. https://doi.org/10.1007/978-981-13-1402-5_18

Download citation

DOI: https://doi.org/10.1007/978-981-13-1402-5_18
Published: 10 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1401-8
Online ISBN: 978-981-13-1402-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics