Keywords

1 Introduction

India is popularly referred to as the land of rivers since it has been blessed with several water bodies which not only enhance the beauty of the country but is also the source of livelihoods for a large number of people. They are the main sustainability source for people especially the farmers since the soil lands in proximity to the rivers are nourished and fertile. For many holy reasons, these rivers are worshipped in India; specially “The Ganges” which is considered to be the holiest of all.

Indian Rivers not only nourish the flora and fauna but also attract tourist from all around the world and play an indispensable role in our economy. They are the witness of how the civilisation evolved but they are not only significant historically but also culturally and religiously. Even their inherent nature could not be altered by the dams. They still originate from the mountains and gush down the plains and valleys with the same force as several years ago. They nourish the plain with vitality and fertility.

But people day by day are forgetting the importance of rivers. The rivers now have fertilisers, pesticides and more different types of chemical products. A number of instances oil spills have disturbed the aquatic animals. The banks of the rivers are piled up with non-biodegradable wastes. But we need to understand that improving the unhygienic and dirty conditions of the rivers is not the sole responsibility of the government. We as the citizens of this nation should take special precautions and actions to improve the water quality of rivers. Even in western countries, the citizens themselves take measures to keep their rivers and river banks clean. We must strive to keep the best gift of nature clean and preserve its water quality.

For present study, 17 rivers were selected for prediction of water quality in 2011 using 2008 instances. The number of stations used to collect data for each river is mentioned in the parenthesis, which are

Beas (19), Satluj (20), Ganga (36), Yamuna (19), Brahmaputra (10), Dhansiri (7), Mahi (7), Narmada (6), Tapi (10), Mahanadi (14), Brahmani (11), Baitarni (5), Subarnarekha (6), Godavari (34), Krishna (22), Pennar (4), Cauvery (20). So, in total 250 instances were used for analysis.

In this paper, there are seven sections. Section 1 is the Introduction which is the current section. The Sect. 2 is Literature Review, the Sect. 3 is Materials and Methods, the Sect. 4 is Performance Comparison, the Sect. 5 is Result and Discussion, the Sect. 6 is Conclusion, seventh is Acknowledgement and the last section is References.

2 Literature Review

In paper [1], performance of CART, J48, REPTREE, Bayes Net and Naïve Bayes classification algorithms are compared by applying them to a dataset consisting of only 11 attributes, for predicting heart attacks. In the research work algorithms for prediction are applied using WEKA as it provides proficiency in analysing, discovering and predicting patterns. The results of the paper helped us in concluding that J48, CART and REPTREE shows the best results and there is not much difference in their performance factor. In paper [2], the author has compared the results of two decision trees ID3 and J48. The two techniques are applied to a dataset of students enrolling for MCA. The research work explains how tree based classification algorithms ID3 and J48 works and are used to analyse the data. From the results it can be concluded that ID3 decision tree algorithm shows an accuracy of 69.69% as compared to that of J48 which is 67.67%. In paper [3] the two data mining algorithms which are used for producing the classification model are Naive Bayesian Classifier algorithm and Decision Tree algorithms. These algorithms are applied on preprocessed student dataset. Decision Tree algorithms has an accuracy of 93.33% over 71.67% of Naive Bayesian Classifier algorithm. Hence decision tree algorithm proves to be better than naïve Bayesian classifier. In paper [4], the author has explained about the heart diseases and symptoms of heart attack. The paper has talked about various models that are developed using different data mining techniques. In paper [5] has bestowed satisfactory modifications for calculation of the water quality index (WQI). To calculate the general water quality index nine parameters are required but sometimes a few parameters are missing or unavailable, in that case the modified formula given in this paper helps user to calculate WQI.

National River Conservation Plan [6] was initiated with the launching of Ganga Action Plan (GAP) in 1985. In 1995 GAP was expanded to cover other rivers of the country. At a sanctioned cost of Rs. 5779.41 crore, NRCP, excluding the GAP-I, GAP-II and National Ganga River Basin Authority (NGRBA) programme presently covers polluted stretches of 40 rivers in 121 towns spread over 19 States running head, it will be shortened. Your suggestion as to how to shorten it would be most welcome.

3 Materials and Methods

For present study, the data set was created using the data that referred to the ambient water quality of Aquatic Resources in India in 2008 and 2011 [7]. This Dataset is released under “National Data Sharing and Accessibility Policy (NDSAP)” and the contributor is “Ministry of Environment and Forests and Central Pollution Control Board”. The values of water quality parameters like Fecal Coliform, Temperature, Nitrate, Biochemical Oxygen Demand (B.O.D), pH, etc. were given in the data used. The data was published on the data portal on December 22, 2014 which was released under National Data Sharing and Accessibility Policy (NDSAP) [8] and the contributor was Ministry of Environment and Forests Central Pollution Control Board [9].

In classification [10] a set of objects is classified into a group so that objects in a group are more similar to each other.

3.1 Classification Techniques Used

3.1.1 Naïve Bayes [11]

Naïve Bayes Classifier is a part of probabilistic classifier based on application of Bayes’ Theorem with strong independent presupposition/presumption between the features. This classifier algorithm presumes that in a given class attribute values are independent of other attributes values.

3.1.2 J48 [12]

J48 is the appendage od ID3. The features of J48 are decision tree pruning, keeping accounts for missing values, derivation of rules, etc. In WEKA, the implementation for JAVA open source algorithm C4.5 is done using J48. For tree pruning a number of options are provided by WEKA (data mining tool). Pruning could be employed as a mechanism for précising if there is case of over fitting. The aim of this algorithm is to progressively generalise the decision tree till accurate and flexible tree is obtained. Continuous and discrete attributes can be handled by this algorithm.

3.1.3 SMO (Sequential Minimal Optimization) [13]

SMO stands for Sequential Minimal Optimization, John Platt invented this algorithm in 1998 at Microsoft Research. This algorithm is mainly used for solving quadratic programming problem which emerges at time of programming support vector machine. Nominal attributes are transformed into binary ones on implementation of the model. Also, by default it normalises all attribute. The worst case running time for this is O(n3).

3.1.4 REPTree [14]

REP stands for Reduce Error Pruning. This algorithm is based minimising error surfacing form variance and calculating the information gain using entropy. REP Tree generates various trees in reordered iterations and uses regression tree logic. It splits the missing values into pieces of corresponding instances.

3.2 Software Used

Eclipse [15] open source IDE (Integrated Development Environment) was used to compile the code wrote using Waikato Environment for Knowledge Analysis (WEKA) [16] (developed by the University of Waikato, New Zealand) API. It is commonly used for data mining works, as it has a number of machine learning algorithms. It has tools for preprocessing, classification, visualisation, etc. Eclipse is an IDE mostly used for computer programming in Java language but it also supports many other programming languages.

3.2.1 General Algorithm

  1. 1.

    Training dataset is loaded.

  2. 2.

    The class index is set to the last attribute.

  3. 3.

    Number of classes is fetched.

  4. 4.

    Class values in the training dataset is printed.

  5. 5.

    Class string value using the class index is fetched.

  6. 6.

    Creating and building the classifier.

  7. 7.

    The test dataset is loaded.

  8. 8.

    The class index is set to the last attribute.

  9. 9.

    Looping through the new dataset to make predictions.

  10. 10.

    Fetching class value for current instance.

  11. 11.

    Fetching class string value using the class index Class’s int value is used.

  12. 12.

    Instance object of current instance is fetched.

  13. 13.

    Calling classifyInstance, which returns a double value for the class.

  14. 14.

    Use the double value to get string value of the predicted class.

3.2.2 Code Used

The following is the code used for Naïve Bayes Classifier. The classifier can accordingly be changed as per the requirement but the rest of the code will remain the same.

  • import weka.classifiers.bayes.NaiveBayes;

  • import weka.core.Instance;

  • import weka.core.Instances;

  • import weka.core.converters.ConverterUtils.DataSource;

  • public class RiversClassification

  • {

  • public static void main(String args[]) throws Exception{

  • DataSource source = new DataSource(“C:\\Users\\Bindra\\Desktop\\data set\\claasification\\train2008.arff”);

  • Instances trainDataset = source.getDataSet();

  • trainDataset.setClassIndex(trainDataset.numAttributes()-1);

  • int numClasses = trainDataset.numClasses();

  • for(int i = 0; i < numClasses; i++){

  • String classValue = trainDataset.classAttribute().value(i);

  • System.out.println(“Class Value “+i+” is “ + classValue);

  • }

  • //Classifier used is Naïve Bayes here

  • NaiveBayes nb = new NaiveBayes();

  • nb.buildClassifier(trainDataset);

  • DataSource source1 = new DataSource(“C:\\Users\\Bindra\\Desktop\\data set\\claasification\\test2011.arff”);

  • Instances testDataset = source1.getDataSet();

  • testDataset.setClassIndex(testDataset.numAttributes()-1);

  • System.out.println(“===================“);

  • System.out.println(“Actual,NB Predicted Class”);

  • for (int i = 0; i < testDataset.numInstances(); i++) {

  • double actualClass = testDataset.instance(i).classValue();

  • String actual = testDataset.classAttribute().value((int)actualClass);

  • Instance newInst = testDataset.instance(i);

  • double predNB = nb.classifyInstance(newInst);

  • String predString = testDataset.classAttribute().value((int) predNB);

  • System.out.println(actual+”, “+predString);

  •           }

  •    }

  • }

4 Performance Comparison

Below is the performance of various classification techniques on the used data set (Table 1).

Table 1 Percentage error in classification techniques applied for analysis of water quality of rivers

The table consists of two columns, techniques and error percentage. Naïve Bayes show the best result and maximum error is found in REPTree technique.

Below are the screenshots of the outputs when the code was run on Eclipse IDE using different classifiers (Figs. 1, 2, 3, and 4).

Fig. 1
figure 1

Output of the code for Naïve Bayes classifier

Fig. 2
figure 2

Output of the code for J48 classifier

Fig. 3
figure 3

Output of the code for SMO classifier

Fig. 4
figure 4

Output of the code for REPTree classifier

5 Result and Discussion

In this work, Naïve Bayes has proved to be the best technique with minimum error. The error percentage in the classification techniques we have applied in our analysis is high because the data set which is used as input is biased since in the dataset number of instances in average and good water quality groups are greater than the number of instances in fair and excellent groups. Another reason for high error percentage is that the number of instances in excellent, good, fair and average water quality groups are not same.

6 Conclusion

Out of all the classification techniques, we have applied on our dataset, Naive Bayes has shown the results with least error.

In future, for effective and accurate analysis, some modifications need to be applied in these predefined classification techniques or new classification techniques need to be devised in order to form correct classes of such biased datasets.