Abstract
Machine learning is used to analyze data from different perspectives, summarize it into useful information, and use that information to predict the likelihood of future events. Classification is one of the main problems in the field of machine learning. The aim here is to study various classification algorithms in machine learning applied on different kinds of datasets. The algorithms used for this analysis are J48, Naive Bayes, multilayer perceptron, and ZeroR. The performance is analyzed using various metrics such as true positive rate, false positive rate, and error rates such as root mean squared error and mean absolute error. The performance of J48 algorithm is better than other algorithms for large datasets. The proposed algorithm still increases the performance in terms of error rates for large datasets. The contemplated algorithm is eventuated by mutating the splitting paradigm in the tree-based algorithms. The experimental analysis demonstrates that the proposed algorithm has reduced error rate as compared with the traditional J48 algorithm.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
5.1 Introduction and Preliminaries
During the past several years, investigation has been concentrated on diverse groups based on the machine learning algorithms due to the extreme require of accurate prophecies. Machine learning is not about giving tight rules by analyzing the datasets rather it is used to predict the likelihood of future events with some certainty. Classification is a machine learning approach used to fore tell cluster association for documents illustration and is a widely used technique in various fields [1].
Machine learning, pervasive computing, statistical analysis, data analytics, etc., are the applications of artificial intelligence (AI), whereas machine learning allows training and strengthens from practice to estimate the eventualities [2,3,4,5, 5,6,7,8].
A well-known test sample label is correlated with a separate result from the model. The extent of precision of the proportion of instances of the test set is grouped consequently by the framework. If precision is tolerable, then this model is used to separate tuples of data class labels, which are unknown [9, 10].
5.2 Literature Work and Methodologies
Classification has been considered as a seminal issue in the area of machine learning [11]. All the time, there has been absolutely a number of enormous surveys on classification algorithms [12, 13], performance evaluation [14,15,16], collations, and assessment of various classification algorithms [2, 17] beside their uses in figuring out real-life problems in the applications of business [9, 18,19,20,21], engineering, medicine [1, 22, 23], etc.
Amudha and Abdul Rauf [24, 25] applied data mining techniques as an approach for intrusion detection to identify whether the deviation from normal usage patterns can be flagged as intrusions and performed a correlative investigation of various classification algorithms.
Voznika and Viana [26, 27], described different approximation algorithms such as statistical algorithms, genetic programming, neural networks and concluded that the best model can be found by trial and error trying different algorithms in order to obtain the best results possible.
Kesavaraj, Sukumaran [13] performed investigation on multifold categorization methods to furnish an exhaustive analysis of machine learning algorithms.
Chintan Shah and Anjali Jeevani [17] compared decision tree, K-nearest neighbor, Naive Bayesian using parameters like correctly classified illustrations, time taken, relative absolute error, kappa statistic, and root relative absolute error on breast cancer dataset.
Dogan and Tanrikulu [18], performed a study that collate and contrast the precision of the classification algorithms. The application of certain classification models on multiple datasets is done in three stages. The research addressed the reliability of the classifiers, studied by demonstration on various datasets.
Rutvija and Pandya [28] performed the extensive analysis on various categorization techniques.
Keerthana [29] focused on image classification approach in order to identify better algorithm for medical image classification.
Classification is an approach of grouping or allocating class labels to a pattern set under the direction of an instructor. Classification is also termed as supervised learning. The patterns are primarily segregated into training and test sets. Training set is used to prepare the classifier, and the test set is prone to estimate the precision of a classifier. The classifiers are categorized into tree-based, rule-based, Bayes, functions, etc. The algorithms that have been chosen for this predictive data mining task include J48 from trees, multilayer perceptron from functions, ZeroR from rules, and Naive Bayes from Bayes.
The most popular supervised classifier which can work well on noisy data is decision tree classifier. There are various other types of classifiers such as Bayesian classification, neural network-based classifier, and support vector machine. A great deal of research has been done for developing efficient methods in the field of machine learning.
The J48 algorithm uses information gain and gain ratios to construct the decision tree for a given dataset. It works by recursively dividing the data on a single attribute, according to the information gain calculated. Each split in the tree represents a node where a decision must be taken, and you go to the following node and the next till you reach the leaf that expresses you the predicted output.
The steps in the J48 algorithm are as follows:
-
(i)
If the requirements are the identical group, the tree illustrates the leaf so that the leaf is substituted by designating in the identical class.
-
(ii)
The feasible information is intended for every characteristic, determined by a check on the attribute. Then, the gain in information is premeditated that would outcome from a examination on the characteristic (attribute).
-
(iii)
Then the best characteristic (attribute) is identified on the foundation of the current selection criterion and that attribute is adopted for ramification.
5.3 Information Gain and Gain Ratio
The information gain is based on the entropy after a dataset is split on an attribute, where entropy is used to estimate the similarity of a sample. If the instance is completely identical, the entropy is zero, and if the instance is evenly separated, it has entropy of one.
The entropy is calculated using the following formula.
Entropy on the other hand is an estimate of impurity. It is characterized for a binary class with estimates a and b as:
Using the above formula, we calculate two entropy values, namely entropy before and entropy after. The entropy before value is calculated before splitting, and entropy after is computed after considering the split. Now by assimilating the entropy before and after the split, we derive an estimate of information gain as denoted below:
At each node of the tree, this computation is carried out for every feature, and the feature with the largest information gain is chosen for the split in a greedy manner. This process is applied recursively from the root-node down and stops when a leaf node contains instances all having the same class, i.e., it stops when the node cannot be divided further. Constructing a decision tree is all about finding attribute that has the highest information gain.
Gain ratio is a modification of the information gain that reduces its bias. It takes into account the number and size of branches while choosing an attribute. There are chances of getting negative values in the existing information gain and gain ratio algorithms.
The idea of the proposed algorithm is to eliminate the negative values. The accuracy of the algorithm can be improved by eliminating the negative values. The proposed algorithm checks if the entropy before value is less than entropy after value and return 0; otherwise, it returns the unknown rate calculated.
Because of the outliers pruning is a significant step to the result. Some instances are present in all datasets which are not well defined and differ from the other instances on its neighborhood.
The classification is performed on the instances of the training set, and tree is formed. There exist various algorithms for performing classification, extracting salient features, opinion mining, processing of scalable web log data using map reduce framework etc. [30,31,32,33,34,35,36,37]. The pruning is performed for decreasing classification errors which are being produced by specialization in the training set. Pruning is performed for the generalization of the tree.
5.4 Results and Discussion
Evaluation of the datasets has been done by using the proposed classification algorithm. Various classification algorithms analyzed are evaluated by the evaluation criteria such as true positive rate, false positive rate, mean absolute error, and root mean square error. Heart disease, mushrooms, and birds are the datasets used for the analysis. These two datasets are taken from UCI machine learning repository. The classification algorithms are applied on the data using tenfold cross validation technique, and the results are then recorded. A sample description of datasets has been represented in Table 5.1.
The considered sample heart disease dataset has 14 attributes and 303 instances that are categorized into 5 classes. Mushrooms dataset has 23 attributes and 8124 instances that are categorized into 2 classes. Experimentation has been done on each dataset.
5.4.1 Data Set 1 (Heart Disease Dataset)
The considered sample heart disease dataset has 14 attributes and 303 instances that have only 5 classes. This dataset has taken from UCI machine learning repository.
True positive rate, false positive rate, root mean square error, and mean absolute error are calculated for J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm, which are represented in Table 5.2.
True positive rate of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.558, 0.559, 0.574, 0.541, and 0.558, respectively. It shows that the proposed modified algorithm is able to show the same performance as J48 algorithm.
False positive rate of J48, Naive Bayesian, Multilayer Perceptron, ZeroR, and the proposed algorithm are 0.238, 0.273, 0.189, 0.541, and 0.238, respectively. It shows that the proposed modified algorithm is able to show the same performance as J48 algorithm.
Mean absolute error of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.2, 0.1838, 0.2768, 0.2591, and 0.2, respectively. It shows that J48 and the proposed modified algorithm are better in terms of mean absolute error, compared with ZeroR and multilayer perceptron. But for this dataset Naive Bayesian performs better by exhibiting low mean absolute error, i.e., 0.1838.
Root mean squared error of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.3867, 0.3368, 0.378, 0.3592, and 0.3367, respectively. It shows that the proposed modified algorithm is better in terms of root mean squared error, compared with the other algorithms, i.e., 0.3367.
For the considered dataset, the proposed modified algorithm shows better performance than other algorithms in terms of the root mean square error. The root mean square error values of various algorithms are pictorially represented in Fig. 5.1 on heart disease dataset.
5.4.2 Data Set 2 (Birds Dataset)
Birds dataset used here is an images dataset. It contains 600 images of 34 samples of 6 types of birds. We applied some filters before classifying the data. This dataset has taken from a Ponce research group repository. True positive rate, false positive rate, root mean square error, and mean absolute error are calculated for J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm, which are represented in Table 5.3.
True positive rate of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.372, 0.325, 0.390, 0.165, and 0.372, respectively. It shows that the proposed modified algorithm is able to show the same performance as J48 algorithm.
False positive rate of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.126, 0.095, 0.102, 0.169, and 0.026, respectively. It shows that the proposed modified algorithm is able to show better performance compared with all the considered algorithms.
Mean absolute error of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.213, 0.2726, 0.2774, 0.2778, and 0.213, respectively. It shows that J48 and the proposed modified algorithm is better for the considered dataset in terms of mean absolute error, compared with Naive Bayesian, ZeroR, and multilayer perceptron.
Root mean squared error of J48, Naive Bayesian, multilayer perceptron, ZeroR, and the proposed algorithm are 0.2536, 0.3421, 0.3784, 0.3727, and 0.2436, respectively. It shows that the proposed modified algorithm is better in terms of root mean squared error, compared with the other algorithms, i.e., 0.2436.
For the considered dataset, the proposed modified algorithm shows better performance than other algorithms in terms of the root mean square error. The root mean square error values of various algorithms are pictorially represented in Fig. 5.2 on birds dataset.
The experimentation has shown that the proposed algorithm outperforms other existing algorithms on various considered datasets.
5.5 Conclusion
This study focuses on finding the right algorithm for classification of diverse datasets. The datasets that used are heart disease and mushrooms. For mushrooms dataset, the decision tree algorithm J48 and the proposed modified algorithm gave better results in terms of root mean squared error (RMSE) and for heart diseases dataset; the proposed modified algorithm gave reduced error rates than the traditional J48 algorithm, Naïve Bayes, multilayer perceptron, and ZeroR. However, it is noticed that the performance of a classifier depends on the dataset used. Although there are many algorithms available, the best one is often found by trial and error. For better results, one must compare or even combine the available algorithms. The performance of the existing algorithms can even be improved with minor modifications.
References
Vandehei, B., et al.: Leveraging the defects life cycle to label affected versions and defective classes. ACM Trans. Softw. Eng. Methodol. 30, 2 (1–35) (2021), Online publication date: 1-Mar-2021
Jahnavi, Y., Radhika, Y.: A cogitate study on text mining. Int. J. Eng. Adv. Technol. 1(6), 189–196 (2012)
Jahnavi, Y., Radhika, Y.: Hot topic extraction based on frequency, position, scattering and topical weight for time sliced news documents. In: 15th International Conference on Advanced Computing Technologies, ICACT 2013
Jahnavi, Y.: Statistical data mining technique for salient feature extraction. Int. J. Intell. Syst. Technol. Appl. (Inderscience Publishers), 18(4) (2019)
Jahnavi, Y., et al.: A novel ensemble stacking classification of genetic variations using machine learning algorithms. Int. J. Image Graph. 2350015(2021)
Jahnavi, Y., Radhika, Y.: FPST: a new term weighting algorithm for long running and short lived events’. Int. J. Data Anal. Tech. Strategies 7(4), 366–383 (2015)
Jahnavi, Y.: A new algorithm for time series prediction using machine learning models. Evol. Intell. (Springer), Accepted, 2022
Jahnavi, Y.: Analysis of weather data using various regression algorithms. Int. J. Data Sci. (Inderscience Publishers), 4(2) (2019)
Parneetkaur, et al.: Classification and prediction based data mining algorithms to predict slow learns in education sector. In: 3rd International Conference on Recent Trends in Computing, 2015
Shiqun, et al.: Research and implementation of classification algorithm on web text mining. In: IEEE International Conference on Semantics Knowledge and Grid, 2007, pp. 446–449
Singh, P., et al.: Machine learning: a comprehensive survey on existing algorithms (2021)
Gulati, H.: Predictive analytics using data mining technique. In: 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), 2015, pp. 713–716
Binjubeir, M., et al.: Comprehensive survey on Big Data privacy protection. Access IEEE 8, 20067–20079 (2020)
Kumar, et al.: Performance evaluation of decision tree versus artificial neural network based classifiers in diversity of datasets. In: 2011 World Congress on Information and Communication Technologies, 2011, pp. 798–803. https://doi.org/10.1109/WICT.2011.6141349
Rjeily, B., Andres, E., et al.: Medical data mining for heart diseases and the future of sequential mining in medical field. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds.) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-94030-4_4.
Sharma, S., et al.: Machine learning techniques for data mining: a survey. In: IEEE National Conference on Computational Intelligence and Computing Research (ICCIC), 2013
Panda, M., et al.: International proceedings on advances in soft computing. Intell. Syst. Appl. 628, 363 (2018)
Dogan, W., Tanrikulu, Z.: A comparitive analysis of classification algorithms in data mining for accuracy, speed & robustness. Springer 14(2), 105–124 (2013)
Panchal, G., et al.: Performance analysis of classification techniques using different parameters. In: Second International Conference, ICDEM, 2010
Tamizharasi, K., UmaRani, Performance analysis of various data mining algorithms. Int. J. Comput. Commun. Inf. Syst. (IJCCIS) 6(3) (2014)
Korting, T.S.: “C4.5 algorithm and multivariate decision trees”, image processing division. In: National Institute for Space Research--INPE
Ian, et al.: Data Mining Practical Machine Learning Tools and Techniques, 3rd edn. Elsevier (2011)
Zuveda, Comparison of the performance of several data mining methods for bad debt recovery in the health care industry. J. Appl. Bus. Res. 21 (2005)
Gayatri, N., et al.: Performance analysis of data mining algorithms for software quality prediction. In: International Conference on Advances in Recent Technologies in Communication and Computing, ARTCom’09, India, October 2009
Amudha, P., Abdul Rauf, H.: Performance analysis of data mining approaches in intrusion detection. In: PACC, International Conference, 2011
Voznika, F., Viana, L.: Data Mining Classification. Springer (2001)
Peter, T.J., Somasundaram, K.: An empirical study on prediction of heart disease using classification data mining techniques. In: IEEE-International Conference on Advances in Engineering Service and Management, 2012
Pandya, R., Pandya, J.: C5.0 Algorithm to improved decision tree with feature selection and reduced error pruning. Int. J. Comput. Appl. 117(16) (2015), (0975-8887)
Keerthana, P., et al.: Performance analysis of data mining algorithms for medical image classification. Int. J. Comput. Sci. Mob. Comput. 5(3) (2016)
Jahnavi, Y.: A New Term Weighting Algorithm for Identifying Salient Events. LAP LAMBERT Academic Publishing (2018)
Tiwari, V., et al.: Applications of the Internet of Things in healthcare: a review. Turk. J. Comput. Math. Educ. 12(12), 2883–2890 (2021)
Haripriya, et al.: Using social media to promote E-commerce business. Int. J. Recent Res. Aspects 5(1), 211–214 (2018)
Sukanya, et al.: Country location classification on Tweets. Indian J. Public Health Res. Dev. 10(5) (2019)
Vijaya, et al.: Community-based health service for Lexis Gap in online health seekers
Bhargav, et al.: An extensive study for the development of web pages. Indian J. Public Health Res. Dev. 10(5) (2019)
Srivani, et al.: An approach for opinion mining by acumening the data through exerting the insights
Jahnavi, Y., et al.: A novel processing of scalable web log data using map reduce framework. In: Proceedings of CVR 2022, Computer Vision and Robotics, ISBN: 978-981-19-7891-3
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jahnavi, Y., Lokeswara Reddy, V., Nagendra Kumar, P., Sri Sishvik, N., Srinivasa Prasad, M. (2024). Performance Analysis of Various Machine Learning Classifiers on Diverse Datasets. In: Jha, P.K., Tripathi, B., Natarajan, E., Sharma, H. (eds) Proceedings of Congress on Control, Robotics, and Mechatronics. CRM 2023. Smart Innovation, Systems and Technologies, vol 364. Springer, Singapore. https://doi.org/10.1007/978-981-99-5180-2_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-5180-2_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5520-6
Online ISBN: 978-981-99-5180-2
eBook Packages: EngineeringEngineering (R0)