Severity Classification of Code Smells Using Machine-Learning Methods

Dewangan, Seema; Rao, Rajwant Singh; Chowdhuri, Sripriya Roy; Gupta, Manjari

doi:10.1007/s42979-023-01979-8

Severity Classification of Code Smells Using Machine-Learning Methods

Original Research
Published: 29 July 2023

Volume 4, article number 564, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

SN Computer Science Aims and scope Submit manuscript

Severity Classification of Code Smells Using Machine-Learning Methods

Download PDF

Seema Dewangan¹,
Rajwant Singh Rao ORCID: orcid.org/0000-0001-6993-8927¹,
Sripriya Roy Chowdhuri² &
…
Manjari Gupta²

270 Accesses
6 Citations
Explore all metrics

Abstract

Code smell detection can be very useful for minimizing maintenance costs and improving software quality. Code smells help developers/programmers, researchers to subjectively interpret design defects in different ways. Code smells instances can have varied size, intensity or severity which needs to be focused upon as they affect the software quality accordingly. Therefore, this study aims to detect the severity of code smells from code smell datasets. The severity of code smells is significant for reporting code smell detection performance, as it permits refactoring efforts to be prioritized. Code smell severity also describes extent of effort required during software maintenance. In our work, we have considered four code smells severity datasets to detect the severity of code smell. These datasets are data class, god class, feature envy and long method code smells. This paper uses four machine-learning and three ensemble learning approaches to identify the severity of code smells. To improve the models’ performance, we used fivefold cross-validation method: Chi-square-based feature selection algorithm and parameter optimization techniques. We applied two-parameter optimization techniques, namely grid search and random search and also compared their accuracy. The conclusion of this study is that the XG Boost model obtained an accuracy of 99.12%, using the Chi-square-based feature selection technique for the long method code smell dataset. In this study, the results show that ensemble learning is best as compared to machine learning for severity detection of code smells.

Bad Smell Detection Using Machine Learning Techniques: A Systematic Literature Review

Article 07 January 2020

Method-Level Code Smells Detection Using Machine Learning Models

Improving Code Smell Detection by Reducing Dimensionality Using Ensemble Feature Selection and Machine Learning

Article 25 June 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Code smells indicate issues with software design. In addition to making the code more difficult to comprehend, a code smell may make modifications and mistake proneness more likely. Software engineers may learn the code more efficiently than ever by identifying and removing code smells from the program.

The software is getting more and more complicated because there are more and bigger modules, more complicated requirements, and code smells, among other things. Challenging requirements are complex to assess and comprehend, making development difficult and beyond the scope of developers, but code smells may be recognized and refactored to make the software simpler, straightforward, and easier to produce and maintain [1]. The software engineering principles are required to develop better quality software [2]. In general, developers concentrate on functional requirements and overlook nonfunctional requirements such as maintainability, development, verifiability, reprocessing ability, and comprehensibility [3].

The severity of code smells is a significant consideration when reporting outcome of code smell detection, as it permits refactoring efforts to be prioritized. High-severity code smells can become a significant and complex problem for software’s maintainability process. Thus, detecting code smells as well as their severity are very useful for software developers to minimize maintenance costs and improve software quality [4].

The purpose of the research is to detect the severity of code smells to help the software developers minimize the maintenance charges and improve the software quality.

In the literature, a number of code smell severity detection techniques have been developed [5,6,7,8]. Each technique gives different outcomes because smells can be interpreted subjectively and, therefore, can be described in various ways [4]. To the best of our knowledge and available literature, only Fontana et al. [4] and Abdou [27] have found the code smell severity on the code smell dataset. They used ordinal and multinomial machine-learning algorithms (MLA) for code smell severity detection. In addition, they used ranking-based correlation to find the best algorithm. The Fontana et al. [4] and Abdou [27] approach has the following limitations: they have not presented class-wise accuracy in their studies. They did not consider other performance metrics such as precision, recall, and F-measure. They used different MLA; the ensemble learning methods were not applied.

This study hypothesizes detecting the severity of code smells using machine-learning and ensemble learning methods and presenting class-wise outcomes.

Contributions: In this work, we have applied seven MLA/ensemble learning models (logistic regression (LR), Random Forest (RF), KNN, decision tree (DT), AdaBoost, XGBoost, and Gradient Boost) with the Chi-square feature selection approach on each dataset. In addition, we applied grid search and random search-based parameter optimization techniques (POT) to see the effect of parameters optimization on the classification results of code smell severity detection. We achieved the highest severity classification accuracy (SCA), 99.12%, in the XG Boost model for the LM dataset.

The advantage of this research is to detect the severity from the code smell severity dataset to make our code or source code more effective, accurate, and easily understandable by the programmer and users.

The following is the outline of this paper: the next section explains related works. The third section describes the dataset’s structure and proposed models. The fourth section describes the experimental results of our proposed models. The fifth section discusses our results and compares these with baseline results and the final section concludes the work.

Related Work

Various machine learning-based algorithms and techniques have been used by the researchers for code smell detection and also for code smell severity classification. The related work section is divided in two parts: the first part discusses the research works done for code smell detection using machine-learning techniques, and the second part discusses research works related to the severity of code smells.

Machine-Learning Techniques for Code Smell Detection

Fontana et al. [9] proposed a comparison-based observation among 16 MLAs on 4 code smell datasets from 74 java systems with manually validated examples on the training dataset for code smell detection. In addition, boosting techniques were used for four code smells datasets.

Mhawish et al. [10] suggested an MLA for detecting code smells in source code. They used the two-feature selection approach based on the genetic algorithm (GA) and a POT based on a grid search. Using the GA_CFS approach, they obtained the highest accuracy in the data class (DC), god class (GC), and long method (LM) smells by 98.05%, 97.56%, and 94.31%, respectively. They also obtained the highest accuracy of 98.38% in the LM using the GA-Naïve Bayes feature selection method. Mhawish et al. [11] presented code smell testing of prediction with the help of MLA and software metrics. They also used feature selection methods based on GA to increase the efficiency of these MLA by identifying the appropriate features in every dataset. Furthermore, they used the grid search algorithm based on POT to improve the performance of the approaches. The RF model obtains the highest accuracy of 99.71% and 99.70% in predicting the DC in the original refined datasets.

Kaur et al. [12] proposed a correlation metrics selection strategy and an ensemble learning method for detecting code smells in three publicly available Java datasets. They used bagging and RF classifier to examine each method with four occurrence measures accuracy (P1), G-mean 1 (P2), G-meam2 (P3), and F-measure (P4). Pushpalatha et al. [13] suggested that the severity of bug reports for closed-source datasets could be predicted. For this, they used the PROMISE Repository to get the dataset (PITS) for NASA projects. They improved the accuracy by employing ensemble learning strategies and two-dimensional reduction methods, such as Chi-square analysis and information gain, respectively. Alazba et al. [14] proposed 14 MLA and stacking ensemble learning algorithms on the six code smell datasets. The results of MLA were compared, and they found that the best accuracy was 99.24% applying the Stack-SVM algorithm for the LM Dataset. A search-based technique to enhance the code smell detection using Whale optimization method as a classifier was proposed by Draz et al. [15]. They researched five open-source software projects that detected the nine code smell types and established an average of 94.24% precision and 93.4% recall.

Dewangan et al. [16] presented six MLAs, two Chi-square, and Wrapper-based FSTs to pick the significant feature from each dataset; the grid search technique was then applied to improve the model’s outcome and achieved 100% accuracy using the LR model for the LM dataset. Reis et al. [17] proposed a Crowd smelling approach using collective knowledge in code smells detection with the LM, GC, and Feature envy (FE) datasets. They applied six MLA to detect the code smells. They obtained the highest outcome of 0.896% ROC using the Naive Bayes algorithm for GC detection and 0.870% ROC using AdaBoostM1algorithm for LM detection. The worst performance was 0.570% using the RF algorithm for FE detection. Oort et al. [18, 19] presented a study to examine the occurrence of code smells in machine-learning projects. They collected the 74 machine-learning projects, and then applied the Pylint tool to those projects. After this, they assembled the delivery of Pylint messages per group per project, the best 10 code smells in these projects in general, and the best 20 code smells per group. They originate that the PEP8 rule for the identifier naming method may not forever be appropriate in machine-learning code due to its similarity with mathematical notation. They also detected severe problems with the measurement of needs that present the main threat to the maintainability and reproducibility of Python machine-learning projects.

Boutaib et al. [20] proposed a bi-level multi-label detection of smells (BMLDS) tool to reduce the population of classification series for detecting multi-label smells. They implemented a bi-level scheme in which the higher-level part is to discover the best classification for each measured series, and the lower level part is to construct the series.

Abdou et al. [21] proposed three ensemble methods (bagging, boosting, and rotation forest) with a resample technique to detect the software defects. They used seven code smell datasets given in the PROMISE repository. They found that the ensemble method gives better accuracy as compared to single learning methods. They found the 93.40% highest accuracy using Random Forest with resample technique model for KC1 dataset.

Dewangan et al. [22] introduced ensemble and deep learning algorithms as a method for identifying the code smell. They applied Chi-square-based FST to pick the significant features from each code smell dataset and a SMOTE technique is used to balance the dataset. They were able to achieve an accuracy of 100% by utilizing all of the ensemble approaches for the LM dataset.

Dewangan et al. [23] proposed five classification models to detect the code smell. They used four code smell datasets DC, GC, FE, and LM. They obtained 0.9912% best accuracy using Random Forest model for FE dataset.

Dewangan et al. [24] proposed three ML algorithms to detect the code smell. A principle component analysis (PCA)-based FST is used to pick the significant features from each code smell dataset. They obtained 99.97% best accuracy using principal component analysis-based logistic regression (PCA_LR) model for DC dataset.

There is a notable difference between these works and the technique we used in this paper. The majority of the above studies are focused on the identification of code smells as described by Flower et al. [25]. Most of the previous research papers have examined only a few systems and applied the MLAs. However, they have not mentioned the aspect of severity of code smells in their work.

Machine-Learning Techniques for Code Smell Severity Detection

Vidal et al. [5] proposed a tool for detecting code smells using textual analysis. For this purpose, they conducted two separate experiments. They started by performing a software repository mining study to examine how engineers spotted code smells through textual or structural cues. They then carried out a user research with industrial developers and quality experts to qualitatively examine how they examined the detection of code smells using two different sources of details. They discovered that textual code smells are simpler to pick up.

Liu et al. [6] propose severity prediction of bug reports based on feature selection methods. They establish a ranking-based policy to enhance existing feature selection methods and intend an ensemble learning feature selection method by merging existing ones. They applied eight-feature selection methods. They found that the ranking-based strategy gets the highest F1 score, 54.76%. Tiwari et al. [7] proposed a method to find the LM and their severity that shows the importance of refactoring LMs. They find that it matches the expert’s severity evaluation for half the approaches within a tolerance level one. In addition, they identified a high severity; this evaluation is more or less equivalent to an expert’s judgment.

MLAs were suggested by Baarah et al. [8] for closed-source software bug severity detection. They assembled a dataset from past bug information stored in the JIRA bug tracking system connected to different closed-source projects built by INTIX Company in Amman, Jordan. They evaluate eight MLAs such as Naive Bayes, Naive Bayes Multinomial, SVM, Decision Rules (JRip), Logistic Model Trees, DT (J48), RF, and KNN in the measurement of accuracy, F-measure, and area under the curve (AUC). The DT model obtained the highest performance accuracy, AUC, and F-measure with 86.31%, 90%, and 91%, respectively.

Gupta et al. [26] proposed a hybrid method to examine the severity of the code smell intensity in the Kotlin language and found which code smells are equivalent in the Java language. They used five types of code smells to examine the research work: complex method, long parameter list, large class, LM, and string literal duplication. They applied various MLAs, where they found the JRip algorithm achieved the best performance with 96% precision and 97% accuracy. Abdou et al. [27] proposed the classification of code smell severity using MLAs based on regression, multinomial, and ordinal classifiers. In addition, they applied the local interpretable model agnostic explanations (LIME) approach to explaining the MLAs and prediction rules produced by the PART algorithm to find the efficiency of the feature. They employed the LIME algorithm to help us gain a deeper knowledge of the model's decision-making process and the characteristics that affect the model’s decision. They found the highest accuracy of 92–97% with correlation measurement of the Spearman algorithm. Hejres et al. [28] proposed the detection of code smell severity using MLAs. They applied three models, sequential minimal optimization (SMO), artificial neural network (ANN), and J48, to detect the code smell severity from four datasets. They obtained the best result for GC and FE datasets using the SMO model, while the LM dataset obtained the best accuracy using adaptive neural network ensemble (ANNE) with the SMO model.

Nanda et al. [29] proposed a combination of SMOTE and Stacking model to classify the severity of GC, DC, FE, and LM datasets. They improved their performance from 76 to 92%. Fontana et al. [4] proposed MLA for classifying the severity of code smell. They implemented different MLAs, spanning from multinomial classification to regression and a binary classifier for ordinal classification. They found the correspondence between the actual and predicted severity for the top techniques and achieved 88–96%, calculated by Spearman’s p.

To the best of our knowledge and available literature, it is observed that most of the authors used different code smell datasets and code smell severity datasets using different types of MLAs, multinomial, and regression techniques to find the code smell and severity of code smell from datasets. It has been observed that the effect of grid search and ensemble learning algorithms on the severity datasets has not been applied earlier. Therefore, to study and analyze the effect of MLA and ensemble learning approaches, we have used MLAs and ensemble learning methods with grid search, random search, and Chi-square feature selection techniques to find the severity of code smell from the code smell severity datasets.

Proposed Model and Dataset Description

The severity of code smells is significant to study when describing code smell detection results since it categorizes refactoring efforts. This research work builds a model for detecting the severity of code smell using machine-learning methods. A step-by-step research framework designed for code smell severity detection is shown in Fig. 1. First, we collected the code smell severity datasets from Fontana et al. [4]. Then, we applied the min–max normalization technique to find the various values in the datasets. After that, we used the Chi-square feature selection algorithm to extract the best features from the datasets. Then, we applied grid search and random search POTs. Then, we applied MLAs with fivefold cross-validation, and finally, we obtained the performance measurements.

The following research queries are resolved in this paper.

RQ1: Which MLA/ensemble learning algorithm detects code smell severity best?

Motivation: Baarah et al. [8] and Fontana et al. [4] proposed various MLAs, such as Naive Bayes, Naive Bayes Multinomial, SVM, and Decision Rules (JRip), and also multinomial classification, regression, etc. Alazba et al. [14] applied both MLA and stacking ensemble learning algorithms and compared the performances of MLAs and ensemble learning algorithms. They discovered that ensemble learning algorithms were more accurate than MLAs in terms of performance. Therefore, to investigate and observe the effect of MLA and ensemble learning algorithms to code smell severity detection, we applied both MLA and ensemble learning algorithms.

RQ2: What is the effect of applying the feature selection method in code smell severity detection?

Motivation: Liu et al. [6], Mhawish et al. [10, 11], Kaur et al. [12], and Dewangan et al. [16] introduced the influence of various feature selection algorithms on the performance measures. They found an enhancement in the performance accuracy by applying feature selection methods. Therefore, to examine the effect of the feature selection method on the method’s accuracy and extract software severities that play an essential task in the code smell severity detection process, we used Chi-square-based feature selection method for our work.

RQ3: Does the hyper-parameter optimization algorithm enhance the performance of the detection of the code smell’s severity?

Motivation: Mhawish et al. [10, 11] applied grid search-based parameter optimization and found improvement in their results. Therefore, to study and analyze more effectively, tuning in MLA and ensemble learning algorithm parameters in this work have been applied.

Dataset Description

In this study of severity detection of code smell, we have taken four datasets from Fontana et al. [4], which consist of two class level datasets named data class (DC) and god class (GC) and two method level datasets named feature envy (FE) and long method (LM). These datasets can all be found at http://essere.disco.unimib.it/reverse/MLCSD.html. Fontana et al. [4] have selected 76 systems out of 111, computed by different sizes and a large set of object-oriented features. They considered the Qualitas Corpus of systems collected by Tempero et al. [30] for the system selection. They employed a variety of tools and methods to find code smell severity called advisors: iPlasma (GC, Brain Class), Anti-pattern Scanner [31], PMD [32], iPlasma, Fluid Tool [33] and Marinescu detection rules [34]. Table 1 shows the automatic detection tools.

Table 1 Automatic detector tools (advisors)

Severity Classification of Code Smells Using Machine-Learning Methods

Abstract

Similar content being viewed by others

Bad Smell Detection Using Machine Learning Techniques: A Systematic Literature Review

Method-Level Code Smells Detection Using Machine Learning Models

Improving Code Smell Detection by Reducing Dimensionality Using Ensemble Feature Selection and Machine Learning

Explore related subjects

Introduction

Related Work

Machine-Learning Techniques for Code Smell Detection

Machine-Learning Techniques for Code Smell Severity Detection

Proposed Model and Dataset Description

Dataset Description

Code Smells Classification of Severity

Dataset Composition (Structure)

Normalization Technique

Feature Selection Algorithm

Hyper-parameter Tuning/Optimization

Validation Methodology

Performance Capacity

Results of Proposed Model

Logistic Regression

Random Forest

K-Nearest Neighbor

Decision Tree

AdaBoost (Adaptive Boosting)

XGB Algorithm (XG Boosting)

Gradient Boosting (GB Algorithm)

SCA Comparison Between Grid Search and Random Search of All Machine-Learning Methods

Comparison Among All the Algorithms Used in this Work

Influence of Feature Selection Technique

Influence of Hyper-parameter Tuning on Algorithms

Discussion

Result Evaluation of Our Results with Other Related Works

Statistical Analysis for Comparing Machine-Learning Models

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation