Introduction

Code smells indicate issues with software design. In addition to making the code more difficult to comprehend, a code smell may make modifications and mistake proneness more likely. Software engineers may learn the code more efficiently than ever by identifying and removing code smells from the program.

The software is getting more and more complicated because there are more and bigger modules, more complicated requirements, and code smells, among other things. Challenging requirements are complex to assess and comprehend, making development difficult and beyond the scope of developers, but code smells may be recognized and refactored to make the software simpler, straightforward, and easier to produce and maintain [1]. The software engineering principles are required to develop better quality software [2]. In general, developers concentrate on functional requirements and overlook nonfunctional requirements such as maintainability, development, verifiability, reprocessing ability, and comprehensibility [3].

The severity of code smells is a significant consideration when reporting outcome of code smell detection, as it permits refactoring efforts to be prioritized. High-severity code smells can become a significant and complex problem for software’s maintainability process. Thus, detecting code smells as well as their severity are very useful for software developers to minimize maintenance costs and improve software quality [4].

The purpose of the research is to detect the severity of code smells to help the software developers minimize the maintenance charges and improve the software quality.

In the literature, a number of code smell severity detection techniques have been developed [5,6,7,8]. Each technique gives different outcomes because smells can be interpreted subjectively and, therefore, can be described in various ways [4]. To the best of our knowledge and available literature, only Fontana et al. [4] and Abdou [27] have found the code smell severity on the code smell dataset. They used ordinal and multinomial machine-learning algorithms (MLA) for code smell severity detection. In addition, they used ranking-based correlation to find the best algorithm. The Fontana et al. [4] and Abdou [27] approach has the following limitations: they have not presented class-wise accuracy in their studies. They did not consider other performance metrics such as precision, recall, and F-measure. They used different MLA; the ensemble learning methods were not applied.

This study hypothesizes detecting the severity of code smells using machine-learning and ensemble learning methods and presenting class-wise outcomes.

Contributions: In this work, we have applied seven MLA/ensemble learning models (logistic regression (LR), Random Forest (RF), KNN, decision tree (DT), AdaBoost, XGBoost, and Gradient Boost) with the Chi-square feature selection approach on each dataset. In addition, we applied grid search and random search-based parameter optimization techniques (POT) to see the effect of parameters optimization on the classification results of code smell severity detection. We achieved the highest severity classification accuracy (SCA), 99.12%, in the XG Boost model for the LM dataset.

The advantage of this research is to detect the severity from the code smell severity dataset to make our code or source code more effective, accurate, and easily understandable by the programmer and users.

The following is the outline of this paper: the next section explains related works. The third section describes the dataset’s structure and proposed models. The fourth section describes the experimental results of our proposed models. The fifth section discusses our results and compares these with baseline results and the final section concludes the work.

Related Work

Various machine learning-based algorithms and techniques have been used by the researchers for code smell detection and also for code smell severity classification. The related work section is divided in two parts: the first part discusses the research works done for code smell detection using machine-learning techniques, and the second part discusses research works related to the severity of code smells.

Machine-Learning Techniques for Code Smell Detection

Fontana et al. [9] proposed a comparison-based observation among 16 MLAs on 4 code smell datasets from 74 java systems with manually validated examples on the training dataset for code smell detection. In addition, boosting techniques were used for four code smells datasets.

Mhawish et al. [10] suggested an MLA for detecting code smells in source code. They used the two-feature selection approach based on the genetic algorithm (GA) and a POT based on a grid search. Using the GA_CFS approach, they obtained the highest accuracy in the data class (DC), god class (GC), and long method (LM) smells by 98.05%, 97.56%, and 94.31%, respectively. They also obtained the highest accuracy of 98.38% in the LM using the GA-Naïve Bayes feature selection method. Mhawish et al. [11] presented code smell testing of prediction with the help of MLA and software metrics. They also used feature selection methods based on GA to increase the efficiency of these MLA by identifying the appropriate features in every dataset. Furthermore, they used the grid search algorithm based on POT to improve the performance of the approaches. The RF model obtains the highest accuracy of 99.71% and 99.70% in predicting the DC in the original refined datasets.

Kaur et al. [12] proposed a correlation metrics selection strategy and an ensemble learning method for detecting code smells in three publicly available Java datasets. They used bagging and RF classifier to examine each method with four occurrence measures accuracy (P1), G-mean 1 (P2), G-meam2 (P3), and F-measure (P4). Pushpalatha et al. [13] suggested that the severity of bug reports for closed-source datasets could be predicted. For this, they used the PROMISE Repository to get the dataset (PITS) for NASA projects. They improved the accuracy by employing ensemble learning strategies and two-dimensional reduction methods, such as Chi-square analysis and information gain, respectively. Alazba et al. [14] proposed 14 MLA and stacking ensemble learning algorithms on the six code smell datasets. The results of MLA were compared, and they found that the best accuracy was 99.24% applying the Stack-SVM algorithm for the LM Dataset. A search-based technique to enhance the code smell detection using Whale optimization method as a classifier was proposed by Draz et al. [15]. They researched five open-source software projects that detected the nine code smell types and established an average of 94.24% precision and 93.4% recall.

Dewangan et al. [16] presented six MLAs, two Chi-square, and Wrapper-based FSTs to pick the significant feature from each dataset; the grid search technique was then applied to improve the model’s outcome and achieved 100% accuracy using the LR model for the LM dataset. Reis et al. [17] proposed a Crowd smelling approach using collective knowledge in code smells detection with the LM, GC, and Feature envy (FE) datasets. They applied six MLA to detect the code smells. They obtained the highest outcome of 0.896% ROC using the Naive Bayes algorithm for GC detection and 0.870% ROC using AdaBoostM1algorithm for LM detection. The worst performance was 0.570% using the RF algorithm for FE detection. Oort et al. [18, 19] presented a study to examine the occurrence of code smells in machine-learning projects. They collected the 74 machine-learning projects, and then applied the Pylint tool to those projects. After this, they assembled the delivery of Pylint messages per group per project, the best 10 code smells in these projects in general, and the best 20 code smells per group. They originate that the PEP8 rule for the identifier naming method may not forever be appropriate in machine-learning code due to its similarity with mathematical notation. They also detected severe problems with the measurement of needs that present the main threat to the maintainability and reproducibility of Python machine-learning projects.

Boutaib et al. [20] proposed a bi-level multi-label detection of smells (BMLDS) tool to reduce the population of classification series for detecting multi-label smells. They implemented a bi-level scheme in which the higher-level part is to discover the best classification for each measured series, and the lower level part is to construct the series.

Abdou et al. [21] proposed three ensemble methods (bagging, boosting, and rotation forest) with a resample technique to detect the software defects. They used seven code smell datasets given in the PROMISE repository. They found that the ensemble method gives better accuracy as compared to single learning methods. They found the 93.40% highest accuracy using Random Forest with resample technique model for KC1 dataset.

Dewangan et al. [22] introduced ensemble and deep learning algorithms as a method for identifying the code smell. They applied Chi-square-based FST to pick the significant features from each code smell dataset and a SMOTE technique is used to balance the dataset. They were able to achieve an accuracy of 100% by utilizing all of the ensemble approaches for the LM dataset.

Dewangan et al. [23] proposed five classification models to detect the code smell. They used four code smell datasets DC, GC, FE, and LM. They obtained 0.9912% best accuracy using Random Forest model for FE dataset.

Dewangan et al. [24] proposed three ML algorithms to detect the code smell. A principle component analysis (PCA)-based FST is used to pick the significant features from each code smell dataset. They obtained 99.97% best accuracy using principal component analysis-based logistic regression (PCA_LR) model for DC dataset.

There is a notable difference between these works and the technique we used in this paper. The majority of the above studies are focused on the identification of code smells as described by Flower et al. [25]. Most of the previous research papers have examined only a few systems and applied the MLAs. However, they have not mentioned the aspect of severity of code smells in their work.

Machine-Learning Techniques for Code Smell Severity Detection

Vidal et al. [5] proposed a tool for detecting code smells using textual analysis. For this purpose, they conducted two separate experiments. They started by performing a software repository mining study to examine how engineers spotted code smells through textual or structural cues. They then carried out a user research with industrial developers and quality experts to qualitatively examine how they examined the detection of code smells using two different sources of details. They discovered that textual code smells are simpler to pick up.

Liu et al. [6] propose severity prediction of bug reports based on feature selection methods. They establish a ranking-based policy to enhance existing feature selection methods and intend an ensemble learning feature selection method by merging existing ones. They applied eight-feature selection methods. They found that the ranking-based strategy gets the highest F1 score, 54.76%. Tiwari et al. [7] proposed a method to find the LM and their severity that shows the importance of refactoring LMs. They find that it matches the expert’s severity evaluation for half the approaches within a tolerance level one. In addition, they identified a high severity; this evaluation is more or less equivalent to an expert’s judgment.

MLAs were suggested by Baarah et al. [8] for closed-source software bug severity detection. They assembled a dataset from past bug information stored in the JIRA bug tracking system connected to different closed-source projects built by INTIX Company in Amman, Jordan. They evaluate eight MLAs such as Naive Bayes, Naive Bayes Multinomial, SVM, Decision Rules (JRip), Logistic Model Trees, DT (J48), RF, and KNN in the measurement of accuracy, F-measure, and area under the curve (AUC). The DT model obtained the highest performance accuracy, AUC, and F-measure with 86.31%, 90%, and 91%, respectively.

Gupta et al. [26] proposed a hybrid method to examine the severity of the code smell intensity in the Kotlin language and found which code smells are equivalent in the Java language. They used five types of code smells to examine the research work: complex method, long parameter list, large class, LM, and string literal duplication. They applied various MLAs, where they found the JRip algorithm achieved the best performance with 96% precision and 97% accuracy. Abdou et al. [27] proposed the classification of code smell severity using MLAs based on regression, multinomial, and ordinal classifiers. In addition, they applied the local interpretable model agnostic explanations (LIME) approach to explaining the MLAs and prediction rules produced by the PART algorithm to find the efficiency of the feature. They employed the LIME algorithm to help us gain a deeper knowledge of the model's decision-making process and the characteristics that affect the model’s decision. They found the highest accuracy of 92–97% with correlation measurement of the Spearman algorithm. Hejres et al. [28] proposed the detection of code smell severity using MLAs. They applied three models, sequential minimal optimization (SMO), artificial neural network (ANN), and J48, to detect the code smell severity from four datasets. They obtained the best result for GC and FE datasets using the SMO model, while the LM dataset obtained the best accuracy using adaptive neural network ensemble (ANNE) with the SMO model.

Nanda et al. [29] proposed a combination of SMOTE and Stacking model to classify the severity of GC, DC, FE, and LM datasets. They improved their performance from 76 to 92%. Fontana et al. [4] proposed MLA for classifying the severity of code smell. They implemented different MLAs, spanning from multinomial classification to regression and a binary classifier for ordinal classification. They found the correspondence between the actual and predicted severity for the top techniques and achieved 88–96%, calculated by Spearman’s p.

To the best of our knowledge and available literature, it is observed that most of the authors used different code smell datasets and code smell severity datasets using different types of MLAs, multinomial, and regression techniques to find the code smell and severity of code smell from datasets. It has been observed that the effect of grid search and ensemble learning algorithms on the severity datasets has not been applied earlier. Therefore, to study and analyze the effect of MLA and ensemble learning approaches, we have used MLAs and ensemble learning methods with grid search, random search, and Chi-square feature selection techniques to find the severity of code smell from the code smell severity datasets.

Proposed Model and Dataset Description

The severity of code smells is significant to study when describing code smell detection results since it categorizes refactoring efforts. This research work builds a model for detecting the severity of code smell using machine-learning methods. A step-by-step research framework designed for code smell severity detection is shown in Fig. 1. First, we collected the code smell severity datasets from Fontana et al. [4]. Then, we applied the min–max normalization technique to find the various values in the datasets. After that, we used the Chi-square feature selection algorithm to extract the best features from the datasets. Then, we applied grid search and random search POTs. Then, we applied MLAs with fivefold cross-validation, and finally, we obtained the performance measurements.

Fig. 1
figure 1

Proposed model

The following research queries are resolved in this paper.

RQ1: Which MLA/ensemble learning algorithm detects code smell severity best?

Motivation: Baarah et al. [8] and Fontana et al. [4] proposed various MLAs, such as Naive Bayes, Naive Bayes Multinomial, SVM, and Decision Rules (JRip), and also multinomial classification, regression, etc. Alazba et al. [14] applied both MLA and stacking ensemble learning algorithms and compared the performances of MLAs and ensemble learning algorithms. They discovered that ensemble learning algorithms were more accurate than MLAs in terms of performance. Therefore, to investigate and observe the effect of MLA and ensemble learning algorithms to code smell severity detection, we applied both MLA and ensemble learning algorithms.

RQ2: What is the effect of applying the feature selection method in code smell severity detection?

Motivation: Liu et al. [6], Mhawish et al. [10, 11], Kaur et al. [12], and Dewangan et al. [16] introduced the influence of various feature selection algorithms on the performance measures. They found an enhancement in the performance accuracy by applying feature selection methods. Therefore, to examine the effect of the feature selection method on the method’s accuracy and extract software severities that play an essential task in the code smell severity detection process, we used Chi-square-based feature selection method for our work.

RQ3: Does the hyper-parameter optimization algorithm enhance the performance of the detection of the code smell’s severity?

Motivation: Mhawish et al. [10, 11] applied grid search-based parameter optimization and found improvement in their results. Therefore, to study and analyze more effectively, tuning in MLA and ensemble learning algorithm parameters in this work have been applied.

Dataset Description

In this study of severity detection of code smell, we have taken four datasets from Fontana et al. [4], which consist of two class level datasets named data class (DC) and god class (GC) and two method level datasets named feature envy (FE) and long method (LM). These datasets can all be found at http://essere.disco.unimib.it/reverse/MLCSD.html. Fontana et al. [4] have selected 76 systems out of 111, computed by different sizes and a large set of object-oriented features. They considered the Qualitas Corpus of systems collected by Tempero et al. [30] for the system selection. They employed a variety of tools and methods to find code smell severity called advisors: iPlasma (GC, Brain Class), Anti-pattern Scanner [31], PMD [32], iPlasma, Fluid Tool [33] and Marinescu detection rules [34]. Table 1 shows the automatic detection tools.

Table 1 Automatic detector tools (advisors)

Code Smells Classification of Severity

After manual assessment of code smells, a severity value between 1 and 4 is allocated to each assessed occurrence. The values with their meanings are describes as follows:

  1. 1.

    Allocated for no smell: a method or class that is unaffected by the code smell.

  2. 2.

    Allocated for non-severe smell: a method or class that is only slightly impacted by the code smell.

  3. 3.

    Allocated for smell: the class or method possesses all the properties of a smell.

  4. 4.

    Allocated for severe smell: there is a smell that is extremely strong in terms of size, complexity, or coupling.

Each code smell dataset has 420 instances (class or method), where 63 instances are selected for DC and GC datasets, and 84 instances are selected for FE and LM datasets. Details as shown by Fontana et al. [4] are given in Table 2.

Table 2 Dataset configuration [4]

The code smell datasets used in this paper are defined as follows:

Data class (DC): It mentions classes that maintain data with basic functionality and are heavily relied upon by other classes. A DC reveals a lot of features; it is not complicated and data are exposed via accessor methods [4].

God class (GC): It describes those classes that concentrate on the system’s intelligence. The GC is assumed to be the utmost convoluted code smells for many causes, actions, and tasks that arise there. It causes problems with big-size code, coupling, and complexity [4].

Feature envy (FE): It mentions techniques that make extensive use of data from classes other than their own. It prefers to use other classes’ features, considering features entered through accessor methods [4].

Long method (LM): It mentions techniques that prefer to centralize a class’s functionality. An LM has a lot of code, is complicated, difficult to understand, and heavily relies on data from other classes [4].

Dataset Composition (Structure)

There are 420 manually calculated classes or methods in each dataset. Table 2 shows the dataset configuration, including the number of instances distributed to each severity level. The least repeated severity level in the datasets is 2, and the two class-based smells (DC and GC) have a dissimilar balance to the two method-based smells (FE and LM) with respect to severity levels 1 and 4 [4].

Normalization Technique

The datasets have a wide range of features; therefore, in such a case the features should be normalized before MLAs are used. The min–max feature scaling strategy was used in this paper to rescale the datasets’ range of feature or observable values between 0 and 1 [35]. The min–max formula is shown in Eq. 1 with X′ denoting the normalized value and X representing the initial real value. The feature’s Xmin and Xmax values are altered to “0” and “1”, respectively, while every other value is modified to a decimal between “0” and “1”:

$$X=\frac{X-X\mathrm{min}}{X\mathrm{max}-X\mathrm{min}}$$
(1)

Feature Selection Algorithm

Feature selection is used to find the most significant feature in a dataset to refine results by improving the understanding of the instances that contribute in making a distinction among parallel roles in features [36]. We used a Chi-square-based feature selection technique in this study to extract the best instances from each dataset.

Typically, Chi-square feature selection is used in categorical datasets. By examining the relationships between features, Chi-square aids in choosing the optimal features. Equation 2 provides the formula for the Chi-square test [36]:

$${X}^{2}=\frac{{(\mathrm{Observed\;frequency}-\mathrm{Expected\;frequency})}^{2}}{\mathrm{Expected\;frequency}}.$$
(2)

We can calculate observed frequency and expected frequency for response and independent variables. The Chi-square calculates the difference between these two numbers. The more the difference, the more the response, and variables that are not related to each other are dependent.

The top 10 instances were extracted from each dataset. Table 3 shows these instances extracted by the Chi-square feature selection method from each dataset. The detailed explanation of selected metrics is described in Table 21 (Appendix section).

Table 3 Chi-square feature selection method’s extracted feature

Hyper-parameter Tuning/Optimization

A hyper-parameter is a parameter value that controls the learning process. In machine-learning and ensemble learning methods, hyper-parameter optimization or tuning is used to select the best parameters for each algorithm. The best parameters for every method differ based on the learning dataset. The various parameter value combinations for each algorithm should be tried in order to find the exact parameters that will allow the predictive algorithm to successfully forecast the test dataset [11].

In this study, the grid search and random search-based POTs are applied to select the optimal parameter values for every method. Parameter optimization is necessary to select an algorithm’s optimal hyper parameters, which help outcomes for the utmost correct calculations. It depends on an inclusive search for the number of criteria that yields the best prediction model performance value [38]. To avoid over fitting the algorithm on the test dataset, a fivefold cross-validation is performed. It is applied to determine the efficiency of the procedure for every conceivable parameter combination.

We identified grid search and random search algorithm for each machine-learning and ensemble learning algorithms. We allowed the value range and several steps to the numeric parameters. As indicated in Table 5, the numbers which must be checked are allocated inside the range’s upper and lower boundaries according to the given step count allocated to every parameter. The best-selected parameters and allotted step count for nominal parameters for the grid search and random search are shown in Tables 4 and 5.

Table 4 Selected best parameters for each model
Table 5 Number of allotted steps to every parameter (parameter tuning)

Validation Methodology

This study has used a validation methodology to calculate each experiment’s performance. MLAs are calculated using fivefold cross-validation to divide the datasets into five segments, five times for the training of the algorithm. In each replication dataset, one part is evaluated as the test set, and the other is evaluated for the training set.

Performance Capacity

To evaluate the performance capacity of all the models, a set of experimental measurements such as true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are computed with the confusion matrix. In this study, we have evaluated various experiments.

  • TP (true positive) stands for the outcomes in which the system correctly calculates the positive class.

  • TN (true negative) stands for the outcomes in which the system correctly calculates the negative class.

  • FP (false positive) stands for the outcomes in which the system calculates positive class but the class is actually a negative class. This means the system wrongly identifies a positive class.

  • FN (false negative) stands for the outcomes where the system calculates the negative class which is actually a positive class which means that system wrongly identifies the negative class.

To compute the performance capacity of MLAs, four parameters: positive predictive value, true positive rate, F-measure, and SCA, are calculated using TP, TN, FP, and FN. The detailed description and equations of all parameters are given as follows:

Positive predictive value (PPV): PPV deals with the number of code smell instances precisely recognized by the MLA. Equation (3) is applied to calculate the PPV:

$$PPV=\frac{TP}{TP+FP}$$
(3)

True positive rate (TPR): TPR deals with the number of code smell instances precisely recognized by the MLA. Equation (4) is applied to compute TPR:

$$TPR=\frac{TP}{TP+FN}$$
(4)

F-measure: The harmonic mean of PPV and TPR is the subject of F-measure, which is calculated for a balance between their respective values. The F-measure is calculated using Eq. (5):

$$F-\mathrm{measure}\left(F\right)=2\times \frac{\mathrm{PPV}\times \mathrm{TPR}}{\mathrm{PPV}+\mathrm{TPR}}$$
(5)

Severity classification’s accuracy (SCA): SCA deals with the organization of PPV and TPR. It depicts the exact classification of cases in the positive and negative classifications. Equation (6) is applied to calculate SCA:

$$\mathrm{SCA}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$
(6)

Results of Proposed Model

In order to provide a response to RQ1, we made use of a total of three ensemble learning methods and four MLAs. In this paper, we have used four code smell severity datasets (DC, GC, FE, and LM), and each dataset has four types of severities. We have shown individual results for each severity of each dataset. In addition, the average result of all severity classes is also shown. We implemented the severity classification accuracy (SCA) in two ways: (1) SCA with the grid search algorithm and (2) SCA with the random search algorithm. The following subsection shows the obtained results from every MLA and ensemble learning algorithm in tabular form.

The following subsection uses seven MLAs and ensemble learning methods: logistic regression, Random Forest, K-nearest neighbor, decision tree, AdaBoost, XG Boosting, and Gradient Boosting. The following “Logistic Regression”–“Gradient Boosting (GB Algorithm)”) shows the experimental results of the MLA and ensemble learning methods in table form.

Logistic Regression

Logistic regression (LR) is a classification algorithm that simplifies LR to multi-class problems, i.e., greater than two promising distinct results. It is a technique for calculating the possibility of the unusual feasible results of classically dispersed associated variables, specified a set of particular variables [38].

Table 6 presents the SCA of each severity class and the average SCA of all severity classes for each dataset using the LR method. This study observed that the LR algorithm got the highest SCA, 97.34% using the grid search algorithm for the LM dataset for an average of all severities, and 91.42% SCA obtained using the Random search algorithm for the LM dataset for the average of all severities.

Table 6 Severity classification results of LR algorithm

Random Forest

Random Forest (RF) is a classification method applied in regression and ensemble learning methods. When used for the MLA, the outcome of the RF is the class selected by a large number of trees.

Table 7 shows the RF algorithm got the highest SCA, 97.34% using the grid search algorithm for the LM dataset for an average of all severities, and 96.20% SCA obtained using the random search algorithm for the LM dataset for the average of all severities.

Table 7 Severity classification results of RF algorithm

K-Nearest Neighbor

The K-nearest neighbor (KNN) algorithm is an MLA used to anticipate classification and regression issues. It is, however, generally applied for classification prediction problems. The KNN method assigns a value to a new data point based on how similar it is to the training point [40].

Table 8 presents that the KNN algorithm got the highest SCA, 90.95% using the grid search algorithm for the LM dataset for an average of all severities, and 91.64% SCA obtained using the Random search algorithm for the LM dataset for the average of all severities.

Table 8 Severity classification results of KNN algorithm

Decision Tree

A decision tree (DT) constructs the classification techniques in the structure of a tree. It divides a dataset into smaller subsets while, at the same time, a connected DT is incrementally developed. The last outcome is a tree with leaf nodes or results. The leaf node corresponds to a classification [39].

Table 9 presents that the DT algorithm got the highest SCA, 99.08% using the grid search algorithm for the LM dataset for an average of all severities, and 96.00% SCA obtained using the Random search algorithm for the FE dataset for the average of all severities.

Table 9 Severity classification results of DT algorithm

AdaBoost (Adaptive Boosting)

The first well-doing boosting algorithm built for binary classification was the AdaBoost algorithm. Yoav Fried and Robert Schapire were the ones who found it [41]. A popular boosting technique turns several “poor classifiers” into one “strong classifier.”

Table 10 presents that the AdaBoost algorithm obtained the highest 98.18% SCA using the grid search method for the LM dataset for the average of all severities and 97.61% SCA using the Random search algorithm for the LM dataset for an average of all severities.

Table 10 Severity classification results of AdaBoost algorithm

XGB Algorithm (XG Boosting)

The XGBoost is a tree-based MLA with superior presentation and speed, commonly known as the extreme gradient boosting algorithm. It is a straightforward algorithm that works with MLA and has grown in popularity since it produces effective outcomes for organized and tabular data. Tianqi Chen developed XGBoost, which is primarily governed by the Distributed Machine-Learning Community (DMLC) organization. It is open-source software [42].

Table 11 presents that the XGB algorithm obtained the highest 99.12% SCA using the grid search method for the LM dataset for the average of all severities and 98.89% SCA using the random search method for the LM dataset for the average of all severities.

Table 11 Severity classification results of XG Boosting algorithm

Gradient Boosting (GB Algorithm)

The gradient boosting (GB) algorithm is the most effective ensemble MLA. The errors in MLAs are mainly separated into two types, i.e., bias error and variance error. GB algorithm is a boosting technique which can be employed to decrease the bias error of the algorithm. The GB algorithm is used for both categorical and constant target variables, such as classifiers and regression [43].

Table 12 presents that the GB algorithm obtained the highest, 98.67% SCA using the grid search method for the LM dataset for the average of all severities and 98.32% SCA using the Random search algorithm for the LM dataset for the average of all severities.

Table 12 Severity classification results of gradient boosting algorithm

SCA Comparison Between Grid Search and Random Search of All Machine-Learning Methods

Table 13 shows the comparison among the SCA of all the machine-learning methods obtained by Grid search and Random search algorithms. It is observed that some of the machine-learning methods perform better using the grid search algorithm, and some perform better using the random search algorithm.

Table 13 SCA comparison between grid search and random search of all machine-learning methods

From our work, we observed the following: (1) for the DC dataset, the highest severity detection accuracy is 88.22% using the gradient boosting algorithm for the random search method; (2) for the GC dataset, the highest severity detection accuracy is 86.00% using the DT algorithm for the random search method; (3) for the FE dataset, the highest severity detection accuracy is 96.00% using the DT algorithm for the random search method; and (4) for the LM dataset, the highest severity detection accuracy is 99.12% using the XG Boost algorithm for the grid search method.

Comparison Among All the Algorithms Used in this Work

A comparison among all machine-learning methods are shown in Table 14, and Fig. 2 shows the comparative analysis among all the algorithms using a bar graph. The following results are obtained here:

  1. 1.

    The gradient boosting (GB) algorithm achieved the highest SCA of 88.22% for the DC dataset.

  2. 2.

    The DT approach achieved the maximum SCA of 86.00% for GC and 96.00% for the FE dataset.

  3. 3.

    XGB approach achieved the highest SCA of 99.12% for the LM dataset.

Table 14 Comparison among all algorithms
Fig. 2
figure 2

Comparison bar chart of all algorithms

Influence of Feature Selection Technique

This section focuses on the influence of feature selection techniques to enhance the results and identify the features which play an essential role in code smell severity detection. The Chi-square feature selection technique is constructed in response to research question 2 (RQ2). The SCA (%) and F-measure (%) of all the algorithms with and without applied feature selection techniques are compared in Table 15. For this study, ten features are selected from each dataset shown in Table 3. This study observed that maximum algorithms show improved performance of SCA for each dataset by applying the feature selection technique, and only few of them did not perform well. In this study, we observed that the highest severity was detected by the gradient boosting model (F-measure—87.00% and accuracy—88.22%) from the DC dataset. The highest severity detected by the DT model (F-measure—90.00% and accuracy—86.00%) from GC dataset, and (F-measure—96.00% and accuracy—96.00%) from FE dataset. Similarly, the highest severity was detected by the XG Boost model (F-measure—100% and accuracy—99.12%) from the LM dataset.

Table 15 Result comparison with and without applying feature selection technique

Influence of Hyper-parameter Tuning on Algorithms

In response to research question 3 (RQ3), the influence of hyper-parameter tuning on the performance of all algorithms is studied. Table 16 represents the various parameter groups and obtained the highest SCA of for the DT algorithm. The DT algorithm obtained the highest SCA of 99.08% when the maximum level is 20, the numbers of trees are 20, and the number of the split is 15. Similarly, Tables 17 and 18 show the influence of parameter optimization on the SCA of all algorithms.

Table 16 Tuning parameters applied on DT algorithm
Table 17 POT’s effect on the LR model’s SCA
Table 18 POT’s effect on the RF, AdaBoost, XGBoost, and GB model’s SCA

Discussion

In this paper, we have used four code smell severity dataset which includes god class, data class, feature envy and long method. The severity composition of each dataset is given in Table 2. In all the datasets, Severity 1 has the highest instances, and Severity 2 has the lowest instances. We have shown separate results for each severity of each dataset in “Results of Proposed Model”, and we observed a significant difference between the results of the four severities for each dataset because the combination of the number of severities in datasets is different. In “Results of Proposed Model”, we found the highest accuracy for severity 1 for each dataset because severity 1 has the highest instances and the lowest accuracy for severity 2 for each dataset because severity 2 has the lowest instances. In this way, we found a class imbalance problem in all the datasets, which we will be considering in our future works.

In this experiment, we solved three research questions given in “Proposed Model and Dataset Description”. To answer research question 1, we have applied seven machine-learning algorithms described in “Logistic Regression” to “Gradient Boosting (GB Algorithm)”. We found that the DT algorithms give the best SCA of 86.00% and 96.00% for the GC and FE datasets (Table 14). Likewise, the gradient boost algorithm gives the best SCA of 88.22% for the DC dataset, and the XG Boost approach gives the maximum SCA of 99.12% for the LM dataset (Table 14). To answer research question two, we obtained the influence of the feature selection technique in Table 15, and we found that feature selection gives better performance in all the algorithms for each dataset. We applied the POT to each algorithm to answer research question three. The influence of the POT on each model performance is shown in Tables 16, 17 and 18.

Result Evaluation of Our Results with Other Related Works

In earlier literatures, many authors have applied different algorithms in different severity datasets, which we have seen in detail in “Related Work”. This section constructs a summarized evaluation of our work in comparison with Fontana et al. [4] and Abdou [27]. They both applied various MLAs and implemented multinomial classification to regression and binary classifiers for ordinal classification. A linear correlation-based filter method was also applied to select the best features by them. In addition, Abdou [27] applied projective adaptive resonance theory (PART) algorithm to learn the efficiency of software metrics to detect the code smells and a local interpretable model agnostic explanations (LIME) algorithm was applied to describe the ML models and interpretability.

In our approach, we applied four MLAs and three ensemble learning approaches to identify the severity of code smells. A Chi-square-based FST is applied to select the best metrics, and two-parameter optimization techniques (grid search and random search) are applied to optimize the best parameters from each model.

Table 19 compares our results with other works. In our approach, for the DC dataset GB algorithm achieved 88.22% SCA, while Fontana et al. [4] achieved 77.00% SCA using the O-RF algorithm, and Abdou [27] achieved 93.00% SCA using the O-R-SMO algorithm. Like this, the Abdou [27] model is best for severity detection in data class dataset.

Table 19 Result evaluation of our method with other works

To the GC dataset, our approach achieved 86.00% SCA using DT algorithm, while Fontana et al. [4] achieved 74.00% SCA using the O-DT algorithm, and Abdou [27] achieved 92.00% SCA using the R-B-RF algorithm. Like this, the Abdou [27] model is best for severity detection in God class dataset.

To the FE dataset, our approach achieved 96.00% SCA using DT algorithm, while Fontana et al. [4] achieved 93.00% SCA using the J48-Pruned algorithm, and Abdou [27] achieved 97.00% SCA using the R-B-JRIP and O-R-SMO algorithm. Like this, the Abdou [27] model is best for severity detection in FE dataset.

To the LM dataset, our approach achieved 99.12% SCA using XG Boost algorithm, while Fontana et al. [4] achieved 92.00% SCA using the B-Random Forest algorithm, and Abdou [27] achieved 97.00% SCA using the R-B-JRIP, O-B-RF, and O-R-JRip algorithm. Like this, our model is best for severity detection in LM dataset.

Statistical Analysis for Comparing Machine-Learning Models

From Table 15, we observed that there is not much difference in the results of two best models to the same dataset. In that case, we have used a statistical analysis method to choose the one of best model for each dataset. We applied a paired T test statistical analysis method on the two best models (according to Table 15) for each datasets. For this, we have selected two best models which are obtained best accuracy for each dataset and then applied paired T test.

A paired T test allows data analysis to consider different methods using the same dataset to see if the difference is minimal [44]. This method can be used to check that there is a statistically significant distinction between two models, allowing you to choose only the better one. We used mean accuracy as a measurement calculated across tenfold cross-validation in our studies, with set the significance level (i.e., α = 0.05).

We developed the following hypotheses (H) values for each comparison:

Null hypothesis (H0)X and Y’s accuracy of the model are obtained from the same sample. As a result, the difference in levels of accuracy has a predicted value of 0 (E[diff] = 0). In essence, the two models are identical [45].

Alternative hypothesis (H1): The prediction accuracy is obtained from two separate models, E[diff] ≠ 0. Essentially, the models are distinct, and one is superior to the other [45].

  • H0: ACCURACYx = ACCURACYy (the detection efficiency of the two models is identical).

  • H1: ACCURACYx ≠ ACCURACY (the detection efficiency of the two models differs significantly).

where x and y are the two models under consideration. If the function returns a p value less than alpha, we can correctly reject the null hypothesis.

if p > alpha:

print(“Fail to reject null hypothesis”)

else:

print(“Reject null hypothesis”)

In this study, we calculated two parameters: mean accuracy and p value for each dataset.

Mean accuracy: A model has higher mean accuracy that means the model is best for dataset as compared to less mean accuracy model. p value: If the p value is higher to alpha (α = 0.05), the null hypothesis is assumed to be true. If the p value is below alpha, the null hypothesis is assumed to be incorrect. Then, we have selected to one of best model according to mean accuracy from given to two the model.

Table 20 displays the mean accuracy and p value of each classification model across each code smell dataset. Form Table 20, we observed that the DC, GC, and LM dataset has p value less than 0.05. Therefore, the accuracy of both the models is different. In this way for DC dataset, the mean accuracy of XGBoost model is more than gradient boosting model so the XG Boost model is best for DC dataset. For the GC dataset, the mean accuracy of XG model is more than DT model so the XG Boost model is best for GC dataset. For the LM dataset, the mean accuracy of XG model is more than DT model so the XG Boost model is best for the LM dataset. The FE dataset has p value greater than 0.05. Therefore, the mean accuracy of both the models is same (Table 20).

Table 20 Statistical analysis results

Conclusion

In this research work, to analyze the code smell severities from software to decrease the maintenance work and enhance the software quality, and also find the best algorithms for detecting the code smell severities, we proposed the severity classification of code smell framework with multi-class classification approaches using four machine-learning and three ensemble learning algorithms. To select the significant features from each dataset, Chi-square feature selection algorithm is applied. Two-parameter optimization algorithms (grid search and random search) with fivefold cross-validation are used to enhance the SCA.

In this study, it is found that the GB method finds the maximum SCA of 88.22% using the feature selection algorithm, while the AdaBoost algorithm obtains the lowest result of 70.27% without using the feature selection algorithm for the DC dataset.

The DT method found the maximum SCA of 86.00% using the feature selection algorithm, while the KNN algorithm obtained the lowest result of 61.10% without using the feature selection algorithm for the GC dataset.

The DT algorithm found a maximum SCA of 96.00% using the feature selection algorithm, while the lowest result was 68.78% obtained by the KNN algorithm without using the feature selection algorithm for the FE dataset.

The XG Boost algorithm found a maximum SCA of 99.12% using the feature selection algorithm, while the lowest result was 91.64%, obtained by the KNN model by applying the feature selection approach for the LM dataset.

This study found that the GB method is best for the DC dataset, the XG Boost model is best for the LM dataset, and the DT model is best for the GC and FE datasets to detect the severity of code smells. Moreover, the Chi-square feature selection technique is always helpful for better detecting the severity of code smells.

The limitation of this study is that the code smell severity datasets have a class imbalance problem; therefore, in subsequent work, we intend to enhance outcomes by utilizing class balancing techniques to address the issue of class imbalance (present in the used dataset). In order to determine the most effective methods for code smell severity detection, other learning algorithms and feature selection strategies should be investigated.