Code smell detection using multi-label classification approach

Guggulothu, Thirupathi; Moiz, Salman Abdul

doi:10.1007/s11219-020-09498-y

Code smell detection using multi-label classification approach

Published: 04 April 2020

Volume 28, pages 1063–1086, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Software Quality Journal Aims and scope Submit manuscript

Code smell detection using multi-label classification approach

Download PDF

1613 Accesses
41 Citations
Explore all metrics

Abstract

Code smells are characteristics of the software that indicates a code or design problem which can make software hard to understand, evolve, and maintain. There are several code smell detection tools proposed in the literature, but they produce different results. This is because smells are informally defined or subjective in nature. Machine learning techniques help in addressing the issues of subjectivity, which can learn and distinguish the characteristics of smelly and non-smelly source code elements (classes or methods). However, the existing machine learning techniques can only detect a single type of smell in the code element that does not correspond to a real-world scenario as a single element can have multiple design problems (smells). Further, the mechanisms proposed in the literature could not detect code smells by considering the correlation (co-occurrence) among them. To address these shortcomings, we propose and investigate the use of multi-label classification (MLC) methods to detect whether the given code element is affected by multiple smells or not. In this proposal, two code smell datasets available in the literature are converted into a multi-label dataset (MLD). In the MLD, we found that there is a positive correlation between the two smells (long method and feature envy). In the classification phase, the two methods of MLC considered the correlation among the smells and enhanced the performance (on average more than 95% accuracy) for the 10-fold cross-validation with the ten iterations. The findings reported help the researchers and developers in prioritizing the critical code elements for refactoring based on the number of code smells detected.

Multi-label learning for identifying co-occurring class code smells

Article 27 May 2024

Categorical Analysis of Code Smell Detection Using Machine Learning Algorithms

Severity Classification of Code Smells Using Machine-Learning Methods

Article 29 July 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Code smell refers to an anomaly in the source code that shows the violation of basic design principles such as abstraction, hierarchy, encapsulation, modularity, and modifiability (Booch 1980). Even if the design principles are known to the developers, they are often violated because of inexperience, deadline pressure, and heavy competition in the market. Fowler et al. (1999) have defined 22 informal code smells. These smells have different granularities based on their affected element such as class-level (God class, data class, etc.) and method-level (long method and feature envy, etc.) code smells. One way to remove them is by using refactoring techniques (Opdyke 1992), i.e., a technique that improves the internal structure (design quality) of the code without altering the external behavior of the software.

In the literature, there are several techniques (Kessentini et al. 2014) and tools (Fontana et al. 2012) available to detect different code smells. Each technique and tool produces different results. According to Kessentini et al. (2014), the code smell detection techniques can be classified into seven categories: cooperative based (Abdelmoez et al. 2014), visualization based (Murphy-Hill and Black 2010), search based (Palomba et al. 2015; Liu et al. 2013; Palomba et al. 2013), probabilistic based (Rao and Reddy 2007), metric based (Marinescu 2004; Moha et al. 2010a; Tsantalis and Chatzigeorgiou 2009), symptoms based (Moha et al. 2010b), and manual techniques (Travassos et al. 1999; Ciupke 1999) which differ in the underlying algorithm. Bowes et al. (2013) compared two smell detection tools on message chaining and showed the disparity of results between them. Due to the differing results, Rasool and Arshad (2015) classified, compared, and evaluated existing detection tools and techniques so as to understand the categorization better. There are three main reasons for the disparity in the results: (1) The developers can subjectively interpret the code smells, and hence detected in different ways, (2) agreement between the detectors is low, i.e., several tools or rules recognize different smells for different code elements, and (3) the threshold value for identifying a smell can vary from one detector to another.

To address the above limitations, in particular, the subjective nature, Fontana et al. (2016b) proposed a machine learning (ML) technique (supervised classification) to detect four code smells with the help of 32 classification techniques. The authors showed that most of the classifiers achieved more than 95% performance in terms of accuracy and F-measure. After observing the results, authors have suggested that ML classifiers are the most suitable approach for the code smell detection. Di Nucci et al. (2018) addressed some of the limitations of Fontana et al. (2016b); one of the drawback reported is that the prepared datasets do not represent a real-world scenario, i.e., in the datasets, metric distribution between smelly and non-smelly instances is highly variant. This may enable the ML classifiers to easily distinguish two classes (smelly and non-smelly). In the real-time environment, the boundary between smelly and non-smelly characteristics is not always clear (Tufano et al. 2017; Fontana et al. 2016a). To avoid this limitation and simulate more realistic datasets, Di Nucci et al. (2018) configured the datasets of Fontana et al. (2016b) by merging the class-level and method-level datasets, respectively. The merged datasets have reduced the metric distribution and maintain more than one type of code smell instances. The authors experimented the same ML techniques of the Fontana et al. (2016b), on revised datasets and achieved an average of 76% accuracy in all models. The authors claimed that the performance of ML classifiers is reduced when dataset represents the real-time scenario.

In this work, we addressed the reason why ML classifier performed less in the Di Nucci et al. (2018) and showed that ML classifiers perform good even under the real-time scenario. That is, in the merged datasets of Di Nucci et al. (2018), some of the instances are identical but they are assigned as different (one is smelly and another one is non-smelly) decision labels called disparity. Due to disparity instances in the datasets, the classifiers performed poorly in their work. Di Nucci et al. (2018) considered that the long method dataset has an instance which is smelly, if the same instance exists in feature envy dataset irrespective of smelly or non-smelly, then that instance of feature envy is merged into long method as non-smelly. Now, long method dataset has same instance with 2 different decision labels called disparity instance. This disparity will confuse the ML algorithms and result in poor performance of Di Nucci et al. (2018). In this paper, we have removed the disparity instances in the method-level merged datasets and experimented same tree-based classifier techniques on them. The performance achieved in our work is similar to the performance of Fontana et al. (2016b).

From the datasets of Fontana et al. (2016b) and Di Nucci et al. (2018), we have observed that there are 395 common instances in method level, which are labeled with the two smells. These instances led to an idea to form a multi-label dataset. Through this dataset, the disparity can be eliminated, maintained similar metric distribution as in Di Nucci et al. (2018), and more than one smell can be detected. In the literature (Azeem et al. 2019; Pecorelli et al. 2019b; Zaidi and Colomo-Palacios 2019), only one code smell was detected for the same method with the help of ML (single label) classifier. In addition to it, no one has detected the code smells by considering the correlation among them. In this paper, we formulate the code smell detection as a multi-label classification (MLC) problem. The two factors (multiple smell detection, correlation) can be achieved through the methods of MLC. It is important for the developers to detect multiple code smells so that they (developers) can schedule the smells accordingly for refactoring. The effort required in refactoring varies from one smell to the other as they have correlation (one may influence the other) between them. Refactoring such correlated smells results in reducing the developer’s effort.

For our study, we have considered two method-level code smell (long method and feature envy) datasets from Fontana et al. (2016b) and converted them into a multi-label dataset (MLD). From the MLD, we found (by using lift measure) that there is a positive correlation between the considered smells. Three MLC methods (binary relevance (BR), classifier chain (CC), label combination (LC)) are applied on MLD by using 10-fold cross-validation with the ten iterations. In the classification phase, among the three methods, BR does not consider the correlation. The other two (CC, LC) approaches take advantage of positive correlation and results in improved performances (on average 95%) than the BR (on average 91%).

The structure of the paper is organized as follows; The second section explains the background of ML classification technique and introduces a work related to the detection of code smell using ML techniques; the third section defines the code smells used in preparation of the multi-label dataset; the fourth section describes the proposed approach and addresses few research questions; the fifth section shows the experimental study of the multi-label classification; the sixth section shows the results of the proposed study and answers the research questions; the sixth section outlines the threats to the validity of our work; the final section gives conclusion and future directions.

2 Related work

Over the past fifteen years, researchers have presented various tools and techniques for detecting code smells. According to Kessentini et al. (2014), there are seven different classification categories to detect code smells. They are cooperative-based approaches, visualization-based approaches, machine learning–based approaches, probabilistic approaches, metric-based approaches, symptoms-based approaches, and manual approaches. In this section, we only consider the machine learning approaches for detecting the code smells.

2.1 Machine learning (supervised classification) approaches to detect code smells

Supervised classification is the task of using algorithms that allow the machine to learn associations between instances and decision labels. Supervision comes in the form of previously labeled instances, from which an algorithm builds a model to predict the labels of new instances automatically. Figure 1 shows the working procedure of supervised classification algorithm. In ML, classification is of three types: binary (yes or no), multi-class, and multi-label classification (MLC). In the literature (Azeem et al. 2019), code smell detection is treated as a single label (binary) classifier which detects single type code smell (presence or absence) only. Below is the summary of the related work of code smell detection using single label classifiers.

Kreimer (2005) introduced an adaptive method to find the design flaws (viz., big class/large class and long method) by combining the known approaches based on the software metrics by using classification techniques called decision trees. IYC system and WEKA package are the two software systems used for analysis purpose.

Khomh et al. (2009) have proposed a Bayesian approach to detect occurrences of the Blob anti pattern on open-source programs (GanttProject v1.10.2 and Xerces v2.7.0). Khomh et al. (2011) presented BDTEX (Bayesian Detection Expert) Goal Question Metric approach to build Bayesian Belief Networks from the definitions of anti-patterns and validate BDTEX with blob, functional decomposition, and spaghetti code anti-patterns on two open-source programs.

Maneerat and Muenchaisri (2011) collected datasets from the literature to evaluate seven bad smells. In order to predict those bad smells, seven machine learning algorithms and 27 design model metrics (extracted by a tool as independent variables) are used to detect these smells. The author has not made any explicit references to the dataset.

Maiga et al. (2012) have introduced the SVM Detection approach by using the support vector machine to detect anti-patterns . They have used open-source programs ArgoUML, Azureus, and Xerces to study their subjects such as blob, functional decomposition, spaghetti code, and Swiss army knife anti-patterns. They have enlarged their work by initiating SMURF, taking the practitioner’s feedback into account.

Wang et al. (2012) have proposed a method that will help in understanding the destructive nature of predetermined cloning operations by using Bayesian Networks and a set of features such as code, destination, and history.

Yang et al. (2015) studied the decisions of individual users by applying machine learning algorithms on each code clones. White et al. (2016) detected code clone by using deep learning techniques. The authors have sampled 398 files and 480 method levels pairs across eight real-world Java software systems.

Amorim et al. (2015) studied to recognize the code smells through decision tree algorithms. For this, the authors have experimented on four open-source projects and the results were compared with the manual oracle, with existing detection approaches and machine learning algorithms.

Fontana et al. (2016b) experimented 16 ML classification techniques on the four code smell datasets (viz., data class, long method, feature envy, God class) to detect them. For this study, the authors have used 74 Java systems that belongs to the Qualitus Corpus (Tempero et al. 2010).

Fontana and Zanoni (2017) classified the code smells severity by using a machine learning method. This approach can help software developers to prioritize or rank the classes or methods. Multinomial and regression classification techniques are applied for code smell severity classification.

Di Nucci et al. (2018) have covered some of the limitations of Fontana et al. (2016b). The authors configured the datasets of Fontana and provided new datasets that are suitable for real-time scenarios.

Pecorelli et al. (2019a) investigated several techniques to handle data imbalance issues to understand their impact on ML classifiers for code smell detection.

When observed, the major difference of the previous work with respect to the proposed approach is the detection of code smells viewed as multi-label classification. No other approaches have considered the correlation among the smells when detecting the code smell. Table 1 presents a comparison of our study with respect to the referenced papers.

Table 1 Comparison with prior work and this paper

Code smell detection using multi-label classification approach

Abstract

Similar content being viewed by others

Multi-label learning for identifying co-occurring class code smells

Categorical Analysis of Code Smell Detection Using Machine Learning Algorithms

Severity Classification of Code Smells Using Machine-Learning Methods

Explore related subjects

1 Introduction

2 Related work

2.1 Machine learning (supervised classification) approaches to detect code smells

3 Evaluated code smells

4 Multi-label classification approach for code smell detection

4.1 Reference datasets

4.1.1 Selection of systems

4.1.2 Metric extraction

4.1.3 Dataset preparation

4.2 Proposed approach

4.2.1 Construction of multi-label dataset

Training dataset

4.2.2 Multi-label classification approach

Methods of multi-label classification

5 Experimental setup

6 Experimental results

6.1 Dataset results

6.2 Multi-label dataset statistics

6.3 Co-occurrence analysis (market basket analysis)

6.4 Performance improvement in existing datasets

6.5 Performances of multi-label classification

7 Threats to validity

Threats to internal validity

Threats to external validity

8 Conclusion and future directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation