1 Introduction

Identifying code smells is a fundamental software maintenance activity. Although code smells are characterized as suboptimal design choices [1] that do not affect the functionality of the software, they still pose threats in both short and long terms. Numerous studies have delved into the effects of the presence of code smells in software and have highlighted their harmful impact on software quality as a primary consequence [2,3,4]. Various dimensions of software quality are affected, encompassing maintainability, correctness and program comprehension [5,6,7]. Additionally, their influence extends to proneness of both faults and changes [8] with a notable association with the emergence of technical debt. A technical debt represents a situation where long-term quality of the code is sacrificed in favour of prioritizing short-term objectives [9]. This in turn results in the accumulation of a growing burden of deferred costs that manifest in the future.

Considering the significant threats posed by code smells, various aspects of their existence within the software have been examined. These aspects include the origins of their introduction [10], their relative severity [11, 12], the identification methods [13,14,15,16,17] and the way to mitigate them through refactoring [18]. Among these aspects, the identification of code smells has gained substantial interest due to its causal relationship with the other mentioned aspects. According to Fontana et al. [15], previous research in code smell identification can be categorized into two primary groups. The first category comprises rule-based approaches which rely on metrics and require domain experts. The second category involves machine learning-based approaches that are based on learning from training data and have an advantage in reducing cognitive load as required in the first category.

In the literature, most of the approaches that fall in the machine learning-based category are designed to identify just one type of code smell within a single software artefact, such as a class or a method. However, in practice, a software artefact typically exhibits multiple code smells simultaneously. In this work [19], the authors have quantified the diffuseness of code smells in projects and found that 59% of smelly classes are affected by more than one smell. As a result, the fact of identifying a single code smell per artefact cannot meet the complexities found in real-world projects. This highlights the importance of addressing the issue from a different perspective and emphasizes the need of adopting approaches that can simultaneously identify multiple code smells within a single artefact.

One of the important approaches for framing this problem is through the utilization of multi-label learning [20]. Multi-label learning is a machine learning method that has the capability to assign multiple labels to a single instance simultaneously. Despite its potential, there has been only a limited number of studies which employed multi-label learning for code smell identification. In the limited number of studies that have used this approach, the focus has typically been on detecting few types of code smells, either at the class or method level. However, the field of code smells is more expansive and encompasses a comprehensive spectrum of distinct types spanning various levels of granularity, i.e. the Fowler’s catalogue is composed of 22 types [1]. Therefore, this variety coupled with the high diffuseness of code smells within projects have accentuated the requirement to address both their identification and the co-occurrence among them.

In this paper, we target 8 types of code smells that belong to class-level by means of different multi-label methods: problem transformation, ensemble and algorithm adaptation methods. Specifically, for the problem transformation method, we use Binary Relevance, Classifier Chain, Label PowerSet and HOMER. For ensemble method, we apply Ensemble of Classifier Chains, Ensemble of Pruned Sets, RAkEL and AdaBoost MH. Lastly, it’s worth noting that the algorithm adaptation, unlike the first two methods, has not been employed thus far in the detection of code smells. For this method, we select BPMLL, BRkNN, IBLR-ML and MLkNN. Our experiments were carried out on 30 open-source Java projects that belong to different sizes and domains.

The main contributions of this work are as follows:

  • Explore the co-occurrence of code smells at the class-level to show which code smells frequently appear together.

  • Investigate the importance of correlations between different code smells and how they influence prediction outcomes.

  • Evaluate and compare different multi-label learning methods to determine the most efficient approach for identifying code smell, allowing us to discern whether the results are influenced by the data transformation or the method adaptation.

The reminder of this paper is organized as follows. Section 2 presents the related work of learning-based approaches. Section 3 outlines the methodology applied in this study. Section 4 describes the construction of the dataset, whose type is a multi-label dataset. The experimental findings and discussion are presented in Sect. 5. Section 6 addresses potential threats to the validity of our findings. Finally, in Sect. 7, we conclude the paper.

2 Related work

Based on the number of identified code smells, machine learning-based methods can be classified into two categories. The first category involves the detection of one code smell within an artefact at a time, referred to as single code smell identification (SCSI). The second category involves identifying multiple code smells simultaneously within an artefact, known as multi code smell identification (MCSI). Table 1 lists the studies falling in these categories.

The SCSI category has a larger number of works compared to the MCSI category. Kreimer [21] introduced a decision tree-based method for detecting Long Method and Large Class. The method was evaluated on the IYC system and the WEKA package. Khomh et al. [22, 23] extended the DECOR (DEtection & CORrection) approach [13] to handle uncertainty in the detection process. They transformed rule card specifications into BBNs (Bayesian Belief Networks), introducing BDTEX (Bayesian Detection Expert) based on the GQM (Goal Question Metric) technique. This allowed systematic BBN construction without relying on rule cards. GanttProject and Xerces were used to evaluate the detection of Blob, Functional Decomposition and Spaghetti Code. Hassaine et al. [24] applied artificial immune system algorithms to GanttProject and Xerces for code smell detection by drawing inspiration from the human immune system. Oliveto et al. [25] proposed ABS (Anti-pattern identification using B-Splines), a method using interpolation curves and metrics to identify anti-pattern signatures for detecting Blob in Java systems. Maiga et al. [26, 27] introduced a support vector machine-based approach and later expanded it by proposing SMURF where practitioner feedbacks are incorporated. Both approaches were validated on ArgoUML, Azureus and Xerces.

Fontana et al. [15] employed 16 machine learning algorithms for detecting four code smells. Notably, the authors found that J48 and Random Forest delivered the highest performance, while support vector machines exhibited lower performance. The experiments were conducted on four datasets extracted from a large repository of 74 software systems. Subsequently, various works widely adopted the four datasets from the latter study. For instance, in this study [28], the authors have utilized the mentioned datasets where they have applied six machine learning models along with two feature selection techniques, which are Chi-squared and Wrapper-based methods. The subject code smells belong to class and method levels.

Barbez et al. [29] have introduced SMAD (SMart Aggregation of Anti-patterns Detectors), a method that unifies three detection tools into a single tool using a boosting ensemble model. The outcomes produced by these tools are aggregated into a single vector, which is then employed as input for a Multi-Layered Perceptron.

Comparing to the SCSI category, the MCSI category has significantly fewer research studies. Guggulothu et al. [30] have explored code smell detection through a multi-label classification approach. They have applied this approach to determine whether an element at the method level exhibited Long Method and Feature Envy. To address this, they have employed three techniques, namely Binary Relevance, Classifier Chains and Label Combination derived from the problem transformation. The authors combined these techniques with different basic classifiers and have adopted two datasets proposed by Fontana et al. [15] for the experiments. Similarly, Kiyak et al. [31] have adopted four datasets put forth by this work [15]. The authors have employed problem transformation and ensemble techniques. In their experiments, these techniques were tested in conjunction with different basic classifiers, including Decision Tree, Random Forest, Naive Bayes, Support Vector Machine and Neural Network. Besides the mentioned machine learning-based approaches, the multi-label problem has been framed as a search-based approach through a bi-level optimization problem in this work [32]. In another work, the problem has been carried out using deep learning, where the authors have introduced a Hybrid Model with Multi-Level code representation (HMML) [33]. This model integrates Graph Convolutional Neural Networks and Bi-directional Long Short-Term Memory Networks with an Attention Mechanism to perform multi-label classification of method-level code smells.

Unlike previous works in the MCSI category that identify a limited number of code smells within a code fragment, our research tackles a more extensive set, specifically eight code smells at the class-level. To accomplish this, we created a large multi-label dataset derived from 30 open-source projects. Concerning the dataset balance, we applied a sampling technique that is tailored for multi-label datasets. Furthermore, we selected various techniques from different multi-label methods, particularly using algorithm adaptation techniques. Our experiments aim to explore not only code smell identification but also the relationships between them. We also assess the impact of the choice of techniques by inspecting whether the results depend more on data transformation or method adaptation.

Table 1 Machine learning-based detection approaches

3 Methodology

In this section, we will start with formulating our research questions. Based on these inquiries, we will develop our proposed methodology, which will be presented in more detail in the following sub-sections.

Three research questions RQs are addressed in this paper:

  • RQ1: Which code smells frequently co-occur in class-level artefacts?

  • RQ2: What is the role of correlation in influencing the outcomes of code smell identification?

  • RQ3: Does the selection of a multi-label learning method significantly impact code smell identification results, with a focus on whether this influence is attributed more to data transformation or method adaptation?

3.1 Overview of the proposed approach

As illustrated in Fig. 1, our process starts with data retrieval from public repositories. The selected projects undergo statistical analysis and subsequent annotation for the identification of specific code smells. This procedure results in dataset creation. However, since our issue is framed as a multi-label learning problem, the dataset takes the form of a multi-label dataset (see Sect. 4). Following that, we conduct quantitative analysis using a variety of multi-label learning techniques to provide answers to the addressed research questions.

Fig. 1
figure 1

Overview of the proposed approach

3.2 Statistical analysis

Statistical code analysis involves the extraction of software metrics. These metrics are crucial to capture diverse properties of the source code and serve as quantitative measures of different software aspects. Among the wide range of metrics available in the literature, we primarily focus on the Chidamber and Kemerer (CK) metrics [34]. The CK metrics are a set of well-established software measures that provide a thorough analysis of the codebase. We opted for this metric suite based on the findings in the systematic literature review of Azeem et al. [35], where it was identified as the most commonly used metric suite for code smell detection. Including but not limited to, the CK metrics encompass various aspects such as coupling, cohesion and complexity.

In our research, the CK metrics are used as features in our dataset. As a tool to compute these metrics, we employ the CK tool [36], an open-source tool.Footnote 1 Specifically, this tool analyses Java projects through a static analysis where it focuses on the source code rather than the compiled version. It encompasses both class-level and method-level code metrics, but our focus on class code smells leads us to employ only class-level metrics, as outlined in Table 2.

Table 2 Class-level CK metrics

3.3 Annotation of code smells

The annotation of code smells is accomplished through the labelling of true positive samples. The process of labelling, also called oracle creation, is a challenging and time-consuming task demanding significant expertise in this specific field. The resultant oracle serves as the basis for assessing the performance of learning models. There exist three different approaches to ensure the labelling task [37]: (i) the manual approach involves experienced developers who should have considerable knowledge in analysing software design problems, (ii) the tool-based approach is based on existing code smell detection tools, and (iii) the mixed approach which combines both first approaches where the detection results of tools are subsequently evaluated by developers.

More recently, a systematic literature review has been conducted on code smells datasets [38]. The authors have compared between the existing datasets according to different factors including availability, data source, recency and completeness of labelling. Considering these criteria, the authors have selected two datasets as the most comprehensive and adequate code smells datasets:

  • The first dataset, proposed by Palomba et al. [3], involves the labelling of 13 types of code smells at both class and method levels across different releases of 30 open source projects. The labelling approach is mixed, wherein a tool is employed to detect code smells in order to generate a list of potential candidates. Subsequently, these results undergo manual validation to classify them as true or false positives.

  • The second dataset, called MLCQ, is proposed by Madeyski et al. [39]. This dataset focuses on four code smells, comprising two at the class-level and two at the method level. Instances of code smells are extracted from 792 industry-relevant projects. The oracle for this dataset is manually curated by experienced developers with industrial expertise.

Between these two datasets, although they are both large and heterogeneous, the first dataset stands out by encompassing a greater diversity of code smells. In our study, the problem is casted as a multi-label learning approach, aiming to simultaneously identify multiple code smells. To address this, a substantial number of labels, i.e., distinct code smell types are required. So, the first dataset [3] is particularly advantageous in this context as it effectively provides a more extensive range of identified code smells in its oracle, aligning well with our research objectives.

In our study, the selected dataset comprises 13 code smells at both class and method levels. We have selected 8 code smells that pertain to individual fragments and excluded the smells at method-level and the smells that consider involved classes for their introduction. Further details of the selected smells can be found in Table 3.

Table 3 Description of subject class code smells

3.4 Multi-label learning methods

Learning from multi-label data can be achieved using various methods, including problem transformation, ensemble and algorithm adaptation methods [40]. Each method involves distinct techniques for its application. In our study, we have used all three methods and for each, we have selected four different techniques.

  • Problem Transformation Method (PTM):

    In the problem transformation method, a multi-label problem is converted into one or more single-label classification tasks [20, 41]. This transformation allows the application of traditional supervised machine learning classifiers which are originally designed for single-label problems. The outputs of these single-label classifiers are then aggregated to address the objective of the initial multi-label classification problem. For this method, we have selected four different techniques:

    • Binary Relevance (BR) [20]: The multi-label dataset is decomposed into multiple binary datasets, where each corresponds to one label. Next, a single-label learning algorithm is employed to address each individual binary dataset.

    • Classifier Chain (CC) [40]: It operates by connecting binary classifiers in a chain in order to address label correlation. Each binary classifier includes the previous predicted labels as supplementary information.

    • Label PowerSet (LP) [42]: Also known as Label Combination, it transforms a multi-label dataset into a multi-class dataset. It creates a new class for each distinct combination of labels, treating each combination as a unique class in a multi-class problem.

    • Hierarchy Of Multi-label classifiERs (HOMER) [43]: It constructs a hierarchy of classifiers with various label combinations while demonstrating its prediction performance.

    The primary distinction among these four problem transformation techniques lies in their conservation of label correlation. Binary Relevance stands out as the only technique that does not preserve label correlation.

  • Ensemble Method (EM):

    The ensemble method involves the combination of several classifiers [44]. The selected techniques are:

    • Ensemble of Classifier Chains (ECC) [40]

    • Ensemble of Pruned Sets (EPS) [45]

    • RAndom k-labELsets (RAkEL) [44]

    • AdaBoost MH [46]

  • Algorithm Adaptation Method (AAM):

    This method adapts existing single-label classification algorithms to directly handle multiple labels [20, 41]. Unlike the problem transformation method, the algorithm adaptation is classifier dependent. The selected algorithms belong to different learning families:

    • Back-Propagation Multi-Label Learning (BPMLL) [47]

    • BRkNN [48]: Binary Relevance implementation of the k Nearest Neighbours algorithm

    • Instance-Based Learning by Logistic Regression-ML (IBLR-ML) [49]

    • Multi-Label k Nearest Neighbours (MLkNN) [50]

4 Dataset construction

In this section, we describe the construction of the multi-label dataset, encompassing project details and the dataset generation process. Subsequently, we extract the dataset characteristics given their importance in the experiments.

4.1 Generation of multi-label dataset

As mentioned in Sect. 3.3, we will use 30 open-source Java projects from the selected dataset. These projects are heterogeneous as they vary in size and belong to diverse application domains. They can be downloaded from GitHubFootnote 2 and SourceForge.Footnote 3 The complete list of these projects is provided in Table 4.

Table 4 Description of 30 open-source projects

As illustrated in Fig. 2, the process begins with each project undergoing a statistical analysis using the CK tool. Classes are extracted as dataset instances and CK metrics are computed for each one of them. Following that, the labelling is conducted based on the adopted oracle where each type of the eight code smells corresponds to a label. After processing all projects and code smells, at this stage, binary datasets are created where each instance has a single label. These binary datasets are then merged according to the code smell type across all projects. Ultimately, these binary datasets are transformed into one unified multi-label dataset (MLD), where each instance is associated with 8 labels representing code smells.

Fig. 2
figure 2

Construction of multi-label dataset

4.2 Characteristics of multi-label dataset

There exist various metrics that capture specific characteristics of a multi-label dataset. These metrics provide important information, such as label distribution, inter-label relationship and imbalance level [20, 51]. These information serve as crucial criteria for subsequent experimental steps. The most important metrics include:

  • Cardinality assesses the average number of active labels \(Y_i\) per sample, with D is the dataset and N representing the number of instances.

    $$\begin{aligned} Card(D) = \frac{1}{N}\sum _{i=1}^{N}\left| Y_i \right| \end{aligned}$$
    (1)
  • Density is the cardinality divided by the number of labels \(\left| \mathcal {L} \right|\).

    $$\begin{aligned} Dens(D) = \frac{Card(D)}{\left| \mathcal {L} \right| } = \frac{1}{\left| \mathcal {L}\right| }\frac{1}{N}\sum _{i=1}^{N}\left| Y_i \right| \end{aligned}$$
    (2)
  • Mean Imbalance Ratio (MeanIR) describes the mean ratio of imbalance among the labels \(\mathcal {L}\). The higher the value is, the more imbalanced the MLD is.

    $$\begin{aligned} MeanIR = \frac{1}{\left| \mathcal {L} \right| } \sum _{l \in \mathcal {L}}^{}IRLbl(l) \quad where \quad IRLbl(l) =\frac{\underset{l' \in \mathcal {L}}{max}(\sum _{i=1}^{N}\llbracket l' \in Y_i \rrbracket )}{\sum _{i=1}^{N}\llbracket l \in Y_i \rrbracket } \end{aligned}$$
    (3)

To calculate these metrics, we have used mldr package [52]. As shown in Table 5, the MLD comprises 23 distinct label sets, representing possible combinations. According to Charte et al. [51], the MLD is considered as imbalanced if its MeanIR surpasses 1.5. Following this, our MLD is imbalanced, which means that some labels have high frequency while others are less represented. To deal with, we have applied Multilabel Synthetic Minority Over-sampling Technique (MLSMOTE) [53]. MLSMOTE is a multi-label oversampling algorithm able to generate synthetic instances based on a randomly chosen instances that include minority labels and their nearest neighbour instances.

Table 5 Characteristics of MLD

5 Experiments and results

In this section, we will present the results and discuss the research questions that have been addressed. But before doing so, we will provide the context of the experimentation by presenting the experimental set-up and evaluation metrics.

5.1 Experimental settings

In our experimentation, we have applied three multi-label learning methods: PTM, EM and AAM. Within PTM, we selected BR, CC, LP and HOMER. For EM, we employed ECC, EPS, RAkEL and AdaBoost MH. And for the AAM, we choose BPMLL, BRkNN, IBLR-ML and MLkNN. Some of these techniques necessitate a basic classifier to be implemented, for which we opted for the Random Forest classifier. Our selection is motivated by two key factors. Firstly, Random Forest has yielded significant results in this study [30], particularly in effectively detecting both Long Method and Feature Envy. Secondly, our focus was directed towards exploring diverse multi-label techniques built upon the same basic classifier, aiming for a comprehensive evaluation of these techniques.

The implementation of these techniques was accomplished using MULAN 1.5.0 [54], a Java library designed for learning from multi-label data. MULAN is built on the WEKA library [55] and provides a diverse range of classification and ranking algorithms. Concerning the validation, we utilized 5-fold cross-validation approach where the dataset is split into five folds, i.e. four folds used for training and the remaining fold for testing.

5.2 Evaluation metrics

The assessment of multi-label learning techniques involves distinct metrics compared to single-label learning. Due to the association of each sample with multiple labels simultaneously, evaluating performance in multi-label learning is more complex where the metrics fall into two broad categories: example and label-based metrics [56]. Example-based metrics involve averaging differences between actual and predicted label sets across the samples in the dataset. This category includes two sub-categories: classification metrics (SubsetAccuracy, HammingLoss, \(F-Measure\), Accuracy) and ranking metrics (Coverage, AveragePrecision, RankingLoss). In the second category, label-based metrics, the performance for each label is calculated individually and then averaged over all labels (Macro/MicroAveraging) [20, 57]. In the equations, for a given instance (\(x_{i}\)), (\(Z_{i}\)) denotes the set of predicted labels and (\(Y_{i}\)) denotes the set of actual labels. The total number of instances and the total number of labels are respectively represented by (N) and (\(\left| \mathcal {L} \right|\)).

  • Example-based classification metrics

    • HammingLoss (\(\searrow\)) is the symmetric difference (\(\Delta\)) between predicted (\(Z_{i}\)) and actual labels (\(Y_{i}\)). It is averaged over total number of labels (\(\left| \mathcal {L} \right|\)) and total number of instances (N). Lower HammingLoss indicates better performance.

      $$\begin{aligned} HammingLoss = \frac{1}{N}\frac{1}{\left| \mathcal {L}\right| }\sum _{i=1}^{N}{\left| {Y}_{i} \Delta Z_{i} \right| } \end{aligned}$$
      (4)
    • SubsetAccuracy (\(\nearrow\)), also called ExactMatchRatio, known as one of the most strict evaluation measurements. It evaluates the proportion of accurately classified samples accross all the samples, where the predicted label set matches the actual labels.

      $$\begin{aligned} Subset Accuracy = \frac{1}{N}\sum _{i=1}^{N} \llbracket {Y}_{i} = Z_{i} \rrbracket \end{aligned}$$
      (5)
    • \(F-Measure\) (\(\nearrow\)) represents the harmonic mean of precision and recall. Precision is the ratio of correctly predicted labels to the total number of actual labels, averaged across all instances, while recall is the ratio of correctly predicted labels to the total number of predicted labels, averaged across all instances.

      $$\begin{aligned} F-measure = 2 \ * \ \frac{Precision \ * \ Recall}{Precision \ + \ Recall} \end{aligned}$$
      (6)
    • Accuracy (\(\nearrow\)) is the proportion of correctly predicted labels to the total number of labels for an instance.

      $$\begin{aligned} Accuracy = \frac{1}{N}\sum _{i=1}^{N}\frac{\left| {Y}_{i} \cap Z_{i} \right| }{\left| {Y}_{i} \cup Z_{i} \right| } \end{aligned}$$
      (7)
  • Example-based ranking metrics

    • Coverage (\(\searrow\)) is a metric that measures, on average, how a learning algorithm needs to go in the ranked list of predictions to cover all the true labels of an instance. A lower coverage value indicates better performance.

      $$\begin{aligned} Coverage = \frac{1}{N}\sum _{N}^{i=1}\underset{y\in Y_i}{argmax}(rank(x_i,y))-1 \end{aligned}$$
      (8)
    • AveragePrecision (\(\nearrow\)) calculates the proportion of relevant labels ranked before each label and then makes the average across all relevant labels.

      $$\begin{aligned} Average Precision = \frac{1}{N}\sum _{N}^{i=1}\frac{1}{\left| Y_i \right| }\sum _{y\in Y_i}^{}\frac{\left| \left\{ y'|rank(x_i,y') \leqslant rank(x_i,y),y' \in Y_i \right\} \right| }{rank(x_i,y)} \end{aligned}$$
      (9)
    • RankingLoss (\(\searrow\)) measures the proportion of label pairs that are incorrectly ordered in reverse.

      $$\begin{aligned} RLoss = \frac{1}{N}\sum _{N}^{i=1}\frac{1}{\left| Y_i \right| \left| \overline{Y_i} \right| }\left| y_a,y_b:rank(x_i,y_b)>rank(x_i,y_b),(y_a,_b)\in Y_i \times \overline{Y_i} \right| \end{aligned}$$
      (10)
  • Label-based metrics

    • \(Micro/Macro \quad averaging\) (\(\nearrow\)), the macro approach computes the metric for each label and averages the values over all labels, while the micro approach considers predictions for all instances together by aggregating TP, TN, FP, FN values for all labels and then calculates the measure across all labels.

      $$\begin{aligned} F1-Macro = \frac{1}{\left| \mathcal {L} \right| }\sum _{l\in \mathcal {L}}^{} F1(TP_l, FP_l, TN_l, FN_l) \end{aligned}$$
      (11)
      $$\begin{aligned} F1-Micro = F1 \left( \sum _{l \in \mathcal {L}}^{}TP_{l}, \sum _{l \in \mathcal {L}}^{}FP_{l}, \sum _{l \in \mathcal {L}}^{}TN_{l}, \sum _{l \in \mathcal {L}}^{}FN_{l} \right) \end{aligned}$$
      (12)

5.3 Co-occurrence of code smells at class-level

To address the first research question RQ1, we opted for the utilization of a chord diagram to offer a comprehensive understanding of potential co-occurrences among two or more code smells at the class-level. Chord diagrams serve as a visual representation to connect entities through chords, providing a graphical representation of relationships. It is important to note that the thickness of these chords reflects the frequency of co-occurrences between code smells: thicker chords denote more frequent co-occurrences, while thinner lines suggest less common associations.

In Fig. 3, the chord diagram presents a graphical depiction of 8 entities, i.e. class code smells, interconnected by chords to illustrate the strength of their relationships. Notably, our analysis reveals that, on average, half of the instances affected by Spaghetti Code exhibit co-occurrences with Large Class, while the remaining half co-occurs with Complex Class. We also found that, approximately, 35% of smelly instances of Complex Class tend to co-occur with Spaghetti Code, while others demonstrate associations with Message Chain, with the remaining linked to Large Class. Through our analysis, three prominent and recurrent co-occurrences have emerged: {Complex Class and Large Class}, {Spaghetti Code, Large Class and Complex Class} and {Spaghetti Code and Large Class}.

Fig. 3
figure 3

Chord diagram of class code smells co-occurrences

The presence of the pair {Complex Class and Large Class} indicates a consistent relationship, implying that when Complex Class is present in a class, the likelihood of Large Class being present is notably high. The frequent co-occurrence of these two smells often stems from their inherent software design characteristics. For instance, a Complex Class, characterized by its high cyclomatic complexity, may tend to assume multiple responsibilities, thereby giving rise to the appearance of the Large Class smell. The additional occurrences of {Spaghetti Code, Large Class and Complex Class} and {Spaghetti Code and Large Class} highlight important patterns for developers to consider during maintenance activities. This may suggest that when a class is affected by the first co-occurrence of Large Class and Complex Class, it tends to lead to the emergence of lengthy methods, thereby elevating the overall complexity of the class. This, in turn, may ultimately pave the way for the emergence of Spaghetti Code. Instances where an artefact is affected by more than one code smell are considered critical, emphasizing the need for a high priority in the refactoring process.

5.4 Role of correlation between code smells in prediction results

The presence or absence of correlations between specific code smells can significantly impact the code smell identification task. To delve into this aspect, multi-label learning algorithms may either preserve or ignore these correlations during the detection process. To address this, in the context of RQ2, our investigation involves a comparative analysis between algorithms that preserve correlations and those that do not based on various evaluation metrics.

Tables 6, 7 and 8 present the identification results categorized by different evaluation metrics. Table 6 focuses on example-based classification metrics. Among these metrics, Subset Accuracy represents the strictest metric, where ECC is the top-performing algorithm, closely followed by BR, with more than 0.96. Similarly for Accuracy and F-measure metrics, ECC and BR demonstrate superior performance. For Hamming Loss, where lower values approaching zero are desired for optimal results, BR outperforms ECC.

Table 6 Example-based classification results

Table 7 presents the evaluation of algorithms using three example-based ranking metrics. Lower values for Coverage and Ranking Loss indicate better performance. Notably, the ensemble technique ECC provides the top performance, followed by the adaptation algorithm technique BPMLL. For the Average Precision, the problem transformation technique CC and its ensemble version exhibit superior performance followed by BR. Across all the three evaluation metrics, AdaBoost MH is ranked as the lowest-performing algorithm.

Table 7 Example-based ranking results

In Table 8, in contrast to the results in Tables 6 and 7, the results show a decrease for both macro and micro averaging metrics. This shift occurs because label-based metrics are computed for each label rather than each instance. Regarding micro-averaged F-measure, the top three algorithms are associated with the ensemble method: EPS, followed by ECC and RAkEL. On the other hand, for macro-averaged F-measure, RAkEL takes the first place, followed by CC and HOMER. Across all metrics in both macro and micro averaging, AdaBoost MH consistently ranks as the lowest-performing algorithm.

Table 8 Label-based results

To establish a comprehensive comparison, we employed both the Friedman test and the post-hoc Nemenyi test [58] to evaluate the overall performance of all algorithms across various metric categories. The test aids in determining significant differences between algorithms based on their rankings, enabling the identification of statistically significant variances among pairs of algorithms. As depicted in Fig. 4, the average ranking of each technique within every classification metric category is utilized to determine which classifiers exhibit superior performance compared to others. The statistical significance test highlights that ECC, BR and CC are considered as the top-ranked algorithms, while AdaBoost MH is consistently ranked as the lowest-performing algorithm across all evaluation categories. In summary, addressing RQ2 regarding the potential impact of the considered correlation on algorithm performance, our analysis suggest that the correlation taken into account by the algorithms does not have a substantial effect on their overall performance.

Fig. 4
figure 4

Average rank diagrams for (a) example-based classification metrics, (b) example-based ranking metrics and (c) label-based metrics

5.5 Comparing multi-label learning methods

In the last question, RQ3 focus into another factor that potentially influences the results by taking a higher level of abstraction compared to the previous question. Instead of focusing on individual algorithms, RQ3 centers on the broader method categories to which these algorithms belong. As mentioned earlier, the algorithms fall into three categories: PTM, EM and AAM. In the first two methods, PTM and EM, the multi-label dataset undergoes a transformation to suit problem-solving, while in AAM, the algorithm is adapted to directly operate on the multi-label dataset. To conduct a comprehensive comparison, boxplots are employed across all evaluation metrics, as illustrated in Fig. 5.

The boxplot diagrams concerning example-based classification and label-based metrics showed close distributions, with PTM slightly outperforming EM in terms of variance reduction, followed by AAM. In terms of subset accuracy, F-measure and accuracy, the mean value achieved by PTM is 0.96, whereas AAM attains 0.94. Regarding Hamming loss, where lower values indicate better performance, there is a notable distinction, PTM yields a mean value of 0.005425, whereas AAM stands at 0.007325. However, for example-based ranking metrics, PTM and EM exhibited the largest interquartile range compared to AAM. Specifically, in the Coverage measure, the values of EM range between 0.0173 and 0.105, while for AAM, it falls between 0.0309 and 0.0607. We found that the same observations in Coverage apply to Ranking Loss and Average Precision, where lower values denote superior performance. Notably, techniques within the AAM category demonstrated a closer range by reflecting less variability compared to the other two methods.

Thus, for RQ3, our findings suggest that the choice of a multi-label learning method can impact the results, where problem-transformation and ensemble methods demonstrate better results in example-based classification and label-based metrics but lower results in example-based ranking metrics compared to the adaptation algorithm method.

Fig. 5
figure 5

Boxplots for comparing multi-label learning methods

6 Threats to validity

In this section, we discuss potential threats to the validity of our study.

  • Construct Validity: The construction of the selected oracle combines automated and manual processes. A tool identifies potential code smells, generating a candidate list, which is then subjected to manual validation. While the presence of false positives and negatives in the oracle cannot be ruled out, it is essential to note that in the literature, this oracle is highly recognized as a well-established one for code smell identification problem.

  • Internal Validity: In our work, the multi-label dataset exhibits a considerable Mean Imbalance Ratio, a common issue in multi-label learning due to label distribution. To address this, we implemented MLSMOTE, a Synthetic Minority Over-sampling Technique designed for multi-label learning, which has reduced the imbalance ratio.

  • External Validity: Our experiments are carried out on 30 Java open-source projects, limiting the generalizability of our findings to other programming languages or industrial projects. Further research is needed to explore this potential limitation.

7 Conclusion

In this study, we presented a multi-label learning-based approach to identify eight class code smells across a diverse set of 30 open-source Java projects. Employing 12 different algorithms, with four selected from each of the three existing multi-label learning methods, our investigation delved into the co-occurrence of code smells at the class-level. Our analysis revealed three significant and recurring co-occurrences: {Complex Class and Large Class}, {Spaghetti Code, Large Class and Complex Class} and {Spaghetti Code and Large Class}.

We further explored the influence of correlations between various code smells on prediction outcomes. Across different evaluation metrics spanning diverse categories, we found that ECC, BR and CC emerged as the top-ranked algorithms showing that the consideration of correlations by the algorithms did not significantly impact their overall performance.

Additionally, our investigation extended to the evaluation and comparison of different multi-label learning methods, aiming to discern the efficiency of data transformation versus method adaptation in identification results. The results suggested that the choice of a multi-label learning method can indeed impact the outcomes, where the problem-transformation and ensemble methods exhibit superior performance in example-based classification and label-based metrics but lower results in example-based ranking metrics compared to the adaptation algorithm method.

For future work, we aim to broaden our research by incorporating other types of code smells and applying the approach to different programming languages. Furthermore, we plan to utilize our research findings to create a recommendation system for prioritizing code refactoring operations.