1 Introduction

With the progression of time, software systems are becoming substantially large and complex. The maintenance of such complex systems is becoming enormously challenging for software professionals. Some software organizations may be unable to begin new projects since most of their assets may be devoted to maintaining the old systems. Therefore, predicting software maintainability is becoming an impending area in software engineering. It focuses on the design and development of prediction models to forecast software maintainability when the software is in the initial stages of its development.

The knowledge about high-maintainability effort classes in advance helps to allocate the limited resources of an organization optimally to these classes. It results in good quality and highly maintainable software developed within the time and budget. Over the years, there has been a debate on measuring software maintainability. Software maintainability has been seen as a software quality attribute and defined according to numerous facets. According to Coleman et al. (1994), maintainability is defined as “the ease with which software component can be modified to correct the existing faults after delivery.” Aggarwal et al. (2002) suggested that maintainability is an integrated measure of software characteristics like the readability of source code, software understandability, and quality of documentation. Software maintainability is the degree of difficulty in understanding and performing changes in the software, according to Schneberger (1997). Oman and Hagemeister (1994) proposed the maintainability index (MI) to assess and quantify the maintainability. The maintainability index is a linear polynomial equation computed from the software metrics. The software metrics describe the characteristics of software like average Halstead volume of a program, average cyclomatic complexity of the program, the average number of lines of source code, and the average number of comments. The polynomial equation on evaluation gives a single number that indicates the maintainability. The lower the value of MI, the lesser would be the maintainability of the software and vice versa (Oman and Hagemeister 1994). Later, Ash et al. (1994) and Coleman et al. (1995) revised the MI proposed by Oman and Hagemeister (1994) and validated it on software written procedural programming languages like Pascal, C, Ada, and Fortran. However, the correctness of MI for software systems implemented in object-oriented (OO) languages has not been advocate much. Li and Henry (1993) defined software maintainability in the form of the lines of source code changed during the period of maintenance to correct faults. This study advocated that the software maintainability has a strong correlation with the OO metrics describing various software characteristics such as inheritance, coupling, and cohesion. Later various researchers (Van Koten and Gray 2006; Zhou and Leung 2007; Thwin and Quah 2005) measured the maintainability in the form of the lines of source code changed during the maintenance. The more changes encountered in a class in the maintenance phase means more maintainability effort is required for that class and vice versa. The ultimate goal of developing these models is to predict those software classes accurately that require high maintainability effort. In the literature, various maintainability prediction models have developed with statistical, ML, evolutionary, and hybridized techniques. The software metrics have been used as predictors for developing the maintainability prediction models (Wang et al. 2009; Dagpinar and Jahnke 2003; Kumar and Rath 2015; Malhotra and Lata 2017). The high maintainability effort classes are critical for any project because these classes must be tested cautiously to decrease the probability of occurrence of faults. Also, such classes should need to be well-documented to augment understandability to carry out future maintenance activities. Software metrics ranging from procedural metrics such as number of unique operators, number of unique operands, and cyclomatic complexity of module (McCabe 1976; Halstead 1977; Schneidewind 1979) to OO metrics (Chidamber and Kemerer 1994; Henderson-Sellers 1996; Martin 2002) characterize and quantify various aspects of the software systems and play a vital role in model development. The dataset used for training the maintainability prediction models should consist of sufficient instances of high- and low-maintainability effort classes to train the model effectively. However, in reality, during maintenance, few software classes demand complex interventions, resulting in more changes in the code lines, i.e., high maintainability effort. Therefore, there is an imbalance among the number of instances of the classes requiring high (minority class)- and low-maintainability effort (majority class), resulting in an imbalanced dataset. It is challenging to train the prediction models to predict the unseen data points of these classes with reasonable accuracy using imbalanced data.

Therefore, this study is important because it deals with the development of effective SMP models by treating imbalanced datasets to predict high maintainability effort classes accurately. Identification of high maintainability effort classes is crucial as these classes need more attention during the software maintenance and testing phase as such classes are likely to be sources of defects and future advancements (Eski and Buzluca 2011). Appropriate distribution of resources to these classes helps in enhancing the quality of the software product. However, with the imbalanced datasets, many ML techniques encounter enormous trouble (Chawla et al. 2004; Fawcett and Provost 1997; Kubat et al. 1998), and the prediction models obtain higher prediction accuracies just for the majority class rather than those for both of the classes. The software maintainability prediction models developed using imbalanced data do not have any practical significance as they may misclassify the minority class (high maintainability effort) instances. Thus, such misclassification may lead to improper resource allocation to the misclassified classes resulting in poor quality software products. Early prediction of high maintainability effort classes accurately before the software product release helps the software professionals to test these classes critically. Also, software developers can effectively refactor such classes to improve their maintainability.

In the software engineering domain, the imbalanced class problem is addressed to build competent models to predict faulty and change-prone classes (Malhotra and Khanna 2017; Choeikiwong and Vateekul 2015; Gao et al. 2015). However, no study dealt with handling the imbalanced class problem in SMP. Therefore, to treat the imbalanced data problem in SMP, this study applies various data resampling methods, including oversampling, undersampling, and hybrid resampling, before learning the SMP models to improve their performance.

Data resampling: The data resampling methods modify the training dataset in such a manner that it includes enough quantity of data points of minority and majority class. These methods include oversampling, undersampling, and hybrid resampling. In the oversampling techniques, the new data points of the rare or minority class produced so that the dataset contains the proportionate number of instances of the minority and majority class. The undersampling methods work by expelling a few data points of the majority class to make a proportionate dataset. Hybrid resampling combines the oversampling and undersampling strategy (Kotsiantis et al. 2006).

The study has the following objectives:

  • To construct SMP models to predict high maintainability effort classes by treating the imbalanced datasets with data resampling techniques.

  • To assess the predictive performance of the developed SMP models and validate them statistically.

  • To investigate the improvement in the predicting performance of the built SMP models after data resampling.

We achieve the above-specified objectives by finding answers to the following research questions (RQs).

  • RQ1: What is the performance of SMP models developed using ML techniques on original imbalanced datasets?

  • RQ2a: What is the performance of SMP models developed using ML techniques after balancing the datasets with data resampling methods?

  • RQ2: Which data resampling method improves the performance of the prediction models the most?

In the interest of answering the above research questions, we build up SMP models that use OO metrics as predictors and software maintainability as the outcome. The datasets extracted from eight Apache open-source software packages are used to develop SMP models with the application of ML techniques. The stable evaluators Balance and G-mean are used in this study to evaluate the predictive performance of the SMP models. Also, the study conducts a statistical analysis of constructed models to strengthen the conclusions. The organization of the remaining paper is given below:

Section 2 presents related work. Section 3 describes the research methodology. Section 4 describes the results of the study. Section 5 presents the threats to validity, and Section 6 describes the conclusions and future work.

2 Related work

We present the related work in two sections. The first section discusses the state-of-the-art of SMP models, whereas the second section discusses the studies which have faced and handled the class imbalance problem.

2.1 Literature work related to studies predicting software maintainability

This section discusses various studies that have proposed models to predict software maintainability. Different learning techniques ranging from ML, statistical, and hybridized have been used to construct models by building up the relationship of software metrics with maintainability. An empirical analysis of the dataset extracted from two software systems written in Java language is conducted by Dagpinar and Jahnke (2003). The study revealed that coupling and size are strong maintainability predictors. The study by Elish and Elish (2009) proposed TreeNet classifier. The outcome of the study evidenced that OO metrics are good predictors of maintainability.

A non-linear model, project-pursuit regression, is given by Wang et al. (2009) to build SMP models. The study developed SMP models using OO metrics extracted from two commercial software systems. The study by Jin and Liu (2010) used clustering techniques to predict software maintainability. The study empirically validated OO metrics collected from software projects written in the C++ language. A comprehensive statistical comparison of 27 different ML techniques to develop models to forecast maintainability was conducted by Kaur and Kaur (2013). The study revealed that instance-based classifier performs best to predict maintainability. The study by Olatunji and Ajasin (2013) proposed extreme learning machines to develop SMP models using OO metrics. The study by Zhang et al. (2015) suggested a framework, SMPlearner, where they employed 44 metrics collected at different hierarchy levels and developed SMP models. They validated SMPlearner on eight datasets pertaining to open-source software systems. Kumar and Rath (2015) built the SMP model by applying hybridized techniques. This study was carried out on two commercial datasets widely used in the literature. Wang et al. (2019) proposed a fuzzy network framework to predict software maintainability using two widely used commercial datasets, and the study advocated that the proposed framework improves the accuracy of SMP models compared with standard fuzzy-based models. Kumar et al. (2019) used class-level software metrics with three different types of neural networks to train SMP models. The genetic algorithm with gradient decent approach is used to find optimal weights of neural networks. Schnappinger et al. (2019) extracted software metrics using static analysis tools and predicted software maintainability using diverse ML techniques. Thus, in this way, we see various models to predict software maintainability have been successfully developed and validated on software metrics. This study is related to the studies published in the literature in the manner that like the previous studies (Zhang et al. 2015; Kumar and Rath 2015; Wang et al. 2009; Kaur and Kaur 2013), this study also predicts software maintainability using the internal characteristics of the software systems. However, the imbalanced data problem has not been touched in any of the studies published. This problem is taken care of in this investigation to develop effective prediction models to forecast the software maintainability.

2.2 Literature work related to studies taking care of class imbalance problem

The imbalanced class problem arises when in a particular dataset, the quantity of data points of one class is far more compared to the other. With such datasets, the many ML techniques encounter serious troubles (Laurikkala 2001) . Many times, with the imbalanced datasets, ML techniques learn to predict only the dominant (majority) class instances. In contrast, the cases of the class of interest (minority class) are discarded by being treated as noise (Maloof 2003). Various resolutions are given by the researchers to address the imbalanced class problem at the algorithm level and the data level. The data level solutions employ numerous kinds of data resampling strategies to get rid of the issue of imbalanced data. The algorithmic solutions include regulating the cost of both classes to tackle the problem (Chawla et al. 2004). For predictive modeling in software engineering, the imbalanced class problem encountered prediction of classes that are likely to be change-prone and defective. This problem is solved in different ways to uplift the performance of the predictive models. Choeikiwong and Vateekul (2015) proposed an algorithm-level solution to class imbalance problem for software fault prediction. They implemented a classifier in which the separation hyperplane of the Threshold Adjustment Support Vector Machine (R-SVM) is adjusted to cut down the bias from the dominant class. This study was performed on 12 datasets. The findings of the paper showed that R-SVM improved the prediction rate of models for predicting faulty modules. Gao et al. (2015) examined four different scenarios of feature selection and data sampling to boost the predictive capability of defect prediction models developed with imbalanced datasets. This study confirmed that feature selection on resampled data improves the predictive capability of the models. Laradji et al. (2015) proposed an average probability ensemble (APE) incorporating several base classifiers to cope up with the imbalanced class problem. To further improve the prediction capability, feature selection was combined with APE in this study. Siers and Islam (2015) proposed a cost-sensitive classifier, which was an ensemble of the decision trees to tackle the problem. Pelayo and Dick (2007) investigated the synthetic minority oversampling, which balances the proportion of the defective and non-defective modules.

The paper by Sun et al. (2012) used ensemble and coding schemes to handle class imbalance for predicting defective classes. The study first converted the unbalanced binary class data into multiclass balanced data by using coding-based schemes and then trained the defect prediction models from this multiclass data. Tan et al. (2015) employed three data resampling approaches and updatable classification techniques to boost the predictive capability of fault prediction models learned using imbalanced class datasets. Wang and Yao (2013) investigated different methods, including resampling, ensembles, and threshold moving, to improve defect prediction models. The study also proposed a dynamic version of AdaBoost to handle the imbalanced class issue in the area of defect prediction. A paper by Zheng (2010) proposed three cost-sensitive boosting techniques to improve the prediction rate of neural networks. Malhotra and Khanna (2017) developed software change prediction models from imbalanced data by employing three data resampling methods and meta cost learners. In this way, various studies in the literature have dealt with class imbalance situations in predictive modeling in the software engineering domain to improve defect prediction and change prediction models. However, the imbalanced class issue is untouched in literature in the maintainability prediction. Therefore, in this direction, this study will be the first study to deal with the imbalanced class problem for SMP.

3 Research methodology

The research methodology comprises all the components of the study, experimental design, data resampling methods, and ML techniques used for developing the SMP models.

3.1 Components of the empirical study

3.1.1 Predictor and response variable

Training a prediction model for a predictive task requires a dataset comprising predictors (independent) and response (dependent) variable. For software quality prediction models, the predictor variables are the software metrics. Software metrics quantify various aspects of the software systems and are used to predict and estimate different software characteristics (Chidamber and Kemerer 1991; Ebert and Dumke 2007; Fenton and Bieman 2014). Over the years, different software metrics (procedural and OO) are proposed, and their relationship with software maintainability is established.

3.1.2 Predictor variables

We use OO metrics as the independent variable to develop prediction models in this study as the study is carried out using OO systems developed in the Java programming. The OO metrics used in the study include Chidamber and Kemerer (C&K) metric suite (Chidamber and Kemerer 1994), Quality Model for Object-Oriented Design-QMOOD (Bansiya and Davis 2002) metric suite, and metrics proposed by Henderson-Sellers (Henderson-Sellers 1996) and Martin (Martin 2002). C&K metrics suite includes the metrics, namely WMC: weighted methods per class, DIT: depth of inheritance tree, NOC: number of children of a class, CBO: coupling between the objects, LOCM: lack of cohesion among methods, and RFC: response for class. The QMOOD metric suite includes the metrics, namely, MOA: measure of aggregation, DAM: data access metric, MFA: measure of functional abstraction, NPM: number of public methods, and CAM: cohesion among methods of a class. The Martin metrics are Ce: efferent coupling and Ca: afferent coupling. Few other metrics used in the study are IC: inheritance coupling, CBM: coupling between the class methods and AMC: average method complexity, LOC: lines of source code, and LCOM3. LCOM3 is the variation of LCOM given by Henderson-Sellers. These metrics describe different aspects, namely, cohesion, coupling, size, inheritance, composition, and encapsulation of OO systems.

The metrics WMC, NPM, LOC, DAM, and AMC are indicators of the size of a class. The metrics CBO, RFC, Ca, Ce, IC, and CBM measure the coupling. The inheritance property is measured with the help of NOC, DIT, and MFC. The metrics LCOM, CAM, and LCOM3 are indicators of class cohesion, whereas MOA measures composition. These metrics that quantify the different characteristics of a class are regarded as internal quality attributes. The class qualities such as testability, reliability, reusability, and maintainability belong to a set of quality attributes that are called external quality attributes (Al-Dallal 2013). The present study is based on Morasca’s (2009) suggestion to predict software maintainability, i.e., external quality attribute, by constructing probabilistic models. In this study, the prediction models use the above OO metrics as the predictor variables to estimate the external quality attribute, namely, software maintainability. The internal quality attributes used have significant relation with software maintainability. For instance, the attributes WMC, NPM, LOC, DAM, and AMC measure the size of a class. If the size increases, the code would likely be less maintainable, i.e., likely to require high-maintainability effort (Al-Dallal 2013). Table 1 shows a brief explanation of the predictors used in this paper. As researchers for predictive modeling in the domain of software engineering have extensively used these metrics (Singh et al. 2010; Kpodjedo et al. 2011; Giger et al. 2012; Gyimothy et al. 2005; Olague et al. 2007; Elish and Al-Rahman Al-Khiaty 2013), this motivates us to use these metrics for our study. Radjenovic et al. (2013) conducted a review of 106 papers predicting defects in classes. This review revealed that C&K metrics have frequently been used for predicting faults in classes. Therefore, our study also has taken C&K metrics to validate them for predicting software maintainability. The paper Lu et al. (2012) assessed sixty-two OO metrics for estimating the change-prone classes. The study discovered in the study that LOC, CBO, LCOM, and CAM are significant metrics. The C&K metrics, combined with QMOOD metrics, are used by Eski and Buzluca (2011) to predict change-prone classes. The study advocated the combination of C&K and QMOOD metrics be the competent predictors for predicting classes that are likely to be changed in the future. Hence, our paper uses an effective combination of predictors successfully validated in literature for predictive modeling tasks in software engineering.

Table 1 OO metrics studied

Let us see how we can compute the values for the few of the OO metrics used in this study. For example, let us say we want to calculate the value for WMC and DIT.

Let us see how we can compute the values for the few of the OO metrics used in this study. For example, let us say we want to calculate the value for WMC and DIT. Suppose class A has three methods M1, M2, and M3 with complexities X1, X2, and X3, respectively. Then, WMC for class A will be given as WMC = sum (X1, X2, X3). Let us see how to calculate another metric. Consider an OO program with six classes: CL1, CL2, CL3, CL4, CL5, and CL6. The classes CL2, CL3, and CL4 are derived from CL1, and CL5 and CL6 are derived from CL4. DIT computes the class depth in the hierarchy of inheritance. DIT for CL1, i.e., DIT(CL1) = 0, DIT(CL2) = 1, DIT(CL3) = 1, DIT (CL4) = 1, DIT(CL5) = 2, and DIT(CL6) = 2. The deeper classes add up to more design complexity. Table 1 presents a brief definition of the OO metrics used.

3.2 Response variable

In this study, maintainability is estimated in change count (CC) metric. This metric is defined as CC = LOCadded + LOCdeleted + LOCmodified where LOCadded is lines of source code added during the maintenance phase, LOCdeleted is lines of source code deleted during the maintenance period, and LOCmodified is lines code modified in the maintenance period. Many studies in the literature (Kumar and Rath 2015; Elish and Elish 2009; Kaur and Kaur 2013) measure maintainability like this. The response variable in the study is maintainability. It has two values, low maintainability effort and high maintainability effort. Maintainability is a binary variable obtained after discretizing the CC metric. The low-maintainability effort classes are those that require a few changes in lines of code in the software in the maintenance phase, whereas high maintainability effort classes require more changes in source code during maintenance. The study aims at developing efficient prediction models to predict high maintainability effort classes with good accuracy. Early prediction of such classes helps to allocate the resources to these classes in an optimal way, thereby producing high-quality and maintainable software.

3.2.1 Software system studied

We undertook this study using OO metrics extracted from open-source software. The brief explanation of the software under consideration in the present study is as follows:

  • Apache Bcel (Byte Code Engineering Library) is proposed to present clients a helpful method to make, analyze, and control Java class documents that are ending with .class. An object, which is the representation of a class, contains all information like fields, methods, and particularly bytecode instructions of a particular class.

  • Apache Betwixt gives a method for transforming beans into XML and producing digester rules automatically. Just like the BeanInfo system can be utilized to modify the default introspection on a Java object, the digester rules can be customized on a per-type style.

  • Apache Io is a Java library that includes numerous classes. The library enables developers to do various everyday tasks efficiently with much less effort. Different functions, such as input, output, utility classes, and comparators, are included in this library.

  • Apache Ivy is a robust tool for recording, tracking, resolving, and reporting various project dependencies. It is very flexible, configurable, and has strong integration with Apache Ant, a Java build tool.

  • Apache Jcs is a Java Caching System written in Java language. It speeds up the applications by managing the cache data of different forms. In addition to cache data management, it provides various other features like memory management, element grouping, and remote synchronization, to name a few.

  • Apache Lang provides various methods that the standard Java libraries unable to deliver. These functions include basic numerical methods, string methods, and concurrency control.

  • Apache Log4j is a logging framework that can be configured through external configuration files dynamically. It provides a convenient way to direct logging information to various destinations such as console and database.

  • Apache Ode (Orchestration Director Engine) is a software component that is intended to execute BPEL (Business Process Execution Language) business processes. It sends and gets messages to and from web administrations, controls information, and takes care of error handling.

3.2.2 Performance metrics

Prediction of maintainability is regarded as the classification problem in this study. Given training data points of classes labeled as low maintainability effort or high maintainability effort, a classifier can be learned from the data points and used to classify the unknown classes to be low-maintainability effort or high maintainability effort. The classifier’s performance is assessed by examining the confusion matrix. Table 2 depicts the confusion matrix for a two-class classification task. The confusion matrix contains the class values in the form of positives and negatives. For the present study, a positive value corresponds to high maintainability effort instances, and the negative class value corresponds to low-maintainability effort instances. Some widely used traditional performance metrics to evaluate a classifier like an accuracy, error rate, etc. are calculated using a confusion matrix. The performance metrics, accuracy, and error rate assume the uniform class distribution of the positive and negative class. Using these performance measures to evaluate a classifier might gives rise to misleading conclusions in the situation of imbalanced data as these performance measures are intensely inclined towards the majority class. Consider a case where in a particular dataset, 99% of the data points belong to the majority class. Let us say the accuracy of the classifier is 99%. It means the classifier will predict the label of every unseen test data point as that of the label of the majority class. Due to this fact, accuracy or error rate are not recommended for imbalanced data. The minority class is the prime attention in case of the imbalanced data. However, accuracy and error rate give equal importance to the classification error. Due to this reason, the use of these performance evaluators to judge a classifier is suspicious for class imbalance problem. The confusion matrix is used to derive the performance evaluators that assess the classifier’s performance independently on positive instances as well as on negative instances. These performance metrics are sensitivity (true positive rate: TPR) and specificity (true negative rate: TNR). However, for imbalanced data, there is a trade-off among TPR and TPN. Therefore, many researchers considered a few stable performance measures to evaluate the prediction models for class imbalance situation. The paper by He and Garcia (2009) advocated that G-mean is a stable metric for assessing the prediction models developed from the imbalanced data. A robust performance evaluator, Balance, to evaluate prediction models developed on an imbalanced dataset is given by Menzies et al. (2007). So, our study assesses the maintainability prediction models using stable and strong evaluators, namely, Balance and G-mean. The formulas for performance evaluators are given in Table 3.

Table 2 Confusion matrix
Table 3 Performance metrics

3.2.3 Statistical tests

In empirical research, deriving conclusions entirely from the empirical results without applying the statistical analysis can be misleading (Lessmann et al. 2008). Using statistical tests, establish confidence in the outcomes of an empirical investigation and help to validate the hypothesis formed. We apply statistical analysis at different stages in this study and confirm the corresponding hypothesis built there, e.g., in the first stage, the null hypothesis tested is “the performance of the ML techniques do not differ significantly after applying data resampling methods compared to the situation when no resampling is employed.” This hypothesis is validated with the help of the Friedman test. The Friedman test is a distribution-free test independent of any presumptions about the distribution of performance measures before its application. If test statistics obtained after the Friedman test result is sufficiently substantial for the rejection of the null hypothesis, it means the variance in the ML techniques’ performance after the data resampling is non-random. In such a situation, we go for a post hoc examination by Wilcoxon signed-rank test. This test is applied after Bonferroni correction. We find the pairwise difference between the best resampling method with the other resampling methods.

3.3 Experimental setting

3.3.1 Data collection

This investigation is carried out using eight Apache application packages. These are Apache Bcel, Betwixt, Io, Jcs, Lang, Log4i, and Ode. We analyzed two subsequent versions of each of these software systems. Section 3.1.2 describes the systems investigated in this study. All of these are large-scaled application packages and give us sufficient data points to develop the prediction model. These software systems are OO, written in Java programming language, and OO programming has full potential these days, which has influenced the project selection for this study. We have used the DCRS tool: Data Collection and Reporting System tool (Malhotra et al. 2014) for data extraction out of software systems. Data extraction has successfully been carried out from various open-source repositories using this tool. Only the prerequisite for utilizing this tool is that the repository must use GIT as version control. DCRS tool is successfully employed for data extraction corresponding to Apache application packages because Apache uses GIT.

Two successive versions of the software products have fed as input to the DCRS tool. The change log corresponding to common classes between the two analyzed versions has extracted in the form of log records with the DCRS tool. The log records contain information like a list of modified files, CC (change count), description of changes made. The CKJM tool (http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/), open-source software for extraction of OO metrics, is embedded with the DCRS tool to collect the OO metrics corresponding to Java classes. The OO metrics extracted by the CKJM tool are the quantitative measure of cohesion, coupling, inheritance, etc. of a class. Each common class among the two versions yields one data point for our analysis. Each such data point entails a combination of OO metrics and CC metric representing the lines of code changed. The OO metrics included in the data point are used as the predictors, and CC helps to form the response variable (maintainability). Each generated data point also contains a variable named “alter” containing two values “yes” and “no.” If there is a change in a class, alter is assigned yes value; otherwise, no.

3.3.2 Data pre-processing

Various data pre-processing steps carried out on the datasets before beginning with our analysis are given as follows:

  • Removal of no change classes: As discussed in Section 3.1.1, OO metrics and CC metric corresponding to the common classes among two successive versions form data points, and each data point also contains the alter variable containing two values yes or no corresponding to one data point. For our analysis, we considered only those data points where the alter variable contains yes values, which means that classes were changed at least once. In this way, we only included data points of changed classes for our analysis. The detail of the software projects with their names, version analyzed, #common classes (number of common classes), and #common classes changed (number of changed common classes), is given in Table 4.

  • Outlier detection and removal: Outlier analysis was necessary to develop generalized prediction models. Outliers had extreme variability from the remaining data points in the dataset (Rousseeuw and Leroy 2005), and their detection and removal are essential for building a competent prediction model. Outlier analysis was performed with the Interquartile filter using the WEKA tool: Waikato Environment for Knowledge Analysis tool (http://www.cs.waikato.ac.nz/ml/weka/). In this way, 26, 34, 24, 58, 22, 47, 32, and 115 outliers were detected and removed from Apache Bcel, Apache Betwixt, Apache Io, Apache Ivy, Apache Jcs, Apache Lang, Apache log4j, and Apache Ode datasets respectively.

  • Data discretization: After the data extraction, the CC metric values ranged from tens of change in the line of code to thousands of modifications in all eight software projects. The classes where thousands of the lines of code are changed are very few. The dependent variable (maintainability) used in this study is formed after discretizing the CC metric into two bins: low-maintainability effort and high-maintainability high maintainability effort. The label low-maintainability effort corresponds to those classes which require less maintainability effort (fewer changes in LOC). Furthermore, the label high-maintainability effort corresponds to the classes requiring more maintainability effort (more effort in the form of LOC changed). The details of software systems after this step are shown in Table 5. The low-maintainability effort data points are regarded as majority class data points as they are a larger in number compare with high maintainability effort (minority class) data points in all the eight software systems used in the study. It is evident from Table 5 that the imbalance ratio (i.e., number of data points of low maintainability effort divided by number of data points of high maintainability effort) varies from 6.29 to 23.80.

Table 4 Details of software projects
Table 5 Data discretization results

3.3.3 Applying data resampling methods

The next step in our experimental setup involves applying data resampling methods to make uniformity in minority and majority examples. We use fourteen resampling methods consisting of oversampling, undersampling, and hybrid resampling in this study.

3.3.4 Maintainability prediction model development and evaluation

After data resampling, the next step is the development of SMP models. To develop SMP models, we apply ML techniques commonly used in the literature. The details of ML techniques used to construct the models are given in the next section. Tenfold cross-validation strategy is employed during prediction model development. With this cross-validation mechanism, the total available training instances are randomly separated into ten equivalent parts. Nine parts of training data are utilized for model development. The model is validated on the tenth partition. This process is carried out ten times to ensure a low bias of random partitioning. We appraise the performance of the SMP models through G-mean and Balance. Lastly, we do a statistical analysis to establish confidence in the results produced.

3.4 Data resampling methods used

The study executes fourteen resampling techniques for handling imbalanced datasets using the KEEL tool: Knowledge Extraction Based on Evolutionary Learning tool with default parameter settings (https://www.keel.es).

  • Adaptive synthetic sampling (ADASYN) adaptively generates synthetic examples corresponding to minority class instances. It employs the weighted density distribution to decide the number of synthetic cases to be made corresponding to each minority class instance. Many examples are created corresponding to harder-to-learn cases, and a smaller number of instances are generated, corresponding to easy-to-learn examples (He et al. 2008).

  • Synthetic minority oversampling technique (SMOTE) oversamples the rare class by introducing artificial instance in the dataset. The artificial instances are formed alongside the line joining the minority class example with its k-nearer neighbors. The requisite number of neighbors has been taken from the k-nearer neighbors for this purpose. For instance, if 200% oversampling is needed, two out of five nearer neighbors of particular minority class instances are randomly picked up, and an artificial example is generated corresponding to each. The formation of an artificial example consists of two steps. The first step involves computing the difference between the selected example and its chosen nearer neighbor. The next step is multiplying the difference by a random number between 0 and 1 and adding the resultant value in the initially selected instance (Chawla et al. 2002).

  • Borderline synthetic minority oversampling technique (Border-Line-SMOTE) oversamples the instances of the minority that are at the borderline by introducing synthetic samples using SMOTE. A minority class instance is said to belong to the borderline when all its k-nearer neighbor instances belong to the majority class (Han et al. 2005).

  • Safe Level-SMOTE: Unlike SMOTE, Safe-Level-SMOTE computes the safe level of each minority class instance before producing synthetic instances. The safe level of a minority class instance is the number of minority class instances from its k-nearer neighbors (Bunkhumpornpat et al. 2009)

  • Synthetic minority oversampling technique + edited nearer neighbor (SMOTE-ENN) is a hybridized technique based on SMOTE. In this technique, synthetic samples are generated using SMOTE, and Wilson’s Edited Nearest Neighbor rule is applied to filter out the noisy instances (Batista et al. 2004).

  • Synthetic minority oversampling technique + Tomek’s modification of condensed nearer neighbors (SMOTE-TL) is a hybrid method based on SMOTE. Synthetic examples are generated using SMOTE, after which the Tomek’s links are detected and filtered out from the dataset. Two instances belonging to dissimilar classes are supposed to form Tomek’s link if the distance between them is less than that of the distance between them to any other sample in the dataset (Batista et al. 2004).

  • Random oversampling (ROS) at random generates rare class instances until both rare and dominant classes do not have the same number of examples (Batista et al. 2004).

  • Selective pre-processing of imbalanced data (SPIDER) technique involves oversampling the minority class and filtering out the instances from the majority class depending on if they are safe or noisy using KNN classification. After this, the noisy samples are removed (Stefanowski and Wilk 2008).

  • Selective pre-processing of imbalanced II (SPIDER-II) involves two phases to pre-process the minority and majority class examples. In the beginning, noisy instances from the dominant class are identified. Then, these instances are removed or relabeled as per the reliable option. Afterward, noisy instances from the rare class are identified and replicated as per the amphl option (Napierala et al. 2010).

  • Random undersampling (RUS) at random removes the instances from the dominant class until both of the rare and the dominant class do not have the same number of samples (Batista et al. 2004).

  • Condensed nearer neighbor (CNN) method is formed based on the nearer-neighbor rule. It eliminates certain instances out of the dataset without affecting the NN classification performance (Hart 1968).

  • Condensed nearer neighbors with Tomek’s modification (CNN-T). This technique combines Condensed nearer neighbors and TL. Based on the information given by Tomek’s link, certain instances are removed from the dataset without affecting the NN classification performance (Batista et al. 2004).

  • Class Purity Maximization Clustering (CPM). In this technique, two random instances from the majority and minority class are designated as initial cluster centers. The remaining samples are segregated into two subgroups according to nearer centers with a precondition that one subgroup should have more class purity. The process is recursively repeated unlit the two subsets are unable to from cluster in such a way that class purity of at least one group should be more than that of the parent cluster (Yoon and Kwek 2005).

3.5 ML techniques

In this study, we explored different categories of ML, namely neural network, instance-based learner, ensembles, and decision tree. Apart from these techniques, the study also selected one statistical technique. The predictive capability of chosen techniques is well established in the research of software quality domains for predictive modeling tasks. Neural networks (NNs) have proven their outstanding capability to derive meaning insights and extract a pattern from complex data. Many studies in the literature have used neural networks for building successful models to predict software maintainability (e.g., Thwin and Quah 2005; Zhou and Leung 2007; Ahmed and Al-Jamimi 2013). A study by Malhotra (2015) conducted a review of ML techniques to predict defective classes in software. The review revealed that C4.5 was the most popular technique in the decision tree category, and this technique has an outstanding capability to predict defect-prone classes. The C4.5 decision tree is very much simple in terms of its implementation and offers comprehensive predictive capability (Arisholm et al. 2007). In the category of instance-based learners, we have selected two techniques, namely, KStar (KS) and k-nearer neighbor. Kaur and Kaur (2013) statistically compared 27 different ML techniques to predict maintainability and discovered the outstanding performance of KS. All of the above techniques used in the study are distribution-free techniques that require no prior assumptions on the statistical distribution of the dataset. We also used two ensemble techniques in this study. Ensemble learners are a combination of several classifiers whose predictions are aggregated to obtain a single consolidated decision (Oza and Tumer 2008). Several studies have advocated the remarkable applicability of ensembles to achieve improved capability for quality modeling tasks in software engineering (Catolino and Ferrucci 2018; Xu et al. 2010; Peng et al. 2011). Thus, given the vast number of techniques that can potentially be selected for developing SMP models, we had to strike a balance between curtailing the choice of techniques by considering existing empirical studies and also covering a wide diversity of ML techniques based on their properties. Hence, the primary reason for selecting these techniques for developing SMP models was their establishment and popularity in the literature for predictive modeling in the software engineering domain. The brief description of all the techniques used is given below:

  • C4.5 is the enhancement of the ID3 decision tree. ID3 algorithm has few limitations: (i) it only works for nominal attributes and (ii) the dataset should not have any missing values. Ross Quinlan, the innovator of ID3, improved ID3 to overcome these limitations and proposed a modified version of the ID3 algorithm called C4.5. This algorithm creates more generalized prediction models, and it can handle continuous-valued attributes as well as missing values (Quinlan 2014).

  • Multilayer perceptron with conjugate learning (MLP-CG) is a feed-forward neural network for classification and prediction. MLP-CG uses conjugate gradient methods for training the weights of a multilayer perceptron. Conjugate gradient methods are characterized by the fact that their memory requirements are low and fast global and local convergence (Moller 1993).

  • Radial basis function neural network (RBFNN) is a type of non-linear feed-forward neural network with a single hidden layer. It has guaranteed learning because of a single layer of weights that are adjustable and calculated by a linear optimization problem. This neural network can represent non-linear transformations (Broomhead and Lowe 1988).

  • Increment radial basis function neural network (IRBFNN) is a type of NN that learns by apportioning new units and changing the parameters of existing units. If the system performs inadequately on a pattern presented to it, at that point, another unit is allocated, which remedies the reaction to the introduced pattern. If the system performs well on a displayed pattern, at that point, the system parameters are updated (Platt 1991).

  • BG is an ensemble technique that creates individual subsets of the training dataset randomly with replacement. A predictor is developed corresponding to each subset. The results of individual predictors are then either averaged or combined using majority voting (Breiman 1996).

  • AB is based on the boosting method. The boosting technique develops a powerful classification model by using some weak classifiers. AdaBoost improves the predictive power of weak classifiers. The learning of weak classifiers is carried out by using the weighted training data samples, and the misclassification rate of the individual classification models is determined. This algorithm includes a weight-updating process. The correctly classified data points get small weights, and the wrongly classified data points get large weights. In this manner, the AB technique focuses more on the difficult-to-learn data points (Quinlan 2014).

  • KNN has higher interpretability and less calculation time. KNN classifies an instance using the majority voting of its neighboring examples. An instance is said to belong to the class, which is one of the most common among the k-nearer neighbors (Cover and Hart 1967).

KS belongs to the category of instance-based classifiers. In KS, the output label of a test example is determined based on the output label of the training examples that are similar to the given test example. KS technique uses entropy as a measure of dissimilarity (Cleary and Trigg 1995).

  • LR with ridge estimator is a prominent method used for binary classification. Ridge estimators enhance the parameter estimates and lower the error when maximum-likelihood estimators cannot fit the data (Le Cessie and Van Houwelingen 1992).

4 Results and analysis

We developed the SMP models using original imbalanced datasets and after applying data resampling methods. The performance of SMP models is compared concerning performance measures: G-mean and Balance.

4.1 RQ1: What is the performance of SMP models developed using ML techniques on original imbalanced datasets?

To answer this research question, we develop SMP models with original imbalanced datasets using ML techniques discussed in Section 3.4. Tables 6 and 7 show the predictive performance of ML techniques for SMP models based on G-mean and Balance, respectively. In this experiment, we notice that ML techniques without applying resampling methods have inferior performance regarding G-mean and Balance. On analyzing Table 6, we see that in 61% of the cases, G-mean values are even less than 50%. Similarly, in 66.66% of the cases (Table 7), the Balance value is less than even 50%. Figure 1 shows the boxplots for the performance of maintainability prediction models regarding Balance and G-mean in case of imbalanced data. It is evident from Fig. 1 that G-mean values are even 0% for a few of the cases, whereas Balance has 29% as its lowest value. Also, the median of G-mean and Balance for all ML techniques is approximately 40%. These trends of the poor performance of ML techniques for maintainability prediction are because the datasets are imbalanced in nature, and the prediction model is unable to learn the minority class instances (class = high maintainability effort) properly, i.e., for training the model, very few minority class instances are presented to the classifier. Therefore, such kind of prediction models cannot be utilized for making future predictions for unknown instances.

Table 6 G-mean results of SMP model developed using the imbalanced datasets
Table 7 Balance results of SMP model developed using imbalanced datasets
Fig. 1
figure 1

Boxplots G-mean and Balance of original imbalanced data

4.2 RQ2a: What is the performance of SMP models developed using ML techniques after balancing the datasets with data resampling methods?

In this section, we assess ML techniques’ performance for predicting software maintainability after applying various data resampling methods to balance the datasets. Tables 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, and 23 show SMP models’ performance concerning performance measures G-mean and Balance, respectively, after applying data resampling methods. It is evident from Tables 8, 9, 10, 11, 12, 13, 14, and 15 that G-mean values are higher than 50% in 85%, 72%, 84%,73%, 95%, 95%, 95%, and 83% of the cases respectively for Apache Bcel, Apache Betwixt, Apache Io, Apache Ivy, Apache Jcs, Apache Lang, Apache Log4j, and Apache Ode datasets. Similarly, Balance values are higher than 50% in 81%, 68%, 83%, 66%, 95%, 95%, 94%, and 87% of the cases for Apache Bcel, Apache Betwixt, Apache Io, Apache Ivy, Apache Jcs, Apache Lang, Apache Log4j, and Apache Ode datasets respectively after data resampling as shown in Tables 16, 17, 18, 19, 20, 21, 22, and 23.

Table 8 Performance of SMP models based on G-mean for Apache Bcel dataset after resampling
Table 9 Performance of SMP models based on G-mean for Apache Betwixt dataset after resampling
Table 10 Performance of SMP models based on G-mean for Apache Io dataset after resampling
Table 11 Performance of SMP models based on G-mean for Apache Ivy dataset after resampling
Table 12 Performance of SMP models based on G-mean for Apache Jcs dataset after resampling
Table 13 Performance of SMP models based on G-mean for Apache Lang dataset after resampling
Table 14 Performance of SMP models based on G-mean for Apache Log4j dataset after resampling
Table 15 Performance of SMP models based on G-mean for Apache Ode dataset after resampling
Table 16 Performance of SMP models based on Balance for Apache Bcel dataset after resampling
Table 17 Performance of SMP models based on Balance for Apache Betwixt dataset after resampling
Table 18 Performance of SMP models based on Balance for Apache Io dataset after resampling
Table 19 Performance of SMP models based on Balance for Apache Ivy dataset after resampling
Table 20 Performance of SMP models based on Balance for Apache Jcs dataset after resampling
Table 21 Performance of SMP models based on Balance for Apache Lang dataset after resampling
Table 22 Performance of SMP models based on Balance for Apache Log4j dataset after resampling
Table 23 Performance of SMP models based on Balance for Apache Ode dataset after resampling

Figures 2 and 3 show the boxplots for the performance of maintainability prediction models regarding G-mean and Balance after data resampling. It is evident from Fig. 2 that G-mean reaches up to 80% in most of the datasets. Also, Balance reaches up to 70 to 80% in all eight datasets, as shown in Fig. 3. It is also quite evident from Figs. 2 and 3 that the median of G-mean and median of Balance are even higher after data resampling.

Fig. 2
figure 2

Boxplots for G-mean results after data resampling

Fig. 3
figure 3

Boxplots for Balance results after data resampling

The median of G-mean is greater than 60% for Apache Jcs, Apache Lang, and Apache Log4j datasets (Fig. 3).

Median of G-mean is approximately equal to 60% in Apache Bcel, Apache Betwixt, Apache Io, Apache Ivy, and Apache Ode datasets. Also, the median of Balance is higher than 60% for Apache Jcs, Apache Lang, and Apache Log4j datasets. The median of Balance is nearly 60% for Apache Bcel, Apache Betwixt, Apache Io, Apache Ivy, and Apache Ode datasets after applying data resampling methods (Fig. 3). To conclude, we say that there is a vast improvement in the performance of SMP models after applying data resampling methods as compared with the situation when no data resampling is used.

4.3 RQ2b: Which data resampling method improve the performance of the prediction models the most?

To assess the performance of data resampling methods used in the study, we perform Friedman’s test concerning performance metrics G-mean and Balance for all eight datasets used in the study along with the scenario when no data resampling is used. In this direction, the following hypotheses are formed and tested.

H0: Null hypothesis—There is no significant difference in the predictive performance of SMP models developed with original imbalanced datasets and after applying data resampling methods concerning performance measures G-mean and Balance.

Ha: Alternate hypothesis—There is a significant difference in the performance of SMP models developed with original imbalanced datasets and after applying data resampling methods concerning performance measures G-mean and Balance.

The above-stated hypotheses are tested at a confidence level of 95% (α = 0.05) by extracting the values of the performance metrics Balance and G-mean of all datasets used in the study. Tables 24 and 25 show the Friedman test results for G-mean and Balance, respectively. The mean rank attained by each data resampling method is shown in parenthesis. The higher the rank obtained by the resampling method, the better would be that method.

Table 24 Friedman’s test results for G-mean
Table 25 Friedman’s test results for Balance

On conducting Friedman’s test for different data resampling methods concerning G-mean measure on all eight datasets used in the study, the p value obtained is 0.00 (p < 0.05), which means the results of the Friedman test are significant. It is evident from Table 24 that Safe-Level-SMOTE achieves the best rank. The worst rank is obtained for no resampling. Similarly, on conducting Friedman’s test for different data resampling methods for Balance measure on all eight datasets used in the study, the p value obtained is 0.00 (p < 0.05), which means, again, the results of the Friedman’s test are significant. Concerning Balance, the mean rank obtained after the Friedman’s test for different data resampling methods along with no resampling scenario is shown in Table 25. Again, Safe-Level-SMOTE yielded the best rank concerning Balance measure after the Friedman’s test is applied, and the worst rank is obtained for the no-resampling situation. As the test statistics of the Friedman test are significant for both G-mean and Balance, this leads to the rejection of the null hypothesis (H0) and acceptance of the alternate hypothesis (Ha). Therefore, in this way, we observe a significant improvement in the performance of SMP models developed after applying data resampling methods on imbalanced datasets. It is observed that the enhanced version of SMOTE, namely, Safe-Level-SMOTE, and hybrid resampling methods, SMOTE-TL, SMOTE-ENN, are among the four ranked methods as per ranking obtained after Friedman’s test concerning G-mean and Balance measures. The Safe-Level-SMOTE method emerges as the best technique to improve the performance of prediction models. To further extend our analysis, i.e., to get insight into whether the Safe-Level-SMOTE method is statistically better than other resampling methods used in the study or not, we apply Wilcoxon’s signed-rank test at 95% level of confidence (α = 0.05) by doing Bonferroni correction. Using the Wilcoxon signed-rank test, a pairwise comparison among the Safe-Level-SMOTE method and other resampling methods is computed concerning G-mean and Balance measures of all ML techniques for all datasets. The test statistics of Wilcoxon signed-rank are reported in Table 26 both for G-mean and Balance. In Table 26, S+ means a significant difference in the performance of two corresponding pairs of resampling methods, and NS signifies that there is no significant difference. The results depict that Safe-Level-SMOTE significantly outperforms ADASYSN, SMOTE, Border-Line-SMOTE, SPIDER, SPIDER-II, ROS, CNN, CNN-T, CPM, NCL, and no resampling by both G-mean and Balance. Also, the test results depict that Safe-Level-SMOTE do not significantly outperform SMOTE-TL, SMOTE-ENN, and RUS. The performance of SMOTE-TL, SMOTE-ENN, and RUS is comparable with Safe-Level-SMOTE.

Table 26 Wilcoxon’s signed-rank test results

4.4 Discussion on results

For the imbalanced datasets, SMP models’ performance is not good in terms of both of the performance measures G-mean and Balance. An analysis of Table 6 indicates that for the Apache Bcel dataset, the G-mean values of SMP models developed by ML techniques ranged from 0 to 51.86%, and except for IRBFNN and MLP-CG, G-mean results of all techniques are less than 50%. A similar trend is observed for the imbalanced Apache Betwixt dataset. For Apache Betwixt dataset, in the imbalanced scenario, the SMP models’ G-mean values ranged from 0 to 55.30%. For the Apache Io dataset, the lowest G-mean values were reported to be 29.29% for SMP models developed with KS and BAGG techniques, and the G-mean results ranged from 29.29 to 57.56%. For this dataset, except for KNN and RBFNN techniques, the G-mean values for the remaining techniques were reported to be far less than 50%. For the Apache Ivy dataset, the G-mean values ranged from 26.37 to 49.71%. It is worth noting that for Ivy datasets, SMP models’ performance is inferior in terms of G-mean, and none of the techniques could achieve G-mean value of even 50%. For the Apache Log4j dataset, in case of imbalanced scenario, the G-mean results of SMP models developed with the application of different ML techniques ranged from 0 to 69.42%, and for Apache Ode dataset, the G-mean values ranged from 0 to 24.56%. On analyzing the Apache Jcs results, the G-mean values were reported to be in the range of 0–76.25%. For the Apache Lang dataset, the SMP models developed with the help of different ML techniques gave G-mean values in the range of 0–81.56%. The performance of SMP models developed from the imbalanced datasets in terms of Balance performance measure is reported to be very poor.

For Apache Bcel dataset, the Balance values of SMP models developed by ML techniques ranged from 29.29 to 48.88%, and for all techniques, Balance results are less than 50%. For the Apache Betwixt dataset, in the imbalanced scenario, the Balance values of SMP models ranged from 29.29 to 52.50%, and for all the techniques except MLP-CG, the Balance results are less than 50%. For the Apache Io dataset, the lowest Balance value was reported to be 29.00% for SMP models developed with C4.5 techniques, and the G-mean results ranged from 29.29 to 57.56%. For this dataset, except for KNN and RBFNN techniques, the Balance values for the remaining techniques were reported to be far less than 50%. For the Apache Io dataset, a similar trend was observed for G-mean results. For the Apache Ivy dataset, the Balance values ranged from 34.38 to 46.96%. For Ivy dataset, the performance of SMP models is very poor in terms of Balance, and none of the techniques could achieve a Balance value of even 50%. For the Apache Log4j dataset, the Balance results of the SMP models developed using different ML techniques ranged from 29.29 to 65.59%. For the Apache Ode dataset, the Balance results ranged from 29.29 to 46.59%. On analyzing the Apache Jcs dataset results, the Balance values were reported to be in the range of 29.29–73.71%, and for the Apache Lang dataset, the SMP models developed with the help of different ML techniques gave Balance values in the range of 29.29–78.66%.

Therefore, if we look at SMP models’ performance for imbalanced datasets, we see that the performance of the SMP models is inferior in terms of both the G-mean and Balance. The low performance is due to the skewed distribution of high maintainability effort and low maintainability effort data points in the datasets.

As the data sets have insufficient data points for the high maintainability effort classes, the SMP models may not be able to learn the prediction of high-maintainability effort classes competently, resulting in low sensitivity values (true positive rate) that resulted in low G-mean and Balance results.

However, the use of data resampling techniques (RQ2) enhanced the performance of the ML techniques for building SMP models. For the Apache Bcel dataset, the G-mean values ranged from 50.32 to 80.14%, and Balance ranged from 50.65 to 79.31%, respectively, for the majority of the cases after data resampling. For Apache Betwixt dataset, the G-mean values ranged from 50.01 to 72.54% and Balance ranged from 50.17 to 72.48%, respectively, for most of the cases after data resampling. On analyzing the SMP models’ results for the Apache Io dataset after data resampling, we observed that for most of the cases, G-mean and Balance values ranged from 50.96 to 87.82% and from 50.11 to 86.31% respectively. The G-mean and Balance values ranged from 50.16–73.44% to 50.74–73.39%, respectively, for most of the cases after data resampling for the Apache Ivy dataset. In the case of the Apache Jcs dataset, the G-mean and Balance values ranged from 55.17 to 87.07% and from 60.54 to 86.97%, respectively, for the majority of the cases after data resampling. The range of G-mean and Balance values was 60.70–85.41% and 60.01–83.68%, respectively, for the majority of the cases after data resampling for the Apache Lang dataset. For Apache Log4j dataset, the G-mean values ranged from 60.07 to 78.73%, and Balance values ranged from 60.02 to 78.39%, respectively, for the majority of the cases after data resampling. For the Apache Ode dataset, the G-mean and Balance values were observed in the range from 50.09 to 74.06% and from 50.11 to 72.69%, respectively, for most of the cases after data resampling. Therefore, the G-mean and Balance results showed improvement for all datasets when data resampling techniques were used. The improvement in G-mean and Balance after data resampling is due to an increase in sensitivity and specificity. When the datasets were imbalanced, SMP models gave lower sensitivity values as models were having a smaller number of instances of the high maintainability effort classes to learn the positive examples properly. However, the sensitivity increased after data resampling that has increased the G-mean of SMP models as G-mean is the geometric mean of specificity and sensitivity. After data resampling, the rise in sensitivity led to a decrease in the false-positive rate, which also improved the Balance.

On analyzing the results of the study, it was discovered that models developed after resampling with Safe-Level-SMOTE performed well on all the datasets. The Safe-Level-SMOTE technique improved the performance of the models in terms of G-mean and Balance. According to statistical analysis carried out with the Friedman test, the same result is obtained, i.e., the Safe-Level-SMOTE technique achieved the highest rank in terms of G-mean and Balance whereas the no-resampling situation has attained the worst rank. Therefore, these results support the use of Safe-Level-SMOTE. We also did a pairwise comparison of the performance of Safe-Level-SMOTE with all other resampling methods used in the study, and the performance of Safe-Level-SMOTE was better than all other resampling methods except SMOTE-TL, SMOTE-ENN, and RUS. The performance of SMOTE-TL, SMOTE-ENN, and RUS is comparable with the top-ranked technique, Safe-Level-SMOTE. The Safe-Level-SMOTE technique does not create the same number of synthetic instances for each minority instance; instead, it emphasizes on the instances that fall in the safe region and discounts the instances that are noise. The Safe-Level-SMOTE technique’s superiority denotes that a data resampling technique should properly apply a method for generating the synthetic instances which evade noise and redundancy.

The results of this work signify the importance of balanced data with an appropriate number of instances of low maintainability effort and high maintainability effort classes for constructing competent SMP models using ML techniques.

5 Threats to validity

The predictor variable used for the development of prediction models in this study has already been analyzed and validated as useful in the software quality domain. The response variable used in this is formed after the discretization of a continuous variable, “change,” which is extracted after scanning the change logs with the help of the DCRS tool. The DCRS tool has successfully been used for data collection in many empirical studies. Therefore, there is a threat to construct validity concerning predictors, and the response variable is not present in this study. Threats to the generalizability of the results abolish the external validity of empirical research. All eight datasets used in this study are extracted from application packages of Apache open-source software. Therefore, there exists an external validity threat that the results may vary for proprietary software and software written in a programming language other than Java. However, for developing prediction models, the ML techniques are used with their default parameter setting in this study, which minimize the threat to the generalizability of the results. The degree to which the conclusions drawn after conducting research are believable or credible is called construct validity. It is also referred to as statistical validity. The threat to conclusion validity does not exist in this study, as the results of the study are supported by appropriate statistical analysis.

6 Conclusions and future work

Early prediction of software classes needing high maintainability effort or low maintainability is an essential activity in the software development to design such classes in a better manner. In many software projects, classes requiring  high maintainability effort are the majority, resulting in an imbalanced dataset. Therefore, in this direction, the study assesses the impact of applying data resampling methods to balance the class distribution in datasets and compare the performance of maintainability prediction models before and after applying data resampling methods.

Fourteen data resampling methods, including oversampling, undersampling, and hybrid resampling, are used in the study. The SMP models are developed using nine ML techniques on original imbalanced datasets and after employing data resampling methods. Tenfold cross-validation is used to partition the data for training and testing the maintainability prediction models. The results of the developed prediction models are assessed by stable and vigorous performance evaluators, Balance, and G-mean. The study uses statistical tests to strengthen the conclusions and enhance the credibility of the results.

The experimental results on all eight datasets showed that the performance of the ML techniques for predicting software maintainability is significantly improved after employing data resampling methods. The Safe-Level-SMOTE method outperformed with the performance measures G-mean and Balance compared with all the other data resampling methods used in this study. Safe-Level-SMOTE is the enhanced version of SMOTE, which determines the safe level of each minority class instance before generating the synthetic samples. The performance of two hybrid resampling techniques, SMOTE-ENN, SMOTE-TL, and an undersampling method, RUS, is also comparable with the Safe-Level-SMOTE as indicated by the results of the Wilcoxon signed-rank test. The study advocates the use of the Safe-Level-SMOTE method to handle imbalanced data and to improve the performance of ML techniques to develop efficient maintainability prediction models to forecast high maintainability effort classes at the early stages of software development that are crucial for any software project.

Thus, the study results would help in the accurate identification of the classes which have low maintainability and involve a large share of maintenance effort. The accurate identification of such classes would enable software practitioners to improve the design and code of such classes. Also, software developers can devote extra time to the testing phase for testing the low maintainability classes that would lessen the chances of discovering faults in these classes during software maintenance. Early prediction of low maintainability classes in advance would also assist software developers in strategically utilizing their resources, enhancing process efficiency, and optimizing the associated maintenance costs. Lastly, software practitioners are encouraged to document low maintainability in a better manner in order to reduce the time to comprehend the code and undertake the essential modifications during the software maintenance phase.

We plan to replicate this work to examine the effectiveness of data resampling methods used in this study with evolutionary and search-based learning techniques to predict software maintainability.