1 Introduction

Software defect prediction is an important research topic in software engineering (Lyu 2007; Seliya et al. 2010; Nam et al. 2013), By means of metric based classification, software modules can be classified typically into two categories: defective and non-defective. Metric based defect prediction method can predict the defect proneness of new software modules with software historical defect data, which is an efficient means to relieve the burden on software code inspection or testing (Catal and Diri 2009; Hall et al. 2011). The historical defect data is obtained by using McCabe (1976), Halstead (1977) and other static software metrics, how to fully analyze and utilize the existing historical defect data, and build more precise and effective classifiers has attracted considerable researchers’ interest from both academia and industry.

Software historical defect data has a complicated structure (Gray et al. 2011; Shepperd et al. 2013), which will leads to negative influence on decision of classifiers unless we focus on them when we design the software defect prediction algorithm (Ma et al. 2012; Luo et al. 2012; Gao et al. 2011; Ren et al. 2014). Kernel based learning method (Schoelkopf et al. 1998; Scholkopf et al. 1999; Muller et al. 2001), however, can map these data into a higher or even infinite dimensional kernel feature space with the original data distribution information. In this way more information is excavated and the classification performance is also improved. Compared with single kernel learning, multiple kernel learning (Ong et al. 2005; Lewis et al. 2006; Rakotomamonjy et al. 2007; Zien and Ong 2007; GÄonen and Alpaydin 2008; Damoulas and Girolami 2008; Gehler and Nowozin 2009; Kembhavi et al. 2009) can combine the mapping ability of each single basic kernel function, which solves the kernel functions selection problem effectively. It can also assemble different kernel functions with different characteristics, achieve better mapping capability by taking advantage of each basic kernel function, make the data express better in the new feature space, and improve the prediction accuracy significantly. In general, multiple kernel learning (MKL) can handle some complicated situations such as heterogeneous information or irregular multi-dimensional data, large scale problems, non-flat distribution of samples, etc.

Software historical defect data also has a marked characteristic of class-imbalance (Khoshgoftaar et al. 2010; Zhou and Liu 2006; Menzies et al. 2007; He and Garcia 2009; Gao et al. 2012), which means that defective modules are far less than non-defective ones in most software systems. The class-imbalance problem leads to negative influence on decision of classifiers (Ren et al. 2014). Many methods including resampling techniques, ensemble learning and cost-sensitive learning have been proposed to solve the class-imbalance problem (Sun et al. 2012; Dietterich 2000; Valentini and Masulli 2002; Rokach 2010). Compared with other methods, ensemble learning can obtain better predictive performance by integrating the learning results in a series of weak classifiers. As one of the most popular ensemble learning methods, Adaptive Boosting (known as “Adaboost”) updates the weight of the samples dynamically according to the error rate of last learning, thus reduces the bias generated by the majority class and solves the class-imbalance problem (Freund and Schapire 1997). Research shows that ensemble learning has better prediction capability for software historical defect data with the characteristic of “class-imbalance” in software defect prediction (Rokach 2010).

1.1 Motivation

Combining multiple kernel learning with ensemble learning is of crucial necessity to process these software historical defect data. It can not only map the historical defect data to a higher-dimensional feature space to excavate more effective information, but also solve the problem of class-imbalance. Moreover, multiple kernel learning problem can be solved through ensemble learning method, which is called multiple kernel boosting learning (Xia and Hoi 2013; Bi et al. 2004; Bennett et al. 2002). Compared with the ordinary multiple kernel learning methods, multiple kernel boosting learning can avoid the complex parameter optimization problem and adjust the accuracy of the classifier according to different application requirements.

Additionally, misclassifying a defective software module as non-defective one is much more dangerous than misclassifying a non-defective module as defective-prone one in software defect prediction (Zheng 2010). That is because classifying a software module as defective-prone implies that more testers should be invested in the verification activities, in contrast, misclassifying a defective module as non-defective carries the risk of system failure. However, this problem has yet been addressed in the process of updating the sample weight in general multiple kernel boosting learning method. Instead, the sample weight is adjusted directly according to the classification results. So when multiple kernel boosting learning method is used for software defect prediction, this problem should be focused on.

1.2 Contribution

In this paper, we propose a multiple kernel ensemble learning (MKEL) approach for software defect prediction. And the contributions are summarized as following:

1. It is the first attempt to introduce the multiple kernel learning technique into the field of software defect prediction to the best of our knowledge, although it has been effectively applied to other domains. Through boosting method we get the most optimal multiple kernel classifier, in consequence, our MKEL approach has the advantages of both multiple kernel learning and ensemble learning for software defect prediction.

2. Considering the cost of risk in software defect prediction, we design a new sample weight vector updating strategy. In the training process of MKEL, we increase the weights of the defective samples and decrease the weights of the non-defective samples according to the classification results, so as to focus more on those defective modules, reduce the cost of risk of misclassifying defective modules as non-defective ones and gain better prediction.

In this paper, we conduct experiments on twelve NASA datasets, which are public and widely used for software defect prediction. The experimental results demonstrate that the proposed approach outperforms several representative methods. The remainder part of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the proposed approach. Section 4 introduces the experimental setup. Section 5 shows the experimental results and analysis. Section 6 presents threats to validity of the models and the conclusion is drawn in Section 7.

2 Related work

Software defect prediction technology can be generally categorized into two types: static and dynamic defect prediction technology. The illustration of typical static defect prediction process is shown in Fig. 1. The first step is to collect software modules in software archives and then label them. A software module can be labeled as defective or non-defective according to whether it contains defects or not. Then, the values of defect prediction metrics, such as McCabe, Halstead metrics and et al., are used as module attributes. The module attributes and labels are used to train a learner for building a prediction model. With the prediction model, the new query modules can be whether they are defective or non-defective.

Fig. 1
figure 1

Software Defect Prediction Process

Many traditional classification methods in machine learning have been adopted for static software defect prediction, such as SVM (Xing et al. 2005; Elish and Elish 2008; Gray et al. 2009; Yan et al. 2010, Bayes (Turhan and Bener 2007, 2009; Amasaki et al. 2003; Wang and Li 2010), decision tree (Khoshgoftaar and Seliya 2002a, b; Wang et al. 2012; Breiman 2001; Gayatri et al. 2010), neural network (Thwin and Quah 2005; Paikari et al. 2012), dictionary learning (Jing et al. 2014), and etc. In order to solve the class-imbalance problem, many methods are used in software defect prediction, such as resampling (Menzies et al. 2007), ensemble learning (Sun et al. 2012; He and Garcia 2009; Wang and Yao 2013; Seiffert et al. 2009; Aljamaan and Elish 2009), and cost-sensitive (Zheng 2010; Bezerra et al. 2011; Seliya and Khoshgoftaar 2011; Sun et al. 2007). Sun et al. (2012) presented a coding-based ensemble learning method, which converts imbalanced binary-class data into balanced multiclass data and builds a defect predictor on the multiclass data with a specific coding scheme. Based on the successful class-imbalance learning method AdaBoost.NC (He and Garcia (2009)), Wang et al. (2013) presented a dynamic version of AdaBoost.NC, which can adjust its parameter automatically during training process. Zheng (2010) presented cost-sensitive boosting algorithm to improve neural network classifiers for defect prediction, which incorporates the misclassification costs into the weight-update rule of boosting, such that the classification performance on those samples with higher misclassification costs can be improved.

In order to improve the accuracy of the prediction model, some researchers consider introducing kernel method to software defect prediction. The historical defect data were first mapped into a higher-dimensional feature space and then the software modules were classified and predicted in that kernel space. Ma et al. (2012) used a single kernel function to map the defect prediction data into a higher-dimensional feature space, and then used Kernel principal component analysis (KPCA) method for software defect prediction. Luo et al. (2012) mapped the defect prediction data into a higher-dimensional feature space in the same way, but Kernel partial least squares (KPLS) method was used for software defect prediction. Ren et al. (2014) used asymmetric Kernel partial least squares classifier (AKPLSC) and asymmetric Kernel principal component analysis classifier (AKPCAC) to solve the class imbalance problem by applying kernel function to the classifiers. All of them constructed an asymmetric classifier in kernel space, so as to improve software defect prediction by compensating the bias of the regression model caused by class-imbalance.

In general, multiple kernel learning method has obvious advantage compared with single kernel learning method, so in this paper we use multiple kernel learning method to predict the defect proneness of software modules and propose multiple kernel ensemble learning (MKEL) algorithm. Different form previous algorithms, our MKEL algorithm integrates the advantages of multiple kernel learning and ensemble learning, and applies multiple kernel learning to software defect prediction in boosting framework for the first time. Moreover, in the parts of weight vector updating, according to the characteristic of historical defect data, we take the cost of risk problem in software defect prediction account to improve the defect prediction performance.

3 Our approach

3.1 Problem formulation

For a given set of training software modules \(D=\{(x_i ,y_i ),i=1,2,\cdots ,N\}\) and a collection of \(M\) kernel functions \(K=\{k_j :X\times X\rightarrow {\mathbb {R}}, j=1,2,\cdots ,M\}\), where \(x\) is a vector of module attributes, \(y\in \{-1,+1\}\) is the module label, \(k\) is the basic single kernel function. MKEL aims to learn a multi-kernel-based classifier \(f(x)\), which is an ensemble of kernel classifiers using the collection of \(M\) kernels trained from the given training software historical defect data, and then predict the defect proneness of new software modules with the multi-kernel-based classifier \(f(x)\). Typically, we express such a multi-kernel-based classifier as:

$$\begin{aligned} f(x)=\sum _{t=1}^T {\alpha _t f_t (x)} , \end{aligned}$$
(1)

where \(T\) is the total number of boosting trials, \(f_t \) is a kernel-based hypothesis learned from the \(t\)th \((1\le t\le T)\) boosting trial, and \(\alpha _t \) is its associated weight in the final classifier. The main challenge of MKEL is to develop an effective boosting scheme to learn the optimal kernel-based hypothesis \(f_t \) and its combination weight \(\alpha _t \) at each boosting trial. When \(T\) boosting trials are completed, we get \(T\) kernel-based hypothesis and their combination weights; the final MKEL classifier is an ensemble of them. Once the MKEL classifier is got, it can be used to make defect prediction for new query modules.

3.2 Multiple kernel learning

Given a supervised machine learning problem \((x_i ,y_i )\in X\times Y\), we can map the input samples into a new feature space \(F=\{\Phi (x)|x\in X\}\) through nonlinear mapping \(\Phi :X\rightarrow F\;\;x\rightarrow {}{}{}{}\Phi (x)\), then use the new representation to consider original learning problem. The kernel method is to replace the dot-product \(<\Phi (x),\Phi (x^{\prime })>\) in kernel feature space by a kernel function \(k(x,x^{\prime })\) in the original feature space. The commonly used kernels include Polynomial kernels, Gaussian radial basis function (RBF) kernels, and etc. The Polynomial kernels has the form \(k(x,x^{\prime })=(<x,x^{\prime }>+\theta )^{v}\), and RBF kernels can be expressed as \(k(x,x^{\prime })=\exp (-\gamma \parallel x-x^{\prime }\parallel _2^2 )\). Combining different characteristics of different kernel functions contribute to get the advantages of multiple kernels and achieve better mapping capability, which is known as multiple kernel learning.

The synthetic kernel in multiple kernel learning is a convex combination of different kernel functions. It can be expressed as the weighted sum of basic kernels as \(K(x,x^{\prime })=\sum _{m=1}^M {\beta _m k_m (x,x^{\prime })} \), where \(\beta _m \) is the weight parameter of basic kernels. Within the framework of multiple kernel, the representation of samples in the feature space is translated to how to select and confirm basic kernels and their weight parameter. The goal of a regular multiple kernel learning is to identify the optimal combination of \(M\) basic kernels, denoted by \(\theta =(\theta _1 ,\cdots \theta _M )\) by the following maximum margin learning principle, which can be cast into the following optimization problem:

$$\begin{aligned} \mathop {\min }\limits _{\theta \in \Delta } \;\mathop {\min }\limits _{f\in H_{K(\theta )} } \frac{1}{2}\parallel f\parallel _{H_{K(\theta )} }^2 +\;C\sum _{i=1}^N {l(f(x_i ),y_i )} , \end{aligned}$$
(2)

where \(\Delta =\{\theta \in {\mathbb {R}}_+^M |\theta ^{\mathrm{T}}eM=1\}\), \(K(\theta )(\cdot ,\cdot )=\sum _{j=1}^M {\theta _j k_j (\cdot ,\cdot )} \), \(l(f(x_i ),y_i )=\max (0,1-y_i f(x_i ))\), \(e_M\) represents a vector of \(M\) dimensions with all its elements being 1. Formulation 2 can also be turned into the following min-max optimization task:

$$\begin{aligned} \mathop {\min }\limits _{\theta \in \Delta } \;\mathop {\max }\limits _{\alpha \in \Xi } \left\{ {\alpha ^{\mathrm{T}}e_N -\frac{1}{2}(\alpha \circ y)^{\mathrm{T}}\left( {\sum _{j=1}^N {\theta _j ,K^{j}} } \right) (\alpha \circ y)} \right\} , \end{aligned}$$
(3)

where \(K^{j}\in {\mathbb {R}}^{N\times N}\) with \(K_{p,q}^j =k_j (x_p ,x_q )\), \(\Xi =\{\alpha |\alpha \in [0,C]^{N}\}\), and \(\circ \) defines the element-wise product between two vectors. We can see that formulating this problem as an optimization task will lead to large amount of complex calculation, and to avoid this drawback, boosting method is used to calculate the parameters of the synthetic kernel in the multiple kernel learning.

3.3 Multiple kernel ensemble learning

To learn a classifier with multiple kernels, we follow the typical procedure of a popular and successful boosting algorithm, i.e., Adaboost. Specifically we repeatedly learn some basic kernel classifiers through a series of boosting trials after the training set is initialized, and then integrated them according to their combination weights, thus we get the final MKEL classifier. The whole process can be expressed in Fig. 2.

Fig. 2
figure 2

Schematic representation of Adaboost

Before boosting trials, the training set needs to be initialized firstly. We can perform a random sampling strategy directly on the entire training set, and then take these selected samples as the MKEL initial training set. After the training set initialization is completed, a distribution of weights \(D_t \) needs to be engaged to indicate the importance of the training samples for learning. Initially, these weights are all equal. At each boosting trial, we will adjust the weight vector \(D_t \) according to certain strategy, for the sake of focusing on those samples that need to be concerned at next boosting trial.

Once the initial training set and the weight vector \(D_t \) are obtained, boosting trials can be started. The key issue of the \(t\)th boosting trial is how to learn the kernel-based classifier \(f_t (x)\) from these training data. In single kernel case, \(f_t (x)\) can be learned by applying any regular kernel method, e.g., SVM used in our study, but in the case of multiple kernel, situation becomes more complex. In order to get \(f_t (x)\), we firstly learn one single kernel classifier \(f_t^j (x)\) with each single kernel \(k_t^j \) using a regular kernel method. Based on the set of \(M\) base classifiers, we can further measure the misclassification performance of each classifier \(f_t^j (x)\) with kernel \(k_t^j \) over distribution \(D_t \) of the whole collection of training data:

$$\begin{aligned} \varepsilon _t =\varepsilon \Big (f_t^j (x)\Big )=\sum _{i=1}^N {D_t (i)} \Big (f_t^j (x_i )\ne y_i \Big ). \end{aligned}$$
(4)

As a result, the best classifier with the smallest misclassification rate can be figured out and taken as the weak classifier \(f_t (x)\) for the \(t\)th boosting trial:

$$\begin{aligned} f_t (x)=\mathop {\arg \min }\limits _{f_t^j ,j\in \{1,\cdots ,M\}} \varepsilon (f_t^j (x)). \end{aligned}$$
(5)

To get the final classification from all of these weak classifiers, each of the classifiers is assigned a weight \(\alpha \) in AdaBoost. For the \(t\)th boosting trial, \(f_t (x)\)’s combination weight \(\alpha _t \) can be determined with the error rate based on the following formula:

$$\begin{aligned} \alpha _t =\frac{1}{2}\ln \left( {\frac{1-\varepsilon _t }{\varepsilon _t }} \right) . \end{aligned}$$
(6)

After got \(f_t (x)\) and its combination weight \(\alpha _t \), the \(t\)th boosting trial is finished. Before next trial started, we must update the weight vector \(D_{t+1} \) in the \((t+1)\)th boosting trial according to the result of the \(t\)th boosting trial. General boosting methods adjust the weight vector directly only according to the classification results, so that the samples that are correctly classified in the \(t\)th boosting trial will decrease in weight, and the misclassified samples will increase in weight. The goal of this effort is to focus on the misclassified samples in next boosting trial (Sun et al. 2007).

In software defect prediction we should pay more attention to defective samples, the reason is that misclassifying a defective software module as non-defective is much more dangerous than misclassifying a non-defective module as defective-prone. In order to put more emphasis on defective samples, class labels need to be taken into account when adjust the weight vector \(D_{t+1} \). For defective samples, if they are incorrectly predicted in the \(t\)th boosting trial, their weight will be increased; if they are correctly predicted, their weight will keep unchanged; For non-defective samples, if they are incorrectly predicted in the \(t\)th boosting trial, their weight will keep unchanged, if they are correctly predicted, their weight will be decreased. The sample weight vector \(D_{t+1} \) is calculated by:

if the module is defective \((y_i =1)\):

$$\begin{aligned} D_{t+1} (i)=\frac{D_t (i)}{Sum(D)}\times \left\{ {{\begin{array}{ll} 1 &{}if\,\, f(x_i )=y_i \\ e^{\alpha _t }&{}if\,\, f(x_i )\ne y_i \\ \end{array} }} \right. , \end{aligned}$$
(7)

if the module is non-defective \((y_i =-1)\):

$$\begin{aligned} D_{t+1} (i)=\frac{D_t (i)}{Sum(D)}\times \left\{ {{\begin{array}{ll} e^{-\alpha _t }&{} if\,\, f(x_i )=y_i \\ 1 &{}if\,\, f(x_i )\ne y_i \\ \end{array} }} \right. . \end{aligned}$$
(8)

After \(D_{t+1} \) is calculated, AdaBoost starts on the \((t+1)\)th iteration. When all of the training and weight-adjusting iterations are completed, \(T\) basic kernel classifiers and their combination weights are got, the final MKEL classifier is an ensemble of them:

$$\begin{aligned} f(x)=sign\left( {\sum _{t=1}^T {\alpha _t f_t (x)} } \right) , \end{aligned}$$
(9)

and the details of the proposed MKEL algorithm are shown in Algorithm 1.

figure a

4 Experimental setup

In this section, we describe the experimental setup in detail, including the benchmark datasets, evaluation measures, and experiment design.

4.1 Benchmark datasets

In this paper, we experiment with 12 datasets from NASA Metrics Data Program (MDP) to verify the applied effects of the proposed algorithm. NASA benchmark datasets are publicly available and have been widely used for software defect prediction. Each dataset represents a NASA software system or sub-system, which contains the corresponding defect-marking data and various static code metrics (Gray et al. 2011). Static code metrics are measured by static software metric methods including Halstead and McCabe measures, these metrics contain lines of code (LOC), operand and operator counts, readability, complexity and etc., we list the 20 common basic metrics and their descriptions in Table 1.

Table 1 20 Metrics selected from NASA datasets

Table 2 gives the brief properties of 12 NASA datasets, including the total number of attributes, the number of defective and non-defective modules, and the ratio of them. It is obviously that every dataset has the characteristic of class-imbalance. It should be noted that the original data contains various duplicate entries, and to make our results more credible, these datasets have been cleaned to remove the duplicate parts (Gray et al. 2011).

Table 2 NASA benchmark datasets after data cleaning

4.2 Evaluation measures

In software defect prediction, probability of detection (Pd), probability of false alarm (Pf), precision and accuracy are four important measures to evaluate the performance of prediction model. They can be defined by using \(A\), \(B\), \(C\), and \(D\) in Table 3. Here, \(A\), \(B\), \(C\), and \(D\) are the number of defective modules that are predicted as defective, the number of defective modules that are predicted as defective-free, the number of defective-free modules that are predicted as defective, and the number of defective-free modules that are predicted as defective-free, respectively.

Table 3 Defect prediction metric

Pd denotes the ratio is the number of defective modules correctly classified as defective to the number of defective modules, which is defined as \(A/(A+B)\). Pf denotes the ratio is the number of defective-free modules wrongly classified as defective to the number of defective-free modules, which is defined as \(C/(C+D)\). Precision denotes the ratio is the number of defective modules correctly classified as defect to the number of modules that are classified as defective, which is defined as \(A/(A+C)\). Accuracy denotes the ratio is the number of modules that are correctly classified to the number of total modules, which is defined as \((A+D)/(A+B+C+D)\).

From the definitions of Pd and precision, it can be concluded that a higher Pd means the prediction model intends to find out defective modules as much as possible, and a higher precision means the prediction model intends to predict defective modules as correct as possible. For software defect prediction, we hope that the prediction model not only can find out more defective modules, but also can make fewer mistakes, which means that a good prediction model desires to achieve high value of Pd and precision. However, there exists trade-off between these two measures, high Pd is on the expense of low precision, and vice versa. Therefore, a comprehensive measure to combine Pd with precision is necessary when we evaluate the performance of defect prediction model. This is the F-measure, the harmonic mean of Pd and precision, which can be defined as:

$$\begin{aligned} F-measure=2^*\hbox {Pd}^*precision/\left( {\hbox {Pd }+precision} \right) . \end{aligned}$$

The value of F-measure ranges from 0 to 1, the higher the value, the better the prediction model performance. Compared with other evaluate measures, F-measure can evaluate the prediction model comprehensively and effectively, meanwhile, F-measure is reasonably stable and not susceptible to the influence of parameter adjustment.

4.3 Experimental design

To verify the effect of MKEL approach, we conduct some experiments. The experimental setup is as follows: for each of the NASA MDP datasets, we randomly selected 50 % of the defective and non-defective samples as training set, and the remaining 50 % were used for test. Moreover, for the purpose of getting a more general result, we repeat each algorithm 20 times on every dataset and report average performances.

We create a set of 30 base kernels, i.e., RBF kernels with 21 different widths \((2^{-10},2^{-9},\;\cdots ,\;2^{10})\) on all features, Polynomial kernels of degree 1 to 9 on all features. We map every NASA MDP dataset to a higher-dimensional kernel space respectively by using these 30 base kernels. For SVM, the popular LIBSVM toolbox is adopted as the SVM solver.

The boosting training set are initialized as follows: for each NASA MDP training set, we randomly select 40 % of the training samples as suggested in Xia and Hoi (2013), then take them as the initial boosting training set. By default, we set the total number of boosting trials \(T\) to 100, so the final MKEL classifier will be an ensemble of 100 kernel-based weak classifiers.

5 Experimental results

In order to evaluate the performance of our MKEL approach, we compare it with two state-of-the-art class-imbalance learning methods for software defect prediction including coding based ensemble learning (CEL) (Sun et al. 2012) and dynamic version of AdaBoost.NC (DVA) (Wang and Yao 2013). And the comparison also embraces several representative software defect prediction methods, including weighted Naïve Bayes (NB) (Wang and Li 2010), Compressed C4.5 decision tree (CC4.5) (Wang et al. 2012), Cost-sensitive Boosting Neural Network (CBNN) (Zheng 2010) and Asymmetric Kernel Principal Component Classification (AKPCC) (Ma et al. 2012). For running these previous algorithms mentioned above, we adopt their default settings and chose the suggested parameters by the literatures (Random forest is used as the basic classifier of CEL since it has better performance in the literature. DVA employs 10-fold cross-validation, and 9/10 partitions are used at each time of building models, among them, 8/9 data serve as a training set and 1/9 data serve as a validation set. CBNN employs a fivefold cross-validation and cost ratio varies from 1 to 20. RBF kernel is used for AKPCC). For all the experiments, we repeat each algorithm 20 times on every NASA MDP dataset. In this section, we present the experimental results of our MKEL approach and other compared methods.

Table 4 shows the Pd and Pf values of our approach and other compared methods on 12 NASA datasets. For each dataset, Pd and Pf values of all methods are the mean values calculated from the results of 20 runs. The average Pd and Pf values across all datasets of MKEL are 0.68 and 0.26. Due to Table 4, we can observe that the Pd values of MKEL are higher than the corresponding values of all other methods. The results indicate that the proposed MKEL approach takes the misclassification costs into consideration, which makes the prediction tend to classify the non-defective modules as the defective ones in order to obtain higher Pd values. The results of Pf values suggest that in spite of not acquiring the best Pf values on most datasets, MKEL can achieve comparatively better results in contrast with other methods.

Table 4 Experimental results: Pd and Pf comparisons on NASA’s 12 datasets

Table 5 shows the mean and standard deviation of F-measure values of our approach and the compared methods across 20 random running on 12 NASA datasets. The average F-measure value across all datasets of MKEL is 0.48, and the standard deviation of F-measure values across all datasets of MKEL is 0.03. It can be seen that our approach achieves higher F-measure values than those of other compared methods on all datasets, which indicates that the proposed approach achieves preferable prediction effects.

Table 5 F-measure values on 12 NASA datasets from different prediction methods

To statistically analyze the F-measure results used in Table 5, we conduct a statistical test, i.e., Mcnemar’s test (Yambor et al. 2000). This test can provide statistical significance between MKEL and other methods. Here, the Mcnemar’s test uses a significance level of 0.05, that is, if the p value is below 0.05, the performance difference between two compared methods is considered to be statistically significant. Table 6 shows the p values between MKEL and other compared methods on 12 NASA datasets, where only one value is slightly above 0.05. According to Table 6, the proposed approach indeed makes a statistically significant difference in comparison with other methods for software defect prediction.

Table 6 P values between MKEL and other compared methods on 12 NASA datasets

In above experiments, we create a set of 30 base kernels as multiple kernel learning, select 40% of the training samples randomly as the initial boosting training set, and set the total number of boosting trials \(T\) to 100. When update the sample weight vector, we use a MKEL boost sample weight vector updating strategy by gradually increasing the weights of the defective samples and decreasing the weights of the non-defective samples. We use SVM as the basic kernel classifier. The average F-measure value across all datasets of MKEL is 0.48 under these settings. In order to validate the influence of these factors, we repeat the experiment by using different number of base kernels, weight updating strategies, sampling ratios, boosting trails, and basic kernel classifiers. The average F-measure values are shown in Table 7, where \(M\) is the number of base kernels (\(M=1\) means single kernel), and \(\mu \) is the initial sampling ratio. General boost weight updating strategy adjusts the weight vector directly only according to the classification results (\(if \,f(x_i )=y_i , D_{t+1} (i)=D_t (i) e^{-\alpha _t }/Sum(D)\), \(if \, f(x_i )\ne y_i \), \(D_{t+1} (i) = D_t(i) e^{\alpha _t }/Sum(D)\)). The results in Table 7 indicate that multiple kernels achieve better performance than single kernel. Multiple kernels and the weight vector updating strategy improve MKEL performance mostly, by contrast, initial sampling ratio, and basic classifiers the number of boosting trials influence performance relatively inconspicuously. We conduct a statistical test to analyze the F-measure results used in Table 7, the p values are shown in Table 8.

Table 7 Average F-measure value of 12 NASA datasets by using different parameters
Table 8 P values between MKEL and other different settings on 12 NASA datasets

6 Threat to validity

The study has limitations that are common with most of the empirical studies in the literature. The proposed MKEL approach uses a randomly selecting strategy to initialize the training set, which means that the initial training set is built by randomly selecting samples from the original data. It is particularly conspicuous for the problem of class-imbalance in the MC1, PC1 and PC2 dataset even after data cleaning. For these datasets, it will be more likely that none of defective modules are selected in the initial training set. When the boosting training set is initialized, step must be taken to ensure that both of defective and non-defective modules are included in the initial training set.

The analysis and conclusion presented in this paper are based upon the metrics and defect data obtained from NASA projects. In spite of the generalization of our empirical results, the same analysis for another software system may provide different results, especially from a different application domain. However, the proposed MKEL approach can be extended to any software system for high quality and less testing effort. A software quality practitioner can utilize the process of developing a useful defect predictor in the presence of the problem of class-imbalance or when there are a large number of software metrics to work with.

7 Conclusion

Although multiple kernel learning has been shown effective in other domains, to the best of our knowledge, we are the first attempt towards improving software prediction performance by introducing multiple kernel learning technique. Aiming at the characteristics of defect data, we specifically design a multiple kernel ensemble learning (MKEL) classifier to predict defective modules. By using multiple kernel trick, it can fully exploit the information of historical data to improve the predict power. During the training course, we devise a new initialization strategy to make a balance between defective and non-defective software modules, and we also use a new weight update strategy which makes the defective software modules always to be focused on, so MKEL provides an effective solution for software defect prediction.

As compared with several state-of-the-art representative software defect prediction methods, the experiments on 12 NASA datasets show that the proposed MKEL approach performs better under the same experimental environment, it significantly improves the average F-measure values on all datasets. In addition, the initialization and the weight update strategy we used during the training stage solve the class-imbalance problem and decrease the cost of misclassification risk effectively, and they also make the average F-measure values improved on all datasets compared with the general strategies. All of these confirm that our MKEL approach can fully exploit the characteristics of historical data and improve the predict performance, so it is an effective solution for software defect prediction task.