1 Introduction

Mostly the testing is done during the development phase. There is no post maintenance. It can help to detect future defects in the system as well as it can help to build a model for defect prediction [1].

Most of the today’s organizations still searching for a defect prediction model which can be used for any type of software but still it is under the development. Generally the prediction models focus on following aspects:

  1. 1.

    Finding the bugs in software system;

  2. 2.

    Checking the reliability of the software against the time frame;

  3. 3.

    To grasp the effect of designing process over defects and failures.

The most famous and widely used technique for defect prediction is testing. Testing efforts depend on the size of the project. Testing maybe simple or complicated depended on the project size. Defects can be easily predicted in other projects or other parts of project if any relation can be established between software metrics and defects [2].

Object-oriented patterns are widely used in software development. Software metrics are used as quality predictor for OO software. Various researchers and practitioners suggested various metrics to calculate the quality of the software [3].

Software metrics are collected with the help of automated tools which are used by defect prediction models to predict the defects in the system. There is generally a dependent variable and various independent variables are present in any fault prediction model. Dependent variable defines that the software modules are faulty or not. Various metrics such as process metrics, product metrics etc. can be used as independent variables. For example cyclomatic complexity and lines of code which are method-level product metrics [4].

Cross company defect prediction (CCDP) is a mechanism that builds defect predictors by using data from various other softwares and companies and the data may be heterogeneous in nature. Cross project defect prediction use data from with-in company projects or cross-company projects. Cross-company (CC) data involves facts from many different projects and are diversified as compared to with-in company (WC) data [5]. To solve this heterogeneity number of analysis is required and it is been done by various research community members. Having a generalized prediction model for defect prediction will help the maintenance and testing team to analyse the software at the best. This study focus on to use the software metrics to develop the prediction model for defects prediction. Prediction models proposed in this study include prediction models for within-company defect prediction, cross-project defect prediction and cross-company defect prediction. Further, efforts has been made to check and compare the accuracy of each with various mathematical parameters.

2 Literature survey

Zimmermann et al. [6] calculated the performance of defect prediction for cross projects by using data from 12 projects (622 combinations). Among of these combinations, only 21 pairs resulted in efficient prediction performance. Data distributions of the initial and final projects are different which results in low prediction performance. It is expected that training and test data have the same distribution of data. This assumption is good for within-project prediction and may be not suited for cross-project prediction. Cross-project prediction can be indicted in two dimensions: the domain and the company. Zimmermann et al. noticed that in many software companies may or may not provide local data for defect prediction as they are small or they do not have any past data. Zimmermann et al. observed the data sets from F. and IE. They experimented on these web browsers and found that F. data could predict the defects in IE very well, but vice versa was not true. They come up with the result that “building a model from a small population to predict a larger one is likely more difficult than the reverse direction”.

Zhang et al. [7] proposed a universal model for defect prediction that can be used in with-in company and cross company projects. One issue in building a cross company defect prediction is the variations in data distribution. To overcome this, the authors first suggested collecting data, and then transforming the training and testing data to make more similar in their data distribution. They proposed a rank transformation which is context-free to limit the changes in the distribution of data before applying them to the universal defect prediction model. The six context factors is used by authors for prediction. They used 21 code metrics and 5 process metrics in their research. Their experiment results shows higher AUC values and higher recall than with-in project models and has better AUC for cross-projects.

Ma et al. [8] proposed a Transfer Naive Bayes (TNB) algorithm for defect prediction in cross-company projects, which is a novel transfer learning algorithm. The advantage of transfer learning is that it allows that training and testing data to be heterogeneous. They have used instance-transfer approach in their research which assigns weights to source instances according to their contribution in the prediction model. They use four performance metrics, probability of detection (PD), F-measure, probability of false alarms (PF), and AUC to measure the performance of defect predictor. They show that the TNB gives good performance.

Mahaweerawat et al. [9] introduced a new approach in object-oriented software systems for predicting faults. In this neural network is used with supervised learning. They used multi-layer perceptron (MLP) neural network with back-propagation to identify fault-prone classes and radial basis function (RBF) neural network is used to cluster the faults of same types. Their experiment results show 90% accuracy for predicting faultiness of a module/class.

Aggarwal et al. [10] proposed a model to discover the dependency of faults on object-oriented design metrics of a software product. They used data from Java applications which contains 136 classes. They used Principal component method for preprocessing of data. Univariate Logistic Regression is used to check the effect of software metrics on fault proneness. Prediction model is developed using Multivariate Logistic Regression. The model gives sensitivity 86.5% and specificity above 90%.

Various Software metrics are used by Singh and Salaria [11] to find the Software defects. They used various machine leaning methods for defect prediction. They discussed about the uses of neural network in various fields such as data mining, image processing, etc. Experiment Data is collected from PROMISE repository. The data is divided in the ratio of 17:3 for training and testing. Levenberg–Marquardt (LM) algorithm is used for training which results in 88% accuracy.

Canfora et al. [12] proposed a multi-objective approach for cross-company defect prediction, which uses logistic regression model, developed using a genetic algorithm. Multivariate logistic regression is used in this experiment. It deals with the defect prediction, and the cost-effectiveness. They used a multi-objective Genetic Algorithm (GA) is used for training, in which metrics are used as independent variables. They used ten datasets from the Promise repository. They perform a data standardization to reduce the effect of heterogeneity. The model gets a better cost-effectiveness than within-project predictors, and gives better results than single-objective predictors.

Lessmann et al. [13] use metric-based classification for defect prediction. They designed a defect prediction model which uses data from 10 public-domain data sets collected from the NASA Metrics Data (MDP) repository and the PROMISE repository and 22 classification methods are tested against defect prediction. They used area under the receiver operating characteristics curve (AUC) for measuring the performance of the model. They used various classifiers which are divided into several categories such as neural networks, statistical approaches, support vector machines, ensembles, nearest-neighbor methods and tree-based methods. They divide the data randomly as 2/3 for training and 1/3 for performance evaluation. Results show more than 0.7 AUC for most of the classifiers. They observed that RndFor, LS-SVMs, MLPs, and Bayesian networks which are sophisticated classifiers produce the better results. Simple classifiers are good enough to analyze the correlation between static code attributes and software defects.

Kumar et al. [14] compare fuzzy logic and artificial neural network methods for predicting the defect density (DD) of software. They used mean absolute error (MAE), root mean square error (RMSE) and graphical analysis for performance measurement. Defect density (DD) is an attribute used to the reliability of the software product. They used data from two projects of different domains. Fuzzy inference system (FIS) gives maximum 77% and minimum 73% accuracy. ANN gives up to 85% accuracy and 0.3872 as RMSE.

Kaur et al. [15] presented a survey on various object-oriented metrics as quality factors for software. They used 22 software metrics for their research. They used data from 3 projects. They identified the metrics which can be used to check the quality level of the software.

Kaur et al. [16] designed a framework to identify software code smells to analyze the quality of the software. They used feed forward neural network (FFNN) and used eight object-oriented metrics for their research. Their framework provides a better result and they also show a relationship between object-oriented metrics and bad smells.

3 Collection of data

Data is collected from the Bugzilla database for two versions of Mozilla Firefox 2.0 and 3.0 and for one version of Mozilla Seamonkey. The Bugzilla database consists of all errors (bugs) that have been found in projects with detailed information. Another system chosen for cross company and cross domain analysis is Licq (UNIX based). The database for bugs is obtained from social community known as GitHubFootnote 1 community. Licq has 280 classes only and is smallest among all dataset (Table 1).

Table 1 Dataset details

4 Multi layer preceptron model (MLP) based on neural network

Various techniques are adopted for defect prediction such as Naïve Byes, Random Forest, SVM, Machine Learning etc. In this paper FFNN is proposed.

FFNN uses a back-propagation learning algorithm. It is used to solve a vast variety of problems. In this various neurons are organized into various different layers such as input layer, output layer, and hidden layers. Figure 1 show the model used in this paper. Different layer are connected to each other.

Fig. 1
figure 1

Feed forward neural network

Weights are assigned against each connection between two neurons (i,j) the weight coefficient wij. It has an impact on the importance of the connection in the FFNN. Following equation can be used to determine the output of a layer

$${\text{a}} = {\text{x}}_{1} {\text{w}}_{1} + {\text{x}}_{2} {\text{w}}_{2} + {\text{x}}_{3} {\text{w}}_{3} \cdots + {\text{x}}_{{\text{n}}} {\text{w}}_{{\text{n}}} .$$
(1)

In this paper input layer consists of seven neurons and hidden layer contains three neurons. The inputs neurons used are object-oriented metrics which includes:-NOC [3], RFC [3], DIT [3], WMC [3], CBO [3], LCOM [3], LCOM5 [17, 18].

Various artificial neural network (ANN) experiments use multi-layer perceptron (MLP) method. MLP is a FFNN that uses back propagation algorithm as its training algorithm. A general framework is provided by FFNN for indicating mappings between input variables and output variables. For this various activation functions are used which represents the nonlinear function of various variables in terms of formations of nonlinear functions of a single variable.

In this paper, Hyperbolic Tangent Sigmoid Function (tansig) and Linear Transfer Function (purelin) are used as activation functions for the layers.

5 Performance evaluation parameters

Performance evaluation parameters are needed to validate the performance of the proposed model. In this paper parameters required to validate the performance were calculated using the confusion matrix. If these parameters are not considered then the performance of the proposed model can’t be evaluated. In this section, various parameters are defined which were used to calculate the performance of the proposed model.

Table 2 shows the confusion matrix for fault prediction. There are four categories of confusion matrixes which are as following:

Table 2 Confusion matrix for defect prediction
  1. 1.

    True positives (TP): number of classes which are classified as faulty classes correctly.

  2. 2.

    False positives (FP): number of not-faulty classes predicted as faulty classes.

  3. 3.

    True negatives (TN): number of non-faulty modules correctly predicted as non-faulty.

  4. 4.

    Finally, false negatives (FN): number of faulty classes incorrectly predicted as not-faulty classes [19].

Performance parameter used to measures the proposed model’s performance:

Precision

It shows that how many measurements produce same results when conditions are unchanged. Precision refers to the closeness of two or more measurements to each other.

$${\text{Precision }} = \frac{TP}{FP + TP}.$$
(2)

Accuracy

It is ratio of correctly classified modules and all modules. It is defined as:

$${\text{Accuracy }} = \frac{TN + TP}{TP + TN + FP + FN}.$$
(3)

Mean square error (MSE)

The MSE measures the squares of the “errors” and uses average value, i.e. the difference between the actual and predicted values.

$$MSE = \frac{1 }{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i}^{'} - y_{i} } \right)^{2} ,$$
(4)

where n = total number of samples, y is the output generated by the model and y′ is the expected output.

Receiver operating characteristics (ROC)

The performance of a binary classifier is shown by ROC curve. It is in a graphical form. The true positive rate (TPR) and the false positive rate (FPR) are used to plot the curve. The Area Under ROC Curve (AUC) is used to analyze different ROC curves. Higher AUC values indicate the classifier is good [20].

6 Results

In this paper Accuracy, Precision, MSE, and ROC curve are used to evaluate the performance of the model. More accuracy means the model performs better.

Table 3 shows results for 500 epochs where model is trained on one dataset and is tested on all datasets.

Table 3 Results for 500 epochs

From the above results it was analyzed that the Licq dataset has highest precision i.e. 1 means Licq produces the outputs which are more closed to each other. But the Licq doesn’t give accuracy. The Ver. 3.0 has lowest precision i.e. 0.0139 but it has highest accuracy for defect prediction i.e. 98.77% when tested using Ver. 2.0 and Ver. 3.0. The accuracy of proposed model is up to 99% which means the proposed model with 500 epochs has good accuracy rate while applying it over the same version or subsequent versions. Corresponding graphs of above data is shown in Figs. 2, 3, 4 and 5.

Fig. 2
figure 2

a Training performance of Ver. 2.0, ROC curve for b Ver. 2.0 using Ver. 2.0, c Ver. 3.0 using Ver. 2.0, d SM 1.0.1 using Ver. 2.0, e Licq using Ver. 2.0

Fig. 3
figure 3

a Training performance of Ver. 3.0, ROC curve for b Ver. 2.0 using Ver. 3.0, c Ver. 3.0 using Ver. 3.0, d SM 1.0.1 using Ver. 3.0, e Licq using Ver. 3.0

Fig. 4
figure 4

a Training performance of SM 1.0.1, ROC curve for b Ver. 2.0 using SM 1.0.1, c Ver. 3.0 using SM 1.0.1, d SM 1.0.1 using SM 1.0.1, e Licq using SM 1.0.1

Fig. 5
figure 5

a Training performance of Licq, ROC curve for b Ver. 2.0 using Licq, c Ver. 3.0 using Licq, d SM 1.0.1 using Licq, e Licq using Licq

Table 4 shows the results of proposed model for 1000 epochs. In this the dataset is trained using 1000 iterations.

Table 4 Results for 1000 Epochs

From the above results it was examined that the proposed model gives accuracy in the range from 55 to 99%. The highest accuracy is shown by Ver. 2.0 and 3.0. The accuracy has higher values indicates that the model proposed in this paper effectively identifies defective modules in the software. Corresponding graphs of above data are shown in Figs. 6, 7, 8 and 9.

Fig. 6
figure 6

a Training performance of Ver. 2.0, ROC curve for b Ver. 2.0 using Ver. 2.0, c Ver. 3.0 using Ver. 2.0, d SM 1.0.1 using Ver. 2.0, e Licq using Ver. 2.0

Fig. 7
figure 7

a Training performance of Ver. 3.0, ROC curve for b Ver. 2.0 using Ver. 3.0, c Ver. 3.0 using Ver. 3.0, d SM 1.0.1 using Ver. 3.0, e Licq using Ver. 3.0

Fig. 8
figure 8

a Training performance of SM 1.0.1, ROC curve for b Ver. 2.0 using SM 1.0.1, c Ver. 3.0 using SM 1.0.1, d SM 1.0.1 using SM 1.0.1, e Licq using SM 1.0.1

Fig. 9
figure 9

a Training performance of Licq, ROC curve for b Ver. 2.0 using Licq, c Ver. 3.0 using Licq, d SM 1.0.1 using Licq, e Licq using Licq

In the Table 5, the results of proposed model are shown using 2000 epochs i.e. the training is performed using 2000 iterations.

Table 5 Results for 2000 Epochs

Using the Table 5 it is examined that the models have MSE values below 0.4. Ver. 3.0 has MSE value 0.0123 using Ver. 2.0 and Ver. 3.0 which is lowest among all. The model performs better with cross projects, but doesn’t show accuracy in defect prediction using cross-company projects The Licq dataset has highest precision i.e. 1 when tested using same version and the SM 1.0.1 has lowest precision i.e. 0.0174 when tested using Licq. Related graphs of above table are shown in Figs. 10, 11, 12 and 13.

Fig. 10
figure 10

a Training performance of Ver. 2.0, ROC curve for b Ver. 2.0 using Ver. 2.0, c Ver. 3.0 using Ver. 2.0, d SM 1.0.1 using Ver. 2.0, e Licq using Ver. 2.0

Fig. 11
figure 11

a Training performance of Ver. 3.0, ROC curve for b Ver. 2.0 using Ver. 3.0, c Ver. 3.0 using Ver. 3.0, d SM 1.0.1 using Ver. 3.0, e Licq using Ver. 3.0

Fig. 12
figure 12

a Training performance of SM 1.0.1, ROC curve for b Ver. 2.0 using SM 1.0.1, c Ver. 3.0 using SM 1.0.1, d SM 1.0.1 using SM 1.0.1, e Licq using SM 1.0.1

Fig. 13
figure 13

a Training performance of Licq, ROC curve for b Ver. 2.0 using Licq, c Ver. 3.0 using Licq, d SM 1.0.1 using Licq, e Licq using Licq

7 Conclusion

Results show that if more training is used, better results can be produced as with 500 epochs, model gives better results only for with-in company projects but with 1000 and 2000 epochs it works good for cross projects also as shown in Tables 4 and 5. But in case of Licq dataset the results of the model are not improved. The reason behind this may be the compact size of the Licq dataset.

As compare to previous work, these results are better. The proposed model gives AUC value 0.821 using Firefox Ver. 3.0 on Firefox Ver. 3.0, 0.815 for SM 1.0.1 when the model is trained with Firefox Ver. 2.0. The model proposed by [7] with the use of clustering as classification and Cliff ranking as analysis, is tested on few datasets, so it may not applicable for other datasets. Further, they had boldly written it as limitation of analysis. The model proposed by [8] with the help of Naïve Bayes technique also helps to transfer the results of one dataset to others to predict defects in the dataset. It doesn’t provide any defined model for cross-project and cross-company projects. The model propose by [21] with the Random Forest analysis technique uses a Just In Time (JIT) for defect prediction, which requires more training for more accurate results. After analyzing these results, we can say that proposed model is well suited for predicting defects in both with-in company projects as well as in cross projects but for cross-company projects, results are not as good enough as compare to with-in and cross-project models. The reasons for it may lie under various domains and require to be analysed to achieve more accuracy. To get more accuracy of the prediction model various other techniques of machine learning can be tested (Table 6).

Table 6 Comparison to previous work