Keywords

1 Introduction

In the arena of software engineering, a crucial problem like prediction of defects is often taken into account as a very important step for the purpose of quality improvement obtained in lesser period and by minimum cost. Prediction of defects is highly necessary in order to find out sensitive and defect-prone domains in the stage of software testing, so that it may help in qualitative improvement of the software system with reduced cost. The possibility of detection of potential faults in software system at an early stage may help in effective planning, controlling and execution of software development activities to a considerable extent. In modern days, as the development of software has become very meaningful and keeping pace with the necessity, it may be safely said that hence reviewing and testing of the software system will be very essential and result-oriented in the case of predicting defects. Predicting software defects often involves huge cost and as-such the matter of correction of software defects is altogether a very expensive matter [25]. Those studies which have been carried out in the recent years, reveal the fact that the case of predicting defects assumes more importance compare to testing and reviewing process of software systems [36, 43]. As such, accuracy in predicting software defects is certainly very much helpful in case of improving software testing, minimizing the expenses [10] and improving the software quality [22].

In this paper, although analysis and comparison of various research work (from the year 1992 to 2015) on predicting software defects by using various methodologies have been made but only those unique and most updated methodologies (year 2005–2015) have been highlighted. This paper is having the objective of critically estimate the efficacy of the methods adopted in predicting software defects. Simultaneously, evaluation of the varied systems in prediction of software defects have been made and thus realized the effectiveness and importance of some methodologies like Advance Machine Learning, Neural Network and Support Vector Machine applied most frequently compared to various other techniques for achieving desirable accuracy in predicting defects in the software system. This paper has also highlighted the requirement of further work in this field by applying newer methodologies since the previous ones have not at all been found defect-free or at-least a least defective software system which may finally produce quality software system.

2 Literature Review

In order to perform the analysis, we explored 102 papers (during the period 1992–2015) from various digital library like IEEE Transactions on Software Engineering, ACM, Springer, Elsevier, Science Direct, International Conferences, Reports, Thesis and even technical papers and case studies were also reviewed. After exploring these digital libraries, we found that most of the research work on predicting defects of software system was performed on similar patterns/methodologies/techniques as well as on nearly same datasets. As such, papers based on similar patterns/methodologies/techniques, datasets were excluded. We included only 49 those papers which are found unique and updated (from the year 2005 to 2015) in this particular field. Since 1992 various methodologies have been applied in predicting defects of software system. But in modern days, various methodologies are basically very favourable in predicting defects in software system. Only those methodologies which were considered unique as well as updated, have been analyzed, compared and the results obtained would help to determine which are the most frequently used and effective methodologies in the field of predicting defects of software system.

2.1 Predicting Defects of Software System Using Data Mining (DM)

Campan et al. [9] experimented with Length Ordinal Association Rule in datasets for searching out any interesting new rules. Song et al. [44] emphasized on Rule Mining methodologies in predicting and correcting software defects. Kamei et al. [26] proposed a methodology combining Logistic Regression analysis with Association Rule Mining for predicting software defects. Chang et al. [12] combined Decision Tree and Classification methodologies-Action Based Defect Prediction (ABDP) along with Association Rule Mining for predicting and discovering software defects pattern with minimum support and confidence. Gray et al. [21] experimented with Support Vector Machine (SVM) classifier based on Static Code Metrics and NASA datasets to maintain defective classes and remove redundant instances. Riquelme et al. [39] applied Genetic Algorithms finding rules featuring subgroups predicting defects and extracted software metrics program dataset from the Promise repository. Gayatri et al. [19] combined Induction methodology with Decision Tree and the new method of feature selection was better as compared to SVM and RELIEF methodologies. Gray et al. [20] analyzed Support Vector Machine (SVM) classifiers based on NASA datasets in such a way that identifies software defects and the basic idea was to classify training data rather than obtaining test datasets. Liu et al. [34] experimented with a new Genetic Programming based search methodology for evaluating the quality of software systems. It found that Validation cum Voting classifier was better than Baseline classifier, Validation classifier. Tao and Wei-Hua [46] found that Multi-Variants GAUSS Naive Bayes methodology was superior as compared to other versions of Naive Bayes methods and J48 algorithm in predicting defects of software system. Catal [11] reviewed different methodologies such as Logistic Regression, Classification Trees, Optimised Set Reduction (OSR), Artificial Neural Networks and discriminate model used during the period 1990 to 2009 on predicting software defects. Kaur and Sandhu [28] found that accuracy level was on higher side in case of software system based on K-Means. Tan et al. [45] attempted prediction of software defects by application of functional cluster of programs vide class or file which significantly improved recall and precision percentage. Dhiman et al. [15] used a clustered approach in which the software defects will be categorized and measured separately in each cluster. Kaur and Kumar [30] applied clustering methodology for forecasting as well as error forecasting in object-oriented software systems. Najadat and Alsmadi [37] proved Ridor algorithm with other classification approaches on NASA datasets to be an effective methodology for predicting software defects with higher accuracy level. Sehgal et al. [41] focused on application of J48 algorithm of Decision Tree methodology in prediction of defects in software systems. The performance of new methodology was evaluated against the IDE algorithm as well as Natural Growing Gas (NGG) methodology. Banga [7] found that a hybrid architecture methodology called as GP-GMDH or GMDH-GP was more effective as compared to other methodologies on the ISBSG datasets. Chug and Dhall [14] different methodologies were used on different datasets of NASA with both supervised and unsupervised learning methodologies for defect prediction. Okutan and Yildiz [38] for predicting software defects proposed a kernel methodology based on pre-computed kernel metrics. It was observed that the proposed defect prediction methodology was also comparable with other existing methodologies like Linear Regression and IBK. Selvaraj and Thangaraj [42] predicted software defects using SVM and compared its effectiveness with Naive Bayes and Decisions stumps methodologies. Adline and Ramachandran [3] proposed program modules for predicting the fault-proneness when the fault levels of modules are not available. The supervised methodologies like Genetic Algorithm for classification and predicting fault in software were applied. Agarwal and Tomar [4] observed that Linear Twin Support Vector Machine (LTSVM) on the basis of feature selection and F-score methodology was superior to other methodologies. Sankar et al. [40] advocated feature selection methodology using SVM and Naive Baye classifier based on F-mean metrics for predicting and measuring the defects in software system.

2.2 Predicting Defects of Software System Using Machine Learning (MI)

Boetticher [8] analyzed K-Nearest Neighbour (K-NN) algorithm or sampling for predicting software defects and its performance was not effective in case of small datasets. Ardil et al. [5] applied one of the easiest forms of Artificial Neural Network and compared it with other modules of Neural Network. Chen et al. [13] predicted software defects using Bayesian Network and Probabilistic Relational Models (PRM). Jianhong et al. [23] showed that the Resilient Back propagation algorithm based on neural network was superior methodology for predicting software defects. Xu et al. [47] evaluated the effectiveness of software metrics in predicting software defects by applying various Statistical and Machine Learning methodologies. Gao and Khoshgoftaar [17] predicted software defects by use of class-imbalanced and high dimensional database system. In this approach, modelling and feature selection was done on the basis of alternative use of both original and sampled data. Li et al. [32] found that effectiveness of sampled based methodologies like active semi-supervised methodology called as ACoForest was better compared to Random Sampling both with conventional machine learners and semi-supervised learner. Kaur [29] used software metrics along with Neural Network to find out those modules suitable for multiple uses. Abaei and Selamat [1] experimented with the application of various machine learning and artificial intelligent methodologies on different public NASA datasets in connection with predicting software defects. Askari and Bardsiri [6] predicted software defects by using Multilayer Neural Network. Support Vector Machine with the Learning algorithm and Evolutionary methodologies were also used for the purpose of removing the defects. Gayathri and Sudha [18] applied Bell function based Multi-Layer Perceptron Neural Network along with Data Mining for predicting defects in software system and its performance was compared with other Machine Learning methodologies. Jing et al. [24] proposed an efficient model using Advanced Machine Learning methodology-Collaborative representation classification for Software Defect Prediction (CSDP). Kaur and Kaur [27] predicted defects in classes using Machine Learning methodologies with different classifiers. Li and Wang [33] compared various Ensemble Learning methodologies- Ada Boost and Smooth Boost with SVM, KNN, Naive Baye, Logistic and C4.5 for predicting software fault proneness on imbalanced NASA data sets. Malhotra [35] predicted defects and estimated relationship among static code measures, different ML methodologies were applied. Yang et al. [48] used a Learning-to-Rank methodology for predicting defects in software system and also compared its effectiveness with others. Abaei et al. [2] studied the effectiveness of new version of semi-supervised methodology on eight datasets from NASA and Turkish in predicting software defects with high accuracy. Erturk and Sezer [16] proposed a new methodology-Adaptive Neuron Fuzzy Inference System (ANFIS) and compared it with other methodologies (SVM, ANN, ANFIS) using Promise repository for predicting software defects. Laradji et al. [31] Average Probability Ensemble (APE) comprised of seven classifiers was superior to weighted SVM and Random Forest methodologies. Finally, a new version of APE comprised of greedy forward selection was more efficient in removing duplicate and unnecessary features. Zhang et al. [49] predicted software efforts by using methodology based on Bayesian Regression Expectation Maximize (BREM).

3 Methodology

In this paper, a specific methodology was used with the aim of analyzing and comparing only those different, unique and updated methodologies (from the year 2005 to 2015) for predicting defects of software system. Different methodologies were compared on the basis of studies and the results showed that Advance Machine Learning, Neural Network and Support Vector Machine methodologies are the most commonly used techniques for predicting software defects. Summary of major findings are given in Table 1.

Table 1 Summary of major findings of different software defect prediction methodologies

Figure 1 indicates different methodologies used in software defect prediction from the year 2005 to 2015. This illustrates that these methodologies have been compared on the basis of studies and the results showed that Advance Machine Learning, Neural Network and Support Vector Machine techniques are the most frequently used as compared to other techniques in predicting defects of software system.

Fig. 1
figure 1

Methods used in software defect prediction

The Fig. 2 shows the datasets used in software defect prediction. The research studies using public datasets comprise 64.79 % whereas studies using private datasets cover 35.21 %. In-fact, the public free distributed datasets are mostly connected with PROMISE Repository and NASA Metrics Data Program. Private Datasets are not distributed as public datasets and they basically belong to private companies.

Fig. 2
figure 2

Datasets used in software defect prediction

4 Key Analysis

The analysis of various techniques applied for software defect prediction till date has brought out the following observations:-

  1. (a)

    Proper prediction of software defects in the initial phase of design level of software development lifecycle can improve software quality, provide customer satisfaction and considerably reduce overall cost, time and initiation of further work.

  2. (b)

    In order to minimize efforts in defect prediction with more accuracy and higher efficiency, it necessitates identifying newer methods and datasets by applying more sophisticated methodologies which will be appropriate and have adequate positive and effective impact on prediction of software defects.

  3. (c)

    Although considerable work has been made so far for prediction of software defects by applying various parameters, but it may be safely stated that sufficient work had not yet been done in defect prediction of the wave applications and open source software. As such, there is a need for further research work to find out more effective methodologies that may produce better result with higher accuracy in case of predicting software defects.

5 Challenging Issues

After critical analysis, various challenging issues have come to the light that requires immediate attention and timely solution. Owing to various reasons, application of methodologies is not totally problem or defect-free. In-fact, most of the studies implemented open source or public datasets and so, they may not work effectively for private and commercial datasets. Moreover, owing-to privacy issues, the proprietary datasets are not available in public. If availability of proprietary datasets is more, then it may help cross-project defect prediction with higher accuracy. Although various open or public datasets are available for defect prediction but each dataset is not having same number of metrics and similar type of metrics. These metrics are evaluated from different domain and the defect prediction model based object-oriented metrics is not applicable for different metrics or different feature-space. That-is-why, cross-project defect prediction is not very easy and feasibility of cross-project defect prediction model being wide acceptable is very less. It has however been accepted that this model is very useful for the industry. Various defect prediction models that have been proposed so far, could not at all give any guarantee for result of prediction. It is essential to undertake further studies on new metrics, new model or new development process that may be better performance, result-orientated and widely acceptable.

6 Conclusion and Future Work

Defect prediction in software system is truly crucial since, it is considered as an important step for enhancing software quality. Defect prediction in software system with application of proper methodologies is truly significant as it may immensely help in directing test efforts, reducing costs and improving quality and reliability of software. Research work in this field has emerged since 1992 and having huge volume of work done during last 25 years or so, but still it lacks in some areas and needs to solve those issues. However, unique and updated works (from the year 2005 to 2015) have been analyzed separately and the findings reveal that particularly Advance Machine Learning (AML), Neural Network (NN) and Support Vector Machine (SVM) methodologies are the most frequently used techniques as compared to all other techniques for predicting defects of software system. Moreover, it was also an important observation that public datasets used for this purpose comprise 64.79 % where as studies using private datasets cover only 35.21 %. We may conclude by stating that though different methodologies have been applied but no single methodology can be considered as a full proof for predicting software defects. It is highly essential to undertake further work applying newer methodologies in the initial stage for defect prediction with special emphasize on public datasets that are better result-orientated with higher level of accuracy. This work will facilitate further work and make endeavors in designing newer metrics of software that would pave the way and have all the potential to achieve higher prediction accuracy.