Software Defect Detection with Rocus

Jiang, Yuan; Li, Ming; Zhou, Zhi-Hua

doi:10.1007/s11390-011-9439-0

Software Defect Detection with Rocus

Regular Paper
Published: 05 March 2011

Volume 26, pages 328–342, (2011)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Computer Science and Technology Aims and scope Submit manuscript

Software Defect Detection with Rocus

Download PDF

Yuan Jiang¹,
Ming Li¹ &
Zhi-Hua Zhou¹

295 Accesses
35 Citations
Explore all metrics

Abstract

Software defect detection aims to automatically identify defective software modules for efficient software test in order to improve the quality of a software system. Although many machine learning methods have been successfully applied to the task, most of them fail to consider two practical yet important issues in software defect detection. First, it is rather difficult to collect a large amount of labeled training data for learning a well-performing model; second, in a software system there are usually much fewer defective modules than defect-free modules, so learning would have to be conducted over an imbalanced data set. In this paper, we address these two practical issues simultaneously by proposing a novel semi-supervised learning approach named Rocus. This method exploits the abundant unlabeled examples to improve the detection accuracy, as well as employs under-sampling to tackle the class-imbalance problem in the learning process. Experimental results of real-world software defect detection tasks show that Rocus is effective for software defect detection. Its performance is better than a semi-supervised learning method that ignores the class-imbalance nature of the task and a class-imbalance learning method that does not make effective use of unlabeled data.

Article PDF

Label propagation based semi-supervised learning for software defect prediction

Article 22 March 2016

Software Defect Prediction in Imbalanced Data Sets Using Unbiased Support Vector Machine

A variable-level automated defect identification model based on machine learning

Article Open access 23 March 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Dai Y S, Xie M, Long Q, Ng S H. Uncertainty analysis in software reliability modeling by Bayesian approach with maximum-entropy principle. IEEE Transactions on Software Engineering, 2007, 33(11): 781–795.
Article Google Scholar
Guo L, Ma Y, Cukic B, Singh H. Robust prediction of fault-proneness by random forests. In Proc. the 15th International Symposium on Software Reliability Engineering, Nov. 2–5, 2004, pp.417–428.
Khoshgoftaar T M, Allen E B, Jones W D, Hudepohl J P. Classification-tree models of software-quality over multiple releases. IEEE Transactions on Reliability, 2000, 49(1): 4–11.
Article Google Scholar
Lessmann S, Baesens B, Mues C, Pietsch S. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 2008, 34(4): 485–496.
Article Google Scholar
Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 2007, 33(1): 2–13.
Article Google Scholar
Zhang H, Zhang X. Comments on “Data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 2007, 33(9): 635–637.
Article MATH Google Scholar
Zhou Y, Leung H. Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Transactions on Software Engineering, 2006, 32(10): 771–789.
Article Google Scholar
Seliya N, Khoshgoftaar T M. Software quality estimation with limited fault data: A semi-supervised learning perspective. Software Quality Journal, 2007, 15: 327–344.
Google Scholar
Pelayo L, Dick S. Applying novel resampling strategies to software defect prediction. In Proc. the 2007 Annual Meeting of the North American Fuzzy Information Processing Society, San Diego, USA, Jun. 24–27, 2007, pp.69–72.
Zhou Z H, Li M. Semi-supervised learning by disagreement. Knowledge and Information Systems, 2010, 24(3): 415–439.
Article Google Scholar
Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets, Washington DC, USA, Jul. 21, 2003.
Zheng A X, Jordan M I, Liblit B, Naik M, Aiken A. Statistical debugging: Simultaneous identification of multiple bugs. In Proc. the 23rd International Conference on Machine Learning, Pittsburgh, USA, Jun. 25–29, 2006, pp.1105–1112.
Andrzejewski D, Mulhern A, Liblit B, Zhu X. Statistical debugging using latent topic models. In Proc. the 18th European Conference on Machine Learning, Warsaw, Poland, Sept. 17–21, 2007, pp.6–17.
Chilimbi T M, Liblit B, Mehra K K et al. HOLMES: Effective statistical debugging via efficient path profiling. In Proc. the 31st International Conference on Software Engineering, Vancouver, Canada, May 16–24, 2009, pp.34–44.
Basili V R, Briand L C, Melo W L. A validation of object-oriented design metrics as quality indicators. IEEE Transactions on Software Engineering, 1996, 22(10): 751–761.
Article Google Scholar
Khoshgoftaar T M, Allen E B. Neural Networks for Software Quality Prediction. Computational Intelligence in Software Engineering, Pedrycz W, Peters J F (eds.), World Scientific, Singapore, 1998, pp.33–63.
Google Scholar
Halstead M H. Elements of Software Science, Elsevier, 1977.
McCabe T J. A complexity measure. IEEE Transactions on Software Engineering, 1976, 2(4): 308–320.
Article MathSciNet Google Scholar
Gyimóthy T, Ferenc R, Siket I. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 2005, 31(10): 897–910.
Article Google Scholar
Ganesan K, Khoshgoftaar T M, Allen E B. Verifying requirements through mathematical modelling and animation. International Journal of Software Engineering and Knowledge Engineering, 2000, 10(2): 139–152.
Article Google Scholar
Khoshgoftaar T M, Seliya N. Fault prediction modeling for software quality estimation: Comparing commonly used techniques. Empirical Software Engineering, 2003, 8(3): 255–283.
Article Google Scholar
Fenton N E, Neil M. A critique of software defect prediction models. IEEE Transactions Software Engineering, 1999, 25(5): 675–689.
Article Google Scholar
Pérez-Miñana E, Gras J J. Improving fault prediction using Bayesian networks for the development of embedded software applications. Software Testing, Verification Reliability, 2006, 16(3): 157–174.
Article Google Scholar
Chapelle O, Schölkopf B, Zien A. Semi-Supervised Learning. Cambridge: MIT Press, MA, 2006.
Google Scholar
Zhu X. Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI, 2006, http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
Google Scholar
Miller D J, Uyar H S. A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. Advances in Neural Information Processing Systems 9, Mozer M, Jordan M I, Petsche T (eds.), MIT Press, Cambridge, MA, 1997, pp.571–577.
Google Scholar
Nigam K, McCallum A K, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2/3): 103–134.
Article MATH Google Scholar
Shahshahani B, Landgrebe D. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 1994, 32(5): 1087–1095.
Article Google Scholar
Chapelle O, Zien A. Semi-supervised learning by low density separation. In Proc. the 10th International Workshop on Artificial Intelligence and Statistics, Barbados, Jan. 6–8, 2005, pp.57–64.
Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization. Advances in Neural Information Processing Systems, Saul L K, Weiss Y, Bottou L (eds.), MIT Press, Cambridge, MA, 2005, pp.529–536.
Google Scholar
Joachims T. Transductive inference for text classification using support vector machines. In Proc. the 16th International Conference on Machine Learning, Bled, Slovenia, Jun. 27–30, 1999, pp.200–209.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 2006, 7(11): 2399–2434.
MathSciNet Google Scholar
Zhou D, Bousquet O, Lal T N, Weston J, Schölkopf B. Learning with Local and Global Consistency. Advances in Neural Information Processing Systems 16, Thrun S, Saul L, Schölkopf B (eds.), MIT Press, Cambridge, MA, 2004.
Google Scholar
Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In Proc. the 20th International Conference on Machine Learning, Washington, DC, USA, Aug. 21–24, 2003, pp.912–919.
Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In Proc. the 11th Annual Conference on Computational Learning Theory, Madison, USA, Jul. 24–26, 1998, pp.92–100.
Goldman S, Zhou Y. Enhancing supervised learning with unlabeled data. In Proc. the 17th International Conference on Machine Learning, San Francisco, USA, Jun. 29-Jul. 2, 2000, pp.327–334.
Li M, Zhou Z H. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics --- Part A: Systems and Humans, 2007, 37(6): 1088–1098.
Article Google Scholar
Zhou Z H, Li M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529–1541.
Article Google Scholar
Zhou Z H, Li M. Semi-supervised regression with co-training style algorithms. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1479–1493.
Article Google Scholar
Steedman M, Osborne M, Sarkar A et al. Bootstrapping statistical parsers from small data sets. In Proc. the 11th Conference on the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, Apr. 12–17, 2003, pp.331–338.
Li M, Zhou Z H. Semi-supervised document retrieval. Information Processing & Management, 2009, 45(3): 341–355.
Article Google Scholar
Zhou Z H, Chen K J, Dai H B. Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems, 2006, 24(2): 219–244.
Article Google Scholar
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357.
MATH Google Scholar
Kubat M, Matwin S. Addressing the curse of imbalanced training sets: One-sided selection. In Proc. the 14th Int. Conf. Machine Learning, Nashville, USA, 1997, pp.179–186.
Domingos P. MetaCost: A general method for making classifiers cost-sensitive. In Proc. the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA, Aug. 15–18, 1999, pp.155–164.
Elkan C. The foundations of cost-sensitive learning. In Proc. the 17th International Joint Conference on Artificial Intelligence, Seattle, USA, Aug. 4–10, 2001, pp.973–978.
Batista G, Prati R C, Monard M C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations, 2004, 6(1): 20–29.
Article Google Scholar
Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. IEEE Transactions on Systems, Man and Cybernetics --- Part B: Cybernetics, 2009, 39(2): 539–550.
Article Google Scholar
Angluin D, Laird P. Learning from noisy examples. Machine Learning, 1988, 2(4): 343–370.
Google Scholar
Ho T K. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8”): 832–844.
Google Scholar
Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140.
MathSciNet MATH Google Scholar
Chapman M, Callis P, Jackson W. Metrics data program. NASA IV and V Facility, 2004, http://mdp.ivv.nasa.gov/.
Schapire R E. A brief introduction to Boosting. In Proc. the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, Jul. 31-Aug. 6, 1999, pp.1401–1406.
Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.
Bradley A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30(6): 1145–1159.
Article Google Scholar
Zhou Z H, Wu J, Tang W. Ensembling neural networks: Many could be better than all. Artificial Intelligence, 2002, 137(1/2): 239–263.
Article MathSciNet MATH Google Scholar
Khoshgoftaar T M, Seliya N. Tree-based software quality estimation models for fault prediction. In Proc. the 8th IEEE International Symp. Software Metrics, Ottawa, Canada, Jun. 4–7, 2002, pp.203–214.
Dietterich T G, Lathrop R H, Lozano-Pérez T. Solving the Multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 1997, 89(1/2): 31–71.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China
Yuan Jiang (Member, CCF), Ming Li (Member, CCF, ACM, IEEE) & Zhi-Hua Zhou (Senior Member, CCF, IEEE, Member, ACM)

Authors

Yuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Hua Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Li.

Additional information

This work was supported by the National Natural Science Foundation of China under Grant Nos. 60975043, 60903103, and 60721002.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 76 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, Y., Li, M. & Zhou, ZH. Software Defect Detection with Rocus . J. Comput. Sci. Technol. 26, 328–342 (2011). https://doi.org/10.1007/s11390-011-9439-0

Download citation

Received: 15 May 2009
Revised: 26 October 2010
Published: 05 March 2011
Issue Date: March 2011
DOI: https://doi.org/10.1007/s11390-011-9439-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Software Defect Detection with Rocus

Abstract

Article PDF

Similar content being viewed by others

Label propagation based semi-supervised learning for software defect prediction

Software Defect Prediction in Imbalanced Data Sets Using Unbiased Support Vector Machine

A variable-level automated defect identification model based on machine learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(PDF 76 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Software Defect Detection with Rocus

Abstract

Article PDF

Similar content being viewed by others

Label propagation based semi-supervised learning for software defect prediction

Software Defect Prediction in Imbalanced Data Sets Using Unbiased Support Vector Machine

A variable-level automated defect identification model based on machine learning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(PDF 76 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation