Abstract
Analyzing the quality of data prior to constructing data mining models is emerging as an important issue. Algorithms for identifying noise in a given data set can provide a good measure of data quality. Considerable attention has been devoted to detecting class noise or labeling errors. In contrast, limited research work has been devoted to detecting instances with attribute noise, in part due to the difficulty of the problem. We present a novel approach for detecting instances with attribute noise and demonstrate its usefulness with case studies using two different real-world software measurement data sets. Our approach, called Pairwise Attribute Noise Detection Algorithm (PANDA), is compared with a nearest neighbor, distance-based outlier detection technique (denoted DM) investigated in related literature. Since what constitutes noise is domain specific, our case studies uses a software engineering expert to inspect the instances identified by the two approaches to determine whether they actually contain noise. It is shown that PANDA provides better noise detection performance than the DM algorithm.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD conference on management of data, ACM Press, Dallas, TX
Bobrowski M, Marre M, Yankelevich D. A software engineering view of data quality. Available at www.citeseer.ist.psu.edu/277636.html$
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11: 131–167
Clark P, Niblett T (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the 5th European working session on learning, pp 151–163
Dunagan JD (2002). A geometic theory of outliers and perturbation. Ph.D. Dissertation. Available at http://research.microsoft.com/∼jdunagan/thesis.pdf
Fenton NE, Pfleeger SL (1997) Software metrics: a rigorous and practical approach, 2nd edn. PWS Publishing Company: ITP, Boston, MA
Galhardas H, Florescu D, Shasha D, Simon E (2000) An extensible framework for data cleaning. In: Proceedings of 18th international conference on data engineering, IEEE Computer Society, San Jose, CA
Gamberger D, Lavrac N, Dzeroski S (1999) Noise elimination in inductive concept learning: a case study in medical diagnosis. In: Proceedings of the 7th international workshop on algorithmic learning theory, Springer, Berlin Heidelberg Ney York, pp 199–212
Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Mateo, California, pp 143–153
Hernandez MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 127–138. citeseer.ist.psu.edu/stolfo95mergepurge.html
Hernandez MA, Stolfo, SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1): 9–37
Khoshgoftaar TM, Allen EB (1998) Classifcation of fault-prone software modules: prior probabilities, costs and model evaluation. Empiric Software Eng 3: 275–298
Khoshgoftaar TM, Bullard LA, Gao K (2003) Detecting outliers using rule-based modeling for improving CBR-based software quality classification models. In: Ashley KD, Bridge DG (eds) Proceedings of the 16th international conference on case-based reasoning. LNAI, vol 1689. Springer-Verlag, Berlin Heidelberg New York, pp 216–230
Khoshgoftaar TM, Rebours P (2004) Generarting multiple noise elimination filters with the ensemble-partitioning filter. In: Proceedings of the IEEE international conference on information reuse and integration, IEEE Systems, Man and Cybernetics Society, Las Vegas, NV, USA, pp 369–375
Khoshgoftaar TM, Seliya N (2004) The necessity of assuring quality in software measurement data. In: Proceedings of 10th international software metrics symposium, IEEE Computer Society, Chicago, IL, pp 119–130
Khoshgoftaar TM, Seliya N, Gao K (2005) Detecting noisy instances with the rule-based classification model. Intell Data Anal 9(4):347–364
Khoshgoftaar TM, Zhong S, Joshi V (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intell Data Anal 9(1): 3–27
Knorr E, Ng R (1997) A unified notion of outliers: Properties and computation. In Proceedings of knowledge discovery and data mining. American Association for Artificial Intelligence, Newport Beach, CA, pp 219–222
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large databases, New York, NY, pp 392–403
Marcus A, Maletic J, Lin K-I (2001) Ordinal association rules for error identification in datasets. In: Proceedings of 10th international conference on information and knowledge management. ACM Press, Atlanta, GA, pp 589–591
Murphy, PM, Aha DW (1998) UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Science. http://www.ics.uci.edu/∼mlearn/MLRepository.html
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California
Ramasway S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of ACM SIGMOD conference on management of data, ACM, pp 427–438
SAS Institute (2004) SAS/STAT user's guide. SAS Institute Inc
Shekhar S, Lu C, Zhang P (2002) Detecting graph-based spatial outliers. Intell Data Anal 6: 451–458
Strong D, Lee Y, Wang R (1997) Data quality in context. Commun ACM 40(5): 103–110
Teng CM (1999) Correcting noisy data. In: Proceedings of 6th international conference machine learning (ICML 99). Morgan Kaufmann, San Mateo, California, pp 239–248
Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings of 8th European conference on principles and practice of knowledge discovery in databases, Pisa, Italy
Zhong S, Khoshgoftaar TM, Seliya N (2004) Analyzing software measurement data with clustering techniques. IEEE Intell Syst, pp 22–29
Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4): 177–210
Author information
Authors and Affiliations
Corresponding author
Additional information
Jason Van Hulse is a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include data mining and knowledge discovery, machine learning, computational intelligence and statistics. He is a student member of the IEEE and IEEE Computer Society. He received the M.A. degree in mathematics from Stony Brook University in 2000, and is currently Director, Decision Science at First Data Corporation.
Taghi M. Khoshgoftaar is a professor at the Department of Computer Science and Engineering, Florida Atlantic University, and the director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 300 refereed papers in these subjects. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the IEEE, the IEEE Computer Society, and IEEE Reliability Society. He served as the program chair and general chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005, respectively. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems.
Haiying Huang received the M.S. degree in computer engineeringfrom Florida Atlantic University, Boca Raton, Florida, USA, in 2002. She is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. Her research interests include software engineering, computational intelligence, data mining, software measurement, software reliability, and quality engineering.
Rights and permissions
About this article
Cite this article
Van Hulse, J.D., Khoshgoftaar, T.M. & Huang, H. The pairwise attribute noise detection algorithm. Knowl Inf Syst 11, 171–190 (2007). https://doi.org/10.1007/s10115-006-0022-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0022-x