Abstract
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. This work discusses the problem of distance-based outlier detection on uncertain datasets of Gaussian distribution. The Naive approach of distance-based outlier on uncertain data is usually infeasible due to expensive distance function. Therefore a cell-based approach is proposed in this work to quickly identify the outliers. The infinite nature of Gaussian distribution prevents to devise effective pruning techniques. Therefore an approximate approach using bounded Gaussian distribution is also proposed. Approximating Gaussian distribution by bounded Gaussian distribution enables an approximate but more efficient cell-based outlier detection approach. An extensive empirical study on synthetic and real datasets show that our proposed approaches are effective, efficient and scalable.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aggarwal, C.C., Yu, P.S.: Outlier detection with uncertain data. In: SIAM International Conference on Data Mining, pp. 483–493 (2008)
Alaydie, N., Fotouhi, F., Reddy, C.K., Soltanian-Zadeh, H.: Noise and outlier filtering in heterogeneous medical data sources. In: Workshops on Database and Expert Systems Applications, DEXA, pp. 115–119 (2010)
Angiulli, F., Pizzuti, C.: Fast outlier detection in high dim. spaces. In: PKDD, pp. 15–26 (2002)
Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: CIKM, pp. 811–820 (2007)
Arturo, E., Alberto, O.Z., Alejandro, P., Julio, P.: Outlier analysis for plastic card fraud detection a hybridized and multi-objective approach. In: Hybrid Artificial Intelligent Systems, LNCS, pp. 1–9 (2011)
Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994)
CISL Research Data Archive. http://rda.ucar.edu. Accessed 16 July 2012
Diao, Y., Li, B., Liu, A., Peng, L., Sutton, C., Tran, T., Zink, M.: Capturing data uncertainty in high-volume stream processing. In: CIDR (2009)
Garces, H., Sbarbaro, D.: Outliers detection in environmental monitoring databases. Eng. Appl. Artif. Intell. 24(2), 341–349 (2011)
Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
Helm, I., Jalukse L., Leito I.: Measurement uncertainty estimation in amperometric sensors: a tutorial review. Sensors 10(5), 4430–4455 (2010)
International Surface Pressure Databank (ISPDv2) 1768–2010. http://rda.ucar.edu/datasets/ds132.0/index.html. Accessed 16 July 2012
Ishida, K., Kitagawa, H.: Detecting current outliers: continuous outlier detection over time-series data streams. In: DEXA, pp. 255–268 (2008)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th VLDB, pp. 392–403 (1998)
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8(3–4), 237–253 (2000)
Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: ICDE, pp. 135–146 (2011)
Mahoney, M., Chan, P.: Learning rules for anomaly detection of hostile network traffic. In: Proceedings of the 3rd ICDM, pp. 601–604 (2003)
Maimon, O., Rockach, L.: Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic, Norwell (2005)
Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: ICDM, pp. 436–445 (2006)
Nievergelt, J., Hinterberger, H., Sevick, K.C.: The Grid file: an adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9(1), 38–71 (1984)
Orair, G.H., Teixeira, C.H.C., Meira, W.: Distance-based outlier detection: consolidation and renewed bearing. In: Proc. of the VLDB Endowment, pp. 1469–1480 (2010)
Pukelsheim, F.: The three sigma rule. Am. Stat. 48(2), 88–91 (1994)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: ACM SIGMOD, pp. 427–438 (2000)
Shaikh, S.A., Kitagawa, H.: Distance-based outlier detection on uncertain data of Gaussian distribution. In: APWeb, pp. 109–121 (2012)
Sharma, A.B., Golubchik, L., Govindan, R.: Sensor faults: detection methods and prevalence in real-world datasets. ACM Trans. Sens. Netw. 6(3), 23:1–39 (2010)
Sloan Digital Sky Survey. http://www.sdss.org. Accessed 16 July 2012
Stevens Water Monitoring Systems, Inc. http://www.stevenswater.com/. Accessed 7 March 2013
Tao, Y., Xiao, X., Cheng, R.: Range search on multidimensional uncertain data. ACM Trans. Database Syst. 32(3), 15:1–54 (2007)
Thistleton, W., Marsh, J.A., Nelson, K., Tsallis, C.: Generalized Box-Muller method for generating q-Gaussian random deviates. IEEE Trans. Inf. Theory 53(12), 4805–4810 (2007)
Vaisala Corporation. http://www.vaisala.com/. Accessed 7 March 2013
Wang, B., Xiao, G., Yu, H., Yang, X.: Distance-based outlier detection on uncertain data. In: IEEE 9th International Conference on Computer and Information Technology, pp. 293–298 (2009)
Weisstein, E.W.: Normal Difference Distribution. From MathWorld—A Wolfram Web Resource. http://mathworld.wolfram.com/NormalDifferenceDistribution. Accessed 27 Jan 2012
Xylem Corporation. http://www.globalw.com/. Accessed 7 March 2013
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shaikh, S.A., Kitagawa, H. Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution. World Wide Web 17, 511–538 (2014). https://doi.org/10.1007/s11280-013-0211-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-013-0211-y