Abstract
Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN (Duan et al. in Inf. Syst. 32(7):978–986, 2007) which is capable of finding clusters and assigning LOF (Breunig et al. in Proceedings of the 2000 ACM SIG MOD International Conference on Manegement of Data, ACM Press, pp. 93–104, 2000) to single points.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Record, 27(2), 94–105. doi:10.1145/276305.276314.
Ankerst, M., Breunig, M. M., Kriegel, H., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (pp. 49–60). SIGMOD’99, Philadelphia, Pennsylvania, United States, May 31–June 03, 1999. New York: ACM Press.
Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley.
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is “nearest neighbor” meaningful? In C. Beeri & P. Buneman (Eds.), Lecture notes in computer science: Vol. 1540. Proceeding of the 7th international conference on database theory (pp. 217–235). January 10–12, 1999. London: Springer.
Breunig, M. M., Kriegel, H., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 93–104). SIGMOD’00, Dallas, Texas, United States, May 15–18, 2000. New York: ACM Press.
Carvalho, R., & Costa, H. (2007). Application of an integrated decision support process for supplier selection. Enterprise Information Systems, 1(2), 197–216. doi:10.1080/17517570701356208.
Crovella, M. E., & Bestavros, A. (1997). Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.
Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information Systems, 32(7), 978–986. doi:10.1016/j.is.2006.10.006.
Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noises. In Proc. 2nd int. conf. on knowledge discovery and data mining (pp. 226–231). AAAI Press: Portland.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In A. Tiwary & M. Franklin (Eds.), Proceedings of the 1998 ACM SIGMOD international conference on management of data (pp. 73–84). SIGMOD’98 Seattle, Washington, United States, June 01–04, 1998. New York: ACM Press.
Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Amsterdam: Elsevier.
Hawkins, D. (1980). Identification of outliers. London: Chapman and Hall.
He, Z., Xu, X., & Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10), 1641–1650. doi:10.1016/S0167-8655(02)00160-5.
Hinneburg, A., & Keim, D. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proc. 4th int. conf. on knowledge discovery and data mining (pp. 58–65). New York.
Hinneburg, A., Aggarwal, C. C., & Keim, D. A. (2000). What is the nearest neighbor in high dimensional spaces? In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, & K. Whang (Eds.), Proceedings of the 26th international conference on very large data bases (pp. 506–515). Very large data bases, September 10–14, 2000. San Francisco: Morgan Kaufmann Publishers.
Hsu, C., & Wallace, W. A. (2007). An industrial network flow information integration model for supply chain management and intelligent transportation. Enterprise Information Systems, 1(3), 327–351. doi:10.1080/17517570701504633.
Jiang, M. F., Tseng, S. S., & Su, C. M. (2001). Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6–7), 691–700.
Johnson, T., Kwok, I., & Ng, R. (1998). Fast computation of 2-dimensional depth contours. In Proc. 4th int. conf. on knowledge discovery and data mining (pp. 224–228). New York: AAAI Press.
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In A. Gupta, O. Shmueli, & J. Widom (Eds.), Proceedings of the 24rd international conference on very large data bases (pp. 392–403). Very large data bases, August 24–27, 1998. San Francisco: Morgan Kaufmann Publishers.
Knorr, E. M., & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, & M. L. Brodie (Eds.), Proceedings of the 25th international conference on very large data bases (pp. 211–222). Very large data bases, September 07–10, 1999. San Francisco: Morgan Kaufmann Publishers.
Li, H., & Xu, L. (2001). Feature space theory—a mathematical foundation for data mining. Knowledge-Based Systems, 14(5–6), 253–257. doi:10.1016/S0950-7051(01)00103-4.
Li, H., Xu, L., Wang, J., & Mo, Z. (2003). Feature space theory in data mining: transformations between extensions and intensions in knowledge representation. Expert Systems, 20(2), 60–71. doi:10.1111/1468-0394.00226.
Luo, J., Xu, L., Jamont, J., Zeng, L., & Shi, Z. (2007). Flood decision support system on agent grid: method and implementation. Enterprise Information Systems, 1(1), 49–68. doi:10.1080/17517570601092184.
Ng, R., & Han, J. (2002). CLARANS: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1003–1016.
Preparata, F., & Shamos, M. (1988). Computational geometry: an introduction. Berlin: Springer.
Qiu, G., Li, H., Xu, L., & Zhang, W. (2003). A knowledge processing method for intelligent systems based on inclusion degree. Expert Systems, 20(4), 187–195. doi:10.1111/1468-0394.00243.
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 427–438). SIGMOD’00, Dallas, Texas, United States, May 15–18, 2000. New York: ACM Press.
Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: a multi-resolution clustering approach for very large spatial databases. In A. Gupta, O. Shmueli, & J. Widom (Eds.), Proceedings of the 24rd international conference on very large data bases (pp. 428–439). Very large data bases, August 24–27, 1998. San Francisco: Morgan Kaufmann Publishers.
Shi, Z., Huang, Y., He, Q., Xu, L., Liu, S., Qin, L., Jia, Z., Li, J., Huang, H., & Zhao, L. (2007). MSMiner-a developing platform for OLAP. Decision Support Systems, 42(4), 2016–2028. doi:10.1016/j.dss.2004.11.006.
Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison–Wesley.
Wang, W., Yang, J., & Muntz, R. R. (1997). STING: a statistical information grid approach to spatial data mining. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, & M. A. Jeusfeld (Eds.), Proceedings of the 23rd international conference on very large data bases (pp. 186–195). Very large data bases, August 25–29, 1997. San Francisco: Morgan Kaufmann Publishers.
Xu, L. (2006). Advances in intelligent information processing. Expert Systems, 23(5), 249–250. doi:10.1111/j.1468-0394.2006.00405.x.
Xu, L., Liang, N., & Gao, Q. (2008). An integrated approach for agricultural ecosystem management, IEEE Transactions on Systems Man and Cybernetics, Part C, 38(3).
Zhang, M., Xu, L., Zhang, W., & Li, H. (2003). A rough set approach to knowledge reduction based on inclusion degree and evidence reasoning theory. Expert Systems, 20(5), 298–304. doi:10.1111/1468-0394.00254.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. In J. Widom (Ed.), Proceedings of the 1996 ACM SIGMOD international conference on management of data (pp. 103–114). SIGMOD’96 Montreal, Quebec, Canada, June 04–06, 1996. New York: ACM Press.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Duan, L., Xu, L., Liu, Y. et al. Cluster-based outlier detection. Ann Oper Res 168, 151–168 (2009). https://doi.org/10.1007/s10479-008-0371-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-008-0371-9