An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Wang, Xi-Te; Shen, De-Rong; Bai, Mei; Nie, Tie-Zheng; Kou, Yue; Yu, Ge

doi:10.1007/s11390-015-1596-0

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Regular Paper
Published: 20 November 2015

Volume 30, pages 1233–1248, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Computer Science and Technology Aims and scope Submit manuscript

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Download PDF

Xi-Te Wang¹,
De-Rong Shen¹,
Mei Bai¹,
Tie-Zheng Nie¹,
Yue Kou¹ &
…
Ge Yu¹

214 Accesses
11 Citations
Explore all metrics

Abstract

The distance-based outlier is a widely used definition of outlier. A point is distinguished as an outlier on the basis of the distances to its nearest neighbors. In this paper, to solve the problem of outlier computing in distributed environments, DBOZ, a distributed algorithm for distance-based outlier detection using Z-curve hierarchical tree (ZH-tree) is proposed. First, we propose a new index, ZH-tree, to effectively manage the data in a distributed environment. ZH-tree has two desirable advantages, including clustering property to help search the neighbors of a point, and hierarchical structure to support space pruning. We also design a bottom-up approach to build ZH-tree in parallel, whose time complexity is linear to the number of dimensions and the size of dataset. Second, DBOZ is proposed to compute outliers in distributed environments. It consists of two stages. 1) To avoid calculating the exact nearest neighbors of all the points, we design a greedy method and a new ZH-tree based k-nearest neighbor searching algorithm (ZHkNN for short) to obtain a threshold LW. 2) We propose a filter-and-refine approach, which first filters out the unpromising points using LW, and then outputs the final outliers through refining the remaining points. At last, the efficiency and the effectiveness of ZH-tree and DBOZ are testified through a series of experiments.

Article PDF

A Fast Outlier Detection Algorithm for Big Datasets

Fast Top-k Distance-Based Outlier Detection on Uncertain Data

Entropy-based outlier detection using spark

Article 16 April 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Hawkins D M. Identification of Outliers. Springer, 1980.
Barnett V, Lewis T. Outliers in Statistical Data. Wiley New York, 1994.
Rousseeuw P J, Leroy A M. Robust Regression and Outlier Detection. John Wiley & Sons, 2003.
Knorr E M, Ng R T. Algorithms for mining distancebased outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp. 392-403.
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 2000, 29(2): 427–438.
Article Google Scholar
Angiulli F, Pizzuti C. Outlier mining in large high dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(2): 203–215.
Article MathSciNet Google Scholar
Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Mining and Knowledge Discovery, August 2002, pp. 15-27.
Schubert E, Zimek A, Kriegel H P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 2014, 28(1): 190–237.
Article MathSciNet MATH Google Scholar
Shiokawa H, Fujiwara Y, Onizuka M. Scan++: Efficient algorithm for finding clusters, hubs and outliers on large-scale graphs. Proceedings of the VLDB Endowment, 2015, 8(11): 1178–1189.
Article Google Scholar
Aggarwal C C, Yu P S. Outlier detection for high dimensional data. ACM SIGMOD Record, 2001, 30(2): 37–46.
Article Google Scholar
Breunig M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 2000, 29(2): 93–104.
Article Google Scholar
He Z, Xu X, Deng S. Discovering cluster-based local outliers. Pattern Recognition Letters, 2003, 24(9/10): 1641-1650.
Article MATH Google Scholar
Bay S D, Schwabacher M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp. 29-38.
Angiulli F, Fassetti F. Very efficient mining of distance based outliers. In Proc. the 16th ACM Conference on Information and Knowledge Management, November 2007, pp. 791-800.
Guttman A. R-trees: A dynamic index structure for spatial searching. ACM SIGMOD Record, 1984, 14(2): 47–57.
Article Google Scholar
Ciaccia P, Patella M, Zezula P. M-tree: An efficient access method for similarity search in metric spaces. In Proc. the 23rd International Conference on Very Large Databases, August 1997, pp. 426-435.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In Proc. the 7th International Conference on Database Theory, January 1999, pp. 217-235.
Wang B, Yang X C, Wang G R, Yu G. Outlier detection over sliding windows for probabilistic data streams. Journal of Computer Science and Technology, 2010, 25(3): 389–400.
Article MathSciNet Google Scholar
Cao K Y, Wang G R, Han D H, Ding G H, Wang A X, Shi L X. Continuous outlier monitoring on uncertain data streams. Journal of Computer Science and Technology, 2014, 29(3): 436–448.
Article MathSciNet Google Scholar
Gupta M, Gao J, Aggarwal C, Han J. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery, 2014, 5(1): 1–129.
Article Google Scholar
Yan Y, Zhang J, Huang B, Sun X, Mu J, Zhang Z, Moscibroda T. Distributed outlier detection using compressive sensing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 31-June 04, 2015, pp. 3-16.
Hung E, Cheung D W. Parallel mining of outliers in large database. Distributed and Parallel Databases, 2002, 12(1): 5–26.
Article MATH Google Scholar
Lozano E, Acuña E. Parallel algorithms for distance-based and density-based outliers. In Proc. the 5th IEEE International Conference on Data Mining, November 2005, pp. 729- 732.
Angiulli F, Basta S, Lodi S, Sartori C. Distributed strategies for mining outliers in large data sets. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(7): 1520–1532.
Article Google Scholar
Jagadish H V, Ooi B C, Tan K, Yu C, Zhang R. iDistance: An adaptive B⁺−tree based indexing method for nearest neighbor search. ACM Transaction Database System, 2005, 30(2): 364–397.
Article Google Scholar
Ramsak F, Markl V, Fenk R, Zirkel M, Elhardt K, Bayer R. Integrating the UB-tree into a database system kernel. In Proc. the 26th International Conference on Very Large Data Bases, September 2000, pp. 263-272.
O’Malley O. Terabyte sort on apache hadoop. http://sortbenchmark.org/YahooHadoop.pdf, October 2015.

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, 110819, China
Xi-Te Wang, De-Rong Shen, Mei Bai, Tie-Zheng Nie, Yue Kou & Ge Yu

Authors

Xi-Te Wang
View author publications
You can also search for this author in PubMed Google Scholar
De-Rong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Mei Bai
View author publications
You can also search for this author in PubMed Google Scholar
Tie-Zheng Nie
View author publications
You can also search for this author in PubMed Google Scholar
Yue Kou
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xi-Te Wang.

Additional information

Special Section on Networking and Distributed Computing for Big Data

This work was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N120816001.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, XT., Shen, DR., Bai, M. et al. An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets. J. Comput. Sci. Technol. 30, 1233–1248 (2015). https://doi.org/10.1007/s11390-015-1596-0

Download citation

Received: 15 May 2015
Revised: 14 October 2015
Published: 20 November 2015
Issue Date: November 2015
DOI: https://doi.org/10.1007/s11390-015-1596-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Abstract

Article PDF

Similar content being viewed by others

A Fast Outlier Detection Algorithm for Big Datasets

Fast Top-k Distance-Based Outlier Detection on Uncertain Data

Entropy-based outlier detection using spark

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Abstract

Article PDF

Similar content being viewed by others

A Fast Outlier Detection Algorithm for Big Datasets

Fast Top-k Distance-Based Outlier Detection on Uncertain Data

Entropy-based outlier detection using spark

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation