Abstract
The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a real world graph and evaluate scalability of the algorithms. We use Apache Spark as one of the most popular Big Data frameworks.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
D. Reed and J. Dongarra, “Exascale computing and big data: the next frontier,” Commun. ACM 57 (7), 56–68 (2014).
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets. HotCloud,” in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10. https://doi.org/static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf. Accessed 2018.
J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM51, 107–113 (2010).
L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly detection and description: a survey,” Data Min. Knowl. Disc1ov 29, 626–688 (2015). https://doi.org/arxiv.org/pdf/1404.4679.pdf. Accessed 2018.
Z. Li, H. Xiong, and Y. Liu, “Detecting blackholes and volcanoes in directed networks,” arXiv:1005. 2179 (2010). https://doi.org/arxiv.org/pdf/1005.2179.pdf. Accessed 2018.
L. Akoglu, M. McGlohon, and C. Faloutsos, “OddBall: Spotting Anomalies in Weighted Graphs,” in Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’10, 2010, Part 2, pp. 410–421. https://doi.org/repository.cmu.edu/cgi/viewcontent.-gi?article=3599&context=compsci. Accessed 2018.
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” in Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, 2010. https://doi.org/www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf. Accessed 2018.
R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in Proceedings of the 24rd International Conference on Very Large Data Bases (Morgan Kaufmann, 1998), pp. 194–205.
D. T. Lee and C. K. Wong, “Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees,” Acta Inf. 9, 23–29 (1977).
H. Koga, T. Ishibashi, and T. Watanabe, “Fast agglomerative hierarchical clustering algorithm using localitysensitive hashing,” Knowledge Inf. Syst. 12, 25–53 (2007).
A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 97–104. https://doi.org/homes.cs.washington.-edu/sham/papers/ml/cover_tree.pdf. Accessed 2018.
R. Weber and S. Blott, “An approximation-based data structure for similarity search,” Technical Report No. 24, ESPRIT Project HERMES No. 9141 (1997).
S. Ramaswamy and K. Rose, “Adaptive cluster-distance bounding for nearest neighbor search in image databases,” IEEE Int. Conf. Image Process. 6, 381–384 (2007). https://doi.org/citeseerx.ist.psu.edu/viewdoc/-download?doi=10.1.1.80.6562&rep=rep1&type=pdf.
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probe LSH: efficient indexing for high-dimensional similarity search,” in Proceedings of the VLDB Conference, 2007, pp. 950–961. https://doi.org/www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf.
T. S. Teixeira, G. Teodoro, E. Valle, and J. H. Saltz, “Scalable locality-sensitive hashing for similarity search in high-dimensional, large-scale multimedia datasets,” arXiv:1310. 4136 (2013); http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdfhttps://doi.org/arxiv.org/pdf/1310.4136.pdf.
Z. Yang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in Proceedigns of the IEEE International Conference on Multimedia and Expo ICME, IEEE Cat. No. 04TH8763 (2004), Vol. 1, pp. 743–746. https://doi.org/www.comp.nus.edu.sg/ooiwt/papers/lsh-icme04-final.pdf. Accessed 2018.
V. Stegailov, N. Orekhov, and G. Smirnov, “HPC hardware efficiency for quantum and classical molecular dynamics,” in Proceedigns of the International Conference on Parallel Computing Technologies (Springer, 2015).
G. Smirnov and V. Stegailov, “Efficiency of classical molecular dynamics algorithms on supercomputers,” Math. Models Comput. Simul. 8, 734–743 (2016).
M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, and M. Zaharia, “Spark sql: Relational data processing in spark,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383–1394. https://doi.org/amplab.cs.berkeley.edu/wpcontent/uploads/2015/03/SparkSQLSigmod2015.pdf. Accessed 2018.
A. Agarkov, T. Ismagilov, D. Makagon, A. Semenov, and A. Simonov, “Performance evaluation of the Angara interconnect,” in Proceedings of the International Conference Russian Supercomputing Days, 2016, pp. 626–639. https://doi.org/www.dislab.org/docs/rsd2016-angara-bench.pdf. Accessed 2018.
P. Erdős and A. Rényi, “On random graphs,” Publ. Math. Debrecen 6, 290–297 (1959). https://doi.org/snap.stanford.edu/class/cs224w-readings/erdos59random.pdf. Accessed 2018.
Author information
Authors and Affiliations
Corresponding author
Additional information
(Submitted by V. V. Voevodin)
Rights and permissions
About this article
Cite this article
Semenov, A., Mazeev, A., Doropheev, D. et al. Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark. Lobachevskii J Math 39, 1262–1269 (2018). https://doi.org/10.1134/S1995080218090184
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1995080218090184