Abstract
The possibility of clustering objects represented by structured data with possibly non-trivial geometry certainly is an interesting task in pattern recognition. Moreover, in the Big Data era, the possibility of clustering huge amount of (structured) data challenges computer science and pattern recognition researchers alike. The aim of this paper is to bridge the gap on large-scale structured data clustering. Specifically, following a previous work, in this paper a parallel and distributed k-medoids clustering implementation is proposed and tested on real-world biological structured data, namely pathway maps (graphs) and primary structure of proteins (sequences). Furthermore, two methods for medoids’ evaluation are proposed and compared in terms of scalability, based on exact and approximate procedures, respectively. Computational results show that the proposed implementation is flexible with respect to the dissimilarity measure and the input space adopted, with satisfactory results in terms of scalability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In Spark, workers are the elementary computational units, which can be either single cores on a CPU or entire computers, depending on the environment configuration (i.e. the Spark Context).
- 2.
For a more extensive list, we refer the interested reader to the official Apache Spark RDD API at https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
- 3.
In the following subsections and pseudocodes every variable whose name ends with RDD shall be considered as an RDD, therefore a data structure whose shards are distributed across the workers or, similarly, with no guarantee that a single worker has its entire content in memory.
- 4.
Not to be confused with i in datasetRDD, distancesRDD and nearestClusterRDD, which is a within-dataset pattern ID.
- 5.
Indeed, in many cases the pattern-to-medoid assignments rather than the WCSoD are compared between consecutive iterations.
- 6.
In the following, we suppose that the dissimilarity measure \(d(\cdot ,\cdot )\), although it might not be metric, is at least symmetric, i.e. \(d(a,b)=d(b,a)\). For symmetric dissimilarity measures the pairwise distance matrix is symmetric by definition and the sum by rows or columns leads to the same result. In case of not-symmetric dissimilarity measures, it is possible to ‘force’ symmetry by letting (for example) \(\bar{d}(a,b)=\frac{1}{2}(d(a,b)+d(b,a))\).
- 7.
Indeed, for very small clusters a Spark-driven parallelisation is discouraged as the majority of the execution time will be spent on thread scheduling and worker communication rather than processing.
- 8.
- 9.
- 10.
It is sufficient for a given node to exist in a single network to be included.
- 11.
For the sake of comparison, it is worth remarking that both the Euclidean distance and the Hamming distance have complexity \(\mathcal {O}(n)\), where n is the number of attributes. However, in this work, the number of attributes ranges from \(2136^2\) to \(4431^2\) whereas in [12] it ranged from 4 to 28102.
References
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In Cam, L.M.L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pp. 368–374. MIT Press, Cambridge, MA, USA (1996)
Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis Based on the L1-Norm and Related Methods (1987)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, vol. 96, pp. 226–231 (1996)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, pp. 49–60. ACM, New York, NY, USA (1999)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, pp. 103–114. ACM, New York, NY, USA (1996)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 73–84 (1998)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)
Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th International Joint Conference on Computational Intelligence—Volume 1: IJCCI, INSTICC, pp. 338–347. SciTePress (2017)
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin, Heidelberg (2009)
Yue, X., Man, W., Yue, J., Liu, G.: Parallel k-medoids++ spatial clustering algorithm based on mapreduce (2016). arXiv:1608.06861
Arbelaez, A., Quesada, L.: Parallelising the k-medoids clustering problem using space-partitioning. In: Sixth Annual Symposium on Combinatorial Search (2013)
Jiang, Y., Zhang, J.: Parallel k-medoids clustering algorithm based on Hadoop. In: 2014 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 649–652. IEEE (2014)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley (2009)
Martino, A., Rizzi, A., Mascioli, F. M. F.: Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Park, H.S., Jun, C.H.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009)
Del Vescovo, G., Livi, L., Frattale Mascioli, F.M., Rizzi, A.: On the problem of modeling structured data with the minsod representative. Int. J. Comput. Theory Eng. 6, 9 (2014)
Martino, A., Giuliani, A., Rizzi, A.: Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. In: Pedrycz, W., Chen, S.M. (eds.) Computational Intelligence for Pattern Recognition, pp. 53–81. Springer International Publishing, Cham (2018)
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Bianchi, F.M., Livi, L., Rizzi, A.: Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal. Appl. 3, 745–763 (2016)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18, 267–276 (1953)
Livi, L., Rizzi, A.: The graph matching problem. Pattern Anal. Appl. 16, 253–283 (2013)
Livi, L., Del Vescovo, G., Rizzi, A.: Graph recognition by seriation and frequent substructures mining. In: ICPRAM 2012—Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 186–191 (2012)
van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)
Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python (2001). Accessed 13 Mar 2018
Millman, K.J., Aivazis, M.: Python for scientists and engineers. Comput. Sci. Eng. 13, 9–12 (2011)
Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9 (2007)
Cokelaer, T., Pultz, D., Harder, L.M., Serra-Musach, J., Saez-Rodriguez, J.: Bioservices: a common Python package to access biological web services programmatically. Bioinformatics 29, 3241–3242 (2013)
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2016)
Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000)
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., Tanabe, M.: KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2015)
Tun, K., Dhar, P.K., Palumbo, M.C., Giuliani, A.: Metabolic pathways variability and sequence/networks comparisons. BMC Bioinform. 7, 24 (2006)
Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29, 147–160 (1950)
Giuliani, A., Krishnan, A., Zbilut, J.P., Tomita, M.: Proteins as networks: usefulness of graph theory in protein science. Curr. Protein Pept. Sci. 9, 28–38 (2008)
Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Giuliani, A.: Protein contact networks: an emerging paradigm in chemistry. Chem. Rev. 113, 1598–1613 (2012)
Livi, L., Giuliani, A., Sadeghian, A.: Characterization of graphs for protein structure modeling and recognition of solubility. Curr. Bioinform. 11, 106–114 (2016)
Livi, L., Maiorino, E., Giuliani, A., Rizzi, A., Sadeghian, A.: A generative model for protein contact networks. J. Biomol. Struct. Dyn. 34, 1441–1454 (2016)
Maiorino, E., Rizzi, A., Sadeghian, A., Giuliani, A.: Spectral reconstruction of protein contact networks. Phys. A Stat. Mech. Appl. 471, 804–817 (2017)
Martino, A., Maiorino, E., Giuliani, A., Giampieri, M., Rizzi, A.: Supervised approaches for function prediction of proteins contact networks from topological structure information. In: Sharma, P., Bianchi, F.M. (eds.) Image Analysis, pp. 285–296. Springer International Publishing, Cham (2017)
Martino, A., Rizzi, A., Mascioli, F. M. F.: Supervised approaches for protein function prediction by topological data analysis. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
De Santis, E., Martino, A., Rizzi, A., Mascioli, F. M. F.: Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017)
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, vol. 10, pp. 707–710. Soviet Physics Doklady (1966)
Cinti, A., Bianchi, F. M., Martino, A., & Rizzi, A. (2017). A novel algorithm for online inexact string matching and its FPGA implementation. arXiv preprint arXiv:1712.03560
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Martino, A., Rizzi, A., Frattale Mascioli, F.M. (2019). Efficient Approaches for Solving the Large-Scale k-Medoids Problem: Towards Structured Data. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-16469-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16468-3
Online ISBN: 978-3-030-16469-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)