Fast Implementations of Markov Clustering for Protein Sequence Grouping

Szilágyi, László; Szilágyi, Sándor Miklos

doi:10.1007/978-3-642-41550-0_19

László Szilágyi^23,24 &
Sándor Miklos Szilágyi²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8234))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

844 Accesses
1 Citations

Abstract

Two efficient versions of a Markov clustering algorithm are proposed, suitable for fast and accurate grouping of protein sequences. First, the essence of the matrix splitting approach consists in optimal reordering of rows and columns in the similarity matrix after every iteration, transforming it into a matrix with several compact blocks along the diagonal, and zero similarities outside the blocks. These blocks are treated separately in later iterations, thus significantly reducing the overall computational load. Alternately, a special sparse matrix architecture is employed to represent the similarity matrix of the Markov clustering algorithm, which also helps getting rid of a severe amount of unnecessary computations. The proposed algorithms were tested to classify sequences of protein databases like SCOP95. The proposed solutions achieve a speed-up factor in the range 15-300 compared to the conventionally implemented Markov clustering, depending on input data size and parameter settings, without damaging the partition accuracy. The convergence is usually reached in 40-50 iterations. Combining the two proposed approaches brings us close to the 1000 times speed-up ratio.

This work was supported by the Hungarian National Research Funds (OTKA) under grant no. PD103921, the Hungarian Academy of Science through the János Bolyai Fellowship Program.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Clustering of Biological Sequences

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Article Open access 05 February 2015

Keywords

References

Altschul, S.F., Madden, T.L., Schaffen, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search program. Nucleic Acids Res. 25, 3389–3402 (1997)
Article Google Scholar
Andreeva, A., Howorth, D., Chadonia, J.M., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425 (2008)
Article Google Scholar
Dayhoff, M.O.: The origin and evolution of protein superfamilies. Fed. Proc. 35, 2132–2138 (1976)
Google Scholar
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Article Google Scholar
Enright, A.J., van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002)
Article Google Scholar
Everitt, B.S., Landau, S., Leese, M., Stahl, D.: Cluster Analysis, 5th edn. John Wiley & Sons, Chichester (2011)
Book MATH Google Scholar
Gáspári, Z., Vlahovicek, K., Pongor, S.: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 21, 3322–3323 (2005)
Article Google Scholar
Heger, A., Holm, L.: Towards a covering set of protein family profiles. Prog. Biophys. Mol. Bio. 73, 321–337 (2000)
Article Google Scholar
Hegyi, H., Gerstein, M.: The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol. 288, 147–164 (1999)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Article Google Scholar
Protein Classification Benchmark Collection, http://net.icgeb.org/benchmark
Structural Classification of Proteins database, http://scop.mrc-lmb.cam.ac.uk/scop
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Article Google Scholar
Szilágyi, L., Medvés, L., Szilágyi, S.M.: A modified Markov clustering approach to unsupervised classification of protein sequences. Neurocomputing 73, 2332–2345 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Technical and Human Science, Sapientia - Hungarian Science University of Transylvania, Tîrgu-Mureş, Romania
László Szilágyi
Department of Control Engineering and Information Technology, Budapest University of Technology and Economics, Budapest, Hungary
László Szilágyi
Petru Maior University of Tîrgu-Mureş, Romania
Sándor Miklos Szilágyi

Authors

László Szilágyi
View author publications
You can also search for this author in PubMed Google Scholar
Sándor Miklos Szilágyi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IIIA-CSIC, Campüus UAB s/n, 08193, Bellaterra, Catalonia, Spain
Vicenç Torra
Toho Gakuen, 3-1-10, Naka, 186-0004, Kunitachi, Tokyo, Japan
Yasuo Narukawa
Departament d’Enginyeria de la Informació i de les Comunicacions, Universitat Autonoma de Barcelona, 08193, Bellaterra, Catalonia, Spain
Guillermo Navarro-Arribas
Internet Interdisciplinary Institute (IN3); Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, Rambla del Poblenou, 156, 08018, Barcelona, Catalonia, Spain
David Megías

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szilágyi, L., Szilágyi, S.M. (2013). Fast Implementations of Markov Clustering for Protein Sequence Grouping. In: Torra, V., Narukawa, Y., Navarro-Arribas, G., Megías, D. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2013. Lecture Notes in Computer Science(), vol 8234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41550-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-41550-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41549-4
Online ISBN: 978-3-642-41550-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Implementations of Markov Clustering for Protein Sequence Grouping

Abstract

Chapter PDF

Similar content being viewed by others

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Clustering of Biological Sequences

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fast Implementations of Markov Clustering for Protein Sequence Grouping

Abstract

Chapter PDF

Similar content being viewed by others

A Fast and Memory-Efficient Hierarchical Graph Clustering Algorithm

Clustering of Biological Sequences

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation