Abstract
In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithms for mismatch kernel matrix computations improve currently known time bounds for these computations. We then consider the mismatch kernel problem with feature selection, and present efficient algorithms for it. Our experimental results show that, for string kernels with mismatches, kernel matrices can be computed 100-200 times faster than traditional approaches. Kernel vector evaluations on new sequences show similar computational improvements. On several DNA barcode datasets, k-mer string kernels considerably improve identification accuracy compared to prior results. String kernels with feature selection demonstrate competitive performance with substantially fewer computations.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Hebert, P.D.N., Cywinska, A., Ball, S., deWaard, J.: Biological identifications through DNA barcodes. In: Proceedings of the Royal Society of London, pp. 313–322 (2003)
Armstrong, K., Bal, S.: DNA barcodes for biosecurity: invasive species identification. Philos. R. Soc. Lond. B. Biol. Sci. 360(1462), 1813–1823 (2005)
Steinke, D., Vences, M., Salzburger, W., Meyer, A.: TaxI: a software tool for DNA barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1975–1980 (2005)
Nielsen, R., Matz, M.: Statistical approaches for DNA barcoding. Systematic Biology 55(1), 162–169 (2006)
Matz, M.V., Nielsen, R.: A likelihood ratio test for species membership based on DNA sequence data. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1969–1974 (2005)
Meyer, C.P., Paulay, G.: Dna barcoding: error rates based on comprehensive sampling. PLoS Biol. 3(12) (December 2005)
Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Leslie, C.S., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Becker, S., Thrun, S., Obermayer, K. (eds.) NIPS, pp. 1417–1424. MIT Press, Cambridge (2002)
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. In: CSB 2004: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004), Washington, DC, USA, pp. 152–160. IEEE Computer Society Press, Los Alamitos (2004)
Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 7(1-2), 95–114 (2000)
Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: ICML 2005: Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, pp. 585–592. ACM Press, New York (2005)
Schölkopf, B., Smola, A.J.: Learning with kernels. MIT Press, Cambridge (2002)
Vapnik, V.: Statistical learning theory. Wiley, Chichester (1998)
Vishwanathan, S.V.N., Smola, A.J.: Fast kernels for string and tree matching. In: NIPS, pp. 569–576 (2002)
Ukkonen, E.: Constructing suffix trees on-line in linear time. In: Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing 1992, vol. 1, pp. 484–492. North-Holland, Amsterdam (1992)
Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455 (2004)
Hebert, P.D.N., Penton, E.H., Burns, J.M., Janzen, D.H., Hallwachs, W.: Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. In: PNAS, vol. 101, pp. 14812–14817 (2004)
Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry 20(1), 25–33 (1996)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kuksa, P., Pavlovic, V. (2007). Fast Kernel Methods for SVM Sequence Classifiers. In: Giancarlo, R., Hannenhalli, S. (eds) Algorithms in Bioinformatics. WABI 2007. Lecture Notes in Computer Science(), vol 4645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74126-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-74126-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74125-1
Online ISBN: 978-3-540-74126-8
eBook Packages: Computer ScienceComputer Science (R0)