Abstract
In this paper we present a novel methodology for sequence classification, based on sequential pattern mining and optimization algorithms. The proposed methodology automatically generates a sequence classification model, based on a two stage process. In the first stage, a sequential pattern mining algorithm is applied to a set of sequences and the sequential patterns are extracted. Then, the score of every pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated by summing the specific pattern scores. Each score is updated, multiplied by a weight and the output of the first stage is the classification confusion matrix of the sequences. In the second stage an optimization technique, aims to finding a set of weights which minimize an objective function, defined using the classification confusion matrix. The set of the extracted sequential patterns and the optimal weights of the classes comprise the sequence classification model. Extensive evaluation of the methodology was carried out in the protein classification domain, by varying the number of training and test sequences, the number of patterns and the number of classes. The methodology is compared with other similar sequence classification approaches. The proposed methodology exhibits several advantages, such as automated weight assignment to classes using optimization techniques and knowledge discovery in the domain of application.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, pp 1–36
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, Menlo Park
Rabiner L (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77: 257–286
Loewenstern D, Berman H, Hirsh H (1998) Maximum a posteriori classification of DNA structure from sequence information. In: Proceedings of Pacific symposium in Biotech, pp 667–668
Lesh N, Zaki MJ, Ogihara M (1999) Mining features for sequence classification. In: 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). San Diego, pp 342–346
Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Intell Syst 15(2): 48–56
Shin-Mu Tseng V, Lee C-H (2005) CBS: a new classification method by using sequential patterns. In: Proceedings of SIAM international data mining conference. California, USA
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases. Santiago, Chile
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering, Taiwan, pp 3–14
Bayardo Jr RJ (1997) Brute-force mining of high-confidence classification rules. In: Proceedings of the third international conference on knowledge discovery and data mining, pp 123–126
Quinlan JR (1992) C4.5: Programs for machine learning, Morgan Kaufman, Menlo Park
Zaki MJ (1998) Efficient enumeration of frequent sequences. In: 7th International conference on information and knowledge management, Washington DC, pp 68–75
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proceedings of fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 80–86 (1998)
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng 16: 1424–1440
Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern mining using bitmaps. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Canada, pp 429–435
Chen T-Z, Hsu S-C (2007) Mining frequent tree-like patterns in large databases. Data Knowl Eng 62(1): 65–83
Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 9th international conference on information and knowledge management, USA, pp 422–429
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings 5th International Conference Extending Database Technology, EDBT. Springer, Heidelberg, vol 1057, pp 3–17 (1996)
Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraint. In: Proceedings of the 25th international conference on very large databases, pp 223–234
Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7: 499–514
Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optim. 9(1): 112–147
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2007) Mining sequential patterns for protein fold recognition. J. Biomed. Inf. (in press)
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2006) Protein classification using sequential pattern mining. In: Proceedings of IEEE engineering in medicine and biology conference, New York, USA, pp 5814–5817
Wang K, Hu Y, Hu Yu J (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM conference on information and knowledge management, USA, pp 178–187
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28: 235–242
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540
Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 12(2): 95–107
Karplus K, Karchin R, Shackelford G, Hughey R (2005) Calibrating evalues for hidden Markov models using reverse-sequence null models. Bioinformatics 21: 4107–4115
Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3: 1–8
Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins 61(Suppl 7): 3–7
Kum, H.-C., Chang, JH., Wang, W.: Intelligent sequential mining via alignment: optimization techniques for very large DB. In: Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol 4426, pp 587–597 (2007)
Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1): 55–67
Tzvetkov P, Yan X, Han J (2005) TSP: mining top-k closed sequential patterns. Knowl Inf Syst 7: 438–457
Luo C, Chung SM (2007) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (in press)
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11(3): 259–286
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Exarchos, T.P., Tsipouras, M.G., Papaloukas, C. et al. An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst 19, 249–264 (2009). https://doi.org/10.1007/s10115-008-0146-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0146-2