An optimized sequential pattern matching methodology for sequence classification

Exarchos, Themis P.; Tsipouras, Markos G.; Papaloukas, Costas; Fotiadis, Dimitrios I.

doi:10.1007/s10115-008-0146-2

An optimized sequential pattern matching methodology for sequence classification

Regular Paper
Published: 30 May 2008

Volume 19, pages 249–264, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Knowledge and Information Systems Aims and scope Submit manuscript

An optimized sequential pattern matching methodology for sequence classification

Download PDF

Themis P. Exarchos¹,
Markos G. Tsipouras²,
Costas Papaloukas³ &
…
Dimitrios I. Fotiadis²

202 Accesses
16 Citations
Explore all metrics

Abstract

In this paper we present a novel methodology for sequence classification, based on sequential pattern mining and optimization algorithms. The proposed methodology automatically generates a sequence classification model, based on a two stage process. In the first stage, a sequential pattern mining algorithm is applied to a set of sequences and the sequential patterns are extracted. Then, the score of every pattern with respect to each sequence is calculated using a scoring function and the score of each class under consideration is estimated by summing the specific pattern scores. Each score is updated, multiplied by a weight and the output of the first stage is the classification confusion matrix of the sequences. In the second stage an optimization technique, aims to finding a set of weights which minimize an objective function, defined using the classification confusion matrix. The set of the extracted sequential patterns and the optimal weights of the classes comprise the sequence classification model. Extensive evaluation of the methodology was carried out in the protein classification domain, by varying the number of training and test sequences, the number of patterns and the number of classes. The methodology is compared with other similar sequence classification approaches. The proposed methodology exhibits several advantages, such as automated weight assignment to classes using optimization techniques and knowledge discovery in the domain of application.

Article PDF

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Article 04 November 2014

Itemset Based Sequence Classification

FS-Tree: Sequential Association Rules and First Applications to Protein Secondary Structure Analysis

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Advances in knowledge discovery and data mining. AAAI Press/MIT Press, Cambridge, pp 1–36
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, Menlo Park
Rabiner L (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77: 257–286
Article Google Scholar
Loewenstern D, Berman H, Hirsh H (1998) Maximum a posteriori classification of DNA structure from sequence information. In: Proceedings of Pacific symposium in Biotech, pp 667–668
Lesh N, Zaki MJ, Ogihara M (1999) Mining features for sequence classification. In: 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). San Diego, pp 342–346
Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Intell Syst 15(2): 48–56
Article Google Scholar
Shin-Mu Tseng V, Lee C-H (2005) CBS: a new classification method by using sequential patterns. In: Proceedings of SIAM international data mining conference. California, USA
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases. Santiago, Chile
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering, Taiwan, pp 3–14
Bayardo Jr RJ (1997) Brute-force mining of high-confidence classification rules. In: Proceedings of the third international conference on knowledge discovery and data mining, pp 123–126
Quinlan JR (1992) C4.5: Programs for machine learning, Morgan Kaufman, Menlo Park
Zaki MJ (1998) Efficient enumeration of frequent sequences. In: 7th International conference on information and knowledge management, Washington DC, pp 68–75
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proceedings of fourth international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 80–86 (1998)
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng 16: 1424–1440
Article Google Scholar
Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential pattern mining using bitmaps. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Canada, pp 429–435
Chen T-Z, Hsu S-C (2007) Mining frequent tree-like patterns in large databases. Data Knowl Eng 62(1): 65–83
Article Google Scholar
Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 9th international conference on information and knowledge management, USA, pp 422–429
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings 5th International Conference Extending Database Technology, EDBT. Springer, Heidelberg, vol 1057, pp 3–17 (1996)
Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraint. In: Proceedings of the 25th international conference on very large databases, pp 223–234
Lin M-Y, Lee S-Y (2005) Efficient mining of sequential patterns with time constraints by delimited pattern growth. Knowl Inf Syst 7: 499–514
Article Google Scholar
Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optim. 9(1): 112–147
Article MATH MathSciNet Google Scholar
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2007) Mining sequential patterns for protein fold recognition. J. Biomed. Inf. (in press)
Exarchos TP, Papaloukas C, Lampros C, Fotiadis DI (2006) Protein classification using sequential pattern mining. In: Proceedings of IEEE engineering in medicine and biology conference, New York, USA, pp 5814–5817
Wang K, Hu Y, Hu Yu J (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM conference on information and knowledge management, USA, pp 178–187
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28: 235–242
Article Google Scholar
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536–540
Google Scholar
Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS 12(2): 95–107
Google Scholar
Karplus K, Karchin R, Shackelford G, Hughey R (2005) Calibrating evalues for hidden Markov models using reverse-sequence null models. Bioinformatics 21: 4107–4115
Article Google Scholar
Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3: 1–8
Google Scholar
Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A (2005) Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins 61(Suppl 7): 3–7
Article Google Scholar
Kum, H.-C., Chang, JH., Wang, W.: Intelligent sequential mining via alignment: optimization techniques for very large DB. In: Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol 4426, pp 587–597 (2007)
Rigoutsos I, Floratos A (1998) Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1): 55–67
Article Google Scholar
Tzvetkov P, Yan X, Han J (2005) TSP: mining top-k closed sequential patterns. Knowl Inf Syst 7: 438–457
Article Google Scholar
Luo C, Chung SM (2007) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (in press)
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11(3): 259–286
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Medical Physics, Medical School, University of Ioannina, 45110, Ioannina, Greece
Themis P. Exarchos
Department of Computer Science, Unit of Medical Technology and Intelligent Information Systems, University of Ioannina, PO Box 1186, 45110, Ioannina, Greece
Markos G. Tsipouras & Dimitrios I. Fotiadis
Department of Biological Applications and Technology, University of Ioannina, 45110, Ioannina, Greece
Costas Papaloukas

Authors

Themis P. Exarchos
View author publications
You can also search for this author in PubMed Google Scholar
Markos G. Tsipouras
View author publications
You can also search for this author in PubMed Google Scholar
Costas Papaloukas
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios I. Fotiadis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios I. Fotiadis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Exarchos, T.P., Tsipouras, M.G., Papaloukas, C. et al. An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst 19, 249–264 (2009). https://doi.org/10.1007/s10115-008-0146-2

Download citation

Received: 25 September 2007
Revised: 07 March 2008
Accepted: 12 April 2008
Published: 30 May 2008
Issue Date: May 2009
DOI: https://doi.org/10.1007/s10115-008-0146-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An optimized sequential pattern matching methodology for sequence classification

Abstract

Article PDF

Similar content being viewed by others

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Itemset Based Sequence Classification

FS-Tree: Sequential Association Rules and First Applications to Protein Secondary Structure Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An optimized sequential pattern matching methodology for sequence classification

Abstract

Article PDF

Similar content being viewed by others

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Itemset Based Sequence Classification

FS-Tree: Sequential Association Rules and First Applications to Protein Secondary Structure Analysis

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation