Mining probabilistic automata: a statistical view of sequential pattern mining

Jacquemont, Stéphanie; Jacquenet, François; Sebban, Marc

doi:10.1007/s10994-008-5098-y

Mining probabilistic automata: a statistical view of sequential pattern mining

Published: 08 January 2009

Volume 75, pages 91–127, (2009)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Mining probabilistic automata: a statistical view of sequential pattern mining

Download PDF

Stéphanie Jacquemont¹,
François Jacquenet¹ &
Marc Sebban¹

915 Accesses
13 Citations
Explore all metrics

Abstract

During the past decade, sequential pattern mining has been the core of numerous research efforts. It is now possible to efficiently extract knowledge of users’ behavior from a huge set of sequences collected over time. This has applications in various domains such as purchases in supermarkets, Web site visits, etc. However, sequence mining algorithms do little to control the risks of extracting false discoveries or overlooking true knowledge. In this paper, the theoretical conditions to achieve a relevant sequence mining process are examined. Then, the article offers a statistical view of sequence mining which has the following advantages: First, it uses a compact and generalized representation of the original sequences in the form of a probabilistic automaton. Second, it integrates statistical constraints to guarantee the extraction of significant patterns. Finally, it provides an interesting solution in a privacy preserving context in order to respect individuals’ information. An application in car flow modeling is presented, showing the ability of our algorithm (acsm) to discover frequent routes without any private information. Comparisons with a classical sequence mining algorithm (spam) are made, showing the effectiveness of our approach.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11th international conference on data engineering (pp. 3–14). Los Alamitos: IEEE Computer Society.
Google Scholar
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the ACM SIGMOD conference on management of data (pp. 439–450). New York: ACM.
Chapter Google Scholar
Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proceedings of the 8th international conference on knowledge discovery and data mining (pp. 429–435). New York: ACM.
Google Scholar
Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Proceedings of the 21st international conference on data engineering (pp. 217–228). Los Alamitos: IEEE Computer Society.
Google Scholar
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
MATH MathSciNet Google Scholar
Borges, J., & Levene, M. (1998). Mining association rules in hypertext databases. In Proceedings of the 4th international conference on knowledge discovery and data mining (pp. 149–153).
Borges, J., & Levene, M. (1999). Data mining of user navigation patterns. In WEBKDD ’99: revised papers from the international workshop on web usage analysis and user profiling (pp. 92–111). Berlin: Springer.
Google Scholar
Borges, J., & Levene, M. (2004). A dynamic clustering-based Markov model for web usage mining. In CoRR: the computing research repository. cs.IR/0406032, June 2004.
Callut, J. (2007). First passage times dynamics in Markov models with applications to HMM: induction, sequence classification and graph mining. PhD thesis, Université Catholique de Louvain.
Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In Proceedings of the 2nd international colloquium on grammatical inference (Vol. 862, pp. 139–152). Berlin: Springer.
Google Scholar
de la Higuera, C. (2005). A bibliographical study of grammatical inference. Pattern Recognition, 38(9), 1332–1348.
Article Google Scholar
Dupont, P., Denis, F., & Esposito, Y. (2005). Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9), 1349–1371.
Article MATH Google Scholar
Dupont, P., Callut, J., Dooms, G., Monette, J.-N., & Deville, Y. (2006). Relevant subgraph extraction from random walks in a graph (Technical Report 2006-2007). UCL/FSA/INGI, November 2006.
Evfimievski, A. V., Srikant, R., Agrawal, R., & Gehrke, J. (2004). Privacy preserving mining of association rules. Information Systems, 29(4), 343–364.
Article Google Scholar
Fisher, R. A. (1922). On the interpretation of chi-square from the contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87–94.
Article Google Scholar
Garofalakis, M., Rastogi, R., & Shim, K. (2002). Mining sequential patterns with regular expression constraints. IEEE Transactions on Knowledge and Data Engineering, 14(3), 530–552.
Article Google Scholar
Gionis, A., Mannila, H., Mielikainen, T., & Tsaparas, P. (2006). Assessing data mining results via swap randomization. In KDD ’06: proceedings of the 12th international conference on knowledge discovery and data mining (pp. 167–176).
Gold, E. M. (1978). Complexity of automaton identification from given data. Information and Control, 37(3), 302–320.
Article MATH MathSciNet Google Scholar
Han, J., Altman, R. B., Kumar, V., Mannila, H., & Pregibon, D. (2002). Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54–58.
Article Google Scholar
Hingston, P. (2002). Using finite state automata for sequence mining. In Proceedings of the 25th Australasian conference on computer science (pp. 105–110). Australian Computer Society.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
Article MATH MathSciNet Google Scholar
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
MATH MathSciNet Google Scholar
Klemettinen, M., Mannila, H., & Toivonen, H. (1999). Interactive exploration of interesting findings in the telecommunication network alarm sequence analyzer. Information & Software Technology, 41(9), 557–567.
Article Google Scholar
Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. SIGKDD Explorations, 2(1), 1–15.
Article Google Scholar
Laur, P., Nock, R., Symphor, J., & Poncelet, P. (2007a). Mining evolving data streams for frequent patterns. Pattern Recognition, 40(2), 492–503.
Article MATH Google Scholar
Laur, P., Symphor, J., Nock, R., & Poncelet, P. (2007b). Statistical supports for mining sequential patterns and improving the incremental update process on data streams. Intelligent Data Analysis, 11(1), 29–47.
Google Scholar
Mannila, H., Toivonen, H., & Verkamo, A. I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 259–289.
Article Google Scholar
Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Knowledge discovery and data mining (pp. 274–278).
Newton, E. M., Sweeney, L., & Malin, B. (2005). Preserving privacy by de-identifying face images. IEEE Transactions on Knowledge and Data Engineering, 17(2), 232–243.
Article Google Scholar
Pearson, K. (1900). On a criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine, 50, 157–175.
Google Scholar
Pei, J., Han, J., & Wang, W. (2002). Mining sequential patterns with constraints in large databases. In Proceedings of the 11th international conference on information and knowledge management (pp. 18–25). New York: ACM.
Google Scholar
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863.
Article Google Scholar
Shaffer, J. (1995). Multiple hypothesis-testing. Annual Review of Psychology, 46, 561–584.
Article Google Scholar
Spiliopoulou, M., & Pohle, C. (2001). Data mining for measuring and improving the success of web sites. Data Mining and Knowledge Discovery, 5(1–2), 85–114.
Article MATH Google Scholar
Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th international conference on extending database technology (Vol. 1057, pp. 3–17). Berlin: Springer.
Google Scholar
Sweeney, L. (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557–570.
Article MATH MathSciNet Google Scholar
Valiant, L. G. (1984). A theory of the learnable. In Proceedings of the 16th annual ACM symposium on theory of computing (pp. 436–445). New York: ACM.
Google Scholar
Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P., Saygin, Y., & Theodoridis, Y. (2004). State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1), 50–57.
Article Google Scholar
Webb, G. I. (2007). Discovering significant patterns. Machine Learning, 68(1), 1–33.
Article Google Scholar
Zaki, M. J. (2000). Sequence mining in categorical domains: incorporating constraints. In Proceedings of the 9th international conference on information and knowledge management (pp. 422–429). New York: ACM.
Google Scholar
Zaki, M. J. (2001). Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2), 31–60.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire Hubert Curien, UMR 5516 Université Jean Monnet, 18, rue Benoît Lauras, Bâtiment F, 42000, Saint-Étienne, France
Stéphanie Jacquemont, François Jacquenet & Marc Sebban

Authors

Stéphanie Jacquemont
View author publications
You can also search for this author in PubMed Google Scholar
François Jacquenet
View author publications
You can also search for this author in PubMed Google Scholar
Marc Sebban
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stéphanie Jacquemont.

Additional information

Editors: Thomas Gärtner, Gemma C. Garriga.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jacquemont, S., Jacquenet, F. & Sebban, M. Mining probabilistic automata: a statistical view of sequential pattern mining. Mach Learn 75, 91–127 (2009). https://doi.org/10.1007/s10994-008-5098-y

Download citation

Received: 09 March 2007
Revised: 12 December 2008
Accepted: 16 December 2008
Published: 08 January 2009
Issue Date: April 2009
DOI: https://doi.org/10.1007/s10994-008-5098-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Mining probabilistic automata: a statistical view of sequential pattern mining

Abstract

Article PDF

Similar content being viewed by others

Overview on Sequential Mining Algorithms and Their Extensions

Sequential Pattern Mining

Trajectory Data Pattern Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Mining probabilistic automata: a statistical view of sequential pattern mining

Abstract

Article PDF

Similar content being viewed by others

Overview on Sequential Mining Algorithms and Their Extensions

Sequential Pattern Mining

Trajectory Data Pattern Mining

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation