Abstract
During the past decade, sequential pattern mining has been the core of numerous research efforts. It is now possible to efficiently extract knowledge of users’ behavior from a huge set of sequences collected over time. This has applications in various domains such as purchases in supermarkets, Web site visits, etc. However, sequence mining algorithms do little to control the risks of extracting false discoveries or overlooking true knowledge. In this paper, the theoretical conditions to achieve a relevant sequence mining process are examined. Then, the article offers a statistical view of sequence mining which has the following advantages: First, it uses a compact and generalized representation of the original sequences in the form of a probabilistic automaton. Second, it integrates statistical constraints to guarantee the extraction of significant patterns. Finally, it provides an interesting solution in a privacy preserving context in order to respect individuals’ information. An application in car flow modeling is presented, showing the ability of our algorithm (acsm) to discover frequent routes without any private information. Comparisons with a classical sequence mining algorithm (spam) are made, showing the effectiveness of our approach.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11th international conference on data engineering (pp. 3–14). Los Alamitos: IEEE Computer Society.
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the ACM SIGMOD conference on management of data (pp. 439–450). New York: ACM.
Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proceedings of the 8th international conference on knowledge discovery and data mining (pp. 429–435). New York: ACM.
Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Proceedings of the 21st international conference on data engineering (pp. 217–228). Los Alamitos: IEEE Computer Society.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
Borges, J., & Levene, M. (1998). Mining association rules in hypertext databases. In Proceedings of the 4th international conference on knowledge discovery and data mining (pp. 149–153).
Borges, J., & Levene, M. (1999). Data mining of user navigation patterns. In WEBKDD ’99: revised papers from the international workshop on web usage analysis and user profiling (pp. 92–111). Berlin: Springer.
Borges, J., & Levene, M. (2004). A dynamic clustering-based Markov model for web usage mining. In CoRR: the computing research repository. cs.IR/0406032, June 2004.
Callut, J. (2007). First passage times dynamics in Markov models with applications to HMM: induction, sequence classification and graph mining. PhD thesis, Université Catholique de Louvain.
Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In Proceedings of the 2nd international colloquium on grammatical inference (Vol. 862, pp. 139–152). Berlin: Springer.
de la Higuera, C. (2005). A bibliographical study of grammatical inference. Pattern Recognition, 38(9), 1332–1348.
Dupont, P., Denis, F., & Esposito, Y. (2005). Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognition, 38(9), 1349–1371.
Dupont, P., Callut, J., Dooms, G., Monette, J.-N., & Deville, Y. (2006). Relevant subgraph extraction from random walks in a graph (Technical Report 2006-2007). UCL/FSA/INGI, November 2006.
Evfimievski, A. V., Srikant, R., Agrawal, R., & Gehrke, J. (2004). Privacy preserving mining of association rules. Information Systems, 29(4), 343–364.
Fisher, R. A. (1922). On the interpretation of chi-square from the contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87–94.
Garofalakis, M., Rastogi, R., & Shim, K. (2002). Mining sequential patterns with regular expression constraints. IEEE Transactions on Knowledge and Data Engineering, 14(3), 530–552.
Gionis, A., Mannila, H., Mielikainen, T., & Tsaparas, P. (2006). Assessing data mining results via swap randomization. In KDD ’06: proceedings of the 12th international conference on knowledge discovery and data mining (pp. 167–176).
Gold, E. M. (1978). Complexity of automaton identification from given data. Information and Control, 37(3), 302–320.
Han, J., Altman, R. B., Kumar, V., Mannila, H., & Pregibon, D. (2002). Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54–58.
Hingston, P. (2002). Using finite state automata for sequence mining. In Proceedings of the 25th Australasian conference on computer science (pp. 105–110). Australian Computer Society.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
Klemettinen, M., Mannila, H., & Toivonen, H. (1999). Interactive exploration of interesting findings in the telecommunication network alarm sequence analyzer. Information & Software Technology, 41(9), 557–567.
Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. SIGKDD Explorations, 2(1), 1–15.
Laur, P., Nock, R., Symphor, J., & Poncelet, P. (2007a). Mining evolving data streams for frequent patterns. Pattern Recognition, 40(2), 492–503.
Laur, P., Symphor, J., Nock, R., & Poncelet, P. (2007b). Statistical supports for mining sequential patterns and improving the incremental update process on data streams. Intelligent Data Analysis, 11(1), 29–47.
Mannila, H., Toivonen, H., & Verkamo, A. I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 259–289.
Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Knowledge discovery and data mining (pp. 274–278).
Newton, E. M., Sweeney, L., & Malin, B. (2005). Preserving privacy by de-identifying face images. IEEE Transactions on Knowledge and Data Engineering, 17(2), 232–243.
Pearson, K. (1900). On a criterion that a given system of deviations from the probable in the case of correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophy Magazine, 50, 157–175.
Pei, J., Han, J., & Wang, W. (2002). Mining sequential patterns with constraints in large databases. In Proceedings of the 11th international conference on information and knowledge management (pp. 18–25). New York: ACM.
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6, 855–863.
Shaffer, J. (1995). Multiple hypothesis-testing. Annual Review of Psychology, 46, 561–584.
Spiliopoulou, M., & Pohle, C. (2001). Data mining for measuring and improving the success of web sites. Data Mining and Knowledge Discovery, 5(1–2), 85–114.
Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th international conference on extending database technology (Vol. 1057, pp. 3–17). Berlin: Springer.
Sweeney, L. (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557–570.
Valiant, L. G. (1984). A theory of the learnable. In Proceedings of the 16th annual ACM symposium on theory of computing (pp. 436–445). New York: ACM.
Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P., Saygin, Y., & Theodoridis, Y. (2004). State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1), 50–57.
Webb, G. I. (2007). Discovering significant patterns. Machine Learning, 68(1), 1–33.
Zaki, M. J. (2000). Sequence mining in categorical domains: incorporating constraints. In Proceedings of the 9th international conference on information and knowledge management (pp. 422–429). New York: ACM.
Zaki, M. J. (2001). Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1–2), 31–60.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Thomas Gärtner, Gemma C. Garriga.
Rights and permissions
About this article
Cite this article
Jacquemont, S., Jacquenet, F. & Sebban, M. Mining probabilistic automata: a statistical view of sequential pattern mining. Mach Learn 75, 91–127 (2009). https://doi.org/10.1007/s10994-008-5098-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5098-y