Abstract
We study efficient discovery of proximity word-association patterns, defined by a sequence of strings and a proximity gap, from a collection of texts with the positive and the negative labels. We present an algorithm that finds alld-stringsk-proximity word-association patterns that maximize the number of texts whose matching agree with their labels. It runs in expected time complexityO(k d−1n logd n) and spaceO(k d−1n) with the total lengthn of texts, if texts are uniformly random strings. We also show that the problem to find one of the best word-association patterns with arbitrarily many strings in MAX SNP-hard.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, R., Imielinski, T. and Swami, A., “Mining Association Rules between Sets of Items in Large Databases,” inProc. 1993 SIGMOD, pp. 207–216, 1993.
Ausiello, G., Crescenzi, P. and Protasi, M., “Approximate Solution of NP Optimization Problems,”Theor. Comput. Sc., 150, pp. 1–55, 1995.
Arimura, H., Wataki, A., Fujino, R. and Arikawa, S., “A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases,” inProc. the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, pp. 247–261, 1998.
Arimura, H., Kasai, T., Wataki, A., Fujino, R., Shimozono, S. and Arikawa, S., “An Efficient Tool for Discovering Simple Combinatorial Patterns from Large Text Databases,” inProc. the 1st Discovery Science, LNAI 1532, pp. 393–394, 1998.
Devroye, L., Szpankowski, W. and Rais, B., “A Note on the Height of the Suffix Trees,”SIAM J. Comput.,21, pp. 48–53, 1992.
Dobkin, D. P. Gunopulos, D. and Maass, W., “Computing the Maximum Bichromatic Discrepancy, with Applications to Computer Graphics and Machine Learning,”J. Comut. Sys. Sci., 52, pp. 453–470, 1996.
Fukuda, T. Morimoto, Y., Morishita, S. and Tokuyama, T., “Data Mining Using Two-Dimensional Optimized Association Rules,” inProc. 1996 SIGMOD, pp. 13–23, 1996.
Johnson, D. S., “Approximation Algorithms for Combinatorial Problems,”J. Comut. Sys. Sci., 9, pp. 256–278, 1974.
Kearns, M. J., Shapire, R. E. and Sellie, L. M., “Toward Efficient Agnostic Learning,”Machine Learning, 17, pp. 115–141, 1994.
Manber, U. and Baeza-Yates, R., “An Algorithm for String Matching with a Sequence of Don’t Cares,”Inf. Procc. Lett., 37, pp. 133–136, 1991.
McCreight, E. M., “A Space-Economical Suffix Tree Construction Algorithm,”J. ACM, 23, pp. 262–272, 1976.
Miyano, S., Shinohara, A. and Shinohara, T., “Which Classes of Elementary Formal Systems are Polynomial-Time Learnable,” inProc. 2nd Workshop on Algorithmic Learning Theory, pp. 139–150, 1991.
Papadimitriou, C. H. and Yannakakis, M., “Optimization, Approximation, and Complexity Classes,”J. Comput. Sys. Sci., 43, pp. 425–440, 1991.
Wang, J. T.-L., Chirn, G.-W., Marr, T. G., Shapiro, B., Shasha, D. and Zhang., K., “Combinatorial Pattern Discovery for Scientific Data: Some preliminary results,” inProc. 1994 SIGMOD, pp. 115–125, 1994.
Author information
Authors and Affiliations
Additional information
Shinichi Shimozono, Ph.D.: He is an Associate Professor of the Department of Artificial Intelligence at Kyushu Institute of Technology Iizuka, Japan. He obtained the B.S. degree in Physics from Kyushu University, awarded M.S. degree from Graduate School of Information Science in Kyushu University, and his Dr. Sci. degree in 1996 from Kyushu University. His research interests are primarily in the design and analysis of algorithms for intractable problems.
Hiroki Arimura, Ph.D.: He is an Associate Professor of the Department of Informatics at Kyushu University, Fukuoka, Japan. He is also a researcher with Precursory Research for Embryonic Science and Technology, Japan Science and Technology Corporation (JST) since 1999. He received the B.S. degree in 1988 in Physics, the M.S. degree in 1979 and the Dr.Sci. degree in 1994 in Information Systems from Kyushu University. His research interests include data mining, computational learning theory, and inductive logic programming.
Setsuo Arikawa, Ph.D.: He is a Professor of the Department of Informatics and the Director of University Library at Kyushu University, Fukuoka, Japan. He received the B.S. degree in 1964, the M.S. degree in 1966 and the Dr.Sci. degree in 1969 all in Mathematics from Kyushu University. His research interests include Discovery Science, Algorithmic Learning Theory, Logic and Inference/Reasoning in AI, Pattern Matching Algorithms and Library Science. He is the principal investigator of the Discovery Science Project sponsored by the Grant-in Aid for Scientific Research on Priority Area from the Ministry of ESSC, Japan.
About this article
Cite this article
Shimozono, S., Arimura, H. & Arikawa, S. Efficient discovery of optimal word-association patterns in large text databases. New Gener Comput 18, 49–60 (2000). https://doi.org/10.1007/BF03037568
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF03037568