Abstract
In this paper, we consider the problem of discovering a simple class of combinatorial patterns from a large collection of unstructured text data. As a framework of data mining, we adopted optimized pattern discovery in which a mining algorithm discovers the best patterns that optimize a given statistical measure within a class of hypothesis patterns on a given data set. We present efficient algorithms for the classes of proximity word association patterns and report the experiments on the keyword discovery from Web data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. V. Aho, J. E. Hopcroft, and J. Ullman, The design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, MIT Press, 307–328, 1996.
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94, 487–499, 1994.
A. V. Aho, M. J. Corasick, Efficient string matching: An aid to bibliographic search, Comm. ACM, 1998
H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. Digital Library, 2001. (to appear)
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, A fast algorithm for discovering optimal string patterns in large text databases, In Proc. the 9th Int. Workshop on Algorithmic Learning Theory (ALT’98), LNAI 1501, 247–261, 1998.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, 118, 69–114, 2000.
L. Devroye, L. Gyor., G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag,1996.
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the sufix trees. SIAM J. Comput., 21, 48–53 (1992).
R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD2000, LNAI 1805, 281–293, 2000.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules, In Proc. SIGMOD’96, 13–23, 1996.
G. Gonnet, R. Baeza-Yates and T. Snider, New indices for text: Pat trees and pat arrays, In William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, 66–82, 1992.
D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997.
J. E. Hopcroft, and J. Ullman, Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979.
M. J. Kearns and U. V. Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994.
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, Linear-time longest-commonpre fix computation in sufix arrays and its applications, In Proc. CPM’01, LNCS, Springer-Verlag, 2000(this volumn).
J. M. Kleinberg, Authoritative sources in a hyperlinked environment. In Proc. SODA’98, 668–677, 1998.
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.
L. C. K. Lui, Color set size problem with applications to string matching. Proc. the 3rd Annual Symp. Combinatorial Pattern Matching, 1992.
E. M. McCreight, A space-economical sufix tree construction algorithm, JACM, 23(2):262–272, 1976.
U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 1991.
U. Manber and G. Myers, Sufix arrays: A new method for on-line string searches, SIAM J. Computing, 22(5), 935–948 (1993).
S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.
B. Schieber and U. Vishkin, On finding lowest common ancestors: simplifications an parallelization, SIAM J. Computing, 17, 1253–1262, 1988.
S. Shimozono, H. Arimura, and S. Arikawa, Efficient discovery of optimal wordassociation patterns in large text databases, New Generation Computing, Special issue on Discovery Science, 18, 49–60, 2000.
J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, In Proc. SIGMOD’94, 115–125, 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Arimura, H., Sakamoto, H., Arikawa, S. (2002). Efficient Data Mining from Large Text Databases. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_6
Download citation
DOI: https://doi.org/10.1007/3-540-45884-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43338-5
Online ISBN: 978-3-540-45884-5
eBook Packages: Springer Book Archive