Efficient Data Mining from Large Text Databases

Arimura, Hiroki; Sakamoto, Hiroshi; Arikawa, Setsuo

doi:10.1007/3-540-45884-0_6

Hiroki Arimura^2,3,
Hiroshi Sakamoto² &
Setsuo Arikawa²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2281))

516 Accesses
1 Citations

Abstract

In this paper, we consider the problem of discovering a simple class of combinatorial patterns from a large collection of unstructured text data. As a framework of data mining, we adopted optimized pattern discovery in which a mining algorithm discovers the best patterns that optimize a given statistical measure within a class of hypothesis patterns on a given data set. We present efficient algorithms for the classes of proximity word association patterns and report the experiments on the keyword discovery from Web data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Techniques, Applications, and Issues in Mining Large-Scale Text Databases

Introduction to Pattern Mining

Pattern-Growth Methods

References

A. V. Aho, J. E. Hopcroft, and J. Ullman, The design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
Google Scholar
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, MIT Press, 307–328, 1996.
Google Scholar
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94, 487–499, 1994.
Google Scholar
A. V. Aho, M. J. Corasick, Efficient string matching: An aid to bibliographic search, Comm. ACM, 1998
Google Scholar
H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. Digital Library, 2001. (to appear)
Google Scholar
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, A fast algorithm for discovering optimal string patterns in large text databases, In Proc. the 9th Int. Workshop on Algorithmic Learning Theory (ALT’98), LNAI 1501, 247–261, 1998.
Google Scholar
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, 118, 69–114, 2000.
Article MATH Google Scholar
L. Devroye, L. Gyor., G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag,1996.
Google Scholar
L. Devroye, W. Szpankowski, B. Rais, A note on the height of the sufix trees. SIAM J. Comput., 21, 48–53 (1992).
Article MATH MathSciNet Google Scholar
R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD2000, LNAI 1805, 281–293, 2000.
Google Scholar
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules, In Proc. SIGMOD’96, 13–23, 1996.
Google Scholar
G. Gonnet, R. Baeza-Yates and T. Snider, New indices for text: Pat trees and pat arrays, In William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, 66–82, 1992.
Google Scholar
D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997.
MATH Google Scholar
J. E. Hopcroft, and J. Ullman, Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979.
Google Scholar
M. J. Kearns and U. V. Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994.
Google Scholar
T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, Linear-time longest-commonpre fix computation in sufix arrays and its applications, In Proc. CPM’01, LNCS, Springer-Verlag, 2000(this volumn).
Google Scholar
J. M. Kleinberg, Authoritative sources in a hyperlinked environment. In Proc. SODA’98, 668–677, 1998.
Google Scholar
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
MATH Google Scholar
D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.
L. C. K. Lui, Color set size problem with applications to string matching. Proc. the 3rd Annual Symp. Combinatorial Pattern Matching, 1992.
Google Scholar
E. M. McCreight, A space-economical sufix tree construction algorithm, JACM, 23(2):262–272, 1976.
Article MATH MathSciNet Google Scholar
U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 1991.
Google Scholar
U. Manber and G. Myers, Sufix arrays: A new method for on-line string searches, SIAM J. Computing, 22(5), 935–948 (1993).
Article MATH MathSciNet Google Scholar
S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.
Google Scholar
B. Schieber and U. Vishkin, On finding lowest common ancestors: simplifications an parallelization, SIAM J. Computing, 17, 1253–1262, 1988.
Article MATH MathSciNet Google Scholar
S. Shimozono, H. Arimura, and S. Arikawa, Efficient discovery of optimal wordassociation patterns in large text databases, New Generation Computing, Special issue on Discovery Science, 18, 49–60, 2000.
Article Google Scholar
J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, In Proc. SIGMOD’94, 115–125, 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu Univ., 812-8581, Fukuoka, Japan
Hiroki Arimura, Hiroshi Sakamoto & Setsuo Arikawa
PRESTO, Japan Science and Technology Corporation, Japan
Hiroki Arimura

Authors

Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Sakamoto
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Setsuo Arikawa & Ayumi Shinohara &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Arimura, H., Sakamoto, H., Arikawa, S. (2002). Efficient Data Mining from Large Text Databases. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_6

Download citation

DOI: https://doi.org/10.1007/3-540-45884-0_6
Published: 14 March 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43338-5
Online ISBN: 978-3-540-45884-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Efficient Data Mining from Large Text Databases

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Techniques, Applications, and Issues in Mining Large-Scale Text Databases

Introduction to Pattern Mining

Pattern-Growth Methods

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Data Mining from Large Text Databases

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Techniques, Applications, and Issues in Mining Large-Scale Text Databases

Introduction to Pattern Mining

Pattern-Growth Methods

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation