Skip to main content

Efficient Data Mining from Large Text Databases

  • Chapter
  • First Online:
Progress in Discovery Science

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2281))

Abstract

In this paper, we consider the problem of discovering a simple class of combinatorial patterns from a large collection of unstructured text data. As a framework of data mining, we adopted optimized pattern discovery in which a mining algorithm discovers the best patterns that optimize a given statistical measure within a class of hypothesis patterns on a given data set. We present efficient algorithms for the classes of proximity word association patterns and report the experiments on the keyword discovery from Web data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. A. V. Aho, J. E. Hopcroft, and J. Ullman, The design and Analysis of Computer Algorithms. Addison-Wesley, 1974.

    Google Scholar 

  2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chap. 12, MIT Press, 307–328, 1996.

    Google Scholar 

  3. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94, 487–499, 1994.

    Google Scholar 

  4. A. V. Aho, M. J. Corasick, Efficient string matching: An aid to bibliographic search, Comm. ACM, 1998

    Google Scholar 

  5. H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. Digital Library, 2001. (to appear)

    Google Scholar 

  6. H. Arimura, A. Wataki, R. Fujino, S. Arikawa, A fast algorithm for discovering optimal string patterns in large text databases, In Proc. the 9th Int. Workshop on Algorithmic Learning Theory (ALT’98), LNAI 1501, 247–261, 1998.

    Google Scholar 

  7. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, 118, 69–114, 2000.

    Article  MATH  Google Scholar 

  8. L. Devroye, L. Gyor., G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag,1996.

    Google Scholar 

  9. L. Devroye, W. Szpankowski, B. Rais, A note on the height of the sufix trees. SIAM J. Comput., 21, 48–53 (1992).

    Article  MATH  MathSciNet  Google Scholar 

  10. R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD2000, LNAI 1805, 281–293, 2000.

    Google Scholar 

  11. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules, In Proc. SIGMOD’96, 13–23, 1996.

    Google Scholar 

  12. G. Gonnet, R. Baeza-Yates and T. Snider, New indices for text: Pat trees and pat arrays, In William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval: Data Structures and Algorithms, 66–82, 1992.

    Google Scholar 

  13. D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997.

    MATH  Google Scholar 

  14. J. E. Hopcroft, and J. Ullman, Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979.

    Google Scholar 

  15. M. J. Kearns and U. V. Vazirani, An Introduction to Computational Learning Theory, MIT Press, 1994.

    Google Scholar 

  16. T. Kasai, G. Lee, H. Arimura, S. Arikawa, K. Park, Linear-time longest-commonpre fix computation in sufix arrays and its applications, In Proc. CPM’01, LNCS, Springer-Verlag, 2000(this volumn).

    Google Scholar 

  17. J. M. Kleinberg, Authoritative sources in a hyperlinked environment. In Proc. SODA’98, 668–677, 1998.

    Google Scholar 

  18. M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.

    MATH  Google Scholar 

  19. D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research.att.com/~lewis/, 1997.

  20. L. C. K. Lui, Color set size problem with applications to string matching. Proc. the 3rd Annual Symp. Combinatorial Pattern Matching, 1992.

    Google Scholar 

  21. E. M. McCreight, A space-economical sufix tree construction algorithm, JACM, 23(2):262–272, 1976.

    Article  MATH  MathSciNet  Google Scholar 

  22. U. Manber and R. Baeza-Yates, An algorithm for string matching with a sequence of don’t cares. IPL 37, 1991.

    Google Scholar 

  23. U. Manber and G. Myers, Sufix arrays: A new method for on-line string searches, SIAM J. Computing, 22(5), 935–948 (1993).

    Article  MATH  MathSciNet  Google Scholar 

  24. S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.

    Google Scholar 

  25. B. Schieber and U. Vishkin, On finding lowest common ancestors: simplifications an parallelization, SIAM J. Computing, 17, 1253–1262, 1988.

    Article  MATH  MathSciNet  Google Scholar 

  26. S. Shimozono, H. Arimura, and S. Arikawa, Efficient discovery of optimal wordassociation patterns in large text databases, New Generation Computing, Special issue on Discovery Science, 18, 49–60, 2000.

    Article  Google Scholar 

  27. J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, In Proc. SIGMOD’94, 115–125, 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Arimura, H., Sakamoto, H., Arikawa, S. (2002). Efficient Data Mining from Large Text Databases. In: Arikawa, S., Shinohara, A. (eds) Progress in Discovery Science. Lecture Notes in Computer Science(), vol 2281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45884-0_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-45884-0_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43338-5

  • Online ISBN: 978-3-540-45884-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics