Skip to main content

Using IR Techniques for Text Classification in Document Analysis

  • Conference paper
SIGIR ’94

Abstract

This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.1

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. D. E. Appelt, J. R. Hobbs, J. Bear, D. Israel, M. Tyson. FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. Proc. of 13th International Conference on Artificial Intelligence (IJCAI’93), Chambéry, France, 28. Aug.-3. Sept. 1993, pp. 1172–1178.

    Google Scholar 

  2. T. Bayer, J. Franke, U. Kressel, E. Mandler, M. Oberländer, J. Schürmann. Towards the Understanding of Printed Documents. In: H. Baird, H. Bunke, K. Yamamoto (eds.), Structured Document Image Analysis, Springer-Verlag, 1992, pp. 3–35.

    Google Scholar 

  3. W. B. Croft. Retrieval from large text databases. Proc. of Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, 1992, pp. 96–101.

    Google Scholar 

  4. G. DeJong. An Overview of the FRUMP System. In: W. G. Lehnert, M. H. Ringle (eds.), Strategies for Natural Language Processing, Lawrence Erlbaum Assoc., Hillsdale, 1982, pp. 149–175.

    Google Scholar 

  5. A. Dengel, R. Bleisinger, R. Hoch, F. Fein, F. Hones. From Paper to Office Document Standard Representation. IEEE Computer, vol. 25, no. 7, 1992, pp. 63–67.

    Article  Google Scholar 

  6. A. Dengel and R. Hoch. Intelligent Interfaces between Paper and Computer. In: A. H. Rubenstein, H. Schwärtzel (eds.), Lecture Notes, Springer-Verlag, Berlin Heidelberg New York, 1992, pp. 122–136.

    Google Scholar 

  7. R. Hoch, A. Dengel. INFOCLAS: Classifying the Message in Printed Business Letters. Proc. of 2nd Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, April 26–28, 1993, pp. 443–456.

    Google Scholar 

  8. W. Finkler, G. Neumann. MORPHIX—A Fast Realization of a Classification-Based Approach to Morphology. Proc. of 4. Österreichische Artificial Intelligence-Tagung,Springer-Verlag, Berlin, 1988, pp. 11–19.

    Google Scholar 

  9. E. A. Fox. Development of the CODER system: A teethed for artificial intelligence methods in information retrieval. Information Processing & Management, vol. 23, no. 4, 1987, pp. 341–366.

    Article  Google Scholar 

  10. M. D. Harris. Introduction to Natural Language Processing. Reston Publishing Company Inc., Reston, Virginia, 1985.

    Google Scholar 

  11. P. J. Hayes, P. M. Andersen, I. B. Nirenburg, L. M. Schmandt. TCS: A Shell for Content-Based Text Categorization. Proc. of 6th Conference on Al Applications, Santa Barbara, CA, 1990, pp. 320–326.

    Google Scholar 

  12. J. J. Hull, Y. Li. Word Recognition Result Interpretation Using the Vector Space Model for Information Retrieval. Proc. of 2nd Symposium on Document Analysis and Information Retrieval,Las Vegas, Nevada, USA, April 26–28, 1993, pp. 147–155.

    Google Scholar 

  13. IEEE Computer Magazine. Special Issue on Document Image Analysis Systems, vol. 25, no. 7, July 1992.

    Google Scholar 

  14. ISO 8613. Information Processing, Text and Office Systems,Office Document Architecture and Interchange Format (ODA/ODIF), parts 1–8, 1988.

    Google Scholar 

  15. ISO 9735. Electronic date interchange for administration, commerce and transport (EDIFACT), application level syntax rules, 1988.

    Google Scholar 

  16. P. S. Jacobs (ed.). Text-Based Intelligent Systems—Current Research and Practice in Information Retrieval. Lawrence Ertbaum Associates, Publishers, Hillsdale, New Jersey, 1992.

    Google Scholar 

  17. K. E. Lochbaum, L. A. Streeter. Comparing and Combining the Effectiveness of Latent Semantic Indexing and the Ordinary Vector Space Model for Information Retrieval. Information Processing & Management,vol. 25, no. 6, 1989, pp. 665–676.

    Article  Google Scholar 

  18. B. Masand, G. Linoff, D. Waltz. Classifying News Stories using Memory Based Reasoning. Proc. of 15th Annual International Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, pp. 59–65.

    Google Scholar 

  19. M. L. Mauldin. Retrieval Performance in FERRET—A Conceptual Information Retrieval System. Proc. of 14th Annual International ACMISIGIR Conference on Research and Development in Information Retrieval, 1991, pp. 347–355.

    Book  Google Scholar 

  20. L. F. Rau, P. S. Jacobs. Integrating top-down and bottom-up strategies in a text processing system. Proc. of Second Conference on Applied NLP, Austin, Texas, 1988, pp. 129–135.

    Google Scholar 

  21. C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979.

    Google Scholar 

  22. G. Salton, M. J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, Inc., 1983.

    MATH  Google Scholar 

  23. G. Salton. Developments in Automatic Text Retrieval. Science, vol. 253, August 1991, pp. 974–980.

    Article  MathSciNet  Google Scholar 

  24. G. Salton, C. Buckley. Global Text Matching for Information Retrieval. Science, vol. 253, 1991, pp. 1012–1015.

    Article  MathSciNet  Google Scholar 

  25. K. Sparck Jones. Automatic indexing. Journal of Documentation, 30, 1974, pp. 393–432.

    Article  Google Scholar 

  26. K. Taghva, J. Borsack, A. Condit, S. Erva. The Effects of Noisy Data on Text Retrieval. Technical Report 93–06, Information Science Research Institute, University of Nevada, Las Vegas, March 1993, pp. 71–81.

    Google Scholar 

  27. P. Willett. An Algorithm for the Calculation of Exact Term Discrimination Values. Information Processing & Management, vol. 21, no. 3, 1985, pp. 225–232.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag London Limited

About this paper

Cite this paper

Hoch, R. (1994). Using IR Techniques for Text Classification in Document Analysis. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2099-5_4

  • Publisher Name: Springer, London

  • Print ISBN: 978-3-540-19889-5

  • Online ISBN: 978-1-4471-2099-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics