Abstract
This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis.1
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
D. E. Appelt, J. R. Hobbs, J. Bear, D. Israel, M. Tyson. FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. Proc. of 13th International Conference on Artificial Intelligence (IJCAI’93), Chambéry, France, 28. Aug.-3. Sept. 1993, pp. 1172–1178.
T. Bayer, J. Franke, U. Kressel, E. Mandler, M. Oberländer, J. Schürmann. Towards the Understanding of Printed Documents. In: H. Baird, H. Bunke, K. Yamamoto (eds.), Structured Document Image Analysis, Springer-Verlag, 1992, pp. 3–35.
W. B. Croft. Retrieval from large text databases. Proc. of Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, 1992, pp. 96–101.
G. DeJong. An Overview of the FRUMP System. In: W. G. Lehnert, M. H. Ringle (eds.), Strategies for Natural Language Processing, Lawrence Erlbaum Assoc., Hillsdale, 1982, pp. 149–175.
A. Dengel, R. Bleisinger, R. Hoch, F. Fein, F. Hones. From Paper to Office Document Standard Representation. IEEE Computer, vol. 25, no. 7, 1992, pp. 63–67.
A. Dengel and R. Hoch. Intelligent Interfaces between Paper and Computer. In: A. H. Rubenstein, H. Schwärtzel (eds.), Lecture Notes, Springer-Verlag, Berlin Heidelberg New York, 1992, pp. 122–136.
R. Hoch, A. Dengel. INFOCLAS: Classifying the Message in Printed Business Letters. Proc. of 2nd Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, April 26–28, 1993, pp. 443–456.
W. Finkler, G. Neumann. MORPHIX—A Fast Realization of a Classification-Based Approach to Morphology. Proc. of 4. Österreichische Artificial Intelligence-Tagung,Springer-Verlag, Berlin, 1988, pp. 11–19.
E. A. Fox. Development of the CODER system: A teethed for artificial intelligence methods in information retrieval. Information Processing & Management, vol. 23, no. 4, 1987, pp. 341–366.
M. D. Harris. Introduction to Natural Language Processing. Reston Publishing Company Inc., Reston, Virginia, 1985.
P. J. Hayes, P. M. Andersen, I. B. Nirenburg, L. M. Schmandt. TCS: A Shell for Content-Based Text Categorization. Proc. of 6th Conference on Al Applications, Santa Barbara, CA, 1990, pp. 320–326.
J. J. Hull, Y. Li. Word Recognition Result Interpretation Using the Vector Space Model for Information Retrieval. Proc. of 2nd Symposium on Document Analysis and Information Retrieval,Las Vegas, Nevada, USA, April 26–28, 1993, pp. 147–155.
IEEE Computer Magazine. Special Issue on Document Image Analysis Systems, vol. 25, no. 7, July 1992.
ISO 8613. Information Processing, Text and Office Systems,Office Document Architecture and Interchange Format (ODA/ODIF), parts 1–8, 1988.
ISO 9735. Electronic date interchange for administration, commerce and transport (EDIFACT), application level syntax rules, 1988.
P. S. Jacobs (ed.). Text-Based Intelligent Systems—Current Research and Practice in Information Retrieval. Lawrence Ertbaum Associates, Publishers, Hillsdale, New Jersey, 1992.
K. E. Lochbaum, L. A. Streeter. Comparing and Combining the Effectiveness of Latent Semantic Indexing and the Ordinary Vector Space Model for Information Retrieval. Information Processing & Management,vol. 25, no. 6, 1989, pp. 665–676.
B. Masand, G. Linoff, D. Waltz. Classifying News Stories using Memory Based Reasoning. Proc. of 15th Annual International Conference on Research and Development in Information Retrieval (SIGIR’92), 1992, pp. 59–65.
M. L. Mauldin. Retrieval Performance in FERRET—A Conceptual Information Retrieval System. Proc. of 14th Annual International ACMISIGIR Conference on Research and Development in Information Retrieval, 1991, pp. 347–355.
L. F. Rau, P. S. Jacobs. Integrating top-down and bottom-up strategies in a text processing system. Proc. of Second Conference on Applied NLP, Austin, Texas, 1988, pp. 129–135.
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979.
G. Salton, M. J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, Inc., 1983.
G. Salton. Developments in Automatic Text Retrieval. Science, vol. 253, August 1991, pp. 974–980.
G. Salton, C. Buckley. Global Text Matching for Information Retrieval. Science, vol. 253, 1991, pp. 1012–1015.
K. Sparck Jones. Automatic indexing. Journal of Documentation, 30, 1974, pp. 393–432.
K. Taghva, J. Borsack, A. Condit, S. Erva. The Effects of Noisy Data on Text Retrieval. Technical Report 93–06, Information Science Research Institute, University of Nevada, Las Vegas, March 1993, pp. 71–81.
P. Willett. An Algorithm for the Calculation of Exact Term Discrimination Values. Information Processing & Management, vol. 21, no. 3, 1985, pp. 225–232.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1994 Springer-Verlag London Limited
About this paper
Cite this paper
Hoch, R. (1994). Using IR Techniques for Text Classification in Document Analysis. In: Croft, B.W., van Rijsbergen, C.J. (eds) SIGIR ’94. Springer, London. https://doi.org/10.1007/978-1-4471-2099-5_4
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2099-5_4
Publisher Name: Springer, London
Print ISBN: 978-3-540-19889-5
Online ISBN: 978-1-4471-2099-5
eBook Packages: Springer Book Archive