Abstract
In this paper, we describe a new approach to information extraction that neatly integrates top-down hypothesis driven information with bottom-up data driven information. The aim of the kelp project is to combine a variety of natural language processing techniques so that we can extract useful elements of information from a collection of documents and then re-present this information in a manner that is tailored to the needs of a specific user. Our focus here is on how we can build richly structured data objects by extracting information from web pages; as an example, we describe our methods in the context of extracting information from web pages that describe laptop computers. Our approach, which we call path-merging, involves using relatively simple techniques for identifying what are normally referred to as named entities, then allowing more sophisticated and intelligent techniques to combine these elements of information: effectively, we view the text as providing a collection of jigsaw-piece-like elements of information which then have to be combined to produce a representation of the useful content of the document. A principle goal of this work is the separation of different components of the information extraction task so as to increase portability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Borthwick, A., et al.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 152–160 (1998)
Appelt, D., Hobbs, J., Bear, J., Israel, D., Kameyana, M., Tyson, M.: Fastus: a finitestate processor for information extraction from real-world text (1993)
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance learning name-finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201. Morgan Kaufmann Publishers, San Francisco (1997)
Cowie, J., Lehnert, W.: Information extraction. Communications of the ACM 39(1), 80–91 (1996)
Dale, R., Green, S.J., Milosavljevic, M., Paris, C., Verspoor, C., Williams, S.: Using natural language generation techniques to produce virtual documents. In: Proceedings of the Third Australian Document Computing Symposium (ADCS 1998), Sydney, Australia, August 21 (1998)
Defense Advanced Research Projects Agency. In: Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Francisco (1995)
Jackson, P., Moulinier, I.: Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins, Amsterdam (2002)
Mikheev, A., Grover, C., Moens, M.: XML tools and architecture for named entity recognition. Markup Languages 1(3), 89–113 (1999)
Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)
Shieber, S.: An Introduction to Unification-Based Approaches to Grammar. CSLI Lecture Notes. Chicago University Press, Chicago (1986)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dale, R., Paris, C., Tilbrook, M. (2003). Information Extraction via Path Merging. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-24581-0_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20646-0
Online ISBN: 978-3-540-24581-0
eBook Packages: Springer Book Archive