Abstract
Conventional document search techniques are constrained by attempting to match individual keywords or phrases to source documents. Thus, these techniques miss out documents that contain semantically similar terms, thereby achieving a relatively low degree of recall. At the same time, processing capabilities and tools for syntactic and semantic analysis of language have advanced to the point where an index-time linguistic analysis of source documents is both feasible and realistic. In this paper, we introduce document dimensions, a means of classifying or grouping terms discovered in documents. Using an enhanced version of Jakarta Lucene[1], we demonstrate that supplementing keyword analysis with some syntactic and semantic information can indeed enhance the quality of information retrieval results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Jakarta Lucene, http://jakarta.apache.org/lucene/docs/index.html
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1980)
Salton, G., Y.C.: On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372 (1973)
Brin, S., Page, L.: Anatomy of a hypertextual web search engine. In: WWW7 (1998)
Brooks, T.: The semantic distance model of relevance assessment. In: Proceedings of the 61 st Annual Meeting of ASIS, Pittsburgh, PA. Information Access in the Global Information Economy, vol. 35, pp. 33–44 (1998)
Budanitsky, A.: Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, in NAACL 2000, Pittsburgh, PA, June 2001 (2000)
Dixon, M.: (An overview of document mining technology)
Rijke, M.V.: Beyond document retrieval. In: Trento, Nice (2003)
Yang, K.: Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web. PhD thesis, University of North Carolina (2002)
Modelling and mining of network information systems, http://www.mathstat.dal.ca/~mominis/
Lawrence, S., Giles, C.: Indexing and retrieval of scientific literature. In: Eighth International Conference on Information and Knowledge Management (1999)
Lawrence, S.: Context in web search. In: IEEE Data Engineering Bulletin (2000)
Hu, W.: An overview of world wide web search technologies. In: International Conference on Information Systems, Analysis and Synthesis, vol. 12 (2001)
Etzioni, O.: On the instability of search engines. In: Content-Based Multimedia Information Access (RIAO), Paris, France (2000)
WebFountain, http://www.almaden.ibm.com/webfountain/
Eder, J., Koncilia, C.: Evolution of dimension data in temporal datawarehouses. Springer, Heidelberg (1998)
Roellke, T.: The accessibility dimension for structured document retrieval. Journal of Documentation (1998)
Mothé, J.: Information mining: using document dimensions to analyse a document set interactively. In: European Colloquium on IR Research: ECIR, pp. 66–77 (2001)
Mothé, J.: Doccube: Multi-dimensional visualization and exploration of large document sets. In: JASIST (Journal of American Society for Information Science and Technology) (2003)
Tsang, V., Stevenson, S.: Calculating semantic distance between word sense probability distributions. In: Proceedings of CoNLL 2004, Boston, MA, USA (2004)
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2, 219–229 (1999)
Mailing list archives of nutch.org, http://sourceforge.net/mailarchive/forum.php?forum_id=13068&viewmonth=%200404&viewday=26
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jayasooriya, T., Manandhar, S. (2004). Using Document Dimensions for Enhanced Information Retrieval. In: Manandhar, S., Austin, J., Desai, U., Oyanagi, Y., Talukder, A.K. (eds) Applied Computing. AACC 2004. Lecture Notes in Computer Science, vol 3285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30176-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-30176-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23659-7
Online ISBN: 978-3-540-30176-9
eBook Packages: Springer Book Archive