Abstract
As more and more structured documents, such as the SGML or XML documents, become available on the Web, there is a growing demand to develop effective structured document retrieval which exploits both content and hierarchical structure of documents and return document elements with appropriate granularity. Previous work on partial retrieval of structured document has limited applications due to the requirement of structured queries and restriction that the document structure cannot be traversed according to queries. In this paper, we put forward a method for flexible element retrieval which can retrieve relevant document elements with arbitrary granularity against natural language queries. The proposed techniques constitute a novel hierarchical index propagation and pruning mechanism and an algorithm of ranking document elements based on the hierarchical index. The experimental results show that our method significantly outperforms other existing methods. Our method also shows robustness to the long-standing problems of text length normalization and threshold setting in structured document retrieval.
This work was performed when the author was a visiting student at Microsoft Research Asia.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul S., Quass D., McHugh J., Widom J. and Wiener J., 1996, The Lorel Query Language for Semi-structured Data, Department of Computer Science. Stanford University, California, USA, 1996.
Baeza-Yates, R., Navarro, G., 1996, Integrating contents and structure in text retrieval, ACM SIGMOD Record, 25(1):67–79, March 1996.
Callan, J., 1994, Passage-level evidence in document retrieval. In Proceedings of the 17 Annual ACM SIGIR Conference on Research and Development in nformation Retrieval, Dublin, Ireland, 1994, Pages 302–310.
Frisse, M, 1988, Searching for Information in a hypertext medical handbook, Comm. of ACM, 31(7), July 1988, Pages 263–271.
Fuhr, N., Grobjohann, K., 2001, XIRQL: a query language for information retrieval in XML documents, In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, USA, September 2001, Pages 172–180.
Geffet, M., Feitelson, D., 2001, Hierarchical indexing and document matching in BoW, In Proceedings of JCDL’01, Roanoke, Virginia, USA, 2001, pages 259–267.
Goldman, R., Shivakumar, N., Venkatasubramanian, S. and Garcia-Molina, H., Proximity search in databases, In Proceedings of the Twenty-Fourth International Conference on Very Large Data Bases, New York, USA, August 1998, Pages 26–37.
Kaszkiel, M., Zobel J. and Sacks-Davis R., 1999, Efficient passage ranking for document databases, ACM Transactions on Information Systems, Vol. 17, No. 4, October 1999, Pages 406–439.
Kaszkiel, M., Zobel, J., 1997, Passage retrieval revisited, In Proceedings of the 20th Annual ACM SIGIR International Conference on Research and Development in Information Retrieval, 1997, Philadelphia, PA, USA, Pages 178–185.
Kazai, G., Lalmas, M., and Rölleke, T., 2001, Aggregated Representation for the Focussed Retrieval of Structured Documents, SIGIR 2001 Workshop, Mathematical/Formal Methods in IR, New Orleans, 2001.
Lee, Y., Yoo, S. Yoon, K. and Berra, P., 1996, Index structures for structured documents, In Proc. of the First ACM International Conf. on Digital Libraries, pp. 91–99, 1996, Bethesda, Maryland.
McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J., 1997, Lore: a database management System for semistructured data, SIGMOD Record, 26(3), September 1997, Pages 54–66.
Mittendorf, E., and Schauble, P., 1994, Document and Passage Retrieval Based on Hidden Markov Models, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July, 1994, Pages 318–327.
Myaeng, S., Jang, D., Kim, M. and Zhoo Z., 1998, A flexible model for retrieval of SGML documents, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, Pages 138–145.
Salton, G., Allan, J. and Singhall, A., 1996, Automatic Text Decomposition and Structuring, Information Processing and Management. 32(2), Pages 127–138.
Wilkinson, R., 1994, Effective retrieval of structured document, In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, 1994, Pages 311–317.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cui, H., Wen, JR., Chua, TS. (2003). Hierarchical Indexing and Flexible Element Retrieval for Structured Document. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_6
Download citation
DOI: https://doi.org/10.1007/3-540-36618-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive