EXTIRP: Baseline Retrieval from Wikipedia

Lehtonen, Miro; Doucet, Antoine

doi:10.1007/978-3-540-73888-6_12

Miro Lehtonen¹ &
Antoine Doucet^1,2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4518))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

619 Accesses
1 Citations

Abstract

The Wikipedia XML documents are considered an interesting challenge to any XML retrieval system that is capable of indexing and retrieving XML without prior knowledge of the structure. Although the structure of the Wikipedia XML documents is highly irregular and thus unpredictable, EXTIRP manages to handle all the well-formed XML documents without problems. Whether the high flexibility of EXTIRP also implies high performance concerning the quality of IR has so far been a question without definite answers. The initial results do not confirm any positive answers, but instead, they tempt us to define some requirements for the XML documents that EXTIRP is expected to index. The most interesting question stemming from our results is about the line between high-quality XML markup which aids accurate IR and noisy “XML spam” that misleads flexible XML search engines.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Information Retrieval in XML Document: State of the Art

Overview of INEX 2014

Approximate XML Query Processing

References

Doucet, A., Aunimo, L., Lehtonen, M., Petit, R.: Accurate Retrieval of XML Document Fragments using EXTIRP. In: INEX, Workshop Proceedings, Schloss Dagstuhl, Germany, pp. 73–80 (2003)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Ahonen-Myka, H.: Finding all frequent maximal sequences in text. In: Mladenic, D., Grobelnik, M. (eds.) Proceedings of the 16th International Conference on Machine Learning ICML-99 Workshop on Machine Learning in Text Data Analysis, Ljubljana, Slovenia, J. Stefan Institute, pp. 11–17 (1999)
Google Scholar
Doucet, A.: Advanced Document Description, a Sequential Approach. PhD thesis, University of Helsinki (2005)
Google Scholar
Lehtonen, M.: Preparing heterogeneous XML for full-text search. ACM Trans. Inf. Syst. 24, 455–474 (2006)
Article Google Scholar
Lehtonen, M.: Indexing Heterogeneous XML for Full-Text Search. PhD thesis, University of Helsinki (2006)
Google Scholar
Kazai, G., Lalmas, M.: INEX 2005 Evaluation Measures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 16–29. Springer, Heidelberg (2006)
Chapter Google Scholar
Lehtonen, M.: When a few highly relevant answers are enough. In: [9] pp. 296–305
Google Scholar
Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): INEX 2005 (Revised Selected Papers). LNCS, vol. 3977. Springer, Heidelberg (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, P.O. Box 68 (Gustaf Hällströmin katu 2b), FI–00014 University of Helsinki, Finland
Miro Lehtonen & Antoine Doucet
IRISA-INRIA, Campus de Beaulieu, F-35042 Rennes Cedex, France
Antoine Doucet

Authors

Miro Lehtonen
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Norbert Fuhr Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehtonen, M., Doucet, A. (2007). EXTIRP: Baseline Retrieval from Wikipedia. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-73888-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73887-9
Online ISBN: 978-3-540-73888-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

EXTIRP: Baseline Retrieval from Wikipedia

Abstract

Chapter PDF

Similar content being viewed by others

Information Retrieval in XML Document: State of the Art

Overview of INEX 2014

Approximate XML Query Processing

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

EXTIRP: Baseline Retrieval from Wikipedia

Abstract

Chapter PDF

Similar content being viewed by others

Information Retrieval in XML Document: State of the Art

Overview of INEX 2014

Approximate XML Query Processing

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation