Abstract
The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and describe and evaluate a five-step pipelined architecture to crawl and index data from both the traditional and the Semantic Web.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 119–128 (September 2001)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a Scalable Fully Distributed Web Crawler. Software: Practice and Experience 34(8), 711–726 (2004)
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7), 107–117 (1998)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the Twelfth International World Wide Web Conference, pp. 178–186 (May 2003)
Gottlob, G., Koch, C., Pichler, R., Segoufin, L.: The Complexity of XPath Query Evaluation and XML Typing. Journal of the ACM 52(2), 284–335 (2005)
Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress, pp. 71–80. IEEE, Los Alamitos (2005)
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a Distributed Full-Text Index for the Web. In: Proceedings of the 10th International World Wide Web Conference, pp. 396–406 (2001)
Michalowski, M., Ambite, J.L., Thakkar, S., Tuchinda, R., Knoblock, C.A., Minton, S.: Retrieving and Semantically Integrating Heterogeneous Data from the Web. IEEE Intelligent Systems 19(3), 72–79 (2004)
Najork, M., Wiener, J.L.: Breadth-First Crawling Yields High-Quality Pages. In: Proceedings of the Tenth International World Wide Web Conference, pp. 114–118 (May 2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Harth, A., Umbrich, J., Decker, S. (2006). MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data. In: Cruz, I., et al. The Semantic Web - ISWC 2006. ISWC 2006. Lecture Notes in Computer Science, vol 4273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11926078_19
Download citation
DOI: https://doi.org/10.1007/11926078_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49029-6
Online ISBN: 978-3-540-49055-5
eBook Packages: Computer ScienceComputer Science (R0)