MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data

Harth, Andreas; Umbrich, Jürgen; Decker, Stefan

doi:10.1007/11926078_19

Andreas Harth²⁴,
Jürgen Umbrich²⁴ &
Stefan Decker²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4273))

Included in the following conference series:

International Semantic Web Conference

3292 Accesses
29 Citations
3 Altmetric

Abstract

The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and describe and evaluate a five-step pipelined architecture to crawl and index data from both the traditional and the Semantic Web.

Download to read the full chapter text

Chapter PDF

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Towards the Web in Your Pocket: Curated Data as a Service

LOD Lab: Scalable Linked Data Processing

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 119–128 (September 2001)
Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a Scalable Fully Distributed Web Crawler. Software: Practice and Experience 34(8), 711–726 (2004)
Article Google Scholar
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7), 107–117 (1998)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the Twelfth International World Wide Web Conference, pp. 178–186 (May 2003)
Google Scholar
Gottlob, G., Koch, C., Pichler, R., Segoufin, L.: The Complexity of XPath Query Evaluation and XML Typing. Journal of the ACM 52(2), 284–335 (2005)
Article MathSciNet Google Scholar
Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress, pp. 71–80. IEEE, Los Alamitos (2005)
Chapter Google Scholar
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a Distributed Full-Text Index for the Web. In: Proceedings of the 10th International World Wide Web Conference, pp. 396–406 (2001)
Google Scholar
Michalowski, M., Ambite, J.L., Thakkar, S., Tuchinda, R., Knoblock, C.A., Minton, S.: Retrieving and Semantically Integrating Heterogeneous Data from the Web. IEEE Intelligent Systems 19(3), 72–79 (2004)
Article Google Scholar
Najork, M., Wiener, J.L.: Breadth-First Crawling Yields High-Quality Pages. In: Proceedings of the Tenth International World Wide Web Conference, pp. 114–118 (May 2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Digital Enterprise Research Institute, National University of Ireland, Galway
Andreas Harth, Jürgen Umbrich & Stefan Decker

Authors

Andreas Harth
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Umbrich
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Decker
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Illinois at Chicago, 851 South Morgan Street (M/C 152), 60607, Chicago, IL, USA
Isabel Cruz
Digital Enterprise Research Institute, National University of Ireland, Galway, IDA Business Park, Lower Dangan, Galway, Ireland
Stefan Decker
TopQuadrant, 22314, VA, USA
Dean Allemang
HP Laboratories, Bristol, UK
Chris Preist
Departamento de Informática – Pontifícia, Universidade Católica do Rio de Janeiro,, (PUC Rio) – Caixa Postal 38.097, 22.453-900, Rio de Janeiro, RJ, Brazil
Daniel Schwabe
Yahoo! Research, Barcelona, Spain
Peter Mika
Boeing, Phantom Works, P.O. Box 3707, m/s 7L-40, 98124-2207, Seattle, WA, USA
Mike Uschold
Technische Universiteit Eindhoven, P.O. Box 513, NL 5600, Eindhoven, MB, The Netherlands
Lora M. Aroyo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Harth, A., Umbrich, J., Decker, S. (2006). MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data. In: Cruz, I., et al. The Semantic Web - ISWC 2006. ISWC 2006. Lecture Notes in Computer Science, vol 4273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11926078_19

Download citation

DOI: https://doi.org/10.1007/11926078_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49029-6
Online ISBN: 978-3-540-49055-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data

Abstract

Chapter PDF

Similar content being viewed by others

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Towards the Web in Your Pocket: Curated Data as a Service

LOD Lab: Scalable Linked Data Processing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data

Abstract

Chapter PDF

Similar content being viewed by others

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Towards the Web in Your Pocket: Curated Data as a Service

LOD Lab: Scalable Linked Data Processing

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation