Abstract
In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013. Altogether, the datasets consist of almost 30 billion RDF quads. The most recent of the datasets contains amongst other data over 211 million product descriptions, 54 million reviews and 125 million postal addresses originating from thousands of websites. The availability of the datasets lays the foundation for further research on integrating and cleansing the data as well as for exploring its utility within different application contexts. As the dataset series covers four years, it can also be used to analyze the evolution of the adoption of the markup formats.
Chapter PDF
Similar content being viewed by others
References
Ben Adida and Mark Birbeck. RDFa primer - bridging the human and data webs - W3C recommendation (2008), http://www.w3.org/TR/xhtml-rdfa-primer/
Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of rdfa, microdata, and microformats on the web a quantitative analysis. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 17–32. Springer, Heidelberg (2013)
Goel, K., Guha, R.V., Hansson, O.: Introducing rich snippets (2009), http://googlewebmastercentral.blogspot.de/2009/05/introducing-rich-snippets.html
Guha, R.V.: Schema.org support for job postings (2011), http://blog.schema.org/2011/11/schemaorg-support-for-job-postings.html
Guha, R.V.: Schema.org update (April 2014), http://events.linkeddata.org/ldow2014/slides/ldow2014_keynote_guha_schema_org.pdf
Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 725–734. ACM, New York (2011)
Hickson, I.: HTML Microdata, Working Draft (2011), http://www.w3.org/TR/microdata/
Lindahl, G.: Blekko donates search data to common crawl (December 2012), http://blog.blekko.com/2012/12/17/common-crawl-donation/
Meusel, R., Vigna, S., Lehmberg, O., Bizer, C.: Graph structure in the web - revisited: a trick of the heavy tail. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, International World Wide Web Conferences Steering, pp. 427–432 (2014)
Mika, P.: Microformats and RDFa deployment across the Web (2011), http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
Mika, P., Potter, T.: Metadata statistics for a large web corpus. In: LDOW 2012: Linked Data on the Web, CEUR Workshop Proceedings, vol. 937, CEUR-ws.org (2012)
Petrovski, P., Bryl, V., Bizer, C.: Integrating product data from websites offering microdata markup. In: 4th Workshop on Data Extraction and Object Search, DEOS 2014 (2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Meusel, R., Petrovski, P., Bizer, C. (2014). The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In: Mika, P., et al. The Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes in Computer Science, vol 8796. Springer, Cham. https://doi.org/10.1007/978-3-319-11964-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-11964-9_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11963-2
Online ISBN: 978-3-319-11964-9
eBook Packages: Computer ScienceComputer Science (R0)