Mining Taxonomies from Web Menus: Rule-Based Concepts and Algorithms

Keller, Matthias; Hartenstein, Hannes

doi:10.1007/978-3-642-39200-9_23

Matthias Keller¹⁹ &
Hannes Hartenstein¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7977))

Included in the following conference series:

International Conference on Web Engineering

3684 Accesses
2 Citations

Abstract

The logical hierarchies of Web sites (i.e. Web site taxonomies) are obvious to humans, because humans can distinguish different menu levels and their relationships. But such accurate information about the logical structure is not yet available to machines. Many applications would benefit if Web site taxonomies could be mined from menus, but it was an almost unsolvable problem in the past. While a tag newly introduced in HTML5 and novel mining methods allow to distinguish menus from other contents today, it has not yet been researched, how the underlying taxonomies can be extracted, given the menus. In this paper we present the first detailed analysis of the problem and introduce rule-based concepts for addressing each identified sub problem. We report on a large-scale study on mining hierarchical menus of 350 randomly selected domains. Our methods allow extracting Web site taxonomy information that was not available before with high precision and high recall.

Download to read the full chapter text

Chapter PDF

An efficient generic approach for automatic taxonomy generation using HMMs

Article 18 September 2020

Simple, Fast and Accurate Taxonomy Learning

Automatic Generation of Sitemaps Based on Navigation Systems

Keywords

References

Morville, P., Rosenfeld, L.: Information architecture for the World Wide Web. O’Reilly, Sebastopol (2006)
Google Scholar
Kalbach, J.: Designing Web navigation. O’Reilly, Sebastopol (2007)
Google Scholar
Lin, S.-H., Chu, K.-P., Chiu, C.-M.: Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis. Expert Systems with Applications 38, 3944–3958 (2011)
Article Google Scholar
Yang, Q., Jiang, P., Zhang, C., Niu, Z.: Reconstruct Logical Hierarchical Sitemap for Related Entity Finding. In: Voorhees, E.M., Buckland, L.P. (eds.) The Nineteenth Text Retrieval Conf (TREC 2010). National Institute of Standards and Technology, NIST (2010)
Google Scholar
Pavan Kumar, G.M., Leela, K.P., Parsana, M., Garg, S.: Learning website hierarchies for keyword enrichment in contextual advertising. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 425–434. ACM, Hong Kong (2011)
Google Scholar
Amitay, E., Carmel, D., Darlow, A., Lempel, R., Soffer, A.: The connectivity sonar: detecting site functionality by structural patterns. In: Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pp. 38–47. ACM, Nottingham (2003)
Chapter Google Scholar
Keller, M., Nussbaumer, M.: MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques. In: Proceedings of the 21st Int’l. Conf. Companion on World Wide Web, pp. 1025–1034. ACM, Lyon (2012)
Chapter Google Scholar
Rossi, G., Schwabe, D., Lyardet, O., Puc-rio, D.D.I., MarquêS, R., Vicente, S.: Improving Web information systems with navigational patterns. Computer Networks 31 (1999)
Google Scholar
Ceri, S., Fraternali, P., Bongio, A.: Web Modeling Language (WebML): a modeling language for designing Web sites. Computer Networks 33, 137–157 (2000)
Article Google Scholar
Schwabe, D., Rossi, G., Barbosa, S.D.J.: Systematic hypermedia application design with OOHDM. In: Proc. of the the Seventh ACM Conf. on Hypertext, pp. 116–128. ACM, Bethesda (1996)
Chapter Google Scholar
Koch, N., Knapp, A., Zhang, G., Baumeister, H.: Uml-Based Web Engineering. In: Rossi, G., Pastor, O., Schwabe, D., Olsina, L. (eds.) Web Engineering: Modelling and Implementing Web Applications, pp. 157–191. Springer London, London (2008)
Chapter Google Scholar
Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci. 38, 420–442 (1987)
Article Google Scholar
Ho, Q., Eisenstein, J., Xing, E.P.: Document hierarchies from text and links. In: Proceedings of the 21st International Conference on World Wide Web, pp. 739–748. ACM, Lyon (2012)
Chapter Google Scholar
Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. In: Proc. of the 21st Int’l. Conf. Companion on World Wide Web, pp. 93–102. ACM, Lyon (2012)
Chapter Google Scholar
Bernardi, M., Di Lucca, G., Distante, D.: The RE-UWA approach to recover user centered conceptual models from Web applications. International Journal on Software Tools for Technology Transfer 11, 485–501 (2009)
Article Google Scholar
Yang, C.C., Liu, N.: Web site topic-hierarchy generation based on link structure. J. Am. Soc. Inf. Sci. Technol. 60, 495–508 (2009)
Article Google Scholar
Kumar, R., Punera, K., Tomkins, A.: Hierarchical topic segmentation of websites. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 257–266. ACM, Philadelphia (2006)
Chapter Google Scholar
Cheung, W.K., Sun, Y.: Identifying a hierarchy of bipartite subgraphs for web site abstraction. Web Intelli. and Agent Sys. 5, 343–355 (2007)
Google Scholar
Bose, A., Beemanapalli, K., Srivastava, J., Sahar, S.: Incorporating concept hierarchies into usage mining based recommendations. In: Nasraoui, O., Spiliopoulou, M., Srivastava, J., Mobasher, B., Masand, B. (eds.) WebKDD 2006. LNCS (LNAI), vol. 4811, pp. 110–126. Springer, Heidelberg (2007)
Chapter Google Scholar
Wang, C., Lu, J., Zhang, G.: Mining key information of web pages: A method and its application. Expert Syst. Appl. 33, 425–433 (2007)
Article MathSciNet Google Scholar
Liu, Z., Ng, W.K., Lim, E.-P.: An Automated Algorithm for Extracting Website Skeleton. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 799–811. Springer, Heidelberg (2004)
Chapter Google Scholar
Keller, M., Nussbaumer, M.: Beyond the Web Graph: Mining the Information Architecture of the WWW with Navigation Structure Graphs. In: Proc. of the 2011 Int’l. Conf. on Emerging Intelligent Data and Web Technologies, pp. 99–106. IEEE Computer Society, Tirana (2011)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Steinbuch Centre for Computing, Karlsruhe Institute of Technology, D-76128, Karlsruhe, Germany
Matthias Keller & Hannes Hartenstein

Authors

Matthias Keller
View author publications
You can also search for this author in PubMed Google Scholar
Hannes Hartenstein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Trento, Via Sommarive 5, 38123, Povo, TN, Italy
Florian Daniel
Department of Computer Science, Aalborg University, Selma Lagerloefs Vej 300, 9220, Aalborg, Denmark
Peter Dolog
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong, China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keller, M., Hartenstein, H. (2013). Mining Taxonomies from Web Menus: Rule-Based Concepts and Algorithms. In: Daniel, F., Dolog, P., Li, Q. (eds) Web Engineering. ICWE 2013. Lecture Notes in Computer Science, vol 7977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39200-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-39200-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39199-6
Online ISBN: 978-3-642-39200-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics