Abstract
Web Directories provide a way of locating relevant information on the Web. Typically, Web Directories rely on humans putting in significant time and effort into finding important pages on the Web and categorizing them in the Directory. In this paper we present a way for automating the creation of a Web Directory. At a high level, our method takes as input a subject hierarchy and a collection of pages. We first leverage a variety of lexical resources from the Natural Language Processing community to enrich our hierarchy. After that, we process the pages and identify sequences of important terms, which are referred to as lexical chains. Finally, we use the lexical chains in order to decide where in the enriched subject hierarchy we should assign every page. Our experimental results with real Web data show that our method is quite promising into assisting humans during page categorization.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Google Directory, http://dir.google.com
Kartoo, http://www.kartoo.com
MultiWordNet Domains, http://wndomains.itc.it/
Open Directory Project, http://dmoz.com
Sumo Ontology, http://ontology.teknowledge.com/
Vivisimo, http://www.vivisimo.com/
WordNet 2.0, http://www.cogsci.princeton.edu/~wn/
Yahoo!, http://yahoo.com
Yahoo! Inc. MyYahoo, http://my.yahoo.com
Anderson, C.R., Horvitz, E.: Web montage: A dynamic personalized start page. In: Proceedings of the 11th WWW Conference, pp. 704–712 (2002)
Barzilay, R., Elhadad, M.: Lexical chains for text summarization. Master’s Thesis, Ben-Gurion University (1997)
Broder, A.Z., Glassman, S.C., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)
Chakrabarti, S., Dom, B., Agraval, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal 7, 163–178 (1998)
Chekuri, C., Goldwasser, M., Raghavan, P., Upfal, E.: Web search using automated classification. In: Proceedings of the 6th WWW Conference (1997)
Chen, H., Dumais, S.: Bringing order to the web: Automatically categorizing search results. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 145–152 (2000)
Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: THESUS: Organizing web document collections based on link semantics. VLDB Journal 12, 320–332 (2003)
Haveliwala, T.: Topic sensitive PageRank. In: Proceedings of the 11th WWW Conference, pp. 517–526 (2002)
Hirst, G., St-Onge, D.: Lexical chains as representations of content for the detection and correction of malapropisms. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database, pp. 305–332. MIT Press, Cambridge (1998)
Huang, C.C., Chuang, S.L., Chien, L.K.: LiveClassifier: Creating hierarchical text classifiers through web corpora. In: Proceedings of the 13th WWW Conference, pp. 184–192 (2004)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & sons, New York (1990)
Kummumuru, K., Lotlikar, R., Roy, S., Singai, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th WWW Conference, pp. 658–655 (2004)
Mladenic, D.: Turning Yahoo into an automatic web page classifier. In: Proceedings of the 13th European Conference on Artificial Intelligence, pp. 473–474 (1998)
Morris, J., Hirst, G.: Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics 17(1), 21–43 (1991)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? The evolution of the web from a search engine perspective. In: Proceedings of the 13th WWW Conference, pp. 1–12 (2004)
Olston, C., Chi, E.: ScentTrails: Intergrading browsing and searching. ACM Transactions on Computer-Human Interaction 10(3), 1–21 (2003)
Song, Y.I., Han, K.S., Rim, H.C.: A term weighting method based on lexical chain for automatic summarization. In: Proc. of the 5th CICLing Conference, pp. 636–639 (2004)
Montoyo, A., Palomar, M., Rigau, G.: WordNet Enrichment with Classification Systems. In: Proc. of NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customization (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stamou, S., Krikos, V., Kokosis, P., Ntoulas, A., Christodoulakis, D. (2005). Web Directory Construction Using Lexical Chains. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_13
Download citation
DOI: https://doi.org/10.1007/11428817_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)