Abstract
Social tagging systems allow people to classify Web resources by using a set of freely chosen terms commonly called tags. However, by shifting the classification task from a set of experts to a larger and untrained set of people, the results of the classification are not accurate. The lack of control and guidelines generates noisy tags (i.e. tags without a clear semantic) which lower the precision of the user generated classifications. In order to face this limitation several tools have been proposed in the literature for suggesting to the users tags which properly describe a given resource. On the other hand we propose to suggest n-grams (named keyphrases) by following the idea that sequences of two/three terms can better face potential ambiguities. More specifically, in this work, we identify a set of features which characterize n-grams adequate for describing meaningful aspects reported in the Web pages. By means of these features, we developed a mechanism which can support people when classifying Web pages by automatically suggesting meaningful keyphrases.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
- Noun Phrase
- Computational Linguistics
- Unsupervised Approach
- Normalize Discount Cumulative Gain
- Training Document
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
The rise of crowdsourcing. Website, http://www.wired.com/wired/archive/14.06/crowds.html
Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Bracewell, D.B., Ren, F., Kuroiwa, S.: Multilingual single document keyword extraction for information retrieval. In: Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, pp. 517–522 (2005)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)
D’Avanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization purposes: the lake system at duc2004. In: DUC Workshop, Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting, Boston, USA (2004)
Ferrara, F., Pudota, N., Tasso, C.: A Keyphrase-Based Paper Recommender System. In: Agosti, M., Esposito, F., Meghini, C., Orio, N. (eds.) IRCDL 2011. CCIS, vol. 249, pp. 14–25. Springer, Heidelberg (2011)
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 668–673. Morgan Kaufmann Publishers, San Francisco (1999)
Howe, J.: Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business, 1st edn. Crown Publishing Group, New York (2008)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics, Morristown (2003)
Hulth, A., Megyesi, B.B.: A study on automatically extracted keywords in text categorization. In: ACL-44: Proc. of the 21st Int. Conf. on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, vol. 44, pp. 537–544. ACL, Morristown (2006)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transaction on Information Systems 20(4), 422–446 (2002)
Jones, S., Paynter, G.W.: An evaluation of document keyphrase sets. Journal of Digital Information 4(1) (2003)
Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Krulwich, B., Burkey, C.: Learning user information interests through the extraction of semantically significant phrases. In: Hearst, M., Hirsh, H. (eds.) AAAI 1996 Spring Symposium on Machine Learning in Information Access, pp. 110–112. AAAI Press, California (1996)
Litvak, M., Last, M.: Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop on Multi-Source Multilingual Information Extraction and Summarization, pp. 17–24. ACL, Morristown (2008)
Micarelli, A., Gasparetti, F., Biancalana, C.: Intelligent Search on the Internet. In: Stock, O., Schaerf, M. (eds.) Reasoning, Action and Interaction in AI Theories and Systems. LNCS (LNAI), vol. 4155, pp. 247–264. Springer, Heidelberg (2006)
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Dekang, L., Dekai, W. (eds.) Proc. of Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona (2004)
Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316 (1997)
Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. International Journal of Intelligent Systems, Special Issue: New Trends for Ontology-Based Knowledge Discovery 25, 1158–1186 (2010)
Pudota, N., Dattolo, A., Baruzzo, A., Tasso, C.: A New Domain Independent Keyphrase Extraction System. In: Agosti, M., Esposito, F., Thanos, C. (eds.) IRCDL 2010. CCIS, vol. 91, pp. 67–78. Springer, Heidelberg (2010)
Turney, P.: Learning to extract keyphrases from text. Technical Report ERB-1057, National Research Council, Institute for Information Technology (1999)
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 254–255. ACM, New York (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferrara, F., Tasso, C. (2013). Extracting Keyphrases from Web Pages. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-35834-0_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35833-3
Online ISBN: 978-3-642-35834-0
eBook Packages: Computer ScienceComputer Science (R0)