Extracting Keyphrases from Web Pages

Ferrara, Felice; Tasso, Carlo

doi:10.1007/978-3-642-35834-0_11

Felice Ferrara³ &
Carlo Tasso³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 354))

Included in the following conference series:

Italian Research Conference on Digital Libraries

1283 Accesses
1 Citations

Abstract

Social tagging systems allow people to classify Web resources by using a set of freely chosen terms commonly called tags. However, by shifting the classification task from a set of experts to a larger and untrained set of people, the results of the classification are not accurate. The lack of control and guidelines generates noisy tags (i.e. tags without a clear semantic) which lower the precision of the user generated classifications. In order to face this limitation several tools have been proposed in the literature for suggesting to the users tags which properly describe a given resource. On the other hand we propose to suggest n-grams (named keyphrases) by following the idea that sequences of two/three terms can better face potential ambiguities. More specifically, in this work, we identify a set of features which characterize n-grams adequate for describing meaningful aspects reported in the Web pages. By means of these features, we developed a mechanism which can support people when classifying Web pages by automatically suggesting meaningful keyphrases.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

The rise of crowdsourcing. Website, http://www.wired.com/wired/archive/14.06/crowds.html
Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases. In: Hamilton, H.J. (ed.) Canadian AI 2000. LNCS (LNAI), vol. 1822, pp. 40–52. Springer, Heidelberg (2000)
Chapter Google Scholar
Bracewell, D.B., Ren, F., Kuroiwa, S.: Multilingual single document keyword extraction for information retrieval. In: Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, Wuhan, pp. 517–522 (2005)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks 30(1-7), 107–117 (1998)
Google Scholar
D’Avanzo, E., Magnini, B., Vallin, A.: Keyphrase extraction for summarization purposes: the lake system at duc2004. In: DUC Workshop, Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting, Boston, USA (2004)
Google Scholar
Ferrara, F., Pudota, N., Tasso, C.: A Keyphrase-Based Paper Recommender System. In: Agosti, M., Esposito, F., Meghini, C., Orio, N. (eds.) IRCDL 2011. CCIS, vol. 249, pp. 14–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 668–673. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Howe, J.: Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business, 1st edn. Crown Publishing Group, New York (2008)
Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics, Morristown (2003)
Chapter Google Scholar
Hulth, A., Megyesi, B.B.: A study on automatically extracted keywords in text categorization. In: ACL-44: Proc. of the 21st Int. Conf. on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, vol. 44, pp. 537–544. ACL, Morristown (2006)
Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transaction on Information Systems 20(4), 422–446 (2002)
Article Google Scholar
Jones, S., Paynter, G.W.: An evaluation of document keyphrase sets. Journal of Digital Information 4(1) (2003)
Google Scholar
Justeson, J., Katz, S.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Article Google Scholar
Krulwich, B., Burkey, C.: Learning user information interests through the extraction of semantically significant phrases. In: Hearst, M., Hirsh, H. (eds.) AAAI 1996 Spring Symposium on Machine Learning in Information Access, pp. 110–112. AAAI Press, California (1996)
Google Scholar
Litvak, M., Last, M.: Graph-based keyword extraction for single-document summarization. In: Proceedings of the Workshop on Multi-Source Multilingual Information Extraction and Summarization, pp. 17–24. ACL, Morristown (2008)
Chapter Google Scholar
Micarelli, A., Gasparetti, F., Biancalana, C.: Intelligent Search on the Internet. In: Stock, O., Schaerf, M. (eds.) Reasoning, Action and Interaction in AI Theories and Systems. LNCS (LNAI), vol. 4155, pp. 247–264. Springer, Heidelberg (2006)
Chapter Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Dekang, L., Dekai, W. (eds.) Proc. of Empirical Methods in Natural Language Processing, pp. 404–411. Association for Computational Linguistics, Barcelona (2004)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316 (1997)
Google Scholar
Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. International Journal of Intelligent Systems, Special Issue: New Trends for Ontology-Based Knowledge Discovery 25, 1158–1186 (2010)
MATH Google Scholar
Pudota, N., Dattolo, A., Baruzzo, A., Tasso, C.: A New Domain Independent Keyphrase Extraction System. In: Agosti, M., Esposito, F., Thanos, C. (eds.) IRCDL 2010. CCIS, vol. 91, pp. 67–78. Springer, Heidelberg (2010)
Chapter Google Scholar
Turney, P.: Learning to extract keyphrases from text. Technical Report ERB-1057, National Research Council, Institute for Information Technology (1999)
Google Scholar
Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 254–255. ACM, New York (1999)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Lab., Department of Mathematics and Computer Science, University of Udine, Italy
Felice Ferrara & Carlo Tasso

Authors

Felice Ferrara
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Tasso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo, 6/a, 35131, Padua, Italy
Maristella Agosti & Nicola Ferro &
Department of Computer Science, University of Bari, Via E. Orabona, 4, 70126, Bari, Italy
Floriana Esposito & Stefano Ferilli &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrara, F., Tasso, C. (2013). Extracting Keyphrases from Web Pages. In: Agosti, M., Esposito, F., Ferilli, S., Ferro, N. (eds) Digital Libraries and Archives. IRCDL 2012. Communications in Computer and Information Science, vol 354. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35834-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-35834-0_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35833-3
Online ISBN: 978-3-642-35834-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Extracting Keyphrases from Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Extracting Keyphrases from Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Automated Mining of Relevant N-grams in Relation to Predominant Topics of Text Documents

TagTheWeb: Using Wikipedia Categories to Automatically Categorize Resources on the Web

N-Gram Features for Unsupervised WSD with an Underlying Naïve Bayes Model

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation