1 Introduction

Keyphrases Footnote 1 are words that capture the main topics of a document. Extracting high-quality keyphrases can benefit various natural language processing (NLP) applications: in text summarization, keyphrases are useful as a form of semantic metadata indicating the significance of sentences and paragraphs, in which they appear (Barzilay and Elhadad 1997; Lawrie et al. 2001; D’Avanzo and Magnini 2005); in both text categorization and document clustering, keyphrases offer a means of term dimensionality reduction, and have been shown to improve system efficiency and accuracy (Zhang et al. 2004; Hammouda et al. 2005; Hulth and Megyesi 2006; Wang et al. 2008; Kim et al. 2009); and for search engines, keyphrases can supplement full-text indexing and assist users in formulating queries (Gutwin et al. 1999; Gong and Liu 2008).

Recently, a resurgence of interest in automatic keyphrase extraction has led to the development of several new systems and techniques for the task, as outlined in Sect. 2. However, a common base for evaluation has been missing, which has made it hard to perform comparative evaluation of different systems. In light of these developments, we felt that the time was ripe to conduct a shared task on keyphrase extraction, to provide a standard evaluation framework for the task to benchmark current and future systems against.

For our SemEval-2010 Task 5 on keyphrase extraction, we compiled a set of 244 scientific articles with keyphrase annotations from authors and readers. The task was to develop systems which automatically produce keyphrases for each paper. Each team was allowed to submit up to three system runs, to benchmark the contributions of different parameter settings and approaches. The output for each run took the form of a ranked list of 15 keyphrases from each document, ranked by their probability of being keyphrases.

In the remainder of the paper, we first detail related work (Sect. 2) then describe the task setup, including how data collection was managed and the evaluation methodology (Sects. 3, 4). We present the results of the shared task, and discuss the immediate findings of the competition in Sect. 5. In Sects. 6 and 7, we present a short description of submitted systems and the human performance by comparing reader-assigned keyphrases to those assigned by the authors, giving an approximation of the upper-bound performance for this task. Finally, we conclude our work in Sect. 8.

2 Related work

Previous work on automatic keyphrase extraction has broken down the task into four components: (1) candidate identification, (2) feature engineering, (3) developing learning models, and (4) evaluating the extracted keyphrases.

Given a document, candidate identification is the task of detecting all keyphrase candidates, in the form of nouns or noun phrases mentioned in the document. The majority of methods are based on n-grams (Frank et al. 1999; Hulth 2003; Tomokiyo and Hurst 2003; Paukkeri et al. 2008) or POS sequences (Turney 1999; Barker and Corrnacchia 2000; Nguyen and Kan 2007; Kim and Kan 2009), or both. Some approaches employ heuristics aimed at reducing the number of false-positive candidates while maintaining the true positives. A comprehensive analysis of the accuracy and coverage of candidate extraction methods was carried out by Hulth (2004). She compared three methods: n-grams (excluding those that begin or end with a stop word), POS sequences (pre-defined) and NP-chunks, excluding initial determiners (a, an and the). No single method dominates, and the best results were achieved by voting across the three methods.

The second step of feature engineering involves the development of features with which to characterize individual keyphrase candidates, and has been extensively researched in the literature. The majority of proposed features combine frequency statistics within a single document and across an entire collection, semantic similarity among keyphrases (i.e. keyphrase cohesion), popularity of keyphrases among manually assigned sets, lexical and morphological analysis, and heuristics such as locality and the length of phrases. The most popular and best-performing single feature is TF × IDF, which is often used as a baseline feature (Frank et al. 1999; Witten et al. 1999; Nguyen and Kan 2007; Liu et al. 2009a). TF × IDF highlights those candidate phrases which are particularly frequent in a given document, but less frequent in the overall document collection. Keyphrase cohesion is another widely-used feature. Since keyphrases are intended to capture the topic of a document, they are likely to have higher semantic similarity among themselves than non-keyphrases. Turney (2003) measured keyphrase cohesion within the top-N keyphrase candidates versus the remaining candidates using web frequencies. Others have used term co-occurrence of candidates (Matsuo and Ishizuka 2004; Mihalcea and Tarau 2004; Ercan 2006; Liu et al. 2009a, b) while Ercan (2006) and Medelyan and Witten (2006) used taxonomic relations such as hypernymy and hyponymy. Ercan (2006) additionally built lexical chains based on term senses. As a heuristic feature, the locality of terms is often used. Frank et al. (1999) and Witten et al. (1999) introduced the relative position of the first occurrence of the term, while Nguyen and Kan (2007) and Kim and Kan (2009) analyzed the location and frequency of candidates in terms of document sections, leveraging structure in their dataset (i.e. scientific articles).

Keyphrase extraction is generally construed as a ranking problem—i.e. candidates are ranked based on their feature values, and the top-N ranked candidates are returned as keyphrases. As such, the third step is developing learning models with which to rank the candidates. The majority of learning approaches are supervised, with commonly-employed learners being maximum entropy models (Nguyen and Kan 2007; Kim and Kan 2009), naïve Bayes (Frank et al. 1999; Turney 1999; Ercan 2006), decision trees (Turney 1999) and support vector machines (Krapivin et al. 2010). Others have proposed simpler probabilistic models using measures such as pointwise mutual information and KL-divergence (Barker and Corrnacchia 2000; Tomokiyo and Hurst 2003; Matsuo and Ishizuka 2004). More recently, unsupervised methods have gained popularity, using graphs and semantic networks to rank candidates (Mihalcea and Tarau 2004; Litvak and Last 2008; Liu et al. 2009a, 2010).

The final step is evaluating the extracted keyphrases. Automatic keyphrase extraction systems have commonly been assessed using the proportion of top-N candidates that exactly match the gold-standard keyphrases (Frank et al. 1999; Witten et al. 1999; Turney 1999). This number is then used to compute the precision, recall and F-score for a keyphrase set. However, the exact matching of keyphrases is problematic because it ignores near matches that are largely semantically identical, such as synonyms, different grammatical forms, or sub/super-strings of keyphrases, e.g. linguistic graduate program versus graduate program. To remedy this, in some cases, inexact matches (sometimes termed “near misses” or “near matches”) have also been considered. Some have suggested treating semantically-similar keyphrases as correct based on similarities computed over a large corpus (Jarmasz and Barriere 2004; Mihalcea and Faruque 2004), or using semantic relations defined in a thesaurus (Medelyan and Witten 2006). Zesch and Gurevych (2009) computed near matches using an n-gram based approach relative to the gold standard. To differentiate between plausible near matches and completely erroneous keyphrases, evaluation metrics have been proposed that take into account semantic similarity and character n-grams (Zesch and Gurevych 2009; Kim et al. 2010). However, these metrics have yet to gain traction in the research community.

3 Keyphrase extraction datasets

3.1 Existing datasets

There are several publicly available datasets for evaluating keyphrase extraction, which we detail below.

Hulth (2003) compiled 2,000 journal article abstracts from Inspec, published between the years 1998 and 2002. The dataset contains keyphrases (i.e. controlled and uncontrolled terms) assigned by professional indexers, to 1,000 documents for training: 500 for validation and 500 for testing.

Nguyen and Kan (2007) collected a dataset containing 120 computer science articles, ranging in length from 4 to 12 pages. The articles contain author-assigned keyphrases as well as reader-assigned keyphrases contributed by undergraduate CS students. Krapivin et al. (2009) obtained 2,304 articles from the same source from 2003 to 2005, with author-assigned keyphrases. They marked up the document text with sub-document extents for fields such as title, abstract and references.

In the general newswire domain, Wan and Xiao (2008) developed a dataset of 308 documents taken from DUC 2001, with up to 10 manually-assigned keyphrases per document.

Several databases, including the ACM Digital Library, IEEE Xplore, Inspec and PubMed, provide articles with author-assigned keyphrases and, occasionally, reader-assigned keyphrases. Schutz (2008) collected a set of 1,323 medical articles from PubMed with author-assigned keyphrases.

Medelyan et al. (2009) automatically generated a dataset using tags assigned by users of the collaborative citation platform CiteULike. This dataset additionally records how many people have assigned the same keyword to the same publication. In total, 180 full-text publications were annotated by over 300 users. Footnote 2

Despite the availability of these datasets, a standardized benchmark dataset with a well-defined training and test split, and standardized evaluation scripts, is needed to maximize comparability of results. This was our primary motivator in running the SemEval-2010 task.

We have consolidated all of datasets listed above as well as the new dataset and evaluation scripts used for SemEval-2010 into a single repository for public download. Footnote 3 We hope that the dataset forms a reference dataset to aid more comparative evaluation for future keyphrase endeavors.

3.2 Collecting the SemEval-2010 dataset

To collect the dataset for this task, we downloaded data from the ACM Digital Library (conference and workshop papers) and partitioned it into trial, training and test subsets. The input papers ranged from 6 to 8 pages, including tables and figures. To ensure a variety of different topics is represented in the corpus, we purposefully selected papers from four different research areas. In particular, the selected articles belong to the following four 1998 ACM classifications: C2.4 (Distributed Systems), H3.3 (Information Search and Retrieval), I2.11 (Distributed Artificial Intelligence—Multiagent Systems) and J4 (Social and Behavioral Sciences—Economics). All three datasets (trial, training and test) had an equal distribution of documents from among the categories (see Table 1). This domain-specific information was made available to task participants to see whether customized solutions would work better within specific sub-areas.

Table 1 Number of documents per topic in the trial, training and test datasets, across the four ACM document classifications of C2.4, H3.3, I2.11 and J4

Participants were provided with 40, 144, and 100 articles, respectively, in the trial, training and test data, distributed evenly across the four research areas in each case. Note that the trial data was a subset of the training data that participants were allowed to use in the task. Since the original format for the articles was PDF, we converted them into (UTF-8 encoded) plain text using pdftotext , and systematically restored full words that were originally hyphenated and broken across lines. This policy potentially resulted in valid hyphenated forms having their hyphen removed.

All of the collected papers contained author-assigned keyphrases as part of the original PDF file, which were removed from the text dump of the paper. We additionally collected reader-assigned keyphrases for each paper. We first performed a pilot annotation task with a group of students to check the stability of the annotations, finalize the guidelines, and discover and resolve potential issues that may occur during the actual annotation. To collect the actual reader-assigned keyphrases, we then hired 50 student annotators from the computer science department of the National University of Singapore. We assigned 5 papers to each annotator, estimating that assigning keyphrases to each paper would take about 10–15 minutes. Annotators were explicitly told to extract keyphrases that actually appeared in the text of each paper, rather than to create semantically-equivalent phrases. They were also told that they could extract phrases from any part of the document inclusive of headers and captions. Despite these directives, 15 % of the reader-assigned keyphrases do not appear in the actual text of the paper, although this is still less than the corresponding figure for author-assigned keyphrases, at 19 %. Footnote 4 In other words, the maximum recall that the participating systems can achieve on these documents is 85 and 81 % for the reader- and author-assigned keyphrases, respectively.

As some keyphrases may occur in multiple but semantically-equivalent forms, we expanded the set of keyphrases to include alternative versions of genitive keyphrases: B of A = A B (e.g. policy of school = school policy), and A’s B = A B (e.g. school’s policy = school policy). We chose to implement only this limited form of keyphrase equivalence in our evaluation, as these two alternations both account for a large portion of the keyphrase variation, and were relatively easy to explain to participants and for them to reimplement. Note, however, that the genitive alternation does change the semantics of the candidate phrase in limited cases (e.g. matter of fact versus ?fact matter). To deal with this, we hand-vetted all keyphrases generated through these alternations, and did not include alternative forms that were judged to be semantically distinct.

Table 1 shows the distribution of the trial, training and test documents over the four different research areas, while Table 2 shows the distribution of author—and reader-assigned keyphrases. Interestingly, among the 387 author-assigned keywords, 125 keywords match exactly with reader-assigned keywords, while many more near matches occur.

Table 2 Number of author- and reader-assigned keyphrases in the different portions of the dataset

4 Evaluation method and baseline

For the evaluation we adopt the traditional means of matching auto-generated keyphrases against those assigned by experts (the gold-standard). Prior to computing the matches, all keyphrases are stemmed using the English Porter stemmer. Footnote 5 We assume that auto-generated keyphrases are supplied in ranked order starting from the most relevant keyphrase. The top-5, top-10 and top-15 keyphrases are then compared against the gold-standard for the evaluation.

As an example, let us compare a set of 15 top-ranking keyphrases generated by one of the competitors and stemmed using the Porter stemmer:

grid comput, grid, grid servic discoveri, web servic, servic discoveri, grid servic, uddi, distribut hash tabl, discoveri of grid, uddi registri, rout, proxi registri, web servic discoveri, qos, discoveri

with the equivalent gold-standard set of 19 keyphrases (a combined set assigned by both authors and readers):

grid servic discoveri, uddi, distribut web-servic discoveri architectur, dht base uddi registri hierarchi, deploy issu, bamboo dht code, case-insensit search, queri, longest avail prefix, qo-base servic discoveri, autonom control, uddi registri, scalabl issu, soft state, dht, web servic, grid comput, md, discoveri

The system has correctly identified 6 keyphrases, which results in a precision of 40 % (6/15) and recall of 31.6 % (6/19). Given the results for each individual document, we then calculate the micro-averaged precision, recall and F-score (β = 1) for each cut off (5, 10 and 15). Footnote 6 Please note that the maximum recall that could be achieved over the combined keyphrase set was approximately 75 %, because not all keyphrases actually appear in the document.

Participants were required to extract keyphrases from among the phrases used in a given document. Since it is theoretically possible to access the original PDF articles and extract the author-assigned keyphrases, we evaluate systems over the independently generated reader-assigned keyphrases, as well as the combined set of keyphrases (author- and reader-assigned).

We computed a TF × IDF  n-gram based baseline using both supervised and unsupervised approaches. First, we generated 1-, 2- and 3-grams as keyphrase candidates for both the test and training data. For training documents, we identified keyphrases using the set of manually-assigned keyphrases for that document. Then, we used a maximum entropy (ME) learner to learn a supervised baseline model based on the keyphrase candidates, TF × IDF scores and gold-standard annotations for the training documents. Footnote 7 For the unsupervised learning system, we simply use TF × IDF scores (higher to lower) as the basis of our keyphrase candidate ranking. Therefore in total, there are two baselines: one supervised and one unsupervised. The performance of the baselines is presented in Table 3, broken down across reader-assigned keyphrases (Reader), author-assigned keyphrases (Author), and combined author- and reader-assigned keyphrases (Combined).

Table 3 Keyphrase extraction performance for baseline unsupervised (TF × IDF) and supervised (ME) systems, in terms of precision (P), recall (R) and F-score (F), given as percentages

5 Competition results

The trial data was downloaded by 73 different teams, of which 36 teams subsequently downloaded the training and test data. 21 teams participated officially in the final competition, of which two teams withdrew their systems from the published set of results.

Table 4 shows the performance of the final 19 teams. 5 teams submitted one run, 6 teams submitted two runs, and 8 teams submitted the maximum number of three runs. We rank the best-performing run for each team by micro-averaged F-score over the top-15 candidates. We also show system performance over reader-assigned keywords in Table 5, and over author-assigned keywords in Table 6. In all these tables, P, R and F denote precision, recall and F-score, respectively. The systems are ranked in the descending order of their F-score over the top-15 candidates.

Table 4 Performance of the submitted systems over the combined author- and reader-assigned keywords, ranked by Top-15 F-score
Table 5 Performance of the submitted systems over the reader-assigned keywords, ranked by Top-15 F-score
Table 6 Performance of the submitted systems over the author-assigned keywords, ranked by Top-15 F-score

The best results over the reader-assigned and combined keyphrase sets are 23.5 and 27.5 %, respectively, achieved by the HUMB team. Most systems outperformed the baselines. Systems generally scored better against the combined set, as the availability of a larger gold-standard answer set means that more correct cases could be found among the top-5, 10 and 15 keyphrases, which lead to a better balance between precision and recall scores, resulting in a higher F-score.

In Tables 7 and 8, we present system rankings across the four ACM document classifications, ranked in order of top-15 F-score. The numbers in parentheses are the actual F-scores for each team. Note that in the case of a tie in F-score, we sub-ranked the teams in descending order of F-score over the full dataset.

Table 7 System ranking (and F-score) for each ACM classification: combined keywords
Table 8 System ranking (and F-score) for each ACM classification: reader-assigned keywords

6 A summary of the submitted systems

The following is an overview of the systems which participated in the task, ranked according to their position in the overall system ranking. They are additionally labelled as being supervised or unsupervised, based on whether they made use of the keyphrase-labelled training data. Systems which did not have an accompanying description paper are omitted.

  • HUMB (Supervised): Candidates are generated based on n-grams (n = 1 to 5), after removing terms with stop words and mathematical symbols. Ranking is implemented using a bagged decision tree over several features, including document structure (e.g. section and position), content (e.g. score of 2-to-5-grams using Generalized Dice Coefficient and TF × IDF), lexical/semantic scores from large term-bases (e.g. the GRISP terminological database and Wikipedia). To further improve the candidate ranking, candidates are re-ranked using a probabilistic model trained over author-assigned keyphrases in an independent collection (Lopez and Romary 2010).

  • WINGNUS (Supervised): Heuristics are used to select candidates, based on occurrence in particular areas of the document, such as the title, abstract and introduction. The algorithm first identifies the key sections and headers, then extracts candidates based on POS tag sequences only in the selected areas. To rank the candidates, the system employs 19 features based on syntactic and frequency statistics such as length, TF × IDF and occurrence in the selected areas of the document (Nguyen and Luong 2010).

  • KP-Miner (Unsupervised): Heuristic rules are used to extract candidates, which are then filtered to remove terms with stop words and punctuation. Further, the candidates are filtered by frequency and their position of first appearance. Finally, candidates are ranked by integrating five factors: term weight in the document D i , term frequency in the document D i , term IDF, a boosting factor, and term position (El-Beltagy and Rafea 2010).

  • SZTERGAK (Supervised): First, irrelevant sentences are removed from the document based on their relative position in the document. Candidates are then extracted based on n-grams (up to size n = 4), restricted by predefined POS patterns. To rank the candidates, the system employs a large number of features computed by analyzing the term (e.g. word length, POS pattern), the document (e.g. acronymity, collocation score for multiword terms), the corpus (e.g. section-based TF × IDF, and phrasehood in the complete dataset) and external knowledge resources (e.g. Wikipedia entries/redirection) (Bernend and Farkas 2010).

  • SEERLAB (Supervised): Document sections are first identified, and n-gram candidates of differing length extracted based on their occurrence in an external scholarly corpus and their frequency in different parts of the document. Finally, the system produces its final ranking of candidates using multiple decision trees with 11 features, primarily based on term frequencies, such as term frequency in section headings and document frequency, as well as heuristics such as the word length and whether the candidate is used as an acronym in the document (Treeratpituk et al. 2010).

  • KX_FBK (Supervised): n-gram candidates are computed similarly to SZTERGAK, in addition to simple statistics such as the local document frequency, and global corpus frequency. The system then ranks candidates using five features: IDF, keyphrase length, position of first occurrence, “shorter concept subsumption” and “longer concept boosting” (whereby a candidate which contains a second candidate substring receives the score of the substring) (Pianta and Tonelli 2010).

  • DERIUNLP (Unsupervised): Based on the assumption that keyphrases often occur with “skill types” (important domain words that are general enough to be used in different subfields and that reflect theoretical or practical expertise e.g. analysis, algorithm, methodology in scientific articles), 81 skill type words were manually extracted from the corpus. Next, POS patterns that appear in phrases containing these skill type words were used to identify candidate keyphrases. To rank the candidates, the system introduces a probabilistic model based on TF × IDF, keyphrase length and term frequency in the collection (Bordea and Buitelaar 2010).

  • Maui (Supervised): Maui is an open-source system developed by one of the task organizers prior to and independently of the competition (Medelyan et al. 2009). Maui’s candidates are n-grams, and the keyphrase ranking is generated using bagged decision trees over features such as TF × IDF, location, phrase length, and how often a candidate was chosen as a keyphrase in the training set. The features are enhanced with statistics from Wikipedia.

  • DFKI (Supervised): Candidates are generated using “closed-class forms” (i.e. function words such as conjunctions and prepositions, and suffixes such as plural and tense markers) and four types of nominal groups, all within the first 2000 characters of a document. Candidate selection takes the form of an ordinal regression problem using SVMrank, based on eight features including web counts, the use of special characters, and Wikipedia statistics (Eichler and Neumann 2010).

  • BUAP (Unsupervised): The documents are first pre-processed to remove stop words, punctuation and abbreviations, and then the words are lemmatized and stemmed. Candidates are then selected using heuristic rules to prefer longer sequences which occur above a frequency threshold, based on the local document and the collection. Finally, the candidates are ranked using PageRank (Ortiz et al. 2010).

  • SJTULTLAB (Supervised): OpenNLP Footnote 8 is used to extract noun phrase chunks as candidates, which are then filtered using three heuristic rules: phrase length, frequency, and POS patterns. The candidates are then ranked using the top-30 keyphrases extracted by running KEA (Witten et al. 1999), a separate keyphrase extraction system (Wang and Li 2010).

  • UNICE (Supervised): Abbreviations are first identified using ExtractAbbrev (Schwartz and Hearst 2003), then OpenNLP is used for sentence tokenization and POS tagging. Candidates are selected based on POS patterns, and represented in a sentence–term matrix. Clustering algorithms are employed to reduce the dimensionality of the matrix, and Latent Dirichlet allocation (LDA) is applied to identify the topics of each cluster. Finally, candidates are scored using a probabilistic metric based on the topical relatedness of candidates (Pasquier 2010).

  • UNPMC (Supervised): Candidates are selected based on n-grams (n ≤ 3) which do not contain stop words. For each candidate, the frequency within pre-defined sections of the paper (i.e. title, abstract, introduction and conclusion) is computed, as well as the number of sections it appears in. The authors empirically determine the weight of these features and then use them to rank the candidates (Park et al. 2010).

  • Likey (Unsupervised): First, section headings, references, figures, tables, equations, citations and punctuation are removed from the text, and all numbers are replaced with the \({\tt <NUM>}\) tag. Then, candidates are selected as those words and phrases that appear in a reference corpus based on Europarl (European Parliament plenary speeches). Finally, the system ranks candidates using document and reference corpus frequencies (Paukkeri and Honkela 2010).

  • UvT (Unsupervised): First, URLs and inline references are removed from each document, and section boundaries are detected. Then, candidates are extracted using eight POS patterns. These candidates are further normalized based on lexical and morphological variation (e.g. morphological affixes and hyphenated phrases). Finally, the C - value (Frantzi et al. 2000) probabilistic measure is used to rank candidates (Zervanou 2010).

  • POLYU (Unsupervised): Simplex candidates are selected based on POS tag, and scored by frequency in the title, abstract and body of the document. The top-scoring words are “core words”, which are expanded into keyphrases, by appending neighboring words, based on predefined POS patterns (Ouyang et al. 2010).

7 Discussion of results

The top-performing systems return F-scores in the upper twenties. Superficially, this number is low, and it is instructive to examine how much room there is for improvement. Keyphrase extraction is a subjective task, and an F-score of 100 % is infeasible. On the author-assigned keyphrases in our test collection, the highest a system could theoretically achieve was 81 % recall Footnote 9 and 100 % precision, which gives a maximum F-score of 89 %. However, such a high value would only be possible if the number of keyphrases extracted per document could vary; in our task, we fixed the thresholds at 5, 10 or 15 keyphrases.

Another way of computing the upper-bound performance would be to look into how well people perform the same task. We analyzed the performance of our readers, taking the author-assigned keyphrases as the gold standard. The authors assigned an average of 4 keyphrases to each paper, whereas the readers assigned 12 on average. These 12 keyphrases cover 77.8 % of the authors’ keyphrases, which corresponds to a precision of 21.5 %. The F-score achieved by the readers on the author-assigned keyphrases is 33.6 %, whereas the F-score of the best-performing system on the same data is 19.3 % (for top-15, not top-12 keyphrases, see Table 6).

Reviewing the techniques employed by the 15 submitted systems revealed interesting trends in the different stages of keyphrase extraction: candidate identification, feature engineering and candidate ranking. In the candidate identification step, most systems used either n-grams or POS-based regular expressions, or both. Additionally, there is a clear tendency to apply pre-processing prior to the candidate identification step. For example, dealing with abbreviations seems to be an important step for improving candidate coverage, specifically aimed at scientific papers. Also, filtering candidates by frequency and location in different sections of the document was broadly employed among the participating systems. The majority of systems which used section information found the boundaries with heuristic approaches over the provided text dump, while HUMB and WINGNUS performed section boundary detection over the original PDF files.

In ranking the candidates, the systems applied a variety of features: lexical, structural and statistical. It is particularly interesting that many systems used external information, such as Wikipedia and external corpora. On the other hand, none of systems made use of the 4 ACM document classifications that the test and training documents were grouped into. Table 9 describes the features used by each system, as described in the system description paper.

Table 9 The participating systems, ordered by overall rank, with the different feature types used by each system (broken down into Token Scoring, Lexical/Syntactic, Sem(antic), External and Format)

To rank the candidates, supervised systems used learners such as maximum entropy, naïve Bayes and bagged decision trees, all of which are popular approaches for keyphrase extraction. Another approach used for ranking was a learn-to-rank classifier based on SVM rank. Unsupervised systems tended to propose a novel probabilistic model to score candidates, mostly based on simple multiplication of feature values, but also including PageRank and topic modeling. It is difficult to gauge the relative superiority of different machine learning approaches over the task, as they were combined with different candidate selection techniques and feature sets. However, the standardized evaluation on the common training and test data does uncover some trends: namely that document structure and IR-style term weighting approaches appear to be effective across the board. There is no doubt, however, that there is definitely still room for improvement on the task, and we look forward to seeing the dataset used in future experimentation on keyphrase extraction.

For any future shared task on keyphrase extraction, we recommend against fixing a system threshold on the number of keyphrases to be extracted per document. Finally, as we use a strict exact matching metric for evaluation, the presented evaluation figures are likely underestimations of actual system performance, as many semantically-equivalent keyphrases are not counted as correct. For future runs of this challenge, we believe a more semantically-motivated evaluation should be employed to give a more accurate impression of keyphrase acceptability.

8 Conclusion

We describe Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-2010), focusing on keyphrase extraction. We provided an overview of the keyphrase extraction process and related work in this area. We outlined the design of the datasets used in the shared task and the evaluation metrics, before presenting the official results for the task and summarizing the immediate findings. We also analyzed the upper-bound performance for this task, and demonstrated that there is still room for improvement on the task. We look forward to future advances in automatic keyphrase extraction based on this and other datasets.