Keywords

1 Introduction

While considerable progress has been made over the last few years, the task of extracting meaningful keywords is yet to be solved, as the effectiveness of existing algorithms is still far from the ones in many other core areas of computer science. Most traditional approaches follow a supervised methodology, which largely depends on having access to training annotated text corpora. One of the first approaches has been proposed by Turney [5] who developed a custom-designed algorithm named GenEx. The great majority of the approaches developed so far, relied however, on supervised methods such as Naïve Bayes as a way to select relevant keywords. Arguably the most widespread implementation of such an approach is KEA [7] which uses the Naïve Bayes machine learning algorithm for keyword extraction. Despite their often-superior effectiveness, the main limitation of supervised methods is their relatively long training time process. This contrasts with general unsupervised algorithms [3, 4, 6], which may be quickly applied to documents across different languages or domains in a short time span and which demand reduced effort due to their plug and play nature. In this paper, we describe YAKE!, an online keyword extraction demo which builds upon text statistical features extracted from a single document to identify and rank the most important keywords. While keywords extraction systems have been extensively studied over the last few years, multilingual online single document tools are still very rare. YAKE! is an attempt to fill this gap. Overall, it provides a solution which does not need to be trained on a particular set of documents, and thus can be easily applied to single texts, regardless of the existence of a corpus, dictionary or any external collection. In an era of massive but likely unlabeled collections, this can be a great advantage over other approaches, particularly supervised ones. Another important feature is that YAKE! does not use NER nor PoS taggers, which makes the system to be language-independent, except for the use of different but static lists of stopwords for each language. This enables an easy adaptation of YAKE! to other languages other than English, especially, to minor languages for which open source language processing tools are scarce. It is an advantage over supervised methods, which demand training a custom model beforehand. Finally, the fact that YAKE! relies only on statistical features extracted from the text itself allows for easily scaling to vast collections.

2 Keyword Extraction Pipeline

The proposed system has six main components: (1) Text pre-processing; (2) Feature extraction; (3) Individual terms score; (4) Candidate keywords list generation; (5) Data Deduplication; and (6) Ranking. In the following we will provide a concise description of each of the six steps as a detailed discussion of them is beyond the scope of this paper and can be found on [1]. First, we apply a pre-processing step which splits the text into individual terms whenever an empty space or a special character (e.g., line breaks, brackets, comma, period, etc.) delimiter is found. Second, we devise a set of five features to capture the characteristics of each individual term. These are: (1) Casing; (2) Word Positional; (3) Word Frequency; (4) Word Relatedness to Context; and (5) Word DifSentence. The first one, Casing, reflects the casing aspect of a word. Word Positional values more those words occurring at the beginning of a document based on the assumption that relevant keywords often tend to concentrate more at the beginning of a document. Word Frequency indicates the frequency of the word, scoring more those words that occur more often. The fourth feature, Word Relatedness to Context, computes the number of different terms that occur to the left (resp. right) side of the candidate word. The more the number of different terms that co-occur with the candidate word (on both sides), the more meaningless the candidate word is likely to be. Finally, Word DifSentence quantifies how often a candidate word appears within different sentences. Similar to Word Frequency, Word DifSentence values more those words that often occur in different sentences. Both features however, are combined with Word Relatedness to Context, meaning that the more they occur in different sentences the better, as long as they do not occur frequently with different words on the right or left side (which would resemble a behavior close to the one of stopwords). In the third step, we heuristically combine all these features into a single measure such that each term is assigned a score \( {\text{S}}\left( {\text{w}} \right) \). This weight will feed the process of generating keywords which is to be taken in the fourth step. Here, we consider a sliding window of 3-grams, thus generating a contiguous sequence of 1, 2 and 3-gram candidate keywords. Each candidate keyword will then be assigned a final \( {\text{S}}\left( {\text{kw}} \right) \), such that the smaller the score the more meaningful the keyword will be. Equation 1 formalizes this:

$$ {\text{S}}\left( {\text{kw}} \right) = \frac{{\prod\nolimits_{w\; \in \;\;kw} {S\left( w \right)} }}{{{\text{TF}}\left( {\text{kw}} \right)*\left( {1 + \sum\nolimits_{w\; \in \;kw} {} S\left( w \right)\;} \right)}} $$
(1)

where \( {\text{S}}\left( {\text{kw}} \right) \) is the score of a candidate keyword, determined by multiplying (in the numerator) the score \( {\text{S}}\left( {\text{w}} \right) \) of the first term of the candidate keyword by the subsequent scores of the remaining terms. This is divided by the sum of the \( {\text{S}}\left( {\text{w}} \right) \) scores to average out with respect to the length of the keyword, such that longer n-grams do not get benefited just because they have a higher n. The result is further divided by \( {\text{TF}}\left( {\text{kw}} \right) \)- term frequency of the keyword - to penalize less frequent candidates. In the fifth step, we eliminate similar candidates coming from the previous steps. For this, we use the Levenshtein distance [2]. Finally, the system will output a list of relevant keywords, formed by 1, 2, 3-grams, such that the lower the \( {\text{S}}\left( {\text{kw}} \right) \) score the more important the keyword will be.

3 Demonstration Overview

In this demonstration, we highlight some of the major features of YAKE!, in particular, its independence with regards to a training corpus, dictionary, size of the text, languages and domains. The online demo of YAKE! can be accessed at http://bit.ly/YakeDemoECIR2018. A python implementation of YAKE! is also available at https://pypi.python.org/pypi/yake/ meaning that our method can already be used and imported as a library. During the demonstration, we will showcase the behavior of our system on different kinds of datasets with various types of settings and we will show the audience how to interact with YAKE! All the results are immediately put side by side with the IBM NLU commercial solution and Rake [4] system for a comparison. In a nutshell, users can interact and test YAKE! under 3 different scenarios. For the first one, they can try the system by selecting a pre-chosen text from six datasets (500 N-KPCrowd-v1.1, INSPEC, Nguyen 2007, SemEval 2010, PubMed, 110-PT-BN-KP), the first five in English, the latter in Portuguese. Texts are from the scientific domain (including short abstracts, medium and large-size texts), TV broadcast and news articles. All the results can be compared against a ground-truth. Therefore, users can analyze the effectiveness of our system with regards to different domains, sizes and languages of the input text. For the second scenario, we arbitrarily choose as input to our system, a set of sample texts from different domains (politics, culture, history, tourism, technology, religion, economy, sports, education and biography) and languages (English, Portuguese, German, Italian, Netherland, Spanish, French, Turkish, Polish, Finnish and Arabic), thus enriching our demo with a range of text examples with characteristics different than those of formal datasets. Finally, we offer the user the chance to test YAKE! in an online environment and in real time with his/her own text, to see how it responds to different scenarios/texts and languages/domains. Users can input their text either by hand (copy/paste), or by referring to its URL. PDF files are also accepted. As a rule of thumb, the maximum size of n-grams is set to 3. However, this parameter can be adjusted by the user on the opening page. As an add-on, we also offer researchers access to a web service (under the API tab) so that our system can be used for research purposes. A screenshot of the results obtained for a document extracted from the INSPEC dataset is shown in Fig. 1.

Fig. 1.
figure 1

YAKE! interface.

The results of our demo can be explored through three different functionalities: (1) annotated text; (2) word cloud; and (3) comparing YAKE! with alternative methods. The first one shows the text annotated with the top 10 keywords retrieved by YAKE!. The second, uses the relevance score of each keyword retrieved by YAKE!, to generate a word cloud, where more important keywords are given a higher size. Finally, we compare the results of YAKE! against IBM NLU and Rake [4]. Each result is assigned a relevance ranking value reflecting the importance of the keyword within the text. Note that in the case of YAKE!, the lower the value, the more important the keyword is. As a further additional feature, we offer the user to explore the results by means of a mouse-hover feature which highlights both the keyword selected as well as similar keywords identified within the three systems. In the example in Fig. 1, we highlight a keyword from the ground-truth (“dollar general”) and observe its disposition within the three systems. Though anecdotal, this example shows that Yake! is able to list this keyword in the 1st position. A formal evaluation however is needed to take valid conclusions.