Keywords

1 Introduction

Multi-Word Units (MWUs), or Multi-Word Expressions (MWEs), are “idiosyncratic interpretations that cross word boundaries (or spaces)” [6]. Although they consist of many words (graphical units), for some application-dependent reasons they should be listed, described and processed as a single unit at some level of linguistic analysis [1, 8]. One type of MWUs are multi-word entity names. MWUs pose a serious difficulty in many Natural Language Processing tasks [6]. One such difficulty is morphological analysis of such expressions, especially for languages with rich morphology, such as Slavic languages.

An example of task, which becomes difficult when dealing with MWUs, is their lemmatization. This is due to the fact, that the lemma of a MWU may contain words, which are not lemmas themselves [8]. Let’s analyze a Polish multi-word entity name Organizacji Narodów Zjednoczonych (United Nations in genitive case). If we lemmatize each word separately and concatenate received lemmas, we obtain the following phrase: Organizacja Naród Zjednoczyć, which is an incorrect expression according to the grammar of Polish language (correct lemma is Organizacja Narodów Zjednoczonych). Therefore, trying to obtain the lemma of the phrase simply by performing lemmatization of each word separately, would result in generation of a grammatically incorrect phrase.

In this paper, we analyze a problem of lemmatization of multi-word entity names for Polish language. As will be discussed in the Related Work section, a number of approaches exist towards this issue. It is commonly acknowledged that, to ensure high accuracy of the achieved results, the inflection of a phrase should be analyzed at a lexical rather than grammatical level. This usually requires a significant amount of manual work. Still, we have not found any evaluation on what accuracy can be obtained for highly inflective languages, like Polish, when the lemmatization is based only on grammatical rules, which ignore lexical information. We believe, that in some cases, such approach may be sufficient and much less labour intensive, especially when the inflection rules are automatically extracted from a corpus. Thus, the goal of this paper is to analyze what accuracy may be achieved for Polish language using only grammar-based inflection rules automatically extracted from a corpus.

The structure of this article is the following. First, in Sect. 2, we describe the problem of MWU lemmatization in greater detail. Next, in Sect. 3 a brief analysis of related work is presented. In Sects. 4 and 5 we first present a developed approach towards grammar-based MWU lemmatization and next analyze the obtained results of performed experiments. The article is concluded with a short summary.

2 Description of the Encountered Problem

We encountered the problem of multi-word names lemmatization for Polish language during our work on a search engine for legislative acts of Greater Poland Regional Assembly and Greater Poland Executive Board. We wanted to tag acts with names of entities, which were mentioned in the titles of acts. In many cases, some multi-word names were mentioned in the titles, usually in an inflected form. The entity names could be easily extracted, because each word in these names started with a capital letter. Having these names, we wanted to present users with tags representing these entities; if a user would click such tag, he would be presented with a list of all acts, in which this entity was mentioned in the title.

Two problems resulting from the inflection of multi-word entity names arise here:

  1. 1.

    linking differently inflected forms of the same names together,

  2. 2.

    presenting the users with lemmatized forms of the entity names.

Table 1. Exemplary three-words long MWUs (in English these are Substance Abuse Treatment Facility, Regional Innovation Strategy and Poznań International Fair). In each of them, a different number of words must be inflected to produce a lemma from the inflected form. Below the phrases, their POS tags (using NKJP tagset) are presented

The first problem can be solved using some text normalization techniques and string similarity measures, such as Levenshtein distance. Still, the other one poses a greater challenge, because, as was discussed in the Introduction, simple lemmatization of each constituent separately will usually result in a grammatically incorrect phrase and not the lemma of the MWU. There are three main types of decisions, which must be made to correctly generate a lemma for a given MWU:

  1. 1.

    Which words from the MWU should be inflected; in different MWUs a different number of words is being inflected. For example, having a three words long phrase, in some cases its inflection may require that only one word must be inflected, while in other cases two or even all three word must be inflected (see Table 1).

  2. 2.

    If a given word is to be inflected, in the next step we must determine which form of a given word should be chosen, e.g. grammatical case, number and gender must be determined.

  3. 3.

    For some languages, inflection may change the order of constituents in the MWU [8, 9]; still, for Polish language, this is generally not the case and we will skip this type of decisions in our work.

3 Related Work

Inflection of Multi Word Units is a well-established problem in Natural Language Processing [1]. Among others, it is often encountered when developing electronic dictionaries. Lemmatization of a phrase is one of the most important steps in this task [9]. A list of all inflected forms of a phrase, together with their inflectional description, is called an inflectional paradigm [8] and generation of such paradigm was the goal of a number of previous research.

3.1 Rule-Based Inflection of MWUs

A basic requirement, which has to be met to enable automatic inflection of MWUs, is acquisition of a comprehensive inflection module or an inflectional dictionary for single words, which are constituents of MWUs [9]. For Polish language, PoliMorf, an open morphological dictionary for Polish may be used for this purpose [10]. Still, lemmatization of single words is much more difficult when proper names are concerned, for example person names [4].

It is generally acknowledged that a high accuracy of automatic inflection of MWUs may be achieved only when lexical information is taken into account. In other words, inflection rules must be assigned on a per-phrase basis by the lexicon engineer, which is a labour intensive task [1]. A survey of such lexical approaches to the inflection of MWUs was published in [8].

An exemplary lexicalized approach towards inflection of MWUs was Multiflex, proposed in paper [7]. In this approach, to each phrase a so-called inflection graph is assigned, which is used to describe the inflectional behavior of a given MWU. The inflection graph is directed and acyclic and each node in it represents a single, possibly inflected, constituent. Each path in such graph corresponds to one or more inflected forms of a whole MWU. There may be many nodes corresponding to a single word in one graph and in each node there is information whether a given constituent should be inflected and, if so, how. A set of restrictions can be put on constituents, for example ensuring the agreement between specific attributes of several constituents, e.g. a grammatical case.

Similar approach was presented in paper [9]. In this paper, a tool designed to help linguists in developing, maintaining and exploiting e-dictionaries was presented, called LeXimir. LeXimir uses a set of rules manually produced by the expert, which deduce the basic structure of a given MWU, as well as its additional features. For each phrase, the software offers several lemmas (with assigned inflection rules), from which the user has to choose the correct one [9].

To summarize, there exists a number of approaches towards automatic inflection of MWUs, as well as. Analyzed approaches allow high accuracy, but require a lot of manual work to create inflection rules and assign them to individual phrases to subsequently allow conducting of automatic inflection. What is also worth mentioning is that the described approaches are, generally speaking, directed at lexicon construction task.

3.2 Wikipedia-Based Mappings for Lemmatization of Multi-Word Entity Names

An important resource for lemmatization-related data is Wikipedia. Due to its vast size and semi-structured contents, it is possible to automatically (or semi-automatically) obtain inflected forms of MWUs mapped to their lemmas. This can be done, for example, based on analysis of inter-wiki links, that is links in the content of a certain Wikipedia article mapping to another Wikipedia article. Usually, title of article on Wikipedia is a lemma and an anchor text of link in the contents often is in inflected form (depending on the context, in which it appears in the text) of a certain word or MWU. Such links, their targets and anchor texts may be extracted automatically from the HTML contents of the Wikipedia article or based on analysis of dump of Wikipedia database.

Still, in many cases, such mappings will consist not only of lemma-inflected form pairs. Many different words or phrases in the text may be used as anchor texts of links, not only inflected forms of the name of the given article. For example, many links on Polish Wikipedia pointing to Poznan University of Economics (Uniwersytet Ekonomiczny w Poznaniu) as their anchor texts have inflected form one of older names of the institution (Akademia Ekonomiczna w Poznaniu or Wyższa Szkoła Ekonomiczna). Such mappings also may be useful in some scenarios (we’ve used them for identification and disambiguation of maritime-related entity names in article [2]), but they cannot be used for lemmatization purposes. Some kind of filtering must be therefore conducted, in order to select only the correct mappings and, at the same time, do not erroneously reject correct mappings.

For Polish language, a resource containing Wikipedia-based lemmatization mappings was released as part of CLARIN project, called NeLexicon. Version 2.7, which was available as of time of writing this article, contained 143301 such mappings for many different types of entities, such as persons, organizations, locations etc. This resource was utilized for lemmatization for example in paper [3]. The described approach may be extremely useful for lemmatization of MWUs doe to its ease of use, but it is limited only to those mappings, which were found and correctly extracted from Wikipedia. Usually this means, that this resource cannot be used for lemmatization of names of less-known entities (e.g. organizations or persons), which did not have their own Wikipedia articles or which are not liked to from other articles. Also, in our case, names of departments of some institutions occur frequently in the analyzed dataset, and such types of entities are represented on Wikipedia even less frequently.

4 Proposed Approach

In our work, we decided to try to automatically retrieve a list of lemmatization rules based on a corpus analysis. The quality of such rules will be worse than of those prepared by the expert. Still, the accuracy of lemma identification performed this way may be sufficient for some tasks and it is much less labour intensive. Also, we did not find any evaluation on how such approach may work for morphology-rich languages like Polish and we hope to fill this gap with the method described below.

4.1 Available Corpus and Data Preparation

As was stated in Sect. 2, we were processing legislative acts of Greater Poland Regional Assembly and Greater Poland Executive Board. In the corpus, there were in total 5172 documents. From titles of these acts, using regular expressions, we extracted 3932 multi-word units, in which there were 942 unique phrases. The acts were well formatted and in most cases, phrases from the titles, in which several consecutive words were capitalized, were entity names (we extracted only MWUs at least three words long). The extracted entity names in many cases were inflected, but some of them were in their base form.

For each phrase, at the beginning we were determining if it is a lemma or some inflected form. We did that using a simple heuristic: if the first word of the MWU was in nominative case, we considered the phrase to be in its base form. Otherwise, the phrase was classified as inflected. For that purpose, we were using WCRFT [5], a morpho-syntactic tagger for Polish language. We found, that such approach allowed us to identify MWUs in lemma forms with accuracy above 95%.

Identification of MWUs, which already are lemmas, immediately gave us two benefits. Firstly, obviously,we did not have to process lemmas anymore. Moreover, having a lemma of a phrase, we could search trough all extracted MWUs to find inflected forms of the same phrase. Thus, we would identify other phrases, for which we know their lemma.

To identify other MWUs, which are inflected forms of a given lemma, we were generating simplified forms of phrases, where as simplified form of a phrase we understand a form, where all words from that phrase were lemmatized separately and then concatenated. For lemmatization of single words, we used Hunspell toolFootnote 1. An example of such simplified form was already given in the Introduction; for phrases Organizacja Narodów Zjednoczonych and Organizacji Narodów Zjednoczonych, the simplified form is Organizacja Naród Zjednoczyć. If both phrases had the same simplified form (as is the case in the presented example), we assumed, that they differ only because of the inflection. Thus, we could identify, that a lemma for a phrase Organizacji Narodów Zjednoczonych is Organizacja Narodów Zjednoczonych (because the first word of the latter phrase is in nominative case). We will refer to such identified pairs of phrases as (lemma, inflected form) pairs.

Analyzing phrases from titles of acts we found 67 such (lemma, inflected form) pairs. To find additional pairs, we searched trough whole documents (not only the titles) to find phrases with the same simplified form as some of MWUs extracted from the titles. We found in total 634 different (lemma, inflected form) pairs. Still, for 433 MWUs we did not find any corresponding lemma. For these MWUs, their lemmas had to be generated automatically.

4.2 Generation of Lemmatization Rules

As was stated, after some data preparation steps, we were identifying (lemma, inflected form) pairs in the corpus. For each phrase, we also had POS tags sequences, generated using WCRFT tagger. Thus, by analyzing tags sequences in such pairs we could identify, how POS tags sequences tend to change, when a phrase with a certain tag sequence is lemmatized. We will denote POS tags sequence for a phrase p for its inflected form as \(POS_{p, infl}\), and for its base form as \(POS_{p, lemma}\). Having such pairs of POS tags sequences, we were automatically generating four types of lemmatization rules, which are described below.

Each rule consists of two sides: a Left Hand Side (LHS) and a Right Hand Side (RHS), separated from each other with \(\rightarrow \) sign. Each side of the rule is a sequence of tags. LSH was used to match a given phrase to a specific rule; that is, having an inflected phrase \(p'\) and its POS tags sequence \(POS_{p', infl}\), we were comparing it with LHSs of all rules to find a match. If a match was found, the matched rule was applied to \(p'\), that is the constituents of the phrase were inflected as was stated on the RHS the rule.

Complete Rules. In this type of rules, we take POS tags sequences from lemma - inflected form pairs and consider those as lemmatization rules as shown on Eq. 1. Examples of such lemmatization rules are presented in Table 1. In each row of this table, below phrases, there are POS tags sequences \(POS_{p, infl}\) in the first and \(POS_{p, lemma}\) in the second column. Using such rules, for each phrase \(p'\), for which we do not know its lemma, we retrieve its POS tags sequence \(POS_{p', infl}\) and we search through all complete rules for a rule, in which \(POS_{p', infl}\) was equal to its LHS. We were assuming, that in such case, if we inflect the words in the MWU according to the RHS of the rule, we will receive a correct lemma for that phrase.

$$\begin{aligned} POS_{p, infl} \rightarrow POS_{p, lemma} \end{aligned}$$
(1)

An example of application of this type of rule is the following. Lets assume, that we have the following inflected MWU: \(p' =\)Miejskim Programem Rewitalizacji (Urban Renewal Programme in genitive). Its POS tags sequence \(POS_{p', infl}\) is exactly the same as for phrase Regionalną Strategią Innowacji in Table 1. A rule generated based on the second row in Table 1 would therefore have a LHS matching to \(POS_{p', infl}\). Thus, the lemma for \(p'\) is generated based on the RHS of the rule, that is using tags from the lemma column of the same row in Table 1. For example, first word of \(p'\) (Miejskim) should be inflected to . By inflecting all words from \(p'\) according to RHS, we receive phrase Miejski Program Rewitalizacji, which is a correct lemma for \(p'\).

Partial Rules. Partial rules differ from complete rules in that, having a phrase \(p'\), for which we want to get its lemma, we go trough all \((POS_{p, infl}, POS_{p, lemma})\) pairs and we try to find the longest match between subsequences of \(POS_{p', infl}\) and \(POS_{p, infl}\), where such subsequences always start from the beginning of the sequence. If we denote subsequence starting at tag with index \(t_1\) and ending at \(t_2\) as \(POS_{p, infl}[t_1, t_2]\), we look for the pair, in which \(POS_{p, infl}[1, t_2] = POS_{p', infl}[1, t_2]\) and \(t_2\) has the highest value. Then, we create a RHS of the rule as a concatenation of two sequences: subsequence of \(POS_{p, lemma}\) ending at \(t_2\) and subsequence of the \(POS_{p', infl}\), starting at index \(t_2\) + 1 and reaching the end of the sequence, as shown on Eq. 2. Please note, that \(POS_{p, infl}\) and \(POS_{p', infl}\) may have a different lenght (that is, the phrase being inflected may have a different number of words comparing to the phrase, which was used to generate the rule).

$$\begin{aligned} POS_{p, infl}[1, t_2] \rightarrow POS_{p, lemma}[1, t_2] + +\, POS_{p', infl}[t_2+1, ... ] \end{aligned}$$
(2)

Such rules are based on the fact, that in Polish language, when we inflect MWUs, in many cases some number of words at the end of the unit remain unchanged. This was shown in Table 1, where in the first row two final word, and in second row one final word, remained unchanged. Thus we assume, that in many cases we may skip the analysis of some number of POS tags at the end of the sequence and still get the proper lemma. On the other hand, it is unlikely, that inflection of the MWU will change some words at the end, without affecting the ones at the beginning.

Caseless Complete Rules. In this type of rules, having a \(POS_{p', infl}\), that is a sequence of tags for phrase \(p'\), for which we wanted to generate a lemma, we were analyzing lemma - inflected form pairs in search for a pair \((POS_{p, infl}, POS_{p, lemma})\), in which for \(POS_{p, infl}\) all tags were the same as in \(POS_{p', infl}\) apart from the grammatical case. We assume here, that grammatical cases of words in these phrases are different only because phrases \(p'\) and p were used in the text in a different case and if they would be used in the same case, then \(POS_{p, infl}\) and \(POS_{p`, infl}\) would be identical. The described type of rules is therefore identical to Complete Rules apart from the fact, that we ignore the information about the grammatical case on the LHS of the rule.

Caseless Partial Rules. This type of rules is a variant of Partial rules, in which information about grammatical cases on the LHS of the rule is ignored. For each \(POS_{p', infl}\), that is a sequence of tags for phrase \(p'\), for which we wanted to generate a lemma, we were analyzing \((POS_{p, infl}, POS_{p, lemma})\) pairs in search of the longest match between subsequences of \(POS_{p', infl}\) and \(POS_{p, infl}\), while in both sequences we were ignoring information about the grammatical case.

4.3 Generation of Lemmas

Having some phrase in an inflected form, we were obtaining its lemma in the following manner. First, we were searching trough all lemmas found in the corpus to check, if the lemma of that phrase was found somewhere in the corpus. If the lemma was not found, we were applying rules described in the previous section in a cascade manner, in the same order as they were described above. Such order was set to ensure that rules, which we assumed would produce better results, were applied before the less reliable ones. If we found basis to apply rule of a certain type, we were generating the lemma for the phrase using a selected rule and we were ignoring rules of the subsequent types. In some cases, perhaps the analyzed phrase could match LHSs of two different rules of the same type; in such situation, we were choosing the rule to be applied randomly.

If we decided, that a certain rule should be applied, based on its RHS we knew, how words in the phrase should be inflected. For the inflection of single words, we used PoliMorf [10] dictionary.

5 Evaluation

We performed an experiment, in which we wanted to assess what accuracy of lemma identification may be achieved for the described approach. In the experiment, we were identifying lemmas for all MWUs identified as being inflected, according to the procedure described in Subsect. 4.3. Using the described approach, we were trying to identify lemmas for 1067 inflected MWUs extracted from the corpus.

Table 2. Accuracy of automatic lemmatization of MWUs using different lemmatization rules and a percentage of correctly extracted phrases among all that phrases were lemmatized using a given lemmatization rule type

The evaluation of accuracy of lemmatization was performed manually. A human annotator (a native speaker of Polish language) was presented with pairs, each consisting of an inflected phrase and a lemma generated (or identified) for that phrase. The annotator was to assign to each pair two annotations:

  • annotation stating if the lemma for a given phrase is correct,

  • annotation stating whether the phrase is processable; by processable we understand phrases which:

    • are correctly extracted, i.e. span across the whole entity name; incorrectly extracted phrases are for example phrases missing some words from the entity name (for example the first or the last word),

    • contain only words, that may be inflected using the available dictionary; many phrases may contain non-Polish words or some proper names, which are impossible to be correctly lemmatized without appropriate dictionaries; we decided to annotate such phrases as unprocessable.

The results of the annotation are presented in Table 2. There are four columns with statistics in the table. In column “accuracy for all phrases” we put accuracy for all phrases, regardless whether they were annotated as processable or not. In column “accuracy for processable phrases” we did not take into account phrases annotated as unprocessable. In the third column, we put information about the percentage of MWUs lemmatized using a given rule type, which were annotated as processable. Finally, in the last column, there is information about how many phrases were lemmatized using a given lemmatization rule type.

The total accuracy of the proposed approach, when only processable phrases are concerned, was above 82%. When taking all phrases into account (also those incorrectly extracted ones or MWUs containing words, which we were not able to inflect) the result was around 76%. For most of the inflected phrases (634 out of 1063), using the proposed approach, the lemma could be found in the corpus. In such case, more than 94% of lemmas were assigned correctly.

For the remaining inflected MWUs, lemmas had to be generated automatically using rules described in Subsect. 4.2 The accuracy of lemma generation for phrases annotated as processable was generally between 84% up to 92%, except for caseless partial rules, which performed much worse than the other types of rules. To some extent, this is probably due to the fact, that rules of this type were executed only when no other rule could lemmatize a given phrase. Because of that, the rules of this type were dealing with the most difficult MWUs. For 89 phrases, we were not able to generate the lemma using the developed approach at all (none of the generated rules was matching POS tags sequences of these phrases).

We have also experimented with lemmatizing MWUs in our dataset with Wikipedia-based mappings, which were described in Sect. 3.2. We have searched the NeLexicon mappings dataset for ocurrences of inflected entity names from our dataset. We have found a match in only 12 cases, out of 934 searched phrases (only prases assessed to be correctly extracted from the corpus were used). Still, all 12 lemmas obtained for these phrases from NeLexicon mappings were correct. This results correspond to our expectations, which were described in Sect. 3.2; Wikipedia mappings allow to obtain correct lemmas, but, in case of our corpus, these mappings are applicable to only a very limited number of inflected MWUs. An interesting utilization of NeLexicon dataset would be to use the provided mappings as additional examples for learning of lemmatization rules, as described in Sect. 4

6 Summary

In this paper, we presented an approach towards automatic lemmatization of Multi-Word Units for Polish language and an evaluation of lemmatization accuracy, which may be obtained using the proposed approach. The presented method utilizes rules automatically generated based on the corpus analysis. Conducted experiments revealed, that the accuracy of automatic lemmatization of MWUs for the Polish language may reach up to 82%. We believe, that such results prove, that the automatic lemmatization of MWUs may be used for some tasks. When high accuracy is a crucial factor, the proposed method may be followed by an additional step of verification by a human expert. In such case, the amount of manual work by the expert would be highly reduced comparing to situation, when he would have to assign lemmas to all phrases without any aid of a computer system.