Keywords

1 Introduction and Related Work

If excluding so-called isolating languages as Mandarin Chinese which do not have grammatical categories (as case, gender, number, tense, etc.), languages show a varied degree of inflection (and derivation), starting from weakly inflected as English and going to highly inflected as Spanish, Czech, Lithuanian, Turkish, Arabic or Hebrew. In this paper we focus on the fusional Lithuanian language, which has the rich inflectional morphology even complex to Latvian or Slavic languages [20]. Different morphological categories are defined with various endings attached to the stable parts of words (i.e., to a root or to a root with affixes). In highly inflectional languages hundreds of word’s forms can be generated from a single root (e.g., \(\sim \)2–3 million grammatical forms can be used for a dictionary of \(\sim \)100 thousand words [9]); moreover, these forms often match other grammatical categories or parts-of-speech. Thus, a rate of ambiguous morphological forms for the Lithuanian language reaches even \(\sim \)47 [18].

Morphological analysis has experienced a great success since the invention of the Two-Level morphology and the development of the finite-state technology that Two-level formalism is based on [14]. All existing morphological analyzers according to their creation method can be divided into knowledge-based (sometimes called rule-based and/or lexicon-based), supervised, and unsupervised. Despite that unsupervised approaches (segmenting the raw text into morphs as, e.g., un+fail+ing+ly) have become very attractive recently (because do not require gold morphological labels and for any language there is an unlimited number of text resources), they, however, are more suitable for agglutinative languages [3]. Knowledge-based approaches rely on rules/lexicons prepared by linguist-experts and do not require additional resources. Probably due to this reasons, this approach is still the most widely spread, thus, used for many different languages: English [11, 19], French [7], Russian and Spanish [14], Urdu and Hindi [1, 5], Tamil [2], etc.

Corpus-based morphological analyzers are the closest alternative to knowledge-based approaches. Although such systems are already built automatically in the supervised manner, induced rules are based on gold morphological annotations found in the training data. The annotation process itself is very laborious and requires deep language expertise, but such analyzers can be easily redeveloped and improved after adding more annotated texts. Analyzers of this type are used for many languages: Dutch [6], Swahili [17], Hindi [15], Kazakh [12], Arabic [13], Polish [10], etc.

Morphological analysis is important in such NLP applications as information retrieval, parsing or machine translation (especially when translating direction points from/to the morphologically rich language). Each module of such complex system has to be as accurate and reliable as possible (because the overall accuracy depends on cumulative accuracies of separate modules), including the morphological analyzer. The priority is its accuracy, no matter if the analyzer is developed using rule-based or corpus-based approach. The aim of this research is to evaluate, to compare and to determine the most accurate morphological analyzer for the Lithuanian language.

2 The Lithuanian Morphological Analyzers

The Lithuanian language is spoken by only \(\sim \)3.2 million people world-wide; therefore it is not very attractive for big companies. Nevertheless this field of research has a rather long history. The first prototype of the Lithuanian morphological analyzer was created \(\sim \)30 years ago and ever since there were several attempts towards creation of the accurate tool coping with the complex Lithuanian morphology. Despite all of those attempts, there are only two reliable morphological analyzers and lemmatizers which are still maintained, updated, and publicly available on-line:

  1. 1.

    Lemuoklis Footnote 1 at the beginning was purely rule and lexicon-based approach (described in detail by it’s founder V. Zinkevičius in [22]), later extended with the statistical approach for the disambiguation of morphological homoforms. In Lemuoklis the knowledge about the Lithuanian language is stored in the lexical and grammar database, which contains 6 lexicons with various Lithuanian lexical groups; proper nouns, in the forms they as found in the corpus (that is without lemmatization); the stems of these proper nouns; obsolete and dialectal word forms; forms with the shortened endings which appear in literary and colloquial styles; abbreviations and acronyms. Since the database also contains word stems, each stem can be augmented with the affixes (prefixes, suffixes, endings) which, in turn, are determined according to the word’s morphological type (i.e., its morphemic structure). Each analyzed word is divided on the basis of the various scheme options using prefix + stem + postfix pattern, therefore the implemented inflectional models not only recognize different inflectional word forms (including obsolete or dialectal as e.g., illative) of the existent words, but also synthesize some derivatives in their various inflected forms. The lexical database contains \(\sim \)91 thousand different headwords in total; however, the number of theoretically possible grammatical word forms can reach even several billions.

    Lemuoklis has been used for many practical tasks. One of its first versions was used in preparing the first frequency world list for the Lithuanian language [21], and it was later integrated into the Microsoft Office package and Information Base components and used for the automatic spell checking [22]. In 2000–2005 Lemuoklis was applied on \(\sim \)1 million word corpus, which afterwards was manually corrected by a linguist-expert (for more information see [18]) and led to the creation of the first lemmatized and morphologically annotated gold-standard corpus for the Lithuanian language. This research showed that within the \(\sim \)89% of all automatically recognized Lithuanian words no less than \(\sim \)47% are ambiguous. The disambiguation problem was solved out by complementing the rule and lexicon based approach with the statistical trigram Hidden Markov Model method (described in [8]): this version reached \(\sim \)94% of accuracy for annotation and \(\sim \)99% for lemma assignment.

  2. 2.

    Semantika.lt Footnote 2 morphological analyzer was created with the ambition to outperform its ancestor Lemuoklis, which still has not got rid of such shortcomings as rather low performance on the proper nouns. The main reason for designing a new tool, was the fact that Lemuoklis data was hard coded, which makes difficult to enrich the lexical database. Semantika.lt is also based on the hybrid approach: it is based on the Hunspell open source platform (consisting of the lexicon and the affixes) supplemented with the statistical method for the disambiguation task. The information included into the lexicon was taken from the following sources: from the 6th edition of the Modern Lithuanian Dictionary; from the Corpus of the Contemporary Lithuanian Language at the Centre of Computational Linguistics of Vytautas Magnus University (\(\sim \)100 million tokens; \(\sim \)600 thousand unique); from the database of the Lithuanian Parliamentary documents (\(\sim \)400 million tokens; \(\sim \)1 million unique); from various public Internet sources. The created analyser resulted in 429 groups of rules; 1,518 explicit tags for flexing/non-flexing properties; 5,832 rules for suffix and affix alternation in 16,734 alternation cases. The total number of headwords in Semantika.lt is \(\sim \)146 thousand of which \(\sim \)38 thousand are common nouns, \(\sim \)67 thousand proper nouns, \(\sim \)12 thousand adjectives, \(\sim \)23 thousand verbs, \(\sim \)4 thousand words from other classes. The disambiguation problem as in Lemuoklis is solved using statistical trigram Markov model + Viterbi algorithm.

Thus, according to the number of headwords, Semantika.lt obviously outperforms Lemuoklis; but on the other side, Lemuoklis uses the synthesis method in order to handle some frequent derivation patterns (e.g. some regular agentive and diminutive forms). Besides, Lemuoklis had been updated in the past, but since 2007 it has not been experimentally evaluated. Moreover, Semantika.lt has never been fully evaluated on the basis of a gold-standard corpus. In general, there was no evaluation with explicit methodology. Therefore currently it is not clear which one is more accurate and whether difference in their accuracy is statistically significant.

The contribution of this research is to evaluate both of these analyzers and to compare their results following standard up-to-date methods for tool evaluation. However, that the research would be carried out correctly it is important (1) to equalize experimental conditions for the both analyzers (to evaluate them on the same gold-standard corpus; to equalize their annotation tags; to use the same evaluation metrics); (2) to test them on the unseen corpus which was neither used in the rule or lexicon creation nor in training for the disambiguation problem solving. Besides, we anticipate that the publicly available morphologically annotated gold-standard corpus (presented in this paper) could be treated as the benchmark corpus and used for evaluation and comparison purposes of other existing or forthcoming morphological analyzers.

3 The Comparative Evaluation

The experimental comparison (described in Sect. 3.2) of both Lithuanian morphological analyzers (presented in Sect. 2) was performed on a morphologically annotated gold-standard corpus (described in Sect. 3.1). The issue of the annotation format discordance (in the gold corpus and texts produced by both analyzers) was solved out by converting all formats to one based on the Leipzig glossing rules [4] used in the Universal Dependencies ProjectFootnote 3.

3.1 The Gold-Standard Corpus

The first morphologically annotated gold-standard corpus (called MATASFootnote 4) was prepared by the Centre of Computational Linguistics at Vytautas Magnus University. It contains 1,641,263 words and covers 4 domains, in particular, administrative, fiction, scientific and periodical texts. MATAS was prepared in a semi-automatic manner: the initial annotations were obtained with Lemuoklis and afterwards manually verified and corrected by one linguist-expert.

Unfortunately for our experiments we could not take the entire corpus, because some parts of it have already been used in training of the Semantika.lt morphological analyzer. Thus, we had to select and annotate additional texts taking into account two important factors: (1) they must not have been used in creation/training of Lemuoklis and Semantika.lt; (2) the obtained gold-standard corpus has to be balanced (in terms of words) that results would not be biased towards the largest domains. Hence, for experiments we selected texts that contain \(\sim \)5 thousand words in each domain, resulting \(\sim \)20 thousand in totals. The statistics about the gold-standard corpus is presented in Table 1.

Table 1. The distribution of total and distinct (in brackets) words over different parts-of-speech and domains in the gold-standard corpus. The unrecognized words caption defines foreign language or misspelled Lithuanian words.
Fig. 1.
figure 1

The lemmatization accuracies in white and gray columns for Lemuoklis and Semantika.lt, respectively. The results which differences are not statistically significant are connected with a solid black line (see the scientific domain).

Fig. 2.
figure 2

The micro-accuracy/micro-f-score values for parts-of-speech (the left diagram) and morphological categories (the right diagram) in white and gray columns for Lemuoklis and Semantika.lt, respectively.

3.2 The Experimental Set-Up and Results

In our experiments we compared the gold annotations with the automatic annotations produced by Lemuoklis and Semantika.lt and calculated the accuracy and f-score values.

Moreover, we evaluated if the differences between the results obtained by different morphological analyzers are statistically significant. The evaluation was done using the McNemar test [16] with one degree of freedom at the significance level of \(\alpha = 0.05\), meaning that the differences are considered statistically significant if calculated probability density function \(p < \alpha \).

The obtained lemmatization accuracies are presented in Fig. 1. The micro-accuracy (or micro-f-score) values for the parts-of-speech (i.e., coarse-grained morphological information) and the morphological categories (i.e., fine-grained information as case, gender, number, voice, tense, etc.) are summarized in Fig. 2. The f-score values distributed over the different parts-of-speech are presented in Table 2.

Table 2. The calculated f-score values for various parts-of-speech in different domains. Lem and Sem stands for Lemuoklis and Semantika.lt, respectively.

4 Discussion

As it can be seen from the Fig. 1 the best lemmatization results are obtained with the Semantika.lt morphological analyzer. Although the difference is very small (i.e., only \(\sim \)1.7% points on entire gold-standard corpus), it is still statistically significant. The superiority of Semantika.lt over Lemuoklis is especially apparent on fiction and periodical texts. The fiction is usually characterized by a high abundance of words, whereas periodical texts are full of neologisms and specific terminology. Thus, a larger number of headwords incorporated into Semantika.lt has an obvious advantage over Lemuoklis. Surprisingly Semantika.lt slightly underperformed Lemuoklis on the scientific texts, but the difference is not statistically significant. The terminology used in the scientific texts is not completely settled: some Anglicisms are more popular than their Lithuanian equivalents, some equivalents in Lithuanian sometimes even does not exist, thus are not recorded in the dictionary.

The left diagram in Fig. 2 presents the coarse-grained annotation results. The difference between the results on the entire gold-standard corpus is \(\sim \)2.5%: the largest gap is again on the fiction (\(\sim \)3.5%) and periodicals (\(\sim \)3.1%), the smallest – on scientific texts (\(\sim \)0.4%). In the lemmatization task, the Semantika.lt morphological analyzer outperforms Lemuoklis, but the difference again is not statistically significant. In the right diagram of Fig. 2, which already presents fine-grained morphological categorization results (determined cases, genders, tenses, etc.), the robustness of Lemuoklis over Semantika.lt on the scientific texts is already statistically significant (the difference is \(\sim \)5.3%). However, on the entire gold-standard corpus (the difference is \(\sim \)8.1%) and on the other domains, in particular, fiction (\(\sim \)17.8%), administrative (\(\sim \)11.0%), periodicals (\(\sim \)8.9%), the superiority of Semantika.lt is apparent. The lower accuracy in the fine-grained annotation is due to the complicated disambiguation problem and out-of-vocabulary words. The main drawback of both morphological analyzers is due to out-of-the-vocabulary words: i.e., if analyzer cannot recognize the word and indicate its lemma (leaving in original untouched form), it cannot recognize any other morphological information. Thus, the errors in the first lemmatization stage cause errors in the following morphological annotation stages: part-of-speech recognition and afterwards in the morphological categorization.

The detailed error analysis (see Table 2) reveals some major mistakes. The most complicated issue for both analyzers is the auxiliary words (i.e., conjunctions and particles) which can be assigned to the different parts-of-speech without absolutely clear criteria (by the way, some numerals also face this problem). However, Semantika.lt analyzer demonstrates significant improvement for the proper nouns compared to Lemuoklis. A very specific mistake of Lemuoklis is due to the confusion of one letter abbreviations with one letter interjections (e.g., despite interjections in the upper-case at the end of a direct sentence are very rare).

5 Conclusion and Future Work

This comparative research work disclosed strengths/weaknesses of two the most popular and publicly available Lithuanian morphological analyzers: Lemuoklis and Semantika.lt. Both analyzers were evaluated on 4 domains of the same gold-standard corpus.

The morphological analyzers Lemuoklis/Semantika.lt achieved \(\sim \)96.3%/\(\sim \)98.0%, \(\sim \)92.8%/\(\sim \)95.3%, \(\sim \)78.7%/\(\sim \)86.8% of accuracy on the lemmatization, part-of-speech tagging, and annotation of the morphological categories, respectively. Despite Semantika.lt was superior over Lemuoklis on the entire gold-standard corpus and on the administrative, fiction, and periodical texts; Lemuoklis yielded equal performance on the scientific texts and even outperformed Semantika.lt on the annotation task of the morphological categories.

The experiments with Lemuoklis and Semantika.lt were carried out on the normative Lithuanian texts. In the future research we are planning to test their robustness on the challenging types of texts: forum posts, Internet comments, tweets, etc.