Abstract
In this paper we present the comparative research work disclosing strengths and weaknesses of two the most popular and publicly available Lithuanian morphological analyzers, in particular, Lemuoklis and Semantika.lt. Their lemmatization, part-of-speech tagging, and fined-grained annotation of the morphological categories (as case, gender, tense, etc.) performance was evaluated on the morphologically annotated gold standard corpus composed of four domains, in particular, administrative, fiction, scientific and periodical texts. Semantika.lt significantly outperformed Lemuoklis by \(\sim \)1.7%, \(\sim \)2.5%, and \(\sim \)8.1% on the lemmatization, part-of-speech tagging, and fine-grained annotation tasks achieving \(\sim \)98.0%, \(\sim \)95.3% and, \(\sim \)86.8% of the accuracy, respectively.
Semantika.lt was also superior on the administrative, fiction, and periodical texts; however, Lemuoklis yielded similar performance on the scientific texts and even bypassed Semantika.lt in the fine-grained annotation task.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
- Lithuanian morphological analysers
- Gold-standard corpus
- Experimental evaluation
- The Lithuanian language
1 Introduction and Related Work
If excluding so-called isolating languages as Mandarin Chinese which do not have grammatical categories (as case, gender, number, tense, etc.), languages show a varied degree of inflection (and derivation), starting from weakly inflected as English and going to highly inflected as Spanish, Czech, Lithuanian, Turkish, Arabic or Hebrew. In this paper we focus on the fusional Lithuanian language, which has the rich inflectional morphology even complex to Latvian or Slavic languages [20]. Different morphological categories are defined with various endings attached to the stable parts of words (i.e., to a root or to a root with affixes). In highly inflectional languages hundreds of word’s forms can be generated from a single root (e.g., \(\sim \)2–3 million grammatical forms can be used for a dictionary of \(\sim \)100 thousand words [9]); moreover, these forms often match other grammatical categories or parts-of-speech. Thus, a rate of ambiguous morphological forms for the Lithuanian language reaches even \(\sim \)47 [18].
Morphological analysis has experienced a great success since the invention of the Two-Level morphology and the development of the finite-state technology that Two-level formalism is based on [14]. All existing morphological analyzers according to their creation method can be divided into knowledge-based (sometimes called rule-based and/or lexicon-based), supervised, and unsupervised. Despite that unsupervised approaches (segmenting the raw text into morphs as, e.g., un+fail+ing+ly) have become very attractive recently (because do not require gold morphological labels and for any language there is an unlimited number of text resources), they, however, are more suitable for agglutinative languages [3]. Knowledge-based approaches rely on rules/lexicons prepared by linguist-experts and do not require additional resources. Probably due to this reasons, this approach is still the most widely spread, thus, used for many different languages: English [11, 19], French [7], Russian and Spanish [14], Urdu and Hindi [1, 5], Tamil [2], etc.
Corpus-based morphological analyzers are the closest alternative to knowledge-based approaches. Although such systems are already built automatically in the supervised manner, induced rules are based on gold morphological annotations found in the training data. The annotation process itself is very laborious and requires deep language expertise, but such analyzers can be easily redeveloped and improved after adding more annotated texts. Analyzers of this type are used for many languages: Dutch [6], Swahili [17], Hindi [15], Kazakh [12], Arabic [13], Polish [10], etc.
Morphological analysis is important in such NLP applications as information retrieval, parsing or machine translation (especially when translating direction points from/to the morphologically rich language). Each module of such complex system has to be as accurate and reliable as possible (because the overall accuracy depends on cumulative accuracies of separate modules), including the morphological analyzer. The priority is its accuracy, no matter if the analyzer is developed using rule-based or corpus-based approach. The aim of this research is to evaluate, to compare and to determine the most accurate morphological analyzer for the Lithuanian language.
2 The Lithuanian Morphological Analyzers
The Lithuanian language is spoken by only \(\sim \)3.2 million people world-wide; therefore it is not very attractive for big companies. Nevertheless this field of research has a rather long history. The first prototype of the Lithuanian morphological analyzer was created \(\sim \)30 years ago and ever since there were several attempts towards creation of the accurate tool coping with the complex Lithuanian morphology. Despite all of those attempts, there are only two reliable morphological analyzers and lemmatizers which are still maintained, updated, and publicly available on-line:
-
1.
Lemuoklis Footnote 1 at the beginning was purely rule and lexicon-based approach (described in detail by it’s founder V. Zinkevičius in [22]), later extended with the statistical approach for the disambiguation of morphological homoforms. In Lemuoklis the knowledge about the Lithuanian language is stored in the lexical and grammar database, which contains 6 lexicons with various Lithuanian lexical groups; proper nouns, in the forms they as found in the corpus (that is without lemmatization); the stems of these proper nouns; obsolete and dialectal word forms; forms with the shortened endings which appear in literary and colloquial styles; abbreviations and acronyms. Since the database also contains word stems, each stem can be augmented with the affixes (prefixes, suffixes, endings) which, in turn, are determined according to the word’s morphological type (i.e., its morphemic structure). Each analyzed word is divided on the basis of the various scheme options using prefix + stem + postfix pattern, therefore the implemented inflectional models not only recognize different inflectional word forms (including obsolete or dialectal as e.g., illative) of the existent words, but also synthesize some derivatives in their various inflected forms. The lexical database contains \(\sim \)91 thousand different headwords in total; however, the number of theoretically possible grammatical word forms can reach even several billions.
Lemuoklis has been used for many practical tasks. One of its first versions was used in preparing the first frequency world list for the Lithuanian language [21], and it was later integrated into the Microsoft Office package and Information Base components and used for the automatic spell checking [22]. In 2000–2005 Lemuoklis was applied on \(\sim \)1 million word corpus, which afterwards was manually corrected by a linguist-expert (for more information see [18]) and led to the creation of the first lemmatized and morphologically annotated gold-standard corpus for the Lithuanian language. This research showed that within the \(\sim \)89% of all automatically recognized Lithuanian words no less than \(\sim \)47% are ambiguous. The disambiguation problem was solved out by complementing the rule and lexicon based approach with the statistical trigram Hidden Markov Model method (described in [8]): this version reached \(\sim \)94% of accuracy for annotation and \(\sim \)99% for lemma assignment.
-
2.
Semantika.lt Footnote 2 morphological analyzer was created with the ambition to outperform its ancestor Lemuoklis, which still has not got rid of such shortcomings as rather low performance on the proper nouns. The main reason for designing a new tool, was the fact that Lemuoklis data was hard coded, which makes difficult to enrich the lexical database. Semantika.lt is also based on the hybrid approach: it is based on the Hunspell open source platform (consisting of the lexicon and the affixes) supplemented with the statistical method for the disambiguation task. The information included into the lexicon was taken from the following sources: from the 6th edition of the Modern Lithuanian Dictionary; from the Corpus of the Contemporary Lithuanian Language at the Centre of Computational Linguistics of Vytautas Magnus University (\(\sim \)100 million tokens; \(\sim \)600 thousand unique); from the database of the Lithuanian Parliamentary documents (\(\sim \)400 million tokens; \(\sim \)1 million unique); from various public Internet sources. The created analyser resulted in 429 groups of rules; 1,518 explicit tags for flexing/non-flexing properties; 5,832 rules for suffix and affix alternation in 16,734 alternation cases. The total number of headwords in Semantika.lt is \(\sim \)146 thousand of which \(\sim \)38 thousand are common nouns, \(\sim \)67 thousand proper nouns, \(\sim \)12 thousand adjectives, \(\sim \)23 thousand verbs, \(\sim \)4 thousand words from other classes. The disambiguation problem as in Lemuoklis is solved using statistical trigram Markov model + Viterbi algorithm.
Thus, according to the number of headwords, Semantika.lt obviously outperforms Lemuoklis; but on the other side, Lemuoklis uses the synthesis method in order to handle some frequent derivation patterns (e.g. some regular agentive and diminutive forms). Besides, Lemuoklis had been updated in the past, but since 2007 it has not been experimentally evaluated. Moreover, Semantika.lt has never been fully evaluated on the basis of a gold-standard corpus. In general, there was no evaluation with explicit methodology. Therefore currently it is not clear which one is more accurate and whether difference in their accuracy is statistically significant.
The contribution of this research is to evaluate both of these analyzers and to compare their results following standard up-to-date methods for tool evaluation. However, that the research would be carried out correctly it is important (1) to equalize experimental conditions for the both analyzers (to evaluate them on the same gold-standard corpus; to equalize their annotation tags; to use the same evaluation metrics); (2) to test them on the unseen corpus which was neither used in the rule or lexicon creation nor in training for the disambiguation problem solving. Besides, we anticipate that the publicly available morphologically annotated gold-standard corpus (presented in this paper) could be treated as the benchmark corpus and used for evaluation and comparison purposes of other existing or forthcoming morphological analyzers.
3 The Comparative Evaluation
The experimental comparison (described in Sect. 3.2) of both Lithuanian morphological analyzers (presented in Sect. 2) was performed on a morphologically annotated gold-standard corpus (described in Sect. 3.1). The issue of the annotation format discordance (in the gold corpus and texts produced by both analyzers) was solved out by converting all formats to one based on the Leipzig glossing rules [4] used in the Universal Dependencies ProjectFootnote 3.
3.1 The Gold-Standard Corpus
The first morphologically annotated gold-standard corpus (called MATASFootnote 4) was prepared by the Centre of Computational Linguistics at Vytautas Magnus University. It contains 1,641,263 words and covers 4 domains, in particular, administrative, fiction, scientific and periodical texts. MATAS was prepared in a semi-automatic manner: the initial annotations were obtained with Lemuoklis and afterwards manually verified and corrected by one linguist-expert.
Unfortunately for our experiments we could not take the entire corpus, because some parts of it have already been used in training of the Semantika.lt morphological analyzer. Thus, we had to select and annotate additional texts taking into account two important factors: (1) they must not have been used in creation/training of Lemuoklis and Semantika.lt; (2) the obtained gold-standard corpus has to be balanced (in terms of words) that results would not be biased towards the largest domains. Hence, for experiments we selected texts that contain \(\sim \)5 thousand words in each domain, resulting \(\sim \)20 thousand in totals. The statistics about the gold-standard corpus is presented in Table 1.
3.2 The Experimental Set-Up and Results
In our experiments we compared the gold annotations with the automatic annotations produced by Lemuoklis and Semantika.lt and calculated the accuracy and f-score values.
Moreover, we evaluated if the differences between the results obtained by different morphological analyzers are statistically significant. The evaluation was done using the McNemar test [16] with one degree of freedom at the significance level of \(\alpha = 0.05\), meaning that the differences are considered statistically significant if calculated probability density function \(p < \alpha \).
The obtained lemmatization accuracies are presented in Fig. 1. The micro-accuracy (or micro-f-score) values for the parts-of-speech (i.e., coarse-grained morphological information) and the morphological categories (i.e., fine-grained information as case, gender, number, voice, tense, etc.) are summarized in Fig. 2. The f-score values distributed over the different parts-of-speech are presented in Table 2.
4 Discussion
As it can be seen from the Fig. 1 the best lemmatization results are obtained with the Semantika.lt morphological analyzer. Although the difference is very small (i.e., only \(\sim \)1.7% points on entire gold-standard corpus), it is still statistically significant. The superiority of Semantika.lt over Lemuoklis is especially apparent on fiction and periodical texts. The fiction is usually characterized by a high abundance of words, whereas periodical texts are full of neologisms and specific terminology. Thus, a larger number of headwords incorporated into Semantika.lt has an obvious advantage over Lemuoklis. Surprisingly Semantika.lt slightly underperformed Lemuoklis on the scientific texts, but the difference is not statistically significant. The terminology used in the scientific texts is not completely settled: some Anglicisms are more popular than their Lithuanian equivalents, some equivalents in Lithuanian sometimes even does not exist, thus are not recorded in the dictionary.
The left diagram in Fig. 2 presents the coarse-grained annotation results. The difference between the results on the entire gold-standard corpus is \(\sim \)2.5%: the largest gap is again on the fiction (\(\sim \)3.5%) and periodicals (\(\sim \)3.1%), the smallest – on scientific texts (\(\sim \)0.4%). In the lemmatization task, the Semantika.lt morphological analyzer outperforms Lemuoklis, but the difference again is not statistically significant. In the right diagram of Fig. 2, which already presents fine-grained morphological categorization results (determined cases, genders, tenses, etc.), the robustness of Lemuoklis over Semantika.lt on the scientific texts is already statistically significant (the difference is \(\sim \)5.3%). However, on the entire gold-standard corpus (the difference is \(\sim \)8.1%) and on the other domains, in particular, fiction (\(\sim \)17.8%), administrative (\(\sim \)11.0%), periodicals (\(\sim \)8.9%), the superiority of Semantika.lt is apparent. The lower accuracy in the fine-grained annotation is due to the complicated disambiguation problem and out-of-vocabulary words. The main drawback of both morphological analyzers is due to out-of-the-vocabulary words: i.e., if analyzer cannot recognize the word and indicate its lemma (leaving in original untouched form), it cannot recognize any other morphological information. Thus, the errors in the first lemmatization stage cause errors in the following morphological annotation stages: part-of-speech recognition and afterwards in the morphological categorization.
The detailed error analysis (see Table 2) reveals some major mistakes. The most complicated issue for both analyzers is the auxiliary words (i.e., conjunctions and particles) which can be assigned to the different parts-of-speech without absolutely clear criteria (by the way, some numerals also face this problem). However, Semantika.lt analyzer demonstrates significant improvement for the proper nouns compared to Lemuoklis. A very specific mistake of Lemuoklis is due to the confusion of one letter abbreviations with one letter interjections (e.g., despite interjections in the upper-case at the end of a direct sentence are very rare).
5 Conclusion and Future Work
This comparative research work disclosed strengths/weaknesses of two the most popular and publicly available Lithuanian morphological analyzers: Lemuoklis and Semantika.lt. Both analyzers were evaluated on 4 domains of the same gold-standard corpus.
The morphological analyzers Lemuoklis/Semantika.lt achieved \(\sim \)96.3%/\(\sim \)98.0%, \(\sim \)92.8%/\(\sim \)95.3%, \(\sim \)78.7%/\(\sim \)86.8% of accuracy on the lemmatization, part-of-speech tagging, and annotation of the morphological categories, respectively. Despite Semantika.lt was superior over Lemuoklis on the entire gold-standard corpus and on the administrative, fiction, and periodical texts; Lemuoklis yielded equal performance on the scientific texts and even outperformed Semantika.lt on the annotation task of the morphological categories.
The experiments with Lemuoklis and Semantika.lt were carried out on the normative Lithuanian texts. In the future research we are planning to test their robustness on the challenging types of texts: forum posts, Internet comments, tweets, etc.
Notes
- 1.
- 2.
- 3.
More about the Universal Dependencies Project is presented in http://universaldependencies.org/.
- 4.
The annotated corpus can be downloaded from https://clarin.vdu.lt/xmlui/handle/20.500.11821/9.
References
Agarwal, A., Pramila, Singh, S.P., Kumar, A., Darbari, H.: Morphological analyser for Hindi - a rule based implementation. Int. J. Adv. Comput. Res. 4(1), 19–25 (2014)
Akilan, R., Naganathan, E.R.: Morphological analyzer for classical Tamil texts: a rule-based approach. IJISET - Int. J. Innovative Sci. Eng. Technol. 1(5), 563–568 (2014)
Baisa, V., Suchomel, V.: Large corpora for Turkic languages and unsupervised morphological analysis. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC) (2012)
Bickel, B., Comrie, B., Haspelmath, M.: Leipzig Glossing Rules: Conventions for Interlinear Morpheme-by-Morpheme Glosses (2008)
Bögel, T., Butt, M., Hautli, A., Sulger, S.: Developing a finite-state morphological analyzer for Urdu and Hindi. In: The 6th International Workshop on Finite-State Methods and Natural Language Processing (FSMNLP 2007), pp. 86–96 (2007)
den Bosch, A.V., Daelemans, W.: Memory-based morphological analysis. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (ACL 1999), pp. 285–292 (1999)
Byrd, R.J., Tzoukermann, E.: Adapting an English morphological analyzer for French. In: Proceedings of the 26th Annual Meeting on Association for Computational Linguistics (ACL 1988), pp. 1–6 (1988)
Daudaravičius, V., Rimkutė, E., Utka, A.: Morphological annotation of the Lithuanian corpus. In: Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL 2007), pp. 94–99 (2007)
Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003). doi:10.1007/3-540-36456-0_21
Jȩrzejowicz, P., Strychowski, J.: A neural network based morphological analyser of the natural language. In: Proceedings of the International Conference on Intelligent Information Processing and Web Mining (IIPWM 2005), pp. 199–208 (2005)
Karp, D., Schabes, Y., Zaidel, M., Egedi, D.: A freely available wide coverage morphological analyzer for English. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 3, pp. 950–955 (1992)
Kessikbayeva, G., Cicekli, I.: A rule based morphological analyzer and a morphological disambiguator for Kazakh language. Linguist. Lit. Stud. 4(1), 96–104 (2016)
Khoufi, N., Boudokhane, M.: Statistical-based system for morphological annotation of Arabic texts. In: Recent Advances in Natural Language Processing (RANLP 2013), pp. 100–106 (2013)
Koskenniemi, K.: Two-level model for morphological analysis. In: Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI 1983), pp. 683–685 (1983)
Malladi, D.K., Mannem, P.: Statistical morphological analyzer for Hindi. In: International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 1007–1011 (2013)
McNemar, Q.M.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)
Pauw, G.D., de Schryver, G.M.: Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos 18, 303–318 (2008)
Rimkutė, E.: Morfologinio daugiareikšmiškumo ribojimas kompiuteriniame tekstyne [The Limitation of the Morphological Disambiguation in the Digitalized Corpus] (in Lithuanian). Ph.D. thesis, Vytautas Magnus University (2006)
Russell, G.J., Pulman, S.G., Ritchie, G.D., Black, A.W.: A dictionary and morphological analyser for English. In: Proceedings of the 11th Conference on Computational Linguistics (COLING 1986), pp. 277–279 (1986)
Savickienė, I., Kempe, V., Brooks, P.J.: Acquisition of gender agreement in Lithuanian: exploring the effect of diminutive usage in an elicited production task. J. Child Lang. 36, 477–494 (2009)
Žilinskienė, V.: Lietuviŭ kalbos dažninis žodynas [The Frequency Dictionary of the Lithuanian Language] (1990). (in Lithuanian)
Zinkevičius, V.: Lemuoklis - morfologinei analizei [Morphological analysis with Lemuoklis]. In: Gudaitis, L. (ed.) Darbai ir Dienos, vol. 24, pp. 246–273 (2000) (in Lithuanian)
Acknowledgments
The authors thank the researchers from LLC Fotonija, especially Virginijus Dadurkevičius, for providing information about the Semantika.lt morphological analyzer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kapočiūtė-Dzikienė, J., Rimkutė, E., Boizou, L. (2017). A Comparison of Lithuanian Morphological Analyzers. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-64206-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)