Keywords

1 Introduction

The standard NLP task of morphological segmentation, i.e. dividing words into sequences of the smallest possible meaning-bearing units called morphemesFootnote 1, has recently seen a fair share of renewed interest. Recent development in morphological analysis of underresourced languages and/or multilingual morphological analysis (e.g. [1] or [9]). The state-of-the-art methods of morphological segmentation are however usually based on neural networks [1], and hence are neither easily generalizable to languages with fewer resources nor straightforwardly usable for further morphological analysis. As there is a push for the creation and unification of morphologically annotated resources [3, 26], and as adding morphological and syntactic information seems to improve the quality of machine translation for morphologically rich and underresourced languages [8], a need arises for new methods for language-independent low-resource methods of morph classification.

Up to now, there has been relatively little attention paid to the task. Classification of morphs aims to classify the individual morphs given already segmented words, the possible granularity of the classification ranging from the simple binary distinction free morpheme - bounded morpheme to e.g. the very complex and fine-grained Leipzig glossing rules [7]. To our best knowledge, there has been no recent attempt to tackle automated classification of morphs of the Czech language apart from [5], where the DeriNet derivational lexicon is used for root morph identification; however, this is just quickly mentioned in the article and the results are neither evaluated nor discussed. Furthermore, to this date, there is a very limited amount of reasonably-high-quality morphologically annotated Czech data available (apart from the UniMorph project [3], whose unsuitability for our purposes we will discuss in the next section).

In this paper, we report on a pilot experiment concerned with the classification of morphs of the Czech language. Firstly, we try to show how the root and non-root morphs of pre-segmented words are surprisingly well distinguishable using simple quantitative approaches and a small corpus of other pre-segmented words; we also propose classification methods that exploit existing Czech derivational resources. Secondly, we report on a work in progress in which we propose to use the root morph identification methods as a basis for complete morpheme classification of pre-segmented words, exploiting the available Czech derivational and morphological resources.

It will be noticed that the proposed methods differ both in their generality and the demands on resources. Namely, we also use two additional resources: the Czech derivational network DeriNet [24, 25] and MorfFlex [12] (for lemmatization and part-of-speech tagging). Nevertheless as according to [5] similarly constructed derivational networks are available for at least 11 languages and lemmatization and part-of-speech tagging are among the best-explored topics in NLP, we feel confident that these additional requirements are not so stringent as to make possible adaptations or generalizations of our approach to other languages too costly.

2 Related Work

2.1 Terminology

In the following text, we use the terminology as described in [26]. A morpheme is the smallest (in the sense of non-subdivisible) sequence of graphemes associated with a definite meaning. In individual words, morphemes are present in particular forms – morphs [13]. The morphs can be further characterized. We can distinguish free morphemes that can be used as separate words), and bound morphemes that cannotFootnote 2. Alternatively, the morphemes are either lexical (with more or less general lexical meaning) or grammatical (with inflectional meaning).

Based on these two distinctions, we distinguish root morphemes (free lexical morphemes; e.g. kůň, plav), derivational affixes (bound lexical morphemes, e.g. pro-, -tel), inflectional affixes (bound grammatical morphemes, e.g. nej-, -ý) and function words (free grammatical morphemes, e.g. s, a). According to their position relative to the root morpheme (resp. morph), we may divide affixes into prefixes preceding the root morph, interfixes between the root morphs or suffixes following the root. We could also take into account postfixes - morphemes that appear after inflectional suffixes.

2.2 Data Resources

There are several kinds of relevant data resources. First, there are morphological lexicons. For Czech, there are two main lexicons, unfortunately not available in machine tractable form: Retrográdní morfematický slovník češtiny [21] and Bázový morfematický slovník češtiny [20]. Furthermore, the Czech language is included in the UniMorph project [3]. Nevertheless, the Czech data in UniMorph, automatically extracted from MorfFlex [12], are segmented in a way that is incompatible with our proposed morph classification. Namely, the root in the UniMorph segmentation actually seems to be the lemma or the root of the derivational tree (often including non-inflectional prefixes). For the morphological segmentation, there are 38 000 manually segmented Czech words in the data used for the SIGMORPHON 2022 shared task [1].

In addition, several of the available derivational resources already contain at least some kind of morphological segmentation and classification. A survey of this type of resource can be found in [26]. A good example of the kind of morphological resource we have in mind is the manually created CroDeriV lexicon [22]. It contains over 14 000 Croatian words (except two nouns all verbs) segmented to morphs. The morphs are classified as prefixes, stems, suffixes or endings.

The granularity of the classification included in the data differs: the Dictionary of Morphemes of the Russian Language [16] contains over 74 000 segmented Russian lemmas with labeled root morphs (and not affixes). Furthermore, for many languages, either no such resources exist or - as they are created automatically or semi-automatically- they are not straightforwardly usable as a source of gold data, such as in the case of the German morphological derivational lexicon Derivbase [27] or the recent multilingual derivational and inflectional database MorphyNet [2]. This database also includes Czech data; they are however incompletely and quite often inaccurately segmented. Therefore, however useful this resource might prove to be for practical purposes, we are reluctant to employ it for our pilot experiment (especially as the gold data).

The Czech derivational network DeriNet [25] contains a rough morphological segmentation (resp. for 250 000 of the lemmas the root morphs are labeled). The methods by which the root morphs are labeled are nevertheless much similar to some of the methods we try (and therefore cannot be used as the gold data); furthermore, the relevant articles [5, 25] do not mention the final accuracy of the method.

2.3 Morphological Segmentation and Classification

Some of the classical approaches to morphological analysis either already include, or could be straightforwardly extended to include, the classification of morphs. Thus Goldsmith’s unsupervised morphological segmentation [10] uses minimum description length and several simple heuristics to generate candidate stems and suffixes. Unsupervised morphology induction like Schone and Jurafsky [19] or more recently the word-embedding-based induction proposed by Soricut and Och [23] use automatically extracted affixes for morphological rules induction.

For languages for which well-annotated and sufficiently large resources of the abovementioned type are available, supervised machine learning can be used for both the tasks - morphological segmentation and classification of morphs. Recently, Bolshakova and Sapin [6] employed a neural model for morphological segmentation and classification of Russian, achieving over 90% word-level classification accuracy.

In comparison, the state-of-the-art results for the Czech language are not so promising. The Czech derivational network DeriNet has been used for morphological segmentation (and partial classification) [5] with the achieved word-level segmentation accuracy of 58.9 % (the number is only illustrative - the accuracy of root morph recognition was not measured). This might be caused by the lack of available relevant Czech data (almost no available data for morpheme classification, till recently ([1, 26]) also for segmentation).

3 Data

We use four data resources in total. First, we use a small set of fully manually segmented and annotated words (316 words in total), which we further subdivide into the dev set and test set (each containing 158 words).Footnote 3 The morphs are annotated by their type, similarly to the CroDeriV [22]. The classes are as follows:

  • R - root morphs,

  • P - derivational prefixes,

  • S - derivational suffixes,

  • I - inflectional affixes,

  • N - interfixes,

  • O - postfixes.

Secondly, mainly as “training data” (for feature extraction of morphs), we use 10 438 manually segmented Czech non-compound words with manually selected root morphs. Thirdly, in some of our experiments, we use the DeriNet Czech derivational lexicon. It contains over 1M Czech lexemes connected by over 800 000 derivational relations. We also use the MorfFlex lexicon [12], which contains 125 348 899 simple lemma - tag - form triples. The tags, as described in [11], are very fine-grained and contain morphological as well as syntactic information.

4 Evaluation

There are two possible levels of evaluation—word-level evaluation and morph-level evaluation. The word-level evaluation measures are less fine-grained, but they offer some desirable properties (e.g. the instances might be weighted by the number of occurrences of a given word in the corpus, or unweighted - giving each word equal weight. Imagine two extreme scenarios - A) half of the annotations is completely right, the other half is completely wrong; and B) half of the morphs in every word is annotated right and the other half is wrong. While e.g. unweighted word-level precision of the first example would be 50% and the precision of the second one would be 0%, while the morph-level precision of both examples would be 50%.

For both of our experiments, we have selected five simple evaluation measures: Word-level accuracy and morph-level precision, recall, F-measure and accuracy. It should be noted however that since we use a very small test set, the word-level accuracy, while included for illustration and completeness, should be nevertheless regarded with some caution.

5 Experiment 1: Root Morph Selection

5.1 Methods

In our first experiment, our goal is to identify the root morph of the word. We started with three baseline heuristics. We also - as a 0-th baseline - tried to label every morph as an affix (since affixes are more common than roots) - r0. For all following heuristics, we compute the morph features on test set + training data, i.e. the 10k segmented words, without the annotation of the root morph.

First, we selected all the longest morphs of the segmented words. Secondly, we selected the morph with the least occurrences in our dictionary on the intuition that, since compounds is infrequent in Czech, root morphs are usually combined with (a limited number of) affixes, while affixes can combine with (a big number of) root morphs. This method was used in the annotation of DeriNet [25] with the first one as the tiebreaker. Thirdly, we estimate the left and right conditional entropy of the morphs and select the morph with the smallest difference between the two. This was motivated by the observation that while the roots usually appear in between two affixes at the beginning/end of the word, the affixes usually appear only on one side of the root morph, while on the other side, there is either the beginning/end of the word or another affix; thus, one would expect the difference between the left and right entropy to be quite high in the case of affixes and quite low in the case of root morphs. Finally, we combine all these three heuristics - we normalize them (so all three sum to 1) and minimize their unweighted sum; instead of the length we use 1/length (so that we may minimize it).

Further, we experimented with methods based on the derivational network DeriNet [25]. First, for each word, we found the unmotivated lemma (or “root lemma”) of the derivational treeFootnote 4 (in DeriNet) and all its children, computed the edit distance between each morph and these words and selected the morph with the shortest edit distance. This we use either by itself (r5) or in combination with the previous three heuristics (r6) in the same way as in (r4). Lastly, instead of taking into account only the unmotivated lemma of the current word’s derivation tree and its children, we computed the longest common substring of all descendants of the unmotivated lemma, (including replacing any character with a wildcard to - very roughly - deal with possible allomorphy; thus e.g. the common substring of “sit” and “sat” would be “s?t”) and then apply r6.

In all of the (r5–r7), if the processed word is not found in the DeriNet, we use the r4 method. For comparison with a supervised approach, we also trained the CRF tagger implemented in NLTK [4] on the training data (with annotated root morphs); that is, we treat the segmented words as sentences and the morphs as tagged words.

5.2 Results

As we announced in the introduction, the simple quantitative methods gave surprisingly good results (see Table 1). Every method apart from just taking the longest morphs (r1) achieved higher precision than the CRF tagger. As all of the methods (also apart from r1) were restricted to selecting exactly one root morph, the achieved F-measure is also surprisingly good (94.4%). We can also note that even the best of the fully-unsupervised methods (i.e. not using DeriNet) achieves comparable F-measure with the supervised CRF tagger.

Table 1. Evaluation of root morph identification.

6 Experiment 2: Morph Classification

6.1 Methods

In our second experiment, we try to expand our root morph recognition methods to a fully-fledged morphological classifier. The most important part of this is the distinction between root morphs, the derivational affixes and the inflectional affixes.

6.2 Baselines

We have implemented two baseline morph classifiers, supervised and unsupervised one. As our first, supervised baseline (Baseline 1), we assign to each morph the tag that is most commonly associated with the morph in our development set; if it is not present there, we label it as a root morph (it is the most frequent label in the dev set). As our second, unsupervised baseline (Baseline 2), we designed two versions of a simple unsupervised heuristics-based classifier. First, we decide for each morph in a given word whether it could be a derivational affix, inflectional affix or a root morph (by heuristics described in the following subsections). In the second iteration, we assign to each of the positions the tag that is most common for the morph in the processed data. In the B version we consider as possible inflectional morphs only the first and last morphs of the word.Footnote 5

Derivational Affixes. Using DeriNet, we take the morphs in which the segmented word fully differs from the root lemma; i.e., we compute the minimum edit distance between the word and the root lemma, and label the characters that would be added or rewritten during the minimal edit. We regard all the morphs, that consist only of such characters, as derivational morphs. This heuristics is very rough in that most of the Czech lemmas contain an inflectional affix, which in this way could also get classified as a derivational affix.

Inflectional Affixes. For each tag in MorfFlex [12], we take a thousand word-forms corresponding to the same tag; also for every word form present in the data, we take all the other word forms. The inflectional affixes are those that are (more or less) common for common tags, but different for the different forms. Namely, we try to extract an ending common for all the words tagged by the same tag; if that fails, we consider every ending common for at least one-fifth of the examples. If even that fails, we try to find the longest uncommon substring (including wildcard characters) for all the forms of the segmented word.

Roots. For root morph recognition we use the DeriNet-based r7 method from the previous section.

6.3 Finetuning CRF Taggers

To exploit the transition probabilities between words, we have (as in the first experiment) used the bidirectional LSTM-CRF tagger as described in [14]. The CRF tagger used in NLTK does not permit finetuning, so we have used the implementation from the bi-lstm-crf Python package [15]. First, we have trained it on the 10k training set with the manually annotated root morphs. Then, we finetuned it on the small development set (which contains 158 annotated words). This approach (Semi-supervised CRF), however, presupposes the rather large data with annotated root morphs. However, since we have developed methods of root morph recognition, we could use them for creating the training data.

In our second CRF tagger-based experiment (Supervised CRF), we have used the large training data stripped of the manual annotation and annotated automatically, using our most successful DeriNet-based method (r7). Thirdly, we have trained the CRF tagger only on the dev set (Small CRF). Finally, we have pre-trained the CRF tagger on the training data annotated by both of our baseline solutions (and again finetuned on the dev set).

6.4 Evaluation and Results

Table 2. Evaluation of full classification of morphs.

In the second experiment, we evaluate only the word-level and morph-level accuracy (Table 2). The results were somewhat surprising. First of all, no version of the CRF tagger was better than one of our unsupervised baselines, which has achieved 88 % morph-level accuracy. Secondly, while both pretraining the CRF tagger on root identification and on data annotated by the baseline methods seems to have significant impact on their accuracy, there is no clear correspondence between overall quality of the pretraining data and the overall quality of the CRF tagger results. Thirdly, there is a big difference in accuracy between the two versions of our unsupervised baseline.

Table 3. Example output (Baseline 2B)

Sometimes, the taggers make mistakes that could be fairly easily filtered out (but always not so easily corrected), as having a suffix before a root, a word without a root (see the last example in Table 3) or a sequence like “Root - Prefix - Suffix”. Introduction of simple rules might therefore significantly increase the final accuracy. One such example causes the large difference between the accuracy of the two versions of Baseline 2. Closer look at the data reveals that most of the errors of the version A of the unsupervised baseline solution consisted in misidentification of derivational suffixes as inflectional, which might be easily filtered out by the restriction on the position of the inflectional affix, as used in the B version; the real accuracy of the A version might however be higher, as in some cases the identification of derivational suffixes in the manually annotated test set is spurious (e.g. u klid n i l i is assigned the signature PRSSSI in the test set and PRSIII by the baseline solution; but the suffix -l, expressing past tense, might be said to be inflectional).

Table 4. Errors in morph classification (except rare categories, i.e. infixes and postfixes)

7 Conclusion

We have shown that applying simple quantitative methods on comparatively small and/or unannotated segmented data is sufficient for a high-quality root morph identification in Czech and that these results can be further improved by exploiting the DeriNet derivational lexicon. In our second experiment, we used our root morph identification methods to create training data and to train an LSTM-CRF tagger. It appears that the quality of the output can be increased by pre-training the tagger on root morph identification or morph data classified by a good-enough baseline solution. Furthermore, the simple supervised baseline was as good as the CRF taggers, while one of the unsupervised baselines has been significantly more accurate.

In the future, we would like to better utilize Czech resources like MorfFlex and DeriNet either for further morphological analysis (e.g. the derivational affixes would appear in many derivational trees in DeriNet but on the lower levels, while root morphs would appear in only a few trees and on all levels; most the inflectional affixes would not appear at all or at no specific level). Also, the morphological tags present in MorfFlex might be useful - the given tag would probably strongly predict the presence of corresponding endings (as opposed to derivational affixes and root morphs). These could then be used either for designing specific tagging methods or for enlarging the training data for machine-learning-based taggers.

Secondly, we would like to extend our approaches to a multi-lingual setting. We would especially like to use Universal features included in the Universal Dependencies [18]; these could be also used for more fine-grained morphological analysis in the future. However, there are also many derivational [17] multilingual resources that could be used for the classification of morphs in a similar way to DeriNet.