Keywords

1 Introduction

We present work on unsupervised segmentation for languages with complex morphology. Our ultimate research question is to explore to what extent morphological structure can be induced without supervision, from a large body of unannotated text (a corpus). This has implications for the question whether the morphological system is somehow “inherently encoded” in language, as represented by the corpus.

The paper is organized as follows: we state the morphology induction problem (Sect. 2), review prior work (Sect. 3), present our models (Sects. 4 and 5), present the evaluation and experiments (Sect. 6), and finally discuss future work.

2 Morphological Description

We focus on highly-inflecting languages. In the experiments, we use Finnish and Turkish, which belong to different language families (Uralic and Turkic) and exhibit different morphological phenomena.

Finnish is considered to be agglutinative, although it has some fusional phenomena—morpho-phonological alternation. It has complex derivation and inflection, and productive compounding. Derivation and inflection are achieved via suffixation; prefixes do not exist. Multiple stems in the compound may be inflected, e.g., kunnossapidon = “of keeping (smth.) in usable condition”:

where # is a compound boundary, + is a morpheme boundary, Iness. marks inessive case (presence in a location, or being in a state), and Gen. marks the genitive. The morph kunno is a “weak” allomorph of the stem kunto; the weakening is conditioned by the closed syllable environment, i.e., the following double consonant -ss-.

Turkish is similar to Finnish: agglutination, no prefixation, minimal compounding.

The ultimate goal in the future is to model aspects of morphology, including classification of morphemes into morphological categories, and capturing allomorphy—morpho-phonological alternation.Footnote 1 However, at present, as in most prior work, we address the problem of segmentation only, to try to establish a solid baseline. Once we have a good way of segmenting words into morphs, we plan to move to modeling the more complex morphological phenomena.Footnote 2

3 Prior Work

Interest in unsupervised morphology induction has surged since 2000. Detailed surveys are found, e.g., in [15, 19], and in proceedings from a series of “Morpho-Challenge” workshops between 2005 and 2010 [17, 18, 20]. Our approach is founded on the Minimum Description Length principle (MDL) as a measure of model quality, see e.g., [14]. Linguistica, [11], also uses MDL, combined with the idea of a signature—set of affixes that belong to a morphological paradigm; e.g., suffixes for a certain class of nouns form one signature, etc. Our models also group morphs into different morphological classes, though using a different approach (Sect. 4).

Finite state machines (FSMs) are used in [12], which are similar to our FSMs; however, the approach there is less general, since it uses heuristics unsuitable for languages with very rich morphology.

The MDL-based Morfessor and its variants, e.g., [3, 5], are closely related to our work. Unlike [4], we do not posit morphological classes a priori, but allow the model to learn the classes and distributions of morphs automatically. Word embedding, modeling semantic relations between words, have been used with orthographic features of words to learn the morphology of languages as “morphological chains” in [21].

Evaluation of morphological learning is a complex challenge in itself [23, 26]; prior work on this topic is described in detail in Sect. 6.

The Morpho-Challenges have seen attempts to model aspects of morphology which we do not address: using analogical reasoning, handling ablaut-type morphology (as in German, and Semitic languages), etc. Beyond segmentation, modeling allomorphy has been attempted, e.g., by [24], but the performance of the proposed algorithms on segmentation so far falls short of those that do not model allomorphy. Research on induction of morphological structure is driven in part by the observation that children learn it at a very early age,Footnote 3 which makes acquisition by machine a fascinating challenge.

4 The Learning Algorithm: SMorph

Morphological systems for many languages are modeled by a finite-state machine (FSM), where each state corresponds to a morphological class. The seminal approach of Two-Level Morphology, [16], represents morphological grammar by a FSM. We can associate a state with, e.g., a set of verb stems that belong to a certain inflectional paradigm; or the set of suffixes in a certain noun paradigm, etc. The edges in the FSM define the morphotactics of the language—the permissible ordering among the states.

The data \(\mathcal{D}\) (the corpus) is a large list of words in the given language. For every word \(w \in \mathcal{D}\), SMorph tries to learn the most probable sequence of states that generates w: it treats the problem as finding a model that produces the most probable segmentation for each word w into “morphs”—fragments of w—and the classification—assignment of the morphs to classes (classes are identified with the states).Footnote 4

The learning algorithm searches for the best model in a certain model class. Thus, the full description of the algorithm must specify a. the model class (Sect. 4.1), b. the objective function: a way of assigning a score—the cost—to each model (Sect. 4.3), and c. a search strategy for optimizing the objective across the model class (Sect. 4.2).

4.1 Morphology Models

We begin with a hidden Markov model (HMM), with a set of hidden states/classes \(\{{C_i}\}\). States emit morphs with certain emission probabilities, and transition probabilities between states. To code each word w, the model starts at the initial state \(C_0\). From state \(C_i\), it can transition to another state \(C_j\) with a certain probability \(P_{tr}(C_j|C_i)\) and emit a morph—a segment of w from \(C_j\). The probability of emitting a morph \(\mu \) from state \(C_j\) is denoted by \(P_{em}(\mu |C_j)\). Once the entire w is emitted, the model transitions to the final state \(C_F\) and emits a special end-of-word symbol #.

Ideally, the states will correspond to “true” morphological classes; for example, nouns may fall into different declension paradigms and each paradigm would be assigned its own class/state in the model.

Probabilities \(P_{tr}\) and \(P_{em}\) are determined by counting how words are segmented into morphs, and which states emit which morphs. Many segmentations are possible for the given corpus. We need a way to choose the best segmentation for \(\mathcal{D}\)—the best parameters for the model. To approach this model selection problem via MDL we define a code-length for the data \(\mathcal{D}\), which is the number of bits required to encode it.

4.2 Search

figure a

Convergence is determined by the MDL code-length of the complete model (cost), defined in Sect. 4.3. We fix the number of classes K,Footnote 5 and begin with a random segmentation and classification (assignment of morphs to classes). We then greedily re-segment each word \(w \in \mathcal D\), minimizing the MDL code-length of the model plus the data.

4.3 The Objective: Two-Part MDL Cost

Finding the best segmentation and classification can be viewed as the problem of compressing \(\mathcal D\). In MDL 2-part coding, we try to minimize the cost of the complete data: the cost of the model plus the cost of the data given the model. In our case, this means summing the costs of coding the Lexicon, the transitions and the emissions:

$$\begin{aligned} {\begin{matrix} L(\mathcal{D}) = L(M) + L(\mathcal{D} | M) = L(Lex)+L(Tr)+L(Em) \end{matrix}} \end{aligned}$$

The cost of the Lexicon: L(Lex), is the number of bits needed to encode each class, morph by morph, irrespective of the order of the morphs in the class:

$$\begin{aligned} L(Lex) = \sum _{i=1}^K \Bigg [ \sum _{\mu \in C_i} L(\mu ) - \log {|C_i|!} \Bigg ] \end{aligned}$$
(1)

where K is the number of classes, \(\mu \) ranges over all morphs in class \(C_i\), \(L(\mu )\) is the code length of a morph \(\mu \) (Eq. 2), and \(|C_i|\) is the number of morphs in class \(C_i\). The term \(-\log {|C_i|!}\) accounts for the fact \(C_i\) is a set, and we do not need to code the morphs in \(C_i\) in any particular order.

The code length \(L(\mu )\) of morph \(\mu \) is computed similarly to [5], as the number of bits needed to encode \(\mu \):

$$\begin{aligned} L(\mu ) = (|\mu |+1) \cdot \log (|\varSigma |+1) \end{aligned}$$
(2)

where \(|\mu |\) is the number of symbols in \(\mu \), and \(|\varSigma |\) is the size of the alphabet; one is added to \(|\varSigma |\) to account for one special morph-boundary symbol.Footnote 6

Transitions: Given the lexicon, we code the paths of class transitions from \(C_0\) to \(C_F\), from word start to word finish, using Bayesian Marginal Likelihood (ML), as introduced in [9]. In ML, we treat each transition \((C_iC_j)\) in the data as an “event” to be coded. If in a set of events \(\mathcal{E}=\{E_j\}\), each \(E_j\) has a corresponding “prior” count \(\alpha _j\) and count of observed occurrences \(O_j\), the cost of coding \(\mathcal{E}\) is:

$$\begin{aligned} L(\mathcal{E}) = - \sum _j {\log \varGamma (O_j +\alpha _j)} + \sum _j {\log \varGamma (\alpha _j)} + \log \varGamma \sum _j({O_j + \alpha _j)} - \log \varGamma \sum _j \alpha _j \end{aligned}$$
(3)

where the summations range over the set \(\mathcal{E}\).Footnote 7 We use uniform priors, \(\alpha _j=1, \forall j\). Thus \(\log \varGamma (\alpha _j)=0\), and the second term is always zero in this equation.

To compute the cost of all transitions L(Tr), we apply Eq. 3 to each class \(C_i\), as i ranges from 0 to K; the set of “events” \(\mathcal{E}\) is the set of all classes \(C_j\), which are the targets of transitions from \(C_i\):

$$\begin{aligned} \begin{aligned} \sum _{i=0}^K \Bigg [ - \sum _{j=1}^{K+1} \log \varGamma \Big ( f(C_iC_j) + 1 \Big ) + \log \varGamma \left( \sum _{j=1}^{K+1} \Big [ f(C_iC_j)+1 \Big ] \right) - \log \varGamma \left( K \right) \Bigg ]\nonumber \end{aligned} \end{aligned}$$

Emissions: We code the emissions L(Em) analogously:

$$\begin{aligned} \begin{aligned} \sum _{i=1}^K \Bigg [ - \sum _{\mu \in C_i} \log \varGamma \Big ( f(\mu ,C_i) +1 \Big ) + \log \varGamma \left( \sum _{\mu \in C_i} \Big [ f(\mu ,C_i)+1 \Big ] \right) - \log \varGamma \left( |C_i| \right) \Bigg ]\nonumber \end{aligned} \end{aligned}$$

where \(f(C_i,C_j)\) is the count of transitions from \(C_i\) to \(C_j\), and \(f(\mu ,C)\) is the number of times morph \(\mu \) was emitted from class C. The cost L(Em) is computed by applying Eq. 3, where the set of “events” \(\mathcal{E}\) is now the set of emissions \((\mu ,C_i)\) of all morphs \(\mu \) from \(C_i\).

4.4 Input

For each language, we pre-process a large text by collecting distinct words from the text. We do not model the text—i.e., the distribution of words in the text—but the language, i.e., the observed distinct words, as is done in recent prior work. In this paper, corpus refers to the list of distinct words.Footnote 8

4.5 Initialization

The initial segmentation and classification is obtained by randomly placing morph boundaries in each input word w, independently, according to a Bernoulli distribution,Footnote 9 and by randomly assigning the morphs to one of the classes in the Lexicon.

4.6 Dynamic Programming Re-segmentation

We compute the most probable segmentation of every word w in the corpus into morphs, at iteration t, given a set of transitions and emissions from iteration \(t-1\).

We apply a Viterbi-like dynamic programming (DP) search for every word w, to compute the most probable path through the HMM, given w, without using the segmentation of w at iteration \(t-1\). Standard Viterbi would only give us the best class assignment given a segmentation. Here, the search algorithm fills in the DP matrix starting from the leftmost column toward the right.

Notation: \(\sigma _a^b\) is a substring of w, from position a to position b, inclusive. We number positions starting from 1. The shorthand \(\sigma ^b \equiv \sigma _1^b\) is a prefix of w up to b, and \(\sigma _a \equiv \sigma _a^n\) is a suffix, when \(|w|=n\). A single morph \(\mu _a^b\) lies between positions a and b in w. Note that \(\sigma _a^b\) is just a sub-string, and may contain several morphs, or cut across morph boundaries. In the cell (ij), marked X, in the DP matrix, we compute the cost of the HMM being in state \(C_i\) and having emitted the prefix up to the j-th symbol of w, \(L(C_i|\sigma ^j)\). This cost is computed as the minimum over the following expressions, using values computed previously and already available in columns to the left of \({\sigma }^j\):

$$\begin{aligned} \begin{aligned} L(C_i|\sigma ^j) = \min _{q, b} \Big ( L(C_q | \sigma ^{b}) + L(C_i | C_q) + L(\mu _{b+1}^{j} | C_i) \Big ) \end{aligned} \end{aligned}$$
(4)

This says that the best way of getting to state \(C_i\) and emitting w up to \(\sigma ^j\) is to come from some state \(C_q\) having emitted w up to \(\sigma ^b\), then jump from \(C_q\) to \(C_i\), and emit \(\mu _{b+1}^{j}\) as a single morph from \(C_i\). \(L(C_i | C_q)\) is the cost of transition from state \(C_q\) to \(C_i\). This can be calculated using the Bayesian Marginal Likelihood code-length formula as:

$$\begin{aligned} \begin{aligned} L(C_i | C_q) = \varDelta L= L(Tr \cup t) - L(Tr) =- \log \frac{f(C_qC_i)+1}{\sum _{k=0}^{K-1}\sum _{j=1}^{K}\big (f(C_kC_j)+1\big )} \end{aligned} \end{aligned}$$

where t is the transition \(C_q \rightarrow C_i\). The cost of emitting morph \(\mu \) from class \(C_i\) is:

$$\begin{aligned} L(\mu |C_i)= {\left\{ \begin{array}{ll} - \log \frac{f(\mu , C_i)+1}{f(C_i) + |C_i|} \quad \text {if } \mu \in C_i \\ -\log \frac{|C_i|}{\big (f(C_i)+|C_i|\big )\big (f(C_i)+|C_i|+1\big )} + L(\mu )\quad \text {if } \mu \not \in C_i \end{array}\right. } \end{aligned}$$

The second case is for when \(\mu \) is not emitted from \(C_i\) yet and does not exist in its lexicon and \(L(\mu )\) is the cost of adding \(\mu \) to the lexicon.

In Eq. 4, the minimum is taken over all states \(q = 0,1,\ldots ,K\), including the initial state \(C_0\), and over all columns b that precede column j: \(b = j-1,\ldots ,0\). Here \(L(\mu _{b+1}^{j} | C_i)\) is the cost of emitting the \(\mu _{b+1}^{j}\) from \(C_i\), for some \(b < j\). For the empty string, \(\sigma ^{0} \equiv \epsilon \), we set \(L(C_0 | \sigma ^{0}) \equiv 0\) for the initial state, and \(L(C_q | \sigma ^{0}) \equiv \infty \) for \(q \ne 0\).

The transition to the final state \(C_F\) is computed in the rightmost column of the matrix, marked #, using the transition from the last morph-emitting state—in column \(\sigma ^n\)—to \(C_F\). (State \(C_F\) always emits the word boundary #). Thus, the cost of the best path to generate w is:

$$\begin{aligned} L(w) = \min _{q=1,\ldots ,K} L(C_q | \sigma ^{n}) + L(C_F | C_q) + L( \# | C_F) \end{aligned}$$

where the last factor \(L( \# | C_F)\) is always 0. In addition to storing \(L(C_i|\sigma ^j)\) in cell (ij) of the matrix, we store also the “best” (least expensive) state q and the column b from which we arrived at this cell. These values, the previous row and column, allow us to backtrack through the matrix at the end, to reconstruct the lowest-cost—most probable—path through the HMM.

5 Enhancements to Baseline Model

We next present enhancements to the baseline algorithm described in Sect. 4, which yield improvements in performance.

Simulated Annealing: The greedy search for the best segmentation for all words in the corpus quickly converges to local—far from global—optima. To avoid local optima, we use simulated annealing, with temperature T varying between fixed starting and ending values, \(T_0\) and \(T_F\), and a geometric cooling schedule, \(\alpha \). In Eq. 4, rather than using \(\min \) to determine the best cell (qb) from which to jump to X in the DP matrix, we select a candidate cell at random, depending on its cost, from a gradually narrowing window.Footnote 10

This ensures that the model does not always greedily choose the best solution, and enables it to initially make random jumps to avoid local optima.

Next, as mentioned in the abstract, we introduce heuristics that constrain the search, based on simple yet general linguistic principles.

  • 1. Directionality of the FSM: The FSM must be directional: the morphotactics of any language specify exactly the order in which morphological classes may follow one another. In many Indo-European languages, e.g., a word can have some prefixes, then a stem, then suffixes—always in a fixed order. Further, different kinds of suffixes have strict ordering among them—e.g., derivation precedes inflection.Footnote 11

    To enforce directionality, in Sect. 4.6, we constrain the DP matrix so that the preceding state q in Eq. 4 ranges only from 0 up to \(i-1\), rather than up to K. Since the states are ordered, this blocks transitions from a later state to an earlier one.

  • 2. Natural Classes of Morphs: As a general principle, morph classes fall into two principal kinds: stems vs. affixes. We arbitrarily fix some range of states in the beginning to be prefix states, followed by a range of stem states, followed by suffix states.Footnote 12 A simple heuristic based on this is that the HMM must pass through at least one stem state during the DP search.

  • 3. Bulk Re-segmentation: An important linguistic principle is that stems and affixes have very different properties. First, stem classes are usually open—i.e., potentially very large, whereas all affix classes are necessarily closed—very limited. This is reflected, e.g., in borrowing: one language may borrow any number of stems from another freely, but it is extremely unlikely to borrow a suffix.

Second, in general a randomly chosen affix is typically expected to occur much more frequently in the corpus than a random stem.Footnote 13

Based on this principle, we introduce another heuristic to guide the search: after the normal re-segmentation step, we check all classes for “bad” morphs that violate this principle: very frequent morphs in stem classes, and very rare morphs in affix classes.Footnote 14 With a certain probability \(\pi (T)\) which depends only on the simulated annealing temperature,Footnote 15 all words that contain a bad morph are removed from the model in bulk (from the lexicon, and their transition and emission counts), and re-segmented afresh. When T is high, \(\pi (T)\) is small; as T cools, \(\pi (T) \rightarrow 1\).

6 Evaluation

Evaluation of morphology discovery is challenging, specifically since morphological analysis is limited to segmentation—here, as well as in most prior work—because in general it is not possible to posit definitively “correct” segmentation boundaries, which by definition ignore allomorphy.

An evaluation based on probabilistic sampling is suggested in [23]. Another scheme is suggested in the papers about the HUTMEGS “Gold-standard” evaluation corpus for Finnish, [6, 7]. However, these approaches to evaluation are problematic, in that they ignore the issue of consistency of the segmentation.

[7] observe correctly that positing a single “proper” morphological analysis for a word w is not possible, in general. A motivating example is English \(w=tries\): it can be analyzed as tri+es or trie+s. In actuality, w has two morphemes, which can have more than one allomorph—a stem {try-/tri-} or {try-/trie-}, and 3rd person suffix {-s/-es} or {-s}. Restricting morphological analysis to segmentation makes the problem ill-defined: it is not possible to posit a “proper” way to place the morpheme boundary and then to expect an automatic system to discover that particular way. HUTMEGS proposes “fuzzy” morpheme boundaries, to allow the system free choice within the bounds of the fuzzy boundary—as long as the system splits the word somewhere inside the boundary, it is not penalized.

Recent work has called into question the validity of evaluation schemes that disregard the consistency of segmentations across the data, considering such schemes as too permissive: the system should commit to one way of segmenting similar words—its chosen “theory”—and then consistently segment according to its theory—and should be penalized for violating its own theory by placing the boundaries differently in similar words. We follow the evaluation scheme in [22], which provides for gold-standard segmentations and an evaluation algorithm so as to enforce consistency while giving the learning algorithm maximal benefit of the doubt.

Consider again the example of tries. In the suggested gold standard, one annotates the segmentation neither as tri- es nor as trie- s, but as \(tri\overset{X}{\cdot }e\overset{X}{\cdot }s\)—the special markers (dots) indicate that this segmentation is “ambiguous”—can be handled in more than one way. The label X identifies this particular kind of ambiguity. (The gold-standard defines a separate set of labels for each language.) In this case, the definition of X states that it accepts two “theories”: \(\{10, 01\}\) as valid—i.e., a morpheme boundary in the first position (10), or in the second (01), but not both (not 11; also not 00). Similar words: cries, dries, flies are then annotated similarly in the gold-standard.

Suppose the words tries, flies, and applies are in the gold standard, annotated with X as above, and the model segments them as: trie- s, flie- s, but appli- es. We conclude that A. its preferred theory for handling X is to put the boundary in the second position (2 out of 3), and B. when it segmented appli-es it placed the boundary incorrectly: it violated its own preferred theory (defined by the majority of its choices). Thus its accuracy will be 2 / 3. The label X is used to co-index possible “ambiguous” boundaries in the gold standard, for a particular type of ambiguity. Annotators use these labels consistently across the evaluation corpus.

6.1 Experiments

We test on data from Finnish and Turkish. For each language, the corpus consists of about 100,000 distinct word forms, from texts of novels. Approximately 1000 words extracted at random from the data were annotated for the gold-standard by at least two annotators with knowledge of morphology and native proficiency.Footnote 16

Table 1. Ablation studies—Finnish

Ablation Studies: Shown in Table 1, confirm the gains in performance yielded by the heuristics. The enhancements to the basic algorithm are simulated annealing (SA), directionality (Dir), natural classes (NC) and bulk re-segmentation (Bu). Without SA, performance drops substantially. With SA, the table shows the gain obtained from adding all possible combinations of the heuristics. It might seem that adding directionality reduces the scores of the model, however, adding heuristics built upon directionality help the model achieve better results compared to the non-directional model.Footnote 17

Comparison Studies: Are shown in Fig. 1, for Finnish and Turkish. Each point in the plots represents a single run of an algorithm. The coordinates of each point are its recall and precision; the accuracy of each run is in its label.

Fig. 1.
figure 1

Precision vs. recall: A—Finnish data, B—Turkish data (Color figure online)

For comparison, we ran Morfessor CatMAP [8], on the same data, since it currently obtains the best performance over all Morfessor variants, as explained in [13]. FlatCat performs better than CatMAP with semi-supervised learning, but falls short of CatMAP performance in the unsupervised setting. CatMAP has a perplexity threshold parameter, b, “which indicates the point where a morph is as likely to be a prefix as a non-prefix” [8]. This parameter trades off recall for precision; the more data there is, the higher b should be. As b grows, words are split less, giving higher precision but lower recall. Running Morfessor with varying \(5 \le b \le 800\), yields the red line in the plots. SMorph also has hyper-parameters, which we test in the experiments. Probability \(\rho \) of a morph boundary between two adjacent symbols during the initial random segmentation, 0.20–0.25; the number of classes K, 15; assignment of classes to prefix, stem and suffix kinds, \(s_{max}\) and \(a_{min}\).

The blue points in the plots correspond to runs of SMorph, with different settings of the hyper-parameters,Footnote 18 which can be optimized further, e.g., on a development corpus.

The runs of SMorph show an improvement in terms of recall and precision over Morfessor CatMAP: the blue points lie above the red curve. For example, at a given level of recall, SMorph reaches higher precision. For Finnish, the gain in precision is 2–8%; for Turkish, 2–7%. Conversely, at a given level of precision, SMorph reaches higher recall; for very large b, Morfessor reaches higher recall, but generally at a loss in precision. SMorph and Morfessor obtain similar accuracy values, though at a fixed level of recall SMorph has higher accuracy. More fine-grained effects of the hyper-parameters on performance are to be explored and investigated in future work.

Qualitative Evaluation of Classification: An important feature of SMorph is that the morph classes it learns are of a high quality. Manual inspection confirms that the classes group together morphs of a similar nature: noun-stem classes separate from verb stems, classes of affixes of similar kinds, etc.; hence the high precision.

As is natural in MDL, if some affixes appear frequently together, they will eventually be learned as a single affix; this explains lower recall. However, this problem may be addressable as a post-processing step, after learning is complete.Footnote 19 (To be explored in future work.) Of course, evaluating the classes quantitatively is difficult, hence we evaluate quantitatively only the segmentations.

6.2 Error Analysis

A non-directional model trained on Finnish is shown in Fig. 2—with only 5 classes for clearer visualization. Each node shows the 8 most frequent morphs emitted from it, as well as the number of distinct morphs (|Lex|) and emission frequency (freq). Probabilities of transition between states are shown on the edges. (Edges with probability <0.02 are omitted for clarity.) The model has learned to often emit stems from states \(S_1\) and \(S_5\), and suffixes from \(S_2\), \(S_3\) and \(S_4\). As expected, stem states are much larger than suffix states. They exhibit different properties: \(S_1\) and \(S_5\) have much flatter distribution, while \(S_3\) and \(S_4\) have spiked distributions: a few morphs with very high frequencies, and many with very low frequencies.

Further, \(S_1\) has mostly verbal stems, whereas \(S_5\) mostly nominal ones; \(S_4\) is heavy on nominal suffixes, while \(S_2\) has mostly verbal ones.Footnote 20

Fig. 2.
figure 2

Visualization of a model trained on Finnish data (with only 5 states)

7 Conclusions and Current Work

We have presented an algorithm for segmentation of a large corpus of words, which improves upon the state of the art. There are several important differences between SMorph and Morfessor models. SMorph tries to approach the problem in a systematic way by grouping the discovered morphs into classes that respect general linguistic principles: directionality of morphotactics and natural differences between stems vs. affixes. It starts from a random initial model with no prior assumptions about the language, and learns to segment the data by optimizing a two-part cost function and Bayesian Marginal Likelihood, different from coding schemes used in prior work. The model is evaluated using a scheme, which avoids some of the problems in earlier evaluations.

To assure replicability, all gold-standard segmentations and code are made publicly available. Future improvements include those mentioned above, learning the optimal number of classes automatically, which should be reflected in the code-length, modeling compounding and allomorphy.