Keywords

1 Introduction

Agglutinative languages, such as Turkish or Japanese, are commonly reported to be challenging for automatic speech recognition (ASR), partly because the vocabulary of these languages cannot be effectively accounted for by a simple enumeration of words. Listing is impractical due to the highly productive derivational and inflectional morphology. This morphological characteristic may pose problems for speech recognition as the large number of word types can lead to high out-of-vocabulary rates in the pronunciation lexicon and high perplexities in language models.

German is not an agglutinative language, but its relatively complex inflectional system and its productive compounding characteristics raise problems similar to that of agglutinative languages [4,5,6, 13, 16]. For example, German adjective modifiers can have different endings according to the gender, number and case of the nouns they modify (e.g., das kalt-e Bier ‘cold beer[nom]dem kalt-en Bier ‘cold beer[dat]kalt-es Wasser ‘cold water[acc]’). Calculating only with eleven forms per adjective (null, -e -en -em -er -es -ste -sten -stem -ster -stes) would require adding each adjective eleven times to the lexicon. Compounding can also considerably increase the vocabulary size in German as spelling conventions require most compounds to be written as single words, without any hints of morpheme boundaries. This practice results in such super-long word formations as the infamous Donaudampfschifffahrtsgesellschaft ‘Danube Steamboat Shipping Company’. An apparent solution to these problems is to split up word forms, that is, to introduce some kind of word segmentation step that provides input for lexicon and language model related tasks.

There are several word segmentation techniques and tools available ranging from morphological analyzers to completely data-driven, unsupervised segmentation techniques. General purpose morphological analyzers, such as Tagh [7] for German, or ChaSen [11] for Japanese provide full-fledged, linguistically accurate morphological analysis, in which morpheme boundaries can be used as splitting points. Although it is not uncommon for studies to implement custom, morphology-based segmentation tools [2, 14], the costs associated with the development and maintenance of general morphological analyzers is prohibitive. Languages without appropriate morphological analyzers can be processed by self- or semi-supervised, data-driven algorithms that identify sub-word units automatically, without relying on morphological information. Besides some sporadic, heuristically formulated attempts [10, 17], Morfessor [3] has to be highlighted as an established data-driven segmentation tool, frequently occurring in studies concerned with sub-word models of speech recognition [18,19,20]. While data-driven tools are extremely convenient and their performance tend to improve with more data, they can produce unexpected errors, and their behavior is difficult to control.

The current study aims to briefly overview how Finite-State Transducers (FSTs) can be used for word segmentation, and provide a simple performance measure for the techniques introduced—using German data. FSTs can function as a convenient mechanism to segment words, and are often used in morphological analyzers. But FSTs can also operate using bottom-up information, for example in the form of n-gram models. This study introduces and compares two top-down and a range of bottom-up FST models for word segmentation. As preliminary experiments show, more morphological knowledge leads to better segmentation performance, but self-supervised approaches—with no morphological knowledge—can perform on par with expert systems.

2 Word Segmentation with Transducers

2.1 Morphological Analysis as Segmentation

The simplest word segmentation transducer can be constructed similarly to a two-level morphological parser [8], except that instead of underlying morphemes and features the output contains only the split input. Input and output labels share the same set of characters with extra segment boundary symbols (e.g. ‘+’) on the output side. The segmentation transducer is defined as a closure over all acceptable segments. Figure 1 demonstrates a sample transducer that splits up the input compound zeitraum ‘time period’ into its components: zeit+raum.Footnote 1

Fig. 1.
figure 1

A sample word segmenter FST.

A transducer with a simple closure over all lexical items, however, is not an effective segmenter, because it accepts any sequence of segments in any order. For example, the transducer above also accepts nonsense words like zeitzeit or raumraumraum. This problem of over-generation can be addressed by incorporating word-formation constraints into the transducer. A widely utilized technique is to formulate constraints over morphological categories, such as prefixes and suffixes. Figure 2 displays a transducer that accepts prefixes only at the beginning of words and suffixes only at the end. For example, prefix ab- attaches only to left side of verbs, suffix -ung only to their right side (e.g. ab+schaff+ung).

Fig. 2.
figure 2

Segmenter FST with morphological knowledge about prefixes and suffixes.

This naive prefix/suffix only approach also leaves plenty of room for over-generation. The system can be greatly improved by incorporating more fine-grained word formation rules through taking affix types, part of speech categories and other subcategorization features into consideration. Figure 3 represents a more sophisticated attempt for a morphological approach using various derivational and inflectional suffix types. Although discovering and implementing word formation rules is a tedious task, it can lead to remarkable segmentation performance as demonstrated by the German morphological analyzers such as Tagh [7].

Fig. 3.
figure 3

Excerpt of a segmenter FST with expert morphological knowledge.

2.2 Supervised Word Segmentation with N-Grams

While morphological analyzers are obvious choices for segmenting words, the analysis they provide is not necessarily optimal for further processing. For instance, word stems combined with morphological features, instead of the written forms, do not provide optimal input for grapheme-to-phoneme algorithms (e.g. \(\textit{wirfst} \rightarrow \textit{werfen}<{V}> <{2}> <{Sg}>\)). Also, too short morphemes can be sub-optimal for speech recognition tasks. These along with similar constraints can easily result in disagreements with the morphological analysis. Data-driven segmentation techniques can remedy this problem by providing means to learning arbitrary segmentation patterns. One way of doing this is by training n-gram models on segmented data. The idea of using n-gram-based segmentation as a text pre-processing step is an established method for Asian languages [9] but it has also been applied to German. Incorporation of n-gram models into segmentation FSTs is not a complicated task: FST-based language models are commonly used in various speech and language processing tasks [12, 15]. A notable problem concerning the combination of segmentation and n-gram FSTs is that FST-based segmenters typically operate on characters, while n-gram models are defined over words or morphemes. This mismatch can be easily remedied by rewriting character sequences to morpheme labels in segmenters as demonstrated in Fig. 4.

As a first approximation, n-gram information can be integrated into the segmentation process in two steps. First a lattice of possible segmentations is created; second, this lattice is re-scored with an n-gram FST. The two-step approach, however, is slow and cumbersome. A more elegant approach is to merge the segmenter and the n-gram transducers into one FST. The merged—or composed—FST preserves the overall structure and weights of the n-gram transducer. Figure 5 displays a fragment of an n-gram FST whose input morpheme arcs were replaced by characters.

Fig. 4.
figure 4

Word segmenter from Fig. 1 with morpheme-level output labels.

Fig. 5.
figure 5

Fragment of a transducer n-gram model with arcs for an, ab and abend. Epsilon output labels are omitted for clarity.

Making the resulting transducer deterministic and sorting it by input label are useful optimization steps as they help reduce model size and enable faster search of arcs. Figure 6 represents an optimized version of the FST of Fig. 5. A disadvantage of these models is that they require custom search and composition algorithms, as their treatment of back-off and epsilon arcs is different from standard FST-based n-gram models.

Fig. 6.
figure 6

Determinized and weight-pushed version of transducer in Fig. 5.

3 Experiment

A series of experiments was conducted to compare the performance of top-down with bottom-up approaches to FST-based segmentation. The top-down approach was represented by two FST models that implemented different amounts—Naive and Expert levels—of morphological knowledge. The bottom-up approach was associated with transducers that were based on n-gram models. A relatively small (134k) broadcast news corpus was used in a 10-fold cross-validation setup to evaluate segmentation performance. The folds were analyzed for perplexity and OOV rates as well as precision, recall and f-measure. In preparation of those calculations n-gram models with Katz smoothing were trained using 9 folds out of 10. The quality measures were calculated against the retained folds.

3.1 Corpus Data and Segmentation

As there is no standardized way to segment German text, there is also no standardized segmented corpus available. For development and testing purposes, German news broadcast text was collected from the Deutsche Welle news portal www.dw.de between early 2017 and early 2018. The texts collected, extracted from 207 news reports, was manually normalized and segmented. After normalization each file contained on average 646.7 tokens. Segmentation involved only the splitting up of words, no morphological categories or features were added. Some examples from the corpus are: Woche-n-arbeit-s-zeit ‘hours worked per week’, Zahl-reich-e Häuser sind zer-stör-t ‘several houses are destroyed’. Admittedly, this manual segmentation diverged from traditional morphological analyses. For example, in order to keep the lexical model simple, words were kept together if segmentation would have produced alternative pronunciation, such as with Häuser*\(\rightarrow \) Haus\(+\) er.

3.2 Perplexity and OOV Rates

The corpus had a relatively small size of circa 134k tokens after text normalization. The segmentation has increased the token count to 198k, while it almost halved the type count. As expected, the segmented corpus had a lower perplexity of 14.1 compared to 21.4 of the original text (Table 1). Perplexity values were calculated using 3-gram language models with Katz smoothing. As shown in Fig. 7, word segmentation has achieved a considerable decrease in perplexity in unseen data: from 219.98 to 79.69 on average.

Table 1. Text-normalized and segmented news broadcast data.
Fig. 7.
figure 7

Perplexity values in unseen folds with segmented and unsegmented text.

OOV type and token ratios were also calculated for the unseen folds. The weighted average of OOV tokens was 7.47%, which dropped to 1.89% after segmentation. A similar decrease was observed with types: from 20.88% to 9.30% on average. Values for each fold are seen in Fig. 8.

Fig. 8.
figure 8

Out-of-vocabulary ratios for tokens (left) and for types (right) in unseen folds.

3.3 Segmentation Models

A series of FST-based word segmenters was created following the concepts outlined in Sect. 2. A Naive model was created with an FST structure relying only on three morpheme categories: prefixes, suffixes and stems (cf. Fig.  2). Weights were set to a constant value for all segments to prefer longer chunks. The Expert model implemented a thorough, but non-exhaustive set of morphological rules (see Fig. 3). The weights were defined manually, based on experimentation. Both Naive and Expert models used around 80% of the corpus as a development set. Other sources, such as affix dictionaries and word lists were also used to define morpheme classes, transducer structure and weights.

In addition to the two top-down approaches, five data-driven models were created using 1- to 5-gram language models with Katz smoothing. In preparation of these models, first transducer-based n-gram models were trained using normalized and segmented text of the training folds. Next, word and morpheme labels in the n-gram transducer were replaced by character sequences on the input side (cf. Fig. 4). Finally, the transducers were determinized, minimized, and the weights were pushed forward for faster performance (cf. Fig. 6). A special, non-epsilon symbol was used as back-off arc label. All transducers and necessary tools were developed using OpenFst [1] and OpenGrm [15].

3.4 Results

Recall, precision and f-measure values were calculated to evaluate segmentation performance. The unseen data folds from the cross-validation setup were used as test sets for the n-gram models. For Naive and the Expert models, the separation of seen and unseen data was not consistent, as parts of the corpus were used—besides other sources—to manually discover morphological generalizations. For easier comparison the same “unseen” folds were used for all segmentation models. Table 2 summarizes the means of performance metrics over the test sets. A visual presentation of precision and recall values with medians are presented in Fig. 9.

Table 2. Segmentation performance: mean values over “unseen” folds.
Fig. 9.
figure 9

Segmentation performance: recall (left) and precision (right).

3.5 Discussion

In terms of f-measures the best segmentation performance was achieved by the 2-gram model. This result, however, is not significantly different from other higher order n-gram models. Unquestionably the Naive approach had the worst performance among the compared models. This result is not surprising given its over-simplified morphological model. Incorporating more sophisticated morphological knowledge proved to be useful as demonstrated by the performance improvements of the Expert model. Of course the question is if such expert systems are worth developing as n-gram models without morphological knowledge can deliver similar performance.

A closer look at the errors may influence the interpretation of the seemingly outstanding results. Almost half of OOV words in the unseen folds were named entities in non-affixed forms. These unsplit OOV items did not contribute to the evaluation as non-parsable input words were treated as single units.Footnote 2 Thus neither the reference nor the hypotheses had morpheme boundaries. Provided that words used for training are segmented correctly, the seen data together with non-splittable OOV items can account for the seemingly impressive results. The low error rates are attributable to the low number of multi-segment OOV items.

4 Conclusion

The goal of this article was to present a brief overview and a few examples of how FSTs can be used for word segmentation. The introduced top-down and bottom-up approaches, while performing well in the experiments, provided only a limited insight of what FSTs are capable of. For example, top-down models can easily be augmented with stochastic elements; or inversely, the n-gram approach can integrate morphological classes. It is also possible to detect word-embedded OOV tokens with fall-back arcs in combination with confidence measures. Orthogonal to the direction of these technical improvements, another straightforward extension of this research would involve evaluation of segmentation models in context. The presented low perplexity and OOV rates may imply better ASR performance, but the actual effect on recognition accuracy needs to be verified through experimentation. Although the current literature does not provide a conclusive answer, it seems that segmentation may lead to better ASR performance, but this gain may decrease with the increase of vocabulary size [19]. While we cannot answer questions related to speech recognition performance at present, we believe that our work provides a useful base for further studies concerning word segmentation using finite-state techniques.