Abstract
The article presents a state-of-the-art complete part-of-speech tagger for Polish which uses recurrent neural networks. The networks allow accessing the full left and right context of a sentence in comparison to a context window. The tagger uses an external morphological analyzer. In comparison to the best Polish taggers, it does not use word form as a feature for the classifier, there is no separate classifier for unknown words, and predictions are not limited to tags provided by a morphological analyzer. The accuracy is higher—it achieves 28% error reduction and 7% points higher accuracy for unknown words. The tagger also might work faster than others by utilizing GPU. The tagger participated in PolEval 2017 POS Tagging competition and won task B and task C. Additionally, results for PolEval 2020 Morphosyntactic tagging of Middle, New and Modern Polish are reported. The paper is an extension of the Language & Technology Conference paper [25].
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Part-of-speech taggers assign part-of-speech tags to each word (token) in a sentence. For inflectional languages, like Polish or Czech tagger, has to recognize values of morphological categories such as gender, number, and case. The categories highly increase the number of tags up to a thousand.
Recurrent neural networks have not been applied in natural language processing for Polish yet, despite they are usually used in state-of-the-art systems for other languages [6, 7, 9, 23].
Czech is also a Slavic language. The best Czech taggers achieved the accuracy above 95%. Prague Dependency Treebank (PDT) is used as a training and test data. The main difference between PDT and National Corpus of Polish (NCP), which is the primary source of tagging data for Polish, is that PDT is not well-balanced and contains only articles from newspapers and journals. On the contrary, transcriptions of spoken conversations and user generated content from web forums are included in NCP. Both corpora were manually annotated by 2 annotators and 1 person who was resolving disagreements. [11] obtained scores higher by 0.75% point by training and testing only on newspaper subcorpus.
This work presents a part-of-speech tagger for Polish using bidirectional recurrent networks (named KRNNT). Source code and trained models are available at: https://github.com/kwrobel-nlp/krnnt. KRNNT can be easily run using Docker container and integrated to other systems by API.
In the next section training dataset and tagset are described. Also, it summarizes the best publicly available Polish taggers. Section 3 formulates recurrent neural networks and its bidirectional extension. Main Sect. 4 describes all modules of the tagger and the training parameters. Evaluation is described in Sect. 5. Section 6 presents results of PolEval contest. Last Sect. 7 presents conclusions and future works.
2 Polish Tagging
2.1 Dataset
The largest publicly available dataset for Polish is a manually annotated subcorpus of the National Corpus of Polish (NCP) containing above 1 million tokens. The corpus is balanced with respect to genres and subjects—it includes newspaper articles, books, transcriptions of spoken conversations, and user generated content from web forums.
Table 1 presents statistics of NCP. Figure 1 shows frequency of tags in logarithmic scale. Full tags, as labels in a multi-class classification problem, are very unbalanced.
The NCP tagset consists of 35 grammatical classes, each having a set of grammatical categories. The number of all possible tags (grammatical classes with unique values of grammatical categories) is about 4000, but texts in NCP represent 926 tag variants.
E.g. Polish adjectives have 2 numbers (singular, plural), 7 cases (nominative, genitive, dative, accusative, instrumental, locative, vocative), 5 genders (human masculine, animate masculine, inanimate masculine, feminine, neuter), and 3 degrees (positive, comparative, superlative)—210 variants in total.
Morphosyntactic tags are represented as a sequence of grammatical class and values of grammatical categories, e.g. adj:sg:nom:m1:pos, where adj is adjective, sg is singular number, nom is nominative case, m1 is masculine gender and pos is positive degree.
2.2 Polish Taggers
The best Polish POS taggers that are publicly available include: Concraft [24], WCRFT [17], OpenNLP [11], WMBT [18, 20], Pantera [1], TaKIPI [15]. Comparisons of Polish taggers were presented in [11, 13, 16].
So far Concraft has been the best Polilsh tagger. It uses an extended version of CRF algorithm [14] to tackle a high number of labels in an efficient way by restricting space of solutions to the set of tags defined in a morphosyntactic dictionary. For unknown words, morphosyntactic guessing is employed as a separate classifier.
WMBT is a tiered memory-based tagger. Each tier is assigned to a grammatical class or a category. A separate classifier is trained for known and unknown tokens. An algorithm used for each tier is kNN.
WCRFT is similar to WMBT, but kNN algorithm is replaced with CRF. Unknown words are pre-processed by appending potential tags based on analysis of a training data. Therefore, only one classifier is used while tagging.
Pantera uses rule induction algorithm driven by a modified version of Brill’s transformation-based learning algorithm [2]. It uses two tiers in the process of tagging and operates on parts of labels, i.e. grammatical class and categories.
TaKIPI employs C4.5 decision tree algorithm. About 200 classes of ambiguity were defined and for each one the classifier was trained. The tagger uses also handwritten rules.
Apache OpenNLP library is a free implementation of NLP algorithms. Algorithm for POS tagging employs perceptron. No tiers were used and all tags are on the output of the neural network. The main difference to earlier mentioned taggers is that OpenNLP tagger does not use morphological analyzer having a token as the only input.
3 Recurrent Neural Network
A recurrent neural network is an extension to feedforward neural network, which is able to handle a variable-length sequence inputs. The RNN has a hidden state that is updated at each time step, therefore it has information about the left context of a sequence. The RNN shares parameters across all steps. The default behavior of RNN is to provide output for each step. However, RNNs are difficult to train to capture long-term dependencies, because the gradients tend to vanish or explode.
Long short-term memory (LSTM) [8] was developed to alleviate problems of raw RNNs. The LSTM introduces an additional memory cell and gates. Output, forget and input gates are responsible for modulating the amount of memory cell used to calculate the output of the LSTM and a new value of memory cell.
The output of the LSTM \(h_t\) at time t is:
where \(o_t\) is output gate and \(c_t\) is the memory cell:
\(\sigma \) is the logistic sigmoid function. Forget and input gates are computed as follows:
Gated Recurrent Unit (GRU) is similar to LSTM. It does not have the memory cell, but utilizes update and reset gates. It has fewer parameters than LSTM and therefore it’s training is faster.
The output of the GRU is:
where \(z_t\) is update gate and \(r_t\) is reset gate.
A simple extension to the RNN is a bidirectional recurrent neural network (BDRNN). It contains two RNNs, the first working forward, the second working backward. Outputs at each step are merged (i.e. concatenated). Forward RNN remembers the left context and backward RNN remembers the right context. This feature solves a problem of providing full context to a token in POS tagging. Many taggers are using context window to incorporate nearest neighbors of a token, but it is usually of limited size.
4 Polish RNN Tagger
4.1 Tokenization and Morphological Analysis
Text preprocessing is performed by external tools. Firstly, a text is segmented into sentences and into tokens using Toki [19]. Secondly, each token in a sentence is analyzed by morphological analyzer SGJP. Maca [19] integrates both tools and is used in this tagger (Concraft also uses Maca).
4.2 Morphological Guesser
A morphological guesser is a system that predicts potential tags for a word form. In practice, it is applied only for unknown words which are not present in the dictionary. In this work, a morphological guesser as a separate step is omitted—unknown words have no features associated with potential tags. To address this issue, features, based on the word form were added: first and last three letters of a word form. WCRFT and Concraft use a separate classifier that replaces morphological analysis for unknown words.
4.3 Morphological Disambiguator
Morphological disambiguator assigns one tag for each token. The decision is based on features (observations) of tokens. In this work, classification is performed by a recurrent neural network, so there is no need for creating context window which limits the length of a potential dependency. By using a bidirectional recurrent neural network, for each token, a classifier has information about full left and full right context. It creates an advantage in comparison to a window approach used in CRF. Both WCRFT and Concraft use the context of length 2.
4.4 Lemmatization
Lemmatization is a task closely related to morphological disambiguation. Concraft and WCRFT do not tackle this problem. From a dictionary, they choose all lemmas associated with the disambiguated tag. Unknown words are not lemmatized—for the higher scores, token form is set as the lemma.
In this work, a simple extension is provided. Training data is analyzed by counting lemmas for each pair of a token and disambiguated tag. During tagging, the tagger starts with prediction of a tag and then chooses the most frequent lemma for the pair: token and disambiguated tag.
4.5 Network Architecture
The network has two bidirectional GRUs. Dropout (fraction of units to drop during training) is applied to the linear transformation of the recurrent state. Dropout is also applied to the results from the second bidirectional GRU and processed by dense layer with the same shared weights for each step. The network is presented in Fig. 2. All features of words are encoded as one-hot vectors.
4.6 Features
The best Polish taggers use word form as a feature. In Polish, there are more than 3.8 million unique word forms. Treating them as one-hot vector would create too many inputs for a neural network. Generally, this problem of dimensionality reduction is being solved by using word embeddings. However, in this work word embeddings are not used—initial attempts have given worse results. Addition of word embeddings as an additional input to the model prevented learning from gaining above 90% of accuracy.
Eight sets of features were tested (the number of unique features is given in parentheses):
-
tags4 (388)—each tag is divided into two parts: grammatical class + case or person, and grammatical class + rest of grammatical categories [1],
-
tags5 (90)—case, and concatenation of number, case and gerund,
-
shape (76)—collapsed shape of token - upper case letters are represented as u, lower case letters as l, digits as d, other characters as x (e.g. Wrobel2017 gives ullllldddd and after collapsing uld),
-
cases (5)—information whether word form is all lower cased, upper cased, capitalized, or a number,
-
interps (55)—individual punctuation marks,
-
qubs (226)—set of all particle adverbs,
-
3letters (276)—first and last three letters of word form; this feature have source from morphological guesser in Concraft, in which prefixes and suffixes are generated,
-
separator (2)—information about space before token.
Table 2 presents features generated for the word obrazki.
4.7 Output Classes
To reduce the number of outputs WCRFT classifies each grammatical class and category separately while Concraft divides tags into two sets (the same as the feature tags4).
In comparison to other Polish taggers, KRNNT has undivided tags on output. A drawback of this approach is that tags not occurring in training data can not be predicted even if the morphological analyzer has information about possible correct tags.
4.8 Training
10% of training data is used as a validation set for early stopping. Early stopping criterion is checked after processing every 10,000 sentences with patience 10. The maximal number of epochs is 150. A loss (objective) function is categorical loss entropy. The last layer has softmax activation function.
Training is performed using Nadam optimizer, because it was proven to be effective for recurrent neural networks [4, 21]. It is a combination of two algorithms: RMSProp and Nesterov momentum.
Training was performed on GPU NVIDIA Tesla K40 XL and took about 3 h (on NCP).
5 Evaluation
Many experiments testing different sets of features and neural network architectures were conducted (over 40,000 h on GPU). Taggers were assessed in terms of accuracy and speed of tagging. National Corpus of Polish with 10-fold cross-validation was chosen as a training corpus. Sentences from one paragraph are always in the same fold. Sentences incorrectly segmented by Maca (3.41% of NCP) are skipped during training (for simplicity).
Evaluation is performed with the whole pipeline including segmentation, morphological analysis and morphological disambiguation as proposed in [18].
The main metric is accuracy lower bound (\(Acc_{lower}\)). It penalizes all segmentation errors and is calculated as a percentage of all tokens that match tagger segmentation with correct tag. Additional metric accuracy upper bound (\(Acc_{upper}\)) treats segmentation errors as correctly tagged. It shows potential accuracy for a perfect tokenizer.
Two additional metrics are also provided: accuracy lower bound for known (\(Acc_{lower}^K\)) and unknown words (\(Acc_{lower}^U\)).
Comparison of results for each tagger is presented in Table 3. Scores for OpenNLP, Pantera, WMBT, and WCRFT originate from [11]. Evaluations for all taggers were performed on NCP and with 10-fold cross-validation scheme.
KRNNT significantly surpasses scores of other taggers, the error is reduced by 28% in comparison to Concraft. The accuracy of tagging unknown words is 7% points higher than in OpenNLP. Simple voting strategy over 10 models trained on the same data, but with different random initialization, increase accuracy lower bound up to 94.30% (tested without cross-validation).
For 0.55% tokens, the predicted tag was not in the set returned by the morphological analyzer. Despite that, KRNNT correctly assigns tags in 78.89%.
NCP has 2.81% unknown words according to Maca.
[11] developed ensemble of the first 5 taggers from Table 3. The best voting scheme achieves the accuracy lower bound slightly above 92%.
Figure 3 shows accuracy lower bound related to a number of paragraphs, that were used to train the tagger. More training data is needed to determine whether the classifier is saturated.
Table 4 shows tagging time in seconds of whole NCP including the start of a tagger. NCP was manually distributed for separate processes of Concraft because Concraft does not utilize more cores. KRNNT executes longer than Concraft. The analysis showed that the most time-consuming is to generate the features, execute Maca and parse its output. Distribution of these tasks to other cores decreases tagging time. Values in parentheses give the percentage of total time spent by GPU waiting for data. Implementation of the tagger in a statically typed language should improve performance.
Processing sentences sorted by a number of words improves tagging time by 22%–37% because computations on GPU are performed in batches and all sentences need to be padded to the longest sentence in the batch. For sorted sentences, padding is minimal.
Figure 4 shows accuracy lower bound for training and test data related to a number of epochs. The neural network can not memorize the training data (98.3%). Most likely this is caused by an insufficient representation of the input (word form is not a feature). The accuracy for test data is not raising after about 100 epochs and the model is not overfitting.
Manual analysis of 100 errors of the tagger showed that 6% relate to errors in manual annotation of NCP, e.g. to były ostatnie słowa, jakie wypowiedział (these were the last words that he had spoken) - the word jakie (that) is manually annotated as nominative, KRNNT tags it as accusative, which is correct. 9% of errors could only be avoided with the analysis of the whole paragraph, e.g. Na baliku [...] bawiło się około 100 dzieci. Znaleźli się wśród nich (About 100 children played at the ball. There were also among them) - the gender of word nich (them) is dependent on reference to dzieci (children). Dependencies longer than 5 words occur in 11% of errors. Including semantics and valency information could potentially reduce errors by 15%, e.g. Władze miasta [...] szukają inwestora (City authorities are looking for an investor) - the verb szukać (look for) takes objects with genitive case in this context (KRNNT assigns accusative case).
6 PolEval: POS Tagging
PolEval is a Polish version of SemEval—a contest for natural language processing tools. KRNNT participated in a task of morphosyntactic tagging [12]. It involves 3 subtasks: morphosyntactic disambiguation and guessing (subtask A), lemmatization (subtask B), and POS tagging (subtask C). Subtasks A and B are tested on gold segmented data, therefore systems do not need to perform tokenization and morphological analysis. Subtask C is tested on raw text and requires whole text processing pipeline. The training data is NCP. For subtasks A and B, the model was trained without morphological reanalysis. Therefore segmentation errors do not occur. Organizers prepared different testing corpus, they annotated over 1626 sentences for subtasks A and B and 1675 sentences for subtask C. Average number of tokens in sentences is around 16.5—more than in NCP.
The best submission for task A (Table 5) was also prepared using bidirectional neural network. The main difference to KRNNT is a utilization of word embeddings and reduced output to separate grammatical classes and categories. KRNNT was placed second in the ranking.
Despite simple lemmatization module in KRNNT, it won subtask B (Table 6). NeuroParser has better accuracy in lemmatization of unknown words by 4% points, so there is a room for improvement.
Subtask C was also won by KRNNT (Table 7). However, lemmatization performed by NeuroParser was also better. Third place is taken by MorphoDiTaPL [22]—the framework achieving state-of-the-art results in Czech.
6.1 PolEval 2020: Morphosyntactic Tagging of Middle, New and Modern Polish
PolEval 2020 hosts a shared task Morphosyntactic tagging of Middle, New and Modern Polish. The task organizers have shared training and development data analyzed by Morfeusz adapted to historical language. Additionally, each paragraph is annotated with a year of creation. In comparison to task A in PolEval 2017, the tokenization is ambiguous. A simple solution (used in Maca) is to calculate the shortest path in a directed acyclic graph representing possible tokenizations. Some difficulties may cause a lack of sentence boundaries.
The training and development data consist of 10755 and 244 paragraphs, respectively. 2% of tokens in development data are unknown to morphological analysis.
KRNNT was trained for 150 epochs using only training data, development data was not used for validation. The information about the year of text is omitted. As the data do not have information about sentence boundaries the training is performed on whole paragraphs. This approach should increase the accuracy, because some information from near sentences may be used to make better predictions. The model was tested on tokenized text using the shortest path method.
Evaluation results are presented in Table 8. We can expect higher results obtained by more recent model, e.g. contextual embeddings and by utilization of the year of analyzed text.
7 Conclusion
This work presented Polish morphosyntactic tagger KRNNT. It achieves better accuracy than other publicly available taggers.
PolEval showed that better results could be achieved using bidirectional neural networks.
Lemmatization for unknown words may be improved by a separate neural network using a sequence to sequence architecture [3]. Bigger dictionary of named entities should boost the results.
Despite that tags have some structure, in this work they are treated separately. Therefore the system can not generalize well, e.g. “verb must be in every sentence” instead of “one of X tags describing verb must be in sentence”. Tags should be partitioned on output, or more fine-grained outputs should be added.
Researchers should also focus on word embeddings for morphologically rich languages. Including them in a tagger should improve accuracy. What is more, representing tags as word embeddings might be beneficial because they can represent dependencies among them [5].
RNNs make local decisions for each token, incorporating CRF or hidden Markov models as the last layer will assign labels after seeing the word sequence [10].
Including information from the whole paragraph is essential for some ambiguities in tagging.
KRNNT is an open-source system and is available as a Docker container. The new model has been trained on NCP and PolEval 2017 data.
References
Acedański, S.: A morphosyntactic Brill tagger for inflectional languages. In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS (LNAI), vol. 6233, pp. 3–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14770-8_3
Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the workshop on Speech and Natural Language, pp. 112–116. Association for Computational Linguistics (1992)
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014). http://arxiv.org/abs/1406.1078
Dozat, T.: Incorporating Nesterov Momentum into Adam (2016)
Goldberg, Y.: A primer on neural network models for natural language processing. J. Artif. Intell. Res. (JAIR) 57, 345–420 (2016)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning, ICML-14, pp. 1764–1772 (2014)
Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Kobyliński, Ł., Kieraś, W.: Part of speech tagging for Polish: state of the art and future perspectives. In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016, Konya, Turkey (2016)
Kobyliński, Ł., Ogrodniczuk, M.: Results of the PolEval 2017 competition: part-of-speech tagging shared task. In: Proceedings of 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Wydawnictwo Poznańskie i Fundacja Uniwersytetu im. A. Mickiewicza, Poznań, Poland (2017)
Kuta, M., Chrzaszcz, P., Kitowski, J.: A case study of algorithms for morphosyntactic tagging of polish language. Comput. Inform. 26(6), 627–647 (2012)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11(1–2), 151–167 (2007)
Pohl, A., Ziółko, B.: Using part of speech n-grams for improving automatic speech recognition of Polish. In: Perner, P. (ed.) MLDM 2013. LNCS (LNAI), vol. 7988, pp. 492–504. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39712-7_38
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Radziszewski, A., Acedański, S.: Taggers gonna tag: an argument against evaluating disambiguation capacities of morphosyntactic taggers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 81–87. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32790-2_9
Radziszewski, A., Śniatowski, T.: Maca – a configurable tool to integrate Polish morphological data. In: Proceedings of the 2nd International Workshop on Free/Open-Source Rule-Based Machine Translation (2011)
Radziszewski, A., Śniatowski, T.: A memory-based tagger for polish. In: Proceedings of the 5th Language & Technology Conference, Poznań, pp. 29–36 (2011)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147 (2013)
Walentynowicz, W.: MorphoDiTa-based tagger for polish language (2017). http://hdl.handle.net/11321/425. CLARIN-PL digital repository
Wang, D., Nyberg, E.: A long short-term memory model for answer sentence selection in question answering. In: ACL, vol. 2, pp. 707–712 (2015)
Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. the case of morphosyntactic tagging of a highly inflected language. In: COLING, pp. 2789–2804 (2012)
Wróbel, K.: KRNNT: Polish recurrent neural network tagger. In: Vetulani, Z., Paroubek, P. (eds.) Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics pp. 386–391. Fundacja Uniwersytetu im. Adama Mickiewicza w Poznaniu (2017)
Acknowledgments
This research was supported in part by PLGrid Infrastructure.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wróbel, K. (2020). KRNNT: Polish Recurrent Neural Network Tagger Extended. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2017. Lecture Notes in Computer Science(), vol 12598. Springer, Cham. https://doi.org/10.1007/978-3-030-66527-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-66527-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66526-5
Online ISBN: 978-3-030-66527-2
eBook Packages: Computer ScienceComputer Science (R0)