Abstract
In this article the first one-million corpus for the Belarusian NooJ module is represented. The given corpus has been built up of texts, patched up into sections by different subject categories. From the broad list of possible subject categories in the sections the corpus focuses on fiction, historic, medical, scientific, sociological literature, etc. Given a great number of similar subject categories, the first one-million corpus can be considered as a first subject collection of texts for the Belarusian NooJ module.
The text corpus is expected to be suitable for research in the following aspects: word polysemy processing of various texts, polysemic punctuation marks processing, and a new lexical items search.
The first one-million corpus for the Belarusian NooJ module can be fully applicable in many fields of linguistic research.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Corpora
- Belarusian NooJ-module
- Statistical analysis
- Part-of-Speech tagging
- Machine-learning algorithms
- Levenshtein algorithm
- The machine-learning model
- Counter-check
- Spelling errors
- Concatenation-in-paradigm
- Unknown words search
- Known words search
- Clustering
- Belarusian N-Corpus
- Text processing
1 Introduction
The chosen subject “First One-million Corpus for the Belarusian NooJ module” is among the most important, integral parts of future research in the field of speech recognition and synthesis. It is really a great step on the way to new investigations of the Belarusian language in the world of NooJ and linguistics.
The purpose of this paper is to introduce the elaboration, creation stages as well as stages of “deep” analysis and practical application of the first one-million corpus for the Belarusian NooJ module in the context of different aspects and approaches.
Besides, the first one-million Belarusian corpus for the Belarusian NooJ module will be applicable for solving the following fundamental tasks: optimizing and expanding the development of high-quality linguistic algorithms for the electronic text pre-processing in the TTS (Text-to-Speech) system.
Two Belarusian corpora were developed for NooJ [1] – the 1-VERSION corpus (1 million corpus.noc) and the MAIN corpus (First 1MLN Corpus for the Belarusian NooJ Module.noc). To make the process of corpus creation more productive, a special (descriptive) algorithm has been worked out.
2 Descriptive Algorithm for the First One-Million Corpus for the Belarusian NooJ Module
The main work on corpus compilation and analysis with the help of this algorithm was fulfilled on the basis of 1-VERSION corpus (Table 1).
According to this algorithm, the 1-VERSION corpus has been built-up of 338 unarranged text units, the MAIN –of 1 570 text units, patched up into sections of different subject categories (As is seen from A Appendix (Fig. 15)). From the broad list of possible subject categories in the sections, the MAIN corpus focuses on fiction, historical, medical, scientific, sociological literature.
3 The Dictionary of Naturalized Lexical- and Grammatical Information for the Whole List of Unknown Unique Words (File UNKNOWNS.dic)
3.1 “Purity” Check of the Corpus ‘1 million corpus.noc’
To get better results on the task of the Dictionary of Naturalized Lexical- and Grammatical Information creation, it is necessary to realize the more extended “purity” check of the corpus ‘1 million corpus.noc’. From this corpus, with the help of Levenshtein algorithm [2], the search of wordforms with the high (0.8) level of similarity of one wordform to another was realized. As a result, comparing the created dictionaries of known (about 150 000) and unknown (about 50 000) wordforms, the authors have found almost 30 % of similar wordforms, which, as a matter of fact, must belong to known wordforms, despite the fact that the NooJ program has recognized them as unknown.
Below, the general problem points are given in terms of “purity” check of the whole text corpus, which may rather effectively be solved using the abovementioned Levenshtein algorithm:
-
1.
the occurence of Latin letters in many words in the texts of the Belarusian corpus;
-
2.
dialectal words of the Belarusian language;
-
3.
Russian words;
-
4.
orthographic mistakes;
-
5.
different letter case processing.
These issues are also solved in the NooJ program, though it takes far more time and effort, because the program has to process a large scope of information.
3.2 Statistical Analysis of the Text Corpus ‘1 million corpus.noc’
The following steps have been taken at this stage:
-
1.
The corpus ‘1 million corpus.noc’ Linguistic Analysis (As is seen from A Appendix (Fig. 16)).
-
2.
The search of wordforms (all wordforms, which are present in the corpus) using special queries (<WF>, <UNK>, <DIC>, <NOUN>, <VERB>, <ADJECTIVE>, <ADVERB>).
-
3.
Export of the matches into text files (*.txt).
-
4.
Text files with exported data are stored in a special database, where the unique wordform clustering was realized and the number of wordforms was counted.
3.3 Machine-Learning Algorithms Application for the Part-of-Speech Tagging [3] of Unknown Wordforms
-
1.
The main attribute of a wordform for the Part-of-speech tagging process with the machine-learning algorithms, specified for the purpose, was three ending letters of each wordform in the dictionary of unknown wordforms. (As is seen from Fig. 1.)
-
2.
The following algorithms are being applied:
-
Decision Tree;
-
Clustering;
-
Neural Network.
-
-
3.
Firstly, the dictionary [5, 6] of known wordforms was downloaded into the systemFootnote 1. This dictionary was meant to “train” all possible word paradigms by the above-named algorithms. In other words, the algorithms’ learning is realized, and 30 % of known wordforms were taken for its realization. (As is seen from Fig. 2.)
Then, 70 % of checked remaining data were realized by the derived model of machine-learning algorithms to verify the proficiency of the model estimated as rather high.
-
4.
After that, the existing dictionary of unknown wordforms (UNKNOWNS.dic) was “passed” through the machine-learning model. The results produced rather high degree of correct assignment of unknown wordforms to one or another part of speech. And that, even at the elementary level (here, the main wordform attribute for the realization of Part-of-Speech Tagging process, i.e. three ending letters of each wordform, is meant), confirms the effectiveness of the given machine-learning model. (As is seen from Figs. 3, 4 and 5.)
-
5.
There are possible variants of data check results by using the aforementioned model (through the example of VERB; other parts of speech are considered as OTHER) (Table 2).
4 Part-of-Speech Tagging Countercheck
The Part-of-Speech Tagging Countercheck on unknown words was realized with the help of Levenshtein algorithm (on basis of the file UNKNOWNS.dic).
One more task was to work out the dictionary of unknown words usage. The assignment for developers is the maximum reduction of the dictionary sizes and determination of unknown words values for their further correction and inclusion in the dictionary of the one-million Belarusian corpus. (As is seen from Fig. 6.)
The main feature of the algorithm applied to the Belarusian one-million corpus is that the algorithm does not change the words in texts after editing, but makes it possible for users to see comments on various mistakes made in the texts incorporated in the Belarusian one-million corpus.
The words included in the dictionary were classified in groups of the unknown for various reasons:
-
The words written in the Latin alphabet or having some Latin letters (a bavyazkov, atrgml_vayutsets, an akhoplepa);
-
The words written by a tarashkevitsa, i.e. substandard spelling which, however, is used by a rather large number of people, especially the Internet users. The existence of alternative spelling is caused by the historical reasons (aбapaнaздoльнacьцi, aбвeшчaньня, aбвяcьцiлi);
-
Words with spelling errors (aбcлўгoўвaнню, aбяцaюдь);
-
Words with recognition errors after scanning (maгiлёўcкaгa, vcтaгoддзeм);
-
Words of foreign languages (perfekt, deutsche, eine);
-
Proper nouns (Дзятлaвa, Aнaтoлiя), etc.
The main objective at the stage of unknown words recognition is the definition of their morphological characteristics, i.e. assignment of parts of speech value to 49 749 wordforms. The algorithm of Levenshtein revealed parts of speech of unknown words, picked up a possible correct form of the usage, and also gave an index of probability of correct forms. (As is seen from Figs. 7 and 8.)
The stage of manual editing is carried out after computer-assisted Part-of-Speech definition, i.e. the algorithm can correctly reveal all parts of speech on formal grounds. The algorithm is simple: all parts of speech are checked by linguistic experts. In case of the correct Part-of-Speech definition by the algorithm this line of the table is marked as “truly” (1). In an opposite case – “lie} (0). (As is seen from Figs. 9 and 10.)
If the part of speech of a specified wordform doesn’t correspond to the validity, the editing stage comes, namely indications of the correct part of speech. If the corresponding wordform has no deviations (without spelling errors, unclear symbols, and also not foreign-language words), in this case morphological features are simply defined. If a word has a wrong spelling, the correct part of speech is indicated and the link “mistake in spelling” is specified. The same happens to the words written in a tarashkevitsa only with another link: “Tarashkevitsa”.
The parts of speech noted by the “NULL” category are mainly proper names and therefore are defined by the algorithm described above: indication of the correct part of speech and assignment to this line of “true” value. (As is seen from Fig. 11.)
In case the word meaning is not clear or causes doubts, it is necessary to address to a context, namely to the corpus.
At the end of this stage the quantity of unknown words was decreased that allowed to pass to the following stages of the first Belarusian one-million corpus improvements. (As is seen from Fig. 12.)
According to the resulting data, the special Concatenation-in-Paradigm list was made after the countercheck of recognized by the Levenshtein algorithm unknown words (previously exported from the NooJ-dictionary file UNKNOWNS.dic)) in order to create the additional NooJ general_be.nod dictionary. (As is seen from Fig. 13.)
This calculation allows to predict the work on this technique in the future and to estimate degree of overall performance in comparison with other techniques.
5 Comparison of Lexical and Grammatical Base of the Belarusian N-Corpus [6] with Dictionary Properties’ Definition File of the Belarusian NooJ Module
In a similar manner a comparison of lexical and grammatical base of a Belarusian N-corpus with dictionary properties’ definition file of the Belarusian NooJ module was made. The Belarusian N-korpus is the first widely available general Belarusian corpus. The Belarusian N-korpus currently contains ~50,000 texts (~30,000,000 tokens) taken from fiction, newspapers, journals and on-line editions. The texts of the corpus are grammatically annotated and contain metatextual information.
The comparative analysis was performed on the morphological characteristics of different parts of speech listed in dictionaries of both programs. After the structure analysis of both Belarusian N-corpus and the Belarusian NooJ module, it can be concluded that the programs have quite developed system of characteristics of speech parts, but nevertheless some categories need to be improved, what was found out in the process of comparing lexical and grammatical bases. Comparison of the morphological characteristics of Verb is presented on Fig. 14.
6 Conclusion
In conclusion, the first one-million corpus for the Belarusian NooJ module is suitable for research in the following aspects:
-
1.
Words polysemy processing in texts of different subjects;
-
2.
Polysemic punctuation marks processing;
-
3.
New lexical items search.
Besides, the one-million corpus is valuable for solving other important tasks:
-
Conduction of several experiments in order to specify the syntactic and morphological grammar use efficiency of texts of each subject in the corpus, at minimum as well as maximum level.
-
Taking thorough measures in order to create the subject domain generator. (This will be then very useful for the formation of special subject-oriented NooJ dictionaries.)
-
The usage of the given corpus (in the most extent) in the process of Text-to-Speech synthesis with the help of available programs [7], required for such a process, and also when testing newly created applications.
-
Carrying-out of a comparative analysis of this corpus with the same corpora in other languages (taking into account all necessary rules, language features in texts of each current corpus, various possible emerging issues, while building syntactic and morphological grammars, etc.).
Thus, it is essential that the first one-million corpus for The Belarusian NooJ Module has practical application in any line of linguistic research. In the near future the corpus is planned to be expanded up to approximately 5–10 million words.
Notes
- 1.
The Part-of-Speech Tagging process can be realized not only in one particular NLP system but also in many others (including integrated interactive systems), where these three algorithms, mentioned above, can be applied.
References
NooJ: a linguistic development environment [Electronic resource] (2015). http://www.NooJ4nlp.net/. Accessed 08 May 2015
The Levenshtein-Algorithm [Electronic resource] (2015). http://www.levenshtein.net/. Accessed 24 Sept 2015
Taylor, P.: Text-to-Speech synthesis. In: Taylor, P. (ed.) Text Decoding, pp. 89–92. Cambridge University Press, Cambridge (2009). Chapter 5
Hetsevich, Y.: Overview of Belarusian and Russian dictionaries and their adaptation for NooJ. Hetsevich, Y., Hetsevich, S. In: Vučković, K., Božo, B., Max, S. (eds.) Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the NooJ 2011 International Conference, pp. 29–40. Cambridge Scholars Publishing, Newcastle (2012)
Hetsevich, Y.: Accentual expansion of the Belarusian and Russian NooJ dictionaries. Hetsevich, Y., Hetsevich, S., Lobanov, B., Skopinava, A., Yakubovich, Y. In: Donabédian, A., Khurshudian, V., Max, S. (eds.) Formalising Natural Languages with NooJ : Selected Papers from the NooJ 2012 International Conference, pp. 24–36. Cambridge Scholars Publishing, Newcastle (2013)
Aўтaмaтызaвaнaя aпpaцoўкa ciмвaльныx выpaзaў y тэкcтax для cicтэмы ciнтэзy бeлapycкaгa мaўлeння. Бeлapycкi N-кopпyc [Electronic resource] (2015). http://bnkorpus.info/. Accessed 17 May 2015
Corpus.by. Corpus.by [Electronic resource] (2015). http://www.corpus.by/. Accessed 08 May 2015
Acknowledgements
Many thanks to T. Okrut, J. Baradzina, A. Fiodarau for their help in revising the language of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Reentovich, I. et al. (2016). The First One-Million Corpus for the Belarusian NooJ Module. In: Okrut, T., Hetsevich, Y., Silberztein, M., Stanislavenka, H. (eds) Automatic Processing of Natural-Language Electronic Texts with NooJ. NooJ 2015. Communications in Computer and Information Science, vol 607. Springer, Cham. https://doi.org/10.1007/978-3-319-42471-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-42471-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42470-5
Online ISBN: 978-3-319-42471-2
eBook Packages: Computer ScienceComputer Science (R0)