Metalanguage and Knowledgebase for Kazakh Morphology

Yelibayeva, Gaziza; Mukanova, Assel; Sharipbay, Altynbek; Zulkhazhav, Altanbek; Yergesh, Banu; Bekmanova, Gulmira

doi:10.1007/978-3-030-24289-3_51

Gaziza Yelibayeva²⁴,
Assel Mukanova²⁴,
Altynbek Sharipbay²⁴,
Altanbek Zulkhazhav²⁴,
Banu Yergesh²⁴ &
…
Gulmira Bekmanova²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11619))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1456 Accesses
7 Citations

Abstract

Currently, the volume of various information resources in the Turkic languages is increasing. Processing of such resources requires thesauri and corpora created using a single metalanguage (tagging language) and the knowledge base of subject areas. This article proposes a meta-language of the morphological concepts of the Turkic languages on the example of the Kazakh language, which was used to create the knowledge base of the morphology of the Kazakh language in the Protégé environment. The results of the work were used to develop software applications for semantic search and knowledge extraction, as well as an assessment of knowledge applications on the morphology of the Kazakh language in the e-learning system.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Morphology Within the Multi-layered Annotation Scenario of the Prague Dependency Treebank

RuThes Cloud: Towards a Multilevel Linguistic Linked Open Data Resource for Russian

A Comparison of Lithuanian Morphological Analyzers

Keywords

1 Introduction

To process natural languages, it is necessary to create a tagging system. The existing metalanguages mainly contain the concepts of the Roman-Germanic and the Slavic language groups. These metalanguages are not suitable for the description of the Turkic languages, which have many concepts different from the mentioned language groups. Therefore, the creation of the UniTurk meta-language for tagging up texts of the Turkic languages, including the Kazakh language, is an urgent task. Such a metalanguage will allow to unify tagging, facilitate their understanding and use common software, and conduct various studies on linguistic-statistical comparative analysis among the Turkic languages [1,2,3].

The metalanguage is needed to create a common resource with which all the developers of the Turkic electronic corpora could work. Such a resource could serve as a reference system for both developers and users of the Turkic electronic corpora. The most appropriate components of such a resource that meet the required conditions are ontological models of the grammar of the Turkic languages.

It is known that ontologies are used as data sources for many natural language processing applications, they allow to process complex and diverse information in a more efficient way. When managing knowledge of subject areas, ontological models are applied at the structuring stage and are considered as special knowledge bases. An ontology can act as a knowledge base framework, that is, can create a framework used to describe key concepts related to a specific subject area, and can formally be represented as the following triple [4, 5]:

$$ O = < X,R,F > $$

(1)

where X is a finite set of concepts (terms) of the domain, which is represented by the ontology O; R is a finite set of relationships between the concepts (terms) of a given domain; F is a finite set of interpretation functions defined on concepts or relations of ontology O. The role of the interpretation function can be played by a verbal explanation of a term (annotation), a formula for calculating the meaning of a term, an algorithmic description, and also a definition in the form of a logical formula.

Ontological modeling of the morphological concepts of the Kazakh language was done in the Protégé environment, which allows simplifying the process of creating, downloading, changing and transforming the knowledge base, and to provide it for general use for joint viewing and editing [6, 7].

2 Related Works

The metalanguage is designed for tagging up the texts of natural languages and by using its means, it is possible to express all that is expressible by means of the object language and to designate all signs, expressions of the object language, which have names. In the metalanguage, one can express the properties of the object language expression and the relations between them. Here one can formulate definitions, notation, rules of formation and transformation for the objective language expressions [8].

There are well-known tagging systems, such as Penn Treebank [9] and CLAWS tagging system [10], which is used for tagging the British Corpus [9], and the tagging system of the Brownian Corpus in the US National Corpus [11, 12]. The tagging systems are described in more details in [13, 14]. All these systems are mainly used for tagging the English language.

Currently there are more than ten electronic corpora for Turkic languages: the Kazakh language corpus [15,16,17]; the Tatar language corpus “Tugan tel” [18, 19], the Turkish language corpus [20]; the Crimean Tatar language corpus [21], Chuvash language [22], etc. One of the main components of these corpora is the system of morphological, syntactic and semantic tagging. Morphological tagging systems vary in some corpora.

There are various domain models based on ontology. For example, [23] considers the ontological model of hotel services in order to compare actual reviews with automatically generated ones. Some works [24, 25] present formal models for nouns in the Kazakh language and the use of semantic graphs to describe ontological models. The articles [26, 27] compare the ontological models on the example of the nouns in the Kazakh language with the other Turkic (Kyrgyz and Uzbek) languages, and [28] compares adjectives in the Kazakh and the Turkish languages.

3 Metalanguage for Kazakh Morphology

Table 1 presents the UniTurk tags on the example of the morphology concepts description for the Kazakh language. The first column of the table contains tag designations, the second - designations of the concepts as names in English, the third – tag names in Kazakh.

Table 1. Tags of the concepts of the Kazakh language morphology.

Full size table

The text part consists not only of the name of the concepts, but also the definitions, questions, and examples in Kazakh language. As a result of the study, a tagging system for morphological concepts of the Kazakh language was obtained (Fig. 1).

4 Knowledgebase for Kazakh Morphology

4.1 Architecture

Morphology is a section of grammar, the main objects of which are words of the natural languages, their significant parts, and morphological features. The tasks of morphology, therefore, include the definition of a word as a special language object and a description of its internal structure [29]. According to this definition, the following concepts should be covered in the knowledge base architecture: word, morpheme, parts of speech and morphological category.

Word

A word is the central unit of a language.

Morpheme

In words, the minimum significant distinguished parts are morphemes. The main morpheme is the root, which bears the main meaning of the word, and the other morphemes are called affixes (suffix and ending).

1.
Morpheme
1. 1.1
  Root
2. 1.2
  Affix
  1. 1.2.1
    Suffix
  2. 1.2.2
    Ending

Parts of Speech

Parts of speech are classes into which words are distributed according to their grammatical properties. There are nine parts of speech in the Kazakh language:

1.
Noun
2.
Adjective
3.
Numeral
4.
Pronoun
5.
Verb
6.
Adverb
7.
Conjunction
8.
Interjection
9.
Imitative words

All parts of speech have certain properties, and all these properties are implemented in ontology.

Morphological Category

The morphological category is the affixal variations of words, which is an expression of the grammatical category of this word pattern. In the Kazakh language, there are different morphological categories, for example, for a noun (case category, number, person and possessiveness), adjective (degree of comparison) and verb (mood category, time, person, type, voice), etc.

1.
The category of CASE
1. 1.1.
  Genitive case
2. 1.2.
  Direction-dative case
3. 1.3.
  Accusative case
4. 1.4.
  Ablative case
5. 1.5.
  Locative case
6. 1.6.
  Instrumental case
2.
The category of NUMBER
3.
The category of PERSON
4.
The category of POSSESSIVE
5.
Degrees of Comparison
6.
The category of MOOD
7.
The category of TENSE
8.
The category of VOICE

All the above-mentioned concepts of the morphology of the Kazakh language were included into the applied ontology.

4.2 Implementation

Applied ontology “The morphology of the Kazakh language” consists of separate individuals, properties and classes, as well as interpretation functions defined on the ontology concepts or relations.

In this section, the ontology “Morphology of the Kazakh language” was implemented in accordance with the architecture in the Protégé environment. This environment is used to represent knowledge in the form of classes, individuals belonging to classes, and properties between them. All classes of morphological concepts of the Kazakh language are displayed in Fig. 2.

After the classes were created, the fields – properties – were written in them. Properties of objects define some relationship between two objects (classes, individuals). For example, the concept “Adj” (adjective) is characterized by the following properties (Fig. 3): the type can be either “qualitative adjective” (Qual) or “relative adjective” (Rel), in which the meanings of adjectives change. Therefore, we introduce the property “hasSemanticType” (has a semantic meaning), and describe it by an existential constraint (or limitation by a quantifier of existence):

$$ \exists R.(A\mathop \cup \nolimits B) $$

(2)

where $ R = hasSemanticType $, $ A = Qual $, $ B = Rel $.

It can also specify that the adjective has the category of comparison degree:

$$ \exists R.A $$

(3)

where $ R = hasMorphCategory $, $ A = DegComp $.

The Fig. 4 shows some properties of nouns in the Kazakh language, which are defined by the following formulae:

$$ \exists addSuffixes.(NNWF) $$

(4)

$$ \exists addSuffixes.(VNWF) $$

(5)

$$ \exists addSuffixes.(WC) $$

(6)

$$ \exists added.(Number\mathop \cup \nolimits POSS\mathop \cup \nolimits Person\mathop \cup \nolimits Cases) $$

(7)

$$ \exists hasSemanticType.(ABST\mathop \cup \nolimits CNCR) $$

(8)

$$ \exists hasSemanticType.(ANIM\mathop \cup \nolimits INAM) $$

(9)

$$ \exists hasSemanticType.(CMMN\mathop \cup \nolimits PRPR) $$

(10)

$$ \exists hasMorphCategory.(CASES\_category\mathop \cup \nolimits NUMBER\_category\mathop \cup \nolimits PERSON\_category\mathop \cup \nolimits POSS\_category) $$

(11)

There is also the considered concept of substantivization, which is a transition of other parts of speech (adjectives, verbs, participles, numerals) to the category of nouns, due to their ability to directly indicate the subject (which means it will answer the question “who?” or “what?”). For the noun, the necessary condition is to add to it the ending (Number, POSS, Person, Cases). In order to get the substantivization of other parts of speech into a noun, we will convert the necessary conditions into necessary and sufficient conditions (Fig. 5).

When starting the reasoner, we obtain the substantivization of the adjective (Fig. 6), because in the process of word usage some adjectives lose their qualitative semantics and acquire subject meaning, i.e. substantivate, go into the category of nouns, thereby, replenishing vocabulary.

For the morphological category NAbl (noun in the ablative case) there are necessary and sufficient conditions:

$$ NAbl \equiv \exists hasRoot(root\mathop \cap \nolimits \,\left( {\forall hasPOS\left( N \right)} \right)\mathop \cap \nolimits \,\,\exists hasEnding(AblEnd)) $$

(12)

This allows defining the word “дaлaдaн - from outside”, which is an individual of the concept “word” in the category NAbl. The Fig. 7 presents the description of the category NAbl and its implementation after launching the reasoner.

Thus, the created ontological model “Morphology of the Kazakh language” (Fig. 8) includes all the concepts and relations between them that are associated with the morphological features of the Kazakh language [30,31,32].

To formalize the morphological rules of natural languages, it is also possible to use alternative models, such as formal grammars, and languages of functional, logical, and production programming, and others.

5 Conclusion

Based on the study of metalanguages and methods for ontological modeling of subject areas, the metalanguage UniTurk was developed, it presented the notation of all the concepts of the Turkic languages morphology on the example of the Kazakh language and the ontology “Morphology of the Kazakh language” was built.

In the future, all syntactic concepts of the Turkic languages will be added to the UniTurk metalanguage and Grammar of the Turkic Languages will be created.

The linguistic resources created allow, on the one hand, to promote mutual understanding of terminology between the Turkic languages, and on the other hand, to become a multilingual knowledge base that will be used in multilingual search systems, machine translation between the Turkic languages, auto-referencing of the Turkic texts, and in information and training systems.

References

Bekmanova, G., et al.: A uniform morphological analyzer for the Kazakh and Turkish languages. In: Proceedings of the Sixth International Conference on Analysis of Images, Social Networks and Texts, Moscow, Russia, pp. 20–30 (2017)
Google Scholar
Zhetkenbay, L., Sharipbay, A.A., Bekmanova, G.T., Kazhymukhan, D., Kamanur, U.: Comparison of the morphological rules of the Kazakh and Turkish languages. J. Math. Mech. Comput. Sci. 100(4), 42–51 (2018)
Article Google Scholar
Sharipbaev, A.A., Bekmanova, G.T., Buribayeva, A.K., Yergesh, B.Zh., Mukanova, A.S., Kaliyev, A.K.: Semantic neural network model of morphological rules of the agglutinative languages. In: Proceedings of the 6th International Conference on Soft Computing and Intelligent Systems, The 13th International Symposium on Advanced Intelligent Systems, Kobe, Japan, pp. 1094–1099 (2012)
Google Scholar
Tsukanova N.I.: Ontological model for knowledge representation and organization, Moscow (2015)
Google Scholar
Smekhun Y.A.: Ontologies in the knowledge based systems: possibilities of their application. Int. Res. J. https://doi.org/10.18454/IRJ.2016.47.086. Accessed 25 Feb 2019
Protégé. http://protege.stanford.edu. Accessed 10 Mar 2019
Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum Comput Stud. 43(5–6), 907–928 (1995)
Article Google Scholar
Zalevskaya A.A.: Introduction to Psycholinguistics: A Textbook. Moscow (2000)
Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Ling. pp. 313–330 (1993)
Google Scholar
Garside, R.: The CLAWS word-tagging System. In: Garside, R., Leech, G., Sampson, G. (eds.) The Computational Analysis of English: A Corpus-based Approach. Longman, London (1987)
Google Scholar
The Open American National Corpus. http://www.anc.org. Accessed 10 Dec 2018
Ide, N.: The American national corpus: then, now, and tomorrow. In: Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA (2008)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing. Computational Linguistics and Speech Recognition. 2nd edn. Prentice-Hall (2009)
Google Scholar
Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing. 2nd edn. Chapman & Hall/CRC (2010)
Google Scholar
Kazakh Language Corpus. http://kazcorpus.kz/. Accessed 10 Dec 2018
Makhambetov, O., Makazhanov, A., Yessenbayev, Zh., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A.:. Assembling the Kazakh language corpus. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1022–1031 (2013)
Google Scholar
Madiyeva, G.B., Umatova, ZhM: About Almaty Corpus of the Kazakh language KazNU messenger. “Pholology” Series 5(157), 99–103 (2015)
Google Scholar
Tatar National corpus “Tugan tel”. http://tugantel.tatar. Accessed 10 Dec 2018
Galieva, A., Khakimov, B., Gatiatullin A.: On the way to the relevant grammatical tagset for the tatar national corpus. In: 8th International Conference on Corpus Linguistics, Málaga, Spain, pp. 121–129 (2016)
Google Scholar
Turkish National Corpus (TNC). http://www.tnc.org.tr/. Accessed 10 Dec 2018
Kubedinova, L., Gatiatullin, A.: Morphological tagging of crimean tatar electronic corpus. In: Proceedings of the international conference “Turkic languages processing”, Kazan, Tatarstan, pp. 331–337 (2015)
Google Scholar
Zheltov, P.: Morphological annotation system for the national corpus of the chuvash language. In: Proceedings of the international conference “Turkic languages processing”, Kazan, Tatarstan, pp. 328–331 (2015)
Google Scholar
Bekmanova, G., Sharipbay, A., Omarbekova, A., Yelibayeva G., Yergesh B.: Adequate assessment of the customers’ actual reviews through comparing them with the fake ones. In: Proceedings of 2017 International Conference on Engineering & Technology (ICET 2017), Antalya, Turkey (2017). ISBN: 978-1-5386-1949-0
Google Scholar
Yergesh, B., Mukanova, A., Sharipbay, A., Bekmanova, G., Razakhova, B.: Semantic hyper-graph based representation of nouns in the Kazakh language. Computacion y Sistemas 18(3), 627–635 (2014)
Google Scholar
Mukanova, A., Yergesh, B., Bekmanova, G., Razakhova, B., Sharipbay, A.: Formal models of nouns in the Kazakh language. Leonardo Electron. J. Practices Technol. 13(25), 264–273 (2014)
Google Scholar
Aripov, M., Sharipbay, A., Abdurakhmonova, N., Razakhova B.: Ontology of grammar rules as example of noun of Uzbek and Kazakh languages. In: Abstract of the VI International Conference “Modern Problems of Applied Mathematics and Information Technology - Al-Khorezmiy 2018”, pp. 37–38, Tashkent, Uzbekistan (2018)
Google Scholar
Sharipbay, A., Yergesh, B., Yelibayeva, G., Israilova, N., Bakasova, P., Zhetkenbay L.: Ontological models matching of nouns of Kazakh and Kyrgyz languages. In: Proceedings of the International Conference on Computer Processing of Turkic Languages, Tashkent, Uzbekistan, pp. 182–188 (2018)
Google Scholar
Zhetkenbay, L., Sharipbay, A., Bekmanova, G., Kamanur, U.: Ontological modeling of morphological rules for the adjectives in Kazakh and Turkish languages. J. Theor. Appl. Inf. Technol. 91(2), 257–263 (2016)
Google Scholar
Momynova, B.K., Satkenova, ZhB: Morphology of the Kazakh language: a teaching aid. Almaty, Kazakhstan (2014)
Google Scholar
Yskakov, A.: Modern Kazakh Language, 2 edn. Almaty, Kazakhstan (1991)
Google Scholar
The Kazakh grammar. Phonetics, word formation, morphology, syntax. Astana, Kazakhstan (2002)
Google Scholar
Institute of State language development: The Kazakh Language (Short grammar reference book). Almaty, Kazakhstan (2010)
Google Scholar

Download references

Acknowledgments

The work was supported by the grant financing for scientific and technical programs and projects by the Ministry of Science and Education of the Republic of Kazakhstan (Grant No. AP05132249, 2018–2020).

Author information

Authors and Affiliations

L.N. Gumilyov Eurasian National University, Astana, Kazakhstan
Gaziza Yelibayeva, Assel Mukanova, Altynbek Sharipbay, Altanbek Zulkhazhav, Banu Yergesh & Gulmira Bekmanova

Authors

Gaziza Yelibayeva
View author publications
You can also search for this author in PubMed Google Scholar
Assel Mukanova
View author publications
You can also search for this author in PubMed Google Scholar
Altynbek Sharipbay
View author publications
You can also search for this author in PubMed Google Scholar
Altanbek Zulkhazhav
View author publications
You can also search for this author in PubMed Google Scholar
Banu Yergesh
View author publications
You can also search for this author in PubMed Google Scholar
Gulmira Bekmanova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gaziza Yelibayeva .

Editor information

Editors and Affiliations

Covenant University, Ota, Nigeria
Sanjay Misra
University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
Saint Petersburg State University, Saint Petersburg, Russia
Elena Stankova
Saint Petersburg State University, Saint Petersburg, Russia
Vladimir Korkhov
Polytechnic University of Bari, Bari, Italy
Carmelo Torre
University of Minho, Braga, Portugal
Ana Maria A.C. Rocha
Monash University, Clayton, VIC, Australia
David Taniar
Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan
Polytechnic University of Bari, Bari, Italy
Eufemia Tarantino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yelibayeva, G., Mukanova, A., Sharipbay, A., Zulkhazhav, A., Yergesh, B., Bekmanova, G. (2019). Metalanguage and Knowledgebase for Kazakh Morphology. In: Misra, S., et al. Computational Science and Its Applications – ICCSA 2019. ICCSA 2019. Lecture Notes in Computer Science(), vol 11619. Springer, Cham. https://doi.org/10.1007/978-3-030-24289-3_51

Download citation

DOI: https://doi.org/10.1007/978-3-030-24289-3_51
Published: 29 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24288-6
Online ISBN: 978-3-030-24289-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics