Keywords

1 Introduction

To process natural languages, it is necessary to create a tagging system. The existing metalanguages mainly contain the concepts of the Roman-Germanic and the Slavic language groups. These metalanguages are not suitable for the description of the Turkic languages, which have many concepts different from the mentioned language groups. Therefore, the creation of the UniTurk meta-language for tagging up texts of the Turkic languages, including the Kazakh language, is an urgent task. Such a metalanguage will allow to unify tagging, facilitate their understanding and use common software, and conduct various studies on linguistic-statistical comparative analysis among the Turkic languages [1,2,3].

The metalanguage is needed to create a common resource with which all the developers of the Turkic electronic corpora could work. Such a resource could serve as a reference system for both developers and users of the Turkic electronic corpora. The most appropriate components of such a resource that meet the required conditions are ontological models of the grammar of the Turkic languages.

It is known that ontologies are used as data sources for many natural language processing applications, they allow to process complex and diverse information in a more efficient way. When managing knowledge of subject areas, ontological models are applied at the structuring stage and are considered as special knowledge bases. An ontology can act as a knowledge base framework, that is, can create a framework used to describe key concepts related to a specific subject area, and can formally be represented as the following triple [4, 5]:

$$ O = < X,R,F > $$
(1)

where X is a finite set of concepts (terms) of the domain, which is represented by the ontology O; R is a finite set of relationships between the concepts (terms) of a given domain; F is a finite set of interpretation functions defined on concepts or relations of ontology O. The role of the interpretation function can be played by a verbal explanation of a term (annotation), a formula for calculating the meaning of a term, an algorithmic description, and also a definition in the form of a logical formula.

Ontological modeling of the morphological concepts of the Kazakh language was done in the Protégé environment, which allows simplifying the process of creating, downloading, changing and transforming the knowledge base, and to provide it for general use for joint viewing and editing [6, 7].

2 Related Works

The metalanguage is designed for tagging up the texts of natural languages and by using its means, it is possible to express all that is expressible by means of the object language and to designate all signs, expressions of the object language, which have names. In the metalanguage, one can express the properties of the object language expression and the relations between them. Here one can formulate definitions, notation, rules of formation and transformation for the objective language expressions [8].

There are well-known tagging systems, such as Penn Treebank [9] and CLAWS tagging system [10], which is used for tagging the British Corpus [9], and the tagging system of the Brownian Corpus in the US National Corpus [11, 12]. The tagging systems are described in more details in [13, 14]. All these systems are mainly used for tagging the English language.

Currently there are more than ten electronic corpora for Turkic languages: the Kazakh language corpus [15,16,17]; the Tatar language corpus “Tugan tel” [18, 19], the Turkish language corpus [20]; the Crimean Tatar language corpus [21], Chuvash language [22], etc. One of the main components of these corpora is the system of morphological, syntactic and semantic tagging. Morphological tagging systems vary in some corpora.

There are various domain models based on ontology. For example, [23] considers the ontological model of hotel services in order to compare actual reviews with automatically generated ones. Some works [24, 25] present formal models for nouns in the Kazakh language and the use of semantic graphs to describe ontological models. The articles [26, 27] compare the ontological models on the example of the nouns in the Kazakh language with the other Turkic (Kyrgyz and Uzbek) languages, and [28] compares adjectives in the Kazakh and the Turkish languages.

3 Metalanguage for Kazakh Morphology

Table 1 presents the UniTurk tags on the example of the morphology concepts description for the Kazakh language. The first column of the table contains tag designations, the second - designations of the concepts as names in English, the third – tag names in Kazakh.

Table 1. Tags of the concepts of the Kazakh language morphology.

The text part consists not only of the name of the concepts, but also the definitions, questions, and examples in Kazakh language. As a result of the study, a tagging system for morphological concepts of the Kazakh language was obtained (Fig. 1).

Fig. 1.
figure 1

Fragment of a tagging system for morphological concepts of the Kazakh language.

4 Knowledgebase for Kazakh Morphology

4.1 Architecture

Morphology is a section of grammar, the main objects of which are words of the natural languages, their significant parts, and morphological features. The tasks of morphology, therefore, include the definition of a word as a special language object and a description of its internal structure [29]. According to this definition, the following concepts should be covered in the knowledge base architecture: word, morpheme, parts of speech and morphological category.

Word

A word is the central unit of a language.

Morpheme

In words, the minimum significant distinguished parts are morphemes. The main morpheme is the root, which bears the main meaning of the word, and the other morphemes are called affixes (suffix and ending).

  1. 1.

    Morpheme

    1. 1.1

      Root

    2. 1.2

      Affix

      1. 1.2.1

        Suffix

      2. 1.2.2

        Ending

Parts of Speech

Parts of speech are classes into which words are distributed according to their grammatical properties. There are nine parts of speech in the Kazakh language:

  1. 1.

    Noun

  2. 2.

    Adjective

  3. 3.

    Numeral

  4. 4.

    Pronoun

  5. 5.

    Verb

  6. 6.

    Adverb

  7. 7.

    Conjunction

  8. 8.

    Interjection

  9. 9.

    Imitative words

All parts of speech have certain properties, and all these properties are implemented in ontology.

Morphological Category

The morphological category is the affixal variations of words, which is an expression of the grammatical category of this word pattern. In the Kazakh language, there are different morphological categories, for example, for a noun (case category, number, person and possessiveness), adjective (degree of comparison) and verb (mood category, time, person, type, voice), etc.

  1. 1.

    The category of CASE

    1. 1.1.

      Genitive case

    2. 1.2.

      Direction-dative case

    3. 1.3.

      Accusative case

    4. 1.4.

      Ablative case

    5. 1.5.

      Locative case

    6. 1.6.

      Instrumental case

  2. 2.

    The category of NUMBER

  3. 3.

    The category of PERSON

  4. 4.

    The category of POSSESSIVE

  5. 5.

    Degrees of Comparison

  6. 6.

    The category of MOOD

  7. 7.

    The category of TENSE

  8. 8.

    The category of VOICE

All the above-mentioned concepts of the morphology of the Kazakh language were included into the applied ontology.

4.2 Implementation

Applied ontology “The morphology of the Kazakh language” consists of separate individuals, properties and classes, as well as interpretation functions defined on the ontology concepts or relations.

In this section, the ontology “Morphology of the Kazakh language” was implemented in accordance with the architecture in the Protégé environment. This environment is used to represent knowledge in the form of classes, individuals belonging to classes, and properties between them. All classes of morphological concepts of the Kazakh language are displayed in Fig. 2.

Fig. 2.
figure 2

Ontology “Morphology of the Kazakh language”.

After the classes were created, the fields – properties – were written in them. Properties of objects define some relationship between two objects (classes, individuals). For example, the concept “Adj” (adjective) is characterized by the following properties (Fig. 3): the type can be either “qualitative adjective” (Qual) or “relative adjective” (Rel), in which the meanings of adjectives change. Therefore, we introduce the property “hasSemanticType” (has a semantic meaning), and describe it by an existential constraint (or limitation by a quantifier of existence):

Fig. 3.
figure 3

Example of properties of adjectives in the Kazakh language.

$$ \exists R.(A\mathop \cup \nolimits B) $$
(2)

where \( R = hasSemanticType \), \( A = Qual \), \( B = Rel \).

It can also specify that the adjective has the category of comparison degree:

$$ \exists R.A $$
(3)

where \( R = hasMorphCategory \), \( A = DegComp \).

The Fig. 4 shows some properties of nouns in the Kazakh language, which are defined by the following formulae:

Fig. 4.
figure 4

Example of the properties of nouns in the Kazakh language.

$$ \exists addSuffixes.(NNWF) $$
(4)
$$ \exists addSuffixes.(VNWF) $$
(5)
$$ \exists addSuffixes.(WC) $$
(6)
$$ \exists added.(Number\mathop \cup \nolimits POSS\mathop \cup \nolimits Person\mathop \cup \nolimits Cases) $$
(7)
$$ \exists hasSemanticType.(ABST\mathop \cup \nolimits CNCR) $$
(8)
$$ \exists hasSemanticType.(ANIM\mathop \cup \nolimits INAM) $$
(9)
$$ \exists hasSemanticType.(CMMN\mathop \cup \nolimits PRPR) $$
(10)
$$ \exists hasMorphCategory.(CASES\_category\mathop \cup \nolimits NUMBER\_category\mathop \cup \nolimits PERSON\_category\mathop \cup \nolimits POSS\_category) $$
(11)

There is also the considered concept of substantivization, which is a transition of other parts of speech (adjectives, verbs, participles, numerals) to the category of nouns, due to their ability to directly indicate the subject (which means it will answer the question “who?” or “what?”). For the noun, the necessary condition is to add to it the ending (Number, POSS, Person, Cases). In order to get the substantivization of other parts of speech into a noun, we will convert the necessary conditions into necessary and sufficient conditions (Fig. 5).

Fig. 5.
figure 5

Necessary and sufficient conditions for the noun.

When starting the reasoner, we obtain the substantivization of the adjective (Fig. 6), because in the process of word usage some adjectives lose their qualitative semantics and acquire subject meaning, i.e. substantivate, go into the category of nouns, thereby, replenishing vocabulary.

Fig. 6.
figure 6

Substantivization of adjectives.

For the morphological category NAbl (noun in the ablative case) there are necessary and sufficient conditions:

$$ NAbl \equiv \exists hasRoot(root\mathop \cap \nolimits \,\left( {\forall hasPOS\left( N \right)} \right)\mathop \cap \nolimits \,\,\exists hasEnding(AblEnd)) $$
(12)

This allows defining the word “дaлaдaн - from outside”, which is an individual of the concept “word” in the category NAbl. The Fig. 7 presents the description of the category NAbl and its implementation after launching the reasoner.

Fig. 7.
figure 7

Description of the NAbl category.

Thus, the created ontological model “Morphology of the Kazakh language” (Fig. 8) includes all the concepts and relations between them that are associated with the morphological features of the Kazakh language [30,31,32].

Fig. 8.
figure 8

Ontological model “Morphology of the Kazakh language”.

To formalize the morphological rules of natural languages, it is also possible to use alternative models, such as formal grammars, and languages of functional, logical, and production programming, and others.

5 Conclusion

Based on the study of metalanguages and methods for ontological modeling of subject areas, the metalanguage UniTurk was developed, it presented the notation of all the concepts of the Turkic languages morphology on the example of the Kazakh language and the ontology “Morphology of the Kazakh language” was built.

In the future, all syntactic concepts of the Turkic languages will be added to the UniTurk metalanguage and Grammar of the Turkic Languages will be created.

The linguistic resources created allow, on the one hand, to promote mutual understanding of terminology between the Turkic languages, and on the other hand, to become a multilingual knowledge base that will be used in multilingual search systems, machine translation between the Turkic languages, auto-referencing of the Turkic texts, and in information and training systems.