Hybrid Approach for the Semantic Analysis of Texts in the Kazakh Language

Rakhimova, Diana; Turarbek, Asem; Kopbosyn, Leila

doi:10.1007/978-981-16-1685-3_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1371))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

667 Accesses
4 Citations

Abstract

In this paper authors propose a hybrid approach for semantic analysis of text resources and documents in the Kazakh language. An overview and difficulties of analysis for the Kazakh language are presented. The developed approach consists of two main parts. The first definition of keywords (phrases) from the text, and the second, based on the data obtained, will build an annotated summarization of the text. To implement the first part of the approach, the TF-IDF algorithm was applied to extract keywords and phrases from texts. The cosine similarity of the sentence data in the Kazakh language was calculated to determine the similarity. With the help of certain similarities semantic links in the text are determined. On the basis of the data obtained, the second part is performed - the abstraction of texts. The number of annotations directly depends on the size of the document. The linguistic corpus of the Kazakh language was collected for carrying out experiments and calculations. A study of various approaches and a hybrid approach for the semantic analysis of the Kazakh language was carried out. The practical part was implemented in Python. The article presents the results of experimental calculations.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Approach to Extract Keywords and Keyphrases of Text Resources and Documents in the Kazakh Language

An Approach of Automatic Extraction of Domain Keywords from the Kazakh Text

Using Annotated Suffix Tree Similarity Measure for Text Summarisation

Keywords

1 Introduction

The Kazakh language belongs to the Turkic group of languages and the agglutinative class of languages, it has a complex morphological structure and a rich semantic vocabulary. Unfortunately, at the moment, the Kazakh language is a low-resource language, which hinders the development and conduct of scientific research. For the Kazakh language, the problem of semantic analysis and identification of data or facts is relevant. There are no universal approaches and methods that allow for high-quality semantic analysis, to identify data and facts from texts, etc.

Computer semantic analysis is closely related to the problem of text understanding by a machine. There are many interpretations of the concept “meaning of the text” and the task of understanding it. For example, according to D. A. Pospelov [1], the system understands the text entered into it if, from the point of view of a person (or a group of experts), it correctly answers questions related to the information contained in the text.

2 Related Works

There are various scientific approaches and methods for solving the problem of semantic analysis for a particular language. Some of them will be presented below. Of course, no software can replace the analysis that a human can think of. However, the programs that are currently being developed can reduce the time spent on studying large databases. In this regard, the work of the following programs for solving problems of semantic text analysis is considered. Software offered by various manufacturers, such as Semantic LLC, Tomita-parser (Yandex), Semantic Analyst JHON, SummarizeBot API, TextAnalyst 2.0, Galaktika-ZOOM, NLP ISA Natasha»Etc. is used in different subject areas and for different languages [2,3,4,5,6,7,8,9].

For example, “Semantic LLC” is a program for editing unstructured text. The semiconductor line is graphically oriented, each node is a semantic element, and the walls represent the elements of the elements. Each attribute of a node is of great importance, the set of attributes depends on the type of element.

Tomita Parser (Yandex) is a program that allows you to extract facts from structured text. Separation of facts is based on context-independent grammar rules. And the program requires a dictionary of keywords. The parser will write its own grammar.

SummarizeBot API - The web service offers a RESTful API to handle all text and image processing tasks. It uses over 100 languages including Russian, English, Chinese, Japanese, and uses machine learning technology. The current version uses the following parameters: 1) automatically link to text; 2) Selection of keywords and conceptual documents; 3) Analysis of a sample of documents and selection of material objects and attributes; 4) Automatically detect the language of the document; 5) Obtaining unpublished data: the main text of articles, forums, forums, etc.; 6) Image processing: identification and recognition of objects in images.

“TextAnalyst 2.0” - a program developed by the research and production innovation center MicroSystems as a tool for text analysis. Text links allow you to create a semantic web of comments, expressed in processed text. The request has the ability to semantic search for fragments of text taking into account the semantic links hidden in the text. Allows you to parse text by constructing a hierarchical tree/heading topics containing text.

The scientific works [10,11,12,13,14] describe the basic ideas of using semantic analysis in information retrieval systems. Various options for finding text statistics are presented, which include counting the number of occurrences of words in documents and the frequency of word contiguity, and new model architectures for computing continuous vector representations of words from very large datasets. The quality of vector representations of words obtained by various models was studied using a set of syntactic and semantic language problems. In [15], the application of language models of a neural network to the problem of calculating semantic similarity for the Russian language is shown. The tools and bodies used and the results achieved are described.

The above presented software products are designed for many resource languages such as English, Spanish. Russian, etc. Unfortunately, for the Turkic languages (Kazakh, Kyrgyz, Turkish, Uzbek, etc.) there is currently no software implementation in the open access. The disadvantage of the developed systems is that they cannot be applied to the Turkic languages, since they are agglutinative with complex morphological and lexical forms, and semantics dependent sentence structure.

The analysis of a huge amount of data can be simplified if we have keywords or keyphrases that can provide us with the basic characteristics, concept, etc. of a document. The relevant keywords and keyphrases can serve as a summary of the document and help us easily organize documents and extract them based on their contents [16]. It is necessary to distinguish two main approaches to solving the problem of automating the selection of keywords and keyphrases: the assignment of keywords and keyphrases and their extraction [17, 18]. The main difference is that the first approach allows to select only those keywords and keyphrases that are contained in some provided dictionary, and the second approach involves the selection of key information directly from the text.

Keywords can be assigned manually or automatically, but the first approach is very time-consuming and expensive. Thus, there is a need for an automated process that extracts keywords from documents. There are ready-made software solutions to this problem for common languages (English, Russian, Spanish, etc.), and for the Kazakh language there are only a few and they are not in open access.

Below are some approaches and works for carrying out summarization for different languages:

The most common is the superficial approach, which takes into account title words and cue-words (ie, “important”, “best” etc.) To extract response results [19].

The paper [20] presents automatic free text processing using material extraction using agent verification. For data processing, the Kmeans algorithm was used as a basis.

There is a common summarization approach based on the structural removal of parts from the text corpus. For example, the WordNet system [21].

The paper [22] presents the Cohesive Approaches, which define and consider the cohesive relationships between concepts within the text. These include synonyms, antonyms, lexical data of the language, etc.

It should be noted that at the moment one of the most popular methods of summation is graphical approaches. Two methods can be attributed to this type: LexRank [23] and TextRank [24].

In [25], the graph approach of summarization a text document is also presented. The difference between this approach is that it simultaneously takes into account local coherence, importance, and redundancy.

The next type of approach is based on machine learning. With this approach, the resulting document results can be transformed into a controlled or semi-controlled learning task. This method requires big data to conduct training.

In the article [26], a new Seq2Seq model is presented for abstract and extractive generalization. A comparative analysis of existing approaches is carried out and it is shown that RNNs and other Seq2Seq models represent a good practical result. The main difference of this approach is at the first-time step during encoding the sequence of adding contextual information using the agent.

3 A Semantic Analysis Based an Algorithm for Extracting Annotation and Keywords

During digital technologies, given the constant growth of the volume of digital data, an important role is played by improving the quality of information retrieval using new semantic approaches and methods.

To work with big data, various algorithms and methods are being developed for the machine solution of this problem, since the amount of data does not allow for manual analysis. Any natural-language is complex, unique, and multifaceted in its own way, therefore, extracting data from documents and text resources is a large and time-consuming work that requires preliminary processing.

This part will present a hybrid approach to the semantic analysis of text resources and documents in the Kazakh language. The developed approach consists of two main parts. The first definition of keywords (phrases) from the text, and the second, based on the data obtained, will build an annotated generalization of the text.

The developed hybrid approach of semantic analysis of the text in the Kazakh language consists of two main stages:

identify keywords and phrases in the text;
making semantic annotation of the text based on keywords.

For the first stage, it is necessary to prepare the text. To do this, lemmatization and marking by morphological properties are performed on the texts. The main task of the keyword detection algorithms is the task of finding suitable candidates, identifying attributes and ranking [29].

To rank and determine the frequency, the TF-IDF (Term Frequency - Inverse Document Frequency) indicator was used [28]. With TF-IDF, you can determine the weights for each word relative to the entire document. The words with the highest scores and are the main keywords of the text.

TF-IDF was calculated using the formula below

$$ TF\,{*}\,IDF = TF\left( {t,D} \right)\,{*}\,IDF\left( t \right) = \frac{{n_{t,D} }}{{\mathop \sum \nolimits_{k} n_{k,D} }}\,{*}\,\log \left( {\frac{{\left| {TS} \right|}}{{\left| {\left\{ {d:t \epsilon d} \right\}} \right|}}} \right) $$

(1)

where ${\text{n}}_{{{\text{t}},{\text{D}}}}$ is the number of occurrences of the word t in the target collection ${\rm D},\,\mathop \sum \nolimits_{k} n_{k,D}$ is the sum of the occurrences of all words in the target collection ${\rm D},\,\left| {{\text{TS}}} \right|$ is the number of documents in all used collections, $\left| {\left\{ {{\text{d}}:{\text{t}} \in {\text{d}}} \right\}} \right|$ is the number of all documents that include the word t at least once.

According to this formula, the weight of the word is calculated. The higher the weight of a word, the higher its relative frequency of use in the collection of text. Based on this algorithm for determining keywords and properties and linguistic resources of the Kazakh language, a modified algorithm for extracting keywords and phrases was developed [13].

To find the similarity of the elements (sentences) of the text and the evaluation, the cosine similarity was applied. To calculate the cosine similarity between sentences, you need to perform the following steps: first, you need to identify all the individual words. Then the identification of the frequency of occurrence of these words in sentences is formed and is defined as a vector. That is, the sentence itself will be represented as a set of vectors. Next, the cosine similarity function is applied to these vectors, and the cosine of the angle between the vectors is subtracted [14, 15].

x and y are sentence vectors. Their scalar product and the cosine of the angle θ between them are related by the following relation

$$ \left\langle {x,y} \right\rangle = \left| {\left| x \right|} \right|\left| {\left| y \right|} \right|\cos \left(\uptheta \right) $$

(2)

Accordingly, the cosine distance is defined as

$$ \rho_{cos} \left( {x,y} \right) = \arccos \left( {\frac{{\left\langle {x,y} \right\rangle }}{{\left| {\left| x \right|} \right|\left| {\left| y \right|} \right|}}} \right) = { }\arccos \left( {\frac{{\mathop \sum \nolimits_{i = 1}^{d} x_{i} y_{i} }}{{\left( {\mathop \sum \nolimits_{i = 1}^{d} x_{i}^{2} } \right)^{\frac{1}{2}} \left( {\mathop \sum \nolimits_{i = 1}^{d} y_{i}^{2} } \right)^{\frac{1}{2}} }}} \right) $$

(3)

Based on the data obtained from formula 3, a matrix of the similarity values of the sentences is constructed. Next, all the offers are ranked according to the similarity matrix. The sentences with the highest weight, which are defined by keywords or phrases, will form the annotation of the document.

This proposed approach takes into account the grammatical properties and rules of the Kazakh language. The next section presents the practical results of the developed hybrid approach to semantic analysis.

4 Application of Approaches and Experimental Results

At the first stage, 2 tasks are solved: preliminary word processing; and the division of the text into separate words and keyphrases.

The first task is language-dependent, therefore, the Kazakh language morphological feature is taken into account here. To solve this problem, a system of complete endings of the Kazakh language is used (through the morphological analyzer of the Kazakh language developed on the platform Apertium [30], we perform markup of the document), the algorithm for stemming and lemmatization for the Kazakh language [31] (implemented in the Python3 programming language). Then, a simple approach was used - the tokenization procedure, which helps to divide the whole text into separate words.

The developed algorithms and approach for hybrid semantic analysis are implemented using the Python programming language and NLTK libraries. To test the program, we have prepared a marked corpus, which consists of more than 120 text documents of various sizes and topics. First, keywords and phrases with the Tf-idf metric were defined for each text. Table 1 below shows an example of the keywords found for texts in the Kazakh language (Figs. 1 and 2).

Table 1. Experimental data of the obtained keywords from texts in the Kazakh language.

Full size table

Table 2 presents the practical results of the developed algorithm for determining keywords and phrases in Kazakh texts.

Table 2. Experimental results of the developed algorithm for determining keywords for the Kazakh language

Full size table

Taking into account the limiting coefficient of determining keywords by the volume of the text, the keywords and phrases are selected according to the meaning correctly and has a not bad indicator of accuracy.

To test the operation of the developed algorithm for extracting keywords in the Kazakh language, practical experiments were conducted. In practice, two approaches were compared: the first simple summarization, the second summarization with keywords and phrases. In the experiment, more than 120 documents in the Kazakh language with various topics and volumes were processed. The time spent on identifying the text annotation directly depended on the volume of the input text. The resulting annotations are shown in Table 3.

Table 3. Examples of the work of summarization approaches for texts in the Kazakh language.

Full size table

The Table 3 shows examples of text processing using two summarization methods. From the results obtained, it can be seen that the received annotations convey the semantic concept of the text. In experiments on texts with a small volume, there were cases when the results of the two approaches were very approximate.

Figure 3 below shows the interface of the software solution for defining text annotations. The upper yellow window shows the original text in Kazakh. The total number of words and sentences are also indicated. Further down in the yellow window, you will see the specific keywords and phrases that will be used in the text. The left blue window shows the result of the simple summarization, and the right blue window shows the result of the summarization based on keywords.

Figure 4 shows the percentage of the results of the two summarization approaches. The horizontal values show the number of words in the document. And vertically, the percentage of the accuracy of determining the annotations of these texts. The analysis and accuracy of the results were carried out manually by three experts (a specialist linguist of the Kazakh language). Then the average value of the experts’ assessments was calculated.

The best result for defining the annotation of full-text documents is given by the keyword-based summarization approach. This is because keywords are used to cover sentences that have some meaning to the text, rather than simple introductory sentences. The above-developed algorithms and the method of the module are interconnected and provide an integrated approach for processing and analysis of big data in the Kazakh language.

5 Conclusion and Future Work

According to the results of scientific research work, the following results were obtained:

Methods and modern approaches to semantic analysis and abstraction of texts are investigated. Taking into account the peculiarity of the grammar of the Kazakh language, a hybrid semantic analysis of full-text documents was developed. This approach is based on the definition of keywords\phrases and the construction of the text annotation. The practical results of the text analysis show that this approach reveals the contextual meaning of the text. This approach can also be applied to other low-resource Turkic languages. Because it does not require large data for processing.

In the future, it is planned to use this approach in the implementation of machine translation and post-editing systems for Kazakh language.

References

Pospelov, D.A.: Ten hotspots in research on artificial intelligence intelligent systems (MSU). (Resource language – Russian), vol. 1, no. 1–4, pp. 47–56 (1996)
Google Scholar
Semantic: https://semantick.ru/. Accessed 14 July 2020
Tomita parser: https://api.yandex.ru/tomita/. Accessed 14 July 2020
In the foothills of semantics: https://dworq.com/: 05/29/2020. 5. AI Data Analysis Technologies for Business. https://www.summarizebot.com/summarization_business.html. Accessed 27 May 2020
TextAnalyst ver. 2.0: Program for personal text analysis. https://offext.ru/library/data/datakeeping/51.aspx. Accessed 19 Apr 2020
Galaktika-Zoom: analytical system for respectable clients. https://www.itweek.ru/themes/detail.php?ID=52215. Accessed 16 June 2020
Best Out-Of-The-Box Sentiment Analysis Tools. https://monkeylearn.com/blog/sentiment-analysis-tools/. Accessed 25 July 2020
Automatic text analysis technologies (resource language – Russian). https://nlp.isa.ru/. Accessed 26 Apr 2020
GitHub Natasha. https://github.com/natasha. Accessed 26 Apr 2020
Sonawane, S.S., Kulkarni, P.A.: Graph based representation and analysis of text document: a survey of techniques. Int. J. Comput. Appl. 96(19), 1–8 (2014)
Google Scholar
Cicekli, I., Korkmaz, T.: Generation of simple Turkish sentences with systemic-functional grammar. https://doi.org/10.3115/1603899.1603928
Manning, Ch.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. University Press, Cambridge, p. 210 (2008)
Google Scholar
Efficient estimation of word representations in vector space. https://arxiv.org/pdf/1301.3781.pdf. Accessed 10 July 2020
Word2vec parameter learning explained. https://arxiv.org/pdf/1411.2738.pdf. Accessed 10 July 2018
Texts in, meaning out: neural language models in semantic similarity tasks for Russian. https://arxiv.org/ftp/arxiv/papers/1504/1504.08183.pdf. Accessed 20 Apr 2020
Sheremeteva, S.O., Osminin, P.G.: Methods and models for automatic keyword extraction (resource language – Russian). Bull. S. Ural State Univ. №. 1, T. 12, pp. 76–81 (2015)
Google Scholar
Effective approaches for extraction of keywords. https://www.ijcsi.org/papers/7-6-144-148.pdf. Accessed 25 July 2019
Keyword extraction a review of methods and approaches. https://langnet.uniri.hr/papers/beliga/Beliga_KeywordExtraction_a_review_of_methods_and_approaches.pdf. Accessed 05 July 2019
Nastase, V.: Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 763–772 (2008)
Google Scholar
García-Hernández, R., Montiel, R., Ledeneva, Y., Rendón, E., Gelbukh, A., Cruz, R.: Text summarization by sentence extraction using unsupervised learning. In: Gelbukh, A., Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 133–143. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88636-5_12
Chapter Google Scholar
Miller, G.A.: Wordnet: A lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Barzilay, R., Elhadad, M.: Using lexical chains for text summarization. In: Advances in Automatic Text Summarization, pp. 111–121 (1999)
Google Scholar
Erkan, G., Radev, D.R.: LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Article Google Scholar
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. Association for Computational Linguistics (2004)
Google Scholar
Parveen, D., Strube, M.: Integrating importance, non-redundancy and coherence in graph-based extractive summarization. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 1298–1304 (2015)
Google Scholar
Khatri, C., Singh, G., Parikh, N.: Abstractive and extractive text summarization using document context vector and recurrent neural networks (2018). https://arxiv.org/abs/1807.08000
Zeng, B., Xu, R., Yang, Һ, Gan, Z., Zhou, W.: Comprehensive document summarization with refined self-matching mechanism. Appl. Sci. 10, 1864 (2020). https://doi.org/10.3390/app10051864
Article Google Scholar
TF-IDF. https://en.wikipedia.org/wiki/Tf%E2%80%93idf. Accessed 15 July 2020
Hanumanthappa, M., Narayana, S.M., Jyothi, N.M.: Automatic keyword extraction from dravidian language. Int. J. Innov. Sci. Eng. Technol. 1(8), 87–92 (2014)
Google Scholar
Rakhimova, D., Turganbayeva, A.: Auto-abstracting of texts in the Kazakh Language. In: Proceedings of the 6th International Conference on Engineering & MIS, pp. 1–5 (2020). https://doi.org/10.1145/3410352.3410832
Diana, R., Assem, S.: Problems of semantics of words of the Kazakh language in the information retrieval. In: Nguyen, N.T., Chbeir, R., Exposito, E., Aniorté, P., Trawiński, B. (eds.) ICCCI 2019. LNCS (LNAI), vol. 11684, pp. 70–81. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28374-2_7
Chapter Google Scholar

Download references

Acknowledgments

This research is funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP08052421 Project title: «Research and development of the post-editing system o of the Kazakh language in machine translation»).

Author information

Authors and Affiliations

Al-Farabi Kazakh National University, Almaty, Kazakhstan
Diana Rakhimova, Asem Turarbek & Leila Kopbosyn

Authors

Diana Rakhimova
View author publications
You can also search for this author in PubMed Google Scholar
Asem Turarbek
View author publications
You can also search for this author in PubMed Google Scholar
Leila Kopbosyn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Diana Rakhimova or Asem Turarbek .

Editor information

Editors and Affiliations

National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Krystian Wojtkiewicz
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Rathachai Chawuthai
Kielce University of Technology, Kielce, Poland
Pawel Sitek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rakhimova, D., Turarbek, A., Kopbosyn, L. (2021). Hybrid Approach for the Semantic Analysis of Texts in the Kazakh Language. In: Hong, TP., Wojtkiewicz, K., Chawuthai, R., Sitek, P. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2021. Communications in Computer and Information Science, vol 1371. Springer, Singapore. https://doi.org/10.1007/978-981-16-1685-3_12

Download citation

DOI: https://doi.org/10.1007/978-981-16-1685-3_12
Published: 06 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1684-6
Online ISBN: 978-981-16-1685-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hybrid Approach for the Semantic Analysis of Texts in the Kazakh Language

Abstract

Similar content being viewed by others

Approach to Extract Keywords and Keyphrases of Text Resources and Documents in the Kazakh Language

An Approach of Automatic Extraction of Domain Keywords from the Kazakh Text

Using Annotated Suffix Tree Similarity Measure for Text Summarisation

Keywords

1 Introduction

2 Related Works

3 A Semantic Analysis Based an Algorithm for Extracting Annotation and Keywords

4 Application of Approaches and Experimental Results

5 Conclusion and Future Work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Hybrid Approach for the Semantic Analysis of Texts in the Kazakh Language

Abstract

Similar content being viewed by others

Approach to Extract Keywords and Keyphrases of Text Resources and Documents in the Kazakh Language

An Approach of Automatic Extraction of Domain Keywords from the Kazakh Text

Using Annotated Suffix Tree Similarity Measure for Text Summarisation

Keywords

1 Introduction

2 Related Works

3 A Semantic Analysis Based an Algorithm for Extracting Annotation and Keywords

4 Application of Approaches and Experimental Results

5 Conclusion and Future Work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation