Keywords

1 Introduction

Sentiment analysis or opinion mining in natural languages is one of the fast growing natural language processing technologies. Sentiment analysis, also called opinion analysis, is a field of research that analyzes people’s opinions, suggestions, assessments, attitudes and emotions in relation to such subjects as products, services, organizations, individuals, problems, events, topics and their attributes [1].

Many resources and systems have been developed for sentiment analysis of texts by now for English [1, 2]. A number of researches are conducting on sentiment analysis for Russian [3, 4], Turkish [5,6,7], Spanish [8], Arabic [9, 10] and other languages [8]. One approach was proposed for Spanish language to the subjectivity detection on Twitter micro texts that explores the uses of the structured information of the social network framework. For Arabic, there is a semantic approach to discover user attitudes and business insights from Arabic social media by building an Arabic Sentiment Ontology that contains groups of words which express different sentiments in different dialects [13]. Currently, there are emerging semantic models that recognize emotions and words and their expressions in different languages.

The work [11], which presents SEMO, a semantic model for recognizing emotions, which allows users to identify and quantify the emotional load associated with basic emotions hidden in short, emotionally saturated sentences (for example, news headlines, tweets, signatures). The idea, assessing semantic similarity of concepts by considering the occurrences and coincidences of the terms describing them on the pages indexed by the search system, can be directly extended to emotions and words expressing them in different languages.

Here are some works on the analysis of attitudes for dual languages, Kazakh and Russian [12, 13]. The work [12] describes modern approaches for solving the problem of analyzing the opinions of news articles in the Kazakh and Russian languages using deep recurrent neural networks. Thus, studies show that good results can be achieved even without knowing the linguistic features of a particular language. It also proposes a model of deep neural network, which uses two-lingual embedding of words, to solve effectively the problem of mood classification for a given pair of languages. They apply this approach to two corpora of two different language pairs: English-Russian and Russian-Kazakh. It shows how to train a classifier in one language and predict it in another one. This approach ensures an accuracy of 73% for English and 74% for Russian languages. There are methods, proposed to analyze the mood of the texts in Kazakh, such as a baseline method, which reaches an accuracy of 60% and a method for studying bilingual embedding from a large unlabeled corpus using bilingual word pairs [13]. The analysis of the mood of the texts written in Kazakh is not sufficiently studied. The studies were conducted in [14,15,16], for the Kazakh language.

Computers are beginning to acquire the ability to recognize emotions. In 1995, Picard [17] reported on the key issues of “affective computing”, calculations that are associated with emotions, arise from them or affect them. Since then, many studies have been conducted. Many studies are related to the recognition of emotions from texts. The work [18] proposes an approach for recognizing emotions using network similarities. It also proposes a model for ranking emotions, based on semantic indicators of proximity, for example, Confidence, PMI, PMING.

There are many mobile devices such as smartphones, tablets, cameras and PCs in the world today. In addition, many applications for audio, video, chatting are being introduced day by day. Accordingly, the amount of text, audio and video information is increased. Therefore, the task of extracting emotions from textual, graphic, audio and video information becomes an important task.

Emotions are extracted not only from texts, but also from audio and video content [19, 20], from images [21]. Such applications can be used as marketing in social media, brand positioning, elections and financial forecasting.

The term “threat” usually refers to a specific unlawful act with the intent of one person to harm another one, which is expressed orally, in written or in another way. We will consider only written threats, which we will call prohibited content. In [22, 23], sentiment analysis is used to detect some prohibited content on the Internet. The determination of prohibited content by means of sentiment analysis is a new direction and works, that the authors of this article consider as useful for the further research, were published in the last ten years. The next article discusses the use of data mining techniques to detect prohibited terrorist activities on the network (Mahesh, Mahesh, Vinayababu 2010) [22]. The work [23] describes a system that detects forbidden statements with an accuracy of more than 80%, using natural language processing techniques and machine learning.

This work can be considered as an introduction and an attempt to apply the linguistic approach to identify texts containing prohibited content of a terrorist nature, which is written in the Kazakh language with keywords in English and Russian languages. For this reason, this article describes the rule-based methods used in the analysis of feelings, and the approaches used to determine moods through the formalization of morphological and syntactic rules.

2 Method for Analyzing Text Polarity

There are three main methods for determining the text polarity:

1. Analysis of a text using vector analysis methods (often using n-gram models), comparing a text with the previously marked reference corpus on the chosen proximity measure and classifying (classification) it as negative or positive based on the result of the comparison.

2. Search for emotive vocabulary (lexical polarity) in the text according to pre-compiled polarity dictionaries (lists of patterns) using linguistic analysis. Based on the found emotive vocabulary, the text can be evaluated on a scale, which reflects the number of negative and positive vocabulary. This method can use both lists of patterns that are inserted into regular expressions, and rules for combining polarity vocabulary within a sentence.

3. Mixed method (combination of the first and the second approaches).

The analysis of the text polarity, which we are implementing now, consists of several stages (Fig. 1).

Fig. 1.
figure 1

Stages of the text polarity analysis

At the first stage, the processing of web pages is carried out by the special parser, which analyzes the pages for the content of keywords from the database (in our case, keywords of forbidden content).

At the second stage, the text of selected web pages is processed by a morphological analyzer to determine the parts of speech and the characteristics of each parts of speech.

At the third stage, a simplified syntax analyzer works: words and phrases are combined into polarity chains; the subject, predicate, and object are identified in the sentence.

At the fourth stage, a simplified sentiment analyzer works, it determines the polarity of the text.

3 Parser and Database

Parser is software for collecting and converting textual data into a structured format. The processing of web pages is carried out by the special parser, which analyzes the pages to determine if they contain the key words from the knowledge base (in our case, keywords of forbidden content). The operation of the parser is simple, and its effectiveness depends entirely on the knowledge base. The interface of the parser is shown in Fig. 2.

Fig. 2.
figure 2

The interface of the parser

WordNet [24] was used as a prototype of the database structure for the parser, SENTIWORDNET [25] was used at the stage of sentiment analysis. The database was filled completely symmetrically in three languages: Kazakh, English and Russian. This is because, historically, Kazakhstanis often use a mixed language, for example, they speak Kazakh, using terms or stable expressions from Russian and/or English.

WordNet is a large lexical database of the English language. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each of them expresses a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes useful tool for computational linguistics and natural language processing. WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity [24].

The work [25] presented SENTI WORDNET 3.0, an enhanced lexical resource explicitly devised for supporting sentiment classification and opinion mining applications (Table 1).

Table 1. Fragment of the table in the WordNet format

Social networks “Vkontakte” and “Facebook”, and selected foreign Internet resources distributing terrorist content were selected for the research and replenishment of the database. To begin with, we directly reviewed the individual profiles of social network users, mentioning prohibited words from the list provided by the US Department of National Security in 2011, which is used to analyze networks for prohibited content [26].

4 Morphological Analysis

The Kazakh language is a typical Turkic language, retaining most of the features common to this group and possessing a number of typical Kypchak features. The structural-typological characteristic of the Kazakh language is mainly related to its membership in agglutinative languages. As a rule, a description of the aggutinative type applies a features set, which takes into account not only phonetic, but also morphological and syntactic features. The order of the endings in the Kazakh language is strictly defined. For example, for nouns, first the plural endings are added to the base of the word, then the possessive endings (meaning the object belongs to a person), then the case endings and then the conjugation form endings (added only to animate nouns). In general, it can be said that the Kazakh language is a well formalized language. We have developed a rule-based morphological analyzer, which is described in [27,28,29].

5 Syntax Analyzer

Visual analysis of prohibited content shows that it consists of only simple sentences, that is, they do not have complex compound sentences. It follows that to build a syntax analyzer of prohibited content, it is sufficient to formalize and implement the syntax rules for simple sentences in the Kazakh language, that is, to build a simplified parser.

There are 20 types of simple sentences in the Kazakh language. The formalization of the syntax rules of simple sentences and the construction of the corresponding syntactic analyzer are given in [30]. Some examples of simple sentences containing forbidden fragments are provided below.

A simple sentence consisting of the subject will be represented as:

$$ {\text{SS}}\,({\text{Q}}({\text{Q1}}({\text{S}})){\text{M}}({\text{M1}}\left( {{\text{N}}\,{\text{Adj}}\,{\text{Pron}}\,{\text{Adv}}))} \right) $$
(1)

Where SS – simple sentence

Q – structure

Q1– structure with the first index

S – Subject

M – semantics

M1 – semantics with the first index

N – Noun

Adj – Adjective

Pron – Pronoun

Adv – Adverb

Example: Bomb. Explosion.

A simple sentence consisting of the subject and the predicate will be presented as:

$$ {\text{SS}}({\text{Q}}({\text{Q2}}({\text{S}}\,{\text{P}})){\text{M}}({\text{M2}}({\text{MS }}({\text{N}}\,{\text{Adv}}\,{\text{Num}}){\text{MP}}\,\left( {{\text{N}}\,{\text{V}}\,{\text{Adj}}\,{\text{Adv}}\,{\text{Num}}))))} \right) $$
(2)

Where SS – simple sentence

Q – structure

Q2 – structure with the second index

S – Subject

P – Predicate

M – semantics

M2 – semantics with the second index

MS – semantics of subject

N – Noun

Adv – Adverb

Num – Numeral

MP – semantics of predicate

N – Noun

V – Verb

Adj – Adjective

Adv – Adverb

Num – Numeral

Example: The bomb exploded. The explosion occurred. Set fire. The explosion is terrible. The building was crushed. Four blew up. Infidels died.

A simple sentence consisting of the subject, the complement and the predicate will be presented as:

$$ {\text{SS}}({\text{Q}}({\text{Q3}}({\text{S}}\,{\text{O}}\,{\text{P}})){\text{M}}({\text{M3}}({\text{MS}}\,\left( {{\text{N}}\,{\text{Pron}}\,{\text{Num}}\,{\text{Adj}}} \right)\,{\text{MA}}\,\left( {{\text{N}}\,{\text{Pron}}\,{\text{Num}}\,{\text{Adj}}} \right)\,{\text{MP}}\,({\text{N}}\,{\text{V}})))) $$
(3)

Where SS – simple sentence

Q – structure

Q3 – structure with the third index

S – Subject

O – Object

P – Predicate

M – semantics

M3 – semantics with the third index

MS – semantics of subject

N – Noun

Pron – Pronoun

Num – Numeral

Adj – Adjective

MA – semantics of object

N – Noun

Pron – Pronoun

Num – Numeral

Adj – Adjective

MP – semantics of predicate

N – Noun

V – Verb

Example: Terrorists crashed the city. Five soldiers got killed.

The complete list of formal syntactic rules for simple sentences is as follows:

SS (Q(Q1(S)) M(M1(N Adj Pron Adv)))

SS(Q(Q2(S P)) M(M2(MS (N Adv Num)MP (N V Adj Adv Num)))))

SS(Q(Q3(S O P)) M(M3(MS (N Pron Num Adj) MA(N Pron Num Adj) MP (N V)))).

SS(Q(Q4(S C P)) M(M4(MS (N Pron Num Adj) MC(N Adv Num Adj) MP (V)))).

SS(Q(Q5(S A C P)) M(M5(MS (N Pron Num Adj) MA(N Adj Num Pron) MC (N Adv) MP(V)))).

SS(Q(Q6(S A C P)) M(M6(MS (N Pron Num Adj) MC(N Adv Pron) MA (N Pron Adj Num) MP(V)))).

SS(Q(Q7(S D A P)) M(M7(MS (N Pron Adj Num) MD(Adv Pron Num Adv) MA (N Adj Num) MP(V)))).

SS(Q(Q8(A D S P)) M(M8(MA (N Pron Adj Num) MD(Adj Adv Num) MS (N Num Adj) MP(V)))).

SS(Q(Q9(A C S P)) M(M9(MA (N Pron Adj Num) MC(Adv Num) MS (N Pron Adj Num) MP(V)))).

SS(Q(Q10(A S C P)) M(M10(MA (N Pron Adj Num) MS(N Pron Num Adv Adj) MC (Adv Num) MP(V)))).

SS(Q(Q11(C S A P)) M(M11(MC (N Num Adv) MS(N Pron Num Adv Adj) MA (Adv Num) MP(V)))).

SS(Q(Q12(D S A P)) M(M12(MD (Adj Num Adv Pron) MS(N Adj) MA (N Adj Adv Num) MP(V)))).

SS(Q(Q13(D S C P)) M(M13(MD (Adj Num Adv Pron) MS(N Adj) MC (N Adj Adv Num) MP(V)))).

SS(Q(Q14(S C D A P)) M(M14(MS (N Pron Adj Num) MC(N Adv) MD (N Adj Adv Num) MA (N Adj Num) MP(V)))).

SS(Q(Q15(S D A C P)) M(M15(MS (N Pron Adj Num) MD(Adj Adv Num) MA (N Adj Num) MC (N Adv Adj) MP(V)))).

SS(Q(Q16(C S D A P)) M(M16(MC (N Adv Num) MS(N Pron Num Adj) MD(Num Adj) MA(N Num Adj) MP(V)))).

SS(Q(Q17(C A D S P)) M(M17(MC (N Adv Num) MA(N Pron Num Adj) MD(Num Adj) MS(N Pron Num Adj) MP(V)))).

SS(Q(Q18(D S C A P)) M(M18(MD (Adj Adv Num) MS(N Pron Num Adj) MC(Adv N) MA(N Pron Num Adj) MP(V)))).

SS(Q(Q19(D S A C P)) M(M19(MD (Adj Adv Num) MS(N Pron Num Adj) MA(N Pron Num Adj) MC(Adv N) MP(V)))).

SS(Q(Q20(D A S C P)) M(M20(MD (Adj Adv Num) MA(N Pron Num Adj) MS(N Pron Num Adj) MC(Adv N) MP(V)))).

The input of the syntactic analyzer is the text containing elements of forbidden content with morphological tags.

At this stage, the initial syntax analysis is performed: words and phrases are combined into polarity chains, the subject, predicate and object are distinguished in the sentence. The result of the syntactic analyzer will be the text containing forbidden content marked by morphological and syntactic tags.

6 Sentiment Analysis

At this stage, the polarity object is highlighted. It is set by the user or determined automatically: in each sentence, a so-called named entity is searched, for example, a proper name, animate real, etc. For a given object, the polarity of the text is calculated [14,15,16, 31]. To work at the sentiment analysis stage, a database of words and phrases that are emotional in color and refer to prohibited content (hereinafter - the prohibited fragment) is used. In our case, we use database tables that have a SENTI WORDNET database structure (Table 2).

Table 2. Fragment of the table in the SENTI WORDNET format

7 Evaluation of Results

Currently, methods for objective testing of textual markup systems have not yet been developed. Therefore, the testing method currently used by us is based on periodic subjective evaluations of small text collections by an expert [32].

8 Conclusion

In this paper, we have developed a method for analyzing the polarity of the Kazakh texts related to terrorist threats. The database for the parser and sentiment analysis was developed. The syntax rules of simple sentences in the Kazakh language, which are sufficient for the presentation of texts related to terrorist threats, have been formalized. The stages of morphological, syntactic and sentiment analysis of texts in the Kazakh language are described.

The development of this work and the replenishment of the database and knowledge base of the prohibited content will allow detecting sites leading to terrorist propaganda. Currently, the database contains 1200 entries, which allowed us to detect more than 50 similar sites. Content analysis of sites allowed expanding the database with new keywords. Previously obtained results of determining the polarity of texts made it possible to automatically determine the list of sites that have extremely negative content. We hope that this work will be developed by virtue of its practical significance for society.