Keywords

1 Introduction

Emotion recognition is the process of identifying human emotions. People differ from each other in their emotions. Human emotion recognition technology is a new area of research.

The latest advances in the artificial intelligence, deep learning methods, human-friendly Robotics, Cognitive Sciences are used to develop the field of affective computing and approach to creating emotional machines [1,2,3].

Currently, there are works on the automation of facial emotion recognition from video [4,5,6,7,8], by the rhythm of the voice from audio information [9,10,11,12], by the style of writing from texts [13,14,15].Based on various studies, emotions play a vital role in e-learning [5, 16, 17]. Likewise, improving the emotion recognition learning environment has been focused by researchers in the past few decades in the field of computer-based collaborative learning [18]. Teachers can change their teaching style according to the needs of the students.

Emotions act an important role in the analysis of a student's interest and learning outcomes in the course. Understanding emotions by facial expression is the fastest way to detect emotions [19, 20]. Results on sentiment analysis, emotion recognition in the Kazakh language published in [21,22,23,24,25].

Recently, distance learning has become a major formal in education. Distance learning has become a safe and viable option for lifelong learning due to COVID-19 pandemic.

In the new epidemic realities, the role of distance learning in education have greatly increased all over the world, a huge number of people have switched to remote work, schoolchildren and students study remotely. More than 91% of the world's students have been affected by school closures due to the COVID-19 pandemic, according to UNESCO. Even before the pandemic, the global e-education market was already seeing massive annual global growth. The mass transition of education to the distance format has become a serious challenge, both for the University, teachers, and for the students themselves.

Accordingly, the education system provisioning all students with same access to quality education during this crisis.

This gave a powerful impetus to the development of distance learning. According to the UNESCO study, most of the 61 countries surveyed have implemented some form of distance learning. The digital format of education is likely to become more popular in the post-pandemic period. Because this education format is effective and affordable [26, 27].

Our university switched to a distance learning format as well. During distance learning, the Microsoft Teams corporate platform is used, where online classes are recorded on video. And during the session, a proctoring system is used to pass computer testing. We have a database of video recordings of computer testing process. The video is recorded from the webcam and the audio from the microphone. During the examination, many students pronounce questions, talks to themselves, their speech have recorded, and their emotionality was kept by the proctors, so we decided to use the resulting audio material to analyze emotions. This paper describes a theoretical method for determining sentiment/emotion based on speech recognition. This method can be used to measure student emotions during distance learning and online examinations without geographic or cultural limitations.

2 Speech Recognition

Since this experimental work was carried out only to determine sentiment based on special words using generalized transcription recognition method. The recognition method based on generalized transcription, described in [28] used for general Kazakh word recognition before [29], for the first time it is proposed to apply this model for emotion recognition.

2.1 Structural Classification of Kazakh Words and Use of Generalized Transcriptions

This section represents some established statistical about Kazakh words structure. They are, as it seems to us, interesting by them and besides could serve a basis for using the generalized transcriptions. Let’s divide all the symbols of the Kazakh alphabet into several natural classes:

  • W - aұыoeәүiөy

  • C - бвгғджзйлмнңp

  • F - cш

  • P - кқптфx

“W” – vowels plus the consonant “У”, which, when pronounced, remains the vocal tract opened; “C” – voiced consonants; “F” – voiceless hush consonants; “P” – voiceless consonants, which when pronounced represent a pause in a word. Let us assume there is a significantly large dictionary of Kazakh words. Now it shall be a dictionary of initial forms containing 41791 words. Let’s mark it out, replacing each symbol by the number of its class.

The words with the same marks shall be deemed to have a similar structure. Thus, the structure – is some model of gradation of vowels, consonants, hush sounds, etc. It appears that the number of Kazakh words with a similar structure is relatively small. For example, all the words with a structure WCCWFPWC are as follows:

  • aлжacқaн WCCWFPWC

  • aлмacтыp WCCWFPWC

  • oйлacтыp WCCWFPWC

  • үндecкeн WCCWFPWC

  • aлдacпaн WCCWFPWC

  • Here are the words with a structure WCWCWCPW:

  • aғapыңқы WCWCWCPW

  • aмaзoнкa WCWCWCPW

  • ұғыныңқы WCWCWCPW

Maximum number of words with the same structure CWCWC equal to 201, that is, And so on. Maximal number of words with a similar structure CWCWC is equal to 201, that amounts to about 0.5%. Moreover it is practically an exclusive case. All other structures contains much lesser words. We have proved it by the program which automatically marks out and selects words with the similar structure. Besides the selection of classes could be changed [30, 31].

The generalized transcription described is developed in the following way. First the voiceless consonants are distinguished from the recorded and processed sound signal as it was above described. The basis for it serves processing of a signal by a band pass filter with the range of transmission interval from 100 to 200 Hz as describes in the work, the coefficients of this filter are estimated according to the following equation:

$$ a_{k} = a_{k,2} - a_{k,1} $$
(1)

where \(a_{k,2}\) - is the coefficients of audio frequency band-pass filter which band-pass range is equal to 200 Hz, and \(a_{k,1}\) - is the coefficients of audio frequency band-pass filter which band-pass range is equal to 100 Hz.

Audio frequency band-pass filters are estimated according to the following equation:

$$ a_{k} = a_{ - k} = \frac{{2f_{0} }}{f}\frac{{{\text{sin}}\left( {k2\pi \frac{{f_{0} }}{f}} \right)}}{{k2\pi \frac{{f_{0} }}{f}}} $$
(2)

where \(k\) filter order, \(f_{0}\) - is a transmission frequency of filter, \(f\) is a sampling frequency of signal.

The sounds noted differ from all the other ones by the fact that after such filtration their fragments become similar to a pause and contain a large number of constancy points. Thus, at these fragments the difference between the number of inconstancy points and the number of constancy points would be negative, that allows distinguishing them in the massive of such differences developed for the sequence of windows containing 256 counts.

Further the hush and pauselike sounds are selected in the resulted fragments. The analogue of total variation with variable upper limit is estimated:

$$ V\left( 0 \right) = 0,{ }V\left( n \right) = \sum\nolimits_{i = 0}^{n - 1} {\left| {x_{i + 1} - x_{i} } \right|} { } $$
(3)

Let \(N_{1}\) - maximal number so that \(V\left( {N_{1} } \right) \le 255\).

\(N_{2}\) - maximal number so that \( W\left( {N_{1} } \right) \le 255\) and so on. As a result the following massive of numbers appears

$$ N_{1} ,N_{2} - N_{1} ,N_{3} - N_{2} \ldots $$
(4)

At the hush segment the value (3) increases rapidly, i.e. the numbers (4) are relatively small. At the pause segment the value (3) increases slowly and consequently the numbers (4) are relatively large. To distinguish between the hush and the pause, introduce a threshold in the system, it is taken as 120.

When the hushes and the pauses are distinguished, the vowels and the voiced consonants are selected. The remaining fragments are divided into windows containing 256 counts and for them the value of total variation is calculated by Eq. (3), then the average of these values is estimated which is considered as the limit. All the values that are above the average are marked “B”, below the average – “H”. Then the interval, which the described procedure is conducted at, is moved one window to the right and the procedure is repeated. It continues up to the moment when the end of the interval falls outside the boundaries of the fragment (Fig. 1).

Fig. 1.
figure 1

Signal segmentation

The marks of segmentations are entered at the points where the symbols “H” are changed into “B”, or “B” into “H”. B-fragment is deemed corresponding to the vowel (the symbol W is put near the left mark). H-fragment is deemed corresponding to the voiced consonant (the symbol C is put near the left mark) [30,31,32].

This algorithm could help to develop even more generalized transcription, namely to divide all the sounds into 2 natural classes: vowels and consonants. Such division gives good results in little dictionaries as well.

2.2 Construction of Generalized Transcriptions of Sentiment Dictionary

The dictionary of generalized transcriptions of the sentiment dictionary obtained in [34] allows us to construct its generalized transcriptions and apply them to search for emotionally colored words.

Example of emotional words in the Kazakh language with generalized transcription:

Table 1. Example of sentimental words with generalized transcription

3 Sentiment Analysis

3.1 Dataset

Table 1 below shows the number of videos with audio recordings of the exam process (Table 2).

Table 2. Number of videos with audio recordings

3.2 Sentiment Detection Process

The recorded speech is recognized, and then the text is analyzed to determine the sentiment. The process described in Fig. 2. Models and methods for determining the sentiment of texts in the Kazakh language are described in the works [22, 24, 33, 35].

Fig. 2.
figure 2

Sentiment detection process

After speech recognition, words that express emotions are highlighted and tagged into 3 classes Positive, Neutral and Negative (Table 3). This tagged dataset will be used as test set.

Table 3. Sentiment classes

Example:

Recognized words

Sentiment class

senimdimin (I’m sure)

Positive

daiyndaldym (prepared)

Positive

bilemin (I know)

Positive

aqymaq (stupid)

Negative

qobalzhimyn (I’m worried)

Negative

4 Conclusion

This work is devoted to the study and solution of the problem of sentiment analysis of students’ speech. As a result of the study of models and methods of speech sentiment analysis of Kazakh language, a semantic base of emotional words with generalized transcription was obtained. Emotion detection models of audio sounds recorded during the distance exam were proposed and implemented. Experimental works in progress now. The implementation and application of this model can improve the interaction between the teacher and the student, improve the quality of distance learning, and help to personalize education. We are planning to complete experimental work and compare with other state-of-the-art methods. In the future, it is planned to study video files and determine the emotion from the image and video.