Keywords

1 Introduction

The identification, classification and analysis of emotions is a fertile, active and open research area within the pattern recognition field. Historically, the widest source of information to perform emotion detection has been text. However, a remarkable surge in the availability of text sources for sentiment analysis arrived in the last two decades with the massive spreading of Internet [1]. Moreover, the arisal of web-based social networks, particularly designed for the social interaction, eased their usage for the sharing of sentiments, generating massive amounts of information to be mined for the comprehension of human psyche [1, 2]. Traditionally, as emotions detection and classification has been performed (mostly on text sources) with different techniques of Artificial Intelligence (AI), sentiment analysis is commonly regarded as an area of this same field [3].

Furthermore, the rise of social networks also allowed people to find new ways of expressing their emotions, with the use of content like emoticons, pictures as memes, audio and video [1, 4], showing the necessity of generating methods to expand the sentiment analysis to this novel sources of information. Accordingly, much research has been performed in the field of emotion analysis within social networks content, which is mostly based on the analysis of text/comments with AI techniques [5,6,7,8,9,10,11]. Several applications in this field are healthcare [12, 13], social behavioural assessment [14, 15], touristic perception [16], identification of trends in conflicting versus non-conflicting regions [17], evaluation of influence propagation models on social networks [18], emotions identification in text/emoticons/emojis [19], among many others. A review on textual emotion analysis can be found in [20].

With respect to images, the analysis has been focused on facial emotion recognition mainly through the combination of AI and techniques of digital image processing [21]. In [22] a 2D canonical correlation was implemented, [23] combines the distance in facial landmarks along with a genetic algorithm, [24] used a deep learning approach to identify the motions of painters with their artwork, [25] used the maximum likelihood distributions to detect neutral, fear, pain, pleasure and laugh expressions in video stills, [26] uses a multimodal Graph Convolutional Network to perform a conjoint analysis of aesthetic and emotion feelings within images, [27] uses improved local binary pattern and wavelet transforms to assess the learning states and emotions of students in online learning, [28] uses principal component analysis and deep learning methods to identify emotion in children with autism spectrum disorder, [29] used facial thermal images, deep reinforcement learning and IoT robotic devices to assess attention-deficit hyperactivity disorder in children, while [30] fuzzifies emotion categories in images to assess them through a deep metric learning. A recent review on the techniques for emotion detection in facial images is reported in [31].

However, a much lesser studied area within emotion recognition is the emotion analysis within audio sources, specifically in voice/speech. The first attempts were based on the classification of emotions by parameters of the audio signal; for instance, for english and malayan voices and for six emotions, the average pitch, the pitch range and jitter were assessed for signals of both male and female voices, finding that the language does not affect the emotional speech [32], while [33] applied data mining algorithms to prosody parameters extracted from non-professional voice actors. Also, [34] extracted 65 acoustic parameters to assess anger, contempt, fear, happiness, interest, lust, neutral, pride, relief, sadness, and shame emotional stages in over 100 professional actors from five English-speaking countries. Later, medical technology was applied using functional magnetic resonance images to measure the brain activity while the patient was giving a speech which in turn was recorded and computer-processed [35]. More algorithmical approaches were developed later, such as fuzzy logic reasoners [36, 37], discriminant analysis focused on nursing experience [38], the use of statistical similarity measurements to categorise sentiments in acted, natural and induced speeches [39, 40], the use of subjective psychological criteria to improve voice database design, parametrisation and classification schemes [41], among others. Machine learning approaches have also been developed, as the recognition of positive, neutral and negative emotions on spontaneous speech in children with autism spectrum disorders through support vector machines [42], the application of the k-nearest neighbour method to signal parameters as pitch, temporal and duration on theatrical plays for identification of happy, angry, fear, and neutral emotions [43], the simultaneous use of ant colony optimisation and k-nearest neighbour algorithms to improve the efficiency of speech emotion recognition, focusing only on the spectral roll-off, spectral centroid, spectral flux, log energy, and formats at few chosen frequency sub-bands [44], as well as the real time analysis of TV debates’ speech through a deep learning approach in the parameter space [45]. In the field of neural networks, a neurolinguistic processing model based on neural networks to conjointly analyse voice through the acoustic parameters of tone, pitch, rate, intensity, meaning, etc., along with text analysis based on linguistic features was developed by [46], while [47] proposes the use of a multi-layer perceptron neural network to classify emotions by the Mel frequency Cepstral Coefficient, its scaled spectrogram frequency, chroma and tonnetz parameters. Moreover, some studies suggest that, when available, the conjoint analysis of voice and facial expressions could lead to a better performance on emotion classification than the two techniques used separately [48].

As can be observed, there exist two main approaches to the problem of emotion analysis in voice/speech records, which can be used together: the direct analysis of parameters derived from the sound signal, and the use of AI techniques at many levels to build recognition and classification schemes. The main drawback of the systems based on AI methods is that they are subject to a training process that might be prone to bias and that highly depends on the training dataset, which might be inappropriately split [49]; moreover, the presence of hidden variables as well as mistaking the real objective are common drawbacks in the field [49, 50]. Collateral drawbacks are likely the large amount of time and computer resources required to train the AI-based systems. In [51] and [52] the subject of how to build representative AI models in general is explored.

In this work, we deviate from the traditional approaches to the problem of emotion analysis in order to explore a novel approach that regards the voice/speech recording as an information source in the framework of Shannon’s information theory [53]. In particular, we compute the information entropy of a voice/speech signal in order to classify emotion into two categories, positive and negative emotions, by generating an alphabet consisting on the frequency content of a human-audible sub-band. Although Shannon entropy has been previously used to perform pattern recognition in sound, it has been applied mainly to the heart sounds classification [54, 55]. The outcome shows that this approach is suitable for a very fast automatic classification of positive and negative emotions, which lacks of a training phase by its own nature. This work is organised as follows: in Sect. 2 we show the theoretical required background as well as the dataset under use, while in Sect. 3 we show the followed procedure along with the obtained results. Finally, in Sects. 4 and 5 we pose some final remarks as well as future possible paths to extend the presented work.

2 Materials and Methods

2.1 Frequency Domain Analysis

Since the inception of the analysis in the frequency domain by Joseph Fourier in 1882 [56], Fourier series for periodic waveforms and Fourier transform for non-periodic ones have been cornerstones of modern mathematical and numerical analysis. Fourier transforms place time series in the frequency domain, so they are able to provide their frequency content. Moreover, both continuous and discrete waveforms are likely to be analysed through Fourier analysis. In this work, we focus on discrete time series because most of audio sources available nowadays are binary files stored, processed and transmitted in digital computers. Let x(n) be a discrete time series of finite energy, its Fourier transform is given by

$$\begin{aligned} X(w) = \sum _{n=-\infty }^{\infty } x(n)e^{-jwn}, \end{aligned}$$
(1)

where X(w) represents the frequency spectrum of x(n) [57]. Such frequency content allows to classify the signal according to its power/energy density spectra, which are quantitatively expressed as the bandwidth. Fourier transform has been successfully applied for more than a century, in virtually any field of knowledge as it can be imagined for signal analysis, such as in medicine [58,59,60,61], spectroscopy/spectromety [60,61,62,63,64,65,66,67,68,69], laser-material interaction [70], image processing [59, 71,72,73], big data [74], micro-electro-mechanical systems (MEMS) devices [75], food technology [73, 76, 77], aerosol characterisation and assessment [78], vibrations analysis [79], chromatic dispersion in optical fiber communications [80], analysis of biological systems [81], characterisation in geological sciences [82], data compression [83], catalyst surface analysis [84], profilometry [85], among several others.

Frequency domain analysis can be applied to any signal from which information is to be extracted. In the case of the voice signal herein studied, the bandwidth is limited to a frequency range 100 Hz and 4 kHz.

2.2 Shannon’s Entropy Computation

The fundamental problem of communications, i.e. to carry entirely a message from one point to another, was first posed mathematically in [53]. Within his theory, messages are considered discrete in the sense that they might be represented by a number of symbols, regardless of the continuous or discrete nature of the information source, because any continuous source should be eventually discretised in order to be further transmitted. The selected set of symbols to represent certain message is called the alphabet, so that an infinite number of messages could be coded by such alphabet, regardless of its finitude.

In this sense, different messages coded in the same alphabet use different symbols, so the probability of appearance of each could vary from each one to the other. Therefore, the discrete source of information could be considered as a stochastic process, and conversely, any stochastic process that produces a discrete sequence of symbols selected from a finite set will be a discrete source [53]. An example of this is the digital voice signal.

For a discrete source of information in which the probabilities of occurrence of events are known, there is a measure of how much choice is involved in selecting the event or how uncertain we are about its outcome. According to theorem 2 of [53], there should be a function H that satisfies the properties of being continuous on the probabilities of the events (\(p_i\)), of being a monotonically increasing function of n and as well as being additive. The logarithmic function meets such requirements, and it is optimal for considering the influence of the statistics of very large messages in particular, as the occurrence of the symbols tends to be very large. In particular, base 2 logarithms are singularly adequate to measure the information, choice and uncertainty content of digital (binary) coded messages. Such a function then takes the form

$$\begin{aligned} H = -K\sum _{i=1}^np_ilog(p_i), \end{aligned}$$
(2)

where the positive constant K sets a measurement unit and n is the number of symbols in the selected alphabet. H is the so called information (or Shannon’s) entropy for a set of probabilities \(p_1,\ldots , p_n\). It must be noted that the validity of Eq. 2 relies in the fact that each symbol within the alphabet is equiprobable, i.e. for information sources that do not possess hysteresis processes. For alphabets with symbols that are not equally probable, Eq. 2 is modified, yielding the conditional entropy.

Beyond the direct applications of information entropy in the field of communications systems, it has been also used for assessment and classification purposes. For instance, [86] evaluates non-uniform distribution of assembly features in precision instruments, [87] applies it to multi-attribute utility analysis, [88] uses a generalised maximum entropy principle to identify graphical ARMA models, [89] studies critical characteristics of self-organised behaviour of concrete under uniaxial compression, [90] explores interactive attribute reduction for unlabelled mixed data, [91] improves neighbourhood entropies for uncertainty analysis and intelligent processing, [92] proposes an inaccuracy fuzzy entropy measure for a pair of probability distribution and discuss its relationship with mean codeword length, [93] develops proofs for quantum error-correction codes, [94] performs attribute reduction for unlabelled data through an entropy based missclassification cost function, [95] applies the cross entropy of mass function in similarity measure, [96] detects stress-induced sleep alteration in electroencephalographic records, [97] uses entropy and symmetry arguments to assess the self-replication problem in robotics and artificial life, [98] applies it to the quantisation of local observations for distributed detection, etc.

Table 1. Number of files within the voice dataset, distributed by emotions, which are separated in positive and negative emotions subsets.

Although the information entropy has been largely applied to typical text sources, the question of how to apply it to digital sound sources could entail certain difficulties, as the definition of an adequate alphabet in order to perform the information measurement. In this work, we first implement a fast Fourier transform algorithm to a frequency band of digital voice audios in order to set a frequency alphabet. The symbols of such alphabet are finite as the sound files are sampled to the same bitrate.

2.3 Voice Signals Dataset

The dataset used in this work was obtained from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [99], and it consists of 1,440 audio-only WAV files, performed of 24 professional voice actors (12 women, 12 men), who vocalise English matching statements with a neutral American accent.

Each actor performs calm, happy, sad, angry, fearful, surprised, disgusted, and neutral expressions. Each expression is performed at two levels of emotional intensities: normal and loud. Each actor vocalised two different statements: 1 stands for “Kids are talking by the door” and 2 for “Dogs are sitting by the door”. Finally, each vocalisation was recorded twice which is stated as \(1^{st}\) or \(2^{nd}\) repetition. Each audio is approximately 3 s long. Table 1 shows the classification of the voice dataset, in which it can be observed that such emotions have been separated into two subsets of positive and negative emotions.

3 Development and Results

The methods in the aforementioned section were implemented with the Python programming language. First, the librosa library, which is focused on the processing of audio and music signals [100], was implemented to obtain the sampled data of the audio at a sampling bitrate of 22,050 Hz. In order to obtain the frequency spectrum, the Fast Fourier Transform (FFT) algorithm was used to perform the discrete fourier transform. FFT was implemented through the SciPy library, a collection of mathematical algorithms and functions built on top of the NumPy extension to Python, adding significant enhancement by providing high-level commands and classes for manipulating and displaying data [101]. SciPy is a system prototyping and data processing environment that rivals systems like MATLAB, IDL, Octave, R-Lab, and SciLab [101].

As the audios were all sampled to a bitrate of \(br=22,050\) Hz, if their duration is of t s, then the number of time samples they possess is just \(br\cdot t\), which conform a \(br\cdot t\)-sized vector. Then, the scipy.fft.rfft function is used to compute the 1D discrete Fourier transform (DFT) of n points of such real-valued vector [102].

In order to compute the value of entropy of each voice source with the described alphabet, the probability of each symbol (frequency values available in the Fourier spectra) is computed and then the entropy through Eq. (2) is finally calculated, as each frequency does not depend on the occurrence of another, i.e. the sound information source can be regarded as non-hysterical.

3.1 Entropy Analysis

After the processing of the data set (Table 1), the entropy outcome of each audio was analysed for the following emotions: calm, happy, sad, angry, afraid, surprised and disgusted. As neutral expression does not express any emotion intentionally, it was discarded from the analysis. Due to the length of the dataset, here we only show some representative graphics of the obtained results. We also display the average results in what follows.

Fig. 1.
figure 1

Comparison of the average entropy for all actors on each emotion.

The first analysis is over the average of the entropy of all the actors comparing the loud against the normal emotional intensities. Results can be observed for both messages in Fig. 1, where the results are presented in ascending order with respect to normal emotional intensity values of entropy.

Table 2 shows in detail the average values of the entropy obtained from all of the 24 actors, including the cases of Fig. 1, classified according to intensity (loud and normal). The values are in ascending order in accordance with normal intensity. The values obtained in each repetition (1st and 2nd) of the messages are shown separately.

Table 2. Average entropy of all actors comparing intensity.

In order to explore the entropic content of the audios by gender, we compared the loud and normal intensities for both gender, whose graphics are in ascending order with respect to normal intensity, here shown in Figs. 2a and 2b for men and Figs. 2c and 2d for women.

Table 3a shows in detail the average values of the entropy obtained from the 12 male interpreters, as well as in Table 3b for women, which include the cases shown in Fig. 2. Both tables classify the entropy values according to intensity (normal and loud) in ascending order with respect to normal intensity, featuring separately the values of the 1st and 2nd repetitions of the messages.

Fig. 2.
figure 2

Normal vs loud intensity for both gender interpreters.

Table 3. Average entropy for both genders and messages, comparing intensity.

In what follows, we present the results of the average entropy for message 1 against message 2, for the same motional intensity. The general results for the 24 actors are observed in Fig. 3, where each plot is ordered in ascending order with respect to message 1.

Fig. 3.
figure 3

Message 1 vs 2 on \(1^{st}\) repetition.

Likewise, Table 4 shows all the average values of the entropy obtained from the 24 actors, including the cases shown in Fig. 3, classified in accordance to the type of message (1 or 2). The values are presented in ascending order with respect to the normal intensity. The values obtained in each repetition (1st and 2nd) of the messages are clearly separated.

Table 4. Average entropy of all actors comparing type of message.
Fig. 4.
figure 4

Message 1 vs 2 with the same intensity for both gender interpreters.

Moreover, the average entropy comparison between messages 1 and 2 (with the same emotional intensity) is then shown for men in Figs. 4a and 4b and in Figs. 4c and 4d for women. Each graph is ordered in ascending order with respect to message 1.

Table 5. Average entropy for both gender actors comparing type of message.

Table 5a shows in detail the average values of the entropy obtained from the 12 male interpreters, classified according to the message 1 or 2. The values are in ascending order with respect to the normal intensity. The values obtained in each repetition (1st and 2nd) of the messages were separated. The exact same setup for the average values of the entropy obtained from the 12 female interpreters can be observed in Table 5b. The cases shown in Fig. 4 are also include here.

4 Discussion and Conclusions

In this work, a different approach to the analysis of emotion was proposed, since instead of applying the widely common methods of AI, a classification of emotions into positive and negative categories within speech sources is proposed through a tool of the theory of the information: the frequency-based Shannon’s entropy. In order to compute information entropy, an alphabet based on the frequency symbols generated by the decomposition of original audio time series through the FFT algorithm is generated. Then, the probability of appearance of each frequency symbol is obtained from the Fourier spectrum so to finally compute the non-hysterical Shannon’s entropy (see Eq. (2)).

As already mentioned in Sect. 1, the typical sources of information in which entropy calculation is performed are texts where the average entropy value ranges between 3 and 4. However, as it was observed in the average values herein provided, they range between 13 and 15. This is clearly due to the nature of the alphabet developed here. Given that in the texts, alphabets are composed of a number of symbols of the order of tens, they yield small values of entropy. However, in the frequency domain, sound signals generate much larger alphabets, yielding average entropies for a voice signal that are considerably higher than that of a text. It is also clear that if richer sound sources would be analysed through this method, as music files and not only speech, they would certainly yield larger values of entropy. It must be considered that a value of about 14 is much greater than the typical values of text entropy of 3–4, given the logarithmic function that characterises the computation of entropy.

As it can be observed through Sect. 3, a general tendency of positive emotions (happy, surprise and calm) to have lower values of entropy than the negative emotions (sad, fear, disgust and anger) is present. Thus, large values of entropy generally characterise negative emotions while lower values are typical of positive emotions, allowing to perform a pre-classification of emotions into these two categories, without the necessity of going to a training phase as in general machine learning algorithms.

Table 6. Average entropy classified by positive and negative emotion according to intensity.

In order to better grasp the main result, in Table 6 we feature the average entropy values according to the (normal and loud) intensities, by considering both the positive and negative categories of emotions covered by this work. Such values are presented for all of the actors, as well as separated by gender. It can be clearly observed that for the normal intensity, for the three averages (for all actors, for men and for women), positive emotions yield smaller values of entropy than those given by the negative emotions.

It is important to remember that for an information source within the context of Shannon’s theory of information, symbols with lower probabilities to occur, are the ones that represent more information, since the number of symbols required to represent a certain message is less. If the symbols are more equiprobable for a message, the entropic content will be small. In other words, when there are symbols that occur infrequently, the entropic content would be higher. Also, the longitude of the message plays an important role since for short messages, the entropy will vary compared to a long ones, where the entropy will tend to stabilise [103]. In this sense, it can be clearly observed from Table 6 that the entropy values for the same intensity, comparing men against women, turn out larger for women, in both positive and negative emotions (normal intensity). This fact is consistent with the previous observation, because women in general excite a narrower frequency bandwidth, thus making more unfrequently symbols available, yielding to larger values of entropy.

The same pattern from the normal intensity is observed for the loud intensity for the average of all actors as well as for the average of men (see Table 6). It should be noted that for the loud intensities, the gap between the positive and negative emotions is smaller than for the normal intensity of the message. This is clearly because at a loud intensity, the amplitude of the time series is larger, thus in general increasing the probabilities of occurrence of each symbol. The particular case of female interpreters in which the loud intensity has a lower value for the negative emotion than for the positive emotion is likely because when women shout, they narrow their voices’ frequency bandwidth, thus yielding less symbols with larger probabilities. This is not the case of male interpreters, that when they shout, tend to excite a larger portion of the spectrum, yielding lower values of entropy.

On the other side, Table 7 also shows the average values of entropy for the positive and negative categories of emotions, but according to the type of message. It could be noted that the general results are coherent because in general, message 2 has larger values of entropy than message 1. This could be explained subjectively because people could tend to be more expressive with his emotions when talking about animals (dogs, in the case of message 2) than when talking about kids (message 1). Moreover, the high values of entropy for message 2 could be due to the fact that naturally, persons are more susceptible to get negative emotions to animals. These facts could have influenced the actors when vocalising message 2 with respect to message 1. Moreover, Table 7 confirms that for all the cases (all of the actors, the men and the women), the entropy values of positive emotions are lower than the values of entropy for negative emotions, regardless of the analysed message, confirming the results shown in Table 6.

Table 7. Average entropy classified by positive and negative emotion according to message.

Various sound classification applications using AI techniques are based on the implementation of neural network variants, such as Deep Neural Network (DNN) [104], Convolutional Neural Networks (CNN) [105, 106], Recurrent Neural Networks (RNN) [107], among others. Although the use of AI techniques allows predicting the behaviour of the data from an exhaustive training of the chosen algorithm, fixed parameters such as entropy always allows an analysis without estimates or predictions. Thus, entropy values gives a clear idea of the behaviour of the signal from itself, yielding a more reliable and direct result [108].

Although not directly related to information classification, entropy calculation is useful in the context of communication systems, as it represents a measure of the amount of information that can be transmitted. Parameters such as channel capacity, joint entropy, data transmission rate, error symbol count, among others, use entropy to be determined [53]. These parameters become important when the information already classified or processed needs to be transmitted. Various applications such as those exposed in [109] and [110] combine the classification of information with its use in communication systems, especially those that require direct interaction with humans. Despite Shannon’s entropy has been previously used to perform pattern recognition in sound, it has been mainly applied to the heart sounds classification [54, 55], and not in the context herein studied.

As final remarks, in this work we find Shannon’s information to be a reliable tool that is able to perform a very quick classification of emotions into positive and negative categories. The computation of entropy based on the Fourier frequency spectrum also allows to categorise a message considering the amplitude of the original time series (if it is vocalised in normal or loud manner) as well as into male and female broadcaster. However, as previously mentioned through this section, further experiments with larger speech datasets should be performed in order to find stabilised entropy values to pose limiting quantitative criteria. In this way, for its simplicity and quickness, this novel approach could also serve as a pre-classification system for emotions in order to prepare training datasets for more complex machine learning algorithms to perform finer classifications of sentiments.

5 Future Work

This research can be extended in the following pathways:

  • To expand this analysis to longer voice records as complete speeches.

  • To expand this proposal to perform emotion analyses in analogical voice signals.

  • To extend this analysis to assess the entropic content of voice audios in languages different than English.

  • To explore the entropic content of other sound sources as music.

  • To complement this approach with further tools of information theory [53] and signal analysis techniques, in order to be able to perform a finer emotion classification.