1 Introduction

The design principle of classifiers is well defined based on certain mathematical criteria. The classifiers are expected to perform better for specific kind of data. For instance, the data that follows Gaussian distribution is clearly and better classified by the classifiers such as Gaussian mixture models (GMMs). However, at present, the selection process of classifiers is happening blindly and experimenting the datasets with all the classifiers. It is a time consumption operation as well as not an appreciable task. The process of identifying a relevant classifier for a particular dataset solves many advantages such as computational complexity, performance improvement and so on. With this motivation, an effort has been made in this work to identify the properties of datasets before choosing the classifiers to work on them. As a case study, the task of emotion recognition has been selected. Emotion recognition from speech has been one of the challenging tasks due to its ambiguity. Many times even humans cannot recognize emotions correctly. Emotion recognition task helps to identify the emotional state of a human being from their voice. Automatic emotion recognition from speech has many applications such as making human–machine communications practical and more interactive, patient monitoring, telephone-based customer service systems, psychological health care initiatives and so on (Reddy et al. 2011). Some of the important basic emotional states are anger, fear, happiness, sadness, neutral, boredom, surprise, disgust, etc. Many factors influence the difficulties in modeling emotions effectively. The factors include lack of a proper emotional database, identification, and extraction of proper features which discriminate emotions clearly and so on. The highly varying modulation in the emotional speech of different persons makes the speaker-independent emotion recognition more difficult.

Speech signal mainly contains information about linguistics, speaker’s identity, emotion, and so on (Nwe et al. 2001). Of these, emotion is one important attribute which provides naturalness to speech. The emotion can be estimated from the speech, text, and facial expressions. Anyhow, it is hard to determine the emotions from text due to the ambiguities at syntactic and semantic levels. Moreover, it has been stated that all the emotions cannot be identified from facial expressions. Hence, speech is the one possible reliable source to recognize the emotions effectively when compared to text and image. In general, the process of speech emotion recognition can be done by extracting features and using classification models. Several features and an n—number of classification models are already proposed in the literature for the task of emotion recognition. However, all the features may not be useful and it is important to identify the suitable features for the present task. Here, the information related to text is not taken into consideration to make the investigation independent of that factors. The features that are found in the literature of emotion recognition are inherited from the features of speech tasks. The approach for identifying a suitable set of features for any classifier is very essential. Moreover, it is also important to recommend a classifier based on the properties of a feature set is also important since the performance of the emotion recognition systems (ERSs) also highly depends on classification models (Reddy et al. 2011). The performance of the classifier equally depends on quality and properties of the dataset used. Most of the research works focused on selecting the input features based on the task. However, while selecting features for any speech task, classification model should also be taken into consideration. It is practically not possible to get the better performance with all the classification models for the given dataset.

In this work, a metric has been introduced to understand the properties of the given speech emotional dataset. The features that are relevant to speech emotion have been extracted as a prior step of selecting the classifier. The combinations of computed features have been tested with the classifier to select the emotion-specific features. Further, the statistical techniques have been applied to the best possible combinations to determine the nature of the dataset. The relevant classifier has been suggested based on the results of statistical analysis. The results have been compared with two clustering algorithms and one more classification model to defend the proposed classifier. The models that are explored in this work are vector quantization (VQ), K-means clustering, GMMs and artificial neural networks (ANNs). Spectral features like MFCCs and formants; prosodic features such as pitch, intensity, jitter, and shimmer have been computed from the speech signals due to their ability in discriminating emotions. It has been observed that the emotional data falls in the normal distribution and hence, GMM has been suggested since it outperforms for the datasets that are normally distributed. In addition to that, IIT-KGP benchmark emotional database is also used to generalize the conclusions.

Rest of the paper is organized as follows. Detailed literature of related work is discussed in Sect. 2. Section 3 gives complete information about proposed approach in detail including database collection, feature extraction, subset construction and classification models. Results have been provided with analysis in Sect. 4. Section 5 concludes the paper with some future research directions.

2 Related work

In this section, the features and classifiers used for emotion classification are discussed in brief. Though different feature sets have specific statistical properties, pieces of evidence are not found to use those properties while deciding the classifiers. The existing literature on the task of emotion recognition is detailed below:

2.1 Feature selection

To automatically distinguish emotions from the speech signal, it is important to identify relevant features. The combination of features also plays an important role in improving the efficiency of recognition system. From the literature, emotion recognition (ER) related features are broadly categorized into (a) prosodic, (b) spectral, (c) combinations of (a) & (b) and (d) multi-modal features. The following subsections will descibe the importance of these features in recognizing emotions.

2.1.1 Prosodic features

Prosodic is an adjectival form of prosody (Nooteboom 1997). It is believed that, emotional state of speaker is primarily indicated by prosody (Ortony 1990; Chen et al. 2006). There are mainly three categories of prosodic features (i) pitch, (ii) intonation and (iii) intensity. It is not useful to derive these features at frame level, hence they are extracted at syllable, word, utterance and sentence levels (Koolagudi and Rao 2012). Pitch (also referred as Fundamental frequency) is a rate of vocal folds region vibration. It mainly depends on air pressure at subglottal and tension of vocal folds. Hence, it is one of the important features which carries emotion specific information (Ververidis and Kotropoulos 2006). There are many approaches to estimate the pitch value if signal is quasi-stationary (Hess 2008). Studies in Iida et al. (2003) stated that high recognition of emotions can be achieved using autocorrelation based pitch value. In some cases intervention of first formant may affect the fundamental frequency. It is easy to recognize voiced segments from speech using energy criteria as it is high at voiced regions compared to unvioced ones. Overall amplitude, energy distribution in spectrum and duration of pauses are affected with the arousal state of a speaker (Scherer 1999, 2003; Cowie and Cornelius 2003). Hence, energy and duration featuers are useful in recognizing emotions. In general, energy level in males is higher compared to females in anger state (Heuft et al. 1996). For the same state speech rate of female speakers is high compared to males (Iida et al. 2000). Pitch, energy and speech rate is less in the state of disgust. High pitch value and high intensity values are reported in the state of fear. Compared to neutral state speech rate is less in sadness (Ververidis and Kotropoulos 2006). In case of sad emotions male speakers speech rate is high compared to the anger (or) disgust states (Cowie and Cornelius 2003; Iida et al. 2000). Statistical values of prosodic features are also useful in characterizing emotions efficiently (Chung-Hsien and Liang 2011; Rao et al. 2013). Statistical values of pitch include range, minimum, maximum, mean, standard deviation, median, slope maximum, slope mininum, relative pitch, skewness, kurtosis, 4th order legendre parameters, first order difference (\(\varDelta\)), jitter and so on. Similarly for energy and duration statistical features like shimmer, speech rate, duration of voiced to unvoiced sounds’ ratio along with mean, minimum, maximum, standard deviation are considered. In Rao et al. (2010), dynamics of prosody features such as pitch, energy and duration contours are used as features to recognize seven emotions. Analysis is done on individual features with the classifier support vector machines (SVM). Statistical values of pitch and energy are used as features in Bhatti et al. (2004; Schuller et al. 2003; Luengo et al. 2005) to recognize emotions from speech. Different classifiers like modular neural network (MNN), GMM and Hidden Markov Models (HMMs) are used for the same. In Rao et al. (2013), characterization of eight emotions of IITKGP-SESC corpus through signal analysis at syllable, word and utterance levels of speech segments is done. Local and global features are extracted at these levels. Critical analysis is done individually and as a group of features for both male and female speakers. With the combination of local and global features, improved recognition rate is reported for last syllables in final words. In Jawarkar et al. (2007), statistical features of pitch and energy are extracted to recognize four emotions. Fuzzy min–max neural network (FNN) is used as classifier and it is reported that it requires less time to learn compared to back propagation neural network (BPN).

2.1.2 Spectral features

The shapes of the vocal tract system are unique for different emotions, and the shape of the vocal tract can be estimated by using spectral analysis. Hence, spectral features are also useful to categorize the emotions (Rao et al. 2013; Bitouk et al. 2010). In general, the spectral features are computed by dividing the speech signal into small segments (called frames) of length 20–50 ms. The speech signal is assumed to be stationary in the specified length. Studies in Banse and Scherer (1996; Kaiser 1962; Nwe et al. 2003) reported that the value of energy is high in the state of happiness where it is low in the case of sadness. The high and low energy values are observed in the high-frequency regions. There are different approaches to extract the spectral features. Some popular techniques include traditional linear predictor coefficients (LPC), one-sided autocorrelation linear predictor coefficients (OSALPC), short-time coherence method (SMC) and least-squares modified Yule-Walker equations (LSMYWE) (Rabiner and Schafer 1978; Hernando et al. 1997; Le Bouquin 1996; Bou-Ghazale and Hansen 2000). In Chauhan et al. (2010), LP residual is used as a feature to design emotion recognition system (ERS). The auto-associative neural network (AANN) and GMMs are found to be good and used as classifiers for a majority of the tasks mentioned above. Further, the sequence of glottal pulses is considered as the excitation source of voiced speech (Ananthapadmanabha and Yegnanarayana 1979). The glottal closure instance (GCI) (also known as Epoch) is highly helpful for estimating the pitch as well as vocal tract frequency response. The epoch information has been extracted using the approaches such as LP residual and zero frequency filtered (ZFF) speech signal to recognize the emotions from IITKGP—simulated emotion speech corpus(Koolagudi et al. 2010). GMMs and SVMs have been used as classifiers and found that the GMM is highly compatible when compared to SVMs. Moreover, human perception of pitch may not always follow a linear scale. Hence, some approaches have been introduced to estimate the non-linear scales using Bark scale, Mel-frequency scale, modified Mel-frequency scale, and ExpoLog scale (Rabiner and Juang 1993; El Ayadi et al. 2011). The log magnitude spectrum has been computed for the same, also known as Cepstral analysis. The cepstral analysis has been done to extract the features like linear predictive cepstral coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), one-sided autocorrelation linear predictive cepstral coefficients (OSALPCC), and so on. The process of detecting the stress from the speech signal using non-linear scale is found to be better when compared to linear scale analysis (Bou-Ghazale and Hansen 2000). The Mel-energy spectrum dynamic coefficients (MEDCs) are extracted based on spectral energy dynamics to recognize the emotions of both male and female speakers (Lee et al. 2004). In some other works, the Mel frequency based short time speech power coefficients (MFSPCs) are extracted, and VQ based HMM is used as a classifier to recognize emotions (Nwe et al. 2001).

2.1.3 Combination of prosodic and spectral

It is found that the temporal information is missing with the short-time features such as MFCCs and Perceptual linear predictive (PLP) values. The temporal information is very useful while estimating the emotions (Siqing et al. 2011). Based on this, the modulation spectral features (MSF) are introduced to estimate the temporal information in the speech signal that further helps to determine the emotion (Razak et al. 2005). The combination of prosody and spectral features are also used to improve the efficiency while recognizing emotions from speech (Nicholson et al. 2000). Critical Analysis has been done with open and closed tests on both male and female Japanese speakers database. The results stated that if the number of speakers is increasing classification performance is decreasing with the combination mentioned above (Li and Zhao 1998). The filter bank coefficients from 300 to 3400 Hz are used to extract standard MFCCs. In the case of low-MFCCs, filter banks are applied in the range of 20–300 Hz frequency regions to model fundamental frequency (F0) variations. The features such as MFCCs, low-MFCCs, pitch, and \(\varDelta\) pitch are found to efficient to improve the performance of the ERS (Neiberg et al. 2006). GMM is used as a classifier to recognize emotions. It is reported that low-MFCCs perform well in extracting stable pitch. Further, some analysis has been done with the combination of prosody and short-time with rough sets (Zhou et al. 2006). At the outset, it has been stated that the temporal variations play a major role in discriminating emotions.

2.1.4 Multi-modal features

Human feelings can be expressed by using tone, gestures, facial expressions, key spotting techniques and so on. The initial efforts are done with facial analysis to identify the human emotions (Black and Yacoob 1995; Essa and Pentland 1997; Kenji 1991; Tian et al. 2000; Yacoob and Davis 1994) and also auditory voice (Ververidis and Kotropoulos 2006; El Ayadi et al. 2011; Krothapalli and Koolagudi 2013) individually. Further, the facial expressions and voice have been combined to improve the accuracy (Busso et al. 2004; Schuller et al. 2004). The room is open to work on the multi-modal features.

2.2 Classifier selection

There are a n—number of classifiers such as HMM, ANN, GMM, SVM, VQ, k-NN and so on that are used for the task of emotion recognition from speech.

Studies in El Ayadi et al. (2011), Womack and Hansen (1999), Lee and Hon (1989) state that the majority of the previous works have been focused with HMM as a classifier to recognize emotions from speech due to its popularity and efficiency in various speech applications. The phonemes are extracted and modeled using HMM in the case of automatic speech recognition (ASR) applications (Deller et al. 2000). The state transition matrix is useful to capture the temporal dynamics in speech signal (Rabiner 1989). Since the phonemes follow left-to-right sequence in speech, HMM usually adopts the left-to-right structure to recognize speech. The same phenomenon has been used to recognize the emotions from speech (Schuller et al. 2003; Kwon et al. 2003; Nogueiras et al. 2001; Polzin and Waibel 1998; Bitouk et al. 2010; Sato and Obuchi 2007). However, it is not possible to observe the sequential flow of emotional cues incorporated in an utterance. For instance, it is difficult to fix a time for the pause which appears in an utterance of sad emotion. It may appear at any place such as in the beginning, middle or end events in an utterance (Yamada et al. 1995). The concept of ergodic model HMM is considered as a classifier to overcome this problem (Nwe et al. 2003). In this model, it is possible to reach from any single state to any other in a single step. However, none of the works explains the reasons for choosing HMM for a given dataset.

For data density estimation, a probabilistic model GMM is designed. GMM is considered as the state-of-art classifier and mostly used in the tasks of speaker identification and verification (Reynolds et al. 2000). It provides flexible-basis representations to model diversified data with large dimensions (Li and Barron 1999; Vlassis and Likas 2002). It can be treated as a special case of continuous single state hidden Markov model (Douglas and Richard 1995). In general, second-order parameters like mean and standard deviation are used in GMM to capture the hyperplane distribution of data points (Koolagudi et al. 2010). In the case of multi-modal distributions, GMMs are found to be an adequate and minimal train, and test sets are sufficient compared to normal continuous HMMs (Bishop 1995). Hence, GMMs are more appropriate in the case of global features which are extracted from the speech signal to recognize emotions. One of the limitations with this model is difficulty in modeling the temporal structure of the training data due to the independent structure of feature vectors. It is also a challenging attempt to decide the optimum number of components. Modelling order section principles such as minimum description length (MDL) (Rissanen 1978), classification error with respect to a cross-validation set, goodness of fit (GOF) based on kurtosis (Vlassis and Likas 1999) and Akaike information criterion (AIC) (Akaike 1974) are the common approaches to decide the optimal number of components (El-Yazeed et al. 2004). To estimate both the model order and components together expected maximization (EM) algorithm is designed which is based on greedy approach (Vlassis and Likas 2002). GMM is widely used in Neiberg et al. (2006), Ververidis and Kotropoulos (2005), Yang and Lugger (2010) to recognize emotions in speech. Tang et al. proposed a boosted GMM to recognize emotions (2009).

Vapnik and Chervonenkis utilized the concepts of statistical learning theory to introduce a new classification and regression technique, and the result of this effort is SVM (Burges 1998; Wang 2005). It mainly uses kernel functions to map the non-linearity in the feature set to the large dimensional feature space through which the linear separation can be obtained. In various pattern recognition applications, SVM is found to give better results when compared to many other classifiers (Shen et al. 2011). Especially 75–80% of classification rate is obtained in the case of speaker independent applications using SVM classifier (Zhou et al. 2006). At the outset, SVM classifier is constructed for two classes, and it is possible to reduce the classification error of test samples through finding the optimal hyperplane. Several methods were introduced to use SVM for multi-class classification. Among those the one-vs.-all Method is developed and used in Takahashi (2004) to recognize emotions. The LIBSVM is considered to classify five emotional states using the Mel energy spectrum dynamics coefficients (MEDC) feature vector (Lin and Wei 2005). Since SVM is good at classifying two classes, to discriminate n classes Zhou et al. combined it with bin-tree (Zhou et al. 2006). SVM classifier is used to recognize emotions from the speech signal in some of the works but found that the accuracy is not markable (Grimm et al. 2007; Seehapoch and Wongthanavasu 2013; Yu et al. 2011; Pan et al. 2012; Chavhan et al. 2010). The performance of an ERS with SVM is around 80.09% for the gender and situation independent database (Giannoulis and Potamianos 2012; Muthusamy et al. 2015).

The other efficient classifier to capture the non-linear relations and used in various pattern recognition applications is ANNs. It has some significant advantages compared to other classifiers. In the case, if training samples are lesser in number, then classification performance is usually better compared to HMMs and GMMs. Multi-layer perceptron (MLP), radial basis function networks (RBF) and recurrent neural networks (RNN) are the main categories of ANNs (Bishop 1995). Among those, MLP is the one which is commonly used in emotion recognition applications, and RBF is the least used (El Ayadi et al. 2011). MLP is easy to implement and it is built with well-defined training algorithm. ANN performance totally depends on its parameters such as the number of hidden layers and number of hidden neurons in each hidden layer. More than one ANN is used in some speech emotion recognition applications to achieve better performance (Nicholson et al. 2000), (Firoz et al. 2009), (Dai et al. 2008). ANN is used in Petrushin (2000) to distinguish agitation and calm emotional states. Generalized discriminant analysis (GerDA) is introduced in Stuhlsatz et al. (2011) for the task of acoustic emotion recognition using deep neural networks (DNN). Better performance is reported with DNNs compared to SVMs for this task (Han et al. 2014). In Khanchandani and Hussain (2009) MLP and generalized feed forward neural networks (GFNNs) are used to recognize emotions in speech signal and their results are compared. Results stated that GFNNs perform better compared to MLPs. Auto associate neural networks (AANNs) are used in Koolagudi and Rao (2012) to recognize basic emotions in semi natural speech. 2D-neural classifier is used in Partila and Voznak (2013) for classifying the emotional state of a man’s voice.

The classifiers such as HMMs, SVMs and ANNs are found to give less if the number of samples is small. VQ is introduced to provide better recognition in such case (Huang et al. 2012). In VQ a fixed size-quantized vector \(V_i\) is created with dimension n for the vectors V of the same dimension. If all the components of V and the corresponding components of \(V_i\) are close enough, then \(V_i\) is treated as a quantized vector of V (Konar and Chakraborty 2014). Learning vector quantization (LVQ) technique is used in several facial emotion recognition applications (Konar and Chakraborty 2014). Variance-based Gaussian Kernel Fuzzy Vector Quantization (VGKFVQ) method is proposed in Huang et al. (2012) to recognize emotions in short speech. In Khanna and Kumar (2011) LBG-VQ method is introduced to recognize human emotions.

There are some works available to suggest a classifier by experimenting feature vector on all available candidate algorithms (Demšar 2006). They are also called meta-learning (Soares and Brazdil 2000; Muslea et al. 2006). Computational complexity is the important issue while developing meta-learning systems. All the classification algorithms have to be tested with the given dataset, and it may be the reason for complexity issues. In contrast, the dataset is analyzed, and suitable classifiers have been suggested in this work. In literature, there is no specific strategy to decide the suitable classifier for the task of speech emotion recognition (El Ayadi et al. 2011). Their advantages and limitations may confuse researchers to choose the best one.

Moreover, there are several other approaches have been proposed to categorize the emotions. Since the feature vector contributes much while categorizing the emotions, the approaches towards estimating the relevant features are essential. The technique called incomplete sparse least square regression (ISLSR) has been proposed to select the features that can highly contribute to categorize the emotions of six classes (Zheng et al. 2014). A novel set of features based on Fourier parameter model are also proposed with their derivatives to categorize the emotions. However, if the number of emotional classes increases, then the performance is getting degraded (Wang and An 2015). The parameters called time-lapse and linguistic information have been considered as knowledge information and used to estimate the emotions from the spontaneous speech recorded from call centers (Chakraborty et al. 2016). Few models based on auto encoder based unsupervised domain adaptation technique are also proposed for the task of speech emotion recognition (Deng et al. 2014, 2017). They constructed the model without using any label information. However, the dimension of the feature vector is large and severe complexity issues may raise. A modified brain emotional learning model is also proposed to categorize three emotional classes (Motamed et al. 2017). A majority of the works done for speech emotion recognition have focused on estimating the differences between datasets instead of differentiating different corpora (Song et al. 2014). Hence the concept of non-negative matrix factorization (NMF) and transfer NMF have been considered for emotion recognition (Song et al. 2016). Further, the task has been extended to extract the features by directly feeding the raw speech signal to deep neural networks (DNNs). A few works have been done by extracting the features from DNNs and passing them to various classifiers. One such classifier is extreme learning machine (ELM) (Han et al. 2014; Trigeorgis et al. 2016). The summary of literature including features, classifiers, database, and remarks has been given in Table 1. It is found that the majority of above works have experimented on high dimensional feature vectors that generally lead to the computational issues. Hence, an approach with relevant optimal feature vector and the suitable classifier is always essential. In this paper, the properties of the dataset have been estimated to suggest the classifier.

Table 1 Summary of source, features and classifiers used in existing work to recognize emotions from speech

Various pieces of evidence from statistical analysis have been taken from an emotional database which is collected from Telugu movies and people who speak Telugu language to determine the suitable classifier for emotion recognition. For the selected case-study, two clustering and two classification techniques such as VQ, k-means clustering, GMM, and ANN have chosen based on their relevance. The SVM is found to give less performance in the case of speech emotion recognition. Hence, SVM and other random forest classifiers have been ignored in this work due to their less performance and complexity issues. Critical analysis is done with different combinations of features and by modifying the parameters of classifiers. Normalization tests namely Kolmorgov–Smirnov, Shapiro–Wilk and Mardia’s tests are performed to observe the distribution of data.

3 Proposed methodology

The proposed flow diagram is shown in Fig. 1. The semi-natural emotional clips collected from Telugu movies are used as database. In addition to that, standard emotional speech database of IIT-KGP has been considered. After preprocessing, spectral and prosodic features are extracted from the speech clips as they are effective in modelling emotional speech. Exhaustive subset is constructed to identify the task specific features. Different statistical tests are conducted to study the properties of dataset and a suitable classifier is suggested. The results of three classifiers are compared with the suggested classifier to evaluate the proposed classifier. The following subsections elaborate the each block in detail.

Fig. 1
figure 1

The proposed framework to determine the classifier based on the properties of dataset

3.1 Database collection

The proper and complete emotion database is essential for efficient modeling of emotions. There are several ways of collecting emotional database and the ideal way is to collect the speech data from natural conversations since these include real emotions. However, recording sufficient amount of such natural conversations with a good quality is extremely difficult task. So movies are preferred for collecting sound clips that contain emotions assuming that the actors express emotions in a precised natural way. It is also called as semi-natural database.

Five basic emotions such as anger, fear, happiness, neutral, and sadness are considered for this work. All the emotional clips are collected from Telugu movies. Care has been taken to include different genders, context independent speech and ignored the overlapping of voices. Speech samples are collected at high sampling frequency of 44.1 KHz and later decimated to 16 KHz. It is done because the emotion specific information will be retained with in 8 KHz, to avoid the complexity issues and to consider Nyquist theorem (Rao and Koolagudi 2012). For each clip, the emotion is labelled based on mean opinion score (MOS) collected from fifty users. The users of Telugu linguistic base are selected to collect MOS. In addition to that, contextual information is also considered. Total 1000 clips of length 2–5 s are considered. Of these, fifty sophisticated clips have been used to understand the properties of data. Later, the same technique is applied to all 1000 clips. To generalize the conclusions, comparison has been done on one of the standard speech emotion database of IIT-KGP (Koolagudi et al. 2009).

3.2 Feature extraction

From the literature, it is observed that spectral and prosodic features are best suited for emotion recognition (Zhou et al. 2009). It is also true that the variations in prosody helps to model emotional speech. Hence, an analysis has been done in this aspect and it is found that the statistical variations of pitch such as minimum, maximum, mean \((\mu)\) and standard deviation \((\sigma)\) are also prominent to determine emotion from human speech. Therefore the same features are used in this work along with jitter and shimmer. Usually, speech signal is assumed to be stationary for the purpose of analysis over a short duration of time. Hence the spectral features are extracted from a frame of around 20 ms, as variation in speech signal within 20 ms. is ignorable. An overlap of 50% is considered to retain the continuation. 13 MFCC features are extracted at frame level. Average of all frame wise MFCC feature vectors of an utterance is calculated to represent utterance level spectral feature vector. The detailed feature extraction process is explained below.

3.2.1 Mel freqency cepstral coefficients (MFCCs)

MFCCs are most widely used features for almost every speech task. This is due to their ability to imitate human auditory perception mechanism. They are derived based on the characteristics of the human hearing system. These features are derived from a mel-frequency cepstrum where the frequency bands are non linearly spaced based on the mel-scale. General block processing approach was used for extraction of MFCCs (Wenjing et al. 2009).

3.2.2 Pitch

Pitch is the fundamental frequency (F0) of vocal folds’ vibration. Speech signal is observed to be periodic for this reason. There are many methods to estimate pitch of the speech. In this work, auto-correlation method is used for pitch estimation. Initially, pitch analysis have been done for both the genders to recognize the gender of a speaker. Further, the analysis for emotion recognition is done based on the gender. The average pitch observed from the database for females is about 210 Hz whereas the same for males is about 120 Hz (Traunmüller and Eriksson 1995).

Pitch is one of the important attribute which adds naturalness to speech. Pitch contains information about emotion, gender, accent, speaking manner and so on (Wenjing et al. 2009). Pitch of a speech expressing one emotion is usually different from the other one (Kostoulas and Fakotakis 2006). Thus, it can be used as discriminating feature in emotion recognition. Statistical parameters of pitch contour like minimum, maximum, mean, standard deviation are used as features at utterance level. These statistical measures are derived from pitch values of all frames of an utterance.

3.2.3 Intensity

Intensity refers to the volume or energy of speech signal. Energy depends on the loudness of the voice of the speaker. The speaker who speaks louder enforces higher energy in the signal than that of the one who speaks mutedly (Fu et al. 2008). As the average energy of complete signal do not give any information w.r.t. emotions, short time energy is computed for a frame of length 20 ms. shown in Eq. (1) (Anagnostopoulos et al. 2012).

$$\begin{aligned} {E }_{ n }=\sum _{ m=n-N+1 }^{ n }{ { \left[ x(m)w(n-m) \right] }^{ 2 } } \end{aligned}$$
(1)

where \({E}_{n}\) is the energy value, N represents the length of the frame, w() represents analysis window which can be rectangular or hamming and n is the sample where the analysis window is focused.

It has been reported that anger, fear and happiness have high intensity values than that of sad (Scherer 2003). So intensity can also be considered as one of the emotion discriminating features. Statistical measures of intensity such as minimum, maximum, mean (\(\mu\)), standard deviation \((\sigma)\) are computed as features at utterance level.

3.2.4 Jitter

Pitch period slightly changes over consecutive pitch cycles. This variation of pitch period over time, depends on many factors like text, intonation, emotional state of the speaker etc. Hence, cycle-to-cycle pitch variation is computed, also known as Jitter (Farrus and Hernando 2009). It is given by

$$\begin{aligned} J=\frac{ 1 }{ N-1 } \sum _{ i=1 }^{ N-1 }{ |{ T }_{ i }-{ T }_{ i+1 }| } \end{aligned}$$
(2)

where \({T}_{i}\) is the extracted pitch period and N is the number of cycles considered. The same is shown in Fig. 2.

Fig. 2
figure 2

The proces of extracting Jitter and Shimmer

3.2.5 Shimmer

It is possible to observe the energy variation when there is high emotion (Farrus and Hernando 2009). The variations in energy have been clearly observed using the shimmer feature. Shimmer is the parameter which represents the variation in the amplitude of samples between the adjacent pitch periods. It is given by

$$Shimmer = \frac{1}{{N - 1}}\sum\limits_{{i = 1}}^{{N - 1}} {\left| {20\log \left( {\frac{{A_{{i + 1}} }}{{A_{i} }}} \right)} \right|}$$
(3)

where \({A}_{i}\) is the extracted peak-to-peak amplitude data and N is the number of extracted pitch periods. The process of extracting shimmer is given in Fig. 2.

3.2.6 Formants

Formants correspond to the resonances of human vocal tract system. At each formant frequency, there exists a high degree of energy. They are mostly observed in the signal corresponding to vowels (Anagnostopoulos et al. 2012). Formants are the features which depend on vocal tract characteristics. These vocal tract characteristics change as the emotions change. Hence formant features are explored for discriminating emotions (Koolagudi et al. 2009). In this work, first three formants (F1, F2, F3) are considered.

3.3 Subset construction

All the above specified features are concatenated at utterance level to form a complete feature vector of length 26 in the order MFCCs (13), prosodic: pitch (4), intensity (4) and others: jitter (1), shimmer (1), formants (3). It is also true that all features may not be useful for the specified task. It is very important to identify the relevant and suitable features, known as feature selection. There are several feature selection algorithms already available to reduce the dimensionality of feature vector (Liu and Lei 2005; Tang et al. 2014). In contrast to them, in this work, an exhaustive search has been done by testing the accuracy for all feature combinations with four classifiers such as GMM, VQ, K-means and ANN. As MFCCs are considered as baseline features for this work it is considered as single feature. By considering MFCCs as single feature vector \(2^{14}\) feature subsets are generated and tested against each classifier. Out of them, the best five combinations for each classifier has been shown in Table 2.

Table 2 Best five combinations of features for four classifiers (Acc. \(\rightarrow\) accuracy)

3.4 Analysis to understand the properties of featureset

The featureset is analyzed critically by applying various statistical and normality tests to figure out the properties of the dataset. The best subset which is giving better results for all the four classifiers is identified and considered for analysis. As MFCCs are considered as baseline features for the selected case study, they are excluded from this step. From Table 2, it is observed that GMM is giving best performance with the combination of MFCCs, pitch (min, max), intensity (max), jitter, shimmer, F1 and F3 features. Based on this, the same features excluding MFCCs are considered for normality tests. Three different standard tests such as, K–S test, S–W test and Mardia’s test are used in this paper. The detailed explanation for each test is given in the subsequent subsections.

3.4.1 Kolmogorov–Smirnov test

Kolmogorov–Smirnov test is used to assess the similarity between empirical cumulative distribution function (ECDF) of the sample space and cumulative distribution function (c.d.f.) of the reference distribution. It is a non-parametric test, where one dimensional probability distributions can be used to compare a sample with reference probability distribution (Lilliefors 1967).

The ECDF \(\hat{F}_k\) for k data samples \((Y_j)\) is defined in Eq. (4)

$$\begin{aligned} \hat{F}_k(y)=\frac{1}{k} \sum \limits _{j=1}^{k} I_{Y_j \le y} \end{aligned}$$
(4)

where F(y) represents the c.d.f, \(I_{Y_j \le y}\) is the indicator function which equal to 1 if \(Y_j \le y\) and 0 otherwise.

3.4.2 Shapiro–Wilk test

To test whether data samples \((y_1, y_2, \ldots ,y_n)\) are in normal distribution or not, S–W test calculates \(S_w\) and is given by Eq. (5).

$$S_{w} = \frac{{\left( {\sum\nolimits_{{i = 1}}^{n} {\alpha _{i} y_{i} } } \right)^{2} }}{{\sum\nolimits_{{i = 1}}^{n} {(y_{i} - \bar{y})^{2} } }}$$
(5)

where \(y_i\) is the ith data sample, \(\alpha _i\) is a constant value based on y and \(\bar{y}\) is the mean of n samples (Shapiro and Wilk 1965).

3.4.3 Mardia’s test

In order to check the similarity of the multivariate normal distribution generally multivariate tests are conducted (Mardia 1970). In this paper, we considered multivariate skewness measure to support null-hypotheses \((H_0)\) and is given in Eq. (6).

$$\begin{aligned} M_{1,m} = \frac{1}{N^2}\sum \limits _{a=1}^{N}\sum \limits _{b=1}^{N} \left[ (y_a - \bar{y})^{'} S^{-1} (y_b - \bar{y}) \right] ^3 \end{aligned}$$
(6)

Where \(\bar{y}\) is the sample mean and S is the sample co-variance matrix (Rencher and Christensen 2012).

Table 3 shows the results of normality tests. From table, it is observed that the p values of Mardia’s test are less than significant value 0.05. It indicates that the data of emotional speech follows normal distribution.

Table 3 Normality test results using three statistical methods

3.5 Classification models

The detailed explanation for four classification methods such as VQ, K-Means, GMM and ANN is given below.

3.5.1 VQ

In VQ approach, a set of feature vectors is mapped to a finite number of vectors called as code vectors. The collection of these code vectors form a code book. Further, the similar feature vector generated from the test clip will be compared with code vectors for computing deviation, also known as distortion. Lesser the deviation is the more match. The performance of VQ depends on the creation of an effective code book. In this work, the code book is computed by using LBG (Linde–Buzo–Gray) algorithm (Linde et al. 1980). Given a test speech feature vector, the distance for each code book have been calculated to find the one with minimum distance. The emotion class represented by that code book is the emotion of test clip. (Soong et al. 1987). Similarly, five codebooks are obtained for five emotions each of size N. So, all training vectors of an emotion are mapped to the set of code vectors of that specific codebook. The experiments have been conducted by varying the value of N \((N=2, 4, 8)\).

3.5.2 K-means clustering

Though K-means clustering is a well-known clustering algorithm, it may also be used as classification tool. The training and testing procedures are same as that of VQ approach. Instead of code books it forms K centroids. Initially, K centroids for each emotion have to be selected randomly. Due to this random selection of initial centroids, the final positions of centroids change every time we run K-means clustering algorithm. Thus, the centroid positions will be converged either by running the algorithm n times or by stopping whenever the centroids show the least average movement.

In this work, a mathematical approach has been used to estimate the distribution of data. The dispersion of a data distribution is measured by coefficient of variation (CV) which is given by \(s/\bar{x}\). Where s is the standard deviation and \(\bar{x}\) is the mean of the cluster. Generally, higher the CV value indicates that there is a greater the variability in data. K-means clustering tends to form clusters such that CV value is in between 0.3 and 1.0. If the CV value is out of the given range, K-means clustering forms final clusters that are different from true clusters so that CV value attains the prescribed range. This affects the classification accuracy of the patterns. Table 4 shows average CV values of the selected emotions for best five feature combinations.

Table 4 Average coefficient of variation (CV) values of five emotions for best five feature combinations

3.5.3 Gaussian mixture models

GMMs perform better when data is in normal distribution. In this work, for each emotion, a Gaussian mixture model is developed with \('N'\) Gaussian components. If the data for each emotion follow a multivariate normal distribution. Multivariate normal distribution is a distribution that contains a collection of two or more normal distributions. The features that follow multivariate normal distribution can be effectively modeled by GMMs. Classification accuracy of GMMs also depends on the factors like number of Gaussians in each class, size of the dataset, distribution of data and so on. From the statistical tests done in Sect. 3.4, it is observed that the feature vector with some selected features follows normal distribution. As GMM can effectively model the data if it is in normal distribution, better accuracy is achieved when compared to other techniques.

3.5.4 Artificial neural networks

ANNs capture the complex non-linear relations present in the data as similar to the human brain. They contain many simple processing elements called as neurons that are interconnected together to understand the hidden patterns. In general, ANNs contain three types of layers namely input, hidden and output layers. Each of these layers contains several neurons. ANNs are designed with an input and output layers. Number of neurons in the input layer is equal to the length of a feature vector. Number of neurons in the output layer is equal to the total number of classes. The structure of neural network which is specific to this work is as shown in Fig. 3. A simple feed-forward back propagation neural network (BPNN) algorithm has been used for this task (Han and Kamber 2006; Rojas 2013).

Fig. 3
figure 3

Structure of artificial neural network for emotion classification

Table 2 (row numbers 16–20) shows the best five feature combinations which give better accuracy with ANNs. The ANN used in this work contains one hidden layer. Experimentation has been done by varying the number of neurons in the hidden layer from \(({n+1}/{2})\) to \(({2n}/{3})\), where n is number of input layer neurons. Better accuracy is observed at 1.7 times to the input neurons.

4 Results and analysis

In this section, some important results showing the performance of classifiers and identifying the feature sets for particular classifier are discussed. The justification that why a classifier works better for a specific set of data is also given. Initially, the possible subsets are formed for the feature vector and experimentation has been done to identify the best features that are suitable for emotion recognition. Four different classifiers are used to do this task. Table 2 shows five feature combinations giving the best results for a particular classifier. Columns represent all features and rows represent the chosen feature vector. In each row a cell containing ‘\(\checkmark\)’ indicates inclusion of that feature in the feature vector.

Later, the results are validated using cross validation method. As residual evaluation is not able to give information about the capability of the classifier for an unseen test set, k-fold cross validation is considered in this work. The entire subsets are divided into k-subsets. For every subset the remaining \(n-k\) values are considered for training where n is the total number of feature values (Kohavi 1995). The average accuracy of all subsets are considered as system accuracy. The same experimentation has been done with different k values and the best results are obtained with the k value 10 (shown in Table 5).

Table 5 Accuracy of classification models on different datasets

The performance measurement considered in this work is accuracy and computed using the formula given in Eq. (7).

$$\begin{aligned} Performance accuracy = round \left( \frac{I_e}{T_e} \times 100 \right) \end{aligned}$$
(7)

Where \(I_e\) is the total number of emotional clips correctly identified and \(T_e\) represents the total number of emotional clips.

From the experimentation, there are few observations related to classification models have been noted here. Classification accuracy of VQ method depends on various factors like number of clusters in each class, size of the dataset, type of data in the dataset and so on. As the number of code vectors per codebook (N) increases, the accuracy increases till certain value of N (in this case \(N = 8\)). The same thing can be observed from row 1 to 5 of Table 2. Similar to VQ, classification accuracy of K-means clustering also depends on various factors. The algorithm generally tends to form clusters with relatively uniform distribution of cluster sizes (Xiong et al. 2009). Hence, CV value is computed to form the clusters. However, K-means is unable to map the emotional clips due to their ambiguity and non-linearity. GMM is the one which is giving better performance with 84% accuracy for detecting emotions in movie database and 81% accuracy for IIT-KGP emotional database. In addition to that, ANN is giving equivalent performance if the training set increases and the same is found in literature as well (Yegnanarayana 1994). With the movie database collected, it is giving 72% and for IIT-KGP database there is an increase of 7%. From literature, it is true that ANN can improve the learnability if there is large training set.

5 Conclusion and future work

In this work, the classification model which suits better for a given feature set is suggested. In contrast to existing approaches or meta-learning, some statistical operations have been done on featureset and the classifier is recommended based on the results achieved. This work concludes that the classifier performance always depends on the dataset chosen. In this regard, GMM is giving better accuracy if the data falls in the normal distribution region. From the statistical analysis, it is observed that the emotional data majorly falls in the same region. As emotional data is non-linear and ambiguous, VQ and K-means are inappropriate to map them efficiently. ANN always gives better performance if the training set increases. However, it is unable to beat the performance given by classifier suggested. Finally, from several observations and analysis, the work concludes that the relevant emotional featureset falls in normal distribution and GMM is capable to classify it effectively when compared to other classifiers.

A state-of-art classifier has to be recommended for every subtask of speech processing to achieve good performance. This work may be extended to improve the performance by suggesting suitable classifier for other speech processing tasks such as speaker recognition, gender recognition, language identification and so on. However, the present work can be extended by extracting some more relevant features and experimenting with suitable classifiers. Moreover, the future work compares the feature depended systems with deep networks to determine the efficient algorithms for future research.