Keywords

1 Introduction

One of the most natural ways for humans to express themselves is through speech. We rely on it so much that we notice its use in other channels of communication, such as emails and text messages, where we commonly use emoticons to express our emotions. Emotion detection and analysis are vital in the digital age of distant communication since emotions are so important in communication. Emotions are difficult to discern since they are subjective. There is no universally accepted method for quantifying or categorizing them.

Humans are emotional beings, and they express themselves using speech. The same sentence can have different meanings when spoken with different tones. Speeches can be sarcastic, which have meanings that contradict the linguistic meaning. If our computers were only trained on simple Natural Language, they would often misinterpret what’s being said [1]. Thus, it becomes crucial to understand the emotional intent of the speaker, along with the content. Emotion detection has wide-ranging applications like teaching, security, medicine, and entertainment. It could be integrated with conversational AI such as Alexa or Siri, for the AI agent to be able to identify actual human sentiments and emotions [2]. This will prove to be a huge step in making our computers more ‘human-like.’ In this work, we have focused on popular signal features known as mel-frequency cepstral coefficients (MFCC).

Mel-frequency cepstral coefficients (MFCC) is a well-known feature for speech signals. The several applications of it in speech processing are speech recognition, speaker recognition, speech synthesis, speech coding, etc. The research paper [3] has used this feature, and it is described as a limited group of features (often 10–20) that represent the general shape of a spectral envelope clearly [4].

2 Related Work

In the modern world, voice recognition functions must be done by robots just as naturally as they are by humans. As a result, a significant portion of research has been conducted where the goal of SER was re-defined as an image classification problem and then accomplished using a pre-trained model [5]. We have read research where notable features were retrieved from voice data using MFCC. In addition to spectral (roll-off, flux, centroid, bandwidth), energy (root-mean-square energy), raw signal (zero-crossing rate), pitch (fundamental frequency), and chroma features, papers [5, 6] discuss the usage of MFCC features.

Several freely available speech datasets were used in the URDU dataset, including SAVEE, EMODB, and EMONO [5]. The EMODB dataset with seven emotion classes was utilized in the paper [7]. The RAVDESS dataset is used in papers [6, 8]. Reference [6] made use of the TESS dataset.

Researchers used both machine learning techniques, their ensembles, and deep learning models such as CNN models, semi-CNN models, and transfer learning models in their categorization method.

In [5] a comparison of decision tree (J48), random forest (RF), and sequential minimal optimization (SMO), machine learning techniques were presented, as well as an ensemble of these machine learning algorithms using majority voting. Paper [7] used MATLAB 2019a programming software and an HP Z440 Workstation with an Intel Xeon CPU, 2.1 GHz, and 128 GB RAM to deploy a transfer learning model called AlexNet to perform the task of SER. The researchers demonstrated the use of autoencoders for dimensionality reduction, followed by Support Vector Machines (SVM), decision tree classifiers, and convolutional neural networks (CNN), AlexNet, and ResNet50. The use of a deep transfer learning model to train and recognize emotions was demonstrated in the paper [8].

3 Proposed Methodology

To complete the SER goal, the focus of this paper is on merging the original speech attributes and employing images generated from speech signals.

Our model’s design is mostly composed of two modules:

  1. (1)

    SER utilizing audio features experimented using machine learning models.

  2. (2)

    CNN models based on spectrogram and MFCC images.

We have converted the MFCC signal to images and used CNN to extract relevant features from MFCC images and spectrogram images. In addition, to compare its performance we have used audio features and classification using several machine learning models, including logistic regression, Random Forests, Naïve Bayes, and Support Vector Machine, which have been contrasted hhhh to one another.

Using CNN to analyze MFCC and spectrogram images produced accuracy levels of 82.5% and 86.25%, respectively.

Our main work has been focused on the following points:

  • Using the Librosa package [9], extract feature vectors from audio (.wav) files. Applying feature extraction approaches to audio data namely extracting MFCC features from audio signals and extracting features from images. Then, to perform emotion classification on them, employ supervised machine learning algorithms and convolutional neural network architectures.

  • To extract the spectrograms from the audio recordings and feed them into a convolutional neural network architecture to forecast the right output classes for the test dataset.

  • To extract MFCC features from audio files and use them with a CNN architecture to predict the labels for test images.

  • Compare the performance of the proposed MFCC image and spectrogram feature extraction method with existing audio features in terms of AUC score and accuracy.

4 Data Preprocessing

The initial step is to pre-process the audio files so that they may be used with machine learning methods. Librosa, a Python speech recognition package, was used to read the audio files for 2.5 s using a resample type of Kaiser fast, a sampling rate of 44,100 Hz, and an offset of 0.5 s. Following that, the ‘feature. mfcc’ technique of Librosa was used to transform the signals into feature vectors of 216 dimensions by employing the sample rates and time series collected from the signals. Mel-frequency cepstral coefficients (MFCC) are among the most commonly used speech and emotion identification features.

We experimented with numerous divides for the train and test datasets, and 80:20 proved to be the best possible split for this dataset. After doing a train–test split, our dataset is ready to be fed to machine learning algorithms, which will provide predictions. The spectrogram images for CNN architectures are extracted using the Open-Soundscape library’s spectrogram module. Open-Soundscape [10] is a utility library for analyzing bioacoustic data. This produced graphics of 224 by 224 pixels for the audio files. The MFCC pictures were extracted using the Librosa Library and a 2-s speech signal for each audio file. Our CNN architectures may now use these images for image categorization.

5 Classifiers

For extraction and classification of the most relevant feature, we have experimented with machine learning and deep learning methods as described in below sections.

5.1 Machine Learning Classifiers

Our dataset is known as the Urdu dataset, and it comprises four output emotion types. In such supervised learning situations, machine learning algorithms are frequently extremely useful. Supervised learning is a sort of machine learning in which machines are trained with well-labeled training data and then predict the output [5]. Labeled data shows that some input data has already been assigned an output. As a result, our problem is a multi-class classification problem that has been solved with classifiers such as logistic regression, Support Vector classifier, Random Forest classifier, and Nave Bayes Classifier [11]. Our dataset is known as the Urdu dataset, and it comprises four output emotion types as explained in Sect. 6.

5.2 Deep Learning Classifiers (CNN)

CNN is a neural network-based architecture popular in image categorization. This is critical for the task of SER using photos. It is useful for feature extraction and classification since it can pass values to the next layer while preserving spatial information and may be used in noisy images. The overall architecture of CNN is comprised of several layers that function as input, hidden, and output layers. The hidden layers are composed of feature maps, a fully connected layer containing convolutional neural networks, and pooling layers. The convolutional layer and the pooling layer collect essential properties from the input data, and the extracted value is mapped to the feature map [12, 13]. In this process, the characteristics of the MFCC and spectrogram images can be extracted, and then the fully connected layer focuses on the features extracted to perform classification.

6 Dataset Description

For our SER job, we used publicly available URDU data [14]. The URDU dataset is made up of emotional utterances from Urdu talk shows. It has 400 phrases that depict four fundamental emotions: angry, pleased, neutral, and emotional. There is a total of 38 speakers (27 male and 11 female). This information was collected from YouTube content. Speakers are selected at random. The nomenclature used to label the files in the dataset includes information on the speaker, gender, file number for that speaker, and overall numbering of the file in a certain emotion. The files have been renamed so that the first letter indicates the emotion: S for Sad, H for Happy, A for Angry, and N for Neutral, followed by a number to represent the file order. The dataset is divided into four emotion categories: angry, happy, neutral, and sad. Each lesson included 100 audio files with the .wav extension. We experimented with other splits for training the models, such as 75:25 and 80:20. The latter proved to be the preferable option, so we chose 80:20 for training and testing. As a result, the training dataset contained 320 speech files, and the remaining 80 speech files were used to test the performance of our models. Our dataset is also available on GitHub. (Link: https://github.com/siddiquelatif/URDU-Dataset).

7 Experimental Setup and Hyperparameter

To facilitate comparability, the machine learning models chosen were trained on identical training and testing datasets. Sklearn, a popular Python machine learning toolkit [15], was used to create these models. The model achieved a test accuracy of 48.75% by employing the hyperparameters solver = ‘saga,’ penalty = ‘l2,’ and max iter = 80 in logistic regression. The accuracy of SVM was 56.25%. Nave Bayes, on the other hand, provided an accuracy of 58.75%. The Random Forest model obtained an accuracy of 60% on the test dataset after tweaking the hyperparameters: n estimators = 120, criteria = ‘entropy.’

Apart from machine learning algorithms, convolutional neural networks are also applied to the generated MFCC and spectrogram images. The CNN model using MFCC was able to achieve a test accuracy of 82.5%, and the CNN model using spectrogram images was able to achieve a test accuracy of 86.25%. Table 1 shows the CNN architecture used.

Table 1 CNN architecture for the spectrogram model and the MFCC model

8 Experimental Results and Conclusion

The classification metric used for contrasting the models is accuracy, AUC score, and area under AUC-ROC curve. Table 2 shows the AUC scores, and it is observed that CNN using spectrogram, and MFCC image has performed significantly better than traditional audio features. We have also plotted the AUC-ROC curve for the ML model with audio features and CNN with MFCC image and spectrogram image as shown in Fig. 1.

Table 2 AUC scores in tabular form
Fig. 1
figure 1

AUC-ROC curves for all models

Table 3 depicts a comparative examination of the models based on accuracy. From Table 3, it has been observed that logistic regression with MFCC has performed lowest and CNN with spectrogram has performed best among all.

Table 3 Comparative performance in terms of accuracy (%)

This research aims to compare machine learning and deep learning models in executing the SER task. As shown in Fig. 2, we made comparisons between various machine learning and deep learning models. Logistic regression (48.75%), Naïve Bayes (58.75%), SVM (56.25%), and Random Forests have the highest accuracies (60%). Following that, we used CNN on MFCC images to achieve an accuracy of 82.5% and on spectrogram images to achieve an accuracy of 86.25%. Here, we observed that image-based features can play a crucial role in the extraction of emotion from the speech signal. In the future, spectrogram image features can be combined with text-based features [16] to enhance the performance and improve the robustness of the model.

Fig. 2
figure 2

Class-wise accuracies represented as bar plots