Skip to main content

Speech Emotion Recognition Using Machine Learning

  • Conference paper
  • First Online:
ICT Systems and Sustainability (ICT4SD 2023)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 765))

Included in the following conference series:

  • 338 Accesses

Abstract

This paper describes improved research on speech emotion recognition (SER) systems. The definition, classification of the state of emotions and the expressions of emotions are introduced theoretically. In this research article, a SER system based on the CNN classifier and MFCC feature extraction are developed. Mel Frequency Cepstral Coefficients (MFCC) are excerpted from audio signals that are accustomed to train various classifiers. All seven emotions were categorized using a convolutional neural network (CNN). Surrey Audio Visually Expressed Emotion (SAVEE), Ryerson Affective Speech and Song Audiovisual Database (RAVDESS), Toronto Affective Speech Set (TESS), Crowdsourced Affective Multimodal Actor Dataset (CREMA-D) databases were used as experimental datasets. This study shows all four datasets using the CNN classifier. With 1D-CNN, the overall emotion recognition accuracy is 43%, the gender recognition accuracy is 81%, and the gender-neutral recognition accuracy of emotions is 48%. Using 2D-CNN, the overall accuracy rate for emotion recognition is 67.58%, the accuracy rate for gender recognition is 98%, and the accuracy rate for non-gender recognition of emotions is 65%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. New TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4): 603–623

    Google Scholar 

  2. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587

    Google Scholar 

  3. Byun S, Lee S (2016) Emotion recognition using tone and tempo based on voice for IoT. Trans Korean Inst Electr Eng 65(1):116–121. https://doi.org/10.5370/kiee.2016.65.1.116.

  4. Hong I, Ko Y, Kim Y, Shin H (2019) A study on the emotional feature composed of the mel-frequency cepstral coefficient and the speech speed. J Comput Sci Eng 13(4):131–140. https://doi.org/10.5626/JCSE.2019.13.4.131

    Article  Google Scholar 

  5. Likitha MS, Gupta SRR, Hasitha K, Raju AU (2017) Speech based human emotion recognition using MFCC. In: 2017 International conference on wireless communications, signal processing and networking (WiSPNET), pp 2257–2260. https://doi.org/10.1109/WiSPNET.2017.8300161

  6. Park S, Kim D, Kwon S, Park N (2018) Speech emotion recognition based on CNN using spectrogram. In: Information and control symposium, pp 240–241

    Google Scholar 

  7. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556

    Google Scholar 

  8. Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Image-Net: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition (CVPR), pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  9. Lee J, Yoon U, Jo G (2020) CNN-based speech emotion recognition model applying transfer learning and attention mechanism. J KIISE 47(7):665–673. https://doi.org/10.5626/JOK.2020.47.7.665

    Article  Google Scholar 

  10. Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  11. Tang W, Long G, Liu L, Zhou T, Jiang J, Blumenstein M (2020) Rethinking 1D-CNN for time series classification: a stronger baseline. arXiv: 2002.10061

    Google Scholar 

  12. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16:2203–2213. https://doi.org/10.1109/TMM.2014.2360798

    Article  Google Scholar 

  13. Huang L, Dong J, Zhou D, Zhang Q (2020) Speech emotion recognition based on three-channel feature fusion of CNN and BiLSTM. In: 2020 the 4th international conference on innovation in artificial intelligence (ICIAI), pp 52–58. https://doi.org/10.1145/3390557.3394317

  14. https://www.kaggle.com

  15. https://www.kaggle.com/datasets/barelydedicated/savee-database

  16. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE spoken language technology workshop (SLT). https://doi.org/10.1109/SLT.2018.8639583

  17. https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess

  18. https://www.kaggle.com/datasets/ejlok1/cremad

  19. https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio

  20. https://keras.io/getting_started/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohini R. Mergu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mergu, R.R., Shelke, R.J., Bagade, Y., Walchale, P., Yemul, H. (2023). Speech Emotion Recognition Using Machine Learning. In: Tuba, M., Akashe, S., Joshi, A. (eds) ICT Systems and Sustainability. ICT4SD 2023. Lecture Notes in Networks and Systems, vol 765. Springer, Singapore. https://doi.org/10.1007/978-981-99-5652-4_12

Download citation

Publish with us

Policies and ethics