Speech Emotion Recognition Systems: A Cross-Language, Inter-racial, and Cross-Gender Comparison

Datta, Deepayan; Jiang, Wanying; Vogel, Carl; Ahmad, Khurshid

doi:10.1007/978-3-031-28076-4_28

Deepayan Datta¹⁰,
Wanying Jiang¹⁰,
Carl Vogel¹⁰ &
…
Khurshid Ahmad¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 651))

Included in the following conference series:

Future of Information and Communication Conference

770 Accesses
1 Citations

Abstract

In this paper, we will examine the “speaker independent” performance of speech emotion recognition systems. The notion of independence suggests that irrespective of the personal attributes of the speaker–race, gender, age–and a reasonably clear speech, with little or no noise or distraction, then two different systems should give the same output. The difference in the systems lies in the use of training data for recognising emotions, algorithms for recognising human speech, and the spectral, prosodic, and quality attributes of the speech in the training data base. We describe the statistically significant differences between outputs of two major speech emotion recognition systems (SERS) - OpenSmile and Vokaturi. Both these systems were trained on posed training data. Our data sample comprised spontaneous speech data of politicians and their spokespersons – 71 people’s speech with an elapsed time of around 16.66 h. We have focused on speeches delivered by the politicians and the statements made by spokespersons; in some cases these may be answers to questions by journalists. There were differences due to the age and the race of our politicians and spokespersons. Even if we ignore the vagaries of spontaneous speech, the differences between the outputs of two major SERS indicates that (a) on the theoretical level, speaker independent claims of such systems do not match, and (b) on the practical level, efforts should be made to widen the training data to include a variety of races and ages and equally importantly to evaluate which of the spectral, prosodic and quality features should be used as the proxy for emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ShEMO: a large-scale validated database for Persian speech emotion detection

Article 08 October 2018

Speech emotion recognition for the Urdu language

Article 13 August 2022

Trends in speech emotion recognition: a comprehensive survey

Article 22 February 2023

Notes

1.
In the current work, age is approximated: ongoing work related to this and a larger corpus that includes the works sampled here is enhanced by determining the age in years and months of the speaker in each recording at the time of the event recorded.
2.
We adopt distinctions as defined by the US Census Bureau [https://www.census.gov/topics/population/race/about.html - last verified 12/09/2022].
3.
Extremely low p-values are treated as zeros by the Python packages that are being used here.

References

Abramson, A.S., Whalen, D.H.: Theoretical and practical issues in measuring voicing distinctions: voice onset time (VOT) at 50. J. Phonetics 63, 75–86 (2017)
Google Scholar
Ahmad, K., Wang, S., Vogel, C., Jain, P., O’Neill, O., Sufi, B.H.: Comparing the performance of facial emotion recognition systems on real-life videos: gender, ethnicity and age. In: Arai, K. (ed.) FTC 2021. LNNS, vol. 358, pp. 193–210. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-89906-6_14
Chapter Google Scholar
Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)
Google Scholar
Almaghrabi, S.A., et al.: The reproducibility of bio-acoustic features is associated with sample duration, speech task, and gender. IEEE Trans. Neural Syst. Rehabil. Eng. 30, 167–175 (2022)
Article Google Scholar
Awan, S.N.: The aging female voice: acoustic and respiratory data. Clin. Linguist. Phonetics 20(2-3), 171–180 (2006)
Google Scholar
Banse, R., Scherer, K.R.: Acoustic profiles in vocal emotion expression. J. Pers. Soc. Psychol. 70(3), 614–636 (1996)
Google Scholar
Bao, W.: Building a Chinese natural emotional audio-visual database. In: 2014 12th International Conference on Signal Processing (ICSP), pp. 583–587. IEEE (2014)
Google Scholar
Boersma, P.: PRAAT, a system for doing phonetics by computer. Glot Int. 5(9/10), 341–345 (2001)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B., et al.: A database of German emotional speech. Interspeech 5, 1517–1520 (2005)
Google Scholar
Chen, H., Liu, Z., Kang, X., Nishide, S., Ren, F.: Investigating voice features for speech emotion recognition based on four kinds of machine learning methods. In: 2019 IEEE 6th International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 195–199. IEEE (2019)
Google Scholar
Cho, T., Whalen, D.H., Docherty, G.: Voice onset time and beyond: exploring laryngeal contrast in 19 languages. J. Phonetics 72, 52–65 (2019)
Google Scholar
Costantini, G., Parada-Cabaleiro, E., Casali, D., Cesarini, V.: The emotion probe: on the universality of cross-linguistic and cross-gender speech emotion recognition via machine learning. Sensors 22(7), 2461 (2022)
Article Google Scholar
Cowen, A.S., Laukka, P., Elfenbein, H.A., Liu, R., Keltner, D.: The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures. Nat. Hum. Behav. 3(4), 369–382 (2019)
Google Scholar
Datta, D.: Ethnicity, gender, and language: comparing the performance of emotion recognition systems in different modalities (Non-verbal and Verbal) with emphasis on Bengali data (Unpublished). Master’s thesis, Trinity College Dublin, Dublin (2022)
Google Scholar
Eichhorn, J.T., Kent, R.D., Austin, D., Vorperian, H.K.: Effects of aging on vocal fundamental frequency and vowel formants in men and women. J. Voice 32(5), 644–e1 (2018)
Google Scholar
Hillary Anger Elfenbein and Nalini Ambady: On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychol. Bull. 128(2), 203–235 (2002)
Article Google Scholar
Elfenbein, H.A., Luckman, E.A.: Interpersonal accuracy in relation to culture and ethnicity. In: Judith, A.H., Mast, M.S., West, T.V. (eds.) The Social Psychology of Perceiving Others Accurately, pp. 328–349. Cambridge University Press (2016)
Google Scholar
Eyben, F., Schuller, B.: openSMILE:) the Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Rec. 6(4), 4–13 (2015)
Article Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: OpenEAR-introducing the Munich open-source emotion and affect recognition toolkit. In: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–6. IEEE (2009)
Google Scholar
Garcia-Garcia, J.M., Penichet, V.M.R., Lozano, M.D.: Emotion detection: a technology review. In: Proceedings of the XVIII International Conference on Human Computer Interaction, pp. 1–8 (2017)
Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report N, 93:27403 (1993)
Google Scholar
Haq, S., Jackson, P.J.B.: Multimodal emotion recognition. In: Machine Audition: Principles, Algorithms and Systems, pp. 398–423. IGI Global (2011)
Google Scholar
House, A.S., Fairbanks, G.: The influence of consonant environment upon the secondary acoustical characteristics of vowels. J. Acoust. Soc. Am. 25(1), 105–113 (1953)
Google Scholar
Hussain, Q.: A typological study of voice onset time (VOT) in Indo-Iranian languages. J. Phon. 71, 284–305 (2018)
Article Google Scholar
Hussain, Q., Mielke, J.: An acoustic and articulatory study of laryngeal and place contrasts of Kalasha (Indo-Aryan, Dardic). J. Acoust. Soc. Am. 147(4), 2873–2890 (2020)
Article Google Scholar
Jauk, I.: Unsupervised learning for expressive speech synthesis. In: Proceedings IberSPEECH 2018, pp. 189–193 (2018)
Google Scholar
Jauk, I., Bonafonte, A.: Prosodic and spectral iVectors for expressive speech synthesis. In: Proceedings 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), pp. 59–63 (2016)
Google Scholar
Jiang, W.: Performance comparison of emotion recognition systems, and validation of physical correlates of emotions: across different ethnicities and languages with particular focus on Chinese(Unpublished). Master’s thesis, Trinity College Dublin, Dublin (2022)
Google Scholar
Lanjewar, R.B., Mathurkar, S., Patel, N.: Implementation and comparison of speech emotion recognition system using gaussian mixture model (GMM) and K-Nearest Neighbor (k-NN) techniques. Procedia Comput. Sci. 49, 50–57 (2015)
Google Scholar
Li, B., Dimitriadis, D., Stolcke, A.: Acoustic and lexical sentiment analysis for customer service calls. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5876–5880. IEEE (2019)
Google Scholar
Lisker, L., Abramson, A.S.: A cross-language study of voicing in initial stops: acoustical measurements. Word 20(3), 384–422 (1964)
Google Scholar
Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE 2005 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW 2006), p. 8. IEEE (2006)
Google Scholar
Özseven, T., Düğenci, M.: SPeech ACoustic (SPAC): a novel tool for speech feature extraction and classification. Appl. Acoust. 136, 1–8 (2018)
Article Google Scholar
Pell, M.D., Paulmann, S., Dara, C., Alasseri, A., Kotz, S.A.: Factors in the recognition of vocally expressed emotions: a comparison of four languages. J. Phonetics 37(4), 417–435 (2009)
Google Scholar
Pichora-Fuller, M.K., Dupuis, K., Van Lieshout, P.: Importance of F0 for predicting vocal emotion categorization. J. Acoust. Soc. Am. 140(4), 3401–3401 (2016)
Google Scholar
Povey, D.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Sig. Process. Soc. (2011). https://infoscience.epfl.ch/record/192584. Last Verified Sept 2022
Rummer, R., Schweppe, J., Schlegelmilch, R., Grice, M.: Mood is linked to vowel type: the role of articulatory movements. Emotion 14(2), 246 (2014)
Article Google Scholar
Saggio, G., Costantini, G.: Worldwide healthy adult voice baseline parameters: a comprehensive review. J. Voice (2020, in Press). Verified Sept. 2022. https://doi.org/10.1016/j.jvoice.2020.08.028.
Sataloff, R.T., Rosen, D.C., Hawkshaw, M., Spiegel, J.R.: The aging adult voice. J. Voice 11(2), 156–160 (1997)
Google Scholar
Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., Gielen, S.: Acoustic correlates of emotion dimensions in view of speech synthesis. In: Seventh European Conference on Speech Communication and Technology, pp. 87–90 (2001)
Google Scholar
Schuller, B., Steidl, S., Batliner, A.: The Interspeech 2009 emotion challenge. In: Proceedings of Interspeech 2009, pp. 312–315 (2009)
Google Scholar
Schuller, B.W.: Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 61(5), 90–99 (2018)
Google Scholar
Sun, L., Zou, B., Sheng, F., Chen, J., Wang, F.: Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun. 115, 29–37 (2019)
Article Google Scholar
Vokaturi: The acoustic cues. https://developers.vokaturi.com/doc/doc/extract_cues.html. Accessed 14 Sept 2022
Vokaturi: Overview. https://developers.vokaturi.com/getting-started/overview. Accessed 08 Aug 2022
Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015)
Google Scholar

Download references

Acknowledgments and Contributions

Khurhsid Ahmad is the Principal Investigator on this project, Carl Vogel is a co-principal and has designed the statistical tests for the reported comparisons, Deepayan Datta and Wanying Jiang are post-graduate students working on our project. We would like to thank Subishi Chemmarathil for her help in curating the database.

Author information

Authors and Affiliations

School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
Deepayan Datta, Wanying Jiang, Carl Vogel & Khurshid Ahmad

Authors

Deepayan Datta
View author publications
You can also search for this author in PubMed Google Scholar
Wanying Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Carl Vogel
View author publications
You can also search for this author in PubMed Google Scholar
Khurshid Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khurshid Ahmad .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Datta, D., Jiang, W., Vogel, C., Ahmad, K. (2023). Speech Emotion Recognition Systems: A Cross-Language, Inter-racial, and Cross-Gender Comparison. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-28076-4_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-28076-4_28
Published: 27 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28075-7
Online ISBN: 978-3-031-28076-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Speech Emotion Recognition Systems: A Cross-Language, Inter-racial, and Cross-Gender Comparison

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ShEMO: a large-scale validated database for Persian speech emotion detection

Speech emotion recognition for the Urdu language

Trends in speech emotion recognition: a comprehensive survey

Notes

References

Acknowledgments and Contributions

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Speech Emotion Recognition Systems: A Cross-Language, Inter-racial, and Cross-Gender Comparison

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ShEMO: a large-scale validated database for Persian speech emotion detection

Speech emotion recognition for the Urdu language

Trends in speech emotion recognition: a comprehensive survey

Notes

References

Acknowledgments and Contributions

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation