Skip to main content

Automatic Identification of Vietnamese Singer Voices Using Deep Learning and Data Augmentation

  • Conference paper
  • First Online:
Advances in Information and Communication Technology (ICTA 2023)

Abstract

Singer identification is a fascinating and complex field in audio processing and voice recognition. It aims to determine the identity of a singer from a particular audio clip or song. This can be useful in various applications, from constructing personalised playlists to categorising and sorting large music libraries. Traditional methods of voice identification often rely on handcrafted features and statistical models. However, deep learning models, particularly neural networks, have demonstrated exceptional capabilities in automatically learning relevant features directly from raw audio data. This allows them to uncover subtle patterns and representations that were previously challenging to detect. In this paper, we present our study of the automatic identification of some Vietnamese singer voices using deep learning and data augmentation. We built a dataset consisting of ten Vietnamese artists with 2200 excerpts, then performed the identification of these artist’s voices using GRU, LSTM, and CNN models. Our research showed that the GRU model has a higher identification accuracy than the LSTM and CNN models. We proposed a new method of data augmentation for singer voice identification. This new method proved to be much more effective than the regular methods for audio data, such as noise addition and pitch shifting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wai, S.L.: Singer identification using Gaussian mixture model (GMM). Doctoral dissertation, MERAL Portal (2010)

    Google Scholar 

  2. Tsai, W.H., Lin, H.P.: Background music removal based on cepstrum transformation for popular singer identification. IEEE Trans. Audio Speech Lang. Process. 19(5), 1196–1205 (2010)

    Article  Google Scholar 

  3. Ratanpara, T., Patel, N.: Singer identification using perceptual features and cepstral coefficient form of an audio signal from Indian video songs. EURASIP J. Audio Speech Music Process. 2015(1), 1–12 (2015)

    Article  Google Scholar 

  4. Sangeetha, R., Nalini, N.J.: Singer identification using MFCC and CRP features with support vector machines. In: Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019, pp. 295–306. Springer Singapore (2020)

    Google Scholar 

  5. Jitendra, M.S., Radhika, Y.: An ensemble model of CNN with Bi-LSTM for automatic singer identification. Multimed. Tools Appl. 1–22 (2023)

    Google Scholar 

  6. Schlüter, J., Grill, T.: Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th International Society for Music Information Retrieval Conference. Malaga, Spain, 26–30 October 2015. pp. 121–126 (2015)

    Google Scholar 

  7. Srinivasa Murthy, Y.V., Koolagudi, S.G., Jeshventh Raja, T.K.: Singer identification for Indian singers using convolutional neural networks. Int. J. Speech Technol. 24, 781–796 (2021). https://doi.org/10.1007/s10772-021-09849-5

    Article  Google Scholar 

  8. Shen, Z., Yong, B., Zhang, G., Zhou, R., Zhou. Q.: A deep learning method for Chinese singer identification. Tsinghua Sci. Technol. 24(4), 371–378. https://doi.org/10.26599/TST.2018.9010121 (2019)

  9. Zhang, X., Yu, Y., Gao, Y., Chen, X., Li, W.: Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electron. 9, 1458 (2020). https://doi.org/10.3390/electronics9091458

    Article  Google Scholar 

  10. Lehner, B., Widmer, G., Bock, S.: A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In: Proceedings of the IEEE 23rd European Signal Processing Conference, Nice, France, 31 August-4 September. pp. 21–25 (2015)

    Google Scholar 

  11. Leglaive, S., Hennequin, R., Badeau, R.: Singing voice detection with deep recurrent neural networks. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 19–24 April 2015. pp. 121–125 (2015)

    Google Scholar 

  12. Huang, H.M., Chen, W.K., Liu, C.H., You, S.D.: Singing voice detection based on convolutional neural networks. In: Proceedings of the IEEE 7th International Symposium on Next Generation Electronics, Taipei, Taiwan, 7–9, pp. 1–4 May (2018)

    Google Scholar 

  13. Zhang, X., Li, S., Li, Z., Chen, S., Gao, Y., Li, W.: Singing voice detection using multi-feature deep fusion with CNN. In: Proceedings of the 7th Conference on Sound and Music Technology (CSMT), pp. 41–52. Springer, Berlin/Heidelberg, Germany (2020)

    Google Scholar 

  14. Kum, S., Nam, J.: Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Sci. 9, 1324 Appl (2019)

    Google Scholar 

  15. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12. pp. 2625–2634 (2015)

    Google Scholar 

  16. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)

    Article  Google Scholar 

  17. Hsieh, T. -H., Cheng, K. -H., Fan, Z. -C., Yang, Y. -C., Yang, Y. -H.: Addressing the confounds of accompaniments in singer identification. In: ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 1–5 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054069

  18. Zhang, X., Wang, J., Cheng, N., Xiao, J.: MetaSID: singer identification with domain adaptation for metaverse. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Padua, Italy (2022). https://doi.org/10.1109/IJCNN55064.2022.9892793

  19. Zhang, X., et al.: Singer identification using deep timbre feature learning with KNN-NET. In: ICASSP 2021−2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3380–3384 (2021)

    Google Scholar 

  20. Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and state-of the art music source separation tool with pre-trained models. Late-Breaking/Demo ISMIR (2019)

    Google Scholar 

  21. Thanh, C.B., Van Loan, T., Le Thuy, D.T.: Automatic identification of some Vietnamese folk songs Cheo and Quanho using deep neural networks. J. Comput. Sci. Cybern. 38(1), 63−83

    Google Scholar 

  22. Le, T.D.T., Van, L.T., Hong, Q.N.: Deep convolutional neural networks for emotion recognition of Vietnamese. Int. J. Mach. Learn. Comput. 10(5), 692–699 (2020). https://doi.org/10.18178/ijmlc.2020.10.5.992

  23. Trinh Van, L., Dao Thi Le, T., Le Xuan, T., Castelli, E.: Emotional speech recognition using deep neural networks. Sens. 22(4), 1414 (2022). https://doi.org/10.3390/s22041414

  24. Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Pearson India (2014)

    Google Scholar 

  25. McFee, B., et al.: Librosa: audio and music signal analysis in python. In: Proceedings of the Python in Science Conference (2015). https://doi.org/10.25080/majora-7b98e3ed-003

  26. Murthy, Y.V.S., Jeshventh, T.K.R.M., Zoeb, M., Saumyadip, M., Shashidhar, G.K.: Singer identification from smaller snippets of audio clips using acoustic features and DNNs. In: 2018 Eleventh International Conference on Contemporary Computing (IC3), pp. 1–6. Noida, India (2018). https://doi.org/10.1109/IC3.2018.8530602

  27. Thuy, D.T.L., Loan, T.V., Thanh, C.B., Cuong, N.H.: Music genre classification using densenet and data augmentation. Comput. Syst. Sci. Eng. 47(1), 657–674 (2023)

    Article  Google Scholar 

  28. Jobsn, A.: How to treat overfitting in convolutional neural networks (2020). Available online: https://www.analyticsvidhya.com/blog/2020/09/overfitting-in-cnn-show-to-treat-overfitting-in-convolutional-neural-networks. Accessed 4 Oct 2022

  29. Bhandari, A.: AUC-ROC curve in machine learning clearly explained (2020). Available online: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/. Accessed 4 Oct 4 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chu Ba Thanh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Le Thuy, D.T., Thanh, C.B., Van Loan, T., Thanh, L.X. (2024). Automatic Identification of Vietnamese Singer Voices Using Deep Learning and Data Augmentation. In: Nghia, P.T., Thai, V.D., Thuy, N.T., Son, L.H., Huynh, VN. (eds) Advances in Information and Communication Technology. ICTA 2023. Lecture Notes in Networks and Systems, vol 848. Springer, Cham. https://doi.org/10.1007/978-3-031-50818-9_27

Download citation

Publish with us

Policies and ethics