Abstract
Singer identification is a fascinating and complex field in audio processing and voice recognition. It aims to determine the identity of a singer from a particular audio clip or song. This can be useful in various applications, from constructing personalised playlists to categorising and sorting large music libraries. Traditional methods of voice identification often rely on handcrafted features and statistical models. However, deep learning models, particularly neural networks, have demonstrated exceptional capabilities in automatically learning relevant features directly from raw audio data. This allows them to uncover subtle patterns and representations that were previously challenging to detect. In this paper, we present our study of the automatic identification of some Vietnamese singer voices using deep learning and data augmentation. We built a dataset consisting of ten Vietnamese artists with 2200 excerpts, then performed the identification of these artist’s voices using GRU, LSTM, and CNN models. Our research showed that the GRU model has a higher identification accuracy than the LSTM and CNN models. We proposed a new method of data augmentation for singer voice identification. This new method proved to be much more effective than the regular methods for audio data, such as noise addition and pitch shifting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wai, S.L.: Singer identification using Gaussian mixture model (GMM). Doctoral dissertation, MERAL Portal (2010)
Tsai, W.H., Lin, H.P.: Background music removal based on cepstrum transformation for popular singer identification. IEEE Trans. Audio Speech Lang. Process. 19(5), 1196–1205 (2010)
Ratanpara, T., Patel, N.: Singer identification using perceptual features and cepstral coefficient form of an audio signal from Indian video songs. EURASIP J. Audio Speech Music Process. 2015(1), 1–12 (2015)
Sangeetha, R., Nalini, N.J.: Singer identification using MFCC and CRP features with support vector machines. In: Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019, pp. 295–306. Springer Singapore (2020)
Jitendra, M.S., Radhika, Y.: An ensemble model of CNN with Bi-LSTM for automatic singer identification. Multimed. Tools Appl. 1–22 (2023)
Schlüter, J., Grill, T.: Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th International Society for Music Information Retrieval Conference. Malaga, Spain, 26–30 October 2015. pp. 121–126 (2015)
Srinivasa Murthy, Y.V., Koolagudi, S.G., Jeshventh Raja, T.K.: Singer identification for Indian singers using convolutional neural networks. Int. J. Speech Technol. 24, 781–796 (2021). https://doi.org/10.1007/s10772-021-09849-5
Shen, Z., Yong, B., Zhang, G., Zhou, R., Zhou. Q.: A deep learning method for Chinese singer identification. Tsinghua Sci. Technol. 24(4), 371–378. https://doi.org/10.26599/TST.2018.9010121 (2019)
Zhang, X., Yu, Y., Gao, Y., Chen, X., Li, W.: Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electron. 9, 1458 (2020). https://doi.org/10.3390/electronics9091458
Lehner, B., Widmer, G., Bock, S.: A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In: Proceedings of the IEEE 23rd European Signal Processing Conference, Nice, France, 31 August-4 September. pp. 21–25 (2015)
Leglaive, S., Hennequin, R., Badeau, R.: Singing voice detection with deep recurrent neural networks. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia, 19–24 April 2015. pp. 121–125 (2015)
Huang, H.M., Chen, W.K., Liu, C.H., You, S.D.: Singing voice detection based on convolutional neural networks. In: Proceedings of the IEEE 7th International Symposium on Next Generation Electronics, Taipei, Taiwan, 7–9, pp. 1–4 May (2018)
Zhang, X., Li, S., Li, Z., Chen, S., Gao, Y., Li, W.: Singing voice detection using multi-feature deep fusion with CNN. In: Proceedings of the 7th Conference on Sound and Music Technology (CSMT), pp. 41–52. Springer, Berlin/Heidelberg, Germany (2020)
Kum, S., Nam, J.: Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Sci. 9, 1324 Appl (2019)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12. pp. 2625–2634 (2015)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)
Hsieh, T. -H., Cheng, K. -H., Fan, Z. -C., Yang, Y. -C., Yang, Y. -H.: Addressing the confounds of accompaniments in singer identification. In: ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 1–5 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054069
Zhang, X., Wang, J., Cheng, N., Xiao, J.: MetaSID: singer identification with domain adaptation for metaverse. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Padua, Italy (2022). https://doi.org/10.1109/IJCNN55064.2022.9892793
Zhang, X., et al.: Singer identification using deep timbre feature learning with KNN-NET. In: ICASSP 2021−2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3380–3384 (2021)
Hennequin, R., Khlif, A., Voituret, F., Moussallam, M.: Spleeter: a fast and state-of the art music source separation tool with pre-trained models. Late-Breaking/Demo ISMIR (2019)
Thanh, C.B., Van Loan, T., Le Thuy, D.T.: Automatic identification of some Vietnamese folk songs Cheo and Quanho using deep neural networks. J. Comput. Sci. Cybern. 38(1), 63−83
Le, T.D.T., Van, L.T., Hong, Q.N.: Deep convolutional neural networks for emotion recognition of Vietnamese. Int. J. Mach. Learn. Comput. 10(5), 692–699 (2020). https://doi.org/10.18178/ijmlc.2020.10.5.992
Trinh Van, L., Dao Thi Le, T., Le Xuan, T., Castelli, E.: Emotional speech recognition using deep neural networks. Sens. 22(4), 1414 (2022). https://doi.org/10.3390/s22041414
Oppenheim, A., Schafer, R.: Discrete-Time Signal Processing. Pearson India (2014)
McFee, B., et al.: Librosa: audio and music signal analysis in python. In: Proceedings of the Python in Science Conference (2015). https://doi.org/10.25080/majora-7b98e3ed-003
Murthy, Y.V.S., Jeshventh, T.K.R.M., Zoeb, M., Saumyadip, M., Shashidhar, G.K.: Singer identification from smaller snippets of audio clips using acoustic features and DNNs. In: 2018 Eleventh International Conference on Contemporary Computing (IC3), pp. 1–6. Noida, India (2018). https://doi.org/10.1109/IC3.2018.8530602
Thuy, D.T.L., Loan, T.V., Thanh, C.B., Cuong, N.H.: Music genre classification using densenet and data augmentation. Comput. Syst. Sci. Eng. 47(1), 657–674 (2023)
Jobsn, A.: How to treat overfitting in convolutional neural networks (2020). Available online: https://www.analyticsvidhya.com/blog/2020/09/overfitting-in-cnn-show-to-treat-overfitting-in-convolutional-neural-networks. Accessed 4 Oct 2022
Bhandari, A.: AUC-ROC curve in machine learning clearly explained (2020). Available online: https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/. Accessed 4 Oct 4 2022
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Le Thuy, D.T., Thanh, C.B., Van Loan, T., Thanh, L.X. (2024). Automatic Identification of Vietnamese Singer Voices Using Deep Learning and Data Augmentation. In: Nghia, P.T., Thai, V.D., Thuy, N.T., Son, L.H., Huynh, VN. (eds) Advances in Information and Communication Technology. ICTA 2023. Lecture Notes in Networks and Systems, vol 848. Springer, Cham. https://doi.org/10.1007/978-3-031-50818-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-50818-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50817-2
Online ISBN: 978-3-031-50818-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)