Abstract
In this paper, we proposed a new architecture for environmental sound classification on the ESC-50 and urban sound dataset. The ESC-50 dataset is a collection of 2000 labeled environmental audio recordings and the urban sound dataset is a collection of 8732 labeled sound records. The Mel frequency cepstral has been used to obtain the power spectrum of the sound wave. The resulting matrix, made possible the use of the convolutional neural network architecture over the dataset. The new architecture extracts far more complex features repeatedly, while being able to carry it along a greater depth using a ResNet type architecture. After the fine-tuned network, we achieved 89.5% validation accuracy on environmental classification dataset and 96.76% on urban sound dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cotton CV, Ellis DPW (2011) Spectral vs. spectro-temporal features for acoustic event detection. In: Proceedings of the applications of signal processing to audio and acoustics. New Paltz, NY, USA, pp 16–19
Ntalampiras S, Potamitis I, Fakotakis N (2013) Large-scale audio feature extraction and SVM for acoustic scene classification. In: Proceedings of the applications of signal processing to audio and acoustics. New Paltz, NY, USA, pp 20–23
Automatic recognition of urban environmental sounds events (2008) In: Proceedings of the IAPR workshop on cognitive information processing cip. Santorini, Greece, pp 9–10
Kaiming H et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of 2013 IEEE International conference on acoustics, speech and signal processing. IEEE
Yücesoy E, Nabiyev VV (2013) Gender identification of a speaker using MFCC and GMM. In: Proceedings of 2013 8th International conference on electrical and electronics engineering (ELECO). IEEE
Piczak KJ (2015) ESC: dataset for environmental sound classification. In: Proceedings of the 23rd annual ACM conference on multimedia. Brisbane, Australia
Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. In: Proceedings of 22nd ACM International conference on multimedia, Orlando, USA
Piczak KJ (2015) Environmental sound classification with convolutional neural networks. In: Proceedings of 2015 IEEE 25th International workshop on machine learning for signal processing (MLSP). IEEE
Dai W, Dai C, Qu S, Li J, Das S (2017) Very deep convolutional neural networks for raw waveforms. In: Proceedings of 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 421–425
Guzhov A et al (2020) ESResNet: Environmental sound classification based on visual domain models. ArXiv abs/2004.07301:nPag
McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25
Wikipedia contributors (2020) Mel-frequency cepstrum. Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 21 Dec 2019, Web 16 Sep
Stevens SS, Volkmann J, Newman EB (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8(3):185–190
Ahmed N, Natarajan T, Rao KR (1974) Discrete cosine transform. IEEE Trans Comput 100(1):90–93
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366
Huang X, Acero A, Hon H (2001) Spoken language processing: a guide to theory, algorithm, and system development. Prentice Hall
Bottou L (1991) Stochastic gradient learning in neural networks. Proc Neuro-Nımes 91(8):12
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. 1412.6980
Chollet F et al (2015) Keras. GitHub. https://github.com/fchollet/keras
Wikipedia contributors (2020) Tikhonov regularization. Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 15 Sep 2020. Web
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jangid, M., Nagpal, K. (2022). Sound Classification Using Residual Convolutional Network. In: Nanda, P., Verma, V.K., Srivastava, S., Gupta, R.K., Mazumdar, A.P. (eds) Data Engineering for Smart Systems. Lecture Notes in Networks and Systems, vol 238. Springer, Singapore. https://doi.org/10.1007/978-981-16-2641-8_23
Download citation
DOI: https://doi.org/10.1007/978-981-16-2641-8_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2640-1
Online ISBN: 978-981-16-2641-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)