Abstract
Voice Activity Detection (VAD) is becoming an essential front-end component in various speech processing systems. As those systems are commonly deployed in environments with diverse noise types and low signal-to-noise ratios (SNRs), an effective VAD method should perform robust detection of speech region out of noisy background signals. In this paper, we propose applying an adversarial domain adaptation technique to VAD. The proposed method trains DNN models for a VAD task in a supervised manner, simultaneously mitigating the problem of area mismatch between noisy and clean audio stream in a unsupervised manner. The experimental results show that the proposed method improves robust detection performance in noisy environments compared to other DNN-based model learned with hand-crafted acoustic feature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Goodfellow, I.J., et al.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014)
Tzeng, E., et al.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Kim, J., Kim, J., Lee, S., Park, J., Hahn, M.: Vowel based voice activity detection with LSTM recurrent neural network. In Proceedings of the 8th International Conference on Signal Processing Systems, pp. 134–137 (November 2016)
Eyben, F., Weninger, F., Squartini, S., Schuller, B.: Real-life voice activity detection with lstm recurrent neural networks and an application to Hollywood movies. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 483–487. IEEE (May 2013)
Zhang, X. L., Wang, D.: Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Kim, J., Hahn, M.: Voice activity detection using an adaptive context attention model. IEEE Sig. Process. Lett. 25(8), 1181–1185 (2018)
Tong, S., Gu, H., Yu, K.: A comparative study of robustness of deep learning approaches for VAD. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5695–5699. IEEE (March 2016)
Zhang, X.L., Wu, J.: Deep belief networks based voice activity detection. IEEE Trans. Audio, Speech, Lang. Process. 21(4), 697–710 (2012)
Zhang, X.-L.: Unsupervised domain adaptation for deep neural network based voice activity detection. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6864–6868. IEEE (2014)
Lavechin, M., Gill, M. P., Bousbib, R., Bredin, H., Garcia-Perera, L.P.: End-to-end Domain-Adversarial Voice Activity Detection. arXiv preprint arXiv:1910.10655 (2019)
Shahid, M., Beyan, C., Murino, V.: Voice activity detection by upper body motion analysis and unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Varga, A., Steeneken, H.J.: Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Zue, V., Seneff, S., Glass, J.: Speech database development at MIT: TIMIT and beyond. Speech Commun. 9(4), 351–356 (1990)
Berthelot, D., Schumm, T., Metz, L.: Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017)
Ishizuka, K., Nakatani, T., Fujimoto, M., Miyazaki, N.: Noise robust voice activity detection based on periodic to aperiodic component ratio. Speech Commun. 52(1), 41–60 (2010)
SoSound-ideas.com, Generalseries6000combo (2012). https://www.sound-ideas.com/Product/51/General-Series-6000-Combo
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Hirsch, H.G.: Fant-filtering and noise adding tool. Niederrhein University of Applied Sciences (2005). http://dnt.kr.hsnr.de/download.html
Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 1021–1028. IEEE (December 2018)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)
Acknowledgment
This research was partially supported by the National Research Foundation (NRF) Grant (No. 2019R1F1A1048115), the Institute of Information & communications Technology Planning & Evaluation (IITP) Grant (No. IITP-2021-0-00066), and the ICT Creative Consilience program (No. IITP-2020-0-01821) funded by the Korea government (MSIT).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kim, T., Ko, J.H. (2022). Application of Adversarial Domain Adaptation to Voice Activity Detection. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 296. Springer, Cham. https://doi.org/10.1007/978-3-030-82199-9_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-82199-9_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82198-2
Online ISBN: 978-3-030-82199-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)