Abstract
Multi-Task Learning (MTL) has proven its effectiveness for decades. By combining certain related tasks, neural networks will likely perform better due to their inductive biases obtained from concrete tasks. Thus, many AI systems (such as GPT) have been developed based on MTL as the de facto solution. MTL has been applied early in the field of automatic speech recognition (ASR) and has made some significant advances. To continue this work and improve the performance of ASR systems, we propose an MTL-style method that addresses both tasks of automatic speech recognition and speech enhancement, where speech enhancement is used as an auxiliary task. We use Conformer acoustic model as the default architecture in this study, it is also modified to satisfy both tasks. The performance of the ASR task has improved by about 11.5% on the VIVOS dataset and 10.2% test other on the LibriSpeech 100 h by using the proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Carbajal, G., Richter, J., Gerkmann, T.: Disentanglement learning for variational autoencoders applied to audio-visual speech enhancement. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE (2021). https://doi.org/10.1109/waspaa52581.2021.9632676
Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997). https://doi.org/10.1023/A:1007379606734
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning, pp. 160–167 (2008). https://doi.org/10.1145/1390156.1390177
Dean, D., Sridharan, S., Vogt, R., Mason, M.: The QUT-noise databases and protocols (2010). https://doi.org/10.4225/09/58819f7a21a21
Defossez, A., Synnaeve, G., Adi, Y.: Real time speech enhancement in the waveform domain (2020)
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603 (2013). https://doi.org/10.1109/ICASSP.2013.6639344
Eskimez, S.E., et al.: Human listening and live captioning: multi-task training for speech enhancement (2021)
Fu, Y., et al.: Uformer: a Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation (2022)
Girshick, R.: Fast R-CNN (2015)
Graves, A.: Sequence transduction with recurrent neural networks. arXiv preprint: arXiv:1211.3711 (2012)
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition (2020). https://doi.org/10.48550/arXiv.2005.08100
Klakow, D., Peters, J.: Testing the correlation of word error rate and perplexity. Speech Commun. 38(1), 19–28 (2002). https://doi.org/10.1016/S0167-6393(01)00041-3
Kong, Z., Ping, W., Dantrey, A., Catanzaro, B.: Speech denoising in the waveform domain with self-attention (2022)
Ma, D., Hou, N., Pham, V.T., Xu, H., Chng, E.S.: Multitask-based joint learning approach to robust ASR for radio communication speech (2021)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Quan, P.V.H.: VIVOS: 15 hours of recording speech prepared for Vietnamese automatic speech recognition, Ho Chi Minh, Vietnam (2016)
Richter, J., Carbajal, G., Gerkmann, T.: Speech enhancement with stochastic temporal convolutional networks. In: Interspeech (2020)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation (2015)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003
Acknowledgements
We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM for supporting this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huynh, N.H.N. et al. (2023). Improving Automatic Speech Recognition via Joint Training with Speech Enhancement as Multi-task Learning. In: Dao, NN., Thinh, T.N., Nguyen, N.T. (eds) Intelligence of Things: Technologies and Applications. ICIT 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 187. Springer, Cham. https://doi.org/10.1007/978-3-031-46573-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-46573-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46572-7
Online ISBN: 978-3-031-46573-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)