Abstract
The study introduces an enhanced method for improving the accuracy and performance of End-to-End Automatic Speech Recognition (ASR) systems. This involves combining Gammatone Frequency Cepstral Coefficient (GTCC) and Mel Frequency Cepstral Coefficient (MFCC) features with a hybrid CNN-BiGRU model. MFCC and GTCC features capture temporal and spectral aspects of speech, while the hybrid architecture enables effective local and global context modelling. The proposed approach is evaluated using a low-resource Gujarati multi-person speech dataset, incorporating clean and noisy conditions via added white noise. Results demonstrate a 4.6% reduction in Word Error Rate (WER) for clean speech and a significant 7.83% reduction in WER for noisy speech, compared to baseline MFCC with greedy decoding. This method exhibits potential for enhancing ASR systems, making them more reliable and accurate for real-world applications necessitating precise speech-to-text conversion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Deshmukh AM (2020) Comparison of hidden markov model and recurrent neural network in automatic speech recognition. Eur J Eng Technol Res 5(8):958–965
Billa J (2018) ISI ASR system for the low resource speech recognition challenge for Indian languages. Interspeech
Gaudani H, Patel NM (2022) Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language. In: Proceedings of second international conference on sustainable expert systems (ICSES 2021). Springer Nature, Singapore
Lakshminarayanan V (2022) Impact of noise in automatic speech recognition for low-resourced languages. Rochester Institute of Technology
Dua M, Aggarwal RK, Biswas M (2019) GFCC based discriminatively trained noise robust continuous ASR system for Hindi language. J Ambient Intell Humaniz Comput 10:2301–2314
Dua M, Aggarwal RK, Biswas M (2018) Discriminative training using noise robust integrated features and refined HMM modeling. J Intell Syst 29(1):327–344
Graves A et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning
Bourlard HA, Morgan N (1994) Connectionist speech recognition: a hybrid approach, vol 247. Springer Science & Business Media
Maji B, Swain M, Panda R (2022) A feature selection based parallelized CNN-BiGRU network for speech emotion recognition in Odia language
Dubey P, Shah B (2022) Deep speech based end-to-end automated speech recognition (asr) for indian-english accents. Preprint at arXiv:2204.00977
Anoop CS, Ramakrishnan AG (2021) CTC-based end-to-end ASR for the low resource Sanskrit language with spectrogram augmentation. In: 2021 National conference on communications (NCC). IEEE
Joshi B et al (2022) A novel deep learning based Nepali speech recognition. In: International conference on electrical and electronics engineering. Springer, Singapore
Ephrat A et al (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. Preprint at arXiv:1804.03619
Bhogale K et al (2023) Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages. In: ICASSP 2023-2023 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE
Diwan A et al (2021) Multilingual and code-switching ASR challenges for low resource Indian languages. Preprint at arXiv:2104.00235
Raval D et al (2021) Improving deep learning based automatic speech recognition for Gujarati. Trans Asian Low-Resour Lang Inf Process 21(3):1–18
Diwan A, Jyothi P (2020) Reduce and reconstruct: ASR for low-resource phonetic languages. Preprint at arXiv:2010.09322
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhagat, B., Dua, M. (2024). Enhancing Performance of Noise-Robust Gujarati Language ASR Utilizing the Hybrid Acoustic Model and Combined MFCC + GTCC Feature. In: Verma, O.P., Wang, L., Kumar, R., Yadav, A. (eds) Machine Intelligence for Research and Innovations. MAiTRI 2023. Lecture Notes in Networks and Systems, vol 832. Springer, Singapore. https://doi.org/10.1007/978-981-99-8129-8_19
Download citation
DOI: https://doi.org/10.1007/978-981-99-8129-8_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8128-1
Online ISBN: 978-981-99-8129-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)