Abstract
In this work, we explore a Connectionist Temporal Classification (CTC) based end-to-end Automatic Speech Recognition (ASR) model for the Myanmar language. A series of experiments is presented on the topology of the model in which the convolutional layers are added and dropped, different depths of bidirectional long short-term memory (BLSTM) layers are used and different label encoding methods are investigated. The experiments are carried out in low-resource scenarios using our recorded Myanmar speech corpus of nearly 26 h. The best model achieves character error rate (CER) of 4.72% and syllable error rate (SER) of 12.38% on the test set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Hannun, A., Jun, B., LeGresley, P., Lin, L., Narang, S., Ng, A., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao, B., Yogatama, D., Zhan, J., Zhu, Z.: Deep Speech 2: end-to-end speech recognition in English and Mandarin. arXiv:1512.02595 [cs.CL] (2015)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML 2006: Proceedings of the 23rd International Conference on Machine Learning. ACM Press, New York (2006)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Graves, A.: Sequence transduction with recurrent neural networks, arXiv:1211.3711 [cs.NE] (2012)
Zweig, G., Yu, C., Droppo, J., Stolcke, A.: Advances in all-neural speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4805--4809. IEEE (2017)
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y.: Deep speech: scaling up end-to-end speech recognition, arXiv:1412.5567 [cs.CL] (2014)
Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4764–4768. IEEE (2018)
Li, J., Zhao, R., Hu, H., Gong, Y.: Improving RNN transducer modeling for end-to-end speech recognition. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 114–121. IEEE (2019)
Mon, A.N., Pa, W.P., Thu, Y.K.: Building HMM-SGMM continuous automatic speech recognition on Myanmar Web news. In: International Conference on Computer Applications (ICCA2017), pp. 446–453 (2017)
Naing, H.M.S., Pa, W.P.: Automatic speech recognition on spontaneous interview speech. In: Sixteenth International Conferences on Computer Applications (ICCA 2018), Yangon, Myanmar, pp. 203–208 (2018)
Nwe, T., Myint, T.: Myanmar language speech recognition with hybrid artificial neural network and hidden Markov model. In: Proceedings of 2015 International Conference on Future Computational Technologies (ICFCT 2015), pp. 116–122 (2015)
Naing, H.M.S., Hlaing, A.M., Pa, W.P., Hu, X., Thu, Y.K., Hori, C., Kawai, H.: A Myanmar large vocabulary continuous speech recognition system. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 320–327. IEEE (2015)
Mon, A.N., Pa Pa, W., Thu, Y.K.: Improving Myanmar automatic speech recognition with optimization of convolutional neural network parameters. Int. J. Nat. Lang. Comput. (IJNLC) 7, 1–10 (2018)
Aung, M.A.A., Pa, W.P.: Time delay neural network for Myanmar automatic speech recognition. In: 2020 IEEE Conference on Computer Applications (ICCA), pp. 1–4. IEEE (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICRL) (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv:1502.03167 [cs.LG] (2015)
Hannun: Sequence Modeling with CTC. Distill. 2 (2017)
Department of Meteorology and Hydrology. https://www.moezala.gov.mm/
DVB TVnews. https://www.youtube.com/channel/UCuaRmKJLYaVMDHrnjhWUcHw
Google: Myanmar Tools. https://github.com/google/myanmar-tools
ICU - International Components for Unicode. https://site.icu-project.org/
Thu, Y.K.: sylbreak. https://github.com/ye-kyaw-thu/sylbreak
Acknowledgments
The authors are grateful to the advisors from the University of Information Technology who gave us helpful comments and suggestions throughout this project. The authors also thank Ye Yint Htoon and May Sabal Myo for helping us with the dataset preparation and for technical assistance.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chit, K.M.M., Lin, L.L. (2021). Exploring CTC Based End-To-End Techniques for Myanmar Speech Recognition. In: Vasant, P., Zelinka, I., Weber, GW. (eds) Intelligent Computing and Optimization. ICO 2020. Advances in Intelligent Systems and Computing, vol 1324. Springer, Cham. https://doi.org/10.1007/978-3-030-68154-8_87
Download citation
DOI: https://doi.org/10.1007/978-3-030-68154-8_87
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68153-1
Online ISBN: 978-3-030-68154-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)