Abstract
Speech has been the most popular form of human communication. A keyboard or a mouse, on the other hand, is the most common way of entering data into a computer. It would be wonderful if computers could understand and carry out human commands. The method of obtaining the transcription (word sequence) of an utterance from the speech waveform is known as automatic speech recognition (ASR). Over the last few decades, speech technology and systems in human-computer interaction have progressed progressively and significantly. This chapter suggests a comprehensive review of automatic speech recognition systems (ASR) and their most recent developments. This research aims to outline and explain some of the popular approaches in speech recognition systems at various stages and highlight selected systems’ unique and innovative characteristics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
M. Abdel-Mottaleb, N. Dimitrova, R. Desai, J. Martino, Conivas: content-based image and video access system. In Proceedings of the Fourth ACM International Conference on Multimedia, MULTIMEDIA ’96, pp. 427–428, New York, NY, USA, 1997. Association for Computing Machinery
J. Adcock, M. Cooper, L. Denoue, H. Pirsiavash, L.A. Rowe, Talkminer: a lecture webcast search engine. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10 (New York, NY, USA 2010), pp. 241–250. Association for Computing Machinery
T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1 (2018)
M.A. Anusuya, S.K Katti, Speech recognition by machine, a review. arXiv preprint (2010). arXiv:1001.2267
S.J. Arora, R.P. Singh, Automatic speech recognition: a review. Int. J. Comput. Appl. 60(9) (2012)
A. Biswas, A. Gandhi, O. Deshmukh, Mmtoc: a multimodal method for table of content creation in educational videos. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15 (New York, NY, USA, 2015), pp. 621–630. Association for Computing Machinery
Li. Chai, Du. Jun, Qing-Feng. Liu, Chin-Hui. Lee, A cross-entropy-guided measure (cegm) for assessing speech recognition performance and optimizing dnn-based speech enhancement. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 106–117 (2021)
C.-C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, M. Bacchiani, State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778 (2018)
Shun-Po. Chuang, Alexander H. Liu, Tzu-Wei. Sung, Hung-yi Lee, Improving automatic speech recognition and speech translation via word embedding prediction. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 93–105 (2021)
Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Bin Liu, Zhengqi Wen, Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 198–209 (2021)
Gregory Gelly, Jean-Luc. Gauvain, Optimization of rnn-based speech activity detection. IEEE/ACM Trans. Audio, Speech Lang. Proc. 26(3), 646–656 (2018)
Hossein Hadian, Hossein Sameti, Daniel Povey, Sanjeev Khudanpur, End-to-end speech recognition using lattice-free mmi. Proc. Interspeech 2018, 12–16 (2018)
Reinhold Haeb-Umbach, Jahn Heymann, Lukas Drude, Shinji Watanabe, Marc Delcroix, Tomohiro Nakatani, Far-field automatic speech recognition. Proceedings of the IEEE 109(2), 124–148 (2021)
C. Hui, S. Yunyu, Y. Haisheng, G. Ming, Yongxiang Liu Xiang, Xia, A fast and robust key frame extraction method for video copyright protection. J. Elect. Comp. Engin. (March 2017)
S. Jothilakshmi, Spoken keyword detection using autoassociative neural networks. Int. J. Speech Technol. 17 (2014)
C.H. Lee, B.H. Juang, W. Chou, Statistical and discriminative methods for speech recognition. The Kluwer International Series in Engineering and Computer Science (VLSI, Computer Architecture and Digital Signal Processing) (1996)
V.K. Kamabathula, S. Iyer, Automated tagging to enable fine-grained browsing of lecture videos. In 2011 IEEE International Conference on Technology for Education, pp. 96–102 (2011)
Tomoko Kawase, Manabu Okamoto, Takaaki Fukutomi, Yamato Takahashi, Speech enhancement parameter adjustment to maximize accuracy of automatic speech recognition. IEEE Trans. Consum. Electr. 66(2), 125–133 (2020)
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, Mark D. Plumbley, Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio, Speech Lang. Proc. 28, 2880–2894 (2020)
M. Lin, J.F. Nunamaker, M. Chau, H. Chen, Segmentation of lecture videos based on text: a method combining multiple linguistic features. In 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the, p. 9 (2004)
M. Mahrishi, S. Morwal, Index point detection and semantic indexing of videos a comparative review. Advances in Intelligent Systems and Computing AISC Springer (2020)
M. Merler, J.R. Kender, Semantic keyword extraction via adaptive text binarization of unstructured unsourced video. In 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 261–264 (2009)
Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan, Online hybrid ctc/attention end-to-end automatic speech recognition architecture. IEEE/ACM Trans. Audio, Speech Lang. Proc. 28, 1452–1465 (2020)
J. Pustejovsky, A. Stubbs, Natural language annotation for machine learning
R. Rana, R. Singh, D. Mishra, An improved hindi speech recognition system by using i-rover (2013)
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio, Light gated recurrent units for speech recognition. IEEE Trans. Emerging Topics Comput. Intell. 2(2), 92–102 (2018)
M. Riedl, C. Biemann, TopicTiling: a text segmentation algorithm based on LDA. In Proceedings of ACL 2012 Student Research Workshop (Jeju Island, Korea, July 2012), pp. 37–42. Association for Computational Linguistics
Florinda Sauli, Alberto Cattaneo, Hans van der Meij, Hypervideo for educational purposes: a literature review on a multifaceted technological tool. Technol. Pedag. Educ. 27(1), 115–134 (2018)
M. Sharma, K. Sarma, Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition: An Overview and Review of Current State of the Art, 11 (2015)
R. Sharma, M. Mahrishi, S. Morwal, G. Sharma, Index point detection for text summarization using cosine similarity in educational videos. IOP Conf. Series Mater. Sci. Eng. 1131(1), 012001 (Apr 2021)
Xiusong Sun, Bo. Wang, Shaohan Liu, Lu. Tingxiang, Xin Shan, Qun Yang, Lmc-smca: A new active learning method in asr. IEEE Access 9, 37011–37021 (2021)
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, End-to-end speech recognition sequence training with reinforcement learning. IEEE Access 7, 79758–79769 (2019)
N.J. Uke, R. Thool, Segmentation and organization of lecture video based on visual contents. Int. J. e-Education, e-Business, e-Management and e-Learning (2012)
Jing-Xuan. Zhang, Zhen-Hua. Ling, Li-Juan. Liu, Yuan Jiang, Li-Rong. Dai, Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Trans. Audio, Speech Lang. Proc. 27(3), 631–644 (2019)
Lin Zhang, Lu. Yao, Video object segmentation by latent outcome regression. IEEE Access 8, 30355–30367 (2020)
W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung, M. Picheny. Distributed deep learning strategies for automatic speech recognition. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5706–5710 (2019)
Tianxiang Zhou, Ke Wang, Jun Wu, and Ruifeng Li. Video text processing method based on image stitching. In 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), pp. 561–566 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kumar, T., Mahrishi, M., Meena, G. (2022). A Comprehensive Review of Recent Automatic Speech Summarization and Keyword Identification Techniques. In: Fernandes, S.L., Sharma, T.K. (eds) Artificial Intelligence in Industrial Applications. Learning and Analytics in Intelligent Systems, vol 25. Springer, Cham. https://doi.org/10.1007/978-3-030-85383-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-85383-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85382-2
Online ISBN: 978-3-030-85383-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)