Abstract
In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information retrieval (MIR) domain. Then, phoneme classification beyond the typically used techniques is extended towards exploring Deep Neural Networks (DNNs). This is done by combining Convolutional Neural Networks (CNNs) with audio data converted to the time-frequency space domain (i.e. spectrograms) and then exported as images. In this way a two-dimensional representation of speech feature space is employed. When preparing the phoneme dataset for CNNs, zero padding and interpolation techniques are used. The obtained results show an improvement in classification accuracy in the case of allophones of the phoneme /l/, when CNNs coupled with spectrogram representation are employed. Contrarily, in the case of vowel classification, the results are better for the approach based on pre-selected features and a conventional machine learning algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Badshah. A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network International Conference on Platform Technology and Service (PlatCon), pp. 1–5 (2017)
Noroozi, F., Kaminska, D., Sapinski, T., Anbarjafari, G.: Supervised vocal-based emotion recognition using multiclass support vector machine, random forests, and adaboost. J. Audio Eng. Soc. 65(7/8), 562–572 (2017). https://doi.org/10.17743/jaes.2017.0022
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8614–8618 (2013)
Xu, Y., Du, J., Dai, L.R., Lee, C.H.: A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(1), 7–19 (2015)
Alam, M.J., Kenny, P., O’Shaughnessy, D.: Low-variance multitaper mel-frequency cepstral coefficient features for speech and speaker recognition systems. Cognit. Comput. 5(4), 533–544 (2013)
Lerch, A.: An Introduction To Audio Content Analysis: Applications in Signal Processing and Music Informatics, p. 248. Wiley, Hoboken, N.J (2012)
Biswas, A., Sahu, P., Chandra, M.: Multiple camera in car audio–visual speech recognition using phonetic and visemic information. Comput. Electr. Eng. 47, 35–50 (2015). https://doi.org/10.1016/j.compeleceng.2015.08.009
Ziółko, B., Ziółko, M.: Time durations of phonemes in Polish language for speech and speaker recognition. In: Human Language Technology, Challenges for Computer Science and Linguistics. Lecture Notes in Computer Science, vol. 6562, pp. 105–114. Springer (2011)
Czyżewski, A., Kostek, B., Bratoszewski, P., Kotus, J., Szykulski, M.: An audio-visual corpus for multimodal automatic speech recognition. J. Intell. Inf. Syst. 1, 1–27 (2017). https://doi.org/10.1007/s10844-016-0438-z
Rosner, A., Kostek, B.: Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 5 (2017). https://doi.org/10.1007/s10844-017-0464-5
Korvel, G., Kostek, B.: Examining feature vector for phoneme recognition. In: Proceeding of IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2017. Bilbao, Spain (2017)
Korvel, G., Kostek, B.: Voiceless stop consonant modelling and synthesis framework based on MISO dynamic system. Arch. Acoust. 42(3), 375–383 (2017). https://doi.org/10.1515/aoa-2017-0039
Plewa, M., Kostek, B.: Music mood visualization using self-organizing maps. Arch. Acoust. 40(4), 513–525 (2015). https://doi.org/10.1515/aoa-2015-0051
Kostek, B., Kupryjanow, A., Zwan, P., Jiang, W., Raś, Z., Wojnarski, M., Swietlicka, J.: Report of the ISMIS 2011 contest: music information retrieval. Found. Intell. Syst. 715–724 (2011)
Gold, B., Morgan, N., Ellis, D.: Speech and Audio Signal Processing: Processing and Perception of Speech and Music, 2nd edn, 688 pp. Wiley, Inc., (2011)
Prabhu, K.M.M.: Window Functions and Their Applications in Signal Processing. CRC Press (2013)
Heinzel, G., Rudiger, A., Schilling, R.: Spectrum and spectral density estimation by the discrete Fourier transform (DFT), including a comprehensive list of window functions and some new flat-top windows. Internal Report, Max-Planck-Institut fur Gravitations physik, Hannover (2002)
Gillet, O., Richard, G.: Automatic transcription of drum loops. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP ‘04) (2004)
Hyungsuk, K., Heo, S.W.: Time-domain calculation of spectral centroid from backscattered ultrasound signals. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 59(6) (2012)
Hyoung-Gook, K., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Hoboken (2005)
Manjunath, B.S., Salembier, P., Sikora T.: Introduction to MPEG-7: Multimedia Content Description Interface. Wiley (2002)
Ma, Y., Nishihara, A.: Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J. Audio, Speech, Music Process 1–18 (2013)
Moattar, M.H., Homayounpour, M.M.: A simple but efficient real-time voice activity detection algorithm. In: 17th European Signal Processing Conference (EUSIPCO 2009). Glasgow, Scotland, Aug 24–28 (2009)
Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence, pp. 374–388. Academic, New York (1976)
Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of 1st International Symposium on Music Information Retrieval (ISMIR). Plymouth, Massachusetts, USA (2000)
Nijhawan, G., Soni, M.K.: Speaker recognition using MFCC and vector quantisation. J. Recent Trends Eng. Technol. 11(1), 211–218 (2014)
Wang, Y., Lawlor, B.: Speaker recognition based on MFCC and BP neural networks. In: 28th Irish Signals and Systems Conference (2017)
Ahmad, K.S., Thosar, A.S., Nirmal, J.H., Pande, V.S.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 4–7 Jan 2015, pp. 1–6 (2015)
Dennis, J., Tran, H.D., Li, H.: Spectrogram image feature for sound event classification in mismatched conditions. Signal Process. Lett. IEEE 18(2), 130–133 (2011)
Leonard, F.: Phase spectrogram and frequency spectrogram as new diagnostic tools. Mech. Syst. Signal Process. 21(1), 125–137 (2007)
Lawrence, J.R., Borden, G.J., Harris K.S.: Speech Science Primer: Physiology, Acoustics, and Perception of Speech, 6th edn, 334 pp. Lippincott Williams & Wilkins (2011)
Steuer, R., Daub, C.O., Selbig, J., Kurths, J.: Measuring distances between variables by mutual information. In: Innovations in Classification, Data Science, and Information Systems, pp. 81–90 (2005)
Pohjalainen, J., Rasanen, O., Kadioglu, S.: Feature selection methods and their combinations in high-dimensional classification of speaker likability, intelligibility and personality traits. Comput. Speech Lang. 29(1), 145–171 (2015)
Manocha, S., Girolami, M.A.: An empirical analysis of the probabilistic K-nearest neighbour classifier. Pattern Recogn. Lett. 28, 1818–1824 (2007)
Palaniappan, R., Sundaraj, K., Sundaraj, S.: A comparative study of the SVM and k-nn machine learning algorithms for the diagnosis of respiratory pathologies using pulmonary acoustic signals. BMC Bioinf. 15, 1–8 (2014)
Czyżewski, A., Piotrowska, M., Kostek, B.: Analysis of allophones based on audio signal recordings and parameterization. J. Acoust. Soc. Am. 141(5), 3521 (2017). https://doi.org/10.1121/1.4987415
Kostek, B., Piotrowska, M., Czyżewski, A.: Comparative study of self-organizing maps versus subjective evaluation of quality of allophone pronunciation for nonnative english speakers. In: 143rd Audio Engineering Society Convention, Preprint 9847. New York (2017)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: The Morgan Kaufmann Series in Data Management Systems, 2nd edn, 761 pp. Morgan Kaufmann (2006)
Kingma, P.D., Ba, J.L.: ADAM: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 (2015). https://arxiv.org/pdf/1412.6980.pdf. Accessed Jan 2018
Keras library Keras Documentation Website. http://keras.io. Accessed Jan 2018
TensorFlow library. TensorFlow Documentation Website. https://www.tensorflow.org/. Accessed Jan 2018
TIMIT: Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/ldc93s1. Accessed Jan 2018
Acknowledgements
Research partially sponsored by the Polish National Science Centre, Dec. No. 2015/17/B/ST6/01874. This work has also been partially supported by Statutory Funds of Electronics, Telecommunications and Informatics Faculty, Gdansk University of Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Korvel, G., Kurowski, A., Kostek, B., Czyzewski, A. (2019). Speech Analytics Based on Machine Learning. In: Tsihrintzis, G., Sotiropoulos, D., Jain, L. (eds) Machine Learning Paradigms. Intelligent Systems Reference Library, vol 149 . Springer, Cham. https://doi.org/10.1007/978-3-319-94030-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-94030-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94029-8
Online ISBN: 978-3-319-94030-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)