Abstract
It is supposed in Speaker Recognition (SR) that everyone has a unique voice which could be used as an identity rather than or in addition to other identities such as fingerprint, face, or iris. Even though steps have been taken long ago to apply neural networks in SR, recent advances in computing hardware, new deep learning (DL) architectures and training methods, and access to a large amount of training data have inspired the research community to make use of DL as in a large variety of other signal processing applications. In this chapter, the traditional principle techniques in SR are first briefly reviewed and the potential signal processing aspects of these techniques which can be improved by DL are addressed. Then the recent most successful DL architectures used in SR are introduced and some illustrative experiments from the authors are included.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Oglesby, J., Mason, J.S.: Speaker identification using neural nets. IOA Speech (1988)
Oglesby, J., Mason, J.S.: Speaker recognition with a neural classifier. In: Artificial Neural Networks, IET (1989)
Oglesby, J., Mason, J.S.: Optimisation of neural models for speaker identification. In: ICASSP (1990)
Bennani, Y., Soulie, F.F., Gallinari, P.: A connectionist approach for automatic speaker identification. In: ICASSP (1990)
Bennani, Y., Gallinari, P.: On the use of tdnn-extracted features information in talker identification. In: ICASSP (1991)
Oglesby, J., Mason, J.S.: Radial basis function networks for speaker recognition. In: ICASSP (1991)
Rudasi, L., Zahorian, S.A.: Text-independent talker identification with neural networks. In: ICASSP (1991)
Farrell, K.R., Mammone, R.J., Assaleh, K.T.: Speaker recognition using neural networks and conventional classifiers. IEEE Trans. Speech Audio Process. (1994)
Heck, L.P., Konig, Y., Sönmez, M.K., Weintraub, M.: Robustness to telephone handset distortion in speaker recognition by discriminative feature design. Speech Commun. (2000)
Yegnanarayana, B., Kishore, S.P.: Aann: an alternative to gmm for pattern recognition. Neural Netw. (2002)
Lapidot, I., Guterman, H., Cohen, A.: Unsupervised speaker recognition based on competition between self-organizing maps. IEEE Trans. Neural Netw. (2002)
Chen, K., Salman, A.: Learning speaker-specific characteristics with a deep neural architecture. IEEE Trans. Neural Netw. (2011)
Chen, K., Salman, A.: Extracting speaker-specific information with a regularized siamese deep network. In: Advances in Neural Information Processing Systems (2011)
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. Speech, and Language Processing, IEEE Transactions on Audio (2011)
Lei, Y., Scheffer, N., Ferre, L., Mclaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP (2014)
Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J.: Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey (2014)
Campbell, W.M.: Using deep belief networks for vector-based speaker recognition. In: Interspeech (2014)
Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. (2015)
Garcia-Romero, D., Zhang, X., McCree, A., Povey, D.: Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. In: SLT (2014)
Mclaren, M., Lei, Y., Ferre, L.: Advances in deep neural network approaches to speaker recognition. In: ICASSP (2015)
Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP (2014)
Wang, S., Qian, Y., Yu, K.: What does the speaker embedding encode? In: Interspeech (2017)
Bhattacharya, G., Alam, J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech (2017)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech (2017)
Snyder, D., Garcia-Romero, D., Sell, G., D. Povey, Khudanpur, S.: X-vectors: robust dnn embeddings for speaker recognition. In: ICASSP (2018)
Prince, S.J.D., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: ICCV (2007)
Ghahabi, O., Hernando, J.: Restricted boltzmann machine supervectors for speaker recognition. In: ICASSP (2015)
Ghahabi, O., Hernando, J.: Restricted Boltzmann machines for vector representation of speech in speaker recognition. Comput. Speech Lang. (2018)
Safari, P., Ghahabi, O., Hernando, J.: From features to speaker vectors by means of restricted boltzmann machine adaptation. In: Odyssey (2016)
Kenny, P.: Bayesian speaker verification with heavy tailed priors. In: Odyssey (2010)
Stafylakis, T., Kenny, P., Senoussaoui, M., Dumouchel, P.: Preliminary investigation of boltzmann machine classifiers for speaker recognition. In: Odyssey (2012)
Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., Dumouchel, P.: First attempt of boltzmann machines for speaker verification. In: Odyssey (2012)
Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: Interspeech (2014)
Isik, Y.Z., Erdogan, H., Sarikaya, R.: S-vector: a discriminative representation derived from i-vector for speaker verification. In: EUSIPCO (2015)
Villalba, J., Brümmer, N., Dehak, N.: Tied variational autoencoder backends for i-vector speaker recognition. In: Interspeech (2017)
Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V.S., Prudnikov, A.: Non-linear plda for i-vector speaker verification. In: Interspeech (2015)
Pekhovsky, T., Novoselov, S., Sholokhov, A., Kudashev, O.: On autoencoders in the i-vector space for speaker recognition. In: Odyssey (2016)
The NIST Speaker Recognition i-vector Machine Learning Challenge (2014)
Khoury, E., El Shafey, L., Ferras, M., Marcel, S.: Hierarchical speaker clustering methods for the nist i-vector challenge, In: Odyssey (2014)
Novoselov, S., Pekhovsky, T., Simonchik, K.: STC speaker recognition system for the NIST i-vector challenge. In: Odyssey (2014)
Ghahabi, O., Hernando, J.: Deep belief networks for i-vector based speaker recognition. In: ICASSP (2014)
Ghahabi, O., Hernando, J.: i-vector modeling with deep belief networks for multi-session speaker recognition. In: Odyssey (2014)
Ghahabi, O., Hernando, J.: Deep learning backend for single and multisession i-vector speaker recognition. Speech, and Language Processing, IEEE/ACM Transactions on Audio (2017)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)
Song, W., Cai, J.: End-to-end deep neural network for automatic speech recognition (2015)
Safari, P., Ghahabi, O., Hernando, J.: Feature classification by means of deep belief networks for speaker recognition. In: EUSIPCO (2015)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. Speech Audio Process. (1995)
Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support vector machines for speaker and language recognition. Comput. Speech Lang. (2006)
Sadaoki, F.: Fifty years of progress in speech and speaker recognition. J. Acoust. Soc. Am. (2004)
Nadeu, C., Hernando, J., Gorricho, M.: On the decorrelation of filter-bank energies in speech recognition. In: Eurospeech (1995)
Nadeu, C., Macho, D., Hernando, J.: Time and frequency filtering of filter-bank energies for robust HMM speech recognition. Speech Commun. (2001)
Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. (2015)
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. In: Digital Signal Processing (2000)
Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. (2006)
Dehak, N., Chollet, G.: Support vector gmms for speaker verification. In: Odyssey (2006)
Lee, K., You, C., Li, H., Kinnunen, T., Zhu, D.: Characterizing speech utterances for speaker verification with sequence kernel SVM. Comput. Speech Lang. (2008)
Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: ICASSP (2006)
Solomonoff, A., Campbell, W.M., Boardman, I.: Advances in channel compensation for SVM speaker recognition. In: ICASSP (2005)
Hatch, A.O., Stolcke, A.: Generalized linear kernels for one-versus-all classification: application to speaker recognition. In: ICASSP (2006)
Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms (2006)
Dehak, N., Dehak, R., Glass, J., Reynolds, D., Kenny, P.: Cosine similarity scoring without score normalization techniques. In: Odyssey (2010)
Garcia-Romero, D., Espy-Wilson, C. Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Interspeech (2011)
Matějka, P., Glembek, O., Castaldo, F., Alam, J., Plchot, O., Kenny, P., Burget, L., Černocky, J.: Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In: ICASSP (2011)
Greenberg, C., Banse, D., Doddington, G., Garcia-Romero, D., Godfrey, J., Kinnunen, T., Martin, A., McCree, A., Przybocki, M., Reynolds, D.: The NIST 2014 speaker recognition i-vector machine learning challenge. In: Odyssey
Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. (2010)
Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. (2015)
Ghahabi, O.: Deep learning for i-vector speaker and language recognition. Ph.D. thesis, Universitat Politècnica de Catalunya (2018)
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Spoken language recognition based on senone posteriors. In: INTERSPEECH (2014)
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)
Silnova, A., Burget, L., Cernocky, J.: Alternative approaches to neural network based speaker verification. In: Interspeech (2017)
Ranjan, S., Hansen, J.H.L.: Improved gender independent speaker recognition using convolutional neural network based bottleneck features. In: Interspeech (2017)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)
Serdyuk, D., Audhkhasi, K., Brakel, P., Ramabhadran, B., Thomas, S., Bengio, Y.: Invariant representations for noisy speech recognition. arXiv:1612.01928 (2016)
Shinohara, Y.: Adversarial multi-task learning of deep neural networks for robust speech recognition. In: INTERSPEECH (2016)
Yu, H., Tan, Z.-H., Ma, Z., Guo, J.: Adversarial network bottleneck features for noise robust speaker verification. arXiv:1706.03397 (2017)
Li, L., Tang, Z., Wang, D., Zheng, T.F.: Full-info training for deep speaker feature learning. In: ICASSP (2018)
Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., Shchemelinin, V.: On deep speaker embeddings for text-independent speaker recognition. In: Odyssey (2018)
Li, L., Tang, Z., Shi, Y., Wang, D.: Gaussian-constrained training for speaker verification. arXiv:1811.03258 (2018)
Zeinali, H., Burget, L., Rohdin, J., Stafylakis, T., Cernocky, J.: How to improve your speaker embeddings extractor in generic toolkits. arXiv:1811.02066 (2018)
Huang, Z., Wang, S., Yu, K.: Angular softmax for short-duration text-independent speaker verification. In: Interspeech (2018)
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on Similarity-Based Pattern Recognition. Springer (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)
Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech (2017)
Bredin, H.: Tristounet: triplet loss for speaker turn embedding. In: ICASSP (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: Interspeech (2017)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech (2018)
India, M., Safari, P., Hernando, J.: Self multi-head attention for speaker recognition. In: Interspeech (2019)
Ghahabi, O., Fischer, V.: Speaker-corrupted embeddings for online speaker diarization. In: Interspeech (2019)
Jung, J.-W., Heo, H.-S., Yang, I.-H., Shim, H.-J., Yu, H.-J.: Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification (2018)
Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. arXiv:1808.00158 (2018)
Stafylakis, T., Kenny, P.: Preliminary investigation of boltzmann machine classifiers for speaker recognition. In: Odyssey (2012)
Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R.: First attempt of boltzmann machines for speaker verification. In: Odyssey (2012)
Stafylakis, T., Kenny, P., Senoussaoui, M., Dumouchel, P.: PLDA using gaussian restricted boltzmann machines with application to speaker verification. In: Interspeech (2012)
Lee, H., Pham, P., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems (2009)
Ghahabi, O., Hernando, J.: Global impostor selection for DBNs in multi-session i-vector speaker recognition. In: Advances in Speech and Language Technologies for Iberian Languages. Springer International Publishing (2014)
Campbell, W.M.: Using deep belief networks for vector-based speaker recognition. In: Interspeech (2014)
Vasilakakis, V., Cumani, S., Laface, P.: Speaker recognition by means of deep belief networks. In: Biometric Technologies in Forensic Science (2013)
Mahto, S., Yamamoto, H., Koshinaka, T.: I-vector transformation using a novel discriminative denoising autoencoder for noise-robust speaker recognition. In: Interspeech (2017)
Alam, J., Kenny, P., Bhattacharya, G., Kockmann, M.: Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech (2017)
Guzewich, P., Zahorian, S.: Improving speaker verification for reverberant conditions with deep neural network dereverberation processing. in: Interspeech (2017)
Tan, Z., Mak, M.-W.: I-vector dnn scoring and calibration for noise robust speaker verification. In: Interspeech (2017)
Shon, S., Mun, S., Kim, W., Ko, H.: Autoencoder based domain adaptation for speaker recognition under insufficient channel information. arXiv:1708.01227 (2017)
Bousquet, P.-M., Rouvier, M.: Duration mismatch compensation using four-covariance model and deep neural network for speaker verification. In: Interspeech (2017)
Guo, J., Nookala, U. A., Alwan, A.: Cnn-based joint mapping of short and long utterance i-vectors for speaker verification using short utterances. In: Interspeech (2017)
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: ICASSP (2016)
Zhang, S.-X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-end attention based text-dependent speaker verification. In: SLT (2016)
Heo, H.-S., Jung, J.-W., Yang, I.-H., Yoon, S.-H., Yu, H.-J.: Joint training of expanded end-to-end dnn for text-dependent speaker verification. In: Interspeech (2017)
Valenti, G., Daniel, A., Evans, N.: End-to-end automatic speaker verification with evolving recurrent neural networks. In: Odyssey (2018)
Dasgupta, D., McGregor, D.R.: Designing application-specific neural networks using the structured genetic algorithm. In: COGANN (1992)
Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolut. Comput. (2002)
Acknowledgements
This work is partially supported by the Spanish project DeepVoice under grant number TEC2015-69266-P.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ghahabi, O., Safari, P., Hernando, J. (2020). Deep Learning in Speaker Recognition. In: Pedrycz, W., Chen, SM. (eds) Development and Analysis of Deep Learning Architectures. Studies in Computational Intelligence, vol 867. Springer, Cham. https://doi.org/10.1007/978-3-030-31764-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-31764-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31763-8
Online ISBN: 978-3-030-31764-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)