Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Sanshi, Pratham; Kuruvalli, Likhith Reddy; Chikkamath, Satish; Nirmala, R. S.

doi:10.1007/978-981-99-3982-4_17

Pratham Sanshi⁷,
Likhith Reddy Kuruvalli⁷,
Satish Chikkamath⁷ &
…
R. S. Nirmala⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 361))

Included in the following conference series:

International Conference on Information and Communication Technology for Intelligent Systems

269 Accesses
4 Citations

Abstract

Deep learning algorithms have significantly prevailed as a stand-in for i-vector in numerous applications such as speaker recognition, diarization, segmentation and verification. CNN when fed directly with raw voice fragments have resulted in propitious outcomes. Instead of using conventional hand-crafted features, CNNs learn low-level speech patterns using waveforms, which allows the network to better represent crucial narrow band speaker properties like pitch and formants. These narrow band features are less susceptible to surrounding noise. The design of neural network becomes significant to achieve this purpose. This study suggests the use of SincNet, an unique CNN architecture which enables the first convolutional layer to find deep relevant filters. Band-pass filters are implemented via parameterized sinc functions, on which SincNet is built. The standard CNNs normally learns all elements of each filter, where as the sincnet technique learns only high and low cutoff frequencies directly from raw fed data. This provides a highly compact and effective method to produce a unique filter bank that is precisely tuned for the particular application. Our research on the speaker verification task illustrate that the suggested architecture attains higher speed and performs better than the regular CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

A deep learning approach for speaker recognition

Article 18 December 2019

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

Article Open access 17 August 2023

References

Beigi H, Beigi H (2011) Speaker recognition. Springer
Google Scholar
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Heaton J (2018) Ian goodfellow, yoshua bengio, and aaron courville: deep learning. The MIT press, p 800. ISBN: 0262035618. Genet Prog Evol Mach 19(1–2):305–307
Google Scholar
Kenny P, Stafylakis T, Ouellet P, Gupta V, Alam MJ (2014) Deep neural networks for extracting BAUM-welch statistics for speaker recognition. Odyssey 2014:293–298
Article Google Scholar
Yaman S, Pelecanos JW, Sarikaya R (2012) Bottleneck features for speaker recognition. Odyssey 12:105–108
Google Scholar
Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4052–4056
Google Scholar
Richardson F, Reynolds D, Dehak N (2015) A unified deep neural network for speaker and language recognition. arXiv preprint arXiv:1504.00923
Zhang C, Koishida K, Hansen JH (2018) Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans Audio Speech Lang Process 26(9):1633–1644
Article Google Scholar
Palaz D, Collobert R et al (2015) Analysis of CNN-based speech recognition system using raw speech as input. Technical report, Idiap
Book Google Scholar
Sainath TN, Kingsbury B, Mohamed AR, Ramabhadran B (2013) Learning filter banks within a deep neural network framework. In: 2013 IEEE workshop on automatic speech recognition and understanding. IEEE, pp 297–302
Google Scholar
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204
Google Scholar
Muckenhirn H, Doss MM, Marcell S (2018) Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4884–4888
Google Scholar
Dinkel H, Chen N, Qian Y, Yu K (2017) End-to-end spoofing detection with raw waveform CLDNNs. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4860–4864
Google Scholar
Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499
Seki H, Yamamoto K, Nakagawa S (2017) A deep neural network integrated with filterbank learning for speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5480–5484
Google Scholar
Haris BC, Pradhan G, Misra A, Prasanna S, Das RK, Sinha R (2012) Multivariability speaker recognition database in Indian scenario. Int J Speech Technol 15:441–453
Article Google Scholar
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2006) The AMI meeting corpus: a pre-announcement. In: Machine learning for multimodal interaction: second international workshop, MLMI 2005, Edinburgh, UK, 11–13 July 2005, Revised Selected Papers 2. Springer, pp 28–39
Google Scholar
Chikkamath S, Nirmala S (2022) Music detection using deep learning with tensorflow. In: ICDSMLA 2020: proceedings of the 2nd international conference on data science, machine learning and applications. Springer, pp 283–291
Google Scholar
Chikkamath S, Nirmala S (2021) Melody generation using LSTM and bi-LSTM network. In: 2021 international conference on computational intelligence and computing applications (ICCICA). IEEE, pp 1–6
Google Scholar
Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with SINCNET. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 1021–1028
Google Scholar

Download references

Author information

Authors and Affiliations

KLE Technological University, Hubballi, Karnataka, India
Pratham Sanshi, Likhith Reddy Kuruvalli, Satish Chikkamath & R. S. Nirmala

Authors

Pratham Sanshi
View author publications
You can also search for this author in PubMed Google Scholar
Likhith Reddy Kuruvalli
View author publications
You can also search for this author in PubMed Google Scholar
Satish Chikkamath
View author publications
You can also search for this author in PubMed Google Scholar
R. S. Nirmala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pratham Sanshi .

Editor information

Editors and Affiliations

Hertfordshire Business School, University of Hertfordshire, Hatfield, Hertfordshire, UK
Jyoti Choudrie
Department of Artificial Intelligence and Data Science, Vishwakarma Institute of Information Technology, Pune, Maharashtra, India
Parikshit N. Mahalle
University Putra Malaysia, Serdang, Selangor, Malaysia
Thinagaran Perumal
Global Knowledge Research Foundation, Ahmedabad, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanshi, P., Kuruvalli, L.R., Chikkamath, S., Nirmala, R.S. (2023). Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions. In: Choudrie, J., Mahalle, P.N., Perumal, T., Joshi, A. (eds) ICT for Intelligent Systems. ICTIS 2023. Smart Innovation, Systems and Technologies, vol 361. Springer, Singapore. https://doi.org/10.1007/978-981-99-3982-4_17

Download citation

DOI: https://doi.org/10.1007/978-981-99-3982-4_17
Published: 09 September 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4039-4
Online ISBN: 978-981-99-3982-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

A deep learning approach for speaker recognition

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Deep Learning Framework for Speaker Verification Under Multi Sensor, Multi Lingual and Multi Session Conditions

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

A deep learning approach for speaker recognition

An optimized enhanced-multi learner approach towards speaker identification based on single-sound segments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation