Identifying Lithuanian Native Speakers Using Voice Recognition

Dovydaitis, Laurynas; Rudžionis, Vytautas

doi:10.1007/978-3-319-69023-0_8

Laurynas Dovydaitis⁷ &
Vytautas Rudžionis⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 303))

Included in the following conference series:

International Conference on Business Information Systems

921 Accesses
2 Citations

Abstract

In this paper, we analyze speaker identification and present identification test results on Lithuanian native speakers’ database LIEPA. Two approaches for speaker acoustic modeling are examined. We start by extracting MFCC features from audio samples, then we feed this data to create speaker acoustic model with hidden Markov models (1) and with deep neural networks (2). We compare both methods by nalyzing the subset of samples from LIEPA database. This helps to achieve more than 96% identification accuracy on sample dataset.

Access provided by CONRICYT-eBooks. Download conference paper PDF

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Article 18 February 2022

Deep Learning Approaches for Speech Analysis: A Critical Insight

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Article 17 November 2023

Keywords

1 Introduction

Continuing on previous work we presented in [1], our focus was towards implementation of speech recognition system. We proposed to use such system as gateway for security access control, or as authorization service, for phone, voice mail or voice access services. We are continuing our work on speaker recognition, with more focus on speaker identification by analyzing voice examples.

Previously we faced a challenge with our dataset size, as it was too small to have significant results. Just recently, with project LIEPA [2] completion, substantial set of speaker data became available for deep learning and analysis. This database contains approximately 100 h of samples, from more than 370 Lithuanian native speakers.

This paper shows the results of speaker recognition system for speaker identification, using acoustic modeling with Hidden Markov Models (HMM), as well as, acoustic modeling with Deep Neural Network (DNN) techniques.

1.1 Previous Work

In previous paper [1] we showed results of our experiments. We conducted proof of concept for speaker recognition system, that could be used for user authentication.

We also outlined, that the identification module performance, should be tested on larger dataset. During the experiments, we saw that best identification accuracy was achieved on voice signals without noise.

In [1] we concluded, that in order to increase accuracy, we need to split users’ speech stream into smaller windows. We also considered to experiment with speech recognizers which are based on different speech features (LPC, MFCC etc.) and different machine learning techniques.

Speaker Dataset.

For this identification, project LIEPA [2] Speaker dataset was used. This data set includes 376 unique speakers and provides around 100 h of spoken sentences and words. Initial wave format .wav, sampling rate - 22 kHz, quantization - 16 bit, number of channels 1 [2].

Validation Data and Test Data.

Original data subset, was split into 70% of samples for training, 30% for testing created model. Splitting was done randomly.

2 Speaker Features

Feature Extraction.

Mel-frequency cepstral coefficients (MFCC) were as extracted features. This choice was made because of MFCC feature robustness for speaker recognition [3].

All samples were split using 20 ms length window function, with the help of HTK toolkit software [4]. For each windowed sample, 39 total features were extracted - 13 MFCCs, 13 delta and 13 delta-delta coefficients.

The following parameters were set in HTK configuration files

To execute feature extraction, we used HCopy executable

This way we created a speaker feature set, that can be processed further, to create speaker acoustic model.

3 Speaker Acoustic Model

Acoustic model for each speaker was created using two methods. By using Hidden Markov models (HMM) [5] we experimented with various number of hidden states until we got best recognition accuracy. For the second experiment, we created differing deep neural networks architectures [6, 7].

3.1 HMM Model Creation

To train HMM model, the following command was executed from HTK application

3.2 Neural Network

We experimented on number of different configurations in order to create best performing neural network architecture.

The network input layer was a vector of 999 × 39 dimensions, while output layer was had number of nodes, equal to number of unique speakers. Different architectures were used to choose number hidden layers. This was achieved by increasing number of nodes, as well as different depth of networks. Hidden layers consisted of recurrent neural network implementation of Long short-term memory (LSTM) cells [8, 9].

To create and train the network we Python Keras [10] module. One of the architectures that we used can be examined in code example below.

4 Test Results

We tested and compared accuracy of the speaker models in two phases. 1^st phase was conducted on pilot dataset. This dataset contained 9 unique speaker examples, with total of 540 sample data. In second testing phase, we took subset of LIEPA dataset with 66 unique speakers, with total of 4691 samples.

For DNN results, shown in Table 1, input had a 25 × 39 dimensional vector (Table 2).

Table 1. Pilot dataset accuracy results for 9 speaker voice examples.

Full size table

Table 2. Accuracy results with HMM model for experiment running on 66 speaker subset from LIEPA database.

Full size table

Training for DNN model was stopped at 75 to 150 epochs, depending on loss value, which had to be below 0,05 (Table 3, Figs. 1, 2, 3, and 4).

Table 3. Accuracy results with DNN model for experiment running on 66 speaker subset from LIEPA database.

Full size table

In Figures above, we can observe network convergence for DNN tests, where blue line shows training set metrics, while orange line shows validation set testing.

5 Conclusions and Further Work

In this paper we shown, that with the use of deep neural networks like LSTM, it is possible to achieve high speaker identification accuracy, which in our tests reached above 96%. This is slightly higher, than speaker acoustic model created with hidden Markov models, which in our tests achieved 95% identification accuracy.

As this shows positive results, we are encouraged to further experiment and improve accuracy of this speaker identification. Also for further work, we plan to examine other LSTM network configurations, by adding additional depth and width to the network, as well as extending training time, to allow better network convergence.

References

Dovydaitis, L., Rasymas, T., Rudzionis, V.: Speaker Authentication System Based on Voice Biometrics and Speech Recognition, Business Information Systems Workshops, BIS International Workshops, Series Print ISSN 1865–1348 (2016)
Google Scholar
LIEPA Homepage. https://www.xn-ratija-ckb.lt/liepa. Accessed 09 May 2017
Tiwari, V.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1(1), 19–22 (2010)
Google Scholar
HTK Homepage. http://htk.eng.cam.ac.uk/. Accessed 09 May 2017
Abdallah, J.S., Osman, M.I., et al.: Text-independent speaker identification using hidden markov model. World Comput. Sci. Inf. Technol. J. (WCSIT) 2(6), 203–208 (2012). ISSN: 2221–0741
Google Scholar
Fandrianto A., Jin, A., Neelappa, A.: Speaker Recognition Using Deep Belief Networks [CS 229] Fall 2012:12-14-12
Google Scholar
Garcia-Romero, D., Zhang, X., Alan McCree, A., Povey, D.: Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. In: Spoken Language Technology Workshop (SLT), IEEE (2014)
Google Scholar
Graves, A., Mohamed, A., et al.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Keras hompage. https://keras.io/. Accessed 09 May 2017

Download references

Author information

Authors and Affiliations

Kaunas Faculty, Vilnius University, Muitinės str. 8, Kaunas, Lithuania
Laurynas Dovydaitis & Vytautas Rudžionis

Authors

Laurynas Dovydaitis
View author publications
You can also search for this author in PubMed Google Scholar
Vytautas Rudžionis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurynas Dovydaitis .

Editor information

Editors and Affiliations

Poznań University of Economics and Business, Poznań, Poland
Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dovydaitis, L., Rudžionis, V. (2017). Identifying Lithuanian Native Speakers Using Voice Recognition. In: Abramowicz, W. (eds) Business Information Systems Workshops. BIS 2017. Lecture Notes in Business Information Processing, vol 303. Springer, Cham. https://doi.org/10.1007/978-3-319-69023-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-69023-0_8
Published: 18 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69022-3
Online ISBN: 978-3-319-69023-0
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics

Identifying Lithuanian Native Speakers Using Voice Recognition

Abstract

Similar content being viewed by others

TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition

Deep Learning Approaches for Speech Analysis: A Critical Insight

Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal

Keywords