Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Efanov, Dmitry; Aleksandrov, Pavel; Mironov, Ilia

doi:10.1007/s11416-023-00491-0

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Original Paper
Published: 13 August 2023

(2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Download PDF

158 Accesses
2 Citations
Explore all metrics

Abstract

Modern speech synthesis technologies can be used to deceive voice authentication systems, phone scams, or discredit public figures. An urgent task is to detect synthesized speech to protect against the threat of voice substitution attacks. The solution to this problem is based on the choice of cepstral coefficients that determine the quality of the cloned voice. In addition, the dataset used to train the neural network must match the language for which the synthesized speech will be detected. The paper discusses the most widely used cepstral coefficients and compares their effectiveness using two main types of neural networks. To train the network, the Russian speech dataset was developed, the use of which allows achieving the highest accuracy in determining speech synthesis in the case of deepfakes in Russian.

GRU-SVM Model for Synthetic Speech Detection

A Watermark Challenge: Synthetic Speech Detection

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Speech synthesis has firmly entered our lives. A shining example is the 2022 presidential election in South Korea. The team created a digital avatar of Yoon Seok Yeol, who competed in the presidential race along with his living prototype.^{Footnote 1}

Speech synthesis (also called deepfake voice technology, voice cloning or synthetic voice) is the artificial simulation of human speech. The counterpart of the voice recognition, speech synthesis is mainly used to convert textual information into audio information so that a person can naturally interact with digital devices. Another case is the use of speech synthesis to create a clone of a person's voice. Deepfake has advanced to the point where it can faithfully reproduce the human voice with great accuracy in tone and likeness. Modern technologies have reached the point that they allow you to clone a voice, having only about 5 s of its sound. However, a decrease in the recording time leads to a logical deterioration in the quality of the output voice. Based on this, we can conclude that it is rather difficult to clone a voice in real time, however, when it comes to the time of the order of an hour, it is difficult to distinguish such a voice from the real one. For example, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.^{Footnote 2} Attackers can use it to fool voice authentication systems or create fake audio recordings to defame public figures, or combine this voice clone with social engineering techniques to bamboozle people.

Today there are a lot of automatic speaker verification (ASV) systems, which are used in a variety of intelligent personal assistants and other devices (Sber Salut, Yandex Alice, Amazon Alexa, Google Home). So, voice biometrics technology has become a common way to authenticate a user. But the main problem with ASV systems is that they can be bypassed with voice-spoofing attacks.

Conceptually, there are two different classes of voice-spoofing attacks:

Physical access (PA)
Logical access (LA)

In PA attacks bona fide utterances are made in a real, physical space in which spoofing attacks are captured and then replayed within the same physical space using replay devices of varying quality. In LA attacks bona fide and spoofed utterances generated using text-to-speech (TTS) synthesis and voice conversion (VC) algorithms are communicated across telephony and VoIP networks with various coding and transmission effects [24].

To illustrate, consider the two scenarios in Fig. 1, where an attacker could exploit the ASV vulnerability against spoofing attacks and gain access to the user's bank account. The top path is a replay scenario (i.e. PA) where an attacker uses some hidden device to record the voice command of the authentic speaker and then plays the replay sound before the ASV to make a financial transaction. The bottom path is a voice cloning attack scenario (i.e. LA) in which an attacker artificially generates a synthesized voice pattern against a bona fide speaker from text or voice patterns using sophisticated cloning algorithms.

ASVspoof challenge in 2021 includes three tasks of countermeasure development to protect ASV systems from spoofing attacks [24]. These tasks are: PA, LA, and speech deepfake (DF), which is a fake audio detection task comprising bona fide and spoofed utterances generated using TTS and VC algorithms. The DF task is similar to the LA task, but without speaker verification. It is interesting to note that DF detection was introduced as recently as 2021. Voice fake algorithms are constantly being improved, so there is no universal method for determining a synthesized voice. The accuracy of the currently existing recognition system is also not optimal and significantly depends on the training dataset. Given the above facts, the detection of speech synthesis still remains an urgent problem of information security.

The main contributions of our paper are as follows:

We propose a new dataset containing bona fide and synthesized Russian speech.
We compare the effectiveness of cepstral coefficients widely used in voice spoofing countermeasures.
We use two types of neural network as a classifier for determining the authenticity of a voice: convolutional neural network (CNN) and long-short term memory (LSTM) network.

The rest of the paper is structured as follows. Section 2 provides an overview of existing approaches to counter voice spoofing. Section 3 presents the Russian speech dataset and neural network architecture. Section 4 presents the details of the experiments carried out, and Sect. 5 presents the conclusions.

2 Related work

This section provides an overview of existing approaches and cepstral coefficients used to counter voice spoofing.

The problem considered in this paper is the subject of many studies to find a reliable method for detecting the fake voice of a genuine speaker, which works equally well for male and female voices, physical and logical attacks, etc. [2, 9, 18, 23]. Among the proposed solutions, two directions of search can be distinguished: experiments with the used cepstral coefficients and the creation of various models of neural networks. These directions refer to different phases of audio signal analysis, but it is obvious that in the final synthesized speech detection system these phases strongly influence each other. Many studies use the Gaussian Mixed Model (GMM) to mix different coefficients to improve accuracy [7, 15]. However, in this paper, the emphasis is not on improving the accuracy of the final system, but on identifying the influence of the features of Russian speech on the coefficients used and the type of neural network.

It is important to note that the vast majority of studies are focused on English and do not consider other languages (Russian, etc.). Only a few works deal with the influence of the features of other languages on the quality of the speaker and speech recognition [3, 5, 19, 21]. Obviously, the features of a particular language should also affect the detection of speech synthesized in this language. This is probably due to the fact that most of the research currently being done is related to English-language systems, and also to the fact that linguists are not involved in the development of such systems. In addition, many studies do not provide the data set that was used for training and the source code of the developed models, which makes it almost impossible to reproduce the proposed results. In [12] it was suggested that it is necessary to form a specialized dataset for the detection of fake Russian speech. Next, the corresponding data set will be proposed.

2.1 Cepstral coefficients

The basis for detecting synthesized speech is the analysis of an audio recording for its further verification. The most straightforward way is mel-scale spectrograms, which have been widely used for many years for various purposes, for example, to map character embeddings to mel-scale spectrograms for speech synthesis directly from text [20], or for detection of respiratory and cough symptoms of various diseases during the COVID-19 pandemic [4]. But usually, acoustic features are presented in the form of cepstral coefficients, which carry a lot of information about the speaker and take into account the shape of the vocal tract. There are many different methods for calculating various coefficients, of which the Mel-Frequency Cepstral Coefficients (MFCC) and the Linear-Frequency Cepstral Coefficients (LFCC) are the most famous. These two coefficients have been well studied, many models have been created on their basis, and the quality of these models has been compared for different sets of input data [13, 26]. MFCC was mainly used for speech and speaker recognition. However, since the length of the vocal tract has a greater effect on speech in the high frequency range, LFCC provides some advantages in speaker recognition. So, while MFCC and LFCC are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region [26].

Another feature is the Constant-Q Cepstral Coefficients (CQCC). It is also widely used in the creation of countermeasure models and has a number of advantages over MFCC and LFCC [12, 17, 22, 25]. However, since this study is focused on the analysis of Russian speech, it was decided to use the most traditional characteristics, and leave CQCC for future research. In [6] a novel audio features descriptor is proposed named as Extended Local Ternary Pattern (ELTP). It is used to capture the vocal tract dynamically induced attributes of bona fide speech and algorithmic artifacts in synthetic and converted speeches. ELTP features are combined with LFCC features to train the BiLSTM network for classification of the bona fide and spoof signal. In [10] a novel features representation scheme Center Lop-Sided Local Binary Patterns (CLS-LBP) is proposed to better capture the characteristics of genuine and spoofed audio samples. CLS-LBP features are used to train the LSTM network for classification. Moreover, the proposed scheme is able to accurately classify the cloning algorithms used to synthesize the bona fide samples. In [15] authors use MFCC and CQCC to extract speech features, GMM to construct a user model, and CNN as a classifier to determine voice authenticity. The authors obtained an EER of 4.79% and a minDCF of 0.623.

3 Research methods

This section discusses the proposed Russian speech dataset and neural network model architectures.

3.1 Dataset requirements and formation

To train the model, a dataset containing utterances in Russian is needed. This dataset should contain bona fide and synthesized speech in approximately equal proportions. Various research institutions and individuals have contributed to the field of synthetic speech detection by creating open training and benchmarking datasets [9]. Unfortunately, there is no good dataset in Russian containing a synthesized voice in the public domain. Since creating such a dataset from scratch is a voluminous and routine task, it seems appropriate to form it based on existing live speech datasets. Therefore, it is necessary to choose basic datasets consisting of real speech. Then the target dataset will be compiled from them. This will require Russian speech from various sources. Also, an important requirement will be the presence of various voices, and not a text broken into parts, read out by one speaker.

Let's formulate the basic requirements for target dataset:

the dataset must contain audio recordings from 2 to 12 s long in Russian;
the dataset must contain a comparable number of male and female voices;
at least 40% of the synthesized voices must be present in the dataset;
the dataset must contain synthesized voices obtained using different synthesis algorithms;
sampling frequency must be at least 16 kHz.

Several datasets are available for developing spoofing attacks detection models: ASVspoof, Fake-or-Real, WaveFake, Audio Deep Synthesis Detection, Half-Truth [9]. The most popular is ASVspoof [11]. However, for our study, we analyzed datasets that contain exactly Russian speech. The most suitable open datasets summarized in Table 1.

Table 1 Comparison of Russian open datasets

Full size table

Based on the table, Common Voice from Mozilla was taken as a basis, containing live speech of people. This dataset contains about 200 h of verified voice in mp3 format. At the same time, there is a division according to the age and gender of the speakers. Other existing datasets mainly contain audio files derived from audiobooks, public speaking, and YouTube videos. However, in order not to learn only from the “read” text, we need live speech contained in the selected dataset.

The selected dataset was analyzed, the necessary confirmed audio recordings were extracted from it, which were reduced to 16 kHz and converted to WAV format. This format was chosen because many libraries work with it. Part of this set went into the resulting dataset as real speech. Another part was passed through synthesis algorithms to produce fake speech.

The result is a dataset that satisfies the requirements formulated above. This dataset contains 50,000 audio recordings with a total length of more than 70 h with additional meta-information about them. This meta information includes:

path to desired audio file;
sentence in this passage;
audio file length;
speaker's gender;
speaker's age;
whether this voice is synthesized or not (0 if real voice, 1 if fake);
synthesis algorithm used.

It was decided to leave so much additional information because this dataset can be used for other purposes than detecting a fake speech. In addition, you can collect statistics and identify weaknesses in some algorithms. For example, they may not work well with female voices or certain audio lengths.

To form the target dataset, it is necessary to select algorithms for the synthesis of the human voice. The study of the influence of algorithms on the accuracy of the final fake voice detection system is another interesting topic that requires separate study. Currently, the most popular algorithms include the following algorithms: MelGAN, DEEP-VOICE-3, SV2TTS, gTTS [12]. On their basis, 3 synthesis algorithms were used, each of which contains an approximately equal amount. It is worth noting that there is speech obtained both by cloning methods and by conventional synthesis. As a result, this dataset was divided into 2 parts for training and testing in proportions 3/2.

The features of the resulting dataset are presented in Table 2.

Table 2 Dataset features

Full size table

Another important requirement is that the synthesized voice cannot be determined indirectly, i.e., for example, based on the fact that such recordings are usually shorter or such recordings are usually spoken by a female voice. To do this, it is necessary to balance these parts. Let us establish that the gender difference is within 20% of the total, and the difference in length is not more than 10%. These differences are presented in Table 3.

Table 3 Features of genuine and synthesized voice

Full size table

As can be seen from the presented table, the main parameters of the genuine and synthesized voices are approximately the same. This will provide a more accurate detection of the synthesized voice.

This dataset is the initial version, which, however, can already be used to train neural networks. Further development of this dataset is required to improve results. This can happen along the following lines:

adding other audios to the dataset, which will help improve the accuracy of synthesized speech detection, as it will increase the "coverage" of various algorithms;
collection and addition to the dataset of genuine speech that is not read from the screen, as is the case in the underlying dataset used;
adding additional meta-information to collect various statistics; for example, you can add voices with different accents and see how this changes the accuracy of the model.

3.2 Network architecture

In this study two types of neural network were chosen as a classifier for determining the authenticity of a voice: CNN and LSTM.

CNN model is widely used in speech recognition tasks. This approach has been used by the authors of a number of studies [15, 16]. The architecture of the CNN model is shown in Fig. 2. The last sequence is passed to the special Softmax function. It gives actual probability scores for two classes, i.e. fake and good faith in our work. If the probability score of the genuine class is greater than that of the fake class, the Softmax level predicts the sample as valid, and if the probability score of the fake class is greater than that of the genuine class, then the function predicts the sample as fake.

At the output, the developed system should give only the result: whether the given audio recording is a synthesized voice. If yes, then 1 is displayed; if not, then 0 is displayed.

Second model is a recurrent neural network (RNN) with multi-layer LSTM architecture, which is a type of neural network that is well-suited for processing sequential data.

The architecture of the LSTM model is shown in Fig. 3. We use a five-layer LSTM with hidden_size = 128, which is the number of features in the hidden state. The input data is of dimension (batch_size, seq_length, input_dim), which is passed to the LSTM layer. The LSTM layer returns the hidden dimension state of size (batch_size, hidden_size) and the memory state of size (batch_size, hidden_size).

The latent state then passes through a sequence of linear layers that apply a non-linear ReLU activation function and perform a Dropout regularization with a factor of 0.5 to reduce overfitting of the model. This helps the model to generalize better to new, unseen data.

Finally, the output layer is a linear layer that converts the output into two classes using the Softmax function. This makes the model suitable for classification tasks, where the goal is to predict one of two possible outcomes.

A model with five LSTM layers is considered to be the best option for a variety of reasons. Firstly, models with fewer layers tend to suffer from underfitting, where they are unable to capture the complexity of the data and produce inaccurate predictions. On the other hand, models with too many layers often overfit the data, meaning they memorize the training data too well and struggle to generalize to new, unseen data.

The use of five LSTM layers strikes a balance between underfitting and overfitting. It allows the model to capture the necessary complexity of the data while avoiding the over-reliance on the training data that comes with too many layers.

Thus, as a result of processing the input sequence through the five-layer LSTM model, the model produces an output that can be used to make predictions about the class of the input sequence. By using an LSTM architecture with multiple layers and incorporating dropout regularization, our model is able to effectively capture complex patterns in sequential data and avoid overfitting, resulting in improved performance on classification tasks.

4 Experimental results

This section provides details of the experiments carried out to evaluate the effectiveness of the proposed cepstral coefficients and neural network models.

Two models have been developed, one based on the CNN and the other based on the LSTM networks. Each model was trained on the dataset proposed above [1] with Adam optimizer. To extract audio features, three types of characteristics were used: mel spectrogram, MFCC and LFCC. A total of six experiments were carried out.

An important issue in assessing the quality of the resulting models is the choice of metrics by which their accuracy will be assessed. We evaluated the effectiveness of the features using Equal Error Rate (EER), F1 score and Matthews Correlation Coefficient (MCC). The strength of spoofing countermeasures is usually measured using the EER. The primitive EER does not reflect application requirements, nor the impact of spoofing and countermeasures on ASV, and its use as a primary metric in traditional ASV research has long been abandoned in favor of risk-based evaluation approaches, such as Tandem Detection Cost Function (t-DCF) [14]. For example, the ASVspoof 2021 considers t-DCF as the primary metric for the LA and the PA tasks. But for the DF task, the primary metric is EER, since the DF task does not include the ASV system, so its metric does not require specifying the cost and a priori parameters [24]. Since the proposed models produce a binary classification, the F1-score is a suitable metric for assessing their accuracy. The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. The MCC is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [8].

The results obtained are presented in Table 4. Here we see that the EER for the mel spectrogram is much larger than for the LFCC and MFCC. In turn, the EER for the LFCC is about half that for the MFCC. The F1 score shows a similar result with MCC as the dataset is class balanced. In general, it can be seen that for the DF task, the CNN model outperformed the LSTM model. This can be explained by the fact that in the task of determining synthesized speech, i.e. in fact, in a binary classification problem, the LSTM model does not use its capabilities associated with the processing of long sequential data.

Table 4 Experimental results

Full size table

5 Conclusion

This paper compares the effectiveness of cepstral coefficients, widely used in voice spoofing countermeasures, in detecting deepfakes in Russian speech. We proposed a new dataset containing both bona fide and synthesized Russian speech and then used it to train two types of neural networks. We did not find the influence of the Russian language on the work of cepstral coefficients, and obtained results that are consistent with the results of other researchers.

Our future work is aimed at expanding the Russian speech dataset, as well as exploring other cepstral coefficients and their mixtures.

Notes

References

PyAra: Russian bona fide and spoofed speech. https://www.kaggle.com/datasets/alep079/pyara
Almutairi, Z., Elgibreen, H.: A review of modern audio Deepfake detection methods: challenges and future directions. Algorithms 15(5), 155 (2022). https://doi.org/10.3390/a15050155
Article Google Scholar
Akinrinmade, A.A., et al.: Creation of a Nigerian voice corpus for indigenous speaker recognition. J. Phys. Conf. Ser. 1378, 032011 (2019). https://doi.org/10.1088/1742-6596/1378/3/032011
Article Google Scholar
Aly, M., Alotaibi, N.S.: A novel deep learning model to detect COVID-19 based on wavelet features extracted from Mel-scale spectrogram of patients’ cough and breathing sounds. Inform. Med. Unlocked 32, 101049 (2022). https://doi.org/10.1016/j.imu.2022.101049. (ISSN 2352-9148)
Article Google Scholar
Andrusenko, AYu., Romanenko, A.N.: Improving out of vocabulary words recognition accuracy for an end-to-end Russian speech recognition system. Sci. Tech. J. Inf. Technol. Mech. Opt. 22(6), 1143–1149 (2022). https://doi.org/10.17586/2226-1494-2022-22-6-1143-1149
Article Google Scholar
Arif, T., Javed, A., Alhameed, M., Jeribi, F., Tahir, A.: Voice spoofing countermeasure for logical access attacks detection. IEEE Access 9, 162857–162868 (2021). https://doi.org/10.1109/ACCESS.2021.3133134
Article Google Scholar
Chettri, B., Sturm, B.L.: A deeper look at Gaussian mixture model based anti-spoofing systems. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 5159–5163. https://doi.org/10.1109/ICASSP.2018.8461467
Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21, 6 (2020). https://doi.org/10.1186/s12864-019-6413-7
Article Google Scholar
Cuccovillo, L., et al.: Open challenges in synthetic speech detection. In: 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Shanghai, China, 2022, pp. 1–6. https://doi.org/10.1109/WIFS55849.2022.9975433
Dawood, H., Saleem, S., Hassan, F., Javed, A.: A robust voice spoofing detection system using novel CLS-LBP features and LSTM. J. King Saud Univ. Comput. Inf. Sci. 34(9), 7300–7312 (2022). https://doi.org/10.1016/j.jksuci.2022.02.024. (ISSN 1319-1578)
Article Google Scholar
Delgado, H., Evans, N., Kinnunen, T., Lee, K.A., Liu, X., Nautsch, A., Patino, J., Sahidullah, M., Todisco, M., Wang, X., Yamagishi, J.: ASVspoof 2021 Challenge—Speech Deepfake Database (1.0). Zenodo (2021). https://doi.org/10.5281/zenodo.4835108
Efanov, D., Aleksandrov, P., Karapetyants, N.: The BiLSTM-based synthesized speech recognition. Procedia Comput. Sci. 213, 415–421 (2022). https://doi.org/10.1016/j.procs.2022.11.086. (ISSN 1877-0509)
Article Google Scholar
Hanilçi, C., Kinnunen, T., Sahidullah, M., Sizov, A.: Spoofing detection goes noisy: an analysis of synthetic speech detection in the presence of additive noise. Speech Commun. 85, 83–97 (2016). https://doi.org/10.1016/j.specom.2016.10.002. (ISSN 0167-6393)
Article Google Scholar
Kinnunen, T., et al.: Tandem assessment of spoofing countermeasures and automatic speaker verification: fundamentals. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2195–2210 (2020). https://doi.org/10.1109/TASLP.2020.3009494
Article Google Scholar
Murtazin, R.A., Kuznetsov, A.Y.: The speech synthesis detection algorithm based on cepstral coefficients and convolutional neural network. Sci. Tech. J. Inf. Technol. Mech. Opt. 21(4), 545–552 (2021). https://doi.org/10.17586/2226-1494-2021-21-4-545-552
Article Google Scholar
Osipov, A., Pleshakova, E., Gataullin, S., Korchagin, S., Ivanov, M., Finogeev, A., Yadav, V.: Deep learning method for recognition and classification of images from video recorders in difficult weather conditions. Sustainability 14(4), 2020 (2022). https://doi.org/10.3390/su14042420
Article Google Scholar
Phapatanaburi, K., Buayai, P., Kupimai, M., Yodrot, T.: Linear prediction residual-based constant-Q cepstral coefficients for replay attack detection. In: 2020 8th International Electrical Engineering Congress (iEECON), Chiang Mai, Thailand, 2020, pp. 1–4. https://doi.org/10.1109/iEECON48109.2020.229465
Pleshakova, E.S., Gataullin, S.T., Osipov, A.V., Filimonov, A.V.: Countering telephone fraud using neural network technologies. Cybersecur. Issues 6(32), 83–92 (2022)
Google Scholar
Rebai, I., BenAyed, Y.: Text-to-speech synthesis system with Arabic diacritic recognition system. Comput. Speech Lang. 34(1), 43–60 (2015). https://doi.org/10.1016/j.csl.2015.04.002. (ISSN 0885-2308)
Article Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
Sulír, M., Juhár, J.: Speaker adaptation for Slovak statistical parametric speech synthesis based on hidden Markov models, 2015. In: 25th International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 2015, pp. 137–140. https://doi.org/10.1109/RADIOELEK.2015.7128977
Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017). https://doi.org/10.1016/j.csl.2017.01.001. (ISSN 0885-2308)
Article Google Scholar
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H.: Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 66, 130–153 (2015). https://doi.org/10.1016/j.specom.2014.10.005. (ISSN 0167-6393)
Article Google Scholar
Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K.A., Kinnunen, T., Evans, N., Delgado, H.: ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In: Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, pp. 47–54 (2021). https://doi.org/10.21437/ASVSPOOF.2021-8
Yang, J., Das, R.K., Li, H.: Extended constant-Q cepstral coefficients for detection of spoofing attacks. In: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, pp. 1024–1029 (2018). https://doi.org/10.23919/APSIPA.2018.8659537
Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., Shamma, S.: Linear versus MEL frequency cepstral coefficients for speaker recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, pp. 559–564 (2011). https://doi.org/10.1109/ASRU.2011.6163888

Download references

Acknowledgements

The work was funded by the Foundation for Assistance to Small Innovative Enterprises, Russia under Grant No (36ГУКoдИИC12-D7/81484).

Author information

Authors and Affiliations

National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Moscow, Russian Federation
Dmitry Efanov, Pavel Aleksandrov & Ilia Mironov

Authors

Dmitry Efanov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Aleksandrov
View author publications
You can also search for this author in PubMed Google Scholar
Ilia Mironov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Efanov.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Efanov, D., Aleksandrov, P. & Mironov, I. Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection. J Comput Virol Hack Tech (2023). https://doi.org/10.1007/s11416-023-00491-0

Download citation

Received: 02 February 2023
Accepted: 03 July 2023
Published: 13 August 2023
DOI: https://doi.org/10.1007/s11416-023-00491-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection

Abstract

Similar content being viewed by others

GRU-SVM Model for Synthetic Speech Detection

A Watermark Challenge: Synthetic Speech Detection

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

1 Introduction