1 Introduction

Speech synthesis has firmly entered our lives. A shining example is the 2022 presidential election in South Korea. The team created a digital avatar of Yoon Seok Yeol, who competed in the presidential race along with his living prototype.Footnote 1

Speech synthesis (also called deepfake voice technology, voice cloning or synthetic voice) is the artificial simulation of human speech. The counterpart of the voice recognition, speech synthesis is mainly used to convert textual information into audio information so that a person can naturally interact with digital devices. Another case is the use of speech synthesis to create a clone of a person's voice. Deepfake has advanced to the point where it can faithfully reproduce the human voice with great accuracy in tone and likeness. Modern technologies have reached the point that they allow you to clone a voice, having only about 5 s of its sound. However, a decrease in the recording time leads to a logical deterioration in the quality of the output voice. Based on this, we can conclude that it is rather difficult to clone a voice in real time, however, when it comes to the time of the order of an hour, it is difficult to distinguish such a voice from the real one. For example, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything—and do it in a way that attempts to preserve the speaker's emotional tone.Footnote 2 Attackers can use it to fool voice authentication systems or create fake audio recordings to defame public figures, or combine this voice clone with social engineering techniques to bamboozle people.

Today there are a lot of automatic speaker verification (ASV) systems, which are used in a variety of intelligent personal assistants and other devices (Sber Salut, Yandex Alice, Amazon Alexa, Google Home). So, voice biometrics technology has become a common way to authenticate a user. But the main problem with ASV systems is that they can be bypassed with voice-spoofing attacks.

Conceptually, there are two different classes of voice-spoofing attacks:

  • Physical access (PA)

  • Logical access (LA)

In PA attacks bona fide utterances are made in a real, physical space in which spoofing attacks are captured and then replayed within the same physical space using replay devices of varying quality. In LA attacks bona fide and spoofed utterances generated using text-to-speech (TTS) synthesis and voice conversion (VC) algorithms are communicated across telephony and VoIP networks with various coding and transmission effects [24].

To illustrate, consider the two scenarios in Fig. 1, where an attacker could exploit the ASV vulnerability against spoofing attacks and gain access to the user's bank account. The top path is a replay scenario (i.e. PA) where an attacker uses some hidden device to record the voice command of the authentic speaker and then plays the replay sound before the ASV to make a financial transaction. The bottom path is a voice cloning attack scenario (i.e. LA) in which an attacker artificially generates a synthesized voice pattern against a bona fide speaker from text or voice patterns using sophisticated cloning algorithms.

Fig. 1
figure 1

Spoofing attacks scenarios

ASVspoof challenge in 2021 includes three tasks of countermeasure development to protect ASV systems from spoofing attacks [24]. These tasks are: PA, LA, and speech deepfake (DF), which is a fake audio detection task comprising bona fide and spoofed utterances generated using TTS and VC algorithms. The DF task is similar to the LA task, but without speaker verification. It is interesting to note that DF detection was introduced as recently as 2021. Voice fake algorithms are constantly being improved, so there is no universal method for determining a synthesized voice. The accuracy of the currently existing recognition system is also not optimal and significantly depends on the training dataset. Given the above facts, the detection of speech synthesis still remains an urgent problem of information security.

The main contributions of our paper are as follows:

  • We propose a new dataset containing bona fide and synthesized Russian speech.

  • We compare the effectiveness of cepstral coefficients widely used in voice spoofing countermeasures.

  • We use two types of neural network as a classifier for determining the authenticity of a voice: convolutional neural network (CNN) and long-short term memory (LSTM) network.

The rest of the paper is structured as follows. Section 2 provides an overview of existing approaches to counter voice spoofing. Section 3 presents the Russian speech dataset and neural network architecture. Section 4 presents the details of the experiments carried out, and Sect. 5 presents the conclusions.

2 Related work

This section provides an overview of existing approaches and cepstral coefficients used to counter voice spoofing.

The problem considered in this paper is the subject of many studies to find a reliable method for detecting the fake voice of a genuine speaker, which works equally well for male and female voices, physical and logical attacks, etc. [2, 9, 18, 23]. Among the proposed solutions, two directions of search can be distinguished: experiments with the used cepstral coefficients and the creation of various models of neural networks. These directions refer to different phases of audio signal analysis, but it is obvious that in the final synthesized speech detection system these phases strongly influence each other. Many studies use the Gaussian Mixed Model (GMM) to mix different coefficients to improve accuracy [7, 15]. However, in this paper, the emphasis is not on improving the accuracy of the final system, but on identifying the influence of the features of Russian speech on the coefficients used and the type of neural network.

It is important to note that the vast majority of studies are focused on English and do not consider other languages (Russian, etc.). Only a few works deal with the influence of the features of other languages on the quality of the speaker and speech recognition [3, 5, 19, 21]. Obviously, the features of a particular language should also affect the detection of speech synthesized in this language. This is probably due to the fact that most of the research currently being done is related to English-language systems, and also to the fact that linguists are not involved in the development of such systems. In addition, many studies do not provide the data set that was used for training and the source code of the developed models, which makes it almost impossible to reproduce the proposed results. In [12] it was suggested that it is necessary to form a specialized dataset for the detection of fake Russian speech. Next, the corresponding data set will be proposed.

2.1 Cepstral coefficients

The basis for detecting synthesized speech is the analysis of an audio recording for its further verification. The most straightforward way is mel-scale spectrograms, which have been widely used for many years for various purposes, for example, to map character embeddings to mel-scale spectrograms for speech synthesis directly from text [20], or for detection of respiratory and cough symptoms of various diseases during the COVID-19 pandemic [4]. But usually, acoustic features are presented in the form of cepstral coefficients, which carry a lot of information about the speaker and take into account the shape of the vocal tract. There are many different methods for calculating various coefficients, of which the Mel-Frequency Cepstral Coefficients (MFCC) and the Linear-Frequency Cepstral Coefficients (LFCC) are the most famous. These two coefficients have been well studied, many models have been created on their basis, and the quality of these models has been compared for different sets of input data [13, 26]. MFCC was mainly used for speech and speaker recognition. However, since the length of the vocal tract has a greater effect on speech in the high frequency range, LFCC provides some advantages in speaker recognition. So, while MFCC and LFCC are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region [26].

Another feature is the Constant-Q Cepstral Coefficients (CQCC). It is also widely used in the creation of countermeasure models and has a number of advantages over MFCC and LFCC [12, 17, 22, 25]. However, since this study is focused on the analysis of Russian speech, it was decided to use the most traditional characteristics, and leave CQCC for future research. In [6] a novel audio features descriptor is proposed named as Extended Local Ternary Pattern (ELTP). It is used to capture the vocal tract dynamically induced attributes of bona fide speech and algorithmic artifacts in synthetic and converted speeches. ELTP features are combined with LFCC features to train the BiLSTM network for classification of the bona fide and spoof signal. In [10] a novel features representation scheme Center Lop-Sided Local Binary Patterns (CLS-LBP) is proposed to better capture the characteristics of genuine and spoofed audio samples. CLS-LBP features are used to train the LSTM network for classification. Moreover, the proposed scheme is able to accurately classify the cloning algorithms used to synthesize the bona fide samples. In [15] authors use MFCC and CQCC to extract speech features, GMM to construct a user model, and CNN as a classifier to determine voice authenticity. The authors obtained an EER of 4.79% and a minDCF of 0.623.

3 Research methods

This section discusses the proposed Russian speech dataset and neural network model architectures.

3.1 Dataset requirements and formation

To train the model, a dataset containing utterances in Russian is needed. This dataset should contain bona fide and synthesized speech in approximately equal proportions. Various research institutions and individuals have contributed to the field of synthetic speech detection by creating open training and benchmarking datasets [9]. Unfortunately, there is no good dataset in Russian containing a synthesized voice in the public domain. Since creating such a dataset from scratch is a voluminous and routine task, it seems appropriate to form it based on existing live speech datasets. Therefore, it is necessary to choose basic datasets consisting of real speech. Then the target dataset will be compiled from them. This will require Russian speech from various sources. Also, an important requirement will be the presence of various voices, and not a text broken into parts, read out by one speaker.

Let's formulate the basic requirements for target dataset:

  • the dataset must contain audio recordings from 2 to 12 s long in Russian;

  • the dataset must contain a comparable number of male and female voices;

  • at least 40% of the synthesized voices must be present in the dataset;

  • the dataset must contain synthesized voices obtained using different synthesis algorithms;

  • sampling frequency must be at least 16 kHz.

Several datasets are available for developing spoofing attacks detection models: ASVspoof, Fake-or-Real, WaveFake, Audio Deep Synthesis Detection, Half-Truth [9]. The most popular is ASVspoof [11]. However, for our study, we analyzed datasets that contain exactly Russian speech. The most suitable open datasets summarized in Table 1.

Table 1 Comparison of Russian open datasets

Based on the table, Common Voice from Mozilla was taken as a basis, containing live speech of people. This dataset contains about 200 h of verified voice in mp3 format. At the same time, there is a division according to the age and gender of the speakers. Other existing datasets mainly contain audio files derived from audiobooks, public speaking, and YouTube videos. However, in order not to learn only from the “read” text, we need live speech contained in the selected dataset.

The selected dataset was analyzed, the necessary confirmed audio recordings were extracted from it, which were reduced to 16 kHz and converted to WAV format. This format was chosen because many libraries work with it. Part of this set went into the resulting dataset as real speech. Another part was passed through synthesis algorithms to produce fake speech.

The result is a dataset that satisfies the requirements formulated above. This dataset contains 50,000 audio recordings with a total length of more than 70 h with additional meta-information about them. This meta information includes:

  • path to desired audio file;

  • sentence in this passage;

  • audio file length;

  • speaker's gender;

  • speaker's age;

  • whether this voice is synthesized or not (0 if real voice, 1 if fake);

  • synthesis algorithm used.

It was decided to leave so much additional information because this dataset can be used for other purposes than detecting a fake speech. In addition, you can collect statistics and identify weaknesses in some algorithms. For example, they may not work well with female voices or certain audio lengths.

To form the target dataset, it is necessary to select algorithms for the synthesis of the human voice. The study of the influence of algorithms on the accuracy of the final fake voice detection system is another interesting topic that requires separate study. Currently, the most popular algorithms include the following algorithms: MelGAN, DEEP-VOICE-3, SV2TTS, gTTS [12]. On their basis, 3 synthesis algorithms were used, each of which contains an approximately equal amount. It is worth noting that there is speech obtained both by cloning methods and by conventional synthesis. As a result, this dataset was divided into 2 parts for training and testing in proportions 3/2.

The features of the resulting dataset are presented in Table 2.

Table 2 Dataset features

Another important requirement is that the synthesized voice cannot be determined indirectly, i.e., for example, based on the fact that such recordings are usually shorter or such recordings are usually spoken by a female voice. To do this, it is necessary to balance these parts. Let us establish that the gender difference is within 20% of the total, and the difference in length is not more than 10%. These differences are presented in Table 3.

Table 3 Features of genuine and synthesized voice

As can be seen from the presented table, the main parameters of the genuine and synthesized voices are approximately the same. This will provide a more accurate detection of the synthesized voice.

This dataset is the initial version, which, however, can already be used to train neural networks. Further development of this dataset is required to improve results. This can happen along the following lines:

  • adding other audios to the dataset, which will help improve the accuracy of synthesized speech detection, as it will increase the "coverage" of various algorithms;

  • collection and addition to the dataset of genuine speech that is not read from the screen, as is the case in the underlying dataset used;

  • adding additional meta-information to collect various statistics; for example, you can add voices with different accents and see how this changes the accuracy of the model.

3.2 Network architecture

In this study two types of neural network were chosen as a classifier for determining the authenticity of a voice: CNN and LSTM.

CNN model is widely used in speech recognition tasks. This approach has been used by the authors of a number of studies [15, 16]. The architecture of the CNN model is shown in Fig. 2. The last sequence is passed to the special Softmax function. It gives actual probability scores for two classes, i.e. fake and good faith in our work. If the probability score of the genuine class is greater than that of the fake class, the Softmax level predicts the sample as valid, and if the probability score of the fake class is greater than that of the genuine class, then the function predicts the sample as fake.

Fig. 2
figure 2

CNN model architecture

At the output, the developed system should give only the result: whether the given audio recording is a synthesized voice. If yes, then 1 is displayed; if not, then 0 is displayed.

Second model is a recurrent neural network (RNN) with multi-layer LSTM architecture, which is a type of neural network that is well-suited for processing sequential data.

The architecture of the LSTM model is shown in Fig. 3. We use a five-layer LSTM with hidden_size = 128, which is the number of features in the hidden state. The input data is of dimension (batch_size, seq_length, input_dim), which is passed to the LSTM layer. The LSTM layer returns the hidden dimension state of size (batch_size, hidden_size) and the memory state of size (batch_size, hidden_size).

Fig. 3
figure 3

LSTM model architecture

The latent state then passes through a sequence of linear layers that apply a non-linear ReLU activation function and perform a Dropout regularization with a factor of 0.5 to reduce overfitting of the model. This helps the model to generalize better to new, unseen data.

Finally, the output layer is a linear layer that converts the output into two classes using the Softmax function. This makes the model suitable for classification tasks, where the goal is to predict one of two possible outcomes.

A model with five LSTM layers is considered to be the best option for a variety of reasons. Firstly, models with fewer layers tend to suffer from underfitting, where they are unable to capture the complexity of the data and produce inaccurate predictions. On the other hand, models with too many layers often overfit the data, meaning they memorize the training data too well and struggle to generalize to new, unseen data.

The use of five LSTM layers strikes a balance between underfitting and overfitting. It allows the model to capture the necessary complexity of the data while avoiding the over-reliance on the training data that comes with too many layers.

Thus, as a result of processing the input sequence through the five-layer LSTM model, the model produces an output that can be used to make predictions about the class of the input sequence. By using an LSTM architecture with multiple layers and incorporating dropout regularization, our model is able to effectively capture complex patterns in sequential data and avoid overfitting, resulting in improved performance on classification tasks.

4 Experimental results

This section provides details of the experiments carried out to evaluate the effectiveness of the proposed cepstral coefficients and neural network models.

Two models have been developed, one based on the CNN and the other based on the LSTM networks. Each model was trained on the dataset proposed above [1] with Adam optimizer. To extract audio features, three types of characteristics were used: mel spectrogram, MFCC and LFCC. A total of six experiments were carried out.

An important issue in assessing the quality of the resulting models is the choice of metrics by which their accuracy will be assessed. We evaluated the effectiveness of the features using Equal Error Rate (EER), F1 score and Matthews Correlation Coefficient (MCC). The strength of spoofing countermeasures is usually measured using the EER. The primitive EER does not reflect application requirements, nor the impact of spoofing and countermeasures on ASV, and its use as a primary metric in traditional ASV research has long been abandoned in favor of risk-based evaluation approaches, such as Tandem Detection Cost Function (t-DCF) [14]. For example, the ASVspoof 2021 considers t-DCF as the primary metric for the LA and the PA tasks. But for the DF task, the primary metric is EER, since the DF task does not include the ASV system, so its metric does not require specifying the cost and a priori parameters [24]. Since the proposed models produce a binary classification, the F1-score is a suitable metric for assessing their accuracy. The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. The MCC is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [8].

The results obtained are presented in Table 4. Here we see that the EER for the mel spectrogram is much larger than for the LFCC and MFCC. In turn, the EER for the LFCC is about half that for the MFCC. The F1 score shows a similar result with MCC as the dataset is class balanced. In general, it can be seen that for the DF task, the CNN model outperformed the LSTM model. This can be explained by the fact that in the task of determining synthesized speech, i.e. in fact, in a binary classification problem, the LSTM model does not use its capabilities associated with the processing of long sequential data.

Table 4 Experimental results

5 Conclusion

This paper compares the effectiveness of cepstral coefficients, widely used in voice spoofing countermeasures, in detecting deepfakes in Russian speech. We proposed a new dataset containing both bona fide and synthesized Russian speech and then used it to train two types of neural networks. We did not find the influence of the Russian language on the work of cepstral coefficients, and obtained results that are consistent with the results of other researchers.

Our future work is aimed at expanding the Russian speech dataset, as well as exploring other cepstral coefficients and their mixtures.