Keywords

1 Introduction

Nowadays, speech recording can be easily forged by some audio software. It will cause a huge threat if we cannot make sure the speech is natural or maliciously modified. Specifically, it will bring an inestimable impact on society when the forged speech is used for news report, court evidence and other fields.

In the past decades, digital speech forensics plays a crucial role on identifying the authenticity and integrity of speech recordings. Lots of works have been proposed. In order to detect the compression history of AMR audio, Luo [1] proposed a Stack Autoencoder (SAE) network for extracting the deep representations to classify the double compressed audios with a UBM-GMM classifier. Jing [2] present a detection method based on adaptive least squares and periodicity in the second derivative of an audio signal as a classification feature. For protecting text-dependent speaker verification systems from the spoofing attacks, Jakub [3] proposed an algorithm for detecting the replay attack audio. In [4], Galina use a high-level feature with a GMM classifier to against the synthetize audio in ASVspoof challenge. To detect the electronic disguised speech, Huang [5] proposed a forensic algorithm that adopted the SVM model with the Mel-frequency Cepstral Coefficients (MFCC) statistical vectors as acoustic features, including the MFCC and its mean value and correlation coefficients. The experimental results show that their algorithm can achieve a high detection accuracy about 90%. In [6], Wang combined Linear Frequency Cepstrum Coefficient (LFCC) statistical moment and formant statistical moment as input features to detect electronic disguised audio in adding different SNR and different types of background noise.

Most of those forensic methods have achieved a good performance on detecting the modified speech with a specific forgery operation. However, they will be failed to detect the unknown forgery operation. For example, the electronic disguised classifier can identify whether the testing speech has undergone disguising processing. If the testing speech was only processed with noise-adding, the classifier will not give the correct result.

In recent years, some researchers start to focus on the forensics of various forgery operations. In [7], Jeong proposed a method to detect various image operations by a statistical feature. Luo [8] used the statistical features derived from image residuals to build an identifications of various image operations. The traditional features like MFCC are adopted as the acoustic feature in most existing forensic methods. However, with the fast development of deep learning, we can obtain a more powerful discrimination ability classifier based on deep learning techniques such as CNN and RNN [10,11,12,13]. In [9], Chen designed a convolutional neural network (CNN) with a fixed prior layer to classify different audio operations. The result shows that the CNN based method can achieved a better accuracy than the traditional forensic methods.

In this paper, we present an RNN to detect various speech forgery operations with the traditional feature MFCC and LFCC. We have made extensive experiments to verify the suitable feature and the architecture of RNN. The results show that the proposed method can detect the kinds of forgery operations, and outperforms better than the other detection works.

The rest of the paper is organized as follows. Section 2 introduces input of the network and feature extraction. Section 3 describes the proposed network architecture and some important hyper parameters. Section 4 presents comparative results for the detection of various forgery operation. Finally, the concluding remarks of this paper are given in Sect. 5.

2 Feature Extraction

The cepstrum coefficients which are the representation of the spectrum of speech signal in the setting window frame, have been commonly applied as a classificational feature to present the difference between original speech and the forged speech. The experimental results show that the forgery operations will cause the cepstrum coefficients of operated speech different from the original speech. In this section, we will give a briefly introduction of MFCC and LFCC, which are two of most used cepstrum coefficients.

2.1 Mel-Frequency Cepstrum Coefficient

MFCC is a beneficial speech feature based on human auditory perception characteristics, which are widely used for speech recognition [14]. Figure 1 shows the procedure for extracting the MFCC statistical moments.

Fig. 1.
figure 1

Extraction procedure of MFCC statistical moment.

The MFCC focuses on the non-linear frequency characteristic and the size of Mel frequency corresponds to the relation of the logarithmic distribution of linear frequency and accords with the human ears’ characteristic. The relationship between Mel frequency and linear frequency is shown as,

$$ Mel\left( f \right) = 2595 { \lg }\left( {1 + f/700} \right) $$
(1)

where \( f \) is linear frequency.

At first, the speech signal \( x\left( n \right) \) is divided into \( N \) frames, and the Hamming window \( H\left( n \right) \) is adopted to obtain the windowed frame from the raw speech signal, as shown,

$$ H\left( n \right) = 0.54 - 0.46cos\frac{2\pi n}{Z - 1},n = 0,1, \cdots ,Z - 1 $$
(2)

where \( Z \) is the total number of the frames in a speech sample.

Then the frequency spectrum \( F\left( \omega \right) \) of the \( i \)-th frame \( x_{i} \left( n \right) \) is calculated through a Fast Fourier Transform (FFT). The power spectrum \( \left| {F\left( \omega \right)} \right|^{2} \) is process by a Mel-filter bank \( B_{Mel} \) which consist of \( M \) triangular band-pass filters. Then the power \( P_{m} \) of the \( m \)-th Mel-filter \( B_{m} \left( \omega \right) \) is denoted as,

$$ P_{m} = \mathop \smallint \limits_{{f_{lm} }}^{{f_{um} }} B_{m} \left( \omega \right)\left| {F\left( \omega \right)} \right|^{2} d\omega , m = 1,2, \cdots ,M $$
(3)

where \( f_{um} \) and \( f_{lm} \) present the upper and lower cut-off frequencies of \( B_{m} \left( \omega \right) \).

Then pre-emphasize the \( i \)-th frame \( x_{i} \left( n \right) \) and transform it through Fast Fourier Transform and gain the \( L \)-dimensional MFCC of \( x_{i} \left( n \right) \) through discrete cosine transform. The calculative formula is defined as,

$$ C_{l} = \mathop \sum \limits_{m = 1}^{M} \left[ {\log P_{m} \cdot \cos \frac{{l\left( {m - 0.5} \right)\pi }}{M}} \right],l = 1,2, \ldots L $$
(4)

where \( C_{l} \) is the \( l \)-th MFCC composition, \( L \) is less than the number of Mel filters.

We also calculate the dynamic cepstrum coefficients derivatives (\( \Delta MFCC \) and \( \Delta MFCC \)). Assume that \( v_{ij} \) is the \( j \)-th component of the MFCC vector of the \( i \)-th frame, and \( V_{j} \) is the set of all \( j \)-th components. The average value \( E_{j} \) of each component set \( V_{j} \) and the correlation coefficient \( CR_{{jj^{'} }} \) between different component sets \( V_{j} \) and \( V_{{j^{\prime}}} \) are obtained by Eq. 5 and Eq. 6, respectively.

$$ E_{j} = E\left( {V_{j} } \right) = E\left( {\left\{ {v_{1j,} v_{2j,} \cdots ,v_{Nj} } \right\}} \right),j = 1,2, \cdots L $$
(5)
$$ CR_{{jj^{'} }} = \frac{{cov\left( {V_{j} ,V_{{j^{\prime}}} } \right)}}{{\sqrt {VAR\left( {V_{j} } \right)} \sqrt {VAR\left( {V_{{j^{\prime}}} } \right)} }},1 \le j \le j^{\prime} \le L $$
(6)
$$ W_{MFCC} = \left[ {E_{1} ,E_{2} , \cdots ,E_{L} ,CR_{12} ,CR_{13} , \cdots ,CR_{L - 1L} } \right] $$
(7)

The \( E_{j} \) and \( CR_{{jj^{ '} }} \) are combined to form the statistical moment \( W_{MFCC} \) of the \( L \)-dimensional MFCC vector by Eq. 7. In this way, the statistical moment \( W_{\Delta MFCC} \) of the \( \Delta MFCC \) vector and the statistical moment \( W_{\Delta \Delta MFCC} \) of the \( \Delta \Delta MFCC \) vector will also be obtained.

2.2 Linear Frequency Cepstral Coefficients

LFCC is an average distribution from low frequency to high frequency bandpass filters [14]. The extraction procedure of LFCC statistical moment is shown in Fig. 2.

Fig. 2.
figure 2

Extraction procedure of LFCC statistical moment.

As shown in Fig. 2, the speech will firstly through the pre-process, and then the spectral energy can be obtained through the FFT, the calculative formula is shown as,

$$ X_{i} \left( k \right) = \sum\nolimits_{n = 0}^{N - 1} {x_{i} \left( m \right)e^{ - j2\pi /N} ,0 \le k \le N} $$
(8)
$$ E\left( {i,k} \right) = \left[ {X_{i} \left( k \right)} \right]^{2} $$
(9)

where \( x_{i} \left( m \right) \) is the speech signal data of the \( i \)-th frame, \( N \) is the number of Fourier.

Then the spectral energy will be processed through the bank filter group which including \( L \) bank filters with the center frequency \( f\left( m \right), m = 1,2, \ldots L \). The frequency response of triangular band-pass filter is shown as,

$$ H_{l} \left( k \right) = \left\{ {\begin{array}{*{20}c} {0 ,k < f\left( {l - 1} \right)} \\ {\frac{{k - f\left( {l - 1} \right)}}{{f\left( l \right) - f\left( {l - 1} \right)}},f\left( {l - 1} \right) \le k \le f\left( l \right)} \\ {\frac{{f\left( {l - 1} \right) - k}}{{f\left( {l + 1} \right) - f\left( l \right)}},f\left( l \right) \le k \le f\left( {l + 1} \right)} \\ {0,k > f\left( {l + 1} \right)} \\ \end{array} } \right. $$
(10)

And the filtering spectral energy processed by bank filter group is denoted as,

$$ S\left( {i,l} \right) = \sum\nolimits_{k = 0}^{N - 1} {\left[ {X_{i} \left( k \right)} \right]^{2} H_{l} \left( k \right),0 \le l \le L} $$
(11)

where \( l \) denote the \( i \)-th triangular band-pass filter.

Then the DCT is applied to calculate the cepstrum coefficients of the output of the bank filters, the calculated formula is denoted as,

$$ lfcc\left( {i,n} \right) = \sqrt {\frac{2}{L}} \sum\nolimits_{l = 0}^{L - 1} {\ln \left[ {S\left( {i,l} \right)} \right]cos\left( {\frac{{\pi n\left( {2l - 1} \right)}}{2L}} \right)} $$
(12)

where \( n \) represents the spectrum after the DCT of the \( i \)-th frame,

As the same process of MFCC, we also calculate the first-order difference \( \Delta {\text{LFCC}} \) of LFCC and second-order difference \( \Delta \Delta {\text{LFCC}} \). The concrete calculative formula is shown as,

$$ LFCC = \left| {\begin{array}{*{20}c} {x_{1,1} } & \cdots & {x_{1,n} } \\ \cdots & \cdots & \cdots \\ {x_{s,1} } & \cdots & {x_{s,n} } \\ \end{array} } \right| $$
(13)
$$ \Delta x_{i,j} = \frac{1}{3}\sum\nolimits_{u = - 2}^{2} {ux_{i + u,j} ,3 \le i \le s - 2,1 \le j \le s} $$
(14)

3 Detection Method Based on RNN

In this section, we will give a general description of the proposed framework for detecting four forgery operations based on RNN.

3.1 Framework

Recently, many deep learning approaches have been applied as the classifier especially the CNN which can capture the highly complex feature from a raw sample significantly [15]. It is obvious that, the CNN structure can effectively extract deep high-level features and obtain a good detection result in image forensics. However, it is not suitable for speech forensic task because the CNN structure cannot capture the sequential connection well. Recently, RNN have been widely used for applications processing temporal sequences such as speech recognition, which can capture the correlation between the frames [16]. Hence, we apply the RNN model in our task of classify the various forgery operations.

The proposed framework is shown in Fig. 3. The traditional feature is extracted from raw waveform, then fed into the RNN. In this work, we choose the statistical moments of MFCC and LFCC cepstrum coefficients as the features mentioned in Sect. 2.1.

Fig. 3.
figure 3

Proposed classification framework.

Due to the gradient vanishing and exploding issues in training a single-layer RNN, most of the existing RNN architectures only consist of several layers (1, 2 or 3), although the deeper network will capture more useful information. Hence, in this work, to find the better architecture of RNN, three networks have been designed. The network configurations are shown in Fig. 4. Meanwhile, we set the \( tanh \) activation function to improve the performance of the model, and set the value of the Dropout function to 0.5, which can help the network reduce the overfitting in training procedure. And a Softmax layer is followed to output the probability.

Fig. 4.
figure 4

Three proposed recurrent neural networks.

In the experimental stage, the RNN with two-layers of LSTM layers temporarily selected as the baseline network to find the best features among MFCC and LFCC for detect forgery operations. Then the selected features are used to determine the architecture of RNN.

3.2 Training Strategy

The training strategy of the proposed method includes two stages: training and testing. Before the training, we process the original speech by selecting a parameter for each forgery operation randomly. The training procedure is performed according to the process shown in Fig. 3. The classification feature will be firstly extracted from the original speech and the forged speech which through the disguising, noise-adding, high-pass filtering and low-pass filtering, and then the features will be used for training the RNN. In the testing, we frozen the parameters of RNN model, and choose a part of the original speech and forged speech as the test database, then the final detection result from the output of the \( {\text{Softmax}} \) layer will be obtained. Finally, the accuracy is taken as the evaluation metric, and we perform the confusion matrix by making a comparison of the predict labels of testing database and its true labels.

4 Experimental Results and Analysis

In this section, we first present the experimental data and then compare the proposed method with other existing methods.

4.1 Experiment Setup

We create four forgery databases based on the TIMIT speech database [17] and the UME speech database [18], including disguising, low-pass filtering, high-pass filtering and noise-adding. Specifically, we use the Audition CS 6 to build the electric disguised database, and the MATLAB is applied to build the other three forgery database. As shown in Table 1, for each forgery operation, we choose four different operational parameters. And we use the Gaussian white noise as the added noise. And the sample splicing of train setting and test setting of TIMIT and UME are shown in Table 2.

Table 1. Parameters processed by different forgery operations.
Table 2. Specific database for multiple operations (Natural/Operated).

Forged speech databases are built by selecting the forgery speech from those forgery databases. Then, a 4 NVIDIA GTX1080Ti GPUs with 11 GB graphic memory is used for the RNN training.

4.2 Experimental Results

First, we choose a two-layer RNN architecture for selecting a suitable forensic feature from the acoustic features, including MFCC, LFCC and its first and second derivative called \( \Delta {\text{MFCC}} \), \( \Delta \Delta {\text{MFCC}} \) and \( \Delta {\text{LFCC}} \), \( \Delta \Delta {\text{LFCC}} \). Then 6 well-trained two-layer RNN models are obtained for each feature and the sample for testing on TIMIT and UME are fed into the 6 models to compare the forensic capability of 6 acoustic features. Table 3 shows the detection accuracy of 6 traditional acoustic features. The MFCC with its first and second derivative features \( \Delta {\text{MFCC}} \), \( \Delta \Delta {\text{MFCC}} \) is better than the other features for classifying the various forged samples in the intra-database, the average accuracy is about 99%. But it is perform a lower accuracy in cross-database (testing the UME samples while the model was trained by TIMIT database), approximately 80%, which means the MFCC features may not be universal and robust.

Table 3. Average detection accuracy of six features in a two-layer RNN (%).

Different from MFCC, the LFCC feature have a better performance in the forensic task of detect the various operations. As shown in Table 3, the LFCC and its first and second derivative features \( \Delta {\text{LFCC}} \), \( \Delta \Delta {\text{LFCC}} \) have achieved a detection accuracy about 88% in cross-database while maintaining a good performance in intra-database.

Compared with the results shown in Table 3, the MFCC features is not well performance the difference between the original samples and the four forged samples. It indicates that MFCC features is not robustness enough. Although the LFCC and its first and second derivate features is slightly reduced in the intra-database, it is also still in the acceptable range better than the MFCC in cross-database. Hence, the LFCCs is selected as the suitable acoustic feature considering.

The structure of the RNN has play an important role in affecting the classification result. We design three structures for RNN in Fig. 4 to explore the impact of the specific network structures. Then, the selected features LFCCs are extracted from the original database and forgery databases for training the three RNN models. Finally, the classification probability will be obtained by the Softmax layer. The detection accuracy of three models based on TIMIT and UME databases in training process are shown in Fig. 5 (a–f). And the comparison results of three models are shown in Table 4.

Fig. 5.
figure 5

Detection accuracy of three RNN networks among training process in TIMIT and UME database. (a) and (b) are the detection performance of RNN1 model. (c) and (d) are the detection performance of RNN2 model. (e) and (f) are the detection performance of RNN3 model. (a) (c) (e) are trained by TIMIT database and (b) (d) (f) are trained by UME database.

Table 4. Average detection accuracy of different structures of RNN (%).

As shown in the second and third rows, the testing results have an excellent accuracy (above 97%). And the RNN2 model achieved a better detection accuracy about 88% in detecting the cross-database. Results show that the detection ability of the RNN2 structure and the RNN3 network structure are similar. Hence, we choose the RNN2 model as the final structure for the detection of various forgery operations considering the complexity of the experiment.

4.3 Comparative Experiment

We make a comparative experiment of the detection performance between this work based on RNN and our previous work based on CNN [19]. In our previous work, we proposed a forensic method for identifying the four kinds of forgery operations. First, a fixed convolutional layer is used to obtain the residuals of the speech sample, and then the residual signals are classified by a set of convolutional layer group. The comparative experiment shows that the method proposed in this paper has greatly improved the classification accuracy.

As shown in Sect. 4.2, the RNN2 is determined as the final recurrent neural network with LFCCs as the acoustic feature in this work. Results show that the average detection accuracy of its classification result is about 90%. In order to compare with the existing work, we repeated the experiments in [19] with the original and forged databases, and the experimental results are shown in Table 5. As shown in the second and third rows, the test results all have excellent accuracy (above 96%) in the intra-database. Even the results of the CNN are slightly better than RNN. However, the test results all have a certain decline in the cross-database, and the detection rate of RNN can be maintained above 87%. Some of the multiclassification results given in this paper are comparable with the CNN model in [19], and some detection accuracy are significantly better than the detection method based on CNN.

Table 5. Classification capability of the proposed RNN compared with CNN model (%).

5 Conclusion

In this paper, we carefully design a speech forensic method based on RNN for the detection of various forgery operations, and provide extensive results to show that the proposed method can effectively identify forgery operations. In the future, we will extend the proposed model and explore the deep features extracted by the neural network to identify unknown forgery operations.