1 Introduction

As one of the most popular media, JPEG (Joint photographic experts group) images are easily accessible and thus are liable to be altered or manipulated with various basic operations such as image resizing, filtering, splicing, noise addition, contrast enhancement, rotation, double compression and so on [25, 26]. With no visual traces left, such images are often not clear in processing history which could be detrimental in some specific situations [2]. Therefore, it is of great significance to verify the trustworthy of a given digital image. A variety of image forensics measures have been developed on the purpose of detecting the traces of these operations.

Among these operations, double compression is a hotspot because it mainly has its place in two different situations. One is that JPEG images often result from image forgery. Operations like image splicing tends to generate a forged image by copying a local block of the source image to the target one, and a JPEG compression is needed if one of the two images is in JPEG format. Consequently, a trace of JPEG compression is exhibited. The other is that in several JPEG steganographic algorithms [27], the images are recompressed during embedding after decompressed into the spatial domain. A double compression will take place if the quantization matrix used in embedding differs from that used in the original compression. This steganalysis usually uses the similar idea with the double compression detection [17].

For JPEG double compression detection, researchers have proposed several methods based on different algorithm theory. In [19], a noisy mixing model which is studied in blind component analysis was formulated to interpret the shifted double JPEG compression problem. To adapt the noisy environment, aside from the independent component analysis (ICA), the asymmetry of independent value map (IVM) was learned to obtain the independency’s normalized criteria. The features exploited were then fed to a support vector machine (SVM) classifier. Besides, in [4], differences between magnitude of JPEG coefficient 2-D array of a given image and the shifted versions along various directions are used to strengthen artifacts of double JPEG compression. Transition probability matrices are obtained with a thresholding technique applied to Markov random process. In another study, the neighboring joint density features and marginal density features on the discrete cosine transform (DCT) coefficients are extracted to help the learning classifiers to detect the clue of manipulation [16]. Moreover, the variation trend of the DCT coefficients when a JPEG image is recompressed with the same quantization matrix again and again was studied in [8]. The differences of the trends among uncompressed, single-compressed and double compressed images were then used to discriminate the images undergoing unknown times of compression with a novel random perturbation strategy. Similarly, feature vector formed by histograms of low-frequency DCT coefficients is used to accomplish a classification using SVM in [18]. Other methods exploiting features from DCT coefficients are proposed in [15, 22, 24].

In this paper, a special filtering layer and blocks of the DenseNet (densely connected convolutional networks) are comprised to form a new neural network trained to check the compression history of a given image. In particular, a network based on features from both spatial domain and frequency domain is proposed to perform image forgery detection on RGB images. The network is trained to differentiate images undergone uncompressed, single-compressed, and double-compressed. If the image is classified as double-compressed, the maximum likelihood of the quality factor (QF) of the primary JPEG compression is detected.

The rest of the paper is organized as follows. A brief overview of CNNs and DenseNet is given in Section 2. In Section 3, we describe the proposed model at full length, as well as the process of filtering kernel selection in the first layer. Experimental results and performance comparisons are presented in Section 4. Conclusions are given in Section 5.

2 Related work

In recent years, neural networks especially convolutional neural networks (CNN), have shown ability of extracting and learning complex features and their representations from images. CNN was first introduced in 1990 [14], but did not catch much attention until 2012 when it was used in the image classification on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [5, 12]. It benefits from fewer connections and parameters to train and reduced computational cost due to the pooling layer, compared with standard feedforward neural networks.

In the wake of the wide utilization of CNNs, they have been the major methods for visual object recognition, and many improved versions of CNN have been springing up. Among them, the residual deep learning framework introduced in [7] was designed to address the degradation problem: as the number of the layers increases, instead of steadily improving or remaining stable, accuracy degrades rapidly and training error rises with the depth of the network increasing [6, 9, 13, 20, 21]. A deep residual learning framework explicitly let the layers fit a residual mapping by simply performing an identity mapping with a shortcut connection which skips two convolutional layers. With the identity mapping, it is easier for the solver to find the perturbation when the optimal function is closer to an identity mapping rather than a zero mapping. Thus, the residual networks can ease the optimization by providing faster convergence. In fact, not only the residual network, but also several other researches have started to realize a similar improvement on CNN: to create short connections from early layers to later ones [9, 13, 21].

In [10], the researchers simply distilled the idea above, and applied it to a greater part of the network. They connects all layers each other to ensure the maximum usage of the output information of every layer involved. As shown in Fig. 1, each layer takes in additional inputs from the early layers, and meanwhile delivers its feature maps to all later layers. Instead of summation, concatenation is used at the joint points, so that each layer’s input contains all the output information of the preceding layers, and thus much fewer feature maps are needed. The problem of gradient-vanishing is also alleviated for the implicit deep supervision that each layer has the direct original input signal as well as the gradients of loss function. All of these advantages will help a lot in avoiding overfitting problems.

Fig. 1
figure 1

Layout of a dense block with all layers connected

More concretely, in one dense block, there are several convolutional layers including one transition layer. The duplicated outputs of the previous layers are directly concatenated to every input of the latter layers. That is to say, the number of the feature maps of each layer in a dense block has a trend of increase. In the present paper, the number of feature maps of each layer in a dense block in the proposed network is: 32, 32, 16, 16, 32, and 32. In addition, the input layer of the dense block has 32 feature maps, making a total of 160 feature maps concatenated together at the input of the last transition layer. The sizes of the convolutional kernels in the convolutional layers defined in the proposed network are all 3 × 3. The detailed structure of a dense block and the parameters set can be seen in Fig. 1.

In [1], the authors applied a simply CNN based network to accomplish detection of double JPEG compression. Besides, a multi-branch CNN network with raw DCT coefficients as input was proposed in [15]. Moreover, a well-designed pre-processing step concatenating the histograms of different selective sub-bands with different numbers of bins was applied before the deep convolutional neural network [24]. However, to our best knowledge, few works have applied the DenseNet in the field of double JPEG compression detection. Considering the training time and accuracy performance, four dense blocks are used in the proposed model.

3 CNN-based network structure

3.1 Network architecture

To accomplish the task of detecting the existence primary JPEG compression and its quality factor, a mainly spatial domain-based network with and without a frequency-based part is defined, respectively. The spatial domain-based network is composed of a cascade of several convolutional layers, pooling layers, fully-connected layers, and dense blocks. As for the frequency-based part, it appears in form of a branch and joins the spatial domain-based part at the second fully-connected layer. Figure 2 shows the architecture of the proposed networks.

Fig. 2
figure 2

Framework of the proposed neural network

The first layer of the spatial domain part is a special filtering layer containing 12 specifically defined filtering kernels selecting using Fisher Linear Discriminant Analysis (F-LDA). The details will be disclosed in the next section. A convolutional layer with 32 kernels of size 3 × 3 follows the first layer and the resolution of the feature maps the output is then reduced to 32 × 32 with a kernel of size 2 × 2 and a stride of two. After that, four dense blocks each containing five convolutional layers and one transition layer. In addition, as the last part of each dense block, an average pooling layer downsamples the output of them with a 2 × 2 sized kernel and a stride of two. Thus, the resolution is reduced to 2 × 2 after the fifth average pooling layer. The output is then reshaped to a column vector of length 128 and followed with three fully-connected layers. The first two fully-connected layers contain 256 neurons while the last one contains nine neurons since it is the classification results output from the network. To prevent overfitting, dropout technique is used in the first two fully-connected layers so that half of the neurons are randomly dropped during training. The same network is applied to eight different datasets with QF2 varying from 60 to 95 with step 5.

The frequency-based part of the network keeps in accordance with that in the research [1]. The input of it is a 909 × 1 sized vector extracted from every image based on the first nine spatial frequencies in the Zigzag scan order (the direct component is omitted) and the corresponding histogram of their DCT coefficients following the idea described in [23] and [1]. The next two layers are both convolutional layers with 100 kernels of size 3 × 1, each followed by a max-pooling layer with a stride of two. After them, there is a fully-connected layer containing 256 neurons. The output of it joins together into the output of the first fully-connected layer of the spatial domain-based part, making a second fully-connected layer containing 512 neurons with a dropout technique. During the validation experiments, a purely spatial domain-based network and a multi-domain-based one are carried out to complete the classification task respectively and the results are listed in two tables. We will then give a simple introduction to every kind of layer in the network.

Convolutional layer: a conventional convolutional layer contains two main operations: convolution and non-linearity. The convolution operation extracts larger quantity of features through the theory of receptive filed and shared weights. It aggregates the response of both local and global areas of the input as particular feature representations called feature maps. In the proposed network, every convolutional layer in the spatial domain-based part follows a batch normalization (BN) layer and a Rectified Linear Unit (ReLU) technique sequentially (Fig. 3). Batch normalization is a technique introduced by [11], which can accelerate the convergence rate of training process by standardizing the distribution of the inputs of each layer so as to reduce the ‘internal covariate shift’ (that is, the phenomenon that the distribution of each layer’s input changes during training). Normalization is applied to each training mini-batch as a part of the model architecture. With batch normalization, higher learning rate can be used to train the model and we no longer need to take much care of parameter initialization. In addition, the activation ReLU non-linearity, i.e., f(x) = max(0,x), also allows the model train faster than other activation functions [3].

Fig. 3
figure 3

Extended configuration of a convolutional layer

Pooling layer: to relieve the computational complex and probability of over-fitting, after obtaining feature maps from the convolutional layers, the pooling layer is used to discard irrelevant information while preserving important one over a local region of the feature maps. There are mainly two pooling methods: average pooling and max pooling. The outputs of neighboring neurons in a local region of the feature maps are summarized as the average and maximum value and then propagated to the next layer as higher-level feature representations.

Fully-connected layer: several various-sized fully-connected layers comprise a classification module. A dropout technique is used for the two previous fully-connected layers by randomly setting the neurons to zero for the alleviation of over-fitting problem. For the last fully connected layer, a softmax activation function is applied to generate a distribution over all nine class labels to accomplish the classification.

3.2 Special filtering layer

To find out the most appropriate filtering kernels in the first layer of the spatial domain-based network for the classification task, the F-LDA method is used to rank the ability of residual noise defined for image classes distinguishing. F-LDA is a method used in machine learning to find out a linear combination of features which are able to separate several classes from each other. It projects the high dimensional pattern samples over the optimum discriminant vector space so as to extract useful information for classification and reduce the dimensions of feature subspaces. To ensure the best separability of the samples of different classes, the projection makes sure that the pattern samples have the maximal between-class scatter but the minimal within-class scatter. Mathematically, the between-class scatter can be defined as:

$$ \mathbf{S}_{be}=\sum\limits_{i = 1}^{c}N_{i}(\mathbf{u}_{i}-\mathbf{u})(\mathbf{u}_{i}-\mathbf{u})^{T}, \quad \mathbf{u}=\frac{1}{N}\sum \mathbf{x} $$
(1)

where N is the number of samples in the dataset, c is the number of classes, x is the input training feature and ui is the mean vector of every class. The within-class scatter can be defined as:

$$ \mathbf{S}_{in}=\sum\limits_{i = 1}^{c}\mathbf{S}_{i}, \quad \mathbf{S}_{i}=\sum\limits_{\mathbf{x}\in C_{i}}(\mathbf{x}-\mathbf{u}_{i})(\mathbf{x}-\mathbf{u}_{i})^{T} $$
(2)

where Si is the covariance matrix of every class and Ci is the set of each class. Thus, the object function can be described as:

$$ J(\mathbf{W})=\frac{\|\mathbf{W}^{T}\mathbf{S}_{be}\mathbf{W}\|}{\|\mathbf{W}^{T}\mathbf{S}_{in}\mathbf{W}\|} $$
(3)

where W is the projection matrix. The projection matrix can also be referred to W = [w1,w2,…,wk], and wi satisfies the following formula:

$$ \mathbf{S}^{-1}_{in}\mathbf{S}_{be}\mathbf{w}_{i}=\lambda_{i}\mathbf{w}_{i} $$
(4)

indicating that wi is the eigenvectors of the matrix \(\mathbf {S}_{in}^{-1}\mathbf {S}_{be}\). In turn, the corresponding eigenvectors to the first k greatest eigenvalues obtained after applying matrix eigenvalue decomposition to the matrix \(\mathbf {S}_{in}^{-1}\mathbf {S}_{be}\) can be chosen to form the projection matrix, and the value of k is no more than C − 1.

In the present paper, we rank every residual noise’s ability to get the class separation for each class through F-LDA and find out the residuals that are good at separation as many classes as possible. First of all, we defined seven basic filters (BF) among which four are of size 3 × 3 and the other three are 5 × 5 (Fig. 4). To expand the diversity of RN, taking the 3 × 3 filter named BF3 as an example, we generate more filters using the formula below:

$$ R_{N}=\alpha\times R_{A}\pm \beta\times R_{B} $$
(5)

where α,β ∈{0,0.1,…,1}, and RA,RB ∈{BF31,BF32,BF33,BF34}.

Fig. 4
figure 4

Definitions of seven basic filters

Altogether 144 unique BF3 are obtained after eliminating the repeated ones. Next, 1000 images for each class are randomly chosen after cropping the origin images of resolution 512 × 384 to 48 subimages sized 64 × 64. For each image, we extracted 400 blocks of size 3 × 3 in the center. As for the RGB images have three channels, each block is an average result of three blocks extracted from three channels. In all, a training sample of 400 × 1000 × 144 = 57600000 residual noises are obtained for each class. After that, a feature matrix of size 9000 × 400 is generated with each BF. Thus, altogether 144 feature matrixes are generated.

Using 400 features of each image among the nine classes, we calculate the linear discriminant coefficients through F-LDA. The linear discriminant coefficients are used for classification. We calculate the linear scores with the coefficients and finally, a posterior probability of each one of the 9000 images are obtained. To see the ability to distinguish among the nine classes of images, we calculate the average posterior probabilities of the 1000 images in one class so that for each BF, there is a 9 × 1 vector demonstrating its capacity of classification. We rank the 144 posterior probabilities of each class and find out the same BFs shared by all nine classes in the top 50 ones ranked in each class. Nine BF3 come out during the selection. The same operations are done to three basic BF5. We obtain 440 unique BF, and in the end, three of them are chosen. In Fig. 5, we depict the details about 12 selected basic filters. Furthermore, In Fig. 6, a simple workflow of the process of BF selection is also provided.

Fig. 5
figure 5

Details about 12 selected basic filters

Fig. 6
figure 6

A simple workflow of the process of BF selection

4 Experiments

4.1 Experimental setup

We use the UCID database for the experimental tests. For the sufficiency of the data, we crop each of the 1338 images (size: 512 × 384) into 48 blocks sized 64 × 64 and thus 64224 images are made. Among them, 90% images (57798) are used for training and the others (6426) for testing. For the first and second JPEG compression, quality factors (QF) are set from 60 to 95 with a step of five. Moreover, when QF1 = QF2, the second compression is skipped considering the class would fall in the single-compressed class. Therefore, there are seven double-compressed classes and another two classes, uncompressed and single-compressed, for a network to be trained to classify. Eight networks are trained on eight groups of datasets varying with QF2 to detect the primary JPEG compression. In addition, based on the DCT coefficients, features of frequency-domain are extracted. In detail, for each 8 × 8 block, DCT coefficients are extracted from a given N × N patch. We select the first nine spatial frequencies in the zigzag scan order (eliminating the DC component). We then form a 101 × 1 vector based on the occurrences of the absolute values of quantized DCT values varying from -50 to 50. Thus, a vector containing 909 elements (101 vectors × 9 columns) is taken as the input of the frequency-based part of network. After that, another eight networks with the extra frequency-domain features are trained as before on the same eight groups of datasets.

We implement the proposed model using the Tensorflow 1.5 deep learning framework, Nvidia 1080 Ti GPU with 16GB RAM. The network is optimized using Momentum Optimizer, with momentum= 0.9. The learning rate is set as 0.1 due to the existence of BN layers. We set the batch size for training and testing to 64 images.

4.2 Spatial domain-based network results

The performance of the spatial domain-based network is shown in Table 1. Both of the accuracy and the True Positive Rate (TPR) are shown in the table. As the results tested on 6426 images shown, the comprehensive accuracy rate is up to 88.6%. Among them, it is obvious that the performance is much better in the top-right area of the table because it is easier to detect the existence of the primary JPEG compression as the traces of it could easily remain when QF2 > QF1. Note that we only use features extracted from the spatial domain and the results are far better than the results in [1] using the spatial domain-based CNN and are close to the results using the multi-domain-based CNN in it. The positive and negative differences between the results of the network we proposed and the spatial domain-based network in [1] are also listed in Table 1.

Table 1 Classification results over eight groups of datasets using the spatial domain-based network

4.3 Multi-domain-based network results

The results of the multi-domain-based network are shown in Table 2. As the results tested on 6426 images shown, the comprehensive accuracy rate is up to 95.0%. Generally, significant improvement can be spotted in all of the eight networks. Similarly, the positive and negative differences between the results of the network we proposed and the multi-domain-based network in [1] are also listed in different colors.

Table 2 Classification results over eight groups of datasets using the multi-domain-based network

From Fig. 7, we can see that the performances of eight proposed networks rise steadily with QF2 increasing from 60 to 95. The performance of the proposed network is well better than that of the network in [1] when QF2 is no more than 80, but only not as good as the multi-domain-based one in [1] after QF2 surpasses 85.

Fig. 7
figure 7

Comparison of the classification results of four networks

5 Conclusions

To meet the challenge of double JPEG compression detection, the present paper has proposed a modified version of DenseNet with a special filtering layer applied previously in the front of the network and a frequency-based part exploiting DCT features from the images. F-LDA method is used to select the most appropriate filtering kernels among a large amount of predefined residual noises based on the posterior probability obtained. The DenseNet was applied to achieve incremental training accuracy, efficient reutilization of the training features and persistent feature propagation. Results show that the networks perfectly accomplish the task of double JPEG compression detection, especially making great progress when QF2 is less than 80 comparing with the other works.