Introduction

Majority of the people on this planet use social media platforms. This leads to an emergence of trend where an image or a video is tampered to spread amusement which is now known as “Meme Culture”. Nowadays seeing tampered images on the internet is a common situation, but there are multiple cases where an image or a video is tampered for nefarious purposes. Creation of these tampered images leads to a rise in the numbers of fake news. Digital image forensics is facing problems in finding new methods of detection.

These days the amount of data available and development in technology leads to colossal growth in the field of deep learning. Deep learning models proved their value when it comes to image processing and computer vision. CNNs are the most favoured of all deep learning models. CNNs are getting more recognition than other deep learning models because they can extract and learn features automatically. Also adding more data can increase the performance of the CNN. Chen et al. [1] first applied a CNN to detect tampering in images.

Even though CNNs are faster, can automatically understand and establish relationships between the features in images and can improve their performance with the amount of data. But it is not suitable for tampering detection in its usual configuration. When Chen et al. [1] used tampered images directly as an input to their CNN it learned the features of the images rather than the aspects related to tampering. It happened because the evidences of tampering are present in the underlying statistics of the images. Due to this issue, some researchers chose to pre-process the images using suitable methods like Huang et al. [2], some decided to add an additional layer in their CNN architecture like Chen et al. [1], Bayar et al. [3], while some utilized a variation of CNN or other deep learning models like Region-based CNN (R-CNN) like Zhou et al. [4], Fully Convolutional Network (FCN) like Salloum et al. [5], Deep Neural Network (DNN) like Wu et al. [6], Autoencoder like Zhang et al. [7].

To deal with the issue mentioned above we chose a procedure that can assist a CNN to detect image forgery. Error level analysis [8] is a method in which tampering on images of lossy compression can be detected. It means when an image undergoes compression some of its information is lost. In this process, first a suspected tampered image is compressed and then the difference between pixel intensity of that image before and after compression is calculated. Altered regions can be spotted easily because of having different error levels than the unaltered regions.

The main contribution of this paper is as follows-

  • The created CNN model has the smallest size but it achieves comparable accuracy with respect to the different models present in literature.

  • It needs less resource for its execution.

  • It has lesser number of parameters which lead to a decrease in the training time of the model.

The rest of the paper is arranged into four segments. In “Related Works”, we briefly describe about other previously proposed methods. “Methodology” describes the procedure we used, then “Experimental Results and Discussions” states the results we get after conducting our experiment. Eventually, “Conclusion” depicts the conclusion and the direction of future research.

Related Works

Different machine learning and deep learning methodologies are applied for identification of fake image. Gunawan et al. [12]Footnote 1 used a CNN to classify between authentic and tampered image. They used 80% data for training and 20% data for validation which leads to a poor approximation and low accuracy of their model. Sudiatmika et al. [13] used VGG-16 for tampering detection but bigger network have huge size, which means it needs a considerable amount of time and computational resources. Also deep networks suffer from vanishing gradient problem which makes it more difficult to train the network. That is the reason of poor performance compare to other models.

Kanwal et al. [14] first extract the chroma components of an image. Then in the first part, feature vectors are generated using DCT over different local feature descriptor. In the second part, Fourier transform is applied and final feature vectors are generated using an enhanced version of local feature descriptor. The feature vectors are fed into SVM classifiers. Still we can see that the accuracy is low because the features are generated using traditional and handcrafted methods.

Doegar et al. [15] feed the real and tampered images directly into an AlexNet model that was previously trained. Features extracted by the network are then used as an input for an SVM to classify. They didn't utilize any particular strategy that can spot the hidden indications of tampering. Thakur et al. [16] proposed a method to detect copy-move and splicing. First, the images are resized and transformed into greyscale. Then, traces of median filtering and image blurring are detected using suitable methods which are common post-processing employed to hide the tampering. Lastly, a CNN is used to classify the images.

Zhang et al. [17] applied a CNN in combination with ELA for classifying whether an image is DeepFake or not. The size of their model is 225 MB and their total model parameters are 2.95 × 107. Doegar et al. [18] employed three deep residual networks whose combined features are then used for training a classifier. Using residual networks may help with the problem of vanishing gradient. However deeper networks tend to learn more and unnecessary information from an image which may lead to over fitting as noticed by Zhang et al. [19].

In this paper, we propose an algorithm, which combined error level analysis (ELA) with a convolutional neural network (CNN) to classify authentic or fake image. This methodology yields the validation accuracy of 96.18% after 24 epochs.

Methodology

We pre-processed the authentic and tampered images before feeding them to the CNN. In our approach the first step is to generate the error level analysis (ELA) [8] of original and fake images, then we resize the images and normalize the pixels, after which we add the labels accordingly. Then we split the data into training set and validation set. In the second step, we feed the images and their corresponding labels into the CNN for training our model. Complete process is illustrated in Fig. 1.

Fig. 1
figure 1

Outline of the overall process

Error Level Analysis (ELA)

The idea of ELA was proposed by Neal Krawetz et al. [8]. This technique is performed on an image that uses lossy compression, mostly JPEG images. If an image of JPEG format is tampered and resaved as a JPEG image again, then some of its information is lost after compression. The task of ELA is to resave an image at a notable error rate of 95% compression, and evaluating the difference between the original image and resaved image. Since JPEG images consist of 8 × 8 blocks, after compression all of the blocks should have almost similar error levels. In the case of tampering, modified areas can be easily identified because 8 × 8 blocks of these areas will have a different error level than the areas that have not been modified. The functioning of ELA is shown in Figs. 2, 3, and 4.

Fig. 2
figure 2

An authentic image and its ELA

Fig. 3
figure 3

Resaved image at 75% compression and its ELA

Fig. 4
figure 4

Tampering of 75% resaved image and it’s ELA

From the above three rows, we can observe that the more times an image gets resaved, the more its information gets lost. From Figs. 2 and 3 we can see that the changes in the authentic and resaved images are imperceptible to human eyes but the differences are clear in their corresponding ELA. In Fig. 4 some aspects of the images are changed like the building is copied and the paraglider and helicopter are added. From the ELA of Fig. 4, it can be clearly seen that regions which have undergone tampering have a different error level than other regions.

Convolutional Neural Network (CNN)

Convolutional neural networks was designed by LeCun et al. [9], where it was used to recognize hand-written digits from the images. The task of the CNN is to reduce the data into a structure that is simpler to process, without losing the attributes which are essential for getting satisfactory results. CNNs are mostly utilized for working with 2-dimensional data like images or videos. Just like other neural networks CNN also have three kinds of layers, input layer, hidden layers, and output layer.

CNN Architecture of Proposed Method

Convolutional Layer

As the name suggests CNN uses convolution operation to convert the data into a map consisting of features.

It is the central component of a CNN. At first, there is an input layer of shape: (input image height) × (input image width) × (input image channels), followed by a convolutional layer which is a set of filters/masks/kernels used to change the input image into a map by separating the features. A filter is a matrix that is smaller than the input image. Individual filter slide across the width and height of the input image and performs convolution, in other words generate dot product of the filter with a patch of the image whose size is equivalent to the filter producing information of all the spatial locations. Each filter is capable of capturing an important feature. After all the filters completely pass over the image it generates a feature map which is passed to the next layer. Proposed model uses two convolutional layers.

Pooling Layer

The convolutional layer sums up the number of features by generating a feature map of an image. A major problem is that it focuses on the positions of the features more than the relationships between those features. This increases the number of parameters and computation time needed by the network. Also slight changes in features’ locations may create problems. To handle these issues pooling layers are used. Pooling layers used a downsampling method with makes it easier to process the information and compress the size of feature maps. This makes the model more resilient to changes and also reduces the number of parameters and computation time. There are three kinds of pooling operation which are, max pooling, average pooling, and global pooling. we have used max-pooling.

In max-pooling, a filter is used over the feature map generated by the convolutional layer in a non-overlapping manner. Now only the maximum element will be extracted from the area covered by the filter. In this manner, only important elements from each feature of the feature map are considered. Proposed model uses two max-pooling layers.

Fully Connected Layer

These are the last layers in a CNN. It performs classification operations like an artificial neural network dependent on the information extracted by the preceding layers of a CNN. Output from the last convolutional or pooling layer in a 3-dimensional format must be converted into a vector before passing it into the fully-connected layer. The output layer is generally a layer with softmax activation function which converts the input vector of the fully connected layer into a probability vector that generates the probability of each class label in the CNN. Our model uses one fully connected layer and a two-way softmax layer.

Tuning Parameters

During training period we need to change the parameters of the model to bring down the loss as much as possible which will help our model to make more accurate predictions thus optimizing it further. The algorithms or methods which help us to modify these parameters are called optimizers. To tune our model during training, we used “Adadelta” [10] as the optimizer. After passing the result using the “softmax” function, the function utilized to minimize the variation between actual result and predicted result is called loss function, we used “categorical_crossentropy” as the loss function to tune our model. Architecture of our CNN is shown in Fig. 5.

Fig. 5
figure 5

Architecture of proposed CNN

Experimental Results and Discussions

Experimental Setup

All of our experiments are conducted using Jupyter Notebook available on Google Colab. Training of the model is performed using a GPU runtime on Google Colab which assigned a RAM of 12.72 GB and a Disk Space of 68.40 GB.

Dataset

For training our model we chose the CASIA dataset [11] available on Kaggle. More specifically we chose the CASIA v2.0 dataset because CASIA v1.0 dataset contains fewer samples. CASIA v2.0 dataset consists of 7492 authentic images and 5124 tampered images of various lossy and lossless formats. We chose CASIA v 2.0 dataset because the images in CASIA v 2.0 are tampered in two ways. First one is copy-move tampering in which part of an image is copied and pasted back to another part of the same image, it is usually done to hide some features or add some extra features in an image. Second one is splicing, here objects from two or more images are combined to form a tampered image. Both copy-move and splicing are basic kinds of tampering, which is why this dataset is more suitable for tampering detection. Figure 6 shows samples of authentic images while Fig. 7 shows samples of tampered images from the dataset.

Fig. 6
figure 6

Authentic samples from CASIA v 2.0

Fig. 7
figure 7

Tampered samples from CASIA v 2.0

Training and Performance

In our experiment, we chose a combination of images with one lossy format which is JPEG and one lossless format which is PNG and discarded images of other formats. After that our dataset contained 9418 authentic and tampered images. After generating ELA of the images and adding their labels, the data is split into 90% for training and 10% for validation. Then the images are fed into the neural network for training up to 40 epochs. The curve for accuracy and loss is shown in Fig. 8 where x-axis represents number of epochs and y-axis shows the value of accuracy and loss in Fig. 8a and b respectively.

Fig. 8
figure 8

Accuracy curve and loss curve

The training process stops at epoch 24. As you can see from the figures above that our model achieved training accuracy of 98.34% and validation accuracy of 96.18%. Then we evaluate our model over the validation data whose confusion matrix is presented below in Fig. 9.

Fig. 9
figure 9

Confusion matrix

From the confusion matrix shown above we can calculate precision, recall and f1-score of our model, whose formulae are shown below. Also the calculated results of our model are illustrated in Table 1.

$${\text{Precision}} = \frac{{{\text{True}} {\text{positive}}}}{{{\text{True}} {\text{positive}} + {\text{False}} {\text{positive}}}}$$
$${\text{Recall}} = \frac{{\text{True positive}}}{{{\text{True}} {\text{positive}} + {\text{False}} {\text{negative}}}}$$
$${\text{F}}1 {\text{score}} = 2 \times \frac{{{\text{Precision}} \times {\text{recall}}}}{{{\text{Precision}} + {\text{recall}}}}$$
Table 1 Precision, recall and F1 score of our model

Comparison

We compare the performance of our model with other models that also classify whether an image is tampered or not. Comparisons of results are discussed in Table 2 along with other details of the experiments.

Table 2 Comparison with other models

Our model has the smallest size with respect to all other models discussed above and also has better accuracy than all the models except Zhang et al. [17]. Zhang et al. [17] achieved an accuracy of 97.6% and our proposed model achieved an accuracy of 96.18%. Although their accuracy is slightly better than the proposed model, the size of their model is 225 MB whereas our model size is only 96 MB. Their total model parameters are 2.95 × 107, but our model parameters are 8.41 × 106. Therefore the proposed model takes less time and computation resources.

Conclusion

The amount of tampered images we find these days makes us question the information we come across. Digital image forensics is having a tough time dealing with these kinds of fake information. Convolutional neural networks have a remarkable performance when it comes to extracting features from images. But CNNs are inclined to learn features from the images rather than finding the signs of tampering.

Hence, to improve the effectiveness we pre-processed the images using error level analysis and then fed them to a CNN. The model can fairly classify between authentic and tampered images as it obtained a validation accuracy of 96.18%.

The main focus of the paper is to identify Fake images from real images in social media. Region of objects which are modified/ tampered is visible from the ELA enhanced images and in our future work we will identify the tampered objects.