1 Introduction

Modern technology has brought a revolution in photograph acquisition and publication. Image/video capture devices, such as digital cameras and camera phones, have become cheaper, ubiquitous, and easier for owners of these devices to record images anywhere and keep a daily visual journal of their lives. Meanwhile, to prevent crime and insure public security, surveillance cameras have proliferated and can be found in both public and private spaces watching our movements around the clock. If connected to Internet, it is also effortless to share captured photographs within seconds by posting them to social-networking sites (e.g., Facebook, Twitter, LinkedIn), photo/video-sharing sites (e.g., Instagram, Flikr, YouTube), and personal web sites and blogs. According to the KPCB’s (Kleiner Perkins Caufield & Byers) 2016 Internet Trends Report [1], about 2 billion photos are shared on Facebook-owned websites every day. These modern techniques provide a great convenience and security to users. Despite these many advantages, however, challenges do exist. The pervasiveness of such images/video capture devices creates a growing concern for invasions of privacy to unassuming entities. There is always a trade-off between availability and privacy, and therefore, it is important to look for ways that technology can be used to protect privacy.

In this paper, we concentrate on an emerging visual privacy protection problem—covert photo classification. Covert photography (also called secret photography or unauthorized photography) refers to the use of an image or video recording device to photograph or film a person who is unaware that they are being intentionally photographed or filmed [2]. Photos taken in this manner are called covert photographs. These photos may be collected by normal cameras, while the method of acquisition from the device or observer is concealed. Usually covert photos contain information that is inherently special or sensitive and is private to an unaware person. If these photos are shared on the Internet, it may lead to serious negative consequences [3,4,5,6].

Covert photo classification was first studied in [7]. In the work, an algorithm is proposed to fuse heterogeneous image features and visual attributes in a multiple kernel learning framework, and it typically relies on manual feature design. On the other hand, recent progress in machine learning, especially deep learning technology, has achieved remarkable successes in many tasks such as image recognition [8,9,10], speech recognition [11,12,13], natural language understanding and translation [14, 15], achieving performance that equals or even beats humans assessment [16]. These successes suggest a promising trend of using deep neural networks trained end-to-end (from the raw image pixels to class scores) to improve performance of covert photo classification.

This paper explores the capability of applying deep neural networks to covert photo classification. In particular, three DCNN-based (AlexNet [17], GoogleNet [18], and VGGS [19]) architectures are developed for covert photo classification, with the following contributions:

  • DCNN-based methods are first introduced to solve the emerging covert photo classification problem. Experimental results demonstrate that the performance of all the explored DCNN-based architectures, after transferring parameters from ImageNet pre-trained models, significantly surpasses hand-engineered feature methods.

  • Activation maps demonstrate what intrinsic characteristics DCNN has learned to discriminate covert photos from non-covert ones. Though different DCNN architectures show different discriminative regions, most of the dark occlusion parts due to a hidden camera are highlighted which is consistent with our human instincts.

  • With the aid of auxiliary attributes, a two-stage parameter transferring method is exploited to enhance the performance of the algorithm.

  • The final fusion of three DCNN-based architectures further boosts the classification performance and significantly outperforms hand-crafted feature methods.

In the rest of the paper, Sect. 2 summarizes related work. The proposed covert photo classification with DCNN algorithm is described in Sect. 3, with detailed experimentation and analysis. Section 4 explores leveraging auxiliary attributes to improve the final covert photo classification performance. Finally, the paper is concluded in Sect. 5.

Fig. 1
figure 1

Framework of covert photo classification by DCNN. Three DCNN-based (AlexNet, GoogleNet, and VGGS) architectures are first trained on the source dataset—ImageNet and then all the pre-trained DCNN parameters of the internal layers of the network are then transferred to the target task—covert photo classification. Data augmentation by extracting 10 different subcrops is employed to make up the deficiency of training set and help to reduce overfitting

2 Related work

2.1 Privacy protection in visual data

Covert photo classification is highly related to studies of privacy protection, which is an increasing concern in modern society. Almost all countries have laws to protect privacy, although the boundaries and content of what is considered private differ among cultures and individuals. Many researchers and groups have proposed various algorithms and systems to protect privacy in visual media such as images and videos [20,21,22,23,24,25,26,27,28,29,30,31,32].

Previous studies on visual privacy protection mainly focus on privacy information detection and privacy information hiding. Martin et al. [33] developed an algorithm that implements a specific identification filter on video sequences of a driver from naturalistic driving data to protect the identity and preserve the behavior of the driver. Nakashima et al. [34] proposed a method for intended human object detection and developed a system for obscuring human object regions in videos taken for mobile video surveillance that contain privacy sensitive information. Elhadad et al. [35] developed a high capacity hiding technique which embeds the video captured by the surveillance camera into another processed video where the private information was removed. Ross and Othman [36] explored using visual cryptography to preserve the privacy of biometric data (such as face images, fingerprint images, and iris codes) by decomposing the original image into two images that were stored in two separate database servers. The original private image can be revealed only when both images were simultaneously available and the individual component images did not reveal any identity of the original private image.

Our work is most related to that in [7], which addresses the problem of classifying covert photos and establishes a covert image dataset with 2500 covert photos and 10,000 non-covert photos. The photos were collected from varying sources, e.g., web, surveillance system, voyeurism publishing, and real covert photography on street. Each sample image in the dataset was verified rigorously and carefully, by checking its source from which the final dataset was adjusted to reduce the potential bias toward specific topics or content. Eight hand-crafted low-level image features and 13 mid-level image attributes were fused for image representation using a multiple kernel learning framework for covert photo classification. The experimental results showed that the approach achieved an average classification rate (1-EER) of 0.8940, which significantly outperforms other contemporary algorithms as well as human performance.

In contrast to the prior efforts that use hand-designed models based on user-defined features, we propose using a deep learning architecture to discover the intricate structure in the training set and automatically learn the representations for covert photo classification.

2.2 Deep convolutional neural networks for image classification

Conventional computer vision strategies were constrained in their capacity to process pixel values of an image. Building an artificial intelligence or machine learning framework required domain experts’ careful design to extract feature vectors from the raw image data and followed a classifier that maps input vectors to different categories [37]. By contrast, deep convolutional neural networks [17,18,19] allow a machine to be fed with raw image pixels and learns representations automatically. Therefore, DCNN depends less on prior knowledge and human effort in features design.

DCNN is comprised of multiple convolutional and subsampling layers optionally followed by fully connected layers and is therefore said to be deep (in contrast, classical representation will be referred to as shallow) [38]. A DCNN architecture can take full advantage of the 2D structure of an input image, its local connectivity, and shared weighting properties to help dramatically reduce the number of free parameters to estimate and at the same time improve generalization. DCNN can also be easily trained with standard back propagation algorithm. DCNN has achieved leading performances on a variety of vision recognition task [8,9,10, 39]. In recent ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [40] competitions, almost all the highly ranked teams used DCNN as their basic framework.

3 Covert photo classification with DCNN

3.1 Problem formulation

We develop three DCNN-based (AlexNet, GoogleNet, and VGGS) architectures for covert photo classification. All the three networks are first trained on the source dataset—ImageNet, which contains 1.2 million images with 1000 categories, and then all the pre-trained DCNN parameters of the internal layers of the network are then transferred to the target task—covert photo classification. The framework is illustrated in Fig. 1.

A DCNN architecture usually contains millions of parameters and directly learning so many parameters from only a few thousand training images is problematic [41,42,43]. However, in our task only a small amount of training data is available. As illustrated in [7], the process of covert dataset collection needs a rigorous verification and bias reduction for which the final dataset contains 2500 covert photos.

A common technique to resolve the limited dataset problem is transfer learning, which aims to transfer knowledge from related source to target domains [44]. In this paper, we directly use the pre-trained models which are shared on Caffe Model Zoo [45] to get the source parameters. These models usually need 2–3 weeks to train on ImageNet. We initialize networks parameters by transferring from pre-trained models on ImageNet and keep the earlier layer parameters fixed (these parameters are not specific to a particular object category or dataset and usually appear like Gabor filters or color blobs) and then fine-tune higher layer parameters (these parameters are more specific to the classes contained in the training dataset) on our Covert-2500 dataset. This is based on the work of Yosinski et al. [41] which demonstrates transferring features even from distant tasks can be better than using random features. Recently, similar strategy is also adopted by Banerjee et al. [46] which transfer ImageNet pre-trained AlexNet parameters to classify medical images and Ghazi et al. [47] which transfer ImageNet pre-trained GoogleNet, AlexNet and VGGNet parameters to identify plant species.

For comparison, experiments of training the models from scratch, i.e., initializing parameters with random numbers instead of transferring them from pre-trained models are conducted.

Our problem is to distinguish covert and non-covert photos, and this is a typical two-class classification problem. Here the covert photos are denoted by the positive class, and the non-covert photos by negative class, respectively.

3.2 Experimental protocol

The classification performance is evaluated using two measures: area under an receiver operating characteristic (ROC) curve (AUC) and equal error rate (EER). The two measures are consistent with those used in [7] and derived from the ROC curve which plots the true positive rate against the false positive rate as the decision threshold varies along the score range [48, 49]. The larger the AUC, the better the ROC. The EER identifies where the false positive rate and the false negative rate are equal, where the smaller the EER, the better of the system. The EER point marked with a ‘*’ in the ROC figures locates at the intersection of ROC curve and straight line through (1, 0) and (0, 1).

All models are trained and tested with Caffe [50] on a NVIDIA GeForce GT640 2GB GPU.

3.3 Dataset

The Covert-2500 [7] dataset includes 2500 covert photos and 10,000 non-covert photos. The training and testing sets are the same with [7]. The training set contains 2000 covert photos and 8000 non-covert photos, and the testing set contains 500 covert photos and 2000 non-covert photos.

Each input image is preprocessed by resizing to \(256 \times 256\) and subtracting the per-pixel mean across all training images. The system employs data augmentation which consists of generating image translations and horizontal reflections. The method extracts 10 different subcrops (4 corners, center and their horizontal flips) from the resized \(256 \times 256\) images. The subcrops are of size \(227 \times 227\) (AlexNet) or \(224 \times 224\) (GoogleNet and VGGS), and the networks are then trained on these extracted subcrops. This increases the size of our training set by a factor of 10.

Fig. 2
figure 2

An illustration of the three DCNN-based covert photo classification architectures. The source images are first resized to a fix size of \(256 \times 256\) and then are subcropped to different input size (\(227 \times 227\) for AlexNet and \(224 \times 224\) for GoogleNet and VGGS networks) and after passing through a number of convolutional, subsampling, and optional fully connected layers, the networks finally output the class scores of covert and non-covert. The DCNN-based networks are trained end-to-end (from the raw image pixels on one end to class scores at the other)

3.4 Covert photo classification

3.4.1 AlexNet-based covert photo classification architecture

AlexNet was the first popularized DCNN architecture in Computer Vision, proposed by Krizhevsky et al. [17]. It was the winner of ImageNet ILSVRC 2012, and its performance significantly outperformed the second runner-up. In this paper, we use a Caffe [50] version of AlexNet, which is slightly different from original one, where pooling is done before normalization. Figure 2a describes the architecture of AlexNet-based covert photo classification. A \(227 \times 227\) crop of an image (with 3 RGB color channels) is taken as the input. In the first layer, it is convolved with 96 different filters, each is of size \(11 \times 11\), using a larger stride of 4 pixels, which enable fast processing. The resulting 96 feature maps of size \(55 \times 55\) are firstly passed through a rectified linear unit (ReLU [51]) and then are subsampled to \(27 \times 27\) with \(3 \times 3\) max-pooling (using stride 2) operation and finally normalized by local input regions. Similar operations are repeated in layers 2, 3, 4, and 5. The last three layers (fc6, fc7, and fc8) are fully connected, taking all neurons in the previous layer as inputs and connecting them to every single neuron available. The fully connected fc6 and fc7 layers have 4096 neurons each, and a drop-out [52] probability of 0.5 is followed to avoid overfitting. The number of neurons of the last fully connected layer (fc8) is equal to the number of classes, i.e., 1000 for ImageNet and two for covert photo classification. The last fully connected layer is followed by a softmax with loss layer which represents the class scores.

We train the AlexNet-based model with stochastic gradient descent with momentum. The batch size is set to 50, and the momentum is fixed to 0.9, and the multiplicative weight decay is set to \(5 \times 10^{-4}\) per iteration. The learning rate starts at 0.001 and anneals over the course of training by dropping by a factor of 10 when the validation error rate stops decreasing with the current learning rate. In our experiments, best performance of AUC:97.29% and EER:8.65% is reached after 40 epochs when transferring parameters from the ImageNet pre-trained model. By contrast, best performance of AUC:93.09% and EER:14.04% is observed after 68 epochs when training without transfer parameters.

3.4.2 GoogleNet-based covert photo classification architecture

GoogleNet [18] was the winner of the ILSVRC 2014. The network is 22 layers deep when counting only layers with parameters. Figure 2b describes the architecture of GoogleNet, and we omit the details of the network, which is available from [18]. The input image size is of \(224 \times 224\), and after passed through two convolutional layers, the resulting feature maps are then fed to a series of inception modules. At the top of the network, average pooling instead of fully connected layers is performed. The inception module is a combination of multiple convolution layers and a parallel pooling with their output filter banks concatenated into a single output vector forming the input of the next stage. In these layers, the filter size is restricted to \(1 \times 1\), \(3 \times 3\) and \(5 \times 5\). This architecture leads to a dramatically reduced number of parameters in the network, which is 12 times fewer parameters than AlexNet.

We train GoogleNet-based models using stochastic gradient descent with a batch size of 8 (due to memory limitation, this is the largest value we can set in our 2GB GPU) examples, momentum of 0.9, and weight decay of \(2 \times 10^{-4}\). The base learning rate starts at 0.001 and is decreased by a factor of 10, until the test set accuracy stops improving. Best performance of AUC:97.71% and EER:7.85% is reached after 13.6 epochs when transferring parameters from ImageNet pre-trained model. By contrast, the best performance of AUC:90.82% and EER:17.40% is observed after 15.5 epochs when training from scratch.

3.4.3 VGGS-based covert photo classification architecture

The VGG network [38] was proposed by Chatfield et al. from the Visual Geometry Group at the University of Oxford, and it includes three versions (the fast version VGG-F, the medium version VGG-M, and the slow version VGG-S) with different accuracy/speed comparisons. Here we use the slow but more accurate version VGG-S (denoted VGGS for simplicity) network. Similar to AlexNet [17], VGGS network contains five convolutional layers and three fully connected layers, and its architecture is described in Fig. 2c. The input is a \(224 \times 224\) RGB image, and in conv1 layer it uses \(7\times 7\) filters with smaller stride 2. In conv3, conv4, and conv5 layers, more filters (512 instead of 384 and 256) are used than AlexNet. The configuration of fully connected layers (fc6, fc7, and fc8) is the same with AlexNet: the first two have 4096 neurons each, the third performs two-way covert classification and has only two outputs (covert and non-covert). All hidden layers are equipped with ReLU activation unit, and the final layer is the softmax layer.

The VGGS training procedure follows that of [38], learning on the Covert-2500 dataset using stochastic gradient descent with momentum. A mini-batch size of 10 is used to update the parameters, starting with a learning rate of \(10^{-4}\), in conjunction with a momentum term of 0.9. The training is regularized by weight decay, and the L2 penalty multiplier is set to \(5 \times 10^{-4}\). Best performance of AUC:97.76% and EER:8.05% is reached after 7.6 epochs when transferring parameters from ImageNet pre-trained model. By contrast, best performance of AUC:79.32% and EER:29.20% is observed after 20 epochs when training from scratch.

The ROC curves of the three DCNN-based architectures with parameters transferred from ImageNet and trained from scratch are illustrated in Fig. 3. The figure shows that the results of all DCNN architectures with training from scratch (with random initialization) on the Covert-2500 dataset show drastically decreased performance. This is understandable since the training set is insufficient in size. However, when transferring parameters from ImageNet pre-trained models, VGGS-based architecture reaches best performance. It is interesting to see that all the three DCNN-based architectures, which are fully trained, outperform the hand-crafted feature methods.

Fig. 3
figure 3

ROC curves of three DCNN-based covert photo classification algorithms trained from scratch and transferring parameters from pre-trained ImageNet models, respectively. The ROC curve of hand-crafted method in [7] is also listed for comparison

Fig. 4
figure 4

Visualization of learned filters of the first convolution layers. All the three networks are pre-trained on ImageNet and then fine-tuned on the Covert-2500 dataset. Each single patch corresponds to one filter, and all the filters resemble either Gabor filters or color blobs. a AlexNet (96 learned filters, size \(11 \times 11\)). b GoogleNet (64 learned filters, size \(7 \times 7\)). c VGGS (96 learned filters, size \(7 \times 7\))

Fig. 5
figure 5

Activation maps. The discriminative regions which DCNN used for covert photo classification are highlighted. The first row is the source covert example images; the second row is the activation maps of AlexNet; the third row is the activation maps of GoogleNet; the last row is the activation maps of VGGS

Fig. 6
figure 6

Framework of covert-related attributes classification and a two-stage parameter transferring method leveraging these auxiliary attributes to improve the final covert photo classification performance. First, the pre-trained DCNN parameters on the ImageNet are transferred to the intermediate task of covert-related attributes classification and fine-tuned, and then the fine-tuned parameters on the intermediate task are transferred to the final covert photo classification task. In the multi-label attributes classification task, sigmoid cross entropy loss layer is used

3.5 Detailed analysis of what DCNN learned

Although DCNN has demonstrated excellent performance on a variety of challenging machine learning tasks, it has long been thought of a series of black boxes because it is difficult to exactly understand their inner workings. In recent years, researchers have developed a series of algorithms [53,54,55,56,57] to expose the settings inside these black boxes and visualize the internal structure in an attempt to better understand exactly what DCNN has learned.

Figure 4 visualizes the learned filters of the first convolutional layer. All the networks parameters are initialized by transferring from pre-trained models on ImageNet and then fine-tuned on the Covert-2500 dataset. The first-layer filters exhibit human interpretable values, similar to Gabor filters or edge detectors and color blobs used in Computer Vision. This phenomenon appears in many datasets and tasks [17, 41].

Figure 5 demonstrates the activation maps which followed the procedure of [55]. In Fig. 5, all the nine input photos are covert, and the discriminative image regions used by DCNN to identify covert photos from non-covert ones are highlighted. As we can see from the activation maps, different networks use different discriminative regions which means they have learned different internal representations. Most of the dark occlusion regions in the photos caused by the hidden camera are highlighted, and this infers that DCNN actually learns the intrinsic characteristics similar to human instincts. When the predicted score is mapped back to the convolutional layer, fully connected layers will lose the ability to localize objects. When we compute these activation maps, we remove the fully connected layers before the final output and replace them with a global average pooling layer for AlexNet and VGGS, while GoogleNet remains unchanged.

Table 1 Covert-related attributes

4 Covert-related attributes classification

Inspired by work of Zhang et al. [58] which exploits auxiliary attribute to improve the landmark detection or face alignment task, we also investigated the possibility of leveraging covert-related attributes to improve the final covert photo classification performance.

Apart from its final category, an image also has many other attributes. An attribute within the context of computer vision is defined as some semantic or abstract quality which different categories share [59]. Automatic learning and recognition of attributes can complement category-level classification and therefore improve the degree to which machines perceive visual content [60,61,62,63,64,65].

In this section, a two-stage parameter transfer method is exploited. In the first stage, the pre-trained DCNN parameters on ImageNet are transferred to an intermediate task of covert-related attributes classification and fine-tuned, see Fig. 6 transfer parameters \(\textcircled {1}\). Then in the second stage, the fine-tuned parameters on the intermediate task are transferred to the final covert photo classification task, see Fig. 6 transfer parameters \(\textcircled {2}\).

Fig. 7
figure 7

ROC curves of covert attributes classification by DCNN-based architectures. a AlexNet. b GoogleNet. c VGGS

Fig. 8
figure 8

Accuracies (1-EER) of DCNN-based covert attributes classification algorithms and hand-crafted method used in [7]

As stated in [7], some visual attributes play important roles for human decision making of photo covertness, and 13 hand-engineered attributes are used for covert photo classification. These 13 covert-related attributes, denoted as A1, A2, ..., A13, are listed in Table 1. In this paper, we first explore DCNN architectures for these covert-related attributes classification. We use the three DCNN architectures in the previous section, i.e., AlexNet, GoogleNet, and VGGS. The attributes classification is a multi-label task, where each input image has multiple binary labels. The DCNN networks are trained using a sigmoid cross entropy loss layer to replace the softmax with a loss layer. The multi-label vector of each input image is feed to Caffe in HDF5 format. The label is a 13-dimension binary vector, and the 13 attributes are listed in Table 1. If an attribute is positive, it is labeled as 1; otherwise, it is labeled as 0. Figure 7 demonstrates the ROC curves of 13 covert-related attributes. Figure 8 shows the accuracies (1-EER) of attributes classification by three DCNN architectures and the hand-crafted method used in [7] which exploits eight features together to estimate each visual attribute. Only 11 attributes are estimated in [7], as the last two attributes A12 (“blur”) and A13 (“noise”) are directly defined on the input image by the blind image quality index (BIQI) detector [66]. The DCNN-based architectures demonstrate approximately equivalent results compared with hand-crafted method for covert-related attributes classification.

At the second stage, the fine-tuned parameters of covert-related attributes classification are transferred to the covert photo classification. Figure 9 demonstrates the ROC curves, and the curves show that all the DCNN-based architectures with transferred parameters are within close proximity and surpassing hand-crafted method in covert photo classification. In order to further enhance the classification effect, we fuse the best models of AlexNet, GoogleNet, and VGGS (i.e., AlexNet-ImageNet-transfer, GoogleNet-ImageNet-transfer, and VGGS-attribute-transfer). Fusion is performed at the final softmax layers, and the softmax scores of individual network are combined by a weighted sum rule to produce the final score fusion result. For AlexNet, GoogleNet, and VGGS, the weights are set to be 0.2, 0.3, and 0.5, respectively. The final fusion of three DCNN-based architectures takes 41 ms to deal with an image (about 24 fps) with GPU acceleration.

Fig. 9
figure 9

ROC curves of covert photo classification transferred from ImageNet and attributes models. The ROC curve of hand-crafted method in [7] is also listed for comparison

Table 2 Summarization of experimental results of different DCNN-based architectures

At last, all the experimental results are summarized in Table 2. It shows that among all the explored individual DCNN-based architectures, the VGGS-based architecture which transferred from auxiliary attributes model reaches the best result. The fusion of three DCNN-based architectures further boosts the classification performance and achieves an average classification rate (1-EER) of 0.925 which significantly outperforms the result (1-EER) of 0.8940 of hand-crafted feature methods.

5 Conclusion

Instead of exploiting experience-dependent hand-crafted manual features, we have introduced DCNN-based architectures to automatically discover intricate structure and learn the representations for covert photo classification. We have demonstrated that the performance of DCNN-based architectures (AlexNet, GoogleNet, and VGGS) when transferring parameters from ImageNet pre-trained models significantly surpass those training from scratch. We also investigate leveraging auxiliary attributes to improve the final covert photo classification performance. A two-stage parameter transferring method is exploited. Firstly, the pre-trained DCNN parameters on the ImageNet are transferred to an intermediate attributes classification task, and then the fine-tuned parameters on the intermediate task are transferred to the final covert photo classification task. Experimental results demonstrate that all the fully trained DCNN-based architectures are within close proximity and surpassing hand-crafted method in covert photo classification. The fusion of three DCNN-based architectures shows enhanced performance over individual networks.