Keywords

1 Introduction

Baggage inspection with X-ray machines is a priority task, which can reduces the risk of crime and terrorist attacks [1]. Security and safety screening with X-ray scanners has become an important process in the transportation industry and at border checkpoints [2]. However, inspection is a complex task and the detection for prohibited items relies mainly on the human. Missed inspection is an unavoidable mistake, when the security inspector has worked for a long time. This will cause security risks. Therefore, this type of task is more suitable for computer processing, freeing human from this heavy work.

With the advances of Convolutional Neural Networks(CNN), the realization of intelligent security check is no longer out of reach [3]. However, most prohibited items detection models require lots of images and manually collecting images usually takes much time and efforts. There are currently almost no public data sets containing prohibited items on the web. Therefore, it is very important to design approaches that automatically synthesize images for extending new datasets. Motivated by recent promising success of GANs [4] in several applications [5,6,7], we propose to build a GAN-based model to synthesize realistic prohibited items images in real scene and utilize them as the augmented data to train the CNN-based prohibited items detector. We denominate it as X-ray image-Synthesis-GAN(XS-GAN). Compared with adopting the regular GAN, the XS-GAN synthetic images are more realistic and retain more details.

Fig. 1.
figure 1

The XS-GAN model.

XS-GAN adopts the adversarial learning recipe and contains multiple discriminators: \(D_b\) for background context learning and \(D_p\) for prohibited items discriminating (the gun as an example), as shown in Fig. 1. We replace the prohibited items with the bounding boxes with random noise and train the generator G to synthesize new prohibited items within the noise region. The discriminator \(D_b\), learns to discriminate between real and synthesized pair. Meanwhile, the discriminator \(D_p\) learns to judge whether the synthetic prohibited item cropped from the bounding boxes is real or fake. \(D_b\) aims to force G to learn the background information. It leads to smooth connection between the background and the synthetic prohibited items. In order to makes G to generate real prohibited items with more realistic shape and details, we introduce guided filters into the proposed XS-GAN. After training, the generator G can learn to generate photo-realistic prohibited items in the noise box regions.

2 Related Work

2.1 Generative Adversarial Network

GANs [4] have achieved great success in generating realistic new images from either existing images or random noises. The main idea is to have continuing adversarial learning between a generator and a discriminator, where the generator tries to generate more realistic images while the discriminator aims to distinguish the newly generated images from real images. It is like a game, and will reach a state of balance. The generate image is consistent with the original image.

2.2 Image Synthesis with GAN

The work of image synthesis using GAN is generally based on the image-to-image translation work. The Pix2pix-GAN [5] is the earliest image-to-image translation model based on the condition GAN [8]. CycleGAN [6], DiscoGAN [9], and DualGAN [10] are similarly in principles. CycleGAN replaces the traditional one-way generated GAN with a loop-generated ring network and changes the traditional input method of paired images. Therefore, the input to the model becomes available for any two images. GAWWN [11] introduced a new synthesis method, which can synthesizes higher resolution images given instructions describing what content to draw in which location. PS-GAN [7] proposed an algorithm that can smoothly synthesize pedestrians on background images of varying and different levels of detail.

2.3 Guided Filter

Guided Filters [12, 13] use one image as a guide for filtering another image, which exhibits superior performance in detail preserving filtering. The filtered output is a linear transformation of the guided image, where the guided image can be the input image itself or another different image. Guided filtering has been used for a variety of computer vision tasks. [14] uses guided filter for weighted averaging and image fusion. [15] uses a rolling guidance to fully control the detail smoothing in an iterative manner. [16] uses guided filtering to suppress heavy noise and structural inconsistency. [17] uses guided filtering as a non-convex optimization problem and proposes solutions via majorize-minimization.

Most GANs for image-to-image translation can synthesize high-resolution images, but the appearance transfer usually suppresses image details such as edges and textures. The proposed XS-GAN introduces guided filter into the generator network, which enables both appearance transfer and detail retention.

3 The Proposed Method

Unlike the regular GAN, our method leverages an adversarial process between the generator G and two discriminators: \(D_b\) for background context learning and \(D_p\) for discriminating prohibited items. In this section, we will give a detailed formulation of the overall objective.

3.1 Model Architecture

U-Net for Generator \(\varvec{G}.\) The Generator G learns a mapping function G:\(x \rightarrow y\), where x is the input noise image and y is the ground truth image. In this work, we adopt the enhanced encoder-decoder network (U-Net) [5] for G. It follows the main structure of the encoder-decoder architecture, where the input image x is passed through a series of convolutional layers as down-sampling layers until the bottleneck layer. Then the bottleneck layer feeds the encoded information of original inputs to the deconvolutional layers to be up-sampled. U-Net uses the skip connections to connect the down-sampling and up-sampling layers to symmetric locations relative to the bottleneck layer, which can preserve richer local information (Fig. 2).

Fig. 2.
figure 2

The U-Net structure of the Generator G.

\(\varvec{D_{p}}\) to Discriminate Fake/Real Prohibited Items. For this discriminator \(D_{p}\), we crop the synthetic prohibited items from the generated image as a negative sample, while the real prohibited items \({y_p}\) from the original image y as a positive sample. Therefore, \({D_p}\) is used to classify whether the generated prohibited item is real or false in the noise area. It forces G to learn the mapping from z to the real prohibited items \({y_p}\), where z is the noise region in the noise image x.

\(\varvec{{D_b}}\) to Learn Background Context. The goal of our model is to not only synthesize real prohibited items but also smoothly fill the synthetic prohibited items into the background. Thus our model needs to learn context information. Following the pair-training recipe from Pix2Pix-GAN [5], \({D_b}\) is used to classify between real and synthetic pairs. The real pair is the noise image x and the ground truth image y, while the synthesized pair is the noise image x and the generated image. The overall framework is shown in Fig. 3.

Fig. 3.
figure 3

The overall structure of discriminator.

Guided Filter. Guided filter is designed to perform edge-preserving image smoothing by using the structure in the guidance image. We introduce the guided filter into the proposed XS-GAN and formulate the detail-preserving as a joint up-sampling problem. In particular, the synthetic images (image detail loss) output of G is the input image I to be filtered and the initially input image act as the guidance image R to provide edge and texture details. Therefore, the detail-preserving image T can be derived by minimizing the reconstruction error between I and T, subjects to the linear model:

$$\begin{aligned} \ {T_i} = {a_k}{I_i} + {b_k},\forall i \in {\omega _k} \; \end{aligned}$$
(1)

where i is the index of the pixel and \(\omega _k\) is a local square window centered at pixel k.

In order to determine the coefficients of the linear models \({a_k}\) and \({b_k}\), we seek a solution that minimizes the difference between T and filter input R, which can be derived by minimizing the following cost function in the local window:

$$\begin{aligned} \ E\left( {{a_k},{b_k}} \right) = \sum \limits _{i \in {\omega _k}} {\left( {{{\left( {{a_k}{I_i} + {b_k} - {R_i}} \right) }^2} + \in a_k^2} \right) } \; \end{aligned}$$
(2)

here \({a_k}{I_i} + {b_k}\) represents the output of the filter. Since the output of the filter combines the characteristics of the guidance image and the input image, \({\left( {{a_k}{I_i} + {b_k} - {R_i}} \right) ^2}\) is used here to measure the similarity between the output image and the input image. And \( \in \) is a regularization parameter that prevents \({a_k}\) from being too large. It can be solved by linear regression:

$$\begin{aligned} \ {a_k} = \frac{{\frac{1}{{\left| \omega \right| }}\sum \nolimits _{i \in {\omega _k}} {{I_i} - {\mu _k}{{\bar{R}}_k}} }}{{{{\bar{\sigma }}_k} + \in }} \; \end{aligned}$$
(3)
$$\begin{aligned} \ {b_k} = {\bar{R}_k} - {a_k}{\mu _k}\; \end{aligned}$$
(4)

where \({\mu _k}\) and \(\sigma _k^2\) are the mean and variance of I at \({\omega _k}\), \(\left| \omega \right| \) is the number of pixels in \({\omega _k}\), and \({\bar{R}_k} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{i \in {\omega _k}} {} \) is the average of R in \({\omega _k}\).

By applying a linear model to all \({\omega _k}\) windows on the image and calculating \(\left( {{a_k},{b_k}} \right) \), the filter output can be derived by averaging all possible values of \(T_i\):

$$\begin{aligned} \ {T_i} = \frac{1}{{\left| \omega \right| }}\sum \limits _{k:i \in {\mu _k}} {\left( {{a_k}{I_i} + bk} \right) = \bar{a}{I_i} + {{\bar{b}}_i}} \; \end{aligned}$$
(5)

where \({\bar{a}_i} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{k \in {\omega _k}} {{a_k}} \) and \({\bar{b}_i} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{k \in {\omega _i}} {{b_k}}\). We integrate the guided filter into the generator network structure to implement an end-to-end trainable system.

3.2 Loss Function

As shown in Fig. 1, this model includes two adversarial learning processes \(G \Leftrightarrow {D_b}\) and \(G \Leftrightarrow {D_p}\). The adversarial learning between G and \(D_b\) can be formulated as:

$$\begin{aligned} \ \begin{array}{c} {\mathcal{L}_{LSGAN}}\left( {G,{D_b}} \right) = {E_{y \sim {p_{gt \cdot image}}\left( y \right) }}\left[ {{{\left( {{D_b}\left( y \right) - 1} \right) }^2}} \right] \\ + {E_{x,z \sim {p_{noise \cdot image}}\left( {x,z} \right) }}\left[ {{{\left( {{D_b}\left( {G\left( {x,z} \right) } \right) } \right) }^2}} \right] \end{array} \; \end{aligned}$$
(6)

where x is the image with noise and y is the ground truth image. The original GAN loss is replaced here with the least squared loss of LSGAN.

To encourage G to generate realistic prohibited items within the noise box z in the input image x, another resistance loss is added between G and \({D_p}\):

$$\begin{aligned} \ \begin{array}{c} {\mathcal{L}_{GAN}}\left( {G,{D_p}} \right) = {E_{{y_p} \sim {p_{prohibited items}}\left( {{y_p}} \right) }}\left[ {\log {D_p}\left( {{y_p}} \right) } \right] \\ + {E_{z \sim {p_{noise}}\left( z \right) }}\left[ {\log \left( {1 - {D_p}\left( {G\left( z \right) } \right) } \right) } \right] \end{array}\; \end{aligned}$$
(7)

where z is the noise box in x and \({y_p}\) is the crop prohibited items in the ground truth image y. The negative log-likelihood targets are used to update the parameters of G and D.

GAN training can benefit from traditional losses [5]. In this paper, the L loss is used to control the difference between the generated image and the real image y:

$$\begin{aligned} \ {\mathcal{L}_{{\ell _1}}}\left( G \right) = {E_{x,z \sim {p_{noise \cdot image}}\left( {x,z} \right) ,y \sim {p_{gt \cdot image}}\left( y \right) }}\left[ {{{\left\| {y - G\left( {x,z} \right) } \right\| }_1}} \right] \; \end{aligned}$$
(8)

Finally, combining the previously defined losses results in a final loss function:

$$\begin{aligned} \ \mathcal{L}\left( {G,{D_b},{D_p}} \right) = {\mathcal{L}_{LSGAN}}\left( {G,{D_b}} \right) + {\mathcal{L}_{GAN}}\left( {G,{D_p}} \right) + \lambda {\mathcal{L}_{{\ell _1}}}\left( G \right) \; \end{aligned}$$
(9)

4 Experimental Results

4.1 Datasets

The datasets used in our experiments are collected from the laboratory. We experiment with several types of prohibited items, such as guns, fruit knives, forks, hammers and scissors.

4.2 Contrast Experiment

In this section, We conducted several synthetic experiments on prohibited items and evaluated the synthesized images. The experimental results are shown in Fig. 4.

Fig. 4.
figure 4

Columns 1–2 are the input images, columns 3–5 show the images synthesized by pix2pix-GAN, PS-GAN and XS-GAN.

As can be seen from Fig. 4, our XS-GAN model with guided filtering has a better effect on the synthesis of security image prohibited items. The pix2pix-GAN model hardly the generated prohibited items. The PS-GAN can generate prohibited items, but the synthesis image is not clear enough. The images generated using our improved XS-GAN network model are not only clearer but also can retain more details.

In order to evaluate the quality of synthetic images, we test Fréchet Inception Distance (FID) score. The smaller the value of FID, the closer the synthetic image is to the real image. The test results are shown in Table 1.

Table 1. FID Score Test.

As shown in Table 1, the XS-GAN synthesized images has the lowest FID score, which proves that the images synthesized by our method are closer to the real images.

To analyze the effect of the data augmentation, we combine the real and synthesized data to train the SSD [18] detectors and evaluate the performance. We experimented with images of three prohibited items, pistols, forks, and scissors. In the first experiment we use all the real images for training. In the second experiment we use all the synthesized images for training. In the third experiment we use half of the real images and half of the synthesized images for training. We use synthetic images from PS-GAN and XS-GAN to train SSD, separately. The results of the evaluation are shown in Table 2.

Table 2. Results of the SSD algorithm evaluation. Experimental evaluations were performed using real data sets, synthetic data sets, and mixed data sets, respectively.

Table 2 shows that the detector is trained with synthetic images from XS-GAN can improve \(8\% \) mAP, and the detector is trained with mixed images can improve \(10\% \) mAP. However, the detector is trained with synthetic images from PS-GAN can improve \(3\% \) mAP, and the detector is trained with mixed images can improve \(5\% \) mAP. Thus by adding the synthetic images, the AP rate can be improve, and the image synthesized by our method has better data enhancement effect.

5 Conclusion

This paper introduces the XS-GAN model to synthesizes realistic X-Ray images in certain bounding boxes. The experimental results show that the network model with guided filtering can retain more details when synthesizing images. Our model can generate high quality prohibited items images, and the synthetic images can effectively improve the abilities of CNN based detectors. We use this model to synthesize different prohibited items images, which demonstrates the ability of generalization and transferring knowledge. We will continue to study XS-GAN for prohibited items image synthesis for training better detection models.