Abstract
Using deep learning to assist people in recognizing prohibited items in X-Ray images is crucial to improve the quality of security inspections. However, these methods require lots of data and the data collection usually takes much time and efforts. In this paper, we propose a method to synthesize X-ray image to support the training of prohibited items detectors. The proposed framework is built on the Generative Adversarial Networks (GAN) with multiple discriminators, trying to synthesize realistic X-Ray prohibited items and learn the background context simultaneously. In the other hand, a guided filter is introduced for detail preserving. The experimental results show that our model can smoothly synthesize prohibited items on background images. To quantitatively evaluate our approach, we add the generated samples into training data of the Single Shot MultiBox Detector (SSD) and show the synthetic images are able to improve the detectors’ performance.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Baggage inspection with X-ray machines is a priority task, which can reduces the risk of crime and terrorist attacks [1]. Security and safety screening with X-ray scanners has become an important process in the transportation industry and at border checkpoints [2]. However, inspection is a complex task and the detection for prohibited items relies mainly on the human. Missed inspection is an unavoidable mistake, when the security inspector has worked for a long time. This will cause security risks. Therefore, this type of task is more suitable for computer processing, freeing human from this heavy work.
With the advances of Convolutional Neural Networks(CNN), the realization of intelligent security check is no longer out of reach [3]. However, most prohibited items detection models require lots of images and manually collecting images usually takes much time and efforts. There are currently almost no public data sets containing prohibited items on the web. Therefore, it is very important to design approaches that automatically synthesize images for extending new datasets. Motivated by recent promising success of GANs [4] in several applications [5,6,7], we propose to build a GAN-based model to synthesize realistic prohibited items images in real scene and utilize them as the augmented data to train the CNN-based prohibited items detector. We denominate it as X-ray image-Synthesis-GAN(XS-GAN). Compared with adopting the regular GAN, the XS-GAN synthetic images are more realistic and retain more details.
XS-GAN adopts the adversarial learning recipe and contains multiple discriminators: \(D_b\) for background context learning and \(D_p\) for prohibited items discriminating (the gun as an example), as shown in Fig. 1. We replace the prohibited items with the bounding boxes with random noise and train the generator G to synthesize new prohibited items within the noise region. The discriminator \(D_b\), learns to discriminate between real and synthesized pair. Meanwhile, the discriminator \(D_p\) learns to judge whether the synthetic prohibited item cropped from the bounding boxes is real or fake. \(D_b\) aims to force G to learn the background information. It leads to smooth connection between the background and the synthetic prohibited items. In order to makes G to generate real prohibited items with more realistic shape and details, we introduce guided filters into the proposed XS-GAN. After training, the generator G can learn to generate photo-realistic prohibited items in the noise box regions.
2 Related Work
2.1 Generative Adversarial Network
GANs [4] have achieved great success in generating realistic new images from either existing images or random noises. The main idea is to have continuing adversarial learning between a generator and a discriminator, where the generator tries to generate more realistic images while the discriminator aims to distinguish the newly generated images from real images. It is like a game, and will reach a state of balance. The generate image is consistent with the original image.
2.2 Image Synthesis with GAN
The work of image synthesis using GAN is generally based on the image-to-image translation work. The Pix2pix-GAN [5] is the earliest image-to-image translation model based on the condition GAN [8]. CycleGAN [6], DiscoGAN [9], and DualGAN [10] are similarly in principles. CycleGAN replaces the traditional one-way generated GAN with a loop-generated ring network and changes the traditional input method of paired images. Therefore, the input to the model becomes available for any two images. GAWWN [11] introduced a new synthesis method, which can synthesizes higher resolution images given instructions describing what content to draw in which location. PS-GAN [7] proposed an algorithm that can smoothly synthesize pedestrians on background images of varying and different levels of detail.
2.3 Guided Filter
Guided Filters [12, 13] use one image as a guide for filtering another image, which exhibits superior performance in detail preserving filtering. The filtered output is a linear transformation of the guided image, where the guided image can be the input image itself or another different image. Guided filtering has been used for a variety of computer vision tasks. [14] uses guided filter for weighted averaging and image fusion. [15] uses a rolling guidance to fully control the detail smoothing in an iterative manner. [16] uses guided filtering to suppress heavy noise and structural inconsistency. [17] uses guided filtering as a non-convex optimization problem and proposes solutions via majorize-minimization.
Most GANs for image-to-image translation can synthesize high-resolution images, but the appearance transfer usually suppresses image details such as edges and textures. The proposed XS-GAN introduces guided filter into the generator network, which enables both appearance transfer and detail retention.
3 The Proposed Method
Unlike the regular GAN, our method leverages an adversarial process between the generator G and two discriminators: \(D_b\) for background context learning and \(D_p\) for discriminating prohibited items. In this section, we will give a detailed formulation of the overall objective.
3.1 Model Architecture
U-Net for Generator \(\varvec{G}.\) The Generator G learns a mapping function G:\(x \rightarrow y\), where x is the input noise image and y is the ground truth image. In this work, we adopt the enhanced encoder-decoder network (U-Net) [5] for G. It follows the main structure of the encoder-decoder architecture, where the input image x is passed through a series of convolutional layers as down-sampling layers until the bottleneck layer. Then the bottleneck layer feeds the encoded information of original inputs to the deconvolutional layers to be up-sampled. U-Net uses the skip connections to connect the down-sampling and up-sampling layers to symmetric locations relative to the bottleneck layer, which can preserve richer local information (Fig. 2).
\(\varvec{D_{p}}\) to Discriminate Fake/Real Prohibited Items. For this discriminator \(D_{p}\), we crop the synthetic prohibited items from the generated image as a negative sample, while the real prohibited items \({y_p}\) from the original image y as a positive sample. Therefore, \({D_p}\) is used to classify whether the generated prohibited item is real or false in the noise area. It forces G to learn the mapping from z to the real prohibited items \({y_p}\), where z is the noise region in the noise image x.
\(\varvec{{D_b}}\) to Learn Background Context. The goal of our model is to not only synthesize real prohibited items but also smoothly fill the synthetic prohibited items into the background. Thus our model needs to learn context information. Following the pair-training recipe from Pix2Pix-GAN [5], \({D_b}\) is used to classify between real and synthetic pairs. The real pair is the noise image x and the ground truth image y, while the synthesized pair is the noise image x and the generated image. The overall framework is shown in Fig. 3.
Guided Filter. Guided filter is designed to perform edge-preserving image smoothing by using the structure in the guidance image. We introduce the guided filter into the proposed XS-GAN and formulate the detail-preserving as a joint up-sampling problem. In particular, the synthetic images (image detail loss) output of G is the input image I to be filtered and the initially input image act as the guidance image R to provide edge and texture details. Therefore, the detail-preserving image T can be derived by minimizing the reconstruction error between I and T, subjects to the linear model:
where i is the index of the pixel and \(\omega _k\) is a local square window centered at pixel k.
In order to determine the coefficients of the linear models \({a_k}\) and \({b_k}\), we seek a solution that minimizes the difference between T and filter input R, which can be derived by minimizing the following cost function in the local window:
here \({a_k}{I_i} + {b_k}\) represents the output of the filter. Since the output of the filter combines the characteristics of the guidance image and the input image, \({\left( {{a_k}{I_i} + {b_k} - {R_i}} \right) ^2}\) is used here to measure the similarity between the output image and the input image. And \( \in \) is a regularization parameter that prevents \({a_k}\) from being too large. It can be solved by linear regression:
where \({\mu _k}\) and \(\sigma _k^2\) are the mean and variance of I at \({\omega _k}\), \(\left| \omega \right| \) is the number of pixels in \({\omega _k}\), and \({\bar{R}_k} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{i \in {\omega _k}} {} \) is the average of R in \({\omega _k}\).
By applying a linear model to all \({\omega _k}\) windows on the image and calculating \(\left( {{a_k},{b_k}} \right) \), the filter output can be derived by averaging all possible values of \(T_i\):
where \({\bar{a}_i} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{k \in {\omega _k}} {{a_k}} \) and \({\bar{b}_i} = \frac{1}{{\left| \omega \right| }}\sum \nolimits _{k \in {\omega _i}} {{b_k}}\). We integrate the guided filter into the generator network structure to implement an end-to-end trainable system.
3.2 Loss Function
As shown in Fig. 1, this model includes two adversarial learning processes \(G \Leftrightarrow {D_b}\) and \(G \Leftrightarrow {D_p}\). The adversarial learning between G and \(D_b\) can be formulated as:
where x is the image with noise and y is the ground truth image. The original GAN loss is replaced here with the least squared loss of LSGAN.
To encourage G to generate realistic prohibited items within the noise box z in the input image x, another resistance loss is added between G and \({D_p}\):
where z is the noise box in x and \({y_p}\) is the crop prohibited items in the ground truth image y. The negative log-likelihood targets are used to update the parameters of G and D.
GAN training can benefit from traditional losses [5]. In this paper, the L loss is used to control the difference between the generated image and the real image y:
Finally, combining the previously defined losses results in a final loss function:
4 Experimental Results
4.1 Datasets
The datasets used in our experiments are collected from the laboratory. We experiment with several types of prohibited items, such as guns, fruit knives, forks, hammers and scissors.
4.2 Contrast Experiment
In this section, We conducted several synthetic experiments on prohibited items and evaluated the synthesized images. The experimental results are shown in Fig. 4.
As can be seen from Fig. 4, our XS-GAN model with guided filtering has a better effect on the synthesis of security image prohibited items. The pix2pix-GAN model hardly the generated prohibited items. The PS-GAN can generate prohibited items, but the synthesis image is not clear enough. The images generated using our improved XS-GAN network model are not only clearer but also can retain more details.
In order to evaluate the quality of synthetic images, we test Fréchet Inception Distance (FID) score. The smaller the value of FID, the closer the synthetic image is to the real image. The test results are shown in Table 1.
As shown in Table 1, the XS-GAN synthesized images has the lowest FID score, which proves that the images synthesized by our method are closer to the real images.
To analyze the effect of the data augmentation, we combine the real and synthesized data to train the SSD [18] detectors and evaluate the performance. We experimented with images of three prohibited items, pistols, forks, and scissors. In the first experiment we use all the real images for training. In the second experiment we use all the synthesized images for training. In the third experiment we use half of the real images and half of the synthesized images for training. We use synthetic images from PS-GAN and XS-GAN to train SSD, separately. The results of the evaluation are shown in Table 2.
Table 2 shows that the detector is trained with synthetic images from XS-GAN can improve \(8\% \) mAP, and the detector is trained with mixed images can improve \(10\% \) mAP. However, the detector is trained with synthetic images from PS-GAN can improve \(3\% \) mAP, and the detector is trained with mixed images can improve \(5\% \) mAP. Thus by adding the synthetic images, the AP rate can be improve, and the image synthesized by our method has better data enhancement effect.
5 Conclusion
This paper introduces the XS-GAN model to synthesizes realistic X-Ray images in certain bounding boxes. The experimental results show that the network model with guided filtering can retain more details when synthesizing images. Our model can generate high quality prohibited items images, and the synthetic images can effectively improve the abilities of CNN based detectors. We use this model to synthesize different prohibited items images, which demonstrates the ability of generalization and transferring knowledge. We will continue to study XS-GAN for prohibited items image synthesis for training better detection models.
References
Mery, D., Svec, E., Arias, M., Riffo, V., Saavedra, J.M., Banerjee, S.: Modern computer vision techniques for x-ray testing in baggage inspection. IEEE Trans. Syst. Man Cybern. Syst. 47(4), 682–692 (2016)
Mendes, M., Schwaninger, A., Michel, S.: Does the application of virtually merged images influence the effectiveness of computer-based training in x-ray screening? In: 2011 Carnahan Conference on Security Technology, pp. 1–8. IEEE (2011)
Rogers, T.W., Jaccard, N., Griffin, L.D.: A deep learning framework for the automated inspection of complex dual-energy x-ray cargo imagery. In: Anomaly Detection and Imaging with X-Rays (ADIX) II, vol. 10187. International Society for Optics and Photonics, 101870L (2017)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Ouyang, X., Cheng, Y., Jiang, Y., Li, C.L., Zhou, P.: Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047 (2018)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1857–1865. JMLR. org (2017)
Yi, Z., Zhang, H., Tan, P., Gong, M.: Dualgan: Unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857 (2017)
Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: Advances in Neural Information Processing Systems, pp. 217–225 (2016)
He, K., Sun, J., Tang, X.: Guided image filtering. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 1–14. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_1
He, K., Sun, J.: Fast guided filter. arXiv preprint arXiv:1505.00996 (2015)
Li, S., Kang, X., Hu, J.: Image fusion with guided filtering. IEEE Trans. Image Process. 22(7), 2864–2875 (2013)
Zhang, Q., Shen, X., Xu, L., Jia, J.: Rolling guidance filter. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 815–830. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_53
Liu, W., Chen, X., Shen, C., Yu, J., Wu, Q., Yang, J.: Robust guided image filtering. arXiv preprint arXiv:1703.09379 (2017)
Ham, B., Cho, M., Ponce, J.: Robust guided image filtering using nonconvex potentials. IEEE Trans. Pattern Anal. Mach. Intell. 40(1), 192–207 (2018)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, T., Zhang, H., Zhang, Y., Yang, J. (2019). X-Ray Image with Prohibited Items Synthesis Based on Generative Adversarial Network. In: Sun, Z., He, R., Feng, J., Shan, S., Guo, Z. (eds) Biometric Recognition. CCBR 2019. Lecture Notes in Computer Science(), vol 11818. Springer, Cham. https://doi.org/10.1007/978-3-030-31456-9_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-31456-9_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31455-2
Online ISBN: 978-3-030-31456-9
eBook Packages: Computer ScienceComputer Science (R0)