1 Introduction

Prohibited item detection from X-ray images can automatically search for prohibited items in passenger packages, thereby effectively suppressing terrorism and criminal incidents. Compared to other nondestructive detection methods (such as ultrasound, overfrequency imaging, and thermal imaging), the advantage of this technique lies in its excellent recognition, clarity and visualization abilities. Therefore, intelligent prohibited item detection based on X-ray images has always been a popular area of research in the multimedia field.

Recently, deep learning, especially deep convolutional neural networks [3, 7, 13, 29], has been successfully applied to prohibited item detection. However, occlusion in X-ray baggage inspection is different from that in conventional object detection. The occluded parts are totally invisible in conventional object detection scenarios, while the occluded items can still be observed in detection based on X-ray images. Examples are illustrated in Fig. 1. The appearance of an item in an X-ray image depends not only on the specific item but also on the interacting item. To solve the occlusion problem in prohibited item detection from X-ray images, several approaches [7, 29] use edge information to enhance the model’s discrimination capacity. However, the gradient information introduces too much noise, which causes high uncertainty due to inference. In addition, models for extracting semantic edge features require supervised learning, but the true labels used for model learning can be obtained only through a complex and tedious labeling process. Therefore, there is an urgent need for a detection method that can combine semantic information to detect partially visible prohibited items without the cumbersome labeling process.

Fig. 1
figure 1

Illustration of the occlusion difference between prohibited item detection from X-ray images and general object detection. Occluded pixels are unobserved in general object detection, while prohibited items from X-ray images should overlap with other items

We argue that 3 factors may help to uncover the occluded prohibited item in an X-ray image: (1) Multiscale analysis, since other visible parts of an item provide valuable knowledge for learning. (2) The characteristics of the object that distinguish it from other objects such that the object is discovered only when these characteristics are found. (3) Multitask learning that utilizes the association between tasks to assist the inference, e.g., segmentation and detection; however, since pixelwise labeling requires complicated annotation for learning, multitask learning without the tedious labeling process is particularly attractive. Thus, we must design an approach to detect partially observed prohibited items by combining these factors.

We propose a method for detecting prohibited objects based on X-ray images, as shown in Fig. 2. The scale interaction module (SIM) extracts features through the encoder by simultaneously exploring information at multiple scales. Then, the cross-image analysis module (CAM) uses the coattention mechanism to discriminate the semantics by using object images from the same or different classes, which provides weakly supervised information for localization. Finally, the multitask learning module (MLM) simultaneously learns the localization and segmentation branches, in which the segmentation branch is learned in a weakly supervised manner to alleviate the annotation effort.

Fig. 2
figure 2

Illustration of the framework. SIM extracts features through the encoder by simultaneously exploring information at multiple scales. Then, the CAM uses the coattention mechanism to discriminate the semantics by using object images from the same or different classes, which provides weakly supervised information for localization. Finally, the MLM simultaneously obtains the localization and segmentation outputs

The novelty of this work is exemplified by the following:

  • The context information is explored by incorporating features from neighboring scales to improve the discrimination capabilities.

  • Cross-image semantics are introduced to further extract high-level knowledge by using two different coattentions.

  • A weakly supervised method is proposed to learn the segmentation branch in an MLM.

The experimental results on the security inspection X-ray (SIXray) dataset [13], the occluded prohibited items X-ray (OPIXray) dataset [29], and the HIXray dataset [16] show that our approach outperforms other state-of-the-art prohibited item detection approaches by margins of 1.47 and 1.45 in mean average precision (mAP).

The rest of this paper is organized as follows. Section 2 provides an overview of the recent work on prohibited item detection and general object detection. Section 3 introduces our approach. Section 4 discusses the experimental results. Conclusions are reported in Sect. 4.

2 Related work

In this section, we briefly review the related research on prohibited item detection and general object detection.

2.1 General object detection

Deep learning has made great achievements in the field of general object detection [19, 22]. According to the measure of whether there is a candidate anchor generation stage, these methods can be divided into the following two categories: (1) Single-stage approaches, such as the scaled you only look once version 4 (S-YOLOv4) [27], which directly regress the object location and category from all candidate locations, have advantages in efficiency. However, invalid candidates occupy a large proportion, which decreases the effectiveness of this kind of approach. (2) Two-stage approaches, such as the faster region-based convolutional neural network (Faster R-CNN) [15], first locate candidate anchors and distinguish foreground and background regions, and then the category and location of the candidate anchor is determined. These approaches have higher accuracy. Nevertheless, the initial positioning of objects requires extensive calculations, and as a result, these approaches may be slow [24].

Fig. 3
figure 3

Example of the relation between feature maps from neighboring scales

A large number of diverse samples is important for training. Copy-Paste [5] pastes objects from one image to another image, which is a useful mechanism for data augmentation. To solve the sample imbalance problem, RetinaNet [10] employs focal loss to reduce the contribution of easy samples.

Multiscale analysis perceives the input from different perception fields [21], among which class-balanced hierarchical refinement (CHR) [13] and recursive feature pyramid and switchable atrous convolution detection (DetectoRS) [14] incorporate extra feedback connections from high-level features to improve the semantic features.

Contextual exploration has gradually become a popular research topic [17, 20], especially mining nonlocal dependent information, such as nonlocal neural networks [28], dual attention networks (DANs) [4] and ternary attention networks [25]. Recently, the transformer [26] has become a prevalent model architecture. The shifted window (swin transformer) [12] constructs a hierarchical representation to expand the applicability of transformers. The focal transformer [30] performs fine-grained self-attention only in local regions and coarse-grained attention globally.

However, research on learning the discrepancies of different semantic features and the interactions between different feature maps is very limited. In addition, due to the loss of appearance and geometric information, coupled with the limited ability to extract semantic information, the above methods are very sensitive to overlap and occlusion phenomena. Moreover, the attention mechanism used by the above methods considers only the isolated information in a single image to assign pixel weights, which makes the detection model extremely vulnerable to the bottleneck constraint of a single image and thus deviates from the overall distribution of the dataset, significantly reducing the effectiveness in scenarios where an object is highly occluded, such as in package inspection.

2.2 Prohibited item detection

For prohibited item detection from X-ray images, transfer learning is introduced to learn the differences between general detection tasks and prohibited item detection [2]. To alleviate the adverse effects of outliers, joint learning of high-dimensional image generation and spatial reasoning based on a conditional generative adversarial network [1] is studied. To alleviate the negative impact of complex scenes, CHR [13] investigates the effectiveness of sample balance and multiscale analysis in prohibited item detection.

Because package capacity is inevitably limited, the items in packages are highly occluded and overlap. In this regard, Wei et al. [29] proposed a deocclusion attention module (DOAM) based on appearance, material and color information to extract edge information for the detection process. The work employing the cascaded structure tensor (CST) [7] uses a similar idea, fusing gradients in different directions in an iterative manner. The work [9] combines CST and transfer learning to further address occlusion problems and achieves desirable results.

However, the edge cues contain too many irrelevant gradients. Therefore, they do not improve localization and classification capabilities, causing the detection model to have poor discrimination ability in cases of severe occlusion. In addition, the above methods are limited to using single images for training, which prevents the model from using other potential information in different images to form a more comprehensive understanding of prohibited and nonprohibited objects.

3 Our approach

To handle occlusion in prohibited item detection from X-ray images without introducing a cumbersome and complicated labeling process, we propose a weakly supervised learning method, as shown in Fig. 1.

3.1 Scale interaction module

Multiscale analysis provides the potential to handle ambiguity challenges because other parts of an occluded item provide valuable information for judgment [23]. However, the current approach [29] combines features of different scales only by a linear combination. This manner of combination does not take the large semantic gaps among different perception fields into consideration. Therefore, the fused features have a lower discrimination capacity because of the inconsistent semantic information.

In fact, there is a dependency between feature maps of adjacent scales (as shown in Fig. 3). Thus, we model this dependency and utilize it for baggage inspection.

We propose a method named SIM, as illustrated in Fig. 4, to improve the feature discernment of the model with interacting feature maps of adjacent scales to avoid training fluctuations caused by semantic gaps.

Fig. 4
figure 4

SIM Illustration. For simplification, only the third SIM module has a blue dotted box

First, assume that the input of the module is image \(\mathbf {I}\), and that the multiscale initial feature maps {\(\mathbf {f}^1_0, \mathbf {f}^2_0,\ldots , \mathbf {f}^C_0\)} are obtained by the encoder, where C is the number of feature map scales, which is selected in the experiments. The encoder is a residual learning network composed of C residual blocks [8]. While additional scales for interaction would obtain additional global information, this would also increase the risk of overfitting. Each residual block is composed of a batch normalization layer, a rectified linear unit, and a convolutional layer with a kernel size of \(3\times 3\).

To mitigate the semantic gap, we acquire interactions among feature maps of neighboring scales; that is, low-level feature maps \(\mathbf {f}^{i-1}_0\), middle-level feature maps \(\mathbf {f}^i_0\), and high-level feature maps \(\mathbf {f}^{i+1}_0\) are aggregated to obtain multiscale interaction maps. The scale interaction process can be added by projection as follows:

$$\begin{aligned} \mathbf {f}^{i}_{k+1} = \mathbf {W}^{i-1}_{down} \mathbf {f}^{i-1}_{k} + \mathbf {W}^{i}_{k} \mathbf {f}^{i}_{k} + \mathbf {W}^{i+1}_{up} \mathbf {f}^{i+1}_{k}, \end{aligned}$$
(1)

where \(\mathbf {W}^{i+1}_{up}\) and \(\mathbf {W}^{i-1}_{down}\) represent the upsampling and downsampling operations implemented by the \(3\times 3\) deconvolutional and convolutional layers with a stride of 2, respectively. All channel numbers change to the same number as the i-th scale feature maps.

To explore nonlocal dependencies, the scale interaction process can be performed one or more times to generate multiscale interaction feature maps. We take the feature interaction maps \(\mathbf {f}^i_1, \ldots , \mathbf {f}^i_{k-1}\) generated in round \(1:k-1\) as the input to the round k interaction. The total number of iterations is K.

In addition, a residual learning strategy is introduced to prevent vanishing gradient issues, and pixel summation is performed on all K round feature interaction maps generated at scale i to obtain the context feature maps \(\{\mathbf {f}^i\}, i = 1, 2, \ldots , C\) at each scale. Finally, the branches that pertain to different scales are fused together through a gated CNN [31] because the context feature maps in each branch contain information regarding a specific perception field.

Multiscale feature fusion via an SIM has the following advantages: (1) The feature interaction of neighboring scales enhances the feature representation, alleviates the semantic gaps between features in different perception domains, and allows the model to obtain a comprehensive understanding of the entirety and different parts of the same object. (2) It can effectively capture the appearance variations caused by severe occlusion, where the visual features are seriously insufficient. (3) The SIM can be directly used as a plug-and-play module in various applications; moreover, it is efficient and easy to train.

3.2 Cross-image analysis module

The unique attributes of an object are also an important basis for detecting it and can be used for distinguishing a particular kind of object from other objects. The key point is how to distinguish and locate the most unusual parts of prohibited items.

Inspired by the notion of identifying a new object by comparing it with a reference, such as a photo or an explanation text for a specific class, we attempt to understand the attributes and patterns of prohibited items by comparing a target image with reference images through a cross-image attention mechanism to explore the characteristics between them and weaken complex background interference. The common attention explores cross-image shared semantics, which helps the classifier to proficiently perceive the common semantic labels over the coattentive regions. Discrepancy attention focuses on unshared semantics, which enables the classifier to capably separate the semantic patterns of different objects.

Moreover, we focus on locating unique areas by using weakly supervised signals; that is, the classification task is employed to discover the unique part of prohibited items by optimizing the cross-entropy of the common class and discrepancy class in the image pair. Compared to the detection task, this weakly supervised learning method has relatively less labeling effort. Cross-image semantic relations are used as additional category-level information to guide the learning stage. Specifically, we design a CAM using common attention and discrepancy attention mechanisms to learn cross-image semantic representations for prohibited items. The details are shown in Fig. 5.

Fig. 5
figure 5

Details of the cross-image semantic relation exploration. The feature maps are obtained via the SIM; then, the context features are generated by using the coattention mechanism

Assuming that image \(\mathbf {I}_m\) is a target image and \(\mathbf {I}_n\) is a reference image that is randomly selected from a reference set containing at least one kind of prohibited item with the target image, we resize these two images to a fixed size. The symbol \(\mathbf {l}_n \in \{0,1\}^{K}\) represents a category label corresponding to \(\mathbf {I}_n\) (elements corresponding to prohibited items in an image are denoted as ‘1’, and remaining elements in the label vector are denoted as ‘0’), and K equals the number of prohibited item categories. The feature maps \((\mathbf {F}_m, \mathbf {F}_n) \in R^{U \times H \times W}\) obtained by the SIM are used as the CAM input, where U, H, and W are the number of channels, height, and width of the feature map, respectively. Then, the feature maps \((\mathbf {F}_m, \mathbf {F}_n)\) are processed by class-aware full convolution (CFC) to obtain the activation maps \((\mathbf {S}_m, \mathbf {S}_n) \in R^{Q \times H \times W}\), where Q is the number of channels of the activation maps. After that, the category score vectors \((\mathbf {s}_m, \mathbf {s}_n) \in R^{Q}\) are obtained through global average pooling. Finally, the sigmoid function is applied to normalize and obtain the cross-entropy loss function \(L_{ce}\). The single-image classification loss of the image pair is as follows:

$$\begin{aligned} L^{m,n}_{single} = L_{ce}(\mathbf {s}_{m}, \mathbf {l}_{m}) + L_{ce}(\mathbf {s}_{n}, \mathbf {l}_{n}). \end{aligned}$$
(2)

To learn common attention in an image pair, the feature maps \((\mathbf {F}_m, \mathbf {F}_n)\) are reshaped to obtain flattened feature maps \((\overline{\mathbf {F}}_m, \overline{\mathbf {F}}_n) \in R^{U \times HW}\), where HW is the number of pixels in the input feature map. The cross-image common attention similarity matrix

$$\begin{aligned} \mathbf {P}_{mn}= & {} \overline{\mathbf {F}}^{T}_{m} \mathbf {W}_{p} \overline{\mathbf {F}}_{n}, \end{aligned}$$
(3)
$$\begin{aligned} \mathbf {P}_{nm}= & {} \overline{\mathbf {F}}^{T}_{n} \mathbf {W}_{p} \overline{\mathbf {F}}_{m}, \end{aligned}$$
(4)

is used to measure the similarity between any positions of two different feature maps, where \(\mathbf {W}_{p} \in R^{U \times U}\) is the weight matrix to be learned. \(\mathbf {P}_{mn}, \mathbf {P}_{nm} \in R^{HW \times HW}\) are normalized with the softmax function to obtain a cross-image common attention map \((\mathbf {A}_m, \mathbf {A}_n) \in R^{HW \times HW}\), and then the flattened cross-image common context feature maps are obtained

$$\begin{aligned} \overline{\mathbf {F}}^{co}_{m2n}= & {} \overline{\mathbf {F}}_{m} \mathbf {A}_{m} \in {R}^{U \times HW}, \end{aligned}$$
(5)
$$\begin{aligned} \overline{\mathbf {F}}^{co}_{n2m}= & {} \overline{\mathbf {F}}_{n} \mathbf {A}_{n} \in {R}^{U \times HW}. \end{aligned}$$
(6)

We adjust the shape of \((\overline{\mathbf {F}}^{co}_{m2n}, \overline{\mathbf {F}}^{co}_{n2m})\) to obtain common context feature maps \((\mathbf {F}^{co}_{m2n}, \mathbf {F}^{co}_{n2m}) \in {R}^{U \times H \times W}\). The class-aware activation maps \((\mathbf {S}^{co}_{m2n}, \mathbf {S}^{co}_{n2m}) \in {R}^{Q \times H \times W}\) are obtained through the CFC, and the class score vectors \((\mathbf {s}^{co}_{m2n}, \mathbf {s}^{co}_{n2m}) \in {R}^{Q}\) are obtained through global average pooling. Finally, the cross-image common attention classification loss is calculated using the sigmoid cross-entropy

$$\begin{aligned} L^{m,n}_{co-att} = L_{ce}(\mathbf {s}^{co}_{m2n}, \mathbf {l}_{m} \cap \mathbf {l}_{n}) + L_{ce}(\mathbf {s}^{co}_{n2m}, \mathbf {l}_{n} \cap \mathbf {l}_{m}), \end{aligned}$$
(7)

where \(\mathbf {l}_m \cap \mathbf {l}_n\) is the common category label of image pair \((\mathbf {I}_m, \mathbf {I}_n)\).

To understand the objects well, we also learn the discrepancy attention by exploring the semantic difference between different objects in the image pair.

Assuming that the parameter matrix \(\mathbf {W}_{b} \in R^{1 \times U}\) to be learned collects common semantic knowledge (implemented by a \(1\times 1\) convolutional layer), the sigmoid activation function is \(\sigma (.)\); then, the class-independent attention maps are

$$\begin{aligned} \mathbf {B}^{co}_{m2n}= & {} \sigma (\mathbf {W}_b \mathbf {F}^{co}_{m2n}), \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {B}^{co}_{n2m}= & {} \sigma (\mathbf {W}_b \mathbf {F}^{co}_{n2m}). \end{aligned}$$
(9)

The discrepancy attention maps of the unshared semantic region can be obtained by

$$\begin{aligned} \mathbf {A}^{dis}_{m2n}= & {} 1 - \mathbf {B}^{co}_{m2n}, \end{aligned}$$
(10)
$$\begin{aligned} \mathbf {A}^{dis}_{n2m}= & {} 1 - \mathbf {B}^{co}_{n2m}, \end{aligned}$$
(11)

then the discrepancy context feature can be obtained by

$$\begin{aligned} \mathbf {F}^{dis}_{n2m}= & {} \mathbf {F}_{m} \otimes \mathbf {A}^{dis}_{n2m}, \end{aligned}$$
(12)
$$\begin{aligned} \mathbf {F}^{dis}_{m2n}= & {} \mathbf {F}_{n} \otimes \mathbf {A}^{dis}_{m2n}, \end{aligned}$$
(13)

where \(\otimes \) is the elementwise product. Likewise, the activation maps \((\mathbf {S}^{dis}_{m2n}, \mathbf {S}^{dis}_{n2m}) \in {R}^{Q \times H \times W}\) can be obtained through CFC, the category score vectors \((\mathbf {s}^{dis}_{m2n}, \mathbf {s}^{dis}_{n2m}) \in {R}^{Q}\) can be obtained through global average pooling, and the cross-image discrepancy attention classification loss is as follows:

$$\begin{aligned} L^{m,n}_{dis-att} = L_{ce}(\mathbf {s}^{dis}_{n2m}, \mathbf {l}_{m} \backslash \mathbf {l}_{n}) + L_{ce}(\mathbf {s}^{dis}_{m2n}, \mathbf {l}_n \backslash \mathbf {l}_{m}), \end{aligned}$$
(14)

where \(\mathbf {l}_m \backslash \mathbf {l}_n\) represents the object classes that exist in image \(\mathbf {I}_m\) and that do not exist in \(\mathbf {I}_n\), and likewise for \(\mathbf {l}_n \backslash \mathbf {l}_m\). Finally, the total training loss function of the weakly supervised learning is as follows:

$$\begin{aligned} L_{total} = \sum _{m,n}[L^{m,n}_{single} + \alpha (L^{m,n}_{co-att}+L^{m,n}_{dis-att})], \end{aligned}$$
(15)

where \(\alpha \) is the weight of the cross-image attention classification loss.

The usage of common and discrepancy attention in the CAM has the following advantages: (1) The rich contextual semantic information between images is explored by means of common/discrepancy attention to understand the unique parts of prohibited items. (2) The weakly supervised signals from the class-aware activation map reduce the tedious labeling process. (3) Due to the large number of configurations in image pairs, this approach works similarly to data augmentation to improve semantic understanding. (4) The framework is unified, effective, versatile, and efficient regarding achieving stable results under different configurations.

3.3 Multitask learning module

In prohibited item detection from X-ray images, prohibited items are likely to be blocked by other objects. However, X-ray images feature distinguishing effects, clarity and visualization abilities, so blocked areas are still somewhat visible. As shown in Fig. 6a, with the characteristics of X-ray images, current methods [7, 29] use pixel gradients, such as the Canny operator(Fig. 6b) and Sobel operator (Fig. 6c), to address occlusion problems. However, edge features contain too much noisy information, such as the edge details of items other than prohibited objects, which is not conducive to detection.

Fig. 6
figure 6

Illustration of the problem with edge cues. a Input X-ray image. b Edge map obtained by the Canny operator and c the Sobel operator. d Segmented mask generated by our approach. e Corresponding ground truth of the segmentation obtained by manual annotation

Compared with noisy edge maps, the segmented mask provides only the semantic shape of the objects of interest, which can greatly reduce the interference impact [18]. Therefore, we design an MLM learning module for combining segmentation cues to improve the detection results. An example of the segmented mask is shown in Fig. 6d, and the corresponding ground-truth segmented mask is shown in Fig. 6e.

In the segmentation branch, we use the class-aware CAM activation maps to extract the segmentation information because the detection task does not provide pixel-level labeling. The activation map (see details in Fig. 5) is not as accurate as the segmented mask map since it contains more information from the foreground region than from the prohibited items. Therefore, we also employ background pseudomasks that are obtained by saliency maps [11] to alleviate the overcover in the activation map. Then, the decoder serves as the segmentor; this capability is learned by using the pseudoground-truth masks.

Any image \(\mathbf {I}_j\) in the dataset is fed into the CAM to generate class-aware activation maps \(\mathbf {S}^{dis}_j\); then, the semantic segmentation mask is obtained through the decoder:

$$\begin{aligned} \mathbf {O}_{j} = f_{d} (\mathbf {S}^{dis}_{j}, \mathbf {\theta }), \end{aligned}$$
(16)

where \(\mathbf {\theta }\) is the weight of decoder \(f_d\), and the resolution of the output \(\mathbf {O}_j\) is the same as that of input \(\mathbf {I}_j\). \(\mathbf {S}^{dis}_j\) is upsampled and binarized to generate the foreground pseudomasks and then combined with the background pseudomasks generated by the saliency maps to constitute the segmented mask ground truth \(\mathbf {E}_j\). The loss function of the segmentation module is as follows:

$$\begin{aligned} L_{seg} = \sum _{j} L_{bce}(\mathbf {O}_{j}, \mathbf {E}_{j}), \end{aligned}$$
(17)

where \(L_{bce}\) is the binarized cross-entropy loss function. The segmented mask \(\mathbf {O}_j\) is downsampled to the same resolution as that of \(\mathbf {S}^{dis}_j\), becoming \(\mathbf {O}'_j\), and is fed into the classification and localization branches.

The detection branch comprises a region of interest (ROI) pooling layer, a convolutional (Conv) layer and a fully connected (FC) layer, followed by two sibling output layers. The ROI pooling layer performs dynamic max pooling over \(26\times 26\) output bins for each box. The Conv layer with a \(3\times 3\) kernel extracts abstract features. The FC layer reduces the channel number from 256 to 64. Two sibling output layers follow, that is, a scoring layer and a bounding box regression layer. Assume that N is the number of prohibited item classes, the scoring layer outputs the \((N+1)\)-D vector (1 for background) representing the possibility of existence for all kinds of prohibited items, and the bounding box regression layer computes 4-D box offsets (center, width, and height). We employ the cross-entropy loss \(L_{ce}\) for classification and the \(L_1\) loss for localization.

The final loss of the multitask learning is constructed as follows:

$$\begin{aligned} L_{det} = \sum _{j} (L_{ce}(\mathbf {r}_j, \mathbf {t}_{j}) + \beta L_{1}(\mathbf {p}_{j}, \mathbf {d}_{j})), \end{aligned}$$
(18)

where \(\mathbf {r}_j\) and \(\mathbf {t}_j\) are the prediction and ground truth for the classification, respectively, and \(\mathbf {p}_j\) and \(\mathbf {d}_j\) are the corresponding counterparts for localization. \(\beta \) is the hybrid balance factor.

The use of MLM to construct segmentation pseudomask-assisted detection has the following advantages: (1) Dual branches are used to perform different tasks, which optimizes the solution from multiple aspects; (2) Segmentation masks that accurately reflect the object category and location assist in pixel-level understanding and improve the detection accuracy.

4 Results

In this section, we verify the performance of the proposed approach compared with that of popular approaches.

4.1 Hardware and software environment

A workstation with two Intel i7-4790 3.6 GHz central processing units (CPUs) with 64 GB memory and 4 NVIDIA GTX Titan X graphics cards are used. Our approach for demonstrating the effectiveness of the proposed method is based on PyTorch.

4.2 Datasets

We evaluate our approach on the SIXray dataset, the OPIXray dataset, and the HIXray dataset.

The SIXray dataset contains 1,059,231 X-ray images collected from multiple subway stations. Prohibited items include guns, knives, wrenches, pliers, scissors, and hammers in 6 categories. The hammer class is removed in this experiment since there are fewer than 60 images containing hammers. The average size of all images is 100 K pixels, and different material objects are displayed in different colors. The dataset is divided into three subdatasets, namely, SIXray10, SIXray100 and SIXray1000, and the corresponding numbers indicate the ratio of negative samples to positive samples. Because the ratio of positive and negative samples in the SIXray100 dataset is close to the true distribution, SIXray100 is used as the dataset for our experiments. The training set contains 7143 positive samples and 714,300 negative samples, the validation set contains 893 positive samples and 89,300 negative samples, and the test set contains 893 positive samples and 89,300 negative samples.

The OPIXray dataset contains 8885 X-ray images collected from security inspection machines at international airports, including 7109 images in the training set and 1776 images in the test set. All images vary in size. There are 5 types of prohibited items: folding knives, straight knives, scissors, utility knives, and multifunction knives. All images in this dataset contain prohibited objects, and 3 types of prohibited object occlusion levels are defined. All samples were manually marked by professional inspectors at an international airport.

The HIXray dataset contains 102,928 X-ray images collected from multiple international airports, including 82,452 images in the training set and 20,476 images in the test set. Prohibited items include portable chargers, mobile water bottles, laptops, mobile phones, tablets, cosmetics, and metallic-lighters (abbreviated as PO1, PO2, WA, LA, MP, TA, CO and ML) in 8 categories. It has high-quality images, multiple objects of interest per image, and object occlusion.

4.3 Evaluation criteria

We use evaluation criteria that others have employed and released in their work to compare our approach to popular approaches on the same datasets. We use the mAP at an intersection over union (IoU) threshold of 50% as the evaluation measure criteria for the SIXray100 and OPIXray100 datasets. All detected images are sorted according to the confidence of the detected items, and the average precision is calculated.

4.4 Implementation details

For implementation, the size of all images is adjusted to \(1200\times 1000\) resolution to meet the input requirements of the FC layer. For the SIM, the number of residual blocks is set to 5. The channel numbers of the feature maps of each scale are 64, 256, 512, 1024, and 2048. The scale interaction is iterated for 2 rounds. Then, the feature maps are fed into category-aware full convolution and global average pooling to obtain the category score vectors. In the CAM, the number of object categories in the SIXray dataset is 7 (including the background), and that in the OPIXray dataset is 6. In addition, the weight \(\alpha \) of the cross-image attention classification loss function is set to 0.01. In the MLM, the numbers of feature map channels of the decoder are 1024, 512, 256, 64, and E (E is the number of item categories in the dataset). The numbers of channels of the two FC layers are 128 and D(D is 10 in both the SIXray dataset and the OPIXray dataset). The parameter \(\beta \) is 0.1 (determined by a grid search). The entire network is trained using the stochastic gradient descent algorithm, the momentum parameter is 0.9, and the weight decay coefficient is 0.007. The learning rate is 0.005 for the first 45,000 iterations and then automatically decreases according to the feedback results in the validation set. The batch size is set to 6. The numbers of epochs are set to 150 for the SIXray dataset and 120 for the OPIXray dataset.

4.5 Ablation study

We conduct extensive ablation studies to evaluate the effects of several contributions in our approach. These comparisons are performed only on the OPIXray dataset.

The backbone in the SIM We evaluate the model size, Giga floating point operations (GFLOPs), and effectiveness of different backbones, and the results are reported in Table 1. The swin transformer [12] obtains an obvious improvement owing to its natural hierarchical representation. We choose to use the swin transformer as the backbone in the following experiments.

Table 1 Evaluations of different backbones on the OPIXray dataset

Parameters Grid searching is employed to choose appropriate parameters in different modules. The experimental results of different numbers of scale interactions in SIM are shown in Fig. 7a. The x-axis is the interaction number, and the y-axis is the mAP. Two interactions are conducive to model decision-making, and more interactions only lead to overfitting.

The experimental results of different CFC numbers in the CAM are shown in Fig. 7b, where the x-axis is the CFC number and the y-axis is the mAP. A single CFC layer is selected to maintain a balance between effectiveness and efficiency.

The experimental results of hyperparameters in the loss function are shown in Fig. 7c. The x-axis is the parameter value, and the y-axis is the mAP. We choose \(\alpha = 0.01\) and \(\beta = 1.0\) according to the experimental results.

Fig. 7
figure 7

Parameter selection on the OPIXray dataset. Quantitative analysis of a the interaction number in SIM, b the CFC number in CAM, and c hyperparameters in the loss function

The robustness of the CAM The performance varies in mAP owing to different occlusion levels, poses, and backgrounds among the corresponding reference images. Therefore, Table 2 reports the performance variation of different rounds in the experiments. The experimental results in Table 2 show that the performance is stable when 7 trials are implemented in the experiment. We also visualize common attention and discrepancy attention in the CAM module to illustrate the discrimination capability of prohibited items in X-ray images in Fig. 8. Note that input images are augmented by bounding boxes to localize prohibited items. The common attention discovers the discriminant parts of prohibited items, while normal items with similar patterns are also highlighted. Then, the discrepancy attention removes high activation regions of normal items, which makes prohibited items easy to localize.

Table 2 The robustness evaluations of different rounds on the OPIXray dataset
Fig. 8
figure 8

Visualization of the variation of feature maps in the CAM module. Images are from the OPIXray dataset. Input images, common attention, and discrepancy attention are illustrated. Note that input images are augmented by bounding boxes to localize prohibited items

Effectiveness Finally, we evaluate the model size, FLOPs, and effectiveness of the three proposed modules. In the benchmark model, only the residual learning encoder with the same number of layers as the proposed solution is used to extract features, and the FC layers estimate the category and location of prohibited objects. As a result, the detection performance is relatively poor. Then, the modules proposed in this paper are gradually added; the results are shown in Table 3.

Table 3 Effectiveness evaluations of each module on the OPIXray dataset

The SIM, CAM and MLM modules achieve improvements of 1.7, 1.6 and 1.4 in the mAP, respectively. This shows that our method can effectively aggregate context information, thereby improving the detection performance.

Fig. 9
figure 9

Experimental results on the SIXray dataset, where the DOAM approach, our approach, and the corresponding ground truth are illustrated in blue, red, and green bounding boxes, respectively. a Accurate detection results. The 1st (top) row shows the cases where the prohibited items have no or slight occlusion, and the 2nd row shows the situations where the prohibited items exhibit partial occlusions. b Inaccurate detection results

4.6 Evaluation on the SIXray dataset

This section compares the proposed method with other prohibited item detection methods. The experimental results are shown in Table 4. RetinaNet and CHR partially alleviate the occlusion problem by introducing different weights for each sample, achieving limited effects in X-ray image detection. The CHR and DetectoRS fuse features with details or semantics (from different scales) to improve the localization accuracy. The nonlocal network, swin transformer, and focal transformer explore useful context information by using a self-attention mechanism, which is helpful to detect occluded items. The DOAM and CST methods use edge information to guide the localization and are easily affected by noise factors. In contrast, our method eliminates the negative effects of high-frequency noise by generating high-level segmentation masks, gaining a 1.47% improvement in mAP.

Table 4 Comparison of different aggregation methods on the SIXray dataset. The evaluation metric is the mAP. ‘*’ denotes the approach we reimplemented

The detection results of different methods based on the SIXray dataset are shown in Fig. 9, where the DOAM method, our method, and the ground truth are represented by blue, red and green rectangles, respectively. In Fig. 9a, our method obtains robust detection results for a variety of occlusion levels of prohibited objects. Compared with DOAM, which uses noisy edge-assisted detection, our method uses semantic segmentation information to assist in prohibited object detection.

However, our method still yields some inaccurate results, as shown in Fig. 9b. Part of the reason for this is the complex background in the X-ray images (multiple items overlap with each other), as the unique regions of the prohibited items are not correctly understood, and the information of prohibited items (such as knives) in the real scene is limited because prohibited items occupy only a small portion of the images.

Fig. 10
figure 10

Experimental results on the OPIXray dataset, where the DOAM approach, our approach, and the corresponding ground truth are illustrated in blue, red, and green bounding boxes, respectively. a Accurate detection results. b Inaccurate detection results

4.7 Evaluation on the OPIXray dataset

In this section, we verify the effectiveness of various methods on the OPIXray dataset. The results are shown in Table 5. FO, ST, SC, UT and MU in the table represent the folding knife, straight knife, scissors, utility knife and multifunction knife, respectively. The high-frequency noise that is generated in the edge extraction makes the CST and DOAM methods only reach 2.04% and 3.10% mAP improvements. The nonlocal network, swin transformer, and focal transformer explore nonlocal dependencies by using different structures of local regions to enhance contextual features, obtaining mAP improvements of 0.5%, 4.13%, and 5.34%, respectively. On the basis of the swin transformer and DOAM methods, our approach performs multiscale analysis with the interactions among adjacent scales, discovers semantic regions via image comparison, and uses two branches for multitask learning, which leads to an additional mAP increase of approximately 1.45%.

Table 5 Experimental comparison of the mAP on the OPIXray dataset

The accurate detection results in Fig. 10a show that our method is robust to variance in the input. Fig. 10b shows some of the inaccurate detection results. Note that because the useful information and the interference information of the complex background are intertwined, the serious occlusions caused by other objects greatly affect the performance. In addition, other factors, such as the camera view, inevitably expand the intravariance. For example, a straight knife becomes thinner at a specific observation spot, drifting from the typical characteristics of the prohibited item. Multiview analysis may be a potential solution, collecting observed cues from different views for learning.

4.8 Evaluation on the HIXray dataset

We also conduct experiments on the HIXray dataset. The experimental results are reported in Table 6, and some detection results are illustrated in Fig. 11. PO1, PO2, WA, LA, MP, TA, CO, and ML represent portable chargers 1 (lithium-ion prismatic cell), portable chargers 2 (lithium-ion cylindrical cell), mobile water bottles, laptops, mobile phones, tablets, cosmetics, and metallic lighters, respectively. Note that our results based on the HIXray dataset yield conclusions similar to those from our results on the OPIXray dataset. Our method obtains 83.21% in mAP, values which are 1.11% more than the runner-up.

Table 6 Experimental comparison of the mAP on the HIXray dataset

In Fig. 11, the detected outputs of the DOAM and our approach and the corresponding ground truth are illustrated. The accurately detected bounding boxes in Fig. 11a certify that our approach is robust to variations in background clutter. Feature maps obtained from multiple perception fields and analyzed by common and discrepancy knowledge across different images help to locate positions and recognize categories of prohibited items.

Fig. 11
figure 11

Experimental results on the HIXray dataset, where the DOAM approach, our approach, and the corresponding ground truth are illustrated in blue, red, and green bounding boxes, respectively. a Accurate detection results. b Inaccurate detection results

Fig. 11b also illustrates the inaccurate inference results. Note that items in ’cosmetic’ are sometimes missed owing to its diversity in both appearance and shape. Curriculum learning [6] is a potential solution, because it learns the pattern of objects from general to specific in a cascade manner, which partially solves the cosmetic diversity problem.

4.9 Discussion

If the prohibited item is partially occluded, the information from the unoccluded distal part can be employed for discrimination by using multiple scale perceptions. In multiscale analysis, CHR [13] and DetectoRS [14] deliver only high-level visual cues to assist midlevel features and achieve limited improvement in object localization. To handle this predicament, our SIM module also incorporates detailed information from low-level feature maps and bidirectional mining of contextual semantic information to improve localization accuracy. The feature interaction of neighboring scales enhances the feature representation, alleviates the semantic gaps between features in different perception domains, and allows the model to obtain a comprehensive understanding of the entirety and different parts of the same object. It can also effectively capture the appearance variations caused by severe occlusion, where the visual features are seriously insufficient.

The rich contextual semantic information between images can be explored to understand the unique parts of prohibited items. The common attention in the CAM module explores cross-image shared semantics, which helps the classifier to proficiently perceive the common semantic labels over the coattentive regions. Discrepancy attention in the CAM module focuses on unshared semantics, which enables the classifier to capably separate the semantic patterns of different objects. Actually, the cross-attention [28] or transformer [12, 30] mechanism can not only explore similarity in local and global regions but also discover and discriminate semantics by using images containing items in the same of different classes, at the cost of computational complexity.

Although the segmentation-based branch (our MLM module) does not receive accurate pixel-level outputs, it is more robust than edge-based methods [7, 29] because an edge is affected by factors from other sides, while an object region contains much information to describe the specific class to which the object belongs, which reflects the object category and location assisting in pixel-level understanding and improving the detection accuracy. In addition, dual branches are used to perform different tasks (detection and segmentation), which optimizes the shared feature maps from both the pixel level and object level.

The basis of our approach is that the occluded part is still partially visible in the X-ray image. Therefore, we learn the multiscale analysis, characteristics, and segmentation of the interacting part to improve the perception of prohibited items. In general object detection, the occluded part is totally invisible; as a result, our approach does not work in this situation owing to the absence of valuable partially observed appearance information.

In future work, we will explore the relationship between different viewpoints or depths through multiview- or computed tomography-based approaches to expand the method and address the challenges under a single viewpoint, which will further improve the effectiveness of prohibited item detection from X-ray images. Curriculum learning [6], learning the pattern of objects from general to specific in a cascade manner, is also a potential solution to handle the problem of diversity in prohibited item detection.

5 Conclusion

Here, we present a new method that employs multilayer feature interaction to improve the perception ability of the model. The proposed cross-image analysis can learn the pixel-level semantics of objects in a weakly supervised manner. This pixel-level information can further assist in prohibited item detection, especially in the case of missing information, such as from occlusion. Experimental results on the SIXray dataset, the OPIXray dataset, and the HIXray dataset show that our approach outperforms other popular approaches by margins of 1.47%, 1.45%, and 1.11% in mAP.