Keywords

1 Introduction

According to the most recent statistics, we found that the diseases with the highest mortality rates in the world are malignant tumors, heart disease, pneumonia, and cerebrovascular diseases. In addition, the number of deaths caused by malignant tumors is increasing each year [1]. Therefore, accurate diagnosis is of great significance to the discovery and treatment of tumor disease. Diagnostic imaging is the most common method used to detect cancer, especially the rapid development of medical imaging technology in recent years has made it possible to obtain high-resolution CT and MRI datasets. At the same time, it also makes it possible to train high-resolution CT images to detect lesions.

In the diagnosis based on CT images, clinicians judge the presence of tumor based on their subjective medical knowledge and relevant laboratory report data, which is largely dependent on the experience of the doctor. However, with the rapid advancement of medical information systems in recent years, doctors now can easily access the medical image data of patients to facilitate the diagnosis of the disease, the massive task of analyzing lesions on computed tomography (CT) and magnetic resonance images (MRI) images is onerous for the physician. Therefore, we propose to automate the detection of tumors in computed tomography images by computer, aiming at alleviating the burden on doctors and expecting to achieve higher accuracy of analysis.

Currently, many laboratories and scholars in the field of computer vision and image recognition are attempting to introduce object detection techniques in the medical field. For example, [2,3,4,5,6,7,8] showcase some of the explorations that researchers have made in the field of medicine in recent years. However, the size of tumors varies greatly within a certain range. For example, in DeepLesion dataset, the size of the lesion area varies from 0.21 mm to 342.5 mm. As for the treatment of such large differences in spatial scale changes, the above methods failed to deal well with the low precision of detection of small lesions. Especially for small lesion regions detection, which is an important evidence to detect early lesions in the body, it is necessary to improve the detection accuracy for small lesions.

In this paper, we propose a fine-grained lesion detection method with a novel multi-scale global attention mechanism. For the samples of small lesions in dataset (DeepLesion), we improve the network structure based on the two-stage network of Faster RCNN and use ResNet101 network as backbone. First of all, we use dilation and erosion operations in a mathematical morphology approach to process the input CT images, which makes the diseased areas more visible. In order to fuse more semantic information when constructing the feature pyramid, we continue to deepen the network based on ResNet101 until the resolution of the feature map is \({8} \times {8}\). After up-sampling, the output of each layer residue block is fused with the high-resolution topographic map to preserve as much spatial and semantic information as possible at different scales. We then use multi-scale response (MSR) to facilitate lesion detection across fine-grained scales. Given a feature map with a specific resolution, the aggregated dilation block (ADB) in MSR is based on the split-transform merge principle, using the regional correlations in each pyramid feature generation block. As the dilation rate increases, new feature maps are generated from a wide range of contexts. Aggregated dilation blocks further increase the receptive fields of the top-down paths in the feature pyramid. The channel and spatial attention modules in the MSR then focus on the different lesion responses in feature maps. Finally, we re-sample the high-response features with the upper feature map and fuse it with the upper map again, which is then fed into the RPN to obtain the coordinates of the prediction score and the bounding box of the output category. The experiments show that the network using the MSR module has a significant improvement in accuracy compared to the original two-stage network.

Fig. 1.
figure 1

The framework of the proposed lesion detection method

The main contributions of this work can be summarized as follows.

  1. 1)

    An Aggregated Dilation Block (ADB) is proposed. The block alleviates the shortcomings of the low-resolution feature layer in the network due to the large receptive field and the small target features are not obvious.

  2. 2)

    The Global Attention Block (GAB) is designed to reduce the influence of background noise and highlight the target features, which is effective for detecting obscure objects at different scales.

2 Proposed Method

The structure of framework we proposed is illustrated in Fig. 1. Firstly, we selected different morphological operations (dilation and erosion) for tumors with different colors in each organ as pre-processing, and then these feature-enhanced maps were fed into the network. We extracted features from the ResNet backbone (P2-P7) (Sect. 2.2). The extracted features are further processed by ADB (Sect. 2.3) and GAB (Sect. 2.4) in the MSR module. The output of the MSR is subsequently fed into the RPN while being up-sampled and summed with the output of the previous layer of high-resolution feature maps to create a feature pyramid network constructed from the output of the MSR module. Finally, the network undergoes Softmax loss (detection of classification probability) and Smooth L1 loss (detection of frame regression) to train classification probability and bounding box regression.

2.1 Mathematical Morphology Operations

Mathematical morphology image processing refers to a series of image processing techniques to enhance image shape features. The basic idea of morphology is feature enhancement by using a special structure element to measure or extract the corresponding shape or feature when given an input image. Common morphological processing methods include: erosion, dilation, opening and closing. Morphological operations rely on the correlation of pixel values rather than their absolute values, so they are very suitable for binary image processing. By constructing the corresponding mathematical morphology structure element suitable for DeepLesion dataset, the maximum or minimum area can be effectively found in the image, which can reduce some noises in CT scanning and make the lesions more obvious. For the DeepLesion dataset, we could roughly divide it into two categories, dark background with light color lesion area or the opposite. We selected the CT images of representative Lung and Liver organs, which respectively represented the images of the previous two background types. Through the experiment of morphological processing method as shown in Fig. 2, the images obtained after different morphological operations for light and dark tumors are shown. We can see that for CT images containing light color tumors, dilation operation can magnify the lesion area and facilitate the network to detect small lesions. For the CT images containing dark tumors, we can see that the tumor portion of the erosion manipulated images not only becomes larger, but also retains a lot of texture information. The formulas of dilation and erosion operations are shown as follows:

Fig. 2.
figure 2

The output of different colors tumors after morphological operations. Organs with representative light and dark colors of lesions in the dataset are shown. Where, (a)-(e) represent the labeled image, erosion, dilation, opening and closing respectively. First row is lung CT image and second row is liver CT image.

Fig. 3.
figure 3

The structure of FPN built based on the proposed MSR module

$${A \oplus B = \{ {x , y | ( B )}_{ }}_{xy} \cap A \ne \mathrm{ \varnothing }\}$$
(1)
$${A \ominus B = \left\{ x , y \right| ( B )}_{ xy} \subseteq A \}$$
(2)

Where, \(\oplus \) and \(\ominus\) represent dilation and erosion respectively.

Fig. 4.
figure 4

The structure of MSR module. The detailed architecture of the Multi-Scale Response (MSR) module consists of two parts: Aggregated dilated block (ADB) in the blue dashed box on the left and Global attention Block (GAB) in the orange dashed box on the right. (Color figur online)

2.2 Feature Extraction Network

Faster-RCNN [9] is an object detection algorithm proposed by Kaiming He in 2015. Based on Fast-RCNN [10], this algorithm puts forward the RPN region proposal box generation algorithm, which greatly improves object detection speed. The detection part is divided into four steps: The first step is to input the whole image into CNN for feature extraction. The second step is to generate the anchor box by RPN. In the third step, the RoI pooling layer makes each RoI generate a feature map with a fixed size. In the fourth step: Softmax Loss and Smooth L1 Loss are used for classification and Bounding box regression respectively.

We first modified the backbone of the model, the original backbone of Faster RCNN is VGG16 and we replaced it with ResNet101 [11]. The ResNet results on the ImageNet dataset [12] show that the performance of the residual structure is significantly better than the traditional convolution framework. Due to the down-sampling effect in traditional convolutional neural networks, small objects cannot acquire obvious features.

To address this issue. First of all, we deepened the backbone to continue to enhance the network's ability to extract deeper semantic information. conv2_ ×, conv3_ ×, conv4_ ×, conv5_ ×, conv6_ ×, and conv7_ × blocks are used to build the feature pyramid. P2, P3, P4, P5, P6 and P7 correspond to conv2_ × –conv7_ ×. We convolved the corresponding bottom-up feature maps by 1 × 1 kernels to reduce the number of channels and fuse the deep feature [13]. After processing by the MSR module, the corresponding bottom-up feature map is added to the corresponding bottom-up feature map by up-sampling. The feature map of P2–P4 resolution in Backbone can help the network find and locate small lesions. With the deepening of the network, the small objects information will dismiss because of the convolution and pooling operations. There are two purposes for adding conv6_ × and conv7_ ×. The first point is to improve the feature extraction capability of the backbone. The second point is to fuse with the upper layer feature map after up-sampling when constructing FPN, see Fig. 3, which can bring more deep semantic information into the higher resolution feature map in the upper layer.

2.3 Aggregated Dilation Block

In the process of generating feature pyramids based on ResNet residual blocks, the imbalanced problem between spatial and semantic information will emerge. To address this issue, we build a feature pyramid network constructed by multiple scale output of Res-block in the top-down pathway. We introduced dilated convolution in the ADB module by using multi-branch structure to adapt to the receptive field of feature maps with multi-scales through different dilation rates. In each parallel dilated convolution branch, the feature map is enhanced by the cascade convolution kernels with different dilation rates. After the convolution of each layer, the output is nonlinearized by the activation function to prevent gradient explosion and can bring more differential representations for feature transformation. Weighted combinations in the multi-branch dilation convolution process can eliminate to some extent the noise left behind in low-resolution images. Then we concatenate the output features of each branch with the original image to become an aggregated feature map. The feature map output by the ADB module has a larger receptive field.

In ADB module, \({f \in R}^{W\times D}\) is used to describe the architecture of ADB, where W and D are the width and depth of ADB respectively. The dilation rate of specific layer in ADB is expressed as \({f}_{ij}\), where \(\mathfrak{i}=\mathrm{1,2},\dots ,W\) and \(\mathfrak{j}=\mathrm{1,2},\dots ,D\) represent the index of width and depth respectively. The aggregated dilated operation is shown as follows:

$$\mathcal{F}\left(x\right)= \sum\nolimits_{\mathfrak{i}=1}^{W}{\mathcal{T}}_{\mathfrak{i}}\left(x \right| {f}_{i1} , {f}_{i2} , { . . . ,f}_{iD} ) $$
(3)

where \({T}_{i}(x)\) represents the cascade-transformation.

As shown in Fig. 4, the parallel structure branch inside the ADB module is connected in series with convolution kernels with different dilation rates, and the output multi-scale feature map restores more detailed spatial information of the instance. It also provides more long-range context information for the construction of the feature pyramid. The receptive field of each layer is expressed as follows:

$${\mathcal{A}}_{i0}=1$$
(4)
$${k}_{ij}={k}_{ij}+\left({k}_{ij}-1\right)\times \left({f}_{ ij}-1 \right)$$
(5)
$${\mathcal{A}}_{ij}={r}_{i,j-1}\times {k}_{ij}-\left({k}_{ij}-1\right)\times ({\mathcal{A}}_{i,j}- \prod_{k=1}^{j-1}{s}_{k})$$
(6)

Where, \({\mathcal{A}}_{ij}\) denotes the receptive field, \({k}_{ij}\) denotes the kernel size and \({s}_{k}\) represents the stride.

From the formula, we can see that the size of the receptive field extracted by the convolution kernel with different dilation rates is also different. Usually in the feature extraction network, that is, in the backbone, the dilation convolution helps us identify the large object from the enlarged receptive field [14]. However, we added dilation convolution follows the output of the feature pyramid, expecting to provide more context spatial information to improve the detection accuracy of small lesions.

Fig. 5.
figure 5

Lesion detection results for sample CT images of various methods. Each row from top to bottom represents the label image, our proposed method, Faster R-CNN respectively

2.4 Global Attention Block

Inspired by the current popular attention mechanism, we propose a Global Attention Block (GAB). When the network's backbone extracts information from a large number of feature maps, the GAB allows the network to pay more attention to some vital information and improving the accuracy of detection. Each attention block consists of two parts: a spatial attention block and a channel attention block.

Object detection needs to be extremely sensitive to changes in spatial location, so we proposed spatial attention block which uses a self-attentive mechanism to model remote dependencies in order to enhance the network's global understanding of the visual scene. In addition, inspired by SENet [15], we proposed a channel attention block that aims to focus on the feature information we need.

We converted the input feature map x into three paths \({F}_{1}\), \({F}_{2}\) and \({F}_{3}\), where \({F}_{1}(x) ={W}_{1} x\) and \({F}_{2}(x) ={W}_{2} x\). Firstly, obtaining the attention map of the long-range correlation between each position in the feature map through\({S}_{ij}\), where\({S}_{ij}= {{F}_{1}({x}_{i})}^{T}\otimes {F}_{2} ({x}_{j})\).\({S}_{ij}\) is transformed into \({A}_{i,j}\) by \(softmax\), where, \({A}_{ij}=softmax({s}_{ij})\) represents the relationship between the position of \(i\) and the position of \(j\) in the feature map, and then \({A}_{ij}\) and \({F}_{3}\) are multiplied to query the response relationship between pixels on the feature map.

$${z}_{i}= \sum_{j=1}^{N}{A}_{i,j} \otimes F3\left({x}_{j}\right),{ F}_{3}\left(x\right)= {W}_{3}x$$
(7)

Where \(i\) represent the index of query position. \(N = H \times W\) and \(N\) represents number of feature locations. \(\otimes \) denotes matrix multiplication.

After the spatial attention block, we compress the global information into channels through global average pooling, the main difference between the SE block and the block we proposed is the fusion module, which reflects the different goals of the two blocks. The SE module uses re-adjustment to re-calibrate the importance of the channel, but it does not fully simulate the long-range correlation. We captured the long-range correlation by using addition to aggregate the global context to all positions. The detailed architecture of the global attention block (GAB) is formulated as follows:

$${y}_{i}={x }_{i}+ {W}_{v2} ReLU \left( LN \left({ W}_{v1 }z \right) \right)$$
(8)

Where \({W}_{v1}\in { R}^{\frac{C}{r } \times C} , {W}_{v2}\in {R}^{C \times \frac{C}{r}}.\) In order to obtain the lightweight attribute of the channel attention block, we reduced the parameters of the module from C to C/r. Where r is the bottleneck ratio, setting r too large will lose feature information and vice versa will consume a lot of computation, so we need to strike a balance between two costs, we found that when r = 4, the model performs best.

3 Experiments and Results

3.1 Datasets

The DeepLesion is the largest open dataset of multi-category, lesion-level labeled clinical medical CT images ever published by an NIH Clinical Center. By training deep neural networks on this dataset, it will be possible to obtain a large-scale universal lesion detector that can more accurately and automatically measure the size of all lesions in the patient's body, allowing initial assessment of cancer system-wide. The dataset contains 32,735 labeled lesion instances from 4,427 independent, anonymous patients. The dataset covers a wide range of lesions involving the lung, liver, mediastinum (mainly lymph nodes), kidney, pelvis, bone, abdomen and other soft tissues. We used 70% samples of the dataset for training, 15% for validation, and 15% for testing.

3.2 Training Schedule

We set training learning rate to 0.008 and training momentum to 0.9; after 10,000 iterations, the training weights decayed to 0.001; the training batch is 128, the mini_batch is 2; and the learning process is 12 epochs. The initialization weights for P1–P5 are from the ImageNet pre-trained model, and for the deepened network part.

Table 1. mAP and AP of each lesion type on the official split set of DeepLesion.
Table 2. An ablation study with various configurations of the proposed modules. Lesion detection sensitivity is reported at different false positive (FP) rates on the DeepLesion test set.

We initialize the parameters randomly. We resized the input image to 512 × 512 size. The optimization algorithm we used is stochastic gradient descent (SGD). The training time of our detector is about 60 h, compared to one-stage detector we have drawbacks in training cost and test speed, but speed is not very important in medical image detection scenario, compared to the need to accurately detect the focal area.

3.3 Hardware and Software Setup

Experiments were conducted on a Workstation with IntelCore i7, 2.7 GHz CPU, 8 GB RAM under Ubuntu 18.4, and a NVIDIA GTX 2080 video processing card with 11 GB memory. Faster RCNN was deployed in pytorch 1.5.1 framework and based on python3.7, cuda 10.1 and cudnn 7.6.3.

3.4 Evaluation

We quoted two evaluation metrics in our subsequent ablation experiments and comparisons. In our experiments, the object detection accuracy was measured by mean Average Precision (mAP) when IoUthres = 0.5. Another evaluation metric is the average sensitivity values at different false positives rates (FROC) of the whole test set.

3.5 Results

Ablation Study

Fig. 6.
figure 6

Ablation FROC Curve on test set of official split test set of DeepLesion.

Table 3. mAP and AP of each lesion type of various methods on the official split test set of DeepLesion.
Table 4. Comparison of the proposed method with state-of-the-art methods on the DeepLesion test set. Lesion detection sensitivity values are reported at different false positive (FP) rates.
Fig. 7.
figure 7

FROC Curves of various methods.

The proposed network consists of three main components: Faster R-CNN, ADB and GAB. In Table 1 and Table 2, Baseline denotes the original Faster R-CNN model, and Baseline + (Res101) denotes the method for deepening the backbone mentioned in Sect. 2.2. To assess the validity of each module, we performed ablation studies on the DeepLesion dataset. From Table 1, we can see that the original model was improved by our proposed method and the mAP values were improved to 10%. Based on the coarse lesion types provided by DeepLesion for each CT slice, we calculated the AP for each lesion type. Besides, the table shows that the AP values were increased by different magnitudes for different sites. The metrics assessed in Table 2 are the average sensitivity values for the entire test set at different false positives rates. The comparison between different configurations shows that the proposed method achieves the highest sensitivity at different false positives rates. We plotted the FROC curves to make the results more intuitive, see Fig. 6. The detection results are also shown in Fig. 5.

Comparisons with State-of-the-Art.

We compared our model with state-of-the-art methods. As can be seen from Table 2 and Table 4 we have chosen YOLOv3 [16], Reti-naNet [17], Mask R-CNN [18] and 3DCE [5] Yan et al. selected different numbers of slices and sent them to the detector. The detector generated 3D context information from different numbers of slices to help the detector make the final lesion prediction.From Table 3, we can see that the mAP value of our proposed detector is higher than other advanced detectors. Although it is slightly lower than 3DCE on the medi-astinum and lung organs, higher than other detectors on all other organs.

Table 4 shows the results of the evaluation, which indicates that our method is superior to the existing methods. We can see that the higher the number of CT slices that the detector selects as input for 3DCE, the higher sensitivity the detector has, due to the fact that the number of CT slices can provide more contextual information to the detector. However, in this paper, we obtained better results by selecting only a single CT slice as input. To make the comparison more straightforward, we plotted the FROC curves comparing the experimental results of several methods in Fig. 7.

In terms of the overall comparison results with state-of-the-art methods, our proposed method is superior in detection accuracy and sensitivity.

4 Conclusion

We propose a fine-grained lesion detection method with a novel multi-scale global attention mechanism to enhance the detection of lesions on feature maps of different sizes. In different scale convolution levels of the detection network, we augment the detector's awareness of feature map scale variation by ADB. ADB provides finer size estimates of the feature map to capture the response to scale under different receptive fields. To effectively choose meaningful responses, we propose the GAB attention module, where the results of ablation experiments on the DeepLesion dataset demonstrate the effectiveness of our proposed method for detection at different scales.