Keywords

1 Introduction

Automated lesion detection is an important yet challenging task in medical image analysis, as exploited by [8, 16, 19, 22, 23, 27, 29] on the public NIH DeepLesion dataset. Its aims include improving physician’s reading efficiency and increasing the sensitivity for localizing/reporting small but vital tumors, which are more prone to be missed, e.g. human-reader sensitivity is reported at 48–57% with small-sized hepatocellular carcinoma (HCC) liver lesions  [1]. Automated lesion detection remains difficult due to the tremendously large appearance variability, unpredictable locations, and frequent small-sized lesions of interest  [12, 22]. In particular, two key aspects requiring further research are (1) the best means to effectively process the 3D volumetric data (since small and critical tumors require 3D imaging context to be differentiated) and (2) to more accurately regress the tumor’s 3D bounding box. This work makes significant contributions towards both aims.

Computed tomography (CT) scans are volumetric, so incorporating 3D context is the key in recognizing lesions. As a direct solution, 3D convolutional neural networks (CNNs) have achieved good performance for lung nodule detection  [5, 6]. However, due to GPU memory constraints, shallower networks and smaller input dimensions are used  [5, 6], which may limit the performance for more complicated detection problems. For instance, universal lesion detection (ULD)  [16, 17, 21, 29], which aims to detect many lesions types with diverse appearances from the whole body, demands wider and deeper networks to extract more comprehensive image features. To resolve this issue, 2.5D networks have been designed  [2, 16, 17, 20, 21, 29] that use deep 2D CNNs with ImageNet pre-trained weights and fuse image features of multiple consecutive axial slices. Nevertheless, these methods do not fully exploit 3D information since their 3D related operations operate sparsely at only selected network layers via convolutional-layer inner products. 2.5D models are also inefficient because they process CT volumes in a slice-by-slice manner. Partially inspired by  [3, 14, 24], we propose applying pseudo 3D convolution (P3DC) backbones to efficiently process 3D images. This allows our volumetric lesion detector (VLD) framework to fully exploit 3D context while re-purposing off-the-shelf deep 2D network structures and inheriting their large capacities to cope with lesion variances.

Fig. 1.
figure 1

Overview of VLD. We show (a) the complete workflow; (b) the detailed pseudo 3D convolution (P3DC) backbone; (c) 3D lesion center regression head; and (d) surface point regression (SPR) head for bounding box generation.

Good lesion detection performance also relies on accurate bounding box regression. But, some lesions, e.g. liver lesions, frequently present vague boundaries that are hard to distinguish from background. Most existing anchor-based  [15] and anchor-free  [18, 28] algorithms rely on features extracted from the proposal center to predict the lesion’s extent. This is sub-optimal since lesion boundary features should intuitively be crucial for this task. To this end, we adopt and enhance the RepPoint algorithm  [25], which generates a point set to estimate bounding boxes, with each point fixating on a representative part. Such a point set can drive more finely-tuned bounding box regression than traditional strategies, which is crucial for accurately localizing small lesions. Different from RepPoint, we propose surface point regression (SPR), which uses a novel triplet-base appearance regularization to force regressed points to move towards lesion boundaries. This allows for an even more accurate regression.

In this work, we advance both volumetric detection and bounding box regression using deep volumetric P3DCs and effective SPR, respectively. We demonstrate that our P3DC backbone can outperform state-of-the-art 2.5D and 3D detectors on the public large-scale NIH DeepLesion dataset  [22], e.g. we increase the strongest baseline’s sensitivity of detecting small lesions from \(22.4\%\) to \(30.3\%\) at 1 false positive (FP) per CT volume. When incorporating SPR, our VLD outperforms the best baseline  [2] by >4% sensitivity for all operating points on free-response receiver operating characteristic (FROC). We also evaluate VLD on an extremely challenging dataset (574 patient studies) of HCC liver lesions collected from archives in Chang Cung Memorial Hospital. Many patients suffer from cirrhosis, which make HCC detection extremely difficult. P3DC alone accounts for \(63.6\%\) sensitivity at 1 FP per CT volume. Adding SPR boosts this sensitivity to \(69.2\%\). Importantly, for both the DeepLesion and in-house HCC dataset, our complete VLD framework provides the largest performance gains for small lesions, which are the easiest to miss by human readers and thus should be the focus for any detection system.

2 Method

VLD follows a one-stage anchor-free detection workflow  [2, 28], which is simple but has yielded state-of-the-art performance on DeepLesion  [2]. As shown in Fig. 1, VLD takes volumetric CT scans as inputs and extracts deep convolutional features with its P3DC backbone. The extracted features are then fed into VLD’s 3D center regression and SPR heads to generate center coordinates and surface points, respectively.

2.1 P3DC Backbone

VLD relies on a deep volumetric P3DC backbone. To do this, we build off of DenseNet-121  [7]. Specifically, we first remove the fourth dense block as we found this truncated version performs better with DeepLesion. The core strategy of VLD is to keep front-end processing to 2D, while only converting the third dense block of the truncated DenseNet-121 to 3D using P3DCs. This strategy is consistent with [21], which found that introducing 3D information at higher layers is preferred to lower layers. Using N to denote convolutional kernel sizes throughout, for the first two dense blocks the weight parameters, \((c_{o},c_{i},N,N)\), are reshaped to \((c_{o},c_{i},1,N,N)\) to process volumetric data slice-by-slice. When processing dynamic CTs with multiple contrast phases, e.g., our in-house dataset, we stack the multi-phase input and inflate the weight of the first convolutional kernel along its second dimension  [3].

Fig. 2.
figure 2

Options to transfer the 2D convolutional layer (a) to volumetric 3D convolutions: (b) inflated 3D  [3], (c) spatio-temporal 3D  [14], and (d) axial-coronal-sagittal 3D  [24].

To implement 3D processing, we convert the third dense block and task-specific heads and investigate several different options for P3DCs, which include inflated 3D (I3D)  [3], spatio-temporal 3D (ST-3D)  [14], and axial-coronal-sagittal 3D (ACS-3D)  [24]. These options are depicted in Fig. 2. I3D  [3] simply duplicates 2D kernels along the axial (3D) direction and downscales weight values by the number of duplications. Thus, I3D produces true 3D kernels. ST-3D  [14] first reshapes \((c_o,c_i,N,N)\) kernels into \((c_o,c_i,1,N,N)\) to act as “spatial” kernels and introduces an extra \((c_o,c_i,N,1,1)\) kernel as the “temporal” kernel. The resulting features from both are fused using channel-wise concatenation. There are alternative ST-3D configurations; however, the parallel structure of Fig. 2(c) was shown to be best in a liver segmentation study  [27]. ACS-3D  [24] splits the kernel \((c_o,c_i,N,N)\) into axial \((c_{oa},c_i,N,N)\), coronal \((c_{oc},c_i,N,N)\), and sagittal \((c_{os},c_i,N,N)\) kernels, where \(c_o=c_{oa}+c_{os}+c_{oc}\). Thereafter, it reshapes the view-specific kernels correspondingly into \((c_{oa},c_i,1,N,N)\), \((c_{oc},c_i,N,1,N)\), and \((c_{os},c_i,N,N,1)\). Like ST-3D, ACS-3D fuses the resulting features using channel-wise concatenation. Compared to the extra temporal-kernels introduced by ST-3D, ACS-3D requires no extra model parameters, keeping the converted model light-weight. In our implementation, we empirically set the ratio of \(c_{oa}:c_{oc}:c_{os}\) to 8 : 1 : 1 as the axial plane usually holds the highest resolution.

VLD has two task-specific network heads, one to locate the lesion centers and one to regress surface points. Before inputting the deep volumetric features into the heads, we use an feature pyramid network (FPN)  [10] with three \((c_o,c_i,1,1,1)\) convolutional layers to fuse outputs of the dense blocks, which helps VLD to be robust to lesions with different sizes. Focusing first on the center regression head, it takes the output of the FPN (i.e. “deep feature” in Fig. 1) and processes it with an ACS-3D convolutional layer followed by a \((1,c_i,1,1,1)\) convolutional layer. Both layers are randomly initialized. Like CenterNet  [28], the output is a 3D heat map, \(\hat{Y}\), that predicts lesion centers. Ground-truth heat map, Y, is generated as a Gaussian heat map with the radius in each dimension set to half of the target lesion’s width, height, and depth. We use focal loss  [2, 11, 28] to train the center regression head:

$$\begin{aligned} \mathcal {L}_{ctr} = \frac{-1}{m} \sum _{xyz} {\left\{ \begin{array}{ll} (1 - \hat{Y}_{xyz})^{\alpha } \log (\hat{Y}_{xyz}) &{} \!\text {if}\ Y_{xyz}=1 \\ \begin{array}{c} (1-Y_{xyz})^{\beta } (\hat{Y}_{xyz})^{\alpha } \log (1-\hat{Y}_{xyz}) \end{array}&\!\text {otherwise} \end{array}\right. } \text {,} \end{aligned}$$
(1)

where m is the number of lesions in the CT and \(\alpha =2\) and \(\beta =4\) are focal-loss hyper-parameters  [28]. The ground-truth heat map is <1 everywhere except at the lesion center voxel. Like recent work  [2], when possible we also exploit hard negatives by generating negative-valued heatmaps in Y, which will magnify their loss contributions more than 0-valued regions. See Cai et al.  [2] for more details.

2.2 Surface Point Regression

The P3DC backbone and center regression head are effective at locating lesions. But, once the lesion is located its extent must also be determined. To do this, we directly regress a 3D point set (actually offsets from the center point), using backbone features located at the center point:

$$\begin{aligned} \mathcal {P} = \{(x_k, y_k, z_k)\}_{k=1}^{n}, \end{aligned}$$
(2)

where n is the total number of points. This requires a \(1\times 1 \times 1\) convolution with 3n outputs. Empirically, we find \(n=16\) delivers the best results. Because \(\mathcal {P}\) is computed from center-point features, it may suffer from inaccuracies. Thus, we also compute offsets to refine \(\mathcal {P}\):

$$\begin{aligned} \mathcal {P}_r = \{(x_k + \varDelta x_k, y_k + \varDelta y_k, z_k + \varDelta z_k)\}_{k=1}^{n}, \end{aligned}$$
(3)

where \(\{(\varDelta x_k, \varDelta y_k, \varDelta z_k)\}\) are the predicted offsets of the refined surface points. To do this, for each location in \(\mathcal {P}\), we bilinearly interpolate corresponding backbone features and regress location-specific offsets. This only requires a \(1\times 1 \times 1\) convolution with 3 outputs. To actually supervise the \(\mathcal {P}\) and \(\mathcal {P}_{r}\) regression, we compute their minimum and maximum coordinates and ensure they match with the ground-truth bounding box. More formally, if we denote the ground-truth box using its top-right-front and bottom-left-rear corners \(\{(x_{trf}, y_{trf}, z_{trf}),\) \( (x_{blr}, y_{blr}, z_{blr})\}\), the regression of \(\mathcal {P}\) and \(\mathcal {P}_{r}\) can be trained using the following loss:

$$\begin{aligned} \mathcal {L}_{pts} = \sum _{i \in (x, y, z)} |i_{blr} - \min _{1\le k \le n}(i_k)| + |i_{trf} - \max _{1\le k \le n}(i_k)| \\ +|i_{blr} - \min _{1\le k \le n}(i_k + \varDelta i_k)| + |i_{trf} - \max _{1\le k \le n}(i_k + \varDelta i_k)| \mathrm {.} \end{aligned}$$
(4)

One important limitation of (4) is that ellipsoid lesions do not fit perfectly in cuboid boxes. As a result, regressed points may still satisfy (4) if they lay outside the lesion, but still inside the box. Such points may be more prone to produce inaccurate offsets, i.e. (3), during inference. To address this, we propose an appearance-based similarity constraint to encourage points to only fixate on lesion surfaces so that the point set can represent fine-grained lesion geometry correctly. The idea is to force surface-point appearance to be more similar to regions inside the lesion than to those outside it. This constraint is achieved by adding a triplet-loss with the lesion center as the positive anchor (inside) and box corners as negative anchors (outside). Specifically, we compute point-wise features from the center and eight corners of the bounding box with bilinear sampling and denote them as \(a^p\) and \(\{a^n_{j}\}_{j=1}^8\), respectively. We also extract point-wise features from \(P_r\): \(\{a_{k}\}_{k=1}^n\). The triplet-loss is then formulated as

$$\begin{aligned} \mathcal {L}_{tri} = \frac{1}{m} \sum _{k=1}^{n}\sum _{j=1}^{8} \max (0, \Vert a^p - a_k\Vert _2 - \Vert a^p - a^n_j\Vert _2 + 1). \end{aligned}$$
(5)

With the supervision of \(\mathcal {L}_{pts}\) and \(\mathcal {L}_{tri}\), we expect surface points will either move toward lesion surfaces or to the center. This constitutes our surface point regression (SPR). The extracted point-wise features are designed to be semantic in nature (healthy versus lesion tissue). Thus, complex lesion appearances, e.g., cavitations, should be mapped to a similar semantic space. We optimize the SPR together with the center regression head by minimizing a joint loss function:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{ctr} + 0.1(\mathcal {L}_{pts} + \mathcal {L}_{tri}). \end{aligned}$$
(6)

2.3 Implementation Details

We implement our system in Pytorch  [13] on four NVIDIA Quadro RTX 6000 GPUs. The P3DC backbone weights were initialized with the pre-trained Lesion Harvester weights  [2], which were trained using the official DeepLesion data split so there is no data leakage. We also tried ImageNet-pretrained weights and random initialization, but performance was not as good. All other layers were randomly initialized. The FPN’s output, i.e., “deep feature” in Fig. 1, has 512 channels. In the task-specific heads, each ACS-3D layer consists of an ACS-3D convolutional layer with a kernel size of 3 and \(c_{ao}+c_{co}+c_{so}=256\). The output channels of the lesion center heat map, \(\mathcal {P}\), \(\mathcal {P}_{r}\), and point-wise features are 1, 48 (16 points), 3, and 128, respectively. We adopt the Adam  [9] optimizer and set a base learning rate to 0.0001, which was reduced by a factor of 10 after the validation loss reached its minimum value.

3 Experimental Results

Datasets. We evaluate our approach on two datasets. DeepLesion  [23] is a large-scale benchmark for ULD that comprises 32,735 retrospectively clinically annotated lesions from 10,594 CT scans of 4,427 unique patients. Many works report performance on DeepLesion, but most are either 2D  [16, 17, 29] or 2.5D  [20, 21]. We use the 3D annotations and hard-negatives from  [2] to both train and evaluate DeepLesion. The volumetric test set of DeepLesion  [2] includes 272 fully-annotated sub-volumes and more accurately reflects the 3D lesion detection performance. HCC Liver Dataset: We also evaluate on our in-house dataset of 574 dynamic CT studies of patients with HCC liver lesions. HCC is one of the most fatal cancers and detection at early stages is crucial. However, HCC often co-occurs with liver fibrosis, challenging lesion discovery. Human sensitivities have been reported to be 48–57% for small-sized lesions  [1]. We randomly split the dataset patient-wise into 384, 92, and 98 studies for training, validation, and testing, respectively.

Evaluation and Comparison Methods. A detected bounding-box is regarded as correct when the 3D-IoU between the detected box and a ground-truth box exceeds 0.3. The FROC is used for evaluation. We first evaluate different P3DC backbones: ST-3D, I3D, and ACS-3D. We also test a shallow fully-3D UNet  [4] backbone within the CenterNet  [28] framework and also against the 2.5D Lesion Harvester  [2], which reports the highest performance to date for the DeepLesion dataset. These two competitors directly regress a lesion’s size using features sampled from the predicted lesion center and can also naturally learn from hard-negatives  [2]. In addition, we also report results using CenterNet (2D)  [28], Faster R-CNN (2.5D)  [15], and MULAN (2.5D)  [21], drawn from Cai et al.’s experiments  [2]. This represents a comprehensive comparison across many different detector variants. To measure the impact of our proposed SPR, we also implement VLD with deep representative points (DRP)  [26] that foregoes the appearance-based triplet loss. Finally, we evaluate our proposed VLD framework: P3DC + SPR.

Table 1. Sensitivities (%) at various FPs per CT volume.

Results. In Table 1, we compare our proposed approach against alternative approaches. Using FROC analysis, the average sensitivities on DeepLesion are: CenterNet-2D \(27.9\%\); CenterNet-3D \(18.7\%\); Faster R-CNN \(25.4\%\); MULAN \(27.9\%\); Lesion Harvester \(31.9\%\), and our strongest P3DC variant \(36.4\%\). As can been seen, P3DC significantly outperforms the previous SOTA Lesion Harvester and MULAN methods by \(4.5\%\) and \(8.5\%\), respectively, which validates the effectiveness of P3DC over its 2.5D counterparts. We choose ACS-3D over I3D for it produces comparable performance to I3D, meanwhile it keeps VLD light-weight.

From Table 1, we also observe that adding the original DRP method actually underperforms the baseline P3DC. This in fact motivated our development of SPR. The DRP method lacks explicit constraints on point locations, making it challenging to automatically learn effective point-wise feature from CT images. In contrast, SPR introduces surface constraints to force the regressed points to distribute onto lesion surfaces. Tests on our in-house dataset also confirms that our proposed SPR can improve sensitivities on HCC liver lesion detection.

While these results demonstrate the value of our P3DC backbone and SPR bounding-box regression, even more convincing conclusions can be drawn when analyzing performance based on lesion size. In DeepLesion, we use 2 cm and 5 cm as cut-off sizes. However, our HCC liver dataset has hardly any lesions smaller than 2 cm, so we only stratify based on a 5 cm cut-off. As Table 2 indicates, compared to Lesion Harvester, our P3DC backbone can yield improvements of \(~7\%\) sensitivity for small-size lesions in DeepLesion. These are the most critical lesions to detect, since these are the easiest for human observers to miss. Adding the SPR boosts small-size performance even further, indicating that SPR’s aggregation of boundary features can produce improved fine-grained bounding boxes. Moving to the HCC dataset, our SPR can produce boosts in sensitivity of over \(4\%\) compared to direct CenterNet-style regression, further validating our SPR regression strategy. These are clinically significant performance improvements. Visual examples can be found in Fig. 3, and our supplementary material, depicting the process of SPR’s more refined regression of bounding box extents.

Table 2. Size-stratified sensitivities (%) at FP \(=1\) per CT volume.\(^{\dagger }\): P3DC+DRP produces FPs with high confidences, thus at FP \(=1\), it has lower sensitivity than P3DC+SPR on HCC Liver.
Fig. 3.
figure 3

Visualization of different methods. We show an instance of liver tumor overlaid with its ground-truth box in the 1\(^{st}\) column. In the 2\(^{nd}\), 3\(^{rd}\), and 4\(^{th}\) columns, we show the detection results from P3DC with general box regression, P3DC\(+\)DRP, and P3DC\(+\)SPR, respectively. For each example, we display the result in 3D and show three representative axial slices. We render the ground-truth box in Green, the detection results in Blue, and the regressed surface points, when applicable, in Red. Best viewed in color. (Color figure online)

4 Conclusion

In this work, we tackle challenges of lesion detection in CT scans by proposing a very deep volumetric lesion detection model VLD. It processes CT scans directly in 3D so as to fully incorporate 3D context for better performance. It has very deep backbones with large capacities so that it can handle lesions with large appearance variability. Its surface point regression head can effectively estimate the 3D lesion spatial extents. It also generalize well with small-scaled medical datasets as it is light-weight and can be initialized with pre-trained 2D networks. Compared with 2D, 2.5D, and fully 3D variants, our method is superior in accuracy, model size, and speed (see our supplementary material). The proposed VLD achieved new SOTA performance on the large-scale NIH DeepLesion dataset. It has also validated its generalization capability on our in-house HCC liver dataset.