Keywords

1 Introduction

There has been much progress in various object detection tasks [13, 15, 16, 28, 37] with the prosperity of deep learning. In the medical field, detection algorithms are able to obtain performance comparable to that of clinical experts, e.g. pulmonary nodule detection [18, 27, 33, 34], etc. Nonetheless, most of the approaches are based on the assumption that the training/source and test/target data come from similar distributions. This assumption restricts the application of these approaches in the real world, because there often exists nontrivial domain difference between the training data and the real-world test data; the domain shift causes significant performance degradation of the algorithms in the test/target domain. Hence, a great deal of effort has been directed towards cross-domain object detection [1, 2, 5, 7, 12, 23, 32, 38, 39] in recent years to enhance the performance of the source model on the target domain.

However, current approaches for cross-domain object detection still contain an improper assumption for medical applications. They assume that the training samples from the source domain are freely accessible, while in reality, medical data are usually not shareable due to privacy issues and merely a pre-trained source model is accessible. What’s more, acquiring and annotating medical data are both time-consuming and costly, resulting in limited training samples of the target domain, making cross-domain object detection in the medical field very challenging. Considering these two aspects, we present a realistic but demanding setting, source data-free cross-domain detection of lung nodule. In this scenario, merely a pre-trained source model and a few annotated samples from the target domain are available. As far as we know, this is the first work that tackles source data-absent cross-domain adaptation in the pulmonary nodule detection task.

The batch normalization (BN) [9] layers of a model normalize and modulate the features, and thus are closely tied to the model performance when there is a shift in data distribution. In cross-domain image classification and semantic segmentation tasks, some studies simply substitute the source batch statistics with the statistics of the current batch of the target domain [14]. Some studies combine the statistics of both source and target [36]. Some other studies [31, 35] pay attention to the target statistics, and minimize entropy loss to optimize the affine parameters as well. Nevertheless, these methods are either too weak or not applicable for cross-domain object detection.

In our cross-domain pulmonary nodule detection setting, which does not rely on source data, we propose adapting to the target domain by reducing the entropy of the model predictions. However, the original entropy [25] only supports image classification and segmentation currently. We successfully solve this problem by extending entropy to its detection variant, termed General Entropy (GE). We choose entropy for its ability to quantify uncertainty and shifts, as low entropy predictions are all-in-all more reliable and high entropy predictions represent larger shifts. To better utilize the source information and efficiently adapt, we only optimize the affine parameters and estimate the target dataset-level statistics in the batch normalization layers via entropy minimization. This step enables us to learn a target-specific feature encoding module under the same detection head, without requiring access to the source data or the labels of the target data.

To enhance the detection performance further and alleviate the common problem of rater disagreement in the medical field, we also fine-tune the detection head of the model using annotated samples from the target domain.

Our primary contributions are summarized as follows:

  • We establish a source data-free setting for cross-domain lung nodule detection, utilizing merely a well-trained source model and a limited number of labeled target samples.

  • We propose a novel method, which adapts the model feature extraction module for the target domain via General Entropy (GE) minimization. We further fine-tune the model detection head with labeled target samples to improve the adaptation performance.

  • For the purpose of evaluation, we curate a benchmark using four widely used pulmonary nodule datasets.

Experiments on the benchmark show our method can achieve the state-of-the-art results, demonstrating the effectiveness of our method.

Fig. 1.
figure 1

The pipeline of our proposed method. The source model is composed of a feature encoding module and a detection head module. (a) We keep the detection head frozen, and adapt the batch normalization (BN) layers in the feature extraction module by minimizing our Generalized Entropy (GE) to obtain target dataset-level statistics. (b) The detection head is fine-tuned using a small fraction of target data with labels.

2 Method

For a vanilla cross-domain adaptation (DA) task, we have \(N^{s}\) labeled samples \(\{x^{s}_{i},y^{s}_{i}\}^{N^{s}}_{i=1}\) from the source domain and also \(N^{t}\) labeled samples \(\{x^{t}_{i},y^{t}_{i}\}^{N^{t}}_{i=1}\) from the target domain. The main goal of DA is to address the domain shift between the source domain and the target domain, thus to well predict labels \(\{y^{t}_{i}\}^{N^{t}}_{i=1}\) in the target domain. In this work, we assume that we cannot obtain samples from the source domain because of concerns related to privacy. Instead of the source dataset, we are given a well-trained source model \(f_{\theta }(x)\) with parameters \(\theta \). Based on this assumption, we present source data-free cross-domain pulmonary nodule detection, and aim to learn a target model with the given well-trained source model \(f_{\theta }(x)\) and target samples \(\{x^{t}_{i},y^{t}_{i}\}^{N^{t}}_{i=1}\).

Our method comprises two steps as shown in Fig. 1. First, the feature extraction module of the well-trained source model is adjusted to the target domain using unsupervised learning. To be specific, the batch normalization (BN) layers of the model are optimized by minimizing entropy loss to obtain target dataset-level statistics, where a general form of entropy termed Generalized Entropy (GE) is proposed. Then, using the annotated target samples, we further employ supervised learning to fine-tune the detection head of the model for rater difference mitigation and performance enhancement. In the following, we would like first to revisit two types of the uncertainty of the bounding box, the probability distribution representation and localization quality estimation, and then elaborate on our method in detail.

Preliminaries. There are two conventional representations for the bounding box \(\mathcal {B}\) in detection. For instance, the central point coordinates, width, height, and depth, \(\{a,b,c,w,h,d\}\) [3, 17, 21], and the distance from the sampling point to the up, down, top, bottom, left, and right planes, \(\{u,d,t,b,l,r\}\) [28] are utilized to denote bounding boxes in the pulmonary nodule detection task. According to [37], there is no performance difference between the two representations. In this work, relative offsets from the sampling point to the six planes of a bounding box \(\mathcal {B} = \{u,d,t,b,l,r\}\) are used as the regression targets, since the physical meaning of each variable in \(\{u,d,t,b,l,r\}\) is consistent. Given the \(\{a,b,c,w,h,d\}\) form, we will convert it to the \(\{u,d,t,b,l,r\}\) form.

Yet this form follows the Dirac delta distribution that only concentrates on the ground-truth locations, and is too rigid to reflect the ambiguity of bounding boxes [6, 13]. Recently, some works [13, 20] adopt the probability distribution representation of the bounding box to learn its localization uncertainty. Let \(y \in \mathcal {B}\) be the distance to a certain plane of a bounding box, whose estimated value \(\hat{y}\) can be represented as:

$$\begin{aligned} \hat{y} = \int _{y_{min}}^{y_{max}} s\Pr (s)ds, \end{aligned}$$
(1)

where s is the regression distance in range of \([y_{min}, y_{max}]\), and \(\Pr (s)\) is the corresponding probability. Then, to be congenial with the convolutional neural networks, the continuous regression range \([y_{min}, y_{max}]\) is converted into a uniform discretized representation, \(\{y_{0},y_{1},...,y_{i},y_{i+1},...,y_{n-1},y_{n}\}\) with even intervals \(\varDelta \), where \(\varDelta = y_{i+1} - y_{i}, \forall i \in [0, n-1]\), \(y_{0} = y_{min}\), and \(y_{n} = y_{max}\). Thus, the estimated value \(\hat{y}\) becomes:

$$\begin{aligned} \hat{y} = \sum _{i=0}^{n}\Pr (y_{i})y_{i}, \end{aligned}$$
(2)

where \(\sum _{i=0}^{n}\Pr (y_{i}) = 1\), and the \(\Pr (s)\) can be easily implemented using a SoftMax function with \(n+1\) outputs. Hereto, the uncertainty of the bounding box offsets are modeled.

There is also another simple way to model the localization uncertainty of the bounding box, i.e. the localization quality estimation in the form of IoU [30] or centerness [28] score. Thereinto, the centerness [28] represents the distance measurement between the center points of the location and its corresponding object. Given the regression targets \(u^{*}, d^{*}, t^{*}, b^{*}, l^{*}\), and \(r^{*}\) for a sampling point, the centerness \(\hat{y}\) can be defined as:

$$\begin{aligned} \hat{y} = \sqrt{\frac{\min (u^{*}, d^{*})}{\max (u^{*}, d^{*})} \times \frac{\min (t^{*}, b^{*})}{\max (t^{*}, b^{*})} \times \frac{\min (l^{*}, r^{*})}{\max (l^{*}, r^{*})}}. \end{aligned}$$
(3)

In our method, we employ the centerness [28] score measurement for its simplicity and good performance in pulmonary nodule detection.

2.1 Feature Extractor Adaptation

Entropy Objective. Our training goal is to reduce the entropy \(H(\hat{y})\) of the model detection results \(\hat{y} = f_{\theta }(x^{t})\). This is because entropy is an unsupervised objective for uncertainty measurement, while related to the supervised task and model. However, the current Shannon entropy [25] only supports classification. Therefore, we propose Generalized Entropy (GE) that generalizes the Shannon entropy [25] for dense detectors. Assume that a model’s final prediction \(\hat{y}\) is the linear combination of two variables \(\hat{y} = y_{l}p_{y_{l}} + y_{r}p_{y_{r}}, (y_{l} \le \hat{y} \le y_{r})\), where \(p_{y_{l}}, p_{y_{r}} (p_{y_{l}} \ge 0, p_{y_{r}} \ge 0, p_{y_{l}} + p_{y_{r}} = 1)\) are probabilities for these variables estimated by the model respectively. The proposed GE is able to cover the three special cases of the General Focal Loss (GFL) [13] for dense detectors:

When \(\beta = \gamma , y_{l} = 0, y_{r} = 1, p_{y_{r}} = p, p_{y_{l}} = 1 - p\) and \(y \in \{1, 0\}\) in GFL [13], GE for focal loss (FL) can be written as:

$$\begin{aligned} H(p) = - ((1 - \alpha ) p^{\gamma }(1-p)\log (1-p) + \alpha (1-p)^{\gamma }p\log (p)). \end{aligned}$$
(4)

When \(y_{l} = 0, y_{r} = 1, p_{y_{r}} = \sigma \) and \(p_{y_{l}} = 1 - \sigma \) in GFL [13], GE for quality focal loss (QFL) can be written as:

$$\begin{aligned} H(\sigma ) = - (\sigma ^{\beta }(1-\sigma )\log (1-\sigma ) + (1-\sigma )^{\beta }\sigma \log (\sigma )). \end{aligned}$$
(5)

When \(\beta = 0, y_{l} = y_{i}, y_{r} = y_{i+1}, p_{y_{l}} = \Pr (y_{l}) = \Pr (y_{i}) = \mathcal {S}_{i}\) and \(p_{y_{r}} = \Pr (y_{r}) = \Pr (y_{i+1}) = \mathcal {S}_{i+1}\) in GFL [13], GE for distribution focal loss (DFL) can be written as:

$$\begin{aligned} H(\mathcal {S}_{i}, \mathcal {S}_{i+1}) = - (\mathcal {S}_{i}\log (\mathcal {S}_{i}) + \mathcal {S}_{i+1}\log (\mathcal {S}_{i+1})). \end{aligned}$$
(6)

Modulation Parameters. As shown in Fig. 1, the pulmonary nodule detection network \(f_{\theta }(x)\) is composed of two modules: the feature encoding module \(g_{\theta }: x \rightarrow \mathbb {R}^{d}\) and the detection head module \(h_{\theta }: \mathbb {R}^{d} \rightarrow \mathbb {R}^{K}\); \(f_{\theta }(x) = h_{\theta }(g_{\theta }(x))\), d and K are dimensions of the extracted feature and the model output respectively. To keep the same hypothesis \(h_{\theta }\), a natural choice of the modulation parameters is all the feature extractor parameters \(g_{\theta }\); however, altering \(g_{\theta }\) may cause the model to diverge from its training, since \(\theta \) is the only representation of the source data in our setting. Besides, the limited number of training samples from the target domain is not suitable for optimizing the high dimensional \(\theta \). Previous works [31, 35] find that adapting the batch statistics, especially dataset-level statistics, is effective for domain adaptation. Considering the feature modulation ability and low dimensional computation of the batch normalization (BN) layers, we choose to update the BN layers during training. Inside the BN layer, there are two sets of parameters: the statistics \((\mu , \sigma )\), which normalize the feature, and the affine parameters \((\beta , \gamma )\), which modulate the feature. Given a batch of target samples \(\{x_{i}^{t}\}_{i=1}^{B}\), where B is the batch size, the outputs of the BN layer \(\{{x_{i}^{t}}^{\prime }\}_{i=1}^{B}\) are calculated as:

$$\begin{aligned} {x_{i}^{t}}^{\prime } &= \gamma \overline{x_{i}^{t}} + \beta = \gamma \frac{x_{i}^{t} - \mu }{\sigma } + \beta , \\ \mu &= \mathbb {E}[x_{i}^{t}], \sigma ^{2} = \mathbb {E}[(x_{i}^{t} - \mu )^{2}]. \end{aligned}$$

In the meantime, a running mean vector \(\mu _{r}\) and a running variance vector \(\sigma _{r}\) are estimated using moving average to derive dataset-level statistics for the target domain:

$$\begin{aligned} \mu _{r} = \lambda \mu + (1 - \lambda )\mu _{r}, \sigma _{r}^{2} = \lambda \sigma ^{2} + (1 - \lambda ) \sigma _{r}^{2}. \end{aligned}$$
(7)

The affine parameters \((\beta , \gamma )\) are optimized via minimizing the GE loss.

2.2 Detection Head Adaptation

Transfer learning by fine-tuning is a common way to adjust a well-trained network to a new domain. To enhance the performance of pulmonary nodule detection even further, we tune the detection head of the model \(h_{\theta }\) using the training samples from the target domain \(\{x^{t}_{i},y^{t}_{i}\}^{N^{t}}_{i=1}\). Meanwhile, this can also alleviate the issue of rater disagreement between different datasets, a common problem in the medical field.

3 Experiments

3.1 Benchmark and Evaluation

We establish a benchmark from PN9 [18] to LUNA16 [24]/tianchi [29]/russia [19] for shifts, as shown in Fig. 2. The specifics of these datasets are listed in Table 1. As seen, the CT scans in these datasets, which are gathered from various sites, have different image sizes and voxel sizes. In Table 2, we display the lung nodule size and quantity distribution of the four datasets.

Recall that vanilla domain adaptation requires the use of the labeled source data, while our setting denies the use of source data PN9 [18] during adaptation. We take into account only those CT scans having publicly available nodule annotations. The annotation files of the four datasets are csv files. Each line of the files holds the information of one nodule, including the CT scan filename it belongs to, and its location. In the three target datasets, the nodule location is indicated by the center coordinates and diameter, whereas in PN9 [18], it is marked by the top-left and bottom-right coordinates.

Fig. 2.
figure 2

Samples from four lung nodule datasets are shown, with each column corresponding to a dataset as marked. CT images from different datasets exhibit domain discrepancy, for instance, color contrast/saturation, voxel intensity, image spacing, amount of nodules.

Fig. 3.
figure 3

Samples of the pre-processed images in the LUNA16, tianchi, and russia. The 1st row contains the raw images, the 2nd row shows the extracted lung regions, and the 3rd row displays the pre-processed images.

Table 1. Pulmonary nodule datasets. ‘Scans’ and ‘Class’ indicates the number of CT scans and the class, respectively. ‘Raw’ denotes whether the CT images in the dataset are pre-processed. ‘Image Size’ refers to the CT image matrix size in the direction of the x, y, and z axes. ‘Spacing’ denotes the voxel sizes (mm) in the direction of the x, y, and z axes.
Table 2. Distribution of the pulmonary nodule size. ‘d’ indicates the nodule diameter (mm).

LUNA16 [24], tianchi [29], and russia [19] are divided into 7/1/2 for training, validation, and testing. In these three datasets, the raw CT data undergoes three pre-processing steps: 1) We use lungmask [8] to extract lung regions from each CT image and mask other regions to minimize irrelevant calculations. In this process, the HU values of the raw CT data are clipped into the range \([-1200,600]\) and then linearly converted into the range [0, 255], resulting in uint8 values. Then we set a padding value of 170 for regions outside the lung masks. 2) To prevent an excess of unnecessary hyper-parameters, the spacing of all the CT images is resampled to (1.00, 1.00, 1.00) mm, ensuring consistency for the anchor design across all detectors. 3) To further improve the computational efficiency, we crop the CT images according to the extracted lung masks. Figure 3 shows the CT image samples after being pre-processed. For PN9 [18] dataset, the data pre-processing procedure is kept the same as in [18]. In our experiments, the voxel coordinates are utilized. Based on our pre-processing procedures and the voxel coordinates, the nodule locations in the annotation files are recalculated.

In terms of the evaluation metric, the Free-Response Receiver Operating Characteristic (FROC), a commonly used measure for pulmonary nodule detection, is selected. It is calculated by averaging the sensitivities at 0.125, 0.25, 0.5, 1, 2, 4, and 8 false positives per scan. We also use the detection sensitivity at 8 false positives per image for evaluation, since false positives in the medical field are preferable to false negatives. The detected nodule is counted as a true positive if there exists one annotated nodule, and the distance between the center points of the detected nodule and the annotated nodule is smaller than the radius R of the annotated nodule. Otherwise, the detected nodule is considered a false positive.

3.2 Implementation Details

In our experiments, we employ the same backbone as the SANet [18], thus utilizing the weights pre-trained on PN9 [18] for source model training. Concretely, the backbone is U-shaped [22], consisting of a 3D ResNet50 [4] equipped with Slice Grouped Non-local modules [18] and a decoder. Different from [18], the backbone is followed by FPN [15] as neck, and the FCOS-style [28] anchor-free head for classification and localization. The network is optimized using the Stochastic Gradient Descent (SGD). The training batch size of the 3D patches is 16. We implement the patch-based input strategy for training and use the complete 3D volume for inference as in [18]. The learning rate, the momentum, and the weight decay coefficients are respectively fixed at 0.001, 0.9, and \(1\times 10^{-4}\). To obtain the source model, the network is set to be trained for a maximum of 30 epochs. For learning in the target domain, we tune the pre-trained source model for 1 epoch. For other training and testing hyper-parameters, we follow the [28], and specialize some hyper-parameters in the task of detecting pulmonary nodules. We use FPN [15] with two levels, a detection head with two classification/regression towers, and a radius of 3. All the experiments are carried out with PyTorch on four NVIDIA GeForce RTX 3090 GPUs, each having 24 GB of memory.

Fig. 4.
figure 4

FROC curves of our method and the baseline on target dataset russia w.r.t 60% percentage of its training set.

Fig. 5.
figure 5

FROC curves of our method and the baseline on target dataset russia w.r.t 80% percentage of its training set.

3.3 Results

We evaluate the proposed method by contrasting it with the baseline approach, which simply fine-tunes all the parameters of the source model using the labeled samples from the target domain. Experiments are conducted with 20%, 40%, 60%, 80%, and 100% labeled training samples from the target domains respectively, and the results are reported for the whole target testing sets. Table 3 lists the experimental results of our method and the baseline on target dataset LUNA16 [24] and tianchi [29]. Our method obviously outperforms the baseline. Meanwhile, it adapts more efficiently. It is especially noteworthy that utilizing only the feature extraction module adaptation, the first step of our method without the use of any labeled training samples from the target domain, already brings a good performance. This shows the potential of our method in the more wild and challenging settings. Nonetheless, the performance of our method on target dataset russia [19] is unsatisfactory, probably due to its larger shift with the source. For more adaptation, we tune all the parameters of the model in our second step on russia [19]. As listed in Table 4, our method obtains better FROC scores for lung nodule detection than the baseline, which verifies the effectiveness of our proposed adapting via entropy minimization. The FROC curves illustrated in Fig. 4 and Fig. 5 further confirm the superiority of our method.

Table 3. Comparison of our method and the baseline on target dataset LUNA16 and tianchi w.r.t percentage of their training set. The values are pulmonary nodule detection sensitivities (unit: %) at 8 false positives per CT image, with each column indicating the percentage of the training set.
Table 4. Comparison of our method and the baseline on target dataset russia w.r.t percentage of its training set. The values are FROCs (unit: %) with each column indicating the percentage of the training set.

4 Related Works

Recently, some works propose to adapt the trained model in test-time. This branch of study originates from the works of recalculating the batch statistics [14]. Test-time training (TTT) [26] relies on a proxy task for altering training the entire model on the source, and then adapts to the target using self-supervised learning. Tent [31] optimizes the affine parameters of batch normalization layers of the model via entropy minimization. This is demonstrated to be effective for robustness and source-free domain adaptation tasks. In [36], the authors replace the target statistics used in Tent with mixed source and target statistics. T3A [10] utilizes centroid-based modification to adapt the classifier in test-time for domain generalization. In [35], the authors revisit the batch normalization in the training process and develop a test-time batch normalization layer design named GpreBN, which is optimized during testing by minimizing entropy loss. This newly designed batch normalization operation preserves the same gradient backpropagation form as training and uses dataset-level statistics for robust optimization and inference. Unfortunately, all these works focus on image classification or semantic segmentation [11], and may not work well on object detection. In contrast, our method revisits the batch statistics for cross-domain pulmonary nodule detection, delving into the model optimization method specific for the detection.

5 Conclusion

In this paper, we present a source data-free setting for cross-domain lung nodule detection and present a method to tackle this issue, requiring only a pre-trained source model and a limited number of annotated samples from the target domain. Specifically, our method adapts the feature extraction module of the model by minimizing the proposed general entropy loss, and tunes the detection head with labeled target samples to enhance the detection performance even more. Experiments on our established benchmark verify that our method is an effective way to solve cross-domain object detection with data privacy issues involved. To the best of our knowledge, this is the first work on cross-domain pulmonary nodule detection without access to the source data. We also hope that this work in the medical field can bring insights into the general object detection field. In the future, we plan to pursue adaptation to more and harder types of shifts.