Keywords

1 Introduction

Segmentation of objects of interest is an important task in medical image analysis. Benefiting from the development of deep neural networks and the accumulation of annotated data, fully convolutional networks (FCNs) have demonstrated remarkable performances [8, 19] in this task. In general, these models assume that the ground truth is given precisely. However, for tasks with a large number of category labels, the peripheral areas are often difficult to annotate due to ambiguous boundaries and the partial volume effect (PVE)  [2], etc., and are likely to be labeled with uncertainty. With a limited number of data, FCNs may have difficulties in coping with such uncertainty, which in turn affects the performance. Taking brain MRI for example, in Fig. 1, we show a slice of a multi-sequence MRI, in which the pink area shows barely or non-discernible boundaries from its surroundings, causing great difficulties in the manual annotation.

Fig. 1.
figure 1

Illustration of ambiguous boundaries in medical images with a slice of a multi-sequence brain MRI. The first three images show the MRI sequences, and the last shows the ground truth annotation. As we can see, the boundaries of the tissue marked in pink are barely or even not discernible from its surroundings. Best viewed in color.

To reduce the impact of imprecise boundary annotation, a potential solution is the label softening technique, and at this moment, we are only aware of few of them [5, 10, 11]. Based on the anatomical knowledge that the lesion-surrounding pixels may also include some lesion level information, Kats et al. [11] employed 3D morphological dilation to expand the binary mask of multiple sclerosis (MS) lesions and assigned a fixed pseudo probability to all pixels within the expanded region, such that these pixels can also contribute to the learning of MS lesions. Despite the improved Dice similarity coefficient in the experiments, the inherent contextual information of images was not utilized when determining the extent of dilation or exact value of the fixed pseudo probability. To account for uncertainties in the ground truth segmentation of atherosclerotic plaque in the carotid artery, Engelen et al. [5] proposed to blur the ground truth mask with a Gaussian filter for label softening. One limitation of this work was that, similar to [11], the creation of the soft labels was only based on the ground truth while ignoring the descriptive contextual information in the image. From another perspective, soft labels can also be obtained by fusing multiple manual annotations, e.g.., in [10] masks of MS lesions produced by different experts were fused using a soft version of the STAPLE algorithm [22]. However, obtaining multiple segmentation annotations for medical images can be practically difficult. An alternative to label softening is the label smoothing technique  [16, 20] which assumes a uniform prior distribution over labels; yet again, this technique did not take the image context into consideration, either.

Fig. 2.
figure 2

Pipeline of our proposed method.

In this paper, we propose a new label softening method driven by the image contextual information, for improving segmentation performance especially near the boundaries of different categories. Specifically, we employ the concept of superpixels [1] for the utilization of local contextual information. Via unsupervised over-segmentation, the superpixels group original image pixels into locally homogeneous blocks, which can be considered as meaningful atomic regions of the image. Conceptually, if the scale of superpixel is appropriate, pixels within the same superpixel block can be assumed belonging to the same category. Based on this assumption, if a superpixel intersects with the annotation boundary of the ground truth, we consider a high probability of uncertain labeling within the area prescribed by this superpixel. Driven by this intuition, we soften labels in this area based on the signed distance to the annotation boundary, producing probability values spanning the full range of [0, 1]—in contrast to the original “hard” binary labels of either 0 or 1. The softened labels can then be used to train the segmentation models. We evaluate the proposed approach on two publicly available datasets: the Grand Challenge on MR Brain Segmentation at MICCAI 2018 (MRBrainS18) [7] dataset and an optical coherence tomography (OCT) image  [3] dataset. The experimental results verify the effectiveness of our approach.

2 Method

The pipeline of our method is illustrated in Fig. 2. We employ the SLIC algorithm [1] to produce superpixels, meanwhile converting the ground truth annotation to multiple one-hot label maps (the “hard” labels). Soft labels are obtained by exploiting the relations between the superpixels and hard label maps (the cross symbol \(\bigotimes \) in Fig. 2). Then, the soft and hard labels are used jointly to supervise the training of the segmentation network.

Superpixel-Guided Region of Softening. Our purpose is to model the uncertainty near the boundaries of categories in the manual annotation for improving model performance and robustness. For this purpose, we propose to exploit the relations between superpixels and the ground truth annotation to produce soft labels. Specifically, we identify three types of relations between a superpixel and the foreground region in a one-hot ground truth label map (Fig. 3): (a) the superpixel is inside the region, (b) the superpixel is outside the region, and (c) the superpixel intersects with the region boundary. As the superpixel algorithms [1] group pixels into locally homogeneous pixel blocks, pixels within the same superpixel can be assumed to belong to the same category given that superpixels are set to a proper size. Based on this assumption, it is most likely for uncertain annotations to happen in the last case, where the ground truth annotation indicates different labels for pixels inside the same superpixel block. Therefore, our label softening works exclusively in this case.

Formally, let us denote an image by \(x \in \mathbb {R}^{W \times H}\), where W and H are the width and height, respectively. (Without loss of generalization, x can also be a 3D image \(x \in \mathbb {R}^{W \times H \times T}\), where T is the number of slices, and our method still applies.) Then, its corresponding ground truth annotation can be denoted by a set of one-hot label maps: \(Y=\{y^c | y^c\in \mathbb {R}^{W \times H}\}_{c=1}^C\), where C is the number of categories, and \(y^c\) is the binary label map for category c, in which any pixel \(y^c_i\in \{0, 1\}\), where \(i\in \{1,\ldots ,N\}\) is the pixel index, and N is the total number of pixels; besides, we denote the foreground area in \(y^c\) by \(\phi ^c\). We can generate superpixel blocks \(S(x)=\{s^{(j)}\}_{j=1}^M\) for x using an over-segmentation algorithm, where M is the total number of superpixels. In this paper, we adopt SLIC [1] as our superpixel-generating algorithm, which is known for computational efficiency and quality of the generated superpixels. We denote the set of soft label maps to be generated by \(Q_c=\{q^c | q^c \in \mathbb {R}^{W \times H}\}\); note that \(q^c_i\in [0,1]\) is a continuous value, in contrast with the binaries in \(y^c\). As shown in Fig. 3, the relations between any \(\phi ^c\) and \(s^{(j)}\) can be classified into three categories: (a) \(s^{(j)}\) is inside \(\phi ^c\); (b) \(s^{(j)}\) is outside \(\phi ^c\); and (c) \(s^{(j)}\) intersects with boundaries of \(\phi ^c\). For the first two cases, we use the original values of \(y_i^c\) in the corresponding locations in \(q^c\). Whereas as for the third case, we employ label softening strategies to assign a soft label \(q^c_i\) to each pixel i based on its distance to boundaries of \(\phi ^c\), which is described below.

Fig. 3.
figure 3

Illustration of three types of relations between the foreground region in a binary ground truth label map (GT) and a superpixel block (SP). (a) SP is inside GT; (b) SP is outside GT; (c) SP intersects with boundaries of GT. We identity the region enclosed by the SP in the third case for label softening, based on the signed distances to GT boundaries.

Soft Labeling with Signed Distance Function. Assume a superpixel block s intersects with the boundaries of a foreground \(\phi \) (for simplicity, the superscripts can be safely omitted here without confusion). For a pixel \(s_i\) in s, the absolute value of the distance \(d_i\) from \(s_i\) to \(\phi \) is defined as the minimum distance among all the distances from \(s_i\) to all pixels on the boundaries of \(\phi \). We define \(d_i > 0\) if \(s_i\) is inside \(\phi \), and \(d_i \le 0\) otherwise. As aforementioned, in the case of a superpixel block intersecting with the boundaries of \(\phi \), we need to assign each pixel in this block a pseudo-probability as its soft label according to its distance to \(\phi \). The pseudo-probability should be set to 0.5 for a pixel right on the boundary (i.e. \(d_i=0\)), gradually approach 1 as \(d_i\) increases, and gradually approach 0 otherwise. Thus, we define the distance-to-probability conversion function as

$$\begin{aligned} q_i=f_\mathrm {dist}(d_i)=\frac{1}{2}\left( \frac{d_i}{1+|d_i|}+1\right) , \end{aligned}$$
(1)

where \(q_i\in [0,1]\) is the obtained soft label for pixel i.

Model Training with Soft and Hard Labels. We adopt the Kullback-Leibler (KL) divergence loss  [13] to supervise model training with our soft labels:

$$\begin{aligned} \mathcal {L}_\mathrm {KL} = \frac{1}{N}{\sum }_{i=1}^N {\sum }_{c=1}^C q_{i}^c \log \left( q_{i}^c/p_{i}^{c}\right) , \end{aligned}$$
(2)

where \(p_{i}^c\) is the predicted probability of the i-th pixel belonging to the class c, and \(q_{i}^c\) is the corresponding soft label defined with Eq. (1). Along with \(\mathcal {L}_\mathrm {KL}\), we also adopt the commonly used Dice loss \(\mathcal {L}_\mathrm {Dice}\)  [15] and cross-entropy (CE) loss \(\mathcal {L}_\mathrm {CE}\) for medical image segmentation. Specifically, the CE loss is defined as:

$$\begin{aligned} \mathcal {L}_\mathrm {CE} = -\frac{1}{N}{\sum }_{i=1}^N{\sum }_{c=1}^C w_{c}y_{i}^c\log (p_i^c), \end{aligned}$$
(3)

where \(w_c\) is the weight for class c. When \(w_c=1\) for all classes, Eq. (3) is the standard CE loss. In addition, \(w_c\) can be set to class-specific weights to counteract the impact of class imbalance  [17]: \(w_c= 1 / \log (1.02 + {\sum }_{i=1}^N y_i^c/N)\), and we refer to this version of the CE loss as weighted CE (WCE) loss. The final loss is defined as a weighted sum of the three losses: \(\mathcal {L} = \mathcal {L}_\mathrm {CE} + \alpha \mathcal {L}_\mathrm {Dice} + \beta \mathcal {L}_\mathrm {KL}\), where \(\alpha \) and \(\beta \) are hyperparameters to balance the three losses. We follow the setting in nnU-Net [8] to set \(\alpha =1.0\), and explore the proper value of \(\beta \) in our experiments, since it controls the relative contribution of our newly proposed soft labels which are of interest.

3 Experiments

Datasets. To verify the effectiveness of our method on both 2D and 3D medical image segmentation, we use datasets of both types for experiments. The MRBrainS18 dataset [7] provides seven 3T multi-sequence (T1-weighted, T1-weighted inversion recovery, and T2-FLAIR) brain MRI scans with the following 11 ground truth labels: 0-background, 1-cortical gray matter, 2-basal ganglia, 3-white matter, 4-white matter lesions, 5-cerebrospinal fluid in the extracerebral space, 6-ventricles, 7-cerebellum, 8-brain stem, 9-infarction and 10-other, among which labels 9 and 10 were officially excluded from the evaluation and we follow this setting. We randomly choose five scans for training and use the rest for evaluation. For preprocessing, the scans are preprocessed by skull stripping, nonzero cropping, resampling, and data normalization. The other dataset [3] includes OCT images with diabetic macular edema (the OCT-DME dataset) for the segmentation of retinal layers and fluid regions. It contains 110 2D B-scan images from 10 patients. Eight retinal layers and fluid regions are annotated. We use the first five subjects for training and the last five subjects for evaluation (each set has 55 B-scans). Since the image quality of this dataset is poor, we firstly employ a denoising convolutional neural networks (DnCNN) [23] to reduce image noise and improve the visibility of anatomical structures. To reduce memory usage, we follow He et al.  [6] to flatten a retinal B-scan image to the estimated Bruch’s membrane (BM) using an intensity gradient method [14] and crop the retina part out.

Experimental Setting and Implementation. For the experiments on each dataset, we first establish a baseline, which is trained without the soft labels. Then, we re-implement the Gaussian blur based label softening method  [5], in which the value of \(\sigma \) is empirically selected, for a comparison with our proposed method. Considering the class imbalance in both datasets, we present results using the standard CE and WCE losses for all methods. We notice that the Dice loss adversely affects the performance on the OCT-DME dataset, therefore those results are not reported. We use overlap-based, volume-based, and distance-based mean metrics [21], including: Dice coefficient score, volumetric similarity (VS), 95th percentile Hausdorff distance (HD95), average surface distance (ASD), and average symmetric surface distance (ASSD) for a comprehensive evaluation of the methods. We employ a 2D U-Net [19] segmentation model (with the Xception [4] encoder) for the OCT-DME dataset, and a 3D U-Net [8] model for the MRBrainS18 dataset (patch-based training and sliding window test tricks [9] are employed in the implementation). All experiments are conducted with the PyTorch framework [18] on a standard PC with an NVIDIA GTX 1080Ti GPU. The Adam optimizer [12] is adopted with a learning rate of \(3\times 10^{-4}\) and a weight decay of \(10^{-5}\). The learning rate is halved if the validation performance does not improve for 20 consecutive epochs. The batch size is fixed to 2 for the MRBrainS18 dataset, and 16 for the OCT-DME dataset.

Results. The quantitative evaluation results are summarized in Table 1 and Table 2 for the MRBrainS18 and OCT-DME datasets, respectively. (Example segmentation results on both datasets are provided in the supplementary material.) As expected, the weighted CE loss produces better results than the standard CE loss for most evaluation metrics on both datasets. We note that the Gaussian blur based label softening [5] does not improve upon the baselines either with the CE or WCE loss, but only obtains results comparable to those of the baselines. The reason might be that this method indiscriminately softens all boundary-surrounding pixels with a fixed standard deviation without considering the actual image context, which may potentially harm the segmentation near originally precisely annotated boundaries. In contrast, our proposed method consistently improves all metrics when using the generated soft labels with the WCE loss. In fact, with this combination of losses, our method achieves the best performances for all evaluation metrics. It is also worth mentioning that, although our method is motivated by improving segmentation near category boundaries, it also improves the overlap-based evaluation metrics (Dice) by a noticeable extent on the OCT-DME dataset. These results verify the effectiveness of our method in improving segmentation performance, by modeling uncertainty in manual labeling with the interaction between superpixels and ground truth annotations.

Table 1. Evaluation results on the MRBrainS18 dataset [7]. The KL divergence loss is used by our method for model training with our soft labels.
Table 2. Evaluation results on the OCT-DME dataset [3]. The KL divergence loss is used by our method for model training with our soft labels.

Ablation Study on Number of Superpixels. The proper scale of the superpixels is crucial for our proposed method, as superpixels of different sizes may describe different levels of image characteristics, and thus may interact differently with the ground truth annotation. Since in the SLIC [1] algorithm, the size of superpixels is controlled by the total number of generated superpixel blocks, we conduct experiments to study how the number of superpixels influences the performance on the MRBrainS18 dataset. In Fig. 4, we show performances of our method with different numbers of superpixels ranged from 500 to 3500 with a sampling interval of 500. As we can see, as the number of superpixels increases, the performance first increases due to the more image details incorporated, and then decreases after reaching the peak. This is in line with our intuition, since the assumption that pixels within the same superpixel belong to the same category can hold only if the scale of superpixels is appropriate. Large superpixels can produce flawed soft labels. In contrast, as the number of superpixels grows and their sizes shrink, soft labels will degenerate into hard labels, which does not provide additional information.

Fig. 4.
figure 4

Performances of our method with different numbers of superpixels on the MRBrainS18 dataset [7]. The HD95, ASD and ASSD are in mm. Best viewed in color.

Fig. 5.
figure 5

Performances of our method using different values of \(\beta \) on the MRBrainS18 dataset [7]. The HD95, ASD and ASSD are in mm. Best viewed in color.

Ablation Study on Weight of Soft Label Loss. The weight \(\beta \) controls the contribution of the soft labels in training. To explore the influence of the soft label loss, we conduct a study on the MRBrainS18 dataset to compare the performance of our method with different values of \(\beta \). We set \(\beta \) to 1/4, 1/2, 1, 2, 4, and 8. The mean Dice, HD95, ASD, and ASSD of our proposed method with these values of \(\beta \) are shown in Fig. 5. Note that the x-axis uses a log scale since values of \(\beta \) differ by orders of magnitude. Improvements in performance can be observed when \(\beta \) increases from 1/4 to 1. When \(\beta \) continues to increase, however, the segmentation performances start to drop. This indicates that the soft labels are helpful to segmentation, although giving too much emphasis to them may decrease the generalization ability of the segmentation model.

4 Conclusion

In this paper, we presented a new label softening method that was simple yet effective in improving segmentation performance, especially near the boundaries of different categories. The proposed method first employed an over-segmentation algorithm to group image pixels into locally homogeneous blocks called superpixels. Then, the superpixel blocks intersecting with the category boundaries in the ground truth were identified for label softening, and a signed distance function was employed to convert the pixel-to-boundary distances to soft labels within [0, 1] for pixels inside these blocks. The soft labels were subsequently used to train a segmentation network. Experimental results on both 2D and 3D medical images demonstrated the effectiveness of this simple approach in improving segmentation performance.