Keywords

1 Introduction

Automatic segmentation of structures is a fundamental task in medical image analysis. Segmentations either serve as an intermediate step in a more elaborate pipeline or as an end goal by itself. The clinical interest often lies in the volume of a certain structure (e.g. the volume of a tumor, the volume of a stroke lesion), which can be derived from its segmentation [11]. The segmentation task can also carry inherent uncertainty (e.g. noise, lack of contrast, artifacts, incomplete information).

To evaluate and compare the quality of a segmentation, the similarity between the true segmentation (i.e. the segmentation derived from an expert’s delineation of the structure) and the predicted segmentation must be measured. For this purpose, multiple metrics exist. Among others, overlap measures (e.g. Dice score, Jaccard index) and surface distances (e.g. Haussdorf distance, average surface distance) are commonly used [13].

The focus on one particular metric, the Dice score, has led to the adoption of a differentiable surrogate loss, the so-called soft Dice [9, 15, 16], to train convolutional neural networks (CNNs). Many state-of-the-art methods clearly outperform the established cross-entropy losses using soft Dice as loss function [7, 12].

In this work, we investigate the effect on volume estimation when optimizing a CNN w.r.t. cross-entropy or soft Dice, and relate this to the inherent uncertainty in a task. First, we look into this volumetric bias theoretically, with some numerical examples. We find that the use of soft Dice leads to a systematic under- or overestimation of the predicted volume of a structure, which is dependent on the inherent uncertainty that is present in the task. Second, we empirically validate these results on four medical tasks: two tasks with relatively low inherent uncertainty (i.e. the segmentation of third molars from dental radiographs [8], BRATS 2018 [4,5,6, 14]) and two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 [2, 18], ISLES 2018 [3]).

2 Theoretical Analysis

Let us formalize an image into I voxels, each voxel corresponding to a true class label \(c_{i}\) with \(i=0 \dots I-1\), forming the true class label map \(C=[c_{i}]^{I}\). Typical in medical image analysis, is the uncertainty of the true class label map C (e.g. due to intra- and inter-rater variability; see Sect. 2.2). Under the assumption of binary image segmentation with \(c_{i} \in \{0,1\}\), a probabilistic label map can be constructed as \(Y=[y_{i}]^{I}\), where each \(y_{i}=P(c_{i}=1)\) is the probability of \(y_{i}\) belonging to the structure of interest. Similarly, we have the maps of voxel-wise label predictions \(\hat{C}=[\hat{c}_{i}]^{I}\) and probabilities \(\hat{Y}=[\hat{y}_{i}]^{I}\). In this setting, the class label map \(\hat{C}\) is constructed from the map of predictions \(\hat{Y}\) according to the highest likelihood.

The Dice score \(\mathcal {D}\) is defined on the label maps as:

$$\begin{aligned} \mathcal {D}(C, \hat{C}) = \frac{2 |C \cap \hat{C}|}{|C| + |\hat{C}|} \end{aligned}$$
(1)

The volumes \(\mathcal {V}(C)\) of the true structure and \(\mathcal {V}(\hat{C})\) of the predicted structure are then, with v the volume of a single voxel:

$$\begin{aligned} \mathcal {V}(C) = v\sum _{i=0}^{I-1}c_{i},\ \mathcal {V}(\hat{C}) = v\sum _{i=0}^{I-1}\hat{c}_{i} \end{aligned}$$
(2)

In case the label map is probabilistic, we need to work out the expectations:

$$\begin{aligned} \mathcal {V}(Y) = v\mathbf {E} [\sum _{i=0}^{I-1}y_{i}],\ \mathcal {V}(\hat{Y}) = v\mathbf {E} [\sum _{i=0}^{I-1}\hat{y}_{i}] \end{aligned}$$
(3)

2.1 Risk Minimization

In the setting of supervised and gradient-based training of CNNs [10] we are performing empirical risk minimization. Assume the CNN, with a certain topology, is parametrized by \(\varvec{\theta } \in \varTheta \) and represents the functions \(\mathcal {H}=\{\mathfrak {h}_{\varvec{\theta }}\}^{|\varTheta |}\). Further assume we have access to the entire joint probability distribution \(P(\mathbf {x},y)\) at both training and testing time, with \(\mathbf {x}\) the information (for CNNs this is typically a centered image patch around the location of y) of the network that is used to make a prediction \(\hat{y}=\mathfrak {h}_{\varvec{\theta }}(\mathbf {x})\) for y. For these conditions, the general risk minimization principle is applicable and states that in order to optimize the performance for a certain non-negative and real-valued loss \(\mathcal {L}\) (e.g. the metric or its surrogate loss) at test time, we can optimize the same loss during the learning phase [17]. The risk \(\mathcal {R}_{\mathcal {L}}(\mathfrak {h}_{\varvec{\theta }})\) associated with the loss \(\mathcal {L}\) and parametrization \(\varvec{\theta }\) of the CNN, without regularization, is defined as the expectation of the loss function:

$$\begin{aligned} \mathcal {R}_{\mathcal {L}}(\mathfrak {h}_{\varvec{\theta }}) = \mathbf {E} [\mathcal {L}(\mathfrak {h}_{\varvec{\theta }}(\mathbf {x}), y)] \end{aligned}$$
(4)

For years, minimizing the negative log-likelihood has been the gold standard in terms of risk minimization. For this purpose, and due to its elegant mathematical properties, the voxel-wise cross-entropy loss (\(\mathcal {CE}\)) is used:

$$\begin{aligned} \mathcal {CE}(\hat{Y},Y) = \sum _{i=0}^{I-1}[\mathcal {CE}(\hat{y}_{i},y_{i})] = -\sum _{i=0}^{I-1}[y_{i}\log \hat{y}_{i}] \end{aligned}$$
(5)

More recently, the soft Dice loss (\(\mathcal {SD}\)) is used in the optimization of CNNs to directly optimize the Dice score at test time [9, 15, 16]. Rewriting Eq. 1 to its non-negative and real-valued surrogate loss function as in [9]:

$$\begin{aligned} \mathcal {SD}(\hat{Y},Y) = 1-\frac{2\sum _{i=0}^{I-1}\hat{y}_{i}y_{i}}{\sum _{i=0}^{I-1}\hat{y}_{i} + \sum _{i=0}^{I-1}y_{i}} \end{aligned}$$
(6)

2.2 Uncertainty

There is considerable uncertainty in the segmentation of medical images. Images might lack contrast, contain artifacts, be noisy or incomplete regarding the necessary information (e.g. in ISLES 2017 we need to predict the infarction after treatment from images taken before, which is straightforwardly introducing inherent uncertainty). Even at the level of the true segmentation, uncertainty exists due to intra- and inter-rater variability. We will investigate what happens with the estimated volume \(\mathcal {V}\) of a certain structure in an image under the assumption of having perfect segmentation algorithms (i.e. the prediction is the one that minimizes the empirical risk).

Assuming independent voxels, or that we can simplify Eq. 3 into J independent regions with true uncertainty \(p_{j}\) and predicted uncertainty \(\hat{p}_{j}\), and corresponding volumes \(s_{j}=vn_{j}\), with \(n_{j}\) the number of voxels belonging to region \(j=0 \dots J-1\) (having each voxel as an independent region when \(n_{j}=1\)), we get:

$$\begin{aligned} \mathcal {V}(Y) = \sum _{j=0}^{J-1} (s_{j}p_{j}),\ \mathcal {V}(\hat{Y}) = \sum _{j=0}^{J-1} (s_{j}\hat{p}_{j}) \end{aligned}$$
(7)

We analyze for \(\mathcal {CE}\) the predicted uncertainty that minimizes the risk \(\mathcal {R}_{\mathcal {CE}}(\mathfrak {h}_{\varvec{\theta }})\):

$$\begin{aligned} \arg \min _{\hat{Y}}[\mathcal {R}_{\mathcal {CE}}(\mathfrak {h}_{\varvec{\theta }})] = \arg \min _{\hat{Y}}[\mathbf {E} [\mathcal {CE}(\hat{Y}, Y)]] \end{aligned}$$
(8)

We need to find for each independent region j:

$$\begin{aligned} \arg \min _{\hat{p}_{j}}[s_{j}\mathcal {CE}(\hat{p}_j, p_{j})] = \arg \min _{\hat{p}_{j}}[-p_{j}\log \hat{p}_j-(1-p_{j})\log (1-\hat{p}_{j})] \end{aligned}$$
(9)

This function is continuous and its first derivative monotonously increasing in the interval ]0, 1[. First order conditions w.r.t. \(\hat{p}_{j}\) give the optimal value for the predicted uncertainty \(\hat{p}_{j}=p_{j}\). With the predicted uncertainty being the true uncertainty, \(\mathcal {CE}\) becomes an unbiased volume estimator.

We analyze for SD the predicted uncertainty that minimizes the risk \(\mathcal {R}_{\mathcal {SD}}(\mathfrak {h}_{\varvec{\theta }})\):

$$\begin{aligned} \arg \min _{\hat{Y}}[\mathcal {R}_{\mathcal {SD}}(\mathfrak {h}_{\varvec{\theta }})] = \arg \min _{\hat{Y}}[\mathbf {E} [\mathcal {SD}(\hat{Y}, Y)]] \end{aligned}$$
(10)

We need to find for each independent region j:

$$\begin{aligned} \arg \min _{\hat{Y}}[\mathbf {E} [\mathcal {SD}(\hat{Y}, Y)]] = \arg \min _{\hat{p}_{j}}[\mathbf {E}[1-\frac{2\sum _{j=0}^{J-1}s_{j}\hat{p}_{j}p_{j}}{\sum _{j=0}^{J-1}s_{j}\hat{p}_{j} + \sum _{j=0}^{J-1}s_{j}p_{j}}]] \end{aligned}$$
(11)

This minimization is more complex and we analyze its behavior by inspecting the values of \(\mathcal {SD}\) numerically. We will consider the scenarios with only a single region or with multiple independent regions with inherent uncertainty in the image. For each scenario we will vary the inherent uncertainty and the total uncertain volume.

Fig. 1.
figure 1

The effects of optimizing w.r.t. \(\mathcal {SD}\) for volume ratios: \(\mu =0.25\) (blue), \(\mu =1\) (black) and \(\mu =4\) (red). ROWS A-C: Situations with respectively \(N = \{1, 4, 16\}\) independent regions with uncertainty \(p_{\beta }\). COLUMN 0: Schematic representation of the situation. COLUMNS 1-3: \(\mathcal {SD}=[0, 1]\) (y-axis) for \(p_{\beta }=\{0,0.25,0.5,0.75,1\}\) (respectively with increasing opacity) and \(\hat{p}=[0, 1]\) (x-axis). COLUMN 4: Influence of \(p_{\beta }=[0, 1]\) (x-axis) on volumetric bias (solid lines) or on the error in predicted uncertainty (dashed lines). With the light red area we want to highlight that easier overestimation of the predicted volume occurs due to a higher volume ratio \(\mu \) or an increasing number of independent regions N. (Color figure online)

Single Region of Uncertainty. Imagine the segmentation of an image with \(K=3\) independent regions, \(\alpha , \beta \) and \(\gamma \), as depicted in Fig. 1 (A0). Region \(\alpha \) is certainly not part of the structure (\(p_{\alpha }=0\), i.e. background), region \(\beta \) belongs to the structure with probability \(p_{\beta }\) and region \(\gamma \) is certainly part of the structure (\(p_{\gamma }=1\)). Let their volumes be \(s_{\alpha }=100\), \(s_{\beta }\), \(s_{\gamma }=1\), respectively, with \(\mu =\frac{s_{\beta }}{s_{\gamma }}=s_{\beta }\) the volume ratio of uncertain to certain part of the structure. Assuming a perfect algorithm, the optimal predictions under the empirical risk from Eq. 11 are:

$$\begin{aligned} \arg \max _{\hat{p}_{\alpha },\hat{p}_{\beta },\hat{p}_{\gamma }}[\mathbf {E}[\frac{2(s_{\beta }\hat{p}_{\beta }p_{\beta }+s_{\gamma }\hat{p}_{\gamma })}{s_{\alpha }\hat{p}_{\alpha } + s_{\beta }\hat{p}_{\beta } + s_{\gamma }\hat{p}_{\gamma } + s_{\beta }p_{\beta } + s_{\gamma }}]] \end{aligned}$$
(12)

It is trivial to show that \(\hat{p}_{\alpha }=0=p_{\alpha }\) and \(\hat{p}_{\gamma }=1=p_{\gamma }\) are solutions for this equation. The behavior of \(\hat{p}_{\beta }\) w.r.t. \(p_{\beta }\) and \(\mu \) can be observed qualitatively in Fig. 1 (A1-A4). Indeed, only for \(p_{\beta }=\{0, 1\}\) the predicted uncertainty \(\hat{p}_{\beta }\) is exact. The location of the local minimum in \(\hat{p}_{\beta }=[0, 1]\) switches from 0 to 1 when \(p_{\beta }=0.5\). Therefore, when \(p_{\beta }\) decreases or increases from 0.5 (different opacity in A1-A3), respectively under- or overestimation will occur (A4). The resulting volumetric bias will be highest when the inherent uncertainty \(p_{\beta }=0.5\) and decreases towards the points of complete certainty, being always 0 or 1. The effect of the volume ratio \(\mu \) (colors) is two-fold. With \(\mu \) increasing, the optimal loss value increases (A1-A3) and the volumetric bias increases (A4; solid lines). However, the error on the estimated uncertainty is not influenced by \(\mu \) (A4; dashed lines).

Multiple Regions of Uncertainty. In a similar way we can imagine the segmentation of a structure with \(K=N+2\) independent regions, for which we further divided the region \(\beta \) into N equally large independent sub-regions \(\beta _{n}\) with \(n=0 \dots N-1\). Let us further assume they have the same inherent uncertainty \(p_{\beta _{n}}=p_{\beta }\) and volume ratio \(\mu _{\beta _{n}}=\frac{\mu _{\beta }}{N}\) (in order to keep the total uncertain volume the same). If we limit the analysis to a qualitative observation of Fig. 1 with \(N=4\) (B0-B4) and \(N=16\) (C0-C4), we notice three things. First, the uncertainty \(p_{\beta }\) for which under- or overestimation will happen decreases (A4, B4, C4). Second, this effect is proportional with \(\mu \) and the maximal error on the predicted uncertainty becomes higher (B0-B4, C0-C4). Third, there is a trend towards easier volumetric overestimation and with the maximal error being more pronounced when the number of regions increases (A4, B4, C4).

3 Empirical Analysis

In this section we will investigate whether the aforementioned characteristics can be observed under real circumstances. In a practical scenario, the joint probability distribution \(P(\mathbf {x},y)\) is unknown and presents itself as a training set. The risk \(\mathcal {R}_{\mathcal {L}}\) (Eq. 4) becomes empirical, where the expectation of the loss function becomes the mean of the losses across the training set. Furthermore, the loss \(\mathcal {L}\) absorbs the explicit (e.g. weight decay, L2) or implicit (e.g. early stopping, dropout) regularization, which is often present in some aspect of the optimization of CNNs. Finally, the classifier is no longer perfect and additionally to the inherent uncertainty in the task we now have inherent uncertainty introduced by the classifier itself.

To investigate how these factors impact our theoretical findings, we train three models with increasing complexity: LR (logistic regression on the input features), ConvNet (simpler version of the next) and U-Net. We use five-fold cross-validation on the training images from two tasks with relatively low inherent uncertainty (i.e. lower-left third molar segmentation from panoramic dental radiographs (MOLARS) [8], BRATS 2018 [4]) and from two tasks with relatively high inherent uncertainty (i.e. ISLES 2017 [2], ISLES 2018 [3]). Next, we describe the experimental setup, followed by a dissemination of the predicted volume errors \(\varDelta \mathcal {V}(\hat{Y}, Y)=\mathcal {V}(\hat{Y})-\mathcal {V}(Y)\) by \(\mathcal {CE}\) and \(\mathcal {SD}\) trained models.

3.1 Task Description and Training

We (re-)formulate a binary segmentation task for each dataset having one (multi-modal) input, and giving one binary segmentation map as output (for BRATS 2018 we limit the task to whole tumor segmentation). For the 3D public benchmarks we use all of the provided images, resampled to an isotropic voxel-size of 2 mm, as input (for both ISLES challenges we omit perfusion images). In MOLARS (2D dataset from [8]), we first extract a 448 \(\times \) 448 ROI around the geometrical center of the lower-left third molar from the panoramic dental radiograph. We further downsample the ROI by a factor of two. The output is the segmentation of the third molar, as provided by the experts. All images are normalized according to the dataset’s mean and standard deviation.

For our U-Net model we start from the successful No New-Net implementation during last year’s BRATS challenge [12]. We adapt it with three 3 \(\times \) 3(\(\times \)3) average pooling layers with corresponding linear up-sampling layers and strip the instance normalization layers. Each level has two 3 \(\times \) 3(\(\times \)3) convolutional layers before and after the pooling and up-sampling layer, respectively, with [[10, 20], [20, 10]], [[20, 40], [40, 20]], [[40, 80], [80, 40]] and [40, 20] filters. For the ConvNet model, we remove the final two levels. The LR model uses the inputs directly for classification, thus performing logistic regression on the input features.

The images are augmented intensively during training and inputs are central image crops of 162 \(\times \) 162 \(\times \) 108 (in MOLARS 243 \(\times \) 243). We train the models w.r.t. \(\mathcal {CE}\) or \(\mathcal {SD}\) with ADAM, without any explicit regularization, and with the initial learning rate set at \(10^{-3}\) (for LR model at 1). We lower the learning rate by a factor of five when the validation loss did not improve over the last 75 epochs and stop training with no improvement over the last 150 epochs.

3.2 Results and Discussion

In Table 1 the results are shown for each dataset (i.e. MOLARS, BRATS 2018, ISLES 2017, ISLES 2018), for each model (i.e. LR, ConvNet, U-Net) and for each loss (i.e. \(\mathcal {CE}\), \(\mathcal {SD}\)) after five-fold cross-validation. We performed a pairwise non-parametric significance test (bootstrapping) with a p-value of 0.05 to assess inferiority or superiority between pairs of optimization methods.

Table 1. Empirical results for cross-entropy (\(\mathcal {CE}\)), soft Dice score (\(1-\mathcal {SD}\)) and volume error (\(\varDelta \mathcal {V}\); in \(10^{2}\) pixels or ml) metrics for models optimized w.r.t. \(\mathcal {CE}\) and \(\mathcal {SD}\) losses. Significant volumetric underestimations in italic and overestimations in bold.

Optimizing the \(\mathcal {CE}\) loss reaches significantly higher log-likelihoods under all circumstances, while soft Dice scores (i.e. \(1-\mathcal {SD}\)) are significantly higher for \(\mathcal {SD}\) optimized models. Looking at the volume errors \(\varDelta \mathcal {V}(\hat{Y}, Y)\), the expected outcomes are, more or less, confirmed. For the LR and ConvNet models, \(\mathcal {CE}\) optimized models are unbiased w.r.t. volume estimation. For these models, \(\mathcal {SD}\) optimization leads to significant overestimation due to the remaining uncertainty, partly being introduced by the models themselves.

The transition to the more complex U-Net model brings forward two interesting observations. First, for the two tasks with relatively low inherent uncertainty (i.e. MOLARS, BRATS 2018), the model is able to reduce the uncertainty to such an extent it can avoid significant bias on the estimated volumes. The significant underestimation for \(\mathcal {CE}\) in BRATS 2018 can be due to the optimization difficulties that arise in circumstances with high class-imbalance. Second, although the model now has the ability to extend its view wide enough and propagate the information in a complex manner, the inherent uncertainty that is present in both of the ISLES tasks, brings again forward the discussed bias. In ISLES 2017, having to predict the infarction after treatment straightforwardly introduces uncertainty. In ISLES 2018, the task was to detect the acute lesion, as observed on MR DWI, from CT perfusion-derived parameter maps. It is still unknown to what extent these parameter maps contain the necessary information to predict the lesion.

The \(\mathcal {CE}\) optimized U-Net models result in Dice scores (Eq. 1) of 0.924, 0.763, 0.177 and 0.454 for MOLARS, BRATS 2018, ISLES 2017 and ISLES 2018, respectively. The Dice scores obtained with their \(\mathcal {SD}\) optimized counterparts are significantly higher, respectively 0.932, 0.826, 0.343 and 0.527. This is in line with recent theory and practice from [7] and justifies \(\mathcal {SD}\) optimization when the segmentation quality is measured in terms of Dice score.

4 Conclusion

It is clear that, in cases with high inherent uncertainty, the estimated volumes with soft Dice-optimized models are biased, while cross-entropy-optimized models predict unbiased volume estimates. For tasks with low inherent uncertainty, one can still favor soft Dice optimization due to a higher Dice score.

We want to highlight the importance of choosing an appropriate loss function w.r.t. the goal. In a clinical setting where volume estimates are important and for tasks with high or unknown inherent uncertainty, optimization with cross-entropy can be preferred.