1 Introduction

The Dice loss was introduced in [5] and [13] as a loss function for binary image segmentation taking care of the class imbalance between foreground and background often present in medical applications. The generalized Dice loss [16] extended this idea to multiclass segmentation tasks, thereby taking into account the class imbalance that is present across different classes. In parallel, the Jaccard loss was introduced in the wider computer vision field for the same purpose [14, 17]. More recently, it has been shown that one can use either Dice or Jaccard loss during training to effectively optimize both metrics at test time [6].

The use of the Dice loss in popular and state-of-the-art methods such as No New-Net [9] has only fueled its dominant usage across the entire field of medical image segmentation. Despite its fast and wide adoption, research that explores the underlying mechanisms is remarkably limited and mostly focuses on the loss value itself building further on the concept of risk minimization [8]. Regarding model calibration and inherent uncertainty, for example, some intuitions behind the typical hard and poorly calibrated predictions were exposed in [4], thereby focusing on the potential volume bias as a result of using the Dice loss. Regarding semi-supervised learning, adaptations to the original formulations were proposed to deal with “missing” labels [7, 15], i.e. a label that is missing in the ground truth even though it is present in the image.

In this work, we further contribute to a deeper understanding of the specific implementation of the Dice loss, especially in the context of missing and empty labels. In contrast to missing labels, “empty” labels are labels that are not present in the image (and hence also not in the ground truth). We will first take a closer look at the derivative, i.e. the real motor of the underlying optimization when using gradient descent, in Sect. 2. Although [13] and [16] report the derivative, it is not being discussed in detail, nor is any reasoning behind the choice of the reduction dimensions \(\varPhi \) given (Sect. 2.1). When the smoothing term \(\epsilon \) is mentioned, no details are given and its effect is underestimated by merely linking it with numerical stability [16] and convergence issues [9]. In fact, we find that both \(\varPhi \) and \(\epsilon \) are intertwined, and that their choice is non-trivial and pivotal in the presence of missing or empty labels. To confirm and validate these findings, we set up two empirical settings with missing or empty labels in Sects. 3 and 4. Indeed, we can make or break the segmentation task depending on the exact implementation of the Dice loss.

2 Bells and Whistles of the Dice Loss: \(\varPhi \) and \(\epsilon \)

In a CNN-based setting, the weights \(\theta \in \varTheta \) are often updated using gradient descent. For this purpose, the loss function \(\ell \) computes a real valued cost \(\ell (Y,\tilde{Y})\) based on the comparison between the ground truth Y and its prediction \(\tilde{Y}\) in each iteration. Y and \(\tilde{Y}\) contain the values \(y_{b,c,i}\) and \(\tilde{y}_{b,c,i}\), respectively, pointing to the value for a semantic class \(c\in \mathcal {C}=[\text {C}]\) at an index \(i\in \mathcal {I}=[\text {I}]\) (e.g. a voxel) of a batch element \(b\in \mathcal {B}=[\text {B}]\) (Fig. 1). The exact update of each \(\theta \) depends on \(d\ell (Y,\tilde{Y})/d\theta \), which can be computed via the generalized chain rule. With \(\omega =(b,c,i)\in \varOmega =\mathcal {B} \times \mathcal {C} \times \mathcal {I}\), we can write:

$$\begin{aligned} \frac{d\ell (Y,\tilde{Y})}{d\theta }= \sum _{b\in \mathcal {B}}\sum _{c\in \mathcal {C}}\sum _{i\in \mathcal {I}}\frac{\partial \ell (Y,\tilde{Y})}{\partial \tilde{y}_{b,c,i}} \frac{\partial \tilde{y}_{b,c,i}}{\partial \theta }= \sum _{\omega \in \varOmega }\frac{\partial \ell (Y,\tilde{Y})}{\partial \tilde{y}_\omega } \frac{\partial \tilde{y}_\omega }{\partial \theta }. \end{aligned}$$
(1)

The Dice similarity coefficient (DSC) over a subset \(\phi \subset \varOmega \) is defined as:

$$\begin{aligned} \text {DSC}(Y_\phi ,\tilde{Y}_\phi )=\frac{2|Y_\phi \cap \tilde{Y}_\phi |}{|Y_\phi |+|\tilde{Y}_\phi |}. \end{aligned}$$
(2)

This formulation of \(\text {DSC}(Y_\phi ,\tilde{Y}_\phi )\) requires Y and \(\tilde{Y}\) to contain values in \(\{0, 1\}\). In order to be differentiable and handle values in [0, 1], relaxations such as the soft DSC (sDSC) are used [5, 13]. Furthermore, in order to allow both Y and \(\tilde{Y}\) to be empty, a smoothing term \(\epsilon \) is added to the nominator and denominator such that \(\text {DSC}(Y_\phi ,\tilde{Y}_\phi )=1\) in case both Y and \(\tilde{Y}\) are empty. This results in the more general formulation of the Dice loss (DL) computed over a number of subsets \(\varPhi = \{\phi \}\):

$$\begin{aligned} \text {DL}(Y,\tilde{Y})= 1-\frac{1}{|\varPhi |}\sum _{\phi \in \varPhi }\text {sDSC}(Y_\phi ,\tilde{Y}_\phi )= 1-\frac{1}{|\varPhi |}\sum _{\phi \in \varPhi }\frac{2\sum _{\varphi \in \phi } y_\varphi \tilde{y}_\varphi +\epsilon }{\sum _{\varphi \in \phi } (y_\varphi +\tilde{y}_\varphi )+\epsilon }. \end{aligned}$$
(3)

Note that typically all \(\phi \) are equal in size and define a partition over the domain \(\varOmega \), such that \(\bigcup _{\phi \in \varPhi }\phi =\varOmega \) and \(\bigcap _{\phi \in \varPhi }\phi =0\). In \(d\text {DL}(Y,\tilde{Y})/d\theta \) from Eq. 1, the derivative \(\partial \text {DL}(Y,\tilde{Y})/\partial \tilde{y}_\omega \) acts as a scaling factor. In order to understand the underlying optimization mechanisms we can thus analyze \(\partial \text {DL}(Y,\tilde{Y})/\partial \tilde{y}_\omega \). Given that all \(\phi \) are disjoint, this can be written as:

$$\begin{aligned} \frac{\partial \text {DL}(Y,\tilde{Y})}{\partial \tilde{y}_\omega } =-\frac{1}{|\varPhi |}\left( \frac{2y_\omega }{\sum _{\varphi \in \phi ^\omega }(y_\varphi +\tilde{y}_\varphi )+\epsilon }- \frac{2\sum _{\varphi \in \phi ^\omega }y_\varphi \tilde{y}_\varphi +\epsilon }{\left( \sum _{\varphi \in \phi ^\omega }(y_\varphi +\tilde{y}_\varphi )+\epsilon \right) ^2}\right) , \end{aligned}$$
(4)

with \(\phi ^\omega \) the subset that contains \(\omega \). As such, it becomes clear that the specific action of DL depends on the exact configuration of the partition \(\varPhi \) of \(\varOmega \) and the choice of \(\epsilon \). Next, we describe the most common choices of \(\varPhi \) and \(\epsilon \) in practice. Then, we investigate their effects in the context of missing or empty labels. Finally, we present a simple heuristic to tune both.

Fig. 1.
figure 1

Schematic representation of Y, having a batch, class and image dimension, respectively with \(|\mathcal {B}|=\text {B}\), \(|\mathcal {C}|=\text {C}\) and \(|\mathcal {I}|=\text {I}\) (similarly for \(\tilde{Y}\)). The choice of \(\varPhi \), i.e. a family of subsets \(\phi \) over \(\varOmega \) defines the extent of the reductions in \(\text {sDSC}(Y_\phi , \tilde{Y}_\phi )\). From left to right, we see how the choice of \(\varPhi \), and thus an example subset \(\phi \) in blue, is different between the image-wise (\(\text {DL}_\mathbb {I}\)), class-wise (\(\text {DL}_\mathbb{C}\mathbb{I}\)), batch-wise (\(\text {DL}_\mathbb{B}\mathbb{I}\)) and all-wise (\(\text {DL}_\mathbb {BCI}\)) implementation of DL.

2.1 Configuration of \(\varPhi \) and \(\epsilon \) in Practice

In Fig. 1, we depict four straightforward choices for \(\varPhi \). We define these as the image-wise, class-wise, batch-wise or all-wise DL implementation, respectively \(\text {DL}_\mathbb {I}\), \(\text {DL}_\mathbb{C}\mathbb{I}\), \(\text {DL}_\mathbb{B}\mathbb{I}\) and \(\text {DL}_\mathbb {BCI}\), thus referring to the dimensions over which a complete reduction (i.e. the summations \(\sum _{\varphi \in \phi }\) in Eq. 3 and Eq. 4) is performed. We see that in all cases, a complete reduction is performed over the set of image indices \(\mathcal {I}\), which is in line with all relevant literature that we consulted. Furthermore, while in most implementations \(\text {B}>1\), only in [11] the exact usage of the batch dimension is described. In fact, they experimented with both \(\text {DL}_\mathbb {I}\) and \(\text {DL}_\mathbb{B}\mathbb{I}\), and found the latter to be superior for head and neck organs at risk segmentation in radiotherapy. Based on the context, we assume that most other contributions [5, 6, 9, 10, 13, 18] used \(\text {DL}_\mathbb {I}\), although we cannot rule out the use of \(\text {DL}_\mathbb{B}\mathbb{I}\). Similarly, we assume that in [16] \(\text {DL}_\mathbb{C}\mathbb{I}\) was used (with additionally weighting the contribution of each class inversely proportional to the object size), although we cannot rule out the use of \(\text {DL}_\mathbb {BCI}\).

Note that in Eq. 3 and Eq. 4 we have assumed the choice for \(\varPhi \) and \(\epsilon \) to be fixed. As such, the loss value or gradients only vary across different iterations due to a different sampling of Y and \(\tilde{Y}\). Relaxing this assumption allows us to view the leaf Dice loss from [7] as a special case of choosing \(\varPhi \). Being developed in the context of missing labels, the partition \(\varPhi \) of \(\varOmega \) is altered each iteration by substituting each \(\phi \) with \(\emptyset \) if \(\sum _\varphi ^\phi y_\varphi =0\). Similarly, the marginal Dice loss from [15] adapts \(\varPhi \) every iteration by treating the missing labels as background and summing the predicted probabilities of unlabeled classes to the background prediction before calculating the loss.

Based on our own experience, \(\epsilon \) is generally chosen to be small (e.g. \(10^{-7}\)). However, most research does not include \(\epsilon \) in their loss formulation, nor do they mention its exact value. We do find brief mentions related to convergence issues [9] (without further information) or numerical stability in the case of empty labels [10, 16] (to avoid division by zero in Eq. 3 and Eq. 4).

2.2 Effect of \(\varPhi \) and \(\epsilon \) on Missing or Empty Labels

When inspecting the derivative given in Eq. 4, we notice that in a way \(\partial \text {DL}/\partial \tilde{y}_\omega \) does not depend on \(\tilde{y}_\omega \) itself. Instead, the contributions of \(\tilde{y}_\varphi \) are aggregated over the reduction dimensions, resulting in a global effect of prediction \(\tilde{Y}_\phi \). Consequently, the derivative in a subset \(\phi \) takes only two distinct values corresponding to \(y_\omega =0\) or \(y_\omega =1\). This is in contrast to the derivative shown in [13] who used a \(L^2\) norm-based relaxation, which causes the gradients to be different for every \(\omega \) if \(\tilde{y}_\omega \) is different. If we work further with the \(L^1\) norm-based relaxation (following the vast majority of implementations) and assuming that \(\sum _{\varphi \in \phi ^\omega }\tilde{y}_\varphi \gg \epsilon \), we see that \(\partial \text {DL}/\partial \tilde{y}_\omega \) will be negligible for missing or empty ground truth labels. Exploiting this property, we can either avoid having to implement specific losses for missing labels, or we can learn to predict empty maps with a good configuration of \(\varPhi \). Regarding the former, we simply need to make sure \(\sum _{\varphi \in \phi ^\omega }y_{\varphi }=0\) for each map that contains missing labels which can be achieved by using the image-wise implementation \(\text {DL}_{\mathbb {I}}\). Regarding the latter, non-zero gradients are required for empty maps. Hence, we want to choose \(\phi \) large enough to avoid \(\sum _{\varphi \in \phi ^\omega }y_{\varphi }=0\) for which a batch-wise implementation \(\text {DL}_{\mathbb{B}\mathbb{I}}\) is suitable.

2.3 A Simple Heuristic for Tuning \(\epsilon \) to Learn from Empty Maps

We hypothesized that we can learn to predict empty maps by using the batch-wise implementation \(\text {DL}_\mathbb{B}\mathbb{I}\). However, due to memory constraints and trade-off with receptive field, it is often not possible to go for large batch sizes. In the limits when \(\text {B}=1\) we find that \(\text {DL}_\mathbb {I}=\text {DL}_\mathbb{B}\mathbb{I}\), and thus the gradients of empty maps will be negligible. Hence, we want to mimic the behavior of \(\text {DL}_{\mathbb{B}\mathbb{I}}\) with \(\text {B}\gg 1\), but using \(\text {DL}_{\mathbb {I}}\). This can be achieved by tuning \(\epsilon \) to increase the derivative for empty labels \(y_\omega =0\). A very simple strategy would be to let \(\partial \text {DL}(Y,\tilde{Y})/\partial \tilde{y}_\omega \) for \(y_\omega =0\) be equal in case of (i) \(\text {DL}_\mathbb{B}\mathbb{I}\) with infinite batch size such that \(\sum _{\varphi \in \phi ^\omega }y_{\varphi } \ne 0\) and negligible \(\epsilon \) and (ii) \(\text {DL}_\mathbb {I}\) with non-negligible epsilon and \(\sum _{\varphi \in \phi ^\omega }y_\varphi =0\). If we set \(\sum _{\varphi \in \phi ^\omega }\tilde{y}_\varphi =\hat{v}\) we get:

$$\begin{aligned} \frac{2\sum _{\varphi \in \phi ^\omega }y_\varphi \tilde{y}_\varphi }{\left( \sum _{\varphi \in \phi ^\omega }(y_\varphi +\tilde{y}_\varphi )\right) ^2}=\frac{\epsilon }{\left( \sum _{\varphi \in \phi ^\omega }\tilde{y}_\varphi +\epsilon \right) ^2} \Rightarrow \frac{2a\hat{v}}{(b\hat{v})^2}=\frac{\epsilon }{(\hat{v}+\epsilon )^2}, \end{aligned}$$
(5)

with a and b variables to express the intersection and union as a function of \(\hat{v}\). We can easily see that when we assume the overlap to be around 50%, thus \(a\approx 1/2\), and \(\sum _{\varphi \in \phi ^\omega }y_\varphi \approx \sum _{\varphi \in \phi ^\omega } \tilde{y}_\varphi =\hat{v}\), thus \(b\approx 2\), we can find \(\epsilon \approx \hat{v}\). It is further reasonable to assume that after some iterations \(\hat{v}\approx \mathbb {E}\sum _{\varphi \in \phi ^\omega } y_\varphi \), thus setting \(\epsilon = \hat{v}\) will allow DL to learn empty maps.

3 Experimental Setup

To confirm empirically the observed effects of \(\varPhi \) and \(\epsilon \) on missing or empty labels (Sect. 2.2), and to test our simple heuristic choice of \(\epsilon \) (Sect. 2.3), we perform experiments using three implementations of DL on two different public datasets.

Setups \(\mathbb {I}\) , \(\mathbb{B}\mathbb{I}\) and \(\mathbb {I}_\epsilon \) : In \(\mathbb {I}\) and \(\mathbb{B}\mathbb{I}\), respectively \(\text {DL}_\mathbb {I}\) and \(\text {DL}_\mathbb{B}\mathbb{I}\) are used to calculate the Dice loss (Sect. 2.1). The difference between \(\mathbb {I}\) and \(\mathbb {I}_\epsilon \) is that we use a negligible value for epsilon \(\epsilon =10^{-7}\) in \(\mathbb {I}\) and use the heuristic from Sect. 2.3 to set \(\epsilon =\mathbb {E}\sum _{\varphi \in \phi ^\omega } y_\varphi \) in \(\mathbb {I}_\epsilon \). From Sect. 2.2, we expect \(\mathbb {I}\) (any B) and \(\mathbb{B}\mathbb{I}\) (\(\text {B}=1\)) to successfully ignore missing labels during training, still segmenting these at test time. Vice versa, we expect \(\mathbb{B}\mathbb{I}\) (\(\text {B}>1\)) and \(\mathbb {I}_\epsilon \) (any B) to successfully learn what maps should be empty and thus output empty maps at test time.

BRATS: For our purpose, we resort to the binary segmentation of whole brain tumors on pre-operative MRI in BRATS 2018 [1, 2, 12]. The BRATS 2018 training dataset consists of 75 subjects with a lower grade glioma (LGG) and 210 subjects with a glioblastoma (HGG). To construct a partially labeled dataset for the missing and empty label tasks, we substitute the ground truth segmentations of the LGGs with empty maps during training. In light of missing labels, we would like the CNN to successfully segment LGGs at test time. In light of empty maps, we would like the CNN to output empty maps for LGGs at test time. Based on the ground truths of the entire dataset, in \(\mathbb {I}_\epsilon \) we need to set \(\epsilon =8,789\) or \(\epsilon =12,412\) when we use the partially or fully labeled dataset for training, respectively.

ACDC: The ACDC dataset [3] consists of cardiac MRI of 100 subjects. Labels for left ventricular (LV) cavity, LV myocardium and right ventricle (RV) are available in end-diastole (ED) and end-systole (ES). To create a structured partially labeled dataset, we remove the myocardium labels in ES. This is a realistic scenario since segmenting the myocardium only in ED is common in clinical practice. More specifically, ED and ES were sampled in the ratio 3/1 for \(\mathbb {I}_\epsilon \), resulting in \(\epsilon \) being equal to 13,741 and 19,893 on average for the myocardium class during partially or fully labeled training, respectively. For LV and RV, \(\epsilon \) was 21,339 and 18,993, respectively. We ignored the background map when calculating DL. Since we hypothesize that \(\text {DL}_\mathbb {I}\) is able to ignore missing labels, we compare \(\mathbb {I}\) to the marginal Dice loss [15] and the leaf Dice loss [7], two loss functions designed in particular to deal with missing labels.

Implementation Details: We start from the exact same preprocessing, CNN architecture and training parameters as in No New-Net [9]. The images of the BRATS dataset were first resampled to an isotropic voxel size of 2 \(\times \) 2 \(\times \) 2 mm\(^3\), such that we could work with a smaller output segment size of 80 \(\times \) 80 \(\times \) 48 voxels as to be able to vary B in \(\{1, 2, 4, 8\}\). Since we are working with a binary segmentation task we have \(\text {C}=1\) and use a single sigmoid activation in the final layer. For ACDC, the images were first resampled to 192 \(\times \) 192 \(\times \) 48 with a voxel size of 1.56 \(\times \) 1.56 \(\times \) 2.5 mm\(^3\). The aforementioned CNN architecture was modified to use batch normalization and pReLU activations. To compensate the anisotropic voxel size, we used a combination of 3 \(\times \) 3 \(\times \) 3 and 3 \(\times \) 3 \(\times \) 1 convolutions and omitted the first max-pooling for the third dimension. These experiments were only performed for \(\text {B}=2\). In this multiclass segmentation task, we use a softmax activation in the final layer to obtain four output maps.

Statistical Performance: All experiments were performed under a five-fold cross-validation scheme, making sure each subject was only present in one of the five partitions. Significant differences were assessed with non-parametric bootstrapping, making no assumptions on the distribution of the results [2]. Results were considered statistically significant if the p-value was below 5%.

4 Results

Table 1 reports the mean DSC and mean volume difference (\(\mathrm {\Delta }\)V) between the fully labeled validation set and the predictions for tumor (BRATS) and myocardium (ACDC). For both the label that was always available (HGG or MYOED) and the label that was not present in the partially labeled training dataset (LGG or MYOES), we can make two observations. First, configurations \(\mathbb {I}\) and \(\mathbb{B}\mathbb{I}\) (\(\text {B}=1\)) delivered a comparable segmentation performance (in terms of both DSC and \(\mathrm {\Delta }\)V) compared to using a fully labeled training dataset. Second, using configurations \(\mathbb{B}\mathbb{I}\) (\(\text {B}>1\)) and \(\mathbb {I}_\epsilon \) the performance was consistently inferior. In this case, the CNN starts to learn when it needs to output empty maps. As a result, when calculating the DSC and \(\mathrm {\Delta }\)V with respect to a fully labeled validation dataset, we expect both metrics to remain similar for HGG and MYOES. On the other hand, we expect a mean DSC of 0 and a \(|\mathrm {\Delta }\)V| close to the mean volume of LGG or MYOES. Note that this is not the case due to the incorrect classification of LGG or MYOES as HGG or MYOED, respectively. Figure 2 shows the Receiver Operating Characteristic (ROC) curves when using a partially labeled training dataset with the goal to detect HGG or MYOED based on a threshold on the predicted volume at test time. For both tasks, we achieved an Area Under the Curve (AUC) of around 0.9. Figure 3 shows an example segmentation.

When comparing \(\mathbb {I}\) with the marginal Dice loss [15] and the leaf Dice loss [7], no significant differences between any method for myocardium (MYOED = 0.88, MYOES = 0.88), LV (LVED = 0.96, LVES = 0.92) and RV (RVED = 0.93, RVES = 0.86 − 0.87) were found in both ED and ES.

Table 1. Mean DSC and mean \(\mathrm {\Delta }\)V. HGG and MYOED are always present during training while LGG and MYOES are replaced by empty maps under partial labeling. Configurations that we expect to learn to predict empty maps are highlighted (since we used a fully labeled validation set, we expect lower DSC and \(\mathrm {\Delta }\)V). Comparing partial with full labeling, inferior (\(\text {p}<0.05\)) results are indicated in italic.
Fig. 2.
figure 2

ROC analysis if we want to detect the label that was always present during training by using different thresholds on the predicted volume. In the legend we also report the AUC for each setting.

Fig. 3.
figure 3

Segmentation examples for BRATS (top) and ACDC (bottom). The ground truths for LGG and MYOES were replaced with empty maps during training (GTtrain).

5 Discussion

The experiments confirmed the analysis from Sect. 2.2 that \(\text {DL}_{\mathbb {I}}\) (equal to \(\text {DL}_{\mathbb{B}\mathbb{I}}\) when \(\text {B}=1\)) ignores missing labels during training and that it can be used in the context of missing labels naively. On the other hand, we confirmed that \(\text {DL}_{\mathbb{B}\mathbb{I}}\) (with \(\text {B}>1\)) and \(\text {DL}_{\mathbb {I}}\) (with a heuristic choice of \(\epsilon \)) can effectively learn to predict empty labels, e.g. for classification purposes or to be used with small patch sizes.

When heuristically determining \(\epsilon \) for configuring \(\mathbb {I}_{\epsilon }\) (Eq. 5), we only focused on the derivative for \(y_\omega =0\). Of course, by adapting \(\epsilon \), the derivative for \(y_\omega =1\) will also change. Nonetheless, our experiments showed that \(\mathbb {I}_\epsilon \) can achieve the expected behavior, indicating that the effect on the derivative for \(y_\omega =1\) is only minor compared to \(y_\omega =0\). We wish to derive a more exact formulation of the optimal value of \(\epsilon \) in future work. We expect this optimal \(\epsilon \) to depend on the distribution between the classes, object size and other labels that might be present. Furthermore, it would be interesting to study the transition between the near-perfect prediction for the missing class (\(\text {DL}_\mathbb {I}\) with small \(\epsilon \)) and the prediction of empty labels for the missing class (\(\text {DL}_\mathbb {I}\) with large \(\epsilon \)).

All the code necessary for exact replication of the results including preprocessing, training scripts, statistical analysis, etc. was released to encourage further analysis on this topic (https://github.com/JeroenBertels/dicegrad).

6 Conclusion

We showed that the choice of the reduction dimensions \(\varPhi \) and the smoothing term \(\epsilon \) for the Dice loss is non-trivial and greatly influences its behavior in the context of missing or empty labels. We believe that this work highlights some essential perspectives and hope that it encourages researchers to better describe their exact implementation of the Dice loss in the future.