1 Introduction

Deep convolutional neural networks have attracted enormous attention in medical image segmentation as they have shown superior performance compared to conventional methods in several applications, including automatic segmentation of brain lesions [2, 10], tumors [9, 15, 21], and neuroanatomy [3, 14, 22] using voxelwise network architectures [9, 14, 17], and more recently using 3D voxelwise networks [3, 10] and fully convolutional networks (FCNs) [4, 13, 17]. Compared to voxelwise methods, FCNs are fast in testing and training, and use all samples to learn image features. Voxelwise networks, on the other hand, may use a subset of samples to reduce data imbalance issues and increase efficiency [17].

Data imbalance is a common issue in medical image segmentation. For example in lesion detection the number of non-lesion voxels is typically \({>}500\) times larger than the number of diagnosed lesion voxels. Without balancing the labels the learning process may converge to local minima of a sub-optimal loss function, thus predictions may strongly bias towards non-lesion tissue. The outcome will be high-precision, low-recall segmentations. This is undesired especially in computer-aided diagnosis or clinical decision support systems where high sensitivity (recall) is a key characteristic of an automatic detection system.

A common approach to account for data imbalance, especially in voxelwise methods, is to extract equal training samples from each class [20]. The downsides of this approach are that it does not use all the information content of the images and may bias towards rare classes. Hierarchical training [5, 20, 21] and retraining [9] have been proposed as alternative strategies but they can be prone to overfitting and sensitive to the state of the initial classifiers [10]. Recent training methods for FCNs resorted to loss functions based on sample re-weighting [2, 10, 12, 16, 18], where lesion regions, for example, are given more importance than non-lesion regions during training. In the re-weighting approach, to balance the training samples between classes, the total cost is calculated based on weighted mean of each class. The weights are inversely proportional to the probability of each class appearance, i.e. higher appearance probabilities lead to lower weights. Although this approach works well for some relatively unbalanced data like brain extraction [17] and tumor detection [15], it becomes difficult to calibrate and does not perform well for highly unbalanced data such as lesion detection. To eliminate sample re-weighting, Milletari et al. proposed a loss function based on the Dice similarity coefficient [13].

The Dice loss layer is a harmonic mean of precision and recall thus weighs false positives (FPs) and false negatives (FNs) equally. To achieve a better trade-off between precision and recall (FPs vs. FNs), we propose a loss layer based on the Tversky similarity index [19]. Tversky index is a generalization of the Dice similarity coefficient and the \(F_\beta \) scores. We show how adjusting the hyperparameters of this index allow placing emphasis on false negatives in training a network that generalizes and performs well in highly imbalanced data as it leads to high sensitivity, Dice, \(F_2\) score, and the area under the precision-recall (PR) curve [1] in the test set. To this end, we adopt a 3D FCN, based on the U-net architecture, with a Tversky loss layer, and test it in the challenging multiple sclerosis lesion detection problem on multi-channel MRI [6, 20]. The ability to train a network for higher sensitivity (recall) in the expense of acceptable decrease in precision is crucial in many medical image segmentation tasks such as lesion detection.

2 Method

2.1 Network Architecture

We design and evaluate our 3D fully convolutional network [12, 18] based on the U-net architecture [16]. To this end, we develop a 3D U-net based on Auto-Net [17] and introduce a new loss layer based on the Tversky index. This U-net style architecture, which has been designed to work with very small number of training images, is shown in Fig. 1. It consists of a contracting path (to the right) and an expanding path (to the left). To learn and use local information, high-resolution 3D features in the contracting path are concatenated with upsampled versions of global low-resolution 3D features in the expanding path. Through this concatenation the network learns to use both high-resolution local features and low-resolution global features. The contracting path contains padded \(3\times 3\times 3\) convolutions followed by ReLU non-linear layers. A \(2\times 2\times 2\) max pooling operation with stride 2 is applied after every two convolutional layers. After each downsampling by the max pooling layers, the number of features is doubled. In the expanding path, a \(2\times 2\times 2\) transposed convolution operation is applied after every two convolutional layers, and the resulting feature map is concatenated to the corresponding feature map from the contracting path. At the final layer a \(1\times 1\times 1\) convolution with softmax output is used to reach the feature map with a depth equal to the number of classes (lesion or non-lesion).

Fig. 1.
figure 1

The 3D U-net style architecture; The complete description of the input and output size of each level is presented in Table S1 in the supplementary material.

2.2 Tversky Loss Layer

The output layer in the network consists of c planes, one per class (\(c=2\) in lesion detection). We applied softmax along each voxel to form the loss. Let P and G be the set of predicted and ground truth binary labels, respectively. The Dice similarity coefficient D between two binary volumes is defined as:

$$\begin{aligned} D(P,G) = \frac{2|PG|}{|P|+|G|} \end{aligned}$$
(1)

If this is used in a loss layer in training [13], it weighs FPs and FNs (precision and recall) equally. To give FNs higher weights than FPs in training our network for highly imbalanced data, where detecting small lesions is crucial, we propose a loss layer based on the Tversky index [19]. The Tiversky index is defined as:

$$\begin{aligned} S(P,G;\alpha ,\beta ) = \frac{|PG|}{|PG|+\alpha |P\setminus G|+\beta |G\setminus P|} \end{aligned}$$
(2)

where \(\alpha \) and \(\beta \) control the magnitude of penalties for FPs and FNs, respectively.

To define the Tversky loss function we use the following formulation:

$$\begin{aligned} T(\alpha ,\beta ) = \frac{\sum _{i=1}^Np_{0i}g_{0i}}{\sum _{i=1}^Np_{0i}g_{0i} + \alpha \sum _{i=1}^Np_{0i}g_{1i} + \beta \sum _{i=1}^Np_{1i}g_{0i}} \end{aligned}$$
(3)

where in the output of the softmax layer, the \(p_{0i}\) is the probability of voxel i be a lesion and \(p_{1i}\) is the probability of voxel i be a non-lesion. Also, \(g_{0i}\) is 1 for a lesion voxel and 0 for a non-lesion voxel and vice versa for the \(g_{1i}\). The gradient of the loss in Eq. 3 with respect to \(p_{0i}\) and \(p_{1i}\) can be calculated as:

$$\begin{aligned} \frac{\partial T}{\partial p_{0i}} = 2\frac{g_{0j}(\sum _{i=1}^Np_{0i}g_{0i} + \alpha \sum _{i=1}^Np_{0i}g_{1i} + \beta \sum _{i=1}^Np_{1i}g_{0i}) - (g_{0j} + \alpha g_{1j})\sum _{i=1}^Np_{0i}g_{0i}}{(\sum _{i=1}^Np_{0i}g_{0i} + \alpha \sum _{i=1}^Np_{0i}g_{1i} + \beta \sum _{i=1}^Np_{1i}g_{0i})^2} \end{aligned}$$
(4)
$$\begin{aligned} \frac{\partial T}{\partial p_{1i}} = -\frac{\beta g_{1j} \sum _{i=1}^Np_{0i}g_{0i}}{(\sum _{i=1}^Np_{0i}g_{0i} + \alpha \sum _{i=1}^Np_{0i}g_{1i} + \beta \sum _{i=1}^Np_{1i}g_{0i})^2} \end{aligned}$$
(5)

Using this formulation we do not need to balance the weights for training. Also by adjusting the hyperparameters \(\alpha \) and \(\beta \) we can control the trade-off between false positives and false negatives. It is noteworthy that in the case of \(\alpha = \beta = 0.5\) the Tversky index simplifies to be the same as the Dice coefficient, which is also equal to the \(F_1\) score. With \(\alpha = \beta = 1\), Eq. 2 produces Tanimoto coefficient, and setting \(\alpha +\beta =1\) produces the set of \(F_\beta \) scores. Larger \(\beta \)s weigh recall higher than precision (by placing more emphasis on false negatives). We hypothesize that using higher \(\beta \)s in our generalized loss function in training will effectively helps us shift the emphasis to lower FNs and boost recall.

2.3 Experimental Design

We tested our FCN with Tversky loss layer to segment multiple sclerosis (MS) lesions [6, 20]. T1-weighted, T2-weighted, and FLAIR MRI of 15 subjects were used as input, where we used two-fold cross-validation for training and testing. Images of different sizes were all rigidly registered to a reference image at size \(128 \times 224 \times 256\). Our 3D-Unet was trained end-to-end. Cost minimization on 1000 epochs was performed using ADAM optimizer [11] with an initial learning rate of 0.0001 multiplied by 0.9 every 1000 steps. The training time for this network was approximately 4 h on a workstation with Nvidia Geforce GTX1080 GPU.

The test fold MRI volumes were segmented using feedforward through the network. The output of the last convolutional layer with softmax non-linearity consisted of a probability map for lesion and non-lesion tissues. Voxels with computed probabilities of 0.5 or more were considered to belong to the lesion tissue and those with probabilities \({<}0.5\) were considered non-lesion tissue.

2.4 Evaluation Metrics

To evaluate the performance of the networks and compare them against state-of-the-art in MS lesion segmentation, we report Dice similarity coefficient (DSC): \( DSC = \frac{2\left| P\cap R \right| }{\left| P \right| +\left| R \right| } = \frac{2TP}{2TP+FP+FN}\), where P and R are the predicted and ground truth labels, respectively; and TP, FP, and FN are the true positive, false positive, and false negative rates, respectively. We also calculate and report specificity, \(\frac{TN}{TN+FP}\), and sensitivity, \(\frac{TP}{TP+FN}\), and the \(F_2\) score as a measure that is commonly used in applications where recall is more important than precision (as compared to \(F_1\) or DSC): \(F_2 = \frac{5TP}{5TP+4FN+FP}\). To critically evaluate the performance of the detection for the highly unbalanced (skewed) dataset, we use the Precision-Recall (PR) curve (as opposed to the receiver-operator characteristic, or ROC, curve) as well as the area under the PR curve (the APR score) [1, 7, 8]. For such skewed datasets, the PR curves and APR scores (on test data) are preferred figures of algorithm performance.

3 Results

To evaluate the effect of Tversky loss function and compare it with Dice in lesion segmentation, we trained our FCN with different \(\alpha \) and \(\beta \) values. The performance metrics (on the test set) are reported in Table 1. The results show that (1) the balance between sensitivity and specificity was controlled by the parameters of the loss function; and (2) according to all combined test measures, the best results were obtained from the FCN trained with \(\beta =0.7\), which performed better than the FCN trained with the Dice loss layer corresponding to \(\alpha =\beta =0.5\).

Table 1. Performance metrics (on the test set) for different values of the hyperparameters \(\alpha \) and \(\beta \) used in training the FCN. The best values for each metric have been highlighted in bold. As expected, it is observed that higher \(\beta \) led to higher sensitivity (recall) and lower specificity. The combined performance metrics, in particular APR, \(F_2\) and DSC indicate that the best performance was achieved at \(\beta =0.7\).

Figure 2(a) shows the PR curve for the entire test dataset, and Fig. 2(b) and (c) show the PR curves for two cases with extremely high and extremely low density of lesions, respectively. The best results based on the precision-recall trade-off were always obtained at \(\beta =0.7\) and not with the Dice loss function.

Figures 3 and 4 show the effect of different penalty magnitudes (\(\beta \)s) on segmenting a subject with high density of lesions and a subject with very few lesions, respectively. These cases, that correspond to the PR curves shown in Fig. 2(b and c), show that the best performance was achieved by using a loss function with \(\beta =0.7\) in training. We note that the network trained with the Dice loss layer (\(\beta =0.5\)) did not detect the lesions in the case shown in Fig. 4.

Fig. 2.
figure 2

PR curves with different \(\alpha \) and \(\beta \) for: (a) all test set; (b) a subject with high density of lesions (Fig. 3); and (c) a subject with very low density of lesions (Fig. 4).

Fig. 3.
figure 3

The effect of different penalties on FP and FN in the Tverskey loss function on a case with extremely high density of lesions. The best results were obtained at \(\beta =0.7\)

Fig. 4.
figure 4

The effect of different penalties on FP and FN in the Tverskey loss function on a case with extremely low density of lesions. The best results were obtained at \(\beta =0.7\).

4 Discussion and Conclusion

We introduced a new loss function based on the Tversky index, that generalizes the Dice coefficient and \(F_\beta \) scores, to achieve improved trade-off between precision and recall in segmenting highly unbalanced data via deep learning. To this end, we added our proposed loss layer to a state-of-the-art 3D fully convolutional deep neural network based on the U-net architecture [16, 17]. Experimental results in MS lesion segmentation show that all performance evaluation metrics (on the test data) improved by using the Tversky loss function rather than using the Dice similarity coefficient in the loss layer. While the loss function was deliberately designed to weigh recall higher than precision (at \(\beta =0.7\)), consistent improvements in all test performance metrics including DSC and \(F_2\) scores on the test set indicate improved generalization through this type of training. Compared to DSC which weighs recall and precision equally, and the ROC analysis, we consider the area under the PR curves (APR, shown in Fig. 2) the most reliable performance metric for such highly skewed data [1, 8]. To put the work in context, we reported average DSC, \(F_2\), and APR scores (equal to 56.4, 57.3, and 56.0, respectively), which indicate that our approach performed very well compared to the latest results in MS lesion segmentation [6, 20]. We did not conduct a direct comparison in the application domain, however, as this paper intended to provide proof-of-concept on the effect and usefulness of the Tversky loss layer (and \(F_\beta \) scores) in deep learning. Future work involves training and testing on larger, standard datasets in multiple applications to compare against state-of-the-art segmentations using appropriate performance criteria.