DOMINO: Domain-Aware Model Calibration in Medical Image Segmentation

Stolte, Skylar E.; Volle, Kyle; Indahlastari, Aprinda; Albizu, Alejandro; Woods, Adam J.; Brink, Kevin; Hale, Matthew; Fang, Ruogu

doi:10.1007/978-3-031-16443-9_44

Skylar E. Stolte¹²,
Kyle Volle¹³,
Aprinda Indahlastari^14,15,
Alejandro Albizu^14,16,
Adam J. Woods^14,15,16,
Kevin Brink¹⁷,
Matthew Hale¹³ &
…
Ruogu Fang^12,14,18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13435))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

8205 Accesses
2 Citations

Abstract

Model calibration measures the agreement between the predicted probability estimates and the true correctness likelihood. Proper model calibration is vital for high-risk applications. Unfortunately, modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability. Medical image segmentation particularly suffers from this due to the natural uncertainty of tissue boundaries. This is exasperated by their loss functions, which favor overconfidence in the majority classes. We address these challenges with DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels. Our experiments demonstrate that our DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation. Our results show that our method can consistently achieve better calibration, higher accuracy, and faster inference times than these methods, especially on rarer classes. This performance is attributed to our domain-aware regularization to inform semantic model calibration. These findings show the importance of semantic ties between class labels in building confidence in deep learning models. The framework has the potential to improve the trustworthiness and reliability of generic medical image segmentation models. The code for this article is available at: https://github.com/lab-smile/DOMINO.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Using Soft Labels to Model Uncertainty in Medical Image Segmentation

Multi-Consistency Training for Semi-Supervised Medical Image Segmentation

Article 10 May 2024

Unsupervised Bias Discovery in Medical Image Segmentation

Keywords

1 Introduction

Machine learning calibration measures the agreement between the predicted probability estimates and the true correctness likelihood [8]. Proper calibration is vital for high-risk applications. Modern deep neural networks (DNNs) achieve impressive accuracy at poor calibration [8]. Incorrectly calibrated DNNs are unreliable on out-of-distribution data and don’t know when they are likely to be incorrect. This discrepancy leaves them vulnerable in critical decision-making such as self-driving cars, surgical robots, and disease subtyping On the other hand, well-calibrated models are less certain when incorrect and comparably certain when correct. Their reliable confidence establishes trustworthiness.

We hypothesize that domain-aware model calibration that leverages the semantic confusability and hierarchical similarity among class labels can yield well-calibrated and higher-performing models. To test this hypothesis, we have chosen medical image segmentation because it is fundamental in medical image analysis. Overly-confident tissue boundaries can introduce significant errors in brain volume estimations [4]. Head image segmentation is prone to errors due to fine tissue boundaries, tissue imbalance, and low contrast. These challenges can make open-source software fall short on patient sub-populations [3, 12, 17]. Errors in head segmentation can lead to downstream errors in clinical pipelines, like in estimating parameters for non-invasive brain stimulation [2, 11].

Hence, we address uncertainty in medical image segmentation by introducing DOMINO, a framework that leverages domain information among class labels to calibrate DNNs. Unlike prior works that push class means to be orthogonal [15], we assume some class labels are naturally similar. The choice of the loss function is important to calibration because loss drives how a model learns [16]. Medical image segmentation still largely relies on standard losses [1]. We extend these approaches with domain-aware loss regularization to improve model calibration. We study two regularization schemes that are based on confusion matrices (CM) and hierarchical classes (HC). The former imposes a penalty based on class confusability when using a standard network on a held-out data subset. The latter groups labels into hierarchical classes based on common tissue properties.

2 Domain-Aware Model Calibration

2.1 U-Net Transformers (UNETR) Model

We employ UNETR [9] as our base model due to its superior segmentation performance. UNETR utilizes a U-Net architecture with a transformer encoder. This approach combats the relative locality of convolutional layers in fully convolutional networks (FCNs). Transformers have revolutionized Natural Language Processing due to superior long-range learning [18]. Transformers encode images as sequences of one-dimensional patch embeddings. Self-attention modules learn weighted sums from hidden layers. Hence, UNETR reformulates 3D image segmentation as sequence-to-sequence predictions. Skip connections pass the transformer’s global context to a traditional FCN decoder. The decoder concatenates local information with the global multi-scale information from the encoder. This paper refers to un-regularized UNETR as UNETR-Base.

2.2 Domain-Aware Loss Regularization

Concept. Our penalty addresses a deficit with cross-entropy (CE) loss in uncertainty. CE loss maximizes the output of the ground truth label. Due to this, the network increases the true label logit more than the incorrect label logits. The resulting networks are overly confident in their predictions. Meanwhile, the non-selected classes’ softmax outputs do not represent the true likelihood. Our work introduces more meaningful uncertainty by penalizing incorrect classes. Specifically, we assume that some classes are more similar to others. Network presentation often pushes class means to all be orthogonal to one another [15]. Such networks assume that all classes are equally separable. This assumption fights the natural similarities between certain classes. Thus, we hypothesize that a network can learn better class representation by taking advantage of class similarities, rather than fighting them. Our methods apply to classification and segmentation. This treats segmentation as pixel-wise classification [13].

Derivation. Our regularization term adds to any loss function as follows:

$$\begin{aligned} \mathcal {L}(y, \hat{y}) + \beta (y')(W)(\hat{y}) \end{aligned}$$

(1)

where $\mathcal {L}$ is a suitable loss function (we use DiceCE which is a combination of Dice score and cross-entropy), y is the one-hot encoded true label, and ŷ is the softmax output. $\beta $ can take on any value between zero and one. W represents a generic regularization term of size $N\times N$, where N is the number of classes. The diagonals are zero, whereas the off-diagonals represent the penalties for confusing classes. We propose two domain-aware approaches to design W as below.

Confusion Matrix (UNETR-CM). Confusion matrix-based calibration utilizes the natural confusability among class labels using a non-calibrated DNN. First, we train UNETR-base without regularization on the training set. Then, we evaluate the trained model on a held-out validation set to generate a confusion matrix for all classes. The loss regularization is computed as below:

$$\begin{aligned} W_{ij} = S \cdot \frac{I_i - C_{ij}}{Ii} \end{aligned}$$

(2)

Here, i and j represent the row and column indices, respectively. C is the confusion matrix generated when UNETR-Base is applied on a held-out validation set and normalized by class prevalence. $W_{ij}$ represents any given matrix entity. $I_i$ is $i^{th}$ row of the identity matrix. Thus $W_{ii}=0$ so there is no penalty for the correct class. Finally, S is a scaling factor to make the regularization weights more significant. We set $S=3$ based on empirical experiments; however, jointly varying $\beta $ and S can change the balance of the loss function. Low values for both result in no regularization; too high and it begins to affect model accuracy. The correct values for these hyperparameters will depend on the model and dataset.

Table 1. Hierarchical class groupings. $^{*}$Eyes are considered to fall within CSF and soft tissue due to have aqueous and fibrous components.

Full size table

Hierarchical Class (UNETR-HC). Here, we regularize using hierarchical relationships between semantic labels. Hierarchical groups are more likely to have similar properties than inter-group classes. Hence, confusion within groups can facilitate more informed and safer mistakes when wrong. Table 1 shows the hierarchy for our head segmentation. We define the matrix penalty in Fig. 1b by considering which classes are subsets of the same super-class. In Fig. 1b, each row represents the penalties for confusing the given class with any other class. The maximum penalty is 3, and penalties are manually lowered within the groups of Table 1. The eye class is considered close to two groupings. This matrix penalty is more subjective than UNETR-CM, but it incorporates domain knowledge.

3 Experiments and Results

3.1 Dataset

This study uses data from a Phase III clinical trial on cognitive training and non-invasive brain stimulation for cognitive improvements. The study recruited participants between 65–89 years old and with age-related cognitive decline. The trial was approved by all relevant Institutional Review Boards. Structural T1-weighted magnetic resonance images (MRIs) were obtained using a 32-channel, receive-only head coil from a 3-T Siemens MAGNETOM Prisma MRI scanner. MPRAGE sequence parameters: repetition time $=$ 1800 ms; echo time $=$ 2.26 ms; flip angle $=$ 8$^\circ $; field of view $=$ $256 \times 256 \times 256$ mm; voxel size $=$ 1 mm$^3$.

Ground Truth. Trained staff segmented the T1 MRIs into 11 tissues using semi-automated segmentation. These 11 tissues included muscle, fat, skin, cortical bone, cancellous bone, major artery (blood), air, cerebrospinal fluid (CSF), eyes, grey matter (GM), and white matter (WM). Semi-automated segmentation consists of automated segmentation followed by manual correction. First, base segmentations for WM, GM, and bone were obtained using Headreco, while air was generated in the Statistical Parametric Mapping toolbox (SPM12). Next, these automatic outputs were manually corrected using ScanIP Simpleware™ (version 2018.12, Synopsys, Inc., Mountain View, USA). Bone was separated into cancellous and cortical tissue using thresholding and morphology. Blood, skin, fat, muscle, and eyes (sclera and lens) were manually segmented in Simpleware. CSF was generated by subtracting the other ten tissues from the entire head. The resulting 11 tissue masks served as the ground truths for learned segmentation.

Implementation Details. We implement UNETR using the Medical Open Network for Artificial Intelligence (MONAI-0.8) in Pytorch 1.10.0 [6]. We split our 113 MRIs into 93 training/10 validation/10 testing. Each DNN required 1 GPU, 4 CPUs, and 30 GB of memory. Each model was trained for 25,000 iterations with evaluation at 500 intervals. The models were trained on $256 \times 256 \times 256$ images with batch sizes of 2 images. We trained our models with Adam optimization using stochastic gradient descent. UNETR segmentation results took 3 s per head. Headreco takes roughly 20 min per head.

3.2 Evaluation Metrics

We employ the following metrics on the 11-class and 6-class segmentation tasks.

Dice. represent the overlap of two binary masks [5]: $Dice = \frac{2|Y \cap \hat{Y}|}{|Y| + |\hat{Y}|}$ where Y and $\hat{Y}$ represent the ground truth mask and generated mask for a given tissue, respectively. A perfect overlap between these two generates a Dice score of 1, whereas a 0 represents no mask overlap.

Hausdorff Distance (Hausdorff). calculates the average distances between the closest points in two data subsets [7, 10]. Hausdorff distances are generally more robust than Dice in respect to the precise boundaries.

$$\begin{aligned} H(Y,\hat{Y}) = max(h(Y,\hat{Y}), h(\hat{Y},Y)) \end{aligned}$$

(3)

$$\begin{aligned} h(Y,\hat{Y}) = \max _{y \in Y}(\min _{\hat{y} \in \hat{Y}}(d(y,\hat{y}))),\quad h(\hat{Y},Y) = \max _{\hat{y} \in \hat{Y}}(\min _{y \in Y}(d(\hat{y},y))) \end{aligned}$$

(4)

where y represents a point in Y and $\hat{y}$ represents a point in $\hat{Y}$. $H(Y,\hat{Y})$ is the overall modified Hausdorff distance, whereas $h(Y,\hat{Y})$ and $h(\hat{Y},Y)$ are directed Hausdorff distances. $d(y,\hat{y})$ and $d(\hat{y},y)$ are Euclidean distances. Smaller the Hausdorff distance indicates better segmentation.

Top-N Accuracy. Top-N accuracy measures how often your true class falls within your top N highest softmax outputs. This metric reflects meaning in the outputs that were not the selected class. For instance, higher Top-2 and Top-3 predictions can show that a well-calibrated makes reasonable mistakes that are supported by the data, rather than random misclassifications.

Calibration Curves. show the relationship between the predicted probability estimates and the true correctness likelihood. These plots are meant for binary classification, so for segmentation one class “positive” is compared to the rest “negative”. The prevalence of positive classes is compared to predicted certainty for that class. Perfect calibration is a straight line from the origin to (1,1).

3.3 Calibrated Models Outperform UNETR-Base on 11-Classes

Qualitative Analysis. Figure 2 shows that UNETR-HC best captures the fine detail of the boundary between GM and CSF. This observation is noticeable in the upper left and upper right “grooves” in the light blue (CSF) color. UNETR-HC attempts to tract out these regions and label them as CSF, whereas the UNETR-Base and UNETR-CM assign more of these pixels as GM. This boundary is a major challenge in automatic segmentation due to partial volume effects.

Table 2. Top-N accuracy on 11 classes

Full size table

Quantitative Comparison. Figure 3 and Table 2 show the Dice, Hausdorff, and Top-N. UNETR-CM performs best in Dice and Top-N accuracy, whereas UNETR-CM and UNETR-HC outperform UNETR-Base in Hausdorff. Hence, UNETR-CM classifies the most pixels correctly, whereas both models capture tissue boundaries.

Table 3. Top-N accuracy on 6 classes

Full size table

3.4 Calibrated UNETR Outperforms or Performs Comparably to Headreco in 6-Class Segmentation

Qualitative Analysis. We compare 6-classes because the current field standard in head segmentation (e.g., Headreco) provides different tissues than our method. For example, Headreco [14] uses 8 tissues and SPM uses 6 tissues. Thus, we had to combine tissues into groups for a fair comparison. We combine DOMINO classes that are subsets of Headreco classes; for example, cancellous and cortical bone are both labeled as bone. Figure 4 shows the results for our models and Headreco. Differences are highlighted with white rectangles. Our methods show comparable or superior performance to Headreco across all tissue types.

Quantitative Comparison. Figure 5 and Table 3 show the Dice, Hausdorff, and top-1/2/3 accuracy on 6-classes. Calibrated UNETR is comparable to Headreco in WM, GM, and CSF; our models outperform Headreco in Air, Bone, and Soft tissue. UNETR-HC’s Hausdorff shows that the regularization can improve 6-class segmentation without retraining. UNETR-CM performs the best in Top-1/2/3 accuracy. Figure 6 shows that DOMINO achieves better calibration than UNETR-Base. All algorithms are approximately evenly calibrated on GM and air. Our methods are better calibrated than Headreco on WM, CSF, bone, and soft tissue.

4 Conclusions

There is often a trade-off between performance and calibration. This work proposes a novel domain-aware calibration method that improves model calibration, top-N accuracy, and segmentation metrics. The calibrated models perform well on full class and reduced class tasks without retraining. This highly-flexible approach can be applied to widespread medical segmentation. Further, model calibration can help improve cross-talk between automated algorithms and manual labelers. Finally, our calibration can be applied to classification tasks in medical image diagnosis. We will release DOMINO to the community to support open science research.

References

Abdar, M., et al.: A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf. Fusion 76, 243–297 (2021)
Article Google Scholar
Albizu, A., et al.: Machine learning and individual variability in electric field characteristics predict TDCS treatment response. Brain stimul. 13(6), 1753–1764 (2020)
Article Google Scholar
Antonenko, D., Grittner, U., Saturnino, G., Nierhaus, T., Thielscher, A., Flöel, A.: Inter-individual and age-dependent variability in simulated electric fields induced by conventional transcranial electrical stimulation. NeuroImage 224, 117413 (2021)
Google Scholar
Ballester, M.A.G., Zisserman, A.P., Brady, M.: Estimation of the partial volume effect in MRI. Med. Image Anal. 6(4), 389–405 (2002)
Article Google Scholar
Bertels, J., et al.: Optimizing the Dice score and Jaccard index for medical image segmentation: theory and practice. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 92–100. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_11
Chapter Google Scholar
Consortium, M.: MONAI: medical open network for AI, March 2020. https://doi.org/10.5281/zenodo.6114127. If you use this software, please cite it using these metadata
Dubuisson, M.P., Jain, A.K.: A modified Hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566–568. IEEE (1994)
Google Scholar
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1321–1330. PMLR, 06–11 August 2017. https://proceedings.mlr.press/v70/guo17a.html
Hatamizadeh, A., et al.: UNETR: transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 574–584 (2022)
Google Scholar
Huttenlocher, D.P., Klanderman, G.A., Rucklidge, W.J.: Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993)
Article Google Scholar
Indahlastari, A., et al.: Individualized tDCS modeling predicts functional connectivity changes within the working memory network in older adults. Brain Stimulation 14(5), 1205–1215 (2021)
Article Google Scholar
Indahlastari, A., et al.: Modeling transcranial electrical stimulation in the aging brain. Brain stimul. 13(3), 664–674 (2020)
Article Google Scholar
Jadon, S.: A survey of loss functions for semantic segmentation. In: 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–7. IEEE (2020)
Google Scholar
Nielsen, J.D., et al.: Automatic skull segmentation from MR images for realistic volume conductor models of the head: assessment of the state-of-the-art. Neuroimage 174, 587–598 (2018)
Article Google Scholar
Papyan, V., Han, X., Donoho, D.L.: Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci. 117(40), 24652–24663 (2020)
Article MathSciNet Google Scholar
Taghanaki, S.A., et al.: Combo loss: handling input and output imbalance in multi-organ segmentation. Comput. Med. Imaging Graph. 75, 24–33 (2019)
Article Google Scholar
Wilke, M., Schmithorst, V., Holland, S.: Normative pediatric brain data for spatial normalization and segmentation differs from standard adult data. Magn. Reson. Med. 50(4), 749–757 (2003)
Article Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Institutes of Health/National Institute on Aging (NIA RF1AG071469, NIA R01AG054077), the National Science Foundation (1908299), and the NSF-AFRL INTERN Supplement (2130885). We acknowledge NVIDIA AI Technology Center (NVAITC) for their suggestions. We also thank Jiaqing Zhang for formatting assistance.

Author information

Authors and Affiliations

J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida (UF), Gainesville, USA
Skylar E. Stolte & Ruogu Fang
Department of Mechanical and Aerospace Engineering, Herbert Wertheim College of Engineering, UF, Gainesville, USA
Kyle Volle & Matthew Hale
Center for Cognitive Aging and Memory, McKnight Brain Institute, UF, Gainesville, USA
Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods & Ruogu Fang
Department of Clinical and Health Psychology, College of Public Health and Health Professions, UF, Gainesville, USA
Aprinda Indahlastari & Adam J. Woods
Department of Neuroscience, College of Medicine, UF, Gainesville, USA
Alejandro Albizu & Adam J. Woods
United States Air Force Research Laboratory, Eglin Air Force Base, FL, USA
Kevin Brink
Department of Electrical and Computer Engineering, Herbert Wertheim College of Engineering, UF, Gainesville, USA
Ruogu Fang

Authors

Skylar E. Stolte
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Volle
View author publications
You can also search for this author in PubMed Google Scholar
Aprinda Indahlastari
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Albizu
View author publications
You can also search for this author in PubMed Google Scholar
Adam J. Woods
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Brink
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Hale
View author publications
You can also search for this author in PubMed Google Scholar
Ruogu Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruogu Fang .

Editor information

Editors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Linwei Wang
Chinese University of Hong Kong, Hong Kong, Hong Kong
Qi Dou
University of Virginia, Charlottesville, VA, USA
P. Thomas Fletcher
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Case Western Reserve University, Cleveland, OH, USA
Shuo Li

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4153 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stolte, S.E. et al. (2022). DOMINO: Domain-Aware Model Calibration in Medical Image Segmentation. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13435. Springer, Cham. https://doi.org/10.1007/978-3-031-16443-9_44

Download citation

DOI: https://doi.org/10.1007/978-3-031-16443-9_44
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16442-2
Online ISBN: 978-3-031-16443-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

DOMINO: Domain-Aware Model Calibration in Medical Image Segmentation

Abstract

Similar content being viewed by others

Using Soft Labels to Model Uncertainty in Medical Image Segmentation

Multi-Consistency Training for Semi-Supervised Medical Image Segmentation

Unsupervised Bias Discovery in Medical Image Segmentation

Keywords

1 Introduction