Keywords

1 Introduction

Image segmentation is an important task in medical ultrasound imaging. For example, peripheral nerves are often detected and screened by ultrasound, which has become a convention modality for computer-aided diagnosis (CAD) [20]. As entrapment neuropathies are considered to be accurately screened and diagnosed by ultrasound [2, 3, 25], the segmentation of peripheral nerves helps experts identify anatomic structures, measure nerve parameters and provide real-time guidance for therapeutic purposes. In addition, Breast ultrasound images (BUSI) can guide experts to localize and characterize breast tumors, which is also one of the key procedures in CAD [27].

The advancements in deep learning enable an automatic segmentation of ultrasound images, though they still require large, high-quality datasets. The scarcity of the labeled data motivated several studies to propose learning from limited supervision, such as transfer learning [24], supervised domain adaptation [19, 22] and unsupervised domain adaptation [6, 12, 17]. In practice, separate datasets are needed to train a model to segment different anatomical structures or lesions with different levels of malignancy. For example, peripheral nerves can be detected and identified across different human anatomic structures, such as peroneal (located below the knee) and ulnar (located inside the elbow) nerves. Typically, the annotated datasets for peroneal and ulnar nerves are separately constructed, and models are separately trained. However, since the models perform a similar task, i.e., segmenting nerve structures from ultrasound images, one may use a single model to be jointly trained with peroneal and ulnar nerves in order to leverage the variability in heterogeneous datasets and improve generalization abilities. A similar argument can be applied to breast ultrasound. A breast tumor is categorized into two types, benign and malignant, and we examine the effectiveness of a single model handling the segmentation of both types of lesions. While a simple approach would be incorporating multiple datasets for training, the characteristics of imaging vary among datasets, and it is challenging to train models which deal with distribution shift and generalize well for the entire heterogeneous datasets [4, 26, 28].

In this paper, we consider methods to train a single model with heterogeneous datasets jointly. We combine the heterogeneous datasets into one dataset and call each component dataset as a subgroup. We consider a model which can adapt to domain shifts among sub-groups and improve segmentation performances. We leverage recently proposed Segment Anything model (SAM) which has shown great success in natural image segmentation [14]. However, several studies have shown that SAM could fail on medical image segmentation tasks [5, 9, 10, 16, 29]. We adapt SAM to distribution shifts across sub-groups using a novel method for condition embedding, which is called SAM with Condition Embedding block (CEmb-SAM). In CEmb-SAM, we encode sub-group conditions and combine them with image embeddings. Through experiments, we show that the sub-group conditioning guides SAM to adapt to each sub-group effectively. Experiments demonstrate that, compared with SAM [14] and MedSAM [16], CEmb-SAM shows consistent improvements in the segmentation tasks for both peripheral nerves and breast lesions. Our main contributions are as follows:

  • We propose CEmb-SAM, which jointly trains a model over heterogeneous datasets leveraging Segment Anything model for robust segmentation performances.

  • We propose a conditional embedding module to combine sub-group representations with image embeddings, which effectively adapts the Segment Anything Model to sub-group conditions.

  • Experiments on the peripheral nerve and the breast cancer datasets demonstrate that CEmb-SAM significantly outperforms the baseline models.

Fig. 1.
figure 1

(A) CEmb-SAM: Segment Anything model with Condition Embedding block. Input images come from heterogeneous datasets, i.e., the datasets of peroneal and ulnar nerves, and the model is jointly trained to segment both types of nerves. The sub-group condition is fed into Condition Embedding block and encoded into sub-group representations. Next, the image embeddings are combined with sub-group representations. The image and prompt encoders are frozen during the fine-tuning of Condition Embedding block and mask decoder. (B) Detailed description of Condition Embedding Block. The sub-group condition is encoded into learnable parameters \(\gamma \) and \(\beta \), and the input feature \(F^{\tiny \text {in}}\) is scaled and shifted using those parameters.

2 Method

The training dataset is a mixture of m heterogeneous datasets or sub-groups. The training dataset with m mutually exclusive sub-groups \(\mathcal {D} = \textbf{g}_{1}\cup \textbf{g}_{2}\cup \dots \cup \textbf{g}_{m}\) consists of N samples \(\mathcal {D}=\{ (x_{i}, y_{i}, y_{i}^{a})_{i=1}^{N}\}\) where \(x_{i}\) is an input image, \(y_{i}\) is a corresponding ground-truth mask. The sub-group condition \(y_{i}^{a} \in \{0, \dots , m-1\}\) represents the index of the sub-group the data belongs to. The peripheral nerve dataset consists of seven sub-groups, six different regions at the peroneal nerve (located below the knee) and a region at the ulnar nerve (located inside the elbow). The BUSI dataset consists of three sub-groups: benign, malignant, and normal. The detailed description and sub-group indices and variables are shown in Table 1.

2.1 Fine-Tuning SAM with Sub-group Condition

SAM architecture consists of three components: image encoder, prompt encoder, and mask decoder. Image encoder uses a vision transformer-based architecture [7] to extract image embeddings. Prompt encoder utilizes user interactions, and mask decoder generates segmentation results based on the image embeddings, prompt embeddings, and its output token [14]. We propose to combine sub-group representations with image embeddings from the image encoder using the proposed Condition Embedding block (CEmb). The proposed method, SAM with condition embedding block (CEmb-SAM), uses a pre-trained SAM (ViT-B) model as the image encoder and the prompt encoder. For the peripheral nerve dataset, we fine-tune the mask decoder and CEmb with seven sub-groups. Likewise, we fine-tune the mask decoder on the breast cancer dataset with three sub-groups. The overall framework of the proposed model is illustrated in Fig. 1.

2.2 Condition Embedding Block

We modified the conditional instance normalization (CIN) [8] to combine sub-group representations and image embeddings. Learnable parameters \({W}_{\gamma }, {W}_{\beta } \in \mathbb {R}^{ {C}\times {m}}\) where m is the number of sub-groups of the datasets, and C is the number of the output feature maps. A sub-group condition \(y^{a}\) is converted to one-hot vectors, \(x^{a}_{\gamma }\) and \(x^{a}_{\beta }\) which are fed into Condition Embedding encoder and transformed into sub-group representation parameters \(\gamma \) and \(\beta \) using two fully connected layers (FCNs). Specifically,

Table 1. Summary of the predefined sub-group conditions of peripheral nerve and BUSI datasets. FH: fibular head, FN: fibular neuropathy. \(\text {FN}+\alpha \) represents the measured site is \(\alpha \) cm away from the fibular head. m represents the total number of sub-groups.
$$\begin{aligned} \gamma = {W}_{2}\cdot \sigma ({W}_{1}\cdot {W}_{\gamma }\cdot x^{a}_{\gamma }),\;\beta = {W}_{2}\cdot \sigma ({W}_{1}\cdot {W}_{\beta }\cdot x^{a}_{\beta }) \end{aligned}$$
(1)

where \({W}_{1},\,{W}_{2} \in \mathbb {R}^{{C}\times {C}}\) are FCN weights, and \(\sigma (\cdot )\) represents ReLU activation function.

The image embedding x is transformed into the final representation z using the condition embedding as follows. The image embedding is normalized with mini-batch \(\mathcal {B}=\{x_{i}, y_{i}^{a}\}_{i = 1}^{N_{n}}\) of \(N_{n}\) examples as follows:

$$\begin{aligned} \text {CIN}(x_{i}\vert \gamma ,\beta ) = \gamma \frac{x_{i}-\text {E}[x_{i}]}{\sqrt{\text {Var}[x_{i}]}+\epsilon } + \beta \end{aligned}$$
(2)

where \(\text {E}[x_{i}]\) and \(\text {Var}[x_{i}]\) are the instance mean and variance, and \(\gamma \) and \(\beta \) are given by Condition Embedding encoder. The proposed CEmb consists of two independent consecutive CIN layers with convolutional layers given by:

$$\begin{aligned} {F}^{{\tiny \text {mid}}} = \sigma (\text {CIN}(W_{3\times 3}\cdot x_{i}\vert \gamma _{1}, \beta _{1} )) \end{aligned}$$
(3)
$$\begin{aligned} z = \sigma (\text {CIN}(W_{3\times 3}\cdot {F}^{{\tiny \text {mid}}}\vert \gamma _{2}, \beta _{2})) \end{aligned}$$
(4)

where \({F} \in \mathbb {R}^{c\times h\times w}\) represents an intermediate feature map, \(\text {W}_{3\times 3}\) denotes convolution kernel size with \(3\times 3\). Figure 1 (B) illustrates the Condition Embedding block.

Table 2. Sample distribution of peripheral nerve and BUSI datasets. FH: fibular head, FN: fibular neuropathy. \(\text {FN}+\alpha \) represents that the measured site is \(\alpha \) cm away from the fibular head.
Table 3. Performance comparison between U-net, SAM, MedSAM and CEmb-SAM on BUSI and Peripheral nerve datasets.

3 Experiments

3.1 Dataset Description

We evaluate our method on two datasets: (i) a public benchmark dataset, Breast Ultrasound images (BUSI) [1]; (ii) the peripheral nerve ultrasound images collected in our institution. Ultrasound images in the public BUSI dataset are measured from an identical site. The dataset is categorized into three sub-groups: benign, malignant, and normal. The shape of a breast lesion varies according to its type. The benign lesion possesses a relatively round and convex shape. On the other hand, the malignant lesion possesses a rough and uneven spherical shape. The BUSI dataset consists of 780 images. The average image size of the dataset is \(500 \times 500\) pixels.

The peripheral nerve dataset was created at the Department of Physical Medicine and Rehabilitation, Korea University Guro Hospital. The dataset consists of ultrasound images of two different anatomical structures, the peroneal nerve and the ulnar nerve. The peroneal nerve, on the outer side of the calf of the leg, contains 410 images with an average size of \(494 \times 441\) pixels. The peroneal nerve images are collected from six different anatomical structures where the nerve stem comes from the adjacent fibular head. FH represents the fibular head, and FN represents fibular neuropathy. FN+\(\alpha \) represents that the measured site is \(\alpha \) cm away from the fibular head. The ulnar nerve is located along the inner side of the arm and passing close to the surface of the skin near the elbow. The ulnar nerve dataset contains 1234 images with an average size of \(477 \times 435\) pixels. Table 2 describes the sample distribution of datasets. This study was approved by the Institutional Review Board at Korea University (IRB number: 2020AN0410).

3.2 Experimental Setup

Each dataset was randomly split at a ratio of 80:20 for training and testing. Each training set was also randomly split into 80:20 for training and validation. SAM comes with three segmentation modes: segmenting everything in a fully automatic way, bounding box mode, and point mode. However, in the case of applying SAM for medical image segmentation, it seems that the segment everything mode is prone to erroneous region partitions. The point-based mode empirically requires multiple iterations of prediction correction. The bounding box-based mode can clearly specify the ROI and obtain good segmentation results without multiple trials and errors [16]. Therefore, we choose the bounding box prompts as input to the prompt encoder for SAM, MedSAM, and CEmb-SAM. In the training phase, the bounding box coordinates were generated from the ground-truth targets with a random perturbation of 0–10 pixels.

Fig. 2.
figure 2

Segmentation results on BUSI (1st and 2nd rows) and peripheral nerve dataset (3rd and 4th rows).

The input image’s intensity values were normalized using Min-Max normalization [21] and resized to \(3\times 256\times 256\). We used the pre-trained SAM (ViT-B) model as an image encoder. An unweighted sum between Dice loss and cross-entropy loss is used as the loss function [11, 15]. Adam optimizer [13] was chosen to train our proposed method and baseline models using NVIDIA RTX 3090 GPUs. The initial learning rate of our model is 3e-4.

3.3 Results

To evaluate the effectiveness of our method, we compare CEmb-SAM with the U-net [23], SAM [14], and MedSAM [16]. The U-net is trained from scratch on BUSI and peripheral nerve datasets, respectively. The SAM is used with the bounding box mode. The pre-trained SAM (ViT-B) weights are used as image encoder and prompt encoder. During inference, the bounding box coordinates are used as the input to the prompt encoder. Likewise, the pre-trained SAM (ViT-B) weights are used as image encoder and prompt encoder in the MedSAM. The mask decoder of MedSAM is fine-tuned on BUSI and peripheral nerve datasets. CEmb-SAM also uses the pre-trained SAM (ViT-B) model as an image encoder and prompt encoder, and fine-tunes the mask decoder on BUSI and peripheral nerve datasets. During inference, the bounding box coordinates are used as the input to the prompt encoder.

For the performance metrics, we used the Dice Similarity Coefficient (DSC) and Pixel Accuracy (PA) [18]. Table 3 shows the quantitative results comparing with CEmb-SAM, MedSAM, SAM (ViT-B), and U-net on both BUSI and peripheral nerve datasets. From Table 3, we observe that our method achieves the best results on both DSC and PA scores. CEmb-SAM outperformed the baseline methods in terms of the average DSC by 18.61% in breast, 14.85% in peroneal, and 14.68% in ulnar, and in terms of the average PA by 3.26% in breast, 2.24% in peroneal and 1.71% in ulnar.

Figure 2 shows the visualization of segmentation results on peripheral nerve dataset and BUSI. The qualitative results show that CEmb-SAM achieves the best segmentation results with fewer missed and false detections in the segmentation of both the breast lesions and peripheral nerves. The results demonstrate that CEmb-SAM is more effective and robust in the segmentation through learning from domain shifts caused by heterogeneous datasets.

4 Conclusion

In this study, we propose CEmb-SAM which adapts the Segment Anything Model to each dataset sub-group for joint learning from the entire heterogeneous datasets of ultrasound medical images. The proposed module for conditional instance normalization was able to guide the model to effectively combine image embeddings with subgroup conditions for both the BUSI and peripheral nerve datasets. The proposed module helped the model deal with distribution shifts among sub-groups. Experiments showed that CEmb-SAM achieved the highest score in DSC and PA on both the public BUSI dataset and peripheral nerve datasets. As future work, we plan to extend our work for improved domain adaptation in which the model is robust and effective under higher degrees of anatomical heterogeneity among datasets.