1 Introduction

Breast cancer is the most prevalent cancer in women among all the cancers [1] with the leading cause of death worldwide. With the molecular etiology of breast cancer being unknown, identifying the early signs of cancer is the only means to reduce the mortality rate. Due to the non-invasive, non-radioactive, painless, cost effective and ease in availability of the ultrasound imaging [2], it is most widely accepted for screening and diagnosing breast cancer. However, even for an expert radiologist, the manual analysis of such scans is challenging and time consuming. Following this context, deep learning-based computer-aided diagnosis (CAD) systems are developed for the early detection of breast tumor for faster diagnosis and treatment [3]. In most CAD systems, breast tumor segmentation (BTS) is the key phase for follow up tumor treatment plans and diagnosis, where the goal is to segregate the target tumor region from the rest of the image. However, most of the approaches proposed for BTS are presented and validated on the private datasets which limit their reusability and reachability.

The general schematic representation of the deep learning based segmentation models is presented in Fig. 1. In the data pre-processing phase, the aim is to transform the data into the trainable format by applying certain techniques like normalization to reduce intensity variation, resize to fit the model input layer, cropping the irrelevant features or noise, data augmentation, etc. The processed data is utilized to train the deep learning model and generate the desired segmentation mask. Finally, the generated mask is post-processed to refine the segmentation results. In the last decades, many deep learning-based segmentation models are proposed [4], where U-Net based approaches achieved state-of-the-art performance in a wide variety of 2D and 3D data space [5,6,7] while also addressing the challenge of limited availability of the medical data.

Fig. 1
figure 1

Generalized representation of the overview of the biomedical image segmentation models

1.1 U-Net

The U-Net model, developed by Ronneberger et al. [8], formed the basis of the state-of-the-art biomedical image segmentation networks. This model employed unique contraction and expansion paths along with the residual skip connections for biomedical image segmentation. In this architecture, the contraction phase tends to extract high and low level features, whereas the expansion phase follows from the features learned in the corresponding contraction phase (skip connections) to reconstruct the image into the desired dimensions with the help of transposed convolutions or upsampling operations. The network does not have any fully connected layers and only uses the valid convolution accompanied by rectified linear unit (ReLU) activation and max pooling operations. Following the state-of-the-art potential of the U-Net model, many variants are proposed for biomedical image segmentation [4]. With such high utility of the U-Net model, this article presents a U-Net based model for breast tumor segmentation.

1.2 Our contribution

The major contribution of the article concerning breast tumor segmentation is described below:

  • A novel architecture, residual cross-spatial attention-guided inception U-Net model (RCA-IUnet) is introduced with long and short skip connections to generate binary segmentation mask of tumor using ultrasound imaging.

  • Instead of the direct concatenation of encoder feature maps with upsampled decoded feature maps, a cross-spatial attention filter is introduced in the long skip connections that use multi-level encoded feature maps to generate attention maps for concatenation with decoded feature maps.

  • Hybrid pooling operation is introduced that uses a combination of spectral and max pooling for efficient pooling of the feature maps. It is utilized in two modes: (a) same: used in inside inception block (b) valid: used to connect inception blocks (reduce the spatial resolution by half the input feature map).

  • The model is also equipped with short skip connections (residual connections) along with the inception depth-wise separable convolution layers (concatenated feature maps from \(1 \times 1\), \(3 \times 3\), \(5 \times 5\) and hybrid pooling).

1.3 Article organization

The rest of the article is structured in various sections covering related work in Sect. 2 to present the literature survey and the proposed approach in Sect. 3. In the later Sects. 4 and 5, the experimental setup and results are presented along with the qualitative and quantitative results to cover the comparative analysis and ablation study. Finally, the concluding remarks and future scope are presented.

2 Related work

With the advent of advancements in deep learning, the healthcare sector is improving every day [9]. In classical approaches, thresholding [10], region growing [11] and watershed [12]-based frameworks were adopted to produce segmentation masks. In this section, various breast ultrasound image segmentation approaches are studied that achieved state-of-the-art performance, especially on their private dataset [3].

Shan et al. [13] proposed a fully automatic deep learning based segmentation framework to identify and localize the breast lesions using ultrasound imaging. The framework considers textural and spatial features, where initially region of interest (RoI) is generated (region likely to contain lesion) with automatic seed point selection and region growing approach. Following the RoI generation, multi-domain features are extracted: phase in max orientation (PMO), radial distance (RD) and a frequently used texture-and-intensity feature joint probability (JP). Later, an artificial neural network was used to generate the binary segmentation mask of the lesion region. In 2014, Torbati et al. [14] introduced a neural network-based framework that uses merging moving average self organizing maps (MMA-SOM) to generate an initial segmentation mask and objects belonging to the joint cluster are merged. Later, a 2D discrete wavelet transform (DWT) is computed to generate the input feature space of the network. The approach was validated on multiple modalities, where for breast ultrasound image segmentation authors established a strong correlation between ground truth mask and predicted mask. In another approach, a stacked denoising auto-encoder (SDAE) was introduced by Cheng et al. [15] to diagnose lesions in breast ultrasound and pulmonary nodules in CT scans. The approach achieved robust results and outperformed traditional computer-aided diagnosis (CAD) approaches, because of automatic feature extraction and high noise tolerance.

With transfer learning [16] being a growing area of research, Huyanh et al. [17] proposed a transfer learning-based approach to classify cystic, benign, or malignant cancer in breast ultrasound imaging. In a similar approach, Fujioka et al. [18] utilized GoogLeNet inception [19] model to classify breast tumors with varying shapes and size. To generate the segmentation mask, Yap et al. [20] utilized a pre-trained FCN-AlexNet model. The approach outperformed other segmentation models, however failed to produce better segmentation masks for small lesion regions. Huang et al. [21] introduced a superpixels classification and clustering patches based segmentation approach to diagnose breast tumors in ultrasound imaging. Though the authors achieved promising segmentation results, the performance was fairly low on large tumors due to simple linear iterative clustering [22]. In order to generate better segmentation results, several methods have been studied to dynamically adapt to the target structures (tumor) of varying shapes and sizes using attention mechanism [6, 23]. Following this context, Lee et al. [24] introduced a channel attention module and multi-scale grid average pooling to segment breast ultrasound images. Unlike channel attention that offers depth correlation, spatial attention allows to prioritize an area within the receptive field to better extract the target feature maps [25]. With this potential of spatial attention filter, we propose a novel residual inception U-Net architecture that uses a cross-spatial attention filter to extract relevant features from multi-scale encoded features to generate binary tumor segmentation masks. Furthermore, the model is equipped with residual inception depth-wise separable convolution and hybrid pooling (max pooling and spectral pooling) layers for better feature extraction and learning.

3 Proposed architecture

The schematic representation of the residual cross-spatial attention-guided inception U-Net model (RCA-IUnet) is presented in Fig. 2. The network follows U-Net topology where standard convolution and pooling operations are replaced by inception convolution with short skip connections and hybrid pooling along with the cross-spatial attention filter on long skip connection to focus on the most relevant features. The network has four stages of encoding and decoding layer, where at each stage the spatial dimension (width and height) of the feature map reduces by 50% and channel depth increases by 50%. Besides, in order to minimize the training parameters and the number of multiplications, the depth-wise separable convolution (DSC) operation [26] is followed which resulted in 2.9M trainable parameters.

Fig. 2
figure 2

Schematic representation of the RCA-IUnet

Fig. 3
figure 3

Convolution operations: a standard convolution, and b depthwise separable convolution

The network generates a binary segmentation mask to highlight the tumor region. In some of the predicted masks, minor holes (false negative) and small unnecessary regions (false positive) are identified. Hence, the generated segmentation mask is further refined with post-processing morphological operations such as the flood fill algorithm, mask extraction and binary thresholding to fill the minor holes left in the generated mask based on the surrounding or connected pixels (reducing the false negative predictions), remove the small masked regions (reducing the false positive predictions) and filter the masked regions, respectively.

3.1 Depthwise separable convolution

Unlike standard convolution (SC) operation, in DSC the convolution is performed in two stages involving depthwise and pointwise convolutions as shown in Fig. 3b for some input feature map with width (w), height (h) and depth (d), \({\mathcal {F}}\in {\mathbb {R}}^{w\times h\times d}\) . From Fig. 3 it can be observed that the ratio of reduction in parameters and multiplications can be presented using Eq. 3 in terms of number of parameters (\(P_{\textit{SC}}, P_{\textit{DSC}}\)) or multiplications (\(M_{\textit{SC}}, M_{\textit{DSC}}\)), number of kernels (r) and kernel size (f).

$$\begin{aligned}&M_{\textit{SC}}=r.p^2.f^2.d \;,\;\; P_{\textit{SC}}=r.f^2.d \end{aligned}$$
(1)
$$\begin{aligned}&M_{\textit{DSC}}=d.p^2. (f^2+r) \;,\;\; P_{\textit{DSC}}=d.(f^2+r) \end{aligned}$$
(2)
$$\begin{aligned}&\frac{M_{\textit{DSC}}}{M_{\textit{SC}}}=\frac{P_{\textit{DSC}}}{P_{\textit{SC}}}=\frac{1}{r}+\frac{1}{f^{2}} \end{aligned}$$
(3)

3.2 Hybrid pooling

In deep learning, various pooling operations are introduced [27], where max pooling is the most common choice for downsampling the feature maps. Max pooling tends to only preserve the sharpest features by applying max operation in given window size, whereas spectral pooling [28] not only downsamples the feature maps but also preserves more information as compared to max pooling. In spectral pooling, discrete Fourier transform (DFT) of the input feature map is computed to truncate the high frequency values in the spectral domain and then inverse DFT is applied to convert back to the spatial domain. Hence, to better downsample the feature maps, in this article hybrid pooling is introduced in which downsampled feature maps from max pooling and spectral pooling are merged using the \(1 \times 1\) convolution operation.

3.3 Inception convolution

In order to identify the features concerning tumor regions of varying shape and size, the model needs to have an adaptive receptive field [29, 30]. The inception convolution is designed by concatenating the feature maps extracted using the ReLU activated parallel depthwise separable convolutions with different kernels of sizes such as \(1 \times 1\), \(3 \times 3\) and \(5 \times 5\), and hybrid pooling while also using the batch normalization to avoid the covariance shift problem. Finally, the concatenated feature maps undergo \(1 \times 1\) convolution to setup the channel correlation and optimize the spatial dimension. Consider an input feature map, \({\mathcal {F}}_{i}\in {\mathbb {R}}^{w\times h\times d}\), the overview of the inception convolution is illustrated in Fig. 4a. Following from the inception convolution layers, the residual inception convolution block is developed by applying double inception convolution layers with a short skip connection to merge the extracted feature maps with input using \(1 \times 1\) DSC as shown in Fig. 4b.

Fig. 4
figure 4

Overview of the a inception convolution layer and b residual inception layer

3.4 Cross-spatial attention block

In order to draw the attention of the model toward the tumor structure of varying shape and size, a cross-spatial attention block is introduced in the long skip connections. Unlike the standard attention network [6], in this block, the attention filter utilizes the extracted features maps from multiple encoded layers to develop better correlation in the spatial dimension of the feature maps. The schematic representation of the cross-spatial attention approach is illustrated in Fig. 5, where feature maps from three different layers are considered to form the attention feature maps (output feature maps) which are later concatenated with the corresponding decoded layer in the expansion or reconstruction phase.

Fig. 5
figure 5

Schematic representation of cross-spatial attention block

4 Experiment setup

In this section, details concerning the experimental environment and datasets are presented along with the obtained results and comparative analysis. Due to non-availability of the implementation of the existing breast ultrasound image segmentation models and a standard testing set, the proposed model is compared with other state-of-the-art segmentation models like SegNetFootnote 1 [31], U-Net1 [8], U-Net++Footnote 2 [32], attention U-NetFootnote 3 [6], dense U-NetFootnote 4 and deep layer aggregation (DLA)2 [33] while using vgg16 [34] and resnet50 [34] as backbone architectures.

Table 1 Tumor segmentation evaluation metrics in terms of number of true positive (TP), true negative (TN), false positive (FP) and false negative (FN), predicted mask (\({\mathcal {P}}\)) and ground truth (\({\mathcal {G}}\)), \({\mathcal {H}}({\mathcal {P}},{\mathcal {G}})\) is the directed AHD from \({\mathcal {P}}\) to \({\mathcal {G}}\) with d as Euclidean distance, N is the total number of pixels and t is the prediction threshold

4.1 Dataset description and setup

The RCA-IUnet model is trained and evaluated using two publicly available datasets: a) breast ultrasound image segmentation (BUSIS) benchmark dataset [35] and b) breast ultrasound images (BUSI) dataset [36]. The BUSIS dataset comprises 562 breast ultrasound images that are collected from vivid hospitals: Harbin medical university, Qingdao university, and Hebei medical university. Each image is provided with a binary ground truth mask (1 label is assigned for tumor pixel and 0 label for background pixel) to highlight the tumor region which is generated using the majority voting approach from the annotations provided by various radiologists. Unlike the BUSIS dataset, BUSI dataset offers 780 ultrasound images divided into normal (133), benign (487) and malignant (210) classes along with the binary ground truth mask. Figure 6 shows the sample ultrasound images along with the ground truth from BUSIS and BUSI datasets. Due to the variation in the image size in both the datasets, the images are normalized and resized to \(256 \times 256\) for all the segmentation models. Both datasets are randomly split into 70% of the training set and 30% of the testing set and are kept the same throughout the experimentation. All the segmentation models are trained on the training set which is further split into 70% train set and 30% validation set. The trained models are then evaluated on the testing set.

Fig. 6
figure 6

Breast ultrasound images with ground truth from a BUSIS and b BUSI datasets

4.2 Training and testing

The models are trained and tested on the BUSIS and BUSI datasets. The training phase is assisted with the stochastic gradient descent approach and Adam as an optimizer [37] on an NVIDIA GeForce RTX 2070 Max-Q GPU. During training, the learning rate initialized at \(1e-3\) is reduced by a factor of 2 once learning stagnates to achieve better results. Moreover, earlystopping technique is adopted that halts the training process as soon as the validation error stops improving to avoid the overfitting problem. The RCA-IUnet is trained with the segmentation loss function (\({\mathcal {L}}\)) that is defined as the average of binary cross entropy loss (\({\mathcal {L}}_{{\textit{BC}}}\)) and dice coefficient loss (\({\mathcal {L}}_{{\textit{DC}}}\)) as shown in Eq. 4.

$$\begin{aligned}&{\mathcal {L}}=\frac{1}{2} {\mathcal {L}}_{{\textit{BC}}}+\frac{1}{2}{\mathcal {L}}_{{\textit{DC}}} \end{aligned}$$
(4)
$$\begin{aligned}&\begin{aligned} {\mathcal {L}}_{{\textit{BC}}}\left( y,p\left( y\right) \right) =-\sum ^N_i\left( y_i.{\log \left( p\left( y_i\right) \right) }+\left( 1-y_i\right) .\right. \\ \left. {\log \left( 1-p\left( y_i\right) \right) }\right) \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned}&{\mathcal {L}}_{DC}\left( y,p\left( y\right) \right) =1-\frac{2\sum ^N_i{y_i.p(y_i)}}{\sum ^N_i{{\left| y_i\right| }^{2}}\mathrm {+}\sum ^N_i{{|{p(y}_i)|}^{2}}} \end{aligned}$$
(6)

where y is the ground truth label, p(y) is the predicted label, and N is the total number of pixels. During the backpropagation, the gradient of the loss function with respect to the predicted value can be computed using Eq. 7.

$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial p(y)}=\frac{1}{2}\left[ \frac{\partial {\mathcal {L}}_{BC}\left( y,p\left( y\right) \right) }{\partial p(y)}+\frac{\partial {\mathcal {L}}_{DC}\left( y,p\left( y\right) \right) }{\partial p(y)}\right] \end{aligned}$$
(7)
Fig. 7
figure 7

Qualitative comparison of BUS tumor segmentation results of the models: SegNet, U-Net, U-Net++, attention U-Net, dense U-Net, deep layer aggregation and RCA-IUnet, a Without the post-processing and b With the post-processing. The quantities indicate the dice score for each predicted mask

Fig. 8
figure 8

Summary of average training and validation scores: a Dice coefficient, b mean absolute error, c average Hausdorff distance, d mean intersection over union, e precision and f) recall, of RCA-IUnet model over BUSIS and BUSI datasets

where

$$\begin{aligned}&\frac{\partial {\mathcal {L}}_{BC}\left( y,p\left( y\right) \right) }{\partial p(y)}=\frac{p\left( y\right) - y}{p\left( y\right) \left( 1-p\left( y\right) \right) } \end{aligned}$$
(8)
$$\begin{aligned}&\frac{\partial {\mathcal {L}}_{DC}\left( y,p\left( y\right) \right) }{\partial p(y)}\mathrm {=-2}\left( \frac{y.({\left| y\right| }^2-{\left| p\left( y\right) \right| }^2}{{\left( {\left| y\right| }^2+{\left| p\left( y\right) \right| }^2\right) }^2}\right) \end{aligned}$$
(9)

The trained models are utilized to predict the tumor segmentation mask for the test set. The performance of the models is compared using various evaluation metrics as shown in Table 1. In addition, inference time (IT) [38] is considered to measure the speed of the model. This is computed by measuring the average time taken by the model to generate mask for each sample in test set, where less inference time indicates faster mask generation.

Table 2 Comparative analysis of the RCA-IUnet with other segmentation approaches on the BUS datasets

5 Results and discussion

The models produce a binary tumor segmentation mask for a given BUS image. The qualitative results of all the models with and without the post-processing are shown in Fig. 7. The generated segmentation mask along with the dice scores confirms the better performance of the RCA-IUnet model over other segmentation models. Figure 8 presents the mean segmentation performance of the RCA-IUnet model over the training and validation sets from both the datasets monitored during the training phase. From Fig. 8, it can be observed that the training and validation scores are promising and close to each other indicating that the RCA-IUnet model neither overfits nor underfits the training data and hence generates better segmentation masks.

Table 3 Ablation study of RCA-IUnet model
Table 4 Cross-data validation of RCA-IUnet model with fine tuning

It is also observed that among the tested models, the post-processing techniques have minimal impact on the performance of the RCA-IUnet model, indicating that the model produces a segmentation mask with very low false positive and false negative predictions of the tumor regions. However, there is a noticeable improvement in the performance of other models by using post-processing, indicating that these models generate high false predictions and hence relies on further refinement to improve the results. For instance, in Fig. 7, the segmentation mask generated for the second sample by U-Net without and with post-processing has dice scores of 0.731 and 0.934, respectively, while the RCA-IUnet model produces same results with a better dice score of 0.984. Besides, the overall quantitative results are shown in Table 2 along with the comparative analysis with other state-of-the-art models in terms of evaluation metrics described in Table 1. The proposed model outperformed with best segmentation scores and minimal inference time while having considerably less number of training parameters.

The effectiveness of each proposed component of the RCA-IUnet model is analyzed in Table 3. This ablation study is conducted by adding the proposed components to base U-Net model. Here U-Net is a skeleton model of complete RCA-IUnet model that consists of default depth-wise separable convolutions, max pooling operations and skip connections with four stages of encoding and decoding. This study is conducted with the same training, validation and testing sets of both datasets over various combinations to form different models by adding components to the U-Net model such as U-Net + CSA, U-Net + RIC + HP, etc. The performance of each model is compared using segmentation metrics along with the inference time (IT). From Table 3, it can be inferred that RIC and CSA are core components that derive the outperforming nature of the RCA-IUnet model as shown for models: U-Net + RIC, U-Net + CSA and U-Net + RIC + CSA. The residual inception convolution enables the network to capture multi-scale feature representation, and cross-spatial attention enables the network to draw attention towards the most relevant features. As compared to max pooling, hybrid pooling plays a vital role with efficient downsampling to further improve the results as shown for the models: U-Net + RIC + HP vs U-Net + RIC and U-Net + CSA + HP vs U-Net + CSA. With the achieved quantitative results, it is evident that each component contributes to improving the segmentation performance of the RCA-IUnet model. Though this segmentation performance is delivered with increased inference time as compared to the base U-Net model but is comparatively lesser as compared to the existing models as shown in Table 2.

To further establish the robustness of the proposed model a cross-data validation is performed as shown in Table 4. The testing is performed with two scenarios: (1) model pre-trained on BUSIS dataset is tested on BUSI dataset, and (2) model pre-trained on BUSI dataset and is tested on BUSIS dataset, by fine-tuning. The model achieved similar results as highlighted in Tables 2 and 3. This indicates that the proposed model can adapt to a new dataset by just fine-tuning without compromising the performance.

6 Conclusion

This article proposes a deep learning based model, residual cross-spatial attention inception U-Net (RCA-IUnet), for breast tumor segmentation in ultrasound imaging. The RCA-IUnet model is designed with a state-of-the-art U-Net model that uses residual inception depth-wise separable convolution and hybrid pooling (max pooling and spectral pooling) layers along with the cross-spatial attention filter in the long skip connections to better propagate and extract the feature maps concerning the tumor region. With exhaustive trials, the proposed model achieved significant improvement over the state-of-the-art models with minimal training parameters and inference time on two publicly available datasets to generate tumor segmentation mask. Moreover, the ablation study describes the significance of each component of the model toward tumor segmentation, where residual inception convolution (RIC) and cross-spatial attention (CSA) components displayed a major contribution in the achieved results. As an extension, the attention component could further be improved by incorporating a channel attention filter to focus on most relevant feature layers. Overall the performance of the model could further be improved by incorporating deeper feature extraction layers, hybrid or ensemble learning leading toward better feature representation for tumor regions. Besides, the scope of this model is not limited to tumor segmentation in breast ultrasound imaging, it can also provide potentially useful results with other modalities for biomedical image segmentation.