1 Introduction

Medical imaging is a highly critical element in modern medical practice and biotechnology to undertake numerous diagnostic procedures, from wellness and screening to early diagnosis, clinical analysis, treatment selection, image-guided surgery, and subsequent follow-ups for continuous assessments of the patient’s health condition [36]. It has become a crucial resource for physicians to understand and assess the disease. Moreover, it is essential to determine the efficacy of the treatment, allowing clinicians to better analyze a patient by creating a pictorial and functional representation of hidden physiological structures of body parts such as bones, organs, tissue, and blood vessels for clinical examination [4, 52] and evaluate various cellular and molecular events. Noninvasive medical imaging techniques, such as X-ray, computerized tomography (CT), ultrasound, colonoscopy, dermoscopy, microscopy, electrocardiogram (ECG), and magnetic resonance imaging (MRI), can reveal crucial anatomical functionality-related information on diseases and anomalies within the body [39].

In recent years, deep learning-based methods have achieved incredible success in many challenging tasks in diverse research domains [1, 2, 31, 51]. Semantic medical image segmentation (MIS) is one of the significant areas of research in medical image analysis. In semantic segmentation, every distinct pixel in the image is assigned a distinct category, thus partitioning an image into a set of non-overlapping regions, which can also be regarded as a dense classification problem [29]. MIS refers to the process of distinguishing specific areas within a 2D or 3D medical image, which can facilitate the clinicians to study only the desired parts or region of interest (ROI) of the multi-modal medical images [29]. It is an essential preliminary step for any computer-aided diagnosis (CAD) system and often plays an integral role in both quantitative and qualitative analysis of medical images [4], such as segmentation of polyps [30, 56], lung region [42, 48], brain tumors [85], retinal blood vessels [83], cell nuclei [80], cell contour [70], and breast ultrasound images [40].

During the past decade, the vast majority of architectures created for semantic segmentation in various applications of computer vision (CV) and medical image analysis are based on deep neural networks (DNNs), such as fully convolutional networks (FCNs), coined by Long et al. [38] or encoder-decoder-based convolutional neural networks (CNNs) such as Seg-Net [6]. The establishment of encoder-decoder-based CNNs achieved promising segmentation performance in CV and medical imaging. Nevertheless, U-Net, proposed by Ronneberger et al. [49], made a significant breakthrough in the MIS task by incorporating the idea of skip connections between each symmetric layer of the encoder and decoder. Primarily, the encoder performs multiple convolutions and pooling operations to capture various representations of images, from low to high-level. It decreases the spatial dimensions of each layer and increases the number of channels. More high-level feature maps, such as objects and various shapes, are captured as the architecture goes deeper. On the contrary, the decoder performs multiple up-sample and concatenation operations, followed by convolution operations to predict the segmented mask. It increases the spatial dimensions while decreasing the channels.

Over the few years, several variants of U-Net followed, such as U-Net++ [85], MultiResU-Net [29], LadderNet [87], Attention U-Net [43], R2U-Net [5], DoubleU-Net [30], CE-Net [22], and KiU-Net [65]. Even though these methods have improved the feature representation to a satisfactory level, they are still constrained by a number of significant drawbacks. Similar scale feature maps with various receptive fields that are generated from the convolution kernel have distinct semantic feature representations. The dimension of the receptive field in the convolutional kernel can affect network performance [2, 58]. Most of the datasets have images where the ROI is of diverse shapes and sizes, for example, polyps in colonoscopy images. When the receptive field is too large, smaller targets can get disregarded, and on the other side, a smaller receptive field can capture redundant information. Hence, processing the image using convolution kernels with different receptive fields is vital for capturing the global contextual representation of features [22]. Because of the substantial loss of spatial information during encoding, it is usually challenging to reconstruct the details of low-level feature maps such as edges, dots, corners, and lines using orthodox de-convolution operations [35]. The resultant feature maps are sparse, resulting in a reduction in segmentation performance. Moreover, U-Net and its variants also suffer from semantic gaps in feature representations because of longer skip connections present between the corresponding encoder and decoder. Combining the two incompatible representations of feature sets of the encoder and decoder blocks introduces inconsistency in the architecture’s learning process. In order to reduce the semantic gaps and loss of spatial information during encoding and improve the high- and low-level fusion of semantic information throughout the network, multiple U-Net-based architectures can be deployed to achieve state-of-the-art (SOTA) segmentation results [30].

The attention mechanism concentrates solely on the most informative feature representations for a specific task without additional supervision, thereby penalizing less informative regions of the image and avoiding using similar feature maps in the network; thus, attention-based networks have recently been widely employed in MIS tasks. The channel-based attention mechanism is one of the most investigated attention mechanisms in literature. It exploits the inter-channel relations of features and focuses on desired object selection by actively re-calibrating each channel’s weight [23]. Hu et al. [28] initially presented the idea of channel attention and introduced SE-Net architecture. SE-Net utilizes the global average pooling mechanism to capture the global representations of contextual features. However, a simple global average pooling mechanism can fail to extract complex high-level intra-channel feature representations [20]. The spatial attention mechanism mainly focuses on relevant spatial regions of informative features. Nevertheless, integration of only SE-Net’s attention has been found to be inadequate and sub-optimal in many MIS tasks [71]. Woo et al. [71] suggested the concept of a convolutional block attention module (CBAM), which is a sequential combination of these two attention mechanisms and can bring effective results for many CNN-based tasks. Oktay et al. [43] introduced a low-cost, lightweight attention gate mechanism to focus mainly on the selected ROIs while suppressing feature activations in non-ROIs. Recently various transfer learning techniques have been applied to the task of MIS due to their robustness and quick convergence mechanism [22, 30, 68]. It allows the pre-trained weights from one task to be utilized in different but related tasks.

In this paper, we extend and significantly improve the SOTA DoubleU-Net [30] architecture and propose a robust novel architecture that can effectively perform the MIS tasks of multi-modal domains by modeling global contextual information and high-level multi-scale semantic feature representations of pixels of varying receptive fields. EfficientNetB7 [61] architecture is adopted through transfer learning as our backbone encoder module for extracting effective feature information. We incorporate a novel triple attention gate (TAG) mechanism in every skip connection to attend to selective inputs with high relevancy to the target region. To reduce the semantic gap issues of the skip connections of U-Net [49], DoubleU-Net [30], and other similar variants, we incorporate attention-guided residual (AG-Residual) convolution operations instead of regular convolutions. We also design a multi-kernel residual convolution (MKRC) module to acquire high-level global contextual features. The MKRC block extracts fine-grained contextual information of higher levels from images with various levels of receptive fields such as \(1 \times 1\), \(3 \times 3\), \(5 \times 5\), and \(7 \times 7\). The receptive field of a CNN usually refers to the size of the kernel. The generated feature maps from the MKRC block are then passed through the newly designed squeeze and excitation-based atrous spatial pyramid pooling (SE-ASPP) module [14] to extract high-resolution relevant feature maps for effective learning of the proposed model. In addition, inspired by the CBAM architecture [71], we also integrate a hybrid triple attention module (TAM), which performs features refinement through parallel execution of spatial attention, modified channel-based attention mechanism, and squeeze and excitation-based attention to capturing relevant spatial regions of the higher-level global contextual features and inter-dependencies among different channels, respectively.

Overall, the main contributions of this work can be summarized as follows:

  • A robust EfficientNetB7 encoder backbone-based segmentation framework, referred to as DoubleU-NetPlus, is proposed to enhance the semantic segmentation performance for biomedical images.

  • A newly proposed multi-kernel residual convolution module, which expands the field of view representation of heterogeneous, semantic global contextual features at different scales.

  • A modified hybrid triple attention module, which performs an aggregation of spatial and channel-based attention and squeeze and excitation-based attention, thus, improves the channel inter-dependencies and inter-spatial relationships of the high-level feature maps.

  • A novel lightweight triple attention gate module is integrated at the decoder side of each network to highlight salient features from the skip connections.

  • Embedding of features re-calibration through squeeze and excitation operation in the attention-based atrous spatial pyramid pooling mechanism.

  • We demonstrate the effectiveness of the proposed DoubleU-NetPlus architecture on six publicly available benchmark datasets of different modalities, and comparative analysis exhibits that the proposed method outperformed several SOTA medical image segmentation methods.

2 Related works

This section provides a brief summary of the research pertaining to MIS techniques, including context-aware segmentation, attention-guided segmentation, and stacked multi-U-Net techniques.

2.1 Context-aware segmentation

Contextual information from multiple levels of a network plays a significant role in the performance of any CNN-based MIS model. Xie et al. [76] proposed a context hierarchical integrated network (CHI-Net), which introduced a dense dilated convolution module for gathering features from four cascaded branches of hybrid dilated convolutions. The authors also introduced a stacked residual pooling module that uses multiple effective fields of view. Residual dilated convolution was utilized in the encoder part of the network to capture multi-level hierarchical features. Gu et al. [22] used a context encoder network (CE-Net) that utilizes a pre-trained ResNet34 as the encoder module. The authors integrate a context extractor module consisting of a dense atrous convolution block and a residual multi-kernel pooling block. Al-masni and Kim [4] applied a contextual multi-scale multi-level network (CMM-Net) by fusing the global contextual features of different spatial scales in the encoding part of the U-Net. The authors also used a dilated convolution module that expanded the receptive field with different rates depending on feature maps network sizes.

Xiao et al. [75] introduced a deep residual contextual and sub-pixel convolution network (RC-SPCNet) for the segmentation of neuronal structure. The encoding section of the U-Net included residual-convolution blocks along with summation-based skip connections, and the decoding section was deployed with sub-pixel convolutional layers. Lifted multi-cut was used for optimizing the output for reconstruction results. Lou et al. [40] introduced an inverted residual pyramid block and a context-aware fusion block in a new U-Net architecture. The authors deployed a multi-level context refinement network (MCRNet) using these two context refinement blocks into a U-net architecture in a multi-level manner. In another study, Wu et al. [72] proposed a new U-Net architecture comprising three new modules: a scale-aware feature aggregation module, an adaptive feature fusion, and a multi-level semantic supervision module.

Recently various transformer-based architectures have been effectively used in the MIS task too. By modeling global context-based features effectively, architectures like Swin-UNet [10], Ds-TransUNet [34], and UNETR [25] achieved SOTA results on MIS tasks of diverse modalities.

In all of the studies discussed above, the authors tried to extract multi-scale representations to reduce gaps in semantics between the encoder and decoder features. Although, these readjustments in many a case introduced over-fitting problems [81], which resulted in not so much significant rise in evaluation metrics.

2.2 Attention-guided segmentation

Over the years, with the successful application of many computer vision-oriented tasks, various attention mechanisms have been increasingly applied to the field of MIS. Wang et al. [68] proposed an iterative edge attention network (EANet) where the authors integrated the edge-attention preservation (EAP) module along with a dynamic scale-aware context module. The authors employ the VGG-19 [54] pre-trained architecture as the feature encoder. The EAP module captures edge-related attention information such as background noise and shape by preserving the low-level local edge features. The gated convolutional blocks (GCB) interleaved with some residual blocks in the EAP module allow the edge stream to solely analyze boundary-related data.

Zhao et al. [84] proposed an MIS architecture where the authors apply spatial and squeeze and excitation networks (SE-Net) to focus mainly on the initial low-level feature maps and channel inter-dependencies in the high-level feature maps in the bottleneck part of the network. Wang et al. [67] incorporate the SE attention mechanism in the encoder part of the network to adaptively extract the feature maps and the ASPP module to capture the context-based semantic information from the extracted feature maps at multiple scales. SE-Net is also incorporated by Li et al. [33], where the authors use Res2Net [19] as the encoder backbone. The extracted features are grouped by channels, and convolution operations are performed on each group separately. SE-Net is integrated to learn the relationship between groups and re-calibrate the channel weights to focus on the target object.

Gao et al. [18] proposed a multi-scale fused network that employs two attention mechanisms, additive channel attention and additive spatial attention in the skip connections, which utilize high-level features to prune the responses of low-level features in both channel and spatial dimensions. It improves the learning of the superior spatial relationship between adjacent pixels and inter-dependencies between channels. Yeung et al. [79] proposed an attention-gated U-Net architecture that employs a new attention module named focus gate and combines spatial and channel-based attention with a focal parameter to regulate the degree of background suppression. The focus gate utilizes the gating signal to refine incoming signals from the encoding network as long-range skip connections, indicating selected image features and regions included in the decoding network.

Tomar et al. [63] introduced a new attention-based mechanism named FANet, which combines the feature maps from the current training epoch with the prior epoch mask. The prior epoch mask provides hard attention to the learned feature maps at different convolutional layers. Han et al. [24], in their proposed ConvUNeXt architecture, utilized the ConvNeXt [37] as the encoder backbone along with the attention gate mechanism in every skip connection. Tong et al. [64] also utilized the lightweight attention gate mechanism in the decoder part of the network. The feature map generated by the attention gate module is processed by the channel and spatial attention modules in parallel, whose outputs are combined to produce the final feature maps.

Though all the aforementioned attention-based methods achieved reasonable performance in the MIS tasks, they still face challenges in achieving SOTA segmentation performance in terms of diverse shapes, intricate textures, and subjects, especially in the breast ultrasound image and retinal modalities.

2.3 Stacking/cascading of multiple U-Nets

Another popular method explored by researchers to improve feature representation of segmentation tasks is to stack multiple U-Net architectures together in a k-cascading U-Net format, where k refers to the number of sub-U-Nets [47, 82]. For example, DoubleU-Net [30], with two U-Net architectures stacked on top of each other. Ghosh et al. [21] proposed the idea of incorporating dilated stacked U-Nets for semantic scene segmentation. In another work, Ding et al. [16] utilize a series of U-Nets stacked together for brain tumor segmentation. In addition, a multi-level nested U-net structure with encoders and decoders comprised of U-Net structured modules has been constructed [47] for salient object detection and segmentation. Furthermore, W-shaped networks have been established in recent years. W-Net [74] functions by concatenating two U-Nets into an autoencoder format, one for encoding and one for decoding, and achieves satisfactory results in unsupervised image segmentation tasks.

All of the above-mentioned architectures connect two or more U-Nets together and can therefore extract a separate group of features using the same set of original features. However, the challenge is that the same features may be extracted repeatedly, which can degrade the network’s efficiency [82].

3 Proposed method

In this section, we describe the architecture of the proposed segmentation network and the details of the constituent modules. Firstly, the architecture of the DoubleU-Net [30] model is briefly described, and then we elaborately describe the proposed architecture and the incorporated modules in it. The proposed architecture is demonstrated in Fig. 1.

3.1 Overview of DoubleU-Net architecture

DoubleU-Net [30] is an encoder–decoder architecture comprising two U-Net-like networks stacked on top of each other. There are two encoders and two decoders in the DoubleU-Net architecture. In the first U-Net architecture, VGG-19 [54] is incorporated as the backbone of the first encoder, which is pre-trained on ImageNet [32]. The decoder of the first U-Net architecture is built by performing the up-sampling of the feature maps, then concatenating with the corresponding skip connections, and lastly, two regular convolution operations of \(3 \times 3\) followed by batch normalization, ReLu, and squeeze and excitation operation. In order to utilize more high-level semantic information efficiently, the authors placed the second U-Net at the bottom of the first U-Net. The encoder of the second U-Net is formed by performing consecutive convolution and max-pooling operations. The decoder of the second U-Net is similar to the decoder of the first U-Net. The results generated by the DoubleU-Net architecture outperformed several MIS algorithms by a significant margin in four benchmark datasets. Despite achieving significant performances, DoubleU-Net lacks effectiveness in the skip connections of the network [50], limiting the precise flow of information throughout the network. Moreover, it does not fully exploit the high-level feature maps from varying receptive fields, which can increase the results further. A further shortcoming of DoubleU-Net is its outdated VGG-19 encoder backbone, which can be replaced by a more recently proposed deeper architecture like EfficientNetB7 [24]. Hence, we select DoubleU-Net as our basic architecture for further enhancement.

Fig. 1
figure 1

Composition of the proposed DoubleU-NetPlus architecture

3.2 Overview of the proposed DoubleU-NetPlus architecture

We performed enhancements in both the networks of the DoubleU-Net architecture by deploying the EfficientNetB7 architecture as the encoder one backbone for extracting multi-scale information. In all the skip connections, we employ a novel triple attention gate (TAG) module to selectively attend to the significantly relevant features in the decoder while suppressing irrelevant feature representations. Compared to high-level feature information, low-level feature information tends to contribute less to network performance and use a lot of computational resources, as pointed out by [55, 73]. As demonstrated in Fig. 1, to capture more effective multi-scale high-level contextual encoder information and pass it to be decoded by the decoder in the bottleneck/bridge of each encoder-decoder network, we design and embed the multi-kernel residual convolution (MKRC) module, modified squeeze and excitation-based atrous spatial pyramid pooling (SE-ASPP) module, and triple attention module (TAM) sequentially. Deeper networks considerably enhance the performance of the model. However, an increase in the depth of the network might occur in vanishing or exploding gradient problems [26, 67]. In order to address this issue and reduce the semantic gaps between the feature representations of the encoder and decoder, we utilize shortcut connections between layers in the residual learning paradigm. We have performed attention-guided residual (AG-Residual) convolution operations (see Fig. 2) in the encoder of the second network and decoders of both networks. The motivation behind deploying two multi-contextual attention-guided residual U-Net architectures is that the output feature maps of network one are not fully explored [82]. We can enhance it by capturing the unexplored high-level multi-contextual information from the generated output feature maps of network one by multiplying it with the original input image and processing them together again in the second network to capture more semantic information.

3.3 Encoder and decoder

The encoder portion of a U-Net is responsible for condensing the spatial information by each level. While it does so, the number of inputs halves, and the number of channels doubles. Consequently, we are left with highly condensed feature information that needs to be passed on to be decoded by the following levels. In our proposed DoubleU-NetPlus architecture, we utilize the EfficientNetB7 pre-trained architecture as the backbone for the encoder of network one using the transfer learning method, whereas the encoder in network two is built by performing two residual convolutions of \(3 \times 3\) followed by spatial and channel attentions. In the first encoder, we chose the EfficientNetB7 architecture mainly because of its higher accuracy and increased network depth. The deployment of EfficientNetB7 as the encoder of the first network gives the network effective feature extraction capability that the decoder of the first network can employ to generate extremely precise segmentation maps [53]. EfficientNetB7 implements a mobile inverted bottleneck convolution with an injected SE-Net [28] block, which can attend to relevant features. By utilizing shortcuts directly between bottlenecks, which connect a significantly less number of channels than expansion layers, and depth-wise separable convolution, which effectively reduces computing cost compared to traditional layers. It performs more effectively by uniformly scaling the network’s resolution, depth, and width, resulting in improved performance. Hence, deploying an EfficientNetB7 encoder enables us to have a contracting path that is significantly deeper and can perform effective contextual feature extraction of medical images. Each encoder block of the second encoder executes AG-residual convolution operations, as illustrated in Fig. 2. The AG-residual convolution module performs two \(3 \times 3\) convolution operations, each of which is followed by batch normalization and ReLU. The batch normalization decreases the internal covariant shift and regularizes the model [30], while ReLU introduces nonlinearity to the architecture. A shortcut residual connection is added with a \(1\times 1\) convolution of the input features to provide an identity mapping of features, followed by batch normalization and ReLU operations. Features from the \(3 \times 3\) convolution operations and \(1 \times 1\) shortcut connection are concatenated, followed by another ReLU operation. The generated feature maps are then passed to the TAM module, which performs both spatial and channel-based attention as well as squeeze and excitation-based attention on the features to focus more on the relevant feature maps. Then we perform a max-pooling operation with a \(2 \times 2\) window and stride of \(2 \times 2\) to reduce the spatial dimension of the feature maps.

As shown in Fig. 1, the architecture has two decoders, one in each network. Each input feature is passed to the gating signal module, which captures high-level feature representations from the immediate lower part of the network. Then, each block in the decoder applies a \(2\times 2\) up-sampling of bi-liner interpolation to each input feature, hence doubling the dimension of the input feature maps. The generated feature maps are then passed to the attention gate module, which takes the skip connections, and the gating signal as inputs and performs additive soft attention on these two feature maps, and the network learns to attend to the desired ROI while suppressing feature activation in irrelevant areas as the training proceeds. Then, we concatenate the up-sampled feature maps with the output feature maps of the attention gates. The concatenated feature maps are then passed to the AG-residual module for attention-based convolution operations. Every skip connection in the proposed model passes through the attention gate. In the first decoder, we only employ attention-gated-skip connections from the first encoder of network one. However, in the second decoder, we use attention-gated-skip connections from both the encoders from networks one and two. This procedure maintains spatial resolution and improves the output feature maps’ quality without focusing on irrelevant regions. Similar to the DoubleU-Net architecture [30], the final step is applying a convolution layer with a sigmoid activation function to construct the mask for the modified U-Net.

Fig. 2
figure 2

Composition of the attention gated (AG)-residual convolution module

3.4 Multi-kernel residual convolution module

One of the challenges in MIS is the larger variation in the size and shape of an object in the medical image. Hence, to achieve effective results in the MIS task, extracting high-level multi-scale contextual features through different receptive fields is necessary. In our proposed architecture, we applied an inception architecture [60] inspired multi-kernel residual convolution (MKRC) module in both of the bottlenecks of networks one and two, which helps reduce saturation and degradation in the learning gradient. The proposed MKRC module is demonstrated in Fig. 3. The MKRC module expands the field of view representation of heterogeneous features for more effective and robust learning of the model. The module consists of multiple parallel convolution layers with different kernel sizes of (\(1 \times 1\)), (\(3 \times 3\)), (\(5 \times 5\)), and (\(7 \times 7\)), respectively. Increasing the kernel size in the convolution layers enables the networks to extract a more robust feature representation from multi-scale receptive fields, causing them to modulate the learning of features differently for each block. The next step after each convolution layer is a batch normalization layer and a ReLU activation function. After that, all four feature maps are concatenated together, which leaves us with information on every relevant receptive field. Next, we feed the concatenated feature maps to a (\(1 \times 1\)) convolution followed by batch normalization and ReLU. Next, we integrate a residual shortcut connection, also known as identity mapping [27], passed through a (\(1\times 1\)) convolution and batch normalization and perform concatenation with the previously generated feature maps. An effective identity mapping through a (\(1 \times 1\)) convolution in lesser residual settings can ensure smooth propagation of information in a network with reduced overfitting. A ReLU activation is performed next. The resulting feature maps are then processed through a modified SE-ASPP module that expands the field-of-view representation of features to encompass a broader context.

Fig. 3
figure 3

Composition of the multi-kernel residual convolution (MKRC) module

3.5 Squeeze and excitation-based atrous spatial pyramid pooling module

Atrous spatial pyramid pooling (ASPP) introduced by Chen et al. [14] allows us to effectively enlarge the filters’ field of view to include multi-scale contextual representation of semantic features by parallel atrous convolution layers with different dilation rates. It can efficiently mitigate the issue of reduced spatial resolution resulting from repeated down-sampling in the encoder [67]. We modify the ASPP module and propose a new SE-ASPP module by embedding the squeeze and excitation networks (SE-Net) to the increased and enlarged field of view of the convolution filters. The structure of the SE-ASPP module is demonstrated in Fig. 4. We have utilized a deeper set of dilated convolutions in the SE-ASPP module in order to capture more robust and expanded representations of features from the MKRC module. The dilation rates utilized in the seven parallel convolution layers of the SE-ASPP module are 1, 1, 2, 6, 10, 13, and 16, respectively. We apply the squeeze and excitation network to effectively re-calibrate and refine the acquired features through different dilation rates. All the feature maps from the SE-Net modules of each branch of the SE-ASPP network are concatenated together, and a (\(1 \times 1\)) convolution operation is performed on the concatenated feature maps, followed by batch normalization and a ReLU activation function. The SE-ASPP module captures efficient and relevant semantic information at multi scales. The generated feature maps are then passed to the hybrid triple attention module (TAM) for further processing.

Fig. 4
figure 4

Composition of the squeeze and excitation-based atrous spatial pyramid pooling (SE-ASPP) module

3.6 Hybrid triple attention module

The hybrid triple attention module (TAM) performs effective attention-based feature refinement and extends the concepts introduced by CBAM [71] and Focus U-Net [79]. As shown in Fig. 5, it performs a feature fusion through parallel processing of squeeze and excitation networks [28], modified channel-based attention, and spatial attention mechanisms. We utilize these attention mechanisms to fully explore the high-level inter-spatial relationship of relevant features and effective inter-channel relationships. By adjusting the weight of each channel, SE-Net offers channel-based attention that can improve the channel inter-dependencies and can be seen as an object selection process while suppressing noise. However, SE-Net performs only global average pooling operations to perform channel-based attention. Later, CBAM [71] suggested that these features could be sub-optimal and suggested using max pooling operations for modeling improved channel inter-dependencies. As illustrated in Fig. 5, to achieve effective channel-based attention, we extend the ideas of CBAM and employ initial global average pooling and global max pooling operations along the channel axis, followed by concatenation and sigmoid activation to generate efficient feature descriptor that helps to determine where to highlight or suppress along the channel axis. Through the spatial attention mechanism, the architecture focuses on the location of high-level feature maps of the target regions. In conjunction with channel-based attention, spatial attention module aggregate features along the channel axis [28, 71, 79]. We utilized the CBAM implementation of spatial attention by establishing two distinct channel contexts using average and max pooling along the channel axis, following spatial re-calibration using a kernel of size 7. Similar to modified channel-based attention, we experimented by incorporating initial global average pooling and global max pooling operations in the spatial attention module; however, the performance did not improve, and hence opted to use the original implementation of CBAM.

Fig. 5
figure 5

Composition of the triple attention module (TAM)

3.7 Triple attention gate module

Having introduced the SE-Net, channel-based attention, and spatial attention modules in the previous subsection, we describe the structure of the triple attention gate (TAG) module. Due to the lightweight design of the attention gate module, it significantly improves the model’s representation ability without significantly increasing the computing cost or the number of model parameters [43]. Here, similar to the attention gate and focus gate [79] modules, we introduce a novel triple attention-gated deep neural network named the TAG module, which performs parallel implementation of channel attention, spatial attention, and squeeze and excitation-based attention mechanisms into a single TAG module to encourage selective learning of efficient, relevant features. The TAG module takes two inputs, as shown in Fig. 6; one is the gating signal from the one-step lower levels, which has a better representation of features such as edges, texture, and dots through training, and the other is the corresponding skip connection at that level, having a better representation of the spatial information. First, the gating signal and skip connection are resized to matching dimensions, and then they are combined through element-wise addition followed by nonlinear activation (ReLu) and create attention coefficients. After that, the attention coefficients are passed through the channel, spatial, squeeze, and excitation-based attention modules and are then concatenated together to produce effective refinement of the relevant features. Next, we perform a \(2 \times 2\) up-sampling operation to match the dimensions from the output of the \(1 \times 1\) convolutions, followed by sigmoid operations, and \(2 \times 2\) up-sampling performed on the output of the previously mentioned nonlinear activation function. The aligned weights get larger, and the unaligned weights get relatively smaller. The spatial contextual information of ROIs is captured by concatenating the original skip connection by the generated attention coefficients. Hence, the vector gets scaled based on its relevance.

Fig. 6
figure 6

Composition of the triple attention gate (TAG) module

4 Experimental analysis

4.1 Datasets

This section briefly describes all the utilized datasets in this study. For the evaluation of the proposed model, we have utilized six datasets of different modalities, namely BUSI, CVCclinicDB, Drive, ISBI 2012, 2018 DSB, and LUNA. A representative image and the corresponding mask from each of the datasets are shown in Fig. 7.

4.1.1 DRIVE

A diabetic retinopathy screening program in the Netherlands provided the data used to create the Digital Retinal Images for Vessel Extraction (DRIVE) dataset facilitating retinal vessel segmentation as described in [57]. A Canon CR5 non-mydriatic 3CCD camera with a 45-degree field of view (FOV) was used for image acquisition, and there is a total of 40, 8-bit per color channel images with a resolution of \(768 \times 584\) pixels.

4.1.2 Lung segmentation

Based on the computed tomography (CT) image modality, lung segmentation from CT images is available in the lung nodule analysis (LUNA) competition [41]. This dataset contains 267 2D CT images with full annotations of the labeled lung images provided by experts in the medical sector. The size of each image is \(512 \times 512\) pixels.

4.1.3 Breast ultrasound image

Utilizing LOGIQ E9 ultrasound system-guided scanning, the breast ultrasound image (BUSI) dataset was created from images collected from 600 females aged between 25 and 75 years old [3]. The dataset contains seven hundred eighty images with an average image size of \(500 \times 500\) pixels in three distinct categories: benign, normal, and malignant. The ground truth for each image was generated using MATLAB.

4.1.4 CVCclinicDB

The CVCclinicDB dataset contains image frames extracted from colonoscopy videos, using Window Median Depth of Valley (WM-Dova) methodology as mentioned in [7]. From a collection of twenty-nine video sequences, 612 still image frames were extracted for polyp detection. Each image is of the size \(384 \times 288\), while the corresponding ground truth image is presented as the segmentation mask of the polyps.

4.1.5 2018 data science bowl

The 2018 data science bowl (DSB) dataset was created as a challenge for generic segmentation of nuclei of cells in a diverse set of stained two-dimensional (2D) microscopic images [9]. The training set contains 670 images from both bright-stained and fluorescence modalities of microscopic images with sizes \(256 \times 256 \times 3\). In addition to the images captured under various lighting conditions, corresponding annotations (segmentation masks) for each image are also provided to be used as ground truth.

4.1.6 ISBI 2012

The ISBI 2012 dataset, introduced in [11], is comprised of transmission electron microscopy (TEM) images of Drosophila larval brain for the purpose of analyzing the structural aspect of neural micro-circuitry. The training data are comprised of 30 TEM \(512 \times 512\) serial section images of the first instar larval Dorsophila brain using TrakEm2 [12] software. The corresponding labels of each image were produced by an expert neuro-anatomist for the purpose of segmentation (Table 1).

Table 1 Overview of the datasets employed in our experiments
Fig. 7
figure 7

Input images and their corresponding segmentation masks in the dataset. Sample images and their masks of DRIVE, LUNA, BUSI, CVCclinicDB, 2018 DSB and ISBI 2012 can be found in (a, b, c, d, e), and (f), respectively

4.1.7 Preprocessing and data augmentation

In our experiments, we used several augmentation techniques to ensure that over-fitting does not occur for a small number of samples present in the datasets. To ensure efficient, robust learning of the proposed model in five datasets, namely CVCclincDB, 2018 DSB, BUSI, and LUNA, we employed a total of thirteen data augmentation techniques, including two variations of random rotations, grid distortion, horizontal and vertical flips, transpose, a composition of vertical flip and random rotation, random brightness, random contrast, random brightness contrast, random gamma, hue-saturation contrast, and RGB shifting to increase the image variability during the training process. For the DRIVE and ISBI 2012 datasets, we employed a total of twenty-two data augmentation techniques, that includes the techniques mentioned above, as well as CLAHE, FancyPCA, and Gaussian noise injection. It should be noted that the original DoubleU-Net architecture employed a total of twenty-five augmentation types of a single image mask pair. After the data augmentation process, the augmented RGB images were compressed to \(256 \times 256\) to prepare them for fitting into the models. It should also be noted that the original images were also resized and incorporated into the training dataset.

4.2 Training setup and experimental metrics

In order to train the models, the augmented dataset was divided using an 80:10:10 ratio, i.e., 80% of the images were used for composing the training dataset, 10% for the testing, and the rest 10% for the validation dataset. We initialize the pre-trained weights of EfficientNetB7 architecture, and the batch size was set at 4. The learning rate starts from 0.0001, and the learning rate is reduced by a factor of 0.1, with patience of 10. We fed 2D images of size \(256 \times 256\) as input for the proposed network. Our system was implemented using Tesla P100-PCIE GPU with 16 GB RAM and a Tensorflow backend. The total number of trainable parameters of the proposed model is 22.4 million. We incorporated a hybrid loss function by adding the binary cross-entropy loss (\({\rm Loss}_{\rm BCE}\)) and Dice loss (\({\rm Loss}_{\rm Dice}\)) [59], offering smooth gradient flow and handling of the class imbalance problems [8]. The hybrid loss function can be defined as:

$$\begin{aligned}{} & {} \textrm{Loss}_\textrm{Hybrid} = \textrm{Loss}_\textrm{BCE} + \textrm{Loss}_\textrm{Dice} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} \begin{aligned} \textrm{Loss}_\textrm{BCE}&= -\sum _{c=1}^My_{o,c}\log (p_{o,c}) \\&= -{(y\log (p) + (1 - y)\log (1 - p))} \end{aligned} \end{aligned}$$
(2)

The \({\rm Loss}_{\rm BCE}\) specified in Eq. (2) can be defined in terms of the number of classes M, the natural log, binary indicator (0 or 1) y, class label c the correct classification for observation o, and p is the predicted probability observation o is of class c.

$$\begin{aligned} \textrm{Loss}_\textrm{Dice} = 1 - \frac{2\sum _{i=1}^{N}p_{i}g_{i}+\epsilon }{\sum _{i=1}^{N}p_{i}^{2}+\sum _{i=1}^{N}g_{i}^{2}+\epsilon } \end{aligned}$$
(3)

The Dice coefficient between the prediction samples p and the mask g can be defined as given in Eq. (3). Here, \(\epsilon\) is a constant added to avoid the divide by zero error.

4.2.1 Precision and recall

True positive (TP) outcomes are the number of samples that were correctly classified as the mask, and false positives (FP) are the number of samples that were falsely predicted as part of the mask region. On the other hand, the true negatives (TN) are the number of samples that are correctly classified as not present inside the mask region, and the false negatives (FN) are the pixels that are falsely classified as not present inside the masked region. Thus, we can now calculate the precision and recall from the confusion matrix as follows:

$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(4)
$${\text{Recall}} = {\text{ }}\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(5)

4.2.2 Dice similarity coefficient

The Dice coefficients, as coined by [15], are used widely for image segmentation purpose, and it has been used in the case of both 2D and 3D image segmentation tasks. The Dice coefficients required for image segmentation can be constructed from a contingency table [88] of four possible outcomes as represented in the probabilities of segmentation results from an image. Dice score can be generalized using the definitions of true positives (TP), false positives (FP), and false negatives (FN) as:

$${\text{DICE}} = \frac{{2{\text{TP}}}}{{2{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
(6)

The Dice coefficient measures how much the area of interest of two images has overlapped. Dice score values have a range of [0,1]. The higher the Dice score is, the better segmentation is achieved from the prediction.

4.2.3 Intersection-Over-Union (IoU)

Along with the Dice score, mean intersection-over-union (mIoU) can be used to calculate the prediction similarity with ground truth. IoU values have a range of [0,1]. The higher value of IoU means there is a better similarity between prediction and ground truth. IoU can be defined in terms of the common confusion metrics as follows:

$${\text{IoU}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
(7)

4.3 Evaluation of the segmentation results

This section provides the quantitative and qualitative analysis result analysis of the proposed DoubleU-NetPlus method with other SOTA methods.

Fig. 8
figure 8

Visual comparative analysis among different segmentation methods. First row (left to right): input image, ground truth, U-Net output, U-Net++ output. Second row (left to right): MultiResU-Net output, Attention U-Net output, and results of the DoubleU-NetPlus network (Mask 1 and 2) on the LUNA dataset. A similar pattern is followed in rows three to six for CVCclinicDB and ISBI 2012 datasets. Blue, red, yellow, and green boxes denote exemplary ROI, unsatisfactory, moderate, and good results

Fig. 9
figure 9

Visual comparative analysis among different segmentation methods. First row (left to right): input image, ground truth, U-Net output, U-Net++ output. Second row (left to right): MultiResU-Net output, Attention U-Net output, and results of the DoubleU-NetPlus network (Mask 1 and 2) on the DRIVE dataset. A similar pattern is followed in rows three to six for BUSI and 2018 DSB datasets. Blue, red, yellow, and green boxes denote exemplary ROI, unsatisfactory, moderate, and good results

Table 2 Comparisons of the segmentation result for the proposed and conventional methods in all the employed datasets

4.3.1 Quantitative result analysis

Here, we report the quantitative results on six various modalities of medical image datasets and compare them to other SOTA approaches to ensure that the proposed model surpasses the performance or performs at par with other SOTA methods (on the same train-test split ratio and similar types of data augmentation methods). It is important to note that in order to provide a fair comparison, the evaluation metrics are provided only for the approaches that prioritize segmentation performance over computational efficiency. The performance of the model on all the utilized datasets is shown in Table 2.

Results on DRIVE A comparison with well-established segmentation architectures with different backbones demonstrates that our proposed method outperforms the SOTA architectures. With a Dice score of 85.17%, mIoU of 73.92%, precision of 98.05%, and recall of 96.48% (see Table 2), the DoubleU-NetPlus architecture significantly surpasses all SOTA architectures on the DRIVE dataset. While outperforming U-Net and most of its variants, it can also be observed that DoubleU-NetPlus exceeds the performance of the recently proposed ConvNeXt [37] encoder backbone-based ConvUNeXt [24] architecture by a Dice score of 2.87% though ConvUNeXt reports the highest mIoU value of 82.60%. Compared to FANet [63], the model achieves an increase of 4.65% in the mIoU metric with a lesser number of augmented images during training.

Results on LUNA In the LUNA dataset, the DouleU-NetPlus network achieves SOTA segmentation results of 99.34% on Dice, 98.93% on mIoU, 99.57% on precision, and 98.82% on recall metrics, respectively (see Table 2). The results outperform U-Net [49], U-Net++ [85], VGG-19 encoder-based EANet [68], and ResNet-50 based Sharp U-Net [89] architectures in the Dice metric by a margin of 4.23%, 5.27%, 0.69%, and 2.09%. DoubleU-NetPlus also has the best balance on both the precision-recall and Dice-mIoU pairs.

Results on BUSI In the BUSI dataset, the DoubleU-NetPlus achieves significantly improved results compared to all the SOTA architectures. It achieves a precision of 96.90% and a recall of 92.47%. The model achieves a significantly improved Dice value of 94.30% which is 13.01% and 15.54% better compared to the UNet++ [85], and MultiResUNet [29] architectures, respectively (see Table 2). Although the highest mIoU is achieved by RCA-IUnet [46] with 89.95%, compared to DoubleU-NetPlus’s 84.71%.

Results on CVCclinicDB Table 2 demonstrates that in the CVCclinicDB dataset, DoubleU-NetPlus produces a Dice score of 96.40%, mIoU of 95.12%, precision of 97.96%, and a recall value of 93.87% with an improvement of 4.01% in Dice with respect to SOTA DoubleU-Net architecture. Our model achieves the best trade-off between Dice and mIoU metrics compared with the SOTA architectures resulting in the highest mIoU metric value of 95.12%, surpassing the dual Swin Transformer-based Ds-transunet [34] model by 6.02% in the mIoU metric.

Results on 2018 DSB DoubleU-NetPlus obtains significantly improved precision value of 98.82%, Dice of 95.76%, and mIoU of 90.29% which are much improved results compared to U-Net [49], U-Net++ [85], and DoubleU-Net [30] (see Table 2). It also achieves the best trade-off between Dice-mIoU compared to other SOTA architectures. Though Sharp U-Net [89] reports the highest Dice value of 95.40%, in terms of mIou, DoubleU-NetPlus generates better results. Poudel and Lee [45] report the highest mIoU of 90.97%; however, DoubleU-NetPlus outperforms their architecture by 5.69% in the Dice metric.

Results on ISBI 2012 In the ISBI 2012 dataset, DoubleU-NetPlus achieves 99.75% in precision, 88.62% in the recall, 97.10% in Dice, and 94.38% in mIoU metric, which are significantly improved results compared to the U-Net [49], U-Net++ [85], and MultiResU-Net [29] architectures. Especially in the mIoU metric, the proposed model obtains an increase of 5.00%, 5.78%, 0.57%, and 1.48% compared to U-Net, U-Net++, Attention U-Net [43], and MultiResU-Net architectures respectively (see Table 2). The highest Dice value of 98.12% is reported in LCP-Net [44].

The results of the DoubleU-NetPlus model show that the proposed model greatly improves the performance of MIS tasks in diverse modalities of colonoscopy, fluorescence, electron microscopy, CT, retinal, and ultrasound.

4.3.2 Qualitative result analysis

The results that were obtained from the experiments on six datasets of diverse modalities were evaluated critically on visual qualitative criteria to ensure proper segmentation performance. Specifically, we illustrate the predictions of U-Net, U-Net++, Attention U-Net, MultiResU-Net, and our proposed DoubleU-NetPlus segmentation architectures, which were also applied in the quantitative comparisons too. The visual comparisons of the mentioned architectures with the proposed DoubleU-NetPlus, as demonstrated in Figs. 8a, b, c, and  9a, b, c, shows that the segmentation map of the DoubleU-NetPlus network achieves better semantic segmentation performance in every datasets. On visual inspection, it is clear that there are several instances where the proposed network outperforms SOTA architectures such as the U-Net, U-Net++, Attention U-Net, and MultiResU-Net (Table 3).

Table 3 Ablation experiments that analyze the contributions of the different modules on the utilized datasets

4.3.3 Statistical significance test

To statistically investigate the performance of the proposed DoubleU-NetPlus over other SOTA segmentation methods on different quantitative metrics, we conduct paired sample t tests between the Dice and mIoU obtained by DoubleU-NetPlus and the Dice and mIoU obtained by other methods. A paired sample t test is often used for comparing two methods on the same evaluation metric in the MIS domain [56, 65, 69, 77]. We perform the test on Dice and mIoU metrics mainly because these two are the most significant evaluation metrics in semantic image segmentation. It should be noted that we do not include the precision and recall metrics in the test because every compared method does not report these two metrics. A comparison was done with those methods which utilized all six datasets in their study or reported in the literature. A p-value less than 0.05 is considered as statistically significant, and the paired-wise p-values are reported in Table  4. From Table  4, it is clear that in all seven paired methods, the p-values are smaller than 0.05 for the Dice and mIoU metrics, which demonstrates that our proposed method achieved significantly improved results compared to seven other SOTA models.

Table 4 P-values between proposed DoubleU-NetPlus and other SOTA methods on different evaluation metrics

4.3.4 Ablation studies

We have performed an extensive ablation study in each of the employed datasets to empirically verify some of our incorporated modules in the proposed DoubleU-NetPlus network. A baseline U-Net was used to benchmark the performance of various datasets that were used in our experiments. We investigate the baseline performance of the U-Net by training it with the same number of augmented images that were used to train the proposed DoubleU-NetPlus model and sequentially assess the performance with subsequent removal of the individual MKRC, TAM, and TAG modules. We also investigated by removing (TAM, MKRC), and (TAM, MKRC, and TAG) modules combined from the proposed architecture. The results of module removal on the BUSI and DRIVE dataset are demonstrated in Table  3. It can be observed that the EfficientNetB7-based encoder backbone, TAM, MKRC, and TAG modules contribute significantly to the improvements in Dice score, mIoU, precision, and recall metric values.

5 Conclusion

Semantic segmentation of medical images is a key element in medical image analysis. This paper presents a robust deep learning-based MIS network named DoubleU-NetPlus equipped with several architectural modifications, mainly the integration of pre-trained EfficientNetB7 as a feature encoder backbone, a newly proposed multi-kernel residual convolution module, multi-scale feature re-calibrating SE-ASPP module, and a hybrid triple attention module at the bottleneck of each network. We also integrated attention-driven residual convolutions throughout the encoder and decoder part of the network. To capture salient regions with higher precision, we have integrated a novel triple attention gate module that focuses on the relevant regions and suppresses other irrelevant regions in the skip connections features. A combination of all these modules together captures high-level semantic and discriminative feature maps while preserving effective spatial information. Experimental results evaluated on six benchmark datasets of different modalities demonstrate the proposed model’s superiority over SOTA segmentation methods in MIS tasks. We believe that DoubleU-NetPlus is a generic segmentation model and can be applied to similar 2D MIS tasks. One of the challenges in this architecture is its high number of trainable parameters. We plan to reduce the number of parameters and computational complexity in the future. We also plan to adjust the design of the network to make it adaptable in the 3D image domain.