1 Introduction

Deep learning is a machine learning method that teaches computers to perform tasks that humans accomplish without thinking about them. A computer model can learn to carry out categorization tasks directly via images, text, or sound using deep learning. Modern precision can be attained by deep models, sometimes even outperforming human ability. A sizable collection of labelled data and multi-layered neural network structures are used to train models. Cancer has historically been a fatal disease. It can be devastating even in today's technologically advanced world if it isn’t caught in its earliest stages. Millions of lives could be saved by swiftly identifying any malignant cells. Nucleus segmentation is a method for identifying an image's nucleus by segmenting it into different parts. Deep learning is quickly gaining traction in the field of nucleus segmentation and has attracted quite a few researchers with its numerous published research articles demonstrating its usefulness.

Image Segmentation is principally a process that is used to partition a digital image into numerous segments or objects (Szeliski 2010). It is widely employed in several applications ranging from image compression (Rabbani 2002) to medical image analysis (Ker et al. 2017) to robotic perception (Porzi et al. 2016). Image segmentation is categorized as semantic (Ahmed et al. 2020) and instance segmentation (Birodkar et al. 2021). Semantic segmentation groups together parts of an image that belong to the same class. Instance segmentation, which combines object detection and semantic segmentation, finds objects in well-defined categories. Medical image segmentation similar as natural image segmentation refers to the procedure of mining the anticipated object (organ) from a medical image that can be instigated manually, semi-automatically or automatically intending to make anatomical or pathological structures transform indistinctly of the underlying images. Quite a few medical image segmentations take into account Breast and Breast Histopathology Segmentation (Liu et al. 2018a), liver and liver-tumour segmentation (Li 2015) (Vivanti et al. 2015), cell segmentation (Song et al. 2017) etc. as an input imagery and further applies mechanism into it. Medical image segmentation is a key part of Computer-Aided Diagnosis (CAD) and smart medicine, where features are taken from segmented images. Due to the rapid growth of deep learning techniques (Krizhevsky et al. 2017), medical image segmentation is no longer limited to hand-crafted features. Instead, Convolutional Neural Networks (CNN) can efficiently create hierarchical image features, which leads to the most accurate image segmentation models on popular benchmarks. This CNN method has inspired academics to develop deep learning segmentation models for histopathology images. This article focuses on recent trends in Deep Learning for Nucleus Segmentation from Histopathology Images throughout 2017–2021 by discussing U-Net (Ronneberger et al. 2015), SCPP-Net (Chanchal et al. 2021b), Sharp U-Net(Zunair and Hamza 2021), and LiverNet (Aatresh et al. 2021a) etc.

In recent years, Deep Learning-based innovative algorithms have shown state-of-the-art performance in medical imaging segmentation, processing, detection, and classification. The literature review was used to choose the four segmentation models. Only these four models were selected because they have demonstrated excellent nucleus segmentation performance in recent years. This introduction's references were chosen because they accurately represent.

The remaining sections of the paper are schematized as follows: Sect. 2 deliberates upon the importance of nucleus segmentation like cell counting, movement tracking, and morphological study, etc., stressing certain challenges while dealing with the same. The review and discussion on recent trends in Deep Learning for Nucleus Segmentation since 2017 is offered in Sect. 3. In Sect. 4, we also have an analysis based on year-wise published papers, backbone, loss functions for the research initiative since 2017 with a graphical representation based on the most frequently used backbone, loss function, optimizer, dataset etc. over the last five years in the literature survey. The architecture and brief description of some segmentation models (U-Net, SCPP-Net, Sharp U-Net, and LiverNet) along with their loss function and segmentation quality parameters have been conveyed in Sect. 5. Experimental datasets, training and implementation, and comparison of a few segmentation models along with the experimental outcomes are emphasised in Sect. 6 with some graphical representation based on segmentation results and training loss. Lastly, Sect. 7 discusses the conclusion and future research directions.

2 Nucleus segmentation: need and challenges

This section briefly presents the need for and challenges of nucleus segmentation from histopathology images.

2.1 Need of nucleus segmentation

Segmenting cell nuclei in histopathology images is the preliminary step in analyzing current imaging data for biological and biomedical purposes. The fundamental steps for nucleus segmentation namely Cell counting (Grishagin 2015), Movement tracking (Dewan et al. 2011), Computational pathology (Louis et al. 2015), Cytometric analysis (Yu et al. 1989), Computer-Aided diagnosis (Kowal and Filipczuk 2014) and Morphological study (Abdolhoseini et al. 2019) plays a dynamic role in analysing, diagnosing and grading cancerous cell. These fundamental steps we described as below:

  1. a.

    Cell Counting: It is a subclass of cytometry considered one of the methods used for counting or quantification of similar cells and is widely employed in numerous research and clinical practices. Superior quality microscopy images can be used with statistical classification algorithms for cell counting and recognition as part of image analysis (Han et al. 2012) performed off-line, keeping the error rate constant (Han et al. 2008).

  2. b.

    Movement Tracking: Automated tracking and analysis (Meijering et al. 2009) is seen as an important part of biomedical research, both for biological processes and for diagnosing diseases.

  3. c.

    Computational Pathology: It deals with analysing digitized pathology images with allied metadata wherein nucleus segmentation in digital microscopic tissue images aids high-quality features extraction for nucleus morpho metrics in it (Kumar et al. 2017).

  4. d.

    Cytometric Analysis: Nucleus segmentation is a significant step in the pipeline of many cytometric analyses. It has been used in a few studies to analyse the nucleus DNA to observe the association between the DNA ploidy pattern and the 5-year survival rate of advanced gastric cancer patients using paraffin-embedded tissue specimens (Kimura and Yonemura 1991).

  5. e.

    Computer-Aided Diagnosis (CAD): Computer-aided detection, also called CAD, is a useful tool for precise diagnosis and prognosis (Su et al. 2015). It helps doctors interpret medical images.

  6. f.

    Morphological Study: This complex biological mechanism regulates cell proliferation, differentiation, development and disease (Jevtic et al. 2014). Cell morphology, for example, requires nucleus segmentation as a fundamental step because it provides valuable information about nucleus morphology, chromatin, DNA content, etc.

2.2 Challenges of nucleus segmentation

Dependent on a variability of measures like nuclides, malignant tumours, their life cycles etc., nuclei appear in different shapes and sizes. Several types of nuclei exist; however, lymphocyte nuclei (LN) are inflammatory nuclei having a regular shape, which have a major role in the immune system, and epithelial nuclei (EN) (Irshad et al. 2013) have nearly uniform chromatin distribution with a smooth boundary, which are the types of interest. Automated nuclei segmentation, though, is a well-researched problem in the field of digital pathology, but segmenting the nucleus turns out to be difficult due to the presence of a variety of blood cells. Furthermore, due to variability induced by elements in slide preparation (dyes concentration, damage of the given tissue sample, etc.) and image acquisition (digital noise existence, explicit features of the slide scanner, etc.), existing methods are unfitting and cannot be applied to all types of histopathology images (Hayakawa et al. 2021). Additionally, some of significant challenges that arise while segmenting nuclei are presented below:

  1. 1.

    There is a high level of heterogeneity in appearance between different types of organs or cells. So, methods that were made based on what was already known about geometric features can’t be used right away on different images.

  2. 2.

    Nuclei are often clustered with many overlapping instances. Separating the clustered nuclei frequently necessitates additional processing.

  3. 3.

    In out-of-focus images, the boundary of nuclei seems blurry. That increases the difficulty of extricating dense illustrations from images. Furthermore, the factors that make the segmentation task difficult are the appearance of the nucleus and the noticeable variation in its shape.

An effective image processing approach must be able to overcome the aforesaid obstacles and challenges while maintaining the quality and accuracy of the underlying images in various situations.

3 Survey on deep learning based nucleus segmentation

For a few years, deep learning models have proven to be effective, vigorous, and accurate in biomedical image segmentation, specifically nucleus segmentation. This section includes a literature review of work done from 2017 to 2021 on Convolutional Neural Network (CNN) Model for nucleus segmentation, as shown in Table 1. The mentioned papers have been collected from the following sources:

  1. a.

    Google Scholar—https://scholar.google.com

  2. b.

    IEEE Xplore—https://ieeexplore.ieee.org

  3. c.

    ScienceDirect—https://www.sciencedirect.com

  4. d.

    SpringerLink—https://www.springerlink.com

  5. e.

    ACM Digital Library—https://dl.acm.org

  6. f.

    DBLP—https://dblp.uni-trier.de

Table 1 Literature survey on deep learning models for nucleus segmentation during the year 2017 to 2021

Each of these above sources is queried with the following combinations of keywords:

  • KW1: Deep Learning based histopathology Image Segmentation.

  • KW2: Deep Learning based hematology Image Segmentation.

  • KW3: Deep Learning based pathology Image Segmentation.

  • KW4: Deep learning based white blood cell segmentation.

  • KW5: Nucleus segmentation using deep learning.

  • KW6: Nucleus segmentation using machine learning.

  • KW7: Nucleus segmentation using Convolutional Neural Network.

  • KW8: White blood cell segmentation using Convolutional Neural Network.

  • KW9: Deep Neural Network based image segmentation.

4 Analysis and discussion

This section presents an analysis based on the reports on nucleus segmentation using CNN models which are reported in Table1. The analysis is based on the year wise publications, datasets, CNN models, utilized segmentation metrics etc.

4.1 Analysis based on publication year

This subsection will present the analysis based on the publication years of the various works that were taken into consideration that are associated with nucleus segmentation. The year wise number of published papers over Nucleus Segmentation has been presented in Fig. 1, which clearly demonstrates that Nucleus Segmentation is paving its way in the field of research.

Fig. 1
figure 1

Year-wise publishes papers of Nucleus Segmentation

4.2 Analysis based on dataset

This sub-section highlights a brief description about some of the extensively used Nucleus Segmentation datasets encountered while performing the literature survey as depicted in Table 1 namely TCGA (Tomczak et al. 2015; The Cancer Genome Atlas (TCGA) 2016), TNBC (Naylor et al. 2018), Herlev dataset (Jantzen et al. 2005), MS COCO (Lin et al. 2015), MoNuSeg (Kumar et al. 2017, 2020), DSB2018 (Caicedo et al. 2019; Data science bowl 2018), KMC Liver dataset (Kasturba Medical College 2021), and PanNuke dataset (Gamper et al. 2019, 2020), respectively.

  1. (i)

    The cancer genome atlas (TCGA) dataset:

The TCGA dataset is sponsored project wherein the researcher aims to analyse and produces an atlas of cancer genomic profiles (openly available datasets (The Cancer Genome Atlas (TCGA) 2016)) with over 20,000 cases of 33 types of cancer acknowledged till date. Kumar et al. for the purpose of nuclear segmentation task (Kumar et al. 2017) generated ground truths by picking around 44 WSIs of multiple organs especially images collected from seven different organs, including the bladder, breast, colon, kidney, liver, prostate, and stomach.

  1. (b)

    Triple negative breast cancer (TNBC) dataset:

For breast cancer histopathology images, Naylor et al. presented this dataset that deals with the type of breast cancer in which the cancer cells do not have oestrogen or progesterone receptors and yield in adequate amounts of the protein called HER2 and presented nuclear segmentation technique (Naylor et al. 2018) for the same. TNBC encompasses 50 H&E-stained images with 512 × 512 resolution and 4022 annotated nuclei. Entire images of TNBC are extracted from 11 triple negative breast cancer patients, and comprises of several cell types such as myoepithelial breast cells, endothelial cells and inflammatory cells.

  1. (iii)

    Herlev Pap smear dataset:

Herlev University Hospital and the Technical University of Denmark announced the Herlev Pap smear dataset (Jantzen et al. 2005) comprising of 917 Pap smear images, each of which encompasses one cervical cell segmented and classified by means of ground truth. The images in the dataset are captured at a magnification of 0.201 µm/pixel with a resolution of 156 × 140 pixels on average with the longest length of a side is 768 pixels and the shortest is 32 pixels. Seven classes of cell images are available in this dataset wherein first three classes namely superficial squamous, intermediate squamous, columnar are normal cells and remaining four classes namely are abnormal cells namely mild dysplasia, moderate dysplasia, severe dysplasia, and carcinoma in situ.

  1. (d)

    Microsoft common objects in context (MS COCO) dataset:

The MS COCO dataset (Lin et al. 2015) investigates the drawbacks of non-iconic views of object representation. Looking at Objects that are not the main emphasis of an image, is generally stated as a non-iconic view and this dataset was created with the help of Amazon Mechanical Turk for data annotations. MS COCO comprises of 2,500,000 labelled instances in 328,000 images and comprises of 91 common object categories, 82 of which have over 5,000 labelled instances.

  1. (v)

    Multi-organ nuclei segmentation (MoNuSeg) dataset:

Indian Institute of Technology Guwahati prepared a dataset named as Multi-organ Nuclei Segmentation (MoNuSeg) dataset that was published in the official satellite event of MICCAI 2018 contains WSI images of 7 organs (breast, kidney, colon, stomach, prostate, liver, and bladder) from various medical centres (i.e., various stains) of high-resolution WSI of H&E-stained slides from nine tissue types, digitised at 40 × magnification in eighteen different hospitals and obtained from National Cancer Institute’s cancer Genome Atlas (TCGA) (Tomczak et al. 2015) with training set comprising of colour normalized (Vahadane et al. 2016) H&E images from all tissue types, excluding breast.

  1. (f)

    Data science bowl 2018 (DSB2018) dataset:

The DSB2018 (Caicedo et al. 2019) dataset is yet another dataset freely available at Broad Bio-image Benchmark Collection (Data science bowl 2018) comprising of 670 images of segmented nuclei attained under diverse circumstances namely altering the cell type, magnification, and imaging modality (bright-field vs. fluorescence) that are further resized from various resolution to 256 × 256 (aspect ratio maintained).

  1. (vii)

    Kasturba Medical College Liver (KMC Liver) dataset:

The KMC Liver (Kasturba Medical College 2021) dataset containing 257 (70: sub-type 0; 80: sub-type 1, 83: sub-type 2, and 24: sub-type 3) original slides each measuring 1920 × 1440 pixels, belonging to 4 sub-types of liver HCC tumour taken from various patients. It includes 80 H&E-stained histopathology images collected by pathologists at Kasturba Medical College (KMC), Manipal.

  1. (viii)

    PanNuke dataset:

The PanNuke (Gamper et al. 2019) dataset comprise of H&E-stained image set containing 7,904 images of 256 × 256 patches from 19 different tissue types wherein the nuclei are classified into 5 different categories of cell namely neoplastic, inflammatory, connective/soft tissue, dead, and epithelial cells. Gamper et al. (2020) outlines an evaluation process that separates the patches into three folds (later these 3 folds are used to create three different dataset splits) wherein one-fold is helpful for training and the remaining two folds for validation and testing sets, containing 2657, 2524, and 2723 images, respectively.

The following Fig. 2 shows a graphical representation of the most frequently used dataset over the last five years by the researchers, according to Table 1.

Fig. 2
figure 2

Graphical representation of mostly utilised Datasets

4.3 Analysis based on optimizer

Optimizer is a procedure of improving the neural network properties like weight and learning rates. It helps to minimizing the loss and enhancing performance. The following Fig. 3 shows a graphical representation of most frequently used optimizers by the researchers according to Table 1. Adam is the widely used optimizer.

Fig. 3
figure 3

Graphical representation of mostly utilised Optimizers

4.4 Analysis based on loss function

Loss Function is a process of examines how a CNN model predicts the intended results. The following Fig. 4 shows the most of the usable loss function that encountered while performing the literature survey as depicted in Table 1. BCE is the mostly used loss function.

Fig. 4
figure 4

Graphical representation of mostly utilised Loss Function

4.5 Analysis based on evaluation metric

The evaluation metric or segmentation quality parameters are a measurement process of performance indicator for segmentation models. The following Fig. 5 shows some of the mostly used parameters that had been using the most of the work in the literature survey as depicted in Table 1.

Fig. 5
figure 5

Graphical representation of mostly utilised Evaluation Metric

4.6 Analysis based on backbone

Backbone means which feature extracting network is being used in the CNN model architecture. In the following Fig. 6 covers some of the backbone that has been used in Table 1 literature survey's models. U-Net is the most popular backbone for nucleus segmentation.

Fig. 6
figure 6

Graphical representation of mostly utilised Backbone

5 Experimental CNN models

This section provides an overview of some of the greatest CNN models that have been proposed up to this point, including U-Net, SCPP-Net, Sharp U-Net, and LiverNet. These are the models that we have utilized in our comparative analysis, and they are described in the following ways:

5.1 U-Net

FCNs and encoder-decoder models influenced numerous models originally designed for medical/biomedical picture segmentation. Qu et al. proposed the U-Net (Ronneberger et al. 2015) model, in which the network and training approach rely o or data augmentation to effectively learn from a limited number of annotated images. The U-Net design, which is depicted in Fig. 7, comprises two parts: a contracting path for context capture and a symmetrically extending path for accurate localization. An FCN-like design extracts features with 3 × 3 convolutions in the down-sampling or contracting section. Up-convolution, popularly known as deconvolution, is used for feature map reduction and up-sampling for increasing the dimensions to prevent losing pattern information. Navigation of feature maps takes place from the network’s down-sampling section towards the up-sampling section. Finally, feature map analysis takes place with the help of a 1 × 1 convolution, further creating a segmentation map that classifies each pixel in the input picture. For different types of pictures, several U-Net extensions have been developed. A multi-channel feature map is represented by each blue box in Fig. 7 with channel numbers on top, and the white boxes represent the copies of the feature map. The sizes X and Y are indicated in the lower left border of the box, whereas the arrows represent the various operations being carried out.

Fig. 7
figure 7

Architecture of U-Net Model

5.2 Separable convolutional pyramid pooling network (SCPP-Net)

This SCPP-Net model by Chanchal et al. (2021b) was built upon the idea of mining supplementary information at an advanced level, as depicted in Fig. 8. The receptive field of the SCCP layer is expanded by keeping the kernel size constant while regulating four distinct dilation rates. The generated feature maps have an extra parameter called “dilation rate” that could be changed to see bigger areas. The separation of clumped and overlapping nuclei is a critical issue in histopathology image nuclei segmentation. However, by expanding the receptive field at a higher level, this CNN-based design helps to overcome the problem of proximity and overlapping nuclei.

Fig. 8
figure 8

Architecture of Separable Convolution Pyramid Pooling Network (SCPP-Net)

The convolution and max-pooling operations are conducted on the input image during the down-sampling process, giving extreme importance to capturing the context of the image, which leads to the growth in the size of the image on the one hand but, on the other hand, depth drops along the growing route. For the same reason, progressively adding up-sampling to the decoder route enables accurate localization. Figure 8 depicts the proposed SCPP-Net’s comprehensive design, whereas Fig. 9 depicts the SCPP-Net's inclusive and precise SCPP block concept.

Fig. 9
figure 9

Separable Convolution Pyramid Pooling (SCPP) block

5.3 Sharp U-Net

In encoder-decoder networks, predominantly in U-Net (Ronneberger et al. 2015), for the purpose of convalescing fine-grained features, skip connections play a vital role for prediction. Moreover, skip connections have a tendency to semantically associate low- and high-level convolution features of diverse nature, thereby generating totally obscure feature maps. In order to overcome such a flaw, Zunair and Hamza et al. suggested the Sharp U-Net (Zunair and Hamza 2021) architecture, as revealed in Fig. 10 that is applicable to both binary and multi-class segmentation.

Fig. 10
figure 10

Architecture of Sharp U-Net

The encoder section is divided into five blocks, each of which includes two 3 × 3 convolutional layers with ReLU activations, followed by a 2 × 2 layer known as a “max-pooling layer.” For the convolutional layers, 32, 64, 128, 256, and 512 filters are applied, and the same are used along the input to construct a feature map that basically recapitulates the occurrence and existence of the features that have been mined from the said input. A new connection mechanism, termed a “sharp block,” as depicted in Fig. 11, is formed to contain the up-sampled features with the intention of fusing the encoder and decoder’s low- and high-level features, avoiding the semantic gap issues. The encoder features are rather exposed to a spatial convolution operation that is fundamentally accomplished autonomously on each channel of the encoder features by means of a sharpening spatial kernel beforehand and then making use of meek skip connections between encoder and decoder.

Fig. 11
figure 11

Illustration of Sharp Block

  1. (a)

    Sharpening Spatial Kernel

Spatial filtering, on the other hand, is a low-level neighbourhood-based image processing method that basically tends to enhance the image (sharpen the image) by performing certain operations on the neighbourhood of individual pixels of the input image. Image convolution with kernels is used to perform high-pass filtering or image sharpening. Convolution kernels, normally referred to as filters, are a second-order derivative operator that might respond to intensity evolutions in any direction. A typical Laplacian high-pass filtering kernel is specified as a matrix, K, that includes a negative value off-center and a single positive value in the centre for image sharpening that takes into account all eight neighbours of the input image's reference pixel.

$${\mathbf{K}} = \left[ {\begin{array}{*{20}c} { - 1} & { - 1} & { - 1} \\ { - 1} & 8 & { - 1} \\ { - 1} & { - 1} & { - 1} \\ \end{array} } \right]$$

Kernel adjusts the brightness of the centre pixel in relation to the adjacent pixels while convolving an image with the Laplacian filter. Additionally, the input imagery is added to its convolution with the kernel to produce a refined image. Considering an input image, I, and the resultant sharpened image S, S is generated as S = I + K * 1; wherein * signifies convolution, a kernel weighted neighbourhood-based operator that processes an image by adding each pixel's value to its nearest neighbours.

  1. (b)

    Sharp block

This block does a depth-wise convolution on a single feature map by using a sharpening spatial kernel given by the Laplacian filter kernel K. This kernel is usually of size WxHxM, where W, H, and M are the width, height, and number of the encoder's feature maps, respectively.

In convolutions, M filters are applied that discretely act on each of the input channels rather than a single filter of a specific size (i.e., 333). Individual input channels are convolved with the kernel K individually with a stride of 1, thereby producing a feature map of WxHx1 dimension. To retain the dimension of the output to be the same as that of the input and to match the size of the decoder features all throughout the connection, padding is performed during the feature fusion of the encoder and decoder sub-networks. The depth-wise convolution layer's ultimate output of size WxHxM is attained at this point by piling these maps together. This planned feature connection is referred to as a “sharp block.” Fig. 11 displays a visual representation of the sharp block's operation flow.

5.4 LiverNet

The convolution procedure resides in the heart of every CNN. 2D discrete linear convolution is articulated as (1) with f and h as two-dimensional signals. Aatresh et al. (2021a) suggested the LiverNet model for liver hepatocellular carcinoma histopathology images.

$$g\left(m, n\right)=\sum _{k=-\infty }^{\infty }\sum _{l=-\infty }^{\infty }{\varvec{f}}\left[k,l\right]{\varvec{h}}[m-k, n-l]$$
(1)

Using the above definition, they add the bias to Eq. (1) to obtain the computation formula per node in a given layer. In addition, Max-pooling operations are a critical operation in most CNN systems nowadays (Krizhevsky et al. 2017). Consider a sliding window over the input feature map to the max-pool layer to better understand this procedure. By sliding the window with a stride S, this procedure provides the greatest value of the pixels inside the window, which is repeated throughout the entire image. By lowering the number of parameters, max-pool layers help minimise the network's computational complexity and provide an abstract representation of the input data.

Aatresh et al. (2021a) employ a base architecture similar to To˘gaçar et al. (2020), and they extract features from the input image using two convolution layers before the initial max-pool operation. To extract relevant information more effectively, they used CBAM blocks (Woo et al. 2018) and residual blocks deeper in the architecture. After each max-pool operation, they employ intermediate features in the encoder pipeline to feed into ASPP blocks before up-sampling. To merge the pixel data of layers of varied depths, the hyper-column approach employed in To˘gaçar et al. (2020) was applied. The hyper-column technique, along with ASPP blocks, ensures multi-scale feature extraction and information retrieval for further processing in this architecture. They have applied these ideas to the problem of multi-class cancer classification in liver tissue, and a detailed depiction of the proposed model can be found in Fig. 12. The sub-modules of the LiverNet architecture have been described in detail in the following subsections.

Fig. 12
figure 12

Architecture of LiverNet

  1. (a)

    CBAM block and residual block

Convolutional Block Attention Module (CBAM), introduced by Woo et al. (2018), comprised of a CBAM block that is proficiently implanted into any CNN architecture without instigating unnecessary computation or memory performance drawbacks. Channel-wise and spatial attention modules were anticipated in succession to produce attention maps that were increased by the input feature map. In CBAM, the channel-wise attention block focuses on what the network needs to focus on, whereas the spatial attention block concentrates on where the network needs to place emphasis.

The CBAM block's behaviour at an intermediate step, taking into consideration a feature map A ∈ H×W×C input in the encoder pipeline, can be mathematically projected as in (2).

$${\varvec{A}}_{{\varvec{c}}} = {\varvec{f}}_{{\varvec{c}}} \left( {\varvec{A}} \right).\varvec{ A}{\varvec{\left( 2 \right)}}\quad {\text{and}}\quad {\varvec{A}}_{{\varvec{s}}} = \user2{ f}_{{\varvec{s}}} \left( {{\varvec{A}}_{{\varvec{c}}} } \right).\user2{ A}_{{\varvec{c}}}$$
(2)

where “.” representing element-wise multiplication; fc: ℝH×W×C → ℝ1×1×C and fs: ℝH×W×C → ℝH×W×1 symbolizing the functions of channel-wise and spatial attention blocks, correspondingly. Following the element-wise multiplication between the channel-wise attention map fc and the input feature map A, Ac is the intermediate output. The product of element-wise multiplication between the spatial attention map and Ac, as well as the final output of the CBAM attention block. The channel-wise attention block is composed of concurrent average and max-pooling procedures that share a fully connected network as described in Eq. (3) before it is added.

$${{\varvec{f}}}_{{\varvec{c}}}=\boldsymbol{ }{\varvec{\sigma}}\left({\varvec{F}}{\varvec{C}}\left({\varvec{A}}{\varvec{v}}{\varvec{g}}{\varvec{P}}{\varvec{o}}{\varvec{o}}{\varvec{l}}\left({\varvec{A}}\right)\right)\right)+{\varvec{F}}{\varvec{C}}\boldsymbol{ }\left({\varvec{M}}{\varvec{a}}{\varvec{x}}{\varvec{P}}{\varvec{o}}{\varvec{o}}{\varvec{l}}\left({\varvec{A}}\right)\right)$$
(3)

wherein \(\sigma\) is the popular sigmoid function with FC being the shared fully connected layers. Before feeding the result to a convolution layer, the spatial attention block concatenates the results of max-pool and average-pool operations. Whenever the input A is provided, the action is defined by Eq. (4).

$${{\varvec{f}}}_{{\varvec{c}}}=\boldsymbol{ }{\varvec{\sigma}}\left({\varvec{w}}\otimes \left([{\varvec{A}}{\varvec{v}}{\varvec{g}}{\varvec{P}}{\varvec{o}}{\varvec{o}}{\varvec{l}}\left({\varvec{A}}\right);{\varvec{M}}{\varvec{a}}{\varvec{x}}{\varvec{P}}{\varvec{o}}{\varvec{o}}{\varvec{l}}\boldsymbol{ }\left({\varvec{A}}\right)]\right)\right)$$
(4)

wherein ⊗ represents the two-dimensional convolution operation with a kernel w. (He et al. 2016) has proposed the residual block that is used in the LiverNet architecture, which is comparable to the residual block used in To˘gaçar et al. (2020). The main difference is that the filters in the residual block’s initial convolution layer are lowered by a factor of 4 when compared to the filter used in the residual block presented in To˘gaçar et al. (2020). This not only reduced the number of parameters needed in the model but it also increased the quality of the features derived from the input.

  1. (b)

    ASPP block

An Atrous Spatial Pyramid Pooling (ASPP) block may successfully extract multi-scale features from a feature map, as demonstrated in Chen et al. (2018). They use a comparable ASPP block in the LiverNet architecture because of its effectiveness. To increase the size of the receptive field without increasing the number of parameters involved, atrous convolution or dilated convolution can be utilised. Consider a two-dimensional signal X that has been convolved with a two-dimensional filter w via Atrous convolution. A convoluted product is represented by the following equation Eq. (5).

$$\mathrm{Y }\left[\mathrm{i},\mathrm{ j}\right]= \sum_{\mathrm{m}}\sum_{\mathrm{n}}\mathrm{X }\left[\mathrm{i}+\mathrm{r}.\mathrm{m},\mathrm{ j}+\mathrm{r}.\mathrm{n}\right]\mathrm{ w}\left[\mathrm{m}\right]\left[\mathrm{n}\right]$$
(5)

where r corresponds to the dilation rate or the rate at which the input signal X is sampled. Atrous convolution has the effect of increasing the receptive field size of the kernel by adding r-1 zeros in between the kernel elements. As a result, if r = 2, a 3 × 3 kernel will have some receptive field size equivalent to a 5 × 5 kernel but with just 9 parameters.

Figure 13 illustrates the ASPP block that is employed in the LiverNet architecture. A feature map is received as an input before concatenation thereafter certain operations are conducted in parallel namely 1 × 1 Convolution; 3 × 3 Convolution with dilation rate = 2; 3 × 3 Convolution with dilation rate = 3; 3 × 3 Convolution with dilation rate = 6; 3 × 3 Convolution with dilation rate = 8 and global average pooling.

Fig. 13
figure 13

ASPP Block in LiverNet Architecture

To keep the same filter size as the input, the entire convolution and pooling outputs are concatenated and passed through a 11 convolution layer. Further, the convolution output is passed through a batch normalization and ReLU activation layer before being delivered to the bilinear up-sampling layer. The output of the max-pool layers delivers feature-rich information at many sizes and extents; therefore, the ASPP block is placed after each max-pooling operation in the encoder pipeline in the LiverNet architecture.

For the entire models used in our work, we used Binary Cross Entropy (BCE) as the loss function, as well as the Intersection over Union (IoU) and Dice Coefficient (DC) parameters for the quantitative analysis of the nucleus segmentation results. We define these loss functions and parameters as follows:

  1. (i)

    Loss function

Reducing the loss is the goal of an error-driven learning algorithm, which is accomplished through the use of a good loss function. We anticipated a number and were eager to find how much we were off, the squared error loss appears appropriate for the regression problem. We recognize it was a distribution for classification, so we could use something that captures the difference between the true and projected distributions. In our study, we use a loss function named Binary Cross-Entropy (BCE) (Ahamed et al. 2020).

$$\mathrm{Cross \,Entropy\, Loss}= -\sum_{\mathrm{k}=1}^{\mathrm{C}}{\mathrm{y}}_{\mathrm{k}}\mathrm{log}(\mathrm{f}({\mathrm{s})}_{\mathrm{k}})$$
(6)

where \({y}_{k}\) and \({s}_{k}\) represent the ground truth and projected scores for each class k in C respectively. For loss computation, ReLU activation in the intermediate layer and sigmoid activation before are used.. Two distinct symbols, C and C′, represent two classes as used in different equations wherein for C classes, Eq. (6) represents cross-entropy loss; for C′ classes, Eq. (7) represents BCE loss. As projected in Eq. (8) BCE loss is highlighted with respect to the activation unit is either denoted as f(\({s}_{k}\)) or \(\widehat{y}\).

$$\mathrm{BCE \,Loss}= -\sum_{\mathrm{k}=1}^{{\mathrm{c}}^{\mathrm{^{\prime}}}}{\mathrm{y}}_{\mathrm{k}}\mathrm{log}(\mathrm{f}({\mathrm{s})}_{\mathrm{k}})$$
(7)
$$\mathrm{BCE\, Loss}= - [\mathrm{yln}\left(\widehat{\mathrm{y}}\right)+(1-\mathrm{ y})\mathrm{ln}(1-\widehat{\mathrm{y}})]$$
(8)
  1. (b)

    Segmentation quality parameters

In purpose of our study, two segmentation quality parameters have been used, such as Intersection over Union (IoU) (Kanadath et al. 2021) and Dice Coefficient (DC) (Gudhe et al. 2021).

  1. (A)

    Intersection over Union (IoU): In the fields of semantic segmentation, IoU, popularly known as the Jaccard Index, is yet another frequently used metric that is basically the area overlapped between predicted segmentation and the ground truth, as indicated by the area of union between the predicted segmentation and the ground truth indicated in Eq. (9), wherein A is the ground truth mask image and B is the predicted segmentation result obtained from the model.

    $$\mathrm{IoU}= \frac{\mathrm{Area \,of\, Overlap}}{\mathrm{Area\, of \,Union}}= \frac{|\mathrm{A}\cap \mathrm{B}|}{|\mathrm{A}\cup \mathrm{B}|}$$
    (9)
  2. (B)

    Dice coefficient (DC): This segmentation quality parameter measures the similarity between the predicted mask and the corresponding ground truth mask, which is generally defined as 2 multiplied by the area of overlap divided by the total number of pixels in both images, as depicted in Eq. (10), wherein A is the ground truth mask image and B is the predicted segmentation result obtained from the model.

    $$\mathrm{DC}= \frac{2 \times \mathrm{ Area\, of \,Overlap}}{\mathrm{Total\, Number\, of\, pixels}}= \frac{2 \times |\mathrm{A}\cap \mathrm{B}|}{\left|\mathrm{A}\right|+|\mathrm{B}|}$$
    (10)

6 Experimental result and discussion

This section represents the experimental results of the four well-known deep learning CNN models, namely U-Net, Separable Convolutional Pyramid Pooling Network (SCPP-Net), Sharp U-Net, and LiverNet, over a merged dataset. The specifics of the dataset we used are briefly described below.

6.1 Experimental dataset

For our purpose, we merged three publicly available datasets, such as JPI 2016 dataset (Janowczyk and Madabhushi 2016), IEEE TMI 2019 dataset (Naylor et al. 2018) and PSB 2015 dataset (Irshad et al. 2015), respectively. These three datasets are described in details as follows:

  1. (A)

    JPI 2016 Dataset: Janowczyk and Madabhushi (2016) announced this dataset which comprises of 143 H&E images of 137 patients and ER + BC a images scanned at 40x.Each image is 2000 by 2000 pixels in size, with around 12,000 nuclei painstakingly segmented throughout the photos. The file is in the following formats: 12750_500_f00003_original.tif for original H&E photos and 12750_500_f00003_mask.png for a mask of the same size, with white pixels representing nuclei. Each image is prefaced by a code i.e., 12,750 to the left of the first underscore (_), which defined with a unique patient number. A few patients (137 patients vs. 143 images) have several images associated with them.

  2. (B)

    IEEE TMI 2019 Dataset: Naylor et al. (2018) offered this IEEE TMI 2019 dataset generated by the Curie Institute, which comprises of annotated H&E-stained histology images at 40 × magnification wherein total of 122 histopathology slides are annotated. There are 56 annotated pCR, 10 are RCB-I, 49 are RCB-II and 7 are RCB-III.

  3. (C)

    PSB 2015 Dataset: Irshad et al. (2015) presence this PSB 2015 dataset, images in this dataset came from the TCGA data portal's WSIs of Kidney Renal Clear Cell Carcinoma (KIRC). The TCGA project is jointly supported by National Cancer Institute and the National Human Genome Research Institute and TCGA has undertaken detailed molecular profiling on tens of thousands of tumours, covering the 25 most frequent cancer types. 10 KIRC Whole Slide Images (WSI) from the TCGA data portal (https://tcgadata.nci.nih.gov/tcga/) is selected. Nucleus-rich ROIs and extracted 256 × 256-pixel size images for each ROI at 40 × magnification is identified further from these WSIs.

Therefore, there are a total of 653 images contained inside the combined dataset that was employed. We used random selection to choose 457 images from these three datasets to use for training, 98 images to use for validation, and 98 images to use for testing.

6.2 Training and implementation

To speed up the development procedure and experiments on a machine with Ryzen 5 3550, 16 GB RAM, and a Nvidia GTX 1650, the training and implementation were done in a Jupyter notebook with the latest version of Keras and Tensorflow Python-3 framework. The four deep learning models considered in this study were trained using Sigmoid or SoftMax as the activation function and an adaptive learning rate optimization algorithm known as the Adam optimizer to speed up the training. The loss function employed for the four models is binary cross entropy (BCE) (Ahamed et al. 2020), as highlighted in (9). Further, batch sizes of 8, 4, 10, and 2 for U-Net, SCPP-Net, Sharp U-Net, and LiverNet, respectively, are used to train all 256 × 256 histopathology images.

6.3 Discussion on segmentation results

In this study, a comparative examination of four pre-trained CNN architectures—U-Net (Ronneberger et al. 2015), SCPP-Net (Chanchal et al. 2021b), Sharp U-Net (Zunair and Hamza 2021), and LiverNet (Aatresh et al. 2021a)—is conducted. On a combined dataset, all four models are trained using almost 457 images for training, 98 for validation, and 98 for testing. During training, the network is fed the histopathological images from the training set and the ground truth masks. Two assessment metrics are proposed in this study, namely intersection over union (IoU) (Kanadath et al. 2021) and dice coefficient (DC) (Gudhe et al. 2021), which are shown in (9) and (10), respectively. All the models are then used to predict the masks of the test images. The input size for all models is 256 × 256 pixels. The U-Net and SCPP-Net models that were used had, respectively, 7,725,249 and 2,985,659 trainable parameters. The Sharp U-Net and LiverNet models, on the other hand, have 4,320 and 12,288 non-trainable parameters, respectively, and 7,760,097 and 989,117 trainable parameters. The training times for the U-Net, SCPP-Net, Sharp U-Net, and LiverNet models, which are trained over 1500, 500, 550, and 700 epochs, respectively, are 590 ms, 330 ms, 595 ms, and 490 ms per step. Because U-Net is less sophisticated than the other three models, it takes slightly less time. On the other hand, Sharp U-Net provides better image segmentation and accuracy. The performance of four deep learning models for nuclear segmentation (U-Net, SCPP-Net, Sharp U-Net, and LiverNet) is fairly compared in Table 2.

Table 2 Performance comparison of four utilised architectures on merged dataset

The main model complexity of U-Net (Ronneberger et al. 2015) architecture is that the resulting segmentation map will be negatively impacted by the feature mismatch in between encoder and decoder paths, which will cause the fusing of semantically incompatible data and hazy feature maps during the learning process. The segmentation of overlapping nuclei is the main complexity of the SCPP-Net (Chanchal et al. 2021b) model. Sharp U-Net (Zunair and Hamza 2021) predicts segmented outcomes that are slightly under-segmented and defective, but it generates far less noise and segmented outputs that are broken. The key difficulty of the LiverNet (Aatresh et al. 2021a) model is that it was difficult to segment the tiniest and most densely packed nuclei.

Figure 14 depicts a graphical representation of the segmentation experiment based on (IoU Score, Dice, and Accuracy%), which clearly demonstrates that Sharp U-Net produces the best segmentation results for two quality parameters, namely IoU and Dice, producing smoother predictions than the other three segmentation models used.

Fig. 14
figure 14

Graphical representation of IoU and DC of our four utilised segmentation models

In terms of the Dice Coefficient (DC) and Intersection over Union (IoU) score, the segmentation results of four nuclei segmentation models considered the merged dataset. As shown, LiverNet obtains a DC of 0.5299 and an IoU of 0.3801, which are lower than the other three models. U-Net and SCPP-Net achieve improvements on the DC and IoU scores. U-Net and SCPP-Net obtain DC = 0.6599 and IoU = 0.4934 and 0.4711, respectively, as depicted in Table 2. Sharp U-Net obtains the best results on the DC and IoU at 0.6899 and 0.5276, respectively. This analysis further reveals that Sharp U-Net could be used to get suitable nuclear segmentation results. The four segmentation models used (U-Net, SCPP-Net, Sharp U-Net, and LiverNet) produce accuracy of 83.13%, 81.67%, 82.04%, and 82.28%, respectively. Figure 15 depicts a graphical representation of the training and validation loss for four CNN models.

Fig. 15
figure 15

Graphical representation of training and validation loss of four utilised CNN models

Figure 16 contains examples of some original images as well as images predicted by various models. These examples highlight the various outcomes of our segmentation results based on our combined dataset. Based on the information shown in this figure, the Sharp U-Net produces a better segmented image than the other three models that were tested.

Fig. 16
figure 16

Row-wise visual segmentation comparison of four utilised models on merged dataset

7 Conclusion and future directions

Recent advancements in the field of computer vision and machine learning strengthen an assemblage of algorithms with remarkable and noteworthy ability to interpret the content of imagery. Several such deep learning algorithms are being imposed and employed on biological images, thereby massively transmuting the analysis and interpretation of imaging data and generating satisfactory outcomes for segmentation and even classification of images across numerous domains. Even though learning parameters in deep architectures necessitate a large volume of labeled training data, transfer learning is promising in such scenarios because it focuses on reusing the learned features and applying them appropriately based on the situation's requirements and demands. This study has three major contributions as a survey paper, which is stated below:

  1. a.

    An overview table of deep learning models used for nucleus segmentation from 2017 to 2021, with different optimizers used across a range of datasets and for different types of images, will show how different deep learning models are used for nucleus segmentation.

  2. b.

    A study that makes comparisons between four different deep learning models that were developed very recently for segmenting nuclei.

  3. c.

    Training the deep learning models mentioned in (ii) was performed by the merged version of three datasets, namely JPI 2016, IEEE TMI 2019, and PSB 2015, each containing thousands of images, and the training results are updated in Table 2 and grouped according to the accuracy results. The experimental results are very encouraging; highlighting that Sharp U-Net delivers high accuracy results in all the cases with minimal loss. The highest accuracy obtained by Sharp U-Net is a DC of 0.6899 and an IoU of 0.5276. The DC and IoU values for U-Net, SCPP-Net, and LiverNet were 0.6599 and 0.4934, 0.6340 and 0.4711, and 0.5299 and 0.3801, respectively.

Therefore, it can be easy to infer that deep learning-based nucleus segmentation for histopathology images is a fresh and exciting research topic to work on and concentrate on. The major challenges one might encounter in the future would be to develop:

  1. a.

    Innovative and hybrid CNN architectures should be enabled for a wide range of medical image segmentation techniques,

  2. b.

    loss function should be designed for more specific medical image segmentation,

  3. c.

    The researchers should place a strong emphasis on transfer learning as well as the interpretability of the CNN model.

  4. d.

    Nature Inspired Optimization Algorithms (NIOA) based optimized CNN models should be explored.

  5. e.

    Explore different techniques and Architectures will be explored to further improve the speed and decrease the model size. in addition, larger diversified silent object datasets are needed to train more accurate and robust models

  6. f.

    Future study will focus on developing deep architecture that requires fewer calculations and can work on embedded devices while producing better test results.

  7. g.

    The information recession problem needs to be effectively mitigated that occurs in traditional u-shape architecture.

  8. h.

    Nature inspired optimization algorithms (Rai et al. 2022) like Aquila Optimizer (Abualigah et al. 2021a), Reptile Search algorithm (Abualigah et al. 2022), and Arithmetic optimization algorithm (Abualigah et al. 2021b) can be utilized to build optimized CNN models in the field of medical image segmentation.