Introduction

Image segmentation plays an important role in computer vision applications as it influences all the critical tasks, such as image analysis, feature calculation, object detection and classification. Recently, with the advances in hardware technologies and development of neural network algorithms, emphasis is given on pixel level segmentation rather than localized segmentation of an image. This is called semantic image segmentation, where the different regions of an image can be clustered as different object classes. It exemplifies the process of combining each pixel of an image with a class label and gives multiple level of representation of the image by means of object classes. Nowadays scene parsing has become a fundamental research area in computer vision as the number of applications are on rise. Scene parsing is to analyze and segment an image into different image regions connected with semantic categories. It relies mostly on semantic segmentation [1,2,3]. For example, people may be interested in segmenting vehicles, persons, roads, and the sky in a traffic scene captured by a camera mounted on a vehicle to assist autonomous driving operations [2]. Some other important applications include detecting road signs [4], human machine interaction [5], virtual reality, and computational imaging [6]. These algorithms have found potential application in computer vision mainly due to its accuracy, which is achieved using emerging deep learning techniques like convolutional neural networks (CNN), deep neural networks (DNN) and recurrent neural networks (RNN) etc. Success of these techniques is attributed mainly to the increasing availability of datasets and increase in parallel computing power with widely available general-purpose graphics processing units (GPGPU). The deep learning methods ride into its popularity with it success in image classification, and its extensive research to perform pixel-wise image segmentation with more object classes in the scenes. This helps to provide well defined object boundaries. Both supervised and unsupervised machine learning methods have been successfully used for deep learning-based semantic segmentation tasks.

Fig. 1
figure 1

Medical imaging applications in which deep learning has achieved state-of-the-art results. From left to right (top row): mammographic mass classification [7], segmentation of lesions in the brain [8], leak detection in airway tree segmentation [9]; from left to right (middle row): diabetic retinopathy classification [10], prostate segmentation [11], nodule classification [12]; from left to right (bottom row): breast cancer metastases detection in lymph nodes [13], human expert performance in skin lesion classification [14], and bone suppression in X-ray [15]. These images represented in this Figure have been extracted from [16]

Recently, in the field of biomedical image processing, there is a huge increase in applications of image segmentation, recognition and registration techniques. The performance of image analysis by traditional methods, such as manual analysis of X-rays or CT scans, is somewhat restricted due to the limited experience of the analyzer, image complexity, non-similarity of interpretation and irregular anatomy between patients. Many of these limitations can be removed by the use of computer aided systems, and therefore, in the field of automated medical image analysis, applications of computer aided systems are on the rise.

These automated inspection methods have surpassed traditional methods by a large margin in terms of diagnostic measures. Some of the emerging applications where these techniques are showing promising results include glaucoma detection and blood vessels segmentation from fundus images [17, 18], brain tumor segmentation from MRI [19], segmentation of the pectoral muscle from breast MRI [20], segmentation of the coronary arteries in cardiac CT angiography, 3D segmentation in microscopic images [21], lesion segmentation [22], microscopy image analysis [23], colon crypts segmentation [24] to name a few. Figure 1 shows some of inferences of deep learning methods for classification and segmentation of biomedical images.

Researchers have provided review on semantic segmentation on natural and biomedical images. Many deep learning approaches for medical image segmentation have been introduced with different medical imaging modalities. Litigens et al. [16] reviewed major deep learning concepts related to medical image analysis. They reviewed deep CNN architectures for general classification, detection and segmentation of biomedical images. Thoma [25] reviewed semantic segmentation using traditional approaches. He focused on feature based approaches, unsupervised segmentation methods, random decision forest, conditional random fields and Markov random fields etc. Guo et al. [26] reviewed semantic segmentation of images using deep learning techniques by dividing the work into three categories: region-based, fully convolutional network (FCN) based, and weakly supervised segmentation methods. They discussed about major challenges and weaknesses of the deep learning methods based on data size, computational resources and accuracy of inferences. Liu et al. [27] presented a review on progress on semantic segmentation considering both traditional methods and deep learning techniques. They mainly focused on FCN, pyramid method in segmentation and multistage networks using convolutional neural networks (CNN). Goceri et al. [28] reviewed different challenges related to training of deep neural networks for segmentation of medical images. Taghanaki et al. [29] reviewed semantic segmentation of natural and medical images by categorizing the leading deep learning-based medical and non-medical image segmentation solutions into six main groups, like deep architectural, data synthesis, loss function, sequenced models, weakly supervised, and multi-task methods.

This review paper offers a complete overview of the deep learning based semantic segmentation techniques and their applications in biomedical imaging field. This also includes an overview of the state-of-the-art work, most recent datasets, details of the relevant deep learning techniques, potential research directions and open challenges in the field of biomedical imaging. The following contributions are highlighted in this review paper in comparison with the other existing surveys:

  • A brief coverage of research contributions in the field of semantic segmentation of bio-medical images using DL techniques are made w.r.t modalities, types of organs and imaging applications. This paper discusses all important deep learning models used for semantic segmentation task.

  • Popular deep architectures are discussed along with their applications and limitations.

  • Different medical databases used for semantic segmentation of biomedical images with semantic segmentation ground truth have been explained. Along this, different open source software packages and libraries used for computation of deep learning algorithms have been presented for better understanding.

  • A brief discussion about the deep learning architectures for semantic segmentation, including supervised and unsupervised learning models with their applications on medical imaging have been discussed.

  • In the last, the paper highlights the important research directions and limitations of different model architectures w.r.t training, testing, hyper-parameter selection, modalities, and types of organs etc, for semantic segmentation in biomedical images.

The rest part of the paper is organized as follows: “Semantic Image Segmentation” summarizes the traditional segmentation algorithms and their characteristics. Semantic segmentation algorithms based on neural networks are discussed in “Neural Network-Based Methods for Semantic Image Segmentation”. Some popular deep network architectures are discussed in “Standard Deep Neural Network Architectures”. “Datasets” discusses about the available datasets and software. In “Deep Learning for Semantic Segmentation of Medical Images” the application of deep neural networks in medical imaging are described. Applications and challenges in semantic segmentation in biomedical field are discussed in “Discussion”. We conclude this paper in “Conclusion” with our comments on this review and future research directions.

Semantic Image Segmentation

Traditional segmentation methods focused on segmenting the region of interest while semantic segmentation segmented the different objects in an image to different classes. Based on the underlying technique of feature extraction, the semantic segmentation algorithms can be divided into two parts: traditional feature based classification methods and deep neural network-based methods. Traditional approaches use different featured-based classification methods, such as: region based segmentation [30], texton forest [31], random forest based classifiers [32], conditional random fields [33], and clustering techniques, where the features are hand crafted. On the other hand, neural network-based methods incorporate the domain knowledge available in a dataset through repeated spatial convolution operations to learn enhanced features for accurate inferences. Another way to categorize semantic segmentation algorithms is dividing into supervised and unsupervised segmentation methods. The supervised learning methods are influenced by supervision that uses intense domain knowledge or labeled data for separating the region of interest, whereas unsupervised learning develops perceptions right from the data itself, clusters the data and supports data driven decisions without any external bias.

Segmentation Architecture

Semantic segmentation architecture consists of classifiers which can classify the image into different semantic regions or assign each pixel a class label. Classifier-based methods depend on fixed size feature inputs and works on a predefined statistical or probabilistic model of the classifier. Statistical classifiers use supervised or unsupervised models for pixel classification, which directly depends on distribution of data. Probabilistic models used spatial probability distribution maps to tackle the variability of pixels. Markov random fields (MRF) and conditional random fields (CRF) are used for semantic segmentation, where the classifier learns the conditional distribution of the feature vectors for class labeling. Utilization of a CRF permits to include shape, texture, color, location, and edge cues in a single combined model. Another approach is a sliding window-based approach, where the trained classifier is fed with rectangular regions of the image and classifies the center pixel or a subset of the complete window. This type of approach is supported by neural network-based methods to handle a trained network as a convolution and apply the convolution on the complete image. Using deep neural networks and by increasing the layer of convolution it provides more satisfactory results than feature-based classifiers.

Traditional Methods for Semantic Image Segmentation

Traditional methods for semantic image segmentation rely on efficient feature detection and classification. In traditional methods, the main importance is given to feature detection or pixel wise classification or matching methods. Various hand crafted features are used for semantic segmentation, such as: pixel color [34], histogram of oriented gradients (HOG) [35], scale-invariant feature transform (SIFT) [36], local binary pattern (LBP) [37], sub-pixel corner [38] and features from accelerated segment test (FAST) [27]. Figures 2 and 3 represent semantic segmentation of biomedical images using traditional methods.

Chen et al. [39] proposed an intensity neighborhood-based supervised automated segmentation system for segmenting biomedical images. In training stage, the system received scaled, normalized input data and extracted significant pixels in neighborhood windows. Whereas in the testing stage, a voting procedure is used for predicting the unknown data with trained classifiers at different scales. Principal component analysis (PCA) is used to reduce the high dimensional complexity arising due to pixel windows. Brox et al. [40] proposed part-based poselet detectors, which use potential object contours and texture patches in the image for semantic object segmentation.

Fig. 2
figure 2

Semantic segmentation on the JSRT dataset (red color represents the heart; green color represents the lungs). First column represents image, second column represents ground truth and third column represents prediction. This Figure has been extracted from [41]

Fig. 3
figure 3

Gray scale mammogram image and its semantic segmentation representation. This Figure is extracted from [42]

Adam et al. [43] used color cues for detection and classification of road signs using histograms of oriented gradient (HOG) descriptors and support vector machine (SVM) classifier. They applied the algorithm for Greek road signs detection and classifications. Also Dalal et al. [35] used HOG descriptor as an efficient feature for human detection. They used linear SVM classifier for detection.

Conditional random fields (CRF) are used in some work to exploit the spatial information for semantic image interpretation. Yang et al. [44] proposed a hierarchical conditional random field model for image classification by modeling spatial and hierarchical structures. They labeled the image dataset with hierarchical CRF with the energy function given by

$$\begin{aligned} E(x|d)=\sum _{i\in V}{E_{1}x_{i}} +\alpha \sum _{(i,j)\in N}{E_{2}x_{i}x_{j}} +\beta \sum _{(i,k)\in H}{E_{3}x_{i}x_{k}}, \end{aligned}$$
(1)

where \(\alpha\) and \(\beta\) are weighting coefficients. \(x_{i}\) are the labels for each region i based on the image data d. \(E_{1}\) represents unary potential, \(E_{2}\) represents local pairwise potential which gives the relation between variables of neighboring regions within each scale and \(E_{3}\) represents hierarchical pairwise potential, which represents relationships within regions of neighboring scales. They used 8-class eTRIMS dataset [45] which consists of 60 building facade images with 8 labelled classes. Shotton et al. [46] used appearance, shape and context information as a whole as textons (which jointly model shape and texture) for automatic visual recognition and semantic segmentation of photographs. They used a CRF model which integrates shape, texture, color, location and edge into a unified feature space. Dalal et al. [35] used histogram of oriented gradients as feature vectors and used SVM classifier for object/non-object classification. Also, Raviteja et al. [47] used Gaussian CRF model for semantic segmentation. Table 1 contains some important contributions towards semantic segmentation using traditional methods.

Table 1 Traditional methods for segmentation of biomedical images with semantic segmentation ground truth

Neural Network-Based Methods for Semantic Image Segmentation

Recent developments in the field of neural networks have improved the state-of-the-art in semantic segmentation. Neural network-based classifiers use summation of weighted inputs with cascaded layers and apply activation functions to the weighted sum to obtain output. The general architecture consists of an input layer, several hidden layers, a fully connected layer, and an output layer. The mapping between the consecutive layers decides the architecture of the neural network. The network learns these parameters by updating the weights by minimizing an error function (cross entropy or mean squared error) [52].

Fig. 4
figure 4

Neural network architecture [52]

In Fig. 4, \(X_{i}\) represents the layer of input neurons. The first hidden layer \(H_{1}\) is represented by the function

$$\begin{aligned} y_{j}= f(Z_{j}), \end{aligned}$$
(2)

where \(Z_{j}=\sum {W_{ij}x_{i}}\). Similarly, the second hidden layer \(H_{2}\) is represented by the function

$$\begin{aligned} y_{k}= f(Z_{k}), \end{aligned}$$
(3)

where \(Z_{k}=\sum {W_{jk}x_{j}}\). The final layer is represented by

$$\begin{aligned} y_{l}= f(Z_{l}), \end{aligned}$$
(4)

where \(Z_{l}=\sum {W_{kl}x_{k}}\). The output of the final layer may pass through a threshold function to finally get the classified output.

Deep neural network-based methods permit efficient learning of features directly from the image data. The network learns the features that optimally represent the data for classification. The increasing use of convolutional neural networks (CNN), recursive neural networks (RNN), deep belief networks (DBN), and auto-encoders in image segmentation has enhanced the state-of-the-art in semantic segmentation of images [53, 54]. These deep learning algorithms use multiple layers of data abstraction and learn progressively high-level features to transform the input data into a suitable output form.

CNN in Semantic Segmentation

Convolutional neural networks (CNN) are artificial neural network architectures consist of several convolutional layers that allow spatial convolution of input image with different filter kernels to compute various feature maps [55]. It was first introduced by Kunihiko Fukushima [56]. In CNN, filters or kernels acting as weights slide across the total input image during convolution to create the next layer, and this new layer is called a feature map. The same filters can be used to repeat this operation to create new layers of feature maps. The input and output feature maps have different dimensions depending on the dimension of image channels, i.e. 1 or 3, respectively for grayscale and color images. Multiple filters can be applied across a set of input image slices where each filter will generate a distinctive output slice. These slices highlight the features detected by the filters. At each layer, the input image is convolved with a set of kernels or masks with added biases to generate a new feature map. If k number of kernels or masks are taken, each kernel can be represented by \(W = \{ {w_{1},}{w_{2}},{w_{3}},\dots ,{w_k}\}\) with added biases \(B = \{ {b_{1}},{b_{2}},{b_{3}},\dots ,{b_k}\}\). These feature maps are subjected to an element wise non-linear transformation, such as tanh, sigmoid, or rectified linear units (ReLUs)) and the process is repeated for every convolutional layer l [57]. Mathematically, it can be represented as

$$\begin{aligned} X_k^l = f(W_k^{l - 1}*{X^{l - 1}} + b_k^{l - 1}), \end{aligned}$$
(5)

where \(W_k\) represents the \(k^{th}\) kernel, X represents the input image or some predefined portion of the input image and \(b_k\) is the \(k\text {th}\) added bias. Figure 5 shows a basic convolution operation over an \(3\times 2\) matrix with a kernel size of \(2\times 1\).

Fig. 5
figure 5

Convolution operation

After the convolution layer, pooling operation is performed which, downsamples each input feature map. The pooling process progressively reduces the dimensions of each input feature map while conserving the most significant features and avoids over-fitting [55]. Then outputs of each CNN layer are passed through non-linear activation functions, such as rectified linear units (ReLUs), which allow to represent composite non-linear mappings between the input image and the desired outputs. Figure 6 represents a basic classification model using CNN architecture.

Fig. 6
figure 6

CNN architecture [55]

Lo et al. [58] applied CNN for medical image analysis and hand-written digit recognition in LeNet [59]. But the popularity of CNN among the computer vision researchers started when a AlexNet proposed by Krizhevsky et al. [55] won ImageNet challenge in December 2012 by a huge margin. Later, more advanced CNN architectures have been proposed using related but deeper architectures. These architectures have shown outstanding performances on image segmentation, classification and detection tasks.

Fully Convolutional Networks (FCN) in Semantic Segmentation

Fully convolutional network (FCN) proposed by Long et al. [60] has a fully connected convolutional layer in the last layer. A fully-connected layer can be taken as a special case of convolutional layer, with input volume of depth 1, with filter size same as the size of the input, and a total number of filters equal to the number of output neurons. FCN naturally works on an input of any size, and produces an output of consequent resampled spatial dimensions [60]. The different parts of a FCN (shown in Fig. 7 ) are convolution layers, pooling layers, activation functions, and softmax layers.

Fig. 7
figure 7

FCNN architecture

Using pooling layers, convolutional layers can detect features at every level of the feature maps. The cumulative feature map size of deeper layers is larger than the ones at the beginning of the network. This permits them to capture more complex features from larger input regions. Generally the convolutional layers are used to extract features from the input in the form of feature maps. The features detected by the deepest layers are highly nonrepresentational. To solve this problem, one or more fully-connected convolutional layers are added after the last convolutional/pooling layer. The last fully convolutional layer (output) uses softmax to estimate the class probabilities of the input. Hence, this FCN layer can be treated as a translator between the network’s representation and desired output [61].

Long et al. [60] used dense FCN for semantic segmentation, which combines dense downsampling layers and deconvolution layers (upsampling) to enhance spatial precision of the output. They used ImageNet large scale visual recognition challenge (ILSVRC) database [62] for dense prediction with upsampling and a pixelwise loss. They built a novel skip architecture which, combines different spatial information to refine predictions. They experimented the proposed algorithm on the PASCAL VOC 2011 segmentation challenge training dataset, NYUDv2 dataset [63] and SiftFlow dataset [64]. The same approach is adopted by Shelhamer et al. [65] which modified existing classification networks (AlexNet, VGG net, and GoogLeNet) into FCNs. They added fine-tuning to the segmentation task by defining a skip architecture that combines semantic information from deep layers that contain appearance information to produce detailed and high quality segmentation.

Atrous Convolution with CNN in Semantic Segmentation

Chen et al. [66] proposed a network architecture called DeepLabv3, which uses atrous convolution to extract dense features for semantic segmentation task. The network employs atrous convolution in cascade or in parallel form to capture multi-scale context by adopting multiple atrous rates, called atrous spatial pyramid pooling (ASPP). Here, the field of view of filters are enlarged effectively to include larger context without increasing the number of parameters or the amount of computation. In atrous convolution, the kernels are chosen with different dilation rates, which defines the spacing between the values in a kernel. Hence, a \(3 \times 3\) kernel with dilation rate of two uses the same as \(5 \times 5\), which covers a wider field of view at the same computational cost as a \(3\times 3\) kernel. Figure 8 shows schematic view of block representation of the atrous convolution.

There work has shown improved segmentation results due to the addition of encoder–decoder module [63], which shown in Fig. 9. In many notable work on image segmentation atrous convolution or dilated convolution is used for semantic segmentation to enhance the resolution of features as the quality of these features are often reduced due to repeated pooling operations or convolution striding in CNNs [53].

Fig. 8
figure 8

Deeper representation with atrous convolution. Figure has been extracted from [53]

Fig. 9
figure 9

DeepLabv3+ encoder-decoder structure. Figure has been extracted from [66]

Recurrent Convolutional Neural Networks (RNN) in Semantic Segmentation

Recurrent neural networks (RNN) is used for processing sequential data with variable length. It used a recurrence relation between the current layer and previous layer using feedback loops. RNN can be defined as a recurrence relation

$$\begin{aligned} {s_t} = f({s_{t - 1}},{x_t}), \end{aligned}$$
(6)

where \({x_t}\) is the network input at step t, \({y_t}\) is the output of internal state at step t and \({s_{t - 1}}\) is the output of internal state at step \(t - 1\). In a RNN, each state is dependent on all previous computations via this recurrence relation. Generally, RNN has three sets of weights:

  1. 1.

    U transforms the input \({x_t}\) to the state \({s_t}\)

  2. 2.

    W transforms the previous state \({s_{t - 1}}\) to the current state \({s_t}\)

  3. 3.

    V maps the newly computed internal state \({s_t}\) to the output \({y_t}\)

The relation between the internal state and the network output is given as

$$\begin{aligned} {s_t} = f({s_{t - 1}}*W{x_t}*U), \end{aligned}$$
(7)

and

$$\begin{aligned} {y_t} = {s_t}*V, \end{aligned}$$
(8)

where f is a non-linear activation function [67].

Pinheiro et al. [68] used recurrent CNN for scene parsing. They used a recurrent convolutional neural network on a large input image size context and trained the model in an end-to-end fashion over raw pixels using complex spatial dependencies with low inference cost. By increasing the context size with the built-in recurrence, the system itself determines and corrects its own errors. They used Stanford background dataset [69] and the SIFT flow dataset [64] for testing.

Recursive Context Propagation Networks (RCPN) for Semantic Segmentation

RCPN frames the problem of semantic segmentation as labeling of super-pixels [70] into desired semantic categories. It starts with the localization of semantically connected regions (super-pixels) followed by the extraction of visual features for each super-pixel. Multi-scale CNN [71] is used to extract per pixel features, which are averaged over super-pixels. Random binary parse trees are created with the adjacency information between super-pixels where leaf nodes correspond to initial super-pixels. Merging the nodes, a hierarchical graph structure is created, which is then passed through pre-designed modules to get the output labels. The final labels are decided through a voting procedure because each parse tree can give rise to different labels for the same super-pixel. Sharma et al. [72] used RCPN that employs contextual information of the whole image via random binary parse trees for improved feature representation of every super-pixel in the image. They compute bypass error paths in the computation graph of RCPN, which hamper contextual propagation. Hence they used pure-node RCPN and tree Markov random field-recursive context propagation network (MRF-RCPN) to minimize the bypass error.

Weakly-Supervised or Semi-Supervised Learning Models

In case of deep learning-based methods some weakly-supervised models have been proposed [73,74,75]. Training a deep neural network (DNN) requires a huge number of annotated segmentation ground truths to achieve good performance. Availability of consistent pixel-wise segmentation annotations are limited within a few popular datasets. Hence, it makes it difficult to use supervised DNNs in semantic segmentation tasks. In semi-supervised learning the unlabeled samples are used along with the labeled samples during training to improve the accuracy of the supervised learning with limited labeled samples [76]. Figure 10 shows the basic structure of semi-supervised learning.

Fig. 10
figure 10

Semi-supervised learning architecture. Figure has been reproduced from [77]

Seunghoon et al. [73] proposed a decoupled DNN architecture using heterogeneous annotations, which is composed of two separately trained networks; one is for classification and the other one is for segmentation. The classification and segmentation networks are decoupled through bridging layers with class-specific activation maps, which deliver critical information from classification network to the segmentation network. The object labels associated with an input image are recognized by the classification network while figure-ground segmentation of each identified label is obtained by segmentation network. The advantage of their model is that it uses pre-trained models for classification network and they train only segmentation network and bridging layers using a few strongly annotated data.

Kim et al. [54] proposed a framework for semantic segmentation using tied deconvolutional neural network with scale-invariant feature learning. In the proposed framework they have both convolutional layers and deconvolutional layers, where each deconvolution layer consists of unpooling and deconvolution using filter masks tied with that of the corresponding convolution layer. The restored features from all the deconvolution layers comprises a rich feature set and the feature maps with the uppermost abstraction level take out from the top most layers are used for reinforcement of final feature map. All the feature maps are concatenated across channel dimension, which covers all verities of features. Feature maps with the uppermost abstraction level take out from the top most convolution layer and the fine points of features are restored using deconvolution layers. Class-specific activation maps are generated using convolutional layers and are passed through softmax layers across channel dimensions and aggregated into a single vector to be compared with the image label vector.

Unsupervised Learning Models

Many authors have used unsupervised models in deep learning to overcome different challenges with deep neural networks, such as requirement of huge labeled datasets, overfitting in algorithms with supervision, and reduction of scalability of the target functions at hand. Layer wise unsupervised learning can be incorporated into deep neural architectures to improve accuracy where the data is not properly labeled, lack of annotated data, weak label annotations or when the amount of training data is less. Unsupervised learning algorithms can extract salient information about the input distribution, which reveals a representation that confines statistical regularities of the layers. It mainly helps to reduce the dependency on the changing gradient update direction given by a supervised criterion. Hence, unsupervised learning is an approach to naturally decompose the problem into sub-problems associated with different levels of abstraction [57]. They can be used to train deep neural networks for semantic image segmentation. These algorithms can be used as a part of supervised algorithms and trained to store information about the semantic classes. Unsupervised learning models can be used for semantic segmentation by utilizing some fixed models which are discussed below.

Autoencoders (AE) and Stacked Autoencoders (SAE)

Autoencoders (AE) [78] are unsupervised learning models comprise of a single-layer neural network. An autoencoder is trained to reproduce its inputs to the output layer through hidden layers as shown in Fig. 11. It mainly consists of two parts: first, an encoder which converts the input layer to a hidden layer, and second, a decoder that reconstructs the input from hidden layer. Originally they were used for feature learning with reduction of dimensionality, but now they are being used as latent variable models for regenerative modeling. The main idea behind copying the input layer to the output layer through hidden layers is to estimate the useful representative features. The input data can be projected on to a smaller dimensional subspace, which represents a dominant latent structure of the input and can be modeled to learn prominent features of the data distribution. It comes with a lot of variants, such as sparse autoencoder, convolutional autoencoder, variational autoencoder, contractive autoencoder, and denoising autoencoder. Also stacked autoencoders (SAE) are built by arranging autoencoders on top of each other and stacked into multiple layers, where the output of each layer is input to the successive layers. These layers are trained individually or in a greedy layer-wise fashion. Then, the full network is fine-tuned using supervised training to make predictions.

Fig. 11
figure 11

Autoencoder maps an input layer x to an output layer y through a hidden layer h [77]

Contractive autoencoders use nonlinear loss functions and encourage the model to have properties, such as sparsity of the representation and noise robustness [79]. Chen et al. [80] used unsupervised learning using autoencoders for the classification of pulmonary nodules from lung CT images. They proposed a convolutional autoencoder neural network (CAENN) architecture for feature learning, which consists of an input layer, three convolution layers, three pooling layers and one fully connected layer. Denoising auto-encoders minimize the loss function of a copy of the input corrupted by some form of noise and it can minimize the reconstruction error. Gondara [81] used denoising autoencoders for efficient denoising of medical images.

Restricted Boltzmann Machines (RBM)

Fig. 12
figure 12

Graphical model of RBM [77]

Boltzmann machines [82] are energy-based models, which are generally represented by the distribution function

$$\begin{aligned} p(x)=\exp (-E(x)), \end{aligned}$$
(9)

where E(x) is the energy function. The energy function of the Boltzmann machine is given by

$$\begin{aligned} E(x)=-{x}^\intercal {w}{x}. \end{aligned}$$
(10)

Restricted Boltzmann machines (RBM) [82] are necessarily energy function-based undirected graphical models consisting of a single layer of latent variables used to learn the representation of input. They can be used to make deep graphical models that can learn the internal representation or the latent variables of the deep model by efficient interaction between the layers. A simple graphical representation is shown in Fig. 12.

RBMs can be stacked to design deeper graphical models containing layers of observable variables and latent variables. The constrained connectivity between the layers makes it feasible to construct deeper models for efficient learning. RBMs have been extensively used in various parts of medical image analysis, such as image segmentation [83], feature learning [84], disease classification [85], mass detection in breast cancer [86], and brain lesion segmentation [84].

Deep Belief Networks (DBN)

DBN [87] is a generative hybrid graphical model having multiple hidden layers with no intra-layer connections between the hidden layers. Here, the probability distributions of all neurons can be copied to the next layer to learn the representation of the input. DBN can obtain both directed and undirected graph model and hence can be viewed as a mixture of unsupervised networks, such as RBMs and autoencoders. Without supervision in training stage, a DBN can learn the best features from the probability distribution of copied inputs in the hidden layer. Here, the connections between the top two layers are undirected, and therefore can be used for classification problems.

Transfer Learning

Transfer learning is generally used in the sense that the learned model parameters from a trained model can be transferred to train a new model. It is a machine learning method where prior knowledge of the model parameters of a working model are reprocessed as the starting point for training a new related model. It can be effective when a model is to be trained with small dataset or to be trained from scratch [88].

Fig. 13
figure 13

Transfer learning model used for otitis media (OM) detection [89]

In deep learning scenario, a network trained on a large dataset can be used either as an initialization or a fixed feature extractor for a new model to be trained from a smaller dataset. It is already proven that to start with pre-trained weights is more helpful than random initialization of weights, even with large data sets [88, 90]. Also, it imposes constraints due to the size and similarity features between the datasets used in the trained model and to be used for training the new model.

Shie et al. [89] used transfer learning to overcome data scarcity and feature representation problem. They proposed a novel method for segmenting otitis media (OM) images, which is shown in Fig. 13. They learned a codebook in unsupervised manner by utilizing CNN with ImageNet dataset [91], then encoded OM images with the codebook to get weighting vector for each image. They applied these feature vectors to a supervised learning system using SVM to train an OM classifier with 1195 labeled OM instances. They achieved an accuracy of \(88.5\%\) for OM detection. Singh et al. [92] proposed a transfer learning based method for concept detection and modality classification.

Generative Adversarial Networks (GAN)

Generative adversarial network (GAN) is a deep model, first proposed by Goodfellow et al. [67] which is basically used to generate new replicas of data by learning the distribution of data. It consist of two neural network models called generator and discriminator. The generator captures the data distribution and generates unreal data and the discriminator tries to identify real datas from unreal datas. As a result of this competition, the discriminator and generator models are updated and the generator will update better-looking unreal data while the discriminator will become better at identifying them. The deep generative model generates the output from the input distribution which are looking same as input. The adversarial model is trained to optimally discriminate samples from the empirical data distribution and samples from the deep generative model. The basic block diagram of GAN is shown in Fig. 14 .

Fig. 14
figure 14

Basic block diagram of GAN

Luc et al. [93] trained a convolutional semantic segmentation network along with an adversarial network that discriminates segmentation maps coming either from the ground truth or from the segmentation network. It can detect and correct higher-order inconsistencies between ground truth segmentation maps and the ones produced by the segmentation net. They used the network on Stanford Background dataset and Pascal VOC 2012 dataset.

Instance-Aware Semantic Segmentation

Instance-aware semantic segmentation performs both classification and segmentation of object instances. It operates on region level and same pixel may have different semantics in different regions. Generally, segmentation is based on segment proposal and classification is based on region based methods.

Li et al. [94] proposed a fully connected CNN for instance-aware semantic segmentation task (FCIS). FCIS uses rotation invariant property to perform both detection and segmentation. Authors created instant masks known as region-of-interest (ROI) from the FCN by region proposal network (RPN). This helps to produce pixel-wise score maps by assembling the operations in ROI. Detection and segmentation are the two tasks performed for each pixel in ROI. They trained two classifiers separately for mask predication and classification.

Multi task network cascades (MNC) is proposed by Dai et al. [95] for instance-aware semantic segmentation. The network consists of differentiating instances (represents by bounding boxes, which are class-agnostic), estimating masks (predicts pixel-level mask for each instance), and categorizing objects (predicts the categorize level). The network helps in sharing their convolutional features.

Standard Deep Neural Network Architectures

Several deep neural network architectures have been proposed in the last decade. Some of them gained popularity due to their enhanced performance in the fields, such as image classification, speech processing and robotics. These networks are becoming a standard choice for the researchers to solve novel challenges. Some of the important deep neural network architectures are reviewed below.

AlexNet

AlexNet was a revolutionary deep CNN architecture that won the ILSVRC-2012 challenge [55] with a accuracy of \(84.6\%\), and it was a significant lead from the entries with traditional techniques which achieved a \(73.8\%\) accuracy in the same challenge. The architecture proposed by Krizhevsky et al. [55]. It contains eight learned layers (five convolutional layers with maxpooling layers and three fully-connected layers). ReLU non-linearity was applied to the output of every convolutional and fully-connected layers. A block diagram of the model is shown in Fig. 15. The model input image size was \(224\times 224\times 3\) and classified to 1000 output classes.

Fig. 15
figure 15

AlexNet convolutional neural network architecture. Figure has been extracted from [55]

VGG16

Visual geometry group (VGG16) is a CNN model proposed by Simonyan and Zisserman [96]. The architecture consists of a stack of 16 convolution layers followed by 3 fully connected layers with small receptive field of size \(3\times 3\). The model achieved \(92.7\%\) in top-five test accuracy in ImageNet large-scale visual recognition challenge (ILSVRC) 2014 with ImageNet dataset (consists of more than 10 million annotated images with 1000 classes). The block diagram of the mode is shown in Fig. 16.

Fig. 16
figure 16

VGG16 convolutional neural network architecture. Figure has been extracted from [96]

GoogLeNet

GoogLeNet architecture (22 layer DNN) was introduced by Szegedy et al. [97], which won the ILSVRC-2014 challenge with \(6.7\%\) in top five error. It consists of stacked inception modules, which are convolutional neural networks with multiple receptive field sizes for convolution and pooling operation. They applied parallel filtering operations layer wise and concatenated all filter outputs together followed by \(1\times 1\) convolution operations to reduce dimensionality. The block diagram of an inception module is shown in Fig. 17 .

Fig. 17
figure 17

Inception module with dimensionality reduction from the GoogLeNet architecture. Figure reproduced from [97]

ResNet

ResNet architecture [98] was introduced by Microsoft corporation, which won ILSVRC-2016 challenge with an accuracy of \(96.4\%\). It uses residual learning framework to train the dense layers of deep representations. The use of residual blocks with identity mapping helps in reducing the training errors due to large number of stacked layers. Figure 18 represents the building block of residual learning architecture.

Fig. 18
figure 18

Building block of residual learning. Figure extracted from from [98]

ReNet

ReNet architecture was presented by Visin et al. [99] and used unidirectional RNNs. It uses four RNNs instead of CNNs, which sweep over the image patches in both horizontal and vertical directions. Composite feature maps are extracted from the intermediate hidden states by sweeping the RNNs both vertically and horizontally. Each subsequent layer operates on extracted representation from the previous layer, ensuring location specific operation. The output feature maps are stacked to create deeper architecture simultaneously capturing complex features. Figure  19 represents the basic structure of one layer ReNet architecture.

Fig. 19
figure 19

One layer ReNet architecture. Figure extracted from [99]

U-Net

U-Net is an encoder-decoder architecture first proposed by Ronneberger et al. [100], that have been used to segment biomedical images and a submission based on U-Net had won the international symposium on biomedical imaging (ISBI) cell tracking challenge in 2015. The network has a U-shaped architecture, which consists of two paths: one is a contracting path and the other one is a symmetric expanding path. Contracting path has general CNN structure consists of recurring layers of convolutions, followed by a rectified linear unit (ReLU) and a maxpooling operation. On the other hand, expanding path facilitates accurate localization of high resolution features. In contraction path multi-channel feature space is enhanced and spatial information is reduced, while expanding path uses a sequence of upsampling that allows the network to transmit perspective information to higher resolution layers [100]. Figure 20 represents the basic structure of the proposed U-Net architecture. It has gain popularity for semantic segmentation of biomedical images.

Fig. 20
figure 20

U-Net architecture. Figure is extracted from [100]

SegNet

SegNet architecture is proposed by Badrinarayanan et al. [101] consists of an encoder-decoder neural network architecture used for semantic pixel-wise segmentation. The encoder part consists of 13 convolutional layers, which down-sample the input to low resolution feature maps preserving the high-level features. The decoder network is designed to up-sample the low resolution encoder feature maps to high resolution feature maps for pixel-wise classification. The decoder upsamples the low feature maps using pooling indices computed in the maxpooling steps corresponding to encoder which eliminates the need for up-sample learning. The upsampled maps are convolved with trainable kernels to yield dense feature maps and can be trained for pixel-wise classification. Figure 21 represents the basic structure of the proposed SegNet architecture.

Fig. 21
figure 21

SegNet architecture. Figure has been extracted from [101]

PSPNet

Fig. 22
figure 22

PSPNet architecture. Figure has been extracted from [102]

The pyramid scene parsing network architecture (PSPNet) is proposed by Zhao et al. [102] for pixel-level scene parsing and it won ILSVRC 2016 challenge for scene parsing. It utilizes pyramid pooling module instead of global pooling to collect context information from the feature maps using CNNs. The pyramidal pooling uses four different pyramid scales to separate the feature maps into dissimilar regions with pooled representations. Bi-linear interpolation is used to upsample the low dimension feature maps to the appropriate size of original feature maps. Final prediction maps are generated by concatenating different size feature maps followed by a convolution layer. Figure 22 represents the overview of the proposed PSPNet architecture.

Datasets

In recent years, many large datasets are created by computer vision community with the emergence of deep learning models as they require large number of data samples to train well. Here, some publicly available and widely used medical image datasets are mentioned. An overview of the datasets are given in Table 2.

Table 2 An overview of publicly available medical image datasets with semantic segmentation ground truth

Medical Image Databases for Semantic Segmentation

The DIARETDB1 [103] dataset contains 89 retinal fundus images, which can be used to detect diabetic retinopathy. The images are annotated with four classes: hard exudates, soft exudates, hemorrhages and red small dots.

IDRiD dataset [104] is used for detecting diabetic retinopathy (DR) and diabetic macular edema. It provides information regarding disease severity level of diabetic retinopathy and diabetic macular edema. The dataset consists of 81 color fundus images with pixel level annotation of abnormalities associated with DR, such as microaneurysms (MA), soft exudates (SE), hard exudates (HE) and hemorrhages. The images are stored in JPEG format with pixel resolution of \(288\times 284\) pixels.

The open-CAS endoscopic dataset [107] consists of 60 images taken from laparoscopic adrenalectomies and another 60 images taken from laparoscopic pancreatic resections. This dataset can be used for instrument segmentation in laparoscopic images. Semantic segmentation of these images help the surgeons to get sensory information about surgical procedures.

The Warwick-QU dataset consists of 165 images of colorectal cancer gland with pixel level annotation of 5 classes [108]. The classes may be divided in to healthy, adenomatous, moderately differentiated, moderately-to-poorly differentiated, and poorly differentiated. Semantic segmentation of these images help the medical practitioners to diagnose the cancer cells more accurately.

Fluo-N2DL-HeLa dataset contains frame sequences of cultured fluorescent HeLa cells which is used in ISBI cell tracking challenge [110]. In all frames, the ground truth contains markers for all 34060 cells and segmentation masks for 874 cells in four frames. This dataset is helpful for diagnosis of cell characteristics and semantic segmentation can be used to tackle the challenges like many cell clusters, frequent cell divisions, low contrast and variation in cell sizes.

PhC-HeLa dataset [111] contains 22 phase contrast microscopic images of cervical cancer colonies of HeLa cells. From these images, 2228 cells consist of cell markers with ground truth. This dataset is helpful for diagnosis of cell characteristics by semantically segment the cells with respect to ground truth so that high variation in cell shapes and sizes, missing cell boundaries, and high cell density regions can be accurately classified.

Hist-BM dataset [112] consists of 11 microscopic images containing hematoxylin and eosin of human bone marrow with ground truth consist of markers for all 4202 cell nuclei and unclear regions. This dataset can be used for semantic segmentation of cell nuclei and ambiguous regions and can be helpful for diagnosis of cell characteristics in microscopy image analysis.

The NIH Chest X-ray dataset [113] consists of 100,000 identified images of chest X-rays images with the text-mined fourteen disease image labels from 30,805 unique patients. The images are in PNG format. The data is provided by the NIH Clinical Centre and is available in NIH site. Fourteen common thoracic pathologies include atelectasis, consolidation, infiltration, pneumothorax, edema, emphysema, fibrosis, effusion, pneumonia, pleural thickening, cardiomegaly, nodule, mass and hernia. To create these labels, the authors used natural language processing to text-mine disease classifications from the associated radiological reports. The dataset can be used for semantic segmentation of common thorax diseases.

CQ-500 dataset [117] consist of 491 non-contrast head CT scans with 193,317 slices, provided by Centre for Advanced Research in Imaging, Neurosciences and Genomics, New Delhi, India. The dataset was used to detect intracranial hemorrhage (ICH) and its types (intraparenchymal hemorrhage (IPH), intraventricular hemorrhage (IVH), subdural hemorrhage (SDH), extradural hemorrhage (EDH) and subarachnoid hemorrhage (SAH), calvarial fractures, midline shift and mass effected) in head CT scans. The dataset can be used for semantic segmentation of intracranial hemorrhage and its types.

The MIMIC Chest X-ray (MIMIC-CXR) dataset [114] is a large publicly available dataset of chest radiographs with free-text radiology reports. The dataset consist of 377,110 chest X-rays associated with 227,827 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 and 2016. Images are provided with 14 labels derived from two natural language processing tools applied to the corresponding free-text radiology reports. This dataset can be utilized for segmenting various pathologies related to lungs like enlarged cardiomediastinum, cardiomegaly, lung lesion, lung opacity, pneumonia and other abnormalities.

Gland Segmentation in Colon Histology Images Challenge Contest (GlaS) held at MICCAI 2015 [109]. The dataset used in this challenge consists of 165 images derived from 16 H and E stained histological sections of stage T3 or T4 colorectal adenocarcinoma. This dataset can be utilized for segmenting the extent of malignancy in histology images. The images in the dataset provides tissue architecture of two classes having benign and malignant histologic grades. The image can be semantically segment into challenging features like small glands, sub-mucosa layer, area with dense nuclei in mucosa layer and lumen of the gastrointestinal tract.

CATARACTS dataset (CaDIS) [118] is used for semantic segmentation of cataract surgery. It consist of 25 videos each having 30 frames per second for surgical procedure. Each video has a duration of 10 min and 56 s. The dataset consist of 29 surgical instrument classes, 4 anatomy classes and 3 miscellaneous classes and used for identification and localization of surgical instruments and anatomical structures through semantic segmentation.

The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. This dataset contains 1018 low-dose lung CTs taken from 1010 lung patients. It is used for evaluation of CAD methods for lung cancer detection and diagnosis. Semantic segmentation can be applied to the CT scans for segmenting lung cancer.

Hardware and Software

The rise of deep learning methods are principally due to the availability of large databases, wide availability of open source software packages, and increasing availability of consumer hardware, such as graphical processing units (GPUs) and GPU-computing libraries (e.g., CUDA, OpenCL).

GPUs are designed for faster and parallel processing of images in a frame buffer purported for output to a display device. Their parallel computing structure makes them more efficient for training massively parallelizable deep learning models.

The widely available open source software packages and libraries, developed based on the recent research in deep learning, is boosting the efficient use of deep learning methods in computer vision field. These libraries provide economical and efficient GPU implementations for the processing of large data with an acceptable processing time. Some popular packages are listed here.

  1. 1.

    MATLABFootnote 1: It offers specialised toolboxes for machine learning, neural networks and computer vision.

  2. 2.

    Caffe [62]: A deep learning framework developed by Berkeley AI Research (BAIR), provides C++ and python interfaces.

  3. 3.

    Tensorflow [120]: An open software math library, targeting mainly at implementation of deep learning models, provides C++ and python interfaces. It was developed at Google’s AI research group.

  4. 4.

    Theano [121]: It provides a python interface designed to handle large neural network algorithms, developed by MILA lab in Montreal.

  5. 5.

    Torch [122]: It is a python based scientific computing package developed by Facebook’s AI research group.

  6. 6.

    Keras [123]: An open source neural network library written in python, developed with a focus on enabling fast experimentation, supports both convolution based networks and recurrent networks.

  7. 7.

    MXNet [124]: An apache software foundation framework used to train and deploy deep neural networks.

  8. 8.

    Cognitive toolkit (CNTK) [125]: Frame work developed by Microsoft, which offers a python API over C++ code and operates under MIT license.

Evaluation Measures for Semantic Segmentation

Different performance measures are adopted by computer vision researchers for semantic segmentation and scene parsing evaluations. Long et al. [60] used pixel accuracy, mean accuracy, mean intersection over union, and frequency weighted intersection over union for performance measures for image segmentation and defined as below:

$$\begin{aligned}&\text {Pixel accuracy}\nonumber \\&\quad =\frac{{\sum \nolimits _i {{n_{ii}}} }}{{\sum \nolimits _i {{t_i}} }}, \end{aligned}$$
(11)
$$\begin{aligned}&\text {Mean accuracy}\nonumber \\&\quad =\frac{1}{{n_{cl}}}\sum \nolimits _i\frac{n_{ii}}{t_i}, \end{aligned}$$
(12)
$$\begin{aligned}&\text {Mean intersection over union}\nonumber \\&\quad =\frac{1}{{n_{cl}}}\sum \limits _i\frac{n_{ii}}{t_i+\sum \nolimits _i{{n_{ji}}}-{n_{ii}}}, \end{aligned}$$
(13)
$$\begin{aligned}&\text {Frequency weighted intersection over union}\nonumber \\&\quad =\frac{1}{{\sum \nolimits _k {{t_k}}}}\sum \limits _i\frac{{t_i}{n_{ii}}}{t_i+\sum \nolimits _i{{n_{ji}}}-{n_{ii}}}, \end{aligned}$$
(14)

where \(n_{ij}\) is the number of pixels of class i predicted correctly to belong to class j where there are \({n_{cl}}\) different classes, and \({t_i} = \sum \limits _i{{n_{ij}}}\) is the total number of pixels of class i.

Sharma et al. [126] proposed recursive context propagation network (RCPN) for semantic segmentation where they took four standard evaluation metrics: per pixel accuracy (PPA), mean class accuracy (MCA), intersection over union (IoU) and time per image (TPI).

  1. 1.

    PPA is the ratio of the correctly classified pixels to the total pixels in the test image.

  2. 2.

    MCA is the mean of the category wise pixel accuracy.

  3. 3.

    IoU is the ratio of true positives to the sum of true positive, false positive and false negative, averaged over all classes.

  4. 4.

    TPI is the time required to label an image on GPU and CPU.

Roth et al. [127] computes dice similarity coefficient to measure the amount of agreement between two binary regions. To predict multiple classes for segmentation they have used a total loss function

$$\begin{aligned} K_{\text {total}}=\frac{1}{K}\sum \limits _k{{w_k}{L_k}}, \end{aligned}$$
(15)

where K is the number of classes (number of foreground classes and background) and \({w_k}\) is a weight factor that can influence the contribution of each label class k. \(L_k\) is the loss function for each class k, which is given by

$$\begin{aligned} {L_k} = \frac{{2\sum \nolimits _i^N {{p_i}{r_i}} }}{{\sum \nolimits _i^N {{p_i} + \sum \nolimits _i^N {{r_i}} } }} \end{aligned}$$
(16)

where \(p_i\) represents the value of the probability map and \(r_i\) the corresponding ground truth at voxel i of total N voxels, present in the current image volume.

Deep Learning for Semantic Segmentation of Medical Images

Applications of deep learning to medical image analysis is now growing very rapidly. Due to its promising results, it is already being used in different fields, such as image segmentation [61], object detection [10, 128] and localization [22, 106]. The work has been categorized on the basis of imaging modalities, organs of interest and architecture. The mainly used architectures are fully convolutional networks (FCN), U-Net, RNN and GAN.

Fully Convolutional Networks in Semantic Segmentation of Medical Images

The fully convolutional network plays an important role in medical image segmentation due to its ability of dense prediction with pixel wise loss to predict different class at a time. This network has been used by the researchers for 2D and 3D images. The detail description about this network for 2D and 3D semantic image segmentation has been discussed in the following subsection

2D Images

Long et al. [60] used it for semantic segmentation of general scene images (ImageNet database) where they used last layer as fully convolutional layer instead of fully connected layer. Later skip architecture is added to it by Evan et al. [65] to fine tune the network. Cui et al. [106] segmented the intra-tumor structure of brain tumor using cascaded CNN with the MRI data. They proposed two subnetworks: tumor localization network (TLN) and intra-tumor classification network (ITCN). TLN is a fully convolutional network used to segment the tumor region from an MRI slice whereas ITCN is used to label the defined tumor region into multiple subregions. Akram et al. [23] applied FCN in microscopic image analysis for cell detection, segmentation and tracking. They proposed cell bounding boxes using a fully convolutional neural network (FCNN) and used a second CNN architecture to predict segmentation masks for each proposed bounding box. They used eight convolutional layers for detecting bounding boxes and used adaptive maxpooling inside the bounding box to extract fixed size features maps. By this network they were able to predict the regions that belonged to the cell to be segmented and accurately localize of cell boundaries. Tajbakhsh et al. [129] used fine tuned CNN instead of deep CNNs trained from scratch for polyp detection and pulmonary embolism detection using a free-response operating characteristic (FROC) analysis. Zhoua et al. [130] used weighted FCN with focal loss [131] to segment small objects as foreground. They emphasized on training wrongly-segmented pixels to decrease the number of false positives arise due to the uneven distribution of pixels in medical images.

FCNN faces different challenges in the semantic segmentation of biomedical images. Due to the variable size of organs in biomedical images, the fixed perception field of FCNN, the same architecture is not able to produce satisfactory output on multiple organ segmentations. Many authors used multiscale FCN or cascaded FCN to overcome this. Xiangrong et al. [132] proposed a voxel-wise multiple-class classification scheme for automatically assigning labels to each pixel/voxel in a 2D/3D CT image.

3D Images

After successful application of FCN in image segmentation, researchers used 3D FCN, where both input image and kernels are in 3D form. Roth et al. [77] used 3D U-Net architecture [133] for multi organ semantic segmentation in CT images. It uses two paths in the network: analysis path and synthesis path. The analysis path is used for downsampling, which contains two convolutional layers followed by ReLU activation and a maxpooling layer. In the synthesis path, transposed convolutions are employed to convert the lower resolution feature maps within the network to the higher resolution space of the input images. It also utilizes skip architecture to provide higher-resolution features to the synthesis path. The final convolutional layer utilizes softmax activation function to compute a 3D probability map for each of the target organs as the output of network. Figure 23 shows axial view of multi-organ segmentation of pancreas.

Fig. 23
figure 23

Multi-organ segmentation in CT (axial). Figure reproduced from [77]

Ahn et al. [134] proposed a 3D convolutional neural network architecture called SqueezeNet3D for detecting lung cancer in the CT scans. In the Kaggle competition named Data Science Bowl 2017, the second winner Daniel Hammack [135] designed a 3D CNN for lung cancer classification. In this case the size of typical CT scans is about \(512 \times 512\times 400\) volumetric pixels. However, the region of interest is generally in the order of \(1\;cm^{3}\). He used 3D CNN to predict nodule attributes to forecast a cancer diagnosis. He used 17 3D CNN models trained on LUNA16 dataset which consist of consist of 5 convolutional 3D blocks, followed by global maxpooling and a non-negative regression layer with a softplus activation. Kamnitsas et al. [21] proposed a 3D FCN for semantic segmentation of brain lesion. They employed a dual pathway architecture that processes the input images at multiple scales simultaneously. In post processing they used 3D fully connected conditional random field to remove false positives.

U-Net in Semantic Segmentation of Medical Images

But for biomedical image segmentation U-Net gets the high priority. Different modified structures are proposed based on U-Net backbone. It uses skip connections between the stages of network to regain the contextual information lost due to deep convolutional layers with pooling operations. Also it used concatenation of low-level features with high-level features for better learning representation. Many authors used modified U-Net architecture for multi-class segmentation of biomedical images [136,137,138]. Different methodologies based on U-Net are shown in Table 3.

Table 3 Overview of papers using U-Net backbone for semantic segmentation tasks in biomedical images

3D U-Net in Semantic Segmentation of Medical Images

As majority of image modalities are in volumetric format, authors developed 3D U-Net based models to capture more affluent spatial information of volumetric images. Cicek et al. [133] proposed a 3D counterpart of U-Net architecture for volumetric semantic segmentation of xenopus kidney embryos at Nieuwkoop–Faber stage. The network consists of analysis and synthesis path as like 2D-U-Net but in the 3D form. The input to the network is a \(132 \times 132 \times 116\) voxel tile of the image with three channels and the output in the final layer is \(44 \times 44 \times 28\) voxels in x, y and z directions. The network output and the ground truth labels are compared using softmax with weighted cross-entropy loss. They used intersection over union (IoU) as performance matrices and infers the output with average IoU of 0.863.

The main issue with these 3D models is the memory limitations. As the voxels increase the parameters of the network, sophisticated and expensive hardware is required. Also the limited size of the voxels leaves constrains on resolution of the output. This can be overcome by dividing the input volume to multiple slices and used them for training and testing.

CNN with Residual Networks in Semantic Segmentation of Medical Images

CNN architecture with residual modules are proven to preserve more richer and discriminative feature information, lost by increasing the depth of the deep networks. Lequan et al. [141] used residual network with CNN for automated melanoma recognition from dermoscopy images. They proposed a fully convolutional residual network (FCRN) for accurate skin lesion segmentation. The network used 16 residual blocks in down-sampling path and three types of stride prediction map for upsampling. Hao et al. [142] used this residual learning concept on volumetric data and proposed VoxResNet architecture for volumetric image semantic segmentation of 3D brain MRI images. The VoxResNet architecture consist of stacked residual modules with a total of 25 volumetric convolutional layers and 4 deconvolutional layers. This can generate more representative features to deal with the large variations of brain tissues than fully connected layer. Also it has the capability to combine low-level image appearance features, implicit shape information, and high-level context together for semantic segmentation.

RNN in Semantic Segmentation of Medical Images

RNN architecture uses a recurrence relation between the current layer and previous layer using feedback loops, which empowered them to handle arbitrary input output length and to memorize the patterns from previous layer. RNN can be used with convolutional layers to capture the variations in pixel neighborhood. One of the successful variant of RNN is long short-term memory (LSTM) which is capable of learning long-term dependencies and can address the vanishing gradient problem generally occurs in simple RNN. In biomedical segmentation field many authors applied variants of LSTM such as bidirectional LSTM, convolutional LSTM (CLSTM), gated recurrent unit (GRU) [143] etc for segmenting multi-modal biomedical images.

Chen et al. [144] proposed a deep network for 3D image segmentation, based on a combination of a FCN and RNN called bidirectional LSTM (BDC-LSTM), which are responsible for utilizing both the intra-slice and inter-slice spatial dependences. The FCN take out and compress the hierarchy of intra-slice contexts into feature maps, and RNN (BDC-LSTM) extracts the 3D context from a sequence of preoccupied 2D contexts. Marijn et al. [145] proposed pyramidal multi-dimensional LSTM (PyraMiD-LSTM) network for segmentation of biological volumetric images, which employs six generalized convolutional LSTM (CLSTM) networks to develop the 3D context. This pyramidal structure is easier to parallelize and need less computations compared to multi-dimensional LSTM. Poudel et al. [146] proposed a recurrent FCN (RFCN) for automatic left ventricle segmentation from short-axis MR images of the left-ventricle using MICCAI 2009 challenge dataset [147].

Weakly-Supervised or Semi-supervised Learning Models in Semantic Segmentation of Medical Images

To overcome the constrains of supervised learning, mainly the shortage of pixel-level annotated databases in the medical imaging field with semantic segmentation ground truth, many authors developed weakly-supervised or semi-supervised learning models for semantic segmentation of biomedical images like autoencoders, restricted Boltzmann machines, deep belief networks etc.

Chen et al. [80] used unsupervised learning using autoencoders for the classification of pulmonary nodules from lung CT images. They proposed a convolutional autoencoder neural network (CAENN) architecture for feature learning, which consists of an input layer, three convolution layers, three pooling layers and one fully connected layer. Xinyang et al. [148] proposed a weakly-supervised nodule segmentation framework for segmentation of pulmonary nodules on lung computed tomography (CT) scans. The network consist of two stages like training stage and segmentation stage. In training stage, a CNN model is used to generate nodule activation maps (NAMs) for nodule localizations. In the second stage, coarse segmentation of nodule candidates are generated using multi-GAP CNN with residual connection. Several other works in the literature have used unsupervised models for semantic segmentation. Gu et al. [149] proposed a context encoder network (CE-Net) for 2D medical image segmentation. The network consists of three modules: a feature encoder module, a context extractor and a feature decoder module.They used a dense atrous convolution (DAC) block and a residual multi-kernel pooling (RMP) block. These two blocks are integrated with encoder–decoder structure to capture high-level features and preserve more spatial information. This method was applied on different medical images like fundus images for segmenting optic disc, blood vessels and retinal OCT layers. Chen et al. [80] proposed a convolutional autoencoder deep learning framework (CAENN) for classification of pulmonary nodules, which use unlabeled data for learning efficient features. They used 50000 unlabeled data of \(64\times 64\) patches, for unsupervised training and subsequently used 3662, \(64 \times 64\) patches of labeled data for classification. Oktay et al. [150] proposed an anatomically constrained neural network model for pathological classification of cardiac MR and ultra sound images. It utilizes anatomical prior knowledge into CNNs and learns the representation of the underlying anatomy through a stacked convolutional autoencoder. Varghese et al. [151] used staked denoising autoencoders (SDAEs) for brain lesion detection, segmentation, and false-positive reduction. They used a large number of unlabeled data to train SDAEs and fine-tuned SDAEs with labeled patches drawn from a limited number of patients.

Transfer Learning in Semantic Segmentation of Medical Images

The idea before the transfer learning is to take the parameters of a trained model to initialize a new model, which can perform better than random initialization known as fine-tuning. For pixel-wise segmentation, the new network can be pretrained on a general scene database or biomedical image database. To capture the multi-modal perceptive of biomedical images it is better to fine tune the network with related domain. Transferring the weights from related domain is proven better than random initialization [152]. Tajbakhsh et al. [129] used transfer learning for fine tuning the network using AlexNet [55] and applied for polyp detection and pulmonary embolism detection. Considering the weights from other domain, 25% higher sensitivity achieved compared to used CNN from scratch. Hariharan et al. [153] used CaffeNet [62] architecture built on ImageNet database for ultrasound kidney detection. They initialized the network with weights from CaffeNet parameters and the entire network weights were updated by training on kidney image samples. They achieved 4% increase in average dice overlap over the baseline method.

To overcome the difficulty of data acquisition and annotation in 3D medical imaging, Sihong et al. [154] used a heterogeneous 3D network called Med3D that contains a series of pretrained models. They converged eight medical databases to create 3DSeg-8 dataset that was used on pretrained Med3D network with ResNet [98] backbone. For lung segmentation task they achieved an accuracy of 93.82% which is much higher than baseline methods. These discussions favor the use of transfer learning approach over full training of a CNN from scratch. Lots of parameter are to be monitored when adapting transfer learning for biomedical images like modality domain, appearance and size of organs etc.

GAN in Semantic Segmentation of Medical Images

GAN are used to generate new duplicate data by learning the distribution of data. Yuan et al. [155] proposed a adversarial neural network, called SegAN for the task of medical image segmentation. They used a fully CNN as the segmentor to generate segmentation label maps, and proposed a novel adversarial critic network with a multiscale L1 loss function to force the critic and segmentor to learn both global and local features that capture long and short range spatial relationships between pixels. Rezaei et al. [156] proposed recurrent GNN (RGNN) for medical image semantic segmentation. They used U-Net with skip connection as recurrent generator G to produce segmentation maps and recurrent discriminator R as classifier. Bidirectional LSTM was used as the recurrent architecture selected for both discriminator and generator models. They used mixed categorical loss function for training procedure of the semantic segmentation task, which helps to tackle the effect of imbalanced training data problem (Tables 4, 5, 6).

Table 4 Overview of papers using deep learning for various image analysis and semantic segmentation tasks
Table 5 Overview of papers using deep learning for various image analysis and semantic segmentation tasks
Table 6 Overview of papers using deep learning for various image analysis and semantic segmentation tasks using unsupervised models

Discussion

As semantic segmentation is well suited to handle the variations of distribution of pixels into semantic classes, it is increasingly being used for related image analysis applications. In medical image segmentation, deep learning-based techniques are overpowering the traditional feature-based methods in usage. Therefore, in this paper attention is given on reviewing image segmentation techniques based on deep neural networks. These techniques allow automatic learning of the best features unlike traditional methods, where features are handcrafted, like pixel color [34], histogram of oriented gradients (HOG) [35], scale-invariant feature transform (SIFT) [36], local binary pattern (LBP) [37], sub-pixel corner [38]. Furthermore, accurate segmentation leads to better object detection and classification.

However, deep learning-based methods come with lots of challenges. Some of the related issues are discussed in the following sub sections.

Challenges with the Existing Databases

Generally, all deep learning algorithms work on annotated datasets along with ground truths. Datasets are fine tuned for specific context and how to achieve generality is still unclear. Also, collecting labeled datasets with dense annotations is expensive and time consuming. Another important issue is the requirement of expensive and high performing computing hardware. In particular, to reduce the training time, large number of graphical processing units are required while training neural networks with dense convolution layers. For medical images till now there is a lack of databases having large number of annotated ground truths. The number of increased images will improve the performance of deep neural networks for automated multi-organ segmentation in medical imaging. This hampers the growth of supervised deep learning algorithms to be applied to the medical databases compared to general databases with semantic ground truth. Generally data augmentation techniques are incorporated to increase the size of the databases. These augmentation techniques mainly consist of applications of affine transformations (rotation, flipping and scaling etc). It is already experimented that these augmentation techniques are able to increase the performance up to several percentage [179]. Recently, many authors proposed GAN for increasing the number of images in the datasets [180].

Challenges in the Deep Learning Architectures

Different deep learning structures have their own characteristics and mainly database dependable. For general scene parsing starting with AlexNet, till now different backbone structures are proposed ( VGG16, GoogLeNet, ResNet, ReNet, SegNet etc). The AlexNet has the ability for object detection and classification in large datasets like ImageNet. However any small change in the convolutional layer will significantly degrade AlexNet’s performance. VGG16 was one of the best performing architecture in ILSVRC challenge 2014. It is very slow to train. The size of VGG16 trained ImageNet weights is 528 MB. So, it acquires more disk space and bandwidth that makes it incompetent for large databases. SegNet is a deep encoder-decoder architecture used for semantic segmentation for general scene datasets and applications like autonomous driving, scene understanding, etc. But, due to maxpooling and upsampling in encoder-decoder architecture there is a significant reduction in feature map resolution, which infers low output resolution. To overcome this issue Zhao et al. [102] proposed pyramidal pooling module instead of global pooling to preserve the context information using CNN.

But, for biomedical image segmentation UNet gets the high priority. Different modified structures are proposed based on UNet backbone. It uses skip connections between the stages of network to regain the contextual information, which is lost due to deep convolutional layers with pooling operations. Also it used concatenation of low-level features with high-level features for better learning representation. Many authors used modified UNet architecture for multi-class segmentation of biomedical images [136,137,138]. Different methodologies based on UNet are shown in Table 3. Despite their success, these models suffer depth optimization and unnecessary complexity due to fusion of skip connections. These limitations may be overcome by modifying the structure and redesigning the skip connections.

Challenges with Training, Testing and Hyper-parameter Selection

Training a deep model with semantic segmentation ground truth is tricky as it involves optimization of different parameters regarding training, testing and hyper-parameter selection. For training a network with a large database with a large number of cascaded stages require a large training time. Reduce the training time and faster convergence are the issues in CNN. These can be achieved by reducing the dimensionality of the parameters. Generally pooling layers are used to reduce the dimension of the feature maps. Pooling variants are maxpooling, average pooling and adaptive pooling. Also many authors used convolution with variable stride to lighten the network or minimizing the parametres. But it has the effect of information loss. Batch normalization is used to reduce internal co-variate shift and provides faster convergence. It is performed through subtracting mean from the mini batch output and normalized by standard deviation of the mini batch. It is also known as an effective key for faster convergence. Batch normalization is a more preferred approach to improve the network convergence as it is not reported to have any negative effects on the performance, while the pooling and down-sampling techniques have came out with loosing beneficial information.

Another two important parameters are over-fitting and vanishing gradient problem. Over-fitting occurs when a model can learn well on training data by capturing the patterns and regularities in the training set with higher accuracy, but performs inadequately on unseen data with lower accuracy. Over-fitting can be reduced by increasing the size of the training database, hence it is an important issue for medical image databases. Different augmentation techniques are used to increase the size of the dataset by also maintaining the variability. Another regularization technique called dropout [181], is used by different authors for reducing the over-fitting at the time of training. Here, randomly selected neurons are dropped out from the activation or temporally removed on the forward pass during training. This will help the network to learn independent internal representations, which helps to reduce over-fitting.

Another problem faced by deep neural networks with gradient based optimization, is vanishing gradient problem. This occurs due to increase in hidden layers and the gradient becomes zero before the convergence. Hence, the error gradient cannot be efficiently back propagated to lower layers, inferring unsatisfactory result. One solution is to increase the number of training data so that the gradient vector spans the total epochs of the training. Many methods have been proposed, such as alternate weight initialization schemes [182], unsupervised pre-training [183], guided layer-wise training [184] and variations on gradient descent. Authors used ReLU, which prevents the gradient to diminish.

Challenges with Organ Appearance

Both in the 2D and 3D biomedical images, the diverAse and overlapped surfacing of the organ provides a big challenge to the researchers in segmentation field. The varying size of the organs and the indistinct boundary between target organ and its neighboring tissues in the imaging, give a ill-posed problem in the segmentation field. However, deep architectures with multitask learning network can address this issue [140]. Applying weighted loss function with a larger weight allocated to the separating background labels between touching organs is another successful approach for touching objects of the same class.

Challenges with Image Dimension

In dealing with 3D biomedical images, the training of the volumetric data faces more challenges than 2D images. Due to limited amount of training data, large number of parameters and high memory requirement, make the training much more expensive and time consuming for producing satisfactory inferences. It is not always possible to get fully annotating data for 3D images. Hence sparsely annotated data are used, which makes the inference with low accuracy. Weighted loss functions can be used to handle these unbalance sparsely annotated volume data [133].

Conclusion

Nowadays, semantic image segmentation has become an essential task in many applications in the field of computer vision and machine learning, where multiclass segmentation task is of importance. In the field of medical imaging the application of semantic segmentation is constantly growing. This paper covered fundamentals about deep learning techniques, databases used in both medical and non-medical field and their structures. It over viewed recent progresses in semantic segmentation in biomedical field, especially deep neural network-based semantic segmentation techniques. Deep learning-based semantic image segmentation techniques are more accurate and with the increasing availability of datasets and graphical processing units the inference time is decreasing with time. In the field of medical imaging, the main challenges remain due to the variability of patient data and lack of large datasets.