1 Introduction

Semantic segmentation is also referred to as a pixel-level classification (Chen and Gupta 2015) because it classifies each pixel of an image into a corresponding class. Semantic segmentation has long been one of the most important tasks in the field of computer vision. It is a commonplace to use deep learning methods to solve semantic segmentation problems. Previously, people used to pay more attention to features and classification methods (Liu et al. 2018). Nowadays, semantic segmentation has a rich application background and has achieved remarkable results. Specifically, segmenting daily life images and natural images (Chen et al. 2019), real-time semantic segmentation of road-driving images (Orsic et al. 2019) and segmentation based on character pose (Zhang et al. 2019). At present, segmenting medical images has received more attention (Goceri and Goceri 2017). Zhao et al. (2019) focused on brain magnetic resonance imaging segmentation. Another study is to introduce block chain miner computation into the field of biomedical image segmentation (Li et al. 2019a). Shen et al. (2017) comprehensively summarized and analyzed the deep learning methods applied to medical image processing. The reason is efficient formulas behind deep learning success (Goceri 2018), significant effect of image based diagnosis systems in biomedical information technology (Goceri and Songul 2018) and future healthcare (Goceri 2017). In addition, the remote sensing image segmentation task has also made a great progress. Mou et al. (2019) enhanced the presentation capabilities of the model used to segment the aerial scene image. Kampffmeyer et al. (2016a) used urban remote sensing images to focus on small objects to map land cover.

From the perspective of supervision, image semantic segmentation methods can be divided into three categories, including fully supervised learning, semi- and weakly supervised learning and unsupervised learning. Unlike pixel-level annotation of fully supervised learning, weakly supervised semantic segmentation tasks hope to use image-level labels in the training process to ultimately predict the object to which each pixel belongs (Vezhnevets and Buhmann 2010; Yu et al. 2018). The image-level label means that each training image I is represented by a n-dimensional vector set N, and each element of N is regarded as a Boolean variable, and a value of 1 or 0 indicates whether the label exists in the image. To achieve balance, semi-supervised semantic segmentation uses both weak and strong tags in an attempt to compromise the challenge of weak supervision and the high consumption of full supervision (Yu et al. 2018). Unsupervised semantic segmentation, as the name suggests, does not use any labeled data when training deep models (Sultana et al. 2019).

Within the scope of fully supervised segmentation, many excellent models emerged on the basis of CNN. Long et al. (2015) first used the FCN for semantic segmentation. Fast R-CNN was addressed to accelerate Region-based convolutional network (R-CNN) and improve accuracy (Girshick 2015). Later, both faster R-CNN (Ren et al. 2015) and mask R-CNN (He et al. 2018) have been proved to be effective in semantics segmentation. DeepLabv3 (Chen et al. 2017) uses multi-proportional atrous convolution to capture multi-scale information in cascade or in parallel. Atrous convolution, also known as dilated convolution. It is a generalization of the Kronecker factor convolution filter (Zhou et al. 2015), or a traditional convolution using an upsampling filter that supports exponentially extended receive fields without loss of resolution (Garcia-Garcia et al. 2018). DeepLabV3+ (Chen et al. 2018) uses an encoder-decoder architecture to further improve the accuracy and speed of the segmentation algorithm. Despite the gratifying achievements of the fully supervised learning algorithm, this is based on time-consuming and laborious manual annotations. Therefore, semi- and weakly supervised learning algorithm have received more attention. As mentioned in Sect. 2, many semi- and weakly supervised semantic segmentation models have emerged, and excellent results have been achieved in various fields. More details will be found in Sect. 2. The unsupervised learning algorithm has the lowest level of supervision and does not require any expert annotations. This is a promising direction, but the current results are insufficient and not convincing. The relevant research currently available will be elaborated in the last subsection of the second section.

Many review papers on semantic segmentation have been published. Lateef and Ruichek (2019) gave a thorough review of existing models and data sets. Garcia-Garcia et al. (2018) focused on deep learning techniques with a background of both image and video. What is not refined is that too much space is used to introduce common network architectures and data sets. There are two short reviews (Guo et al. 2017; Siam et al. 2017), and the second one is focused on automated driving. However, Guo et al. (2018) only explained the FCN-based segmentation models, the technical analysis is not sufficient, and there is a lack of review of many weak supervised segmentation methods in the past three years. Similarly, the lack of a review of methods in recent years has also appeared in Yu et al. (2018). Based on classification, Bo et al. (2017) gave a brief introduction about semantic segmentation only in the Sect. 3. Geng et al. (2016) used most of the space to review CNN-based methods. Thoma (2016) gave an analysis of many algorithmic principles and also involves unsupervised learning methods. Moreover, both Garcia-Garcia et al. (2017) and Liu et al. (2018) further analyzed model and data sets. But these papers have a common shortcoming, that is, the lack of integrated analysis of semi- and weakly supervised methods. Zhang et al. (2008) reviewed the unsupervised evaluation method from a novel perspective, please refer to the paper if needed. Vezhnevets and Buhmann (2010) reviewed the weakly supervised semantic segmentation method in the early days, but the paper does not cover the deep learning methods that currently dominate the mainstream. In fact, excellent semi- and weakly supervised learning algorithms emerge in an endless stream, which is also the focus of this paper.

The contributions to this work are summarized as follows:

  1. 1.

    Reviewing the semi- and weakly supervised semantic segmentation models in recent years according to the basic model.

  2. 2.

    Focusing on the algorithm and mechanism of the model and displaying the necessary equations.

  3. 3.

    Comprehensively summarizing the commonly used evaluation indicators and data sets, and then the segmentation effect of different models is analyzed according to the data set.

  4. 4.

    Summarizing the full text and giving inspirational suggestions for future research.

This paper is organized as follows: Sect. 2 reviews semi- and weakly supervised model approaches and concludes with a brief summary of unsupervised methods. Section 3 summarizes the semantic segmentation data sets and various evaluation metrics. Experimental results and analysis are shown in Sect. 4. Section 5 briefly summarizes the paper and lists enlightening suggestions for future study.

2 Models

Whether they are image-level labels, box-level annotations, scribbles or points, it is very cost effective and the main difficulty is how to precisely match these annotations to their corresponding pixels (Huang et al. 2018). This section is a review of semi-supervised and weakly supervised segmentation methods. According to our knowledge, this is the first review paper on semi- and weakly supervised semantic segmentation in recent years. In the following content, in addition to reviewing the semi-supervised and weakly supervised semantic segmentation model methods, the loss function of the model and the optimization method used are also discussed in depth. Among them, ADAM is nearly a default optimization method for semantic segmentation methods. Chosen scalar products can affect the performance of gradient descent based optimizers (Goceri 2015). Moreover, Sobolev gradient type has been applied recently in some works (Goceri 2016, 2019; Goceri and Esther 2014). At the end of this section, the unsupervised algorithm in recent years will be briefly summarized.

2.1 Semi-supervised methods

Semi-supervised segmentation methods published in recent years are mainly based on three classical models, namely CNN, R-CNN (Girshick et al. 2014) and adversarial networks (Goodfellow et al. 2014).

2.1.1 CNN based models

Based on CNN, Hong et al. (2015) decoupled classification network and segmentation network, and the two parts are trained using image-level and pixel-wise annotations, respectively. In addition, bridging layers are used to output class-specific activation map for obtaining class-specific segmentation maps. Self-supervised method is one of the earliest prposed semi-supervised learning method (Scudder 1965). Zhan et al. (2017) proposed their self-supervised segmentation with a new approach called mix-and-match (M&M) to improve the pre-training. Prior to M&M, the proxy task was proposed to use cross-entropy loss to learn representations from colorization. The tuning loss used to fine-tune the parameters is formulated by transforming from graph optimization to triplet ranking (Schroff et al. 2015), shown as Eq. 1.

$$\begin{aligned} L=\frac{1}{N} \sum _{i}^{N} \max \left\{ D\left( P_{a}^{i}, P_{p}^{i}\right) -D\left( P_{a}^{i}, P_{n}^{i}\right) +\alpha , 0\right\} \end{aligned}$$
(1)

where \(P_{a}, P_{p}, P_{n}\) denote anchor, positive, negative nodes in a triplet, \(\alpha\) is a regularization factor controlling the distance margin and D is a distance metric measuring patch relationship. Lee et al. (2019) designed the FickleNet by using the core idea of randomly selecting hidden unit, which can also be recognized as a dropout method. In this work, Gradient-weighted Class Activation Mapping (Grad-CAM) proposed by Selvaraju et al. (2016) is used instead of CAM to deal with localization maps, which is expressed as Eq. 2.

$$\begin{aligned} {Grad-CAM}^{\mathrm {c}}={ReLU}\left( \sum _{k} x_{k} \times \frac{\partial S^{c}}{\partial x_{k}}\right) \end{aligned}$$
(2)

where \(x_{k}\) is the kth channel of the feature map x, \(S^{c}\) is the classification score of class c.

2.1.2 R-CNN based models

The R-CNN method (Girshick et al. 2014) creatively uses Region Proposal plus CNN framework for target detection and semantic segmentation. The model is shown in Fig. 1. It can be concluded that the workflow of the model is roughly divided into four steps: (a) getting the input image; (b) extracting candidate regions; (c) entering the candidate regions into the CNN network separately; (d) entering the output of the CNN into the SVM for category determination. Since each region proposal has to perform CNN once, the training process consumes time and space very much. At the same time, some improved models of semi-supervised semantic segmentation based on R-CNN have emerged. Hu et al. (2017) proposed a transfer learning method built on Mask R-CNN (He et al. 2018). Specifically, it is to learn the category-specific mask parameters from the bounding box parameters by a generic weight transfer function shown as Eq. 3.

$$\begin{aligned} w_{\mathrm {seg}}^{c}=\mathcal {T}\left( w_{\mathrm {det}}^{c} ; \theta \right) \end{aligned}$$
(3)

where c represents category, \(w_{\mathrm {det}}^{c}\) are detection weights in the last layer of the bounding box head, \(\theta\) is class-agnostic, learned parameters and \(w_{\mathrm {seg}}^{c}\) belong to the mask weights.

Mirakhorli and Amindavar (2017) used the hierarchical network structure to interconnect the Mask R-CNN and Conditional Random Field (CRF) to achieve instance segmentation results. The whole structure contains two sub-networks. In sub-network 1, Mask R-CNN is first used to produce object masking and then together with the superpixel layer generated by the original image to produce the final segmentation results. This result enters the Mask R-CNN and superpixel layers in sub-network 2, and the final labeling is generated by the Maximum A Posteriori (MAP) estimated in CRF.

Fig. 1
figure 1

R-CNN model diagram

2.1.3 GAN based models

At present, adversarial techniques have received great attention. Although the recently published GAN based semi-supervised semantic segmentation method is not the mainstream method, it has great learning value. The GAN model architecture (Goodfellow et al. 2014) consists of two subnetworks, a generation network and a discriminant network, which can also be called generator (G) and discriminator (D). Details are shown in Fig. 2. During training, G tries to trick D by receiving random noise to generate as realistic a picture as possible, while D is committed to distinguishing between pictures generated by G and real pictures. Thus, D and G constitute a dynamic game process. Instead of using CRF, adversarial method is performed in by Luc et al. (2016) to unify high-order potentials. The designed approach can also be divided into two parts, segmentation network and adversarial network. The RGB image is taken as input in segmentor which produces pixel-wise class predictions. Then adversarial net takes both the output and the RGB image as input to produce final class label. Fischer et al. (2017) further used imperceptible adversarial perturbations to train their model for semantic image segmentation. Different from previous methods where the generators are trained to generate images using noise vectors, the network proposed by Hung et al. (2018) outputs the probability maps of the semantic labels. Under these circumstances, the outputs are enforced close enough to the ground truth label maps spatially. This is fulfilled by combining cross-entropy loss, as shown in Eq. 4.

$$\begin{aligned} \mathcal {L}_{D}=-\sum _{h, w}\left( 1-y_{n}\right) \log \left( 1-D\left( S\left( \mathbf {X}_{n}\right) \right) ^{(h, w)}\right) +y_{n} \log \left( D\left( \mathbf {Y}_{n}\right) ^{(h, w)}\right) \end{aligned}$$
(4)

where \({Y}_{n}=0\) if the sample comes from the segmentation network, \({Y}_{n}=1\) if the sample is from the ground truth label, \(D\left( S\left( \mathbf {X}_{n}\right) \right) ^{(h, w)}\) is the confidence map of X at location (hw), \(D\left( \mathbf {Y}_{n}\right) ^{(h, w)}\) is in the same way. As for segmentation network, a multi-class loss is used for training which formed as Eq. 5.

$$\begin{aligned} \mathcal {L}_{s e g}=\mathcal {L}_{c e}+\lambda _{a d v} \mathcal {L}_{a d v}+\lambda _{s e m i} \mathcal {L}_{s e m i} \end{aligned}$$
(5)

where \({L}_{c e }, {L}_{a d v}, {L}_{s e m i}\) represent spatial multi-class cross entropy loss, adversarial loss, and semi-supervised loss, respectively. All semi-supervised methods mentioned in all three subsections above are summarized in Table 1.

Fig. 2
figure 2

GAN model flow chart

Table 1 Semi-supervised segmentation methods

2.2 Weakly supervised methods

It is well known that weakly supervised methods employ different levels of supervision, such as bounding boxes (Dai et al. 2015), scribbles (Lin et al. 2016), points (Russakovsky et al. 2015), image-level labels (Papandreou et al. 2015), etc. The bounding boxes are common representations of object position. Scribbles refer to marking each type of semantics as a mark. Points imply the object location. Among all these types of supervision, image-level supervision is the weakest one and image-level tags can be obtained very efficiently. Although there is a certain gap between models trained by weak supervision and models trained by full supervision, many researchers are devoted to narrowing the gap. Various weakly supervised methods will be elaborated in the rest of this section.

2.2.1 CNN based models

The CNN-based approaches still account for the majority. Only image-level class information is used to train the segmentation model (Pinheiro and Collobert 2014). During training, CNN is used to generate feature planes, and then an aggregation layer takes these planes as input to constrain the model to put more weight on the right pixels. However, one obvious shortcoming of this study is that it only segments the single-object image and unable to meet current needs. Oquab et al. (2015) proposed the idea of transferring the parameters of CNN to overcome other target tasks at an early stage. Papandreou et al. (2015) designed a method called Expectation-Maximization (EM) to train segmentation model under both semi- and weak supervision. The main idea of Hypotheses CNN Pooling (HPC) is each hypothesis fed into CNN produces a c-dimensional prediction and then uses max-pooling for all predictions to get the final multi-target detection (Wei et al. 2014).

Since 2016, a large number of CNN-based weakly supervised semantic segmentation methods have emerged. STC (Wei et al. 2016b) represents a framework for Simple to Complex. It is implemented by three networks: Initial DCNN, Enhanced DCNN and Powerful DCNN, which gradually improve the segmentation performance of the model. Still based on DCNN, Wei et al. (2014) chose Hypotheses-CNN-Pooling (HCP) to predict the classification scores. Additionally, a novel multi-label cross-entropy loss is utilized to work with single-label loss to train the net, shown as Eq. 6.

$$\begin{aligned} J=-\eta \sum _{t=1}^{N} \sum _{i=1}^{h} \sum _{j=1}^{w} \sum _{m=1}^{c+1} \hat{p}_{t}^{m}(i, j) \log \left( p_{t}^{m}(i, j)\right) \end{aligned}$$
(6)

where \({p}_{t}^{m}(i, j)\) obtained from the generated localization map is used to represent the groundtruth probability of the mth class at the position (ij). Kolesnikov and Lampert (2016) focused on the loss function and proposed a new loss calculation method based on three principles, named SEC. The SEC represents seed, expand, and constrain, respectively. Among them, seeds are localization cues, and seeding loss is used to weakly locate an object. The form of seeding loss is shown as Eq. 7.

$$\begin{aligned} L_{\mathrm {seed}}\left( f(X), T, S_{c}\right) =-\frac{1}{\sum _{c \in T}\left| S_{c}\right| } \sum _{c \in T} \sum _{u \in S_{c}} \log f_{u, c}(X) \end{aligned}$$
(7)

where \(S_{c}\) is a set of locations with label of class c. Given that global max-pooling (GMP) often underestimates object size, and global average-pooling (GAP) often overestimates size. Using global weighted rank-pooling (GWRP) which is leveraged by expansion loss to reasonably extend regions of object seeds. The loss function is as Eq. 8.

$$\begin{aligned} {\begin{matrix} L_{{ expand }}(f(X), T)=-\frac{1}{|T|} \sum\limits_{c \in T} \log G_{c}\left( f(X) ; d_{+}\right) \\ -\frac{1}{\left| \mathcal {C}^{\prime } \backslash T\right| } \sum\limits _{c \in \mathcal {C}^{\prime } \backslash T} \log \left( 1-G_{c}\left( f(X) ; d_{-}\right) \right) -\log G_{c^{\mathrm {bg}}}\left( f(X) ; d_{\mathrm {bg}}\right) \end{matrix}} \end{aligned}$$
(8)

where \(G_{c}\) represents the GWRP classification scores. In addition, a fully connected CRF was designed. Constrain-to-boundary loss is obtained by calculating the mean KL-divergence between the network outputs and the CRF outputs, as shown in Eq. 9.

$$\begin{aligned} L_{{ constrain }}(X, f(X))=\frac{1}{n} \sum _{u=1}^{n} \sum _{c \in \mathcal {C}} Q_{u, c}(X, f(X)) \log \frac{Q_{u, c}(X, f(X))}{f_{u, c}(X)} \end{aligned}$$
(9)

The goal is to make the mask be closer to the boundary. The method proposed in this paper has been used as a reference for subsequent studies (Shen et al. 2018). Similar ideas are applied by Huang et al. (2018) where seeding loss and boundary loss are adapted to obtain better results. Redondo-Cabrera et al. (2018) designed a segmentation model that is fully end-to-end trained and does not require any external aid, such as saliency and priors. The model architecture consists of two parts, the hide-and-seek module and the segmenter module. The hide-and-seek part uses two siamese CAM modules in combination to get activation masks that cover full objects. According to previous activation maps, the segmenter network learns to realize images segmentation using a CNN network. The idea of hide-and-seek is also applied by Singh and Lee (2017). The difference is that this method randomly hides the blocks of images during training, instead of making algorithm changes or relying on external information. Inspired by Weston et al. (2012), Goodfellow et al. (2015), Tang et al. (2018) aimed to evaluate the seeds where labels are known and consistency of all pixels using two evaluation methods. The former uses cross entropy loss and the latter uses normalized cut. It is worth mentioning that normalized Cut is a variant of a family of spectral clustering and embedding algorithms (Shi and Malik 2000; Ng et al. 2001). Moreover, they continue their recent work which entirely uses the normalized cut loss and directly integrates the standard targets in shallow segmentation (Tang et al. 2018). Commonly used proposal generation methods include dense CRF mean-field inference (Papandreou et al. 2015; Rajchl et al. 2016) or graphic cut (Lin et al. 2016). Instead, the proposed method directly combines integrating shallow regularizers with loss functions. There are 2 losses, named Potts loss, CRF loss and kernel cut loss. The joint loss is shown as Eq. 10.

$$\begin{aligned} \sum _{p \in \Omega _{\mathcal {L}}} H\left( Y_{p}, S_{p}\right) +\lambda \cdot R(S) \end{aligned}$$
(10)

where H is the cross entropy between prediction \(S_{p}\) and ground truth labeling \(Y_{p}\). A segmentation network named guided network (GAIN) was addressed by incorporating the attention (Li et al. 2018a). It contains two routes, \(S_{c l}\) and \(S_{a m}\). Scl locates the significant area, and Sam tries to ensure the coverage accuracy of the area. The final self-guidance loss is as Eq. 11.

$$\begin{aligned} L_{{self}}=L_{c l}+\alpha L_{a m} \end{aligned}$$
(11)

Grad-CAM is used here, and it can be concluded that replacing CAM with Grad-CAM has become a new trend. It is proposed that a GrabCut-like algorithm is used to obtain labels from given bounding boxes and achieve the advanced quality through a single training round (Khoreva et al. 2016). The model employs the DeepLabv1 (Chen et al. 2016). Besides, Box-driven figureground segmentation (Rother et al. 2004) and object proposal (Pont-Tuset and Van Gool 2015) are used to feed the training. MCG (Pont-Tuset et al. 2016) and GrabCut+ are used to mark foreground pixels. A model named CRF-RNN is presented by Roy and Todorovic (2017) where CRF is designed as a Recurrent Neural Network (RNN) and be further used to refine the initial CNN’s prediction. This design follows Zhou et al. (2015). The architecture is a totally end to end deep architecture that unifies top-down attention and bottom-up segmentation, and finally refines all the former cues. Similar bottom-up and top-down frameworks are also used by Wang et al. (2018). The proposed MCOF network uses the heat map response of the classification network as the initial seed, and the predictions of RegionNet and PixelNet alternately become the supervision labels of each other and iterate in multiple rounds. For the Refinement step, a Bayesian estimate is used to refine the object area, as shown in Eq. 12.

$$\begin{aligned} p(o b j | {v})=\frac{p( {obj}) p({v} | o b j)}{p( {obj}) p({v} | o b j)+p(b g) p({v} | b g)} \end{aligned}$$
(12)

Set p(obj) as the saliency map, and \(p(b g)=1-p(o b j)\), p(v |obj) and p(v |bg) are feature distributions at object regions and background regions. Vernaza and Chandraker (2017) used sparse labels that are inexpensively available as input to the CNN-based segmentation network, mimicking these tags through training, and ultimately producing dense labels. This method is similar to Lin et al. (2016) but avoids the problem that the upper limit never increases due to non-adaptive label smoothness. The label propagation process is defined by random-walk hitting probabilities (Grady 2006), which is known to be efficiently computed by solving linear systems. Kwak et al. (2017) designed a superpixel pooling network (SPN) and combined with deCoupledNet to perform weakly supervised semantic segmentation tasks. Specifically, SPN is used to generate segmentation annotations, deCoupledNet is used for semantic segmentation. The loss function used to learn SPN is defined by the sum of C binary classification losses, shown as Eq. 13.

$$\begin{aligned} \mathcal {L}(f(\mathbf {x}), \mathbf {y})=\frac{1}{C} \sum _{c=1}^{C}\left\{ y_{c} \log \frac{e^{f_{c}(\mathbf {x})}}{1+e^{f_{c}(\mathbf {x})}}+\left( 1-y_{c}\right) \log \frac{1}{1+e^{f_{c}(\mathbf {x})}}\right\} \end{aligned}$$
(13)

where \(f_{c}(\mathbf {x}\)) and \(y_{c}\) are the network output and the ground-truth label for a single class c. Unlike most existing semantic partitions that focus on countable objects, Li et al. (2018b) not only segments semantics and instances but also splits countable and uncountable categories. In this paper, countable objects and uncountable objects are represented as thing and stuff, respectively, also referred to as panoramic segmentation (Kirillov et al. 2019). The authors assumed that the training data for pixel-level tasks is statistically correlated within an image, and that only small sets of pixels need to be randomly extracted during training. Specifically, the method includes many common mechanisms. For example, GrabCut (Rother et al. 2004) and MCG (Arbeláez et al. 2014) are used to obtain foreground masking, Grad-CAM (Selvaraju et al. 2016) is responsible for positioning tasks, and Maximum-a-Posteriori (MAP) estimate of CRF is the final output.

All of the above CNN-based weak supervised segmentation methods are summarized in Table 2.

Table 2 CNN based weakly supervisied segmentation methods

2.2.2 FCN based models

Since the FCN was first proposed for semantic segmentation (Long et al. 2015), a large number of weakly supervised methods have been developed. The classic FCN architecture is shown in Fig. 3 and the main difference from CNN is the use of convolution layers instead of fully connected layers. In the same year, several FCN-based weakly supervised semantic segmentation models were released. Russakovsky et al. (2015) used point-level supervision whose supervisory level is slightly higher than image-level supervision. Additionally, a modified training loss function was delivered to solve the difficulty of not being able to learn the full object extent. Details are shown in Eq. 14.

Fig. 3
figure 3

FCN model diagram

$$\begin{aligned} \mathcal {L}_{o b j}(S, P)=-\frac{1}{|\mathcal {I}|} \sum _{i \in \mathcal {I}}\left( P_{i} \log \left( \sum _{c \in \mathcal {O}} S_{i c}\right) +\left( 1-P_{i}\right) \log \left( 1-\sum _{c \in \mathcal {O}} S_{i c}\right) \right) \end{aligned}$$
(14)

Let \(P_{i}\) be the probability that pixel i belongs to an object. O be the classes corresponding to objects, with the other classes corresponding to backgrounds. Oquab et al. (2015) used a stochastic gradient descent with a global maximum pool, and additionally defined the sum of K binary logistic regression losses as a loss function. Since this is an earlier model, there are problems such as simple structure and weak persuasiveness in the experimental part. Pathak et al. (2015) heuristically defined a multi-class MIL loss, shown as Eq. 15.

$$\begin{aligned} \left( x_{l}, y_{l}\right) =\arg \max _{\forall (x, y)} \hat{p}_{l}(x, y)\qquad \forall l \in \mathcal {L}_{I}\Longrightarrow \mathrm {MIL} {LOSS}=\frac{-1}{\left| \mathcal {L}_{I}\right| } \sum _{l \in \mathcal {L}_{I}} \log \hat{p}_{l}\left( x_{l}, y_{l}\right) \end{aligned}$$
(15)

It is stated that the calculation loss on the largest score pixel is identified only in the rough heat map of the class existing in the image and the background and is propagated back through the network. Let the input image be I, its label set be \({L}_{I}\), and \(\log \hat{p}_{l}\left( x_{l}, y_{l}\right)\) be the output heat-map for the lth label at location (xy).

After 2015, FCN-based weakly supervised semantic segmentation methods have gained more attention. Adding a separate branch to locate the target object is one of the most commonly used methods (Qi et al. 2016). The proposed localization branch performs as an object detector and help adjust the output of the segmentation branch. The model designed by Lin et al. (2016) has been mentioned many times with good practicability and interactivity. It uses scribbles as annotations and utlizes the methods proposed by Felzenszwalb and Huttenlocher (2004) to generate superpixels. It also uses the graph-cut to train the network. The objective function of this method is shown in Eq. 16 which contains two parts, a unary term and a pairwise term.

$$\begin{aligned} \sum _{i} \psi _{i}\left( y_{i} | X, S\right) +\sum _{i, j} \psi _{i j}\left( y_{i}, y_{j} | X\right) \end{aligned}$$
(16)

This form of structure has been used in many interactive segmentation methods more than a decade ago (Rother et al. 2004; Grady 2006; Levin et al. 2006; Liu et al. 2009; Jian and Jung 2016). A model called Fully Convolutional Attention Network (FCAN) was designed by Chaudhry et al. (2017) where erasing is used to excavate salient regions hierarchically. In the training process, the attention mechanism for locating the most discriminative region is combined with saliency maps. Compared with another similar method (Wei et al. 2017) which uses erasing to extend the attention map and needs to retrain the attention network after each erasing. However, this method keeps the attention network intact and iteratively erasing to discover new significant areas. While this method compensates for this deficiency and is capable of processing multi-object images. A two-phase learning method is designed by Kim et al. (2017) using SEC as baseline segmentation network, and the network structure consists of two identical sub-networks. The first network is used to locate the most significant region and hide it. The second one is used to continue to find the most significant area which is actually the second significant area. Finally, the target object is segmented. However, this network has two obvious shortcomings, it is not shared or non-end-to-end training. As mentioned earlier, Shen et al. (2018) combined the SEC method with an auxiliary training set for training the segmentor obtained from the net and creates precise pixel-level masks for the training images through the bootstrap process. Specifically, the SEC acts as an initial filter for the target domain and the network domain, respectively, and the web images are used to learn better features. There are many other webly supervised methods (Hong et al. 2017; Jin et al. 2017; Shen et al. 2018). The idea of dilated convolution (Chen et al. 2014, 2016) is integrated in Wei et al. (2018) to improve the discriminative capability by expanding the receptive field. In this method, CAM is responsible for generating the class-specific localization map for each convolution block and gradually increasing the rate of expansion to search for more target-related areas. The model designed by Zhou et al. (2018) does not combine with popular modules or mechanisms. The main idea is to take the local maximum which is named peaks in a class response map and then calculate them backwards.

At the end of this section, we give a brief review about weakly supervised semantic segmentation methods that perform the counting task simultaneously. Crowd counting is the basis for many complicated tasks such as crowd localization (Shaban et al. 2019), abnormal behavior analysis, and scene monitoring (Gao et al. 2019). Cholakkal et al. (2019) used the density map to perform object counting while performing image-level supervised semantic segmentation. The model architecture was designed with two main branches, classification and density. The classification branch is used to determine the presence or absence of an object and to generate a pseudo groundtruth to train the density branch.

Table 3 summarizes all of the FCN-based semantic segmentation methods mentioned in this subsection.

Table 3 FCN based weakly supervisied segmentation methods

2.2.3 GAN based models

Although the CNN and FCN based weakly supervised semantic segmentation models occupy half of the country, GAN based methods still have a place. Souly et al. (2017) extended the typical GAN and gradually realized semi-supervised and weakly supervised semantic segmentation. Throughout the architecture, Generator uses both noise and class label to generate an image. The role of discriminator D is to predict the classification confidence of picture pixels which has K classes. Softmax is used to obtain the probability that x belongs to a certain category. It is worth mentioning that in addition to using erasing method on the basis of CNN, Wei et al. (2017) also used an adversarial way (AE) to train neural networks. In addition, inspired by AE, Zhang et al. (2018a) designed adversarial complementary learning (ACoL) method to compensate for the lack of AE, that is, to train several independent classification networks in order to obtain a certain object region. The main part of the architecture is two parallel-classifiers consisting of several convolutions, GAPs and a softmax layer are used to obtain complementary regions of interest. Finally, the results of the two classifiers are fused for outputs.

Table 4 lists the semantic segmentation using adversarial learning methods described in this subsection.

Table 4 GAN based weakly supervisied segmentation methods

2.2.4 Other methods

In addition to the several mainstream methods mentioned above, there are other ways to do with weakly supervised semantic segmentation. Especially in the early days, weakly supervised semantic segmentation has received a lot of attention, but the deep neural network method has not been widely used. Images used to be represented as sets of superpixels (Vezhnevets et al. 2011). With image level labels, the training process is achieved by calculating the distance between the centroids of the two superpixels. In 2012, Vezhnevets et al. (2012) designed a Gaussian process-based algorithm to solve the Bayesian optimization problem, which is how to choose the optimal model. The concrete implementation is realized by using Extremely Randomised Hashing Forest (ERHF), which is capable of mapping almost any feature space into a sparse binary representation. The decision forest is the basis of the Semantic Texton Forest (STF) method (Shotton et al. 2008), and the STF was used as the underlying framework. STF structure was further extended by using geometric context estimation tasks as regularizers. Two years later, a patch alignment-based manifold embedding algorithm and a hierarchical BN was proposed by Zhang et al. (2014), superpixel semantics are finally calculated by voting. Xu et al. (2015) designed a unified framework to handle different types of weak markers, image-level markers, bounding boxes, and scribbles. The method divides training images into n super-pixels and clusters all super-pixels using the max-margin clustering (Zhao et al. 2008, 2009). The optimization objective function of the process is shown in Eq. 17.

$$\begin{aligned} \frac{1}{2}{tr}\left( W^{T} W\right) +\lambda \sum _{p=1}^{n} \sum _{c=1}^{C} \xi \left( \mathbf {w}_{c} ; \mathbf {x}_{p}, h_{p}^{c}\right) \end{aligned}$$
(17)

where W is a feature matrix, each column represents a clustering feature of the category, \(\xi\) is the cost of dividing the pth superpixel into class c. Saleh et al. (2016) performed a validation and evaluation of foreground or background masks. Unlike the previous semantic segmentation models which use clean labels, Lu et al. (2017) added noisy annotations. Then, a label noise reduction method emerged as the times require which is realized by a sparse learning model based on L1 optimization. For a more detailed theoretical analysis, please refer to the paper. Considering that if there is occlusion between the targets, it is difficult to segment the complete object without additional information. Thus, a saliency model is designed to works in parallel with the segmentation network to provide additional information for image labels (Oh et al. 2017). In the same year, Meng et al. (2017) novelly segmented the components of the target object. The author gave a concept that is different from the object region segmentation, that is, partial level segmentation. The defined energy function clearly shows the structure of the model, shown as Eq. 18.

$$\begin{aligned} E=E_{s}+E_{c}+E_{p}+E_{h} \end{aligned}$$
(18)

Let \(E_{s}\) be a segmentation evaluation for each image to distinguish between foreground and background. \(E_{c}\) is the cosegmentation evaluation, which measures the similarity among foregrounds. \(E_{p}\) is the part consistency evaluation among images. \(E_{h}\) is the assessment of part structure consistency.For more information on cosegmentation, please refer to (Ma and Latecki 2013; Meng et al. 2013).

The last method to be mentioned is essentially a multi-mechanism fusion. There are three steps, image level stage, instance level stage and pixel level stage (Ge et al. 2018). In the whole process, the output of each step is used as the input of the next step, and the first stages perform the role of multi-evidence fusion, the second step removes the outlier by triplet loss based metric learning and density-based clustering (Rodriguez and Laio 2014) and train a classifier for instance filtering. The last step fuses the former maps and make a final prediction.

2.3 Unsupervised methods

In recent years, unsupervised semantic segmentation methods have received some attention. Sultana et al. (2019) proposed the DCP method, which is capable of background estimation and foreground detection in a variety of challenging real-time environments. In addition, domain adaption is a commonly used method of unsupervised semantic segmentation and the current main method for solving unsupervised domain adaptation is the adversarial learning (Hoffman et al. 2016). Domain adaption is a representative method in migration learning, which aims to improve the performance of the target domain model by using information-rich source domain samples. The source domain has rich supervision information, and the target domain indicates the area where the test sample is located, and there is no label or only a small number of labels. Murez et al. (2017) aimed to design an unsupervised domain adaptation framework that is widely applicable and in the field of image processing. In addition, the training process is performed by adding additional networks and losses, as shown in Eq. 19.

$$\begin{aligned} Q=\lambda _{c} Q_{c}+\lambda _{z} Q_{z}+\lambda _{t r} Q_{t r}+\lambda _{i d} Q_{i d}+\lambda _{c y c} Q_{c y c}+\lambda _{t r c} Q_{t r c} \end{aligned}$$
(19)

For specific individual loss functions, please refer to the paper as needed. A dual channel-wise alignment networks (DCAN) model was designed by Wu et al. (2018). The author assumed that channel alignment is important for adjusting the segmentation model because it preserves the spatial structure and semantic concepts, thus effectively constraining the domain shift. Saito et al. (2018) introduced a new kind of confrontational learning. The specific implementation is to design two classifiers, which are used to maximize the difference of the target samples to detect the target domain samples far from the source domain, and then generate features that minimize the difference to generate the target domain features close to the source domain. Thereby optimizing the boundary segmentation and aligning the distribution of the source and target domains. Fully Convolutional Adaptation Networks (FCAN) was presented by Zhang et al. (2018b) combined with Appearance Adaptation Networks (AAN) and Representation Adaptation Networks (RAN). The purpose of the ANN network is to obtain high-level content in the source image and low-level pixel information of the target domain. The FCN network is shared in the RAN, and atrous spatial pyramid pooling (ASPP) is additionally used to expand the receptive field of the filter in the feature map. Li et al. (2019b) designed a bidirectional learning system that alternately learns the segmentation adaptive model and the image translation model. The self-supervised learning (SSL) algorithm is used to train the segmentation adaptation model with a new perceptual loss. Then, through the reverse learning, a better segmentation adaptation model will help to obtain a better translation model.

3 Evaluation metrics and datasets

This section describes the existing evaluation metrics and data sets, paving the way for the experiment analysis of the next chapter.

3.1 Evaluation metrics

In order to fairly measure the contribution of the segmentation model approach, the assessment requires the use of standard, accepted methods. Execution time, memory usage, and accuracy are all evaluations. However, due to the different design goals of the model, some indicators will be more convincing than other indicators, so it is necessary to analyze the specific situation. Commonly used evaluation metrics are intersection over union IoU, mean intersection-over-union mIoU, average precision \(A P_{\mathrm {vol}}^{r}\), mean average precision mAP, panoptic quality PQ, average best overlap ABO and mean accuracy mAcc. Their descriptions are listed as follows.

$$\begin{aligned} \mathrm {IoU}= & {} \frac{ {Area\ of\ overlap}}{{Area\ of\ union}} \end{aligned}$$
(20)
$$\begin{aligned} M I o U= & {}\, \frac{1}{k+1} \sum _{i=0}^{k} \frac{p_{i i}}{\sum _{j=0}^{k} p_{i j}+\sum _{j=0}^{k} p_{j i}-p_{i i}} \end{aligned}$$
(21)
$$\begin{aligned} A P_{\mathrm {vol}}^{r}= & {} \int _{0}^{1} p(r) d r \end{aligned}$$
(22)
$$\begin{aligned} m A P= & {} \int _{0}^{1} P(R) d R \end{aligned}$$
(23)
$$\begin{aligned} \mathrm {PQ}= & {} \frac{\sum _{(p, g) \in T P} {IoU}(p, g)}{|T P|} \times \frac{|T P|}{|T P|+\frac{1}{2}|F P|+\frac{1}{2}|F N|} \end{aligned}$$
(24)
$$\begin{aligned} \mathrm {ABO}= & {} \frac{1}{\left| G^{c}\right| } \sum _{g_{i}^{c} \in G^{c}} \max _{l_{j} \in L} {Overlap}\left( g_{i}^{c}, l_{j}\right) \end{aligned}$$
(25)
$$\begin{aligned} \left( g_{i}^{c}, l_{j}\right)= & {} \frac{{area}\left( g_{i}^{c}\right) \cap {area}\left( 1_{j}\right) }{{area}\left( g_{i}^{c}\right) \cup {area}\left( l_{j}\right) } \end{aligned}$$
(26)
$$\begin{aligned} M P A= & {} \frac{1}{k+1} \sum _{i=0}^{k} \frac{p_{i i}}{\sum _{j=0}^{k} p_{i j}} \end{aligned}$$
(27)

3.2 Datasets

CityScapes dataset The Cityscapes dataset (Cordts et al. 2016), the Urban Landscape Dataset, is a large-scale dataset. It contains a set of stereo video sequences that record street scenes in 50 different cities. In addition to a large set of 20,000 weakly annotated frames, it also contains 5000 frames of high quality pixel-level annotations. The Cityscapes dataset has two evaluation criteria, fine and coarse. The former corresponds to 5000 finely labeled images, while the latter corresponds to 5000 fine labels plus 20,000 rough labels.

Microsoft COCO dataset Microsoft COCO (Lin et al. 2014) is a data set collected by the Microsoft team for image processing. There are five types of tags: target detection, key point detection, object segmentation, polygon segmentation and image description. These tag data are stored in json format. In addition, the COCO dataset has more than 300,000 images, more than 2 million instances, more than 70 categories and multiple objects in each image.

Pascal VOC dataset The PASCAL VOC Challenge mainly includes subclasses such as object classification, object detection, object segmentation, human layout, and action classification. The data set includes JPEG images, annotations, imagesets, Segmentationobject and segmentationclass. JPEG images contain all the images provided by PASCAL VOC, including training images and test images. Annotations mainly stores label files in xml format, and each xml corresponds to a picture in JPEG image. Imagesets includes action, layout, main, and segmentation, where segmentation stores the data for segmentation. Segmentationobject and segmentationclass are used to save the segmentation data. PASCAL VOC 2007 contains 9,963 labeled images with a total of 24,640 objects. The trainval/test of PASCAL VOC 2012 (Everingham et al. 2012) contains all the corresponding pictures from PASCAL VOC 2008 to PASCAL VOC 2010. In trainval, there are 11,540 images with a total of 27,450 objects. For the segmentation task, the trainval of VOC2012 contains all the corresponding pictures from PASCAL VOC 2007 to PASCAL VOC 2011, and test only contains the corresponding pictures from PASCAL VOC 2008 to PASCAL VOC 2011.

4 Experimental comparison and analysis

This chapter summarizes and analyzes the semi- and weakly supervised semantic segmentation algorithms in recent years according to the data set. The following summarizes the three data sets that are used more frequently, namely CityScapes dataset, Microsoft COCO dataset, and Pascal VOC 2012 dataset.

4.1 Pascal Voc 2012 dataset

As we all know, Pascal VOC 2012 is the most commonly used dataset in image processing and even semantic segmentation. The experimental results of methods using this dataset are shown in Table 5. Due to the numerous methods of experimenting with VOC data sets, only the results based on the two most commonly used networks VGG16 and Resnet are shown here. Similarly, the two most commonly used and currently most representative evaluation metrics val-mIoU and test-mIoU are used. From the experimental results of Lee et al. (2019) and Chaudhry et al. (2017) we can conclude that the same mechanism can achieve different effects on different DeepLabs. For example, using ResNet will be better than using VGGNet. The effect of the Dropout Rate and the degree of supervision on the experimental results is additionally given in the paper, and the description will not be repeated here. The implementation of Shen et al. (2018) is based on MXNet (Chen et al. 2015). Tang et al. (2018) made CRF loss a universal performance improvement mechanism that can work effectively on several networks.

Table 5 Results on Pascal Voc 2012 dataset

4.2 CityScapes dataset

This section compares the methods using the CityScapes dataset. The experimental comparison results are shown in Table 6. It can be seen from the results of Hung et al. (2018) that increasing the loss term can improve the experimental results, and this conclusion is also in line with the experimental comparison results of the previous section. As can be drawn from Zhan et al. (2017), after fine-tune, the performance of the model is significantly improved. However, the test result of Li et al. (2019a) is not satisfactory.

Table 6 Results on CityScapes validation set

4.3 Microsoft CoCo dataset

Finally, the experimental results on the MS COCO dataset are analyzed, as shown in Tables 7, 8 and 9. Comparing the results, it can be found that combining the COCO dataset with other datasets can achieve better training results than simply using the COCO dataset. And this result is almost better than the experimental results of all the above-mentioned datasets that are used alone. In addition to comparing the weak supervised results, these results are compared with the fully supervised ones. It can be discovered that the gap between the weakly supervised and the fully supervised is very small.

From the above comparison, we can get three conclusions very intuitively. First of all, the semi- and weakly supervised learning semantic segmentation field has produced a lot of methods and achieved satisfactory results. Second, the results of weak supervised learning have been comparable to those obtained by full-supervised learning under the same method. Third, focusing on mIoU, it can be found that most of the values are between 50 and 60%. Although the current results are satisfactory, they still have a great distance from the ideal ones, and the method of significantly improving the weakly supervised segmentation remains to be explored.

Table 7 Results on Voc2012+COCO
Table 8 Results on Pascal VOC and COCO
Table 9 Results on MS COCO 2014 validation

4.4 Segmentation results

This section visualizes the image segmentation effects of some classic semi and weakly supervised methods. Then give some brief analysis according to the segmentation effect of different methods. Table 10 shows the phased segmentation effect of point-level segmentation (Russakovsky et al. 2015). From the segmentation results, it can be concluded that the use of point-level supervision can successfully segment objects in the picture. However, the segmentation effect is rough and the edges are not detailed enough. The performance of decoupled deep neural network with different examples is shown in Table 11. It can be clearly seen that although the effect is not fine enough, the segmentation of the object edges is more accurate. Besides, it is capable of recognizing and segmenting small objects. Table 12 shows the segmentation effect of the semi-supervised semantic segmentation using adversarial learning network (Huang et al. 2018) under different loss functions. The results reflect a significant increase in the performance of this semi-supervised model using the adversarial method compared to earlier years. However, it can be found that although the segmentation effect is satisfactory on a simple graph, the segmentation result is still poor for complex ones, especially when the objects in the graph are overlapped and interlaced. In addition, batch size is an important factor affecting performance of all deep learning based image segmentation (Goceri and Gooya 2018). Therefore, it should be chosen carefully.

Table 10 Point-level segmentation (Russakovsky et al. 2015)
Table 11 Decoupled deep neural network (Hong et al. 2015)
Table 12 Adversarial learning model (Hung et al. 2018)

5 Inspiration and conclusion

This paper reviews the semi-supervised and weakly supervised segmentation model methods, focusing on the core content of the model architecture, working mechanism and main functions. In general, although there has been a long-term research on semi-supervised and weakly supervised learning applications in image segmentation, the number of studies in this area has soared and the degree of attention has increased significantly in recent years. This is because time-consuming and labor-intensive pixel-by-pixel annotations are no longer sufficient for today’s development needs, and people need to use more economical and efficient research methods. It can be seen from this paper that the research on semi-supervised and weakly supervised segmentation methods has made great progress. Many studies have pushed single object semantic segmentation to multi-objective instance segmentation, and even panoramic segmentation and counting. However, it can be seen from the analysis of the experimental results that the current methods still have shortcomings, and there are still many aspects to be further studied.

  1. 1.

    Although some methods such as adding additional mechanisms or designing the loss function can improve the performance of the segmentation model, the results obtained by current methods are still far from the ideal state. Therefore, the next study should focus on two aspects, one is to continuously reduce the degree of supervision, and the other is to continuously improve the segmentation effect while achieving more complex tasks.

  2. 2.

    It can be seen from the use of data sets that the current semi-supervised and weakly supervised semantic segmentation often takes the natural life scene as the application background, that is to say, there is a problem that the application background is relatively simple. Drawing on the rich application background of fully supervised semantic segmentation, such as medical images (Zhao et al. 2009; Li et al. 2019a), remote sensing images (Zhou et al. 2019; Liu et al. 2019), and so on. Subsequent research focuses on how to utilize weakly supervised semantic segmentation for a wider variety of tasks. Although some studies have now used weakly supervised methods to segment medical images (Jia et al. 2017; Rajchl et al. 2016), it still takes a lot of effort to accurately segment the complex medical images with a small number of annotations.

  3. 3.

    Semantic segmentation is one of the basic tasks of remote sensing image processing. In other words, semantic segmentation of remote sensing images has great research value and practical application significance. Many studies have used deep neural networks to segment remote sensing images (Kampffmeyer et al. 2016b; Wang et al. 2017; Zhang et al. 2017; Hamaguchi et al. 2017). However, fully supervised learning is used in current remote sensing image segmentation methods. Therefore, the use of weakly supervised learning instead of fully supervised learning can effectively solve problems such as the current remote sensing image datasets are not abundant, and the resource waste of collecting pixel by pixel annotations.

  4. 4.

    As mentioned at the end of Sect. 2.2.2, counting can be done simultaneously with segmentation. Therefore, we reasoned that we can implement other related tasks while performing semantic segmentation, such as target behavior recognition and text interpretation, replacing some specified segmentation objects with other objects to generate new images, and so on. In general, subsequent research can consider giving it more practical value on the basis of segmentation.

  5. 5.

    Finally, from our perspective, the study of weakly supervised learning is to pave the way for the ultimate realization of unsupervised learning while improving the efficiency of fully supervised learning. So far, research on unsupervised learning has not been interrupted, whether in the field of image segmentation or in other image fields, or even in the field of natural language processing. Because completing tasks without any label is the ideal state for machine learning.