Keywords

1 Introduction

The recent success of deep learning approaches for image segmentation in natural image analysis is generally supported by large-scale fully annotated datasets. Although several deep-learning-based nucleus segmentation methods have been proposed  [8, 9, 11, 12], it is still challenging to segment nuclei from pathological images, due to limited training data with full nucleus masks. Generally, it is labor-expensive and time-consuming to perform the full mask annotation. Alternatively, it is much easier to annotate the nuclei with points.

Currently, there are a few studies that focus on the problem of segmenting nuclei with point supervision. To train nucleus segmentation model with only point annotations, extra supervised information, including geometric diagram and clustering labels have been employed [2,3,4]. For example, Qu et al.  [3] proposed a weakly-supervised method for nucleus segmentation based on point annotation in H&E histopathology images, which extracts pixel-level labels by using the Voronoi diagram and k-means clustering algorithm. Then, Chamanzar et al.  [4] further modified this method to detect and segment nuclei in immunohistochemistry (IHC) images by using local pixel clustering in every Voronoi sub-region and repel encoding. However, these methods do not pay attention to the nucleus boundary. Differently, Nishimura et al.  [2] proposed to a postprocessing method to segment the individual nucleus with graph-cut after obtaining the nucleus region map. Generally, it is difficult to rectify the large bias by independent post-processing. Therefore, Yoo et al.  [5] extended a blob generation method (training with point supervision)  [1] for nucleus segmentation with an auxiliary network, in which an auxiliary network helps the segmentation network to recognize nucleus boundaries. For the same purpose, Qu et al.  [3] employed a dense CRF loss for model refinement in nucleus segmentation.

Accordingly, we would like to develop a method that can integrate the benefits of using pixel clustering and boundary attention. In this paper, we propose a coarse-to-fine framework that can improve the segmentation performance progressively in a self-stimulated learning manner. Specifically, to generate coarse segmentation masks, we employ a self-supervision strategy using clustering to perform the binary classification. To avoid trivial solutions, our model is sparsely supervised by annotated positive points and geometric-constrained negative boundaries, via point-to-region spatial expansion and Voronoi partition. Then, to generate fine segmentation masks, the prior knowledge of edges in the unadorned image is additionally utilized by our proposed contour-sensitive constraint to further tune the nucleus contours. By doing so, both coarse information (i.e., the roughly mask generated by stimulated learning from point annotation) and contour information (i.e., the contour obtained by unadorned image) can be progressively integrated into the learning model in the whole framework, by utilizing our rectified supervisions. Experiments show that our model trained with weakly-supervised data achieves competitive performance compared with the model trained with fully supervised data on MoNuSeg and TNBC datasets.

Fig. 1.
figure 1

Framework of our proposed method.

2 Method

As shown in Fig. 1, our method has two major stages for training the fully convolutional networks (FCN). The first stage obtains the initial coarse nucleus masks for all training data with self-supervised learning and estimated distance maps. The second stage further refines the FCN with an additional contour constraint. In the application stage, our FCN model can directly perform the inference with the well trained FCN.

2.1 Coarse Segmentation Estimation

Our target is to generate coarse segmentation masks in the first training stage. Intuitively, we can perform binary classification with clustering via self-supervised learning (i.e., deep clustering  [6]). However, typical clustering has the problem of trivial solutions. An optimal decision boundary is to assign all pixels to a single class. While point annotations provide us necessary positive pixels that are too sparse and one class only. Therefore, we would like to transform the point annotations to more informative supervision maps in the first place. With the generated supervision maps, we can train an FCN model end-to-end to obtain the coarse segmentation masks.

Maps for Supervision. We denote the image as \(\mathbf{I} \) and the positive point annotation map as \(\mathbf{P} \). We intend to generate two distance maps that focus on reliable positive and negative pixels, respectively.

1) We propose a point distance map (i.e., \(\mathbf{D} \)) focusing on positive pixels with high confidence. We assume that the annotated point for each nucleus is near the center of the nucleus. Then, we perform a distance filter to point annotations to dilate the dot to a local region with decreasing response, which is considered reliable nucleus supervision, as shown in Fig. 2(c). Mathematically, each element \(d_{i,j}\) (i and j are the coordinates in the image space) in \(\mathbf{D} \) is calculated as

$$\begin{aligned} d_{i,j} = max (0, 1 - \alpha \sqrt{(i-m)^2 + (j-n)^2)}) \end{aligned}$$
(1)

where m and n are the coordinates of the nearest positive point in the positive point annotation map \(\mathbf{P} \), and \(\alpha \) is a scaling parameter to control the scale of distribution. Note that, a Gaussian-like filter could also be employed in our application to obtain the point distance map.

2) We propose another Voronoi edge distance map (i.e., \(\mathbf{V} \)) focusing on negative pixels with high confidence. Since most nuclei are convex and have the shape of ellipse, the Voronoi diagram, according to a given set of points, is an ideal partition of a plane into blocks. Therefore, we employ the Voronoi diagram to obtain the partition edges that are further dilated with the rapidly decreasing response using a distance filter (Eq. 1). This Voronoi edge distance map is utilized to describe reliable negative pixels, as shown in Fig. 2(d).

Fig. 2.
figure 2

Sparse supervision maps for segmentation.

First Stage Sparsely Supervised Learning. To perform the self-supervised learning, we employ the polarization loss to guide the update of the weights (i.e., \(\mathbf{W} \)) in the FCN (denoted as f). Denote the output segmentation map as \(\mathbf{S} \), with the probability value from 0 to 1, the polarization loss is calculated as

$$\begin{aligned} \mathcal {L}_{polar} (\mathbf{W} ) =\, \parallel (f(\mathbf{I} ) - H(\mathbf{S}-0.5 ) ) \parallel _F^2, \end{aligned}$$
(2)

where the H(Heaviside step function) operation rectified the output segmentation map to the binary mask to realize the self-supervised learning. Note that, we do not require the function H to be differentiable, since we employ this function for generating the pseudo segmentation mask.

Fig. 3.
figure 3

Point-to-region coarse segmentation method.

Besides, we calculate two sparse losses, named \(\mathcal {L}_{point}\) and \(\mathcal {L}_{voronoi}\), to guide the update of \(\mathbf{W} \). Since the two maps only focus on partial positive and negative pixels. The pixels without responses are the unknown pixels that should not be involved in calculating the loss. Therefore, the losses are sparsely calculated according to the following equations:

$$\begin{aligned} \mathcal {L}_{point} (\mathbf{W} ) = \ \parallel ReLU(\mathbf{D} ) \cdot (f(\mathbf{I} ) - \mathbf{D} ) \parallel _F^2, \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{voronoi} (\mathbf{W} ) =\ \parallel ReLU(\mathbf{V} ) \cdot (f(\mathbf{I} ) - \mathbf{0} ) \parallel _F^2, \end{aligned}$$
(4)

where \(\cdot \) is the pixel-wise product, and the ReLU operation here is to extract the reliable weight mask for sparse loss calculation. By doing this, \(\mathcal {L}_{point}\) only focuses on the assured positive pixels, and \(\mathcal {L}_{voronoi}\) only focuses on the assured negative pixels.

Generally, it is difficult to directly obtain satisfactory segmentation masks by training with such sparse constraints. While we could receive initial segmentation maps that are the expansion of our point annotations. Therefore, we iteratively train the segmentation model with the expanded point distance maps, which are updated by the latest trained model. The point distance map (i.e., \(\mathbf{D} \)) is updated according to Eq. 1, where the point annotation map \(\mathbf{P} \) is replaced with the estimated segmentation mask (i.e., \(\mathbf{S} _c\)) from previous training round. The operation repeats two additional times to achieve reliable segmentation masks. As shown in Fig. 3, the silhouette of the nucleus gradually becomes clear by multiple training rounds. Note that, we employ the same Voronoi edge distance map for three iterations. Importantly, because the nuclei differ significantly in size in different images, it is a good idea to use the same size disk (up to the nucleus scale) as the nucleus area. Small nuclei will provide wrong reliable positive pixels. Therefore, we gradually fit the coarse segmentation that is more suitable for nuclei of different sizes.

2.2 Contour Refinement

The contours of nuclei in coarse segmentation are not accurate. We propose to use an additional contour-sensitive constraint to refine the contours.

Contour Map for Supervision. For the observation that the colors of nucleus pixels are often different from the surrounding background pixels. We can extract the apparent contour (not necessary to be the complete contours for nuclei) of the input images as an additional supervision. Specifically, we first employ a Sobel operator to detect edges from the original images. Not surprisingly, there are lots of noisy edges in the edge map (i.e., \(\mathbf{E} \)), as shown in Fig. 4(b). Then, we refine the edge map by the coarse segmentation mask (i.e., \(\mathbf{S} _c\)), obtained in the first stage, to eliminate the unnecessary Sobel edges. The refined edge map (i.e., \(\mathbf{E} _r\)) is obtained as

$$ \begin{aligned} \mathbf{E} _r = (dilation(\mathbf{S} _c, k) - erosion(\mathbf{S} _c, k)) \& \mathbf{E} , \end{aligned}$$
(5)

where & is the pixel-wise AND operator, and \(dilation(\cdot , k)\) and \(erosion(\cdot , k)\) are the morphological operations of image dilation and erosion in k pixels, respectively. Sample images can be found in Fig. 4.

Fig. 4.
figure 4

Supervision maps for contour refinement.

Second Stage Sparsely-Supervised Learning. To implement the supplement boundary supervision, we propose an additional contour-sensitive loss (i.e., \(\mathcal {L}_{contour}\)) to the existing losses to fine-tune the nucleus contours. Similarly, we also perform the supervision sparsely using our generated contour map. The contour-sensitive loss is defined as

$$\begin{aligned} \mathcal {L}_{contour} (\mathbf{W} ) =\ \parallel ReLU(\mathbf{E} _r) \cdot (sobel(f(\mathbf{I} )) - \mathbf{E} _r) \parallel _F^2, \end{aligned}$$
(6)

Note that, the sobel operation is differentiable, and thus \(\mathbf{W} \) can be optimized by backpropagation.

2.3 Implementation Details

During the whole training process, the segmentation model is a unified FCN of LinkNet  [7], while different synergistic tasks with corresponding losses is applied to the same model output. Our model is implemented based on Keras with Tensorflow backend. The scaling parameter \(\alpha \) is set to 0.05. The parameter k for morphological operations is set to 5.

In our weakly-supervised framework, we initialized the network with pretrained parameters from an natural image segmentation dataset. Because of the lack of training samples, random cropping, scaling, rotation, flipping, brightness, and gamma transformation are utilized for data augmentation. We randomly crop the input image into the size of 512 \(\times \) 512 for training the model. For every coarse segmentation iteration, we employ \(\mathcal {L}_{point}\), \(\mathcal {L}_{voronoi}\), and \(\mathcal {L}_{polar}\) with weights of 1.0, 0.1, 0.1, respectively, to train network in 200 epochs. While in the contour estimation stage, we update the network by introducing an additional loss of \(\mathcal {L}_{contour}\), to refine the model in 50 epochs. And the final loss weights are 0.01, 0.01, 0.01, 1.0, respectively. We employ Adam optimizer with a learning rate of 0.001 for both stages.

3 Experiments

3.1 Datasets

We evaluate our proposed weakly-supervised framework on two independent nucleus segmentation datasets: MoNuSeg  [8] and TNBC  [9]. MoNuSeg consists of 30 images of size \(1000 \times 1000\), which are selected from the TCGA website of different cancer types from multiple hospitals. And TNBC is comprised of 50 images of size \(512 \times 512\), which are extracted from slides of a cohort of Triple Negative Breast Cancer (TNBC) patients, scanned with Philips Ultra Fast Scanner 1.6RA. Both MoNuSeg and TNBC have pixel-level mask annotations. Therefore we can generate the points annotation for the training set by calculating the central point (with a random bias) of each nucleus mask. We adopt tenfold cross-validation for evaluation.

Table 1. Ten-fold validation results on MoNuSeg and TNBC datasets
Table 2. Comparison of different iterations

3.2 Evaluation Metrics

We use four metrics for evaluation, including two pixel-level criteria (i.e., pixel-level IoU and F1 score) and two object-level criteria (i.e., object-level Dice coefficient  [10] and Aggregated Jaccard Index (AJI)  [8]). The detailed definitions of these metrics are provided in  [3, 8]. Note that, the pixel-level F1 score is also known as the pixel-level Dice coefficient.

3.3 Results and Comparison

We compare our method with three weakly-supervised methods  [1, 3, 5]. It should be noted that results from [3] are obtained by running the provided code, while results for [1, 5] are obtained from related paper  [5]. Furthermore, we train a fully supervised model to illustrate the upper limit of our method. As shown in Table 1, in comparison with all weakly-supervised methods, our method almost achieves the best segmentation performance (except AJI on MoNuSeg set) on two datasets in terms of all evaluation criteria. Moreover, our method can achieve a competitive result compared with the fully supervised model.

To illustrate the effect of the point-to-region stage and contour-refine stage, Table 2 lists the results of each iteration. In the point-to-region stage, the accuracies of the first three iterations are gradually increased, while the fourth iteration decreases the performance. This is because when the positive segmentation results gradually reach the nucleus scale (even larger scale), certain negative pixels will be introduced into the positive point distance map according to Eq. 1, thus leading to unreliable positive map. However, the last row of Table 2 shows that after the contour-refinement stage, the segmentation model can better fit the nucleus edges, thereby further improving the effectiveness of the segmentation model.

4 Conclusion

In this paper, we propose a weakly-supervised segmentation framework based on point annotations. First, we train a sparse segmentation model through multiple iterations, and then we propose to use the additional contour-sensitive loss for contour refinement. In the experiments, our method can obtain a superior segmentation performance compared with the state-of-the-art weakly-supervised methods using point supervision. It suggests the effectiveness of our proposed coarse-to-fine learning framework.