1 Introduction

With the rapid development and widespread popularity of artificial intelligence, deep learning is widely regarded as one of the most representative and promising methods. It attracts numerous attentions as impressive results have been achieved by this approach in many fields in recent years [46, 52].

Histopathology plays an important role in the clinical diagnosis of breast diseases. Early diagnosis and adjuvant therapy are of great help to patients. Hematoxylin-eosin (HE) and Immunohistochemical (IHC) staining techniques are widely utilized for breast cancer histopathological diagnosis. The former is used for observing pathological changes in tissues and the latter is used for evaluating severity as well as choosing therapy methods [28]. From the perspective of images, HE-stained images are normally in purple and red colors while IHC-stained images are in brown or blue colors. The examples of IHC- and HE-stained images are shown in Fig. 1.

Fig. 1
figure 1

The samples of IHC- and HE-stained images

Reliable and automatic segmentation of breast cancer regions on HE- and IHC- stained images would be of considerable value for histopathological analysis. However, the task is challenging due to large diversity of nuclei size, appearance and staining procedure. Although fully convolutional networks (FCNs) have achieved breakthrough performance on several biomedical image segmentation tasks [13], they are data-driven methods and require a large amount of labelled images for training. Manually annotating the images is very cost in terms of time and labor, especially in medical imaging, which requires specific domain knowledge. In addition, when FCNs comes into a new domain, the segmentation performance usually drops dramatically because of the existence of the domain gaps. In recent years, domain adaptation has been widely used to alleviate this problem: 1) The strategy of adversarial learning [16] is to make two networks compete with each other. One of them is generator network, which constantly captures the probability distribution of real pictures in the training library and transforms the input random noise into new samples. The other is the discriminator network, which can observe the real and fake data at the same time and judge whether the data is real or not. Through repeated confrontation, the ability of the generator and discriminator will continue to increase until a balance is reached. Finally, the generator can generate high-quality images.2) Pseudo label technology [23] is a process that uses the model trained in labeled data to predict the unlabeled data, filters the samples according to the prediction results, and inputs them into the model again for training.

In this paper, we propose a novel domain adaptation framework for breast cancer segmentation on histopathological images. The aim is to train the model with the labelled data in source domain and the unlabelled data in target domain, and the trained model adapts and performs well in target domain. The proposed method consists of three steps: 1) apply adversarial learning for segmenting target domain data; 2) select target domain data with the highest prediction confidence and the most representativeness; 3) add the selected data with the corresponding pseudo-labels into the training set and refine the model with both source and target domain data.

In our work, we consider the HE-stained images as the source domain (with labels available) and the IHC-stained images as the target domain (no labels available). This can be attributed to two reasons: firstly, as stated before, HE-stained images are normally used for checking pathological changes in tissues, thus it is easier to recognize and annotate the cancer regions on it than on IHC-stained images; secondly, HE-stained images, as a conventical staining protocol, are easier to access in daily diagnosis routine. Thus, it has more practical value to develop a domain adaptation method transferring from HE domain to IHC domain. We build both HE- and IHC-stained datasets with the cancer regions delineated by the pathologists and evaluated our proposed method on it. The experimental results show that by only using unlabeled data, our method can achieve 0.846 Dice score on the target domain images, which has 1.8% improvement over the state-of-the-art method.

Entropy sorting and representativeness sorting based data selection method is proposed. Meanwhile, the domain adaptation method is utilized to segmentate cross-domain histopathology samples. The influence of color augmentation for the segmentation performance is analyzed. The main contributions of this paper are summarized as follows:

  1. 1.

    We propose a novel domain adaptation framework for histopathological breast cancer segmentation, which combines the advantages of adversarial learning and pseudo-labelling;

  2. 2.

    We introduce a new data selection method, considering both prediction confidence and representativeness, and add the selected target domain data with their corresponding pseudo-label into the training set and then refine the model, which further improves the segmentation performance in the target domain;

  3. 3.

    We evaluate our proposed method on private HE- and IHC-stained datasets. The experimental results demonstrate the effectiveness of each component in the framework and our method also outperforms the state-of-the-art domain adaptation method.

The rest of this paper is organized as follows: The related works are introduced in Section 2. In Section 3, we describe our proposed method in details. The experiment results are shown in Section 4. Finally, we make a conclusion and discuss the future work in Section 5.

2 Related works

In the clinical examinations, some appearance can reflect the patients’ health conditions, such as face, gait and iris [1, 2].

Diagnosis of breast cancer requires strong expertise, and pathologists need to spend a lot of time in diagnosing the disease, and pathologists use the results of optical microscopic evaluation of tissue sections for manual morphological evaluation and tumor grading. A fast, accurate and robust diagnostic algorithm for breast tumor pathology slides is urgently needed. Computer technology is widely used in medical diagnostics because of its high accuracy, low cost, and robustness. Traditional algorithms have been applied to image segmentation of pathological tissues. Pathology is an image-based discipline, which is generally studied by light-field microscopy. For example, Qu et al. [30] proposed a pixel-based SVM classifier method that performs well in segmenting HE-stained histopathological images of IDC. Boykov [5] proposed that graph cuts to find globally optimal segmentation of N-dimensional images. The multiphase level set framework proposed by Vese [47] has a good effect in segmenting various structures in different types of pathological histological images. Taher et al. [40, 41] used Bayesian classification, computer-aided diagnosis (CAD) system was proposed for early detection of lung cancer. Zheng [54] proposed the method of Gabor Cancer Detection for breast cancer detection. Leng et al. [24]. proposed a lightweight network framework. Tahmoush [42] proposed an image similarity-based method for breast cancer diagnosis. The segmentation method of ROI in breast cancer pathology images has been very extensively studied. Kong et al. [20] proposed an expectation-maximizing ROI segmentation method based on color texture features. Ruiz et al. [33] proposed an efficient GPU implementation method for segmentation of neuroblastoma. Kong et al. [21] proposed an expectation-maximizing ROI segmentation method based on the color texture features. Foran [15] implemented a fast and accurate image segmentation algorithm. The delineation algorithm and a learning-based tumor region segmentation approach which utilizes multiple scale texton histograms are introduced. Lahoura proposed an extreme learning machine based breast cancer diagnostic framework based on cloud computing [22]. Nguyen et al. [29] proposed a method for automatic segmentation and classification of glandular tissue based on morphological features. However, these methods rely on the prior knowledge of engineers and have poor generalization.

The diagnosis of breast cancer requires precise localization of the pathological area, lesion area and potential lesion area. Deep learning has been widely used in recent years for various image tasks, like tracking [50], detection [19], classification [49] and diagnosis [43]. Deep learning has achieved impressive results in the segmentation field and is widely used in the localization of various diseases. Beeravolu et al. [4] preprocessed breast cancer images and create data sets for Deep CNN. Malebary proposed an automatic breast mass classification system based on deep learning and ensemble learning [27]. Bayramoglu et al. [3] proposed both single-task CNN to predict malignancy and multitask CNN to predict malignancy and image magnification simultaneously. Recent developments in deep learning methods have shown that in many cases, deep learning functions are better than what is learned from large datasets for specific tasks. For some specific cases, it can even reach human-level performance. However, these existing methods rely heavily on manually labeled data for learning, which is undoubtedly a tedious and time-consuming task.

However, segmentation networks have a high demand for data and the segmentation task is very difficult to label, so a self-learning segmentation approach is of great research importance. Saber applies transfer learning technology to automatic detection and classification of breast cancer [34]. Shen et al. [35] combined deep active learning and self-paced learning paradigm to propose a new framework for large-scale detection learning. The approach improves the speed of model training with fewer annotation samples, active learning greatly reduces the physician’s annotation work, and self-paced learning mitigates the ambiguity of the data. However, the approach still requires partial labeling and multiple rounds of training. In terms of domain adaptation, fine-tuning an existing model using a small sample of the target domain is a widely used method, and recent studies [9, 18, 26, 38, 44, 45, 48] have used labeled source domain data and unlabeled target domain data to shift the recognition region of the model from a supervised learning source domain to an unsupervised target domain. Vu et al. [48] proposed the concept of entropy minimization to reduce inter-domain differences, but this method relys on the initial entropy information selection, which would lead to some limitations. In medical imaging, the use of adversarial generative learning is increasing. Courty et al. [9] found domain adaptation of the computational distance between data distributions. Tzeng [45] et al. used the maximum mean difference MMD method to minimize the difference in data distribution between source and target domains, Sun et al. [38] minimized the difference in data distribution between source and target domains by correlation distance. Because of the emergence of generative adversarial networks [16], the domain adaptation task can be learned implicitly by selecting the adversarial loss to minimize the domain offset [26], Tsai [44] et al. proposed multi-level production adversarial learning with game road scene data as the source domain and real road scene data as the target domain to achieve domain adaptation of different road scenes, and achieved good results on real road scene classification. Kamnitsas [18] used generative adversarial networks to segment MR data from patients with traumatic brain injury to achieve good effect. Dou [12] et al. optimized the domain adaptive and domain annotator modules through adversarial loss to adapt MRI image trained models to unpaired CT data for cardiac Structural Segmentation. Unfortunately, despite the importance of this direction, it has not evolved rapidly in breast tumor sections due to the lack of data sets. At the same time, the accuracy is not satisfactory. Hence, two data selection methods are proposed, and the two methods are combined with pseudo labeling techniques.

3 Materials and methods

3.1 Overview of the framework

Cancer region segmentation plays an important role in the diagnosis of breast cancer which assist in the assessment of the severity of the breast cancer according to the proportion of the cancer area. Although deep neural networks have achieved promising results in many segmentation tasks, such as U-net [32], FCN [25], Deeplab_v2 [8], Pspnet [53], Linknet [7], when a model trained by the deep convolution neural network enters a new domain, the segmentation performance drops dramatically due to the domain gaps. Annotating the images in the new domain manually from scratch is very time and labor consuming, especially in medical imaging domain, which requires experts’ knowledge. In this paper, we propose a novel domain adaptation framework for segmenting the cancer regions on IHC-stained images with only annotations on HE-stained images available. The framework is divided into three stages: 1. Adversarial learning for domain adaptation; 2. Target domain data selection based on entropy and representativeness ranking; 3. Model refinement with the selected data and pseudo labels.

3.2 Adversarial learning

The adversarial learning framework consists of one segmentor and two discriminators. The segmentor is a normal network for segmentation task, such as Linknet, Deeplab, etc. The purpose of the segmentor is to generate segmentation outputs for both source and target domain while the discriminators plays an adversarial role to distinguish whether these outputs are from the source domain or the target domain. During training, the segmentor is also trained to fool the discriminators thus it can gradually produce similar outputs for source and target domain, ignoring the domain-specific information. In our work, we place the two discriminators in the last two layers of the segmentation network to consider multi-scale image information, which is usually helpful for the segmentation task. The network structure is shown in Fig. 2. The last two layers of our method are used to calculate the losses.

Fig. 2
figure 2

Network structure of adversarial learning for domain adaptation

Specifically, we use Deeplab_v2 structure as the segmentation network with ResNet101 [17] (pre-trained on ImageNet dataset [10]) as the backbone, the classification layer of ResNet101 was discard, the filter size of the last two convolution layer is 1 × 1 the size of the output feature maps is 1/8 of the input image size. In the conv4 and conv5 layers, the sizes of the filters are set to 2 × 2 and 4 × 4, respectively to enlarge the receptive field. After the last layer, we used Atrous Spatial Pyramid Pooling (ASPP) module to encode multi-scale information in the feature maps, followed by an upsampling layer with softmax output, which upsamples the output dimensions to the input dimensions. Cross-entropy loss was calculated based on the source ground truth to obtain segmentation losses Lseg.

$${L}_{seg}=-{G}_s\mathit{\log}\left({P}_s\right)$$
(1)

where Gs is the ground truth annotations of the source domain data SPs is the output of the segmentation network in the source domain data S.

For discriminators, we use full convolutional layers to retain spatial information. The network is composed of 5 convolutional layers, with 4 × 4 kernel size and stride = 2. The number of channels is 64, 128, 256, 512 and 1, respectively. There is a leaky ReLU at the end of each of the first four layers, and the parameter is 0.2. In the last convolutional layer, we do not use the upsampling layer and the discriminator results are calculated directly.

As stated before, discriminators D is used to determine whether the input image is from the source domain or the target domain. The images from the source domain is marked as label 1, and the discriminator prediction result of the source domain is used to calculate the cross-entropy loss as the discriminator loss Ls, D.

$${L}_{s,D}=-\mathit{\log}{\left(D\left({P}_s\right)\right)}^{\left(h,w,1\right)}\Big)$$
(2)

Where h, w is the length and width of the input image. Target domain data T passes through the same segmentation network to predict the softmax output Pt. We feed the predicted results of the target domain Pt through discriminator D, and the cross-entropy between Pt and marked label 0 is calculated as the adaptive loss Lda.

$${L}_{da}=-\mathit{\log}\left(D{\left({P}_t\right)}^{\left(h,w,0\right)}\right)$$
(3)

The adversarial loss Lt, D, is obtained by calculating cross-entropy between the target discriminator result and marked label 1.

$${L}_{t,D}=-\mathit{\log}{\left(D\left({P}_t\right)\right)}^{\left(h,w,1\right)}\Big)$$
(4)

The generator loss L(S,  T) includes two modules. We optimize the segmentation network by segmentation loss Lseg and adversarial loss Lt, D. λda is the loss equilibrium weight.

$${L}_{\left(s,T\right)}={L}_{seg}+{\lambda}_{da}{L}_{t,D}$$
(5)

For optimizing the discriminators, we put the segmentation outputs into full convolution discriminators to distinguish the source and target domain based on the cross-entropy loss LD, which can be written as:

$${L}_D={L}_{s,D}+{L}_{da}$$
(6)

The segmentor and the discriminators are trained simultaneously. Based on these two levels of adversarial learning, we constantly optimize the segmentation network and discriminators to improve the segmentation ability of the network to the target domain images.

3.3 Data selection

After adversarial learning, the model can produce reasonable segmentation results for the target domain. In the next step, we would like to select some data in the target domain automatically, and refine the model with the selected data and its pesudo-label to further strength the model capability for the target domain. Specifically, among all the unannotated images Iu , our task is to select a subset of M images, Im ⊆ Iu. We hope that Im has high prediction confidence and is representative over Iu as well. Firstly, we rank all the unlabelled images in the target domain by calculating the image entropy and select k images, k = 60 in this paper, with lowest entropy to form IkIk ⊆ Iu , and then we refine Ik by using the representative selection to get subset Im Finally, the model prediction results are taken as pseudo labels, and the selected data and pseudo labels are added to the training set for training.

3.3.1 Entropy rank

We feed all the target domain data through the segmentation network and calculate the entropy map Et. We use a simple way to rank data: directly calculate the mean value ST of its entropy map, such as:

$${S}_T=1/ HW\sum\nolimits_{h,w}{E_t}^{\left(h,w\right)}$$
(7)

Normally, the images with low ST potentially have higher confidence with less noise, resulting in a better prediction effect. Thus, we select the data with 40% lowest ST to form Ik with highest confidence. The predicted results can be added to the segmentation network as pseudo labels to obtain a more favorable learning effect. The cost of computational of the proposed entropy rank approach is only determined by h and w, so the time complexity of this method is O(h × w). Considering the sizes of the feature maps are smaller than 512, so the time complexity is always satisfactory and can be considered as O(n).

3.3.2 Representative selection

The data selected by the entropy rank is believed to have a high confident level. However, these data may be similar. The main reasons can be explained as follows: 1. The selected patches could be from a single patient; 2. The appearance of patches from different patients could also be similar. On the other hand, the selected images are expected to contain different characteristics as many as possible to maximize the effectiveness of these data. Thus, we propose to a representativeness criterion to select the most representative images from Ik.

Similar as [14, 51]. In order to calculate the representativeness of Im for Iu , we use the Max-cover [14] method. Firstly, Im was defined as a representation of an image Oj ∈ Iu as: \(f\left({I}_m,{O}_x\right)={\mathit{\max}}_{O_i\in {I}_m}\left( sim\left({O}_i,{O}_j\right)\right)\), where sim(Oi, Oj) is the similarity estimate between Oi, Ox. In our opinion, Oj is represented by the most similar image in Im and measured by similarity sim(Oi, Oj). We define the representativeness of Im for Iu as: \(F\left({I}_m,{I}_u\right)={\sum}_{O_j\in {I}_u}f\left({I}_u,{O}_j\right)\), Im is a good representation of all the images in Iu. In order to calculate the subset Im that maximizes F(Im, Iu), Im belongs to Iu, we should ensure that: 1. The selected images should be similar to many unannotated images in Iu; 2. Include different conditions (e.g. adding two slices of the same patient does not significantly increase F(Im, Iu). We use an approximate greedy [25] method iteratively put in F(Im ⋃ Oi, Iu) and overwrite S until the number of images in Im reaches the value M we preset.

To calculate the similarity between Oi and Ox, we extract the last layer of the output of the block as the advanced features and calculate the channel-wise mean, represented by finally using cosine similarity as the two image similarity measure, namely sim(Oi, Ox). The computational cost is a convex operation between Oi and Ox, benefit from the strong parallel computational ability, the time cost of representative selection approach is only O(1).

$$sim\left({O}_i,{O}_x\right)={y}_i\bullet {y}_x^T/\left(\parallel {y}_i\parallel \bullet \parallel {y}_x\parallel \right)$$
(8)

3.4 Model refinement

After data selection, the selected images are supposed to have high confidence and representativeness among all the unannotated target domain images, which can make the model better for domain adaptation learning.

The selected images and its pseudo- labels were added to the original source training set together, and a separate segmentation network was used for training. The network structure is shown in Fig. 3.

Fig. 3
figure 3

The framework of adding pseudo labels in training the segmentation network

4 Experiments and discussion

The proposed framework was implemented in Pytorch. The segmentation learning rate λseg used in the experiment was 0.0025, and the discriminator learning rate λD is set to 0.0001, the specific gravity of the last two layers is 0.1 when calculating the loss, and the batch size of the source domain and the target domain is 1, and the input size of the image is 512 × 512.

In order to improve the generalization ability of the model, we use albumentations library [6] for data augmentation, which includes horizontal and vertical flip, random cropping, random rotation and random color distortion. The color distortion consists of saturation, brightness and contrast adjustments. We also made an attempt to explore the effectiveness of color augmentation (e.g., adjust the hue of images) for the cross-domain segmentation, which will be discussed later.

The key of this paper is how to learn enough information from the imbalanced data and dimension reduction [31, 37, 39], the proposed method is evaluated on an in-house histopathology breast cancer dataset. The dataset contains 400 patch images extracted from hot spot areas in whole slide tissue images (WSIs) from over 100 patients, including 200 IHC and 200 HE patches. The IHC-stained protocols include estrogen receptor (ER) and progesterone receptor (PR). For both HE- and IHC-stained images, we use 75% of the data (150 images) as the training set and the rest (50 images) as the validation set. There is no overlap of patients between training and validation set. All the images were resized to have 0.848 um/pixel, which is equivalent to 10 × objective magnification of a normal microscope. The images used in this paper are all manually annotated by junior pathologists and reviewed and modified by senior pathologists, and the all operations are following the standard clinical operations.

For quantitative evaluation, we calculate Dice score between predicted segmentation results and the ground truth

$$\boldsymbol{Dice}=\frac{\mathbf{2}\ast \mid \boldsymbol{P}\cap \boldsymbol{G} \mid }{\left|\ \boldsymbol{P}\right|+\mid \boldsymbol{G}\mid }$$
(9)

where denotes the intersection operation, P and G represent the predicted region and the ground truth, respectively. Dice is a commonly measurement for evaluation segmentation accuracy [11, 36]. Its value is in the range of [0, 1] where high Dice values stand for good segmentations while low values may indicate segmentation failures.

4.1 Naive segmentation without adaptation

In order to investigate the influence of domain gap for image segmentation, we conduct two experiments: 1) training on HE images only (with label masks); 2) training on IHC images only (with label masks), and evaluate on both HE and IHC validation sets. We experimented three different networks, namely Deeplab_v2, Pspnet and Linknet. Without color augmentation, the result is shown in Fig. 4.

Fig. 4
figure 4

The segmentation results based on different domains. a - b represent the networks are trained based on HE and IHC-stained images, respectively

The model trained on HE-stained images is very poor in predicting IHC-stained images, and the model trained on IHC-stained images can predict HE-stained images better. The main reasons can be concluded as:1. The HE-stained images is entirely close to red or purple color (see Fig. 1b), making the model overfit to HE images easily, while the color range of IHC-stained images cover more than HE as it has more kinds of colors (e.g., brown and blue), which strengths the model generalization capability to other domain implicitly. 2. The spatial feature distribution of HE-stained images is simpler than IHC-stained images, and the model learns fewer features from HE-stained images than IHC-stained images. 3. HE-stained images are naturally used for observing organ tissues, thus it is easier to segment cancer regions on HE-stained images than on IHC-stained images through visual appearance.

4.2 The effectiveness of color augmentation

IHC-stained images are mainly brown and blue, and HE-stained images are mainly purple. If the random range is greater than 40%, the selection range of HE-stained images can be considered as the subset of the selection range of IHC-stained images after color augmentation. Hence, the segmentation performance is generally satisfactory. Unfortunately, the selection range of IHC-stained images cannot contain the selection range of HE-stained images after color augmentation, so the transfer segmentation performance is terrible. Hence, in order to improve the robustness of our method, we must ensure that the selection range of IHC-stained/HE-stained after color augmentation cannot contain any information of HE-stained/IHC-staifpzned images.

As explained above, the main reason for the poor segmentation performance in a different domain is the color discrepancy between HE- and IHC-stained images. Thus, we conducted another experiment to evaluate the effectiveness of color augmentation, using the same training set, but with color perturbations of different degrees, and test the model on the same validation set. Specifically, we applied color augmentation by adjusting the hue value of the image. The hue value was randomly chosen from the range of [0, H] where larger H means heavier color augmentation. Figure 5 shows the exemplars of HE- and IHC-stained images with color augmentation (hue = 0.2). Linknet was used as the segmentation network. Figure 6 shows the segmentation performance under different degrees (setting hue from 0 to 0.5) of color augmentation.

Fig. 5
figure 5

The examples of changing the color of the images in data augmentation pipeline

Fig. 6
figure 6

Segmentation performance under different degrees of color augmentation

The experimental results show that color augmentation has positive effects on cross-domain segmentation. Increasing hue range improves orange bar scenario (training on HE and validate on IHC) gradually. For example, increasing hue range from 0.1 to 0.5 make the average Dice improve from 0.261 to 0.703, but less obvious for the blue bar scenario (training on IHC and validate on HE).

IHC-stained images are mainly brown and blue, and HE-stained images are mainly purple. When adjusting hue of blue images, they can be potentially transformed near purple color, which helps the model to predict HE-stained images. However, when adjusting hue of purple HE-stained images, they can be potentially transformed near blue color only, but not near brown color. Therefore, it is difficult for the model to predict the brown IHC- stained images (see Fig. 1), so the segmentation performance of IHC-stained images by using the model trained on HE-stained image is not ideal (e.g., only 0.703) even when the heaviest color augmentation is applied.

4.3 The effectiveness of adversarial learning

Firstly, we use the labeled IHC-stained images as the training set, and train a Deeplab_v2 network model. In this way, the Dice of the model trained by fully supervised learning can reach 89.2% on the IHC-stained validation set, which is the upper bound of domain adaptation from HE to IHC.

Then, we apply the adversarial learning described in Section 3.2 for the domain adaptation task from HE-stained images to IHC-stained images (see Table 1). In other words, the HE-stained image is used as the source domain with labels, and the IHC-stained image is used as the target domain without labels. The Dice value of our model could achieve 88.1% on the validation set of the source domain. The Dice value of adversarial learning framework achieve 82.8% on the validation set of the target domain. It can be considered that the improvement space of the domain adaptation task from HE-stained images to IHC-stained images is 6.4%.

Table 1 The results of adversarial learning (HE as the source domain, IHC as the target domain)

Similarly, we use the labeled HE-stained images as the training set, and use Deeplab_v2 network model. In this way, the Dice value of the model trained by fully supervised learning can reach 88.7% on the IHC-stained validation set, which is the upper bound of domain adaptation from IHC to HE.

We use the same network structure for the domain adaptation task from IHC-stained images to HE-stained images (see Table 2). The IHC-stained image is used as the source domain, and HE-stained image is used as the target domain. The Dice value of this model could achieve 88.2% on the validation set of the target domain. The Dice value of this model achieve 86.4% on the validation set of the source domain, and it can be considered that the improvement space of the domain adaptation task from IHC-stained images to HE-stained images is 0.5%.

Table 2 The results of adversarial learning (IHC as the source domain, HE as the target domain)

We also conduct an experiment to investigate the effects of color augmentation for adversarial learning. The performances are similar in the output space under training with color augmentation and without. Figure 7 shows that applying color augmentation has little effect on the segmentation performance, but could speed up the model convergence during training phase.

Fig. 7
figure 7

The segmentation performance curves of adversarial learning a without color augmentation and b with color augmentation

4.4 The effectiveness of data selection and model refinement

After the target domain data is ranked by entropy and representativeness selection criteria, the selected images and their pseudo labels are added to the training set. In our experiment, we set the number of selected images to 30. Figure 8 shows the validation Dice curve along with training steps of model refinement.

Fig. 8
figure 8

The segmentation performance curves of model refinement with selected data and pseudo labels

From Fig. 8, it can be observed that after refining the model, the average Dice was increased from 82.8% to 84.6%, resulting in an 1.8% improvement. The visual segmentation results after model refinement are shown in the 4th column of Fig. 10. After adding selected target data and pseudo labels, the training set contains the target domain data for fully supervised learning. Although the pseudo labels are not the real ground truth, by our entropy selection criterion, they are among the high prediction confidence samples and supposed to be close to the real ground truth. In addition, by our representativeness criterion, the selected data are supposed to contain as many characteristics as possible, while not limited to a few of patients’ specific features. Therefore, the refined model will have a better generalization capability to the target domain than only applying adversarial learning. Note that no real target domain label is used during training, our method is indeed an unsupervised domain adaption framework.

At present, in the field of annotation, it is much more difficult to label IHC-stained images than HE- stained images. Our method could alleviate this problem thus it has a practical value for histopathology research.

4.5 Comparison with the state-of-the-art methods

4.5.1 Direct minimize entropy

Vu et al. [48] propose to directly minimize entropy loss to maximize the predictive certainty in the target domain. This entropy loss is added to the segmentation network to constrain the model. Thus, the model is supposed to produce high-confidence predictions for the source domain as well as the target domain.

We re-implement this approach in our framework and from our experimental results, this method does not seem to be suitable for domain adaptation tasks from HE-stained images to IHC -stained images (see MinEnt in Table 3). We extract the feature layer of the IHC image from the model, and generate a probability map through softmax. Figure 9 shows some poor predictions on IHC-stained images by directly minimizing entropy. From the probability map (2nd column of Fig. 9), it can be observed that if the initial prediction on IHC-stained images is far away from the ground truth, the entropy loss would have a negative influence for segmentation: it will potentially strength the confidence of wrong prediction, which is difficult to be corrected by the model during training, resulting in the poor performance in target domain.

Table 3 Results of all experiments. We first use three segmentation networks without color augmentation to compare the segmentation results(SegNaiveAug)
Fig. 9
figure 9

Bad cases of direct minimize entropy method

4.5.2 Minimize entropy and adversarial learning

The direct minimization of entropy loss ignores the structural dependence between local semantics. Adversarial learning reduces the difference between the source domain and the target domain through discriminator loss and segmentation loss This method defines the training constraint as the sum of minimizing entropy loss and adversarial loss. In our experiment, the segmentation performance (see MinAdv in Table 3) are similar to direct entropy minimization (MinEnt) which also attributes to the negative impact of entropy loss as described above.

We list all the experimental results as well as the state-of-the-art methods in Table 3, which include naive segmentation with Linknet, Pspnet, Deeplab_v2, different degrees of color augmentation, direct minimize entropy (MinEnt), minimize entropy with adversarial learning (MinAdv) and our proposed method.

Table 3 shows evaluation results of the proposed algorithm against the state-of-the-art methods [9, 26, 38, 44, 45, 48] that use domain adaptation. The models trained by naïve segmentation networks without color enhancement have poor prediction results for target domain. The different degree of color disturbance based on Linknet is helpful to the training of segmentation network, due to the increasement in coverage between domains. Entropy minimization depends on the initial entropy information, which leads to the poor effective in the output space.

Figure 10 visually shows the segmentation results of naive segmentation, adversarial learning only and our proposed method. It can be seen that the model trained by our method is better than other methods on IHC image prediction. Lastly, the experimental results show that our model works better in the output space. The selected data are representative samples with high prediction confidence, close to the real ground truth value, it has better generalization ability. Our method is the best in IHC validation set, which is 1.8% higher than the state-of-the-art. Meanwhile, our method is able to give a competitive performance on the HE validation set as well.

Fig. 10
figure 10

Example results of adapted segmentation for HE-stained images to IHC -stained images. For each target image, we show results SegNaiveAug(Linknet) with Adversarial Learning and our method in the output space

In order to assess the proposed method, the computational time is also tested in this paper, and shown in Table 4. Our method achieved 11.6 frame per second (FPS), which is satisfactory. Although our method is not the fastest method, but compared with Linknet, our method can achieve a greater accuracy. Meanwhile, compared with the traditional manual clinical operation, our method is satisfactory.

Table 4 Comparison results of running time between different methods

5 Conclusions

In this paper, we present a novel domain adaptation framework for cross-domain histopathological breast cancer segmentation. Through a well-designed segmentation scheme based on domain adaptation, the proposed method can segment the cancer region in the target domain accurately. The accuracy is improved by 1.8% compared with the latest method, and the computational time is also satisfactory, benefit from the proposed entropy selection criterion, the selected pseudo labels are among the high prediction confidence samples. In addition, by our representativeness criterion, the selected data are supposed to contain as many characteristics as possible, while not limited to a few of patients’ specific features.

In summary, our method alleviates the burden of manual annotation by only using unlabelled data in target domain, which has large practical value for medical application. It is worth emphasizing that our method is general and not constrained to HE and IHC images. It can be adapted to other similar tasks through simple fine tuning. In future work, we will attempt to evaluate our methods on a larger dataset and on other image domains as well.