Keywords

1 Introduction

For medical image analysis based on deep learning, a great challenge remains that deep learning models require high-quality and large quantities of annotated images. This problem results in expensive data collection and repeatedly annotation workload. Furthermore, annotating different image modalities such as CT and MR of the same organ makes the issue more pronounced. Consequently, an annotation-efficient deep learning method, namely unsupervised domain adaptation (UDA), is introduced to address cross-modality medical image analysis.

Fig. 1.
figure 1

Comparison between our proposed method and previous methods. (a) Previous methods where the detailed low-level domain informative features are not utilized for single way segmentation. (b) Proposed method encourages low-level detailed features to be domain uninformative for following segmentation, and employs self-supervision on target domain by constructing an edge generation task as auxiliary task.

Unsupervised domain adaptation generalizes the learning model trained on annotated source domain to another unlabeled target domain without any target label supervision. For semantic segmentation task, many existing UDA methods [1,2,3, 8] borrow the idea of image-to-image translation from CycleGAN [15] and multi-modal image translation network [6], so that the aligned images of two domains can be learned together under source domain supervision. Another main stream of UDA strategies employs adversarial learning to align source and target domain, where a common way is to follow [11] to utilize adversarial learning at the segmentation output [2, 13], or at the segmentation entropy map [12, 14], or in the VAE-based latent space [9].

These UDA methods have two drawbacks. Firstly, simple adversarial learning between source and target domain is not enough to completely align two domains, especially when unpaired source and target modalities vary much in medical images. Under unsupervised conditions, the edge region of target domain segmentation mask may be very inaccurate and has a high probability to over segmented or under segmented. Therefore, we propose a novel self-supervision on target domain to directly improve target domain performance. Specifically, we propose an auxiliary task that generates edges to assist primary segmentation task to improve prediction accuracy around contour. These two tasks are collaborated through a designed edge consistency function and their partially shared parameters, where two tasks share a common feature extractor and partial layers in decoder.

Secondly, existing methods employ segmentation task on aligned semantic features, without considering rich detailed information in low-level features, because domain information contained in low-level features can harm the adaptation performance. But detailed features can benefit medical image segmentation which are proved by the great success of U-net [10], and should also be considered. Therefore, to leverage detailed information in low-level features, while simultaneously reduce the adaptation degradation results from the skip-connection in U-net, we propose a hierarchical low-level adversarial learning mechanism to encourage low-level detailed features domain uninformative in a hierarchical way according to the content of domain.

Fig. 2.
figure 2

Our proposed framework. The feature extractor F generates domain invariant features, and the hierarchical discriminator \(D_f\) differentiates input features accordingly. Segmentor S and edge generator G take features from corresponding layers of F to generate segmentation masks and edges. Two discriminators \(D_m\) and \(D_e\) (omitted in this figure) are employed at the output of S and G for adversarial learning.

In general, the comparison between our Dual-task and Hierarchical learning Network (DualHierNet) and previous UDA methods is shown in Fig. 1.

2 Methodology

Given \(N_s\) pixel-level labeled source domain data \(\left\{ X^s,Y^s\right\} =\left\{ (x_i^s,y_i^s)\right\} _{i=1}^{N_s}\) and \(N_t\) unlabeled target domain data \(X^t=\left\{ x_i^t\right\} _{i=1}^{N_t}\), unsupervised domain adaptation aims to use these data to learn a source to target adaptation network to correctly segment target images without any target domain supervision.

The architecture of proposed DualHierNet is shown in Fig. 2. Target domain self-supervision is achieved through the edge consistency between partially shared primary segmentation task S and auxiliary edge generation task G. Also, the low-level features extracted by feature extractor F are encouraged to be domain invariant through the adversarial learning with discriminator \(D_f\) in a hierarchical way according to the domain content. Lastly, two discriminators \(D_m\) and \(D_e\) are employed on output semantic space to align generated segmentation masks and edges respectively.

2.1 Dual-Task Collaboration for Target Domain Self-supervision

Under unsupervised conditions, the edge region of target domain segmentation mask may be inaccurate and has a high probability to over or under segmented. We therefore propose a novel target self-supervision by constructing an auxiliary task to generate edges, and making it collaborate with primary segmentation task to obtain a more accurate target segmentation mask at the edge region.

Specifically, feature extractor F generates domain invariant features from input images of source and target domain, where this part will be illustrated in later subsection. Domain invariant features are input to both segmentor S and edge generator G. The edge generator G has a similar network structure to segmentor S, with low-level features of F copied and concatenated to corresponding high-level features. Besides, edge generator employs deep-supervision for the purpose of providing auxiliary supervision to improve edge generation quality by outputting upsampled features in G as auxiliary edges shown in Fig. 2. These auxiliary edges are fused together to obtain the final generated edge \(p_e\).

For source domain supervision, we use a combination of weighted cross-entropy loss and Dice loss: \(\mathcal {L}^s(p^s,y^s)=\mathcal {L}_{wCE}^s+\mathcal {L}_{Dice}^s\), where \(p^s\) and \(y^s\) are segmentation mask and ground truth. We employ multi-class and two-class segmentation for segmentor S and edge generator G:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{seg}^s&=\mathcal {L}^s(p^s_m,y^s_m),\\ \mathcal {L}_{edge}^s&=\mathcal {L}^s(p^s_{e},y_e^s)+\sum \limits _{Ae}\mathcal {L}^s(p^s_{Ae},y_e^s), \end{aligned} \end{aligned}$$
(1)

where \(\mathcal {L}_{seg}^s\) and \(\mathcal {L}_{edge}^s\) are objective functions of S and G respectively. \(p^s_m\) and \(y^s_m\) is the segmentation mask and ground truth of source domain, and \(p^s_e\) and \(y^s_e\) is the generated edge and ground truth edge. Noted that \(y^s_e\) is obtained by calculating the first derivative of \(y^s_m\). \(p^s_{Ae}\) are auxiliary edges shown in Fig. 2.

For target domain self-supervision, we encourage the segmentation mask \(p^t_m\) and generated edge \(p^t_e\) to keep consistency at the edges and we propose a dual-task consistency loss \(\mathcal {L}_d^t\) on target domain. An operation \(\varvec{\partial }\) calculates the first derivative of soft segmentation mask \(p^t_m\) on two spatial axes ij to obtain a soft edge, which should possess structural consistency with generated edge \(p^t_e\). The consistency loss and the soft edge calculation formula are:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{d}^t&= \left\| p^t_{e}- \varvec{\partial }(p^t_m) \right\| _2^2,\\ \varvec{\partial }(p^t_m)=\frac{1}{2}(\left| \frac{\partial p^t_m}{\partial i}\right| +\left| \frac{\partial p^t_m}{\partial j}\right| )&\approx \frac{1}{2}(\sum \limits _{c}\left| p^t_{m,i+1}-p^t_{m,i} \right| +\sum \limits _{c}\left| p^t_{m,j+1}-p^t_{m,j} \right| ), \end{aligned} \end{aligned}$$
(2)

where the summation symbol is applied to channel dimension c. The soft edge \(\varvec{\partial }(p^t_m)\) has a probability between [0,1].

2.2 Hierarchical Adversarial Learning for Better Alignment

Hierarchical Adversarial Learning. We follow the success of U-net in medical image segmentation [10] to combine low-level detailed features with high-level semantic features. However, low-level features are domain informative, and severe domain gap in detailed features can harm adaptation performance when combined with domain uninformative semantic features. We thus develop a hierarchical adversarial skip connection mechanism to make low-level detailed features domain invariant when concatenating them to semantic features simultaneously.

Specifically, feature extractor F maps input images to feature space, and we propose a hierarchical discriminator \(D_f\) to differentiate input domains accordingly. Features of each layer in F: \(l_1,l_2,l_3,l_4\) and \(l_5\), are gradually decreasing in domain information and increasing in semantic information. \(l_5\) is directly input to following segmentor S and edge generator G, while \(l_1,l_2,l_3,l_4\) are input to different layers of discriminator \(D_f\) in a hierarchical way according to their distinct resolutions for domain alignment. The objective function of layer \(l_{k,k=1,2,...,K}\) is formulated as follows where F and \(D_f\) play a min-max game:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{f}&=\sum \limits _{k=1}^K \gamma _{k}\mathcal {L}_{f,k},\\ \mathcal {L}_{f,k}=\mathbb {E}_{l^s_k\in F(X^s)}[\log&(D_f(l^s_k))] +\mathbb {E}_{l^t_k\in F(X^t)}[\log (1-D_f(l^t_k))], \end{aligned} \end{aligned}$$
(3)

where \(\gamma _{k}\) increases as k decreases, indicating that lower layer features contained more domain information is assigned with larger weights for attention.

Output Alignment. Finally, two discriminators \(D_m\) and \(D_e\) are employed in output space to align segmentation mask \(p_m\) and generate edge \(p_e\) with adversarial learning. \(\mathcal {L}_{m}\) and \(\mathcal {L}_{e}\) are the adversarial objective as follows respectively:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{m}&=\mathbb {E}_{x^s\sim X^s} [\log (D_m(p^s_m))] + \mathbb {E}_{x^t\sim X^t} [\log (1-D_m(p^t_m))],\\ \mathcal {L}_{e}&=\mathbb {E}_{x^s\sim X^s} [\log (D_e(p^s_e))] + \mathbb {E}_{x^t\sim X^t} [\log (1-D_e(p^t_e))]. \end{aligned} \end{aligned}$$
(4)

Therefore, with trade-off parameters \(\lambda _0,\lambda _1,\lambda _2,\lambda _3\), the total objective function of the model is formulated as:

$$\begin{aligned} \begin{aligned} \min \limits _{F,S,G}\max \limits _{D_f,D_m,D_e} \mathcal {L}_{seg}^s+\mathcal {L}_{edge}^s+\lambda _0\mathcal {L}_{d}^t\\ +\lambda _1\mathcal {L}_{f}+\lambda _2\mathcal {L}_{m}+\lambda _3\mathcal {L}_{e}. \end{aligned} \end{aligned}$$
(5)

3 Experiments and Results

Dataset and Implementation Details. The proposed framework is evaluated on the Multimodality Whole Heart Segmentation Challenge MMWHS2017 dataset [17] which consists of unpaired 20 CT and 20 MR volumes with pixel-level annotation of seven heart structures: left ventricle blood cavity (LV), right ventricle blood cavity (RV), left atrium blood cavity (LA), right atrium blood cavity (RA), myocardium of the left ventricle (Myo), ascending aorta (AA) and pulmonary artery (PA). We follow Pnp-AdaNet and SIFA [2, 4] to use randomly selected sixteen MR volumes as source and sixteen CT as target for training. The remaining four CT volumes are for testing. Each volume is split into transverse view slices as inputs since doctors observe transverse view to diagnose cardiac diseases, and is augmented with flipping, rotation and scaling, and normalized to zero mean and unit variance and resized to \(256\times 256\). The volume metrics Dice score and Average Surface Distance (ASD) are employed for evaluation. For fair comparison, 5-fold cross validation is employed. All annotations of CT are only used for evaluation without being presented during training.

We also validate our proposed method on another multi-modality cardiac dataset: MS-CMRSeg 2019 [16] which consists of 45 patients and each patient has cardiac images of three MR modalities: bSSFP, T2 and LGE. For fair comparison, we re-implement methods [1, 13] under the same experiment setup with us, and follow [13] to combine labeled bSSFP and T2 as source and unlabeled LGE as target, where the target LGE is divided by competition, and use transverse view slices with the same preprocessing and augmentation as above.

The detailed dual-task architecture is shown in Fig. 2. Discriminators follow [7] to have 6 convolutional layers where the first 3 use instance normalization. Adam optimizers are utilized with a learning rate of \(1.0 \times 10^{-3}\) for segmentation and edge generation, but with a decay rate of 0.9 every 2 epochs for segmentation and no decay for edge generation, since we empirically found edge generation task converges slower than segmentation. The model is trained for 100 epochs with a batch size equals 4. Hyper-parameters \(\lambda _0\) is 10, \(\lambda _1\) is 1.0, while \(\lambda _2,\lambda _3\) grow linearly from 0.0 to 1.0 as epoch increases to 40, and remain 1.0 subsequently.

Quantitative and Qualitative Analysis. For MMWHS2017 dataset, we validate our methods on seven structures and show the results in Table 1, and also follow [2, 4] to validate on four left-side structures in Table 2. We compare with several state-of-the-art UDA methods including CyCADA [5], Pnp-AdaNet [4], BEAL [14], Cascaded U-net [1] and SIFA [2]. We re-implemented all above methods under the same experiment setup with five-fold cross validation shown in the mean ± std manner, and no post-processing is employed.

In Table 1, we first obtain the unadapted results by directly testing a source domain trained U-net on target domain, and a Dice score of \(30.43\%\) reflects severe domain shift between different modalities. A supervised target domain upper bound of \(84.95\%\) is also obtained through a supervised U-net. Our proposed method outperforms several UDA methods by a great margin and achieves superior performance of 73.68% in average Dice and 7.3 in average ASD. Note that our approach significantly improves the accuracy of LA by achieving a performance gain up to 9.4% in Dice, and even the most difficult structure to segment: Myo, is also improved to 64.03%. For four class segmentation of LV, LA, Myo and AA shown in Table 2, we achieve an average Dice of 76.98% and average ASD of 4.6, with great margin compared with other methods. Results on MS-CMRSeg shown in Table 3 prove the generalization ability of our method on cross-MR modalities by achieving an average Dice of 84.85%.

Visual results are shown in Fig. 3. Our DualHierNet has a smoother 3D heart with clearer contours, and better segmentation masks inside cardiac structures. For generated edges in lower part of Fig. 3, figures inside the red box are good examples that generated edge \(p_E^t \) and \(\varvec{\partial }(p^t_M)\) are well constrained to be similar. Figures inside the blue box are poor examples, where the blue arrows point to boundary area that are distinct in \(p_E^t \) and \(\varvec{\partial }(p^t_M)\). This usually results from incoherent annotation between two adjacent slices.

Table 1. Comparison results on MMWHS2017 for 7 cardiac structures.
Table 2. Comparison results on MMWHS2017 for 4 cardiac structures.
Table 3. Comparison results on MS-CMRSeg.
Fig. 3.
figure 3

Visual results of comparison and generated edges.

Table 4. Effects of each component.
Table 5. Effect of hierarchical weights.
Table 6. Dual-task self-supervision extended on supervised setting.

Ablation Study.Firstly, we conduct an ablation experiment to evaluate the effectiveness of each component: (i) U-net with output adversarial learning (Base), (ii) Base equipped with dual-task collaboration (Base+Dual), (iii) Base with hierarchical adversarial learning (Base+Hier), and (iv) ours (Base+Dual+Hier). In Table 4, the performance is improved to \(68.50\%\) and \(70.89\%\) equipped with our proposed dual-task self-supervision and hierarchical strategy respectively. The further improvement to \(76.98\%\) in our DualHierNet confirms the effect of using dual-task as self-supervision and hierarchically aligning low-level features.

Secondly, we experiment on choice of hierarchical weights \(\gamma _{k}\) shown in Table 5. When we assign larger weights to higher layers, only an average Dice of 70.65% is achieved. A Dice of 72.61% is achieved if each layer shares a same weight. When we enlarge the weights of shallow layers which contain more domain information, we can get a Dice of 76.98%. This further justify that low-level domain informative features should receive stronger adversarial learning attention.

Thirdly, we extend on target-only supervised segmentation to validate our proposed self-supervision. We replace Seg+Edge structure with two segmentors Seg+Seg so that they have nearly the same number of parameters. In supervised setting, Seg+Edge uses segmentation loss and dual_consist loss while Seg+Seg only uses segmentation loss to train the networks. Results shown in Table 6 reveals that auxiliary edge task assists segmentation even on supervised setting and achieve performance gain of \(1.14\%\). While in adapted setting, a larger performance gain is obtained through our proposed dual-task self-supervision.

4 Conclusion

We propose a dual-task collaboration framework for target self-supervision with low-level hierarchical adversarial learning for cross-modality image segmentation. We develop a novel self-supervision by constructing an auxiliary task to generate edges to assist segmentation task, and we also design a hierarchical adversarial mechanism according to the content of domain. Our framework outperforms several adaptation methods on cross-modality datasets and the proposed dual-task architecture even achieves promising performance in supervised setting.