Abstract
Data annotation is always an expensive and time-consuming issue for deep learning based medical image analysis. To ease the need of annotations, domain adaptation is recently introduced to generalize neural networks from a labeled source domain to unlabeled target domain without much performance degradation. In this paper, we propose a novel target domain self-supervision for domain adaptation by constructing an edge generation auxiliary task to assist primary segmentation task so as to extract better target representation and improve target segmentation performance. Besides, in order to leverage detailed information contained in low-level features, we propose a hierarchical low-level adversarial learning mechanism to encourage low-level features domain uninformative in a hierarchical way, so that the segmentation performance can benefit from low-level features without being affected by domain shift. Following these two proposed approach, we develop a cross-modality domain adaptation framework which employs the dual-task collaboration for target domain self-supervision, and encourages low-level detailed features domain uninformative for better alignment. Our proposed framework achieves state-of-the-art results on public cross-modality segmentation datasets.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
For medical image analysis based on deep learning, a great challenge remains that deep learning models require high-quality and large quantities of annotated images. This problem results in expensive data collection and repeatedly annotation workload. Furthermore, annotating different image modalities such as CT and MR of the same organ makes the issue more pronounced. Consequently, an annotation-efficient deep learning method, namely unsupervised domain adaptation (UDA), is introduced to address cross-modality medical image analysis.
Unsupervised domain adaptation generalizes the learning model trained on annotated source domain to another unlabeled target domain without any target label supervision. For semantic segmentation task, many existing UDA methods [1,2,3, 8] borrow the idea of image-to-image translation from CycleGAN [15] and multi-modal image translation network [6], so that the aligned images of two domains can be learned together under source domain supervision. Another main stream of UDA strategies employs adversarial learning to align source and target domain, where a common way is to follow [11] to utilize adversarial learning at the segmentation output [2, 13], or at the segmentation entropy map [12, 14], or in the VAE-based latent space [9].
These UDA methods have two drawbacks. Firstly, simple adversarial learning between source and target domain is not enough to completely align two domains, especially when unpaired source and target modalities vary much in medical images. Under unsupervised conditions, the edge region of target domain segmentation mask may be very inaccurate and has a high probability to over segmented or under segmented. Therefore, we propose a novel self-supervision on target domain to directly improve target domain performance. Specifically, we propose an auxiliary task that generates edges to assist primary segmentation task to improve prediction accuracy around contour. These two tasks are collaborated through a designed edge consistency function and their partially shared parameters, where two tasks share a common feature extractor and partial layers in decoder.
Secondly, existing methods employ segmentation task on aligned semantic features, without considering rich detailed information in low-level features, because domain information contained in low-level features can harm the adaptation performance. But detailed features can benefit medical image segmentation which are proved by the great success of U-net [10], and should also be considered. Therefore, to leverage detailed information in low-level features, while simultaneously reduce the adaptation degradation results from the skip-connection in U-net, we propose a hierarchical low-level adversarial learning mechanism to encourage low-level detailed features domain uninformative in a hierarchical way according to the content of domain.
In general, the comparison between our Dual-task and Hierarchical learning Network (DualHierNet) and previous UDA methods is shown in Fig. 1.
2 Methodology
Given \(N_s\) pixel-level labeled source domain data \(\left\{ X^s,Y^s\right\} =\left\{ (x_i^s,y_i^s)\right\} _{i=1}^{N_s}\) and \(N_t\) unlabeled target domain data \(X^t=\left\{ x_i^t\right\} _{i=1}^{N_t}\), unsupervised domain adaptation aims to use these data to learn a source to target adaptation network to correctly segment target images without any target domain supervision.
The architecture of proposed DualHierNet is shown in Fig. 2. Target domain self-supervision is achieved through the edge consistency between partially shared primary segmentation task S and auxiliary edge generation task G. Also, the low-level features extracted by feature extractor F are encouraged to be domain invariant through the adversarial learning with discriminator \(D_f\) in a hierarchical way according to the domain content. Lastly, two discriminators \(D_m\) and \(D_e\) are employed on output semantic space to align generated segmentation masks and edges respectively.
2.1 Dual-Task Collaboration for Target Domain Self-supervision
Under unsupervised conditions, the edge region of target domain segmentation mask may be inaccurate and has a high probability to over or under segmented. We therefore propose a novel target self-supervision by constructing an auxiliary task to generate edges, and making it collaborate with primary segmentation task to obtain a more accurate target segmentation mask at the edge region.
Specifically, feature extractor F generates domain invariant features from input images of source and target domain, where this part will be illustrated in later subsection. Domain invariant features are input to both segmentor S and edge generator G. The edge generator G has a similar network structure to segmentor S, with low-level features of F copied and concatenated to corresponding high-level features. Besides, edge generator employs deep-supervision for the purpose of providing auxiliary supervision to improve edge generation quality by outputting upsampled features in G as auxiliary edges shown in Fig. 2. These auxiliary edges are fused together to obtain the final generated edge \(p_e\).
For source domain supervision, we use a combination of weighted cross-entropy loss and Dice loss: \(\mathcal {L}^s(p^s,y^s)=\mathcal {L}_{wCE}^s+\mathcal {L}_{Dice}^s\), where \(p^s\) and \(y^s\) are segmentation mask and ground truth. We employ multi-class and two-class segmentation for segmentor S and edge generator G:
where \(\mathcal {L}_{seg}^s\) and \(\mathcal {L}_{edge}^s\) are objective functions of S and G respectively. \(p^s_m\) and \(y^s_m\) is the segmentation mask and ground truth of source domain, and \(p^s_e\) and \(y^s_e\) is the generated edge and ground truth edge. Noted that \(y^s_e\) is obtained by calculating the first derivative of \(y^s_m\). \(p^s_{Ae}\) are auxiliary edges shown in Fig. 2.
For target domain self-supervision, we encourage the segmentation mask \(p^t_m\) and generated edge \(p^t_e\) to keep consistency at the edges and we propose a dual-task consistency loss \(\mathcal {L}_d^t\) on target domain. An operation \(\varvec{\partial }\) calculates the first derivative of soft segmentation mask \(p^t_m\) on two spatial axes i, j to obtain a soft edge, which should possess structural consistency with generated edge \(p^t_e\). The consistency loss and the soft edge calculation formula are:
where the summation symbol is applied to channel dimension c. The soft edge \(\varvec{\partial }(p^t_m)\) has a probability between [0,1].
2.2 Hierarchical Adversarial Learning for Better Alignment
Hierarchical Adversarial Learning. We follow the success of U-net in medical image segmentation [10] to combine low-level detailed features with high-level semantic features. However, low-level features are domain informative, and severe domain gap in detailed features can harm adaptation performance when combined with domain uninformative semantic features. We thus develop a hierarchical adversarial skip connection mechanism to make low-level detailed features domain invariant when concatenating them to semantic features simultaneously.
Specifically, feature extractor F maps input images to feature space, and we propose a hierarchical discriminator \(D_f\) to differentiate input domains accordingly. Features of each layer in F: \(l_1,l_2,l_3,l_4\) and \(l_5\), are gradually decreasing in domain information and increasing in semantic information. \(l_5\) is directly input to following segmentor S and edge generator G, while \(l_1,l_2,l_3,l_4\) are input to different layers of discriminator \(D_f\) in a hierarchical way according to their distinct resolutions for domain alignment. The objective function of layer \(l_{k,k=1,2,...,K}\) is formulated as follows where F and \(D_f\) play a min-max game:
where \(\gamma _{k}\) increases as k decreases, indicating that lower layer features contained more domain information is assigned with larger weights for attention.
Output Alignment. Finally, two discriminators \(D_m\) and \(D_e\) are employed in output space to align segmentation mask \(p_m\) and generate edge \(p_e\) with adversarial learning. \(\mathcal {L}_{m}\) and \(\mathcal {L}_{e}\) are the adversarial objective as follows respectively:
Therefore, with trade-off parameters \(\lambda _0,\lambda _1,\lambda _2,\lambda _3\), the total objective function of the model is formulated as:
3 Experiments and Results
Dataset and Implementation Details. The proposed framework is evaluated on the Multimodality Whole Heart Segmentation Challenge MMWHS2017 dataset [17] which consists of unpaired 20 CT and 20 MR volumes with pixel-level annotation of seven heart structures: left ventricle blood cavity (LV), right ventricle blood cavity (RV), left atrium blood cavity (LA), right atrium blood cavity (RA), myocardium of the left ventricle (Myo), ascending aorta (AA) and pulmonary artery (PA). We follow Pnp-AdaNet and SIFA [2, 4] to use randomly selected sixteen MR volumes as source and sixteen CT as target for training. The remaining four CT volumes are for testing. Each volume is split into transverse view slices as inputs since doctors observe transverse view to diagnose cardiac diseases, and is augmented with flipping, rotation and scaling, and normalized to zero mean and unit variance and resized to \(256\times 256\). The volume metrics Dice score and Average Surface Distance (ASD) are employed for evaluation. For fair comparison, 5-fold cross validation is employed. All annotations of CT are only used for evaluation without being presented during training.
We also validate our proposed method on another multi-modality cardiac dataset: MS-CMRSeg 2019 [16] which consists of 45 patients and each patient has cardiac images of three MR modalities: bSSFP, T2 and LGE. For fair comparison, we re-implement methods [1, 13] under the same experiment setup with us, and follow [13] to combine labeled bSSFP and T2 as source and unlabeled LGE as target, where the target LGE is divided by competition, and use transverse view slices with the same preprocessing and augmentation as above.
The detailed dual-task architecture is shown in Fig. 2. Discriminators follow [7] to have 6 convolutional layers where the first 3 use instance normalization. Adam optimizers are utilized with a learning rate of \(1.0 \times 10^{-3}\) for segmentation and edge generation, but with a decay rate of 0.9 every 2 epochs for segmentation and no decay for edge generation, since we empirically found edge generation task converges slower than segmentation. The model is trained for 100 epochs with a batch size equals 4. Hyper-parameters \(\lambda _0\) is 10, \(\lambda _1\) is 1.0, while \(\lambda _2,\lambda _3\) grow linearly from 0.0 to 1.0 as epoch increases to 40, and remain 1.0 subsequently.
Quantitative and Qualitative Analysis. For MMWHS2017 dataset, we validate our methods on seven structures and show the results in Table 1, and also follow [2, 4] to validate on four left-side structures in Table 2. We compare with several state-of-the-art UDA methods including CyCADA [5], Pnp-AdaNet [4], BEAL [14], Cascaded U-net [1] and SIFA [2]. We re-implemented all above methods under the same experiment setup with five-fold cross validation shown in the mean ± std manner, and no post-processing is employed.
In Table 1, we first obtain the unadapted results by directly testing a source domain trained U-net on target domain, and a Dice score of \(30.43\%\) reflects severe domain shift between different modalities. A supervised target domain upper bound of \(84.95\%\) is also obtained through a supervised U-net. Our proposed method outperforms several UDA methods by a great margin and achieves superior performance of 73.68% in average Dice and 7.3 in average ASD. Note that our approach significantly improves the accuracy of LA by achieving a performance gain up to 9.4% in Dice, and even the most difficult structure to segment: Myo, is also improved to 64.03%. For four class segmentation of LV, LA, Myo and AA shown in Table 2, we achieve an average Dice of 76.98% and average ASD of 4.6, with great margin compared with other methods. Results on MS-CMRSeg shown in Table 3 prove the generalization ability of our method on cross-MR modalities by achieving an average Dice of 84.85%.
Visual results are shown in Fig. 3. Our DualHierNet has a smoother 3D heart with clearer contours, and better segmentation masks inside cardiac structures. For generated edges in lower part of Fig. 3, figures inside the red box are good examples that generated edge \(p_E^t \) and \(\varvec{\partial }(p^t_M)\) are well constrained to be similar. Figures inside the blue box are poor examples, where the blue arrows point to boundary area that are distinct in \(p_E^t \) and \(\varvec{\partial }(p^t_M)\). This usually results from incoherent annotation between two adjacent slices.
Ablation Study.Firstly, we conduct an ablation experiment to evaluate the effectiveness of each component: (i) U-net with output adversarial learning (Base), (ii) Base equipped with dual-task collaboration (Base+Dual), (iii) Base with hierarchical adversarial learning (Base+Hier), and (iv) ours (Base+Dual+Hier). In Table 4, the performance is improved to \(68.50\%\) and \(70.89\%\) equipped with our proposed dual-task self-supervision and hierarchical strategy respectively. The further improvement to \(76.98\%\) in our DualHierNet confirms the effect of using dual-task as self-supervision and hierarchically aligning low-level features.
Secondly, we experiment on choice of hierarchical weights \(\gamma _{k}\) shown in Table 5. When we assign larger weights to higher layers, only an average Dice of 70.65% is achieved. A Dice of 72.61% is achieved if each layer shares a same weight. When we enlarge the weights of shallow layers which contain more domain information, we can get a Dice of 76.98%. This further justify that low-level domain informative features should receive stronger adversarial learning attention.
Thirdly, we extend on target-only supervised segmentation to validate our proposed self-supervision. We replace Seg+Edge structure with two segmentors Seg+Seg so that they have nearly the same number of parameters. In supervised setting, Seg+Edge uses segmentation loss and dual_consist loss while Seg+Seg only uses segmentation loss to train the networks. Results shown in Table 6 reveals that auxiliary edge task assists segmentation even on supervised setting and achieve performance gain of \(1.14\%\). While in adapted setting, a larger performance gain is obtained through our proposed dual-task self-supervision.
4 Conclusion
We propose a dual-task collaboration framework for target self-supervision with low-level hierarchical adversarial learning for cross-modality image segmentation. We develop a novel self-supervision by constructing an auxiliary task to generate edges to assist segmentation task, and we also design a hierarchical adversarial mechanism according to the content of domain. Our framework outperforms several adaptation methods on cross-modality datasets and the proposed dual-task architecture even achieves promising performance in supervised setting.
References
Chen, C., et al.: Unsupervised multi-modal style transfer for cardiac MR segmentation. In: Pop, M., et al. (eds.) STACOM 2019. LNCS, vol. 12009, pp. 209–219. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39074-7_22
Chen, C., Dou, Q., Chen, H., Qin, J., Heng, P.A.: Synergistic image and feature adaptation: towards cross-modality domain adaptation for medical image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 865–872 (2019)
Chen, Y.C., Lin, Y.Y., Yang, M.H., Huang, J.B.: Crdoco: pixel-level domain transfer with cross-domain consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1791–1800 (2019)
Dou, Q., et al.: PNP-adanet: plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation. IEEE Access (2019)
Hoffman, J., et al.: Cycada: Cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning, pp. 1989–1998 (2018)
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189 (2018)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Jiang, J., et al.: Cross-modality (CT-MRI) prior augmented deep learning for robust lung tumor segmentation from small MR datasets. Med. Phys. 46(10), 4392–4404 (2019)
Ouyang, C., Kamnitsas, K., Biffi, C., Duan, J., Rueckert, D.: Data efficient unsupervised domain adaptation for cross-modality image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 669–677. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_74
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481 (2018)
Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2517–2526 (2019)
Wang, J., Huang, H., Chen, C., Ma, W., Huang, Y., Ding, X.: Multi-sequence cardiac MR segmentation with adversarial domain adaptation network. In: Pop, M., et al. (eds.) STACOM 2019. LNCS, vol. 12009, pp. 254–262. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39074-7_27
Wang, S., Yu, L., Li, K., Yang, X., Fu, C.-W., Heng, P.-A.: Boundary and entropy-driven adversarial learning for fundus image segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11764, pp. 102–110. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32239-7_12
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Zhuang, X.: Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2933–2946 (2018)
Zhuang, X., Shen, J.: Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Med. Image Anal. 31, 77–87 (2016)
Acknowledgement
This work is supported by SHEITC (No. 2018-RGZN-02046), 111 plan (No. BP0719010), and STCSM (No. 18DZ2270700).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Xue, Y., Feng, S., Zhang, Y., Zhang, X., Wang, Y. (2020). Dual-Task Self-supervision for Cross-modality Domain Adaptation. In: Martel, A.L., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science(), vol 12261. Springer, Cham. https://doi.org/10.1007/978-3-030-59710-8_40
Download citation
DOI: https://doi.org/10.1007/978-3-030-59710-8_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59709-2
Online ISBN: 978-3-030-59710-8
eBook Packages: Computer ScienceComputer Science (R0)