1 Introduction

Cardiovascular disease (CVD) is a major and ever-increasing global problem. According to the World Health Organization, CVD causes about 17.9 million deaths worldwide every year [1]. Morbidity due to CVD is an equally severe challenge, with the total number of disability life years due to ischemic heart disease and stroke having reached 182 million (95% UI: 170 to 194 million) by 2019 [2]. Therefore, the prevention and diagnosis of CVD are essential to reduce social and economic burdens. In diagnosing CVD, medical image segmentation can reveal cardiac substructures, which is the premise of quantifying human cardiac anatomy and locating lesions [3]. Thus, medical image segmentation occupies an essential position in clinical practice. In recent years, significant progress has been made due to the development of deep convolutional neural networks. Traditionally and ideally, the training and testing images for a deep learning network contain the same pixel intensity distribution. A large number of accurately annotated training images ensures that the model learns sufficiently to achieve promising segmentation results for diagnostic purposes when using images from the same modality. However, in actual clinical practice, the testing images are often of different modalities, or of the same modality but from different vendors, thus giving rise to large differences between the intensity distributions of the training and testing images. Unfortunately, these differences (known as domain shift) can often lead to a significant degradation in the model performance.

In clinical practice, magnetic resonance imaging (MRI) and computed tomography (CT) images are often used to diagnose CVD. Common cardiac magnetic resonance (CMR) imaging modalities include late gadolinium enhancement (LGE), balanced steady-state free precession (bSSFP), T1 and T2 images. LGE images are commonly used for diagnosing myocardial disease, while bSSFP images show clear borders between the myocardium and the ventricles. T1 images are used to show anatomical structures, while T2 images are used to display pathological information [4]. Figure 1 demonstrates the considerable discrepancy in intensity distribution and appearance between MRI and CT cardiac images.

Fig. 1
figure 1

Illustration of the domain shift existing between various image types. A Comparison of MRI and CT coronal plane cardiac images, and their pixel intensity distribution. The main cardiac substructures include the ascending aorta (AA), left ventricle (LV), and left ventricular myocardium (LV_myo). B Comparison of short-axis LGE and bSSFP CMR images, and their pixel intensity distribution. The main cardiac substructures include the right ventricle (RV), LV, and LV_myo

However, since the annotation of medical images is extremely tedious, time-consuming, and costly, it is difficult to build a multimodal medical image segmentation dataset with many pixel-level labels. To reduce the annotation burden and prevent degradation of model performance, new studies on the unsupervised domain adaptation (UDA) segmentation method are emerging [5]. The UDA method uses the richly labeled images of one modality (the source domain) to train deep convolutional neural networks for segmenting poorly labeled images of another modality (the target domain).

The critical issue in the UDA method for cross-modality cardiac image segmentation is to extract useful features from the source and target domains and reduce their intensity distribution discrepancy, whereas the domain-invariant features extracted by the deep learning network are implicit. To overcome the above limitation, we propose a UDA framework for cross-modality cardiac image segmentation with cooperative alignment from multiple levels. Individual alignment of feature distribution at image level, feature level, or output level is likely to lead to the loss of semantic information from the images, between the source and target domains, ultimately affecting the segmentation performance of the target domain images. The cooperative training process ensures that the network extracts more useful semantic information from multi-level feature spaces. As well as considering the significant variations between the source and target domains, we introduce an intermediate domain to gradually manage the domain shift. Thus, we employ the style transfer sub-network (CycleGAN) to effectively capture the pixel-level information of the source domain and target domain to generate fake target domain images which retain the original contents with their structural semantics unaffected. The fake target domain images act as the intermediate domain, and then we can design the following domain adaptation-based segmentation process, which contains two sub-networks to transfer the source domain label information to the target domain. The first is the segmentation sub-network (SSN) to transfer the source domain label information to the intermediate domain and generate the corresponding pseudo-labels. The second is the self-training sub-network (StSN) to transfer intermediate domain pseudo-label information to the target domain using a self-training strategy.

In summary, the main contributions of this paper are as follows:

  1. 1.

    We propose a two-stage progressive UDA network (TSP-UDANet) for cross-modality cardiac image segmentation based on generative adversarial learning. The TSP-UDANet includes a style transfer sub-network, a segmentation sub-network, and a self-training sub-network, that aligns the source and target domains at image level, feature level, and output level, respectively.

  2. 2.

    We introduce an intermediate domain as a bridge between the source and target domains. The intermediate domain is trained in an adversarial manner in the segmentation sub-network and the self-training sub-network with the source and target domains, respectively, to progressively reduce the discrepancy in feature distribution between the source and target domains.

  3. 3.

    In the self-training sub-network (StSN), we introduce a self-training strategy into the self-training sub-network. This strategy combines the labeled and unlabeled data to expand the total amount of data available for network training, hence improving the performance of the UDA segmentation.

To validate the generalization performance of the TSP-UDANet, we have conducted extensive experiments on three different cross-modality multi-objective medical image segmentation tasks, including the MMWHS, MS-CMRSeg, and M&Ms datasets. The results of the experiments on the three datasets demonstrate the effectiveness of the TSP-UDANet and its further application potential in various tasks, e.g., the detection and segmentation of tumors in medical images.

The remainder of this paper is organized as follows: Section 2 presents related works from the literature. Section 3 gives the details of the TSP-UDANet, including a method overview, the style transfer sub-network, segmentation sub-network, self-training sub-network, network configurations, and implementation details. Section 4 describes the design of the experiments. Section 5 presents the experimental results. Section 6 introduces the ablation analysis and Section 7 discusses the performance of the TSP-UDANet. Finally, our conclusions and suggestions for future work are offered in Section 8.

2 Related work

When analyzing images from one modality, exploiting labeled images from another is challenging due to the significant domain shift caused by the obvious differences in image properties between them. Specifically, in an unsupervised cross-modality cardiac segmentation task, the main idea is to extract domain-invariant features from the source and target domains, and to transfer label information from the source domain to the target domain. To better transfer knowledge learned from a source domain with rich labels to a target domain without labels, the UDA segmentation method has attracted recent attention [6, 7]. This approach uses images from the source domain with labels to train a model and applies it to the segmentation of target domain images, which are generally of another modality or images of the same modality but derived from machines made by different vendors. Current cross-modality UDA segmentation methods typically include two strategies. The first is to train the segmentation network with the labeled source domain images and then to use some of the target domain images for fine-tuning [8] or directly, for segmenting the target domain images. The second is to minimize the discrepancies in feature distribution between the source and target domains and to align latent features from the image level, the feature level, and the output level.

2.1 Image-level alignment

The goal of image-level alignment is to minimize the differences in the distribution of pixel intensities between the source and target domains. This ensures that the knowledge gained from the source domain can be effectively transferred to the target domain, thus improving the segmentation performance of the target domain. Image-level alignment is usually achieved in one of two ways. One is to extract domain-invariant features at the input level of the segmentation network [9, 10]. Here, the source domain and target domain images can share the feature extraction part of the segmentation network to learn image-level features, such as grayscale distribution and texture information. The other is the style transfer method for cross-modality images. In this case, the mapping relationship between the source domain and target domain is learned, and the generated cross-modality images are sent to the segmentation network for segmentation [11, 12]. Traditional cross-domain image segmentation methods require a large number of paired training images. However, such paired images are usually difficult or even impossible to obtain. Zhu et al. [13] proposed a cycle generative adversarial network (CycleGAN) to generate fake target domain and fake source domain images without the need for paired training images. Following the success of CycleGAN in the image-to-image translation task, researchers have converted the appearance of images from different modalities by translating the image style from the source (target) domain to the target (source) domain. Jiang et al. [11] proposed an unsupervised cross-modality domain adaptation network for lung cancer region segmentation by transforming the CT image style into that of MRI images. Chen et al. [12] proposed a semantics-aware generative adversarial network (SeUDA) to align the image-level features of different X-ray datasets for left/right lung segmentation.

2.2 Feature-level alignment

Feature-level alignment aims to adjust the features of the source and target domain data so that they have a consistent representation in feature space. Feature-level alignment is primarily used to reduce domain shift in higher dimensional feature spaces by minimizing the distribution discrepancy of feature maps extracted from the source and target domains. This process includes minimizing the maximum mean discrepancy (MMD) [14, 15], the loss of Wasserstein generative adversarial network (WGAN) [16], and the distribution distance in unique feature spaces [17]. Some studies have introduced the GAN into feature-level alignment, where adversarial training of generators and discriminators makes the generators focus on the common features of the target and source domains [18,19,20,21]. Kamnitsas et al. [22] proposed learning domain-invariant features for brain lesion segmentation with an adversarial network, and designed a multi-connected domain discriminator that predicts the input image domain. Jain et al. [23] employed an adversarial learning scheme to adapt knowledge from PV phase images to ART phase images for detecting liver tumors.

2.3 Output-level alignment

Output-level alignment is primarily used to extract the domain-invariant features in semantic prediction space. The output-level alignment can make the segmentation results of the source and target domains semantically consistent, so as to improve the segmentation performance in the target domain. For output-level alignment, most methods are based on the GAN [24,25,26], where the output-level features obtained from the generator are fed to the discriminator for generative adversarial training. Panfilov et al. [24] proposed a two-stage network for unsupervised domain adaptation by generative adversarial learning in multi-level feature spaces. Their methods achieved unsupervised segmentation of MRI images from different scanners. Yang et al. [25] proposed a self-attentive GAN that forces the feature maps generated by the generator between the source and target domains to be indistinguishable at the output level.

When facing the challenges of severe domain shift in cross-modality medical image segmentation, the approaches which use image-level, feature-level, or output-level alignment alone are often not sufficient. Thus, multi-level alignment methods should be beneficial for extracting domain-invariant features.

3 Methods

3.1 Overview

This work aims to train a UDA segmentation network for segmenting target domain images where pixel-level annotations are unavailable. We achieve UDA segmentation of unlabeled target domain images by introducing an adversarial training strategy to the segmentation network. Due to different imaging techniques or imaging parameters, there are significant style discrepancies in the appearance of the source and target domain images, thus we introduce an intermediate domain as a bridge to transfer the label information from the source domain to the target domain. Figure 2 shows the framework of the TSP-UDANet, which consists of three sub-networks: the style transfer sub-network (CycleGAN), the segmentation sub-network (SSN), and the self-training sub-network (StSN). Table 1 summarizes the symbols used in the following sections.

Fig. 2
figure 2

The framework of the TSP-UDANet for cross-modality cardiac segmentation, including the style transfer sub-network (CycleGAN), the segmentation sub-network (SSN), and the self-training sub-network (StSN). \({D}_{0}\) and \({D}_{1}\) are used for image-level adversarial training, \({D}_{2}\) and \({\widetilde{D}}_{2}\) are used for feature-level adversarial training, \({D}_{3}\) and \({\widetilde{D}}_{3}\) are used for output-level adversarial training. Upsample scales the predicted images to the raw image size using bilinear interpolation

Table 1 Summary of symbols

3.2 Style transfer sub-network

To reduce the visual difference and the effect of domain shift between the source and target domains, we use image-level alignment to transform the style of the source domain images to the style of the target domain images. The style transfer sub-network for image-level alignment in our framework borrows ideas from the CycleGAN [13], which consists of a source domain generator (\({G}_{\mathrm{t}\to \mathrm{s}}\)), target domain generator (\({G}_{\mathrm{s}\to \mathrm{t}}\)), source domain discriminator (\({D}_{0}\)), and target domain discriminator (\({D}_{1}\)). The generators are used for image reconstruction and generating fake images. The target domain generator (\({G}_{\mathrm{s}\to \mathrm{t}}\)) is used to transfer the source domain (\({X}^{\mathrm{s}}\)) style to that of the target domain (\({X}^{\mathrm{t}}\)), while the source domain generator (\({G}_{\mathrm{t}\to \mathrm{s}}\)) is used to transfer the target domain (\({X}^{\mathrm{t}}\)) style to that of the source domain (\({X}^{\mathrm{s}}\)).

During the training of the CycleGAN, the source domain images (\({x}^{\mathrm{s}}\)) are fed into \({G}_{\mathrm{s}\to \mathrm{t}}\) to generate the fake target domain images (\({x}^{\mathrm{s}\to \mathrm{t}}\)), then these fake target domain images \(({x}^{\mathrm{s}\to \mathrm{t}})\) are put into \({G}_{\mathrm{t}\to \mathrm{s}}\) to generate the reconstructed source domain images \({(x}^{\mathrm{s}\to \mathrm{t} \to \mathrm{s}}={G}_{\mathrm{t}\to \mathrm{s}}\left({G}_{\mathrm{s}\to \mathrm{t}}\left({x}^{\mathrm{s}}\right)\right)\). Similarly, the target domain images (\({x}^{\mathrm{t}}\)) pass through the generators \({G}_{\mathrm{t}\to \mathrm{s}}\) and \({G}_{\mathrm{s}\to \mathrm{t}}\) in turn, to generate the reconstructed target domain images (\({x}^{\mathrm{t}\to \mathrm{s}\to \mathrm{t}}={G}_{\mathrm{s}\to \mathrm{t}}\left({G}_{\mathrm{t}\to \mathrm{s}}\left({x}^{\mathrm{t}}\right)\right)\)). The source and target domain images share source and target domain generators. In the CycleGAN, the cyclic structure enables bidirectional style transfer between the source and target domain images. The loss function used for the cycle reconstruction is:

$$\begin{aligned} L_{{{\text{cyc}}}} \left( {G_{\mathrm{s}\to \mathrm{t}} ,G_{\mathrm{t}\to \mathrm{s}} } \right) & = E_{{x^{\mathrm{t}} \sim X^{\mathrm{t}} }} \left[ {\left| {G_{\mathrm{s}\to \mathrm{t}} \left( {G_{\mathrm{t}\to \mathrm{s}} \left( {x^{\mathrm{t}} } \right)} \right) - x^{\mathrm{t}} } \right|} \right] \\ & \quad + E_{{x^{\mathrm{s}} \sim X^{\mathrm{s}} }} \left[ {\left| {G_{\mathrm{t}\to \mathrm{s}} \left( {G_{\mathrm{s}\to \mathrm{t}} \left( {x^{\mathrm{s}} } \right)} \right) - x^{\mathrm{s}} } \right|} \right] \\ \end{aligned}$$
(1)

where the cycle consistency loss \({L}_{\mathrm{cyc}}\) ensures that the reconstructed images preserve the contents of the real images.

In contrast to the GAN, the CycleGAN performs bidirectional generation for the source and target domains. The target domain generator (\({G}_{\mathrm{s}\to \mathrm{t}}\)) generates fake target domain images (\({x}^{\mathrm{s}\to \mathrm{t}})\) and the source domain generator (\({G}_{\mathrm{t}\to \mathrm{s}}\)) generates fake source images (\({x}^{\mathrm{t}\to \mathrm{s}}\)). Optimization of \({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\) relies on the generator loss:

$$\begin{aligned} L_{{{\text{adv}}}}^{G} \left( {G_{\mathrm{s}\to \mathrm{t}} ,G_{\mathrm{t}\to \mathrm{s}} } \right) & = E_{{x^{\mathrm{s}} \sim X^{\mathrm{s}} }} \left[ {(D_{1} (G_{\mathrm{s}\to \mathrm{t}} \left( {x^{\mathrm{s}} } \right)) - 1)^{2} } \right] \\ & \quad + E_{{x^{\mathrm{t}} \sim X^{\mathrm{t}} }} \left[ {(D_{0} (G_{\mathrm{t}\to \mathrm{s}} \left( {x^{\mathrm{t}} } \right)) - 1)^{2} } \right] \\ & \quad + \lambda \cdot L_{{{\text{cyc}}}} \left( {G_{\mathrm{s}\to \mathrm{t}} ,G_{\mathrm{t}\to \mathrm{s}} } \right) \\ \end{aligned}$$
(2)

where \(\uplambda\) is the weighting of the cycle consistency loss (\({L}_{\mathrm{cyc}}\)) in the generator loss (\({L}_{\mathrm{adv}}^{G}\)).

The discriminator \({D}_{0}\) is used to determine whether the image input to \({D}_{0}\) is a fake source image (\({x}^{\mathrm{t}\to \mathrm{s}}\)) or a real source image (\({x}^{\mathrm{s}}\)). \({D}_{1}\) is used to determine whether the image fed to \({D}_{1}\) is the fake target domain image (\({x}^{\mathrm{s}\to \mathrm{t}}\)) or the real target image (\({x}^{\mathrm{t}}\)). Optimization of \({D}_{0}\) and \({D}_{1}\) relies on the discriminator loss:

$$\begin{aligned} &L_{{{\text{adv}}}}^{D} \left( {D_{0} ,D_{1} } \right) \\ & = E_{{x^{\mathrm{t}} \sim X^{\mathrm{t}} }} \left[ {(D_{1} \left( {x^{\mathrm{t}} } \right) - 1)^{2} + (D_{0} \left( {G_{\mathrm{t}\to \mathrm{s}} (x^{\mathrm{t}} } \right)) - 0)^{2} } \right] \\ & \quad + E_{{x^{\mathrm{s}} \sim X^{\mathrm{s}} }} \left[ {(D_{1} \left( {G_{\mathrm{s}\to \mathrm{t}} (x^{\mathrm{s}} } \right)) - 0)^{2} + (D_{0} \left( {x^{\mathrm{s}} } \right) - 1)^{2} } \right] \\ \end{aligned}$$
(3)

where \(1\in {\mathcal{R}}^{H/8\times W/8\times 1}\) represents the real images, while \(0\in {\mathcal{R}}^{H/8\times W/8\times 1}\) represents the fake images. The generators and discriminators are alternately optimized to generate fake images that can confuse the discriminators. The fake target domain images are used in the intermediate domain for the SSN and StSN. The training process of CycleGAN is summarized in Algorithm 1.

figure a

3.3 Segmentation sub-network (SSN)

Due to the different principles and parameter values of image acquisition by differing modalities, there are disparities in the feature distribution between the source and target domains. To better extract the domain-invariant features of the source and target domains, we use SSN to transfer the label information of the source domain to the intermediate domain. The SSN is a two-level generative adversarial network that includes feature-level and output-level alignment of the source and intermediate domains. Due to the large discrepancies in the distribution of features between the source and target domain images, we introduce an intermediate domain, which consists of the generated fake target domain images with the source domain style, so the SSN can better learn the label information from the source domain. The SSN transfers the source domain label information to the intermediate domain and generates the segmentation results of the intermediate domain images as the pseudo-labels in the StSN training process. The SSN is a generative adversarial network consisting of a generator \({(G}_{\mathrm{seg}})\), a feature-level discriminator \(({D}_{2})\) and an output-level discriminator (\({D}_{3}\)), where \({G}_{\mathrm{seg}}\) is composed of the feature extractor \((F)\) and the class predictor \((P)\). \(F\) uses the modified Resnet101 [27]. \(P\) is the Atrous Spatial Pyramid Pooling (ASPP) module [28] which uses multi-scale convolution to extract multi-level semantic features for pixel classification. Generative adversarial learning aligns the feature distribution either at the feature level or output level to reduce the domain shift between the source and intermediate domains.

During the training of the SSN, \(F\) extracts the source domain feature maps (\({f}^{\mathrm{s}}\)) from the source domain images (\({x}^{\mathrm{s}}\)), where \({f}^{\mathrm{s}}=F({x}^{\mathrm{s}})\). \(P\) takes \({f}^{\mathrm{s}}\) as input and upsamples to produce the source domain pixel-level prediction output (\({p}_{i,c}^{\mathrm{s}}\)), where \({p}_{i,c}^{\mathrm{s}}=Up(P({f}^{\mathrm{s}}))\). The operator \(Up\) is a bilinear interpolation algorithm that upsamples the output feature maps to the size of the raw image. We use \({p}_{i,c}^{\mathrm{s}}\) and one-hot source domain ground truths (\({y}_{i,c}^{\mathrm{s}}\)) to compute \({L}_{\mathrm{seg}}^{\mathrm{s}}\) and optimize \({G}_{\mathrm{seg}}\). During the training of the SSN, the source domain image segmentation is supervised, and \({G}_{\mathrm{seg}}\) applies the source domain pixel-level label information to the intermediate domain. Source domain supervised image segmentation loss \({L}_{\mathrm{seg}}^{\mathrm{s}}({G}_{\mathrm{seg}})\) is:

$${L}_{\mathrm{seg}}^{\mathrm{s}}\left({G}_{\mathrm{seg}}\right)=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}{y}_{i,c}^{\mathrm{s}}\cdot log{(p}_{i,c}^{\mathrm{s}})$$
(4)

where \(c\) is the class index, \(C\) is the total number of classes (determined by the number of segmented objects in the different datasets), and \(N\) is the number of samples in a batch.

We feed the fake target domain images (\({x}^{\mathrm{s}\to \mathrm{t}}\)) as an intermediate domain into \(F\) and output the intermediate domain feature maps (\({f}^{\mathrm{s}\to \mathrm{t}}\)), where \({f}^{\mathrm{s}\to \mathrm{t}}=F({x}^{\mathrm{s}\to \mathrm{t}})\). \(P\) takes \({f}^{\mathrm{s}\to \mathrm{t}}\) as an input feature map which is then upsampled to form the intermediate domain pixel-level prediction results (\({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\)), where \({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}=Up(P({f}^{\mathrm{s}\to \mathrm{t}}))\). \({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\) are used as the pseudo-labels for the intermediate domain images in StSN. The unsupervised domain adaptation is conducted by alternately optimizing the generator (\({G}_{\mathrm{seg}}\)) and discriminators \({(D}_{2}\) and \({D}_{3})\). The adversarial loss \({L}_{\mathrm{adv}}^{{G}_{\mathrm{seg}}}\) is used to confuse \({D}_{2}\) and \({D}_{3}\) to align the feature distribution of \({f}^{\mathrm{s}\to \mathrm{t}}\) and \({f}^{\mathrm{s}}\) at feature level, and the output distribution of \({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\) and \({p}_{i,c}^{\mathrm{s}}\) at output level. The adversarial loss \({L}_{\mathrm{adv}}^{{G}_{\mathrm{seg}}}({G}_{\mathrm{seg}})\) is:

$$\begin{aligned}{L}_{\mathrm{adv}}^{{G}_{\mathrm{seg}}}\left({G}_{\mathrm{seg}}\right)&={E}_{{x}^{\mathrm{s}\to \mathrm{t}}\sim {X}^{\mathrm{s}\to \mathrm{t}}}\left[{{(D}_{2}(F\left({x}^{\mathrm{s}\to \mathrm{t}}\right))-0)}^{2}\right.\\ &\left.+{{(D}_{3}({G}_{\mathrm{seg}}\left({x}^{\mathrm{s}\to \mathrm{t}}\right))-0)}^{2}\right]\end{aligned}$$
(5)

where \({G}_{\mathrm{seg}}\) extracts the domain-invariant features of the source and intermediate domains. Finally, \({D}_{2}\) and \({D}_{3}\) are used to distinguish the features of the source and intermediate domains. The adversarial loss \({L}_{\mathrm{adv}}^{D}\left({D}_{2},{D}_{3}\right)\) is:

$$\begin{aligned}& L_{{{\text{adv}}}}^{D} \left( {D_{2} ,D_{3} } \right) \\ & = E_{{x^{\mathrm{s}} \sim X^{\mathrm{s}} }} \left[ {(D_{2} \left( {F\left( {x^{\mathrm{s}} } \right)} \right) - 0)^{2} + (D_{3} \left( {G_{{{\text{seg}}}} \left( {x^{\mathrm{s}} } \right)} \right) - 0)^{2} } \right] \\ & \quad + E_{{x^{\mathrm{s}\to \mathrm{t}} \sim X^{\mathrm{s}\to \mathrm{t}} }} \left[ {\left( {D_{2} \left( {F\left( {x^{\mathrm{s}\to \mathrm{t}} } \right)} \right) - 1} \right)^{2} + \left( {D_{3} \left( {G_{{{\text{seg}}}} \left( {x^{\mathrm{s}\to \mathrm{t}} } \right)} \right) - 1} \right)^{2} } \right] \\ \end{aligned}$$
(6)

where \(1\in {\mathcal{R}}^{H/8\times W/8\times 1}\) and \(0\in {\mathcal{R}}^{H/8\times W/8\times 1}\) represent the intermediate and source domains, respectively. The training process of the SSN is summarized in Algorithm 2.

figure b

3.4 Self-training sub-network (StSN)

With the abovementioned adversarial training of the source and intermediate domains in SSN, we have obtained good segmentation performance using the fake target domain images (\({x}^{\mathrm{s}\to \mathrm{t}}\)) as the intermediate domain. Unfortunately, this is still insufficient to achieve the desired performance when domain shift is severe. Therefore, we introduce a self-training strategy and use the intermediate domain and target domain to train the StSN to transfer the label information of the intermediate domain to the target domain for further improving the image segmentation results of the target domain. It is worth noting that the network structure of the StSN is identical to that of the SSN, the only difference being the images fed into the network during the training process. In the SSN, we use the source and intermediate domains for generative adversarial training, while in the StSN, we use the intermediate and target domains. The prediction results of the intermediate domain images (\({x}^{\mathrm{s}\to \mathrm{t}}\)) in the SSN act as pseudo-labels (\({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\)) in the StSN.

Firstly, the outputs (\({\widetilde{p}}_{i,c}^{\mathrm{s}\to \mathrm{t}}\)) of the generator (\({\widetilde{G}}_{\mathrm{seg}}\)) are used to compute the segmentation loss (\({L}_{\mathrm{seg}}^{\mathrm{s}\to \mathrm{t}}\)) under the supervision of the one-hot pseudo-labels (\({p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\)). The intermediate domain segmentation loss (\({L}_{\mathrm{seg}}^{\mathrm{s}\to \mathrm{t}}({\widetilde{G}}_{\mathrm{seg}})\)) is:

$${L}_{\mathrm{seg}}^{\mathrm{s}\to \mathrm{t}}({\widetilde{G}}_{\mathrm{seg}})=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}{p}_{i,c}^{\mathrm{s}\to \mathrm{t}}\cdot \mathrm{log}({\widetilde{p}}_{i,c}^{\mathrm{s}\to \mathrm{t}})$$
(7)

where \(c\) is the class index, \(C\) is the total number of classes (determined by the number of segmented objects in different datasets), and \(N\) is the number of samples in a batch in the training process.

Secondly, the introduction of the intermediate domain progressively transfers the label information of the intermediate domain to the target domain. The adversarial loss (\({L}_{\mathrm{adv}}^{{\widetilde{G}}_{\mathrm{seg}}}\left({\widetilde{G}}_{\mathrm{seg}}\right)\)) is:

$$\begin{aligned}{L}_{\mathrm{adv}}^{{\widetilde{G}}_{\mathrm{seg}}}\left({\widetilde{G}}_{\mathrm{seg}}\right)&={E}_{{x}^{\mathrm{t}}\sim {X}^{\mathrm{t}}}\left[{({\widetilde{D}}_{2}\left(\widetilde{F}\left({x}^{\mathrm{t}}\right)\right)-0)}^{2}\right.\\ &\left.+{\left({\widetilde{D}}_{3}\left({\widetilde{G}}_{\mathrm{seg}}\left({x}^{\mathrm{t}}\right)\right)-0\right)}^{2}\right]\end{aligned}$$
(8)

Finally, the feature-level discriminator (\({\widetilde{D}}_{2}\)) and the output-level discriminator (\({\widetilde{D}}_{3}\)) are optimized by \({L}_{\mathrm{adv}}^{\widetilde{D}}\). \({\widetilde{D}}_{2}\) and \({\widetilde{D}}_{3}\) are used to distinguish features from different domains and train them adversarially with the generator (\({\widetilde{G}}_{\mathrm{seg}}\)). The adversarial loss \({L}_{\mathrm{adv}}^{\widetilde{D}}\left({\widetilde{D}}_{2}, {\widetilde{D}}_{3}\right)\) is:

$$\begin{aligned}& L_{{{\text{adv}}}}^{{\tilde{D}}} \left( {\tilde{D}_{2} ,\tilde{D}_{3} } \right)\\ & = E_{{x^{\mathrm{t}} \sim X^{\mathrm{t}} }} \left[ {\left( {\tilde{D}_{2} \left( {\tilde{F}\left( {x^{\mathrm{t}} } \right)} \right) - 1} \right)^{2} + \left( {\tilde{D}_{3} \left( {\tilde{G}_{{{\text{seg}}}} \left( {x^{\mathrm{t}} } \right)} \right) - 1} \right)^{2} } \right] \\ & \quad + E_{{x^{\mathrm{s}\to \mathrm{t}} \sim X^{\mathrm{s}\to \mathrm{t}} }} \left[ {\left( {\tilde{D}_{2} \left( {\tilde{F}\left( {x^{\mathrm{s}\to \mathrm{t}} } \right)} \right) - 0} \right)^{2} + \left( {\tilde{D}_{3} \left( {\tilde{G}_{{{\text{seg}}}} \left( {x^{\mathrm{s}\to \mathrm{t}} } \right)} \right) - 0} \right)^{2} } \right] \\ \end{aligned}$$
(9)

where \(1\in {\mathcal{R}}^{H/8\times W/8\times 1}\) and \(0\in {\mathcal{R}}^{H/8\times W/8\times 1}\) represent the target and intermediate domains. The potential feature distributions between the intermediate and the target domains are aligned by optimizing the adversarial loss (\({L}_{\mathrm{adv}}^{{\widetilde{G}}_{\mathrm{seg}}}\)) and discriminator loss (\({L}_{\mathrm{adv}}^{\widetilde{D}}\)). The training process of the StSN is summarized in Algorithm 3.

figure c

3.5 Network configurations

In the style transfer sub-network, the generators (\({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\)) have the same structure (Fig. 3A), and the discriminators \({(D}_{0}\) and \({D}_{1})\) also have the same structure (Fig. 3B).

Fig. 3
figure 3

Details of the generators (\({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\)) and the discriminators (\({D}_{0}\) and \({D}_{1}\)) in the style transfer sub-network. A Structure of generators (\({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\)), B Structure of discriminators (\({D}_{0}\) and \({D}_{1}\)), and C Residual block used in the generators (\({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\)). \(W\) and \(H\) are the width and height of the raw image

The SSN and StSN have the same structure, which includes a feature extractor, class predictor, feature-level discriminator, and output-level discriminator. The feature extractors (\(F\) and \(\widetilde{F}\)) are based on the basic ResNet101 [27], without the redundant fully connected layer. Introducing the residual structure of ResNet101 into the feature extractors (\(F\)  and  \(\widetilde{F}\)) can alleviate the problem of gradient disappearance. The feature extractors \((F\mathrm{\,and\,}\widetilde{F})\) generate feature maps of size \(H/8\times W/8\times 2048\), which are fed into the class predictors (\(P\) and \(\widetilde{P}\)), respectively. The class predictors (\(P\) and \(\widetilde{P}\)) adopt the classical ASPP [28]. The ASPP uses convolutional kernels with different dilation rates sampled in parallel to extract multi-scale cardiac context information, as shown in Fig. 4. The size of the feature maps generated by \(P\) and \(\widetilde{P}\) is \(H/8\times W/8\times C\), where \(C\) is the number of classes, including the number of foreground and background classes. The feature maps generated by \(P\) and \(\widetilde{P}\) are scaled to the raw image size of \(H\times W\times C\) by a bilinear interpolation algorithm.

Fig. 4
figure 4

The class predictor is the ASPP. The dilation rates of the convolution kernels are 6, 12, 18, and 24 in order. \(W\) and \(H\) are the width and height of the raw image. \(C\) is the number of classes

Figure 5 shows the structure of the feature-level discriminators (\({D}_{2}\) and \({\widetilde{D}}_{2}\)) and output-level discriminators (\({D}_{3}\) and \({\widetilde{D}}_{3}\)).

Fig. 5
figure 5

Details of \({D}_{2}\), \({\widetilde{D}}_{2}\), \({D}_{3}\), and \({\widetilde{D}}_{3}\) in the SSN and StSN. A Structure of the feature-level discriminators (\({D}_{2}\) and \({\widetilde{D}}_{2}\)), B Structure of output-level discriminators (\({D}_{3}\) and \({\widetilde{D}}_{3}\)). \(W\) and \(H\) are the width and height of the raw image. \(C\) is the number of classes

3.6 Training strategy

The training of the TSP-UDANet is divided into two stages: Stage 1 (CycleGAN + SSN) and Stage 2 (CycleGAN + SSN + StSN). In Stage 1, we implement the image-level alignment by using the CycleGAN, which generates fake target domain images to serve as intermediate domain images. Then we train the SSN using the intermediate and source domain images, and the source domain label information is transferred to the intermediate domain. The trained SSN can segment the intermediate domain images, and the segmentation results can be used as pseudo-labels. In Stage 2, we train the StSN using the intermediate domain and the target domain images to transfer the intermediate domain pseudo-label information to the target domain. The trained StSN can then segment the target domain testing images. In the two stages, the source domain label information is progressively transferred to the target domain by using the multi-level adversarial training of the SSN and StSN. Thus, the intermediate domain acts as a bridge between the source and target domains to transfer domain-invariant features between them. In the testing process, the StSN trained in Stage 2 is used as the final segmentation network.

3.7 Implementation details

We implemented our framework in Pytorch (Version1.7.0). Each sub-network was trained on a computer fitted with an NVIDIA Quadro RTX 5000 and Intel® Xeon® W-2133 CPU. For the CycleGAN, the batch size was 4, and the generators (\({G}_{\mathrm{s}\to \mathrm{t}}\) and \({G}_{\mathrm{t}\to \mathrm{s}}\)) and discriminators (\({D}_{0}\) and \({D}_{1}\)) all used the Adam optimizer [29] with a learning rate of \(2.0\times {10}^{-4}\). The weight \(\uplambda\) of cycle consistency loss (\({L}_{cyc}\)) in the generative adversarial loss (\({L}_{\mathrm{adv}}^{G}\)) was set to 0.8 for the MMWHS dataset and 1.0 for the MS-CMRSeg and the M&Ms datasets. The CycleGAN was trained to generate fake target domain images as the intermediate domain in the TSP-UDANet. Algorithm 1 outlines the CycleGAN training process.

The SSN uses a labeled source domain and an intermediate domain for adversarial training. After training, the SSN implements the initial segmentation of the intermediate domain images, the results of which are used as the pseudo-labels of the intermediate domain images. The StSN uses the pseudo-labeled intermediate domain and the target domain to carry out adversarial training, and the trained StSN achieves an accurate segmentation of the target domain images. The SSN and the StSN undergo the same training process. The generators (\({G}_{\mathrm{seg}}\), and \({\widetilde{G}}_{\mathrm{seg}}\)) use the stochastic gradient descent (SGD) optimizer [30] with a learning rate of \(2.0\times {10}^{-4}\), the momentum is set to 0.9, and the decay rate, to \(5.0\times {10}^{-4}\). The discriminators (\({D}_{2}\), and \({D}_{3}\)) use the Adam optimizer with a learning rate of \(1.0\times {10}^{-4}\). The discriminators (\({\widetilde{D}}_{2}\), and \({\widetilde{D}}_{3}\)) also use the Adam optimizer with the same learning rate. Algorithms 2 and 3 outline the training processes of the SSN and the StSN, respectively.

4 Experiments

In this section, we describe the assessment of the effectiveness of our method under various conditions. These include MRI and CT cardiac images, bSSFP and LGE MRI images, and multi-disease MRI images from different centers and device manufacturers.

4.1 Datasets

To validate the segmentation performance of the TSP-UDANet for the segmentation of cardiac substructures from multimodal medical images, we performed experiments on three datasets: the cross-modality Multi-Modality Whole Heart Segmentation Challenge (MMWHS) dataset [31], the Multi-sequence Cardiac MR Segmentation Challenge (MS-CMRSeg) dataset [32], and the Multi-Center, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (M&Ms) dataset [33]. We normalized the image slices of the three cardiac datasets and performed data augmentation by rotation, mirroring, and affine transformations to reduce overfitting.

The MMWHS dataset contains unpaired MRI images of 20 subjects and CT images of 20 subjects. The labels include four cardiac substructures: left ventricular myocardium (LV_myo), left atrium (LA), left ventricle (LV), and ascending aorta (AA). In the MRI \(\to\) CT adaptation, the source domain is MRI, and the target domain is CT; whereas in the CT \(\to\) MRI adaptation, the source and target domains are reversed. For the MRI and CT images, we randomly selected 80% of the subjects as the training set and the remaining 20% as the testing set. We resampled the raw images to the same in-plane resolution of \(1.0\times 1.0\) mm. We used 2D slices to train our framework and cropped all images at an ROI of \(256\times 256\) pixels, centered on the cardiac area. The size of the ROI was sufficient to contain the entirety of the cardiac substructures to be segmented. There were 70 to 100 slices per subject in the MRI image stacks, with 200 to 250 slices per subject for the CT images.

The MS-CMRSeg dataset consists of CMR images in three modalities: LGE, bSSFP, and T2. In the cross-modality UDA experiments, since the number of T2 images was relatively small, we chose to use only the bSSFP images as the source domain and the LGE images as the target domain. There were bSSFP images of 35 subjects and LGE images of 40 subjects. Segmentation objectives included the LV, LV_myo, and right ventricle (RV). There were 8 to 12 slices per subject in the bSSFP images, with 10 to 18 slices per subject for the LGE images. All images were resampled to the same in-plane resolution of \(1.25\times 1.25\) mm and cropped at an ROI of \(224\times 224\) pixels, centered on the cardiac area. We used the labeled bSSFP image as the source domain to segment the LGE images.

The M&Ms dataset consists of patients with hypertrophic cardiomyopathy, dilated cardiomyopathy, and healthy subjects. All subjects were scanned at clinical centers in three countries (Spain, Germany, and Canada) using a MR scanner from one of 4 vendors (Siemens, General Electric, Philips, and Canon). The training dataset contains labeled images of 150 subjects from two different MRI vendors (Siemens and Philips). The labeled areas included the LV, LV_myo, and RV. The testing set images were from one of four MR scanner vendors (Siemens, General Electric, Philips, and Canon), including 160 subjects with 10 to 20 slices per subject. Since a significant cross-scanner performance drop was observed on the M&Ms dataset [33], in this study, the M&Ms challenge was chosen as a cross-modality cardiac segmentation task. We chose the training set of the M&Ms dataset as the source domain and the testing set of the M&Ms dataset as the target domain to further validate the generalizability of the TSP-UDANet. All images were aligned and resampled to \(1.25\times 1.25\) mm and cropped at an ROI of \(224\times 224\) pixels, centered on the cardiac area.

4.2 Evaluation metrics

We used three evaluation metrics: the dice similarity coefficient (Dice) [34], the average surface distance (ASD) [34], and the Hausdorff distance (HD) [35]. The Dice is used mainly to calculate the similarity between a 3D prediction and the ground truth. The higher the Dice score, the better the segmentation performance. The ASD is used to calculate the average distance between the surface of the 3D prediction and the ground truth, and the HD is the maximum distance from one group to the nearest point in another group, the groups being the 3D prediction and ground truth. In image segmentation, lower ASDs and HDs indicate better segmentation performance. To allow comparison with other studies using the same datasets, we selected Dice and ASD as the evaluation metrics for the MMWHS dataset and the Dice and HD for the MS-CMRSeg and M&Ms datasets.

5 Results

This section shows the results of applying the TSP-UDANet for cardiac segmentation to the three cardiac datasets. We compared our approach with several recently developed methods to explore the segmentation performance of the two-stage multi-level generative adversarial network.

5.1 MMWHS dataset

We evaluated the MRI and CT image cross-modality unsupervised cardiac segmentation in two directions using the MMWHS dataset, namely from MRI to CT images (MRI \(\to\) CT) and CT to MRI images (CT \(\to\) MRI). In the MRI \(\to\) CT adaptation, we used labeled MRI and unlabeled CT images to train our TSP-UDANet, and in the CT \(\to\) MRI adaptation, we used labeled CT and unlabeled MRI images.

Table 2 shows the performance of our TSP-UDANet on the MMWHS dataset. In the MRI \(\to\) CT adaptation, we achieved a mean Dice score of 77.1% and a mean ASD of 7.9 mm for the four cardiac substructures. In the CT \(\to\) MRI adaptation, we achieved a mean Dice score of 69.0% and a mean ASD of 7.2 mm. The segmentation performance of CT \(\to\) MRI was worse than that of MRI \(\to\) CT because of the few MRI training images or inherent MRI image characteristics (i.e. their limited contrast) [34]. Figure 6 is a visualization of the segmentation results from the MMWHS dataset. It shows that our method can accurately segment the four cardiac substructures when compared to the ground truth.

Table 2 Results of MMWHS (MRI \(\to\) CT) adaptation segmentation
Fig. 6
figure 6

Visualization of the results of our method from a representative subject in the MMWHS testing set. The uppermost row is the raw images, the middle row is the ground truth (GT), and the bottom row shows the predicted result (Pred). The images in the left panel were taken from MRI → CT, and those on the right from CT → MRI. The cardiac substructures AA, LA, LV, and LV_myo are shaded in pink, pale grey, light purple, and pale blue, respectively

5.2 MS-CMRSeg dataset

We used the MS-CMRSeg dataset to validate the generalizability of our TSP-UDANet and found that it achieved precise segmentation of the cardiac substructures, including the LV, LV_myo, and RV. The task of the MS- CMRSeg challenge was to train the segmentation network using labeled bSSFP images for the segmentation of LGE images. Thus, we validated the TSP-UDANet using bSSFP images with labels and LGE images without labels, as required by the MS-CMRSeg segmentation challenge.

As shown in Table 3, we achieved a mean Dice score of 87.5% and a mean HD of 8.2 mm on unsupervised segmentation of LGE images. Figure 7 is a visualization of the segmentation results on the MS-CMRSeg dataset. We can clearly see the changes in the cardiac slices and the segmentation results of the TSP-UDANet from the base to the apex.

Table 3 MS-CMRSeg segmentation results (Cardiac bSSFP \(\to\) Cardiac LGE)
Fig. 7
figure 7

Visualization results of our method from a representative subject (Pat_40) with median Dice score in the MS-CMRSeg testing dataset. The leftmost images in each row are from the base of the heart, moving to the right are slices progressing towards the apex. The uppermost row is the raw LGE images, the middle row is the ground truth (GT) images, and the bottom row shows the predicted result (Pred). The LV, RV, and LV_myo are shown in pink, cyan and gray, respectively. Note that the sub-figures of the second and third rows are zoomed and cropped for improved clarity

5.3 M&Ms dataset

On the M&Ms dataset, the cardiac images were acquired from 4 different MR scanners, where the training images in the source domain were from Siemens and Philips, and the testing images in the target domain were from Siemens, Philips, General Electric, and Canon. In this experiment, the target domain labels were used for evaluation only, without being used in the training process.

As shown in Table 4, the TSP-UDANet achieved a mean Dice score of 85.2% and a mean HD of 13.2 mm. The Dice scores of TSP-UDANet were 90.1% (LV), 79.5% (LV-myo), and 85.2% (RV), respectively. The HD values are respectively 11.8 mm (LV), 8.7 mm (LV_myo), and 19.1 mm (RV). Figure 8 shows the segmentation results on the M&Ms dataset. The sizes of the three target structures (including LV, LV_myo, and RV) vary greatly from the base to the apex of the heart, but TSP-UDANet can locate and segment the target structures well.

Table 4 M&Ms segmentation results
Fig. 8
figure 8

Visualization of the results of our method from a representative subject (Pat_E5J6L2) with median Dice in the M&Ms testing set. The uppermost row is the raw images, the middle row is the ground truth (GT), and the bottom row shows the predicted result (Pred). The images in the left panel were taken at end diastole (ED) and those on the right, at end systole (ES). In both panels, the images in each column are, from the left to right, the base, middle, and apex slice samples, respectively. The LV, RV, and LV_myo are shown in pink, cyan and gray, respectively. Note that the sub-figures of the second and third rows are zoomed and cropped for improved clarity

5.4 Comparison with other methods

To demonstrate the effectiveness of our proposed UDA method on multi-modality data, we compared our TSP-UDANet with other state-of-the-art (SOTA) unsupervised learning methods. For a fair comparison, we selected the methods developed on each of the three datasets (including the MMWHS dataset, MS-CMRSeg dataset, and M&Ms dataset) for comparison and have cited the results from the original papers.

5.4.1 MMWHS dataset

In Table 2, we compare the performance of our method and other SOTA methods on the MMWHS dataset, including AdaOutput [26], CycleGAN [13], PnP-AdaNet [16], CyCADA [36], and SIFA [34]. In both the MRI \(\to\) CT and the CT \(\to\) MRI adaptations, our method performed well as measured by the Dice and ASD. In more detail, the mean Dice score of our method was higher than that of CyCADA [36] by 12.7% (MRI \(\to\) CT) and 11.5% (CT \(\to\) MRI), and the mean ASD of our method was lower than that of CyCADA [36] by 1.5mm (MRI \(\to\) CT) and 0.7mm (CT \(\to\) MRI). When compared with SIFA [34], our method improved the mean Dice score by 3.0% (MRI \(\to\) CT) and 5.6% (CT \(\to\) MRI). Our method also showed the best performance in the ASD on LV and LV_myo (MRI \(\to\) CT), and on LA (CT \(\to\) MRI). These results demonstrate the effectiveness of the TSP-UDANet for cross-modality cardiac image segmentation. Table 2 also shows that the mean Dice score was not consistent with the mean ASD in the MRI \(\to\) CT and CT \(\to\) MRI adaptation tasks, because of the unsuccessful segmentation results in the slices at the base and apex of the heart [40]. Furthermore, the Dice score was sensitive to internal filling of the mask, while the ASD was sensitive to segmented edges [41].

In this study, the TSP-UDANet combines image-level, feature-level, and output-level alignments to segment cross-modality cardiac images and achieved the best mean Dice score in the MRI \(\to\) CT and CT \(\to\) MRI adaptation. Among the other approaches we tested, PnP-AdaNet [16], an extended network only aligns the feature distribution between the source and target domains in the output-level feature space. AdaOutput [26] conducts adversarial training only at the output level in the source and target domains, so the segmentation performance was poor. SIFA [34] introduced feature-level and image-level alignments, and achieved the second-best segmentation result for LA in both directions in bidirectional domain adaptation. ARL_GAN [37] employs image-level alignment and then uses the generated images to train a single-level generative adversarial segmentation network. In the MRI \(\to\) CT adaptation, the Dice scores of AA and LV obtained by ARL_GAN were 11.1% and 17.9% lower than that of the TSP-UDANet. Furthermore, ARL_GAN [37] only operated in one direction (MRI \(\to\) CT).

5.4.2 MS-CMRSeg dataset

In Table 3, we compare the performance of our method and other SOTA methods on the MS-CMRSeg dataset, including Tao et al. [38], Vesal et al. [35], Wang et al. [21], Vesal et al. [20], and Chen et al. [39]. We achieved the best mean Dice score compared with other methods. Our mean Dice score was 0.2% higher than that of Chen et al. [39], who achieved a value of 87.3%. Furthermore, we obtained the best Dice score for the RV with a value of 90.6%, which was 3.1% higher than that of Chen et al. [39] and 2.8% higher than that of Vesal et al. [20]. The HD scores of our method for the LV_myo and RV were 8.4 mm and 8.1 mm, both of which are better than the other methods.

In our model, we use CycleGAN to generate pseudo-LGE images as intermediate domain images and use a self-training strategy to bridge the SSN and StSN. Furthermore, our proposed backbone combines basic Resnet101 and ASPP, and works well on the image segmentation tasks. Vesal et al. [20] achieved the second-best segmentation results for the RV. They used entropy minimization and point-cloud shape adaptation to extract domain-invariant features from cross-modality cardiac images. Vesal et al. [35] and Wang et al. [21] achieved poor segmentation results for LV, LV_myo, and RV. Vesal et al. [35] trained a U-net [42] using labeled bSSFP images, and fine-tuned the trained network using LGE images. Wang et al. [21] used a two-channel U-net [42], which only used feature-level alignment to extract image features separately from the source and target domains. Compared with our method, Chen et al. [39] achieved a lower mean Dice score.

5.4.3 M&Ms dataset

In Table 4, we compare the performance of our method and other SOTA methods on the M&Ms dataset, including Li et al. [43], Carscadden et al. [44], Scannell et al. [45], and Full et al. [46]. We achieved the best HD for LV_myo in all methods, namely 8.7 mm. Full et al. [46] used a supervised learning method based on nnU-Net [47] and achieved a mean Dice score of 88.3% and a mean HD of 11.0 mm. Our method achieved Dice scores for LV and RV, which were 2.1% and 2.0% higher than those of Scannell et al. [45]. In the TSP-UDANet, the ASPP acted as a class predictor after the Resnet101 to fuse the multi-scale cardiac features. The mean Dice score of our result was 4.0% higher than that obtained by Carscadden et al. [44], who only used Resnet101 as the segmentation network, whereas our segmentation sub-network (SSN) can be used as a general backbone for image segmentation. Li et al. [43] proposed a cascaded encoding–decoding network as the backbone and achieved a mean Dice score of 70.6%, showing that it is not a good strategy to use a single network for both segmentation and style transfer tasks. Our mean Dice score was 14.6% higher than that of Li et al. [43]. Scannell et al. [45] used a traditional GAN as the backbone, and the generator used the U-net. Compared with the traditional GAN of [45], we added a feature-level discriminator to learn more useful features. When compared with the results of [45], our method improved the Dice score by 2.1% and 2.0% for the LV and RV, respectively. As a supervised training approach, Full et al. [46] used an ensemble of five 2D and five 3D nnU-Net and achieved better segmentation performance on the M&Ms cardiac dataset, using a variety of intensity-based data augmentation methods (i.e., noise addition, brightness modification and contrast modification). These data augmentation techniques are specifically designed for the M&Ms dataset due to the variety of imaging protocols and MRI vendors [33]; whereas our method employed a domain adaptation strategy to achieve good cardiac segmentation, which is less dependent on the specific vendors.

6 Ablation analysis

We performed an ablation analysis to demonstrate the effect of introducing the intermediate domain and the multi-level generative adversarial approach for UDA cross-modality cardiac segmentation. In the MMWHS dataset, the discrepancy in appearance between the CT and MRI cardiac images is evident, which further illustrates the superiority of TSP-UDANet.

In the ablation analysis, we compared the segmentation results of SSN (w/o CycleGAN), Stage 1 (CycleGAN + SSN), and Stage 2 (CycleGAN + SSN + StSN) in the MMWHS dataset to verify the impact of the key components, as shown in Table 5. In SSN (w/o CycleGAN), we used the source and target domains to train the SSN for feature-level and output-level alignment. The trained SSN segmented the testing target domain images in the testing process. In Stage 1, we introduced image-level alignment using CycleGAN, where the generated fake target domain images acted as intermediate domain images.

Table 5 Results of the MMWHS (MRI \(\to\) CT) segmentation in the ablation experiment

We used the source and intermediate domains to train the SSN. During the testing process, the testing target domain images were segmented by the trained SSN. In Stage 2, the fake target domain images were used as the intermediate domain to connect the source and target domains. We used the intermediate and target domains to train the StSN, where the intermediate domain images were matched with pseudo-labels generated by the SSN. During the testing process, the testing images of the target domain were segmented with the trained StSN.

In Stage 1, we introduced the CycleGAN for image-level alignment and the SSN for aligning the feature distribution between the source and intermediate domains at the feature level and the output level. In the MRI \(\to\) CT adaptation, Stage 1 outperformed the SSN (w/o CycleGAN), where the mean Dice score was 2.3% higher than that of the SSN (w/o CycleGAN), and the mean ASD was 1.8 mm lower. In the CT \(\to\) MRI adaptation, the segmentation results of Stage 1 outperformed the SSN (w/o CycleGAN) for all cardiac substructures assessed. The mean Dice score of Stage 1 was 19.3% higher than that of the SSN (w/o CycleGAN), and a mean ASD was 11.8 mm lower. The increase in segmentation accuracy demonstrated that image-level feature alignment is effective for target domain image segmentation.

Unlike the SSN (w/o CycleGAN), Stage 2 used fake target domain images as an intermediate domain to bridge the source and target domains. We took the result of the StSN as the final segmentation result wherein the mean Dice score was 3.3% higher, and the mean ASD was 4.3 mm lower than those of the SSN (w/o CycleGAN) in the MRI \(\to\) CT adaptation. In the CT \(\to\) MRI adaptation, the mean Dice score of the StSN was 20.3% higher, and the mean ASD was 12.0 mm lower than those of the SSN (w/o CycleGAN). In Stage 2, the source domain label information was progressively transferred to the target domain through the intermediate domain, which reduced the domain shift and improved the segmentation performance of the target domain images. As shown in Fig. 9, the segmentation results produced by Stage 2 were closer to the ground truth than that of Stage 1.

Fig. 9
figure 9

Visual comparison of the 2D slice results of the proposed method from a representative case with the median Dice for the MMWHS testing set in the ablation experiments. From left to right are the target domain image (column 1), the intermediate domain image generated by CycleGAN (column 2), the target domain ground truth (column 3), the segmentation results of SSN (w/o CycleGAN) and Stage 1 (columns 4–5), and the final segmentation results of Stage 2 (column 6). The cardiac substructures AA, LA, LV, and LV_myo are shaded in pink, pale grey, light purple, and pale blue, respectively. The first and second rows are MRI \(\to\) CT adaptation examples, with only AA, LV, and LV_myo in the first row, and LA, LV, and LV_myo in the second row. The third and fourth rows are CT \(\to\) MRI adaptation examples, with only AA, LV, and LV_myo in the third row, and LA, LV, and LV_myo in the fourth row

7 Discussion

In this paper, we focus on the UDA problem for cross-modality cardiac segmentation. We present a novel framework, TSP-UDANet, which can effectively extract and align the cardiac domain-invariant features from multiple levels in cardiac domain adaptation segmentation tasks. We conduct generative adversarial training at three levels, namely image level, feature level, and output level, to achieve cross-modality cardiac segmentation on the MMWHS dataset. The results of TSP-UDANet are compared with other methods in Table 2, in which it can be seen that the results of the cooperative adversarial learning at the three levels are better than the results of individual image-level or feature-level alignment. The network can extract more semantic features in cooperative adversarial learning and achieve better alignment of image features from different modalities. For example, PnP-AdaNet [16], CycleGAN [13], and ARL_GAN [37] do not incorporate cooperative adversarial learning methods with feature-level and image-level alignment. AdaOutput [26] uses output-level semantic space alignment, which is less effective than CyCADA [36]. CyCADA [36] and SIFA [34] only use image-level and feature-level alignments. Our method outperforms CyCADA [36] and SIFA [34], demonstrating the effectiveness of generative adversarial training through image-level, feature-level, and output-level alignments.

We introduce the intermediate domain to bridge the source and target domains and reduce the domain shift between them for two reasons: one is the failure of adversarial training in the SSN (w/o CycleGAN), and the other is the discrepancy in image appearance between the source and target domains. Obviously, as shown by the ablation experiment, the discrepancies in the appearance of images obtained by different modalities substantially impact cardiac segmentation results. We attempted to use source and target domain adversarial training of the SSN (w/o CycleGAN), but the segmentation results were unsuccessful, as shown in Table 5. The main reason is the large discrepancy in image intensity distribution between the source and target domain images. Images with a similar style are more likely to perform better in UDA segmentation, so we used CycleGAN to generate fake target domain images as the intermediate domain. The intermediate domain divides the label information transfer process between the source and target domains into two parts: SSN and StSN. The SSN segments the intermediate domain images, and then the segmentation results, as the pseudo-labels, are used for the training of the StSN. The trained StSN segments the target domain images. This achieves a two-stage progressive UDA cross-modality cardiac segmentation. Figure 10 shows the mean Dice statistics on the MMWHS dataset in the form of box plots, where Fig. 10A is the MRI \(\to\) CT adaptation, and Fig. 10B is the CT \(\to\) MRI adaptation. It can be seen that the mean Dice score increases after introducing the intermediate domain.

Fig. 10
figure 10

Segmentation results of each ablation experiment using the proposed method, showing the change in Dice score at each stage of the progressive unsupervised cross-modality adaptation segmentation process. The vertical coordinate represents the mean Dice score of all segmented objects from the MMWHS testing set. A MRI \(\to\) CT adaptation segmentation, B CT \(\to\) MRI adaptation segmentation. The green triangles represent the mean values for each stage. Upper and lower rectangle boundaries indicate the interquartile range; middle horizontal lines are median values and whiskers indicate the full range of the data

To verify the generalizability of the TSP-UDANet, we conducted experiments on the MMWHS, MS-CMRSeg, and M&Ms datasets. In the MMWHS dataset, the mean Dice scores obtained by the TSP-UDANet were 3% and 5.6% higher than those of SIFA [34] in both MRI \(\to\) CT and CT \(\to\) MRI adaptations. In the MS-CMRSeg dataset, TSP-UDANet achieved the best mean Dice score, being 0.2% higher than the result of Chen et al. [39], and TSP-UDANet achieved the best HDs in the LV_myo and RV. In the M&Ms dataset, TSP-UDANet achieved the best HD for LV_myo. Taken together, these results show that the TSP-UDANet is a generalizable method for UDA cross-modality segmentation. This is mainly because the TSP-UDANet can effectively reduce the domain shift using style-transferred images as the intermediate domain. Moreover, the segmentation networks employ multiple discriminators for adversarial training to extract domain-invariant features, which enables the generator to better align the feature distribution between different domains at multiple levels. The TSP-UDANet uses the classical Resnet101 and ASPP for the segmentation backbone, which can serve as a general network configuration in the task of image segmentation. The Resnet101 can provide enough depth for feature extraction, and its residual structure can effectively prevent gradient disappearance. The ASPP can adapt to multi-scale contextual features because it has multi-scale receptive fields.

Our TSP-UDANet achieves good performance on three cross-modality cardiac datasets, but there are still some limitations. Figure 11 shows the visualization results of the TSP-UDANet on the MS-CMRSeg and M&Ms datasets. The segmentation results for LV_myo and LV in the apical region are weaker than those in the basal region. Some substructures in the region of the cardiac apex are very small and occupy fewer pixels in each image slice, which makes the extraction of meaningful 2D anatomical features very challenging. In future, we will explore the possibility of introducing 3D anatomical information to tackle the difficulties in segmenting small objects.

Fig. 11
figure 11

Visualization of two successful subjects (Pat_6 and Pat_A2H5K9, upper and lower left-hand panels) and two unsuccessful subjects (Pat_8 and Pat_ A8C5E9, right-hand panels) (n.b, successful and unsuccessful refers to the subjects with the highest and the lowest Dice score in the MS-CMRSeg and M&Ms testing set). The upper three rows show the segmentation results from the MS-CMRSeg dataset, and lower three rows show the segmentation results from the M&Ms dataset. The left three columns show successful segmentation results, and the right three columns show unsuccessful segmentation results. LV, RV, and LV_myo are shown in pink, cyan and gray. Note that the sub-figures of the second, third, fifth, and sixth rows are zoomed and cropped for improved clarity

8 Conclusion

In this paper, we have proposed a two-stage progressive UDA network for segmenting multi-modality cardiac images. The network is trained from multi-level feature spaces at the image level, feature level, and output level. We introduce an intermediate domain linking the source and target domains. An improved self-training process is used in Stage 2 to progressively reduce the domain shift between the different domains and to extract domain-invariant features. We have validated the method using unpaired cardiac MRI and CT images, LGE and bSSFP images, and CMR images acquired with devices manufactured by multiple vendors. Compared with existing methods, our approach achieves good segmentation performance for a variety of source images and has good generalizability making it possible to apply the UDA network to the segmentation of other medical images. In future, to demonstrate the generalizability and robustness of our method, we will explore its application in other areas beyond cardiac segmentation in multimodal images, for instance the segmentation of solid tumors.