1 Introduction

Despite the success of deep neural networks (DNNs) for medical imaging applications  [4, 11, 21, 26, 27], learning a task-specific model which generalizes to various medical datasets remains a challenge. This is due to the difference of feature distributions between different datasets, which is known as domain shift  [29]. In medical imaging, domain shift can result from different imaging modalities (e.g., magnetic resonance imaging and ultrasound) or different image acquisition devices. In this paper, we focus on model generalization between different image acquisition devices, transferring knowledge from a source device domain to a target device domain.

Fine-tuning DNNs on labelled data from the target domain is a possible solution but is often infeasible due to the need for sufficient manual annotations. More importantly, fine-tuned models remain domain specific because performance gains do not propagate back to the source domain. Deep domain adaptation has been widely studied for tackling the problem of domain shift by extracting domain-invariant features  [13, 15, 22]. Such approaches allow porting DNNs to the target domain without extensive annotation as well as preserving performance in both source and target domains. Unsupervised domain adaptation aims at transferring knowledge from a labeled source domain to an unlabeled target domain where both domains share a common label space  [13, 20, 25]. This setting is important for real-world medical imaging scenarios, where data annotation is laborious, time consuming and requires rare expertise is available.

In this work, we propose distance metric guided feature alignment (MetFA) to learn a domain-invariant latent representations for model generalization in an unsupervised domain adaptation setting. We evaluate the proposed method on a challenging medical application, the classification of standardized diagnostic fetal ultrasound (US) view planes during prenatal screening. In many countries, fetal US is clinical routine for early detection of pathological development and informs subsequent decisions about treatment and delivery options  [31]. However, domain shift caused by different acquisition devices and prohibitively expensive data annotation restricts the generalization of vanilla DNN classifiers. We show that MetFA enables unsupervised cross-device classification in fetal US.

Contribution. The main contributions of this paper are: (1) We propose distance metric guided feature alignment (MetFA), which learns a shared latent representation space between a labeled source domain and an unlabeled target domain; (2) we develop a framework that jointly learns class distribution alignment and MetFA, which further transfers semantic knowledge from a source domain to a target domain for model generalization; (3) we utilize the proposed method for cross-device anatomical classification in fetal US, which is an important medical imaging application that inherently requires knowledge transfer between different device domains to facilitate the use of DNNs for large scale population screening (Codes in https://github.com/qingjie99/MetFA).

Related Work. Unsupervised domain adaptation (UDA) mainly focuses on feature distribution alignment. Most UDA approaches explore an appropriate metric to measure the distance of feature distributions between two domains and subsequently train DNNs to minimize this distance  [33]. Previous work such as Maximum Mean Discrepancy  [22, 35] utilizes kernels to measure the discrepancy between representations. Recent research explores domain adversarial training, where a domain discriminator is used to estimate this discrepancy while a feature extractor tries to deceive the discriminator by learning domain-invariant representations  [2, 23, 24]. UDA has been applied to various medical imaging applications such as anatomical segmentation  [3, 6, 9, 17, 28] and diagnostic classification  [1, 16]. Most of these works utilize domain adversarial training for feature alignment. In contrast to these works, we explicitly manipulate the latent space to learn discriminative features. Our work is inspired by MiniMax Entropy (MME) proposed in  [30], which estimates domain-invariant prototypes and clusters target domain features around these prototypes in a semi-supervised domain adaptation setting. In contrast to  [30], our method (1) embeds extracted features into a shared latent space with a fixed prior distribution before prototypes are estimated, and (2) simultaneously reduces intra-class variance while increasing inter-class variance across domains via cross-domain metric learning.

Metric learning aims at learning embedded representations that cluster similar samples while separating dissimilar samples in latent space  [37]. Previous metric learning methods measure feature similarity by learning a linear Mahalanobis distance  [19, 36]. More recent works focus on deep metric learning, which learns non-linear relationships of data using DNNs with different losses, such as contrastive loss  [8, 14], triplet loss  [5, 36] and N-pair loss  [32]. Deep metric learning has shown great benefits for domain adaptation. For example, Sohn et al.  [33] proposed a deep metric learning method for unsupervised domain adaptation in disjoint label space. Dou et al.  [10] introduced deep metric learning for domain generalization. Most existing metric-learning-based domain adaptation methods only utilize metric learning on the labeled source domain and neglect the relationship between intra-class samples. In contrast to these methods, we introduce cross-domain metric learning to (1) jointly measure the similarity between samples in a labeled source domain and an unlabeled target domain and (2) learn metrics between different groups of intra-class samples.

2 Method

We are given images and the corresponding labels from a source domain \(\mathcal {D}_S=\{\mathcal {X}_S, \mathcal {Y}_S\}\) as well as unlabeled images from a target domain \(\mathcal {D}_T=\{\mathcal {X}_T\}\). Both domains share a common label space and contain M classes. Our goal is to classify unlabeled target domain data by aligning latent features of both domains. The proposed method contains three main parts (see Fig. 1): (1) supervised classification on the labeled source domain, (2) distance metric guided feature alignment (MetFA) to transfer knowledge from the source domain to the target domain, and (3) class distribution alignment to preserve source domain class relationships in the target domain.

Classification. Classification in the unlabeled target domain is guided by the labeled source domain by sharing whole networks including an encoder E, a Gaussian embedding G and a classifier C. The cross-entropy loss is

$$\begin{aligned} \mathcal {L}_{ce}=-\mathbb {E}_{\{\mathbf {x},y\}\thicksim \{\mathcal {X}_S, \mathcal {Y}_S\}}\sum _{t=1}^{M}\mathbbm {1}[y=t]log(C(G(E(\mathbf {x})))). \end{aligned}$$
(1)

Classifier C simultaneously predicts class distributions for the target domain as \(P_T(\hat{y}|\mathbf {x})|_{\mathbf {x} \in \mathcal {X}_T}\) (abbreviated as \(P_T\)). This prediction will be utilized in MetFA.

Fig. 1.
figure 1

Left: An overview of the proposed method. Our method consists of (1) supervised classification on the labeled source domain (optimize \(\mathcal {L}_{ce}\)), (2) distance metric guided feature alignment (MetFA), which aligns features between both domains (optimize \(\mathcal {L}_{prior}\), \(\mathcal {L}_{H}\), \(\mathcal {L}_{M}\), \(\mathcal {L}_{rec}\)), and (3) class distribution alignment, which preserves class relationships in both domains (optimize \(\mathcal {L}_{KL}\)). Right: Schematic illustration of \(\mathcal {L}_H\) and \(\mathcal {L}_{M}\) optimization.

MetFA: Distance Metric Guided Feature Alignment. Feature embedding is used to constrain features from both domains to lie in a shared latent space. In this latent space, class representations (prototypes) are estimated to extract domain-invariant features in each class, while cross-domain metric learning is introduced to further separate clusters of different classes in both domains.

Feature embedding encourages features (\(F_S\), \(F_T\)) extracted by an encoder E to share the same fixed prior distribution in a latent space \(\mathcal {Z}\), which is similar to distribution matching in a variational autoencoder  [18]. In our method, a Gaussian embedding G is built to model \(F_S\) and \(F_T\) by a standard Gaussian distribution \(\mathcal {N}(0,I)\). Specifically, \(Z_i\sim q(\mathcal {Z}|\mathcal {X}_i)|i\in \{S,T\}\) is sampled from \(\mathcal {N}(\mu _i, \varSigma _i)|i\in \{S,T\}\) with the reparameterization trick  [18], where \(\{\mu _i, \varSigma _i\}=G(F_i)|i\in \{S,T\}\). The prior alignment loss is the Kullback-Leibler (KL) divergence between \(\mathcal {N}(0,I)\) and \(\mathcal {N}(\mu _i, \varSigma _i)|i\in \{S,T\}\), which is

$$\begin{aligned} \mathcal {L}_{prior}=D_{KL}(\mathcal {N}(\mu _S, \varSigma _S)\parallel \mathcal {N}(0,I))+D_{KL}(\mathcal {N}(\mu _T, \varSigma _T))\parallel \mathcal {N}(0,I)). \end{aligned}$$
(2)

In order to guarantee that embedded features are representative of the extracted features, we add a feature reconstruction loss \(\mathcal {L}_{rec}\) as a regularizer:

$$\begin{aligned} \mathcal {L}_{rec}=\Vert F_S-Z_S \Vert _2^2 + \Vert F_T-Z_T \Vert _2^2. \end{aligned}$$
(3)

Feature embedding constrains distribution matching. In the absence of target domain labels, it is essential for subsequent feature alignment. However, feature embedding itself is unlikely to ensure that features are domain-invariant and discriminative between different classes. The rest of MetFA tackles this problem.

Domain-invariant feature extraction is motivated by Minimax Entropy (MME), proposed by Saito et al.  [30]. Using unlabeled data in the target domain, MME learns a single domain-invariant prototype (a representation point) for each class in both domains and clusters target domain samples around these prototypes (see Fig. 1 upper right). We implement prototypes as the weights \(\mathbf {W}\) of the last dense layer in the classifier C.

Training MME contains two iterative steps. The first step is to move prototypes from source domain to target domain, which is maximizing the similarity between \(\mathbf {W}\) and its input features (\(H_T\)). This similarity maximization is equivalent to maximizing the entropy of \(\mathcal {X}_T\) with respect to \(\mathbf {W}\), using

$$\begin{aligned} \mathcal {L}_{H}=-\mathbb {E}_{\mathbf {x}\sim \mathcal {X}_T}\sum _{i=1}^{M}p_T(\hat{y}=i|\mathbf {x})\log p_T(\hat{y}=i|\mathbf {x}), \,\, p_T\in P_T=\sigma (\frac{1}{\tau _0}\frac{\mathbf {W}^T H_T}{\Vert H_T\Vert }), \end{aligned}$$
(4)

where \(\sigma \) is a softmax function and \(\tau _0\) is a temperature parameter. The second step is to assign target domain features to the domain-invariant prototypes. To achieve this, \(\mathcal {L}_H\) is minimized with respect to E, G and \(C\setminus \mathbf {W}\) (C without \(\mathbf {W}\)).

Cross-domain metric learning is proposed to maximize the margin between different classes across domains. We define latent features of \(\mathcal {X}_S\) and \(\mathcal {X}_T\) (which are \(Z_S\) and \(Z_T\)) respectively as support samples and query samples. In cross-domain metric learning the distance between query and support samples is minimized when they are from the same class and simultaneously maximized when they are from different classes (see Fig. 1 lower right). The metric loss is

$$\begin{aligned} \mathcal {L}_{M}=\frac{1}{N}\sum _{i=1}^{M}\sum _{j=1}^{c_i^T}\log (1+\sum _{k\ne i}^{{k\in [1,M]}}e^{d_j^i-d_j^k})=-\frac{1}{N}\sum _{i=1}^{M}\sum _{j=1}^{c_i^T}\log \frac{e^{d_j^i}}{e^{d_j^i}+\sum _{k\ne i}^{k\in [1,M]}e^{d_j^k}}, \end{aligned}$$
(5)

where N and \(c_i^T\) are the number of all query samples and query samples from class i. Note that the labels of query samples are \(P_T\) in Eq. 4. \(d_j^i\) is the distance between a query sample \(q_j^i\) and a same class support sample \(s_t^i\). \(d_j^k\) is the distance between \(q_j^i\) and \(s_t^k\) from different classes. Considering the relationship between intra-class samples and using a hard mining strategy  [7], we define \(d_j^i\) and \(d_j^k\) as

$$\begin{aligned} \begin{aligned}&d_j^i=\max _t d(q_j^i,s_t^i), \; t \in [1, c_i^S], \, q_j^i \sim Z_T, \, s_t^i \sim Z_S, \\&d_j^k=\min _t d(q_j^i,s_t^k), \; t \in [1, c_k^S], \, q_j^i \sim Z_T, \, s_t^k \sim Z_S, \end{aligned} \end{aligned}$$
(6)

where \(c_i^S\) and \(c_k^S\) are the number of support samples from class i and class k. We use the squared Euclidean distance for \(d(\cdot ,\cdot )\) in Eq. 6.

Class Distribution Alignment. Apart from structuring a feature space for better class predictions, we want to further transfer semantic knowledge which is preserving class relationships between domains. Class distribution alignment is used for class relationship preservation between multiple labeled source domains in a domain generalization task  [10]. In our method, we align class distributions between a labeled source domain and an unlabeled target domain. We utilize the symmetrized KL-divergence to define the class distribution alignment loss

$$\begin{aligned} \begin{aligned}&\qquad \qquad \; \mathcal {L}_{KL} = \frac{1}{M} \sum _{i=1}^M \varLambda [D_{KL}(\bar{p}_i^S \parallel \bar{p}_i^T) + D_{KL}(\bar{p}_i^T \parallel \bar{p}_i^S)],\\ \bar{p}_i^S&=\sigma (\frac{1}{\tau _1}\frac{1}{c_i^S}\sum _{y=i} g_{\mathbf {x}}^S)|_{(\mathbf {x},y) \sim \{\mathcal {X}_S, \mathcal {Y}_S\}}, \; \bar{p}_i^T=\sigma (\frac{1}{\tau _1}\frac{1}{c_i^T}\sum _{\hat{y}=i} g_{\mathbf {x}}^T)|_{(\mathbf {x},\hat{y}) \sim \{\mathcal {X}_T, P_T(\mathbf {x})\}}. \end{aligned} \end{aligned}$$
(7)

Here, \(\varLambda =[c_1^T, c_2^T,...,c_M^T]\) contains the number of target domain samples predicted for each class. \(\bar{p}_i^S\) and \(\bar{p}_i^T\) are the \(i^{th}\) class distributions in source and target domain. \(g_{\mathbf {x}}^S\) and \(g_{\mathbf {x}}^T\) are the pre-softmax activations from classifier C and \(\tau _1\) is a temperature parameter.

Optimization. The overall objective function of the proposed method is:

$$\begin{aligned} \begin{aligned}&\;\;\min _{E,G,C\setminus \mathbf {W}}\{\mathcal {L}+\lambda _6\mathcal {L}_H\}, \quad \min _{\mathbf {W}}\{\mathcal {L}-\lambda _6\mathcal {L}_H\}, \\ \text {with} \quad&\mathcal {L}=\lambda _1\mathcal {L}_{ce}+\lambda _2\mathcal {L}_{prior}+\lambda _3\mathcal {L}_{M}+\lambda _4\mathcal {L}_{rec}+\lambda _5\mathcal {L}_{KL}. \end{aligned} \end{aligned}$$
(8)

Here \(\lambda _1\) to \(\lambda _6\) are hyper-parameters chosen experimentally depending on the application. Our model is end-to-end trainable, with \(\mathbf {W}\) and the rest of the networks are trained in an alternating fashion according to Eq. 8. We apply L2 regularization (\(\text {scale}=10^{-5}\)) to all weights during training to prevent over-fitting and apply random image flipping as data augmentation. Our model is trained on a Nvidia Titan X GPU.

3 Evaluation and Results

We evaluate the proposed method on 2D fetal US images acquired during routine prenatal screening. This US data is obtained by different imaging devices: Device A (GE Voluson E8) acquires \({\sim }12k\) images and device B (Philips EPIQ V7 G) acquires unpaired \({\sim }5.5k\) images. In both datasets, six anatomical standard planes have been selected by expert sonographers, including Four Chamber View (4CH), Abdominal, Left Ventricular Outflow Tract (LVOT), Right Ventricular Outflow Tract (RVOT), Femur and Lips. We evaluate our method in two scenarios where device A is the source domain while device B is the target domain, and vice versa. During training, the source domain is fully labeled and the target domain is unlabeled. In both scenarios, hyper-parameters \(\lambda _1\) to \(\lambda _6\) in Eq. 8 are \(\lambda _1=10, \lambda _2=10^{-2}, \lambda _3=10^{-1}, \lambda _4=1, \lambda _5=10, \lambda _6=5\). \(\tau _0\) in Eq. 4 is 0.05 (same to  [30]) and \(\tau _1\) in Eq. 7 is 2 (same to  [10]). We use Stochastic Gradient Descent (SGD) with momentum optimizer to update our model.

Comparison Methods. We evaluate a VGG network which contains an encoder E and a classifier C from the proposed method as a baseline. This baseline is trained on data only from the source domain (Source only) to demonstrate the existence of domain shift. We compare the proposed method with the state-of-the-art domain-adaptation algorithms, including domain-adversarial training of neural networks (DANN)  [13], adversarial discriminative domain adaptation (ADDA)  [34] and semi-supervised domain adaptation via minimax entropy (MME)  [30]. Note that for fair comparison, we use the MME model in an unsupervised learning paradigm. Additionally, given target domain labels, we show fine-tuned and fully-supervised classification on the target domain as references. Fine-tuned classification is pre-trained on the labeled source domain and fine-tuned on the labeled target domain. This fine-tuned model is evaluated on both source and target domains. Fully-supervised classification is trained from scratch on the labeled target domain and evaluated on the target domain.

Ablation Study. We further explore the effectiveness of different components in the proposed method by removing different loss components: UDA-MetFA-I: only contains \(\mathcal {L}_{ce}\), \(\mathcal {L}_{prior}\) and \(\mathcal {L}_{H}\); UDA-MetFA-II: UDA-MetFA-I plus \(\mathcal {L}_{M}\); UDA-MetFA-III: UDA-MetFA-II plus \(\mathcal {L}_{KL}\); UDA-MetFA-IV: UDA-MetFA-II plus \(\mathcal {L}_{rec}\); UDA-MetFA-V: contains all components.

Table 1. Comparison of Source only, the state-of-the-art and ablation study (UDA-MetFA- I to V) for fetal US anatomical classification with device A as source domain and device B as target domain. Fine-tuned and Fully-supervised are reference results given target domain labels. Best results in bold.

Results. Table 1 shows the experimental results of baselines and the ablation study where device A is the source domain and device B is the target domain. From this table, we observe that the UDA-MetFA-V model outperforms other baselines. In the target domain, UDA-MetFA-V achieves an average F1-score of 0.7713 while the highest average F1-score of other baselines is 0.4398 (MME  [30]). UDA-MetFA-I greatly outperforms MME  [30] in the target domain, demonstrating the importance of feature embedding in the proposed method. UDA-MetFA-V performs better than other ablation models in the target domain, illustrating the effectiveness of all components in the proposed method. Furthermore, the results of Fine-tuned and source only in the source domain indicate that the fine-tuned model remains less generalizable, whereas the proposed method (UDA-MetFA-V) enables model generalization with improved classification performance in both source and target domains.

We further compare MME (best baseline in Table 1) with the proposed method (UDA-MetFA-V) in confusion matrices and t-SNE plots. Figure 2(a) demonstrates that our method extracts more discriminative features for better classification, especially on easily confused anatomies (e.g., LVOT vs. RVOT). Figure 2(b) shows that for UDA-MetFA-V, target features are closer to source features while features of different classes are more separated. This indicates that the proposed MetFA benefits the extraction of discriminative and domain-invariant features.

Table 2 shows the results of comparison methods and the proposed method (UDA-MetFA-V) on switched domains, where device B is the source domain and device A is the target domain. We observe that UDA-MetFA-V outperforms the state-of-the-art in both source and target domains, demonstrating that our method is capable of successfully transferring knowledge from source domain to target domain as well as improving model generalization.

Fig. 2.
figure 2

Comparison of MME  [30] and UDA-MetFA-V on (a) confusion matrix of target domain (device B) and (b) t-SNE plot of extracted test data features.

Table 2. Comparison of baselines and UDA-MetFA-V with device B as source domain and device A as target domain. Best results in bold.

Discussion. Domain adaptation is commonly used to transfer a performant, task-specific model from a source domain to a target domain. However, the DNNs learning ability in a source domain can limit this ability in a target domain. This may explain the lower classification performance of the proposed method compared with a fully-supervised method in the target domain in Table 2. Current UDA methods rarely discuss the performance of DNNs in the source domain. From Table 2, we observe that tracking the source domain performance can be potentially used for data selection during model improvement in the source domain. A limitation of our method is the empirical hyper-parameters selection. For a specific application, we adjust hyper-parameters according to their importance and select the best combination with grid search. Meta-learning  [12] will be explored in future work to allow automatic hyper-parameter selection.

4 Conclusion

In this paper, we discuss the problem of model generalization for unsupervised domain adaption. We propose metric learning for improved feature alignment (MetFA) to extract discriminative and domain-invariant features across domains. MetFA explicitly structures latent representations without using domain adversarial training. Our model integrates class distribution alignment for transferring semantic knowledge from a source domain to a target domain. Experiments on cross-device fetal US screening images demonstrate the effectiveness and practical applicability of our method compared with the state-of-the-art.