Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Great advancements have been achieved in machine learning, particularly with supervised learning algorithms, reaching human-level performance on applications that a few years ago would be considered extremely challenging. However, a common assumption in machine learning is that training and test data are drawn from the same probability distribution [19]. Methods are trained on data from a source domain \(D_S=\left\{ \mathcal {X}_S,P(X_S)\right\} \), where \(\mathcal {X}_S\) is a feature space, \(X_S=\left\{ x_{S1}, \ldots ,x_{Sn}\right\} , x_{Si} \in \mathcal {X}_S\) the data and \(P(X_S)\) the marginal distribution that their features follow. In an image segmentation problem, for example, \(X_S\) could be samples (voxels or patches) from multi-spectral MR scans, \(\mathcal {X}_S\) is the feature space defined by the available MR sequences and \(P(X_S)\) is the distribution of intensities in the sequences. In the developing stage of a supervised algorithm, given corresponding ground truth labels \(Y_S=\left\{ y_{S1}, \ldots ,y_{Sn}\right\} , y_{Si} \in \mathcal {Y}_S\), such as segmentation masks, where \(\mathcal {Y}_S\) the label space, a predictive function \(f_S(x) = P_S(y|x)\) is learnt via training and configuration of hyper-parameters on the data (\(X_S,Y_S\)). \(f_S(\cdot )\) tries to approximate the optimal function \(f^{\prime }_{S}(x), x \in \mathcal {X}_S\) that generated \(Y_S\). At the time of deployment, however, these methods often under-perform or fail if the testing data come from a different target domain \(D_T=\left\{ \mathcal {X}_T,P(X_T)\right\} \), with \(\mathcal {X}_T \ne \mathcal {X}_S\) and/or \(P(X_T) \ne P(X_S)\). This is because the optimal predictive function \(f^{\prime }_{T}(x), x \in \mathcal {X}_T\) for \(D_T\) may differ from \(f^{\prime }_{S}(\cdot )\), and so the learnt \(f_S(\cdot )\) will not perform well on \(D_T\). The above scenario is common in biomedical applications due to variations in image acquisition, in particular, in multi-center studies. Training and testing data may differ in contrast, resolution, noise levels (\(P(X_T) \ne P(X_S)\)) or even type of sequences (\(\mathcal {X}_T \ne \mathcal {X}_S\)). Despite the rapid advancements in representation learning, this issue has been shown to affect even the latest models [18]. Generating labelled databases is time consuming and often expensive, and assuming annotations for training are available for each new domain is neither realistic nor scalable. Instead, it is desired to develop methods that can learn from existing databases and generalize well or adapt to the target domain without the need for additional training data.

Transfer learning (TL) [14] investigates development of predictive models by leveraging knowledge from potentially different but related domains and tasks. Even between tasks where label spaces \(\mathcal {Y}_S\) and \(\mathcal {Y}_T\) differ, TL can take advantage of similarities in the underlying structure of the mappings \(f_S:\mathcal {X}_S \mapsto \mathcal {Y}_S\) and \(f_T:\mathcal {X}_T \mapsto \mathcal {Y}_T\). A subclass of TL is multi-task learning, where a model is trained on multiple related tasks simultaneously. Most related to our work, domain adaptation (DA) is the subclass of TL that assumes \(\mathcal {Y}_S = \mathcal {Y}_T\) and only the domains differ. It explores learning a function \(f_a(\cdot )\) that performs well on both domains, under the basic assumption that such a function exists [1].

In this work we investigate unsupervised domain adaptation (UDA) [7]. In this setting we assume the availability of a labeled database \(S=(X_S,Y_S)\) from source domain \(D_S\), along with an unlabeled database \(T=(X_T)\) from a different but related target domain \(D_T\). We wish to model the unknown optimal function \(f^{\prime }_{T}(\cdot )\) for labelling \(X_T\). However since no labels are available for \(D_T\), \(f^{\prime }_{T}(\cdot )\) cannot be learnt. This is in contrast to supervised DA, which requires at least some labelled data for \(D_T\). Instead, we try to learn a representation \(h_a(x)\) that maps \(X_S\) and \(X_T\) to a feature space that is invariant to differences between the two domains, as well as a function \(f_{ah}(\cdot )\) learnt using data \(\left\{ X_S,Y_S,X_T \right\} \), such that \(f_a(x) = f_{ah}(h_a(x))\) approximates \(f^{\prime }_{S}(\cdot )\) and is closer to \(f^{\prime }_T(\cdot )\) than any function \(f_S(\cdot )\) that can be learnt using only the source data \((X_S,Y_S)\).

Contributions: In this work we develop a domain adaptation method based on adversarial neural networks [4, 5]. We propose the adversarial training of a segmenter and a domain-classifier, which aims to make the representation learnt by the segmenter invariant to domain-specific factors. We describe and analyse the development of domain-adversarial networks for the purpose of segmentation, which to the best of our knowledge has not been previously performed. We investigate the adaptation of layers at various depths and propose multi-connected adversarial networks, which we show improve domain adaptation. We employ our system for the segmentation of traumatic brain injuries (TBI), investigating adaptation between databases acquired using two different scanners with difference in the available MR sequences. We show that without utilizing any labels in the target domain, our method closes the performance gap with respect to supervised learning with target labels to a large extent.

Related Work: TL and DA have attracted significant interest over the years. Comprehensive reviews of early works can be found in [1, 7, 14]. Popularity of TL increased with the wide adoption of neural networks when their features were found to be effective when transferred across tasks. For example, features learnt from natural images were used off-the-shelf for detecting peri-fissural nodules [3]. More commonly, TL is performed via pre-training on a source task, followed by fine-tuning for the target task via supervised training [16]. A representative example of TL via multi-task learning was presented in [12]. A network was trained simultaneously for segmentation of brain tissue, pectoral muscle and coronary arteries. These experiments show that much of a network’s capacity can be shared between a variety of tasks. Note, all of the above require labels in \(D_T\).

In contrast, DA explores the case where label spaces \((Y_S,Y_T)\) are the same and little or no labelled data is available in \(D_T\). In [13] the authors explored supervised DA with SVM-based adaptive classifiers in the scenario where source and target data are acquired with different protocols. This method, however, requires labelled target data. Unsupervised DA was tackled in [6] via instance weighting, but this relies on strong assumptions about the data distributions. [2] performed UDA with boosted decision stumps with a search for visual correspondences between source and target samples. This is not as flexible as our approach nor scales well to large databases. The authors in [2] question the feasibility of DA with neural networks on 3D data due to memory requirements. Here, we show that using adversarial 3D networks is indeed a viable approach.

2 Unsupervised Domain Adaptation with Adversarial Nets

The accuracy of a binary classifier that distinguishes between samples from two domains can serve as a proxy of the divergence of distributions \(P(X_S)\) and \(P(X_T)\), which otherwise is not straightforward to compute. This idea was first introduced in [1]. Inspired by this, the authors of [4] presented a method for simultaneously learning a domain-invariant representation and a task-related classifier by a single neural network. This is done by minimizing the accuracy of an auxiliary network, a domain-discriminator, that processes a hidden representation of the main network and tries to classify the domain of the input sample. This approach formed the basis of our work. We below describe its extension for segmentation and our proposed multi-connected system.

Fig. 1.
figure 1

Proposed multi-connected adversarial networks. Segmenter: we use the 3D CNN architecture presented in [8]. Dashed lines denote low resolution features. Input samples are multi-modal, although not depicted. Discriminator: We use a second 3D CNN for classifying the domain of input x, by processing activations at multiple layers of the segmenter. Red lines show the path of the adversarial gradients, from \(L_{adv}\) back to the segmenter. See text for details on architecture. (Color figure online)

2.1 Segmentation System with Domain Discriminator

Segmenter: At the core of our system is a fully convolutional neural network (CNN) for image segmentation [10]. Given an input x of arbitrary size, which can be a whole image or a sub-segment, this type of network predicts labels for multiple voxels in x, one for each stride of the network’s receptive field over the input. The parameters of the network \(\mathbf {\theta }_{seg}\) are learnt by iteratively minimizing a segmentation loss \(\mathcal {L}_{seg}\) using stochastic gradient descent (SGD). The loss is commonly the cross-entropy of the predictions on a training batch \(B_{seg}=\left\{ (x_{1},y_{1}), \ldots ,(x_{N_{seg}},y_{N_{seg}})\right\} \) of \(N_{seg}\) samples. In our settings, \((x_i,y_i)\) are sampled from the source database \(S = (X_S,Y_S)\), for which labels \(Y_S\) are available. We borrowed the 3D multi-scale CNN architecture from [8], the segmenter depicted in Fig. 1, and adopt the same configuration for all meta-parameters.

Domain Discriminator: When processing an input x, the activations of any feature map (FM) in the segmenter encode a hidden representation h(x). If samples come from different distributions \(P(X_S) \ne P(X_T)\), e.g. due to different domains, and the filters of the segmenter are not invariant to the domain-specific variations, the distributions of the corresponding activations will differ as well, \(P(h(X_S)) \ne P(h(X_T))\). This is expected when the segmenter is trained only on samples from S where learnt features will be specific to the source domain. Similar to [4], we choose a certain representation \(h_a(x)\) from the segmenter and use a second network as a domain-classifier that takes \(h_a(x)\) as input and tries to classify whether it comes from \(P(h_a(X_S))\) or \(P(h_a(X_T))\). This is equivalent to classifying the domain of x. Classification accuracy serves as an indication of how source-specific the representation \(h_a(\cdot )\) is. The architecture we use for a domain classifier is a 3D CNN with five layers. The first four have 100 kernels of size \(3^3\). The last classification layer uses \(1^3\) kernels. This architecture has a receptive field of \(9^3\) with respect to its input \(h_a(\cdot )\) and was chosen for compatibility with the size of feature maps in the 3 last layers of the segmenter.

We train this domain-discriminator simultaneously with the segmenter. For this, we form a second training batch \(B_{adv} = \left\{ (x_{1},y_{1}^d), \ldots ,(x_{N_{adv}},y_{N_{adv}}^d)\right\} \). Equal number of samples \(x_i\) are extracted from \(X_S\) and \(X_T\), so there is no bias towards either. \(y_{i}^d\) is a label that encodes the domain of \(x_i\), used as the training target. \(B_{adv}\) is processed by the segmenter, at the same time with \(B_{seg}\) or interleaved to lower memory requirements, computing activations \(h_{a}(x) \forall x \in B_{adv}\). These activations are then processed by the discriminator, which classifies the domain of each sample in \(B_{adv}\). The discriminator’s classification loss \(\mathcal {L}_{adv}\) is minimized through optimization of the parameters \(\theta _{adv}\).

A complication arises for the joint training. The samples from S are shared in an SGD iteration for the two losses in the algorithm of [4]. However, many segmentation methods use weighted sampling in order to mitigate class-imbalance, for example by oversampling rare classes [8, 12]. Such sampling requires segmentation masks that are not available for T whose samples are extracted randomly. In this case, the discriminator should not compare those against non-randomly extracted samples from S, as it could easily associate activations for the over-weighted classes with domain S and fail to learn useful domain-discriminative features. Hence, we resort to forming entirely separate batches. \(B_{adv}\) is formed of 20 image segments, randomly extracted from images in S and T. As done in [8], weighted sampling is used for extracting 10 segments from S to form \(B_{seg}\). This ensures countering of class-imbalance for the segmenter, while being unbiased on the samples used for the discriminator.

Domain Adaptation via Adversarial Training: We aim at adapting the representation \(h_a(\cdot )\) to become invariant to variations between S and T. To this end, we expose the accuracy of the domain-discriminator to the segmenter and let it alter its parameters such that its FMs that comprise \(h_{a}(\cdot )\) do not contain cues about the input domain. This is done by incorporating the domain-discriminator’s loss \(\mathcal {L}_{adv}\) into the training objective of the segmenter, which now aims to simultaneously maximize the domain classification loss and minimize the segmentation loss \(\mathcal {L}_{seg}\), or:

$$\begin{aligned} \mathcal {L}_{segAdv}(\theta _{seg}) = \mathcal {L}_{seg}(\theta _{seg}) - \alpha \mathcal {L}_{adv}(\theta _{seg}) \end{aligned}$$
(1)

\(\alpha \) is a positive weight that defines the relative importance of the domain adaptation task for the segmenter. This optimization is possible with regular SGD, as the adversarial networks are interconnected and gradients of \(\mathcal {L}_{adv}\) can propagate back through the discriminator and into the segmenter. This process was implemented in [4] via a custom gradient-reversal layer, which is not needed if the optimization is formulated as in Eq. (1), as also noted by the authors.

2.2 Multi-connected Adversarial Networks

A natural question to arise concerns which layer(s) of the segmenter should be adapted. In [17], the authors investigated which of the last three fully connected layers of an AlexNet leads to better accuracy when adapted, concluding it is the last hidden layer that is optimal in their settings. Earlier layers are commonly not adapted as their features are considered rather generic and transferable across related tasks [4, 11].

We argue that adapting only the last layers might not be ideal, especially for the case of segmentation. The accuracy of classification networks depends mostly on high-level patterns. For precise segmentation, however, fine patterns such as detailed texture and small contrast variations are likely to be important. These fine patterns are extracted in early layers and are more susceptible to image-quality variations between domains. Adapting top layers makes them invariant to such variations, but it is still a loss of capacity if such features have been already extracted by early layers, which may not be well adapted by the weakened adversarial gradients that reach them. On the other hand, if only early layers are adapted, assuming that the adaptation is not ideal and the features not entirely free of factors of variation between the two domains, the network could recover source-specific patterns at greater depth. For these reasons we propose an architecture where the domain discriminator is connected at multiple layers of the segmenter. First, this removes source-specific patterns early on but also disallows their recovery at deeper layers. Furthermore, the discriminator is enabled to process a large variety of features for discriminating between the domains, increasing its performance and thus the quality of the gradients for the domain adaptation. Finally, by seeing the whole adversarial network as an auxiliary cost function for the segmenter, this type of connections can be compared with deep-supervision [9], which allows better flow of the gradients incoming from \(\mathcal {L}_{adv}\) throughout the segmenter and as such can improve learning of quality features. Our main results are based on feeding input \(h_{in}(\cdot )\) to the discriminator from FMs of layers 4,6 and 8 of both high and low resolution pathways, as well as the 10-th hidden layer of the segmenter (cf. Fig. 1). After the FMs of the low resolution pathway are upsampled, all FMs are cropped to match the size of the deepest layer and concatenated. A detailed analysis of the effect of adapting different layers is presented in Sect. 3.4.

3 Experiments

3.1 Material

We make use of two databases with multi-spectral MR brain scans of patients with moderate to severe TBI, acquired within the first week of injury. The first database consists of 61 subjects, imaged on a 3-T Siemens Magnetom TIM Trio. The MR sequences are isotropic MPRAGE (1 mm\(^3\)), axial FLAIR, T2 and Proton Density (PD) (0.7 \(\times \) 0.7 \(\times \) 5 mm), and Gradient-Echo (GE) (0.86 \(\times \) 0.86 \(\times \) 5 mm). The second database consists of 41 subjects, imaged on a 3-T Siemens Magnetom Verio. This database includes MPRAGE, FLAIR, T2 and PD sequences, acquired at the same resolution as in the first database. The important difference is that instead of GE, a Susceptibility Weighted Image (SWI) is available (0.7 \(\times \) 0.7 \(\times \) 5 mm). On both databases, all visible lesions were manually annotated on the FLAIR and GE/SWI by clinical experts. We merge them into a single lesion mask, as we here focus on binary segmentation of abnormalities within the brain tissue. Extra-cerebral pathologies are treated as background. All images are skull-stripped, resampled to isotropic 1 mm\(^3\) and affinely registered to MNI space. Image intensities under the brain masks are normalized to zero-mean and unit-variance, after windowing the lowest and top 2% of the intensity histograms.

Source ( S ) and Target ( T ) Databases: GE and SWI are commonly used in TBI studies due to their great sensitivity to haemorrhages. They enable detection of lesions invisible in other sequences, such as micro-bleeds. SWI is actually a type of GE that offers greater sensitivity and image quality [15]. See Fig. 2 for visual examples. For the purpose of this study, the first database, with GE available, is considered the source database S used to train the segmenter in a supervised manner. The second database, with SWI available, is considered the target database T on which we aim to successfully apply the trained segmenter. This corresponds to a typical scenario where a training database is generated on data coming from one clinical site, and new test data coming from another site with varying protocol. Motivated by their common property of being sensitive to blood and thus providing similar information for TBI segmentation, we consider GE and SWI as interchangeable for the same input channel to our system, unless stated otherwise. However the difference in appearance of GE and SWI images (cf. Fig. 2) contributes the largest variation between distributions \(P(X_S)\) and \(P(X_T)\). Further variations may be present due to the different scanners used for acquiring S and T. Using our method, we aim to learn features invariant to these domain differences without the need for any annotations on the target domain.

3.2 Configuration of the Training Schedule

A complication of adversarial training concerns the training schedule of the two connected networks, which influences the way they interact. The strength with which the segmenter is adapting its features in order to counter the domain-discriminator is controlled by the parameter \(\alpha \) (cf. Eq. (1)). We set \(\alpha =0\) for the first \(e_{1}=10\) epochs and let both networks learn independently. This allows the segmenter to initially learn features for the segmentation of S without being influenced by noisy adversarial gradients from an initially poorly performing domain-discriminator. After epochs \(e_{1}\), when the discriminator’s performance has increased, we start countering it to learn domain invariant features with the segmenter. For this, we increase \(\alpha \) according to the linear schedule \(\alpha = \alpha _{max} \frac{e_{curr} - e_{1}}{e_{2} - e_{1}}\), where \(e_{2}=35\) and \(\alpha _{max}\) is the maximum weighting, so \(\alpha \) equals \(\alpha _{max}\) after epoch \(e_{2}\). Finally, at epoch 43 we start refining the segmenter’s features by gradually lowering its learning rate. The discriminator is optimized with constant learning rate 0.001. In the following, \(\alpha _{max}=0.05\) is used. In Sect. 3.4 we present a sensitivity analysis showing robust behaviour across a range of values for \(\alpha _{max}\). \(e_{1}\), \(e_{2}\) and the total duration of this piecewise linear schedule were determined empirically for satisfactory convergence without prolonging training time. Optimal settings are not fully explored yet and may vary between applications and the relative difficulty of each network’s specific task.

3.3 Evaluation

We performed multiple experiments to obtain upper and lower bounds of baseline accuracy on the challenging task of TBI segmentation. We discuss experiments below, summarize results in Table 1 and give examples of segmentations in Fig. 2.

Table 1. Comparison of our method’s performance on T with several baselines. Our system significantly closes the gap between the lower bound, when the segmenter is trained on S only, and the upper bound, when the segmenter is also trained with labelled data from T. Values are given in format mean (std).

Train on S , Test on T : We perform standard supervised training of the segmenter on S without adaptation. To segment T, motivated by the similarity between GE and SWI sequences, at test time we use SWI in the channel used for GE during training. Even though these sequences can serve similar purposes in the analysis of TBI by radiologists, this approach totally fails, proving them not directly interchangeable as input to a CNN.

Train on S (No GE/SWI), Test on T : We repeat the previous experiment but only use the common sequences of S and T in both training and testing, neglecting GE and SWI. The experiment was repeated twice to reduce random variations between training sessions. This corresponds to a practical scenario, where we need to segment T by only using annotated training data from S, and serves as the lower bound of accuracy for our system.

Train on T , Test on T : We perform a 2-fold validation using supervised training on half of T and testing on the other half. We use all sequences of T. The obtained performance is similar to what was reported in [8], although on a different database. This experiment provides another indication for the expected accuracy on this challenging segmentation task.

Train on S and T , Test on T : To obtain an upper bound of accuracy, we train the segmenter on all data of S and half the data of T, using their manual annotations. The same input channel is used for GE of S and SWI of T. We then test on the other half of data from T. The experiment is repeated for the other split of T. We balance the samples from the two domains in each batch \(B_{adv}\) to avoid biasing the segmenter towards S that has more subjects. With supervised training on T, the system learns to interchange GE and SWI successfully. This setting uses all available data from both domains, both images and manual annotations, and serves as an estimate of optimal, supervised transfer learning.

Train on S and T , Test on T (GE/SWI in Different Channels): We perform a sanity check that using GE and SWI in the same input channel is reasonable. We repeat the previous experiment but using a CNN with six channels, with separate ones for GE and SWI. The channel is filled with \(-4\) when the sequence is not available, which corresponds to a very low value after our intensity normalization. From this the CNN learns when the sequence is missing and we found this to behave better than common zero-filling. The segmenter performs better than supervised training on T only. This indicates that information from both domains is used. However, knowledge transfer is not as strong as when GE and SWI, which share much information, are used in the same channel.

Proposed Unsupervised Domain Adaptation: We train the segmenter on all data of S and adapt the domains using half the subjects of T, but no labels. GE and SWI share the same input channel. We test accuracy on the other half of T. The experiment is repeated for the other fold. Our method learns filters invariant to the two imaging protocols and transfers knowledge from S to T, allowing the system to segment haemorrhages only visible on SWI without ever seeing a manual annotation from T (Fig. 2). This improves by 3% DSC over the non-adapted segmenter that uses only information from S and the common sequences, covering 44% of the difference between this practical lower bound and the upper bound achieved by supervised training with labels from both domains.

Fig. 2.
figure 2

(top) Example case from S. (middle/bottom) SWI and FLAIR of two subjects from T (T2, MPRAGE, PD also used but not shown). Notice that only GE and SWI show certain lesions, such as micro-bleeds. However, brain tissue appears differently in GE and SWI. Consequently, a model trained on S fails on T when SWI is naively used in place of GE (3rd col.). A model trained using only the four common sequences misses lesions visible only on SWI (4th col.). Our method mitigates these problems by learning features invariant to the imaging protocol (5th col.).

Fig. 3.
figure 3

Behaviour when the domain-discriminator is connected at different layers of the segmenter. Adaptation is performed after epoch 10 by linearly increasing \(\alpha \). Connections at earlier layers lead to higher performance of the discriminator but slower adaptation. Multiple connections increase performance. Note, features learnt at early layers during the refinement in the last stages of training seem more domain-discriminative.

3.4 Analysis of the System

Effect of Adapting Layers at Different Depths: To investigate how depth of adapted layers affects our system, we repeat the experiment with domain adaptation from S to T, changing the layers at which the adversarial networks are connected. Results are shown on Fig. 3 and Table 2. Note that connections are added to both pathways of the segmenter at the same depth (for example, L4 means connections to the 4th layers of both pathways). Adapting shallow layers tends towards over-segmentation (increased recall but lower precision). It has been noticed that severe over-segmentation occurs without adaptation (Fig. 2). These observations indicate that source-specific features are possibly recovered between the adapted and the classification layer. Comparing L2 and L(2, 4, 6, 8, 10) shows that this is alleviated by multiple connections that enforce domain invariance throughout the segmenter. Since, however, behaviour of multi-connected adversarials is strongly defined by the shallowest connection, we avoid adapting the earliest layers, which offer less benefit but slow down convergence.

Table 2. Final accuracy on T when the discriminator is connected at different depths of the segmenter. Shallow connections increase recall but significantly decrease precision. Multiple connections remove better the source-specific nuisances throughout the segmenter, closing the gap to the practical upper bound of 66.5% for UDA (Sect. 3.3) by approximately 1.5% DSC. Best configuration in bold.
Fig. 4.
figure 4

The segmenter counters the domain-discriminator after epoch 10, when we linearly increase \(\alpha \) from zero to \(\alpha _{max}\) until epoch 35. Final accuracy on T was found rather stable for a wide range of values. Decrease greater than 1% DSC from the highest was found for values 0.02 and 2.0.

Effect of Adaptation’s Strength via \(\varvec{\alpha _{max}}\) : Here we investigate how sensitive is our method to \(\alpha _{max}\), which defines how strongly the segmenter counters the discriminator. Figure 4 shows that higher values lead to quicker adaptation but the accuracy is rather stable for a significant range of values \(\alpha _{max} \in [0.05,1.0]\). We note this range might differ for other applications and that smooth convergence is generally preferred for learning high quality features over steep schedules that alter the loss surface aggressively. Finally, we observe that strongly countering the discriminator does not guarantee better performance on T. A theoretical reason is that a more domain-invariant representation \(h_a(x)\) likely encodes less information about x. This information loss increases the Bayes error rate and the entropy of the predictions by the learnt \(f_a(x)=f_{ah}(h_a(x))\). After a certain level of invariance, this can outweigh the benefits of domain-adaptation [1, 7].

4 Conclusion

We present an unsupervised domain adaptation method for image segmentation based on adversarial training of two 3D neural networks. To the best of our knowledge this is the first work showing the plausibility and capabilities of such an approach on a biomedical imaging problem. Additionally, we propose multi-connected adversarial networks, which perform better by enabling flow of higher quality adversarial gradients throughout the adapted network. We investigate aspects of adversarial training such as the depth of the adapted layer and the strength of adaptation, providing valuable insights for development of future approaches. While unsupervised in the target domain, our method performs close to the accuracy of supervised baselines. We believe our work makes an important contribution in the context of multi-center studies where domain differences are a major limitation in current image analysis methods. Future work will investigate the capabilities of our approach to normalize different types of variations. An implementation of the proposed system will be made publicly available on https://biomedia.doc.ic.ac.uk/software/deepmedic/.