Keywords

1 Introduction

Deploying machine learning models in the real world often requires the ability to generalize to unseen samples i.e. samples significantly different from those seen during learning. Despite impressive performances on a variety of tasks, deep learning models do not always meet these requirements [3, 14]. For this reason, out-of-distribution generalization is recognized as a major challenge for the reliability of machine learning systems [1, 2]. Domain Adaptation (DA) [28, 30] is a well-studied approach to bridge the gap between train and test distributions. In DA, we refer to train and test distributions as source and target respectively noted \(p_S(x,y)\) and \(p_T(x,y)\) where x are inputs and y are labels. The objective of DA can be defined as learning a good classifier on a poorly sampled target domain by leveraging samples from a source domain. Unsupervised Domain Adaptation (UDA) assumes that only unlabelled data from the target domain is available during training. In this context, a natural assumption, named Covariate shift [19, 33], consists in assuming that the mapping from the inputs to the labels is conserved across domains, i.e. \(p_T(y|x) = p_S(y|x)\). In this context, Importance Sampling (IS) performs adaptation by weighting the contribution of sample x in the loss by \(w(x) = p_T(x) / p_S(x)\) [30]. Although IS seems natural when unlabelled data from the target domain is available, the covariate shift assumption is not sufficient to guarantee successful adaptation [5]. Moreover, for high dimensional data [12] such as texts or images, the shift between \(p_S(x)\) and \(p_T(x)\) results from non-overlapping supports leading to unbounded weights [20].

In this particular context, representations can help to reconcile non-overlapping supports [5]. This seminal idea, and the corresponding theoretical bound of the target risk from [5], has led to a wide variety of deep learning approaches [13, 23, 24] which aim to learn a so-called domain invariant representation:

$$\begin{aligned} p_S(z) \approx p_T(z) \end{aligned}$$
(1)

where \(z:=\varphi (x)\) for a given non-linear representation \(\varphi \). These assume that the transferability of representations, defined as the combined error of an ideal classifier, remains low during learning. Unfortunately, this quantity involves target labels and is thus intractable. More importantly, looking for strict invariant representations, \(p_S(z) = p_T(z)\), hurts the transferability of representations [20, 22, 36, 40]. In particular, there is a fundamental trade-off between learning invariant representations and preserving transferability in presence of label shift (\(p_T(y) \ne p_S(y)\)) [40]. To mitigate this trade-off, some recent works suggest to relax domain invariance by weighting samples [8, 9, 36, 37]. This strategy differs with (1) by aligning a weighted source distribution with the target distribution:

$$\begin{aligned} w(z) p_S(z) \approx p_T(z) \end{aligned}$$
(2)

for some weights w(z). We now have two tools, w and \(\varphi \), which need to be calibrated to obtain distribution alignment. Which one should be promoted? How weights preserve good transferability of representations?

While most prior works focus on the invariance error for achieving adaptation [13, 23, 24], this paper focuses on the transferability of representations. We show that weights allow to design an interpretable generalization bound where transferability and invariance errors are uncoupled. In addition, we discuss the role of inductive design for both the classifier and the weights in addressing the lack of labelled data in the target domain. Our contributions are the following:

  1. 1.

    We introduce a new bound of the target risk which incorporates both weights and domain invariant representations. Two new terms are introduced. The first is an invariance error which promotes alignment between a weighted source distribution of representations and the target distribution of representations. The second, named transferability error, involves labelling functions from both source and target domains.

  2. 2.

    We highlight the role of inductive bias for approximating the transferability error. First, we establish connections between our bound and popular approaches for UDA which use target predicted labels during adaptation, in particular Conditional Domain Adaptation [24] and Minimal Entropy [15]. Second, we show that the inductive design of weights has an impact on representation invariance.

  3. 3.

    We derive a new learning procedure for UDA. The particularity of this procedure is to only minimize the transferability error while controlling representation invariance with weights. Since the transferability error involves target labels, we use the predicted labels during learning.

  4. 4.

    We provide an empirical illustration of our framework on two DA benchmarks (Digits and Office31 datasets). We stress-test our learning scheme by modifying strongly the label distribution in the source domain. While methods based on invariant representations deteriorate considerably in this context, our procedure remains robust.

2 Preliminaries

We introduce the source distribution i.e. data where the model is trained with supervision and the target distribution i.e. data where the model is tested or applied. Formally, for two random variables (XY) on a given space \(\mathcal {X} \times \mathcal {Y}\), we introduce two distributions: the source distribution \(p_S(x,y)\) and the target distribution \(p_T(x,y)\). Here, labels are one-hot encoded i.e. \(y \in [0,1]^{C}\) such that \(\sum _c y_c = 1\) where C is the number of classes. The distributional shift situation is then characterized by \(p_S(x,y) \ne p_T(x,y)\) [30]. In the rest of the paper, we use the index notation S and T to differentiate source and target terms. We define the hypothesis class \(\mathcal {H}\) as a subset of functions from \(\mathcal {X}\) to \(\mathcal {Y}\) which is the composition of a representation class \(\varPhi \) and a classifier class \(\mathcal {G}\), i.e. \(\mathcal {H}= \mathcal {G} \circ \varPhi \). For the ease of reading, given a classifier \(g \in \mathcal {G}\) and a representation \(\varphi \in \varPhi \), we note \(g\varphi := g\circ \varphi \). Furthermore, in the definition \(z:=\varphi (x)\), we refer indifferently to z, \(\varphi \), \(Z:=\varphi (X)\) as the representation. For two given h and \(h'\in \mathcal {H}\) and \(\ell \) the \(L^2\) loss \(\ell (y,y') = ||y-y'||^2\), the risk in domain \(D \in \{S,T\}\) is noted:

$$\begin{aligned} \varepsilon _D(h) := \mathbb {E}_D[\ell (h(X),Y)] \end{aligned}$$
(3)

and \(\varepsilon _D(h,h') := \mathbb {E}_D[\ell (h(X), h'(X))]\). In the seminal works [5, 27], a theoretical limit of the target risk when using a representation \(\varphi \) has been derived:

Bound 1

(Ben David et al.). Let \(d_{\mathcal {G}}(\varphi ) = \sup _{g,g' \in \mathcal {G}}|\varepsilon _S(g \varphi , g'\varphi ) - \varepsilon _T(g \varphi , g'\varphi )|\) and \(\lambda _{\mathcal {G}}(\varphi ) = \inf _{g\in \mathcal {G}}\{ \varepsilon _S(g \varphi ) + \varepsilon _T(g\varphi )\}\), \(\forall g \in \mathcal {G}, \forall \varphi \in \varPhi \):

$$\begin{aligned} \varepsilon _T(g\varphi )\le \varepsilon _S(g\varphi ) + d_{\mathcal {G}}(\varphi ) + \lambda _{\mathcal {G}}(\varphi ) \end{aligned}$$
(4)

This generalization bound ensures that the target risk \(\varepsilon _T(g\varphi \)) is bounded by the sum of the source risk \(\varepsilon _S(g\varphi \)), the disagreement risk between two classifiers from representations \(d_{\mathcal {G}}(\varphi )\), and a third term, \(\lambda _{\mathcal {G}}(\varphi )\), which quantifies the ability to perform well in both domains from representations. The latter is referred to as the adaptability error of representations. It is intractable in practice since it involves labels from the target distribution. Promoting distribution invariance of representations, i.e. \(p_S(z)\) close to \(p_T(z)\), results on a low \(d_{\mathcal {G}}(\varphi )\). More precisely:

$$\begin{aligned} d_\mathcal {G}(\varphi ) \le 2 \sup _{d \in \mathcal {D}} |p_S(d(z)=1) - p_T(d(z)=0)| \end{aligned}$$
(5)

where \(\mathcal {D}\) is the so-called set of discriminators or critics which verifies \(\mathcal {D} \supset \{g \oplus g': (g,g') \in \mathcal {G}^2 \}\) where \(\oplus \) is the \(\mathrm {XOR}\) function [13]. Since the domain invariance term \(d_{\mathcal {G}}(\varphi )\) is expressed as a supremal value on classifiers, it is suitable for domain adversarial learning with critic functions. Conversely, the adaptability error \(\lambda _{\mathcal {G}}(\varphi )\) is expressed as an infremal value. This ‘\(\sup / \inf \)’ duality induces an unexpected trade-off when learning domain invariant representations:

Proposition 1

(Invariance hurts adaptability [20, 40]). Let \(\psi \) be a representation which is a richer feature extractor than \(\varphi \): \(\mathcal {G}\circ \varphi \subset \mathcal {G} \circ \psi \). Then,

$$\begin{aligned} d_{\mathcal {G}}(\varphi ) \le d_{\mathcal {G}}(\psi ) \text{ while } \lambda _{\mathcal {G}}(\psi ) \le \lambda _{\mathcal {G}}(\varphi ) \end{aligned}$$
(6)

As a result of Proposition 1, the benefit of representation invariance must be higher than the loss of adaptability, which is impossible to guarantee in practice.

3 Theory

To overcome the limitation raised in Proposition 1, we expose a new bound of the target risk which embeds a new trade-off between invariance and transferability (3.1). We show this new bound remains inconsistent with the presence of label shift (3.2) and we expose the role of weights to address this problem (3.3).

3.1 A New Trade-Off Between Invariance and Transferability

Core Assumptions. Our strategy is to express both the transferability and invariance as a supremum using Integral Probability Measure (IPM) computed on a critic class. We thus introduce a class of critics suitable for our analysis. Let \(\mathcal {F}\) from \(\mathcal {Z}\rightarrow [-1,1]\) and \(\mathcal {F}_C\) from \(\mathcal {Z}\rightarrow [-1,1]^{C}\) with the following properties:

  • (A1) \(\mathcal {F}\) and \(\mathcal {F}_C\) are symmetric (i.e. \(\forall f \in \mathcal {F}, -f \in \mathcal {F}\)) and convex.

  • (A2) \(\mathcal {G} \subset \mathcal {F}_C\) and \(\left\{ \mathbf{f} \cdot \mathbf{f}'~ ; ~\mathbf{f}, \mathbf{f}' \in \mathcal {F}_C \right\} \subset \mathcal {F}\).

  • (A3) \(\forall \varphi \in \varPhi \), \(\mathbf{f}_D(z) \mapsto \mathbb {E}_D[Y|\varphi (X)=z] \in \mathcal {F}_C\).Footnote 1

  • (A4) For two distributions p and q on \(\mathcal {Z}\), \(p=q\) if and only if:

    $$\begin{aligned} \mathrm {IPM}(p,q ; \mathcal {F}) := \sup _{f \in \mathcal {F}} \left\{ \mathbb {E}_p[f(Z)] - \mathbb {E}_q[f(Z)] \right\} = 0 \end{aligned}$$
    (7)

The assumption (A1) ensures that rather comparing two given \(\mathbf{f}\) and \(\mathbf{f}'\), it is enough to study the error of some \(\mathbf{f}'' = \frac{1}{2} (\mathbf{f} - \mathbf{f}')\) from \(\mathcal {F}_C\). This brings back a supremum on \(\mathcal {F}_C^2\) to a supremum on \(\mathcal {F}_C\). The assumption (A2), combined with (A1), ensures that an error \(\ell (\mathbf{f}, \mathbf{f}') \) can be expressed as a critic function \(f \in \mathcal {F}\) such that \(f = \ell (\mathbf{f}, \mathbf{f}')\). The assumption (A3) ensures that \(\mathcal {F}_C\) is rich enough to contain label function from representations. Here, \(\mathbf{f}_D(z) = \mathbb {E}_D[Y|Z=z]\) is a vector of probabilities on classes: \(f_D(z)_c = p_D(Y=c|Z=z)\). The last assumption (A4) ensures that the introduced IPM is a distance. Classical tools verify these assumptions e.g. continuous functions; here \(\mathrm {IPM}(p,q ; \mathcal {F})\) is the Maximum Mean Discrepancy [16] and one can reasonably believe that \(\mathbf{f}_S\) and \(\mathbf{f}_T\) are continuous.

Invariance and Transferability as IPMs. We introduce here two important tools that will guide our analysis:

  • \(\mathrm {INV}(\varphi )\), named invariance error, that aims at capturing the difference between source and target distribution of representations, corresponding to:

    $$\begin{aligned} \mathrm {INV}(\varphi ) := \sup _{f\in \mathcal {F}} \left\{ \mathbb {E}_T[f(Z)] - \mathbb {E}_{S}[f(Z)] \right\} \end{aligned}$$
    (8)
  • \(\mathrm {TSF}(\varphi )\), named transferability error, that catches if the coupling between Z and Y shifts across domains. For that, we use our class of functions \(\mathcal {F}_C\) and we compute the IPM of \(Y\cdot \mathbf{f}(Z)\), where \(\mathbf{f} \in \mathcal {F}_C\) and \(Y\cdot \mathbf{f}(Z)\) is the scalar productFootnote 2, between the source and the target domains:

    $$\begin{aligned} \mathrm {TSF}(\varphi ) := \sup _{\mathbf{f}\in \mathcal {F}_C} \{ \mathbb {E}_T[Y \cdot \mathbf{f} (Z)] - \mathbb {E}_{S}[Y \cdot \mathbf{f} (Z)] \} \end{aligned}$$
    (9)

A New Bound of the Target Risk. Using \(\mathrm {INV}(\varphi )\) and \(\mathrm {TSF}(\varphi )\), we can provide a new bound of the target risk:

Bound 2

\(\forall g \in \mathcal {G}\) and \(\forall \varphi \in \varPhi \):

$$\begin{aligned} \varepsilon _T(g\varphi ) \le \varepsilon _{S}(g\varphi ) + 6\cdot \mathrm {INV}(\varphi ) + 2\cdot \mathrm {TSF}(\varphi ) + \varepsilon _T(\mathbf{f}_T\varphi ) \end{aligned}$$
(10)

The proof is in Appendix A.1. In contrast with Bound 1 (Eq. 6), here two IPMs are involved to compare representations (\(\mathrm {INV}(\varphi )\) and \(\mathrm {TSF}(\varphi )\)). A new term, \(\varepsilon _T(\mathbf{f}_T\varphi )\), reflects the level of noise when fitting labels from representations. All the trade-off between invariance and transferability is embodied in this term:

Proposition 2

Let \(\psi \) a representation which is a richer feature extractor than \(\varphi \): \(\mathcal {F}\circ \varphi \subset \mathcal {F} \circ \psi \) and \(\mathcal {F}_C\circ \varphi \subset \mathcal {F}_C \circ \psi \). \(\varphi \) is more domain invariant than \(\psi \):

$$\begin{aligned} \mathrm {INV}(\varphi ) \le \mathrm {INV}(\psi ) \text{ while } \varepsilon _T (\mathbf{f}_T^\psi \psi ) \le \varepsilon _T (\mathbf{f}_T^\varphi \varphi ) \end{aligned}$$
(11)

where \(\mathbf{f}_T^\varphi (z) = \mathbb {E}_T[Y|\varphi (X) = z]\) and \(\mathbf{f}_T^\psi (z) = \mathbb {E}_T[Y|\psi (X)=z]\). Proof in Appendix A.2.

Bounding the target risk using IPMs has two advantages. First, it allows to better control the invariance/transferability trade-off since \(\varepsilon _T(\mathbf{f}_T\varphi ) \le \lambda _{\mathcal {G}}(\varphi )\). This is paid at the cost of \(4 \cdot \mathrm {INV}(\varphi ) \ge d_{\mathcal {G}}(\varphi )\) (see Proposition 7 in Appendix A.1). Second, \(\varepsilon _T(\mathbf{f}_T\varphi )\) is source free and indicates whether there is enough information in representations for learning the task in the target domain at first. This means that \(\mathrm {TSF}(\varphi )\) is only dedicated to control if aligned representations have the same labels across domains. To illustrate the interest of our new transferability error, we provide visualisation of representations (Fig. 1) when trained to minimize the adaptability error \(\lambda _{\mathcal {G}}(\varphi ) \) from Bound 1 and the transferability error \(\mathrm {TSF}(\varphi )\) from Bound 2.

Fig. 1.
figure 1

t-SNE [26] visualisation of representations when trained to minimize (a) adaptability error \(\lambda _{\mathcal {G}}(\varphi )\) from [5], (b) transferability error \(\mathrm {TSF}(\varphi )\) introduced in the present work. The task used is A\(\rightarrow \)W of the Office31 dataset. Labels in the target domain are used during learning in this specific experiment. For both visualisations of representations, we observe well-separated clusters associated to the label classification task. Inside those clusters, we observe a separation between source and target representations for \(\lambda _{\mathcal {G}}(\varphi )\). That means that representations embed domain information and thus are not invariant. On the contrary, source and target representations are much more overlapping inside of each cluster with \(\mathrm {TSF}(\varphi )\), illustrating that this new term is not conflictual with invariance.

3.2 A Detailed View on the Property of Tightness

An interesting property of the bound, named tightness, is the case when \(\mathrm {INV}(\varphi )=0\) and \(\mathrm {\mathrm {TSF}(\varphi )} =0\) simultaneously. The condition of tightness of the bound provides rich information on the properties of representations.

Proposition 3

\(\mathrm {INV}(\varphi ) = \mathrm {TSF}(\varphi ) = 0\) if and only if \( p_S(y,z) = p_T(y,z)\).

The proof is given in Appendix A.3. Two important points should be noted:

  1. 1.

    \(\mathrm {INV}(\varphi ) = 0\) ensures that \(p_S(z) = p_T(z)\), using (A4). Similarly, \(\mathrm {TSF(\varphi )} =0 \) leads to \(p_S(y,z)= p_T(y,z)\). Since \(p_S(y,z)= p_T(y,z)\) implies \(p_S(z)= p_T(z)\), \(\mathrm {INV}(\varphi )\) does not bring more substantial information about representations distribution than \(\mathrm {TSF}(\varphi )\). More precisely, one can show that \(\mathrm {TSF}(\varphi ) \ge \mathrm {INV}(\varphi )\) noting that \(Y\cdot \mathbf{f}(Z) = f(z)\) when \(\mathbf{f}(z) = (f(z), ..., f(z))\) for \(f\in \mathcal {F}\).

  2. 2.

    Second, the equality \(p_S(y,z)= p_T(y,z)\) also implies that \(p_S(y)=p_T(y)\). Therefore, in the context of label shift (when \(p_S(y) \ne p_T(y)\)), the transferability error cannot be null. This is a big hurdle since it is clearly established that most real world UDA tasks exhibit some label shift. This bound highlights the fact that representation invariance alone can not address UDA in complex settings such as the label shift one.

3.3 Reconciling Weights and Invariant Representations

Based on the interesting observations from [20, 40] and following the line of study that proposed to relax invariance using weights [9, 36,37,38], we propose to adapt the bound by incorporating weights. More precisely, we study the effect of modifying the source distribution \(p_S(z)\) to a weighted source distribution \(w(z) p_S(z)\) where w is a positive function which verifies \(\mathbb {E}_S[w(Z)] = 1\). By replacing \(p_S(z)\) by \(w(z)p_S(z)\) (distribution referred as \(w\cdot S\)) in Bound 2, we obtain a new bound of the target risk incorporating both weights and representations:

Bound 3

\(\forall g \in \mathcal {G}, \forall w: \mathcal {Z}\rightarrow \mathbb {R}^+\) such that \(\mathbb {E}_S[w(z)]=1\):

$$\begin{aligned} \varepsilon _T(g\varphi ) \le \varepsilon _{w\cdot S}(g\varphi ) +6 \cdot \mathrm {INV}(w, \varphi ) + 2 \cdot \mathrm {TSF}(w, \varphi ) + \varepsilon _T(\mathbf{f}_T \varphi ) \end{aligned}$$

where \( \mathrm {INV}(w, \varphi ) := \displaystyle {\sup _{f\in \mathcal {F}}} \left\{ \mathbb {E}_T[f(Z)] - \mathbb {E}_{S}[w(Z)f(Z)] \right\} \) and \(\mathrm {TSF}(w, \varphi ) := \displaystyle {\sup _{\mathbf{f}\in \mathcal {F}_{C}}}\{ \) \(\mathbb {E}_T[Y \cdot \mathbf{f}(Z)] - \mathbb {E}_{S}[w(Z) Y \cdot \mathbf{f}(Z)] \}\).

As for the previous Bound 2, the property of tightness, i.e. when invariance and transferability are null simultaneously, leads to interesting observations:

Proposition 4

\(\mathrm {INV}(w,\varphi ) = \mathrm {TSF}(w,\varphi ) = 0\) if and only if \( w(z) = \frac{p_T(z)}{p_S(z)}\) and \(\mathbb {E}_T[Y|Z=z] = \mathbb {E}_S[Y|Z=z]\). The proof is given in Appendix A.4.

This proposition means that the nullity of invariance error, i.e. \(\mathrm {INV}(w,\varphi ) = 0\), implies distribution alignment, i.e. \(w(z) p_S(z) = p_T(z)\). This is of strong interest since both representations and weights are involved for achieving domain invariance. The nullity of the transferability error, i.e. \(\mathrm {TSF}(w,\varphi ) =0\), implies that labelling functions, \(\mathbf{f}:z \mapsto \mathbb {E}[Y|Z=z]\), are conserved across domains. Furthermore, the equality \(\mathbb {E}_T[Y|Z] = \mathbb {E}_S[Y|Z]\) interestingly resonates with a recent line of work called Invariant Risk Minimization (IRM) [2]. Incorporating weights in the bound thus brings two benefits:

  1. 1.

    First, it raises the inconsistency issue of invariant representations in presence of label shift, as mentioned in Sect. 3. Indeed, tightness is not conflicting with label shift.

  2. 2.

    \(\mathrm {TSF}(w, \varphi )\) and \(\mathrm {INV}(w,\varphi )\) have two distinct roles: the former promotes domain invariance of representations while the latter controls whether aligned representations share the same labels across domains.

4 The Role of Inductive Bias

Inductive Bias refers to the set of assumptions which improves generalization of a model trained on an empirical distribution. For instance, a specific neural network architecture or a well-suited regularization are prototypes of inductive biases. First, we provide a theoretical analysis of the role of inductive bias for addressing the lack of labelling data in the target domain (4.1), which is the most challenging part of Unsupervised Domain Adaptation. Second, we describe the effect of weights to induce invariance property on representations (4.2).

4.1 Inductive Design of a Classifier

General Formulation. Our strategy consists in approximating target labels error through a classifier \(\tilde{g} \in \mathcal {G}\). We refer to the latter as the inductive design of the classifier. Our proposition follows the intuitive idea which states that the best source classifier, \(g_S:= \arg \min _{g\in \mathcal {G}} \varepsilon _S(g\varphi )\), is not necessarily the best target classifier i.e. \(g_S \ne \arg \min _{g\in \mathcal {G}} \varepsilon _T(g\varphi )\). For instance, a well-suited regularization in the target domain, noted \(\varOmega _T(g)\) may improve performance, i.e. setting \(\tilde{g} := \arg \min _{g\in \mathcal {G}} \varepsilon _S(g\varphi ) + \lambda \cdot \varOmega _T(g)\) may lead to \(\varepsilon _T(\tilde{g}\varphi ) \le \varepsilon _T(g_S\varphi )\). We formalize this idea through the following definition:

Definition 5

(Inductive design of a classifier). We say that there is an inductive design of a classifier at level \(0 < \beta \le 1\) if for any representations \(\varphi \), noting \(g_S = \arg \min _{g \in \mathcal {G}} \varepsilon _S (g \varphi )\), we can determine \(\tilde{g}\) such that:

$$\begin{aligned} \varepsilon _T(\tilde{g}\varphi ) \le \beta \varepsilon _T(g_S\varphi ) \end{aligned}$$
(12)

We say the inductive design is \(\beta -\)strong when \(\beta <1\) and weak when \(\beta =1\).

In this definition, \(\beta \) does not depend of \(\varphi \), which is a strong assumption, and embodies the strength of the inductive design. The closer to 1 is \(\beta \), the less improvement we can expect using the inductive classifier \(\tilde{g}\). We now study the impact of the inductive design of a classifier in our previous Bound 3. Thus, we introduce the approximated transferability error:

$$\begin{aligned} \widehat{\mathrm {TSF}}(w, \varphi , \tilde{g}) = \sup _{\mathbf{f}\in \mathcal {F}_C} \{ \mathbb {E}_T[\tilde{g}(Z) \cdot \mathbf{f}(Z)] - \mathbb {E}_{S}[w(Z) Y \cdot \mathbf{f}(Z)] \} \end{aligned}$$
(13)

leading to a bound of the target risk where transferability is target labels free:

Bound 4

(Inductive Bias and Guarantee). Let \(\varphi \in \varPhi \) and \(w: \mathcal {Z}\rightarrow \mathbb {R}^+\) such that \(\mathbb {E}_S[w(z)]=1\) and a \(\beta -\)strong inductive classifier \(\tilde{g}\) and \(\rho := \frac{\beta }{1- \beta }\) then:

$$\begin{aligned} \varepsilon _T(\tilde{g}\varphi ) \le \rho \left( \varepsilon _{w\cdot S}(g_{w\cdot S}\varphi ) + 6\cdot \mathrm {INV}(w, \varphi ) + 2\cdot \widehat{\mathrm {TSF}}(w, \varphi , \tilde{g}) + \varepsilon _T(\mathbf{f}_T\varphi )\right) \end{aligned}$$
(14)

The proof is given in Appendix A.5. Here, the target labels are only involved in \(\varepsilon _T(\mathbf{f}_T\varphi )\) which reflects the level of noise when fitting labels from representations. Therefore, transferability is now free of target labels. This is an important result since the difficulty of UDA lies in the lack of labelled data in the target domain. It is also interesting to note that the weaker the inductive bias (\(\beta \rightarrow 1\)), the higher the bound and vice versa.

The Role of Predicted Labels. Predicted labels play an important role in UDA. In light of the inductive classifier, this means that \(\tilde{g}\) is simply set as \(g_{w\cdot S}\). This is a weak inductive design (\(\beta =1\)), thus, theoretical guarantee from Bound 4 is not applicable. However, there is empirical evidence that showed that predicted labels help in UDA [15, 24]. It suggests that this inductive design may find some strength in the finite sample regime. A better understanding of this phenomenon is left for future work (See Appendix B). In the rest of the paper, we study this weak inductive bias by establishing connections between \(\widehat{\mathrm {TSF}}(w, \varphi , g_S)\) and popular approaches of the literature.

Connections with Conditional Domain Adaptation Network. CDAN [24] aims to align the joint distribution \((\hat{Y}, Z)\) across domains, where \(\hat{Y} = g_S \varphi (X)\) are estimated labels. It is performed by exposing the tensor product between \(\hat{Y}\) and Z to a discriminator. It leads to substantial empirical improvements compared to Domain Adversarial Neural Networks (DANN) [13]. We can observe that it is a similar objective to \(\widehat{\mathrm {TSF}}(w,\varphi , g_S)\) in the particular case where \(w(z)=1\).

Connections with Minimal Entropy. MinEnt [15] states that an adapted classifier is confident in prediction on target samples. It suggests the regularization: \(\varOmega _T(g) := H(\hat{Y}|Z) = \mathbb {E}_{Z \sim p_T}[-g(Z) \cdot \log g(Z)]\) where H is the entropy. If labels are smooth enough (i.e. it exists \(\alpha \) such that \(\frac{\alpha }{C-1} \le \mathbb {E}_S[\hat{Y} |Z] \le 1 -\alpha \)), MinEnt is a lower bound of transferability: \(\widehat{\mathrm {TSF}}(w,\varphi , g_s) \ge \eta \left( H_T(g_S\varphi ) - \mathrm {CE}_{w\cdot S}(Y, g_S\varphi )\right) \) for some \(\eta >0\) and \(\mathrm {CE}_{w\cdot S}(g_S\varphi , Y)\) is the cross-entropy between \(g_S\varphi \) and Y on \(w(z) p_S(z)\) (see Appendix A.6).

4.2 Inductive Design of Weights

While the bounds introduced in the present work involve weights in the representation space, there is an abundant literature that builds weights in order to relax the domain invariance of representations [8, 9, 36, 37]. We study the effect of inductive design of w on representations. To conduct the analysis, we consider there is a non-linear transformation \(\psi \) from \(\mathcal {Z}\) to \(\mathcal {Z}'\) and we assume that weights are computed in \(\mathcal {Z}'\), i.e. w is a function of \(z':= \psi (z) \in \mathcal {Z}'\). We refer to this as inductive design of weights. For instance, in the particular case where \(\psi = g_S\), weights are designed as \(w(\hat{y}) = p_T(\hat{Y} = \hat{y}) / p_S(\hat{Y} = \hat{y})\) [9] where \(\hat{Y} = g\varphi (X)\). In [24], entropy conditioning is introduced by designing weights \( w(z') \propto 1+ e^{-z'}\) where \(z' = - \frac{1}{C}\sum _{1\le c \le C} g_{S,c} \log (g_{S,c})\) is the predictions entropy. The inductive design of weights imposes invariance property on representations:

Proposition 6

(Inductive design of w and invariance). Let \(\psi : \mathcal {Z} \rightarrow \mathcal {Z}'\) such that \(\mathcal {F} \circ \psi \subset \mathcal {F}\) and \(\mathcal {F}_C\circ \psi \subset \mathcal {F}_C\). Let \(w:\mathcal {Z}' \rightarrow \mathbb {R}^+\) such that \(\mathbb {E}_S[w(Z')]=1\) and we note \(Z':= \psi (Z)\). Then, \(\mathrm {INV}(w, \varphi ) = \mathrm {TSF}(w,\varphi ) = 0\) if and only if:

$$\begin{aligned} w(z') = \frac{p_T(z')}{p_S(z')} ~~ \text{ and } ~~ p_S(z|z') = p_T(z|z') \end{aligned}$$
(15)

while both \(\mathbf{f}_S^\varphi = \mathbf{f}_T^\varphi \) and \(\mathbf{f}_S^\psi = \mathbf{f}_T^\psi \). The proof is given in Appendix A.7.

This proposition shows that the design of w has a significant impact on the property of domain invariance of representations. Furthermore, both labelling functions are conserved. In the rest of the paper we focus on weighting in the representation space which consists in:

$$\begin{aligned} w(z) = \frac{p_T(z)}{p_S(z)} \end{aligned}$$
(16)

Since it does not leverage any transformations of representations \(\psi \), we refer to this approach as a weak inductive design of weights. It is worth noting this inductive design controls naturally the invariance error i.e. \(\mathrm {INV}(w, \varphi ) =0 \).

5 Towards Robust Domain Adaptation

In this section, we expose a new learning procedure which relies on weak inductive design of both weights and the classifier. This procedure focuses on the transferability error since the inductive design of weights naturally controls the invariance error. Our learning procedure is then a bi-level optimization problem, named RUDA (Robust UDA):

$$\begin{aligned} \qquad \qquad \displaystyle { \left\{ \begin{array}{l} \varphi ^\star = \displaystyle {\arg \min _{\varphi \in \varPhi } } ~~ \varepsilon _{w(\varphi ) \cdot S} (g_{w\cdot S} \varphi ) + \lambda \cdot \widehat{\mathrm {TSF}}(w,\varphi , g_{w \cdot S} ) \\ \quad \,\,\text{ such } \text{ that } w(\varphi ) = \displaystyle {\arg \min _w} ~ \mathrm {INV}(w, \varphi ) \\ \end{array}\right. } \qquad \quad \mathrm{(RUDA)} \end{aligned}$$

where \(\lambda >0\) is a trade-off parameter. Two discriminators are involved here. The former is a domain discriminator d trained to map 1 for source representations and 0 for target representations by minimizing a domain adversarial loss:

$$\begin{aligned} \mathcal {L}_{\mathrm {INV}}(\theta _d |\theta _\varphi ) = \frac{1}{n_S} \sum _{i=1}^{n_S}- \log (d(z_{S,i})) + \frac{1}{n_T}\sum _{i=1}^{n_T} - \log (1-d(z_{T,i})) \end{aligned}$$
(17)

where \(\theta _d\) and \(\theta _\varphi \) are respectively the parameters of d and \(\varphi \), and \(n_S\) and \(n_T\) are respectively the number of samples in the source and target domains. Setting weights \(w_d(z) := (1-d(z)) / d(z)\) ensures that \(\mathrm {INV}(w,\varphi )\) is minimal (See Appendix C.2). The latter, noted \(\mathbf{d}\), maps representations to the label space \([0,1]^C\) in order to obtain a proxy of the transferability error expressed as a domain adversarial objective (See Appendix C.1):

$$\begin{aligned} \mathcal {L}_{\mathrm {TSF}}(\theta _{\varphi },\theta _{\mathbf{d}} |\theta _d, \theta _g) = \inf _{\mathbf{d}}&\left\{ \frac{1}{n_S} \sum _{i=1}^{n_S} - w_d(z_{S,i}) g(z_{S,i}) \cdot \log (\mathbf{d}(z_{S,i})) \right. \nonumber \\&~~~~ + \left. \frac{1}{n_T}\sum _{i=1}^{n_T} - g(z_{T,i}) \cdot \log (1-\mathbf{d} (z_{T,i})) \right\} \end{aligned}$$
(18)

where \(\theta _{\mathbf{d}}\) and \(\theta _g\) are respectively parameters of \(\mathbf{d}\) and g. Furthermore, we use the cross-entropy loss in the source weighted domain for learning \(\theta _g\):

$$\begin{aligned} \mathcal {L}_{c}(\theta _g, \theta _\varphi |\theta _d) = \frac{1}{n_S}\sum _{i=1}^{n_S} - w_d (z_{S,i}) y_{S,i} \cdot \log (g(z_{S,i})) \end{aligned}$$
(19)

Finally, the optimization is then expressed as follows:

$$\begin{aligned} \left\{ \begin{array}{l} \theta _\varphi ^\star = \arg \min \nolimits _{\theta _\varphi } \mathcal {L}_{c}(\theta _g, \theta _\varphi |\theta _d) + \lambda \cdot \mathcal {L}_{\mathrm {TSF}}(\theta _{\varphi },\theta _{\mathbf{d}} |\theta _d, \theta _g) \\ \theta _g = \arg \min \nolimits _{\theta _g} \mathcal {L}_{c}(\theta _g, \theta _\varphi | \theta _d) \\ \theta _d = \arg \min \nolimits _{\theta _d} \mathcal {L}_{\mathrm {INV}}(\theta _d |\theta _\varphi ) \end{array} \right. \end{aligned}$$
(20)

Losses are minimized by stochastic gradient descent (SGD) where in practice \(\inf _{d}\) and \(\inf _{\mathbf{d}}\) are gradient reversal layers [13]. The trade-off parameter \(\lambda \) is pushed from 0 to 1 during training. We provide an implementation in Pytorch [29] based on [24]. The algorithm procedure is described in Appendix C.5.

6 Experiments

6.1 Setup

Datasets. We investigate two digits datasets: MNIST and USPS transfer tasks MNIST to USPS (M\(\rightarrow \)U) and USPS to MNIST (U\(\rightarrow \)M). We used standard train/test split for training and evaluation. Office-31 is a dataset of images containing objects spread among 31 classes captured from different domains: Amazon, DSLR camera and a Webcam camera. DSLR and Webcam are very similar domains but images differ by their exposition and their quality.

Label Shifted Datasets. We stress-test our approach by investigating more challenging settings where the label distribution shifts strongly across domains. For the Digits dataset, we explore a wide variety of shifts by keeping only \(5\%\), \(10\%\), \(15\%\) and \(20\%\) of digits between 0 and 5 of the original dataset (refered as \(\% \times [0\sim 5]\)). We have investigated the tasks U\(\rightarrow \)M and M\(\rightarrow \)U. For the Office-31 dataset, we explore the shift where the object spread in classes 16 to 31 are duplicated 5 times (refered as \(5\times [16 \sim 31]\)). Shifting distribution in the source domain rather than the target domain allows to better appreciate the drop in performances in the target domain compared to the case where the source domain is not shifted.

Comparison with the State-of-the-Art. For all tasks, we report results from DANN [13] and CDAN [24]. To study the effect of weights, we name our method RUDA when weights are set to 1, and RUDA\(_w\) when weights are used. For the non-shifted datasets, we report a weighted version of CDAN (entropy conditioning CDAN+E [24]). For the label shifted datasets, we report IWAN [38], a weighted DANN where weights are learned from a second discriminator, and CDAN\(_w\) a weighted CDAN where weights are added in the same setting than RUDA\(_w\).

Training Details. Models are trained during 20.000 iterations of SGD. We report end of training accuracy in the target domain averaged on five random seeds. The model for the Office-31 dataset uses a pretrained ResNet-50 [18]. We used the same hyper-parameters than [24] which were selected by importance weighted cross-validation [35]. The trade-off parameters \(\lambda \) is smoothly pushed from 0 to 1 as detailed in [24]. To prevent from noisy weighting in early learning, we used weight relaxation: based on the sigmoid output of discriminator \(d(z) = \sigma (\tilde{d}(z))\), we used \(d_\tau (z) = \sigma (\tilde{d}(z/\tau ))\) and weights \(w(z) = (1 - d_\tau (z))/ d_{\tau }(z)\). \(\tau \) is decreased to 1 during training: \(\tau = \tau _{\min } + 2(\tau _{\max } - \tau _{\min }) /(1+ \exp (-\alpha p))\) where \(\tau _{\max }= 5, \tau _{\min } = 1\), \(p \in [0,1]\) is the training progress. In all experiments, \(\alpha \) is set to 5 (except for \(5\%\times [0\sim 5]\) where \(\alpha =15\), see Appendix C.3 for more details).

Table 1. Accuracy (\(\%\)) on the Office-31 dataset.

6.2 Results

Unshifted Datasets. On both Office-31 (Table 1) and Digits (Table 2), RUDA performs similarly than CDAN. Simply performing the scalar product allows to achieve results obtained by multi-linear conditioning [24]. This presents a second advantage: when domains exhibit a large number of classes, e.g. in Office-Home (See Appendix), our approach does not need to leverage a random layer. It is interesting to observe that we achieve performances close to CDAN+E on Office-31 while we do not use entropy conditioning. However, we observe a substantial drop in performance when adding weights, but still get results comparable with CDAN in Office-31. This is a deceptive result since those datasets naturally exhibit label shift; one can expect to improve the baselines using weights. We did not observe this phenomenon on standard benchmarks.

Table 2. Accuracy (\(\%\)) on the Digits dataset.

Label Shifted Datasets. We stress-tested our approach by applying strong label shifts to the datasets. First, we observe a drop in performance for all methods based on invariant representations compared with the situation without label shift. This is consistent with works that warn the pitfall of domain invariant representations in presence of label shift [20, 40]. RUDA and CDAN perform similarly even in this setting. It is interesting to note that the weights improve significantly RUDA results (+1.7% on Office-31 and +16.0% on Digits both in average) while CDAN seems less impacted by them (+0.2% on Office-31 and +10.0% on Digits both in average).

Should we Use Weights? To observe a significant benefit of weights, we had to explore situations with strong label shift e.g. \(5\%\) and \(10\% \times [0 \sim 5]\) for the Digits dataset. Apart from this cases, weights bring small gain (e.g. + 1.7% on Office-31 for RUDA) or even degrade marginally adaptation. Understanding why RUDA and CDAN are able to address small label shift, without weights, is of great interest for the development of more robust UDA.

7 Related Work

This paper makes several contributions, both in terms of theory and algorithm. Concerning theory, our bound provides a risk suitable for domain adversarial learning with weighting strategies. Existing theories for non-overlapping supports [4, 27] and importance sampling [11, 30] do not explore the role of representations neither the aspect of adversarial learning. In [5], analysis of representation is conducted and connections with our work is discussed in the paper. The work [20] is close to ours and introduces a distance which measures support overlap between source and target distributions under covariate shift. Our analysis does not rely on such assumption, its range of application is broader.

Concerning algorithms, the covariate shift adaptation has been well-studied in the literature [17, 19, 35]. Importance sampling to address label shift has also been investigated [34], notably with kernel mean matching [39] and Optimal Transport [31]. Recently, a scheme for estimating labels distribution ratio with consistency guarantee has been proposed [21]. Learning domain invariant representations has also been investigated in the fold of [13, 23] and mainly differs by the metric chosen for comparing distribution of representations. For instance, metrics are domain adversarial (Jensen divergence) [13, 24], IPM based such as MMD [23, 25] or Wasserstein [6, 32]. Our work provides a new theoretical support for these methods since our analysis is valid for any IPM.

Using both weights and representations is also an active topic, namely for Partial Domain Adaptation (PADA) [9], when target classes are strict subset of the source classes, or Universal Domain Adaptation [37], when new classes may appear in the target domain. [9] uses an heuristic based on predicted labels for re-weighting representations. However, it assumes they have a good classifier at first in order to obtain cycle consistent weights. [38] uses a second discriminator for learning weights, which is similar to [8]. Applying our framework to Partial DA and Universal DA is an interesting future direction. Our work shares strong connections with [10] (authors were not aware of this work during the elaboration of this paper) which uses consistent estimation of true labels distribution from [21]. We suggest a very similar empirical evaluation and we also investigate the effect of weights on CDAN loss [24] with a different weighting scheme since our approach computes weights in the representation space. All these works rely on an assumption at some level, e.g. Generalized Label Shift in [10], when designing weighting strategies. Our discussion on the role of inductive design of weights may provide a new theoretical support for these approaches.

8 Conclusion

The present work introduces a new bound of the target risk which unifies weights and representations in UDA. We conduct a theoretical analysis of the role of inductive bias when designing both weights and the classifier. In light of this analysis, we propose a new learning procedure which leverages two weak inductive biases, respectively on weights and the classifier. To the best of our knowledge, this procedure is original while being close to straightforward hybridization of existing methods. We illustrate its effectiveness on two benchmarks. The empirical analysis shows that weak inductive bias can make adaptation more robust even when stressed by strong label shift between source and target domains. This work leaves room for in-depth study of stronger inductive bias by providing both theoretical and empirical foundations.