1 Introduction

In the standard supervised machine learning setting the learner receives a set of labeled examples, known as the training set. However, very often we have additional information at hand that could be beneficial to the learning process. One such an example is the use of unlabeled data drawn from the marginal distributions, which gives rise to the semi-supervised learning setting (Chapelle et al. 2006). Another example is when the training data is coming from a related problem, as in multi-task learning (Caruana 1997), domain adaptation (Ben-David et al. 2010; Mansour et al. 2009), and transfer learning (Pan and Yang 2010; Taylor and Stone 2009). Among others, there is the use of structural information, such as taxonomy, different views on the same data (Blum and Mitchell 1998), or even a sort of privileged information (Vapnik and Vashist 2009; Sharmanska et al. 2013). In recent years all these directions have received considerable empirical and theoretical attention.

In this work we focus on a less theoretically studied direction in the use of supplementary information—learning with auxiliary hypotheses, that is classifiers or regressors originating from other tasks. In particular, in addition to the training set we assume that the learner is supplied with a collection of hypotheses and their predictions on the training set itself. The goal of the learner is to figure out which hypotheses are helpful and use them to improve the prediction performance of the trained classifier. We will call these auxiliary hypotheses the source hypotheses and we will say that helpful ones accelerate the learning on the target task. We focus on the linear setting, that is, we train a linearFootnote 1 classifier and the source hypotheses are used additively in the prediction process, weighted by arbitrary weights. This generalizes the setting in which the outputs of the source hypotheses are concatenated with the feature vector, a widely used heuristic (Bergamo and Torresani 2014; Li et al. 2010; Tommasi et al. 2014).

The scenario described above is related to the Transfer Learning (TL) and Domain Adaptation (DA) ones, or learning effectively from a possibly small amount of data by reusing prior knowledge (Thrun and Pratt 1998; Pan and Yang 2010; Taylor and Stone 2009; Ben-David et al. 2010). However, transferring from hypotheses offers an advantage compared to the TL and DA frameworks, where one requires access to the data of the source domain. For example, in DA (Ben-David et al. 2010), one employs large unlabeled samples to estimate the relatedness of source and target domains to perform the adaptation. Even if unlabeled data are abundant, the estimation of adaptation parameters can be computationally prohibitive. This is the case, for example, when a large number of domains is involved or when one acquires new domains incrementally.

A recently proposed setting, closer to the one we consider, is Hypothesis Transfer Learning (HTL) (Kuzborskij and Orabona 2013; Ben-David and Urner 2013), where the practical limitations of TL and DA are alleviated through indirect access to the source domain by means of a source hypothesis. Also, in the HTL setting there are no restrictions on how the source hypotheses can be used to boost the performance on the target task.

Albeit empirically the setting considered in this paper has already been extensively exploited in the past (Yang et al. 2007; Orabona et al. 2009; Tommasi et al. 2010; Luo et al. 2011; Kuzborskij et al. 2013). A first theoretical treatment of this setting was given by Kuzborskij and Orabona (2013), where we analyzed a linear HTL algorithm that solves a regularized least-squares problem with a single fixed, unweighted, source hypothesis. We proved a polynomial generalization bound that depends on the performance of the fixed source hypothesis on the target task.

1.1 Our contributions

We extend the formulation in Kuzborskij and Orabona (2013), with a general regularized Empirical Risk Minimization (ERM) problem with respect to any non-negative smooth loss function, not necessarily convex, and any strongly convex regularizer. We prove high-probability generalization bounds that exhibit fast rate, that is \(\mathcal {O}(1/m)\), of convergence whenever any weighted combination of multiple source hypotheses performs well on the target task. In addition, we show that, if the combination is perfect, the error on the training set becomes deterministically equal to the generalization error. Furthermore, we analyze the excess risk of our formulation, and conclude that a good source hypothesis also speeds up the convergence to the performance of the best-in-the-class. As a byproduct of our study, we prove an upper bound on the Rademacher complexity of a smooth loss class that provides extra information compared to that of Lipschitz loss classes. Our analysis, which might be of independent interest, is an alternative to the analysis of Srebro et al. (2010) and it holds under much weaker assumptions.

The rest of the paper is organized as follows. In the next section we make a brief review of the previous work. Next, we formally state our formulation in Sect. 4 and present the main results right after, in Sect. 5. In Sect. 5.1 we discuss the implications and compare them to the body of literature in learning with fast rates and transfer learning. Next, in Sect. 6, we present the proofs of our main results. Section 7 concludes the paper.

2 Related work

Kuzborskij and Orabona (2013) showed that the generalization ability of the regularized least-squares HTL algorithm improves if the supplied source hypothesis performs well on the target task. More specifically, we proposed a key criterion, the risk of the source hypothesis on the target domain, that captures the relatedness of the source and target domains. Later, Ben-David and Urner (2013) showed a similar bound, but with a different quantity capturing the relatedness between source and target. Instead of considering a general source hypothesis, they have confined their analysis to the linear hypothesis class. This allowed them to show that the target hypothesis generalizes better when it is close to the good source hypothesis. From this perspective it is easy to interpret the source hypothesis as an initialization point in the hypothesis class. Naturally, given a starting position that is close to the best in the class, one generalizes well.

Prior to these works there were few studies trying to understand the learning with auxiliary hypotheses subject to different conditions. Li and Bilmes (2007) have analyzed a Bayesian approach to HTL. Employing a PAC-Bayes analysis they showed that given a prior on the hypothesis class, the generalization ability of logistic regression improves if the prior is informative on the target task. Mansour et al. (2008) analyzed a setting of multiple source hypotheses combination. There, in addition to the source hypotheses, the learner receives unlabeled samples drawn from the source distributions, that are used to weight and combine these source hypotheses. They have studied the possibility of learning in such a scenario, however, they did not address the generalization properties of any particular algorithm.

Unlike these works, we focus on the generalization ability of a large family of HTL algorithms that generate the target predictor given a set of multiple source hypotheses. In particular, we analyze Regularized Empirical Risk Minimization with the choice of any non-negative smooth loss and any strongly convex regularizer. Thus our analysis covers a wide range of algorithms, explaining their empirical success. One category of those, prevalent in computer vision (Kienzle and Chellapilla 2006; Yang et al. 2007; Tommasi et al. 2010; Aytar and Zisserman 2011; Kuzborskij et al. 2013; Tommasi et al. 2014), employs the principle of biased regularization (Schölkopf et al. 2001). For example, instead of penalizing large weights by introducing the term \(\Vert \mathbf {w}\Vert ^2\) into the objective function, one enforces them to be close to some “prior” model, that is \(\Vert \mathbf {w}- \mathbf {w}^{\text {prior}}\Vert ^2\). This principle also found its applications in other fields, such as NLP (Daumé III 2007; Daumé III et al. 2010), and electromyography classification (Orabona et al. 2009; Tommasi et al. 2013). Many empirical works have also investigated the use of the source hypotheses in a “black box” sense, sometimes not even posing the problem as transfer learning (Duan et al. 2009; Li et al. 2010; Luo et al. 2011; Bergamo and Torresani 2014), and recently in conjunction with deep neural networks (Oquab et al. 2014).

In the literature there are several other machine learning directions conceptually similar to the one we consider in this work. Arguably, the most well known one is the Domain Adaptation (DA) problem. The standard machine learning assumption is that the training and the testing sets are sampled from the same probability distribution. In such case, we expect that a hypothesis generated by the learner from that training set will lead to sensible predictions on the testing set. The difficulty arises when training and testing distributions differ, that is we have a training set sampled from the source domain and testing set from the target domain. Clearly, the hypothesis generated from the source domain can perform arbitrarily badly on the target one. The paradigm of DA, addressing this issue has received a lot of attention in recent years (Ben-David et al. 2010; Mansour et al. 2009). Although, this framework is different from the one we study in this work, we identify similarities and compare our findings with the theory of learning from different domains in Sect. 5.2.

3 Definitions

In this section we introduce the definitions used in the rest of the paper.

We denote random variables by capital letters. The expected value of a random variable distributed according to a probability distribution \(\mathcal {D}\) is denoted by \({{\mathrm{\mathbb {E}}}}_{X \sim \mathcal {D}}[X]\) and the variance is denoted by \(\mathrm {Var}_{X \sim \mathcal {D}}[X]\). The small and capital bold letters will stand respectively for the vectors and matrices, e.g. \(\mathbf {x}= [x_1, \ldots , x_d]^{\top }\) and \(\mathbf {A}\in \mathbb {R}^{d_1 \times d_2 }~\).

Denoting by \(\mathcal {X}\) and \(\mathcal {Y}\) respectively the input and output space of the learning problem, the training set is \(S=\{(\mathbf {x}_i,y_i)\}_{i=1}^m\), drawn i.i.d. from the probability distribution \(\mathcal {D}\) defined over \(\mathcal {X}\times \mathcal {Y}\). Without the loss of generality we will have \(\mathcal {X}= \{\mathbf {x}: \Vert \mathbf {x}\Vert \le 1\}\) and we will focus on the problems where \(\mathcal {Y}= [-C, C]\).

To measure the accuracy of a learning algorithm, we introduce a non-negative loss function \(\ell (h(\mathbf {x}), y)\), which measures the cost incurred predicting \(h(\mathbf {x})\) instead of y. The risk of a hypothesis h, with respect to a probability distribution \(\mathcal {D}\), and the empirical risk measured on the sample S are then defined as

$$\begin{aligned} R(h) := \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x},y) \sim \mathcal {D}}[\ell (h(\mathbf {x}), y)],\quad \text { and } \quad \hat{R}_S(h) := \frac{1}{m} \sum _{i=1}^m \ell (h(\mathbf {x}_i), y_i). \end{aligned}$$

In the following, the risk is measured with respect to the probability distribution of the target domain, unless stated otherwise. We capture the smoothness of the loss function via following definition.

3.1 H-smooth loss function

We say that a non-negative loss function \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\) is H -smooth iff,

$$\begin{aligned} \forall t,r \in \mathbb {R}, \forall y \in \mathcal {Y}, ~ |\nabla _t \ell (t, y) - \nabla _r \ell (r, y)| \le H |t - r|. \end{aligned}$$

In this work we will make use of strongly convex regularizers, functions that are defined as follows.

3.2 Strongly convex function

A function \(\varOmega \) is \(\sigma \)-strongly convex w.r.t. a norm \(\Vert \cdot \Vert \) iff for all \(\mathbf {w}, \mathbf {v}\), and \(\alpha \in (0, 1)\) we have

$$\begin{aligned} \varOmega (\alpha \mathbf {w}+ (1 - \alpha ) \mathbf {v}) \le \alpha \varOmega (\mathbf {w}) + (1-\alpha ) \varOmega (\mathbf {v}) - \frac{\sigma }{2} \alpha (1-\alpha ) \Vert \mathbf {w}- \mathbf {v}\Vert ^2. \end{aligned}$$

We will quantify the complexity of a hypothesis class by the means of Rademacher complexity (Bartlett and Mendelson 2003). In particular, the empirical Rademacher complexity of the hypothesis class \(\mathcal {H}\) measured on the sample S and its expectation are defined as

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {H}) := \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }} \left[ \sup _{h \in \mathcal {H}} \frac{1}{m} \sum _{i=1}^m \varepsilon _i h(\mathbf {x}_i) \right] \quad \text { and } \quad \mathfrak {R}(\mathcal {H}) := \mathop {{{\mathrm{\mathbb {E}}}}}\limits _S\left[ \hat{\mathfrak {R}}_S(\mathcal {H}) \right] . \end{aligned}$$

Here, \(\varepsilon _i\) is a random variable such that \(\mathbb {P}(\varepsilon _i=1) = \mathbb {P}(\varepsilon _i=-1) = \frac{1}{2}\). Similarly, as in the case of the risk, the Rademacher complexity is measured with respect to the probability distribution of the target domain, unless stated otherwise.

4 Transferring from auxiliary hypotheses

In the following we will capture and generalize many transfer learning formulations that employ a collection of given source hypotheses \(\{h^{\text {src}}_i : \mathcal {X}\mapsto \mathcal {Y}\}_{i=1}^n\) within the framework of Regularized Empirical Risk Minimization (ERM). These problems typically involve a criterion for source hypothesis selection and combination with the goal to increase performance on the target task (Yang et al. 2007; Tommasi et al. 2014; Kuzborskij et al. 2015). Indeed, some source hypotheses might come from tasks similar to the target task and the goal of an algorithm is to select only relevant ones. In this work we will consider source combination

$$\begin{aligned} h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}) := \sum _{i=1}^n \beta _i h^{\text {src}}_i(\mathbf {x}), \end{aligned}$$

and target hypothesis

$$\begin{aligned} h_{\mathbf {w}, \varvec{\beta }}(\mathbf {x}) : = \left\langle \mathbf {w},\mathbf {x} \right\rangle + h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}), \end{aligned}$$
(1)

with the relevance of the sources characterized by the parameter \(\varvec{\beta }\in \mathbb {R}^n\). We will focus on the Regularized ERM formulations with the choice of any non-negative smooth loss function and any strongly-convex regularizer. This puts our problem into the class of the ones that can be solved efficiently, yet endowed with interesting properties.

4.1 Regularized ERM for transferring from auxiliary hypotheses

Let \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\) be an H-smooth loss function and let \(\varOmega : \mathcal {H}\mapsto \mathbb {R}_+\) be a \(\sigma \)-strongly convex function w.r.t. a norm \(\Vert \cdot \Vert \). Given the target training set \(S = \{(\mathbf {x}_i, y_i)\}_{i=1}^m\), \(\lambda \in \mathbb {R}_+\), source hypotheses \(\{h^{\text {src}}_i\}_{i=1}^n\), and parameters \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), the algorithm generates the target hypothesis \(h_{\hat{\mathbf {w}}, \varvec{\beta }}\), such that

$$\begin{aligned} \hat{\mathbf {w}}= \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf {w}\in \mathcal {H}}\left\{ \frac{1}{m} \sum _{i=1}^m \ell \left( \left\langle \mathbf {w}, \mathbf {x}_i \right\rangle + h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}), y_i\right) + \lambda \varOmega (\mathbf {w})\right\} . \end{aligned}$$
(2)

Note that (2) is minimized only w.r.t. \(\mathbf {w}\), that is, we do not analyze any particular algorithm that searches for the optimal weights of the source hypotheses. However, we assume that \(\varOmega (\varvec{\beta }) \le \rho \), that is we constrain \(\varvec{\beta }\) through a strongly convex function. Thus, we cover regularized algorithms generating \(\varvec{\beta }\), which includes most of the empirical work in this field, and potential new algorithms.

In the following we will pay special attention to a quantity that captures the performance of the source hypothesis combination \(h^{\text {src}}_{\varvec{\beta }}(\mathbf {x})\) on the target domain

$$\begin{aligned} R^{\text {src}}:= R(h^{\text {src}}_{\varvec{\beta }}). \end{aligned}$$

Our analysis will focus on the generalization properties of \(h_{\hat{\mathbf {w}}, \varvec{\beta }}\). In particular, our main goal will be to understand the impact of the source hypothesis combination on the performance of the target hypothesis. In our analysis we will discuss various regimes of interest, for example considering perfect and arbitrarily bad source hypothesis. Our discussion will cover scenarios where the auxiliary hypotheses accelerate the learning and the conditions when we can provably expect perfect generalization. Finally, we will consider the consistency of the algorithm (1) and pinpoint conditions when we achieve faster convergence to the performance of the best-in-the-class.

One special example covered by our analysis, commonly applied in transfer learning, is the biased regularization (Schölkopf et al. 2001). Consider the following least-squares based algorithm.

4.2 Least-squares with biased regularization

Given the target training set \(S = \{(\mathbf {x}_i, y_i)\}_{i=1}^m\), source hypotheses \(\{\mathbf {w}^{\text {src}}_i\}_{i=1}^n \subset \mathcal {H}\), parameters \(\varvec{\beta }\in \mathbb {R}^n\) and \(\lambda \in \mathbb {R}_+\), the algorithm generates the target hypothesis \(h(\mathbf {x}) = \left\langle \hat{\mathbf {w}}, \mathbf {x} \right\rangle \), where

$$\begin{aligned} \hat{\mathbf {w}}= \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf {w}\in \mathcal {H}}\left\{ \frac{1}{m} \sum _{i=1}^m \left( \left\langle \mathbf {w}, \mathbf {x}_i \right\rangle - y_i\right) ^2 + \lambda \left\| \mathbf {w}- \mathbf {W}^{\text {src}}\varvec{\beta }\right\| _2^2\right\} . \end{aligned}$$
(3)

This problem has a simple intuitive interpretation: minimize the training error on the target training set while keeping the solution close to the linear combination of the source hypotheses. One can naturally arrive at (3) from a probabilistic perspective: The solution \(\hat{\mathbf {w}}\) is a maximum a posteriori estimate when the conditional distribution is Gaussian and the prior is a \(\mathbf {W}^{\text {src}}\varvec{\beta }\)-mean, \(\frac{1}{\lambda } \mathbf {I}\)-covariance Gaussian distribution. Even though biased regularization is a simple idea, it found success in a plethora of transfer learning applications, ranging from computer vision (Kienzle and Chellapilla 2006; Yang et al. 2007; Tommasi et al. 2010; Aytar and Zisserman 2011; Kuzborskij et al. 2013; Tommasi et al. 2014) to NLP (Daumé III 2007), to electromyography classification (Orabona et al. 2009; Tommasi et al. 2013).

Claim

Least-Squares with Biased Regularization is a special case of the Regularized ERM in (1).

Proof

Introduce \(\mathbf {w}'\), such that \(\mathbf {w}' = \mathbf {w}- \mathbf {W}^{\text {src}}\varvec{\beta }\). Then we have that problem (3) is equivalent to

$$\begin{aligned} \min _{\mathbf {w}\in \mathcal {H}}\left\{ \frac{1}{m} \sum _{i=1}^m \left( \left\langle \mathbf {w}' + \mathbf {W}^{\text {src}}\varvec{\beta }, \mathbf {x}_i \right\rangle - y_i\right) ^2 + \lambda \left\| \mathbf {w}' \right\| _2^2\right\} , \end{aligned}$$

which in turn is a special version of (2) when \(h^{\text {src}}_i(\mathbf {x}) = \left\langle \mathbf {w}^{\text {src}}_i, \mathbf {x} \right\rangle \), we use the square loss, and \(\Vert \cdot \Vert _2^2\) as regularizer. \(\square \)

Albeit practically appealing, the formulation (3) is limited in the fact that the source hypotheses must be a linear predictor living in the same space of the target predictor. Instead, the formulation in (1) naturally generalizes the biased regularization formulation, allowing to treat the source hypothesis as “black box” predictors.

5 Main results

In this section, we present the main results of this work: generalization and excess risk bounds for the Regularized ERM. In the next section we discuss in detail the implications of these results, while we defer the proofs to the subsequent sections.

The first bound demonstrates the utility of the perfect combination of source hypotheses, while the second lets us observe the dependency on the arbitrary combination. In particular, the first bound explicitates the intuition that given a perfect source hypothesis learning is not required. In other words, when \(R^{\text {src}}=0\) we have that the empirical risk becomes equal to the risk with probability one.

Theorem 1

Let \(h_{\hat{\mathbf {w}}, \varvec{\beta }}\) be generated by Regularized ERM, given a m-sized training set S sampled i.i.d. from the target domain, source hypotheses \(\{h^{\text {src}}_i : \Vert h^{\text {src}}_i\Vert _\infty \le 1 \}_{i=1}^n\), any source weights \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), and \(\lambda \in \mathbb {R}_+\). Assume that \(\ell (h_{\hat{\mathbf {w}}, \varvec{\beta }}(\mathbf {x}), y) \le M\) for any \((\mathbf {x}, y)\) and any training set. Then, denoting \(\kappa = \frac{H}{\sigma }\) and assuming that \(\lambda \le \kappa \), we have with probability at least \(1 - e^{-\eta }, \ \forall \eta \ge 0\)

$$\begin{aligned} R(h_{\hat{\mathbf {w}}, \varvec{\beta }})&\le \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }}) + \mathcal {O}\left( \frac{R^{\text {src}}\kappa }{\sqrt{m} \lambda } + \sqrt{\frac{R^{\text {src}}\rho \kappa ^2}{m \lambda }} + \frac{M \eta }{m \log \left( 1 + \sqrt{\frac{M \eta }{u^{\text {src}}}}\right) } \right) \end{aligned}$$
(4)
$$\begin{aligned}&\le \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }}) + \mathcal {O}\left( \frac{\kappa }{\sqrt{m}} \left( \frac{R^{\text {src}}}{\lambda } + \sqrt{\frac{R^{\text {src}}\rho }{\lambda }} \right) + \frac{\kappa }{m} \left( \frac{\sqrt{R^{\text {src}}M \eta }}{\lambda } + \sqrt{\frac{\rho }{\lambda }} \right) \right) , \end{aligned}$$
(5)

where \(u^{\text {src}}= R^{\text {src}}\left( m + \frac{\kappa \sqrt{m}}{\lambda } \right) + \kappa \sqrt{\frac{R^{\text {src}}m \rho }{\lambda }}\).

Now we focus on the consistency of the HTL. Specifically, we show an upper bound on the excess risk of the Regularized ERM, which depends on \(R^{\text {src}}\), that is the risk of the combined source hypothesis \(h^{\text {src}}_{\varvec{\beta }}\) on the target domain. We observe that for a small \(R^{\text {src}}\), the excess risk shrinks at a fast rate of \(\mathcal {O}(1/m)\). In other words, good prior knowledge guarantees not only good generalization, but also fast recovery of the performance of the best hypothesis in the class.

This bound is similar in spirit to the results of localized complexities, as in the works of Bartlett et al. (2005), Srebro et al. (2010), however we focus on the linear HTL scenario rather than a generic learning setting. Later, in Sect. 5.1, we compare our bounds to these works and show that our analysis achieves superior results.

Theorem 2

Let \(h_{\hat{\mathbf {w}}, \varvec{\beta }}\) be generated by Regularized ERM, given the m-sized training set S sampled i.i.d. from the target domain, source hypotheses \(\{h^{\text {src}}_i : \Vert h^{\text {src}}_i\Vert _\infty \le 1\}_{i=1}^n\), any source weights \(\varvec{\beta }\) obeying \(\varOmega (\varvec{\beta }) \le \rho \), and \(\lambda \in \mathbb {R}_+\). Then, denoting \(\kappa = \frac{H}{\sigma }\), assuming that \(\lambda \le \kappa \le 1\), and setting the regularization parameter

$$\begin{aligned} \lambda = \mathcal {O}\left( \sqrt{\frac{\kappa }{\tau } \frac{R^{\text {src}}+ \sqrt{R^{\text {src}}\rho }}{\sqrt{m}}+ \frac{\sqrt{\kappa }}{\tau } \sqrt{\frac{R^{\text {src}}+ \sqrt{R^{\text {src}}\rho }}{m^{1.5}}} } \right) , \end{aligned}$$

for any choice of \(\tau \ge 0\), we have with high probability that

$$\begin{aligned}&R(h_{\hat{\mathbf {w}}, \varvec{\beta }}) - \min _{\varOmega (\mathbf {w}) \le \tau }R(h_{\mathbf {w}, \varvec{\beta }})\\&\quad = \mathcal {O}\left( \frac{\sqrt{R^{\text {src}}} + \root 4 \of {R^{\text {src}}\rho }}{\root 4 \of {m}} \sqrt{\kappa \tau } + \frac{\root 4 \of {R^{\text {src}}} + \root 8 \of {R^{\text {src}}\rho }}{\root 4 \of {m^{1.5}}} \root 4 \of {\kappa \tau ^2} + \sqrt{\frac{R^{\text {src}}}{m}} + \frac{1}{m} \right) . \end{aligned}$$

5.1 Implications

We start by discussing the effect on the generalization ability of the source hypothesis combination. Intuitively, a good source hypothesis combination should facilitate transfer learning, while a reasonable algorithm must not fail if we provide it with the bad one. That said, a natural question to ask here is, what makes a good or bad source hypothesis? As in previous works in transfer learning and domain adaptation, we capture this notion via a quantity that has two-fold interpretation: (1) the performance of the source hypothesis combination on the target domain; (2) relatedness of source and target domains. In the theorems presented in the previous sections we denoted it by \(R^{\text {src}}\), that is the risk of the source hypothesis combination on the target domain. In this section we will consider various regimes of interest with respect to \(R^{\text {src}}\).

5.1.1 When the source is a bad fit

First consider the case when the source hypothesis combination \(h^{\text {src}}_{\varvec{\beta }}\) is useless for the purpose of transfer learning, for example, \(h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}) = 0\) for all \(\mathbf {x}\). This corresponds to learning with no auxiliary information. Then we can assume that \(R^{\text {src}}\le M\), and from Theorem 1 we obtain \( R(h_{\hat{\mathbf {w}}}) - \hat{R}_S(h_{\hat{\mathbf {w}}}) \le \mathcal {O}\left( 1/ (\sqrt{m} \lambda ) \right) \). This rate matches the one in the analysis of regularized least-squares (Vito et al. 2005; Bousquet and Elisseeff 2002), which is a special case of the smooth loss function that the Regularized ERM employs. On the other hand, Srebro et al. (2010) showed a better worst-case rate \(\mathcal {O}(1/\sqrt{m \lambda })\). However, their framework builds upon a worst case Rademacher complexity which does not involve the expectation over the sample and does not lead to the dependency on \(R^{\text {src}}\) we have obtained in Theorem 1. We will discuss this problem in details later.

5.1.2 When the source is a good fit

Here we would like to consider the behavior of the algorithm in the finite-sample and asymptotic scenarios. We first look at the regime of small m, in particular \(m = \mathcal {O}(1/R^{\text {src}})\). In this case, the fast rate term will dominate the bound, and we obtain the convergence rate of \(\mathcal {O}( \sqrt{\rho } / (m \sqrt{\lambda }) )\). In other words, we can expect faster convergence when m is small, where “small” depends on \(R^{\text {src}}\), the quality of combined source hypotheses. Now consider the asymptotic behavior of the algorithm, particularly when m goes to infinity. In such case, the algorithm exhibits a rate of \(\mathcal {O}\left( R^{\text {src}}/ \sqrt{m} \lambda + \sqrt{(R^{\text {src}}\rho ) / m \lambda }\right) \), so \(R^{\text {src}}\) controls the constant factor of the rate. Hence, the quantity \(R^{\text {src}}\) governs the transient regime for small m and the asymptotic behavior of the algorithm, predicting faster convergence in both regimes when it is small.

5.1.3 When source is a perfect fit

It is conceivable that the source hypothesis exploited is the perfect one, that is \(R^{\text {src}}= 0\). In other words, the source hypothesis combination is a perfect predictor for the target domain. Theorem 1 implies that \(R(h_{\hat{\mathbf {w}}, \varvec{\beta }}) = \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }})\) with probability one. We note that for many practically used smooth losses, such as the square loss, this setting is only realistic if the source and target domains match and the problem is noise-free. However, we can observe \(R^{\text {src}}= 0\), for example, when the squared hinge loss, \(\ell (z,y) = \max \{0, 1 - zy\}^2\), is used and all target domain examples are classified correctly by the source hypothesis combination, case that is not unthinkable for related domains.

5.1.4 Fast rates

There is a number of works in the literature investigating a rate of convergence faster than \(1/\sqrt{m}\) subject to different conditions. In particular, the localized Rademacher complexity bounds of Bartlett et al. (2005) and Bousquet (2002) can be used to obtain results similar to the second inequality of Theorem 1. Indeed, Theorem 4 shows a bound which is very similar to the localized ones, albeit with two differences. The r.h.s. of the first inequality in Theorem 4 vanishes when the loss class has zero variance. Though intuitively trivial, this allows to prove a considerable result in the theory of transfer learning as it quantifies the intuition that no learning is necessary if the source has perfect performance on the target task. Second, by applying the standard localized Rademacher complexity bounds of Bousquet (2002), and assuming the use of the Lipschitz loss function, we do not achieve a fast rate of convergence, as can be seen from Theorem 8, shown in the ‘Appendix’. We suspect that assuming the smoothness of the loss function is crucial to prove fast rates in our formulation.

Fast rates for ERM with the smooth loss have been thoroughly analyzed by Srebro et al. (2010). Yet, the analysis of our HTL algorithm within their framework would yield a bound that is inferior to ours in two respects. The first concerns the scenario when the combined source hypothesis is perfect, that is \(R^{\text {src}}= 0\). The generalization bound of Srebro et al. (2010) does not offer a way to show that the empirical risk converges to the risk with probability one—instead one can only hope to get a fast rate of convergence. The second problem is in the fact that such bound would depend on the empirical performance of combined source hypothesis. As we have noted before, the quantity \(R^{\text {src}}\) is essential because it captures the degree of relatedness between two domains. In their bounds, one cannot obtain this relationship through the Rademacher complexity term as we did in our analysis. The reason for this is the stronger notion of Rademacher complexity that is employed by that framework, involving a supremum over the sample instead of an expectation. The expectation over the sample of the target distribution is crucial here, because it allows us to quantify how well the source domain is aligned with the target domain, through the source hypothesis acting as a link. However, one can attempt to obtain the bound on the empirical risk in terms of \(R^{\text {src}}\). We prove such a bound in the ‘Appendix’, Theorem 6, and conclude that if one has a good source hypothesis or even a perfect one, the rate is \(\mathcal {O}(1/\root 4 \of {m^3})\), which is worse than ours.

5.2 Comparison to theories of domain adaptation and transfer learning

The setting in DA is different from the one we study, however, we will briefly discuss the theoretical relationship between the two. Typically in DA, one trains a hypothesis from an altered source training set, striving to achieve good performance on the target domain. The key question here is how to alter, or to adapt, the source training set. To answer this question, the DA literature introduces the notion of domain relatedness, which quantifies the dissimilarities between the marginal distributions of corresponding domains. Practically, in some cases the domain relatedness can be estimated through a large set of unlabeled samples drawn from both source and target domains. Theories of DA (Ben-David et al. 2010; Mansour et al. 2009; Ben-David and Urner 2012; Mansour et al. 2008; Cortes and Mohri 2014) have proposed a number of such domain relatedness criteria. Perhaps the most well known are the \(d_{\mathcal {H}\varDelta \mathcal {H}}\)-divergence (Ben-David et al. 2010) and its more general counterpart, the Discrepancy Distance (Mansour et al. 2009). Typically, this divergence is explicitated in the generalization bound along with other terms controlling the generalization on the target domain. Let \(R_{\mathcal {D}^{\text {trg}}}(h)\) and \(R_{\mathcal {D}^{\text {src}}}(h)\) denote the risks of the hypothesis h, measured w.r.t. the target and source distributions. Then a well-known result of Ben-David et al. (2010) suggests that for all \(h \in \mathcal {H}\)

$$\begin{aligned} R_{\mathcal {D}^{\text {trg}}}(h) \le R_{\mathcal {D}^{\text {src}}}(h) + \frac{1}{2} d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}}) + \epsilon _{\mathcal {H}}^{\star }, \end{aligned}$$
(6)

where \(\epsilon _{\mathcal {H}}^{\star } = \min _{h \in \mathcal {H}}\left\{ R_{\mathcal {D}^{\text {trg}}}(h) + R_{\mathcal {D}^{\text {src}}}(h) \right\} \). This result implies that adaptation is possible given that \(d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}})\) and \(\epsilon _{\mathcal {H}}^{\star }\) are small. One can try to reduce those by controlling the complexity of the class \(\mathcal {H}\) and by minimizing the divergence \(d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}})\). In practice, the latter can be manipulated through an empirical counterpart on the basis of unlabeled samples. Increasing the complexity of \(\mathcal {H}\) indeed reduces \(\epsilon _{\mathcal {H}}^{\star }\), but inflates \(d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}})\). On the other hand, by minimizing \(d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}})\) alone puts us under the risk of increasing \(\epsilon _{\mathcal {H}}^{\star }\), since the empirical divergence is reduced without taking the labelling into account.

Clearly, this bound cannot be directly compared to our result in Theorem 1. However, we note the term \(R^{\text {src}}\) appearing in our results, which plays a role very similar to \(d_{\mathcal {H}\varDelta \mathcal {H}}\) in (6). In fact, by defining \(\mathcal {H}= \{\mathbf {x}\mapsto \left\langle \varvec{\beta }, \mathbf {h^{\text {src}}}(\mathbf {x}) \right\rangle \ : \ \varOmega (\varvec{\beta }) \le \tau \}\), where \(\mathbf {h^{\text {src}}}(\mathbf {x}) = [h^{\text {src}}_1(\mathbf {x}), \ldots , h^{\text {src}}_n(\mathbf {x})]^{\top }\), and fixing \(h = h^{\text {src}}_{\varvec{\beta }} \in \mathcal {H}\) in (6), we can write

$$\begin{aligned} R^{\text {src}}= R_{\mathcal {D}^{\text {trg}}}(h^{\text {src}}_{\varvec{\beta }})&\le R_{\mathcal {D}^{\text {src}}}(h^{\text {src}}_{\varvec{\beta }}) + d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}}) + \epsilon _{\mathcal {H}}^{\star }. \end{aligned}$$

Plugging this into the generalization bound (5) and assuming that \(\lambda \le 1\) and \(\rho \le 1/\lambda \) we have for the target hypothesis \(h\) that

$$\begin{aligned} R_{\mathcal {D}^{\text {trg}}}(h) \le \hat{R}_S(h) + \mathcal {O}\left( \frac{R_{\mathcal {D}^{\text {src}}}(h^{\text {src}}_{\varvec{\beta }}) + d_{\mathcal {H}\varDelta \mathcal {H}}(\mathcal {D}^{\text {src}},\mathcal {D}^{\text {trg}}) + \epsilon _{\mathcal {H}}^{\star }}{\sqrt{m} \lambda } + \frac{1}{m \lambda } \right) . \end{aligned}$$
(7)

Albeit this inequality shows the generalization ability of the transfer learning algorithm, comparing to (6), we observe that DA and our result agree on the fact that the divergence between the domains has to be small to generalize well. In fact, in the formulation we consider, the divergence is controlled in two ways: implicitly, by the choice of \(\mathbf {h^{\text {src}}}\) and through the complexity of class \(\mathcal {H}\), that is by choosing \(\tau \). Second, in DA we expect that a hypothesis performs well on the target only if it performs well on the source. In our results, this requirement is relaxed. As a side note, we observe that (7) captures an intuitive notion that a good source hypothesis has to perform well on its own domain. Finally, in the theory of DA \(\epsilon _\mathcal {H}^\star \) is assumed to be small. Indeed, if \(\epsilon _\mathcal {H}^\star \) is large, there is no hypothesis that is able to perform well on both domains simultaneously, and therefore adaptation is hopeless. In our case, the algorithm can still generalize even with large \(\epsilon _\mathcal {H}^\star \), however this is due to the supervised nature of the framework.

We now turn our attention to the previous theoretical works studying HTL-related settings. Few papers have addressed the theory of transfer learning, where the only information passed from the source domain is the classifier or regressor. Mansour et al. (2008) have addressed the problem of multiple source hypotheses combination, although, in a different setting. Specifically, in addition to the source hypotheses, the learner receives the unlabeled samples drawn from the source distributions, which are used to weight and combine these source hypotheses. The authors have presented a general theory of such a scenario and did not study the generalization properties of any particular algorithm. The first analysis of the generalization ability of HTL in the similar context we consider here was done by Kuzborskij and Orabona (2013). The work focused on L2-regularized least squares and a generalization bound involving the leave-one-out risk instead of the empirical one. The following result, obtained through an algorithmic stability argument (Bousquet and Elisseeff 2002), holds with probability at least \(1 - \delta \)

$$\begin{aligned} R(h) \le \hat{R}_S^{\text {loo}}(h) + \mathcal {O}\left( \frac{\root 4 \of {R^{\text {src}}}}{\sqrt{m \delta } \lambda ^{0.75}} \right) , \end{aligned}$$
(8)

where \(R^{\text {src}}\) is the risk of a single fixed source hypothesis and \(h\) is the solution of a Regularized Least Square problem. We first observe that the shape of the bound is similar to the one obtained in this work, although with the number of differences. First, contrary to our presented bounds, their bound assumes the use of a fixed source hypothesis, which is not even weighted by any coefficient. In practice, this is a very strong assumption, as one can receive an arbitrarily bad source and have no way to exclude it. Second, the bound (8) seems to have a vanishing behavior whenever the risk of the source \(R^{\text {src}}\) is equal to zero. This comes at the cost of the use of a weaker concentration inequality. In Theorem 1 we manage to obtain the same behavior with high probability. Finally, we get a better dependency on \(R^{\text {src}}\).

5.3 Combining source hypotheses in practice

So far we have assumed that problem (4) is supplied with a pre-made combination of source hypotheses, that is, we did not study a particular algorithm for tuning the \(\varvec{\beta }\) weights. However, by analyzing our generalization bound (1), it is easy to come up with algorithms that could be used for this purpose. In particular, by minizing the bound w.r.t. \(\varvec{\beta }\), and assuming that the empirical risk \(\hat{R}_S(h^{\text {src}}_{\varvec{\beta }})\) converges uniformly to \(R^{\text {src}}\), we have with high probability that

$$\begin{aligned} \min _{ \varOmega (\varvec{\beta }) \le \rho } R(h_{\hat{\mathbf {w}}, \varvec{\beta }})&\le \min _{\varOmega (\varvec{\beta }) \le \rho } \left\{ \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }}) + \mathcal {O}\left( \frac{\kappa \hat{R}_S(h^{\text {src}}_{\varvec{\beta }})}{\sqrt{m} \lambda } + \sqrt{ \frac{\kappa ^2 \rho \hat{R}_S(h^{\text {src}}_{\varvec{\beta }})}{m \lambda } }\right) \right\} ~. \end{aligned}$$

Thus, at least theoretically, given a fixed solution \(\hat{\mathbf {w}}\), it is enough to jointly minimize the error of the target hypothesis \(h_{\hat{\mathbf {w}}, \varvec{\beta }}\) and the error of the source combination on the target training set. This is particularly efficient when the square loss is used, since \(\hat{\mathbf {w}}\) can be expresses in terms of an inverse of a covariance matrix that has to be inverted only once (Orabona et al. 2009; Tommasi et al. 2014; Kuzborskij et al. 2015).

Many HTL-like algorithms can be captured through the above by choosing among different loss functions and regularizers \(\varOmega \). The simplest case is just a concatenation of the source hypotheses predictions with the original feature vector. However, by choosing different regularizers and their parameters, we can treat the source hypotheses in a different way from the original features. For example, one might enforce sparsity over the source hypotheses, while using the usual L2 regularizer on the target solution \(\hat{\mathbf {w}}\).

6 Technical results and proofs

In this section we present general technical results that are used to prove our theorems.

First, we present the Rademacher complexity generalization bound in Theorem 4, which slightly differs from the usual ones. The difference comes in the assumption that the variance of the loss is uniformly bounded over the hypothesis class. This will allow us to state a generalization bound that obeys the fast empirical risk convergence rate subject to the small class complexity. Second, we will also show a generalization bound with a confidence term that vanishes if the complexity of the class is exactly zero.

Next, we focus on the Rademacher complexity of the smooth loss function class. We prove a bound on the empirical Rademacher complexity of a hypothesis class, Lemma 3, that depends on the point-wise bounds on the loss function. This novel bound might be of independent interest.

Finally, we employ this result to analyze the effect of the source hypotheses on the complexity of the target hypothesis class in Theorem 5.

6.1 Fast rate generalization bound

The proof of fast-rate and vanishing-confidence-term bounds, Theorem 4, stems from the functional generalization of Bennett’s inequality which is due to Bousquet (2002, Theorem 2.11) and we report it here for completeness.

Theorem 3

(Bousquet 2002) Let \(X_1, X_2, \ldots , X_m\) be identically distributed random variables according to \(\mathcal {D}\). For all \(\mathcal {D}\)-measurable, square-integrable \(g \in \mathcal {G}\), with \({{\mathrm{\mathbb {E}}}}_{X}[g(X)]=0\), and \(\sup _{g \in \mathcal {G}} {{\mathrm{ess\,sup}}}g \le 1\), we denote

$$\begin{aligned}&Z = \sup _{g \in \mathcal {G}} \sum _{i=1}^m g(X_i) . \end{aligned}$$
(9)

Let \(\sigma \) be a positive real number such that \(\sup _{g \in \mathcal {G}} \mathrm {Var}_{X \sim \mathcal {D}}[g(X)] \le \sigma ^2\) almost surely. Then for all \(t \ge 0\), we have that

$$\begin{aligned} {{\mathrm{\mathbb {P}}}}\left( Z \ge {{\mathrm{\mathbb {E}}}}[Z] + t\right) \le \exp \left( -v u\left( \frac{t}{v}\right) \right) , \end{aligned}$$
(10)

where

$$\begin{aligned} \begin{aligned}&v = m \sigma ^2 + 2{{\mathrm{\mathbb {E}}}}[Z],\\&u(y) = (1+y) \log (1+y) - y. \end{aligned} \end{aligned}$$

The following technical lemma will be used to invert the right hand side of  (10).

Lemma 1

Let \(a,b>0\) such that \(b = (1+a) \log (1+a) - a\). Then \(a\le \frac{3 b}{2 \log (\sqrt{b}+1)}\).

Proof

It is easy to verify that the inverse function \(f^{-1}(b)\) of \(f(a):=(1+a) \log (1+a) - a\) is

$$\begin{aligned} f^{-1}(b) = \exp \left[ W\left( \frac{b-1}{e}\right) +1 \right] -1, \end{aligned}$$

where the function \(W:\mathbb {R}_+ \rightarrow \mathbb {R}\) is the Lambert function that satisfies

$$\begin{aligned} x=W(x) \exp \left( W(x)\right) . \end{aligned}$$

Hence, to obtain an upper bound to a, we need an upper bound to the Lambert function. We use Theorem 2.3 in Hoorfar and Hassani (2008), that says that

$$\begin{aligned} W(x) \le \log \frac{x+C}{1+\log (C)}, \quad \forall x> -\frac{1}{e}, \ C>\frac{1}{e}. \end{aligned}$$

Setting \(C=\frac{\sqrt{b}+1}{e}\), we obtain

$$\begin{aligned} a=f^{-1}(b) \le e \frac{\frac{b-1}{e}+\frac{\sqrt{b}+1}{e}}{1+\log (\frac{\sqrt{b}+1}{e})} -1 = \frac{b+\sqrt{b}}{\log (\sqrt{b}+1)}-1 \le \frac{3b}{2\log (\sqrt{b}+1)}, \end{aligned}$$

where in the last inequality we used the fact that \(x+\sqrt{x} - \log (\sqrt{x}+1) \le \frac{3}{2} x, \forall x\ge 0\), as it can be easily verified comparing the derivatives of both terms. \(\square \)

The following lemma is a standard tool (Mohri et al. 2012, (3.8)–(3.13); Bartlett and Mendelson 2003).

Lemma 2

(Symmetrization) For any \(f \in \mathcal {F}\), given random variables \(S=\{X_i\}_{i=1}^m\), we have

$$\begin{aligned}&\mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S}~ \sup _{f \in \mathcal {F}} \left\{ \underset{X}{\mathbb {E}}\left[ f(X) \right] - \frac{1}{m} \sum _{i=1}^m f(X_i) \right\} \le 2 \mathfrak {R}(\mathcal {F}),\\&\mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S}~ \sup _{f \in \mathcal {F}} \left\{ \frac{1}{m} \sum _{i=1}^m f(X_i) - \underset{X}{\mathbb {E}}\left[ f(X) \right] \right\} \le 2 \mathfrak {R}(\mathcal {F}). \end{aligned}$$

Now we are ready to present the proof of Theorem 4.

Theorem 4

Consider the non-negative loss function \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\), such that \(0 \le \ell (h(\mathbf {x}), y) \le M\) for any \(h \in \mathcal {H}\) and any \((\mathbf {x}, y) \in \mathcal {X}\times \mathcal {Y}\). In addition, let the training set S of size m be sampled i.i.d. from the probability distribution over \(\mathcal {X}\times \mathcal {Y}\). Also for any \(r \ge 0\), define the loss class with respect to the hypothesis class \(\mathcal {H}\) as,

$$\begin{aligned} \mathcal {L}:= \left\{ (\mathbf {x}, y) \mapsto \ell (h(\mathbf {x}), y) : h \in \mathcal {H}\ \wedge \ R(h) \le r \right\} . \end{aligned}$$

Then we have for all \(h \in \mathcal {H}\), and any training set S of size m, with probability at least \(1 - e^{-\eta }, \ \forall \eta \ge 0\)

$$\begin{aligned} R(h) - \hat{R}_S(h) \le 2 \mathfrak {R}(\mathcal {L}) + \frac{3 M \eta }{m \log \left( 1 + \sqrt{\frac{2 M \eta }{v m}}\right) } \le 2 \mathfrak {R}(\mathcal {L}) + 3 \sqrt{\frac{v M \eta }{2m}} + \frac{3 M \eta }{2m}, \end{aligned}$$

where \(v = 4 \mathfrak {R}(\mathcal {L}) + r\).

Proof

To prove the statement, we will consider the uniform deviations of the empirical risk. Namely, we will show an upper bound on the random variable \(\sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \). For this purpose, we will use the functional generalization of Bennett’s inequality given by Theorem 3. Consider the random variable

$$\begin{aligned} Z := \frac{m}{2 M} \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} . \end{aligned}$$

Using Theorem 3, we have

$$\begin{aligned}&{{\mathrm{\mathbb {P}}}}\left( \frac{m}{2 M} \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \ge \frac{m}{2 M} {{\mathrm{\mathbb {E}}}}\left[ \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \right] + t \right) \nonumber \\&\quad \le \exp \left( -v u\left( \frac{t}{v}\right) \right) , \end{aligned}$$
(11)

where,

$$\begin{aligned} v&= m \sigma ^2 + \frac{m}{M} {{\mathrm{\mathbb {E}}}}\left[ \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \right] , \nonumber \\ \sigma ^2&\ge \sup _{h \in \mathcal {H}} \mathrm {Var}_{(\mathbf {x}, y)}\left[ \frac{1}{2 M} \left( \ell (h(\mathbf {x}), y) - \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}', y')}[\ell (h(\mathbf {x}'), y')] \right) \right] . \end{aligned}$$
(12)

We now need two things: invert the r.h.s. of (11), treating it as a function of t, and provide an upper-bound on v. For the first part, recall that \(u(y) = (1+y) \log (1+y) - y\). To give an upper-bound of t, we apply Lemma 1 with \(a=\frac{t}{v}\), and \(b=\frac{1}{v}\eta \). This leads to the inequalities

$$\begin{aligned} \frac{t}{v} \le \frac{3 \eta }{2 v \log \left( 1+\sqrt{\frac{\eta }{v}} \right) } \le \frac{3\eta }{4v} + \frac{3}{2}\sqrt{\frac{\eta }{v}}. \end{aligned}$$

Using this fact, we have with probability at least \(1-e^{-\eta }\) with any \(\eta \ge 0\)

$$\begin{aligned} \frac{m}{2 M} \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\}&\le \frac{m}{2 M} {{\mathrm{\mathbb {E}}}}\left[ \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \right] + \frac{3\eta }{2\log \left( 1+\sqrt{\frac{\eta }{v}} \right) } \end{aligned}$$
(13)
$$\begin{aligned}&\le \frac{m}{2 M} {{\mathrm{\mathbb {E}}}}\left[ \sup _{h \in \mathcal {H}}\left\{ R(h) - \hat{R}_S(h)\right\} \right] + \frac{3}{4}\eta + \frac{3}{2} \sqrt{v \eta }. \end{aligned}$$
(14)

Next we prove the bound on v. We first show that the variance of centered loss function, \(\sigma ^2\), is uniformly bounded by the Rademacher complexity. From the definition of variance we have

$$\begin{aligned}&\sup _{h \in \mathcal {H}}~ \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}, y)}\left[ \frac{1}{4 M^2} \left( \ell (h(\mathbf {x}), y) - \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}',y')}[ \ell (h(\mathbf {x}'), y') ] \right) ^2 \right] \le \sup _{h \in \mathcal {H}}~ \frac{1}{4 M^2} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}, y)}[\ell (h(\mathbf {x}), y)^2] \nonumber \\&\qquad \le \sup _{h \in \mathcal {H}}~ \frac{1}{2 M} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}, y)}[|\ell (h(\mathbf {x}), y)|] = \sigma ^2 = \sup _{h \in \mathcal {H}} \frac{1}{2 M} R(h) = \frac{r}{2 M}. \end{aligned}$$
(15)

Last inequality is due to the fact that \(\ell (h(\mathbf {x}), y) \le M\). Now we upper-bound the second term of v by applying Lemma 2,

$$\begin{aligned} \begin{aligned}&\frac{1}{2 m M} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S}\left[ \sup _{h \in \mathcal {H}} \sum _{i=1}^m \left( \ell (h(\mathbf {x}_i), y_i) - \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}', y')}{[\ell (h(\mathbf {x}'), y')]} \right) \right] \\&\qquad = \frac{1}{2 M} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _S\left[ \sup _{h \in \mathcal {H}} \left\{ \left( \frac{1}{m} \sum _{i=1}^m \ell (h(\mathbf {x}_i), y_i) \right) - \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x}', y')}{[\ell (h(\mathbf {x}'), y')]} \right\} \right] \le \frac{1}{M} \mathfrak {R}(\mathcal {L}). \end{aligned} \end{aligned}$$

We conclude the proof by upper-bounding the expectation terms in (13) and (14) using Lemma 2, and plugging the upper bound on v,

$$\begin{aligned} v \le \frac{2 m}{M} \mathfrak {R}(\mathcal {L}) + m \sigma ^2 \le \frac{2 m \mathfrak {R}(\mathcal {L})}{M} + \frac{m r}{2 M}. \end{aligned}$$

\(\square \)

6.2 Rademacher complexity of smooth loss class

In this section we study the Rademacher complexity of the hypothesis class populated by functions of the form (1), where the parameters \(\mathbf {w}\) and \(\varvec{\beta }\) are chosen by an algorithm with a strongly convex regularizer. For this purpose we employ the results of Kakade et al. (2008, 2012), who studied strongly convex regularizers in a more general setting. Furthermore, we will focus on the use of smooth loss functions as done by Srebro et al. (2010).

The proof of the main result of this section, Theorem 5, depends essentially on the following lemma, that bounds the empirical Rademacher complexity of a H-smooth loss class.

Lemma 3

Let \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\) be the H-smooth loss function. Then for some function class \(\mathcal {F}\), let the loss class be

$$\begin{aligned} \mathcal {L}= \left\{ (\mathbf {x}, y) \mapsto \ell (f(\mathbf {x}), y) : f \in \mathcal {F}\right\} . \end{aligned}$$

Then having the sample S of size m and the set

$$\begin{aligned} \left\{ \tau _i ~:~ \tau _i \ge \ell (f(\mathbf {x}_i), y_i),~ \forall (\mathbf {x}_i, y_i) \in S ~\wedge ~ \forall f \in \mathcal {F}\right\} , \end{aligned}$$

we have that

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L}) \le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }}\left[ \sup _{f \in \mathcal {F}}\left\{ \frac{2 \sqrt{3 H}}{m} \sum _{i=1}^m \varepsilon _i \sqrt{\tau _i} f(\mathbf {x}_i) \right\} \right] , \end{aligned}$$

where \(\varepsilon _i\) is r.v. such that \(\mathbb {P}(\varepsilon _i=1) = \mathbb {P}(\varepsilon _i=-1) = \frac{1}{2}\).

Proof

This proof follows a line of reasoning similar to the proof of Talagrand’s lemma for Lipschitz functions, see for instance Mohri et al. (2012, p. 79). We will also use Lemma B.1 by Srebro et al. (2010) (arXiv extended version), stating that for any H-smooth non-negative function \(\phi : \mathbb {R}\mapsto \mathbb {R}_+\) and any \(x,z \in \mathbb {R}\),

$$\begin{aligned} |\phi (x) - \phi (z)| \le \sqrt{6 H (\phi (x) + \phi (z))} |x - z|. \end{aligned}$$
(16)

Fix the sample S, then, by definition,

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L})&= \frac{1}{m} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }}\left[ \sup _{f \in \mathcal {F}}\left\{ \sum _{i=1}^m \varepsilon _i \ell (f(\mathbf {x}_i), y_i) \right\} \right] \\&= \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _1, \ldots , \varepsilon _{m-1}} \left[ \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _m}\left[ \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) + \varepsilon _m \ell (f(\mathbf {x}_m), y_m) \right\} \right] \right] , \end{aligned}$$

where \(u_{m-1}(f) = \sum _{i=1}^n \varepsilon _i \ell (f(\mathbf {x}_i), y_i)\). By definition of supremum, for any \(\delta > 0\), there exist \(f_1, f_2 \in \mathcal {F}\) such that

$$\begin{aligned}&u_{m-1}(f_1) + \ell (f_1(\mathbf {x}_m), y_m) \ge (1 - \delta )\left( \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) + \ell (f(\mathbf {x}), y) \right\} \right) \\ \quad \text {and }&u_{m-1}(f_2) - \ell (f_2(\mathbf {x}_m), y_m) \ge (1 - \delta )\left( \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) - \ell (f(\mathbf {x}), y) \right\} \right) . \end{aligned}$$

Thus for any \(\delta > 0\), by definition of \({{\mathrm{\mathbb {E}}}}_{\varepsilon _m}\),

$$\begin{aligned}&(1 - \delta ) \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _m}\left[ \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) + \varepsilon _m \ell (f(\mathbf {x}_m), y_m) \right\} \right] \\&\qquad = \frac{1 - \delta }{2} \left( \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) + \ell (f(\mathbf {x}_m), y_m) \right\} + \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) - \ell (f(\mathbf {x}_m), y_m) \right\} \right) \\&\qquad \le \frac{1}{2} \bigg ( u_{m-1}(f_1) + \ell (f_1(\mathbf {x}_m), y_m) + u_{m-1}(f_2) - \ell (f_2(\mathbf {x}_m), y_m) \bigg ) \\&\qquad \le \frac{1}{2} \bigg ( u_{m-1}(f_1) + u_{m-1}(f_2) \\&\qquad \qquad + s_m \sqrt{6 H \ell (f_1(\mathbf {x}_m), y_m)+ \ell (f_2(\mathbf {x}_m), y_m) } (f_1(\mathbf {x}_m) - f_2(\mathbf {x}_m)) \bigg ) \\&\qquad \le \frac{1}{2} \bigg ( u_{m-1}(f_1) + u_{m-1}(f_2) + s_m \sqrt{12 H \tau _m } (f_1(\mathbf {x}_m) - f_2(\mathbf {x}_m)) \bigg ) \\&\qquad \le \frac{1}{2} \sup _{f \in \mathcal {F}} \bigg \{ u_{m-1}(f) + s_m \sqrt{12 H \tau _m } f(\mathbf {x}_m) \bigg \}\\&\qquad + \frac{1}{2} \sup _{f \in \mathcal {F}} \bigg \{ u_{m-1}(f) - s_m \sqrt{12 H \tau _m } f(\mathbf {x}_m) \bigg \} \\&\qquad = \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _m}\left[ \sup _{f \in \mathcal {F}} \left\{ u_{m-1}(f) + \varepsilon _m \sqrt{12 H \tau _m } f(\mathbf {x}_m)\right\} \right] . \end{aligned}$$

To obtain the second inequality, we applied (16), where \(s_m = \text{ sgn }(f_1(\mathbf {x}_m) - f_2(\mathbf {x}_m))\). Since the inequality holds for all \(\delta > 0\), we have

$$\begin{aligned} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _m}\left[ \sup _{f \in \mathcal {F}}\left\{ u_{m-1}(f) + \varepsilon _m \ell (f(\mathbf {x}_m), y_m) \right\} \right] \le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varepsilon _m}\left[ \sup _{f \in \mathcal {F}} \left\{ u_{m-1}(f) + \varepsilon _m \sqrt{12 H \tau _m } f(\mathbf {x}_m) \right\} \right] . \end{aligned}$$

Proceeding in the same way for all the other \(\varepsilon _i\), with \(i \ne m\), proves the lemma. \(\square \)

To prove Theorem 5 we will also use the following lemma in Kakade et al. (2012, Corollary 4).

Lemma 4

(Kakade et al. 2012) If \(\varOmega \) is \(\sigma \) strongly convex w.r.t. \(\Vert \cdot \Vert \) and \(\varOmega ^\star (\mathbf {0}) = 0\), then, denoting the partial sum \(\sum _{j \le i} \mathbf {v}_j\) by \(\mathbf {v}_{1:i}\), we have for any sequence \(\mathbf {v}_1, \ldots , \mathbf {v}_m\) and for any \(\mathbf {u}\),

$$\begin{aligned} \sum _{i=1}^m \left\langle \mathbf {v}_i, \mathbf {u} \right\rangle - \varOmega (\mathbf {u}) \le \varOmega ^\star (\mathbf {v}_{1:m}) \le \sum _{i=1}^m \left\langle \nabla \varOmega ^\star (\mathbf {v}_{1:i-1}), \mathbf {v}_i \right\rangle + \frac{1}{2 \sigma } \sum _{i=1}^m \Vert \mathbf {v}_i\Vert _\star ^2~. \end{aligned}$$

Now we are ready to give the proofs of the Rademacher complexity results.

Theorem 5

Let \(\varOmega \) be a non-negative \(\sigma \)-strongly convex function w.r.t. a norm \(\Vert \cdot \Vert \), and let \(\mathbf {0}\) be its minimizer. Let risk and empirical risk be defined w.r.t. an H-smooth loss function \(\ell : \mathcal {Y}\times \mathcal {Y}\mapsto \mathbb {R}_+\). Finally, given the set of functions \(\{f_i : \mathcal {X}\mapsto \mathcal {Y}\}_{i=1}^n\) with \(\mathbf {f}(\mathbf {x}) := [f_1(\mathbf {x}), \ldots , f_n(\mathbf {x})]^{\top }\), a combination \(f_{\varvec{\beta }}(\mathbf {x}) = \left\langle \varvec{\beta }, \mathbf {f}(\mathbf {x}) \right\rangle \), a scalar \(\alpha > 0\), and any sample S drawn i.i.d. from distribution over \(\mathcal {X}\times \mathcal {Y}\), define classes

$$\begin{aligned} \mathcal {W}= \left\{ \mathbf {w}~:~ \varOmega (\mathbf {w}) \le \alpha \hat{R}_S(f_{\varvec{\beta }}) \right\} , \quad \mathcal {V}= \left\{ {\varvec{\beta }} ~:~ \varOmega ({\varvec{\beta }}) \le \rho \right\} , \end{aligned}$$

and the loss class

$$\begin{aligned} \mathcal {L}= \left\{ (\mathbf {x}, y) \mapsto \ell (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}), y) ~:~ \mathbf {w}\in \mathcal {W}\ \wedge \ {\varvec{\beta }} \in \mathcal {V}\right\} . \end{aligned}$$

Then for the loss class \(\mathcal {L}\), setting constants \(\sup _{\mathbf {x}\in \mathcal {X}} \Vert \mathbf {x}\Vert _\star \le B\) and \(\sup _{\mathbf {x}\in \mathcal {X}} \Vert \mathbf {f}(\mathbf {x})\Vert _\star \le C\), we have that

$$\begin{aligned} \mathfrak {R}(\mathcal {L}) \le 4 \sqrt{3 H} (B + C) \left( 1 + \sqrt{\frac{2 H B^2 \alpha }{\sigma }} \right) \frac{R(f_{\varvec{\beta }}) \sqrt{\alpha } + \sqrt{R(f_{\varvec{\beta }}) \rho }}{\sqrt{m \sigma }}. \end{aligned}$$

Proof

The core of the proof consists in an application of Lemma 3. In particular, Lemma 3 allows us to introduce additional information about the loss class by providing bounds on the loss at each example. We will bound the loss at each example using the definition of smoothness, extracting the empirical risk of hypothesis \(\hat{R}_S(f_{\varvec{\beta }})\). The last step is to give an upper-bound on the empirical Rademacher complexity of a class regularized by a strongly convex function. We follow the proof of Kakade et al. (2012, Theorem 7) to accomplish this task. First define the classes

$$\begin{aligned} \mathcal {H}_\mathcal {W}:= \left\{ \mathbf {x}\mapsto \left\langle \mathbf {w}, \mathbf {x} \right\rangle \ : \ \mathbf {w}\in \mathcal {W}\right\} , \quad \mathcal {H}_\mathcal {V}:= \left\{ f_{\varvec{\beta }} \ : \ {\varvec{\beta }} \in \mathcal {V}\right\} , \end{aligned}$$

and also define altered samples \(S' := \{ \sqrt{\tau _i} \mathbf {x}_i \}_{i=1}^m\) and \(S'' := \{ \sqrt{\tau _i} \mathbf {f}(\mathbf {x}_i) \}_{i=1}^m\), where \(\tau _i\) is a quantity independent from \(\mathcal {W}\) and \(\mathcal {V}\). Then by applying Lemma 3, we have that,

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L})&\le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }}\left[ \sup _{\begin{array}{c} \mathbf {w}\in \mathcal {W}\\ \varvec{\beta }\in \mathcal {V} \end{array}} \left\{ \frac{2 \sqrt{3 H}}{m} \sum _{i=1}^m \varepsilon _i \sqrt{\tau _i} \left\langle \mathbf {w}, \mathbf {x}_i \right\rangle + \frac{2 \sqrt{3 H}}{m} \sum _{i=1}^m \varepsilon _i \sqrt{\tau _i} \left\langle \varvec{\beta }, \mathbf {f}(\mathbf {x}_i) \right\rangle \right\} \right] \\&\le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }}\left[ \sup _{\mathbf {w}\in \mathcal {W}} \left\{ \frac{2 \sqrt{3 H}}{m} \sum _{i=1}^m \varepsilon _i \sqrt{\tau _i} \left\langle \mathbf {w}, \mathbf {x}_i \right\rangle \right\} \right] + \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{\varvec{\varepsilon }}\left[ \sup _{\varvec{\beta }\in \mathcal {V}} \left\{ \frac{2 \sqrt{3 H}}{m} \sum _{i=1}^m \varepsilon _i \sqrt{\tau _i} \left\langle \varvec{\beta }, \mathbf {f}(\mathbf {x}_i) \right\rangle \right\} \right] \\&= \hat{\mathfrak {R}}_{S'}(\mathcal {H}_{\mathcal {W}}) + \hat{\mathfrak {R}}_{S''}(\mathcal {H}_{\mathcal {V}}). \end{aligned}$$

Having this, we will follow the proof of Kakade et al. (2012, Theorem 7) to bound the empirical Rademacher complexities \(\hat{\mathfrak {R}}_{S'}(\mathcal {H}_{\mathcal {W}})\) and \(\hat{\mathfrak {R}}_{S''}(\mathcal {H}_{\mathcal {V}})\) with quantities of interest. Let \(t > 0\) and apply Lemma 4 with \(\mathbf {u}= \mathbf {w}\) and \(\mathbf {v}_i = t \varepsilon _i \sqrt{\tau _i} \mathbf {x}_i\) to get

$$\begin{aligned}&\sup _{\mathbf {w}\in \mathcal {W}}\left\{ \sum _{i=1}^m \left\langle \mathbf {w}, t \varepsilon _i \sqrt{\tau _i} \mathbf {x}_i \right\rangle \right\} \\&\qquad \le \frac{t^2}{2 \sigma } \sum _{i=1}^m \Vert \varepsilon _i \sqrt{\tau _i} \mathbf {x}_i\Vert _\star ^2 + \sup _{\mathbf {w}\in \mathcal {W}} \varOmega (\mathbf {w}) + \sum _{i=1}^m \left\langle \nabla \varOmega ^\star (\mathbf {v}_{1:i-1}), \varepsilon _i \sqrt{\tau _i} \mathbf {x}_i \right\rangle \\&\qquad \le \frac{t^2 B^2}{2 \sigma } \sum _{i=1}^m |\tau _i| + \alpha \hat{R}_S(f) + \sum _{i=1}^m \left\langle \nabla \varOmega ^\star (\mathbf {v}_{1:i-1}), \varepsilon _i \sqrt{\tau _i} \mathbf {x}_i \right\rangle . \end{aligned}$$

Now take expectation w.r.t. all the \(\varepsilon _i\) on both sides. The left hand side is \(m t \hat{\mathfrak {R}}_{S'}(\mathcal {H}_{\mathcal {W}})\) and the last term on the right hand side becomes zero since \({{\mathrm{\mathbb {E}}}}[\varepsilon _i]=0\). Denoting \(r = \frac{1}{m} \sum _{i=1}^m |\tau _i|\) and multiplying through by \(\frac{1}{m t}\), we get

$$\begin{aligned} \hat{\mathfrak {R}}_{S'}(\mathcal {H}_{\mathcal {W}}) \le \frac{B^2 r t}{2 \sigma } + \frac{\alpha }{m t} \hat{R}_S(f_{\varvec{\beta }}). \end{aligned}$$

Proving analogously for \(\hat{\mathfrak {R}}_{S''}(\mathcal {H}_{\mathcal {V}})\), we get that

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L}) \le 2 \sqrt{3 H}\left( \frac{(B^2 + C^2) r t}{\sigma } + \frac{\alpha \hat{R}_S(f_{\varvec{\beta }}) + \rho }{m t} \right) . \end{aligned}$$

Optimizing over t gives us

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L}) \le 4 \sqrt{3 H} (B + C) \sqrt{\frac{ r (\alpha \hat{R}_S(f_{\varvec{\beta }}) + \rho ) }{m \sigma }}. \end{aligned}$$

Now focus on the upper bound of r. First we obtain bounds on each \(\tau _i\). We start with the bound on the loss function, exploiting smoothness. Let \(\ell (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}), y) = \phi (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}))\), where \(\phi : \mathbb {R}\mapsto \mathbb {R}\) is an H-smooth function. From the definition of smoothness (Shalev-Shwartz and Ben-David 2014, (12.5)), we have that for all \(\mathbf {w}\) and \(\mathbf {v}\)

$$\begin{aligned}&\phi (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}))\nonumber \\&\quad \le \phi (\left\langle \mathbf {v}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x})) + \phi '(\left\langle \mathbf {v}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x})) \left\langle \mathbf {w}- \mathbf {v}, \mathbf {x} \right\rangle + \frac{H}{2} \left\langle \mathbf {w}- \mathbf {v}, \mathbf {x} \right\rangle ^2 \nonumber \\&\quad \le \phi (\left\langle \mathbf {v}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x})) + 2 \sqrt{H \phi (\left\langle \mathbf {v}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x}))} \Vert \mathbf {w}- \mathbf {v}\Vert \Vert \mathbf {x}\Vert _\star + \frac{H}{2} \Vert \mathbf {w}- \mathbf {v}\Vert ^2 \Vert \mathbf {x}\Vert _\star ^2. \end{aligned}$$
(17)

To obtain the last inequality we used the generalized Cauchy-Schwarz inequality and the fact that for an H-smooth non-negative function \(\phi \), we have that \(|\phi '(t)| \le \sqrt{4 H \phi (t)}\), (Srebro et al. 2010, Lemma 2.1). Now recall a property of a \(\sigma \)-strongly-convex function F, that holds for its minimizer \(\mathbf {v}\) and any \(\mathbf {w}\) (Shalev-Shwartz and Ben-David 2014, Lemma 13.5),

$$\begin{aligned} \Vert \mathbf {w}- \mathbf {v}\Vert ^2 \le \frac{2}{\sigma }(F(\mathbf {w}) - F(\mathbf {v})). \end{aligned}$$

Since inequality (17) holds for any \(\mathbf {v}\), set \(\mathbf {v}= \mathbf {0}\), which is also the minimizer of \(\varOmega (\cdot )\), apply aforementioned property to get

$$\begin{aligned}&\phi (\left\langle \mathbf {w}, \mathbf {x} \right\rangle + f_{\varvec{\beta }}(\mathbf {x})) \le \phi (f_{\varvec{\beta }}(\mathbf {x})) + 2 \sqrt{\frac{2 H}{\sigma } \phi (f_{\varvec{\beta }}(\mathbf {x})) \varOmega (\mathbf {w})} \Vert \mathbf {x}\Vert _\star + \frac{H}{\sigma } \varOmega (\mathbf {w}) \Vert \mathbf {x}\Vert _\star ^2~\nonumber \\&\quad \Rightarrow ~ \ell (\left\langle \mathbf {w}, \mathbf {x}_i \right\rangle + f_{\varvec{\beta }}(\mathbf {x}_i), y_i) \le \tau _i\end{aligned}$$
(18)
$$\begin{aligned}&\quad = \ell (f_{\varvec{\beta }}(\mathbf {x}_i), y_i) + \sqrt{\frac{8 H B^2 \alpha }{\sigma } \hat{R}_S(f_{\varvec{\beta }}) \ell (f_{\varvec{\beta }}(\mathbf {x}_i), y_i)} + \frac{H B^2 \alpha }{\sigma } \hat{R}_S(f_{\varvec{\beta }}). \end{aligned}$$
(19)

The last inequality comes from the definition of the class \(\mathcal {H}\). Now we consider the average and, by Jensen’s inequality,

$$\begin{aligned} r&= \frac{1}{m} \sum _{i=1}^m |\tau _i| = \hat{R}_S(f_{\varvec{\beta }}) + \frac{1}{m} \sum _{i=1}^m \sqrt{\frac{8 H B^2 \alpha }{\sigma } \hat{R}_S(f_{\varvec{\beta }}) \ell (f_{\varvec{\beta }}(\mathbf {x}_i), y_i)} + \frac{H B^2 \alpha }{\sigma } \hat{R}_S(f_{\varvec{\beta }})\\&\le \hat{R}_S(f_{\varvec{\beta }}) + \sqrt{\frac{8 H B^2 \alpha }{\sigma }} \hat{R}_S(f_{\varvec{\beta }}) + \frac{H B^2 \alpha }{\sigma } \hat{R}_S(f_{\varvec{\beta }}) \le \left( 1 + \sqrt{\frac{2 H B^2 \alpha }{\sigma }} \right) ^2 \hat{R}_S(f_{\varvec{\beta }}). \end{aligned}$$

This gives us

$$\begin{aligned} \hat{\mathfrak {R}}_S(\mathcal {L})&\le 4 \sqrt{3 H} (B + C) \left( 1 + \sqrt{\frac{2 H B^2 \alpha }{\sigma }} \right) \sqrt{\frac{\hat{R}_S(f_{\varvec{\beta }}) (\alpha \hat{R}_S(f_{\varvec{\beta }}) + \rho )}{m \sigma }}~\\&\le 4 \sqrt{3 H} (B + C) \left( 1 + \sqrt{\frac{2 H B^2 \alpha }{\sigma }} \right) \frac{\hat{R}_S(f_{\varvec{\beta }}) \sqrt{\alpha } + \sqrt{\hat{R}_S(f_{\varvec{\beta }}) \rho }}{\sqrt{m \sigma }}. \end{aligned}$$

Taking expectation w.r.t. the sample on both sides and applying Jensen’s inequality gives the statement. \(\square \)

6.3 Proofs of main results

Proof of Theorem 1

To show the statement we will apply Theorem 4. In particular, we will consider any choice of \(\mathbf {w}\) and \(\varvec{\beta }\) within the set induced by a strongly-convex function \(\varOmega \). To apply Theorem 4, we need to upper bound the Rademacher complexity of the loss class \(\mathcal {L}\) and also the quantity \(r = \sup _{f \in \mathcal {L}} {{\mathrm{\mathbb {E}}}}_{(\mathbf {x},y)} [f(\mathbf {x}, y)]\).

We obtain the bound on Rademacher complexity by applying Theorem 5. First define the loss class \( \mathcal {L}:= \left\{ (\mathbf {x}, y) \mapsto \ell (h, y) \ : h \in \mathcal {H}\right\} , \) and hypothesis class

$$\begin{aligned} \mathcal {H}:= \Big \{&\mathbf {x}\mapsto \left\langle \mathbf {w}, \mathbf {x} \right\rangle + h^{\text {src}}_{\varvec{\beta }}(\mathbf {x}) \ : \ \\&\quad \varOmega (\mathbf {w}) \le \frac{1}{\lambda } \hat{R}_S(h^{\text {src}}_{\varvec{\beta }}) \ \wedge \ \varOmega (\varvec{\beta }) \le \rho \ \wedge \ \hat{R}_S(h_{\mathbf {w}, \varvec{\beta }}) \le \hat{R}_S(h^{\text {src}}_{\varvec{\beta }}) \Big \}. \end{aligned}$$

To motivate the choice for the constraints observe that for

$$\begin{aligned} \hat{\mathbf {w}}= \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf {w}}\left\{ \hat{R}_S(h_{\mathbf {w}, \varvec{\beta }}) + \lambda \varOmega (\mathbf {w}) \right\} , \end{aligned}$$

we have \(\varOmega (\hat{\mathbf {w}}) \le \lambda ^{-1} \hat{R}_S(h_{\mathbf {0}, \varvec{\beta }}) = \lambda ^{-1} \hat{R}_S(h^{\text {src}}_{\varvec{\beta }})\), and \(\hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }}) \le \hat{R}_S(h^{\text {src}}_{\varvec{\beta }})\). That said, by applying Theorem 5 with \(\alpha = \frac{1}{\lambda }\) and \(f_{\varvec{\beta }}=h^{\text {src}}_{\varvec{\beta }}\) and assuming that \(\lambda \le \kappa \), we obtain

$$\begin{aligned} \mathfrak {R}(\mathcal {L}) \le \mathcal {O}\left( \frac{R^{\text {src}}\kappa }{\sqrt{m} \lambda } + \sqrt{\frac{R^{\text {src}}\rho \kappa ^2}{m \lambda }} \right) . \end{aligned}$$

Next we obtain the bound on r

$$\begin{aligned}&r = \sup _{h \in \mathcal {H}}\mathop {{{\mathrm{\mathbb {E}}}}}\limits _{(\mathbf {x},y)} [\ell (h(\mathbf {x}), y)] = \sup _{h \in \mathcal {H}} \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S} \left[ \hat{R}_S(h) \right] \le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S} \left[ \sup _{h \in \mathcal {H}} \hat{R}_S(h) \right] \le \mathop {{{\mathrm{\mathbb {E}}}}}\limits _{S} [\hat{R}_S(h^{\text {src}}_{\varvec{\beta }})] = R^{\text {src}}. \end{aligned}$$

The last two inequalities come from Jensen’s inequality and the definition of the class \(\mathcal {H}\). Plugging the bounds on the Rademacher complexity and r into the statement of Theorem 4, and applying the inequality \(\sqrt{a+b} \le \sqrt{a}+ \frac{b}{2\sqrt{a}}\) to the \(\sqrt{v}\) term, gives the statement. \(\square \)

Proof of Theorem 2

For any choice of \(\varvec{\beta }\) with \(\varOmega (\varvec{\beta }) \le \rho \), denote the best in the class by

$$\begin{aligned} \mathbf {w}^\star = \mathop {{{\mathrm{argmin}}}}\limits _{\mathbf {w}~:~ \varOmega (\mathbf {w}) \le \tau } R(h_{\mathbf {w}, \varvec{\beta }}). \end{aligned}$$

By the definition of \(\hat{\mathbf {w}}\), we have

$$\begin{aligned} \hat{R}_S(h_{\hat{\mathbf {w}}, \varvec{\beta }}) + \lambda \varOmega (\hat{\mathbf {w}}) \le \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }}) + \lambda \varOmega (\mathbf {w}^\star ). \end{aligned}$$
(20)

Now denote

$$\begin{aligned} Z = \kappa \sqrt{\frac{R^{\text {src}}}{m}} (\sqrt{R^{\text {src}}} + \sqrt{\rho }). \end{aligned}$$

Then, by following the proof of Theorem 1 until the application of inequality \(\sqrt{a+b} \le \sqrt{a} + \frac{b}{2 \sqrt{a}}\), ignoring constants, using the assumption (20), and assuming that \(\lambda \le \kappa \le 1\) we have that

$$\begin{aligned} R(h_{\hat{\mathbf {w}}, \varvec{\beta }})&\le \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }}) + \lambda \tau + \frac{Z}{\lambda } + \sqrt{\frac{M \eta }{m}} \sqrt{R^{\text {src}}+ \frac{Z}{\lambda }} + \frac{M \eta }{m} \nonumber \\&\le \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }}) + \lambda \tau + \frac{Z}{\lambda } + \sqrt{\frac{R^{\text {src}}M \eta }{m}} + \frac{\sqrt{Z M \eta }}{\sqrt{m} \lambda } + \frac{M \eta }{m}. \end{aligned}$$
(21)

Optimizing the l.h.s. over \(\lambda \) gives

$$\begin{aligned} \lambda ^\star = \sqrt{\frac{Z}{\tau } + \frac{1}{\tau }\sqrt{\frac{Z M \eta }{m}}}. \end{aligned}$$

We plug it back into (21) to obtain that

$$\begin{aligned} R(h_{\hat{\mathbf {w}}, \varvec{\beta }})&\le \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }}) + \sqrt{\tau } \sqrt{Z + \sqrt{\frac{Z M \eta }{m}}} + \sqrt{\frac{R^{\text {src}}M \eta }{m}} + \frac{M \eta }{m} \nonumber \\&\le \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }}) + \frac{\sqrt{R^{\text {src}}} + \root 4 \of {R^{\text {src}}\rho }}{\root 4 \of {m}} \sqrt{\kappa \tau } + \frac{\root 4 \of {R^{\text {src}}} + \root 8 \of {R^{\text {src}}\rho }}{\root 4 \of {m^{1.5}}} \root 4 \of {\kappa \tau ^2 M \eta } \nonumber \\&\quad + \sqrt{\frac{R^{\text {src}}M \eta }{m}} + \frac{M \eta }{m}. \end{aligned}$$
(22)

All that is left is to concentrate \(\hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }})\) around its mean. Denoting the variance by

$$\begin{aligned} V = {{\mathrm{\mathbb {E}}}}\left[ \sum _{i=1}^m (\ell (h_{\mathbf {w}^\star , \varvec{\beta }}(\mathbf {x}_i), y_i) - R(h_{\mathbf {w}^\star , \varvec{\beta }}))^2\right] , \end{aligned}$$

we apply Bernstein’s inequality

$$\begin{aligned} {{\mathrm{\mathbb {P}}}}\left( \sum _{i=1}^m (\ell (h_{\mathbf {w}^\star , \varvec{\beta }}(\mathbf {x}_i), y_i) - R(h_{\mathbf {w}^\star , \varvec{\beta }})) > t \right) \le \exp \left( - \frac{t^2 / 2}{V + M t/3} \right) . \end{aligned}$$

Setting

$$\begin{aligned} e^{-\eta } = \exp \left( - \frac{t^2 / 2}{V + M t/3} \right) , \end{aligned}$$

we have that with probability at least \(1-e^{-\eta }, \ \forall \eta \ge 0\)

$$\begin{aligned} \hat{R}_S(h_{\mathbf {w}^\star , \varvec{\beta }})&\le R(h_{\mathbf {w}^\star , \varvec{\beta }}) + \sqrt{\frac{2 \eta {{\mathrm{\mathbb {E}}}}\left[ (\ell (h_{\mathbf {w}^\star , \varvec{\beta }}(\mathbf {x}_i), y_i) - R(h_{\mathbf {w}^\star , \varvec{\beta }}))^2\right] }{m}} + \frac{2 M \eta }{3 m}\\&\le R(h_{\mathbf {w}^\star , \varvec{\beta }}) + 2 \sqrt{\frac{R(h_{\mathbf {w}^\star , \varvec{\beta }}) M \eta }{m}} + \frac{2 M \eta }{3 m} \\&\le R(h_{\mathbf {w}^\star , \varvec{\beta }}) + 2 \sqrt{\frac{R^{\text {src}}M \eta }{m}} + \frac{2 M \eta }{3 m}. \end{aligned}$$

The last inequality comes from the observation that \(R(h_{\mathbf {w}^\star , \varvec{\beta }}) \le R(h_{\mathbf {0}}) = R^{\text {src}}\). Plugging this result into (22) completes the proof. \(\square \)

7 Conclusions

In this paper we have formally captured and theoretically analyzed a general family of learning algorithms transferring information from multiple supplied source hypotheses. In particular, our formulation stems from the regularized Empirical Risk Minimization principle with the choice of any non-negative smooth loss function and any strongly convex regularizer. Theoretically we have analyzed the generalization ability and excess risk of this family of HTL algorithms. Our analysis showed that a good source hypothesis combination facilitates faster generalization, specifically in \(\mathcal {O}(1/m)\) instead of the usual \(\mathcal {O}(1/\sqrt{m})\). Furthermore, given a perfect source hypothesis combination, our analysis is consistent with the intuition that learning is not required. As a byproduct of our investigation, we came up with new results in Rademacher complexity analysis of the smooth loss classes, which could be of independent interest.

Our conclusions suggest the key importance of a source hypothesis selection procedure. Indeed, when an algorithm is provided with enormous pool of source hypotheses, how to select relevant ones on the basis of only a few labeled examples? This might sound similar to the feature selection problem under the condition that \(n \gg m\), however, earlier empirical studies by Tommasi et al. (2014) with hundreds of sources did not find much corroboration for this hypothesis when applying L1 regularization. Thus, it remains unclear if having few good sources from hundreds is a reasonable assumption.