Abstract
This chapter discusses the method of Kernel Ridge Regression, which is a very simple special case of Support Vector Regression. The main formula of the method is identical to a formula in Bayesian statistics, but Kernel Ridge Regression has performance guarantees that have nothing to do with Bayesian assumptions. I will discuss two kinds of such performance guarantees: those not requiring any assumptions whatsoever, and those depending on the assumption of randomness.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
- Support Vector Machine
- Support Vector Regression
- Prediction Interval
- Ridge Regression
- Bayesian Statistic
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
This chapter is based on my talk at the Empirical Inference Symposium (see p.x). It describes some developments influenced by Vladimir Vapnik, which are related to, but much less well known than, the Support Vector Machine. The Support Vector Machine is a powerful combination of the idea of generalized portrait (1962; see Chap. 3) and the kernel methods, and from the very beginning the performance guarantees for it were non-Bayesian, depending only on the assumption of randomness: the data are generated independently from the same probability distribution. An example of such a performance guarantee is (3.2); numerous other examples are given in Vapnik [15]. Kernel Ridge Regression (KRR) is a special case of Support Vector Regression, which has been known in Bayesian statistics for a long time. However, the advent of the Support Vector Machine encouraged non-Bayesian analyses of KRR, and this chapter presents two examples of such analyses. The first example is in the tradition of prediction with expert advice [3] and involves no statistical assumptions whatsoever. The second example belongs to the area of conformal prediction [17] and only depends on the assumption of randomness.
2 Kernel Ridge Regression
It appears that the term “Kernel Ridge Regression” was coined in 2000 by Cristianini and Shawe-Taylor [5] to refer to a simplified version of Support Vector Regression; this was an adaptation of the earlier “ridge regression in dual variables” [12]. Take the usual Support Vector Regression in primal variables
where \((x_{t},y_{t}) \in {\mathbb{R}}^{n} \times \mathbb{R}\) are the training examples, w is the weight vector, b is the bias term, ξ t , ξ′ t are the slack variables, and T is the size of the training set; ε, C > 0 and \(k \in \{ 1,2\}\) are the parameters. Simplify the problem by ignoring the bias term b (it can be partially recovered by adding a dummy attribute 1 to all x t ), setting ε: = 0, and setting k: = 2. The optimization problem becomes
(where \(a:= 1/C\)), the usual Ridge Regression problem. And Vapnik’s usual method ([15], Sect. 11.3.2) then gives the prediction
for the label of a new object x, where Y is the vector of labels (with components \({Y }^{t}:= y_{t}\)), K is the Gram matrix \(K_{s,t}:= x_{s} \cdot x_{t}\), and k is the vector with components \({k}^{t}:= x_{t} \cdot x\). The kernel trick replaces x t by F(x t ), and so K by the kernel matrix \(K_{s,t}:= \mathcal{K}(x_{s},x_{t})\) and k by the vector \({k}^{t}:= \mathcal{K}(x_{t},x)\), where \(\mathcal{K}\) is the kernel \(\mathcal{K}(x,x\prime ):= F(x) \cdot F(x\prime )\).
This simple observation was made in [12], where this simplified SVR method was called “ridge regression in dual variables”. There is no doubt that this calculation has been done earlier as well, but the result does not appear useful. First, compared to the “full” SVM, there is no sparsity of examples (and there is no sparsity in attributes, as in the case of the Lasso). Having an explicit formula is an advantage, but the formula is not new: mathematically, the formula for KRR coincides with one of the formulas in kriging [4], an old method in geostatistics for predicting values of a Gaussian random field; this formula had been widely used in Bayesian statistics.
However, there is a philosophical and practical difference:
-
In kriging, the kernel is estimated from the results of observations and in Bayesian statistics it is supposed to reflect the statistician’s beliefs;
-
In KRR, as in Support Vector Regression in general, the kernel is not supposed to reflect any knowledge or beliefs about reality, and the usual approach is pragmatic: one consults standard libraries of kernels and uses whatever works.
In the remaining sections of this chapter we will explore KRR in the SVM style, without making Bayesian assumptions. The practical side of this non-Bayesian aspect of KRR is that it often gives good results on real-world data, despite the Bayesian assumptions being manifestly wrong. We will, however, concentrate on its theoretical side: non-Bayesian performance guarantees for KRR.
An important special case of KRR is (ordinary) Ridge Regression (RR): it is a special case (as far as the output is concerned) of KRR for \(\mathcal{K}\) as the dot product. However, in the case of RR the usual representation of the prediction is
rather than (11.1), where X is the matrix whose rows are x′1, …, x′ T ; there are many ways to show that (11.2) and (11.1) indeed coincide when \(\mathcal{K}\) is the dot product.
Under a standard Bayesian assumption (which we do not state explicitly in general; see, e.g., [17], Sect. 10.3), the conditional distribution of the label y of a new example (x, y) given \(x_{1},\ldots,x_{T},x\) and \(y_{1},\ldots,y_{T}\) is
where K and k are as before (the postulated probability distribution generating the examples depends on \(\mathcal{K}\) and a, and we parameterize a normal probability distribution N(μ, σ 2) by its mean μ and standard deviation σ 2). The mean of the distribution (11.3) is the KRR prediction, but now we have not only a point prediction but also an estimate of its accuracy.
When \(\mathcal{K}\) is the dot product, (11.3) can be rewritten as
In this case the Bayesian assumption can be stated as follows: x 1, x 2, … are fixed vectors in \({\mathbb{R}}^{n}\) (alternatively, we can make our analysis conditional on their values) and
where \(\theta \sim N(0,{(\sigma }^{2}/a)I)\) and \(\xi _{t} \sim N(0{,\sigma }^{2})\) are all independent.
Equations (11.3) and (11.4) give exhaustive information about the next observation; the Bayesian assumption, however, is rarely satisfied.
3 Kernel Ridge Regression Without Probability
It turns out that KRR has interesting performance guarantees even if we do not make any stochastic assumptions whatsoever. Due to lack of space no proofs will be given; they can be found in the technical report [21].
In this section we consider the following perfect-information protocol of on-line regression:
First we consider the case where the space X from which the objects x t are drawn is a Euclidean space, \(\mathbf{X}:= {\mathbb{R}}^{n}\), and our goal is to compete with linear functions; in this case ordinary Ridge Regression is a suitable strategy for Learner. Then we move on to the case of an arbitrary X and replace RR by KRR.
3.1 Ordinary Ridge Regression
In this section, \(\mathbf{X} = {\mathbb{R}}^{n}\). The RR strategy for Learner is given by the formula \(\hat{y}_{t}:= b_{t-1}\prime A_{t-1}^{-1}x_{t}\), where b 0, b 1, … is the sequence of vectors and A 0, A 1, … is the sequence of matrices defined by
(cf. (11.2)), where a > 0 is a parameter. The incremental update of the matrix A t −1 can be done effectively by the Sherman–Morrison formula. The following performance guarantee is proved in [21], Sect. 2.
Theorem 11.1.
The Ridge Regression strategy for Learner with parameter a > 0 satisfies, at any step T,
The part \(x_{t}\prime A_{t-1}^{-1}x_{t}\) in the denominator of (11.6) is usually close to 0 for large t.
Theorem 11.1 has been adapted to the Bayesian setting by Zhdanov and Kalnishkan [20], who also notice that it can be extracted from [1] (by summing their (4.21) in an exact rather than an estimated form).
Theorem 11.1 and its kernel version (Theorem 11.2 below) imply surprisingly many well-known inequalities.
Corollary 11.1.
Assume \(\vert y_{t}\vert \leq \mathbf{y}\) for all t, clip the predictions of the Ridge Regression strategy to [− y, y ], and denote them by \(\hat{y}_{t}^{\mathbf{y}}\) . Then
The bound (11.7) is exactly the bound obtained in [16] (Theorem 4) for the algorithm merging linear experts with predictions clipped to [−y, y], which does not have a closed-form description and so is less interesting than clipped RR. The bound for the strategy called the AAR in [16] has y 2 in place of 4y 2 ([16], Theorem 1). (The AAR is very similar to RR: its predictions are \(b\prime _{t-1}A_{t}^{-1}x_{t}\) rather than \(b\prime _{t-1}A_{t-1}^{-1}x_{t}\); it is called the Vovk–Azoury–Warmuth algorithm in [3].) The regret term in (11.7) has the logarithmic order in T if \(\|x_{t}\|_{\infty }\leq X\) for all t, because
(the determinant of a positive definite matrix is bounded by the product of its diagonal elements; see [2], Chap. 2, Theorem 7). From Theorem 11.1 we can also deduce Theorem 11.7 in [3], which is somewhat similar to Corollary 11.1. That theorem implies (11.7) when RR’s predictions happen to be in [−y, y] without clipping (but this is not what Corollary 11.1 asserts).
RR is not as good as the AAR in the setting where \(\sup _{t}\vert y_{t}\vert \leq \mathbf{y}\) and the goal is to obtain bounds of the form (11.7) (since the AAR is to some degree optimized for this setting), but is still very good; and we can achieve an interesting equality (rather than inequality) for it.
The upper bound (11.7) does not hold for the RR strategy if the coefficient 4 is replaced by any number less than \(\frac{3} {2\ln 2} \approx 2.164\), as can be seen from an example given in Theorem 3 [16], where the left-hand side of (11.7) is 4T + o(T), the minimum on the right-hand side is at most T, y = 1, and the logarithm is 2Tln2 + O(1). It is also known that there is no strategy achieving (11.7) with the coefficient less than 1 instead of 4, even in the case where \(\left \|x_{t}\right \|_{\infty }\leq X\) for all t: see Theorem 2 in [16].
There is also an upper bound on the cumulative square loss of the RR strategy without a logarithmic part and without assuming that the labels are bounded.
Corollary 11.2.
If \(\|x_{t}\|_{2} \leq Z\) for all t then the Ridge Regression strategy for Learner with parameter a > 0 satisfies, at any step T,
This bound is better than the bound in Corollary 3.1 of [8], which has an additional regret term of the logarithmic order in time.
Asymptotic properties of the RR strategy can be further studied using Corollary A.1 of Kumon et al. [9]. Kumon et al.’s result states that when \(\|x_{t}\|_{2} \leq 1\) for all \(t\), then \(x_{t}\prime A_{t-1}^{-1}x_{t} \rightarrow 0\) as t →∞. It is clear that we can replace \(\|x_{t}\|_{2} \leq 1\) for all t by \(\sup _{t}\|x_{t}\|_{2} < \infty \). This gives the following corollary, which can be summarized as follows. If there exists a very good expert (asymptotically), then RR also predicts very well. If there is no such very good expert, RR performs asymptotically as well as the best regularized expert.
Corollary 11.3.
Let a > 0 and \(\hat{y}_{t}\) be the predictions output by the Ridge Regression strategy with parameter a. Suppose \(\sup _{t}\left \|x_{t}\right \|_{2} < \infty \) . Then
and
3.2 Kernel Ridge Regression
In this section, X is an arbitrary set. Let \(\mathcal{F}\) be the RKHS with kernel \(\mathcal{K}\) of functions on X. The KRR strategy for Learner with parameter a > 0 is defined by the formula (11.1) applied to the past examples.
The following version of Theorem 11.1 for KRR can be derived from Theorem 11.1 itself (see [21], Sect. 6, for details).
Theorem 11.2.
The KRR strategy with parameter a > 0 for Learner satisfies, at any step T,
The denominator on the left-hand side tends to 1 under some regularity conditions:
Lemma 11.1 ([20], Lemma 2).
Let \(\mathcal{K}\) be a continuous kernel on a compact metric space. Then
Again, we can derive several interesting corollaries from Theorem 11.2.
Corollary 11.4.
Assume \(\left \vert y_{t}\right \vert \leq \mathbf{y}\) for all t and let \(\hat{y}_{t}^{\mathbf{y}}\) be the predictions of the KRR strategy clipped to [− y, y ]. Then
But now we have a problem: in general, the lndet term is not small compared to T. However, we still have the analogue of Corollary 11.3 (for a detailed derivation, see [20]).
Corollary 11.5 ([20], Corollary 4).
Let X be a compact metric space and \(\mathcal{K}\) be a continuous kernel on X . Then
and
To obtain a non-asymptotic result of this kind under the assumption \(\sup _{t}\vert y_{t}\vert \leq \mathbf{y}\), let us first assume that the number of steps T is known in advance. We will need the notation \(c_{\mathcal{F}}:= \sqrt{\sup _{x\in \mathbf{X} } \mathcal{K}(x, x)}\). Bounding the logarithm of the determinant in (11.9) we have
(cf. (11.8)). Since we know the number T of steps in advance, we can choose a specific value for a; let \(a:= c_{\mathcal{F}}\sqrt{T}\). This gives us an upper bound with the regret term of the order \(O(\sqrt{T})\) for any \(f \in \mathcal{F}\):
If we do not know the number of steps in advance, it is possible to achieve a similar bound using a mixture of KRR over the parameter a with a suitable prior over a:
for any arbitrarily small δ > 0, where the constant implicit in O(y 2) depends only on δ. (No proof of this result has been published.) The inequality (11.10) still looks asymptotic in that it contains an O term; however, it is easy to obtain an explicit (but slightly messier) non-asymptotic inequality.
In particular, (11.10) shows that if \(\mathcal{K}\) is a universal kernel [14] on a topological space X, KRR is competitive with all continuous functions on X: for any continuous \(f: \mathbf{X} \rightarrow \mathbb{R}\),
(assuming \(\vert y_{t}\vert \leq \mathbf{y}\) for all t). For example, (11.11) holds for X a compact set in \({\mathbb{R}}^{n}\), \(\mathcal{K}\) an RBF kernel, and \(f: \mathbf{X} \rightarrow \mathbb{R}\) any continuous function (see Example 1 in [14]). For continuous universal kernels on compact spaces, (11.11) also follows from Corollary 11.5.
4 Kernel Ridge Regression in Conformal Prediction
Suppose we would like to have prediction intervals rather than point predictions, and we would like them to have guaranteed coverage probabilities. It is clear that to achieve this we need a stochastic assumption; it turns out that the randomness assumption is often sufficient to obtain informative prediction intervals. In general, our algorithms will output prediction sets (usually intervals, but not always); to obtain prediction intervals we will apply convex closure (which can only improve coverage probability).
The special case of conformal prediction discussed in this section works as follows. Suppose we have an “underlying algorithm” (such as KRR) producing point predictions in \(\mathbb{R}\). Let \((x_{1},y_{1}),\ldots,(x_{T},y_{T})\) be a training set and x T+1 be a new object. To find the prediction set for y T+1 at a significance level ε ∈ (0, 1):
-
For each possible label \(z \in \mathbb{R}\):
-
Set \(y_{T+1}:= z\);
-
For each \(t \in \{ 1,\ldots,T + 1\}\) compute the nonconformity score \(\alpha _{t}^{z}:= \left \vert y_{t} -\hat{ y}_{t}^{z}\right \vert \), where \(\hat{y}_{t}^{z}\) is the point prediction for the label of x t computed by the underlying algorithm from the extended training set \((x_{1},y_{1}),\ldots,(x_{T+1},y_{T+1})\);
-
Compute the p-value
$$\displaystyle{p(z):= \frac{1} {T + 1}\sum _{t=1}^{T+1}\boldsymbol{1}_{\{\alpha _{ t}^{z}\geq \alpha _{T+1}^{z}\}},}$$where \(\boldsymbol{1}_{\{\cdot \}}\) is the indicator function;
-
-
Output the prediction set \(\{z \in \mathbb{R} \vert p(z) >\epsilon \}\), where ε is the given significance level.
This set predictor is the conformal predictor based on the given underlying algorithm. Conformal predictors have a guaranteed coverage probability:
Theorem 11.3.
The probability that the prediction set output by a conformal predictor is an error (i.e., fails to include y T+1 ) does not exceed the significance level ε.
Moreover, in the on-line prediction protocol (Protocol 1, in which Reality outputs (x t , y t ) independently from the same probability distribution), the long-run frequency of errors also does not exceed ε almost surely. For a proof, see [17] (Theorem 8.1).
The property of conformal predictors asserted by Theorem 11.3 is their validity. Validity being achieved automatically, the remaining desiderata for conformal predictors are their “efficiency” (we want the prediction sets to be small, in a suitable sense) and “conditional validity” (we might want to have prespecified coverage probabilities conditional on the training set or some property of the new example).
The idea of conformal prediction is inspired by the Support Vector Machine (and the notation α for nonconformity scores is adapted from Vapnik’s Lagrange multipliers). The immediate precursor of conformal predictors was described in the paper [7] co-authored by Vapnik, which is based on the idea that a small number of support vectors warrants a high level of confidence in the SVM’s prediction. This idea was present in Vapnik and Chervonenkis’s thinking in the 1960s: see, e.g., (3.2) and [15], Theorem 10.5. The method was further developed in [17]; see [13] for a tutorial.
In the case where the conformal predictor is built on top of RR or KRR, there is no need to go over all potential labels \(z \in \mathrm{I\!R}\). The set prediction for the example \((x_{T+1},y_{T+1})\) can be computed in time O(TlogT) (in the case of RR) or O(T 2) (in the case of KRR). This involves only solving linear equations and sorting; the simple resulting algorithm is called the Ridge Regression Confidence Machine (RRCM) in [11] and [17]. There is an R package implementing the RRCM (in the case of RR), PredictiveRegression, available from CRAN.
The Bayes predictions (11.3) and (11.4) can be easily converted into prediction intervals. But they are valid only under the postulated probability model, whereas the prediction intervals output by the RRCM are valid under the randomness assumption (as is common in machine learning). This is illustrated by Fig. 11.1, which is a version of Wasserman’s Fig. 1 in [19]. We consider the simplest case, where x t = 1 for all t; therefore, the examples (x t , y t ) can be identified with their labels \(y_{t} \in \mathrm{I\!R}\), which we will call observations. The chosen significance level is 20 % and the kernel \(\mathcal{K}\) is the dot product. In the top plot, the four observations are generated from N(1, 1); in the middle plot, from N(10, 1); and in the bottom plot, from N(100, 1). The blue lines are the prediction intervals computed by the RRCM with a = 1 and the red lines are the Bayes prediction intervals computed as the shortest intervals containing 80 % of the mass (11.4) with a = 1 and σ = 1.
All observations are generated from N(θ, 1) for various constants θ. When θ = 1 (and so the Bayesian assumption (11.5) can be regarded as satisfied), the Bayes prediction intervals are on average only slightly shorter than the RRCM’s (the Bayes prediction interval happens to be wider in Fig. 11.1; for a random seed of the random number generator, the Bayes prediction intervals are shorter in about 54 % of cases). But as θ grows, the RRCM’s prediction intervals also grow (in order to cover the observations), whereas the width of the Bayes prediction intervals remains constant. When θ = 100 (and so (11.5) is clearly violated), the Bayes prediction intervals give very misleading results.
In parametric statistics, it is widely believed that the choice of the prior does not matter much: the data will eventually swamp the prior. However, even in parametric statistics the model (such as N(θ, 1)) itself may be wrong.
In nonparametric statistics, the situation is much worse:
the prior can swamp the data, no matter how much data you have
(Diaconis and Freedman [6], Sect. 4). In this case, using Bayes prediction intervals becomes particularly problematic. The RRCM can be interpreted as an example of renormalized Bayes, as discussed in [18] and later papers.
As mentioned earlier, the RRCM is valid under the assumption of randomness; no further assumptions are required. However, conditional validity and, especially, efficiency do require extra assumptions. For example, [10] uses standard statistical assumptions used in density estimation to demonstrate the conditional validity and efficiency of a purpose-built conformal predictor. It remains an open problem to establish whether similar results hold for the RRCM.
References
Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 43, 211–246 (2001)
Beckenbach, E.F., Bellman, R.: Inequalities. Springer, Berlin (1965)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press, Cambridge (2006)
Cressie, N.: The origins of kriging. Math. Geol. 22, 239–252 (1990)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Methods. Cambridge University Press, Cambridge (2000)
Diaconis, P., Freedman, D.: On the consistency of Bayes estimates (with discussion). Ann. Stat. 14, 1–67 (1986)
Gammerman, A., Vovk, V., Vapnik, V.: Learning by transduction. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, pp. 148–155. Morgan Kaufmann, San Francisco (1998)
Kakade, S.M., Ng, A.Y.: Online bounds for Bayesian algorithms. In: Proceedings of the Eighteenth Annual Conference on Neural Information Processing Systems, Vancouver (2004)
Kumon, M., Takemura, A., Takeuchi, K.: Sequential optimizing strategy in multi-dimensional bounded forecasting games. Stoch. Process. Appl. 121, 155–183 (2011)
Lei, J., Wasserman, L.: Distribution free prediction bands. Tech. Rep. arXiv:1203.5422 [stat.ME], arXiv.org e-Print archive (2012). To appear in the Journal of the Royal Statistical Society B
Nouretdinov, I., Melluish, T., Vovk, V.: Ridge regression confidence machine. In: Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, pp. 385–392. Morgan Kaufmann, San Francisco (2001)
Saunders, C., Gammerman, A., Vovk, V.: Ridge regression learning algorithm in dual variables. In: Shavlik, J.W. (ed.) Proceedings of the Fifteenth International Conference on Machine Learning, Madison, pp. 515–521. Morgan Kaufmann, San Francisco (1998)
Shafer, G., Vovk, V.: A tutorial on conformal prediction. J. Mach. Learn. Res. 9, 371–421 (2008)
Steinwart, I.: On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2, 67–93 (2001)
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
Vovk, V.: Competitive on-line statistics. Int. Stat. Rev. 69, 213–248 (2001)
Vovk, V., Gammerman, A., Shafer, G.: Algorithmic Learning in a Random World. Springer, New York (2005)
Wasserman, L.: Frequentist Bayes is objective (comment on articles by Berger and by Goldstein). Bayesian Anal. 1, 451–456 (2006)
Wasserman, L.: Frasian inference. Stat. Sci. 26, 322–325 (2011)
Zhdanov, F., Kalnishkan, Y.: An identity for kernel ridge regression. Theor. Comput. Sci. 473, 157–178 (2013)
Zhdanov, F., Vovk, V.: Competing with Gaussian linear experts. Tech. Rep. arXiv:0910.4683 [cs.LG], arXiv.org e-Print archive (2009). Revised in May 2010
Acknowledgements
I am deeply grateful to Vladimir Vapnik for numerous discussions and support over the years, starting from our first meetings in the summer of 1996. Many thanks to Alexey Chervonenkis, Alex Gammerman, Valya Fedorova, and Ilia Nouretdinov for their advice and help. This work has been supported in part by the Cyprus Research Promotion Foundation (TPE/ORIZO/0609(BIE)/24) and EPSRC (EP/K033344/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Vovk, V. (2013). Kernel Ridge Regression. In: Schölkopf, B., Luo, Z., Vovk, V. (eds) Empirical Inference. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41136-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-41136-6_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41135-9
Online ISBN: 978-3-642-41136-6
eBook Packages: Computer ScienceComputer Science (R0)