Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This chapter is based on my talk at the Empirical Inference Symposium (see p.x). It describes some developments influenced by Vladimir Vapnik, which are related to, but much less well known than, the Support Vector Machine. The Support Vector Machine is a powerful combination of the idea of generalized portrait (1962; see Chap. 3) and the kernel methods, and from the very beginning the performance guarantees for it were non-Bayesian, depending only on the assumption of randomness: the data are generated independently from the same probability distribution. An example of such a performance guarantee is (3.2); numerous other examples are given in Vapnik [15]. Kernel Ridge Regression (KRR) is a special case of Support Vector Regression, which has been known in Bayesian statistics for a long time. However, the advent of the Support Vector Machine encouraged non-Bayesian analyses of KRR, and this chapter presents two examples of such analyses. The first example is in the tradition of prediction with expert advice [3] and involves no statistical assumptions whatsoever. The second example belongs to the area of conformal prediction [17] and only depends on the assumption of randomness.

2 Kernel Ridge Regression

It appears that the term “Kernel Ridge Regression” was coined in 2000 by Cristianini and Shawe-Taylor [5] to refer to a simplified version of Support Vector Regression; this was an adaptation of the earlier “ridge regression in dual variables” [12]. Take the usual Support Vector Regression in primal variables

$$\displaystyle\begin{array}{rcl} \begin{array}{llll} &\text{minimize } &&\quad {\left \|w\right \|}^{2} + C\sum _{t=1}^{T}\left ({\left (\xi _{t}\right )}^{k} +{ \left (\xi \prime _{t}\right )}^{k}\right ) \\ &\text{subject to}&&\quad (w \cdot x_{t} + b) - y_{t} \leq \epsilon +\xi _{t},\quad t = 1,\ldots,T, \\ & &&\quad y_{t} - (w \cdot x_{t} + b) \leq \epsilon +\xi \prime _{t},\quad t = 1,\ldots,T, \\ & &&\quad \xi _{t},\xi \prime _{t} \geq 0,\quad t = 1,\ldots,T,\end{array} & & {}\\ \end{array}$$

where \((x_{t},y_{t}) \in {\mathbb{R}}^{n} \times \mathbb{R}\) are the training examples, w is the weight vector, b is the bias term, ξ t , ξ t are the slack variables, and T is the size of the training set; ε, C > 0 and \(k \in \{ 1,2\}\) are the parameters. Simplify the problem by ignoring the bias term b (it can be partially recovered by adding a dummy attribute 1 to all x t ), setting ε: = 0, and setting k: = 2. The optimization problem becomes

$$\displaystyle{\text{minimize}\qquad a{\left \|w\right \|}^{2} +\sum _{ t=1}^{T}{\left (y_{ t} - w \cdot x_{t}\right )}^{2}}$$

(where \(a:= 1/C\)), the usual Ridge Regression problem. And Vapnik’s usual method ([15], Sect. 11.3.2) then gives the prediction

$$\displaystyle{ \hat{y} =\hat{ w} \cdot x = Y \prime {(K + \mathit{aI})}^{-1}k }$$
(11.1)

for the label of a new object x, where Y is the vector of labels (with components \({Y }^{t}:= y_{t}\)), K is the Gram matrix \(K_{s,t}:= x_{s} \cdot x_{t}\), and k is the vector with components \({k}^{t}:= x_{t} \cdot x\). The kernel trick replaces x t by F(x t ), and so K by the kernel matrix \(K_{s,t}:= \mathcal{K}(x_{s},x_{t})\) and k by the vector \({k}^{t}:= \mathcal{K}(x_{t},x)\), where \(\mathcal{K}\) is the kernel \(\mathcal{K}(x,x\prime ):= F(x) \cdot F(x\prime )\).

This simple observation was made in [12], where this simplified SVR method was called “ridge regression in dual variables”. There is no doubt that this calculation has been done earlier as well, but the result does not appear useful. First, compared to the “full” SVM, there is no sparsity of examples (and there is no sparsity in attributes, as in the case of the Lasso). Having an explicit formula is an advantage, but the formula is not new: mathematically, the formula for KRR coincides with one of the formulas in kriging [4], an old method in geostatistics for predicting values of a Gaussian random field; this formula had been widely used in Bayesian statistics.

However, there is a philosophical and practical difference:

  • In kriging, the kernel is estimated from the results of observations and in Bayesian statistics it is supposed to reflect the statistician’s beliefs;

  • In KRR, as in Support Vector Regression in general, the kernel is not supposed to reflect any knowledge or beliefs about reality, and the usual approach is pragmatic: one consults standard libraries of kernels and uses whatever works.

In the remaining sections of this chapter we will explore KRR in the SVM style, without making Bayesian assumptions. The practical side of this non-Bayesian aspect of KRR is that it often gives good results on real-world data, despite the Bayesian assumptions being manifestly wrong. We will, however, concentrate on its theoretical side: non-Bayesian performance guarantees for KRR.

An important special case of KRR is (ordinary) Ridge Regression (RR): it is a special case (as far as the output is concerned) of KRR for \(\mathcal{K}\) as the dot product. However, in the case of RR the usual representation of the prediction is

$$\displaystyle{ \hat{y} =\hat{ w} \cdot x = x\prime {(X\prime X + \mathit{aI})}^{-1}X\prime Y }$$
(11.2)

rather than (11.1), where X is the matrix whose rows are x1, , x T ; there are many ways to show that (11.2) and (11.1) indeed coincide when \(\mathcal{K}\) is the dot product.

Under a standard Bayesian assumption (which we do not state explicitly in general; see, e.g., [17], Sect. 10.3), the conditional distribution of the label y of a new example (x, y) given \(x_{1},\ldots,x_{T},x\) and \(y_{1},\ldots,y_{T}\) is

$$\displaystyle{ N\left (Y \prime {(K + \mathit{aI})}^{-1}k{,\sigma }^{2} + \frac{{\sigma }^{2}} {a}\mathcal{K}(x,x) - \frac{{\sigma }^{2}} {a}k\prime {(K + \mathit{aI})}^{-1}k\right ), }$$
(11.3)

where K and k are as before (the postulated probability distribution generating the examples depends on \(\mathcal{K}\) and a, and we parameterize a normal probability distribution N(μ, σ 2) by its mean μ and standard deviation σ 2). The mean of the distribution (11.3) is the KRR prediction, but now we have not only a point prediction but also an estimate of its accuracy.

When \(\mathcal{K}\) is the dot product, (11.3) can be rewritten as

$$\displaystyle{ N\left (x\prime {(X\prime X + \mathit{aI})}^{-1}X\prime Y{,\sigma }^{2}x\prime {(X\prime X + \mathit{aI})}^{-1}x {+\sigma }^{2}\right ). }$$
(11.4)

In this case the Bayesian assumption can be stated as follows: x 1, x 2,  are fixed vectors in \({\mathbb{R}}^{n}\) (alternatively, we can make our analysis conditional on their values) and

$$\displaystyle{ y_{t} =\theta \cdot x_{t} +\xi _{t}, }$$
(11.5)

where \(\theta \sim N(0,{(\sigma }^{2}/a)I)\) and \(\xi _{t} \sim N(0{,\sigma }^{2})\) are all independent.

Equations (11.3) and (11.4) give exhaustive information about the next observation; the Bayesian assumption, however, is rarely satisfied.

3 Kernel Ridge Regression Without Probability

It turns out that KRR has interesting performance guarantees even if we do not make any stochastic assumptions whatsoever. Due to lack of space no proofs will be given; they can be found in the technical report [21].

In this section we consider the following perfect-information protocol of on-line regression:

First we consider the case where the space X from which the objects x t are drawn is a Euclidean space, \(\mathbf{X}:= {\mathbb{R}}^{n}\), and our goal is to compete with linear functions; in this case ordinary Ridge Regression is a suitable strategy for Learner. Then we move on to the case of an arbitrary X and replace RR by KRR.

3.1 Ordinary Ridge Regression

In this section, \(\mathbf{X} = {\mathbb{R}}^{n}\). The RR strategy for Learner is given by the formula \(\hat{y}_{t}:= b_{t-1}\prime A_{t-1}^{-1}x_{t}\), where b 0, b 1,  is the sequence of vectors and A 0, A 1,  is the sequence of matrices defined by

$$\displaystyle{b_{T}:=\sum _{ t=1}^{T}y_{ t}x_{t},\qquad A_{T}:= \mathit{aI} +\sum _{ t=1}^{T}x_{ t}x_{t}\prime }$$

(cf. (11.2)), where a > 0 is a parameter. The incremental update of the matrix A t −1 can be done effectively by the Sherman–Morrison formula. The following performance guarantee is proved in [21], Sect. 2.

Theorem 11.1.

The Ridge Regression strategy for Learner with parameter a > 0 satisfies, at any step T,

$$\displaystyle{ \sum _{t=1}^{T} \frac{{(y_{t} -\hat{ y}_{t})}^{2}} {1 + x_{t}\prime A_{t-1}^{-1}x_{t}} =\min _{\theta \in {\mathbb{R}}^{n}}\left (\sum _{t=1}^{T}{(y_{ t} -\theta \prime x_{t})}^{2} + {a\|\theta \|}^{2}\right ). }$$
(11.6)

The part \(x_{t}\prime A_{t-1}^{-1}x_{t}\) in the denominator of (11.6) is usually close to 0 for large t.

Theorem 11.1 has been adapted to the Bayesian setting by Zhdanov and Kalnishkan [20], who also notice that it can be extracted from [1] (by summing their (4.21) in an exact rather than an estimated form).

Theorem 11.1 and its kernel version (Theorem 11.2 below) imply surprisingly many well-known inequalities.

Corollary 11.1.

Assume \(\vert y_{t}\vert \leq \mathbf{y}\) for all t, clip the predictions of the Ridge Regression strategy to [− y, y ], and denote them by \(\hat{y}_{t}^{\mathbf{y}}\) . Then

$$\displaystyle\begin{array}{rcl} \begin{array}{lll} \sum _{t=1}^{T}{(y_{t} -\hat{ y}_{t}^{\mathbf{y}})}^{2} \leq \min _{\theta }\left (\sum _{t=1}^{T}{(y_{t} -\theta \prime x_{t})}^{2} + {a\|\theta \|}^{2}\right ) \\ \qquad \qquad \quad \qquad \;\quad \qquad \qquad \quad \qquad \qquad \quad + 4{\mathbf{y}}^{2}\ln \det \left (I + \frac{1} {a}\sum _{t=1}^{T}x_{ t}x_{t}\prime \right ).\end{array} & &{}\end{array}$$
(11.7)

The bound (11.7) is exactly the bound obtained in [16] (Theorem 4) for the algorithm merging linear experts with predictions clipped to [−y, y], which does not have a closed-form description and so is less interesting than clipped RR. The bound for the strategy called the AAR in [16] has y 2 in place of 4y 2 ([16], Theorem 1). (The AAR is very similar to RR: its predictions are \(b\prime _{t-1}A_{t}^{-1}x_{t}\) rather than \(b\prime _{t-1}A_{t-1}^{-1}x_{t}\); it is called the Vovk–Azoury–Warmuth algorithm in [3].) The regret term in (11.7) has the logarithmic order in T if \(\|x_{t}\|_{\infty }\leq X\) for all t, because

$$\displaystyle{ \ln \det \left (I + \frac{1} {a}\sum _{t=1}^{T}x_{ t}x_{t}\prime \right ) \leq n\ln \left (1 + \frac{{\mathit{TX}}^{2}} {a} \right ) }$$
(11.8)

(the determinant of a positive definite matrix is bounded by the product of its diagonal elements; see [2], Chap. 2, Theorem 7). From Theorem 11.1 we can also deduce Theorem 11.7 in [3], which is somewhat similar to Corollary 11.1. That theorem implies (11.7) when RR’s predictions happen to be in [−y, y] without clipping (but this is not what Corollary 11.1 asserts).

RR is not as good as the AAR in the setting where \(\sup _{t}\vert y_{t}\vert \leq \mathbf{y}\) and the goal is to obtain bounds of the form (11.7) (since the AAR is to some degree optimized for this setting), but is still very good; and we can achieve an interesting equality (rather than inequality) for it.

The upper bound (11.7) does not hold for the RR strategy if the coefficient 4 is replaced by any number less than \(\frac{3} {2\ln 2} \approx 2.164\), as can be seen from an example given in Theorem 3 [16], where the left-hand side of (11.7) is 4T + o(T), the minimum on the right-hand side is at most T, y = 1, and the logarithm is 2Tln2 + O(1). It is also known that there is no strategy achieving (11.7) with the coefficient less than 1 instead of 4, even in the case where \(\left \|x_{t}\right \|_{\infty }\leq X\) for all t: see Theorem 2 in [16].

There is also an upper bound on the cumulative square loss of the RR strategy without a logarithmic part and without assuming that the labels are bounded.

Corollary 11.2.

If \(\|x_{t}\|_{2} \leq Z\) for all t then the Ridge Regression strategy for Learner with parameter a > 0 satisfies, at any step T,

$$\displaystyle{\sum _{t=1}^{T}{(y_{ t} -\hat{ y}_{t})}^{2} \leq \left (1 + \frac{{Z}^{2}} {a} \right )\min _{\theta \in {\mathbb{R}}^{n}}\left (\sum _{t=1}^{T}{(y_{ t} -\theta \prime x_{t})}^{2} + {a\|\theta \|}^{2}\right ).}$$

This bound is better than the bound in Corollary 3.1 of [8], which has an additional regret term of the logarithmic order in time.

Asymptotic properties of the RR strategy can be further studied using Corollary A.1 of Kumon et al. [9]. Kumon et al.’s result states that when \(\|x_{t}\|_{2} \leq 1\) for all \(t\), then \(x_{t}\prime A_{t-1}^{-1}x_{t} \rightarrow 0\) as t. It is clear that we can replace \(\|x_{t}\|_{2} \leq 1\) for all t by \(\sup _{t}\|x_{t}\|_{2} < \infty \). This gives the following corollary, which can be summarized as follows. If there exists a very good expert (asymptotically), then RR also predicts very well. If there is no such very good expert, RR performs asymptotically as well as the best regularized expert.

Corollary 11.3.

Let a > 0 and \(\hat{y}_{t}\) be the predictions output by the Ridge Regression strategy with parameter a. Suppose \(\sup _{t}\left \|x_{t}\right \|_{2} < \infty \) . Then

$$\displaystyle{\left (\exists \theta \in {\mathbb{R}}^{n}:\sum _{ t=1}^{\infty }{(y_{ t} -\theta \prime x_{t})}^{2} < \infty \right )\Longrightarrow\sum _{ t=1}^{\infty }{(y_{ t} -\hat{ y}_{t})}^{2} < \infty }$$

and

$$\displaystyle\begin{array}{rcl} & & \left (\forall \theta \in {\mathbb{R}}^{n}:\sum _{ t=1}^{\infty }{(y_{ t} -\theta \prime x_{t})}^{2} = \infty \right ) {}\\ & & \qquad \qquad \qquad \qquad \Longrightarrow\lim _{T\rightarrow \infty } \frac{\sum _{t=1}^{T}{(y_{t} -\hat{ y}_{t})}^{2}} {\min _{\theta \in {\mathbb{R}}^{n}}\left (\sum _{t=1}^{T}{(y_{t} -\theta \prime x_{t})}^{2} + {a\|\theta \|}^{2}\right )} = 1. {}\\ \end{array}$$

3.2 Kernel Ridge Regression

In this section, X is an arbitrary set. Let \(\mathcal{F}\) be the RKHS with kernel \(\mathcal{K}\) of functions on X. The KRR strategy for Learner with parameter a > 0 is defined by the formula (11.1) applied to the past examples.

The following version of Theorem 11.1 for KRR can be derived from Theorem 11.1 itself (see [21], Sect. 6, for details).

Theorem 11.2.

The KRR strategy with parameter a > 0 for Learner satisfies, at any step T,

$$\displaystyle\begin{array}{rcl} \sum _{t=1}^{T} \frac{{(y_{t} -\hat{ y}_{t})}^{2}} {1 + \frac{1} {a}\mathcal{K}(x_{t},x_{t}) -\frac{1} {a}k\prime _{t}{(K_{t-1} + \mathit{aI})}^{-1}k_{t}}& & {}\\ =\min _{f\in \mathcal{F}}& &\left (\sum _{t=1}^{T}{(y_{ t} - f(x_{t}))}^{2} + a\|f\|_{ \mathcal{F}}^{2}\!\right ). {}\\ \end{array}$$

The denominator on the left-hand side tends to 1 under some regularity conditions:

Lemma 11.1 ([20], Lemma 2).

Let \(\mathcal{K}\) be a continuous kernel on a compact metric space. Then

$$\displaystyle{\mathcal{K}(x_{t},x_{t}) - k\prime _{t}{(K_{t-1} + \mathit{aI})}^{-1}k_{ t} \rightarrow 0\text{ as}t \rightarrow \infty.}$$

Again, we can derive several interesting corollaries from Theorem 11.2.

Corollary 11.4.

Assume \(\left \vert y_{t}\right \vert \leq \mathbf{y}\) for all t and let \(\hat{y}_{t}^{\mathbf{y}}\) be the predictions of the KRR strategy clipped to [− y, y ]. Then

$$\displaystyle\begin{array}{rcl} \sum _{t=1}^{T}{(y_{ t} -\hat{ y}_{t}^{\mathbf{y}})}^{2} \leq \min _{ f\in \mathcal{F}}\left (\sum _{t=1}^{T}{(y_{ t} - f(x_{t}))}^{2} + a\|f\|_{ \mathcal{F}}^{2}\right )& & \\ +4{\mathbf{y}}^{2}\ln & \det & \left (I + \frac{1} {a}K_{T}\right ).{}\end{array}$$
(11.9)

But now we have a problem: in general, the lndet term is not small compared to T. However, we still have the analogue of Corollary 11.3 (for a detailed derivation, see [20]).

Corollary 11.5 ([20], Corollary 4).

Let X be a compact metric space and \(\mathcal{K}\) be a continuous kernel on X . Then

$$\displaystyle{\left (\exists f \in \mathcal{F}:\sum _{ t=1}^{\infty }{(y_{ t} - f(x_{t}))}^{2} < \infty \right )\Longrightarrow\sum _{ t=1}^{\infty }{(y_{ t} -\hat{ y}_{t})}^{2} < \infty }$$

and

$$\displaystyle\begin{array}{rcl} & & \left (\forall f \in \mathcal{F}:\sum _{ t=1}^{\infty }{(y_{ t} - f(x_{t}))}^{2} = \infty \right ) {}\\ & & \qquad \qquad \qquad \qquad \Longrightarrow\lim _{T\rightarrow \infty } \frac{\sum _{t=1}^{T}{(y_{t} -\hat{ y}_{t})}^{2}} {\min _{f\in \mathcal{F}}\left (\sum _{t=1}^{T}{(y_{t} - f(x_{t}))}^{2} + a\|f\|_{\mathcal{F}}^{2}\right )} = 1. {}\\ \end{array}$$

To obtain a non-asymptotic result of this kind under the assumption \(\sup _{t}\vert y_{t}\vert \leq \mathbf{y}\), let us first assume that the number of steps T is known in advance. We will need the notation \(c_{\mathcal{F}}:= \sqrt{\sup _{x\in \mathbf{X} } \mathcal{K}(x, x)}\). Bounding the logarithm of the determinant in (11.9) we have

$$\displaystyle{\ln \det \left (I + \frac{1} {a}\mathbf{K}_{T}\right ) \leq T\ln \left (1 + \frac{c_{\mathcal{F}}^{2}} {a} \right )}$$

(cf. (11.8)). Since we know the number T of steps in advance, we can choose a specific value for a; let \(a:= c_{\mathcal{F}}\sqrt{T}\). This gives us an upper bound with the regret term of the order \(O(\sqrt{T})\) for any \(f \in \mathcal{F}\):

$$\displaystyle{\sum _{t=1}^{T}{(y_{ t} -\hat{ y}_{t}^{\mathbf{y}})}^{2} \leq \sum _{ t=1}^{T}{(y_{ t} - f(x_{t}))}^{2} + c_{ \mathcal{F}}(\|f\|_{\mathcal{F}}^{2} + 4{\mathbf{y}}^{2})\sqrt{T}.}$$

If we do not know the number of steps in advance, it is possible to achieve a similar bound using a mixture of KRR over the parameter a with a suitable prior over a:

$$\displaystyle\begin{array}{rcl} \sum _{t=1}^{T}{(y_{ t} -\hat{ y}_{t}^{\mathbf{y}})}^{2} \leq \sum _{ t=1}^{T}{(y_{ t} - f(x_{t}))}^{2} + 8\mathbf{y}\max \left (c_{ \mathcal{F}}\|f\|_{\mathcal{F}},\mathbf{y}\delta {T}^{-1/2+\delta }\right )\sqrt{T + 2}& & \\ +6{\mathbf{y}}^{2}\ln T + c_{ \mathcal{F}}^{2}\|f\|_{ \mathcal{F}}^{2} + O({\mathbf{y}}^{2})\quad \qquad & &{}\end{array}$$
(11.10)

for any arbitrarily small δ > 0, where the constant implicit in O(y 2) depends only on δ. (No proof of this result has been published.) The inequality (11.10) still looks asymptotic in that it contains an O term; however, it is easy to obtain an explicit (but slightly messier) non-asymptotic inequality.

In particular, (11.10) shows that if \(\mathcal{K}\) is a universal kernel [14] on a topological space X, KRR is competitive with all continuous functions on X: for any continuous \(f: \mathbf{X} \rightarrow \mathbb{R}\),

$$\displaystyle{ \limsup _{T\rightarrow \infty }\frac{1} {T}\left (\sum _{t=1}^{T}{(y_{ t} -\hat{ y}_{t}^{\mathbf{y}})}^{2} -\sum _{ t=1}^{T}{(y_{ t} - f(x_{t}))}^{2}\right ) \leq 0 }$$
(11.11)

(assuming \(\vert y_{t}\vert \leq \mathbf{y}\) for all t). For example, (11.11) holds for X a compact set in \({\mathbb{R}}^{n}\), \(\mathcal{K}\) an RBF kernel, and \(f: \mathbf{X} \rightarrow \mathbb{R}\) any continuous function (see Example 1 in [14]). For continuous universal kernels on compact spaces, (11.11) also follows from Corollary 11.5.

4 Kernel Ridge Regression in Conformal Prediction

Suppose we would like to have prediction intervals rather than point predictions, and we would like them to have guaranteed coverage probabilities. It is clear that to achieve this we need a stochastic assumption; it turns out that the randomness assumption is often sufficient to obtain informative prediction intervals. In general, our algorithms will output prediction sets (usually intervals, but not always); to obtain prediction intervals we will apply convex closure (which can only improve coverage probability).

The special case of conformal prediction discussed in this section works as follows. Suppose we have an “underlying algorithm” (such as KRR) producing point predictions in \(\mathbb{R}\). Let \((x_{1},y_{1}),\ldots,(x_{T},y_{T})\) be a training set and x T+1 be a new object. To find the prediction set for y T+1 at a significance level ε ∈ (0, 1):

  • For each possible label \(z \in \mathbb{R}\):

    • Set \(y_{T+1}:= z\);

    • For each \(t \in \{ 1,\ldots,T + 1\}\) compute the nonconformity score \(\alpha _{t}^{z}:= \left \vert y_{t} -\hat{ y}_{t}^{z}\right \vert \), where \(\hat{y}_{t}^{z}\) is the point prediction for the label of x t computed by the underlying algorithm from the extended training set \((x_{1},y_{1}),\ldots,(x_{T+1},y_{T+1})\);

    • Compute the p-value

      $$\displaystyle{p(z):= \frac{1} {T + 1}\sum _{t=1}^{T+1}\boldsymbol{1}_{\{\alpha _{ t}^{z}\geq \alpha _{T+1}^{z}\}},}$$

      where \(\boldsymbol{1}_{\{\cdot \}}\) is the indicator function;

  • Output the prediction set \(\{z \in \mathbb{R} \vert p(z) >\epsilon \}\), where ε is the given significance level.

This set predictor is the conformal predictor based on the given underlying algorithm. Conformal predictors have a guaranteed coverage probability:

Theorem 11.3.

The probability that the prediction set output by a conformal predictor is an error (i.e., fails to include y T+1 ) does not exceed the significance level ε.

Moreover, in the on-line prediction protocol (Protocol 1, in which Reality outputs (x t , y t ) independently from the same probability distribution), the long-run frequency of errors also does not exceed ε almost surely. For a proof, see [17] (Theorem 8.1).

The property of conformal predictors asserted by Theorem 11.3 is their validity. Validity being achieved automatically, the remaining desiderata for conformal predictors are their “efficiency” (we want the prediction sets to be small, in a suitable sense) and “conditional validity” (we might want to have prespecified coverage probabilities conditional on the training set or some property of the new example).

The idea of conformal prediction is inspired by the Support Vector Machine (and the notation α for nonconformity scores is adapted from Vapnik’s Lagrange multipliers). The immediate precursor of conformal predictors was described in the paper [7] co-authored by Vapnik, which is based on the idea that a small number of support vectors warrants a high level of confidence in the SVM’s prediction. This idea was present in Vapnik and Chervonenkis’s thinking in the 1960s: see, e.g., (3.2) and [15], Theorem 10.5. The method was further developed in [17]; see [13] for a tutorial.

In the case where the conformal predictor is built on top of RR or KRR, there is no need to go over all potential labels \(z \in \mathrm{I\!R}\). The set prediction for the example \((x_{T+1},y_{T+1})\) can be computed in time O(TlogT) (in the case of RR) or O(T 2) (in the case of KRR). This involves only solving linear equations and sorting; the simple resulting algorithm is called the Ridge Regression Confidence Machine (RRCM) in [11] and [17]. There is an R package implementing the RRCM (in the case of RR), PredictiveRegression, available from CRAN.

The Bayes predictions (11.3) and (11.4) can be easily converted into prediction intervals. But they are valid only under the postulated probability model, whereas the prediction intervals output by the RRCM are valid under the randomness assumption (as is common in machine learning). This is illustrated by Fig. 11.1, which is a version of Wasserman’s Fig. 1 in [19]. We consider the simplest case, where x t = 1 for all t; therefore, the examples (x t , y t ) can be identified with their labels \(y_{t} \in \mathrm{I\!R}\), which we will call observations. The chosen significance level is 20 % and the kernel \(\mathcal{K}\) is the dot product. In the top plot, the four observations are generated from N(1, 1); in the middle plot, from N(10, 1); and in the bottom plot, from N(100, 1). The blue lines are the prediction intervals computed by the RRCM with a = 1 and the red lines are the Bayes prediction intervals computed as the shortest intervals containing 80 % of the mass (11.4) with a = 1 and σ = 1.

All observations are generated from N(θ, 1) for various constants θ. When θ = 1 (and so the Bayesian assumption (11.5) can be regarded as satisfied), the Bayes prediction intervals are on average only slightly shorter than the RRCM’s (the Bayes prediction interval happens to be wider in Fig. 11.1; for a random seed of the random number generator, the Bayes prediction intervals are shorter in about 54 % of cases). But as θ grows, the RRCM’s prediction intervals also grow (in order to cover the observations), whereas the width of the Bayes prediction intervals remains constant. When θ = 100 (and so (11.5) is clearly violated), the Bayes prediction intervals give very misleading results.

Fig. 11.1
figure 1

In the top plot, the four observations (shown as short vertical lines) are generated from N(1, 1); in the middle plot, from N(10, 1); and in the bottom plot, from N(100, 1). The blue lines are prediction intervals computed by a conformal predictor, and the red lines are Bayes prediction intervals

In parametric statistics, it is widely believed that the choice of the prior does not matter much: the data will eventually swamp the prior. However, even in parametric statistics the model (such as N(θ, 1)) itself may be wrong.

In nonparametric statistics, the situation is much worse:

the prior can swamp the data, no matter how much data you have

(Diaconis and Freedman [6], Sect. 4). In this case, using Bayes prediction intervals becomes particularly problematic. The RRCM can be interpreted as an example of renormalized Bayes, as discussed in [18] and later papers.

As mentioned earlier, the RRCM is valid under the assumption of randomness; no further assumptions are required. However, conditional validity and, especially, efficiency do require extra assumptions. For example, [10] uses standard statistical assumptions used in density estimation to demonstrate the conditional validity and efficiency of a purpose-built conformal predictor. It remains an open problem to establish whether similar results hold for the RRCM.