1 Introduction

Over the last years, visual image categorization has been dominated by a few classification concepts. Support Vector Machines represented the state-of-the-art for the last decade, e.g., the famous solvers LibLinear by Fan et al. (2008) as well as LibSVM of Chang and Lin (2011). More recently, they were replaced by Convolutional Neural Networks trained with deep learning techniques (Krizhevsky et al. 2012). Today, deep networks are established as excellent black-box tools which lead to impressive results in image categorization challenges (Russakovsky et al. 2015). However, these models only provide what they were designed and trained for: estimated class labels for previously unseen data. Are class labels and plain predictions the only thing we are interested in?

Nonparametric Bayesian methods based on Gaussian process models have the advantage of providing a complete probabilistic framework for inference. In consequence, they allow for estimating the variance of a prediction or finding suitable hyperparameters with marginal likelihood optimization as shown by Kapoor et al. (2010). Unfortunately, their application to large-scale learning scenarios is limited since required computation times and memory consumption scale cubically and quadratically with the number of collected examples, respectively. On the other hand, current research is shifting more and more towards large-scale learning scenarios due to the huge number of available image data. Thus, there exists an increasing gap between the benefits of the GP framework and their applicability to current visual recognition scenarios.

Contributions of this Article In this paper, we show how to overcome the scaling issues of the GP framework even in the presence of a large number of learning examples. Our insights are based on inherent properties of histogram intersection kernels (HIK) which serve as similarity measure between histogram features. Histograms arise from a large number of popular image representations, e.g., SIFT, HOG, or visual Bag-of-words (BOW). Also more recent representations based on CNN activations are non-negative after being passed through rectified linear units. Thus, kernel functions particularly tailored to histogram characteristics are well applicable to a wide range of computer vision scenarios.

Exploiting HIK for efficient GP models has been inspired by several previous works also exploiting HIK properties to speed up kernel SVM learning and prediction (Maji et al. 2008; Wu 2010; Wang et al. 2012). Our first contribution is to transfer their insights to Bayesian methods, which allows for fast multiplications of the kernel matrix \(\mathbf {\varvec{K}}\) with an arbitrary vector \(\mathbf {\varvec{v}}\). Thereby, solving the GP inference equations is efficiently possible using iterative solvers.

We then go beyond pure classification aspects and provide efficient techniques for marginal likelihood optimization, variance prediction, and online learning. In particular, we demonstrate that hyperparameter optimization with the complete GP marginal likelihood can be performed with reduced computational costs by exploiting an upper bound for the log-determinant of the kernel matrix. We prove that our bound, which is a modification of the one provided by Bai and Golub (1997), indeed specifies an upper limit and show how to calculate it efficiently. For predictive variance estimation, we present several approximation methods with different asymptotic times and accuracies. The memory and runtime requirements of our methods are sub-quadratic allowing for scalability.

The main contributions of this article can thus be summarized as follows:

  1. (1)

    We show how to perform learning and inference in a Bayesian manner with Gaussian processes and HIK for a large number of training examples.

  2. (2)

    We introduce a technique for optimizing hyperparameters based on efficient evaluation of the GP marginal likelihood.

  3. (3)

    We demonstrate how to estimate and approximate the GP predictive variance and show the implications for active learning.

  4. (4)

    We derive efficient update routines for online learning, which are a pre-requisite for active learning.

In addition to theoretical derivations, we also empirically validate our techniques for image categorization, incremental learning, and active learning tasks in several experiments. For categorization, we use GP regression with binary labels in a one-vs-all manner similar to Kernel-SVM. Since our experiments are centered on visual object recognition, the application of HIK as a similarity measure is well suited due to the commonly used histogram representations of images. We further use parameterized versions of the kernel that are more flexible, e.g., by appropriate non-linear scaling or individual weighting of histogram elements. Our methods for hyperparameter optimization are useful to handle this increased flexibility and circumvent the necessity of techniques like cross-validation.

The article is based on previous conference publications. In detail, we presented the efficient GP multi-class classification with regression and hyperparameter optimization at ECCV (Rodner et al. 2012). Furthermore, we published the predictive variance approximation techniques and the incremental learning aspects at ACCV (Freytag et al. 2012b). In this article, we go beyond these individual publications and present all aspects in a coherent view. In addition, we extended them with detailed mathematical proofs, an error analysis of the quantization method, adaptive quantization as well as multiple additional experimental results and comparisons. An application of our approach towards semantic segmentation was demonstrated in an ICPR paper (Freytag et al. 2012a).

Structure of this Article We start by reviewing relevant literature for this article in Sect. 2. The Gaussian process framework for regression, classification, and hyperparameter optimization is reviewed in Sect. 3. In Sect. 4, we define the histogram intersection kernel as well as its parameterized versions and review how efficient kernel multiplications can be done. Furthermore, we present how a quantization approach leads to further improvements regarding computational speed.

Based on these foundations, we describe how to efficiently learn and test a GP model for multi-class classification with Gaussian noise models using HIK in Sect. 5. Fast hyperparameter optimization and speeding up predictive variance estimations are discussed in Sects. 6 and 7, respectively. How our methods can be extended to incremental learning as well as applied for active learning is presented in Sect. 8.

Our findings are complemented by extensive experiments on publicly available computer vision datasets. In Sect. 9, we experimentally prove the suitability of our approach for the tasks of large-scale classification and incremental learning. Efficient hyperparameter optimization with Gaussian processes is evaluated in Sect. 10. We further analyze our variance approximation techniques as well as their application for active learning in Sect. 11. A summary of our findings concludes the article.

2 Related Work

In the following, we review related work for different aspects of this article: learning with histogram intersection kernels (Sect. 2.1), parameterized kernels (Sect. 2.2), current approaches for fast GP classification and regression (Sect. 2.3), and the application of the GP framework for visual recognition tasks (Sect. 2.4).

2.1 Fast Learning and Classification with HIK

For almost two decades, kernel methods have been popular tools for handling non-linear relations present in real world problems. The idea of kernels is to directly compute scalar products between transformed feature vectors without the necessity of actually computing the transformation. While model complexity and resulting accuracy can be dramatically improved via kernelization of linear algorithms, resulting benefits come at the cost of potentially increased memory consumption and computation time. To overcome these drawbacks, Vedaldi and Zisserman (2010) presented how to approximate the histogram intersection kernel with explicit feature transformations. These transformations can then be directly used in combination with linear classifiers. In contrast, Maji et al. exploited the properties of HIK directly for efficiently calculating SVM decisions (Maji et al. 2008, 2013). When m denotes the number of support vectors and \(D\) the number of feature dimensions, their technique scales only with \(\mathcal {O}\left( \log (m) D\right) \) time compared to \(\mathcal {O}\left( m D\right) \) for standard SVM inference. Going one step further, Wu (2010, 2012) presented fast SVM training by using the HIK properties to reformulate the SVM dual problem. Similar techniques have been applied later by Wang et al. (2012) for large-scale image similarity calculation. In the present article, we go beyond these works by showing that the special properties of the HIK can be exploited for GP classification with regression, hyperparameter optimization, and variance estimation.

2.2 Generalized HIK and Hyperparameter Optimization

Barla et al. (2003) applied the HIK for image classification and proved it to be a Mercer kernel for images having the same size. Since that time, a lot of improvements on this kernel have been proposed, e.g., HIK with polynomial transformations (Boughorbel et al. 2005) or the weighted multi-level extension known as pyramid match kernel (PMK) by Grauman and Darrell (2007). We show how to further generalize the HIK with order-preserving and positive-valued feature transformations and weights for each dimension. Therefore, our work is similar to the one of Ablavsky and Sclaroff (2011), where a cross-validation procedure is proposed to estimate multiple weights of histogram kernels. In contrast, our hyperparameter optimization is theoretically sound and directly optimizes the data likelihood. We compare our results to the ones achieved by Ablavsky and Sclaroff (2011) and show the resulting benefits of optimizing generalized histogram intersection kernels in Sect. 10.

Our approach for hyperparameter optimization should not be confused with the recent work on Bayesian optimization presented by Snoek et al. (2012). In fact, both approaches are orthogonal. Here, we present how to efficiently compute the marginal log-likelihood of a GP model. Thereby, it can be combined with the optimization method of Snoek et al. (2012) using GP regression for predicting suitable sample points of hyperparameters.

2.3 Fast GP Classification and Regression

Similar to Kernel-SVM, GP classifiers require computation time and memory cubic and quadratic in the number of training examples, respectively. Thus, their direct application to large-scale problems is limited. A growing number of publications tackle this problem by introducing sparse approximations. At the core of these approximations is the assumption of conditional independence between sets of certain examples. These examples could be a selection from the training set or can be learned during training (Quiñonero Candela and Rasmussen 2005). Although these techniques lead to impressive results, the necessary independence assumptions neglect information provided in training and test data. The only work we are aware of tackling full GP inference is the greedy block technique of Bo and Sminchisescu (2012), which circumvents storing the full kernel matrix in memory. However, kernel values have to be calculated explicitly, which is not necessary in our case. In Sect. 9, we empirically show that their method can be improved by orders of magnitude in computation time.

A complementary approach has been presented by Urtasun and Darrell (2008) who apply local learning to tackle the long training time and high memory demands of standard GP regression. In particular, they learn models from the k nearest neighbors selected by evaluating kernel distances. In contrast, our approach is sublinear during testing and does not involve an additional training step for each test example.

2.4 Applications of the GP Framework: Active Learning and Beyond

In active learning scenarios, we have been given a set of unlabeled examples, an expert who is willing to annotate a few of these examples, and a classifier which shall be trained. Thus, we search for the instances within the given set which are as informative as possible after being labeled and added to the model.

Active learning with the Gaussian process framework has been introduced by Kapoor et al. (2010). The authors present three different selection strategies based on the predictive mean, variance, and a combination of both. Thus, our techniques presented in this article allow for using their ideas in large-scale scenarios with a large number of unlabeled examples and an increasing size of the training set. Note that this also holds for our recently introduced active learning methods that involve the computation of expected changes in model parameters for GP regression to develop a new active learning criterion (Freytag et al. 2013) as well as a generalization based on measuring the differences in expected model outputs (Freytag et al. 2014a; Käding et al. 2015).

Besides active learning, GPs serve as easy-to-use probabilistic model in a range of other applications. Three exemplary scenarios are their application for detector adaptation (Vázquez et al. 2014), for super-resolution tasks (He and Siu 2011), or for human pose inference (Urtasun and Darrell 2008).

3 GP Regression and Hyperparameter Optimization

In this section, we briefly introduce the Gaussian process framework for regression and classification to ensure theoretical basics important for understanding this article. An experienced reader may directly jump to Sect. 4 for a review on exploiting HIKs for efficient inference in general or directly to Sect. 5 for a description of efficient GP inference using HIKs.

3.1 Gaussian Process (Label) Regression

Let \({\varOmega }\) be the space of all possible input data, e.g., \(D\)-dimensional feature vectors, which are often \(L_1\)-normalized in case of histogram representations. Furthermore, let \(\mathcal {Y}\) be the space of possible targets. For now, let us focus on binary classification with \(\mathcal {Y}= \{-1,1\}\). Multi-class scenarios are discussed in Sect. 3.3. Given \(N\) training examples \(\mathbf {\varvec{x}}_{i} \in \mathbf {\varvec{X}}\subset {\varOmega }\) and corresponding binary labels \(y_i \in \{-1, 1\}\), we would like to predict the label \(y_*\) of an unseen example \(\mathbf {\varvec{x}}_*\in {\varOmega }\).

The relationship between inputs and outputs can be modeled by a latent function \(f\) and an additional noise process \(y= f(\mathbf {\varvec{x}}) + \varepsilon \). Well known frequentist approaches such as SVM would now seek for a single function \(f\) which optimizes some criterion such as the regularized risk (Vapnik 1998). In contrast, a Bayesian approach assumes that \(f\) is a sample of a stochastic process \(\mathtt {F}\) and inference requires marginalization over all possible realizations. A specific choice of a stochastic process is a Gaussian process \(\mathcal {GP}\). We can thus express our assumptions as a GP prior with zero-mean and covariance (kernel) function \(\kappa \left( \cdot , \cdot \right) \), i.e., \(f\sim \mathcal {GP}(0, \kappa )\). Furthermore, we assume that labels \(y_i\) are conditionally independent from \(\mathbf {\varvec{x}}_{i}\) given \(f(\mathbf {\varvec{x}}_{i})\).

We follow Kapoor et al. (2010) and solve a given binary classification problem as a regression problem where labels \(y_i\) are treated as real-valued function values instead of discrete labels. This is very much related to Kernel-SVM but with an \(L_2\)-loss instead of a Hinge loss. Since it can be interpreted as kernelized least-squares regression, this technique is also known as label regression. In practical applications, it has proved to be useful for classification and it outperforms approximate inference techniques like Laplace approximation with more sophisticated noise models in most cases (Kapoor et al. 2010; Rodner et al. 2010; Kemmler et al. 2010).

For the noise process \(\varepsilon \) in label regression, a simple additive Gaussian noise model with variance \(\sigma _{\mathsf {n}}^2\) is used:

$$\begin{aligned} p(y_i \;|\; f_i) = {\mathcal {N}}(y_i \;|\; f_i, \sigma _{\mathsf {n}}^2). \end{aligned}$$
(1)

An advantage of the Gaussian noise model is that the GP assumptions lead to analytic solutions of the involved marginalizations. In consequence, they allow for directly predicting the expectation \(\mu _*\) as well as the variance \(\sigma ^2_*\) of the posterior distribution for the label \(y_*\) of a new example \(\mathbf {\varvec{x}}_*\) (Rasmussen and Williams 2006):

$$\begin{aligned} \mu _*&= \mathbf {\varvec{k}}_{*}^{\mathrm {T}} ( \mathbf {\varvec{K}}+ \sigma _{\mathsf {n}}^2\cdot \mathbf {\varvec{I}} )^{-1} \mathbf {\varvec{y}}= \mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{\alpha }}, \end{aligned}$$
(2)
$$\begin{aligned} \sigma ^2_*&= k_{**} - \mathbf {\varvec{k}}_{*}^{\mathrm {T}} ( \mathbf {\varvec{K}}+ \sigma _{\mathsf {n}}^2\cdot \mathbf {\varvec{I}} )^{-1} \mathbf {\varvec{k}}_{*}+ \sigma _{\mathsf {n}}^2. \end{aligned}$$
(3)

Here, the vector \(\mathbf {\varvec{k}}_{*}\) contains the kernel values \((\mathbf {\varvec{k}}_{*})_i = \kappa (\mathbf {\varvec{x}}_{i},\mathbf {\varvec{x}}_*)\) corresponding to the test example \(\mathbf {\varvec{x}}_*\), \(\mathbf {\varvec{K}}\) is the kernel matrix of the training data with \((\mathbf {\varvec{K}})_{ij}=\kappa (\mathbf {\varvec{x}}_{i},\mathbf {\varvec{x}}_{j})\), \(k_{**} = \kappa (\mathbf {\varvec{x}}_*, \mathbf {\varvec{x}}_*)\) is the prior variance of \(\mathbf {\varvec{x}}_*\), and \(\mathbf {\varvec{y}}\) is the vector containing all training labels. Although noted before, let us again emphasize that when we write “GP classification” in this article, we always refer to the label regression technique in Eq. (2) as presented by Kapoor et al. (2010). Thereby, we are able to estimate the predictive variance for classification settings which is one important benefit of the Bayesian framework. Another benefit arises when dealing with parameterized kernel functions, which is reviewed in the next section.

3.2 Hyperparameter Optimization

Kernel functions often depend on hyperparameters \(\mathbf {\varvec{\eta }}\), which have an important impact on the resulting classification model. In contrast to SVM techniques, the GP framework allows for finding their optimal values by marginal likelihood maximization instead of expensive cross-validation. As shown by Rasmussen and Williams (2006) the negative marginal log-likelihood for GP regression models can be expressed as:

$$\begin{aligned} - \log p(\mathbf {\varvec{y}}\;|\; \mathbf {\varvec{X}}, \mathbf {\varvec{\eta }}) =&\frac{1}{2} \mathbf {\varvec{y}}^{\mathrm {T}} \left( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\right) ^{-1} \mathbf {\varvec{y}}\nonumber \\&+\frac{1}{2} \log \det \left( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\right) + \frac{N}{2} \log 2 \pi . \end{aligned}$$
(4)

We introduced the short-hand \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}= \left( \mathbf {\varvec{K}}_\mathbf {\varvec{\eta }}+ \sigma _{\mathsf {n}}^2\cdot \mathbf {\varvec{I}}\right) \) as the parameterized kernel matrix having the noise variance \(\sigma _{\mathsf {n}}^2\) added to the main diagonal. It should be mentioned here that for continuous hyperparameters and differentiable kernel functions, it is even possible to use the derivative of the negative marginal log-likelihood for the optimization (Rasmussen and Williams 2006).

3.3 Multi-class Classification

As common for SVM classifiers, GP multi-class classification can be done by utilizing the one-vs-all technique as shown by Kapoor et al. (2010). For each class \(m\), a binary classifier with label regression (Nickisch and Rasmussen 2008) is trained which uses all images of \(m\) as positive examples and all remaining examples as negatives. Classification is then done by returning the class with largest predictive mean estimate of the corresponding binary classifier.

We can also perform model selection for the one-vs-all approach by jointly optimizing hyperparameters of all involved binary problems (Kapoor et al. 2010). Thereby, the objective function turns out to be the sum of all binary negative marginal log-likelihoods as given in Eq. (4). In addition, the predictive variance as shown in Eq. (3) is independent of the actual class labels and has therefore to be calculated only once for an arbitrary number of classes.

We can thus summarize that GP models can be seen as useful probabilistic counterparts to SVM classifiers with a range of interesting properties. Thus, efficient inference tools for GP models should be available to computer vision researchers for their “classification tool-boxes”, which is one aim of the current article. To tackle this goal, we present in the next section how to exploit fast kernel multiplication for speeding up computation times by orders of magnitude and significantly reduce memory demands.

4 Efficient Kernel Multiplications with Histogram Intersection Kernels

Kernel methods are one of the fundamental tools used to handle the complexity of visual recognition and allow for expressing non-linear relations with otherwise linear models. A possible kernel function often used to compare histogram feature vectors \(\mathbf {\varvec{x}}, \mathbf {\varvec{x}}' \in {\mathbb {R}}^D\) is the histogram intersection kernel:

$$\begin{aligned} \kappa _{\text {HIK}}(\mathbf {\varvec{x}}, \mathbf {\varvec{x}}') = \sum _{d=1}^{D} \min ( \mathbf {\varvec{x}}\left( d\right) , \mathbf {\varvec{x}}'\left( d\right) ). \end{aligned}$$
(5)

As shown by Maji et al. (2008) and Wu (2010), this kernel offers two important properties:

  1. (1)

    for any test input \(\mathbf {\varvec{x}}_*\), the computation of the inner product \(\mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{\alpha }}\) between kernel vector \(\mathbf {\varvec{k}}_{*}\) and weight vector \(\mathbf {\varvec{\alpha }}\) of a trained representer model [e.g., GP regression in Eq. (2)] scales only sub-linear with the number of known examples, and

  2. (2)

    for an arbitrary vector \(\varvec{v}\), the matrix-vector product \(\mathbf {\varvec{K}}\varvec{v}\) between the kernel matrix \(\mathbf {\varvec{K}}\) and \(\varvec{v}\) scales only sub-quadratic.

While the first property allows for efficient inference with representer models, the second property enables reduced computation times for learning [as also noted in (Bottou et al. 2007, Sect. 9.4)]. We refer to these properties as fast kernel multiplications. In consequence, we refer to kernels fulfilling these properties as fast multiplication kernels. Note that these kernels should be not confused with multiplicative kernels as introduced by Yuan et al. (2008).

In the following, we review the work of Maji et al. (2008) and Wu (2010). Since the authors presented how fast HIK multiplications can be exploited for fast SVM learning and classification, we put their work into a Gaussian process perspective. In addition, we derive a worst-case bound for quantization errors of the techniques presented by Maji et al. (2008). The section closes by presenting generalizations of currently known histogram intersection kernels important for adaptations such as automated feature scaling or feature relevance determination.

4.1 Fast Kernel Multiplications

As we have seen in Eq. (2), the predictive mean is a weighted sum of kernel values. This property is shared by all representer models such as SVM and GP. The HIK allows for decomposing this sum into two parts (Maji et al. 2008):

$$\begin{aligned} \mathbf {\varvec{k}}_{*}^{\mathrm {T}} \varvec{\alpha }&= \sum _{i=1}^{N} \mathbf {\varvec{\alpha }}\left( i\right) \sum _{d=1}^{D} \min \left( \mathbf {\varvec{x}}_i\left( d\right) ,\mathbf {\varvec{x}}_*\left( d\right) \right) \nonumber \\&= \sum _{d=1}^{D} \Bigl ( \sum _{\{i \ : \ \mathbf {\varvec{x}}_i\left( d\right) < \mathbf {\varvec{x}}_*\left( d\right) \}} \mathbf {\varvec{\alpha }}\left( i\right) \mathbf {\varvec{x}}_i\left( d\right) + \ldots \nonumber \\&\quad \ldots \mathbf {\varvec{x}}_*\left( d\right) \sum _{\{j \ : \ \mathbf {\varvec{x}}_j\left( d\right) \ge \mathbf {\varvec{x}}_*\left( d\right) \}} \mathbf {\varvec{\alpha }}\left( j\right) \Bigr ). \end{aligned}$$
(6)

From Eq. (6) we make the important observation, that for each dimension the predictive mean of Gaussian process regression with HIK is piecewise linear. This property is also visualized in Fig. 1 for a simple toy example.

Fig. 1
figure 1

Piecewise linearity of the regression function when using Gaussian process regression for classification (discrete \(\mathbf {\varvec{y}}\)) together with the histogram intersection kernel: 2-dimensional input vectors \(\mathbf {\varvec{x}}\) are used but due to the normalization \(\Vert \mathbf {\varvec{x}}\Vert _1 = 1\), we only display the predictive mean (blue graph) and confidence areas (shaded area) derived from the predictive variance with respect to the first dimension of the input vectors. Training points are shown as blue crosses and the noise variance is set to 0.1

We can now significantly reduce the computational costs using the following trick. Let us assume given permutations \(\mathbf {\varvec{\pi }}_d\) which rearrange the training examples such that they are sorted in an ascending order in each dimension d. Then, we can rewrite Eq. (6) as

$$\begin{aligned} \mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{\alpha }}&= \sum _{d=1}^{D} \Biggl ( \underbrace{\sum _{i = 1}^{r_{d}} \mathbf {\varvec{\alpha }}\left( \mathbf {\varvec{\pi }}_d^{-1}(i)\right) \mathbf {\varvec{x}}_{\mathbf {\varvec{\pi }}_d^{-1}(i)}\left( d\right) }_{\overset{\cdot }{=}\ \mathbf {\varvec{A}}\left( d,{r_{d}}\right) } + \ldots \nonumber \\&\quad \ldots \mathbf {\varvec{x}}_*\left( d\right) \underbrace{ \sum _{i = {r_{d}}+1}^{N} \mathbf {\varvec{\alpha }}\left( \mathbf {\varvec{\pi }}_d^{-1}(i)\right) }_{\overset{\cdot }{=}\ \mathbf {\varvec{B}}\left( d,{r_{d}}\right) } \Biggr ) \end{aligned}$$
(7)
$$\begin{aligned}&= \sum _{d=1}^{D} \left( \mathbf {\varvec{A}}\left( d,{r_{d}}\right) + \mathbf {\varvec{x}}_*\left( d\right) \mathbf {\varvec{B}}\left( d,{r_{d}}\right) \right) , \end{aligned}$$
(8)

with \(r_{d}\) being the number of examples that are smaller than \(\mathbf {\varvec{x}}_*\) in dimension d. We can precompute the two terms of the linear function during learning and store the values in look-up tables \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\) as displayed in Eq. (8). Calculating the scores for test examples can be done by accessing only few elements of \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\), in particular one element of both \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\) for each dimension.

Given the vector \(\mathbf {\varvec{\alpha }}\), the resulting computation time for building \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\) is dominated by sorting, which requires \(\mathcal {O}\left( ND\log N\right) \) operations. In terms of memory usage, we only have to store \(\mathcal {O}\left( ND\right) \) elements in contrast to the kernel matrix of size \(\mathcal {O}\left( N^2\right) \). This is especially important for large dataset sizes, for which the kernel matrix would not fit into memory. For calculating the score of a new example, we need \(\mathcal {O}\left( D\log N\right) \) operations to find the correct position \(r_{d}\) in each dimension and compute the linear function in Eq. (8) using look-up tables \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\).

Similar considerations hold for multiplications of an arbitrary vector \(\mathbf {\varvec{v}} \in {\mathbb {R}}^{N}\) with the kernel matrix \(\mathbf {\varvec{K}}\), which can be done in \(\mathcal {O}\left( ND\right) \) based on already sorted features in each dimension:

$$\begin{aligned} \left( \mathbf {\varvec{K}}\mathbf {\varvec{v}}\right) _i&=\sum \limits _{j=1}^{N} \mathbf {\varvec{v}}\left( j\right) \cdot \kappa (\mathbf {\varvec{x}}_{i},\mathbf {\varvec{x}}_{j}) \nonumber \\&=\sum \limits _{j=1}^{N} \mathbf {\varvec{v}}\left( j\right) \sum \limits _{d=1}^{D} \min (\mathbf {\varvec{x}}_i\left( d\right) ,\mathbf {\varvec{x}}_j\left( d\right) ). \end{aligned}$$
(9)

From this equation, we observe that we can further exploit sparsity of feature vectors since corresponding terms of zero-valued dimensions vanish.

As we will show in the next sections, the efficient computation of products \(\mathbf {\varvec{K}}\mathbf {\varvec{v}}\) will be an essential part in our overall framework. In detail, we will require it for learning in Sect. 5.1 and for optimizing hyperparameters as shown in Sect. 6. Furthermore, computations of products \(\mathbf {\varvec{k}}_{*}^{\mathrm {T}} \varvec{\alpha }\) will allow for efficient classification as shown in Sect. 5.2.

4.2 Quantization of the Feature Space

The previously shown techniques for computing scores \(\mathbf {\varvec{k}}_{*}^{\mathrm {T}} \varvec{\alpha }\) already result in valuable savings of required computation times. Nonetheless, they still depend on the number of available training examples. If feature values in dimension d are bounded by \(\mathbf {\varvec{x}}_*\left( d\right) \in [l_d,u_d]\), the evaluation can be further speeded up by quantizing the feature space (Maji et al. 2008) leading to an approximate inference method.

For \(L_1\)-normalized histogram features, all elements are obviously bounded by the interval \([0,1]\). Using a quantization for each dimension with q bins, only q different prototypical outputs \(\mathbf {\varvec{p}}(k) \quad (1 \le k \le q)\) are possible in each summand of Eq. (7). With already computed tables \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\), we can proceed with building a final look-up table \(\mathbf {\varvec{T}}\) of dimension \(D\times q\):

$$\begin{aligned} \mathbf {\varvec{T}}\left( d,k\right)&= \mathbf {\varvec{A}}\left( d,r_{d}\right) + \mathbf {\varvec{p}}(k) \cdot \mathbf {\varvec{B}}\left( d,r_{d}\right) , \end{aligned}$$
(10)

with \(r_d\) being the number of examples that are smaller than \(\mathbf {\varvec{p}}(k)\) in dimension d. Since permutations \(\mathbf {\varvec{\pi }}_d\) are already computed, this step requires only \({\mathcal {O}}\left( D\max \left( q,N\right) \right) \) operations.

As a result, the time spent for evaluating the score of a new test example decreases to \(\mathcal {O}\left( D\right) \) if the quantizer works in \(\mathcal {O}\left( 1\right) \). Consequently, for a given number of dimensions, the score of a new test example can be computed in constant time independent of \(N\). It should be noted that in the case of quantization, GP regression can be restated as piecewise Bayesian linear regression.

Adaptive Quantization In contrast to previous works, we use an adaptive quantization for every single dimension of the input feature vectors. The necessity of our adaptive quantization becomes apparent when we replace BOW features by CNN activations. For the latter, the common activation strength typically varies heavily between dimensions. Furthermore, \(L_1\)-normalized activations come without theoretical motivation. Thus, neither are all values guaranteed to be smaller than one as for histogram entries, nor is a quantization equal for all dimensions suitable. We therefore compute the maximum value \(u_d\) in each feature dimension for the training set. For each dimension, we then use \([0, u_d]\) to define the bounds for a uniform quantization.

Quantization Error Analysis The quantization trick reduces the runtime during testing significantly. However, the question remains how much the approximation affects the classification scores. Therefore, we study the error induced by this trick when computing the predictive mean, an analysis that has not been performed in previous works. The proofs of the results are given in Appendix. The results are not restricted to Gaussian process regression and also hold for support vector machines, because of the decision made with a weighted sum of kernel values.

Theorem 1

(Worst-case error analysis) If a quantization with a maximum quantization error of \(\varepsilon _{q}\) is used as well as \(L_1\)-normalized features, the error of the quantized predictive mean \(\tilde{\mu }_*\) can be bounded as follows:

$$\begin{aligned} | \mu _*- \tilde{\mu }_*| \le \frac{D\cdot \varepsilon _{q}}{2} \cdot \Vert \mathbf {\varvec{\alpha }} \Vert _1. \end{aligned}$$
(11)

Proof

See Appendix. \(\square \)

The \(L_1\)-term of the weights \(\mathbf {\varvec{\alpha }}\) does not depend on the quantization, because the quantization trick is only applied when a new test input is given and the position in the sorted list has to be determined in constant time. However, the term can be further bounded for SVM models by using \(\Vert \mathbf {\varvec{\alpha }}\Vert _1 \le C \cdot N\):

$$\begin{aligned} | \mu _*- \tilde{\mu }_*| \le \frac{D\cdot \varepsilon _{q}}{2} \cdot C \cdot N, \end{aligned}$$
(12)

where C is the trade-off parameter used in the soft-margin version (Schölkopf and Smola 2001). For GP regression, we can use the bounds derived by Rodner (2011, p. 64), which lead to:

$$\begin{aligned} | \mu _*- \tilde{\mu }_*| \le \frac{D\cdot \varepsilon _{q}}{2} \cdot \frac{N}{\sigma _{\mathsf {n}}^2}. \end{aligned}$$
(13)

Conclusions of Theorem 1 What do we learn from the previous analysis? First, we observe that the upper bound of the error depends linearly on the dimension and the number of training examples. Furthermore, also the strength of regularization (for SVM determined by C and for GP regression by the noise variance \(\sigma _{\mathsf {n}}^2\)) influences the error induced by the quantization. For a strong regularization (low C or high \(\sigma _{\mathsf {n}}^2\)), the error decreases.

4.3 Very General Histogram Intersection Kernels

In the previous sections, we restricted our analysis to the standard HIK as introduced in Eq. (5). To increase the kernel functions flexibility, Boughorbel et al. (2005) have shown that the HIK equipped with any positive valued function \(g\left( \cdot \right) \):

$$\begin{aligned} \kappa _{\text {GHIK}}\left( \mathbf {\varvec{x}}, \mathbf {\varvec{x}}' \right) = \sum _{d=1}^{D} \min \left( g\left( \mathbf {\varvec{x}}\left( d\right) \right) , g\left( \mathbf {\varvec{x}}'\left( d\right) \right) \right) \end{aligned}$$
(14)

still remains a positive definite kernel. The obvious question is whether efficient calculations as shown previously also hold for the generalized versions of HIK?

In fact, if \(g\left( \cdot \right) \) is an automorphism, the relative order of the training elements stays valid after evaluating \(g\left( \cdot \right) \). Therefore, the proposed techniques can also be applied to these generalized variants of the HIK which we denote with GHIK in the remainder of the article. Note that we can even use the same quantization by storing the original feature values. Two common examples are the polynomial feature transform:

$$\begin{aligned} g_{\left| \cdot \right| ,\eta } \left( \mathbf {\varvec{x}}\left( d\right) \right) = \left| \mathbf {\varvec{x}}\left( d\right) \right| ^{\eta } , \quad \eta \in \mathbb {R}_{+}, \end{aligned}$$
(15)

and the exponential transformation:

$$\begin{aligned} g_{\text {exp},\eta } \left( \mathbf {\varvec{x}}\left( d\right) \right) = \frac{\exp ( \eta \left| \mathbf {\varvec{x}}\left( d\right) \right| )-1}{\exp ( \eta )-1} , \quad \eta \in \mathbb {R}_{+}. \end{aligned}$$
(16)

In the remaining sections, we refer to them as HIK-POLY and HIK-EXP. Interestingly, Eq. (14) even holds if we consider functions \(g_d\) specifically parameterized for each dimension. Thereby, we can individually weight each feature dimension:

$$\begin{aligned} g_{\text {ard}, \mathbf {\varvec{\eta }}} \left( \mathbf {\varvec{x}}\left( d\right) \right) = \mathbf {\varvec{\eta }}\left( d\right) \cdot \mathbf {\varvec{x}}\left( d\right) , \quad \mathbf {\varvec{\eta }}\in \mathbb {R}_{*}^{D}. \end{aligned}$$
(17)

In Sect. 6, we present how to optimize hyperparameters \(\mathbf {\varvec{\eta }}\) of generalized HIKs even for large-scale training data.

Fig. 2
figure 2

Main outline of our approach for GP classification and hyperparameter optimization using fast multiplications with the kernel matrix. Details are given in Sects. 5 and  6

5 Efficient GP Multi-Class Classification with GHIK

In this section, we demonstrate that learning and testing a GP model can be performed efficiently when using GHIKs. Most of our algorithms are also applicable for general fast multiplication kernels, but since GHIK is the only practical family of kernels known to fit to this class, we focus on the GHIK in our presentation of the algorithms. This theoretical investigation is complemented by experimental results given in Sect. 9 and following, where we tackle several applications such as classification of real-world large-scale datasets, model regularization, and feature relevance determination.

As shown in Eq. (2), inference with a GP model requires two steps: (1) solving the linear equation system \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\cdot \mathbf {\varvec{\alpha }} = \mathbf {\varvec{y}}\) and (2) calculating the scalar product \(\mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{\alpha }}\). Step (1) has to be done only once for each class of a given training set \(\mathbf {\varvec{X}}\) and is known as learning. In contrast to that, the second step is used to test the learned model in order to infer labels for new test data. For transferring the techniques of Maji et al. (2008) to GP inference, we need to handle both steps. Let us start with step 1.

5.1 Step 1: Efficient Learning

For the training phase in step 1, we notice that storing the full kernel matrix is impossible for large-scale datasets. Furthermore, applying a Cholesky decomposition with a runtime of \(\mathcal {O}\left( N^3\right) \) is far from being practical. However, the HIK explicitly allows for multiplications with the kernel matrix in linear time \(\mathcal {O}\left( ND\right) \). Therefore, an iterative linear solver can be used to tackle the learning step and only needs to perform several cheap multiplications with the kernel matrix.

Wu (2010) used a coordinate descent method to solve the quadratic program related to SVM learning. In contrast, our experiments show that a linear conjugate gradients (CG) method converges faster for GP problems. Note that in absence of round-off errors, we obtain the exact solution after \(N\) iterations [see also (Hestenes and Stiefel 1952)]. In practice, we can even stop the iteration significantly earlier, e.g., when the maximum norm of the residual drops below a specified threshold.

Let us analyze the resulting asymptotic cost to compare against the GP baseline. To this end, let \(M\) denote the number of classes and let \(T_1\) be the number of iterations the CG method performs, which depends on the condition number of the kernel matrix (Nocedal and Wright 2006). Among others, the condition of the kernel matrix itself depends on \(N\) and can be corrected by adapting the regularization parameter \(\sigma _{\mathsf {n}}^2\). Since the binary label vectors differ for each class, we need to compute \(M\) weight vectors \(\mathbf {\varvec{\alpha }}^{(1)},\ldots ,\mathbf {\varvec{\alpha }}^{(M)}\). As previously noted, solving the resulting linear equation system using an iterative linear solver is possible in linear time, which leads to the first term \(\mathcal {O}\left( DNT_1M\right) \). A second term \(\mathcal {O}\left( DN\log N\right) \) arises from the effort for sorting all \(N\) examples in every dimension.

In summary, we require \(\mathcal {O}\left( ND( T_1M+ \log N)\right) \) operations for learning. Note that in practice, we often observe sparse features which leads to a significant reduction of the necessary computation times for this step. We also see that the runtime performance of our method is linear in the number of classes \(M\) allowing for scalability towards large-scale scenarios with hundreds or even thousands of classes.

5.2 Step 2: Efficient Testing

After estimating the coefficients \(\mathbf {\varvec{\alpha }}\), the test step only involves evaluating inner products \(\mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{\alpha }}\), which can be done in logarithmic time. Note that we can further reduce the asymptotic cost to constant time by applying the quantization idea of Maji et al. (2008) as reviewed in Sect. 4.

Table 1 Overview of asymptotic runtimes for learning, testing, and optimization of hyperparameters for baseline GP compared to our approach

In summary of this section, we visualized the interplay of all steps of our approach in Fig. 2 including hyperparameter optimization as explained in the next section. An overview of asymptotic computation times and memory demand for our approach is finally given in Table 1.

6 Large-Scale Hyperparameter Optimization

While the previous section dealt with efficient large-scale classification, we are now interested in optimally adapting our system to a specific task. Here, we realize adaptations by optimizing involved hyperparameters. Due to our probabilistic model, this optimization can be done by minimizing the negative GP log-likelihood as given in Eq. (4).

Exact optimization of the negative GP log-likelihood is very time consuming and intractable for large-scale datasets. We show how to solve this drawback by optimizing a proxy function instead which can be evaluated efficiently. A theoretical analysis reveals that our proxy function is indeed an upper bound of the exact negative GP log-likelihood and thus results in similar optima.

6.1 An Upper Bound of the Log-Determinant

As we reviewed in Eq. (4), computing the negative log-likelihood mainly requires the evaluation of two terms: a quadratic data term \(\mathbf {\varvec{y}}^{\mathrm {T}} \mathbf {\varvec{K}}_{\mathbf {\varvec{\eta }}}^{-1} \mathbf {\varvec{y}}\) and a complexity term \(\log \det \left( \mathbf {\varvec{K}}_{\mathbf {\varvec{\eta }}}\right) \). Since the data term involves solving the same linear system as required for learning, we can compute it efficiently using the techniques presented previously. In contrast to that, the complexity term requires the determinant of the kernel matrix, which is a costly algebraic operation even with fast HIK multiplications (Yuster 2008). Due to this reason, we develop an upper bound of the log determinant which will ultimately lead to the upper bound of the negative log-likelihood.

The derived bound is based on the results of Bai and Golub (1997), which turns out to be efficiently computable with HIKs. Let us therefore assume that for a given positive definite matrix \(\mathbf {\varvec{M}} \in {\mathbb {R}}^{N\times N}\), all eigenvalues \(\lambda _i\) can be bounded by \(0 < \lambda _i \le \beta \). Then, an upper bound of the log-determinant is given by:

$$\begin{aligned} \log \det (\mathbf {\varvec{M}})&\le \begin{bmatrix} \log \beta ,&\log \overline{t} \end{bmatrix} \begin{bmatrix} \beta&\overline{t} \\ \beta ^2&\overline{t}^2 \end{bmatrix}^{-1} \begin{bmatrix} \mu _1 \\ \mu _2\end{bmatrix} \end{aligned}$$
(18)
$$\begin{aligned}&\overset{\cdot }{=} {{\mathrm{ub}}}(\beta , \mu _1, \mu _2) \end{aligned}$$
(19)
$$\begin{aligned} \text {with}\quad \bar{t}&= \frac{\beta \mu _1 - \mu _2}{\beta n - \mu _1}, \end{aligned}$$
(20)

where \(\mu _1 = {{\mathrm{{\text {tr}}}}}\left( \mathbf {\varvec{M}}\right) \) is the trace of the matrix and \(\mu _2 = \Vert \mathbf {\varvec{M}} \Vert _F^2 \) is the Frobenius norm (Bai and Golub 1997).

While Bai and Golub (1997) proved the correctness of that bound, its tightness is what finally matters in practical applications. From this point of view, it is interesting to note that the bound is tight for regularized rank-1 matrices \(\mathbf {\varvec{M}} = \mathbf {\varvec{u}} \mathbf {\varvec{u}}^{\mathrm {T}} + \tau \mathbf {\varvec{I}}\). This fact arises as a direct generalization of the result from Bai and Golub (1997) on Pei matrices. Fortunately, kernel matrices computed on common datasets are often of low rank as observed by Williams and Seeger (2000). Thus, we can expect that the bound can indeed offer sufficient accuracy in many scenarios. We give an empirical proof for this statement in Sect. 10.1.

How to Efficiently Evaluate the Upper Bound Function To calculate the bound given in Eq. (19) for the regularized kernel matrix \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\), we need its largest eigenvalue \(\lambda _{\max }\), its trace \(\mu _1\), and its squared Frobenius norm \(\mu _2\). We first compute the largest eigenvalue \(\lambda _{\max }\) with the Arnoldi iteration, which only requires matrix vector products. In our experiments, the algorithm needed approximately \(T_3\sim 10\) steps until convergence with high accuracy for various settings.

The trace of the regularized kernel matrix can be decomposed into two terms:

$$\begin{aligned} {{\mathrm{{\text {tr}}}}}( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }})&= \sum \limits _{i=1}^N\left( \kappa (\mathbf {\varvec{x}}_{i}, \mathbf {\varvec{x}}_{i}) + \sigma _{\mathsf {n}}^2\right) \nonumber \\&= \sigma _{\mathsf {n}}^2\cdot N+ \sum \limits _{i=1}^N\sum \limits _{d=1}^D\min ( \mathbf {\varvec{x}}_i\left( d\right) , \mathbf {\varvec{x}}_i\left( d\right) ). \end{aligned}$$
(21)

The first part arises from the noise model, whereas the sum of self-similarities constitutes the second part. For the specific choice of HIK, the latter one is equal to the sum of all feature values:

$$\begin{aligned} {{\mathrm{{\text {tr}}}}}( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }})&= \sigma _{\mathsf {n}}^2\cdot N+ \sum \limits _{i=1}^N\sum \limits _{d=1}^D\mathbf {\varvec{x}}_i\left( d\right) . \end{aligned}$$
(22)

If we use \(L_1\)-normalized histograms with \(\Vert \mathbf {\varvec{x}}_{i}\Vert _1=1\), this further simplifies to \({{\mathrm{{\text {tr}}}}}( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }})=(\sigma _{\mathsf {n}}^2+ 1)\cdot N\). Note that similar derivations hold for GHIKs. As we can see, incorporating prior knowledge about kernels and features helps for speeding up the computations.

However, the squared Frobenius norm of \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\) is not directly available. Nonetheless, we can approximate it as shown next. To this end, let \(M\) again denote the number of classes of the classification task and let \(\lambda _i\) be the ith largest eigenvalue of the regularized kernel matrix such that \(\lambda _1 \ge \cdots \ge \lambda _N\) are sorted in decreasing order. Then, an intuitive approximation is \(\mu _2 = \sum _{i=1}^{N} \lambda _i^2 \ge \sum _{i=1}^{M} \lambda _i^2 = \tilde{\mu }_2\). The motivation for this approximation is as follows: if we have \(M\) classes with very compact clusters and large distances between each other, the kernel matrix should obey a simple block structure of rank \(M\) leading to \(M\) non-zero eigenvalues and hence \(\mu _2 \approx \tilde{\mu }_2\).

Due to the fact that our approximation of \(\mu _2\) is a lower bound of \(\Vert \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\Vert _F^2\), the necessary computations in Eq. (19) are still well-defined. We verify in the next section that we still have a proper upper bound of the log-determinant. The final upper bound is:

$$\begin{aligned} \log \det ( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}) \le {{\mathrm{ub}}}\left( \lambda _{\max },\, {{\mathrm{{\text {tr}}}}}\left( \mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\right) ,\, \sum _{i=1}^{M} \lambda _i^2\right) \ . \end{aligned}$$
(23)

With our experimental results in Sects. 9.310.1, and 10.4, we show how to successfully utilize the resulting upper bound of the negative GP log-likelihood for hyperparameter optimization.

6.2 Proof of the Upper Bound in Case of Frobenius Norm Approximation

So far, we have proposed to use a lower bound for the Frobenius norm of \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\) based on the sum of the \(M\) largest eigenvalues to avoid expensive computations of the original negative log-likelihood. In the following, we prove that we are able to obtain a valid upper bound of the log-determinant with the bound of Bai and Golub (1997) even when using a lower bound of the Frobenius norm. Our proofs are completely algebraic and do not require knowledge of the Gaussian quadrature techniques used in Bai and Golub (1997). First of all, we show the validity of the modified upper bound for \(\beta =1\). The proofs of the results are given in Appendix.

Lemma 1

(Monotonicity for \(\beta =1\) ) Let \(\tilde{\mu }_2\) with \(0 < \tilde{\mu }_2 \le \mu _2\) be a lower bound of the squared Frobenius norm of a regular positive definite matrix \(\mathbf {\varvec{M}}\), e.g., \(\tilde{\mu _2} = \sum _{i=1}^M \lambda _i^2\) with \(M< N\). Then the following holds for every positive definite matrix \(\mathbf {\varvec{M}}\) with \(\mu _1 = {{\mathrm{{\text {tr}}}}}(\mathbf {\varvec{M}})\) and \(\beta =1\) being the largest eigenvalue of \(\mathbf {\varvec{M}}\):

$$\begin{aligned} {{\mathrm{ub}}}(1, \mu _1, \tilde{\mu _2}) \ge {{\mathrm{ub}}}(1, \mu _1, \mu _2). \end{aligned}$$
(24)

The next lemma shows that scaling the matrix \(\mathbf {\varvec{M}}\) with \(\gamma > 0\) leads to an additive constant in the bound, which is independent of \(\mu _1\) and \(\mu _2\). This constant is equivalent to the one occurring in \(\log \det \left( \gamma \mathbf {\varvec{M}} \right) = \log \det \left( \mathbf {\varvec{M}} \right) + N \log \gamma \), therefore, the quality of the bound is invariant with respect to \(\gamma \). Note that the squared Frobenius norm scales with \(\gamma ^2\) and \(\bar{t}\) with \(\gamma \) (see Eq. 20).

Lemma 2

(Multiplicative scaling) For all suitable parameters \(\beta , \mu _1,\) and \(\mu _2\) of a positive definite matrix and every positive factor \(\gamma >0\), the following holds:

$$\begin{aligned} {{\mathrm{ub}}}( \gamma \beta , \gamma \mu _1, \gamma ^2 \mu _2 )&= {{\mathrm{ub}}}( \beta , \mu _1, \mu _2 ) + N \cdot \log \gamma . \end{aligned}$$
(25)

The last step is to combine both lemmas, which leads directly to the validity of the bound of Bai and Golub for our Frobenius norm approximation:

Theorem 2

(Upper bound with \(\tilde{\mu }_2\) ) Let \(\mathbf {\varvec{M}}\in {\mathbb {R}}^{N\times N}\) be a positive definite matrix with trace \(\mu _1\), squared Frobenius norm \(\mu _2\), and upper bound \(\beta \) for the eigenvalues (e.g., the largest eigenvalue). If \(\tilde{\mu }_2\) is a lower bound of \(\mu _2\), the following holds:

$$\begin{aligned} \log \det (\mathbf {\varvec{M}} ) \le {{\mathrm{ub}}}(\beta , \mu _1, \mu _2 ) \le {{\mathrm{ub}}}( \beta , \mu _1, \tilde{\mu }_2). \end{aligned}$$
(26)

With this theorem on hand, we are able to efficiently optimize hyperparameters even in large-scale scenarios as validated in the experimental sections.

6.3 Optimization Technique

The actual optimization is carried out with a method that does not require any gradient information, because calculating the gradient of the log-likelihood or the gradient of our upper bound is still a costly operation. A popular technique for this task is the downhill-simplex method, which is also known as Nelder–Mead method (Nelder and Mead 1965). Note that any other black-box optimization method could be applied as well. In experimental evaluations, however, we stick to the downhill-simplex technique.

7 Estimating the GP Predictive Variance

Up to now, we only considered fast computations of the predictive mean derived from GP regression. However, in many scenarios such as active learning or novelty detection it is important to get an estimate for the uncertainty of the prediction as well. The uncertainty is mostly measured in terms of class entropy. For Gaussian distributions, it is directly related to the variance. Due to this reason, we develop methods to efficiently compute the GP predictive variance also in large-scale scenarios.

As presented in Eq. (3), the predictive variance \(\sigma ^2_*\) depends on three terms. From these three terms, the a-priori variance is formed by the first term and the third term, i.e., by \(k_{**}=\kappa (\mathbf {\varvec{x}}_*, \mathbf {\varvec{x}}_*)\) known as self-similarity and the noise variance \(\sigma _{\mathsf {n}}^2\). Thus, there are no previously known training examples considered so far. In contrast, the second term reduces this a-priori variance based on the similarities between test example \(\mathbf {\varvec{x}}_*\) and training examples \(\mathbf {\varvec{X}}\).

To efficiently compute \(\sigma ^2_*\), we start in Sect. 7.1 by applying the techniques for fast classification introduced in Sect. 5.2. However, since the second term is a quadratic form instead of a linear form in \(\mathbf {\varvec{k}}_{*}\), these computations are not highly efficient. Therefore, we further show how to approximate the second term in an efficient manner using fast kernel evaluations (Sects. 7.2 and 7.3).

7.1 PUP: Precise Uncertainty Prediction

Naively computing the predictive variance for a single test example \(\mathbf {\varvec{x}}_*\) involves three steps:

  1. (1)

    explicitly computing the kernel vector \(\mathbf {\varvec{k}}_{*}\),

  2. (2)

    solving \(\varvec{\alpha }_* = \mathbf {\varvec{\tilde{K}}}^{-1} \mathbf {\varvec{k}}_{*}\) specific for \(\mathbf {\varvec{x}}_*\), and

  3. (3)

    computing the inner product of \(\mathbf {\varvec{k}}_{*}\) and \(\varvec{\alpha }_*\).

For step 1, we require \(\mathcal {O}\left( ND\right) \) operations to explicitly evaluate the kernel function for all examples and dimensions. After that, we can apply an iterative linear solver in step 2 to compute \(\varvec{\alpha }_*\). As noted previously, this requires \(\mathcal {O}\left( NDT_1\right) \) operations. Finally, we can compute the product of \(\mathbf {\varvec{k}}_{*}\) and \(\varvec{\alpha }_*\) in \(\mathcal {O}\left( N\right) \) operations to obtain the desired data term. Since no approximation is involved, we refer to this method as Precise Uncertainty Prediction (PUP).

In total, we need \(\mathcal {O}\left( NDT_1\right) \) operations to compute the exact predictive variance for an unseen example during testing. However, since \(T_1\) is implicitly related to \(N\), the resulting runtime might be too slow for large-scale applications. In addition, the exact values of classification uncertainties are not even required in certain scenarios. Active learning is one example, where only the relative order of uncertainty values is important. For these scenarios, we develop efficient approximations as shown in the following.

7.2 FAPU: Fine Approximation of the Predictive Uncertainty

To obtain efficient approximations of the predictive variance, we start by considering fundamental properties of the involved computations. Since the regularized kernel matrix \(\mathbf {\varvec{\tilde{K}}}\) is symmetric and positive definite, the same holds for its inverse. Therefore, we can use upper and lower bounds for quadratic forms to obtain suitable approximations for \(\sigma ^2_*\) (see Appendix for details ).

To this end, let \(\mathbf {\varvec{M}} \in \mathbb {R}^{N\times N}\) be a positive definite matrix. Then, linear algebra provides us with the following lower bound on the quadratic form of \(\mathbf {\varvec{M}}\) for any vector \(\mathbf {\varvec{x}}\in \mathbb {R}^{N}\):

$$\begin{aligned} \begin{aligned} \mathbf {\varvec{x}}^{\mathrm {T}} \mathbf {\varvec{M}} \mathbf {\varvec{x}}\ge&\sum _{i=N-k+1}^{N} \lambda _i \tilde{\mathbf {\varvec{x}}} \left( i\right) ^2 \\&+ \lambda _{N-k} \left( \left| \left| \mathbf {\varvec{x}}\right| \right| ^2 - \sum _{j=N-k+1}^{N} \tilde{\mathbf {\varvec{x}}} \left( j\right) ^2 \right) , \end{aligned} \end{aligned}$$
(27)

where \(\tilde{\mathbf {\varvec{x}}} \left( j\right) \) is the projection of \(\mathbf {\varvec{x}}\) onto the jth eigenvector of \(\mathbf {\varvec{M}}\). With the parameter \(k\), we can influence the tightness of the bound which is exact for the extreme case of \(k= N\). As before, each variable \(\lambda _j\) denotes the jth largest eigenvalue of \(\mathbf {\varvec{M}}\).

We can now apply the lower bound of Eq. (27) to the data term of the predictive variance given in Eq. (3) by instantiating \(\mathbf {\varvec{M}} = \mathbf {\varvec{\tilde{K}}}^{-1}\):

$$\begin{aligned} \begin{aligned} \sigma ^2_*\ \le&\ k_{**} - \Bigl ( \sum _{j = N-k+1}^{N} \lambda _j \tilde{\mathbf {\varvec{k}}}_*\left( j\right) ^2 \Bigr ) \\&- \lambda _{N-k} \Bigl ( \left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2 - \sum _{j = N-k+1}^{N} \tilde{\mathbf {\varvec{k}}}_*\left( j\right) ^2 \Bigr ) + \sigma _{\mathsf {n}}^2. \end{aligned} \end{aligned}$$
(28)

While this result looks promising, we spent major efforts in previous sections to circumvent explicit computations of the inverse kernel matrix. In consequence, we can not access corresponding eigenvalues or eigenvectors directly. Fortunately, linear algebra provides us with relations between eigenvalues and eigenvectors of a matrix \(\mathbf {\varvec{M}}\) and its inverse. In fact, for any symmetric and positive definite matrix \(\mathbf {\varvec{M}}\), eigenvalues \(\lambda \) of \(\mathbf {\varvec{M}}\) are in relationship with eigenvalues \(\xi \) of \(\mathbf {\varvec{M}}^{-1}\) via \(\lambda _j = \frac{1}{\xi _{N-j+1}}\). Furthermore, the eigenvector of \(\mathbf {\varvec{M}}\) corresponding to \(\lambda _i\) is the same as the eigenvector of \(\mathbf {\varvec{M}}^{-1}\) belonging to \(\xi _{N-i+1}\). Consequently, we obtain \(\mathbf {\varvec{\nu }}\left( i\right) = \tilde{\mathbf {\varvec{k}}}_*\left( N-i+1\right) \) for the projection of \(\mathbf {\varvec{k}}_{*}\) onto the ith eigenvector of \(\mathbf {\varvec{\tilde{K}}}\) .

Using both relations, we can reformulate the previous bound in terms of eigenvectors and eigenvalues of the implicitly accessible kernel matrix \(\mathbf {\varvec{\tilde{K}}}\):

$$\begin{aligned} \begin{aligned} \sigma ^2_*\ \le&\ k_{**} - \Bigl ( \sum _{i = 1}^{k} \frac{1}{\xi _{i}} \ \mathbf {\varvec{\nu }}\left( i\right) ^2 \Bigr ) \\&- \frac{1}{\xi _{k+1}} \Bigl ( \left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2 - \sum _{i = 1}^{k} \mathbf {\varvec{\nu }}\left( i\right) ^2 \Bigr ) + \sigma _{\mathsf {n}}^2. \end{aligned} \end{aligned}$$
(29)

Since we can adjust \(k\) for the desired precision, we call this technique a fine approximation of the predictive uncertainty (FAPU).

As noted previously, eigenvalues and eigenvectors can be computed using the Arnoldi iteration. For \(k+1\) eigenvalues and \(k\) eigenvectors, this requires \(\mathcal {O}\left( NkT_3\right) \) operations. Again, the number of iterations \(T_3\) until convergence was almost constant about 10 in our experiments. For the explicit computation of the kernel vector \(\mathbf {\varvec{k}}_{*}\), we still have to spend \(\mathcal {O}\left( ND\right) \) operations. Squared projections \(\tilde{\mathbf {\varvec{k}}}_*\left( i\right) ^2\) of \(\mathbf {\varvec{k}}_{*}\) onto eigenvectors can then be computed in \(\mathcal {O}\left( N\right) \). Similar considerations hold for the norm \(\left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2\).

In summary, we need \(\mathcal {O}\left( ND+ NkT_3\right) \) operations to compute the fine approximation of \(\sigma ^2_*\) as given in Eq. (29). Although we thereby reduce the quadratic scaling, we still depend linearly on \(N\). Nonetheless, using the histogram intersection kernel, we can even develop a coarser approximation leading to a further speed-up as shown next.

7.3 CAPU: Coarse Approximation of the Predictive Uncertainty

The extreme case of the previous approximation is obtained with \(k= 0\):

$$\begin{aligned} \sigma ^2_*\le k_{**} - \frac{1}{\xi _1} \left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2 + \sigma _{\mathsf {n}}^2. \end{aligned}$$
(30)

Thus, even for the most extreme approximation using FAPU, we still require a computation time linear in \(N\) to compute \(\mathbf {\varvec{k}}_{*}\) and its squared norm. However, for the specific choice of HIK as the kernel function, we note that \(\left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2\) can be expressed as follows:

$$\begin{aligned} \left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2 = \mathbf {\varvec{k}}_{*}^{\mathrm {T}} \mathbf {\varvec{k}}_{*}= \sum _{i = 1}^{N} \Bigl ( \sum _{d = 1}^{D} \min \left( \mathbf {\varvec{x}}_*\left( d\right) ,\mathbf {\varvec{x}}_i\left( d\right) \right) \Bigr )^2. \end{aligned}$$
(31)

We will now exploit the properties of the HIK to approximate \(\left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2\) by a lower bound. Thereby, we will still obtain a valid upper bound approximation for the predictive variance as given in Eq. (30).

One important aspect for the approximation arises from properties of sparse features. When features have only a few non-zero entries, the majority of mixed termss

$$\begin{aligned} \min \left( \mathbf {\varvec{x}}_*\left( d_1\right) , \mathbf {\varvec{x}}_i\left( d_1\right) \right) \cdot \min \left( \mathbf {\varvec{x}}_*\left( d_2\right) , \mathbf {\varvec{x}}_i\left( d_2\right) \right) \end{aligned}$$

between different dimensions in Eq. (31) will vanish. For a sparsity ratio of 0.1, these are already \(99\,\%\) of all terms! In these scenarios, neglecting the mixed terms is well justifiable and we obtain an expression that looks like a Parzen density estimation with squared kernel values:

$$\begin{aligned} \left| \left| \mathbf {\varvec{k}}_{*}\right| \right| ^2&\ge \sum _{i = 1}^{N} \sum _{d = 1}^{D} \left( \min \left( \mathbf {\varvec{x}}_*\left( d\right) , \mathbf {\varvec{x}}_i\left( d\right) \right) \right) ^2 \nonumber \\&= \sum _{i = 1}^{N} \sum _{d = 1}^{D} \min \left( \mathbf {\varvec{x}}_*\left( d\right) ^2, \mathbf {\varvec{x}}_i\left( d\right) ^2 \right) . \end{aligned}$$
(32)

On a closer look, we notice that Eq. (32) is similar to Eq. (6) but with squared features and \(\mathbf {\varvec{\alpha }} \equiv \mathbf {\varvec{1}}\). Therefore, we can apply the same techniques as described in Sect. 4.1 with squared feature values. Furthermore, we can even use the same permutations of the learning data. The only additional overhead comes from computing a new matrix \(\mathbf {\varvec{A}}_{\sigma ^2_*} \in {\mathbb {R}}^{D\times N}\) storing the cumulative sums of squared feature values similar to \(\mathbf {\varvec{A}}\) of Sect. 4.1.

In consequence, we can compute the squared kernel vector within \(\mathcal {O}\left( D\log N\right) \) operations for an unseen example \(\mathbf {\varvec{x}}_*\). Note that we can now even apply the quantization idea described in Sect. 4.2. Thereby, the computation time is ultimately reduced to \(\mathcal {O}\left( D\right) \) operations and only requires the additional computation of a look-up table \(\mathbf {\varvec{T}}_{\sigma ^2_*}\) similar to \(\mathbf {\varvec{T}}\) of Sect. 4.2. We refer to this fastest uncertainty approximation with q-CAPU.

Fig. 3
figure 3

Approximating the predictive variance \(\sigma ^2_*\) using our techniques which exploit properties of the HIK. Approximations are obtained with FAPU (left) for different numbers of \(k\) [see Eq. (29)] or using the the coarse approximation (right) without and with quantization of test inputs [see Eq. (30)]. The setup is identical to Fig. 1 where no approximation was applied. a FAPU (\(k=4\)). b FAPU (\(k=1\)). c CAPU. d q-CAPU (\(q=10\))

We visualized all approximations in Fig. 3 for the \(2D\) scenario already used in Fig. 1. As we can nicely see, the approximation error of FAPU is inversely related to the number of eigenvectors used. Furthermore, the approximation converges from a piecewise quadratic function (FAPU) to a piecewise linear function (CAPU). While the precision is thereby reduced, we simultaneously reduce the required evaluation time as well.

Table 2 Overview of the presented approaches to compute the predictive variance \(\sigma ^2_*\)

A final overview of all the presented approaches for predictive variance computations as well as their resulting runtimes and decision functions is given in Table 2. In summary, we are able to efficiently compute the predictive variance with adjustable precision as well as adjustable time to spend. Note that the predictive variance is the same for all known classes (Rasmussen and Williams 2006), thus, our computation times are efficient even for an extremely large number of different classes. In Sect. 11, we compare our techniques in terms of runtimes needed in large-scale experiments and study their usability for the task of active learning.

8 Incremental and Active Learning

Large-scale learning is not only important for training based on a large chunk of data in batch mode, but also when the dataset is growing incrementally. We therefore show that incremental learning can be realized within our framework with just a few minor modifications. Furthermore, we also show how standard active learning methods can be directly used with our efficient estimates of the GP predictive mean and variance.

8.1 Fast Incremental Learning

The usual blueprint for object recognition systems is to train a classifier on a given set of labeled data and to apply the resulting model on unseen examples. Although current research led to impressive results even on highly challenging datasets with this strategy (Lazebnik et al. 2006; Vedaldi et al. 2009; Kapoor et al. 2010), it suffers from two main drawbacks. First of all, there is no possibility to exploit labeled examples that are available after the training process. In consequence, we often neglect potentially useful information. Besides, this strategy will fail in situations where existing categories vary over time (e.g., cell phone designs) or new categories become available (e.g., Apple’s iPod in 2001).

We can naively resolve these drawbacks by trivial training from scratch as soon as new data is accessible. However, we would thereby suffer from huge computational costs. For representer models such as Kernel-SVM or GP, every retraining would require \(\mathcal {O}\left( N^3\right) \) since no information about the previously trained model is used. In contrast to that, incremental or online learning methods explicitly rely on previously trained models to efficiently adapt them over time. In the following, we show how to extend our GP/HIK for incremental learning.

As presented in the previous sections, training of GP/HIK models mainly consists of three stages: (1) sort training examples in every dimension, (2) compute the weight vector \(\mathbf {\varvec{\alpha }}\) using an iterative linear solver, and (3) compute the matrices \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\) as well as the look-up table \(\mathbf {\varvec{T}}\) if required. For new training examples, we can exploit the previous calculations in every stage to significantly speed-up the process of retraining:

  1. (1)

    We can build on the given sorting of each dimension and find the correct position for a novel example in each dimension, which takes \(\mathcal {O}\left( \log N\right) \) time for a single dimension and \(\mathcal {O}\left( D\log N\right) \) in total.

  2. (2)

    Using the previously calculated \(\mathbf {\varvec{\alpha }}\) as an initialization for the iterative linear solver, we can significantly speed-up the process until convergence since the variations of \(\mathbf {\varvec{\alpha }}\) are smooth and small, especially for large training sizes.

  3. (3)

    For updating the arrays \(\mathbf {\varvec{A}}\) and \(\mathbf {\varvec{B}}\) as well as the look-up table \(\mathbf {\varvec{T}}\), we only need to correct entries that are affected by new examples.

In addition to updates of the classification model, we could also adapt hyperparameters and optimize them on the fly. This would allow for adapting our model to new situations, e.g., when other feature dimensions become important to distinguish between categories. To speed up the optimization, we can easily use previous values of the hyperparameters as initial values for the optimization method, which is also known as warm-start.

In summary, we significantly benefit from previous calculations in every training step. We further validate this aspect in our experiments in Sect. 9.9.

8.2 Active Learning with Gaussian Processes

A main advantage of Gaussian processes is the possibility for giving feedback about how certain the classification result is. Aside from this apparent use of the predictive variance, it was successfully applied as a query strategy in active learning by Kapoor et al. (2010). The goal of active learning is to improve a classification model by selecting a few but highly informative examples for manual annotation. In this section, we briefly review the main ideas of Gaussian process based active learning.

For active learning, one typically has a small set \(\mathbf {\varvec{X}}= \{\mathbf {\varvec{x}}_{1},\ldots , \mathbf {\varvec{x}}_{N}\}\) consisting of labeled data and a large set \(\mathbf {\varvec{X}}_{\mathfrak {U}}= \{\mathbf {\varvec{x}}'_{1},\ldots , \mathbf {\varvec{x}}'_{N'}\}\) of unlabeled examples. To obtain a classifier \({\mathcal {A}}\) trained with most informative examples, one exploits a query function \(\mathcal {Q}\) that scores each unlabeled example. An oracle (e.g., a human expert) is then asked for the ground-truth label of the example \(\mathbf {\varvec{x}}_*\) with best score. Consequently, an active learning scenario can be seen as a quadruple \(({\mathcal {A}},\mathcal {Q},\mathbf {\varvec{X}}, \mathbf {\varvec{X}}_{\mathfrak {U}})\).

One further distinguishes query strategies in two groups (Ebert et al. 2012). Exploitative methods utilize examples of \(\mathbf {\varvec{X}}\) including their labels and rely on scores derived from outputs of the involved classifier. In contrast to that, explorative methods neglect the label information and query new examples only based on the distribution of the current examples. For the choice of Gaussian processes, Kapoor et al. (2010) introduced three possible query strategies which directly build on the trained model. Selecting examples based on smallest absolute predictive mean:

(33)

is an exploitative method and selects examples close to the current decision boundary. Complementary, selecting examples with large predictive variance:

(34)

is explorative and prefers examples with highest classification uncertainty regarding the known training examples. Finally, Kapoor et al. (2010) propose to select examples with small uncertainty Footnote 1:

(35)

as a combination of \({\mathcal {Q}}_{\mu _*}\) and \({\mathcal {Q}}_{\sigma ^2_*}\). The motivation here is to obtain a query function similar to the minimum margin approach suitable for SVMs (Tong and Koller 2001) but with the additional consideration of the classification uncertainty. We apply our techniques for efficient GP inference to these query strategies in Sect. 11.3 enabling active learning with a large pool of unlabeled examples.

9 Experimental Analysis: Large-Scale Classification

Our experimental evaluation is divided in three parts, thereby following the structure of the previous theoretical sections. In the current section, we analyze our techniques for efficient classification in large-scale scenarios. The suitability of our optimization approach is evaluated in Sect. 10 and experiments with our variance computation techniques are finally presented in Sect. 11.

Table 3 Large-scale learning and classification for 200 binary ImageNet tasks: computation times are given as median values of measurements for each task (learning) and each test example (classification)

9.1 Main Experimental Datasets

The majority of evaluations in the following three sections is done on two datasets which we shortly introduce here.

ImageNet for Multi-Class Classification To evaluate computational scalability of our introduced techniques, the ImageNet dataset as used for the ILSVRC’10 competition (Berg et al. 2010) serves as a perfect benchmark. We use in total 150,000 images from 1000 different categories from this dataset. Learned models are evaluated on 50,000 examples from the ILSVRC’10 validation dataset. As commonly done for ImageNet experiments, the flat-1-error is used as a measure of accuracy indicating the ratio of correctly classified examples among all test data.

ImageNet for Binary Classification Training of multi-class GP models with 1000 categories and hundreds of thousands of training examples is extremely time consuming even with our efficient GP/HIK techniques. Therefore, we also derive binary classification tasks from this dataset to allow for further analyses that have taken less time. Binary tasks are derived in a one-vs-all manner, i.e., we use all images of a single class as positive examples and \(\ell \) examples from each of the other 999 categories as negative examples. Thereby, we obtain 200 tasks from the first 200 categories. Models are evaluated on the validation set using average AUC as a measure of accuracy.

15Scenes for Detailed Analyses We also use the 15Scenes  dataset as a small-scale benchmark (Lazebnik et al. 2006), where all 15 classes are used for multi-class classification. Following the suggestion of Quattoni and Torralba (2009), all images are scaled to a size of \(256 \times 256\) pixels to get results which are not biased on different characteristic image sizes for specific categories. Accuracy of learned models is measured with the averaged class-wise recognition rate (ARR).

9.2 Experimental Setup

Generalized histogram intersection kernels are designed to compare histogram-like image representations. Therefore, we represent images using either BOW features or non-negative activations of convolutional neural networks (CNNs). BOW features are computed using the available toolkitFootnote 2 provided with the ILSVRC’10 challenge (Berg et al. 2010). We use the provided visual codebook with 1, 000 elements to allow for easy reproducibility. CNN activations are obtained from the AlexNet CNN learned on ImageNet (Donahue et al. 2013) and extracted using the Caffe toolbox.

For optimization of hyperparameters, we use the Nelder–Mead technique (Nelder and Mead 1965) as mentioned in Sect. 6.3.

9.3 Large-Scale Experiments with ImageNet 

The first question we are particularly interested in is whether our provided techniques allow for applying GP models to large-scale data. Therefore, we investigate two scenarios on the ImageNet dataset: binary classification and multi-class classification.

Binary Classification Scenarios Let us start with evaluating computational benefits arising from GP/HIK in comparison with the GP baseline implementation using a Cholesky decomposition and explicit kernel evaluations. Here, we only create binary classification scenarios to obtain at least some results for the baseline approach in affordable time. Images are represented using provided BOW features with 1, 000 dimensions. Results are shown in Table 3 for \(\ell =10\) and \(\ell =50\) resulting in 10, 090 and 50, 050 training examples, respectively. Computation times are measured on a single-core Intel 2.6GHz machine and our method makes use of a quantization with 100 bins to speed up classification.

As can be seen in Table 3, we are able to learn GP classifiers within a few minutes without loss in accuracy. In contrast to that, the baseline GP implementation is not applicable to more than some thousands of examples—due to computation time and memory demand. In particular, standard GP regression for \(\ell =50\) exceeded our memory capacities by resulting in a 19 GB kernel matrix.

In summary, we observe that our approach for training and classification is significantly faster than the baseline GP (speed-up factor: 196) and has only a linear memory requirement. Due to both aspects, it allows for learning on large-scale datasets that are otherwise intractable for exact GP inference.

Multi-class Classification Scenarios For a multi-class classification experiment on ImageNet, we compare against SVM solvers publicly available. Specifically, we choose the popular LibSVM package as Kernel-SVM solver with default parameter settings. Since \(\kappa _{\text {HIK}}\) is not directly supported by LibSVM, we follow provided suggestions and apply an RBF kernel instead. Furthermore, we compare against LibLinear as standard solver for linear SVM models in large-scale image classification scenarios (Deng et al. 2010). We apply an adaptive quantization with \(q={100}\) bins for each dimension as introduced in Sect. 4.2. As before, we use provided BOW features with 1, 000 dimensions. For different numbers of training examples, we average over ten random splits. Results in terms of Flat-1 error rates are shown in Fig. 4.

Fig. 4
figure 4

Flat-1 error for multi-class classification using all 1000 categories of ImageNet and provided BOW features

We observe that GP/HIK can successfully be used in large multi-class classification scenarios with 1000 categories. For a small number of training examples \(N\), GP/HIK even outperforms LibLinear but is slightly inferior for increasing training sets. In addition, LibSVM leads to significantly larger errors. We believe that this results from fixed default settings for regularization parameters used in our experiment.

In summary, we observe that GP/HIK is a useful alternative to established SVM solvers in large multi-class scenarios.

9.4 Detailed Analysis Using the 15Scenes Dataset

The previous experiments confirmed the applicability of our GP/HIK to large-scale data. Nonetheless, a simple BOW image representation can not compete with state-of-the-art today. Thus, previous results only serve as proof-of-concept. We are now interested in investigating whether GP/HIK is also useful to work on recent image representations extracted from CNN activations. We chose the 15Scenes dataset to guarantee that all training data fit into our available memory even with the largest applied image representation of \(D={64,896}\) feature dimensions (relu3).

GP/HIK on CNN Activations For different numbers of training examples, we trained GP/HIK models on top of different CNN activations. We used layers relu3 to relu7 due to their non-negativity. Results are shown in Fig. 5.

First of all, we note that GP/HIK is indeed applicable as model on top of CNN activations. Besides, we find that high-level activations of layers relu6 and relu7 give the best performance for this task. Interestingly, earlier experiments in (Freytag et al. 2014b) showed that humans only obtain an accuracy level of \(85.67\,\%\) on this dataset. Thus, the accuracy of \(87.98\,\%\) with relu7 and the largest train size is indeed remarkable.

Fig. 5
figure 5

Performance of GP/HIK on 15Scenes with CNN features extracted from different relu layers and a varying number of training examples per class

GP/HIK versus SVM Baselines As in the previous section, we also compare against linear and kernelized SVM as baselines. Based on the previous results, we use relu7 features as representations. For consistency, we also add results for BOW representations and an overview is given in Table 4. As can be seen, we again outperform the SVM baselines when the same features are used. The recognition rate of a linear SVM is even lower when an \(L_1\)-normalization of the features is applied, a pre-processing technique which we use for our GP methods. Another not so surprising fact is the superiority of CNN compared to BOW or SPMK (BOW with spatial pyramid matching) features. The only technique currently outperforming our method is an AlexNet carefully fine-tuned on 15Scenes . However, this method often requires a grad-student tweaking hyperparameters for a day.

Table 4 Evaluation of GP/HIK on the 15Scenes dataset with standard BOW features (upper part) and CNN features (lower part)

In summary, we find that our techniques can serve as powerful, probabilistic classification technique on top of recent image representations.

9.5 Evaluation of Linear Solvers with Fast HIK Multiplications

In the following, we compare the performance of conjugate gradients with fast HIK matrix multiplications as presented in Sect. 4 against two coordinate descent approaches: (1) the coordinate descent method of Wu (2010) applied to GP and (2) the greedy block coordinate descent (GBCD) approach of Bo and Sminchisescu (2012). The first one was originally presented for fast SVM learning with HIK and directly operates on the look-up table \(\mathbf {\varvec{T}}\) (Sect. 4). GBCD calculates parts of the kernel matrix on the fly to solve sub-problems. For our experiments, the size of the sub-problems is set to 10 and the number of components for greedy selection is 20. We also tested other values like a sub-problem size of 60 and 500 number of components as suggested by (Bo and Sminchisescu 2012), but did not achieve a significant speed-up. Note that our approach and (Wu 2010) exploit fast HIK matrix multiplications, while (Bo and Sminchisescu 2012) can be applied for every kernel function.

Fig. 6
figure 6

Evaluation of the runtime and convergence of linear solvers: (1) our conjugate gradients method, (2) the coordinate descent method of Wu (2010), and (3) greedy block coordinate descent of Bo and Sminchisescu (2012)

We use again the ImageNet dataset with binary tasks and solve the linear system \(\mathbf {\varvec{\tilde{K}}}_\mathbf {\varvec{\eta }}\cdot \mathbf {\varvec{\alpha }} = \mathbf {\varvec{y}}\) with all three methods. Since we only care about the speed of convergence, we use a rather small-scale setup with \(\ell =1\). Figure 6 shows the residual of the linear system with respect to the computation time needed. Termination is done when the maximum norm of the residual drops below \(10^{-6}\). Computation times are measured on a single-core Intel 2.6 GHz machine.

As can be seen in Fig. 6, there are orders of magnitude between all three methods. Conjugate gradients reaches a solution in 3.7 seconds, which is superior to the coordinate descent method of Wu (2010) applied to GP (\(32 \hbox {s}\) until convergence). GBCD is slow (convergence after 16 minutes) due to the long time needed for explicit calculation of kernel values for features of dimension \(D={1,000}\). It should be noted that solving the linear system of GP regression needs more time than solving an SVM optimization problem as presented by Wu (2010). This is due to the additional sparsity constraints of SVM. Furthermore, runtime results presented by Bo and Sminchisescu (2012) looked more promising, which is likely due to the low dimensionality of chosen features (\(D\le 37\)) in the paper. In summary, we find that the CG method nicely fits to our techniques for fast matrix-vector multiplications using HIKs.

9.6 Early Stopping of the Linear Solver

Early stopping refers to performing optimization not until convergence but only up to the point when the residual is lower than a predefined threshold. For large-scale SVMs, Perronnin et al. (2012) figured out that regularization by early stopping leads to suitable generalization abilities with the additional benefit of computation time saved. Since the iterative linear solver in our proposed methods allows for early stopping as well, we evaluate in the following whether their findings also hold here.

For an experimental evaluation, we conduct experiments on the 15Scenes database (Lazebnik et al. 2006) using our GP/HIK and BOW features. We stopped the process of training at different levels of accuracy reached by the iterative linear solver, i.e., if the residual dropped below predefined values. Experimental results are shown in Fig. 7.

Fig. 7
figure 7

Evaluation of model regularization by early stopping. Desired accuracies for the iterative linear solver to stop the computation are displayed on the x axis, whereas the y axis shows the resulting recognition rate as a measure of generalization abilities. The standard HIK serves as kernel function

First of all, we notice a rapid convergence of the resulting classification accuracy even if we stop the linear solver with an extremely high residual. In fact, an early stopping might even lead to better generalization performance. We therefore conclude that early stopping the calculation of weights \(\varvec{\alpha }\) is advisable. In our experiments, the number of iterations needed until reaching the stopping criterion grew exponentially with the desired accuracy. This fact additionally highlights the benefit of early stopping.

In summary, we find that early stopping of the iterative linear solver leads to well regularized models and significantly saves computation times.

9.7 Comparison with GP Sparsity Methods

An important family of methods for large-scale Gaussian process inference covers sparse approximation techniques. Among them, the Fully Independent Training Conditional Approximation (FITC) is presumably the most powerful representative (Quiñonero Candela and Rasmussen 2005). It is therefore interesting to compare our efficient techniques for exact inference against FITC as a representative for sparse GP approximations. Similar to previous experiments, we use the 15Scenes dataset and relu7 features. Results for other CNN layers lead to comparable conclusions. Again, we average over ten random splits. For FITC, we evaluate different sizes of the inducing set. The accuracies depending on the number of training examples per category are visualized in Fig. 8. Furthermore, we compare required computation times for training and inference in Fig. 9.

Fig. 8
figure 8

Comparison of our approach with the FITC method of Quiñonero Candela and Rasmussen (2005) on the 15Scenes dataset with relu7 features. For FITC, different relative sizes of the inducing set are used

Fig. 9
figure 9

Comparison of our approach with FITC with respect to (top) time needed during inference for each test example and (bottom) total time needed for training. See Fig. 8 for an analysis of the resulting accuracies

With respect to classification accuracy (Fig. 8), we outperform FITC especially for small inducing sets with a large margin. We can thus conclude that our techniques allow for exact GP inference and circumvent the necessity of sparse approximations. Regarding time for inference, we observe in Fig. 9 that the our quantization approach leads to constant computation time. In contrast, FITC’s computation time increases with the number of known examples. For training, however, we notice that FITC results in faster learning times. Although this does not hold asymptotically (see Table 1), the investigated setting has too few examples and too large feature dimensions to fully unveil the gain in computation time. In direct comparison, we thus conclude that our techniques is especially beneficial for learning from large datasets without requiring sparse approximations.

9.8 Evaluation of the Quantization

We already applied the quantization idea in Sect. 9.3 to evaluate GP models trained with hundreds of thousands of examples within milliseconds. For simplicity, we set the number of bins per dimension to \(q=100\) and obtained identical accuracies as the baseline GP. Let us now evaluate the resulting classification accuracy as well as required computation times for training and inference with varying numbers of quantization bins per dimension.

Our experiments in this section are performed on 15Scenes with \(L_1\)-normalized relu7 features. For ten random splits, each setting is evaluated to allow for statistically significant but comparable results. Computation times are measured on an Intel Core i7-3930K CPU desktop computer with \(3.20 \hbox { GHz}\) and without any parallelization. All computation times include overhead arising from converting features from Matlab data structures to C++ pendants in Mex-interfaces. Results are shown in Fig. 10.

Fig. 10
figure 10

Effect of different quantization levels on classification accuracy (top), required computation times during training (middle), and for inference (bottom). Results are obtained on the 15Scenes dataset and averaged over ten random splits. a Accuracy (note the small range of the y axis). b Training time. c Classification time

As we can nicely observe in Fig. 10a, quantizing feature values does not negatively affect the classification accuracy. In fact, for all evaluated settings, classification results are comparable to those achieved with exact inference. It should be noted that this effect is also due to the adaptive quantization. For a uniform quantization equal in all dimensions, accuracy drops significantly for small number of bins (not shown here).

From Fig. 10b, we further observe that the training time grows over the exact baseline the more bins are required, since larger LUTs need to be computed and stored. Finally, classification times are shown in Fig. 10c. We observe clear speed-ups compared to the exact baseline.

In summary, we find that quantization of features saves valuable time during classification at the cost of an affordable overhead. Already with 100 bins per dimension, classification results are on-par with the exact baseline.

9.9 Comparing Incremental and Batch Learning

In Sect. 8.1, we analyzed how to efficiently handle new data without the necessity of retraining the classifier from scratch. To evaluate the resulting benefit, we show results of experiments conducted on the 15Scenes dataset with BOW features. In 100 runs, we randomly pick 10 examples per class as an initialization. During each run, we incrementally add 1 example per class over 50 iterations resulting at most 900 examples used for training the model. Every iteration consists of training the classifier as well as optimizing kernel hyperparameters to perform parameter optimization on the fly. Performances are evaluated on a disjoint test set consisting of 50 examples per class and the results are visualized in Fig. 11.

Fig. 11
figure 11

Comparison of incremental learning and learning from scratch: a classification results achieved by both methods (the graphs are identical), b corresponding times for retraining the classifier when new data is accessible

From the plot in Fig. 11a, we make the well-known observation that using more examples is beneficial for training better models. In addition, we notice that the models learned in an incremental manner lead to almost identical results as those from models trained from scratch. However, when taking the computation times given in the plot of Fig. 11b into account, we obtain a clear advantage of our incremental learning approach compared to simple retraining. This speed-up increases with the number of examples and is therefore especially useful for large-scale scenarios.

Note that the major update time is spent for finding optimal hyperparameters during updates. Thus, we could obtain further speed-ups by running optimization steps only after a batch of new data with several examples has been recorded. Since hyperparameters only vary slowly, this would be well justifiable in practice.

Summarizing, we are able to efficiently update our model when new data is available even with an involved parameter optimization, which allows for using Gaussian processes for large-scale scenarios in lifelong or active learning.

10 Experimental Analysis: Hyperparameter Optimization

In this section, we are interested in evaluating our approach for efficient hyperparameter optimization as presented in Sect. 6. The results of this section can be summarized as follows:

  1. (1)

    Optimizing the exact log-likelihood and our upper bound approximation lead to similar optima in practice (Sect. 10.1).

  2. (2)

    Generalized histogram intersection kernels improve the classification performance significantly compared to standard HIK (Sect. 10.2 and Sect. 10.3).

  3. (3)

    Feature relevance determination can be done by optimizing the marginal likelihood of a weighted HIK (Sect. 10.4).

  4. (4)

    Early stopping is also applicable when hyperparameters are optimized (Sect. 10.5).

10.1 Verifying the Bound of the Negative Marginal Log-Likelihood

Before we evaluate potential benefits which arise from optimizing hyperparameters, we are first of all interested in the tightness of our introduced bounds for the negative GP marginal log-likelihood presented in Sect. 6. We therefore train GP models on the 15Scenes dataset with BOW features as image representations and different values of \(\eta \) for HIK-POLY. Then, we evaluate our upper bound approximation as well as the exact value for the negative log-likelihood \(- \log p\left( \mathbf {\varvec{y}}\,|\,\mathbf {\varvec{X}},\eta \right) \). Furthermore, we evaluate trained models on hold-out test data and report average recognition rates to investigate the relation between likelihood and accuracy. Results are shown in Fig. 12.

Fig. 12
figure 12

Comparison between our upper bound of the negative marginal log-likelihood, the real negative marginal log-likelihood (left y axis), and resulting ARR on the test set (right y axis). Different hyperparameter values for HIK-POLY are shown on the x axis

First of all, we notice that the exact log-likelihood is closely connected to the resulting accuracy. Thus, the log-likelihood is a useful criterion for adapting hyperparameters of models to training data. Furthermore, it can be seen that our bound closely matches the true negative marginal log-likelihood in this setup. In consequence, the minima of both curves only differ slightly and our optimization technique can be successfully applied. For higher values of \(\eta \), our bound converges to the exact value because the influence of the log-determinant term compared to the data term of the marginal log-likelihood decreases. Consequently, possible approximation errors become less important and the data term can be computed without any approximation even for large-scale datasets. We observed a similar behavior for other datasets and settings.

Table 5 Benefit of hyperparameter estimation for 200 binary ImageNet tasks: Computation times are given as median values of measurements for each task (learning) and each test example (classification)

In summary, we find that our upper bound approximation is well suited for optimization of hyperparameters.

10.2 Different Generalized HIK for Binary Classification

Since we found that our upper bound tightly matches the exact log-likelihood, we are now interested in the resulting benefits of optimizing generalized variants of the HIK. We again start with evaluations on binary classification tasks similar to Sect. 9.3. The experimental setup is kept identical but with activated optimization of hyperparameters for \(\kappa _{\text {HIK-POLY}}\) and \(\kappa _{\text {HIK-EXP}}\). Experimental results are shown in Table 5.

As we can see, our optimization technique based on the upper bound approximation is able to handle datasets with tens of thousands of training examples. Thereby, we obtain accuracy gains statistically significant with \(p < 10^{-7}\) measured by a paired t test.

In summary, we find that our optimization technique leads to valuable accuracy gains in binary classification settings.

10.3 Different Generalized HIK for Multi-class Classification

For an analysis of hyperparameters in multi-class classification scenarios, we build on the previous evaluations of Sect. 9.4 and use the 15Scenes dataset with BOW and CNN features. In contrast to the previous evaluation, we now optimize hyperparameters of HIK-POLY and HIK-EXP with our GP marginal likelihood optimization technique. Thus, this analysis complements the previous results shown in Table 4. Results are given in Table 6.

Table 6 Evaluation of hyperparameter optimization for \(\kappa _{\text {HIK-POLY}}\) and \(\kappa _{\text {HIK-EXP}}\): classification accuracy is obtained on the 15Scenes dataset with standard BOW features (upper part) and CNN features (lower part)

For the BOW features, we observe that optimizing hyperparameters of GP/HIK -EXP results in the highest accuracy. In fact, it is even comparable to the result of the spatial pyramid matching kernel (SPMK) by Quattoni and Torralba (2009) as shown in Table 4. This highlights the power of generalized HIK and our hyperparameter optimization.

When applying CNN features (lower part of Table 6), results show a huge increase of performance in general leading to state-of-the-art results. It should be noted that in contrast to results by Sun and Ponce (2013), we do not perform patch discovery or fine-tuning to obtain features especially suited for the dataset, but still improve on their results listed in Table 4. Interestingly, the generalized HIK does not further increase classification accuracy—it even reduces the performance slightly. We have not yet a convincing explanation for this phenomenon, which requires further research.

In summary, we find that our GP marginal likelihood optimization method is technically suited to optimize hyperparameters in multi-class classification scenarios. Useful adaptations and other kernel parameterizations for recent CNN features remain an open question.

10.4 Feature Relevance Estimation

We have already seen that Gaussian processes allow for hyperparameter optimization by marginal likelihood estimation. In this experiment, we show the suitability of GP equipped with optimized weighted HIK for efficient feature relevance determination leading to superior results compared to those of SVM-based estimations.

Since there is no exact gradient information during the optimization available, the Nelder–Mead method converges poorly for huge numbers of parameters to be optimized. Consequently, computing feature relevance for features with thousands of dimensions, as in our previous experiments, is almost impossible right now.

Fig. 13
figure 13

Relevance determination with very generalized histogram intersection kernels and GP hyperparameter optimization. The first two features contain most of the discriminative information: a feature weights estimated with 5 examples per class, b performance compared to non-weighted histogram intersection kernels. Results are averaged over 500 runs

Nevertheless, as a proof of concept we follow the same synthetic experimental setup as Ablavsky and Sclaroff (2011): for different numbers of training examples, we randomly sample eight-dimensional feature vectors with relevant information only available in the first two dimensions. The performance is estimated with 500 tests. For the specific random distributions, we refer the reader to the work of Ablavsky and Sclaroff (2011) and references therein. The results of our experiments can be seen in Fig. 13.

The information included in each dimension is well reflected by the estimated relative weights \(\eta _i\), which can be observed from the plot in Fig. 13a. Furthermore, the plot in Fig. 13b shows the recognition accuracy for standard and weighted HIK with respect to the training size. The improvement is highly significant with \(p < 10^{-7}\) using the paired t test. In comparison with Ablavsky and Sclaroff (2011), our approach additionally leads to more consistent weights and higher accuracies. The experimental results emphasize the benefits of a probabilistic framework for hyperparameter optimization.

In addition, regularization terms could be added to the objective, such as terms based on the minimum description length principle. However, this is beyond the scope of this paper.

10.5 Early Stopping of the Linear Solver

In Sect. 9.6, we investigated the effect of early stopping for GP/HIK . However, we did not optimize hyperparameters and applied the plain \(\kappa _{\text {HIK}}\) instead. Let us therefore investigate whether the same findings hold if optimization is additionally activated. We therefore repeat the same experiments as in Sect. 9.6 but optimize parameters of HIK-POLY using our marginal likelihood optimization technique. Results are shown in Fig. 14.

Fig. 14
figure 14

Evaluation of model regularization by early stopping. Desired accuracies for the iterative linear solver to stop the computation are displayed on the x axis. The y axis shows the resulting recognition rate as a measure of generalization abilities. In contrast to the previous evaluation in Sect. 9.6, we now optimize the hyperparameter \(\eta \) of the HIK-POLY

As for the plain \(\kappa _{\text {HIK}}\), we again notice a rapid convergence of the resulting classification accuracy. In contrast, to the results without hyperparameter optimization there is no benefit of early stopping in terms of accuracy. In addition, it should be noted that the optimal hyperparameter value remained unchanged for all settings of the stopping criterion throughout our experiments.

We conclude that early stopping is well applicable for hyperparameter optimization using our upper bound for the negative log-likelihood, since it leads to a significant speed-up due to a decrease in iterations of the linear solver. However, an increase in accuracy with early stopping cannot be expected.

11 Experimental Analysis: Uncertainty Prediction

The third part of our experimental analysis deals with the uncertainty approximation introduced in Sect. 7. We can summarize the results of this section as follows:

  1. (1)

    The predictive variance of Gaussian processes can be approximated efficiently even in large-scale scenarios with a speed-up of up to \({45,000}\times \) (Sect. 11.1).

  2. (2)

    Our upper bound approximations closely follow the exact variance scores (Sect. 11.2).

  3. (3)

    Approximation of the predictive variance leads to active learning results comparable to the ones achieved with the exact predictive variance (Sect. 11.1).

11.1 Fast Computation of the Predictive Variance

We start this section by evaluating the efficiency of our proposed techniques for computing the predictive variance in terms of computation times needed. As in the previous sections, we conduct experiments on the 15Scenes dataset (Lazebnik et al. 2006) as well as on the large-scale ImageNet dataset. For the 15Scenes dataset, we randomly pick 100 examples of each class for training resulting in 1500 training examples in total. Training on the large-scale ImageNet dataset with binary tasks is carried out using \(\ell =10\) or \(\ell =50\) randomly chosen examples per negative class and 100 examples for the positive class, which results in 10, 090 and 50, 050 training examples in total. Computation times are averaged over all remaining examples. Experiments in this section are conducted on a 3.4 GHz CPU without any parallelization.

Table 7 Runtimes needed for the computation of the predictive variance using the presented techniques from Sect. 7 in comparison to the baseline GP on two image categorization datasets (see Sect. 11.1)
Fig. 15
figure 15

Tightness of GP/HIK variance approximations on the 15Scenes dataset with relu7 features. Scores of the exact method are increasingly ordered. Predictions of remaining techniques are re-ordered accordingly. GP/HIK is trained on \(L_1\)-normalized relu7 activations. Left: variance predictions of each method. Right: we normalized variance predictions of each method individually into \([0,1]\) to better visualize the overall trend. Best viewed in color

Experimental results are shown in Table 7. Especially for large-scale datasets, we obtain a significant speed-up compared to the direct computation of the variance (GP-standard). For rapid uncertainty prediction, the CAPU method turns out to be highly suitable with computation times in the order of microseconds. It should be noted that the PUP method is relatively slow due to the involved computations of the iterative linear solver. Although we have already seen the efficiency of a linear conjugate gradient method for this problem, it still needed some hundreds of iterations until convergence, especially for large training sets. Therefore, we argue to use the precise method only in cases where time is not the limiting factor, but the GP baseline can not be computed explicitly due to the huge memory demand.

11.2 Investigating the Tightness of Variance Approximations

In a second experiment, we investigate the deviation between exact variance scores and our introduced approximations. As argued before, two outcomes can be acceptable which depends on the application scenario: (1) either the approximations should be close to the exact scores (e.g., for security applications with fixed thresholds), or (2) only the correct relative order is of interest (e.g., active learning). We analyze our techniques regarding both aspects in the following.

As previously, we use the 15Scenes dataset for evaluation. Images are represented using \(L_1\)-normalized relu7 features. For training, we select three categories with 50 randomly chosen examples of each category. All remaining data serves for model evaluation. Thereby, we obtain a small set of examples from known categories and a significantly larger pool of unknown categories. The noise level for model regularization is fixed to \(\sigma _{\mathsf {n}}^2=0.1\). For FAPU, we spend \(T_3=100\) iterations for the Arnoldi technique to estimate even 16 eigenvectors reliably The quantization for q-CAPU is done with \(q=100\) bins per dimension and dimension-adaptive.

To analyze the deviation of scores, we sort variance scores of the exact PUP technique increasing order. Predictions of the remaining techniques are re-arranged accordingly. Results are shown in the left part of Fig. 15. For visualization, only every 5th sample is plotted.

The relative order of scores is additionally analyzed by normalizing scores of each method individually into \([0,1]\). Again, scores of PUP are ordered decreasingly and scores of remaining methods are plotted accordingly. Results are shown in the right part of Fig. 15.

First of all, we clearly observe the upper bound relationship among our introduced techniques. Regarding exact scores, it can be seen that PUP is closed matched by the FAPU approximation. Hence, FAPU should be used when approximation errors are undesirable. Besides, we observe that q-CAPU and CAPU lead to identical results although their scores differ strongly from the exact counterparts. Nonetheless, the relative order is comparable (right figure). Hence, CAPU is a reasonable choice if only the relative order of variance estimates is required.

11.3 Application for Active Learning

In a final experiment, we are interested in applying our GP/HIK to active learning scenarios. Therefore, we apply the three query strategies by Kapoor et al. (2010) as reviewed in Sect. 8.2. We evaluate our methods on the difficult real-world ImageNet dataset. For each experiment, we randomly pick a single positive class as well as four, nine, or nineteen classes providing negative examples. Starting with two randomly chosen examples per class, we query new examples using the proposed methods. Each task is repeated with 100 random initializations. Final results are achieved by averaging over 100 different tasks. For variance estimation, we apply our FAPU technique with \(k=2\) eigenvectors for the approximation. Experimental results are given in Fig. 16.

Fig. 16
figure 16

Active learning results on 100 binary classification tasks derived from the ImageNet dataset. Best viewed in color. a 4 negative classes. b 9 negative classes. c 19 negative classes

We first notice that concerning \(\mathcal {Q}_{\sigma ^2_*}\), we again obtain interesting results, since it is inferior to random sampling for a small number of negative classes but slightly superior for larger number of classes. Query strategies \(\mathcal {Q}_{\mu _*}\) and \(\mathcal {Q}_{\text {unc}}\) tend to pick similar examples even on this challenging dataset resulting in almost identical performances, which is due to the strong influence of the mean term in \(\mathcal {Q}_{\text {unc}}\).

As a concluding remark we state that for active learning, a suitable combination of mean and variance is beneficial to obtain satisfying learning rates, and our techniques are appropriate for computing query scores even for thousands of possible queries in large-scale learning scenarios.

12 Conclusions

In this article, we presented solutions for efficient Gaussian process inference in large-scale scenarios. Our techniques cover all aspects of inference, i.e., exact multi-class classification with label regression, hyperparameter optimization, and uncertainty prediction. A key aspect of all techniques is to exploit generalized histogram intersection kernels, which have proved to be highly suitable for measuring similarities between histogram-like representations. Thereby, our derived methods yield significant asymptotic as well as practical speed-ups (several orders of magnitude) while requesting only a linear amount of memory. We empirically validated our techniques for a wide range of application scenarios and provided detailed analyses of all involved aspects. Experimental results disprove common belief that full non-parametric Bayesian methods are too expensive for large-scale data.

We believe that the presented techniques can be applied to a wide range of other uses of a GP model (Bonilla et al. 2008; Pillonetto et al. 2010; Bo and Sminchisescu 2010), where the inversion of the kernel matrix or computing its log-determinant is the main bottleneck. Finally, the joint possibility for efficient and exact classification, uncertainty prediction, hyperparameter adaptation, and online learning ultimately allows for life-long learning applications with never-ending data streams.

Our developed source code is publicly available at https://github.com/cvjena/gp-hik-core and licensed under LGPL. Since we provide C++ implementation as well as Matlab interfaces, we hope that other computer vision researchers can use our techniques as additional classification tool besides LibSVM or LibLinear.