Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

12.1 Introduction

In this chapter, we describe the proposed approach to k-NN boosting. First, we introduce the scope of our work, which aims at automatic visual categorization of scenes (Sect. 12.1.1) and relies on prototype-based classification (Sect. 12.1.2). Then, in Sects. 12.2.112.2.3 we present the key definitions for surrogate risk minimization. Our UNN algorithm is detailed in Sect. 12.2.4 for the case of the exponential risk. Section 12.2.5 presents the generic convergence theorem of UNN and the upper bound performance for the exponential risk minimization. Then, in Sect. 12.3, we report our experiments on simulated and real data, comparing UNN with k-NN, support vector machines (SVM) and AdaBoost, using Gist and/or Bag-of-Feature descriptors. Real datasets include those proposed in [10, 30, 43], with a number of categories ranging from 8 to 60. Then, in Sects. 12.4 and 12.5 we discuss results, mention future works, and conclude. Finally, we postpone the general form and analysis of UNN to other surrogate risks to the Appendix.

12.1.1 Visual Categorization

In this work, we address the problem of generic visual categorization. This is a relevant task in computer vision, which aims at automatically classifying images into a discrete set of categories, such as indoor vs outdoor [15, 32], beaches vs mountains, churches vs towers. Generic categorization is distinct from object and scene recognition, which are classification tasks concerning particular instances of objects or scenes (e.g., Notre Dame Cathedral vs St. Peter’s Basilic). It is also distinct from other related computer vision tasks, such as content-based image retrieval (that aims at finding images from a database, which are semantically related or visually similar to a given query image) and object detection (which requires to find both the presence and the position of a target object in an image, e.g., person detection).

Automatic categorization of generic scenes is still a challenging task, due to the huge number of natural categories that should be considered in general. In addition, natural image categories may exhibit high inter-class variability (i.e., visually different images may belong to the same category) and low inter-class variability (i.e., distinct categories may contain visually similar images). Classifying images requires an effective and reliable description of the image content, for example, location and shape of specific objects or overall scene appearance. Although several approaches have been proposed in the recent literature to extract semantic information from images [36, 42], most of the state-of-the-art techniques for image categorization still rely on low-level visual information extracted by means of image analysis operators and coded into vector descriptors.

Examples of suitable low-level image descriptors for categorization purposes are Gist, that is, global image features representing the overall scene [30], and SIFT descriptors, that is, descriptors of local features extracted either at salient patches [24] or at dense grid points [23]. A Gist descriptor is based on the so-called “spatial envelope” [30], which is a very effective low dimensional representation of the overall scene based on spectral information. Such a representation bypasses segmentation, extraction of key-points and processing of individual objects and regions, thus enabling a compact global description of images. Gist descriptors have been successfully used for categorizing locations and environments, showing their ability to provide relevant priors for more specific tasks, like object recognition and detection [40]. Another successful tool for describing the global content of a scene is the Bag-of-Features scheme [38], which represents an image by the histogram of occurrences of vector quantized local descriptors like SIFT.

12.1.2 k-NN Classification

Apart from the descriptors used to compactly represent images, most image categorization methods rely on supervised learning techniques for exploiting information about known samples when classifying an unlabeled sample. Among these techniques, k-NN classification has proven successful, thanks to its easy implementation and its good generalization properties [37]. A generalization of the k-NN rule to the multi-label classification framework has been also proposed recently by [46], whose technique is based on the maximum-a-posteriori principle applied to multi-labeled k-NN. A major advantage of the k-NN rule is not to require explicit construction of the feature space and be naturally adapted to multi-class problems. Moreover, from the theoretical point of view, straightforward bounds are known for the true risk (i.e., error) of k-NN classification with respect to the Bayes optimum, even for finite samples [29].

Although such advantages make k-NN classification very attractive to practitioners, it is an algorithmic challenge to speed-up k-NN queries. It is also a statistical challenge to further improve the risk bounds of k-NN. In part due to the simplicity of the classification rule, many methods have been proposed to address either of these challenges. For example, many methods have been proposed for speeding up nearest neighbor retrieval, including locality sensitive hashing (LSH, [13]), product quantization for nearest neighbor search [21], and vector space embedding with boosting algorithms [2, 25].

It is yet another challenge to reduce the true risk of the k-NN rule, usually tackled by data reduction techniques [17]. In prior work, the classification problem has been reduced to tracking ill-defined categories of neighbors, interpreted as “noisy” [6]. Most of these recent techniques are in fact partial solutions to a larger problem related to the nearest neighbors’ true risk, which does not have to be the discrete prediction of labels, but rather a continuous estimation of class membership probabilities [19]. This problem has been reformulated by [7] as a strong advocacy for the formal transposition of boosting to nearest neighbors classification. Such a formalization is challenging as nearest neighbors rules are indeed not induced, whereas all formal boosting algorithms induce so-called strong classifiers by combining weak classifiers—also induced, such as decision trees—[35].

A survey of the literature shows that at least four different categories of approaches have been proposed in order to improve k-NN classification:

  • learning local or global adaptive distance metric;

  • embedding data in the feature space (kernel nearest neighbors);

  • distance-weighted and difference-weighted nearest neighbors;

  • boosting nearest neighbors.

The earliest approaches to generalizing the k-NN classification rule relied on learning an adaptive distance metric from training data (see the seminal works of [11]). An analogous approach was later adopted by [18], who carried out linear discriminant analysis to adaptively deform the distance metric. Recently, [31] has proposed a method for learning a weighted distance, where weights can be either global (i.e., only depending on classes and features) or local (i.e., depending on each individual prototype as well).

Other more recent techniques apply the k-NN rule to data embedded in a high-dimensional feature space, following the kernel trick approach of support vector machines. For example, [44] have proposed a straightforward adaptation of the kernel mapping to the nearest neighbors rule, which yields significant improvement in terms of classification accuracy. In the context of vision, a successful technique has been proposed by [47], which involves a “refinement” step at classification time, without relying on explicitly learning the distance metric. This method trains a local support vector machine on nearest neighbors of a given query, thus limiting the most expensive computations to a reduced subset of prototypes.

Another class of k-NN methods relies on weighting nearest neighbors votes based on their distances to the query sample [8]. Recently, [49] have proposed a similar weighting approach, where the nearest neighbors are weighted based on their vector difference to the query. Such a difference-weight assignment is defined as a constrained optimization problem of sample reconstruction from its neighborhood. The same authors have proposed a kernel-based non-linear version of this algorithm as well.

Finally, comparatively few work have proposed the use of boosting techniques for k-NN classification [1, 2, 12, 25, 33]. [1] use AdaBoost for learning a distance function to be used for k-NN search. [12] adopt the boosting approach in a non-conventional way. At each iteration a different k-NN classifier is trained over a modified input space. Namely, the authors propose two variants of the method, depending on the way the input space is modified. Their first algorithm is based on optimal subspace selection, that is, at each boosting iteration the most relevant subset of input data is computed. The second algorithm relies on modifying the input space by means of non-linear projections. But neither method is strictly an algorithm for inducing weak classifiers from the k-NN rule, thus not directly addressing the problem of boosting k-NN classifiers. Moreover, such approaches are computationally expensive, as they rely on a genetic algorithm and a neural network, respectively. [2, 25] map examples in a vector space by using the outputs of (Ada)boosted weak classifiers. It is not known whether these algorithms formally keep (or improve) the boosting properties known for AdaBoost [35]. More recently, [33] have built upon the works of [27, 28] (see also the survey of the approach in [9]) to provide a provable boosting algorithm for k-NN classifiers. Guaranteed convergence speed is obtained for AdaBoost’s famed exponential loss, under a weak index assumption which parallels the weak learning assumption of boosting algorithms, making the approach of [33] among the first to provide a provable boosting algorithm for k-NN [7].

We propose in this work a full-fledged solution to the problem of boosting k-NN classifiers in the general multi-class setting and for general classes of losses. Namely, we propose the first provable boosting algorithm, called UNN, which induces a leveraged nearest neighbor rule that generalizes the uniform k-NN rule, and whose convergence rate is guaranteed for a wide (i.e., infinite) set of losses, encompassing popular choices such as the logistic loss or the squared loss. The voting rule is redefined as a strong classifier that linearly combines weak classifiers of the k-NN rule (i.e., the examples). Therefore, our approach does not need to learn a distance function, as it directly operates on the top of k-NN search. At the same time, it does not require an explicit computation of the feature space, thus preserving one of the main advantages of prototype-based methods. Our boosting algorithm is an iterative procedure which learns the weights for examples called leveraging coefficients. Then, our class encoding allows to generalize the guarantees on convergence rates for an infinite number of surrogate risks. Footnote 1 The generalization is highly desirable, not only for experimental purposes related for example, to no-free-lunch Theorems [28]: our generalization encompasses many classification calibrated surrogates, functions exhibiting particularly convenient guarantees in the context of classification [4]. Finally, an important characteristic of UNN is that it is naturally able, through the leveraging mechanism, to discriminate the most relevant prototypes for a given class.

12.2 Method

12.2.1 Preliminary Definitions

In this work, we address the task of multi-class, single-label image categorization. Although the multi-label framework is quite well established in literature [5], we only consider the case where each image is constrained to belong to one single category among a set of predefined categories. The number of categories (or classes) may range from a few to hundreds, depending on applications. For example, categorization with 67 indoor categories has been recently studied by [34]. We treat the multi-class problem as multiple binary classification problems as it is customary in machine learning. Hence, for each class c, a query image is classified either to c or to \(\bar{c}\) (the complement class of c, which contains all classes but c) with a certain confidence (classification score). Then the label with the maximum score is assigned to the query. Images are represented by descriptors related to given local or global features. We refer to an image descriptor as an observation \(\boldsymbol{x} \in \mathcal{X}\), which is a vector of n features and belongs to a domain \(\mathcal{X}\) (e.g., \({\mathbb{R}}^{n}\) or [0,1]n). A label is associated to each image descriptor according to a predefined set of C classes. Hence, an observation with the corresponding label leads to an example, which is the ordered pair \((\boldsymbol{x},\boldsymbol{y}) \in {\mathcal{X}} \times \mathbb{R}^{C}\), where y is termed the class vector that specifies the class memberships of x. In particular, the sign of y c gives the membership of example (x,y) to class c, such that y c is negative iff the observation does not belong to class c, positive otherwise. At the same time, the absolute value of y c may be interpreted as a relative confidence in the membership. Inspired by the multi-class boosting analysis of [48], we constrain class vectors to be symmetric, that is:

$$ \sum_{c=1}^C y_c=0. $$
(12.1)

Hence, in the single-label framework, the class vector of an observation x belonging to class \(\tilde{c}\) is defined as:

$$ y_{\tilde{c}}=1,\qquad y_{c\neq\tilde{c}}=-\frac{1}{C-1}. $$
(12.2)

This setting turns out to be necessary when treating multi-class classification as multiple binary classifications, as it balances negative and positive labels of a given example over all classes. In the following, we deal with an input set of m examples (or prototypes) \({\mathcal{S}} = \{(\boldsymbol{x}_{i}, \boldsymbol{y}_{i}), i = 1, 2,\ldots, m\}\), arising from annotated images, which form the training set.

12.2.2 Surrogate Risks Minimization

We aim at defining a one-versus-all classifier for each category, which is to be trained over the set of examples. This classifier is expected to correctly classify as many new observations as possible, that is, to predict their true labels. Therefore, we aim at determining a classification rule h from the training set, which is able to minimize the classification error over all possible new observations. Since the underlying class probability densities are generally unknown and difficult to estimate, defining a classifier in the framework of supervised learning can be viewed as fitting a classification rule onto a training set \(\mathcal{S}\), with the hope to minimize overfitting as well. In the most basic framework of supervised classification, one wishes to train a classifier on \({\mathcal{S}}\), that is, build a function \(\boldsymbol{h} : {\mathcal{X}} \rightarrow {\mathbb{R}}^{C}\) with the objective to minimize its empirical risk on \({\mathcal{S}}\), defined as:

$$ \varepsilon ^{0/1}(\boldsymbol{h}, {\mathcal{S}}) \stackrel {\mathrm {.}}{=}\frac{1}{mC} \sum _{c=1}^{C} \sum_{i=1}^{m}{ \bigl[\varrho (\boldsymbol{h},i,c) < 0 \bigr]} , $$
(12.3)

with [.] the indicator function (1 iff true, 0 otherwise), called here the 0/1 loss, and:

$$\begin{aligned} \varrho (\boldsymbol{h},i,c) \stackrel {\mathrm {.}}{=}& y_{ic} h_c( \boldsymbol{x}_i) \end{aligned}$$
(12.4)

the edge of classifier h on example (x i ,y i ) for class c. Taking the sign of h c in {−1,+1} as its membership prediction for class c, one sees that when the edge is positive (resp. negative), the membership predicted by the classifier and the actual example’s membership agree (resp. disagree). Therefore, (12.3) averages over all classes the number of mismatches for the membership predictions, thus measuring the goodness-of-fit of the classification rule on the training dataset. Provided the example dataset has good generalization properties with respect to the unknown distribution of possible observations, minimizing this empirical risk is expected to yield good accuracy when classifying unlabeled observations.

However, minimizing the empirical risk is computationally not tractable as it deals with non-convex optimization. In order to bypass this cumbersome optimization challenge, the current trend of supervised learning (including boosting and support vector machines) has replaced the minimization of the empirical risk (12.3) by that of a so-called surrogate risk [4], to make the optimization problem amenable. In boosting, it amounts to sum (or average) over classes and examples a real-valued function called the surrogate loss, thus ending up with the following rewriting of (12.3):

$$ \varepsilon ^{\psi }(\boldsymbol{h}, {\mathcal{S}}) \stackrel {\mathrm {.}}{=}\frac{1}{mC} \sum _{c=1}^{C} \sum_{i=1}^{m}{ \psi\bigl(\varrho(\boldsymbol{h},i,c)\bigr)} . $$
(12.5)

Relevant choices available for ψ include:

$$\begin{aligned} \psi^{\mathrm{sqr}} \stackrel {\mathrm {.}}{=}& (1-x)^2, \end{aligned}$$
(12.6)
$$\begin{aligned} \psi^{\mathrm{exp}} \stackrel {\mathrm {.}}{=}& \exp(-x), \end{aligned}$$
(12.7)
$$\begin{aligned} \psi^{\mathrm{log}} \stackrel {\mathrm {.}}{=}& \log\bigl(1 + \exp(-x)\bigr); \end{aligned}$$
(12.8)

(12.6) is the squared loss [4], (12.7) is the exponential loss [35], and (12.8) is the logistic loss [4]. Such surrogates play a fundamental role in supervised learning. They are upper bounds of the empirical risk with desirable convexity properties. Their minimization remarkably impacts on that of the empirical risk, thus enabling to provide minimization algorithms with good generalization properties [28].

In the following, we move from recent advances in boosting with surrogate risks to redefine the k-NN classification rule. Our algorithm, UNN (Universal Nearest Neighbors), is first proposed for the exponential surrogate. We describe in the appendix the general formulation of the algorithm, not restricted to this surrogate. We show that UNN converges to the optimum of many surrogates with guaranteed convergence rates under mild assumptions, and more generally converges to the global optimum of the surrogate risk for an even wider set of surrogates.

12.2.3 Leveraging the k-NN Rule

We denote by NN k (x) the set of the k-nearest neighbors (with integer constant k>0) of an example (x,y) in set \(\mathcal{S}\) with respect to a non-negative real-valued “distance” function. This function is defined on domain \(\mathcal{X}\) and measures how much two observations differ from each other. This dissimilarity function thus may not necessarily satisfy the triangle inequality of metrics. For sake of readability, we let i k x denote an example (x i ,y i ) that belongs to NN k (x). This neighborhood relationship is intrinsically asymmetric, that is, x i ∈NN k (x) does not necessarily imply that x∈NN k (x i ). Indeed, a nearest neighbor of x does not necessarily contain x among its own nearest neighbors.

The k-nearest neighbors rule (k-NN) is the following multi-class classifier h={h c : c=1,2,…,C} (k appears in the summation indices):

$$ h_c(\boldsymbol{\boldsymbol{x}}) = \sum_{j\sim_k \boldsymbol{x}} [ y_{jc}>0 ], $$
(12.9)

where h c is the one-versus-all classifier for class c and square brackets denote the indicator function. Hence, the classic nearest neighbors classification is based on majority vote among the k closest prototypes.

We propose to weight the votes of nearest neighbors by means of real coefficients, thus generalizing (12.9) to the following leveraged k-NN rule \(\boldsymbol{h}^{\ell}=\{h^{\ell}_{c} :\, c=1,2,\ldots,C \}\):

$$ h_c^{\ell}(\boldsymbol{\boldsymbol{x}}) = \sum _{j\sim_k \boldsymbol{x}} \alpha_{jc}y_{jc}, $$
(12.10)

where \(\alpha_{jc} \in {\mathbb{R}}\) is the leveraging coefficient for example j in class c, with j=1,2,…,m and c=1,2,…,C. Hence, (12.10) linearly combines class labels of the k nearest neighbors (defined in Sect. 12.2.1) with their leveraging coefficients.

Our work is focused on formal boosting algorithms working on top of the k-NN methods. These algorithms do not affect the nearest neighbor search when inducing weak classifiers of (12.10). They are thus independent on the way nearest neighbors are computed, unlike most of the approaches mentioned in Sect. 12.1.2, which rely on modifying the neighborhood relationship via metric distance deformations or kernel transformations. This makes our approach fully compatible with any underlying (metric) distance and data structure for k-NN search, as well as possible kernel transformations of the input space.

For a given training set \(\mathcal{S}\) of m labeled examples, we define the k-NN edge matrix for each class c=1,2,…,C:

$$\begin{aligned} \mathrm {r}^{(c)}_{ij} \stackrel {\mathrm {.}}{=}& \left \{ \begin{array}{l@{\quad }l} y_{ic} y_{jc} & \mbox{if}\ j\sim_k i\\ 0 & \mbox{otherwise}. \end{array} \right . \end{aligned}$$
(12.11)

The name of r (c) is justified by an immediate parallel with (12.4). Indeed, each example j serves as a classifier for each example i, predicting 0 if j∉NN k (x i ), y jc  otherwise, for the membership to class c. Hence, the jth column of matrix r (c), \(\boldsymbol{r}^{(c)}_{j}\), which is different from x when choosing k>0, collects all edges of “classifier” j for class c. Note that nonzero entries of this column correspond to the so-called reciprocal nearest neighbors of j, that is, those examples for which j is a neighbor (Fig. 12.1). Eventually, the edge of the leveraged k-NN rule on example i for class c reads:

(12.12)

where α (c) collects all leveraging coefficients in a vector form for class c: \(\alpha^{(c)}_{i} \stackrel {\mathrm {.}}{=}\alpha_{ic}\), i=1,2,…,m. Thus, the induction of the leveraged k-NN classifier h amounts to fitting all α (c)’s so as to minimize (12.5), after replacing the argument of ψ(⋅) in (12.5) by (12.12).

Fig. 12.1
figure 1

Schematic illustration of the direct (left) and reciprocal (right) k-nearest neighbors (k=1) of an example x i (green diamond). Red squares and blue circles represent examples of positive and negative classes. Each arrow connects an example to its k-nearest neighbors

12.2.4 UNN Boosting Algorithm

We explain our classification algorithm specialized for the exponential loss minimization in the multi-class one-versus-all framework, with pseudo-code shown in Algorithm 1. Like common boosting algorithms, UNN operates on a set of weights w i (i=1,2,…,m) defined over training data. Such weights are repeatedly updated to fit all leveraging coefficients α (c) for class c (c=1,2,…,C). At each iteration, the index to leverage, j∈{1,2,…,m}, is obtained by a call to a weak index chooser oracle Wic(.,.,.), whose implementation is detailed later in this section.

Algorithm 1
figure 2

Universal Nearest Neighbors UNN(\(\mathcal{S}\)) for ψ=ψ exp

Figure 12.2 presents a block diagram of UNN algorithm. In particular, notice how the initialization step, relying on k-NN and edge matrix computation, is clearly distinguished from the iterative procedure, where a new prototype is added at each iteration t, thus updating both the strong classifier h(x) and the weights w i .

Fig. 12.2
figure 3

Block diagram of the UNN learning scheme

The training phase is implemented in a one-versus-all fashion, that is, C learning problems are solved independently, and for each class c the training examples are considered as belonging to either class c or the complement class \(\bar{c}\), that is, any other class. Eventually, one leveraging coefficient (α jc ) per class is learned for each weak classifier (indexed by j).

The key observation when training weak classifiers with UNN is that, at each iteration, one single example (indexed by j) is considered as a prototype to be leveraged. Indeed, all the other training data are to be viewed as observations for which j may possibly vote. In particular, due to k-NN voting, j can be a classifier only for its reciprocal nearest neighbors (i.e., those data for which j itself is a neighbor, corresponding to nonzero entries in matrix (12.11) on column j). This brings to a remarkable simplification when computing δ j in step [I.2] and updating weights w i in step [I.3] (Eqs. (12.14), (12.15)). Indeed, only weights of reciprocal nearest neighbors of j are involved in these computations, thus allowing us not to store the entire matrix r (c), c=1,2,…,C. Note that the set of reciprocal neighbors is split in two subsets, each containing examples that agree (disagree) with the class membership of j, thus yielding the partial sums \(w_{j}^{+}\) and \(w_{j}^{-}\) of (12.13).

Note that when whichever \(w^{+}_{j}\) or \(w^{-}_{j}\) is zero, δ j in (12.14) is not finite. There is however a simple alternative, inspired by [35], which consists in smoothing out δ j when necessary, thus guaranteeing its finiteness without impairing convergence. More precisely, we suggest to replace:

$$\begin{aligned} w^{+}_j \leftarrow & w^{+}_j + \frac{1}{m}, \end{aligned}$$
(12.16)
$$\begin{aligned} w^{-}_j \leftarrow & w^{-}_j + \frac{1}{m}. \end{aligned}$$
(12.17)

Also note that step [I.1] relies on oracle Wic(.,.,.) for selecting index j of the next weak classifier. We propose two alternative implementations of this oracle, as follows:

  1. (a)

    a lazy approach: ;

  2. (b)

    the boosting approach: we pick Tm, and let j be chosen by Wic({1,2,…,m},t,c) such that δ j is large enough. Each j can be chosen more than once.

There are also schemes mixing (a) and (b): for example, we may pick T=m, choose j as in (b), but exactly once as in (a).

12.2.5 UNN Convergence

The main properties of UNN are summarized by the following three fundamental theorems. The first theorem ensures general monotonic convergence to the optimal surrogate loss, for any given surrogate function. The second theorem further refines this general convergence theorem by providing an effective convergence bound for the exponential loss.

Suppose that ψ meets the following conditions:

  1. (i)

    \(\mathrm{im}(\psi) = {\mathbb{R}}_{+}\);

  2. (ii)

    ψ (0)<0 (∇ ψ  is the conventional derivative);

  3. (iii)

    ψ is strictly convex and differentiable.

Theorem 12.1

As the number of iteration steps T increases, UNN converges to h realizing the global minimum of the surrogate risk at hand (12.5), for any ψ meeting conditions (i), (ii) and (iii) above.

Proof

A proofsketch is given in Appendix. □

Then, in order to obtain the specific convergence rate for ψ exp, suppose the following weak index assumption (WIA) holds. (See Eq. (12.13) in Algorithm 1 for the definition of \(w^{(c)+}_{j}\) and \(w^{(c)-}_{j}\).)

(WIA):

There exist some γ>0 and η>0 such that the following two inequalities hold for index j returned by Wic(.,.,.):

$$\begin{aligned} \biggl \vert \frac{w^{(c)+}_j}{w^{(c)+}_j + w^{(c)-}_j} - \frac{1}{2} \biggr \vert \geq & \gamma , \end{aligned}$$
(12.18)
$$\begin{aligned} \frac{w^{(c)+}_j + w^{(c)-}_j}{\|\boldsymbol{w}\|_1} \geq & \eta . \end{aligned}$$
(12.19)

Theorem 12.2

If the (WIA) holds for νT steps in UNN (for each c), then \(\varepsilon ^{0/1}(\boldsymbol{h}^{\ell}, {\mathcal{S}}) \leq \exp(-\varOmega(\eta \gamma^{2} \nu))\).

Proof

A proofsketch is given in Appendix. □

Theorems 12.1 and 12.2 show that UNN converges (exponentially fast) to the global optimum of the surrogate risk on the training set. Most of the recent works that can be associated to boosting algorithms, or more generally to the minimization of some surrogate risk using whichever kind of procedure, have explored the universal consistency of the surrogate minimization problems (see [4, 26, 45], and references therein). The problem can be roughly stated as whether the minimization of the surrogate risk guarantees in probability for the classifier built to converge to the Bayes rule as m→∞. This question obviously becomes relevant to UNN given our results. Among the results contained in this rich literature, the one whose consequences directly impact on the universal consistency of UNN is Theorem 3 of [4]. We can indeed easily show that all our choices of surrogate loss are classification calibrated, so that minimizing the surrogate risk in the limit (m→∞) implies minimizing the true risk, and implies uniform consistency as well. Moreover, this result, proven for C=2, holds as well for arbitrary C≥2 in the single-label prediction problem. [3] proved an additional result for AdaBoost [35]: if the algorithm is run for a number Tm η boosting rounds, for η∈(0,1), then there is indeed minimization in the limit of the exponential risk, and so AdaBoost is universally consistent. From our theorems above, this implies the consistency of UNN, and this even has the consequence to prove that the filtering procedure described in the experiments is also consistent, since indeed [3]’s bound implies that we leverage a proportion of 1/m 1−η examples, “filtering out” the remaining ones.

Moreover, the results of [26] are also interesting in our setting, even when they are typically aimed at boosting algorithms with weak learners like decision-tree learning algorithms, that define quantizations of the observations (each decision tree defines a new description variable for the examples). They show that there exists conditions on the quantizers that yield conditions on the surrogate loss function for universal consistency. It is interesting to notice that the universal consistency of UNN does not need such assumptions, as weak learners are examples that do not make quantizations of the observation’s domain. Finally, the work of [45] explores the consistency of surrogate risk minimization in the case where rejects are allowed by classifiers, somehow refusing to classify an observation at a cost smaller than misclassifying. While this setting is not relevant to UNN in the general case, it becomes relevant as we filter out examples (see the experiments), which boils down to stating that they systematically reject on observations.

On the one hand, [45] show that filtering out examples does not impair UNN universal consistency, as long as filter thresholds are locally based. On the other hand, they also provide a way to quantify the actual loss r,j caused by filtering out example j, which we recall is in between 0 (the loss of good classification) and 1 (the loss of bad classification). For example, choosing the exponential loss and using Theorem 1 in [45] reveals that the reject loss is:

$$\begin{aligned} \ell_{r,j} = & \frac{\min\{w^+_j, w^-_j\}}{w^+_j + w^-_j} . \end{aligned}$$

Let us now complete further the picture of boosting algorithms for k-NN, by showing that, under a mild additional assumption on ψ, we obtain a guaranteed convergence rate for UNN. Of particular interest is the assumption under which we are able to prove this result. Following [27, 28], we make a “Weak Edge Assumption”:

(WEA):

There exists some ϑ>0 such that the following inequality holds for index j returned by Wic(.,.,.):

$$\begin{aligned} \biggl \vert \sum_{i: j \sim_k i} {\mathrm{r}^{(c)}_{ij}w_{i}} \biggr \vert \geq & \vartheta . \end{aligned}$$
(12.20)

This assumption states that the average value (in absolute value) of y ic y jc over the reciprocal neighborhood of example j cannot be smaller than some constant ϑ. It is weak for the following reason. If the classes in the reciprocal neighborhood were picked at random, the quantity inside the absolute value in (12.20) would be zero in average because of the way we model classes in (12.1). So, we are assuming that, regardless of weights, we can always pick an example (x j ,y j ) “beating” random by a potentially small advantage ϑ. Note that (WEA) is weaker than (WIA) in the sense that we do not make any coverage assumption like (12.19).

Let us now turn to the assumption on ψ:

  1. (iv)

    ψ is locally ω strongly smooth, for some ω>0:

    $$\begin{aligned} D_\psi\bigl(x'\|x\bigr) \leq & \frac{\omega}{2} \bigl(x'-x\bigr)^2 , \end{aligned}$$
    (12.21)

where x,x′ range through the values ϱ(h,i,c) over which UNN is run, and

$$\begin{aligned} D_\psi\bigl(x' \| x\bigr) \stackrel {\mathrm {.}}{=}& \psi \bigl(x'\bigr) - \psi(x) - \bigl(x'-x\bigr) \nabla_\psi(x) \end{aligned}$$
(12.22)

is the Bregman divergence with generator ψ. There is an important duality between strong smoothness and strong convexity, with applications in machine learning and optimization [22]. The proof of the following theorem, in the Appendix, is another example of its applicability in these fields.

Theorem 12.3

If the (WEA) holds and ψ meets assumptions (i)(iv), then for any user-fixed τ∈[0,1], UNN has fit a leveraged k-NN classifier with empirical risk no greater than τ provided the number of boosting iterations T satisfies:

$$\begin{aligned} T \geq & \frac{2 (1-\tau) \psi(0) \omega k m}{\vartheta^2 (C-1)} = \varOmega \biggl(\frac{\omega km}{\vartheta^2} \biggr). \end{aligned}$$
(12.23)

Theorem 12.3 does not obliterate the (better) convergence results for the exponential loss of Theorem 12.2, yet it opens the guarantees of convergence under weak assumptions to some of the most interesting surrogates in classification. These include permissible convex surrogates (PCS, [27]), a set containing as special cases the squared and logistic surrogates in (12.6), (12.8). Informally, any loss which meets regularity conditions and common requirements about losses, such as lower-boundedness, symmetry and the proper scoring property, can be represented by a PCS [27]. The exponential surrogate in (12.7) is not a PCS, yet it is a first-order approximation to the logistic surrogate. Up to translating and scaling by constants, any PCS meets im(∇ ψ )⊆[−1,0] [27]. Reasoning on the second derivative of ψ, we see that there is not much room to violate (12.21), thus making many PCS ω strongly smooth for small values of ω. Simple calculations yield that we can take for example ω=1/4 for the logistic loss (12.8), and ω=2 for the squared loss (12.6), making the bound in (12.23) more favorable to the former. As a last example, consider the following parameterized choice for ψ, with μ∈(0,1):

$$\begin{aligned} \psi^{\mathrm{mat}}_\mu \stackrel {\mathrm {.}}{=}& \frac{1}{1-\mu} \bigl(-x + \sqrt{(1-\mu)^2+x^2} \,\bigr); \end{aligned}$$

this choice, which gives rise to Matsushita’s loss for μ=0, has important convexity properties [27]. In this case, we easily obtain that we can pick ω=1/(1−μ).

12.3 Experiments

In this section, we present experimental results of UNN for image categorization. In order to reduce numerical problems on the large databases on which we test UNN, we normalize weights to unity after the update in (12.15). Our experiments aim at carefully quantifying and explaining the gains brought by boosting on k-NN voting on real image databases. In particular, we propose in Sects. 12.3.1 and 12.3.2 an analysis and comparison of UNN vs k-NN for Gist and Bag-of-Features descriptors on two broadly used datasets of natural images. In Sect. 12.3.3, we drill down into precision and execution times comparisons between UNN vs k-NN, SVM and AdaBoost. We also introduce in this section a soft version of UNN which, to classify new observations, convolutes weighting with a simple density estimation suggested by boosting.

12.3.1 Image Categorization Using Global Gist Descriptors

We tested UNN on global descriptors for the categorization of natural images. In particular, we used the database of natural scenes collected by [30], which has been successfully used to validate several classification techniques relying on Gist image descriptors. A Gist descriptor provides a global representation of a scene directly, without requiring neither an explicit segmentation of image regions and objects nor an intermediate representation by means of local features. In the standard setting, an image is first resized to square, then represented by a single vector of d components (typically d=512 or d=320), which collects features related to the spatial organization of dominant scales and orientations in the image. The one-to-one mapping between images and Gist descriptors is one of the main advantages of using such a global representation instead of local descriptors. In particular, the ability to map any instance to a single point in the feature space is crucial for the effectiveness of k-NN methods, where computing the one-to-one similarity between testing and training instances is explicitly required at classification time. Conversely, representing an image with a set of multiple local descriptors is not directly adapted to such discriminative classification techniques, thus generally requiring an intermediate (usually unsupervised) learning step in order to extract a compact single-vector descriptor from the set of local descriptors [14]. For example, this is the case for Bag-of-Features methods, that we discuss in Sect. 12.3.2 along with an experimental comparison to our method. Finally, although Gist is not an alternative image representation method with respect to local descriptors, it has proven very successful in representing relevant contextual information of natural scenes, thus allowing, for instance, to compute meaningful priors for exploration tasks, like object detection and localization [40].

In the following, we denote as 8-cat the database of [30], which contains 2,688 color images of outdoor scenes of size 256×256 pixels, divided in 8 categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways. One example image of each category is shown in Fig. 12.3. In addition, we carried out categorization experiments on a larger database of 13 categories as well, denoted as 13-cat. This dataset was firstly proposed by [10] and contains five more categories, as shown in Fig. 12.4. We extracted Gist descriptors from these images with the most common settings: 4 resolution levels of the Gabor pyramid, 8 orientations per scale and 4×4 blocks.Footnote 2

Fig. 12.3
figure 4

Examples of annotated images from the 8 categories database of [30]

Fig. 12.4
figure 5

Examples of the five additional categories included in the 13 categories database of [10]

We evaluated classification performances when filtering the prototype dataset, that is, retaining a proportion θ of the most relevant examples as prototypes for classification.

In Figs. 12.5 and 12.6, we show classification performances in terms of the mean Average Precision (MAP)Footnote 3 as a function of θ. We randomly chose half images to form a training set, while testing on the remaining ones. In each UNN experiment, we fixed the value of θ=T/m, thus constraining the number of training iterations T such that at most T examples could be retained as prototypes.

Fig. 12.5
figure 6

Gist image classification performances of UNN compared to k-NN on the 8-cat database (see text for details)

Fig. 12.6
figure 7

Gist image classification performances of UNN compared to k-NN on the 13-cat database (see text for details)

We compared UNN with the classic k-NN classification. Namely, in order for the classification cost of k-NN be roughly the same as UNN, we carried out random sampling of the prototype dataset for selecting proportion θ (between 10 % and the whole set of examples). UNN significantly outperforms classic k-NN. Take for example θ=0.5 in Fig. 12.5: UNN not only outperforms k-NN with θ=0.5, its MAP also exceeds that of k-NN with all data (θ=1) by almost 2 %. Moreover, on the 13-cat database, UNN outperforms the technique proposed by [10] by 3 % (the asterisk in Fig. 12.6, which corresponds to the best result reported in their paper).

12.3.2 Image Categorization Using Bags-of-Features

We now describe experiments with UNN on the Bag-of-Features (BoF) image classification approach. This technique is based on extracting a “bag” of local descriptors (e.g., SIFT descriptors) from an image and vector quantizing them on a precomputed vocabulary of so-called “visual words” [38]. An image is then represented by the histogram of visual word frequencies. This approach provides an effective tool for image categorization, as it relies on one single compact descriptor per image, while keeping the informative power of local features. We compare UNN and k-NN on the 8-cat database (see Sect. 12.3.1).

We used the VLFeat toolbox [41]Footnote 4 for extracting gray-scale dense SIFT descriptors at four resolution levels. In particular, a regular grid with spacing 10 pixels was defined over the image and at each grid point SIFT descriptors were computed over circular support patches with radii 4, 8, 12 and 16 pixels. As a result, each point was represented by four different SIFT descriptors. Therefore, given the image size 256×256, we obtained about 2,500 SIFT descriptors per image. Then we split the database in two distinct subsets of images, half for training and half for testing (i.e., 1,344 images in each dataset). In order to build the dictionary of visual words, we applied k-means clustering on 600,000 SIFT descriptors extracted from training images. For this purpose, we first selected a random subset of training images (about 30 images per class), then we collected all SIFT descriptors of these images and run k-means. In all the experiments, we computed dictionaries of 500 visual words.

The results obtained with the three different settings are depicted in Fig. 12.7. Notice that UNN using the Histogram Intersection matching outperforms all the compared curves. We also note an improvement (up to 5 % gap for k-NN and 7 % for UNN) when using L1-normalized Bag-of-Features descriptors compared to Euclidean distance. This similarity measure was firstly proposed by [39] for image indexing based on color histograms, and, more recently, it has been successfully used by [23] in the context of Bag-of-Features image categorization.

Fig. 12.7
figure 8

Overall results of BoF classification with UNN compared to k-NN for different settings of histogram normalization (either L1- or L2-norm) and nearest neighbor matching (either Euclidean distance or Histogram Intersection)

12.3.3 Comparison with SVM and AdaBoost on Image Categorization

Two major issues arise when implementing our UNN algorithm in practice. The first one concerns the distance (or, more generally, the dissimilarity) measure used for the k-NN search. The second one consists in setting the value of k for both training and testing our prototype-based classifiers.

On the one hand, defining the most appropriate dissimilarity measure for k-NN search is particularly challenging when dealing with very high-dimensional feature vectors like image descriptors commonly used for categorization. Indeed, the classic metric distances may be inadequate when such vectors are generated by sophisticated pre-processing stages (e.g., vector quantization or unsupervised dictionary learning), thus lying on complex high-dimensional manifolds. In general, this should require an additional distance learning stage in order to define the optimal dissimilarity measure for the particular type of data at hand. In this respect, our UNN method has the advantage of being fully complementary with any metric learning algorithm, acting on the top of the k-NN search. In Sect. 12.3.2, we have described some examples of using different distances for k-NN search, particularly focusing on the most suitable dissimilarity measure for histogram-based descriptors.

On the other hand, selecting a good value for k amounts to learning parameter-dependent weak classifiers, where the parameter k specifies the size of the voting neighborhood in classification rule (12.10). From the theoretical standpoint, a brute-force approach is possible with boosting: one can define multiple candidate weak classifiers per example, one for each value of k, that is, for each neighborhood size, and then learn prototypes by optimizing the surrogate risk function over k as well. This strategy has the advantage of enabling direct learning of k at training time. However, training several weak classifiers per example without computation tricks would potentially severely impair the applicability of the algorithm on huge datasets. The solution we propose is subtler, as it relies on weighting the neighbors, exploiting the trick that boosting locally fits particular maximum likelihood estimators of class memberships [27]. Using (12.14), we can indeed rewrite (12.10) as:

$$ h_c^{\ell}(\boldsymbol{\boldsymbol{x}}) \approx \log \prod _{j\sim_k \boldsymbol{x}, y_{jc > 0}} \frac{\hat{p}(c|j)}{\hat{p}(\bar{c}|j)} - \log \prod _{j\sim_k \boldsymbol{x}, y_{jc < 0}} \frac{\hat{p}(c|j)}{\hat{p}(\bar{c}|j)}, $$
(12.24)

where \(\hat{p}(c|j)\) (resp. \(\hat{p}(\bar{c}|j)\)) models a conditional probability (resp. not) to belong to class c. To make the right-hand side of (12.24) closer to a full-fledged maximum likelihood, we have to integrate the density estimators for nearest neighbors, \(\hat{p}(j)\). We can obviously make the assumption that they are all equal: this would multiply the right-hand side of (12.24) by a positive constant factor, and would not change the outcome of (12.10). Instead, we have modified the classification phase of UNN, and tried a soft solution which considers a logistic estimator for a Bernoulli prior which vanishes with the rank of the example in the neighbors, thus decreasing the importance of the farthest neighbors:

$$\begin{aligned} \hat{p}(j) = \beta_{j} = \frac{1}{1+\exp(\lambda(j-1))}, \end{aligned}$$
(12.25)

with λ>0. The shape prior is chosen this way because it was shown that boosting, as carried out in a number of algorithms—not restricted to the induction of linear separators [27]—locally fits logistic estimators for Bernoulli priors. The soft version of UNN we obtain, called UNN s (for “Soft UNN”), replaces (12.10) by:

$$ h_c^{\ell}(\boldsymbol{\boldsymbol{x}}) = \sum _{j\sim_k \boldsymbol{x}} \beta_{j}\alpha_{jc}y_{jc} . $$
(12.26)

Notice that it is useless to enforce the normalization of coefficients β j in (12.25), because it would not change the classification of UNN s . Notice also that the β j s in (12.26) are used only to classify new observations: the training steps of UNN s are the same as UNN, and so UNN s meets the same theoretical properties as UNN described in Theorems 12.1, 12.2 and 12.3.

We selected 100 categories from the SUN database [43]. We kept all the images of each category and the inherent unbalancing of the original database. We randomly chose half images to form a training set, while testing on the remaining ones. The MAP was computed by averaging classification rates over categories (diagonal of the confusion matrix) and then averaging those values after repeating each experiment 10 times on different folds. To speed-up processing time, we used the fast implementation of k-NN proposed by [21].Footnote 5 Furthermore, we also developed an optimized version of our program, which exploits multi-thread functionalities. We denote this version as UNN s (MT). All the experiments were run on an Intel Xeon X5690 12-cores processor at 3.46 GHz.

We compared UNN s , SVM with Gaussian RBF Kernel, and AdaBoost with decision stumpsFootnote 6 (i.e., decision trees with a single internal node), using BoF descriptors. In particular, we followed the guidelines of [20] for carrying out the SVM experiments, thus carrying out cross-validation for selecting the best parameters values for SVM. For the sake of completeness, we also provide results for Gist descriptors with UNN s and k-NN.

In Table 12.1, we report the MAP for each classification method. Results in these tables are provided as a function of the number of image categories. The most relevant results obtained are also displayed in Fig. 12.8 (mAP as a function of the number of categories) and Figs. 12.9 and 12.10, for the training and classification times, respectively.

Fig. 12.8
figure 9

Classification performances of the tested methods as a function of the number of image categories

Fig. 12.9
figure 10

Training time as a function of the number of image categories

Fig. 12.10
figure 11

Classification time for UNN s vs SVM as a function of the number of image categories with BoF

Table 12.1 Classification performances of the different methods we tested in terms of the Mean Average Precision (MAP) as a function of the number of categories

We can first notice that BoF descriptors generally outperform Gist, even when this phenomenon is dampened as the number of categories increases (above 30). This, overall, follows the trend generally reported in the literature.

MAP results display that UNN s dramatically outperforms AdaBoost (and k-NN as well); this result, which somehow experimentally confirms that UNN successfully exploits the boosting theory, was quite predictable, as UNN builds a piecewise linear decision function in the initial domain \({\mathcal{O}}\), while AdaBoost builds a linear separator in this domain. SVM, on the other hand, have access to non-linear fitting of data, by lifting the data to a domain whose dimension far exceeds that of \({\mathcal{O}}\). Yet, SVM’s testing results are somehow not as good as one might expect from this clearcut theoretical advantage over UNN, and also from the fact that we carried out SVMs with significant parameters optimization [20]. Indeed, UNN s even beats SVMs over 10 to 30 categories, being slightly outperformed by them on more categories.

In Tables 12.2 and 12.3, we report the corresponding computation time (in seconds) for the training and classification phase, respectively. Obviously, the computation times over training and testing are also a key for exploiting the experimental results. Table 12.2 displays that, while the training time of AdaBoost is linear, UNN s is a logical clearcut winner over SVM for training: it achieves speedups ranging in between two and more than seventeen over SVM. To assess the validity of these comparisons, we have computed least-square fittings of the training and testing times of UNN s vs AdaBoost vs SVM (all with BoF), with both linear (s=aC+b, s being the time in seconds, and C the number of categories) and polynomial (s=bC a) fittings, with the objective to foresee on the best models what might happen on domains with classes ranging from hundreds to (tens of) thousands. The best models are displayed in Table 12.4. The coefficients of determination show that only a slim portion of the data is not explained by the models shown.

Table 12.2 Computation time [s] for the training phase
Table 12.3 Computation time [s] for the testing phase
Table 12.4 Best fits for training/testing times [s] as a function of the number of classes C, or the number of images m in the training sample/to be tested. The model indicated is the best fit among models of the type y=ax+b and y=bx a, according to the coefficient of determination r 2. For all but two models, r 2>0.999≈1.0 (the exceptions are (*), for which r 2≈0.97, and (**), for which r 2≈0.99). m 1m is the number, estimated by the model, of images that can be processed in 1 minute (see text for details)

Models confirm that the training time of AdaBoost is linear. This is not a surprise, as it is ran with stumps as weak classifiers. Allowing decision trees with more than one internal node would have certainly blown the linear time barrier. While they are roughly equivalent for UNN s and AdaBoost (Table 12.3), testing times reveal a much bigger gap between UNN s and SVM, as displayed in Fig. 12.10. Exploiting the models of Table 12.4, we obtain the ratio:

$$ s_{\mathrm{SVM}}/ s_{\mathrm{{{{UNN}}}}} \approx \varOmega(m), $$
(12.27)

while, for the multi-thread implementation, we obtain:

$$ s_{\mathrm{SVM}}/ s_{\mathrm{{{{UNN}}}}_s\mathrm{MT}} \approx \varOmega\bigl(m^{1.3} \bigr). $$
(12.28)

The ratio is always in favor of UNN, and of order the number of examples. Hence, the execution time for UNN s should allow to classify many images in reduced time compared to SVM: from Table 12.4, UNN should already classify almost twenty times as many images as SVM in a single minute. In such a case, UNN s should also classify almost twice as many images as AdaBoost. Thus, UNN provides the best MAP/time trade-off among the tested methods, which suggests that UNN might well be more than a legal contender to classification methods dealing with huge domains, or domains where the testing set is huge compared to the training set, which is the case, for instance, for cell classification in biological images [16]. Finally, we have only scratched experimental optimizations for UNN, and have not optimized UNN from the complexity-theoretic standpoint, so we expect room space for further significant improvement of its training/testing times.

12.4 Discussion and Perspectives

UNN provides us with a sound blend of two powerful yet simple classification algorithms: nearest neighbors and boosting. While the analysis of the mixing is not straightforward—such as for the convergence and boosting properties in Theorems 12.1–12.3—UNN remains simple to state and implement, even in the multiclass case. It also appears to be a interesting contender to SVM: without using the kernel trick mapping examples to high dimensional feature spaces, UNN manages to fit nonlinear classifiers in the initial feature space whose accuracy clearly compete with SVM’s.

We think that this simplicity opens avenues for future research on the way separate extensions and improvements of nearest neighbors and boosting might be transferred to UNN. One example is the inclusion of powerful density estimation techniques that would fit better than our simple logistic convolution of priors in (12.25).

Another example involves improved sophistication from the classifier’s standpoint, in particular with metric distance learning and the kernelization of the input space [47]. This, we expect, would enable significant improvements of categorization performances.

A third example involves improvements from the nearest neighbor search standpoint. Novel techniques exist that make embeddings in a real-valued vector space of nearest neighbors queries, thus transforming the data space with the hope to achieve good compromises between reducing the processing complexity of nearest neighbor queries while not reducing the accuracy of (vanilla) nearest neighbors in the space learnt [2, 25]. Clearly, such approaches do not tackle the same problem as us, as UNN directly processes nearest neighbors on the data’s ambient space. Nevertheless, they are very interesting from the perspective standpoint because this new data space is learnt with (Ada)boosting. A neat combination with UNN might thus offer the possibility to kill two birds in one boosting shot for nearest neighbors: learn an improved data space, and learn in this data space an improved nearest neighbor classifier with UNN. The questions raised by such perspective are not only experimental, as basically only the contractiveness of the approach of [2] is formally known to date. Transferring, or even improving, the boosting properties of UNN in such sophisticated blends would thus be more than interesting.

12.5 Conclusion

In this work, we contribute to fill an important void of NN methods, showing how boosting can be transferred to k-NN classification, with convergence rates guarantees for a large number of surrogates. Our UNN algorithm generalizes classic k-NN to weighted voting where weights, the so-called leveraging coefficients, are iteratively learned by UNN. We prove that this algorithm converges to the global optimum of many surrogate risks in competitive times under very mild assumptions.

Our work is also the first extensive assessment of UNN to computer vision related tasks. Comparisons with k-NN, support vector machines and AdaBoost, using Gist or Bag-of-Feature descriptors, on simulated and real domains, display the ability of UNN to be competitive with its contenders, achieving high mAP in comparatively reduced training and testing times.

Avenues for future research include blending UNN with other approaches that bias the domain towards the improvement of nearest neighbors rules, or that learn more sophisticated metrics over data.