Abstract
In large-scale machine learning, of central interest is the problem of approximate nearest neighbor (ANN) search, where the goal is to query particular points that are close to a given object under certain metric. In this paper, we develop a novel data-driven ANN search algorithm where the data structure is learned by fast spectral technique based on s landmarks selected by approximate ridge leverage scores. We show that with overwhelming probability, our algorithm returns the \((1+\epsilon /4)\)-ANN for any approximation parameter \(\epsilon \in (0,1)\). A remarkable feature of our algorithm is that it is computationally efficient. Specifically, learning k-length hash codes requires \(O((s^3+ns^2)\log n)\) running time and \(O(d^2)\) extra space, and returning the \((1+\epsilon /4)\)-ANN of the query needs \(O(k\log n)\) running time. The experimental results on computer vision and natural language understanding tasks demonstrate the significant advantage of our algorithm compared to state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Nearest neighbor search is one of the most fundamental problems in computational geometry and machine learning. It has been broadly investigated in a large body of real-world scenarios such as data compression (Gersho and Gray, 2012), speech recognition (Makhoul et al., 1985), and information retrieval (Jegou et al., 2011). As a concrete example, for customers without any shopping history, it is often plausible to look up the customers in the database with similar profiles to help make recommendation on the items.
There are many early works for (exact) nearest neighbor search, such as k-d tree and R-tree (Bentley, 1975; Samet, 1990, 2006). These methods perform very well when the data lie in a low-dimensional space, say three dimensions, while suffering computational intractability in a high-dimensional space (Arya et al., 1995). In fact, an early attempt from Dobkin and Lipton (1976) provided the first algorithm for nearest neighbor search in d-dimensional space which takes double-exponential time of \(O(n^{2^{d+1}})\) for preprocessing and \(O(2^d\log n)\) for retrieval. Such problem is known as the curse of dimensionality, and to tackle the problem in high dimensions, the notion of approximate nearest neighbor was proposed as a practical alternative (Arya and Mount, 1993). To be a little formal, given any approximation factor \(\epsilon >0\), we say that a point \(\varvec{p}\) is an \(\epsilon \)-nearest neighbor of a given query \(\varvec{q}\) if the ratio of distances from \(\varvec{p}\) to \(\varvec{q}\) and from \(\varvec{q}\) to its exact nearest neighbor is at most \((1+\epsilon )\).
We consider the data in real-world applications which are usually perturbed with noise (Abdullah et al., 2014). Formally, the observed data set \(\varvec{P}=\{\varvec{p}_1,\cdots , \varvec{p}_n\}\) is generated by a clean data set \(\varvec{X}=\{\varvec{x}_1,\cdots , \varvec{x}_n\}\) with random noise corruption, that is
The query \(\varvec{q}\) is a superposition of the clean query point \(\varvec{y}\) corrupted by the same type of noise, i.e., \(\varvec{q}= \varvec{y}+ \varvec{t}\). Suppose \(\varvec{x}^*\) is the (exact) nearest neighbor of \(\varvec{y}\), that is
where \(\left\Vert \cdot \right\Vert _{2}\) denotes the \(\ell _2\)-norm. We will consider that the noise is bounded, in the sense that \(\max \{\left\Vert \varvec{t}_i \right\Vert _{2}, \left\Vert \varvec{t} \right\Vert _{2}\} \le \epsilon /16\). Though it seems that the most natural assumption on the noise is Gaussian, we note that both Gaussian and bounded random variables are sub-Gaussian. So they admit the same tail bound. Under this smoothed problem setting, Indyk and Motwani (1998) proposed the celebrated locality-sensitive hashing (LSH) algorithm that achieves sub-linear query time. Under the locality-sensitive hashing framework, there have been a large body of works showing efficient computation is possible (Andoni and Indyk, 2008; Andoni et al., 2014, 2018). Notably, the construction of the hashing functions in LSH is independent of the data.
On the other spectrum, algorithms that incorporate machine learning techniques to learn the hash functions from the data have attracted a lot of interest in recent years (Kulis and Darrell, 2009; Liu et al., 2011; Kong and Li, 2012). For example, spectral graph has been widely studied to learn the binary codes that preserve the similarity structure of the database (Weiss et al., 2009; Abdullah et al., 2014). Supervised hashing methods learn the binary code representations of samples that are correlated with their labels (Shen et al., 2015). Recent works on representation learning using deep neural networks have shown practical values in various tasks, which motivates a surge of works to utilize convolutional neural networks as hash functions; see, for example, Çakir et al. (2018).
Though the learning based approaches outperform the locality-sensitive hashing based methods in many applications (Jegou et al., 2011; Xia et al., 2015), there seems a lack of theoretical understanding of the success of many of the existing algorithms. In this paper, we propose a data-dependent learning algorithm for approximate nearest neighbor search, and we aim to resolve two important technical barriers: (1) approximate the low-dimensional space efficiently; and (2) provide the theoretical guarantee that the mutual distance is preserved in the low-dimensional space. That is, if the data points are neighbors in the original space, they should be close to each other in the low-dimensional space. Abdullah et al. (2014) provided the first justification for this disparity, which directly utilized principal component analysis with preprocessing time \(O(nd^2+d^3)\). In our algorithm, we learn the projection matrix by leverage score based sampling which is more computationally efficient (Alaoui and Mahoney, 2015; Musco and Musco, 2015; Cohen et al., 2016; Musco and Musco, 2017). In addition,it is demonstrated that leverage score based sampling approaches often give the strong provable guarantees for subspace approximation and statistical performance in downstream applications (Alaoui and Mahoney, 2015; Rudi et al., 2015; Gittens and Mahoney, 2016).
1.1 Summary of our contributions
In this work, we present a learning-to-hash algorithm based on ridge leverage score: it produces the hash function provably matching the accuracy of principal component analysis methods and the obtained low-dimensional subspace preserves the geometry structure of the database. The advantage of our method is twofold. First, approximating the low-dimensional space is significantly more efficient than many existing spectral methods (Weiss et al., 2009; Abdullah et al., 2014), as our sampling techniques used for subspace learning is performed on s landmark points. The preprocessing, in particular, takes time \(O(n\cdot s^2)\), where \(s \ll \min (n,d)\) is the number of landmarks. Second, we show that \((1+\epsilon /4)\)-approximate nearest neighbor of the query can be obtained with high probability.
In terms of empirical results, we evaluate the performance of our algorithm on real-world applications: computer vision and natural language understanding. The experiments are conducted on real-world data sets, including MNIST, Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) (SST-2), Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), Microsoft Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005), Stanford Question Answering Natural Language Inference Corpus (QNLI) (Rajpurkar et al., 2016), and Glove (Pennington et al., 2014). Our algorithm achieves the best performance with various hash code lengths on all the data sets compared with the state-of-the-art algorithms.
1.2 Roadmap
In Sect. 2, we present a more concrete literature review and state the connection to this work. Section 3 gives the main algorithm with performance guarantee in Sect. 4. A comprehensive empirical study is carried out in Sect. 5. We conclude the paper in Sect. 6. The proof details can be found in the “Appendix”.
Notation. We use lowercase letters to denote vectors and capital letters for matrices. For a vector \(\varvec{q}\), we denote its \(\ell _2\)-norm by \(\left\Vert \varvec{q} \right\Vert _{2}\). We reserve \(\varvec{P}\in \mathbb {R}^{n\times d}\) for the database with n data points in d-dimensional feature space. We use \(p_i^{\top }\in \mathbb {R}^{d}\) to denote the i-th row of \(\varvec{P}\), that is, the i-th sample in \(\varvec{P}\). We use two matrix norms: the Frobenius norm and spectral norm, defined as \(\left\Vert \varvec{P} \right\Vert _{F}=\sqrt{\sum _{i=1}^{d}\sigma _i(\varvec{P})^2}\) and \(\left\Vert \varvec{P} \right\Vert _{2}=\sigma _{1}(\varvec{P})\) respectively, where \(\sigma _i(\varvec{P})\) represents the singular value of \(\varvec{P}\) in descending order (\(\sigma _1(\varvec{P})\ge \sigma _2(\varvec{P}) \ge \dots \ge \sigma _d(\varvec{P}) \ge 0\)). The distance between a data point \(\varvec{q}\) and the subspace \(\varvec{U}\) is defined as \(d(\varvec{q},\varvec{U}) := \inf _{\varvec{y}\in \varvec{U}}\left\Vert \varvec{q}-\varvec{y} \right\Vert _{2} = \left\Vert \varvec{q}-\varvec{q}_{{\varvec{U}}} \right\Vert _{2}\), where \(\varvec{q}_{{\varvec{U}}}\) is the orthogonal projection onto the subspace \({\varvec{U}}\). When we say a subspace is k-dimensional, we mean its intrinsic dimension is k.
2 Related works
The core of nearest neighbor search is to find the data point most close to the query in the database, while the approximate nearest neighbor search returns the data points within \((1+\epsilon )dist\) of the query, where dist is the distance between the query and the nearest neighbor. In either category, the search is usually performed on a collection of data points; the process to organize the database into certain data structure is called data processing, which is assumed to be independent of the number of queries. As the straightforward search is brute force which takes O(n) time for 1-dimensional space, more efficient searching algorithms usually construct a data structure to make the query efficient in terms of space and time cost in processing and retrieval. For example, binary search method formed the balanced binary tree with time \(O(n\log n)\) and answered the query in \(\lfloor \log n \rfloor +1\) time (Knuth, 1973). A plethora of related algorithms have been proposed in the literature, such as k-d trees, R-trees (Bentley, 1975; Samet, 1990; Sellis et al., 1997; Samet, 2006). These approaches are usually based on computational geometry. However, if the number of dimensions exceeds 20, searching in k-d trees and related structures requires the inspection of a large fraction of the database, thereby doing no better than brute-force linear search Gionis et al. (1999). Therefore, the approximate nearest neighbor search has attracted attention for practical problems with high-dimensional data.
Existing algorithms for approximated nearest neighbor search could be categorized as locality-sensitive-hashing families and learning based hashing, depending on how the data structure is constructed. Indyk and Motwani (1998) introduced the idea of locality-sensitive hashing. There are many related works discussing how to chose the parameters L (the number of buckets), \(r_1\) (the radius of the ball centered at \(\varvec{q}\)) and k (the length of hash code) to achieve the low failure probability guarantee. For example, Andoni and Indyk (2008) proposed an algorithm that utilized linear random projection to reduce the feature dimension to k (\(k=O(\log n)\)), then the approximated nearest neighbors could be returned in sublinear query time using nearly-linear space. Andoni et al. (2014) proposed a data-dependent hashing function with Johnson-Lindenstrauss dimension reduction procedure and got a better result. Andoni et al. (2018) presented a data structure for general symmetric norms. Very recently, Andoni et al. (2021) showed improved data structures for the high-dimensional approximate nearest neighbor search for \(\ell _p\) distances for large values of p and for generalized Hamming distances. The details of related space and time bounds for Euclidean distance are summarized in Table 1.
Learning based hashing has seen a recent surge of interest (Gong and Lazebnik, 2011; Weiss et al., 2012; Erin Liong et al., 2015; Han et al., 2015; Liu et al., 2016). Much of this excitement centers around the discovery that these approaches achieve outstanding performance in real-world applications, such as computer vision (Xia et al., 2015) and information retrieval (Jegou et al., 2011). There are some works focusing on supervised binary code projection methods (Liu et al., 2014; Shen et al., 2015). For example, sparse projection (SP) introduced the sparse projections for binary encoding which involved minimizing the distortion and adopted the variable-splitting techniques in optimization (Xia et al., 2015). The spectral analysis based unsupervised methods have attracted a lot of attention since the labeled data is precious. For example, Spectral Hashing utilized a subset of thresholded eigenvectors of the graph Laplacian matrix (Weiss et al., 2009). Iterative quantization (ITQ) proposed an efficient way to find the hash code by minimizing the quantization error of mapping the data to the vertices of a zero-centered binary hypercube (Gong and Lazebnik, 2011). Jegou et al. (2011) decomposed the space into a Cartesian product of low dimensional subspace and the hash code is composed of its subspace quantization indices. Liu et al. (2011) assumed that the data reside in a low-dimensional manifold and proposed a graph-based hashing method. Isotropic hashing (ISO) found the hash projection function with equal variances for different dimensions, called isotropic hashing (Kong and Li, 2012). Multidimensional Spectral Hashing (MDSH) learned the binary codes based on reconstructing the affinity between data points, rather than computing their distances (Weiss et al., 2012). bilinear projection based binary codes (BPBC) learned the similarity-preserving binary codes by compact bilinear projections instead of a single large projection matrix (Gong et al., 2013). The algorithm utilizes a spectral relaxation where the bits are mapped by thresholded eigenvectors of the affinity matrix. Circulant binary embedding (CBE) learned the data-dependent circulant projects by minimizing the objective in original and Fourier domains (Yu et al., 2014). Scalable graph hashing (SGH) is proposed to approximate the whole graph without explicitly computing the similarity graph matrix, but optimizing a sequential learning function to learn the compact hash codes in a bit-wise manner (Jiang and Li, 2015). We follow this line of research and propose an inexact spectral analysis for approximate nearest neighbor search. The experimental results demonstrate the superiority of our algorithm compared with the state-of-the-art learning based hashing approaches mentioned in this section.
3 Main algorithm
In this section, we elaborate on our approach, which consists of two steps: Algorithm 1 samples the landmark points for the construction of data structure that can be used for efficient retrieval, and Algorithm 2 performs approximate nearest neighbor search.
3.1 Overview
Our pipeline consists of learning the hash codes and retrieval, where the primary idea is to find a good embedding of all original data points under which the mutual distance is well controlled with overwhelming probability. To this end, it seems that a straightforward approach is to utilizing principal component analysis (PCA). However, it is known that finding the exact principal components is computationally slow for large-scale problems. Therefore, we propose to first select a manageable number of landmark points followed by PCA. The selection process is based on the ridge leverage score which is a good measure of the importance of data points (Alaoui and Mahoney, 2015).
Definition 1
(Ridge leverage score) For any \(\lambda >0\), the \(\lambda \)-ridge leverage score of the row of \(\varvec{P}\in \mathbb {R}^{n\times d}\) is defined as:
where \(\mathbf {I}\) is the \(n\times n\) identity matrix.
To be more concrete, when constructing the principal components of the training set, our algorithm runs in multiple iterations, where in each iteration a fraction of the training data are sampled and some of them will be selected as landmarks. The low-dimensional subspace is learned based on selected landmarks. The algorithm terminates when all the training data have been evaluated. When a new query comes in, it will be projected onto the learned subspace, through which retrieval is efficient.
3.2 Learning to hash
Algorithm 1 learns a low-dimensional projection matrix \(\varvec{Z}\in \mathbb {R}^{d\times k}\) that can be applied to embed the data. It consists of two major steps: Phase I selects the landmark points as indicated by the matrix \(\tilde{\varvec{S}}\in R^{d\times s}\), and Phase II runs PCA on selected landmark points to return the low-dimensional projection matrix. The algorithm starts with checking if the problem is in large scale, that is, whether the number of samples in \(\varvec{P}\) is greater than \(192\log (1/\delta )\). If not, we could use PCA to get \(\varvec{Z}\) directly; otherwise, the algorithm enters Phase I in the while loop to sample important data points.
The key observation of our sampling approach is that uniform sampling is practical, but it only enjoys theoretical guarantees under strong regularity or incoherence assumptions on the data (Gittens, 2011). On the other hand, ridge leverage scores evaluate the importance of data points which have shown practical impact in downstream applications. However, the calculation of exact ridge leverage score is often slow. In this regard, we propose to combine these two widely used schemes.
First, note that we aim to approximately estimate the ridge leverage score of all data points in an iterative manner, and each data point is evaluated only once. To this end, the number of iterations T is initialized as \(O(\log n)\). In each iteration, we randomly draw half of the data points that have not been accessed. The iteration terminates when the size of the remaining data is less than \(192\log (1/\delta )\).
In particular, in each iteration, we construct uniform sampling matrix \(\bar{\varvec{S}}\) by selecting data points uniformly at random with probability 1/2, and \(\tilde{\varvec{S}}\) is the sampling matrix learned by approximated ridge leverage scores. Each column of \(\tilde{\varvec{S}}\) has one nonzero element that indicates the index of selected sample. In each iteration, we uniformly sample a subset \(\mathcal {J}_i\) and approximate the ridge leverage score of the j-th sample as
Equation (4) is a good approximation of the original ridge leverage score computed as in Definition 1. With the fact \(\varvec{P}^{\top }\tilde{\varvec{S}}\tilde{\varvec{S}}^{\top }\varvec{P}\preceq \varvec{P}^{\top }\varvec{P}\), \(\tilde{l}_{i}\) is an upper bound of the ridge leverage score \(l_{i}\), i.e.
Then we compute the sampling probability of data points based on the approximated leverage score as follows:
The data points are selected as landmarks with probability \( \eta _i \). The corresponding column of the landmark point in the sampling matrix \(\varvec{S}\) is weighted by \(1/\sqrt{\eta _i}\). \(\varvec{S}\) is assigned to \(\tilde{\varvec{S}}\) as the final selected sampling matrix in the current iteration. Then we get the next round data partition \(\mathcal {J}_t\) by uniform sampling. We output a partition of data set \(\{\mathcal {J}_1, \cdots , \mathcal {J}_T\}\) at the end of the algorithm.
Phase II seeks for a low-dimensional projection matrix \(\varvec{Z}\) based on the selected landmarks. A straightforward approach to learn \(\varvec{Z}\) is to optimize the following objective function:
As \(\varvec{Z}\) always lies in the column span of \(\varvec{P}^{\top }\), it can be represented by constructing a matrix \(\varvec{Y}\in \mathbb {R}^{n\times k}\), such that \(\varvec{Z}= \varvec{P}^{\top }\varvec{Y}\). We re-parameterize by writing \(\varvec{Y}=\varvec{K}^{-1/2}\varvec{W}\) where \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\), thus \(\varvec{Z}= \varvec{P}\varvec{K}^{-1/2}\varvec{W}\). Recall that Phase I selects s landmarks denoted by \(\varvec{S}\). Let \(\Phi \) be the orthogonal projection onto the row span of \(\varvec{S}^{\top }\varvec{P}\). We can approximate the database matrix as \(\tilde{\varvec{P}}{\mathop {=}\limits ^{\text {def}}}\varvec{P}\Phi \), where \(\Phi =\varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{P}\varvec{P}^{\top }\varvec{S})^+\varvec{S}^{\top }\varvec{P}\). Since \(\Phi \) is an orthogonal projection, \(\Phi \Phi ^{\top }=\Phi ^2=\Phi \), we can approximate \(\varvec{K}\) as \(\tilde{\varvec{K}}= \tilde{\varvec{P}}\tilde{\varvec{P}}^{\top }=\varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}=\varvec{P}\). So, the projection matrix is in the form \(\varvec{Z}=\Phi \varvec{P}\tilde{\varvec{K}}^{-1/2}\tilde{\varvec{W}}=\varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{P}\varvec{P}^{\top }\varvec{S})^+\varvec{S}^{\top }\varvec{P}\tilde{\varvec{W}}\), where \(\tilde{\varvec{W}}\) minimizes the following function:
The optimization of (7) is equivalent to minimizing the above objective function which is standard in the literature (Woodruff et al., 2014). Since \(\varvec{W}\) can be taken as the top k eigenvectors of \(\varvec{K}\), we approximate it by performing singular value decomposition on \(\varvec{P}\varvec{P}^{\top }\varvec{S}\) and assign it to \(\Sigma _k\) in Algorithm 1.
3.3 Retrieval
In the retrieval phase, it is easy to learn the hash code of the data points by the projection matrix \(\varvec{Z}\), that is, the hash code of data point \(\varvec{p}^{\top }\in \mathbb {R}^{d}\) is \(h(\varvec{p}) = \text {sign}(\varvec{p}^{\top }\times \varvec{Z})\). We can get the hash code of the query in the same way. The near neighbors of the query include the data points that conflict with the query in terms of the hash code. The neighbors could also be retrieved with Hamming distance within certain radius. The search procedure is performed on each data subset of \(\{\mathcal {J}_1,\cdots ,\mathcal {J}_T\}\) in parallel.
As shown in Algorithm 2, we set m as the desired number of approximate nearest neighbors to return. First, we learn the hash code of the data points in \(\varvec{P}\) and the query data point by projection matrix \(\varvec{Z}\). Then the data points conflict with the query is considered as the near neighbors of the query point. As the data set \(\varvec{P}\) is partitioned to several data subsets \(\{\mathcal {J}_i\}_i^{T}\) (\(T =O(\log n)\)). The search in each subset could be implemented simultaneously. The search procedure terminates when the desired number of neighbors are returned. We ensure that the approximated nearest neighbor can be returned in low-dimensional query with high probability as shown in Theorem 3.
3.4 Time and memory cost
Phase I in Algorithm 1 performs at most \(T=O(\log n)\) iterations in total. After \(O(\log n)\) iterations, all data points will be identified by certain group. The time cost in the iterative procedure is dominated by the ridge leverage score computation which takes \(O(ns^2)\) time. Since n is cut in half at each level of iteration, the total run time is \(O(ns^2+\frac{ns^2}{2}+\frac{ns^2}{4}+\cdots ) =O(ns^2)\). The computation of top k eigenvectors of \(\varvec{P}\varvec{P}^{\top }\varvec{S}\) is \(O(ns^2)\). Since \(\varvec{S}\) has \(O(\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon })\) columns, the computation of eigenvectors to get a low-dimensional projection matrix \(\varvec{Z}\) can be performed very efficiently. The construction of \(\varvec{Z}\) takes \(O(s^3+s^2)\). Hence, the total time complexity of Algorithm 1 is \(O((s^3+ns^2)\cdot \log n)\). Recall that the time cost of spectral analysis is usually polynomial in n or d, such as \(O(nd^2+d^3)\) (Abdullah et al., 2014). Clearly, our algorithm is more efficient. In terms of memory cost, the storage of \(\varvec{P}^{\top }\varvec{P}\) requires \(O(d^2)\) extra space which will be used in the ridge leverage score estimation. To search for the neighbors of the query data point as in Algorithm 2, it requires saving all the binary code of training data with space O(nk) and query time \(O(k\cdot \log n)\).
3.5 Hyper-parameter setting
Algorithm 1 learns low-dimensional projection matrix \(\varvec{Z}\in \mathbb {R}^{d\times k}\), where d is the feature dimension of data \(\varvec{P}\) and k is the dimension of projected space, \(k< d\). The while-loop in Phase I terminates in \(T=O(\log n)\) iterations as the uniform sampling will select a half of samples at each iteration from \(\Omega \). We assume that data \(\varvec{P}\) lives in low-dimensional space and k is rank of the data matrix. After data projection, we utilize sign function to get the hash code, hence k equals the length of hash code. The parameter k is tuned in the range of [0, d]. The input parameters of Algorithm 1 \(\lambda =\frac{\epsilon }{k}\sum _{i=k+1}^n\sigma _{i}(\varvec{K})\), \(\epsilon \), \(\delta \) which are used to get sampling matrix \(\varvec{S}\in \mathbb {R}^{n\times s}\), s is the number of sampled data points which is in the order of \(\frac{k}{\epsilon } \log \frac{k}{\delta \epsilon }\). The reason is that \(s\le 2\sum _i \eta _i\) with probability \(1-\delta \) by following Lemma 6. If the ridge leverage score is computed exactly, we bound \(\sum _i l_i\le \frac{2k}{\epsilon }\) as shown in Lemma 9 of “Appendix A”. Accordingly, \(\sum _i \eta _i\le 32\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon }\) as designed.
If the number of data points \(n < 192\log (1/\delta )\), the while-loop is skipped. The \(192\log (1/\delta )\) number of samples is set following the simplified Chernoff bounds in (Mitzenmacher and Upfal, 2017). That is, when \(n\ge 192\log (1/\delta )\), \({\mathbb E}|\bar{S}|\ge 96\log (1/\delta )\), we have:
as long as \(\delta \le 1/32\). Then the while-loop continues on the index set \(\Omega \) of size \(\ge 1\) and \(\le 0.56n\). Accordingly, Theorem 1 holds for all data set \(\mathcal {J}\) with size between 1 and \(n-1\) with probability \(1-\delta \).
\(\lambda \) is the parameter to approximate ridge leverage score which is initialized as \( \frac{\epsilon }{k}\sum _{i=k+1}^n \sigma _i(\varvec{P}\varvec{P}^{\top })\). Then we get the \((1+2\epsilon )\) relative Frobenius error guarantee among the approximated low-rank space and ground-truth. The quantity \(192\log ( 1/\delta )\) is the minimum number of sampled set to compute leverage score. We assume that the number of samples in \(\varvec{P}\) is larger than \(192\log (1/\delta )\), otherwise the low-rank matrix could be computed by singular value decomposition directly.
4 Performance guarantee
In this section, we use the following notations. Let \(\varvec{S}^{\top }\varvec{P}\) denote the data matrix with s samples selected by weighted sampling matrix \(\varvec{S}\) from database \(\varvec{P}\). We write \(\varvec{K}=\varvec{P}\varvec{P}^\top \in \mathbb {R}^{n\times n}\). Note that the Nyström approximation of \(\varvec{K}\) based on \(\varvec{S}\) is \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\).
Lemma 1
For any \(\delta \in (0,1/32)\), with probability \((1-3\delta )\), Algorithm 1 returns \(\varvec{S}\) with s columns that satisfies:
We remark that Lemma 1 is a direct corollary of Lemma 6 and matrix Bernstein inequality.
Lemma 2
For any \(\delta \in (0,1/32)\), let \(\varvec{S}\in \mathbb {R}^{n\times s}\) be returned by Algorithm 1 with \(s\le 384\cdot \mu \log (\mu /\delta )\), where \(\mu ={{\,\mathrm{tr}\,}}(\varvec{K}(\varvec{K}+\lambda \mathbf {I})^{-1})\) is the effective dimension of \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\) with parameter \(\lambda \). Denote Nyström approximation of \(\varvec{K}\) by \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\). With probability \(1-3\delta \), the following holds:
Proof
By Lemma 1, we get
for a weighted sampling matrix \(\varvec{S}\). If we remove the weight from \(\varvec{S}\) so that it has all unit entries, by Lemma 5 and Nyström approximation, \(\tilde{\varvec{K}}\) satisfies:
as claimed. \(\square \)
Now, we are ready to use Lemma 1 and Lemma 2 to give an efficient method to approximate the principal components of the data matrix \(\varvec{P}\).
Theorem 1
Let \(\varvec{S}\in \mathbb {R}^{n\times s}\) returned by Algorithm 1 with \(\lambda =\frac{\varepsilon }{k}\sum _{i=k+1}^{n}\sigma _i(\varvec{P}\varvec{P}^{\top })\) and \(\delta \in (0,1/8)\). \(\varvec{V}\in \mathbb {R}^{d\times k}\) contains optimal top k row principal components of data matrix \(\varvec{P}\). From \(\varvec{P}^{\top }\varvec{S}\), we can compute a matrix \(\varvec{X}\in \mathbb {R}^{n\times s}\) such that if we set \(\varvec{Z}= \varvec{P}^{\top }\varvec{S}\varvec{X}\), with probability \(1-\delta \):
with \(s=O(\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon })\).
The proof is presented in “Appendix B”. In the following, we demonstrate that the nearest neighbor can be retrieved in the learned data structure. To this end, we first show that the nearest neighbor of the query remains consistent even corrupted with noise.
Lemma 3
If the query \(\varvec{y}\) and its nearest neighbor \(\varvec{x}^*\) are corrupted with noise \(\varvec{t}\), that is, \(\varvec{q}=\varvec{y}+\varvec{t}\), \(\varvec{p}^*=\varvec{x}^*+\varvec{t}\), the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\).
Proof
Recalling that the noise is bounded, that is \(\left\Vert \varvec{t} \right\Vert _{2} \le \alpha \). Hence for all \(i\in [0, n]\), we have
By the triangle inequality,
Then, for any other data point in the data set \(\varvec{P}\), that is \(\varvec{p}\in \varvec{P}\) and \(\varvec{p}\ne \varvec{p}^*\), we get
Then we get the guarantee that distances between data and low-dimensional subspace are bounded. \(\square \)
Theorem 2
Let \(\varvec{Z}\in \mathbb {R}^{d\times k}\) be the projection matrix learned by Algorithm 1, \(\tilde{\varvec{U}}\) be the corresponding subspace, then we have:
where \(\sigma _i\) is the i-th singular value of \(\varvec{P}\).
Proof
Let \(\varvec{V}\in \mathbb {R}^{d\times k}\) contain the projection matrix obtained by singular value decomposition of \(\varvec{P}\) and \(\varvec{U}\) be corresponding k-dimensional subspace. The distance between a data point and subspace can be computed as:
Combining with Theorem 1, we show that
where \(\sigma _i\) is the i-th singular value of \(\varvec{P}\). For the case that k is close to the rank of \(\varvec{P}\), \(\sum _{i=k+1}^{n}\sigma _i\) can be very small.
With Theorem 2, we can easily prove that the similarity among data points is preserved in the projected low-dimensional space as Lemma 4, which we defer to “Appendix C”. Then we get our main result Theorem 3, that the nearest neighbor will be returned in the low-dimensional space.
Lemma 4
Suppose the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\) in d-dimensional feature space. In the k-dimensional subspace projected by \(\varvec{Z}\in \mathbb {R}^{d\times k}\) which is learned by Algorithm 1, the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\).
Theorem 3
Algorithm 2 returns data point \(\varvec{p}^*\) from database \(\varvec{P}\) as a \((1+\epsilon /4)\)-approximate nearest neighbor of query point \(\varvec{q}\).
Proof
Recall that noisy data \(\varvec{p}\in \varvec{P}\) is permuted from clean data \(\varvec{x}\in \varvec{X}\) with noise \(\varvec{t}\) (\(\varvec{p}=\varvec{x}+\varvec{t}\)), and so is the query data \(\varvec{q}\) (\(\varvec{q}=\varvec{y}+\varvec{t}\) with \(\varvec{y}\) as clean data). Let the nearest neighbor of \(\varvec{y}\) be \(\varvec{x}^*\in \varvec{X}\) which corresponds to \(\varvec{p}^*\) in \(\varvec{P}\). Assume that \(\varvec{p}^*\) is the returned nearest neighbor of \(\varvec{q}\). Fix \(\varvec{x}\ne \varvec{x}^*\), using the triangle inequality to write
The third inequality is derived from Theorem 2. Following the proof of Theorem 2, \(\sum _{i=k+1}^{n}\sigma _i\) can be as small as possible and \(\epsilon \in (0,1)\). Here we let \((1+2\epsilon ) \sum _{i=k+1}^{n}\sigma _i \le 2\alpha \) to get the last inequality. Similarly for \(\varvec{x}^*\), we have
Using the triangle inequality, we get
and
With \(\alpha \) set as \(16/\epsilon \), we can bound \(\left\Vert \varvec{q}- \varvec{p}^*_{\tilde{\varvec{U}}} \right\Vert _{2}\), which implies
By using Pythagoras’ Theorem (recall both \(\varvec{p}_{\tilde{\varvec{U}}},\varvec{p}^*_{\tilde{\varvec{U}}}\in \tilde{\varvec{U}}\)),
Hence, \(\varvec{p}^*\) is reported as the nearest neighbor of \(\varvec{q}\) in the low-dimensional subspace. \(\square \)
5 Experiments
In this section, we perform experiments on benchmark data sets to demonstrate the effectiveness of our algorithm. First, we describe our experimental settings.
5.1 Experimental setting
5.1.1 Baseline algorithms
We illustrate the effectiveness of our algorithm by comparing it with the celebrated data-independent hashing algorithm of locality-sensitive hashing (LSH) (Andoni and Indyk, 2008), and state-of-the-art data-dependent algorithms, including anchor graph hashing (AGH) (Liu et al., 2011), circulant binary embedding (CBE) (Yu et al., 2014), iterative quantization (ITQ) (Gong and Lazebnik, 2011), Isotropic hashing (ISO) (Kong and Li, 2012), multidimensional spectral hashing (MDSH) (Weiss et al., 2012), supervised discrete hashing (SDH) (Shen et al., 2015), scalable graph hashing (SGH) (Jiang and Li, 2015), spectral hashing (Weiss et al., 2009), sparse projection (SP) (Xia et al., 2015) and bilinear projection based binary codes (BPBC) (Gong et al., 2013). The parameters are set as suggested in the original works. We refer to our algorithm as Inexact Subspace Analysis for approximate Nearest Neighbor Search (ISANNS).
5.1.2 Data sets
We consider data sets from both computer vision and and natural language processing. In particular, for the computer vision application, we apply all the compared algorithms to the handwritten digit recognition data set MNISTFootnote 1, which consists of 70,000 digit images. We randomly sample 69,000 images for training and the left 1000 images for test where each image is represented as a 784-dimensional vector (i.e. the raw pixels).
For the natural language processing task, we use four data sets from the GLUE (General Language Understanding Evaluation) benchmark (Wang et al., 2019) and Glove (Pennington et al., 2014), a word representation data set for Wikipedia’s entries. The data sets of GLUE benchmark include Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) (SST-2), Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), Microsoft Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005) and Stanford Question Answering Natural Language Inference Corpus (QNLI) (Rajpurkar et al., 2016). More specifically, SST-2 consists of movie reviews, the sentiment of which is either positive or negative. CoLA consists of English sentences from books and journal articles. The sentences are grammatically acceptable or not. MRPC is formed by sentence pairs from online news sources. The label of the sentence pair represents whether the two sentences is semantically equivalent or not. QNLI is the data set with pairs of a question and the context sentence, the label of which represents whether the context sentence contains the answer to the question. We compute the representations for sentences and paragraphs with sentence transformers (Reimers and Gurevych, 2019) based on pretrained STS (Semantic Textual Similarity) model “stsb-roberta-base”Footnote 2. Each example is represented with a 768-dimensional dense vector. The statistics of data sets is shown in Table 2.
5.1.3 Evaluation metrics
The performance of the methods are evaluated by two common metrics: Hamming distance ranking and hash table lookup. We retrieve the items within Hamming distance 2 and report related precision, recall and mean average precision (MAP). We also return the top 500 retrieved items and report and mean average precision as well as time complexity.
To compute the precision and recall, let k denote the elements within the Hamming radius 2, and n denote the total number of relevant items in the database, then
We show the performance with various lengths of hash code.
5.2 Empirical results
Figure 1 shows the precision and recall on the MNIST data set when we tune the hash code length. In terms of precision (left panel), our algorithm always outperforms the baselines, especially when the data are encoded with more bits. Perhaps more surprising, the increase of code length degrades the performance of baseline algorithms, while improves our algorithm. It demonstrates the effective of our algorithm in low-dimensional subspace.
The superiority of our algorithm is outstanding compared with baseline algorithms in terms of both precision and recall in almost all cases. Table 3 lists the hash table lookup results for 10-bit, 16-bit and 20-bit hash codes on MNIST data set. We observe that our algorithm dramatically outperforms the compared algorithms. Specifically, in terms of 16-bit hash codes on MNIST data set, the MAP of our algorithm is up to 0.8843 while the others are below 0.5 with Hamming radius 2. In terms of MAP with top 500 retrieved data points, our algorithm achieves significant superiority compared with baseline approaches. Our algorithm also enjoys the best time efficiency. With the increase of hash code length, it requires more time to learn the hash codes. With more information, the performance of models is also improved. The experimental results in Figure 1 and Table 3 show the advantage of our algorithm in all cases.
Figure 2 shows the precision and recall of compared algorithms on GLUE benchmark with the increase of hash code length. Table 4 and Table 5 show MAP within Hamming radius 2 and MAP of top 500 retrieved samples for 10-bit and 16-bit hash code. It is shown that our algorithm achieves the best performance in all listed GLUE benchmark in almost all the cases.
Table 3 also lists the time cost of learning the hash projection matrix for different methods on MNIST data set referred as “training time”. We report the training time cost on GLUE benchmark in Table 6 with hash code length 10-bit and 16-bit. We observe that our algorithm is efficient as the low-rank projection matrix is performed on the sampled matrix instead of the global data matrix. In terms of query time cost, the nearest neighbors in the experiments are computed based on the Hamming distance with radius 2. The dominant query time cost is the computation of the Hamming distance matrix between training data points and the testing data points. Hence, the query time complexity of various methods is same for certain data sets, such as 0.71 s for MNIST, 0.73 s for SST-2, 0.11 s for CoLA, 0.13 s for MRPC, and 10 s for QNLI.
Table 7 presents recall and training time cost of compared algorithms on Glove data set. There are memory issues while implementing AGH, SDH and SH on Glove data set, hence the results of these methods are not included. The experimental results show the advantage of our algorithm in terms of both Recall and training cost. Though SP achieves comparable performance in terms of Recall, our algorithm enjoys higher training efficiency.
In a nutshell, the experimental results on the computer vision and natural language understanding tasks show the practical values of our algorithm.
6 Conclusion
For the approximate nearest neighbor search problem, the high-dimensional and large-scale data raises various challenges. In this paper, we have proposed a spectral analysis for nearest neighbor search method that is based on inexact subspace estimation. Given the data set \(\varvec{P}\in \mathbb {R}^{n\times d}\) and the query \(\varvec{q}\), we have reduced the feature space of the data from d to k with \(k< \log n\). By comparing the time complexity of our method and the spectral analysis based on principal component analysis, it is not hard to see that the computational cost of ours is proportional to \(ns^2\) while that of PCA scales with \(nd^2\). We have further provided the theoretical analysis that the \((1+\epsilon /4)\)-approximate neighbors retrieved in the low-dimensional space are the data points close to the query in the original space. The experimental results have shown the significant improvement of our algorithm over state-of-the-art approaches.
References
Abdullah, A., Andoni, A., Kannan, R. & Krauthgamer, R. (2014). Spectral approaches to nearest neighbor search. In Annual symposium on foundations of computer science (pp. 581–590).
Alaoui, A., & Mahoney, M. W. (2015). Fast randomized kernel ridge regression with statistical guarantees. In Advances in neural information processing systems (pp. 775–783).
Andoni, A., & Indyk, P. (2008). Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1), 117–122.
Andoni, A., Indyk, P., Nguyen, H.L. & Razenshteyn, I. (2014). Beyond locality-sensitive hashing. In Proceedings of the 2014 ACM-SIAM symposium on discrete algorithms, SIAM (pp. 1018–1028).
Andoni, A., Naor, A., Nikolov, A., Razenshteyn, I., & Waingarten, E. (2018). Data-dependent hashing via nonlinear spectral gaps. In Annual ACM SIGACT symposium on theory of computing. ACM (pp. 787–800).
Andoni, A,. Nikolov, A., Razenshteyn, I., & Waingarten, E. (2021). Approximate nearest neighbors beyond space partitions. In Proceedings of the 2021 ACM-SIAM symposium on discrete algorithms (pp. 1171–1190) SIAM.
Arya, S., & Mount, D.M. (1993). Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the fourth annual ACM/SIGACT-SIAM symposium on discrete algorithms (pp. 271–280).
Arya, S., Mount, D. M., & Narayan, O. (1995). Accounting for boundary effects in nearest neighbor searching. In Proceedings of the eleventh annual symposium on computational geometry, Vancouver (pp. 336–344).
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517.
Çakir, F., He, K., Sclaroff, S. (2018). Hashing with binary matrix pursuit. In European conference on computer vision (pp. 344–361).
Cohen, M.B., Musco, C., & Pachocki, J.W. (2016). Online row sampling. In Approximation, randomization, and combinatorial optimization. algorithms and techniques, APPROX/RANDOM (pp 7:1–7:18).
Dobkin, D. P., & Lipton, R. J. (1976). Multidimensional searching problems. SIAM Journal on Computing 5(2), 181–186.
Dolan, W.B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the third international workshop on paraphrasing.
Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J. (2015). Deep hashing for compact binary codes learning. In IEEE conference on computer vision and pattern recognition (pp. 2475–2483).
Gersho, A., & Gray, R. M. (2012). Vector quantization and signal compression, vol 159. Springer.
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. International Conference on Very Large Data Bases, 99, 518–529.
Gittens, A. (2011). The spectral norm error of the naive nystrom extension. arXiv preprint arXiv:11105305
Gittens, A., & Mahoney, M. W. (2016). Revisiting the nyström method for improved large-scale machine learning. The Journal of Machine Learning Research, 17(1), 3977–4041.
Gong, Y., & Lazebnik, S. (2011). Iterative quantization: A procrustean approach to learning binary codes. In IEEE conference on computer vision and pattern recognition (pp. 817–824).
Gong, Y., Kumar, S., Rowley, H. A., & Lazebnik, S. (2013). Learning binary codes for high-dimensional data using bilinear projections. In IEEE conference on computer vision and pattern recognition (pp. 484–491).
Han, X., Leung, T., Jia, Y., Sukthankar, R., & Berg, A. C. (2015). Matchnet: Unifying feature and metric learning for patch-based matching. In The 28th IEEE conference on computer vision and pattern recognition (pp. 3279–3286).
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: Towards removing the curse of dimensionality. In Annual ACM symposium on theory of computing, ACM (pp. 604–613).
Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117–128.
Jiang, Q., &Li, W. (2015). Scalable graph hashing with feature transformation. In International joint conference on artificial intelligence (pp. 2248–2254).
Knuth, D. E. (1973). Sorting and searching. The art of computer programming 3:Ch–6.
Kong, W., & Li, W. J. (2012). Isotropic hashing. In Annual conference on neural information processing systems (pp. 1646–1654).
Kulis, B., & Darrell, T. (2009). Learning to hash with binary reconstructive embeddings. In Annual conference on neural information processing systems (pp. 1042–1050).
Liu, H., Wang, R., Shan, S., & Chen, X. (2016). Deep supervised hashing for fast image retrieval. In IEEE conference on computer vision and pattern recognition (pp. 2064–2072).
Liu, W., Wang, J., Kumar, S., & Chang, S. F. (2011). Hashing with graphs. In International conference on machine learning (pp. 1–8).
Liu, W., Mu, C., Kumar, S., & Chang, S. (2014). Discrete graph hashing. In Annual conference on neural information processing systems (pp. 3419–3427).
Makhoul, J., Roucos, S., & Gish, H. (1985). Vector quantization in speech coding. Proceedings of the IEEE, 73(11), 1551–1588.
Mitzenmacher, M., & Upfal, E. (2017). Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge University Press
Musco, C., & Musco, C. (2015). Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Annual conference on neural information processing systems (pp. 1396–1404).
Musco, C., & Musco, C. (2017). Recursive sampling for the nystrom method. In Advances in neural information processing systems (pp. 3836–3848).
Pennington J., Socher R., & Manning C. D. (2014). GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532–1543).
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. In J. Su, X. Carreras, K. Duh (Eds.) Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2383–2392).
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, (pp. 3980–3990).
Rudi, A., Camoriano, R., & Rosasco, L. (2015). Less is more: Nyström computational regularization. In Annual conference on neural information processing systems (pp. 1648–1656).
Samet, H. (1990). The design and analysis of spatial data structures (Vol. 85). Addison-Wesley.
Samet, H. (2006). Foundations of multidimensional and metric data structures
Sellis, T. K., Roussopoulos, N., & Faloutsos, C. (1997). Multidimensional access methods: Trees have grown everywhere. In Proceedings of 23rd international conference on very large data bases (pp. 13–14).
Shen, F., Shen, C., Liu, W., & Shen, H. T. (2015). Supervised discrete hashing. In IEEE conference on computer vision and pattern recognition (pp. 37–45).
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642).
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 7th international conference on learning representations.
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Trans Assoc Comput Linguistics, 7, 625–641.
Weiss, Y., Torralba, A., & Fergus, R. (2009). Spectral hashing. In Annual conference on neural information processing systems (pp. 1753–1760).
Weiss, Y., Fergus, R., & Torralba, A. (2012). Multidimensional spectral hashing. In European conference on computer vision (pp. 340–353).
Woodruff, D. P., et al. (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1–2), 1–157.
Xia, Y., He, K., Kohli, P., & Sun, J. (2015). Sparse projections for high-dimensional binary codes. In IEEE conference on computer vision and pattern recognition (pp. 3332–3339).
Yu, F. X., Kumar, S., Gong, Y., & Chang, S. (2014). Circulant binary embedding. In Proceedings of the 31th international conference on machine learning (pp. 946–954).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding
Jie Shen is supported by NSF-IIS-1948133 and the startup funding of Stevens Institute of Technology.
Conflicts of interest/competing interests
The authors have no relevant financial or non-financial interests to disclose.
Availability of data and material
All the data sets used in the paper are publicly available, and have been cited in the main text.
Code availability
All the codes will be release on the authors’ personal websites after acceptance.
Authors’ contributions
All authors contributed to the work. The paper was written together by Jie Shen and Jing Wang. Jing Wang designed and ran the experiments and Jie Shen commented on ways for improvements.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication.
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Gustavo Batista.
Appendices
Appendix A: Useful lemmas
Note that by Definition 1, the ridge leverage score equals the diagonal entry of \(\varvec{P}(\varvec{P}^{\top }\varvec{P}+\lambda \mathbf {I})^{-1}\varvec{P}^{\top }\), that is:
The computation is expensive when \(\varvec{P}\) is a large data matrix. We thus define a sampling matrix \(\varvec{S}\) to select some rows from \(\varvec{P}\) to approximate \(\varvec{P}^{\top }\varvec{P}\). The approximated ridge leverage score is computed as:
Leverage score based sampling approaches have long been known to give strong theoretical guarantees for Nyström approximation, and here is the well studied spectral norm guarantee: Lemma 5 is from Gittens and Mahoney (2016) and Lemma 6 is from Musco and Musco (2017).
Lemma 5
Suppose \(\lambda >0\), \(\delta \in (0,1/8)\), the sampling matrix \(\tilde{\varvec{S}}\in \mathbb {R}^{n\times s}\) is obtained by Algorithm 1 with ridge leverage score approximations \(\tilde{l}\) and data sampling probability \(\mu \), then with probability \((1-\delta )\), the kernel \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\) and approximation \(\tilde{\varvec{K}}= \varvec{K}\tilde{\varvec{S}}(\tilde{\varvec{S}}^{\top }\varvec{K}\tilde{\varvec{S}})^+\tilde{\varvec{S}}^{\top }\varvec{K}\) satisfy:
with \(\sum _i \eta _i=O(\sum _i\tilde{l}_i \log (\sum _i\tilde{l}_i/\delta ))\) and the number of sampled data points \(s\le 2\sum _i \eta _i\).
Lemma 6
Suppose \(\lambda >0\), \(\delta \in (0,1/8)\), the sampling matrix \(\varvec{S}\in \mathbb {R}^{n\times s}\) is obtained by sampling the standard basis vectors independently with probability \(\eta _i\) and rescaled with \(1/\sqrt{\eta _i}\), then with probability \((1-\delta )\), \(1/2\cdot \sum _i \eta _i\le s\le 2\sum _i \eta _i\) and:
Corollary 1
With probability \(1-\delta \), Algorithm 1 run with ridge parameter \(\epsilon \lambda \) returns \(\varvec{S}\in \mathbb {R}^{n\times s}\) such that \(s=O(\frac{\mu }{\epsilon }\log \frac{\mu }{\delta \epsilon })\) with \(\mu = {{\,\mathrm{tr}\,}}(\varvec{K}(\varvec{K}+\epsilon \lambda \mathbf {I})^{-1})\), \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\) and \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\) satisfy \(\tilde{\varvec{K}}\preceq \varvec{K}\preceq \tilde{\varvec{K}}+\epsilon \lambda \mathbf {I}\).
Proof
This follows from Lemma 5 by noting \({{\,\mathrm{tr}\,}}(\varvec{K}(\varvec{K}+\epsilon \lambda \mathbf {I})^{-1}) \le \frac{1}{\epsilon }{{\,\mathrm{tr}\,}}(\varvec{K}(\varvec{K}+\lambda \mathbf {I})^{-1})\) since \((\varvec{K}+ \epsilon \lambda \mathbf {I})^{-1} \preceq \frac{1}{\epsilon } (\varvec{K}+\lambda \mathbf {I})^{-1}\). \(\square \)
Lemma 7
For any \(\epsilon \in (0,1)\), \(\delta \in (0, 1/8)\), Algorithm 1 runs with ridge parameter \(\lambda = \frac{\epsilon }{k}\sum _{i=k+1}^{n}\sigma _i(\varvec{K})\) returns matrix \(\varvec{S}\in \mathbb {R}^{n\times s}\), where \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\), \(\sigma (\varvec{K})\) is the singular value of \(\varvec{K}\). Then with probability \(1-\delta \), \(1/2\sum _i\mu _i \le s \le 2\sum _i \mu _i\), \(\tilde{\varvec{K}}= \varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\) satisfies, for any rank k orthogonal projection \(\varvec{W}\) and a positive constant c independent of \(\varvec{W}\):
When ridge leverage scores are computed exactly, \(\sum _i \mu _i =O(\frac{k}{\epsilon }\log \frac{k}{\delta \epsilon })\).
Proof
Set \(c={{\,\mathrm{tr}\,}}(\varvec{K})-{{\,\mathrm{tr}\,}}(\tilde{\varvec{K}})\), which is \(\ge 0\) since \(\tilde{\varvec{K}}\preceq \varvec{K}\) by Lemma 5. By linearity of trace:
So to obtain (14) it suffices to show:
\(\varvec{W}\) is a rank k orthogonal projection, we can write \(\varvec{W}=\varvec{V}\varvec{V}^{\top }\) where \(\varvec{V}\in \mathbb {R}^{n\times k}\) has orthonormal columns. Applying the cyclic property of the trace, and the spectral bound of Lemma 5:
This gives us the upper bound of (16). For the lower bound we apply Corollary 1:
Finally, \({{\,\mathrm{tr}\,}}(\varvec{K}) = \sum _{i=1}^{n}\sigma _i(\varvec{K})\) and \({{\,\mathrm{tr}\,}}(\varvec{W}\varvec{K}\varvec{W})\le \sum _{i=1}^{k}\sigma _i(\varvec{K})\), by the Eckart-Young theorem, we get \(k\epsilon \lambda = \epsilon \sum _{i=k+1}^n \sigma _i(\varvec{K})\le \epsilon {{\,\mathrm{tr}\,}}(\varvec{K}-\varvec{W}\varvec{K}\varvec{W})\). Plugging into (18) gives (16). The proof is complete. \(\square \)
Lemma 8
Algorithm 1 runs with ridge parameter \(\lambda =\frac{\epsilon }{k}\sum _{i=k+1}^{n}\sigma _i(\varvec{K})\) returns \(\tilde{\varvec{S}}\), let \(\varvec{K}= \varvec{P}\varvec{P}^{\top }\) and \(\tilde{\varvec{K}}=\varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^+\varvec{S}^{\top }\varvec{K}\). Suppose \(\varvec{W}_*\) is the optimal solution for \({{\,\mathrm{arg\,min}\,}}{{\,\mathrm{tr}\,}}(\tilde{\varvec{K}}-\varvec{W}\varvec{W}^{\top }\tilde{\varvec{K}}\varvec{W}\varvec{W}^{\top }) \), for \(\tilde{\varvec{K}}\) and let \(\varvec{W}\) be an approximately optimal solution satisfying:
Then, if \(\varvec{W}_*\) is the optimal cluster indicator matrix for \(\varvec{K}\):
Proof
\(\square \)
Lemma 9
If the ridge leverage score in Algorithm 1 is computed exactly, the sum of approximated leverage scores is bounded as \(\sum _i l_i \le \frac{2k}{\epsilon }\).
Proof
\(\square \)
Appendix B: Proof for Theorem 1
Proof
\(\varvec{Z}\) contains orthonormal columns such that the value of the following objective function:
is as small as possible. The above function equals:
Note that \(\varvec{Z}\) lies in the column span of \(\varvec{P}\), we represent it by introducing a matrix \(\varvec{Y}\) such that \(\varvec{Z}= \varvec{P}^{\top }\varvec{Y}\). Let \(\varvec{K}\) denote the linear kernel of data set, that is \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\), then minimizing (24) is equivalent to minimizing
It follows from the fact that \(\varvec{Z}\) is orthonormal as \((\varvec{P}^{\top }\varvec{Y})^{\top }\varvec{P}^{\top }\varvec{Y}=\varvec{Y}^{\top }\varvec{K}\varvec{Y}=\mathbf {I}\). Then we design a matrix \(\varvec{W}\in \mathbb {R}^{n\times k}\) with orthonormal columns, and introduce
From the cyclic and linearity property of the trace, we can rewrite (25) as:
Let \( \tilde{\varvec{W}}={{\,\mathrm{arg\,min}\,}}_{\varvec{W}} {{\,\mathrm{tr}\,}}(\tilde{\varvec{K}})-{{\,\mathrm{tr}\,}}(\varvec{W}\varvec{W}^{\top }\tilde{\varvec{K}}\varvec{W}\varvec{W}^{\top })\), and \(\varvec{W}_*={{\,\mathrm{arg\,min}\,}}_{\varvec{W}} {{\,\mathrm{tr}\,}}(\varvec{K})-{{\,\mathrm{tr}\,}}(\varvec{W}\varvec{W}^{\top }\varvec{K}\varvec{W}\varvec{W}^{\top })\), where \(\tilde{\varvec{K}}=\varvec{K}\varvec{S}(\varvec{S}^{\top }\varvec{K}\varvec{S})^\dagger \varvec{S}^{\top }\varvec{K}\). Following argument in Lemma 8, we have
as \(\varvec{W}\) is an \(n\times k\) matrix with orthonormal columns. Our problem is reduced to a low-rank approximation problem that looks like the k-means problem and \(\varvec{W}\) is the cluster indicator matrix. Hence \(\tilde{\varvec{W}}\) can be taken to equal the top k eigenvectors of \(\tilde{\varvec{K}}\) which can be solved by performing singular value decomposition of \(\tilde{\varvec{K}}\) with time cost \(O(n\cdot s^2)\).
According to (26), we can get \(\varvec{Z}=\varvec{P}^{\top }\varvec{K}^{-1/2}\tilde{\varvec{W}}\). However, \(\varvec{Z}\) cannot be represented efficiently and projecting new vectors to \(\varvec{Z}\) requires n kernel evaluations to multiply by \(\varvec{P}^{\top }\). Recalling that \(\varvec{K}=\varvec{P}\varvec{P}^{\top }\), \(\varvec{S}\) selects s data points \(\varvec{S}^{\top }\varvec{P}\) and we approximate \(\varvec{P}\) using its projection onto these points. Informally, let \(\Phi \in \mathbb {R}^{d\times d}\) be the orthogonl projection onto the row span of \(\varvec{S}^{\top }\varvec{P}\). We approximate \(\varvec{P}\) by \(\tilde{\varvec{P}}{\mathop {=}\limits ^{\text {def}}}\varvec{P}\Phi \). We can write \(\Phi =\varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{P}\varvec{P}^{\top }\varvec{S})^{+}\varvec{S}^{\top }\varvec{P}\). Since it is an orthogonal projection, \(\Phi \Phi ^{\top }=\Phi ^2=\Phi \), and we can write:
\(\tilde{\varvec{Z}}\) is orthonormal as \(\tilde{\varvec{Z}}^{\top }\tilde{\varvec{Z}}=\tilde{\varvec{W}}^{\top }\tilde{\varvec{K}}^{-1/2}\varvec{P}^{\top }\Phi \varvec{P}\tilde{\varvec{K}}^{-1/2}\tilde{\varvec{W}}=\mathbf {I}\). We argue that the approximated principal components offers a good solution to (24) as \(\varvec{Z}=\varvec{P}^{\top }\varvec{K}^{-1/2}\tilde{\varvec{W}}\). To this end, we substitute \(\varvec{Z}\) with \(\tilde{\varvec{Z}}\) in (24) and get:
Compare the objective function values of (23) obtained from \(\tilde{\varvec{Z}}=\Phi \varvec{P}^{\top }\tilde{\varvec{K}}^{-1/2}\tilde{\varvec{W}}\) and \(\varvec{Z}=\varvec{P}^{\top }\varvec{K}^{-1/2}\tilde{\varvec{W}}\):
The last step follows from Lemma 2 that \(\varvec{K}-\tilde{\varvec{K}}\preceq \epsilon \lambda \mathbf {I}\). We rewrite (30) as:
This gives the result. Note that \(\Phi \varvec{P}^{\top }\tilde{\varvec{K}}^{-1/2}\tilde{\varvec{W}}= \varvec{P}^{\top }\varvec{S}(\varvec{S}^{\top }\varvec{K}^{\top }\varvec{S})^+ \varvec{S}^{\top }\tilde{\varvec{K}}^{1/2}\tilde{\varvec{W}}\), we set:
The solution of (24) can be represented as \(\tilde{\varvec{Z}}= \varvec{P}^{\top }\varvec{S}\varvec{X}\). \(\square \)
Appendix C: Proof for Lemma 4
Proof
Let \(\varvec{p}_{\tilde{\varvec{U}}}\), \(\varvec{p}^*_{\tilde{\varvec{U}}}\) and \(\varvec{q}_{\tilde{\varvec{U}}}\) denote the data points \(\varvec{p}\), \(\varvec{p}^*\) and \(\varvec{q}\) in \(\varvec{Z}\) projected k-dimensional subspace. For the nearest neighbor of the query, we have the projected distance bounded as
Then, for any other data point in the data set \(\varvec{P}\), we get
According to Theorem 2 that the distance between data and the projected space is bounded, we get
For the case that k is close to the rank of \(\varvec{P}\), \(\sum _{k+1}^{n}\sigma _i\) can be as small as possible. Hence, in the low-dimensional subspace, the nearest neighbor of \(\varvec{q}\) is \(\varvec{p}^*\). \(\square \)
Rights and permissions
About this article
Cite this article
Wang, J., Shen, J. Fast spectral analysis for approximate nearest neighbor search. Mach Learn 111, 2297–2322 (2022). https://doi.org/10.1007/s10994-021-06124-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-06124-1