1 Introduction

With the explosive growth of data such as images, documents and videos, nearest neighbor (NN) search (a.k.a similarity search) has attracted extensive attention in various domains including machine learning, information retrieval, natural language processing and outlier detection [1,2,3]. Given a query, the objective of NN search is to identify its nearest neighbors from a data collection. A variety of NN algorithms have been developed so far. However, finding exact nearest neighbors from large-scale data is prohibitive and infeasible from the perspective of efficiency [4].

In recent years, approximate NN (ANN) is becoming popular, for it finds approximate nearest neighbors within sub-linear and even constant query time [5]. Generally, the existing ANN search algorithms can be roughly divided into two categories, i.e, tree-based and hashing-based search algorithms. The tree-based search algorithms, such as KD-trees [6], M-trees [7], and ball-trees [8], exploit tree structures to store data and then find approximate nearest neighbors in sub-linear time [9]. However, the performance of these algorithms decreases dramatically when the dimension of data becomes high. Besides, they usually require large memory to store the tree structures.

The hashing-based search methods become flourishing recently because of their high query efficiency and low storage space [10]. They encode the high-dimensional data as binary-code representations by a series of hash functions. With the binary-code representations, retrieving nearest neighbors for a given query turns into matching the corresponding binary codes by bit operations. Benefiting from the bit operations, the task of ANN retrieval can be handled in main memory and extremely fast, usually within O(1) time [11]. This enables the hashing-based search algorithms to be quite popular to large-scale scenarios.

Locality sensitive hashing (LSH), one of the most famous hashing algorithms, maps similar or proximate data samples into the same “bucket” with high probability [11]. Specifically, a series of hash functions are generated randomly to represent the similar data samples as binary codes with small Hamming distances, so that the neighborhood relations between the original data samples can be preserved. Since the hash functions within LSH are generated randomly, LSH has high efficiency. Owing to this, various extensions of LSH, including kernelized LSH (KLSH) [12], shift-invariant KLSH (SKLSH) [13]) and I-LSH [14], have been witnessed. It is noticeable that LSH can achieve desirable performance if the number of hash functions is large enough. This, however, leads to a high computation cost and storage overhead, limiting its applications to large-scale scenarios. For instance, the binary codes of one billion data samples with 1024 hash bits need around 1TB storage space, whereas the codes with 64 hash bits only need 64GB space.

On the other hand, data-dependent hashing techniques exploit inherent properties of data to learn elaborately hash functions, so that the generated binary codes have promising performance. Typical examples of such kind include principal components analysis hashing (PCAH) [15], iterative quantization (ITQ) [16], spectral hashing (SH) [17]. Anchor graph hashing (AGH) [18], sparse embedding and least variance encoding (SELVE) [19] and locally linear hashing (LLH) [20] utilize manifold structural information of data to derive the binary codes. Though the data-dependent hashing algorithms can generate powerful hash functions, their time complexity is relatively high and the scalability is not as good as one expects.

In this paper, we propose a novel bit selection framework to pick important bits out of the hash bits generated by hashing techniques. Though the hash functions are generated randomly, a large amount of redundant information exists within the binary codes assembled from the generated hash bits. It is necessary to remove those redundant information from the binary codes, making the assembled binary codes compact. Taking LSH as an example, we exploit eleven evaluation criteria to measure the importance and similarity of each hash bit generated by LSH, so that the bits with high importance and less similarity are selected to assemble new binary codes. The evaluation metrics not only consider individual bit’s importance, e.g., Laplacian score (LS) [21] and information entropy (Ent) [22], but also involve correlations between a pairwise of hash bits, e.g., Pearson correlation coefficient (PCC) [23] and mutual information (MI) [24]. The experimental results conducted on public benchmark data sets show the proposed bit selection framework works effectively, and can achieve reduced effects of hash bits, without degrading the performance of LSH significantly.

The main contributions of this article are briefly highlighted as follows.

  • A novel hash bit selection framework for hashing techniques is proposed. It aims to derive compact binary codes consisting of less hash bits, which embrace information as faithful as possible.

  • Under the bit selection framework, we renders eleven bit selection methods by using different evaluation criteria to pick important and representative hash bits from the candidate hash bits generated by LSH, without degrading the performance significantly.

  • Extensive experiments were carried out on public benchmark data sets. The experimental results show that the framework works effectively. The performance of LSH was not degraded significantly, after 20% - 60% redundant hash bits were removed.

The rest of this paper is organized as follows. Section 2 briefly reviews the state-of-the-art of hash learning methods. Section 3 introduces the bit selection framework for hashing techniques to derive compact binary codes. In Section 4, we propose eleven hash bit selection methods for LSH to show how the framework works. The Experimental results and discussions on public benchmark data sets are reported in Section 5, and the conclusions are presented in Section 6.

2 Related work

Due to their low storage cost and high computational efficiency, hashing techniques have been widely applied in a large variety of scenarios. As mentioned above, the hash learning algorithms can be roughly divided into two main groups: data-independent and data-dependent hashing algorithms [14]. This section presents a brief overview of the data-independent hashing algorithms. More details about the hashing techniques can be found in good surveys (e.g., [14]).

The data-independent hashing techniques have been extensively investigated to handle the problems of nearest neighbor search for large-scale data. Representative example of this kind is LSH, which maps similar data samples to proximate binary codes with high probability. Theoretically, the probability that two data samples have the same binary code is proportional to their similarity measured by Euclidean distance or semantic consistency [10, 11]. Since the hash functions of LSH are generated randomly, LSH have extremely high efficiency. With such favorable properties, a variety of LSH-like methods have been developed to learn informative and discriminative binary code. For instance, kernelized LSH (KLSH) [13] adopts arbitrary kernel functions to capture the intrinsic relationships among data samples and can be employed in many existing image search measures. Shift-invariant kernelized LSH (SKLSH) [13] maps the random projections to shift-invariant kernel to maintain the similarity structures of the original data samples. Unfortunately, most of the LSH-like methods demand for a considerable length of the binary codes to ensure a competitive performance, leading to long computational time and high storage cost.

In contrast to the data-independent methods, the data-dependent hashing methods make full use of potential properties of data to construct more effective hash functions. Generally speaking, the data-dependent hashing methods can be divided into two classes, namely, principle component analysis (PCA) based hashing and manifold based hashing. The former attempts to preserve the maximal variance of hash bits in dimensionality reduction, during the generation process of hash bits. Unfortunately, maximizing the variance of hash bits is intractable because of the discrete constrain in learning binary codes. To deal with this issue, PCA hashing (PCAH) [15] switches to construct an set of orthogonal hash functions which are uncorrelated to each other, while iterative quantization (ITQ) [16], the most representative PCA-like hashing method, searches an orthogonal rotation matrix to maximizing the variance and minimizing the quantization error simultaneously.

Accordingly, the manifold-based hashing methods manage to make the generated binary codes preserve local similarity structures, a kind of nearest neighborhood relationships, among data pairs, wherein those data samples with similar properties possess the same binary values [19, 20]. Many of the manifold methods exploit various optimization structures to learn informative and important binary codes. For example, self-taught hashing (STH) [25] minimizes the weighted average Hamming distance formulation to meet the criterion of similarity preserving. Anchor graph hashing (AGH) [18] maintains the approximate neighborhood structures of the original data by introducing an anchor graph. Sparse hashing (SH) [26] converts the high-dimensional data samples to low-dimensional binary codes with a sparse coding technique, wherein the similarity structures are preserved with less storage.

Since the data-dependent hashing methods fully take the inherent information of data into consideration, they can learn more compact and have better performance in comparison to the LSH-like hashing methods. Nonetheless, their search efficiency is relatively low, where they need at least quadratic time during the training stage. Moreover, all of the data-dependent hashing methods focus on designing a set of effective hash functions, either maximizing the correlations or preserving the similarity structures, without taking into account the redundancy embraced within the generated binary codes.

Cai [11] argued that LSH is still competitive when the number of generated hash bits is large enough. However, the more the hash bits, the higher correlated between them. It is natural to reduce the number of generated hash bits. Liu et al. [27] considered the problem of bit selection by virtue of weighted graph and dynamic programming techniques. Specifically, for each hash bit, its quality and independence to others are represented as weighted vertex and weighted edges of a graph, respectively. Under this context, the problem of bit selection is formulated as quadratic programming on the graph, which can be solved by replicator dynamics. However, it is computationally expensive and less robust. Motivated by these observations, here we adopt eleven evaluation criteria to measure the importance and the similarities of the generated hash bits. Then the important and less similarity hash bits are picked out, and taken as representative ones to assemble new binary codes. With this strategy, the new binary codes can conserve the properties of the original data as faithful as possible.

3 Bit selection framework

Assume that XRn×d is a data set consisting of n data sample with d dimensions. For each row vector xi of X, it is the i-th d-dimensional data sample. Hash learning aims to implicitly or explicitly design a set of hash functions H = [h1, h2, ..., hm] ∈ Rd×m to encode the data samples X into binary representations B = [b1, b2, ..., bm] ∈{− 1,1}n×m, where m is the number of hash bits and m < d. b\(_{i}\in \{-1, 1\}^{n}\) is the i-th hash bit, whose values are derived from the i-th hash function hi on all data samples in X.

Bit selection is the process of extracting l principal and representative hash bits from the m candidate hash bits generated by hashing techniques to take the place of the original ones when assembling binary codes. Formally, let B = [b1, b2, ..., bm] ∈{− 1,1}n×m be the binary codes assembled from the m hash bits. Since bit selection picks l hash bits out from the m bits without rectifying hash values, the new binary codes B\(^{\prime }\) = [b1, b2, ..., bl] ∈{− 1,1}n×l assembled from the l selected hash bits actually is a subset of B, i.e., B\(^{\prime } \subseteq \)B.

For this purpose, We propose a bit selection framework for hash learning. The selection framework is shown as Fig. 1. It comprises three major stages, namely, bit generation, bit selection and code assembly. Code assembly refers to the process of concatenating the l selected hash bits in a sequential way. Since it is relatively intuitive and simple, we place more focuses on discussing the first two steps.

Fig. 1
figure 1

The framework of hash bit selection for hash learning

The first step of our framework is to generate m candidate hash bits by using the off-the-shelf hashing techniques. For the hashing techniques used within, they can be only one type of hashing algorithm, or multiple hashing algorithms simultaneously. Even more, different types, e.g., the data-independent hashing and the data-dependent hashing, of hashing techniques can also be employed at one time. It should be mentioned that several hashing algorithms, such as PCAH, SH and ITQ, generate less hash bits. In this case, multiple hashing algorithms are often combined to generate a given number of candidate hash bits. For instance, if 600 hash bits are required to be under consideration, we can adopt LSH, PCAH and ITQ to generate 200 hash bits, respectively. Afterwards, these hash bits are concatenated to derive 600 hash bits.

Note that constructing the pool of candidate hash bits by multiple hashing algorithms is inherently consistent with multi-view learning, where data are collected from multiple sources [28]. Inspired by multi-view learning, we can also exploit multi-view hashing algorithms to construct the pool of candidate hash bits [29, 30]. Kong et al. [31] stated that a single hash bit per projection may separate similar samples into different binary values. Hence, multi-bit hashing algorithms encode each projection dimensional with multiple bits of binary values to map those close samples into same binary codes.

The second stage is bit selection, which picks representative and important hash bits out from the candidate bits according to evaluation criteria. The selected hash bits should embrace faithful information to the original one and have strong representation capability, so that new binary codes assembled from the selected bits have good performance.

Within this stage, three major components are involved and tightly associated to each other. They are selection strategies, bit evaluation and bit ranking. Select strategies denote which way of picking bits from the candidate bits is adopted; that is, choosing a bit individually or a subset of bits. If we treat bit individually, only one hash bit is evaluated and chosen at each time, and the most important bits are eventually chosen, according to evaluation criteria. On the contrary, multiple bits are evaluated as a whole and the subset of bits with high evaluation score is chosen to assemble binary codes.

Bit evaluation is a core component of bit selection, for it determines the importance or information amount for each hash bit or bit subset. To evaluate hash bits, evaluation criteria or metrics are required, and different criteria should be exploited for selection strategies. For instance, if a hash bit is under consideration, entropy or Laplacian score can be used to measure the importance of the hash bit; for a subset of hash bits, Pearson correlation coefficient and mutual information are good criteria to measure the correlation or similarity between pairwise bits.

Based on the important scores induced at the step of bit evaluation, bit ranking sorts the candidate hash bits or subsets in a descending order. The hash bits or subset with high important scores are chosen with high priority to assemble new binary codes.

4 Bit selection for LSH

LSH, one of the most popular hashing algorithms, attempts to map similar or proximate data samples to binary codes by using random hash functions. Let h be a randomly generated hash function, and xRd be a data sample, the hash value of x is

$$ b=sign(\textbf{x}^{T}\textbf{h})= \begin{cases} 1,& if \quad \textbf{x}^{T}\textbf{h} \ge 0;\\ -1,& otherwise. \end{cases} $$
(1)

Given m random hash functions H ={h1,h2,...,hm } and data samples XRn×d, the i-th hash bit of X is

$$ \textbf{b}_{i}=sign(\textbf{Xh}_{i}) $$
(2)

With H, the k-th sample xk is represented as the following binary code

$$ \textbf{c}_{k}=sign(\textbf{x}_{k}^{T}\textbf{H})=[\textbf{b}_{k1},\textbf{b}_{k2},..,\textbf{b}_{km}] $$
(3)

Theoretically, there is a high probability that two similar or proximate data samples are represented as the same binary code. Unfortunately, a great number of hash functions are required to retain this feature, that is, the binary codes should be long enough so that LSH is able to deliver a remarkable performance. Long binary code generation demand on high computational cost and large storage space, and the generated hash bits will contain an increasing body of redundancy as its length increases.

A possible solution to the bit redundancy is performing bit selection for LSH. Under the above selection framework of hash bits, here we render eleven bit selection methods to pick important hash bits to assemble new binary codes. To be specific, a pool of candidate hash bits is first constructed from the generated hash bits by virtue of LSH. Then each candidate bit is evaluated in terms of selection criteria. Afterwards, the candidate bits are sorted according to the evaluation scores, and only those candidate bits with high scores are used to assemble new binary codes.

Two types of evaluation criteria are adopted in our selection methods to measure the importance or similarity for each candidate hash bit: one evaluates candidate bit individually, and the other evaluates multiple candidate bits as a whole.

4.1 Bit importance

The bit importance criteria assess an individual bit by calculating several metrics including balance, variance, entrophy, similarity preservation, and Laplacian score.

  1. 1.

    Variance (Var). In probability theory, the variance of a random variable measures the variation degree of its values from the mathematical expectation. A large variance indicates the values differ far from each other. The mathematical definition of variance of bi is shown as follows.

    $$ Var(\textbf{b}_{i}) = \frac{{\sum}_{k=1}^{n}(b_{ik}-\overline{b_{i}})^{2}}{n-1} $$
    (4)

    where bik is the k-th binary value of bi, and \(\overline {b_{i}}\) is the mean value of bi. From the definition, the larger the variable, the more information the hash bit carries.

  2. 2.

    Balance (Bal) [32]. Given a hash bit bi, the balance of bi refers to the equilibrium degree of 1 to -1 within bi. From the perspective of data, the balance implies the representation capability of the corresponding hash function hi, which encodes the data samples to one of binary values, -1 and 1. Formally, the balance of the i-th hash bit bi is \(Bal(\textbf {b}_{i}) = \textbf {b}_{i}^{T}\textbf {1}\). It is remarked that a hash bit with rich information and powerful discernibility can partition the data samples evenly [32], so a smaller Bal(bi) indicates the better balance of bi. For the sake of consistency, here the balance of bi is defined as

    $$ Bal(\textbf{b}_{i}) = 1-\frac{|{\sum}_{k=1}^{n}b_{ik}|}{n} $$
    (5)
  3. 3.

    Entropy (Ent) [22]. Entropy is an effective criterion to measure the uncertainty degree a random variable has in information theory. Given a variable, it contains much more information and more important to prediction if it is highly uncertain. For the hash bit bi, its entropy is represented as

    $$ Ent(\textbf{b}_{i}) = - \sum\limits_{v\in\{1,-1\}} \sum\limits_{k}P(b_{ik} = v)\log P(b_{ik} = v) $$
    (6)

    where bik is the k-th value of bi and P(bik = v) is probability of bik=v (v= 1 or -1).

  4. 4.

    Similarity preservation (SP) [27]. Similarity preservation denotes the similarities of data in the original space should be preserved after projected into the Hamming space; that is, the structural property of data should also be kept down, and similar data samples should be presented as similar binary codes. Thus, a high-quality hash bit can preserve the local structural information of data. Specifically, the metric of similarity preservation of the i-th bit is

    $$ SP(\textbf{b}_{i}) = \left\|\textbf{b}_{i}\textbf{b}_{i}^{T} - \textbf{S}\right\|^{2} $$
    (7)

    where S is an affinity matrix and each entry \(S_{i,j}=\exp (-\left \|\textbf {x}_{i} - \textbf {x}_{j} \right \|^{2} / \epsilon ^{2} )\). 𝜖 is a constant parameter. Note that, the smaller the value of SP(bi), the higher the quality of the i-th hash bit.

  5. 5.

    Laplacian Score (LS) [21]. Laplacian score is widely used to evaluate quality of variable. It considers both local similarity property and variance of data simultaneously. Given a hash bit bi, its Laplacian score is represented as follows.

    $$ LS(\textbf{b}_{i}) = -\frac{{\sum}_{k,j} (b_{ik}-b_{ij})^{2}S_{kj}}{Var(\textbf{b}_{i})} $$
    (8)

    where bik is the k-th value of bi, and Skj is the similarity degree of xk to xj. A high Laplacian score of bi denotes the hash bit is important, for it can retain the local similarities of data and has a small variance simultaneously.

According to the definitions above, we know that these metrics evaluate the importance of hash bits individually. With these evaluation metrics, the principle of bit selection is to reserve those hash bits with high scores; that is, if a bit has high score, it will be selected to assemble new binary codes with a high probability.

Algorithm 1 summarizes the implementation details of our bit selection methods by evaluating individually hash bits. It works in a straightforward way and can be easily understood. Firstly, LSH is performed to generate m candidate hash bits, forming a pool of bits. For each candidate bit, its importance is evaluated by virtue of one of the corresponding metrics ((4) to (8)). Afterwards, the candidate hash bits are sorted in a descending order in terms of their importance. Finally, the top l hash bits are kept down to assemble new binary codes.

figure e

The computational cost of Algorithm 1 mainly consists of three steps: generating the candidate bits (Step 1), estimating the important scores (from Step 2 to 4) and ranking the scores (Step 5). The first step conduct LSH to generate the candidate bits B. Its computational time is \(\mathcal {O}(mnd)\), where m is the number of hash bits. For the second step, its time complexity is dependent on which evaluation criterion is adopted. Generally, the complexity is a linear one, i.e., \(\mathcal {O}(km)\), where k is a constant value. The computational time of bit ranking is \(\mathcal {O}(m\log m)\). Hence, the total time cost of Algorithm 1 is \(\mathcal {O}(mnd+m\log m)\).

4.2 Bit similarity

It is worth to mention that the metrics above only take the importance of individual bit into consideration, ignoring the interactions between bits. This may unfortunately leads to a fact that the selected hash bits contain redundant information. It is notable that highly correlated bits often exhibit similar behaviors. For example, two highly correlated hash bits bi and bj assign the same value, i.e. 1 or -1, to any sample, that is the two bits contribute equally to the hashing learning. In such a case, redundancy exists between bi and bj, and it is apparently no need to choose both bi and bj for compact code assembling even though they both present high importance. As far as the bit redundancy is concerned, we take six more metrics to evaluate the pairwise similarities among hash bits. Below is the brief description of these metrics.

  1. 1.

    Hamming Distance (HD). Hamming distance is frequently used to measure how many different values between two strings at the same positions. Here we exploit it to check how many different hash values between two given hash bits. Given two hash bits bi and bj, their Hamming distance is

    $$ HD(\textbf{b}_{i},\textbf{b}_{j})=\exp\left( - \sum\limits_{k}{b_{ik} \oplus b_{jk}} \right) $$
    (9)

    where ⊕ refers to the logical operation of exclusive disjunction that returns 1 only when bik is differ to bjk; otherwise returns 0. According to the definition, one may observe that if bi has a small Hamming distance to bj, they are proximate to each other; that is, they are highly correlated.

  2. 2.

    Euclidean Distance (ED). Traditionally, the distance refers to how far from one sample to another in the Cartesian coordinate. Here we just calculate the distance between two hash bits with binary values. To be specific, the Euclidean distance ED(bi,bj) of bi to bj is shown in the following.

    $$ ED(\textbf{b}_{i},\textbf{b}_{j}) = \exp\left( - \sum\limits_{k=1}^{n} (b_{ik}-b_{ij})^{2}\right) $$
    (10)

    where bik is the k-th value of bi. Obviously, the smaller the Euclidean distance ED(bi,bj), the more similar between bi and bj.

  3. 3.

    Cosine Distance (CD). Generally, this measurement denotes a cosine value of the angle between two random variables. If these two variables have the close orientation, the distance is very small. Particularly, the cosine similarity equals to 1 if the angle is zero. On the contrary, the similarity is zero if these two variable are perpendicular to each other. Hence, we use the cosine distance to approximate the similarity of bi to bj as follows

    $$ CD(\textbf{b}_{i},\textbf{b}_{j}) = \cos(\textbf{b}_{i},\textbf{b}_{j}) = \frac{\textbf{b}_{i}^{T}\textbf{b}_{j}}{\left\| \textbf{b}_{i}\right\| \left\|\textbf{b}_{j}\right\|} $$
    (11)
  4. 4.

    Jaccard Distance (JD). The Jaccard distance (a.k.a. Jaccard similarity coefficient) is often taken to represent the similarity between two sets. It refers to the proportion of how many values are the same within the sets to the number of all values within their union set. Within each hash bit, its hash value is a binary one. Thus, the union and intersection sets are not appropriate. For the purpose of bit evaluation, the Jaccard distance of bi to bj is given by

    $$ \begin{array}{@{}rcl@{}} &&JD(\textbf{b}_{i},\textbf{b}_{j}) \\&&=\! \frac{{\sum}_{k}I(b_{ik}=b_{jk})} {{\sum}_{k}I(b_{ik} = 0) + {\sum}_{k}I(b_{jk} = 0) - {\sum}_{k}I(b_{ik} = b_{jk})} \end{array} $$
    (12)

    where I(v) is an indication function of v. If v is true, I(v) = 1; otherwise, I(v) = 0.

  5. 5.

    Pearson Correlation Coefficient (PCC) [23]. Pearson correlation coefficient is a popular criterion to measure the correlation between two variables in statistics. For the hash bits bi and bj, their correlation coefficient is

    $$ PCC(\textbf{b}_{i},\textbf{b}_{j}) = \frac{Cov(\textbf{b}_{i},\textbf{b}_{j})}{\sqrt{Var(\textbf{b}_{i})Var(\textbf{b}_{j})}} $$
    (13)

    where Cov(bi,bj) is the covariance of bi to bj, and V ar(bi) is the variance of bi. A large value of PCC(bi,bj) implies that bi and bj are highly correlated to each other, making them more similar.

  6. 6.

    Mutual Information (MI) [24]. Mutual information can effectively measure how much information commonly shared by two variables. If one variable contains a large amount of information through observing another one, they may exhibit similar properties or behaviors. Based on the notion of entropy, the mutual information between bi and bj is defined as follows.

    $$ MI(\textbf{b}_{i},\textbf{b}_{j})= Ent(\textbf{b}_{i})+Ent(\textbf{b}_{j})-Ent(\textbf{b}_{i}, \textbf{b}_{j}) $$
    (14)

    where Ent(bi) is the entropy of bi in (6).

Based on the similarity metrics above, we render the second kind of bit selection methods for LSH in Algorithm 2. Contrastive to the selection methods in Algorithm 1 which only involves the importance of individual bit, Algorithm 2 mainly take use of the similarities of pairwise bits to choose those important bits with less correlation. To be specific, we firstly employ LSH to yield a pool of candidate hash bits. The similarity matrix A of the hash bits is derived by calculating the similarities between bits in a pairwise way using the metrics provided above ((9) to (14)). To reduce the redundancy among the selected hash bits, we choose greedily the most important bits during the iteration process of bit selection, that is, once the maximum value aij in A is picked, the i-th or j-th bit exclusively, and remove the i-th and j-th rows and columns from A. The selection process is repeated until l bits are obtained, which are subsequently used to assemble more compact binary codes.

The computational cost of Algorithm 2 mainly comprises three steps: generating the candidate bits (Step 1), estimating the similarity scores between pairwise bits (from Step 2 to 6) and selecting the first l hash bits (from Step 8 to 12). The time cost of the first step is \(\mathcal {O}(mnd)\). Since the bit similarity is estimated via a pairwise way, its computational time is \(\mathcal {O}(m^{2})\). Within the last step, a column and a row are removed at each time. Thus, its time complexity is \(\mathcal {O}(m)\). Hence, the total time cost of Algorithm 2 is \(\mathcal {O}(mnd+m^{2}+ml)\).

figure f

5 Evaluation experiments

To verify the effectiveness of the proposed framework of bit selection, we conducted a series of experiments on public data sets. Specifically, we made a comparison of LSH to its corresponding ones with the selection criteria. Besides, NDomSet [27], a kind of bit selection algorithm developed recently, was also used as a baseline.

The objectives of evaluation experiments are three-fold; that is, whether the performance of LSH is degraded significantly after bit selection performed, how the performance of LSH is changed with different numbers of selected hash bits, how many the selected bits are required without degraded the performance of LSH significantly. For these purposes, we carried out three groups of experiments on a PC with a 3.8GHz Intel Core i7-9700 CPU and 8GB RAM.

5.1 Experimental data sets

Five publicly available image data sets were employed in our experiments. They are CALTECH101, CIFAR-10, USPS, IAPR-TC12 and NUS-WIDE. Their brief descriptions are given in the following.

  • CALTECH101 consists of 8,677 color images with 101 categories, about 50 images per category [33]. The image resolution is roughly 300×200 pixels, and each image is represented as a 512-dimensional GIST feature vector.

  • CIFAR-10 contains 60,000 color images from the well known 80M tiny image collection [34]. These images are categorized into 10 classes, each consisting of 6,000 impages with 32×32 sizes. The images are also represented as 512-dimensional GIST feature vectors.

  • USPS consists of 9,298 images (16×16) associated with a digit form ‘0’ to ‘9’ [20]. Each image is represented as a 256-dimensional feature vector.

  • IAPR-TC12 is a multi-label data set that contains 20,000 image-text pairs annotated by 255 labels [35], covering natural images taken from locations around the world. Each image is represented by a 512-dimensional GIST feature vector.

  • NUS-WIDE comprises of 269,648 images, where each image is annotated by 81 labels. In our experiments, the first 10 categorizes are considered, amounting to 186,577 images [17]. The ground-truth images are those that share at least one common label.

5.2 Evaluation metrics

Following the routine, in our experiments we also adopted three commonly used evaluation metrics, i.e., top-k precision (Pre@k), top-k recall (Rec@k) and mean average precision (mAP), to evaluate the performance of LSH. For each query sample, its retrieved nearest neighbors were ranked according to the Hamming distance, so that the evaluation criteria could be estimated. The evaluation criteria are defined as follows.

$$ Pre@k = \frac{\text{the relevant images in the top-\textit{k} images}}{\text{the retrieved top-\textit{k} images}} $$
(15)
$$ Rec@k = \frac{\text{the relevant images in the top-\textit{k} images}}{\text{all the relevant images }} $$
(16)
$$ mAP = \frac{1}{Q} \sum\limits_{i=1}^{Q} \frac{1}{M} \sum\limits_{j=1}^{R}P_{i}(j)I_{i}(j) $$
(17)

where Q is the number of query samples, M is the number of relevant images, R is the number of retrieved images. Pi(j) is the top-j precision of the i-th samples, and Ii(j) = 1 indicates the j-th retrieved image is a true neighbor of the i-th query, otherwise Ii(j) = 0. Without loss of generality, we retrieved 1,000 nearest samples for each query, i.e., k = 1000, in our experiments.

5.3 Results and discussion

5.3.1 Comparison of LSH to the bit selection methods

The first group of comparison experiments aims to determine how much the bit selection methods bring negative effects to the performance of LSH. For this purpose, we first performed LSH on the data sets to generate 128, 256 and 512 candidate hash bits, respectively. Then the bit selection methods were adopted to pick 80% bits out from the candidate bits; that is, 103, 205, 410 hash bits were chosen, respectively. Based on the selected hash bits, the values of the evaluation criteria of LSH and its corresponding ones with the selection methods were recorded.

The recall comparisons of LSH to the bit selection methods with 80% hash bits are presented in Fig. 2 to 6, where the bars above (or below) zero indicate that the bit selection methods have higher (or lower) values of the rec@k criterion than that of LSH. For example, the bit selection method using Euclidean distance (ED) achieved higher rec@k than LSH when the number of hash bits was 102, while the one using entropy (Ent) had worse performance than LSH on CALTECH101 (see Fig. 2).

Fig. 2
figure 2

The recall comparison of LSH to the bit selection methods with 80% hash bits on CALTECH101, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 3
figure 3

The recall comparison of LSH to the bit selection methods with 80% hash bits on CIFAR-10, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 4
figure 4

The recall comparison of LSH to the bit selection methods with 80% hash bits on USPS, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 5
figure 5

The recall comparison of LSH to the bit selection methods with 80% hash bits on IAPR-TC12, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 6
figure 6

The recall comparison of LSH to the bit selection methods with 80% hash bits on NUS-WIDE, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

From the experimental results in Figs. 2345 and 6, one can observe that the bit selection methods work well and have not deteriorated the performance of LSH significantly, after removing 20% redundant hash bits generated by LSH. In some cases, several selection methods even outperformed LSH. For instance, on the CALTECH101 data set, the selection methods using Euclidean distance (ED), Jaccard distance (JD) and mutual information (MI) achieved better performance with 80% hash bits, in comparing to LSH. Another interesting thing is that on all data sets, the difference of rec@k between the selection methods with 410 bits and LSH with 512 bits is smaller than those with less bits. This is intuitively reasonable because the more the hash bits, the more the redundancies among them.

The experimental comparisons of LSH with the bit selection methods with 80% hash bits at the aspect of precision are presented in Figs. 78910 and 11. According to the experimental results, similar conclusions can be easily made for the precision criterion. As an example, the bit selection methods with less hash bits on the CALTECH101 data set (see Fig. 7) are consistent with LSH in terms of precision; that is, the selection methods have not degraded the performance of LSH after removing 20% hash bits. On the CIFAR-10 data set (see Fig. 8 (a)-(c)), the difference of the selection methods with 410 hash bits are much smaller than those with 103 bits. This indicates the fact that long hash codes embody much more information, and the selection methods with more bits have comparable performance to LSH.

Fig. 7
figure 7

The precision comparison of LSH to the bit selection methods with 80% hash bits on CALTECH101, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 8
figure 8

The precision comparison of LSH to the bit selection methods with 80% hash bits on CIFAR-10, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 9
figure 9

The precision comparison of LSH to the bit selection methods with 80% hash bits on USPS, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 10
figure 10

The precision comparison of LSH to the bit selection methods with 80% hash bits on IAPR-TC12, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

Fig. 11
figure 11

The precision comparison of LSH to the bit selection methods with 80% hash bits on NUS-WIDE, where the bars above (below) zero indicate that they are better (worse) than LSH. (a) 103 over 128 bits (b) 205 over 256 bits (c) 410 over 512 bits

5.3.2 Sensitivity of the bit selection methods

The second group of comparison experiments is conducted to testify the sensitivity of the bit selection methods with different numbers of hash bits selected. Given 512 hash bits generated via LSH on the data sets, we apply the bit selection methods to assembling new hash bits in various selection ratios, ranging form 50% to 100%. Afterwards, we estimated the mAP scores of these bit selection methods.

Figure 12 gives the mAP scores of the bit selection methods, where the x-axis denotes the bit selection ratio, ranging from 50% to 100% out of 512 hash bits generated by LSH. For clarity purpose, the mAP score of LSH is also provided and serves as the horizontal line. According to Fig. 12, it is observed that the bit selection methods with Euclidean distance (ED) and mutual information (MI) have better performance than LSH on CALTECH101, though they contain less hash bits. Alternatively, the selection methods with similarity preservation SP and Laplacian score (LS) achieve higher scores than others, including NDomSet and LSH, on the USPS data set. Besides, the bit selection method involving cosine distance (CD) achieves stable performance even with less hash bits, around 60%, on the CALTECH101, CIFAR-10 and USPS data sets. Although the bit selection methods had relatively poor performance on the IAPR-TC12 and NUS-WIDE data sets, they are still competitive in the performance when the number of selected hash bits is around 90%.

Fig. 12
figure 12

The mAP scores of the bit selection methods with different quantities of bits, ranging from 50% to 100% of 512 bits

5.3.3 Reduction ratio of the bit selection methods

The objective of the third group of experiments is to test the maximum reduction ratios of the bit selection methods, without degrading the performance significantly; that is, how many hash bits are needed to parallel in performance to LSH. Firstly, LSH is performed on the data sets to generate 128, 256 and 512 hash bits. Then the bit selection methods were applied to pick the least hash bits out so that the mAP scores of the selection methods were lower than those of LSH within 1%.

Table 1 records the maximum reduction ratios of the bit selection methods with less than 1% loss of mAP, where the bold values are the highest reduction ratios among the bit selection methods. From the experimental results in Table 1, it is obvious that the bit selection methods are capable of selecting less bits by removing redundant ones, without degrading the performance of LSH significantly. For example, the reduction ratios of the selection methods involving Euclidean distance (ED) are the best ones and reach to around 40%, 55% and 68% on the CALTECH101 data set, when the quantities of hash bits are 128, 256 and 512, respectively. Moreover, the maximum reduction ratios get greater as the number of candidate hash bits increases. This implies that a longer hash code often embodies more redundant information than a shorter one.

Table 1 The maximum reduction ratios (%) of the bit selection methods with less than 1% loss of mAP

Table 2 lists the running time of selecting 205 over 256 hash bits, i.e., the reduction ratio is 20%, by the bit selection methods. As shown in Table 2, estimating the values of balance (Bal), variance (Var), entropy (Ent) and Pearson correlation coefficient (PCC) is more efficient than the other evaluation criteria. Besides, NDomSet is the most time consuming one among the selection methods, for it requires to calculate both mutual information and similarity preservation of hash bits. Even so, it is still acceptable for the large-scale data. This means that the proposed framework of bit selection has high efficiency and can be used as a post-processing stage for hashing techniques.

Table 2 The running time (seconds) of selecting 205 over 256 hash bits by the bit selection methods

6 Conclusion

In this paper, a novel bit selection framework is introduced to choose important and representative bits out of the hash bits generated by hashing techniques. It mainly consists of three stages, i.e., bit generation, bit selection, and code assembly, wherein bit selection plays a central role. The stage of bit selection also involves three components, including selection strategies, bit evaluation, and bit ranking. To show the effectiveness of the proposed framework, we further exploit eleven evaluation criteria to measure the importance and similarity metrics of each bit generated by LSH, so that the bits with high importance and less similarity are selected to assemble new binary codes. We evaluated the proposed framework with the evaluation criteria on five commonly used data sets. Experimental results show the proposed bit selection framework works effectively in different cases, and the performance of LSH does not degrade significantly after redundant hash bits removed with the evaluation criteria.

In the future work, we will take other existing hashing techniques into account to testify that the proposed framework can work for general hashing techniques. Besides, advanced techniques of measuring correlation will also be considered to improve the reduction ratio of bit selection.