Keywords

1 Introduction

The amount of web data, images especially, is growing rapidly. How to retrieve images that meet users’ requirements from this extremely tremendous data with efficient storage and computation has attracted extensive attentions from academia and industry [3].

Exhaustive nearest neighbor search is intractable. Approximate nearest neighbor search (ANN) can return satisfactory results within logarithmic (O(log(n)) or even constant (O(1)) time by organizing data with structures that keep the distance metric. Especially, hashing-based methods [5, 1416, 19, 24, 26] with lookup tables consume only constant time on a query. The compact codes of hashing can also bring down the demand of storage, and the bitwise operations needed for a query make hashing competent even in the case of exhaustive ranking.

Fig. 1.
figure 1

The architecture of our proposed model in the training stage to learn hash functions. The input size of the model is fixed, and hash layer is fully connected with the prior. When training is done, the two softmax layers can be simply dropped and the outputs of the hash layer are binarized as hash codes.

Conventional hashing methods usually take low-level features as input and use shallow models to generate the hash codes. However, the hand-crafted features are fixed and not learnable for further improvements. Recently CNNH [24] gains a great performance boost via deep model to learn hash codes. But this method breaks the learning process into two separate stages. Firstly, pseudo hash codes are learned from images’ labels. Then the codes are fixed and used to train a convolutional neural network (CNN) model for later prediction. But some information will be lost in the first stage. Although Lai et al. [16] and Zhao et al. [26] propose one-stage ranking-based hashing methods respectively, both of which take only ranking as the supervisory information, they do not use the classes information.

In this paper, we propose a novel model that learn deep features and hash functions at the same time. As shown in Fig. 1, the model consists of three parts, which are a stack of convolutional layers, one softmax loss layer for classification, a new proposed hash layer and hinge softmax loss layer for hash code learning. Of the above three parts, the first is used to learn semantic-preserving features, the second is used to encourage the model to learn discriminant features from class labels, while the third part will learn more hashing-like codes. When training is done, the three loss layers will be dropped away and outputs of the hash layer are binarized with 0 to generate the final hash codes. The proposed model is an end-to-end system where feature extraction and hashing are combined.

The specific contributions of our work are as follows:

  1. (1)

    we learn hash functions via CNN in the form of multi loss layers

  2. (2)

    we introduce the hinge softmax loss layer and a hash layer into hash learning

  3. (3)

    as far as we know, our results on the experimental datasets outperform the state-of-the-art.

The remaining is organized as follows: related works are briefly reviewed in Sect. 2. And the methodology of our work is described in Sect. 3. The experiments and discussions are presented in Sects. 4 and 5. Finally, we conclude the whole paper in Sect. 6.

2 Related Work

To generate n-bit code, hashing methods need n hash functions the \(k_{th}\) of which generally takes the following form:

$$\begin{aligned} h_k(x) ={\left\{ \begin{array}{ll} 1 &{}f_k(x)\ge b_k \\ 0 &{} f_k(x)<b_k \end{array}\right. } \end{aligned}$$
(1)

where x is the representation of a data sample, \(f_k\) is the hashing function, and \(b_k\) is the corresponding threshold. Based on the method to get \(f_k\), hashing can be divided into data-independent methods and data-dependent (or learning-based) methods, of which the latter attempts to capture the inherent distribution of the task domain by learning. And learning-based hashing can be classified into unsupervised and supervised methods according to using annotation information or not.

A typical category of Locality Sensitive Hashing (LSH) [5] uses random projection to construct hash functions. The property of LSH, that samples within small Hamming distance in hash space are more likely to be near in their feature space, makes it very attractive. But the metrics are asymptotically preserved with increasing code length. Thus, to achieve high precision, LSH-related methods require large hash tables.

Unsupervised methods use only unlabeled data as training set, among which are methods such as Kernelized LSH [15], Semantic Hashing [12, 20] and Spectral Hashing [23].

Supervised hashing utilizes human-annotated similarities or labels to get satisfying codes. Supervised Hashing with Kernels (KSH) [19] uses kernel-based model to minimize the Hamming distances of learned hash codes between similar data samples while maximize the distances between dissimilar ones at the same time. Binary Reconstruction Embedding (BRE) [14] learns hash functions by minimizing the differences between original distances of any two samples and the corresponding Hamming distances in hashing space. While initially proposed as unsupervised hashing, BRE can be easily extended to a supervised one by setting similar pairs with distance 0 and dissimilar pairs with distance 1.

These methods are kind of shallow and usually leverage some feature extraction algorithms to get the image representations. But the relationships between samples in semantic space are not maintained in low-level feature apace. And even combined with high-level features, the conventional hashing methods are very likely to perform no better than an end-to-end system which learns the feature extractor and hash functions together [26].

On the other hand, explosive interests in computer vision have been attracted to CNN [13] since 2012. Its remarkable successes in kinds of tasks such as object recognition [13, 17, 22], detection [13, 22], image parsing [4] and video classification [10] have narrowed the gap between machine and human vision by a large step.

It has been suggested that the features in deep layers learned from ImageNet possess great capability to represent visual content of images, and can be used for different tasks, such as scene parsing [2], detection [6] and image retrieval [1]. Neural codes [1] uses activations of a fully-connected layer from an ImageNet-pretrained CNN as descriptors of the input image. And then Euclidean distances are computed to measure similarities. When retrained with datasets related to the query field, the retrieval performance can be comparable with the state-of-the-art.

CNNH and CNNH+ [24] take raw image as input, but divide the learning process into two different stages. In the first stage, similarity/dissimilarity matrix is decomposed to get the pseudo binary codes for training images. In the second stage, the raw image pixels and the corresponding binary codes (CNNH+ together with their one-hot binary labels) are fed to a CNN whose objective is to minimize the error between outputs and the target binary codes. But the decomposition stage would bring about extra errors. And the pseudo codes are fixed once the first stage is done, thus are not tunable for further improvement.

Lai et al. [16] proposes a deep neural network model to learn the hash functions. And the model’s input is in the form of triplet, i.e., (I, \(I^+\), \(I^-\)), meaning image I is more similar with \(I^+\) than with \(I^-\). The sub-networks for each element image of the triplet share parameters with each other. And the triplet ranking loss function is:

$$\begin{aligned} \begin{aligned} loss(x,x^+,x^-)&= \max (0, ||x-x^+||^2_2-||x-x^-||^2_2+1) \\&s.t. \ x,x^+,x^-\in [0,1]^k \end{aligned} \end{aligned}$$
(2)

where x, \(x^+\) and \(x^-\) are the sub-networks’ outputs of I, \(I^+\), \(I^-\) respectively, and k is the length of x, \(x^-\) and \(x^+\). Zhao et al. [26] takes a similar method, but the loss of every triplet is assigned with a weight which is defined by the numbers of shared labels between query image and two result images.

In spite of the above method, [18, 21, 25] assume that good hash codes should also be easily classified by a linear classifier, and take this target into hash learning. In addition to the classification task, [21, 25] penalize the output of embedding functions to make it close to \(-1\) and 1 as much as possible, while [25] also requires the mean of each function to be 0.

3 Methodology

Given a set of class labels \(\mathcal {Y}=\{1,\ldots ,C\}\) and an image dataset \(\mathcal {I}=\{I_1,I_2,\ldots I_N\}\) where each image is associated with one label \(y_n\), our goal is to learn k hash functions which is used to encode images into k-bit hash codes. When using the codes for retrieval, the images sharing same label with the query image will be ranked on top of the result list.

In this paper, we propose a CNN architecture to learn the semantic-preserving hash functions, as shown in Fig. 1. The input image first goes through a stack of convolutional and pooling layers and then arrives at the concatenation layer from where the model branches into two separate paths. Of the two paths, one is the original softmax loss layer and the other is our hash layer and hinge softmax layer.

Normally, suppose that \(\text {x}^l\) is the output of the l-th layer of a CNN. Then if l-th layer is a softmax layer which is used to predict a vector p of which the c-th element is the probability of class c, the formulation is given by

$$\begin{aligned} p_c=\frac{\text {exp}(\text {w}_c^T\text {x}^{l-1})}{\sum _{c'}\text {exp}(\text {w}_{c'}^T\text {x}^{l-1})} \qquad (c = 1,2,\ldots ,C) \end{aligned}$$
(3)

where \(\text {w}_c\) is the weights related with class c, and \(\text {x}^{l-1}\) is the output of the prior layer.

3.1 Hash Layer

The hash layer is a fully connected layer which has no nonlinear function inserted. By using fully connection, the hash layer can learn global semantic representations of the input image. And the hash layer’s output will be penalized by the following formulation:

$$\begin{aligned} {\text {hash}\_\text {loss}}(\text {x}) = \sum _{i=1}^{l_n}||1-\text {abs}(g(\text {x}_i))||^2 \end{aligned}$$
(4)

where \(l_n\) is the number of neurons in the hash layer. And \(g(x_i)\) will take the following form:

$$\begin{aligned} g(\text {x}_i) = {\left\{ \begin{array}{ll} -1 &{} \text {x}_i \le -1\\ 1 &{} \text {x}_i \ge 1\\ \text {x}_i &{} otherwise \end{array}\right. } \end{aligned}$$
(5)

This loss can encourage the neurons to generate outputs that distributes less around 0, which will be used as the threshold to binarize the outputs and get the hash codes.

3.2 Hinge Softmax Loss

In hash learning, the target is more a rank problem than classification. It is sufficient to make prediction of ground truth label \(p_y\) larger than the rest, while the traditional softmax loss \(loss(y, p) = -log(p_y)\) can be too harsh. So we define a loss modified from softmax, which takes the following form:

$$\begin{aligned} \text {hinge}\_\text {softmax}\_loss(y, p)={\left\{ \begin{array}{ll} 0 &{} p_y\ge \max (p_{\widetilde{y}}) + m\\ -log(p_y) &{} p_y<\max (p_{\widetilde{y}}) + m \end{array}\right. } \end{aligned}$$
(6)

where y is the ground truth label, \(\widetilde{y}\) is the rest of label set, p is the prediction possibilities for every class and m is the slack that controls when the model should be penalized.

By this formulation, for those samples that have been classified correctly by a slack larger than m, the loss will be forced to be 0 and thus back propagate no changes to the learnable parameters. Otherwise, the semantic representations learned by CNN can be not so good, thus the penalization term will be taken. So by this setting, the features of our hash layer can be semantically correct.

When m is set to be 1, because all the prediction possibilities are between 0 and 1, the hinge softmax loss will only execute the lower part and thus becomes conventional softmax loss. And when m is no greater than 0, the loss can be easily stuck in a local minimum, for example when the probabilities of all classes are equal to a certain value.

3.3 The Model

Traditional softmax loss can help in training the networks more discriminant. To combine classification task with hash learning, we branch out from the layer prior to the softmax layer and add a hash layer together with a hinge softmax layer. By this way, the softmax loss in our model performs as an auxiliary classifier like [17].

The overall loss function is given as:

$$\begin{aligned} \min _{W}L_1(y, p) + \alpha L_2(\text {x}_h) + \beta L_3(y, p) + \lambda ||W||^2 \end{aligned}$$
(7)

where W is all the parameters that are to be learned in the network, \(L_1\) is the hinge softmax loss, \(L_2\) is the loss of our proposed hash layer, \(L_3\) the softmax loss for classification and the last term is the weight decay, \(\alpha \) and \(\beta \) are hyper-parameters.

By our loss function, the representations learned by the proposed hash layer can preserve the semantic information and be more hashing-like.

Compactness is an important property of hash codes. The binary code generated by every function should be independent with each other and the information carried by the binary bit should be maximized. Compared with [16], we take advantage of dropout’s [8] capability to prevent co-adaption where a feature detector is only helpful in the presence of several other specific feature detectors. With a dropout layer inserted between the hash layer and hinge softmax loss layer, the neurons of hash layer can be independent of each other.

3.4 Hash Codes

The networks are trained by stochastic gradient descent. When training is done, the two softmax layers can be simply dropped and use the rest architecture to generate the hash codes. When a new image comes, it is first filtered by the model so as to being encoded into a k-dimension vector, and then the vector is binarized into the final hash codes according to Eq. (1) with all \(b_k\) set to be 0.

4 Experiments

4.1 Experimental Settings

We compare the proposed model with one data-independent method LSH [5], and four supervised methods BRE [14], KSH [19], CNNH [24] and [16] on two widely used benchmark datasets, i.e., the CIFAR-10 datasetFootnote 1 [11, 16] and the Street View House Number (SVHN) datasetFootnote 2. And we will call the model of [16] TRCNNH for short.

For fair comparison, we sample 1000 images from each dataset as query set and another 5000 from the rest for training like [16]. For LSH, all the data except the query set are training set. The 5000 keeping-label samples serve as training set for the other methods. And following [16], images are represented by 512-dim GIST features for non-CNN methods.

As for our proposed method, the 5000 labeled samples in each datasets are divided into training set and validation set by which to find the most suitable architecture and tune the hyper-parameters. Then all the 5000 samples are used to retrain the model from scratch.

The description of an architecture is given in the following way: 3\(\,\times \,\)32\(\,\times \,\)32-32C5P2-MP3S2-32C5P0-D0.5-SL10 represents a CNN with inputs of 3 channel of \(32\,\times \,32\) pixels, a convolutional layer with 32 filters whose size is \(5\,\times \,5\) and 2 paddings around the input maps, a max pooling layer (AP for average pooling) of \(3\,\times \,3\) size and stride 2, a 32-filters convolutional layer whose kernel size is \(5\,\times \,5\) and have 0 padding around the input maps, a dropout layer whose dropout ratio is 0.5, and finally a softmax loss layer with 10 classes. And our hash layer is always connected with the last hidden layer. Rectifier Linear Unit (ReLU) is used as the nonlinear transformation neurons for all convolutional layers.

Table 1. MAP of Hamming ranking w.r.t different number of bits on two datasets.

Hash lookup and Hamming ranking are two widely used methods to conduct search with hashing [16, 24]. Hash lookup constructs a lookup table with radius r in advance, and all the samples within the radius will be returned as results, thus can decrease the query time to a constant value. However, the number of results returned will dramatically decrease with the code’s length increases. On the other hand, Hamming ranking will traverse the dataset all through at a new coming.

We evaluate the performances on three metrics, i.e. Precision curves within Hamming radius 2, Precision-Recall curves and Precision curves with respect to different returned number with ranking.

Our models are implemented with Caffe [9], an open-source CNN framework. On both datasets, our networks are trained with stochastic gradient descent. The momentum is set to 0.9. Weight decay coefficients for convolutional layers and fully connected layers are separately set to 0.001 and 0.25. For the hyper-parameters in Eq. (7), \(\alpha \) and \(\beta \) are decided by validation and fixed at 0.1, while \(\gamma \) decreases from 0.3 to 0 in the whole training stage. The margin in the hinge softmax loss layer is set to 0.1 in our experiments, which won’t hurt too much, and can back propagate at the same time.

The implementations of BRE [14] and KSHFootnote 3 [19] are provided by their authors. For LSH, the projections are randomly sampled from a Gaussian distribution with zero-mean and identity covariance to construct the hash tables. And the results of CNNH and TRCNNH are obtained from [16].

Fig. 2.
figure 2

The results on CIFAR-10. (a) Precision curves of hash lookup within Hamming radius 2. (b) Precision-recall curves of Hamming raking with code’s length of 48 bits. (c) Precision curves with respect to number of returned images of Hamming raking with code’s length of 48 bits.

Fig. 3.
figure 3

The results on SVHN. (a) Precision curves of hash lookup within Hamming radius 2. (b) Precision-recall curves of Hamming raking with code’s length of 48 bits. (c) Precision curves with respect to number of returned images of Hamming raking with code’s length of 48 bits.

4.2 CIFAR-10

The CIFAR-10 dataset is an images collection containing 60,000 color images of \(32\,\times \,32\) pixels. All the samples are evenly labeled with 10 mutually exclusive classes, ranging from airplane to bird.

For this dataset, the architecture we choose is a six-conv-layer model. The main branch of the model is 3\(\,\times \,\)32\(\,\times \,\)32-32C3P1-32C3P1-MP3S2-D0.5-32C3P1-32C3P1-MP3S2-D0.5-64C3P1-64C3P1-AP3S2-D0.5-SL10. For the convenience of later quotation, we name the third pooling layer “Pool3".

The MAP of Hamming ranking can be seen on Table 1 and the performance curves are shown in Fig. 2. For all the four experimented code length, i.e. 12, 24, 32 and 48, the MAP of our model surpass the state-of-the-art by more than 5.9 percent. Figure 2(a) witnesses that precision of images within 2 bits Hamming distance returned by our model increases along with the length of hash code. The LSH reported in this paper used random projection which relies on the input feature to maintain the similarity. But because of the semantic gap between low-level feature and content, LSH’s performances are not stable. While BRE and KSH perform better than LSH, they are still worse than CNN based methods. And in Fig. 2(b) the precision-recall curve of our model encloses more spaces than the others, which is consistent with Table 1. Then Fig. 2(c) demonstrates our model’s best performance on Hamming ranking, from \(64\,\%\) to \(69\,\%\), nearly by \(8\,\%\) beyond TRCNNH.

Table 2. Ranking precisions of different methods at 500 and 1500 on CIFAR-10

Comparison Between Different Features. On CIFAR-10, we conduct some other experiments to compare. One is image retrieval using Euclidean distance between corresponding features of query and dataset’s images as similarity, and the features are right the outputs of hash layer (the “L2_fc” method). Another two experiments we undertake are using KSH and LSH with outputs of “Pool3” as descriptors of an input image (the “Pool3_feat”). LSH is also conducted with 512-dimension Gist feature as image descriptions. The network used in this part is the 32-bit network. The results are shown in Table 2.

Surprisingly, we find that L2_fc is not the best, which may be resulted from the denoising property of binarizing. KSH learns hash functions on top of the features extracted from “Pool3”, right the preceding layer of our hash layer. The retrieval precision of “Pool3_feat + KSH” is inferior to our method by a small margin. And the last two LSH-based experiments show that the features learned by CNN are indeed better than Gist.

Table 3. Comparison of the proposed model with baseline on CIFAR-10.

Comparison with Baseline. We evaluate the performance of our proposed hash layer and hinge softmax loss layer by comparing with three baseline models on CIFAR-10. The first baseline model is just like our model but without the hash loss. The second baseline model is similar with [18], i.e. without the auxiliary path and the hash loss. In addition to the second baseline, the third baseline model replaces the hinge softmax loss with traditional softmax loss. The parameters of the preceding convolutional layers are kept the same. The results are presented in Table 3. Our proposed method performs best on three of the four tested code lengths. The third baseline model without any of our modifications performs worst except 24 bits and fluctuate with the code length.

4.3 SVHN

The SVHN dataset consists of 630, 420 color house number images collected from Google Street View images. The data is provided in two formats and the second is used in our experiments. Each data sample is of size \(32\,\times \,32\) pixels and annotated with one label from 1 to 10. And the dataset is preprocessed with local contrast normalization, following Goodfellow et al. [7].

On this dataset, the main structure is similar to the CIFAR-10 model: 3\(\,\times \,\)32\(\,\times \,\)32-128C3P1-128C3P1-MP2S2-D0.5-128C3P1-128C3P1-MP2S2-D0.5-128C3P1-128C3P1-AP2S2-D0.5-SL10.

Similar with the results on CIFAR-10, the MAP values related with the four code lengths of our model on CIFAR10 outperform the state-of-the art. Figure 3(a) shows the precision of returns with hash lookup of radius 2. When using 12-bit codes for retrieval, the precision of our model is just comparable with TRCNNH [16]. But with the increment of code length, the gap between our model and TRCNNH is getting larger. On all the three metrics, our model is the best one.

5 Discussion

Experiments have shown that the precision of our proposed model on both SVHN and CIFAR-10 within Hamming radius of 2 improves with more hash bits, which is consistent with our intuition.

Although KSH uses kernel functions for hashing and our hash layer uses linear functions, the KSH with our CNN-feature performs slightly inferior than our method. The hash layer can be considered as learning-based linear projections, whose inputs are also learnable. During training, the weights and biases in CNN are adjusted to render the filters expressive enough so as to allow the simple classifier on the top perform well. And the raw pixels are at the same time transformed into a description space where similarity relationships are correlated with Euclidean distance.

Unlike CNNH and TRCNNH both of which need the amount of annotations to be square or even cubic of the number of training images, our method trains the CNN with single image whose label indicates the class. So more training data will not burden it too much.

6 Conclusion

As a method of ANN search, interests of many researchers and companies have been attracted by hashing. We propose a new method which can obtain the binary hash code of a given image just by binarizing the outputs of our hash layer, and achieves the best result on both SVHN and CIFAR-10.

In consideration of the fact that all the models use only 5000 samples to train, we can expect an improvement of performance with a larger training set. Besides, a large amount of unlabeled data remains untouched, which can be another key element to enhancement.