1 Introduction

Due to the popularity of capturing devices and the high speed of network transformation, we are witnessing the explosive growth of images, which attracts great attention in computer vision to facilitate the development of multimedia search [35, 42], object segmentation [22, 34], object detection [26], image understanding [4, 33] etc. Without a doubt, the ever growing abundance of images brings an urgent need for more advanced large-scale image retrieval technologies. To date, high-dimensional real-valued features descriptors (e.g., deep Convolutional Neural Networks (CNN) [30, 37, 39] and SIFT descriptors) demonstrate superior discriminability, and bridge the gap between low-level pixels and high-level semantic information. But they are less efficient for large-scale retrieval due to their high dimensionality.

Therefore, it is necessary to transform these high-dimensional features into compact binary codes which enable machines to run retrieval in real-time and with low memory. Existing hashing methods can be classified into two categories: data-independent and data-dependent. For the first category, hash codes are generated by randomly projecting samples into a feature space and then performing binarization, which is independent of any training samples. On the other hand, data-dependent hashing methods learn hash functions by exploring the distribution of the training data and therefore, they are also called learning to hashing methods (L2H) [43]. A lot of L2H methods have been proposed, such as Spectral hashing (SpeH) [45], iterative quantization (ITQ) [9], Multiple Feature Hashing (MFH) [36], Quantization-based Hashing (QBH) [32], K-means Hashing (KMH) [12], DH [24], DPSH [20], DeepBit [23], etc. Actually, those methods can be further divided into two categories: supervised methods and unsupervised methods. The difference between them is whether to use supervision information, e.g., classification labels. Some representative unsupervised methods include ITQ, Isotropic hashing [16], and DeepBit which achieves promising results, but are usually outperformed by supervised methods. By contrast, the supervised methods take full advantage of the supervision information. One representative is DPSH [20], which is the first method that can perform simultaneous feature learning and hash codes learning with pairwise labels. However, the information that can be used for supervision is also typically scarce.

To date, hand-craft floating-point descriptors such as SIFT, Speeded-up Robust Features (SURF) [2], DAISY [41], Multisupport Region Order-Based Gradient Histogram (MROGH) [8], the Multisupport Region Rotation and Intensity Monotonic Invariant Descriptor (MRRID) [8] etc., are widely utilized to support image retrieval since they are distinctive and invariant to a range of common image transformations. In [29], they propose a content similarity based fast reference frame selection algorithm for reducing the computational complexity of the multiple reference frames based inter-frame prediction. In [40], they develop a so-called correlation component manifold space learning (CCMSL) to learn a common feature space by capturing the correlations between the heterogeneous databases. Many attempts [21, 25] were focusing on compacting such high quality floating-point descriptors for reducing computation time and memory usage as well as improving the matching efficiency. In those methods, the floating-point descriptor construction procedure is independent of the hash codes learning and still costs a massive amounts of time-consuming computation. Moreover, such hand-crafted feature may not be optimally compatible with hash codes learning. Therefore, these existing approaches might not achieve satisfactory performance in practice.

To overcome the limitation of existing hand-crafted feature based methods, some deep feature learning based deep hashing methods [7, 10, 11, 20, 46, 47] have recently been proposed to perform simultaneous feature learning and hash-code learning with deep neural networks, which have shown better performance than traditional hashing methods with hand-crafted features. Most of these deep hashing methods are supervised whose supervision information is given as triplet or pairwise labels. An example is the deep supervised hashing method by Li et al. [20], which can simultaneously learn features and hash codes. Another example is Supervised Recurrent Hashing (SRH) [10] for generating hash codes of videos. Cao et al. [3] proposed a continuous method to learn binary codes, which can avoid the relaxation of binary constraints [10] by first learning continuous representations and then thresholding them to get the hash codes. They also added weight to data for balancing similar and dissimilar pairs.

In the real world, however, the vast majority of training data do not have labels, especially for scalable dataset. To the best of our knowledge, DeepBit [23] is the first to propose a deep neural network to learn binary descriptors in an unsupervised manner, by enforcing three criteria on binary codes. It achieves the state-of-art performance for image retrieval, but DeepBit does not consider the data distribution in the original image space. Therefore, DeepBit misses a lot of useful unsupervised information.

So can we obtain the pairwise information by exploring the data distribution, and then use this information to guide the learning of hash codes? Motivated by this, in this paper, we propose a Deep Discrete Hashing (DDH) with pseudo pairwise labels which makes use the self-generated labels of the training data as supervision information to improve the effectiveness of the hash codes. It is worth highlighting the following contributions:

  1. 1.

    We propose a general end-to-end learning framework to learn discrete hashing code in an unsupervised way to improve the effectiveness of hashing methods. The discrete binary codes are directly optimized from the training data. We solve the discrete hash, which is hard to optimize, by introducing an intermediate variable.

  2. 2.

    To explore the data distribution of the training images, we learn on the training dataset and generate the pairwise information. We then train our model by using this pairwise information in a self-supervised way.

  3. 3.

    Experiments on real datasets show that DDH achieves significantly better performance than the state-of-the-art unsupervised deep hashing methods in image retrieval applications and object recognition.

2 Our Method

Given N images, \(\mathbf{I } = \{\mathbf{I }_i\}^N_{i=1}\) without labels, our goal is to learn their compact binary codes \(\mathbf B \) such that: (a) the binary codes can preserve the data distribution in the original space, and (b) the discrete binary codes could be computed directly.

As shown in Fig. 1, our DDH consists of two key components: construction of pairwise labels, and hashing loss definition. For training, we first construct the neighborhood structure of images and then train the network. For testing, we obtain the binary codes of an image by taking it as an input. In the remainder of this section, we first describe the process of constructing the neighborhood structure, and then introduce our loss function and the process of learning the parameters.

2.1 Construction of Pairwise Labels

In our unsupervised approach, we propose to exploit the neighborhood structure of the images in a feature space as information source steering the process of hash learning. Specifically, we propose a method based on the K-Nearest Neighbor (KNN) concept to create a neighborhood matrix \(\mathbf S \). Based on [13], we extract 2,048-dimensional features from the pool5-layer, which is last layer of ResNet [13]. This results in the set \(\mathbf X = \{\mathbf{x _{i}}\}^{N}_{i=1}\) where \(\mathbf x _{i}\) is the feature vector of image \(\mathbf I _i\).

Fig. 1.
figure 1

The structure of our end-to-end framework. It has two components, construction of pairwise labels, and hashing loss definition. We first construct the neighborhood structure of images and then train the network based on the define loss function. We utilize the deep neural network to extract the features of the images.

For the representation of the neighboring structure, our task is to construct a matrix \({\varvec{S}}=(s_{ij})^N_{i,j=1}\), whose elements indicate the similarity (\(s_{ij}=1\)) or dissimilarity (\(s_{ij} = -1\)) of any two images i and j in terms of their features \(\mathbf x _{i}\) and \(\mathbf x _{j}\).

We compare images using cosine similarity of the feature vectors. For each image, we select \(K_1\) images with the highest cosine similarity as its neighbors. Then we can construct an initial similarity matrix \(\mathbf S _1\):

$$\begin{aligned} {{\left( S_1\right) }_{ij}} = \left\{ {\begin{array}{*{20}{l}} {1,~\text {if}~\mathbf x _{j}~\text {is }K_1\text {-NN of}~\mathbf x _i}\\ {0,~\text {otherwise}} \end{array}} \right. \end{aligned}$$
(1)

Here we use \(\mathbf L _1 , \mathbf L _2 , \ldots , \mathbf L _N \) to denote the ranking lists of points \(\mathbf I _1\),\(\mathbf I _2\),...,\(\mathbf I _N\) by \({K}_1\)-NN. The precision-recall curve in Fig. 2 indicates the quality of the constructed neighborhood for different values of \({K}_1\). Due to the rapidly decreasing precision with the increase of \({K}_1\), creating a large-enough neighborhood by simply increasing \({K}_1\) is not the best option. In order to find a better approach, we borrow the ideas from the domain of graph modeling. In an undirected graph, if a node v is connected to a node u and if u is connected to a node w, we can infer that v is also connected to w. Inspired by this, if we treat every training image as a node in an undirected graph, we can expand the neighborhood of an image i by exploring the neighbors of its neighbors. Specifically, if \(\mathbf x _{i}\) connects to \(\mathbf x _{j}\) and \(\mathbf x _{j}\) connects to \(\mathbf x _{k}\), we can infer that \(\mathbf x _{i}\) has the potential to be connected to \(\mathbf x _{k}\) as well.

Fig. 2.
figure 2

Precision of constructed labels on cifar-10 dataset with different K, and different methods.

Based on the above observations, we construct \(\mathbf S _1\) using the deep CNN features. If we only use the constructed labels by \(\mathbf S _1\), each image has too few positive labels with high precision. So we increase the number of neighbors based on \(\mathbf S _1\) to obtain more positive labels. Specifically, we calculate the similarity of two images by comparing the two ranking lists of \({K}_1\)-NN using the expression \(\frac{1}{||\mathbf L _i- \mathbf L _j||^2}\). Actually, if two images have the same labels, they should have a lot of intersection points based on \(K_1\)-NN, i.e., they have similar \({K}_1\)-NN ranking list. Then we again construct a ranking list of \(K_2\) neighbors, based on which we generate the second similarity matrix \(\mathbf S _2\) as:

$$\begin{aligned} {{\left( \mathbf S _2\right) }_{ij}} = \left\{ {\begin{array}{*{20}{l}} {1,~\text {if}~\mathbf L _{j}~\text {is }K_2\text {-NN of}~\mathbf L _i}\\ {0,~\text {otherwise}} \end{array}} \right. \end{aligned}$$
(2)

Finally, we construct the neighborhood structure by combining the direct and indirect similarities from the two matrices together. This results in the final similarity matrix \(\mathbf S \):

$$\begin{aligned} {S_{ij}} = \left\{ {\begin{array}{*{20}{l}} {1, ~\text {if}~{\left( \mathbf S _2\right) }_{ik}=1~ and ~ j~ in ~\mathbf L _k}\\ {0,~\text {otherwise}} \end{array}} \right. \end{aligned}$$
(3)

where the \(\mathbf{L }_k\) is the ranking list after \({K}_1\)-NN. The whole algorithm is shown in Algorithm 1. After the two steps KNN, the constructed label precision is shown in Fig. 2. We note that we could have also omitted this preprocessing step and construct the neighborhood structure directly during the learning of our neural network. We found, however, that the construction of neighborhood structure is time-consuming, and that updating of this structure based on the updating of image features in each epoch does not have significant impact on the performance. Therefore, we chose to obtain this neighborhood structure as described above and fix it for the rest of the process.

figure a

2.2 Architecture of Our Network

We introduce an unsupervised deep framework, dubbed Deep Discrete Hashing (DDH), to learn compact yet discriminative binary descriptors. The framework includes two main modules, feature learning part and hash codes learning part, as shown in Fig. 1. More specifically, for the feature learning, we use a similar network as in [48], which has seven layers and the details are shown in Table 1. In the experiment, we can easily replace the CNN-F network with other deep networks such as [13, 18, 38]. Our framework has two branches with the shared weights and both of them have the same weights and same network structure.

We discard the last softmax layer and replace it with a hashing layer, which consists of a fully connected layer and a sgn activation layer to generate compact codes. Specifically, the output of the \(full_7\) is firstly mapped to a L-dimensional real-value code, and then a binary hash code is learned directly, by converting the L-dimensional representation to a binary hash code \(\mathbf b \) taking values of either \(+1\) or \(-1\). This binarization process can only be performed by taking the sign function \(\mathbf b = sgn(\mathbf u )\) as the activation function on top of the hash layer.

$$\begin{aligned} \mathbf b = \mathrm{sgn} (\mathbf u ) = \left\{ {\begin{array}{*{20}{l}} { + 1,~if~\mathbf u \ge 0}\\ { - 1,~~\text {otherwise}} \end{array}} \right. \end{aligned}$$
(4)
Table 1. The configuration of our framework

2.3 Objective Function

Suppose we denote the binary codes as \(\mathbf B = \left\{ {\mathbf{b _i}} \right\} _{i = 1}^N\) for all the images. The neighborhood structure loss models the loss in the similarity structure in data, as revealed in the set of neighbors obtained for an image by applying the hash code of that image. We define the loss function as below:

$$\begin{aligned} \min {J_1} = \frac{1}{2}{\sum \limits _{{s_{ij}} \in S} {\left( {\frac{1}{L}\mathbf{b _i}^T\mathbf{b _j} - {s_{ij}}} \right) } ^2} \end{aligned}$$
(5)

where L is the length of hashing code and \({s_{ij}} \in \{ - 1,1\}\) indicates the similarity of image i and j. The goal of optimizing for this loss function is clearly to bring the binary codes of similar images as close to each other as possible.

We also want to minimize the quantization loss between the learned binary vectors \(\mathbf B \) and the original real-valued vectors \(\mathbf Z \). It is defined as:

$$\begin{aligned} \min {J_2} = \frac{1}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{z _i} - \mathbf{b _i}} \right\| }^2}} \end{aligned}$$
(6)

where \(\mathbf z _i\) and \(\mathbf b _i\) are the real-valued representation and binary codes of the i-th image in the hashing layer. Then we can obtain our final objective function as:

$$\begin{aligned} \min {J} = J_1+\lambda _1J_2, \qquad {\mathbf{b }_{i}}\in {{\{-1,1\}}^{L}},~~\forall i=1,2,3,{\ldots }N \end{aligned}$$
(7)

where \(\lambda _1\) is the parameter to balance these two terms.

Obviously, the problem in (7) is a discrete optimization problem, which is hard to solve. LFH [48] solves it by directly relaxing \(\mathbf b _{i}\) from discrete to continuous, which might not achieve satisfactory performance [15]. In this paper, we design a novel strategy which can solve the problem 5 by introducing an intermediate variable. First, we reformulate the problem in 5 as the following equivalent one:

$$\begin{aligned} \begin{aligned}&\min {{J}}=\frac{1}{2}\sum \limits _{{{s}_{ij}}\in S}{{{\left( \frac{1}{L}{\mathbf{b }_{i}}^{T}{\mathbf{b }_{j}}-{{s}_{ij}} \right) }^{2}}} + \frac{\lambda _1}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{z _i} - \mathbf{b _i}} \right\| }^2}} \\&s.t \quad {\mathbf{u }_{i}}={\mathbf{b }_{i}},\forall i=1,2,3,{\ldots }N \\&\qquad {\mathbf{u }_{i}}\in {{\mathbb {R}}^{L\times 1}},\forall i=1,2,3,\ldots N \\&\qquad {\mathbf{b }_{i}}\in {{\{-1,1\}}^{L}},\forall i=1,2,3,\ldots N \\ \end{aligned} \end{aligned}$$
(8)

where \(\mathbf u _i\) is an intermediate variable and \(\mathbf{b _i} = sgn(\mathbf{u _i})\). To optimize the problem in 8, we can optimize the following regularized problem by moving the equality constraints in 8 to the regularization terms:

$$\begin{aligned} \begin{aligned}&\min {{J}}=\frac{1}{2}\sum \limits _{{{s}_{ij}}\in \mathbf S }{{{\left( \frac{1}{L}\mathbf{u _{i}}^{T}\mathbf{u _{j}}-{{s}_{ij}} \right) }^{2}}} + \frac{\lambda _1}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{z _i} - \mathbf{b _i}} \right\| }^2}} +\frac{{{\lambda _2}}}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{b _i} - \mathbf{u _i}} \right\| }^2}} \\&s.t \quad \mathbf{u _{i}}\in {{\mathbb {R}}^{L\times 1}},\forall i=1,2,3,\ldots N \\&\qquad \mathbf{b _{i}}\in {{\{-1,1\}}^{L}},\forall i=1,2,3,\ldots N \end{aligned} \end{aligned}$$
(9)

where \(\lambda _{2}\) is the hyper-parameter for the regularization term. Actually, introducing an intermediate variable \(\mathbf u \) is equivalent to adding another full-connected layer between \(\mathbf z \) and \(\mathbf b \) in the hashing layer. To reduce the complexity of our model, we let \(\mathbf z =\mathbf u \), and then we can have a simplified objective function as:

$$\begin{aligned} \begin{aligned}&\min {{J}}=\frac{1}{2}\sum \limits _{{{s}_{ij}}\in \mathbf S }{{{\left( \frac{1}{K}\mathbf{z _{i}}^{T}\mathbf{z _{j}}-{{s}_{ij}} \right) }^{2}}} + \frac{\lambda _1}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{z _i} - \mathbf{b _i}} \right\| }^2}} \\&s.t \quad \mathbf{b _{i}}\in {{\{-1,1\}}^{L}},\forall i=1,2,3,\ldots N \end{aligned} \end{aligned}$$
(10)

Equation 10 is not discrete and \(\mathbf z _i\) is derivable, so we can use back-propagation (BP) to optimize it.

2.4 Learning

To learning DDH Model, we need to obtain the parameters of neural networks. We set

$$\begin{aligned} \begin{aligned}&{\mathbf{z }_{i}}={{\mathbf {W}}^{T}}\phi \left( {{x}_{i}};\theta \right) +\mathbf {c}\\&\mathbf{b }_{i} = sgn(\mathbf{z _i}) = sgn({\mathbf{{W}}^T}\phi \left( {{x_i};\theta } \right) + \mathbf{{c}}) \end{aligned} \end{aligned}$$
(11)

where \(\theta \) denotes all the parameters of CNN-F network for learning the features. \(\phi \left( {{x}_{i}};\theta \right) \) denotes the output of the \(full_7\) layer associated with image \(x_{i}\). \(\mathbf {W}\in {{\mathbb {R}}^{4096\times L}}\) denotes hash layer weights matrix, and \(\mathbf {C}\in {{\mathbb {R}}^{L\times 1}}\) is a bias vector. We add regularization terms on the parameters and change the loss function with 10 constrains as:

$$\begin{aligned} \begin{aligned} \begin{array}{l} \min {J} = \frac{1}{2}\sum \limits _{{s_{ij}} \in S} {{{\left( {\frac{1}{L}{\varTheta _{ij}} - {s_{ij}}} \right) }^2}} \\ + \frac{{{\lambda _1}}}{2}\sum \limits _{i = 1}^N {{{\left\| {\mathbf{b _i} - ({\mathbf{{W}}^T}\phi \left( {{x_i};\theta } \right) + \mathbf{{c}})} \right\| }^2}} \\ + \frac{{{\lambda _2}}}{2}{(\left\| \mathbf{{W}} \right\| _F^2 + \left\| \mathbf{{c}} \right\| _F^2)^2} \end{array} \end{aligned} \end{aligned}$$
(12)

where \({{\varTheta }_{ij}}={{z}_{i}}^{T}{{z}_{j}}\), \(\lambda _{1}\) and \(\lambda _{2}\) are two parameters to balance the effect of different terms. Stochastic gradient descent (SGD) is used to learn the parameters. We use CNN-F network trained on ImageNet to initialize our network. In particular, in each iteration we sample a mini-batch of points from the whole training set and use back-propagation (BP) to optimize the whole network. Here, we compute the derivatives of the loss function as follows:

$$\begin{aligned} \frac{{\partial {J}}}{{\partial \mathbf{z _i}}} = \frac{1}{{{L^2}}}(\mathbf{z _i}^T\mathbf{z _j})\mathbf{z _j} - {\frac{1}{{{L}}}s_{ij}}\mathbf{z _j} + {\lambda _1}(\mathbf{z _i} - \mathbf{b _i}) \end{aligned}$$
(13)

2.5 Out-of-Sample Extension

After the network has been trained, we still need to obtain the hashing codes of the images which are not in the training data. For a novel image, we obtain its binary code by inputing it into the DDH network and make a forward propagation as below:

$$\begin{aligned} \mathbf{b }_{i} = sgn(\mathbf{z _i}) = sgn({\mathbf{{W}}^T}\phi \left( {{x_i};\theta } \right) + \mathbf{{c}}) \end{aligned}$$

3 Experiment

Our experiment PC is configured with an Intel(R) Xeon(R) CPU E5–2650 v3 @ 2.30 GHz with 40 cores and the the RAM is 128.0 GB and the GPU is GeForce GTX TITAN X with 12 GB.

3.1 Datasets

We conduct experiments on three challenging datasets, the Oxford 17 Category Flower Dataset, the CIFAR-10 color images, and the NUS-WIDE. We test our binary descriptor on various tasks, including image retrieval and image classification.

  1. 1.

    CIFAR-10 Dataset [17] contains 10 object categories and each class consists of 6,000 images, resulting in a total of 60,000 images. The dataset is split into training and test sets, with 50,000 and 10,000 images respectively.

  2. 2.

    NUS-WIDE dataset [5] has nearly 270,000 images collected from the web. It is a multi-label dataset in which each image is annotated with one or multiple class labels from 81 classes. Following [19], we only use the images associated with the 21 most frequent classes. For these classes, the number of images of each class is at least 5000. We use 4,000 for training and 1,000 for testing.

  3. 3.

    The Oxford 17 Category Flower Dataset [27] contains 17 categories and each class consists of 80 images, resulting in a total of 1,360 images. The dataset is split into the training (40 images per class), validation (20 images per class), and test (20 images per class) sets.

3.2 Results on Image Retrieval

To evaluate the performance of the proposed DDH, we test our method on the task of image retrieval. We compare DDH with other hashing methods, such as LSH [1], ITQ [9], HS [31], Spectral hashing (SpeH) [45], Spherical hashing (SphH) [14], KMH [12], Deep Hashing (DH) [24] and DeepBit [23], Semi-supervised PCAH [44] on the CIFAR-10 dataset and NUS-WIDE. We set the \(K_1\) = 15 and \(K_2\) = 6 to construct labels, and the learning rate as 0.001, \(\lambda _{1}\) = 15, \(\lambda _{2}\) = 0.00001 and batch-size = 128. Table 2 shows the CIFAR-10 retrieval results based on the mean Average Precision (mAP) of the top 1,000 returned images with respect to different bit lengths, while Table 3 shows the mAP value of NUS-WIDE dataset calculated based on the top 5,000 returned neighbors. The precision/recall in CIFAR-10 dataset is shown in Fig. 3.

Table 2. Performance comparison (mAP) of different unsupervised hashing algorithms on the CIFAR-10 dataset. The mean Average Precision (mAP) are calculated based on the top 1,000 returned images with respect to different number of hash bits.
Table 3. Performance comparison (mAP) of different unsupervised hashing algorithms on the NUS-WIDE dataset. The mAP is calculated based on the top 5,000 returned neighbors for NUS-WIDE dataset.

From these results, we have the following observations:

  1. (1)

    Our method significantly outperforms the other deep or non-deep hashing methods in all datasets. In CIFAR-10, the improvement of DDH over the other methods is more significant, compared with that in NUS-WIDE dataset. Specifically, it outperforms the best counterpart (DeepBit) by 25.3%, 23.7% and 25.8% for 16, 32 and 64-bit hash codes. One possible reason is that CIFAR-10 contains simple images, and the constructed neighborhood structure is more accurate than the other two datasets. DDH improves the state-of-the-arts by 5.4%, 3.0%, 3.6% and 2.7% in NUS-WIDE dataset.

  2. (2)

    Table 2 shows that DeepBit and FashH are strong competitors in terms of mAP in CIFAR-10 and NUS-WIDE dataset. But the performance gap of DeepBit and our DDH is still very large, which is probably due to that DeepBit uses only 3 fully connected layers to extract the features. Figure 3 shows that most of the hashing methods can achieve a high recall for small number of retrieved samples (or recall). But obviously, our DDH significantly outperforms the others.

  3. (3)

    With the increase of code length, the performance of most hashing methods is improved accordingly. An exception is PCAH, which has no improvement with the increase of code length.

Fig. 3.
figure 3

Precision/recall curves of different unsupervised hashing methods on the CIFAR-10 dataset with respect to 16, 32 and 64 bits, respectively

To make fair comparison with the non-deep hashing methods, and validate that our improvement is not only caused by the deep features, we conduct non-deep hashing methods with deep features extracted by the CNN-F pre-trained on ImageNet. The results are reported in Table 4, where “ITQ+CNN” denotes the ITQ method with deep features and other methods have similar notations. When we run the non-deep hashing methods on deep features, the performance is usually improved compared with the hand-crafted features.

Table 4. Performance comparison (mAP) of different hashing algorithms with deep features on the CIFAR-10 dataset.

By constructing the neighborhood structure using the labels, our method can be easily modified as a supervised hashing method. Therefore, we also compared with supervised hashing methods, and show the mAP results on NUS-WIDE dataset in Table 5. It is obvious that our DDH outperforms the state-of-the-art deep and non-deep supervised hashing algorithms by a large margin, which are 5.7%, 5.8%, 7.8% and 8.1% for 12, 24, 32, and 48-bits hash codes. This indicates that the performance improvement of DDH is not only due to the constructed neighborhood structure, but also the other components.

Table 5. Results on NUS-WIDE. The table shows other deep network with supervised pair-wise labels. The mAP value is calculated based on the top 5000 returned neighbors for NUS-WIDE dataset.

3.3 Results on Object Recognition

In the task of object recognition, the algorithm needs to recognize very similar object (daisy, iris and pansy etc.). So it requires more discriminative binary codes to represent images that look very similar. In this paper, we use the Oxford 17 Category Flower Dataset to evaluate our method on object recognition and we compared with several real-valued descriptors such as HOG [6] and SIFT.

Due to the variation of color distributions, pose deformations and shapes, “Flower” recognition becomes more challenging. Besides, we need to consider the computation cost while one wants to recognize the flowers in the wild using mobile devices, which makes us generate very short and efficient binary codes to discriminate flowers. Following the setting in [27], we train a multi-class SVM classifier with our proposed binary descriptor. Table 6 compares the classification accuracy of the 17 categories flowers using different descriptors proposed in [27, 28], including low-level (Color, Shape, Texture), and high-level (SIFT and HOG) features. Our proposed binary descriptor with 256 dimensionality improves previous best recognition accuracy by around 5.01% (80.11% vs. 75.1%). We also test our proposed method with 64 bits, which still outperforms the state-of-art result (76.35% vs. 75.1%). We also test the computational complexity during SVM classifier training with only costing 0.3 s training on 256 bits and 0.17s on 64 bits. Compared with other descriptors, such as Color, Shape, Texture, HOG, HSV and SIFT, DDH demonstrates its efficiency and effectiveness.

Table 6. The categorization accuracy (mean%) and training time for different features on the Oxford 17 Category Flower Dataset

4 Conclusion and Future Work

In this work, we address two central problems remaining largely unsolved for image hashing: (1) how to directly generate binary codes without relaxation? (2) how to equip the binary representation with the ability of accurate image retrieval? We resolve these problems by introducing an intermediate variable and a loss function steering the learning process, which is based on the neighborhood structure in the original space. Experiments on real datasets show that our method can outperform other unsupervised and supervised methods to achieve the state-of-the-art performance in image retrieval and object recognition. In the future, it is necessary to improve the classification accuracy by incorporating a classification layer at the end of this architecture.