Deep Multi-level Hashing Codes for Image Retrieval

Dong, Zhenjiang; Song, Ge; Jia, Xia; Tan, Xiaoyang

doi:10.1007/978-981-10-3476-3_11

Zhenjiang Dong^12,13,
Ge Song^14,15,
Xia Jia¹² &
…
Xiaoyang Tan^14,15

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 664))

Included in the following conference series:

Chinese Conference on Intelligent Visual Surveillance

874 Accesses
1 Citations

Abstract

In this paper, we propose a deep siamese convolutional neutral network (DSCNN) to learn semantic-preserved global-level and local-level hashing codes simultaneously for effective image retrieval. Particularly, we analyze the visual attention characteristic inside hash bits by activation map of deep convolutional feature and propose a novel approach of bit selecting to reinforce the pertinence of local-level code. Finally, unlike most existing retrieval methods which use global or unsupervised local descriptors separately, leading to unexpected precision, we present a multi-level hash search method, taking advantage of both local and global properties of deep features. The experimental results show that our method outperforms several state-of-the-art on the Oxford 5k/105k and Paris 6k datasets.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Hierarchical deep hashing for image retrieval

Article 09 March 2017

Attention-Aware Invertible Hashing Network

Cascaded Deep Hashing for Large-Scale Image Retrieval

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Due to the explosive growth of the Internet, massive images have flooded our daily lives. Image retrieval, i.e. finding images containing the same object or scene as in a query image, has attracted more attention from researchers.

Recently, most studies have reported that deep Convolutional Neural Networks (CNNs) achieved the state of the art performance in many computer vision tasks [1,2,3]. Notably, many works [4, 5] have demonstrated the suitability of features from fully-connected layers for image retrieval. While several works [6,7,8] focused on features from deep convolutional layers and showed that these features have the natural interpretation as descriptors of local image regions. However, most CNN features for image retrieval are directly extracted from classification model, and subjected to low precision. Furthermore, the features with rich semantic information distract the target sense of query. Early work by Zhou et al. [9] revealed that the convolutional units of CNNs actually behave as object detectors, and proposed a method to generate Class Activation Map (CAM) [10] for localizing the discriminative image regions, which make it available to use deep localizable representations for visual tasks.

Besides, traditional nearest neighbor search methods are faced with the computational cost of similarity calculation of high-dimension features, are not appropriate for rapid retrieval, especially under the circumstances of big data age. A practical alternative is to use the hashing based methods [11,12,13]. Hash method designs a group function which project images into binary codes so that similar images are mapped into similar code. Therefore, the retrieval problem can be done efficiently by computing Hamming distance. Benefiting from deep learning, several researchers [13,14,15,16,17] combined image representations learning with hash learning into one CNN architecture to learn semantic-preserved binary code. Although these methods achieved outstanding performance, have not shed light on the relation between each bit and semantic concept.

In this paper, we propose a deep siamese CNN (DSCNN) framework to learn semantic-preserved hash code, and design the last convolutional layer of DSCNN to obtain local-level hash codes, which is essentially different from other methods [13,14,15]. Above all, we propose a novel method to obtain compact bits with salient local-semantic. Finally, we present a multi-level hash search method for retrieval.

2 Our Method

Learning Semantic-Preserved Hash Code. It is feasible to embed a latent layer in high-level of a network to output global binary code [13,14,15]. We follow it and use both label and pair information to guide hash leaning. Otherwise, inspired by discovery [4], we propose to hashing convolutional activations. As Fig. 1 shows, the activation of hash layer and conv7 are both tanh function. And we impose constraints on these layers to embed semantic information. Assuming that the feature maps of conv7 are $\left\{ I_i\right\} _{i=1}^C\in (-1,1)^{W\times H}$, W, H is weight and height, C is the number of filters, the output of Hash Layer are $a\in (-1,1)^H$, H is the length of hash code. $\hat{y}$ is output of softmax layer, y is expected output. And we minimize the loss function defined following to learn parameters of our network. For local-level hash:

$$\begin{aligned} L_1=-\sum _{j=1}^{N} y_i\log (\hat{y_j}) \end{aligned}$$

(1)

For global-level hash:

$$\begin{aligned} \begin{aligned} L_2&=-L_1+\alpha J_{11}+\alpha J_{12}+\beta J_2+\gamma J_3 \\&=-\sum _{j=1}^{N} y_i\log (\hat{y_j})+\alpha \sum _{j=1}^{N} \sum _{i=1}^{N}\delta (y_j=y_i)\Vert a_j-a_i\Vert _2^{2} \\&+\alpha \sum _{j=1}^{N} \sum _{i=1}^{N}\delta (y_j\ne y_i)\max (0,c-\Vert a_j-a_i\Vert _2^{2}) \\&+\beta \sum _{j=1}^{N} (\Vert |a_j|-1\Vert ^2)+\gamma \sum _{j=1}^{N} (\Vert avg (a_j)-0\Vert ^2) \\ \end{aligned} \end{aligned}$$

(2)

where $\delta $ is indicator function, $ avg $ is the mean function, c is a constant, N is the number of images. The terms $L_1$ and $J_{1*}$ aim to embed semantic consistency and similarity to hash code respectively. The term $J_2$ aims to minimize the quantization loss between the learned binary code and the original feature. The last term $J_3$ enforces evenly distribution of $-1$ and 1 in hash code. $\alpha ,\beta ,\gamma $ are parameters to balance the effect of different terms.

Finally, the global-level hash code $H_g$ and local-level hash code$H_l$ are defined:

$$\begin{aligned} \begin{aligned}&H_g=\delta (a>0),H_l=\delta (f>0) \\&where f\in (-1,1)^{C},f_k=\frac{1}{ W\times H }\sum _{i=1}^W \sum _{j=1}^H I_k(i,j)\\ \end{aligned} \end{aligned}$$

(3)

Selecting Compact Bits. The deep convolutional feature maps are activated with different regions [18, 19]. And through careful observation we found that some feature maps are not related to the salient area, it may be possible to boost feature discrimination by discarding unrelated feature maps. Therefore we propose to select compact bit to enforce retrieval performance.

The first stage is to catch the attention region of $H_g$. We compute CAMs of $H_g$. Then we average these maps to $M_{avg}$ and binarize by $B_{avg}=\delta (M_{avg}>\theta )$, where $\theta $ is a threshold. And we get attention region by finding the largest connected subgraph of $B_{avg}$. As Fig. 1 shows.

The second stage is selecting local feature maps. We convert all feature maps of Conv7 into activation maps $\left\{ AM_i\right\} _{i=1}^C$ by up-sampling, and obtain corresponding binary maps $\left\{ B_i\right\} _{i=1}^C$ as the first stage done. We definite the score of relevant to salient area of feature maps as follows:

$$\begin{aligned} S(B_i,B_{avg})=sum(B_i\wedge B_{avg}) \end{aligned}$$

(4)

where $\wedge $ is AND operation bit-by-bit, sum represents sum all elements of matrix.

In the last stage, Ranking $I_1,I_2,\ldots ,I_C$ by their scores S and selecting top L filters as informative local features. Then we choose associated L bits of $H_l$ as $H_l^{'}$ for efficient retrieval. In our experiment, we only compared the local-level hash code of query’s L positions with corresponding position bits of others.

$$\begin{aligned} H_q^{'}=\varPsi _q(H_q), d_H(H_q^{'},H_i)=d_H(H_q^{'},\varPsi _q(H_i)) \end{aligned}$$

(5)

where $\varPsi _q(*)$ indicates obtain the bits of $*$ as the same positions as $H_q$.

Searching via Multi-level Hashing. The original data space could be mapped to Hamming space by several group hash functions with similarity structure preserved separately. We proposed a multi-level search method of hashing, using several sets of function with different properties to reinforce positive neighborhoods retrieval and develop two strategies.

Rerank-Based Strategy#1. Firstly, we use global-level hash code to retrieval and select top K as candidates. Then, we use local-level hash code to rerank these candidates.

Hamming Distance Weighted Strategy#2. Assuming that query image $x_q$ and N images $\left\{ x_i\right\} _{i=1}^N$ and corresponding global-level hash code $H_{gq},\left\{ H_{gi}\right\} _{i=1}^N$ and local-hash code $H_{lq}^{'},\left\{ H_{li}\right\} _{i=1}^N$. Fusing distance as:

$$\begin{aligned} Sim (x_q,x_i)=\lambda d_H(H_{gq},H_{gi})+(1-\lambda )d_H(H_{lq}^{'},H_{li}) \end{aligned}$$

(6)

In experiments, we firstly retrieval use the global-level code, then rerank by proposed weighted strategy.

3 Experiments

Datasets. We evaluate performance on three standard datasets with mean average precision (MAP). Oxford Buildings [20] (Oxford5k) contains 5063 images, including 55 queries corresponding 11 landmark buildings. Oxford Buildings+100K [20] (Oxford105k) includes Oxford5k and extra 100K images as distractor. Paris Buildings [21] (Paris6k) contains 6412 images, 55 queries corresponding to 11 Paris landmarks.

Experimental Details. We implement the proposed DSCNN by Caffe [22] package. We design DSCNN based on the AlexNet architecture, details as Fig. 1 shows. All images are resized to $256\times 256$ before passing through the network. For training model, we randomly select positive and negative pairs from dataset exclude queries and initial weights of Conv1-Conv4 with pre-trained AlexNet.

Table 1. mAP comparison with local descriptors. Local-level hash perform better.

Full size table

Results of Local Features. We compare local-level code from DSCNN with other state local descriptors. Firstly, we compare with the sophisticated local descriptors aggregation methods Fisher vectors [6], VLAD [23] and Tri. embedding [24]. Table 1 summaries the results. We attain the best performance on three datasets. Compared with deep feature, we can see that our average-pooling strategy (local-level hash) outperforms max-pooling [25] and SPoC [6] on Oxford dataset. Then, the result on Paris demonstrates that the local-level is superior to global-level hash code. And multi-level improve the performance of global-level code by 12 and 14 on Oxford and Paris, respectively. Some qualitative examples of retrieval using multi-level hash are shown in Fig. 2, local-level hash enhances the ranking of relevant results and decrease the irrelevant images, as expected. Finally, our method is different from PCA and performs better.

Comparison with State-of-the-Art. Approaches based on deep model in the literature. We set length of $H_l$ to 256 impartially. As Table 2 reveals that our method produced better or competitive results. For strategy #1, we use global-level hash code to retrieval 50 candidates and rerank by local-level hash code, achieving mAP 67.1% on Oxford5k and 83.7% on Paris6k. Then, we adopt strategy #2 to retrieval with setting $\lambda $ to 0.5 empirically, obtaining slightly different performance with strategy#1. We conjectured that the fusion weaken some discriminant of local-level code caused the gap in performance.

For deep convolutional features, CNN+fine-tuning [26] gains mAP 55.7% on Oxford by retraining deep models with additional landmarks dataset collected by themselves, while we obtain 67.2% only with limited training samples provide by datasets. Although we did not promote performance by spatial reranking or query expansion strategies as Tolias et al. [7] done, our method achieve competitive results. Compared with R-CNN+CS-SR+QE [26], our method is more simple and effective (83.7 vs 78.4), exploring the inside property of deep convolutional descriptor to select compact local feature for retrieval, while R-CNN+CS-SR+QE locates objects by RPN. Mention that our method can carry out fast image retrieval via Hamming distance measurement, which is obviously superior to others based on Euclidean or Cosine distance.

Table 2. mAP comparison with state-of-the-art methods CNN-based.

Full size table

4 Conclusion

This paper has presented a deep siamese CNN to produce global and local levels hash codes for image retrieval with the proposed multi-level search method. And we firstly propose to select region-related bits by activation maps. Finally, we demonstrate the efficacy and applicability of the proposed approach on retrieval benchmarks. Experimental results show that our method improves the previous performance on Oxford and Paris datasets, respectively.

References

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 2012 (2012)
Google Scholar
Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: Advances in Neural Information Processing Systems, pp. 2553–2561 (2013)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. Computer Science, pp. 1337–1342 (2015)
Google Scholar
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10590-1_38
Google Scholar
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014)
Google Scholar
Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval. Computer Science (2015)
Google Scholar
Tolias, G., Sicre, R., Jgou, H.: Particular object retrieval with integral max-pooling of CNN activations. Computer Science (2015)
Google Scholar
Ng, Y.H., Yang, F.: Davis, L.S.: Exploiting local features from deep networks for image retrieval. Computer Science, pp. 53–61 (2015)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene CNNs. Computer Science (2014)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. Computer Science (2015)
Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Annual Symposium on Foundations of Computer Science, pp. 117–122 (2006)
Google Scholar
Liong, V.E., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact binary codes learning. In: Computer Vision and Pattern Recognition (2015)
Google Scholar
Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. In: Computer Vision and Pattern Recognition (2015)
Google Scholar
Lin, K., Yang, H.F., Hsiao, J.H., Chen, C.S.: Deep learning of binary hash codes for fast image retrieval. In: Computer Vision and Pattern Recognition Workshops, pp. 27–35 (2015)
Google Scholar
Li, W.J., Wang, S., Kang, W.C.: Feature learning based deep supervised hashing with pairwise labels. Computer Science (2015)
Google Scholar
Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: Computer Vision and Pattern Recognition (2015)
Google Scholar
Liu, H., Wang, R., Shan, S., Chen, X.: Deep supervised hashing for fast image retrieval. In: Computer Vision and Pattern Recognition (2016)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10590-1_53
Google Scholar
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Computer Vision and Pattern Recognition (2015)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J.: Object retrieval with large vocabularies and fast spatial matching. In: Computer Vision and Pattern Recognition (2007)
Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Computer Vision and Pattern Recognition (2008)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. Eprint Arxiv, pp. 675–678 (2014)
Google Scholar
Arandjelovic, R., Zisserman, A.: All about VLAD. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1578–1585 (2013)
Google Scholar
Jegou, H., Zisserman, A.: Triangulation embedding and democratic aggregation for image search. In: Computer Vision and Pattern Recognition, pp. 3310–3317 (2014)
Google Scholar
Razavian, A.S., Sullivan, J., Maki, A., Carlsson, S.: A baseline for visual instance retrieval with deep convolutional networks. Computer Science (2015)
Google Scholar
Salvador, A., Giro-I-Nieto, X., Marques, F., Satoh, S.: Faster R-CNN features for instance search. Eprint Arxiv (2016)
Google Scholar
Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. Eprint Arxiv (2015)
Google Scholar

Download references

Acknowledgements

This work is supported by National Science Foundation of China (61373060,61672280), Qing Lan Project and the Research Foundation of ZTE Corporation.

Author information

Authors and Affiliations

Shanghai Jiaotong University, Shanghai, 200240, China
Zhenjiang Dong & Xia Jia
ZTE Corporation, Nanjing, 210012, China
Zhenjiang Dong
Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Ge Song & Xiaoyang Tan
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 211106, China
Ge Song & Xiaoyang Tan

Authors

Zhenjiang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Ge Song
View author publications
You can also search for this author in PubMed Google Scholar
Xia Jia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyang Tan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyang Tan .

Editor information

Editors and Affiliations

Chinese Academy of Sciences , Beijing, China
Zhang Zhang
Chinese Academy of Sciences , Beijing, China
Kaiqi Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, Z., Song, G., Jia, X., Tan, X. (2016). Deep Multi-level Hashing Codes for Image Retrieval. In: Zhang, Z., Huang, K. (eds) Intelligent Visual Surveillance. IVS 2016. Communications in Computer and Information Science, vol 664. Springer, Singapore. https://doi.org/10.1007/978-981-10-3476-3_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-3476-3_11
Published: 21 December 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3475-6
Online ISBN: 978-981-10-3476-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Multi-level Hashing Codes for Image Retrieval

Abstract

Similar content being viewed by others

Hierarchical deep hashing for image retrieval

Attention-Aware Invertible Hashing Network

Cascaded Deep Hashing for Large-Scale Image Retrieval

Keywords

1 Introduction

2 Our Method

3 Experiments

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Deep Multi-level Hashing Codes for Image Retrieval

Abstract

Similar content being viewed by others

Hierarchical deep hashing for image retrieval

Attention-Aware Invertible Hashing Network

Cascaded Deep Hashing for Large-Scale Image Retrieval

Keywords

1 Introduction

2 Our Method

3 Experiments

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation