Keywords

1 Introduction

With the huge accumulation of digital images and videos in the society, the demand of searching such kind of data is also increasing [8]. Traditional multimedia search engines usually use the surrounding meta data, such as titles and tags or manually annotated keywords as the index to retrieval the multimedia data, named keyword-based retrieval [18]. The main drawback of such kind of retrieval technology is the inconsistence between the textural information and visual content of multimedia data. So the content-based multimedia retrieval is proposed and makes great progress in the past decades [13]. In content-based multimedia retrieval, semantic gap is a challenging problem, which refers to the difference between the low level representations of images and the higher level concepts used by human beings to describe the images. To narrow this gap, extensive efforts have been made both from the academic and industry communities [9, 10, 17].

Over the past few years, deep learning has been witnessed as one of the most promising technology in computer vision as their outstanding performance in a series of vision related tasks, such as image classification [19], face recognition [29], image segmentation [27] and so on. Since the successful application of AlexNet [12] in computer vision, the instinct idea to get better feature representation is to use deep convolution neural networks (DCNN). For example, AlexNet contains 8 learned layers, VGGNet [21] has 19 learned layers and GoogLeNet [22] consists of 22 learned layers. However, going deeper means that training such network will become more difficult. Thus deeper but easy to be trained DCNN is proposed, namely, ResNet [6]. Inspired by these successful DCNN, deep learning has already been introduced into content-based image retrieval (CBIR) [2]. However, it is instinct to ask: whether using very deep DCNN will improve the performance of image retrieval, especially for large scale image database?

In order to answer the above question, we use a very deep DCNN, ResNet for image retrieval. However, features extracted by ResNet are high-dimensional, thus they are not compatible with large scale image database if they are used directly for retrieval. So the proposed method constitutes a framework for converting the raw images into binary codes for effectively large scale image retrieval. To do so, the raw images are firstly input into the ResNet to get the deep features. And then the deep features are converted into binary codes by a DCNN based hashing network. In summary, the contributions of this work are twofold:

(1) We investigate a new framework that could convert the raw images into binary codes for large scale image retrieval. This framework integrates ResNet and DCNN based hashing network.

(2) Extensive experiments are conducted for comprehensive evaluations of the proposed framework, especially with different depth of the DCNN.

The reminder of the manuscript is structured as follows: Sect. 2 describes the proposed framework, followed by the experiment results in Sect. 3. Finally, conclusion and perspectives are given in Sect. 4.

2 Proposal

The framework of our method is shown in Fig. 1. The inputs are the pixels of the raw image and the corresponding label information of the image (for training only) and the output are the binary codes of the images. Such codes could be used for image retrieval by comparing the Hamming distances between the query’s codes and the codes of the gallery images. The proposed framework includes two kinds of DCNN. Deep Residual Network (ResNet) [6] is used to convert the raw images into deep features and hashing neural network (HNN) is used to convert the deep feature into binary codes. We name our proposal as ResHNN. Details will be explained in this section.

2.1 ResNet

ResNet won the 1st place on the ILSVRC 2015 classification task and is proved to be easily optimized. And it gains improvement from increased depth [6]. ResNet is composed by many stacked “Residual Units” and each unit could be expressed in the following form:

$$\begin{aligned} y_i=h(x_i)+F(x_i,W_l) \end{aligned}$$
$$\begin{aligned} x_{i+1}=f(y_i) \end{aligned}$$
(1)

where \(x_i\) and \(x_i+1\) are input and output of the i-th unit, and F is residual function. \(h(x_i) \) is an identity mapping and f is a ReLU function [7]. The core idea of ResNet is to learn the additive residual function F with respect to \(h(x_i)\), with a key choice of using an identity mapping \(h(x_i)=x_i\). This is realized by attaching an shortcut connection that performs identity mapping and their outputs are added to the outputs of the stacked layers. The residual unit we used is as shown in Fig. 1. More details could be found in [6, 7].

Fig. 1.
figure 1

The framework of transferring images into compact feature representation

2.2 HNN

The hashing layer is based on Nonlinear Discrete Hashing [3], which uses a multi-layer neural network to obtain the compact binary codes through nonlinear transformations. Let \(X=[x_1,x_2,\dots ,x_n] \in R^{n \times d}\) denote the training set with n samples, where each sample \(x_i \in R^d (1 \le i \le n)\) is a data point of d dimension. Assuming the m-th layer consists of \(u^{(m)}\) units, the output of each layer is computed as:

$$\begin{aligned} h^{(1)}(x_i) = s(x_iW^{(1)}+c^{(1)}), i=1,\cdots ,n \end{aligned}$$
(2)
$$\begin{aligned} h^{(m)}(x_i) = s(h^{(m-1)}(x_i)W^{(m)}+c^{(m)}), i=1,\cdots ,n \end{aligned}$$
(3)

where \(s(\cdot )\) is a nonlinear activation function such as the tanh function, and the projection matrix \(W^{(m)}\) and the bias vector \(c^{(m)}\) are the parameters to be learned for the m-th layer of the network.

And for a I-layer network, we could have the output in the form of:

$$\begin{aligned} F(x)=h^{(I)}(x) \in R^{u^{(I)}} \end{aligned}$$
(4)

where the mapping \(F:R^d \rightarrow R^{u(I)}\) is a parametric nonlinear function determined by \(\{W^{(m)},c^{(m)}\}^I_{m=1}\). We treat the sign of the output of the network as the binary code of these n samples and put the binary code of all the samples together as:

$$\begin{aligned} B=sgn(F(X)) \in \{-1,+1\}^{(n \times r)} \end{aligned}$$
(5)

Specifically, we treat the 0 as \(+1\). The formula is as follows:

$$\begin{aligned} sgn(b)=\left\{ \begin{array}{rcl} 1, &{}\quad {b \ge 0}\\ -1, &{}\quad {b < 0} \end{array} \right. \end{aligned}$$
(6)

The goal is to find a binary matrix that minimizes the value of loss function. The formula is as follows:

$$\begin{aligned} arg\min \limits _{B} Q=Q(L,B)+Q(B,X) \end{aligned}$$
(7)

where Q(LB) means the difference between the predicted labels through the hash code matrix B and the ground truth labels of all samples, and the Q(BX) means the information loss caused by transforming to binary code. Denoting the classifier weight matrix as C, the first term Q(LB) can be considered as:

$$\begin{aligned} Q_C(L,B)=||L-CB^T||_F^2 \end{aligned}$$
(8)

where the L is the ground truth label of the samples and \(||\cdot ||_F\) means the Frobenius norm.

Q(BX) measures the discrepancy between the binary codes and the data samples including the quantization loss term and the similarity preserving term

$$\begin{aligned} Q^{(I)}_{F}(B,X)= & {} \Vert B-F^{(I)}\Vert ^2_F \nonumber \\ {}+ & {} \alpha \sum ^n_{i=1}\sum ^n_{j=1}S_{ij}\Vert F^{(I)}(i,:)-F^{(I)}(j,:)\Vert ^2_F \nonumber \\&s{.}t{.} B\in \{-1,1\}^{n\times r}, B^{T}B=nI_{r} \end{aligned}$$
(9)

where S is the similarity matrix. To reduce the redundancy of information, \(B^{T}B=nI_{r}\) is added. But the problem of constraint makes optimization difficult, so a real-valued matrix Y is introduced in \(\varOmega =\{Y\in \mathbb {R}^{n\times r}\Vert Y^TY=nI_r\}\) approaching to B. So the Eq. 10 is introduced to substitute the independent constraint.

$$\begin{aligned} Q_I(B)=\Vert B-Y\Vert ^2_F \end{aligned}$$
(10)

We notice that the loss function Eq. 7 only considers the outputs of the top layer of the network, but the hidden layers are not included. So the companion loss function is introduced as follows:

$$\begin{aligned} Q_F(B,X)=Q_F^{(I)}(B,X)+\sum ^{I-1}_{m=1}\alpha ^{(n)}h(Q_F^{(m)}-\tau ^{m}) \end{aligned}$$
(11)

where \(h(x)=\max (x,0)\) and \(Q_F^{(m)}=\sum _{i=1}^n\sum _{j=1}^{n}S_{ij}\Vert F^{(m)}(i,:)-F^{(m)}(j,:)\Vert ^2_F\), \(m=1,2,\ldots , I-1\).

In consideration of all the mentioned above, the overall function is defined as follow:

$$\begin{aligned} \arg \min _{\mathbf{B },\mathbf{P },\{\mathbf{F }^{(m)}\}^\mathbf{I }_m=1,\mathbf{Y }} =Q_C+\lambda _1Q_I+\lambda _2Q_F+\lambda _3Q_R \nonumber \\ s{.}t{.} B\in \{-1,1\}^{n\times r} \end{aligned}$$
(12)

where \(Q_R=\Vert C\Vert ^2_F+\sum \limits _{m=1}^I\Vert W^{(m)}\Vert ^2_F+\sum \limits ^I_{m=1}\Vert c^{m}\Vert ^2_F\) contains the regularizer to control the scales of the parameters.

Since the above joint optimization problem is non-convex and difficult to solve. Sub-optimal problems with respect to one variable while keeping other variable fixed is used. So we could iterate each variable of optimal solution in sub-optimal problem one by one. And this problem could be solved by Singular Value decomposition (SVD) and Gram-Schmidt process. More details could be found in [3].

3 Experiments

In this section, we conduct experiments on three datasets: MNIST [4], CIFAR10 [11], and SUN379 [26], to evaluate the performance of the proposal and try to answer the question we posed in the introduction.

Table 1. Retrieval performance on MNIST with 16, 32 and 64 bit length of binary codes

3.1 Databases

MNIST: It is a handwritten digit dataset consisting of 70000 images with the size of \(28\times 28\). Each image is associated with a digit from 0 to 9 and represented as a 784-dimensional gray-scale feature vector by concatenating all pixels [3]. It’s a simple dataset, so we extract a 256 dimensional feature vector by ResNet for each image. Following the same setting in [24], 1000 images with 100 images per class are randomly selected from original test set to form the query set, and use the remaining 69000 images as gallery database.

CIFAR10: It is a set of 60000 manually labeled color images. They are from 10 classes, and each class has 6000 images. Each image is with the size of \(32\times 32\). ResNet is used to extract a 1024 dimensional feature vector for each image. Similar to the MNIST, we use 1000 images consist of 100 images per class from original test set as query set and construct the gallery database with the remaining images.

SUN397: This dataset contains 108754 images which are classified into 397 categories. It is bigger and more complex than the two mentioned above databases, it could be a challenge to retrieve semantic neighbors. Each image is represented by a 2048 dimensional feature vector extracted by ResNet. Following the same protocol of the referred methods, 8000 images are randomly sampled as query images and the remaining images are left to form the gallery database.

3.2 Evaluation Metric

All experiments are repeated 10 times and the averaged values are took as the final result. Two metrics are used to measure the performance of different methods: precision at N samples and mean Average Precision (mAP). Given top N returned samples, precision at N samples is calculated as the percentage of relevant retrieved images:

$$\begin{aligned} Precision@N=\frac{\sum _{k=1}^{N}rel(k)}{N} \end{aligned}$$
(13)

where \(rel(k)=1\) if k-th image is a relevant retrieved image, otherwise, \(rel(k)=0\). The mean Average Precision (mAP) presents an overall measurement of the retrieval performance by computing the area under the precision-recall curve, which delivers good discrimination and stability. It is calculated as follows:

$$\begin{aligned} AveP = \frac{\sum _{k=1}^N (P(k) \times rel(k))}{number\ of\ relevant\ images},\nonumber \\ MAP=\frac{\sum _{q=1}^Q AveP(q)}{Q} \end{aligned}$$
(14)

where k is the rank in the sequence of retrieved documents, N is the number of retrieved images, P(k) is the precision at cut-off k in the list, and rel(k) is equal to 1 if the item at rank k is a relevant image, otherwise, it is equal to 0 [23]. Q is the number of the queries.

Fig. 2.
figure 2

Top 14 retrieved images from SUN397 dataset by different number of layers of ResNet with 128 bits binary codes. The results of ResHNN-50 are shown in the first three rows, the results of ResHNN-101 are shown in the middle three rows, and the results of ResHNN-152 are shown in the last three rows. The irrelevant images are marked by red circle. (Color figure online)

3.3 Results and Analysis

Result on MNIST: The training set used for hashing net is with the size of 5000 images by selecting 500 images from each class. The ResNet and HNN are trained separately and the depth of the ResNet we used in this database is 50. For the HNN, we take the tanh function as the nonlinear activation function and initialize the biases \(c^{(m)}\) to be 0. Each element of \(W^{(m)}\) is uniformly sampled from the range \(\left[ -\sqrt{\frac{6}{row(m)+col(m)}}, \sqrt{\frac{6}{row(m)+col(m)}}\ \right] \), where row(m) is the number of rows of \(W^{(m)}\) and col(m) is the number of columns of \(W^{(m)}\). The numbers R and L are set as 5 and 3. And we set \(\alpha ^{(1)}\) and \(\alpha ^{(2)}\) as 20, \(\alpha ^{(3)}\) as 100, \(\tau ^{(1)}\) and \(\tau ^{(2)}\) as 1000, \(\lambda _1\) as 1e−3, \(\lambda _2\) and \(\lambda _3\) as 1e−5, learning rate \(\eta \) as 1e−3. And the same setting is adopted for all the other datasets. The experimental results are shown in Table 1. The results are compared on hash code with lengths of 16, 32, and 64 bits. As MNIST is a simple handwrite characters dataset, a lot of methods could achieve good performance, so does ResHNN. As the performance achieved in this database is quiet high, the impact of depth of the DCNN is difficult to estimate, so we did not conduct further experiments on it.

Result on CIFAR10: Similar as the setting of experiments on the MNIST dataset, the training set is constructed with 5000 images with 500 images per category. We compare the results in different depth of the ResNet with 50, 101 and 152 to evaluated their impacts on feature extracting and hashing. Results are shown in Table 2. It is obviously that our method ResHNN outperforms referred methods obviously in different kinds of bit length, both in the aspect of mAP and Precision@500 and ResHNN-152 is the best. And with the ResNet going deeper, the retrieval performance improves slightly. We believe that the reason is that deeper networks could extract features from images more efficiently, which is preserved in our hash layers.

Table 2. Retrieval performance on CIFAR-10 with 16, 32 and 64 bit length of binary codes
Table 3. Retrieval performance on SUN397 with 48, 64 and 128 bit length of binary codes respectively

Result on SUN397: In order to verify whether the proposed ResHNN works well under large and complex conditions, more experiments were conducted in SUN397 database. As this database is a larger collection, we evaluate the impacts of the different number of the layers of ResNet also, with respects of 50, 101 and 152. Results are shown in Table 3. We notice that the proposed ResHNN could achieve the comparable performance with referred methods, and outperforms in long length of bits. Furthermore, the increase of the depth of the ResNet could trigger the obvious improvements on the retrieval with the longer length of binary codes. This could be explained by the fact that the deeper of the DCNN layers, the more information of the visual content of the image could be extracted, and with the longer of the length of the binary codes, such information could be preserved better. The examples of retrieval results with different layers of ResNet are shown in Fig. 2. The wrong returned images are marked by red circles.

4 Conclusion and Perspective

In this paper, we propose a ResNet based compact feature representation for image retrieval, namely, ResHNN, which integrates Residual net and hashing neural networks to generate the binary code for CBIR. Extensive experimental results on three widely used public databases demonstrate the superiority of the proposed ResHNN. Furthermore, we explore the impact of ResNet’s depth on the performance. The impacts of the different deep features and different hashing method will be discovered further.