Users Personalized Sketch-Based Image Retrieval Using Deep Transfer Learning

Huo, Qiming; Wang, Jingyu; Qi, Qi; Sun, Haifeng; Ge, Ce; Zhao, Yu

doi:10.1007/978-3-319-99365-2_14

Qiming Huo¹⁶,
Jingyu Wang¹⁶,
Qi Qi¹⁶,
Haifeng Sun¹⁶,
Ce Ge¹⁶ &
…
Yu Zhao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11061))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1712 Accesses
1 Citations

Abstract

Traditionally, sketch-based image retrieval is mostly based on human-defined features for similarity calculation and matching. The retrieval results are generally similar in contour and lack complete semantic information of the image. Simultaneously, due to the inherent ambiguity of hand-drawn images, there is “one-to-many” category mapping relationship between hand-drawn and natural images. To accurately improve the fine-grained retrieval results, we first train a SBIR general model. Based on the two-branch full-shared parameters architecture, we innovatively propose a deep full convolutional neural network structure model, which obtains mean average precision (MAP) 0.64 on the Flickr15K dataset. On the basis of the general model, we combine the user history feedback image with the input hand-drawn image as input, and use the transfer learning idea to finetune the distribution of features in vector space so that the neural network can achieve fine-grained image feature learning. This is the first time that we propose to solve the problem of personalization in the field of sketch retrieval by the idea of transfer learning. After the model migration, we can achieve fine-grained image feature learning to meet the personalized needs of the user’s sketches.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Sketch-Based Image Retrieval Using Convolutional Neural Networks Based on Feature Adaptation and Relevance Feedback

Joint Classification Loss and Histogram Loss for Sketch-Based Image Retrieval

DeepSketch 3

Article 27 May 2017

Keywords

1 Introduction

The difference in the distribution of sketch and natural image statistics results in completely different image scopes. Using the computer to process hand-drawn image feature and to retrieve related natural pictures require some transformation [8] hand-drawn and natural images to make them in the same image domain. In addition, determining user intent from visual search queries is still a public challenge, especially in sketch-based image retrieval (SBIR). The hand-drawn images are ambiguous, and the same hand-drawing can express the semantics of different things. On the other hand, hand-drawings of the same object, drawn by different users, are also different, and eventually the search results after computer operations are certainly different. Usually, there is a “one-to-many” relationship between the hand-drawing and categories of natural images. If it is desired that the computer can accurately recognize objects represented by the user-drawn image as humans do, it is necessary to add user relevance feedback information. The feedback allows the user to indicate to the system which of these instances are desired or related, and which are not. Based on feedback, the system modifies its search mechanism and tries to return a more optimal picture set to the user [9]. The feedback here serves as an effective tool for extracting image depth semantic information and doing fine grain image analysis.

The main contributions of this paper are as follows. (1) Based on the natural image cross-image scoping method, extracting the bottom pixel-level edge line information of the natural image, which is input to the improved deep full-convolution neural network simultaneously with the hand-drawn image information. After the training, the mean average precision (MAP) of the model evaluation is greatly improved compared with the traditional image algorithm [5,6,7, 14] and the deep learning algorithm [1, 2, 12, 13] in recent years; (2) As for the problem of “one-to-many” relationships between hand-drawing and the categories of natural images, we propose a data modeling method based on user feedback using the idea of transfer learning [10]. The way is using the user history feedback to adjust the spatial distribution of the subclass images and input hand-drawn image feature vectors on the basis of a general model. Determine the subspace [17] where the fine-grained natural image and the input hand-drawn image are located in the overall feature space of each category. The migrated model completes the fine-grained image retrieval task and satisfies the user’s personalized requirements to the maximum extent possible.

2 Sketch-Based Image Retrieval General Model

As for the general model of sketch retrieval, our goal is to extract the complete image feature information as much as possible. The more complete the sketch feature is, the more the real content of the sketch can be expressed, and the more accurate the sketch matching is. At the same time, this step is also the basis for the training of personalization model training data. The quality of the common model directly affects the training of the feedback process.

2.1 Image Pre-processing and Feature Extraction

In the natural image contour extraction process, we use the global probability of boundary (gPb) edge detection algorithm and the dual threshold processing method to obtain the binary edge map, retaining the strongest edge information of 25% and removing 25% of the weakest edge information. Then, the canny edge extraction algorithm is used to perform the lag threshold processing. And the pixels connected to the strong edges are left and the isolated edge pixels are removed [2]. The resulting image after filling the remaining blank image to a size of $256\times 256$ is then binary processed. The nature images are filled the blank and binary processed by the same way.

2.2 Establish a Joint Loss Function

The construction of label information is based on the relationship between the hand-drawn image $X^S$ and the contour image $X^C$ provided in the data set. First define the input tag Y, which value is 0 or 1. When the i-th hand-drawn image $X^S_i$ and the contour image $X^C_i$ are in the same category, it is a positive sample, and the triplet ${<}X^S_i,X^C_i,Y_i=0{>}$ is constructed. Conversely, it is a negative sample, construct a triplet ${<}X^S_i,X^C_i,Y_i=1{>}$. When the triplet is input into the neural network, $X^S_i$ and $X^C_i$ are used to calculate $V^S_i$ and $V^C_i$, where $V^S_i=f_N(X^S_i)$ and $V^C_i=f_N(X^C_i)$, $f_N$ is the neural network forward propagation calculation function. Here is the loss function [4]:

$$\begin{aligned} \sum _{i=0}^{batch\_size} (1-Y_i)\frac{2}{Q}{Ew}_i^2+Y_i\times 2Q\times e^{-\frac{2.77}{Q}{Ew}_i} \; . \end{aligned}$$

(1)

Q is a constant, which is the maximum value of $Ew=||V^S-V^C||_2$ when the final category is discriminated (Fig. 1).

2.3 Image Similarity Matching and Retrieval

In image retrieval, we calculate the Euclidean distance of the feature vectors of all the pictures in the hand-drawn image and the image library is calculated by traversing the entire list.

$$\begin{aligned} Sim_{common}=[euc_1,euc_2,euc_3......euc_n]. \end{aligned}$$

(2)

Take the index of the top K values as the most similar K pictures as candidate results to return to the user interface. For the R is the set of nature images.

$$\begin{aligned} index=\mathop {\arg \min }^{K}_{i\in R} Euc(i). \end{aligned}$$

(3)

3 User Feedback Based on Transfer Learning

Through the discussion in the previous section, we can obtain a general model after the training process has converged. When the general model gives candidate results based on the low-level similarity matching, the user is provided with the choice of which pictures are of interest to the user. The system selects the records according to the user history, and judges the positive correlation sample, negative correlation sample, and general correlation sample [16]. Use this data as a training sample for a personalized model. By learning user feedback information, the model will change the distribution of input hand-drawn images and variously related samples in the feature space. Then, based on the rearranged distribution, the system determines the similarity measurement method and retrieves the image with the closest similarity to the user (Fig. 2).

3.1 Data Construction

As for the construction of the training sample data set, here we define R is a correlation set of user set U, input hand-drawing set S, user feedback natural image outline image set FI, and construction training data set sampling natural image outline SI set, expressed as $R_{u,s,Fi,Si}$. The set FC represents the parent category of the Fi in the feedback data pair ${<}s,Fi{>}$, and the set SC represents the sub-category where Fi is located. Defining the input tag $Y\in \{0,0.5,1\}$ is used to train the input data of the neural network model. The actual meaning represented is the quantification value of the correlation degree. The rules defining the relationship between samples are defined as follows, in which samples $Si\in SI$. The tag Y rules are defined as follows:

$$\begin{aligned} Y= {\left\{ \begin{array}{ll} 0, &{} \text {if} \quad Si \in FC \quad and \quad Si \in SC \\ 0.5, &{} \text {if} \quad Si \in FC \quad and \quad Si \notin SC \\ 1, &{} \text {if} \quad Si \notin FC \quad and \quad Si \notin SC \end{array}\right. } \; . \end{aligned}$$

(4)

According to the user feedback result data, the training data is constructed with the positive correlation, negative correlation and general correlation sample 1:1:1 ratio when training samples are selected, and the input data format is a quadruple $<u,s,Si,Y>$.

3.2 Network Architecture

In this task, since a general correlation is added in this section, the two-branch independent joint loss function (1) based on strong and weak relations in the previous section needs to be rewritten as a three-branch independent function.

$$\begin{aligned} L_p(Ew,Y)=\delta _1 F_S(Ew)+\delta _2 F_M(Ew)+\delta _3 F_W(Ew). \end{aligned}$$

(5)

where $Ew=||V^S-V^C||_2$ is still the Euclidean Distance between the two output vectors, $F_S(Ew)$, $F_M(Ew)$, $F_W(Ew)$ are the loss functions selected for the positive correlation sample, the general correlation sample, and the negative correlation sample relationship respectively. The prefix term $\delta $ is defined here as an independent factor, which is determined according to the value of Y. In order to ensure the independence of branches, set: $\delta _1=2\times |Y-1|\times (0.5-Y)$, $\delta _2=4\times |Y-1|\times Y$, $\delta _3=2\times |Y-0.5|\times Y$.

As for the loss function of each independent branch, to ensure that the overall feature space does not change, continue to use $F_W(Ew)=2Q\times e^{-\frac{2.77}{Q}Ew}$ as the branch loss function. For positively correlated samples and generally related samples, since the two parts of loss function are to reduce the distance of output vectors, but magnitude of reduction should different from each other, we introduce a double-threshold method here to control the amplitude. Define: $F_S(Ew)=\frac{2}{Q}(Ew-Th1)^2$ and $F_M(Ew)=\frac{2}{Q}(Ew-Th2)^2$. Th1 and Th2 are set as thresholds, where $Th1<Th2<Q$.

3.3 Similarity Measure and Image Retrieval

The similarity metrics selected during the final test still use the Euclidean distance of the output eigenvector after using the personalized model, so that get a list of similarity results for personalized models:

$$\begin{aligned} Sim_{personal}=[euc_1,euc_2,euc_3......euc_n] . \end{aligned}$$

(6)

Let $w\in [0,1]$ be the weighting factor, the final similarity list is:

$$\begin{aligned} Similarity=w\times Sim_{personal}+(1-w)\times Sim_{common} . \end{aligned}$$

(7)

Arranged from the smallest to the largest, the indexes of the top K values are the suitable pictures for the user’s preference as the final fine-grained search results and returned to the user interface.

4 Experiments and Results

4.1 Dataset and Experiment Settings

The sketch-based image retrieval general model training experiment is based on the public data set Flickr15K. In order to effectively extend the training data and solve the model over-fitting problem during each batch of input data, we also set up a random hand-drawn image/contour image cropping and flipping to perform data enhancement operations. We use the RMSProp algorithm to train the network for a total of 20 epochs on the Tensorflow platform. For the model to converge to the optimal result, we set the learning rate decay operation, the initial learning rate is set to 0.0001, and the learning rate decay is performed every 5 epochs, and the degree of decay is 0.5. Select 100 between the boundary Euclidean distance Q for the positive and negative sample pairs in the general model (Tables 1 and 2).

Table 1. Neural network structure

Full size table

Table 2. MAP in general model

Full size table

As for the personalized model, the feedback images randomly select the natural images in TOP-20 in the general model to simulate the single user selection operation. The sub-category according to the folder name can be determined to have a total of 60. Each experimental set of hand-drawn sketches randomly corresponds to 100 positive correlation samples (positive correlation samples can be repeated). The experiment has also the learning rate decay operation. Using the Adam optimization algorithm to train the network for a total of 20 epochs. Threshold value $Th1=\frac{Q}{10}, Th2=\frac{Q}{2}$ is set in the personalized model.

4.2 Model Evaluation

For the general model and personalized model, we can both evaluate the intuitive perception and quantitative values. Intuitively, we can observe the TOP-N results of the search to visually feel the matching of the hand-drawn images with the resulting images. Measure the index of general image retrieval model can use the mean average accuracy (MAP) to determine the precision by category. This indicator can not only show the effect of retrieval on each specific category, but also play a very important role in the generalization of the entire search model (Fig. 3).

For personalized training model evaluation, in order to make the personalized model contain both sketch content outline information and user feedback preference information, the effect of w is added to the evaluation during the calculation. Select 0, 0.25.0.5, 0.75 and 1 representative values from [0, 1] as the value of w, and thus the MAP change curve is made according to the value of w as shown in the Fig. 4.

Table 3. Personalization models and general model through fine-grained search subcategories AP and MAP in Flickr15K comparison tables

Full size table

In the Table 3, it can be seen that while the general model has a high degree of accuracy in retrieving the parent category, it does not achieve good results in the retrieval of natural images in the fine-grained subcategories. The reason is that the sub-category sample features are randomly distributed in the parent category’s feature space, so it is probably the correct sample from the same parent category appears, but the sub-category appears randomly. The improved personalized model successfully subdivides the spatial range of the sub-category features. When the input hand-drawn images are calculated to obtain features, according to the principle of distance similarity, the calculated features are located as close as possible to the sub-category samples required by the user, so that sub-category natural images can be efficiently retrieved.

5 Conclusions

In this article, we show how to use an improved depth full convolutional neural network to extract sketch features. We use popular universal datasets to verify the powerful feature extraction capabilities of our designed network. The high-level feature information learned by neural networks is used to improve the accuracy of sketch retrieval. Based on the idea of transfer learning, the distribution of the migrated feature space is adjusted, and the similarity between the results of user input and historical feedback is further enhanced. The improved personalized model can not only retrieve pictures based on image content but also retrieve pictures based on user selection feedback. However, in the experiment, we used user feedback to enhance supervised sub-tag information to achieve fine-grained sketch retrieval. The accuracy of tagging and user selection operations are highly dependent on the tags. Error tagging or incomplete information of tag information and user’s random selection of feedback activities the large probability affects the final retrieval rate. Therefore, in the future, realizing fine-grained sketch retrieval based on weak supervision information needs to design more powerful neural network networks and scientific loss function models.

References

Bhattacharjee, S.D., Yuan, J., Hong, W., Ruan, X.: Query adaptive instance search using object sketches. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, pp. 1306–1315 (2016)
Google Scholar
Bui, T., Ribeiro, L., Ponti, M., Collomosse, J.: Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network. Comput. Vis. Image Underst. 164, 27–37 (2017)
Article Google Scholar
Bui, T., Ribeiro, L.S.F., Ponti, M., Collomosse, J.P.: Generalisation and sharing in triplet convnets for sketch based visual search. CoRR abs/1611.05301 (2016)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR 2005, pp. 539–546 (2005)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR 2005, pp. 886–893 (2005)
Google Scholar
Hu, R., Collomosse, J.P.: A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput. Vis. Image Underst. 117(7), 790–806 (2013)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Ma, Z., Tan, Z., Guo, J.: Feature selection for neutral vector in EEG signal classification. Neurocomputing 174(174), 937–945 (2016)
Article Google Scholar
Macarthur, S.D., Brodley, C.E., Kak, A.C., Broderick, L.S.: Interactive content-based image retrieval using relevance feedback. Comput. Vis. Image Underst. 88(2), 55–75 (2002)
Article Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Qi, Y., et al: Making better use of edges via perceptual grouping. In: CVPR 2015, pp. 1856–1865 (2015)
Google Scholar
Qi, Y., Song, Y., Zhang, H., Liu, J.: Sketch-based image retrieval via siamese convolutional neural network. In: ICIP 2016, pp. 2460–2464 (2016)
Google Scholar
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. 35(4), 119 (2016)
Article Google Scholar
Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR 2007 (2007)
Google Scholar
Tolias, G., Chum, O.: Asymmetric feature maps with application to sketch based retrieval. In: CVPR 2017, pp. 6185–6193 (2017)
Google Scholar
Xie, L., Wang, J., Zhang, B., Tian, Q.: Fine-grained image search. IEEE Trans. Multimed. 17(5), 636–647 (2015)
Article Google Scholar
Xu, P., et al.: Cross-modal subspace learning for fine-grained sketch-based image retrieval. Neurocomputing 278, 75–86 (2018)
Article Google Scholar

Download references

Acknowledgment

This work was jointly supported by: (1) National Natural Science Foundation of China (No. 61771068, 61671079, 61471063, 61372120, 61421061); (2) Beijing Municipal Natural Science Foundation (No. 4182041, 4152039); (3) the National Basic Research Program of China (No. 2013CB329102).

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Qiming Huo, Jingyu Wang, Qi Qi, Haifeng Sun, Ce Ge & Yu Zhao

Authors

Qiming Huo
View author publications
You can also search for this author in PubMed Google Scholar
Jingyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Qi
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ce Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiming Huo .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huo, Q., Wang, J., Qi, Q., Sun, H., Ge, C., Zhao, Y. (2018). Users Personalized Sketch-Based Image Retrieval Using Deep Transfer Learning. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-99365-2_14
Published: 12 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Users Personalized Sketch-Based Image Retrieval Using Deep Transfer Learning

Abstract

Similar content being viewed by others

Sketch-Based Image Retrieval Using Convolutional Neural Networks Based on Feature Adaptation and Relevance Feedback