Keywords

1 Introduction

The emergence of Web 2.0 technology along with the prevalence of mobile devices leads to an explosion of images being uploaded and shared online, which makes image retrieval become an important research topic during the past two decades [25]. In a typical image retrieval system, a search task may be launched by either keywords or examples provided by the user, termed as Query-by-Keyword (QBK) and Query-by-Example (QBE) respectively, and then the system ranks the images in the database according to their similarities to the user’s query. However, QBK is always limited by the so-called ‘intent gap’, i.e. the user may not describe the visual content of his or her target using proper keywords, while QBE often suffers from the well-known ‘semantic gap’ existing between low-level image pixels captured by machines and high-level semantic concepts perceived by humans.

An effective solution to bridge the gaps is to exploit the users’ feedbacks that could be obtained in either explicit [28] or implicit manner [8], such that the initial image ranking list would be refined. Some recent studies along this direction [5, 7, 20, 22, 23, 26] are designed for QBK, called image re-ranking, while others serve for QBE including both short-term learning [3, 12, 14, 16, 29] and long-term learning [4, 9, 15, 17, 21]. In the meantime, a surge of efforts have been made for the graph-based similarity ranking, especially in manifold ranking (MR) [27]. By taking the intrinsic geometrical structure into consideration, MR assigns each data point a relative ranking score, instead of an absolute pairwise similarity as traditional ways. The score is treated as a distance metric defined on the data manifold, which is more meaningful to capture the semantics among data points. In addition, the users’ feedbacks are easily exploited by MR in both explicit and implicit ways [18], and previous studies have shown that MR is one of the most successful ranking approaches for the image retrieval with relevance feedback [3, 16, 19].

Despite its success, the regular MR is limited by two major shortcomings when deployed for image retrieval. First, in many search applications, image data are with multiple views, where each view is actually a feature set. For example, images can be represented by their visual information, surrounding text, and users’ query logs. However, the regular MR method cannot effectively integrate multiple views to encode the similarity ranking. Second, the user-contributed views are often completed in a ‘crowdsourcing’ way, such as textual tags and query logs, so the correctness cannot be always guaranteed. Directly exploiting the highly noisy views in MR may degenerate the retrieval performance. In this paper, we propose a novel method named Robust Multi-view Manifold Ranking (RM2R) to address the aforementioned problems. The main contributions can be summarized as follows. First, we extend the regular MR from single-view to multi-view, aiming to exploit the complementary properties of different feature sets. For convenience of discussion in this paper, we focus on the two-view scenario, assuming that only visual features and query logs are available. Furthermore, we develop a data cleaning solution to make the proposed method more robust to the noisy query logs. An empirical study shows encouraging results in comparison to several exiting approaches. In particular, we observe that our RM2R method is quite robust to noisy query logs, even if the noise level reaches 50 %.

In the following we start with a brief review of some related works. Then we propose our RM2R method and report the experimental results. Finally we conclude this paper.

2 Related Work

Graph-based similarity ranking has been extensively studied in the multimedia retrieval area. Its main idea is to describe the dataset as a graph and then decide the importance of each vertex based on local or global structure drawn from the graph. One canonical graph-based ranking technique is the Manifold Ranking (MR) algorithm [27], and He et al. [3] first applied MR to image retrieval. Its limitations are addressed by latter research efforts. For example, Wang et al. [10] improved the MR accuracy using a k-regular nearest neighbor graph that minimizes the sum of edge weights and balances the edges in the graph as well. Wu et al. [16] proposed a self-immunizing MR algorithm that uses an elastic kNN graph to exploit unlabeled images safely. Xu et al. [19] proposed an efficient MR solution based on scalable graph structure to handle large-scale image datasets.

Multi-view learning concentrates on learning from the data with multiple feature sets. Co-training [1] is probably the most famous representative. It constructs two learners each from one view, and then lets them to provide pseudo-labels for the other learner. Zhou et al. [30] regarded the visual content and surrounding texts of images as two views, and applied co-training to image retrieval. In fact, co-training does not really need the existence of multiple views, and the diversity among the learners is the real essence [13]. A variant of co-training used in image retrieval [29] suggested to train two different rankers based on the same visual feature set, where each ranker identifies unlabeled images with highest confidence to enlarge the training set of the other ranker. Besides co-training, there are many other solutions to fuse multi-view data in literatures, such as multiple kernel learning [26], and multi-graph learning [2, 11, 23], etc.

Our research is also closely related to the collaborative image retrieval (CIR) that aims to combine short-term learning and long-term learning within a unified framework. For example, Yin et al. [21] exploited the long-term experiences fulfilled by different users to select the optimal online ranker from a set of candidates based on reinforcement learning. Hoi et al. [4] regarded the query log as the ‘side information’, and then, taking that as constraints, learned a distance metric form a mixture of labeled and unlabeled images. Su et al. [9] suggested to discover the navigation patterns from query logs, and using the patterns to facilitate new searching tasks. Wu et al. [15, 17] proposed a hybrid similarity measure that preserves both visual and semantic resemblance by incorporating short-term with long-term feedback experiences.

3 The Proposed RM2R Method

Our RM2R method is developed based on two intuitions. First, a ‘good’ ranker should be able to exploit the complementary property of different views. Moreover, the ranker should be robust to noisy views. We start from the description of notations.

3.1 Preliminaries

For simplicity, assume that we are handling two-view image dataset \(\mathcal {D}=\{X_i=(\mathbf {x}_i^{(1)},\mathbf {x}_i^{(2)}),i=1,\cdots ,n\}\), where each image instance \(X_i=(\mathbf {x}_i^{(1)},\mathbf {x}_i^{(2)})\) is with two views. Concretely, in the first view \(\mathbf {x}_i^{(1)}\in \mathbb {R}^{d_1}\) is a \({d_1}\) dimensional visual feature vector of image \(X_i\), and in the second view \(\mathbf {x}_i^{(2)}\in \{0,1\}^{d_2}\) is a \({d_2}\) dimensional log feature vector recording the clicks made by different users on image \(X_i\), where ‘1’ means that \(X_i\) is clicked in a certain query session, and ‘0’ otherwise.

To discover the geometrical structure, we build a couple of graphs \(G^{(z)}=(V^{(z)}, E^{(z)}, \mathbf {W}^{(z)})\) on \(\mathcal {D}\), where \(z\in \{1,2\}\) is the graph identify. In details, \(V^{(z)}\) is the node set, in which each node corresponds to an image; \(E^{(z)}\) and \(\mathbf {W}^{(z)}\in \mathbb {R}_+^{n\times n}\) are the edge set and the edge weighting matrix respectively; each \(W_{ij}^{(z)}\) represents the weight of edge \(E_{ij}^{(z)}\). Typically, the weight is defined by a certain similarity measure, and we apply different similarity measures to \(G^{(1)}\) and \(G^{(2)}\) due to the different input spaces.

For graph \(G^{(1)}\), its nodes are real-valued vectors, so the similarity between \(\mathbf {x}_i^{(1)}\) and \(\mathbf {x}_j^{(1)}\) is defined by the Gaussian kernel

$$\begin{aligned} W_{ij}^{(1)}=\exp \left( -\frac{d^2(\mathbf {x}_i^{(1)},\mathbf {x}_j^{(1)})}{\sigma ^2} \right) \end{aligned}$$
(1)

where \(d(\mathbf {a},\mathbf {b})\) is a distance metric between two vectors \(\mathbf {a}\) and \(\mathbf {b}\) (suggested by [3], L1 distance is considered), and \(\sigma \) is the bandwidth parameter that can be tuned by local scaling technique, the effectiveness of which has been verified in the clustering [24] and ranking [16] tasks.

For \(G^{(2)}\), its nodes are binary vectors, so the similarity between \(\mathbf {x}_i^{(2)}\) and \(\mathbf {x}_j^{(2)}\) is defined by the Jacquard coefficient

$$\begin{aligned} W_{ij}^{(2)}= \frac{|\mathbf {x}_i^{(2)}\cap \mathbf {x}_j^{(2)}|}{|\mathbf {x}_i^{(2)}\cup \mathbf {x}_j^{(2)}|} \end{aligned}$$
(2)

where \(\left| \bullet \right| \) denotes the size of a set.

To exploit the complementary property between \(G^{(1)}\) and \(G^{(2)}\), we will extend the regular MR method [27] from single-view to multi-view in the next subsection.

3.2 M2R: Multi-view Manifold Ranking

Let \(\mathbf {r}:\mathcal {D} \rightarrow \mathbb {R}\) be a ranking function that assigns to each image instance \(X_i\) a ranking score \(r_i\). We also define a label vector \(\mathbf {y}=[y_1,\cdots ,y_n]^T\) to collect the user’s online feedbacks, where \(y_i=1\) if \(X_i\) is labeled as positive, \(y_i=-1\) if \(X_i\) is labeled as negative, and \(y_i=0\) otherwise. The cost function associated with \(\mathbf {r}\) is defined to be

$$\begin{aligned} \mathcal {Q}(\mathbf {r})= & {} \frac{\lambda }{2} \sum _{i,j=1}^n W_{ij}^{(1)} \left( \frac{r_i}{\sqrt{D_{ii}^{(1)}}}-\frac{r_j}{\sqrt{D_{jj}^{(1)}}} \right) ^2 \nonumber \\+ & {} \frac{1-\lambda }{2} \sum _{i,j=1}^n W_{ij}^{(2)} \left( \frac{r_i}{\sqrt{D_{ii}^{(2)}}}-\frac{r_j}{\sqrt{D_{jj}^{(2)}}}\right) ^2 + \frac{\mu }{2} \Vert \mathbf {r}-\mathbf {y} \Vert ^2 \end{aligned}$$

where \(0<\lambda <1\) is a parameter to adjust the weight between visual view and log view, \(\mu >0\) is a regularization parameter, and \(\mathbf {D}^{(z)}\) (\(z\in \{1,2\}\)) is a diagonal matrix with \(D_{ii}^{(z)}=\sum _{j=1}^n W_{ij}^{(z)}\). The first two terms in the cost function are two smoothness constraints, which make the nearby examples in both visual space and log space having close ranking scores. The third term is a fitting constrain, which makes the ranking result fitting to the label assignment.

By differentiating \(\mathcal {Q}\) with respect to \(\mathbf {r}\), we have

$$\begin{aligned} \frac{\partial \mathcal {Q}}{\partial \mathbf {r}} \Big |_{\mathbf {r}=\mathbf {r^*}}= \lambda (\mathbf {r^*}-\mathbf {S^{(1)}}\mathbf {r^*}) + (1-\lambda )(\mathbf {r^*}-\mathbf {S^{(2)}}\mathbf {r^*})+\mu (\mathbf {r^*}-\mathbf {y})=0 \end{aligned}$$

where \(\mathbf {S}^{(z)}\) is the symmetrical normalization of \(\mathbf {D}^{(z)}\), i.e.

$$\begin{aligned} \mathbf {S}^{(z)}=(\mathbf {D}^{(z)})^{1/2} \mathbf {W}^{(z)} (\mathbf {D}^{(z)})^{1/2}. \end{aligned}$$
(3)

By regrouping, the equation can be transformed into

$$\begin{aligned} \mathbf {r^*}- \frac{\lambda }{1+\mu }\mathbf {S}^{(1)}\mathbf {r^*}- \frac{1-\lambda }{1+\mu }\mathbf {S}^{(2)}\mathbf {r^*}- \frac{\mu }{1+\mu }\mathbf {y}=0. \end{aligned}$$

Let \(\alpha =\lambda /(1+\mu )\), \(\beta =(1-\lambda )/(1+\mu )\) and \(\gamma =\mu /(1+\mu )\), and then we have

$$\begin{aligned} (\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})\mathbf {r^*}=\gamma \mathbf {y}. \end{aligned}$$

Note that \(\alpha +\beta +\gamma =1\). Since \((\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})\) is invertible, we have

$$\begin{aligned} \mathbf {r^*}=\gamma (\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})^{-1} \mathbf {y}. \end{aligned}$$
(4)

We can directly use the above closed form solution to compute the ranking scores of examples. However, in large scale problems, we prefer to use the iteration solution

$$\begin{aligned} \mathbf {r}(t+1)=(\alpha \mathbf {S}^{(1)}+\beta \mathbf {S}^{(2)})\mathbf {r}(t)+ \gamma \mathbf {y}. \end{aligned}$$
(5)

It is easy to prove that Eq. (5) converges to Eq. (4).

Proof

Suppose the sequence \(\{\mathbf {r}(t)\}\) converges to \(\mathbf {r^*}\). Substituting \(\mathbf {r^*}\) for \(\mathbf {r}(t+1)\) and \(\mathbf {r}(t)\) in the equation. We have \(\mathbf {r^*}=(\alpha \mathbf {S}^{(1)}+\beta \mathbf {S}^{(2)})\mathbf {r^*}+ \gamma \mathbf {y}\) that can be transformed into \((\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})\mathbf {r^*}=\gamma \mathbf {y}\). Since \((\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})\) is invertible, we have \(\mathbf {r^*}=\gamma (\mathbf {I}- \alpha \mathbf {S}^{(1)}- \beta \mathbf {S}^{(2)})^{-1} \mathbf {y}\).

Given the fact that different users may have different opinions on judging the same image, the inherent noise issue of query logs is inevitable. Hence we will study the Query Log Cleaning (QLC) solution in the next subsection.

3.3 Query Log Cleaning via Neighbor Voting

Inspired by neighbor voting [6], our QLC solution is based on the intuition that if a user clicks a group of visually similar images, his or her clicking information is likely to reflect the objective aspects of visual content. This intuition suggests that, given an image, the confidence of a click made on it can be estimated from how its visual neighbors are judged (clicked or not) in the same session, i.e. accumulating the votes from its visual neighbors.

To facilitate voting, we encode each query session to a unique code (e.g. using the session id), and then all sessions can be viewed as a code book. Then, given an image \(X_i=(\mathbf {x}_i^{(1)},\mathbf {x}_i^{(2)})\), all clicks it received can be represented by a bag-of-code \(\mathcal {C}_i=\{c_j,j=1,\cdots ,l_i\}\), where each code \(c_j\in \mathcal {C}_i\) is actually the index of each non-zero element in the log vector \(\mathbf {x}_i^{(2)}\) and there are \(l_i\) codes in total. Given an image \(X_i=(\mathbf {x}_i^{(1)},\mathcal {C}_i)\), we define a voting function \(f(\mathbf {x}_i^{(1)},c_j)\) to measure the relevance of each code \(c_j\in \mathcal {C}_i\) to the visual object \(\mathbf {x}_i^{(1)}\)

$$\begin{aligned} f(\mathbf {x}_i^{(1)},c_j)=\sum _{m=1}^K \delta (\mathbf {x}_m,c_j) \end{aligned}$$
(6)

where \(\mathbf {x}_m\) (\(m=1,\cdots , K\)) denote the K visually nearest neighbors of \(\mathbf {x}_i^{(1)}\). The binary function \(\delta (\mathbf {x}_m,c_j)=1\) if the visual neighbor \(\mathbf {x}_m\) also has the code \(c_j\), otherwise \(\delta (\mathbf {x}_m,c_j)=0\).

After neighbor voting, each click recorded in query logs is associated with a votes. The higher votes a click received, the more confident it is. Therefore, the clicks with low votes (less than a threshold T) can be removed. We apply the ‘Three Standard Deviations’ principle to set the threshold, i.e. \(T=\mu _v-1.5\sigma _v\), where \(\mu _v\) and \(\sigma _v\) are the mean and standard deviation of the voted results, respectively.

3.4 RM2R-Based Image Retrieval

In this section, we make a brief summary of applying RM2R to image retrieval, described in Algorithm 1. Note that the affinity matrixes can be calculated offline, and therefore RM2R can be quite efficient in processing online queries.

figure a

4 Experiments

4.1 Experimental Setup

We employ the ‘10K Images’ datasetFootnote 1 which is publicly available on the web to make our experiments reproducible. The images are from 100 semantic categories, with 100 images per category. Three kinds of visual features are extracted to represent the images, including a 64-dimensional color histogram, an 18-dimensional wavelet-based texture and a 5-dimensional edge direction histogram [15].

The log dataset consists of 1000 query sessions which are simulated based on the ground truth of image dataset. The average number of clicks in each query session is 20. Further, to evaluate the robustness of our method, three noised log datasets are used in experiments and the noise levels are \(10\,\%\), \(30\,\%\) and \(50\,\%\), respectively.

Our RM2R method is compared with three existing approaches which are introduced as follows.

  • Self-Tuning Manifold Ranking (STMR): the regular MR method [3] with self-tuned bandwidth parameter [24].

  • Co-training based Ranking (CoR): a multi-view learning based image ranking method [29].

  • Hybrid Similarity learning (HySim): a CIR method combining short-term learning with long-term learning [15].

In addition, M2R is also included in comparisons to study whether the QLC solution used by RM2R is effective or not.

A query set with 200 image examples is randomly selected to evaluate the average performance of all compared methods. For each query, only one round of feedback is performed (the top ten returned images are labeled), since users have no patience to do more. We use the precision-recall-graph (PR-graph), Mean Average Precision (MAP) and the precision at top N retrieval results (P@N) for performance evaluation.

4.2 Performance Evaluation

At first, we conduct model selection for our M2R and RM2R methods, and there are four parameters in total: \(\alpha \), \(\beta \), \(\gamma \) (\(\alpha +\beta +\gamma =1\)) used in Eq. (5) and K used in Eq. (6). For convenience, \(\gamma \) is fixed at 0.01, consistent with the previous experiences [3, 19, 27]. Then, we have \(\beta =0.99-\alpha \) and thus only need to tune two parameters \(\alpha \) and K. We first investigate the impact of \(\alpha \) by only evaluating the performance of M2R. Figure 1(a) shows the performance of M2R over different \(\alpha \) values, which is evaluated on the log datasets with different noise levels. We observe that it would be desirable to choose smaller \(\alpha \) values when the noise level of log data is low, and larger ones when the noise level is high. To avoid bias, we set this parameter as the value that the best performance of M2R is achieved on the middle noise level (30 %), i.e. \(\alpha =0.5\). Also, the performance derived by this parameter value is not too good or too bad, regardless of the noise level is high or low. Further, we evaluate K for the final results of RM2R on the log dataset with 30 % noise, depicted by Fig. 1(b). Fortunately, our RM2R method is not sensitive to K, and we fixed it at 20.

Fig. 1.
figure 1

Tuning parameters for our RM2R method.

Next, with the best parameters setting, we compare our RM2R method with STMR, CoR and HySim on the ideal log dataset (without any noise). Figure 2(a) and, (b) print the PR graphs and the P@N curves respectively. By examining all methods, we observe that the methods using both visual features and query logs (RM2R and HySim) outperform the methods only using visual features (STMR, CoR), which indicates that exploiting the multi-view information is beneficial to the similarity ranking. Further, by comparing STMR with CoR, we find that the performance of STMR is much better than CoR, which indirectly verifies the superiority of the graph-based similarity ranking used by our method in comparison to the pairwise similarity ranking.

Fig. 2.
figure 2

The performance of our RM2R method compared with several existing approaches.

Fig. 3.
figure 3

The robustness of the proposed methods compared with the HySim approach.

Furthermore, we compare RM2R with M2R and HySim on the noisy log datasets. To analyze the tolerance of an algorithm to noise, we define a performance lower bound for a method first, which refers to the performance of the method evaluated in the single-view case, i.e. no any query logs are used. The MAPs of the proposed methods (RM2R and M2R) and HySim evaluated under different noise levels are plotted in Fig. 3, where LB-1 and LB-2 are the performance lower bounds of (R)M2R and HySim, respectively. As expected, the performances of all methods degrade as the noise level grows, but RM2R is more robust to the noise than M2R and HySim. As illustrated, the MAP curve of RM2R is always above its lower bound and degrades gracefully, while the curves of M2R and HySim degrade sharply with the growing noise levels and drop below their lower bounds when the noise level hits 50 % and 30 %, respectively. This observation confirms that our QLC solution is effective and essential to achieving the robust similarity ranking. Note that the performance of RM2R is not always the best. When the noise level is low (10 %), the MAP curves of RM2R and M2R are very similar to each other and even RM2R is worse than M2R when there is no noise in log dataset. It is conjectured that QLC is a lossy data cleaning technique, i.e. a very few examples are mistakenly removed, and thus its impact is trivial when the noise level is low. With the growing of the noise level, QLC is increasingly helpful to RM2M.

Table 1. Comprehensive comparison.

At last, a comprehensive MAP comparison of all compared methods is summarized in Table 1, where the best performance has been boldfaced. We observe that M2R and HySim perform no better, and often worse, than STMR when the noise level is high, which validates the superiority of our QLC solution again. In summary, when there is no noise in log dataset, the performance of our RM2R method is only marginally better than HySim, but it outperforms STMR and CoR significantly. Considering that our RM2R method is much more robust than M2R and HySim when the log dataset is (highly) noisy, the above presented experiments confirm that our RM2R method is the best among the compared methods.

5 Conclusions

This paper presents a novel RM2R method that aims at the noise-resistant exploitation of the complementary information hidden in multiple views. We first propose a M2R framework to integrate visual feature with query log feature, and then, based on Neighbor Voting, develop a data cleaning solution to noisy query logs. To our best knowledge, not much has been reported on investigating multi-view learning with noisy data in literatures. Experimental study has validated the superiority of the proposed method in comparison to several existing similarity ranking approaches.In this paper, we focus on the two-view data. However, extending to more views will suffer from the inconvenience of tuning a number of parameters. In the future, we will study the self-tuning solution to multi-view manifold ranking.