Image Annotation Incorporating Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors

Hou, Yuqing

doi:10.1007/978-3-319-27857-5_7

Yuqing Hou²⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9474))

Included in the following conference series:

International Symposium on Visual Computing

2914 Accesses
3 Citations

Abstract

Tag-based image retrieval (TBIR) has drawn much attention in recent years due to the explosive amount of digital images and crowdsourcing tags. However, TBIR is still suffering from the incomplete and inaccurate tags provided by users, posing a great challenge for tag-based image management applications. In this work, we propose a novel method for image annotation, incorporating several priors: Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors. Highly representative CNN feature vectors are adopted to model the tag-visual correlation and narrow the semantic gap. And we extract word vectors for tags to measure similarity between tags in the semantic level, which is more accurate than traditional frequency-based or graph-based methods. We utilize the Accelerated Proximal Gradient (APG) method to solve our model efficiently. Extensive experiments conducted on multiple benchmark datasets demonstrate the effectiveness and robustness of the proposed method.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Boosted Inductive Matrix Completion for Image Tagging

Annotating Web Images by Combining Label Set Relevance with Correlation

Dual-Graph Regularized Sparse Low-Rank Matrix Recovery for Tag Refinement

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The prevalence of social network and digital photography in recent years makes image retrieval an urgent need. Image retrieval methods can be classified into two categories: content-based image retrieval (CBIR) and tag-based image retrieval (TBIR). The performance of CBIR algorithms are limited due to the semantic gap between the low-level visual features used to represent images and the high-level semantic meaning behind images. Tags can represent the semantics of images more precisely than low-level visual features, giving rise to research on TBIR.

However, tags are usually noisy and incomplete due to the arbitrariness of user tagging behaviors, leading to performance degradations of TBIR systems [1]. What’s more, manual annotation is laborious, error prone, and subjective, making automatic image annotation an attractive research task.

Many machine learning methods have been developed for image annotation. They can be roughly grouped into three categories: supervised methods, unsupervised methods and semi-supervised methods.

Supervised methods use the tagged images to train a dictionary of concept models and formulate image annotation as a supervised learning problem. They annotate images using the likelihood between images and tags. [2] formulates the annotation problem in a probabilistic framework and images are represented as bags of localized feature vectors. [3] learns a two-dimensional Multi-resolution Hidden Markov Model (2D-MHMM) on a fixed-grid segmentation of all category examples. [4] models image annotation procedure as a translation problem between image blobs and tags.

Unsupervised methods, e.g. search based-methods, learn the distribution of images and tags and annotate tags among clusters. Search-based methods always search in the feature space to find the most relevant images to the query image, and transfer tags to it using various tag transfer algorithms [5–7]. JEC [6] demonstrates that simple baseline algorithm can achieve high performance. TagProp [5] applies metric learning in the neighborhood of the feature space to annotate query images.

In recent years, semi-supervised approaches have been proposed in this field [8–10]. Semi-supervised algorithms can exploit the unlabelled information to improve the learning procedure and achieve satisfactory performance. [9] models the annotation task as a matrix completion problem, assuming the low-rankness property of the underlying matrix. [11] combined language model with matrix completion by assuming the independency of tags. Kernel trick and metric learning are exploited in [12] to capture the nonlinear relationships between visual features and semantics of the images. Semi-supervised relational topic model (ssRTM) is exploited to explicitly model the image content and their relations [13].

To utilize the large amount of unlabeled dataset for removing noisy tags and completing the missing ones, we propose a semi-supervised method. We formulate the annotation task as a transduction matrix completion problem, taking the following four priors into consideration:

1. Low-rankness. Many methods formulated the image annotation problem in a matrix completion framework by constructing and refining the image-tag matrix [8, 10–12]. Existing works have demonstrated that semantic space spanned by tags can be approximated by a much smaller subset of words derived from the original space [14]. As text information, tags are consequently subjected to such subset property [8]. According to the subset property, we assume that the image-tag matrix is a low rank matrix. Thus we can exploit the low rank matrix completion techniques to complete the matrix, thereby annotating the images.

2. Tag Correlation. Tags have high level semantic meanings and often appear correlatively at the semantic level. However, traditional methods treat tags merely as labels, reducing the annotation task as a multi-label classification problem. In recent years, researchers have explored the relation among tags. Graph-based methods calculated the semantic correlation among tags using the WordNet distance [15]. Frequency-based methods [10, 16] estimated tag correlations using Jaccard coefficient or co-occurrence in text search results. Jensen-Shannon divergence is introduced in the Flickr Distance to make the algorithm more precise and reasonable [17]. However, these methods are still imprecise and inefficient. In this work, we utilize the vector representations for tags instead of labels. Word vectors [18], which are seldom used in this field, can present a much higher level semantic meanings than labels, thus we can measure the tag similarity much more easily and precisely.

3. Tag-Visual Correlation. Tag-Visual Correlation describes the correlation between the content level and the semantic level. Visually similar images often belong to similar themes and thus are annotated with similar tags. This prior has been widely explored in the image classification field [19, 20]. However, there still exists a semantic gap between the content level and the semantic level. Traditional methods usually adopt low level image features, such as color, texture or shape descriptors, to represent the images, which are not so correlated with the semantic level. To narrow the semantic gap and make full use of the correlation property, we utilize high level visual features in our model, such as DeCAF$_{6}$, which demonstrate much stronger tag-visual correlation than low level visual features.

4. Inhomogeneous Errors. Tagging errors come from two aspects: missing tags and noisy tags. Since human-beings are relatively reasonable, we should assume that the tagging results are reasonably accurate. We can observe from the datasets that one image usually has relation with only a few tags, but we have to calculate its association with hundreds or even thousands of tags. For example, images from the MIRFlickr-25K have about 12.7 tags on average [21], but the dataset has 1, 386 unique tags, which means that each image should only be annotated with less than $1\,\%$ of all the tags. Hence users are more likely to adding noisy tags than missing noisy tags since there are too many unrelated tags. And the errors are mainly composed of noisy tags rather than missing tags. Thus we should treat these two errors with different strategies. We should put more emphasis on denoising rather than completing, paying more attention to the annotated tags rather than the unannotated ones. In other words, if an image is not originally annotated with a tag, it is more likely that they really have no relation at all.

Existing methods never model these two kinds of errors separately. They simply model the errors as Laplacian noise [8] or Gaussian noise [10]. To our knowledge, our model is the first to model the missing errors and noisy errors separately. The model can further adapt to different datasets according to their noise levels.

The novelties and main contributions of this paper are summarized as follows:

We propose a new image annotation model that incorporates four priors: Low-rankness, Tag Correlation, Tag-Visual Correlation, and Inhomogeneous Errors.
We utilize the word vectors and CNN features for the tag and the visual features, respectively. These high level features can narrow the semantic gap effectively. It is the first time to utilize both the features for image annotation.
We model tag correlation and tag-visual correlation in different ways according to their semantic levels.
We model two kinds of errors separately, the model can adapt to different datasets according to the noise level.
We utilize the APG to solve our model efficiently.

The most related work to our model is LRES [8]. In their work, the authors formulated the image annotation task as a Robust PCA [22] framework, decomposing the original tag matrix into a refined tag matrix and a sparse error matrix. LRES also takes the tag correlation and tag-visual correlation into consideration and achieves good performance. However, our model is different from LRES in several aspects. First, our model measures tag correlation and tag-visual correlation using different models according to their different semantic levels, rather than using the same Graph Laplacian model in LRES. Second, we adopt more representative features such as CNN features and word vectors to narrow the semantic gap. Third, we do not model the error matrix simply as a sparse matrix, since thee errors are inhomogeneous and the distribution varies across different datasets.

2 Our Image Annotation Model

2.1 Low-Rankness

Denote the image collection $I = \{i_1, i_2,\ldots , i_m\}$, where m is the size of the image set. All original tags appearing in the set form a tag set $W = \{w_1, w_2, \ldots , w_n\}$, where n denotes the total number of unique tags. We can construct a binary matrix $\hat{T} \in \{0,1\}^{m \times n}$ whose element $\hat{T}_{i,j}$ indicates the relation between image $i_i$ and tag $w_j$ , i.e. if $i_i$ is annotated with tag $w_j$, $\hat{T}_{i,j} = 1$, otherwise $\hat{T}_{i,j} = 0$. We use T to represent the refined tag matrix, where $T_{i,j} \in (0,1)$ the confidence score of assigning $w_j$ to $i_i$. As mentioned above, we want the refined matrix T to be low rank. Since the low-rankness constraint on T is NP-hard to solve, we replace it with the standard relaxation, the trace norm, i.e. sum of singular values : $\Vert T\Vert _{*}$.

2.2 Tag Correlation Using Word Vectors

To narrow the semantic gap, we extract 300-dimensional word vectors [18] for each tag rather than treating tags merely as labels. Word vectors contain rich semantic information, e.g. semantic similarity. We denote the word vectors as $WV = \{wv_1, wv_2,\ldots , wv_n\}$. Given the completed tag matrix, $T^i$ and $T^j$ are the ith and jth columns of the tag matrix T. Thus we can measure the correlation between tag i and tag j in two ways: (1) similarity between word vectors $wv_1$ and $wv_2$, (2) similarity between tag vectors $T^i$ and $T^j$.

The tag correlation prior can be enforced by solving the following optimization

$$\begin{aligned} \mathop {min}\limits _{T}\sum _{i=1}^{n}\sum _{j=1}^{n} \Vert T^i - T^j \Vert ^2S_{i,j} , \end{aligned}$$

(1)

where $\Vert T^i - T^j \Vert ^2$ measures the similarity between tag vectors $T^i$ and $T^j$ and $S_{i,j}$ measures the similarity between word vectors $wv_i$ and $wv_j$. The formulation forces tag vectors with large similarities also have large similarity in their corresponding word vectors and vice versa, which essentially embodies the tag correlation prior.

The formulation can be rewritten as $Tr(T^{T}LT)$, where $L = diag(S^{T}1) - S$ is the Graph Laplacian [23]. In our formulation, we define $S_{i,j} = \cos (wv_i, wv_j)$.

2.3 Tag-Visual Correlation Using CNN Features

The tag-visual correlation is not as strong as the correlation between tags owing to the semantic gap. Thus we just formulate the problem in a widely used model [10], which is much more simple and intuitive compared with the Graph Laplacian framework. Denote the image visual features as matrix V, where each visual image is represented as a row vector in $V \in R^{m \times f_{v}}$. Given the visual feature matrix, we can compute the visual similarity between $i_i$ and $i_j$ as $V_i^{T}V_j$, where $V_i^{T}$ and $V_j^{T}$ are the ith and jth rows of matrix V. Given the completed tag matrix, we can compute the similarity between $i_i$ and $i_j$ basing on the overlap between their corresponding tags, i.e., $T_i^{T}T_j$, where $T_i^{T}$ and $T_j^{T}$ are the ith and jth rows of the tag matrix T [10]. To model the aforementioned tag-visual correlation, we expect $|T_i^{T}T_j - V_i^{T}V_j|^2$ to be as small as possible. Thus we can model the tag-visual correlation using the Frobenius norm as $ \sum _{i,j}^{n}|T_i^{T}T_j - V_i^{T}V_j|^2 = \Vert TT^{T} - VV^{T}\Vert _{F}^2$.

2.4 Inhomogeneous Errors

To model the inhomogeneous errors, we set different weight to the annotated positions and unannotated positions separately: $\lambda _{0}\Vert P_{\varOmega }(\hat{T} - T)\Vert _{F}^2 + \lambda _{1}\Vert P_{\varOmega ^{\bot }}(\hat{T} - T)\Vert _{F}^2$. $\varOmega $ represents the positions where the images are annotated with tags, $P_{\varOmega }$ and $P_{\varOmega ^{\bot }}$ are projection operators, $\lambda _{0}$ and $\lambda _{1}$ are positive weighting parameters. $\lambda _{0}$ and $\lambda _{1}$ will change adaptively in different datasets according to their noisy levels. Different from the assumption of sparse errors [8], we model the errors using the Frobenius norm since we observe that large scale noisy datasets tend to be contaminated with dense Gaussian noises rather than Laplacian noises. Experiments on noisy datasets have confirmed our assumption.

2.5 Object Function Formulation: The Four Priors Model

Based on the terms regarding low-rankness, tag correlation, tag-visual correlation and inhomogeneous errors, we formulate the objective function as follows:

$$\begin{aligned} \begin{aligned}&\min \limits _{T}F(T) = \Vert T\Vert _{*} + \lambda _0\Vert P_{\varOmega }(\hat{T} - T)\Vert _{F}^2 + \lambda _{1}\Vert P_{\varOmega ^{\bot }}(\hat{T} - T)\Vert _{F}^2 + \\&~~~~~~~~~~~~~~~~~~~~\lambda _{2}Tr(T^{T}LT) + \lambda _{3}\Vert TT^{T} - VV^{T}\Vert _{F}^2 . \end{aligned} \end{aligned}$$

(2)

$\lambda _2$ and $\lambda _3$ are also weighting parameters.

The proposed Four Priors method belongs to transductive learning category, which means it reasons from both labeled and unlabeled data. We can further turn it into a inductive model using traditional machine learning approaches [24].

3 Solving the Four Priors Model

We set $\lambda _0 = 1$ for computational efficiency, and denote the nuclear norm as g(T) and the other terms together as f(T). And $F(T) = g(T) + f(T)$, where $g(\cdot )$ is nonsmooth and $f(\cdot )$ is smooth. We pursuit an effective iterative procedure to solve this optimization based on Accelerated Proximal Gradient method (APG) [25].

Given the following unconstrained problem

$$\begin{aligned} \min \limits _{X}F(X) = \mu g(X) + f(X) . \end{aligned}$$

(3)

where $g(\cdot )$ is nonsmooth, $f(\cdot )$ is smooth and its gradient is Lipschitz continuous. To avoid the computation of subgradient, proximal gradient algorithms minimize a sequence of separable quadratic approximations to F(X), denoted as Q(X, Y), formed at specially chosen points Y

$$\begin{aligned} Q(X,Y) \triangleq \mu g(X) + f(Y) + \langle \nabla f(Y), X - Y \rangle + \frac{L_f}{2}\Vert X - Y\Vert ^2 . \end{aligned}$$

(4)

Let $M = Y - \frac{1}{L_f}\nabla f(Y)$, we get

$$\begin{aligned} X = \mathop {argmin}\limits _{X} Q(X,Y) = \mathop {argmin}\limits _{X}\{ \mu g(X) + \frac{L_f}{2}\Vert X - M\Vert ^2 \} . \end{aligned}$$

(5)

APG set $Y_{k} = X_k + \frac{b_{k-1} - 1}{b_k}(X_k - X_{k-1})$ for a sequence $\{b_k\}$ satisfying $b_{k+1}^{2} - b_{k+1} \le b_k^2$ to get an $O(k^{-2})$ convergence rate. The APG method is described in Algorithm 1.

The main advantage of the APG method is that the minimizer $X_{k+1}$ has a simple or even closed-form solution when the $g(\cdot )$ is $\ell _1$ norm or nuclear norm [8].

It is obvious that the APG method naturally fits for the Four Priors model.

We estimate the $L_f$ using backtracking method and calculate the $\nabla f(T)$:

$$\begin{aligned} \nabla f(T) = 2[P_{\varOmega }^{*}P_{\varOmega }(\hat{T} - T) + \lambda _1 P_{\varOmega ^{\bot }}^{*}P_{\varOmega ^{\bot }}(\hat{T} - T) + \lambda _2LT + \lambda _3(TT^T - VV^T)T] \end{aligned}$$

(6)

where $P_{\varOmega }^{*}$ and $P_{\varOmega ^{\bot }}^{*}$ are the adjoint operators of $P_{\varOmega }$ and $P_{\varOmega ^{\bot }}$, respectively.

Basing on Eqs. (5) and (6) we can obtain the subproblem (Step 4 in Algorithm 1) for our model.

$$\begin{aligned} \begin{aligned}&T_{k+1} = \mathop {argmin}\limits _{T} \Biggl \{ \Vert T\Vert _{*} + \frac{L_f}{2}\Vert T - M_k \Vert ^2 \Biggr \} , \end{aligned} \end{aligned}$$

(7)

where $ M_k = T_k + \frac{b_{k-1} - 1}{b_k} (T_k - T_{k-1}) - \frac{1}{L_f}\nabla f[T_k + \frac{b_{k-1} - 1}{b_k} (T_k - T_{k-1})]$. The solution to (7) is:

$$\begin{aligned} T_{k+1} = US_{\frac{1}{L_f}}(\varSigma )V^T \end{aligned}$$

(8)

where $U\varSigma V^T$ is the singular value decomposition (SVD) of $M_k$ and $S_{\tau }(\cdot )$ is the singular value thresholding operator [26].

4 Experimental Evaluation

4.1 Datasets and Experimental Setup

The proposed algorithm is denoted as Four Priors and is evaluated on two well known benchmark datasets: MIRFlickr-25K, Corel5K and Labelme. MIRFlickr-25K is collected from Flickr. Compared to the Corel5K, tags in Labelme and MIRFlickr-25K are rather noisy and many of them are misspelled or meaningless words. Hence, a pre-processing is performed. We match each tag with entries in a Wikipedia thesaurus and only retain the tags in accordance with Wikipedia. We use the pre-trained word and phrase vectors [18] to extract tag vectors from the tags in these two datasets. To narrow the semantic gap, we utilized DeCAF [27] to extract the DeCAF$_{6}$ features, which have high level semantic meanings (Table 1).

Table 1. Statistics of 3 datasets

Full size table

We compare the proposed Four Priors model with the state-of-the-art methods, including matrix completion-based model LRES [8], TCMR [11], RKML [12], search-based algorithms (i.e. JEC [6], TagProp [5], and TagRelevance [7]), mixture models (i.e. CMRM [28] and MBRM [29]), tag recommendation approaches (i.e. Vote+ [30] and Folk [31]), co-regularized learning model FastTag [32] and Bayesian network model InfNet [33]. Note that the parameters of adopted baselines are also carefully tuned on the validation set of Corel5K with corresponding proposed tuning strategy.

We measure all the algorithms in terms of average precision@N (i.e. AP@N), average recall@N (i.e. AR@N) and coverage@N (i.e. C@N). In the top N completed tags, precision@N is to measure the ratio of correct tags in the top N competed tags and recall@N is to measure the ratio of missing ground-truth tags, both averaged over all test images. Coverage@N is to measure the ratio of test images with at least one correctly completed tag.

4.2 Evaluation of Tag Completion on Corel5K

We adopt the tuning strategy used in [10] to set $\lambda _1 = 0.6,\lambda _2 = 1,$ and $\lambda _3 = 0.8$. Table 2 demonstrates the performance comparisons. Due to the space limit, we only report results when $N = 2, 3, 5, 10$.

Table 2. Performance comparison on Corel5K dataset

Full size table

4.3 Evaluation of Tag Completion on MIRFlickr-25K and Labelme

We tuned $\lambda _1 = 0.2, \lambda _2 = 1.0,$ and $\lambda _3 = 0.5$ using cross validation on MIRFlickr-25K. The two datasets use the same parameters since they are both noisy. Note that as the datasets become large or noisy, the semantic gap expands, leading to the decrease of $\lambda _3$. And $\lambda _1$ varies according to different noisy level.

Tables 3 and 4 demonstrate the performance comparisons. Note that Folk and InfNet is unable to run on the large dataset MIRFlickr-25K. Besides, search-based baselines (JEC, TagProp, and TagRel) cost a lot of time to run on the dataset.

Table 3. Performance comparison on Labelme dataset

Full size table

Table 4. Performance comparison on MIRFlickr-25K dataset

Full size table

4.4 Observations on Experimental Results

We observe that: (1) Generally algorithms achieve better performance on Corel5K, since tags in MIRFlickr-25K are more noisy. (2) Matrix completion-based methods, such as Four Priors, LRES and TCMR, usually achieve the best performances. (3) Four Priors shows increasing advantage to LRES as the data become more and more noisy, justifying our assumption and model of the noises. (4) Four Priors nearly outperforms all the other algorithms in all cases. (5) Performance on MIRFlickr-25K in some sense provides an evidence for the robustness of Four Priors.

5 Conclusions and Future Work

We have proposed an effective method for image annotation. The model takes four priors into consideration: Low-Rankness, Tag Correlation, Tag-Visual Correlation and Inhomogeneous Errors. This is the first work to model inhomogeneous errors in the image annotation field. We utilize word vectors to calculate tag correlation and CNN features to measure tag-visual correlation. It achieves the state-of-the-art performance in extensive experiments conducted on benchmark datasets for image annotation.

References

Ntalianis, K., Tsapatsoulis, N., Doulamis, A., Matsatsinis, N.: Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution. Multimedia Tools Appl. 69, 397–421 (2014)
Article Google Scholar
Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 29, 394–410 (2007)
Article Google Scholar
Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1075–1088 (2003)
Article Google Scholar
Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)
Chapter Google Scholar
Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)
Google Scholar
Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 316–329. Springer, Heidelberg (2008)
Chapter Google Scholar
Li, X., Snoek, C.G., Worring, M.: Learning social tag relevance by neighbor voting. IEEE Trans. Multimedia 11, 1310–1322 (2009)
Article Google Scholar
Zhu, G., Yan, S., Ma, Y.: Image tag refinement towards low-rank, content-tag prior and error sparsity. In: ACM MM (2010)
Google Scholar
Goldberg, A., Recht, B., Xu, J., Nowak, R., Zhu, X.: Transduction with matrix completion: three birds with one stone. In: NIPS (2010)
Google Scholar
Wu, L., Jin, R., Jain, A.K.: Tag completion for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 35, 716–727 (2013)
Article Google Scholar
Feng, Z., Feng, S., Jin, R., Jain, A.K.: Image tag completion by noisy matrix recovery. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 424–438. Springer, Heidelberg (2014)
Google Scholar
Feng, Z., Jin, R., Jain, A.: Large-scale image annotation by efficient and robust kernel metric learning. In: ICCV (2013)
Google Scholar
Niu, Z., Hua, G., Gao, X., Tian, Q.: Semi-supervised relational topic model for weakly annotated image recognition in social media. In: CVPR (2014)
Google Scholar
Zhao, R., Grosky, W.I.: Narrowing the semantic gap-improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4, 189–200 (2002)
Article Google Scholar
Jin, Y., Khan, L., Wang, L., Awad, M.: Image annotations by combining multiple evidence & wordnet. In: ACM MM (2005)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 370–383 (2007)
Article Google Scholar
Wu, L., Hua, X.S., Yu, N., Ma, W.Y., Li, S.: Flickr distance. In: ACM MM (2008)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1958–1970 (2008)
Article Google Scholar
Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: CVPR (2006)
Google Scholar
Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: MIR 2008: Proceedings of the 2008 ACM ICMI (2008)
Google Scholar
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58, 11 (2011)
Article MathSciNet Google Scholar
Chung, F.R.: Spectral Graph Theory. American Mathematical Society, Providence (1997)
MATH Google Scholar
Gammerman, A., Vovk, V., Vapnik, V.: Learning by transduction. In: UAI (1998)
Google Scholar
Toh, K.C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optimiz. 6, 615–640 (2010)
MATH MathSciNet Google Scholar
Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MATH MathSciNet Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531 (2013)
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR (2003)
Google Scholar
Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: CVPR (2004)
Google Scholar
Sigurbjörnsson, B., Van Zwol, R.: Flickr tag recommendation based on collective knowledge. In: ACM WWW (2008)
Google Scholar
Lee, S., De Neve, W., Plataniotis, K.N., Ro, Y.M.: Map-based image tag recommendation using a visual folksonomy. Pattern Recogn. Lett. 31, 976–982 (2010)
Article Google Scholar
Chen, M., Zheng, A., Weinberger, K.: Fast image tagging. In: ICML (2013)
Google Scholar
Metzler, D., Manmatha, R.: An inference network approach to image retrieval. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 42–50. Springer, Heidelberg (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, Beijing, 100871, China
Yuqing Hou

Authors

Yuqing Hou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuqing Hou .

Editor information

Editors and Affiliations

University of Nevada, Reno, Nevada, USA
George Bebis
NASA Ames Research Center, Moffett Field, California, USA
Richard Boyle
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Bahram Parvin
Desert Research Institute, Reno, Nevada, USA
Darko Koracin
University of Houston, Houston, Texas, USA
Ioannis Pavlidis
IBM T.J. Watson Research Center, Yorktown Heights, New York, USA
Rogerio Feris
Purdue University, West Lafayette, Indiana, USA
Tim McGraw
Side Effects Software, Santa Monica, California, USA
Mark Elendt
The DiVE, Durham, North Carolina, USA
Regis Kopper
Texas A&M University, College Station, Texas, USA
Eric Ragan
Kent State University, Kent, Ohio, USA
Zhao Ye
Lawrence Berkeley National Laboratory, Berkeley, California, USA
Gunther Weber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, Y. (2015). Image Annotation Incorporating Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2015. Lecture Notes in Computer Science(), vol 9474. Springer, Cham. https://doi.org/10.1007/978-3-319-27857-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-27857-5_7
Published: 18 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27856-8
Online ISBN: 978-3-319-27857-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Image Annotation Incorporating Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors

Abstract

Similar content being viewed by others

Boosted Inductive Matrix Completion for Image Tagging

Annotating Web Images by Combining Label Set Relevance with Correlation

Dual-Graph Regularized Sparse Low-Rank Matrix Recovery for Tag Refinement

Keywords

1 Introduction

2 Our Image Annotation Model

2.1 Low-Rankness

2.2 Tag Correlation Using Word Vectors

2.3 Tag-Visual Correlation Using CNN Features

2.4 Inhomogeneous Errors

2.5 Object Function Formulation: The Four Priors Model

3 Solving the Four Priors Model

4 Experimental Evaluation

4.1 Datasets and Experimental Setup

4.2 Evaluation of Tag Completion on Corel5K

4.3 Evaluation of Tag Completion on MIRFlickr-25K and Labelme

4.4 Observations on Experimental Results

5 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Image Annotation Incorporating Low-Rankness, Tag and Visual Correlation and Inhomogeneous Errors

Abstract

Similar content being viewed by others

Boosted Inductive Matrix Completion for Image Tagging

Annotating Web Images by Combining Label Set Relevance with Correlation

Dual-Graph Regularized Sparse Low-Rank Matrix Recovery for Tag Refinement

Keywords

1 Introduction

2 Our Image Annotation Model

2.1 Low-Rankness

2.2 Tag Correlation Using Word Vectors

2.3 Tag-Visual Correlation Using CNN Features

2.4 Inhomogeneous Errors

2.5 Object Function Formulation: The Four Priors Model

3 Solving the Four Priors Model

4 Experimental Evaluation

4.1 Datasets and Experimental Setup

4.2 Evaluation of Tag Completion on Corel5K

4.3 Evaluation of Tag Completion on MIRFlickr-25K and Labelme

4.4 Observations on Experimental Results

5 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation