1 Introduction

With the prevalence of social networks and digital cameras in our daily life, more and more people share their photos with each other and the number of images presents a trend of the explosive growth on Internet. To efficiently find one image from a mass of images, many tag-based image search engines are emerging rapidly. To enhance the performance of these search engines, more and more researchers study on how to give the accurate labels for images. The aim of image annotation is to assign several labels for an image which can accurately describe the semantic content of the image. Since the existence of the well-known semantic gap [1], image annotation becomes a challenging and difficult task.

Due to the huge amount of images, it is a time-consuming and labor-consuming to manually annotate images. Therefore, in the past decades, many image annotation methods have been proposed to predict the labels for an image, and these methods have achieved better and better annotation results. Because of the simplicity yet effectiveness, the nearest neighbor based image annotation methods have been widely applied into the practice image annotation system. Li et al. proposed an algorithm that accumulated votes from visually similar neighbors to learn tag relevance scalably and reliably [11]. Some methods use the support vector machine (SVM) to annotate images. Tao et al. proposed an asymmetric bagging and random subspace SVM (ABRS-SVM) and combined the random subspace method and SVM to improve the relevance feedback performance [22]. Besides, there are also many other methods for image annotation. Tang et al. proposed a novel generalized deep transfer networks (DTNs), which can transfer label information across heterogeneous domains, textual domain to visual domain and this framework can solve the problem of insufficient training images. [20]. To improve the performance of social image tag refinement, Tang et al. proposed a novel tri-clustered tensor completion framework to collaboratively explore three kinds of information, including users, images and tags [21].

However, the labels obtained by the traditional methods are usually incomplete, insufficient and noisy to accurately describe the semantic content of one image. To solve the problem, many automatic data cleaning methods have been proposed [26]. In this paper, to solve this problem, we propose a novel 2PKNN-GSR method for image annotation and label refinement, which can effectively refine the relevance between the labels and the images to improve the performance of image annotation. The whole framework of the proposed method is illustrated in Fig. 1. The first step of this framework is to obtain the relevance between the labels and the testing images using the traditional 2PKNN method [23]. 2PKNN is a two-step variant of the classical K-nearest neighbor algorithm, and the first step addresses the class-imbalance issue using image-to-label similarity, while the second step use image-to-image to address the issue of weak-labeling. Due to the existence of the above shortcomings, we choose the group sparsity [27] for the group sparse reconstruction to refine the result from the first step. To prove the effectiveness of the proposed method, we conduct the experiments on three well-known datasets, i.e., Corel 5K [3], IAPR TC12 [14] and ESP Game [25] datasets. And experimental results show that the proposed method achieves better performance than the state-of-the-art methods.

Fig. 1
figure 1

The framework of our method

Overall, the contributions in this work are summarized as follows:

  • We propose a novel 2PKNN-GSR (Group Sparse Reconstruction) method to refine the annotation labels of images, which are obtained by the traditional 2PKNN method.

  • In the proposed method, the group sparse reconstruction with a combination of L1 and L2 norms can effectively mine the relevance for image annotation, where L1 norm and L2 norm are used to emphasize the sparsity among the groups and smooth the weights within each group, respectively.

The rest of this paper is organized as follows. Section 2 reviews related works. Section 3 details the proposed method. Experiments are conducted in Section 4, followed by the conclusions in Section 5.

2 Related work

Image annotation has attracted more and more attentions in the field of computer vision and multimedia. Based on the different solutions, existing image annotation methods can be mainly divided into three main categories, including generative models, discriminative models and nearest neighbor based methods. In the past few years, sparsity-based methods have emerged and become more and more popular in image annotation. We will introduce the four categories as follows.

2.1 Generative models

The generative models assign labels to images on the basis of the learnt joint distributions between image features and tags. It can also be interpreted as a collection of mixture models and topic models. As an important part of generative models, mixture models define a joint distribution over image features and labels. Then, the image annotation task is considered as learning the non-parametric density estimators between the images and labels. Lavrenko et al. proposed an approach, called Continuous-space Relevance Model (CRM) [10], whose basic idea is to divide an image into several real-valued feature vectors, compute the joint probability between image features and labels and predict the probability for each label. In the topic models, we regard images as samples from a specific mixture of topics. Blei et al. described a three-level hierarchical Bayesian model, Latent Dirchlet Allocation (LDA) [2]. Putthividhya et al. extended the supervised Latent Dirichlet Allocation to a new probabilistic model called sLDA-bin [17], which can address a multi-variate binary response variable in the annotation data.

2.2 Discriminative models

The discriminative models aim at learning a separate classifier for each class, which can predict the labels class of an image. Szummer et al. previously proposed that images were divided into two categories [18], i.e. image classification. As we all know, the number of an images labels is usually more than two. The problem of the multi-class classification becomes more necessary. To remove the confusing labels, Lavreko et al. presented a model based on an SVM, which modifies the SVM hinge loss function [24]. Recently, Hong et al. proposed a Multiple-Instance Learning (MIL) method which makes use of discriminative feature mapping and feature selection to address the noise of the generated features [8]. In the Cross-Media Relevance Model (CMRM) [9], image annotation is considered as a cross-lingual retrieval problem.

2.3 Nearest neighbor based methods

The nearest neighbor based methods have become more and more popular in the domain of image annotation due to their effectiveness. These methods can be summarized that the labels can be shared among similar images, and various kinds of features are usually combined to compute the distance between images to get the similarities. Makadia et al. proposed a new baseline technique, Joint Equal Contribution (JEC) [14], which considered the image annotation task as a retrieval problem. Guillaumin et al. presented the TagProp [6] method, which learns the weight of each feature group and use label relevance prediction to annotate images. Verma et al. recently proposed 2PKNN [23], which takes advantage of image-to-label and image-to-image similarities to respectively the “class-imbalance” and the “weak-labeling” issues at the same time.

3 The proposed method

In this section, we propose a novel 2PKNN-GSR method for image annotation and label refinement. A summary of the notations in this paper is shown in Table 1.

Table 1 Description of symbols

3.1 Relevance between labels and image

The traditional 2PKNN method can address both the class-imbalance and weak-labeling issues. In this section, we make use of 2PKNN [23] to obtain the relevance between the labels and testing images. The first step of 2PKNN addresses the class-imbalance issue by using image-to-label similarities, while the second step uses image-to-image similarities to address the weak-labeling issue.

Let X = {x1,x2,...,xn}∈ Rn×d be a collection of n training images where xiRd(i ∈ [1,n]) is the ith image. Define L = {l1,l2,...,lc}∈{0,1}n×c as a dictionary consisting of c labels. The set T = {(x1,t1),(x2,t2),...,(xn,tn)} contains the pairs of the image xi and its corresponding label set ti, where ti ∈{0,1}c. And ti(j) = 1 if xi is annotated by the label lj and ti(j) = 0 otherwise.

According to [23], we can see that the 2PKNN method can solve the problem of weak-labeling and class-imbalance. Define TiT,∀i ∈{1,...,c} as the subset of T that contains all the images with the label lk(k ∈ [1,...c]). Given a testing image I, we compute the visual distance between this images and other images in each subset. Due to the more informative diversities of images in each TI,i, it is necessary to merge all the subsets to form TI = {TI,1TI,2 ∪ ... ∪ TI,c} = ∪i∈[1,...,c]TI,i as the neighbors of image I. This is the first pass of 2PKNN.

The second pass of 2PKNN is to give the different important values for different labels by assign a weight to each label. Based on the set TI, given a label lk, we can write the posterior probability for the testing image I.

$$P(I|l_{k})=\sum\limits_{(x_{i},t_{i})\in T_{I}} \alpha_{I,x_{i}}\cdot \beta(l_{k}\in t_{i}), $$
(1)

where \(\alpha _{I,x_{i}}=\exp (-D(I,x_{i}))\) is the contribution of the image xi when we predict a label lk for I, and D(I,xi) denotes the visual similarity between I and xi. And β(lkti) denotes whether or not the label lk appears in the set ti of xi, namely β(lkti) = 1 if lk appears in ti and β(lkti) = 0 otherwise.

Given a testing image I, we can obtain the posterior probability for the label lk, i.e.,

$$ P(l_{k}|I)=\frac{P(l_{k})P(I|l_{k})}{P(I)}, $$
(2)

According to the 2PKNN method, we can get a relation matrix to shows the relation between the testing images and all the labels. For the following refinement, we choose the M most relative labels for one testing image, and set the entries of the relation matrix according to these labels as 1, while other entries of the relation matrix are set as 0. Then, we define the original relation matrix as Pm×c, where m is the number of the testing images.

3.2 Group sparse reconstruction

We can make use of 2PKNN to obtain the relation matrix Pm×c, which can achieve image annotation task and show the relation between images and labels. However, these obtained labels are usually insufficient and noisy to describe the whole semantic content of the testing images. To solve the problem, we propose to utilize the group sparse reconstruction [12] to further refine the relation matrix Pm×c. The process of group sparse reconstruction can be formulated as follows.

The process of group sparse reconstruction can be formulated as follows

$$ {\Omega}=\min_{\tau}\|W(p-\hat{P}\tau)\|_{2}^{2}+\xi\sum\limits_{i = 1}^{c}\|g_{i}\|_{2}, $$
(3)

where pRc×1 is the label vector of a testing image I in \(P_{m\times c}^{\prime }, \hat {P}\in R^{c\times K2}\) is the corresponding label matrix of the K2 nearest neighbors of the image I in the training image set, τRK2×1 denotes the weights of each neighbors in the training image set, W is a diagonal matrix defined as W(i,i) = exp(pi), and ξ is a tuning parameter to balance the group sparsity. The details of the groups are as follows: First, for each label li, all the images in the training images with the label li form a group groupi. Then, we can define \(g_{i}=\{\tau _{\sigma (i,1)},\tau _{\sigma (i,2)},...,\tau _{\sigma (i,|g_{i}|)}\}\) as the corresponding index of the K2 nearest neighbors appearing in the gi in the vector τ.

According to [27], the group sparsity can improve the performance of previous sparse method and ensure more accurate and robust weights. From the latter part in (3), we can see that \(\xi \sum \limits _{i = 1}^{c}\|g_{i}\|_{2}\) is a combination of both L1 and L2 norms. L1 norm is used to emphasize the sparsity among groups, whereas L2 norm is used to smooth the weights within each group.

After getting the optimal parameter ξ, the reconstructed label vector \(\hat {p}\) for a testing image I can be obtained as follows.

$$ \hat{p}=\hat{P}\tau. $$
(4)

Finally, we can use the above-mentioned method for each testing image and further refine the relation matrix P. The algorithm of the proposed 2PKNN-GSR method details in Table 2.

Table 2 The algorithm Of the 2PKNN-GSR method

4 Experimental results and analysis

4.1 Datasets

We consider three standard image annotation datasets to evaluate the performance of the proposed 2PKNN-GSR method. Then we compare the performance with previous state-of-the-art methods [4,5,6, 10, 13,14,15,16, 23, 27]. Table 3 shows the information of the three datasets in detail.

Table 3 Details for the three datasets used in this work

4.2 Features

In the experiments, we use the similar features in [6] and directly merge these features as a set to represent all the images to compare the proposed method with the state-of-the-art methods. Therefore, this set consists of 15 global and local features. The global features contain the Gist and the color histograms in HSV, LAB and RGB. While in the local features, there are the SIFT and hue descriptors, which are obtained from multi-scale grid and Harris-Laplacian interest points. For each image, all the features, except for the Gist, also need to be calculated over three equal horizontal partitions to encode some spatial information of an image. We make use of different measures to compute the distance between different features. For example, L1 measure for the color histograms in HSV, LAB and RGB, L2 for the Gist features and χ2 for SIFT and hue descriptors.

4.3 Evaluation measures

In the experiments, we make use of the annotation precision and recall of each label in the testing image set to evaluate the performance and compare the performance with other state-of-the-art methods. For example, we assume that the number of images annotated by the label li in the ground-truth is num1, and the label li is annotated for num2 images in the testing set where the labels of num3 images are correct. Therefore, for the label li, the precision is defined as \(Precision=\frac {num_{3}}{num_{2}}\), and the recall is \(Recall=\frac {num_{3}}{num_{1}}\). Then, according to the precision and recall of each label, we can obtain the average precision P and recall R, and further get the percentage F1 − score by using \(F1=\frac {2\cdot P\cdot R}{P+R}\). Furthermore, we also compare the value N + which is the number of the labels assigned to at least one testing image. The above parameters can evaluate the performance of the proposed method effectively.

4.4 Results

We firstly evaluate the performance of the proposed method by comparing it with several previous state-of-the-art methods. Table 4 shows the results of the proposed method and other state-of-the-art methods on three datasets (Corel 5K, IAPR TC12 and ESP Game). According to the results, we can conclude that the proposed 2PKNN-GSR method outperforms the previous state-of-the-art methods. From the table, we can see that the recall of the proposed method is the highest on both the IAPR TC12 and ESP Game, the precision is the highest on the Corel 5K dataset, and the F1 − score and N + on the three datasets are better than a majority of these methods.

Table 4 Details for three datasets (corel 5K, IAPR TC12 and ESP game)

First, we compare the proposed 2PKNN-GSR method with the nearest neighbor based methods [6, 14, 23]. According to the results in Table 4, we can see that the F1 − score of our 2PKNN-GSR is higher than the above three methods. By analyzing the results, we can find that if we only make use of the traditional nearest neighbor based methods, the obtained labels are usually incomplete, inconsistency and error-prone. Since the proposed 2PKNN-GSR method uses the group sparse reconstruction to refine the annotation results which are obtained by traditional image annotation methods, it significantly outperforms the nearest neighbor based methods.

Then, we also compare our method with Sparsity-based methods [15, 27].According to the obtained results in Table 4, we can see that the proposed 2PKNN-GSR method achieves the better performance than the above two sparsity-based methods. The disadvantages of these sparsity-based methods are that they use all data to train the model and do not ignore the useless data. Therefore, it may cause the data redundancy. It is noticed that the proposed 2PKNN-GSR method firstly uses 2PKNN method to obtain the neighbors of the testing image, which can avoid the data redundancy as much as possible to improve the image annotation results.

Moreover, to intuitively show the effectiveness of the proposed 2PKNN-GSR method, we also some example results of the proposed method on the Corel 5K dataset, as shown in Fig. 2. For example, the predicted labels of the fifth image include the word “tree” yet the ground truth do not, and the word can describe the image better. The ground truth of the last image includes the word “house”, yet it cannot describe the image. However, the proposed method refines the results. The predicted labels delete “house” and add two words, “sky” and “water”, which can show this image completely.

Fig. 2
figure 2

Details For the Three Datasets Used in This Work

We can analyze the proposed 2PKNN-GSR method from following two aspects:

The first step of our method is to get the predicted labels of the testing images using the traditional 2PKNN method. This 2PKNN method consists of two steps, which respectively use “image-to-label” similarity and “image-to-image” similarity, and can effectively address the issues of weak-labeling and class-imbalance.

However, the predicted labels obtained from the traditional methods usually are incomplete, insufficient and noisy to describe the semantic content accurately. To solve the problem, in the second step, we propose the 2PKNN-GSR method to take advantage of the group sparse reconstruction to refine the results obtained by the first step. In the group structure, both L1 and L2 norms are combined, and L1 norm is used to emphasize the sparsity among the groups, while L2 norm is to smooth the weights within each group.

All of the above, we can summarize the novelty and technical contribution of the proposed 2PKNN-GSR method. The method combined the traditional 2PKNN method with the group sparse reconstruction effectively. First, we use the traditional 2PKNN to obtain the relation between the testing images and labels. However, these labels are usually incomplete, insufficient and noisy. Then, the proposed 2PKNN-GSR method refined the results by the group sparse reconstruction, which uses both L1 and L2 norms at the same time. Finally, we can improve the image annotation results.

5 Conclusion and future work

In this paper, we propose a novel method, 2PKNN-GSR, to refine image annotation results by using the group sparse reconstruction to improve the performance. First, we get the predicted labels of the images using the traditional 2PKNN method, which make use of “image-to-label” and “image-to-image” similarities to address the weak-labeling and class-imbalance issues. However, since the predicted labels of the images are usually incomplete, insufficient and noisy to describe the whole semantic content of images accurately, thus causing the unsatisfactory results. Then, we take advantage of the group sparse reconstruction to refine the above results obtained by 2PKNN. We conduct the experiments and theoretical analysis on three standard datasets, i.e. Corel 5K dataset, IAPR TC12 dataset and ESP Game dataset. Experimental results on the three datasets show that the proposed 2PKNN-GSR method outperforms the several previous state-of-the-art methods in the annotation quality. In the future, we plan to select “clean” samples for learning recognizer to prove the effectiveness of the proposed 2PKNN-GSR method and improve the image annotation performance by adopting more efficient refinement procedure.