1 Introduction

There is an explosive growth in the amount of available images in social image sharing networks, such as Flickr. Social media websites allow users to provide several descriptive words called tags to illustrate the content of each shared photo [29]. The user annotated tags make the social image sharing websites better accessible to the public [1]. The social image has multimodality information [65, 70, 74], such as the tags, geo-tags, and so on. These information has close correlation with image taken. For example, the geo-tags sometimes are highly relevant to the geo-graphical information that captured by remote sensing satellites [4, 17, 28, 34,34,35,37, 39, 40, 72]. Many applications can benefit from the tags of social images [22, 35, 36, 38, 46, 52, 68, 74], for example, the pattern recognition and computer visual based applications [2, 5,6,7,8, 50, 76], personalized recommendations [15, 16, 19, 47, 48, 77,78,79], and big data based analysis. In the tag based image retrieval, user-labeled tags are noisy, biased and incomplete. The performances of tag based applications are inevitably influenced by the tag quality [27, 38].

Especially, in text based image search, if the tag is not in the tag list, it is impossible for the images ranked in to top ranked image list [38]. Thus, image tagging or tag enrichment according both the generated tags and visual information is important. Recently, many efforts have been done on image tagging to recommend more relevant tags for an image. From which more relevant tags can be inferred by textual concurrences or learning based approach.

In the proposed tag enrichment, firstly we determine the relevance of each tag to image. Then we select seed tags for this image by taking both their relevance and semantic coverage into account. Finally, tags are selected iteratively by maximizing the compatible values among tags.

The rest of this paper is organized as follows: In Section II related work on image tagging is reviewed. In Section III, the proposed approach is illustrated in detail. In Section IV experimental results and discussions are given. Conclusions are drawn in Section V.

2 Related Work

Social image tag enrichment is to enrich content relevant textual descriptions for an image from its existing tags provided by social network users based on the visual or the textual description. Social image tagging/annotation is aiming at transforming content related tags for the image. In social image, there is much multimodality information available, such as geo-locations, image taking time, user annotated tags, views, and textual comments by other social users. In social image, the tags labeled by user are noisy, biased and incomplete. From feature point of view [3, 4, 71, 79], social image tagging approaches can be classified into only visual feature, only textual feature, and their combinations.

Some tags are associated with color, texture pattern, visual content, while the other tags are from textual or local descriptor. How to select effective features for image tagging is also interesting. Recently, this problem is studied by exploring the sparsity of semantics and the low-level features [20, 24, 34, 38, 67, 70, 75]. Especially with the success of deep feature extraction approaches, such as convolution neural networks in feature, the traditional semantic gap can be further narrowed. For example, Ma et al. propose a collaborative feature selection based subspace sparsity representation approach for image annotation [12].

Model-based and model-free approaches can be adopted to fuse multimodal features in tagging. The model based approaches need to build models for each tag [9, 10, 20, 21, 24, 34, 39, 52, 65, 70, 80]. The model free approaches predict relevant tags for an image by utilizing statistical properties of tags, the low-level visual features, and other type of information [11, 46, 55, 71, 74].

In the model based approach, both discriminative and generative models can be adopted to build the relationship between tag and low-level visual features [10, 12, 20, 24, 34, 39, 65, 70, 77]. The SVM, GMM, LDA, and Multi-kernel learning based approaches are often utilized [72]. For example, Chen et al. use support vector machines to carry out image tagging [45]. Wang et al. utilize adaptive Gaussian mixture model to build visual tag dictionary [21]. Xu et al. refine tagging quality by utilizing regularized latent Dirichlet allocation to model the tag similarity and tag relevance [64]. Different from the model based approach, Wu et al. only model the hardest examples to improve tag enrichment performances, rather than building model for each tag [63].

However, this type of approaches usually meet two problems, for the number of tag is very large, it is computational intensive and complex to train model for each tag that appeared in social network. So, Gu et al. proposed to tag the community of the tags before assigning tags for each image [13].

Graph based approaches are also often adopted [11, 16, 19, 21, 23, 27, 31, 33, 43, 54, 62,63,64, 67, 72, 80]. A tag corresponds to a node and there is an edge between two nodes. The weight of the edge between two nodes (tags) is their similarity. The weight can be measured by taking the concurrence and visual similarity into account. Random work models are often utilized to find more relevant tags [1, 27, 29, 37, 53]. The random walk model promotes the tags with many visual similar neighbors and weakens the tags with fewer neighbors. Jia et al. fuse the textual similarities of tags and visual similarities of images in multi-graph reinforcement framework to find relevant tags [1]. Liu et al. model the image retagging process as a multiple graph-based multi-label learning problem [33]. They propagate the information of each tag along a tag-specific graph and determine the tag-specific visual sub-vocabulary from a collection of social images [33]. Yang et al. propose a graph model based image annotation approach by visual features to exploit the unlabeled images [67]. A multi-label classifier is trained by integrating structure learning and graph based transductive classification for providing image content relevant labels during image annotation. Some tagging approaches convert the graph based image tagging to a graph based optimization problem [42, 43, 66, 73]. Graph-cut and graph reinforcement can be utilized in finding appropriate tags for an image or video.

Except from using model based approach, some tagging approaches make full use of cooperative filtering to generate more content relevant tags. Firstly, visual near duplications for the input image are selected from large scale image corpus, and then the frequently appeared tags in the found near duplicate images are recommended to the input image [19, 24, 54, 56, 71, 74]. These approaches are based on the fact that different users label visually similar images using the same tags. However, this approach requires large scale tagged dataset. For a small dataset with nearly no duplicate it performances is inferior. Especially, for novel images which are first shared in the internet. Moreover, it only use visual features but ignores the user labeled tags.

Image tagging by mining knowledge from web is also studied in [31]. Liu et al. first use knowledge based method to find visual content related tags [31], then constrain the tagging vocabulary within the image content related tags. Image tagging using the supplement information of the images, such as image taking time and location, shared by social users’ are also proposed recently [3, 15, 17, 18, 49, 79]. These approaches usually consist of image geo-location estimation [3], and geo-location based tagging.

3 Our Approach

In the proposed image tagging approach we classify each tag into two different sets: highly relevant to image and less relevant to image. Our goal is to recommend tags iteratively. Firstly some seed tags which are highly relevant to image content are selected based on which we extract more content relevant tags. Then the top recommended tags in previous iteration are served as seeds to find new relevant tags iteratively.

There are two major problems needed to be solved in the proposed tag enrichment approach which are summarized as follows:

  1. 1)

    How to select seed tags, i.e. how to determine the relevance of tag to image.

  2. 2)

    How to recommend tags covering diverse semantics based on the top relevant tags to widen the semantic coverage the image.

3.1 Tag Enrichment

The sketch map of the proposed tag enrichment approach is illustrated in Fig. 1. S r denotes the determined most relevant tag set with the tag ranks smaller than or equals to r, and T r denotes the tag set with the ranks of the tags larger than r. the black dark shapes denote the seed tags, the light gray shapes denote the tags to be adopted and the green shapes denote the adopted tags. In the proposed tag enrichment approach, a new tag x will be extracted from T r-1will be assigned with rank r based on the S r-1.

Fig. 1
figure 1

Sketch map of tag enrichment. S r denotes the tag set with the tag ranks smaller than or equals to r, and T r denotes the tag set with the ranks of the tags larger than r. Based on the S r-1, a new tag x selected from T r-1will be assigned with rank r

3.1.1 Similarity of Tag to Image

The tag and image are with two different modalities. It is difficult to measure their similarity directly. However, taking the advantages of the social image with surrounding tags, in this paper the similarity of the tag τ to the image I (denoted by R(τ)) is measured by the similarity of image I to all the images containing the tag τ as follows:

$$ R\left(\tau \right)=\frac{1}{\mid {\varTheta}_{\tau}\mid}\sum \limits_{x\in {\varTheta}_{\tau }}\exp \left(-{\left\Vert {F}_I-{F}_x\right\Vert}^2/{\sigma}^2\right) $$
(1)

where Θ τ denotes the image set that contains tag τ, the image number in this set is ∣Θ τ ∣. σ 2 is set to be the median value of all the pair wise Euclidean distances between images [29]. F I and F x are the visual features of images I and x. It can be both low-level visual features or deep features extracted from CNN [76].

3.1.2 Similarity of Tag to Tag

Tag to tag distance is measured by the commonly utilized Google distance [38, 66] which is expressed as follows:

$$ d\left(p,q\right)=\frac{\max \left(\log f(p),\log f(q)\right)-\log f\left(p,q\right)}{\log W-\min \left(\log f(p),\log f(q)\right)} $$
(2)

where d(p,q) is the normalized Google distance of tag p and q, f(p) and f(q) are the numbers of images containing tag p and q. f(p, q) is the number of images containing both the tags p and q. In this paper, these numbers are obtained by performing search by tag on Flickr website using the tags as query terms [29]. W is the total number of images on Flickr. Correspondingly, the tag to tag score is expressed as follows:

$$ s\left(p,q\right)=\exp \left(-d\left(p,q\right)\right) $$
(3)

3.1.3 Problem Descriptions

Let Γ = {t 1,  ⋯ , t |Γ|} denote a tag set, with its tag number denoted by |Γ|. Our goal is to rank the tag with high relevance ahead of the tags with low relevance. The smaller the rank means the higher the relevance of the tag to the image.

Assuming that the most relevant (r-1) tag list S r-1 has already been determined (the detailed approach can be found in the Algorithm 1), our tag enrichment algorithm aims at finding a candidate tag x (x ∈ T r − 1) to be with rank r (r < |Γ|). From the sketch map as shown in Fig. 1, the relationship of the sets S r , T r , S r-1, and T r-1 is as follows

$$ \left\{\begin{array}{cc}{S}_{r-1}+{T}_{r-1}=\varGamma, & {S}_r+{T}_r=\varGamma \\ {}{S}_{r-1}+x={S}_r,& {T}_{r-1}-x={T}_r\end{array}\right. $$
(4)

The corresponding tag numbers in S r and T r have following relationship

$$ \left\{\begin{array}{cc}\left|{S}_r\right|=r,& \left|{T}_r\right|+\left|{S}_r\right|=\left|\varGamma \right|\\ {}\left|{T}_r\right|-\left|{T}_{r-1}\right|=-1,& \left|{S}_r\right|-\left|{S}_{r-1}\right|=1\end{array}\kern0.5em ,r\in \left\{1,\cdots, \left|\varGamma \right|\right\}\right. $$
(5)

It is an optimizing problem to determine which tag in T r can be selected as the r-th relevant tag to the image. Assigning the tag x with rank r should take the cost of transferring it from T r-1 to S r and the cost of assigning the tag x be the r-th relevant tag to the image. Let E (r)(L) denote the compatible value (or dissolvability) of assigning the tag x with rank r on the base of the already determined (r-1) relevant tags in S r-1 and the other tags in T r-1. Thus E (r)(L) consists of two terms as follows

$$ {E}^{(r)}(L)={E}_n^{(r)}(L)+{E}_I^{(r)}(L) $$
(6)

where L = {L p |pΓ} is an iteration related labeling of tag setΓ. L p  = s(or t) means the tag p is relevant (or irrelevant) to the image at step r. In Eq.(6), \( {E}_n^{(r)}(L) \) is the overall compatible value of partitioning the tags into two sets.

$$ {E}_n^{(r)}(L)=\sum \limits_{\left\{p,q\right\}\in \varGamma }{V}_{p,q}^{(r)}\left({L}_p,{L}_q\right) $$
(7)

\( {V}_{p,q}^{(r)}\left({L}_p,{L}_q\right) \)be the compatible value of assigning tag p and q with labels L p and L q at r-th iteration. \( {V}_{p,q}^{(r)}\left({L}_p,{L}_q\right) \) can be measured by the textual similarity in terms of the often utilized normalized Google distances.

The larger the Google distance of tag p and tag q, the smaller the compatible value. According to the normalized Google distance, the compatible value of tag p and tag q by assigning them labels L p and L q can be measured as follows

$$ {V}_{p,q}\left({L}_p,{L}_q\right)=\left\{\begin{array}{cc}\exp \left(-d\left(p,q\right)\right)& \mathrm{if}\kern0.5em {L}_p={L}_q\\ {}1-\exp \left(-d\left(p,q\right)\right)& \mathrm{otherwise}\end{array}\right. $$
(8)

Especially, if p = q, then the compatible value V p , p (L p , L p ) = 1. Moreover, we have V p , q (L p , L q )=V q , p (L q , L p ).

In Eq.(6) \( {E}_I^{(r)}(L) \) is the overall compatible value of assigning the tags with label L at step r for the input image. It can be expressed by summing up the compatible values of each tag to image. Thus it can be written as follows:

$$ {E}_I^{(r)}(L)=\sum \limits_{p\in \varGamma }D\left({L}_p\right) $$
(9)

Where D(L p ) be the compatible value of assigning the tag p with label L p .

$$ D\left({L}_p\right)=\left\{\begin{array}{cc}R(p),& if\ {L}_p=s\\ {}1-R(p),& if\ {L}_p=t\end{array}\right. $$
(10)

where R(p) measures the similarity of the tag p to the image with respect to Eq.(1), s = 0 and t = 0.

3.1.4 Iterative Optimal Tag Selection

The optimal tag can be determined by maximizing the variation of the compatible values of adopting a tag x between two adjacent steps r-1 and r. We determine the optimal solution by maximizing the variations of compatible values of changing the label of tag x from irrelevant to relevant at step r. The compatible value variations ΔE(x) of changing the label of tag x from irrelevant to relevant at any two neighboring steps r and r-1 is calculated as follows

$$ {\displaystyle \begin{array}{c}\varDelta E(x)={E}^{(r)}(L)-{E}^{\left(r-1\right)}(L)\\ {}=\left({E}_n^{(r)}(L)+{E}_I^{(r)}(L)\right)-\left({E}_n^{\left(r-1\right)}(L)+{E}_I^{\left(r-1\right)}(L)\right)\\ {}=\left({E}_n^{(r)}(L)-{E}_n^{\left(r-1\right)}(L)\right)+\left({E}_I^{(r)}(L)-{E}_I^{\left(r-1\right)}(L)\right)\\ {}=\varDelta {E}_n^{(r)}(x)+\varDelta {E}_I^{(r)}(x)\end{array}} $$
(11)

Eq.(11) can be rewritten as follows:

$$ \varDelta E(x)=\varDelta D(x)+2\varDelta {V}_1(x)+2\varDelta {V}_2(x) $$
(12)

where ΔD(x) is the compatible value variation when changing the label of the tag x from t to s (i.e. from irrelevant to relevant, or changing its label from 0 to 1) at the r-th step.

$$ \varDelta D(x)=D\left({L}_x=s\right)-D\left({L}_x=t\right)=2R(x)-1 $$

ΔV 1(x)is the variation of compatible value of the tags in S r-1 to tag x when the label of tag x is changed from t to s at step r.

$$ {\displaystyle \begin{array}{c}\varDelta {V}_1(x)=\sum \limits_{p\in {S}_{r-1},q\in \varGamma }{V}_{p,q}^{(r)}\left({L}_p=s,{L}_q\right)-\sum \limits_{p\in {S}_{r-1},q\in \varGamma }{V}_{p,q}^{\left(r-1\right)}\left({L}_p=s,{L}_q\right)\\ {}\varDelta {V}_1(x)=\sum \limits_{p\in {S}_{r-1}}{V}_{p,x}\left({L}_p=s,{L}_x=s\right)-\sum \limits_{p\in {S}_{r-1}}{V}_{p,x}\left({L}_p=s,{L}_x=t\right)\end{array}} $$

ΔV 2(x)is the variation of compatible value of the tags in T r to tag x when the label of tag x is changed from t to s.

$$ {\displaystyle \begin{array}{c}\varDelta {V}_2(x)=\sum \limits_{p\in {T}_r,q\in \varGamma }{V}_{p,q}^{(r)}\left({L}_p=t,{L}_q\right)-\sum \limits_{p\in {T}_r,q\in \varGamma }{V}_{p,q}^{\left(r-1\right)}\left({L}_p=t,{L}_q\right)\\ {}=\sum \limits_{p\in {T}_r}{V}_{p,x}\left({L}_p=t,{L}_x=s\right)-\sum \limits_{p\in {T}_r}{V}_{p,x}\left({L}_p=t,{L}_x=t\right)\end{array}} $$

Assigning a tag x 0 with rank r can bring maximum compatible value improvement. Thus, We have:

$$ {\displaystyle \begin{array}{c}{x}_0=\arg \underset{x\in {T}_{r-1}}{\max \limits}\varDelta E(x)=\arg \underset{x\in {T}_{r-1}}{\max \limits}\left(\varDelta {E}_n^{(r)}(x)+\varDelta {E}_I^{(r)}(x)\right)\\ {}=\arg \underset{x\in {T}_{r-1}}{\max \limits}\left(\varDelta D(x)+2\times \varDelta {V}_1(x)+2\times \varDelta {V}_2(x)\right)\end{array}},\kern0.5em {\displaystyle \begin{array}{cc}{x}_0\in {T}_{r-1},& {x}_0={S}_r\cap \end{array}}{T}_{r-1} $$
(13)

That is to say that changing x 0 from irrelevant to relevant can maximize the compatible value. That is to say, the tags in S r-1 like to adopt x 0 into the same set and at the same time the tags in T r-1 most likely to kick x 0 out of their set at step r.

3.1.5 Seed Tags Selection

The multiple seed tags are selected by taking into both the relevance of tag to image and the diversities of top ranked tags. A greedy based searching approach is utilized to select seed tags by maximizing the semantic coverage of the seed tags iteratively. Assuming that the most relevant m-1 seed tags are determined and pushed in the list S m-1 (the detailed determination approach can be found in the Algorithm 1), the belief value of the tag τ be selected as the m-th seed tags for the input image I should take both the relevance of the tag to the image and the semantic compensations of the tag τ to the already determined m-1 seed tags. The belief value of the tag τ is represented as follows:

$$ B\left(\tau \right)=R\left(\tau \right)\times C\left(\tau \right) $$
(14)

where C(τ) is the semantic compensations of tag τ to the tag set Γ. In this paper, we use the minimum score of tag τ to the tags in Γ to measure the semantic compensations C(τ) as follows.

$$ C\left(\tau \right)=\underset{\tau_i\in \varGamma }{\min}\left(1-s\left(\tau, {\tau}_i\right)\right) $$
(15)

where s(τ,τ i ) is the textual similarity score of the tag τ and the tag τ i which is given as follows.

$$ s\left(\tau, {\tau}_i\right)=\exp \left(-d\left(\tau, {\tau}_i\right)\right) $$
(16)

where d(τ,τ i ) is the normalized Google distance of tag τ and τ i , which is expressed in Eq.(2). The corresponding seed tags selection algorithm is as shown in Algorithm 1:

3.2 Flowchart of Image Tagging

Given image I with the initial tag set φ = {τ 1,  ⋯ , τ φ}, let Ω = {v 1,  ⋯ , v Ω} denote the set of the whole tag vocabulary, which is obtained from the crawled Flickr images. The total tag number in Ω is ∣Ω∣. Our image tagging approach consists of the following steps: 1) feature extraction and determine the relevance of tag to image; 2) select seed tags for the image as shown in Algorithm 1; 3) using textual similarity to generate new tags for the image. Now we give the flowchart of the proposed image tagging approach in Algorithm 2.

4 Experiments and Discussion

To evaluate the performances of the proposed tagging approach, we conduct experiments on a crawled dataset using selected 25 queries and NUS-Wide [9]. We compare our approach (denoted TS) with the raw tags by Flickr users (denoted INIT), tag concurrence based approach (denoted COCR), random walk based tag ranking approach [29] (denoted RANK), and visual neighbor voting (denoted NBVT) [24]. Our experiments consist of the raw tag ranking, tag enrichment and tag based image retrieval.

4.1 Dataset

In our crawled dataset, we select 25 queries, including alcedoatthis, apple, beach, bear, butterfly, cherry, deer, eagle, forest, highway, jeep, lavender, lotus, orange, peacock, rose, sailship, sea, sky, strawberry, sun, sunflower, tiger, tower, and zebra, then perform tag-based image search with “ranking by interestingness” option on Flickr. The top 5000 returned images for each query are collected together with their associated information, including tags, uploading time, user identifier, etc. In this way, we obtain a social image collection consisting of 52,418 images. Totally, there are 887,353 raw tags.

4.1.1 Preprocessing

We match each tag with the entries in a Wikipedia thesaurus and we keep only the tags that have a coordinate in Wikipedia [38, 21, 57, 66]. We adopt a simple and effective method by removing the tags with their frequency that appear less than 20 times. Also some high frequency stop words, are removed as irrelevant tags [15, 16].

4.1.2 Feature Representation

For each image, we extract 470-dimensional features, including 225-dimensional block-wise color moment features generated from 5-by-5 fixed partition of the image, 170-dimensional hierarchical wavelet packet features [1], and 75-dimensional edge distribution histogram features.

In this paper, we only utilize the basic low-level features in our social image tagging approaches. Actually, high level feature extraction approach that determined from the CNN can be also utilized.

4.2 Evaluation of Tag enrichment

We use the average relevant score of the test images for performance evaluation as that utilized in [44]. 200 images are randomly selected from our Flickr dataset for labeling by five persons. For each image, each of its user labeled tags is labeled as one of the five levels: Most Relevant (score 4), Relevant (score 3), Partially Relevant (score 2), Weakly Relevant (score 1), and Irrelevant (score 0).

  1. 1)

    the tag is with Most Relevant to the image if most important content of the image is disclosed by it.

  2. 2)

    the tag is with Relevant to the image if important content but not the most important content is disclosed by it.

  3. 3)

    the tag is with Partial Relevant to the image if some parts of image content is disclosed by it.

  4. 4)

    the tag is with Weak Relevant if a small part of image content is disclosed by it.

  5. 5)

    the tag is with Irrelevant to the image if no content of the image is disclosed by it.

Given an image with ranked tag list {t 1,  ⋯ , t n }, the average relevant score is computed as

$$ {AVR}_n=\frac{1}{K}\sum \limits_{k=1}^K{R}_n^k $$
(16)

where \( {R}_n^k \) is the relevance level of the n-th tag of k-th test image and K is the total test image number.

4.2.1 Tag ranking Performance

Fig. 2 (a) shows the average relevant scores of valid tags of the user annotated raw tags on Flickr (denoted INIT), ranked results of tags by random walk [29] (denoted RANK), visual neighbor voting [24] (denoted NBVT), and the proposed approach (denoted TS) at depths in the range [29, 70]. For the first ranked tags, the average relevant scores of INIT, COCR, NBVT, RANK, RLVT and TS are 2.56, 2.76, 2.90, 2.89, 3.01 and 3.01 respectively. From relevance ranking the performances of the first tag are highly improved. As the first tag of RLVT and TS are both selected by the most relevant, their performances are the same. Under @5, the average relevant scores of them are 2.17, 2.22, 2.43, 2.43, 2.49, and 2.51 respectively. From Fig. 2, we find that the tag ranking performances of TS are better than these of NBVT, RANK and INIT. Moreover, the improve algorithm of similar compatible approach by using the diverse semantic, better performances are achieved. This is due to the fact that, the preselected tags can cover wide range of semantics.

Fig. 2
figure 2

top 10 image tagging performance comparisons for INIT, COCR, RANK, NBVT, RLVT and TS. a Initial tag ranking performances. b tag enrichment performances

4.2.2 Tag Enrichment Performance

Fig. 2 (b) shows the average relevant scores of enriched tags by COCR, RANK, NBVT, RLVT and TS at depths in the range [29, 70]. For the first ranked tags, the average relevant scores of COCR, NBVT, RANK, RLVT and TS are 2.72, 2.98, 2.87, 3.23, and 3.23 respectively. From relevance ranking the performances of the first tag are highly improved. Due to the fact that the first tag of RLVT and TS are both selected by the most relevant, their performances are the same. Under @5, the average relevant scores of them are 2.49, 2.64, 2.74, 2.91, and 2.99 respectively. From Fig. 2, we find that the tag ranking performances of TS are better than these of NBVT, RANK and RLVT. Moreover, the improve algorithm of similar compatible approach by using the diverse semantic, better performances are achieved. This is due to the fact that, the preselected tags can cover wide range of semantics. The subjective tag enrichment results for some social images with their initial tags are given in Table 3. From Table 3, we find that comparatively better performances are achieved by ours.

4.3 Relative Performances

In Part C of Section III, we provide the relative tag ranking performance for the raw tags and three tag ranking approaches: RANK, NBVT and TS. Our approach can also improve tag enrichment performances if the enriched tag list of an image is provided. In this Section, we evaluate the relative tag enrichment performances of TS over RANK and NBVT.

We invite five volunteers to evaluate tag ranking performances for the compared two approaches into following three degrees: 1) better than, 2) almost the same, and 3) poor than. And then, we use weighted average precision (WAP) to evaluate the relative performances of two approaches as follows.

$$ WAP=\left(b-p\right)/\left(b+s+p\right) $$
(17)

where b, s, p denote the ratio of “better than”, “almost the same”, and “poor than” respectively.

The relative performances of TS over RANK and NBVT at NDCG depth @1, @5 and @10 are shown in Table 1. The corresponding ratios of “better than”, “almost the same”, and “poor than” under the above NDCG depth are also given. As TS and RANK both select the most relevant tag as the first rank, the performances of TS and RANK @1 are identical. With the increase of NDCG depth, TS outperforms RANK 37% and 43% respectively under @5 and @10. By comparing TS with NBVT, the WAP values of TS over NBVT are 24.2%, 44.2% and 49.2% respectively.

Table 1 Statistical results of the relative performances of TS over RANK and NBVT for the 500 test images

4.4 Image Search

We conduct image search to verify the effectiveness of proposed tag filtering approach our crawled dataset. We first select the 25 queries from crawl our dataset, and we compare the tag based image search results based on the enriched tags by following methods:(1) Image Search with raw tags labeled by user (INIT).(2) Image Search with tags enriched by TS (TS).(3) Image Search with tags enriched by RANK (RANK).(4) Image Search with tags enriched by NBVT (NBVT).

To compare the image search results, we obtain the ranked image lists of different approaches for each query. We invited 5 subjects to label the relevance of the top 50 results for each query and each method. For each ranking list, the images are decided as relevant or irrelevant with respect to the query terms by the five volunteers. We use average precision (AP) as image search evaluation metric. Give a ranked image list, the AP at depth n is defined as follows

$$ {AP}_n=\frac{1}{n}\sum \limits_{j=1}^n{R}_j $$
(18)

where R j measures the relevance of the j-th instance to the query. R j  = 1 if the j-th instance is relevant and 0 otherwise. To evaluate the overall performance, we use mean average precision (MAP) of the 25 queries (Fig. 3).

Fig. 3
figure 3

Tag based image search results of different approaches under depth x in the range [29, 41]

Figure 4 illustrates the MAP measurement at different return depths on the crawled dataset. We can see that the search results on the enriched tags by RANK, NBVT, RLVT and TS are better than the INIT. Moreover, our approach TS outperforms the RANK, NBVT and RLVT [44]. This shows that our approach can enrich image content relevant tags for each image.

Fig. 4
figure 4

Image Tagging performance of TS using one, two, three, four seed tags and adaptive seed tag selection algorithm

4.5 Discussion

In [44], the authors only use the most relevant tags as the seed tags for tagging a social image. In this section, we go to discuss whether selecting seed tags can improve tagging performance. Fig. 5 shows the tagging performances of using one, two, three, four and seed number determination approach by determined by the adaptive seed tag selection (denoted ASTS) shown in Algorithm 1. In Fig. 5 the performances of using one seed tag is actually the [44]. From this figure we find that manually setting the seed number to be 2 can achieve better performances. However, the adaptive seed tag selection approach achieves the best performances.

Fig. 5
figure 5

Image Tagging performance of TS by selecting tags from user labeled tags, tags enriched by NBVT, COCR and ASTS

Fig. 5 shows the performances of determining seed tags from user annotated initial tag, top 20 tags recommended by NBVT, and COCR then we use ASTS to determine the seed tags, and tags determined by ASTS. It is interesting to find that user annotated tags are not as good as the tags by the existing tag enrichment approach. However, different enrichment approach does not influence significantly. Table 3 shows some flickr images and the enriched tags by the corresponding different INIT, RANK, and NBVT. The comparisons show the effectiveness of the proposed approach.

For example, for the second image, the NBVT approach miss the key tag “women” and some false tag “butterfly”, while our approach can generate more content relevant tags such as, “fashion” and “art”.

4.6 Experiments on NUS-Wide

The NUS-Wide dataset is composed with two parts: the training part, which contains 27,807 images, and testing part, which contains 27,808 images [9]. All images are manually annotated with the concepts from 81 Ground Truth. Thus, in this paper, Recall, Precision and F1 are used to measure the performance of different image tagging approach. Except the images and the ground truth labels for each image, the low-level features extracted from the image including color histogram (64D), color correlation histogram (73D), edge-detection histogram (73D), block-wised color moments (256D) and wavelet texture (128D) are also provided. We use the features provided by NUS-Wide, rather than those of ourselves.

There are no initial tags for the test and training image, thus in this section we only using the visual features for tag recommendation. Correspondingly, we only use the visual features for determining the relevant tags. Thus we can only provide the performances of RANK, RLVT, NBVT, and TS. In order to make fair comparisons, the features utilized by RANK, RLVT, NBVT and TS are all the same. All the five low-level features are utilized; the total dimension of the feature of each image is 594. Moreover the mean of average Recall, Precision and F1 values are shown in Table 2 respectively. From Table 2, we find that the performances of TS still outperform RLVT a little. This is because the concepts in NUS-Wide are relative independent. This is caused by the fact that the concepts in NUS-Wide dataset are manually labeled.

Table 2 The mean of the average Recall, Precision and F1 values of the top 5 tags by RANK, NBVT, RLVT and TS on NUS-Wide dataset

5 Conclusion

In this paper, a tag filtering approach based on textual similarity modeling is proposed. This approach classifies raw tags into two sets. The relevant tags are determined one by one by changing their labels from irrelevant to relevant. The tag ranking problem is converted to an optimization problem. The energy function takes both the compatible value of this tag to the tags from the two set, and cost of assigning this tag to be relevant to the image content. Experimental results show the effectiveness of the proposed approach (Table 3).

Table 3 Raw tags (the 2nd column) annoted by Flickr users for the input images (as shown in the first column), and the ranked initial tag lists by random walk model (denoted RANK, the 3rd column), visual neighbor voting (denoted NBVT, the 4-th row), the first 10 tags enriched by RANK (the 6-th column), and the tag lists of TS over RANK (the 7-th column)