1 Introduction

This new period of the Web, also recognized as Web 2.0, has brought a diversity of new social applications like wikis, blogs, social networks, social bookmarking, photo, music and video sharing sites, bringing into existence many collaborative and novel applications which are highly accepted among users and are very successful. These applications made it possible for all users of web to add and share huge amounts of multimedia content, and to label these content resources with free-form keywords commonly called tags.

Web sites such as Flickr and YouTube, called the tagging applications, support users to tag user-generated photos and videos. In comparison, Amazon and Del.icio.us motivate users to give tags to products or existing web pages. These tagging processes led to the emergence of folksonomies where the tags are freely created by users keeping in mind the context in which the user is tagging a resource. According to some authors, the widespread use is accredited to two main factors, firstly, tags are very simple and easy to create; the users do not need any particular skills or experience to tag. They are easy to use for all users even if they have different level of understanding, age, cultural backgrounds and languages. It requires no setup and is very easy to adapt. Secondly, tagging is instantaneous [38]. Furthermore, folksonomies do not require any hierarchy or other classification scheme that’s why it is open-ended and truly reflects user perspective regarding different resources. It involves low cognitive cost [77]. Users have freedom in assigning tags that they think are suitable for a resource and this freedom of following their own vocabulary is the basic reason behind the success of tagging systems. Users utilize tags to retrieve or explore information, to add or share resources, to catch the attention of other people, to introduce themselves in a community, or to convey their opinion [38].

Information retrieval is very important in searching databases that are tag based, as there is large number of different kinds of resources with variable number of free form tags assigned to them. Folksonomies are very useful attempt to improve precision in searching and retrieving information. When different users assign metadata to a web resource in the form of tags, users’ consensus in the form of user generated classification emerges automatically [74]. Because of this consensus, users can find unexpected information that they didn’t know but is relevant to them [77].

The liberty and freedom, however, leads to the problem of highly unstructured tags. Tag meanings get ambiguous due to spelling mistakes, different lexical forms of the same word (morphological variation), polysemy, homography, synonymy, detail/granularity level, multilingualism, inaccurate tag-to-resource associations, different levels of tag precision and abstraction [8, 44, 77, 80]. Due to these reasons, tag space is inconsistent, inefficient and noisy. This reduces precision and recall in search results. As folksonomy has a flat organization having no explicit semantic relations among tags [44, 112], it is difficult to find relevant tags and to navigate through them. Due to this unstructured form, tagging in folksonomy poses a serious challenge to information retrieval. Current systems pay no attention to resources tagged with morphological variations or synonyms of that tag, as well as the resources tagged with more generic or more specific tags, or the same tag written in another language. In addition, when searching with polysemous tags, all the resources tagged with that tag are retrieved without considering the sense of the tag, the user was looking for [38].

By making different semantic relations (like equivalence, subsumption etc.) explicit [63] and at different abstraction levels, it will be easy to locate the tags. In addition, it will also show the level of generality or specificity of tags. Furthermore, when the user enters search keywords, these may not be specifically from the domains that a folksonomy covers. There exist tags that are different in scope but are very relevant. So, they must be disambiguated independent of their domains. A system solution for folksonomy problems may be developed by disambiguating tags and arranging them in some hierarchical structure (at different granularity levels in the form of tag bundles or in some other representation). The tag space can be further enriched with different novel features like bursty events and tags, enriching it with more metadata (secondary tags in addition to the primary tags) to increase precision and recall ratios and removing spam posts for correct tag to resource association.

This paper presents an extensive survey of different approaches based on the previously mentioned aspects and other semantic emergent features. The major contributions of the paper are following:

  • We did comparative analysis of semantic incorporating sources to estimate significance of each source and to highlight their strengths and limitations.

  • We have categorized the recent and state-of-art semantic emergent methods that result in precision in search and navigation.

  • We have summarized these methods highlighting their accuracy, pros/cons and supported feature list.

The paper is organized in seven sections according to Framework shown in Fig. 1. In section 2, we are discussing significance of semantics and semantic incorporating sources (External, Mathematical/statistical formulas, and folks) utilized by different researchers for introducing semantics in folksonomy. Making these sources and an additional Hybrid category (Combination of statistical, knowledge based and folk) as basis, recent and state-of-art semantic emergent methods for Bringing structure, Protection of folksonomy structure and Enriching query along with search results are categorized in Section 3, 4 and 5 respectively. We are focussing on these three aspects (Bringing structure, Protection of folksonomy structure and Enriching query along with search results) because they are interrelated and have the same objective of achieving precision in navigation and searching. Some of the techniques are not folksonomic but in our opinion can effectively be utilized in tagging model. These are discussed under the category ‘Other Aspects’. Lastly in section 6, we have presented the summary and ended up with conclusion and recommendations in section 7.

Fig. 1
figure 1

Framework

2 Semantics in folksonomy

There is a lot of work done in order to introduce semantics in folksonomy. Braun et al. [18] compared some novel web applications that provide semantic tagging and thus, result in increase precision. The work of Yeung et al. [124] is based on mutual contextualization of users, tags and resources. They analysed semantics emerging from the bipartite graphs for all three elements (the users, tags and resources) of folksonomy.

A survey based on social tagging techniques by M.Gupta et al. [44] discussed models of tag generation, user motivations for tagging, tag space visualization, aspects of ambiguity removal, hierarchy generation and spamming. J. Trant et al. [112] presents another review which discusses research literature on folksonomies till 2007. In this section let’s have a look at the sources utilized for semantic induction and its significance.

2.1 Significance and role of semantic sources

Semantics in folksonomy can be incorporated utilizing different sources; the prominent ones are Knowledge based sources like Wikipedia/DBpedia, Ontologies (collectively called external sources), Statistical/mathematical formulas, and Folk perspective. Let’s have a look at each source and its significance.

2.1.1 External/Knowledge sources

Wikipedia

Wikipedia is one of the finest examples of collaboratively created and crowd sourcing based content on the web. According to Alexa.comFootnote 1 Wikipedia is among the top 10 sites most visited on the web. Other wiki based online encyclopaedias like ScholarpediaFootnote 2 and CitizendiumFootnote 3 are also available, however, they allow registered users only.

There have always been questions regarding quality of Wikipedia due to its open to edit nature. So, many approaches have been used to prove it as a reliable data source. Wikipedia can be considered a source of information as reliable as Britannica, analysed and stated by Jim Giles [41]. In order to assess the quality of Wikipedia articles, Kittur et al. [56] used article assessment project of Wikipedia in which articles were assigned grades analysing how much real facts they contain, and how much accurate, verifiable, unbiased, stable and comprehensive they are. They validated it externally by non-Wikipedian community too. Results from external community were also highly significant. Javanmardi et al. [52] compared registered and non registered users of Wikipedia to statistically assess the quality of their contribution in editing wiki text. Results showed that most of the changes in this online encyclopaedia are made by the registered users and the ones made by the non-registered users are in a short number. Data resulting from Wikipedia articles is not biased and is validated by collective intelligence of editors worldwide.

Currently, extensive research has been done on utilizing Wikipedia or its RDF (Resource Description Framework) form called DBpedia as a data source, and taking advantage of this collaborative effort.

DBpedia

DBpedia is the Semantic Web version of Wikipedia [34]. We can use it to ask sophisticated queries against Wikipedia. Bizer et al. [16] and Auer et al. [6] DBpedia project extracts structured information from Wikipedia so that semantic web techniques can be applied on it. As Wikipedia evolves, changes in Wikipedia are reflected in DBpedia. So, it is continuously updated. Thus, the problems like non machine understandability, non-freshness and topic coverage can be covered by DBpedia.

WordNet

WordNet is a well-organized taxonomic knowledge base and in many researches has been utilized for finding semantic relatedness. It consists of both lexical units and the relations among them, structured into a relational semantic network. Basic intention for its development was to create a product that could merge the advantages of electronic dictionaries and on-line thesauri. Thus, making it an ideal tool for disambiguation of meaning, semantic tagging and information retrieval. In WordNet each distinct meaning of a word is presented by a synset. Synsets are linked to each other through explicit semantic relations (synonymy, antonymy, is-a, part-of, etc.). This creates a network where related concepts can be recognized by their relative distance from each other. It is outlined by Wu and ZHOU [116] that for nouns, the most common, important and useful relation is ‘is-a relation’. It covers over 70 % of the total relations that exist for nouns. This relation is covered in WordNet. To achieve the goal of Multilingual WordNet, one of the most significant attempts is EuroWordNet, whose ultimate aim is to develop multilingual databases with WordNets for several European languages.

Wikipedia and WordNet are created with different objectives and both have been used as powerful semantic incorporating sources. In some researches, these two sources are compared to highlight their strengths and limitations. Haridas et al. [47] state in their work that discrete knowledge bases like IMDB(Internet Movie Database) and WordNet do not cover very diverse topics. To explore topics of interest that are very new and diverse (cannot be rightly classified in existing categories), we require other knowledge sources. Strube et al. [105] verified that Wikipedia computes semantic relatedness better than WordNet and Google Counts baseline. They did experiments comparing WordNet and Wikipedia on different benchmarks by applying WordNet based measures to compute semantic relatedness on Wikipedia.

Ontology

In relation to search and browsing limitations, ontology solves two major problems recognized in folksonomies: (1) tag variety, similar verb tenses, plurals, spellings, synonyms etc., and (2) the different aims or types of tags used by the users, taking into consideration a separation between personal and common tags. Information retrieval becomes rich by introducing the ontology in folksonomy as it solves the problem of ambiguity and tag explosion [32]. According to Braun et al. [17] ontologies face challenge of evolving data and work process. To achieve ontology-based sustainable systems, ontology building should be done by people having domain knowledge and not just by knowledge experts.

2.1.2 Statistical and mathematical techniques

The simple and effective approaches utilized by many researchers in bringing semantics to folksonomy are based on mathematical and statistical formulas. Mathematical and Statistical formulas play an important role. The best thing about them is they are clear and unambiguous. One important reason as observed by Cattuto et al. [20] is that the vocabulary of folksonomies contains lots of community- specific terms, which are not present in any lexical resource. Thus, value is given to the utilization of distributional measures in folksonomies as compared to mapping tags to a thesaurus.

Aschke et al. [5], observed that many factors limit WordNet from extensive coverage of Del.icio.us tags. WordNet only provides coverage of English language and is composed of static body of words while Del.icio.us has tags from different languages. In addition, tags are not considered as words at all, rather considered as string of characters in Del.icio.us. Another restrictive factor is the structure of WordNet since at maximum only 61 % of 10000 most repeated tags in Del.icio.us can be found in WordNet. These facts encourage the use of statistical and mathematical techniques.

2.1.3 Folks

Research shows that user’s tagging motivation is the key factor for success of tagging systems. The web demo [3] stresses the need, besides content, to know more about the user’s intent in order to improve search. Ko¨rner et al. [57] also state that collective intelligence will be more precise if tagging pragmatics can also be analysed. They say that as users are the basic factor behind evolution of semantics in folksonomy, there will be some specific composition of crowd that contributes maximally to semantics emergence in folksonomy. They differentiate among different folks (folksonomy users) as categorizers or describers. The distinction they identified represents folk’s pragmatic behaviour in the sense that how much they contribute to emerging semantics. They identified and showed experimentally a specific group of taggers that add semantic precision in folksonomy. User tagging behaviour (tagging actions) shows interest and perceptions of users for different tags [115].

In this section, we have briefly discussed the significance of semantic incorporating sources. In the next section, we will focus on the techniques using these sources for bringing structure, maintaining the structure by protecting it from spam and enriching query.

3 Utilizing sources of semantics to bring structure in folksonomies

Organization brings structure. In this section, keeping different types of semantic relationships as organizational criteria, we have categorized the semantic discovery techniques. Choudhury et al. [25] used statistical and/or external knowledge based classification. However, we have classified by adding folks and hybrid based classification as well.

3.1 Similarity/Equivalence

Researches viewed and evaluated similarity in various ways by finding similarity among tags, tag to resource(s) similarity/association, resource to resource and user to user similarity. Let’s have a look at the approaches.

  • Statistical / Mathematical Approaches - Classical metrics to find similarity between any two tags \( tag1 \) and \( tag2 \) include cosine, jaccard and dice as given in Eq 1 and Eq 2 and Eq 3 respectively. However, cosine seems to yield more synonyms and siblings [105].

    $$ \cos ine\left( tag1,\; tag2\right)=\frac{tag1 \cdot tag2}{\left\Vert tag1\right\Vert \cdot \left\Vert tag2\right\Vert } $$
    (1)
    $$ jaccard\left( tag,\; tag2\right)=\frac{\left| tag1\cap tag2\right|}{\left| tag1\cup tag2\right|} $$
    (2)
    $$ dice\left( tag1,\; tag2\right)=2\frac{\left| tag1\cup tag2\right|}{\left| tag1\right|\cup \left| tag2\right|} $$
    (3)

    Markines et al. [76] focused on tag to tag and resource to resource similarity. They used different methods of aggregation (projection, distribution, incremental, collaborative filtering) and evaluated them against similarity measures like cosine, overlap, jaccard and mutual information. In non-incremental methods, distributional and mutual information performed best. Same was the case for incremental method. Furthermore, the approach is verified by using WordNet for tags similarity and Open Directory Project (ODP) for resource similarity.

    Mousselly et al. [87] proposed an approach called Adapative Jenses-Shannon Divergence (AJSD) for finding related tags and is based on calculating distance between tag distributions using Jenses-Shannon Divergence. Probability distribution for each tag is calculated using co-occurrence and Laplacian. The authors evaluated their scheme using WordNet and compared it with cosine similarity.

    Combination of morpho-syntactic and semantic similarity measures are proposed by Geir and Atle [39]. Levenshtein distance for morpho-syntactic similarity while tag signatures and cosine similarity have been used to find the semantic similarity among tags. No external linguistic resources (WordNet or even semantic resources like ontologies) have been used to mine tag pairs, making this approach more robust in terms of handling a larger portion of the tags found in the folksonomy. In addition, proposed approach does not necessarily depend on tags to co-occur for finding relations among them, rather it is focused at using topical/semantic similarity in addition to the Levenshtein distance for finding similar tags.

    Quattrone et al. [93, 94] argued and emphasized that real world folksonomies are characterized by power law distributions of tags, over which commonly used similarity metrics, including the Jaccard coefficient and the cosine similarity, fail to compute. Mutual reinforcement principle has been proposed which states, “two tags are deemed similar if they have been associated to similar resources, and vice-versa that is, two resources are deemed similar if they have been labelled by similar tags”, in order to compute tag and resource similarity in large-scale folksonomies.

    SHIATSU is a system developed by Bartolini et al. [10] for automatic suggestion of user labels for videos at the shot level. SHIATSU is based on the opinion that the objects that share similar visual content also have the same semantic content. This leads to conclusion that content wise similar objects should be tagged using the same set of labels. One important aspect that can influence selection of candidate set of tags (to be assigned to a resource) based on considering tags of content wise similar resources is tagging behaviour. Golbeck et al. [43] worked on examining the tagging behaviour with respect to image content. One of the important conclusions they highlighted is that, the users give more tags to images that are more visually complex. However, number of tags decrease when the numbers of Areas of interest (AOIs) exceed a certain threshold.

    For recommendation purpose Lops et al. [71] computed set of candidate tags using content and collaborative components. Collaborative part is based on the analysis of tags assigned to most similar resources (about same topic), while the content-based part exploits the content of the resources that is the information emerging from contents of the resource (Content based Tagging). Based on the same idea Zhou et al. [133] presented their hybrid probabilistic model (HPM) which combines low level image features and user provided tags (Content based and Collaborative tagging) to provide appropriate tags to label images.

    Distributional Measures discover the similarity among tags keeping in consideration the resource, tag and folk [21, 51]. In Resource Context Based approach the context of a tag tag i considers all the resources that are annotated with tag tag i . Abbasi [1] formally, represents the resource context of a tag tag i as a resource vector R as shown in Eq 4.

    $$ R=\left[f\_\left\{ij\right\}\right] $$
    (4)

    Where f represents number of times tag tag i appeared with the resource j. Each row of matrix R represents tag vector and each column of matrix R stands for a resource vector. Non-zero elements give count of number of times the resource has been annotated with a particular tag and zero value represents tags not used. To find tags tag i and tag j that are semantically similar based on their resource context, first, compute resource context R for each of tag tag i and tag j . Then cosine, dice, jaccard, probabilistic (Mutual Information) and heuristic can be used to compute the similarity between resource vectors.

    In Folk Context Based approach the user context of a tag consists of all the users that share identical tags. For example, if many users annotate different resources with the tags coin and cent and they do not use these two tags together in any of the resources they annotate, it would still be likely to discover relationships that exist in these tags by taking into consideration all the users that have both of these tags in common. The user context of a tag tag i as a vector u is computed as given in Eq 5 [1].

    $$ U=\left[u\_\left\{ij\right\}\right] $$
    (5)

    If a user j has utilized the tag tag i , value of u will be 1, otherwise u will be 0. Each row of matrix U is a tag vector while each column of the matrix U is a user vector. Non-zero values stand for the users that have used particular tag. Similarity between two tag vectors based on the user context can be computed using cosine, dice, jaccard, probabilistic (Mutual Information) and Heuristic.

    In Tag Context Based approach two tags are considered similar if they occur in the same context.Tag context similarity is scalable and accurate tag similarity measurement as pointed in [21, 76].Tag context similarity is utilized by Benz et al. [13] by taking Flicker and Del.icio.us folksonomies to measure tag similarity at a global scale. As many of the frequently occurring Del.icio.us tags also appear in Flickr. The assessment of tags across Flickr and Del.icio.us shows little semantic overlapping, being tags in Flickr related more to visual point of view whereas in Del.icio.us they are inclined more towards their technical meaning. Tags can be contextualised in a better way by taking into account the social contexts in which they appear, believed by Yeung et al. [125]. While a tag itself offers slight information on this, its associations with other tags, users and documents in a folksonomy provide valuable clues for understanding its semantics.

    The Tripartite Topic Model (TTM) model is applied on folksonomy by Harvey et al. [48], to put forward new tags to users (keeping in view a small number of tags that they have given) as well as their previous annotations. This model suggests more appropriate tags than current systems. TTM provides a complete representation of the data acquired from a folksonomy and so could be applied effortlessly on useful estimations such as to find similar user groups by clustering. The tag recommendation algorithm could be tailored to propose new resources instead of tags. Xu et al.[119] state an important point that in reality most of the tags are inappropriate to image content. Solutions presented in many researches are based on tag similarity in order to mine tag relevance. However, the computation of tag similarity is strongly affected by the noisy tags in the corpus, being unable to estimate precise tag relevance. In this paper, tag refinement problem is tackled from the angle of topic modelling. Since topic model does not need explicit co-occurrence among terms (tags) in order to reach to the conclusion that they are semantically similar.

  • Knowledge/External Source Based Approaches - Lee et al. [62, 63] derived subsumption, similarity and equivalence relations among folksonomy tags using collective intelligence of Wikipedia showing precision and recall upto 88.03 and 91.87 % respectively. Min et al. [80] identifies semantically related tags using WordNet (Disambiguation using WordNet) and Lin similarity measure. They have tested their proposed method on Flickr tags. Experiment showed that their method provides similarity improvement of 80.28 % over some other methods.

    WU and ZHOU [116] viewed semantic relatedness among tags in context of the semantic relatedness among words. For this, they mentioned to use Roget’s thesaurus, WordNet and Wikipedia. They also concluded that Wikipedia semantic network has a larger coverage as compared to WordNet for computing semantic relatedness.

  • Hybrid Approaches- Uddin et al. [114] method (Mlin) for finding relationship among tags makes use of WordNet and co-occurrence metric. In addition to pair wise relationship between tag, resource and user; relationship among three is also considered. The proposed technique experimentally proved to be more effective than LCH[94], JCN[11], and LIN[20] in discovering semantic relationships among tags in Flickr and Del.icio.us dataset with F measure value of 80.28 %.

  • Other Aspects – K.G.V.R. et al. [55] attempted to detect topics in a document by making a topic space composed of frequent document combinations that have common set of keywords, showing these keywords representing the same topic.

It is important to note that semantic similarity is different from semantic relatedness, as the later covers concepts such as antonymy and meronymy. However, it is observed that these terms are used interchangeably. In essence, semantic similarity and semantic relatedness mean, “How much does term X has to do with term Y?” [116]. There are many ways for estimating semantic similarity such as by finding distance between the words as proposed in [73, 96]. The outcome distance is more often represented as a number between 0 and 1, where 1 stands for extreme high similarity/relatedness, and 0 means little-to-none [115]. Moreover, the results of each approach are different. For example the strength or weight of similarity involving two tags based on two different measures could be dHierarchical /Tag taxonomyifferent. Similarity between two tags based on WordNet could be changed from similarity based on cosine measure [1].

3.2 Co-occurring tags

Tags co-occur in a variety of ways as identified by Halpin and shepard in [45]. In Super-Class Relationship tags that co-occur often represent general to specific relationship for example, ‘music’ co-occurs with both ‘piano’ and ‘guitar’, and can be taken as super-class of both. In comparison, ‘piano’ most likely does not co-occur with, more likely tags other than ‘music’ and generally co-occurs with ‘music’ so it is possible for it to be subclass of ‘music’. In Facet Relationships tags that co-occur often might have structured or facet relationship. These may be dyads or triads. For example, ‘book’ and ‘author and ‘Mark Twain’ is a triadic (‘triple’ in Semantic Web) relationship, and if these co-occur quite often they are most likely a facet. In fact, one would expect that most co-occurrences are dyads, like ‘author’ and ‘Zadie Smith’, or ‘book’ and ‘Mark Twain’.

According to simpson [101] co-occurrence between tags takes place when both tags are used with the identical resource. Co-occurrence can be inner or outer. In Inner co-occurrence, a single user applies both tags to a resource and in outer co-occurrence, both tags are assigned by different users to a resource [67]. Let’s have a look at the approaches.

  • Statistical / Mathematical Approaches – Simple co-occurrence of two tags \( tag1 \) and \( tag2 \) is calculated by simply counting the number of resources (Urls, photos etc.) that are labelled with both \( tag1 \) and \( tag2 \) as illustrated in Eq 6.

    $$ co- occurrence\left( tag1,\; tag2\right)=\left| tag1\cup tag2\right| $$
    (6)

    However, the drawback of this simple co-occurrence is that it gives more weightage to pairs of tags whose occurrence is very frequent. As a result, frequent tags will co-occur more often than infrequent tags even if they are not related. This problem can be solved by Normalization. There are two types of Normalization Symmetric and Asymmetric [100].

    For Symmetric normalization Abbasi et al. [2] compared cosine similarity, Jaccard coefficient, Dice as defined in EQ 1, 2 and 3 in section 3.1.1. They concluded that the Dice co-efficient gives higher value to co-occurring tags than the Jaccard co-efficient. Secondly, Jaccard co-efficient penalizes tags which do not co-occur very often. Zhang et al. [131] utilized Mutual information (Minfo) for symmetric co-occurrence as shown in EQ 7. The low value of Minfo indicates the two tags never co-occur, in contrast high value means high correlation.

    $$ M\mathrm{info}\;\left( tag1,\; tag2\right)=\frac{ \log \left(co- occurrence\left( tag1,\; tag2\right)\right)}{occurrence(tag1). occurrence(tag2)} $$
    (7)

    Normalization in Asymmetric takes place by using frequency of one of the tags [65, 100] as shown in Eq 8.

    $$ P\left(\left. tag2\right| tag1\right)=\frac{co- occurrence\left( tag1, tag2\right)}{occurrence(tag1)} $$
    (8)

    Sigurbjörnsson et al. [100] concluded that according to experiments Jaccard symmetric coefficient is good in discovering equivalent tags. In comparison, Asymmetric tag co-occurrence is able to provide a more diverse candidate tags to annotate a resource.

    Wu et al. [115] studied user vocabulary of tags. They linked folksonomy tags based on collaborative tagging from users using co-occurring tags, users and resources to form a semantically connected network of folksonomy. Fujimura et al. [36] proposed dimensional placement of tags in the tag cloud according to their co-occurrence facilitating tag search in large scale tag clouds. Their approach does not overlap tags in the cloud. Through k-dense they computed centrality of tags and assigned them height accordingly. In this way, relevant resources can be found even if they don’t exist in immediate neighbours of a tag. Freq / FolkRank algorithm shows bias towards high-frequency tags, i.e. to hyperonyms [20].

    Tibely et al. [109] focussed on the statistical properties of tag occurrence in tagged networks with the help of 2D tag distance distribution for the relative positions in the DAG (directed acyclic graph). Fig. 2 is the diagrammatical representation of the scheme. The DAG of hierarchy between the tags is already defined. First column of cells and the bottom row contains co-occurring pairs of tags which are in direct ancestor–descendant relation, whereas the diagonal cells keep up a correspondence to pairs in which the two tags are similarly deep in diverse branches from their lowest common ancestor.

    Fig. 2
    figure 2

    a A small piece of a DAG with two pairs of tags are chosen, solid filled circles represents ancestor–descendant relation, whereas dashed circles represent ‘uncle–nephew’ pair. b In parallel cells of the tag-distance distribution are displayed in solid black colour and with dashed lines, respectively [109]

  • Knowledge/External Source Based Approaches- Garcia et al. [37] disambiguate tags using DBpedia and TSR (TAGora sense repositoryFootnote 4). They have built index and in a triple, stored the title, term frequencies, disambiguation, number of incoming links, and redirection links of wiki articles. Titles are stored in different forms like in lowercase letters, concatenated title. When TSR is queried for a tag, it returns all the DBpedia resources representing different senses of the tag along with weight given to each tag and term frequencies for each of the wiki resource. They consider co-occurring tags for a resource (as the context for any of those tags in folksonomy) and senses in vector representation. For any tag, their sense and context vectors are compared through cosine similarity. But their presented approach is non-experimented. They just tested the algorithm on tags from real data.

    In Flickr site, there are Flickr clusters for disambiguation of tag. These clusters have tags based on their co-occurrence. But the drawback of this approach is that synonyms are not clustered and if a resource is assigned a tag that does not co-occur with other related tags, that image will not appear in user’s search results even if he/she searches for that tag.

    Lee et al. [66] proposed a system tagplus that uses homonyms and synonyms from WordNet to retrieve more relevant images from Flickr. They make use of synset id present in WordNet. Due to no homonym control in this approach, Flickr may return images that are not relevant to user entered keyword or sense even if he or she uses highly relevant tag. But it reduces synonymy problem by searching for synonyms of the user-entered keywords.

    Tag sense disambiguation (TSD) is experimented on vocabulary of social tags, thereby enabling users to know the sense of each tag with the help of Wikipedia. To discover the accurate mappings from Del.icio.us tags to Wikipedia articles, Local eighbor tags, the Global eighbor tags, and finally the Eighbor tags have been utilized. These useful keywords play useful role in disambiguating the sense of each tag based on the tag co-occurrences. The main objective of TSD is that the sense or meaning of a tag can be disambiguated by the help of its neighbour tags, which acts as a context. Neighbour tags can be defined as the tags that co-occur very frequently with the tag. The underlying principle behind this co-occurrence-based approach is that the frequent co-occurrences of two tags can be taken as they have high semantic relatedness among them. This approach is based on the collective intelligence hidden in folksonomies [64].

    The drawback identified by [10] regarding co-occurrence is that tag co-occurrence is not a solution of homonymy/polysemy problem when used alone.

3.3 Clustering

Folksonomies have nested groups of tags associated to common topics [101]. Clustering in folksonomy can be viewed as clusters of tags, context dependent clusters of tags, clusters of resources, clusters of users or combination of them. Let’s have a look at the approaches.

  • Statistical/ Mathematical Approaches– Clustering techniques keeping in view only tagging information and tag co-occurrence to find out semantically related sets of tags and resources, out of folksonomy, are achieved in [12] Flickr clustersFootnote 5. Such techniques require only statistical analysis tags and they lack semantic information. As a result, they quite frequently yield clusters of co-occurring tags, which can neither be mapped to an actual topic nor understood by a user. Moreover, most of the time these clusters are unable to solve the problem of tag synonymy, the reason is synonymous tags are usually given by users from diverse background and they rarely co-occur [40].

    Agglomerative clustering algorithm, Asymmetric hierarchical clustering, Hierarchical divisive clustering algorithms, Probabilistic Latent Semantic Indexing (PLSI) and User-Categorize tag (UCTag) have been tested and proposed by [4, 33, 46, 101] respectively in order to make tag clusters. However, some clusters produced may be too large if utilized for navigation and for that, removing unpopular tags before clustering can be useful. Hierarchical agglomerative clustering of tags also proved to be effective in personalized navigational recommendations. However, choice of cluster selection can further improve the recommendations by deleting clusters which are not directly linked to the user’s query [1, 86].Clusters of tags can be successfully utilized in order to find out both the user’s interest as well as topic of a resource [29].

    A co-clustering approach is proposed in [67] to yield clusters containing both resources and user annotation (tags). The technique makes use of groups of correlated tags and social data sources. It also considers the semantics in addition to the social aspect of resources accompanying tags in a reasonable way. Cluster of tag, resource and user, simultaneously using centroid based approach achieved by cosine similarity is proposed in [72].

    Among these approaches, Agglomerative Clustering algorithm has been used in most of the recent researches because it is quite flexible and can make required number of levels and cluster sizes. However, Xu et al. [120] argued that use of K-means or Hierarchical Agglomerative Clustering techniques for making tag clusters work well if tags are scattered spherically and evenly in data space. These techniques will not be effective if distribution is arbitrary, for example “S” shape. As freedom of tagging inhibits any surety of distribution of tags evenly or spherically, they proposed tag clustering based on kernel information propagation via random walk on graph to resolve this issue. They did experiments on six datasets and compared results of this clustering technique with others.

  • Knowledge/External Source Based Approaches –Mirizzi et al. [84] states in their work that Wikipedia categories that help clustering wiki articles are reflected in DBpedia (cluster resource sets). All DBpedia categories are skos:concept. But the documents are associated with categories they specifically belong to. They are not associated with each and every category which they belong to in some manner. Haridas et al. [47] says that if clustering is done using a discrete knowledge base, clusters don’t show information or semantics about the concept. In DMOZ (Directory Mozilla) resembling hierarchies, it is easy to present different semantic relations other than just subsumption.

  • Hybrid Approaches- Lu et al. [72] clustered simultaneously the users, tags and resources as these three are interrelated in tripartite structure of folksonomy. They calculated random clusters centroid based on user, tag and resource vectors and then included these three nodes in a cluster having least distance in cosine similarity with the centroid. In this clustering approach, contents of a web resource are not considered as compared to k-means clustering algorithm that uses word vectors. So this method can be implemented on different types of web resources like video, images etc. But as they compared the tripartite link structure only, false associations among tags and resource cannot be identified. They used DMOZ in order to validate resource clusters extracted from tripartite structure of folksonomy. SEMSOC [40] framework (SEMantic, SOcial, Content similarity) suggested clustering process of multimedia resources. It makes use of jointly semantic, social and content-based information, however, were based mostly on tag co-occurrences.

  • Other Aspects – K.G.V.R. et al. [55] proposed document clustering by means of a hierarchical algorithm and using Wikipedia as an external knowledge source. They first mine frequent itemsets (sets of words that occur frequently and can be used for making clusters) for topic detection within a document and clustering of that document with other documents. First, tf-idf scores are assigned to each document in a cluster. Then Wikipedia categories and outlinks are used. Each cluster is labelled belonging to relevant Wikipedia categories (whose occurrence frequency is top k for all documents in a cluster). Their evaluation was based on five standard datasets and they claimed that their results outperformed the current state of the art methods.

3.4 Hierarchical /Tag taxonomy

Hierarchy is considered a classical semantic relationship. This section is all about the approaches that bring hierarchical structure to folksonomy.

  • Statistical/Mathematical Approaches – Aras et al. [4] presented a tag cloud in which tags can be explored at different hierarchy levels, which gave increased semantic density and focused result. They have used cosine similarity for normalized tag co-occurrence and also considered term context. Agglomerative clustering algorithm has been utilized. Evaluation showed that the users were more satisfied with Semantic Cloud user interface than the standard user interface of folksonomy (in this case Del.icio.us).

    Search result classification based on hierarchical clusters (c-clustering) and zoom based navigation is proposed by Rástočný et al. [95] to improve web search results. Hierarchical clusters make use of semantic properties of search results to produce clusters and hence do not require to be predefined by domain specialists. It also solves the navigational pitfalls of faceted browsing.

    Eda et al. [33] used folksonomy triples to organize tags in generalized and specialized relationships. Using Probabilistic Latent Semantic Indexing (PLSI), they distinguished between subjective and objective tags and then arranged the objective ones into a hierarchy in a Directed Acyclic Graph. They measured the subjectivity of a tag by computing its entropy.

    Considering the different levels or degrees of tag generality (or tag abstractness), for highlighting hierarchical relationships that exist among concepts, [14] suggested by their results that centrality and entropy measures can distinguish well between abstract and concrete terms. Moreover, the tag co-occurrence graph is a key important input to centrality measures as against to using tag similarity graphs to compute abstractness. Tag generality vs. popularity problem is also taken into account and it is concluded that, in fact, popularity seems to be a fairly good indication of the true generality of a particular tag.

    The approach used in [102] is based on the conclusion that co-tags are appropriate for developing ontological structures based on folksonomies. Cosine similarity among tag vectors is also an appropriate tool to identify alike tags. An unsupervised method for generating such structure taking into account combination of association rule mining and the underlying tagged material has been utilized for generating a semantic representation of each tag. The semantic depiction of the tags is an essential component of the structure generated.

    Daud et al. [27] presented ontology of folksonomy taking into account users, tags and resources all at the same time. They named their proposed approach as Actor-Concept-Instance-Topic (ACIT). Their approach outperforms User-Word-Topic (UWT) and Tag-Topic (TT) approaches in accuracy by 8.4 % and 7.4 % respectively.

    Tang et al. [108] formalized a novel problem of ontology learning from folksonomies. By taking into consideration, a probabilistic topic model to represent the tags and their annotated documents, they proposed four divergence measures (Tag, Hypernym, Merging, and Keep). This algorithm is utilized to construct a hierarchical structure from tags. Results of experiment conducted on two different types of real-world datasets prove to be effective in learning the ontological hierarchy from social tags.

    Kawakubo et al. [54] introduced hierarchical relation by computing visual, text-based and combined concept vectors. First, they calculated entropy and JS divergence for these three vectors. Degree of relatedness among the concept vectors has been analysed and hierarchical relations among tags have been extracted. They constructed three different ontologies, each of them based on one of the concept vectors, among which they found that the one based on combined features is better than the other two. The noise removal accuracy rate on the average for selected images was 92 % and for randomly selected images was 70 %.

  • Knowledge/External Source Based Approaches- YAGO project [121] worked on structured information extracted from Wikipedia. It makes use of Wikipedia category system and redirects and considers fourteen types of relations. But it does not completely make use of the hierarchy provided by Wikipedia category system. It just maps end points of categories to WordNet hierarchy. FreeBase project5 also attempts to make an online accessible data base that can be edited as a wiki.

    Kobilarov et al. [58] mentioned in his work, that DBpedia entities have been arranged in four different hierarchies: SKOS representation of Wikipedia categories, DBpedia hierarchy YAGO ontology, UMBEL ontology and DBpedia hierarchy (developed manually).

    Tomuro et al. [110] built ontology from folksonomic tags. Using Domain Similarity Clustering by Committee (DSCBC) algorithm, they made clusters of related tags using Wikipedia knowledge source. In these committees, ambiguous tags are included in each related cluster based on relevance to show their different senses and then ontology from these disambiguated tags using agglomerative clustering algorithm is generated.

  • Folk based approaches- Structured folksonomies with predefined structure (e.g. hierarchical) have some pitfalls (1) restriction on tagging because of limited pre- defined vocabulary and (2) Selection of tags, which is time consuming manual effort. Yoo et al. [126] proposed a technique based on the idea that when a user enters the tag, he/she must also define its category. This tag is called categorized tag (CT). CTs are added to collaborative structured folksonomy(CSF) showing tag category relation supported by most of the users. A CT based organizational layer is built on top of CSF for organizational knowledge classification and enables users to find appropriate knowledge. Authors compared their technique with flat folksonomy and claimed to be effective in retrieval.

    Yoo and Suh. [127] proposed a prototype User-Categorized Tag (UCTag) in the form of a document management system. In this system users can assign tags and specify their category as well. Thus, a structured folksonomy based on user’s consensus emerges in which tags are included in different categories. But the relationships in this hierarchy correspond only to ‘has-a- relationship’ type.

    Ding et al. [30] proposed upper tag ontology based on tagging behaviour. Mika et al. [78] added to the folksonomy the user’s aspect by introducing Actor-Concept-Instance model.

  • Heuristics Based Approaches -In [113], an approach based on heuristic regulations and deep syntactic analysis for taxonomy construction has been utilized. In the first step, tags are obtained from the tag clouds of domain folksonomy websites. The folksonomy tags play role of target domain taxonomy. The taxonomy is constructed without human intervention based on heuristic principles and deep syntactic analysis. Heuristic rules approach traditionally has the trait of relatively low recall but high precision rate. In comparison, deep syntactic analysis has a higher recall however lower precision rate. Two algorithms have been combined applying heuristic rules analysis first and then a concept–relationship acquirement algorithm to steer clear of the low recall. But the challenge is heuristic patterns are uncommon to be discovered in tags.

  • Other Aspects - Pirrone et al. [91] took text of wiki articles into analysis to derive relationships and concepts. They extracted relationships from both Wikipedia link structure and text. After information extraction, contents are structured in the form of ontology. They have used table of contents in the wiki pages for extracting semantic relations. In their proposed methodology, semantic sense extraction is done using table of content tree and text of the section. Sense of a section is extracted by comparing domain ontology with the table of contents. The link analysis is a good source of relating terms to each other.

    In recent times, numbers of researches are being done on integration of folksonomic and ontological approach. Hierarchical ontology development based on existing hierarchies like DMOZ gives better results instead of making hierarchies from the scratch. However, by using knowledge sources like WordNet, AWS, IMDB etc., resulting hierarchy is a binary tree and the clusters do not show information or semantic about the concept of the child nodes [47].Chen et al. [23] says that as it is a cumbersome job for domain experts to make an ontology from scratch, so folksonomy is a very good knowledge source to build ontology that will also reflect collective intelligence.

3.5 Tag-pairs subsumption

Subsumption relation between any tag tag x and tag y can be defined as in [99], tag tag x subsumes tag tag y, means that everywhere when the tag tag y is used, tag x can also be used without ambiguity. The subsumption relation between tag tag x and tag y is represented as given in Eq 9.

$$ ta{g}_x\to ta{g}_y $$
(9)

Subsumption relation is directional, that is, \( {tag}_x\to {tag}_y \) does not mean \( {tag}_y\to {tag}_x \). But the subsumption has transitivity property, that is \( {tag}_x\to {tag}_y \) and \( {tag}_y\to {tag}_z \) means \( {tag}_x\to {tag}_z \). Subsumption relation is stricter than similarity metric. Now let’s have a look at the approaches.

  • Statistical/Mathematical Approaches – Han et al. [46] makes use of Asymmetric hierarchical clustering algorithm to find tag subsumptions. They have used tag co-occurrence to measure similarity among cluster tags and dissimilarity among different clusters. Resulting hierarchy reflects knowledge of the users. Mo et al. [85] utilized entropy to measure tag-pairs subsumption relationships in diigo and Del.icio.us.

    Si et al. [99] in his work proposed TAG-TAG, TAG-WORD and TAG-REASON. The last two give weightage to the content of document to help estimation. The results showed that the proposed methods performed better than the similarity-based hierarchical clustering in order to dig out subsumption relations.

  • Hybrid Approaches- Lee et al. [63] FolksoViz, a statistical representation for digging out subsumption relationships keeping in view the number of occurrence of each tag in the Wikipedia texts, along with using the TSD (Tag Sense Disambiguation) technique for mapping each tag to an equivalent Wikipedia text. The derived subsumption pairs are shown successfully on the display screen. The experiment shows that the FolksoViz manages to dig out the right subsumption pairs precisely.

3.6 Some other semantic relationships

  • Non Taxonomic relation discovery - Non-taxonomic refers to absence of hierarchy among the classes. Taxonomic relations such as subclass, superclass, is-a or has-a are lacking in non-taxonomic relations. For example ‘Polio affects children’. Classes will be ‘Polio’ and ‘children’ and the relation between them is ‘affects’. In general, two tasks have to be performed for non-taxonomic relationships. First is to find out which concepts are correlated. Secondly, it is required to dig out how these concepts are linked, so that the name can be given to the relationship [111]. Trabelsi et al. [111] worked on discovery of non-taxonomic relation in folksonomy. In their work triadic concepts have been used in order to find out and select related tags. External sources (Wikipedia, WordNet and Google) are utilized for tags filtering and non-taxonomic relationships discovery.

  • Bursty Tags- Yao et al. [123] identified bursty tags and events from the folksonomy tags. They make use of temporal information for burst detection. They extracted temporal tag graphs from the tag space by dividing tag space into time intervals based on tags time stamps. These temporal tag graphs are much smaller in size than the whole tag space and maintain only those tags and their correlations that have some bursty information. From these local tag graphs, they identified bursty tags and edges using a generative Gaussian distribution and Probabilistic model.

  • Time and Location Tags- Baba et al. [9] not only worked on finding the time and/or location related tags on flicker, but also extracted the concepts related to a tag in a machine-understandable way. Another work in this direction by Zhang et al. [132], computes connection or relationship among tags by analysing their distributions over time and space. In other words, their work is based on digging out tags with similar geographic and temporal patterns of use. Using a dataset obtained from Flickr, Flickr photo tags are clustered based on their geographic and temporal patterns.

  • Tagging Motivation/Self intention based- Strohmaier et al. [104] highlighted different tagging motivations and concluded that motivation behind tagging effects tagging behaviour of the users (selection of tags) in folksonomy. Making these motivations as basis, users and tags can be categorized. Cantador et al. [19] proposed classification of tags into purpose-oriented categories namely context or content based tags. By purpose-oriented they mean to categorize according to their intentions. Semantics of these categories of tags have been retrieved from Wikipedia and WordNet. The results have significant accuracy. Körner et al. [59] identified various quantitative measures (Tag/Resource Ratio, Orphaned Tag Ratio, Conditional Tag Entropy, Overlap Factor and Tag/Title Intersection Ratio) to identify Categorizer and Describer users based on their tagging behaviour. Categorizers use tags for categorization of resources while Describers use tags for description of resources. All measures they identified work well but are not equally useful. Among all these measures tag/Resource ratio prove to be the best.

4 Protection of folksonomy structure

Instead of focusing just on the tag, resource and user association discovery, we also need to consider protection of valid relationships. By this consideration we mean to handle issue of spam tags and spam users. In this section, we are focussing on techniques covering this aspect, so that folksonomy maintains its correct structure with time.

Tagging systems are quite easy and cheap target for spammers as compared to spamming through online advertising, email systems and search engines. User can add any content, generate spam annotations anonymously without any cost. Tag collision [70], where people either purposely or unintentionally use the same tags, for equally valid yet not related contents. The intention for making false associations among tags and resources can be, for example, by assigning tags that are popular bring their resources higher in search result ranking. Apparently, no one is harmed by spam tags on web but good web information resources become difficult to be found among all the content.

Spam can be introduced at resource level, in the form of spam posts (incorrect Tag-To-Resource association) or through spam user accounts. Hayati et al. [49] presented a survey and evaluation of anti-spam methods in Web 2.0. They evaluated the methods based on whether they used a preventive strategy or a detective one. Authors of [31, 92] classified anti-spam techniques as Prevention, Detection and Demotion based. Spam detection/prevention approaches can also be classified on level basis that is user level or post level. Post level means that individual posts are marked as spam or otherwise, whereas user level means all or none of the posts of a user is marked as spam.

4.1 Spam posts/Tag spam

Spam post means incorrect tag to resource association. Misleading tags that are generated in order to boost the visibility of some resources or minimally to confuse and mislead the users. Let’s have a look at the approaches.

  • Statistical / Mathematical Approaches- Combining KNN algorithm with tag clustering to filter noisy tags is proposed by Pan et al.[89]. By doing so they improved the accuracy of recommendations. The precision results of this technique for the M-Eco and Moivelens dataset are 73.9 % and 87.1 % respectively in comparison with TagNeighbor with Clustering, TagNeighbor, Collaborative Filtering and the Pure Tag techniques.

  • Folk Based Approaches- The performance of the algorithms based on static user data analysis has been presented in many studies in order to combat with tag spam, but either they do not give precise evaluation or the algorithms’ performances are not appreciably good. Liu et al. [69] makes use of dynamic user behaviour data for the notion that users’ behaviours in social tagging system can mirror the quality of tags more precisely. By making different categories of participants’ behaviours, tag-associated actions are extracted to estimate whether tag is spam or not, and then proposed algorithm filters the tag spam as an outcome of social search. The observed results demonstrate that method indeed outperforms the already present methods based on static data and successfully defends against the tag spam in a variety of spam attacks.

    Zhai et al. [130] proposed a technique in which personalized experience is assigned by a user to other annotators using correlation. This results in a ranked list, according to his personalized experience with other annotators. For those annotators who don’t have common tags with other users, socially enhanced mechanism is used to link users by some references. For evaluation they compared efficiency of SpamClean model to the occurrence, coincidence and boolean model, on different threats like collusive, normal and tricky attacks. SpamClean effectively defends against spam tag.

    Koutrika et al. [60] assigned relevance numbers to web resources based on the number of common tags they share. This is a language-independent method. Krause et al. [61] identified spam in their work on post level so that only malicious posts are blocked and not the rest by any user. They outlined four feature set categories to tackle spam and evaluated them against machine- learning techniques.

  • Other Aspects- Yhang et al. [122] proposed method is based on text mining approach which could find out the relationships between web pages and also among tags. In the first step, Web pages and their tags are clustered using self-organizing map algorithm. A labelling process is applied on the trained map to find out the relationships between web pages and among tags. The detection of spam tag could then be achieved by looking at the semantic relatedness between a tag and its tagged web page.

    Zhai [129] proposed spam-proof tagging system leads to a good quality tag search. The proposed technique is based on four key factors including demotion-based strategy, reputation, altruistic users and social networking. The proposed technique, upgrades/degrades the ranks of correct/incorrect content items in the search results by taking into account personalized users’ reliability degrees and responsible users. Thus preventing clients from picking unwanted contents.

4.2 Spam users /Social spam

In [57], authors identified users that created semantic noise in the folksonomy. They showed it experimentally that hyperactive taggers perform more tagging actions comprising 40 % of all. Hence, removing these users reduces semantic noise from folksonomy. The techniques adopted for spam users are mostly tested at both user and post level. Let’s have a look at the approaches.

  • Hybrid Approaches- Markines et al. [75] addressed different properties of spam in social tagging systems to differentiate spam users from legitimate ones. According to the author removing spam at post level is most appropriate. Among the six features they have used to identify spam, three are at resource level, two at post level and one for identifying spam users.

    Based on work of [106], that is scoring and semantic analysis of tags using tag score shows 95.0 % performance. Performance is further improved to 96.8 % when selective evaluation using the white tag and black tag concepts has been used. Tag scoring seems to be powerful method for discriminating spammers, but when a spammer uses popular tags to cover-up as a legitimate user, detection becomes difficult. To deal with these drawbacks of tag score, features using semantic similarity are implemented. When semantic attributes are united with the tag features the precision increases from 96.8 to 98.0 %. In experiments comparing the feature performance at the post level and the user level, the performance of the user level was slightly better.

    Poorgholami et al. [92] considered tags, resources, users and relations among them and highlighted set of features (Tag spamicity, Legal and illegal domain, coincidence and Network features). In their work they claimed that above mentioned features are effective in detection of spammers. The reliability of presented features is over 95 %, and combination of them is 99 %.These features are used for various machine learning algorithms to sort out spammers and achieve 99 % accuracy.

5 Enrich query and search results

Structured folksonomy enables elicitation of precise search results. In addition, mapping query keywords for disambiguation and semantic clarity, ranking search results, secondary tags and multilingualism also significantly improves precision in search results. This section is planned to focus on these aspects.

5.1 Mapping and ranking

Search engine results use only lexical information and web page importance on web to rank results. Folksonomies are difficult to navigate if tags are presented as long lists [101]. Now let’s have a look at the approaches.

  • Statistical/Mathematical Approaches- Chen et al. [24] argued that WordNet is too fine (many tags in folksonomy match to one sense not to all the senses available in the WordNet of a particular word) as well as too coarse (does not cover senses of a word in all domains) in defining granularity of word senses. Therefore, it is not fit for social tagging system. A technique based on non-negative matrix factorization (NMF) is proposed for automatic discovery of topic sense from tags and then used for tag disambiguation. The aim of the technique is to achieve precision in searching of resources.

    A technique for providing users with more specific keywords to replace or enhance the meaning of abstract tags when giving query and to precise the search is proposed by Xia et al. [117]. In the first step, ontology in which concepts are categorized in three semantic levels (General, basic and specific) to detect abstract tags is built. To confirm wether the selected tags in the first step are abstract or not and also to identify specific tags they utilize co-occurrence for tag context and K-NN with Gaussian weight for image context of a tag in the second step. For image context, similarity of both visual and textual features are combined because author mentioned that it gives 8 % more improvement in detecting abstract tags as compared to using visual and textual similarity individually. In addition to identify specific tags in the second step, all the in-between nodes between abstract tags and specific tags from the ontology developed in the first step are added to provide set of concrete tags for abstract tags.

  • Knowledge/External Source Based Approaches- Mirizzi et al. [81, 84] presented a tool Not Only Tag Footnote 6 by mapping keywords to DBpedia resources and by using DBpedias’s ontological structure to enrich its meaning showing results in the form of a tag cloud. It ranks resources using a hybrid ranking algorithm. Resources are ranked based on their relevance with the query and other related connected nodes in DBpedia graph, rather than calculating individual resource importance separately as done in PageRank algorithm.

    The DBpediaRanker algorithm computes relevance among DBpedia nodes by exploiting link structure, title and abstract comparisons, by querying social bookmarking systems as well as by considering web search engine results as shown in Fig. 3 [82].This algorithm has statistically significant results over the other algorithm with which it was compared. The same authors presented LED [82] (Lookup Discover Explore) to provide exploratory search using their RDF ranker in DBpedia. They say if users are helped by semantic tags, they can save monthly 10 min of each user and thus, in aggregate will save 4.l million working hours yearly. They find relevant resources by discovering them in the neighbourhood of a resource node.

    Fig. 3
    figure 3

    DBpedia Ranker [82]

    Lin et al. [68] attempted to combine ontologies and folksonomies to improve search and navigation. Bindelli et al. [15] presented TagOnto system that performs mapping of folksonomy to ontology providing users access to folksonomy system with search and navigation features that are peculiar to ontologies. Passant [90] attempted to combine weblogs and ontology for better information retrieval by mapping folksonomy tags to domain ontology. He used SIOC ontology.

    Ronzano et al. [97] said in their paper that if web resources are characterized according to the concept they represent instead of keywords, it may increase precision. They proposed Tagpedia, a general-domain encyclopedia of tags to provide web content descriptions through Wikipedia. It covers 84 % of the considered tags. They integrated this semantic resource into SemKey [74]. When user selects a tag, he or she can further select the sense of the tag. A weak point of this approach is that Tagpedia does not provide coverage on non conventional tags and there are no semantic relations defined among the Syntag sets. Furthermore, the same Syntag sets are not available in multiple languages, thus there is no support for multilingualism. Hence, search in Tagpedia lacks multilingualism and support for non conventional tags.

    Iijima et al. [50] proposed linked Flickr search, by integrating DBpedia, user preference data and folksonomy tags. When the user enters a query, the tag is searched on DBpedia and all classes the tag belongs to are returned. These classes or class instances are then ranked according to their weights from the user’s search logs. Flickr images are searched by giving initial tag entered by the user and the DBpedia instance the user selects. Results are evaluated by comparing it with Flickr Wrapper [35]. But evaluation results show that precision and recall values are lower than Flickr Wrapper with increased unexpectedness.

    Dellschaft et al. [28] presented sensible search by querying TAGora Sense Repository to give senses list for a tag after normalizing it and assign weight to them according to their importance. It retrieves different senses using DBpedia:hasDBpediaSenseInfo property. Mirizzi et al. [83] further presented Semantic Wonder Cloud that supports exploratory search using the same hybrid approach as described above and gave statistically significant results. They provided exploratory search as O’Brien wrote in an article [88] that: “The Web, they say, is leaving the era of search and entering one of discovery. What’s the difference? Search is what you do when you’re looking for something. Discovery is when something wonderful that you didn’t know existed, or didn’t know how to ask for, finds you.”

    Choudhury et al. [25] in their work, semantically enriched tag cloud of YouTube by linking it with the Linked Data Cloud and expanding and ranking the tag space. For semantic enrichment of tag space, they used their own dataset to generate related videos based on temporal, textual, geospatial and social context. Then, they further expanded it by tag co-occurrence analysis. However, they have not fully implemented Tag-To-Concept mapping module. Quality of their tag enhancement and quality of ranking was upto 80 % accurate. Similarly, the tag enrichment process when evaluated showed that content understanding is improved.

    Stampouli et al. [103] dealt with tag disambiguation and improved content retrieval quality in Flickr using mashup. They showed through a case study that this system provides high retrieval quality. Figure 4 shows a graphical representation of the system.

    Fig. 4
    figure 4

    Proposed frameworks for Tag disambiguation by Stampouli et al. [103]

    Chandramouli et al. [22] presented Semantic Concept mapping that leads to Hypernym Discovery (SCMTHD) algorithms resulting in accuracy improvement from 49 % (single-user environment) to 58 % (collaborative environment).They mapped tags to synsets of WordNet to get semantic concepts. But for semantic concept mapping they do not consider the problem of ambiguity. THD uses online resource for hypernym discovery. They used Wikipedia to increase entity coverage.

  • Other Aspects- Cucerzan et al. [26] identified named entities and disambiguated those using Wikipedia data. The accuracy of identifying named entities from within the text was 91.4 %. Technique for finding temporal semantic context of a concept (associated words, context graph, associated concepts, context communities and example sentences) that can be effectively used for query suggestions, faceted searching and trend analysis is proposed by Xu et al. [118]. They claimed that proposed technique helps in discovering semantic context automatically as compared to manualy generated context repository. The technique is tested for the effectiveness and accuracy.

5.2 Secondary tags

There is an information overload on the web. However, meaningful metadata can increase precision substantially. Searching solely based on user’s generated tags is not efficient due to variety of reasons. The three main reasons are: firstly, the usual number of tags assigned to a document is from 0 to 19 but among them mostly just 2 is the modal number. Secondly, due to presence of like polysemy, synonymy etc. Third, in some cases tags may not represent true metadata. e.g., if a user tags a resource with his/her opinion about the resource like’interesting’, then this tag may not be of use for other users in searching that content/resource.

All this signifies the need for presence of some content related metadata to be added for improved retrieval of resources. If the metadata can be generated in the form of keywords that are extracted automatically and these keywords are used along with the user’s created tags, result in improved search quality and precision. Let’s have a look at the approaches.

  • Knowledge/External Source Based Approaches- Awawdeh et al. [7] added metadata to user’s generated tags in folksonomy using Yahoo Term Extraction API. They generated keywords from text of original document. In their previous work [75], they compared different techniques to extract terms from web documents. These techniques comprised of extracting meta tags for document description, using yahoo term extraction service and terms selection having highest term frequency. They showed through experimental results that yahoo terms added the most to the searching process. They presented Enhanced Tag Set Engine that combines yahoo terms from the document with the user’s tag set.

    Faviki [79] combines tagging and Wikipedia by suggesting tags from Wikipedia concepts. But the suggested tags must be name of some Wikipedia article. The semantic tags it provides are machine-interpretable. It makes use of Zemanta [128] API for semantic tag suggestion. Zemanta is basically a blogging plug-in for firefox and can suggest tags from Wikipedia and user content.

  • Other Aspects- Tan et al. [107] have referenced to papers that show that precision and recall improves by adding semantic data to XML documents. They marked up Wikipedia articles in XML form. Their approach uses semantically tagged documents to detect concepts from wiki articles using Wikipedia categories, info boxes and link structure. Precision and recall measures for the three sources show that infobox parameter name is a good source for describing the information in both; precision and recall. But a negative point is that the tag names are not implicit. In 18 different types of relations existing in WordNet, Hypernym/hyponym relation in WordNet can be used to explore words that are more specific or more general for a specific word to explore secondary tags that will increase precision.

5.3 Multilingualism support

Translating tags into different languages and utilizing them for searching makes it possible to get unexpected information in search that cannot be achieved by using only one language. Let’s have look at the approaches.

  • Knowledge/External Source Based Approaches- Wikipedia gives extensive linguistic coverage [98]. Based on this fact Gobbo [42] presented Flickrpedia, by using Wikipedia support for multilingualism. They emphasize to improve serendipity regardless of the natural language. As a result, highly unexpected and relevant photos were retrieved. However, there was no support for sense disambiguation. Among all of the applications reviewed in [18], Faviki [79] supports multilingualism by translating tags in different languages.

  • Folk based Approaches- Jung et al. [53] support information retrieval based on multilingual tags coming from users by relating lingual practices of different folks. They translate tags into other languages to support search for multilingual resources using Google AJAX Language API.

6 Summary

The research efforts presented in this paper are summarized in Table 1 and Table 2. Table 1 is about feature set summary. In Table 2 techniques discussed in the paper have been viewed from perspective of features they support. These features include Folksonomic/Non-folksonomic, Non conventional tag coverage, Multilinguism, Disambiguation, Temporal, and Hierarchical clustering. If we look at Tagpedia, it does not provide coverage on non conventional tags. There are no relationships defined among syntag sets. The same syntag sets are not available in multiple languages. DMOZ supports Hierarchical clustering. Some of the discussed methods in the paper are not folksonomy based (Non-folksonomic) but in our opinion they can be used in social tagging model efficiently.

Table 1 Feature Set Summary
Table 2 Feature Support Comparison

7 Conclusion

Folksonomy provides a low cognitive cost system to support classification but due to its flat structure it suffers from low search precision. This paper attempts to review the different approaches for semantic incorporation in folksonomies to achieve objective of improving precision in search and navigation. We have categorized these approaches and summarized the feature set support. Following are the concluding remarks.

Statistical approaches help to cover the vocabulary which is not present in lexical resources. However, if we compare the precision ratios of knowledge source based and statistical approaches, knowledge source based approaches perform better in disambiguation. Also, hybrid approaches that utilize features from both methods have relatively high precision than pure statistical approach.

Formal classification systems like ontologies are very good in precision but can be built for limited domains and by limited number of people-experts. Moreover, the objects to be classified in these domains are limited in number as well. To build one huge ontology from scratch that covers all domains of web resources and to update it regularly is a challenge. As far as domain ontologies are concerned, it is difficult to get consensus on domain ontologies as they are made by knowledge experts and do not have common user’s consensus.

On the current web, with continuous exponential increase in the amount of content, such classification system will not be a viable solution. It needs to keep evolving to cover the emerging trends and vocabularies. Folksonomies are users driven and a non-formal way to categorize data and generate metadata while ontologies are the formal way to provide metadata for annotations. Their integration can give a very high precision. Hence, a fresh investigation in direction to integrate the folksonomic and ontological approaches can give better precision but may suffer the problem of complexity. Typical rigid taxonomies cannot tackle the challenge posed by fast evolving information space with continuous emergence of new vocabularies and trends. There may be many such terms that don’t necessarily fit into some fixed set of categories. For hierarchical arrangement of tags, again external knowledge source based approaches are better with respect to precision as well as vocabulary coverage.

Lastly, bringing semantically enriched structure in folksonomy, utilizing semantics for folksonomy cleaning by removing spam posts/spam users and other aspects like multilingual, secondary tags, search query enhancement further improve precision of search results.