Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Motivation and Challenges

With the advances in computer technologies and the evolution of social networks, there has been an explosion in the amount and complexity of digital media that is being generated, stored, transmitted and accessed through the Internet. Much of this information is multimedia in nature, including digital images, video, audio, graphics and textual data. Large-scale social media repositories enable users to creatively share thoughts among a much wider audience. As a consequence, every online user has been transformed into the role of a broadcaster. In efforts to be heard, there is an increasing interest in associating these media items with free text annotations. The disadvantages of manual textual annotation, and in particular of tagging, have been studied over the years, and the three main problems associated with it include (1) manual labour, (2) differences in the interpretation of the media items and (3) inconsistency of the keyword assignments among tags. Due to these disadvantages, recently there has been large amount of research focusing on automatically generating reliable and useful tags for multimedia content in social networks. In other words, there is currently great interest in the development of techniques that are able to take advantage of the characteristics of Internet multimedia that sets it apart from multimedia in more conventional environments in order to generate effective and useful annotations.

To tackle these problems, recently there has been a lot of research focusing on automatically generating reliable and useful tags for multimedia content in the Internet. Such systems usually rely on textual or low-level features, as well as some predefined knowledge focusing on particular domains. Therefore, one aim of this chapter is to provide a survey on the state-of-the-art research in this emerging field and to address the growing interests in automatic tagging of Internet multimedia. In particular, this survey concentrates on mechanisms capable of exploiting the full range of information available online to predict user tags automatically, with specific focuses on technologies related to query expansion, exploitation of complementary resources and visual-based approaches.

Despite of the large amount of research work done on multimedia tagging in social network repositories, the tagging of online multimedia resources is particularly challenged by the fact that these are unbounded to any particular domain. This makes users’ requirements for tagging and indexing both too general and specific. On one hand, it is ideal to have a system that ‘works for everything’. The universal context is very broad, while the usable resources are limited. Therefore, the task of tagging in a general context is very difficult and often intractable. On the other hand, the systems designed for a specific area can exploit the rich domain knowledge, but they are restricted to the domain and thus may not be useful in an irrelevant context. Therefore, the challenge is how to derive rich and correct tags in a general context using the limited metadata and at the same time can be easily adapted for more specific applications.

Addressing this challenge, in this chapter we also present a framework that aims at predicting user tags of online videos from the associated textual metadata. Despite significant research developments in the area of semantic tagging, much of these techniques are bounded to the a priori knowledge of their domains. Since by nature, Internet videos are not bounded to anything particular, we considered textual metadata to provide a more reliable source of information that does not require training based on a priori knowledge. To extend the limited information available in the textual metadata, this framework is able to exploit complementary resources such as Wikipedia and WordNet in order to extract more semantically meaningful tags from a largely textual resource. The proposed framework has been tested in a social network tagging scenario using Flickr videos and images. A very important feature of the proposed framework is that it relies only on existing features associated to the multimedia content and general complementary resources which are available to anyone through the Internet. Without relying on domain specific knowledge, the proposed framework can be used for any general purposes. However, if specific application is required, the framework is flexible enough to be adapted for the domain of concern, using available complementary context in that domain.

Based on the survey on related research and on our experiments using the proposed framework, at the end of this chapter we also identify some potential research directions towards a future user tag-prediction systems. The focus of these identified future research directions is on their capability of handling large-scale social network media repositories.

2 Related Research in Social Multimedia Tagging

Nowadays, large-scale online multimedia repositories have become available through various Web 2.0 applications, such as Flickr,Footnote 1 Wikipedia,Footnote 2 YouTube,Footnote 3 Facebook,Footnote 4 Second LifeFootnote 5 and Twitter,Footnote 6 providing access to tremendous amount of multimedia data which are mostly created by users. For example, Flickr has been providing access to over five billion images by September 2010, and there are over 3,000 uploads every minute to the website. YouTube has stored 400 million videos by 2010, and in every minute around 20 h videos are being uploaded to the website. The number of images on Facebook has exceeded 60 billion by the end of 2010, and around 138 MB of new content is being uploaded every minute. This user-uploaded and user-generated audio-visual content belongs to the established concept of user-generated content (UGC). UGC includes all kinds of data that comes from regular people who voluntarily contribute with data, information or media that then appears before others in a useful or entertaining way. All digital media technologies can be related to UGC, such as question-answer databases, digital video, blogging, podcasting, forums, review sites, social networking, mobile phone photography and wikis.

Among all kinds of user-generated data, digital audio-visual content is certainly the one receiving most public interests, and the one generating most technological challenges compared to the others. For example, automatic tagging and search for multimedia content has been a tremendous challenge, particularly in uncontrolled environments such as UGC applications. Collaborative tagging has been a typical and promising approach for tagging of user-generated multimedia content [37]. This kind of approach enables a process where users add and share tags to other shared items. Collaborative tagging is an organisational method. Its most important contribution is the concept of folksonomy, which will be further elaborated in Sect. 2.2. Still, it faces some serious limitations that restrict its usability, such as the nonstructured tags, tags validation, spamming detection and removal, redundancy and subjectivity in tags.

In this section, we present a survey of technologies related to the multimedia content tagging in a large-scale online repositories. First, an overview of the related works on multimedia tagging in general is presented. Then, the survey is focused on some specific topics in social media tagging, including approaches using query expansion, folksonomies, complementary resources, visual analysis techniques and some other related works.

2.1 Multimedia Tagging

Indexing and retrieval of multimedia content in the large scale online repositories has become an increasingly active field. Annotation and tagging have been recognised as a very important and essential mechanism to enable the effective organisation and sharing of large scale of multimedia information. However, manual annotation on large multimedia datasets is extremely labour intensive and time-consuming. Therefore, efficient automatic tagging methods are highly desirable. This interdisciplinary research direction has attracted various attentions and resulted in many algorithmic and methodological developments. There has been a significant amount of research on automatic video indexing based on textual and visual analysis [5, 10, 12, 16, 23].

In general, such approaches for automatic labelling or tagging can be classified in two types, ‘open-set tagging’ and ‘closed-set tagging’ [21]. The first type of approaches ‘extract’ appropriate labels for items from the words or phrases already associated with item content or metadata. In this case, the tags to be assigned are not known in advance. In comparison, the second type of approaches ‘assign’ tags in a known set of labels to multimedia content. The tagging problem can be posed as a classification problem to be solved either using a series of binary classifiers, one for each tag, or a multi-class classifier [8]. Another approach to close-set tagging relies on multimedia search and retrieval systems for assigning tags to the items, where each tag is treated as a query [16]. In this approach, conventional query expansion methods in information retrieval can be used to expand the tags into appropriately enriched queries. Such approach often applies a certain threshold in the list of retrieved multimedia items and assigns the queried tag to all items above the threshold.

In [77], authors have tested three different techniques, namely, language modelling, query expansion and maximum entropy, for tagging videos based solely on the video abstracts. Another approach for video tagging based only on the use of associated metadata is discussed in [28]. In [29], tags are predicted for bookmarked URLs using page text, anchor text, linked websites and tags of other URLs. In [56], different sources of information have successfully been integrated in factorisation models to predict the tags that a user will assign to an item. A very important group of research employs query expansion. In the following two subsections, a list of such research is reviewed. Our proposed framework shows that using other metadata resources and complementary information improves the quality of assigned tags.

2.2 Query Expansion and Folksonomy

The associated textual information in social networks is identified as a rich source of information for extracting high-level semantics for collaborative tagging systems. However, in order to effectively index these media items, the free text description needs to be analysed, and corresponding tags with semantic meaning should be extracted.

Most research in this field has so far focused on nonstatistical approaches, particularly on the lexico-syntactic patterns (Hearst patterns) first introduced in [27]. While purely statistical approaches such as latent semantic indexing (LSI) are prevalent in other fields of natural language processing, until recently they were only suitable for discovering symmetrical relations between words. The closest task to hypernym discovery mentioned in the seminal text book on statistical natural language processing [46] is unsupervised disambiguation, in which k meanings of a term are determined automatically. This approach has however the limitation that meaning is not represented by a single word (term) but by a context. Recent research [6] introduced one of the first statistical methods to hypernym discovery. Their work utilis principal component analysis (PCA) for discovering term taxonomies (hierarchies of hypernyms). The algorithm presented here is closest to the research of Cimiano et al. [13], who use lexico-syntactic patterns, also codified in a JAPE transducer grammar. The focus is however different, as their Text2Onto framework tries to learn the whole ontology, while the work presented here tries to discover only hypernyms for the given query.

Query expansion is probably the most typical application of hypernym (taxonomy) discovery. Query expansion is a method for improving recall and possibly the precision of information retrieval by expanding the query with other terms related to the original query. These terms are usually weighted. Query expansion has not been found to provide any significant objective improvement, although it is perceived positively by the users [52, 60]. Generally, query expansion comprises two basic steps: expand the initial queries using new words and term re-weighting in the set of the expansion queries. Currently, five query expansion techniques have extensively applied, namely, query expansion based on global document analysis [17, 78], query expansion based on local analysis [42, 76], query expansion based on query log analysis [36, 79], query expansion based on association rules [18, 83] and query expansion based on complementary semantic resources [25, 54]. Xu et al. [42] proposed a local context analysis method, which selects expansion terms based on cooccurrence with the query terms in the top-ranked documents. The method produces more effective and robust query expansion than traditional global and local techniques. However, the main drawback of this method is that it may lead to irrelevant addition of terms. In global analysis methods, new terms are added to an original query before searching. This method needs external resources such as thesaurus and WordNet [78]. Cui [15] proposed a query expansion model based on user logs. By mining user logs, a probability method is used to optimise the query. Some researchers have also worked on the ontology-based expansion but they have been static in their approach [84]. To improve this method, authors in [49] propose an approach called dynamic document analysis considering thesaurus analysis as well as dynamic documents.

Social networks and social resource sharing systems use the lightweight knowledge representation, called folksonomy. The term ‘folksonomy’, first proposed by Thomas Vander Wal in a mailing list [3], is combination of ‘folk’ and ‘taxonomy’ to describe the social classification phenomenon. Folksonomy provides user-created metadata rather than professional-created and author-created metadata. As discussed in [47], the tags, which constitute the core of folksomony, can be seen as good keywords for describing the respective web pages from various aspects. The folksonomy tags have the keyword property which may convey the topics of web pages from various aspects. Al-Khalifa and Davis [2] analysed the semantic value of social tags and concluded that the folksonomy tags are semantically richer than keywords extracted using a major search engine extraction services. X. Wu et al. [80] explored machine understandable semantics from social annotations in a statistical way and applied the derived emergent semantics to discover and search shared web bookmarks. In [31], authors proposed Adapted PageRank and FolkRank to find communities within the folksonomy. Bao et al. [4] proposed to measure the similarity and popularity of web pages from web users’ perspective by calculating SocialSimRank and SocialPageRank. In [82], a personalised search framework to utilise folksonomy for personalised search has been proposed.

2.3 Query Expansion Using Complementary Resources

A gold standard dataset for training and testing hypernym discovery algorithms is WordNet (e.g. [24, 64]). WordNet is a lexical database developed by Princeton University to model the lexical knowledge of a native speaker of English [20]. Sets of synonym terms called synsets constitute its basic organisation. Several types of relations between synsets are recorded in WordNet, including hypernymy/hyponymy (is-a relation) and meronymy/holonym (part-of relation). In addition, each synset has a gloss that defines the synset. WordNet is one of the most important lexical semantic resources in information retrieval. Faced with the defects of traditional query expansion methods by choosing similar terms to query terms based on some criterion, a query expansion method based on concepts has been proposed in [55]. In this approach, terms with a common sense are chosen as one of the candidate terms for expansion. To improve this approach, WordNet has been used to expand queries using the well-defined synonyms [73]. But in this work, query terms were deemed independent from each other and only synonyms were selected as term candidates for expansion. In other work, Smeaton [57] tried to perform query expansion using various strategies of weighting expansion terms, along with manual and automatic word sense disambiguation techniques, but it proved not able to improve the performance of retrieval. Hoeber manually constructed a concept network based on which terms are selected to perform conceptual query expansion [43]. The performance of this method depends highly on the quality of the concept network. In contrast, Liu et al. [30] proposed automatically generating expanded query terms by WordNet. Once original query terms’ concepts are determined, their synonyms, hyponyms and the like are considered to be the expanded terms. But in their work, queries to be expanded are confined to noun phrases. The main drawback of this technique is that it does not take term relationships into consideration. In [84], the word sense disambiguation is utilised to recover the sense of a word in the given query context. Based on the extracted concepts, similar terms in the corresponding synset are extracted from WordNet. Then through combining the newly chosen terms, the candidate expanded query set is generated, from which final expanded queries are selected.

Although WordNet contains general knowledge of a wide range of fields, it is difficult to instantly add new knowledge, particularly proper nouns, to these general ontologies. Therefore, Wikipedia has been used as a useful corpus for knowledge extraction because it is a free and large-scale online encyclopedia that continues to be actively developed. Wikipedia presents a much larger data resource for named entity extraction such as people, places, organisation and events to name a few. There have been many attempts to combine web search and Wikipedia article titles and hyperlinks for extraction of instances of arbitrary relations [7]. In [66], authors used the Wikipedia category system for the purpose of ontology learning. Kliegr et al. [34] found the first section of Wikipedia articles as particularly suitable for hypernym discovery and use it as the sole source of information. However, making judgements about the semantic relatedness of different terms in Wikipedia articles are yet a deceptively complex task. Any attempt to compute semantic relatedness automatically must also consult external sources of knowledge. Some techniques use statistical analysis of large corpora while some others use hand-crafted lexical structures such as taxonomies and thesauri. In either case, it is the background knowledge that is the limiting factor limited in scope and scalability. These limitations are the motivation behind several new techniques which infer semantic relatedness from the structure and content of Wikipedia. Strube and Ponzetto [65] were the first to compute measures of semantic relatedness using Wikipedia. Their approach ‘WikiRelate’ took familiar techniques that had previously been applied to WordNet and modified them to suit Wikipedia. In another work, authors achieved extremely accurate results with ESA, a technique that is somewhat reminiscent of the vector space model widely used in information retrieval [22]. Instead of comparing vectors of term weights to evaluate the similarity between queries and documents, they compare weighted vectors of the Wikipedia articles related to each term. The difference to this approach is the use of Wikipedia’s hyperlink structure to define relatedness [48]. This approach offers a measure that is both cheaper and more accurate than ESA: cheaper, because Wikipedia’s extensive textual content can largely be ignored, and more accurate, because it is more closely tied to the manually defined semantics of the resource.

2.4 Tagging Using Visual Analysis Approaches

Content-based tagging and search for multimedia content has been a most important approach in parallel to the textual features-based approach. Therefore, in this subsection, we give an overview on the important works in this direction. In the state-of-the-art research, many automatic tagging methods use visual content analysis together with text features in order to predict tag assignments. These visual-based approaches borrow many concepts and techniques from the content-based image retrieval field, a comprehensive survey of which can be found in [62].

One of the first approaches to tagging using visual analysis was based on machine translation [19]. The rationale was annotating image regions with words. To that end, the regions an image was segmented into were categorised using a taxonomy of region types. Subsequently, an EM-based learning approach is used for mapping region types and keywords, thus captioning the image.

Latent space models (namely, latent semantic analysis and probabilistic latent semantic analysis) were applied to image annotation for discovering the links between visual features and words in an unsupervised fashion, propagating tags from the most similar images in the latent space [51].

The work by Li and Wang [38] introduced a fully automatic and high speed system for annotating online pictures called ALIPR (Automatic Linguistic Indexing of Pictures – Real Time). It was based on the use of generative models for learning the joint distributions of visual features and vocabulary subsets, thus characterising each image by a statistical distribution. By exploiting statistical relationships between images and words, tagging could be conducted in realtime without the need of recognising individual objects in the images.

According to [44], the availability of training data required by most approaches to tagging limits their performance and scalability. This is one of the motivations of the dual cross-media relevance model for automatic image tagging proposed by Liu et al., which estimates the joint probability by the expectation over words in a predefined lexicon. To do so, the proposed model considers two types of relations in image annotation: word-to-image relations and word-to-word relations, which are estimated by using search techniques on Web data as well as available training data.

In [1], visual features were mapped to semantic categories by designing a dedicated feature space for each image category. To that end, a two-layer ensemble learning system called Supervised Annotation by Descriptor Ensemble (SADE) was proposed. In a nutshell, the proposal was based on an initial extraction of multiple low level visual descriptors from the image, each one of which is separately fed into a learning machine in the first layer. Finally, the meta-layer classifier is trained on the output of the first layer classifiers, and the images are annotated by using the decision of the meta-layer classifier.

The analysis of visual contents is coupled with the exploitation of collaboratively annotated image databases in [41]. The proposed approach applied two techniques based on image analysis: an SVM classifier annotated images with a controlled vocabulary, while a tag propagation module exploited user-generated, folksonomic annotations from Flickr, thus being able to deal with an unlimited vocabulary.

It is a commonplace that the tags associated with images in social media repositories are a source of valuable information source for superior multimedia retrieval experiences [67]. For this reason, it is necessary to evaluate the descriptive power (or relevance) of user-generated tags. However, users tag images with uncontrolled and often personalised and ambiguous terms. This is the motivation behind the work of Sun and Bhowmick [67], who proposed a measure called Normalized Image Tag Clarity (NITC) – a version of the clarity score proposed for query performance prediction in classic information retrieval – for evaluating the descriptiveness of a tag with respect to the visual contents of the image it is attached to. To that end, images are represented using a bag of visual words scheme, which allows to build a collection language model upon which the NITC evaluation measure is computed.

Focusing also on the tag relevance evaluation problem, Li et al. proposed a scalable algorithm for computing tag relevance values from visually similar neighbours [39]. In a subsequent work, Li et al. [40] used an extended version of their previous work for automatic image tagging. Broadly speaking, the proposal consisted in annotating an untagged image with the most relevant tags attached to its visual neighbours, retrieved from a large user-tagged image database. However, the validity of this approach suffered from the unreliability and sparsity of user tagging, so a joint-modality tag relevance estimation method based on textual and visual clues was introduced to mitigate their effect.

This idea of exploiting the nearest neighbours for annotating an untagged image was also explored in [26]. The proposed model (called TagProp), though, was based on a discriminatively trained nearest neighbour model in which neighbours were weighted according to their rank. The TagProp model included a word specific sigmoidal modulation of the weighted neighbour tag predictions to boost the recall of rare words. Moreover, it allowed to combine several visual similarity metrics in order to consider simultaneously local and global aspects of image contents.

The power of groups of images uploaded to online repositories like Flickr was exploited by Ulges et al. in [72]. Their approach was based on the realistic assumption that Flickr users group their pictures into batches (e.g. all snapshots taken over the same holiday trip) and that the images within a batch are likely to have a common tagging style. Therefore, these batches are matched with categories learned from Flickr groups, and leveraged for accurate context-specific image annotation.

A problem related to image tagging is tag recommendation, which tries to avoid both the noise inherent to user tags and also semantic noise. In [81], a multimodal tag recommendation algorithm was introduced. In there, tag recommendation was posed as a learning problem that was tackled using tag and visual correlations. Each modality was used to generate a ranking feature, and the optimal ranking features’ combination from different modalities was learnt by means of the RankBoost algorithm.

Another related problem is the creation of visual tags dictionaries, which was the goal of Wang et al. [75]. The main idea is describing textual tags by means of visual words related to a bag of visual words’ representation of images. With the proposed method, the visual tags dictionary is built in a fully automatic manner by harnessing tagged images available online. Once the dictionary is created, a connection between textual tags and visual words is established, which can be exploited for image annotation.

The tagging of online video resources has also attracted the attention from researchers in the last years. At least two main trends coexist in this area. The first one is based on annotating the video using concept detectors that describe objects, locations or activities appearing in it [63]. In order to alleviate the problem caused by the little availability of large-scale collections of annotated videos for training tagging systems, the work by Ulges et al. [71] proposes training concept detectors on videos available in online repositories such as YouTube. This allows exploiting existing user tags, besides scaling concept detection up to thousands of concepts with need of no manual labour at all.

An alternative strategy to video tagging is based on exploiting the redundancy of its content [58, 61]. The underlying rationale is based on the existence of a large amount of videos with overlapped or duplicated content on YouTube. Thus, this can be harnessed in order to obtain useful information about connections between videos, which are revealed by means of robust content-based video analysis techniques thus allowing to generate new tag assignments using tag propagation methods.

2.5 Other Related Research

Another interesting field of multimedia tagging is music annotation. Indeed, songs can be tagged with highly semantic concepts related to their mood, usage, instrumental contents, among others, which are of interest for building music recommendation systems and large scale music discovery engines.

In [70], Turnbull et al. presented a computer audition system capable of annotating novel audio tracks with semantically meaningful words. They posed the problem as a supervised multiclass, multilabel problem in which the joint probability of acoustic features and words was modelled. Using a dataset of human-generated annotations that describe popular music tracks, a Gaussian mixture model was trained over an acoustic feature space for each word in the vocabulary, obtaining music annotations comparable with the performance of humans on the same task.

More recently, a larger dataset comprising 10,870 annotated songs was collected in order to develop a novel music tagging system [68]. The novelty of this approach was that it considered both genre tags as well as ‘acoustically-objective’ tags, the main feature of the latter being that they can be consistently applied to songs by expert musicologists. Another interesting aspect of this work was the analysis of the tagging performance of two novel content-based audio features related to timbre and mid-level acoustic parameters.

However, the obtainment of accurate and reliable tags for annotating multimedia resources is a great challenge. This is due to the fact that harnessing user tags of publicly available videos and images may lead to unreliable results, whereas manual annotation is expensive though more accurate in general. For this reason, some researchers have devised collaborative strategies for motivating users to manually annotate multimedia resources, particularly by means of gaming.

One of the earliest attempts to do so in the image field was the work by von Ahn and Dabbish [74]. Their motivation was to take advantage of the people’s desire to be entertained to make them do the work that computers are unable to do well enough due to the shortcomings of computer vision techniques. The proposed game, called ESP, encouraged players to tag a given image with the same strings (i.e. a think like each other type of game), as the strings two players agree on turned out to be good labels for the image. The authors estimated that if the proposed game was played as much as popular online games, most images on the Web could be labelled in a few months.

More recently, a new gaming approach to gaming-based image annotation was proposed in [59]. Its main features were the fact that it takes into account the social aspects of human-based computation, as it aimed at what millions of individual gamers are enthusiastic to do, to enjoy themselves within a social competitive environment. This goal was achieved by setting the focus of the system on the social aspects of the gaming environment, which involved a widely distributed network of human players. Furthermore, the proposed framework integrated a number of different algorithms commonly found in image processing and game theoretic approaches to obtain an accurate label. As a result, the framework was able to assign accurate tags for images besides being able to detect and eliminate annotations made by cheater players.

A less gaming-oriented approach is the one presented by Moehrmann et al. [50] that introduces an image labelling interface based on self-organising maps (SOM) for optimising its usability.

As for the manual tagging of music based on gaming, a parallel road has been followed. For instance, Mandel and Ellis [45] designed a web-based game to collect descriptions of musical excerpts. Their goal was to make this task fun and easy for users, besides obtaining useful and objective tags. They apply the same idea than in [74], as the goal of players is to describe song clips using the same tags as other participants.

Another example of game-based music tagging is an online multiplayer game called Listen Game, aimed to measure the semantic relationship between music and words [69]. The game has two playing modes: in the normal mode, the player is prompted to select the best and worst words (describing semantic music concepts such as instruments, emotions, song usages and genres) to describe a song. In the freestyle mode, the player is asked to suggest a new word that describes the music, receiving feedback of other players’ answers.

3 Predicting Tags Using Semantic Expansion and Visual Analysis

In this section, we present a framework for predicting user tags, by jointly exploiting the associated textual metadata, the expanded query terms and their complementary resources, as well as the visual features embedded in content. The visual features we employed in the proposed system are MPEG-7 colour layout and edge histogram features [32].

The proposed framework consists of two stages. The first stage is the tag preprocessing where each tag from the list of all tags is processed and further expanded if needed. The algorithmic workflow is presented in Fig. 1. As tags in general can contain any keyword which the author might consider as relevant, it was important to contextualise the tags. To this end, the preprocessing framework developed is aimed at categorising the tags into two general categories, namely, (1) common tags and (2) named entity tags. Common tags are those which correspond to either an action, country or as depicted in the figure have a synset associated to it in WordNet. On the other hand, named entity tags are those tags which do not have a WordNet synset and depend on external resources to contextualise them. The objective of this preprocessing is to ensure that named entity tags are disambiguated enough to enable a match semantic similarity search.

Fig. 1
figure 1

Overview of the tag preprocessing phase

An overview of the second stage of processing is presented in Fig. 2. As we considered the metadata (i.e. video title, video description, automatic speech recognition (ASR) transcripts) to be of value in determining the nature of tags, we first processed the metadata with GATEFootnote 7 NLP framework. The framework includes a tokeniser, sentence splitter, and part-of-speech (POS) tagger. In addition to the basic text components, we also included a gazetteer in order to identify entity names in the text based on lists of predefined words. Also, for extraction of additional semantic information, we included the Java Annotation Pattern Engine (JAPE) to extract hypernyms from Wikipedia. Finally, we also included the OpenCalaisFootnote 8 plugin for extraction of named entities from the textual metadata.

Fig. 2
figure 2

Overview of the proposed system

One of the significant contributions of this framework is the integration of Bag-of-Articles (BOA) algorithm as an extension to GATE NLP tools. Briefly, the module locates a Wikipedia article using the unlabelled entity through media wiki API. The similarity measure for determining the article’s relevance to the tag is obtained through text relevance with popularity of the articles [34]. From the selected article, a JAPE implementation of Hearst patterns was used to extract a hypernym. This hypernym was then looked up in WordNet, thus establishing a link between the entity and a WordNet synset.

3.1 Wikipedia as the Source of Knowledge

WorldNet has a structured nature, and its general coverage makes it a good choice for general disambiguation tasks. The focus of work presented here is on specialised domain, which makes the use of WordNet less appealing. Most existing lexical resources including WordNet will have difficulty finding hypernyms for specialised search queries such as the name of a footballer or football arena. In experiments with automatically learned rather than hand-crafted lexico-syntactic patterns [64], using TREC dataset and Wikipedia as the training corpus gave a significant improvement to the best WordNet classifier (F-Measure from 0.2339 to 0.3592).

Our previous work relied on WordNet thesaurus [53], but it turned not to be exhaustive enough, and we decided to search for another source of information. In this sense Wikipedia turned out to be convenient as we needed a closed corpus of texts where the duplicity of articles describing the distinctive semantic category of the given word is minimal. In this regard, the general web cannot serve as a good source while Wikipedia tries to cover most of the semantic meanings using only limited number of pages (usually only one page). Therefore, we found the first section of Wikipedia articles as particularly suitable for hypernym discovery and use it as the sole source of information.

3.2 Bag-of-Articles Classifier

As previously mentioned, Wikipedia presents a much larger data resource compared to WordNet for named entity extraction such as people, places, organisation and events to name a few. In order to exploit Wikipedia resources, the BOA classifier has been developed. The proposed BOA is an extension of the well-known bag-of-words (BOW) approach [33]. The input for the BOA classifier is the classified entity represented as a noun chunk and a set of class entities, represented with a Wikipedia page title. For unlabelled entities, the BOA classifier locates articles in Wikipedia that might define the entity and selects one of them using a disambiguation function. Subsequently, it uses link analysis to try to identify related articles falling into the same semantic category, and then creates a BOA term-weight vector by aggregating their BOW’s vectors. The class is assigned by choosing the closest class entity, also a BOA term weight vector, with cosine similarity or other suitable metric.

Formally, the input of a BOA classifier is a set of t labelled instances (titles of Wikipedia articles) C and a set of u unlabelled instances (noun phrases) E. Wikipedia article titles provide an unanimous mapping between the labelled instance and a Wikipedia article. We use symbol W to denote a collection of all pages in Wikipedia at a given time. Each article is described by its title, term-weight vector, outbound links, a list of categories it belongs to and type (article page, disambiguation page, category page, ). The BOA representation, as proposed here, does not process Wikipedia infoboxes.

For an unlabelled instance e x  ∈ E, it is first necessary to determine the articles that may be defining its various senses. The ranking function ρ maps it onto the vector of its n possible senses s x  = ρ(e x , W) = ⟨s x, 1 …s x, l …s x, n ⟩. The senses – titles of Wikipedia article pages – are sorted in the vector in the decreasing order of relevance. The sense l of an unlabelled instance e x is represented by article title s x, l . The fact that there are multiple senses for the unlabeled instance gives space for disambiguation function δ. In the base scenario, we use disambiguation function δ mfs , which assigns the most frequent sense:

$${\delta }_{mfs}({s}_{x}) = {s}_{x,1}.$$
(1)

Now, both a disambiguated unlabelled instance and a labelled instance is a Wikipedia article title and can be mapped to a Wikipedia article. In the following, we will use the variable a to refer to a Wikipedia article to which an instance (labelled or unlabelled) is mapped. The bag of articles β(a) is constructed by aggregating related article across the set of modalities M with the help of the modality membership function μ, article term-weighting function τ and recursive term-weight aggregation function θ.

3.2.1 Modality Membership μ

Modality membership function μ(a, a r )↦{0, 1} expresses if article a r is considered related to a (μ = 1) or not (μ = 0). Several modality membership functions are suggested below. Article a is evaluated as related to a r (aa r ) if

  • μ outlink (a, a r ) = 1 iff a links to a r .

  • μ backlink (a, a r ) = 1 iff a r links to a.

  • μ related outlink (a, a r ) = 1 iff a links to a r and there is an article a c linking to a and a r , a r aa c .

  • μ backlinking outlink − firstpara (a, a r ) = 1 iff a links to a r , a r links to a and the link from a to a r is contained in the first paragraph of a.

  • μ shared category outlink (a, a r ) = 1 iff a links to a r and a and a r share the same category.

Other modality membership function definitions are also possible and various have been in fact suggested in the literature, albeit under a different name. This applies, for example, to μ backlinking outlink − firstpara [14] or μ related outlink , which is used in the Lucene-search Mediawiki extension (refer to Sect. 3.3). We use the symbol \({A}_{{\mu }_{m}}^{a}\) to denote the set of all articles a r that are related to a with respect to modality membership function μ m :

$${A}_{{\mu }_{m}}^{a} =\{ {a}_{ r}\vert {a}_{r} \in W,{\mu }_{m}(a,{a}_{r}) = 1\}.$$
(2)

The bag of articles might contain articles related according to multiple modalities.

3.2.2 Article Term-Weighting τ

The weight function τ(a)↦R n represents the article a as a vector of term weights. The parameter w m, d is a weight assigned to term vectors τ(a) in modality m and depth d. The term weight functions considered are

  • Term frequency (TF)

  • Term frequency – inverse document frequency (TF-IDF) computed over entire Wikipedia

  • Term frequency – inverse document frequency computed over articles included in bag of articles of labelled instances C

  • Term frequency with first paragraphFootnote 9 boost

Other term-weight function definitions can be also considered.

3.2.3 Recursive Term-Weight Aggregationθ

The function θ m (a, d, maxd m ) → R n recursively aggregates term-weight vectors of articles related to a according to the modality membership function μ m :

$${ \theta }_{m} = \left \{\begin{array}{@{}l@{\quad }l@{}} {\sum\nolimits }_{{a}_{r}\in {A}_{{\mu }_{m}}^{a}}[{w}_{m,d}\ \tau ({a}_{r})+ \quad \\ \ \ \ {\theta }_{m}({a}_{r},d + 1,{\mathit{maxd}}_{m})]\mbox{ if }d <{ \mathit{maxd}}_{m}\quad \\ 0\mbox{ if }d ={ \mathit{maxd}}_{m}. \quad \end{array} \right.$$
(3)

3.2.4 Bag of Articles β

Function β(a)↦R n creates the bag of articles for article a:

$$\beta (a) = \tau (a) +{ \sum\nolimits }_{m\in M}{\theta }_{m}(a,1,ma{x}_{d}).$$
(4)

The formula aggregates the term-weight vector for article a with term-weight vectors of articles recursively related to it up to level maxd m , maxd m  ∈ N. The articles (directly) related to it have level 1.

The classification is done by comparing the BOA vector of the unlabelled instance β(a x ) with BOA-term vectors of labelled instances β(a c ) with the similarity metrics sim and selecting the class with the highest similarity:

$$\mathit{BOAclass}({a}_{x}) =\mathop{ arg\ max}\limits_{c}\ sim(\beta ({a}_{x}),\beta ({a}_{c})).$$
(5)

A BOA classifier implementation needs to make decisions as of the selection of the ranking function ρ, modality membership functions μ m , term-weighting function τ and the BOA similarity function sim. The weights w m, d and the maximum depth maxd m for gathering related pages in modality m are externally set. Except for the function sim, all these settings are made separately for labelled and unlabelled instances.

3.3 Implementation of BOA Classifier

This section describes an experimental implementation of the BOA-based classification system. As the ranking function ρ, the implementation uses a composite metric, which combines text-based similarity between the noun chunk and article text and article popularity as measured by the number of backlinks. As modality membership function μ m , there is one option – outlinks, implementation of backlinks is in progress. For the term-weighting function τ, there is a TF and TF-IDF support. As the BOA similarity metrics sim, the implementation uses cosine similarity.

A BOA classifier requires a Wikipedia index containing the following pieces of information about each article:

  • Term vectors with term frequencies

  • Outlinks

  • Popularity ranking (for most frequent sense relevance ranking)

Given the current size of English Wikipedia and the fact that it is constantly updated, meeting these data acquisition requirements results in a considerate engineering effort, and in fact, a reimplementation of an existing software as these functions are from the most part performed by the existing Lucene-search Mediawiki extension.Footnote 10 This Lucene Footnote 11-based Mediawiki search engine indexes the Mediawiki article database and creates five Lucene indexes: the main index, the links index, the related index, the headlines index and spellcheck index. For the BOA classifier, the main index containing term vectors and the links index containing links leading out of each article are the most important. This extension provides two additional vital functions for the BOA classifier – parsing of wikitext and prospectively the ability to perform incremental updates.

The main wiki index contains the following important fields: title, key with a numeric article identifier, the term vectors are saved in the contents field, category stores article’s categories, related stores titles of articles that were determined as related during indexing.Footnote 12 The wiki.links index contains the following fields: Article key containing concatenated article title, Article PageID with a unique numeric identifier that binds the entry with the main index key field, links with a list of article titles to which the article links. The index differentiates between different types of links (article/image) using a namespace (prefix), redirect contains the title of the article to which the current article is redirected, rank contains the number of backlinking articles. In the BOA classifier implementation, these indexes are exploited as follows.

Indexed Wikipedia articles are stored in the wiki.main index, however the Lucene-seach extension does not store term vectors. For the purpose of the BOA classifier, it was necessary to modify the extension with code for storing the term vectors.

This information can be obtained from the links field of the article entry in the wiki.links index.

The Lucene-search extension contains a search engine, which uses sophisticated relevance ranking involving the number of backlinks. The BOA implementation uses the first-ranked article as the MFS baseline.

The Lucene Mediawiki indexer as used in the BOA classifier system has several changes in code, the most marked one is the extension of the index with stored term vectors. The term vector computations are done with a sparse matrix toolkit java library.Footnote 13

3.4 WordNet-Based Classification

To expand known entities using WordNet, we perform a similarity matching function by constructing TF/IDF matrix. We used the Lin similarity metric between the WordNet synsets representing an entity with each of the target tags. The Lin similarity measure has sound theoretical foundation stated in the similarity theorem [9] and is defined as

$${ \mathit{sim}}_{L}({c}_{1},{c}_{2}) = \frac{2 {_\ast} log\:p(lso({c}_{1},{c}_{2}))} {log\:p({c}_{1}) + log\:p({c}_{2})}$$
(6)

The function lso returns the lowest common subsumer from the hierarchy, and the value − log(p(c)) is called information content (IC). The value p(c) denotes the probability of encountering an instance of concept c, which is estimated from frequencies from a large corpus. More details of the method can be found in [11].

3.5 Filename-Based Classification

The filename-based approach exploits the human reasoning behind naming video files and is aimed at transforming the user behaviour towards predicting user tags. In addition, the video file name contains intrinsic semantic information, in particular when multiple file names starting with or containing a major portion of the file name. This approach is based on the implementation of a filename-based classifier for which the development set from MediaEval 2010 dataset was used as a training set. The filename-based classifier was developed based on the Weka statistical signal processing library.

3.6 Experiments and Evaluation

In this section, we present an overview of the evaluation methodology we adopted for the evaluation of the proposed framework on a user tagging task.

The evaluation consists of two parts, namely, ‘closed-set annotation’ and ‘open-set annotation’. On one hand, the objective of closed-set annotation is to predict user tags only from a list of tags provided. Although it should be noted that there are no restrictions on the data domain. On the other hand, in the ‘open-set annotation,’ there are no restrictions assigned to the list of tags that could be associated with the media items.

3.6.1 Closed-Set Annotation

For the closed-set annotation, the evaluation was treated as a retrieval problem, and using the TRECVID evaluation tool, we obtained MAP measure for predicted tags. Although the dataset contained 1,727 videos, we extracted tags only for 1,671 videos. This was due to either the absence of title and/or description or the absence of named entities from these textual resources. In summary, using our proposed framework we achieved 30 % MAP for all 1,727 videos and 43 % MAP against 1,671 videos for which we found any tags. It is worth noting the filename-based approach has been responsible for 17 % MAP of correctly detected tags. Overall, our proposed framework performed the best among all participants who submitted their results to the MediaEval2010 Tagging Task competition. Our method has been compared to other techniques: DCU team achieving 0.16 % MAP and TUD team achieving 0.27 % MAP. More details about approaches proposed by other teams can be found at [35]. These results are more clearly presented in Table 1.

Table 1 Close set annotation results in MAP

3.6.2 Open-Set Annotation

We were the only team participating to the MediaEval2010 open-set annotation task. In order to provide a fair evaluation on the open-set annotation,we randomly selected 40 videos and had seven annotators to manually label if the tags associated to each video are ‘relevant’ or ‘irrelevant’. As a measure of relevance, we considered the ‘inter-annotator’ agreement [28] among any three or more annotators. A total of 296 tags were generated for the 40 videos considered for the evaluation and among them, 35.8 % of generated tags were considered to be irrelevant and 20 % tags relevant by all annotators. Considering a tag with more than 3 inter-annotator agreement, then 47.3 % of the tags generated were considered to be relevant and with four inter-annotator agreement, the percentage drops to 37.5 %. For the total dataset of 1,727 videos, we obtained 6,095 unique tags. These results are presented in Fig. 3.

Fig. 3
figure 3

Open-set annotation results

In summary, the performance analysis of the results for closed-set annotation shows the benefit from exploiting complementary textual resources such as Wikipedia, WordNet and considering filenames as another strong tag predictor. Proposed framework proved successful also on the open-set annotation with almost 40 % generated tags being considered relevant by 4 out of 7 manual annotators.

4 Future Research Directions

One of the most relevant future research directions in the use of visual analysis for tagging is the exploitation of online multimedia repositories as substitutes of hard-to-collect training datasets. Although already a reality in image and video tagging applications, a boost in performance could be achieved if the group and hypergroup structures of sites like Flickr or YouTube were explored [72]. However, this issue still remains a challenge in the area of music annotation.

Another promising issue resides in the integration of multiple annotation techniques under a single framework. An interesting idea is the combination of tagging models with different scalabilities, so that good performance can be obtained regardless of the datasets size [72]. In a similar sense, another way of extending tagging approaches would consist in taking into account the relationships link between different resources such as videos, pictures or text found in different sites, which may be of help for extracting additional information for improving tagging accuracy [61].

Moreover, a very interesting direction for future research, specially in the music annotation field, is the construction of user-specific models that allow to reduce the influence of subjectivity, thus making it possible to model each user’s concept of audio semantics [70].

Another relevant issue is the analysis and generation of the so-called deep tags (i.e. tags linked to a small part of a larger media resource (e.g. a segment of a video [61], a region of an image, or a passage of a song)).