1 Introduction

Due to the advances in digital technology and the success of the World Wide Web as a platform for sharing media (e.g., using Flickr or Youtube), people are recording and sharing every moment of their lives. As a consequence, more experiences are preserved but it also becomes more difficult to later access this information to remember a relevant past moment. These difficulties arise mainly because personal pictures are being collected, most of the time, in a disorganized way without any type of annotation [9].

Personal memories are recalled by humans through the episodic memory [34] which is related with the memory of events, times, places and other knowledge about a past experience. Due to the diversity of the clues provided by our memory, tools to search for personal media should provide ways to express these different types of information (query system) and the collection should be annotated [18] with these clues to support efficient retrieval.

The manual annotation of images with keywords describing their content is one way to efficiently provide media annotation. However, humans tend to avoid this operation [9, 36] because it is a time consuming task in large collections. The alternative is to develop automatic techniques to perform image annotation. The majority of the automatic methods are based on semantic models that are estimated using visual content [6]. This is one way to avoid the semantic gap problem [20] but as stated in the TRECVID 2006 overview [27] some difficulties remain.

Currently, most of the digital cameras have a built-in microphone. They save the camera parameters in the EXIF (Exchangeable Image File format) header of the JPEG (Joint Photographic Experts Group) file at capture time and some of them have integrated GPS (Global Positioning System) receivers to register the image location. It is expected that in the future more sensors will be included in the capture devices to record more information at capture time, in a similar way to the SenseCam. Therefore, image annotation methods should explore this contextual information in order to better understand the real world and consequently reduce the sensory gap [33].

This paper presents a semi-automatic image annotation method based on annotations provided automatically using semantic concepts and on user interventions to correct them through a computer game. Our strategy address the semantic gap with the feedback given by the users and with a semantic model that maps the low-level information in high-level concepts. The proposed method also address the sensory gap because images are analyzed by exploring their visual content and the contextual metadata (time and location) and audio information obtained at capture time.

The next section presents the related work and the following gives an overview of the system. Section 4, describes the multimodal image retrieval method. The semi-automatic application for image tagging is described in Section 5. The paper ends describing the tests conducted to evaluate our proposal and with the conclusions and directions for future work.

2 Related work

The image retrieval method proposed is based on semantic concepts that can be used to directly create a ranked list, given a query, or to annotate images with semantic meaning. We start this section by presenting and discussing several ways for image annotation. The term “automatic annotation” is used to describe the methods based on semantic concepts. The section ends with a description of related work about automatic and semi-automatic methods which are the categories where our proposals fit.

Different approaches have been proposed in order to annotate pictures with keywords describing their content. We classify them using the following categories: (1) manual annotation, (2) collaborative annotation, (3) annotation with recognized words using ASR (Automatic Speech Recognition) tools, (4) annotation using an entertainment application, (5) semi-automatic and (6) automatic annotation. Table 1 summarizes the main characteristics of these categories.

Table 1 Main features for several annotation techniques: human effort, efficiency, input provided by the user and information used by the system

The manual annotation can be provided by choosing labels from a pre-defined set [31] or by typing words and associating them to an image (e.g., Picasa or Adobe Photoshop Album). As mentioned before, one of the weaknesses of this method is related to the human effort needed to annotate large collections of pictures [9, 36]. Manual annotation provided in a collaborative way (as it happens using Flickr) is more efficient. Although these annotations may contain noise due to errors generated by some users, several users annotating the same image contribute to a richer annotation set (requiring less human effort from each participant). Annotations obtained by recognizing words from audio files [30] recorded when the user speaks about their photos using a microphone (Table 1) require less user intervention. The problem is related to the recognition errors which can be frustrating to the user.

Previous methods demand some human effort but they are the most efficient. To provide fully automatic image annotation, systems need to extract information from the visual content or to use the camera metadata obtained at capture time and recorded in the EXIF header of the JPEG image file. This metadata provides useful information to retrieve pictures but to retrieve images with more complex information (e.g., people or buildings), the visual content must also be considered. The information used are the visual features automatically extracted or the semantic models estimated from these features [6, 20]. Nevertheless, the automatic image annotation is not as accurate as the manual process [27].

Semi-automatic methods attempt to solve the problem by including the user in the process [36]. They increase the annotation efficiency but they also increase the human effort when compared with the automatic annotation. The human effort needed by an application to perform one annotation plays an important role when the user is included in the task. In [37] time models are proposed for two manual annotation approaches to quantify the human effort.

Another option to annotate images is to turn the annotation process in an enjoyable task. In [35], this problem was addressed by replacing the manual annotation process with a computer game for online content. The human effort is the same but spent in a fun way. Since the annotation is manual the high performance is guaranteed.

2.1 Automatic methods

During the last years, several approaches have been proposed to automatically annotate images based on semantic models. Generative models [2, 21], machine translation models [7, 19], Bayesian networks [25], latent space models [23], hierarchical boosting algorithms [8] and based on an agent framework [3] have been some of the techniques used for automatic image annotation. An early approach for automatic image annotation was proposed in [24]. Mori et al. applied a co-occurrence model to words and low-level features extracted from rectangular image regions obtained using a regular grid. More recently, new approaches [4, 5, 16] in the domain of personal photos that integrates content and context metadata have been proposed. In [4], an approach is proposed based on probabilistic graphical models to recognize landmarks and people with location information and visual features. Temporal data combined with visual information is used in [5] to detect faces. In [16] it was proposed to use the metadata obtained at capture time (time, exposure time, subject distance and flash fired) with visual content for scene classification. This work has similarities with our proposal but they use SVMs for image classification and a different method based on Hidden Markov models to include the metadata. Additionally, GPS data and audio information is also included in our approach.

2.2 Semi-automatic methods

Several approaches [17, 22, 36] were proposed that use the relevance feedback mechanism for image annotation with keywords. Lu et al. created a semantic network with a set of words having links with weights to a set of images. These connections are based on the user feedback. They adapt Rocchio’s formula to incorporate the low-level features and high-level semantic feedback. In [17], a system was proposed that is closer to our proposal but their system is based on SVMs and uses a search application. Our proposal uses an entertainment application which was inspired by an idea proposed in [35]. The computer game in this proposal is played by two unrelated players using the web. Whenever both players type the same keyword for the same image they win points given the fact that, the words that come from different people are more robust and descriptive than words typed individually. Our proposal is different because it uses machine learning techniques to improve the image annotation task.

3 System overview

In this section an overall description of the proposed system is presented. The system is composed of (see Fig. 1): a multimedia retrieval system, image search applications and a semi-automatic image annotation application.

Fig. 1
figure 1

System architecture

The image retrieval method is based on image semantic analysis that uses multimodal information automatically extracted from images. Visual features, audio information and contextual metadata are explored to train semantic concepts that can be used directly for retrieval purposes by the image search applications (see Fig. 1) or to annotate picture with keywords in the semi-automatic application.

The image search applications are used to annotate images with audio information and contextual metadata (location and time) and to search pictures based on the proposed image retrieval system. Users can retrieve images using a mobile device (phone or PDA) when visiting a point of interest (see Fig. 2) or the Web before or after the visit.

Fig. 2
figure 2

Image search applications: Memoria mobile

Memoria Mobile (see Fig. 2) allows the retrieval of images of previous experiences during the visit and and Memoria Web enables the retrieval of the experiences before or after the visit. Both interfaces include a map of the place which can be used to define geographic queries and can help to guide the visit.

Memoria Mobile [13, 14], is an application to capture, share and access personal memories composed of pictures when visiting sites of interest. This application provides automatic image annotation at capture time using audio information and GPS data. Memoria Web [38] is an application developed to virtually visit the place and provide more annotations manually. Both interfaces include a map of the place which can be used to define geographic queries and can help to guide the visit. Both applications use the image retrieval system to access the memories.

The semantic models (image retrieval) are also included in a framework for semi-automatic image annotation. This platform combines the automatic method based on semantic concepts with the manual annotation through an application. This framework can include any type of application that follows a set of requirements. Our proposal instantiates the application block of the framework with a gesture based image annotation game.

The remainder of the paper describes the two darker blocks of Fig. 1, the multimedia information retrieval system based on semantic concepts and the application for semi-automatic image annotation.

4 Multimedia information retrieval system

This section describes the methods to retrieve and annotate images based on semantic models trained with multimodal information. These methods correspond to three components (see Fig. 3):

  • Image annotation and retrieval—application of the semantic analysis;

  • Multimedia semantic analysis—estimation of the semantic models based on features extracted from images;

  • Feature extraction—extraction of content and context information from images.

Fig. 3
figure 3

Block diagram of the proposed methods for annotation and retrieval

The first block applies the semantic models, represented by p(w|I), to retrieve and to annotate images. Image retrieval is performed through semantic queries and picture annotation is obtained by associating semantic concepts to images as a way to describe their content.

In the multimedia semantic analysis block, the models are trained using visual and audio information and contextual metadata obtained at capture time. The method uses binary classification and a sigmoid function. The output of the block are the probabilities, p(w|I).

The techniques used to automatically extract this information from images are applied in the third block. It is assumed that each database document is composed of an image, an audio file and contextual metadata obtained at capture time.

The following subsections provide more details about these components. First is described our proposal for semantic image analysis. Then is explained how this method is used for image retrieval. The section ends with the methods used for features extraction.

4.1 Semantic analysis

The objective of the semantic analysis is to train a set of semantic models in order to obtain the probabilities p(w|I) (see Fig. 3). To achieve this goal, the method described in this section uses visual and temporal information. Figure 3 shows that audio and spatial information is also extracted from images, but it is only used in the retrieval task as explained in Section 4.2. Concerning the audio information, the semantic analysis is performed using ASR (Automatic Speech Recognition) tools.

The proposed semantic description method is based on a combination of individual detectors. It uses a binary classifier to detect the presence or absence of a concept in an image and a sigmoid function to normalize the classifier output. After this, the temporal correlation between sequential images are analyzed to correct errors of the classification process.

4.1.1 Visual information

We use the Regularized Least Squares Classifier (RLSC) [29] as a binary classifier to detect a concept in an image. The output of the classifier can be used for image annotation, however this measure is not normalized therefore not suitable to combine different features or several concepts. The output of the classifier must be converted to a probability. We adapt the method proposed in [28] to the RLSC.

Assuming w is a Bernoulli random variable where the outcome can be one of two concepts (e.g., “Indoor”/“Outdoor”, “Beach”/“No Beach” or “People”/“No People”), the probability p(w|x) can be obtained using the output of the classifier f(x) and a sigmoid function [28],

$$ p(w|x)=\frac{1}{1+e^{-Af(x)+B}}, \label{equation11} $$
(1)

In [28] several methods to estimate the A and B parameters are discussed. Currently, we set them manually but in the future they will be estimated.

Given the training set \(S_m=\{(x_i,y_i)_{i=1}^m\}\) where labels y i  ∈ { − 1,1} and x i is a vector of image features, the decision boundary between the two classes (e.g., “Indoor” and “Outdoor”) is obtained by the discriminant function,

$$ f(x)= \sum\limits_{i=1}^m c_i K(x_i,x), \label{equation12} $$
(2)

where K(x i , x) is the Gaussian Kernel \(K(x_i,x)=e^{-\frac{\|x_i-x\|^2}{2\sigma^2}}\), m is the number of training points and \(c=[c_1,...,c_m]^T\), is a vector of coefficients estimated by Least Squares [29],

$$ (m \gamma I+K)c=y , \label{equation13} $$
(3)

where I is the identity matrix, K is a square positive definite matrix with the elements K i,j  = K(x i , x j ), y is a vector with coordinates y i and γ is a regularization parameter. To choose the optimal values for σ and γ the cross-validation method is used.

A point x with f(x) ≤ 0, is classified in the negative class (y = − 1) , and a point with f(x) > 0 is classified in the positive class (y = 1). If multiple features are used different classifiers are obtained.

4.1.2 Temporal information

Temporal proximity can improve the semantic analysis of images, captured by the same user, by including the temporal correlation in the model. For instance, Fig. 4 shows three sequential images captured with an interval of 10 s. It is not difficult to identity the concept “Beach” in Fig. 4a and c using visual information but the same does not happen in Fig. 4b. In this situation, the temporal proximity can help to infer the “Beach” concept in Fig. 4b, using the probabilities obtained by the other two.

Fig. 4
figure 4

Sequential images captured with a time interval of 10 s

Let T = [t 1, t 2, ..., t N ] be the ranked vector with the capture time of each picture of a collection C img = {I 1, I 2, ..., I N }, the probability of a concept w given an image \(I_{t_i}\) taken at instant t i is,

$$ p_{t}(w|I_{t_i})= \frac{\alpha_{t_{i-1}} p(w|I_{t_{i-1}}) + p(w|I_{i_t}) + \alpha_{t_i} p(w|I_{t_{i+1}}) }{1+\alpha_{t_{i-1}}+\alpha_{t_i}} , \label{equation8} $$
(4)

where \(\alpha_{t_i}\) and \(\alpha_{t_{i-1}}\) are weights that measure the relevance of the p(w|I) in the temporal adjacent images of the picture \(I_{t_i}\). These weights are inversely proportional to the temporal distance between images,

$$ \alpha_{t_i} = 1 - \frac{d(t_i)} {d_{\max}} \label{equation9} $$
(5)

where d max is a constant obtained empirically to represent the maximum time distance allowed to influence the adjacent images, d(t i ) is the time distance between the captured image on instant t i and the next one,

$$ d(t_i)=\left\{ \begin{array}{l@{\quad}l} |t_{i+1}-t_i|, & |t_{i+1}-t_i| <d_{\max} \\ d_{\max}, & \hbox{other cases.} \end{array} \right. \label{equation10} $$
(6)

This technique can be adapted to explore spatial correlations using the GPS coordinates of each image instead of the time information. Probabilities p(w|I) are obtained using visual information.

4.2 Image retrieval

The goal of this component is to create a list of ranked images according to their relevance to the query. Considering a query defined by k concepts Q = {w 1, w 2, ..., w k } describing the background and some objects presented in the desired pictures (e.g., indoor, people and computers), the position in the ranked list of a picture I, using audio, visual, spatial and temporal information is obtained by the similarity measure,

$$ Sim(Q,I)= f_{\rm gps}\left[Sim_{{\rm visual}+ {\rm time}}(Q,I)+ Sim_{\rm audio}(Q,I)\right], \label{equation1} $$
(7)

where f gps is a filter applied when the query includes geographic elements (e.g., region or direction) to select images from the list obtained using the others components. The similarity using the audio information is defined by Sim audio and Sim visual + time represents the similarity obtained using the visual and temporal information.

Considering multiple visual features, the visual and temporal similarity, Sim visual + time, of an image I to a given query Q is the sum of the similarity obtained for each feature,

$$ Sim_{{\rm visual}+{\rm time}}(Q,I)= \sum\limits_{j=1}^r a_j Sim_{{\rm visual}+{\rm time}}(Q,x^j), \label{equation2} $$
(8)

where r is the feature number, x j represents the j th feature extracted from the I image and a j is the weight of each feature assuming \(\sum_{j=1}^r a_j=1\). The ranked list obtained for each feature is given by the joint probability,

$$ Sim_{{\rm visual}+{\rm time}}(Q,x^j)= p_t(w_{1}, w_{2}, ...,w_{k}|x^j), \label{equation3} $$
(9)

Assuming independence between concepts, the joint probability of a set of concepts given an image is,

$$ p_t(w_{1}, w_{2}, ...,w_{k}|x)= \prod\limits_{i=1}^k p_t(w_{i}|x^j), \label{equation4} $$
(10)

The probabilities \(p_t(w_{i}|x^j)\) are computed in the semantic analysis section.

The function f gps is used when the query includes geographic information. Two types of query can be used: region query and direction query. In the first case, images inside the defined region are selected. For each photo, Ig, the distance between its GPS location, and the location of the center of the selected region Qg is calculated using the Great Circle distance [32],

$$ dist(Qg,Ig)=r_{\rm earth} \Delta\phi \label{equation5} $$
(11)

where r earth=6,378.7 km is the earth radius and Δϕ is,

$$ \Delta\phi=\cos[\sin({lat}_{Qg}) \sin({lat}_{Ig})+\cos({lat}_{Qg}) \cos({lat}_{Ig}) \cos({lon}_{Ig}-{lon}_{Qg})] \label{equation14} $$
(12)

In this equation lat represents the latitude and lon the longitude. All the images satisfying the condition,

$$ dist(Qg,Ig)<r_{\rm query}, \label{equation6} $$
(13)

are selected to be ranked. The r query is the radius of the circle defined by the region selected.

The direction query is defined if the query includes the directions: North, South, East or West. Given a position in GPS coordinates and the direction, the system searches all of the pictures that are in the selected direction.

In both cases, if the query only contains geographic information, images are ranked according to the Great Circle distance (see 11).

Recognized words from audio captured when the picture is taken are used to calculate the Sim audio(Q, I). This similarity is obtained using standard techniques of text retrieval [1]. Images are ranked according to this similarity measure.

4.3 Feature extraction

Audio information is converted to text using ASR tools. We used a speech recognition API developed by the Microsoft Language Development Center in Portugal. When the picture is captured, the camera parameters are saved in the EXIF header (a part of the JPEG header). These parameters include the capture time and the GPS coordinates. Concerning the visual features, four are used: (1) marginal HSV color moments, (2) Gabor filter features, (3) bags of SIFT descriptors and (4) bags of color regions (see [12, 13] for more details).

5 Semi-automatic image annotation

This section describes a framework for semi-automatic image annotation that uses the manual user intervention to correct errors of the automatic methods, in a way that is similar to a search with relevance feedback. The framework is composed of four parts (see Fig. 5): an application, the human interaction module, a block to update the automatic models and a method for automatic image annotation.

Fig. 5
figure 5

Block diagram of the semi-automatic annotation method

The application block can be a search application with relevance feedback or any other application where the user makes connections between images and words. To motivate the user and inspired by the ESP GAME [35], a computer game called Tag Around was designed as the application module of the proposed framework. Section 5.1 describes the game, the human interaction module and the method to compute the score.

The automatic block uses the output of the semantic concepts (see Section 4.1) to automatically annotate images. The feedback given by the users is used to update the semantic concepts (Automatic models update block), that is, labeled images by users playing the game are included in the training set. Then, the semantic models are estimated again (see Section 4.1).

With the four blocks of Fig. 5, an algorithm for semi-automatic image annotation is defined. Section 5.2 describes this algorithm.

5.1 Tag Around

Our proposal for the application block of the framework is the Tag Around game [10, 15]. The goal of the application is to make annotations in a set of pictures. Fig. 6 presents the interface of this game. The game is played with gesture input. A video camera is pointed to the user to capture the movements. A set of concepts is is displayed at top of the screen, grouped in a rotational platform, which is controlled by the user input. At the bottom of the screen the pictures are displayed, also controlled by the user. Whenever the users decide to match the center annotation/picture, a short animation sequence is presented to them, and a comment appears on the screen, which helps to know if a correct annotation was performed. To assist the user interaction while playing the game, the user picture is displayed with a set of hotspots.

Fig. 6
figure 6

Tag Around—a game for image tagging

While playing the game, the users have to correctly annotate the maximum number of images (in a given time period), using the words in the screen. Based on the user behavior (how well users tagged images) confidence is assigned to them. The game scoring is based on the automatic annotation algorithm, the trust in the user and the relevant feedback given by other users that have already played the game and tagged that set of pictures.

5.1.1 Human interaction

As mentioned, a camera is pointed to the user to capture the movements. There are five hotspots in the image. The user has to be able to touch them to move the objects on the screen. There are two hotspots situated at the bottom of the screen (see Fig. 6) for picture rotation, and two hotspots situated at the top of the screen, to rotate the tags. The left side is used to rotate the picture left and the right hotspot is for the right rotation. The other two rotation hotspots are situated at the shoulders height, to handle tags rotation. The fifth hotspot is situated above the user and is used for matching purposes. Whenever users are sure of a picture-tag pair, they move their hand in that hotspot area to perform tag matching.

5.1.2 Score computation

When users tag an image the game module calculates the score of the player move, using a formula that includes the trust level in the player, the probability of a tag given an image (obtained by the automatic algorithm output) and the feedback provided by all previous users. If this result is a strong value, it is considered a correct annotation, and the user score and trust level are increased, increasing indirectly the feedback provided by all users.

Given a set of pictures \(L=\{I_1,...,I_{N_l}\}\) ( L ⊂ C img) and a set of concepts \(V_{sc} = \{w_1,...,w_{N_{\rm con}}\} \) (V sc  ⊂ V con), the score obtained by matching the concept w in the image I is computed by,

$$ S_{\rm total}(I,w,n,m)=C_{\rm group}(m)+[1-C_{\rm group}(m)]S_{\rm new}(I,w,n), \label{equation7} $$
(14)

where n represents the number of correct annotations provided by the user, m is the number of times the concept w was annotated in image I, S new is function that evaluates the annotation using the semantic concepts and the trust in the player (see (16)) and C group(m) means the group trust obtained by the correctness of the same annotation provided by other users,

$$ C_{\rm group}(m)=1-e^{-(\frac{m}{k_g})}, \label{equation8} $$
(15)

the exponencial parameter k g is estimated in order to obtain a group trust near the maximum value after m annotations. We considered that three players providing the same annotation (m = 3) means an high group trust and for this reason k g is obtained assuming this condition. The ESP GAME [35] validates an annotation with two players. With this equation, when m = 2, the score is not the maximum but is a value that allows the system to classify the annotation as correct.

When a concept w is annotated for the first time in an image I the score is computed by,

$$ S_{\rm new}(I,w,n)=C_{\rm player}(n)+[1-C_{\rm player}(n)]p(w|I), \label{equation9} $$
(16)

where p(w|I) is the probability obtain by the automatic method (semantic concepts) and C player is the trust of the system in the player that denotes the quality of previous annotations provided by the player,

$$ C_{\rm player}(n)=\left\{ \begin{array}{ll} k_pn, & n < K_{\rm moves} \\ k_{\rm conf}, & n \geq K_{\rm moves} \end{array} \right. \label{equation10} $$
(17)

where K moves is a constant with the number of good moves to reach to the player trust maximum value k conf and k p is a constant that is used to increment the player trust.

The number of correct moves n increases when the group trust is different from zero and the score is greater than a defined threshold. It decreases when the score is above another threshold. These thresholds were obtained empirically. When the group trust is zero this means the score is obtained using only the semantic concepts and the player trust. In these cases, it is difficult to know the correctness of the annotation.

5.2 Image annotation algorithm

An annotation on an image I ∈ C img of a concept w i belonging to a vocabulary V con = {w 1, w 2, ..., w k }, is defined as, A(I, w i ). Given a set L ⊂ C img with N l images and a set of V sc  ⊂ V con with N con concepts, the semi-automatic method is defined by the following steps:

  1. 1.

    The subsets L and V sc are presented in the interface (application block);

  2. 2.

    The user selects one image I l  ∈ L and a concept w k  ∈ V sc ;

  3. 3.

    The user makes an annotation, A i (I l ,w k );

  4. 4.

    The score is computed using the automatic models p(w k |I l ), the trust of the game in the player and the feedback provided by all previous users;

  5. 5.

    For all concepts w k  ∈ V sc , if the \(|\{A_1,A_2,...,A_{N_A}\}| > N_{upd}\) for a concept w k , then the training set is updated and the model for the concept w k is computed again;

  6. 6.

    Go to 2.

A semantic model is trained again when the number of different correct annotations with the concept is above N upd. An annotation is considered correct when it is performed by at least two users. As a result of this algorithm, a set of annotations \(A=\{A_1,A_2,...,A_{N_{\rm total}}\}\) is obtained and the semantic concepts of the set V con are estimated with a larger training set. If two different players provide the same wrong annotation the algorithm fails and this can increase the number of failures of the related concept but this is not a usual situation.

Both subsets, L and V sc , used in each level of the game are selected in the automatic annotation block. Therefore, the learning process is driven by the automatic model.

6 Experiments and evaluation

To evaluate our proposal for image annotation, we start by testing the automatic method (based on semantic concepts) in images using different combinations of features and with users to assess the quality of the results. Then, we evaluate the concept detection including the feedback provided by the users playing the game. Finally, we evaluate the Tag Around game with usability tests since users are an important part of the process.

We train a set of concepts, suitable for personal collections, selected from the set of the 449 LSCOM [26] concepts. These concepts are estimated using a training set obtained from the Corel Stock Photo CDs, from the TRECVID2005 database and from Flickr, in order to build a generic data set. Nine semantic concepts were selected for evaluation: “People”, “Face”, “Outdoor”, “Indoor”, “Nature”, “Manmade”, “Snow”, “Beach” and “Party”.

6.1 Datasets

Our proposals were tested with two different picture collections. A personal collection with about 5,000 pictures was used to test the visual features and the time information. It was also used in the Tag Around game. This collection is essentially composed of pictures of people, nature or urban scenes, holidays and parties. We also tested the automatic method with a database of pictures taken by several visitors of a cultural heritage site in Sintra, Portugal. This place is composed of beautiful gardens, caves and romantic buildings. Given the nature of the place, only five concepts out of the nine were used in this evaluation.

6.2 Automatic method

This section starts by presenting the results obtained with the automatic detection of the semantic concepts in images with several visual features and time information. Table 2 compares the performance of the system using two different combinations of visual features and including time information in the second combination. Feature set 1 represents the combination of color moments with features obtained with the Gabor filters and Feature set 2 denotes the combination of a bag of color regions with a bag of SIFT descriptors. In this test we used the personal collection with 5,000 images.

Table 2 Mean average precision (MAP) obtained for a set of concepts combining several visual features and time information. Maximum time distance considered between images was d max = 240 s

Generally, the combinations that use bags are better than the other combinations of visual features. If time is included in the better combination, the system results improve. The concepts “Snow”, “Beach” and “Party”, exhibit the best increments when using time information. These concepts represent events where people stay during a larger period of time and for this reason, the probability of capturing correlated images increases. The concepts, “Outdoor” and “Snow”, present the best results with Feature set 1.

To evaluate the inclusion of the audio information and the GPS data we used the database of a cultural heritage site in Sintra, Portugal. Table 3 presents the results obtained individually using visual features, audio information and the combination of the visual information with audio and GPS data. We tested the image retrieval method in several locations of the place. Table 3 presents the performance obtained in one of the tested locations. As shown, combining two types of data yields better results than using only visual features. The results obtained with the geographic metadata depend on the local features of the region selected, and for this reason, some concepts decrease their performance. For instance, the location selected is an outdoor region consequently, the concept “Indoor” obtained a low performance (next to last column in Table 3). Using the three types of information, because of the reason mentioned, including geographic metadata do not yields better results than using only visual and audio data.

Table 3 MAP obtained for a set of concepts using visual features, audio information and location data to select a region of 60 m

To compare the results evaluated using the MAP measure with the users opinion about the same results, we ask to 58 voluntary participants to classify the results obtained by several searches using semantic concepts. The participants were graduated students. Ten of them were female. The users ranged in age from 21 to 31 years old with a mean age of 23.5.

After each search, users had to classify the results using a 5-point Likert-type scale, where 1 (one) means bad and 5 (five) means excellent. The users’ classification is summarized in Table 4.

Table 4 Evaluation of the results obtained by several searches provided by 58 users

In general, the results obtaine d were reasonable for the users. This means that the values obtained using the MAP measure for the “People” and “Nature” concepts represent acceptable results for these users. In this experiment, we used the Features set 1 (“People”, MAP = 69% and “Nature”, MAP = 45%).

6.3 Semi-automatic method

The section presents the results obtained by including users to correct the errors provided by the automatic method using a computer game. Table 5 presents the mean average precision (MAP) obtained using the initial training set and with more 20 and 40 images that were annotated using the game. Generally, best results occur when 20 new images are included in the training set. This happens because images of the test collection are included in the training set and the images of the same collection present more correlation. After that, when increasing the training set for each concept by 20 new images, the mean average precision obtained with the semantic models increases 4%.

Table 5 MAP obtained using different training sets to estimate the semantic concepts: initial training set, with more 20 and 40 images for each concept

The Tag Around game was subject to usability testing [11], with the goal to evaluate the interface complexity, the usefulness, the aesthetic aspects and to understand how easy it is to learn and use. These tests were performed by 15 voluntary participants ranging in age from 18 to 31 years old with a mean age of 24. After testing the application, users were asked to fill in a questionnaire and express their opinions regarding the application they have just tested. Table 6 shows the results obtained with usefulness related questions which are important to evaluate the proposal as an annotation tool. Questions presented were answered on a 5-point Likert-type scale, where 1 means totally disagree and 5 means totally agree. In general, the results are positive. The lack of consensus of the users about the use of the application to annotate their own images is a concern. Nevertheless, the feedback related with the use of the application as a game to spend time in a public space is positive, which was one of the intended design goals.

Table 6 Mean (μ) and standard deviation (σ) obtained for a set of usefulness related questions

7 Conclusions and future work

This paper proposes methods to annotate images using semantic concepts in two ways: automatically and in a semi-automatic framework. The semantic concepts are estimated using multimodal information. The semi-automatic image annotation framework is based on a computer game played with gesture input. In general, the results presented show that the semantic concepts increase the performance when including more types of data in the process. Time information improves the results of all concepts specially the concepts related to events with a certain duration. GPS data also improves the results but conditioned by the features of the selected place. Manual annotations are a good choice in terms of efficiency. However, users must be motivated to do it. Our solution to motivate the users is to convert the image annotation task in a enjoyable game. While people enjoy playing the game they contribute to solve the complex annotation problem. For future work, we are going to include more different data in the semantic model and we plan to develop an active learning method for the computer game.