Keywords

1 Introduction

The current era has seen rapid growth in Multimedia Information Retrieval (MIR). Despite constant hard work in the development and construction of new MIR techniques and datasets, the semantic gap between images and high-level concepts remains high. We need a promising model to focus on modeling high-level semantic concepts, either by image annotation or by object recognition to diminish this semantic gap. Numerous real-world methods [1, 2] have been introduced for this kind of concept-based multimedia search system. Among several of these methodologies, the first step is dataset selection for high-level concepts and small semantic gaps, which are relatively easy for machine understanding and training.

This paper presents a novel resource evaluation dataset for cross-media searching, called FB5K along with a benchmark learning system. Existing cross-media or multi-modal retrieval datasets have some limitations. Firstly, some datasets lack context information i.e. link relations. Such context information is quite accurate, and can provide significant evidence to ameliorate cross-media retrieval system accuracy. Similarly, the Pascal VOC 2012 datasetFootnote 1 [3] consists of only 20 categories. However, cross-media retrieval implicates numerous domains under real-world internet conditions. Cross-media retrieval systems trained on scanty domain datasets have difficulties in handling queries from anonymous domains. Secondly, popular cross-media datasets are small in size, for example Xmedia [4], IAPR TC-12 [5], and WikipediaFootnote 2 [6]. This deficiency in appropriate data makes it difficult for retrieval systems to learn and evaluate the robustness in real-world galleries. Thirdly, datasets such as, ALIPR [7], SML [8], either just used all the image annotation keywords associated with training images, or unenforced any constraint to the annotation vocabulary for example ESP [9], LabelMe [10], and AnnoSearch [11]. Therefore, these datasets essentially neglect the differences among keywords relating to semantic gaps.

Fig. 1.
figure 1

Some examples of FB5K dataset used for cross-media search.

Considering the aforementioned problems, this paper makes three major contributions. The first is the collection of a new resource evaluation cross-media retrieval dataset, named FB5K. It contains 5130 image-feeling pairs collected from FacebookFootnote 3, introduced for the first time in the cross-media retrieval research community. This dataset is differentiated from current datasets in three aspects: varied domains, high-level semantic information incorporation, and rich context information. Eventually, it should provide a more accurate standard for cross-media study. Therefore, we constructed a standard dataset, keeping in mind the research issues to focus researcher/developer efforts on cross-media retrieval algorithm development, instead of laboriously comparing methods and results. The second is that, to the best of our knowledge, this is the first effort to collect a dataset of high-level concepts with small semantic gaps based on users’ semantic descriptions i.e. image-feeling relationships. Third, this approach aims to learn the cross-media embeddings of users’ feelings, images, and tags/texts. We propose a novel method by using Optical Character Recognition (OCR), explicit incorporation of high-level semantic information, and a new similarity measurement in the embedded space, which significantly overcomes the conventional distance measurement methods and improve retrieval performance.

2 Proposed Dataset

This section describes a new dataset called FB5K, which comprises 5130 images collected from Facebook. The complete FB5K dataset will be made available via ABCFootnote 4.

2.1 Dataset Collection

Each step in the dataset collection is briefly explained below.

Seed User Gathering. In order to obtain the genuine emotions of users associated with an image rather than image contents, we obtained seed users by sending queries to Facebook with numerous key words, for example happy, hungry, love, etc.

User Candidate Generation. To generate user candidates we implement a web spider to crawl the user accounts of individuals who were following the seed users. This step was repeated a number of times until we obtained a lengthy list of user candidates.

Feelings Collection. Another web spider collected feelings as text associated with the matching images by visiting the web pages of different users present in the candidate list. Our finding suggested that about 80\(\%\) of the users’ feelings accompanied the images.

Data Pruning. We refer to an image, tag, or feeling-text pair as a tweet. Data were pruned based on the following criteria (pruned out data were referred to as garbage data):

  • Feelings without images;

  • Tweets not associated with images or feelings;

  • Repeated images with the same ID;

  • Error images.

As a result, a total of 5130 image-tag pairs were obtained. Figure 1 presents some examples from this benchmark dataset.

2.2 Dataset Characteristics

The performance of cross-media retrieval methods is highly dependent on the nature of the dataset used for their evaluation. The FB5K dataset includes a set of images that are closely associated with user feelings. These images were crawled from Facebook along with the user-associated feelings. The FB5K dataset has the following attributes:

  • First, since this dataset was collected from a social media website, it contains a broad variety of domains under single examples of feelings such as, hungry, love, sad, thankful etc.

  • Second, the relationship between images and users’ feelings is often very strong. In the examples given in Fig. 1, the images have strong ties with the associated feelings. Such is the case in a realistic scenario.

  • Third, FB5K is a large-scale dataset, containing 5130 image-text pairs, which helps to avoid overfitting during system training. In other words, it helps to test the cross-media retrieval method’s robustness via a wealth of data.

  • Fourth, this dataset helps to reduce the semantic gap by providing more accessible visual content descriptors using high-level semantic concepts.

To our knowledge, this is the first cross-media dataset that consists of the above-mentioned characteristics. Also, we believe that FB5K is the first dataset collected from Facebook that comprises high-level concepts with minor semantic gaps between users’ semantic descriptions, and a ground-truth of 70 concepts for the whole dataset.

3 Proposed Retrieval Method for the FB5K Dataset

This section briefly explains the proposed cross-media retrieval algorithm for FB5K. Numerous features are used for image representation, for example SIFT [12], color features [13], GIST [14] and HOG [15, 16]. These features are useful for extracting colors and shapes of images, but not for words represented by the images. In this regards, we first propose OCR then adopt explicit incorporation of high-level semantic information and finally develop a novel similarity measurement in the embedded space to improve the retrieval performance. A detailed explanation is provided as follows:

Fig. 2.
figure 2

Graphical representation of high-level semantic information incorporation: (a) without semantic class and (b) with semantic class.

Text Extraction. First is the extraction of words on each image using tessartFootnote 5.

Incorporation of High-Level Semantic Information. To facilitate OCR text extraction we incorporate high-level semantic information for learning a common space for image, text/tag, and semantic information (user feelings). Assume we have n training images having \(i_f\)-dimensional visual feature vectors and \(t_f\)-dimensional tag feature vectors. Where \(I\in R^{n\times i_f}\) and \( T\in R^{n\times t_f}\). Furthermore, we also associated each training image with a high-level semantic class, \(C\in R^{n\times c}\), where c represents the number of categories. Individual images are labeled with one c class (only one specific class in each row of K is 1 and the remaining are 0).

Let i, j denote two points. We define similarity as:

$$\begin{aligned} K_x(i,j) = \psi _x(i)\psi _x(j)^T \end{aligned}$$
(1)

where \(K_x\) is a kernel function and \(\psi _x(.)\) represent a function embedding the original feature vector into a nonlinear kernel space. The goal is to find matrices \(W_x\) that project the embedded vector \(\psi _x(i)\) to minimize the distance between data items.

The objective function can be mathematically expressed as:

$$\begin{aligned}&{\min \nolimits _{{w_1},{w_2},{w_3}}} = \sum \limits _{x,y = 1}^3 {\left\| {{\psi _1}(I){W_1}} \right. } {\left. { - {\psi _2}(T){W_2}} \right\| _2}\nonumber \\&+ \left\| {{\psi _1}(I){W_1}} \right. {\left. { - {\psi _3}(C){W_3}} \right\| _2} + \left\| {{\psi _2}(T){W_2}} \right. {\left. { - {\psi _3}(C){W_3}} \right\| _2}\nonumber \\&where~ w_1=w_2=w_3=0 \end{aligned}$$
(2)

this equation tries to align corresponding images and tags [17], whereas, the remaining two terms try to align images with their semantic class. Figure 2 illustrates graphically the benefits of incorporating high-level semantic information.

Similarity Measure. We developed a novel similarity measurement that yielded better realistic results. Mathematically, this can be expressed as:

$$\begin{aligned} sim({x_i}{y_i}) = {\frac{{({\psi _x}(i){W_x}){{({\psi _y}(j){W_y})}^T}}}{{{{\left\| {({\psi _x}(i){W_x})} \right\| }_2}\left\| {({\psi _y}(j){W_y})} \right\| }}_2} \end{aligned}$$
(3)

where \(x_i\) represents the training image and \(y_i\) represents the corresponding tweet. \(W_x\) projects the embedded vector \(\psi _x(i)\) and \(W_y\) projects the embedded vector \(\psi _y(j)\) to minimize the distance between image and text.

Distance in Common Subspace. In this paper, we represent the cosine distance between two different modalities in the common subspace as Cos(TwtImg), where Twt and Img represent the tweet and image. It was learned by retrieval methods such as, Correspondence Auto Encoder (Corr-AE) and subspace methods.

Ranking. Each candidate in the gallery was ranked, based on similarity distances between the queries and candidates.

4 Experimental Results and Discussion

4.1 Experimental Setup

All experiments were performed on four subspace learning methods, which were Canonical Correlation Analysis (CCA) [6], Bilinear Model (BLM) [18], Partial Least Square (PLS) [19], and Generalized Multi-view Marginal Fisher Analysis (GMMFA) [20] and three Corr-AE methods [21]: Corr-AE, cross Corr-AE and full Corr-AE.

In the case of subspace learning methods, we used the implementation from [20] to compute the linear projection matrix. For Corr-AE methods, we use the implementation of [21] to calculate the hidden vectors of the two different modalities. We employed a 1024-dimensional hidden layer. For Corr-AE, cross Corr-AE and full Corr-AE the weight factors for reconstruction errors and correlation distances were set to 0.8, 0.2 and 0.8, respectively.

Dataset Splitting. We used three datasets in each experiment: Wikipedia, Flickr30k and FB5K. We split each dataset into a training set, a testing set, and a validation set, as illustrated below:

  1. 1.

    Wikipedia dataset. In the case of subspace learning, we used 2173 and 500 image-text pairs for training and testing respectively, while for Corr-AE methods a further 193 pairs served as a validation set. We utilized all of the data in a test set as a query.

  2. 2.

    Flickr30k dataset. For subspace learning, we used 15000 image-text pairs for both training and testing while for Corr-AE methods an additional 1783 image-text pairs were added for validation. We randomly selected 2000 images and texts from the test set to function as a query.

  3. 3.

    FB5K dataset. We split the dataset into \(80\%\) and \(20\%\) image-text pairs for training and testing respectively. We used the same split for subspace learning, while for Corr-AE methods 250 additional image-text pairs served as a validation set.

Representation. All images were first resized to dimension of \(224\times 224\). Then we extracted the last fully connected (fc7) Convolution Neural Network (CNN) features using VGG16 [22] with CAFFE [23] implementation. Text representation was based on Latent Dirichlet Allocation (LDA) [24]. An LDA model was learned from all texts and used to compute the probability of each text under 50 hidden topics. We used this probability vector for text representation. A Bag-of-Word (BoW) model was used for text representation in Corr-AE methods. Initially, texts were converted into lower case, with all stopping-words removed. A unigram model was adopted to form a dictionary of the most recurrent 5000 words. Based on this dictionary, for each text we generated a 5000-dimensional BoW model.

Evaluation Parameters. We assessed the retrieval performance using Cumulative Match Characteristic (CMC) curves and mean rank. CMC is a useful approach that is used as the evaluation metric in many applications such as face recognition [25,26,27] and biometric systems [28, 29]. For cross-media retrieval, CMC can be illustrated by a curve of average retrieval accuracy with respect to the average ranks of the correct matches for a series of queries, K, where rank is:

$$\begin{aligned} Rank = \frac{1}{{\left| K \right| }}\sum \nolimits _{x = 1}^K {ran{k_x}}, \end{aligned}$$
(4)

\(rank_x\) refers to the rank position of the correct match for the \(x^{th}\) query.

4.2 Retrieval Methods Comparison Using Different Datasets

We tested different cross-media retrieval methods on Flickr30k, Wikipedia and FB5K datasets. Figure 3 shows the effectiveness of the different retrieval methods. We drew several conclusions from this.

Fig. 3.
figure 3

CMC curves compared for different Corr-AE and subspace learning methods using different cross-media datasets: (a) Wikipedia, (b) Flickr30k, and (c) FB5K.

Corr-AE methods performed well compared to subspace learning methods with all three datasets. However, CCA showed a significant improvement in performance as the number of training samples increased using FB5K. The logic behind this is that correlation is ignored between different modalities in subspace learning when representation learning is performed. However, representation learning and correlation learning are merged into a single process in Corr-AE methods. Furthermore, Corr-AE is used to train a model by minimizing linear combinations of representation learning error and correlation learning error for individual modalities, and between hidden representations of two modalities [16]. This minimization of correlation learning error helps the model in learning hidden representations, while minimization of representation learning error makes better hidden representations to reconstruct the input of individual modalities.

The retrieval performance was highest for FB5K and lowest for Wikipedia. This shows that the tweets are highly correlated when using FB5K compared with Flickr30k and Wikipedia. The main reason that FB5K obtained the highest retrieval accuracy is twofold: first, it contained high-level concepts with small semantic gaps. Second, text and images were highly correlated in this dataset. We conclude that user descriptions on tweets are highly correlated to the scenarios.

Fig. 4.
figure 4

CMC curves on FB5K with the proposed method and baselines. (a) Im2txt retrieval. (b) Txt2im retrieval.

4.3 Proposed Method Performance

In this section, we describe the proposed method for evaluation of FB5K. We compared its performance with the baseline methods.

Figure 4 clearly shows that using the proposed method in the baseline learning systems significantly improved their performance. In particular, using OCR, explicit incorporation of high-level semantic information, and a specially developed similarity measurement in the embedded space improved cross-media retrieval accuracy when similar retrieval methods were used. For example, in the case of Txt2im retrieval, CCA achieved \(45\%\) accuracy at rank = 110, whereas the BLM, PLS, and GMMFA methods achieved the same accuracy at ranks 20, 25 and 18, respectively. Incorporating the proposed method boosted the accuracy of CCA, BLM, PLS, and GMMFA to \(6.5\%\), \(4\%\), \(5\%\) and \(7\%\) respectively, at the same rank.

Fig. 5.
figure 5

Retrieval examples for FB5K using CCA and the proposed method. The first two rows represent the query tag and its corresponding top five retrieved images, whereas the last two rows show query images and their corresponding top five retrieved tags. (a) tag/txt2img retrieval. (b) img2tag/txt retrieval.

4.4 FB5K Retrieved Examples

This section describes the retrieval examples for FB5K using CCA and the proposed method.

Figure 5(a) shows the retrieval image results for different query tags. It shows that the proposed method was successful in learning colors, background and class information, e.g. in Fig. 5(a) we used the keyword cold to retrieve the images on right, which strongly indicate that the keyword information lay in the retrieved image. Moreover, incorporation of semantic class not only improved the retrieval accuracy, but also provided higher weights to more minor concepts during the formation of query tag vectors.

Figure 5(b) shows the tagging results retrieved by the proposed method on some test images. It is clear that using the proposed method with FB5K significantly outperformed the baseline methods, despite its diverse features.

Furthermore, FB5K provides information that is more realistic to the user. It incorporates high-level semantic information by providing the class probability for individual images. For example, in Fig. 5(b), with a query image of a baby, the proposed method retrieved happy and love as high frequency words in the retrieved text. This shows that despite the sentiment of an image being hidden under high-level concepts, opinion characteristics can have an impact on multi-modal retrieval.

5 Conclusion

This paper introduced a novel cross-media dataset called FB5K. We also presented a more realistic embedding approach for images, tags/texts, and their semantics. Specifically, in order to learn the cross-modal embeddings of user feelings, images and tags/texts, we developed a novel method by utilizing OCR, explicit incorporation of high-level semantic information, and a new similarity measurement in the embedded space, to improve the retrieval performance.

We believe that FB5K and the proposed cross-media retrieval method suffice as a reference guide for researchers and developers to facilitate the design and implementation of better evaluation protocols.