1 Introduction

Cross-modal multimedia retrieval is especially needed in the era of Web 2.0, due to the explosive multimedia contributions in social network and media sharing websites. It is imperative to many real-world applications, e.g., to find a set of images that visually best illustrate a given text description or to find a set of sentences that textually best illustrate a given image. A large volume of multimedia is generated by users with informal content structures and various media types. To retrieve among heterogenous instances, the key problem is how to measure distances or similarities between them in a cross-modal manner. Many previous works [1, 2, 7, 8, 13, 25, 27, 32, 43] aim to align two feature spaces by learning a common latent space so that they can be reasonably comparable. Of these proposed approaches, Canonical Correlation Analysis (CCA) [9] shows its simplicity and efficiency in learning the common subspace by maximizing the correlation between the linear projections of two modalities.

While CCA has been popular for its advantages, it also suffers from several drawbacks. CCA relies on explicit pairings between two modalities to establish correspondences and in this procedure, the multi-label information still remains un-utilized. However, semantic concepts usually co-occur in real-world instead of occurring in isolation. For example, Fig. 1 demonstrates the automatic tagging of an image of horse with multiple labels through service provided by www.imagga.com. As we can see from the highlighted tags in boxes which are more representative in the exemplar image, more labels can help to interpret the content of an image, such as a horse as a foreground and grass or farm in the background. Besides the concepts which can be used to label the content of an image, the correlations between different concept pairs form another part of visual semantics. Similar to the hierarchy of concepts organized in WordNet [22] lexicon, images are also pre-organized into class hierarchies in Imagenet [4] where an image labeled with the child node class can also be categorized into its parent class. This “is-a” relationship is also illustrated in Fig. 1 where both “horse” and “mammal” are children of concept “animal”. Relationship of “is-part-of” is also shown in Fig. 1 to reflect the inherent correlations of concepts of “grass” and “farm”.

Fig. 1
figure 1

Automatic tagging of an image of horse with multiply labels through service provided by www.imagga.com. Representative labels are highlighted in red and blue boxes, which are linked with concept relationships of “is-a” and “is-part-of”, respectively

While it is widely accepted that incorporating the above discussed multi-label relationships can help computer to understand the semantics of multimedia, it is still challenging to quantify these correlations in fulfilling tasks for multimedia retrieval. To alleviate this, concept correlations are exploited both statistically [15,16,17, 20, 28, 34] from annotation sets or semantically [18, 36, 41] from knowledge bases aiming at improving multi-label tagging performances for image retrieval. The high-level semantics is also utilized in [29] for location visualization. Instead of highly relying on the annotation sets or pre-constructed knowledge bases, [37, 39] proposed a training-free method which can utilize concept correlations reflected by the underlying co-occurrence and re-occurrence patterns in improving image multi-labeling performances. This method avoids the difficulty in explicitly quantifying the concept correlations through concept graph which are usually highly non-linear but is tackled through global and local pattern analysis in [37, 39]. Moreover, to retrieve images accurately, [47,48,49] proposed novel ranking models in which visual features and click features are simultaneously utilized to obtain the ranking model. For example, Yu et al. [49] proposed Deep-MDML method which adopts a structured ranking model to utilize both visual and click features in distance metric learning. Similar problem is also faced with when applying CCA to cross-modal retrieval as the incapacity of CCA in measuring complex non-linear correlations between different modalities can limit its performance. Since many semantic correlations cannot be simply represented in linear form, less discriminative subspaces might be constructed which cannot better suit for cross-modal retrieval tasks involving multiple labels. Though some extensions of CCA [7, 27] have been proposed to utilize label information, most of them tackle the problem of single label, i.e., assuming data sample is annotated with only one label. It is common that an image has multiple concept in presence hence should have multi-label annotations. Thus, it is imperative to take multi-label into account in order to precisely model the correlations between different modalities. Based on this rationale, Viresh Ranjan et al. proposed ml-CCA [25], an extension of CCA which outperforms most of other CCA extensions by combining high level semantics in the form of multi-label annotations. Because ml-CCA still relies on linear establishment of modality correspondences, ml-CCA will not perform consistently well when involved in more complex correlations which are difficult to be modeled linearly.

To tackle the challenge of learning complex correlations embedded in multi-label semantics, we introduce a novel multi-label Kernel Canonical Correlation Analysis (ml-KCCA) method for effective cross-modal retrieval with multi-label properties. By introducing semantic similarity matrix and embedding it into KCCA, semantic information can be utilized by the proposed method to learn a more discriminative common subspace for different modalities. Moreover, the algorithm structure is compatible with different multi-label semantics as long as they can be quantified and represented as matrices. The contributions of this paper can be summarized as follows:

  • A novel cross-modal retrieval method named ml-KCCA, which accounts for multi-label information as well as utilizes kernel function to mine non-linear correlations between data from different modalities.

  • A kernelized CCA with multi-label embedding is formalized in order to provide a non-linear solution for multi-label semantic correspondence estimation.

  • Extensive empirical evaluation on public datasets validates our approach and shows improvement over other extensions of CCA and other state-of-the-art cross-modal retrieval approaches.

The paper is organized as follows: Section 2 discusses the most related work in areas of cross-modal information retrieval and multi-label approaches. Section 3 presents the overview of the proposed method for multi-label settings in cross-modal retrieval tasks. To deal the proposed problem, Section 4 presents a mathematical formulation and solution for the proposed ml-KCCA framework. Section 5 reports an extensive experimental evaluation on benchmark multi-label datasets to verify the efficacy of the proposed approach for cross-modal retrieval tasks. Finally, the paper is closed by conclusions and future work in Section 6.

2 Related work

Cross-modal information retrieval is a challenging research topic due to the socalled semantic gap which means that queries and their corresponding results might involve different media modalities and in such cases, the two counterparts cannot be directly compared. To tackle this challenge, a great number of approaches have been proposed to in cross-modal retrieval in the past few years, among which an effective method is to learn an optimal common representation of different modalities. This kind of method projects different modalities into a common space, in which the distance of similar semantics is minimized and the distance of dissimilar semantics is maximized. In building semantic correlation among multimodal instances, Canonical Correlation Analysis (CCA) is one of workhorses for cross-modal retrieval tasks due to its simplicity and efficiency. CCA is a method of correlating linear relationships between two multidimensional variables. It makes use of two views of the same semantic object to extract the representation of the semantics and has become one of the most popular unsupervised cross-modal subspace learning methods due to its generalization capability.

Various extensions of CCA have been proposed to emphasize different challenge aspects for the task of cross-modal retrieval in recent years [1, 2, 7, 8, 13, 25, 27]. First proposed by Hotelling [9], CCA is a method of data analysis used to discover a subspace of multiple data spaces. It is a useful method on how to seek optimal basic vectors for two sets of variables to model the multi-modal correlation. More than one canonical correlations can be found and each corresponds to a different set of basis vectors. PLS [10] aims to find a linear regression model by projecting the predicted variables and the observable variables to a new space, which is equivalent with CCA in many situations [8]. CCA can also be used as a complimentary preprocessing for other learning tasks. For example, based on the subspaces learned by CCA, Rasiwasia et al. [26] proposed to learn cross-modal topic classifiers to measure the semantic divergence of Web data. Wu et al. [42] constructed a semantic distance measurement model and Gong et al. [6] developed a binary codes learning approach which leverages the label information with CCA. More recently, Yao et al. [45] explored relative relationship by firstly finding a latent space by CCA and then re-adjusting the space to incorporate ranking preferences from click-through data. The heterogeneous discriminative analysis of canonical correlation (HDCC) [40] utilizes discriminative information from the source domain as well as topology information from the target domain to learn two different projection matrices to discover a common feature subspace in which heterogeneous features can be compared.

However, the classic CCA ignores additional high-level semantic information which significantly limits its performance in real-world multimodal retrieval tasks. To alleviate this, Rasiwasia et al. proposed cluster-CCA [27] to incorporate high level features represented by single labels. Though demonstrated to be effective in single-label datasets in which instances have to be separated into distinct clusters, the disadvantage of cluster-CCA is obvious in multi-label scenarios because there is no natural separation of multi-label datasets into distinct clusters. To adapt CCA to multi-label settings, 3-view CCA was introduced in [7] and in this CCA variant, multi-label vectors are used as representations of high-level semantics. However, 3-view CCA highly depends on a priori correspondence information across modalities, hence cannot be directly applied to those datasets where such correspondence is not available for its requirements. Another typical extension of CCA to multi-label information is ml-CCA proposed by Viresh Ranjan et al. in [25]. Ml-CCA utilizes multi-label information while learning a common semantic space for the two modalities, and can learn a discriminative semantic space which is more suitable for cross-modal tasks. Unlike CCA, ml-CCA does not rely on explicit pairings between modalities, instead it uses the multi-label information to establish correspondences, which results in a more discriminative subspace that is better suited for cross-modal retrieval tasks. Benefited from taking multi-label information into account, ml-CCA has shown its merit and outperforms most of other extensions of CCA. However, ml-CCA fails to exploit non-linear inter-modal relationships which also limits its performance in multi-label cross-modal tasks in which the modality correspondence is usually complex and cannot precisely modeled by linear projections.

One stream of research related to multi-label semantics as this paper investigates is multi-label multimedia indexing, for which multi-label training and indexing refinement are two main approaches to utilizing multi-label information. A typical multi-label training method is presented in [24], in which concept correlations are modeled in the classification model using Gibbs random fields. Similar multi-label training methods can be found in [44]. Since all concepts are learned from one integrated model, one shortcoming is the lack of flexibility, which means that the learning stage needs to be repeated when the concept lexicon is changed. As an alternative, index refinement methods post-process detection scores obtained from individual detectors, allowing independent and specialized classification techniques to be leveraged for each concept. Context-Based Concept Fusion (CBCF) is an approach to refining the detection results for independent concepts by modeling relationships between them [15]. Concept correlations are either learned from annotation sets [15,16,17, 20, 34] or inferred from pre-constructed knowledge bases [18, 36, 41] such as WordNet. However, annotation sets are almost always inadequate for learning correlations due to their limited sizes and the annotation having being done with independent concepts rather than correlations in mind. The use of external knowledge networks also limits the flexibility of CBCF because they use a static lexicon which is costly to create. In [39], a training-free method was proposed to utilize concept correlations through global and local refinement. When pre-constructed ontology can be incorporated in the optimization procedure, the method can better adapt to this knowledge constraint. Similarly, [38] dealt with multi-label indexing problem through a tensor-factorization method by taking temporal semantics into account.

In bi-directional image and sentence retrieval, Hodosh et al. [13] proposed Kernel CCA (KCCA) in order to discover a shared feature space for both modalities of images and sentences, which is a powerful approach of extracting nonlinear features in machine learning area. KCCA increases the flexibility of the feature selection, which has been applied to map the hypotheses to a higher-dimensional feature space. KCCA has been applied in some previous work by Lai and Fyfe [21] and Vinokourov et al. [33] with improved results. [8] also uses KCCA to model correlation between web images and corresponding text captions. More recently, Yoshida et al. has proposed a novel method of two-stage kernel CCA to select appropriate kernels in the framework of multiple kernel learning [46]. Though highly non-linear inter-modal relations can be exploited by KCCA, multi-label semantics is not utilized in KCCA and how kernel method can be employed in multi-label cross-modal retrieval using CCA remains unaddressed. Sung Ju Hwang et al. [11] introduce a method for image retrieval based on KCCA that leverages the implicit information about object importance conveyed by the list of keyword labels. However, this type of labels is difficult to obtain. Thus we need a novel method to utilize label information more naturally and conveniently.

3 Method overview and notations

In this section, we present the overview of the proposed multi-label kernel Canonical Correlation Analysis (ml-KCCA) for multi-label settings in cross-modal retrieval tasks. In proposing ml-KCCA, we rely on CCA as the fundamental approach given its efficiency in learning a common subspace for different modalities. We restrict the discussion to multi-label entities containing image and text to simplify the notation and model description and our method can easily apply to any combination of content modalities. Before detailed description of the proposed ml-KCCA framework, a brief review of CCA is first presented for the purpose of completeness.

3.1 Brief on CCA

Provided with different views of data such as represented by two multidimensional variables, Canonical Correlation Analysis (CCA) is able to construct their common representation by analyzing the linear relationships between them. CCA uses data consisting of paired views to simultaneously find projections from each feature space such that correlation between projected features originating from the same instance is maximized [9]. Formally, given a set of N paired data samples {(t1, p1),...,(tN, pN)}, where tRt, and pRp denote textual and visual modal data respectively and are both normalized, the key is to seek two sets of vectors u and v to maximize the canonical correlation:

$$\begin{array}{@{}rcl@{}} {\rho} = \underset{u,v}{\max} \frac{{{u^{T}}C_{tp}v}}{{\sqrt {{u^{T}}{C_{tt}}u} \sqrt {{v^{T}}{C_{pp}}v}} } \end{array} $$
(1)

Where, \(C_{tp} = \frac {1}{N}\sum \nolimits _{i = 1}^{N} {{t_{i}}{p_{i}}^{T}}\) denotes the between-sets covariance matrix, \(C_{tt} = \frac {1}{N}\sum \nolimits _{i = 1}^{N} {{t_{i}}{t_{i}^{T}}}\) and \(C_{pp} = \frac {1}{N}\sum \nolimits _{i = 1}^{N} {{p_{i}}{p_{i}^{T}}}\) denote the auto-covariance matrices for textual and visual data, respectively. The solution for (1) can be found via a generalized eigenvalue problem. As we can see, CCA cannot mine the non-linear correlations between different modalities as it is a linear method. Moreover, CCA cannot utilize high-level semantic information which further limits its performance. These disadvantages usually lead to the common subspace learned by CCA which is not discriminative enough for cross-modal retrieval tasks.

3.2 Kernelizing CCA with multiple labels

In this section, we introduce multi-label kernel Canonical Correlation Analysis (ml-KCCA) to deal with the tasks of cross-modal retrieval involving multi-label scenarios. By optimizing kernel matrices with this approach, the similarity between corresponding multi-label vectors of paired data can be utilized to learn a more discriminative common subspace for different modalities which is more suitable for cross-modal retrieval tasks.

Figure 2 illustrates the schematic diagram of ml-KCCA in which triangles and squares denote images and texts. Different labels are represented by + , −, × and ÷ in Fig. 2. As shown in Fig. 2a, semantic similarity matrix obtained from multi-label representation is employed to obtain the new form of kernel matrixes \(K_{t}^{*}\) and \(K_{p}^{*}\) for texts and images respectively. After solving the kernelized version of (1), a new feature space is constructed in Fig. 2b using optimized projection vectors α and β which have the same interpretation as u and v in (1). As shown in Fig. 2b, paired multi-modal instances with similar labels are semantically more close and then have small coordinate distance in this new projected common space. Bi-directional cross-modal retrieval is effectively performed such as retrieving images in response to a text query, and vice versa, after both texts and images are mapped to this common space using ml-KCCA.

Fig. 2
figure 2

Schematic diagram of multi-label kernel Canonical Correlation Analysis (ml-KCCA). Triangles and squares denote datapoints in visual and textual modalities respectively, and ‘ + ’, ‘−’, ‘×’, ÷ denote different class labels. a Mapping of text and image instances from their respective feature spaces to a common subspace learned using ml-KCCA. b The distance of paired instances having similar labels are more closer in the common subspace learned by ml-KCCA. c An example of bi-directional cross-modal retrieval: After both texts and images are mapped to the learned subspace, images can be retrieved more accurately in response to a text query, and vice versa

4 Multi-label cross-modal retrieval with ml-KCCA

4.1 Embedding multi-label semantics with ml-KCCA

To formalize ml-KCCA, we denote N samples of paired images and texts with multi-label information as {(t1, p1, z1),...,(ti, pi, zi),...,(tN, pN, zN)}, where zi is the label vector of the i-th samples of paired data. T = [t1, t2,...,tN] ∈ Rdt×N is the matrix representation of textual samples, where dt is the dimension of the textual feature space. Similarly, P = [p1, p2,...,pN] ∈ Rdp×N is the matrix representation of the visual images, where dp denote the dimension of the visual feature space. Z = [z1, z2,...,zN] ∈ RC×N is the label matrix whose columns are the label vectors and C is equal to the number of labels of interest. While only one single element in zi is nonzero in single label problem, elements in zi corresponding to multiple labels could be nonzero simultaneously in ml-KCCA. This introduces more complex semantic correlations which cannot be well settled with linear CCA.

Let f(⋅) be a function which gives similarity between any two label vectors. The semantic similarity matrix S can be calculated as S = (f(zi, zj))N×N, 1 ≤ i,,jN. Given kernel functions (we use polynomial kernels, linear kernels and gaussian kernels in this work) for both feature spaces, kt(ti, tj) = ϕt(ti)Tϕt(tj) and kp(pi, pj) = ϕp(pi)Tϕp(pj), then original kernel matrices can be formalized as Kt = (kt(ti, tj))N×N and Kp = (kp(pi, pj))N×N, 1 ≤ i, jN. We further define \(K_{t}^{*} {{ =} }\eta S \cdot {K_{t}}\) and \(K_{p}^{*} {{ =} }\eta S \cdot {K_{p}}\) as the N × N multi-label kernel matrices which have multi-label embedding over N sample pairs, where ‘⋅’ denotes dot product and η is used to control the influence of semantic similarity matrix. By denoting \(K_{tp}^{*}=K_{t}^{*}K_{p}^{*}\), the objective in the form of KCCA [13] can be extended using the above defined multi-label kernels in order to identify α, βRN so as to maximize the canonical correlation:

$$\begin{array}{@{}rcl@{}} {\rho^{{*}}} = \underset{\alpha ,\beta}{\max} \frac{{{\alpha^{T}}K_{tp}^{*}\beta} }{{\sqrt {{\alpha^{T}}K_{t}^{*2}\alpha {\beta^{T}}K_{p}^{*2}\beta} } } \end{array} $$
(2)

4.2 Constructing common subspace

Similar as CCA problem defined by (1), (2) can also be reduced to an eigenvalue problem (See Hardoon et al. [8] for more details of the solution). Therefore, computational cost of our method is similar to KCCA because the main cost lies in eigenvalue problem. α, β can be obtained in a similar manner as that in the case of kernel CCA as:

$$\begin{array}{@{}rcl@{}} {B^{- 1}}Aw = \lambda w, \end{array} $$
(3)

where,

$$\begin{array}{l} A = \left[ {\begin{array}{*{20}{c}} 0&{K_{t}^{*} K_{p}^{*}}\\ {K_{p}^{*} K_{t}^{*}}&0 \end{array}} \right],B = \left[ {\begin{array}{*{20}{c}} {K_{t}^{*} K_{t}^{*}}&0\\ 0&{K_{p}^{*} K_{p}^{*}} \end{array}} \right],\\ w = {\left[ {\begin{array}{*{20}{c}} \alpha &\beta \end{array}} \right]^{T}}. \end{array}$$

As indicated by (3), once \(K_{t}^{*}\) and \(K_{p}^{*}\) are computed, the returned top D eigenvectors yield a series of bases (α1, β1),...,(αD, βD) with which to compute the D-dimensional projections for an arbitrary modal input t or p. For example, an unseen textual input tx can be projected to the common space as a single coordinate specified by α, i.e., evaluating the weighted kernel function between the tx and the N sampled training points, formalized as:

$$\begin{array}{@{}rcl@{}} \sum\limits_{i = 1}^{N} {{\alpha_{i}}{\phi_{t}}{{({t_{i}})}^{T}}{\phi_{t}}({t_{x}})} = \sum\limits_{i = 1}^{N} {{\alpha_{i}}{k_{t}}({t_{i}},{t_{x}})} \end{array} $$
(4)

Then, the final projection of tx onto the D-dimensional common subspace is formed as:

$$\begin{array}{@{}rcl@{}} \left[ {\begin{array}{*{20}{c}} \displaystyle{\sum\limits_{i = 1}^{N} {\alpha_{{~}_{i}}^{1}{k_{t}}({t_{i}},{t_{x}})}} &{...}&\displaystyle{\sum\limits_{i = 1}^{N} {\alpha_{{~}_{i}}^{D}{k_{t}}({t_{i}},{t_{x}})}} \end{array}} \right] \end{array} $$
(5)

Similarly, an unseen image input px can be represented in the common subspace as:

$$\begin{array}{@{}rcl@{}} \left[ {\begin{array}{*{20}{c}} \displaystyle{\sum\limits_{i = 1}^{N} {\beta_{{~}_{i}}^{1}{k_{p}}({p_{i}},{p_{x}})}} &{...}&\displaystyle{\sum\limits_{i = 1}^{N} {\beta_{{~}_{i}}^{D}{k_{p}}({p_{i}},{p_{x}})}} \end{array}} \right] \end{array} $$
(6)

After projecting all image and text instances into this learned common subspace, various tasks like image annotation and image search can be performed based on this semantic representation. Because the data points in this D-dimensional common subspace are more correlated semantically, vector distance can be utilized precisely to measure the distance between instances of different modalities.

4.3 Similarity function

The role of f(⋅) is to measure the semantic relationship of two multi-label vectors, as introduced in Section 4.1. While different forms of similarity can be employed in f(⋅), we employ the following similarity functions investigated in [25] since they have been demonstrated to be effective in assigning a higher value to the label pair (zi, zj) when the labels are more similar, as reported in [25]:

Dot-product based similarity: :
$$\begin{array}{@{}rcl@{}} f({z_{i}},{z_{j}}) = \frac{{\left\langle {{z_{i}},{z_{j}}} \right\rangle} }{{||{z_{i}}||||{z_{j}}||}} \end{array} $$
(7)
Squared exponential distance based similarity: :
$$\begin{array}{@{}rcl@{}} f({z_{i}},{z_{j}}) = {e^{({{ - ||{z_{i}} - {z_{j}}||_{2}^{2}} / \sigma} )}}, \end{array} $$
(8)

where σ is a constant factor for scaling the sample-wise distance.

5 Experiments

In this section, the proposed ml-KCCA method is applied to three public datasets and experimental results are reported on two tasks of image annotation (image query text) and image retrieval (text query image). While performing image annotation and image retrieval, the query and the test points are projected to the common subspace using (5) or (6), and the retrieval performance is measured by comparing the label vector of the query with the label vectors of retrieved test points. Gaussian kernels are used for all component features. We empirically fix the number of selected top eigenvectors returned by (3) as D = 20 for all experiments reported in the following sections since the overall performance is shown to be insensitive to the dimensionality of the common subspace constructed by ml-KCCA. To tackle the computational issue that can arise when using large data sets, we applied incomplete Cholesky decomposition to accelerate solving the eigenvalue problem for ml-KCCA.

5.1 Experiment setup

5.1.1 Datasets

Three datasets of NUS-WIDE [3], PASCAL VOC 2007 [5] and LabelMe [12] are employed to evaluate the proposed method and all datasets contain two modalities of images and texts annotated with multi-label information. The details of these three datasets can be summarized as:

NUS-WIDE consists of 269,648 Flickr documents and we randomly select 20K for training and 20K for testing from this dataset. Each document consists of an image and its corresponding textual tags which are selected from a vocabulary of 81 semantic concepts. In this experiment, we employ the widely-used bag of visual words (BoVW) as the visual feature and 1,000 dimensional bag-of-word (BoW) tag features as textual feature.

Pascal VOC 2007 consists of 5,011 training and 4,952 testing images and this split is directly used in our experiemnts. Similar as in NUS-WIDE, we use the publicly available BoVW, together with gist [23] and color histogram features as visual features. Convolutional features extracted by VGG 16 layers model [31] are also used in experiments of Section 5.2.3. For textual feature, we use the 399 dimensional absolute tag rank features provided by [11]. Groundtruth annotations of the images in this dataset are used as multi-label information.

LabelMe consists of a total of 3,825 images and we select 3,000 samples randomly for training and the rest 825 samples for testing in our experiments. We use the publicly available bag of visual words, gist and color histogram features for image representation. For text representation, we use the 209 dimensional absolute tag rank features provided by [11]. For multi-label information representation, we use the groundtruth annotation of the images.

5.1.2 Evaluation metrics

To evaluate the proposed retrieval method, the following evaluation metrics are adopted:

  • Precision@K: Precision@K (P@K) measures the precision at top-K results of the retrieved list and is employed in our evaluation.

  • NDCG@K: Performances are also evaluated using normalized discounted cumulative gain at top-K (NDCG@K) [14], a measure commonly used in information retrieval. It gives graded relevance to retrieved results instead of binary relevance and more strongly emphasizes the accuracy of the higher ranked items. The score ranges from 0 to 1 and 1 indicates perfect agreement. NDCG@K can be calculated as:

    $$\begin{array}{@{}rcl@{}} {\text{NDCG@K } }=\frac{\sum\limits_{i = 1}^{K} {2^{re{l_{i}} - 1 / \log_{2}(i + 1)}}}{{\sum\limits_{j = 1}^{K} {{{{{2^{re{l_{j}} - 1}}}} / {{{{\log}_{2}}(j + 1)}}}}} } \end{array} $$
    (9)

    where reli denotes the degree of relevance of the i-th document in the result while relj is judged according to the groundtruth ranking.

  • MAP: As a widely adopted metrics in evaluating the retrieval performance, mean average precision (MAP) criterion is used in our experiment which is formalized as

    $$\begin{array}{@{}rcl@{}} \text{MAP}= \frac{\sum\limits_{q = 1}^{Q} {\left( \frac{1}{R} \times \sum\limits_{r = 1}^{R} \frac{r}{{position(r)}}\right)}}{Q} \end{array} $$
    (10)

    where Q denotes the number of all queries, R indicates the number of relevant documents in the result returned for a query q, position(r) indicates the position of the r-th relevant document in the result list.

5.2 Results and discussion

5.2.1 Effects of multi-label semantics

As formalized in Section 4, parameters of η and σ measure how multiple labels affect kernel matrices and the scaling of semantic similarity based on squared exponential distance, respectively. To evaluate the effects of multi-label semantics in ml-KCCA, we train the value of parameters η and σ of ml-KCCA on NUS-WIDE training set as mentioned in Section 5.1.1. Dot-product based similarity and gaussian kernels are adopted while we seek the influence of η in order to eliminate the effects of σ. After stable performance is achieved, we fix η and evaluate the influence of σ on ml-KCCA using squared exponential distance based similarity function as introduced in Section 4.3.

We use KCCA as baseline method in the same settings with our method except ml-KCCA utilizes the semantic similarity matrix derived from multi-label information, and experimental results are shown in Fig. 3. As we can see from Fig. 3a, the proposed ml-KCCA method performs much better than KCCA [13] for most η settings in bi-directional information retrieval. The robustness over η values indicates that utilizing multi-label information using ml-KCCA is valuable in finding a more distinctive common subspace for cross-modal tasks. Only at two extremes of very small and large η values, ml-KCCA has less satisfactory performances for image annotation task. This makes sense because small η value imposes less effects of multi-label semantics while large value will force multi-label semantics to dominate the correlation learning.

Fig. 3
figure 3

Experimental results. a The influence of η with dot-product based similarity function(horizontal axis: η). b The influence of σ with squared exponential distance based similarity function and η = 2.0 (horizontal axis: σ). Precision@10 is used as the performance measure

The influence of σ is shown in Fig. 3b which is generated by fixing η = 2.0. While the curve of ml-KCCA image retrieval is stable in Fig. 3b, the fluctuation of ml-KCCA image annotation curve shows that σ has more influence on image annotation than image retrieval task. By comparing two figures of Fig. 3a and b, we can find that: 1) Squared exponential distance based similarity function performs better than dot-product based similarity function because the former uses Gaussian formula which is smoothed by hyper parameter σ, hence is more effective in measuring the similarity of multi-label label pair (zi, zj). 2) ml-KCCA is more sensitive in image annotation than in image retrieval while ml-KCCA performs better in image annotation than in image retrieval for most cases.

5.2.2 Effects of label quality

As we previously introduced, the utilization of multi-label semantics can enhance the performance of cross-modal retrieval, which has been validated in our experiment as discussed in Section 5.2.1. Because the concept correlations are an important part of semantics in bridging two modalities, the performance of ml-KCCA might be degraded if the inherent label correlations are destroyed. In contrast to the experiment in Section 5.2.1 in which the number of labels are fixed as they are originally annotated, in this section we evaluate the effects of multi-label label correlations by controlling the quantity of labels. We first categorize the samples into different sets according to the quantity of labels and further evaluate our method in these sets respectively in order to compare the performances of the proposed method on datasets with different quantities of labels.

To avoid the influence of textual and visual features of different samples, we select 3,500 image-text pairs with 4 labels from NUS-WIDE dataset and use this subset (3,000 pairs for training and 500 pairs for testing) to repeat the experiment as implemented in Section 5.2.1 four times using randomly selected 1–3 labels and 4 labels of each sample respectively. Experimental results are shown in Fig. 4. Similar to Section 5.2.1, dot-product based similarity and gaussian kernels are adopted while we seek the influence of η and we fix η = 2.0 and evaluate the influence of σ on ml-KCCA using squared exponential distance. From Fig. 4, we can find the quantity of sample labels has a obvious influence on the performance of ml-KCCA indicating the importance of multi-label semantics in cross-modal retrieval. When label quality is improved, i.e. more labels are correctly annotated, both tasks of image annotation and retrieval can be further enhanced. This also validates that our proposed ml-KCCA can make full use of such semantics and effectively embed semantic correlations in enhancing the final retrieval tasks.

Fig. 4
figure 4

Effects of label quality. a The effects of label quality using dot-product based similarity function (horizontal axis: η). b The effects of label quality using squared exponential distance based similarity function with η = 2.0 (horizontal axis: σ). Precision@10 is used as the performance measure

5.2.3 Image annotation and image retrieval

In this section, more comprehensive comparison is given to evaluate how well the textual or visual information can be retrieved to describe the content of the given image or sentence. In the implementation, we fix η = 1.8 and use squared exponential distance based similarity function with σ = 2 on Pascal dataset. Tables 1 and 2 list the performance of ml-KCCA and other typical CCA extensions on Pascal dataset and LableMe dataset, using BoVW, color histogram, gist features, combination of them, and Convolutional Neural Network (CNN) features as visual features. According to Table 1 and 2, it is obvious that ml-KCCA outperforms all the other approaches on two datasets across most features. This demonstrates the advantage of ml-KCCA in utilizing multi-label information and exploiting non-linear inter-modal relations effectively. By taking multi-label semantics into account, both ml-CCA and our proposed ml-KCCA methods can outperform the other methods significantly. Moreover, our proposed ml-KCCA method more flexibly models the underling non-linear correspondence in multiple labels, and outperforms ml-CCA on most of features (BoVW, Color histogram, Combination and CNN) on two datasets. When using gist for image annotation and BoVW for image retrieval, the proposed ml-KCCA still obtains comparable performance on NDCG@30 matric with ml-CCA without degrading the performance.

Table 1 Performances of ml-KCCA and other CCA extensions are compared on Pascal dataset using NDCG@30 for cross-modal retrieval task
Table 2 Performances of ml-KCCA and other CCA extensions are compared on LabelMe dataset using NDCG@30 for cross-modal retrieval task

In Table 3, the proposed ml-KCCA is also compared to other state-of-the-art cross-modal retrieval baselines used in [25] on Pascal dataset, using publicly available image and text features provided by [11]. Similar as in Table 1, with the use of multi-label information, the MAP of ml-KCCA method outperforms other methods including kernelized methods like KGMMFA and KGMLDA [30] in fulfilling the tasks. Our method outperforms ml-CCA which is the CCA extension with best performance before because our method can exploit more complex non-linear relations of different modal as a kernelized method.

Table 3 Comparison of ml-KCCA with various state-of-the-arts using MAP

6 Conclusions and future work

In this work, we propose Multi-Label Kernel Canonical Correlation Analysis (ml-KCCA), a novel kernelized method for cross-modal retrieval, which can effectively utilize multi-label information while learning the common subspace of multiple modalities. Experimental results on public datasets show that ml-KCCA achieves state-of-the-art performance in bi-directional retrieval. Though multi-label cross-modal retrieval can be semantically enhanced with the proposed ml-KCCA methods, there are still some space to improve the work which can be regarded as part of future work. For example, it is helpful in improving the proposed model by designing better similarity functions between multi-label vectors, during which pairwise label semantic correlation can be further considered. More extensive experiments on large-scale datasets like ImageNet can be carried out as another future work by taking multi-label annotations into consideration.