1 Introduction

With the rapid growth of multimedia data, such as text, image, audio and video, etc., how to perform cross-modal retrieval [32,33,34, 55] on large-scale multimedia data is still a challenging yet interesting topic. This is because it is usually subject to feature representation across modalities [40, 45, 48, 49]. For efficiency, hashing methods have been broadly studied for large-scale cross-modal retrieval due to its low-storage cost and fast search speed. They seek to map the data points from original feature space to a Hamming space with binary codes as well as to preserve the neighborhood structure. Thereafter, hashing methods can be potentially applied to many vision tasks such as classification [43, 54], person re-identification [13, 53, 54] and information retrieval [41, 42, 44, 47, 51, 56, 58]. Thus far many hashing methods including Locally Sensitive Hashing (LSH) [10], Spectral Hashing (SH) [50] and Discrete Hashing via Affine Transformation (ADGH) [12], have been developed and mainly are applied for large-scale image retrieval. However, they are only designed for single-modal retrieval and neglect solving the media gap across modalities, thus they cannot directly be applied to cross-modal retrieval.

To this, much progress on hashing for cross-modal retrieval have been made to measure the similarities of heterogeneous modalities. This can be attributed to the fact that multi-media data belong to different attribution spaces and have the inconsistent semantic gap. Thus, we are eager to relieve such inconsistency across modalities before cross-modal retrieval. In the regard, current hashing methods for cross-modal retrieval might be roughly grouped into supervised hashing methods and unsupervised ones. As we know, unsupervised cross-modal hashing methods, such as Cross-View Hashing (CVH) [19], Inter-Media Hashing (IMH) [36], Composite Correlation Quantization (CCQ) [20] and Collaborative Subspace Graph Hashing (CSGH) [58], excel at working without supervised information such as semantic labels and pair-wised information. That is because they encode cross-modal data into hashing codes by looking into the data itself. Compared with unsupervised counterparts, cross-modal hashing methods with supervised information, such as Cross-Modal Sensitive Hashing (CMSSH) [6], Semantic Correlation Maximization (SCM) [57] and Quantized Correlation Hashing (QCH) [52], usually achieve better retrieval results. Our work is an alternative in this scope.

Most of aforementioned cross-modal hashing methods usually maps data across modalities into a common Hamming subspace where the relationship across modalities can be unveiled. One leading type of such methods often captures the correlation between multiple modalities. These methods could mostly be inspired by canonical correlation analysis (CCA) [16] and its variants such as kernel canonical correlation analysis (KCCA) [1], cluster canonical correlation analysis (cluster-CCA) [31], generalized CCA (GCCA) [15], Deep Canonical Correlation Analysis (DCCA) [2] and Deep Generalized Canonical Correlation Analysis (DGCCA) [5]. Analogue with them, our work also uncovers the correlation across modalities in the frame of generalized CCA (GCCA) [15]. Compared to CCA, GCCA can be easily extended to more than two modalities in terms of model complexity. Moreover, it explicitly learns the common latent representation of all the modalities. Inspired by these two aspects of GCCA, this paper proposes a label guided correlation hashing method for cross-modal retrieval based on GCCA, dubbed as LGCH. Similar to GCCA, LGCH easily deals with multi-media data because it intends to maintain two merits mentioned above by imposing the extra constraints over the common latent representation. Different from GCCA, LGCH is a supervised cross-modal hashing method and takes GCCA as its special case. As illustrated in Fig. 1, without loss of generality, LGCH learns two individual projection matrices to respectively project text and image into a common latent subspace whilst to generate binary codes and hashing function on the basis of the learned common latent representation. To make the common latent representation more discriminative, we introduce a label guided linear classifier. More importantly, we devise a novel strategy of leveraging the semantic label to guide learning both binary codes and hashing function simultaneously. As we know, the semantic label is usually utilized to either generate binary codes or learn hashing function, with supervised joint learning both aspects to be rarely explored. Thus, LGCH can generate discriminative binary codes and meanwhile effectively fuse the common latent representation and the label guided representation by introducing an adaptive parameter. To highlight this work, the main contributions are summarized as follows:

  1. 1)

    LGCH can be easily extended for cross-modal retrieval that involves more than two types of multimedia data, because the involved formulation terms is linear with the number of media types.

  2. 2)

    To learn discriminative hashing codes, we use label information to guide the training process. On one hand, learning the classifier over the common latent representation is to make the learned common latent representation more discriminative. On the other hand, we try to learn the hashing function and encode the hashing codes with the label guided representation. More importantly, we introduce an adaptive parameter to jointly fuse the discriminative common latent representation and the label guided representation to simultaneously learn more discriminative hashing codes and hashing function.

  3. 3)

    As a byproduct, each sub-problem of our model has the elegant analytical solution.

  4. 4)

    Experiments of cross-modal retrieval on Wiki [30], NUS-WIDE [7] and Flickr25K [17] datasets, show the effectiveness of LGCH as compared to many well-established baselines.

Fig. 1
figure 1

The framework of LGCH, mainly containing two procedure: the common latent representation and the label guided representation learning procedure, and the hashing function and Hamming subspace learning procedure

2 Related work

Cross-modal retrieval is to deal with multimedia data search problem. It has been a popular research topic in machine learning and computer vision. The key point of cross-modal retrieval task is how to solve the “media gap” problem of multimedia data. Much progress has been made to address this issue. Considering whether using deep structures or not, existing cross-modal retrieval methods are roughly divided into traditional shallow structure based methods and deep neural network (DNN) based methods. For more details, we refer the readers to [28] for a comprehensive survey.

Traditional shallow structure based methods mostly adopt the hand-crafted features, mainly including Scale Invariant Feature Transform(SIFT) [21, 22], Histogram of Oriented Gradient (HOG) [9], Speeded-up Robust Features (SURF) [3, 4], Local Binary Pattern (LBP) [26] and Bag-of-words based features (BoW) [35], etc. One category of shallow structure based methods is to directly compute the similarities among different media types by analyzing the known data relationships without training an explicit common space. Since there is no common space in these methods, the cross-modal similarities cannot be measured directly by distance measuring or normal classifiers. One effective solution [39, 59] is to construct one or more graphs for multimedia data, and use the edge in graphs to represent the neighborhood structure among modalities. Then the retrieving process can be evaluated based on similarity propagation [59], constraint fusion [39] and so on. Another solution is neighbor analysis methods [8, 23], which are usually based on graph construction framework. The difference between these two solutions is that graph-based methods focus on the process of graph construction, while neighbor analysis methods concentrates on measuring the cross-modal similarities by using the neighbor relationships. The shallow structure based methods are easy to understand and implement.

Another category is to project different media features to a common space, then compute the similarities among different media types in this common space. The common space can be a continuous feature embedding space [25, 29, 38], or a binary Hamming space [6, 19, 20, 36, 52], where the latter is also dubbed hashing based methods. CCA based methods is one of the most representative common space methods. Many studies based on CCA have been made to perform cross-modal retrieval, such as three-view canonical correlation analysis (CCA (V+T+K)) [14], multi-label Canonical Correlation Analysis (ml-CCA) [29] and semantic correlation matching (SCM) [57]. CCA itself is an unsupervised method, while some of its variants make use of supervised information. For instance, ml-CCA can learn a shared subspace by taking the high-level semantic information into account. In Semantic matching (SM) procedure, SCM maps the image and text spaces into a pair of semantic spaces, which can be represented by a posterior probability distribution calculated by multi-class logistic regression. CCA (V+T+K) introduces a three-view kernel CCA formulation for learning a joint space for visual, textual, and semantic information. Our proposed method belongs to this scope and provides an alternative way to leverage label information. Different from previous CCA based retrieval methods, we exploit generalized CCA (GCCA) [15] to capture the correlation of over two types of media data as well as to learn hashing codes and function.

In recent years, since 2012, when Krizhevsky et al. [18] utilized the convolutional neural network (CNN) to achieve the state-of-the-art classification result on ILSRVC 2012, many studies based on deep learning have emerged. When processing the image data, some DNN based cross-modal retrieval methods directly pass the source RGB features through the network [11, 46]. Some methods also use traditional hand-crafted features as the input [27]. Existing DNN based cross-modal retrieval methods have two way to use the networks: the first one [25, 37] use a shared network, that is to say, inputs of different modalities pass through the same network layers. The second one [11, 46] use different subnetworks for different modalities, and introduce the correlation constraints to the code layers to capture the neighbourhood structure among different modalities. DNN based cross-modal retrieval methods usually yield better performance as compared to most shallow structure based methods, but their performance heavily relies on large-scale training data and it often costs expensive training time. In addition, most existing DNN retrieval methods usually consider two media types and hence might not be directly applied for over two media types.

3 Label guided correlation cross-modal hashing (LGCH)

In this section, we propose a label guided correlation cross-modal hashing method (LGCH) for cross-modal retrieval. Then we provide a toy example to clearly illustrate its effectiveness and extend it to multi-modal situation. In the end, we list the optimization procedure of LGCH.

3.1 Model

Why GCCA?

As the correlation across modalities is always the focus of cross-modal retrieval, our model considers this point as well. Concretely, we prefer generalized CCA (GCCA) [15] to CCA in order to capture the correlation across two modalities. The reasons for this are twofold. Firstly, GCCA is closely akin to CCA and thus shares some analogue properties. Secondly, GCCA explicitly learns a common latent representation that simultaneously best reconstructs all of the view-specific representations. This has the advantage that, by using GCCA, we do not need to care for the number of media types. This is because, on the basis of the common latent representation, GCCA has the flexibility of coupling with some sophisticated regularization techniques by enforcing the regularized constraints over the common latent representation.

Without loss of generalization, given image and text descriptors \({X_{1}} \in {R^{{m_{1}} \times n}}\) and \({X_{2}} \in {R^{{m_{2}} \times n}}\), where n is the number of samples, and m1 and m2 are the dimensions of image and text features, respectively. Then the objective of GCCA is:

$$ \begin{array}{l} \min\limits_{G,{W_{1}},{W_{2}}} {O_{1}} = \left\| {G - {W_{1}^{T}}{X_{1}}} \right\|_{F}^{2} + \left\| {G - {W_{2}^{T}}{X_{2}}} \right\|_{F}^{2},\\ s.t. G{G^{T}} = I \end{array} $$
(1)

where \({W_{1}} \in {R^{{m_{1}} \times r}}\) and \({W_{2}} \in {R^{{m_{2}} \times r}}\) are the transformation matrices of image and text, respectively, and GRr×n is the common latent representation (latent rep. for short), wherein r is the dimension of the common latent subspace.

To avoid trivial solution, we impose the F-norm regularization over two transformation matrices as below:

$$ \begin{array}{l} \mathop {\min }\limits_{G,{W_{1}},{W_{2}}} {O_{2}} = \left\| {G - {W_{1}^{T}}{X_{1}}} \right\|_{F}^{2} + \left\| {G - {W_{2}^{T}}{X_{2}}} \right\|_{F}^{2} \\ + \beta (\left\| {{W_{1}}} \right\|_{F}^{2} + \left\| {{W_{2}}} \right\|_{F}^{2}).\\ s.t. G{G^{T}} = I \end{array} $$
(2)

Learning discriminative common latent representation with linear classifier

To make use of the label information and make the learned common subspace more discriminative, we intend to learn a linear classifier based on the common latent representation:

$$ \min\limits_{P,G} {O_{3}} = \left\| {Y - {P^{T}}G} \right\|_{F}^{2}, $$
(3)

where YRk×n is the label information shared by image and text, k is the number of classes, and PRr×k is the transformation matrix mapping the common latent representation from the common latent subspace to the label subspace. Different from the similar strategy of the cross-modal counterparts, we base the classifier on the common latent representation.

Learning label guided binary codes and hashing function

Most existing cross-modal hashing methods often focus on either directly generating the hashing code or learning hashing function. Different from them, we jointly consider both aspects based on the common latent representation in a unified way. Besides, we indirectly incorporate the label information into the unified process. Concretely, we minimize the quantization error loss:

$$ \begin{array}{l} \mathop {\min }\limits_{P,G,Q,B} {O_{4}} = \left\|B - \sigma \underbrace{G}_{latent\ rep.} - (1 - \sigma )\underbrace{Q\overbrace{P^{T}G}^{learned\ label\ info.}}_{new\ rep.} \right\|_{F}^{2},\\ s.t. G{G^{T}} = I,B \in {\left\{ { - 1,1} \right\}^{r \times n}}, B{\textbf{1}} = 0,0 \le \sigma \le 1 \end{array} $$
(4)

where BRr×n is the binary codes shared by two modalities, and 1 = [1,⋯,1]TRn. QRr×k transforms the learned label information (learned label info. for short) to the label guided representation (new rep. for short), and σ is an adaptive parameter to trade off the label guided representation and the common latent representation. By comparing Fig. 2c and d with Fig. 2e and f, we find that this proposed term greatly affects the discriminative property of the final common latent representation and the binary codes. The details about Fig. 2 are left in the paragraph after the final model part.

Fig. 2
figure 2

Original cross-modal dataset is composed of 400 a 3D and b 2D samples; the common latent representation G learned by c\({O_{2}} + \lambda \left \| {B - G} \right \|_{F}^{2}\), d\({O_{2}} + {O_{3}} + \lambda \left \| {B - G} \right \|_{F}^{2}\), and e LGCH, respectively; f the fused features σG + (1 − σ)QPTG learned by LGCH

The final model

Based on the above formula, we derive our final model dubbed label guided correlation cross-modal hashing (LGCH) as follows:

$$ \begin{array}{l} \mathop {\min }\limits_{P,G,{W_{1}},{W_{2}},B,Q,\sigma } O = {O_{2}} + {O_{3}} + {O_{4}}\\ = \left\| {Y - {P^{T}}G} \right\|_{F}^{2} + \alpha (\left\| {G - {W_{1}^{T}}{X_{1}}} \right\|_{F}^{2} + \left\| {G - {W_{2}^{T}}{X_{2}}} \right\|_{F}^{2})\\ + \beta (\left\| {{W_{1}}} \right\|_{F}^{2} + \left\| {{W_{2}}} \right\|_{F}^{2}) + \lambda \left\| {B - \sigma G - (1 - \sigma )Q{P^{T}}G} \right\|_{F}^{2},\\ s.t. G{G^{T}} = I,B \in {\left\{ { - 1,1} \right\}^{r \times n}},B{\textbf{1}} = 0,0 \le \sigma \le 1 \end{array} $$
(5)

The first term O2 in (5) aims to learn a label subspace to make full use of the discriminative label information; the second term O3 in (5) can learn a common latent subspace through the GCCA procedure; the last term O4 connect the learned label subspace and the common latent subspace with the Hamming space, in order to learn more discriminative binary hashing codes. It is easy to find that our model involves a cross-modal term O3 which can learn a common latent representation, then all the constraints including O2 and O4 are imposed over the common latent representation. Thus, we easily extend (5) to multi-modal case as follows:

$$ \begin{array}{l} \mathop {\min }\limits_{P,G,{W_{1}},{W_{2}},B,Q,\sigma } \left\| {Y - {P^{T}}G} \right\|_{F}^{2} + \sum\limits_{i = 1}^{m} {(\alpha \left\| {G - {W_{i}^{T}}{X_{i}}} \right\|_{F}^{2}} + \beta \left\| {{W_{i}}} \right\|_{F}^{2})\\ + \lambda \left\| {B - \sigma G - (1 - \sigma )Q{P^{T}}G} \right\|_{F}^{2},\\ s.t. G{G^{T}} = I,B \in {\left\{ { - 1,1} \right\}^{r \times n}},B{\textbf{1}} = 0,0 \le \sigma \le 1 \end{array} $$
(6)

where m is the number of multimedia types. From (6), we can find that the multi-media formula of LGCH considers cross-modal relationships which is linear with the number of media types, thus can be easily extended to cross-modal retrieval that involves more than two types of multimedia data. More importantly, LGCH provides a novel alternative to leveraging the label information as the guidance for simultaneously learning the binary codes and hashing function.

To illustrate the function of label information in LGCH, we perform a toy example to show the common latent representation with and without label information guiding. We firstly generate a three-dimensional dataset with 400 samples and these samples belong to four different Gaussian distributions. Each Gaussian distribution stands for one category. Likewise, we generate a corresponding two-dimensional dataset again. Then we collect these two synthetic datasets to construct a cross-modal dataset. Based on such cross-modal dataset, we respectively show the common latent representation trained under different conditions. The learned features are displayed in Fig. 2. As shown in Fig. 2c, d and e, the common latent representation generated by LGCH shows the best discriminative ability for the four classes. By comparing Fig. 2e with f, we can find that LGCH can make the features of O4 more discriminative by fusing the common latent representation and the label guided representation.

3.2 Optimization

For convenience, we consider the two-modality cross-modal case to show the optimization of our method. Although (5) is jointly non-convex with respect to all the variables, we can solve it through the use of an iterative optimization method, where G, W1, W2, P, Q , B and σ are solved alternatively. It is interesting to find that each subproblem can yield a closed-form solution. The details of our optimization procedure are listed as follows:

G-subproblem

When W1, W2, P, Q , B and σ are fixed, we rewrite (5) as

$$ \begin{array}{l} \underset{G}{\max } tr(GH),\\ s.t. G{G^{T}} = I \end{array} $$
(7)

where \(H = {Y^{T}}{P^{T}} + \alpha {X_{1}^{T}}{W_{1}} + \alpha {X_{2}^{T}}{W_{2}} + \lambda \sigma {B^{T}} + \lambda (1 - \sigma ){B^{T}}Q{P^{T}}\). Then we can get an analytical solution of G via Theorem 1.

Theorem 1

G = UVTisthe optimal solution of the problem in (7), where U and V are theleft- and right-part of the Singular Value Decomposition (SVD) ofHT,respectively.

Proof

Suppose that the singular value decomposition (SVD) form of HT is HT = UΣVT, and substitute this into (7), we can obtain:

$$ \begin{array}{l} tr(GH) = tr(GV{\Sigma} {U^{T}})= tr({\Phi} {\Lambda} ), \end{array} $$
(8)

where Φ = UTGV. According to the von Neumann’s trace inequality [24],

$$ tr({\Phi} {\Lambda} ) \le \sum\limits_{i = 1}^{r} {{\phi_{i}}{\eta_{i}}}, $$
(9)

where ϕ1 ≥⋯ ≥ ϕi and η1 ≥⋯ ≥ ηi are the singular values of ϕ and η, respectively. As ΦΦT = I, the singular values ϕi = 1 and (9) becomes \(tr({\Phi } {\Lambda } ) \le \sum \limits _{i = 1}^{r} {{\eta _{i}}}\). The equality holds when Φ = I, then the solution of G is

$$ G = U{V^{T}} $$
(10)

This completes our proof.□

P-subproblem

When G, W1, W2, Q , B and σ are fixed, setting the gradient of (5) over P to zero, we can obtain

$$ G{Y^{T}} + \lambda (1 - \sigma )G{B^{T}}Q - \lambda \sigma (1 - \sigma )Q = P + \lambda {(1 - \sigma )^{2}}P{Q^{T}}Q. $$
(11)

Then the solution of P is

$$ P = ( G{Y^{T}} + \lambda (1 - \sigma )G{B^{T}}Q - \lambda \sigma (1 - \sigma )Q){({I_{k}} + \lambda {(1 - \sigma )^{2}}{Q^{T}}Q)^{- 1}}. $$
(12)

W1(W2)-subproblem

When G, W2, P, Q , B and σ are fixed, setting the gradient of (5) over W1 to zero, we can obtain

$$ (\alpha {X_{1}}{X_{1}^{T}} + \beta I){W_{1}} = \alpha {X_{1}}{G^{T}}. $$
(13)

The solution of W1 is

$$ {W_{1}} = \alpha {(\alpha {X_{1}}{X_{1}^{T}} + \beta I)^{- 1}}{X_{1}}{G^{T}}. $$
(14)

Likewise, when G, W1, P, Q , B and σ are fixed, setting the gradient of (5) over W2 to zero, we can obtain the solution of W2:

$$ {W_{2}} = \alpha {(\alpha {X_{2}}{X_{2}^{T}} + \beta I)^{- 1}}{X_{2}}{G^{T}}. $$
(15)

Q-subproblem

Note that only when σ≠ 1, we need solve this problem. When G, W1, W2, P , B and σ are fixed and σ≠ 1, setting the gradient of (5) over Q to zero, we can obtain

$$ {(1 - \sigma )^{2}}Q{P^{T}}P = (1 - \sigma )B{G^{T}}P - \sigma (1 - \sigma )P. $$
(16)

The solution of Q is

$$ Q = \frac{{(B{G^{T}}P - \sigma P){{({P^{T}}P)}^{- 1}}}}{{{{(1 - \sigma )}}}}. $$
(17)

B-subproblem

When G, W1, W2, P , Q and σ are fixed, (5) can be written as

$$ \begin{array}{l} \underset{B}{\max } B(\sigma {G^{T}} + (1 - \sigma ){G^{T}}P{Q^{T}}).\\ s.t. B \in \left\{ { - 1,1} \right\},B{\textbf{1}} = 0 \end{array} $$
(18)

According to [12, 58], the solution of B is

$$ B({\mathbf{rank}}(i,j),j) = \left\{ {\begin{array}{*{20}{c}} {1, i \le \frac{n}{2}}\\ { - 1, otherwise,} \end{array}} \right. $$
(19)

where rank is the sorted index of σG + (1 − σ)QPTG) in descending order, and i and j are the indices of the matrix rank.

σ-subproblem

When G, W1, W2, P , Q and B are fixed, we can rewrite (5) as

$$ \begin{array}{l} \mathop {\min }\limits_{\sigma} - 2\lambda \sigma tr(B{G^{T}}) - 2\lambda (1 - \sigma )tr(B{G^{T}}P{Q^{T}}) + {\sigma^{2}}\lambda r + 2\lambda \sigma (1 - \sigma )tr(P{Q^{T}})\\ + {(1 - \sigma )^{2}}\lambda tr(Q{P^{T}}P{Q^{T}})\\ = \mathop {\min }\limits_{\sigma} - 2\sigma tr(B{G^{T}}) + 2\sigma tr(B{G^{T}}P{Q^{T}}) + {\sigma^{2}}r + 2\sigma tr(P{Q^{T}}) - 2{\sigma^{2}}tr(P{Q^{T}})\\ - 2\sigma tr(Q{P^{T}}P{Q^{T}}) + {\sigma^{2}}tr(Q{P^{T}}P{Q^{T}})\\ = \mathop {\min }\limits_{\sigma} \frac{1}{2}{\sigma^{2}}(2r - 4tr(P{Q^{T}}) + 2tr(Q{P^{T}}P{Q^{T}})) + \sigma (- 2tr(B{G^{T}}) \\ + 2tr(B{G^{T}}P{Q^{T}})+ 2tr(P{Q^{T}}) - 2tr(Q{P^{T}}P{Q^{T}})).\\ s.t. 0 \le \sigma \le 1 \end{array} $$
(20)

Let h = 2r − 4tr(PQT) + 2tr(QPTPQT) and f = − 2tr(BGT) + 2tr(BGTPQT) + 2tr(PQT) − 2tr(QPTPQT), then (20) becomes

$$ \begin{array}{l} \underset{\sigma}{\min } \frac{1}{2}{\sigma^{2}}h + \sigma f.\\ s.t. 0 \le \sigma \le 1 \end{array} $$
(21)

Since \(h = 2r - 4tr(P{Q^{T}}) + 2tr(Q{P^{T}}P{Q^{T}})= 2\left \| {I-P{Q^{T}}} \right \|_{F}^{2}\geq 0\), (21) is convex and has a closed-form solution. By simple algebra, we can obtain the solution to (21) as follows:

$$ \sigma = \left\{ {\begin{array}{*{20}{c}} {0, if - \frac{f}{h} \notin [0,1] and \frac{1}{2}{\sigma^{2}}h + \sigma f\left| {{~}_{\sigma = 1}} \right. > \frac{1}{2}{\sigma^{2}}h + \sigma f\left| {{~}_{\sigma = 0}} \right.,}\\ {1, if - \frac{f}{h} \notin [0,1] and \frac{1}{2}{\sigma^{2}}h + \sigma f\left| {{~}_{\sigma = 1}} \right. < \frac{1}{2}{\sigma^{2}}h + \sigma f\left| {{~}_{\sigma = 0}} \right.,}\\ { - \frac{f}{h}, if - \frac{f}{h} \in [0,1] . } \end{array} } \right. $$
(22)
figure f

The overall optimization procedure for LGCH is summarized in Algorithm 1. Since each sub-problem of our model is minimized via the update rules mentioned above, the objective value of (5) is no-increasing. Thus, the convergence of our optimization algorithm meets:

$$ \begin{array}{l} O({G^{t}},{P^{t}},{W_{1}}^{t},{W_{2}}^{t},{B^{t}},{Q^{t}},{\sigma^{t}}) \le O({G^{t + 1}},{P^{t + 1}},{W_{1}}^{t + 1},{W_{2}}^{t + 1},{B^{t + 1}},{Q^{t + 1}},{\sigma^{t + 1}}), \end{array} $$
(23)

where ⋅t indicates the matrix ‘⋅’ at the tth iteration wherein ‘⋅’ is a matrix placeholder. To accelerate the convergence of LGCH, we treat (24) as the termination condition of Algorithm 1.

$$ \frac{|O_{t}-Q_{t + 1}|}{|O_{t}|}\le \zeta, $$
(24)

where the tolerance ζ = 10− 4 is used in all the experiments. The objective curve is shown in Figs. 34 and 5. As Figs. 35 shows, the objective value curve of LGCH takes on the monotone decreasing tendency, and respectively converges at around 10 iterations on Wiki dataset and 20 iterations on both NUS-WIDE and Flickr25K datasets.

Fig. 3
figure 3

The objective value of LGCH for a 32 bits, b 64bits, c 96 bits and d 128 bits cross-modal retrieval versus different iterations on Wiki dataset

Fig. 4
figure 4

The objective value of LGCH for a 32 bits, b 64bits, c 96 bits and d 128 bits cross-modal retrieval versus different iterations on NUS-WIDE dataset

Fig. 5
figure 5

The objective value of LGCH for a 32 bits, b 64bits, c 96 bits and d 128 bits cross-modal retrieval versus different iterations on Flickr25K dataset

Computational complexity

The total number of iterations and the time cost per iteration are two aspects of the time complexity of iterative algorithms for our LGCH. As in Figs. 35, LGCH converges with relatively small total number of iterations, thus we only discuss the computational cost per iteration. For LGCH, the total time cost lies in seven steps (3-7 and 11-12) of Algorithm 1. In particular, solving B involves the sort algorithm and matrix multiplication, which respectively take O(rnlogn) and O(r2(k + n)) in time. For P, the time cost of step 3 mainly depends on the matrix multiplication and inverse operations, thus the total time cost of this step is O(nkr + (n + k)r2 + k2r + k3). Likewise, updating W1, W2, Q and G at most cost \(O(n{m_{1}}^{2} + {m_{1}}^{3} + n{m_{1}}r)\), \(O(n{m_{2}}^{2} + {m_{2}}^{3} + n{m_{2}}r)\), O(nr2 + kr2 + k2r + k3) and \(O(nkr + n{m_{1}}r + n{m_{2}}r + n{r^{2}} + rn\min (r,n))\), respectively. For solving σ, since the involved matrix multiplication QPT can be calculated in step 7, the time cost is relatively very small and can be omitted here. In summary, the total time cost per iteration of Algorithm 1 is \(O(T(n(k + {m_{1}} + {m_{2}})r + rnlogn + (n + k){r^{2}} + {k^{2}}r + {k^{3}} + n{m_{1}}^{2} + n{m_{2}}^{2}+ {m_{1}}^{3} + {m_{2}}^{3} + {rnmin(r,n)}))\). In practice, m1, m2, r and k are far less than n. Then, the main time cost of LGCH depends on O(rnlogn). For comparison, two typical counterparts such as IMH [36] and CVH [19] are analyzed here. Particularly, the time cost per iteration is O(n3) for IMH, and at least O(n2) for CVH, respectively. In a nutshell, both IMH and CVH take higher time cost per iteration against LGCH.

New coming sample

When a new sample \(x \in {R^{{m_{1}}}}\) comes, if it is an image, we firstly map its image descriptor to the common subspace to gain the common latent representation \(g = {W_{1}}^{T}x\), and then we solve the following problem:

$$ \begin{array}{l} \mathop {\min }\limits_{b} \left\| {b - \sigma g - (1 - \sigma )Q{P^{T}}g} \right\|_{F}^{2}.\\ s.t. b \in {\{ - 1,1\}^{r \times 1}} \end{array} $$
(25)

Then, the solution of b is:

$$ b = sign(\sigma g + (1 - \sigma )Q{P^{T}}g). $$
(26)

Likewise, if a new sample belonged to the text data, let \(g = {W_{2}}^{T}x\), and we can compute its binary codes using (26).

4 Experiments

In this section, we verify the effectiveness of LGCH by conducting experiments of cross-modal retrieval on three popular cross-modal datasets including Wiki, NUS-WIDE and Flickr25K.

4.1 Datasets

  • The Wiki dataset contains 2,866 images and corresponding texts collected from Wikipedia’s featured articles. All of the image-text pairs are labelled by a 10-dimensional vector of 10 categories. Each image is represented by a 128-dimensional bag-of-words feature vector based on SIFT, while each text sample is represented by a 10-dimensional feature vector. Following [20, 30, 58], we set the released 2173 image-text pairs as training set and the 693 pairs as testing set. We treat the training dataset as the gallery database.

  • The NUS-WIDE dataset comprises 269,648 images and the associated tags from Flickr, with the 81ground-truth concepts used for evaluation. Following [20, 58], we select the most 21 frequently-used concepts to construct a new dataset of 195,834 image-text pairs. Each image is represented by a 500-dimensional bag of words feature vector based on SIFT descriptors, and meanwhile each text is encoded as a 1000-dimensional tag occurrence feature vector. From the new dataset, we randomly sample 2,000 image-text pairs as the testing set, and the rest image-text pairs are used as the gallery database. From the gallery database, we randomly select 10,000 image-text pairs as the training set.

  • The Flickr25K dataset is the labelled subset of Flicke1M dataset which contains 1 million Flickr image-text pairs. There are 38 categories in Flickr25K dataset. Each image is represented by a 3,857-dimensional feature vector and each text is represented by a 2,000-dimensional feature vector generated by tag occurrences [38]. We use 1,000 randomly selected image-text pairs as the testing set and the remaining 2, 4000 pairs as training set. Like Wiki dataset, we also treat the training set as the gallery database.

4.2 Evaluation protocols

As LGCH is an CCA based method, we should compare it with CCA based methods. Among those CCA based methods, SCM is a representative state-of-the-art one. We first choose to compare our method with the supervised method SCM [57]. Then we compare it with other six state-of-the-art cross-modal hashing methods including the supervised ones such as CMSSH [6] and QCH [52], and the unsupervised ones like CVH [19], IMH [36], CCQ [20] and CSGH [58]. Two types of cross-modal retrieval tasks are employed: 1) image to text task (I → T), we treat images as query and texts as the returned results; 2) text to image (T → I) task, we treat texts as query and images as the returned results. We employ the mean average precision to evaluate the retrieval effect of LGCH. Following [58], we report the MAP@50 results. That is to say, we calculate the mean average precision of top 50 retrieved samples. In addition, we also display the F-measure curves, the precision-recall curves and the precision@top-R curves results of the two retrieval tasks in figure, where R represent the number of retrieved samples.

LGCH has three parameters: α, β and λ. We explored the impact of these three parameters on LGCH for the above two cross-modal retrieval tasks. And finally, for both Wiki and NUS-WIDE dataset, we set α = 100, β = 0.001 and λ = 0.01. For Flickr25K dataset, we set α = 0.001, β = 1 and λ = 0.01.

4.3 Experiment results

For all datasets, we report the 32-bit, 64-bit, 96-bit and 128-bit of: the F-measure results, the mAP results, the precision-recall curves and the precision@topR curves of two cross-modal retrieval tasks: I → T and T → I, respectively. Particularly, the best mAP results are bold emphasized in Table 1.

Table 1 The mAP results on Wiki, NUS-WIDE and Flickr25K dataset, respectively

Results on Wiki dataset

The F-measure curves of LGCH comparing with other seven state-of-the-art methods on Wiki dataset are shown in Figs. 6 and 7. As Figs. 67 shows, LGCH is superior to the other seven compared models for two cross-modal retrieval tasks on Wiki dataset. In addition, the T → I task results in Fig. 7 shows that LGCH performs better compared with other methods when the bit is growing.

Fig. 6
figure 6

The F-measure results of different bits for I → T task on Wiki dataset

Fig. 7
figure 7

The F-measure results of different bits for T → I task on Wiki dataset

As shown in Table 1, the mAP results of Wiki dataset show the effectiveness of LGCH in most cases except for the T → I task at 64 bits and 96 bits, but there is little gap between LGCH and the best model in the table, also shows that LGCH is competitive.

The precision-recall curves results on Wiki dataset are shown in Figs. 8 and 9. Figure 8 shows the I → T task performance and Fig. 9 shows the T → I performance of LGCH and other seven state-of-the-art methods, and the results show the effectiveness of LGCH comparing with other methods.

Fig. 8
figure 8

The precision-recall curves of different bits for I → T task on Wiki dataset

Fig. 9
figure 9

The precision-recall curves of different bits for T → I task on Wiki dataset

The precision@topR results on Wiki dataset is shown in Figs. 10 and 11, and these results are similar to the above precision-recall results, which show that LGCH is superior to other methods in most cases. The main reason of this similarity may be that Wiki dataset is a

small dataset, thus the top1000 returned samples occupy a large proportion of the database and the precision@topR curves are in line with the trend of the precision-recall curves.

Fig. 10
figure 10

The precision@topR curves of different bits for I → T task on Wiki dataset

Fig. 11
figure 11

The precision@topR curves of different bits for T → I task on Wiki dataset

Results on NUS-WIDE dataset

The F-measure results on NUS-WIDE dataset are shown in Figs. 12 and 13. From Figs. 1213, we can easily draw the conclusion that LGCH is superior than all other methods for all compared bits at the two retrieval tasks.

Fig. 12
figure 12

The F-measure results of different bits for I → T task on NUS-WIDE dataset

Fig. 13
figure 13

The F-measure results of different bits for T → I task on NUS-WIDE dataset

The mAP results on NUS-WIDE dataset in Table 1 also show that LGCH can get better performance than all other compared methods, especially for the T → I task. The precision-recall curves on NUS-WIDE dataset are shown in Figs. 14 and 15 and the precision@topR curves are shown in Figs. 16 and 17. These results also show the effectiveness of LGCH: it is superior than all other compared methods.

Fig. 14
figure 14

The precision-recall curves of different bits for I → T task on NUS-WIDE dataset

Fig. 15
figure 15

The precision-recall curves of different bits for T → I task on NUS-WIDE dataset

Fig. 16
figure 16

The precision@topR curves of different bits for I → T task on NUS-WIDE dataset

Fig. 17
figure 17

The precision@topR curves of different bits for T → I task on NUS-WIDE dataset

Results on Flickr25K dataset

The F-measure results on Flickr25K dataset are displayed in Figs. 18 and 19. All methods have not much difference in performance for both I → T and T → I tasks. Thus showing that LGCH is competitive with other compared methods.

Fig. 18
figure 18

The F-measure results of different bits for I → T task on Flickr25K dataset

Fig. 19
figure 19

The F-measure results of different bits for T → I task on Flickr25K dataset

The mAP results on Flickr25K dataset in Table 1 show that LGCH has better performance than other compared methods in I → T and T → I tasks, respectively. The precision-recall curves of different bits for both tasks on Flickr25K dataset are shown in Figs. 20 and 21, respectively. For 96 and 128 bits in Figs. 20 and 21, LGCH is superior to the other compared methods. For 32 and 64 bits in Figs. 20 and 21, LGCH outweighs the compared methods except CSGH and QCH: LGCH is competitive with these two methods, as in Fig. 20; it is competitive with CSGH and is slightly inferior to QCH, as shown in Fig. 21.

Fig. 20
figure 20

The precision-recall curves of different bits for I → T task on Flickr25K dataset

Fig. 21
figure 21

The precision-recall curves of different bits for T → I task on Flickr25K dataset

Although the precision-recall results on Flickr25K are not all better than other methods. The precision@top results in Figs. 22 and 23 show the superiority of LGCH comparing with all other methods, containing CSGH and QCH. In fact, real situations unnecessarily retrieve a large number of query results. For such situations, LGCH might be the better choice because it can work better than the other compared methods.

Fig. 22
figure 22

The precision@topR curves of different bits for I → T task on Flickr25K dataset

Fig. 23
figure 23

The precision@topR curves of different bits for T → I task on Flickr25K dataset

Impact of parameters

LGCH has three free parameters α, β and λ, so we should analyze the impact of these parameters. We research the impact of \(\alpha , \beta , \lambda \in \left \{{{{10}^{i}}}\right \}_{i= -5}^{5}\) for both I → T and T → I retrieval tasks of 32-bits on Wiki, NUS-WIDE and Flickr25K dataset, respectively. To visualize this point, we show the mAP results versus various values of two parameters with another parameter fixed.

The impact of three parameters on Wiki dataset is shown in Figs. 24 and 25. Figures 24a and 25a show that when α is fixed, the small β where \(\beta \in \left \{ {{{10}^{i}}, lg\beta \le 0} \right \}_{i = - 5}^{5}\) can induce better mAP results for both 32-bits I → T and T → I retrieval tasks on Wiki dataset. Likewise, when β is fixed, Figs. 24b and 25b show that the small α where \(\alpha \in \left \{ {{{10}^{i}}, lg\beta \le 0} \right \}_{i = - 5}^{5}\) have better mAP results for both 32-bits I → T and T → I retrieval tasks on Wiki dataset. When λ is fixed, Fig. 24c shows that the mAP results change slightly for 32 bits I → T task on Wiki dataset, but for T → I task in Fig. 25c, when \(\alpha , \beta \in \left \{ {{{10}^{i}},lg\beta \ge \lg \alpha - 4} \right \}_{i = - 5}^{5}\), LGCH has better mAP results.

Fig. 24
figure 24

The impact of α, β and λ for 32 bits I → T task on Wiki dataset: a the mAP results of β and λ when fixing α = 100, b the mAP results of α and λ when set β = 0.001 and c the mAP results of α and β when set λ = 0.01

Fig. 25
figure 25

The impact of α, β and λ for 32 bits T → I task on Wiki dataset: a the mAP results of β and λ when fixing α = 100, b the mAP results of α and λ when set β = 0.001 and c the mAP results of α and β when set λ = 0.01

The impact of three parameters on NUS-WIDE dataset is shown in Figs. 26 and 27. Like the results on Wiki dataset, Figs. 26a and 27a show that when α is fixed, the small β where \(\beta \in \left \{ {{{10}^{i}}, lg\beta \le 0} \right \}_{i = - 5}^{5}\) have better mAP results for both 32 bits I → T and T → I retrieval tasks on NUS-WIDE dataset. Likewise, when β is fixed, Figs. 26b and 27b show that the small α where \(\alpha \in \left \{ {{{10}^{i}}, lg\beta \le 0} \right \}_{i = - 5}^{5}\) have better mAP results for both 32 bits I → T and T → I retrieval tasks on NUS-WIDE dataset. When λ is fixed, Fig. 26c shows that \(\alpha ,\beta \in \left \{ {{{10}^{i}},lg\beta \ge \lg \alpha - 4, lg\beta \le 3} \right \}_{i = - 5}^{5}\) have better mAP results for 32 bits I → T task on NUS-WIDE dataset, and for T → I task in Fig. 27c, \(\alpha ,\beta \in \left \{ {{{10}^{i}}, lg\beta \le 4} \right \}_{i = - 5}^{5}\) have better mAP results.

Fig. 26
figure 26

The impact of α, β and λ for 32 bits I → T task on NUS-WIDE dataset: a the mAP results of β and λ when fixing α = 100, b the mAP results of α and λ when set β = 0.001 and c the mAP results of α and β when set λ = 0.01

Fig. 27
figure 27

The impact of α, β and λ for 32 bits T → I task on NUS-WIDE dataset: a the mAP results of β and λ when fixing α = 100, b the mAP results of α and λ when set β = 0.001 and c the mAP results of α and β when set λ = 0.01

The impact of three parameters on Flickr25K dataset is shown in Figs. 28 and 29. When α is fixed, Fig. 28a shows that the small β where \(\beta \in \left \{{{{10}^{i}}, lg\beta \le 1}\right \}\)\(_{i = - 5}^{5}\) have better mAP results for 32 bits I → T retrieval task on Flickr25K dataset, and Fig. 29a shows that \(\beta , \lambda \in \left \{{{{10}^{i}}, lg\beta \le 2, lg\lambda \le 2}\right \}_{i = - 5}^{5}\) have better mAP results for 32 bits T → I retrieval task on Flickr25K dataset. When β is fixed, Figs. 28b and 29b show that \(\alpha , \lambda \in \left \{{{{10}^{i}}, lg\alpha \le 2, lg\lambda \le 4}\right \}_{i = - 5}^{5}\) have better mAP results for both 32 bits I → T and T → I retrieval tasks on Flickr25K dataset. When λ is fixed, Figs. 28c and 29c show that \(\alpha , \beta \in \left \{{{{10}^{i}}, lg\alpha \ge \lg \beta - 4}\right \}_{i = - 5}^{5}\) have better mAP results for both 32 bits I → T and T → I tasks on Flickr25K dataset.

Fig. 28
figure 28

The impact of α, β and λ for 32 bits I → T task on Flickr25K dataset: a the mAP results of β and λ when fixing α = 0.001, b the mAP results of α and λ when set β = 1 and c the mAP results of α and β when set λ = 0.01

Fig. 29
figure 29

The impact of α, β and λ for 32 bits T → I task on Flickr25K dataset: a the mAP results of β and λ when fixing α = 0.001, b the mAP results of α and λ when set β = 1 and c the mAP results of α and β when set λ = 0.01

5 Conclusion

This paper proposes the label guided correlation cross-modal hashing method (LGCH), which can learn more discriminative common latent representation and binary codes by leveraging the label information as the guidance for simultaneously learning the binary codes and hashing function. This model is based on generalized CCA and can be easily extended to multi-media retrieval that contains more than two media types. When solving LGCH, it has an elegant analytical solution for each subproblem. The toy example illustrates the efficacy of label guided strategy, and experiments of cross-modal retrieval on three popular cross-modal datasets show that LGCH performs favorably as compared to several state-of-the-art cross-modal retrieval methods.