Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Zhang, Hong; Pan, Min

doi:10.1007/s11042-020-09869-4

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Published: 02 November 2020

Volume 80, pages 17299–17314, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Download PDF

Hong Zhang^1,2 &
Min Pan^1,2

376 Accesses
4 Citations
Explore all metrics

Abstract

Research on hash-based cross-modal retrieval has been a hotspot in the field of content-based multimedia retrieval research. Most deep cross-modal hashing methods only consider inter-modal loss that can remain local information of training data, and ignore the loss within data samples of the same modality that can remain the global information of dataset. In addition, they also ignore the factor that different scales of single modal data contain different semantic information, which affects the representation of data features. In this paper, we propose a semantics-preserving hashing method based on multi-scale fusion. More concretely, a multi-scale fusion pooling model is proposed for both image feature training network and text feature training network. Therefore, we can extract the multi-scale features of image dataset and solve the sparsity problem of text BOW vectors. When constructing the loss function, we consider intra-modal loss while considering inter-modal loss. Therefore, the output hash code retains both global and local underlying semantic correlation when image and text feature training network are trained. Experiment results on NUS-WIDE and MIRFlickr-25 K prove that against other existing methods, our algorithm improves cross-modal retrieval accuracy.

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Article 22 February 2023

Self-auxiliary Hashing for Unsupervised Cross Modal Retrieval

Joint feature fusion hashing for cross-modal retrieval

Article 20 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Development in information technology has led to explosive growth of multimedia data. At the same time, people’s demand for information search to obtain diverse results is increasing. Therefore, there are more and more researches [18, 23, 33, 19, 31, 20, 32, 15] on multimedia data analysis and cross-modal retrieval technology. Cross-modal retrieval is point to all relevant data of other modalities are accurately and quickly retrieved through the data of one modal.

Hash learning is widely used in cross-modal retrieval models [27, 21, 29, 1], because of its good low storage and efficient retrieval. In the past few decades of research, there are many hash methods for single-modal retrieval [25, 22, 16, 14, 8, 13, 35]. However, these methods are not suitable for cross-modal hash retrieval, because of the semantic gap between data in different modalities. Most existing cross-modal retrieval hashing methods [34] solve semantic gaps by mining the correlations of different modal data. The main cross-modal hashing methods can be divided into two categories: deep cross-modal hashing methods [5, 30, 7, 17, 24] and shallow cross-modal hashing methods [28, 12, 3, 11, 4]. Shallow cross-modal hashing methods mainly map each sample into a binary code based on hand-crafted features, so as to learn the hash function. However, this hash function cannot express the underlying features of samples and the retrieval efficiency is not ideal. Deep cross-modal hashing methods, in contrast to shallow cross-modal hashing methods, using the feature extraction capability of deep learning to learning effective representations of different modalities, which solve the problem of limited hand-crafted features expression ability. In addition, this method can also integrate feature learning into the process of hash code learning, ensuring the accuracy of hash code to obtain better retrieval efficiency.

Up to now, there has been a lot of research on deep cross-modal hashing retrieval, but they both ignore two attributes of cross-modal data. These two attributes contribute to retrieval accuracy, which is also the motivation of this paper. First, different scales of single modal data contain different semantic information; second, when judging the similarity between cross-modal data, most deep cross-modal hashing methods treat two cases in the same way, where only one tag is similar between different modal data and more than one tag is similar. They both ignore the fact that the similarity between different modal data is related to the number of labels they share.

Taken together, this paper proposes a Semantic-Preserving Hashing based on Multi-scale Fusion (SPHMF). The framework of SPHMF is shown in Fig. 1. First, Image Pooling Model for Multi-scale Fusion (IPMSF) and Text Pooling Model for Multi-scale Fusion (TPMSF) are used to extract multi-scale feature information of different modal data. Secondly, image-text pairs label information is used to train self-supervise network [17], so as to better mine the relevance between image and text. Finally, when we construct the loss function, we use multi-level similarity information of image-text pairs to construct the intra-modal loss. In addition, the loss function also includes pairwise loss and inter-modal loss.

The remaining paper is structured in the following manner. We summarize work related to cross-modal retrieval (section 2), present the deep learning architecture proposed (section 3), describe the construction of the loss function (section 4), discuss the results and experiments (section 5), and draw a conclusion (section 6).

2 Related works

As mentioned above, our proposed method is SPHMF, which is a cross-modal hashing retrieval method based on deep learning. Therefore, we conducted a series of researches on the two types of shallow and deep cross-modal hashing methods.

Most existing shallow cross-modal hashing methods are independent of feature learning and hash code learning, which leads to unsatisfactory retrieval results. This kind of typical methods include CVH [28], CCQ [12], CMSSH [3], SCM [11], and SePh [4]. CVH considers the similarity of intra-view and inter-view. CCQ jointly learns related maximal maps and composite quantizes. It converts multimedia data into binary code through an isomorphic potential space. CMSSH is a supervised cross-modal hash, which models hash learning through an enhanced classification paradigm. SCM uses tag information to build semantically similar matrices to learn a hash function. SePh changes the semantic matrix to a probability distribution, which is combined with the minimization of hamming spatial distribution to learn hamming space.

Deep cross-modal hashing methods use deep learning framework to study hash function. It can effectively catch the non-linear correlation between cross-modal instances. This kind of typical methods include CMNNH [5], DCMH [30], PRDH [29], A CMR [7], SSAH [17], and MCSCH [24]. CMNNH learns the hash function in deep learning framework by maintaining the relationship of intra-modal and the pairwise correspondence of inter-modal. DCMH performs feature learning and hash function learning simultaneously. PRDH guides the learning of hash codes by constructing different pairwise losses. ACMR differentiates the data of different modes and learns binary hash codes by adversarial learning methods and classification methods. SSAH uses a self-supervised network to generate semantic information of tags, as well as uses these information to guide the feature learning process of different modal data. MCSCH proposes to use multi-scale features to guide sequence hash learning, enhance the diversity of hash codes and promote the studying of hash functions.

Above related works achieved good results. Compared with unsupervised learning methods, supervised learning has achieved better results. For supervised learning methods, the key is to use limited training data and supervised information to learn semantic information with similar neighbor relations of the original data. The self-supervised network in the SSAH can build semantic association between multimedia data and find more information in labels. Therefore, our method also uses this label self-supervised network. Our approach of SPHMF is different from the above methods in the following two ways: it extracts semantic information of different scales from single modal data and integrates it into the feature learning process; we construct intra-modal loss with multilevel semantic affinity matrix, inter-modal loss and pairwise loss.

3 Deep learning framework based on multi-scale fusion

As shown in Fig. 1, the overall network structure contains three parts, namely Image Feature Training Network (IMFN), Text Feature Training Network (TEFN), and Semantic Tag Generation Network (STGN). We use the output of STGN to guide training of IMFN and TEFN. After IMFN and TEFN are trained, the respective hash functions of image dataset and text dataset can be obtained. The hash function can be used to obtain the hash code of each modal data, then by counting and sorting hamming distance to complete cross-modal retrieval.

3.1 Image pooling model for multi-scale fusion

The overall structure of IMFN proposed in this paper is shown in the upper part of Fig. 1. Considering that features at different scales in image dataset represent different semantics, we propose an Image Pooling Model for Multi-scale Fusion (IPMSF) for IMFN. The setting of IPMSF is based on the idea of Spatial Pyramid Pooling [6] (SPP). We take the output of conv5 as the input of each pooling layer in IPMSF. Different pooling layers perform maximum pooling operation according to different scale regions, and then the output vectors of each pooling layer are concatenated as the input of fc1 to complete the training of IMFN. Therefore, the problems of limitation on size of input image in traditional CNN network and unreliable feature learning due to some information loss are solved. The settings for IPMSF are shown in Table 1. Unlike SPP, SPP directly connects the features of different scales, while IPMSF first merges the features of the same scale, and then connects the features of different scales in series, thereby reducing the parameters of the network model. In section 5.4.5, it can be seen that the IPMSF model reduces the computational overhead of the training process while maintaining the retrieval accuracy.

Table 1 Parameter settings for IPMSF model

Full size table

3.2 Text pooling model for multi-scale fusion

For text dataset, it is usually represented by a bag of words vector, which can easily lead to sparsity. To solve this problem, we design a Text Pooling Model for Multi-scale Fusion (TPMSF). First, the multi-scale features of text samples are extracted by pooling layer, and then multiple features are fused through convolutional layer. This process can capture the relevance among various words in text modal construction, and that is very useful for semantic relevance. The overall setting of TPMSF is shown in Table 2, and c is the number of class labels.

Table 2 Parameter settings for TPMSF model

Full size table

The output of the TPMSF model is used as the input of TEFN. The model architecture for TEFN is shown in bottom part of Fig. 1.

3.3 Semantic tag generation network

In this paper, STGN is used to extract label semantic information of text-image pairs, and guide training of IMFN and TEFN. The overall setting of STGN is shown in the middle part of the Fig. 1.

The STGN is trained through the class label information and the neighbor relationship matrix S. After the training of STGN is completed, the label semantic hash code H^(s) and label semantic features F^(s) can be obtained through this network to guide training of IMFN and TEFN. In the training process, the inner product between the vectors is used to represent the correlation between any two output features or two hash codes. And in the meanwhile, the likelihood function is used to represent the inner product value between outputs under S supervision, as shown in formula (1):

$$ p\left(S|H\right)=\left\{\begin{array}{c} sig\ \left({\theta}_{ij}\right),{S}_{ij}=1\\ {}1- sig\ \left({\theta}_{i\mathrm{j}}\right),{S}_{ij}=0\end{array}\right\} $$

(1)

where sig () represents the sigmoid function, θ_ij = 1/2 < H_i, H_j>, H_i, H_j represents the hash code of a set of samples output by the hash layer, and S_ij = 1 indicates that the two sample vectors are similar, S_ij = 0 means dissimilar.

Maximizing the likelihood function by minimizing the form of the negative log-likelihood function yields:

$$ \min R=-\log p\left(S|H\right)=-\sum {S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\log \left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right) $$

(2)

If all parameters in the STGN are set to θ, then formula (2) can be used to represent all samples in F^(s), H^(s):

$$ {\displaystyle \begin{array}{c}\underset{\theta }{\mathit{\min}} Js=-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {F}_i,{F}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {F}_i,{F}_j\right\rangle}\right)\right)\\ {}-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right)\right)\end{array}} $$

(3)

where F_i, F_j, H_i, H_j represent the features of the ith and jth groups and the hash codes of the ith and jth groups, respectively.

Since the hash code is lost from output to quantization into a binary hash code, a quantization error is added to the function, as follows:

$$ \alpha {\left\Vert {H}_i-\mathit{\operatorname{sign}}\left({H}_i\right)\right\Vert}_F^2 $$

(4)

So, the final objective function is:

$$ {\displaystyle \begin{array}{c}\underset{\theta }{\mathit{\min}} Js=-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {F}_i,{F}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {F}_i,{F}_j\right\rangle}\right)\right)\\ {}-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right)\right)\\ {}+\alpha {\left\Vert {H}_i-\mathit{\operatorname{sign}}\left({H}_i\right)\right\Vert}_F^2\end{array}} $$

(5)

In this paper, the parameter θ of STGN is studied by stochastic gradient descent and back propagation.

4 Learning cross-modal hash functions

Suppose there are n sets of training data points, each set of training data is composed of pairs of text-image, X = {x_i}ⁿ_i = 1 represents high-dimensional original image dataset, Y = {y_j}ⁿ_j = 1 represents the text dataset descripting image. Each pair of training data has a class label vector l_i = {l_i1, l_i2, …, l_ic}, c is the number of dataset categories. In the label semantic information learning part, the class label information L^n*c (l₁, l₂, …, l_n) can be obtained from the text-image pairs (l_i represents a c-dimensional binary vector) to construct a similarity matrix S, S_ij = 1 means x_i is similar to y_j, S_ij = 0 means that they are not similar. In the hash function part, we specify the length of the output hash code as m.

Given the image data, text data and the similarity matrix S, the target of SPHMF is to study a hash code B that retains similarity. When constructing the entire objective function, the features of image and text and the semantic feature F^(s) are used to construct pairwise loss to convey the neighborhood relationship in the label. Hash code of modal data construct inter-modal and intra-modal losses to preserve global and local semantic structure. Therefore, the overall objective function is:

$$ J={J}_{se}+\eta {J}_{inter}+\beta {J}_{intra} $$

(6)

where J_se is the pairwise loss, J_inter is the inter-modal loss, and J_intra is the intra-modal loss. ŋ and ß are used for balance the impacts of each term.

4.1 Pairwise loss

For the feature of image and text modal data, the inner product method is used to represent the similarity relationship, and the pairwise loss is used to transfer the nearest neighbor relationship of F^(s), as shown in formula (7), (8):

$$ \underset{\theta x}{\mathit{\min}}{J}_{se}^x=-\sum \limits_{i,j=1}^n\left({S}_{ij}<{F}_i^{(s)},{Z}_j^x>-\mathit{\log}\left(1+{e}^{<{F}_i^{(s)},{Z}_j^x>}\right)\right) $$

(7)

$$ \underset{\theta_y}{\mathit{\min}}{J}_{se}^y=-\sum \limits_{i,j=1}^n\left({S}_{ij}<{F}_i^{(s)},{Z}_j^y>-\mathit{\log}\left(1+{e}^{<{F}_i^{(s)},{Z}_j^y>}\right)\right) $$

(8)

where J_se^x represent the pairwise loss for image samples, J_se^y represent the pairwise loss for text samples, <F_i^(s), Z^y_j>, <F_i^(s), Z^x_j > represent the inner product of two vectors, respectively. They are used to weigh the similarity between text, image features and semantic retention features. Z^x_j, Z^y_j represent the feature representations of the jth group of image samples and text samples, respectively. θ_x, θ_y represent the parameters of image net and text net.

4.2 Inter-modal loss

The hash codes of the image and text are respectively constructed with the hash codes of label to construct the cross-entropy loss, that is, the inter-modal loss, to study the hash function, so that the label information can be inset in the modal data and make hash codes closer to the ideal hash codes.

$$ {\displaystyle \begin{array}{c}\underset{\theta_x,{\theta}_y}{\mathit{\min}}{J}_{\mathit{\operatorname{int}}\mathrm{e}r}=-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\mathit{\log}\left(\sigma \left({H}^{(x)}\right)\right)+\left(1-{H}^{(s)}\right)\mathit{\log}\left(1-\sigma \left({H}^{(x)}\right)\right)\right)\\ {}-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\mathit{\log}\left(\sigma \left({H}^{(y)}\right)\right)+\left(1-{H}^{(s)}\right)\mathit{\log}\left(1-\sigma \left({H}^{(y)}\right)\right)\right)\end{array}} $$

(9)

where σ () represents the sigmoid function, n represents the number of training samples, H^(s) represent the label semantic hash code and H^(x) and H^(y) represent the hash codes output by IMFN and TEFN. Because IMFN and TEFN are separate training, so you need to add a cross-modal adaptive constraint, as shown in the following formula:

$$ \underset{\theta_x,{\theta}_y,B}{\min}\left({\left\Vert {B}^{(x)}-{H}^{(x)}\right\Vert}_F^2+{\left\Vert {B}^{(y)}-{H}^{(y)}\right\Vert}_F^2\right) $$

(10)

Among them, B^(x) and B^(y) are the binary hash codes from images and text, so that information loss caused by the quantization of the hash code can be reduced. In addition, in order to obtain better performance, we set B = B^(x) = B^(y), so:

$$ \underset{\theta_x,{\theta}_y,B}{\mathit{\min}}\left({\left\Vert B-{H}^{(x)}\right\Vert}_F^2+{\left\Vert B-{H}^{(y)}\right\Vert}_F^2\right) $$

(11)

Therefore, the inter-modal loss function is formula (12), where γ is used for balance the impacts of the term of cross-modal adaptive constraint.

$$ {\displaystyle \begin{array}{c}\underset{\theta_x,{\theta}_y,B}{\mathit{\min}}{J}_{inter}=-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\log \left(\sigma \left({H}^{(x)}\right)\right)+\left(1-{H}^{(s)}\right)\log \left(1-\sigma \left({H}^{(x)}\right)\right)\right)\\ {}-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\log \left(\sigma \left({H}^{(y)}\right)\right)+\left(1-{H}^{(s)}\right)\log \left(1-\sigma \left({H}^{(y)}\right)\right)\right)\\ {}+\gamma \left(\left\Vert B-{H}^{(x)}\right\Vert \begin{array}{c}2\\ {}F\end{array}+\left\Vert \begin{array}{c}2\\ {}F\end{array}+\right\Vert B-{H}^{(x)}\left\Vert \begin{array}{c}2\\ {}F\end{array}\right.\right)\end{array}} $$

(12)

4.3 Intra-modal loss

The global semantic similarity matrix A of the training instance is used as supervision information to study each mode of the global semantic retention hash code, that is, the intra-modal loss. The element A_ij is defined as:

$$ {A}_{\mathrm{i}\mathrm{j}}={l}_{\mathrm{i}}^T{l}_j $$

(13)

According to formula (13), we can obtain the complete matrix A, thereby obtaining the joint probability distribution P. The element P_ij is:

$$ {p}_{ij}=\frac{A_{ij}}{\sum_{i=1}^n{\sum}_{j=1,j\ne i}^n{A}_{ij}} $$

(14)

The hamming distance between the hashing codes B^(x) (B^(y)) of the network output is extracted by the image (text) feature to calculate the probability distribution Q^x (Q^y) in the image (text) mode. Use q_ij^x represent the similarity between two image instances, and q_ij^y represent the similarity between two text instances:

$$ {q}_{ij}^x=\frac{e^{-{d}_H\left({b}_i^x,{b}_j^x\right)}}{\sum \limits_{k=1}^n\sum \limits_{m=1,m\ne k}^n{e}^{-{d}_H\left({b}_k^x,{b}_m^x\right)}} $$

(15)

$$ {q}_{ij}^y=\frac{e^{-{d}_H\left({b}_i^y,{b}_j^y\right)}}{\sum \limits_{k=1}^n\sum \limits_{m=1,m\ne k}^n{e}^{-{d}_H\left({b}_k^y,{b}_m^y\right)}} $$

(16)

where d_H() represent the hamming distance of two instances, b_i^x and b_j^x indicate binary code for two image data, b_i^y and b_j^y indicate binary code for two text data.

Here, we use the KL divergence to measure the similarity of the two probability distributions, P and Q^x (Q^y), to represent the intra-modal loss for image (text). As shown in the following formula:

$$ \underset{\theta_x,B}{\min }{J}_{\mathrm{intra}}^x= KL\left(P\Big\Vert {Q}^x\right)=\sum \limits_{i=1}^n\sum \limits_{j=1,j\ne i}^n{p}_{ij}\log \frac{p_{ij}}{q_{ij}^x} $$

(17)

$$ \underset{\theta_y,B}{\min }{J}_{\mathrm{intra}}^y= KL\left(P\Big\Vert {Q}^y\right)=\sum \limits_{i=1}^n\sum \limits_{j=1,j\ne i}^n{p}_{ij}\log \frac{p_{ij}}{q_{ij}^y} $$

(18)

$$ \underset{\theta_x,{\theta}_y,B}{\min }{J}_{\mathrm{intra}}=\underset{\theta_x,{\theta}_y,B}{\min}\left({J}_{\mathrm{intra}}^x+{J}_{\mathrm{intra}}^y\right) $$

(19)

We use the same method as in STGN network to optimize the object function to train image and text networks. The algorithm process as shown in Algorithm 1.

5 Experiments

5.1 Datasets

In this paper, two standard cross-modal data sets are selected to complete cross-modal retrieval between text and image data, namely NUS-WIDE [2] data set and MIRFlickr-25 K [10] dataset.

NUS-WIDE contains 269,648 samples that is associated with text markup web images. We choose 10 frequent concepts, including 186,577 image-text pairs, and randomly select 105,000 examples for training set and 81,577 examples for test set.

MIRFlickr-25 K contains 25,000 image-text pairs. In our experiments, we select those samples that they are tagged by 20 text at least, and finally we have 20,015 image-text pairs. We randomly extract 15,000 pairs of image-text pairs for training set, and 5015 pairs for test set.

5.2 Implementation details

We choose five shallow cross-modal hashing methods CCQ [12], CVH [28], SCM_seq [11], CMSSH [3], SePh [4] and DCMH [30] of deep cross-modal hashing method to evaluate the performance of SPHMF. For the text network, we convert text samples into a 1000-dimensional word bag vector as input of the text network. As for the image network, we use deep features extracted from pretrained VGG-Net [9] on the ImageNet [26] as image input in all shallow cross-modal methods, and we use images of the same size as input of all deep cross-modal methods.

For the proposed method SPHMF, we use the pretrained VGG-Net model to initialize the first five convolutional networks in the image feature part. The hyper-parameters γ, , ß in SPHMF are empirically set as 1,1,10⁻⁴, respectively, and they will be discussed in part 5.4.1 For the selection of learning rate of network, choose from 10⁻⁴ to 10⁻⁸ for network training. Set the network’s batch training size to 128.

5.3 Evaluation protocol

We use mean accuracy precision (MAP) and precision-recall (PR) curves to evaluate the performance of all algorithm.

MAP: MAP represents the average value of the accuracy rate (AP) of each query. The value of MAP is positively correlated with the performance of the algorithm. AP is calculated as follows:

$$ AP=\frac{1}{R}{\sum}_{k=1}^n\frac{\mathrm{k}}{n}\ast {rel}_k $$

(20)

where n represents the total number returned by the query, R is the number of related items in the retrieval set, R_k is the first k targets in the returned related targets, and rel_k indicates whether the k-th sample is a related sample. rel_k = 1 means it is a relevant sample, and rel_k = 0 means it is not relevant.

Precision-Recall: precision reflects the retrieval accuracy. Recall reflects the comprehensiveness of the search. PR curves are often used in information retrieval to evaluate search efficiency.

5.4 Cross-modal retrieval results

5.4.1 Parameters sensitivity evaluation results

We study the effect of hyper-parameter ŋ, ß and γ on retrieval results on MIRFlickr-25 K with the hash length being 64-bits. Figure 2 (a) show the effect of the hyper-parameter ŋ with the value between 0.0001 and 2. Figure 2 (b) show the effect of the hyper-parameter ß with the value between 0.0001 and 2. Figure 2 (c) show the influence of hyper-parameter γ with the value between 0.0001 and 2. It can be found that SPHMF is sensitive to the hyper-parameter ŋ. Furthermore, SPHMF can get the best retrieval result when the hyper-parameter ŋ = 1, ß = 1, γ = 10⁻⁴.

5.4.2 Hash retrieval task

In the experiment, the first step is to train STGN, and then we can obtain the semantic retention features F^(s) and semantic retention hash codes H^(s) of training data. We evaluate H^(s) in NUS-WIDE and MIRFlickr-25 K datasets, and calculate the MAP values under different bits, as shown in Table 3. By comparison, 64bits hash code can be considered ideal Hash code.

Table 3 The MAP values of the label semantic hash code in different bits

Full size table

We use F^(s) and H^(s) to guide the training of IMFN and TEFN. After completing the training of IMFN and TEFN, we calculate the MAP values and PR curves for two retrieval tasks: image retrieval text (I → T) and text retrieval image (T → I) on two datasets. As shown in Tables 4 and 5, SPHMF increases the MAP value from 0.7364 to 0.7501 and 0.6297 to 0.6391 in I → T task based on 64bits hash code. In T → I task, the MAP value increased from 0.7787 to 0.7845, and from 0.6089 to 0.6152 based on 64bits hash code.

Table 4 Comparison of MAP values of I → T and T → I on MIRFlickr-25 K dataset

Full size table

Table 5 Comparison of MAP values of I → T and T → I on NUS-WIDE dataset

Full size table

The corresponding precision-recall curves on two datasets are plotted in Figs. 3 and 4. We can see from the Figs. 3 and 4, the algorithm proposed in this paper achieves higher accuracy at most recall levels than comparison methods.

It can be seen from the above analysis that SPHMF has significant advantages. Compared with the unsupervised methods, we use label information for supervised training, and these label information can provide the original relationship of data for hash code learning. Compared with the supervised methods, we consider the complementary information and correlation among multi-scale features, make the most of the multi-scale information of image and work out the sparsity of text BOW vectors. In addition, the construction of inter-modal loss in the loss function can provide more accurate judgment of similarity or dissimilarity for two data of different modes, and the construction of intra-modal loss can make the hash code of model output have global potential semantic correlation. Thus, the accuracy of retrieval is improved to some extent.

5.4.3 Comparison of training time

We conduct a comparative experiment between DCMH and SPHMF on MIRFlickr-25 K dataset to assess the training efficiency of SPHMF. We can observe that SPHMF trains faster than DCMH, and the value of the MAP on the retrieval task is better from Fig. 5.

5.4.4 Impact analysis of each loss function

The objective function is composed of three parts, namely intra-modal loss, inter-modal loss and pairwise loss. In order to find out the influence of each loss function on the final search results, we conduct experiments. Therefore, we divide the experiment into the following three situations: the objective function includes intra-modal loss and inter-modal loss (SPHMF-1); the total function includes intra-modal loss and pairwise loss (SPHMF-2); the objective function includes inter-modal loss and pairwise loss (SPHMF-3).

The experimental results of different methods on two datasets are shown in Table 6. We can find out that the order of the MAP values of the experimental results from large to small is: SPHMF, SPHMF-3, SPHMF-1, and SPHMF-2. Therefore, we know that inter-modal loss has the greatest influence on the final retrieval results, followed by pairwise loss, and finally intra-modal loss. However, the retrieval effect of combining these three loss functions is the best.

Table 6 The impact of different loss on MAP values (MIRFlickr-25 K dataset and NUS-WIDE dataset 64 bits)

Full size table

5.4.5 Compare IPMSF pooling and SPP pooling

In order to prove the effectiveness of IPMSF, the IPMSF model in the network structure shown in Fig. 1 is replaced with the SPP model, and other network settings are the same, and comparative experiments are performed. Table 7 is the result of experimental comparison.

Table 7 Comparison of MAP values of different pooling methods (MIRFlickr-25 K dataset and NUS-WIDE dataset 64 bits)

Full size table

Table 7 shows that the using the IPMSF model can slightly improve retrieval performance compared to the SPP model, but it is not much different. However, from Table 8 can be seen that the IPMSF-based network model uses less space than the SPP-based network model in space utilization. This is because the IPMSF model reduces the parameters of the network model compared with SPP model.

Table 8 Comparison of the space between the trained IPMSF and SPP models

Full size table

6 Conclusions

This paper proposes a semantics-preserving hashing method based on multi-scale fusion for cross-modal retrieval, called SPHMF. SPHMF supervises both image feature training network and text feature training network by using cross-modal label information. For image feature training network and text feature training network, multi-scale fusion pooling model is proposed to extract multi-scale information of data; We construct intra-modal loss with multilevel semantic affinity matrix, inter-modal loss and pairwise loss. Therefore, the hash code learned by SPHMF can better retain the original information of modal data. The NUS - WIDE and MIRFlickr-25 K datasets verify the validity of SPHMF. But this article only explores retrieval method between image and text. In the later work, we will further improve our retrieval algorithm and apply it to multimedia data of more modalities, including image, text, audio and video.

References

Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval[M]. ACM press, New York
Google Scholar
Bronstein M M, Bronstein A M, Michel F, et al. (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing[C]//2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3594-3601
Chua T S, Tang J, Hong R, et al. (2009) NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international conference on image and video retrieval. 1-9
Han Y, Wu F, Tian Q, Zhuang Y (2012) Graph-Guided Sparse Reconstruction for Region Tagging. IEEE Conference on Computer Vision and Pattern Recognition
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence 37(9):1904–1916
Article Google Scholar
He X, Peng Y, Xie L (2019) A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia. 1740-1748
Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM international conference on Multimedia information retrieval. 39-43
J Zhang J, Peng Y (2018) Query-adaptive image retrieval by deep-weighted hashing[J]. IEEE Transactions on Multimedia 20(9):2400–2414
Article Google Scholar
Jiang QY, Li WJ (2017) Deep cross-modal hashing[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3232-3240
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search[C]//Twenty-Second International Joint Conference on Artificial Intelligence
Li C, Deng C, Li N et al. (2018) Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proce-edings of the IEEE conference on computer vision and pattern recognition. 4242-4251
Lin Y, Zheng Z, Zhang H, et al. Bayesian query expansion for multi-camera person re-identification[J]. Pattern Recognition Letters, 2018.
Lin Z, Ding G, Hu M, et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3864-3872.
Long M, Cao Y, Wang J, et al. Composite correlation quantization for efficient multimodal retrieval[C]////Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016: 579-588.
Lu X, Chen Y, Li X (2017) Hierarchical recurrent neural hashing for image retrieval with hierarchical convolutional features[J]. IEEE Transactions on Image Processing 27(1):106–120
Article MathSciNet Google Scholar
Mu N, Xu X, Zhang X et al (2018) Salient object detection using a covariance-based CNN model in low-contrast images[J]. Neural Computing and Applications 29(8):181–192
Article Google Scholar
Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges[J]. IEEE Transactions on circuits and systems for video technology 28(9):2372–2385
Article Google Scholar
Peng Y, Zhang J, Ye Z. Deep reinforcement learning for image hashing[J]. IEEE Transactions on Multimedia, 2019.
Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge[J]. International journal of computer vision 115(3):211–252
Article MathSciNet Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
Wang B, Yang Y, Xu X, et al. Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM international conference on Multimedia. 2017: 154-162.
Wu F, Han Y, Liu X et al (2012) The heterogeneous feature selection with structural sparsity for multimedia annotation and hashing: a survey[J]. International Journal of Multimedia Information Retrieval 1(1):3–15
Article MathSciNet Google Scholar
Xu Y, Han Y, Hong R et al (2018) Sequential video VLAD: Training the aggregation locally and temporally[J]. IEEE Transactions On Image Processing 27(10):4933–4944
Article MathSciNet Google Scholar
Yang E, Deng C, Liu W, et al. Pairwise relationship guided deep hashing for cross-modal retrieval[C]// Thirty-first AAAI conference on artificial intelligence. 2017.
Yang Y, Ma Z, Hauptmann AG et al (2012) Feature selection for multimedia analysis by sharing information among multiple tasks[J]. IEEE Transactions on Multimedia 15(3):661–669
Article Google Scholar
Ye Z, Peng Y. Multi-scale correlation for sequential cross-modal hashing learning[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 852-860.
Zhaoda Ye and Yuxin Peng. 2019. Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article 105 (December 2019), 20 pages.
Yuan M, Peng Y. Text-to-image synthesis via symmetrical distillation networks[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1407-1415.
Yuwono B, Lee DL. Server ranking for distributed text retrieval systems on the internet[M]//Database Systems For Advanced Applications' 97. 1997: 41-49.
Zhang D, Li W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014.
Zhang H, Wang T, Dai G (2020) Semi-supervised cross-modal common representation learning with vector-valued manifold regularization[J]. Pattern Recognition Letters 130:335–344
Article Google Scholar
Zhang J, Han Y, Jiang J (2017) Semi-supervised tensor learning for image classification[J]. Multimedia Systems 23(1):63–73
Article Google Scholar
Zhang J, Peng Y (2017) SSDH: semi-supervised deep hashing for large scale image retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology 29(1):212–225
Article Google Scholar
Zhang J, Peng Y (2019) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval[J]. IEEE Transactions on Multimedia 22(1):174–187
Article MathSciNet Google Scholar
Zhuang Y, Yu Z, Wang W, et al. Cross-media hashing with neural networks[C]//Proceedings of the 22nd ACM international conference on Multimedia. 2014: 901-904.

Download references

Author information

Authors and Affiliations

College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan, 430081, China
Hong Zhang & Min Pan
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
Hong Zhang & Min Pan

Authors

Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Min Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Pan, M. Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval. Multimed Tools Appl 80, 17299–17314 (2021). https://doi.org/10.1007/s11042-020-09869-4

Download citation

Received: 13 March 2020
Revised: 18 August 2020
Accepted: 11 September 2020
Published: 02 November 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11042-020-09869-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

Abstract

Similar content being viewed by others

CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval

Self-auxiliary Hashing for Unsupervised Cross Modal Retrieval

Joint feature fusion hashing for cross-modal retrieval

1 Introduction

2 Related works

3 Deep learning framework based on multi-scale fusion

3.1 Image pooling model for multi-scale fusion