1 Introduction

Development in information technology has led to explosive growth of multimedia data. At the same time, people’s demand for information search to obtain diverse results is increasing. Therefore, there are more and more researches [18, 23, 33, 19, 31, 20, 32, 15] on multimedia data analysis and cross-modal retrieval technology. Cross-modal retrieval is point to all relevant data of other modalities are accurately and quickly retrieved through the data of one modal.

Hash learning is widely used in cross-modal retrieval models [27, 21, 29, 1], because of its good low storage and efficient retrieval. In the past few decades of research, there are many hash methods for single-modal retrieval [25, 22, 16, 14, 8, 13, 35]. However, these methods are not suitable for cross-modal hash retrieval, because of the semantic gap between data in different modalities. Most existing cross-modal retrieval hashing methods [34] solve semantic gaps by mining the correlations of different modal data. The main cross-modal hashing methods can be divided into two categories: deep cross-modal hashing methods [5, 30, 7, 17, 24] and shallow cross-modal hashing methods [28, 12, 3, 11, 4]. Shallow cross-modal hashing methods mainly map each sample into a binary code based on hand-crafted features, so as to learn the hash function. However, this hash function cannot express the underlying features of samples and the retrieval efficiency is not ideal. Deep cross-modal hashing methods, in contrast to shallow cross-modal hashing methods, using the feature extraction capability of deep learning to learning effective representations of different modalities, which solve the problem of limited hand-crafted features expression ability. In addition, this method can also integrate feature learning into the process of hash code learning, ensuring the accuracy of hash code to obtain better retrieval efficiency.

Up to now, there has been a lot of research on deep cross-modal hashing retrieval, but they both ignore two attributes of cross-modal data. These two attributes contribute to retrieval accuracy, which is also the motivation of this paper. First, different scales of single modal data contain different semantic information; second, when judging the similarity between cross-modal data, most deep cross-modal hashing methods treat two cases in the same way, where only one tag is similar between different modal data and more than one tag is similar. They both ignore the fact that the similarity between different modal data is related to the number of labels they share.

Taken together, this paper proposes a Semantic-Preserving Hashing based on Multi-scale Fusion (SPHMF). The framework of SPHMF is shown in Fig. 1. First, Image Pooling Model for Multi-scale Fusion (IPMSF) and Text Pooling Model for Multi-scale Fusion (TPMSF) are used to extract multi-scale feature information of different modal data. Secondly, image-text pairs label information is used to train self-supervise network [17], so as to better mine the relevance between image and text. Finally, when we construct the loss function, we use multi-level similarity information of image-text pairs to construct the intra-modal loss. In addition, the loss function also includes pairwise loss and inter-modal loss.

Fig. 1
figure 1

The framework of SPHMF

The remaining paper is structured in the following manner. We summarize work related to cross-modal retrieval (section 2), present the deep learning architecture proposed (section 3), describe the construction of the loss function (section 4), discuss the results and experiments (section 5), and draw a conclusion (section 6).

2 Related works

As mentioned above, our proposed method is SPHMF, which is a cross-modal hashing retrieval method based on deep learning. Therefore, we conducted a series of researches on the two types of shallow and deep cross-modal hashing methods.

Most existing shallow cross-modal hashing methods are independent of feature learning and hash code learning, which leads to unsatisfactory retrieval results. This kind of typical methods include CVH [28], CCQ [12], CMSSH [3], SCM [11], and SePh [4]. CVH considers the similarity of intra-view and inter-view. CCQ jointly learns related maximal maps and composite quantizes. It converts multimedia data into binary code through an isomorphic potential space. CMSSH is a supervised cross-modal hash, which models hash learning through an enhanced classification paradigm. SCM uses tag information to build semantically similar matrices to learn a hash function. SePh changes the semantic matrix to a probability distribution, which is combined with the minimization of hamming spatial distribution to learn hamming space.

Deep cross-modal hashing methods use deep learning framework to study hash function. It can effectively catch the non-linear correlation between cross-modal instances. This kind of typical methods include CMNNH [5], DCMH [30], PRDH [29], A CMR [7], SSAH [17], and MCSCH [24]. CMNNH learns the hash function in deep learning framework by maintaining the relationship of intra-modal and the pairwise correspondence of inter-modal. DCMH performs feature learning and hash function learning simultaneously. PRDH guides the learning of hash codes by constructing different pairwise losses. ACMR differentiates the data of different modes and learns binary hash codes by adversarial learning methods and classification methods. SSAH uses a self-supervised network to generate semantic information of tags, as well as uses these information to guide the feature learning process of different modal data. MCSCH proposes to use multi-scale features to guide sequence hash learning, enhance the diversity of hash codes and promote the studying of hash functions.

Above related works achieved good results. Compared with unsupervised learning methods, supervised learning has achieved better results. For supervised learning methods, the key is to use limited training data and supervised information to learn semantic information with similar neighbor relations of the original data. The self-supervised network in the SSAH can build semantic association between multimedia data and find more information in labels. Therefore, our method also uses this label self-supervised network. Our approach of SPHMF is different from the above methods in the following two ways: it extracts semantic information of different scales from single modal data and integrates it into the feature learning process; we construct intra-modal loss with multilevel semantic affinity matrix, inter-modal loss and pairwise loss.

3 Deep learning framework based on multi-scale fusion

As shown in Fig. 1, the overall network structure contains three parts, namely Image Feature Training Network (IMFN), Text Feature Training Network (TEFN), and Semantic Tag Generation Network (STGN). We use the output of STGN to guide training of IMFN and TEFN. After IMFN and TEFN are trained, the respective hash functions of image dataset and text dataset can be obtained. The hash function can be used to obtain the hash code of each modal data, then by counting and sorting hamming distance to complete cross-modal retrieval.

3.1 Image pooling model for multi-scale fusion

The overall structure of IMFN proposed in this paper is shown in the upper part of Fig. 1. Considering that features at different scales in image dataset represent different semantics, we propose an Image Pooling Model for Multi-scale Fusion (IPMSF) for IMFN. The setting of IPMSF is based on the idea of Spatial Pyramid Pooling [6] (SPP). We take the output of conv5 as the input of each pooling layer in IPMSF. Different pooling layers perform maximum pooling operation according to different scale regions, and then the output vectors of each pooling layer are concatenated as the input of fc1 to complete the training of IMFN. Therefore, the problems of limitation on size of input image in traditional CNN network and unreliable feature learning due to some information loss are solved. The settings for IPMSF are shown in Table 1. Unlike SPP, SPP directly connects the features of different scales, while IPMSF first merges the features of the same scale, and then connects the features of different scales in series, thereby reducing the parameters of the network model. In section 5.4.5, it can be seen that the IPMSF model reduces the computational overhead of the training process while maintaining the retrieval accuracy.

Table 1 Parameter settings for IPMSF model

3.2 Text pooling model for multi-scale fusion

For text dataset, it is usually represented by a bag of words vector, which can easily lead to sparsity. To solve this problem, we design a Text Pooling Model for Multi-scale Fusion (TPMSF). First, the multi-scale features of text samples are extracted by pooling layer, and then multiple features are fused through convolutional layer. This process can capture the relevance among various words in text modal construction, and that is very useful for semantic relevance. The overall setting of TPMSF is shown in Table 2, and c is the number of class labels.

Table 2 Parameter settings for TPMSF model

The output of the TPMSF model is used as the input of TEFN. The model architecture for TEFN is shown in bottom part of Fig. 1.

3.3 Semantic tag generation network

In this paper, STGN is used to extract label semantic information of text-image pairs, and guide training of IMFN and TEFN. The overall setting of STGN is shown in the middle part of the Fig. 1.

The STGN is trained through the class label information and the neighbor relationship matrix S. After the training of STGN is completed, the label semantic hash code H(s) and label semantic features F(s) can be obtained through this network to guide training of IMFN and TEFN. In the training process, the inner product between the vectors is used to represent the correlation between any two output features or two hash codes. And in the meanwhile, the likelihood function is used to represent the inner product value between outputs under S supervision, as shown in formula (1):

$$ p\left(S|H\right)=\left\{\begin{array}{c} sig\ \left({\theta}_{ij}\right),{S}_{ij}=1\\ {}1- sig\ \left({\theta}_{i\mathrm{j}}\right),{S}_{ij}=0\end{array}\right\} $$
(1)

where sig () represents the sigmoid function, θij = 1/2 < Hi, Hj>, Hi, Hj represents the hash code of a set of samples output by the hash layer, and Sij = 1 indicates that the two sample vectors are similar, Sij = 0 means dissimilar.

Maximizing the likelihood function by minimizing the form of the negative log-likelihood function yields:

$$ \min R=-\log p\left(S|H\right)=-\sum {S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\log \left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right) $$
(2)

If all parameters in the STGN are set to θ, then formula (2) can be used to represent all samples in F(s), H(s):

$$ {\displaystyle \begin{array}{c}\underset{\theta }{\mathit{\min}} Js=-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {F}_i,{F}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {F}_i,{F}_j\right\rangle}\right)\right)\\ {}-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right)\right)\end{array}} $$
(3)

where Fi, Fj, Hi, Hj represent the features of the ith and jth groups and the hash codes of the ith and jth groups, respectively.

Since the hash code is lost from output to quantization into a binary hash code, a quantization error is added to the function, as follows:

$$ \alpha {\left\Vert {H}_i-\mathit{\operatorname{sign}}\left({H}_i\right)\right\Vert}_F^2 $$
(4)

So, the final objective function is:

$$ {\displaystyle \begin{array}{c}\underset{\theta }{\mathit{\min}} Js=-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {F}_i,{F}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {F}_i,{F}_j\right\rangle}\right)\right)\\ {}-\sum \limits_{i,j=1}^n\left({S}_{ij}\left\langle {H}_i,{H}_j\right\rangle -\mathit{\log}\left(1+{e}^{\left\langle {H}_i,{H}_j\right\rangle}\right)\right)\\ {}+\alpha {\left\Vert {H}_i-\mathit{\operatorname{sign}}\left({H}_i\right)\right\Vert}_F^2\end{array}} $$
(5)

In this paper, the parameter θ of STGN is studied by stochastic gradient descent and back propagation.

4 Learning cross-modal hash functions

Suppose there are n sets of training data points, each set of training data is composed of pairs of text-image, X = {xi}ni = 1 represents high-dimensional original image dataset, Y = {yj}nj = 1 represents the text dataset descripting image. Each pair of training data has a class label vector li = {li1, li2, …, lic}, c is the number of dataset categories. In the label semantic information learning part, the class label information Ln*c (l1, l2, …, ln) can be obtained from the text-image pairs (li represents a c-dimensional binary vector) to construct a similarity matrix S, Sij = 1 means xi is similar to yj, Sij = 0 means that they are not similar. In the hash function part, we specify the length of the output hash code as m.

Given the image data, text data and the similarity matrix S, the target of SPHMF is to study a hash code B that retains similarity. When constructing the entire objective function, the features of image and text and the semantic feature F(s) are used to construct pairwise loss to convey the neighborhood relationship in the label. Hash code of modal data construct inter-modal and intra-modal losses to preserve global and local semantic structure. Therefore, the overall objective function is:

$$ J={J}_{se}+\eta {J}_{inter}+\beta {J}_{intra} $$
(6)

where Jse is the pairwise loss, Jinter is the inter-modal loss, and Jintra is the intra-modal loss. ŋ and ß are used for balance the impacts of each term.

4.1 Pairwise loss

For the feature of image and text modal data, the inner product method is used to represent the similarity relationship, and the pairwise loss is used to transfer the nearest neighbor relationship of F(s), as shown in formula (7), (8):

$$ \underset{\theta x}{\mathit{\min}}{J}_{se}^x=-\sum \limits_{i,j=1}^n\left({S}_{ij}<{F}_i^{(s)},{Z}_j^x>-\mathit{\log}\left(1+{e}^{<{F}_i^{(s)},{Z}_j^x>}\right)\right) $$
(7)
$$ \underset{\theta_y}{\mathit{\min}}{J}_{se}^y=-\sum \limits_{i,j=1}^n\left({S}_{ij}<{F}_i^{(s)},{Z}_j^y>-\mathit{\log}\left(1+{e}^{<{F}_i^{(s)},{Z}_j^y>}\right)\right) $$
(8)

where Jsex represent the pairwise loss for image samples, Jsey represent the pairwise loss for text samples, <Fi(s), Zyj>, <Fi(s), Zxj > represent the inner product of two vectors, respectively. They are used to weigh the similarity between text, image features and semantic retention features. Zxj, Zyj represent the feature representations of the jth group of image samples and text samples, respectively. θx, θy represent the parameters of image net and text net.

4.2 Inter-modal loss

The hash codes of the image and text are respectively constructed with the hash codes of label to construct the cross-entropy loss, that is, the inter-modal loss, to study the hash function, so that the label information can be inset in the modal data and make hash codes closer to the ideal hash codes.

$$ {\displaystyle \begin{array}{c}\underset{\theta_x,{\theta}_y}{\mathit{\min}}{J}_{\mathit{\operatorname{int}}\mathrm{e}r}=-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\mathit{\log}\left(\sigma \left({H}^{(x)}\right)\right)+\left(1-{H}^{(s)}\right)\mathit{\log}\left(1-\sigma \left({H}^{(x)}\right)\right)\right)\\ {}-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\mathit{\log}\left(\sigma \left({H}^{(y)}\right)\right)+\left(1-{H}^{(s)}\right)\mathit{\log}\left(1-\sigma \left({H}^{(y)}\right)\right)\right)\end{array}} $$
(9)

where σ () represents the sigmoid function, n represents the number of training samples, H(s) represent the label semantic hash code and H(x) and H(y) represent the hash codes output by IMFN and TEFN. Because IMFN and TEFN are separate training, so you need to add a cross-modal adaptive constraint, as shown in the following formula:

$$ \underset{\theta_x,{\theta}_y,B}{\min}\left({\left\Vert {B}^{(x)}-{H}^{(x)}\right\Vert}_F^2+{\left\Vert {B}^{(y)}-{H}^{(y)}\right\Vert}_F^2\right) $$
(10)

Among them, B(x) and B(y) are the binary hash codes from images and text, so that information loss caused by the quantization of the hash code can be reduced. In addition, in order to obtain better performance, we set B = B(x) = B(y), so:

$$ \underset{\theta_x,{\theta}_y,B}{\mathit{\min}}\left({\left\Vert B-{H}^{(x)}\right\Vert}_F^2+{\left\Vert B-{H}^{(y)}\right\Vert}_F^2\right) $$
(11)

Therefore, the inter-modal loss function is formula (12), where γ is used for balance the impacts of the term of cross-modal adaptive constraint.

$$ {\displaystyle \begin{array}{c}\underset{\theta_x,{\theta}_y,B}{\mathit{\min}}{J}_{inter}=-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\log \left(\sigma \left({H}^{(x)}\right)\right)+\left(1-{H}^{(s)}\right)\log \left(1-\sigma \left({H}^{(x)}\right)\right)\right)\\ {}-\frac{1}{n}\sum \limits_{i,j=1}^n\left({H}^{(s)}\log \left(\sigma \left({H}^{(y)}\right)\right)+\left(1-{H}^{(s)}\right)\log \left(1-\sigma \left({H}^{(y)}\right)\right)\right)\\ {}+\gamma \left(\left\Vert B-{H}^{(x)}\right\Vert \begin{array}{c}2\\ {}F\end{array}+\left\Vert \begin{array}{c}2\\ {}F\end{array}+\right\Vert B-{H}^{(x)}\left\Vert \begin{array}{c}2\\ {}F\end{array}\right.\right)\end{array}} $$
(12)

4.3 Intra-modal loss

The global semantic similarity matrix A of the training instance is used as supervision information to study each mode of the global semantic retention hash code, that is, the intra-modal loss. The element Aij is defined as:

$$ {A}_{\mathrm{i}\mathrm{j}}={l}_{\mathrm{i}}^T{l}_j $$
(13)

According to formula (13), we can obtain the complete matrix A, thereby obtaining the joint probability distribution P. The element Pij is:

$$ {p}_{ij}=\frac{A_{ij}}{\sum_{i=1}^n{\sum}_{j=1,j\ne i}^n{A}_{ij}} $$
(14)

The hamming distance between the hashing codes B(x) (B(y)) of the network output is extracted by the image (text) feature to calculate the probability distribution Qx (Qy) in the image (text) mode. Use qijx represent the similarity between two image instances, and qijy represent the similarity between two text instances:

$$ {q}_{ij}^x=\frac{e^{-{d}_H\left({b}_i^x,{b}_j^x\right)}}{\sum \limits_{k=1}^n\sum \limits_{m=1,m\ne k}^n{e}^{-{d}_H\left({b}_k^x,{b}_m^x\right)}} $$
(15)
$$ {q}_{ij}^y=\frac{e^{-{d}_H\left({b}_i^y,{b}_j^y\right)}}{\sum \limits_{k=1}^n\sum \limits_{m=1,m\ne k}^n{e}^{-{d}_H\left({b}_k^y,{b}_m^y\right)}} $$
(16)

where dH() represent the hamming distance of two instances, bix and bjx indicate binary code for two image data, biy and bjy indicate binary code for two text data.

Here, we use the KL divergence to measure the similarity of the two probability distributions, P and Qx (Qy), to represent the intra-modal loss for image (text). As shown in the following formula:

$$ \underset{\theta_x,B}{\min }{J}_{\mathrm{intra}}^x= KL\left(P\Big\Vert {Q}^x\right)=\sum \limits_{i=1}^n\sum \limits_{j=1,j\ne i}^n{p}_{ij}\log \frac{p_{ij}}{q_{ij}^x} $$
(17)
$$ \underset{\theta_y,B}{\min }{J}_{\mathrm{intra}}^y= KL\left(P\Big\Vert {Q}^y\right)=\sum \limits_{i=1}^n\sum \limits_{j=1,j\ne i}^n{p}_{ij}\log \frac{p_{ij}}{q_{ij}^y} $$
(18)
$$ \underset{\theta_x,{\theta}_y,B}{\min }{J}_{\mathrm{intra}}=\underset{\theta_x,{\theta}_y,B}{\min}\left({J}_{\mathrm{intra}}^x+{J}_{\mathrm{intra}}^y\right) $$
(19)

We use the same method as in STGN network to optimize the object function to train image and text networks. The algorithm process as shown in Algorithm 1.

figure a

5 Experiments

5.1 Datasets

In this paper, two standard cross-modal data sets are selected to complete cross-modal retrieval between text and image data, namely NUS-WIDE [2] data set and MIRFlickr-25 K [10] dataset.

NUS-WIDE contains 269,648 samples that is associated with text markup web images. We choose 10 frequent concepts, including 186,577 image-text pairs, and randomly select 105,000 examples for training set and 81,577 examples for test set.

MIRFlickr-25 K contains 25,000 image-text pairs. In our experiments, we select those samples that they are tagged by 20 text at least, and finally we have 20,015 image-text pairs. We randomly extract 15,000 pairs of image-text pairs for training set, and 5015 pairs for test set.

5.2 Implementation details

We choose five shallow cross-modal hashing methods CCQ [12], CVH [28], SCM_seq [11], CMSSH [3], SePh [4] and DCMH [30] of deep cross-modal hashing method to evaluate the performance of SPHMF. For the text network, we convert text samples into a 1000-dimensional word bag vector as input of the text network. As for the image network, we use deep features extracted from pretrained VGG-Net [9] on the ImageNet [26] as image input in all shallow cross-modal methods, and we use images of the same size as input of all deep cross-modal methods.

For the proposed method SPHMF, we use the pretrained VGG-Net model to initialize the first five convolutional networks in the image feature part. The hyper-parameters γ, , ß in SPHMF are empirically set as 1,1,10−4, respectively, and they will be discussed in part 5.4.1 For the selection of learning rate of network, choose from 10−4 to 10−8 for network training. Set the network’s batch training size to 128.

5.3 Evaluation protocol

We use mean accuracy precision (MAP) and precision-recall (PR) curves to evaluate the performance of all algorithm.

MAP: MAP represents the average value of the accuracy rate (AP) of each query. The value of MAP is positively correlated with the performance of the algorithm. AP is calculated as follows:

$$ AP=\frac{1}{R}{\sum}_{k=1}^n\frac{\mathrm{k}}{n}\ast {rel}_k $$
(20)

where n represents the total number returned by the query, R is the number of related items in the retrieval set, Rk is the first k targets in the returned related targets, and relk indicates whether the k-th sample is a related sample. relk = 1 means it is a relevant sample, and relk = 0 means it is not relevant.

Precision-Recall: precision reflects the retrieval accuracy. Recall reflects the comprehensiveness of the search. PR curves are often used in information retrieval to evaluate search efficiency.

5.4 Cross-modal retrieval results

5.4.1 Parameters sensitivity evaluation results

We study the effect of hyper-parameter ŋ, ß and γ on retrieval results on MIRFlickr-25 K with the hash length being 64-bits. Figure 2 (a) show the effect of the hyper-parameter ŋ with the value between 0.0001 and 2. Figure 2 (b) show the effect of the hyper-parameter ß with the value between 0.0001 and 2. Figure 2 (c) show the influence of hyper-parameter γ with the value between 0.0001 and 2. It can be found that SPHMF is sensitive to the hyper-parameter ŋ. Furthermore, SPHMF can get the best retrieval result when the hyper-parameter ŋ = 1, ß = 1, γ = 10−4.

Fig. 2
figure 2

MAP values with different hyper-parameters

5.4.2 Hash retrieval task

In the experiment, the first step is to train STGN, and then we can obtain the semantic retention features F(s) and semantic retention hash codes H(s) of training data. We evaluate H(s) in NUS-WIDE and MIRFlickr-25 K datasets, and calculate the MAP values under different bits, as shown in Table 3. By comparison, 64bits hash code can be considered ideal Hash code.

Table 3 The MAP values of the label semantic hash code in different bits

We use F(s) and H(s) to guide the training of IMFN and TEFN. After completing the training of IMFN and TEFN, we calculate the MAP values and PR curves for two retrieval tasks: image retrieval text (I → T) and text retrieval image (T → I) on two datasets. As shown in Tables 4 and 5, SPHMF increases the MAP value from 0.7364 to 0.7501 and 0.6297 to 0.6391 in I → T task based on 64bits hash code. In T → I task, the MAP value increased from 0.7787 to 0.7845, and from 0.6089 to 0.6152 based on 64bits hash code.

Table 4 Comparison of MAP values of I → T and T → I on MIRFlickr-25 K dataset
Table 5 Comparison of MAP values of I → T and T → I on NUS-WIDE dataset

The corresponding precision-recall curves on two datasets are plotted in Figs. 3 and 4. We can see from the Figs. 3 and 4, the algorithm proposed in this paper achieves higher accuracy at most recall levels than comparison methods.

Fig. 3
figure 3

Precision-Recall curves (MIRFlickr-25 K dataset 64bits hash)

Fig. 4
figure 4

Precision-Recall curves (NUS-WIDE dataset 64bits hash code)

It can be seen from the above analysis that SPHMF has significant advantages. Compared with the unsupervised methods, we use label information for supervised training, and these label information can provide the original relationship of data for hash code learning. Compared with the supervised methods, we consider the complementary information and correlation among multi-scale features, make the most of the multi-scale information of image and work out the sparsity of text BOW vectors. In addition, the construction of inter-modal loss in the loss function can provide more accurate judgment of similarity or dissimilarity for two data of different modes, and the construction of intra-modal loss can make the hash code of model output have global potential semantic correlation. Thus, the accuracy of retrieval is improved to some extent.

5.4.3 Comparison of training time

We conduct a comparative experiment between DCMH and SPHMF on MIRFlickr-25 K dataset to assess the training efficiency of SPHMF. We can observe that SPHMF trains faster than DCMH, and the value of the MAP on the retrieval task is better from Fig. 5.

Fig. 5
figure 5

Training efficiency of SPHMF and DCMH

5.4.4 Impact analysis of each loss function

The objective function is composed of three parts, namely intra-modal loss, inter-modal loss and pairwise loss. In order to find out the influence of each loss function on the final search results, we conduct experiments. Therefore, we divide the experiment into the following three situations: the objective function includes intra-modal loss and inter-modal loss (SPHMF-1); the total function includes intra-modal loss and pairwise loss (SPHMF-2); the objective function includes inter-modal loss and pairwise loss (SPHMF-3).

The experimental results of different methods on two datasets are shown in Table 6. We can find out that the order of the MAP values of the experimental results from large to small is: SPHMF, SPHMF-3, SPHMF-1, and SPHMF-2. Therefore, we know that inter-modal loss has the greatest influence on the final retrieval results, followed by pairwise loss, and finally intra-modal loss. However, the retrieval effect of combining these three loss functions is the best.

Table 6 The impact of different loss on MAP values (MIRFlickr-25 K dataset and NUS-WIDE dataset 64 bits)

5.4.5 Compare IPMSF pooling and SPP pooling

In order to prove the effectiveness of IPMSF, the IPMSF model in the network structure shown in Fig. 1 is replaced with the SPP model, and other network settings are the same, and comparative experiments are performed. Table 7 is the result of experimental comparison.

Table 7 Comparison of MAP values of different pooling methods (MIRFlickr-25 K dataset and NUS-WIDE dataset 64 bits)

Table 7 shows that the using the IPMSF model can slightly improve retrieval performance compared to the SPP model, but it is not much different. However, from Table 8 can be seen that the IPMSF-based network model uses less space than the SPP-based network model in space utilization. This is because the IPMSF model reduces the parameters of the network model compared with SPP model.

Table 8 Comparison of the space between the trained IPMSF and SPP models

6 Conclusions

This paper proposes a semantics-preserving hashing method based on multi-scale fusion for cross-modal retrieval, called SPHMF. SPHMF supervises both image feature training network and text feature training network by using cross-modal label information. For image feature training network and text feature training network, multi-scale fusion pooling model is proposed to extract multi-scale information of data; We construct intra-modal loss with multilevel semantic affinity matrix, inter-modal loss and pairwise loss. Therefore, the hash code learned by SPHMF can better retain the original information of modal data. The NUS - WIDE and MIRFlickr-25 K datasets verify the validity of SPHMF. But this article only explores retrieval method between image and text. In the later work, we will further improve our retrieval algorithm and apply it to multimedia data of more modalities, including image, text, audio and video.