1 Introduction

With the explosive growth in data usage, the approximate nearest neighbor (ANN) search in large databases is becoming increasingly key and is attracting considerable research attention in lots of fields, including information retrieval, computer vision and data mining. The ANN attempts to find an approximate nearest neighbor for a query point in a large database; hashing is one of the primary technologies in the ANN, as it can achieve better performance than other methods in the ANN search applications [3, 17, 22, 31, 33].

Hashing consists of attempting to convert medias, such as images and videos, to a set of short binary codes, and the binary code, which can be called hash code, should preserve the semantic similarity among the data in the original space. With these binary hash codes, users can readily conduct the task of the ANN on a large-scale dataset because of the high efficiency of the pairwise or triplet-wise comparison with binary codes in the Hamming distance [36]. There are two primary categories in the existing hashing techniques, which are data-independent methods [4, 12, 13] and data-dependent [32, 34, 55, 59] methods. In the first category, the hashing methods use no data for training. Data-independent hashing methods use random projections to extract the feature representation. Locality sensitive hashing (LSH) [12] and the variants of LSH [4, 13] are the best-known data-independent algorithms. The main idea behind the LSH family of methods is to return the same bit for the neighborhoods in the original space with a high probability based on a hashing function. LSH has exhibited interesting theoretical properties and a performance guarantee. However, LSH-based methods have a major limitation, as they typically require longer hash codes to achieve a better retrieval precision performance, thus reducing retrieval recall. Multiple hashing table-based methods can alleviate this problem partially, but they inevitably lead to increased storage cost and retrieval time.

The second category, called data-dependent hashing, uses data to train the model. Data-dependent hashing, also known as learning-to-hash or learned-based hashing, attempts to learn the hash functions so that the nearest neighbors in the Hamming space approximate those in the original space [48, 58]. The analysis of the underlying characteristics of data can lead to better retrieval performance. As a result, data-dependent hashing has become more and more popular, as the learned compact hash codes can effectively and efficiently search massive amounts of data. Generally, there are two main types of data-dependent hashing methods: unsupervised and supervised.

Unsupervised hashing methods employ unlabeled data in the training process, and the hash functions are learned mainly from the original data structure, and does not make any use of the label information. This type of hashing includes spectral hashing [47], principal component hashing [30], principal component analysis [11], iterative quantization (PCA-ITQ) [6], scalable graph hashing with feature transformation [14], and inductive manifold hashing (IMH) [38]. However, as the label information of the input data does not be considered in unsupervised hashing methods, certain useful information, critical to pattern classification, may be lost. Therefore, various supervised hashing methods have been proposed based on their enhanced discriminative recognition abilities. The K-means hashing (KMH) [9] was proposed to generate hash codes through K-means clustering and quantization simultaneously. Anchor graph hashing (AGH) [26] was proposed to estimate data similarity through anchors. Recently, with the rapid advances in deep learning, a CNN has achieved a significant breakthrough in the visual area. Deep Hashing (DH) in [28] is a deep unsupervised framework, and aims to learn hierarchical non-linear transformations and minimize the loss between a compact real-valued vector and the learned binary code. Similarity-Adaptive Discrete Hashing (SADH) [39] proposed an unsupervised architecture as an alternative approach to deep model training, similarity updating and code optimization.

In contrast with unsupervised methods, supervised hashing methods fully utilize class labels [27]. For example, in [44], Wang et al. proposed a semi-supervised hashing (SSH) framework, and in this framework they minimize empirical errors over the labeled set. Liu et al. [25] proposed kernel-based supervised hashing (KSH), and Lin et al. [19] proposed fast supervised hashing using graph cuts and decision trees. In [6], ITQ was extended to CCA-ITQ, which uses label information to perform CCA and maximize the correlation of samples from the same class first, and then performs ITQ to minimize the quantization loss. LDA Hashing [41] was proposed to minimize the variation of data within the same classes and maximize the variation of data across different classes. Supervised discrete hashing (SDH) [36] was introduced to learn hash codes using a regularization algorithm and yield an analytic solution for the regularization sub-problem. As for deep hashing models, in [23], a novel supervised deep hashing model (DSH) was proposed to perform simultaneous feature representation learning and hash code learning using pairwise labels. In [16] and [35], asymmetric deep supervised hashing (ADSH) and deep asymmetric pairwise hashing (DAPH) were proposed to learn hash mappings in an asymmetric deep architecture during the training process. Certain ranking-based methods, such as the ones described in [46] and [45], are also considered supervised methods [7].

As is known, hash codes are binary, and discrete constraints always lead to the mixed-integer optimization problem which is NP-hard [24]. To address this issue, the hashing methods first relax the discrete optimization problem by discarding the discrete constraints. A quantization step can then be performed, turning the real numbers into the approximate hash code using thresholding, and we call this process relaxation, which can simplify the original discrete optimization considerably. However, this type of approximation is suboptimal, leading to a low quality, and often producing a less effective hash code as a result of accumulated quantization errors. To overcome this issue, Lin et al. [20] tried to solve the discrete optimization problem directly. However, their method is typically time-consuming and unscalable. Shen et al. [37] proposed a discrete hashing method using a cyclic coordinate descent. Kang et al. [17] proposed a column sampling-based discrete supervised hashing method(COSDISH) which first learns the binary code and then trains binary classifiers, with each bit corresponding to one classifier. Generally, in these approaches, the learning of the hash is formulated in terms of least squares classification regressing each hash code to its corresponding label. This strategy does not fully leverage the mutual relations between labels. In addition, despite considering this relation in some methods, such as COSDISH, they are still prone to dividing the hash learning into two steps. To address the issues mentioned above, we propose in this study a supervised discrete hashing method with similarity learning (SDHSL). SDHSL not only fully leverages the mutual relations across semantic labels, but also directly learns the discrete hash code and projection for out-of-sample extensions simultaneously. The main contributions of this study are summarized as follows:

  • Learning hash codes from the relative similarity between semantic labels. The primary purpose of hashing is the search of a close neighbor. Hence the relative similarity between semantic labels is quite evidently crucial for learning. In contrast with the existing supervised methods that always consider hash codes as features of samples and then regress them to labels, the proposed method learns hash codes from label similarity. The objective function can then be solved by a 2-approximation algorithm. Existing works have proven that the 2-approximation algorithm is more stable as the category label is provided [17, 29, 52].

  • Directly learning discrete hash codes and the projection for out-of-samples simultaneously. In the proposed SDHSL, we jointly learn the discrete hash codes for training samples and the projection that transforms the original sample into a hash code, thus leading the proposed method to be more efficient.

  • Experimental results on some large-scale datasets illustrate that SDHSL can outperform the existing methods in the task of image retrieval.

The remainder of this paper is organized as follows. In Section 2, we describe the proposed SDHSL method in detail, and in Section 3, we evaluate it through experiments. Section 4 consists of the conclusion of our study.

2 Proposed Method

This section describes the proposed SDHSL method in detail. We first present the SDHSL formulation and then describe the process of optimization.

2.1 Problem Formulation

Assume that we have a training set X consisting of n instances (i.e., \({\textbf {X}} = \left ({\textbf {x}}_{i} \right )_{i = 1}^{n}\)), where xiRf is the feature vector. The semantic label matrix \({\textbf {Y}} = \{ {{\textbf {y}}_{i}}\}_{i = 1}^{n}\) is also available, with yi = {yij}∈ Rc being the label vector of the ith instance and c the number of categories in the training set. If the ith instance belongs to the jth category, we have yij = 1; otherwise, 0. In addition, the hash matrix is defined as \({\textbf {H}}=\{{\textbf {h}}_{i}\}_{i = 1}^{n}\). Some notations and definitions are shown in Table 1.

Table 1 Notations

Learning a binary code matrix H ∈{− 1,1}n×L using semantic labels is the purpose of supervised hashing, with L being the length of the hash code. Most supervised hashing methods effectively utilize the label information by minimizing the term \({\left \|{\textbf {Y}} - {{\textbf {W}}^{T}}{\textbf {H}}\right \|^{2}}\), which is accomplished through the methods in [36, 42, 50]. However, this strategy leads to additional time spent updating the auxiliary variable W during training, and the linear projection W is a weak metric in terms of bridging the gap between hash codes and semantic labels. In addition, the linear regression used in \({\left \|{\textbf {Y}} - {{\textbf {W}}^{T}}{\textbf {H}}\right \|^{2}}\) is less stable for generating discrete hash codes [8, 10].

By contrast, semantic similarity based on labels is also used for learning hash codes in some supervised hashing methods. We assume that \(\{ {\textbf {S}} = {S_{ij}}|{S_{ij}} \in {\{ - 1,1\}^{n \times n}}\} \) is a fully observed semantic similarity matrix with no missing entries, where Sij = 1 indicates that the ith and jth samples are semantically similar (i.e., they share the same label), and Sij = − 1 means that the ith and jth samples are semantically dissimilar (i.e., they have different labels). The hash codes in the Hamming space should preserve the similarity between the samples in the original space. Therefore, we can formulate this as the following minimization problem:

$$ \underset{{\textbf{H}}}{\min}\left \|{\textbf{H}}^{T}{\textbf{H}}-L\cdot {\textbf{S}}\right \|^{2}\quad\quad {\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, $$
(1)

where L is the length of the hash codes and \(\left \| \cdot \right \|\) is the 2 norm.

In general, Sij is an element of the similarity matrix and is usually 0 or 1 in traditional methods, based on the similarity or dissimilarity of the labels for samples xi and xj, respectively. However, in the proposed method, we adopt {− 1,1} rather than {0,1} for Sij during hash learning. The reason as follows.

We are given two samples xi and xj, with hash codes hi and hj, respectively. The equation for learning hash codes can be formulated as follows:

$$ \underset{{\textbf{h}}_{i},{\textbf{h}}_{j}}{\min}\left \|{\textbf{h}}_{i}^{T}{\textbf{h}}_{j}-L\cdot S_{i,j}\right \|^{2} \quad\quad {\textup{s.t.}}\quad {\textbf{h}}_{i}, {\textbf{h}}_{j}\in \left \{ -1,1 \right \}^{L\times 1}. $$
(2)

If xi and xj are dissimilar, and Si,j = 0, the optimal solution for hi and hj is that \(\left [\frac {L}{2} \right ]\) bits of the hash codes are identical (i.e., \(L-\left [\frac {L}{2} \right ]\) bits of the hash codes are different). Thus, the Hamming distance for hi and hj is \(L-\left [\frac {L}{2} \right ]\), which seems to be a waste of hash length. Even worse, if the optimal solution for hi and hj is difficult to obtain, there are two equivalent suboptimal solutions. Specifically, it is equivalent for problem \(\left \|{\textbf {h}}_{i}^{T}{\textbf {h}}_{j}-L\cdot {\textbf {S}}_{i,j}\right \|^{2}\) if hi has \(\left [\frac {L}{2} \right ]+k\) or \(\left [\frac {L}{2} \right ]-k\) (k as an integer in \((0,L-\left [\frac {L}{2} \right ])\)) bits of hash codes which are the same as hj. The two equivalent suboptimal solutions destabilize the process for generating hash codes.

If xi and xj are dissimilar and Si,j = − 1, the optimal solution for hi and hj implies that L bits of hash codes are different. Thus, the Hamming distance for hi and hj is L, which makes full use of the hash length.

The problem defined in (1) means that the semantic similarity should be preserved by the binary codes in the Hamming space. This term has been used in certain supervised hashing methods, namely, [18, 20, 21, 49, 56], and [17]. However, most of these methods use a two-step learning strategy, which first learns the hash code and then trains a classifier for the out-of-sample generalization. In this study, we learn the hash codes and the projection between the samples and their hash codes together.

Learning-based hashing is known to typically involve learning a projection from the original samples to generate hash codes. In this study, we adopt a linear projection P to bridge the gap between the Hamming and original feature space. To learn robust hash codes, the 2,p-norm (0 < p ≤ 2) is adopted as its ability to alleviate sample noise has been proven [51, 53, 54]. Given a matrix X = {xi}∈ Rf×n, the 2,p-norm is defined as \(||{\textbf {X}}|{|_{2,p}} = \sum \limits _{i = 1}^{n} {||} {{\textbf {x}}_{i}}||_{2}^{p}\). Compared with the 2-norm, the 2,p-norm can suppress the influence of potential noise and expand the applicable range. Furthermore, the 2,p-norm can also flexibly adapt to different levels of hash code noise.

To avoid a trivial solution for P, we use an additional constraint by setting the diagonal entries PTP to 1. The projection learning term can be formulated as:

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{H}},{\textbf{P}}} \left\|{\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}\right\|_{2,p}\\ &&{\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, diag({\textbf{P}}^{T}{\textbf{P}})={\textbf{1}}, 0< p\leq 2, \end{array} $$
(3)

where 1 is a constant vector whose elements are 1.

In addition, based on the property of the hash codes, each hash bit has a 50% chance of being − 1 or 1 in the hash vector, which is called the balance [15, 35]. In this study, the property balance can be formulated as:

$$ \min\limits_{{\textbf{H}}}\left\|{\textbf{H1}}\right\|^{2}\quad\quad {\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, $$
(4)

Finally, the objective function for SDHSL can be defined as:

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{H}},{\textbf{P}}}\left \|{\textbf{H}}^{T}{\textbf{H}}-L\cdot {\textbf{S}}\right \|^{2}+ \alpha \left\|{\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}\right\|_{2,p}+\upbeta \left\|{\textbf{H1}}\right\|^{2}\\ &&{\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, diag({\textbf{P}}^{T}{\textbf{P}})={\textbf{1}}, 0< p\leq 2, \end{array} $$
(5)

where α and β are penalty parameters.

2.2 Optimization

Directly optimizing (5) is a challenge as it is both non-convex and non-smooth. However, obtaining a solution becomes easier when one considers one variable while keeping another fixed.

First, we rewrite (5) as follows:

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{H}},{\textbf{P}}}\left \|{\textbf{H}}^{T}{\textbf{H}}-L\cdot {\textbf{S}}\right \|^{2}+ \alpha \cdot Tr(({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}){\textbf{D}}({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}})^{T})+\upbeta \left\|{\textbf{H1}}\right\|^{2}\\ &&{\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, diag({\textbf{P}}^{T}{\textbf{P}})={\textbf{1}}, 0< p\leq 2, \end{array} $$
(6)

where Tr(⋅) is the trace of “⋅”, and D is a diagonal matrix, and the ith diagonal element of D is defined as:

$$ {{D}}_{i,i}=\frac{1}{\frac{2}{p}\left \| {\textbf{r}}_{i} \right \|_{2}^{2-p}}, $$
(7)

where ri is the ith column of matrix (HPTX).

In other words, (6) can be solved using an iterative framework comprising the following two steps until convergence. We can directly obtain a discrete solution for the hash codes without relaxation:

  1. Step 1:

    Learn H with P fixed. The problem in (6) is then reduced to:

    $$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{H}}}\left \|{\textbf{H}}^{T}{\textbf{H}}-L\cdot {\textbf{S}}\right \|^{2}+ \alpha \cdot Tr(({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}){\textbf{D}}({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}})^{T})+\upbeta \left\|{\textbf{H1}}\right\|^{2}\\ &&{\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, 0< p\leq 2. \end{array} $$
    (8)

Inspired by the recent advance in non-convex and non-smooth optimization [1, 2] and discrete proximal linearized minimization (DPLM) [40], the discrete solution for H can be solved through an iterative process as follows. The convergence about DPLM can also been seen in [40].

We first define the objective function in (8) as L(H). Thus, (8) can be transformed into an unconstrained problem as follows:

$$ \mathop {\min }\limits_{\textbf{H}} \delta ({\textbf{H}}) + L({\textbf{H}}) $$
(9)

where δ(H) is defined as:

$$ \delta \left (\textbf{H} \right )= \left\{ \begin{array}{ccr} 0, & {\textbf{H}}\in \left \{ -1,1 \right \}. & \\ \infty , & {\textbf{H}}\notin \left \{ -1,1 \right \}. & \end{array} \right. $$
(10)

Using the proximal and forward-split algorithms [1] as inspiration, (9) can be formulated as:

$$ {\textbf{H}}^{j+1}=\arg \min\limits_{{\textbf{H}}} \delta \left (\textbf{H} \right )+\frac{\lambda}{2}\left \|{\textbf{H}}-{\textbf{H}}^{j}+\frac{1}{\lambda} \bigtriangledown L({\textbf{H}}^{j})\right \| $$
(11)

Equation (11) can then be transformed into a constraint problem as follows:

$$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{H}}}\left \|{\textbf{H}}-{\textbf{H}}^{j}+\frac{1}{\lambda} \bigtriangledown L({\textbf{H}}^{j})\right \|\\ &&{\textup{s.t.}}\quad {\textbf{H}}\in \left \{ -1,1 \right \}^{L\times n}, 0< p\leq 2. \end{array} $$
(12)

Now, determining that H has an optimal discrete solution is straightforward:

$$ {\textbf{H}}^{j+1}=sgn({\textbf{H}}^{j}-\frac{1}{\lambda}\bigtriangledown L({\textbf{H}}^{j})), $$
(13)

where Hj is the solution of H obtained after the jth iteration, λ is a parameter, and ▽ L(H) is defined as:

$$ \bigtriangledown L({\textbf{H}})=4{\textbf{H}}{\textbf{H}}^{T}{\textbf{H}}-4L{\textbf{HS}}+2({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}){\textbf{D}}+2\upbeta {\textbf{H}}{\textbf{11}}^{T}. $$
(14)
  1. Step 2:

    Learn P with H fixed. The problem in (6) is reduced to:

    $$ \begin{array}{@{}rcl@{}} &&\min\limits_{{\textbf{P}}}Tr(({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}})^{T}{\textbf{D}}({\textbf{H}}-{\textbf{P}}^{T}{\textbf{X}}))\\ &&{\textup{s.t.}} \quad diag({\textbf{P}}^{T}{\textbf{P}})={\textbf{1}}, 0< p\leq 2. \end{array} $$
    (15)

Setting the derivative of the objective function in (15) with P to zero:

$$ {\textbf{X}}{\textbf{X}}^{T}{\textbf{P}}{\textbf{D}}-{\textbf{X}}{\textbf{H}}^{T}{\textbf{D}}=0, $$
(16)

the closed-form solution for P can be calculated as:

$$ {\textbf{P}}=({\textbf{X}}{\textbf{X}}^{T})^{-1}{\textbf{X}}{\textbf{H}}^{T}. $$
(17)

To satisfy the constraint diag(PTP) = 1, we project each row of P onto the unit norm ball after each update:

$$ {\textbf{P}}_{j}\leftarrow \frac{{\textbf{P}}_{j}}{\left \| {\textbf{P}}_{j} \right \|}, $$
(18)

where Pj is the jth row of P.

In general, the proposed discrete problem is solved based on DPLM [40]. In fact, the proposed discrete problem is reformulated as minimizing the sum of a smooth loss term with a non-smooth indicator function, and then it can be efficiently solved by an iterative procedure with each iteration admitting an analytical discrete solution, which can be converge very fast. The convergence can be achieved within a few iterations.

2.3 Time complexity

Let the symbols n, f and L represent the total number of training samples, the dimension of the sample feature and the length of the hash code, respectively. The time complexity for learning the hash code H is O(nL2 + n2L + nfL), and the time complexity for learning the projection P is O(nf2 + nfL + nL). As L is much smaller than n and f, the total training time complexity of the proposed method can be reduced to O(n2L + nf2 + nfL).

3 Experiments

In this section, the datasets and evaluation metrics used in our experiments are described and additional implementation details are provided. Two image datasets CIFAR-10 NUS-WIDE were used for performance evaluation of the proposed SDHSL, and the performance comparison with several state-of-the-art methods was also presented. Our experiments were conducted on an Intel(R) Core(TM) i7-4790 CPU with 16 GB of RAM. The hyperparameters we set are listed in the section on implementation details.

3.1 Experimental Settings

3.1.1 Datasets

We used the two datasets of CIFAR-10 NUS-WIDE, which are widely used for image retrieval.

CIFAR-10::

There are 60,000 images in this database, 50,000 for training and 10,000 for testing. Ten classes are included in the image dataset with 6,000 images in every class.

NUS-WIDE::

There are 269648 images with 81 classes in this database. We chose the 10 classes that contained the most images. Hence, the total number of images used in the experiment is 67994. The bag-of-visual-words SIFT feature with 500-dimension is used for each image as the input feature vector. In addition, 20000 images are chosen as the training set and 500 images as the test set.

3.1.2 Evaluation Metric

To evaluate the proposed hashing method, we used an evaluation metric commonly used in image retrieval, viz. the mean average precision (MAP). MAP is the average of the average precision (AP) values of the top retrieved samples.

$$ \textup{MAP}=\frac{1}{Q}\sum\limits_{r=1}^{Q}\textup{AP}(i), $$
(19)

where Q is the number of query images, and AP(i) is the AP of the ith instance. AP is defined as:

$$ \textup{AP}=\frac{1}{R}\sum\limits_{r=1}^{G}precision(r) \sigma (r), $$
(20)

where R is the number of relevant samples in the G samples retrieved. Here, σ(r) = 1 if the rth sample is relevant to the query; otherwise, σ(r) = 0.

3.1.3 Implementation Details.

We compared our method with the following: SPLH [43], KSH [25], LFH [57], FastH [20], ITQ [5], SDH [36], and COSDISH [17]. The hyperparameters were all initialized based on the authors’ suggestions. For each method, we conducted five experiments and provided the average result of these five experiments. During training, four hyperparameters were used with our proposed SDHSL method:, α, β, p and λ. We empirically set α to 0.01, β to 10, p to 1.2 and λ to 0.1.

3.2 Experimental Results and Analysis

3.2.1 Performance Evaluation

Table 2 shows the MAP of the proposed SDHSL method and baselines. Compared with the baseline methods, we can see that our method outperforms the baselines on both CIFAR-10 and NUS-WIDE datasets in most cases.

Table 2 Performance in terms of MAP

From Table 2, it appears that SRDML achieves the best results on the under most of cases. In particular, the proposed SDHSL method achieves an improvement of over 20% in some cases over the other methods. As the length of the hash code increases, our method performs even better. On the CIFAR-10 dataset, the MAP performance is better than most of the baseline methods, although our performance trails that of COSDISH when the hash code length is 8 and 64. The COSDISH also considers label similarity during hash learning. However, it divides the generation of the hash code of the training samples and the hash function learning into two steps; these are learned simultaneously in the proposed method, and the strategy used in SDHSL is more efficient.

3.2.2 Parameter analysis

In this study, we learn the hash codes for the training samples and the projection between the original features and their corresponding hash codes for out-of-sample extensions simultaneously. Therefore, the parameter α balancing these two terms in the objective function is important. In addition, the 2,p-norm is adopted in the proposed method, potentially leading to a more stable learned hash code. Hence, α and p are the two most important parameters in the proposed method. In order to gauge their sensitivity, we conduct experiments to evaluate the influence on performance of parameters α and p. We draw the MAP curves with different values of α and p in Figs. 1 and 2, respectively, with α ranging from 10− 4 to 104 and p ranging from 0.2 to 2.0. During the experiments, we set the code length to 64. We repeat the experiments five times for each value of α and p and extract the average MAP performance.

Fig. 1
figure 1

Performance influence of parameter α

Fig. 2
figure 2

Performance influence of parameter p

In Fig. 1, we can see that the MAP performance has low sensitivity regarding the parameter α. In addition, when the value of α is 0, the performance is the lowest, which means that learning the hash codes for the training samples and the projection for out-of-samples simultaneously is beneficial to the aggregate MAP performance.

Figure 2 shows the influence of the parameter p. It is apparent that the MAP performance is satisfactory when the values of p are lower than 2. In particular, when the value of p is 2 (i.e., the 2,p-norm is 2-norm), the MAP performance deteriorates. This phenomenon also proves that the 2,p-norm ((0 < p ≤ 2)) adopted in the proposed method leads to potentially more stable hash codes.

4 Conclusion

In this study, a novel discrete hashing method called SDHSL is proposed, which learns hash codes based on semantic label similarity. The proposed SDHSL jointly learns the hash codes of the training samples and the hash functions to obtain hash codes for samples outside the training set. The experiments performed on two benchmark datasets confirmed the superiority of our method under various retrieval scenarios.