1 Introduction

In the era of big data, content-based multimedia data retrieval has become increasingly difficult. Simultaneously, efficient indexing and search algorithms are receiving increasing attention [6, 8, 29, 37, 55, 58]. However, when the amount of data is very large, the speed of exact nearest neighbor searches will drop dramatically. To balance retrieval performance and computational cost, approximate nearest neighbor (ANN) [1] approach has begun to gain considerable attention. The representative ANN solutions include two kinds of methods: tree-based [18] and hashing-based methods [16, 34, 55]. From an application perspective, images or texts can often be represented by high-dimensional features, such as SIFT-based bag-of-words features [35] and deep features. It is well known that traditional tree-based methods have high feature space dimensions, and in many cases, their performance has been theoretically shown to be comparable to that of linear scans [45]. Therefore, hashing-based large-scale ANN search methods have become a focus of research.

Hashing, which is a method of transforming high-dimensional feature vectors into compact and informative binary codes using mapping functions, has achieved remarkable success in quickly retrieving data [15, 19]. Recently, the rapid development in convolutional neural networks (CNNs) has contributed to significant progress in ANN research [30,31,32, 50]. In particular, compared to supervised hashing methods, since unsupervised hashing methods do not require labeled training data, they have received increasing attention, which has broadened their application prospects. Stacked restricted Boltzmann machines (RBMs) were first used to encode hash codes for unsupervised hashing [31, 42]. However, the shortcomings of high complexity and the need for pretraining make RBMs fundamentally difficult to implement. More recently, many studies have made remarkable achievements in hash learning using deep learning frameworks, especially generative adversarial networks [2, 36, 44, 55]. However, most of these methods can perform hash learning only for retrieval tasks, which are relatively simple. As we know, combining similar data can greatly improve the efficiency of data retrieval, especially in big data systems. Hence, our goal is to unify unsupervised clustering and retrieval tasks into a single learning model without degradation of performance.

In this paper, we propose a novel unsupervised deep hashing method, named Autoencoder-based Unsupervised Clustering and Hashing (AUCH). Specifically, we construct the hashing function as a latent layer with K units between the encoder and decoder in an autoencoder. We first pretrain a stacked denoising autoencoder and then perform feature representation extraction, hashing learning and cluster assignment simultaneously based on an overall loss function. Finally, we remove the decoder and keep the encoder. By feeding visual data into the network, we can obtain compact hash codes that preserve the structural similarity between input data. The main contributions are outlined as follows:

  • AUCH is an unsupervised hashing approach that makes full use of the characteristics of autoencoders, unifies clustering and retrieval tasks in a single learning model, and jointly learns feature representations, hashing functions and clustering assignments from input data.

  • For the retrieval and clustering tasks, we design an overall loss function. By optimizing this loss function, we can obtain efficient feature representations, compact hash codes and outstanding clustering performance.

  • Extensive experiments on general datasets, including several image datasets, verify that AUCH can achieve better performance than other unsupervised hashing and clustering methods.

2 Related work

2.1 Hash learning

Based on whether the process of hash learning focuses on the data themselves, hashing methods can be divided into data-independent hashing methods and data-dependent hashing methods. At the same time, based on whether deep learning is employed for hash learning, hashing methods can be divided into traditional unsupervised hashing methods and deep unsupervised hashing methods.

2.1.1 Data-Independent hashing

The seminal LSH method presented in [1] represents groundbreaking work on locality-sensitive hashing; it maps similar data to similar hash codes in Hamming space. LSH usually requires longer hash bits to achieve better retrieval performance since it is a data-independent method. By contrast, data-dependent hashing methods [6, 15] can learn the distribution of the data themselves, and usually, shorter hash codes are needed to achieve better performance. In practice, data-dependent hashing is more valuable.

2.1.2 Data-Dependent hashing

For data-dependent hashing methods, several studies [16, 54] have been performed in a supervised manner, which usually relies on label information. The process of hash learning is guided by extracting the semantic information of the labels, which can usually lead to good retrieval performance. However, manually annotating data is a time-consuming and labor-intensive task, which prevents these methods from being widely used. To improve the utilization of unlabeled data, the study of unsupervised hashing methods is important.

2.1.3 Traditional unsupervised hashing methods

Most traditional unsupervised hashing methods usually consist of two independent processes: feature learning and hash learning. Representative unsupervised traditional hashing methods include spectral hashing (SH) [49], anchor graph hashing (AGH) [33], iterative quantization (ITQ) [15], and stochastic generative hashing (SGH) [6]. In SH, spectral analysis is first performed on the original high-dimensional dataset, and the constraints are then relaxed to convert the problem into a dimension reduction problem for the Laplacian feature graph. SH introduces the concept of feature functions to address the problem of data outside the training dataset. Inspired by SH, AGH addresses the same optimization problem as SH but offers a novel process for solving the objective function, breaking away from the assumption that the data were sampled from a multidimensional uniform distribution to obtain the prior hypothesis. Using an approximate adjacency matrix instead of the adjacency matrix reduces the time complexity and gives the method broader applicability. In ITQ, PCA-based dimension reduction is first performed on the dataset, and the rotation matrix with the smallest quantization error is then found to obtain the binary code of the feature vector corresponding to the optimal rotation matrix. SGH is essentially a generation method. It uses the principle of the minimum description length to map data to compact hash codes while retainings the structural similarity of the original data. The learned hash codes can in turn, be used to regenerate the input data. However, these methods have an obvious weakness. They cannot learn hash codes and features at the same time, which prevents their further development in practice.

2.1.4 Deep unsupervised hashing methods

Recently, the rapid development of deep neural networks has led to significant developments in various areas, such as computer vision and pattern recognition [59]. Deep unsupervised hashing methods have also been proposed, which adopt deep architectures to extract features and perform hash mapping. Representative unsupervised deep hashing methods include deep binary descriptors (DeepBit) [32], semantic structure-based unsupervised deep hashing (SSDH) [55], binary generative adversarial networks (BGANs) [44], deep hashing (DH) [36] and DeepQuan [2]. DeepBit jointly performs feature learning and hash learning, and the learned hash codes preserve the similarity between the visual data. DH utilizes a deep neural network to seek multiple hierarchical nonlinear transformations to learn binary codes so that the nonlinear relationship among the samples can be well exploited. SSDH is based on the empirical study of deep feature statistics and the construction of a semantic structure to guide hash learning. In a BGAN, the input noise variable of a generative adversarial network (GAN) is restricted to be binary, and conditions are imposed on the features of each image. A BGAN can learn to obtain a hash code for each image and generate an image similar to the original image. Moreover, the DeepQuan model is essentially a deep autoencoder network, including an encoder to generate efficient hash codes and a decoder to reconstruct visual data.

2.2 Clustering algorithms

In recent years, the advent of the era of big data has led to the rapid development of machine learning technology. Cluster analysis, as a commonly used method in traditional machine learning algorithms, is widely favored due to its practicality, simplicity and efficiency. It has been successfully applied in many fields [3, 13, 14, 24, 39, 43, 46, 52, 56, 57, 60], such as document clustering, market segmentation, image segmentation, and feature learning. Clustering is also an important concept in data mining, where its core purpose is to find valuable information hidden in data objects. Recently, multiview clustering and subspace clustering methods have also undergone rapid development. Representative methods include GBS-KO [47], L3E-M2VC [61], and LRLER [9]. GBS-KO is based on the clustering of multiview data and an extension of the graph-based approach to multiview clustering. L3E-M2VC is a new multitask multiview clustering algorithm proposed to overcome certain limitations in real-world applications. LRLER is a locally linear embedding low-rank representation model proposed for the subspace clustering of data that achieves superior performance.

3 Proposed AUCH method

Specifically, our goal is to design an unsupervised hashing algorithm that exploits an autoencoder, which can efficiently capture the features of high-dimensional data, to obtain compact hash codes. First, let \(\boldsymbol {X}=\left \{ {\boldsymbol {{x}_{i}}} \right \}_{i=1}^{N}\) denote N items of visual data. For the hash codes to preserve the structural similarity of the input data, we need to learn a mapping function \(F:\boldsymbol {X}\to {{\left \{ -1,+1 \right \}}^{K\times N}}\). This function maps visual data to their K-bit hash codes \(\boldsymbol {B}=\{{\boldsymbol {{b}_{i}}}\}\in {{\left \{ -1,+1 \right \}}^{K\times N}}\), which reflects the structural similarity between the input data. Similar data are mapped to the same or similar hash codes.

3.1 Autoencoder with a hashing layer

The architecture of AUCH is illustrated in Fig. 1. The fundamental structure of AUCH is essentially a stacked denoising autoencoder.

Fig. 1
figure 1

An overview of our proposed Autoencoder-Based Unsupervised Clustering and Hashing (AUCH) method. AUCH performs hash learning by means of an additional hashing layer with K units, which is equivalent to hash mapping, between the encoder and decoder of an autoencoder. By optimizing an objective function defined over the clustering loss, reconstruction loss and quantization loss for the hash codes, AUCH jointly learns feature representations, hashing functions and clustering assignments from input data. The learned hash codes can preserve the structural relationship of visual data and enable efficient image retrieval. The discriminative features extracted can also enable superior clustering performance

The denoising autoencoder is based on an autoencoder. To prevent overfitting problems, noise is added to the input data (in the input layer of the network), which makes the learned autoencoder more robust. As a result, the generalization ability of the model is enhanced. The network consists of two parts: an encoder function \(\boldsymbol {h_{i}^{(H)}}=E(\boldsymbol {{{x}_{i}}})\), and a decoder function \(\boldsymbol {{{x}_{i}}^{\prime }}=D(\boldsymbol {h_{i}^{(H)}})\) that produces a reconstruction. The denoising autoencoder minimizes the following objective function:

$$ L=\sum\limits_{i=1}^{N}{\left\| {\boldsymbol{{x}_{i}}}-D(E(\boldsymbol{{{{\tilde{x}}}_{i}}})) \right\|_{2}^{2}} $$
(1)

where \({\boldsymbol {{\tilde {x}}_{i}}}\) represents the data xi after noise has been added in a specific way.

The original autoencoder has eight fully connected layers. The first four layers (F1 − 4) form the encoder and the last four layers (F5 − 8) form the decoder. The ReLU function is used as the activation function for the network because it largely solves the vanishing gradient problem of the BP algorithm when optimizing deep neural networks. To incorporate hash learning into the autoencoder, we add a latent layer H (i.e., the hashing layer) with K units after layer F4 (i.e., the last layer of the encoder) as illustrated in Fig. 1. This latent layer is fully connected to F4 and uses tanh units, meaning that the activation values are between -1 and 1.

3.2 Pretraining a stacked denoising autoencoder

Given a sample xi, by feeding it into the network, we can obtain the output of F4, denoted by \(\boldsymbol {a_{i}^{(4)}}\in {{R}^{d}}\), where d is the number of neurons in F4. Let W(H)Rd×K denote the weights between F4 and the hashing layer, and let b(H) denote the bias of the hashing layer. Then, the activation values of the units in H can be computed as \(\boldsymbol {h_{i}^{(H)}}=\tanh (\boldsymbol {a_{i}^{(4)}}\boldsymbol {{{W}^{(H)}}}+\boldsymbol {{{b}^{(H)}}})\), where \(\boldsymbol {h_{i}^{(H)}}\) is a K-dimensional vector and \(\tanh (.)\) is the hyperbolic tangent function, defined as \(\tanh (z)=({{e}^{z}}-{{e}^{-z}})/({{e}^{z}}+{{e}^{-z}})\), with z being a real value. The hash code generation function is given by

$$ \boldsymbol{{b}_{i}}=sgn(tanh (\boldsymbol{a_{i}^{(4)}{{W}^{(H)}}}+\boldsymbol{{b}^{(H)}})) $$
(2)

where sgn(v) = − 1 if v < 0 and + 1 otherwise; for a matrix or vector, sgn(.) is applied in an elementwise manner.

Before clustering, we first pretrain a stacked denoising autoencoder for efficient feature extraction. After pretraining, all input data are embedded into the hashing space by the encoder mapping \(\boldsymbol {h_{i}^{(H)}}=E(\boldsymbol {{{x}_{i}}})\). We then apply the k-means algorithm to \(\{\boldsymbol {h_{i}^{(H)}}\}_{i=1}^{N}\) to obtain C initial cluster centers \(\{{\boldsymbol {{\mu }_{j}}}\}_{j=1}^{C}\).

4 Overall loss

The dataset X consists of N samples, and each sample \(\boldsymbol {{{x}_{i}}}\in {{R}^{e}}\) can be represented by an e-dimensional vector. There are a total of C cluster centers, where the value of C is known a priori. After the features of each xi are clustered, the assigned cluster indices are represented by si ∈ {1, 2, ... , C}. Let us define the encoder mapping \(E:\boldsymbol {{x}_{i}}\to \boldsymbol {h_{i}^{(H)}}\) and the decoder mapping \(D:\boldsymbol {h_{i}^{(H)}}\to \boldsymbol {{{x}_{i}}^{\prime }}\), where \(\boldsymbol {{{x}_{i}}^{\prime }}\) represents the reconstruction of xi.

For the clustering and hashing tasks, we need to find a good encoder mapping E to map the data \(\boldsymbol {X}=\left \{ \boldsymbol {{x}_{i}} \right \}_{i=1}^{N}\) to the hashing space \(\{\boldsymbol {h_{i}^{(H)}}\}_{i=1}^{N}\). With regard to this, the overall loss function consists of three parts, namely, a clustering loss, a reconstruction loss and a quantization loss. The reconstruction loss guarantees that the autoencoder can efficiently learn representations of the data in an unsupervised manner. The clustering loss mainly operates on the hashing space to disperse the embedded points. The quantization loss causes the outputs of the hashing layer neurons to be close to either + 1 or -1 to avoid the introduction of unnecessary errors when these outputs are quantized into binary codes. The loss function is defined as

$$ L=\gamma {{L}_{c}}+\alpha {{L}_{r}}+\upbeta {{L}_{h}} $$
(3)

where Lc, Lr and Lh are the clustering loss, reconstruction loss and hashing quantization loss, respectively, and γ > 0, α > 0 and β > 0 determine the relative importance of the clustering loss, reconstruction loss and quantization loss.

4.1 Clustering loss

The clustering loss is defined as

$$ {{L}_{c}}=KL(P\left\| Q \right.)=\sum\limits_{i}{\sum\limits_{j}{{{p}_{ij}}\log \frac{{{p}_{ij}}}{{{q}_{ij}}}}} $$
(4)

where KL is the KullbackLeibler divergence, which measures the degree of matching between the two probability distributions P and Q. The greater the difference between these two distributions is, the larger the KL divergence. We use Student’s t-distribution to measure the distribution of soft labels, defined as Q. Student’s t-distribution, proposed in [38], is used to measure the similarity between an embedded hashing point \(\boldsymbol {h_{i}^{(H)}}\) and its cluster center μj. This similarity is expressed as

$$ {{q}_{ij}}=\frac{{{\left( 1+{{\left\| \boldsymbol{h_{i}^{(H)}}-\boldsymbol{{\mu }_{j}} \right\|}^{2}}\right)}^{-1}}}{{\sum}_{j}{{{\left( 1+{{\left\| \boldsymbol{h_{i}^{(H)}}-\boldsymbol{{\mu }_{j}} \right\|}^{2}}\right)}^{-1}}}} $$
(5)

P is the target distribution generated by Q. pij in (4) is defined as

$$ {{p}_{ij}}=\frac{q_{ij}^{2}/{\sum}_{i}{{{q}_{ij}}}}{{\sum}_{j}{(q_{ij}^{2}/{\sum}_{i}{{{q}_{ij}}})}} $$
(6)

We can see that the target distribution P is generated by Q, so we train the network in a self-training manner [40]. The clustering loss was proposed in [52].

4.2 Reconstruction loss

Following [17], after pretraining, we replace the stacked denoising autoencoder with an undercomplete autoencoder because the hashing space features of clean data are needed to perform clustering, meaning that it is unnecessary to add noise to the input data. Simultaneously, we must ensure that the decoder is unaffected and apply the clustering loss to the hashing space. In this way, the autoencoder can finally learn the most significant features of the data. The reconstruction loss is essentially the mean squared error (MSE):

$$ {{L}_{r}}=\sum\limits_{i=1}^{N}{\left\| \boldsymbol{{x}_{i}}-D(\boldsymbol{h_{i}^{(H)}}) \right\|_{}^{2}} $$
(7)

with \(\boldsymbol {h_{i}^{(H)}}=E(\boldsymbol {{x}_{i}})\), where E and D are the encoder and decoder mappings, respectively.

4.3 Quantization loss

The hash codes need to preserve the similarity of the original data. In addition, it is necessary to ensure that the outputs of the hashing layer are close to + 1 or -1 to reduce the quantization loss when transforming the hashing features into hash codes. Let \(h_{i,k}^{\boldsymbol {(H)}}(k=1, ... , K)\) be the k-th element of the output vector \(\boldsymbol {h_{i}^{(H)}}\) from the hashing layer. Because \(h_{i,k}^{\boldsymbol {(H)}}\) has already been activated by a tanh function, it is a float value and within the range of [-1, + 1]. To make the output value as close as possible to the quantized value + 1 or -1 and reduce the error loss caused by the quantization process, the quantization loss function is expressed as follows:

$$ {{L}_{h}}=-\frac{1}{K}\sum\limits_{i=1}^{N}{\left\| \boldsymbol{h_{i}^{(H)}} \right\|_{}^{2}} $$
(8)

with \(\boldsymbol {h_{i}^{(H)}}=E(\boldsymbol {{x}_{i}})\), where E is the encoder mapping.

5 Optimization

During the training process, there are three kinds of parameters that need to be updated: the autoencoder’s weights, the cluster centers and the target distribution. Minibatch stochastic gradient decent (SGD) and backpropagation are used in the optimization of (3).

5.1 Updating the autoencoder’s weights and cluster centers

First, the target distribution P needs to remain constant, and the learning rate λ and a mini batch with m samples are given simultaneously. The cluster center μj is updated as follows:

$$ \boldsymbol{{\mu }_{j}}=\boldsymbol{{\mu }_{j}}-\frac{\lambda }{m}\frac{\partial {{L}_{c}}}{\partial \boldsymbol{{\mu }_{j}}} $$
(9)

where \({\frac {\partial {{L}_{c}}}{\partial \boldsymbol {{\mu }_{j}}}}\) is the gradient of Lc with respect to the cluster center μj, which can be computed as

$$ {\frac{\partial {{L}_{c}}}{\partial \boldsymbol{{\mu }_{j}}}}=2\sum\limits_{i=1}^{m}{{{\left( 1+{{\left\| \boldsymbol{h_{i}^{(H)}}-\boldsymbol{{\mu }_{j}} \right\|}^{2}}\right)}^{-1}}({{q}_{ij}}-{{p}_{ij}})(\boldsymbol{h_{i}^{(H)}}-\boldsymbol{{\mu }_{j}})} $$
(10)

The gradient of Lc with respect to the embedded hashing point \(\boldsymbol {h_{i}^{(H)}}\), denoted by \({\frac {\partial {{L}_{c}}}{\partial \boldsymbol {h_{i}^{(H)}}}}\), is computed as follows:

$$ {\frac{\partial {{L}_{c}}}{\partial \boldsymbol{h_{i}^{(H)}}}}=2\sum\limits_{j=1}^{C}{{{\left( 1+{{\left\| \boldsymbol{h_{i}^{(H)}}-{\boldsymbol{{\mu }_{j}}} \right\|}^{2}}\right)}^{-1}}({{p}_{ij}}-{{q}_{ij}})(\boldsymbol{h_{i}^{(H)}}-{\boldsymbol{{\mu }_{j}}})} $$
(11)

The decoder’s weights are updated as follows:

$$ \boldsymbol{W^{\prime}}=\boldsymbol{W^{\prime}}-\frac{\lambda }{m}\sum\limits_{i=1}^{m}{{\frac{\partial {{L}_{r}}}{\partial \boldsymbol {W^{\prime}}}}} $$
(12)

The encoder’s weights are updated as follows:

$$ \boldsymbol{W}=\boldsymbol{W}-\frac{\lambda }{m}\sum\limits_{i=1}^{m}{\left( \gamma {\frac{\partial {{L}_{r}}}{\partial \boldsymbol{W}}}+\alpha {\frac{\partial {{L}_{c}}}{\partial \boldsymbol{W}}}+\upbeta {\frac{\partial {{L}_{h}}}{\partial \boldsymbol{W}}}\right)} $$
(13)

5.2 Updating the target distribution

If we update the target distribution P in every iteration, this may lead to instability during the training process. In practice, we use all embedded hashing points to update P every T iterations. See (5) and (6) for the update rules. Once the target distribution P has been updated, the labels assigned to xi will be updated accordingly as follows:

$$ {s}_{i}=\arg \mathop{\max}\limits_{j} {q}_{ij} $$
(14)

where qij is computed using (5). After two consecutive updates of the target distribution, if the percentage change in the label assignments for the image is less than a threshold ϕ, we will terminate training.

5.3 Generating binary codes

Figure 2 shows how to obtain the hash code of an image and use that hash code to retrieve similar images from a database. After training the autoencoder, we remove the decoder and keep the encoder. First, we send an image to the network, obtain the output of the hashing layer, and then quantize the output to obtain the hash code via (2). By calculating the Hamming distance between the hash codes of the query image and each data in the database, we select images from the database with small Hamming distances as the retrieval results. Our method is summarized in Algorithm 1.

figure c
Fig. 2
figure 2

Generation of hash codes for retrieval. After training, the query image and the images in the database are fed to the network, and the corresponding hash codes are obtained by quantizing the output of the hashing layer. For image retrieval, we calculate the Hamming distance between the hash codes of the query image and each image in the database and then return the images with the smallest Hamming distances as retrieval results

6 Experiments

To evaluate the performance of AUCH, we performed several experiments involving image retrieval and clustering tasks on multiple datasets.

6.1 Datasets

We compare our method with other unsupervised hashing methods for the image retrieval task on MNIST [28] and CIFAR-10 [26]. Furthermore, we compare the performance of AUCH and other clustering methods for the clustering task on the MNIST, CIFAR-10, USPS[23], STL-10 [5] and Fashion-MNIST [51] datasets. Table 1 shows the descriptions of all datasets used.

Table 1 Dataset descriptions

6.2 Implementation details

Specifically, when performing experiments on the CIFAR-10 dataset, we used the methods considered for comparison to extract 512-D GIST [41] features of the raw images as inputs for training. These methods include ITQ [15], KMH [20], Spherical [22], SH [49], PCAH [48], LSH [7], AGH [33], DH [12, 36], UH-BNN [10], DeepQuan [2], and KNNH [21]. Let CIFAR-10-GIST denote the CIFAR-10 dataset after feature extraction. Then, following [2], we randomly selected 1,000 images (100 images per class) as the query image set, and the remaining 59,000 images as the training set.

The autoencoder is essentially composed of two fully connected multilayer perceptrons (MLPs), i.e., an encoder and a decoder. The dimensions of the encoder were d-500-500-2000-10-h for all datasets, where d and h are the dimensions of the input layer and the hashing layer, respectively. The decoder network is a mirror of the encoder. For the input and output layers, we do not use an activation function. The activation function used in the hashing layer is the tanh activation function. The other layers are activated with the nonlinear ReLU function. The autoencoder pretraining settings were the same as those in [52]. After pretraining, the batch size was set to 256 for all datasets. The coefficients of the clustering loss, reconstruction loss and hashing loss were set differently for different datasets. We first set α = 1 and determined γ and β through a grid search of {10− 6, 10− 5, 10− 4, 10− 3, 10− 2, 10− 1, 1} by comparing the retrieval performance. Table 2 shows the values of γ, α and β corresponding to the optimal retrieval performance as determined by comparing the mean average precision (mAP) on the different datasets. We divided each dataset into three parts. The first part was the training set used to train the model. The second part was used to determine the optimal hyperparameters of the model by calculating and comparing the mAPs of models with different hyperparameters. The third part was the test set, which was used to evaluate the performance of the model for comparison with other algorithms. For the MNIST, CIFAR-10-GIST and FASHION-MNIST datasets, we used the Adam optimizer [25] with an initial learning rate of λ = 0.001, θ1 = 0.9, and θ2 = 0.999. For the other datasets, we use SGD with a learning rate of λ = 0.1 and a momentum of θ = 0.99. The update intervals T were 140, 30, 100, 30, and 140 iterations for MNIST, USPS, CIFAR-10-GIST, STL-10 and FASHION-MNIST, respectively. We used Python and Keras [4] to write our code and ran the algorithm on a machine with one GeForce RTX 2080 Ti GPU.

Table 2 Optimal hyperparameters on different datasets

6.3 Image retrieval

6.3.1 Alternative models

To evaluate the effectiveness of AUCH for the image retrieval task, we performed experiments on the MNIST and CIFAR-10-GIST datasets and compared the retrieval performance with that of several state-of-the-art unsupervised hashing methods, including ITQ, KMH, Spherical, SH, PCAH, LSH, AGH, DH, UH-BNN, DeepQuan, KNNH, ClusterGAN, and UDHSCA [11]. Note that because 512-D GIST features were not used for training in the ClusterGAN paper, our experimental results on CIFAR-10-GIST are not compared with those of ClusterGAN.

6.3.2 Evaluation metrics

To evaluate the retrieval performance of AUCH compared to the other methods, we rely on three popular metrics used to evaluate retrieval: the mean average precision (mAP) , the precision for the top N samples and the Hamming look-up result when the Hamming radius is set to r. The mAP is actually the area enclosed by the PR curve and the coordinate axis. The larger the value of the mAP is, the better the overall performance of the hashing algorithm. The second metric is calculated as the percentage of true neighbors among the top N retrieved samples. The premise of the last metric is that the precision of a failed search is zero; this metric measures the precision for all points in the buckets that fall within a Hamming radius of r = 2.

6.3.3 Performance comparison

Tables 3 and 4 show that our AUCH method significantly outperforms the other state-of-the-art hashing methods in general on the MNIST and CIFAR-10-GIST datasets, except that when the number of hash bits is 64, the precision for the top 1000 samples of AUCH is slightly lower than that of PCA-ITQ on the CIFAR-10-GIST dataset. In other cases, whether the number of hash bits is 16, 32 or 64, AUCH is superior to the other methods in terms of all three indicators. Figure 3 shows the PR curve on the MNIST dataset for the unsupervised hashing methods at 16, 32 and 64 bits. As seen from this figure, AUCH is significantly better than the other methods. It should be noted that for the MNIST dataset, we computed the mAP value on the whole database, whereas for the CIFAR-10-GIST dataset, the map was computed only on the top 1000 returned samples (as done in [12]). These two calculation methods are denoted by mAP@All and mAP@1000, respectively.

Table 3 Comparison of the average retrieval results (mAP score, Precision@N with N = 1000, and Hamming look-up results with a Hamming radius of r = 2) on MNIST
Table 4 Comparison of the average retrieval results (mAP score, Precision@N with N = 1000 and Hamming look-up results with a Hamming radius of r = 2) on CIFAR-10-GIST
Fig. 3
figure 3

Recall vs. precision curves on the MNIST dataset for unsupervised hashing methods at 16, 32 and 64 bits: a 16 bits. b 32 bits. c 64 bits

In addition, to ensure the integrity and breadth of the experiment, we further evaluated the retrieval performance of AUCH on MNIST for different numbers of bits. Table 5 shows the retrieval mAP@1000 obtained with AUCH on the MNIST dataset and compares it with those of other state-of-the-art methods, including UDHSCA [11], ITQ, SH, and LSH. The results of the compared methods were taken from [11]. AUCH also significantly outperforms other unsupervised hashing methods on these hash codes.

Table 5 Retrieval mAP@1000 results on MNIST. The notation XXX-VGGF is used to denote hashing methods built on top of the features produced by the pretrained VGG-F network

In short, we believe that because AUCH captures efficient feature representations of the data, the results are greatly improved.

6.4 Ablation study and parameter sensitivity analysis

We also performed an ablation study to investigate the effect of each component of the overall loss on the retrieval task. We evaluated the retrieval performance across γLc, γLc + αLr, γLcLh and γLc + αLrLh. We changed the loss components one at a time, measuring the difference in mAP@1000 on the CIFAR-10 dataset and in mAP@All on the other datasets, with the number of hash bits set to 32 (see Table 6). Note that γLc is required during the training process, which guarantees that AUCH can normally perform unsupervised clustering to guide hash learning. When we keep only γLc or γLcLh, this means that after pretraining, we remove the decoder and train only the encoder to minimize the loss. If αLr participates in the training process, this means that we retain the decoder after pretraining and train the entire autoencoder.

Table 6 Ablation study results in terms of the mean average precision (mAP, %) on different datasets

From Table 6, the first observation is that all of the terms contribute to improving the results. Moreover, when we keep only γLc for training, AUCH can already achieve reasonably satisfactory retrieval performance on all five datasets. When we add βLh or αLr, the retrieval performance is improved. αLr has a greater impact on the mAP than βLh does, especially on the MNIST dataset. This shows that the clustering loss is the dominant part, and the quantization loss or reconstruction loss can only provide some improvement. In addition, Lr can guide AUCH to extract better features than Lh. AUCH achieves the best retrieval performance on all datasets with the application of all three partial losses.

Figure 4 shows the effects of varying the values of the hyperparameters γ and β. The experiments were conducted by varying one parameter at a time while keeping the others fixed and setting α to 1. Note that the fixed hyperparameters were set to their optimal values and the number of hash bits was 32. To achieve competitive results, the settings for γ and β are different on the different datasets. For example, when we fix α, β and set the value of γ to approximately 10- 2, AUCH can achieve the best retrieval performance on MNIST, while the optimal γ is approximately 10- 1 on USPS.

Fig. 4
figure 4

Parameter sensitivity analysis

6.5 Image clustering

6.5.1 Alternative models

To evaluate the effectiveness of AUCH for the image clustering task, we performed experiments on the MNIST, USPS, CIFAR-10, STL-10 and FASHION-MNIST datasets, which are designed only for clustering, to compare the clustering performance of AUCH with that of several state-of-the-art unsupervised clustering methods, including K-means [24], SC-Ncut [43], SC-LS [3], AC-PIC [60], SEC [39], LDMGI [57], NMF-D [46], DEC [52], JULE [56], DEPICT [14] and ClusterGAN.

6.5.2 Evaluation metrics

To evaluate the clustering performance of AUCH in comparison to the other methods, we use two popular metrics for clustering: the accuracy (ACC) and the normalized mutual information (NMI). We need to find the best mapping between the predicted clusters and the true labels to calculate the ACC [27]. The NMI measures the similarity between two data points with the same label. The lowest similarity value is normalized to 0, and the highest is normalized to 1 [53].

6.5.3 Performance comparison

Table 7 shows the clustering performance comparison with the other clustering methods, all of which except ClusterGAN are designed only for clustering, on the five real datasets. From Table 7, we can see that AUCH is superior to most other state-of-the-art clustering methods on the MNIST, USPS and CIFAR-10 datasets in terms of the ACC and NMI metrics. It is more straightforward to look at the confusion matrix. Figure 5 presents the confusion matrix of AUCH after clustering on MNIST with 32 bits. We can also see that AUCH generally divides MNIST into 10 categories. Simultaneously, on the STL-10 and FASHION-MNIST datasets, AUCH is obviously superior to the other methods. Especially on the STL-10 dataset, AUCH shows greatly improved clustering performance compared with the other methods. In summary, this experiment demonstrates that AUCH can extract discriminative features from images and achieve superior clustering performance.

Table 7 Clustering performance comparison with other clustering methods, all of which except ClusterGAN are designed only for clustering, on five real datasets
Fig. 5
figure 5

Confusion matrix of the true labels and clustering labels on MNIST

7 Conclusion

We have presented an unsupervised deep hashing method, AUCH, that preserves the structural relationship between visual data. AUCH performs hash learning by means of an additional hashing layer, which is equivalent to hash mapping, between the encoder and decoder of an autoencoder. By optimizing an objective function defined over the clustering loss, reconstruction loss and quantization loss for hash codes, AUCH jointly learns feature representations, hashing functions and clustering assignments from input data. Experimental results obtained on general datasets show that AUCH achieves superior retrieval performance and produces promising clustering results.