1 Introduction

Nowadays, the data information in many research areas like object detection [13], text categorization [23, 26] and face recognition [2, 40], is always provided with a high dimension. Thus many difficulties in information processing have led to extensive discussions on how to compress the big data. It is well known that a learning model could not see or hear input data directly. Instead, it needs to learn the representation of the original data information for providing useful compressed data to the model. Feature representation [42] is a method that aims to deal with such big data efficiently by representing these data in a low-dimensional form. It can be broadly divided into supervised, semi-supervised and unsupervised based methods. Although the performance of supervised methods in tests is generally better than the unsupervised one, supervised methods require all the label information of test data, which will spend a huge amount of efforts to label big data, let alone some of them are difficult to label. Hence the potential of dealing with the ‘curse of dimensionality’ by unsupervised feature representation methods is worth exploring.

The classic unsupervised feature representation methods contain two forms, namely linear based and non-linear based methods, respectively. Principal component analysis (PCA) and locality preserving projections (LPP) are two algorithms based on linear transformation [20, 43]. PCA is the most common method for linear feature representation partly due to its simplicity. Different from PCA, LPP mainly puts emphasis on retaining the neighborhood structure between input data points, not the global structure. Furthermore, isometric feature mapping (Isomap) and neighborhood preserving embedding (NPE) are two algorithms based on non-linear transformation [1, 19]. The central ideology of Isomap is to reserve geodesic distances among all the input data pairs as much as possible through defining a low-dimensional embedding. While NPE aims to seek out the local neighborhood structure between data points, it can be considered as a variant of locally linear embedding (LLE) [37].

Auto-encoder neural network [21] is another feature representation method, whose architecture consists of an input layer, a hidden layer and an output layer. The network utilizes a linear objective function for linear projection and a sigmoid function to realize non-linear mapping. Variants of auto-encoder network have received many achievements in unsupervised learning in the past decade. For instance, Xie et al. [45] proposed the deep embedded clustering method (DEC) which is based on the auto-encoder network, to learn a clustering oriented feature representation by jointly optimizing the clustering and low-dimensional representation learning, and obtained inspiring clustering performance. Furthermore, Guo et al. [17] noticed the shortcoming of DEC that it lacks the retention of local structure and may result in the distortion of feature space. To this end, they proposed the improved embedded clustering method (IDEC), which simultaneously applies the clustering loss and reconstruction loss in the training process to avoid the distortion of feature space.

According to the idea of combining the variational auto-encoder with the generative adversarial networks (GAN) [9, 15, 49], Makhzani et al. proposed a method called adversarial auto-encoder (AAE) [25, 28]. The structure of GAN consists of a generative network and a discriminative network. During the training process, the generative network is trained to fool the discriminator into believing that the generated samples are true samples, while the discriminative network aims to improve its ability to distinguish the authenticity of samples. Generally, AAE can be partitioned into two stages: the reconstruction stage and the regularization stage. Reconstruction stage aims to minimize the reconstruction loss of input data, while the target of regularization stage is to update the parameters of generator and discriminator. In recent years, the application of AAE in feature representation research field has been studied widely. Barone et al. [3] proposed a method that uses AAE to learn a cross-lingual oriented distributed representation without training the parallel text. That was different from current approaches which need an amount of parallel text to be used to learn a representation which is compatible between different languages, and it creates a potentially promising field in natural language processing research. Moreover, Zhifei Zhang et al. used the proposed conditional adversarial auto-encoder to learn a latent representation for human face through the imposed adversarial training on encoder and generator [47]. This learned latent representation well retains the personalized features of human faces so that it can achieve a glossy age progression and obtain state-of-the-art performance.

However, the application of AAE in image clustering is rarely studied. The main purpose of clustering methods [18, 31, 41] is to partition input data into multiple clusters without the guidance of label information, and join the points in the same category as close to each other as much as possible, while separating the points in different categories. Some clustering methods have been smoothly applied to text clustering for enhancing the performance of its applications, but image clustering is still a difficult computer vision task because of its the pixel-level based characteristics. Since the image data today often owns a high dimension, most images need to be processed before clustering. Taking the feature representation method as an instance, some methods could not preserve the pixel feature well after the original data is reduced to a relatively low dimension, which leads to unsatisfactory performance of image clustering. Thus how to learn an encouraging representation from high-dimensional image data is a topic worthy of further study.

In this paper, we attempt to learn a discriminative feature representation via adversarial auto-encoder. First, the architecture and training strategy of adversarial auto-encoder are presented, and for the fairness of comparative experiments, a publicly available auto-encoder network and GAN structure are utilized. Then adversarial auto-encoder is applied to learning a discriminative feature representation for image data and compared with seven feature representation methods by learning the low-dimensional feature on eight publicly available image data sets. Especially, in order to prove the usefulness of the learned discriminative feature representation, we also compare the performance of AAE with two recently proposed deep clustering methods. For the purpose of assessing their performance, we perform image clustering through K-means method [18] on the feature learned from each algorithm, and select three evaluation metrics including clustering accuracy (ACC) [22, 34], normalized mutual information (NMI) [10] and adjusted rand index (ARI) [39], to provide comparison. Finally, comprehensive experiments illustrate the effectiveness of the learned discriminative feature via an unsupervised adversarial auto-encoder algorithm in the tested data sets.

The main contribution of this work is summed up as follows:

  • This paper is an attempt to apply the adversarial auto-encoder (AAE) network to learn the discriminative feature for the image clustering task, which has never been studied before. This work could provide a new perspective to study unsupervised feature representation methods.

  • For the purpose of providing the readers a clearer understanding of the unsupervised feature representation, we have divided the unsupervised feature representation methods into three categories (linear transformation based, non-linear transformation based, and auto-encoder based) based on experience and comprehensively reviewed these methods in Section 2.

  • Along the framework of adversarial auto-encoder, we propose a new method to conduct an unsupervised feature learning and image clustering task simultaneously, and adjust the network and some parameters to make the structure of adversarial auto-encoder more suitable for the image clustering task.

  • We conduct a fairly comprehensive experiment covering eight different image data sets (handwritten fonts, human faces, object images, etc.) and nine different comparative algorithms (classic feature representation methods, recently proposed feature representation methods and novel deep clustering methods).

  • The performance of the designed network outperforms other comparative algorithms in the image clustering issue of the tested data sets.

In the following sections, we review some related work on feature representation learning in Section 2. Then in Section 3, the network structure and training strategy of adversarial auto-encoder are presented. In Section 4, several comparative experiments are reported to prove the effectiveness of the discriminative feature learned from adversarial auto-encoder method. Ultimately, this paper is summed up in Section 5.

2 Related work

For the convenience of expression, we indicate the collection of input data samples and their labels as \(\mathbb {X}\) and \(\mathbb {L}\), respectively. Hence input data matrix is represented by \(\mathbf {X}=\left [\begin {array}{ll}\mathbf {x}_{1},{\cdots } ,\mathbf {x}_{n}\end {array}\right ]\in \mathbb {R}^{d\times n}\), where \(\mathbf {x}_{i}\in \mathbb {R}^{d}\) is a d dimensional feature vector.

In this section, we present some classic feature representation methods, which can be broadly partitioned into three classes: linear transformation based feature representation, non-linear transformation based feature representation and variations of auto-encoder neural network.

2.1 Linear transformation based feature representation

Until now, linear feature representation is still a focal point in many fields such as image processing and pattern recognition. This may be owing to its intuitiveness, simplicity, and effectiveness in handling numerous real-world issues.

Generally speaking, for an input data matrix \(\mathbf {X}=\left [\begin {array}{ll}\mathbf {x}_{1},{\cdots } ,\mathbf {x}_{n}\end {array}\right ]\in \mathbb {R}^{d\times n}\), performing a linear transformation on the original high-dimensional space is the simplest way to obtain a subspace with low dimension. It is able to be formalized as

$$ \begin{array}{@{}rcl@{}} \mathbf{{X}^{\prime}}=\mathbf{W}^{\mathbf{T}}\mathbf{X} \end{array} $$
(1)

where \(\mathbf {\mathbf {W}^{\mathbf {T}}}\in \mathbb {R}^{{d}^{\prime }\times d}\) is the transformation matrix and \({d}^{\prime }\) is the dimensions of subspace \(\left ({d}^{\prime }< d \right )\). \(\mathbf {{X}^{\prime }}=\left [\begin {array}{ll}\mathbf {{x}^{\prime }_{1},\cdots ,\mathbf {{x}^{\prime }_{n}}}\end {array}\right ]\in \mathbb {R}^{{d}^{\prime }\times n}\) is the representation in low-dimensional subspace, and \(\mathbf {{x}^{\prime }}_{i}=\mathbf {W^{T}}\mathbf {x}_{i}\) denotes a \({d}^{\prime }\)-dimensional feature vector. Evidently, the features in the new low-dimensional subspace are linear combinations of features in the original high-dimensional space.

Among the feature representation methods based on linear transformation, PCA and LPP are two mainstream methods. Though linear discriminant analysis (LDA) [30] is another mainstream approach, it is a supervised one. In this paper, We mainly focus on unsupervised feature representation, hence LDA will not be discussed.

2.1.1 Principal component analysis

The central idea of PCA [7, 32, 36] is to utilize a linear transformation to project the input data in the directions of maximum variances, which aims to maintain the features of input data when reducing to a subspace with a low dimension, and minimize the reconstruction loss. For a collection of input data matrix \(\mathbf {X}=\left [\begin {array}{ll}\mathbf {x}_{1},{\cdots } ,\mathbf {x}_{n}\end {array}\right ]\), denotea transformation matrix as W and \(\mathbf {y_{i}}=\mathbf {W}^{T}\mathbf {x}_{i}\), the objective function of PCA can be represented as follows

$$ \begin{array}{@{}rcl@{}} \underset{\left \| \mathbf{W} \right \|=1}{\mathbf{argmax}}\sum\limits_{i=1}^{n}\left (\mathbf{y}_{i} -\bar{\mathbf{y}}\right )^{2} \end{array} $$
(2)

where \(\bar {\mathbf {y}}=\frac {1}{n}{\sum }_{i=1}^{n}\mathbf {y}_{i}\) indicates the mean of \(\left \{ \mathbf {y}_{i} \right \}_{i=1}^{n}\), then (2) can be represented in another form

$$ \begin{array}{@{}rcl@{}} \underset{\left \| \mathbf{W} \right \|=1}{\mathbf{argmax}}\mathbf{W}^{T}\mathbf{C}\mathbf{W} \end{array} $$
(3)

where C indicates the covariance matrix about input samples, and the eigenvectors of C relevant to the largest eigenvalues cross the most favorable subspace which generated by PCA.

2.1.2 Locality preserving projections

Different from PCA, LPP mainly puts emphasis on retaining the local neighborhood architecture between input data points, while PCA mainly revolves around retaining the global architecture of input data.

LPP contains the advantages of manifold learning and linear feature representation [36, 46], which is based on a linear approximation of Laplacian Eigenmaps (LE) method [4]. In simple terms, it is a linear version of LE that utilizes a linear approximation method for non-linear feature representation.

Assume input data matrix \(\mathbf {X}=\left [\begin {array}{ll}\mathbf {x}_{1},{\cdots } ,\mathbf {x}_{n}\end {array}\right ]\) and a linear transformation matrix W which projects X into a subspace \(\mathbf {Y}=\left [\begin {array}{ll}\mathbf {y}_{1},{\cdots } ,\mathbf {y}_{n}\end {array}\right ]\) with low dimension, the objective function can be denoted as follows

$$ \begin{array}{@{}rcl@{}} \underset{\mathbf{W}}{\mathbf{min}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{n}\left \| \mathbf{y}_{i}-\mathbf{y}_{j}\right \|^{2}\mathbf{P}\left (i,j \right ) \end{array} $$
(4)

where \(\mathbf {y}_{i}=\mathbf {W}^{T}\mathbf {x}_{i}\), and \(\mathbf {P}\left (i,j \right )\) which created by the nearest-neighbor graph is a similarity matrix. When xi is among kNN of xj or xi, \(\mathbf {P}\left (i,j \right )\) can be formalized as

$$ \begin{array}{@{}rcl@{}} \mathbf{P}\left (i,j \right )=\mathbf{e}^{-\frac{ \left \| \mathbf{x}_{i}-\mathbf{x}_{j} \right \|^{2}}{t}} \end{array} $$
(5)

2.2 Non-linear transformation based feature representation

In the real world, not all data is linear structure. Data structures such as trees and graphs, are very common, but they are non-linear structures and therefore difficult to be handled by linear methods. Hence, non-linear methods are also widely studied. Isometric feature mapping (Isomap) and neighborhood preserving embedding (NPE) are two typical unsupervised feature representation methods based on non-linear transformation.

2.2.1 Isometric feature mapping

Isomap [1, 38] is based on the method that retains geodesic distances among all pairs of the data points as much as possible by defining a low-dimensional embedding.

Isomap creates a neighborhood graph G to calculate the geodesic distances among the input data. Then a shortest path among two points in G is able to be calculated through applying Dijkstra or Floyd algorithm [11, 24]. This path matrix constitutes an evaluation about the geodesic distance between two input data points. Finally, the geodesic distances among all data points are calculated, thus it could create a pairwise geodesic distance matrix.

But isomap method still has some disadvantages, such as its topological instability. In a neighborhood graph G, it may establish wrong connections between data points. Although with some shortcomings, isomap is also widely applied in many real-world problems successfully.

2.2.2 Neighborhood preserving embedding

The purpose of NPE [19, 36] is to seek out an approach to implement local neighborhood structure retention [48] for input data, which is similar to locally linear embedding (LLE) [37]. NPE makes use of a local least squares approximation method to assess the affinity weight matrix W. For an input data matrix \(\mathbf {X}=\left [\begin {array}{ll}\mathbf {x}_{1},{\cdots } ,\mathbf {x}_{n}\end {array}\right ]\), the local geometry of these data points can be characterized through the reconstruction from their neighbors. Thus the reconstruction error is able to be formalized by the cost function as follows

$$ \begin{array}{@{}rcl@{}} \phi\left (\mathbf{W} \right )={{\sum}_{i=1}^{n}} \big \| \mathbf{x}_{i}-{{\sum}_{j=1}^{n}}\mathbf{W}_{ij}\mathbf{x}_{j} \big \|^{2} \end{array} $$
(6)

where xj is one of the neighbors of xi, WijW represents the verge weight from xi to xj, and a constraint of Wij is defined as

$$ \begin{array}{@{}rcl@{}} {\sum\limits_{j=1}^{n}}\mathbf{W}_{ij}=1 \end{array} $$
(7)

2.3 Feature representation based on auto-encoder neural network

Auto-encoder (AE) neural network is another method for feature representation. In the past decade, variations of auto-encoder neural network have received many achievements. In this subsection, an unsupervised sparse auto-encoder and a semi-supervised label consistent auto-encoder are introduced.

2.3.1 Sparse auto-encoder

As a variant of auto-encoder, sparse auto-encoder (SAE) assumes that its network is a sparse network. The difference between SAE and auto-encoder is the definition of loss function. SAE has a constraint on the output of the hidden layer, it adds sparsity constraints to the hidden neurons, so as to induce the mean value of hidden layer output close to zero as much as possible. It indicates that most of the hidden layer neurons are in a non-activate state. Therefore, the optimization function of SAE is defined as

$$ \begin{array}{@{}rcl@{}} \underset{\mathbf{W},\mathbf{b}}{\min}J(\mathbf{W},\mathbf{b})+\beta \sum\limits_{j=1}^{{d}^{\prime}}\text{KL}(\rho ||\hat{\rho_{j}}) \end{array} $$
(8)

where \({d}^{\prime }\) is the dimensions of hidden layer, J(W, b) denotes the optimization function of the typical AE network, ρ is a sparsity parameter whose value is generally tiny, and \(\text {KL}(\rho ||\hat {\rho _{j}})=\rho \log (\frac {\rho }{\hat {\rho _{j}}})+(1-\rho )\log (\frac {1-\rho }{1-\hat {\rho _{j}}})\) is the KL-divergence that induces the output of hidden layer close to the sparsity parameter ρ which is predefined. Besides, \(\hat {\rho _{j}}\) can be formulated as

$$ \begin{array}{@{}rcl@{}} \hat{\rho_{j}}=\frac{1}{n}\sum\limits_{i=1}^{n}\sigma (\mathbf{w}_{j}^{T}\mathbf{X}+\mathbf{b}_{j}^{(1)})\mathbf{x}_{i} \end{array} $$
(9)

2.3.2 Label consistent auto-encoder

Label consistent auto-encoder (LCAE) which proposed by Gogna et al. [14], is an approach for learning feature representation in semi-supervised. The structure of LCAE is a two-layer stacked auto-encoder network, and it is based on the idea of how to gain an approximate inverse for a linear problem, which can be formalized as

$$ \begin{array}{@{}rcl@{}} {\mathbf{x}}^{\prime}=\mathbf{A}^{T}\mathbf{y}=\mathbf{A}^{T}\mathbf{A}\mathbf{x} \end{array} $$
(10)

where \({\mathbf {x}}^{\prime }\) is a noisy version of x, and x presents the practical solution. LCAE focuses on seeking a linear mapping from deepest layer of network to category labels, so as to form the penalty about class-label consistency. Thus the objective function of LCAE is defined as follows

$$ \begin{array}{@{}rcl@{}} &&\underset{\mathbf{W}_{1},{\mathbf{W}_{1}}^{\prime},\mathbf{W}_{2},{\mathbf{W}_{2}}^{\prime},D} {\min}\left \| \mathbf{X}-{\mathbf{W}_{1}}^{\prime}\phi ({\mathbf{W}_{2}}^{\prime}\phi (\mathbf{W}_{2}\phi(\mathbf{W}_{1}\mathbf{X}))) \right \|_{F}^{2}\\ &&+\lambda \left \| L-D\phi(\mathbf{W}_{2}\phi(\mathbf{W}_{1}\mathbf{X}_{S})) \right \|_{F}^{2} \end{array} $$
(11)

where L is the labels, D is the linear map. And the weight matrices of encoder and decoder in the first layer are W1 and \({\mathbf {W}_{1}}^{\prime }\), while W2 and \({\mathbf {W}_{2}}^{\prime }\) are the encoder and decoder matrices in the second layer, respectively. X presents the input data and part of it have labels while the rest are not. Therefore \(\mathbf {X}=\left [ \mathbf {X}_{U}|\mathbf {X}_{S} \right ]\), where XS is the data which have supervised information.

2.3.3 Deep embedded clustering

Deep clustering is an emerging research field in recent years. Different from the two-stage framework in a classic way, deep clustering attempts to learn the low-dimensional representation and cluster allocations simultaneously.

The deep embedded clustering method proposed by Xie et al. [45] applied an auto-encoder network and defined a clustering loss to learn the clustering allocations and low-dimensional feature simultaneously. After the pre-training process, DEC only reserve the encoder and discard the decoder, then the encoder is fine-tuned according to the defined loss function as follows

$$ \begin{array}{@{}rcl@{}} L= KL(P\parallel Q)=\sum\limits_{i}\sum\limits_{j}p_{ij}\log\frac{p_{ij}}{q_{ij}} \end{array} $$
(12)

where KL means the KullbackLeibler divergence, Q is the soft labels distribution and P is the target distribution produced from Q, their detailed definitions can be found in [45]. In addition, the training process of DEC can be considered as a self-training due to the relation of P and Q.

2.3.4 Improved Deep Embedded Clustering

Although the deep clustering method which aims to learn the low-dimensional feature representation of inputs for the clustering issue had obtained inspiring performance, Guo et al. [17] noticed a shortcoming of the DEC method that it lacks the consideration about the retention of local structure. To this end, they proposed the improved deep embedded clustering method (IDEC) to address this issue.

The core of IDEC is to keep the decoder of auto-encoder network after the pre-training process instead of discarding it like DEC. The retained decoder can provide the reconstruction loss to guide the clustering task jointly with the clustering loss, thereby avoiding the distortion of feature space when only the clustering loss is applied. Hence the objective function IDEC can be formalized as follows

$$ \begin{array}{@{}rcl@{}} L=L_{r}+\gamma L_{c} \end{array} $$
(13)

where γ is a coefficient which can control the degree of distortion in feature space. If γ = 1 and Lr = 0, then IDEC can be regarded as DEC method. Besides, Lr and Lc are reconstruction loss and clustering loss, respectively. The reconstruction loss Lr can be defined as

$$ \begin{array}{@{}rcl@{}} L_{r}=\left \| \mathbf{X} -\mathbf{X}^{\prime}\right \|_{2}^{2} \end{array} $$
(14)

where \(\mathbf {X}^{\prime }\) is the reconstruction of the input data matrix X. And The clustering loss Lc is the same with (12).

3 Adversarial auto-encoder

Adversarial auto-encoder (AAE) is a probability-based auto-encoder. It combines generative adversarial networks (GAN) which is popular recently with variational auto-encoder (VAE), and focuses on seeking out a discriminative feature representation for high-dimensional data.

3.1 Architecture

In order to comprehend the architecture of AAE, we must understand the structure of generative adversarial networks (GAN) and auto-encoder (AE) firstly.

A typical auto-encoder network includes two phases: an encoder and a decoder. The former focuses on learning a subspace feature representation of inputs, while the latter aims to reconstruct the inputs and minimize the reconstruction error. The network structure of auto-encoder which consists of three layers is shown in Fig. 1.

Fig. 1
figure 1

The network structure of auto-encoder. A typical auto-encoder network usually comprises an input layer, a hidden layer, and an output layer

With regard to GAN, the structure of it contains a generative network G and a discriminative network D, respectively. A function D(x) used in the discriminative network calculates the possibility that sample x is a positive sample that comes from the original data distribution, instead of a negative sample that is produced by generator. Meanwhile, assume a sample z sources from the prior distribution and a function G(z) is utilized to map z to the data space, then a visualization about the structure of GAN is presented in Fig. 2.

Fig. 2
figure 2

The structure of generative adversarial networks. The generative network aims to generate fake samples that can deceive the discriminative network, while the discriminative network puts emphasis on enhancing its ability to distinguish between true and false samples through training

During the training procedure, we could leverage the gradient of D(x) to train the generator G and revise its parameters, while the aim of G(z) in training is to fool the discriminative network D into trusting that the samples it generates are positive samples as much as possible. This process is formalized as follows

$$ \begin{array}{@{}rcl@{}} \underset{G}{\min} \underset{D}{\max} \mathrm{E}_{\mathbf{x}\sim p_{data}}[\log D(\mathbf{x})] + \mathrm{E}_{\mathbf{z}\sim p(\mathbf{z}) } [\log(1-D(G(\mathbf{z}))]\\ \end{array} $$
(15)

thus it is able to divide the training process into two phases roughly. First, the discriminator D is trained to differentiate the real samples from those fakes produced by generator G, and the second is to make the samples produced by generator G can confuse discriminator D through training.

The adversarial auto-encoder (AAE) network is a combination of generative adversarial networks and variational auto-encoder. The two training targets of AAE are: minimize reconstruction error and match the aggregated posterior distribution of the representation in hidden layer to an arbitrary prior distribution through an adversarial training. Assume an input data x and a hidden unit z of an auto-encoder, the structure of AAE is presented in Fig. 3.

Fig. 3
figure 3

A visualization of the architecture of adversarial auto-encoder. The upper half is an auto-encoder, its encoder can be considered as a generator, while the bottom half is a discriminative network

The upper part of this figure is an auto-encoder network which first transforms the input data x into hidden vector z, then reconstructs x from the hidden vector z. The encoder of auto-encoder network can be considered as a generator. The bottom part of this figure is a discriminative network which is trained to distinguish a sample whether it is from the hidden layer of auto-encoder network (fake sample), or from the original data distribution (true sample).

3.2 Training strategy

Define the original data distribution as pdata(x), the prior distribution as p(z), and the model distribution as p(x). Meanwhile, we also define q(z|x) and q(x|z) as a distribution of encoder and decoder, respectively. By the encoding function in encoder distribution q(z|x), we can define q(z) on hidden layer through the following formula

$$ \begin{array}{@{}rcl@{}} q(\mathbf{z})={\int}_{\mathbf{x}}q(\mathbf{z}|\mathbf{x})p_{data}(\mathbf{x})d\mathbf{x} \end{array} $$
(16)

where q(z) in hidden layer is an aggregated posterior distribution. In the training process, one goal of AAE is to match q(z) with p(z) to realize regularization. Consequently, AAE adds an adversarial network in training, and considers the encoder q(z|x) as a generator in the adversarial network. From Fig. 3, it can be easily known that it is the adversarial network that leads this process. In the meantime, another goal of AAE is the minimization of reconstruction error about auto-encoder network. The encoder guarantees that the aggregated posterior distribution, q(z), could confuse the discriminative network into trusting that q(z) comes from p(z).

For the selection of encoder q(z|x) in AAE, there are several possible methods: First is the deterministic method, in this method we suppose q(z|x) as a deterministic function of x, thus the randomness of q(z) only comes from the data distribution pdata(x). The encoder can be considered the same as the encoder of a typical auto-encoder. Another is the universal approximator posterior method. In this method, the origin of randomness about q(z) is the random noise of encoder inputs and the data-distribution pdata(x). The Gaussian posterior method is also a choice of encoder. We suppose q(z|x) as a Gaussian distribution in this method, and its means and standard deviations are computed by encoder. Besides, the randomness of q(z) sources from the stochasticity of the Gaussian distribution in the output of encoder and the data-distribution pdata(x).

The main difference in the training process between AAE and VAE is that in AAE, we only require to learn a sampling from the prior distribution for inducing the process of matching q(z) to p(z), while VAE demands to learn the precise function of the prior distribution for the back-propagation of KL divergence. Thus AAE could impose a complex distribution without learning the precise function of the prior distribution. In addition, there are also some differences between AAE and GAN, for instance, AAE learns to catch the data distribution through the training process of auto-encoder network. However, for an output layer in network, a pixel level data distribution is imposed by GAN on it.

On the whole, each mini-batch training process of AAE, including the auto-encoder and the adversarial network which are trained jointly with SGD, can be separated into a reconstruction stage and a regularization stage. In the reconstruction stage, the auto-encoder aims at the minimization of the reconstruction error for inputs, by updating the parameters of encoder and decoder network, ϕ and 𝜃. Define objective function as follows

$$ \begin{array}{@{}rcl@{}} \underset{\theta ,\phi }{\text{argmin}}\left \| \mathbf{x}-{\mathbf{x}}^{\prime}\right \|^{2} \end{array} $$
(17)

where \({\mathbf {x}}^{\prime }\) is the reconstructed data. While in the regularization stage, there have two aims in it. First, the adversarial network differentiates the true samples that were produced by the prior distribution, from those fake samples generated in the hidden layer through updating its discriminator. Then the generator in adversarial network is updated to fool the discriminative network, and its objective function is similar to (15). The discriminator D(x) calculates the possibility that sample x is a true sample which we attempt to model, instead of a fake sample. While z is mapped by G(z) from the prior distribution into the data space. The goal of G(z) is to maximally fool the discriminative network through training, so as to force the discriminator to trust the samples generated by it are true samples, and this training process of G is realized by taking the advantage of the gradient of D(x) with regard to x. G also uses it to update its parameters.

The training procedure of adversarial auto-encoder is presented at Algorithm 1.

figure d

4 Experimental analysis

In this section, we first present the eight tested image data sets. Then illustrate the parameter settings for each feature representation algorithm and evaluation metrics used in experiment. After that, a number of comparative experiments are performed to provide a comparison for the low-dimensional discriminative feature learned via adversarial auto-encoder with other feature representation methods. Extensive experiments prove the efficiency of adversarial auto-encoder.

4.1 Data sets

USPS is a popular subset of the USPS handwritten digit database [35], it includes 9,298 handwritten images for a total of 10 categories,Footnote 1 and splits them into 7,291 images for training and 2,007 images for testing. Each of them is denoted as a feature vector with 256 dimensions.

Chars74K database consists of two different versions (English and Kannada) [8]. The English version contains 62 different categories of images.Footnote 2 It could be divided into three parts: the first part is 7,705 characters acquired from natural scenes, the second part is 3,410 hand-drawn characters generated by tablet PC, and the third part is 62,992 synthesized character generated by computer fonts. In this paper, we select the English version in our experiments. Apart from this, we barely employ the third part of this database and eliminate all numerals data. Thus the database applied in experiment consists of 44,044 training samples and 8,788 testing samples with 52 categories where each image could be stacked as a 1,024-dimensional feature vector.

Yale-B [6, 12] is a subversion of The Extended Yale Face Database B.Footnote 3 For this data set, we simply utilize the cropped images and unify the size to 32×32 pixels for all images. Each of them is denoted as a feature vector with 1,024 dimensions. This data set is now composed of 2,414 gray-scale face images with 38 categories, and under 64 different illuminations conditions per class.

ORL [5, 16] face database is a collection of 10 different types of face images and each type of images contains 40 distinctive subjects.Footnote 4 These images were obtained at varied times on some subjects, differing the facial details and expressions and the lighting. Every image is reshaped as a 1,024-dimensional feature vector.

COIL-20 data set is comprised of 20 objects in total [33], and each object has 72 images.Footnote 5 These images were generated by rotating the object on a turntable and producing at intervals of 5 degrees. All images are gray-scale with the size 32×32, and stacking as a 1,024-dimensional feature vector.

CIFAR-10 database is composed of 60,000 colour images, including 10 categories such as airplane, dog and truck, etc. It is a subset gained from the 80 million tiny images data set [27] and each category contains 6,000 images.Footnote 6 We split the data into 50,000 images for training and 10,000 images for testing, and each of them is reshaped as a feature vector with 3,072 dimensions.

Fashion-MNIST is a data set which contains 60,000 samples for training and 10,000 samples for testing, including 10 categories such as Trouser, Dress and Sneaker, etc.Footnote 7 [44]. Each example is a 28×28 gray-scale image, with 256 grey levels per pixel, and it could be denoted as a 784-dimensional feature vector.

SMSHP (Sebastien Marcel Static Hand Posture) data set [29] consists of 6 different hand posture styles (v, a, b, c, point, five).Footnote 8 The categories in this database are generated by the 6 hand postures. For convenience of experiment, the size of all images is unified as 32×32 pixels, and each of them could be reshaped as a feature vector with 1,024 dimensions.

The primary information of each data set aforementioned, including the amount of samples, features and classes, are summed up in Table 1. In addition, we arbitrarily select 30 samples from every data set to provide a visualization in Fig. 4.

Fig. 4
figure 4

A visualization of data sets utilized in experiment, we arbitrarily select 30 images from each database for display

Table 1 A concise illustration about the tested data sets applied in experiments

4.2 Parameter settings

In order to prove the effectiveness of adversarial auto-encoder (AAE) for unsupervised feature representation, several unsupervised feature representation methods are introduced to provide a comparison, including principal component analysis (PCA), Isometric feature mapping (Isomap), locality preserving projection (LPP), neighborhood preserving embedding (NPE) auto-encoder (AE) and sparse auto-encoder (SAE). Apart from this, we also take a semi-supervised label consistent auto-encoder (LCAE) method and two recently proposed deep clustering methods (deep embedded clustering (DEC) and improved deep embedded clustering (IDEC) ) into comparison, to show the effectiveness of AAE.

Regarding the parameter settings, we illustrate the architecture and training strategy about AAE in Section 3. In the following experiment, a three layers AAE is employed, the hidden layer units are set to \(\left \{ 20, 40, 60, {\cdots } ,200 \right \}\), and the number of training epochs is fixed as 1000. Apart from this, the learning rate α is set as α = 0.001. With regard to PCA, the only parameter of it is the subspace dimensions. For Isomap, LPP and NPE, they search their neighbors by utilizing the k NN method, the amount of their nearest k neighbors is fixed to 20, and if a singular matrix with the size of k × k appears, k will be reset to a larger one to solve this problem. For AE, SAE and LCAE, the parameters of them including learning rate α and training epochs, are identical to the settings of AAE. Besides, the range of subspace dimensions about all compared feature representation methods is all set to \(\left \{ 20, 40, 60, {\cdots } ,200 \right \}\), unanimous to AAE. For the two deep clustering methods (DEC and IDEC), the performance of DEC and IDEC in USPS data set are obtained from the original paper [17], and for other data sets, we train them with the same parameter settings. The network structure of DEC and IDEC is fixed as d-500-500-2000-10 dimensions and the learning rate is set as α = 0.001. Also, the pre-training iterations are set to 200 and the coefficient γ is set as γ = 0.1 after pre-training.

After all feature representation methods are performed, the low-dimensional features learned by them are assessed by K-means clustering method. The process of dimensionality reduction and K-means clustering for AAE is summarized in Fig. 5. The main trouble in assessing the effectiveness of unsupervised and semi-supervised approaches is the missing of supervised information. In the absence of complete supervised information, it is difficult to seek for the corresponding relation between predictive labels and ground truths after clustering. With a few searches and selections, we finally select three evaluation metrics, which are popular recently, including clustering accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI), to evaluate the performance of K-means method. In addition, for different initial values, the performance of K-means fluctuates greatly, it mainly due to its high sensitivity. Thus we repeat it 10 times then record their means and standard deviations. The higher scores of ACC, ARI and NMI in K-means method indicate the better performance.

Fig. 5
figure 5

The process of dimensionality reduction and K-means clustering on AAE. Through encoding, a d-dimensional original data is denoted as a feature vector with \({d}^{\prime }\) dimensions in the hidden layer, then adopt it as the input of K-means method. Finally, the clustering performance is assessed by three popular evaluation metrics including ACC, ARI and NMI

4.3 Evaluation metrics

After learning the subspace feature with low dimension by those feature representation methods, we assess their quality through K-means clustering. With regard to AAE, AE, SAE and LCAE, we obtain the subspace representation of them in the hidden layer, as presented in Fig. 5.

As mentioned above, it is difficult to seek out a corresponding relation between the labels predicted by K-means method and the ground truths without the guidance of supervised information. Thus in performance evaluation, we suppose the information of true class labels for data sets is available, though at the unsupervised situation. In the following, the definition of ACC, ARI and NMI is introduced.

4.3.1 Clustering accuracy

Define \(\left \{ \mathbf {x}_{i} \right \}_{i=1}^{n}\) as the data points, \(\left \{ \mathbf {y}_{i} \right \}_{i=1}^{n}\), \(\left \{ \hat {\mathbf {y}}_{i} \right \}_{i=1}^{n}\) as the labels of ground truths and the predictive labels of K-means clustering, respectively. From this, ACC can be formalized as follow

$$ \begin{array}{@{}rcl@{}} ACC = \frac{{\sum}_{i=1}^{n}\delta (\mathbf{y}_{i},map(\hat{\mathbf{y}_{i}}))}{n} \end{array} $$
(18)

where map(⋅) is a permutation mapping which could best match the labels of K-means clustering with the labels of ground truths. Besides, \(\delta (\mathbf {y}_{i},map(\hat {\mathbf {y}_{i}}))=1\) when \(\mathbf {y}_{i}=map(\hat {\mathbf {y}_{i}})\), as for other cases, \(\delta (\mathbf {y}_{i},map(\hat {\mathbf {y}_{i}}))=0\).

4.3.2 Adjusted rand index

Assume \(\mathbb {T}=\left \{ \mathbf {T}_{j}\right \}_{j=1}^{t}\) stands for the ground truths, \(\mathbb {C}=\left \{ \mathbf {C}_{i}\right \}_{i=1}^{c}\) stands for the clustering result. ARI computes the resemblance between two clustering samples, the contingency table regards to \(\mathbb {C}\) and \(\mathbb {T}\) is represented as follows

figure e

where \(n_{ij}=\left | \mathbf {C}_{i}\cap \mathbf {T}_{j} \right |\), hence, \(ARI(\mathbb {C},{\mathbb {T}})\) can be defined according to the contingency table.

$$ \begin{array}{@{}rcl@{}} \frac{{\sum}_{ij}\left( \begin{array}{ll}n_{ij}\\2 \end{array}\right)-[{\sum}_{i}\left( \begin{array}{ll}a_{i}\\2 \end{array}\right){\sum}_{j}\left( \begin{array}{ll}b_{j}\\2 \end{array}\right)]/\left( \begin{array}{ll}n\\2 \end{array}\right)}{\frac{1}{2}[{\sum}_{i}\left( \begin{array}{ll}a_{i}\\2 \end{array}\right)+{\sum}_{j}\left( \begin{array}{ll}b_{j}\\2 \end{array}\right)]-[{\sum}_{i}\left( \begin{array}{ll}a_{i}\\2 \end{array}\right){\sum}_{j}\left( \begin{array}{ll}b_{j}\\2 \end{array}\right)]/\left( \begin{array}{ll}n\\2 \end{array}\right)} \end{array} $$
(19)

4.3.3 Normalized mutual information

With regard to NMI, define di as the amount of flows in category i, cj as the amount of flows in category j, and di, j as the amount of flows both in category i and category j, then NMI is able to be calculated as follows

$$ \begin{array}{@{}rcl@{}} NMI=\frac{\sum d_{i,j}\log(\frac{\left | {\Omega} \right |.d_{i,j}}{d_{i}c_{j}})}{\sqrt{({\sum}_{i}d_{i}\log(\frac{d_{i}}{d}))({\sum}_{j}c_{j}\log(\frac{c_{j}}{d}))}} \end{array} $$
(20)

where NMI is based on the metric mutual information(MI), the value of denominator indicates the information entropy, and the value of numerator is MI. The value of NMI is range from 0 to 1.

Due to the high sensitivity of K-means clustering, it will be performed 10 times repeatedly to receive their means and standard deviations as the final evaluation result of our experiments. The higher scores of ACC, ARI and NMI denote the better performance for K-means clustering.

4.4 Experimental visualization

The main target of unsupervised feature representation methods is to seek for a subspace representation which could well represent the feature of original data, without the guidance of supervised information. In this subsection, we provide a concise visual experimental result for each feature representation algorithm used in experiments.

Taking the data set COIL-20 as an instance, part of aforementioned feature representation algorithms are selected to be performed to reduce the dimensions of it into 2. Then executing K-means clustering for each low-dimensional feature learned by them, and visualize the result as Fig. 6.

Fig. 6
figure 6

A concise visualization of feature representation on COIL-20 data set, part of aforementioned algorithms are selected to performed to reduce the dimensions of original input data into 2, then we execute K-means clustering for each low-dimensional feature representation learned, and visualize the result in this figure

From this figure, we can roughly offer an intuitive comparison about the performance of image clustering for adversarial auto-encoder and other feature representation methods. The goal of clustering is to join the points in a same category close to each other as much as possible, while separating the points in different categories. The clustering result of adversarial auto-encoder outperforms the other compared algorithms clearly, it can join the points which have the same label more closely, while separating the points in different categories relatively better. The favorable performance of adversarial auto-encoder may be due to the added adversarial network during its training process, it takes encoder as its generator, and induces encoder to generate a sample to confuse the discriminative network into trusting this sample is a true sample, the discriminative learning process motivates encoder to generate a better sample in every iteration of training. Hence, the low-dimensional feature representation generated by encoder can still represent the original data well.

Other than this, for the discriminative feature learned by AAE, we also provide a brief visualization in Fig. 7, the reconstructed images come from the discriminative feature obtained in the hidden layer. We can notice that the result of reconstructed images is not encouraging after 10 iterations from this figure. When the number of iterations exceeds 100, we can obtain a relatively ideal result of reconstructed images, which demonstrates the efficiency of the discriminative feature learned from AAE and its ability to accelerate the convergence process of the network.

Fig. 7
figure 7

A brief visualization for the discriminative feature learned by AAE. We first reduce the dimensions of Fashion-MNIST data sets into 200 to learn a discriminative feature representation, then reconstruct it to make this visualization. So as to prove the fast convergence of AAE, we provide a intuitive comparison from the reconstructed images after 10 and 100 iterations

4.5 Experimental result

In this subsection, comprehensive experiments are carried out. We choose the K-means method to compare the image clustering performance between AAE and other feature representation methods including PCA, LPP, Isomap, NPE, AE, SAE, and LCAE. In particular, for demonstrating the high efficiency of the learned discriminative feature representation, we also compared with DEC and IDEC, which are two recently proposed deep clustering methods. Then, three popular evaluation metrics consisting of ACC, ARI, and NMI are selected to assess their performance. Besides, the raw data of each data set is performed image clustering directly to obtain the baseline in our experiments.

The K-means performance of all compared feature representation approaches in these eight image data sets are displayed in Tables 23 and 4. The performance of each method can be appraised through comparing the scores of three metrics between the learned low-dimensional feature and the baseline. From the three tables, it is intuitive to notice that the feature representation methods are mostly superior to the traditional k-means clustering in all tested data sets, and in most image data sets, AAE is better than other compared feature representation methods in clustering performance. In Table 2, AAE ranks first at clustering accuracy on seven image data sets and second on two image data sets. Especially in the COIL-20 database, the discriminative feature learned by AAE reaches a favorable performance. The clustering accuracy of AAE in this data set is 76.3%, and the baseline is 57.8%, with an 18.5% advancement. While the second-best performance is the LPP method, its clustering accuracy is 68.8%, AAE exceeds it more than 7%.

Table 2 Clustering accuracy obtained by applying different feature representation approaches. The results (mean% + std%) shown in and mean the best two performance. The symbol “—” denotes that the scores and source codes of these algorithms in corresponding data set were not available
Table 3 Adjusted rand index obtained by applying different feature representation approaches. The results (mean% + std%) shown in and mean the best two performance. The symbol “—” denotes that the scores and source codes of these algorithms in corresponding data set were not available

Moreover, AAE still obtains encouraging results in the scores of ARI and NMI. In Tables 3 and 4, the scores of AAE are also very competitive. Furthermore, in comparison with the semi-supervised LCAE method and deep clustering methods DEC and IDEC, AAE also achieves a better performance. Nevertheless, other methods also perform well on some data sets. For instance, NPE ranks first on three metrics of the data set ORL. LPP and Isomap show their favorable performance in the data set Yale-B. DEC and IDEC are also very competitive on USPS and Chars74K.

Table 4 Normalized mutual information obtained by applying different feature representation approaches. The results (mean% + std%) shown in and mean the best two performance. The symbol “—” denotes that the scores and source codes of these algorithms in corresponding data set were not available

In particular, for demonstrating the influence about different numbers of hidden layers on the clustering performance of AAE, the fluctuations about the scores of three metrics with the numbers of hidden layers are shown in (Figs. 89 and 10). To preserve the original feature after the reduction of dimensions is an arduous task, a low-dimensional feature that can well represent the original data is able to gain higher scores in experiments. Overall, the clustering results indicate that AAE obtains a competitive clustering performance in unsupervised learning, and comprehensive experiments demonstrate the effectiveness of learned feature representation via AAE in the tested data sets.

Fig. 8
figure 8

The influence about different numbers of hidden layers on the clustering accuracy scores of AAE, the numbers of hidden layers is set as \(\left \{ 20, 40, 60, {\cdots } ,200 \right \}\)

Fig. 9
figure 9

The influence about different numbers of hidden layers on the adjusted rand index scores of AAE, the numbers of hidden layers is set as \(\left \{ 20, 40, 60, {\cdots } ,200 \right \}\)

Fig. 10
figure 10

The influence about different numbers of hidden layers on the normalized mutual information scores of AAE, the numbers of hidden layers is set as \(\left \{ 20, 40, 60, {\cdots } ,200 \right \}\)

5 Conclusion and further discussion

An unsupervised adversarial auto-encoder was applied to learning a discriminative feature representation for image data in this paper, and the learned feature was evaluated by K-means clustering. Extensive experiments showed that adversarial auto-encoder obtained a competitive performance in image clustering. The discriminative feature learned by it could well represent the original feature after appropriate iterations. Compared to other feature representation methods, the added discriminative process led the adversarial auto-encoder to be more robust. However, when the categories increase, the clustering performance may suffer from a bottleneck without any guidance of label information. Thus in further work, we will focus on how to add weakly label information during the feature learning process, to break through this bottleneck. Besides, we will also put more emphasis on the network structure improvement about adversarial auto-encoder which may further enhance the network performance.