Keywords

1 Introduction

The rapid development of Generative Adversarial Network family sheds light on the task of natural image generation. As the basic idea of GAN [7], the generator tries to produce images as real as possible to confuse the discriminator. Various GAN-based models [1, 2, 8, 8, 24, 28] have been proposed to optimize the instability problems in generating images from different aspects. They have made solid progress in synthesizing natural images by using the standard datasets with legible backgrounds/foregrounds [33], e.g. MNIST [20], CIFAR-10 [18], CUB-200 [36] and so on. In many real-world applications, generating images with good visual aesthetics is highly desirable. Most of the existing GAN models are limited to achieve this goal, as they do not consider the image aesthetics in the learning process.

To address the above issue, in this paper, we propose a novel adversarial network namely AestheticGAN, and synthesize images with better visual aesthetics and plausible visual contents. Our consideration is two-fold. First, people always prefer to images with pleasant appearances, such as vivid color and appropriate composition. Therefore, the image generator is expected to be trained with aesthetics awareness. Second, apart from visually appealing, the generated images should also have reasonable visual contents. For example, based on our method, the image scene is quickly recognizable, and the content details are real. So the image generator is also expected to be aware of image semantics. To this end, we design and add two types of loss functions for the DCGAN architecture [29]. The first one is the aesthetics loss, which uses a quantitative score to evaluate the visual aesthetics of an image. The second one is the semantic loss, which measures the high-level semantic similarity between generated and real images [13, 21].

The main contributions of this work are listed as follows:

  • We attempt to create images with visually appealing images based on adversarial learning. Two types of loss functions are designed and added into the state-of-the-art GAN architecture.

  • Extensive experiments are conducted on the AVA and cifar10 datasets. Comparisons in terms of visual appearance, quantitative scores, and user studies all demonstrate the effectiveness of our method.

The remain parts of this paper are organized as follows. We briefly review the related work in Sect. 2, and describe the proposed method in Sect. 3. In Sect. 4, we evaluate our method with qualitative and quantitative experiments. Section 5 finally concludes the paper.

2 Related Works

Since our research is closely related with the fields of GANs and image aesthetics, we briefly introduce their related research in this section.

GAN is a generation model inspired by two-person zero-sum game in Game Theory. Based on the seminal research by Goodfellow et al. [7], many GAN-based variants [1, 2, 8, 24, 28] have been proposed, which focus on the model structure extension, in-depth theoretical analysis, and efficient optimization techniques, as well as their extensive applications. For example, in order to solve the problem of disappearance of training gradient, Arjovsky et al. [1] proposes WassersteinGAN (W-GAN) and then improves it by adding the gradient penalty [8]. In order to limit the modeling ability of the model, Qi [28] proposes Loss-sensitiveGAN (LS-GAN), which limits the loss function obtained by minimizing the objective function to satisfy the Lipschitz continuity function class, and the authors also give the results of quantitative analysis of gradient disappearance. Further, ConditionalGAN (CGAN) [24] adds additional information(y) to the G and D, where y can be labels or other auxiliary information. InfoGAN [2] is another important extension of GAN, which can obtain the mutual information between hidden layer variables of the input and the specific semantics. Odena et al. [27] proposes that Auxiliary Classifier GAN (AC-GAN) can achieve multiple classification problems, and its discriminator outputs the corresponding tag probability. Despite of the rapid development of GANs, there are few works that specifically designed for the task of aesthetic image generation.

The computational aesthetics has attracted attentions in recent years [6, 14]. The purpose of the research on computational aesthetics is to endow machine with the ability to perceive the attractiveness of an image qualitatively or quantitatively. The extraction of aesthetics-aware features plays a key role in this direction before the deep learning era. Previous research efforts [4, 15, 23, 26, 35] have shown some success in extracting aesthetic features. For 3D objects, [10] proposes to employ multi-scale topic models to fit the relationship of features from the multiple views of objects. However, most of them are handcrafted and task-specific. With the continuous development of deep learning, extracting the deep features of aesthetics images becomes the best way to solve the above problems. A lot of CNN-based models such as [22, 25] have been proposed to improve the results. The applications are mainly targeted on the task of image aesthetic evaluation [17, 22, 34]. What’s more, Hong et al. [12] propose a multi-view regularized topic model to discover Flickr users’s aesthetic tendency and then construct a graph to group users into different aesthetic circles. Based on it, a probabilistic model is used to enhance the aesthetic attractiveness of photos from corresponding circles [11]. Although existing GAN models have achieved great success, they are still limited in producing “beautiful and real” images. Based on adversarial learning, Deng et al. [37] enhance image aesthetics in terms of scene composition and color distribution. This work is different from the theme of our research, as the enhancement model of [37] tries to optimize the parameters of cropping and re-coloring for an existing natural image. For our method, we directly synthesize an image without any prior information on the input side, e.g. a meaningless noise image.

3 Proposed Method

We formulate the problem of automatic aesthetic image generation as an adversarial learning model. We first introduce the overall architecture of our proposed framework shown in Fig. 1. Then we present the details of the newly-added loss functions.

Fig. 1.
figure 1

The overall architecture of the proposed system.

3.1 Overall Framework

Basically, GAN is a pair of neural networks (G;D): the generator G and the discriminator D. G maps a vector z from a noise space \(N^{z}\) with a known distribution \(p_{z}\) into an image space \(N^{x}\). The goal of G is to generate \(p_{g}\) (the distribution of the samples \(G\left( z\right) \)) to deceive the network D. And goal of D is to try to distinguish \(p_{g}\) (the distribution of a generated image) from \(p_{data}\) (the distribution of a real image). These two networks are iteratively optimized against each other in a minimax game (hence namely “adversarial”) until the convergence. In this context, the GAN model is typically formulated as a minimax optimization of

$$\begin{aligned} \min \limits _{G}\max \limits _{D}V(D,G)=E_{x\sim {p_{data}}}[logD(x)]+E_{z\sim {p_{z}(z)}}[log(1-D(G(z)))] \end{aligned}$$
(1)

Specifically, as for the structure of G and D, we choose fully convolutional networks as in DCGAN [29]. As shown in Fig. 2, there are a series of fractionally-stride convolutions in G and a series of convolution layers in D.

Fig. 2.
figure 2

The network of the G and D.

We can see that the above target function only seeks for the consistency between \(p_{data}\) and \(p_{g}\) in a broadly statistical sense. It has no explicit control over the visual appealingness and the content realness. So we extend the total loss function with two additional losses:

$$\begin{aligned} L_{total} = \alpha _1L_{GAN} + \alpha _2L_{aesthetics} +\alpha _3L_{content} \end{aligned}$$
(2)

In the formulation, \(L_{GAN}\) is the original GAN loss, \(L_{aesthetics}\) is the aesthetics loss, and \(L_{content}\) is the content loss. \(\alpha _1,\alpha _2,\) and \(\alpha _3\) denote their weights. In the following, we introduce the details of the two added losses.

3.2 Loss Function

Aesthetics-Aware Loss. In order to generate a visually appealing image, we propose to apply the aesthetics scoring model [17] to boost the image aesthetics, i.e. maximizing the obtained score, or minimizing \((1-score)\). The key point is to learn a deep convolutional neural network that is able to accurately rank and rate visual aesthetics. In the network, the scoring ability is subtly encoded in its network architecture in the following aspects. First, the Alexnet [19] is fine-tuned based on a regression loss that predicts continuous numerical value as aesthetic ratings. Second, a Siamese network [3] is used by taking image pairs as inputs, which ensures images with different aesthetic levels have different ranks. The whole network is trained with a joint Euclidean and ranking loss. Moreover, they add attribute and content category classification layers and make the model be aware of fine-grained visual attributes. As demonstrated in [17], the overall aesthetic evaluation model is able to provide aesthetic scores which are well consistent with human rating. Therefore, we use the obtained scores as the aesthetic-aware loss:

$$\begin{aligned} L_{aesthetics}\,=\,\parallel (1-S(\tilde{x}))\parallel \end{aligned}$$
(3)

where \(S(\tilde{x})\) is the aesthetic score of the generated image \(\tilde{x}\).

Content-Aware Loss. Our synthesized images are also expected to have meaningful visual semantics. So we design the content-aware loss. In many image processing tasks [13, 21], the content loss is considered. It is usually based on the activation maps produced by the ReLU layers of the pre-trained VGG network. Different from measuring pixel-wise distance between images, this loss emphasizes similar feature representation in terms of high-level content and perceptual quality. Since we aim to generate images with both good aesthetics and reasonable details, we need a network that is more suitable to our task. So we replace VGG with a more advanced U-net network [30], as its structure is able to preserve more image details by combining the concept features (“what it is”) and the locality features (“where it is”). We denote \(\psi _{i}()\) as the feature map extracted after the \(i-th\) convolutional layer of the U-net. Then our content loss is defined as:

$$\begin{aligned} L_{content} = \frac{1}{C_{i}H_{i}W_{i}}\parallel \psi _i(\tilde{x})-\psi _i(x)\parallel \end{aligned}$$
(4)

where \(C_i\), \(H_i\), and \(W_i\) are the number, height and width of the feature maps, xs are real images and \(\tilde{x}\)s are generated ones.

3.3 Training Details

In training the proposed GAN model, input images are resized to \(96\times 96\) and then randomly cropped to \(64\times 64\), which reduces the potential over-fitting problem. The horizontal flipping of cropped images is also applied for random data augmentation. We use the ADAM technique [16] for optimization. As for the learning rates \(lr_G\) and \(lr_D\), we set them as 0.002 for both the generator network and the discriminator network. \(\beta _1\) and \(\beta _2\) are set as 0.5 and 0.999. We trained the proposed model in the experiments for 10000 epochs with minibatch size of 256. In implementation, we found out that reducing the learning rate during the training process helps to improve the image quality. Therefore, the learning rates are reduced by a factor of 2 in every 1000 epoch. We empirically set \(\alpha _1=1,\alpha _2=0.15,\alpha _3=0.1\) in our experiments.

4 Experiments

In this section, we evaluate the proposed AestheticGAN on public benchmark datasets. Apart from the direct visual comparison, we also use quantitative measures and user study to validate its effectiveness.

4.1 Datasets for Training

The Aesthetic Visual Analysis (AVA) dataset is by far the largest benchmark for image aesthetic assessment. Each of the 255,530 images is labeled with aesthetic scores ranging from 1 to 10. In this study we select a subset of them, i.e., 25,000 images, based on the semantic tags provided in the AVA data for analysis. What’s more, to further illustrate the applicability of our method, we also compare our model and its competitors on the cifar10 dataset.

4.2 Visual and Quantitative Comparison

We conduct visual comparison between the results of DCGAN and our model in Figs. 3 and 4. First, from Fig. 3, both DCGAN and our model generate images with good appearances at the first glance. However, the DCGAN results have less appropriate image composition, and less realistic image contents. In contrary, we can easily recognize the scene category and image contents of our results. Second, similar trends can be observed from Fig. 4, although they are not as clear as in Fig. 3. Furthermore, by comparing all the resultant images of Figs. 3 and 4, we can see in general that the model trained on AVA has superior performance than the one trained on cifar10 in terms of visual aesthetics, which indicates the data-driven property of GAN-based models.

We adopt the four different metrics for quantitative assessment. The first two are inception score [31] and Freéchet inception distance (FID) [9] that are commonly used in evaluating the performance of GAN-based image synthesis. Since our goal is to make the model aesthetics-aware, we also use two state-of-the-art evaluation models namely NIMA [32] and ACQUINE [5]. Among them, the NIMA estimates aesthetic qualities in aspects of photographing skills and visual appealingness. ACQUINE achieves more than \(80\%\) consistency with the human rating. Of note, larger values of inception score/NIMA/ACQUINE, and smaller FID values denote better quality, respectively. From Tables 1 and 2, we can see that our DCGAN+aesthetic+content achieves much better performances than the baseline DCGAN model. Additionally, from Table 2, the aesthetic performance of AVA dataset is consistently better than that of cifar10 dataset, which echoes the above visual results.

We also perform an ablation study. Apart from the baseline DCGAN and our DCGAN+aesthetic+content, we build an intermediate version DCGAN+aesthetic. Figure 5(b) has better lightness, vivid color and composition than Fig. 5(a). Furthermore, the contents in Fig. 5(c) are more realistic than those in Fig. 5(b). The results in Tables 1 and 2 are also consistent with the above observations. This experiment empirically validate the two losses, respectively.

Fig. 3.
figure 3

Comparison of the experiments on the AVA between DCGAN (left) and our method (right)

Fig. 4.
figure 4

Comparison of the experiments on the CIFAR10 between DCGAN (left) and our method (right)

Fig. 5.
figure 5

Result images for 3 different loss functions (Color figure online)

Table 1. Inception scores and FIDs for different methods on CIFAR10 and AVA datasets
Table 2. The aesthetic scores of NIMA and ACQUINE for different methods on CIFAR10 and AVA datasets

4.3 User Study

We also conduct an experiment of user study. We built a ranking system and distributed it to a total of 30 participants. All participants were shown three sets of 330 images, where each image set were generated by three different loss configurations. We asked all participants to rank the images in range of 1–5, where 1 means the lowest aesthetic quality and 5 is the highest one. In order to avoid the random and systematic errors, the images generated by different loss configurations are listed randomly. Also, we randomly repeatedly provide some images, and ignore the scores when a participant ranked differently on the repeated images. The statistics are shown in Fig. 6, which again demonstrates the effectiveness of the added losses.

Fig. 6.
figure 6

User study on the AVA dataset

5 Conclusion

In the paper, we proposes a novel AestheticGAN to synthesize more challenging and complex aesthetic images. We enrich the loss function by designing two types of loss functions to train G. The aesthetics-aware loss helps to enhance aesthetic quality of the generated images, while the content-aware loss enforces them to be semantically meaningful. Various experimental results validate the effectiveness of our model. Of note, from Tables 1 and 2, we can see that the overall quality of GAN-generated images is still far from real-world natural images. We plan to narrow this gap by considering fine-grained aesthetic attributes as our future research.