Keywords

1 Introduction

Recently, deep convolutional neural network technology has made great progress in the field of computer vision, especially object recognition and semantic recognition. However, the aesthetic quality of using computer to identify or evaluate images is far from practical. Image Aesthetic Quality Assessment (IAQA) is still a challenging task [1], the reasons are: large-scale data set of aesthetic is less in this field, aesthetic features are difficult for learning and generalization, evaluation of human subjectivity, etc. The aesthetic quality evaluation of images is a hot topic in the field of computer vision, computational aesthetics and computational photography.

Fig. 1.
figure 1

Aesthetic radar map and other assessment methods.

In terms of the data set we use the PCCD aesthetic data set to train proposed by Chang et al. [22], which provided 7 kinds of aesthetic characteristics of the image, and we use these characteristics to compute the multiply scores. As shown in Fig. 1, according to the Aesthetic Radar Map we can get more complete and multi-angle evaluation aesthetic information. We will think it is a very good photo by scoring one number or classification, but it has some disadvantages in focus and exposure, which is very important for people’s aesthetic understanding, and the general one score regression or classification can not implement.

This paper presents a new hierarchical multi-task dense network architecture. Compared with the traditional learning method, this network can be strengthened from both global and attribute scoring, and finally get the total score of the image and the score of each attribute. In the feature extraction part of the convolution neural network, this paper use dense block structure [20] with different aesthetic characteristics in learning step, to reduce the phenomenon of vanishing-gradient and strengthens the use and transfer of feature information, and reduce the numbers of parameters to a certain extent. Behind the network part, we combine the study of the characteristics of global score and attribute score by fusion connection operation, to realize the global score effective utilization, and strengthens the attribute. Finally, through the combination of loss function, the network performs better. In the experimental part, this paper makes a comparison between the simple regression model and the non-hierarchical multi-task method, and proves that the proposed network and method have better performance. The main contributions of this paper are as follows:

  • This is the first time to put forward the concept of the Aesthetic Radar Map and it fully show the aesthetic features with the Aesthetic Radar Map;

  • Use the structure of the dense block in the aesthetic task to return the aesthetic score;

  • For the first time, multi-task regression learning is applied to the aesthetic task, and a new feature fusion strategy is proposed to make the network selectively extract aesthetic features.

This paper predicts that the multi-attribute scoring of image aesthetic quality can be used for aesthetic image retrieval, photography technical guidance, video cover automatic generation and other applications. The evaluation of the quality of image aesthetics has a guiding effect on the application of UAV shooting, robot intelligence, and so on. Only by making the machine have the eyes of beauty can we serve the human beings better.

2 Related Work

As mentioned in [2], the early work of image aesthetic quality evaluation mainly focuses on the manual design of various image aesthetic features and uses pattern recognition algorithm to make aesthetic quality prediction. Another research route tries to directly fit the quality of image aesthetics with some hand-designed universal image features. Recently, the study from big data depth image characteristics shows good performance [3,4,5,6,7,8,9,10,11,12,13,14,15], and the performance beyond the traditional manual design features. The training data for image aesthetic quality assessment usually comes from the online professional photography community, such as photo.net and dpchallenge.com. People can rate photos on these sites (1–7 or 1–10). The higher the score means the higher the aesthetic quality of the image [17].

Although aesthetic quality evaluation exists in a certain sense, it is still an inherent subjective visual task. The quality evaluation of image aesthetics is ambiguous [18], and there are different methods for quality evaluation of aesthetic images.

In the field of aesthetic classification, people usually use two value labels, such as good image and bad image, which are usually used to represent the quality of image aesthetics. In the field of aesthetic scoring, some regression network begins to get the score aesthetics of image, these models designed by convolution neural network to present image aesthetic quality of binary classification results or one-dimensional numerical evaluation [16, 23, 24]. Before the depth of neural network and mass aesthetic image quality evaluation dataset AVA [19] release, such as Wu et al. [17] training on small data sets, which is proposed based on support vector machine (SVM) prediction methods of the aesthetic image quality evaluation of distribution. Jin et al. [14] began to put forward an aesthetic histogram to better represent aesthetic quality, and Chang et al. [22] began to perform aesthetic image caption.

On aesthetic data set, Murray et al. [19] first puts forward the most massive data sets in aesthetics field, AVA, and gaussian distribution to fitting all the AVA data samples, the rest of the image evaluation scores can better be gamma distribution fitting [19]. Then, in view of the imbalance of AVA samples, Kong et al. [12] proposed the AADB data set to make the aesthetic data set more balanced and better proper in the normal distribution. Chang et al. [22] proposed the PCCD data set, which is a relatively comprehensive small-scale data set.

3 Hierarchical Multi-task Network

3.1 Aesthetics Radar

For aesthetic image evaluation, the evaluation of a score is often incomplete. Through the evaluation of the pictures through several aesthetic indicators, a more comprehensive and a richer evaluation can be obtained. Usually such evaluation is also more meticulous.

The data set we use is called PCCD. It is based on the evaluation of the basic score, in the meantime, it considered the influence of Subject of Photo, Composition & Perspective, Use of Camera, Exposure & Speed, Depth of Field, Color & Lighting, Focus on the evaluation of the picture is also considered, and finally it is plotted in the form of a radar chart.

The composition of the picture evaluation will be updated from low dimension to high dimension, and some of the features with clear features can also be well represented by radar charts (Fig. 2).

Fig. 2.
figure 2

Samples in the Photo Critique Captioning Dataset (PCCD)

The PCCD (Photo Critique Captioning Dataset) data set is a model for verifying the problems arising from the proposed aesthetic image evaluation, provided by Chang et al. [22]. The dataset is based on the professional photo review websiteFootnote 1 and provides experienced photographers’ comments on the photos. On the website, photos were displayed and some professional reviews were provided in the following seven areas: general impressions, composition and perspective, color and lighting, photo theme, depth of field, focus and camera usage, exposure and speed.

3.2 Dense Module

The dense module neural network was proposed in CVPR2017 [20]. Its algorithm is based on ResNet [21], but its network structure is completely new. Dense module can effectively reduce the number of features in a neural network while achieving better results. In each Dense Model, the input for each layer comes from the output of all previous layers. At the same time, each layer can relate to the input data and the loss, which can alleviate over-fitting and the problem of gradient disappearing when the network is too deep (Fig. 3).

Fig. 3.
figure 3

Dense module

In ResNet, the relationship between two adjacent layers can be expressed by the following formula:

$$\begin{aligned} X_l = H_l(X_{l-1})+X_{l-1} \end{aligned}$$
(1)

where l denotes the layer, \(X_l\) denotes the output of layer l, and \(H_l\) denotes a nonlinear transform. So for ResNet, the output of layer l is the output of layer \(l-1\) plus the nonlinear transformation of the output of layer \(l-1\).

By changing the way information is transmitted between layers, dense module proposes a new connection method. Any one of them needs to relate to its subsequent layer. Its mathematical expression is as follows:

$$\begin{aligned} X_l = H_l([X_0,X_1,\ldots ,X_{l-1}]) \end{aligned}$$
(2)

where \([X_0,X_1,\ldots ,X_{l-1}]\) refers to the concatenation of the feature-maps produced in layers 0,.., \(l-1\) (Fig. 4).

Fig. 4.
figure 4

The structure of feature extract network

There \(H_l\) as a composite function of three consecutive operations: batch normalization (BN), a rectified linear unit (ReLU) and a convolution (Conv). Due to the dense connectivity of the network, we refer to this network architecture as a dense convolutional network (DenseNet).

Dense module produces k output maps for each layer, but there are more inputs. In a specific application, a \(1\times 1\) convolution is added as a bottleneck before each \(3\times 3\) convolution to reduce the number of input feature maps, thereby increasing the computational efficiency. We have found that this design is particularly effective for dense module, and this method has been the bottleneck in the network.

3.3 Hierarchical Multi-task

Multi-task learning (MTL) is a common algorithm widely used in machine learning and deep learning. Due to the diversity of its results, MTL can achieve multi-angle evaluation of picture aesthetics through parameter sharing. The results of picture evaluation under different angles are relatively independent, but the model training process is the same. The Hierarchical MTL structure used in the experiment like Fig. 5.

Fig. 5.
figure 5

The multi-task part of HMDnet (hierarchical multi-task dense network)

The dense module output at the last full-connection level is divided into seven parts, general impression and another six aesthetic attributes. Next, we split six aesthetic properties on the output by full-connection operation and perform the same operation to create the general impression. For the final result, the calculation of the mean-square error (MSE) is performed and returned as a model loss parameter to the previous network.

Hierarchical multi-task is a joint learning method. It learns multiple attributes of a picture, solves multiple problems at the same time, and performs regression prediction on multiple problems. A typical Multi-task, for example, in the business area, the personalized problem, from analysing multiple hobbies of a person to get a more comprehensive evaluation plan.

Hierarchical multi-task image processing methods have two advantages over traditional statistical methods:

  • The radar image can display multi-angled and multi-leveled image information. In this experiment, pictures often have different levels of picture attributes and can be vividly represented by Multi-task;

  • Multi-task evaluation pictures are often more specific and detailed. Multi-task analysis pictures can show the advantages and disadvantages of the picture in all aspects.

4 Experiment

4.1 Implementation Details

We fix the parameters of the layers before the first full connected layer of a pre-trained densenet model on the ImageNet [2] and fine-tune the all full connected layers on the training set of the PCCD dataset. We use the Keras frameworkFootnote 2 to train and test our models. The learning policy is set to step. Stochastic gradient descent is used to train our model with a mini-batch size of 16 images, a momentum of 0.9, a learning rate of 0.001 and a weight decay of 1e−6. The max number of iterations is 160. The training time is about 40 min using Titan X Pascal GPU.

4.2 Predict Result

For the data output by our model, dimension reduction is performed through the full connect layer, and regression calculations are performed on the known scores to obtain the predicted values of six aesthetic attributes of a picture and a total score estimate. The size of the Test data set is 500 pictures.

The experimental prediction results and test dataset data fitting results are better. Among them, the Color and Lighting attribute and the Composition and Perspective attribute have better results, and the other four attributes have larger deviations. The overall result is accurate. Some predict demo shown in Fig. 6.

Fig. 6.
figure 6

Predicted results of test data set photos and ground truth.

4.3 Compare with Other Methods

To verify the effectiveness of our experimental results, we compared the algorithm (HMDNet) with other algorithms. The regression method uses densenet to make a simple regression to the score, without adding multi-attribute and multilayer full-connection structure, multi-task method uses multi-attribute combination method but does not use the total score. For the same data set, we get a better fit for the model predictions and the real data. Compared with other methods, we can prove that our method has more advantages in multi-task picture aesthetic reviews.

Table 1. The predictions’ MSE of HMDNet and other methods.

As Shown in Table 1, the GI means General Impression, it’s a general evaluate of a picture. The SP which in the Table 1 means Subject of Photo, the CP means Composition & Perspective, the UES means Use of Camera, Exposure & Speed, the DF means Depth of Field, the CL means Color & Lighting, the FO means Focus. Our methods can get best performance in overall score and all attribute scores.

5 Conclusions

This paper puts forward a new Hierarchical Multitasking convolution neural network architecture. We present a new aesthetic task and goal of Aesthetic Radar Map, and predict it through the multi-task regression network. Compared with the traditional regression network, this paper makes full use of the global aesthetic rating to make the overall score and attribute rating interact with each other, thus realizing the accurate prediction of multi-attribute tasks. Experiments show that this method makes the prediction closer to the real label. As an interdisciplinary subject of computer vision, photography and iconography, aesthetic evaluation has more interesting discoveries waiting for people to explore, and many blind areas await our in-depth discovery.