1 Introduction

Chinese painting has a long history and distinct national form and unique style in the world art field, which are painted on special rice paper or silk with brushes, ink, and Chinese pigments. The inheritance and innovation of Chinese painting art have become essential to cultural heritage and innovation. Recently, the artificial intelligence (AI) application that transfers scenery photos to Chinese painting has become popular and exciting for many art lovers and cultural relic protection. The role of AI technology in extracting typical features that reflect the style of artistic works and imitating paintings will help train painting skills, innovate literary works, and inherit and protect traditional art.

The translation of scenery photos into Chinese landscape paintings is essentially an image style transfer. The traditional computational digital painting simulation methods can be divided into physical modeling, artistic effect decomposition, and non-photorealistic rendering.

  1. (1)

    The physical modeling methods [1] are based on the mechanism of landscape paintings: physical methods model brushes, paper, and ink. Simulation is achieved by simulating its inherent physical characteristics and dynamic interaction behavior. These methods are not ideal due to their intrinsic complexity [2].

  2. (2)

    In artistic effect decomposition methods [3, 4], the normal and real continuous painting processes are explained and described with a logical and virtual discrete step. The artistic effect decomposition methods can simulate various landscape painting effects, but it is difficult to achieve the blending and gradual change between multiple products, resulting in unsatisfactory overall outcomes.

  3. (3)

    The non-photorealistic rendering methods [5] are based on image analogy, mainly pursuing creativity to generate non-real image objects, such as synthetic rock textures. However, due to the ever-changing artistic effect of landscape paintings, each effect requires different rendering algorithms, which makes it difficult to simulate.

The generative adversarial network (GAN), proposed by Goodfellow et al. [6], is an effective method to realize image style transfer and has gradually become its mainstream method. The basic structure of GAN consists of two networks, one is called generator network, and the other is called discriminator network. These two networks can be convolutional neural networks, cyclic neural networks, or self-encoders. Training GANs is an unsupervised learning process. The generator network converts the input noise vector into an image and then sends the generated image to the discriminator network for classification. The goal of the generator network is to become perfect in developing realistic images, while the purpose of the discriminator network is to become perfect in judging whether the image provided to it is fake or real. This process is accomplished through multiple iterations.

GAN is powerful and has many practical applications, such as generating high-quality images [7], generating images from text [8], converting images from one style to another [9], etc. Many widely popular GAN-based architectures have emerged. Radford et al. proposed DCGAN [10]. This is the first application of convolutional neural networks in GANs, an important research milestone. Since then, a large number of GAN architectures have been introduced based on the DCGAN architecture. Brock et al. proposed BigGAN [11], in which the most important improvement is the orthogonal regularization of the generator to generate high-fidelity images. Karras proposed StyleGAN [12]. The core is style transfer or style mixing, which can generate images of different styles. Zhang et al. proposed StackGAN [13] to generate realistic images from text descriptions. Antipov et al. proposed Age-cGAN [14] for facial aging. Choi et al. proposed StarGAN [15] and realized the image conversion through an unsupervised learning method in multiple styles.

Kim et al. proposed the U-GAT-IT [16], which incorporates a new attention module and a new learnable normalization function in an end-to-end manner. Huang et al. proposed MUNIT [17], which is a multimodal unsupervised image-to-image translation framework. To translate an image into another domain, they recombine its content code with a random style code drawn from the style space of the target domain. Chen et al. proposed NICE-GAN [18], which contends a novel role of the discriminator by reusing it for encoding the images of the target domain.

Although those methods can be applied to scenery photos and Chinese painting translation, some unique problems still need to be solved. Due to the particularity of painting materials and the extensive use of various painting techniques, strokes, and ink, the creative process of landscape paintings is complicated. The main content of Chinese landscape paintings is mountains, rocks, and trees, and painters often express these contents by the "Cun method" [19]. This has made the "Cun method" a vital expression language in landscape painting and an art form for painters' aesthetic experience and aesthetic expression. Chinese paintings with the "Cun method" are quite different from other kinds of images and have a unique charm and artistic conception. Figures 1 and 2 are two typical "Cun method". These "Cun methods" are mainly presented by lines, "imitate" the rough outlines of objects and scenes in terms of morphology. Therefore, when automatically generating landscape paintings from scenery photos, strengthening the contours of various objects to imitate the "Cun method" effect is a challenging topic worthy of in-depth study.

Fig. 1
figure 1

Hemp-fiber strokes (Pima Cun)

Fig. 2
figure 2

Axe-chapped strokes (Fupi Cun)

This paper developed an unsupervised and Contour-Enhanced(CE) image style transfer framework based on CycleGAN, called CE-CycleGAN. According to the characteristics of Chinese landscape paintings, it pays attention to the lines, sets constraints, highlights the edge features to realize the style transfer from scenery photos to landscape paintings with more Chinese painting characteristics.

The main innovations of this paper are summarized as follows:

  1. (1)

    We proposed an image style transfer framework from scenery photos to Chinese landscape paintings. This framework adds a contour-enhancing translation branch with the edge detection operator to enhance edge information. The edge operator can highlight better the contour of objects such as rocks and trees and imitate specific texture effects of the “Cun Method”.

  2. (2)

    This paper proposes the Inception-based Structural Similarity (ISSIM) metrics to describe the similarity between two image sets. The Frechet-Inception Distance (FID) and Kernel Inception distance (KID) metric scores are lower only the more similar the generated image set is to the original image set and the more diverse the generated images are at the same time in only one metric. The ISSIM metric set is defined directly from similarity and can provide richer information than the FID metric for judging the quality of the generated image set. Unlike the FID and KID, the ISSIM set presents similarity and diversity separately in multiple metrics, and a single ISSIM metric can be very intuitive to reflect the similarity. Therefore, the ISSIM metric is more applicable to tasks where similarity is more important than diversity. From the analysis in the later section, it is indicated that it can be judged whether the output of the model collapses to a certain direction through the ISSIM metrics, if there is a data set in that direction.

  3. (3)

    To evaluate the artistry of the generated landscape paintings, we introduce an aesthetic evaluation metric, NIMA [20], to evaluate the performance of the method from an artistic point of view, which can effectively assess the improvement of the aesthetic performance of the generated paintings in the introduction of the edge operator.

The rest of this paper is structured as follows. Section 2 introduces the CycleGAN model and related work, Sect. 3 details the CE-CycleGAN method proposed in this paper, Sect. 4 gives experimental results and discussions, and Sect. 5 summarizes the full text and introduces future work ideas.

2 The CycleGAN model

The two models, Pix2pix [21] and CycleGAN [9], are worthy of attention in image style transfer from scenery photos to Chinese landscape paintings. In 2016, the Pix2pix model proposed by Isola et al. [21] provides a general framework for transforming images from one style to another, that is, to map labels to photos, map edges to objects, convert night images to day images, color for black and white images, convert sketches to images, etc. However, the model must require pairwise data (paired data), but it is usually challenging to obtain training sample pairs of photos and landscape paintings in photo-to-painting conversion. In 2017, the CycleGAN model proposed by Zhu et al. [9] has good performance in geometry, color, and style transfer and can achieve some exciting image conversions, such as converting photos into paintings. The outstanding advantage of this model is that it can be trained in unpaired data. Based on CycleGAN, Zhang et al. [22] proposed a CycleGAN-AdaIN framework to convert real photos to Chinese ink paintings. In this model structure, the author uses one cycle-consistency loss to replace the two cycle-consistency losses. In addition, multi-scale structural similarity metric loss is added to the reconstruction loss to generate more detailed images.

As shown in Fig. 3, CycleGAN is essentially two mirror-symmetric GANs that form a ring network. The two GANs share two generators, and each carries a discriminator, i.e., two discriminators and two generators. One one-way GAN has two losses, and two one-way GANs have a total of four losses. In short, the model works by acquiring an input image from the domain Discriminator A, DA. This input image is passed to the first generator, GAB, transforming the given image from the domain Discriminator A to the target domain Discriminator B. This newly generated image is then passed to another generator, GBA, whose task is to convert the image recA back to the original domain Discriminator A. Here, a comparison can be made with the autoencoder. This output image must be similar to the original input image and define meaningful mappings that did not originally exist in the unpaired dataset (Table 1).

Fig. 3
figure 3

Diagram of the CycleGAN network structure

Table 1 A list of terms for all equations and their corresponding meanings

3 The proposed CE-CycleGAN method

CycleGAN is characterized by performing unpaired image-to-image translation and is suitable for style transfer from scenery photos to landscape paintings. Landscape paintings pay attention to freehand brushwork rather than other details, but some contrasting edge information is still essential to the whole work. Generally, the human eye is more sensitive to solid edges than weak edges. According to this visual feature of the human eye, it is necessary to selectively enhance the edge information in the image, that is, to retain the strong edge with more considerable contrast instead of the weak edge with minor contrast. Thus, by adding a contour-enhancing translation branch to the CycleGAN network, we designed a gradient guidance method to effectively guide the style transfer from photos to paintings with gradient information.

Furthermore, to enhance the landscape painting style, the edge detection operator is used to extract the strong edge of the grayscale image. In this paper, we use the simple Sobel edge detection operator [23] to test whether the introduction of the edge detection operator is useful. The Sobel operator is a typical edge detection operator based on the first derivative. Since this operator introduces a similar local average operation, it smooths the noise and eliminates its influence. It has a good detection effect on rough edges. This feature meets the requirements of the method for the simulation effect of landscape paintings.

The whole network with a contour-enhancing translation branch is called CE-CycleGAN, and its structure is shown in Fig. 4. We define scenery photos as domains A and landscape paintings as domains B. Using the characteristics of cyclic consistency, the network is designed as forwarding translation and reverse translation. The forward translation first transfers the scenery photos into landscape paintings, namely \(G_{AB} :A \to B\). Then, the landscape paintings are translated back to the original scenery photos, namely \(G_{BA} :B \to A\). The reverse translation will transfer landscape paintings into scenery photos through \(G_{BA}\) and then from scenery photos to the original landscape paintings through \(G_{AB}\). Both forward and reverse translation contain two branches: the painted translation branch and the contour-enhancing translation branch.

Fig. 4
figure 4

Diagram of the proposed CE-CycleGAN network structure

For the forward translation, the mapping of \(G_{AB}\) is first performed. The painted translation branch takes the real scenery photos \({\text{real}}_{A}\) as the generator \(G_{AB}\) input and the resulting image enhancement features as the input of the fusion module. The contour-enhancing translation branch takes the real image gradient map \({\text{real}}_{{A\_{\text{edge}}}}\) as the input of the generator \(G_{BA}\), and the generated edge-enhanced gradient map \({\text{fake}}_{{B\_{\text{edge}}}}\) and the real edge-enhanced gradient map \({\text{real}}_{{B\_{\text{edge}}}}\) in the reverse translation are input into the discriminator \(D_{{B\_{\text{edge}}}}\) for authenticity judgment. The \({\text{fake}}_{{B\_{\text{edge}}}}\) is also be spliced with the output features of the image translation branch and sent into the fusion module, which provides the guidance of the gradient information for the fusion module to generate the final edge-enhanced map \({\text{fake}}_{B}\). The generated \({\text{fake}}_{B}\) will be input to the discriminator \(D_{B}\) with the real edge-enhanced map \({\text{real}}_{B}\) in reverse translation for authenticity discrimination. The process is described by formulas (1) and (2) as:

$${\text{fake}}_{{B\_{\text{edge}}}} = \left( {G_{{AB\_{\text{edge}}}} \left( {{\text{real}}_{{A\_{\text{edge}}}} } \right)} \right)$$
(1)
$${\text{fake}}_{B} = {\text{Fuse}}\left( {G_{AB} \left( {{\text{real}}_{A} } \right), {\text{fake}}_{{B\_{\text{edge}}}} } \right)$$
(2)

The corresponding generative adversarial losses also come from the contour-enhancing translation branch and the image translation branch, which are expressed as (3) and (4):

$$L_{{{\text{GAN}}\_B\_{\text{edge}}}} \left( {G_{{AB\_{\text{edge}}}} ,D_{{B\_{\text{edge}}}} } \right) = eE_{{{\text{real}}_{{B\_{\text{edge}}}} }} \left[ {\log D_{{B\_{\text{edge}}}} \left( {{\text{real}}_{{B\_{\text{edge}}}} } \right)} \right] + E_{{{\text{fake}}_{{B\_{\text{edge}}}} }} \left[ {\log \left( {1 - D_{{B\_{\text{edge}}}} \left( {{\text{fake}}_{{B\_{\text{edge}}}} } \right)} \right)} \right]$$
(3)
$$L_{{{\text{GAN}}\_B}} \left( {G_{AB} , D_{B} } \right) = e E_{{{\text{real}}_{B} }} \left[ {\log D_{B} ({\text{real}}_{B} } \right)\left] { + E_{{{\text{fake}}_{B} }} } \right[\log (1 - D_{B} ({\text{fake}}_{B} ))]$$
(4)

As a result, we realized the translation from scenery photos to landscape paintings. We then need to transfer from landscape paintings back to the original scenery photos to achieve cyclic consistency, the mapping of \(G_{BA}\). The generated edge-enhanced map \({\text{fake}}_{B}\) is input into the generator \(G_{BA}\) of the painted translation branch. A set of restored image features is obtained as the input of the fusion module for restoration. The generated edge-enhanced gradient map \({\text{fake}}_{{B\_{\text{edge}}}}\) is then input into the generator \(G_{{BA\_{\text{edge}}}}\) of the contour-enhancing translation branch to generate the image gradient map \({\text{rec}}_{{A\_{\text{edge}}}}\) for restoration. Finally, \({\text{rec}}_{{A\_{\text{edge}}}}\) is spliced with the restored image feature of the painted translation branch and sent to the fusion module to guide gradient information to restore the original scenery photo \({\text{rec}}_{A}\). The process can be described as formula (5) and formula (6):

$${\text{rec}}_{{A\_{\text{edge}}}} = G_{{BA\_{\text{edge}}}} \left( {{\text{fake}}_{{B\_{\text{edge}}}} } \right)$$
(5)
$${\text{rec}}_{A} = {\text{Fuse}}\left( {G_{BA} \left( {{\text{fake}}_{B} } \right),\;{\text{rec}}_{{A\_{\text{edge}}}} } \right)$$
(6)

We set the cyclic consistency loss of the restored image gradient map and the real image gradient map, which facilitates the contour-enhancing translation branch to provide more accurate gradient information for the image translation branch to restore the image. At the same time, we also need to ensure the consistency of restored scenery photos and real scenery photos. The L1 distance loss defines the cyclic consistency loss of these two parts as formula (7) and formula (8):

$$L_{{{\text{cyc}}\_A\_{\text{edge}}}} \left( {G_{{AB\_{\text{edge}}}} ,\; G_{{BA\_{\text{edge}}}} } \right) = E\left[ {|\left| {{\text{rec}}_{{A\_{\text{edge}}}} - {\text{real}}_{{A\_{\text{edge}}}} } \right||_{1} } \right]$$
(7)
$$L_{{{\text{cyc}}\_A}} \left( {G_{AB} , G_{BA} } \right) = E\left[ {\left| {\left| {{\text{rec}}_{A} - {\text{real}}_{A} } \right|} \right|_{1} } \right]$$
(8)

So far, the forward translation process of our method is completed. The reverse translation is the opposite process to the forward translation. For the \(G_{BA} : B \to A\) mapping, the process can be described as formula (9) and formula (10):

$${\text{fake}}_{{A\_{\text{edge}}}} = \left( {G_{{BA\_{\text{edge}}}} \left( {{\text{real}}_{{B\_{\text{edge}}}} } \right)} \right)$$
(9)
$${\text{fake}}_{A} = {\text{Fuse}}\left( {B_{BA} \left( {{\text{real}}_{B} } \right), \;{\text{fake}}_{{A\_{\text{edge}}}} } \right)$$
(10)

The corresponding generative adversarial loss function is defined as formula (11) and formula (12):

$$L_{{{\text{GAN}}\_A\_{\text{edge}}}} \left( {G_{{BA\_{\text{edge}}}} ,D_{{A\_{\text{edge}}}} } \right) = eE_{{{\text{real}}_{{A\_{\text{edge}}}} }} \left[ {\log D_{{A\_{\text{edge}}}} \left( {{\text{real}}_{{A\_{\text{edge}}}} } \right)} \right] + E_{{{\text{fake}}_{{A\_{\text{edge}}}} }} \left[ {\log \left( {1 - D_{{A\_{\text{edge}}}} \left( {{\text{fake}}_{{A\_{\text{edge}}}} } \right)} \right)} \right]$$
(11)
$$L_{{{\text{GAN}}\_A}} \left( {G_{BA} , D_{A} } \right) = e E_{{{\text{real}}_{A} }} \left[ {\log D_{A} \left( {{\text{real}}_{A} } \right)} \right] + E_{{{\text{fake}}_{A} }} \left[ {\log \left( {1 - D_{A} \left( {{\text{fake}}_{A} } \right)} \right)} \right]$$
(12)

For the \(G_{AB} : A \to B\) mapping, the process can be described as expressions (13) and (14):

$${\text{rec}}_{{B\_{\text{edge}}}} = G_{{AB\_{\text{edge}}}} \left( {{\text{fake}}_{{A\_{\text{edge}}}} } \right)$$
(13)
$${\text{rec}}_{B} = {\text{Fuse}}\left( {G_{AB} \left( {{\text{fake}}_{A} } \right), {\text{rec}}_{{B\_{\text{edge}}}} } \right)$$
(14)

The corresponding cyclic consistency loss is defined as expressions (15) and (16):

$$L_{{{\text{cyc}}\_B\_{\text{edge}}}} \left( {G_{{BA\_{\text{edge}}}} , G_{{AB\_{\text{edge}}}} } \right) = E\left[ {\left| {\left| {{\text{rec}}_{{B\_{\text{edge}}}} - {\text{real}}_{{B\_{\text{edge}}}} } \right|} \right|_{1} } \right]$$
(15)
$$L_{{{\text{cyc}}\_B}} \left( {G_{BA} , G_{AB} } \right) = E\left[ {\left| {\left| {{\text{rec}}_{B} - {\text{real}}_{B} } \right|} \right|_{1} } \right]$$
(16)

The overall objective loss function of our model is:

$$L\left( {G, D} \right) = L_{{{\text{GAN}}\_B\_{\text{edge}}}} + L_{{{\text{GAN}}\_B}} + L_{{{\text{GAN}}\_A\_{\text{edge}}}} + L_{{{\text{GAN}}\_A}} + \lambda \left( {L_{{{\text{cyc}}\_A}} + L_{{{\text{cyc}}\_A\_{\text{edge}}}} + L_{{{\text{cyc}}\_B}} + L_{{{\text{cyc}}\_B\_{\text{edge}}}} } \right)$$
(17)

where \(\lambda\) denotes the relative weight of the generative adversarial loss and the cyclic consistency loss.

4 The proposed ISSIM metrics

One challenge in comparing models is that quantitative comparisons are difficult. For a good model, in addition to generating realistic images (clear), the generated image set must also have both sufficient similarity and diversity. That is, the images in the generated set should be similar to the training set and be representative of the general distribution at the same time. However, while ensuring similarity, overfitting should be avoided. Networks tend to overfit with the deepening of networks and the increase of parameters [24]. The smaller the data set, the more severe the overfitting phenomenon. In the extreme case, all images in the generation set are copied from the training set. Simultaneously, insufficient sample diversity in generating image sets, or mode collapse [25], are common problems occurring during training a GAN resulting in a generator that outputs a single data point for the different modes of the domain. In the extreme case, the generator collapses to a single output mode which makes the model practically useless since it can only generate extremely similar patterns regardless of the input. Note that, if the data in the training set is realistic (clear) enough, then the generated set with high similarity must also be realistic (clear) enough. So, in general, it is good enough if a model can generate image sets that satisfy both diversity and similarity. In order to evaluate the quality of the generated samples of Chinese landscape paintings reconstructed by our models, the IS [26], FID [27], and KID [28] metrics are introduced in this paper.

The IS [26] metric is the KL-Divergence between conditional and marginal label distributions over the generated data, and the higher is, the better. This score correlates somewhat with the human judgment of sample quality on natural images, which determines whether the image is real (clear) or not. That is, in Inception Net-V3's "world view", any data that does not look like ImageNet is not real (clear). It knows nothing about the desired distribution for the model and cannot evaluate the similarity between the generated set and the real set [28].

The FID [27] is the Wasserstein-2 distance between multi-variate Gaussians fitted to data embedded into a feature space. As an unbiased alternative to FID, the KID [28] measures the dissimilarity between two probability distributions using samples drawn independently from each distribution. Smaller FID and KID values represent better feature distributions in the generated images and thus indicate they are closer to real images [29]. Both of these metrics incorporate the effects of similarity and diversity, whose values will become large due to model collapse [30]. However, neither IS, nor FID can catch the overfitting and underfitting [31], and the same goes for KID. Meanwhile, since both FID and KID take into account similarity and diversity, it is not possible to distinguish which values are specifically influenced by which one.

To overcome these shortcomings, the ISSIM set is proposed in this paper, which can provide richer information. Based on the 2048-dimensional feature vector output of Inception Net-V3, ISSIM for evaluating individual image similarity is proposed referring to the form of traditional structural similarity (SSIM) [32]. Unlike the SSIM, the samples of ISSIM are 2048-dimensional feature vectors {x, y} instead of matrixes of image pixel values. The formula for ISSIM is:

$${\text{ISSIM}} = \frac{{\left( {2\mu_{x} \mu_{y} + C_{1} } \right)\left( {2\sigma_{x} \sigma_{y} + C_{2} } \right)}}{{\left( {\mu_{x}^{2} + \mu_{y}^{2} + C_{1} } \right)\left( {\sigma_{x}^{2} + \sigma_{y}^{2} + C_{2} } \right)}}$$
(18)

where \(\mu_{x}\) is the mean of x, \(\mu_{y}\) is the mean of y, \(\sigma_{x}^{2}\) is the variance of x, \(\sigma_{y}^{2}\) is the variance of y, and \(\sigma_{xy}\) is the covariance of x and y. The constants \(C_{1}\) and \(C_{2}\) are defined to avoid system instability when the denominator is close to 0, which takes \(C_{1} = 0.0013, C_{2} = 0.0117.\) This ISSIM metric is used to calculate the similarity of two images directly.

Based on this, the other three ISSIM metrics for two image sets were proposed to reflect the similarity between the generated images and the real images in different aspects. For a sample set Y with Inception Net-V3 feature vectors y of m generated images, and a sample set X with feature vectors x of n real images used for comparison, the mean value of the feature vectors can be calculated first, and then ISSIM of two sets is calculated to obtain:

$${\text{ISSIM}}_{0} = {\text{ISSIM}}\left( {\overline{X}, \overline{Y}} \right)$$
(19)

\({\text{ISSIM}}_{0}\) corresponds to the similarity between the mean values of the features of the generated image set and the real image set, which represents whether the centers of the two distributions are close. The maximum ISSIM value of the generated image feature vector Yi and the feature vectors of the sample set X can also be accumulated and then averaged to obtain:

$${\text{ISSIM}}_{1} = \frac{1}{m}\mathop \sum \limits_{i} \mathop {\max }\limits_{j} \left( {{\text{ISSIM}}\left( {X_{j} , Y_{i} } \right)} \right)$$
(20)

\({\text{ISSIM}}_{1}\) corresponds to the average of the similarity between the features of any image in generated set and the most similar samples that can be found in the real image set, which represents whether the images in the generated set can be found similar in the real set. Similarly, \({\text{ISSIM}}_{2}\) is defined as:

$${\text{ISSIM}}_{2} = \frac{1}{n}\mathop \sum \limits_{i} \mathop {\max }\limits_{j} \left( {{\text{ISSIM}}\left( {X_{i} , Y_{j} } \right)} \right)$$
(21)

which corresponds to the average of the similarity between the features of any image in the real image set and the most similar samples that can be found in the generated set. This represents whether the images in the real set can be found similar in the generated set. Another method to find ISSIM of two sets is:

$${\text{ISSIM}}_{3} = \frac{1}{nm}\mathop \sum \limits_{i} \mathop \sum \limits_{j} {\text{ISSIM}}\left( {X_{j} , Y_{i} } \right)$$
(22)

\({\text{ISSIM}}_{3}\) corresponds to the average similarity of each sample in the generated image set and the real image set, which is named cross-similarity. The larger the values of these four metrics, the higher the similarity between the generated set and the real set. However, it is straightforward to deduce from the definition that the wider the distribution range of the images in two image sets X and Y, the smaller the value of \({\text{ISSIM}}_{3}\). That is, the better the diversity of the two sets, the smaller the value of \({\text{ISSIM}}_{3}\). Then considering diversity and similarity together, when the model is good, \({\text{ISSIM}}_{3}\) should take must take a value close to 1 thus ensuring similarity, but not too close to 1 thus ensuring diversity. The self-similarity \({\text{ISSIM}}_{s}\) of the real image set can be defined as:

$${\text{ISSIM}}_{s} = \frac{1}{{n^{2} }}\mathop \sum \limits_{i} \mathop \sum \limits_{i} {\text{ISSIM}}\left( {X_{i} , X_{i} } \right)$$
(23)

\({\text{ISSIM}}_{s}\) also represents the diversity of the real image set. The lower the self-similarity, the higher the diversity. By replacing X in the formula with Y, the self-similarity of the generated set can also be calculated.

Considering the extreme case, where the distribution of the generated samples is exactly the same as the real image set, then \({\text{ISSIM}}_{3}\) = \({\text{ISSIM}}_{s}\), which takes the suitable value. Then, a scoring function is proposed here as:

$$F\left( {x,s,p,a} \right) = \left\{ {\begin{array}{*{20}c} {\left( {x/s} \right)^{a} } & {x/s \le 1} \\ {\left( {x/s} \right)^{ - p} } & {x/s > 1} \\ \end{array} } \right.$$
(24)

where s is the suitable value, p is the penalty factor, a is the adjustment factor, and p and a are both positive numbers. Then, when s = \({\text{ISSIM}}_{s}\), choosing the appropriate p and a, the score \(F\left( {{\text{ISSIM}}_{3} ,s,p,a} \right)\) related to \({\text{ISSIM}}_{3}\) will become a value between [0,1]. Similarly, if one does not want the generated set to be too close to the real set, one can use \(F\left( {{\text{ISSIM}}_{1} ,s,p,a} \right)\) to adjust the scores related to \({\text{ISSIM}}_{1}\), using the parameter p to penalize \({\text{ISSIM}}_{1}\) > s for overfitting. With these adjustments, all scores become the bigger the better. Then, considering the above four metrics together, a composite metric can be defined as:

$${\text{ISSIM}}_{c} = \left( {{\text{ISSIM}}_{0} F\left( {x_{1} ,s_{1} ,p_{1} ,a_{1} } \right)F\left( {x_{2} ,s_{2} ,p_{2} ,a_{2} } \right)F\left( {x_{3} ,s_{3} ,p_{3} ,a_{3} } \right)} \right)^{0.25}$$
(25)

where \(x_{i} = {\text{ISSIM}}_{i}\), \(s_{3} = {\text{ISSIM}}_{s}\), and other parameters should be adjusted appropriately according to the real picture set and the actual situation. For example, if the self-similarity of the real set \({\text{ISSIM}}_{s}\) is large, then the overfitting thresholds \(s_{1}\) and \(s_{2}\) should be adjusted appropriately to large values; If you want the range of the distribution of the generated set and the real set to be as close as possible, you should increase \(p_{3}\) and \(a_{3}\) appropriately; if you care whether some modes in the real set are missing in the generated set, you can increase \(a_{2}\); if you care whether the generated set incorrectly generates modes that do not exist in the real set, you can increase \(a_{1}\), and so on. The specific parameter values are discussed in the context of specific cases in the analysis of the results later on.

5 Simulation results

5.1 Preparation of the simulation

5.1.1 Access to data

The Image-Downloader crawler program uses English and Chinese keywords to grab relevant pictures from the original URL of image search engines (Baidu, Bing, and Google) and download them in batches. A total of 2363 images were collected in the target domain (Chinese landscape paintings), and a total of 2562 images were collected in the source domain (scenery photos).

5.1.2 Data filtering

We use manual screening methods to eliminate irrelevant images. A total of 1956 scenery photos and 1884 Chinese landscape paintings were retained. We randomly assigned 1761 scenery photos as training samples in the source domain and 195 scenery photos as test samples; in the target domain, and also randomly assigned 1696 Chinese landscape paintings as training samples and 188 Chinese landscape paintings as test samples. Some example images are shown in Fig. 5. This paper aims to better simulate the "Cun method" through neural networks to achieve better results with small samples and minor scales. However, relatively few Chinese landscape paintings were collected with distinctive "Cun method" features when resized to 256 × 256-pixel. Of the 1696 training samples, only 695 had significant "Cun method" features, as shown in the first row of Fig. 5. The rest of the samples had general or weak "Cun method" features, as shown in the second and third rows of Fig. 5, respectively. The classification of whether or not the "Cun method" characteristics were significant was manually determined by the art school of the northwest university of professionals. The inclusion of samples with insignificant "Cun method" features improves the generalization ability of the neural network and prevents overfitting in small samples.

Fig. 5
figure 5

Examples of source (a) and target (b) landscape images

5.1.3 Simulation environment

The simulation used a workstation with an NVIDIA RTX 3090 graphics card. The operating system is Ubuntu 20.04 with PyTorch 1.7.1. The CUDA version is 11.2, and the CuDNN version is 8.1.

5.1.4 Training process

To facilitate training and evaluation, all images are adjusted to 256 × 256 resolution. We trained 4,00,000 iters with a learning rate of 0.0002 in the first 200 epochs, and then the learning rate decays linearly to zero in the following 400 epochs (8,00,000 iters). We applied the Adam optimizer [33]. The batch size is set to 1.

5.2 Simulation results and analysis

This section shows the effectiveness of our designed CE-CycleGAN on realistic painting style transfer tasks and introduces the ablation study. All qualitative results shown in this section are evaluated on the test set of the prepared dataset. Figure 6 shows some of the simulation results of our method.

Fig. 6
figure 6

Generalization simulations

5.2.1 The IS, FID and KID metrics

The IS metric of the generated image set before and after graying is given in Table 2. It can be seen that the IS of the three other models decreases, and the score of CycleGAN, as well as our model, increases compared to the color images. It can be inferred that the three other models are more sensitive to color information and pay relatively little attention to silhouette information. Before and after graying, the \({\text{IS}}^{{{\text{Diff}}}}\) of our model is greater than that of the CycleGAN model, which indicates that contour enhancement can make the generated set performs better on sensitivity to image edges. Our model may be more advantageous in simulating image textures, which is the characteristic required to simulate the "Cun method".

Table 2 The IS (the higher, the better) and FID scores (the lower, the better) of different models

The FID scores of different models are shown in Table 2, which are calculated from the mean and covariance matrices of the 2048-dimensional feature vector set output by the two image collections from the Inception Net-V3. The \({\text{FID}}_{{{\text{All}}}}\) is the FID score between the generated image set and the full training set, and the \({\text{FID}}_{{{\text{Cun}}}}\) is the FID score between the generated image set and the "Cun method" set. It can be seen that the NiceGAN model generates the image set closest to the training set as well as the "Cun method" set. Considering that the "Cun method" is only a stroke method that is not sensitive to color information, the FID scores of the real image distribution and the generated image distribution after graying were also calculated. The FID scores significantly improved, and our model has the largest drop as shown by \({\text{FID}}_{{{\text{All}}}}^{{{\text{Diff}}}}\) and \({\text{FID}}_{{{\text{Cun}}}}^{{{\text{Diff}}}}\), which again illustrates that the generated set of our model is more sensitive to edge information.

Although the NiceGAN performed the best on the FID score, the human naked eye perception is not the same. As can be seen in Fig. 6, some of the images generated by NiceGAN are significantly worse for humans to feel. On the contrary, the image set generated by our model, which scored second in \({\text{FID}}_{{{\text{Cun}}}}^{{{\text{Gray}}}}\), gives a subjective feeling closer to the effect of the "Cun method". So FID may not be the most appropriate metric for this paper, and then KID was calculated.

The KID scores of different models are shown in Table 3, with the variance being also presented to ensure reliable, due to its high variance can make it unreliable to provide values only [34]. As we can see, when color information is considered, the image set generated by U-GAT-IT is closest to the full training set, and the image set generated by NiceGAN is closest to the "Cun method" set. When the color information is not considered, the image set generated by NiceGAN is closest to the full training set, and the image set generated by our model is closest to the "Cun method" set. Overfitting tends to occur when the sample size is small, and the model has too many parameters. Combining the number of parameters reflected by the checkpoint size in Table 4 and the fact that some of the outputs in the fourth and sixth columns in Fig. 6 show a visible distortion in the contours relative to the inputs, it can be determined that U-GAT-IT and NiceGAN show signs of overfitting. Excluding these two models, the KID scores indicate that the generated set of the CycleGAN model is clearly closer to the full set. After performing contour enhancement, the generated set of the model is closer to the "Cun method" set while ignoring the color information, although the distance to the full set becomes larger. This analysis shows that introducing the edge-enhanced translation branch can make the images generated by the neural network closer to the "Cun method" effect.

Table 3 The KID scores (the lower, the better), the checkpoint size, and the graphics memory occupancy for batch size being 1 of different models
Table 4 The metrics scores for artificially created abnormal image sets with seven typical abnormality which are simulated by selecting images from the grayed-out test set and generated set

5.2.2 The ISSIM metrics

To explore the validity of various metrics, a series of artificially created abnormal sets were tested, and the results of seven typical anomalies are shown in Table 4. Abn1 represents the case of low diversity in the training set, which is made by selecting some similar images from the training set and the generation set to form new sets. The self-similarity of Abn1’s training set is 0.751, and assume that the generated set is a good fit with the training set. Abn2-Abn7 use the grayed "Cun method" set as the "training set". Some images from the "Cun method" set were copied to the "training set" to simulate overfitting in Abn1-Abn2 and Abn6-Abn7. Abn6 and Abn7 are extreme cases where all the images in the "generated set" are from the "Cun method" set. In Abn1 and Abn4-Abn7, the same images are copied several times to simulate mode collapse. But in mode collapse case 1, the distribution of the copied images is more concentrated, while in mode collapse case 2, the distribution of the copied images is broader. In Abn3, randomly selected scenery photos were used as the "generation set" to simulate the underfitting.

From the test results, the IS scores are more sensitive to mode collapse. Once the mode collapse occurs, the IS score will become worse. For the FID and KID metrics, overfitting leads to better scores, and mode collapse leads to worse scores. So when the scores of these two metrics are good, one needs to be alert to whether overfitting has occurred. In the case of severe underfitting (Abn3), FID still achieves relatively good scores, while KID is relatively more sensitive. From this point of view, the KID metric is more reasonable than the FID metric. In contrast, the ISSIM metrics proposed in this paper, because it is directly related to the similarity, its value can visually reflect the overfitting and underfitting. Overfitting directly leads to higher scores for \({\text{ISSIM}}_{1}\), and underfitting directly leads to lower scores. Determining whether overfitting or underfitting occurs requires determining a threshold value, which should be related to the self-similarity of the training set. Note that for different models, the determination of the threshold needs to be combined with the actual situation of the model, and it is best to carry out a certain test and combine the effect of human visual perception. Through testing, this paper sets the threshold for overfitting as \({\text{min}}\left( {0.9,\left( {{\text{ISSIM}}_{1} } \right)^{1/3} } \right)\) and the threshold for underfitting as \(\left( {{\text{ISSIM}}_{1} } \right)^{0.8}\). Then, the right fitting range for Abn1 is [0.795,0.9] and for Abn2-Abn7 is [0.724, 0.874]. The outliers that \({\text{ISSIM}}_{1}\) does not fall within these ranges have been underlined in Table 4. It can also be seen from Table 4 that mode collapse leads to an increase in the self-similarity of the generated set, especially for mode collapse case 1.

The above discussion of the test results shows that the ISSIM metrics do provide intuitive similarity and a richer set of information than other metrics. In order to combine this information and give a single reference value, parameters can be set to calculate the composite metric \({\text{ISSIM}}_{c}\). \(s_{1}\) and \(s_{2}\) can be set to the threshold value mentioned above as \({\text{min}}\left( {0.9,\left( {{\text{ISSIM}}_{1} } \right)^{1/3} } \right)\). For the value of \({\text{ISSIM}}_{1}\)-\({\text{ISSIM}}_{3}\) exceeding the threshold, the penalty intensity gradually weakens. \(p_{1}\)-\(p_{3}\) are set to 12, 9, and 6, respectively. These values are relatively large, making it impossible to obtain a high score for the overfitting cases. In this paper, we focus more on whether the images in a generated set are similar to the images in the training set, so the importance of \({\text{ISSIM}}_{1}\)-\({\text{ISSIM}}_{3}\) decreases. Then \(a_{1}\) to \(a_{3}\) are set to 1.5, 0.9, and 0.5, respectively. The calculated composite metric \({\text{ISSIM}}_{c}\) based on these parameter settings has been listed in Table 4, and it can be seen that the scores for both overfitting and underfitting are relatively low. For cases where the fit is good but the mode collapses (similar to Abn4-Abn5), the case with mode collapse with a broad sample distribution (similar to Abn5) scores higher.

The ISSIM results of the grayed image sets are shown in Table 5. The self-similarity of the test set for the full training set is 0.527,.for the "Cun method" set is 0.668. Compared to the full training set, the diversity of the "Cun method" set is reduced, and therefore the self-similarity increases from 0.527 to 0.668. In accordance with the previous discussion, the right fitting range of \({\text{ISSIM}}_{1}\) for the full training set is [0.599, 0.808], and for the "Cun method" set is [0.724, 0.874]. \({\text{ISSIM}}_{1}\) for U-GAT-IT and NiceGAN are larger than 0.808, so these two models are suspected of overfitting. Combined with the specific results shown in Fig. 6, it can be determined that overfitting occurred in both models may be due to their excessive parameters compared to the number of training samples. Therefore, these two models should be excluded (underlined in Table 4) when considering which model works better on the tasks in this paper. The self-similarity of the generation set of all models is larger than 0.527, which suggests that all models are suspect of model collapse, especially for MUNIT with the largest generation set self-similarity.

Table 5 The ISSIM scores of different models and the self-similarity of the generator set for the grayed image sets

From the composite metric \({\text{ISSIM}}_{c}\), the generated set of CycleGAN has the best similarity with the full training set, while the generated set of our model has the best similarity with the "Cun method" set. It can be seen from Table 5 that after contour enhancement, the ISSIM metrics have been improved, and the cross-similarity has been improved from 0.606 to 0.639. That is, the contour enhancement makes the modes collapse toward the "Cun method" set. For the full training set, the effects of the models in order from good to bad are CycleGAN > Our > MUNIT > U-GAT-IT > NiceGAN; for the "Cun method" set are Our > CycleGAN > MUNIT if two over-fitted models are excluded. The score of \({\text{ISSIM}}_{c}\) is consistent with human visual perception, which indicates that the previously set parameters are reasonable.

So, the similarity with the "Cun method" set is the best in our model, with a 97.89% similarity in the feature means, a 63.94% similarity in the overall compared features, a 79.62% similarity by finding the most similar images, and the composite similarity metric being 0.92. The ISSIM proposed in this paper is as simple to calculate as the FID score, also takes into account the judgment of diversity, and can effectively describe the similarity between image sets.

5.2.3 The NIMA metric

In addition to the proximity of the generated images to the "Cun method," this paper also focuses on the aesthetic quality of the generated images. Evaluating image technology quality and aesthetic effects have been a long-standing problem in image processing and computer vision [35]. The technical quality assessment measures the image damage at a pixel level, such as noise, blurriness, and artificial compression. In contrast, the aesthetic effect assessment captures the semantic features of emotion and beauty in the image. In general, image quality assessment can be divided into two types [36]: Peak Signal to Noise Ratio(PSNR) with full reference (FR), and standard structure similarity(SSIM) [32] with no-reference (NR) [32]. This paper also focuses on the artistic effect of transformed images, so this technical quality evaluation method is not applicable. Fortunately, NIMA [20] can predict the human evaluation opinions on images from direct perception and attractiveness, which has advantages similar to human subjective scoring, so we choose it as the image quality evaluation metric. NIMA generates a score histogram for any image. The image is scored 1–10 points, and the images of the same subject are compared directly. This design is consistent in form with the histogram generated by a human scoring system, and the evaluation effect is closer to the result of human evaluation.

Figure 7 shows the comparison of NIMA score histograms between our method, CycleGAN method, NiceGAN, MUNIT, and U-GAT-IT [16] on 195 test images. It can be seen from Fig. 7 that our method is superior to the CycleGAN method, the NiceGAN method, the MUNIT method, and the U-GAT-IT method. The average score of the images generated by our method is 4.74 points, the CycleGAN method is 4.05 points, the NiceGAN method is 3.78 points, the MUNIT method is 4.03 points, and the U-GAT-IT method is 3.96 points. From the evaluation metrics of NIMA, the performance of our method improves about 17% on average relative to the performance of the CycleGAN method, about 25% on average relative to the NiceGAN method, about 18% on average relative to the MUNIT method, and about 20% on average relative to the U-GAT-IT method.

Fig. 7
figure 7

NIMA score histograms of 195 test images. The insets show the average of the NIMA scores or the difference of the NIMA scores

To further analyze the effectiveness of our method, we selected the top five images that outperform the CycleGAN method, NiceGAN method, MUNIT method, and the U-GAT-IT method, as shown from Figs. 8, 9, 10 and 11, respectively. Meanwhile, the worst five images compared with the CycleGAN, NiceGAN, MUNIT, and U-GAT-IT methods are shown from Figs. 12, 13, 14 and 15. We can find that the original photos in Figs. 8, 9, 10 and 11 are relatively rich in content, and their NIMA scores are higher, as shown in Table 6. In contrast, the original photos, shown in Figs. 12, 13, 14 and 15, have relatively monotonous contents, and their NIMA scores are generally lower, as shown in Table 7. These results are consistent with our expectations. Our approach consists of highlighting the boundaries of the rocks and trees in photos. When the image is informative, these boundaries stand out. When the images are monotonous, the boundary information is not rich. Therefore, the method in this paper is mainly applicable to the case of photos with rich details.

Fig. 8
figure 8

The top five best performing images where our proposed method outperforms the Cycle-GAN method

Fig. 9
figure 9

The top five best performing images where our proposed method outperforms the NiceGAN method

Fig. 10
figure 10

The top five best performing images where our proposed method outperforms the MUNIT method

Fig. 11
figure 11

The top five best performing images where our proposed method outperforms the U-GAT-IT method

Fig. 12
figure 12

The top five worst performing images of our proposed method compared to the CycleGAN method

Fig. 13
figure 13

The top five worst performing images of our proposed method compared to the NiceGAN method

Fig. 14
figure 14

The top five worst performing images of our proposed method compared to the MUNIT method

Fig. 15
figure 15

The top five worst performing images of our proposed method compared to the U-GAT-IT method

Table 6 Our method is compared with the Cycle-GAN method and U-GAT-IT method, respectively, and the NIMA scores of the top five best-performing images
Table 7 Our method is compared with Cycle-GAN method and U-GAT-IT method respectively, and the NIMA scores of the top five worst performing images

6 Conclusion

This paper proposed the CE-CycleGAN framework to transfer scenery photos to a realistic landscape painting style with the "Cun Model." An edge detection operator is introduced for the distinctive features of the edges of landscape paintings. A gradient-guided path is designed after obtaining the gradient of the image, which enhances the edge transformation from the photograph to the painting. The KID metric and ISSIM metric produce somewhat similar results to the "Cun method". For our experiments, the ISSIM proposed in this paper is a more powerful metric than FID and KID, which intuitively give the similarity between the generated set and the real set. The simulation results also showed that the subjective evaluation effect of this method is satisfactory, and it is superior to the other four methods under the NIMA metric. The results of our work show that introducing the NIMA metric and ISSIM metric into the loss function would be a feasible improvement direction.

Although the CE-CycleGAN method in this paper has made progress in edge prominence, the generated result is still awaiting further optimization in the future compared with the realistic paintings method used by the artist. Chinese landscape painting focuses on realism, unlike Western painting, so the presence of the sky or clouds turning into mountains in the generated images is irrelevant. However, it is essential to note that Chinese landscape painting focuses on white space, which is not reflected in the generated images and could be improved by adding selective attention mechanisms in the future. The use of the Soble operator to highlight contours in this paper does bring the generated image closer to the "Cun method". We infer that the effectiveness comes from the fact that the "Cun method" itself has significant contours. Therefore, the CE-CycleGAN method in this paper will also be effective for large samples and large sizes. It also reveals that a better way to simulate the "Cun method" is to enhance the contours by different methods, for example, introducing other edge extraction and contour extraction operators, as well as further adjustment of the line thickness and trend of the extracted contours to be closer to the texture strokes of "Cun method", or even using different contour enhancement effects to simulate different "Cun method" effects.