Keywords

1 Introduction

Painting art has attracted people for many years and is one of the most popular art forms for creative expression of the conceptual intention of the practitioner. Since 1990’s, researches have been made by computer scientists on the artistic work, in order to understand art from the perspective of computer or to turn a camera photo into an artistic image automatically. One early attempt is Non-photorealistic rendering (NPR) [17], an area of computer graphics, which focuses on enabling artistic styles such as oil painting and drawing for digital images. However, NPR is limited to images with simple profiles and is hard to generalize to produce styled images for arbitrary artistic styles.

One significant advancement was made by [8], called neural style transfer, which could separate the representations of the image content and style learned by deep Convolutional Neural Networks (CNN) and then recombine the image content from one and the image style from another to obtain styled images. During this neural style transfer process, fantastic stylized images were produced with the appearance similar to a given real artistic work, such as Vincent Van Gogh’s “The Starry Night”. The success of the style transfer indicates that artistic styles are computable and are able to be migrated from one image to another. Thus, we could learn to draw like some artists apparently without being trained for years.

Following [8], a lot of efforts have been made to improve or extend the neural style transfer algorithm. The content-aware style-transfer configuration was considered in [23]. In [14], CNN was discriminatively trained and then combined with the classical Markov Random Field based texture synthesis for better mesostructure preservation in synthesized images. Semantic annotations were introduced in [1] to achieve semantic transfer. To improve efficiency, a fast neural style transfer method was introduced in [13, 21], which is a feed-forward network to deal with a large set of images per training. Results were further improved by an adversarial training network [15]. For a systematic review on neural style transfer, please refer to [12].

The recent progress on style transfer relied on the separable representation learned by deep CNN, in which the layers of convolutional filters automatically learns low-level or abstract representations in a more expressive feature space than the raw pixel-based images. However, it is still challenging to use CNN representations for style transfer due to their uncontrollable behavior as a black-box, and thus it is still difficult to select appropriate composition of styles (e.g. textures, colors, strokes) from images due to the risk of incorporation of unpredictable or incorrect patterns. In this paper, we propose computational analysis of the artistic styles and decompose them into basis elements that are easy to be selected and combined to obtain enhanced and controllable style transfer. Specifically, we propose two types of decomposition methods, i.e., spectrum based methods featured by Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), and latent variable models such as Principal Component Analysis (PCA), Independent Component Analysis (ICA). Then, we suggest methods of combination of styles by intervention and mixing. The computational decomposition of styles could be embedded as a module to state-of-the-art neural transfer algorithms. Experiments demonstrate the effectiveness of style decomposition in style transfer. We also demonstrate that controlling the style bases enables us to transfer the Chinese landscape paintings well and to transfer the sketch style similar to picture-to-sketch [2, 19].

2 Related Work

Style transfer generates a styled image having similar semantic content as the content image and similar style as the style image. Conventional style transfer is realized by patch-based texture synthesis methods [5, 22] where style is approximated as texture. Given a texture image, patch-based texture synthesis methods can automatically generate new image with the same texture. However, texture images are quite different from arbitrary style images [22] in duplicated patterns, which limits the functional ability of patch-based texture synthesis method in style transfer. Moreover, control of the texture by varying the patch size (shown in Fig. 2 of [5]) is limited due to the duplicated patterns in the texture image.

The neural style transfer algorithm proposed in [8] is a further development from the work in [7] which pioneers to take advantage of pre-trained CNN on ImageNet [3]. Rather than using texture synthesis methods which are implemented directly on the pixels of raw images, the feature maps of images are used which preserves better semantic information of the image. The algorithm starts with a noise image and finally converges to the styled image by iterative learning. The loss function \(\mathcal {L}\) is composed of the content loss \(\mathcal {L}_{content}\) and the style loss \(\mathcal {L}_{style}\):

$$\begin{aligned} \mathcal {L} = \alpha \mathcal {L}_{content} + \beta \mathcal {L}_{style}, \end{aligned}$$
(1)
$$\begin{aligned} \mathcal {L}_{content} = \frac{1}{2}(\mathcal {F}_l^{pred} - \mathcal {F}_l^{content}) ^ 2, \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{style} = \sum _{l} e_l \frac{1}{4h_l^2w_l^2c_l^2}(G_l^{pred} - G_l^{style}) ^ 2, \; \end{aligned}$$
(3)

where \(\mathcal {F}_l^{pred}\), \(\mathcal {F}_l^{content}\) and \(\mathcal {F}_l^{style}\) denote the feature maps of the synthesized styled image, content image and style image separately, \(\mathcal {F}_l\) is treated as 2-dimensional data (\(\mathcal {F}_l \in \mathcal {R}^{(h_l w_l) \times c_l}\)), \(G_l\) is the Gram matrix of layer l by \(G_l = \mathcal {F}_l^T \times \mathcal {F}_l, G_l \in \mathcal {R}^{c_l \times c_l}\), and \(h_l, w_l, c_l\) denote the height, width and the channel number of the feature map. Notice that the content loss is measured at the level of feature map, while the style loss is measured at the level of Gram matrix.

Methods were proposed in [9] for spatial control, color control and scale control for neural style transfer. Spatial control is to transfer style of specific regions of the style image via guided feature maps. Given the binary spatial guidance channels for both content and style image, one way to generate the guided feature map is to multiply the guidance channel with the feature map in an element-wise manner while another way is to concatenate the guidance channel with the feature map. Color control is realized by YUV color space and histogram matching methods [10] as post-processing methods. Although both ways are feasible, it is not able to control color to a specified degree, i.e., the control is binary, either transferring all colors of style image or preserving all colors of content image. Moreover, scale control [9] depends on different layers in CNN which represents different abstract levels. Since the number of layers in CNN is finite (19 layers at most for VGG19), the scale of style can only be controlled in finite degrees.

The limitation of control over neural style transfer proposed by pre-processing and post-processing methods in [9] derives from the lack of computational analysis of the artistic style which is the foundation of continuous control for neural style transfer. Inspired by spatial control in [9] that operations on the feature map could affect the style transferred, we analyze the feature map and decompose the style via projecting feature map into a latent space that is expanded by style basis, such as color, stroke, and so on. Since every point in the latent space can be regarded as a representation of a style and can be decoded back to feature map, style control become continuous and precise. Meanwhile, our work facilitates the mixing or intervention of the style basis from more than one styles so that compound style or new style could be generated, enhancing the diversity of styled images.

3 Methods

We propose to decompose the feature map of the style image into style basis in a latent space, in which it becomes easy to mix or intervene style bases of different styles to generate compound styles or new styles which are then projected back to the feature map space. Such decomposition process enables us to continuously control the composition of the style basis and enhance the diversity of the synthesized styled image. An overview of our method is demonstrated in Fig. 1.

Fig. 1.
figure 1

An overview of our method which is indicated by red part. The red rectangle represents the latent space expanded by the style basis, f denotes computational decomposition of the style feature map \(\mathcal {F}_s\), g denotes mixing or intervention within the latent space. The red part works as a computational module embedded in Gatys’ or other neural style transfer algorithms. (Color figure online)

Given the content image \(I_{content}\) and style image \(I_{style}\), we decompose the style by function f from the feature map \(\mathcal {F}_s\) of the style image to \(\mathcal {H}_s\) in the latent space which is expanded by style basis \(\{S_i\}\). We can mix or intervene the style basis via function g which is operated on style basis to generate the desired style coded by \(\hat{\mathcal {H}_s}\). Using the inverse function \(f^{-1}\), \(\hat{\mathcal {H}}_s\) is projected back to the feature map space to get \(\hat{F}_s\), which replace the original \(\mathcal {F}_s\) for style transfer. Our method can serve as embedded module for the state-of-the-art neural style transfer algorithms, as shown in Fig. 1 by red.

It can be noted that the module can be regarded as a general transformation from original style feature map \(\mathcal {F}_s\) to new style feature map \(\hat{\mathcal {F}_s}\). If we let \(\hat{\mathcal {F}_s} = \mathcal {F}_s\), our method degenerates back to traditional neural style transfer [8].

Since the transformation of the feature map is only done on the feature map of the style image, we simply notate \(\mathcal {F}_s\) as \(\mathcal {F}\) the denote the feature map of the style image and \(\mathcal {H}_s\) as \(\mathcal {H}\) in the rest of the paper. We notate h and w as the height and width of each channel in the feature map. Next, we introduce two types of decomposition function f and also suggest some control functions g.

3.1 Decomposed by Spectrum Transforms

We adopt 2-dimensional Fast Fourier Transform (FFT) and 2-dimensional Discrete Cosine Transform (DCT) as the decomposition function with details given in Table 1. Both methods are implemented in channel level of \(\mathcal {F}\) where each channel is treated as 2-dimensional gray image.

Through the transform by 2-d FFT and 2-d DCT, the style feature map was decomposed as frequencies in the spectrum space where the style is coded by frequency that forms style bases. We will see that some style bases, such as stroke and color, actually correspond to different level of frequencies. With help of decomposition, similar styles are quantified to be close to each other as a cluster in the spectrum space, and it is easy to combine the existing styles to generate compound styles or new styles \(\hat{\mathcal {H}}\) by appropriately varying the style codes. \(\hat{\mathcal {H}}\) is then projected back to the feature map space via the inverse function of 2-d FFT and 2-d DCT shown in Table 1.

3.2 Decomposed by Latent Variable Models

We consider another type of decomposition by latent variable models, such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA), which decompose the input signal into uncorrelated or independent components. Details are referred to Table 1, where each channel of the feature map \(\mathcal {F}\) is vectorized as one input vector.

  • Principal Component Analysis (PCA)

    We implement PCA from the perspective of matrix factorization. The eigenvectors are computed via Singular Value Decomposition (SVD). Then, the style is coded as linear combination of orthogonal eigenvectors, which could be regarded as style bases. By varying the combination of eigenvectors, compound styles or new styles are generated and then projected back to feature map space via the inverse of the matrix of the eigenvectors.

  • Independent Component Analysis (ICA)

    We implement ICA via the fastICA algorithm [11], so that we decompose the style feature map into statistically independent components, which could be regarded as the style bases. Similar to PCA, we could control the combination of independent components to obtain compound styles or new styles, and then project them back to the feature map space.

3.3 Control Function g

The control function g in Fig. 1 defines style operations in the latent space expanded by the decomposed style basis. Instead of operating directly on the feature map space, such operations within the latent space have several advantages. First, after decomposition, style bases are of least redundancy or independent to each other, operations on them are easier to control; Second, the latent space could be made as a low dimensional manifold against noise, by focusing on several key frequencies for the spectrum or principal components in terms of maximum data variation; Third, continuous operations, such as linear mixing, intervention, and interpolation, are possible, and thus the diversity of the output style is enhanced, and even new styles could be sampled from the latent space. Fourth, multiple styles are able to be better mixed and transferred simultaneously.

Table 1. The mathematical details of f and \(f^{-1}\)

Let \(S_{i}^{(n)}, i \in \mathbb {Z}\) denote the i-th style basis of n-th style image. Notate \(\{S_{i}^{(n)} | i \in I\}, I \subset \mathbb {Z}\) as \(S_{I}^{(n)}\).

  • Single style basis: Project the latent space on one of the style basis \(S_j\). That is \(S_i = 0\) if \(i \ne j\)

  • Intervention: Reduce or amplify the effect of one style basis \(S_j\) by multiplying I while keeping other style bases unchanged. That is \(S_i = I * S_i\) if \(i = j\)

  • Mixing: Combine the style bases of n styles. That is \(S = \{S_I^{(1)}, S_J^{(2)}, \dots , S_K^{(n)}\}\)

4 Experiments

We demonstrate the performance of our method using the fast neural style transfer algorithm [6, 13, 21]. We take the feature map ‘relu4_1’ from pre-trained VGG-19 model [18] as input to our style decomposition method because we try every single activation layer in VGG-19 and find that ‘relu4_1’ is suitable for style transfer.

4.1 Inferiority of Feature Map

Here, we demonstrate that it is not suitable for the style control function g to be applied on the feature map space directly because feature map space is possibly formed by a complicated mixture of style bases. To check whether the basis of feature map \(\mathcal {F}\) can form the style bases, we experimented on the channels of \(\mathcal {F}\) and the pixels of \(\mathcal {F}\).

Fig. 2.
figure 2

(a)① content image (Stata Center); ② style image (“The Great Wave off Kanagawa” by Katsushika Hokusai); ③ styled image by traditional neural style transfer; ④⑤⑥ are the results of implementing control function directly on the feature map \(\mathcal {F}\). Specifically, we amplify some pixels of \(\mathcal {F}\) which generate ④⑤ and preserve a subset of channels of \(\mathcal {F}\) which generate ⑥. (Color figure online)

Channels of \(\mathcal {F}\). Assume styles are encoded in space \(\mathcal {H}\) which is expanded by style basis \(\{S_{1},S_{2},\dots ,S_{n}\}\). A good formulation of \(\mathcal {H}\) may imply that if two styles are intuitively similar from certain aspects, there should be at least one style basis \(S_{i}\) such that the projections of the two styles onto \(S_{i}\) are close in Euclidean distance. Based on the above assumption, we generate the subset C of channels of \(\mathcal {F}\) that could possibly represent color basis with semi-supervised method using style images in Fig. 13(a–c). It can be noticed that Chinese paintings and pen sketches (Fig. 13(a, c)) share the same color style while oil painting (Fig. 13(b)) has an exclusive one. We iteratively find the largest channel set \(C_{max}\) (384 channels included) whose clustering result out of K-means [16] conforms to the following clustering standard for color basis:

  • No cluster contains both oil painting and Chinese painting or pen sketch.

  • One cluster contains only one or two points, since K-means is not adaptive to the cluster number and the cluster number is set as 3.

However, if we only use \(C_{max}\) to transfer style, the styled image (Fig. 2(a)⑥) isn’t well stylized and doesn’t indicate any color style of the style image (Fig. 2②), which probably indicates that the channels of \(\mathcal {F}\) are not suitable to form independent style basis. However, by proper decomposition functions f introduced in Table 1, a good formulation of latent space \(\mathcal {H}\) is feasible to reach the above clustering standard and to generate reasonably styled images.

Pixels of \(\mathcal {F}\). We give intervention \(I=2.0\) to certain region of each channel of \(\mathcal {F}\) to see if any intuitive style basis is amplified. The styled images are shown in Fig. 2(a)④⑤. Rectangles in style image (Fig. 2(a)②) are the intervened regions correspondingly. Compared to the styled image using [8] (Fig. 2(a)③), when small waves in style image is intervened, the effect of small blue circles in the styled image are amplified (green rectangle) and when large waves in style image is intervened, the effect of long blue curves in the styled image are amplified (red rectangle). Actually, implementing control function g on the pixels of the channels of \(\mathcal {F}\) is quite similar to the methods proposed for spatial control of neural style transfer [9] which controls style transfer via a spatially guided feature map defined by a binary or real-valued mask on a region of the feature map. Yet it fails to computationally decompose the style basis.

Fig. 3.
figure 3

(a) the original content and style images; (b) styled image by traditional neural style transfer; (c–h) results of preserving one style basis by different methods. Specifically, (c–d) FFT; (e–f) PCA; (g–h) ICA where (c, e, g) aim to transfer the color of style image while (d, f, h) aim to transfer the stroke of style image.

4.2 Transfer by Single Style Basis

To check whether \(\mathcal {H}\) is composed of style bases, we transfer style with single style basis preserved. We conduct experiments on \(\mathcal {H}\) formulated by different decomposition functions, including FFT, DCT, PCA as well as ICA, with details mentioned in Sect. 3. The results are shown in Fig. 3. As is shown in Fig. 3(c), (d), The DC component only represents the color of style while the rest frequency components represent the wave-like stroke, which indicates that FFT is feasible for style decomposition. The result of DCT is quite similar to that of FFT, with DC component representing color and the rest representing stroke.

Besides, we analyze the spectrum space via Isomap [20], which can analytically demonstrate the effectiveness and robustness of spectrum based methods. Color (low frequency) and stroke (high frequency) forms X-axis and Y-axis of the 2-dimensional plane respectively where every style is encoded as a point. We experiment on 3 artistic styles, including Chinese painting, oil painting and pen sketch, and each style contains 10 pictures which is shown in Fig. 13(a–c). Chinese paintings and pen sketches share similar color style which is sharply distinguished with oil paintings’ while the stroke of three artistic styles are quite different from each other, which conforms to the result shown as Fig. 13(d).

Via PCA, the first principal component (Fig. 3(e)) fails to separate color and stroke, while the rest components (Fig. 3(f)) fail to represent any style basis, which indicates PCA is not suitable for style decomposition (Fig. 4).

Fig. 4.
figure 4

(a) Chinese paintings; (b) Oil paintings (by Leonid Afremov); (c) Pen sketches; (d) low-dimensional projections of the spectrum of style(a-c) via Isomap; (e) low-dimensional projections of the spectrum of large scale of style images via Isomap. The size of each image shown above does not indicate any other information, but is set to prevent the overlap of the images only.

The results of ICA (Fig. 3(g), (h)) are as good as the results of FFT but show significant differences. The color basis (Fig. 3(g)) is more murky than Fig. 3(c) while the stroke basis (Fig. 3(h)) retains the profile of curves with less stroke color preserved compared to Fig. 3(d). The stroke basis consists of \(S_{arg_{i}}, i \in [0,n-1] \cup [c-n,c-1]\) while the rest forms the color basis. \(arg \in \mathcal {R}^{c}\) is the ascent order of \(A^{sum} \in \mathcal {R}^{c}\) where \(A^{sum}_i\) is the sum of ith column of mixing matrix A.

4.3 Transfer by Intervention

We give intervention to the stroke basis via control function g to demonstrate the controllable diversified styles and distinguish the difference in stroke basis between spectrum based methods and ICA.

As is shown in Fig. 5, the strokes of ‘wave’ are curves with light and dark blue while the strokes of ‘aMuse’ are black bold lines and coarse powder-like dots. Intervention using spectrum based methods affects both color and outline of the strokes while intervention using ICA only influences stroke outline but not its color.

4.4 Transfer by Mixing

Current style mixing method, interpolation, cannot mix the style bases of different styles because styles are integrally mixed however interpolation weights are modified (Fig. 6(g–i)), which limits the diversity of style mixing. Based on the success spectrum based methods and ICA in style decomposition, we experiment to mix the stroke of ‘wave’ with the color of ‘aMuse’ to check whether such newly compound artistic style can be transferred to the styled image.

Specifically, for ICA, we not only need to replace the color basis of ‘wave’ with that of ‘aMuse’ but also should replace the rows of mixing matrix A corresponding to the exchanged signals. Both spectrum based methods (Fig. 6(d–f)) and ICA (Fig. 6(j–l)) work well in mixing style bases of different styles. Moreover, we can intervene the style basis when mixing, which further enhances the diversity of style mixing.

Fig. 5.
figure 5

(a, d) the stroke of style image ‘wave’ and ‘aMuse’; (b, c, e, f) the results with different intervention I to the stroke basis. Specifically, (b, e) use spectrum based method and (c, f) use ICA. (Color figure online)

4.5 Sketch Style Transfer

The picture-to-sketch problem challenges whether computer can understand and represent the concept of objects both abstractly and semantically. The proposed controllable neural style transfer tackles the obstacle in current State-of-the-art methods [2, 19] which is caused by inconsistency in sketch style because diversified styles can in turn increase the style diversity of output images. Moreover, as is shown in Fig. 7, our method can control the abstract level by reserving major semantic details and minor ones automatically. Our method does not require vector sketch dataset. As a result, we cannot generate sketches in a stroke-drawing way [2, 19].

4.6 Chinese Painting Style Transfer

Chinese painting is an exclusive artistic style having much less color than Western painting and represents the artistic conception by strokes. As is shown in Fig. 8, with effective controls over stroke via our methods, the Chinese painting styled image can be either misty-like known as freehand-brush or meticulous representation known as fine-brush.

Fig. 6.
figure 6

(b), (c) styled image of single style; (g–i) interpolation mixing where \(I_1\) and \(I_2\) are the weights of ‘wave’ and ‘aMuse’; (d–f, j–l) results of mixing the color of ‘aMuse’ and the stroke of ‘wave’ where I is the intervention to the stroke of ‘wave’. Specifically, (d–f) use FFT; (j–l) use ICA.

Fig. 7.
figure 7

Picture-to-sketch using style transfer and binarization. (a) content image and style image; (b–d) styled images. From (b) to (d), the number of stroke increases as more details of the content image are restored.

Fig. 8.
figure 8

Neural style transfer of Chinese painting with stroke controlled. (a) content image and style image (by Zaixin Miao); (b–d) styled images. From (b) to (d), the strokes are getting more detailed which gradually turns freehand style into finebrush style.

Fig. 9.
figure 9

Styled images using [8] with same epochs using every single activation layer from the pre-trained VGG19.

5 Conclusion

Artistic styles are made of basic elements, each with distinct characteristics and functionality. Developing such a style decomposition method facilitate the quantitative control of the styles in one or more images to be transferred to another natural image, while still keeping the basic content of natural image. In this paper, we proposed a novel computational decomposition method, and demonstrated its strengths via extensive experiments. The method could serve as a computational module embedded in those neural style transfer algorithms. We implemented the decomposition function by spectrum transform or latent variable models, and thus it enabled us to computationally and continuously control the styles by linear mixing or intervention. Experiments showed that our method enhanced the flexibility of style mixing and the diversity of stylization. Moreover, our method could be applied in picture-to-sketch problems by transferring the sketch style, and it captures the key feature and facilitates the stylization of the Chinese painting style.