Modeling Artistic Workflows for Image Generation and Editing

Tseng, Hung-Yu; Fisher, Matthew; Lu, Jingwan; Li, Yijun; Kim, Vladimir; Yang, Ming-Hsuan

doi:10.1007/978-3-030-58523-5_10

Hung-Yu Tseng¹²,
Matthew Fisher¹³,
Jingwan Lu¹³,
Yijun Li¹³,
Vladimir Kim¹³ &
…
Ming-Hsuan Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12363))

Included in the following conference series:

European Conference on Computer Vision

3778 Accesses
14 Citations

Abstract

People often create art by following an artistic workflow involving multiple stages that inform the overall design. If an artist wishes to modify an earlier decision, significant work may be required to propagate this new decision forward to the final artwork. Motivated by the above observations, we propose a generative model that follows a given artistic workflow, enabling both multi-stage image generation as well as multi-stage image editing of an existing piece of art. Furthermore, for the editing scenario, we introduce an optimization process along with learning-based regularization to ensure the edited image produced by the model closely aligns with the originally provided image. Qualitative and quantitative results on three different artistic datasets demonstrate the effectiveness of the proposed framework on both image generation and editing tasks.

H.-Y. Tseng—Work done during HY’s internship at Adobe Research.

Access provided by Autonomous University of Puebla. Download conference paper PDF

End-to-End Visual Editing with a Generatively Pre-trained Artist

Controllable multi-domain semantic artwork synthesis

Article Open access 03 January 2024

Paint2Pix: Interactive Painting Based Progressive Image Synthesis and Editing

1 Introduction

Creating artwork from scratch is a herculean task for people without years of artistic experience. For novices to the world of art, it would be more feasible to accomplish this task if there are clear creation steps to follow. Take a watercolor painting for example. One may be guided to first sketch the outline with pencils, then fill out areas with large brushes, and finalize details such as the color gradient and shadow with small brushes. At each stage, some aspects (i.e., variations) of the overall design are determined to carry forward to the final piece of art.

Inspired by these observations, we aim to model workflows for creating art, targeting two relevant artistic applications: multi-stage artwork creation and multi-stage artwork editing. As shown in Fig. 1, multi-stage artwork generation guides the user through the creation process by starting from the first stage then selecting the variation at each subsequent creation stage. In the multi-stage artwork editing, we are given a final piece of artwork and infer all the intermediate creation stages, enabling the user to perform different types of editing on various stages and propagate them forward to modify the final artwork.

Existing artwork creation approaches use conditional generative adversarial networks (conditional GANs) [20, 28, 49] to produce the artwork according to user-provided input signals. These methods can take user inputs such as a sketch image [7] or segmentation mask [34, 43] and perform a single-step generation to synthesize the final artwork. To make the creation process more tractable, recent frameworks adopt a multi-step generation strategy to accomplish the generation tasks such as fashion simulation [38] and sketch-to-image [12]. However, these approaches typically do not support editing existing artwork. To manipulate an existing artwork image without degrading the quality, numerous editing schemes [4, 31, 36, 46, 47] have been proposed in the past decade. Nevertheless, these methods either are designed for specific applications [31, 36, 46] or lack flexible controls over the editing procedure because of the single-stage generation strategy [4, 47].

In this paper, we develop a conditional GAN-based framework that 1) synthesizes novel artwork via multiple creation stages, and 2) edits existing artwork at various creation stages. Our approach consists of an artwork generation module and a workflow inference module. The artwork generation module learns to emulate each artistic stage by a series of multi-modal (i.e., one-to-many) conditional GAN [49] networks. Each network in the artwork generation module uses a stage-specific latent representation to encode the variation presented at the corresponding creation stage. At test time, the user can determine the latent representation at each stage sequentially for the artwork generation module to synthesize the desired artwork image.

To enable editing existing artwork, we also design an inference module that learns to sequentially infer the corresponding images at all intermediate stages. We assume a one-to-one mapping from the final to intermediate stages, and use a series of uni-modal conditional GANs [20] to perform this inference. At test time, we predict the stage-specific latent representations from the inferred images at all intermediate stages. Depending on the desired type of edit, the user can edit any stage to manipulate the stage-specific image or latent representation and regenerate the final artwork from the manipulated representations.

We observe that directly applying our workflow inference module can cause the reconstructed image to differ slightly from the initially provided artwork. Such a reconstruction problem is undesirable since the user expects the generated image to be unchanged when no edits are performed. To address this problem, we design an optimization procedure along with learning-based regularization to refine the reconstructed image. This optimization aims to minimize the appearance difference between the reconstructed and the original artwork image, while the learning-based regularization seeks to guide the optimization process and alleviate overfitting.

We collect three datasets with different creation stages to demonstrate the use cases of our approach: face drawing, anime drawing, and chair design. We demonstrate the creation process guided by the proposed framework and present editing results made by artists. For quantitative evaluations, we measure the reconstruction error and Fréchet inception distance (FID) [14] to validate the effectiveness of the proposed optimization and learning-based regularization scheme. We make the code and datasets public available to stimulate the future research.^{Footnote 1}

In this work, we make the following three contributions:

We propose an image generation and editing framework which models the creation workflow for a particular type of artwork.
We design an optimization process and a learning-based regularization function for the reconstruction problem in the editing scenario.
We collect three different datasets containing various design stages and use them to evaluate the proposed approach.

2 Related Work

Generative Adversarial Networks (GANs). GANs [2, 5, 13, 22, 23] model the real image distribution via adversarial learning schemes. Typically, these methods encode the distribution of real images into a latent space by learning the mapping from latent representations to generated images. To make the latent representation more interpretable, the InfoGAN [8] approach learns to disentangle the latent representations by maximizing the mutual information. Similar to the FineGAN [37] and VON [50] methods, our approach learns to synthesize an image via multiple stages of generation, and encode different types of variation into separate latent spaces at various stages. Our framework extends these approaches to also enables image editing of different types of artwork.

Conditional GANs. Conditional GANs learn to synthesize the output image by referencing the input context such as text descriptions [44], scene graphs [42], segmentation masks [15, 34], and images [20]. According to the type of mapping from the input context to the output image, conditional GANs can be categorized as uni-modal (one-to-one) [20, 48] or multi-modal (one-to-many) [17, 29, 32, 49]. Since we assume there are many possible variations involved for the generation at each stage of the artwork creation workflow, we use the multi-modal conditional GANs to synthesize the next-stage image, and utilize the uni-modal conditional GANs to inference the prior-stage image.

Image Editing. Image editing frameworks enable user-guided manipulation without degrading the realism of the edited images. Recently, deep-learning-based approaches have made significant progress on various image editing tasks such as colorization [19, 26, 45, 46], image stylization [16, 31], image blending [18], image inpainting [33, 35], layout editing [30], and face editing [6, 9, 36]. Unlike these task-specific methods, the task-agnostic iGAN [47] and GANPaint [4] models map the variation in the training data onto a low-dimensional latent space using GAN models. Editing can be conducted by manipulating the representation in the learned latent space. Different from iGAN and GANPaint, we develop a multi-stage generation method to model different types of variation at various stages.

Optimization for Reconstruction. In order to embed an existing image to the latent space learned by a GAN model, numerous approaches [10, 25, 49] propose to train an encoder to learn the mapping from images to latent representations. However, the generator sometimes fails to reconstruct the original image from the embedded representations. To address this problem, optimization-based methods are proposed in recent studies. Abdal et al. [1] and Bau et al. [4] adopt the gradient descent scheme to optimize the latent representations and modulations for the feature activations, respectively. The goal is to minimize the appearance distance between the generated and original images. We also utilize the optimization strategy to reconstruct existing artwork images. In addition, we introduce a learning-based regularization function to guide the optimization process.

Regularizations for Deep Learning. These approaches [11, 24, 27, 39,40,41] aim to prevent the learning function from overfitting to a specific solution. Particularly, the weight decay scheme [24] regularizes by constraining the magnitude of learning parameters during the training phase. Nevertheless, regularization methods typically involve hyper-parameters that require meticulous hand-tuning to ensure the effectiveness. The MetaReg [3] method designs a learning-to-learn algorithm to automatically find the hyper-parameters of the weight decay regularization to address the domain generalization problem. Our proposed learning-based regularization is trained with a similar strategy but different objectives to alleviate the overfitting problem described in Sect. 3.2.

3 Method

Our approach is motivated by the sequential creation stages of artistic workflows. We build a model that enables a user to 1) follow the creation stages to generate novel artwork and 2) conduct edits at different stages. Our framework is composed of an artwork generation and a workflow inference module. As shown in Fig. 2(a), the artwork generation module learns to model the creations stages of the artist workflow. To enable editing an existing piece of art, the workflow inference module is trained to sequentially infer the corresponding images at all creation stages. When editing existing artwork, it is important that the artwork remains as close as possible to the original artwork, and only desired design decisions are altered. To enable this, we design an optimization process together with a learning-based regularization that allows faithful reconstruction of the input image. We provide the implementation and training details for each component in the proposed framework as supplemental material.

3.1 Artwork Generation and Workflow Inference

Preliminaries. The proposed approach is driven by the number of stages in the training dataset and operates in a supervised setting with aligned training data. Denoting N as the number of stages, the training dataset is comprised of a set of image groups $\{(x_1,x_2,\cdots ,x_{N})\}$, where $x_N$ denotes the artwork image at the final stage. We construct the proposed framework with $N-1$ workflow inference models $\{G^I_i\}^N_{i=1}$ as well as $N-1$ artwork generation models $\{(E^G_i,G^G_i)\}^N_{i=1}$. We show an example of 3 stages in Fig. 2(a). Since the proposed method is based on the observation that artists sequentially determine a design factor (i.e., variation) at each stage, we assume that the generation from the image in the prior stage to the later one is multi-modal (i.e., one-to-many mapping), while the inference from the final to the previous stages is uni-modal (i.e., one-to-one mapping).

Artwork Generation. The artwork generation module aims to mimic the sequential creation stages of the artistic workflow. Since we assume the generation from the prior stages to the following ones is multi-modal, we construct a series of artwork generation networks by adopting the multi-modal conditional GAN approach in BicycleGAN [49] and the network architecture of MUNIT [17]. As shown in Fig. 2(a) and (b), each artwork generation model contains two components: latent encoder $E^G_i$ and generator $G^G_i$. The latent encoder $E^G_i$ encodes the variation presented at the i-th stage in a stage-specific latent space. Given an input image $x_i$ and the corresponding next-stage image $x_{i+1}$, the latent encoder $E^G_i$ extracts the stage-specific latent representation $z_i$ from the image $x_{i+1}$, and computes the transformation parameter $z^\mathrm {Ada}_i$. The generator $G^G_i$ then takes the current-stage image $x_i$ as input and modulates the activations through the AdaIN normalization layers [49] with the transformation parameter $z^\mathrm {Ada}_i$ to synthesize the next-stage image $\hat{x}^G_{i+1}$, namely

$$\begin{aligned} \hat{x}^G_{i+1}=G^G_i(x_i,E^G_i(x_{i+1}))~~~~i\in \{1,2,\cdots ,N-1\}. \end{aligned}$$

(1)

We utilize the objective introduced in the BicycleGAN [49], denoted as $L^\mathrm {bicycle}_i$, for training the generation model. The objective $L^\mathrm {bicycle}_i$ is detailed in the supplementary material.

Ideally, the artwork generation networks corresponding to a given stage would encode only new information (i.e., incremental variation), preserving prior design decisions from earlier stages. To encourage this property, we impose a cycle consistency loss to enforce the generation network to encode the variation presented at the current stage only, as shown in Fig. 2(c). Specifically, we use the inference model $G^I_i$ to map the generated next-stage image back to the current stage. The mapped image should be identical to the original image $x_i$ at the current stage, namely

$$\begin{aligned} L^\mathrm {c}_i=\Vert G^I_i(G^G_i(x_i,E^G_i(z_i)))-x_i\Vert _1~~~~z_i\sim N(0,1). \end{aligned}$$

(2)

Therefore, the overall training objective for the artwork generation model at the i-th stage is

$$\begin{aligned} L^G_i=L^\mathrm {bicycle}_i + \lambda ^cL^c_i, \end{aligned}$$

(3)

where $\lambda ^c$ controls the importance of the cycle consistency.

Workflow Inference. To enable the user to edit the input artwork $x_N$ at different creation stages, our inference module aims to hallucinate the corresponding images at all previous stages. For the i-th stage, we use a unimodal conditional GAN network [20] to generate the image at i-th stage from the image at ($i+1$)-th stage, namely

$$\begin{aligned} \hat{x}^I_{i}=G^I_i(x_{i+1})~~~~i\in \{1,2,\cdots ,N-1\}. \end{aligned}$$

(4)

During the training phase, we apply the hinge version of GAN loss [5] to ensure the realism of the generated image $\hat{x}^I_{i}$. We also impose an $\ell _1$ loss between the synthesized image $\hat{x}^I_{i}$ and the ground-truth image $x_i$ to stabilize and accelerate the training. Hence the training objective for the inference network at the i-th stage is

$$\begin{aligned} L^I_i=L^\mathrm {GAN}_i(\hat{x}^I_i) + \lambda ^\mathrm {1}\Vert \hat{x}^I_{i}-x_i\Vert _1, \end{aligned}$$

(5)

where $\lambda ^\mathrm {1}$ controls the importance of the $\ell _1$ loss.

Test-Time Inference. As shown in Fig. 2(a), given an input artwork image $x_N$, we sequentially obtain the images at all previous stages $\{\hat{x}^I_i\}^N_{i=1}$ using the workflow inference module (blue block). We then use the artwork generation module (green block) to extract the latent representations $\{z_i\}^{N-1}_{i=1}$ from the inferred images $\{\hat{x}^I_i\}^N_{i=1}$, and compute the transformation parameters $\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}$. Combining the first-stage image $x^G_1=x^I_1$ and the transformation parameters $\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}$, the generation module consecutively generates the images $\{\hat{x}^G_i\}^N_{i=2}$ at the following stages. The user can choose the stage to manipulate based on the type of edit desired. Edits at the i-th stage can be performed by either manipulating the latent representation $z_i$ or directly modifying the image $x^G_i$. For example, in Fig. 2(a), the user can choose to augment the representation $z_1$ to adjust the flat coloring. After editing, the generation module generates the new artwork image at the final stage.

3.2 Optimization for Reconstruction

As illustrated in Sect. 3.1, the artwork generation module would ideally reconstruct the input artwork image (i.e., $\hat{x}^G_N=x_N$) from the transformation parameters $\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}$ before the user performs an edit. However, the reconstructed image $\hat{x}^G_N$ may be slightly different from the input image $x_N$, as shown in the first row of Fig. 3. Therefore, we adopt an AdaIN optimization algorithm to optimize the transformation parameters $\{z^\mathrm {Ada}_i\}^N_{i=1}$ of the AdaIN normalization layers in the artwork generation models. The goal of the AdaIN optimization is to minimize the appearance distance between the reconstructed and input image.

While this does improve the reconstruction of the input image, we observe that the optimization procedure causes the generation module to memorize input image details, which degrades the quality of some edited results, as shown in the second row of Fig. 3. To mitigate this memorization, we propose a learning-based regularization to improve the AdaIN optimization.

AdaIN Optimization. The AdaIN optimization approach aims to minimize the appearance distance between the reconstructed image $\hat{x}^G_N$ and the input artwork image $x_N$. There are many choices for what to optimize to improve reconstruction: we could optimize the parameters in the generation models or the extracted representations $\{z_i\}^N_{i=1}$. Optimizing model parameters is inefficient because of the large number of parameters to be updated. On the other hand, we find that optimizing the extracted representation is ineffective, as validated in Sect. 4.3. As a result, we choose to optimize the transformation parameters $\{z^\mathrm {Ada}_i\}^N_{i=1}$ of the AdaIN normalization layers in the generation models, namely the AdaIN optimization. Note that a recent study [1] also adopts a similar strategy.

We conduct the AdaIN optimization for each stage sequentially. The transformation parameter at the early stage is optimized and then fixed for the optimization at the later stages. Except for the last stage (i.e., $i=N-1$) that uses the input artwork image $x_N$, the inferred image $x^I_{i+1}$ by the inference model serves as the reference image $x^\mathrm {ref}$ for the optimization. For each stage, we first use the latent encoder $E^G_i$ to compute the transformation parameter $z^\mathrm {Ada}_i$ from the reference image for generating the image. Since there are four AdaIN normalization layers with c channels in each artwork generation model, the dimension of the transformation parameter is $1\times {8c}$ (a scale and a bias term for each channel). Then we follow the standard gradient descent procedure to optimize the transformation parameters with the goal of minimizing the loss function $L^\mathrm {Ada}$ which measures the appearance distance between the synthesized image $\hat{x}^G_i$ by the generator $G^G_i$ and the reference image $x^\mathrm {ref}$. The loss function $L^\mathrm {Ada}$ is a combination of the pixel-wise $\ell _1$ loss and VGG-16 perceptual loss [21], namely

$$\begin{aligned} L^\mathrm {Ada}(\hat{x}^G_i,x^\mathrm {ref}) = \Vert \hat{x}^G_i - x^\mathrm {ref} \Vert _1 + \lambda ^pL^p(\hat{x}^G_i,x^\mathrm {ref}), \end{aligned}$$

(6)

where $\lambda _p$ is the importance term. We summarize the AdaIN optimization in Algorithm . Note that in practice, we optimize the incremental term $\delta ^\mathrm {Ada}_i$ for the transformation parameter $z^\mathrm {Ada}_i$, instead of updating the parameter itself.

Learning-Based Regularization. Although the AdaIN optimization scheme addresses the reconstruction problem, it often degrades the quality of editing operations, as shown in the second row of Fig. 3. This is because the AdaIN optimization causes overfitting (memorization of the reference image $x^\mathrm {ref}$). The incremental term $\delta ^\mathrm {Ada}_i$ for the transformation parameter $z^\mathrm {Ada}_i$ is updated to extreme values to achieve better reconstruction, so the generator becomes sensitive to the change (i.e., editing) on the input image and produces unrealistic results.

To address the overfitting problem, we use weight decay regularization [24] to constrain the magnitude of the incremental term $\delta ^\mathrm {Ada}_i$, as shown in Line 6 in Algorithm . However, it is difficult to find a general hyper-parameter setting $w_i\in R^{1\times {8c}}$ for different generation stages of various artistic workflows. Therefore, we propose a learning algorithm to optimize the hyper-parameter $w_i$. The core idea is that updating the incremental term $\delta ^\mathrm {Ada}_i$ with the regularization $w_i\delta ^\mathrm {Ada}_i$ should 1) improve the reconstruction and 2) maintain the realism of edits on an input image. We illustrate the proposed algorithm in Fig. 4. In each iteration of training at the i-th stage, we sample an image pair $(x_i, x_{i+1})$ and an additional input image $x'_i$ from the training dataset. The image $x'_i$ serves as the edited image of $x_i$. We first use the latent encoder $E^G_i$ to extract the transformation parameter $z^\mathrm {Ada}_i$ from the next-stage image $x_{i+1}$. As shown in the grey block of Fig. 4, we then update the incremental term from $\delta ^\mathrm {Ada}_i$ to via one step of the AdaIN optimization and the weight decay regularization. With the updated incremental term , we use the loss function $L^\mathrm {Ada}$ to measure the reconstruction quality, and use the GAN loss to evaluate the realism of editing results, namely

(7)

Finally, since the loss $L^\mathrm {L2R}$ indicates the efficacy of the weight decay regularization, we optimize the hyper-parameter $w_i$ by

$$\begin{aligned} w_i=w_i - \eta \bigtriangledown _{w_i}L^\mathrm {L2R}, \end{aligned}$$

(8)

where $\eta $ is the learning rate of the training algorithm for the proposed learning-based regularization.

4 Experimental Results

4.1 Datasets

To evaluate our framework, we manually process face drawing, anime drawing, and chair design datasets. We describe details in the supplementary material.

4.2 Qualitative Evaluation

Generation. We present the generation results at all stages in Fig. 5. In this experiment, we use the testing images at the first stage as inputs, and randomly sample various latent representation $z\in \{z_i\}^{N-1}_{i=1}$ at each stage of the proposed artwork generation module. The generation module sequentially synthesizes the final result via multiple stages. It successfully generates variations by sampling different random latent codes at different stages. For example, when generating anime drawings, manipulating the latent code at the final stage produces detailed color variations, such as modifying the saturation or adding the highlights to the hair regions.

Editing. Figure 6 shows the results of editing the artwork images at different stages. Specifically, after the AdaIN optimization reconstructs the testing image at the final stage (first row), we re-sample the representations $z\in \{z_i\}^{N-1}_{i=1}$ at various stages. Our framework is capable of synthesizing the final artwork such that its appearance only changes with respect to the stage with re-sampled latent code. For example, for editing face drawings, re-sampling representations at the flat coloring stage only affects hair color, while maintaining the haircut style and details.

To evaluate the interactivity of our system, we also asked professional artists to edit some example sketches (Fig. 7). First, we use the proposed framework to infer the initial sketch from the input artwork image. Given the artwork image and the corresponding sketch, we asked an artist to modify the sketch manually. For the edited sketch (second row), we highlight the edits with the red outlines. This experiment confirms that the proposed framework enables the artists to adjust only some stages of the workflow, controlling only desired aspects of the final synthesized image. Additional artistic edits are shown in Fig. 1.

AdaIN Optimization and Learning-Based Regularization. Figure 8 presents the results of the AdaIN optimization and the proposed learning-based regularization. As shown in the first row, optimizing representations z fails to refine the reconstructed images due to the limited capacity of the low-dimensional latent representation. In contrast, the AdaIN optimization scheme minimizes the perceptual difference between the input and reconstructed images. We also demonstrate how the optimization process influences the editing results in the second row. Although the AdaIN optimization resolves the reconstruction problem, it leads to overfitting and results in unrealistic editing results synthesized by the generation model. By utilizing the proposed learning-based regularization, we address the overfitting problem and improve the quality of the edited images.

4.3 Quantitative Evaluation

Evaluation Metrics. We use the following evaluation metrics:

Reconstruction error: Given the input artwork $x_N$ and the reconstructed image $\hat{x}^G_N$, we use the $\ell _1$ distance $\Vert \hat{x}^G_N-x_N\Vert $ to evaluate the reconstruction quality.
FID: We use the FID [14] score to measure the realism of generated images $\hat{x}^G_N$. A smaller FID score indicates better visual quality.

Reconstruction. As shown in Sect. 3.2, we conduct the AdaIN optimization for each stage sequentially to reconstruct the testing image at the final stage. We use both the reconstruction error and FID score to evaluate several baseline methods and the AdaIN optimization, and show the results in Table 1. Results on the 2-nd and 3-rd rows demonstrate that the AdaIN optimization is more effective than optimizing the latent representations $\{z_i\}^{N-1}_{i=1}$. On the other hand, applying stronger weight decay regularization (i.e., $w_i=10^{-2}$) diminishes the reconstruction ability of the AdaIN optimization. By applying the weight decay regularization with learned hyper-parameter w (i.e., LR), we achieve comparable reconstruction performance in comparison to the optimization without regularization.

Table 1. Quantitative results of reconstruction. We use the $\ell 1$ pixel-wise distance ($\downarrow $) and the FID ($\downarrow $) score to evaluate the reconstruction ability. w and LR indicates the hyper-parameter for the weight regularization and applying the learned regularization, respectively.

Full size table

Editing. In this experiment, we investigate how various optimization methods influence the quality of edited images. For each testing final-stage image, we first use different optimization approaches to refine the reconstructed images. We then conduct the editing by re-sampling the latent representation $z_i$ at a randomly chosen stage. We adopt the FID score to measure the quality of the edited images and show the results in Table 2. As described in Sect. 3.2, applying the AdaIN optimization causes overfitting that degrades the quality of the edited images. For instance, applying the AdaIN optimization increases the FID score from 38.68 to 44.28 on the face drawing dataset. One straightforward solution to alleviate this issue is to apply strong weight decay regularizations (i.e., $w=10^{-2}$). However, according to the results in 5-th row of Table 1, such strong regularizations reduce the reconstruction effectiveness of the AdaIN optimization. Combining the results in Table 1 and Table 2, we conclude that applying the regularization with the learned hyper-parameter w not only mitigates overfitting but also maintains the efficacy of the AdaIN optimization. We conduct more analysis of the proposed learning-based regularization in the supplementary materials.

4.4 Limitations

The proposed framework has several limitations (see supplemental material for visual examples). First, since the model learns the multi-stage generation from a training dataset, it fails to produce appealing results if the style of the input image is significantly different from images in the training set. Second, the uni-modal inference assumption may not be correct. In practice, the mapping from later stages to previous ones can also be multi-modal. For instance, the style of the pencil sketches by various artists may be different. Finally, artists may not follow a well-staged workflow to create artwork in practice. However, our main goal is to provide an example workflow to make the artwork creation and editing more feasible, especially for the users who may not be experts in that type of artwork.

Table 2. Quantitative results of editing. We use the FID ($\downarrow $) score to evaluate the quality of the edited images $\hat{x}^G_N$ synthesized by the proposed framework. w and LR indicates the hyper-parameter for the weight regularization and applying the learned regularization, respectively.

Full size table

5 Conclusions

In this work, we introduce an image generation and editing framework that models the creation stages of an artistic workflow. We also propose a learning-based regularization for the AdaIN optimization to address the reconstruction problem for enabling non-destructive artwork editing. Qualitative results on three different datasets show that the proposed framework 1) generates appealing artwork images via multiple creation stages and 2) synthesizes the editing results made by the artists. Furthermore, the quantitative results validate the effectiveness of the AdaIN optimization and the learning-based regularization.

We believe there are many exciting areas for future research in this direction that could make creating high-quality artwork both more accessible and faster. We would like to study video sequences of artists as they create artwork to automatically learn meaningful workflow stages that better align with the artistic process. This could further enable the design of editing tools that more closely align with the operations artists currently perform to iterate on their designs.

Notes

1.
https://github.com/hytseng0509/ArtEditing.

References

Abdal, R., Qin, Y., Wonka, P.: Image2StyleGAN: how to embed images into the styleGAN latent space? In: ICCV (2019)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. In: ICML (2017)
Google Scholar
Balaji, Y., Sankaranarayanan, S., Chellappa, R.: MetaReg: towards domain generalization using meta-regularization. In: NIPS (2018)
Google Scholar
Bau, D., et al.: Semantic photo manipulation with a generative image prior. ACM TOG (Proc. SIGGRAPH) 38(4), 59 (2019)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Google Scholar
Chang, H., Lu, J., Yu, F., Finkelstein, A.: PairedCycleGAN: asymmetric style transfer for applying and removing makeup. In: CVPR (2018)
Google Scholar
Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)
Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)
Google Scholar
Cheng, Y.C., Lee, H.Y., Sun, M., Yang, M.H.: Controllable image synthesis via SegVAE. In: ECCV (2020)
Google Scholar
Donahue, J., Simonyan, K.: Large scale adversarial representation learning. In: NIPS (2019)
Google Scholar
Ghiasi, G., Lin, T.Y., Le, Q.V.: DropBlock: a regularization method for convolutional networks. In: NIPS (2018)
Google Scholar
Ghosh, A., et al.: Interactive sketch & fill: multiclass sketch-to-image translation. In: CVPR (2019)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NIPS (2017)
Google Scholar
Huang, H.-P., Tseng, H.-Y., Lee, H.-Y., Huang, J.-B.: Semantic view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 592–608. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_35
Chapter Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
Google Scholar
Hung, W.C., Zhang, J., Shen, X., Lin, Z., Lee, J.Y., Yang, M.H.: Learning to blend photos. In: ECCV (2018)
Google Scholar
Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM TOG (Proc. SIGGRAPH) 35(4), 110 (2016)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: NIPS (1992)
Google Scholar
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 (2015)
Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 577–593. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_35
Chapter Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: FractalNet: ultra-deep neural networks without residuals. In: ICML (2017)
Google Scholar
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M.K., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: ECCV (2018)
Google Scholar
Lee, H.Y., et al.: DRIT++: diverse image-to-image translation via disentangled representations. IJCV 1–16 (2020)
Google Scholar
Lee, H.Y., et al.: Neural design network: graphic layout generation with constraints. In: ECCV (2020)
Google Scholar
Li, Y., Liu, M.Y., Li, X., Yang, M.H., Kautz, J.: A closed-form solution to photorealistic image stylization. In: ECCV (2018)
Google Scholar
Mao, Q., Lee, H.Y., Tseng, H.Y., Ma, S., Yang, M.H.: Mode seeking generative adversarial networks for diverse image synthesis. In: CVPR (2019)
Google Scholar
Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: EdgeConnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212 (2019)
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)
Google Scholar
Portenier, T., Hu, Q., Szabo, A., Bigdeli, S.A., Favaro, P., Zwicker, M.: FaceShop: deep sketch-based face image editing. ACM TOG (Proc. SIGGRAPH) 37(4), 99 (2018)
Google Scholar
Singh, K.K., Ojha, U., Lee, Y.J.: FineGAN: unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In: CVPR (2019)
Google Scholar
Song, S., Zhang, W., Liu, J., Mei, T.: Unsupervised person image generation with semantic parsing transformation. In: CVPR (2019)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Tseng, H.Y., Chen, Y.W., Tsai, Y.H., Liu, S., Lin, Y.Y., Yang, M.H.: Regularizing meta-learning via gradient dropout. arXiv preprint arXiv:2004.05859 (2020)
Tseng, H.Y., Lee, H.Y., Huang, J.B., Yang, M.H.: Cross-domain few-shot classification via learned feature-wise transformation. In: ICLR (2020)
Google Scholar
Tseng, H.Y., Lee, H.Y., Jiang, L., Yang, W., Yang, M.H.: RetrieveGAN: image synthesis via differentiable patch retrieval. In: ECCV (2020)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)
Google Scholar
Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. TPAMI 41(8), 1947–1962 (2018)
Article Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Zhang, R., et al.: Real-time user-guided image colorization with learned deep priors. ACM TOG (Proc. SIGGRAPH) 9(4) (2017)
Google Scholar
Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36
Chapter Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
Google Scholar
Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: NIPS (2017)
Google Scholar
Zhu, J.Y., et al.: Visual object networks: image generation with disentangled 3D representations. In: NIPS (2018)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the NSF CAREER Grant #1149783.

Author information

Authors and Affiliations

University of California, Oakland, USA
Hung-Yu Tseng & Ming-Hsuan Yang
Adobe Research, San Jose, USA
Matthew Fisher, Jingwan Lu, Yijun Li & Vladimir Kim

Authors

Hung-Yu Tseng
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Fisher
View author publications
You can also search for this author in PubMed Google Scholar
Jingwan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Li
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hung-Yu Tseng .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6370 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tseng, HY., Fisher, M., Lu, J., Li, Y., Kim, V., Yang, MH. (2020). Modeling Artistic Workflows for Image Generation and Editing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-58523-5_10
Published: 04 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58522-8
Online ISBN: 978-3-030-58523-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics