1 Introduction

Creating artwork from scratch is a herculean task for people without years of artistic experience. For novices to the world of art, it would be more feasible to accomplish this task if there are clear creation steps to follow. Take a watercolor painting for example. One may be guided to first sketch the outline with pencils, then fill out areas with large brushes, and finalize details such as the color gradient and shadow with small brushes. At each stage, some aspects (i.e., variations) of the overall design are determined to carry forward to the final piece of art.

Inspired by these observations, we aim to model workflows for creating art, targeting two relevant artistic applications: multi-stage artwork creation and multi-stage artwork editing. As shown in Fig. 1, multi-stage artwork generation guides the user through the creation process by starting from the first stage then selecting the variation at each subsequent creation stage. In the multi-stage artwork editing, we are given a final piece of artwork and infer all the intermediate creation stages, enabling the user to perform different types of editing on various stages and propagate them forward to modify the final artwork.

Fig. 1.
figure 1

We model the sequential creation stages for a given artistic workflow by learning from examples. At test time, our framework can guide the user to create new artwork by sampling different variations at each stage (left), and infer the creation stages of existing artwork to enable the user to perform natural edits by exploring variations at different stages (middle and right).

Existing artwork creation approaches use conditional generative adversarial networks (conditional GANs)  [20, 28, 49] to produce the artwork according to user-provided input signals. These methods can take user inputs such as a sketch image  [7] or segmentation mask  [34, 43] and perform a single-step generation to synthesize the final artwork. To make the creation process more tractable, recent frameworks adopt a multi-step generation strategy to accomplish the generation tasks such as fashion simulation  [38] and sketch-to-image  [12]. However, these approaches typically do not support editing existing artwork. To manipulate an existing artwork image without degrading the quality, numerous editing schemes  [4, 31, 36, 46, 47] have been proposed in the past decade. Nevertheless, these methods either are designed for specific applications  [31, 36, 46] or lack flexible controls over the editing procedure because of the single-stage generation strategy  [4, 47].

In this paper, we develop a conditional GAN-based framework that 1) synthesizes novel artwork via multiple creation stages, and 2) edits existing artwork at various creation stages. Our approach consists of an artwork generation module and a workflow inference module. The artwork generation module learns to emulate each artistic stage by a series of multi-modal (i.e., one-to-many) conditional GAN  [49] networks. Each network in the artwork generation module uses a stage-specific latent representation to encode the variation presented at the corresponding creation stage. At test time, the user can determine the latent representation at each stage sequentially for the artwork generation module to synthesize the desired artwork image.

To enable editing existing artwork, we also design an inference module that learns to sequentially infer the corresponding images at all intermediate stages. We assume a one-to-one mapping from the final to intermediate stages, and use a series of uni-modal conditional GANs  [20] to perform this inference. At test time, we predict the stage-specific latent representations from the inferred images at all intermediate stages. Depending on the desired type of edit, the user can edit any stage to manipulate the stage-specific image or latent representation and regenerate the final artwork from the manipulated representations.

We observe that directly applying our workflow inference module can cause the reconstructed image to differ slightly from the initially provided artwork. Such a reconstruction problem is undesirable since the user expects the generated image to be unchanged when no edits are performed. To address this problem, we design an optimization procedure along with learning-based regularization to refine the reconstructed image. This optimization aims to minimize the appearance difference between the reconstructed and the original artwork image, while the learning-based regularization seeks to guide the optimization process and alleviate overfitting.

We collect three datasets with different creation stages to demonstrate the use cases of our approach: face drawing, anime drawing, and chair design. We demonstrate the creation process guided by the proposed framework and present editing results made by artists. For quantitative evaluations, we measure the reconstruction error and Fréchet inception distance (FID)  [14] to validate the effectiveness of the proposed optimization and learning-based regularization scheme. We make the code and datasets public available to stimulate the future research.Footnote 1

In this work, we make the following three contributions:

  • We propose an image generation and editing framework which models the creation workflow for a particular type of artwork.

  • We design an optimization process and a learning-based regularization function for the reconstruction problem in the editing scenario.

  • We collect three different datasets containing various design stages and use them to evaluate the proposed approach.

2 Related Work

Generative Adversarial Networks (GANs). GANs  [2, 5, 13, 22, 23] model the real image distribution via adversarial learning schemes. Typically, these methods encode the distribution of real images into a latent space by learning the mapping from latent representations to generated images. To make the latent representation more interpretable, the InfoGAN  [8] approach learns to disentangle the latent representations by maximizing the mutual information. Similar to the FineGAN  [37] and VON  [50] methods, our approach learns to synthesize an image via multiple stages of generation, and encode different types of variation into separate latent spaces at various stages. Our framework extends these approaches to also enables image editing of different types of artwork.

Conditional GANs. Conditional GANs learn to synthesize the output image by referencing the input context such as text descriptions  [44], scene graphs  [42], segmentation masks  [15, 34], and images  [20]. According to the type of mapping from the input context to the output image, conditional GANs can be categorized as uni-modal (one-to-one)  [20, 48] or multi-modal (one-to-many)  [17, 29, 32, 49]. Since we assume there are many possible variations involved for the generation at each stage of the artwork creation workflow, we use the multi-modal conditional GANs to synthesize the next-stage image, and utilize the uni-modal conditional GANs to inference the prior-stage image.

Image Editing. Image editing frameworks enable user-guided manipulation without degrading the realism of the edited images. Recently, deep-learning-based approaches have made significant progress on various image editing tasks such as colorization  [19, 26, 45, 46], image stylization  [16, 31], image blending  [18], image inpainting  [33, 35], layout editing  [30], and face editing  [6, 9, 36]. Unlike these task-specific methods, the task-agnostic iGAN  [47] and GANPaint  [4] models map the variation in the training data onto a low-dimensional latent space using GAN models. Editing can be conducted by manipulating the representation in the learned latent space. Different from iGAN and GANPaint, we develop a multi-stage generation method to model different types of variation at various stages.

Optimization for Reconstruction. In order to embed an existing image to the latent space learned by a GAN model, numerous approaches  [10, 25, 49] propose to train an encoder to learn the mapping from images to latent representations. However, the generator sometimes fails to reconstruct the original image from the embedded representations. To address this problem, optimization-based methods are proposed in recent studies. Abdal et al.  [1] and Bau et al.  [4] adopt the gradient descent scheme to optimize the latent representations and modulations for the feature activations, respectively. The goal is to minimize the appearance distance between the generated and original images. We also utilize the optimization strategy to reconstruct existing artwork images. In addition, we introduce a learning-based regularization function to guide the optimization process.

Regularizations for Deep Learning. These approaches  [11, 24, 27, 39,40,41] aim to prevent the learning function from overfitting to a specific solution. Particularly, the weight decay scheme  [24] regularizes by constraining the magnitude of learning parameters during the training phase. Nevertheless, regularization methods typically involve hyper-parameters that require meticulous hand-tuning to ensure the effectiveness. The MetaReg  [3] method designs a learning-to-learn algorithm to automatically find the hyper-parameters of the weight decay regularization to address the domain generalization problem. Our proposed learning-based regularization is trained with a similar strategy but different objectives to alleviate the overfitting problem described in Sect. 3.2.

3 Method

Our approach is motivated by the sequential creation stages of artistic workflows. We build a model that enables a user to 1) follow the creation stages to generate novel artwork and 2) conduct edits at different stages. Our framework is composed of an artwork generation and a workflow inference module. As shown in Fig. 2(a), the artwork generation module learns to model the creations stages of the artist workflow. To enable editing an existing piece of art, the workflow inference module is trained to sequentially infer the corresponding images at all creation stages. When editing existing artwork, it is important that the artwork remains as close as possible to the original artwork, and only desired design decisions are altered. To enable this, we design an optimization process together with a learning-based regularization that allows faithful reconstruction of the input image. We provide the implementation and training details for each component in the proposed framework as supplemental material.

Fig. 2.
figure 2

Overview of the proposed framework. (a) Given N creation stages (\(N=3\) in this example), our approach consists of \(N-1\) workflow inference networks and \(N-1\) artwork generation networks. The workflow inference module produces the intermediate results of the input artwork at all creation stages. The artwork generation module computes the latent representation z and transformation parameter \(z^\mathrm {Ada}\) for each stage, then reconstructs the input artwork images from these transformation parameters. (b) The latent encoder \(E_i^G\) extracts the stage-specific latent representation z from the example, and computes the transformation parameters \(z^\mathrm {Ada}\) for the AdaIN normalization layers (c channels). (c) We introduce a cycle consistency loss for each stage to prevent the artwork generation model (which accounts for detail coloring in this example) from memorizing the variation determined at the previous stages (sketching and flat coloring).

3.1 Artwork Generation and Workflow Inference

Preliminaries. The proposed approach is driven by the number of stages in the training dataset and operates in a supervised setting with aligned training data. Denoting N as the number of stages, the training dataset is comprised of a set of image groups \(\{(x_1,x_2,\cdots ,x_{N})\}\), where \(x_N\) denotes the artwork image at the final stage. We construct the proposed framework with \(N-1\) workflow inference models \(\{G^I_i\}^N_{i=1}\) as well as \(N-1\) artwork generation models \(\{(E^G_i,G^G_i)\}^N_{i=1}\). We show an example of 3 stages in Fig. 2(a). Since the proposed method is based on the observation that artists sequentially determine a design factor (i.e., variation) at each stage, we assume that the generation from the image in the prior stage to the later one is multi-modal (i.e., one-to-many mapping), while the inference from the final to the previous stages is uni-modal (i.e., one-to-one mapping).

Artwork Generation. The artwork generation module aims to mimic the sequential creation stages of the artistic workflow. Since we assume the generation from the prior stages to the following ones is multi-modal, we construct a series of artwork generation networks by adopting the multi-modal conditional GAN approach in BicycleGAN  [49] and the network architecture of MUNIT  [17]. As shown in Fig. 2(a) and (b), each artwork generation model contains two components: latent encoder \(E^G_i\) and generator \(G^G_i\). The latent encoder \(E^G_i\) encodes the variation presented at the i-th stage in a stage-specific latent space. Given an input image \(x_i\) and the corresponding next-stage image \(x_{i+1}\), the latent encoder \(E^G_i\) extracts the stage-specific latent representation \(z_i\) from the image \(x_{i+1}\), and computes the transformation parameter \(z^\mathrm {Ada}_i\). The generator \(G^G_i\) then takes the current-stage image \(x_i\) as input and modulates the activations through the AdaIN normalization layers  [49] with the transformation parameter \(z^\mathrm {Ada}_i\) to synthesize the next-stage image \(\hat{x}^G_{i+1}\), namely

$$\begin{aligned} \hat{x}^G_{i+1}=G^G_i(x_i,E^G_i(x_{i+1}))~~~~i\in \{1,2,\cdots ,N-1\}. \end{aligned}$$
(1)

We utilize the objective introduced in the BicycleGAN  [49], denoted as \(L^\mathrm {bicycle}_i\), for training the generation model. The objective \(L^\mathrm {bicycle}_i\) is detailed in the supplementary material.

Ideally, the artwork generation networks corresponding to a given stage would encode only new information (i.e., incremental variation), preserving prior design decisions from earlier stages. To encourage this property, we impose a cycle consistency loss to enforce the generation network to encode the variation presented at the current stage only, as shown in Fig. 2(c). Specifically, we use the inference model \(G^I_i\) to map the generated next-stage image back to the current stage. The mapped image should be identical to the original image \(x_i\) at the current stage, namely

$$\begin{aligned} L^\mathrm {c}_i=\Vert G^I_i(G^G_i(x_i,E^G_i(z_i)))-x_i\Vert _1~~~~z_i\sim N(0,1). \end{aligned}$$
(2)

Therefore, the overall training objective for the artwork generation model at the i-th stage is

$$\begin{aligned} L^G_i=L^\mathrm {bicycle}_i + \lambda ^cL^c_i, \end{aligned}$$
(3)

where \(\lambda ^c\) controls the importance of the cycle consistency.

Workflow Inference. To enable the user to edit the input artwork \(x_N\) at different creation stages, our inference module aims to hallucinate the corresponding images at all previous stages. For the i-th stage, we use a unimodal conditional GAN network  [20] to generate the image at i-th stage from the image at (\(i+1\))-th stage, namely

$$\begin{aligned} \hat{x}^I_{i}=G^I_i(x_{i+1})~~~~i\in \{1,2,\cdots ,N-1\}. \end{aligned}$$
(4)

During the training phase, we apply the hinge version of GAN loss  [5] to ensure the realism of the generated image \(\hat{x}^I_{i}\). We also impose an \(\ell _1\) loss between the synthesized image \(\hat{x}^I_{i}\) and the ground-truth image \(x_i\) to stabilize and accelerate the training. Hence the training objective for the inference network at the i-th stage is

$$\begin{aligned} L^I_i=L^\mathrm {GAN}_i(\hat{x}^I_i) + \lambda ^\mathrm {1}\Vert \hat{x}^I_{i}-x_i\Vert _1, \end{aligned}$$
(5)

where \(\lambda ^\mathrm {1}\) controls the importance of the \(\ell _1\) loss.

Test-Time Inference. As shown in Fig. 2(a), given an input artwork image \(x_N\), we sequentially obtain the images at all previous stages \(\{\hat{x}^I_i\}^N_{i=1}\) using the workflow inference module (blue block). We then use the artwork generation module (green block) to extract the latent representations \(\{z_i\}^{N-1}_{i=1}\) from the inferred images \(\{\hat{x}^I_i\}^N_{i=1}\), and compute the transformation parameters \(\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}\). Combining the first-stage image \(x^G_1=x^I_1\) and the transformation parameters \(\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}\), the generation module consecutively generates the images \(\{\hat{x}^G_i\}^N_{i=2}\) at the following stages. The user can choose the stage to manipulate based on the type of edit desired. Edits at the i-th stage can be performed by either manipulating the latent representation \(z_i\) or directly modifying the image \(x^G_i\). For example, in Fig. 2(a), the user can choose to augment the representation \(z_1\) to adjust the flat coloring. After editing, the generation module generates the new artwork image at the final stage.

Fig. 3.
figure 3

Motivation of the AdaIN optimization and learning-based regularization. The proposed AdaIN optimization and the learning-based regularization are motivated by the observations that 1) using the computed transformation parameters \(z^\mathrm {Ada}\) in Fig. 2 cannot well reconstruct the original input image (red outline in 1-st row)), and 2) the AdaIN optimization may degrade the quality of the editing results (yellow outline in 2-nd row). (Color figure online)

3.2 Optimization for Reconstruction

As illustrated in Sect. 3.1, the artwork generation module would ideally reconstruct the input artwork image (i.e., \(\hat{x}^G_N=x_N\)) from the transformation parameters \(\{z^\mathrm {Ada}_i\}^{N-1}_{i=1}\) before the user performs an edit. However, the reconstructed image \(\hat{x}^G_N\) may be slightly different from the input image \(x_N\), as shown in the first row of Fig. 3. Therefore, we adopt an AdaIN optimization algorithm to optimize the transformation parameters \(\{z^\mathrm {Ada}_i\}^N_{i=1}\) of the AdaIN normalization layers in the artwork generation models. The goal of the AdaIN optimization is to minimize the appearance distance between the reconstructed and input image.

While this does improve the reconstruction of the input image, we observe that the optimization procedure causes the generation module to memorize input image details, which degrades the quality of some edited results, as shown in the second row of Fig. 3. To mitigate this memorization, we propose a learning-based regularization to improve the AdaIN optimization.

AdaIN Optimization. The AdaIN optimization approach aims to minimize the appearance distance between the reconstructed image \(\hat{x}^G_N\) and the input artwork image \(x_N\). There are many choices for what to optimize to improve reconstruction: we could optimize the parameters in the generation models or the extracted representations \(\{z_i\}^N_{i=1}\). Optimizing model parameters is inefficient because of the large number of parameters to be updated. On the other hand, we find that optimizing the extracted representation is ineffective, as validated in Sect. 4.3. As a result, we choose to optimize the transformation parameters \(\{z^\mathrm {Ada}_i\}^N_{i=1}\) of the AdaIN normalization layers in the generation models, namely the AdaIN optimization. Note that a recent study  [1] also adopts a similar strategy.

We conduct the AdaIN optimization for each stage sequentially. The transformation parameter at the early stage is optimized and then fixed for the optimization at the later stages. Except for the last stage (i.e., \(i=N-1\)) that uses the input artwork image \(x_N\), the inferred image \(x^I_{i+1}\) by the inference model serves as the reference image \(x^\mathrm {ref}\) for the optimization. For each stage, we first use the latent encoder \(E^G_i\) to compute the transformation parameter \(z^\mathrm {Ada}_i\) from the reference image for generating the image. Since there are four AdaIN normalization layers with c channels in each artwork generation model, the dimension of the transformation parameter is \(1\times {8c}\) (a scale and a bias term for each channel). Then we follow the standard gradient descent procedure to optimize the transformation parameters with the goal of minimizing the loss function \(L^\mathrm {Ada}\) which measures the appearance distance between the synthesized image \(\hat{x}^G_i\) by the generator \(G^G_i\) and the reference image \(x^\mathrm {ref}\). The loss function \(L^\mathrm {Ada}\) is a combination of the pixel-wise \(\ell _1\) loss and VGG-16 perceptual loss  [21], namely

$$\begin{aligned} L^\mathrm {Ada}(\hat{x}^G_i,x^\mathrm {ref}) = \Vert \hat{x}^G_i - x^\mathrm {ref} \Vert _1 + \lambda ^pL^p(\hat{x}^G_i,x^\mathrm {ref}), \end{aligned}$$
(6)

where \(\lambda _p\) is the importance term. We summarize the AdaIN optimization in Algorithm . Note that in practice, we optimize the incremental term \(\delta ^\mathrm {Ada}_i\) for the transformation parameter \(z^\mathrm {Ada}_i\), instead of updating the parameter itself.

Fig. 4.
figure 4

Training process for learning-based regularization. For the i-th stage (\(i=2\) in this example), we optimize the hyper-parameter \(w_i\) for the weight decay regularization (orange text) by involving the AdaIN optimization in the training process: after the incremental term \(\delta ^\mathrm {Ada}_i\) is updated via one step of AdaIN optimization and the weight decay regularization (blue arrow), the generation model should achieve improved reconstruction as well as maintain the quality of the editing result (green block). Therefore, we use the losses \(L^\mathrm {Ada},L^\mathrm {GAN}\) computed from the updated parameter to optimize the hyper-parameter \(w_i\) (red arrow). (Color figure online)

Learning-Based Regularization. Although the AdaIN optimization scheme addresses the reconstruction problem, it often degrades the quality of editing operations, as shown in the second row of Fig. 3. This is because the AdaIN optimization causes overfitting (memorization of the reference image \(x^\mathrm {ref}\)). The incremental term \(\delta ^\mathrm {Ada}_i\) for the transformation parameter \(z^\mathrm {Ada}_i\) is updated to extreme values to achieve better reconstruction, so the generator becomes sensitive to the change (i.e., editing) on the input image and produces unrealistic results.

To address the overfitting problem, we use weight decay regularization  [24] to constrain the magnitude of the incremental term \(\delta ^\mathrm {Ada}_i\), as shown in Line 6 in Algorithm . However, it is difficult to find a general hyper-parameter setting \(w_i\in R^{1\times {8c}}\) for different generation stages of various artistic workflows. Therefore, we propose a learning algorithm to optimize the hyper-parameter \(w_i\). The core idea is that updating the incremental term \(\delta ^\mathrm {Ada}_i\) with the regularization \(w_i\delta ^\mathrm {Ada}_i\) should 1) improve the reconstruction and 2) maintain the realism of edits on an input image. We illustrate the proposed algorithm in Fig. 4. In each iteration of training at the i-th stage, we sample an image pair \((x_i, x_{i+1})\) and an additional input image \(x'_i\) from the training dataset. The image \(x'_i\) serves as the edited image of \(x_i\). We first use the latent encoder \(E^G_i\) to extract the transformation parameter \(z^\mathrm {Ada}_i\) from the next-stage image \(x_{i+1}\). As shown in the grey block of Fig. 4, we then update the incremental term from \(\delta ^\mathrm {Ada}_i\) to via one step of the AdaIN optimization and the weight decay regularization. With the updated incremental term , we use the loss function \(L^\mathrm {Ada}\) to measure the reconstruction quality, and use the GAN loss to evaluate the realism of editing results, namely

(7)

Finally, since the loss \(L^\mathrm {L2R}\) indicates the efficacy of the weight decay regularization, we optimize the hyper-parameter \(w_i\) by

$$\begin{aligned} w_i=w_i - \eta \bigtriangledown _{w_i}L^\mathrm {L2R}, \end{aligned}$$
(8)

where \(\eta \) is the learning rate of the training algorithm for the proposed learning-based regularization.

figure a

4 Experimental Results

4.1 Datasets

To evaluate our framework, we manually process face drawing, anime drawing, and chair design datasets. We describe details in the supplementary material.

4.2 Qualitative Evaluation

Generation. We present the generation results at all stages in Fig. 5. In this experiment, we use the testing images at the first stage as inputs, and randomly sample various latent representation \(z\in \{z_i\}^{N-1}_{i=1}\) at each stage of the proposed artwork generation module. The generation module sequentially synthesizes the final result via multiple stages. It successfully generates variations by sampling different random latent codes at different stages. For example, when generating anime drawings, manipulating the latent code at the final stage produces detailed color variations, such as modifying the saturation or adding the highlights to the hair regions.

Editing. Figure 6 shows the results of editing the artwork images at different stages. Specifically, after the AdaIN optimization reconstructs the testing image at the final stage (first row), we re-sample the representations \(z\in \{z_i\}^{N-1}_{i=1}\) at various stages. Our framework is capable of synthesizing the final artwork such that its appearance only changes with respect to the stage with re-sampled latent code. For example, for editing face drawings, re-sampling representations at the flat coloring stage only affects hair color, while maintaining the haircut style and details.

Fig. 5.
figure 5

Results of image generation from the first stage. We use the first-stage testing images as input and randomly sample the latent representations to generate the image at the final stage.

Fig. 6.
figure 6

Re-sampling latent representation at each stage. After we use the AdaIN optimization process to reconstruct the input image (1st row), we edit the reconstructed image by re-sampling the latent representations at various stages.

To evaluate the interactivity of our system, we also asked professional artists to edit some example sketches (Fig. 7). First, we use the proposed framework to infer the initial sketch from the input artwork image. Given the artwork image and the corresponding sketch, we asked an artist to modify the sketch manually. For the edited sketch (second row), we highlight the edits with the red outlines. This experiment confirms that the proposed framework enables the artists to adjust only some stages of the workflow, controlling only desired aspects of the final synthesized image. Additional artistic edits are shown in Fig. 1.

Fig. 7.
figure 7

Results of artistic editing. Given an input artwork image, we ask the artist to edit the inferred sketch image. The synthesis model then produces the corresponding edited artwork. The first row shows the input artwork and inferred images, and the red outlines indicate the edited regions. (Color figure online)

AdaIN Optimization and Learning-Based Regularization. Figure 8 presents the results of the AdaIN optimization and the proposed learning-based regularization. As shown in the first row, optimizing representations z fails to refine the reconstructed images due to the limited capacity of the low-dimensional latent representation. In contrast, the AdaIN optimization scheme minimizes the perceptual difference between the input and reconstructed images. We also demonstrate how the optimization process influences the editing results in the second row. Although the AdaIN optimization resolves the reconstruction problem, it leads to overfitting and results in unrealistic editing results synthesized by the generation model. By utilizing the proposed learning-based regularization, we address the overfitting problem and improve the quality of the edited images.

Fig. 8.
figure 8

Results of different optimization approaches. We show both the reconstruction and editing results of various optimization approaches at the final stage for the face drawing dataset.

4.3 Quantitative Evaluation

Evaluation Metrics. We use the following evaluation metrics:

  • Reconstruction error: Given the input artwork \(x_N\) and the reconstructed image \(\hat{x}^G_N\), we use the \(\ell _1\) distance \(\Vert \hat{x}^G_N-x_N\Vert \) to evaluate the reconstruction quality.

  • FID: We use the FID  [14] score to measure the realism of generated images \(\hat{x}^G_N\). A smaller FID score indicates better visual quality.

Reconstruction. As shown in Sect. 3.2, we conduct the AdaIN optimization for each stage sequentially to reconstruct the testing image at the final stage. We use both the reconstruction error and FID score to evaluate several baseline methods and the AdaIN optimization, and show the results in Table 1. Results on the 2-nd and 3-rd rows demonstrate that the AdaIN optimization is more effective than optimizing the latent representations \(\{z_i\}^{N-1}_{i=1}\). On the other hand, applying stronger weight decay regularization (i.e., \(w_i=10^{-2}\)) diminishes the reconstruction ability of the AdaIN optimization. By applying the weight decay regularization with learned hyper-parameter w (i.e., LR), we achieve comparable reconstruction performance in comparison to the optimization without regularization.

Table 1. Quantitative results of reconstruction. We use the \(\ell 1\) pixel-wise distance (\(\downarrow \)) and the FID (\(\downarrow \)) score to evaluate the reconstruction ability. w and LR indicates the hyper-parameter for the weight regularization and applying the learned regularization, respectively.

Editing. In this experiment, we investigate how various optimization methods influence the quality of edited images. For each testing final-stage image, we first use different optimization approaches to refine the reconstructed images. We then conduct the editing by re-sampling the latent representation \(z_i\) at a randomly chosen stage. We adopt the FID score to measure the quality of the edited images and show the results in Table 2. As described in Sect. 3.2, applying the AdaIN optimization causes overfitting that degrades the quality of the edited images. For instance, applying the AdaIN optimization increases the FID score from 38.68 to 44.28 on the face drawing dataset. One straightforward solution to alleviate this issue is to apply strong weight decay regularizations (i.e., \(w=10^{-2}\)). However, according to the results in 5-th row of Table 1, such strong regularizations reduce the reconstruction effectiveness of the AdaIN optimization. Combining the results in Table 1 and Table 2, we conclude that applying the regularization with the learned hyper-parameter w not only mitigates overfitting but also maintains the efficacy of the AdaIN optimization. We conduct more analysis of the proposed learning-based regularization in the supplementary materials.

4.4 Limitations

The proposed framework has several limitations (see supplemental material for visual examples). First, since the model learns the multi-stage generation from a training dataset, it fails to produce appealing results if the style of the input image is significantly different from images in the training set. Second, the uni-modal inference assumption may not be correct. In practice, the mapping from later stages to previous ones can also be multi-modal. For instance, the style of the pencil sketches by various artists may be different. Finally, artists may not follow a well-staged workflow to create artwork in practice. However, our main goal is to provide an example workflow to make the artwork creation and editing more feasible, especially for the users who may not be experts in that type of artwork.

Table 2. Quantitative results of editing. We use the FID (\(\downarrow \)) score to evaluate the quality of the edited images \(\hat{x}^G_N\) synthesized by the proposed framework. w and LR indicates the hyper-parameter for the weight regularization and applying the learned regularization, respectively.

5 Conclusions

In this work, we introduce an image generation and editing framework that models the creation stages of an artistic workflow. We also propose a learning-based regularization for the AdaIN optimization to address the reconstruction problem for enabling non-destructive artwork editing. Qualitative results on three different datasets show that the proposed framework 1) generates appealing artwork images via multiple creation stages and 2) synthesizes the editing results made by the artists. Furthermore, the quantitative results validate the effectiveness of the AdaIN optimization and the learning-based regularization.

We believe there are many exciting areas for future research in this direction that could make creating high-quality artwork both more accessible and faster. We would like to study video sequences of artists as they create artwork to automatically learn meaningful workflow stages that better align with the artistic process. This could further enable the design of editing tools that more closely align with the operations artists currently perform to iterate on their designs.