Keywords

1 Introduction

Colorization means to predict the missing chrome information from the given gray image. It is an interesting and practical task in computer vision, widely used in legacy footage processing [27], color transfer [1, 39], and other visual editing applications [3, 52]. It is also exploited as a proxy task for self-supervised learning [25], since predicting perceptually natural colors from the given grayscale image heavily relies on scene understanding. However, even the ground-truth color is available for supervision, it is still very challenging to predict pixel colors from gray images, due to the ill-posed nature that one input grayscale could correspond to multiple possible color variants.

Most current methods [3, 12, 17, 23, 26, 38, 49, 54, 56] formulate colorization as a pixel-level regression task, suffering from multimodal representation more or less. With the large-scale training data and end-to-end learning models, they can learn the color distribution prior conveniently, e.g. vegetation greenish tones, human skin colors, etc.. Anyhow, when it comes to objects with inherently color ambiguity (e.g. human clothes, cars, and other man-made stuff), these approaches tend to predict the brownish average colors. To tackle such multi-modality, researches [24, 54, 56] proposed to formulate the color prediction as pixel-level color classification, which allows multiple colors to be assigned to each pixel based on posterior probability. Unfortunately, these suffer from regional color inconsistency due to the independent pixel-wise sampling mechanism. In this regard, means of utilizing the sequential modeling [12, 23] can only partially help the sampling issue, because the unidirectional sequential dependence of 2D flattened pixel primitives causes error accumulation and hinders the learning efficiency.

Apart from the multimodal issue, color bleeding is another common issue in colorization due to inaccurate identification of semantic boundaries. To suppress such visual artifacts, most works [3, 17, 26, 38, 49, 54, 56] resort to Generative adversarial networks (GAN) to encourage the generated chrome distribution to be indistinguishable from that of the real-life color images. Currently, no special algorithms or modules for deep models have been proposed to enhance the performance of this aspect, which matters the visual pleasantness considerably.

To avoid modeling the color multimodality pixel-wisely, we propose a new colorization framework PalGAN that predicts the pixel colors in a coarse-to-fine paradigm. The key idea is to first predict the global palette probability (e.g. palette histogram) from the grayscale. It does not collapse into a single specific colorization solution but represents a certain color distribution of the potential color variants. Then, the uncertainty about the per-pixel color assignment is modeled with a generative model in the GAN framework, conditioned on the grayscale and palette histogram. Therefore, multiple colorization results could be achieved by changing the palette histogram input.

To guarantee the color assignment with semantic correctness and regional consistency, we study color affinities by a proposed chromatic attention module. It explicitly aligns color affinity with both semantics and low-level characteristics. In structure, chromatic attention includes global interaction and local delineation. The former enables global context utilization for color inference by using semantic features in the attention mechanism. The latter preserves regional details by mapping the gray input to color through local affine transformation. The transformation is explicitly parameterized by the correlation between gray input and color feature. Experiments illustrate the effectiveness of our method. It achieves impressive visual results (Fig. 1) and quantitative superiority over state-of-the-art approaches over ImageNet [9] and COCO-Stuff [5]. Our method also works well with the user-specified palette histogram from a reference image, which could even have no content correlation with the input grayscale. So, by nature, our method supports diverse coloring results with certain controllability. Our code and pretrained models are available at https://github.com/shepnerd/PalGAN.

Generally, our contributions are three-fold: i) We propose a new colorization framework PalGAN that decomposes colorization to palette estimation and pixel-wise assignment. It circumvents the challenges of color ambiguity and regional homogeneity effectively, and supports diverse and controllable colorization by nature. ii) We explore the less-touched color affinities and propose an effective module named chromatic attention. It considers both semantic and local detail correspondence, applying such correlations to color generation. It alleviates notable color bleeding effects. iii) Our method surpasses state-of-the-arts in perceptual quality (FID [16] and LPIPS [55]) notably. It is known that there exists a trade-off between perceptual and fidelity results in multiple low-level tasks. We argue perceptual effects matter more than fidelity as colorization aims to produce realistic colorized results rather than restore identical pixel-wise colors as the ground truth. Regardless, our method can achieve best both fidelity (PSNR and SSIM) and perceptual performance with proper tuning.

Fig. 1.
figure 1

Our colorization results. \(1_{st}\) row: inputs, and \(2_{nd}\) row: our predictions.

2 Related Work

2.1 Colorization

User Guided Colorization. Some of early works [6,7,8, 18, 21, 29, 34, 39, 47] in colorization turn to a reference image for transferring its color statistics to the given gray one. With the prevalence of deep learning, such color transfer is characterized in neural feature space for introducing semantic consistency [15]. These works perform decently when the reference and input share similar semantics. Its applications are limited by the reference retrieval quality, especially when handling complicated scenes.

Besides of reference images, several systems require users to give sufficient local color hints (usually in scribble form) before colorizing inputs [21, 27, 37, 52]. Then approaches propagate the given colors based on their local affinities. Besides, some attempts are made [3] to explore other modalities like languages to instruct what colors are used and how they are distributed.

Learning-Based Colorization. This line of work [10, 17, 19, 24, 54, 56] gives colorful images only from the gray inputs, learning a pixel-to-pixel mapping. Large-scale datasets are exploited in a self-supervised fashion, converting colorful pictures to gray ones for pair-wise training. Iizuka et al. [17] utilize image-level labels for associating predicted color with global semantics, using a global-and-local convolutional neural network. Larsson et al. [24] and Zhang et al. [54] introduce pixel-level color distribution matching by classification, alleviating color unbalance and multi-modal outputs. Besides, extra input hints are integrated into learning systems by simulation in [56], providing automatic and semi-automatic ways to colorize images. Recently, transformer architectures are explored for this task considering their expressiveness on non-local modeling [23].

Some work explicitly exploits additional priors from pretrained models for colorization. Su et al. [38] study leveraging instance-level annotations (e.g., instance bounding boxes and classes) by using an off-the-shelf detector. It will make the colorization model focuses on color rendering without the need of recognizing high-level semantics. In addition to the mentioned pretrained discriminative models, pretrained generative ones are also exploitable in improving colorization performance in diversity. Wu et al. [49] explore to incorporate generative color prior from a pretrained BigGAN [4] to help a deep model produce colored results with diversities. They design an extra encoder to project the given gray image into latent code, then estimate colorful images from BigGAN. With such primary predictions, they further refine the color results by the intermediate features in BigGAN. Afifi et al. [1] propose employing a pretrained StyleGAN [20] for image recoloring, and color is controlled by histogram features.

2.2 GAN-Based Image-to-Image Translation

Image-to-image translation aims to learn the transformation between the input and output image. Colorization can be formulated to this task and handled by Generative Adversarial Networks [11] (GAN) based approaches [19, 30, 35, 41, 44]. They employ an adversarial loss that learns to discriminate between real and generated images, and then minimize this loss by updating the generator to make the produced results look realistic [28, 31, 36, 42, 45, 46, 50, 51, 57].

Fig. 2.
figure 2

Our colorization system framework.

3 Method

PalGAN aims to colorize grayscale images. It formulates colorization as a palette prediction and assignment problem. Compared with directly learning the pixel-to-pixel mapping from gray to color as adopted by most learning-based methods, this disentanglement fashion not only brings empirical colorization improvements (Sect. 4), but also enables us to manipulate global color distributions by adjusting or regularizing palettes.

Fig. 3.
figure 3

Visualizations of palettes (1\(_\text {st}\) row, shown in jet colormap) and how they work on colorization (2\(_\text {nd}\) row). (a) Input, (b) the ground truth, (c)–(e) reference-based colorization, (f) automatic colorization.

For PalGAN, its input is a grayscale image (i.e. the luminance channel of color images) \(\textbf{L} \in \mathcal {R}^{h \times w \times 1}\), and the output is the estimated chromatic map \(\mathbf {\hat{C}} \in \mathcal {R}^{h \times w \times 2}\) that will be used as the complementary ab channels together with \(\textbf{L}\) in CIE Lab color space. PalGAN consists of palette generator \(\mathcal {T}_{\textbf{E}}\), palette assignment generator \(\mathcal {T}_{\textbf{G}}\), and a color discriminator \(\textbf{D}\). In inference, only \(\mathcal {T}_{\textbf{E}}\) and \(\mathcal {T}_{\textbf{G}}\) are employed. The whole framework is given in Fig. 2.

3.1 Palette Generator

\(\mathcal {T}_{\textbf{E}}\) estimates the global palette probabilities from the given gray image as \(\mathbf {\hat{h}} = \mathcal {T}_{\textbf{E}}(\textbf{L})\). We employ a 2D chromatic histogram \(\mathbf {\hat{h}} \in \mathcal {R}^{N_a \times N_b \times 1}\) to represent palette probabilities (\(N_a\) and \(N_b\) denotes bin numbers of a and b axes respectively), modeling the chromatic information statistics instead of learning a deterministic one. \(\mathcal {T}_{\hat{P}}\) is an encoder network with several convolutional layers and a few multiple-layer perceptions (MLP), ended with a sigmoid function. The former is to extract features and the latter is to transform spatial features to a histogram (in vector form). With the explicit representation of the color palette in histogram form, we find it not only makes global color distribution more predictable, but also manipulative by introducing proper regularizations.

The user-guided colorization [6, 34, 56] has demonstrated the effectiveness of utilizing the color histogram of a reference image for colorizing images. Compared to the existing practice [6, 34, 56], we make one step further, i.e. synthesizing a palette histogram conditioned on the input grayscale instead of taking that from a user-specified reference image. This design brings two non-trivial advantages. First, it makes our method to be a self-contained fully automatic colorization system, instead of depending on any outside guidance (i.e. a reference image) to work. Second, in general cases, the palette histogram estimated each specific grayscale may offer more accurate and instructive information for the colorization process, than that from a reference image selected in the wild. We empirically demonstrate this in Sect. 4.4. In Fig. 3, we visualize the predicted palette histogram (f), in comparison with the ground-truth (b) and those of reference images (c e).

3.2 Palette Assignment Generator

\(\mathcal {T}_{\textbf{G}}\) conducts color assignment task via conditional image generation. It produces the corresponding ab from the gray image conditioned on palette histogram \(\mathbf {\hat{h}}\) and extra latent code z (sampled from a normal distribution), as \(\mathbf {\hat{C}} = \mathcal {T}_{\textbf{G}}(\textbf{L}|\mathbf {\hat{h}},z)\). It is a convolutional generator is composed of common residual blocks used in image translation [14, 19], together with our customized Palette Normalization (PN) layer and Chromatic Attention (CA) module. The palette normalization is designed to promote the conformity of the generated chromatic channels to the palette guidance \(\mathbf {\hat{h}}\), which is used along with each Batch Normalization layers. Specifically, the PN layer normalizes its input feature first and then performs an affine transformation parameterized by \(g(\mathbf {\hat{h}})\) (where \(g(\cdot )\) is a fully-connected layer). Besides, we propose a chromatic attention module (Fig. 4) to explicitly align color affinity to their corresponding semantic and low-level characteristics, which mitigates potential color bleeding or semantic misunderstanding effectively. We discuss the designs below in detail, along with a visualization of the effects of its components shown in Fig. 5.

Fig. 4.
figure 4

The illustration of chromatic attention.

Fig. 5.
figure 5

Ablation studies of chromatic attention (CA). (a) Input, (b) wo CA, (c) w Global, (d) w Local, (e) full CA. Please zoom in.

Chromatic Attention. The proposed Chromatic Attention (CA) module incorporates both semantic and low-level affinities into constructing color relations. These two are realized by global interaction and local delineation submodules (Fig. 4). Specifically, inputs to CA are a high-resolution feature map \(\textbf{F}\) (of the size \(\mathcal {R}^{128 \times 128 \times 64}\) from \(\mathcal {T}_{\textbf{G}}\)), high-level feature map \(\textbf{S}\), and resized gray input \(\textbf{L}\). It outputs two feature maps \(\textbf{F}^g\) and \(\textbf{F}^l\) from global interaction and local delineation respectively, and fuses them into a feature map residual, adding back to the input feature map, as:

$$\begin{aligned} \text {CA}(\textbf{F}, \textbf{S}, \textbf{L}) = \textbf{F} + \textbf{F}' = \textbf{F} + f(\textbf{F}^g \oplus \textbf{F}^l) = \textbf{F} + f(\text {CA}_g(\textbf{F}|\textbf{S}) \oplus \text {CA}_l(\textbf{L}|\textbf{F})), \end{aligned}$$
(1)

where \(f(\cdot )\) is a nonlinear fusion operation formed by two consecutive convolutional layers, and \(\oplus \) is channel-wise concatenation operation. \(\text {CA}_g(\cdot )\) and \(\text {CA}_l(\cdot )\) denote global interaction and local delineation, respectively. In this paper, we use \(\textbf{F}, \textbf{F}' \in \mathcal {R}^{128 \times 128 \times 64}\).

Global Interaction. We reconstruct every regional feature point from the input feature map using a weighted sum of other ones, and such local weight is computed according to their semantic similarity. Formally, it is written as \(\textbf{F}_p^g = \sum _{q \in \textbf{F}}w_{pq}\textbf{F}^\text {V}_q\), where p and q denote a patch centering at pixel location p and q within \(\textbf{F}\), respectively. And \(w_{pq}\) is calculated from the region-wise interaction in the learned high-level feature maps from the input gray images. The region-wise feature interaction is measured by the cosine similarity between the normalized regional features, as:

$$\begin{aligned} w_{pq} = \frac{\exp (w'_{pq})}{\sum _{k \in \textbf{S}}\exp (w'_{pk})} \quad \text {where} \quad w'_{pq} = \frac{\textbf{S}^\text {K}_p \cdot \textbf{S}^\text {Q}_q}{|\textbf{S}^\text {K}_p| |\textbf{S}^\text {Q}_q|}, \end{aligned}$$
(2)

where \(\textbf{S}\) denote high-level feature map, extracted from intermediate representation of the encoder \(\mathcal {T}_{\hat{P}}\). \(\textbf{S}^\text {K}\) and \(\textbf{S}^\text {Q}\) denote two translated feature maps from \(\textbf{S}\) using convolution.

Local Delineation. Though color changes in texture and edges are delicate, overlooking these subtle variances leads to notable visual degradation. To preserve these details, we design a local delineation module to complement global interaction. We adopt the assumption that local color affinity is linearly correlated with its corresponding intensity [40, 58]. We propose to learn such local relationship in the guided filter manner [13, 48], which preserves edges from the guidance well. Our given local preserving module computes a learnable local affine transformation \(\{\textbf{A} \in \mathcal {R}^{128 \times 128 \times 64}, \textbf{B} \in \mathcal {R}^{128 \times 128 \times 64}\}\) to map the gray image \(\textbf{L} \in \mathcal {R}^{256 \times 256 \times 1}\) to its corresponding ab feature map, as:

$$\begin{aligned} \textbf{F}^l = \textbf{A} \odot \textbf{L}\downarrow +\, \textbf{B}, \end{aligned}$$
(3)

where \(\odot \) is the element-wise multiplication operator and \(\downarrow \) is downsampling one to ensure the spatial size of \(\textbf{L}\) is the same as \(\textbf{F}^{l}\). \(\{\textbf{A}, \textbf{B}\}\) are parameterized by the a learnable local correlation between \(\textbf{L}\) and \(\textbf{F}\), as:

$$\begin{aligned} \textbf{A} = \varPsi (\frac{\text {cov}(\textbf{F}, \textbf{L})}{\text {var}(\textbf{L})+\epsilon }), \; \textbf{B} = \overline{\textbf{F}} - \textbf{A} \odot \overline{\textbf{L}} \end{aligned}$$
(4)

where \(\varPsi \) is a learnable transformation parameterized by a small convolutional net, \(\text {cov}(\cdot ,\cdot )\) computes the local covariance between two feature maps (within a fixed window size) while \(\text {var}()\) computes the local variance of the given feature map. \(\overline{\textbf{F}}\) and \(\overline{\textbf{L}}\) denote the smoothed versions of \(\textbf{F}\) and \(\textbf{L}\) by a mean filter, respectively. \(\epsilon \) is a small positive number for computational stability.

Palette Optimization. To further ensure the proposed palette assignment generator is responsive to the given palette, we minimize the discrepancy between the palette extracted from the predicted chromatic channels and that from the corresponding ground truth. However, common histograms from images are non-differentiable due to the hard thresholds. Follow the practice of [1], we regard the palette histogram as a joint distribution over a and b, represented by a weighted sum of kernels. Formally, the color histogram is written as:

$$\begin{aligned} \textbf{h}(a, b) = \frac{1}{Z}\sum _{x}k(\textbf{C}_a(x), \textbf{C}_b(x), a, b), \end{aligned}$$
(5)

where \(\textbf{C}_a(x)\) and \(\textbf{C}_b(x)\) denote the values of pixel x in a and b channels, respectively. k is the used kernel function to measure the difference between \((\textbf{C}_a(x), \textbf{C}_b(x))\) and a given (ab), Z is a normalization factor. In this paper, we adopt inverse-quadratic kernel [1], which is:

$$\begin{aligned} k(\textbf{C}_a(x), \textbf{C}_b(x), a, b)= \prod _{i \in \{a, b\}} (1+(\frac{|\textbf{C}_i(x)-i|}{\sigma })^2)^{-1}, \end{aligned}$$
(6)

where \(\sigma \) controls the smoothness of adjacent bins. We find \(\sigma =0.1\) works best.

Regularization. To diversify the predicted colors, we introduce palette regularization, combating against the dull colors brought by imbalanced color distribution. On one hand, we employ ab histogram in probabilistic palette form to measure the color distribution in the predicted color map and ground truth. Minimizing their discrepancies explicitly considers different color ratios, avoiding converging to a few dominant ones. On the other hand, we diversify the produced colors by increasing the possibility of rare colors (statistically in training samples). We exploit the entropy of the probabilistic palette to control such diversity. Formally, the entropy of \(\mathbf {\hat{h}}\) is \(E(\mathbf {\hat{h}})=-\sum _{i=1}^{|\mathbf {\hat{h}}|}\mathbf {\hat{h}}_{i}\log \mathbf {\hat{h}}_{i}\). To improve the color diversity in \(\mathbf {\hat{h}}\), we can maximize \(E(\mathbf {\hat{h}})\).

Table 1. Quantitative results on the validation sets from different methods.

3.3 Color Discriminator

We give a color discriminator utilizing the palette, improving the result from the adversarial training. We incorporate the palette into the discriminator in a condition projection manner [33]. We employ convolutional discriminator \(\textbf{D}\), converting the input (the concatenation between the ab image and its converted RGB one) into a 1D feature embedding \(\mathbf {g \in \mathbb {R}^{256 \times 1}}\). Then such feature is fused with the palette by the inner product. The likelihood of the realness of the input is given as:

$$\begin{aligned} p(\textbf{C} \oplus \textbf{I}) = (\textbf{W}\textbf{g}) ^{\text {T}} \textbf{h}, \end{aligned}$$
(7)

where \(\textbf{W} \in \mathbb {R}^{n^2 \times 256}\) is a learnable linear projection, and \(\textbf{I} \in \mathbb {R}^{h \times w \times 3}\) is the converted rgb version of \(\textbf{C}\) and \(\textbf{L}\).

3.4 Learning Objective

Palette estimation and assignment are trained with different optimization targets. For palette estimation, it is learned concerning palette reconstruction and regularization as:

$$\begin{aligned} \mathcal {L}_\textbf{E} = \underbrace{\lambda _\text {rec1}|\textbf{h}-\mathbf {\hat{h}}|_1}_{\text {reconstruction}} - \underbrace{\lambda _\text {rg}E(\mathbf {\hat{h}})}_{\text {regularization}}, \end{aligned}$$
(8)

where \(\lambda _\text {rec1}\) and \(\lambda _\text {rg}\) balance the influences of different terms, set to 5.0 and 1.0, respectively.

The optimization target for palette assignment is formed by pixel-level regression, palette reconstruction, and adversarial training, as:

$$\begin{aligned} \mathcal {L}_\textbf{G} = \underbrace{\lambda _\text {reg}|\textbf{C}-\mathbf {\hat{C}}|_1}_{\text {regression}} + \underbrace{\lambda _\text {rec2} |\textbf{h}-\mathbf {\tilde{h}}|_1}_{\text {reconstruction}} + \underbrace{\lambda _\text {adv} \mathcal {L}_\text {adv}}_{\text {adversarial}}, \end{aligned}$$
(9)

where \(\mathbf {\tilde{h}}\) are extracted from \(\mathbf {\hat{C}}\) using Eq. 5. \(\lambda _\text {reg}\), \(\lambda _\text {rec2}\), and \(\lambda _\text {adv}\) are set to 5.0, 1.0, 1.0, respectively.

For the used adversarial loss, we employ hinge loss version. Its training target of generator is

$$\begin{aligned} \mathcal {L}_{adv} = -\text {E}_{\textbf{L} \sim \mathbb {P}_\textbf{L}}\textbf{D}(\mathbf {\hat{C}} \oplus \mathbf {\hat{I}}), \end{aligned}$$
(10)

where \(\mathbf {\hat{C}}=\mathcal {T}_{\textbf{G}}(\textbf{L}|\mathcal {T}_{\textbf{E}}(\textbf{L}))\), \(\mathbf {\hat{I}}\) is a converted rgb version from \(\mathbf {\hat{C}}\) and \(\textbf{L}\), and \(\mathbb {P}_\textbf{L}\) denotes the gray-scale image distribution. The optimization goal for the discriminator is

$$\begin{aligned} \mathcal {L}_{adv}^{\textbf{D}} = \text {E}_{\textbf{I} \sim \mathbb {P}_\textbf{I}}[\text {max}(0, 1-\textbf{D}(\textbf{C} \oplus \textbf{I}))]+ \text {E}_{\textbf{L} \sim \mathbb {P}_\textbf{L}}[\text {max}(0, 1+\textbf{D}(\mathbf {\hat{C}} \oplus \mathbf {\hat{I}})]. \end{aligned}$$
(11)

where \(\mathbb {P}_\textbf{I}\) denotes the rgb image distribution, and \(\textbf{C}\) is converted from \(\textbf{I}\).

Training. We jointly train palette generator \(\mathcal {T}_{\textbf{E}}\) and palette assignment generator \(\mathcal {T}_{\textbf{G}}\) in an progressive fashion. Specifically, for the inputs \(\{\textbf{L}_i\}\) to \(\mathcal {T}_{\textbf{E}}\), the corresponding inputs to \(\mathcal {T}_{\textbf{G}}\) are \(\{\mathbbm {1}(p_{\textbf{h}}> 0.8) \textbf{h}_i+(1-\mathbbm {1}(p_{\textbf{h}} > 0.8))\hat{\textbf{h}}_i\}\), where \(\mathbbm {1}\) is an indicator function of value 1 if its condition holds true, and 0, otherwise. \(p_{\textbf{h}}\) is sampled from a uniform distribution \(\mathcal {U}[\tau , 1]\). We start training with \(\tau =1\), then linearly decrease it to 0 when approaching the end of learning.

4 Experiments

We evaluate our method along with existing representative works on ImageNet [9] and COCO-Stuff [5]. On ImageNet we take two evaluation protocols. One is to evaluate all methods on a selective subset ctest10k (with 10K pictures) of its validate data (with 50K pictures) following the protocols in [24]. Another is to run on the full validation set, same as in [49]. For COCO-Stuff, we test all methods on its 5K validating images.

4.1 Implementation

We employ spectral normalization [32] on the whole model and a two time-scale update rule in training (lr for the generator and discriminator are \(1e{-}4\) and \(4e{-}4\), respectively) to stabilize learning. Adam [22] optimizer with \(\beta _1=0\) and \(\beta _2=0.9\) is used. For the applied batch normalization, we take the sync version. We train our method on the training set of ImageNet with 40 epochs with 8 TiTAN 2080ti using batch size 64. Images in training are randomly cropped in a fixed size (256 \(\times \) 256) from the resized ones with aspect ratio unchanged. In testing, we resize images into 256 \(\times \) 256 ones and do evaluations.

Baselines. We focus on the recent learning-based colorization methods for comparisons. Deoldify [2], CIColor [54], UGColor [56], Video Colorization [26], InstColor [38], ColTrans [23], and GPColor [49] are employed for comparisons. Note InstColor is learned with a pretrained object detection model (requiring both labels and bounding boxes), and GPColor exploits a pretrained (on ImageNet with labels) BigGAN. Other approaches including ours are only trained with paired gray-colorful images. For UGColor, we use its fully automatic version where no color hints are used. We use their released model for testing.

Metrics. We employ pixel-wise similarity measures PSNR, SSIM, image-level perceptual metric LPIPS [55], and Fr\(\acute{\text {e}}\)chet Inception Distance (FID) [16] to quantitatively evaluate colorization results. LPIPS and FID are more consistent with human evaluations compared with PSNR and SSIM.

4.2 Quantitative Evaluations

Compared with other methods, our proposed PalGAN (ours\(^1\) in Table 1) gives the best perceptual scores (FID & LPIPS) both on ImageNet (FID: 4.60 and 2.78, LPIPS: 0.161 and 0.161 from ctest10K and val50K, respectively) and COCO-Stuff (FID: 7.70, LPIPS: 0.148) without exploiting any annotations or hints, which outperforms other methods. It validates the superiority in realness and diversity of our results. We also achieve competitive fidelity scores (PSNR & SSIM) among all. It shows the well-behaved color restoration ability of PalGAN. If given the ground truth palette, our method (ours* in Table 1) can deliver impressive fidelity performance as well as a generative one. It shows the upper bound performance of our method for reference. For methods in Table 1 denoted with *, they employ external priors e.g. annotations.

Considering the trade-off between fidelity and perceptual results, we can get the best of both worlds on all benchmarks compared with others (ours\(^2\) in Table 1) with proper training setting (\(\lambda _{adv}=0.1\) and other regularization coefficients remain still).

Fig. 6.
figure 6

Visual comparison on ImageNet and COCO-Stuff.

Table 2. User study. Each entry gives the percentage of cases where colorization results are favored compared with GT.

4.3 Qualitative Evaluations

As shown in Fig. 6, our colorization results give natural, diverse, and fine chrome predictions considering both semantic correspondence and local gradient change. It suffers less from the common color bleeding compared with other methods, owing to chromatic attention. More results are given in Supp.

User Studies. Table 2 gives human evaluations on our methods with the compared ones. Following the protocol in [23, 54], we conduct a colorization Turning test. Specifically, the ground truth color image and its corresponding colorization result (from ours or other methods) are given to 20 participants in random order. These participants need to determine which one is more realistic than the other for no more than 2 s. There are 40 colorization predictions from each method, randomly chosen from ImageNet ctest10k. Table 2 presents that our method beats the competitors with a large margin.

Fig. 7.
figure 7

Our method on legacy images. (1) Inputs, (2) our automatic results, (3) our reference-based results.

Colorization of Legacy Photos. Though our model is trained in a self-supervised manner using synthetic data, it generalizes well on real-world black-and-white legacy pictures (from [15]), as given in Fig. 7. Color boundaries and consistency are well handled in these cases, working well on the object and portrait.

Fig. 8.
figure 8

Our reference-based colorization.

Reference-Based Colorization. With the intermediate palette, our approach can conduct reference-based (or example-based) colorization by feeding it with the palette from the reference color image, as given in Fig. 7 and 8. Note even using palettes from an image without semantic correlations with the input (Fig. 8), PalGAN still well tunes the given color distribution according to the semantics of the given image, keeping color regionally consistent. Note the car appearances in Fig. 8 present impressive diversity and realness.

Fig. 9.
figure 9

Ablation studies of model structures. (a) Input, (b) AE, (c) VAE, (d) PalGAN w PatchD, (e) PalGAN wo \(E(\mathbf {\hat{h}})\), (f) full PalGAN.

4.4 Ablation Studies

Our key designs are ablated on COCO-Stuff as follows.

Palette Prediction and Assignment. We validate the effectiveness of our colorization formulation with the proposed model structure compared with a naive autoencoder (AE) and variational one (VAE). Specifically, AE shares the same computational units with PalGAN, except it generates feature maps instead of the palette from its encoder, and utilizes common BN instead of PalNorm in its decoder. VAE is almost the same as PalGAN but it changes the intermediate product palette into a latent vector constrained by Normal distribution. The optimization of AE and VAE is nearly the same as ours except they do not have the palette reconstruction and regularization term, and VAE employs one more term for regularizing the intermediate latent code.

In Table 3, we find PalGAN gets significant improvements on FID compared with AE and VAE, while its fidelity performance (PSNR and SSIM) is inferior to AE. It suggests intermediate latent code (in PalGAN and VAE) performs better at color generation than feature maps (in AE), and feature maps excel at fidelity restoration. It validates the effectiveness of our formulation and method on the usage of palette considering its generative performance. Moreover, Fig. 9 illustrates visual differences between varied structures in one example. A high fidelity score of AE does not guarantee the realism of its result.

The effectiveness of the predicted palette is studied. We use palettes from random reference images to simulate failed palette estimations in our method, and give the corresponding evaluation in Table 3 (PalGAN w rand ref). It shows the dramatic fidelity and generative performance drop, meaning our palette generator can learn effective chrome distribution for colorization. This is also supported by the visualizations of palettes and their corresponding images in Figure 3.

Table 3. Quantitative results on COCO-Stuff using different structures.

Chromatic Attention. We explore how the proposed chromatic attention affects colorization, given in Table 3 and 4. Compared with naive self-attention (PalGAN w SA in Table 3, and SA is applied on the high-level feature maps \(\textbf{S}\)), our chromatic attention enhances both generative and fidelity performance notably. In Table 4, with global interaction in chromatic attention, the generative performance will be improved non-trivially on FID (\(9.90 \rightarrow 8.34\)). It is consistent with the observations in prior image generation works [4, 43, 44, 53] that employing attention will boost generation results. For the local delineation, it focuses on pixel-level restoration, giving notable fidelity increase on PSNR (\(21.93 \rightarrow 24.52\)) and SSIM (\(0.902 \rightarrow 0.924\)). Generally, CA achieves the best of both worlds as it enhances both pixel- and perceptual-level performance. Moreover, we give visual comparison of the ablation study on the chromatic attention in Fig. 5.

Note CA is a generic parametric module. It can be applied to previous methods e.g. UGC [56], and it can further improve the corresponding quantitative results (\(24.34 \rightarrow 24.51\), \(0.924 \rightarrow 0.925\), \(0.165 \rightarrow 0.162\), and \(14.74 \rightarrow 11.38\) on PSNR, SSIM, LPIPS, and FID, respectively).

Table 4. Quantitative results on COCO-Stuff by ablating chromatic attention.
Table 5. Quantitative results on COCO-Stuff about palette with different bins.

PalNorm and Color Discriminator. In Table 3, we find PalNorm yields better quantitative results than BN and SPADE [35] (we use gray-input as semantic layout to generate pixel-wise affine transformation). Besides, PalGAN (default with Color Discriminator) beats PalGAN with Patch Discriminator [41]. These show the effectiveness of our designed PalNorm and Color discriminator.

Palette Configuration. We systematically explore different factors of the employed palette. Table 5 shows how the number of bins of palette affects the colorization results. Generally, when #Bins is relatively small, increasing it (16 \(\rightarrow \)256) will lead to a performance increase on almost all used metrics; while #Bins is relatively large, increasing it (256 \(\rightarrow \)1024) will lead to a performance drop. We conjecture this is caused by the tradeoff between the fineness and sparsity of the used palette. The rise of #Bins enhances both its fineness and sparsity. The former reduces ambiguities of palette and the latter makes the optimization of palette reconstruction harder. Empirically, #Bins is 256 is an acceptable setting, used in all experiments.

Also, as given in Table 3 (PalGAN wo \(E(\mathbf {\hat{h}})\)), applying diversity regularization on the estimated palette can improve our generative performance.

Limitation. In the user-guided colorization, current PalGAN lacks fine-grained control as we utilize a global palette. In addition, PalGAN cannot well address small-size regions with independent semantics (e.g. small objects), since the global interaction in CA cannot well represent these areas using semantic embeddings from small-scale feature maps. Failure cases are given in the supp.

5 Concluding Remarks

In this paper, we study multimodal challenges and color bleeding in colorization from a new perspective. We give a new formulation of colorization for multimodal representation. It introduces the palette as an intermediate variable. This leads to a new and workable colorization method by palette estimation and assignment, yielding diverse and controllable colorful outputs. Additionally, we address the color bleeding issue by explicitly studying color affinities using chromatic attention. It not only leverages semantic affinities to coordinate color, but also exploits the correlation between intensity and their corresponding chrome to delineate color details. Our method is experimentally proven effective and surpasses existing state-of-the-arts non-trivially.