1 Introduction

Image color rendering is an important branch of image processing, and color image can highlight the deep information in the image. In recent years, with the rapid development of deep learning and computer vision, color rendering methods based on neural network model have attracted extensive attention from scholars at home and abroad [2, 3, 10, 14, 19]. How to effectively render existing images with high-quality image data and improve the details of existing methods has become a research hotspot [18, 23, 29, 31, 32]. As a computer aided technology, grayscale image color rendering is widely used in image and video rendering, such as repairing old black and white photos and processing of black and white film and television works [21, 22].

At present, the traditional color rendering method needs manual intervention, which requires high-quality reference images, and the color rendering effect is difficult to be ideal when the image structure and color are more complex [1, 5, 9, 13, 27]. With the rapid development of neural network, deep learning algorithm has been widely used in image color rendering. Different neural network models can be trained through corresponding data, images can be automatically rendered according to the model without being affected by human factors or other factors [6, 8, 17, 24, 30].

Goodfellow et al. [4] proposed the generative adversarial networks (GAN), and Mirza et al. [20] proposed conditional generative adversarial network (CGAN) on this basis. Some additional information, such as class tags or data from other modes, is conditioned as an additional input layer into the discriminator and generator. Isola et al. [7] improved the CGAN model to realize the transformation between images, such as the transformation from grayscale image to color image, from day to night, and from line image to physical image. The proposed pix2pix model has a powerful image conversion function, which can learn the mapping relationship between grayscale image and color image to achieve color rendering. Zhu et al. [33] proposed a CycleGAN based on unaligned datasets, which can achieve better style transfer.

There are some problems in image rendering based on GAN, such as blurred boundary and unclear details, and the instability of GAN model leads to low rendering quality. Firstly, by analyzing the global average pooling (GAP) and channel attention mechanisms, we found that they both computed the scalar of each channel by learning the weight function [15]. However, GAP could not capture the rich information of input patterns, and its averaging operation suppressed the diversity of features. Then, we proved that GAP was equivalent to the lowest frequency component of the discrete cosine transform (DCT). Finally, we generalized GAP to the frequency domain and proposed the frequency domain channel attention GAN (FCAGAN) in combination with generative adversarial networks.

2 Related work

2.1 Global average pooling

Global average pooling (GAP) [15] is the summing and averaging of all pixel values of the feature map to obtain a value, which is used to represent the corresponding feature map. It also regularizes the whole network structurally to prevent overfitting, eliminates the features of the fully connected layer black box operation, and directly gives the actual category meaning to each channel. The advantage of using GAP is that it has a global perceptual field, which effectively reduces the spatial dimensionality of the feature map while preserving the spatial location information of the feature map [11]. Second, GAP reduces the number of parameters, which mitigates the occurrence of overfitting and provides a feature representation with positional invariance. Further, GAP is insensitive to the spatial location in the image due to the averaging operation over the entire feature map, and thus is able to handle input images of different sizes.

2.2 Channel attention mechanism

The channel attention mechanism is designed to explicitly model correlations between different channels or feature maps. The importance of each feature channel is automatically obtained through network learning, and then different weight coefficients are assigned to each channel, so as to strengthen important features and suppress non-important features [12, 28]. From the perspective of feature channel itself, its different feature degrees represent different information.

Suppose \(X \in R^{C \times H \times W}\) is the image eigenvector in the network, C is the number of channels, H is the feature height, W is the feature width. Then, the feature channel attention mechanism is [26]:

$$\begin{aligned} \hbox {att} = \hbox {sigmoid}(\hbox {fc}(\hbox {gap}(X))) \end{aligned}$$
(1)

where \(\hbox {att} \in R^{C}\) is attention vector, \(\hbox {sigmoid}\) is Sigmoid function, \(\hbox {fc}\) is a mapping function similar to full connection layer or one-dimensional convolution, and \(\hbox {gap}\) is global average pooling.

After obtaining the attention vectors of all C channels, each channel of input X is scaled by adjacent attention values:

$$\begin{aligned} \widetilde{X}_{:,i,:,:} = att_{i}X_{:,i,:,:} \end{aligned}$$
(2)

where \(i\in \{0,1,\ldots ,C-1\}\). \(\widetilde{X}\) is the output of the attention mechanism, \(att_{i}\) is the ith element of the attention vector, and \(X_{:,i,:,:}\) is the ith channel of the input.

3 FCAGAN

3.1 GAP is a special case of two-dimensional DCT

DCT is defined as follows:

$$\begin{aligned} f_{k} = \sum _{i=0}^{L-1}x_{i}\cos \left( \dfrac{\pi k}{L}\left( i+\dfrac{1}{2}\right) \right) \end{aligned}$$
(3)

where \(k\in \{0,1,\ldots ,L-1\}\). \(f\in R^{L}\) is the frequency domain spectrum of DCT, \(x\in R^{L}\) is the input, and L is the length of input x. In addition, two-dimensional DCT is defined as follows [25]:

$$\begin{aligned} f_{h,w}^{2d}= & {} \sum \limits _{i=0}^{H-1} \sum \limits _{j=0}^{W-1} x_{i,j}^{2d}\nonumber \\{} & {} \underbrace{\cos \left( \dfrac{\pi h}{W}\left( i+\dfrac{1}{2}\right) \right) \cos \left( \dfrac{\pi w}{W}\left( j+\dfrac{1}{2}\right) \right) }_{\text {DCT weights}} \end{aligned}$$
(4)

where \(h\in \{0,1,\ldots ,H-1\}\), \(w\in \{0,1,\ldots ,W-1\}\). \(f^{2d}\in R^{H\times W}\) is the frequency domain spectrum of two-dimensional DCT, \(x^{2d}\in R^{H\times W}\) is the input, H is the height of \(x^{2d}\), W is the width of \(x^{2d}\). Therefore, the inverse transformation of two-dimensional DCT is:

$$\begin{aligned} x_{i,j}^{2d}= & {} \sum \limits _{h=0}^{H-1} \sum \limits _{w=0}^{W-1} f_{h,w}^{2d}\nonumber \\{} & {} \underbrace{\cos \left( \dfrac{\pi h}{W}\left( i+\dfrac{1}{2}\right) \right) \cos \left( \dfrac{\pi w}{W}\left( j+\dfrac{1}{2}\right) \right) }_{\text {DCT weights}} \end{aligned}$$
(5)

where \(i\in \{0,1,\ldots ,H-1\}\), \(j\in \{0,1,\ldots ,W-1\}\).

In Formulas (4) and (5), some normalized factor constants are removed in this paper to simplify operations and facilitate narration. As can be seen from the above formula, GAP is the preprocessing method of the existing channel attention method, and DCT can be regarded as the weighted sum of the input, where the cosine part is its corresponding weight. Therefore, GAP mean calculation can be regarded as the simplest input spectrum in this paper, but a single GAP is not enough to represent all the feature information.

Fig. 1
figure 1

Frequency channel attention GAN structure

Due to the limitation of computational overhead, GAP is a kind of mean operation, which can be regarded as the simplest spectrum of input, but it is not enough to use single GAP information in channel attention.

Assuming that h and w of Formula (4) are 0, it can be obtained:

$$\begin{aligned} f_{0,0}^{2d}= & {} \sum \limits _{i=0}^{H-1} \sum \limits _{j=0}^{W-1} x_{i,j}^{2d}\cos \left( \dfrac{0}{W}\left( i+\dfrac{1}{2}\right) \right) \nonumber \\{} & {} \cos \left( \dfrac{0}{W}\left( j+\dfrac{1}{2}\right) \right) \nonumber \\= & {} \sum \limits _{i=0}^{H-1} \sum \limits _{j=0}^{W-1} x_{i,j}^{2d}\nonumber \\= & {} \hbox {gap}(x^{2d})\hbox {HW} \end{aligned}$$
(6)

In combination \(\cos (0) = 1\), it can be known \(f_{0,0}^{2d}\) as the lowest frequency domain component of two-dimensional DCT. Therefore, it is proportional to GAP, that is, GAP is a special case of two-dimensional DCT.

3.2 Frequency channel attention mechanism

Next, other frequency domain components are integrated into the channel attention mechanism. According to Formula (6), the inverse transformation of two-dimensional DCT is rewritten as follows:

$$\begin{aligned} x_{i,j}^{2d}= & {} \sum \limits _{h=0}^{H-1} \sum \limits _{w=0}^{W-1} f_{h,w}^{2d}\cos \left( \dfrac{\pi h}{W}\left( i+\dfrac{1}{2}\right) \right) \nonumber \\{} & {} \cos \left( \dfrac{\pi w}{W}\left( j+\dfrac{1}{2}\right) \right) \nonumber \\= & {} f_{0,0}^{2d}B_{0,0}^{i,j}+f_{0,1}^{2d}B_{0,1}^{i,j}+\cdots +f_{H-1,W-1}^{2d}B_{H-1,W-1}^{i,j}\nonumber \\= & {} \underbrace{\hbox {gap}(x^{2d})\hbox {HWB}_{0,0}^{i,j}}_{\text {utilized}} \nonumber \\{} & {} + \underbrace{f_{0,1}^{2d}B_{0,1}^{i,j}+\cdots +f_{H-1,W-1}^{2d}B_{H-1,W-1}^{i,j}}_{\text {discarded}} \end{aligned}$$
(7)

where \(i\in \{0,1,\ldots ,H-1\}\), \(j\in \{0,1,\ldots ,W-1\}\). B represents the frequency domain component, namely the weight component of DCT.

Obviously, image features can be decomposed into combinations of different frequency domain components, and GAP is only one of the frequency domain components. Previous channel attention mechanisms only use GAP and discard the rest. To further introduce more information, we use multiple frequency components of two-dimensional DCT, including the lowest frequency component GAP.

Firstly, input X is divided into n parts according to channel dimension, where \(X^{i} \in R^{C \times H \times W}\), \(i \in \{0,1,\ldots ,n-1\}\), \(C' = \dfrac{C}{n}\). The corresponding two-dimensional DCT frequency domain components are assigned to each part, and the results can be used as the preprocessing results of channel attention.

$$\begin{aligned} \hbox {Freq}^{i}= & {} 2\hbox {DDCT}^{u,v}(X^{i})\nonumber \\= & {} \sum \limits _{h=0}^{H-1}\sum \limits _{w=0}^{W-1}x_{:,h,w}^{2d}B_{h,w}^{u,v} \end{aligned}$$
(8)

where \(i \in \{0,1,\ldots ,n-1\}\), [uv] corresponds to the 2D index of the frequency component of \(X_{i}\), \(\hbox {Freq}^{i} \in R^{c'}\) is the \(C'\) dimensional vector after pretreatment.

The whole preprocessing vector can be obtained by connecting:

$$\begin{aligned} \hbox {Freq} = \hbox {cat}\left( \left[ \hbox {Freq}^{0},\hbox {Freq}^{1},\ldots ,\hbox {Freq}^{n-1}\right] \right) \end{aligned}$$
(9)

Then, the frequency channel attention mechanism is:

$$\begin{aligned} \hbox {fca}\_\hbox {att} = \hbox {sigmoid}(\hbox {fc}(\hbox {Freq})) \end{aligned}$$
(10)

It can be seen from Formulas (9) and (10) that the method in this paper extends the lowest frequency component to a framework with multiple frequency sources, thus solving the deficiency of the original method.

3.3 Network structure

In this paper, a GAN model based on frequency channel attention mechanism is proposed by using generative adversarial network as model architecture. Based on U-Net network structure as generator, skip connection and frequency channel attention mechanism are added to enhance the rendering capability of the model. Considering the effectiveness of calculation and structure, the generator and discriminator structures are shown in Fig. 1. As we know from Sect. 3.1, GAP is equivalent to the lowest frequency component of DCT from the perspective of frequency domain, that is to say, GAP is only a special case of DCT. Therefore, GAP is extended to the frequency domain in this paper, and the frequency channel attention mechanism is proposed and introduced into the generator. Images can better represent more information through frequency channel attention mechanism. The discriminator uses patchGAN of 70*70 size for image conversion. Each of the four convolution layers uses the unit form of conversion-regularity-LeakyRelu function.

Table 1 Verifies the effectiveness of frequency channel attention mechanism
Table 2 Comparison of evaluation index of variant models

In order to solve the problem of unstable GAN training, model collapse and ensure the diversity of generated samples, the loss function of the model in this paper is the adversarial loss of the generated adversarial network plus parameter times L1 loss. The parameter value in this experiment is 100. Suppose the real image is x, the generated image is G(z), the expected output is y, the generator is G, and the discriminator is D, then the loss function of the generator and discriminator is, respectively:

$$\begin{aligned}{} & {} \left\{ \begin{array}{ll} G^{*}=\hbox {arg min}_{G} \hbox {max}_{D} L_{\textrm{GAN}}(G,D)+\lambda L_{L1}(G)\\ L_{\textrm{GAN}}(G,D)=E_{x,y}(\log D(x,y))+\\ \quad E_{x}(\log (1-D(x,G(x))))\\ L_{L1}(G)=E_{x,y}(\parallel y-G(x)_{1}\parallel ) \end{array}\right. \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \left\{ \begin{array}{ll} L_D = -E_{(x,y) \sim P_{\textrm{data}}}[\hbox {min}(0,-1 + D(x,y))]-\\ \quad E_{z \sim P_z,y \sim P_{\textrm{data}}}[\hbox {min}(0,-1 - D(G(z),y))]\\ L_G = -E_{z \sim P_z,y \sim P_{\textrm{data}}}D(G(z),y) \end{array}\right. \end{aligned}$$
(12)

4 Experiments

In this section, we first verify using the DIV2K dataset that image features can be decomposed into combinations of different frequency domain components, and the different effects of GAP versus DCT in the GAN model. Then, the effectiveness of frequency channel attention GAN is tested using DIV2K dataset to verify that the model can improve the performance while reducing the complexity of the model. Finally, compare the model in this paper, pix2pix model [14], CycleGAN model [33] and HCEGAN model [16], and use different types of images in COCO dataset to verify the robustness of this model.

4.1 Experiment settings

Adam with \({\beta }_{1}=0.5\) and \({\beta }_{2}=0.999\) was used to optimize the network parameters. The batch size was 1, the learning rate was 0.0002, and the number of processes was 4. The experimental hardware environment is Windows 10, 64-bit operating system, NIVIDIA GeForce RTX 2080Ti Graphics Card, and the processor is Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz 3.70 GHz desktop computer. All models are implemented by the PyTorch toolkit and CUDA computing platform, using GPU for training.

The DIV2K dataset we used consists of 800 training sets and 100 test sets. The COCO dataset contains six categories, including 3000 training sets and 600 test sets. And the images of all data sets are uniformly resize to \(256\times 256\). In order to evaluate the effectiveness of each module and the color rendering effect of FCAGAN, the differences between the images generated by the model and the Ground truth are compared, and the rendered images are evaluated by using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). The higher the PSNR, the smaller the image distortion, and the higher SSIM, the more similar the two images are.

4.2 GAP and DCT

According to Sect. 3.1, GAP is only one of the frequency domain components. If the channel attention mechanism only uses GAP and discards the rest, then all feature information will not be fully represented. To further introduce more information, we need to use multiple frequency components of two-dimensional DCT, including the lowest frequency component GAP. Table 1 shows the PSNR and SSIM values under different frequency components of GAP and two-dimensional DCT in the DIV2K dataset. In DCTk (\(k= 1,4,8,16\)), k represents the number of frequency components.

Fig. 2
figure 2

Comparison of methods under COCO dataset

It can be seen that although the number of frequency components in DCTk model is different, the overall data are higher than that in GAP model. Among them, images generated by Dct1 model have higher quality, while images generated by Dct4 model are closer to Ground truth. This is because the generalization of GAP to the frequency domain is more effective than the original GAP. GAP is only one of the frequency domain components, which is not enough to represent all the information of the image. However, when the image is converted to the frequency domain to obtain the global characteristics of the image, and more information is introduced, PSNR and SSIM are improved to varying degrees.

4.3 Model validity

We use the DIV2K dataset to test the validity of our proposed method. Overlay DCTk (\(k= 1,4,8,16\)) represents image information and forms Ours(i) variant model, where i represents the number of overlay layers. We use frequency domain processing for feature extraction, so the DCTk model only discarded a small part of frequency domain components. PSNR and SSIM values of different models under DIV2K dataset are shown in Table 2. It can also be seen that the data of Ours(i) model after stacking are almost the same. This is because, based on Sect. 4.2, we have implemented how to better extract features. Theoretically, the more layers of superposition, the more parameters of the network model, the model complexity and computation will increase exponentially, but not necessarily achieving better results. As the low-frequency component contains more information, the rendering result of Ours(2) model is closer to ground truth, and the rendering quality is almost the same as that of other models. Therefore, we conducted a balance between the model effect and the amount of model parameters, and ultimately chose Ours(2) as the experimental result, with a more lightweight feature.

4.4 Comparisons

In order to make a fair comparison, we used the same training dataset (COCO dataset) to train all methods and compared pix2pix model, CycleGAN mode and HCEGAN model. Pix2pix is a supervised image translation method that uses synthetic image pairs to learn image-level translation. CycleGAN is a well-known unsupervised image translation method. It uses unpaired images from different fields to learn to translate images. HCEGAN is the latest image rendering method, which uses GAN for in-depth learning to automatically render images. Figure 2 shows some examples of the render result. Obviously, the results of the method in this paper are clearer at the junction of different colors, background and other details, with obvious structure and no ambiguity, which can achieve true restoration of colors and be closer to Ground truth. However, pix2pix model and CycleGAN model have some major errors such as rendering errors and blurred boundaries when rendering images in complex scenes. Although the rendering result of CycleGAN model is not close to ground truth, the background color is more bright, and the effect is better when rendering images with fewer colors. And although the HCEGAN model has a good recovery effect, the image color is still dark.

The specific rendering result data are shown in Table 3. Compared with pix2pix model, CycleGAN model and HCEGAN model, the PSNR index improved by the proposed method on average 2.660 dB, 2.595 dB and 1.430 dB, respectively. As for SSIM indexes, the proposed method improved 7.943% compared with pix2pix model, 6.790% compared with CycleGAN model and 2.436% compared with HCEGAN model. Therefore, in general, compared with other methods, the method in this paper can truly restore colors and is closer to ground truth. And the method in this paper has better robustness and can achieve ideal results when rendering images in complex scenes.

Table 3 Data comparison of rendering results

5 Conclusion

In order to improve the color rendering model based on deep learning in the face of complex scenes color boundary crossing and fuzzy problems, we propose frequency channel attention GAN for image color rendering. The proposed channel attention mechanism in frequency domain extends global average pooling to frequency domain to obtain better image information. DIV2K dataset and COCO dataset were used to verify the experimental results. Experimental results show that compared with other image rendering models, the proposed method can improve the performance of the model and reduce the complexity of the model. Moreover, it can also get the ideal effect when rendering images in complex scenes.