Abstract
In recent years, channel attention mechanism has greatly improved the performance of computer vision-oriented network models. But the simple superposition of modules inevitably increases the complexity of the model. In order to improve the performance and reduce the complexity of the model, a novel frequency channel attention GAN is proposed and applied to image color rendering. Firstly, global average pooling is a special case of discrete cosine transform. In order to better capture the rich input mode information, we extend global mean pooling to the frequency domain to obtain the frequency channel attention mechanism. Secondly, the frequency channel attention mechanism is combined with U-Net network to represent all the feature information of the image. The effectiveness of channel attention GAN in frequency domain was verified by using DIV2K dataset and COCO dataset. Finally, compared with pix2pix, CycleGAN, and HCEGAN models, PSNR increased by 2.660 dB, 2.595 dB and 1.430 dB, and SSIM increased by 7.943%, 6.790% and 2.436%. Experimental results show that our method not only improves the image rendering effect and quality, but also enhances the model stability.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Image color rendering is an important branch of image processing, and color image can highlight the deep information in the image. In recent years, with the rapid development of deep learning and computer vision, color rendering methods based on neural network model have attracted extensive attention from scholars at home and abroad [2, 3, 10, 14, 19]. How to effectively render existing images with high-quality image data and improve the details of existing methods has become a research hotspot [18, 23, 29, 31, 32]. As a computer aided technology, grayscale image color rendering is widely used in image and video rendering, such as repairing old black and white photos and processing of black and white film and television works [21, 22].
At present, the traditional color rendering method needs manual intervention, which requires high-quality reference images, and the color rendering effect is difficult to be ideal when the image structure and color are more complex [1, 5, 9, 13, 27]. With the rapid development of neural network, deep learning algorithm has been widely used in image color rendering. Different neural network models can be trained through corresponding data, images can be automatically rendered according to the model without being affected by human factors or other factors [6, 8, 17, 24, 30].
Goodfellow et al. [4] proposed the generative adversarial networks (GAN), and Mirza et al. [20] proposed conditional generative adversarial network (CGAN) on this basis. Some additional information, such as class tags or data from other modes, is conditioned as an additional input layer into the discriminator and generator. Isola et al. [7] improved the CGAN model to realize the transformation between images, such as the transformation from grayscale image to color image, from day to night, and from line image to physical image. The proposed pix2pix model has a powerful image conversion function, which can learn the mapping relationship between grayscale image and color image to achieve color rendering. Zhu et al. [33] proposed a CycleGAN based on unaligned datasets, which can achieve better style transfer.
There are some problems in image rendering based on GAN, such as blurred boundary and unclear details, and the instability of GAN model leads to low rendering quality. Firstly, by analyzing the global average pooling (GAP) and channel attention mechanisms, we found that they both computed the scalar of each channel by learning the weight function [15]. However, GAP could not capture the rich information of input patterns, and its averaging operation suppressed the diversity of features. Then, we proved that GAP was equivalent to the lowest frequency component of the discrete cosine transform (DCT). Finally, we generalized GAP to the frequency domain and proposed the frequency domain channel attention GAN (FCAGAN) in combination with generative adversarial networks.
2 Related work
2.1 Global average pooling
Global average pooling (GAP) [15] is the summing and averaging of all pixel values of the feature map to obtain a value, which is used to represent the corresponding feature map. It also regularizes the whole network structurally to prevent overfitting, eliminates the features of the fully connected layer black box operation, and directly gives the actual category meaning to each channel. The advantage of using GAP is that it has a global perceptual field, which effectively reduces the spatial dimensionality of the feature map while preserving the spatial location information of the feature map [11]. Second, GAP reduces the number of parameters, which mitigates the occurrence of overfitting and provides a feature representation with positional invariance. Further, GAP is insensitive to the spatial location in the image due to the averaging operation over the entire feature map, and thus is able to handle input images of different sizes.
2.2 Channel attention mechanism
The channel attention mechanism is designed to explicitly model correlations between different channels or feature maps. The importance of each feature channel is automatically obtained through network learning, and then different weight coefficients are assigned to each channel, so as to strengthen important features and suppress non-important features [12, 28]. From the perspective of feature channel itself, its different feature degrees represent different information.
Suppose \(X \in R^{C \times H \times W}\) is the image eigenvector in the network, C is the number of channels, H is the feature height, W is the feature width. Then, the feature channel attention mechanism is [26]:
where \(\hbox {att} \in R^{C}\) is attention vector, \(\hbox {sigmoid}\) is Sigmoid function, \(\hbox {fc}\) is a mapping function similar to full connection layer or one-dimensional convolution, and \(\hbox {gap}\) is global average pooling.
After obtaining the attention vectors of all C channels, each channel of input X is scaled by adjacent attention values:
where \(i\in \{0,1,\ldots ,C-1\}\). \(\widetilde{X}\) is the output of the attention mechanism, \(att_{i}\) is the ith element of the attention vector, and \(X_{:,i,:,:}\) is the ith channel of the input.
3 FCAGAN
3.1 GAP is a special case of two-dimensional DCT
DCT is defined as follows:
where \(k\in \{0,1,\ldots ,L-1\}\). \(f\in R^{L}\) is the frequency domain spectrum of DCT, \(x\in R^{L}\) is the input, and L is the length of input x. In addition, two-dimensional DCT is defined as follows [25]:
where \(h\in \{0,1,\ldots ,H-1\}\), \(w\in \{0,1,\ldots ,W-1\}\). \(f^{2d}\in R^{H\times W}\) is the frequency domain spectrum of two-dimensional DCT, \(x^{2d}\in R^{H\times W}\) is the input, H is the height of \(x^{2d}\), W is the width of \(x^{2d}\). Therefore, the inverse transformation of two-dimensional DCT is:
where \(i\in \{0,1,\ldots ,H-1\}\), \(j\in \{0,1,\ldots ,W-1\}\).
In Formulas (4) and (5), some normalized factor constants are removed in this paper to simplify operations and facilitate narration. As can be seen from the above formula, GAP is the preprocessing method of the existing channel attention method, and DCT can be regarded as the weighted sum of the input, where the cosine part is its corresponding weight. Therefore, GAP mean calculation can be regarded as the simplest input spectrum in this paper, but a single GAP is not enough to represent all the feature information.
Due to the limitation of computational overhead, GAP is a kind of mean operation, which can be regarded as the simplest spectrum of input, but it is not enough to use single GAP information in channel attention.
Assuming that h and w of Formula (4) are 0, it can be obtained:
In combination \(\cos (0) = 1\), it can be known \(f_{0,0}^{2d}\) as the lowest frequency domain component of two-dimensional DCT. Therefore, it is proportional to GAP, that is, GAP is a special case of two-dimensional DCT.
3.2 Frequency channel attention mechanism
Next, other frequency domain components are integrated into the channel attention mechanism. According to Formula (6), the inverse transformation of two-dimensional DCT is rewritten as follows:
where \(i\in \{0,1,\ldots ,H-1\}\), \(j\in \{0,1,\ldots ,W-1\}\). B represents the frequency domain component, namely the weight component of DCT.
Obviously, image features can be decomposed into combinations of different frequency domain components, and GAP is only one of the frequency domain components. Previous channel attention mechanisms only use GAP and discard the rest. To further introduce more information, we use multiple frequency components of two-dimensional DCT, including the lowest frequency component GAP.
Firstly, input X is divided into n parts according to channel dimension, where \(X^{i} \in R^{C \times H \times W}\), \(i \in \{0,1,\ldots ,n-1\}\), \(C' = \dfrac{C}{n}\). The corresponding two-dimensional DCT frequency domain components are assigned to each part, and the results can be used as the preprocessing results of channel attention.
where \(i \in \{0,1,\ldots ,n-1\}\), [u, v] corresponds to the 2D index of the frequency component of \(X_{i}\), \(\hbox {Freq}^{i} \in R^{c'}\) is the \(C'\) dimensional vector after pretreatment.
The whole preprocessing vector can be obtained by connecting:
Then, the frequency channel attention mechanism is:
It can be seen from Formulas (9) and (10) that the method in this paper extends the lowest frequency component to a framework with multiple frequency sources, thus solving the deficiency of the original method.
3.3 Network structure
In this paper, a GAN model based on frequency channel attention mechanism is proposed by using generative adversarial network as model architecture. Based on U-Net network structure as generator, skip connection and frequency channel attention mechanism are added to enhance the rendering capability of the model. Considering the effectiveness of calculation and structure, the generator and discriminator structures are shown in Fig. 1. As we know from Sect. 3.1, GAP is equivalent to the lowest frequency component of DCT from the perspective of frequency domain, that is to say, GAP is only a special case of DCT. Therefore, GAP is extended to the frequency domain in this paper, and the frequency channel attention mechanism is proposed and introduced into the generator. Images can better represent more information through frequency channel attention mechanism. The discriminator uses patchGAN of 70*70 size for image conversion. Each of the four convolution layers uses the unit form of conversion-regularity-LeakyRelu function.
In order to solve the problem of unstable GAN training, model collapse and ensure the diversity of generated samples, the loss function of the model in this paper is the adversarial loss of the generated adversarial network plus parameter times L1 loss. The parameter value in this experiment is 100. Suppose the real image is x, the generated image is G(z), the expected output is y, the generator is G, and the discriminator is D, then the loss function of the generator and discriminator is, respectively:
4 Experiments
In this section, we first verify using the DIV2K dataset that image features can be decomposed into combinations of different frequency domain components, and the different effects of GAP versus DCT in the GAN model. Then, the effectiveness of frequency channel attention GAN is tested using DIV2K dataset to verify that the model can improve the performance while reducing the complexity of the model. Finally, compare the model in this paper, pix2pix model [14], CycleGAN model [33] and HCEGAN model [16], and use different types of images in COCO dataset to verify the robustness of this model.
4.1 Experiment settings
Adam with \({\beta }_{1}=0.5\) and \({\beta }_{2}=0.999\) was used to optimize the network parameters. The batch size was 1, the learning rate was 0.0002, and the number of processes was 4. The experimental hardware environment is Windows 10, 64-bit operating system, NIVIDIA GeForce RTX 2080Ti Graphics Card, and the processor is Intel(R) Core(TM) i9-10900X CPU @ 3.70 GHz 3.70 GHz desktop computer. All models are implemented by the PyTorch toolkit and CUDA computing platform, using GPU for training.
The DIV2K dataset we used consists of 800 training sets and 100 test sets. The COCO dataset contains six categories, including 3000 training sets and 600 test sets. And the images of all data sets are uniformly resize to \(256\times 256\). In order to evaluate the effectiveness of each module and the color rendering effect of FCAGAN, the differences between the images generated by the model and the Ground truth are compared, and the rendered images are evaluated by using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). The higher the PSNR, the smaller the image distortion, and the higher SSIM, the more similar the two images are.
4.2 GAP and DCT
According to Sect. 3.1, GAP is only one of the frequency domain components. If the channel attention mechanism only uses GAP and discards the rest, then all feature information will not be fully represented. To further introduce more information, we need to use multiple frequency components of two-dimensional DCT, including the lowest frequency component GAP. Table 1 shows the PSNR and SSIM values under different frequency components of GAP and two-dimensional DCT in the DIV2K dataset. In DCTk (\(k= 1,4,8,16\)), k represents the number of frequency components.
It can be seen that although the number of frequency components in DCTk model is different, the overall data are higher than that in GAP model. Among them, images generated by Dct1 model have higher quality, while images generated by Dct4 model are closer to Ground truth. This is because the generalization of GAP to the frequency domain is more effective than the original GAP. GAP is only one of the frequency domain components, which is not enough to represent all the information of the image. However, when the image is converted to the frequency domain to obtain the global characteristics of the image, and more information is introduced, PSNR and SSIM are improved to varying degrees.
4.3 Model validity
We use the DIV2K dataset to test the validity of our proposed method. Overlay DCTk (\(k= 1,4,8,16\)) represents image information and forms Ours(i) variant model, where i represents the number of overlay layers. We use frequency domain processing for feature extraction, so the DCTk model only discarded a small part of frequency domain components. PSNR and SSIM values of different models under DIV2K dataset are shown in Table 2. It can also be seen that the data of Ours(i) model after stacking are almost the same. This is because, based on Sect. 4.2, we have implemented how to better extract features. Theoretically, the more layers of superposition, the more parameters of the network model, the model complexity and computation will increase exponentially, but not necessarily achieving better results. As the low-frequency component contains more information, the rendering result of Ours(2) model is closer to ground truth, and the rendering quality is almost the same as that of other models. Therefore, we conducted a balance between the model effect and the amount of model parameters, and ultimately chose Ours(2) as the experimental result, with a more lightweight feature.
4.4 Comparisons
In order to make a fair comparison, we used the same training dataset (COCO dataset) to train all methods and compared pix2pix model, CycleGAN mode and HCEGAN model. Pix2pix is a supervised image translation method that uses synthetic image pairs to learn image-level translation. CycleGAN is a well-known unsupervised image translation method. It uses unpaired images from different fields to learn to translate images. HCEGAN is the latest image rendering method, which uses GAN for in-depth learning to automatically render images. Figure 2 shows some examples of the render result. Obviously, the results of the method in this paper are clearer at the junction of different colors, background and other details, with obvious structure and no ambiguity, which can achieve true restoration of colors and be closer to Ground truth. However, pix2pix model and CycleGAN model have some major errors such as rendering errors and blurred boundaries when rendering images in complex scenes. Although the rendering result of CycleGAN model is not close to ground truth, the background color is more bright, and the effect is better when rendering images with fewer colors. And although the HCEGAN model has a good recovery effect, the image color is still dark.
The specific rendering result data are shown in Table 3. Compared with pix2pix model, CycleGAN model and HCEGAN model, the PSNR index improved by the proposed method on average 2.660 dB, 2.595 dB and 1.430 dB, respectively. As for SSIM indexes, the proposed method improved 7.943% compared with pix2pix model, 6.790% compared with CycleGAN model and 2.436% compared with HCEGAN model. Therefore, in general, compared with other methods, the method in this paper can truly restore colors and is closer to ground truth. And the method in this paper has better robustness and can achieve ideal results when rendering images in complex scenes.
5 Conclusion
In order to improve the color rendering model based on deep learning in the face of complex scenes color boundary crossing and fuzzy problems, we propose frequency channel attention GAN for image color rendering. The proposed channel attention mechanism in frequency domain extends global average pooling to frequency domain to obtain better image information. DIV2K dataset and COCO dataset were used to verify the experimental results. Experimental results show that compared with other image rendering models, the proposed method can improve the performance of the model and reduce the complexity of the model. Moreover, it can also get the ideal effect when rendering images in complex scenes.
References
Afifi, M., Brubaker, M.A., Brown, M.S.: Histogan: Controlling colors of gan-generated and real images via color histograms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7941–7950 (2021)
Allegra, D., Furnari, G., Gargano, S., et al.: A method to improve the color rendering accuracy in cultural heritage: preliminary results. In: Journal of Physics: Conference Series, p. 012057. IOP Publishing (2022)
Bahng, H., Yoo, S., Cho, W., et al.: Coloring with words: guiding image colorization through text-based palette generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 431–447 (2018)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 2672–2680 (2014)
Hong’an, L., Min, Z., Zhuoming, D., et al.: Interactive image color editing method based on block feature. Infrared Laser Eng. 48(12), 293–298 (2019)
Hong’an, L., Qiaoxue, Z., Wenjing, Y., et al.: Image super-resolution reconstruction for secure data transmission in Internet of Things environment. Math. Biosci. Eng. 18(5), 6652–6671 (2021)
Isola, P., Zhu, J.Y., Zhou, T., et al.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 1125–1134 (2017)
Kim, A.S., Cheng, W.C., Beams, R., et al.: Color rendering in medical extended-reality applications. J. Digit. Imaging 34, 16–26 (2021)
Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer. arXiv:2102.04432 (2017)
Lee, J., Kim, E., Lee, Y., et al.: Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5801–5810 (2020)
Li, J., Han, Y., Zhang, M., et al.: Multi-scale residual network model combined with global average pooling for action recognition. Multimed. Tools Appl. 1–19 (2022c)
Li, J., Liu, K., Hu, Y., et al.: Eres-UNet++: Liver CT image segmentation based on high-efficiency channel attention and Res-UNet++. Comput. Biol. Med. 106501 (2022c)
Li, B., Lai, Y.K., John, M., et al.: Automatic example-based image colorization using location-aware cross-scale matching. IEEE Trans. Image Process. 28(9), 4606–4619 (2019)
Li, H., Zhang, M., Yu, Z., et al.: An Improved pix2pix Model Based on Gabor Filter for Robust Color Image Rendering, pp. 86–101. AIMS Press, Springfield (2022)
Li, J., Han, Y., Zhang, M., et al.: Multi-scale residual network model combined with global average pooling for action recognition. Multimed. Tools Appl. 81(1), 1375–1393 (2022)
Li, H., Zhang, M., Chen, D., et al.: Image color rendering based on hinge-cross-entropy GAN in internet of medical things. CMES-Comput. Model. Eng. Sci. 135(1), 779–794 (2023)
Liang, W., Ding, D., Wei, G.: An improved DualGAN for near-infrared image colorization. Infrared Phys. Technol. 116, 103764 (2021)
Liang, Y., Lee, D., Li, Y., et al.: Unpaired medical image colorization using generative adversarial network. Multimed. Tools Appl. 81(19), 26669–26683 (2022)
Liu, Y., Peng, S., Liu, L., et al.: Neural rays for occlusion-aware image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7824–7833 (2022)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. Comput. Sci. 2672–2680 (2014)
Oza, U., Pipara, A., Mandal, S., et al.: Automatic image colorization using ensemble of deep convolutional neural networks. In: 2022 IEEE Region 10 Symposium (TENSYMP), pp. 1–6. IEEE (2022)
Ren, W., Pan, J., Zhang, H., et al.: Single image dehazing via multi-scale convolutional neural networks with holistic edges. Int. J. Comput. Vis. 128(1), 240–259 (2020)
Sagar, A.: Dmsanet: dual multi scale attention network. In: International Conference on Image Analysis and Processing, pp. 633–645. Springer (2022)
Wan, Z., Zhang, B., Chen, D., et al.: Bringing old photos back to life. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2747–2757 (2020)
Wan-bo, Y., Xiang-xiang, W., Da-qing, W.: Face image recognition based on basis function iteration of discrete cosine transform. J. Graph. 41(1), 91–95 (2020)
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Wu, Y., Wang, X., Li, Y., et al.: Towards vivid and diverse image colorization with generative color prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14377–14386 (2021)
Wu, Y., Wang, G., Wang, Z., et al.: Triplet attention fusion module: a concise and efficient channel attention module for medical image segmentation. Biomed. Signal Process. Control 82, 104515 (2023)
Xuan, D.: Design of 3D animation color rendering system based on image enhancement algorithm and machine learning. Soft Comput. 1–10 (2023)
Yuan, M., Simo-Serra, E.: Line art colorization with concatenated spatial attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3946–3950 (2021)
Žeger, I., Grgic, S., Vuković, J., et al.: Grayscale image colorization methods: overview and evaluation. IEEE Access (2021)
Zhang, X., Wang, T., Wang, J., et al.: Pyramid channel-based feature attention network for image dehazing. Comput. Vis. Image Underst. 197, 103003 (2020)
Zhu, J.Y., Park, T., Isola, P., et al.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–18 (2017)
Acknowledgements
This work was partly supported by the Natural Science Basis Research Plan in Shaanxi Province of China under Grant 2023-JC-YB-517 and the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University under Grant VRLAB2023B08, and the high-level talent introduction project of Shaanxi Technical College of Finance & Economics under Grant 2022KY01. All of the authors declare that there is no conflict of interest regarding the publication of this article and would like to thank the anonymous referees for their valuable comments and suggestions.
Author information
Authors and Affiliations
Contributions
These authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Ha., Wang, D., Zhang, M. et al. Image color rendering based on frequency channel attention GAN. SIViP 18, 3179–3186 (2024). https://doi.org/10.1007/s11760-023-02980-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02980-7