Introduction

Underwater robots with vision guidance have become increasingly common in critical applications in recent years. Examples include underwater exploration [1], monitoring marine species [2], and underwater rescue missions [3]. Approximately 70% of the Earth’s area is the sea, which is closely related to human life, but human exploration of the sea is still less than 10%. Unlike in-air images, because of the complex and diverse underwater environments, underwater images suffer from various degradations. According to the principle of underwater imaging, water absorption during light propagation and the forward/back scattering of suspended particles in water are the main factors contributing to degradation. Absorption is mostly responsible for color distortion, while light attenuation is nonlinear and related to the wavelength of light. Due to the large wavelength of red light, it is absorbed faster with depth in water. Thus, most underwater images are greenish and bluish. In addition, forward scattering causes blurred details in underwater images, while backward scattering leads to low contrast and thus a haze effect.

Many traditional methods have been developed for underwater image enhancement (UIE) [4,5,6,7,8,9,10,11,12,13,14,15]. Although in certain ways, these traditional methods have achieved good results, when dealing with various kinds of underwater environments, there still exist some disadvantages. As shown in Fig. 1, the diversity of image quality degradation usually includes color cast, haze, and blur.

Fig. 1
figure 1

Underwater images with various degradations

Recently, generative adversarial network (GAN) [16] and Transformer [17] have already been effectively implemented for the translation task of images. The two-player zero-sum game serves as an inspiration for GAN, which is mainly applied to image generation and data enhancement and is further developed for other tasks. It consists of two models: a generator G that captures the data distribution and a discriminator D. The two neural networks are trained simultaneously to finally achieve that the samples generated by the generator are real samples. GAN is also used in unsupervised learning, such as CycleGAN [18], which is used to generate paired datasets to guide deep learning network training. However, standard GAN suffers from mode collapse and vanishing gradients, making training unstable. In addition, because the discriminator only contains one branch and mainly focuses on some of the image’s content and details, the color features of the image are challenging to handle. Transformer emerged from the field of NLP. Transformer abandons the traditional convolutional neural network (CNN) and recurrent neural network (RNN). The whole network is composed of self-attention and a feedforward neural network. Due to Transformer’s ability to capture long-range dependencies, it has also been successfully applied to the field of computer vision [19]. However, Transformer suffers from high computational consumption and a weak ability to extract local features. In short, Transformer contributes to the network learning capabilities, and GAN contributes to the network learning goals.

To fully utilize the respective advantages of Transformer and GAN, we effectively fuse the two together. First, we propose a window-based dual local enhancement Transformer block (DleWin), which is more suitable for UIE tasks. The DleWin block implements a self-attention mechanism to extract long-range information well. On the other hand, local features are crucial to UIE tasks. We adopt CNN in serial and parallel modes in the DleWin block for local enhancement, and the generator is built based on the DleWin block. Second, we propose a fusion scheme that combines convolutional neural network and Transformer in units. Since Transformer is good at capturing long-range dependencies and extracting raw information, while CNN is good at extracting local features, the two can be effectively fused to correct the color deviation and enhance image clarity. To make it easier for the DleWin block to obtain global information, the generator is designed as a UNet-like network [20]. In the framework of the generator, the DleWin Transformer block is implemented to extract the raw and global information. Finally, we propose a GAN with a two-branch discriminator containing a feature branch and a color branch. The feature branch is used to preserve image features and enhance contrast, while the color branch performs color correction to generate more realistic colors. For the design of the discriminator, we implement it as stacked convolutional layers. The feature branch training is guided by the Wasserstein GAN with gradient penalty (WGAN-GP) [21] loss, and the underwater index loss (Uloss) [22] is used to guide the training of the color branch. Based on the above three designs, we propose a Transformer embedded generative adversarial network for underwater image enhancement (TEGAN). Comparative experimental results demonstrate that, for both paired and unpaired datasets, our method is superior to the state-of-the-art approaches. It achieves not only the best subjective perception effect but also the overall best performance in terms of image quality evaluation metrics. To show the contribution of each core component, ablation analyses are conducted. In addition, we also test the effect on the downstream tasks. According to the findings, TEGAN can greatly boost the efficiency of visual tasks such as edge detection, underwater object detection, and keypoint matching.

Unlike other comparative methods, TEGAN with a new Transformer block, fusion scheme, and two-branch discriminator is very suitable for solving the degradation problem of underwater images. The contributions are organized as follows:

  • We propose a window-based dual local enhancement Transformer block (DleWin) that is more suitable for the UIE task. This novel block can be used to fully extract the original features and global information of the image, alleviate blur, and improve image clarity.

  • A fusion scheme that combines convolutional neural network and Transformer in units is designed. According to the fact that CNN is good at extracting local features and Transformer can capture long-distance dependencies well, the two can be effectively fused to correct the color deviation and enhance image clarity.

  • A Transformer embedded generative adversarial network with a two-branch discriminator is proposed. The feature branch preserves image features and realizes contrast enhancement, while the color branch rectifies the color cast to generate more realistic colors.

  • Extensive experiments demonstrate that TEGAN can achieve superior results compared to the state-of-the-art approaches on public underwater image datasets such as EUVP [24], RUIE [25], and UIEB [26]. In addition, outstanding results reveal that it can significantly facilitate the performance of other downstream visual tasks.

Related Work

For UIE tasks, the existing methods are systematically divided into the following three types. The enhancement methods directly enhance the visual effects. The recovery methods based on the physical model consider the degradation process of underwater images. The deep-learning methods are data-driven.

Enhancement Methods

Enhancement methods directly adjust the pixel values of a given underwater image to achieve contrast enhancement and color correction without considering the degradation process. The enhancement methods reassign the pixel values of a given image without considering the image degradation process for contrast enhancement and color correction. In recent years, fusion-based methods of enhancement have shown promising results. EUF [4] is based on the principle of fusion and does not need professional hardware or learning about underwater scene structure and conditions. Only input and weight metrics are obtained from degraded images. CBFU [5] is proposed to build on the coordination of color compensation and white balance versions of the raw degraded image. It promotes the conversion of edge and color contrast to the enhanced image, and multiscale fusion strategies are employed. ICM [6] is proposed based on shift stretching. The stretches on color contrast, saturation, and intensity are deployed to improve the image quality. Among other enhancement methods, the gamma correction (GC) [7] approach corrects the images with too much gray and too little gray to enhance the contrast. A model that utilizes the features of light scattering is proposed in [8]. First, the RGB channel average ratio is used to categorize the color projection into five different groups. To restore the color projection of underwater images, a multiscene color recovery method is developed using the optical attenuation characteristics to determine the color loss rate of RGB channels in various scenarios. These enhancement methods directly apply image processing by subjectively adjusting the pixel values to eliminate noise, improve edge blur, enhance the features of the target object, and weaken the effect of irrelevant environmental factors on the target. However, since the underwater optical imaging model is not considered, some additional noise is introduced, which can cause severe oversaturation in different image regions.

Recovery Methods

The recovery methods take into account the underwater image degradation process, the imaging principle of the image, and the building of a physical model. DCP [10] is a solution to the image haze reduction issue. Many researchers have created DCP-based underwater recovery methods after observing the resemblance between hazy photos and blurred underwater images in descattering. Because of the particular features of the underwater environment, UDCP [11] is designed to implement DCP within the green and blue channels. IBLA [12] is an underwater scene depth estimation approach according to image blurring and light absorption. It gives more precise background light and depth estimates that may be utilized in the image formation model (IFM). First, the method selects the background light by the blurred region and obtains a depth map. Then, the transmission map based on the background light is used to recover the scene radiation. The coefficients of ULAP [14] are trained using supervised linear regression based on learning. To recover the real scene radiation, the approach first performs depth estimation. It next estimates the background light and transmission map of RGB relying on the depth map. A new color compensation method is proposed in [15]. The underwater image region with the most severe color distortion is compensated by combining the polarized image with the intensity image. It can improve the exposure of the low-luminance area of the image. The dark channel prior approach is then used to deblur and improve the image. The recovery method recovers degraded images by a priori knowledge, but when the a priori knowledge is inaccurate, it often leads to serious estimation errors. In this area, the absence of reliable priori information about underwater images is now a major obstacle.

Data-Driven Deep Learning Methods

Data-driven deep learning-based approaches are now mainly classified as CNN-based, GAN-based, and Transformer-based. WaterNet [26] takes three images processed by white balance, histogram equalization, and GC as input images and uses the gated fusion network for learning the corresponding confidence map to determine the most important features of the residual inputs in the final result. WaterNet can vastly improve image contrast and correct color cast to some extent. However, for images with a great color cast, the color cast needs to be corrected, and overenhancement is another issue. UWCNN [27] is proposed based on the underwater scene a priori, which may be utilized to create training data. UWCNN reconstructs a clean underwater submerged image without the need for underwater estimate of the imaging model’s parameters. MLFcGAN [31] extracts multiscale features and then uses global features to improve local features at each scale. MLFcGAN performs better in off-color correction, but it is challenging to handle hazy images or even introduce false enhancement effects. FUnIEGAN [24] is designed based on a conditional generative adversarial network, formulating an objective function with content-aware loss that assesses perceptual image quality using information about global content, color, local texture, and style. The model uses only a simple 4-layer UNet network to achieve a real-time effect. Since many paired underwater image datasets are generated using CycleGAN [18], CycleGAN can also be used in underwater image enhancement tasks. Uformer [23] is proposed for image restoration with self-attention local enhancement based on a nonoverlapping window (LeWin) Transformer block and a multiscale recovery modulator, which is used to adjust the features in the Uformer decoder layers. Uformer is mainly applied in image enhancement tasks, and its application in underwater image tasks can improve the cast effect and the contrast to some extent. However, it cannot improve the underwater image haze effect. STSC [28] develops an efficient and compact enhancement network in collaboration with a high-level semantic-aware pretrained model, aiming to exploit the hierarchical feature representation as an auxiliary for low-level underwater image enhancement. SCNet [29] focuses on spatial and channel dimensions, with the key idea of learning water type desensitized features. The purpose of this method is to improve the image quality and deal with the degradation diversity of water. TACL [30] achieves both visual-friendly and task-oriented enhancement. The sharpness of the image may be noticeably enhanced, but it is prone to residual water color, and some areas of the image are too bright.

Since the underwater environment is complicated, many methods cannot fully learn the distribution of the target image, so there exists a large deviation between the enhanced image and the target image. Moreover, there are still large differences between synthesized images and real underwater images. The distribution learned on synthesized images by the data-driven deep learning method is difficult to apply to real underwater images, and the processed images still have some defects, such as color cast, missing detail, and overenhancement. How to better solve the aforementioned issues is the focus of this paper.

Proposed Method

Underwater image enhancement learns a mapping from underwater images degraded for various reasons to target clear images. Due to GAN’s outstanding performance in the field of image generation, it has drawn increasing attention. As a framework for this paper, we adopt the conditional generative adversarial network (cGAN) [32] and design a proper generator (G) for learning the mapping mentioned above. Recently, Transformer has been increasingly used for visual domain tasks since it can extract long-range dependencies well. This new technology is also incorporated into the construction of TEGAN in this paper.

Here, we introduce a new architecture that contains a well-designed novel generator and a two-branch discriminator. Then, by referring to the LeWin block in Uformer [23] and RPE [33], we propose a new window-based dual local enhancement (DleWin) block that is more suitable for the UIE task. Finally, the WGAN-GP loss, Uloss, and L1 loss are adopted to guide the network training.

Network Architecture

Figure 2 depicts the TEGAN architecture in detail. Elaborately constructed Inception, Bottleneck, and Fusion unit are introduced to the original Encoder-Decoder generator, which is designed as a UNet-like network. The effectiveness of each component will be demonstrated in the “Experiments and Analysis” section. The discriminator includes two branches, namely, a feature branch and a color branch.

Fig. 2
figure 2

The architecture of the TEGAN proposed in this paper. From left to right is the framework of the generator (a), discriminator (b), and DleWin blocks (c). The generator is composed of Inception, Encoder, Bottleneck, Decoder, and Fusion units. The discriminator is composed of a feature branch and a color branch. The DleWin block consists of Attn, LeFF, and PLE modules

Generator

Inspired by Uformer, we propose a Transformer embedded generator framework in our underwater image enhancement tasks, but we do not embed Transformer block for each scale as Uformer does. Specifically, a partial fusion scheme is designed to effectively combine Transformer and convolutional neural network. We believe that compared with Transformer, convolutional neural network performs better in multiscale feature extraction, so we use convolutional neural network in the Encoder and Decoder units for multiscale feature extraction and reconstruction, which can effectively reduce the edge blurring and retain more details. Transformer block, due to its expertise in extracting raw and global information from images, is used in Inception and Bottleneck units. The advantage of incorporating global information into each scale is demonstrated in MLFcGAN [31]. We adopt this operation for reference. The global information fully extracted by the Transformer block is integrated into each feature scale, which is particularly effective in solving the problem of color cast in underwater image degradation.

As we can see in Fig. 2a, one DleWin block is embedded in Inception unit to extract the long-range dependencies of the features directly from the original image. Then, it can be used for subsequent feature extraction. We also explore the effect of the number of DleWin blocks in the Inception unit on the model performance, as shown in Fig. 3a. It can be concluded that when there is more than one DleWin block, the time consumption is dramatically increased, and the performance is degraded.

Fig. 3
figure 3

Effect of the number of DleWin blocks on the overall performance in terms of the image quality evaluation metric MSE. a Effect of the number of DleWin blocks in the Inception unit and b effect of the number of DleWin blocks in the Bottleneck unit

The Encoder unit consists of five encoding layers. Details are shown in Fig. 4 Encoder. It performs multiscale feature extraction on the features preliminarily obtained through the Inception unit and finally inputs the shape of 512 × 8 × 8 feature maps to the Bottleneck unit. In addition, the extracted features of each layer are transferred to the corresponding layer of the Decoder unit through skip connections, as shown in Fig. 2a. Encoder1 contains a convolutional layer, while encoder2-encoder5 contain a Conv + BatchNorm + ReLU (CBR) module. The parameters of all convolutional layers are size = 4 × 4, stride = 2, and padding = 1, which plays the role of downsampling while extracting features.

Fig. 4
figure 4

Network structure of the Encoder and Decoder units in the generator. The blue part in Encoder is the extracted feature map. The corresponding blue part in Decoder represents the feature map from Encoder by skip connection, while the green part represents the feature map reconstructed by Decoder. The numbers on each layer annotate the shape of the features

The Bottleneck unit embeds two DleWin blocks. When the features extracted by the Inception are downsampled by Encoder to a size of 8 × 8 (the same size as the window of the DleWin block), the Transformer block can extract global information, such as the overall lighting and image layout. Since Transfomer’s self-attention mechanism is good at extracting long-range information, the DleWin block can be used in this unit to achieve a significant performance improvement. We also investigate how the number of DleWin blocks in this unit affects the effectiveness of the model. The number of DleWin blocks is set to 2 for the following reasons. As we can see in Fig. 3b, in the Bottleneck, the optimal performance is achieved when the number of DleWin blocks is 2. As this number increases, network performance deteriorates. An excessive number of DleWin blocks will make the model extract too much global information, resulting in overfitting of training, which will adversely affect the generalization ability and performance of the model. Meanwhile, the time consumption will be dramatically increased due to the deepening of the network. Moreover, the global information extracted by the previous DleWin block will become blurred after the subsequent DleWin processing, thus weakening the positive role of global information in image enhancement. On the other hand, if the number of DleWin blocks is too small (less than two), the extracted global information is not sufficient, and the utilization efficiency is low. In this case, the corresponding network performance is also poor.

The Decoder unit has five decoding layers, as shown in Fig. 4. Decoder receives the global information extracted from the Bottleneck and outputs the enhanced image with a shape of 3 × 256 × 256 after five decoding layers. Similar to the Encoder, decoder1 contains a transposed convolution and a tanh activation function. Encoder2-decoder5 each contain a CBR module. All transposed convolution parameters are size = 4 × 4, stride = 2, and padding = 1, which plays the role of upsampling while reconstructing features.

The Fusion unit integrates the global information extracted by the Bottleneck unit, such as the overall lighting and layout, into each scale. As shown in Fig. 5, the global information will first go through a F_adjust operation, which is a convolution with size = 1 × 1 and stride = 1. Through the F_adjust operation, the channels of the global information can be adjusted to correspond to Decoder. Then, the global information will be copied and reshaped by the F_copy and F_reshape operations to finally achieve the effect that the shape of the output fusion information is the same as the feature map of the corresponding layer of Decoder. The fusion of global information to each scale helps to provide more realistic colors and finer details.

Fig. 5
figure 5

Schematic diagram of Fusion unit

Discriminator

We use a two-branch discriminator containing a feature branch and a color branch in Fig. 2b, where the feature branch is used to preserve image features and enhance contrast, while the color branch performs color correction to generate more realistic colors. They both adopt PatchGAN [34], as shown in Fig. 6a. The discriminator of the original GAN evaluates only one value (true or false) for the whole image generated by the generator, as we can see in Fig. 6b. This operation evaluates the overall image quality. However, it lacks image localization evaluation, causing the local details of the image generated by the generator to be blurred. PatchGAN adopts the form of full convolution, and the discriminator evaluates the generated image as a matrix of N × N. Each element in the matrix corresponds to the discriminator’s evaluation of a small patch region. The average value of the matrix forms the final evaluation of the discriminator for the whole image. PatchGAN focuses on local information, which can make the generated image have more details and reduce local blur. Moreover, compared to the full-image discriminator, PatchGAN has fewer convolutional layers. In this paper, for a 256 × 256 generated image, the discriminator forms a 30 × 30 evaluation matrix, and the perceptual field (patch size) of each evaluation value in the matrix is 70 × 70.

Fig. 6
figure 6

Schematic diagram for different types of discriminators

In detail, the feature branch preserves the image content by one convolution layer. Then, it stacks three layers of Conv + BatchNorm + Leaky-ReLU (CBL) modules and one layer of convolution to identify the authenticity of the image. Finally, it generates an adversarial map for evaluation and facilitates the generator to generate a realistic image. The color branch directly stacks five layers of CBL modules and one convolution layer to discriminate whether the image belongs to the underwater scene. It generates an underwater index map for evaluating the strength of underwater attributes and facilitates the generator to generate colors consistent with the in-air image. The original image and enhanced image or the original image and real image by concatenate operation fed into the discriminator to finally obtain an adversarial map and an underwater index map.

DleWin Block

In contrast to convolutional neural network, Transformer can compute the correlation between each pixel of an image directly without passing through hidden layers. CNN models the relationship between neighborhood pixels, while Transformer pays more attention to the relationship between all pixels. Therefore, we can design strategies to make the two complement each other well.

For underwater image enhancement using Transformer, there are two problems to solve. First, the standard Transformer [17] computes global self-attention among all tokens, which results in a secondary computational cost for tokens and an enormous computational consumption for images. Second, local information is particularly important for vision tasks, especially underwater image enhancement tasks. However, Transform is not good at extracting local information.

To address the first issue, we propose a window-based DleWin block in which CNNs are introduced for local enhancement using both serial and parallel approaches. It implements an efficient mechanism for calculating self-attention in terms of windows. The DleWin block includes three modules, namely, a self-attention module (Attn) for capturing features, a serial local enhancement feedforward network (LeFF), and a parallel local enhancement module (PLE). In Fig. 2c, the input feature maps are subjected to Attn for feature extraction, and then, LeFF performs local enhancement on the features. Meanwhile, PLE performs local enhancement on the features before passing through Attn. A skip connection is added between Attn and LeFF to avoid degradation of the input features.

As we can see in Fig. 2c, Attn contains a layer normalization layer (LN) and a window-based multihead self-attention (W-MSA). LeFF contains a LN and three convolutional layers, where the input tokens are first transformed into tokens by a linear projection as conv.(1 × 1). The tokens are reshaped into a 2D feature map, which is transformed by a convolutional layer of size = 3 × 3 and then stretched into new tokens. Finally, it is transformed into the same dimension as the input features through linear projection. PLE contains two Conv + BatchNorm + GELU (CBG) modules. Unlike LeFF, the input of PLE is the features not extracted by Attn. LeFF is the long-range dependency captured by Attn for local enhancement, while PLE is a direct local enhancement of the input features. Therefore, they have different local enhancement effects. LeFF is used to compensate for the shortcomings of the Transformer in extracting local features, while PLE is used to further enhance the whole block for local feature extraction. A combination of the two can meet the need for underwater image enhancement tasks for extracting local features and further alleviate the adverse effect of the long-distance dependencies captured by Attn. The differences between PLE and LeFF are shown in Fig. 7. The effectiveness of the embedded DleWin block with PLE and LeFF will be demonstrated in the “Experiments and Analysis” section.

Fig. 7
figure 7

Schematic diagram for different enhancement modes between PLE and LeFF

In addition, instead of implementing a global self-attention mechanism, we deploy W-MSA with window-based multihead self-attention. The input feature matrix \(X\in {R}^{C\times H\times W}\) is partitioned into N feature windows of M × M, where C, H, and W are the number of channels, width, and height, respectively. Then, the transposed and stretched features \({X}^{i}\in {R}^{{M}^{2}\times C}\) of each window are obtained. In short, W-MSA encodes every pixel within the window as a token. It performs self-attention within nonoverlapping local windows, which significantly reduces the computational cost. The motivation for using the multihead self-attention mechanism is that dividing the model into multiple heads and forming multiple subspaces by channels allows the model to focus on different aspects of information. Finally, we combine the information from all aspects. Suppose there are k heads; then, each head has dimension \({d}_{k}=C/k\), and the kth head processes a feature map \({\widehat{X}}_{k}\in {R}^{{M}^{2}*{d}_{k}}\). The self-attention of the kth head can be calculated as follows:

$$X=\left\{{X}^{1},{X}^{2},\dots {X}^{N}\right\}, N=HW/{M}^{2}$$
(1)
$${Y}_{k}^{i}=\mathrm{Attention}\left({X}^{i}{W}_{k}^{Q},{X}^{i}{W}_{k}^{K},{X}^{i}{W}_{k}^{V}\right), i=1,\dots ,N$$
(2)
$${\widehat{X}}_{k}=\left\{{Y}_{k}^{1},{Y}_{k}^{2},\dots {Y}_{k}^{M}\right\}$$
(3)

where \({W}_{k}^{Q},{W}_{k}^{K},{W}_{k}^{V}\in {R}^{C\times {d}_{k}}\) represent the projection matrices of query (Q), key (K), and value (V) of the kth head, respectively. The outputs of all heads are then concatenated and linearly mapped to obtain the final results. W-MSA also applies relative position encoding. The attention can be expressed as

$$\mathrm{Attention}\left(Q,K,V\right)=\mathrm{SoftMax}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}+B\right)V$$
(4)

B is the relative position bias, with the value derived from the learnable parameter \(\widehat{B}\in {R}^{(2M-1)\times (2M-1)}\). Compared to global attention, W-MSA can decrease the time complexity of the input feature map \(X\in {R}^{C\times H\times W}\) from \(O({H}^{2}{W}^{2}C)\) to \(O({M}^{2}HWC)\).

The so-called self-attention mechanism is depicted in Fig. 8. For the input feature matrix, the query (Q), key (K), and value (V) are generated by the learnable parameter matrices \({W}_{q}\),\({W}_{k}\),\({W}_{v}\), respectively. Then, Q and K are multiplied together with relative position encoding (B) and undergo zero-mean normalization (Z-norm) to obtain the attention matrix Attn. Finally, Attn is activated by Softmax and multiplied by V for output.

Fig. 8
figure 8

Illustration of W-MSA’s self-attention mechanism

Objective Function

The standard GAN suffers from mode collapse and vanishing gradient. Mode collapse does not occur in underwater image enhancement. To solve the vanishing gradient problem, Martin Arjovsky proposed the Wasserstein GAN (WGAN) [35]. WGAN is needed to compute the Wasserstein distance, requiring that the discriminator satisfies the Lipschitz restriction. First, weights are clipped to a fixed interval [− c, c], but this simple and brutal operation cannot yield better results. Therefore, the WGAN with gradient penalty (WGAN-GP) [21] is introduced, and the equation is transformed as

$$\begin {aligned}{L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{D}}}=& \,{E}_{x,y}\left[{D}_{W}\left(x,y\right)\right]-{E}_{x}\left[{D}_{W}\left(x,G\left(x\right)\right)\right]\\&+\lambda {E}_{\widehat{x}}[{({\Vert {\nabla }_{\widehat{x}}{D}_{W}\left(\widehat{x}\right)\Vert }_{2}-1)}^{2}]\end {aligned}$$
(5)
$${L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{G}}}={E}_{x}\left[{D}_{W}(x,G(x))\right]$$
(6)

where x and y are the degraded image and ground truth (in the air, clear, color balanced target image), respectively. \(\widehat{x}\) is a linear sample between \(G\left(x\right)\) and y. In this paper, we use WGAN-GP for the adversarial branch.

To guide the training of the color branch, we introduce Uloss from GAN-RA [22].

As shown in Fig. 9, in the Lab color space, we take the in-air image as the target. The distance between the image and the in-air image is evaluated using the underwater index (U). Its formula is as follows:

$$U=\frac{\sqrt{{d}_{0}}}{10{a}_{l}{d}_{a}{d}_{b}}$$
(7)

where \({d}_{0}\) is the distance from the image mean to the center of a and b color channels. \({a}_{\mathrm{l}}\) denotes the mean of the L channel, while \({d}_{a}{d}_{b}\) denotes the area of the image pixel value distribution. The higher the value of \({d}_{0}\) is, the more severe the color distortion.

Fig. 9
figure 9

Underwater index illustration. The elliptical region represents the a-b distribution of the image in the Lab color space. The smaller \({d}_{0}\), larger \({d}_{a}\) and \({d}_{b}\) indicate that the image is closer to the in-air image [22]

The underwater index loss is designed using the L2 loss:

$${L}_{{U}_{D}}={E}_{x,y}[{({D}_{U}\left(x,y\right)-U\left(x,y\right))}^{2}]+{E}_{x}[{({D}_{U}(x,G\left(x\right))-U(x,G\left(x\right)))}^{2}]$$
(8)
$${L}_{{U}_{G}}={E}_{x}\left[{\left({D}_{U}(x,G(x)\right))}^{2}\right]$$
(9)

X is the original image, y is the ground truth, and U(·) is the underwater index for computing an image. In the initial stage of training, the color branch is not sufficiently trained to thoroughly distinguish the difference between underwater and in-air images. Thus, it is not sufficient to guide the generator to learn the distribution of in-air images. Therefore, we adopt a two-phase training strategy, i.e., the generator does not add the underwater index loss at the beginning of training but adds it after the color branch is sufficiently trained.

Existing methods show that adding the L1(L2) loss to the objective function allows the generator to learn the distance from the original image to the ground truth directly [36]. It can focus on the low-frequency information of the image, thus reducing blur. Compared to L2 loss, L1 loss is deployed in this paper for blur reduction due to its greater robustness. Its formula is as follows:

$${L}_{{l1}_{G}}={E}_{x,y}\left[{\Vert y-G\left(x\right)\Vert }_{1}\right]$$
(10)

The global objective function can be expressed as follows:

$${L}_{D}={L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{D}}}+{L}_{{U}_{D}}$$
(11)
$${L}_{G}={\lambda }_{W}{L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{G}}}+{\lambda }_{U}{L}_{{U}_{G}}+{\lambda }_{l1}{L}_{{l1}_{G}}$$
(12)

where \({\lambda }_{W}\), \({\lambda }_{U}\), and \({\lambda }_{l1}\) are the weight factors. The optimal model is \({D}^{*}={\mathrm{arg}}_{D}\mathrm{min}{L}_{D}\), \({G}^{*}={\mathrm{arg}}_{G}\mathrm{min}{L}_{G}\).

Experiments and Analysis

Datasets

Data acquisition for UIE tasks is extremely difficult, especially for paired images with ground truth. We deploy a paired underwater image dataset from UGAN [37] as the training set and validation set. Simultaneously, several publicly available paired and unpaired underwater image datasets are taken as the test set. All image sizes are adjusted to 256 × 256 using bicubic linear interpolation.

Training Set

To better learn the features from ground truth and preserve the content of the images, our model is trained on paired datasets (including ground truth and degraded underwater images). A total of 6000 pairs of images are randomly selected from 6128 pairs generated by CycleGAN [18] in the literature [37] as our training set. The remaining images serve as the verification set.

Test Set

We cite EUVP [24], RUIE [25], and UIEB [26] as our test datasets.

The EUVP (Enhancing Underwater Visual Perception) dataset contains separate sets of paired and unpaired images with varying degrees of perceptual quality. It mainly contains three subsets: paired, unpaired, and test samples. The paired subset contains dark images, images collected from ImageNet, and bluish and greenish images from real underwater scenes.

The RUIE dataset is a real underwater image dataset without ground truth. It includes three subsets: UCCS, UIQS, and UHTS. Among them, the UCCS subset contains blue, green, and blue-green subsets, corresponding to the common color cast problem in underwater image degradation.

The UIEB dataset achieves the goal of underwater image data collection [26], i.e., diversity of underwater scenes, different characteristics of quality degradation, and a wide range of image contents. It consists of raw and challenge subsets. The raw subset contains 890 underwater images and corresponding reference images. The reference images are the subjective optimal enhancement results selected by using various underwater image enhancement methods. The challenge subset contains 60 underwater images, which have a high degree of degradation and have not achieved satisfactory results by many previous enhancement methods. It should be noted that although UIEB provides reference images, it is only the images generated by other enhancement methods and cannot be considered ground truth.

We construct four separate groups of paired tests containing ground truth and four groups of unpaired tests without ground truth, as shown in Table 1. The paired image test set is divided into four groups, i.e., the Validation set (Val), Underwater-dark, Underwater-imagenet, and Underwater-scenes subsets from EUVP. The total number of paired test images is 1028. For the unpaired image test set, we also divide it into four groups, i.e., all 2574 unpaired images in EUVP, all 4229 real-world underwater images in the three subsets of RUIE, all 890 images in the raw subset of UIEB, and all 60 images in the challenge subset of UIEB. From Table 1, we can see that the unpaired real-world underwater test images are more sufficient, which can reflect the generalization ability of our method.

Table 1 Description of the test datasets

Training Details

We set a batch size of 32, \({\lambda }_{W}=0.1\), \({\lambda }_{U}=5\), \({\lambda }_{l1}=10\) and use Adam optimizer with a learning rate = 0.0002, β1 = 0.5, β2 = 0.999. The training set images are all first adjusted to 286 × 286 using bicubic linear interpolation and then randomly cropped to 256 × 256 to achieve data enhancement. We use PyTorch as a deep learning framework to train 200 epochs on an Inter(R) Xeon Silver 4214R, 4 GB RAM, and GeForce RTX 3090 GPU platform.

Loss curves are shown in Fig. 10. The feature branches \({L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{G}}}\) are gradually in a dynamic equilibrium at the beginning. As mentioned in the “DleWin Block” section, here, we adopted a two-phase training strategy. The generator does not add the underwater index loss at the beginning of training but adds it until the color branch is well trained. In detail, after the 30th epoch, the color branch starts to work due to the addition of \({L}_{{U}_{G}}\) to the generator training, and each loss begins to change dramatically. Then, \({L}_{{U}_{G}}\) and \({L}_{{U}_{\mathrm{D}}}\) tend to be in dynamic equilibrium. Moreover, \({L}_{l1}(G)\) steadily decreases except for the rapid increase when the color branch starts to take effect.

Fig. 10
figure 10

Illustration of the training losses. a G: WGAN-GP, G: U, G: l1, and G: T are labeled as the feature branch loss \({L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{G}}}\), color branch loss \({L}_{{U}_{G}}\), L1 loss \({L}_{{l1}_{G}}\), and global loss \({L}_{G}\) of the generator, respectively. b D: WGAN-GP, D: U, and D: T are labeled as the feature branch loss \({L}_{{\mathrm{WGAN}-\mathrm{GP}}_{\mathrm{D}}}\), the color branch loss \({L}_{{U}_{D}}\), and the global loss \({L}_{D}\). When \({L}_{{U}_{G}}\) starts to take effect, it can be seen that each loss has a significant change

Models for Comparison

Traditional and deep learning-based (data-driven) methods are conducted for comparison to demonstrate the superiority of TEGAN, as shown in Table 2.

Table 2 Models for comparison

The enhancement methods include EUF, CBFU, ICM, and GC (where EUF and CBFU are based on the fusion method). Recovery methods include MIP, DCP, UDCP, and ULAP. Learning-based methods are CycleGAN, FUnIEGAN, MLFcGAN, UWCNN, WaterNet, Uformer-B (the best performance parameter setting in Uformer), STSC, SCNet, and TACL. To compare the performance in an objective way, all learning-based methods except TACL are trained on the same training set. The network parameters for comparison are the recommended settings captured from the original paper to obtain the best enhancement results. It is worth mentioning that the source training code of TACL is not publicly available. Here, we use the trained network parameters provided by the author for comparison.

Results and Analysis

Paired Test Images

The benefits of our method are illustrated by the visual comparisons in Fig. 11. Compared with other methods, the images enhanced by our method are color balanced with higher contrast and better visual effects. Using GT (ground truth) as a reference, it can be seen that some methods have limited quality improvement, while others have obvious quality improvement but still cause overenhancement or wrong color correction. Most traditional methods have difficulty improving the color deviation, which is far from GT.

Fig. 11
figure 11

Visual comparison of various methods in terms of color, sharpness, and contrast on paired test image sets. Each row is the processing result of the corresponding method. a, b Selected from Val. c, d From Underwater-dark. e, f From Underwater-imagenet. g, h From Underwater-scenes. Raw denotes the original raw image, and GT denotes the corresponding ground truth

For the enhanced methods, although ICM and GC can reduce blur, in Fig. 11a, c, h, ICM does not reduce the image’s color divergence, while GC reduces the image’s brightness. The fusion-based methods are more effective in brightness enhancement, but there are still obvious problems, such as overenhancement compared to GT. As we can see, EUF introduces a large amount of red-blue noise in Fig. 11a and is severely exposed in Fig. 11d, e. In Fig. 11b, d, CBFU has serious color distortion compared to GT.

For the recovery methods, the brightness of MIP is improved to some extent, but the yellow compensation is excessive, resulting in the yellow color of the restored image, as shown in Fig. 11b, e, g. DCP can augment the image’s contrast, but it cannot solve color divergence, as in Fig. 11a, d, f, g, h. UDCP increases the haze effect while reducing the color cast, in Fig. 11b, d, e, g. Compared with GT, the recovered image still exists residual water color. The quality of ULAP is improved in some images, but still, some images, such as Fig. 11a, d, h, are not good at removing the effects of water bodies and adjusting the color cast.

For the deep-learning based method, improving the color deviation is generally better than the traditional method, but it may not have good results in other aspects of image quality improvement. CycleGAN and Uformer-B retain details well, but there is an obvious haze effect, as shown in Fig. 11g. Uformer-B also has a haze effect on Fig. 11f. The effect of noise removal of the water body in CycleGAN is not satisfactory, as shown in Fig. 11c, f. The color saturation of FUNIEGAN enhanced image is improved, but there is still color distortion, such as the background color distortion in Fig. 11e, g. The performance of MLFcGAN is closer to GT but still has a slight color cast. As in Fig. 11a, there is a bluish effect, and the color of the flower is somewhat orange in Fig. 11f. UWCNN and WaterNet contribute to removing blurring, but for images with serious color cast, color deviation still exists. Moreover, STSC and SCNet can largely eliminate the influence of color cast, but STSC still has color deviation, in Fig. 11a, and whitening effects, in Fig. 11c, d. SCNet makes the objects in Fig. 11c, d green, and it cannot remove the haze completely, as shown in Fig. 11g. TACL can improve clarity and brightness to a certain extent, but its performance in color correction is not good, as shown in Fig. 11a. In addition, Fig. 11b, f, g still have serious residual water color, and they are much different from GT. In contrast, TEGAN exhibits competitive performance. As shown in Fig. 11a, e, h, our results are almost the same as GT, while Fig. 11f, c, d are closer to GT. It is worth mentioning that compared with GT, the result in Fig. 11b improves the color cast, while Fig. 11g reduces the haze effect.

Consistent quantitative conclusions can be inferred from Table 3 for the paired image test sets. Since GT images exist, we select the full-reference evaluation metrics MSE, PSNR, and SSIM [38]. Among them, MSE is the expected value of the square of the gap between the enhanced image and GT. PSNR measures the difference between the enhanced image and GT pixels. SSIM is an image quality evaluation criterion that conforms to human intuition. It indicates how close the enhanced image is to GT in structure and texture properties.

Table 3 Full-reference image quality evaluation for paired image test sets.

Table 3 displays the full-reference evaluation metrics. We can see that traditional methods perform unsatisfactorily in general, and none of them have entered the top two. Some methods perform well on specific test sets, e.g., MLFcGAN achieves the best PSNR and SSIM and the second-best MSE on the Val test set. Notably, UWCNN achieves the best MSE, PSNR, and SSIM on the Underwater-scenes test set. SCNet also seems to perform well in Underwater-dark and has average results. Our TEGAN achieves outstanding results on all data sets. The good generalization capability and competitive performance of TEGAN benefit from the strong learning ability of the DleWin block.

Unpaired Test Images

Experimental results for the unpaired images are depicted in Fig. 12, where every two images are selected from one test set. For the original images, two images from the EUVP are blue-green casted. For the two images from RUIE, Fig. 12c is heavily greenish, and Fig. 12d has greenish haze and blurred details. The two images from UIEB-Raw, exhibited in Fig. 12e, f, show a slight blue cast. In the UIEB-Challenge dataset, Fig. 12g shows a slight haze effect, and Fig. 12h shows a haze and green cast effect.

Fig. 12
figure 12

Visual comparison of various methods in terms of color, sharpness, and contrast on unpaired test image sets. Each row is the processing result of the corresponding method. a, b Selected from EUVP. c, d From RUIE. e, f From UIEB-Raw. g, h From UIEB-Challenge. Raw indicates the original image

The traditional methods fail to solve the color cast problem better, except for EUF and CBFU. However, both EUF and CBFU have color distortion caused by color oversaturation and excessive color compensation, as shown in Fig. 12c–e. In Fig. 12a, b, d, f, the other two enhancement methods, GC and ICM, contribute to blur reduction, but the color distortion and haze effect are not significantly improved.

Among the recovery methods, MIP can improve the color saturation to some extent, but it will cause overenhancement and under-enhancement for slightly degraded and severely degraded images, respectively, such as the overenhancement effect in Fig. 12a and almost no improvement in Fig. 12d. DCP can correct color distortion, but it will cause lower image brightness, as shown in Fig. 12f, and it will not improve the color deviation of the seriously degraded image, as shown in Fig. 12d. UDCP can improve the brightness, but it will cause a serious haze effect. ULAP improves the cast color correction, but the blue color is overcompensated, and the recovered image is bluish, in Fig. 12g, h.

For the deep-learning based method, CycleGAN enhanced image still has color deviation. The color correction effect of FUnIEGAN and MLFcGAN on the image with slight color deviation is relatively good, as shown in Fig. 12a, b. However, there will be wrong halos for the hazy image, as shown in Fig. 12d, h. UWCNN and Uformer-B are also ideal for color correction of images with slight color deviation, but they are unable to improve the haze effect. WaterNet works well for detail processing, as in Fig. 12a, b, but it also shows the wrong enhancement effect, such as the wrong green lines on top of Fig. 12f. STSC and SCNet have a satisfactory improvement on some images with less serious color distortion, such as Fig. 12a–c. However, the color saturation and contrast generated in Fig. 12e are low, and the haze removal effect is not satisfactory in Fig. 12d. The sharpening effect of TACL on the edge of the object is relatively obvious, such as the echinus in Fig. 12c, d. However, on the whole, the color deviation is nonnegligible, as shown in Fig. 12a, b, d–f. In contrast, TEGAN achieves the best effect both from the perspective of color correction and blur elimination, making the enhanced image richer in color, higher in contrast, and more distinct in detail. It is worth mentioning that, in contrast to CycleGAN, we can see that the supervised learning used in our paper is more suitable for underwater image enhancement tasks than unsupervised learning. Compared with MLFcGAN, it can be concluded that using Transformer block is more effective than using CNN in extracting long-distance or even global information. More importantly, the comparison with Uformer reveals that the unit fusion scheme we proposed is superior to the method of using Transformer block in each layer of the network.

For the unpaired image test set, since there is no GT as a reference, we choose nonreference evaluation metrics: UIQM [39], UCIQE [40], NIQE [41], BRISQUE [42], FRIQUEE [43], information entropy (Entropy), and underwater index (U) [22].

UIQM consists of three underwater image attribute metrics: UICM, UISM, and UIConM. Each attribute is selected to evaluate one aspect of underwater image degradation. UCIQE uses the Lab color space to linearly combine color density, saturation, and contrast to quantitatively evaluate underwater images for nonuniform color cast, blurring, and low contrast. NIQE extracts features using a multivariate Gaussian model and then combines them with quality distributions using an unsupervised approach. It is concluded that BRISQUE and FRIQUEE have a high consistency of human subjective perception and allow objective evaluation of images [44]. Due to the high time complexity of FRIQUEE, only the first 100 images are evaluated for each test group at most. Information entropy reflects the richness of the image, and the underwater index can be considered as the characteristic image intensity.

The numerical comparison results shown in Table 4 demonstrate the excellent performance of TEGAN. According to the average results of the five test groups, our method is superior to others in UIQM, NIQE, BRISQUE, and FRIQUE. UIQM is at least 0.3175 higher, UCIQE is at least 0.0054 higher, NIQE is at least 0.4465 lower, BRISQUE is at least 8.1168 lower, and FRIQUEE is at least 2.4478 higher. The performance of TEGAN in Entropy and underwater index needs to be improved, but it still ranks at the forefront among various comparison methods.

Table 4 Nonreference image quality evaluation for unpaired image test sets

Ablation Study

To confirm the efficacy of our strategy, several ablation studies are carried out. The color branch of the discriminator aims to generate more realistic colors than the GT. Therefore, for the validation set with GT, to evaluate and demonstrate the learning ability of the DleWin block and other units of the generator, in this part of the experiment, the discriminator only uses the feature branch to guide the generator’s training (see Table 5).

Table 5 Model description with different generator structures
  1. 1.

    The generator deploys only the Encoder + Decoder unit, noted as ED.

  2. 2.

    The generator adds a Bottleneck unit in the middle of the Encoder and Decoder based on ED, denoted as ED-B.

  3. 3.

    The generator adds the fusion unit to ED-B and fuses the features extracted by the Bottleneck unit into each layer of the Decoder, which is named ED-BF.

  4. 4.

    Replace the DleWin block with W-MSA on the generator framework proposed in this paper, denoted as Ours-W.

  5. 5.

    On the generator framework proposed in this paper, replace the DleWin block with the LeWin block, denoted as Ours-L.

  6. 6.

    The generator framework of TEGAN proposed in this paper is denoted as Ours.

Here, each generator structure is trained on the training set, and the model with the largest PSNR on the validation set is taken as a comparison. Detail is crucial for improving the quality of the underwater image. We compare the detailed enhancement effect of different generator frameworks and Transformer blocks in Fig. 13. From a global perspective, TEGAN’s generator framework improves the input image significantly in terms of brightness, color, and contrast and is closest to the GT. Locally, our generator enhances the structural details well, as shown by the enlarged areas in the red and blue boxes in Fig. 13.

Fig. 13
figure 13

Ablation study of the contributions of each unit/block in terms of color, sharpness, and contrast on the validation set. Red and blue areas in each image are enlarged and displayed above to indicate the details. The number on the bottom of each image refers to its PSNR (dB)

Table 6 exhibits the quantitative evaluations of different generators on the validation set. Due to the robustness and excellent learning ability of our proposed generator framework, it achieves the best results on MSE, PSNR, and SSIM. In addition, from Table 6, we can derive the following conclusions:

  1. 1.

    Transformer can extract global information, which is very important for UIE. The role of Transformer can be clearly seen from the comparison between ED and ED-B.

  2. 2.

    Fusion unit has an excellent ability to fuse features across different scales. ED-B and ED-BF show that the Fusion unit has a positive effect. It integrates global information, such as overall lighting and image layout, into each scale. The fusion of global and local information at different scales facilitates the generation of images with more natural colors and better details.

  3. 3.

    Transformer can effectively extract the original features. The comparison between ED-BF and Ours shows that the extraction of dependency between original features is helpful to improve the enhancement results.

  4. 4.

    The DleWin block is more effective for underwater image enhancement tasks. Numerical comparisons between Ours-W, Ours-L, and Ours show that local enhancement is particularly important. Notably, comparing Ours-W and Ours-L shows that LeFF can effectively improve the model’s metric performance. Comparing Ours-L and Ours, we can also see the positive effect of PLE. Dual local enhancement can effectively improve image clarity and significantly correct the overall color of the image.

Table 6 Quantitative evaluations of the ablation study on the validation set

Running Time Comparison

The average running times for different methods on the Intel(R) Core i5-9th CPU and GeForce RTX 3090 GPU platforms are illustrated in Table 7. The image resolution is 256 × 256. Among them, only GC, FUnIEGAN, and UWCNN are faster than ours because these three methods mainly pursue time performance. We can conclude that TEGAN not only achieves superior image quality improvements but also has a relatively high speed to meet real-time processing requirements.

Table 7 Average running times of different methods (in seconds)

Enhancement Effect for High-Resolution Images

High-resolution images cited from SUIM dataset [45] with 512 × 512 are tested to verify the enhancement effect of TEGAN. Several representative comparison methods and evaluation metrics are selected. The enhancement effects are shown in Fig. 14. As we can see, CBFU easily causes color bias. The color corrections of ULAP, WaterNet, Uformer-B, and SCNet are incomplete, and the enhanced image still has a thin veil effect. TACL results in incorrect enhancement, such as white patches appearing in the lower areas of the turtle. Additionally, there still exists residual water color in the TACL-enhanced image of the second row. The TEGAN-enhanced image has high clarity and realistic color. The advantages of TEGAN in terms of evaluation metrics are exhibited in Table 8. These results indicate that TEGAN can handle high-resolution images well. In addition, the average running times of different methods on high-resolution images are shown in Table 9. Among them, TEGAN has the fastest processing speed, which can facilitate many practical applications.

Fig. 14
figure 14

Enhancement effect for high-resolution images on SUIM

Table 8 Nonreference image quality evaluation for high-resolution images on SUIM
Table 9 Average running time of different methods on high-resolution images (in seconds)

Downstream Application Test

In this section, several typical downstream visual tasks are selected to prove the effectiveness of our model. We test SIFT keypoint matching [46], Canny edge detection [47], and underwater object detection and then compare them with some representative underwater image processing methods. As shown in Fig. 15, the SIFT algorithm has only a few matched keypoints on the original image, while other enhanced methods have improved the number of matches. In the images enhanced by our model, significant features have been extracted, and a large number of accurate matchings can be attained. The same conclusion can be seen from Fig. 16. Canny detection only has relatively few edges on the original image. Compared with others, the image enhanced by ours can significantly obtain more edges. In Fig. 17, we use YOLOv5 [48] model for underwater object detection, which is trained on a dataset containing 300 labeled images. By comparing with original images and other enhancement methods, TEGAN can achieve significant and competitive improvement in detection accuracy. The numerical experiments shown in Fig. 18 indicate the percentage of performance improvement tested on the above three downstream applications. Although the results vary depending on different tasks, we observe approximately 41–117, 4–32, and 13–46% improvements, respectively. The outstanding results reveal that our model can facilitate the performance of other visual tasks.

Fig. 15
figure 15

SIFT keypoint matching results with different methods. The original image and its flipped mirror image are used to exhibit the feature point matching performance

Fig. 16
figure 16

Canny edge detection results with different methods. The upper row represents the images to be detected, and the lower row represents the edge

Fig. 17
figure 17

Underwater object detection results by YOLOv5 with different methods. The labeled bounding boxes 1, 2, 3, and 4 represent the object categories of holothurian, echinus, scallop, and starfish, respectively

Fig. 18
figure 18

Percentage of performance improvement tested on SIFT keypoint matching, Canny edge detection, and underwater object detection with different methods

Failure Case Analysis

Our method still has some shortcomings. For images with severe haze, after enhancement, some noise and blur will be introduced, as shown in Fig. 19a, b, respectively. For images with very low brightness and serious color loss, the brightness has not been greatly improved while suffering from excessive color enhancement, as we can see in Fig. 19c, d.

Fig. 19
figure 19

Failure enhancement results. The upper row represents the raw images with serious haze, low brightness, and serious loss of color. The lower represents the less satisfactory enhancement results by our method

These failure cases are mainly caused by the fact that our training set does not contain these severely degraded and distorted images. For the deep learning method, it is difficult to handle images with large differences from the training set. In addition, since all our images are tested on the size of 256 × 256, for some very large size images, some edge features will be lost after the scaling down operation, which also leads to blurring.

Conclusions and Future Work

In this paper, we propose a Transformer embedded generative adversarial network for underwater image enhancement. A DleWin Transformer block that can adapt well to the high demands of underwater image enhancement tasks for local feature extraction is designed. We also fuse Transformer with CNN in units, which allows our model to focus on local information and capture long-range or even global dependencies. The proposed TEGAN with a two-branch discriminator can preserve the image content by the feature branch and restore the image color by the color branch. Compared with other methods, TEGAN achieves the best results in terms of comprehensive performance, whether on paired or unpaired datasets. Moreover, it can significantly facilitate the performance of other downstream visual tasks. Future works can be carried out in the following aspects. Other attention mechanisms can be integrated into Transformer to further improve the downstream application tasks. Combining unsupervised and supervised methods for training to solve the problem of insufficient paired datasets will be another focus. In addition, there are still some problems with the mainstream evaluation metrics. In some cases, there is a deviation between image quality metrics and subjective perception. The inconsistency between the two is also an urgent issue to be studied and improved.