1 Introduction

Recently, underwater image enhancement has become a research hotspot in underwater vision [1], which has a wide variety of applications in marine archaeology [2], marine biology and marine ecological [3]. Recently, autonomous underwater vehicles have been widely employed to explore and develop the marine resources. However, the visual quality of underwater images hardly meets the expectations because the quality of the underwater images can be degraded by a lot of adverse effects, such as light scattering and wavelength dependent light absorption [4,5,6], which limits the performance of autonomous underwater vehicles to understand the underwater scene, as shown in Fig. 1. Therefore, it is necessary to develop effective methods to obtain higher quality underwater images for pleasant visual perception.

Fig. 1
figure 1

Samples of raw underwater images and its corresponding ground truth. Top row: raw underwater images; Bottom row: the corresponding ground truth underwater images

To address the above-mentioned problem, a lot of underwater image enhancement methods have been proposed and made notable progress. Traditional underwater image enhancement methods can be mainly classified into two groups: non-physical model based methods and physical model based methods. The former improves the quality of the underwater images by modifying the pixel value in the image. The latter builds a degradation model for the underwater images and obtains high quality images by estimating the parameters of the model. Recently, a variety of learning based underwater enhancement methods have been proposed and can be organized into two main categories: CNN-based methods and GAN-based methods. These learning-based approaches own powerful non-linear expression ability and generalization ability, which have achieved leading results in underwater image enhancement tasks.

Although learning based underwater image enhancement methods have achieved rapid development, there is still much room for improvement. Firstly, most of the existing CNN-based methods fuse the features directly by concatenating or residual operations such as [7, 8] which can’t reflect the interdependencies of the features at different scales. Furthermore, the CNN-based underwater image enhancement methods usually apply SSIM loss, L1 loss, or perceptual loss to train the network, aiming to impose the texture, structure, content and semantics similarity on the predicted images. However, the degraded underwater images are always with color distortion and low contrast because of light absorption and scattering, these methods haven’t introduced specific color loss and contrast loss to correct the color casts and improve the contrast, which limits the enhancement quality of the degraded underwater image. Furthermore, the existing CNN-based underwater image enhancement methods have not simultaneously paid attention to the multiple factors that affect the visual perception of underwater images.

Given the above mentioned problems, in this paper, we proposed a Multi-Task Cascaded Network (MTNet) for underwater image enhancement, which contains three cascaded sub-tasks, namely color reconstruction task, contrast reconstruction task and content reconstruction task, as shown in Fig. 2. To correct the color casts, we introduced specific color loss to the color reconstruction task, which pays attention to the difference in colors between the images while eliminating texture and content comparison. To improve the contrast, we transformed the RGB color space to the HSV color space because the RGB color space can’t directly reflect the contrast and brightness of the underwater image. HSV loss is used to learning the mapping function of saturation and brightness. To learn the texture and structure similarity from the ground truth image and sharpen the predicted underwater image, SSIM loss and image gradient loss are used for content reconstruction. Furthermore, we introduce an Adaptive Fusion Module (AFM) to fuse the feature maps from different reconstruction task. The comparative experiments are conducted on both synthetic underwater images and real world underwater images. Experimental results show that our proposed method achieves better performance in both qualitative and quantitative evaluations.

Fig. 2
figure 2

The architecture of MTNet

In summary, the main contributions of this paper can be listed as follows:

  • • We proposed a MTNet for underwater image enhancement, which contains three cascaded sub-tasks, namely color reconstruction task, contrast reconstruction task and content reconstruction task, aiming to reconstruct the color, the contrast and the content of the underwater image.

  • • In MTNet, AFM is designed to fuse the feature maps from different reconstruction task.

  • • To correct the color casts, we introduced specific color loss to the color reconstruction task, which pays attention to the difference in colors between the images while eliminating texture and content comparison. For contrast reconstruction, HSV loss is used to learn the mapping function of saturation and brightness. SSIM loss and image gradient loss are used to learn the content similarity.

  • • The comparative experiments are conducted on both real world underwater images and synthetic underwater images with both the full-reference and non-reference metrics. . Both the qualitative and quantitative experimental results show that our proposed method achieves better performance for underwater image enhancement.

The rest of the paper is organized as follows: Sect. 2 discusses the related works. Section 3 introduces the design of MTNet and loss function in detail. Section 4 conducts comparative experiments on both synthetic underwater images and real world underwater images and analyzes the experimental results with quantitative and qualitative evaluation. Section 5 is the conclusion of this paper.

2 Related works

Due to the importance of underwater image quality, a lot of underwater image enhancement methods have been proposed in recent years. Existing approaches can be classified into the following categories.

2.1 Non-physical model based methods

Non-physical model based methods aim to produce high quality underwater images without constructing any physical model by modifying image pixel values. The classical methods include White Balance (WB) color correction algorithm [9], gray world algorithm [10], Histogram Equalization (HE) algorithm [11] and fusion-based underwater image enhancement algorithm [12, 13] improved the contrast and saturation of underwater images both in HSV color space and RGB color space. [14, 15] reduce the number of over enhanced and under enhanced regions by a Rayleigh-stretched process based on [13, 16] proposed a two-step underwater image enhancement method which is used for image contrast enhancement and color correction. [17] applied Retinex algorithm to underwater image enhancement tasks, which consists of color correction, reflectance and illumination decomposition, and the enhancement of the reflectance and the illumination. [18] introduced an underwater image enhancement method via extended multi-scale Retinex. When non-physical model based methods are directly applied to the real underwater scene, there may be some problems such as color deviation and contrast deviation.

Image dehazing is close research area to the underwater enhancement. The image dehazing methods [19,20,21,22,23] are often used to enhance the quality of the underwater image. However, compared with the foggy image, the underwater image will suffer more serious distortion problems, such as reduced contrast and excessive blue-green. Therefore, the image dehazing methods fog removal method needs further improvement to achieve good results in underwater image enhancement tasks.

2.2 Physical model based methods

Physical model-based methods consider the image enhancement as an inverse problem, which constructs the underwater image degradation model and achieve image enhancement by estimating the parameters of the model. In 2006, [24] designed an adaptive filter to improve the underwater image quality based on the simplified Jaffe-McGlamery underwater model. [25] proposed to use Dark Channel Prior (DCP) and the wavelength-dependent compensation method to improve the visual perception of underwater images. [26] proposed an Underwater Dark Channel Prior (UDCP) which can estimate the medium transmission. Recently, [27] incorporated adaptive color correction to the model and proposed a Generalized Dark Channel Prior (GDCP). A Red Channel method is introduced in [28], which restores the colors associated with short wavelengths to recover the lost contrast of the underwater images. According to the relationships between inherent optical properties of water and the background color of the underwater images, [29] achieved better results for underwater image enhancement. According to the minimum information loss principle and optical properties of underwater images, [30] effectively improved the brightness and contrast of underwater images. Recently, [31] designed a physically accurate underwater image formation model improved by [6] to correct the color of underwater images. These physical model-based methods follow the simplified image formation models which achieve good performance for simple scenes, but for complex actual underwater scenes, there is still visually unpleasing and unstable results.

2.3 Learning based methods

Recently, deep learning has been widely employed in the field of computer vision. A variety of learning based underwater enhancement methods have been proposed because these learning-based approaches own powerful non-linear expression ability and generalization ability. Learning based underwater enhancement methods can be organized into two main categories: GAN-based methods and CNN-based methods. [32] introduced an underwater image enhancement model, called WaterGAN. WaterGAN first generates synthetic training data from the in-air images and depth pairings. Then it uses a two-stage network to estimate the depth map and conducts color restoration. UWGAN [33] improved the WaterGAN and used Unet [34] to enhance the degraded underwater images. [35] proposed UWCNN to reconstruct the clear underwater image with MSE and SSIM loss which is trained by ten types of synthetic underwater images. [36] introduced a new real world underwater dataset and designed a novel network called WaterNet, which takes the images generated by WB, HE and Gamma Correction as input. In [37], both RGB color space and HSV color space are applied to design the underwater image enhancement network UIEC^2-Net. More recently, an underwater enhancement network called Water CycleGAN [38] was proposed to improve the visual perception of the underwater images in a weakly supervised way. [39] introduced UGAN with simple structure using Generative Adversarial Network, aiming to enhance the visual perception for autonomous underwater robots. In [40], a large scale underwater dataset was presented. What is more, the author proposed a conditional generative adversarial network which is suitable for real-time visually-guided underwater robots. The above-mentioned learning based underwater image enhancement methods have not taken into account the reconstruction of the color, the content and the contrast simultaneously.

3 Our approach

In this section, we will first introduce the structure of MTNet. Then, the details of each reconstruction task and the design of AFM will be described. Finally, the design of loss function for each task will be described in detail.

3.1 Network Architecture

As shown in Fig. 2, we divided the underwater enhancement into three cascaded sub-tasks, namely color reconstruction task, contrast reconstruction task and content reconstruction task. For each sub-task, an encoder-decoder network like Unet [14] is designed to achieve feature extraction and feature map reconstruction. Residual block is taken as the basic unit of the encoder-decoder network because it is helpful for the reuse of the features from different layers. For the encoder, 4 × 4 convolution with stride 2 is used to down-sample the input. For the decoder, we utilize transpose convolution to up-sample the feature maps, aiming to generate the output with the same size as the input SAR image. Each convolution is followed by a Leaky-Relu activation and Batch Normalization. To achieve feature fusion, skip connections are used to concatenate the feature maps in the encoder to the ones in the decoder.

For color reconstruction task, the color sub-network takes the raw underwater image as input. To correct the color casts, we introduced specific color loss to the color reconstruction task, which pays attention to the difference in colors between the images while eliminating texture and content comparison. The output of the color reconstruction sub-network is color map.

For contrast reconstruction task, the contrast sub-network is cascaded to the color sub-network and takes the color map as input. To improve the contrast, we transformed the RGB color space to the HSV color space because the RGB color space can’t directly present the brightness and contrast of the underwater image. HSV loss is used to learning the mapping function of saturation and brightness. We referred to [37] to transform the RGB color space to the HSV color space. The output of the contrast reconstruction sub-network is contrast map.

For content reconstruction task, the content sub-network is cascaded to the contrast sub-network and takes contrast map as input. To impose the texture and structure similarity on the predicted underwater image, SSIM loss is used for content reconstruction. What is more, to prevent producing blurry underwater images, image gradient loss is also introduced to the content.

As shown in Fig. 3, we design AFM to fuse the feature maps (color map, contrast map, content map) from different reconstruction task adaptively. To learn the importance of the feature maps from different sub-tasks, we first concatenate the color map, contrast map and content map in channel wise. Suppose \({x}_{i,j}^{n}\) and \(weigh{t}_{i,j}^{n}\) are the feature and the weight in the position (i, j) at channel n. Three 3 × 3 convolutions are used to learn the mapping from \({x}_{i,j}^{n}\) to \(weigh{t}_{i,j}^{n}\). The channel of each convolution is 64, 64 and 9. Then we utilize softmax function to compute the learnable weight for each reconstruction task. The learnable weight of each task will meet the formulas (1) and (2).

$$\sum_{n=1}^{N}weigh{t}_{i,j}^{n}=1$$
(1)

where N represents the number of the feature maps in the network.

$$weigh{t}_{i,j}^{n}\in [\mathrm{0,1}]$$
(2)

\(weigh{t}_{i,j}^{n}\) reflects the importance of the features for each reconstruction task. Therefore, the output enhanced underwater image can be represented by (3).

Fig. 3
figure 3

The structure of AFM. reconstruction sub-task. The output of the content reconstruction sub-network is content map

$$\begin{array}{c}Output=weight[0:3]\times colormap\\ +weight[3:6]\times contrastmap\\ +weight[6:9]\times contentmap\end{array}$$
(3)

3.2 Design of multi-task loss function

The loss function of MTNet mainly consists of three sub-tasks, namely the color reconstruction task, the contrast reconstruction task and the content reconstruction task.

For the color reconstruction task, to impose the color similarity on the predicted underwater image, we applied Gaussian blur operator on the predicted and ground truth underwater image to eliminate texture and content comparison and compute the L1 loss. Color loss can be computed by:

$${\mathcal{L}}_{color}={\Vert X({\stackrel{\wedge }{I}}_{colormap})-X({I}_{colormap})\Vert }_{1}$$
(4)

where \(X(\cdot )\) represents the blurred images computed by a Gaussian blur operator, which can be written as:

$$X(I)=\sum_{k,l}I(i+k,j+l)\cdot G(k,l)$$
(5)

where the Gaussian blur operator G(k, l) is written as:

$$G(k,l)=A\times \mathrm{exp}(-\frac{{(k-{\mu }_{x})}^{2}}{2{\sigma }_{x}}-\frac{{(l-{\mu }_{y})}^{2}}{2{\sigma }_{y}})$$
(6)

where A = 0.053, \({\mu }_{x,y}=0\), \({\sigma }_{x,y}=0\).

For the contrast reconstruction task, to further improve the contrast and saturation of the predicted underwater images, we transform RGB color space to HSV color space and compute the HSV loss as follows:

$${\mathcal{L}}_{HSV}={\Vert \stackrel{\wedge }{S}\stackrel{\wedge }{V}\mathrm{cos}(\stackrel{\wedge }{H})-SV\mathrm{cos}(H)\Vert }_{1}$$
(7)

where H, S and V are the hue, saturation and value in the HSV color space, \(H\in [\mathrm{0,2}\pi )\), \(S\in [\mathrm{0,1}]\), \(V\in [\mathrm{0,1}]\). With HSV loss, the luminance, saturation and color of the underwater images can be refined through value-channel, saturation-channel and hue-channel, respectively.

For the content reconstruction task, we first apply SSIM loss to impose the texture and structure similarity on the predicted underwater image. The SSIM value is computed within a 11 × 11 patch for each pixel in the image as the following formula.

$$SSIM(x)=\frac{2{\mu }_{I}(x){\mu }_{\stackrel{\wedge }{I}}(x)+{c}_{1}}{{\mu }_{{}_{I}}^{2}(x)+{\mu }_{{}_{\stackrel{\wedge }{I}}}^{2}(x)+{c}_{1}}\cdot \frac{2{\sigma }_{I\stackrel{\wedge }{I}}(x)+{c}_{2}}{{\sigma }_{{}_{I}}^{2}(x)+{\sigma }_{{}_{\stackrel{\wedge }{I}}}^{2}(x)+{c}_{2}}$$
(8)

where \({\mu }_{I}(x)\) and \({\mu }_{\stackrel{\wedge }{I}}(x)\) are the mean of predicted content map and the ground truth underwater image; \({\sigma }_{I}(x)\) and \({\sigma }_{{}_{\stackrel{\wedge }{I}}}(x)\) are the standard deviation of predicted content map and the ground truth underwater image; \({\sigma }_{I\stackrel{\wedge }{I}}(x)\) represents the cross-covariance; \({c}_{1}\) and \({c}_{2}\) are set to 0.02 and 0.03, respectively.

Then, the SSIM loss can be computed by

$${\mathcal{L}}_{SSIM}\text{=1-}\frac{1}{N}\sum_{i=1}^{N}SSIM({x}_{i})$$
(9)

where N indicates the number of the underwater images of each batch.

To prevent producing blurry underwater images, we also introduce image gradient loss to the content reconstruction sub-task.

$$\begin{array}{c}{\mathcal{L}}_{GL}=\text{\hspace{0.05em}}\sum_{i,j}\left|\left|{I}_{G}(i,j)-{I}_{G}(i-1,j)\right|-\left|{I}_{P}(i,j)-{I}_{P}(i-1,j)\right|\right|\\ +\left|\left|{I}_{G}(i,j-1)-{I}_{G}(i,j)\right|-\left|{I}_{P}(i,j-1)-{I}_{P}(i,j)\right|\right|\end{array}$$
(10)

where IP and IG are the output content map and the ground truth underwater image.

According to the formula (3), we can get the predicted enhanced underwater image. To ensure our predicted enhanced underwater image are enough close to the real underwater images, we use L1 loss to preserve overall similarity, which can be represented as:

$${\mathcal{L}}_{l1}={\Vert \hat{I}-I\Vert }_{1}$$
(11)

where \(\hat{I}\) and \(I\) are the predicted enhanced underwater image and the ground truth underwater image.

To preserve the semantic information, we also introduce the perpetual loss. The perpetual loss is defined based on VGG network, which can be computed by

$${\mathcal{L}}_{per}\text{=}\frac{1}{{C}_{j}{H}_{j}{W}_{j}}\sum_{i=1}^{N}\Vert {\phi }_{j}(\stackrel{\wedge }{{I}_{i}})-{\phi }_{j}(\stackrel{}{{I}_{i}})\Vert$$
(12)

where N represents the number of each batch; \({C}_{j}{H}_{j}{W}_{j}\) are the channel, height and width of the feature map in jth layer; \({\phi }_{j}\) represents the specific jth layer of VGG-19.

Therefore, the total loss can be calculated by summing the loss at each scale.

$${\mathcal{L}}_{total}=\text{\hspace{0.05em}}{\mathcal{L}}_{color}+{\mathcal{L}}_{HSV}+{\mathcal{L}}_{SSIM}+{\mathcal{L}}_{GL}+{\mathcal{L}}_{l1}+{\mathcal{L}}_{per}$$
(13)

where \({\mathcal{L}}_{color}\) is for the color reconstruction task, \({\mathcal{L}}_{HSV}\) is for contrast reconstruction task, \({\mathcal{L}}_{SSIM}\) and \({\mathcal{L}}_{GL}\) are for content reconstruction task, \({\mathcal{L}}_{l1}\) and \({\mathcal{L}}_{per}\) are for the predicted enhanced underwater image.

4 Experiments

4.1 Experimental setup

To demonstrate the performance of MTNet, we do the quantitative and qualitative experiments with traditional underwater image enhancement methods and learning based underwater image enhancement methods on both synthetic and real world underwater images. These comparative methods include Contrast Limited Adaptive Histgram Equalization (CLAHE), White Balance, Gamma Correction, Dark channel prior, UGAN, FUnIE-GAN, UWCNN, WaterNet, UIEC^2-Net. For fair comparison, we ran the source codes to generate the best results. In this section, we will introduce the comparative experiments and analyze the experimental results in detail.

Dataset

To evaluate the enhanced capacity of MTNet, we conduct comparative experiments on both the synthetic underwater images and real world underwater images. We first evaluate the performance of MTNet on the synthetic dataset generated by RGB-D NYU-v2 indoor dataset. We also conduct the comparative experiments on the real world underwater images from UIBE dataset [36] which owns a diversity of scenes and underwater content.

Implementation details

The experiments are implemented by an Intel i7-5930 k processor, 32 GB RAM and 1 NVIDIA GeForce GTX 3090. For training, both the synthetic underwater images based on NYU-v2 and real world underwater images from UIEB are used as input. There are 2000 images in training set. The input images are resized to 320 × 320. The models are conducted on Pytorch deep learning framework and trained by stochastic gradient descent (SGD) for optimization without any augmentation. For testing, there are 90 real world underwater images and 90 synthetic underwater images in the testing set. The initial learning rate of our model is set to 0.0001, which will decrease to 0.000001 during training. We set the batch size to 24 and the total epoch to 300.

Evaluation metrics

For full-reference indicators, we use the Peak Signal-to-Noise Ration (PSNR), Mean Square Error (MSE) and Structural Similarity (SSIM) to objectively evaluate the enhanced capacity of MTNet. In terms of PSNR and MSE, the higher PSNR or the lower MSE represents the recovery underwater images is more close to the ground truth underwater image. For SSIM, the higher value denotes the texture and the structure is more close to ground truth. Meanwhile, we also employ Underwater Image Quality Measure (UIQM) and Underwater Color Image Quality Measure (UCIQE) for non-reference underwater image quality evaluation. In case of UIQM and UCIQE, higher value means better underwater enhancement performance.

4.2 Performance comparison on synthetic underwater images

To evaluate the performance achieved by the proposed MTNet, We compare MTNet with several state-of-the-art underwater image enhancement methods on the synthetic underwater testing set. The comparative experiments are conducted on the synthetic underwater testing set, which includes 90 underwater images.

Table 1 shows the quantitative comparison of different underwater enhancement methods in terms of MSE, PSNR and SSIM on the synthetic underwater testing set. The best enhancement results are in bold. It is obvious that our proposed MTNet obtains the best performance compared with both the traditional underwater enhancement methods and the deep learining based methods across all the full-reference metrics. In terms of SSIM, our proposed MTNet is 0.8943 higher than the second best enhancement method.

Table 1 Full reference underwater image quality evaluation on synthetic underwater images

To further evaluate the enhancement performance of MTNet, we also employ UIQM and UCIQE for non-reference underwater image quality evaluation. Table 2 describes the average values on 90 testing underwater images. It is easy to see that our proposed method obtains higher UIQM than other underwater enhancement methods. Furthermore, MTNet achieves the second best UCIQE value, which is larger than most of the methods. Both the full-reference and non-reference metrics prove our proposed network has better capacity for underwater enhancement.

Table 2 Non-reference underwater image quality evaluation on synthetic underwater images

To qualitatively evaluate the detection performance of MTNet, Fig. 4 shows the visualization of the comparative results on synthetic underwater testing set. It is obvious that the underwater images are always with color shift and low brightness and contrast because of light scattering and absorption. Most of the traditional underwater enhancement methods are not sensitive to brightness and saturation and may introduce color casts.

Fig. 4
figure 4

Visualization of the comparative results on synthetic underwater testing set

Especially for complex underwater environment The deep learning based underwater enhancement methods achieves relatively good enhancement performance. Our proposed MTNet can effectively hinder the color casts and improve the brightness and saturation of the underwater images even with complex underwater environment, which can produce a good and pleasant perception. The visual results in Fig. 4 agree with the non-reference metrics in Table 2.

4.3 Performance comparison on real world underwater images

To further validate the performance of MTNet, we also conduct comparative experiments on real world underwater images, which includes 90 testing images. The test results between MTNet and the state-of-the-art enhancement methods are described in Table 3. Similarity to Sect. 4.2, MSE, PSNR and SSIM are employed to evaluate the enhanced underwater images. Our proposed MTNet also achieves the best enhancement performance across all the full-reference metrics. Compared with the second best enhancement method, the PSNR and SSIM have been improved 0.0073 and 0.8494 by MTNet.

Table 3 Full reference underwater image quality evaluation on real world underwater images

Meanwhile, we also use UCIQE and UIQM non-reference metrics to verify the performance of MTNet. As shown in Table 4, our proposed method also performs the best in terms of UIQM on real world underwater dataset. Although the UCIQE value of MTNet is not the highest, it still achieves the second best.

Table 4 Non-reference underwater image quality evaluation on real world underwater images

Similarly, to qualitatively evaluate the performance of MTNet, Fig. 5 shows the visualization of comparative results among different underwater enhancement methods on real world underwater dataset. The deep learning based enhancement methods outperform most of the traditional underwater enhancement methods. The enhanced images produced by MTNet are natural without introducing aritificial colors and MTNet can effectively enhance the brightness and contrast, which is similiar to the ground truth underwater images.

Fig. 5
figure 5

Visualization of the comparative results on real world underwater testing set

To sum up, both the comparative experiments on the synthetic underwater testing set and the real world underwater testing set demonstrate our MTNet outperforms other state-of-the-art underwater enhancement methods.

5 Conclusion

In this paper, a multi-task cascaded network is introduced to improve the visual perception for underwater image, which contains three cascaded sub-tasks, namely color reconstruction task, contrast reconstruction task and content reconstruction task. For each task, the color loss, HSV loss, SSIM loss and image gradient loss are employed to train MTNet in an end-to-end way. Furthermore, we introduce an AFM to fuse the feature maps from different reconstruction task adaptively. To verify the performance of MTNet, we conducted the comparative experiments on synthetic underwater images and real world underwater images with both the full-reference and non-reference metrics. Experimental results demonstrate that our proposed method can efficiently improve the underwater images quality and outperforms other underwater image enhancement methods in both qualitative and quantitative evaluations.