Keywords

1 Introduction

Underwater images often suffer from severe color casting and contrast decreasing caused by light absorption and scattering. This degradation not only disturbs visual quality of images but also has negative impact on visual tasks, e.g., salient object detection and instance segmentation. Moreover, it is hard for submarine robotic explorers equipped with visual system to autonomously explore underwater environment. Therefore, underwater image enhancement has drawn much attention in recent years.

Underwater image enhancement aims to generate clear and natural images against several degradations (e.g., color cast, low contrast and even detail loss). The existing image underwater enhancement methods can be roughly divided into three categories: non-physical model-based methods, physical model-based methods and data-driven methods. In early enhancement methods, some are directly used to enhance underwater images regardless of underwater imaging model. [12, 28] enhanced image contrast by expanding the dynamic range of the image histogram. [6, 24] corrected color cast based on color assumption of natural images. To enhance contrast and correct color cast simultaneously, Ancuti et al. [1] proposed a fusion-based method to fuse several kinds of enhancement images. From the perspective of underwater physical imaging model, underwater image enhancement is regarded as an inverse problem. As He et al. [10] proposed Dark Channel Prior (DCP) which achieved an outstanding performance in single image dehazing, several DCP variants [2, 9] were proposed by exploring different underwater prior. However, these methods might be restricted with some assumptions and simple physical imaging model. When the assumption and prior are less adaptive to unseen scene, these methods may generate severe artifacts, as shown in Fig. 1(b).

With the success of deep learning in various vision tasks [4, 18, 20, 21] some learning-based underwater image enhancement methods are proposed. Wang et al. [25] proposed UIE-Net, which is composed of two branches to estimate attenuation coefficient and transmission map respectively. Li et al. [15] directly reconstructed the clear latent natural images from inputs instead of estimating the parameters of underwater imaging model. Recently, some methods [14, 17] adopted the generative and discriminative mechanism to improve the capability of network.

Fig. 1.
figure 1

Underwater image enhancement results and the corresponding saliency maps predicted by a salient object detection method F\(^3\)Net [26] on USOD dataset [13]. Obviously, our method performs better than other state-of-the-art methods, especially without noise or artifact. The superior salient object detection result further reveals semantic-sensitive property of the proposed method in high-level vision tasks.

However, existing methods for underwater image enhancement still suffer from color distortion and unclear background details in some unknown scenarios, and these algorithms may be adverse to the high-level vision tasks. To settle these issues, we propose a semantic-driven context aggregation network that utilizes multi-scale semantic features to guide detail restoration and color correction. As shown in Fig. 1, our method realizes a stronger adaptation on the unknown scenario and performs a better semantic-sensitive property than other state-of-the-art methods. To be specific, we first build an encoder-aggregation-decoder enhancement network to establish the map between underwater low-quality observations and high-quality images. Then we introduce an encoder-type classification network that has been pre-trained on ImageNet [3], to provide semantic cues for better enhancement. We further construct a multi-scale feature transformation module to convert semantic features to the desired features of the enhanced network. Concretely, our main contributions can be concluded as the following three-folds:

  • We successfully incorporate semantic information into a context aggregation enhancement network for underwater image enhancement to achieve robustness towards unknown scenarios and be friendly to semantic-level vision tasks.

  • We construct a multi-scale feature transformation module to extract and convert effective semantic cues from the pre-trained classification network to assist in enhancing the low-quality underwater images.

  • Extensive experiments demonstrate that our method is superior to other advanced algorithms. Moreover, the application on salienct object detection further reveals our semantic-sensitive property.

Fig. 2.
figure 2

Overview of the proposed semantic-driven context aggregation network for underwater image enhancement. Our network is composed of three basic modules. a) Semantic feature extractor, which consists of a pre-trained VGG16 classification network to extract semantic features. b) Multi-scale feature transformation module, which is used to concentrate on beneficial semantic features to guide underwater image enhancement. c) Context aggregation enhancement network, which integrates semantic features and enhancement features to generate clear and natural underwater images.

2 Method

The overall architecture of our proposed method is shown in Fig. 2. In this section, we begin with describing the overall architecture in Sect. 2.1, then introduce the semantic feature extractor in Sect. 2.2, the proposed multi-scale feature transformation module in Sect. 2.3, and finally the context aggregation enhancement network and the loss function in Sect. 2.4.

2.1 The Overall Architecture

Semantic information extracted from high-level network has potential to facilitate underwater image enhancement with more accurate and robust predictions. Thus, we propose a semantic-driven context aggregation network for underwater image enhancement, as illustrated in Fig. 2. Our whole architecture includes a semantic feature extractor, a multi-scale feature transformation module, and a context aggregation enhancement network. Specifically, we adopt a general VGG16 classification network to extract the semantic features, then the extracted multi-scale semantic features with abundant information are fed into the enhancement network through the multi-scale feature transformation module. The feature transformation blocks process the affluent features in an attentive way, which benefit the enhancement network in restoring details and correcting color casts for underwater images.

2.2 Semantic Feature Extractor

A common perception is that the shallower features from backbone network of high-level task consider texture and local information, while the deeper features focus more on semantic and global information. This motivates us to explore the ability of multi-scale semantic features. Specifically, we extract features of the first four scales (denoted as \(F_n\), \(n\in [1,4]\)) from a VGG16 network pre-trained on ImageNet [3]. To avoid information loss caused by pooling operation, we select features before pooling layers of each stage. The abundant use of guidance information from multi-scale features allows us to better handle the challenges in low-quality underwater images. Besides, in order to avoid introducing the training costs, semantic feature extractor is fixed during training phase.

2.3 Multi-scale Feature Transformation Module

With the extracted multi-scale semantic features, we aim at incorporating the abundant semantic information into the enhancement network. A straightforward method is directly combining semantic features with the features in the enhancement network, e.g., addition or concatenation. However, this may ignore the distinguishability of semantic features and introduce redundant information into the enhancement network. Inspired by the attention mechanism in [11], we propose a feature transformation block (FTB) to attentively select and incorporate the key prior for the enhancement network. We first adopt a 1\(\times \)1 convolution block to match the channel dimensions of features, then exploit the inter-channel dependences and reweight the importance of each channel to highlight the vital information and suppress the unnecessary ones. The process can be formulated as the following equation:

$$\begin{aligned} F_o^n = S(MLP(Avgpool(Conv_{1 \times 1}(F^n)))) \odot F^n, n=1,2,3,4 \end{aligned}$$
(1)

where \(Conv_{1 \times 1}(\cdot )\) denotes a convolution block consisting of 1 \(\times \) 1 Conv, BN and PReLU. \(Avgpool(\cdot )\) and \(MLP(\cdot )\) denote global average pooling and multilayer perceptron respectively, and \(S(\cdot )\) denotes a sigmoid function. \(\odot \) is pixel-wise multiplication, and \(F_o^n\) is the output of feature transformation block.

Through the multi-scale feature transformation module, we can suppress and balance the comprehensive information to guide the enhancement network to achieve finer predictions. The extensive experiments in Sect. 3.3 also demonstrate the effectiveness of our proposed FTB.

2.4 Context Aggregation Enhancement Network and Loss Function

For the enhancement network, we employ an encoder-aggregation-decoder network. On the one hand, in the encoder part, the multi-scale semantic features are attentively incorporated into the corresponding level through FTB. On the other hand, with the goal of extracting global contextual information from the combined features, we adopt Context Aggregation Residual Block (CARB) to further enlarge the respective field following [4]. Finally, the comprehensive and attentive features are fed into the decoder to generate clearer and more natural predictions for underwater images.

In order to apply an appropriate loss function, there are two key factors needed to be considered. First, the widely-used \(\mathcal {L}_2\) loss usually leads to over-smoothed results. Thus, we choose to adopt \(\mathcal {L}_1\) loss function as a pixel-wise objective function. Second, considering that pixel-wise loss function is not sensitive to image structure characteristics (e.g., luminance, contrast), we simultaneously adopt MS-SSIM [27] to guide the network to focus on image structure information. As a result, the overall loss function can be formulated as

$$\begin{aligned} \mathcal {L}_{total} = \lambda \mathcal {L}_1 + (1-\lambda )\mathcal {L}_{MS-SSIM}, \end{aligned}$$
(2)

where \(\lambda \) is a balance parameter. In our experiment, \(\lambda \) was empirically set as 0.2.

3 Experiments

In this section, we first introduce the adopted datasets and implementation details. Next we comprehensively compare the performance of our approach with other state-of-the-art methods. We also perform ablation experiments to analyze the effect of main components in our proposed network. Finally, we evaluate our method in the application of underwater salient object detection.

Table 1. Quantitative evaluation on two underwater benchmark datasets. The best two results are shown in and fonts, respectively.

3.1 Experimental Setup

Datasets. To evaluate the performance and generalization ability of our model, we conduct experiments on two underwater benchmark datasets: Underwater Image Enhancement Benchmark (UIEB) [16] and Underwater Color Cast Set (UCCS) [19]. The UIEB dataset includes 890 raw underwater images with corresponding high-quality reference images. The UCCS dataset contains 300 real underwater no-reference images in blue, green and blue-green tones. In order to achieve a fair comparison, we randomly selected 712 paired images from 890 paired images on UIEB as the training set. The remaining 178 paired images on UIBE and the UCCS dataset are used for testing.

Fig. 3.
figure 3

Visual results of our method and top-ranking methods on UIBE dataset.

Implementation Details. We implemented our network using Pytorch toolbox on a PC with an NVIDIA GTX 1070 GPU. The training images were all uniformly resized to 640\(\times \)480 and then randomly cropped into patches with the size of 256\(\times \)256. During the training phase, we used the ADAM optimizer and set the parameter \(\beta _1\) and \(\beta _2\) as 0.9 and 0.999, respectively. The initial learning rate was set as 5e−4 and decreased by 20% every 10k iterations.

Evaluation Metrics. To comprehensively evaluate the performance of various underwater image enhancement methods, we adopt five evaluation metrics, including two widely-used evaluation metrics for data with reference, i.e., Peak Signal to Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM); and three reference-free metrics, i.e., Naturalness Image Quality Evaluator (NIQE) [22], Underwater Image Quality Measure (UIQM) [23], and Twice-Mixing (TM) [8].

Fig. 4.
figure 4

Visual results of our method and top-ranking methods on UCCS dataset.

3.2 Comparison with the State-of-the-Arts

To fully evaluate the performance of our method, we compare our method with other six state-of-the-art underwater image enhancement methods. There are three conventional methods, i.e., EUIVF [1], TSA [7] and UDCP [5], and three learning-based methods, i.e., UWCNN [15], Water-Net [16] and F-GAN [14].

Quantitative Evaluation. Table 1 shows the validation results of all the competing methods on the UIEB dataset and the UCCS dataset. It is noted that our method outperforms other advanced methods in terms of all the evaluation metrics across the two datasets except the Top-3 UIQM on UCCS dataset. Specially, our method respectively boosts the PSNR and SSIM by 10.87% and 5.17%, compared to the suboptimal method Water-Net on the UIEB dataset.

Qualitative Evaluation. From a more intuitive view, we visualize some representative results generated from our method and other top-ranking approaches across UIBE and UCCS datasets in Figs. 3 and 4, respectively. It can be easily seen from Fig. 3 that our method can simultaneously restore clearer details of the background and more natural color. Moreover, we can observe from the visual results in Fig. 4 that our method achieves more visually pleasing results on UCCS dataset. For instance, other methods have trouble in restoring natural images (e.g., the reddish color of row 1, column 4, and the yellowish color of row 2, column 1 in Fig. 4), while our results are closer to the real scenarios with lighter color casts.

Table 2. Ablation study of our method on UIEB dataset. “w/” means with the corresponding experimental setting. The best result is shown in bold font.
Fig. 5.
figure 5

Visual results of ablation study on UIEB dataset.

3.3 Ablation Study

In this section, we conduct ablation studies to validate the effectiveness of the key components proposed by our method.

Effectiveness of Semantic Features and FTB. First, we apply the encoder-aggregation-decoder network as our baseline (denoted as “M1” in Table 2). For the sake of investigating the effectiveness of introducing semantic features into the underwater image enhancement task, we directly concatenate the semantic features from the pre-trained VGG16 network and the enhancement features (denoted as “M2”). The comparison results of “M1” and “M2” in Table 2 demonstrate that directly introducing semantic features can bring 0.33dB performance gains towards PSNR on the UIEB dataset. Moreover, we further employ our proposed FTB to attentively extract and transmit the semantic features (denoted as “Ours”). We can see from the comparison results of “M2” and “Ours” that after applying the proposed FTB, our network obtains consistent performance gains (e.g., 0.76dB towards PSNR). In addition, the corresponding visual results shown in Fig. 5 also demonstrate that our FTB is beneficial for color correction and detail restoration.

Effectiveness of Multi-scale Features. In order to further study the effectiveness of multi-scale semantic features, we carefully divide the multi-scale features into two groups and specified a series of experimental settings, i.e., the shallower group (denoted as “M3”) with the \(1^{st}\) scale and the \(2^{nd}\) scale features, and the deeper group (denoted as “M4”) with the \(3^{th}\) scale and the \(4^{th}\) scale features. The comparison results of “M3”, “M4” and “Ours” in Table 2 indicate that the incorporation of deeper-scale features and shallower-scale features both obtain much performance gains. And when the multi-scale features are fully incorporated, we achieve the best results. Besides, the visual comparisons shown in Fig. 5 also present the consistent performance.

Table 3. Application of the top-ranking image enhancement methods and ours to the saliency detection task evaluated on USOD dataset. The best result is shown in bold font.
Fig. 6.
figure 6

Visualization of the application of the top-ranking image enhancement methods and ours to the saliency detection task evaluated on the USOD dataset. The corresponding enhanced images are shown in the upper right corner.

3.4 Application on Salient Object Detection

To further verify the effectiveness and applicability of our proposed network, we also apply our method to the underwater salient object detection task. Specifically, we first adopt the pre-trained salient object detection network F\(^3\)Net [26] and evaluate it on an underwater salient object detection dataset (USOD) [13] (denoted as “Original input” in Table 3). We employ several top-ranking image enhancement networks and our proposed network to conduct image enhancement on the inputs and made saliency predictions through the F\(^3\)Net. The quantitative results of the predicted saliency maps are tabulated in Table 3. It is obvious that our method shows performance gains against other image enhancement methods. Meanwhile, we can note that after applying our method, the images achieve finer details and more natural colors, and further facilitate F\(^3\)Net to predict saliency maps with superior consistence and improved robustness in Fig. 6.

4 Conclusion

In this paper, we presented a semantic-driven context aggregation network to cooperatively guide detail restoration and color correcting. Multi-scale semantic features extracted from a pre-trained VGG16 network are fused into the encoder-aggregation-decoder architecture to explore the ability of semantic features. We further proposed a multi-scale feature transformation module which attentively concentrates on the key priors and suppresses the unhelpful ones. Moreover, experimental results conducted on two real datasets demonstrate that our method outperforms the state-of-the-art methods. Additionally, our method can also help salient object detection to achieve better performance.