1 Introduction

As an exponential increase in the availability of underwater imagery currently, deep learning-based underwater object detection (UOD) shows potentially unprecedented research opportunities for many halobios [1, 2]. However, UOD suffers from low detection accuracy because of various environmental degradations. First are haze-like effects. The water medium scatters the light causing low-contrast and haze-like phenomena in the underwater photography [3]. Second are color distortions. Wavelength absorption usually causes a color reduction in the captured image, which leads to bluish or greenish underwater images [3, 4]. Third is imaging noise. Electronics and sediments affect high dimensional imaging, causing noises in the underwater image. These environmental degradations greatly interfere with the imaging process, which makes UOD difficult.

The main difficulty of UOD is that the structural and statistical properties of objects in the underwater image are obstructed by various environmental degradations. Therefore, it is necessary to design appropriate detection structures for better feature representation. In a typical deep learning-based object detector, a backbone network plays an important role in extracting basic features for detecting [5,6,7]. Not surprisingly, if a backbone can extract more useful features, its corresponding detector will perform better. Hence, starting from AlexNet [8], more powerful backbones have been developed, such as ResNet [9], ResNetXT [10], MobileNetV2 [6], CBNet [5], and YOLOX [11]. While promising, they consume expensive computational costs for case-by-case design. In addition, since most of these existing backbones are originally designed for classification or general detection tasks, directly using them to extract features for UOD may lead to suboptimal performance. Indeed, some researchers attempt to design specific backbones for underwater scenes [12,13,14,15]. However, these backbones heavily rely on abundant architecture engineering and subtle adjustments experiences. Besides, environmental degradation information exists in underwater images, while these heuristic manners hardly acquire these information from extensive images.

Recently, neural architectural search (NAS)-based methods [16,17,18] for computer vision tasks (e.g., classification and general object detection) have been introduced and applied well. The representative gradient-based architecture search methods [7, 19, 20] relaxed the non-differential architecture as a continuous weighted network for achieving differential search. Unfortunately, primitive search space (e.g., separable convolutions), is still a challenge to search optimal architecture for extracting deep features in underwater scenes with various degradation factors.

Fig. 1
figure 1

Accuracy-speed-size trade-off accurate models on URPC2020 dataset for our method and other state-of-the-art detectors

Fig. 2
figure 2

Workflow of our method. As for the searching stage, we define a macro-detector to contain the basic backbone, FPN, and class+box. The basic backbone contains 20 layers to be searched, and each layer chooses a block from the MAaB-based search space. After the searching stage, we can derive the final structure by selecting one of optimal blocks at each layer. In this way, we can train this searched network, aiming to extract scene-oriented features from underwater images

To alleviate the aforementioned issues, this paper focuses on a deep learning-based method that aims to search scene-oriented backbones (SSoB) and to embed a mixed anti-aliasing block (MAaB)-based search space, for solving UOD task. First, we develop NAS technology to discover the underwater scene-oriented backbone. As a result, our network can extract typical features under the interference of various environmental degradations. Then, we formulate a novel search space, which is more robust and stable to environmental degradations such as haze-like effects and imaging noises. Finally, with the MAaB-based search space, we employ the differentiable search strategy guides search processes, generating a scene-friendly result. Thus, our contributions can be distilled as threefold as follows:

  • Different from existing heuristic backbones for UOD that heavily depend on engineering experiences, we construct a novel scene-oriented backbones learning model around environmental degradations from the differential NAS perspective.

  • Toward the complex underwater scene, we propose new blocks as the candidate operations of a search space, i.e., MAaB. MAaB has multiple kernels in a single block to boost the contextual representation capacity and introduces anti-aliased convolutions to enhance the robustness of degraded factors.

  • Extensive experiments are conducted on a popular underwater dataset URPC2020Footnote 1. As shown in Fig. 1, our searched scene-oriented architecture significantly outperforms other state-of-the-art methods (including CNN-/transformer-based detectors) by a large margin.

2 Related works

2.1 Underwater object detection

UOD aims at determining what and where an object is in an underwater image. Generally, deep learning-based detectors generally consist of four parts: a backbone that extracts feature from an image, a neck followed backbone that fuses multi-level features, a region proposal network (not necessarily part) followed the extracted features that generates prediction candidates, and a head for classification and localization prediction. In recent years, various methods in literatures have been proposed to tackle with UOD tasks. The common solution for UOD is to re-train existing detectors including CNN- and transformer-based detectors. Among them, some also attempt to redesign structures based on these existing detectors for UOD. Here we briefly review some of the recent detectors:

The state-of-the-art detectors can be briefly categorized into two major branches. The first branch contains CNN-based methods such as YOLO [21], SSD [22], RetinaNet [23], FSAF [24], YOLOX [11], Free-Anchor [25], FoveBox [26], Faster RCNN [27], FPN [28], Mask RCNN [29], Grid RCNN [30], Cascade RCNN [31], and Guided Anchoring [32]. The other branch contains transformer-based methods such as DETR [33], Swin Transformer [34], and PVTv1 [35]. Besides, some researchers also attempt to improve the feature extraction and representation capacity of structure based on these popular detectors for UOD [12, 14, 15].

2.2 Backbone for underwater object detection

Backbone play a vital role in detectors to extract basic object features for detection. UOD detectors generally adopt existing backbones, and most of these backbone are designed for classification or general detection. Meanwhile, some researchers also attempt to design specific backbones for UOD-based existing backbones. Here we briefly introduce these backbones:

The original works RCNN [36] and OverFeat [37] are pioneers for deep learning-based detectors. After them, almost all of current detectors use the pretraining and fine-tuning paradigm, that is, directly adopt the networks that are pretrained for ImageNet [38] classification task as their detection backbone. For instance, VGG [39], ResNet [9], and ResNetXT [10] are classification backbones, but they are widely used by the state-of-the-art detectors. Recently, CBNet [5], Darknet53 [40], and MoblieNetv2 [6] are designed for general detection. Obviously, UOD directly adopts these backbones may lead to suboptimal performance. In addition, there are some works design specific backbones for UOD [12, 15]. However, these handcrafted methods requires much manpower and computation cost. More importantly, underwater datasets contains more environmental degradation information, but handcrafted methods hardly acquire these information from extensive images.

2.3 Neural architecture search

Neural architecture search is a automatic manner of learning architectures from data distribution that outperforms human expertise [41,42,43]. NAS for classification has attracted great attention recently. Some works [44,45,46] adopt reinforcement learning-based methods to use a RNN controller to generate a cell-based structure. Some works [47,48,49] use evolutionary algorithm-based manners to form architectures by mutating current ones. To speed up searching process, some works [7, 50] adopt gradient-based paradigm to form a continuous relaxation search space, which allow the the differentiable optimization during the whole search phase.

Some recent works attempt to develop NAS for object detection. Early works attempt to [16] adopt evolutionary strategy to search a better backbone for detection tasks. Another group of approaches [17, 18, 51] use reinforcement learning algorithms to train a controller to generate potential components of detectors. Unfortunately, these two kinds of methods are too resources demanding, causing inefficient search. Recently, [20, 52] formulate a detection supernet into differential form with a set of architecture and weight parameters, so that they can perform search in a gradient descent manner and reduce search cost of several hours. However, existing NAS methods have not yet been explored in the detection of underwater environments. Besides, the search space is designed by previously built blocks and might be naive for complex underwater scenarios.

3 The proposed approach

3.1 The scene-oriented architecture learning model

Existing manually designed detection backbones mainly depend on engineering skills. They are too resource demanding for case-by-case redesigns. More importantly, we know that underwater images contain rich information of environmental degradations. Obviously, heuristic manners hardly acquire these information from extensive images. To overcome these problems, from the NAS perspective, we raise a differentiable search optimization strategy to design our UOD backbone SSoB, which can be formulated as:

$$\begin{aligned} \begin{aligned}&\min _{\alpha }\;{\mathcal {L}}(\varvec{\omega }_{\alpha }^*, \alpha ; {\mathcal {D}}_\texttt {val})\\&s.t.,\;\;\varvec{\omega }_{\alpha _\texttt {u}}^*=\arg \min _{\varvec{\omega }_{\alpha }}{\mathcal {L}}(\varvec{\omega }_{\alpha },\alpha ;{\mathcal {D}}_{\texttt {tr}}), \end{aligned} \end{aligned}$$
(1)

where \({\mathcal {L}}(\cdot )\) is the loss function of detectors. \({\mathcal {D}}_{\texttt {val}}\) and \({\mathcal {D}}_{\texttt {tr}}\) are training and validation datasets, respectively. As shown in Fig. 2a, both of \({\mathcal {D}}_{\texttt {val}}\) and \({\mathcal {D}}_{\texttt {tr}}\) contain various underwater degradations. The search approach seeks to find a backbone \(\alpha \) that minimizes the validation loss \({\mathcal {L}}(\varvec{\omega }_{\alpha }^*, \alpha ; {\mathcal {D}}_\texttt {val})\) with the trained weights \(\varvec{\omega }_{\alpha }^*\). The weights \(\varvec{\omega }_{\alpha }^*\) associated with the backbone are obtained by minimizing the training loss \({\mathcal {L}}(\varvec{\omega }_{\alpha },\alpha ;{\mathcal {D}}_{\texttt {tr}})\).

As shown in Fig. 2b, we propose a macro-detector framework to solve problem in Eq. 1. The macro-detector is decoupled into three main principled parts, i.e., the basic backbone, FPN, and class+box. The basic backbone, extracting features of images, contains a \(3\times 3\) convolution with stride of 2, four stages that contain 20 blocks to be searched, and another \(1\times 1\) convolution with stride of 1. According to practical experience, the channel of each stage is set to \(\{48,192,384,768\}\), respectively. Then we send the features into FPN to fuse these features from different stages. After FPN, class+box is used to predict object classification and bounding box.

3.2 MAaB-based search space

To begin with, according to the latest NAS method [7, 20], we configure to define a block as the smallest module. To this end, the macro-detector searching space comprises a layer-level search, which allows us to explore the whole network from a block perspective. We adopt the fundamental routine to design the layer-wise search space: a search space includes a number of candidate blocks (operations). Each layer to be searched can choose a different block from candidate blocks.

How to construct a layer-level search space plays a vital role in NAS technique, existing NAS-based approaches for classification or general detection [7, 20] are mainly designed primitive operators (e.g., separable convolutions), and these unsophisticated operators may pose a touch issue for optimizing the backbone architectures. For this purpose, we consider requirements of contextual representation capacity and degradation robustness in underwater scenes for constructing our search space. The search space is consisted of novel blocks MAaB which is specifically designed for underwater scenes.

There are two main aspects to consider for extracting more typical features from complex underwater scenes. For one thing, a backbone needs to extract multi-scale features as much as possible. Many approaches [53, 54] choose to fuse features after backbones, while this means more layers are needed. For another thing, detecting objects from cloudy images requires high robustness of a detector. However, common downsampling operations (such as the convolution with stride 2, MaxPooling) do not have the capacity to anti-alias, which may cause damage to robustness [55]. To overcome these issues, we design the MAaB blocks inspired by [55, 56]. MAaB has multiple different sizes of kernels in one block, which can easily fuse multi-scale features without extra layers. Besides, it introduces anti-aliased operations (i.e., Convblupool) in a block, which can enhance the robustness of degraded factors. Figure 2c show the structure of MAaB block, which is composed of several convblurpool with stride 2 and one \(1\times 1\) convolution with stride 1. For an input, it is split into N groups along channel axis. Then each group is processed by an independent convblurpool. The outputs of these parallel branches are concatenated and then fused by the final \(1\times 1\) convolution to reduce output channels. If the input and output have the same dimension, we use a skip operation to add them.

Convblupool [55] is an convolution operation with a normalized Guassian filter with stride 2 to downsample an input. The Convblupool use blur kernels. In the paper, we set N to [1,2,3] to construct our search space. \(N=1\), there is one convblurpool with blur kernel Triangle-3, Binomial-5, or Binomial-7. \(N=2\), there are two convblurpool operation, and their kernel sizes are [Triangle-3,Binomial-5], [Triangle-3,Binomial-7], or [Binomial-5,Binomial-7]. \(N=3\), there are three convblurpool operations, and their kernel sizes are [Triangle-3,Binomial-5,Binomial-7]. In detail, the value of Triangle-3, Binomial-5, and Binomial-7 are [1, 2, 1], [1, 4, 6, 4, 1], and [1, 6, 15, 20, 15, 6, 1]. The weights are normalized, and the filters are the outer product of the following vectors with themselves. Specifically, the eight candidate blocks are given in the following. Note that we have a popular block called ”skip,” which allows us to reduce the depth of the backbone network.

  • Triangle-3, group=1, MAaB (T3)

  • Triangle-5, group=1, MAaB (B5)

  • Triangle-7, group=1, MAaB (B7)

  • Triangle-3, Binomial-5, group=2, MAaB (T3-B5)

  • Triangle-3, Binomial-7, group=2, MAaB (T3-B7)

  • Triangle-3, Binomial-7, group=2, MAaB (B5-B7)

  • Triangle-3, Binomial-5, Binomial-7, group=3, MAaB (T3-B5-B7)

  • Skip

3.3 The differentiable search algorithm

We adopt the differentiable manner proposed in [19] to solve Eq.(1). In searching phase, the output of each intermediate layer is computed with a weighted sum based on all candidate blocks. For backbones, the output of i-th layer is formulated as

$$\begin{aligned} x_i = \sum \limits _{b \in {\mathcal {B}}}\frac{\exp (\alpha _i^b)}{\sum \nolimits _{b^{'} \in {\mathcal {B}}} \exp (\alpha _i^{b^{'}})}b (x_{i-1}), \end{aligned}$$
(2)

where \(x_i\) is the output of the ith layer, \(\alpha _i^b\) is the parameter for block \(b(\cdot )\), and it can be simply perceived as the scores of b-th block in ith layer. And \({\mathcal {B}}\) denotes the search space as described in the above subsection. The continuous relaxation of Eq.(2) makes the entire problem Eq.(1) differentiable to both weights and architecture parameters, so we can search the backbone in an end-to-end manner. In the training phase, we choose a block with highest scores for each layer to build our backbone.

At last, the loss function used in Eq. (1) is defined as follows:

$$\begin{aligned} {\mathcal {L}}(\alpha _\texttt {u}, \varvec{\omega }_{\alpha _\texttt {u}})= {\mathcal {L}}_{det}(\alpha _\texttt {u}, \varvec{\omega }_{\alpha _\texttt {u}}) + \gamma {\mathcal {L}}_{flo}(\alpha ) \end{aligned}$$
(3)

The first term \({\mathcal {L}}_{\text{ det }}(\cdot )\) denotes the loss of detectors, which is the classification and localization loss. As underwater detectors are often deployed to mobile CPUs, we introduce the second term to guarantee detection efficiency. \({\mathcal {L}}_{\text{ flo }}(\cdot )\) indicates FLOPs of the backbone part and can be decomposed as linear sum of each operations. The two terms are weighted by a balancing parameter \(\gamma \). It is clear that the loss function (3) is differentiable due to the continuous relaxation of Eq.(2). Thus \(\{\alpha _\texttt {u}, \varvec{\omega }_{\alpha _\texttt {u}}\}\) can be optimized jointly using SGD.

4 Experiments

4.1 Experimental configurations

We conduct experiments on URPC2020 dataset which consists of 6575 underwater images. The dataset is split into trainval set with 5260 images and test set with 1315 images. The dataset has 4 object categories including echinus, holothurian, scallop, and starfish. We analyze our method by numerous comparison experiments. For all experiments, the input image is resized to the default size of the respective methods, and implementation is based on mmdetectionFootnote 2 and Pytorch framework.

SSoB searching. We first initialize the basic backbone with kaiming_init. Then we search the backbone on URPC2020 trainval set. We use SGD optimizer with a batch size of 2 images, and search for 12 epochs. In each iteration, we update \(\varvec{\omega }_{\alpha }\) and \(\alpha \) alternately. We set learning rate, momentum and balancing parameter \(\gamma \) being 0.04, 0.9 and 0.01, respectively.

Detection training. We choose blocks with highest scores for each layer to build SSoB. We first pretrain SSoB on Imagenet for 150 epochs. Then we fine-tune the whole detector on URPC2020 trainval dataset for 24 epochs with SGD optimizer, and \(1\times \) schedule. We set the initial learning rate being 0.04 which is divided by 10 at the 7th and 10th epoch. The weight decay , momentum and batch size are 0.0001, 0.9 and 2, respectively.

Table 1 Comparisons with handcrafted models on dataset URPC2020
Table 2 Comparisons with NAS-based methods on dataset URPC2020

4.2 Main results

Comparisons with handcrafted methods We replace the backbone in FPN [28] with other excellent backbone, i.e., Darknet53, ResNetXT101, and MobileNetv2, and form three competitors accordingly. As shown in Table 1. SSoB surpasses these competitors by a large margin with less parameters. Specifically, SSoB is 6.8% higher on AP compared to the Darknet53 based detector with less than one half of the parameters. Compared with ResNetXT101, the similar excellent phenomenon is also existed. In addition, we outperform MobileNetv2 by 9.1% on AP with less parameters. These experimental results demonstrate that our methods can design a better backbone than handcrafted methods.

Table 3 Comparison with state-of-the-art methods on URPC2020

Comparisons with NAS-based methods. As shown in Table 2, we compare SSoB with detectors that adopt NAS based model. FBNet [7] is searched on ImageNet dataset for classification tasks, and we directly apply it as the backbone of a detector. Unfortunately, its performance on detection is disappointing. NS-FPN [17]and hit detector [20] are designed for detection tasks, and we thus re-search the architecture on URPC2020. NAS-FPN aims to discover a new feature pyramid architecture for detectors while leaving the backbone unchanged. Our method outperforms NAS-FPN by 4.3% with less parameters and FLOPs. Hit detector discovers architectures for all components (i.e. backbone, neck, and head) of detectors while its search space is consisted of common blocks (such as separable block). Our method also surpasses hit detector. These experiment results indicate that it is important to design specific backbones with efficient search space toward underwater scenes in an detector.

Comparisons with state-of-the-art methods. We compare our methods with other state-of-the-art methods on URPC2020, the results are summarized in Table 3. SSoB only applies simple data augmentation and 1X training scheme, which achieves 47.5% AP without bells and whistles. Our method has fewer parameters and performs better than CNN-based detectors. Specifically, CSAM and FERNet are designed for UOD tasks. Both develop sophisticated deep architectures to improve the feature representation capacity. SSoB outperforms them by a large margin. For AP, SSoB is 1.1% higher than CSAM and 3.3% than FERNet. In addition, SSoB also outperforms transformer-based methods. Although DETR has fewer FLOPs than SSoB, SSoB outperforms DETR by 24.7% in AP, 27.3% in AP\(_{50}\), and 37.0% in AP\(_{75}\). DETR fails to detect small objects very well, so its performance is poor on underwater datasets with numerous small object. These experiment results further demonstrate that our methods can design a better architecture than existing popular detectors.

we also compare the proposed methods in terms of speed, as shown in Table 3. Our method can achieve 6.7 FPS, which is similar to YOLOX. Compared with other methods, our method can be well qualified for real-time detection tasks.

Fig. 3
figure 3

The heatmaps of final candidate operators (i.e., \(\alpha \)), where 20 layers to be searched of the backbone are plotted orderly. yellow boxes indicate the final choice

Table 4 Comparison of performance of SSoB on different detectors. (URPC2020)
Table 5 The comparison results on UODD dataset
Fig. 4
figure 4

Some qualitative examples on URPC2020. The example from top to bottom is haze-like effects, color distortions, and imaging noise, respectively. Both error and missed detection are marked with a red dotted box

4.3 Performance analysis

Searching space analysis. Figure 3 plots the heatmaps toward the final searched backbone. Visual inspection shows that multi-kernel (such as T3-B5-B7, B5-B7) operations occupy the main position in the first 12 layers, which indicates that the first half of the backbone paying more attention to extract more image information. For the rear 8 layers, single-kernel (such as T3, B5, and B7) operations are got relatively high scores, and it demonstrates that the rear half of the backbone relaxes the requirements for feature intensity. In addition, almost all blocks have been selected, and it demonstrate that single- and multi- kernel are necessary for construction of underwater backbones.

Various detectors. To validate the generalization ability of SSoB, we combine SSoB with different detectors. We select popular detectors like Cascade RCNN, YOLOX, and FoveaBox in this analysis. As demonstrated in Table 4, performances of these detectors are improved prominently (for AP, 1.9% in Cascade RCNN, 0.5% in YOLOX, and 2.3% in FoveaBox). SSoB shows the strong generalization capacity on different detectors. However, the best performance is achieved by our original methods. The search process is based on a macro-detector. Therefore, the searched network combined with other detectors may result in suboptimal performance.

The robustness of SSoB on other datasets. In order to verify that SSoB also has an effective performance improvement on other datasets, we carry out comparative experiments with currently proposed methods. The comparative experiments are carried out on the experimental dataset UODD [14] proposed by CSAM. UODD contains 3 types of underwater objects, i.e., holothurian, echinus,and scallop. We take 2560 images for training and 502 images for testing. As shown in Table 5, the detection accuracy of SSoB comprehensively outperforms these detectors. For instance, SSoB surpasses FoveaBox by 5.1%, YOLOX by 1.9%, Grid RCNN by 2.8%, Cascade RCNN by 1.6%, and CSAM by 1.6% in AP.

Finally, we also compare the proposed methods in terms of speed, as shown in Table 5. Our method can achieve 12.0 FPS, which is similar to YOLOX. Compared to other methods, our method can be well qualified for real-time detection tasks.

Fig. 5
figure 5

Examples of visualization of the feature maps on URPC2020. The example from top to bottom is haze-like effects, color distortions, and imaging noise, respectively

Study on various environmental degradations Figure 4 exhibits some qualitative examples of various environmental degradations. For color distortions, most popular backbones fail to complete detection, there are error and missed detection phenomena in these backbones. However, our SSoB completes the detection task very well. For haze-like effects and imaging noise, some and our methods can complete the detection task very well. Some methods still have error and missed detection phenomena, for example, MobileNetV2 and Darknet53 have missed detection. The qualitative results demonstrate that SSoB can overcome the obstacles that degradation poses to feature extractions. Figure 5 shows the feature visualization results of various environmental degradations. For color distortions, the feature response of MobileNetV2 and Darknet53 is relatively weak. For haze-like effects and imaging noise, the amplitude of feature response of Darknet53, ResNetXT101, and MobileNetV2 is attenuated inconsistently. But on various environmental degradations, our SSoB significant boots the feature response on discriminative region while suppressing the interference. These feature response results further demonstrate that SSoB does perform well on various environmental degradations.

5 Conclusion

In this paper, we propose an automatically scene-oriented feature extraction module for solving the UOD task. Based on NAS technology, we fully discover the potential and inherent information of different underwater images; thus, our backbone can comprehensively extract deep features. Meanwhile, we also formulate a MAaB-based search space that can further improve the performance of our methods. Both qualitative and quantitative experimental results demonstrate that our SSoB has great superiority over the state-of-the-art methods.