1 Introduction

Underwater object detection is an important and difficult subject in computer vision. The underwater object detection task has attracted people’s attention gradually. In recent years, many popular networks based on deep learning have achieved good results on common datasets. However, the image quality captured by the camera is poor [1, 2] because of the underwater lighting conditions and environment. These methods are not ideal when applied to the underwater object detection task directly. Underwater images have problems such as low contrast, color bias and uneven illumination because of the scattering and attenuation of light transport in the water [3]. As a result, underwater image features are difficult to extract. Objects in underwater images are usually small in size because of the long distance from the camera and the small actual size of the objects [4]. It is necessary to design an accurate object detection network for the above problems.

At present, convolutional neural networks (CNNs) are the backbone of most models based on deep learning. Different convolution layers allow one to extract the characteristics of different scales for CNNs [5, 6]. In general, a high-level feature map provides rich semantic information and it is advantageous to the detection of large objects. The low-level characteristics have rich texture information, more conducive to small object detection [7]. Detail information is crucial for small object identification. It is very important to construct multi-scale features for more complex underwater object detection tasks, which include not only abundant texture features but also strong semantic features [8].

CNN can learn advanced semantic features and use single-scale input features for recognition. SSD [9] uses the hierarchical feature of CNN pyramid, the multi-scale feature map from multiple layers calculated by the forward process. Feature pyramid network (FPN) [10] aims to use the pyramid form of CNN hierarchical features naturally to generate feature pyramids with strong semantic information at all scales.

A multi-scale feature pyramid architecture based on FPN is proposed to detect underwater objects. Firstly, improved VoVNet [11] is taken as the backbone network. The one-shot aggregation (OSA) module in VoVNet only aggregates all the layers before the last one-time aggregation, which is highly efficient on the GPU and has fewer layers than the residual network layer at the same level. It can keep more details that are crucial to feature maps extraction. Secondly, a multi-scale aggregation feature pyramid is built. The basic features extracted from the backbone network are downsampled to enhanced features scale. Only top-down and horizontal connections result in the main information still come from the top. A new bottom-up network structure added to get a new feature pyramid. Then, the distance between corner point and bounding box is introduced to improve the recall of small objects. At the same time, the scale of the feature map is divided by the corresponding step size to better adapt to the size of the FPN. Batch normalization (BN) is replaced by group normalization (GN) after the convolutional layer of the head. GN divides the characteristic channels into groups, calculates the mean and variance to normalize within each group so that its calculation normalization will not depend on the batch size. Finally, GIoU [12] is added to measure the distance between the real box and the prediction box without overlap. GIoU pays attention to overlapping areas as well as other non-overlapping areas, which can better reflect the degree of overlap. The GIoU loss is added to the regression loss to ensure the accuracy of the prediction box. To summarize, the major contributions of this work are three-fold:

  1. 1.

    A convolution block of backbone is designed to enhance detailed information. The stage 1 of the original VoVNet-39 is replaced with two-channel convolution block. The added convolution block is used to extract the rich detail features of the image which are more conducive to the detection of small scale objects.

  2. 2.

    The feature scale is extended by downsampling based on FPN. An aggregation path from low-level features to high-level features is added to extract details. Therefore, a multi-scale aggregation feature pyramid is constructed. This structure can use context information to strengthen features and enhance the resolution of feature map.

  3. 3.

    The cornerness strategy is designed to add recall points. The distance between corner points and bounding boxes is introduced to add regression point and divided by the corresponding stride to improve the recall rate of small objects. The cornerness loss is designed based on the above method. Besides, IoU is replaced by GIoU as a measure of the actual box and the predicted box without overlap.

The rest of the paper is organized as follows. Section 2 presents related work about the development of technologies involved in our method. Section 3 describes the proposed methods specifically. Section 4 gives the experiments and analysis with proposed methods. Moreover, the last section presents conclusions on this work.

2 Related works

2.1 Object detection

Object detection is a heavily researched topic in computer vision. There has been a large body of researches on object detection with deep learning. According to whether region proposal is needed, popular object detection methods based on CNN mainly include two-stage object detection network [9, 30] and one-stage object detection network [15, 16].

The two-stage object detection network first extracts the candidate box from the image, then makes two corrections based on the candidate area to get the detection result. One-stage object detection networks remove region proposals unlike two-stage methods and directly regresses the location and category of the object. The latter can bring to faster detection. However, these methods all require presetting dense anchor which will introduces a lot of hyperparameters and is time consuming. The detection result is greatly affected by hyperparameters. Therefore, an anchor-free object detection model is constructed based on FCOS [22].

The backbone is responsible for extracting basic features from images in the object detection model, which is very important for object location and classification. ResNet [28] is the most commonly used backbone of object detection model. In fact, DenseNet [29] has stronger feature extraction ability than ResNet. Although it has a good effect for object detection with slow speed. The high memory access costs and power consumption are caused by dense connections in DenseNet. VoVNet [11] is designed to solve this problem. The object detection model based on VoVNet outperforms the model based on DenseNet with faster speed and better performance. VoVnet is selected as the backbone and extracts basic features as the input of FPN.

Fig. 1
figure 1

The whole architecture of proposed network

2.2 Multi-scale features

Recently, extracting features from different layers is popular in image recognition and these features are used together to detect objects. SPP-Net [13], R-CNN [14], Fast R-CNN [15] and Faster R-CNN [16] just take the final feature maps to detect object. Shrivastava et al [17] and SNIP [18] adopted the feature image pyramid, input images of different scales as image pyramids to generate features of different scales for prediction. The high accuracy is achieved at a high cost in terms of time and memory. SSD, DSSD [19], YOLOv3 [20] are detecting objects in the feature pyramid extracted from inherent layers within the network while merely taking a single-scale image. This strategy ignores the context information of features. FPN utilized lateral connections and a top-down pathway to produce a feature pyramid and achieve more powerful representations. FCOS [22] and Xu et al [23] adjust the feature maps on FPN and take higher level feature maps to predict object. The top-down path results in information only coming from the top and information from the bottom is not well utilized. M2Det [21] uses U-shaped pyramid to extract feature depth, concat features of different levels according to the same scale. Nas-FPN [24] uses neural network search based on RetinaNet [25] to design FPN. However, these improved FPN models are complicated and the search cost is high. Therefore, a multi-scale aggregation feature pyramid is constructed based on FPN.

Table 1 Two constructs of VoVNets Stage1

2.3 FCOS detection head

The detection head detects the location and category of objects based on the features obtained from above. Detection heads can be divided into anchor-based and anchor-free models. Anchor-based model need predefined anchor point generation candidate boxes such as YOLO and SSD. The anchor involves many hyperparameters which have a great influence on the final result. The anchor-free model is based on corner or center such as CenterNet [31] and FCOS [22]. The FCOS network adopts the regression strategy of anchor-free. Although the recall rate is improved, it will produce many prediction bounding boxes with low quality center point offset. Based on this, a simple and effective strategy centerness is proposed to suppress these low quality bounding boxes. This strategy does not introduce any hyperparameters. The FCOS detection head is selected as the basic detection head. Because it is a general object detection network, this paper will improve this detection head for underwater environment.

The loss function is used to estimate the difference between the predicted value and the real value of the model. Intersection-over-union (IoU) loss is used as regression loss in FCOS. IoU calculates the ratio of the intersection and union between the predicted box and the actual box. The IoU loss does not provide any motion gradients and cannot be trained when the predicted box and the actual box do not overlap. Chen [27] proposes Pixel-IoU loss to improve the accuracy of both rotation angle and IoU. Zheng [26] takes into account the distance, overlap rate and scale to design distance-IoU (DIoU), making the object box regression more stable. CIoU is further proposed on the basis of DIoU, the length-width ratio of the three elements of box regression is considered in the calculation. However, the centers of dense objects in underwater images are close. They will be removed after non-maximum suppression (NMS) processing. Hamid put forward the idea of generalized intersection over union, introducing the smallest bounding rectangle of the predicted box and the actual box on the basis of the IoU. The predicted boxes will move towards the object box given the introduction of penalty terms. It overcomes the above shortcomings of the IoU.

3 Proposed method

This paper proposes path aggregation feature pyramid network to settle the issue on underwater object detection. The model architecture is represented in Fig. 1. Our network will be introduced from three aspects: feature extraction, multi-scale feature fusion and object detection.

3.1 Feature extraction

This paper proposes feature extraction architecture based on VoVNet to obtain abundant and robust feature maps. A refined convolutional block is proposed to replace the backbone of the first convolutional layer. For VoVNet-39, the first convolutional layer uses multi-layer convolution to extract the feature, which loses more detail information of small objects than single-layer convolution. Dual convolution block is used to extract image details. The two channels can extract more abundant basic feature, so that they are more conducive to small object detection. Then, the improved VoVNet-39 as the backbone is called VoVNet-39-A. Three OSA modules are used to aggregate all the previous layers once after a refined convolutional layer. The output of each OSA module is used as the basic feature to generate the basic feature layer at different scales. A branch is added to VoVNet stage1 to extract richer details. The added branches are used alone to compare and verify the effect of feature extraction. The structure of stage 1 is shown in Table 1.

Finally, a multi-scale feature pyramid is constructed. The basic features are obtained from the OSA module output of the backbone network, while the high level features are sampled from the basic feature map. A new feature pyramid is obtained by adding a new bottom-up network structure. This pyramid feature map integrates feature maps of different sizes from low level to high level, contains rich texture information and semantic information which are beneficial to underwater object detection.

3.2 Multi-scale feature pyramid

This paper builds a multi-scale feature pyramid to acquire robust feature maps. Inspired by FPN, the third to fifth convolutional blocks are taken to extract feature maps and build our deeper feature pyramid. We conduct upsampling from higher level feature maps to enhance lower level feature map with context information. Our feature pyramids are combined with five feature maps, where each feature map has a different scale. This paper builds a multi-scale feature pyramid based on feature extraction network. Our multi-scale feature maps are defined as \(P_{\textrm{3}}\), \(P_{\textrm{4}}\), \(P_{\textrm{5}}\), \(P_{\textrm{6}}\), \(P_{\textrm{7}}\) and \(N_{\textrm{3}}\), \(N_{\textrm{4}}\), \(N_\textrm{5}\), \(N_{\textrm{6}}\), \(N_{\textrm{7}}\), where the strides of them are 8, 16, 32, 64, 128, respectively. \(C_{\textrm{3}}\), \(C_{\textrm{4}}\), and \(C_{\textrm{5}}\) are the initial feature layers and the scaling process can be described as:

$$\begin{aligned} \begin{array}{c} P_{\textrm{i}}={f_{\textrm{1}}*C_{\textrm{i}}}+{\mu _{}*P_{\mathrm{i+1}}}\quad i=3,4\\ P_{\textrm{5}}={f_{\textrm{1}}*C_{\textrm{5}}}\quad \\ P_{\textrm{6}}={f_{\textrm{2}}*C_{\textrm{5}}}\quad \\ P_{\textrm{7}}={f_{\textrm{2}}*P_{\textrm{6}}}\quad \\ \end{array} \end{aligned}$$
(1)

\(P_{\textrm{i}}\) represents the i-th layer feature of the P level feature pyramid, \(f_{\textrm{1}}\) is the variable channel number filter with the convolution kernel of 3\(\times \)3 and the stride of 1, \(f_{\textrm{2}}\) is the downsampling filter with the convolution kernel of 3\(\times \)3 and the stride of 2, \( \mu \) is upsamping,* is the convolution operation.

Each building block takes a higher resolution feature map \(N_\textrm{i}\) and a coarser map \(P_{\mathrm{i+1}}\) through lateral connection and generates the new feature map \(N_{\mathrm{i+1}}\). Each feature map \(N_{\textrm{i}}\) first goes through a 3\(\times \)3 convolutional layer with stride 2 to reduce the spatial size. Then each element of feature map \(P_{\mathrm{i+1}}\) and the down-sampled map are added through lateral connection. The fused feature map is then processed by another 3\(\times \)3 convolutional layer to generate \(N_{\mathrm{i+1}}\) for following sub-networks. This is an iterative process and terminates after approaching \(P_{\textrm{7}}\). The 256-channel feature map is used in these features to be detected.

The feature fusion process can be formulated as follows,

$$\begin{aligned} N_{\mathrm{i+1}}= {f_{\textrm{2}}*N_{\textrm{i}}}+{P_{\mathrm{i+1}}} \quad i=3...6 \end{aligned}$$
(2)

\(N_{\textrm{i}}\) is the ith layer feature of the feature pyramid.

3.3 Detection of head

The FCOS head is selected as the base head and GN is added after the convolutional layer of the head so that its calculation during Normalization will not depend on the batch size value. The error of training and verification is higher when the batch size value is small. The calculation accuracy of the BN layer depends on the value of the current batch. GN divides the channels into groups and calculates the mean value and variance within each group for normalization. So it is not constrained by batchsize naturally.

Fig. 2
figure 2

The cornerness strategy

Centerness as a unique branch of FCOS, the image is divided into grids according to scale. The training target is the distance between the center point of the grid and the truth value box. (\(x_{0}\), \(y_{0}\)) and (\(x_{1}\), \(y_{1}\)) are the corner points of the truth value box and (x, y) is the center points of the grid. The distances from the center point to the truth value box are, respectively, \(l^{*},t^{*},r^{*},b^{*}:\)

$$\begin{aligned} \begin{array}{c} l^{*}=x_{}-{x_{\textrm{0}}}\quad t^{*}=y_{}-{y_{\textrm{0}}}\\ r^{*}=x_{\textrm{1}}-{x_{}}\quad b^{*}=y_{\textrm{1}}-{y_{}} \end{array} \end{aligned}$$
(3)

Centerness can be expressed as:

$$\begin{aligned} \textrm{Centerness}^{*}=\root \of {\frac{\textrm{min}(l^{*},r^{*})}{\textrm{max}(l^{*},r^{*})}\times {\frac{\textrm{min}(t^{*},b^{*})}{\textrm{max}(t^{*},b^{*})}}} \end{aligned}$$
(4)

However, there may be no center point or only one center point falling into the truth box when the object size is small. The scale of underwater image object is small, so the method needs to be improved in underwater object detection. We add corner points into the regression strategy to solve this problem. Corner points regression strategy is shown in Fig. 2.

In practical applications, the distance between the corner point and the actual box is divided by the corresponding stride to match the actual size of the underwater object. Then the distance between the corner point and the truth value box is, respectively, \(l_{c}^{*},t_{c}^{*},r_{c}^{*},b_{c}^{*}:\)

$$\begin{aligned} \begin{array}{c} l_{\textrm{c}}^{*}=(x_{}-{x_{\textrm{0}}})/s\quad t_{\textrm{c}}^{*}=(y_{}-{y_{\textrm{0}}})/s\\ r_{\textrm{c}}^{*}=(x_{\textrm{1}}-{x_{}})/s\quad b_{\textrm{c}}^{*}=(y_{\textrm{1}}-{y_{}})/s \end{array} \end{aligned}$$
(5)

The Cornerness is:

$$\begin{aligned} \begin{array}{c} l_{\textrm{c}}^{*}=(x_{}-{x_{\textrm{0}}})/s\quad t_{\textrm{c}}^{*}=(y_{}-{y_{\textrm{0}}})/s\\ r_{\textrm{c}}^{*}=(x_{\textrm{1}}-{x_{}})/s\quad b_{\textrm{c}}^{*}=(y_{\textrm{1}}-{y_{}})/s \end{array} \end{aligned}$$
(6)

The loss function of FCOS network is

$$\begin{aligned} L_{\textrm{1}}{} & {} =\frac{1}{N_{\textrm{pos}}}\sum \nolimits _{x,y}{L_{\textrm{cls}}}(c_{\textrm{x,y}},c_{\textrm{x,y}}^{*})+\frac{\lambda }{N_{\textrm{pos}}}\sum \nolimits _{x,y}{1}_{\mathrm{c_\textrm{x,y}}^{*}>0}\nonumber \\{} & {} \quad \times {L_{\textrm{reg}}}(t_{\textrm{x,y}},t_{\textrm{x,y}}^{*})+\frac{1}{N_{\textrm{pos}}}\sum \nolimits _{x,y}{L_{\textrm{cen}}}(e_{\textrm{x,y}},e_{\textrm{x,y}}^{*}) \end{aligned}$$
(7)

where \(L_{cls}\) is focal loss, \(L_{reg}\) is the IoU loss and \(L_{cen}\) is Centerness loss. \(N_{pos}\) denotes the number of positive samples and \( \lambda \) is 1 in this paper is the balance weight for \(L_{reg}\). \(c_{\textrm{x,y}}^{*}\) is the true values of the target category, \(t_{\textrm{x,y}}^{*}\) is the true values of the target position, \(e_{\textrm{x,y}}^{*}\) is the true values of Centerness, the predicted target category is \(c_{\textrm{x,y}}\), the predicted target position is \(t_{\textrm{x,y}}\) and the predicted Centerness is \(e_\textrm{x,y}\). \({1}_{\mathrm{c_\textrm{x,y}}^{*}>0}\) is the activation function, being 1 if \(c_{\textrm{x,y}}^{*}>0\) and 0 otherwise.

The regression process is improved as well as the corresponding loss function. The new loss function is added with corner regression to form a new loss function:

$$\begin{aligned} \begin{aligned} L_{\textrm{2}}&=\frac{1}{N_{\textrm{pos}}}\sum \nolimits _{x,y}{L_{\textrm{cls}}}(c_{\textrm{x,y}},c_{\textrm{x,y}}^{*})+\frac{\lambda }{N_{\textrm{pos}}}\sum \nolimits _{x,y}{1}_{\mathrm{c_\textrm{x,y}}^{*}>0}\\&\quad \times {L_{\textrm{reg}}}(t_{\textrm{x,y}},t_{\textrm{x,y}}^{*})+\frac{1}{N_{\textrm{pos}}}\sum \nolimits _{x,y}{L_{\textrm{cor}}}(e_{\textrm{x,y}},e_{\textrm{x,y}}^{*}) \end{aligned}\nonumber \\ \end{aligned}$$
(8)

where \(L_\textrm{cor}\) is Cornerness loss. IoU calculates the ratio of the intersection and union of the predicted box and the actual box. However, The IoU has the disadvantage of not measuring the distance between two boxes and the way of intersection. GIoU aims to overcome the shortcomings of IoU and takes full advantage of it. IoU can be propagated back for intersecting boxes, it can be directly used as the objective function of optimization. But the gradient will be zero and optimization cannot be performed if they do not intersect. Using GIoU at this point completely avoids this problem. The regression loss is replaced by GIoU loss to form an objective function. The training loss function is defined as the sum of \(L_{1}\) and \(L_{2}\).

4 Experiments and analysis

In this section, we design several group experiments of proposed method and analysis of results to verify our work. Our experiments are mainly conducted on 4 categories of an underwater image dataset. The experiment section includes 4 parts: (1) introducing implementation details about the experiments; (2) experiments on underwater image datasets; (3) analysis of the loss function; (4) experiments on PASCAL VOC datasets.

Fig. 3
figure 3

Comparison of detection results between baseline and our proposed method

4.1 Implementation details

We implement MA-FPN and other networks based on PyTorch. The VoVNet-39-A is taken as our backbone networks. Specifically, our network is trained with stochastic gradient descent (SGD) for 100K iterations with the initial learning rate being 0.01 and a minibatch of 4 images. The learning rate is reduced by a factor of 10 at iteration 60K and 80K, respectively. Weight decay and momentum are set as 0.0001 and 0.9, respectively. In addition, the input images are resized to 1024*800. All of the experimental results are implemented using a Nvidia TiTan Xp GPU and cuDNN v5.1 and an Intel(R) Xeon(R) W-2135 CPU@3.70GHz. The dataset is available at https://aistudio.baidu.com/aistudio/datasetdetail/25886 publicly.

4.2 Experiments on underwater image datasets

The underwater image datasets are built with the same format of PASCAL VOC datasets, which mainly include 5546 pictures with four categories: starfish, echinus, holothurian and scallop. The underwater images are blur and color cast. The scales of underwater objects in underwater images are small. What is more, some underwater objects have protective coloration to hide themselves into surroundings, such as holothurian and scallops.

The captured images usually have a high density of objects because of the living habits of underwater objects. These natures aggravate the challenges of underwater object detection task. A series of experiments are performed on underwater image dataset in accordance with the proposed network. The detection results of FCOS and the improved network are selected to verify the performance of the method. It can be seen from the comparison in Fig. 3 that our method is better than FCOS in underwater object detection. Specifically, the improved network can detect more small objects and objects with protective colors.

Table 2 Comparison with popular detectors on the underwater image dataset
Table 3 Ablation experiments on underwater image dataset

4.2.1 Comparison with popular detectors

Experiments were carried out with different detectors on the underwater image dataset. Specifically, each popular detector was reimplemented with default settings on the underwater image dataset. The comparison results are shown in Table 2. Obviously, the underwater object detection performance cannot reach the common type of detection performance.

Table 4 Influence of backbone network structure on detection performance

In object detection tasks, SSD, YOLO and RCNN series are popular methods. This article implements these networks on the same underwater dataset. As shown in Table 2, the mAP of the two-stage object detection network Faster-RCNN on the underwater dataset is 71.18%. It has higher detection accuracy compared with the single-stage object detection network SSD and YOLOv3, but the complex network structure causes the detection speed to be low. The YOLOv3 detector achieves a faster detection speed and can process 16.8 frames per second. YOLOv4 [30] improved the detection accuracy of underwater dataset to 76.46%. The SSD detector obtains a detection accuracy of 72.51% through the feature pyramid. Our proposed method performs best on the underwater image dataset with a mAP of 78.90%. Below we will analyze the effectiveness of our network in detail.

4.2.2 Ablation study

We conduct a series of ablation experiments to show the comparative effect of each component for verifying performance of proposed network. In Table 3, the FCOS on underwater image dataset is considered as baseline and introduce our design on it to improve the performance.

The comparison between the second line and the third line shows that the underwater object detection performance is improved after the introduction of MA-FPN, which is attributed to the rich texture information and semantic information of MA-FPN. What is more, the results of the third and fourth lines show that the redesign of the loss function also contributes to the detection performance. This is because GIoU can better reflect the coincidence degree of the prediction box and ground truth. It can be seen from the last two lines that the corner point regression strategy designed by us is effective. It takes into account more regression points and has a more friend detection effect for small objects existing underwater. The proposed network has advantage on underwater object detection by contrast with FCOS. MA-FPN performance is 4.37% better than FCOS with the same setting of experiments.

Fig. 4
figure 4

Comparison of detection results of two networks. Blue line is the mAP of FCOS and red line is the mAP of MA-FPN

Figure 4 shows the FCOS network and the underwater dataset detection results of the proposed network. First of all, at the beginning of training, our network has a higher mAP improvement effect than FCOS in the same epoch and the fitting speed is fast. On the other hand, the mAP of MA-FPN is higher than FCOS for each epoch, which proves the effectiveness of the proposed network.

4.2.3 Research on backbone

The experiments are carried out on different types of backbone networks to discuss the influence of different backbone networks on detection performance. The experimental results are shown in Table 4. The network using ResNet as the backbone network only generates 74.53% of mAP on the underwater image dataset. The overall detection effect of VoVNet is better than ResNet. VoVNet-39-A is more suitable for our network compared with the previous three lines.

4.3 Analysis of the loss function

Fig. 5
figure 5

The comparison of precision-recall curve between FCOS and Ours

After several simulation experiments, our network achieves good performance in underwater object detection. The comparison of precision-recall curve between FCOS and ours is shown in Fig. 5.

Neural network training is to reduce the loss function continuously. The fitting effect of the model can be judged by comparing the loss function when there is no change in the dataset. Fig. 6 is proved by experimental data. Loss1 and loss2 are the training loss reduction curves of the FCOS and proposed network. It can be found by comparison when the network training loss value is stable, the loss function designed by us has smaller value and better data fitting. Moreover, it can be seen the network is more stable since the loss function of our network fluctuates less during the training process. Among them, the cornerness strategy plays a crucial role. As shown in Fig. 7, under the condition of numerous small-size objects in underwater images, the corner point regression strategy added can detect more small objects by setting more recall points.

More detection results of our network are shown in Fig. 8. It can be seen that our network performs well in scenes with fuzzy and uneven lighting. Even if there are a large number of small objects, the objects can be accurately detected.

However, this network still has shortcomings in underwater object detection. It is extremely difficult to identify occluded objects. In addition, the detection of covered objects also needs to be improved. Some failed detection cases are shown in Fig. 9. For example, it is difficult to detect a starfish hidden behind a stone. When a sea urchin is very close to a mesh with similar characteristics, it will be ignored as a mesh.

Fig. 6
figure 6

The comparison of training loss

To verify the practicability of the network, we simulated a real underwater environment in a laboratory pool. As shown in Fig. 10, the underwater robot collects underwater images of the object to be detected through a camera, it uses our network to detect these objects. It can be seen that our network can accurately detect underwater objects under certain conditions and has certain practicability.

4.4 Experiments on PASCAL VOC datasets

We conducted experiments on the PASCAL VOC dataset to verify the effect of proposed method. Specifically, we train the network on the VOC 2007 and VOC 2012 training sets, then test the model on the VOC 2007 test set. We compare our network with the latest object detection networks on the PASCAL VOC dataset in Table 5.

Fig. 7
figure 7

Some detection results of FCOS and MA-FPN on underwater image dataset. Blue bounding boxes are the FCOS detection results and red ones represent the MA-FPN detection results

Fig. 8
figure 8

Part of the detection results of our method on the underwater image dataset

Fig. 9
figure 9

Failure cases on underwater object detection task

Fig. 10
figure 10

Simulation of underwater environment detection results

As shown in Table 5, YOLOv3 can detect objects in real time at a speed of 34 frames per second and SSD300 can reach 46 FPS on detection tasks. The upgraded networks of these detectors, such as SSD512 and DSSD321, achieve higher detection accuracy at the cost of increasing the computational burden. DSSD321 even reached 78.6% of mAP. Our network obtained the highest mAP value 84.3% on the PASCAL VOC dataset, exceeded the FCOS by 3.8% mAP.

4.5 Robust testing experiments

In this section, we analyze the object detection accuracy of the proposed network in noisy environments to verify the robustness [32]. As shown in Fig. 11, we add Gaussian noise obeying a normal distribution \(N\left( \mu , \sigma ^{2}\right) \) to validate the proposed network. The abscissa of Fig. 11 is the noise parameter \(\sigma \), and the ordinate is the average detection accuracy mAP.

Table 5 Detection results on the PASCAL VOC datasets

The values of mAP are 0.769, 0.715, 0.554, 0.340, respectively, when \(\sigma \) is 0.1, 0.3, 0.5, 0.8. It can be seen that the proposed object detection method is robust to a certain extent. Our network is very robust when the added noise is small. However, the underwater image itself has strong noise, so when a large amount of noise is added, the detection accuracy is affected to a certain extent.

Fig. 11
figure 11

Object detection accuracy under different degrees of Gaussian noise

5 Conclusion

This paper proposes a simple and effective multi-scale feature pyramid network structure, which is used to construct a feature pyramid to detect multi-scale objects. First, the efficient VovNet-39-A is selected as the backbone network to extract the basic features. Then, a multi-scale feature pyramid is built to enhance the texture and semantic features. In addition, the corner point regression strategy is introduced and divide it by the scale of the feature when calculating the point regression to adapt to the actual scale of the object. Finally, this paper uses GIoU instead of IoU to improve the loss function to measure the distance between the prediction box and the real box. Experimental results show that this method is effective for underwater object detection. After several experiments, the map of this method in underwater object detection reaches 78.90%, which is better than FCOS by 4.37%.