1 Introduction

Water bodies cover almost two-thirds of the earth’s surface producing almost half of oxygen and absorbing the maximum amount of carbon dioxide from the environment. To maintain these water bodies and other underwater ecosystem services, we require to monitor critical underwater habitats. Underwater ocular imaging technology provides the immense potential to monitor subaquatic scenes more efficiently, in terms of both time and cost. Several management strategies for the underwater environment employ remote sensing and spooring of subaquatic species and their habitats. In the last few years, the availability of subaquatic imagery has increased exponentially due to the use of autonomous underwater vehicles (AUV), digital cameras, and unmanned underwater vehicles (UUV) [79]. Millions of subaquatic pictures of coral reefs have been captured by the Integrated Marine Observing System (IMOS) around Australia, however, not more than 5% of them go through expert underwater analysis. For the National Oceanic and Atmospheric Administration, it is even lesser, just 1 to 2% [8]. For the aforementioned reasons, it is now a high time and research priority to automatically monitor and analyze underwater digital data. To perform such research and solve underwater issues, the state-of-the-art area of machine learning called deep learning offers potentially unparalleled opportunities for several subaquatic objects [89]. So far Manually designed low-level features have been exploited in conventional classification. Moreover, Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Principal Component Analysis (PCA), and other conventional machine learning tools get immediately saturated when the volume of training data increases. Hinton et al. [48] proffered learning of features by using deep neural networks (DNNs) to reduce and eliminate the shortcomings in conventional machine learning networks. For texts, images, sounds, etc., to make sense, deep learning algorithms transform input data via more layers than algorithms based on shallow learning [23]. Every layer transforms the signal using a processing unit, such as an artificial neuron, which learns the parameters via training [102]. Handcrafted features are being replaced by efficient deep learning algorithms for learning features and for hierarchical extraction of features [107]. Methods of deep learning represent an observation (e.g., an image) in a better way and create models for learning these representations from huge data. By using an ample amount of data for training, deep and big networks showed excellent performance. As an example, a convolutional neural network (CNN) trained via ImageNet has achieved unequaled precision in picture classification [61]. CNNs have been used in the area of image classification [61], object detection [64], digits and traffic signs recognition [18], face verification [72], etc., and showed excellent performance. However, algorithms based on deep learning have not been much used in underwater object detection and classification. A study on the present deep learning techniques for various underwater object detection and classification will help the researchers to know the challenges and survey highly efficient possibilities. Roughly about two-thirds of the surface of the earth is covered by water [132], however, comparatively not much of technologies related to underwater research have been exhaustively explored [73]. Besides, more importantly, security including naval battle, shipwreck, etc., is a momentous issue, therefore, for underwater surveillance marine object detection technologies are of utmost importance.

Humans can quickly spot scenes (sea-water, mountains), objects (sailboat, cruise), and visual details when seeing an image. However, for detecting marine objects in an image or video without manual intervention, a computer vision technique known as object detection is used. Object detection is mainly used to detect objects in a picture just like humans do, and then to instruct a computer so that it gets an understanding of what a picture contains. An object in an image can be detected from the back view, side view, and front view [105]. Conventional methods like manually classifying and recognizing objects in an underwater image, conventional statistical analysis and simulating ocean model, etc. highly rely on the availability of optical characteristics and are imprecise and inefficient for processing huge underwater data. On contrary, methods based on deep learning can process huge data meeting the requirements of accurate and quick analysis of immense underwater data. Therefore, deep learning enables researchers to solve various underwater problems like protection of the underwater ecological environment, emergency rescue, underwater disaster mitigation and prevention, underwater target detection, spooring, and recognition. Deep learning algorithms have been extensively used for marine systems like marine data classification and recognition (Convolutional Neural Network- CNN) [30], marine data reconstruction (CNN) [29], marine data prediction (Recurrent Neural Network-RNN), (CNN) [128]. The algorithm based on deep learning which is presented by Duo et al. [30] involves three stages: the stage of pre-processing, the network, and lastly the eddy extraction. In the pre-processing step, the remote sensing satellite altimeter data is pre-processed and the precise and small size sample data is improved to acquire the training set; subsequently, in the network stage, there is a deep learning-based integration model along with a network for object detection forming the main part. To facilitate the further survey, the trained model has been employed in the final stage for detection of the eddy extent and produce the coordinates of effective contour and eddy center. This algorithm produces a whole eddy detection network, that can efficiently employ remote sensing data. The shortcoming of this algorithm is that it is only suitable for Archiving, Validation, and Interpretation of Satellite Oceanographic (AVISO) sea level products. Ducournau et al. [29] addresses the downscaling of data from ocean remote sensing by employing models of picture super-resolution built on deep learning, and more specifically Convolutional Neural Networks (CNNs). The main aim of the algorithm is to assess the relevance and the efficiency of deep learning networks that are applied to data from ocean remote sensing. Also, the focus has been on Sea Surface Temperature (SST) data derived from satellites.

Recent upgrades in RNN algorithms proffer significant solutions for problems of sequence prediction. The architecture of Long Short-Term Memory (LSTM) presents an improvisation to the RNN hidden layer and is successfully employed to carry out different tasks like supervised sequence learning. A prediction network that can completely utilize the Spatio-temporal information contained by the image sequences has been immensely desired. Thus, a prediction framework that employs the combination of the spatial and temporal information, called Combined Fully Connected Long Short-Term Memory and Convolution Neural Network model (CFCC-LSTM) has been proposed. So, Yang et al. [128] proposed a prediction network that brings together the spatial and temporal information, which is the Combined Fully Connected Long Short-Term Memory (FC-LSTM) and Convolution Neural Network model (CFCC-LSTM). This model consists of a single FC-LSTM and a single convolution layer. Firstly, a CFCC-LSTM network is proposed which aims at solving problems in sequence prediction, particularly for complicated SST pictures to fulfill the network requirement. Secondly, two data sets are introduced, the Bohai Sea data set and the China Ocean data set [127] to fulfill the data requirement.

These aforementioned algorithms mainly employ CNN. CNN network consumes a lot of computational time to perform prediction accurately because it uses multiple regions as input to perform object detection. These algorithms are not suitable for real-time underwater object detection. You Only Look Once (YOLO) based algorithms do not process multiple regions of the same image unlike RCNN, rather look at the complete image at once making the object detection task less complex and less time-consuming. However, not much of the work and research has been carried out in the area of underwater object detection using YOLO.

The existing deep learning algorithms employed for underwater images require a huge number of images and videos that are of high quality, as the high-quality videos and images often have better discriminant features. Nevertheless, the effect of extensively changing environmental underwater conditions, such as turbid water conditions, amount of light, sophisticated background, range, and viewing angle is the major challenging task for acquiring high-quality pictures and videos. Therefore, research works to develop a unified model or framework are immensely required, by combining three steps: picture pre-processing, extracting feature, and classification of underwater object recognition task so that all the underwater images acquired by camera or some other image capturing equipment can be directly given to models. It is speculated the new model will substantially decrease the need for pictures. Besides, presently several approaches that employ the idea of deep learning to solve the problems of underwater object detection are built on transfer learning which means to apply the classic picture database like COCO or ImageNet for training the algorithm, and after that giving the new underwater dataset to the pre-trained model and properly tuning the model to practically apply it. Nevertheless, this task consumes a lot of time and is hard to perform, particularly in the pre-training process. ImageNet pre-training speeds up convergence in training, however, it does not surely proffer regularization or enhance the accuracy in the final target task. Thus, these points that have been highlighted above encourage researchers to think about the need for fine-tuning and pre-training.

The remaining portion of the paper is organized as; Section II discusses the motivation and contribution; Section III proffers the basics of computer vision and deep learning. Section IV discusses the concept of underwater object detection and presents the various existing deep learning methods used for underwater object detection. Section V discusses the experimental results and shows the comparison between some popular deep learning algorithms used for underwater object detection. Section VI presents the challenges. Section VII proffers the conclusion and future directions.

2 Motivation and contribution

With the help of deep networks having multiple levels of non-linearities, discriminative features and high-level abstractions can be learned from various complex datasets autonomously and effectively [131]. Also, the excellent computation of powerful GPU has highly boosted the efficiency of deep learning algorithms. Over the years apparently, several deep learning methods and algorithms for underwater object detection have been presented. However, the advantages and shortcomings of these methods have not been summarized in a single contribution. Also, future challenges and trends of the majority of such works have not been summarized in a single paper to our best of the knoweledge. Keeping in view these issues, an in-depth review of underwater object detection using deep learning methods is opportune and has both practical and theoretical importance in the ocean engineering community. The aim of this review is to highlight some renowned deep learning underwater object detection algorithms that can aid and pave the way for future research work in this particular field.

The major benefactions of this review are as follows:

  1. 1.

    An understandable and profound review of highly renowned deep learning algorithms for underwater object detection is presented.

  2. 2.

    The architectures and advantages of each algorithm have been thoroughly discussed and analyzed.

  3. 3.

    Experimental results of different deep learning algorithms for subaquatic object detection are inclusively reviewed and compared.

  4. 4.

    Futuristic trends and the possible challenges in marine object detection using deep learning techniques are robustly discussed.

3 Brief idea of computer vision and deep learning

Computer vision

In computer vision computers incorporated with imaging sensors are used to mimic visual functions of humans that draw out features from the acquired data set, examine and classify them to help in process of decision making. Several fields of knowledge like image processing, high-level computer programming, artificial intelligence (AI), so on, are usually involved in computer vision. As an example, manufacturing industries use it to point out defections or enhance the quality [17, 57]. Also, there are efficient applications of emotion observation and face detection at the places like airports and various security checkpoints [10, 19, 20]. Doctors in the medical field make use of certain kinds of software for diagnoses to identify abnormal tissues and tumors using medical imaging [1]. An agricultural industry uses computer vision for systems meant for decision making to predict the total yield from the field [121]. Furthermore, a self-driving car is being designed by google having almost a visual range of 328 ft. This kind of car can also recognize traffic signals and avoid pedestrians as well [83]. Several state-of-the-art algorithms show that computer vision is modifying our day-to-day lives.

Deep learning

The word deep in deep learning means a neural network employs many layers to imitate the human brain. Some algorithms that are based on deep learning using neural networks are Novel Nonlinear Hypothesis for the Delta Parallel Robot Modelling 2020 [6], SOFMLS: Online Self-Organizing Fuzzy Modified Least-Squares Network, 2009 [97], Wavelet-Based EEG Processing for Epilepsy Detection Using Fuzzy Entropy and Associative Petri Net 2019 [16], Stability Analysis of the Modified Levenberg-Marquardt Algorithm for the Artificial Neural Network Training 2020 [98], On the Estimation and Control of Nonlinear Systems With Parametric Uncertainties and Noisy Outputs, 2018 [84], and CNN based detectors on planetary environments: a performance evaluation, 2020 [33]. The idea of deep learning using the neural network has come to light some ten years ago. In 1998, deep learning was originally given and developed by Lecun et al. [67]. They developed a classifier that was five-layered and was known as LeNet5. It was based on a Convolutional Neural Network (CNN). Initially, based on the Modified National Institute of Standards and Technology (MNIST) dataset, LeNet was employed to detect hand-written bank cheques. In neural networks, activation functions and fully connected frameworks were already known. LeNet5 model brought in convolutional and pooling layers and it is thought to be the plinth for all convolutional networks. The LeNet5 network has two sets of layers i.e., convolutional layer and average pooling layer which is then followed by a flattening convolutional layer. Subsequently, there are two fully connected layers placed and lastly, a SoftMax classifier is employed. The input to the LeNet5 model is a grayscale image of size 32 × 32. This image is passed via the initial convolutional layer having six filters or feature maps of dimension 5 × 5 and a stride equal to one. The size of the image gets changed from 32 × 32 × 1 to 28 × 28 × 6. Then average pooling is applied using a filter of size 2 × 2 and stride = 2. The resulting image dimensions will be reduced to 14 × 14 × 6. In the next step, another convolutional layer having sixteen feature maps and dimension 5 × 5 is employed with a stride = 1. In this stage, just ten feature maps out of sixteen feature maps are connected with six previous layer feature maps. Again, the fourth layer is a layer of average pooling having filter dimension 2 × 2 and stride = 2. The fifth layer is a convolutional layer that is fully connected having 1 × 1 sized 120 feature maps. This layer is followed by a fully connected layer and fully connected SoftMax layer ŷ with ten possible values that correspond to the ten digits ranging from 0 to 9. It proffers an error rate of 0.95% on test data. Figure 1 given below shows the process flow of LeNet5.

Fig. 1
figure 1

Process flow of LeNet5

The experimental results of LeNet5 have been highlighted in Fig. 2. The graphs in the given figure show the training and validation loss vs several epochs and accuracy vs several epochs.

Fig. 2
figure 2

Performance evaluation of LeNet5

MNIST dataset

This database has handwritten digits and 60,000 examples as a training set, and 10,000 examples as a test set. MNIST is considered a subset of a bigger set from the National Institute of Standards and Technology (NIST). The digits are centered in a picture of fixed size and also size-normalized. The MNIST dataset was developed from Special Database-3 and Special Database-1 of NIST which consist of binary pictures of hand-written digits. Originally, NIST defined SD-1 as the test set and SD-3 as the training set but SD-3 is easier and cleaner to be recognized as compared to SD-1. The basis for this is seen in the fact that Special Database-3 was gathered among Census Bureau employees. On the other hand, Special Database-1 was gathered among students from high school. To draw significant conclusions from experiments needs that the results should not depend on the selection of the test and training set among the entire collection of samples. Thus, it was mandatory to develop a novel dataset by blending datasets of NIST.

Deep learning has made immense achievements in the past years due to improvement in power used in computing and the explosion of huge data. Deep learning algorithms are based on immense data that is collected in a particular field. Resources used for learning from ample data are highly substantial. With the rise of efficient GPU, cloud storage, ASIC accelerators, and highly powerful computing facilities, it is now practicable to gather, manage, and examine massive datasets. The significance of massive data sets is that they reduce the chances of overfitting problems in deep learning. The improved computing power can increase the speed of the time-taking process of training. Deep learning algorithms are increasingly used in several fields and have considerable advantages over conventional object detection approaches in computer vision. Thus, the performance of several robotic systems is improved by utilizing deep learning. For example, Google’s AlphaGo studied the learning behavior of humans and then competed with the popular Go player [7]. To use the concept of deep learning in the field of computer vision, adequate examples from pictures gathered beforehand is condemnatory. A good example would be ImageNet [2]. Conventional computer vision-based techniques suffered from the precision in the extraction of features, while approaches based on deep learning can be employed to improvise the method via a neural network.

4 Underwater object detection

A computer vision approach employed for identifying and locating underwater objects in a subaquatic picture or video is known as underwater object detection. By using this sort of technique for identification and localization, underwater object detection can be utilized for counting objects in underwater scenes and also for determining and spooring their accurate locations by precisely labeling the objects in underwater images. To detect an underwater object, a bounding box is defined to locate an object in an underwater image. For obtaining the coordinates of the bounding box (X, Y) for an object in a subaquatic image a deep learning algorithm meant for underwater object detection is employed. Object detection not only informs us of what a subaquatic image has but also tells us where the object is in a subaquatic image. Figure 3 given below shows the results of underwater fish detection using deep learning algorithms [58, 119].

Fig. 3
figure 3

a Pollack detection, b Coral reef fish detection [58, 119]

Ocean engineers frequently employ Automatic underwater Vehicles (AUVs) for subaquatic robotic capturing. In autonomous subaquatic capturing detection of underwater organisms is becoming increasingly important. Human divers usually capture seafood. However, diver fishing, not only leads to severe injury to divers’ bodies but also in inefficient working, particularly when the water depth is higher than 20 mt. Robotic capturing via subaquatic robot to catch seafood has been proffered to solve the prevailing problem of subaquatic diver fishing. Figure 4 given below shows some unmanned underwater vehicles used for robotic capturing [86, 87].

Fig. 4
figure 4

Underwater autonomous vehicles [86, 87]

These vehicles not only decrease the chances of the divers’ bodily injury but also decrease the seafood price. The method proposed by Han et al. [42] has been employed with such kind of a model as mentioned above, namely, an underwater remote operated vehicle (ROV). The entire network has been used for fishing underwater products. This underwater robot has a length of about 1 m, a width of about 0.8 m, and a weight of about 90 kg. The technique of gathering underwater products is of adsorption type; the real structure of this underwater robot is shown in Fig. 5. It is operated and controlled by remote. The main task to be performed is the detection of underwater objects and also, locating them.

Fig. 5
figure 5

Underwater ROV for fishing subaquatic organism [42]

Fishery robotics has drawn great attention and efforts of researchers due to the several merits of robotic capturing [59, 110, 113]. Norway has built a submarine employing a Remote Operated Vehicle (ROV) which enabled ocean engineers to realize sea urchin harvesting via remote aspiration manually. However, manipulation of the subaquatic robot is a tedious task and a tough problem. It requires a highly focused operator having rich experience. To reduce expenses and the challenges in operation, autonomous capturing is mandatory. In underwater object detection, algorithms based on conventional machine learning have been popular in past decades. As an example, Garcia et al. [35] performed object segmentation and identification via a process called generic segmentation. Sun et al. [108] presented an algorithm, namely, automatic recognition via shape-based identification and color-based identification. For conventional approaches, color and shape features are mostly considered for the detection and recognition of an object. However, subaquatic organisms exhibit different shapes in various subaquatic environments and due to ecological reasons, they have colors similar to the seabed scenes. The famous and efficient algorithms based on deep learning can enhance the perception ability to perceive underwater organisms. For example, Sermanet et al. [103] proffered the framework called the Over-Feat algorithm, utilizing Convolution Neural Networks (CNNs) and multi-scale sliding windows to detect, recognize, and classify objects in an image. Ren et al. [95] proffered an algorithm, namely, Faster Region-based Convolutional Neural Network (Faster RCNN) which uses a Region Proposal Network (RPN) to generate region proposal and then for classification and bounding box regression uses a CNN network. Figure 6 shown below highlights the image processing and object detection structure. This figure gives an overview of the basic structure of how images are processed and detected as well.

Fig. 6
figure 6

Image processing and object detection structure

4.1 Deep learning algorithms for underwater object detection

Deep learning is a subset of a broader field called Machine Learning which mainly consists of artificial neural networks. Figure 7 given below highlights the general deep learning framework and mathematics behind it is discussed in the following portion.

Fig. 7
figure 7

Basic process flow of deep learning algorithm

Every neuron is portioned into two blocks:

  1. i)

    Calculating z by using input xi where xi = ×1, ×2, ×3, ×4 as shown in figure above:

$$ \mathrm{z}={\sum}_{\mathrm{i}}{\mathrm{w}}_{\mathrm{i}}\ast {\mathrm{x}}_{\mathrm{i}}+\mathrm{b} $$
(1)
  1. ii)

    Calculating a using z:

$$ \mathrm{a}=\Psi \left(\mathrm{z}\right) $$
(2)

Where wi denotes the weights, b represents the bias, and Ѱ is the activation function.

Learning process in deep learning:

The process of learning in a deep neural network is a step of computing weights of parameters that are associated with different regressions all over the framework. The main goal is to look for the best parameters which proffer the best approximation/prediction, beginning from the real value input. For this purpose, an objective function known as Loss Function is defined and it is represented as J. This parameter quantifies the amount of distance between the predicted values and the real values over the entire training dataset.

Two main steps are followed to minimize the value of J:

  1. i)

    Forward-Propagation: The data is propagated via the framework either as a whole or as batches, and the loss function is computed on the batch. This loss function is defined as the summation of the inaccuracies that have been committed at the output that is being predicted for various rows.

  2. ii)

    Back-propagation: It involves the calculation of the cost function gradients with respect to various parameters. After this, a descent algorithm is applied for updating them.

The same process which repeats many times is called epoch number. The learning process is indicated as given below:

  • Initializing the network parameters.

  • For i = 1, 2 ….. N: (N represents the number of epochs)

  • Performing forward-propagation:

  • ⩝i, calculate the predicted value of input xi via deep neural network: \( {\hat{y}}_i^{\theta } \)

  • Evaluating the function:

$$ \mathrm{J}\left(\uptheta \right)=\frac{1}{\mathrm{m}}{\sum}_{\mathrm{i}=1}^{\mathrm{m}}\mathrm{L}\left({\hat{\mathrm{y}}}_{\mathrm{i}}^{\uptheta},{\mathrm{y}}_{\mathrm{i}}\right) $$
(3)

Where m represents the training set size, θ indicates network parameters, L represents the cost(*) function, (∗) cost function L computes the number of distances that are between the predicted value and a real value over a single point, and yi is the actual output.

  • Performing back-propagation:

  • Applying descent approach for updating parameters:

$$ \uptheta =: \mathrm{G}\left(\uptheta \right) $$
(4)

To improve the real-time efficiency of underwater object detection, various deep learning algorithms are proposed to make the computer able to handle and deal with the subaquatic picture information in lesser time.

Usually, an advanced object detector is made of two components. Firstly, it consists of a backbone that is pre-trained using ImageNet. Secondly, it consists of a head that is employed to perform the prediction of object classes and their bounding boxes. Object detectors that run on GPU can have a backbone network like ResNet [46], VGG [106], DenseNet [54], or ResNeXt [125]. Object detectors that run on CPU, can have backbone network like SqueezeNet [63], ShuffleNet [81, 136], or MobileNet [51, 52, 101, 116]. The head portion is often classified into two types, namely, one-stage object detector and two-stage object detector. The most common object detector that is two-stage is the RCNN [39] series, also fast RCNN [38], faster RCNN [95], RFCN [22], and Libra RCNN [88]. It is possible to set up an anchor-free object detector that is a two-stage object detector, e.g., RepPoints [129]. The most common models for one-stage object detector are SSD [74], YOLO [91, 93, 94], and RetinaNet [70]. In the past few years, one-stage anchor-free object detectors have been developed. The object detectors of this type are CornerNet [65, 66], CenterNet [28], FCOS [118], etc. The detectors developed for object detection in the past few years usually insert few layers in between the backbone and head. These layers are often employed for the collection of feature maps from various stages. It can be called the neck of a detector employed for object detection. A neck usually consists of many bottom-up and top-down paths. Those networks which are equipped with such a mechanism include Path Aggregation Network (PAN) [75], Feature Pyramid Network (FPN) [71], Bidirectional Feature Pyramid Network (BiFPN) [117], and NAS-FPN [37]. Besides, some researchers focus on directly setting up a novel backbone (DetNet [68], DetNAS [14]) or a novel entire model (HitDetector [41], SpineNet [27]) to detect objects. Summing up, an ordinary detector used for object detection consists of many parts:

  • Input: Picture, Patches, Picture Pyramid

  • Backbones: ResNet-50 [46], VGG16 [106], SpineNet [27], CSPResNeXt50 [122], EfficientNet-B0/B7 [114], CSPDarknet53 [122]

  • Neck:

  • Added blocks: Atrous Spatial Pyramid Pooling (ASPP) [13], Spatial Pyramid Pooling SPP [44], Receptive Field Block (RFB) [76], Spatial Attention Module (SAM) [123]

  • Blocks for Path-aggregation: PAN [75], FPN [71], Fully-connected FPN, NAS-FPN [37], Adaptively Spatial Feature Fusion (ASFF) [78], Bidirectional Feature Pyramid Network (BiFPN) [117], SFAM [137]

  • Heads:

  • (one-stage) Dense Prediction:

  • RPN [95], SSD [74], RetinaNet [70] (anchor based), YOLO [94]

  • CornerNet [65], CenterNet [28], FCOS [118] (anchor free), MatrixNet [90]

  • (Two-stage) Sparse Prediction:

  • Faster R-CNN [95], RFCN [22], Mask RCNN [47] (anchor based)

  • RepPoints [129] (anchor free).

Some well-known deep learning techniques particularly those employing deep neural networks for digital underwater image detection and classification have been presented in this section. Each of the algorithms is highlighted in Fig. 8 and also discussed in the following portion of the paper.

Fig. 8
figure 8

Some deep learning algorithms used for underwater object detection

The convolution operation performed between the filter and the input image gives a 2-dimensional matrix in which every element is calculated by summation of the elementwise product of the filter elements and the image elements spanned by the filter. Mathematically, for a given filter and picture we have:

$$ \mathrm{conv}{\left(\mathrm{I},\mathrm{K}\right)}_{\mathrm{x},\mathrm{y}}={\sum}_{\mathrm{i}=1}^{{\mathrm{n}}_{\mathrm{H}}}{\sum}_{\mathrm{j}=1}^{{\mathrm{n}}_{\mathrm{W}}}{\sum}_{\mathrm{k}=1}^{{\mathrm{n}}_{\mathrm{C}}}{\mathrm{K}}_{\mathrm{i},\mathrm{j},\mathrm{k}}{\mathrm{I}}_{\mathrm{x}+\mathrm{i}-1,\mathrm{y}+\mathrm{j}-1,\mathrm{k}} $$
(5)
$$ \dim \left(\mathrm{conv}\left(\mathrm{I},\mathrm{K}\right)\right)=\left(\left\lfloor \frac{{\mathrm{n}}_{\mathrm{H}}+2\mathrm{p}-\mathrm{f}\ }{\mathrm{s}}+1\right\rfloor \right);\mathrm{s}>0=\left({\mathrm{n}}_{\mathrm{H}}+2\mathrm{p}-\mathrm{f},{\mathrm{n}}_{\mathrm{W}}+2\mathrm{p}-\mathrm{f}\right);\mathrm{s}=0 $$
(6)

Where nH and nW represent the size of height and width of the input image, respectively, nC represents the number of channels. The filter/kernel K should have the number of channels equal to the number of channels in the input image. Also, the filter has odd dimensions represented by f, s represents the stride, p represents the padding and ⌊x⌋ represents the floor function of x. I represents the input image.

RCNN at first draws out from the input picture several region proposals. CNN model is then employed to carry out forward-propagation over every region proposal for feature extraction. Subsequently, features of every region proposal are employed to predict the bounding box and the class of that region proposal.

You Only Look Once (YOLO) algorithm uses CNNs for real-time object detection. As can be seen from the name itself, this algorithm needs only one forward-propagation via a neural network for detecting objects. This implies that in one go or single run of an algorithm prediction in the whole image is performed. Simultaneously, bounding boxes and different class probabilities are predicted using the CNN network.

4.1.1 Convolutional neural network

A Convolutional neural network is the simplest and extensively used deep learning approach for object detection. Three basic components are involved in a convolutional neural network.

  1. 1.

    Convolutional Layer

  2. 2.

    Pooling Layer

  3. 3.

    Output Layer

The detailed description of each of these layers is as follows:

  1. 1.

    Convolutional Layer

In this layer, an image of size 6 × 6 (say) is subjected to a weight matrix that extracts certain features from the image. The weight matrix is a 3 × 3 square matrix and is run over the input image so that all pixels are spanned to generate the convolved output. In below Fig. 9, the output is acquired by summing the values acquired by the element-wise product of the weight matrix and the portion that is highlighted in the input image. Following the same steps of convolution for the whole image with the stride of 1, the 6 × 6 image will be converted to a 4 × 4 image. Pixels are used again during the sliding of the weight matrix over the entire image. This results in parameter sharing in CNN. Figure 9 shows how convolution operation with a stride of 1 works.

Fig. 9
figure 9

Convolution operation with a stride of 1

  1. 2.

    Pooling Layer

For large images, the number of trainable parameters must be reduced and for that purpose pooling layers are periodically introduced between successive convolution layers. Pooling is mainly done to reduce the spatial size of the image and it is applied independently on every depth dimension; therefore, the image depth remains unaltered. Max pooling is the most commonly used form of pooling layer.

  1. 3.

    The Output Layer

We require the output in the form of some class after several layers of convolution and pooling. These two layers can only extract features and decrease the number of parameters from the input images. However, to produce the output we have to include a fully connected layer to get an output equal to the number of required classes. It is difficult to have that number using just the convolution layers. Convolution layers produce 3-Dimensional activation maps but we just require the final output as to whether a picture belongs to a particular class or not. To calculate the inaccuracy in prediction, the final output layer has a loss function such as categorical cross-entropy. After the forward propagation is finished, the backpropagation starts for updating the weight and biases for error and reducing loss. Figure 10 given below shows how the entire network looks like.

Fig. 10
figure 10

Whole process flow of CNN

An input image is given to the first convolutional layer. The convolved output is received as an activation map. The filters used in this layer take out relevant features from the given input picture and pass them further. Each filter will generate a different feature to help the accurate class prediction. If we want to retain the image size, we should use the same padding (zero padding), otherwise, we can use valid padding as it reduces the number of features. Then Pooling layers are used to further decrease the number of parameters. Many layers for convolution and pooling are used before the prediction is done. Convolutional layers are used to draw out features. As one goes deeper in this architecture more specific features are drawn out compared to a shallow network in which the drawn-out features are more generic. A Fully connected layer is the output layer in a convolutional neural network. Here the input obtained from the previous layers is flattened and then forwarded so as the output is transformed into various classes as desired by the algorithm. The final output is then produced via the output layer. In the fully connected layer, which is the output layer a loss function is being defined to calculate the mean square loss. Finally, the error gradient is computed. To update the bias values and filter (weights) the error is backpropagated. Therefore, in one forward and backward pass one cycle of training is completed.

For underwater object detection using CNN,

  • Firstly, we take an underwater image as an input image.

  • Then we split this image into multiple regions.

  • We then consider each region as an individual underwater image.

  • All these regions (pictures) are passed to the CNN network and classified into different classes.

As all the regions have been divided into a particular class, therefore, all these regions or images are combined to obtain the original input picture with detected underwater objects.

Many researchers have exploited CNN architecture for underwater object detection. Krizhevsky et al. [62] used the CNN model for dealing with the classification process and won the challenge of ILSVRC (ImageNet Large Scale Visual Recognition Challenge), which decreased the top 5 rates of the error to the percentage of 15.3. Thus, since then deep CNN has been extensively used. Elawady et al. [31] have utilized supervised CNNs for coral classification and worked on Heriot-Watt University’s Atlantic Deep-Sea Dataset and Moorea labeled Corals and calculated Phase Congruency (PC), Weber Local Descriptor (WLD), and Zero Component Analysis (ZCA). They also took into consideration texture and shape for input pictures with spatial channels of the color [31]. Mahmood et al. [82] presented a scheme for feature extraction that is built on Spatial Pyramid Pooling (SPP) to make the traditional point-annotated underwater data compatible with the input constraints of CNNs. Sermanet et al. [103] proffered the technique utilizing multi-scale sliding windows and Convolution Neural Networks (CNNs) for object recognition, detection, and classification. Suxia et al. [21] in 2019 presented an approach for underwater object detection using CNN. This network has been employed to detect a fish in a blurry environment underwater. During the process of convolution, feature map size is to be considered. Three major factors affect feature map size and they are padding, depth, and stride. Figure 11 given below shows the extraction of the feature map with a depth of 3, the stride of 1, and zero padding [25, 26].

Fig. 11
figure 11

Feature extractor using convolutional operation [25, 26]

Two types of connections can be usually seen between two layers that are adjacent in a complex neural network and these are the locally connected layer and the fully connected layer as shown in Fig. 12.

Fig. 12
figure 12

a Fully connected neural network, b Locally connected neural network

In a neural network that is fully connected, all picture elements in the input image have a connection with every neuron of the hidden layer as indicated in Fig. 12a. In the CNN network, the two of the last layers are commonly fully connected and they are the SoftMax layer and the output layer, respectively. A massive number of parameters will lead to an increased amount of computation and will also delay the processing. In a neural network that is locally connected, only some of the picture elements in the input image have a connection with the hidden layer neuron as indicated in Fig. 12b. This kind of connection will increase the speed of the system and decrease the number of connections. For local or full connectivity in the CNN model used by Suxia et al. [21] in 2019, the parameters for each layer are given in Table 1.

Table 1 Parameters in CNN model

System validation using ImageNET dataset

Authors [21] downloaded pictures from renowned ImageNet ILSVRC [3] to perform a system validation testing via object classification before using the system for the ocean fish data set developed in their research. There are about 500 pictures having about 20 classes that range from frog, coral, fish, sea turtle, ship, etc. Here every RGB picture is rescaled to 448 × 448. Ground truth pictures are acquired manually from operating LabelImg software. All images are split into a grid of 7 × 7 cells. Every cell predicts the location of two bounding boxes and information of class composed of a 1 × 1 × 30 vector. It is a vector that consists of coordinates of object center (x,y), height ℎ, and width w, coordinates of the predicted probabilities of the underwater object, and confidence scores of the bounding box. To predict the location of the target picture, the target is exhibited in a bounding box. Errors are always present there between the predictions and ground truth. Errors are measured using the Loss function which consists of three: (Intersection over union) IoU error, coordinate error, and class error. Equation (7) given below shows the mathematical formula for the loss function.

$$ \mathrm{Loss}={\sum}_{\mathrm{i}=0}^{{\mathrm{s}}^2}\mathrm{CoordError}+\mathrm{IOUError}+\mathrm{ClassError} $$
(7)

Where CoordError denotes the error in predicting the coordinates, ClassError denotes the inaccuracy in the predicted class and IoU measures the accuracy in position as indicated in Fig. 13. The underwater image has been taken from the turbid dataset [5].

Fig. 13
figure 13

Intersection over union

K bounding boxes will be predicted by every grid cell in a picture. These bounding boxes enclose an underwater object to predict the object class and localization. Besides, confidence is associated with every bounding box, and a confidence score is not associated with the class of undersea objects. This score just shows how accurately the predicted bounding box encloses the real undersea object.

$$ \mathrm{Confidence}=\Pr \left(\mathrm{Object}\right)\times \mathrm{IoU} $$
(8)

where Pr denotes simply the probability and Pr(Object) denotes the probability of the underwater object of interest. Its value is 1 if an object is in the grid cell, and its value is 0 if the object is not present in the grid cell. Most of the time, the loss function is calculated as the summation of squared errors [53]. It has three parts namely, confidence errors, probabilities errors, and localization errors.

$$ \mathrm{Loss}={\sum}_{\mathrm{i}=0}^{{\mathrm{s}}^2}{\sum}_{\mathrm{j}=0}^{\mathrm{B}}\left[{\left({\mathrm{x}}_{\mathrm{i}}-{\mathrm{x}}_{\mathrm{l}}^{\prime}\right)}^2+{\left({\mathrm{y}}_{\mathrm{i}}-{\mathrm{y}}_{\mathrm{l}}^{\prime}\right)}^2+{\left({\mathrm{w}}_{\mathrm{i}}-{\mathrm{w}}_{\mathrm{l}}^{\prime}\right)}^2+{\left({\mathrm{h}}_{\mathrm{i}}-{\mathrm{h}}_{\mathrm{l}}^{\prime}\right)}^2\right]+\kern3.5em {\sum}_{\mathrm{i}=0}^{{\mathrm{s}}^2}{\sum}_{\mathrm{j}=0}^{\mathrm{B}}{\left({\mathrm{c}}_{\mathrm{i}}-{\mathrm{c}}_{\mathrm{l}}^{\prime}\right)}^2+{\sum}_{\mathrm{i}=0}^{{\mathrm{s}}^2}{\sum}_{\mathrm{c}\in \mathrm{class}}^{\mathrm{B}}{\left({\mathrm{P}}_{\mathrm{i}}\left({\mathrm{c}}_{\mathrm{i}}\right)-{\mathrm{P}}_{\mathrm{i}}\left({\mathrm{c}}_{\mathrm{l}}^{\prime}\right)\right)}^2 $$
(9)

where ℎi, wi, denote the height and width of the ground truth bounding box; xi, yi denote the ground truth coordinates of the center of underwater objects; \( {\mathrm{h}}_{\mathrm{l}}^{\prime },{\mathrm{w}}_{\mathrm{l}}^{\prime } \) represent the height and width of the predicted bounding box, \( {\mathrm{x}}_{\mathrm{l}}^{\prime },{\mathrm{y}}_{\mathrm{l}}^{\prime } \)represent the predicted coordinates of the center of underwater objects.

Ground truth preparation for real ocean environment

Suxia et al. [21] after testing the CNN model with nondegraded pictures without noise created their fish dataset from underwater. It has been tough for them to get pictures of other kinds like coral, sea turtle, etc. In the portion of their research, the only underwater object to be detected is fish. To collect about 410 pictures underwater, they have obtained several fish in a single underwater picture, therefore, the detection is quite challenging. The similar approach was selected to create the collection of ground truth pictures.

4.1.2 Networks based on region-based convolutional neural network (RCNN)

The deep learning algorithms based on the CNN framework are employed in several fields. Presently, in different engineering research areas RCNN, Fast RCNN, Faster RCNN, and some advanced versions are extensively used. Overall, in the area of image recognition, the advent of the network built on CNN is fast.

  1. 1.

    Region-Based CNN (CNN)

Girshick (facebook expert) in 2014, presented the Regional Convolution Neural Network (RCNN)) which is the combination of the CNN model and Region Proposal Network [40]. The results of object detection on the dataset namely, PASCAL VOC2007 reached about 66% mean Average Precision (mAP). Based on RCNN, He et al. in 2015 proffered the SPP-Net framework [44], which immensely improved the efficiency of object detection.

RCNN proffers a bunch of boxes in the underwater picture and examines if any one of the boxes has an underwater object. The RCNN model does not work on a huge number of regions. It performs a selective survey to draw out these boxes from an underwater picture and these extracted boxes are known as regions. Four regions contribute to the formation of an object and these are textures, colors, enclosure, and varying scales. In selective search, these patterns are identified in the underwater picture and different regions are proffered based on that. A Brief overview of how selective search for underwater images is carried out is given below:

  • Initially, it takes an underwater image. The underwater image given below in Fig. 14 has been taken from Turbid Dataset [5].

    Fig. 14
    figure 14

    Underwater image [5]

  • In the second step, it produces initial sub-segmentations as shown in Fig. 15 such that several regions are obtained from this picture.

    Fig. 15
    figure 15

    Initial sub-segments [104]

  • This method then joins the regions that are similar based on texture resemblance, color resemblance, shape resemblance, and size resemblance to form a bigger region as indicated in Fig. 16.

    Fig. 16
    figure 16

    Combined similar regions [104]

  • Lastly, the final location of the object called Region of Interest is produced by these regions.

The steps carried out in RCNN for object detection are given below:

  1. a.

    A pre-trained convolutional neural network is first taken.

  2. b.

    In the second step, the CNN framework is retrained. The last layer is trained as per the number of classes that are to be detected.

  3. c.

    In the third step, the Region of Interest (ROI) for each picture is obtained. Every region is then reshaped so that it can match the input size of CNN.

  4. d.

    After all the regions are obtained, the Support Vector Machine (SVM) is trained so that it classifies background and underwater objects. Single binary SVM is trained for each class.

  5. e.

    Finally, a linear regression model is trained to produce bounding boxes that are tighter for every identified underwater object in the picture.

A better understanding of the working of an RCNN network can be achieved by following a visual example.

Firstly, an underwater image is taken as an input image. ROIs are then obtained by employing some proposal approach like selective search. This is indicated in Fig. 17.

Fig. 17
figure 17

An input underwater image and ROIs from the input underwater image, respectively [5]

Reshaping of all of these regions according to the input of CNN is then performed, and every region is passed through Conv-Net. CNN model then performs the extraction of features for every region and these regions are split into various classes using SVMs. Lastly, for every identified region bounding boxes are predicted using Bounding Box Regression (Bbox reg). This process is indicated in Fig. 18.

Fig. 18
figure 18

Process flow of RCNN

Disadvantages of RCNN framework

This network has its shortcomings, RCNN can be used for underwater object detection though. To train an RCNN framework is a slow and costly process due to the below-mentioned steps:

  • Based upon selective search extracting about 2000 picture regions for each picture.

  • Feature extraction by employing a CNN framework for each picture region. For example, we have an N number of pictures, then there will be N × 2000 number of CNN features.

  • The whole object detection process employing RCNN involves three models:

  1. 1.

    CNN to extract features.

  2. 2.

    Classifier namely, Linear SVM to identify objects in the image.

  3. 3.

    Regression framework to tighten the bounding boxes.

All the aforementioned operations together result in the slow processing of RCNN. This framework consumes around 40 to 50 s to do predictions for every new picture, which makes the framework burdensome and not practical to build if given a massive dataset.

  1. 2.

    Fast RCNN

Ross et al. in 2014 presented an approach known as Fast RCNN. This framework converts the problem of identifying an object into a regression problem [38] and the mean Average Precision (mAP) was improvised by about 30% as compared to the earlier best result of about 53.3% in a challenge in 2012 called ImageNet Large Scale Visual Recognition Challenge. It involved massive calculation due to the extraction of different sized features of about thousands of proposals in every picture.

To reduce the computational time of RCNN, a CNN model is made to run just once per picture instead of making it run 2000 times per picture. All the ROIs (portions in an image that contain some underwater objects) are then obtained.

The author Ross Girshick of the RCNN framework presented the idea of making the CNN model run just once per picture and then looking for a way out for sharing that computation across the 2000 regions. The input picture is fed to the CNN model in the Fast RCNN framework. The CNN model in turn produces the convolutional feature maps which are used to extract region proposals. An ROI pooling layer is then used for reshaping each region of the proposal into a fixed size to feed it into FCN (a fully connected network).

In the previous two approaches, a picture is taken as input. This picture is then given to a CNN framework which outputs the Regions of Interest and a layer of ROI pooling is applied on each of these regions for reshaping them according to the input of CNN. After these steps, every region is then given to FCN (a fully connected network). A layer called SoftMax is employed on top of FCN (a fully connected network) to generate classes. The Linear Regression layer is employed along with the SoftMax layer to generate coordinates of the bounding box for the classes that are predicted.

Fast RCNN employs only one model instead of employing three different frameworks as in RCNN. The single model used in Fast RCNN performs feature extraction from the regions, splits them into various classes, and simultaneously generates the bounding boxes for the classes that are identified.

A Fast RCNN model uses the whole image and set of proposals of the object as input. Similar to Region Proposal Network (RPN), the model initially processes the entire image using the VGG16 network which is the base convolutional network and generates a 512-dimensional feature map. After that from feature maps, every proposal of the object is mapped to ROI. Then max-pooling utilized by a layer called ROI pooling converts the ROI features into an invariant 7 × 7 spatial extent. A series of fully connected layers is then given the feature vector of 7 × 7. This sequence of FCN then finally branches into a pair of sibling output layers, namely, SoftMax layer and bounding box regression. SoftMax showing the estimates of SoftMax probability over a background class and K object classes. The bounding box regression shows the boundary box coordinates for the K object classes. For bounding box regression and classification training, multi-task loss L can be shown as follows:

$$ \mathrm{L}\left(\mathrm{p},\mathrm{u},{\mathrm{t}}^{\mathrm{u}},\mathrm{v}\right)={\mathrm{L}}_{\mathrm{cls}}\left(\mathrm{p},\mathrm{u}\right)+\uplambda \left[\mathrm{u}\ge 1\right]{\mathrm{L}}_{\mathrm{loc}}\left({\mathrm{t}}^{\mathrm{u}},\mathrm{v}\right) $$
(10)

where, p is equal to p0, p1, pk represents probability distribution that is discrete over K + 1 outputs, tu represents the predicted bounding box coordinate, u denoted the true objects’ class number, v denotes the bounding box of ground truth, Lcls represents the log loss for true class u, Lloc represents the loss in boundary box regression, λ denotes the hyperparameter that governs the balance in the two functions. This is the thought of classification and bounding box regression.

A detailed understanding of the working of Fast RCNN can be obtained from the visual example which is explained with the help of Fig. 19.

Fig. 19
figure 19

Process flow of fast RCNN

Firstly, an underwater image is taken as input by the Fast RCNN model. This image is then given to the CNN model which in turn generates Regions of Interest. These ROIs are then given to the ROI pooling layer to ensure that all the regions are of equal sizes. Lastly, for classification and for returning bounding boxes by employing SoftMax and Linear Regression layers at the same time, these regions are given to Fully Connected Network. In this way Fast RCNN solves two big issues such as instead of passing about 2000 regions per picture, only one region is passed to CNN, and employing a single model instead of employing three different models to extract features, it performs classification and generates bounding boxes.

Disadvantages of fast RCNN

Fast RCNN also employs the concept of selective search to search for the ROIs which consumes a lot of time and hence the network becomes slow. Fast RCNN consumes almost 2 s per picture for detecting objects. This much amount of time is better as compared to the RCNN model, however, if massive real-life datasets are considered, then not even a Fast RCNN works that fast anymore.

There is yet another algorithm for object detection that is more efficient than Fast RCNN and it is popularly known as Faster RCNN.

  1. 3.

    Faster RCNN

Faster RCNN has acquired great accomplishments lately in object detection. Ren et al. [95] presented a Faster Region-Based Convolution Neural Network, utilizing Region Proposal Network (RPN) to generate region proposal and after that, for bounding box regression and classification it uses a CNN framework. As compared to the conventional approaches, algorithms based on deep learning are more robust to changing environments such as variations in illumination, perspective distortion, and motion blur. From among the latest algorithms based on deep learning, the performance of Faster RCNN has been excellent in a lot of areas. Sa et al. [100] employed the Faster RCNN model with transfer learning to perform object detection with better performance. Hoang et al. [49] presented a Faster RCNN technique to autonomously perform detection of whether the driver uses a Mobile Phone or he is holding a steering wheel. Zhang et al. [135] proffered a method of detection of a pedestrian by employing RPN which is then followed by boosted forests on high-resolution, shared convolutional feature maps. Hai Huang et al. [55] in 2019 proposed an approach that uses the Faster RCNN framework to assess various subaquatic organisms’ data augmentation approaches.

The faster RCNN framework has two subnetworks namely, Fast RCNN and the Region Proposal Network (RPN) [38]. Same input feature maps are shared by Fast RCNN and RPN that have been extracted using the base convolutional network. Hai Huang et al. [55] have employed a pre-trained VGG16 framework [106] as the base network. VGG16 is pre-trained on the dataset namely, ImageNet, and RPN is employed to produce proposals. Also, Fast RCNN is employed for the classification of these proposals.

Region proposal network (RPN)

RPN is utilized to produce proposals by using the feature maps that are put in. In [55] feature maps of dimension 512 are drawn out by VGG16 which is the base convolutional network that takes the whole image as an input. The put-in feature maps are utilized as the input of a spatial window that is 3 × 3, each sliding window is being mapped to a 512-dimensional feature vector. Double sibling fully connected layers— a box regression layer plus a box classification layer are given these feature vectors as input. A New concept of Anchors was proffered in [95]. The center of an anchor is at the sliding window which is 3 × 3, every sliding window consists of nine anchors in combination with three aspect ratios [1: 1, 1: 2, 2: 1] and three scales [1282, 2562, 5122]. The layer called Box Classification examines the positivity of an anchor if it is positive or not. Also, the Box Regression layer generates the bounding box coordinates. To train the RPN framework, a loss function can be given as follow:

$$ \mathrm{L}=\frac{1}{{\mathrm{N}}_{\mathrm{cls}}}\sum {\mathrm{L}}_{\mathrm{cls}}\left(\mathrm{p},{\mathrm{p}}^{\ast}\right)+\uplambda \frac{1}{{\mathrm{N}}_{\mathrm{reg}}}\sum {\mathrm{p}}^{\ast }{\mathrm{L}}_{\mathrm{reg}}\left(\mathrm{t},{\mathrm{t}}^{\ast}\right) $$
(11)

where p denotes the probability that a proposal is an object, p∗ represents the proposal true label (if a proposal being an object, then, p∗ = 1, if not the, p∗ = 0), t and t∗ denote the predicted and ground truth boundary box coordinates, respectively. Nreg and Ncls denote the two normalization parameters, Lcls represents the loss in classification, and over two classes it represents log loss i.e., object versus not object. Lreg denotes the regression loss, the product pLreg implies that only for those anchors that are positive, the regression loss is turned on.

$$ {\mathrm{t}}_{\mathrm{x}}=\frac{\left(\mathrm{x}-{\mathrm{x}}_{\mathrm{a}}\right)}{{\mathrm{w}}_{\mathrm{a}}}\kern1.50em {\mathrm{t}}_{\mathrm{y}}=\frac{\left(\mathrm{y}-{\mathrm{y}}_{\mathrm{a}}\right)}{{\mathrm{h}}_{\mathrm{a}}} $$
(12)
$$ {\mathrm{t}}_{\mathrm{w}}=\log \left(\frac{\mathrm{w}}{{\mathrm{w}}_{\mathrm{a}}}\right)\kern1.25em {\mathrm{t}}_{\mathrm{h}}=\log \left(\frac{\mathrm{h}}{{\mathrm{h}}_{\mathrm{a}}}\right) $$
(13)
$$ {\mathrm{t}}_{\mathrm{x}}^{\ast }=\left(\frac{{\mathrm{x}}^{\ast }-{\mathrm{x}}_{\mathrm{a}}}{{\mathrm{w}}_{\mathrm{a}}}\right)\kern0.75em {\mathrm{t}}_{\mathrm{y}}^{\ast }=\left(\frac{{\mathrm{y}}^{\ast }-{\mathrm{y}}_{\mathrm{a}}}{{\mathrm{h}}_{\mathrm{a}}}\right) $$
(14)
$$ {\mathrm{t}}_{\mathrm{w}}^{\ast }=\log \left(\frac{{\mathrm{w}}^{\ast }}{{\mathrm{w}}_{\mathrm{a}}}\right)\kern3.25em {\mathrm{t}}_{\mathrm{h}}^{\ast }=\log \left(\frac{{\mathrm{h}}^{\ast }}{{\mathrm{h}}_{\mathrm{a}}}\right) $$
(15)

The parameterization of four coordinates for bounding box regression is given in Eq. 15, x and y in the above equations represent the box center coordinates, h and w denote the height and width, respectively, of the bounding box, x∗, xa, and x are for the ground-truth bounding box, anchor box, and predicted bounding box, respectively. Similarly, for y, w, and h. A Fast RCNN model [38] is employed to classify the proposals of the object which have been detected by an RPN.

Detailed understanding of working of faster RCNN

The advanced version of the Fast RCNN model is the Faster RCNN model. The main difference in both the networks is that the Fast RCNN network employs selective search for producing ROIs and on the other hand Faster RCNN network employs Region Proposal Network (RPN). RPN uses feature maps of pictures as input and produces a set of proposals of an object, each having score of objectness as an output.

The steps given below are followed in the Faster RCNN model:

  1. 1.

    The network first takes a picture as an input and then passes it to the CNN model. CNN returns the picture feature map.

  2. 2.

    These feature maps are given to RPN which returns the proposals of an object with their score of objectness.

  3. 3.

    These proposals are given to an ROI pooling layer so that all the object proposals have the same size.

  4. 4.

    Lastly, a fully connected layer having on top of it SoftMax layer and linear regression layer is applied on the proposals for classification and for generating the objects’ bounding boxes.

The flow of the whole process of Faster RCNN is indicated in Fig. 20 given below.

Fig. 20
figure 20

Process flow of faster RCNN

Figure 21 given below shows how an RPN network works. The feature maps are taken by the Faster RCNN model from the CNN model and then gives to the RPN i.e., Region Proposal Network. On these feature maps, RPN then employs a sliding window and at every window, RPN produces a k number of Anchor boxes having varied sizes and shapes.

Fig. 21
figure 21

Working of RPN

Fixed-sized bounding boxes namely, anchor boxes are positioned in the image. These boxes have various sizes and shapes. Two things are predicted by RPN for every anchor box:

  • Firstly, it predicts the probability of an anchor being an object and does not predict the class of the predicted object.

  • Secondly, it checks if the boundary box regressor to adjust the anchor boxes better fits the predicted object or not.

After defining boundary boxes of various sizes and shapes, the ROI pooling layer is applied to them. There are then object proposals without any class assigned to these proposals. Every proposal is cropped so that each of them consists of an object and that is the basic function of the ROI pooling layer. This layer draws out feature maps having been fixed for every anchor. Feature maps are then given to a fully connected layer having SoftMax layer and linear regression layer. Lastly, these layers classify the objects and also predict the boundary boxes for the objects that are identified.

Disadvantages of faster RCNN

The aforementioned object detection techniques used for underwater object detection utilize regions for the identification of the objects. These above-mentioned algorithms do not see the whole picture in one go, rather look at the portions of the picture sequentially. This gives rise to two problems:

  • The technique needs several passes through one image for all objects to be extracted.

  • Since in these algorithms various systems are operating back-to-back, the performance of these networks further ahead relies on the performance of previous systems.

4.1.3 Networks based on YOLO (You Only Look Once) framework

The algorithms based on RCNN primarily employ regions for object localization within the picture. These frameworks do not look at the whole picture, rather look only at the portions of the pictures that have a greater possibility of having an object. On the other hand, the YOLO model uses the complete image in one instance and performs the prediction of the boundary box coordinates and for these bounding boxes, it also predicts the class probabilities. The major benefit of employing the YOLO framework is its extremely high speed.

Sung et al., [109] in 2017 presented an underwater fish detection plus fish classification technique employing the YOLO model. These researchers trained the framework of YOLO by using some 829 pictures and acquired classification precision of about 93% on 100 pictures of fish species. Xu and Matzner [126] in 2018 applied the YOLO model for fish detection in three varied datasets and obtained 53.92% of mean average precision. Liu et al., [77] 2018 presented YOLO along with a parallel correlation filter for detection and spooring of subaquatic fish. They proffered results of simulation on two datasets of undersea fish and also presented the fish tracker effectiveness. YOLO framework has four popular versions, i.e., yolov1, yolov2, yolov3, and yolov4. Each version of YOLO has a gradual improvisation over the earlier version.

  1. 1.

    YOLO-V1

This algorithm is unbelievably fast and can operate on 45 frames/s [94]. A generalized representation of objects in an image can also be understood by the YOLO framework. Therefore, this algorithm is quite trending nowadays and shows a comparatively best performance. The steps followed by the YOLO model to detect an object in an image are briefly discussed below:

  1. 1.

    The algorithm first takes an input image.

  2. 2.

    It then divides an input image into grid cells say 3 × 3 as shown in the Fig. 22 given below.

  3. 3.

    Then it applies image classification and localization on each grid cell. YOLO performs the prediction of the boundary boxes and the associated probabilities for the class of an object.

Fig. 22
figure 22

YOLO object detection

Labeled data is passed to the YOLO model to perform its training. For example, the underwater image is divided into a 3 × 3 grid cell, and if only three classes are defined in which the underwater objects need to be categorized into then for each grid cell, y label is going to be an 8-dimensional vector as shown below in Fig. 22.

Here, pc is used to indicate the probability of the presence of an object that is whether an object is in the grid cell or not, bx, by, bh, bw are used to specify the boundary box only if an object is present in the grid cell, c1, c2, c3 define the classes of objects.

In total, the YOLO-v1 consists of 26 layers having 24 layers for convolution operation that is followed by two Fully Connected layers. The major disadvantage of YOLO-v1 is that it is not able to detect tiny objects. This version reframes object detection as just a single regression problem from picture elements to coordinates of bounding box and class probabilities. This framework is quite simple and trains on entire images. This model has many advantages over conventional approaches for object detection. The separate object detection components are unified into one neural network. To predict the bounding box YOLO-v1 utilizes features from the whole image for the prediction of each boundary box. Simultaneously, across all classes, it also estimates all boundary boxes for an image. The YOLO model proffers end-to-end training with a speed that is real-time while ensuring high mean precision [94]. Given below is an equation that indicates the total number of detections per image.

$$ \mathrm{Total}\ \mathrm{detections}\ \mathrm{to}\ \mathrm{be}\ \mathrm{done}\ \mathrm{per}\ \mathrm{image}=\mathrm{s}\times \mathrm{s}\left(\left(\mathrm{B}\ast 5\right)+\mathrm{C}\right) $$
(16)

Here, S × S represents the total number of divisions YOLO makes in an image, B denotes the number of detected bounding boxes all over the complete image, and for every box 5 elements are calculated, namely, coordinates of detected object’s center (x,y), height, width, and confidence score. C represents the conditional probability for several classes.

  1. 2.

    YOLO-V2 / YOLO9000

YOLO version one has a lot of disadvantages as compared to state-of-the-art object detection models [92]. As compared to Fast RCNN, YOLO-v1 error analysis indicates that it presents a substantial number of inaccuracies in localization. Besides, this model has comparatively low recall compared to methods based on region proposal. Therefore, this version of YOLO [92] focuses mainly on the improvisation of localization and recall while also maintaining precision in object classification. Generally, computer vision advances towards deeper and bigger networks [46, 106, 112]. Better work usually results from training bigger networks or combining various models. Yolo-v2 however, offers more precise detection that is also fast. The network is simplified and representations are made easier to learn instead of scaling up the network. Various ideas from earlier work have been implemented to improve the performance of the YOLO model.

Batch normalization

The concept of Batch normalization results in substantial convergence improvements while reducing the requirement of other regularization forms [56]. On all convolutional layers in YOLO batch normalization is used which proffers greater than 2% mAP improvement. The framework is also regularized using Batch normalization and with batch normalization, a dropout from the framework can be removed without overfitting.

High-resolution classifier

Many state-of-the-art object detection approaches employ classifiers that are pre-trained on ImageNet [99]. In AlexNet mostly classifiers process input pictures that are less than 256 × 256 in size [61]. Version one of YOLO performs the training of the classifier network on an image size of 224 × 224 and expands the resolution of images to 448 for object detection. This implies the framework has to switch to learning detection and also adjust to the latest input resolution at the same time. In YOLO-v2 initially, the classification framework is fine-tuned at the resolution of 448 × 448 for about 10 epochs on dataset ImageNet. This gives the framework some amount of time for adjusting the network filters to perform better in case of higher resolution input. This network for classification with high resolution proffers an increment of about 4% in mAP.

Dimension clusters

Two problems are experienced with anchor boxes if employing them with YOLO-v1. The first issue is the dimensions of the anchor box are hand-picked. The framework learns to adjust the anchor boxes properly but when better network priors are picked to begin with, it becomes easier for the framework to learn to estimate better detections. K-means clustering is used for the training set boundary boxes to automatically search for better priors instead of selecting priors manually. If standard k-means along with Euclidean distance are employed, bigger boxes produce more inaccuracy than smaller boxes. However, priors that result in better Intersection over Union (IOU) scores that are not dependent on the box size are required. A Brief idea of the architecture of YOLO-v2 is given below in steps:

  • After every convolutional layer, there is the addition of Batch Normalization layers.

  • It consists of 30 layers in contrast to the YOLO-v1 model which has 26 layers.

  • Anchor Boxes are introduced.

  • There is no fully connected layer in the model.

  • Still bad with small objects

Some updates are presented to previous versions of YOLO by making some design modifications. The modified architecture is a little larger than earlier YOLO versions but has more accuracy [93].

Darknet-19 which is a new classification model is used as a base model of YOLO-V2. Same as the VGG network, YOLO-V2 usually employs filters of size 3 × 3 and twice the number of channels succeeding each pooling step [106]. The final framework, namely, Darknet-19, consists of 19 Conv. and 5 max-pooling layers. A complete description of this framework is given in Table 2. The Darknet-19 framework needs only 5.58 billion number of operations to perform processing of a picture yet acquires 72.9% top-1 precision and 91.2% top-5 precision on ImageNet.

Table 2 Darknet-19
  1. 3.

    YOLO-V3

Yolo-v3 is comparatively fast than the previous versions. Using 320 × 320 images, this version shows a time consumption of about 22 ms with 28.2 mAP, which is as error-free as Single Shot Detector (SSD) but is three times faster than SSD [93]. Similar to YOLO-v2 or YOLO9000, this version also performs the prediction of bounding boxes employing dimension clusters as anchor boxes [112]. An objectness score is predicted by YOLO-v3 for every bounding box by employing logistic regression. When the boundary box prior is overlapping a ground-truth object, the objectness score must be 1. The prediction is ignored when the boundary box prior is not being the best but overlaps an object that is the ground-truth object by an amount greater than some threshold [95]. Every box estimates the classes contained by the boundary box employing multi-label classification. A layer of SoftMax is not used, as it is unimportant for efficient performance, however, independent logistic classifiers are simply used. At the time of training, binary cross-entropy loss is employed for the predictions of classes. This kind of formulation aids if more complex domains are considered such as the Open Images Dataset [60]. Such kind of dataset has several overlapping labels e.g., person and woman. Employing a layer of SoftMax uses the assumption every box consists of exactly a single class which usually is not the case. Hence, a multi-label method models the data in a better way.

Version-3 of YOLO makes predictions of boxes at three different scales. The system takes out features of those scales employing the same concept as that of feature pyramid networks [71]. Multiple convolutional layers are added and the last layer of these layers performs prediction of 3-dimensional tensor encoding boundary box, class predictions, and objectness. For the determination of priors for boundary box, k-means clustering is still employed. Yolo-v3 model is a hybrid algorithm of the model employed in YOLO-v2 (Darknet-19) and the new-fangled residual network. This version of YOLO uses 3 × 3 and 1 × 1 CNN layers in succession but it also involves some shortcut connections and is substantially bigger. It consists of 53 CNN layers; therefore, it is called Darknet-53. Figure 23 shows the Darknet-53 algorithm.

Fig. 23
figure 23

Darknet-53

  1. 4.

    YOLO-V4

The main aim of YOLO-v4 is the fast speed for performing operations of a neural network and optimization for parallel computations. Yolo-v4 proffers two options of neural networks that are real-time [11]:

  • For GPU (Graphics Processing Unit), a small number of groups (1 to 8) in Conv. layers are employed: CSPResNeXt50 / CSPDarknet53

  • For VPU (Vision Processing Unit), grouped-convolution is used, but YOLO-v4 refrains from employing Squeeze-and-excitement (SE) blocks - particularly this includes the following frameworks: EfficientNet-lite / MixNet [115] / GhostNet [43] / MobileNetV3

The objective of this version of YOLO is to look among put in network resolution for the optimal balance, the Conv. layer number, the parameter number, and the number of the output layer. For example, an immense survey reveals that the CSPResNext50 is significantly better than CSPDarknet53 in the case of object classification when used on the ILSVRC2012 (ImageNet) dataset [24]. Conversely, the CSPDarknet53 framework is better than CSPResNext50 in the case of object detection when used on the MS COCO dataset [69]. The other objective is to choose extra blocks to increase the receptive field and the finest technique of parameter aggregation from various backbone levels for various detection levels: for example, FPN (Feature Pyramid Network), ASFF (Adaptively Spatial Feature Fusion), BiFPN (Bi-directional Feature Pyramid Network).

In YOLO-v4, the Spatial Pyramid Pooling (SPP) layer on top of the CSPDarknet53 network is added, since it considerably enlarges the receptive field, takes out the most substantial context features causing almost no decrement in the speed of network operation [11]. PANet is used as the technique of parameter aggregation from various backbone levels for various object detector levels, rather than using FPN as in YOLO-v3. Lastly, as an architecture of YOLO-v4, CSPDarknet53 is chosen as a backbone, SPP as an extra module, PANet as the path-aggregation neck, and YOLO-v3 anchor-based head. CGBN-Cross-GPU Batch Normalization or costly specialized devices are not used which allows anybody to reproduce the state-of-the-art results on a traditional graphic processor like RTX 2080Ti or GTX 1080Ti. Figure 24 given below gives an overview of the architecture of YOLO-v4.

Fig. 24
figure 24

YOLO-v4 architecture

Limitations of YOLO

Strong spatial constraints are imposed by the YOLO model on boundary box predictions as every grid cell just performs a prediction of two boxes and it can only have a single class. This kind of spatial constraint restricts the number of close objects that the YOLO framework can predict. YOLO model has to struggle with tiny and small objects that seem to be in groups, like flocks of birds. As this model trains to perform prediction of boundary boxes from data, the model struggles to generalize to the objects in novel or unusual configurations or aspect ratios. YOLO framework also employs relatively rough features to predict bounding boxes as this architecture has many layers of down-sampling from the input picture. Lastly, as the YOLO model is trained on a loss function that approximates the performance of object detection, the loss function of the YOLO model treats inaccuracies equally in small boundary boxes vs large boundary boxes. A small inaccuracy in a big box is generally not serious however a small inaccuracy in a smaller box has a much higher influence on Intersection over Union (IOU). The major source of inaccuracy is inaccurate localizations.

4.2 Some other underwater object recognition algorithms

Krizhevsky et al., in 2012 [61] proposed the AlexNet model which consists of 5 Conv. layers, that are stacked by nonlinear and/or pooling layers followed by 3 layers that are fully connected. Employing creative methods which also includes DA (data augmentation) [46, 106], dropout [9], local response normalization (LRN) and overlapping pooling [61], rectified linear units (ReLUs) [85, 124], the AlexNet algorithm could win the championship having the testing inaccuracy of 15.4% and outstripped the second winner having the testing inaccuracy of 26.2%. ZFNet which is a variant of the algorithm namely, AlexNet was presented by Zeiler and Fergus in 2014 [133]. The major differences in the AlexNet and ZFNet are in first- layer and second-layer visualization: (a) The size of the filter of a first layer was diminished to 7 × 7 from 11 × 11, and (b) Also, the convolutional stride was diminished to 2 from 4, thereby having immense information in these two layers. The process of visualization has been facilitated by an operation called deconvolution using a deconvolution layer [134]. Alberto et al. in 2017 [4] proffered a Visual Geometry Group Network (VGGNet) which is a type of CNN model and has been build-up by the VGG (Visual Geometry Group), Oxford University. As different deep CNNs’ variants [106], the typical frameworks are VGG-16 and VGG-19. GoogLeNet is another framework proffered by Szegedy et al. in 2015 [111] and it is based on the idea of Network in Network (NiN), whereby almost 22 layers present in the framework called inception framework are employed with the learned parameters. GoogLeNet framework aims to decrease the number of feature filters required at every layer, thereby diminishing complexity in computation. He et al. in 2015 [45] presented a Residual Neural Network (ResNet) having a great depth of 152 layers, and also won the competition called ILSVRC-2015 having the testing inaccuracy of about 3.57%. In this framework, residual blocks take advantage of residual learning, and the output comprises X which is the original input and the output of the final layer that is (X).

Unsupervised models based on deep learning include Deep Belief Networks (DBNs) and different assembled variants of Auto Encoder (AE) which includes stacked sparse AE (SAE) [80] denoising AE (DAE) [120],), and contractive AE (CAE) [96]. Generally, in algorithms that are unsupervised learning algorithms, parameters are optimized via a greedy training approach that is performed layer-wise and is divided into 2 phases, namely, pre-training and fine-tuning.

Besides Chen et al. [15] in 2020 proposed the Sample-Weighted Hyper Network (SWIPENet). The architecture of the algorithm includes many semantic rich and high-resolution hyper feature maps that are inspired by the algorithm, namely Deconvolutional Single Shot Detector (DSSD) [34]. A fast down-sampling network for detection called SSD that has several up-sampling deconv layers is augmented by DSSD to enhance the feature map resolution. The DSSD network, first of all, constructs several down-sampling conv layers for extracting semantic-rich feature maps that are beneficial for classifying underwater objects. After performing many down-sampling functions, the extracted feature maps are quite rough to proffer adequate information to locate small objects accurately, thus, several up-sampling deconv layers along with skip connection are employed to restore the high-resolution feature maps. Nevertheless, the entire details of the information lost during down-sampling are not completely restored even if the resolution has been improved and recovered. For improving the DSSD algorithm, Chen et al. [15] employ dilated conv layers [12, 130] to acquire strong semantics and that too without any loss of intricate information which is important for object localization.

Figure 25 shown above presents the outline of the SWIPENet algorithm. It has several conv, dilated conv, deconv blocks, and a new sample-weighted loss. The initial layers of this network are based on the structure of the VGG16 network. Unlike DSSD, four dilated conv layers have been added to the model with ReLU activations. This modification can acquire huge receptive fields with no sacrifice on the feature map resolution (huge receptive fields result in stronger semantics). Furthermore, feature maps are up-sampled by using deconv layer and then using skip connection for passing the fine details present in low layers to details present in high layers. Lastly, several hyper feature maps are constructed by the deconv layers. The prediction of the proposed SWIPENet algorithm deploys varied three deconv layers and these are Deconv1 2, Deconv2 2, and Deconv3 2 (represented as Deconvx 2 as indicated in Fig. 25), whose size progressively increase and performing prediction of the objects of several scales. At every location of these three deconv layers, six default boxes are defined and also, employ a 3 × 3 conv kernel to generate C + 1 class scores where C denotes the total number of the classes of object and 1 denotes the background class, and four coordinate offsets that are relative to the original shape of the default box.

Fig. 25
figure 25

Overview of SWIPENet

5 Experimental results

The comparison of YOLO-v1 with some other real-time object detection algorithms using PASCAL VOC 2007 is performed. To get the idea of the differences between RCNN and YOLO-v1 variants, the inaccuracies on VOC 2007 made by YOLO-v1 and Fast RCNN (one of the highest performing versions of RCNN [63]) are explored. Keeping in view the different profiles of inaccuracies, it is inferred that YOLO-v1 can be employed to rescore detections made by Fast RCNN and decrease the inaccuracies from background false positives, providing a substantial boost to performance. Comparison of YOLO-v1 is made to Fast RCNN as indicated in Fig. 26 since Fast RCNN is among the excellent performing object detectors when used for PASCAL and the detections of this network are publicly available. Hoiem et al. [50] methodology and tools have been used. For every category during test time, N number of top predictions are looked at for the category. Every prediction can be either correct or is categorized based on the type of error:

  • Correct which implies correct class with Intersection Over Union (IOU) > 0.5

  • Localization implying correct class with 0.1 < IOU < 0.5

  • Similar implies class is similar with IOU > 0.1

  • Other implies class is wrong having IOU > 0.1

  • Background implies IOU < 0.1 for any object

Fig. 26
figure 26

Error analysis: a Fast R-CNN vs. b YOLO. These Pie-charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category)

On comparing RCNN based algorithms and the YOLO model using PASCAL VOC 2007, it can be clearly seen that the RCNN model does not proffer real-time processing and also takes a substantial precision hit from not consisting of good proposals. Fast RCNN increases the speed of the RCNN stage used for classification however, it still depends on a selective search that can take about 2 s per picture to produce boundary box proposals. Therefore, it presents high Mean Average Precision (mAP) but at about 0.5 Frames Per Second (fps) which is still quite far from real-time operational speed. The novel Faster RCNN has replaced selective search with a neural network that proposes boundary boxes, same as Erhan et al. [32]. The version VGG-16 of Faster RCNN has higher mAP, however, being 6 times slower than the YOLO model. YOLO-V2 has also been trained for object detection on VOC 2007.

Figure 27 shown below highlights the comparison of the performance of some detection frameworks on PASCAL VOC 2007. It shows results for mAP and FPS of YOLO-V2 and some other state-of-the-art object detection algorithms. Version 2 of the YOLO framework achieves about 73.4 mAP with a speed far greater than other competing approaches.

Fig. 27
figure 27

Comparing speed and mean average precision of RCNN, FAST RCNN, FASTER RCNN, YOLO-V1 and YOLO-V2 algorithms on PASCAL VOC 2007

Also, for comparing the performance of YOLO-v3 and YOLO-v4, the average precision and Frames Per Second (FPS) of the two algorithms are highlighted and compared using the COCO dataset as shown in Fig. 28 [11].

Fig. 28
figure 28

Comparison of YOLO-V3 and YOLO-V4

For YOLO-v4, in the experiments carried out on the COCO dataset for object detection, the hyperparameters that are default have been set as follows: the step decay rate of learning scheduling strategy is employed with starting rate of learning as 0.01 and then multiplied with 0.1 at the 400,000th and the 450,000th steps, respectively; also, the steps of training are 500,500. The graph given above indicates that YOLO-V4 enhances the Average Precision (AP) and Frames Per Second (FPS) of YOLO-V3 by about 10% and 12%, respectively.

In the process of object detection, Mean Average Precision, Precision, and Recall are usually used for assessing the accuracy and the definition is.

  1. i)

    Mean Average Precision (mAP): It compares the detected bounding box and the ground-truth bounding box. Then it outputs a score. The greater the score, the greater is the accuracy of the model in the detections performed.

  2. ii)

    Precision: It is defined as the ratio of the number of correctly classified positive samples to the total number of samples that are classified as positive (incorrectly or correctly). This parameter gauges the accuracy of the model to classify a sample as positive.

$$ \mathrm{Precision}=\frac{\mathrm{True}\ \mathrm{Positive}}{\mathrm{True}\ \mathrm{Positive}+\mathrm{False}\ \mathrm{Positive}} $$
(17)
  1. iii)

    Recall: It is defined as the ratio of the number of correctly classified positive samples as positive to the total number of positive samples. This parameter gauges the ability of a model so that it can detect positive samples. The greater the recall, the greater number of positive samples are detected.

$$ \mathrm{Recall}=\frac{\mathrm{True}\ \mathrm{Positive}}{\mathrm{True}\ \mathrm{Positive}+\mathrm{False}\ \mathrm{Negative}} $$
(18)

To evaluate the accuracy of Faster-RCNN and Yolo-v3 mean Average Precision (mAP), recall rate (Recall), and speed (Frames Per Second - Fps) have been used. The results have been indicated in Table 3. The dataset used comes from the official website, namely, UNDERWATER ROBOT PICKING CONTEST (URPC) and this dataset has been changed into the VOC2007 format. For making it usable, its size was changed to 18,982, out of which 4746 pictures were employed as a testing set and 14,236 pictures have been employed as training set and validation set.

Table 3 Test results for faster RCNN and Yolo-v3

Table 4 given below shows the Mean Average Precision (mAP) and precision of different iteration times by Fast RCNN, Faster RCNN, and Yolo-v3 with IoU =0.7. The GPU employed for the process to compute these parameters is NVIDIA GTX 1080ti and the total images are 30,000, which have been artificially labeled one by one, out of those images 8520 are taken for training, 8530 images used as a validation set, and 12,950 images used as a test set. The images that are used to testify the algorithms are taken from the “Underwater Robot Picking Contest”.

Table 4 Comparison of underwater object detection algorithms based on precision and mean average precision

Besides, the basic Convolutional Neural Network is being compared to the renowned RCNN family consisting of RCNN, FAST RCNN, and FASTER RCNN algorithms. The comparison is being performed based on time consumption. Table 5 highlights the limitations in these algorithms and also the amount of time consumed to perform the operation of object detection.

Table 5 Comparison of various underwater object detection algorithms

5.1 Comparison of various object detection algorithms used for underwater images

For underwater object detection, the most frequently employed approaches cannot be applied due to the vision being of low-quality and the objects to be detected being small. Thus, it has been quite challenging for the researchers to search for an efficient underwater object detection algorithm keeping in view the subaquatic conditions. The decision-making for selecting an efficient underwater object detection algorithm highly relies on the performance of the networks employed for the particular task. The performance of the underwater object detection algorithms can be measured using some renowned parameters like Mean Average Precision (mAP). Mean Average Precision is the mean of the precision of the total of the detection classes and it is extensively employed to evaluate the object detection systems. To compare various underwater object detection algorithms, the dataset has been prepared in Pascal VOC form. The comparison of subaquatic detection algorithms such as CNN [42], Fast RCNN [36], Faster RCNN [95], and YOLO-V3 are indicated in Fig. 29 [42].

Fig. 29
figure 29

mAP vs Iteration Times of a CNN, b FAST RCNN, c FASTER RCNN, and d YOLO-V3, respectively

From the aforementioned comparisons and results, it is inferred that the accuracy in detection using the Faster RCNN model is better as compared to the other techniques, however, the difference being not very great. As compared to the YOLO-V3 model [93], the CNN technique can yield more precise detection. The convergence of the techniques is different; the YOLO-V3 model converges after 28,000 iteration times, which is earlier compared to the Fast RCNN model and Faster RCNN model. All the techniques cannot improvise the accuracy of object detection after 40,000 iteration times, the main cause is a scarcity of the underwater image dataset, and also the pictures of the dataset are alike, particularly the background of the underwater pictures is similar. In subaquatic object detection, this is the major reason contributing to the fact that the subaquatic data in deep water bodies is too hard to acquire.

The curves of loss function are indicated in Fig. 30. It can be seen that the values of loss for all the techniques are convergent, and the amplitude of loss values for the YOLO-V3 model is smaller as compared to the Fast RCNN model [36] and Faster RCNN model [95].

Fig. 30
figure 30

The loss curves of a CNN, b Fast RCNN, c FASTER RCNN, and d YOLO-V3, respectively

For underwater object detection, the precision of all of the aforementioned techniques is sufficient for applications, however, real-time underwater object detection is more required. The object detection speed has been indicated in Table 6 shown below. It can be clearly seen the YOLO-V3 model [93] has a very high speed of object detection, about 4 times faster as compared to the Faster RCNN framework [95].

Table 6 Detection speed of various methods (IoU = 0.7, Learning Rate = 0.001)

Table 7 given below highlights some existing deep learning-based underwater object detection algorithms.

Table 7 Some famous and existing deep learning-based underwater object detection methods

6 Challenges

The most mandatory issue and highly challenging task is the recognition of optical content for the analysis of underwater pictures. Variability in Intra-class results in the variation of optical content via scales, views, a deformation that is non-rigid, and illumination. Particularly, for performing classification and detection of seagrasses, the differences in boundary in various classes are highly ambiguous as compared to what is for corals or fish. Besides, in digital pictures, there is more ambiguity in optical content as water depth increases. The scarcity of annotated datasets presently at hand to detect objects is another significant hurdle. The object detection techniques require not only to precisely localize and classify objects but also require to have faster and real-time predictions.

7 Conclusion and future directions

Underwater object detection based on deep learning has become a research hotspot because of its highly powerful ability to learn and data pre-processing efficiency. A detailed review of underwater object detection algorithms employing the concept of deep learning has been presented and mainly categorized into CNN, RCNN family, and YOLO family. Besides, it would be relevant to mention here that the detection speed of the YOLO family is quite fascinating and suitable for real-time applications. Multiple subproblems in underwater object detection algorithms have also been comprehensively compared and reviewed. Also, deep learning-based underwater object detection algorithms, their architectures, and issues have been summarized. It was important to highlight all the techniques used for the analysis of underwater data to make it easy to concentrate on possible future work built on the approach of deep neural networks. A lot of work has been done on coral classification and detection employing deep learning, however, meager research work has been carried out for seagrass which is also equally important for the underwater ecosystem. The precision, robustness, and effectiveness of any algorithm meant for detection and classification can be significantly increased when both features based on texture and color are combined. For improved results of seagrass classification and detection, accumulation of features that are hand-crafted features and neural networks are employed. Thus, there exists an opportunity for developing an effective and efficient deep learning method for marine seagrass imagery and that can be the focus of a researcher for future work.

Various problems still prevail despite the development of various algorithms based on deep learning for underwater object detection. Many such challenges and issues have been based on the literature review including the requirement of pre-training, public underwater dataset, recognition of subclasses, the requirement of a unified framework, and the requirement of general network structure. These are the issues that the research community needs to target and solve as it will help to further explore see life in an effective way.