Introduction

Perception is a fundamental task for robots. We consider perception as the problem of processing observations of the surrounding environment to achieve situational awareness. Perceptual tasks can range from high-level mapping, down to object-level understanding and robot state estimation. In our view, perception is the first in several critical steps toward autonomy. Downstream processes such as planning and control require reliable, robust, and consistent perceptual systems to function.

In contrast to other environments, underwater perception is inhibited by the environment itself: plagued by low light, varying water conditions, the attenuation of light and other signals in water, and environmental disturbances far beyond the control of human monitors. These factors limit the tools for underwater perception. Low-cost structured light sensors and infrared LiDAR cannot be used over long distances underwater because water absorbs most or all of the infrared laser energy directed through it, resulting in a very weak or absent return. Radio signals are similarly attenuated, making inter-platform communication a challenge. The Global Positioning System (GPS) provides key pose information for above-surface localization and navigation tasks. However, the electromagnetic signals from orbiting satellites are heavily attenuated by the water, thus, GPS is not adequate for underwater applications. This leads to a more challenging localization procedure, with subsea systems relying on inertial and acoustic sensors for dead reckoning, with numerical integration errors that worsen over time. A common strategy to rectify drift is self-correction by using environmental features; this can be achieved by employing Simultaneous Localization and Mapping (SLAM).

Optical underwater navigation is challenging due to its susceptibility to poor visibility caused by suspended particles, inadequate lighting, and unwanted distortions. Additionally, visual odometry feature extraction relies on texture-rich images, which can be sparse in underwater settings. Nevertheless, in contrast to acoustic sensing, visual navigation can be well suited for precise mapping of feature-rich scenarios such as ship hull and shipwreck inspections. Furthermore, cameras provide colorful and detailed information helpful in object detection and classification tasks. Moreover, the use of cameras in this domain has spawned a research area addressing the issues caused by the water medium.

The use of sonar sensors is motivated by the challenges for cameras imposed by the underwater environment. Sonar is immune to lighting conditions, has a long range due to the properties of sound in water, and is robust to water turbidity. Nonetheless, sonar is not a plug-and-play solution and has specific drawbacks. Several varieties of sonar are available, and here we will consider imaging, side-scan, and profiling. Profiling sonar uses an array of narrow “pencil” beams to recover accurate distance measurements at known bearing angles. However, to build an understanding of the environment, a dead reckoning system is required to stitch line scans together. Imaging sonar, on the other hand, is lower in cost and returns a panoramic view of the environment across a wide horizontal aperture. Imaging sonar, however, does not measure the location of observations in the vertical aperture, resulting in high ambiguity about the location of observed points in 3D space. Lastly, side-scan sonar has two transducer arrays that send and receive acoustic pulses; it is an excellent tool when performing seafloor searches, especially when covering large areas at low image resolution.

Underwater-specific obstacles have resulted in an impressive body of work containing all sensors employed in this domain. In this survey paper, we consider recent developments in perception for underwater robots, categorized by sensor and their respective subfields.

Imaging Sonar

In this section, we will discuss imaging sonar and its recent perceptual algorithms. It is known by several names that are often used interchangeably: imaging sonar, wide aperture multi-beam sonar, and forward-looking sonar. However, when we discuss imaging sonar, we mean a multi-beam sonar with a non-trivial vertical aperture. This sensor provides an expansive field of view and is effective for imaging large volumes of water, making it an excellent tool for situational awareness, allowing a robot to perceive much of the environment around it. Further, sonar is robust to lighting conditions and water quality, making it the sensor of choice when operating in turbid environments. However, imaging sonar does not report information in 3D. While the sensor can measure range and bearing, the elevation angle of an observed point is lost in the image formation process. Moreover, these sensors have a low signal-to-noise ratio as well as multi-path and other characteristics that make it challenging for automated perceptual algorithms. Examples of an imaging sonar’s field of view and imagery are given in Fig. 1.

Fig. 1
figure 1

(a) shows an example of a Bluefin HAUV conducting a ship hull inspection mission; note the large volume of the imaging sonar footprint shown in blue (From: Hover F.S., et al. The International Journal of Robotics Research (Volume 31 Issue 12). Pages 1445-1464. ©2012 by SAGE Publications, Reprinted by Permission of SAGE Publications) [1]. (b) shows an example sonar image with a floating dock in view, and a range of 30m

Object Identification

Identifying, classifying, and understanding objects in the environment is a fundamental task for underwater robots. Firstly when considering object identification, or automated target recognition, many classification methods have been applied. However recently, in settings such as vision and LiDAR, deep learning methods have far eclipsed classical approaches [2]. Critically though, when considering imaging sonar there are few datasets and even fewer with labels to support object classification or similar.

The focus of recent work has been learning labels while only utilizing a few labeled samples. However, unlike in other settings, this is motivated by necessity and not the desire to reduce data procurement overhead when training. Transfer learning from other image domains to initialize the weights of a Convolutional Neural Network (CNN) is sometimes used, reducing the number of training samples required in the sonar data [3, 4]. A Generative Adversarial Network (GAN), in this case, CycleGAN, is used by Liu et al. [5] to generate synthetic training data. Chen et al. [6] use the method of few-shot learning, which requires only a few training examples of each class. Rather than requiring explicit labels, this work trains using image similarity. Interestingly, Wang et al. [7] avoids the need for unknown object classification by introducing an engineered target. This target can be classified using more standard methods, similar to AprilTags [8]. Although promising, relying on engineered targets [7] requires human effort for building and placing markers on existing structures.

3D Reconstruction

Recall that imaging sonar does not contain 3D information, even though the sensor observes a 3D volume of water. This creates some open questions, about how to recover the 3D data and build a 3D understanding of the environment. This is critical, especially when considering robots using this sensing modality that may need to operate outside of a fixed plane.

Firstly we consider approaches that utilize multiple views to resolve the ambiguity in the vertical aperture of an imaging sonar. Akin and Negahdaripour [9] use a space carving approach, with Westman et al. [10] enhancing space carving. Wang et al. [11] extend Acoustic Structure-From-Motion (ASFM) [12] in tracking salient point features through space to minimize their ambiguity. DeBortoli et al. [13] use a CNN to detect salient imagery, removing the imagery that is unproductive when completing 3D reconstruction with ASFM as the backend. Westman et al. [14] introduce a volumetric framework to perform 3D reconstruction, testing sonars with a narrow and wide aperture.

While using multiple views to resolve sonar ambiguity is principled and draws on many methods from outside underwater robotics, it may be desirable to recover 3D information without multiple frames. Guerneve et al. [15] use a blind deconvolution and Westman and Kaess [16•] use an acoustic generative model to recover the missing data in a sonar image. Both of these methods are effective and draw on classical methods to recover the elevation angles of the observed points. DeBortoli et al. [17] employ a CNN and synthetic training data to predict the missing data in a sonar image, with Wang et al. [18] learning from each beam rather than image-to-image, making a comparison to [17]. All of these methods recover the missing elevation angle to accompany the 2D range-bearing observations of a single sonar image, at a single viewpoint.

A new set of methods, introducing a second sonar in a stereo array have also been proposed [19,20,21]. These methods recover point clouds from a pair of concurrent sonar images at the same robot viewpoint using sonars with different perspectives, with McConnell et al. [19] focusing on an orthogonal array of sonars. McConnell and Englot [22] extend their prior work [19] by using object-level inference about simple objects, greatly increasing the coverage rate for a given trajectory.

SLAM

When considering perception, SLAM is considered a critical task, and in the context of imaging sonar a non-trivial one. In this section, we will consider three subtopics of SLAM; sonar odometry, loop closure, and SLAM systems.

Generally speaking, SLAM solutions depend on two categories of measurement constraints; sequential constraints, typically derived from odometry or scan-matching, and non-sequential constraints, often referred to as loop closures. Loop closures are typically achieved in two steps, identifying loop closures and computing the relevant transformation between frames to be inserted into the SLAM backend. Franchi et al. [23] consider the problem of deriving linear speed measurements from sonar imagery, akin to sonar odometry. Henson and Zakharov [24] build sonar mosaics using optical flow as a basis for transform estimation. Almanza-Medina et al. [25] use a neural network to learn 3-DOF transformations between sonar images and Song et al. [26] consider sonar image registration; both of these works could be utilized for either odometry or loop closure. Santos et al. [27] consider the use of a graph structure to describe a scene for both loop closure identification and registration. Ribeiro et al. [28] use a CNN with triplet loss to train a network for sonar image-based place recognition, framed as an image retrieval problem.

While using objects as landmarks for SLAM is relevant, most of the recent approaches have focused on pose SLAM or landmark-based SLAM using point features as landmarks. Firstly, Westman et al. [29] use point features in sonar imagery as a basis for SLAM, taking special care to handle their vertical ambiguity. Ribeiro et al. [30] use a local ASFM as well as loop closure to build a 6-DOF state estimate. Wang et al. [31] estimate the robot state in 3-DOF while enforcing in-plane motion; this work uses Iterative Closest Point (ICP) based scan-matching to derive both sequential and non-sequential measurement constraints. Teixeira et al. [32] use a factor graph with odometry as well and a surfel-based map. Hinduja et al. [33] propose a degeneracy-aware SLAM system to manage possible degenerate factors. Xu et al. [34] use a sliding window to stay robust to front-end outliers. Recently a new area has been considered for underwater robots in littoral settings; using satellite images as an aid in the SLAM problem. Dos Santos et al. [35] use a neural network to compute the similarity between a possible location in the satellite image and a given sonar scan from the robot’s current state. This similarity is used as the basis for particle filter localization in the survey area. McConnell et al. [36] utilize a similar concept, except instead of using a neural network to find the similarity it generates an image predicting the above-surface appearance of the observed underwater structures, to enable the registration of sonar images to the provided satellite image. McConnell et al. [36] then use this registration as another factor in a SLAM solution that includes odometry and standard loop closures.

Underwater Vision

In this section, we will consider camera-based systems in the underwater domain. Cameras provide dense, expansive information about the environment. Unlike sonars, they provide rich color and texture information and are easy to interpret by untrained human operators. Cameras can support a wide range of tasks from the object level to SLAM. Moreover, the broader vision field is large and quickly growing, providing a significant body of research to draw upon.

Cameras, like all sensors, have their drawbacks. These sensors are passive, and therefore require sufficient ambient lighting or lighting onboard the vehicle to make them useful. Further, camera data may be corrupted by water quality. If water turbidity is high and visibility low, cameras may simply not offer a useful path forward for underwater perception. Lastly, the properties of water create a loss of red channel as well as hazing in images, creating a loss of texture required for many automated perception methods. Examples of underwater imagery in different conditions are given in Fig. 2.

Fig. 2
figure 2

(a)–(f) show examples of camera images from a variety of environments, lighting conditions, and water qualities. (©[2019] IEEE. Reprinted, with permission, from [37].)

Underwater Image Enhancement

The properties of water create a loss of red and texture in images that can be problematic for image processing systems. This has spurred an impressive body of research on color correction and image dehazing solutions. There have been many approaches to this problem but in general, they either use a physics-based model or employ machine learning to recover the lost information. Non-learning approaches are also used to recover some of the missing information [38,39,40,41,42,43,44]. These approaches are effective as they require little prior information about the environment to recover lost data. Machine learning in this case is challenging as labeled data is usually required, including examples of scenes with and without the effects of water immersion. For tank scenes this is feasible, but for scenes in the field, it may not be. When learning image correction in underwater vision, a recent focus has been the use of GANs, to create synthetic data for supervised learning [45,46,47,48,49,50]. The use of a GAN is of great importance as it may not be possible to generate enough training data for image enhancement, even when using data-efficient methods. GANs, however, may create a performance gap due to training on synthetic data and testing on real-world underwater scenes.

Object Identification

Object identification and classification are widely explored tasks in vision systems. Accurate and detailed analysis of subsea images is relevant for evaluating and monitoring ecosystem states [51], assessing environmental impact, robot orientation prediction [52], tracking life forms [53], and carrying out many other tasks. However when underwater, there are unique problems: mainly lower image quality and the lack of training data. Obtaining large datasets can be expensive and impractical in underwater environments because of high operational costs, time constraints, and scarcity of specific objects. Nonetheless, there are several examples of supervised learning for classification using deep learning for seafloor image classification [54], wildlife monitoring [55], and coral detection [51]. To deal with the lack of sufficient training data, a common technique in deep learning is to use transfer learning, pre-training on a large dataset then fine-tuning on the domain-specific data [56,57,58]. Another option to obtain sufficient data is to train using synthetic samples, shown by O’Byrne et al. [59]. To entirely avoid the need for a large dataset, algorithms capable of learning from only a handful of samples can be leveraged, Ochal et al. [60] analyze several “few-shot” learning techniques over underwater imagery. Furthermore, Yamada et al. [61] use environmental metadata, such as horizontal location and depth of the observed seafloor, to regularize learning and enhance the accuracy of the self-supervised classification process. Detection has also been used to enable “follow me autonomy”, where a vehicle follows along with a human diver [62, 63] or for multi-robot operations [64].

Extracting meaningful information over a broad range of environments and image quality conditions is challenging. As previously mentioned, producing large underwater datasets is costly and the intensive effort of manual labeling often makes supervised learning techniques impractical. Topic models are a family of Bayesian probabilistic models, suitable for unsupervised semantic clustering. Girdhar et al. [65] proposed a Realtime Online Spatiotemporal Topic (ROST) modeling framework, which attempts to model the semantics of the streaming observed visual data. Using ROST, a “surprise” score of incoming observations is computed. This score is used to determine the presence of high-level patterns in the scene, and differentiate between previously observed or new data. Moreover, the topic labels computed using ROST are suitable for use by autonomous agents working with real-time constraints. Based on ROST, Kalmbach et al. [66] developed a full and robust feature extraction pipeline that can accommodate video recorded under less than ideal quality conditions.

SLAM

Vision-based SLAM is a mature field with a large body of work to draw from. There are even some prominent open-source systems such as ORB-SLAM [67, 68] upon which underwater SLAM systems can be based. Underwater SLAM, however, is challenged by domain-specific environmental conditions: poor lighting conditions, water turbidity, and the reduction in image quality due to the properties of water.

The performance of open-source frameworks in underwater conditions have previously been evaluated [37, 69,70,71]. Zhang et al. [72] and Ferrera et al. [73] propose robust visual odometry systems for use in underwater environments. While the underwater domain imposes new challenges on cameras, it does offer some additional tools. These range from acoustic devices such as sonars and Doppler Velocity Logs (DVLs) to simple pressure sensors. Xu et al. [74] and Vargas et al. [75] enhance visual SLAM with DVL and inertial/DVL fusion, respectively. Sonar has been employed to augment a visual SLAM system with acoustic range measurements [76, 77]. Rahman [78] show stereo visual SLAM assisted by a pressure sensor and inertial measurement unit (IMU) and Hu et al. [79] use a pressure sensor for scale initialization with a monocular camera. Rahman et al. [80, 81] consider the artifacts of artificial light in an underwater cave, such as harsh shadows and contours. Most recently, however, a GoPro-9 camera with integrated IMU has been used for SLAM, potentially enabling complex missions while minimizing hardware overhead [82•]. While Bosch et al. [83] does not consider the SLAM problem, it does open the door for the use of omni-directional cameras in an underwater setting. Xanthidis et al. [84] show an example of multi-robot visual SLAM around shipwrecks. Suresh et al. [85] avoid some of the issues in underwater vision, by looking at the ceiling outside of the water, and performing state estimation using those features. However, this approach is limited to indoor applications.

Bathymetry and Profiling

Of all the underwater perception systems, profiling and bathymetric mapping are perhaps the most relevant in commercial and industrial settings. They are used in support of many offshore construction projects including oil and gas, offshore wind, telecommunications cabling, and many more. Bathymetry is typically the process of using a narrow beam scanner to recover precise distance information between a robot and the seafloor/riverbed, while profiling can refer to scanning in any direction, including in a forward-looking orientation. Most commonly, these single scans are registered into submaps using a highly accurate inertial navigation system, often including a DVL. Once submaps are formed, tools such as point cloud registration can be applied to enhance inter-submap alignment and search for loop closures. In this section, we will consider recent developments in this area, including bathymetry and profiling performed with both narrow beam sonars and laser scanners. An example of submap formation during a bathymetry survey is shown in Fig. 3.

Fig. 3
figure 3

An example of profiling line scans being stitched into submaps shown in green and purple. Note the submap poses shown as \(x_{i}\). (©[2019] IEEE. Reprinted, with permission, from [86])

In bathymetric mapping and profiling, objects of interest in the survey area can greatly enhance SLAM performance. Guernev et al. [87] perform semantic mapping using prior CAD models of expected objects in the survey area. When these priors are available, they can be of great value, however, they may not always be an option, making the case for techniques such as submap registration. Torroba et al. [88] perform a comparison of modern methodologies for submap registration. Hitchcox and Forbes [89] use Gaussian process regression for point cloud registration, enabling bathymetric SLAM using a laser scanner. Jung et al. [90] enhance ICP performance by regularizing terrain height. ICP however can be degenerate, and it is critical to indicate to SLAM backends which measurements can be relied upon, often with a covariance matrix. Sprague et al. [91] learn ICP covariance rather than deriving it with variable initial guesses.

Improving submap registration may not be enough, and further work may be required. Campos and Garcia [92] filter point clouds acquired from noisy sensors by creating a surface mesh. Bore et al. [93] apply sparse Gaussian processes to reduce the storage overhead of bathymetric maps.

While many backends have been applied to bathymetric SLAM, Zhang et al. [94] use a particle filter backend for SLAM, with Teng et al. [95] employing one specifically for detecting invalid loop closures. Torroba et al. [86] fuse dead reckoning information and maximize geometric consistency to produce accurate maps.

Lastly, profiling sensors can be applied for more than simply downward-looking bathymetric sweeps. Teixeira et al. [96] use profiling sonar submaps to perform 3D mapping of ship hulls. Palomer et al. [97] propose a calibration routine for a camera and laser scanner. Palomer et al. [98] use a laser scanner to perform SLAM about an object of interest. Norgren and Skjetne [99] propose SLAM around an iceberg, estimating robot state as well as iceberg drift rate, critical in ice monitoring.

Side-Scan Sonar

In this section, we discuss the use of side-scan sonar (SSS). SSS provides an expansive 2D field of view, useful for large-scale searches of the seafloor, oftentimes with human operators monitoring incoming images. The field of view of a side-scan sonar is illustrated in Fig. 4.

Recently, deep learning methods have been applied to address the Automated Target Recognition (ATR) problem. However, much like other areas discussed in this paper, the lack of training data is a problem. Thus, the use of GANs is proposed [100, 101]. Furthermore, Yu et al. [102] employed the YOLO architecture applied to SSS images.

Place recognition has also been studied using SSS images. Larsson et al. [103] train a network to perform place recognition using triplet loss. Lastly, Xie et al. [104•] use a CNN to infer the missing dimension in SSS images.

Fig. 4
figure 4

An example of side-scan sonar imaging a survey area. Note the image mosaic in gray-scale, with terrain elevation shown in color. (©[2021] IEEE. Reprinted, with permission, from [101])

Conclusions

Underwater perception has seen much work over the past five years. However, some areas stand out as requiring more work and new solutions. Firstly, when considering identifying and classifying objects in sensor data, there are few if any off-the-shelf solutions due to the lack of training data in this setting, especially for imaging sonar. While GANs and simulators have provided an outlet, evaluation is still challenging due to the lack of public benchmark data sets in a common setting (pool, reef, wreck, etc.). Critically, the lack of a common and robust object dataset has hampered efforts to incorporate semantic information into downstream processes.

When considering SLAM, while there are many innovative underwater SLAM solutions, many of them test on their own specially gathered data, and subsequently do not share this data in the public domain. Compared to other fields, such as ground robotics, where the KITTI dataset has proliferated, our community does not have an easy-to-use benchmark dataset. When considering the future in our community, especially in the sonar setting, more open-source code and public data are required to enable comparison, benchmarking, and replication of results.

While underwater state estimation has continued evolving over the past several years, many of these methods use sensor packages that often drastically increase the cost. A research avenue that is of large potential impact is low-cost perceptual and inertial sensor payloads running robust SLAM systems in 6-DOF. Such systems present the potential for larger-scale multi-robot deployments and an alternative to expensive inertial navigation systems that may be prone to failure in adverse conditions.