1 Introduction

Accurate computation of relative pose is essential in multi-robot estimation problems such as cooperative tracking, localization (Kim & Eustice, 2013; Rekleitis et al., 2002), mapping (Johnson-Roberson et al., 2017; Se et al., 2005), path planning (Landa-Torres et al., 2017), and more. Unless global positioning information (e.g., GPS, USBL) is available, the robots need to estimate their positions and orientations relative to each other based on their exteroceptive sensory measurements and noisy odometry (Zhou & Roumeliotis, 2008). This process is necessary for registering their measurements to a common frame of reference to maintain coordination during task execution.

In a cooperative setting, robots with visual sensing capabilities solve the relative pose estimation problem by triangulating mutually visible local features and landmarks. A lack of salient features significantly affects the accuracy of this estimation (Valgren & Lilienthal, 2010), which eventually hampers the overall success of the operation. Such difficulties often arise in poor visibility conditions underwater due to a lower number of point-based salient features and landmarks (Damron et al., 2018; Sattar et al., 2008). Nevertheless, being a low-power passive sensor, cameras have been the choice for exteroceptive perception in many important applications such as inspection of ship hulls and coral reefs (Kim & Eustice, 2013; Dunbabin et al., 2019), 3D reconstructions of archaeological sites (Johnson-Roberson et al., 2017), and human–robot collaborative missions in general (Islam et al., 2018). An important observation is that the proximity of human divers to robots is a fairly common occurrence in these applications and other monitoring and surveying tasks at shallow waterbodies (Kalaitzakis et al., 2020; Manderson et al., 2018). Besides, humans are frequently present and visible in many social scenarios (Islam et al., 2019; Kümmerle et al., 2013) where natural landmarks are not reliably identifiable due to repeated textures, noisy visual conditions, etc. Hence, the problem of having limited natural landmarks can be alleviated by using mutually visible humans as markers (i.e., features correspondences), particularly in human–robot collaborative applications. Despite the potential, the feasibility of using human presence or body-pose for robot-to-robot relative pose estimation has not been explored in the literature.

In this paper, we propose a method for computing six degrees-of-freedom (6-DOF) robot-to-robot transformation between pairs of communicating robots by using mutually detected humans’ pose-based key-points as correspondences. As illustrated in Fig. 1, we adopt a leader-follower framework where one of the robots (equipped with a stereo camera) is assigned as a leader. First, the leader robot detects and triangulates 3D positions of the key-points in its own frame of reference. Then the follower robot matches the corresponding 2D projections on its intrinsically calibrated camera and localizes itself by solving the perspective-n-point (PnP) problem (Zheng et al., 2013). It is to be noted that this entire process of extrinsic calibration is automatic and does not require prior knowledge about the robots’ initial positions. Additionally, it is straightforward to extend the leader-follower framework for multi-robot teams from the pairwise solutions. Furthermore, if the leader robot has global positioning information, i.e., has a GPS or an USBL receiver, the follower robots can use that information to localize themselves in the global frame as well.

Fig. 1
figure 1

A simplified illustration of 3D relative pose estimation between robot 1 and robot 2 (3). The robots know the transformations between their intrinsically-calibrated cameras and respective global frames, i.e., {1}, {2}, and {3}. Robot 1 is considered as the leader (equipped with a stereo camera) and its pose in global coordinates (\(^1R_G\), \(^1t_G\)) is known. Robot 2 (3) finds its unknown global pose by cooperatively localizing itself relative to robot 1 using the human pose-based key-points as common landmarks

In addition to the conceptual design, we present an end-to-end system with efficient solutions to the practicalities involved in the proposed robot-to-robot pose estimation method (see Sect. 3). As mentioned, we use OpenPose (Cao et al., 2017) for detecting human body-poses in the image space. Although it provides reliable detection performance, the extracted 2D key-points across different views do not necessarily associate as a correspondence. We propose a twofold solution to this:

  • First, we design an efficient person re-identification module by evaluating the hierarchical similarities of the key-point regions in the image space (see Sect. 3.2). It takes advantage of the consistent human pose structures across viewpoints and evaluates their pair-wise similarities for fast body-pose association. We also demonstrate that the state-of-the-art (SOTA) appearance-based person re-identification models fail to provide acceptable performance under single-board real-time constraints.

  • Subsequently, we formulate an iterative optimization algorithm to refine the noisy key-point correspondences by further exploiting their local structural properties in respective images (see Sect. 3.3). We demonstrate that the pair-wise key-point refinement is crucial to ensure their validity in a perspective geometric sense.

This two-stage process facilitates efficient and robust key-point associations across viewpoints for accurate robot-to-robot relative pose estimation (see Sect. 4). In this paper, we primarily focus on these two novel modules because the rest of the computational aspects are generic to all multi-robot cooperative pose estimation systems. Nevertheless, we present a fast implementation of the proposed system and evaluate its end-to-end performance over several terrestrial and underwater field experiments. Lastly, we analyze its practical feasibility and discuss various operational considerations (in Sect. 4.5).

2 Related work

2.1 Robot-to-robot relative Pose estimation

The problem of robot-to-robot relative pose estimation has been thoroughly studied for 2D planar robots, particularly for range and bearing sensors. Analytic solutions for determining 3-DOF robot-to-robot transformation using mutual distance and/or bearing measurements involve solving an over-determined system of nonlinear equations (Zhou & Roumeliotis, 2008; Trawny & Roumeliotis, 2010). Similar solutions for the 3D case, i.e., for determining 6-DOF transformation using inter-robot distance and/or bearing measurements, has been proposed as well (Zhou & Roumeliotis, 2011; Trawny et al., 2010). In practice, these analytic solutions are used as an initial estimate for the relative pose, and then iteratively refined by optimization techniques (e.g. nonlinear weighted least-squares) to account for the noisy observation and uncertainty in robot motion.

Robots that rely on visual perception (i.e., use cameras as exteroceptive sensors) solve the relative pose estimation problem by triangulating mutually visible features and landmarks (Wang & Wilso, 1992). Therefore, it reduces to solving the PnP problem by using sets of 2D-3D correspondences between geometric features and their projections on respective image planes (Zheng et al., 2013). Although high-level geometric features (e.g., lines, conics) have been proposed, point-based features are typically used in practice for relative pose estimation (Janabi-Sharifi & Marey, 2010). Moreover, the PnP problem is solved either using iterative approaches by formulating the over-constrained system (n \(>3\)) as a nonlinear least-squares problem, or by using sets of three non-collinear points (\(n=\) 3) in combination with Random Sample Consensus (RANSAC) to remove outliers (Fischler & Bolles, 1981). Besides, vision-based approaches often use temporal-filtering methods, the extended Kalman-filter (EKF) in particular, to reduce the effect of noisy measurements in order to provide near-optimal pose estimates (Wang & Wilso, 1992; Janabi-Sharifi & Marey, 2010). On the other hand, it is also common to simplify the relative pose estimation by attaching specially designed calibration-patterns on each robot (Rekleitis et al., 2006). However, this requires that the robots operate at a sufficiently close range, and remain mutually visible.

2.2 Human body-Pose detection

Visual detection of 2D human pose has made significant progress over the last decade. The SOTA methodologies can be categorized into the top-down and bottom-up approaches. The top-down approaches (Gkioxari et al., 2014; Pishchulin et al., 2012) detect the humans in the image space first, and then perform localization and association of their body-parts. One major limitation of these approaches is that their run-times are proportional to the number of persons in the image. Additionally, the robustness of the pose estimation largely depends on the accuracy of their person detectors. In contrast, the bottom-up approaches (Cao et al., 2017; Pishchulin et al., 2016) do not suffer from these two issues. However, they require solving a more computationally challenging inference problem of learning global contextual cues for simultaneous body-part detection and association.

The classical approaches typically use pictorial structures (Ferrari et al., 2008; Andriluka et al., 2009) to model the appearance of human body-parts. A set of densely sampled shape descriptors are used for localizing the body-parts and then classifiers such as AdaBoost, SVMs, etc., are used for detection. Associating the detected body-parts is rather challenging; a mixture of tree-based models are typically used to learn separate pairwise relationships for different body-part configurations (Johnson & Everingham, 2011). Graph-based connectivity models are then used to formulate the inference (association) as a graph-cut problem. These pairwise connectivity models can be further generalized (Pishchulin et al., 2013) to capture the anatomical relationships among multiple body-parts. Recently proposed approaches use Deep Neural Networks (DNNs) to learn the human pose detection from large training datasets to perform fast and accurate global inference. DeepPose (Toshev & Szegedy, 2014), for instance, formulates the problem as a regression problem and uses a cascade of DNNs to learn the inference in a holistic fashion. On the other hand, OpenPose (Cao et al., 2017) jointly learns to detect and associate using pose machines (Ramakrishna et al., 2014). In contrast to DNNs, each module of a pose machine is trained locally; the sequential predictions of these modules are then refined to perform a hierarchical joint inference. Such hierarchical structures facilitate fast inference for multi-person pose estimation in addition to achieving SOTA performance. Due to these compelling reasons, we use OpenPose in this work.

2.3 Human-aware robot control

Human-awareness is important for autonomous mobile robots operating in social settings and human–robot collaborative applications. A large body of literature and systems exist (Islam et al., 2018; Mead & Matarić, 2017) which focus on the areas of understanding human motion, instructions, behaviors, etc. Additionally, tracking human pose relative to a robot is particularly common in applications such as person tracking or following (Islam et al., 2019; Montemerlo et al., 2002), collaborative manipulation (Mainprice & Berenson, 2013), behavior imitation (Lei et al., 2015), etc. However, the feasibility of using humans’ presence or their body-poses as markers for robot-to-robot relative pose estimation has not been explored in the literature.

3 System design and methodology

Our proposed robot-to-robot relative pose estimation system incorporates several computational components: detection of human body-poses in images captured from different views (by leader and follower robots), pair-wise association of the detected humans across viewpoints, geometric refinement of the key-point correspondences, and 3D pose estimation of the follower robot relative to the leader. We present a snapshot of the end-to-end computational pipeline in Fig. 2. As in standard multi-robot cooperation, the proposed system requires synchronized communication between the leader and follower robots. From a follower robot’s perspective, the primary challenge is to identify the mutually visible humans and then correctly associate their body-poses. Subsequently, geometric refinements of those associated pose-based key-points are essential for accurate relative pose estimation in the wild. We design robust and efficient modules to meet these operational requirements in a reasonable computational overhead. In the following sections, we present methodological details of these modules and discuss the relevant design choices.

Fig. 2
figure 2

The end-to-end computational pipeline is outlined from the perspective of a follower robot which shares a clock with the communicating leader robot by using a timestamp-based buffer scheduler for synchronized data registration. The mutually visible human body-pose based key-points are then associated and refined for relative pose estimation. We design these two novel components (marked in purple boxes) to establish robust and accurate key-point correspondences at a fast rate (195 milliseconds per estimation on NVIDIA™ Jetson TX2)

3.1 Human body-pose detection

OpenPose (Cao et al., 2017) is an open-source library for real-time multi-human 2D pose detection in images, originally developed using Caffe and OpenCV librariesFootnote 1. We use a Tensorflow implementationFootnote 2 based on the MobileNet model that provides faster inference compared to the original model (also known as the CMU model). Specifically, it processes a \(368\times 368\) image in 180 milliseconds on the embedded computing board named Jetson TX2 (NVIDIA™, 2014), whereas the original model takes multiple seconds.

OpenPose generates 18 key-points pertaining to the nose, neck, shoulders, elbows, wrists, hips, knees, ankles, eyes, and ears of a human body. As shown in Fig. 3, a subset of these 2D key-points and their pair-wise anatomical relationships are generated for each human. We represent the key-points \(\mathbf {KP}(I)\) by a \(N_I \times 18\) array where \(N_I\) is the number of detected humans in an image I. If a particular key-point is occluded or not detected, then the values are left as (\(-1\), \(-1\)). We configure \(\mathbf {KP}(I)\) in a way that the first row belongs to the left-most person, the second row belongs to the next left-most person, and gradually the last row belongs to the right-most person in the image. This way of sorting the key-points helps to speed up the process of associating the rows of \(\mathbf {KP}(I_{leader})\) and \(\mathbf {KP}(I_{follower})\). That is, the follower robot needs to make sure that it is pairing the key-points of the same individuals. This is important because in practice they might be looking at different individuals, or the same individuals in a different spatial order. Associating multiple persons across different images is a well-studied problem known as person re-identification (ReId).

Fig. 3
figure 3

Multi-human 2D body-pose detection using OpenPose in various human–robot collaborative settings

Fig. 4
figure 4

An illustration of how the hierarchical body-parts are extracted for person ReId based on their structural similarities; once the persons are associated, the pair-wise key-points are refined and used as correspondences

3.2 Person Re-identification using hierarchical similarities

Although several existing deep visual models provide very good solutions for person ReId (Ahmed et al., 2015; Li et al., 2014), we design a simple and efficient model to meet the real-time single-board computational constraints. The idea is to avoid using a computationally demanding feature extractor by making use of the hierarchical anatomical structures that are already embedded in the key-points. First, we bundle the subsets of key-points in several spatial bounding boxes (BBox) as follows:

  • Face BBox: nose, eyes, and ears;

  • Upper-body BBox: neck, shoulders, and hips;

  • Lower-body BBox: hips, knees, and ankles;

  • Left-arm BBox: left shoulder, elbow, and wrist;

  • Right-arm BBox: right shoulder, elbow, and wrist;

  • Full-body BBox: encloses all the key-points.

Figure 4 illustrates the spatial hierarchy of these BBoxes and their corresponding key-points. They are extracted by spanning the corresponding key-points’ coordinate values in both the x and y dimensions. We use an offset (of additional \(10\%\) length) in each dimension to capture more spatial information around the key-points. A BBox is discarded if its area falls below an empirically chosen threshold of 600 square pixels. We found that BBox areas below this resolution are not always informative and are prone to erroneous results. This happens when the corresponding body-part is either not detected or very far from the camera.

Once the BBox areas are selected, we exploit their pairwise structural properties as features for person ReId; specifically, we compare the structural similarities between image patches pertaining to the face, upper-body, lower-body, left-arm, right-arm, and the full body of a person. Based on their aggregated similarities, we evaluate the pair-wise association between each person as seen by the leader (in \(I_{leader}\)) and by the follower (in \(I_{follower}\)). The structural similarity (Wang et al., 2004) for a particular pair of single-channel rectangular image-patches (\({\mathbf {x}}\), \({\mathbf {y}}\)) is evaluated based on three properties: luminance \(l({\mathbf {x}},{\mathbf {y}}) = {2 {\varvec{\mu }}_{\mathbf {x}} {\varvec{\mu }}_{\mathbf {y}}}/{({\varvec{\mu }}_{\mathbf {x}}^2+{\varvec{\mu }}_{\mathbf {y}}^2)}\), contrast \(c({\mathbf {x}},{\mathbf {y}}) = {2 {\varvec{\sigma }}_{\mathbf {x}} {\varvec{\sigma }}_{\mathbf {y}}}/{({\varvec{\sigma }}_{\mathbf {x}}^2+{\varvec{\sigma }}_{\mathbf {y}}^2})\), and structure \(s({\mathbf {x}},{\mathbf {y}}) = {{\varvec{\sigma }}_{\mathbf {xy}}}/{{\varvec{\sigma }}_{\mathbf {x}}{\varvec{\sigma }}_{\mathbf {y}}}\); here, \({\varvec{\mu }}_{\mathbf {x}}\) (\({\varvec{\mu }}_{\mathbf {y}}\)) denotes the mean of image patch \({\mathbf {x}}\) (\({\mathbf {y}}\)), \({\varvec{\sigma }}_{\mathbf {x}}^2\) (\({\varvec{\sigma }}_{\mathbf {y}}^2\)) denotes the variance of \({\mathbf {x}}\) (\({\mathbf {y}}\)), and \({\varvec{\sigma }}_{\mathbf {xy}}\) denotes the cross-correlation between \({\mathbf {x}}\) and \({\mathbf {y}}\). The structural similarity metric (SSIM) is then defined as:

$$\begin{aligned} SSIM({\mathbf {x}},{\mathbf {y}}) = l({\mathbf {x}},{\mathbf {y}}) c({\mathbf {x}},{\mathbf {y}}) s({\mathbf {x}},{\mathbf {y}}) = \frac{2 {\varvec{\mu }}_{\mathbf {x}} {\varvec{\mu }}_{\mathbf {y}} }{{\varvec{\mu }}_{\mathbf {x}}^2+{\varvec{\mu }}_{\mathbf {y}}^2} \times \frac{2 {\varvec{\sigma }}_{\mathbf {xy}}}{{\varvec{\sigma }}_{\mathbf {x}}^2+{\varvec{\sigma }}_{\mathbf {y}}^2}. \end{aligned}$$

In order to ensure numeric stability, two standard constants \(c_1 = (255k_1)^2\) and \(c_2 = (255k_2)^2\) are added as:

$$\begin{aligned} SSIM({\mathbf {x}},{\mathbf {y}}) = \frac{2 {\varvec{\mu }}_{\mathbf {x}} {\varvec{\mu }}_{\mathbf {y}} + c_1}{{\varvec{\mu }}_{\mathbf {x}}^2+{\varvec{\mu }}_{\mathbf {y}}^2 + c_1} \times \frac{2 {\varvec{\sigma }}_{\mathbf {xy}} + c_2}{{\varvec{\sigma }}_{\mathbf {x}}^2+{\varvec{\sigma }}_{\mathbf {y}}^2 + c_2}. \end{aligned}$$
(1)

We use \(k_1=0.01\), \(k_2=0.03\), and an \(8\times 8\) sliding window in our implementation. Additionally, we resize the patches extracted from \(I_{leader}\) so that their corresponding pairs in \(I_{follower}\) have the same dimensions. Then, we apply Eq. 1 on every channel (RGB) and use their average value as the similarity metric on a scale of [0, 1]. Specifically, we use this metric for person ReId as follows:

  • We only consider the mutually visible body-parts for evaluating the pair-wise SSIM values. This choice is important to enforce meaningful comparisons; otherwise, it is equivalent to using only the full-body BBox, which we found to be highly inaccurate.

  • Each person in \(I_{follower}\) is associated with the most similar person corresponding to the maximum SSIM value in \(I_{leader}\). However, the association is discarded if that value is less than a threshold \(\delta _{min}=0.4\) which is chosen by an AUC (area under the curve)-based analysis (see Sect. 4.2). This reduces the risk of inaccurate associations, particularly when there are mutually exclusive people in the scene.

Fig. 5
figure 5

Results of estimating structure from motion using only human pose-based key-points as features

Fig. 6
figure 6

Structure from motion for a two-view case using only human pose-based key-points as features (Color figure online)

3.3 Key-point refinement

Once the specific persons are identified, i.e., the rows of \(\mathbf {KP}(I_{leader})\) and \(\mathbf {KP}(I_{follower})\) are associated, the mutually visible key-points are paired together to form correspondences. Although the key-points are ordered and OpenPose localizes them reasonably well, they cannot be readily used as geometric correspondences due to perspective distortions and noise. We attempt to solve this problem by designing an iterative optimization algorithm that refines the noisy correspondences based on their structural properties in a \(32\times 32\) neighborhood. By denoting \({\varvec{\phi }}_I({\mathbf {p}})\) as the \(32\times 32\) image-patch centered at \({\mathbf {p}}=[p_x, p_y]^T\) in image I, we define a loss for each correspondence \(({\mathbf {p}}_l \in I_{leader}, {\mathbf {p}}_f \in I_{follower})\) as:

$$\begin{aligned} L({\mathbf {p}}_l, {\mathbf {p}}_f) = 1 - SSIM({\varvec{\phi }}_{I_{leader}}({\mathbf {p}}_{l}), {\varvec{\phi }}_{I_{follower}}({\mathbf {p}}_{f})). \end{aligned}$$
(2)

Then, we refine each initial key-point correspondence \(({\mathbf {p}}_l^{0}, {\mathbf {p}}_f^{0})\) by minimizing the following function:

$$\begin{aligned} {\mathbf {p}}_f^* = {\text {*}}{argmin}_{{\mathbf {p}}} \quad L({\mathbf {p}}_{l}^{0}, {\mathbf {p}}) \quad \text {s. t.} \quad ||{\mathbf {p}}-{\mathbf {p}}_{f}^{0}||_\infty <32. \end{aligned}$$
(3)

As Eq. 3 suggests, we fix \({\mathbf {p}}_l={\mathbf {p}}_l^{0}\) and refine \({\mathbf {p}}_f={\mathbf {p}}_f^{0}\) to maximize \(SSIM({\varvec{\phi }}_{I_{leader}}({\mathbf {p}}_{l}), {\varvec{\phi }}_{I_{follower}}({\mathbf {p}}_{f}))\). In our implementation, we adopt a gradient-based refinement algorithm that performs the following iterative update:

$$\begin{aligned} {\mathbf {p}}_f^{t+1} = {\mathbf {p}}_f^t - \eta \cdot \nabla L({\mathbf {p}}_l^0, {\mathbf {p}}_f^t). \end{aligned}$$
(4)

We follow the procedures suggested in (Avanaki, 2009; Otero and Vrscay, 2014) for computing the gradient of SSIM. For fast processing, we vertically stack all the key-points and their gradients to perform the optimization simultaneously with a fixed learning rate of \(\eta =0.003\) for a maximum iteration of 100. We present empirical validations for the choices of the refinement resolution and other hyper-parameters in Sect. 4.2.

3.4 Robot-to-robot pose estimation

Once the mutually visible key-points are associated and refined, the follower robot uses the corresponding 3D positions (provided by the leader) to estimate its relative pose by solving a PnP problem. Thus, we require that the leader robot is equipped with a stereo camera (or an RGBD camera) so that it can triangulate the refined key-points using epipolar constraints (or use the depth sensor) to represent the key-points in 3D. Let \({\mathbf {x}}_l\) denote the 3D locations of the key-points in the leader’s coordinate frame, and \({\mathbf {p}}_f\) denote their corresponding 2D projections on the follower’s camera. Then, assuming the cameras are synchronized, the PnP problem is formulated as follows:

$$\begin{aligned} {\mathbf {T}}_f^l = {\text {argmin}}_{{\mathbf {T}}_f^l} ||{\mathbf {p}}_f - {\mathbf {K}}_f {\mathbf {T}}_f^l {\mathbf {x}}_l||^2. \end{aligned}$$
(5)

Here, \({\mathbf {K}}_f\) is the intrinsic matrix of the follower’s camera and \({\mathbf {T}}_f^l\) is its 6-DOF transformation relative to the leader. In our implementation, we follow the standard iterative solution for PnP using RANSAC (Zheng et al., 2013).

4 Experimental analysis

We conduct several experiments with 2-DOF and 3-DOF robots to evaluate the performance and feasibility of the proposed relative pose estimation method. We present these experimental details, analyze the results, and discuss various operational considerations in the following sections.

Table 1 A quantitative performance comparison for various person ReId models on standard datasets; a set 150 test images are used for comparison from each dataset
Table 2 Effectiveness of the proposed person ReId method on real-world data; each set contains 100 images of multiple humans in ground and underwater scenarios
Fig. 7
figure 7

Empirical selection of hyper-parameters: a SSIM threshold for pose association in the proposed person ReId module and b resolution of the key-point refinement region. The evaluation is performed on the combined set of 250 images containing a total of 687 person associations with 8256 key-point correspondences

4.1 Proof of concept: structure from motion

At first, we perform experiments to validate that the human pose-based key-points can be used as geometric correspondences for relative pose estimation. As illustrated in Fig. 5a, we emulate an experimental set-up for structure from motion with humans; we use an intrinsically calibrated monocular camera to capture a group of nine (static) people from multiple views. Here, the goal is to estimate the camera poses and reconstruct the 3D structures of the humans using only their body-poses as features.

In the evaluation, we first use OpenPose to detect the human pose-based 2D key-points in the images (Fig. 5a). Then, we utilize the proposed person ReId and key-point refinement modules to obtain the feature correspondences across multiple views (Fig. 5b). Subsequently, we follow the standard procedures for structure from motion (Hartley & Zisserman, 2003): fundamental matrix computation using 8-point algorithm with RANSAC, essential matrix computation, camera pose estimation by enforcing the Cheirality constraint, and linear triangulation. Finally, the triangulated 3D points and camera poses are refined using bundle adjustment. As demonstrated in Fig. 5c, the spatial structure of the reconstructed points on the human bodies and the camera poses are consistent with our setup. Results of another experiment for a two-view case are shown in Fig. 6, which further validate that the estimated camera poses are comparable to the ground truth, i.e., analogous SIFT feature-based estimation. Next, we demonstrate the effectiveness of our proposed refinement modules in ensuring this robust pose estimation performance.

Fig. 8
figure 8

Necessity of the proposed key-point refinement process; results correspond to the experiment illustrated in Fig. 5

Fig. 9
figure 9

3-DOF ground experiment: one leader and one follower robot; the follower robot’s trajectory is shown by red arrows (Color figure online)

Fig. 10
figure 10

An experiment to evaluate the accuracy of 2D relative pose estimation with two planar robots and two mutually visible humans (Color figure online)

4.2 Effectiveness of the body-pose association and key-point refinement modules

Fig. 11
figure 11

Illustrations where a lack of natural landmarks limits the utility of standard feature detectors. As seen, presence of a single human in the scene facilitates considerably more anatomical key-point correspondences than the point-based features

It is easy to notice that person ReId is essential for associating mutually visible persons across different views. As mentioned in Sect. 3.2, we focus on achieving fast association by making use of the local structural properties around the anatomical key-points in the image space. In contrast, the SOTA person ReId approaches adopt deep visual feature extractors that are computationally demanding. In Table 1, we quantitatively evaluate the SOTA models named Aligned ReId (Zhao et al., 2017), Deep person ReId (Li et al., 2014), and Tripled-loss ReId (Zheng et al., 2011) based on accuracy and mean averaged precision (mAP) on two standard datasets. Specifically, a test-set containing 150 instances from the Market-1501 and CUHK-03 datasets are used for the evaluation; also, their run-times on a NVIDIA™ Jetson TX2 are shown for comparison. The results indicate that although these models (once trained on similar data) perform well on standard datasets, they are computationally too expensive for single-board embedded platforms.

Moreover, as demonstrated in Table 2, these off-the-shelf models do not perform that well on high-resolution real-world images. Although their performance can be improved by training on more comprehensive real-world data, the computational complexity remains a barrier. To this end, the proposed person Reid module provides significantly faster run-time and better portability as it does not require rigorous large-scale training. Its only hyper-parameter is the SSIM threshold \(\delta _{min}\) (see Sect. 3.2), which we select by standard AUC-based analysis of ROC (receiver operating characteristic) curve. As shown in Fig. 7a, we choose \(\delta _{min}=0.4\), which corresponds to \(83.5\%\) true-positive and \(6.5\%\) false-positive rates for person ReID on the combined test set of 250 images containing 687 person associations. Additionally, we select the key-point refinement resolution through an ablation experiment with 8256 key-point correspondences. We observe that the optimal key-point location is found within \(25\times 25\) pixels of the initial estimate by OpenPose for over \(96\%\) of the cases. As shown in Fig. 7b, we make a more conservative choice of \(32\times 32\) refinement region in our implementation.

Finally, we evaluate the utility and effectiveness of the proposed key-point refinement algorithm based on re-projection errors and compare the results with traditional SIFT feature-based reconstruction. As Fig. 8a demonstrates, the 3D reconstruction and camera pose estimation with raw key-points are inaccurate as the unrefined correspondences are invalid in a perspective geometric sense. As Fig. 8b shows, the average re-projection error for the refined key-points reduces to \(6.85e^{-5}\) pixels, which is acceptable considering the fact that there are ten times less anatomical key-points than SIFT feature-based key-points. This evaluation corresponds to the experiment presented in Fig. 5, which shows that the refined key-points constitute accurate scene reconstruction and camera pose estimation. Another qualitative validation of the iterative key-point refinement algorithm and its convergence behavior can be found in Fig. 6.

Fig. 12
figure 12

6-DOF underwater experiment: one leader and two follower robots (aerial view)

4.3 Robot-to-robot 3-DOF pose estimation

We also perform experiments for 3-DOF robot-to-robot relative pose estimation with 2D robots. In the particular scenario shown in Fig. 9, we use two planar robots (one leader and one follower) and two mutually visible humans in the scene. The robot with an AR-tag on its back is used as the follower robot while the other robot is used as the leader. The AR-tag is used to obtain the follower’s ground truth relative pose for comparison. On the other hand, the leader robot is equipped with an RGBD camera; it communicates with the follower and shares the 3D locations of the mutually visible key-points. Specifically, it detects the human pose-based 2D key-points and associates the corresponding depth information to represent them in 3D. Subsequently, the follower robot uses this information to localize itself relative to the leader by following the proposed estimation method.

As demonstrated in Fig. 9, we move the follower robot in a rectangular pattern and evaluate the 3-DOF pose estimates relative to the static leader robot. We present the qualitative results in Fig. 10; it shows that the follower robot’s pose estimates are very close to their respective ground truth. Overall, we observe an average error of \(0.0475\%\) in translation (cm) and a \(0.8625^{\circ }\) average error in rotation, which is reasonably accurate. We obtain similar qualitative and quantitative performance with a dynamic leader as well. Next, we present field experimental validations of the relative pose estimation performance in feature-deprived underwater scenarios.

Fig. 13
figure 13

An underwater experiment for 3D relative pose estimation using one leader and two follower robots

Fig. 14
figure 14

Three test cases for the proposed person ReId module are shown: each query is matched with a gallery of candidate images (inside the blue box); the top three matches and respective scores are shown alongside the query image. The scores represent averaged SSIM scores for the mutually visible body-part BBoxes (see Sect. 3.2)

4.4 Robot-to-robot 6-DOF pose estimation in adverse underwater visual conditions

As seen in Fig. 11, standard point-based feature detectors fail to generate a large pool of reliable correspondences when there are very few salient features and landmarks in the scene. Consequently, the sampling-based parameter estimation techniques (e.g., RANSAC) often generate inaccurate results in feature-deprived underwater scenarios. However, we demonstrate that human pose-based key-points can still be refined to establish reliable geometric correspondences for robot-to-robot relative pose estimation. Moreover, we get a reasonably large pool of correspondences with only one or two humans in the scene, which is fairly common in cooperative underwater missions.

We perform several field experiments in human–robot collaborative setups; Fig. 12 shows the setup of a particular underwater experiment where we capture human body-poses from different perspectives to estimate the 6-DOF transformations of two follower robots relative to a leader robot. The leader robot is equipped with a stereo camera; hence, the 3D information of the human pose-based key-points is obtained by using stereo triangulation technique. Subsequently, we find the corresponding 2D projections on the follower robots’ cameras using the proposed person ReId and key-point refinement processes. Finally, we estimate the follower-to-leader relative poses from their respective PnP solutions.

We present a particular snapshot in Fig. 13a; it illustrates the leader and follower robots’ perspectives and the associated human pose-based key-points. Subsequently, Fig. 13b, c demonstrate the geometric validity of those key-point correspondences and the reconstructed 3D points are shown in Fig. 13d. As seen, the estimated 3D structure is consistent with the mutually visible humans’ body-poses. Finally, the estimated 6-DOF poses of the follower robots relative to the leader robot are shown in Fig. 13e.

Such leader-to-follower pose estimates are useful in cooperative diver following (Islam et al., 2019), convoying (Shkurti et al., 2017), and other interactive tasks while operating in close proximity. The robust performance and efficient implementation of the proposed modules make it suitable for use by visually-guided underwater robots in human–robot collaborative applications. However, there are a few practicalities involved which can affect the performance; next, we discuss these aspects and their possible solutions based on our experimental findings.

4.5 Discussion: operational challenges and practicalities

Synchronized cooperation: A major operational requirement of multi-robot cooperative systems is the ability to register synchronized measurements in a common frame of reference, which can be quite challenging in practice. For problems such as ours, an effective solution is to maintain a buffer of time-stamped measurements and register those as a batch using a temporal sliding window. We effectively used such timestamp-based buffer schedulers (ROS.org, 2018) in our implementation with reasonable robustness. However, the challenge remains in finding instantaneous relative poses, especially when both robots are in motion. Nevertheless, these aspects are independent of the choice of features/key-points for relative pose estimation and more generic requirements to multi-robot cooperative systems.

Trade-off between robustness and efficiency: It is quite challenging to ensure a fast yet robust performance for visual feature-based body-pose estimation and person ReId on limited computational resources of embedded platforms. This trade-off between robustness and efficiency led us to design fast body-pose association and refinement modules. These efficient modules enable us to achieve an average end-to-end run-time of 375-420 milliseconds for relative pose estimation on Jetson TX2. Note that the proposed person ReId and key-point refinement account for only 195-240 milliseconds (up to nine humans in the scene). Hence, faster human body-pose detectors (than OpenPose) can significantly boost the end-to-end run-time of the system.

In Sect. 4.2, we demonstrated that the proposed person ReId model performs reasonably well in practice despite its simplistic design. One operational benefit in our application is that the humans are seen at once from every perspective; hence, both their appearances and body-poses remain consistent. We provide a demonstration of this benefit in Fig. 14; it shows three queries for ReId where humans with similar suit/wearable are seen at various distances and orientations from the camera. Although the gallery images contain several humans with similar appearances, we observe that the top three results correspond to best matches both in terms of human appearance and body-pose. We postulate that computing aggregated similarity scores on local pose-based BBoxes contribute to these results. Since the general-purpose person ReId problem is significantly harder and requires more sophisticated computational pipelines (Zhao et al., 2017; Li et al., 2014), our proposed module seems to take advantage of the body-pose consistency across viewpoints for a faster run-time.

Number of humans and relative viewing angle: We observed a couple of other practical issues during the field experiments. First, the presence of multiple humans in the scene helps to ensure reliable pose estimation performance. We found that two or more mutually visible humans are ideal for establishing a large pool of reliable correspondences. Additionally, the pose estimation performance is affected by the relative viewing angle; specifically, it often fails to find correct associations when the \(\angle \)leader-human-follower is larger than (approximately) \(135^{\circ }\). This results in a situation where the robots are exclusively looking at opposite sides of the person without enough common key-points. Moreover, other than temporal lags, we did not observe significant deviations in pose estimation performance with an increasing number of robots within this viewing angle; note that we used up to three follower robots in our experiments.

Practical use cases: Other than the operational challenges discussed above, the proposed system does not incur additional constraints or require any prior knowledge such as global positioning information, robots’ initial positions, 3D sensing capabilities of the follower robots, etc. As mentioned, it is useful for robot-to-robot pose estimation in human–robot collaborative applications where multiple robots need to maintain spatial coordination during task execution. The prominent use cases are multi-robot convoying, diver following, cooperative mapping, inspection, and surveying. Moreover, it is best suited in applications where the human body-poses are already evaluated for other purposes (e.g., detection/tracking, human–robot communication). It is important to note that the proposed system is not intended to be used as a replacement of full-form visual SLAM or cooperative global localization. Nevertheless, if the leader robot has a GPS or an USBL receiver, it facilitates the follower robots to essentially localize themselves globally by using visual sensing alone.

5 Conclusions and future work

In this paper, we explore the feasibility of using human body-poses as markers to establish reliable multi-view geometric correspondences and to eventually solve the robot-to-robot relative pose estimation problem. First, we use OpenPose for extracting the pose-based 2D key-points pertaining to the humans in the scene. Then we associate the humans seen from multiple views using an efficient person re-identification model. Subsequently, we refine the key-point correspondences using an iterative optimization algorithm based on their local structural similarities in the image space. Finally, we use the 3D locations of the key-points (triangulated by the leader robot) and their corresponding 2D projections (seen by the follower robot) to formulate a PnP problem and solve for the unknown pose of the follower robot relative to the leader. We perform extensive experiments in terrestrial and underwater environments to investigate the applicability of the proposed relative pose estimation method; the experimental results validate its effectiveness both for 2D and 3D robots. We also discuss the relevant operational challenges and propose efficient solutions to deal with them. In the future, we seek to improve the end-to-end run-time of the proposed system and plan to use it in practical applications such as multi-robot convoying, cooperative source-to-destination planning, etc. Additionally, we aim to investigate the applicability of DensePose (Alp Güler et al., 2018) in our work, which can potentially provide significantly more key-point correspondences per-person compared to OpenPose.