Keywords

1 Introduction

While learning-based techniques have recently demonstrated great performance in estimating the 6D pose (i.e. the 3D translation and rotation), a huge amount of training data is required  [30, 44, 53]. Furthermore, contrary to most 2D computer vision tasks such as classification, object detection and segmentation, acquiring real world 6D object pose annotations is much more labor intensive, time consuming, and error-prone  [13, 21].

Fig. 1.
figure 1

Abstract illustration of our proposed method. We visualize the 6D pose by overlaying the image with the corresponding transformed 3D bounding box. To circumvent the use of real 6D pose annotations, we firstly train our model purely on synthetic RGB data (a). Secondly, employing a large amount of unlabeled real RGB-D images (b), we significantly improve its performance (right). While Blue constitutes the ground truth pose, we demonstrate in Red and Green the results before and after applying our self-supervision, respectively. (Color figure online)

In order to deal with the lack of real annotations, one common approach is to simulate a large amount of synthetic images  [49, 51]. This is especially appealing for object pose estimation as one usually aims at estimating the 6D pose from an image w.r.t. the corresponding CAD model. Knowing the CAD model enables easy generation of enormous RGB images by randomly sampling 6D poses. Many approaches typically rely on rendering the models using OpenGL and placing them on random background images (drawn from large-scale 2D object datasets such as COCO  [33]) in order to impose invariance to changing scenes  [23, 40]. Recent works propose to instead employ physically-based rendering to produce high quality renderings, and additionally enforce real physical constraints, as they can provide additional cues for the 6D pose  [16, 56].

Despite compelling results, these methods usually still exhibit inferior performance when inferring from real world data, due to the withstanding domain gap between real and synthetic data. Although techniques for domain adaption  [2], domain randomization  [52] and photorealistic rendering  [16] can mitigate the problem to some extent, the performance is still far from satisfactory.

This motivated us to investigate the problem from an entirely different angle. Humans have the amazing ability to learn about the 3D world, whilst only perceiving it through 2D images. Moreover, they can even learn 3D world properties without supervision from another human or labels in a self-supervised fashion through making observations and validating if these observations are in accordance with the expected outcome  [50]. In our context, while labeling the 6D pose is a severe bottleneck, recording unannotated data can be easily achieved at scale. Therefore, similar to learning for humans, we aim at teaching a neural network to reason about the 6D pose of an object by leveraging these unsupervised examples. As shown in Fig. 1, we first train our method fully-supervised with synthetic data. Afterwards, employing unannotaed RGB-D data, we make use of self-supervised learning to enhance the model’s performance on real data.

To accomplish this, it is required to understand 3D properties solely from 2D images. The mechanism of experiencing the 3D world as images on the eye’s retina is known as rendering and has been also extensively explored in Computer Graphics  [41]. Unfortunately, rendering is also known to be non-differentiable due to the rasterization step, as gradients cannot be computed for the argmax function. Nevertheless, many approaches for differentiable rendering have been recently proposed. The real gradient is thereby either approximated  [22, 37], or computed analytically by approximating the rasterization function itself  [6, 35].

In summary, we make the following contributions. i) To the best of our knowledge, we are the first to conduct self-supervised 6D object pose estimation from real data, without the need of 6D labels. ii) Leveraging neural rendering, we formulate a self-supervised 6D pose estimation solution by means of visual and geometric alignment. iii) We experimentally show that the proposed method, which we dub \(\mathrm{Self6D}\), outperforms state-of-the-art methods for monocular 6D object pose estimation trained without real annotations by a large margin.

2 Related Work

We first introduce recent work in monocular 6D pose estimation. Afterwards, we discuss important methods from neural rendering as they form a core part of our (as well as other) self-supervised learning frameworks. We then outline other successful approaches grounded on self-supervised learning. Lastly, we take a brief look at domain adaptation in the field of 6D pose, since our method can be considered an implicit formulation to close the synthetic-to-real domain gap.

2.1 Monocular 6D Pose Estimation

Recently, monocular 6D pose estimation has received a lot of attention and several very promising works have been proposed  [15].

One major branch is grounded on establishing 2D-3D correspondences between the image and the 3D CAD model. After estimating these correspondences, PnP is commonly employed to solve for the 6D pose. Inspired by  [3, 4], Rad et al. propose to employ a CNN to estimate the 2D projections of the 3D bounding box corners in image space  [46]. Similarly, [17, 44] also regress 2D projections of associated sparse 3D keypoints, however, both employ segmentation paired with voting to improve the reliability. In contrast, [30, 43, 61] ascertain dense 2D-3D correspondences, rather than sparse ones.

Another branch of work learns a pose embedding, which can be utilized for latter retrieval. In particular, inspired by  [24, 58, 52] employs an Augmented AutoEncoder (AAE) to learn latent representations for the 3D rotation.

A few methods also directly regress the 6D pose. For instance, while [23] extends [36] to also classify the viewpoint and in-plane rotation, [38] further adjusts [23] to implicitly deal with ambiguities via multiple hypotheses (MHP). In [59] and [29] the authors minimize a point matching loss.

The majority of these methods  [17, 43, 46, 53, 59] exploit annotated real data to train their models. However, labeling real data commonly comes with a large cost in time and labor. Moreover, a shortage of sufficient real world annotations can lead to overfitting, regardless of exploiting strategies such as crop&paste  [8, 21]. Other works, in contrast, fully rely on synthetic data to deal with these pitfalls  [38, 52]. Nonetheless, the performance falls far behind the methods based on real data. We, thus, harness the best of both worlds. While unannotated data can be easily obtained at scale, this combined with our self-supervision for pose is able to outperform all methods trained on synthetic data by a large margin.

2.2 Neural Rendering

Rasterization is a core part of all traditional rendering pipelines. Nonetheless, rasterization involves discrete assignment operations, preventing the flow of gradients throughout the rendering process. A series of work have been devoted to circumvent the hard assignment in order to reestablish the gradient flow.

Loper and Black introduce the first differentiable renderer by means of first-order Taylor approximation to calculate the derivative of pixel values  [37]. In  [22], the authors instead approximate the gradient as the potential change of the pixel’s intensity w.r.t. the meshes’ vertices. SoftRas  [35] conducts rendering by aggregating the probabilistic contributions of each mesh triangle in relation to the rendered pixels. Consequently, the gradients can be calculated analytically, however, with the cost of extra computation. DIB-R  [6] further extends [35] to render of a variety of different lighting conditions. In this work, we use DIB-R  [6] since it can be considered state-of-the-art for neural rendering.

2.3 Recent Trends in Self-supervised Learning

Self-supervised learning, i.e. learning despite the lack of properly labeled data, has recently enabled a large number of applications ranging from 2D image understanding all the way down to depth estimation for autonomous driving. In the core, self-supervised learning approaches implicitly learn about a specific task through solving related proxy tasks. This is commonly achieved by enforcing different constraints such as pixel consistencies across multiple views or modalities.

One prominent approach in this area is MonoDepth  [9], which conducts monocular depth estimation by warping the 2D image points into another view and enforcing a minimum reprojection loss. In the following many works to extend MonoDepth have been introduced  [10, 11, 45]. In visual representation learning, consistency is ensured by solving pretext tasks  [26]. Another line of works explore self-supervised learning for 3D human pose estimation, leveraging multi-view epipolar geometry  [25] or imposing 2D-3D consistency after lifting and reprojection of keypoints  [5]. Self-supervised learning approaches using neural rendering have also been proposed in the field of 3D object and human body reconstruction from single RGB images  [1, 20, 42, 57, 64].

In the domain of 6D pose estimation, self-supervised learning is still a rather unexplored field. [7] proposes a novel self-labeling pipeline with an interactive robotic manipulator. Essentially, running several methods for 6D pose estimation, they can reliably generate precise annotations. Nonetheless, the final 6D pose estimation model is still trained fully-supervised using the acquired data. In this work, we propose to instead directly employ self-supervision for 6D pose by enforcing visual and geometric consistencies on top of neural rendering.

2.4 Domain Adaptation for 6D Pose Estimation

Bridging the domain gap between synthetic and real data is crucial in 6D pose estimation. Many works tackle this problem by learning a transformation to align the synthetic and real domains via Generative Adversarial Networks (GANs)  [2, 28, 60] or by means of feature mapping  [47]. Exemplary, [28] uses a cross-cycle consistency loss based on disentangled representations to embed images onto a domain-invariant content space and a domain-specific attribute space. [47] instead maps the features of a color-based pose estimator to a depth-based pose estimator.

In contrast, works from domain randomization aim at learning domain-invariant attributes. For instance, harnessing random backgrounds and severe augmentations  [23, 52] or employing adversarial training to generate backgrounds and image augmentations  [60].

Fig. 2.
figure 2

Our self-supervised training pipeline. Top: We start training our model for 6D pose estimation purely on synthetic RGB data, to predict a 3D rotation R, translation t and object instance mask \(M^P\). Using a large amount of unlabeled RGB-D images \((I^S, D^S)\), we enhance the model’s performance by means of self-supervised learning. We differentiably render (\(\mathcal {R}\)) the associated RGB-D image and mask \((I^R, D^R, M^R)\). Bottom: We impose various constraints to visually (a and b), and geometrically (c) align the 6D pose.

3 Self-supervised 6D Pose Estimation

In this work we aim at conducting 6D pose estimation from monocular images via self-supervised learning. To this end, we propose a novel model that can learn monocular pose estimation from both synthetic RGB data and real world unannotated RGB-D data. Employing neural rendering, the model can be self-supervised by establishing coherence between real and rendered images w.r.t. the 6D pose. Since this requires good initial pose estimates, we rely on a two-stage approach. As shown in Fig. 1, we start by training our model using synthetic RGB data only. Afterwards, we further enhance the pose estimation performance by leveraging unlabeled real world RGB-D data.

We harness different visual and geometric constraints to seek the best alignment w.r.t. 6D pose. Unfortunately, while a 3D model contains information about the visible and invisible regions, the depth map only covers the visible surface. This complicates supervision since the invisible points would mistakenly contribute to the alignment. Therefore, we aim to extract only the model’s visible surface given the current pose. This can be achieved in different ways: by culling the hidden points, or simply rendering the object in its current pose. Since we are required to render color for visual alignment, we resort to rendering depth for visible surface extraction, as it comes with no extra cost in computation.

We use the differentiable renderer DIB-R proposed by [6] to render 6D pose estimates from our model. Since DIB-R is only able to render RGB images and object masks, we extend it to also provide the depth map fully differentiably. We additionally modify the camera projection to conduct a real perspective projection.Footnote 1 Given the estimated 6D pose as 3D rotation R, 3D translation t, together with the 3D CAD model \(\mathcal {M}\) and the camera intrinsics matrix K, we render the triplet \((I^R, D^R, M^R)\) consisting of the rendered RGB image \(I^R\), the rendered depth map \(D^R\) and the rendered mask \(M^R\)

$$\begin{aligned} \mathcal {R}(R, t, K, \mathcal {M}) = (I^R, D^R, M^R). \end{aligned}$$
(1)

Architecture Details. Besides rendering, also the prediction of the 3D rotation and translation has to be differentiable in order to allow backpropagation. While methods based on establishing 2D-3D correspondences are currently dominating the field, it is infeasible to resort to them as gradients cannot be computed for PnP. To this end, we rely on a similar network architecture as ROI-10D  [39], since they directly estimate rotation and translation. Unfortunately, the predicted poses from ROI-10D are not accurate enough to match the demands of our self-supervision, thus, we base our method on the more recent FCOS  [54] detector. Moreover, a crucial part of our subsequent self-supervision requires object instance masks. Since no annotations are provided, we further extend ROI-10D to also estimate the visible object mask \(M^P\) for each detection.

Our model is grounded on the object detector FCOS using a ResNet-50 based feature pyramid network (FPN)  [31] backbone to compute 2D region proposals. The FPN feature maps from different levels are then fused and concatenated with the input RGB image and 2D coordinates  [34], from which the regions of interest are extracted via ROI-Align to predict masks and poses. Inspired by ROI-10D, we use different branches to predict the 3D rotation R parameterized as a 4D quaternion q, the 3D translation t defined as the 2D projection \((c_x,c_y)\) of the 3D object centroid and the distance z, and the visible object mask \(M^P\).

To train the first-stage, we use focal loss  [32] for classification and GIoU loss  [48] for bounding box regression. We rely on the binary cross entropy loss for mask prediction. As  [29], we use the average of distinguishable model points metric as objective function for pose. The final loss can be summarized as

(2)
(3)

where \(\lambda _{class}, \lambda _{box}, \lambda _{mask}\) and \(\lambda _{pose}\) denote the balance factors for each task, \(\mathcal {M}\) denotes the 3D model, and \(\left[ R|t\right] , \left[ \bar{R}| \bar{t}\right] \) represent the predicted and ground truth poses, respectively. We kindly refer to the supplementary material for more details on the employed hyper-parameters.

For simplicity of the following, we define all foreground and background pixels as and . We further denote all pixels together as \(N=N_+ \cup N_-\).

Neural Rendering for Visual Alignment. The most intuitive way is to simply align the rendered image \(I^R\) with the sensor image \(I^S\), deploying directly a loss on both samples. However, as the domain gap between \(I^S\) and \(I^R\) turns out to be very large, this does not work well in practice. In particular, lightning changes as well as reflection and bad reconstruction quality (especially in terms of color) oftentimes cause a high error despite having good pose estimates, eventually leading to divergence in the optimization. Hence, in an effort to keep the domain gap as small as possible, we impose multiple constraints measuring different domain-independent properties. In particular, we assess different visual similarities w.r.t. mask, color, image structure, and high-level content.

Since object masks are naturally domain agnostic, they can provide a particularly strong supervision. As our data is unannotated we refer to our predicted masks \(M^P\) for a weak supervision. However, due to imperfect predicted masks, we utilize a modified cross-entropy loss  [18], which recalibrates the weights of positive and negative regions

(4)

Although masks are not suffering from the domain gap, they discard a lot of valuable information. In particular, color information is often the only guidance to disambiguate the 6D pose, especially for geometrically simple objects.

Since the domain shift is at least partially caused by light, we attempt to decouple light prior to measuring color similarity. Let \(\rho \) denote the transformation from RGB to LAB space, additionally discarding the light channel, we evaluate color coherence on the remaining two channels according to

(5)

We also avail various ideas from image reconstruction and domain translation, as they succumb the same dilemma. We assess the structural similarity (SSIM) in the RGB space and additionally follow the common practice to use a multi-scale variant, namely MS-SSIM  [63]

(6)

Thereby, \(\odot \) denotes the element-wise multiplication and \(s=5\) is the number of employed scales. For more details on MS-SSIM, we kindly refer the readers to the supplement and  [63].

Another common practice is to appraise the perceptual similarity  [19, 62] in the feature space. To this end, a pretrained deep neural network as AlexNet  [27] is typically employed to ensure low- and high-level similarity. We apply the perceptual loss at different levels of the CNN. Specifically, we extract the feature maps of \(L=5\) layers and normalize them along the channel dimension. Then we compute squared \(L_2\) distances of the normalized feature maps \(\hat{\phi }^l(\cdot )\) for each layer l. We average the individual contributions spatially and sum across all layers  [62]

(7)

The visual alignment is then composed as the weighted sum over all four terms

(8)

where \(\alpha \), \(\beta \) and \(\gamma \) denote the balance factors for \(\mathcal {L}_{ab}\), \(\mathcal {L}_{ms\text {-}ssim}\), and \(\mathcal {L}_{perceptual}\), respectively. We refer to the supplement for more details on the hyper-parameters.

Neural Rendering for Geometric Alignment. Since the depth map only provides information for the visible areas, aligning it with the transformed 3D Model similar to Eq. 3 harms performance. Therefore, we exploit the rendered depth map to enable comparison of the visible areas only. Nevertheless, employing a loss directly on both depth maps leads to bad correspondences as the points where the masks are not intersecting cannot be matched.

Hence, we operate on the visible surface in 3D to find the best geometric alignment. We first backproject \(D^S\) and \(D^R\) using the corresponding masks \(M^P\) and \(M^R\) to retrieve the visible pointclouds \(\mathcal {P}^S\) and \(\mathcal {P}^R\) in camera space with

$$\begin{aligned} \pi ^{-1}(D,M,K) = \{\,K^{-1} \begin{bmatrix} x_j&y_j&1 \end{bmatrix}^{T} \cdot D_j\mid \forall j \in M>0\,\}, \end{aligned}$$
(9)
(10)

Thereby, \((x_j,y_j)\) denotes the 2D pixel location of j in M.

Since it is infeasible to estimate direct 3D-3D correspondences between \(\mathcal {P}^S\) and \(\mathcal {P}^R\), we refer to the chamfer distance to seek the best alignment in 3D

(11)

The overall self-supervision is , with \(\eta \) denoting the balance factor of \(\mathcal {L}_{geom}\). An overview is also presented in Fig. 2. Noteworthy, while we require RGB-D data for self-supervision, we do not need any depth data during latter inference.

4 Evaluation

In this section, we first introduce our experimental setup. Afterwards, we present the analysis on the quality of predicted masks and different ablations to illustrate the effectiveness of our proposed self-supervised loss. We conclude by comparing our method with other state-of-the-art methods for 6D pose estimation and domain adaptation. For better understanding, in addition to the results of \(\mathrm{Self6D}\), we also evaluate our method using synthetic data only and additionally employing real 6D pose labels. Since they can be considered the lower and upper bound of our method, we refer to them \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) and \(\mathrm{Self6D}\hbox {-}\mathrm{UB}\) in the following.

Synthetic Training Data.   [55] and  [16] recently proposed to employ photorealistic and physically plausible renderings to improve 2D detection and 6D pose estimation, in contrast to simple OpenGL rendering  [23]. In our experiments it turns out that a mixture of both approaches, together with a lot of augmentations (e.g. random Gaussian noise, intensity jitter), leads to best results.

Datasets. To evaluate our proposed method we leverage the commonly used LineMOD dataset  [12], which consists of 15 sequences, Only 13 of these provide water-tight CAD models and we, therefore, remove the other two sequences. In  [3], the authors propose to sample \(15\%\) of the real data for training to close the domain gap. We use the same split, however discarding the pose labels. As second dataset, we utilize the recent HomebrewedDB  [21] dataset. However, we only employ the sequence which covers three objects from LineMOD, to depict that we can even self-supervise the same model in a new environment.

To also show generalization to other common datasets for 6D pose, we demonstrate the effectiveness of our self-supervision on 5 objects from YCB-Video  [59] in the supplementary material. To compare with domain adaptation based methods, we refer to the usual Cropped LineMOD dataset  [58] including center-cropped \(64\times 64\) patches of 11 different small objects in cluttered scenes imaged in various of poses.

Metrics for 6D Pose. We report our results w.r.t. the ADD metric  [12], measuring whether the average deviation of the transformed model points is less than \(10\%\) of the object’s diameter. For symmetric objects (e.g., Eggbox and Glue in LineMOD) we rely on the ADD-S metric, which instead measures the error as the average distance to the closest model point  [12, 14].

$$\begin{aligned} \mathbf{ADD} = \underset{x \in \mathcal {M}}{\mathrm{avg}} \Vert (Rx + t) - (\bar{R}x + \bar{t})\Vert _2, \end{aligned}$$
(12)
$$\begin{aligned} \mathbf{ADD}-S = \underset{x_{2} \in \mathcal {M}}{\mathrm{avg}} \min _{x_{1} \in \mathcal {M}} \Vert (Rx_{1} + t) - (\bar{R}x_{2} + \bar{t})\Vert _2. \end{aligned}$$
(13)
Fig. 3.
figure 3

Pose errors v.s. self-supervision. We optimize \(\mathcal {L}_{self}\) on single images from LineMOD for 200 iterations and report the average over in total 100 images. We initialize the 6D poses with \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\).

4.1 Analysis on the Quality of Predicted Masks

Thanks to physically-based renderings, the predicted masks on the real data are very accurate, thus can be reliably used as a self-supervision signal. For instance, on the LineMOD test set, the average F1 score and mIoU between the predicted masks and the ground-truth masks are 89.63% and 90.38%. Please refer to the supplementary for detailed results and qualitative examples.

4.2 Ablation Study

Self-Supervision v.s. 6D Pose Error. We want to demonstrate that there is indeed a high correlation between our proposed \(\mathcal {L}_{Self}\) and the actual 6D pose errors. To this end, we randomly draw 100 samples from LineMOD and optimize separately on each sample, always beginning from \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\). Figure 3 illustrates the average behavior w.r.t. loss v.s.  6D pose error at each iteration. As the loss decreases, also the pose error for both, rotation and translation, continuously declines until convergence. The accompanying qualitative images (Fig. 3, right) further support this observation, as the initial pose is significantly worse compared to the final optimized result. We refer to the supplementary material for more qualitative results.

Table 1. Ablation. We report the Average Recall of ADD(-S) on LineMOD.

Individual Loss Contributions. Table 1 illustrates the contribution of each individual loss component on LineMOD. Note that supervision from both visual and geometry domains is vital for our self-supervised training. Disabling either \(\mathcal {L}_{mask}\) or \(\mathcal {L}_{geom}\) almost always leads to unstable training and divergence (the average recall is only \(0.1\%\) and \(6.4\%\) w.r.t. ADD(-S)). The remaining three factors, measuring color similarity, have a comparably small impact. Concretely, we drop by more than \(2\%\) when disabling \(\mathcal {L}_{ms\text {-}ssim}\), and about \(1\%\) referring to \(\mathcal {L}_{ab}\) and \(\mathcal {L}_{perceptual}\). Nonetheless, we still achieve the overall best results when applying all loss terms together. Most importantly, we can report a significant relative improvement of almost \(50\%\) from \(40.1\%\) to \(58.9\%\) leveraging the proposed self-supervision. Moreover, except for the Duck object, all other objects undergo a strong enhancement in ADD(-S). Noteworthy, we can almost halve the difference between training with and without real pose labels.

4.3 Comparison with State-of-the-Art

In the first part of this section we present a comparison with current state-of-the-art methods in 6D pose estimation. In the latter part, we present our results in the area of domain adaptation referring to Cropped LineMOD.

6D Pose Estimation

LineMOD Dataset. In line with other works, we distinguish between training with and without real pose labels, i.e. making use of annotated real training data. Despite exploiting real data, we do not employ any pose labels and must, therefore, be classified as the latter. We want to highlight that our model can produce state-of-the-art results for training with and without labels. Referring to Table 2, for training using only synthetic data, \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) reveals an average recall of \(40.1\%\), which is deliberately better than AAE  [52] with \(31.4\%\) and on par with MHP  [38] and DPOD\(^{2}\)  [61] reporting \(38.8\%\) and \(40.5\%\). On the other hand, as for training with real pose labels, we are again on par with other recently published methods such as PVNet  [44] and CDPN  [30] reporting a mean average recall of \(86.9\%\). Furthermore, our proposed self-supervision \(\mathrm{Self6D}\) achieves an overall average recall of \(58.9\%\), which is more than \(51\%\) of relative improvement over all state-of-the-art methods using no real pose labels. Except for Holep, Duck and Iron, we can report a significant increase. Objects with little variation in color and geometry can become difficult to optimize. In addition, the 3D mesh of the Holep is rather different compared with the actual perceived object in the real images, which makes our visual alignment less meaningful.

figure a
Table 2. Results for LineMOD. Top: Qualitative results on unseen examples. The projected 3D bounding boxes with blue, red and green denote the poses of ground truth, \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) and \(\mathrm{Self6D}\), respectively. Bottom: Comparison with state-of-the-art. We present the results for the Average Recall(%) of ADD(-S) metric. Real Pose Labels refers to the 15% training split from  [3] with pose labels. We use the same split for training, however, without employing labels.\(^\mathrm{2}\)

HomebrewedDB Dataset. In Fig. 4(left) we compare our method with DPOD  [61] and SSD6D  [23] after refinement using  [40] (SSD6D+Ref.) on three objects of HomebrewedDB, which it shares with LineMOD.Footnote 2 Unfortunately, methods directly solving for the 6D pose always implicitly learn the camera intrinsics which degrades the performance when exposed to a new camera. 2D-3D correspondences based approaches are instead robust to camera changes as they simply run PnP using the new intrinsics. Therefore, the performance of our \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) is slightly outperformed by  [61]. SSD6D+Ref.  [40] employs contour-based pose refinement using renderings for the current hypotheses. Similarly, rendering the pose with the new intrinsics enables again easy adaptation and can even exceed [61] and our \(\mathrm{Self6D}\) on the Bvise object. Nevertheless, we can easily adapt to the new domain and intrinsics by only leveraging 15% of unannotated data from [21]. In fact, we almost double their numbers for all other objects and reach a similar level as for LineMOD (Table 3).

Based on this observation, we were curious to understand the adaptation capabilities of our model w.r.t. the amount of real data that we expose it to. We divided the samples from HomebrewedDB into 100 images for testing and 900 images for training. Afterwards, we repeatedly trained our model with increasing amount of data, however, always evaluating on the same test split. In Fig. 4 (right) we illustrate the corresponding results. When using only 15% (150 samples) of the real data for training, we can already almost double the mean average recall (mAR). Using \(\approx \) 40% of the real data, the mAR can be improved by \(\approx \) 130% from 31% to 71%. Afterwards, it slowly saturates at \(\approx \) 74%.

Fig. 4.
figure 4

Results for HomebrewedDB. Left: Comparison with  [61] and  [40].\(^{2}\) While both train with synthetic data only, we report our results for synthetic data (\(\mathrm{Self6D}\hbox {-}\mathrm{LB}\)) and after self-supervision (\(\mathrm{Self6D}\)) using 15% of real data from  [21]. Right: Self-supervised training w.r.t. an increasing percentage of real training data. Results are always reported on the same unseen test split.

LineMOD Occlusion Dataset. We also evaluate our method on LineMOD Occlusion which exhibits stronger occlusion. We follow the BOP  [15] standard and evaluate on a subset of 200 samples. We compare \(\mathrm{Self6D}\) with two state-of-the-art methods using synthetic data only, namely DPOD  [61] and CDPN  [30].Footnote 3 While our \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) can clearly outperform  [61] with 15.1% compared to 6.3%, [30] exceeds our \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\) by 5.4% and reports a mean average recall of 20.8%. 2D-3D correspondences based methods are more robust towards occlusion as they consider only the visible regions, while direct methods are less stable due to inferring poses from both visible and occluded regions. Nonetheless, after utilizing the remaining real RGB-D data via our self-supervision, we can easily surpass  [30] (32.1% v.s. 20.8%),  and double the performance of our \(\mathrm{Self6D}\hbox {-}\mathrm{LB}\). Noteworthy, there is still plenty of room for all the methods trained without real labels, compared to our fully-supervised model \(\mathrm{Self6D}\hbox {-}\mathrm{UB}\) (70.2%).

Table 3. Results for LineMOD Occlusion. Comparison with  [61] and  [30]. We evaluate the Average Recall(%) of ADD(-S) on the BOP  [15] split.\(^\mathrm{3}\)

Domain Adaptation for Pose Estimation Since our method is suitable for conducting synthetic to real domain adaptation, we assess transfer skills referring to the commonly used Cropped LineMOD scenario. We self-supervise the model with the real training set from Cropped LineMOD, and report the mean angle error on the real test set. As shown in Table 4, our synthetically trained model (\(\mathrm{Self6D}\hbox {-}\mathrm{LB}\)) slightly exceeds state-of-the-art methods as PixelDA  [2]. \(\mathrm{Self6D}\) can successfully surpass the original model on the target domain, reducing the mean angle error from \(19.8^\circ \) to \(15.8^\circ \).

Table 4. Comparison with state of the art on Cropped LineMOD. We present the classification accuracy as well as mean angle error.

5 Conclusion

This work introduced \(\mathrm{Self6D}\), the first self-supervised 6D object pose estimation approach aimed at learning from real data without the need for 6D pose annotations. Leveraging neural rendering, we are able to enforce several visual and geometrical constraints, resulting in a remarkable leap forward compared to other state-of-the-art methods. Moreover, \(\mathrm{Self6D}\) demonstrated to notably reduce the gap with the state of the art for pose estimation with real pose labels.

A main future direction is exploring how to overcome the need for depth data during self-supervision. Another interesting aspect is to incorporate also 2D detections into self-supervision, as this allows backpropagating the loss in an end-to-end fashion throughout the entire network.