Keywords

1 Introduction

From glass doors and windows to kitchenware and all kinds of containers, transparent materials are prevalent throughout daily life. Thus, perceiving the pose (position and orientation) of transparent objects is a crucial capability for autonomous perception systems seeking to interact with their environment. However, transparent objects present unique perception challenges both in the RGB and depth domains. As shown in Fig. 2, for RGB, the color appearance of transparent objects is highly dependent on the background, viewing angle, material, lighting condition, etc. due to light reflection and refraction effects. For depth, common commercially available depth sensors record mostly invalid or inaccurate depth values within the region of transparency. Such visual challenges, especially missing detection in the depth domain, pose severe problems for autonomous object manipulation and obstacle avoidance tasks. This paper sets out to address these problems by studying how category-level transparent object pose estimation may be achieved using end-to-end learning.

Fig. 1.
figure 1

Overview of TransNet, a pipeline for category-level transparent object pose estimation. Given instance-level segmentation masks as input, TransNet estimates the 6 degrees of freedom pose and scale for each transparent object in the image. Internally, TransNet uses surface normal estimation, depth completion, and a transformer-based architecture for accurate pose estimation despite noisy sensor data.

Recent works have shown promising results on grasping transparent objects by completing the missing depth values followed by the use of a geometry-based grasp engine [9, 12, 29], or transfer learning from RGB-based grasping neural networks [36]. For more advanced manipulation tasks such as rigid body pick-and-place or liquid pouring, geometry-based estimations, such as symmetrical axes, edges [27] or object poses [26], are required to model the manipulation trajectories. Instance-level transparent object poses could be estimated from keypoints on stereo RGB images [23, 24] or directly from a single RGB-D image [38] with support plane assumptions. Recently emerged large-scale transparent object datasets [6, 9, 23, 29, 39] pave the way for addressing the problem using deep learning.

In this work, we aim to extend the frontier of 3D transparent object perception with three primary contributions.

  • First, we explore the importance of depth completion and surface normal estimation in transparent object pose estimation. Results from these studies indicate the relative importance of each modality and their analysis suggests promising directions for follow-on studies.

  • Second, we introduce TransNet, a category-level pose estimation pipeline for transparent objects as illustrated in Fig. 1. It utilizes surface normal estimation, depth completion, and a transformer-based architecture to estimate transparent objects’ 6D poses and scales.

  • Third, we demonstrate that TransNet outperforms a baseline that uses a state-of-the-art opaque object pose estimation approach [7] along with transparent object depth completion [9].

Fig. 2.
figure 2

Challenge for transparent object perception. Images are from Clearpose dataset [6]. The left is an RGB image. The top right is the raw depth image and the bottom right is the ground truth depth image.

2 Related Works

2.1 Transparent Object Visual Perception for Manipulation

Transparent objects need to be perceived before being manipulated. Lai et al. [18] and Khaing et al. [16] developed CNN models to detect transparent objects from RGB images. Xie et al. [37] proposed a deep segmentation model that achieved state-of-the-art segmentation accuracy. ClearGrasp [29] employed depth completion for use with pose estimation on robotic grasping tasks, where they trained three DeepLabv3+ [4] models to perform image segmentation, surface normal estimation, and boundary segmentation. Follow-on studies developed different approaches for depth completion, including implicit functions [47], NeRF features [12], combined point cloud and depth features [39], adversarial learning [30], multi-view geometry [1], and RGB image completion [9]. Without completing depth, Weng et al. [36] proposed a method to transfer the learned grasping policy from the RGB domain to the raw sensor depth domain. For instance-level pose estimation, Xu et al. [38] utilized segmentation, surface normal, and image coordinate UV-map as input to a network similar to [32] that can estimate 6 DOF object pose. Keypose [24] was proposed to estimate 2D keypoints and regress object poses from stereo images using triangulation. For other special sensors, Xu et al. [40] used light-field images to do segmentation using a graph-cut-based approach. Kalra et al. [15] trained Mask R-CNN [11] using polarization images as input to outperform the baseline that was trained on only RGB images by a large margin. Zhou et al. [44,45,46] employed light-field images to learn features for robotic grasping and object pose estimation. Along with the proposed methods, massive datasets, across different sensors and both synthetic and real-world domains, have been collected and made public for various related tasks [6, 9, 15, 23, 24, 29, 37, 39, 44, 47]. Compared with these previous works, and to the best of our knowledge we propose the first category-level pose estimation approach for transparent objects. Notably, the proposed approach provides reliable 6D pose and scale estimates across instances with similar shapes.

2.2 Opaque Object Category-Level Pose Estimation

Category-level object pose estimation is aimed at estimating unseen objects’ 6D pose within seen categories, together with their scales or canonical shape. To the best of our knowledge, there is not currently any category-level pose estimation works focusing on transparent objects, and the works mentioned below mostly consider opaque objects. They won’t work well for transparency due to their dependence on accurate depth. Wang et al. [35] introduced the Normalized Object Coordinate Space (NOCS) for dense 3D correspondence learning, and used the Umeyama algorithm [33] to solve the object pose and scale. They also contributed both a synthetic and a real dataset used extensively by the following works for benchmarking. Later, Li et al. [19] extended the idea towards articulated objects. To simultaneously reconstruct the canonical point cloud and estimate the pose, Chen et al. [2] proposed a method based on canonical shape space (CASS). Tian et al. [31] learned category-specific shape priors from an autoencoder, and demonstrated its power for pose estimation and shape completion. 6D-ViT [48] and ACR-Pose [8] extended this idea by utilizing pyramid visual transformer (PVT) and generative adversarial network (GAN) [10] respectively. Structure-guided prior adaptation (SGPA) [3] utilized a transformer architecture for a dynamic shape prior adaptation. Other than learning a dense correspondence, FS-Net [5] regressed the pose parameters directly, and it proposed to learn two orthogonal axes for 3D orientation. Also, it contributed to an efficient data augmentation process for depth-only approaches. GPV-Pose [7] further improved FS-Net by adding a geometric consistency loss between 3D bounding boxes, reconstruction, and pose. Also with depth as the only input, category-level point pair feature (CPPF) [42] could reduce the sim-to-real gap by learning deep point pairs features. DualPoseNet [20] benefited from rotation-invariant embedding for category-level pose estimation. Differing from other works using segmentation networks to crop image patches as the first stage, CenterSnap [13] presented a single-stage approach for the prediction of 3D shape, 6D pose, and size.

Compared with opaque objects, we find the main challenge to perceive transparent objects is the poor quality of input depth. Thus, the proposed TransNet takes inspiration from the above category-level pose estimation works regarding feature embedding and architecture design. More specifically, TransNet leverages both Pointformer from PVT and the pose decoder from FS-Net and GPV-Pose. In the following section, the TransNet architecture is described, focusing on how to integrate the single-view depth completion module and utilize imperfect depth predictions to learn pose estimates of transparent objects.

3 TransNet

Fig. 3.
figure 3

Architecture for TransNet. TransNet is a two-stage deep neural network for category-level transparent object pose estimation. The first stage uses an object instance segmentation (from Mask R-CNN [11], which is not included in the diagram) to generate patches of RGB-D then used as input to a depth completion and a surface normal estimation network (RGB only). The second stage uses randomly sampled pixels within the objects’ segmentation mask to generate a generalized point cloud formed as the per-pixel concatenation of ray direction, RGB, surface normal, and completed depth features. Pointformer [48], a transformer-based point cloud embedding architecture, transforms the generalized point cloud into high-dimensional features. A concatenation of embedding features, global features, and a one-hot category label (from Mask R-CNN) is provided for the pose estimation module. The pose estimation module is composed of four decoders, one each for translation, x-axis, z-axis, and scale regression respectively. Finally, the estimated object pose is recovered and returned as output.

Given an input RGB-D pair (\(\mathcal {I}\), \(\mathcal {D}\)), our goal is to predict objects’ 6D rigid body transformations \([{\textbf {R}}|{\textbf {t}}]\) and 3D scales \({\textbf {s}}\) in the camera coordinate frame, where \({\textbf {R}} \in SO(3), {\textbf {t}} \in \mathbb {R}^{3}\) and \({\textbf {s}} \in \mathbb {R}^{3}_{+}\). In this problem, inaccurate/invalid depth readings exist within the image region corresponding to transparent objects (represented as a binary mask \(\mathcal {M}_t\)). To approach the category-level pose estimation problem along with inaccurate depth input, we propose a novel two-stage deep neural network pipeline, called TransNet.

3.1 Architecture Overview

Following recent work in object pose estimation [5, 7, 34], we first apply a pre-trained instance segmentation module (Mask R-CNN [11]) that has been fine-tuned on the pose estimation dataset to extract the objects’ bounding box patches, masks, and category labels to separate the objects of interest from the entire image.

The first stage of TransNet takes the patches as input and attempts to correct the inaccurate depth posed by transparent objects. Depth completion (TransCG [9]) and surface normal estimation (U-Net [28]) are applied on RGB-D patches to obtain estimated depth-normal pairs. The estimated depth-normal pairs, together with RGB and ray direction patches, are concatenated to feature patches, followed by a random sampling strategy within the instance masks to generate generalized point cloud features.

In the second stage of TransNet, the generalized point cloud is processed through Pointformer [48], a transformer-based point cloud embedding module, to produce concatenated feature vectors. The pose is then separately estimated in four decoder modules for object translation, x-axis, z-axis, and scale respectively. The estimated rotation matrix can be recovered using the estimated two axes. Each component is discussed in more detail in the following sections.

3.2 Object Instance Segmentation

Similar to other categorical pose estimation work [7], we train a Mask R-CNN [11] model on the same dataset used for pose estimation to obtain the object’s bounding box \(\mathcal {B}\), mask \(\mathcal {M}\) and category label \(\mathcal {H}_c\). Patches of ray direction \(\mathcal {R}_{\mathcal {B}}\), RGB \(\mathcal {I}_{\mathcal {B}}\) and raw depth \(\mathcal {D}_{\mathcal {B}}\) are extracted from the original data source following bounding box \(\mathcal {B}\), before inputting to the first stage of TransNet.

3.3 Transparent Object Depth Completion

Due to light reflection and refraction on transparent material, the depth of transparent objects is very noisy. Therefore, depth completion is necessary to reduce the sensor noise. Given the raw RGB-D patch (\(\mathcal {I}_{\mathcal {B}}\), \(\mathcal {D}_{\mathcal {B}}\)) pair and transparent mask \(\mathcal {M}_t\) (a intersection of transparent objects’ masks within bounding box \(\mathcal {B}\)), transparent object depth completion \(\mathcal {F}_{D}\) is applied to obtain the completed depth of the transparent region \(\{\hat{\mathcal {D}}_{(i, j)}|(i, j)\in \mathcal {M}_t \}\).

Inspired by one state-of-the-art depth completion method, TransCG [9], we incorporate a similar multi-scale depth completion architecture into TransNet.

$$\begin{aligned} \hat{\mathcal {D}}_\mathcal {B} = \mathcal {F}_{D}\left( \mathcal {I}_\mathcal {B}, \mathcal {D}_\mathcal {B}\right) \end{aligned}$$
(1)

We use the same training loss as TransCG:

$$\begin{aligned} \begin{aligned}&\mathcal {L} = \mathcal {L}_d + \lambda _{smooth} \mathcal {L}_s \\&\mathcal {L}_d = \frac{1}{N_p}\sum _{p\in \mathcal {M}_t \bigcap \mathcal {B}}\left\Vert \hat{\mathcal {D}}_p - \mathcal {D}^{*}_p\right\Vert ^2 \\&\mathcal {L}_s = \frac{1}{N_p}\sum _{p\in \mathcal {M}_t \bigcap \mathcal {B}}\left( 1 - \text {cos}\left\langle \mathcal {N}(\hat{\mathcal {D}}_p), \mathcal {N}(\mathcal {D}^{*}_p)\right\rangle \right) \end{aligned} \end{aligned}$$
(2)

where \(\mathcal {D}^{*}\) is the ground truth depth image patch, \(p\in \mathcal {M}_t \bigcap \mathcal {B}\) represents the transparent region in the patch, \(\left\langle \boldsymbol{\cdot \; , \; \cdot }\right\rangle \) denotes the dot product operator and \(\mathcal {N}(\boldsymbol{\cdot })\) denotes the operator to calculate surface normal from depth. \(\mathcal {L}_d\) is \(L_2\) distance between estimated and ground truth depth within the transparency mask. \(\mathcal {L}_s\) is the cosine similarity between surface normal calculated from estimated and ground truth depth. \(\lambda _{smooth}\) is the weight between the two losses.

3.4 Transparent Object Surface Normal Estimation

Surface normal estimation \(\mathcal {F}_{SN}\) estimates surface normal \(\mathcal {S}_{\mathcal {B}}\) from RGB image \(\mathcal {I}_{\mathcal {B}}\). Although previous category-level pose estimation works [5, 7] show that depth is enough to obtain opaque objects’ pose, experiments in Sect. 4.3 demonstrate that surface normal is not a redundant input for transparent object pose estimation. Here, we slightly modify U-Net [28] to perform the surface normal estimation.

$$\begin{aligned} \hat{\mathcal {S}}_\mathcal {B} = \mathcal {F}_{SN}\left( \mathcal {I}_\mathcal {B}\right) \end{aligned}$$
(3)

We use the cosine similarity loss:

$$\begin{aligned} \begin{aligned}&\mathcal {L} = \frac{1}{N_p}\sum _{p\in \mathcal {B}}\left( 1 - \text {cos}\left\langle \hat{\mathcal {S}}_p, \mathcal {S}^{*}_p\right\rangle \right) \end{aligned} \end{aligned}$$
(4)

where \(p\in \mathcal {B}\) means the loss is applied for all pixels in the bounding box \(\mathcal {B}\).

3.5 Generalized Point Cloud

As input to the second stage, generalized point cloud \(\mathcal {P}\in \mathbb {R}^{N\times d}\) is a stack of d-dimensional features from the first stage taken at N sample points, inspired from [38]. To be more specific, \(d=10\) in our work. Given the completed depth \(\hat{\mathcal {D}}_\mathcal {B}\) and predicted surface normal \(\hat{\mathcal {S}}_\mathcal {B}\) from Eq. (1), (3), together with RGB patch \(\mathcal {I}_\mathcal {B}\) and ray direction patch \(\mathcal {R}_\mathcal {B}\), a concatenated feature patch is given as \(\left[ \mathcal {I}_\mathcal {B}, \hat{\mathcal {D}}_\mathcal {B}, \hat{\mathcal {S}}_\mathcal {B}, \mathcal {R}_\mathcal {B}\right] \in \mathbb {R}^{H \times W \times 10}\). Here the ray direction \(\mathcal {R}\) represents the direction from camera origin to each pixel in the camera frame. For each pixel (uv):

$$\begin{aligned} \begin{aligned}&p = \begin{bmatrix}u&v&1\end{bmatrix}^T \\&\mathcal {R} = \frac{K^{-1} p}{\left\Vert K^{-1} p\right\Vert ^2} \end{aligned} \end{aligned}$$
(5)

where p is the homogeneous UV coordinate in the image plane and K is the camera intrinsic. The UV mapping itself is an important cue when estimating poses from patches [14], as it provides information about the relative position and size of the patches within the overall image. We use ray direction instead of UV mapping because it also contains camera intrinsic information.

We randomly sample N pixels within the transparent mask of the feature patch to obtain the generalized point cloud \(\mathcal {P}\in \mathbb {R}^{N\times 10}\). A more detailed experiment in Sect. 4.3 explores the best choice of the generalized point cloud.

3.6 Transformer Feature Embedding

Given generalized point cloud \(\mathcal {P}\), we apply an encoder and multi-head decoder strategy to get objects’ poses and scales. We use Pointformer [48], a multi-stage transformer-based point cloud embedding method:

$$\begin{aligned} \mathcal {P}_{emb} = \mathcal {F}_{PF}\left( \mathcal {P}\right) \end{aligned}$$
(6)

where \(\mathcal {P}_{emb} \in \mathbb {R}^{N\times d_{emb}}\) is a high-dimensional feature embedding. During our experiments, we considered other common point cloud embedding methods such as 3D-GCN [21] demonstrating their power in many category-level pose estimation methods [5, 7]. During feature aggregation for each point, they use the nearest neighbor algorithm to search nearby points within coordinate space, then calculate new features as a weighted sum of the features within surrounding points. Due to the noisy input \(\hat{D}\) from Eq. (1), the nearest neighbor may become unreliable by producing noisy feature embeddings. On the other hand, Pointformer aggregates feature by a transformer-based method. The gradient back-propagates through the whole point cloud. More comparisons and discussions in Sect. 4.2 demonstrate that transformer-based embedding methods are more stable than nearest neighbor-based methods when both are trained on noisy depth data.

Then we use a Point Pooling layer (a multilayer perceptron (MLP) plus max-pooling) to extract the global feature \(\mathcal {P}_{global}\), and concatenate it with local feature \(\mathcal {P}_{emb}\) and the one-hot category \(\mathcal {H}_{c}\) label from instance segmentation for the decoder:

$$\begin{aligned} \begin{aligned}&\mathcal {P}_{global} = \text {MaxPool}\left( \text {MLP}\left( \mathcal {P}_{emb}\right) \right) \\&\mathcal {P}_{concat} = \left[ \mathcal {P}_{emb}, \mathcal {P}_{global}, \mathcal {H}_{c}\right] \end{aligned} \end{aligned}$$
(7)

3.7 Pose and Scale Estimation

After we extract the feature embeddings from multi-modal input, we apply four separate decoders for translation, x-axis, z-axis, and scale estimation.

Translation Residual Estimation. As demonstrated in [5], residual estimation achieves better performance than direct regression by learning the distribution of the residual between the prior and actual value. The translation decoder \(\mathcal {F}_{t}\) learns a 3D translation residual from the object translation prior \(t_{prior}\) calculated as the average of predicted 3D coordinate over the sampled pixels in \(\mathcal {P}\). To be more specific:

$$\begin{aligned} \begin{aligned}&t_{prior} = \frac{1}{N_p}\sum _{p\in N} K ^{-1} \left[ u_p \ v_p \ 1\right] ^T \hat{\mathcal {D}_p} \\&\hat{t} = t_{prior} + \mathcal {F}_{t}\left( \left[ \mathcal {P}_{concat}, \mathcal {P}\right] \right) \\ \end{aligned} \end{aligned}$$
(8)

where K is the camera intrinsic and \(u_p\), \(v_p\) are the 2D pixel coordinate for the selected pixel. We also use the \(L_1\) loss between the ground truth and estimated position:

$$\begin{aligned} \mathcal {L}_t = \left|\hat{t} - t^*\right|\end{aligned}$$
(9)

Pose Estimation. Similar to [5], rather than directly regress the rotation matrix R, it is more effective to decouple it into two orthogonal axes and estimate them separately. As shown in Fig. 3, we decouple R into the z-axis \(a_z\) (red axis) and x-axis \(a_x\) (green axis). Following the strategy of confidence learning in [7], the network learns confidence values to deal with the problem that the regressed two axes are not orthogonal:

$$\begin{aligned} \begin{aligned}&\left[ \hat{a}_i, c_i\right] = \mathcal {F}_i\left( \mathcal {P}_{concat}\right) , \ i\in \left\{ x, z\right\} \\&\theta _z = \frac{c_x}{c_x + c_z}\left( \theta - \frac{\pi }{2}\right) \\&\theta _x = \frac{c_z}{c_x + c_z}\left( \theta - \frac{\pi }{2}\right) \end{aligned} \end{aligned}$$
(10)

where \(c_x, c_z\) denote the confidence for the learned axes. \(\theta \) represents the angle between \(a_x\) and \(a_z\). \(\theta _x, \theta _z\) are obtained by solving an optimization problem and then used to rotate the \(a_x\) and \(a_z\) within their common plane. More details can be found in [7]. For the training loss, first, we use \(L_1\) loss and cosine similarity loss for axis estimation:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{r_i} = \left|\hat{a}_i - a^*_i\right|+ 1 - \left\langle \hat{a}_i, a^*_i\right\rangle , \ i\in \left\{ x, z\right\} \end{aligned} \end{aligned}$$
(11)

Then to constrain the perpendicular relationship between two axes, we add the angular loss:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_a = \left\langle \hat{a}_x, \hat{a}_z\right\rangle \end{aligned} \end{aligned}$$
(12)

To learn the axis confidence, we add the confidence loss, which is the \(L_1\) distance between estimated confidence and exponential \(L_2\) distance between the ground truth and estimated axis:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{con_i} = \left|c_i - \text {exp}\left( \alpha \left\Vert \hat{a}_i - a^*_i \right\Vert _2\right) \right|, \ i\in \left\{ x, z\right\} \end{aligned} \end{aligned}$$
(13)

where \(\alpha \) is a constant to scale the distance.

Thus the overall loss for the second stage is:

$$\begin{aligned} \begin{aligned} \mathcal {L}&= \lambda _s\mathcal {L}_s + \lambda _t\mathcal {L}_t + \lambda _{r_x}\mathcal {L}_{r_x} + \lambda _{r_z}\mathcal {L}_{r_z} \\&\quad +\lambda _{r_a}\mathcal {L}_{a} + \lambda _{con_x}\mathcal {L}_{con_x} + \lambda _{con_z}\mathcal {L}_{con_z} \end{aligned} \end{aligned}$$
(14)

To deal with object symmetry, we apply specific treatments for different symmetry types. For axial symmetric objects (those that remain the same shape when rotating around one axis), we ignore the loss for the x-axis, \(i.e., \mathcal {L}_{con_x}, \mathcal {L}_{r_x}\). For planar symmetric objects (those that remain the same shape when mirrored about one or more planes), we generate all candidate x-axis rotations. For example, for an object symmetric about the \(x-z\) plane and \(y-z\) plane, rotating the x-axis about the z-axis by \(\pi \) radians will not affect the object’s shape. The new x-axis is denoted as \(a_{x_{\pi }}\) and the loss for the x-axis is defined as the minimum loss of both candidates:

$$\begin{aligned} \begin{aligned} \mathcal {L}_x = \text {min}\left( \mathcal {L}_x(a_x), \mathcal {L}_x(a_{x_{\pi }})\right) \end{aligned} \end{aligned}$$
(15)

Scale Residual Estimation. Similar to the translation decoder, we define the scale prior \(s_{prior}\) as the average of scales of all object 3D CAD models within each category. Then the scale of a given instance is calculated as follows:

$$\begin{aligned} \begin{aligned}&\hat{s} = s_{prior} + \mathcal {F}_{s}\left( \mathcal {P}_{concat}\right) \\ \end{aligned} \end{aligned}$$
(16)

The loss function is defined as the \(L_1\) loss between the ground truth scale and estimated scale:

$$\begin{aligned} \mathcal {L}_s = \left|\hat{s} - s^*\right|\end{aligned}$$
(17)

4 Experiments

Dataset. We evaluated TransNet and baseline models on the Clearpose Dataset [6] for categorical transparent object pose estimation. The Clearpose Dataset contains over 350K real-world labeled RGB-D frames in 51 scenes, 9 sets, and around 5M instance annotations covering 63 household objects. We selected 47 objects and categorize them into 6 categories, bottle, bowl, container, tableware, water cup, wine cup. We used all the scenes in set2, set4, set5, and set6 for training and scenes in set3 and set7 for validation and testing. The division guaranteed that there were some unseen objects for testing within each category. Overall, we used 190K images for training and 6K for testing. For training depth completion and surface normal estimation, we used the same dataset split.

Implementation Details. Our model was trained in several stages. For all the experiments in this paper, we were using the ground truth instance segmentation as input, which could also be obtained by Mask R-CNN [11]. The image patches were generated from object bounding boxes and re-scaled to a fixed shape of \(256\times 256\) pixels. For TransCG, we used AdamW optimizer [25] for training with \(\lambda _{smooth} = 0.001\) and the overall learning rate is 0.001 to train the model till converge. For U-Net, we used the Adam optimizer [17] with a learning rate of \(1e^{-4}\) to train the model until convergence. For both surface normal estimation and depth completion, the batch size was set to 24 images. The surface normal estimation and depth completion model were frozen during the training of the second stage.

For the second stage, the training hyperparameters for Pointformer followed those used in [48]. We used data augmentation for RGB features and instance mask for sampling generalized point cloud. A batch size of 18 was used. To balance sampling distribution across categories, 3 instance samples were selected randomly for each of 6 categories. We followed GPV-Pose [7] on training hyper-parameters. The learning rate for all loss terms were kept the same during training, \(\left\{ \lambda _{r_x}, \lambda _{r_z}, \lambda _{r_a}, \lambda _{t}, \lambda _{s}, \lambda _{con_x}, \lambda _{con_z}\right\} = \left\{ 8, 8, 4, 8, 8, 1, 1\right\} \times 0.0001\). We used the Ranger optimizer [22, 41, 43] and used a linear warm-up for the first 1000 iterations, then used a cosine annealing method at the 0.72 anneal point. All the experiments for pose estimation were trained on a 16G RTX3080 GPU for 30 epochs with 6000 iterations each. All the categories were trained on the same model, instead of one model per category.

Evaluation Metrics. For category-level pose estimation, we followed [5, 7] using 3D intersection over union (IoU) between the ground truth and estimated 3D bounding box (we used the estimated scale and pose to draw an estimated 3D bounding box) at 25%, 50% and 75% thresholds. Additionally, we used \(5^{\circ }2\) cm, \(5^{\circ }5\) cm, \(10^{\circ }5\) cm, \(10^{\circ }\)10 cm as metrics. The numbers in the metrics represent the percentage of the estimations with errors under such degree and distance. For Sect. 4.4, we also used separated translation and rotation metrics: 2 cm, 5 cm, 10 cm, \(5^{\circ }\), \(10^{\circ }\) that calculate percentage with respect to one factor.

For depth completion evaluation, we calculated the root of mean squared error (RMSE), absolute relative error (REL) and mean absolute error (MAE), and used \(\delta _{1.05}\), \(\delta _{1.10}\), \(\delta _{1.25}\) as metrics, while \(\delta _n\) was calculated as:

$$\begin{aligned} \delta _n = \frac{1}{N_p}\sum _{p}{} {\textbf {I}}\left( \text {max}\left( \frac{\hat{\mathcal {D}}_p}{\mathcal {D}^*_p}, \frac{\mathcal {D}^*_p}{\hat{\mathcal {D}}_p}\right) < n\right) \end{aligned}$$
(18)

where \({\textbf {I}}(\boldsymbol{\cdot })\) represents the indicator function. \(\hat{\mathcal {D}_p}\) and \(\mathcal {D}^*_p\) mean estimated and ground truth depth for each pixel p.

For surface normal estimation, we calculated RMSE and MAE errors and used \(11.25^{\circ }\), \(22.5^{\circ }\), and \(30^{\circ }\) as thresholds. Here \(11.25^{\circ }\) represents the percentage of estimates with an angular distance less than \(11.25^{\circ }\) from ground truth surface normal.

4.1 Comparison with Baseline

Table 1. Comparison with the baseline on the Clearpose Dataset.

We chose one state-of-the-art categorical opaque object pose estimation model (GPV-Pose [7]) as a baseline, which was trained with estimated depth from TransCG [9] for a fair comparison. From Table 1, TransNet outperformed the baseline in most of the metrics on the Clearpose dataset. \(3\text {D}_{25}\) is very easy to learn, so there is no huge difference between them. For the rest of the metrics, TransNet achieved around 2\(\times \) the percentage on \(3\text {D}_{50}\), 3\(\times \) on \(10^{\circ }5\,\text {cm}, 10^{\circ }10\,\text {cm}\) and 5\(\times \) on \(5^{\circ }5\,\text {cm}, 5^{\circ }2\,\text {cm}\) over the baseline. Qualitative results are shown in Fig. 4 for TransNet.

Fig. 4.
figure 4

Qualitative results of category-level pose estimates from TransNet. The left column is the original RGB image within our test set and the right column is the pose estimation results. The white bounding box is the ground truth and the colored one is the estimation result. Different colors represent different categories. For axial symmetric objects, because we only care about the scale and z-axis, we use the ground truth x-axis and estimated z-axis to calculate the estimated x-axis, for better visualization. In the figure, there is a pitcher without either ground truth or estimated bounding box because it is not within any of the defined categories, so we ignore it for both training and testing.

4.2 Embedding Method Analysis

In Table 2, we compared the embedding method between 3D-GCN [21] and Pointformer [48] on TransNet. Modalities for generalized point cloud were depth, RGB and ray direction (without surface normal) for all the trials. The only differences between them were depth type and embedding methods. With ground truth input, 3D-GCN and Pointformer achieved similar results. For some metrics, i.e. \(5^{\circ }5\) cm, 3D-GCN was even better. But when the ground truth depth was changed to estimated depth (modeling the change from opaque to transparent setting), Pointformer retained much more accuracy than 3D-GCN. Here is our explanation. Like many point cloud embedding methods, 3D-GCN propagates information between nearest neighbors. It is a very efficient method given a point cloud with low noise. But given the completed depth, high noise makes it unstable to pass data among neighbors. While for Pointformer, information is passed through the whole point cloud, no matter how large the noise is. Therefore, given depth information with large uncertainty, the transformer-based embedding method might be more powerful than embedding methods using nearest neighbors.

Table 2. Comparison between different embedding methods

4.3 Ablation Study of Generalized Point Cloud

We explored different combinations of feature inputs for the generalized point cloud to find the one most suitable for TransNet. Results are shown in Table 3. For trials 1 and 2, we compared the effect of adding estimated surface normal to the generalized point cloud. All the metrics demonstrated that the inclusion of surface normal does improve the resulting pose estimation accuracy.

Table 3. Ablation study for a different combination of the generalized point cloud. For both trials, we also use RGB as an input feature for the generalized point cloud.

4.4 Depth and Surface Normal Exploration on TransNet

Table 4. Accuracy for depth completion on Clearpose dataset. All the metrics are calculated within the transparent mask.
Table 5. Accuracy for surface normal estimation on Clearpose dataset.
Table 6. Evaluation for depth and surface normal accuracy on TransNet.

We explored the combination of depth and surface normal with different accuracy. Results in Table 4 and Table 5 show performance for TransCG and U-Net separately. “GT" and “EST" in Table 6 represent ground truth and estimated input for depth and surface normal respectively. From the comparison of results among trials 1–3, accurate depth is more essential than surface normal for category-level transparent object pose estimation. For instance, as the ground truth depth changes to the estimated depth from trial 1 to trial 3, \(5^{\circ }2\) cm decreases by 23.7. Compared with surface normal estimation, \(5^{\circ }2\) cm only decreases by 8.4 between trial 1 and trial 2. More specifically, from decoupled rotation and translation metrics, we can see that 2 cm decreases by 41.1 between trial 1 and trial 3 compared to 9.7 between trial 1 and trial 2, meaning that depth accuracy is more important for translation estimation. Focusing on 2 cm, 5 cm, 10 cm between trial 1 and trial 4, the first metric decreases by 46.7 but the latter two lose much less (20.5 for 5 cm and 3.1 for 10 cm). This can be explained by the result of depth completion accuracy shown in Table 4 (MAE = 0.041 m, between 2 cm and 5 cm). From the comparison of trial 1–4 on metrics \(5^{\circ }\) and \(10^{\circ }\), we can see that either accurate surface normal or accurate depth can support good performance in rotation metrics (for either trial 2 or trial 3, \(5^{\circ }\) decreases by 10.0 and \(10^{\circ }\) decreased by around 7). Once we use the estimation version of both, \(5^{\circ }\) decreases by 38.5 and \(10^{\circ }\) decreases by 38.2.

5 Conclusions

In this paper, we proposed TransNet, a two-stage pipeline for category-level transparent object pose estimation. TransNet outperformed a baseline by taking advantage of both state-of-the-art depth completion and opaque object category pose estimation. Ablation studies about multi-modal input and feature embedding modules were performed to guide deeper explorations. In the future, we plan to explore how category information can be used earlier in the network for better accuracy, improve depth completion potentially using additional consistency losses, and extend the model to be category-level across both transparent and opaque instances.