Keywords

1 Introduction

Monocular visual scene understanding is a fundamental technology for many automatic applications, especially in the field of autonomous driving. Using only a single-view driving image, available vehicle parsing studies have covered popular topics starting from 2D vehicle detection, then 6D vehicle pose recovery, and finally vehicle shape reconstruction. However, much less efforts are devoted to vehicle texture estimation, even though both humans and autonomous cars heavily rely on the appearance of vehicles to perceive surroundings. Simultaneously recovering the geometry and texture of vehicles is also important for synthetic driving data generation [19], vehicle tracking [20], vehicle parsing [23] and so on.

Fig. 1.
figure 1

We propose a method to recover realistic 3D textured models of vehicles from a single image (top left) under real street environments. Our approach can reconstruct the shape and texture with fine details. (We manually adjust the scale and layout of models for better visualization.)

Challenges for monocular geometry and texture recovery of vehicles mainly arise from the difficulties in inferring the invisible texture conditioned on only visible pixels while handling various vehicle shapes. Additionally, in real-world street environments, reconstruction methods are also expected to offset the adverse impact of complicated lighting conditions (e.g., strong sunlight and shadows) and diverse materials (e.g., transparent or reflective non-Lambertian surfaces). That said, the shape and appearance of vehicles are not completely arbitrary. Our key insight is that those challenges can be addressed with the prior knowledge from vehicle models, especially the part semantics. Therefore, we seek to find a method that is a) aware of the underlying semantics of vehicles, and b) flexible enough to recover various geometric structures and texture patterns.

Recently, deep implicit functions (DIFs), which model 3D shapes using continuous functions in 3D space, have been proven powerful in representing complex geometric structures [22, 28]. Texture fields (TF) [26] and PIFu [31] took a step further by representing mesh texture with implicit functions and estimating point color conditioned on the input image. To do so, both TF and PIFu diffuse the surface color into the 3D space. However, it remains physically unclear how to define and interpret the color value off the surface. What’s worse, geometry and texture are not fully disentangled in either PIFu or TF, as they rely on the location of surface to diffuse the color into the 3D space, making it difficult to incorporate semantic constraints.

In this paper, we explore a novel method, VERTEX, for VEhicle Reconstruction and TEXture estimation from a single image in real-world street environments. At its core is a novel implicit geo-tex representation that extends DIFs and jointly represents vehicle surface geometry and texture using implicit semantic template mapping. The key idea is to map each vehicle instance to a canonical template field [8, 39] in a semantic-preserving manner. In our geo-tex representation, texture inference is constrained on the 2-manifold of the canonical template; in this way, we can leverage the semantic prior of vehicle template, encourage the model to learn a consistent latent space for all vehicles and bypass the unclear physical meaning of a texture field.

However, training such a representation for vehicle reconstruction is not straight-forward, because we have no access to the ground-truth mapping from vehicle instances to the canonical template field. [8, 39] proposed to train the mapping network in an unsupervised manner, and the mapping follows the principle of shortest distance. As a result, the mapping in these methods is not guaranteed to preserve accurate semantic correspondences. To resolve this drawback, we propose a joint training method for the geometry reconstruction and texture estimation networks. Our training method is largely different from the training schedule of “first geometry then texture” adopted by typical reconstruction works [13, 26, 31]. This stems from the insight that the surface texture is closely related to its semantic labels; consider the appearance difference between different parts such as car bodies, windows, tires and lights as examples. The texture information can serve as the additional supervision to force the template mapping to be semantic-preserving.

Trained with our joint training method, our implicit geo-tex representation owns the advantages of both mesh templates and implicit functions: on one hand, it is expressive to represent various shapes, which is the main advantage of DIFs; on the other hand, it disentangles texture representation from geometry, thus supports many downstream tasks including material editing and texture transfer. Although it is initially designed for vehicles, our method can generalize to other objects such as bikes, planes and sofas.

To simulate real street environments and evaluate our method, we also contribute a synthetic dataset containing 830 elaborately textured car models rendered using Physically Based Rendering (PBRT) system with measured HDRI skymaps to obtain highly realistic images. Each instance is labeled with key points as semantic annotations and can be exploited for evaluation and future research.

In summary, our contributions include:

  • a novel implicit geo-tex representation with semantic dense correspondences and latent space disentanglement, enabling fine-grained texture estimation, part-level understanding and vehicle editing;

  • a joint training strategy leveraging the consistency between RGB color and part semantics for semantics-preserving template mapping;

  • a new vehicle dataset, containing diverse detailed car CAD models, PBRT based rendered images and corresponding real-world HDRI skymaps.

2 Related Work

2.1 Monocular Vehicle Reconstruction

Recently, many works [1, 10, 13, 16] concentrate on vehicle 3D texture recovery under real environments. Due to the lack of ground truth 3D data of real scenes, they mainly focus on the reconstruction from collections of 2D images utilizing unsupervised or self-supervised learning and build on mesh representation. Though eliminating the need for 3D annotations and generating meaningful vehicle textured models, these works still suffer from coarse reconstruction results and the limitation of fixed-topology representation. With large-scale synthetic datasets such as ShapeNet [4], many works [6, 26, 33] train deep neural networks to perform vehicle reconstruction from images. Based on volumentrically representation like 3D voxel [33] and implicit functions [26], these works generate plausible textured models in the synthetic dataset, but still struggle with low-quality texture. In contrast, our approach outperforms state-of-the-art methods in terms of visual fidelity and 3D consistency while representing topology-varying objects.

In addition, some works [3, 9, 25, 27, 38, 40] focus on novel view synthesis, i.e., inferring texture in 2D domain. Although they can produce realistic images, they lack compact 3D representation, which is not in line with our goal.

2.2 Deep Implicit Representation

Traditionally, implicit functions represent shapes by constructing a continuous volumetric field and embed meshes as its iso-surface [2, 32, 34]. In recent years, implicit functions have been implemented with neural networks  [5, 11, 22, 28, 31, 37] and have shown promising results. For example, DeepSDF [28] proposed to learn an implicit function where the network output represents the signed distance of the point to its nearest surface. Other approaches define the implicit functions as 3D occupancy probability functions and cast shape representation as a point classification problem [5, 22, 31, 37].

As for texture inference, both TF [26] and PIFu [31] define texture implicitly as a function of 3D positions. The former uses global latent codes separately extracted from input the image and geometry whereas the latter leverages local pixel-aligned features. Compared with the above approaches [26, 31] which predict texture distribution in the whole 3D space, our method explicitly constrains the texture distribution on the 2D manifold of the template surface with implicit semantic template mapping.

Fig. 2.
figure 2

The overview of our approach. Given the single RGB image, vehicle digitization is achieved by geometry and texture reconstruction. We first convert the original picture into an albedo map, and then extract multi-scale latent codes in Latent Embedding. Conditioned on these latent codes, our neural networks can infer SDF to reconstruct mesh surface and then regress RGB value for the surface.

3 Implicit Geo-Tex Representation

Our method for vehicle reconstruction and texture estimation is built upon a novel geo-tex joint representation, which is presented in this section.

3.1 Basic Formulation

We believe that an ideal geo-tex representation should disentangle texture representation from geometry as uv mapping does and should be accord with the physical fact that texture only attaches to the 2D surface of the object. In particular, observing that vehicles are a class of objects with a strong template prior, we extend DIT [39] and propose a joint geo-tex representation using deep implicit semantic templates. The key idea is to manipulate the implicit field of the vehicle template to represent vehicle geometry while embedding texture on the 2-manifold of the template surface. Mathematically, we denote the vehicle template surface with \(\mathcal {S}_T\) as the level set of a signed distance function \(F: \mathbb {R}^3\mapsto \mathbb {R}\), i.e. \( F(\boldsymbol{q})=0\), where \(\boldsymbol{q}\in \mathbb {R}^3\) denotes a 3D point. Then our representation can be formulated as:

$$\begin{aligned} \left\{ \begin{array}{l} \boldsymbol{p}_{tp}=W(\boldsymbol{p}, \boldsymbol{z}_{shape})\\ s=F(\boldsymbol{p}_{tp})\\ \boldsymbol{p}_{tp}^{(S)}=W(\boldsymbol{p}^{(S)}, \boldsymbol{z}_{shape})\\ c=T(\boldsymbol{p}_{tp}^{(S)}, \boldsymbol{z}_{tex}) \end{array} \right. \end{aligned}$$
(1)

where \(W:\mathbb {R}^3\times \mathcal {X}_{shape}\mapsto \mathbb {R}^3\) is a spatial warping function mapping the 3D point \(\boldsymbol{p}\in \mathbb {R}^3\) to the corresponding location \(\boldsymbol{p}_{tp}\) in the canonical template space conditioned on the shape latent code \(\boldsymbol{z}_{shape}\), and F queries the signed distance value s at \(p_{tp}\). \(\boldsymbol{p}^{(S)}\in \mathcal {S}\subset \mathbb {R}^3\) is a 3D point on the vehicle surface \(\mathcal {S}\), which is also mapped onto the template surface \(\mathcal {S}_T\) using the warping function W, and \(T:\mathcal {S}_T\times \mathcal {X}_{tex}\mapsto \mathbb {R}^3\) regresses the color value c of the template surface point \(\boldsymbol{p}_{tp}^{(S)}\) conditioned on the texture latent code \(\boldsymbol{z}_{tex}\). Intuitively, we map the vehicle surface to the template using warping function W and embed the surface texture of different vehicles onto one unified template. Therefore, in our representation, texture is only defined on the template surface (a 2D manifold), avoiding unclear physical meaning of a three-dimensional texture field.

3.2 Formulation for Image-based Reconstruction

For a specific instance, the shape information is defined by \(\boldsymbol{z}_{shape}\), while the texture information is encoded as \(\boldsymbol{z}_{tex}\), both of which can be extracted from the input image using CNN-based encoders. To further preserve fine details presented in the monocular observation, we fuse local texture information represented as \(\boldsymbol{z}_{loc\_tex}(\boldsymbol{p})\) at the pixel level. Not only the texture in visible region can benefit from local features, invisible regions can also be enhanced with the structure prior of the template. Formally, our formulation can be rewritten as:

$$\begin{aligned} \left\{ \begin{array}{l} \boldsymbol{p}_{tp}=\mathcal {W}(\boldsymbol{p}, \boldsymbol{z}_{shape})\\ s=F(\boldsymbol{p}_{tp})\\ \boldsymbol{p}_{tp}^{(S)}=W(\boldsymbol{p}^{(S)}, \boldsymbol{z}_{shape})\\ c=T(\boldsymbol{p}_{tp}^{(S)},\boldsymbol{z}_{tex},\boldsymbol{z}_{loc\_tex}(\boldsymbol{p})) \end{array} \right. \end{aligned}$$
(2)

where \(T:\mathcal {S}_T\times \mathcal {X}_{tex}\times \mathcal {X}_{loc\_tex}\mapsto \mathbb {R}^3\) is conditioned on the latent codes \(\boldsymbol{z}_{tex} and \boldsymbol{z}_{loc\_tex}\) .

In summary, aiming at vehicle texture recovery, our representation is more expressive with less complexity. However, implementing and training our representation for textured vehicle reconstruction is not straight-forward. We will introduce how we achieve this goal in Sect. 4.

Fig. 3.
figure 3

To implement implicit semantic template mapping (right), we minimize both data terms of geometry (blue arrows) and texture (green arrows) reconstruction simultaneously. Besides, the regularization terms (orange and pink arrows) for specific network modules are applied to assist training. Note that Z in RGB Decoder is the concatenation of the global and local texture latent codes. (Color figure online)

4 Joint Geo-tex Training Method

4.1 Network Architecture

Figure 2 illustrates the overview of our network, consisting of three modules, i.e., Latent Embedding (yellow), Geometry Reconstruction (blue) and Texture Estimation (green). Our network takes as input a single vehicle image and corresponding 2D silhouette, which can be produced by off-the-shelf 2D detectors [15], and generates a textured mesh.

Albedo Recovery: We empirically found that directly extracting texture latent codes from the input images leads to unsatisfactory results. Therefore, before feeding the input image to our network, we first infer the intrinsic color in 2D domain by means of image-to-image translation [30], and the recovered albedo image will be used as the input for texture encoders in Latent Embedding. We find this module effectively contributes to alleviating the noise effects of image illumination on consistent texture recovery.

Latent Embedding: The global shape and texture latent codes, \(\boldsymbol{z}_{shape}\) & \(\boldsymbol{z}_{tex}\), are extracted from the input image and recovered albedo map using two separate ResNet-based [12] encoders respectively. The local texture feature, \(\boldsymbol{z}_{loc\_tex}(\boldsymbol{p})\), is sampled following the practice of PIFu [31]. Different with other texture inference works [26, 31] which only utilize either global or local features for texture reconstruction, we fuse multi-scale texture features to recover robust and detailed texture.

Geometry Reconstruction and Texture Estimation: These two modules form the core of VERTEX. They consist of three main components: Template Mapping, Template SDF Decoder and RGB Decoder. Conditioned on \(\boldsymbol{z}_{shape}\), volume samples are sequentially fed to the Template Mapping and Template SDF Decoder to predict the continuous signed distance field. For texture estimation, surface points on reconstructed mesh are firstly warped to the template surface conditioned on \(\boldsymbol{z}_{shape}\), and then passed through the RGB Decoder with embedding latent codes \(\boldsymbol{z}_{tex}\), \(\boldsymbol{z}_{loc\_tex}(\boldsymbol{p})\) and \(\boldsymbol{z}_{pose}\) to predict texture.

4.2 Network Training

Based on our implicit geo-tex representation, we train the geometry and texture reconstruction network jointly. We visualize the training process in Fig. 3 and provide detailed definition of our training losses.

Data Loss: For geometry reconstruction, we mainly train by minimizing the \(\mathcal {\ell }1\)-loss between the predicted and the ground-truth point SDF values:

$$\begin{aligned} L_{geo}=\frac{1}{N_{sdf}}\sum _{i=1}^{N_{sdf}}\left\| T(W\left( \boldsymbol{p}_i, \boldsymbol{z}_{shape} \right) )-s_i \right\| _1 \end{aligned}$$
(3)

where \(N_{sdf}\) represents the number of input sample points, \(\boldsymbol{z}_{shape}\) is the shape latent code corresponding to the volume sample point \(\boldsymbol{p}_i\), and \(s_i\) is the corresponding ground truth SDF value on the \(\boldsymbol{p}_i\).

To train the texture estimation network, we minimize the \(\mathcal {\ell }1\)-loss between the regressed and the ground-truth intrinsic RGB value:

$$\begin{aligned} L_{tex}=\frac{1}{N_{sf}}\sum _{i=1}^{N_{sf}}\left\| T\left( W\left( \boldsymbol{p}_i^{(S)}, \boldsymbol{z}_{shape} \right) , \boldsymbol{z}_{tex}, \boldsymbol{z}_{loc\_tex}(\boldsymbol{p}_i^{(S)})\right) -c_i \right\| _1 \end{aligned}$$
(4)

where \(N_{sf}\) represents the number of input surface points, \(c_i\) is the corresponding ground truth color value on the surface point \(\boldsymbol{p}_i\), and \(\boldsymbol{z}_{shape}\), \(\boldsymbol{z}_{tex}\) and \(\boldsymbol{z}_{loc\_tex}\) are the latent codes corresponding to the \(\boldsymbol{p}_i^{(S)}\).

Regularization Loss: To establish continuous mapping between the instance space and the canonical template space, we introduce an additional regularization term to constrain position offsets of points after warping:

$$\begin{aligned} L_{reg}=\frac{1}{N_{sdf}}\sum _{i=1}^{N_{sdf}}\left\| W\left( \boldsymbol{p}_i, \boldsymbol{z}_{shape} \right) - \boldsymbol{p}_i \right\| _2 \end{aligned}$$
(5)

Template SDF Supervision: We supervise Template SDF Decoder directly using the sample points of the template car model. The loss is defined as:

$$\begin{aligned} L_{tp\_sdf}=\frac{1}{N_{tp\_sdf}}\sum _{i=1}^{N_{tp\_sdf}}\left\| T(\boldsymbol{p}_i^{(tp)})-s_i^{(tp)} \right\| _1 \end{aligned}$$
(6)

where \(N_{tp\_sdf}\) represents the number of input sample points, \(\boldsymbol{p}_i^{(tp)}\) represents the volume sample point around template model and \(s_i^{(tp)}\) is the corresponding SDF value.

Overall, the total loss function is formulated as the weighted sum of above mentioned terms:

$$\begin{aligned} L = L_{tex}+w_{g}L_{geo}+w_{reg}L_{reg}+w_{t}L_{tp\_sdf} \end{aligned}$$
(7)

4.3 Inference

As shown in the pipeline in Fig. 2, during inference, we first regress the signed distance field with the branch of geometry reconstruction, and then 3D points on the extracted surface are input to the branch of Texture Estimation to recover surface texture. However, because of the lack of ground truth camera intrinsic and extrinsic parameters, it is difficult for a 3D point to sample the correct local feature from feature map, which poses a significant challenge. We address the problem by setting a virtual camera and further optimizing the 6D pose under the render-and-compare optimization framework. See supplementary for details.

Fig. 4.
figure 4

Results on in-the-wild images. Monocular input images are shown in the top row. We compare 3D models reconstructed by ours and contrast works (PIFu and Onet+TF) retrained with our dataset. Two render views are provided to demonstrate reconstruction quality. Our results achieve great performance in terms of both robustness and accuracy.

5 Experiments

In this section, we first introduce the new vehicle dataset in Sect. 5.1. In Sect. 5.2, we illustrate the reconstruction results under real environments and quantitative scores on our dataset compared with two state-of-art baselines. The ablation studies and generalization to other object categories are presented in the supplementary.

5.1 Dataset

To generate synthetic dataset, we collect 83 industry-grade 3D CAD models covering common vehicle types, each of which is labeled with 23 semantic key points. We specifically select a commonly seen car as the vehicle template. To enrich the texture diversity of our dataset, we assign ten different texture for each model. We generate images with high visual fidelity using Physically Based Rendering (PBRT) [29] system and measured HDRI skymaps in the Laval HDR Sky Database [18]. Finally, we get a training set with 6300 instances and a testing set with 2000 instances in total. Please refer to supplementary for more details.

5.2 Results and Comparison

We compare our method with two state-of-the-art methods based on implicit functions. One is PIFu [31] which leverages pixel-aligned features to infer both occupied probabilities and texture distribution. The other one is Onet + Texture Field [22, 26], of which Onet reconstructs shape from the monocular input image and TF infers the color for the surface points conditioned on the image and the geometry. For fair comparison, we retrain botth methods on our dataset by concatenating the RGB image and the instance mask image into a 4-channel RGB-M image as the new input. Specifically, for PIFu, instead of the stacked hourglass network [24] designed for human-related tasks, ResNet34 is set as the encoder backbone and we extract the features before every pooling layers in ResNet to obtain feature embeddings. For Onet and TF, we use the original encoder and decoder networks and adjust the dimensions of the corresponding latent codes to be equal to those in our method.

Qualitative Comparison. To prove that our method adapts to real-world images, we collect several images from Kitti [21], CityScapes [7], ApolloScape [35], CCPDFootnote 1, SCD [17] and Internet. As shown in Fig. 4, our approach generates more robust results when compared with PIFu, while recovering much more texture details than the combination of Onet and TextureField.

Table 1. Quantitative Evaluation using the FID and SSIM metrics on our dataset. For SSIM, larger is better; for FID, smaller is better. Our method achieves best in both two terms.

Quantitative Comparison. To quantitatively evaluate the reconstruction quality of different methods, we use two metrics: Structure similarity image metric (SSIM) [36] and Frechet inception distance (FID) [14]. These two metrics can respectively measure local and global quality of images. The SSIM is a local score that measures the distance between the rendered image and the ground truth on a per-instance basis (larger is better). FID is widely used in the GAN evaluation to evaluate perceptual distributions between a predicted image and ground truth. It is worth noting that both SSIM and FID can not evaluate the quality of generated texture of 3D objects directly. All textured 3D objects must be rendered into 2D images from the same viewpoints of ground truth. To get a more convincing result, for each generated 3D textured model, we render it from 10 different views and evaluate the scores between renderings and corresponding ground truth albedo images. As shown in Tab. 1, our method gives significantly better results in FID term and achieves state-of-the art result in SSIM term, proving that our 3D models preserve stable and fine details under multi-view observations. The quantitative results agree with the performance illustrated in qualitative comparison.

We also implement a variant of our method which does not fuse local features for the purpose of fair comparison. As shown in Tab. 1, our reconstruction conditioned on global latent codes still outperforms ‘Onet+TF’, demonstrating that our representation is more expressive in terms of inferring the texture on the vehicle surface.

6 Conclusion

In this paper, we have introduced VERTEX, a novel method for monocular vehicle reconstruction in real-world traffic scenarios. Experiments demonstrate that our method can recover 3D vehicle models with robust and detailed texture from a monocular image. Based on the proposed implicit semantic template mapping, we have presented a new geometry-texture joint representation to constrain texture distribution on the template surface. We believe the proposed implicit geo-tex representation can further inspire 3D learning tasks on other classes of objects sharing a strong template prior.