Keywords

1 Introduction

Novel view synthesis (NVS) solves the problem of generating new view images of a scene or an object in the condition that one or more input views are given, which can be used to generate all possible viewpoints of real-world scenes in virtual reality (VR). As shown in Fig. 1, given one image of the chair, we generate a new image of the same chair from a novel viewpoint (GT represents the real target view image). However, traditional NVS methods often rely on parallax plots or depth maps, which have the limitations of high computational costs. With the development of neural networks, supervision-based learning methods can synthesize high-quality novel views. NVS can be used in different application areas, including image editing, and animating still photographs.

Fig. 1.
figure 1

With an input image, a novel view of the same object or scene is synthesized. It shows the results of DPNet compared to other two methods and all of them synthesis the target view by predicting target depth map. and prove that DPNet produces more clear result than other two depth guided view synthesis methods (Color figure online)

The NVS methods that have been approached in the last few years fall into two main types: 3D geometry-based methods and IBR methods. For 3D geometry-based methods, comprehensive 3D understanding is important so that the first step is to get the approximate underlying 3D structure. Some methods estimate the underlying 3D geometry in form of voxels [1] and mesh [2], and then put the corresponding camera transformations to the pixels of the 3D structure to produce the final output [3]. However, they not only require a commitment of time and resources, but also produce holes where lack of a prior information. In such conditions, hole-filling algorithms are needed but sometimes these algorithms are not effective [4].

Unlike 3D geometry-based methods, IBR methods generate novel images based on input images. The pixels from source views can be reprojected to the target view, low-level details such as colors and textures are well-protected. Zhou et al. [5] directly estimates the appearance flow and get final pixels value of target views from input views, and Chen et al. [6] predicts the target depth map to obtain the pixel-to-pixel correspondences with 3D warping. Hou et al. [7] also predicts the depth map of target view, but warps feature maps to generate the final target view image rather than directly warping pixels from source view. These IBR methods all achieve great view synthesis quality.

Fig. 2.
figure 2

Overview of the view synthesis pipeline. There are two main components: the depth net and pose net. The depth net takes only the source view as input and generates the depth map \(\hat{D_s}\). Moreover, \(\hat{D_s}\) and two nearby views are used to reconstruct source view that the L1 loss between generated source view and real source view can be alleviated to train the depth net. Then the pose net extracts the feature points of \(\hat{D_s}\) to produce the depth map of target view \(\hat{D_t}\) Finally, the \(\hat{D_t}\) is used to warp pixels of source to the target view with bilinear interpolation.

In this paper, we design a reasonable frameworks to improve the view synthesis quality. Motivated by the advantage of IBR method, we take the method that it warps pixels from source view to target view with the help of target depth map. More specifically, we propose the DPNet consisting of a depth net and a pose net as shown in Fig. 2. DPNet firstly predicts the source depth map and subsequently deduces the target depth map rather than directly producing target depth map. In such way, the short-connection structure can be introduced into pose net. So multi-level feature maps extracted from source depth map can be transferred to target depth map to improve the depth map accuracy. To further improve the accuracy of predicted depth map of source view, two nearby view images are reprojected to source view through predicted depth map of source view. Then the camera transformation and the predicted source depth map are put into pose net to generate the target depth map. Subsequently, the generated target depth map is used to calculate the dense correspondences between the source view and the target view via perspective projection. Finally, the final output image is synthesized via pixel warping.

To get clear and continuous synthesis results, four specially designed loss functions are used to train the DPNet. The supervision loss is used to improve the depth map estimation accuracy. And the L1 reconstruction loss and VGG perceptual loss are used to generate realistic images. Moreover, the edge smoothness loss can make the final target depth map more continuous in edge. Detailed experiments are conducted on real scene [8] and synthetic object [9] datasets, the depth estimation accuracy and image quality are evaluated qualitatively and quantitatively. The experiment results demonstrate that DPNet actually improves the depth estimation accuracy and image quality.

2 Related Work

Study of novel view synthesis has a long history in computer vision and graphics. These researches differ based on whether they use pure images or 3D geometry structure and on whether a single view image or multiple view images are put into neural network. Recently, neural radiance fields and generative models are the new directions.

2.1 Geometric View Synthesis

If multiple images of a scene are provided, with the help of COLMAP [10, 11], a 3D geometry scaffold can be constructed. Riegler et al. [3] firstly ran structure-from-motion [10] to get camera intrinsic and camera poses, then ran multi-view stereo [11] on the posed images to obtain per-image depth maps, and finally fused these maps into a point cloud. Similarly, Penner et al. [12] warped the extracted source feature maps into the target view using the depth map which was derived from the 3D geometry scaffold. A confidence image and a color image for each input image are obtained through these warped feature maps. Then these confidence images and color images were aggregated to get a final output. More recently, deep learning techniques created a new level of possibility and flexibility. Lombardi et al. [13] learned an implicit voxel representation of an object given many training views and generated a new view of that object when tested.

2.2 3D from Single Image

Inference about 3D shapes can serve as an implicit step in view synthesis. Given the serious inadequacy of recovering 3D shapes from a single image, recent work deployed neural networks for this task. They could be categorized by their output representation into mesh, point cloud, and voxel. With a single image as input, Tatarchenko et al. [14] predicted many unseen views and their depth maps from input, and these views were fused into a 3D point cloud which was later optimized to obtain a mesh. In [4], the features extracted from single input and the depth map estimated from the same input were used to create a point cloud carrying features. Many works explore using a DNN to predict 3D object shapes [15] or the depth map of a scene given an image [16]. These works focus on the quality of the 3D predictions as opposed to the view-synthesis task.

2.3 Image-Based Rendering Methods

Recently, many deep neural networks are developed to learn the image-to-image mapping between source view and target view [5,6,7, 17, 18]. Zhou et al. [5] directly estimated the appearance flow map in order to warp pixels of source view to their position of target view, Sun et al. [17] further refined the output by fusing multiple views with confidence map. With the help of predicted target depth map, Chen et al. [6] directly warped pixels of source view to target view and Hou et al. [7] warped the multi-level feature map extracted from source view to synthesize the final output. To improve the quality of synthesized image, Park et al. [18] used two consecutive encoder-decoder networks, firstly predicting a disocclusion aware flow and then refining the transformed image with a completion network. And in this paper, the target depth map couldn’t be predicted from inputs directly, instead, the source depth map is firstly estimated by depth net and then the source depth net is transformed to target depth map through pose net.

2.4 Generative Models and Neural Radiance Fields

View synthesis can also be thought as an image generation task, and it has a lot to do with the field of generative modeling of images [19, 20]. In [21], explicit pose control was allowed, they also used voxel. Although these methods can be used for view synthesis, the resulting view lacks consistency and has no control over the objects to be synthesized. The neural radiation field [22] produced impressive results by training an multi-layer perception (MLP) to map 3D rays to occupancy and color. Images are synthesized from this representation by volume rendering. This approach has been extended to an unlimited collection of outdoor scenes and crowdsourced images.

3 The Proposed Method for Novel View Synthesis

Figure 2 shows an overview of DPNet, it consists of two subnets: the depth net \(\varPsi _D\) and the pose net \(\varPsi _P\). The depth net estimates the depth map of source view firstly. For the depth net, we use the skip-connection structure with four downsampling and upsampling layers to give a final prediction of the same spatial resolution with the input. This is followed by a sigmoid layer and a renormalization step, so the depth of prediction falls within the minimum and maximum values for each dataset. The predicted depth map is used to warp two nearby view images to source view, and L1 distance between generated source view and real source view is used to train the depth net. As for pose net, the given transformation matrix is applied on the 3D feature points extracted from the predicted source depth map to obtain the 3D feature points of the target depth map. Later, when the transformed 3D feature points are given, the depth map of the target view is predicted. Then the estimated depth map is used to find dense correspondences between target and source views. Finally, the source image is warped into the target image via bilinear interpolation.

Fig. 3.
figure 3

Illustration of the pixels warping process from source view to target view. For each pixel point \(p_{ct}\) in the target view, it is firstly reprojected onto the source view based on the predicted depth map and camera pose transformation, and then the pixels value in target view are obtained by bilinear interpolation.

3.1 Pixels Warping

The reprojection process and the bilinear interpolation process are shown in Fig. 3. For the reprojection process, the per-pixel correspondence C is obtained from the target depth map \(D_{ct}\) by converting from a depth map to 3D coordinates [X, Y, Z] and perspective projections:

$$\begin{aligned} {[X,Y,Z]}^T=D_{ct}(x_{ct},y_{ct})K^{-1}{[x_{ct},y_{ct},1]}^T,\end{aligned}$$
(1a)
$$\begin{aligned} {[x_{cs},y_{cs},1]}^T\sim T_{ct{\rightarrow }cs}{[X,Y,Z,1]}^T, \end{aligned}$$
(1b)

where each pixel \((x_{ct}, y_{ct})\) in the target view corresponds to the pixel position \((x_{cs}, y_{cs})\) in the source view. Moreover, K is the camera intrinsic matrix and \(T_{ct{\rightarrow }cs}\) represents the transformation matrix from target view to source view. For the bilinear interpolation process, with the obtained per-pixel correspondences \(C_{ct \rightarrow cs}\), the pixels in the correspondences source view can be warped to the correspondences target view:

$$\begin{aligned} \begin{aligned} I_{ct}(x_{ct},y_{ct}) = \sum \limits _{x_{cs}}\sum \limits _{y_{cs}} max(0,1-|x_{cs}-C_{ct{\rightarrow }cs}(x_{ct},y_{ct})|)&\\ max(0,1-|y_{cs}-C_{ct{\rightarrow }cs}(x_{ct},y_{ct})|)I_{cs}(x_{cs},y_{cs})&. \end{aligned} \end{aligned}$$
(2)

Introducing the intermediate step of predicting depth map enforces the network to adhere to geometric constraints, resolving ambiguous correspondences. This process is substituted by \(I_{ct} = PW(I_{cs},D_{ct},T_{ct{\rightarrow }cs})\).

Fig. 4.
figure 4

Network architecture of the depth and pose modules. The width and height of each blue/red rectangular block respectively represent the output channel and spatial dimension of the feature map at the corresponding layer, and each decrease/increase in width and height size represents a change by the factor of 2 (the last conv layer is the output, it does not obey the rules). For depth net, it consists of 4 downsampling lawyers and 4 upsampling lawyers with the skip-connection structure. For pose net, inspired by [6], we also extract the latent code (3D feature points) to inject the camera transformation and predict the target depth map. (Color figure online)

Fig. 5.
figure 5

Results on ShapeNet chair and car datasets. DPNet generates more structure-consistent predictions than [6] (for example, it can’t generate a distorted leg in line 5); on the other hand, the generated images of the DPNet are more clear than [7] that it can rebuild rich low-level details (for example, it generates more clear chair surface in line 2).

3.2 Depth Map Estimation

The depth net takes a single input image to get the source depth map \(D_s=\varPsi _D (I_s)\). Moreover, two additional nearby view images \(I_{n1}\) and \(I_{n2}\) plus their camera transformation \(T_{s{\rightarrow }n1}\) and \(T_{s{\rightarrow }n2}\) are introduced to warp their pixels to source view to improve the depth estimation accuracy. \(\hat{I}_{s1}=PW(I_{n1},D_s,T_{s{\rightarrow }n1})\) and \(\hat{I}_{s2} =PW(I_{n2},D_s,T_{s{\rightarrow }n2})\). And the two L1 distance between \(I_s\) and \(\hat{I}_{s1}\) and between \(I_s\) and \(\hat{I}_{s2}\) are the important part of final training loss function. The specific structure is described in Fig. 4(a), it consists of four downsampling layers and four upsampling layers, and the skip-connection structure transfers multi-level feature maps to create more stable predictions.

Table 1. Results on ShapeNet objects. DPNet performs better than [6, 7] for both chair and car objects, showing that it can deal with complex shape of chairs and rich colors and textures of cars (\(\downarrow \) suggests the smaller the better, \(\uparrow \) suggests the larger the better).

3.3 Depth Map Transformation and Target View Generation

In pose net, transformation matrix are applied to latent code to predict depth map of the target view, and the pose network is used to learn compact latent representations that are transformation equivariant. Given the source depth map, the 3D feature points \(z_s\) extracted from predicted source depth map can be regarded as a set of points \(z_s\in R^{n\times 3}\). Then the 3D feature points are multiplied with the given transformation \(T_{s{\rightarrow }t}={[R|t]}_{s{\rightarrow }t}\) to get the transformed 3D feature points for the target view:

$$\begin{aligned} \begin{aligned} \widetilde{z}_s=T_{s{\rightarrow }t}{\cdot }\dot{z}_s \end{aligned} \end{aligned}$$
(3)

where \(\dot{z}_s\) is the homogeneous representation of \(z_s\). Then the target depth map \(D_t\) is created through \(\widetilde{z}_s\). With the generated target depth map \(D_t\) and corresponding camera transformation \(T_{s{\rightarrow }t}\), the target view image is synthesized \(\hat{I}_{t} =PW(I_s,D_t,T_{t{\rightarrow }s})\). Because the input to pose net is a source depth map and not a source view image, the skip-connection structure can be introduced to transfer multi-level feature maps to make the pose net more effective (as shown in Fig. 4(b)).

3.4 Training Loss Functions

The framework can be trained in an end-to-end manner. For each input sample, a single source image, two nearby view images, one target view image and their relative transformation are provided. The depth net and the pose net are optimized jointly. To train the depth net in an implicit supervised manner, the supervision loss is used to improve the depth map estimation accuracy. For pixels regression, the L1 reconstruction loss and VGG perceptual loss are used to generate realistic images. Moreover, the edge smoothness loss can make the final target depth map more continuous in edge.

L1 Reconstruction Loss. The L1 reconstruction loss is the L1 loss between the predicted target view \(\hat{I}_t\) and the ground truth \(I_t\). Described as:

$$\begin{aligned} \begin{aligned} \textit{L}_{recon}={\Vert \hat{I}_t - I_t \Vert } \end{aligned} \end{aligned}$$
(4)

To minimize this reconstruction loss, the network learns to produce realistic new views by predicting the necessary depth maps.

Supervision Loss. The supervision loss consists of two parts, both of them are the L1 distance between the ground truth source view \(I_s\) and the generated source view \(\hat{I}_s\):

$$\begin{aligned} \begin{aligned} \textit{L}_{sup}={\Vert \hat{I}_{s1} - I_s \Vert }+{\Vert \hat{I}_{s2} - I_s \Vert } \end{aligned} \end{aligned}$$
(5)

To minimize this supervision loss, the depth net learns to produce more accurate source depth map.

Fig. 6.
figure 6

Qualitative results of KITTI. DPNet produces clear and structurally consistent predictions, while the depth guided pixels warping [6] method produces distortion and the depth guided multi-level feature map warping method [7] produces blurry .

VGG Perceptual Loss. In addition to the L1 reconstruction loss, we also employ VGG perceptual loss to obtain realistic synthesis results. The pre-trained VGG16 network is used to extract features from the generated fake results and ground-truth images, and the perceptual loss is the sum of feature distances (L1 distance) calculated from multiple layers.

Edge Smoothness Loss. The edge smoothness loss encourages local smoothing of the predicted depth map. The loss is weighted because depth discontinuities usually occur at the edges of the image:

$$\begin{aligned} \begin{aligned} \textit{L}_{edge}=\frac{1}{N}\sum _{i,j}|\partial {x}\widetilde{D}_t^{ij}|e^{-{\Vert \partial {x}I_t^{ij} \Vert }} + |\partial {y}\widetilde{D}_t^{ij}|e^{-{\Vert \partial {y}I_t^{ij} \Vert }} \end{aligned} \end{aligned}$$
(6)

where \(\widetilde{D}_t\) is the predicted depth map of the target view and \(I_t\) is the ground-truth target view.

In summary, the final loss function of the joint training framework will be:

$$\begin{aligned} \begin{aligned} \textit{L}=\lambda _r{\textit{L}_{recon}} + \lambda _s{\textit{L}_{sup}} + \lambda _v{\textit{L}_{vgg}} + \lambda _e{\textit{L}_{edge}} \end{aligned} \end{aligned}$$
(7)

where the \(\lambda _r\), \(\lambda _s\), \(\lambda _v\), and \(\lambda _e\) are weights for different loss functions.

Table 2. Results on KITTI. DPNet achieves the best SSIM results, with L1 performance outperforming both Chen et al. [6] and Hou et al. [7]. (\(\downarrow \) suggests the smaller the better, \(\uparrow \) suggests the larger the better).

4 Experiment Results and Analysis

In this section, experiments are conduct on public datasets, ShapeNet dataset [8] and KITTI dataset [9]. DPNet is compared with state-of-the-art algorithms to evaluate the performance qualitatively and quantitatively. Further ablation studies verify the effectiveness of the different modules of DPNet.

4.1 Dataset and Experiment Setup

For datasets, two different types of datasets are used for experiment: ShapeNet dataset [8] is used for synthetic objects and KITTI dataset [9] is used for real-world scene. More specifically, cars and chairs in the ShapeNet dataset are selected. 3D understanding of datasets with complex structures and camera transformations are a great challenge (e.g. depth estimation) and datasets with rich textures will show whether these methods preserve fine-grained detail well. In these selected datasets, the chairs have more complex shapes and structures, but there will be more colorful patterns for the cars. For KITTI, the scene contains more objects, and translation is the primary transformation between frames, unlike ShapeNet, where rotation is the key transformation. In this case, there is less need for accurate depth estimation, and the ability to recover low-level detail is more important for performance.

ShapeNet.    Rendered images are used with the dimension of 256 \(\times \) 256 from 54 viewpoints (the azimuth from \(0^{\circ }\) to \(360^{\circ }\) with \(20^{\circ }\) increments, and the elevation of \(0^{\circ }\), \(10^{\circ }\), and \(20^{\circ }\)) for each object. The training and test pairs are two views with the azimuth difference within the range [\(-40^{\circ }\), \(40^{\circ }\)]. For ShapeNet chairs, there are 558 chair objects in the training set and 140 chair objects in the test set; For ShapeNet cars, there are 5,997 car objects in the training set and 1,500 car objects in the test set.

KITTI.   There are 11 sequences and each sequence contains around 2,000 frames on average. The training pairs are restricted to be separated by at most 7 frames.

For experiment setup, the depth net and the pose net are jointly trained using the Adam solver same with [7] that \(\beta _1=0.9\) and \(\beta _2=0.99\), and learning rate of \(6\times 10^{-5}\).

Fig. 7.
figure 7

Ablation studies results. We compare the performance of the full model with its variants. The results show that the lack of \(\textit{L}_{recon}\) leads to incomplete objects (like the chair in the 5th column). The lack of skip-connection structure in depth net and pose net results in the chair leg shortage (like the chair in the 5th column). The \(\textit{L}_{vgg}\) makes the results sharper. And the \(\textit{L}_{sup}\) and the \(\textit{L}_{edge}\) leads to more accurate depth map estimation that it makes the results more stable.

4.2 Evaluation Metrics and Evaluation Results

For evaluation metrics, Mean Absolute Error (\(L_1\) error) and Structural SIMilarity (SSIM) Index are used as metrics to evaluate the synthetic results. For L1 metric, smaller is better; for the SSIM metric, larger is better. For image synthesis quality, DPNet is compared with two state-of-the-art depth map guided methods: one pixels warping method proposed by [6] and one multi-level feature map warping method proposed by [7]. For depth map estimation, DPNet is compared with one source depth map estimation method [4] and one target depth map estimation method [6]. Table 1 shows the results on test set of ShapeNet objects. DPNet performs best for both the chair and the car objects, showing that it can handle both the complex 3D structure of the chair and the rich texture of the car. Fig. 5 shows the qualitative results for all methods. The depth map guided pixels warping method [6] suffers from distortion and the depth map guided multi-level feature maps warping method [7] leads to blurry results. Two nearby views are introduced to improve the accuracy of depth map estimation so that more impressive results are generated (e.g., more complete chair leg and more detailed car roof are generated in line 5). All the methods are also evaluated on KITTI. Table 2 shows the quantitative results. DPNet performs better than [6] and obtains comparable results to [7]. Figure 6 shows the qualitative results, it can be seen that DPNet produces more clear predictions and better preserves the structure (check the bottom part of row 1, manhole cover in row 5). In a conclusion, DPNet can achieve high image synthesis quality.

Table 3. Results of ablation studies. All designed modules and loss functions help improve performance. (\(\downarrow \) suggests the smaller the better, \(\uparrow \) suggests the larger the better).
Table 4. Depth estimation results on ShapeNet chairs.

4.3 Ablation Studies and Depth Estimation Results

To understand how the different modules of the framework work, we conduct an ablation study on ShapeNet chair as it is the most challenging dataset for 3D structures. Figure 7 and Table 3 show the performance of the different variants. No skip stands for removing the skip-connection structure in depth net and pose net. No \(\textit{L}_{recon}\), no \(\textit{L}_{sup}\), no \(\textit{L}_{edge}\), no \(\textit{L}_{vgg}\) separately represents removing corresponding loss function from total loss function. The results show that the lack of \(\textit{L}_{recon}\) leads to incomplete objects (like the chair in the 5th column). The lack of skip-connection structure in depth net and pose net leads to the chair leg shortage (like the chair in the 5th column). The \(\textit{L}_{vgg}\) makes the results sharper. And the \(\textit{L}_{sup}\) and the \(\textit{L}_{edge}\) leads to more accurate depth map estimation that it makes the results more stable.

Moreover, to prove that more accurate depth maps are predicted, four metrics are used to evaluate depth map quality [7]. L1-all compute the mean absolute difference. L1-rel compute the mean absolute relative difference L1-rel = \(\frac{1}{n}\sum _{i}{|{gt}_i - {pred}_i |} {/}{{gt}_i} \), and L1-inv metric is mean absolute difference in inverse depth L1-inv =\(\frac{1}{n}\sum _{i}{|{gt}_i^{-1} -{pred}_i^{-1} |}\). Except L1 metrics, we also utilize sc-inv = \({\left( \frac{1}{n}\sum { z_i^{2} -\frac{1}{n^2}\left( \sum z_i \right) ^2} \right) }^{\frac{1}{2}}\), where \(z_i\) = \(\lg \left( {{pred}_i}\right) -\lg \left( {{gt}_i}\right) \). The source depth map estimation is compared with [4] and the target depth map estimation is compared with [6]. Table 4 shows that our predicted depth is more accurate, which can explain why the DPNet can achieve better results than other methods.

5 Conclusion and Discussion

In this paper, DPNet is put forth to solve the novel view synthesis task. And it consists of two subnets: depth net and pose net. The depth net predicts the depth map of the source view from a single input view and two nearby view images are introduced to improve the accuracy of predicted depth map. Then the pose net is used for transformation between source depth map and target depth map. Moreover, the warping from source view pixels to target view pixels enables the preservation of low-level details, so more clear predictions are produced. Experimental results show that compared with above depth map guided warping methods, the performance of DPNet is better.