Keywords

1 Introduction

Image-based novel view synthesis, i.e., rendering a 3D scene from a novel viewpoint given a set of context views (images and camera poses), is a long-standing problem in computer graphics with applications ranging from robotics (e.g. planning to grasp objects) to augmented and virtual reality (e.g. interactive virtual meetings). Recently, the field has gained a lot of popularity thanks to Neural Radiance Field (NeRF) methods [2, 40] that were successfully applied to the problem and outperformed prior approaches. We distinguish between two variants of the view synthesis problem. The first variant renders a novel view from multiple context images taken from similar viewpoints [40, 69]. Only a (very) sparse set of context images is provided in the second variant [51, 72], i.e., larger viewpoint variations and missing observations need to be handled. The latter task is much more difficult as it is necessary to learn suitable priors that can be used to predict unseen scene parts. This paper focuses on the second variant.

Fig. 1.
figure 1

Our novel view synthesis method renders images of previously unseen objects based on a few context images. It operates in 2D space without any explicit 3D reasoning (as opposed to NeRF-based approaches [51, 72]). The results are shown on the CO3D [51] (right) and InteriorNet [32] (left) datasets rendered for unseen scenes

Recently, generalizable NeRF-based approaches have been proposed to tackle this problem by learning priors for a class of objects and scenes [51, 72]. Instead of learning a radiance field for each scene, they use context views captured from the target scene to construct the radiance field on the fly by projecting the image features from all context views into 3D. Highly optimized NeRF approaches [22, 43, 50, 71] can be sped up by tuning or caching the radiance field representation [43], although often requiring lots of images per scene. To the best of our knowledge, these techniques do not apply to generalizable NeRF-based methods that do not learn a scene-specific radiance field, and take thousands of GPU-hours to train [51]. In contrast, 2D-only feed-forward networks can be highly efficient. However, explicitly encoding 3D geometric principles in them can be challenging. In our work, we thus pose the question: Is reasoning in 3D necessary for high-quality novel view synthesis, or can a purely image-based method achieve a competitive performance?

Recently, Rombach et al. [54] successfully tackled single-view novel view synthesis, where the model was able to predict novel views without explicit 3D reasoning. Inspired by these findings, we tackle the more complex problem of multi-view novel view synthesis. To answer the question, we propose a method with no explicit 3D reasoning able to predict novel views using multiple context images in a forward pass of a neural network. We train our model on a large collection of diverse scenes to enable the model to learn 3D priors implicitly. Our approach is able to render a view in a novel scene, unseen at training time, three orders of magnitude faster than state-of-the-art (SoTA) NeRF-based approaches [51], while also being ten times faster to train. Furthermore, we are able to train a single model to render multiple classes of scenes (see Fig. 1), whereas the SoTA NeRF-based approaches typically train per-class models [51].

Our model uses a two-stage architecture consisting of a Vector Quantized-Variational Autoencoder (VQ-VAE) codebook [45] and a transformer model. The codebook model is used to embed individual images into a smaller latent space. The transformer solves the novel view synthesis task in this latent space before the image is recovered via a decoder. This enables the codebook to focus on finer details in images while the transformer operates on shorter input sequences, reducing the quadratic memory complexity of its attention layer.

For training, we pass a sequence of views into the transformer and optimize it for all context sizes at the same time, effectively utilizing all images in the training batch, which is different from other methods [20, 21, 46, 48] that train only one query view. Unlike autoregressive models [21, 46, 48], we do not decode images token-by-token but all tokens are decoded at once which is both faster and mathematically exact (while autoregressive models rely on greedy strategies). Our approach can be considered a combination of autoregressive [47, 68] and masked [18] transformer models. With the standard attention mechanism, the complexity would be quadratic in the number of views, because we would have to stack different query views corresponding to different context sizes along the batch dimension. Therefore, we propose a novel attention mechanism called branching attention with constant overhead regardless of how many query views we optimize. Our attention mechanism also allows us to optimize the same model for the camera pose estimation task – predicting the query image’s camera pose given a set of context views. Since this task can be considered an “inverse” of the novel view synthesis task [70], we consider the ability to perform both tasks via the same model to be an intriguing property. Even though the localization results are not yet competitive with state-of-the-art localization pipelines, we achieve a similar level of pose accuracy as comparable methods such as [1, 60].

In summary, this paper makes the following contributions: 1) We propose an efficient novel view synthesis approach that does not use explicit 3D reasoning. Our two-stage method consisting of a codebook model and a transformer is competitive with state-of-the-art NeRF-based approaches while being more efficient to train. Compared to similar methods that do not use explicit 3D reasoning [15, 20, 66], our approach is not only evaluated on synthetic data but performs well on real-world scenes. 2) Our transformer model is a combination of an autoregressive and a masked transformer. We propose a novel attention mechanism called branching attention that allows us to optimize for multiple context sizes at once with a constant memory overhead. 3) Thanks to the branching attention, our model can both render a novel view from a given pose and predict the pose for a given image. 4) Our source code and pre-trained models are publicly available at https://github.com/jkulhanek/viewformer.

2 Related Work

Novel view synthesis has a long history [12, 63]. Recently, deep learning techniques have been applied with great success, enabling higher realism [16, 24, 38, 52, 53]. Some approaches use explicit reconstructed geometry to warp context images into the target view [16, 24, 52, 53, 65]. In our approach, we do not require any proxy geometry and only operate on 2D images.

Neural Radiance Field methods [2, 27, 36, 38, 40, 50, 71] use neural networks to represent the continuous volumetric scene function. To render a view, for each pixel in the image plane, they project a ray into 3D space and query the radiance field in 3D points along each ray. The radiance field is trained for each scene separately. Some methods generalize to new scenes by conditioning the continuous volumetric function on the context images [55, 64], which allows them to utilize trained priors and render views from scenes on which the model was not trained, much like our approach. Other approaches remove the trainable continuous volumetric scene function altogether. Instead, they reproject the context image’s features into the 3D space and apply the NeRF-based rendering pipeline on top of this representation [25, 51, 67, 69, 72]. Similarly to these methods, our approach also utilizes few context views (less than 20), and it also generalizes to unseen objects. However, we do not use the continuous volumetric function nor the reprojection into the 3D space. A different approach, IBRNet [69], learns to copy existing colours from context views, effectively interpolating the context views. Unlike ours, it thus cannot be applied to the settings where the object is not covered enough by the context views [25, 51, 67, 72].

A different line of work directly maps 2D context images to the 2D query image using an end-to-end neural network [15, 20, 66]. GQN-based methods [15, 20, 66] apply a CNN to context images and camera poses and combine the resulting features. While some GQN methods [15, 20] do not use any explicit 3D reasoning (same as our approach), Tobin et al. [66] uses an epipolar attention to aggregate the features from the context views. We optimize our model on all context images and fully utilize the training sequences, whereas GQN methods optimize only a single query view.

A recent work by Rombach et al. [54] proposed an approach for novel view synthesis without explicit 3D modeling. They used a codebook and a transformer model to map a single context view to a novel view from a different pose. Their approach is limited in its scope to mostly forward-facing scenes where it is easier to render the novel view given a single context view and the poses have to be close to one another. It cannot be extended to more views due to the limit on the sequence size of the transformer model. In contrast, in our approach, we focus on using multiple context views, which we tackle through the proposed branching attention. Furthermore, we can jointly train the same model for both the novel view synthesis and camera pose estimation and our decoding is faster because we decode the output at once instead of autoregressive decoding.

Visual Localization. There is an enormous body of work tackling the problem of localization, where the goal is to output the camera pose given the camera image. Structure-based approaches use correspondences between 2D pixel positions and 3D scene coordinates for camera pose estimation [6, 11, 34, 37, 56, 58, 62]. Our method does not explicitly reason in 3D space, and the camera pose is instead predicted by the network. Simple image retrieval (IR) approaches store a database of all images with camera poses and for each query image they try to find the most similar images [9, 10, 17, 26, 59, 74] and use them to estimate the pose of the query. IR methods can also be used to select relevant images for accurate pose estimation [4, 26, 56, 74, 75].

Pose regression methods train a convolutional neural network (CNN) to regress the camera pose of an input image. There are two categories: absolute pose regression (APR) methods [5, 8, 14, 28, 30, 33, 41, 60] and relative pose regression (RPR) methods [1, 19, 31, 33, 39]. It was shown [59] that APR is often not (much) more accurate than IR. RPR methods do not train a CNN per scene or a set of scenes, but instead, condition the CNN on a set of context views. While our approach performs relative pose regression, the main focus of our method is on the novel view synthesis. Some pose regression methods use novel view synthesis methods [14, 41, 42, 44], however, they assume there is a method that generates images, whereas our method performs both the novel view synthesis and camera pose regression in a single model. Iterative refinement pose regression methods [57, 70] start with an initial camera pose estimate and refine it by an iterative process, however, our approach generates novel views and the camera pose estimates in a single forward pass.

3 Method

In this work, we tackle the problem of image-based novel view synthesis – given a set of context views, the algorithm has to generate the image it would most likely observe from a query camera pose. We focus on the case where the number of context views is small, and the views sparsely cover the 3D scene. Thus, the algorithm must hallucinate parts of the scene in a manner consistent with the context views. Therefore, it is necessary to learn a prior over a class of scenes (e.g., all indoor environments) and use this prior for novel scenes. Besides rendering novel views, our model can also perform camera pose estimation, i.e., the “inverse” of the view synthesis task: given a set of context views and a query image, the model outputs the camera pose from which the image was taken.

Our framework consists of two components: a codebook model and a transformer model. The codebook is used to map images to a smaller discrete latent space (code space), and back to the image space. In the code space, each image is represented by a sequence of tokens. For the novel view synthesis task, the transformer is given a set of context views in the code space and the query camera pose, and it generates an image in the code space. The codebook then maps the image tokens back to the image space. See Fig. 2 for an overview. For the camera pose estimation task, the transformer is given the set of context views and the query image in the code space, and it generates the camera pose using a regression head attached to the output of the transformer corresponding to the query image tokens.

Having the codebook and the transformer as separate components was inspired by the recent work on image generation [21, 48, 54]. The main motivation is to decrease its sequence size, because the required memory grows quadratically with it. It also allows us to separate image generation and view synthesis, enabling us to train the transformer more efficiently in a simpler space.

Fig. 2.
figure 2

Inference pipeline. The context images \(x_i\) are encoded by the codebook’s encoder \(E_\theta \) to the code representation \(s_i\). We embed all tokens in \(s_i\), and add the transformed camera pose \(c_i\). The transformer generates the image tokens which are decoded by the codebook’s decoder \(D_\theta \)

Codebook model is a VQ-VAE [45, 49], which is a variational autoencoder with a categorical distribution over the latent space. The model consists of two parts: the encoder \(E_\theta \) and decoder \(D_\theta \). The encoder first reduces the dimension of the input image from \(128 \times 128\) pixels to \(8 \times 8\) tokens by several strided convolution layers. The convolutional part is followed by a quantization layer, which maps the resulting feature map to a discrete space. The quantization layer stores \(n_{lat}\) embedding vectors of the same dimension as the feature vectors returned by the convolutional part of the encoder. It encodes each point of the feature map by returning the index of the closest embedding vector. The output of the encoder at position (ij) for image x is:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _k \, \Vert (f^{(enc)}_\theta (x))_{i, j} - W^{(emb)}_k \Vert _2, \end{aligned}$$
(1)

where \(W^{(emb)} \in \mathbb {R}^{n_{lat} \times d_{lat}}\) is the embedding matrix with rows \(W_k^{(emb)}\) of length \(d_{lat}\) and \(f_\theta ^{(enc)}\) is the convolutional part of the encoder. The decoder then performs an inverse operation by first encoding the indices back to the embedding vectors by using \(W^{(emb)}\) followed by several convolutional layers combined with upscaling to increase the spatial dimension back to the original image size.

Since the operation in Eq. (1) is not differentiable, we approximate the gradient with a straight-through estimator [3] and copy the gradients from the decoder input to the encoder output. The final loss for the codebook is a weighted sum of three parts: the pixel-wise mean absolute error (MAE) between the input image and the reconstructed image, the perceptual loss between the input and reconstructed image [21], and the commitment loss [45, 49] \(\mathcal {L}_c\), which encourages the output of the encoder to stay close to the chosen embedding vector to prevent it from fluctuating too frequently from one vector to another:

$$\begin{aligned} \mathcal {L}_c = \min _{k} \,|| f_{\theta }^{(enc)}(x)_{i, j} - \text {sg}(W_k^{(emb)})||_2^2 , \end{aligned}$$
(2)

where \(\text {sg}\) is the stop-gradient operation [45]. We use the exponential moving average updates for the codebook [45]. See [45, 49] for more details on the codebook, and the supp. mat. for the architecture details.

Transformer. We first describe the case of image generation and extend the approach to camera pose estimation later. We optimize the transformer for multiple context sizes and multiple query views in the batch at the same time. This has two benefits: it will allow the trained model to handle different context sizes, and the model will fully utilize the training batch (multiple images will be targets in the loss function). Each training batch consists of a set of n views. Let \((x_i)_{i=1}^{n}\) be the sequence of images under a random ordering and \((c_i)_{i=1}^{n}\) be the sequence of the associated camera poses. Let us also define the sequence of images transformed by the encoder \(E_\theta \) parametrized by \(\theta \) as \(s_i = E_\theta (x_i)\), \(i=1, \ldots , n\). Note that each \(s_i\) is itself a sequence of tokens. With this formulation, we generate the next image in the sequence given all the previous views, effectively optimizing all different context sizes at once. Therefore, we model the probability \(p(s_i|s_{<i}, c_{\le i})\). Note that we do not optimize the first \(n_{\text {min}}\) views (called the pure context), because they usually do not provide enough information for the task.

Fig. 3.
figure 3

Branching attention mechanism: the nodes represent parts of the processed sequence. Starting in any node and tracing the arrows backwards gives the sequence over which the attention is computed, e.g., node \(s_7,\varnothing \) attends to \(s_1, c_1\), \(s_2, c_2\), ..., \(s_7, \varnothing \). nodes in the last transformer block are used in the loss computation (Colour figure online)

In practice, we need to replace the tokens corresponding to each query view with mask tokens to allow the transformer to decode them in a single forward pass. For the image generation task, the tokens of the last image in the sequence are replaced with special mask tokens \(\lambda \), and, for the localization task, the tokens of the last image do not include the camera pose (denoted as \(\varnothing \)). However, if we replaced the tokens in the training batch, the next query image would not be able to perceive the original tokens. Therefore, we have to process both the original and the masked tokens. For the i-th query image, we need the sequence of \(i - 1\) context views ending with masked tokens at the i-th position. We can represent the sequences as a tree (see Fig. 3) where different endings branch off the shared trunk. By following a leaf node back to the root of the tree, we recover the original sequence corresponding to the particular query view.

For localization, we train the model to output the camera pose \(c_i\) given \(s_{\le i}\) and \(c_{<i}\). For image generation, this leads to \(n - n_{\text {min}}\) sequences. We attach a regression head to the hidden representation of all tokens of the last image in the sequence. The query image tokens form the input, and we mask the camera poses by replacing the camera pose representation with a single trainable vector.

Branching Attention. In this section, we introduce the branching attention which computes attention over the tree shown in Fig. 3, and allows us to optimize the transformer model for all context sizes and tasks very efficiently. Note that we have to forward all tree nodes through all layers of the transformer. Therefore, the memory and time complexity is proportional to the number of nodes in the tree and thus to the number of views and tasks.

The input to the branching attention is a sequence of triplets of keys, values, and queries: \(\big ((K^{(i)}, Q^{(i)}, V^{(i)})\big )_{i=0}^{p}\) for \(p = 2\), because we train the model on two tasks. Each element in the sequence corresponds to a single row in Fig. 3 and \(i = 0\) is the middle row. All \(K^{(i)}\), \(Q^{(i)}\), \(V^{(i)}\) have the size \((n k^2) \times d_m\) where \(d_m\) is the dimensionality of the model and k is the size of the image in the latent space. The output of the branching attention is a sequence \(\big (R^{(i)}\big )_{i=0}^{p}\). The case of \(R^{(0)}\) is handled differently because it corresponds to the trunk shared for all tasks and context sizes. Let us define a lower triangular matrix \(M \in \mathbb {R}^{n \times n}\), where \(m_{i,j} = 1\) if \(i \le j\). We compute the causal block attention as:

$$\begin{aligned} R^{(0)} = (\text {softmax}(Q^{(0)} (K^{(0)})^T) \odot M \otimes \textbf{1}^{k^2 \times k^2}) V^{(0)}, \end{aligned}$$
(3)

where \(\otimes \) and \(\odot \) are the Kronecker and element-wise product, respectively, and \(\textbf{1}^{m \times n}\) is a matrix of ones. Equation (3) is similar to normal masked attention [68] with the only difference in the causal mask. In this case, we allow the model to attend to all previous images and all other vectors from the same image. For \(i > 0\) we can compute \(R^{(i)}\) as follows:

$$\begin{aligned} D&= Q^{(i)} (K^{(0)})^T,\end{aligned}$$
(4)
$$\begin{aligned} C&= \begin{bmatrix} Q_{1:k^2}^{(i)} (K_{1:k^2}^{(i)})^T\\ \vdots \\ Q_{(n - 1) \cdot k^2 + 1:n \cdot k^2}^{(i)} (K_{(n - 1) \cdot k^2 + 1:n \cdot k^2}^{(i)})^T \end{bmatrix},\end{aligned}$$
(5)
$$\begin{aligned} S&= \text {softmax}([D, C]) \odot [(M - I) \otimes \textbf{1}^{k^2 \times k^2}), \textbf{1}^{nk^2 \times k^2}],\end{aligned}$$
(6)
$$\begin{aligned} S'&= S_{\cdot , 1:n \cdot k^2}\,, S'' = S_{\cdot , n \cdot k^2 + 1: (n + 1) \cdot k^2},\end{aligned}$$
(7)
$$\begin{aligned} R^{(i)}&= S' V^{(0)} + \begin{bmatrix} S''_{1:k^2} V^{(i)}_{1:k^2}\\ \vdots \\ S''_{n \cdot k^2 + 1: (n + 1) \cdot k^2} V^{(i)}_{n \cdot k^2 + 1:(n+1) \cdot k^2}\\ \end{bmatrix}. \end{aligned}$$
(8)

Matrix D represents the unmasked raw attention scores between i-th queries and keys from all previous images. Matrix C contains the raw pairwise attention scores between i-th queries and i-th keys (the ending of each sequence). Then, the softmax is computed to normalize the attention scores and the causal mask is applied to the result, yielding the attention matrix S, and the respective values are weighted by the computed scores. In particular, the scores contained in the last \(k^2\) columns of the attention matrix are redistributed back to the associated i-th values. The result \(R^{(0)}\) corresponds to the nodes in the middle row in Fig. 3, whereas \(R^{(i)}, i > 0\) are the other nodes.

Transformer Input and Training. To build the input for the transformer, we first embed all image tokens into trainable vector embeddings of length \(d_m\). Before passing camera poses to the network, we express all camera poses relative to the first context camera pose in the sequence. We represent camera poses by concatenating the 3D position with the normalized orientation quaternion (a unit quaternion with a positive real part). Finally, we transform the camera poses with a trainable feed-forward neural network in order to increase the dimension to the same size as image token embeddings \(d_m\) in order to be able to sum them.

Similarly to [47], we also add the positional embeddings by summing the input sequence with a sequence of trainable vectors. However, our positional embeddings are shared for all images in the sequence, i.e., the i-th token of every image will share the same positional embedding.

The output of the last transformer block is passed to an affine layer followed by a softmax layer, and it is trained using the cross-entropy loss to recover the last \(k^2\) tokens (\(s_{j,1}, \ldots , s_{j,k^2}\)). For the localization task, the output is passed through a two-layer feed-forward neural network, and it is trained using the mean square error to match the ground-truth camera pose of the last \(k^2\) tokens. Note that we compute the losses over position and orientation separately and add them together without weighing.Footnote 1 Since we attach the pose prediction head to the hidden representation of all tokens of the query image, we obtain multiple pose estimates. During inference, we simply average them.

4 Experiments

To answer the question of whether explicit 3D reasoning is really needed for novel view synthesis, we designed a series of experiments evaluating the proposed approach. First, we evaluate the codebook, whose performance is the upper bound on what we can achieve with the full pipeline. We next compare our method to GQN-based methods [14, 20, 66] that also do not use continuous volumetric scene representations. We continue by evaluating our approach on other synthetic data. Then, we compare our approach to state-of-the-art NeRF-based approaches on a real-world dataset. Finally, we show our model’s localization performance.

We evaluate our approach on both real and synthetic datasets: a) Shepard-Metzler-7-Parts (SM7) [20, 61] is a synthetic dataset, where objects composed of 7 cubes of different colours are rotated in space. b) ShapeNet [13] is a synthetic dataset of simple objects. We use \(128 \times 128\) pixel images rendered by [64] containing two categories: cars and chairs. c) InteriorNet [32] is a collection of interior environments designed by 1,100 professional designers. We used the publicly available part of the dataset (20k scenes with 20 images each). While the dataset is synthetic, the renderings are similar to real-world environments. The first 600 environments serve as our test set. d) Common Objects in 3D (CO3D) [51] is a real-world dataset containing 1.5 million images showing almost 19k objects from 51 MS-COCO [35] categories (e.g., apple, donut, vase, etc.). The capture of the dataset was crowd-sourced. e) 7-Scenes [23] is a real-world dataset depicting 7 indoor scenes as captured by a Kinect RGB-D camera. The dataset consists of 44 sequences of 500–1,000 frames each and it is a standard benchmark for visual localization [1, 8, 30, 31, 39].

Fig. 4.
figure 4

Codebook evaluation on multiple datasets comparing the ground truth (GT) with the reconstructed image. For the 7-Scenes dataset, we compare the model finetuned and not-finetuned on the 7-Scenes dataset

Fig. 5.
figure 5

Results on the SM7 dataset. We compare against GQN [20] and STR-GQN [15]

Codebook Evaluation. First, we evaluate the quality of our codebooks by measuring the quality of the images generated by the encoder-decoder architecture without the transformer. We trained codebooks of size 1,024 using the same hyperparameters for all experiments using an architecture very similar to [21]. The training took roughly 480 GPU-hours. A detailed description of the model and the hyperparameters is given in supp. mat. as well as in the published code.

Examples of reconstructed images are shown in Fig. 4. As can be seen, although losing some details and image sharpness, the codebooks can recover the overall shape well. The results show that using the codebook leads to good results, even though we use only \(8\times 8\) codes to represent an image. In some images, there are noticeable artifacts. In our analysis, we pinpointed the perceptual loss to be the cause, but removing the perceptual loss led to more blurry images. Further analysis of the codebooks is included in the supp. mat.

Full Method Evaluation. The transformer is trained using only the tokens generated by the codebook. Having verified that our codebooks work as intended, we evaluate our complete approach in the context of image synthesis. The architecture of our transformer model is based on GPT2 [47]. We give more details on the architecture, the motivation, and the hyperparameters in the supp. mat.

The SM7 dataset was used to compare our approach to other methods that only operate in 2D image space [15, 20, 66]. Our method achieved the best mean absolute error (MAE) of 1.61, followed by E-GQN [66] with 2.14, STR-GQN [14] with 3.11 and the original GQN [20] method with MAE 3.13. The results were averaged over 1,000 scenes (context size was 3) and computed on images with size \(64 \times 64\) pixels. A qualitative comparison is shown in Fig. 5.

Fig. 6.
figure 6

Evaluation of our method on the InteriorNet dataset with the context size 19

We use the InteriorNet dataset because of its large size and realistic appearance. The models pre-trained on it are also used in other experiments. Since each scene provides 20 images, we use 19 context views. Figure 6 shows images generated by the model trained for both the localization and novel view synthesis tasks.

ShapeNet Evaluation. We used the InteriorNet pre-trained model and we fine-tuned it on the ShapeNet dataset. We trained a single model for both categories (cars and chairs) using 3 context views. The training details and additional results are given in supp. mat. We show the qualitative comparison with PixelNeRF [72] in Fig. 7. PixelNeRF trained a different model for each category.

The results show that our method achieves good visual quality overall, especially on the cars dataset. However, the geometry is slightly distorted on the chairs. Compared to PixelNeRF, it prefers to hallucinate a part of the scene instead of rendering a blurry image. This can cause some neighboring views to have a different colour or shape in places where the scene is less covered by context views. However, this problem can be reduced by simply adding the previously generated view to the set of context views. See the video in the supp. mat.

Fig. 7.
figure 7

ShapeNet qualitative comparison with PixelNeRF [72] using 2 context views

Table 1. Novel view synthesis results on the CO3D dataset [51] on all categories and 10 categories from [51]. We compare ViewFormer with and without localization (‘no-loc’) trained on all categories (‘@ all cat.’) and 10 selected categories (‘@ 10 cat.’). We show the PSNR and LPIPS for seen and unseen scenes (‘train’ and ‘test’) and test PSNR with varying context size. The best value is bold; the second is underlined
Fig. 8.
figure 8

Evaluation of our method on the CO3D dataset [51] with the context size 9

Common Objects in 3D. In order to show that we can transfer a model pre-trained on synthetic data to real-world scenes, we evaluate our method on the CO3D dataset [51]. We compare our approach with NeRF-based methods using the results reported in [51]. Unfortunately, we tried to train the PixelNeRF [72] on the CO3D dataset, but were not able to obtain good results. Therefore we omit it from the comparison. While the baselines are trained separately per category, we train two transformer models: one on the 10 categories used for evaluation in [51] and one for all dataset categories. We fine-tune the model trained on the InteriorNet dataset. The context size is 9. Additional details and hyperparameters are given in supp. mat.

The testing set of each category in the CO3D dataset is split into two subsets: ‘train’ and ‘test’ containing unseen images of objects seen and unseen during training respectively. We use the evaluation procedure provided by Reizenstein et al. [51]. It evaluates the model on 1,000 sequences from each category with context sizes 1, 3, 5, 7, 9. The PSNR) and the LPIPS distance [73] are reported. Note that the PSNR is calculated only on foreground pixels. For more details on the evaluation procedure and the details of compared methods, please see [51].

Table 1 shows results of the evaluation on all CO3D categories and on the 10 categories used for evaluation in [51]. Our method is competitive even though it does not explicitly reason in 3D as other baselines, does not utilize object masks, and even though we trained a single model for all categories while other baselines are trained per category. Note that on the whole dataset, the top-performing method, NerFormer [51], was trained for about 8400 GPU-hours while training our codebook took 480 GPU-hours, training the transformer on InteriorNet took 280 GPU-hours, and fine-tuning the transformer took 90 GPU-hours, giving a total of 850 GPU-hours. Also, note that rendering a single view takes 178 s for the NerFormer and only 93 ms for our approach.

The results show that our model has a large capacity (it is able to learn all categories while the baselines are only trained on a single category), and it benefits from more training data as can be seen when comparing models trained on 10 and all categories. We also observe that models achieve a higher performance on 10 categories than on all categories, suggesting that the categories selected by the authors of the dataset are easier to learn or of higher quality. All our models outperform all baselines in terms of LPIPS, which indicates that the images can look more realistic while possibly not matching the real images very precisely.

Figure 1 and 8 show qualitative results. Our method is able to generalize well to unseen object instances, although it tends to lose some details. To answer the original question if explicit 3D reasoning is needed for novel view synthesis, based on our results, we claim that even without explicit 3D reasoning, we can achieve similar results, especially when the data are noisy, e.g. a real-world dataset.

Evaluating Localization Accuracy on 7-Scenes. We compare the localization part of our approach to methods from the literature on the 7-Scenes dataset [23]. Due to space constraints, here we only summarize the results of the comparisons. Detailed results can be found in the supp. mat.

Our approach performs similar to existing APR and RPR techniques that also use only a single forward pass in a network [1, 8, 30, 60], but worse than iterative approaches such as [19] or methods that use more densely spaced synthetic views as additional input [41]. Note that these approaches that do not use 3D scene geometry are less accurate than state-of-the-art methods based on 2D-3D correspondences [7, 56, 58]. Overall, the results show that our approach achieves a similar level of pose accuracy as comparable methods. Furthermore, our approach is able to perform both localization and novel view synthesis in a simple forward pass, while other methods can only be used for localization.

5 Conclusions and Future Work

This paper presents a two-stage approach to novel view synthesis from a few sparsely distributed context images. We train our model on classes of similar 3D scenes to be able to generalize to a novel scene with only a handful of images as opposed to NeRF and similar methods that are trained per scene. The model consists of a VQ-VAE codebook [45] and a transformer model. To efficiently train the transformer, we propose a novel branching attention module. Our approach, ViewFormer, can render a view from a previously unseen scene in 93 ms without any explicit 3D reasoning and we train a single model to render multiple categories of objects, whereas NeRF-based approaches train per-category models [51]. We show that our method is competitive with SoTA NeRF-based approaches especially on real-world data, even without any explicit 3D reasoning. This is an intriguing result because it implies that either current NeRF-based methods are not utilizing the 3D priors effectively or that a 2D-only model is able to learn it on its own without explicit 3D modeling. The experiments also show that ViewFormer outperforms other 2D-only multi-view methods.

One limitation of our approach is the large amount of data needed, which we tackle through pre-training on a large synthetic dataset. Also, we need to fine-tune both the codebook and the transformer to achieve high-quality results on new datasets, which could be resolved by utilizing a larger codebook trained on more data. Using more tokens to represent images should increase the rendering quality and pose accuracy. We also want to experiment with a simpler architecture with no codebook and larger scenes, possibly of outdoor environments.