1 Introduction

Novel View Synthesis (NVS) is the problem of synthesising unseen camera views from a set of known viewsFootnote 1 [8, 29]. NVS is a key technology that can enable compelling augmented or virtual reality experiences [10], new entertainment technology [6], and robotics applications [11]. NVS has undergone a significant improvement after the introduction of Neural Radiance Fields (NeRF) [2, 17] – a trainable implicit neural representation of a 3D scene that can photorealistically render unseen (novel) views. NeRF is a data-driven model that can synthesise high-quality novel views but in general requiring several multi-view images, e.g. about hundreds of images taken from different and uniformly distributed camera viewpoints around an object of interest [17]. If these viewpoints are few and/or not uniformly distributed, the resulting NeRF model may fail to produce satisfactory novel views [12, 16]. This detrimental effect is a known drawback of NeRF-based approaches and it is due to the likelihood of overfitting on known viewpoints while decreasing generalisation on novel views that are furthest from the given viewpoints, namely the few-shot view synthesis problem [12].

Fig. 1.
figure 1

Given a set of known views (ground truth), View Morphing-NeRF (VM-NeRF) generates image transitions between views (morph) that can be effectively used to train a NeRF model in the case of few-shot view synthesis. Results are of a higher quality when VM-NeRF is used.

In this paper, we propose to tackle the problem of training a NeRF model on scenes captured with a sparse set of viewpoints by using a novel geometry-based strategy based on View Morphing [24] (Fig. 1). This purely geometric method can synthesise or morph a new viewpoint that lies in-between two given camera views while ensuring realistic image transitions. Traditionally, view morphing requires a set of accurate point matches between known image pairs in order to successfully perform the morph. As this matching stage is hard to integrate into a NeRF-based learning pipeline, our intuition is to leverage the per-image depth information implicitly estimated by NeRF to obtain dense coordinate matches among views after an image rectification stage (Fig. 2). To this end, we have to relax and modify several steps of the view morphing strategy to be duly integrated in the NeRF learning paradigm. This technique does not require any prior knowledge about the captured 3D scene, and it can synthesise 3D projective transformations (e.g. 3D rotations, translations, shears) of objects by operating entirely on the input images. We evaluate our approach by using the dataset of the original NeRF’s paper [17] and we show that PSNR improves up to 1.8dB and 1.0dB when eight and four views are used for training, respectively. We compare our approach with DietNeRF [12], AugNeRF [5] and RegNeRF [19], and show that our approach can produce higher-quality renderings.

To summarise, our contributions are:

  • We present a novel and effective method for NeRF to address the problem of few-shot view synthesis;

  • We introduce a new view morphing technique based on the NeRF depth output, named VM-NeRF;

  • VM-NeRF can achieve higher-quality rendered images than alternative methods in the literature.

2 Related Work

NVS scene synthesis can be solved either by using traditional 3D reconstruction techniques [23] or by adopting methods based on neural rendering [26]. Neural Radiance Fields (NeRF) is a recent neural rendering method that can learn a volumetric representation of an unknown 3D scene approximating its radiance and density fields from a set of known (ground truth) views by using a multilayer perceptron (MLP) [17]. NeRF optimises its parameters on one scene based on a set of known views, thus overfitting can occur when these views are few.

Current approaches addressing few-shot novel view synthesis can be divided into two groups. The first group uses the same trained network to generate novel views of different scenes. This category of methods trains on datasets characterised by similar scenes, such as DTU [1]. Multiple-scene training can introduce datasets biases and may produce low-quality results in contexts outside the training domain [18, 27]. SparseNeuS [14] and ShaRF [22] train NVS on multiple scenes by conditioning the MLP with features that encode appearance and geometry of the surface at a 3D location. This can be achieved by using an auxiliary deep network jointly trained with NeRF. The second group uses the original per-scene optimisation procedure of NeRF, so a single network trains and tests only on one scene leading to methods without dataset bias. These methods are more likely to encounter overfit problems on the known views, however they reduce this likelihood by adding either semantic or geometric constraints during training. DietNeRF belongs to this category and exploits the feature representations of known images computed with a CLIP pre-trained image encoder, renders random poses, and processes them by imposing semantic consistency through CLIP features [12]. RegNeRF [19] renders random viewpoints around the known ones, and introduces regularisation constraints between known viewpoints and randomly sampled ones.

Single-scene methods working with few viewpoints may overfit on the known images, producing artefacts when novel views are rendered. In general, we can mitigate overfitting via data augmentation [25], and to the best of our knowledge, the only methods that address data augmentation for NeRF are AugNeRF [5] and GeoAug [4]. AugNeRF aims to improve NeRF generalisation by using adversarial data augmentation to enforce each ray and its augmented version to produce the same result. GeoAug [4] perturbs translation and rotation of the known viewpoints during training. Our proposed approach does not perturb the known input views and rays, instead we create new views (novel 3D projective transformations) using pairs of known views. This allows us to enforce coherence of newly rendered viewpoints between distant viewpoint pairs. At the moment of the acceptance of this paper, we could not replicate the results of GeoAug because the authors have not released their source code.

3 Preliminaries

3.1 NeRF Overview

NeRF’s objective is to synthesise novel views of a scene by optimising a volumetric function given a finite set of input views [17]. Let \(f_{\boldsymbol{\theta }}\) be the underlying function we aim to optimise. The input to \(f_{\boldsymbol{\theta }}\) is a 5D datum that encodes a point on a camera ray, i.e. a 3D spatial location (xyz) and a 2D viewing direction \((\theta , \phi )\). Let \(\boldsymbol{c} \in \mathbb {R}^3\) be the view-dependent emitted radiance (colour) and \(\sigma \) be the volume density that \(f_{\boldsymbol{\theta }}\) predicts at (xyz). Novel views are synthesised by querying 5D data along the camera rays. Traditional volume rendering techniques can be used to transform \(\boldsymbol{c}\) and \(\sigma \) into an image [13, 15]. Because volume rendering is differentiable, \(f_{\boldsymbol{\theta }}\) can be implemented as a fully-connected deep network and learned.

Rendering a view from a novel viewpoint consists of estimating the integrals of all 3D rays that originate from the camera optic centre and that pass through each pixel of the camera image plane. Let \(\boldsymbol{r}\) be a 3D ray. To make rendering computationally tractable, each ray is represented as a finite set of 3D spatial locations, indexed with i, which are defined between two clipping distances: a near one (\(t_n\)) and a far one (\(t_f\)). Let \(\varGamma \) be the number of 3D spatial locations sampled between \(t_n\) and \(t_f\). Rendering the colour of a pixel is given by

$$\begin{aligned} \hat{\boldsymbol{c}}(\boldsymbol{r}) = \sum _{i=1}^{\varGamma } s(i) \left( 1 - e^{-\hat{\boldsymbol{\sigma }}(\boldsymbol{r})_i \boldsymbol{\delta }_i} \right) \hat{\boldsymbol{c}}(\boldsymbol{r})_i, \end{aligned}$$
(1)

where \(\hat{\boldsymbol{c}}(\boldsymbol{r})_i\) is the colour and \(\hat{\boldsymbol{\sigma }}(\boldsymbol{r})_i\) is the density predicted by the network at i. \(\boldsymbol{\delta }_i = t_{i+1} - t_i\) is the distance between adjacent sampled 3D spatial locations, and s(i) is the inverse of the volume density that is accumulated up to the \(i^{th}\) spatial location, which is in turn computed as

$$\begin{aligned} s(i) = e^{ - \sum ^{i-1}_{j=1} \hat{\boldsymbol{\sigma }}(\boldsymbol{r})_j \boldsymbol{\delta }_j}, \end{aligned}$$
(2)

where \((1 - e^{- \hat{\boldsymbol{\sigma }}(\boldsymbol{r})_i \boldsymbol{\delta }_i})\) is a density-based weight component: the higher the density value \(\sigma \) of a point, the larger the contribution on the final rendered colour.

Similarly to Eq. 1, we can render the pixel depth as

$$\begin{aligned} \boldsymbol{d}(\boldsymbol{r}) = \sum _{i=1}^{\varGamma } s(i) \left( 1 - e^{- \hat{\boldsymbol{\sigma }}(\boldsymbol{r})_i \boldsymbol{\delta }_i} \right) \boldsymbol{z}_i, \end{aligned}$$
(3)

where \(\boldsymbol{z}_i\) is the distance of the \(i^{th}\) spatial location with respect to the camera optic centre.

The input required to learn the NeRF parameters is a set of N images and their corresponding camera information. Let \(\mathcal {I} = \{\boldsymbol{I}_k\}_{k=1}^N\) be the training images, and \(\mathcal {P} = \{\boldsymbol{P}_k\}_{k=1}^N\) and \(\mathcal {K} = \{\boldsymbol{K}_k\}_{k=1}^N\) be their corresponding camera poses and intrinsic parameters, respectively. A pose \(\boldsymbol{P} = [ \boldsymbol{R}, \boldsymbol{t} ]\) is composed of rotation \(\boldsymbol{R}\) and translation \(\boldsymbol{t}\). We can estimate the depth map of a given view k by rendering the depth of all its pixels, therefore we can define the estimated depth maps as \(\mathcal {D} = \{\boldsymbol{D}_k\}_{k=1}^N\).

Learning \(f_{\boldsymbol{\theta }}\) is achieved by comparing each ground-truth pixel \(\boldsymbol{c}(\boldsymbol{r})\) with its predicted counterpart \(\hat{\boldsymbol{c}}(\boldsymbol{r})\). The goal is to minimise the following L2-norm objective function

$$\begin{aligned} \mathcal {L} = \frac{1}{|\mathcal {R} |}\sum _{\boldsymbol{r} \in \mathcal {R}} \left( \Vert \boldsymbol{c}(\boldsymbol{r}) - \hat{\boldsymbol{c}}_c(\boldsymbol{r}) \Vert _2^2 + \Vert \boldsymbol{c}(\boldsymbol{r}) - \hat{\boldsymbol{c}}_f(\boldsymbol{r}) \Vert _2^2 \right) , \end{aligned}$$
(4)

where \(\hat{\boldsymbol{c}}_c(\boldsymbol{r})\) and \(\hat{\boldsymbol{c}}_f(\boldsymbol{r})\) are the coarse and fine predicted volume colours for ray \(\boldsymbol{r}\), respectively. Please refer to [17] for more details.

3.2 View Morphing Overview

View morphing objective is to synthesise natural 2D transitions between an image pair \(\{\boldsymbol{I}_k, \boldsymbol{I}_{k'}\}\) and the approach can be summarised in three steps: i) the two images are prewarped through rectification, i.e. their image planes are aligned without changing their cameras’ optic centres; ii) the morph is computed between these prewarped images to generate a morphed image whose viewpoint lies on the line connecting the optic centres; iii) the image plane of the morphed image is transformed to a desired viewpoint through postwarping.

In practice, assuming the two views are prewarped, the morph uses the knowledge of their camera poses \(\boldsymbol{P}_k, \boldsymbol{P}_{k'}\), and the pixel correspondences between the images, i.e. \(q_k: \boldsymbol{I}_k \Rightarrow \boldsymbol{I}_{k'}, q_{k'}: \boldsymbol{I}_{k'} \Rightarrow \boldsymbol{I}_k\) where \(q_k\) is a function that maps a pixel of \(\boldsymbol{I}_k\) to the corresponding pixel in \(\boldsymbol{I}_{k'}\) [24]. Sparse pixel correspondences can be defined by a user or determined by a keypoint detector, they can then be densified via interpolation to create a dense correspondence map. This procedure is not viable as is in a learning-based pipeline, hence we have to define a novel view morphing strategy for a NeRF-based network architecture. A warp function for each image can be computed from the correspondence map through linear interpolation

$$\begin{aligned}&\hat{\boldsymbol{I}}_{k,\alpha } = ( 1-\alpha )\hat{\boldsymbol{I}}_k + \alpha q_k ( \hat{\boldsymbol{I}}_k ) \nonumber \\&\hat{\boldsymbol{I}}_{k',\alpha } = ( 1 - \alpha ) q_{k'} ( \hat{\boldsymbol{I}}_{k'} ) + \alpha \hat{\boldsymbol{I}}_{k'} \nonumber \\&\boldsymbol{P}_\alpha = (1 - \alpha ) \boldsymbol{P}_0 + \alpha \boldsymbol{P}_1, \end{aligned}$$
(5)

where \(\hat{\boldsymbol{I}}_k\) are the coordinates of the image of camera k, \(\hat{\boldsymbol{I}}_{k,\alpha }\) are pixel coordinates of the morphed image, and \(\alpha \in [0,1]\) regulates the position of the morphed view along the line connecting the two views. The morphed image can then be computed by averaging the pixel colours of the warped images. Please refer to [24] for more details.

4 NeRF-Based View Morphing

The goal of NeRF-based View Morphing (VM-NeRF) is to use the geometrical constraints of the morphing technique to synthesise a set of additional training input views \(\mathcal {M} = \{ {\textbf {M}}_{(k,k'),\alpha }\}\), where \({\textbf {M}}_{(k,k'),\alpha }\) is a morphed view generated from the view pair k and \(k'\) with a given value of \(\alpha \). Adapting view morphing in a learning-based pipeline is challenging as we need reliable pixel correspondences (\(q_k\) and \(q_{k'}\)) to synthesise morphed views. Our intuition is that it is possible to compute one-to-one correspondences from the disparity information, a function of the depth as in Eq. 3, which we can render with the very same NeRF model. We can then linearly interpolate the photometric content of the view pair to produce the morphed view.

Based on the description in Sect. 3.2, we integrate in NeRF only the steps of prewarping and morphing. We experimentally found that postwarping does not lead to better results. Sect. 4.1 describes how we perform the initial rectification of the two cameras. Section 4.2 describes how the images morphing is computed. Section 4.3 provides detailed information on our practical approach to training NeRF with View Morphing. Figure 2 shows the block diagram of our approach.

Fig. 2.
figure 2

Block diagram of NeRF-based View Morphing (VM-NeRF). From the left, we (1) predict the depth with NeRF, (2) rectify the input images and predicted depths, and (3) compute the image morphing of a view randomly positioned between the view pair. \(\alpha \) determines the new view position and it is sampled from a Gaussian distribution.

4.1 Rectification

Our first step is rectification, which leads to rotating the known camera poses \(\boldsymbol{P}_k\) and \(\boldsymbol{P}_{k'}\) around their optic centres until their image planes become coplanar. We can then compute the common image plane by using a selection of algorithms such as [7, 9]. We represent this plane as the rotation matrix

$$\begin{aligned} \tilde{\boldsymbol{R}} = [ \boldsymbol{a}_{x}, \boldsymbol{a}_{y}, \boldsymbol{a}_{z} ], \end{aligned}$$
(6)

where \(\boldsymbol{a}_{x}, \boldsymbol{a}_{y}, \boldsymbol{a}_{z}\) are the axis components of the coplanar plane resulting from the rectification. Stereo rectification is applied to the original images \( \left\{ \boldsymbol{I}_k, \boldsymbol{I}_{k'} \right\} \) and depth maps \( \left\{ \boldsymbol{D}_k, \boldsymbol{D}_{k'} \right\} \) predicted in Eq. 3. The new camera pose of view k is equal to \(\tilde{\boldsymbol{P}}_k = [ \tilde{\boldsymbol{R}}, \boldsymbol{t}_k ]\), where \(\boldsymbol{t}_k\) is the translation of the original camera pose \(\boldsymbol{P}_k\) (same applies to view \(k'\)).

Rectification algorithms are typically based on the assumptions that viewpoints are aligned horizontally and that the reference viewpoint is the left-hand side of the camera (from an observer positioned behind the cameras) [7, 9]. This is atypical in NeRF, as viewpoints may have arbitrary camera configurations, leading to errors that should be corrected. We mitigate this problem by comparing \(\boldsymbol{a}_{z}\) with the z component of the original view pose. If this angle is greater than \(45^{\circ }\) with respect to both \(\boldsymbol{P}_k\) and \(\boldsymbol{P}_{k'}\), we rotate the warping matrices and poses by \(90^{\circ }\) or \(180^{\circ }\). The application of this modification to conventional rectification algorithm allows us to correctly generate the following rectified images \(\{ \boldsymbol{\tilde{I}}_k, \boldsymbol{\tilde{I}}_{k'} \}\) and rectified depth maps \(\{ \boldsymbol{\tilde{D}}_k, \boldsymbol{\tilde{D}}_{k'} \}\).

4.2 Image Morphing

The second step is image morphing, i.e. fusing the rectified images to obtain the new morphed image. This procedure is divided in three steps: i) finding the pixel correspondences; ii) computing the position of each pixel on the morphed camera; iii) fusing pixels that fall in the same position. To determine the image correspondences, we initially compute the disparity maps as functions of the rectified estimated depths

$$\begin{aligned} \boldsymbol{E}_k = \frac{f_k}{\boldsymbol{\tilde{D}}_{k}} \left\| \boldsymbol{o}_k - \boldsymbol{o}_{k'} \right\| _{2}, \quad \boldsymbol{E}_{k'} = \frac{f_{k'}}{\boldsymbol{\tilde{D}}_{k'}} \left\| \boldsymbol{o}_k - \boldsymbol{o}_{k'} \right\| _{2}, \end{aligned}$$
(7)

where \(\{ \boldsymbol{o}_k, \boldsymbol{o}_{k'} \}\) are the principal points and \(\{f_k, f_{k'} \}\) are the focal lengths of cameras k and \(k'\).

Then, we determine the correspondences of the pixel positions between images defined in Eq. 5 as

$$\begin{aligned} q_k(\hat{\boldsymbol{I}}_k) = \hat{\boldsymbol{I}}_k + \frac{\boldsymbol{\tilde{b}}_k}{\Vert \boldsymbol{\tilde{b}}_k \Vert _2} \textbf{1}^\top \odot \boldsymbol{E}_k, \end{aligned}$$
(8)

where \(\textbf{1}\) is a vector of ones, \(\odot \) indicates the Hadamard product and \(\hat{\boldsymbol{I}}_k\) is the baseline direction with respect to the common plane defined in Eq. 6 that is computed as

$$\begin{aligned} \boldsymbol{\tilde{b}}_k = \boldsymbol{a}_z \times ( (\boldsymbol{o}_k - \boldsymbol{o}_{k'}) \times \boldsymbol{a}_z ). \end{aligned}$$
(9)

The same operation is computed for \(k'\). Then, we apply the warp functions of Eq. 5 to compute the position of each pixel on the morphed view, thus obtaining \(\hat{\boldsymbol{I}}_{k,\alpha }\) and \(\hat{\boldsymbol{I}}_{k',\alpha }\).

Lastly, a coalescence operation [3] fuses the pixels of the two views k and \(k'\). The coalescence operation concatenates two sets of coordinates and fuse pixels with the same position, preserving only the pixel values of the points that are nearest to the camera. We use \(\{ \tilde{\boldsymbol{D}}_k, \tilde{\boldsymbol{D}}_{k'} \}\) to determine the distance of the points.

4.3 Training with VM-NeRF

VM-NeRF is subject to the same geometric constraints as the original view morphing technique [24]. These constraints impose that singular camera configurations should not exist. These configurations happen whenever the optic centre of a camera is within the field of view of another one [24]. We also discard cameras that are distant from each other more than a threshold \(\gamma \), as the morphed cameras may be on a transition path that crosses regions where the object of interest is not actually visible (so being rather useless for training a NeRF based model).

Because view morphing allows the synthesis of a new view at any point on the line that connects the known camera pair, we randomly sample new views using a Gaussian distribution centred halfway through the camera pair. Specifically, let us consider a normalised distance between the two cameras. The Gaussian distribution is centred at 0.5 and the standard deviation \(\sigma \) is chosen such that \(3\sigma \rightarrow \epsilon \) at the optic centre positions. Therefore, we sample \(\alpha \sim \mathcal {N}(0.5, \sigma )\) with \(0 \le \alpha \le 1\). The depth NeRF can render at the first few iterations is noisy, therefore, we let NeRF warm up on the known views for \(\lambda \) iterations before synthesising and injecting VM-NeRF views in the next training iterations. After the warm-up, for each valid camera pair, we regenerate M new views every \(\eta \) training iterations as the predicted depth improves over time during training.

5 Experiments

5.1 Experimental Setup

We evaluate our method on three training setups using the NeRF realistic synthetic \(360^{\circ }\) dataset [17], which is composed of eight scenes, i.e. Chair, Drums, Ficus, Lego, Materials, Ship, Mic, Hot Dog. First setup: We select \(N=8\) views out of 100 available for each scene using the Farthest Point Sampling (FPS) [20] (the first view is used for FPS initialisation in each scene). Second setup: we use the same \(N=8\) views used in DietNeRF [12]. Third setup: we select \(N=4\) views using the previous FPS approach. We test each trained model on all the test views of NeRF realistic synthetic \(360^{\circ }\). We quantify the rendering results using the peak signal-to-noise ratio (PSNR) score, the structured similarity index measure (SSIM) [28] and the learned perceptual image patch similarity (LPIPS) [30]. We quantitatively compare our approach against DietNeRF [12] and RegNeRF [19] as the most recent methods for few-shot view synthesis. We also compare against AugNeRF [5] because it is the only data augmentation for NeRF, and data augmentation can be a useful strategy to promote generalisation. We choose to use the Chair scene for our ablation study, which consists of testing VM-NeRF on four different, randomly-chosen, configurations of eight views and on the DietNeRF configuration.

Table 1. Results on the NeRF realistic synthetic \(360^\circ \) dataset.
Table 2. Ablation study results. Keys: # views: represent the number of views and the relative subset. avg. dist.: average distance between view pairs.

We implement NeRF and our approach in PyTorch Lightning, and run experiments on a single Nvidia A40 with a batch size of 1024 rays. A single scene can be trained in about two days. We use the original implementations of DietNeRF, AugNeRF and RegNeRF to evaluate the different setups. We set the same training parameters as in [17], and set \(\gamma = 6\), \(\sigma = 0.2\), \(M=1\), \(\eta = 5\), \(\lambda = 500\).

5.2 Analysis of the Results

Quantitative. Table 1 shows the results averaged over the eight scenes. Our NeRF implementation can achieve nearly the same results reported in [17] on the 100-view setup, i.e. PSNR equal to 31.21 (ours) compared to 31.01 [17].

VM-NeRF can outperform all the other methods in the eight-view setting. Interestingly, the original version of NeRF is the one that performs as second best, followed by DietNeRF and RegNeRF. AugNeRF fails to produce satisfactorily results. We can also observe that VM-NeRF achieves slightly better quality than its version with oracle depth maps, i.e. 24.39 vs. 24.22 PSNR. In fact, we observed that VM-NeRF can effectively leverage the depth information that is estimated during training, although it is noisy. We also evaluate VM-NeRF on the same eight views originally tested by DietNeRF [12]. Also here we can achieve higher quality results on average, i.e. 24.14 vs. 23.59. We also improve in the four-view setup where we obtain an improvement of +1.02 PSNR on average. The results also show that the perturbation of the known input views, done by AugNeRF, has adverse effects in all the tested setups.

Fig. 3.
figure 3

Comparisons on test-set views of scenes of NeRF realistic synthetic \(360^{\circ }\). Unlike AugNeRF [5], VM-NeRF is an effective method that can be used for few-shot view synthesis problems. Unlike DietNeRF [12], VM-NeRF enables NeRF to learn scenes with a higher definition. VM-NeRF produce less artefacts than RegNeRF during rendering [19]. We report the PSNR that we measured for each method and for each rendered image. AugNeRF unsuccessfully learns Chair and Lego (white and black outputs).

Qualitative. Figure 3 shows some qualitative results on Chair, Hot Dog and Lego where we can observe that VM-NeRF produce results with better details than DietNeRF. We speculate that this difference with DietNeRF may be due to its CLIP-based approach that is introduced to leverage a semantic consistency loss for regularisation [21]. The CLIP output is a low-dimensional (global) representation vector of the image, which may hinder the learning of high-definition details. Differently, our approach interpolates the original photometric information from two views to produce a new input view, without losing information through the encoding of the low-dimensional representation vector. Figure 3 shows that our approach compared to RegNeRF produces fewer artefacts by correlating the nearby views.

Ablation Study. We assess the stability of VM-NeRF by evaluating the rendering quality when different combinations of views are used to train NeRF. Table 2 shows that the performance is fairly stable throughout different view configurations. We also observed that the algorithm is robust to variations in the distance between view pairs. As long as a view pair is not singular and the distance between cameras is adequate to create acceptable 3D projective transformations of the object, we can successfully synthesise new views with VM-NeRF.

6 Conclusions

We presented a novel method for few-shot view synthesis that blends NeRF and the View Morphing technique [24]. View morphing requires no prior knowledge of the 3D shape and it is based on general principles of projective geometry. We evaluated our approach using the conventional dataset employed by NeRF-based methods, demonstrating that VM-NeRF more effectively learns 3D scenes across various few-shot view synthesis setups. VM-NeRF can interpolate only along the line that connects the optical centres of each camera pair. Therefore, it cannot reconstruct the whole object if only a part of it is viewed during training. Lastly, we designed our approach to be fully differentiable, so an attractive research direction is to integrate our approach into an end-to-end training pipeline.