1 Introduction

3D reconstruction generates a 3D shape of an object from single or more 2D images, and it plays an important role in various applications, including medical image processing [1], virtual reality [2], CAD [3], human detection [4], etc. 3D reconstruction has been tackled using conventional computer vision algorithms [5, 6].

But these traditional algorithms require prior knowledge assumptions and sophisticated hardware and are thus not practical in many scenarios. Recently, deep learning has shown powerful advantages in computer vision. So many researchers were prompted to learning-based methods for 3D reconstruction. For multi-view 3D reconstruction, some works pay more attention to matching view features extracted from different views of an object [7,8,9]. Compared with multi-view reconstruction, single-view 3D reconstruction is more difficult and with some troubles, such as self-occlusion and the absence of sufficient object information from different angles. Therefore, it is necessary to propose a single-view 3D reconstruction algorithm with higher reconstruction accuracy.

Generally, the reconstructed 3D shape of an object can be represented as volume [7,8,9,10,11,12,13,14], mesh [15], or point clouds [16,17,18], etc. For single-view voxel-based reconstruction, there are several learning-based networks that were proposed to settle the task. For example, some approaches firstly extract view feature from single input image and then transform the view feature into 3D representations [7,8,9]. 3D VAE-GAN [10] is an adversarial learning-based network. Then, some methods also utilize transformers to perform end-to-end single-image 3D reconstruction [11, 12]. The above methods do not explicitly consider the spatial semantic information of objects, which leads to the inaccuracy or incompleteness of the reconstructed volume. Considering the reconstruction accuracy, parameters and optimization convergence speed of our model, we choose to improve AttSets [8] and propose a better reconstruction framework.

In this paper, for single-view 3D volume reconstruction, we present a novel framework IV-Net. IV-Net reconstructs a refined volume by fusing features of image and volume recovered from baseline and contains two main modules: a baseline and an IV refiner. For each single-view input, the baseline module is pre-trained to generate a relatively reliable 3D volume, which supplements certain spatial information. Then, based on the pre-trained baseline, an IV refiner further generates a better-reconstructed volume, where the details and shape of an object can be better predicted due to fusing image feature with spatial information.

Our main contributions are as follows:

  1. (1)

    We construct a unified refiner network for single-view 3D reconstruction, namely IV-Net. It shows the advantages of recovering the shape and details of an object and has universal and adaptable application prospect.

  2. (2)

    We present a multi-scale convolutional block to extract multi-scale information for enhancing learning ability of 2D encoder.

  3. (3)

    We construct a residual convolutional neural network as 3D encoder to extract efficiently spatial feature of the recovered volume.

  4. (4)

    Experimental results on both ShapeNet and Pix3D datasets demonstrate that IV-Net improves the reconstruction quality and performs favorably compared with state-of-the-art methods.

This paper consists of five sections. Section 2 introduces the related work. Section 3 discusses our framework and loss functions. The datasets, evaluation metrics and results of comparative experiments are shown in Sect. 4. Section 5 presents the conclusion and the prospect for future research.

2 Related work

Predicting the 3D shape from a 2D image is a challenging and ill-posed problem. Recently, with the availability of large-scale datasets, there are some learning-based networks presented for single-view reconstruction.

With a limited memory budget, OGN [13] utilizes octree to represent high-resolution 3D reconstructed volumes. Matryoshka networks [14] decompose the 3D shape of an object into nested shape layers and are better than octree-based methods. With the success of generative adversarial networks (GANs) [19] and its variations, 3D VAE-GAN [10] generates a volume from a single-view input by using GAN and variational autoencoders (VAEs). Besides, 3D-R2N2 [7], AttSets [8], and Pix2Vox++ [9], all based on encoder–decoder, firstly encode the single-view input images to fixed-size feature vectors and then pass them into 3D decoder to decode 3D representations. In particular, AttSets [8] chooses the encoder–decoder of 3D-R2N2 [7] and SilNet [20] as its two base nets. Moreover, researchers have also used transformers for 3D volume reconstruction [11, 12]. However, those voxel-based works, without considering spatial information of objects, reconstruct comparatively inaccurate volumes in detail and shape. And different from voxel-based representation, for a given single-view image, Pixel2Mesh [15] applies a graph convolutional network to generate a 3D triangular mesh, and PSGN [16] and 3D-LMNet [17] generate point representations.

3 Methodology

The proposed IV-Net focuses on reconstructing a 3D volume of size \(32^{3}\) from a single-view image and contains two-step optimization modules: a baseline and an IV refiner, as illustrated in Fig. 1. Firstly, for each single-view image input, baseline is pre-trained to obtain image feature and a coarse volume, which supplements additional spatial information. Then, IV refiner is further trained to generate a more accuracy volume. Next, we will introduce these two modules in detail, respectively.

Fig. 1
figure 1

Overview of the proposed IV-Net

3.1 Baseline

Our baseline model includes of three components: a 2D encoder, a latent feature processing (LFP) module and a 3D decoder. From each single-view input, the baseline is pre-trained to get image feature and a coarse 3D volume of size \(32^{3}\). Considering the reconstruction accuracy, parameters and optimization convergence speed of our model, we improve the AttSets [8] network as our baseline. Specially, in our paper, its encoder–decoder is based on 3D-R2N2 [7]. The encoder and decoder of 3D-R2N2 are standard residual convolutional neural networks (CNNs), and they can enhance and accelerate the optimization process for very deep networks by adding residual connections between standard convolution layers.

3.1.1 2D Encoder

From each single-view \(127 \times 127 \times 3\) image input, 2D encoder gains a fixed \(1 \times 1024\) image feature vector z, as shown in Fig. 2. Our 2D encoder is based on the 2D encoder of 3D-R2N2 [7], which mainly uses fixed \(3 \times 3\) convolution in convolution layers. And to extract multi-scale image feature, we design multi-scale convolutional (MSC) block to replace some fixed-scale convolution layers, see Fig. 2. The MSC block begins with a MSC layer to extract multi-scale feature maps from input feature and then utilizes a \(1 \times 1\) convolution layer to enhance the relation between multi-scale feature maps in channels, see Fig. 3a. For example, a MSC layer utilizes three different kernels to extract multi-scale information of the input feature and then concatenates the three output feature maps along the channel to get the output feature, see Fig. 3b. In practical application, considering the sizes of input image features, we only use two kernels (i.e., \(3 \times 3\) conv, \(5 \times 5\) conv) in MSC layers. More specifically, those convs in MSC layers are all with same stride \(1 \times 1\), padding of ‘SAME,’ and filter c/n, where c is the number of channels of input feature, and n is the number of types of kernels.

Fig. 2
figure 2

2D encoder with MSC blocks, where fc is a fully connected layer

Fig. 3
figure 3

Multi-scale convolutional (MSC) block extracts the multi-scale information of input feature in 2D encoder. a Architecture of MSC block. In practical application, we only use two kernels (i.e., \(3 \times 3\), \(5 \times 5\)) in each MSC layers. b MSC layer with three convolutional kernels (c = 6). ‘C’ denotes concatenating feature maps along the channel

3.1.2 Latent feature processing (LFP) module

LFP maps the \(1 \times 1024\) feature vector z to a voxelized \(4^{3} \times 128\) latent image feature \({\mathbf{F}}_{{\mathbf{I}}}\). AttSets [8] is a unified framework for single-view and multi-view 3D reconstruction and employs a LFP module to attentively fuse the image features and get a voxelized latent image feature. For a given single-view image, we can simplify the LFP, which only contains a fully connected (fc) layer, a relu activation and a reshape operation.

3.1.3 3D Decoder

3D decoder transforms latent feature \({\mathbf{F}}_{{\mathbf{I}}}\) into a coarse volume of size \(32 \times 32 \times 32\). And its construction is same with the 3D decoder of 3D-R2N2 [7].

3.1.4 IV Refiner

To supplement certain spatial information, based on pre-trained baseline, IV refiner fuses features of image and recovered coarse volume, then to get a more accuracy volume. It consists of two components: a 3D encoder and a 3D decoder. Following the structure of our 2D encoder, we construct 3D residual convolutional architectures for 3D encoder to effectively extract voxel feature.

3.1.5 3D Encoder

Generally, the combinations of features have two main methods: ‘concat’ and ‘+.’ To get voxel feature \({\mathbf{F}}_{{\mathbf{V}}}\) of the coarse volume, based on these two methods, we construct two versions of residual convolutional architectures for 3D encoder: 3D Encoder/A and 3D Encoder/B, as shown in Fig. 4. Every residual convolutional block begins with two banks of \(4 \times 4 \times 4\) convolutional layers with stride \(1 \times 1 \times 1\), where every layer is followed by a leaky relu activation with a leaky rate of 0.2, then adds the residual connection between input feature and the second layer, lastly follows a max pooling layer with kernel size of \(2 \times 2 \times 2\). In 3D Encoder/A, there are three residual blocks, and the numbers of the output channels of convolutional layers in residual blocks are 32, 64 and 128, respectively. In 3D Encoder/B, it begins with four residual blocks and follows a reshape operation, a fc layer. The numbers of the output channels of convolutional layers in residual blocks are 32, 64, 128 and 128, respectively. After being processed by 3D Encoder/A or /B, the 3D volume input is encoded to a \(4^{3} \times 128\) voxel feature \({\mathbf{F}}_{{\mathbf{V}}}\) or \(1 \times 1024\) voxel feature \({\mathbf{F}}_{{\mathbf{V}}}\), respectively.

Fig. 4
figure 4

Network architectures of 3D Encoder/A (top) and 3D Encoder/B (bottom)

3.1.6 3D Decoder

3D decoder of IV refiner is identical to that of baseline. It takes the feature combined by image feature and voxel feature \({\mathbf{F}}_{{\mathbf{V}}}\) as input and then transforms the combined feature to generate a final volume. There are two main combination methods. ‘+’ adds latent image feature \({\mathbf{F}}_{{\mathbf{I}}}\) with \({\mathbf{F}}_{{\mathbf{V}}}\), and ‘concat’ concatenates image feature z with \({\mathbf{F}}_{{\mathbf{V}}}\) along the channel. Note that we adopt 3D encoder/A and /B while applying methods ‘+’ and ‘concat,’ respectively.

3.2 Reconstruction loss

Reconstruction loss is crucial in network training. Suppose that Y denotes the ground truth, y represents the corresponding prediction, \(Y_{i}\) and \(y_{i}\) are the i-th ground truth voxel and predicted voxel, and N denotes the voxel number of the predicted volume. Two reconstruction losses are introduced as following.

3.2.1 Cross-entropy (CE) loss

The standard cross-entropy loss is always used as the loss function of previous works on 3D volume reconstruction [7,8,9, 11]. It is calculated as follows:

$$ l_{{{\text{CE}}}} (Y,y) = \frac{1}{N}\sum\limits_{i = 1}^{N} {[Y_{i} \log (y_{i} ) + (1 - Y_{i} )\log (1 - y_{i} )]} . $$
(1)

3.2.2 Dice loss

The Dice can better optimize IoUs [21] and solve the problem of the highly unbalanced voxel occupancy [12, 22], and it is defined as follows:

$$ l_{{{\text{Dice}}}} (Y,y) = 1 - \frac{{\sum\nolimits_{i = 1}^{N} {Y_{i} y_{i} } }}{{\sum\nolimits_{i = 1}^{N} {(Y_{i} + y_{i} )} }} - \frac{{\sum\nolimits_{i = 1}^{N} {(1 - Y_{i} )(1 - y_{i} )} }}{{\sum\nolimits_{i = 1}^{N} {(2 - Y_{i} - y_{i} )} }}. $$
(2)

Usually, the smaller the value of lose function is, the closer the prediction is to the ground truth. Through comparative experiments, we finally choose Dice loss [22] as our reconstruction loss to better optimize baseline and IV refiner step by step.

4 Experiments

Our IV-Net is trained in Tensorflow 2.0 with an Intel Core i9-10920X CPU @ 3.50 GHz and a GeForce RTX 3060, and we set a batch size of 24 and adopt an Adam optimizer [23]. In this section, we show our experimental evaluations on two public datasets ShapeNet [24] and Pix3D [25]. For training and testing, the output 3D reconstructions are at size \(32^{3}\). For the training dataset, we first train the baseline module for 60 epochs, after freezing the pre-trained baseline, the IV refiner is trained for 40 epochs. In addition, we adopt Intersection over Union (IoU) and F-Score as the similarity evaluation metrics [26].

4.1 Datasets

4.1.1 ShapeNet

As a large 3D object dataset, ShapeNet [24] contains 55 categories and 51,300 3D models. Following [7,8,9, 13,14,15,16,17, 29, 30], as a subset of ShapeNet (i.e., ShapeNet13) is also utilized in our paper, which includes 44K models in the resolution of \(32^{3}\). For ShapeNet13, 24 images of size \(137 \times 137\) for each model were rendered from 24 different viewpoints by 3D-R2N2 [7]. For our baseline module, the input size we need is \(127 \times 127\). Hence, in our experiments, we just resize single-view images from \(137 \times 137\) to \(127 \times 127\).

4.1.2 Pix3D

Different from the synthetic dataset ShapeNet [24], Pix3D [25] aligns 3D models with real-world 2D images, and the largest category of it is the chair category, which consists of 3839 real-world images and corresponding objects. And according to the convention, the Pix3D is just used to evaluate the proposed methods in real-world images [9, 25]. Therefore, we also only test our proposed method on Pix3D-Chairs.

4.2 Evaluation metric

For the proposed networks, IoU is applied as a similarity metric to evaluate their reconstruction quality, and the IoU score is calculated as follows:

$$ {\text{IoU}} = \frac{{\sum\limits_{i = 1}^{N} {I(y_{i} > t)I(Y_{i} > 0)} }}{{\sum\limits_{i = 1}^{N} {I[(I(y_{i} > t) + I(Y_{i} > 0)) > 0]} }}, $$
(3)

where \(I( \cdot )\) is an indicator function which will be 0 or 1 when the requirements are unsatisfied or satisfied, respectively, \(Y_{i}\) and \(y_{i}\) are the i-th ground truth voxel and predicted value, t is the threshold for voxelization which is setted [9] as a fixed value 0.3 in our experiments, N denotes the total voxel number of the predicted volume.

F-Score, as an extra metric, is also used to evaluate reconstruction quality of methods. And the F-Score is defined [27] as follows:

$$ {\text{F-Score}}(d) = \frac{2P(d)R(d)}{{P(d) + R(d)}}, $$
(4)

where d is the distance threshold which is setted [9] as 1%, and P(d) and R(d) are the precision and recall. The precision P(d) and recall R(d) can be computed as follows:

$$ P(d) = \frac{1}{{N_{R} }}\sum\limits_{r \in R} {\left[ {\mathop {\min }\limits_{g \in G} \left\| {g - r} \right\| < d} \right]} , $$
(5)
$$ R(d) = \frac{1}{{N_{G} }}\sum\limits_{g \in G} {\left[ {\mathop {\min }\limits_{r \in R} \left\| {g - r} \right\| < d} \right]} , $$
(6)

where R and G present the predicted and ground truth point clouds, and\(N_{R}\) and \(N_{G}\) denote the total number of points in the R and G. For voxel-based reconstruction methods, we first generate mesh of 3D surface from voxel by applying marching cubes algorithm [28] and then sample 8192 points [9] from mesh to obtain corresponding point cloud.

4.3 Ablation study

In this section, IV-Net is ablated by utilizing simplified LFP, loss functions, MSC block and 3D encoders of IV refiner on the ShapeNet dataset [24]:

  • Set up 1 To validate the rationality of simplifying LFP, we train AttSets [8] with original or simplified LFP with standard CE loss, respectively. Table 1 shows that we can maintain learning effect of LFP while simplifying it. We define AttSets with simplified LFP as AttSets/S.

  • Set up 2 To compare what losses benefit the performance of our method, the AttSets/S is also trained with Dice loss [22]. Table 1 indicates that using Dice loss to replace CE loss [7] causes performance upgradation from 0.642 to 0.655. Therefore, Dice loss is optimal to as our reconstruction loss.

  • Set up 3 One might think that the fixed \(3 \times 3\) convolution limits the ability to extract image features. And we replace some fixed \(3 \times 3\) convolution layers with MSC blocks. We can see that the performance increases from 0.651 to 0.654 after using MSC blocks. Trained with reconstruction loss \(l_{{{\text{Dice}}}}\), AttSets/S with MSC blocks is defined as our baseline.

  • Set up 4 A key issue of IV-Net is how to fuse features of image and recovered volume. We compare two most commonly methods ‘+’ and ‘concat’ in Table 1. And 3D encoder/A and /B are, respectively, adopted while applying methods ‘+’ and ‘concat.’ Table 1 shows the performance of IV refiner using method ‘+’ is better than the ‘concat,’ and adding IV refiner improves the performance from 0.658 to 0.681 or 0.680. Table 2 indicates the method ‘+’ has less parameters than ‘concat.’ Hence, IV refiner utilizes ‘+’ to fuse features and chooses 3D encoder/A as shape encoder.

Table 1 The effect of the Dice loss, MSC block, and IV refiner in our proposed network in terms of IoU and F-Scores
Table 2 Comparisons of parameter size of the two methods ‘concat’ and ‘+’ to fuse image feature and corresponding voxel feature

Moreover, we also give some visual comparisons of AttSets/S with CE loss, baseline and IV-Net (‘+’) in Fig. 5. And Fig. 5 indicates that our baseline reaches better performance than AttSets/S with CE loss and IV-Net outperforms baseline, which validate the effect of Dice loss, MSC block and IV refiner using 3D encoder/A.

Fig. 5
figure 5

Visual comparisons of AttSets/S, baseline and IV-Net

4.4 Evaluation on the ShapeNet dataset

On the synthetic ShapeNet dataset [24], we split ShapeNet into two sets, with 4/5 of it to train and the remaining to test, same as [7, 8]. IV-Net is compared with several state-of-the-art methods, containing 3D-R2N2 [7], OGN [13], Matryoshka [14], AtlasNet [29], Pixel2Mesh [15], OccNet [30], IM-Net [31], AttSets [8], and Pix2Vox++ [9], and the IoU scores and F-Scores of these methods are illustrated in Tables 3 and 4, respectively, where the overall IoU/F-Score are taken as the mean IoU/F-Score across all 13 categories. For the overall IoU and F-Score, we observe that our IV-Net outperforms these methods. Additionally, IV-Net outperforms all other methods in 5 of the 13 categories about IoU and in 4 of 13 about F-Score.

Table 3 IoU results of several reconstruction approaches on ShapeNet13. For each category, the best IoU score is highlighted in bold
Table 4 F-Score results of several reconstruction methods on ShapeNet13. For each category, the best IoU score is highlighted in bold

Meanwhile, in visual effect, IV-Net is compared with two voxel-based approaches AttSets [8] and Pix2Vox++ [9] in Fig. 6, which indicates that IV-Net reconstructs more visually cleaner and accurate volumes in some categories. For instance, IV-Net shows more accurate reconstruction results than AttSets and Pix2Vox++ in the legs of chairs, the tail and wings of airplanes, small details in rifles and lamps, and so on.

Fig. 6
figure 6

Visual examples of single-view 3D reconstruction on ShapeNet13

4.5 Evaluation on the Pix3D dataset

On the real-world Pix3D dataset [25], following [9, 25], we also use Pix3D-Chairs as the testing set, to evaluate methods on real-world images. Considering the complex backgrounds of real-world images, using Render for CNN [32], we need first generate 60 images for each chair of ShapeNet-Chairs by adding random backgrounds [9, 25], sampled from the dataset SUN [33]. And these generated images are used as the training set, i.e., ShapeNet-Chairs-RfS. Our IV-Net is compared with 3D-R2N2 [7], Pix3D [25], and Pix2Vox++ [9]. The IoUs and F-Scores on Pix3D-Chairs are shown in Table 5, and the results indicate that our IV-Net performs better than these methods. Figure 7 gives some visual comparisons on Pix3d-Chairs among our baseline, Pix2vox++ and IV-Net. Through adding additional spatial feature, IV-Net obtains better reconstruction than the baseline and Pix2vox++ on the details of objects, such as legs and handles.

Table 5 IoU and F-Score results of several reconstruction approaches on Pix3D-Chairs
Fig. 7
figure 7

Examples of single-view 3D reconstruction on Pix3D-Chairs

4.6 Computational complexity

For computational complexity of different methods, the parameter size and inference time of IV-Net and some state-of-the-art methods are compared in Table 6. The values of Table 6 are collected from Pix2Vox++ [9], and we follow its scheme to get the values of our method.

Table 6 Parameter size and inference time comparisons among IV-Net and other state-of-the-art methods

5 Conclusions

In this paper, we propose a novel framework for single-view 3D reconstruction, named IV-Net, which has universal and adaptable application prospect. In our proposed method, we design multi-scale convolutional block to enhance the ability of 2D encoder and construct two versions of 3D encoders to extract voxel feature efficiently. By fusing features of image and recovered volume, an IV refiner raises the accuracy of the reconstructed volumes and recovers the detailed structures of 3D shapes. In both quantitative and qualitative evaluations, our network outperforms state-of-the-art methods in 3D reconstruction and has less parameters than mostly methods. However, our proposed method does not obtain the optimal results on some categories of ShapeNet13. In future, we will continue to make our network better.