Keywords

1 Introduction

Recent advances in Minimally Invasive Surgery (MIS) bring many advantages to patients including reduced access trauma, less bleeding and shorter hospitalization. However, they also impose challenges to intra-operative navigation, where acquisition of 3D data in real-time is challenging and in most clinical practices, 2D projections or images from fluoroscopy, cross sectional Magnetic Resonance Imaging (MRI) and ultrasound are used. It is difficult to use these 2D images to resolve complex 3D geometries and therefore there is a pressing need to develop real-time techniques to reconstruct 3D structures from limited or even a single 2D projection or image in real-time intra-operatively.

For example, pre-operative 3D context from MRI or Computed Tomography (CT) was registered to intra-operative 2D ultrasound images with both spatial and temporal alignment, which facilitates intra-operative navigation for cardiac MIS [9]. Pre-operative 3D meshes from CT were adapted to intra-operative 2D X-ray images with as-rigid-as-possible method, which acts as an arterial road map for Endovascular Aneurysm Repair [14]. The 3D shape of a stent-graft at three different status: fully-compressed [17], partially-deployed [16] and fully-deployed [18] was instantiated from single 2D fluoroscopic projection with stent-graft modelling, graft gap interpolation, the Robust Perspective-n-Point method, Graph Convolutional Network (GCN), and mesh manipulation, which improves the navigation in Fenestrated Endovascular Aneurysm Repair. A review of bony structures reconstruction from multi-view X-ray images could be found in [7].

Fig. 1.
figure 1

An illustration of the evolution of 3D shape instantiation from two-stage approach based on KPLSR [20], PointOutNet based on deep learning [19] to the Instantiation-Net that could reconstruct 3D mesh from 2D image in end-to-end fashion.

Recently, a general and registration-free framework for 3D shape instantiation was proposed [20] with three steps: 1) 3D volumes were pre-operatively scanned for the target at different time frames during the deformation cycle. 3D meshes were segmented and expressed into 3D Statistical Shape Models (SSMs). Sparse Principal Component Analysis (SPCA) was used to analyze the 3D SSMs to determine the most informative and important scan plane; 2) 2D images were scanned synchronously at the determined optimal scan plane. 2D contours were segmented and expressed into 2D SSMs. Kernel Partial Least Square Regression (KPLSR) was applied to learn the relationship between the 2D and 3D SSMs; 3) the KPLSR-learned relationship was applied to the intra-operative 2D SSMs to reconstruct the instantaneous 3D SSMs for navigation. Two deficiencies exist: 1) manual segmentation is essential; 2) two hyper-parameters including the Gaussian width and component number require to be carefully and manually adjusted. To avoid these drawbacks, a one-stage and fully automatic Deep Convolutional Neural Network (DCNN) was proposed to reconstruct the 3D point cloud of a target from its single 2D projection with PointOutNet and Chamfer loss [19]. However, 3D mesh with more details of the surface is more helpful and vital than point cloud.

In this paper, we propose an Instantiation-Net to reconstruct the 3D mesh of a target from its single 2D projection. DenseNet-121 is used to extract abundant features from the 2D image input. Graph Convolutional Network (GCN) is used to reconstruct the 3D mesh. Fully Connected (FC) layers are used as the connection. Figure 1 illustrates the framework for 3D shape instantiation is evolving from two-stage KPLSR-based framework [20], the PointOutNet [19], to the Instantiation-Net proposed in this paper. 27 Right Ventricles (RVs), indicating 609 experiments, were used for validation. An average 3D distance error around 2mm was achieved, which is comparable to the performance in [20] but with end-to-end and fully automatic training.

2 Methodology

The input of Instantiation-Net is a single 2D image I with a size of \(192 \times 256\) while the output is a 3D mesh \(\mathscr {F}\) with vertex V and the connectivity A. Three parts including DCNN, FC and GCN consist of the proposed Instantiation-Net.

2.1 DCNN

For an image input \(I_\mathrm{N\times \mathrm H\times \mathrm W \times \mathrm C}\), where \(\mathrm N\) is the batch size and is fixed at 1 in this paper, \(\mathrm H\) is the image height, \(\mathrm W\) is the image width, \(\mathrm C\) is the image channel and is 1 for medical images, multiple convolutional layers, batch normalization layers, average-pooling layers and ReLU layersFootnote 1 consist of the first part of Instantiation-Net - DenseNet-121 [8] for extracting abundant features from the single 2D image input. Detailed layer configurations are shown in Fig. 2.

Fig. 2.
figure 2

Detailed layer configurations of the proposed Instantiation-Net.

2.2 GCN

For a 3D mesh \(\mathscr {F}\) with vertex of \(\mathbf {V}_\mathrm{M \times 3}\) and connectivity of \(\mathbf {A}_\mathrm{M \times M}\), where \(\mathrm M\) is the number of vertex in the mesh, \(\mathbf {A}_\mathrm{M \times M}\) is the adjacency matrix with, \(\mathbf {A}_{ij} = 1\) if the ith and jth vertex are connected by an edge, otherwise \(\mathbf {A}_{ij} = 0\). The non-normalized graph Laplacian matrix is calculated \(\mathbf {L}=\mathbf {D}-\mathbf {A}\), where \(\mathbf {D}_{ii} = \sum _{j=1}^\mathrm{M} \mathbf {A}_{ij}\), \(\mathbf {D}_{ij}=0\), if \(i\ne j\), is the vertex degree matrix. For achieving Fourier transform on the mesh vertex, L is decomposed into Fourier basis as \(\mathbf {L}=\mathbf {U} \varLambda \mathbf {U}^T\), where \(\mathbf {U}\) is the matrix of eigen-vectors and \(\varLambda \) is the matrix of eigen-values. The Fourier transform on the vertex v is then formulated as \(v_w = U^Tv\), while the inverse Fourier transform is formulated as \(v = \mathbf {U}^Tv_w\). The convolution in spatial domain of the vertex v and the kernel s can be inversely transformed from the spectral domain as \(v*s = \mathbf {U}((\mathbf {U}^Tv)\odot (\mathbf {U}^Ts))\), where s is the convolutional filter. However, this computation is very expensive as it involves matrix multiplication. Hence Chebyshev polynomial is used to reformulate the computation with a kernel \(g_{\theta }\):

$$\begin{aligned} g_{\theta }(\mathbf {L})=\sum _{k=0}^{K-1}\theta _{k}T_{k}(\tilde{\mathbf {L}}) \end{aligned}$$
(1)

where \(\tilde{\mathbf {L}} = 2\mathbf {L}/\varLambda _{max}-\mathbf {I}_n\) is the scaled Laplacian, \(\theta \) is the Chebyshev coefficient. \(T_{k}\) is the Chebyshev polynomial and is recursively calculated as  [2]:

$$\begin{aligned} \mathbf {T}_{k}(\tilde{\mathbf {L}}) = 2\tilde{\mathbf {L}}\mathbf {T}_{k-1}(\tilde{\mathbf {L}})-\mathbf {T}_{k-2}(\tilde{\mathbf {L}}) \end{aligned}$$
(2)

where \(\mathbf {T}_0=1\), \(\mathbf {T}_1=\tilde{\mathbf {L}}\). The spectral convolution is then defined as:

$$\begin{aligned} y_j = v*s = \sum _{i=1}^{F_{in}}g_{\theta i,j}(\mathbf {L})v \end{aligned}$$
(3)

where \(F_{in}\) is the feature channel number of the input V, \(j\in (1, F_{out})\), \(F_{out}\) is the feature channel number of the output Y. Each convolutional layer has \(F_{in}\times F_{out} \times K\) trainable parameters.

Except graph convolutional layers, up-sampling layers are also applied to learn the hierarchy mesh structures. First, the mesh \(\mathscr {F}\) is down-sampled or simplified to a simplified mesh with \(\mathrm M//S\) vertices, where \(\mathrm S\) is the stride, and is set as 4 or 3 in this paper. Several mesh simplification algorithms can be used in this stage, such as Quadric Error Metrics [13], and weighted Quadric Error Metrics Simplification (QEMS) [5], The connectivity of the simplified meshes is recorded and used to calculate \(\mathbf {L}\) for the graph convolution at different resolutions. The discarded vertexes during the mesh simplification are projected back to the nearest triangle, with the projected position computed with the barycentric coordinates. More details regarding the down-sampling, up-sampling and graph convolutional layers can be found in [13].

2.3 Instantiation-Net

For the DCNN part, DenseNet-121 from [8] is imported from Keras, with parameters pre-trained on ImageNet [3]. For the FC part, two FC layers with an output feature dimension of 8 are used. For the GCN part, four up-sampling and graph convolutional layers are adopted [13]. Detailed configurations of each layer are shown in Fig. 2. An intuitive illustration of the example Instantiation-Net with compacting multiple layers into blocks is shown in Fig. 3. The input is generated by tiling the 2D MRI image three times along the channel dimension. A 3D mesh can be reconstructed directly from the single 2D image input by the proposed Instantiation-Net in a fully automatic and end-to-end learning fashion.

Fig. 3.
figure 3

An intuitive illustration of the proposed Instantiation-Net with compacting multiple layers into blocks.

2.4 Experimental Setup

The data used are the same as [19, 20]. Following [19, 20], Instantiation-Net was trained patient-specifically with leave-one-frame-out cross-validation: one time frame in the patient was used in the test set, while all other time frames were used as the training set. Stochastic Gradient Descent (SGD) was used as the optimizer with a momentum of 0.9, while each experiment was trained up to 1200 epochs. The initial learning rate was \(5e^{-3}\) and decayed with 0.97 every \(5\times \mathrm M\) iterations, where \(\mathrm M\) is the number of time frame for each patient. The kernel size of GCN was 3. For most experiments, the feature channel and stride size of GCN were 64 and 4 respectively, except that some experiments used 16 and 3 instead. The proposed framework was implemented on Tensorflow and Keras functions. L1 loss was used as the loss function, because L2 loss experienced convergence difficulty in our experiments. The average value of the 3D distance errors of all the vertices is used as the evaluation metric.

Fig. 4.
figure 4

The 3D distance error of each vertex of four randomly-selected meshes. The color bar is in a unit of mm.

3 Results

To prove the stability and robustness of the proposed Instantiation-Net to each vertex inside a mesh and to each time frame inside a patient, the 3D distance error for each vertex of four meshes and 3D distance error for each time frame of 12 patients are shown in Sect. 3.1 and Sect. 3.2 respectively. To validate the performance of the proposed Instantiation-Net, the PLSR-based and KPLSR-based 3D shape instantiation in [20] are adopted as the baseline in Sect. 3.3.

3.1 3D Distance Error for a Mesh

Four reconstructed meshes were selected randomly, showing the 3D distance error of each vertex in colors in Fig. 4. It can be observed that the error is distributed equally on each vertex and does not concentrate or cluster on one specific area. High errors appear at the top of the RV, which is normal, as the vertex number at the RV mesh top is less in the ground-truth than other areas.

3.2 3D Distance Error for a Patient

Figure 5 illustrates the 3D distance errors of each time frame of 12 subjects selected randomly. We can see that, for most time frames, the 3D distance errors are around 2mm. High errors appear at some time frames, i.e. the time frame 1 and 25 of subject 7, the time frame 11 of subject 9, the time frame 9 of subject 5, the time frame 18 and 20 of subject 15, the time frame 13 of subject 26. This phenomenon was also observed in [19, 20], which is the boundary effect. At systole or diastole of the cardiac cycle, the shape of cardiac reaches its smallest or largest size, resulting in extreme cases of 3D mesh compared with other time frames. In the cross validation, if these extreme time frames are not seen in the training data, but are tested, the accuracy of the prediction will be lower.

3.3 Comparison to Other Methods

Figure. 6 shows the comparison of the reconstruction performance among the proposed Instantiation-Net, PLSR- and KPLSR-based 3D shape instantiation methods on the 27 subjects. These were evaluated by the mean of 3D distance errors across all time frames. We can see that the proposed Instantiation-Net out-performs PLSR-based 3D shape instantiation while under-performs KPLSR-based 3D shape instantiation slightly for most patients. The overall mean 3D distance error of the mesh generated by Instantiation-Net, PLSR-based and KPLSR-based 3D shape instantiation are 2.21mm, 2.38mm and 2.01mm respectively. In addition, the performance of the proposed Instantiation-Net is robust across patients, no obvious outliers are observed.

Fig. 5.
figure 5

The mean 3D distance errors of each time frame of 12 randomly selected patients.

Fig. 6.
figure 6

The mean 3D distance errors of the mesh of 27 subjects generated by the proposed Instantiation-Net, PLSR- and KPLSR-based 3D shape instantiation.

All experiments were performed with a CPU of Intel Xeon® E5-1650 v4 and a GPU of Nvidia Titan Xp. The GPU memory consuming was around 11G which is larger than the 4G consumed by PointOutNet in [19], while PLSR-based and KPLSR-based method in [20] were trained on a CPU. The training time was around 1 h for one time frame which is longer than the 30mins of the PointOutNet in [19], while PLSR-based and KPLSR-based method in [20] took a few minutes. However, the inference of the end-to-end Instantiation-Net only took 0.5 seconds to generate a 3D mesh automatically, while KPLSR-based 3D shape instantiation needs manual segmentation.

4 Discussion

Due to limited coverage of the MRI images at the atrioventricular ring, less vertexes and sparse connectivity exist at the top of the 3D RV mesh, resulting in a higher error in this area, as shown in the right example in Fig. 4. In practical applications, the training data will cover all time frames pre-operatively, which can eliminate the boundary effect shown in Sect. 3.1.

DCNN has a powerful ability for feature extraction from images while GCN has a powerful ability for mesh deformation with both vertex deformation and connectivity maintenance. This paper integrates these two strong networks to achieve 3D mesh reconstruction from single 2D image, which crosses modalities. Based on the author’s knowledge, this is one of the few pioneering works that achieve direct 3D mesh reconstruction from 2D images with end-to-end training. In medical computer vision, this is the first work that achieves 3D mesh reconstruction from single 2D image in an end-to-end and fully automatic training.

Apart from the baselines in this paper, there are also other works working on similar tasks, i.e. [1, 4, 6, 11, 15, 19]. However, 3D occupancy grid is reconstructed in [1], point cloud is reconstructed in [4, 11, 19] and 3D volume is reconstructed in [6, 15]. 3D occupancy grid, point cloud and 3D volume are different 3D data modalities compared to the 3D mesh reconstructed in this paper, hence it is difficult to conduct a fair comparison with them. In addition, two orthogonal X-rays are needed for a 3D volume reconstruction in [15] which can not work on a single image input in this paper.

One potential drawback of the proposed Instantiation-Net is that it requires both the larger consumption in GPU memory and the longer training time than that of the PointOutNet in [20] and the PLSR-based and KPLSR-based 3D shape instantiation in [19], but, the inference is quick and fully automatic.

5 Conclusion

In this paper, an end-to-end framework, called Instantiation-Net, was proposed to instantiate the 3D mesh of RV from its single 2D MRI image. DCNN is used to extract the feature map from 2D image, which is connected with 3D mesh reconstruction part based on GCN via FC layers. The results on 609 experiments showed that the proposed network could achieve higher/slightly lower accuracy in 3D mesh than PLSR-based/KPLSR-based 3D shape instantiation in [20]. According to the result, one-stage shape instantiation directly from 2D image to 3D mesh can be achieved by the proposed Instantiation-Net, obtaining comparable performance with the two baseline methods.

We believe that the combination of DCNN and GCN will be very useful in the medical area, as it bridges the gap between the image and mesh modality. In the future, we will work on extending the proposed Instantiation-Net to broader applications, i.e., reconstructing 3D meshes directly from 3D volumes.