Keywords

1 Introduction

Polygon meshes have seen great advances in the medical imaging community, propelled by modern graphics processing units (GPUs) that are optimized for mesh rasterization. These advances have facilitated the polygon mesh representation to be easily rendered, and the adoption of powerful modern rendering engines, such as Unity, are enabling efficient visualization. Although previous works such as [3] enable directly rendering the polygonised isosurface of binary volumes, the ray casting technique used is much more computational expensive than mesh rasterization. The polygon mesh is a graph-based representation that consists of vertices and their connecting edges to model the isosurfaces of objects in 3D space. The graph-based data structure enables arbitrary vertex placements in continuous 3D space, therefore the isosurface can be stored in varying levels-of-detail and resulting in a highly compact data structure. Moreover, the shape of the mesh can be easily deformed by displacing the vertices, and the spatial topology is preserved by the edges that connect the vertex pairs, making it ideal for medical simulations such as cardiac cycles [4].

Fig. 1.
figure 1

A close-up visual comparison of the mesh surface extracted using (a) MC, (b) MC with TwoStep Smooth filter from Voreen [1], and (c) Voxel2Mesh [2]. The contour of (c) is slight different from (a) and (b) due to the limited deformation ability.

However, generation of meshes of segmented isosurfaces from medical images is a complex task involving a pipeline that consists of segmentation of the regions of interest (ROIs) and polygon extraction from volumetric segmentation data. Example ROIs include specific anatomical structures such as the boney structures e.g., skull and ribs, and organs such as the liver structure. The conventional approach to generating polygon mesh is by reconstruction from 3D volumetric data acquired by imaging techniques such as computed tomography (CT) and magnetic resonance imaging (MRI). The acquired 3D volumetric data uses a dense discrete voxelized grid of uniform precision to represent the internal spatial properties. It is subsequently segmented to a volumetric mask of the desired ROI, either by scanning each slice of the input volume, such as U-Net [5], or by processing the entire input volume, as in the case of 3D U-Net [6].

Marching Cubes (MC) [7] is the conventional method to generate 3D polygon meshes from volumetric segmentation masks. The quality of the extracted mesh is determined by the resolution of the 3D volumetric data, where the z-axis resolution is often limited by the medical imaging protocols. The meshes extracted by volumes with low z-axis resolution suffer from the staircase artefacts and the visual quality is degraded. Although smoothing filters can be applied to mitigate the staircase artefacts, they often cause volume shrinkage and losing overall shape [8].

To eliminate the staircase artefacts and create smooth meshes while keeping the volume shape, Wickramasinghe et al. proposed Voxel2Mesh [2], a mesh deformation deep neural network for medical ROI representation. The mesh deformation approach is inspired by the Pixel2Mesh [9] and its following Pixel2Mesh++ [10], which utilized graph convolutional network (GCN) to optimize the vertex displacement of an ellipsoid mesh template from a single image. In Voxel2mesh, the authors adapted the Pixel2Mesh to process 3D volumetric data as input and generate mesh representation of the ROIs.

The staircase artefacts often occur when using the MC process to extract meshes from discrete volumetric data of low resolution, as the low level-of-detail limits the volumetric representation to capture the smooth curvatures of the surfaces. On the other hand, the mesh template is a continuous representation of the shape that allows arbitrary level-of-details that are capable of capturing the details of the curvatures. The deformation network bypasses the MC process and performs deformation on the mesh template in continuous space to produce smooth surfaces, and therefore avoid the staircase artefacts, as shown in Fig. 1.

However, a key limitation is that the graph convolution layers cannot alter the connections of the edges, therefore it cannot change the topological structure, e.g., genus value, which describes the number of handles or “holes” in the surface of a 3D object. This limitation hinders the ability of Voxel2Mesh to deform the spherical template mesh to complex anatomical structures with higher genus values, such as the pelvis.

In this study, we evaluate the numerical and visual quality of the medical mesh deformation networks, on three anatomical structures of varying degrees of complexity and topological structures, pelvis from CTPelvic1K [11], liver from CHAOS [12], and kidney from CT-ORG [13]. For evaluation and demonstrate the general challenges and constraints of deformation-based medical mesh representation, in addition to Voxel2Mesh, we also used Pixel2Mesh-3D, a variant of Pixel2Mesh with 3D convolution layers instead of 2D convolution layers. The purpose of these experiments is to investigate the unique characters of deformation networks that bypass the MC process and demonstrate their current limitations in optimizing topological structures. The goal is to provide insights for the medical mesh representation community regarding the importance of such networks and the key challenges to be addressed in future research.

2 Related Work

2.1 Medical Imaging Mesh Generation

The medical isosurface mesh generation consists of the volumetric segmentation task and the mesh extraction process. Convolutional neural networks (CNNs) are widely used for medical imaging segmentation tasks, e.g., fully convolutional networks (FCNs) [14], PSPNet [15], and U-Net [5]. Among which, U-Net and its variants [6, 16,17,18] are the most popular choices for this task. The U-Net utilizes the encoder-decoder architecture with skip connections, which enables direct connections between mirroring layers of the encoder and the decoder. These skip connections help preserve both the coarse information and the fine details in the results. However, both the vanilla U-Net and many of its variants operate on 2D image slices, without utilizing the spatial information of the 3D volume. Both 3D U-Net [6] and V-Net [18] were proposed to directly operate on 3D volume by replacing the 2D operations found in vanilla U-Net to their 3D counterparts. To solve the foreground-background imbalance problem, which becomes exponentially severe in 3D, V-Net also introduced a new loss function based on Dice coefficient. However, due to the hardware limitation and the medical restrictions on radiation dosage, the resolution of the result volume is limited, therefore the extracted mesh would suffer from staircase artefacts. MC is the most prominent method for mesh extraction, it extracts a triangle mesh isosurface from the volumetric data using pre-calculated potential cube configurations to match the ROI’s boundary of the volumetric data. However, the MC process is non-differentiable, therefore the extracted mesh cannot be end-to-end trained for mesh optimization. Liao et al. [19] and Chen et al. [20] used deep learning networks to learn the optimal cube configuration instead of using the pre-calculated configuration during the surface extraction process, therefore making the mesh extraction process differentiable, and the output mesh can be directly optimized using deep learning methods. However, their works are limited by hardware constraints and extracted meshes from smaller volumetric data (up to \({128}^{3}\) volume resolution) [21].

2.2 Mesh Deformation Models

The deformable mesh models was first introduced by Terzopoulos et al. [22, 23], and was quickly adapted for medical image segmentation tasks [24, 25]. By implementing the graph convolutional network (GCN) model [26] that optimize the vertex placement, Wang et al. [9] introduced the first deep learning based mesh deformation model Pixel2Mesh, which generates a 3D mesh from a single image by deforming an ellipsoid mesh template. This model takes a 2D image as input and cannot be used to process the 3D volumetric data. Wickramasinghe et al. then utilized its GCN model and adapted to process 3D volumetric data in Voxel2Mesh [2]. In Voxel2Mesh, an encoder-decoder network with skip connections is used to extract features from the input volume, where these features are then sampled by an adaptive mesh unpooling strategy that maps the spatial features in volume space to the corresponding vertices in mesh space. The sampled features are used to guide the GCN to deform the sphere template to desired shape. The Voxel2Mesh is then extended for various medical mesh generation tasks [4, 27,28,29,30] where sophisticated data-driven templates of the target anatomical structures are used, to minimize the required deformation. The quality of the result mesh is heavily dependent on the initialization of the deformable templates [31], as misaligned spatial features in volume space and mesh space would destabilize the deformation. Kong et al. [4] solved this problem by predicting the displacement of a control point grid to align the features in different space. Although there are works [32,33,34] addressing the topology-dependent problem of using pre-defined templates, the template selection process is non-differentiable and therefore cannot be end-to-end trained, it is hard to address in the deep learning context.

3 Experiments

3.1 Data and Experiment Setup

Three datasets of varying shape complexity are used to evaluate the deformation networks.

Fig. 2.
figure 2

The templates used for deformation networks. The 162-face icosahedron template (a) is used for both the CHAOS and Pelvic1K dataset, and the twin 162-face icosahedrons template (b) is used for CT-ORG dataset.

  1. 1)

    Liver segmentation (simple shape complexity): The CHAOS dataset [12] consists of 20 CTs of human abdomen and their liver segmentation masks.

  2. 2)

    Kidney segmentation (moderate shape complexity): The CT-ORG dataset [13] consists of 40 CTs of the lower human body and their kidney (among other organs) segmentation masks. The two kidneys are selected because of their different topological structures.

  3. 3)

    Pelvis segmentation (high shape complexity): CTPelvic1K dataset [11] consists of 103 CTs and the pelvis segmentation masks. The pelvis is selected for its complex topological structures such as the obturator foramen.

Each dataset is randomly divided into a training set and a testing set with a ratio of 7:3. Each volume is down-sampled using trilinear interpolation to \({256}^{2}\) in slice resolution, with proportional slice counts for training, while the ground truth isosurface mesh models are generated using MC with step size 1 on the full-resolution segmentation labels to minimize the staircase artefacts.

We evaluate the deformation networks visually and numerically with a traditional pipeline that uses the standard U-Net for ROI segmentation and MC for mesh extraction. As shown in Fig. 2, the deformation models use a 162-face icosahedron as the template for CHAOS and Pelvic1K, and a twin 162-face icosahedron as the template for CT-ORG to accommodate the two kidneys. The U-Net uses RMSprop optimizer, BCE with logits loss, and has a batch size of 8. The U-Net is trained on the same datasets used for Voxel2Mesh until convergence. The step size of MC is 1.

We conducted all our experiments, both training and inference, on a workstation with NVIDIA Tesla V100 GPU with Ubuntu 20.04. All the mesh rendering images were captured using MeshLab [35].

3.2 Metrics

The following 3 metrics were used to evaluate our mesh quality quantitatively:

The average symmetric surface distance [36] (ASSD), measures all the average distance of all points from one mesh to the other’s isosurface, and vice versa, hence the name symmetric; the lower the better;

The Hausdorff distance [37] (HD), measures the maximum distance of all the minimum-distance pair of the points between two meshes; the lower the better;

The Chamfer distance [38] (CD), measures the average distance of all the minimum-distance pair of the points between two meshes; the lower the better;

For all point-based metrics (ASSD, HD, CD), we randomly sample 100,000 points on the isosurface for each mesh model.

3.3 Quantitative Result

In Table 1, the U-Net reports best scores of all three metrics in CHAOS dataset. The U-Net also reports best score in ASSD of the Pelvic1K dataset. The Voxel2Mesh reports the best scores of all three metrics in CT-ORG, and best score of HD and CD in Pelvic1K. The Pixel2Mesh-3D reports the worst scores in all experiments.

Table 1. Quantitative comparison of Voxel2Mesh, Pixel2Mesh-3D and 2D U-Net, using three metrics, over three different datasets. The best score of the three is in bold font.

3.4 Qualitative Result

In Fig. 3, the U-Net (a) exhibits high visual similarity with the ground truth mesh (c) in both tasks of liver (first row) and kidney (second row). However, the U-Net fails to reconstruct the lower parts of the pelvis, as indicated by the blue arrow. Moreover, the U-Net also reconstructed parts of the spine and femur that were outside the ROIs (see the red arrow). The visual quality of all three meshes from U-Net, as well as the ground truth images which are extracted using MC, are compromised by the staircase artefacts. The mesh liver generated by both Voxel2Mesh (b), and Pixel2Mesh-3D (c) is unable to preserve the sharp edges of the organ, as indicated by the blue arrows in the first row. We note that in the second row, The lower parts of the kidney pairs are stretched to the opposite kidneys in both deformation networks. In the third row, Both Voxel2Mesh and Pixel2Mesh3D results suffer from the problem of fixed topology and thereby fail to reconstruct the pelvis with the detailed structures such as the obturator foramen, indicated by the red arrows, and resulted in a basin-shaped mesh.

Fig. 3.
figure 3

Visual comparison of the CHAOS (first row), CT-ORG (second row), and Pelvic1K (third row) meshes between (a) U-Net, (b) Voxel2Mesh, (c) Pixel2Mesh-3D,(d) the ground truth, and (e) a coronal view of the CT used as input.

Fig. 4.
figure 4

The visualization of curvature principal directions where red encodes maximal curvature and green encodes minimal curvature. (a–c) are taken from the mesh generated by Voxel2Mesh from different viewpoints, and (d) is taken from mesh extracted by MC for reference of staircase artefacts.

We applied the curvature principal directions from MeshLab to examine the staircase artefacts. We visualize the magnitude of the curvature by encoding the maximum curvature direction with red, the minimum curvature direction with green, and the third principal direction that is perpendicular to both maximum and minimum direction with blue. The intensity of each colour channel indicates the magnitude of each curvature direction, that is, the green regions indicate flat surface, the red and blue regions indicate sharp edges. In Fig. 4, the mesh generated by Pixel2Mesh (a–c) exhibits high strength of curvature in regions corresponding to the overall shape. The mesh extracted from MC (d), however, exhibits distinct steep edges along the shape curves.

4 Discussion

We observed several limitations in the performance of the mesh deformation networks when evaluating the reconstructed meshes of various complexity. Specifically, we found that:

  • The networks are unable to reconstruct the meshes with sharp edges,

  • The networks are unable to deform templates with multiple objects, and

  • The networks are unable to reconstruct organs of different topological structures.

As shown in the first row of Fig. 3, both Voxel2Mesh and Pixel2Mesh-3D can deform their icosahedron templates to simple shapes, such as the liver in CHAOS dataset. However, it fails to preserve the fine details such as the sharp edges indicated by the blue arrows in the first row of Fig. 3.

In the second row of Fig. 3, where the deformation networks deform the template of a icosahedron pair, the networks failed to differentiate the vertices from the two icosahedrons, causing the vertices to be misplaced to match the surface of the opposite kidney.

Furthermore, due to its graph convolution network (GCN) structure, it is unable to deform to complex shapes. The GCN consists of graph convolution layers that modify the value of graph vertices but are constrained by the topology of the graph as it cannot change the connectivity of the mesh edges. As shown in the third row, both networks were able to deform the template to the shape of a basin. However, the details of the pelvis and the foramen are all missing from the mesh. This again is due to the limited capacity of the sphere to deform to a complex basin shape with multiple openings.

In comparison to the mesh models extracted from a U-Net segmented volume using MC, which are affected by the staircase artefacts, the meshes deformed from templates remain unaffected. This is because the sphere template with smooth surfaces is directly deformed into the output shape, without using the voxelized grid which causes the artefacts. In the third row of Fig. 3, the U-Net fails to reconstruct thin structures such as indicated by both red and blue arrows, and erroneously reconstructs parts outside the ROIs. Conversely, these issues are not presented in the deformation networks’ outputs, as the topology structure is constrained by the pre-defined template.

We further verify the absence of staircase artefacts in meshes generated by deformation-based networks by examining the curvature direction distribution in Fig. 4. The staircase artefacts can be visually identified by the contrastive colour ripples that indicate high strength of curvatures and high homogeneity along the z-axis, as shown in (d). However, such ripples are not presented in (a–c), which are generated from a deformation network.

In future work, we will investigate other medical isosurface representations that can minimize the staircase artefacts and can be used for more complex anatomical structures, such as the signed distance functions (SDF) [38] that uses mathematical functions to represent the 0-level distance isosurfaces of the desired shape. As this representation does not require any template as initial input, it is not restricted by the pre-defined shape and can be used to represent any arbitrary shapes. Moreover, the SDF describes the shape in a continuous space, and therefore it can be sampled at arbitrary high resolution when MC is used for mesh extraction; this potentially can minimize the staircase artefacts that are mainly caused by low z-axis resolution.

5 Conclusion

In this study, we highlighted the existing limitations of utilizing a mesh deformation network for representing medical ROIs, which includes the restricted ability to reconstruct shapes with sharp edges, and the inability to change the topological structures of the template. Although the visual quality is inferior compared to the traditional pipelines that utilize the MC for mesh extraction, deformation networks’ benefit to bypass the MC process and avoid the staircase artefacts makes it worthwhile for future research.