Introduction

Facial appearance is a representative image of an individual and can significantly impact self-confidence and social relationships. Patients with jaw deformities suffer from both esthetical impairment and functional abnormality [1]. Orthognathic surgery is a corrective jaw surgery which corrects skeletal deformities by repositioning osteotomized bony segments into desired positions [2, 3]. While surgically “untouched,” the facial soft tissues are passively and “automatically” corrected following the movement of the underlying bony segments [4]. Due to the complex nature of facial anatomy, orthognathic surgery requires an accurate surgical plan. To date, surgeons can accurately plan the movement of bony segments (“bony movement” for short) using computer-aided surgical simulation (CASS) technology [3, 5]. However, because of the nonlinear relationship between bony segments and soft tissues, the prediction of postoperative facial appearance remains a practically challenging task [6].

Various attempts have been made to predict three-dimensional (3D) facial change following orthognathic surgery [4, 7, 8]. Among them, the finite element method (FEM) is reported to be the most accurate and biomechanically relevant method [6, 9]. When using FEM simulation, geometrically accurate patient-specific FE mesh modeling and realistic boundary condition assignment are critical to achieve quantitatively and qualitatively accurate prediction results [4, 6, 9]. However, this level of customization is often difficult to implement using a general FE solver, thus motivating development of a novel incremental simulation method in our previous work [6]. FEM simulation is also a computationally expensive process, and a typical facial change prediction takes about 30 minutes to complete after the bony movement is planned. It is impossible to use FEM simulation for quick surgical planning in clinical settings because surgeons often try multiple procedures or revisions during the planning for each patient in order to achieve the best possible outcome. While various FE acceleration methods such as SOFA [10], NiftySim [11], and proper orthogonal decomposition may accelerate computation, their simulation time rarely approaches that needed for rapid surgical planning, especially for highly detailed meshes. [12]. Therefore, a more efficient approach is needed to improve prediction time while maintaining comparable accuracy to FEM.

Deep learning has been applied to a variety of applications related to orthognathic surgical planning, including classification, segmentation, registration, denoising, and many others [13]. Only recently have deep learning techniques been introduced as a potential alternative to the traditional FEM method to simulate biomechanical problems including tissue deformation [14]. While training a deep neural network is computationally expensive, a fully trained network can decrease simulation time by several orders of magnitude compared to FEM [14]. This decrease in computation time has made deep learning an attractive solution for obtaining simulation results rapidly.

Deep learning networks based on the U-Net architecture have been developed to simulate soft-tissue deformation in various organs [15, 16]. However, such models require input data to be sampled from a regularly spaced grid, which is not suitable for facial change simulations. Accurately capturing details of the face, especially in the clinically critical regions, e.g., the lips, is extremely important for predicting the facial appearance outcome. Therefore, a network which can accept unstructured data with irregularly spaced nodes is needed. Recent works have implemented a PointNet to perform deformation estimation because of its ability to accept data in point cloud format, allowing for unstructured data as input [17, 18]. These networks were able to learn biomechanically relevant tissue deformation while using unstructured data. However, boundary type information was limited to whether or not deformation occurred at a given node. In comparison, boundary types for facial tissue simulations are more complex [4, 6, 9]. For this reason, a way of explicitly supplying boundary type information to a network is needed.

The purpose of this study is to develop a novel biomechanics-informed deep learning method to enable efficient and accurate facial tissue change simulation, addressing the weaknesses of prior deformation prediction networks and FEM simulation. The contribution of this proof-of-concept work are (1) implementation of a deep neural network based on the PointNet++ architecture [19] that accepts data input in point cloud format, and thus is compatible with any geometrically detailed facial mesh, and (2) implementation of explicit patient-specific boundary types as additional input to the network to improve facial change prediction accuracy.

Method

The proposed biomechanics-informed deep learning method is based on PointNet++ [19]. In this method, we assume that the facial tissue mesh has already been generated from computed tomography (CT), and the surgical plan (i.e., exact bony movement) has already been formed. The facial mesh in point cloud format and the explicit patient-specific boundary types are used as inputs to the network for fast prediction of facial soft-tissue deformation following the bony movement. Figure 1 shows an overview of the proposed method.

Fig. 1
figure 1

Overview of the proposed method for soft-tissue deformation prediction

Data representation

Our network learns a nonlinear mapping between an input state, which is represented by the starting mesh, boundary types, and bony surface displacement, and the predicted mesh following bony displacement. The network is designed to accept input data in the same format as in an FEM simulation. Partially inspired by Mendizabal et al. [12] and Saeed et al. [17], let the input vector, \(x_n\), consist of \(N\) feature vectors \(x_n=[c_n, b_n, s_n]\) where \(n = 1,2,...,N\) represents the nodes in an input FEM mesh. The vectors \(c_n\) are the Cartesian coordinates of the input mesh, \(b_n\) are one-hot encoding vectors of the boundary type, and \(s_n\) represent the applied surface displacement using displacement in \(x, y,\) and \(z\) directions. The encoding vector, b, varies depending on the boundary type at each node (Fig. 1). Boundary types are differentiated using a one-hot encoding vector which can be used to distinguish many types of boundary conditions, as opposed to the binary indicator used by Saeed et. al. [17]. Three different boundary types are implemented: fixed, moving and free nodes [6]. For the fixed nodes, \(b = [0, 0, 1]\). For the moving nodes with known displacement \(b = [0, 1, 0]\). For the remaining free nodes, \(b = [1, 0, 0]\). We therefore refer to the inclusion of the encoding vector \(b\) as “explicit” boundary types, whereas exclusion of the vector \(b\) is referred to as “implicit” boundary types. For the moving nodes where a known displacement is applied, \(s = [s_x, s_y, s_z]\) where \(s_x, s_y, s_z\) represent the applied nodal displacement based on corresponding bony movement. For the fixed and free nodes, \(s = [0, 0, 0]\). We use the FEM simulation output as ground truth while training the deep neural network. The deep neural network is trained to predict the final nodal displacement after deformation \(u_n\) where \(u=[u_x, u_y, u_z]\). The ground-truth nodal displacements after deformation, vector \(v_n\) where \(v=[v_x, v_y, v_z]\), are calculated by subtracting the nodal coordinates of the input mesh from the FEM-simulated mesh. Our network is tasked with finding the function which minimizes the expected error between \(u_n\) and \(v_n\). Mean squared error is used as our loss function:

$$\begin{aligned} \min \frac{1}{N}\sum \limits _{n=1}^N \Vert (v_n-u_n)\Vert ^2 \end{aligned}$$
(1)

Network design details

Fig. 2
figure 2

N1 x 128 (Node Features size after the third Feature Decoding module)

PointNet++ [19] is adopted for our task because of its efficiency on point set processing. Its structure is modified by adding the boundary type information and displacement vectors as additional input channels. PointNet++ is a hierarchical feature extraction network consisting of 4 feature encoding modules, 4 feature decoding modules, and a unit PointNet layer, as shown in Fig. 2a. Each feature encoding module has a sampling layer, a grouping layer, and a PointNet layer as shown in Fig. 2b. Each feature decoding module consists of an interpolation layer and a unit PointNet layer as illustrated in Fig. 2c. The PointNet layer has a multilayer perceptron (MLP) and max pooling operator. The unit PointNet layer is similar to one-by-one convolution in convolutional neural networks [19]. Skip connections are used to concatenate the features between feature encoding and decoding modules. The final output of the network is the predicted 3D displacement vector for each of the N input nodes.

Experiments and results

Experiments

We tested the method’s ability to simulate facial mesh change following synthetic bony movement based on real patient data. We then tested the method on an actual patient’s surgical plan to assess the performance of our network in solving a clinical problem.

Data for facial change simulation

A dataset of synthetic surgical plans was generated from real patient examples to train, test, and validate our network. The actual surgical plan from each patient was reserved to validate our network’s performance on real data. Patients who underwent double-jaw orthognathic surgery were randomly selected from our digital archive [IRB#: Pro00008890]. We generated synthetic bony movements and their corresponding FE facial meshes to be used as data for training, testing, and validating the network. The synthetic bony movements were created first. Following the standard surgical procedure, the midface and mandible of preoperative CT models were osteotomized for a LeFort segment, a distal mandible, and a right and a left proximal segment. After the postoperative CT models were registered to the preoperative ones based on surgically unaltered volumes, the surgical plan for the actual surgery (i.e., the movement of each bony segment) was retrospectively formed. The LeFort and the distal segments were moved individually in 6 degrees of freedom while the right and left proximal segments were rotated around the ipsilateral condyle and aligned to the distal mandible. Each rotational and translational bony movement were finally divided into several sub-steps within the maximal surgical movement. A facial change was then simulated for each combination of LeFort and distal mandibular (bony) movements. An initial hexahedral patient-specific FE mesh model (47,088 nodes and 38,280 elements) with detailed lip geometry was generated from patient CT images using eFTP-VP method [6, 20]. Neo-Hookean material properties (Young’s modulus: 3,000 Pa, Poisson’s ratio: 0.47) and patient-specific boundary conditions were applied [6]. Using our validated FEM simulation method [6], facial meshes were generated (Fig. 3). An incremental approach was used for FEM simulation. In this approach, facial changes were simulated sequentially based on incremental bony movements from preoperative to final position. For each simulation, at least 10 simulation results (e.g. 9 intermediate incremental results and 1 final result) were generated. In this way, each incremental simulation result could be used as a separate data sample. Since each simulation takes about 30 minutes to complete, it took approximately a week to generate the 3600 data samples for subject 1 (Table 1). The number of data samples generated for each subject depended on the range of bony movement that was physiologically plausible. Therefore, the number of data samples generated for each subject was not identical (Table 1). To improve network training efficiency, the area above the infraorbital region was removed from the original mesh, and the number of nodes and elements was also downsampled (from  50,000 to 3,960 nodes) while maintaining the best possible geometrical accuracy (Fig. 3). The moving and fixed nodes in the facial FE mesh were assigned using a K-nearest neighbor algorithm (Supplementary Fig. S1).

Table 1 The data split and number of data samples for each subject

Network training and evaluation

Fig. 3
figure 3

Example of facial change simulation. a preoperative original mesh b facial mesh following bony movement c adjusted preoperative mesh for the network training d adjusted facial mesh following bony movement for the network training

The available data were split randomly into training, validation, and test sets by 70%, 10%, and 20%, respectively (Table 1). The feature vectors c and s were scaled such that all data fit in the range between 0 and 1 before being fed to the network. The mean squared error was used as the loss function for training. We used the Adam optimizer, which adaptively adjusts the learning rate, with an initial learning rate of 1e-5 and a batch size of 8. The network trained for 100 epochs, which took approximately 5 hours for a single subject on a Nvidia Tesla V100 GPU.

The network was evaluated using the mean Euclidean error between the predicted node locations and the ground-truth node locations \(e(u,v) = \frac{1}{N}\sum \nolimits _{n=1}^N \Vert (v_n-u_n)\Vert \). The distribution of mean errors was used to evaluate the network’s performance.

Ablation study

To understand the impact of including explicit boundary type information, an ablation study was performed. The boundary type vector \(b\) was omitted from the input vector \(x\). This is referred to as “implicit” boundary types as the network still learns the boundary types implicitly. The size of the first layer in the PointNet++ network was changed to fit the dimension of the input vector accordingly. The mean Euclidean error was calculated for all samples within the validation and test sets to compare the network results with and without boundary types. Since the distribution of mean errors was not normal, a Wilcoxon signed-rank test was used for testing statistical significance in network performance.

Results

Fig. 4
figure 4

Results from ablation study showing explicit boundary types (E) versus implicit boundary types (I). Bars marked with an asterisk (*) denote statistically significant difference due to inclusion of explicit boundary type information \((p < 0.05)\)

In the facial mesh prediction following the synthetic bony movement task, the network achieved between a mean error of 0.159 and 0.642 mm on the test set of each subject (Fig. 4). The mean error rarely exceeded 1 mm, even on simulations with very large input displacement (Supplementary Fig. S2). The predicted facial mesh closely resembled the ground-truth FEM mesh, and the largest error was typically seen around the lips (Fig. 5). On the real surgical plan examples, the network achieved a mean error of between 0.292 and 0.989 mm between the subjects. These results are comparable, if not better, than the synthetic bony movement simulations when compared to the average input displacement (Supplementary Fig. S2). The error was reasonable given the magnitude of the ground truth deformation (Supplementary Fig. S3). The result of the ablation study showed that including explicit boundary types had mixed effects on the performance of the network. In subject 3, including explicit boundary types improved the performance of the network. However, in subjects 4 and 5, inclusion of explicit boundary types hurt network performance. In subjects 1 and 2, network performance was not significantly impacted by inclusion of explicit boundary types. The average run time was only slightly longer when boundary types were included, increasing by 5 ms on average.

Fig. 5
figure 5

Visualization of a simulated facial mesh following synthetic bony movement

Discussion

Our method is capable of closely approximating FEM results. The modified PointNet++ network can predict deformation accurately and consistently, as demonstrated by the low mean error achieved in the facial simulations. The network can also easily adapt to various simulated displacements and achieve low error even on large displacements, as seen in the performance in the surgical simulation results (Table 2). These results indicate that the presented method is robust to high levels of elastic deformation. Our method also demonstrates exemplary performance on real surgical plans. Qualitatively, the network-predicted facial shapes closely resemble those of the ground truth FEM facial shapes (Fig. 5). These results clearly validate our method’s ability to capture fine facial details that are imperative to facial surgical planning.

Table 2 Performance on real surgical simulations for each subject

The inclusion of explicit boundary types did not have a noticeable effect on subjects 1 and 2 and even decreased performance in subjects 4 and 5. Only subject 3 had improved accuracy when explicit boundary types were included. We found that including explicit boundary types only seems to improve accuracy when the simulations have a high maximum deformation, as was seen in the surgical planning results (Table 2). We believe our method of including explicit boundary types may be limited due to the learning process of the network. Since deep learning networks act as a universal approximator, only introducing explicit boundary types in the input may not have a noticeable impact on network learning without also providing a way to enforce boundary conditions in the output. Our future research will seek to develop methods for enforcing boundary conditions through network design or loss algorithms.

The main advantage of our deep learning method is in decreasing simulation time as compared to FEM. The average computation time of our network was less than 700 ms, while FEM takes several minutes on similar simulations [6]. This decrease in computation time allows for clinicians to perform many more simulations during surgical planning and get rapid feedback as compared to FEM. At the same time, our method can achieve simulation results comparable to FEM.

One limitation of our work is the use of mean squared error as our only loss function. In future iterations of our network, adding a smoothing loss algorithm may help lower error while also obtaining better visual accuracy. Furthermore, recent work by Odot et al. has emphasized that use of mean squared error as a loss function may result in shape inaccuracies when simulating hyper-elastic materials [21]. Future iterations of our network will include a governing physics equation as a loss function, as seen in the work of Raissi et al. [22]. We also did not implement sliding nodes as a possible boundary type in this work. We believe that this may have limited the performance of the explicit boundary type encoding as the original FE meshes for our subjects contained sliding nodes. Modeling sliding nodes will require custom loss algorithms, which we will investigate in future work. Another limitation is that we did not include material properties as additional input to the network. This is an additional feature that we will add to our method in future studies, as it has been seen in related works [17]. Additionally, any future iterations of the network will be trained using data from multiple subjects. In future studies, training should occur on a large group of subjects with a wide range of physiologically relevant surgical plans to make the network robust and generalizable to unseen subjects. Ideally, a network trained on sufficient subjects would be adaptable to new subjects (possibly with minimal fine-tuning), making it suitable for clinical use. In order to train our network on multiple subjects, known point correspondence across subjects will need to be established to optimize the training procedure [23]. As this was a proof-of-concept study, the network was trained on data from only one subject at a time. Finally, to validate the performance of our method, we will compare the accuracy of the PointNet++ network to previously used networks, such as U-Net [12, 15, 16].

Conclusion

We presented a deep learning method for biomechanics modeling of facial deformation in orthognathic surgical planning. Our method addressed issues in previous deformation prediction networks approximating FEM, namely network compatibility with geometrically detailed facial meshes and the inclusion of explicit boundary type information. The proposed method achieved accurate performance on facial mesh simulations following synthetic bony movement. Inclusion of explicit boundary type information had mixed results, improving performance in simulations with large deformations but decreasing performance in simulations with small deformations. Finally, our network achieved accurate results on a real surgical example, demonstrating its clinical feasibility.