Keywords

1 Introduction

In photon radiation therapy, accurate dose modeling is paramount to ensure treatment plans safely target the tumour. Existing algorithms such as Collapsed Cone Convolution algorithm [1] or Pencil Beam [11] fail to match the Monte-Carlo (MC) radiation transport calculations in terms of precision of the deposited dose [4, 6]. Yet, MC generation of radiotherapy dose distributions remains too time-consuming for clinical adoption. Recent deep learning accelerated MC dose calculation methods [12, 16] offer a solution for this problem, utilizing common computer vision loss functions. Even if such methods provide a good trade-off between time and performance, training on such loss functions amounts to solving a proxy problem, with no strict assurance to conjointly optimize the clinical validation of the generated dose, which is performed using the Gamma index Passing Rate (GPR).

The GPR is one of the most essential and commonly used clinical evaluation metric for verification of complex radiotherapy dose delivery such as Intensity Modulated Radiation Therapy or Volumetric Modulated Arc Therapy (VMAT) [13]. As such, the GPR provides a clinical criterion to assess the quality of the model’s predictions. Therefore, training directly with the GPR as primary objective would yield more accurate training from a clinical standpoint. However, the GPR has two main limitations that deter from using it as loss function. First, training neural networks in a supervised setting requires a differentiable loss function to allow backpropagation. Yet, the GPR is non-differentiable , thus jeopardizing gradient descent. Secondly, despite efforts to bridge the gap, current Gamma index and GPR computations remain time-consuming, especially when comparing high dimensional dose distributions.

By taking a medical imaging perspective, we circumvent these challenges to incorporate the GPR as an optimization criterion during training of neural networks. According to our knowledge, this is the first study to create a new class of loss functions based on the GPR and to bring the speed of gamma index computations down to milliseconds, both for 2D and 3D dose distributions. We provide a proof-of-concept showcasing deep learning acceleration of MC dose simulations with models trained to optimize the presented GPR-based loss functions. Finally, we study the behavior of the GPR-based loss functions and benchmark them against the Structural Similarity Index Measure (SSIM), the Mean Absolute Error (MAE), and the Mean Squared Error (MSE). Our code and models will be publicly released.

2 Related Work

Loss Functions: When training a neural network on a task, the choice of the loss function is crucial. Loss functions such as the Dice Loss [15], the Focal Loss [7], or the Structural Similarity Index Measure (SSIM) [17] have revolutionized, respectively, segmentation, object detection, and image processing tasks. Moreover, all loss functions do not yield the same impact on the training and inference, as explained in the study introducing the Multiscale-SSIM [19].

This problem becomes even more evident in the medical field, in which models need to ensure reliable performance. For this reason, integrating mathematical objectives that train the models to optimize clinically relevant properties is of utmost importance for their integration into clinical practice.

In light of these considerations, we overcome the mathematical challenge of the GPR and turn this clinical metric into a viable loss function for our task of accelerating the simulation of MC radiotherapy dose distributions. We provide a family of GPR-based criteria that are therefore in adequacy with clinical requirements.

Gamma Index: The main challenges of computing the gamma index matrix reside in the pixel-wise computation of gamma index values that can be time-consuming proportionally to the dimensionality of the evaluated dose distribution. Prohibitive calculation time hinders the potential of the GPR as loss function. Many works propose ways to decrease the computation complexity, either by changing the mathematical formalism or accelerating the calculations. In [5], Gu et al. use a geometric method with a GPU-accelerated radial pre-sorting technique to speed up calculations. Chen et al. [3] consider reducing the search distance by using a fast Euclidean distance transform.

In this paper, we present an acceleration approach adequate for deep learning frameworks that significantly reduces the calculation speed and enables fast training with our GPR-based loss functions.

3 Methods

3.1 The Gamma Passing Rate

Gamma Index: Let \(D_r\) and \(D_e\) be two dose distributions (\(\mathbb {R}^k\rightarrow {}\mathbb {R}\)), respectively the reference and the evaluated. In our case, the evaluated dose distribution is the model’s prediction. To each of them corresponds a grid of points in which each point, \(P_r\) of \(D_r\), and \(P_e\) of \(D_e\) has a coordinate vector, respectively \(\vec {d}(P_r)\) and \(\vec {d}(P_e)\), and a dose value, \(D_r(P_r)\) and \(D_e(P_e)\).

Let us consider a point \(P_r\) in \(D_r\) and the points \(P_e\) in a vicinity \(V(P_r)\) around \(P_r\). Then the gamma index \(\varGamma \) is defined as a function of real values such that for all \(P_r \in D_r\), \(\varGamma (P_r)\) writes as follows:

$$\begin{aligned} \varGamma (P_r)=\min _{P_e \in V(P_r)}{ \sqrt{\frac{|| \vec {d}(P_e) - \vec {d}(P_r) ||^2}{DTA^2} + \frac{(D_e(P_e) - D_r(P_r))^2}{\varDelta ^2}}} \end{aligned}$$
(1)

where DTA is the tolerance on the Distance-To-Agreement (DTA), commonly in mm, and \(\varDelta \) is the tolerance on the relative dose difference expressed as a percentage of the reference dose value \(D_r(P_r)\). This definition entails that each point \(P_r\) has its own gamma index value in \(\varGamma \), which indicates how close neighbouring points \(P_e\) are, both spatially and dose-wise.

GPR: Let us introduce a dose threshold \(\delta \) and consider a point \(P_r\) of the reference distribution such that \(D_r(P_r) \ge \delta \). Then, given a DTA and dose tolerance \(\varDelta \), the evaluated distribution matches the reference at \(P_r\), if the passing criterion is satisfied, i.e. if:

$$\begin{aligned} \text {Passing criterion:} \quad \varGamma (P_r) \le 1 \end{aligned}$$
(2)

The GPR is defined as the percentage of points \(P_r \) that satisfy the condition in Eq. 2 while \(D_r(P_r) \ge \delta \).

Let \(\mathbbm {1}_{D_r \ge \delta }\) and \(\mathbbm {1}_{\varGamma \le 1}\) be the indicator functions defined such that:

$$\begin{aligned} \mathbbm {1}_{D_r \ge \delta }(P_r) = \left\{ \begin{array}{l} 1 \quad \text { if }D_r(P_r) \ge \delta .\\ 0 \quad \text { otherwise}. \end{array} \right. \quad \quad \quad \mathbbm {1}_{\varGamma \le 1}(P_r) = \left\{ \begin{array}{l} 1 \quad \text { if }\varGamma (P_r) \le 1.\\ 0 \quad \text { otherwise}. \end{array} \right. \end{aligned}$$
(3)

Then we can write the GPR as follows:

$$\begin{aligned} GPR(D_r, D_e) = \frac{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r) \cdot \mathbbm {1}_{\varGamma \le 1}(P_r)}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} \end{aligned}$$
(4)

3.1.1 Minimization Problem:

With the GPR formulation in Eq. 4, maximizing the GPR amounts to minimizing the corresponding loss function \(L_{GPR}^{ \delta }\) which draws values in [0, 1]:

$$\begin{aligned} L_{GPR}^{ \delta }(D_r, D_e) = 1 - \frac{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r) \cdot \mathbbm {1}_{\varGamma \le 1}(P_r)}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} = 1 - GPR \end{aligned}$$
(5)

Due to the fact that the indicator function \(\mathbbm {1}_{D_r \ge \delta }\) does not depend on \(\varGamma \), the gradient of \(L_{GPR}^{ \delta }\) (with respect to the trainable parameters) can be written as follows:

$$\begin{aligned} \frac{\partial L_{GPR}^{ \delta }}{\partial w} = \frac{1}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} \cdot \sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r) \frac{ \partial \mathbbm {1}_{\varGamma \le 1}(P_r)}{\partial \varGamma } \frac{ \partial \varGamma (P_r)}{\partial w}\, , \end{aligned}$$
(6)

where w represents any of the trainable parameters of the neural network.

The problem with the above definition of \(L_{GPR}^{ \delta }\) is that it generates zero gradients, which is a direct consequence of the fact that the indicator function \(\mathbbm {1}_{\varGamma \le 1}(\cdot )\) is stepwise constant with respect to \(\varGamma \), preventing SGD training. To address this issue, in the following we propose the use of a soft approximation of the objective function \(L_{GPR}^{ \delta }\) with non-zero gradients.

3.2 Soft Counting with Sigmoid-GPR

To avoid the propagation of null gradients, we propose to use the sigmoid function, \(\sigma (x) = \left( 1+\exp ^{-\beta x}\right) ^{-1}\), to approximate counting passing voxels. The slope of the sigmoid depends on the value of its sharpness \(\beta \) that we consider as a hyperparameter.

Moreover, we note that for all \(P_r \in D_r\), it stands that:

$$\begin{aligned} \lim _{\beta \rightarrow +\infty } \sigma (\beta \cdot (1 - \varGamma (P_r)) = \mathbbm {1}_{\varGamma \le 1}(P_r) \end{aligned}$$
(7)

Hence, the asymptotic behaviour of the sigmoid function combined with shifting the gamma index values can provide an estimate of the count of passing voxels by summation over all points \(P_r\). The accuracy of the estimation then depends on the value of \(\beta \): the bigger the \(\beta \), the more precise the estimation will be.

Thus, we approximate the loss \(L_{GPR}^{ \delta }\) in Eq. 5 with \(L_{\sigma - GPR}^{ \delta }\) defined using the sigmoid function:

$$\begin{aligned} L_{\sigma - GPR}^{ \delta } = 1 - \frac{\sum _{P_r \in D_r} { \sigma (\beta \cdot (1 - \varGamma (P_r))) \mathbbm {1}_{D_r \ge \delta }(P_r)}}{\sum _{P_r \in D_r} { \mathbbm {1}_{D_r \ge \delta }(P_r)}} \end{aligned}$$
(8)

\(L_{\sigma - GPR}^{ \delta }\) is differentiable everywhere and, provided the sharpness \(\beta \) is not too high, gradients are non-zero and allow gradient descent to update the model’s weights during backpropagation.

Given Eq. 7, we remark that \(L_{\sigma - GPR}^{ \delta }\) accurately approximates the true GPR loss function, i.e., \(L_{\sigma - GPR}^{ \delta } \rightarrow L_{GPR}^{ \delta }\) as \(\beta \rightarrow +\infty \).

Annealing Schedule of \(\boldsymbol{\beta }\): In light of the equations above, we propose to consider \(\beta \) as a hyperparameter. At the beginning of training, the model usually predicts poorly and the majority of voxels fail to satisfy the gamma index passing criterion. This implies that the corresponding loss computed with \(L_{\sigma - GPR}^{ \delta }\) generates zero gradients everywhere if the value of \(\beta \) is set too high. To avoid this behaviour, we propose an annealing schedule for \(\beta \) that starts with low initial values and progressively increases \(\beta \) over the training. Moreover, when \(\beta \sim 0^+\), the Taylor series expansion of the sigmoid function yields:

$$\begin{aligned} \sigma (\beta \cdot (1 - \varGamma (P_r))) \sim \beta \cdot (1 - \varGamma (P_r)) \end{aligned}$$
(9)

Given Eq. 9, we can write the Taylor expansion of \(L_{\sigma - GPR}^{ \delta }\) when \(\beta \sim 0^+\):

$$\begin{aligned} \begin{aligned} L_{\sigma - GPR}^{ \delta }&\sim 1 - \beta + \frac{\sum _{P_r \in D_r} \varGamma (P_r) \mathbbm {1}_{D_r \ge \delta }(P_r)}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} \end{aligned} \end{aligned}$$
(10)

Consequently, we introduce the loss function \(L_{\varGamma }^{\delta }(D_r, D_e)\) to model the linear behaviour of \(L_{\sigma - GPR}^{ \delta }\) at the start of the annealing schedule:

$$\begin{aligned} L_{\varGamma }^{\delta }(D_r, D_e) = \frac{\sum _{P_r \in D_r} \varGamma (P_r) \cdot \mathbbm {1}_{D_r \ge \delta }(P_r)}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} \end{aligned}$$
(11)

As the training continues and the loss decreases, the annealing scheme proceeds in progressively increasing \(\beta \) in order to improve the approximation of the GPR loss \(L_{GPR}^{ \delta }\) defined in Eq. 5. As \(\beta \) increases and is acquiring larger values, minimizing the loss amounts to getting failing voxels (voxels with \(\varGamma > 1\)) to satisfy the passing criterion.

To model the behaviour of \(L_{\sigma - GPR}^{ \delta }\) at this stage (i.e. as \(\beta \) is acquiring larger values), we modify \(L_{\varGamma }^{\delta }\) in Eq. 11 by introducing the loss function \(L_{\varGamma > 1}^{\delta }\). To prevent backpropagation of zero gradients with respect to \(\mathbbm {1}_{\varGamma > 1}\), we use the stopgrad operation:

$$\begin{aligned} L_{\varGamma> 1}^{\delta }(D_r, D_e) = \frac{\sum _{P_r \in D_r} \varGamma (P_r) \cdot \mathbbm {1}_{D_r \ge \delta }(P_r) \cdot stopgrad(\mathbbm {1}_{\varGamma > 1}(P_r))}{\sum _{P_r \in D_r} \mathbbm {1}_{D_r \ge \delta }(P_r)} \end{aligned}$$
(12)

For the sake of characterizing the behaviour of \(L_{\sigma - GPR}^{ \delta }\) in Eq. 8, we also study trainings that involve the use of loss functions \(L_{\varGamma }^{\delta }\) and \(L_{\varGamma > 1}^{\delta }\) in the following experiments.

3.3 Accelerating Gamma Index Matrix Computations

Having a differentiable GPR loss does not make it directly applicable for neural network training since, by definition, it requires iterating over all voxels in the given distributions, therefore leading to prohibitive computation time when considering high-resolution distributions. To deal with this issue, we propose an accelerated version of GPR for faster calculations.

To avoid physical incoherence when computing gamma index values, we sample the evaluated and reference distributions to the resolution 1 mm\(^3\) with bilinear interpolation. By the definition in Eq. 1, one can observe that the evaluated voxels located farther than DTA mm from \(P_r\) automatically yield a gamma index superior to 1. Thus we limit the search within an invariant vicinity defined by the chosen DTA. More precisely, the gamma index value of a reference point then stems from comparing gamma values computed with voxels in a cube comprising \((2 \times DTA +1)^k\) voxels, in the case of k dimensional dose distributions. We then use unfolding to extract sliding local blocks of the evaluated distribution generated by the model. This operation creates one channel per voxel in the vicinity defined by the DTA. We then apply the minimum operation over the channel dimension to get the minimal gamma index value.

The approach enables fast computation of the gamma index distribution \(\varGamma \), which is necessary for the calculations of \(L_{\sigma -GPR}^{\delta }\), \(L_{\varGamma }^{\delta }\) an \(L_{\varGamma > 1}^{\delta }\) presented in Eq. 8, 11, 12. Computation times are discussed in Sect. 5.

4 Dataset and Experimental Design

Dataset: We carried out the experiments on the publicly available dataset presented in [10] comprising 50 patients treated with VMAT plans. Each patient has a reference dose distribution computed from \(1 \times 10^{11}\) particles and a low precision simulation computed from \(1 \times 10^9\) particles. The main goal of the methods benchmarked on this task is to generate the high precision simulation of the dose from the available low precision one. More details about the dataset can be found in the original publication. For our experiments, we split patients to 35-5-10 for respectively, the train, validation and test sets. The cases in the dataset correspond to various anatomies and therefore, we split them as equally as possible between sets to avoid biases.

Even though our approach enables training on 3D dose distributions, the dataset comprises a small number of samples. Thus, we decided to carry out the experiments in 2D to favour significant experiments and a relevant benchmark. In this setting, a training sample corresponds to an axial slice of a patient’s dose volume. The 2D training dataset therefore comprises around 11k training samples, where a sample is a pair of corresponding slices of low precision and high precision dose simulation.

Preprocessing: We normalized both low precision and reference distributions using the average dose maximum computed over the reference dose volumes from the training set. We then applied the same normalization on the validation and test sets. To enable batch training, we padded each training sample with zeros in order to match a fixed size of \(256 \times 256\). To further help the model generate accurate dose predictions, we added the corresponding CT slice as second input channel to incorporate the corresponding anatomy. We applied minimax normalization to CT volumes so voxel values remain in [0, 1] range.

Model: In all experiments, the model is a standard UNet architecture [14] with skip connections between the encoder and the decoder. The encoder part of the model performs downsampling twice with convolutional layers using \(4\times 4\) filters and a stride of 2. Symmetrically, transposed convolutions upsample feature maps in the decoder. Each stage of the UNet comprises two convolution blocks before downsampling or upsampling. Much like the convolutional block presented by Liu et al. in [8], a convolution block first applies a convolution with \(7\times 7\) filters and \(3\times \)3 padding, and then two convolutions with \(3\times 3\) filters to further process the features maps. Each convolution is followed by Gaussian Error Linear Units (GELU) activation units. The block ends with a residual connection to keep high frequency details from the block’s input. Overall, the model has around 10 million trainable parameters.

Optimization Set-Up: In all trainings, we trained the model using AdamW optimizer [9]. We set the initial learning rate to \(3e^{-4}\) and decreased it progressively during the training when the validation loss stagnated. Weight decay was set to \(5e^{-4}\) and batch size to 16. We trained for 20k iterations on a NVidia GeForce RTX 3090 GPU. The trainings were stopped when overfitting appeared by adopting the early stopping strategy. With this training scheme, early stopping occurred after around 15k iterations, when the validation loss fails to decrease 2% after 500 iterations.

Loss Functions: To train with the GPR-based loss using sigmoid count \(L_{\sigma - GPR}^{ \delta }\) presented in Eq. 8, we designed the following annealing schedule for the sharpness parameter \(\beta \). We set the inital value of \(\beta \) to \(2\times 10^{-2}\) for the first 150 iterations. Then, \(\beta \) increased by a factor of 5% every 50 iterations until it reached an intermediate value of 3 where updates slowed down to 5% every 100 iterations. Increasing updates stopped when \(\beta \) reached a chosen ceiling value of \(\beta _{max} = 5\). Setting \(\beta _{max}\) prevented the slope of the sigmoid from getting too sharp and the loss from encountering a vanishing gradient problem, which would stop the updates of gradient descent. Additional benchmarks with the approximating loss functions \(L_{\varGamma }^{\delta }\) and \(L_{\varGamma > 1}^{\delta }\) have been also conducted, in order to better characterize the behaviour of \(L_{\sigma - GPR}^{ \delta }\) for small values of \(\beta \).

For all GPR-based functions, we set the dose threshold \(\delta \) to 20%. This means that, while loss functions compute the gamma index distribution by considering all voxels, the computed approximated GPR value takes into account only voxels \(P_r\) for which the dose value is superior to 20% of the maximum dose of the reference distribution, i.e. \(D_r(P_r) \ge 20\% \cdot \max _{P_r \in D_r}{D_r(P_r)}\).

To benchmark against our proposed GPR-based loss functions, we considered several other loss functions commonly used in computer vision. The benchmark includes the MAE and the MSE for a comparison with pixel-wise errors. Finally, we considered the combination of pixel-wise errors with the SSIM. More precisely, the benchmark includes SSIM-MAE and SSIM-MSE, which are the equally weighted sum of respectively the SSIM and MAE, and the SSIM and MSE. For each training on the loss functions considered above, we used the exact same model architecture and optimization strategy, in order to promote the reliability and fairness of the comparison.

5 Results

5.1 Training with GPR-Based Loss Functions

Extensive quantitative comparison on the test set for each training, using the MAE, MSE, SSIM and GPR with various values of DTA and dose tolerance \(\delta \) are summarised in Table 1. As the test set comprises 10 patients, we computed the metrics over each slice of each patient’s volume, and then average over the test set per considered metric. Results point out that models trained with GPR-based loss functions tend to outperform others with respect to the GPR, the MAE and MSE. In contrast, models trained with SSIM-MAE and SSIM-MSE show the highest SSIM scores. With a closer look however, one can observe that they report among the lowest performance for the rest of the metrics. This result indicates that the SSIM may not be a well-suited metric to evaluate the quality of dose distributions, since it seems to be biased.

Table 1. Evaluation metrics over the dose distributions comprised in the test set. Different benchmarks over the considered loss functions for different metrics are highlighted with their mean and standard deviation. With bold we indicate the best performing methods per metric.

To assert statistical significance of the results, we take an in-depth look at each patient in the test set to explain the high standard deviation values observed in Table 1. Boxplots a), b) and d) in Fig. 1 point out the presence of an outlier patient case on which models tend to fail with respect to the GPR, SSIM and MSE. In contrast with SSIM, MSE and MAE-trained models, we observe that models trained with GPR-based loss functions not only display robustness to this outlier, but also show smaller standard deviation over the whole test set.

Fig. 1.
figure 1

Boxplots representing the evaluation metrics achieved by trained models for each case in the test set depending on the loss function used for training. The y axis indicates the values of the considered metric. The x axis spcifies the loss function with which the corresponding model was trained.

Figure 1 also allows to compare discrepancies within the family of GPR-based loss functions. While all of them produce better performing models with respect to all evaluation metrics except the SSIM, the loss function with sigmoid counting \(L_{\sigma -GPR}^{\delta }\) outperforms \(L_{\varGamma }^{\delta }\) and \(L_{\varGamma > 1}^{\delta }\). We explain this behaviour by the fact that both \(L_{\varGamma }^{\delta }\) and \(L_{\varGamma > 1}^{\delta }\) focus only on minimizing gamma index values, and not directly maximizing the number of voxels satisfying the passing criterion. We conclude that \(L_{\sigma -GPR}^{\delta }\) yields better maximization of the GPR and is therefore the better approximation of the true GPR loss function \(L_{GPR}^{\delta }\).

Fig. 2.
figure 2

First row from left to right: a single slice of the 1e9 dose volume, predictions of models trained with MSE and \(L_{\sigma -GPR}^{\delta }\), and reference 1e11 dose. Second row: gamma index maps for the three different representations.

The MSE-trained model outperforms other models trained with non GPR-based loss functions with respect to the GPR, so we chose to display its dose prediction conjointly with the dose generated by the \(L_{\sigma -GPR}^{\delta }\) trained model in Fig. 2. Although both trainings achieved convergence, the prediction of the MSE-trained model manifests important artefacts at the bottom of the generated dose. Additionally, the dose itself seems to be smoother than the dose predicted with the \(L_{\sigma -GPR}^{\delta }\) training. Finally, the MSE-trained model appears to overestimate the dose in low-dose regions to a greater extent than the \(L_{\sigma -GPR}^{\delta }\)-trained model.

Table 2. Speed comparison of metrics computed over 2D or 3D dose distributions.
Fig. 3.
figure 3

Boxplots of execution times of the SSIM, our proposed approach and the exhaustive search method on 3D dose distributions.

5.2 Speed-Up of GPR-Acceleration Approach

In an effort to promote the GPR-based loss functions as viable deep learning optimization criteria that allow fast error computations and training, we had to accelerate gamma index computations. To quantify the extent of our acceleration approach, we benchmark against two methods. The first one is a GPU-accelerated exhaustive search approach in a limited vicinity of 3 mm\(^3\) around the considered reference voxel. The second is an open-source tool from PyMedPhys [2] which makes use of acceleration ideas from Wendling et al. [18] and executes on CPU and is single-threaded. Regarding the latter, we limit the interpolation ratio to 2 to have a fair comparison.

The time estimation was twofold. We timed each evaluation metric and GPR-based loss functions on 3D or 2D distributions stemming from the MC dataset used for the experiments. 3D dose distributions were interpolated to resolution 1 mm\(^3\) and of shape \(128 \times 200 \times 200\), comprising around \(5 \times 10^6\) voxels. The 2D dose distributions comprised axial slices of the 3D dose distributions and were interpolated to a size \(400 \times 400\). For the GPR calculations, we set the DTA and \(\varDelta \) to respectively 2 mm and 2%. Execution times are displayed in Table 2.

Figure 3 and Table 2 highlight that our approach has equivalent speed to that of the SSIM. Compared to the exhaustive search method, our approach improves the speed of gamma index computations by a factor of at least 30 in the case of 3D dose distributions and twofold for 2D distributions. Consistently with these results, we note that trainings took around 24 h for SSIM-MAE, SSIM-MSE and GPR-based loss function, whereas they lasted for 15 h for experiments with the MAE and the MSE. Results therefore validate our GPR-based loss functions to efficiently train deep neural networks. Nonetheless, our comparison is limited to speed assessment and does not encompass RAM usage and precision considerations. Although our approach highlights significant speed gain in the computation of the GPR metric and, by extension, of the GPR-based loss functions presented in this study, it comes at the price of an increased RAM usage caused by the unfolding operation.

We make the remark that for all loss functions, the obtained GPRs do not meet the 95% GPR threshold indicating clinical validation. Nevertheless, the goal of the experiments was to show the benefits of optimizing directly the clinical metric during training and results support that statement.

6 Conclusion

Adopting the correct optimization criterion is essential to train deep learning models adequately with the task they are designed to solve. For the task of accelerating MC radiotherapy dose simulation with deep learning, this work proves that directly optimizing models with the clinical validation metric yields significant improvement in predicted dose quality when compared to other loss functions. We provide a fast computation of the GPR to enable such results. Moreover, the GPR is a similarity metric for distributions in general, and may be applied to other tasks such as radiotherapy dose generation or even finding adversarial examples for generative adversarial networks. Future work will focus on addressing the remaining limitations of our approach and assessing the potential of our new class of loss functions in solving other deep learning tasks.