Keywords

1 Introduction

Intracranial hemorrhage (ICH) refers to brain bleeding within the skull, a serious medical emergency that would cause severe disability or even death [1]. A characteristic symptom of severe ICH is brain midline shift (MLS), which is the lateral displacement of midline cerebral structures (see Fig. 1). MLS is an important and quantifiable indicator of the severity of mass effects and the urgency of intervention [2, 3, 9]. For instance, the 5 mm (mm) threshold of MLS is frequently used for immediate intervention and close monitoring [4]. MLS quantification demands high accuracy and efficiency, which is difficult to achieve with manual quantification, especially in emergencies, due to the variability in shift regions, unclear landmark boundaries, and non-standard scanning pose. An automated MLS quantification algorithm that can immediately and accurately quantify MLS is highly desirable to identify urgent patients for timely treatment.

To measure MLS, clinicians usually first identify a few CT slices with large shifts and then measure and identify the maximum deviation of landmarks such as the septum pellucidum, third ventricle, or falx from their normal counterpart as the final MLS distance (see examples in Fig. 1). Such a clinical fashion of MLS quantification can be difficult to be translated into a well-defined automation process. Currently, there are only limited studies on automated MLS quantification, using different strategies and varied labeling requirements. Nguyen et al. proposed a landmark-based method that relies on anatomical markers to determine the location of the deformed midline [9]. However, this method can only apply to cases where MLS appears on these specific marker regions. Liao et al. adopted a symmetric-based method to seek a curve connecting all deformed structures [10], which is difficult to generalize due to over-simplified anatomical assumptions and sensitivity to patients’ scan poses. A few recent works try to overcome these limitations by using stronger supervision with dense labeling. Some studies formulated MLS quantification as a midline segmentation task [5,6,7], by delineating the intact midline as labels to supervise the training of segmentation models. Another study designed a hemisphere segmentation task to quantify MLS [8], which requires pixel-wise annotation for each slice. However, obtaining such dense annotations is very costly and time-consuming, while may not be necessary for MLS quantification.

Fig. 1.
figure 1

Examples of head CT scans to illustrate how radiologists measure MLS. Dash red line connecting the anterior falx and posterior falx denote a hypothetical normal midline. Blue circles denote the shifted landmarks. Perpendicular red lines from the shifted landmarks to normal midline are measured as MLS scale. (Color figure online)

To tackle these limitations, we propose to fit MLS quantification into a deformation prediction problem by using semi-supervised learning (SSL) with only limited annotations. Our framework avoids the strong dependency on specific landmarks or over-simplified assumptions in previous methods while not increasing the labeling efforts. We aim to use only sparse and weak labels as ground truth supervisions, which are just one shifted landmark and its normal counterpart on a limited number of slices provided by radiologists, but we try to fully exploit the unlabeled slices and non-MLS data to impose extra regularization for the sparse-to-dense extension. Existing SSL methods typically use a partially trained model with labeled data to generate pseudo labels for unlabeled data, assuming that labeled and unlabeled data are generally similar. These methods can be sub-optimal in our case as labeled slices of MLS usually present the largest deformation while unlabeled slices contain only minor or no deformation. Instead, we propose our SSL strategy by generating a corresponding non-MLS image for each unlabeled MLS slice with generative models and regularizing that the deformation field should warp the generated non-MLS images into the original MLS ones. However, as we only have volume-wise labels for MLS and non-MLS classification, it can be difficult to train a slice-wise discriminator as required by many generative models such as GANs [12]. Fortunately, the recently proposed diffusion models [15], which prove to have strong power in both distribution learning and image generation without dependency on discriminators, can be a potentially good solution.

In this work, we propose a novel semi-supervised learning framework based on diffusion models to quantify the brain MLS from head CT images with deformation prediction. Our method effectively exploits supervision and regularization from all types of available data including MLS images with sparse ground truth labels, MLS images without labels, and non-MLS images. We validate our method on a real clinical head CT dataset, showing effectiveness of each proposed component. Our contributions include: (1) innovating an effective deformation strategy for brain MLS quantification, (2) incorporating diffusion models as a representation learner to extract features reflecting where and how an MLS image differs from a non-MLS image, and (3) proposing a diffusion model-based semi-supervised framework that can effectively leverage massive unlabelled data to improve the model performance.

2 Methods

Figure 2 illustrates our diffusion model-based semi-supervised learning framework for MLS quantification via deformation prediction. In Sect. 2.1, we introduce our deformation strategy with only sparse supervision. In Sect. 2.2, we propose to incorporate non-MLS data for representation learning. In Sect. 2.3, we describe how to utilize unlabeled MLS images for sparse-to-dense regularization.

Fig. 2.
figure 2

The pipeline of our proposed semi-supervised deformation strategy for MLS quantification. Sparse labels supervise the labeled image \(x^{l}\) and the unlabeled image \(x^{u}\) is self-supervised with generated negative image \(x'^{u}\).

2.1 MLS Quantification Through Deformation Estimation

Our proposed deformation strategy for brain MLS quantification aims to find an optimal deformation field \(\phi \) so that an MLS image can be regarded as a hypothetically non-MLS image warped with this deformation field. The deformation field can be parameterized by a function with high complexity so that it does not explicitly rely on a single landmark or over-simplified symmetric assumptions, which naturally overcomes the limitations of existing methods. We apply a learning-based framework to parameterize the deformation field with a U shape neural network. The input to the network is individual 2D slices and the network’s output is the stationary velocity field v. The diffeomorphic deformation field \(\phi \) is then calculated through the integration of the velocity field, similarly to VoxelMorph [11] for image registration. The learning process is supervised by sparse deformation ground truth. For each labeled slice, we have the ground truth \(\textbf{y}=(y_1, y_2)\), which is a two-dimensional vector directing from shifted landmark point toward its presumably normal location (the red arrow in Fig. 2). The predicted deformation \(\hat{\textbf{y}}\) is bilinearly interpolated at the shifted landmark point from the deformation field, which is also a two-dimensional vector. To alleviate the influence of a few extremely large deformation points and increase model’s robustness, we use Huber loss to measure the similarity between the predicted deformation and the label:

$$\begin{aligned} l_{\text {huber}}(y_d, \hat{y}_d)=\left\{ \begin{aligned}&|y_d-\hat{y}_d|,&|y_d-\hat{y}_d| \ge c, \\&\frac{(y_d-\hat{y}_d)^2+c^2}{2c},&|y_d-\hat{y}_d| < c. \end{aligned} \right. \end{aligned}$$
(1)

where \(d \in \{1,2\}\). The hyperparameter c defines the range for absolute error or squared error. We also encourage a smooth deformation field with a diffusion regularizer on the spatial gradients of deformation \(\phi \) to avoid a discontinuous deformation field:

$$\begin{aligned} l_{\text {smooth}}=\sum _j\sum _k\Vert \phi _{jk}-\phi _{(j-1)k}\Vert ^2+\Vert \phi _{jk}-\phi _{j(k-1)}\Vert ^2, \end{aligned}$$
(2)

We apply a coarse-to-fine manner, where velocity fields are generated through upsampling with skip connection to progressively aggregate features of different scales, making the model more adaptive to extremely large deformation.

2.2 Learning Negative Patterns from Non-MLS Images

In order to learn a deformation field to warp a non-MLS image into MLS one, ideally we would need a pair of non-MLS and MLS images for network training, which however does not exist in practice. A naive substitution is to generate a corresponding non-MLS image. However, generated images entail some randomness and often lack important details. Depending too much on such fake inputs can lead to poor robustness. Inspired by the score-matching interpretation of diffusion models [17], we propose to learn the non-MLS distribution from massive amount of negative cases. Given an MLS image, we can evaluate which parts of the image make it different from a non-MLS image. This deviation can serve as latent features that help the deformation network with deformation prediction.

Diffusion models, especially DDPM [14], define a forward diffusion process as the Markov process progressively adding random Gaussian noise to a given image and then trying to approximate the reverse process by a Gaussian distribution. The forward process can be simplified by a one-step sampling: \(x_t = \sqrt{\alpha _t}x_0+\sqrt{1-\alpha _t}\epsilon \), where \(\alpha _t:=\prod ^t_{s=0}1-\beta _t\), and \(\beta _t\) are predefined variance schedule. \(\epsilon \) is sampled from \(\mathcal {N}(0,I)\). The mean \(\mu _\theta (x_t,t)\) and variance \(\varSigma _\theta (x_t,t)\) of the reverse process can be parameterized by neural networks. A popular choice is to re-parameterize \(\mu _\theta (x_t,t)\) so that \(\hat{\epsilon }_\theta (x_t, t)\) instead of \(\mu _\theta (x_t,t)\) is estimated by neural networks to approximate the noise \(\epsilon \). Moreover, the output of the diffusion network \(\epsilon (x_t, t)\) is actually a scaled score function \(\nabla \log p(x_t)\) as it moves the corrupted image towards the opposite direction of the corruption [18].

As a result, through pre-training one unconditional diffusion model trained with all data (denoted as \(\mathcal {U}\)) and one conditional diffusion model trained with only non-MLS data (denoted as \(\mathcal {C}\)), the subtraction of two outputs

$$\begin{aligned} \hat{\epsilon }_{\theta _\mathcal {U}}(x_t,t)-\hat{\epsilon }_{\theta _\mathcal {C}}(x_t,t) \propto \nabla \log p(x_t|n) - \nabla \log p(x_t) = \nabla \log p(n|x_t), \end{aligned}$$
(3)

can be regarded as the gradient of class prediction (\(n=1\) for non-MLS and 0 otherwise) w.r.t to the input image, which reflects how the input images deviate from a non-MLS image. This latent contains information regarding how to transform the MLS positive image into a non-MLS one and therefore is helpful for training the deformation network. Moreover, this feature representation exhibits less fluctuation toward the randomness of the additive noise as both terms are somehow estimations of the stochastic noise, which are then eliminated through subtraction. It is more stable than the predicted noise or generated MSL negative images. For training, we randomly sample t from 0 to the diffusion steps \(T_{\text {train}}\), while for inference we fix it to be a certain value. We examine the effects of this value in Sect. 3.4.

2.3 Semi-supervised Deformation Regularization

Deformation estimation is a dense prediction problem, while we only have sparse supervision. This can lead to flickering and poor generalizability if the deformation lacks certain regularization. On the other hand, we have a significant amount of unlabeled data from the MLS volumes that is potentially helpful. Therefore, we propose to include these unlabeled data during training in a semi-supervised manner, so that unlabeled data can provide extra regularization for training or produce additional training examples based on noisy pseudo labels. Many existing semi-supervised methods seek to use the prediction for unlabeled data given by the same or a twin network as pseudo-labels and then supervise the model or impose some regularization with these pseudo-labels. However, these methods hold a strong assumption that labeled and unlabeled data are drawn from the same distribution, which is not true in our case because most labeled data are with large deformation while unlabeled data are with minor or no deformation. Therefore, we want to find another type of pseudo-label to bypass the distribution assumption. As the deformation field is assumed to warp a hypothetically normal image into an MLS one, we generate hypothetically non-MLS images \(x'_0\) using pre-trained diffusion models through a series of denoising steps with classifier-free guidance [16]:

$$\begin{aligned} \hat{\epsilon }(x_t, t) = \lambda \hat{\epsilon }_{\theta _\mathcal {C}}(x_t,t) + (1-\lambda )\hat{\epsilon }_{\theta _\mathcal {U}}(x_t,t), \end{aligned}$$
(4)

where \(\lambda \) is a hyper-parameter controlling the strength of the guidance. We compare \(x'_0\) warped with the deformation field \(\phi (x'_0)\) and calculate its similarity with the original \(x_0\) through MSE loss. As it can be difficult for the generated image to be fully faithful to the original image because the generative process entails a lot of random sampling, this \(l_\text {mse}\) can only serve as noisy supervision. Therefore, instead of generating \(x'_0\) ahead of deformation network training, we generate it in an ad-hoc way (i.e. generating new cases at each iteration) so that the noisy effects can be counteracted.

The final MLS measurement is estimated by calculating the length of the maximum displacement vector from the predicted deformation field, so it is more sensitive to over-estimation. As for unlabelled slices, we still have the prior that its MLS cannot be larger than the MLS of that specific volume \(\delta \), we propose to incorporate an additional ceiling loss to punish the over-estimation:

$$\begin{aligned} l_{\text {ceil}} = \sum _{j}\sum _{k} \max (0, || \phi _{jk} ||-\delta ). \end{aligned}$$
(5)

Overall, the loss term is a combination of supervised loss and unsupervised loss, with a weight term controlling the relative importance of each loss term:

$$\begin{aligned} l = l_{\text {huber}} +w_1 l_{\text {smooth}} + u(i)(l_\text {mse} + w_2 l_{\text {ceil}}), \end{aligned}$$
(6)

where \(w_1\) and \(w_2\) are two fixed weight terms and u(i) is a time-varying weight term that is expected to gradually increase as the training iteration i progresses so that the training can converge quickly through strong supervision first and then refine and enhance generalizability via unsupervised loss.

3 Experiments and Results

3.1 Data Acquisition and Preprocessing

We retrospectively collected anonymous thick-slice, non-contrast head CT of patients who were admitted with head trauma or stroke symptoms and diagnosed with various subtypes of intracranial hemorrhage, including epidural hemorrhage, subdural hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, and intraparenchymal hemorrhage, between July 2019 and December 2019 in the Prince of Wales Hospital, a public hospital under the Hospital Authority of Hong Kong. The ethics approval was obtained from the Joint Chinese University of Hong Kong-New Territories East Cluster Clinical Research ethics committee. The eligible patients comprised 2793 CT volumes, among them 124 are MLS positive cases. The MLS ranges between 2.24 mm and 20.12 mm, with mean value of 8.34 mm and medium value of 8.73 mm. The annotation was performed by two trained physicians and verified by one experienced radiologist (with over 10 years of clinical experience on ICH). The labeling process followed a real clinical measurement pipeline, where the shifted landmark, anterior falx point, and posterior falx point were pointed out, and the length of the vertical line from the landmark to the line connecting the anterior falx point and the posterior falx point was the measured MLS value. For each volume, a few slices with large deformation were separately measured and annotated while the shift of the largest one served as the case-level label. On average, 4 out of 30 slices of each volume were labeled. All slices of non-MLS cases are unlabeled. We discarded the first 8 and the last 5 slices as they are mainly structures irrelevant to MLS. For pre-processing, we adjusted the pixel size of all images to 0.86 mm and then cropped or padded the resulting images to the resolution of 256 \(\times \) 256 pixels. The HU window was set to 0 and 80. We applied intensity clipping (0.5 and 99.5 percentiles) and min-max normalization (between -1 and 1) to each image. Random rotation between \(-15^{\circ }\) and \(15^{\circ }\) was used for data augmentation.

3.2 Implementation Details

For the diffusion network, we use the network architecture designed in DDPM [15] and set the noise level from \(10^{-4}\) to \(2 \times 10^{-2}\) by linearly scheduling with \(T_{\text {train}}=1000\). For non-MLS image generation, we apply the Denoising Diffusion Implicit Model (DDIM) [13] with 50 steps and set the noise scale to 15 to shorten the generative time. We set the hyper-parameters as \(\alpha =1\), \(\beta =1\), \(c=3\) and \(\gamma =2\). u(i) is set from 1 to 10 with the linear schedule. The diffusion models are trained by the AdamW optimizer with an initial learning rate of \(1\times 10^{-4}\), batch size 4, for \(2\times 10^5\) iterations. We up-sample the MLS positive data by \(10\times \) when training the unconditional diffusion model. The deformation network is trained by the AdamW optimizer with an initial learning rate of \(1\times 10^{-4}\), batch size 16, for 100 epochs. All models are implemented with PyTorch 1.12.1 using one Nvidia GeForce RTX 3090 GPU.

Table 1. Comparison of different methods with 5-fold cross-validation.

3.3 Quantification Accuracy and Deformation Quality

We evaluate the performance of our quantification strategy through mean absolute error (MAE) and root mean square error (RMSE). For volume-wise evaluation, we measure the maximum deformation of each slice of the whole volume and select the largest one as the final result. We also report the slice-wise evaluation based on labeled slices, which reflect how the models perform on slices with large deformation. Since existing MLS estimation methods require different types of labels from ours, it is difficult to directly compare with those methods. We therefore first compare our deformation-based strategy with a regression-based strategy, which uses DenseNet-121 [21] to directly predict the slice-wise MLS. We also compare our proposed semi-supervised learning approach with two popular semi-supervised learning methods: Mean-Teacher [19] and Cross Pseudo Supervision (CPS) [20], which are implemented into our deformation framework. The results are given in Table 1, which are based on 5-fold cross-validations.

From the results, we can see that when only using labeled MLS slices for model learning, our deformation strategy already shows better performance than the regression model. This may attribute to that our deformation model learns the knowledge of both MLS values and locations while a regression model only captures the MLS value information. This difference can be further enlarged if we consider slice-wise performance. Moreover, all three semi-supervised learning methods, i.e., Mean-Teacher, CPS, and ours, consistently improve the performance of deformation prediction, showing the benefits and importance of incorporating unlabeled data into model learning. Our semi-supervised learning method based on diffusion models achieves better quantification results than Mean-Teacher and CPS, significantly reducing the volume-wise MAE from 3.80 mm to 2.43 mm. An interesting observation is that the unlabeled data contribute more to the volume-wise evaluation than the slice-wise evaluation. By inspecting the prediction, we find that the deformation prediction trained with labeled data tends to overestimate the deformation of slices with little or no deformation, which makes the volume-wise prediction error-prone. As most unlabeled data are slices with minor shifts, incorporating these data for semi-supervised learning can impose constraints to avoid large deformation, which greatly improves the model’s robustness.

Fig. 3.
figure 3

Predicted deformation on (a) MLS images. (b) non-MLS images. The regions with the largest deformation are highlighted. Slice-wise predicted MSL and ground truth are provided.

We also visualize the predicted deformation field of several sample cases. From Fig. 3(a), we can see the model can well posit the location where the maximum shift appears and push it to its hypothetically normal counterpart. The largest deformation happens exactly at the site with the maximum shift. To validate the robustness of our model, we also select several patients diagnosed with no MLS and plot the predicted deformation of these samples. As can be seen in Fig. 3(b), our method is able to provide a reasonable prediction for non-MLS images by outputting much smaller values than that for MLS images. Our model’s predictions for non-MLS images are not exactly zero are caused on one hand by that even for a completely healthy person, the midline cannot be perfectly aligned due to multiple factors such as scan pose, on the other hand, our models tend to overestimate the shift because we are calculating the maximum deformation as final measurement.

3.4 Ablation Study

We conduct several ablation experiments to study the effects of several components in our proposed framework on the model performance. The volume-wise results reported are trained on four folders and tested on one folder.

Effects for Representation Learning. We first conduct ablation studies to verify that the latent feature extracted from the two diffusion models is truly useful for deformation prediction. To this end, we select two deformation models, one trained with only labeled data and the other using semi-supervised learning, and compare their performance with and without the extracted representation as input. The results are given in Table 2. As expected, incorporating the representation can improve the model performance in both cases.

The noise level is an important component of diffusion models. Only with a proper noise level, can the model accurately estimate the deviation of the image toward the negative sample space. Therefore, we do inference with multiple noise levels and compare its effect on model performance. The results are shown in Fig. 4. Our model is very robust towards this hyper-parameter. As long as t is not too small, the model gives very similar performances. The best performance appears in the middle when \(t=600\). This is reasonable as small noise fails to corrupt the original image thus degenerating the performance of score estimation while large noise may obscure too many details of the original image.

Table 2. Effects of the representation.
Fig. 4.
figure 4

Effects of the noise level.

Quantity of Unlabeled Images. To verify the usefulness of unlabeled images, we conduct ablation studies on the number of unlabeled images used. For each experiment, we randomly sample 20%, 40%, 60%, and 80% volumes, and we incorporate unlabeled slices of these volumes for semi-supervised training. For the rest volumes, we are only using the labeled slices. We also do one experiment that completely removes the uses of unlabeled images. For each experiment, the pre-trained diffusion models are the same, which uses all the data. In other words, these unlabeled images somehow still contribute to the model training. The results are shown in Fig. 5(a). As can be seen, the model performance and robustness can be enhanced as we incorporate more unlabeled images. This provides strong evidence for our claim that our model truly learns valuable information from unlabeled data.

Quantity of Non-MLS Images. To further measure the benefits of including non-MLS cases, we conduct another ablation study on the proportion of non-MLS cases. As currently, the amount of non-MLS cases is much higher than MLS cases, we upsample the MLS cases so that their quantities are approximately the same when training the unconditional diffusion model. For ablation, we first downsample the non-MLS data so that their quantity is \(1\times \), \(5\times \), and \(10\times \) that of the MLS cases, and then upsample the MLS cases to make them balanced. From the results in Fig. 5(b), we find model performance improves with more non-MLS cases incorporated. Increasing non-MLS cases can help train diffusion models and further improve the quality of generated images and extracted feature representations. However, this effect will soon be saturated as the amount of MLS cases is relatively small. This can be a bottleneck for effectively using the non-MLS cases as it is challenging to train unconditional diffusion models with such imbalanced datasets.

Fig. 5.
figure 5

Results of our ablation experiments in terms of: (a) proportion of unlabeled data used, and (b) proportion of negative data used.

4 Conclusions and Future Work

In this paper, we propose a novel framework based on deformation field estimation to automatically measure the brain MLS. The labels we are using are sparse which can greatly alleviate the labeling workload. We also propose a semi-supervised learning strategy based on diffusion models which significantly improves the model performance. Experiments on a clinic dataset show our methods can achieve satisfying performance. We also verify that using unlabeled data and non-MLS cases can truly help improve the model’s performance. Our methods have several limitations. First, the model performance highly relies on pre-trained diffusion models. Training diffusion models with extremely imbalanced data requires great effort. Second, the measurement results exhibit randomness due to noise corruption. Finally, the measurement results are prone to overestimation. Our future work will figure out solutions for these limitations.