Keywords

1 Introduction

Cone Beam Computed Tomography (CBCT) is an emerging medical imaging technique to examine the internal structure of a subject noninvasively. A CBCT scanner emits cone-shaped X-ray beams and captures 2D projections at equal angular intervals. Compared with the conventional Fan Beam CT (FBCT), CBCT enjoys the benefits of high spatial resolution and fast scanning speed [19]. Recent years have witnessed the blossoming of low dose CT, which delivers a significantly lower radiation dose during the scanning process. There are two ways to reduce the dose: decreasing source intensity or projection views [8]. This paper focuses on the latter, i.e., sparse-view CBCT reconstruction.

Sparse-view CBCT reconstruction aims to retrieve a volumetric attenuation coefficient field from dozens of projections. It is a challenging task in two respects. First, insufficient views lead to notable artifacts. As a comparison, the traditional CBCT obtains hundreds of images. The inputs of sparse-view CBCT are 10\(\times \) fewer. Second, the spatial and computational complexity of CBCT reconstruction is much higher than that of FBCT reconstruction due to the dimensional increase of inputs. CBCT relies on 2D projections to build a 3D model, while FBCT simplifies the process by stacking 2D slides restored from 1D projections (but in the sacrifice of time and dose).

Existing CBCT approaches can be divided into three categories: analytical, iterative and learning-based methods. Analytical methods estimate attenuation coefficients by solving the Radon transform and its inverse. A typical example is the FDK algorithm [7]. It produces good results in an ideal scenario but copes poorly with ill-posed problems such as sparse views. The second family, iterative methods, formulates reconstruction as a minimization process. These approaches utilize an optimization framework combined with regularization modules. While iterative methods perform well in ill-posed problems [2, 20], they require substantial computation time and memory. Recently, learning-based methods have become popular with the rise of AI. They use deep neural networks to 1) predict and extrapolate projections [3, 22, 24, 28], 2) regress attenuation coefficients with similar data [11, 27], and 3) make optimization process differentiable [1, 6, 10]. Most of these methods [3, 11, 22, 27] need extensive datasets for network training. Moreover, they rely on neural networks to remember what a CT looks like. Therefore it is difficult to apply a trained model of one application to another. While there are self-supervised methods [1, 28], they operate under FBCT settings considering network capacity and memory consumption. Their performance and efficiency drop when applied to the CBCT scenario.

Apart from the aforementioned work designated for CT reconstruction, efforts have been made to deal with other ill-posed problems, such as 3D reconstruction in the computer vision field. Similar to CT reconstruction, 3D reconstruction uses RGB images to estimate 3D shapes, which are usually represented as discrete point clouds or meshes. Recent studies propose [13, 16] Implicit Neural Representation (INR) as an alternative to those discrete representations. INR parameterizes a bounded scene as a neural network that maps spatial coordinates to metrics such as occupancy and color. With the help of position encoder [14, 21], INR is capable to learn high-frequency details.

This paper proposes Neural Attenuation Fields (NAF), a fast self-supervised solution for sparse-view CBCT reconstruction. Here we use ‘self-supervised’ to highlight that NAF requires no external CT scans but the X-ray projections of the interested object. Inspired by 3D reconstruction work [13, 16], we parameterize the attenuation coefficient field as an INR and imitates the X-ray attenuation process with a self-supervised network pipeline. Specifically, we train a Multi-Layer Perceptron (MLP), whose input is an encoded spatial coordinate (xyz) and whose output is the attenuation coefficient \(\mu \) at that location. Instead of using a common frequency-domain encoding, we adopt hash encoding [14], a learning-based position encoder, to help the network quickly learn high-frequency details. Projections are synthesized by predicting the attenuation coefficients of sampled points along ray trajectories and attenuating incident beams accordingly. The network is optimized with gradient descent by minimizing the error between real and synthesized projections. We demonstrate that NAF quantitatively and qualitatively outperforms existing solutions on both human organ and phantom datasets. While most INR approaches take hours for training, our method can reconstruct a detailed CT model within 10–40 minutes, which is comparable to iterative methods.

In summary, the main contributions of this work are:

  • We propose a novel and fast self-supervised method for sparse-view CBCT reconstruction. Neither external datasets nor structural prior is needed except projections of a subject.

  • The proposed method achieves state-of-the-art accuracy and spends relatively short computation time. The performance and efficiency of our method make it feasible for clinical CT applications.

  • The code will be publicly available for investigation purposes.

Fig. 1.
figure 1

NAF pipeline. Gray block: The CBCT scanner captures X-ray projections from different views. Blue block: NAF simulates projections. Orange block: NAF is optimized by comparing real and synthesized projections. Green block: NAF generates a CT model by querying corresponding voxels. (Color figure online)

2 Method

2.1 Pipeline

The pipeline of NAF is shown in Fig. 1. During a CBCT scanning, an X-ray source rotates around the object and emits cone-shaped X-ray beams. A 2D panel detects X-ray projections at equal angular intervals. NAF then uses the scanner geometry to imitate the attenuation process discretely. It learns CT shapes by comparing real and synthesized projections. After the model optimization, the final CT image is generated by querying corresponding voxels.

NAF consists of four modules: ray sampling, position encoding, attenuation coefficient prediction, and projection synthesis. First, we uniformly sample points along X-ray paths based on the scanner geometry. A position encoder network then encodes their spatial coordinates to extract valuable features. After that, an MLP network consumes the encoded information and predicts attenuation coefficients. The last step of NAF is to synthesize projections by attenuating incident X-rays according to the predicted attenuation coefficients on their paths.

2.2 Neural Attenuation Fields

Ray Sampling. Each pixel value of a projection image results from an X-ray passing through a cubical space and getting attenuated by the media inside. We sample N points at the parts where rays intersect the cube. A stratified sampling method [13] is adopted, where we divide a ray into N evenly spaced bins and uniformly sample one point at each bin. Setting N greater than the desired CT size ensures that at least one sample is assigned to every grid cell that an X-ray traverses. The coordinates of sampled points are then sent to the position encoding module.

Position Encoding. A simple MLP can theoretically approximate any function [9]. Recent studies [18, 21], however, reveal that a neural network prefers to learn low-frequency details due to “spectral bias”. To this end, a position encoder is introduced to map 3D spatial coordinates to a higher dimensional space.

A common choice is the frequency encoder proposed by Mildenhall et al. [13]. It decomposes a spatial coordinate \(\textbf{p}\in \mathbb {R}^{3}\) into L sets of sinusoidal components at different frequencies. While frequency encoder eases the difficulty of training networks, it is considered quite cumbersome. In medical imaging practise [26, 28], the size of encoder output is set to 256 or greater. The following network must be wider and deeper to cope with the inflated inputs. As a result, it takes hours to train millions of network parameters, which is not acceptable for fast CT reconstruction.

Frequency-domain encoding is a dense encoder because it utilizes the entire frequency spectrum. However, dense encoding is redundant for CBCT reconstruction for two main reasons. First, a human body usually consists of several homogeneous media, such as muscles and bones. Attenuation coefficients remain approximately uniform inside one medium but vary between different media. High-frequency features are not necessary unless for points near edges. Second, natural objects favor smoothness. Many organs have simple shapes, such as spindle (muscle) or cylinder (bone). Their smooth surfaces can be easily learned with low-dimensional features.

To exploit the aforementioned characteristics of the scanned objects, we use the hash encoder [14], a learning-based sparse encoding solution. The equation of hash encoder \(\mathcal {M_{H}}\) is:

$$\begin{aligned} \mathcal {M_{H}}(\textbf{p};\mathbf {\Theta })=[\mathcal {I}(\textbf{H}_{1}),\cdots ,\mathcal {I}(\textbf{H}_{L})]^T,~\textbf{H}=\{\textbf{c}|h(\textbf{c})=(\bigoplus c_{j}\pi _{j})~\textrm{mod}~T\}. \end{aligned}$$
(1)

Hash encoder describes a bounded space by L multiresolution voxel grids. A trainable feature lookup table \(\mathbf {\Theta }\) with size T is assigned to each voxel grid. At each resolution level, we 1) detect neighbouring corners \(\textbf{c}\) (cubes with different colors in Fig. 1(b)) of the queried point \(\textbf{p}\), 2) look up their corresponding features \(\textbf{H}\) in a hash function fashion h [23], and 3) generate a feature vector with linear interpolation \(\mathcal {I}\). The output of a hash encoder is the concatenation of feature vectors at all resolution levels. More details of hash function and its symbols can be found in [14].

Compared with frequency encoder, hash encoder produces much smaller outputs (32 in our setting) with competitive feature quality for two reasons. On the one hand, the many-to-one property of hash function conforms to the sparsity nature of human organs. On the other hand, a trainable encoder can learn to focus on relevant details and select suitable frequency spectrum [14]. Thanks to hash encoder, the subsequent network is more compact.

Attenuation Coefficient Prediction. We represent the bounded field with a simple MLP \(\mathbf {\Phi }\), which takes the encoded spatial coordinates as inputs and outputs the attenuation coefficients \(\mu \) at that position. As illustrated in Fig. 1(c), the network is composed of 4 fully-connected layers. The first three layers are 32-channel wide and have ReLU activation functions in between, while the last layer has one neuron followed by a sigmoid activation. A skip connection is included to concatenate the network input to the second layer’s activation. By contrast, Zang et al. [28] use a 6-layer 256-channel MLP to learn features from a frequency encoder. Our network is \(10\times \) smaller.

Attenuation Synthesis. According to Beer’s Law, the intensity of an X-ray traversing matter is reduced by the exponential integration of attenuation coefficients on its path. We numerically synthesize the attenuation process with:

$$\begin{aligned} I=I_{0}\exp (-\sum _{i=1}^{N}\mu _{i}\delta _{i}), \end{aligned}$$
(2)

where \(I_{0}\) is the initial intensity and \(\delta _{i}=\Vert \textbf{p}_{i+1}-\textbf{p}_{i}\Vert \) is the distance between adjacent points.

2.3 Model Optimization and Output

NAF is updated by minimizing the L2 loss between real and synthesized projections. The loss function \(\mathcal {L}\) is defined as:

$$\begin{aligned} \mathcal {L}(\mathbf {\Theta },\mathbf {\Phi }) = \sum _{\textbf{r}\in \textbf{B}}\Vert I_{r}(\textbf{r})-I_{s}(\textbf{r})\Vert ^2, \end{aligned}$$
(3)

where \(\textbf{B}\) is a ray batch, and \(I_{r}\) and \(I_{s}\) are real and synthesized projections for ray \(\textbf{r}\) respectively. We update both hash encoder \(\mathbf {\Theta }\) and attenuation coefficient network \(\mathbf {\Phi }\) during the training process.

The final output is formulated as a discrete 3D matrix. We build a voxel grid with the desired size and pass the voxel coordinates to the trained MLP to predict the corresponding attenuation coefficients. A CT model thus is restored.

3 Experiments

3.1 Experimental Settings

Data. We conduct experiments on five datasets containing human organ and phantom data. Details are listed in Table 1.

Human Organ: We evaluate our method using public datasets of human organ CTs [4, 12], including chest, jaw, foot and abdomen. The chest data are from LIDC-IDRI dataset [4], and the rest are from Open Scientific Visualization Datasets [12]. Since these datasets only provide volumetric CT scans, we generate projections by a tomographic toolbox TIGRE [5]. In TIGRE [5], we capture 50 projections with 3% noise in the range of 180\(^{\circ }\). We train our model with these projections and evaluate its performance with the raw volumetric CT data.

Phantom: We collect a phantom dataset by scanning a silicon aortic phantom with GE C-arm Medical System. This system captures 582 500 \(\times \) 500 fluoroscopy projections with position primary angle from \(-103^{\circ }\) to 93\(^{\circ }\) and position secondary angle of 0\(^{\circ }\). A 512 \(\times \) 512 \(\times \) 510 CT image is also generated with inbuilt algorithms as the ground truth. We only use 50 projections for experiments.

Table 1. Details of CT datasets used in the experiments.

Baselines. We compare our approach with four baseline techniques. FDK [7] is firstly chosen as a representative of analytical methods. The second method SART [2] is a robust iterative reconstruction algorithm. ASD-POCS [20] is another iterative method with a total-variation regularizer. We implement a CBCT variant of IntraTomo [28], named IntraTomo3D, as an example of frequency-encoding deep learning methods.

Implementation Details. Our proposed method is implemented in PyTorch [17]. We use Adam optimizer with a learning rate that starts at \(1\times 10^{-3}\) and steps down to \(1\times 10^{-4}\). The batch size is 2048 rays at each iteration. The sampling quantity of each ray depends on the size of CT data. For example, we sample 192 points along each ray for the 128 \(\times \) 128 \(\times \) 128 chest CT. We use the same hyper-parameter setting for hash encoder as [14]. More details of hyper-parameters can be found in the supplementary material. All experiments are conducted on a single RTX 3090 GPU. We evaluate five methods quantitatively in terms of peak signal-to-noise ratio (PNSR) and structural similarity (SSIM) [25]. PSNR (dB) statistically assesses the artifact suppression performance, while SSIM measures the perceptual difference between two signals. Higher PNSR/SSIM values represent the accurate reconstruction and vice versa.

Fig. 2.
figure 2

Qualitative results of five methods. From left to right: examples of X-ray projections, slices of 3D CT models reconstructed by five methods, and the ground truth CT slices.

3.2 Results

Performance. Our method produces quantitatively best results in both human organ and phantom datasets as listed in Table 2. Both PSNR and SSIM values are significantly higher than other methods. For example, the PSNR value of our method in the abdomen dataset is 3.07 dB higher than that of the second-best method SART.

We also provide visualization results of different methods in Fig. 2. FDK restores low-quality models with notable artifacts, as analytical methods demand large amounts of projections.

Table 2. PSNR/SSIM measurements of five methods on five datasets.

Iterative method SART suppresses noise in the sacrifice of losing certain details. The reconstruction results of ASD-POCS are heavily smeared because total-variation regularization encourages removing high-frequency details, including unwanted noise and expected tiny structures. IntraTomo3D produces clean results. However, edges between media are slightly blurred, which shows that the frequency encoder fails to teach the network to focus on edges. With the help of hash encoding, results of the proposed NAF have the most details, clearest edges and fewest artifacts. Figure 3 indicates that NAF outperforms other methods in all slices of the reconstructed CT volume.

Figure 4 shows the performance of iterative methods and learning-based methods under different number of views. It is clear that the performance increases with the rise of input views. Our methods achieves better results than others under most circumstances.

Fig. 3.
figure 3

Running time that iterative and learn-based methods take to converge to stable results.

Time. We record the running time of iterative and learning-based methods as shown in Fig. 5. All methods use CUDA [15] to accelerate the computation process. Overall, the methods spend less time on datasets with small projections (chest, jaw and foot) and increasingly more time on big datasets (abdomen and aorta). IntraTomo3D requires more than one hour to train the network. Benefiting from the compact network design, NAF spends similar running time to iterative methods and is 3\(\times \) faster than the frequency-encoding deep learning method IntraTomo3D.

Fig. 4.
figure 4

Slice-wise performance of iterative and learning-based methods on the abdomen dataset.

Fig. 5.
figure 5

Performance under different number of views on the abdomen dataset.

4 Conclusion

This paper proposes NAF, a fast self-supervised learning-based solution for sparse-view CBCT reconstruction. Our method trains a fully-connected deep neural network that consumes a 3D spatial coordinate and outputs the attenuation coefficient at that location. NAF synthesizes projections by attenuating incident X-rays based on the predicted attenuation coefficients. The network is updated by minimizing the projection error. We show that frequency encoding is not computationally efficient for tomographic reconstruction tasks. As an alternative, a learning-based encoder entitled hash encoding is adopted to extract valuable features. Experimental results on human organ and phantom datasets indicate that the proposed method achieves significantly better results than other baselines and spends reasonably short computation time.