I Introduction

With the development of positron emission tomography (PET) instrumentation, the axial field of view continually increases, leading to the new area of long axial field of view (LAFOV) PET or total-body PET. Compared with the current clinically standard of care axial field of view (FOV), PET system has a range of 26 cm, and the long-axial FOV-PET systems have larger solid angle coverage and longer axial FOV. In addition, a large anatomical region can be covered with one single bed position. Consequently, a large factor can significantly reduce the total PET acquisition time due to the increased sensitivity [1,2,3].

Several image reconstruction algorithms have been proposed to reconstruct tomographic images using projection data. For instance, conventional methods can solve a mapping function from measurement space to image space based on physical principles. The clinically developed methods include the analytical and iterative methods. The analytical methods, such as the filtered back projection (FBP) [4], can achieve fast image reconstruction. However, the obtained images have a high level of noise. The iterative methods, such as the maximum-likelihood expectation–maximization (MLEM) [5] and ordered-subset expectation maximization (OSEM) [6] with iteratively back-projecting and forward-projecting, are clinically accepted standards. The iterative methods can give reconstructed images with the low noise level and satisfactory contrast. However, the iterative process is time-consuming. Moreover, in order to correctly perform attenuation, a computed tomography (CT) or magnetic resonance (MR) image is used to estimate the attenuation. In recent years, neural networks have been used for tomographic image reconstruction to achieve higher quality results with sparse information and short reconstruction time [7, 8]. Deep learning-based methods have also been applied. For instance, a deep learning method is proposed for noise reduction in order to allow low-dose PET imaging protocols [9,10,11]. A neural network is integrated into the iterative process to speed up the convergence speed and improve the reconstruction quality [12, 13]. A neural network is trained to convert directly from projection to image data. For the direct method, an automated method that uses the transform by manifold approximation (AUTOMAP) to learn the relationship between sensor domain and image domain is proposed [14]. This method mainly tackles the MR image reconstruction, while PET system application is also achievable [14]. A deep encoder-decoder network, referred to as DeepPET, is used for the direct reconstruction of PET images. PET images and projection data simulated based on XCAT digital phantom are used to train the network [15]. Kandarpa et al. [16] propose a double U-Net to learn the sinogram-to-image transformation, while the deep-learning pipeline consists of denoising, image reconstruction, and super-resolution segments. William et al. [17] propose a DirectPET network to achieve full-size neural network PET reconstruction from histo-images data. The XCT-based attenuation maps are used as additional input for corrections. The reconstruction from histo-images using a U-net network is also proposed, where a CT-based attenuation is required [18].

The LAFOV of total-body PET increases the probability of LOR detection in order to increase the sensitivity. However, the high obliqueness of the LORs between distant rings suffers from the parallax error [1] and introduces large heterogeneity in the image quality [19, 20]. The increased Compton scattering and ratio between multiple over single scattered photons is another critical bottleneck for the reconstruction of LAFOV PET [21]. The fraction of multiple scatters changes in LAFOV PET [22] heterogeneously. The fractions of random events also depend on the difference of rings in LAFOV PET [20]. The correction of heterogeneity of random and multiple scattered events makes the reconstruction more difficult than in the conventional scanners.

This paper explores the application of the encoder-decoder network to long-axial FOV PET reconstruction using clinical patient data. The study focuses on achieving an end-to-end PET reconstruction directly from the detector to the image domain. In addition, an attenuation correction is integrated into the training process.

II Material and methods

Patients and imaging

Clinical patient list mode data are collected using Biograph Vision Quadra (Siemens Healthineers) at the University of Bern, Switzerland. This system has a FOV of 106 cm. Preliminary assessments of this scanner’s characteristics reveal a sensitivity of 174 cps/kBq and a time of flight (TOF) resolution of 219 ps in ultra-high sensitivity mode [23].

The selected patients are injected with 18F-FDG and undergo a PET/CT examination, including 80 cases (median age, 66 years; age range, 27–83 years; 36 females; BMI, 25.40 ± 4.70 kg/m2) of patients. In all the cases, the subjects fast for more than 4 h and have blood glucose less than 200 mg/dl. All the patients are injected with 18F-FDG with an uptake time of 90 min ± 10%. The patients without complete PET/CT scan images from above head to below thigh, and those with poor image quality because of movement, are excluded from the study. The 80 patients are randomly split into a training dataset of 60 patients (median age, 67 years; age range, 27–83 years; 26 females; BMI, 25.43 ± 4.71 kg/m2), a validation dataset of 10 patients (median age, 62 years; age range, 58–75 years; 5 females; BMI, 26.72 ± 5.96 kg/m2), and a test dataset of 10 patients (median age, 67 years; age range, 40–81 years; 5 females; BMI, 23.92 ± 2.11 kg/m2). Note that this study is performed following the requirements of the respective local ethics committees in Switzerland (Req-2021–00,517).

Data pre-processing

List-mode data obtained from the scanner is reconstructed using a dedicated software prototype (e7-tools, Siemens Healthineers) with CT-based attenuation correction. As performed in our clinical routine, the PET images were reconstructed using PSF-TOF with 4 iterations and 5 subsets [23]. The 3D sinograms are converted into 2-dimensional (2D) slices using single-slice rebinning (SSRB) [4]. A stack of 2D sinograms is created by placing detected events on the plane, perpendicular to the scanner axis (z) and lying in the middle of the line connecting the two detectors of the event. The image matrix size of 2D image volumes is 440 × 440 with dimensions of 1.65 mm × 1.65 mm. A total of 644 2D sinogram slices are obtained for each patient, corresponding to 644 reconstructed images.

The input dataset of the network is the 2D sinogram slices, and the reconstructed images from e7-tools are used as training targets. Several images of the starting and ending positions of each patient’s data have a low count, and therefore they are excluded from the study. Each patient retains 599 sets of data (2D sinogram and reconstructed image). Finally, 60 patients and 10 patients are respectively designated for training validation, while 10 patients are considered for testing.

Deep neural network structure

An encoder-decoder network is developed for direct image reconstruction. It comprises two parts: image transform and perceptual loss networks [24] (cf. Figure 1).

Fig. 1
figure 1

The network used in this paper. It includes two parts: image transform network and perceptual loss network. The detailed structures of image transform network and perceptual loss network are provided in supplemental Fig. 1, 2

The structure of the proposed training network is based on DeepPET [15]. The network consists of the encoder, transformation, and decoder parts (cf. Figure 1, Supplemental Fig. 1, 2) [25]. In addition, 31 convolution blocks and one single convolution layer are involved. Each convolution block includes a convolution layer used to extract features, a batch normalization (BN) layer used to speed ​​up the training and network convergence, and a rectified linear unit (ReLU) activation function. In the decoder and transformation part, the convolution filter for the first two blocks has a size of 7 × 7, that of the following two blocks is of 5 × 5 size, while the others have a size of 3 × 3. The number of extracted features increases from 32 to 1024. The convolution layer decreases the widths and lengths of the feature maps with a kernel stride of 2. In the decoder part, the convolution filter has a size of 3 × 3, and the feature maps are enlarged by upsampling layers. The output layer is a convolution layer with one feature. The 2D sinogram slices are resized to 288 × 269 and used as inputs of the network. The outputs of the image transform network are reconstructed images that are put in the perceptual loss network. Finally, this network comprises 64,544,865 parameters.

The perceptual loss network uses the first 3 convolution blocks of VGG19 [26] (cf. Supplemental Fig. 2). The VGG network uses the accumulation of multiple small-scale convolution kernels (3 × 3) rather than large-scale convolution kernels. This establishment can form multiple non-linear layers to increase the depth of the network and achieve complex feature learning. The convolution blocks in VGG19 include a convolution layer followed by a ReLU activation function. The sizes of the feature maps are reduced by pooling layers. The first 3 convolutional blocks of VGG19 are shallow networks in convolutional neural networks. The features extracted by the first three convolutional blocks are similar to the input and contain more information, such as color, texture, and edge. It can be sufficient to capture the features while remaining robust. In addition, the depth of the three convolutional blocks ensures a sufficient area of receptive field for better reconstruction of structural details. Another consideration of the choice of the first three blocks follows the study of perceptual loss [24]. It has shown that the reconstruction with more than three layers can preserve image content and overall spatial structure while losing color, texture, and exact shape. Another study for PET image fine-tuning found that the features extracted from deeper layers could reduce the quality of the prediction images [9]. The weights of the VGG19 network pre-trained on the ImageNet database (image-net.org) are used. The outputs of the first 3 pooling layers are extracted and used as feature reconstruction loss:

$${L}_{VGG}=\frac{1}{3}\sum\nolimits_{i=1}^{3}\left|VGG{\left(x\right)}_{i}-VGG{\left(y\right)}_{i}\right|$$
(1)

where VGG(x)i represents the output of the i-th pooling layer in VGG19 with the input of ground truth, VGG(y)i denotes the output of the i-th pooling layer in VGG19 with the input of predict image from image transform network.

The other two parts, which are the mean square error (MSE) loss (cf. Equation (2)) and structural similarity (SSIM) loss [27] (cf. Equation (3)), are also involved in the loss function.

$${L}_{MSE}=\frac{1}{n}\sum\nolimits_{i=1}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}$$
(2)

where x is the ground truth, y represents the predicted image of the image transform network, and n denotes the total number of image pixels.

$${L}_{SSIM}=1-\frac{\left(2{u}_{x}{u}_{y}+{C}_{1}\right)\left(2{\sigma }_{xy}+{C}_{2}\right)}{\left({u}_{x}^{2}+{u}_{y}^{2}+{C}_{1}\right)\left({\sigma }_{x}^{2}+{{\sigma }_{y}^{2}+C}_{2}\right)}$$
(3)
$${C}_{1}=\left(0.01\cdot \mathit{max}\left(x\right)\right){ }^{2}$$
(4)
$${C}_{2}=\left(0.03\cdot \mathit{max}\left(x\right)\right){ }^{2}$$
(5)

where \({u}_{x}\) and \({\sigma }_{x}^{2}\) are respectively the mean and variance of the ground truth image pixels, \({u}_{y}\) and \({\sigma }_{y}^{2}\) are respectively the mean and variance of the predicted image pixels, \({\sigma }_{xy}\) crepresents the covariance of the ground truth and predict images, and \(\mathit{max}\left(x\right)\) denotes the maximum of ground truth image value.

The total loss function is expressed as:

$$loss={L}_{MSE}+{L}_{SSIM}+{L}_{VGG}$$
(6)

Network training and test procedure

The network is implemented using TensorFlow [28]; the training of the network is performed using a GPU (Tesla V100-PCIe-16 GB, NVIDIA) and tested using GeForce RTX 2080 Ti (NVIDIA). The Adam optimization method [29] is used as an optimizer with a learning rate of 10−4. All the images in the training dataset (35,940 2D sinogram and reconstructed images) are used as input to the network with a batch size of 50 and trained with 300 epochs. The trained network is tested on 10 patients’ data (5590 2D sinograms and reconstructed images).

The network is tested with three noise levels for the input sinograms. Frames of shorter width (1/10 and 1/20) are used to generate sinograms of different noise levels. The sinograms are processed by SSRB and used as the input to the network. The outputs are compared with the reconstruction results of sinograms histogrammed using complete list-mode PET data to evaluate the influence of the noise level of the input sinograms on the AI-based reconstruction results.

Image quality evaluation

The image quality evaluation is processed with the structural similarity index (SSIM), normalized root-mean-squared error (NRMSE), and peak signal-to-noise ratio (PSNR) [27], computed on the regions of the body. SSIM is an index used to measure the similarity of two images. The mean, standard deviation, and covariance estimate the brightness, contrast, and structural similarity, respectively. The values range between 0 and 1. More precisely, a value closer to 1 indicates that the output image is more similar to the target image. The SSIMs are computed as

$$SSIM=\frac{\left(2{u}_{x}{u}_{y}+{C}_{1}\right)\left(2{\sigma }_{xy}+{C}_{2}\right)}{\left({u}_{x}^{2}+{u}_{y}^{2}+{C}_{1}\right)\left({\sigma }_{x}^{2}+{{\sigma }_{y}^{2}+C}_{2}\right)}.$$
(7)

NRMSE is calculated based on the mean square error (MSE):

$$NRMSE=\frac{\sqrt{MSE}}{\overline{x} }$$
(8)

where \(\overline{x }\) is the average value of all the pixels in the ground truth image, x represents the ground truth image, and y denotes the predicted image of the network.

The PSNR is computed as

$$PSNR=20\cdot {\mathit{log}}_{10}\left(\frac{{MAX}_{I}}{\sqrt{MSE}}\right)$$
(9)

where \({MAX}_{I}\) is the maximum value of the reconstructed image.

Clinical evaluation

The results on the test dataset are further evaluated by 2 nuclear medicine physicians. For each patient, a typical lesion is selected and manually delineated. Among the 10 patients, 1 patient is proved to have no lesion. The mean standardized uptake values (SUVmean) and max standardized uptake values (SUVmax) are measured of the tracer in the selected lesions. The relative errors between the ground truth and DeepPET results and relative errors between the proposed method are calculated and compared. The 3D sinogram data are also reconstructed using the FBP method for comparison, and example visualization and statistics of the comparison are processed.

III Results

We have trained the networks 3 times with random initialization weights. The obtained final loss curves are shown in Fig. 2. The MSE is used as loss of DeepPET. The perspective loss, shown as Eq. (6), is used for the proposed network. We can see that the losses drop significantly at the first 50 epochs of training, and the loss curves of the validation set stop decreasing after the network has been trained for 300 epochs. Therefore, we stopped network training at that point to prevent overfitting.

Fig. 2
figure 2

Loss curves of the network training

The average time cost of this work and DeepPET for predicting 1 patient (644 images) are both 14 s, including 7 s for the data preparation process (SSRB) and 7 s for network prediction NVIDIA GeForce RTX 2080 Ti. The research prototype software (e7-tools, Siemens Healthineers) reconstructed a single patient’s images using approximately 200 s for PSFTOF and 320 s for FBPTOF. The time costs of the methods are then compared (cf. Figure 3).

Fig. 3
figure 3

Comparison of time cost for reconstruction of 1 patient’s data. The data preparation process of this work and DeepPET for 1 patient costs about 14 s, and the prediction takes about 7 s. The e7-tools use about 200 s with PSFTOF and 320 s with FBPTOF

Network test results

As network selection criterion, the MSEs of 10 validation sets for the DeepPET network and the network proposed in this study are 9.21 and 7.63. For 10 test cases, the predicted images of the network proposed in this work and DeepPET are shown in Fig. 4, along with the ground truth and input sinogram. Figure 6 presents the image quality evaluation results (NRMSE, PSNR, and SSIM) obtained by the proposed network for ten test patients. The same patients’ data are also tested using DeepPET. The average results of ten patients are shown in Table 1.

Fig. 4
figure 4

Test set reconstruction results using DeepPET and the proposed method. Left to right: PET sinogram, ground truth, the results of DeepPET, and the proposed network. The images are labeled with SSIM, NRMSE, and PSNR relative ground truth

Table 1 Quality evaluation results of the test database, NRMSE, PSNR, and SSIM. The quantitative results of the two networks were statically analyzed with paired t test

The input images and reconstruction results from different body areas are shown in Fig. 4. It can be seen that a strong similarity exists between the results obtained by the proposed method and the ground truth. Especially in the regions where the tracer uptake is high such as the head, chest, and heart, the results obtained by the proposed method are coherent with the reference values. Point-like high uptake positions exist in the pelvic cavity, legs, and other areas. However, the proposed network can also accurately restore them. In addition, it can be observed that there are some slight structural differences at the edge regions of some structures, such as the details of the brain and the edge of the heart, that are manifested in the blur of the edges. This is mainly due to the fact that the true value image is directly reconstructed from the 3D sinogram, and the 2D sinograms are the input used for prediction in the proposed method. Some information is lost when the 3D sinogram is converted to a 2D sinogram, and errors are introduced. Compared with the test results obtained by the original DeepPET structure network, the image structure restoration and detail restoration are improved, which demonstrates the efficiency of the perceptual loss network introduction.

The quantitative results statistics are shown in Fig. 6, and the uncertainties are listed in supplemental Table 2. It is shown that the SSIM of the original DeepPET structure prediction result compared with the true value is 0.95 ± 0.02, and the network has a 2% improvement in SSIM (which is close to 1) after the perceptual loss structure is introduced. In addition, it can be observed that the proposed network increases the signal-to-noise ratio from 82.02 ± 0.90 to 82.36 ± 0.87, which represents a slight improvement. Moreover, the NRMSE decreases from 0.63 ± 0.06 to 0.60 ± 0.06, which indicates that the reconstructed image is closer to the true value. The quantitative results obtained by the two networks are statically analyzed with paired t test, and they all show a significant improvement.

It can be seen from Fig. 5 that the AI-based reconstruction is robust with noises in the sinograms. The imaging quality matrices (green cross and red point in Fig. 6) of our results from noisy sinograms are at the same level as the original sinograms. This finding is consistent with the existing studies [17], which demonstrated that the neural network produced images using a half-count sinogram nearly equivalent to full-count data. As a comparison, we reconstructed with sinogram of 1/10 and 1/20 width by FBP methods and compared the results with the AI-based results as shown in Fig. 1 (Supplemental Fig. 3, Table 3). With counts decreasing to 1/10 and 1/20, the SSIM decreases by 1.5% and 2.4% for the AI-based method, NRMSE increases by 2.7% and 3.6%, and PSNR decreases by 0.2% and 0.4%, which are much better than the FBP method. For the FBP method, the SSIM decreases by 10.6% and 14.7%, NRMSE increases by 71.4% and 95.4%, and PSNR decreases by 3.8% and 5.5%. The AI-based reconstruction is less sensitive to the count statistics than FBP reconstruction. The deep learning reconstructed images using a low-count sinogram are very similar to the results of full-count data. This is because the convolutional neural network can extract the features of the input data from a larger space, reducing the noise caused by the low count input. The mechanism is that the convolutional layer can extract the features of the input data from a larger space, reducing the noise caused by the low count input.

Fig. 5
figure 5

Comparison between the reconstruction results of the sinogram generated with full frames, with frames of 1/10 width and with frames of 1/20 width. Left to right: sinogram generated with full frames, results of sinograms generated with full frames, with frames of 1/10 width and with frames of 1/20 width

Fig. 6
figure 6

The image quality evaluation results of this work for ten test patients: including root-mean-squared error (NRMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM). (This work—results of sinograms generated with full frames, this work_n1—sinogram generated with frames of 1/10 width, this work_n2—sinogram generated with frames of 1/20 width)

Clinical evaluation results

The mean standardized uptake values (SUVmean) and max standardized uptake values (SUVmax) of the tracer uptake are measured in a region of interest of lesions (cf. Figure 7) for the test sets (median age, 67 years; age range, 40–81 years; 5 females; BMI, 23.85 ± 2.38 kg/m2). The relative errors between the ground truth, DeepPET, and the proposed method are calculated (cf. Table 2). It can be seen from the comparison for smaller lesions, such as lesions 1 and 6, that the reconstruction results obtained by the proposed method are closer to the ground truth value. For larger lesions, such as lesions 3 and 7, the recovery performed by the proposed method and DeepPET on SUVmax is slightly worse. However, by comparing the shape and contour of the lesion, it can be seen that the results obtained by the proposed method are more similar to the ground truth. The two reconstruction results do not contain enough details for some cases, and lesion 8 is not significantly separated. In addition, the proposed method shows superior performance in anatomical structure with non-intensive uptaken. For instance, in the same layer of lesion 7, it better displays the non-uptaken area in the liver, which is not clearly shown by DeepPET results. In general, compared with DeepPET, the SUVmean and SUVmax of lesions obtained by the proposed method are closer to the ground truths. This indicates that the obtained prediction results can provide a better clinical reference at the lesion level. The currently trained network and DeepPET both have a possible degradation of small lesions such as lesion 2. Compared with the results obtained by the AI methods, the FBP method generates more accurate SUVmean and SUVmax for some lesions, such as lesions 6, 7, and 8. However, as expected, the noise level of the FBP reconstruction results is higher.

Fig. 7
figure 7

The mean standardized uptake values (SUVmean) and max standardized uptake values (SUVmax) are measured of the tracer in lesions

Table 2 The mean standardized uptake values (SUVmean) errors and max standardized uptake values (SUVmax) errors between ground truth and results of DeepPET, this work, and FBP for lesions from test cases

IV Discussion

This study follows the mainstream of AI development for PET reconstruction. It focuses on the direct reconstruction from sinogram data. In contrast to most of the existing studies on sinogram data from phantom-based simulation [15] or anthropomorphic simulation by projecting real patient data [30], the training and test in this study are directly performed on real PET measurements. In addition, a critical concern for AI development is its reproducibility and extensibility to complexity in real applications [31]. Compared with the development on simulated sinogram data, the development on real measurement data in this study can better tackle the challenges of physical and physiological complexity. It also enhances the translational potential of data-driven methods.

An advanced LAFOV PET scanner is used to develop and test the AI-based direct reconstruction from Sinogram data. Although conversion of 2D sinogram data of LAFOV PET led to loss of information, the preliminary results demonstrate that the deep neural network can reconstruct PET images with corrections of attenuation and scattering directly from sinogram data without requiring the CT input. The ground truth data used for training are reconstructed PET images with corrections of attenuation and scattering. This potential of AI in complex reconstruction with different corrections may benefit the reconstruction of LAFOV PET, considering the increased complexity in its reconstruction [21]. Although the current study does not consider all the challenging issues, such as the larger and heterogenous solid angles in LAFOV PET reconstruction, the advantage of the AI methods may deal with the complexity and heterogeneity, which encourages the development of this technology. At this stage, the AI-based reconstruction may be less advanced and accurate than the physics-based reconstruction. Further improvements of the input sinogram and training data with more accurate corrections may enhance the performance of this data-driven approach in LAFOV PET reconstruction. In addition, it may eventually reach or outperform the physics-based reconstruction.

Due to a large number of LORs received in LAFOV PET, the storage and processing of the acquisition data are daunting [21, 22]. For instance, the 106-cm LAFOV system has roughly 10 times data to process, compared with a SAFOV PET system. However, when using more oblique LORs, there could be a 40-fold increase [22]. In fact, the prompts count rate peaks at 10 million events, a few orders of magnitude larger than for a traditional PET scanner [21]. The conventional PET reconstruction algorithms are inefficient in processing the vast data of LAFOV PET reconstruction. Although the presented test is performed on sinogram data for reconstruction, the results demonstrate that deep learning can significantly shorten (up to 36 times) the reconstruction time for whole-body imaging, compared with a conventional iterative algorithm. This potential in accelerating the computational speed may bring advantages for the practice of LAFOV PET in the clinical routine.

Based on the lesion demarcation, overall image quality, and visually assessed signal-to-noise ratio, the proposed method improves image quality more than the traditional DeepPET approach. In addition, the semiquantitative measurement method is used. The obtained results are shown in Table 2. This paper estimates a series of lesions located in different organs such as the rib, muscle, mediastinum, and retroperitoneal space soft tissue (cf. Figure 7). Lesions 1 and 2 both present a lesion located in the rib, where the image obtained by the proposed method shows a better-outlined shape than the traditional DeepPET, which is easily misdiagnosed to be located in the sternum. Considering the purpose of optimizing the reconstruction, the outputs show a satisfying performance in presenting the morphological character of the primary lesion with elevator uptake values. The lack of structure details leads to misdiagnosis in the location and conceals some small lesions in worse cases. This may be due to the limited training cases. The use of more varied and larger capacity training sets can improve the prediction’s accuracy. Compared with the actual reconstruction, the currently developed AI-based reconstruction can generally recover the primary anatomy of patients similar to the trained ones. It can generally maintain contrasts and quantitative relations. As we can see from Fig. 7, no artifacts of attenuation or scatter have been observed in the AI-reconstructed images. The AI-reconstructed images look generally smoother than the actual reconstruction. It may miss the lesions in complex anatomy contexts, such as ribs, intercostal spaces, and supra-/sub-clavicular area (e.g., lesion 2, 8). For the quantitative analysis of the lesions, our network led to missing 2 lesions, reduced SUV values for 4 lesions, and increased SUV values for 3 lesions, while the DeepPET led to missing 2 lesions, reduced SUV values for 5, and increased SUV values for 2 lesions. Although our network has lower biases than the DeepPET, the AI-based reconstruction methods are still suboptimal and remain in the early research stage. They are not able to replace conventional reconstruction. Nevertheless, with the proof of concept in this study, it is expected that further development of AI-based reconstruction with a more extensive and diverse training dataset may overcome the limitations and improve the performance. Eventually, the AI-derived results may recapitulate or outperform the conventional reconstruction.

The proportion of female cases is 43%, 50%, and 50% in the training, test, and validation datasets. We compared the test results of the 10 test cases, including 5 males and 5 females. The NRMSEs, PSNRs, and SSIMs of different gender are calculated, and the one-way analysis of variance is processed. It can be observed from Table 3 that all the p values of NRMSE, PSNR, and SSIM were over 0.05, which indicates no significant gender bias for the trained network in this study.

Table 3 The one-way analysis of variance for gender biased

One limitation of this study is the conversion of 3D sinograms to 2D sinograms, where certain noises are introduced, and part of the spatial information in the axial direction is lost [32, 33]. This results in information loss for LAFOV PET and hampers the throughput of image reconstruction. This was due to the large memory requirements of 3D sinograms and our limited GPU memory. The memory required for the processing of one patient can reach 19 Gb, and it is almost impossible to train a 3D network on hundreds of patients with the current GPU capacities in most research institutes. Consequently, a compromise is made to focus on reconstructing 2D sinograms after transformation. With the anticipated increase of computational capacity, exploring 3D AI-based reconstruction can be feasible in the future. Nevertheless, our results demonstrate that AI-based reconstruction can somehow overcome the limitation of 2D sinograms and is relatively robust with information loss. We believe that the current development of 2D AI-based reconstruction on real clinical data can move an important step forward toward the breakthrough of AI technology in PET image reconstruction.

We tested the trained network with NEMA International Electrotechnical Commission (IEC) body phantom [34] and patients with extreme anatomies and found that the network trained on regular clinical data failed in extreme situations. It is known that deep learning is a data-driven method, and the performance of deep learning models depends heavily on the knowledge established from the training data [35,36,37]. Although the AI-based reconstruction methods have several advantages over conventional reconstruction methods, they have limited extrapolation capability and cannot be suitable for untrained scenarios such as physical phantom and extreme anatomy cases in this study. It is a long way to replace traditional reconstruction methods with AI-based methods. Although this study focused on developing and evaluating regular clinical data, it is expected that the developed methodology will work for these extreme situations provided sufficient relevant training data can be prepared in future works.

This study uses sinograms and reconstructed images with attenuation and scatter corrections as train data sets. Given the appropriate supervision of these reconstructed images with attenuation and scatter corrections, it is assumed that the deep neural network can learn the complex principles of reconstruction with attenuation and scatter correction, but the underlying mechanism is not clear. The black-box nature of the AI-based reconstruction methods is a critical limitation [38], and further interpretation of AI-based methodology can be an important direction for future research.

V Conclusion

This paper proposes a network structure combining the encoding–decoding and perceptual loss structure to improve the direct PET image reconstruction from projection data. This is the first AI-based reconstruction method tested on real clinical data from a LAFOV PET to the best of our knowledge. The preliminary results demonstrate that the improvement of deep learning architecture can improve the performance of AI-based reconstruction. In response to the challenge of real data training, the perceptual loss network structure is used to optimize the neural network. The pre-trained VGG network extracts the feature map from the predicted images and ground truth. The perceptual loss is added to the loss function calculation, improving the training efficiency and network effect. The comparison of the prediction results demonstrates that the similarity of the reconstructed image structure and the signal-to-noise ratio is improved. This is because the perceptual loss function can calculate the distance between the predicted image and the target image from the feature level but not from the pixel level. Therefore, the structure of the image can be better reconstructed in a larger area [24].

Despite the limitations of the AI-based methods, the end-to-end reconstruction process from the sinogram data demonstrates the potential of deep learning to learn complex reconstruction principles such as projection, normalization, attenuation correction, and scattering correction, for example. The current research results reveal the possibility and advantages of AI methods for PET image reconstruction, but it should be acknowledged that reconstruction accuracy cannot fully meet the clinical requirements. In future work, further optimization and development of AI-based reconstruction may provide an efficient solution for complex PET reconstruction such as LAFOV PET.