Introduction

Positron emission tomography (PET) plays an important role in molecular imaging, which quantitatively reveals the tissue metabolism and neurochemistry in vivo and has been widely used in humans and animals [1, 2]. In clinical routine, a semi-quantitative index, namely standardized uptake value (SUV), is deemed as the routine interpretation of PET images [3]. However, there are a number of factors, such as the amount of tracer injected and uptake time after injection, that affect the accuracy of image evaluation and diagnosis [4]. In order to enable the absolute quantitative analysis, dynamic PET scan following kinetic modeling has been applied to provide useful physiological parameters of interest such as blood flow and metabolism, providing complementary information for clinical diagnosis and therapy [5, 6]. Conventionally, the approaches to produce parametric images rely on independently reconstructing a series of dynamic images from sinogram data first and then fitting the time activity curves (TACs) through kinetic models, in which the linear graphical analyses, e.g., Patlak/Logan plot and non-linear compartment models, were acknowledged [6]. However, the noise distribution in iteratively reconstructed dynamic images is usually space variant, objective dependent, and difficult to characterize, resulting in inaccurate estimation of parametric images in this indirect approach [7, 8]. The parametric image reconstruction tackles this problem by directly generating parametric images from measured raw sinograms where the noise distribution is a well-defined Poisson distribution [9]. It has the advantages to reduce the noise propagation and influence [10] and therefore improves the quality of the parametric images [11] as well as the physiological quantification [12].

In spite of its promising image results and potential clinical applications, dynamic PET imaging still has been hampered by some limitations: (i) long acquisition time, (ii) accurate measurement of arterial input function (AIF) is needed, and (iii) large data sizes due to number of frames [3,4,5, 13]. In current standard axial field-of-view PET scanners, dynamic whole-body imaging can be achieved by using a protocol of multi-bed multi-pass, due to the small axial field of view (AFOV) and low sensitivity of the PET scanner itself [11, 14, 15]. Usually, a routine dynamic scan starts after tracer injection and lasts for more than 1 h to guarantee adequate photon counts and avoid noisy image results. Such long acquisitions result in inevitable physiological motion [2] and low-throughout PET scan for hospitals [16], as well as discomforting conditions for patients. Moreover, parametric image reconstruction methods require an accurate estimation of AIF, for which an invasive blood sampling through a catheter in the arterial or arterialized venous [17] was performed in early research, but it is invasive and costly for patient and clinical staff. Therefore, several alternative non-invasive methods have been proposed, including the population-based [18], factor analysis [19], image-driven input function (IDIF) [20,21,22], simultaneous estimation [23], and recent machine learning methods [24]. IDIF is the most common non-invasive method and needs to measure the activity distribution of like the ascending or descending aorta, and left ventricle (LV). The characterization of dynamic PET scan implied many data frames required so that large dataset became a tough issue to be overcome [25].

Recent advancements in long axial field-of-view (LAFOV) PET scanners such as the uEXPLORER (United Imaging Shanghai, China), PennPET Explorer, and Biograph Vision Quadra (Siemens Healthineers, Hoffman Estates, IL, USA) provide new possibilities and challenges for parametric imaging [26,27,28,29], making a single-bed single-pass whole-body dynamic scan possible [30, 31]. The large coverage and high sensitivity make it convenient for blood input function measurement, more accurate tracer kinetic modeling, and high-quality parametric imaging [32]. It also enabled the potential use of abbreviated dynamic imaging protocols [33]. Nevertheless, the estimation of AIF is still necessary in current dynamic protocol of either conventional or novel total-body PET scanners; many short time frame data were acquired leading to heavy storage and computation burden for PET system. Therefore, the methodology avoiding AIF measurement and saving storage is urgently needed.

In recent years, deep learning has been applied to many kinds of tasks in medical imaging, such as noise reduction [34,35,36,37], image segmentation [38, 39], and image reconstruction [40,41,42,43,44]. Using convolutional neural networks (CNNs) or generative adversarial networks (GAN) could get comparable and superior results compared to traditional algorithms, along with a fast computation speed.

In particular, researches about using CNNs as regularization term in reconstruction model [40] or directly transforming the PET projection data into image through CNNs [43, 44] draw much attention in deep learning–based PET image reconstruction. The work in [44] proposed a convolutional encoder–decoder (CED) model, i.e., DeepPET, to reconstruct the PET sinogram into a high-quality image successfully without time-consuming back-projection steps. Therefore, motivated by the powerful representation ability and the end-to-end training pattern of DeepPET, we intended to realize fast parametric imaging with not only high image quality but also no need to apply an IDIF. Specifically, we modified the original DeepPET architecture and introduced self-attention modules to reconstruct the dynamic multi-frame sinograms into the direct Patlak plot images. The experiment was implemented on a total-body PET scanner, the Biograph Vision Quadra. Twenty patients were recruited for an 18F-FDG dynamic scan. During training, the acquired sinograms in partial scan time were set as input and the conventionally reconstructed direct Patlak Ki images were as ground truth. As a preliminary study, this work mainly attempted to demonstrate the feasibility of fast parametric reconstruction without input function using deep learning technology.

Materials and methods

Data preparation

Biograph Vision Quadra is a LAFOV PET scanner with a high sensitivity (176 cps/KBq) [29] which has the potential to accelerate data acquisition [31], and the long axial length (106 cm) covers the critically important organ of interest, enabling parametric imaging of major organs of interest in a single-bed position. Twenty patients were recruited for an 18F-FDG dynamic scan. The local Institutional Review Board approved the study (KEK 2019–02,193), and all patients provided informed consent. As the Patlak graphical method is commonly used to extract the late-time linear phase of a graphical plot, we chose the last 5-frame (25 min, 40–65 min post-injection) sinograms as the training input dataset, in which the sinograms were crystal-based and only random correction was applied by subtracting the delayed sinograms. Meanwhile, they were reconstructed into parametric image by a direct parametric image reconstruction method, the nested EM algorithm (8 iterations, 5 subsets, and 30 nested loops) with an IDIF measured from the descending aorta. A Gaussian filter with 2-mm FWHM was applied to the final reconstructed parametric images [13, 32].

Parametric image reconstruction model

In a dynamic PET scan, measured data \(y\) is following a Poisson distribution as below:

$${y}_{lm}\sim {\text{Poisson}}\{{\overline{y} }_{lm}\}$$
(1)
$${y}_{lm}={\sum }_{j=1}^{{n}_{j}}{p}_{lj}{x}_{jm}+{r}_{lm}+{s}_{lm}$$
(2)

where \({p}_{lj}\) specified the PET system matrix, \(l,j\) is the index of sinogram bins and image pixel, \(m\) means the index of the frame, \(r\) and \(s\) are the measured random noise and scatter events during data acquisition, and \(x\) is the activity map. For conventional parametric imaging reconstruction in this work, linear Patlak modeling was used, which is the most widely used graphical analysis technology for irreversible tracers, like 18F-FDG. In this model, the activity map \(x\) at the time \(t\) can be modeled below [45]:

$$\begin{array}{cc}x(t)={K}_{i}{\int }_{0}^{t}{C}_{p}(\tau )d\tau +DV\cdot {C}_{p}(t)& t>{t}^{*}\end{array}$$
(3)

where \({t}^{*}\) is the equilibrium time, \({K}_{i}\) means the uptake rate of tracer into the irreversibly bound compartment, and the intercept \(DV\) means the initial volume of distribution. \({C}_{p}\) represents the plasma input function obtained by the aforementioned invasive blood sampling or non-invasive approaches.

To estimate the \({K}_{i}\) and \(DV\) directly from projection data, a nested EM algorithm [46] was employed, in which the activity image update and parameter estimation are decoupled into the following steps ((4)–(6)) iteratively[13]:

$${x}_{jm}:=\frac{{\overline{x} }_{jm}}{{p}_{j}}{\sum }_{l}{P}_{lj}\frac{{y}_{lm}}{{\sum }_{j=1}^{{n}_{j}}{p}_{lj}{\overline{x} }_{jm}+{r}_{lm}+{s}_{lm}}$$
(4)
$$\left\{\begin{array}{c}K_{i_j}:=\frac{K_{i_j}}{\sum_m\int_0^{t_m}C_p(\tau)d\tau}\sum\nolimits_m(\int_0^{t_m}C_p(\tau)d\tau\frac{x_{jm}}{{\overline x}_{jm}})\\DV_j:=\frac{DV_j}{\sum_mC_p(t_m)}\sum\nolimits_m(C_p(t_m)\frac{x_{jm}}{{\overline x}_{jm}})\end{array}\right.$$
(5)
$${\overline{x} }_{jm}={K}_{{i}_{j}}{\int }_{0}^{{t}_{m}}{C}_{p}(\tau )d\tau +DV_j\cdot {C}_{p}({t}_{m})$$
(6)

where the sub-loop or namely nested loop in (5) is embedded in the main loop from (4) to (6). In this work, we targeted the Patlak Ki image.

CNN framework

In this study, we constructed a deep CNN network motivated by DeepPET [44]; it employed a CED architecture to reconstruct projection data into an image. Compared to the traditional iterative methods, e.g., maximum-likelihood expectation maximization (MLEM), DeepPET reconstruction was implemented by learning a mapping or an operator from projection into image by plenty of training datasets. Adequately diverse and extensive training data is the key consideration mapping an unseen data input to an unknown ground truth [47]. Therefore, we attempted to construct a DeepPET-like structure for the task of parametric imaging. Figure 1 illustrates the schematic view of the CNN framework used in this study, which consists of encoding, transformation, and decoding parts, as well as a domain transformation module that reconstructs the input sinograms into dynamic images by the ordered subset expectation maximization (OSEM) algorithm, and then introduces the dynamic image information into the decoding part. The final output is the predicted parametric image. Introducing dynamic image information can promote the network to learn richer features such that to improve the generalization ability itself. The multi-frame sinograms were fed into the encoding phase in a way of multi-slice input and the direct reconstructed Patlak Ki images were set as the training label. While, due to the characteristics of parametric reconstruction, we introduced a self-attention module to capture the spatial and temporal features in spatial and channel dimensions. Traditional convolution operations process a local receptive field by customized-size kernels (e.g., 3 × 3, 5 × 5) and lack the ability to capture global information or long-range dependency [48, 49]. Therefore, we replaced the transformation layer between encoder and decoder in origin DeepPET with spatial attention and temporal/channel attention modules to improve the feature representation, as can be seen on the right of Fig. 1.

Fig. 1
figure 1

Left: The DeepPET framework used as baseline in this work, including three parts: encoder, transformation layer, and decoder. Right: Different ways regarded as transformation layer: upper is the module used in DeepPET, and lower is the self-attention module

As shown in Fig. 1, the multi-frame sinograms went through the encoding phase, and then into a latent space representation, and were rebuilt stepwise into a dataset of image domain in the decoding phase. In detail, each layer of the network consists of a convolutional layer (Conv), batch normalization layer (BN), and activation layer (ReLU). At first, sinograms were convoluted with two layers having a kernel size of 7 × 7, and then processed by two down-sampling blocks with five 5 × 5 convolution layers and the other layers having a kernel size of 3 × 3. As mentioned above, we adopted two structures to be the transformation phase; one was the module used in DeepPET, and the other was the self-attention module. In DeepPET, all features in the transformation layer were same size of 16 × 16 and the structure consists of consecutive three, five, and three convolution layers, respectively. As for the details of self-attention module, shown in Fig. 2, it depicts that there are two parallel attention modules connecting the encoder and decoder. After the encoder phase, the feature maps were first fed into a convolution module to get high-level features. Then, the parallel spatial and channel attention modules were employed to obtain the attention matrix representing the spatial dependency within each slice and the interdependency between channel maps, respectively. The following steps were a matrix multiplication between the attention matrix and the high-level features and an element-wise sum between two multiplied matrixes. Prior to the decoder phase, the summed result was fed into a convolution module again. The difference between spatial and channel attention and the calculation details were referenced from a scene segmentation task, namely DANet [50]. Finally, in the decoding phase, the feature maps were decreased by a series of up-sampling and Conv-BN-ReLU blocks, and the last 3 × 3 convolution layer delivered one feature map.

Fig. 2
figure 2

An overview of the self-attention module consisting of a spatial attention module and a channel attention module

Optimization

In the optimization step, the mean absolute error (MAE) was adopted as a loss function, described below:

$${\text{MAE}}=\frac{1}{N}{\sum }_{i=1}^{N}\left|{y}_{i}-f({x}_{i})\right|$$
(7)

where the \({y}_{i}\) means Patlak Ki, the label data, \({x}_{i}\) means sinogram, and \(f\) represents the neural network. To encourage the network to generate the realistic textures and details to label, we introduced a perceptual loss [51], and the expression is as follows:

$${P}_{\text{loss}}=\frac{1}{N}{\sum }_{i=1}^{N}\left|\phi (f({x}_{i}))-\phi ({y}_{i})\right|$$
(8)

For the mapping function \(\phi\), we chose a pre-trained VGG16 network [52]. We extracted the second and fifth pooling layer outputs and calculated their MAE loss for consideration of both low-level and high-level features, and details can be seen in Fig. 3. Overall, the total loss function is as follows:

$${\text{Loss}}=\alpha \cdot {\text{MAE}}+\beta \cdot {P}_{\text{loss}}$$
(9)

where \(\alpha\) and \(\beta\) are the weighting parameters and control the MAE loss and perceptual loss, respectively. We evaluated the performance of the proposed network trained with different combinations of \(\alpha\) and \(\beta\) to determine the final loss function. The value of \(\beta\) was first set to 0 and \(\alpha\) was chosen from {0.01, 0.1, 1, 10, 50}. After fixing the optimal value for \(\alpha\), \(\beta\) was chosen from {0.01, 0.1, 0.5, 1}. The effect of \(\alpha\) and \(\beta\) values on predicted results is shown in Fig. 4. The mean square error (MSE) between predicted Ki and label Ki was set as the criterion. Finally, the minimum of MSE was found when \(\alpha\) and \(\beta\) were set to 10 and 0.01, respectively.

Fig. 3
figure 3

The procedure of calculating the perceptual loss

Fig. 4
figure 4

Left: The predicted error of different \(\alpha\) when \(\beta\) was set as 0. Right: Different \(\beta\) when \(\alpha\) as set as 10

Training details

During network training and testing, the sinograms and Patlak Ki images were set as input and label data, respectively. The dimension of the original sinogram was 520 × 50 × 5 and the Patlak Ki was 440 × 440. We resized the sinogram and Ki images into 256 × 256 × 5 and 256 × 256 by an interpolation algorithm, respectively. Sixteen patient data were used in training and four in testing. Data pairs of sinograms and direct Patlak Ki images were involved in network training and optimization; the whole workflow can be seen in Fig. 5. The network was implemented using Python3.8 and Pytorch1.8. The training and testing processes were implemented on Ubuntu 20.04. For the optimization of our network, we chose an Adam optimizer with a learning rate of 0.0001; the batch size was set to 48. The epoch number of 300 was chosen, where the model converged. In order to inspect the performance of the CNN-based method on lesion volume, a qualified nuclear medicine physician assisted to identify the 18F-FDG avid malignant lesions and tumor volume of interests (VOIs) using a professional tool (PMOD v.4.1) setting a threshold with 50% of max in SUV images.

Fig. 5
figure 5

The workflow of data acquisition, processing, and training

Evaluation metrics

To perform a quantitative evaluation of the CNN-based methods, MSE, structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) were calculated.

$${\text{MSE}}=\frac{1}{N}{{\sum }_{i=1}^{N}({y}_{i}-f({x}_{i}))}^{2}$$
(10)
$${\text{SSIM}}(x,y)=\frac{(2{\mu }_{x}{\mu }_{y}+{c}_{1})(2{\sigma }_{xy}+{c}_{2})}{({\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1})({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{c}_{2})}$$
(11)
$${\text{PSNR}}=10\cdot {\mathrm{log}}_{10}(\frac{{{\text{MA}}{\text{X}}}^{2}}{\text{MSE}})$$
(12)

where \({u}_{x}\) and \({u}_{y}\) are the mean value of network output and label, \({\sigma }_{xy}\) means covariance and \(\sigma\) is variance, and \({c}_{1}\) and \({c}_{2}\) are two constants.

Results

General results

To assess the performance of CNN-based reconstruction, six normal 2D slices representing multiple body parts from four test patient data were shown to prove how capable the CNN output is compared to the conventional reconstructed direct Ki. The comparisons between the DeepPET and proposed self-attention DeepPET were also carried out, as shown in Fig. 6; these two networks were dubbed DeepPET and proposed in all figures and tables, respectively. From top to bottom, Fig. 6 shows the results of DeepPET, self-attention DeepPET, and label Ki images. In order to observe more details, we zoomed in the local region where the red-frame rectangle was in the label Ki image for each result. Overall, as seen in Fig. 6, the CNN-based results can produce similar image structures to the nested EM results. The results of self-attention DeepPET outperformed those of the DeepPET in detail for which the fully 2D convolution operations with a limited receptive field are insufficient to capture global information. Especially in the high activity regions, self-attention DeepPET showed a closer structure profile and value distribution to label Ki images than the original DeepPET framework. Taking the cardiac area, for example, the predicted results of DeepPET seemed to overestimate Ki showing a broader distribution in high Ki region for slice 3, while underestimate Ki for slice 4, compared with the self-attention DeepPET. In slice 2, the DeepPET results even had small structures missing. To quantitatively compare their differences, MSE, PSNR, and SSIM values between predicted results and label images were listed on the lower left of each image slice. The low MSE (< 0.1%), high PSNR (> 30 dB), and SSIM (> 0.9) could be observed in both CNN-based methods. It demonstrated that the CNN-based framework can achieve excellent image quality as traditional direct parametric image reconstruction and the performance of self-attention DeepPET is better than that of the DeepPET framework.

Fig. 6
figure 6

The Ki images reconstructed by CNN and Nested EM. From upper to lower, the results of DeepPET, proposed self-attention DeepPET, and nested EM (label), respectively, are shown. The MSE, PSNR, and SSIM values between the predicted result and label data were shown in the lower left of each image

In quantitative analysis, the average MSE, SSIM, and PSNR values were calculated over all the test datasets to evaluate the performance of CNN-based parametric reconstruction, as listed in Table 1; also, a more clear demonstration can be seen in Fig. 7. From Table 1, it is apparent that both CNN-based methods got a small MSE of about 0.03% and a high SSIM of about 0.98, as well as a considerable PSNR. Additionally, between DeepPET and self-attention DeepPET, the MSE value was 0.032% for the former and 0.028% for the latter, and PSNR for the latter is ~ 0.7 dB higher than the former, whereas both predicted images had a quite similar statistical result on SSIM value. Besides that, as one of the concerns in our work, the reconstruction time between the CNN-based methods is shown in Table 2. Here, we regarded the sum of the model loading time (nearly 3.0 s) and image generation time of an individual volume (619 slices per patient) as the reconstruction time. The CNN-based methods took less than 20 s to reconstruct an individual volume. Since self-attention DeepPET replaced the very deep convolution layer in the transformation part of DeepPET with self-attention modules that only involved few convolution and matrix operations, it took less time than DeepPET.

Table 1 Quantitative results in terms of MSE, SSIM, and PSNR
Fig. 7
figure 7

Box plots of quantitative comparison between DeepPET and self-attention DeepPET results for four test patients in terms of MSE, PSNR, and SSIM

Table 2 Mean computation cost comparison

Lesion analysis

According to the lesion segmentation results, we got 11 VOIs from the test dataset and selected six slices to show, as seen in Fig. 8, which shows the results of DeepPET, self-attention DeepPET, and label Ki from top to bottom. The values of related evaluation metrics including MSE, PSNR, and SSIM were listed, and the local regions were zoomed in. Qualitatively and quantitatively, compared to the Patlak Ki images reconstructed by the nested EM algorithm, the predicted Ki could recover the most details of lesion. Moreover, the proposed method using self-attention module produced better result than the DeepPET framework in terms of MSE, PSNR, and SSIM. Similarly, the results of DeepPET had a larger error than that of self-attention DeepPET in case of the same learning rate and epoch number implemented. Like the normal slices, self-attention DeepPET delineated a more accurate profile and value distribution for each lesion than DeepPET. For example, in lesion 1, the DeepPET result overestimated the Ki values on the edge of the lesion, in which the higher Ki means a higher tracer influx rate, while the proposed self-attention framework showed lesion morphological structures closer to the label Ki.

Fig. 8
figure 8

The Ki image slices with lesion obtained from CNN-based method and nested EM reconstruction. From upper to lower, the results of DeepPET, proposed self-attention DeepPET, and nested EM (label), respectively

To quantify the performance of the CNN-based method on lesion detection, we calculated the Ki means with standard deviations over a total of 11 lesion VOIs and listed the statistical result in Table 3; the unit of Ki is mL/g/min. Additionally, the histogram and linear regression results are shown in Fig. 9. In the regression plot, the value in the horizontal axis is true Ki and in the vertical axis is predicted Ki from CNN-based methods. No significant difference between CNN-based and traditional reconstructed results was found, which suggested that the CNN-based method is implementable in parametric reconstruction and could produce the same high-quality images as direct reconstructed images. The high correlation between CNN-based and nested EM methods verified this conclusion, and the \({R}^{2}\) was 0.73 for DeepPET and 0.82 for proposed self-attention DeepPET.

Table 3 The Ki means and standard deviations in lesions (N = 11)
Fig. 9
figure 9

The histogram result with standard deviation (a) and the linear regression between predicted Ki and label Ki values (b) based on Table 4

Table 4 The Ki means and standard deviations in small lesions

In Fig. 10, we selected four larger lesions to evaluate the correlation between predicted Ki and true Ki. Based on the lesion segmentation masks, we calculated the Ki mean in each slice within each lesion volume. It means that the number of calculated Ki mean is equal to the number of slices a lesion volume covers. A linear regression process was applied between predicted Ki and true Ki. In each subplot, the left presented the sagittal (top), coronal (middle), and transverse (bottom) planes, and the lesions were labeled in red and the right presented the regression result. As seen in Fig. 10, there was a significant correlation between predicted Ki and true Ki found on most lesions. Additionally, the proposed self-attention DeepPET showed better result than the DeepPET.

Fig. 10
figure 10

The scatter plot between predicted Ki from CNN-based method and label Ki on four larger lesion volumes

Moreover, to further investigate the ability of CNN-based parametric imaging in small lesion, three small lesions with diameter less than 10 mm were chosen from the twenty patients’ data. The new training and testing were performed, and the training details were the same as above. As shown in Fig. 9, they are the nodule located in the posterior lower segment of the right liver lobe, the nodule in apical segment of the left lung, and the lymph node in the right axilla, respectively. The diameters of 8.9 mm, 8.0 mm, and 6.0 mm were measured on static PET transverse view, respectively, as seen in Fig. 11a. As can be seen from Fig. 11b, the predicted Ki results indicated that the CNN-based methods could detect the small lesion successfully. With the lesion segmentation mask, we calculated the Ki means within these three lesions for both CNN-based results and label data, as shown in Table 4. From the results, the predicted Ki images preserved the lesion details and had comparable statistic values, which is meaningful for clinical oncology research. Meanwhile, with the self-attention mechanism introduced, the predicted results behaved better than DeepPET.

Fig. 11
figure 11

a The static PET images. b the Ki images obtained from the DeepPET, the proposed self-attention DeepPET, and the nested EM

Discussion

In this work, we estimated the parametric images using a CNN-based method for the total-body PET scanner. Based on previous work such as DeepPET and DPIR-Net [43, 44] that successfully produced static PET images directly from raw projection data, we proposed a deep convolutional encoder–decoder network for dynamic parametric reconstruction.

Apart from the raw projection data, we involved the low-resolution dynamic images in the decoding phase to facilitate the network to converge to optimal results under the circumstance of a limited dataset. In previous research about DeepPET [43, 44], a large number of datasets including simulation phantoms were used. In this study, present results have proven that utilizing sinogram and dynamic images simultaneously could deliver high-quality parametric images for the DeepPET-like network. In addition, we explored the feasibility of CNN-based parametric image generation from static or dynamic PET images only [53, 54]. A 2D U-Net CNN [55] was adopted to map static or dynamic PET images into parametric images. The static PET image (256 × 256, 60–65 min post-injection) and dynamic PET images (256 × 256 × 5, 40–65 min post-injection) were sent into U-Net CNN and trained separately. Compared with the proposed DeepPET-based structures, the parameters except for learning rate remained during the training of U-Net. A learning rate of 0.0002 was chosen for U-Net to achieve the optimal results. There are three examples seen in Fig. 12. The first column shows static PET images, the middle four columns show predicted Ki images from different CNN structures, and the last column shows the Patlak Ki images. The predicted Ki results obtained from U-Net trained with static/dynamic images looked inferior, especially in low Ki regions, compared to DeepPET-based networks trained with sinogram and dynamic images, and in magnified regions, the latter results presented a closer structure and value distribution to label Ki than the former. Figure 13 shows the quantitative results of the test dataset among four different CNN-based methods in terms of MSE, PSNR, and SSIM. The two DeepPET-based methods achieved lower MSE, higher PSNR, and SSIM than U-Net. Meanwhile, training U-Net with dynamic PET images achieved better results than that with static images. This may be because the multi-frame input can be regarded as feature augmentation and introduces time-varying tracer distribution information.

Fig. 12
figure 12

Three slices of static PET, predicted Ki, and Patlak Ki images from left to right. The predicted Ki images are the results obtained from U-Net trained by static images, U-Net trained by dynamic images, and DeepPET and self-attention DeepPET trained by sinograms along with dynamic images, respectively

Fig. 13
figure 13

Box plots of quantitative comparison of different predicted results for all the test patients in terms of MSE, PSNR, and SSIM

Around the deep learning–based parametric imaging researches, a CNN module was embedded into reconstruction model, like CT-guided Logan plot [56], in which an iterative reconstruction framework with a deep neural network as a constraint was implemented. This kind of method no longer need the large number of training pairs, but the corresponding anatomical image from CT or MRI. Another approach is mapping indirect Patlak images to direct ones by CNN, whereas prior to CNN was a procedure of indirect Patlak reconstruction [57]. Anyway, for this deep learning–based parametric reconstruction, it is necessary to acquire blood input function non-invasively or invasively. While, the proposed CNN-based method worked well without other anatomical images and blood input function, delivering high-quality Patlak Ki estimations comparable to the standard nested EM algorithm.

Recently, there has been an attractive interest in the total-body PET scanner. The LAFOV offers large anatomical coverage with excellent sensitivity. In previous scanners, the poor sensitivity of less than 1% has long been a challenge that results in poor signal-to-noise ratio (SNR) in images. LAFOV PET approach addressed this dilemma. Up to now, several studies have demonstrated that total-body PET leads to an approximately 40-fold increment in effective sensitivity and enables shorter times [58]. The PET scanner with higher sensitivity than conventional scanner has significant potential to promote the development of fast dynamic scans and lower radiation scans. However, with it comes dramatically increased volume and complexity of dynamic data. With respect to this motivation, studies about parametric imaging of early kinetics of 18F-FDG have demonstrated the feasibility of estimating parametric images using only the first 90 s of post-projection scan data on the total-body PET scanner [25]. In this study, we used the last five frames as data to be reconstructed, which not only saves the data volume but also conforms to the conclusion that Patlak graphical method is commonly used to extract the late-time linear phase of a graphical plot.

All the results demonstrated that the CNN-based method could achieve an equivalent image quality to direct parametric reconstruction results using the nested EM algorithm. It is evidenced suggesting that deep learning methods potentially can generate total-body PET parametric images using data from Biograph Vision Quadra and LAFOV PET scanner. For the dynamic protocols on Biograph Vision Quadra, a total of 62 frames were reconstructed leading to a large data size in excess of one gigabyte, and it takes considerable time to perform both indirect and direct reconstruction. Therefore, a deep learning–based approach may be appropriate and could significantly save the reconstruction time and complexity.

Compared with static PET scans, dynamic PET kinetic analysis reveals the tracer kinetics and has a temporal dimension. In CNN, multi-frame sinograms were fed into a network and the temporal information was convoluted in channel dimension. To account for the characteristics of parametric reconstruction, we replaced the deep convolution layer in the transformation part of DeepPET with two parallel self-attention modules: spatial and channel attention. The results reveal that only using 2D convolution operations would miss the global information of features and lead to insufficient performance on detail structure in the final predicted Ki images. Moreover, in this work, we only targeted the Patlak graphical plot, which is mainly used in an irreversible or nearly irreversible radiotracer, e.g., 18F-FDG. As for the other tracers like gallium-68 (68 Ga)-labeled prostate-specific membrane antigen (68 Ga-PSMA) or the non-linear compartment model, there is also an important issue for further research. Meanwhile, because of the limited dataset at present, we introduced a domain transformation module to constrain the network training process. Despite its simplicity, noise propagates from emission images to final estimated Ki images. With this consideration, a more diverse and extensive simulation or real datasets are required that would make CNN sufficiently represent the possible features of the input domain. Additionally, due to the limitation of current academic computational resources, the proposed networks only tackle the 2-D parametric reconstruction ignoring the spatial information and leading to inconsecutive predicted results across slices [59]. Nevertheless, with the further increasing of AI computational power, the 3-D network combining with the major parts of this work, such as loss function and attention mechanism, may be feasible in the future for the task of 3-D parametric imaging.

Conclusion

The purpose of this study is to demonstrate the feasibility of CNN-based parametric imaging on a total-body PET scanner, Biograph Vision Quadra. We proposed an encoder–decoder framework with spatial and channel self-attention modules to generate high-quality Patlak Ki images from dynamic data. We only used few frames of data but with adequate quality, which owes to the high sensitivity of scanner. The results show that the CNN-based method can produce high-quality parametric images from few projection data. In all test datasets, the proposed method achieves excellent MSE of less than 0.03%, high SSIM, and PSNR of ~ 0.98 and ~ 38 dB, respectively. Meanwhile, no input function used in the CNN-based method and the dramatic reduction of reconstruction time have much potential to make dynamic PET scan more acceptable clinically.