Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Methodological Considerations

4.1.1 PET-Derived Quantitative Metrics

4.1.1.1 Standardized Uptake Value (SUV)

SUV is the most frequently used semiquantitative PET metric for measuring tumor glucose metabolism. It is defined as the ratio of the decay corrected FDG concentration in a volume of interest (VOI) to the injected dose normalized to the patient’s body weight. Besides body weight-based SUV, various other SUVs have been introduced to account for the different bio-distribution of FDG in different body compositions (Table 4.1). The most commonly used is SUL, which is SUV corrected per the lean body mass (LBM) defined, respectively, for male and female as

Table 4.1 Pros and cons of different SUV measures
$$ \begin{array}{l}\mathrm{L}\mathrm{B}\mathrm{M}=1.1\times \mathrm{weight}-120\times {\left(\mathrm{weight}/\mathrm{height}\right)}^2\\ {}\mathrm{L}\mathrm{B}\mathrm{M}=1.07\times \mathrm{weight}-148\times {\left(\mathrm{weight}/\mathrm{height}\right)}^2\end{array} $$

This index takes into account the different bio-distribution of FDG in the fat tissue. Even if several recommendations exist to use SUL, e.g., in the treatment response evaluation [4], actually it is not of widespread use because of the familiarity established with SUVmax. The general advice, also furnished by the EANM-SNM guidelines on FDG-PET use [1, 3], is to collect SUL along with SUV data to further the understanding of its relevance to in both clinical practice and experimental settings. The SUV, being an index of PET tracer uptake in any tissue should be measured in a known volume of interest (VOI), because with different VOIs its measure considerably varies.

SUVmax

The SUVmax is defined as the maximum value for SUV in a VOI. The rationale is choosing the single point that represents the hottest uptake or highest metabolism in the tumor. This rationale is quite strong and moreover the SUVmax is simple to measure. However, being a single voxel measurement, SUVmax is intrinsically vulnerable to image noise (Fig. 4.1). Consequently, repeated tumor SUVmax measurements showed an intra-patient bias of 5–30 % [5].

Fig. 4.1
figure 1

The detailed characteristics of the noise affecting PET images are often not well known. Typically, it is assumed that overall the noise may be characterized as Gaussian. Noise levels observed in PET images complicate their interpretation; since the measurement of uniformity is strictly connected with the noise. In the figure the different metrics used for measuring activity concentration are “max,” the highest pixel value; “hot,“ the average pixel value in a 1 cm diameter region around the “max;” “peak,” the average pixel value in a 1 cm diameter ROI in the hottest region; and “average,” the average of all the pixel value encompassed in the region of interest. In this figure one can see how the metrics described in the text with the acquisition time changes inside a large uniform VOI placed in the center of a uniform phantom. “Max,” “hot,” and “peak” have a similar trend and are the most influenced by statistics. When increasing the acquisition duration, these values decrease and the measured activity concentration become closer to the estimated values. On the other hand “average” does not change with the scan duration, and the value of the ratio between the measured and the expected activity is always about one. Errors are larger for “max,” “hot,” and “peak” at lower statistics, while “average” is more or less constant because its value is averaged on a large number of pixels. SUVmean is a good description of the expected value while “peak” and “max” are always overestimating the real value. And this changes dramatically when the acquisition time is small. This is well explained in the following histogram describing the SUV distribution inside the same VOI. While mean value is constant independent of the acquisition time “peak” and “max” are much larger at small time

SUVmean

The SUVmean is the average value of different measurements of SUVs within the VOI drawn for the tumor. It is much less vulnerable to image noise, but it heavily depends on the delineation method used for drawing the VOI and the selected region within the tumor volume [6]. Defining a VOI in a tumor mass may have different meanings depending on its coverage within the mass. Generally one would like to provide SUVmean of the entire lesion but this requires the knowledge of the exact dimension and borders with respect to the background, but this is often not the case in routine applications. An alternative approach includes delineation of a VOI inside the tumor far from its border to minimize the effect of the background uptake on the SUV measurement. Nonetheless, VOI delineation is subjective; tumors are usually heterogeneous and/or sometimes associated with necrotic centers; finally, the rationale for selecting only a part of the tumor without including the hottest part defeats the purpose of obtaining an accurate measurement.

SUVpeak

The SUVpeak represents the maximum tumor activity within a 1 cm3 VOI in the hottest part of the tumor volume [3, 6]. The rationale is to have an index measurement associated with the hottest part of the tumor, i.e., SUVmax, but in a standard volume of 1 cm3. The SUVpeak characteristically is less affected by the noise compared to SUVmax and does not require definition of tumor boundaries which is a necessary step for obtaining an SUVmean. Repeat tumor SUVpeak measurements yields a lower within-patient bias (1–11 %) compared to those of SUVmax [5]. The SUVpeak is the proposed measurement in the definition of therapy response for PET response criteria in solid tumors (PERCIST) developed by Wahl et al. [4]. Nonetheless, despite being relatively simple, this method requires the careful use of custom software on a dedicated workstation to be accurately calculated.

4.1.1.2 Sources of Errors in SUV Measurements

Common sources of errors involved in SUV measurements from technical and host-related factors are summarized in Table 4.2. Extensive review in literature exists to discuss these factors [1, 2], and recommendations have also been released by the US and European nuclear medicine associations [1, 3]. The recommendations provided should be considered to be minimal standards to abide by and should be followed by all imaging centers. While several recommendations are easy to adopt in clinical practice, e.g., maintaining a rest state during uptake time, others are more difficult to achieve in a busy clinic, e.g., the consistent time interval for the uptake period. Importantly, the higher the level of standardization reached, the simpler it will be to compare PET metrics acquired at different time points (intra-patient) and between different patients (inter-patient) either at a single site or across multiple centers.

Table 4.2 Common sources of error in SUV calculations
4.1.1.2.1 Technical (Site) Factors

Several factors are patient independent and/or dependent only on the equipment and the procedure used by the site to perform PET/CT imaging studies. The requirements to limit the influence of these factors on SUV measurements should be fulfilled on one occasion and verified periodically (Table 4.2). The cross-calibration of PET scanners and activity calibrators are essential to minimize SUV variability. The procedure for calibration of the PET scanner is depicted in Fig. 4.2. Although cumbersome, this approach proved effective in increasing the accuracy of tracer uptake measurements by 5–10 % [6]. This is well below the range of 10–25 % observed variations even in a controlled environment of a multicenter clinical trial [4]. Particular care should be taken to use the same activity calibrator to measure the activity used for calibrating the scanner. If more than one calibrator is used, they should all be cross-calibrated with a traceable radioactive source. PET sites not equipped with dose calibrators cannot get reliable SUV measures. Indeed, the activity injected in the patient must be always measured with the calibrator that is cross-calibrated with the PET/CT scanner used for imaging. If the activity is measured elsewhere, for example, at the radiopharmaceutical production site, this process is not necessary. Cross calibration in a multicenter framework generally permits to achieve a variability less than 10 %, while 5 % should be a requirement for using PET/CT in a quantitative way [7, 8]. An optimal inter-scanner variability of 3 % has been reached when comparing two [8] PET/CT scanners requiring new cross-calibration strategies. Imaging parameters, such as scan duration per bed position, acquisition mode, 18F-FDG dose, and reconstruction methods directly affect the image quality and quantitative results [3]. These parameters should be preset to fulfill the guidelines [1] for the recovery coefficient curve. The recovery coefficients are calculated as the ratio of measured and expected activity concentration in hot spheres of different radius in a phantom (Fig. 4.3). In addition to the above parameters, the actual administered activity and the accuracy of patient’s weight and height influence the variability in SUV measurements. The injected activity is the difference between activity measured with the activity calibrator in the syringe and the syringe and administration lines residual. If the line is flushed with saline, the residual activity is usually lower than 1–3 MBq, and its measurement could be definitively omitted. Clock synchronization should be carried out on all the clocks of the department with respect to the scanner and the dose calibrator clocks to avoid bias in time and, consequently, SUV assessments. Intravenously administered contrast media could alter SUV of a lesion if a diagnostic CT is performed as part of the PET/CT [3]. While specific recommendations could be found elsewhere [3], a general recommendation is to perform low-dose scan CT for attenuation correction before the PET scan and the full dose diagnostic CT after the PET scan. For the calculation of SULs, the patient’s weight should be routinely entered in the PET dicom dataset with the calibration factors to avoid errors in SUV calculations (e.g., 5 kg difference in an 80 kg patient lead to a 5 % error in SUV).

Fig. 4.2
figure 2

PET scanner calibration. PET scanner electronics measure the count rate of annihilation events. PET scanner calibration is carried out to associate an activity to this count rate. This is done by injecting a known activity, measured in an activity calibrator, in a cylindrical uniform phantom and scanning it with PET

Fig. 4.3
figure 3

Recovery coefficient. An example of a recovery coefficient curve is provided for non-optimized (left) and optimized (right) PET parameters

4.1.1.2.2 Host Factors

Several patient-dependent factors from patient preparation to scan acquisition affect SUVs and must be verified on the single patient (Table 4.2). The biological factors include uptake time, plasma glucose levels, and patient motion or breathing artifacts. For most of these factors, clear recommendations [3] have been provided, as they directly affect SUVs and also image interpretation. In addition, SUV measurements are affected by tumor perfusion and hypoxia, inflammatory cell infiltrates in the tumor microenvironment, which cannot be controlled by extrinsic manipulations. The SUVs decrease in normal tissues with the increase of uptake time [9] with a linear decrease of SUV in all three compartments. The FDG uptake from the same lesion in images acquired at different time intervals after the radiotracer injection is influenced by the recirculating FDG. It is, hence, fundamental to use the same uptake time for all time points when sequentially imaging the same patient to maintain intra-patient consistency and to reduce the uptake time changes in longitudinal scans. As a general recommendation, a patient requiring quantitative PET measurements should be scheduled as the first patient of the day to minimize delays in acquisition times, which occur frequently later in the day. Moreover, in order to get comparable data in longitudinal studies, the PET scanner technicians should annotate the actual uptake time, to ensure reproducible results in the next scans. Elevated plasma glucose levels result in decreased FDG uptake by the tumor, leading to erroneously low SUV values [1, 8]. Consequently, variable plasma glucose levels in longitudinal studies of the same patient will likely cause artificial SUV changes. A constant plasma glucose level in the range of 4–7 mmol/L in an individual patient across all longitudinal studies and tracks of the measured values are an achievable goal with a concerted team effort. There are several strategies for dealing with plasma glucose levels in SUV calculations, but further research is needed to understand whether intra-subject or inter-subject standardization is required. Patient’s physical or breathing motion can also significantly influence SUV measurements [3]. To minimize this negative effect, the PET and CT fusion images should be visually analyzed to identify possible patient motion nearby a lesion. Patient breathing particularly influences the lesions in the thoracic area. Correction techniques are being introduced in PET/CT scanners using dynamic acquisition and breathing control devices; however, until then the data associated with motion should not be used for SUV measurements.

4.1.1.3 Metabolic Tumor Volume Measurements

Other proposed PET-derived functional metrics include metabolically active tumor volume (MTV) and total lesion glycolysis (TLG). The tumor volume concept has been developed in the late 1990s [10] but not evolved until recently because of the lack of necessary software developments. These volume-based PET parameters measure metabolic activity in an entire tumor mass designed to reflect tumor biology.

4.1.1.3.1 MTV

The MTV measure the total volume of the metabolically active tumor included within a VOI, both for a single lesion both for multiple lesions and expressed in cm3 or ml. The rationale is the assumption of a metabolic activity higher than the surrounding healthy tissue to be able to accurately define the tumor volume. MTV is slightly affected by noise since it includes hundreds or thousands of voxels.

4.1.1.3.2 TLG

The TLG is the product of SUVmean in the defined VOI and the MTV; the rationale is to combine tumor burden and its metabolic activity to obtain an index that is correlated to the tumor volume and the uptake within the entire volume. The routine application of these parameters is challenging because the quantification process requires complex calculations, is conducive to subjective definitions of VOIs, and is rather time consuming. There are several segmentation algorithm definitions, relying on manual (by an expert) or semiautomatic methods for tumor delineation (Table 4.3). With the recent development of software-assisted automated VOI assessments, volume-based metabolic quantitative parameters have become increasingly available. Although these metrics are potentially useful clinical parameters for assessing treatment response and survival, they are not ready for clinical applications at the moment because they are yet to be standardized and validated [810]. The advantages and disadvantages of these methods are provided in Table 4.3.

Table 4.3 Pros and cons of the various categories of PET image segmentation techniques

4.1.1.4 Variability of PET-Derived Quantitative Metrics

The first prerequisite to reliably measure a PET-derived tumor volume is to assure a robust and reproducible method to accurately determine SUV-based parameters, overcoming the abovementioned sources of error related to physical, technical, and biological factors [7]. In particular, all quantitative PET metrics are affected by user-defined factors including image acquisition settings, i.e., duration of acquisition, thickness of the slice, acquisition mode (2D vs. 3D), reconstruction algorithm, and the correction herein applied, i.e., attenuation, scattered and random coincidences, and dead time correction. To minimize SUV variability, it is necessary to cross-calibrate the PET scanners and ancillary instruments. Though cumbersome, this approach proved effective in increasing the accuracy of tracer uptake measurements, reducing inter-scanner variability of the measured activity to 5–10 % [7, 1116], which is a major achievement, compared to 10–25 % variation observed even in a controlled environment of a multicenter clinical trial [17].

4.1.1.5 PET Test–Retest Reproducibility

Reported variability of SUV in patient test–retest studies differed from the desired range of ≤10 % [1823]. The largest repeatability study of 62 patients with gastrointestinal malignancies reported an intra-subject coefficient of variation decrease in SUV measurements from 16 % to 11 % after applying a centralized quality control assessment and analysis [24, 25]. These studies showed that the variance of SUV is greater in clinical practice than in clinical trials even in a single site experience: the threshold criteria for a difference of a second scan in respect to baseline at 95 % confidence level were 49 % and 44 % for SUVmax and SUVmean, respectively. A recent meta-analysis by De Langen et al. showed that SUVmean had a slightly better repeatability than SUVmax, with a better reproducibility in larger lesions [26]. However, a recent study comparing SUVmax, SUVmean, SUVpeak, and TGV found that different SUV definitions yielded 20 % variation in tumor response values for an individual tumor and variation of up to 90 % for a single SUV measurement [27]. Another study showed that mean percentage difference in SUVmax measurements in 100 patients with a known chest lesion obtained on subsequent scans was 0.9 ± 7.8 with a coefficient of variation of 4.3 % [28]. This variability was much lower than that reported in previous studies with a range of 2.5–8.2 % [7, 11, 29]. Besides SUV, Leijenaar et al. [30] demonstrated a high test–retest reproducibility of various radiomics features as well as a high (91 %) interobserver variability. Based on the results of these studies, minimal protocol variation should be ensured when performing repeated scans on the same patient required to improve the reliability of SUV measurements.

4.1.2 Segmentation Methods for Volume Calculations

Different segmentation techniques for PET-derived volumes have been proposed with a varying complexity (Table 4.3). Hence, comparing the performance of different methods from published data is almost impossible given the variety of algorithms used and degree of operator manipulations [31, 32]. To date, there is no consensus on a reproducible, accurate, and practical method that should be preferred for tumor segmentation. The existing methodologies are described in the following paragraph.

Manual Technique

The manual contouring by an experienced imaging expert is the first methods applied in this field and it is still widely used. However, this procedure is cumbersome, and time consuming, particularly in patients with disseminated disease. This method is technically least sophisticated but economically less demanding and expectedly leads to significant interobserver variability in the range of 5–137 % [33].

Thresholding Method

The most widely used method to define a tumor volume is the thresholding method that requires identification of voxels exceeding a predefined threshold [34]. The thresholding can be performed using fixed or adaptive methodologies. In general, application of the proper threshold technique is a challenging task because of the limited resolution of PET images. Blurring due to partial volume effect [35] (Fig. 4.4) or motion artifacts and noise fluctuations due to limited photon counts can degrade segmentation accuracy.

Fig. 4.4
figure 4

Partial volume effect. Partial volume effect refers to both image blurring due to scanner finite spatial resolution (left in figure) and to voxel sampling (right in figure). It affects small lesions and is negatively affected by tumor heterogeneity. The SUVmax and SUVmean measurements in a lesion volume of 2.5 ml (about 1 cm diameter) could be underestimated by up to 50 % (Fig. 4.2), and complete recovery of the actual SUV is done for lesions greater than 5 ml. Many strategies have been developed in the past to correct for partial volume effects but none of them reached a daily practice maturity [19]. Only recently, new algorithms have been applied directly to reconstruction algorithms in modern scanners. It should be emphasized that small tumor volumes do not necessarily imply small number of cells since the tumors become visible at about 105–106 malignant cells considering the resolution limits of the PET scanners

Percentage threshold

The earliest thresholding method was based on a percentage SUV, mainly using a cutoff of 40–50 % of the SUVmax [36]. This method was simply based on phantom studies of static spheres. Subsequently, a value of 40 % was adopted by several groups for tumor delineation in radiotherapy planning of non-small cell lung cancer (NSCLC) [37], cervical cancer [38], and head and neck squamous cell carcinoma (HNSCC) [39]. The principal drawback of this method is that the optimal threshold is influenced by the size of the tumor volume; the surrounding background is not taken into account and is often “scanner specific” because of the strong dependence on the spatial resolution of the instrument. Based on the available data suggesting an insufficient tumor coverage using fixed thresholding methods, this method was no longer recommended, particularly for RT planning purposes [40].

Fixed threshold

As an alternative method, an absolute SUV threshold can be used for tumor segmentation. However, tumor inhomogeneity and motion artifacts may hinder the application of this approach by failing to provide adequate tumor delineation in nearly half of the cases, in particular for lesions showing a low tumor-to-background ratio [41]. Moreover, fixed thresholding techniques take neither the background nor the tumor size into consideration [42] thus being inappropriate to define a tumor volume.

Adaptive threshold

To address the background-dependent variability, some investigators suggested adapting the threshold to tumor-to-background (TBR) ratio [43, 44]. Subsequently, a more developed system based on an iterative technique was introduced to optimize the thresholding for the TBR approach [4447]. The rationale is to change TBR threshold iteratively till when an optimal threshold is generated by the convergence algorithm. This is a reasonable and logic approach. However, the coexistence of several operator-dependent thresholding methods, based on different morphologic aspects of radiotracer concentration in tumors, justifies the search for an automatic threshold computing software.

Gradient technique

This technique measures gradient differences between the lesion and the surrounding background with a good spatial accuracy and efficiency [48, 49]. Gradient methodology includes simple edge or ridge detectors [50] or watershed method [51]. More recently deformable active contour models have been applied to PET segmentation with the assumption that contours are characterized by sharp variations in the image intensity [52, 53]. Despite being intuitive, the gradient technique suffers considerably from image noise and often requires filtering of the images with a blurring effect [54].

More sophisticated techniques

To overcome all the difficulties originating from thresholding and gradient techniques, several authors have explored more sophisticated approaches used in other science domains such as active deformable models, learning methods, and stochastic approaches [55] and those using a pattern recognition algorithm [56]. Learning methods based on classification require training of the method moving from data with known labels (known ground truth). However, this is a challenging task due to variability of PET tracer uptakes and bio-distribution in tissues, which in turn depends on the biomarker concentration in the blood (e.g., glucose concentration for FDG) and other technical factors. In addition, PET images need to be properly drawn to identify the ground truth for training purposes (e.g., the structures contoured by a panel of experienced radiologists). Therefore, behind the ground truth, the application of these methods requires a number of other information with a thoroughly checked source. Stochastic models offer the advantage of incorporating the variable of the voxel’s intensity directly into the model. However, these models are based on a proper predefined noise model, which has not been yet defined for PET and is strongly influenced by the parameters and type of the reconstruction algorithm. In general, the Gaussian assumption is used because it simplifies the computational burden and speeds up convergence.

Comparison between methods

Reproducibility is a key issue associated with segmentation methods. Different methods give rise to variations in the calculated PET volumes in the range of 40–50 % [9, 31, 58, 59], and this variability can even reach 400 % [32]. The performance of tumor delineation methods, in turn, largely depends on variations in the TBR, image resolution, and image noise level. Evaluating the accuracy of the segmentation methods is rather difficult because it is virtually impossible to rely on a ground truth as comparator. Studies have been proposed using phantoms, morphological images (CT or MRI), and pathology specimens [57], but there is no consensus among scientists on the optimal method. Despite the heterogeneity of clinical behavior and aggressiveness of the malignant processes, there is preliminary evidence to suggest that MTV and TLG have independent prognostic value across different types of cancers, including lymphoma [60]. It is therefore important to pursue validation studies to establish the real value of these methodologies and also prove their reproducibility in large prospective data sets.

4.1.2.1 Applications in Radiation Oncology

Radiation therapy (RT) is one of the pillars of combined-modality treatment for the Hodgkin lymphoma. Successive technological progresses achieved over the past decade have revolutionized the definition of the target tumor volume and the boundaries of the radiation field. These new methods increased the effectiveness of this treatment modality which delivered much smaller doses to critical organs such as the lung, heart, and breast [61]. RT treatments can be classified as total lymphoid/nodal irradiation (TLI/TNI), extended field RT (EFRT), involved-field RT (IFRT), and involved node RT (INRT) (see Chapter 5). In the modern era of conformal radiotherapy, TNI and the EFRT are no longer in practice and supplanted by limited-field radiation therapy: IFRT, if the RT field encompasses all of the clinically involved nodal regions, and INRT, with an assumption to deliver the dose only to the initially involved nodes, rather than including the entire region of the involved nodal chain. Consequently, the current guidelines for combined chemotherapy and RT indicate that the delineation of the target volume should always be carried out on the affected regions [62, 63].

Field delineation in RT planning is one the most important applications of PET/CT imaging (see Chapter V). In recent years, a large number of studies and methodological research projects were performed to develop and validate automatic and semiautomatic algorithms for accurate and robust delineation of RT target volumes. So far only a few clinical trials have been conducted in which dose escalation was prescribed on an FDG avid area within the GTV [6467].

Recent studies proved high observer variability in clinical target volume (CTV) delineation for HL [6870], thus, highlighting the need for a robust and operator-independent methodology for target definition. A considerable improvement in treatment volume definition on simulation CT has been obtained by integrating the information provided by the FDG-PET/CT, acquired before chemotherapy for diagnosis and staging purposes [67, 7174]. In order to combine the FDG-PET/CT outcome with the CT-based CTV delineation, the common practice is the visual assessment. Briefly, the physician compares the two imaging modalities displayed on two different screens and confirms the matching on anatomical landmarks. However, this approach is time consuming and operator dependent. Some authors proposed methods based on rigid image coregistration and overlay (image fusion) highlighting favorable results if the FDG-PET/CT is acquired in the treatment position [7375].

Dedicated PET/CT planning is already available in some centers, but care must be taken when fusing diagnostic and planning scans because of the need for a deformable registration, which is yet to become a standardized procedure. Nonetheless, there are practical obstacles in routine practice such as the scanning position of the patient (position of the arms and/or neck) and the use of different scanners. In addition, weight loss and lymph node shrinkage occurring between the two imaging stages represent particularly challenging issues for PET/CT matching based on rigid registration.

Similar to other cancers, PET/CT manual contouring is the standard technique in lymphoma [76]. To increase reproducibility, the use of a flat table for PET/CT imaging is advisable. Due to the relatively simple geometry of the lymphoma lymph nodal masses in axial CT and PET/CT sections, a PET segmentation algorithm has been rarely used instead of manual contouring for RT planning.

4.1.3 Conclusions

There is a large variability in computational complexity and level of user interaction required by the various image segmentation techniques. In the near future, the development of more sophisticated and robust tools for PET segmentation will probably help physicians to use these quantitative methods with higher precision and accuracy. However, it is imperative to adopt standardized acquisition, reconstruction, and analysis protocols for the clinical use of PET quantitative metrics.

4.2 Clinical Applications in Lymphoma

4.2.1 Why Should Quantitative Methodology Be Preferred Over Qualitative?

The widely utilized anatomic imaging parameters rely on tumor size change as a measure for treatment response. Nevertheless, functional imaging lends itself as a better surrogate metric for demonstrating a biological tumor response. Although visual assessment of FDG-PET/CT has been successfully integrated into clinical practice for therapy monitoring, high rate of false-positive results even in the hands of expert readers have raised concerns [7781] for its usefulness, particularly, for interim PET-adapted therapeutic strategies. With the emphasis on the liver as a reference background adopted by D 5PS criteria [82], the inter-patient variability and intra-patient fluctuations of hepatic FDG uptake during therapy [8385] have become a focus of concern. More importantly, the depth of tumor response categorization by visual criteria may lead to suboptimal differentiation between response categories by oversimplification. Furthermore, visual assessment is proven to be a reproducible and efficacious method for treatment response assessment in HL [159, 160] and FL [179] but its role is less substantiated with the currently available data in other lymphoma subsets [153]. Quantitative analysis allows for an objective assessment of treatment response, thereby minimizing interobserver variations and more suitable for a continuous measure of response which is also one of the most effective ways to reduce sample size [86]. In order to minimize potential treatment-associated morbidity, and unnecessary interventions, the tumor metabolic response can be used as a practical early clinical end point to substitute survival end points, which may counteract the high cost and lengthy process attendant with the regulatory approval of the novel drugs. Functional imaging provides an earlier and faster readout for treatment response compared to morphologic imaging; consequently, it is preferable for early and accurate evaluation of the efficacy of novel treatments. With the recent insurgence of sophisticated software programs, tumor volumes can be determined with much less effort than otherwise. Thus, MTV as a measure of the viable tumor fraction or TLG, as a product of MTV and mean SUV within the volume, may better predict ultimate patient outcome than anatomical imaging either at baseline or early during therapy. MTV is and may better estimate tumor burden. Hence, there is a strong interest in the development of various quantitative metabolic PET metrics in an effort to decrease the rate of false-positive results, increase reproducibility, and maximize statistical power.

4.2.2 PET-Derived Quantitative Metrics in Clinical Practice

4.2.2.1 Standardized Uptake Values (SUV)

As alluded in the previous section, SUVmax has been investigated as a quantitative PET parameter to provide an objective measure for assessing tumor metabolic activity in tissues.

4.2.2.1.1 Baseline Tumor Characterization

The advent of genomic and proteomic technologies have been shifting traditional cancer management toward an individualized treatment strategy. However, these methods are impractical in a routine setting and do not allow for a complete characterization of the tumor because tumor tissues are spatially and temporally heterogeneous. Noninvasive assessment of tumor behavior with the use of imaging may provide a more comprehensive guidance for improving therapy decisions in cancer patients. Among all indications, differentiation between a malignant and benign etiology or a low-grade phenotype from that of a high grade using an objective imaging tool would be clinically desirable. In this regard, although limited and not validated, the existing published data showing correlation between the SUVmax and tumor histologic characteristics, surgical stage, and prognosis are summarized in the following section.

Diagnosis of different tumor phenotypes

Considering the need for a more aggressive treatment for transformed low-grade lymphomas (LGL) compared to LGLs [87], early identification of transformation to an aggressive phenotype would be clinically consequential. There is sufficient evidence that FDG-PET/CT can detect transformation of chronic lymphocytic leukemia (CLL) to diffuse large B-cell lymphoma (DLBCL), the so-called Richter’s transformation [8891]. In a retrospective study by Bruzzi et al. (n = 37), SUVmax of >5.0 was considered highly suggestive of Richter’s transformation with an overall sensitivity and negative predictive value (NPV) of 91 % and 97 %, respectively [88]. Recently, Falchi et al. evaluated and reported that SUVmax ≥10, international prognostic score (IPS) ≥2, bulky disease, and age ≥65 were independently associated with shorter OS in CLL patients (n = 332) [89]. SUVmax ≥10 strongly correlated with overall survival (OS) (OS: 57 vs. 7 months). Corroborating these results, Michallet et al. identified a threshold of tumor SUVmax >10 as the most effective discriminating cutoff value which yielded a sensitivity and specificity of 91 % and 95 %, respectively, for identifying transformation by PET in CLL patients (n = 250) [90].

The transformation to large B-cell aggressive lymphoma is also a critical event for patients with follicular lymphoma (FL), which warrants a more aggressive therapy approach than de novo FL. The value of FDG-PET/CT diagnosing transformation has been well established for guiding lymph node biopsy when transformation is suspected. Although there is lack of consistency for defining an exact SUVmax cutoff, a transformation is suggested at a SUVmax of 10–15 [9196]. But it should be emphasized that thresholds indicating transformation should be investigated in homogeneous patient cohorts because the cutoff value will be different for different subtypes of indolent lymphomas [94]. Because proliferation is a hallmark of transformation, 3′-deoxy-3′-[18F]fluorothymidine (FLT), as a specific surrogate for proliferation [97], is hypothesized to be superior to FDG for early detection of progression to a more aggressive histology (see Chapter 1: the newer tracers). Nonetheless, there are conflicting reports and this premise has not yet been proven [95, 98]. In a comparative study (n = 26) by Wondergem et al., the ability of FDG to discriminate between FL and transformed FL was superior to that of FLT with a SUVmax of 14.5 aiming at 100 % sensitivity with a maximum specificity (82 %) [95]. At the optimal sensitivity, the specificity of FLT was only 36 % that would imply an unacceptably high proportion of patients requiring a biopsy to exclude transformation. The poor performance of FLT begs the question of its specificity for cell proliferation or Ki-67 expression. Therefore, the clinical impact of FLT remains to be determined in ongoing research studies.

The nodular lymphocyte predominant HL (NLPHL) is an uncommon subtype that invariably expresses CD20 with excellent OS, but unlike classical HL (cHL), late relapses may occur. In addition to staging and response assessment, determination of a disparate phenotype may be clinically relevant to because NLPHL has a propensity to be associated with concurrent or transformation to an aggressive B-cell non-Hodgkin lymphoma that would require long-term follow-up and image-guided rebiopsy. Hence, recognizing the imaging features of this entity is important. NLPHL is FDG avid, although SUVs are generally lower than those observed in cHL [99, 100]. A study by Hutchings et al. (n = 60) found that the mean SUVmax was 8.0 vs. 11–15 for cHL, p = 0.002 [99]. In a retrospective design (n = 12), NLPHL patients were also found to have lower FDG SUVmax compared to those with T-cell/histiocyte-rich large B-cell lymphoma (THR-LBCL) (mean SUVmax, 6.9 vs. 16.6, p = 0.055) [101].

Tumor heterogeneity

The spatial and temporal tumor heterogeneity limits the accuracy of tissue-based molecular assays. However, algorithms of image characterizations may capture intratumor heterogeneity as a signature of gene expression patterns, particularly, with the use of quantitative methods [102, 103]. The heterogeneity of tumor morphology largely accounts for an idiosyncratic treatment response within a single or across different neoplastic disorders. Genetic and epigenetic differences between cancer cells within a tumor might explain why some tumor cells are resistant to therapy, while others are sensitive and can be eradicated after an effective treatment.

Radiomics is an emerging field and refers to the comprehensive evaluation of the entire tumor volume using quantitative image evaluation of tumor phenotypes [102, 104, 105]. Recently, the data published by Aerts et al. suggested that radiomics decoded a general prognostic phenotype existing in multiple cancer types by revealing associations with the underlying gene expression patterns [106]. In one series of mixed cancers including DLBCL, integrating image textural features with SUV measurements significantly improved the prediction accuracy of morphological changes (Spearman correlation coefficient = 0.87, p < 0.0002) [107]. Some of the textural image features (such as entropy and maximum probability) were superior in predicting morphological changes of radiotracer uptake regions longitudinally, compared to SUVmax. In another pilot study, voxel distribution of FDG uptake demonstrated no significant differences in the heterogeneity indices between responders and nonresponders, while the heterogeneity of the intratumoral distribution of 111In-ibritumomab tiuxetan was correlated with the tumor response in this cohort of 16 NHL patients [108]. In this study, pre-therapeutic FDG SUVmax was predictive of the tumor response to 90Y-ibritumomab tiuxetan therapy on a lesion-by-lesion basis. This result is consistent with a previous report [109], while in another prior report, pre-therapeutic FDG SUVmax was not predictive of the tumor response to 90Y-ibritumomab tiuxetan therapy [110]. This may be because of the small number of patients and different analysis methods. Nonetheless, in radionuclide therapy, the nonuniformity of the absorbed dose by the tumor may be a key issue for treatment success or failure. Pre-therapeutic FDG SUVmax in combination with heterogeneity of 111In-ibritumomab tiuxetan might enhance the predictive values for tumor response and long-term outcome, which will be clarified in further studies. Radiomics may have a large clinical impact providing a wealth of extractable additional information that can be quantified for monitoring phenotypic changes during treatment. However, it is still in an early phase of development, and there are multiple technical issues that still need to be streamlined and validated to prove its clinical relevance.

4.2.2.1.2 Assessment of Bone Marrow Involvement (BMI)

Although it is widely recognized that a unilateral iliac crest BMB could underestimate lymphoma infiltration, bone marrow biopsy (BMB) has been the standard conventional method to evaluate bone marrow (BM) involvement in lymphomas (see Chapter 1: the need for bone marrow biopsy). However, BMB is associated with complications such as bleeding, anxiety, and pain [111, 112]. To overcome these disadvantages, the high sensitivity provided by whole body PET/CT imaging is exploited for effectively diagnosing BMI. According to the new Lugano guidelines, if a PET/CT is performed, a BMB is no longer required for the routine evaluation of patients with HL because of the low incidence of BMI [113, 114]. In DLBCL, if the scan is negative, a BMB is indicated to identify involvement by discordant histology if relevant for a clinical trial or patient management [113, 115, 116]. Several studies investigated whether visual and quantitative PET-based BM assessment can replace blind BMBs in various lymphoma subtypes.

Non-Hodgkin Lymphoma

Adams et al. reported the inability of FDG-PET/CT to replace BMB in newly diagnosed DLBCL because PET-based BM assessments, including SUVs, were prognostically inferior to BMB (n = 78). Multivariate analysis showed that only BMB status was an independent predictive factor of PFS (P = 0.016 and OS P = 0.004) [117]. The design of this study, however, was not optimal because of retrospective analysis and the use of BMB as the only reference standard for the diagnosis of BMI, which only allowed for the calculation of patient-based sensitivity of FDG-PET/CT. The same group of investigators subsequently reported that head-to-head comparison with BMB, the diagnostic value of both visual and quantitative PET/CT for the detection of BMI, is low in a cohort of 40 DLBCL patients [118]. The SUVmean, SUVmax, and SUVpeak of BMB-negative patients (1.4 ± 0.49, 2.2 ± 0.69, and 1.7 ± 0.59, respectively) considerably overlapped with those of BMB-positive patients (1.8 ± 0.53, 2.7 ± 0.71, and 2.2 ± 0.61, respectively).

Contrary to these results, in patients with FL, quantitative PET analysis was more beneficial in diagnosing BMI than visual assessment in a preliminary study of 22 patients. Optimal SUVmax cutoff of 2.1 yielded sensitivity and specificity combinations of approximately 87 % [119]. In another study, of 41 patients with grade 1-3a FL and diffuse BM uptake, using a SUVmean cutoff of ≥2 resulted in approximately 30 % improved sensitivity at no cost to specificity. Moreover, using the ratio SUVmean/MBP ≥1, the sensitivity of PET/CT to detect BM involvement improved to 83 % [120]. As a limitation, this study was retrospective and included both staging and restaging patient groups which added heterogeneity to the data.

Hodgkin lymphoma

Although the value of qualitative analysis and the rareness of BMI in HL have been addressed previously, several studies investigated the added value of a quantitative PET approach in the detection of BMI by HL. SUVmax evaluation did not have an incremental value to the visual evaluation in a retrospective study included 26 HL patients [121]. In another retrospective study of 106 HL patients, Salaun et al. reported that multivariate analysis revealed an independent correlation between sacral SUVmax and Ann Arbor stage (p = 0.005). No BMI was found in patients who presented with SUVmax below 3.4 [122].

In summary, because the qualitative interpretation of PET may be marred by the physiologic accumulation of FDG within the BM, there is a need for an objective whole body technique to yield quantifiable results that may simulate BMB. At first glance, the distinction between these potentially overlapping conditions may be easy, considering that only focal FDG uptake is considered to represent BMI in HL [113, 114]. However, this distinction is challenging in NHL where BMI can present with both focal and diffuse patterns of FDG uptake [116, 117]. In this regard, development of a quantitative PET approach may be particularly relevant in patients with newly diagnosed NHL. However, a number of unsettled issues still exists, i.e., what extent of increase in BM uptake should be considered suggestive of BMI, if this increase could be quantifiable how should it be corrected by the actual BM volume that individually varies from one patient to another, how to factor in the differences in the BM volume in different parts of the body, and, finally, what would be the methods to minimize an overlap between reactive BM hyperplasia and diffuse BMI. With the wealth of available software programs, further work is underway to address these viable concerns to determine the actual role of a quantitative PET approach.

4.2.2.2 Quantitative PET-Derived Metrics Beyond SUVs

As discussed at length in the previous section, SUV can be biased by the count variability and tumor heterogeneity in a volume of interest because of the reliance on a single voxel measurement. Furthermore, besides the anatomic finding of high tumor burden in a disseminated disease, which is frequently recorded at baseline in lymphoma, a methodology able to assess and quantify the metabolic activity of a given tumor burden would be more clinically relevant. In an effort to reduce bias, increase reproducibility, and improve the predictive value of PET results, functional volume parameters, i.e., metabolic tumor volume (MTV) and tumor lesion glycolysis (TLG) have been under investigation [1, 810].

4.2.2.2.1 Prognostic Value of PET-Derived Quantitative Metrics at Baseline

If the baseline whole body disease volume is proven to be an independent prognostic factor, high-risk patients may be objectively identified for treatment intensification. However, there is paucity of clinical data for the establishment of a prognostic system that is based on pre-therapy quantitative PET metrics affecting clinical outcomes of lymphoma patients. The available literature in both HL and NHL is discussed in the following section and summarized in Table 4.4.

Table 4.4 Published studies in lymphoma using metabolic tumor volume as a measure of outcome

Hodgkin lymphoma

Tumor bulk is a significant negative prognostic factor in early-stage HL [113, 123125]. However, not only the exact definition of tumor bulk remains a controversial topic but also an objective method to measure whole body tumor burden is yet to be established for a patient-tailored management. Thus far, the practice has relied on the indirect measures of tumor burden, i.e., the extent of involved sites used by the Ann Arbor staging system, and integrated factors including number of disease sites, stage, and LDH used by the prognostic systems including the international prognostic score (IPS) to stratify risk categories [126129]. In a prior study of HL patients treated on standard protocols, the mean tumor burden normalized to body surface area based on CT measurements was found to be largely superior to all prognostic models as a predictor of complete remission and survival [124, 125]. Given the coverage of the entire body, metabolic volume determination may be a better surrogate for response and survival by representing overall tumor functionality.

Several retrospective studies using various methodologies calculating the tumor volume showed that there may be a benefit to use PET quantitative metrics to predict survival [130132]. In a study by Song et al. in 127 early-stage HL patients (20 % bulky) treated with six cycles of ABVD, with or without involved-field radiotherapy (IFRT), the multivariate analysis showed that only older age, B symptoms, and high MTV status were independently associated with PFS and OS (PFS, p = 0.008; OS, p = 0.007) [130]. In this study, a fixed threshold method of ≥SUVmax 2.5 was used to determine the disease volume. In another single-center study, Kanoun et al. showed that pre-therapy MTV was predictive of patient outcomes in a cohort of 59 HL patients (92 % stage II–IV, 60 % IPS > 2), who were treated with an anthracycline-based therapy with or without IFRT [131]. The patients with a low MTV had a significantly better 4-year PFS than those with a high MTV (85 % vs. 42 %, p = 0.001, 88 % vs. 45 %, p = 0.0015, respectively). MTVs were measured with a semiautomatic method using a 41 % SUVmax threshold. In multivariate analysis only baseline MTV (p < 0.006, RR 4.4) and ΔSUVmax at PET2 (71 %, p = 0.0005, RR 6.3) remained independent predictors of PFS when tumor bulk (≥10 cm) did not reach statistical significance. In contrast to these findings, Tseng D. et al. reported that at a median follow-up of 50 months, baseline absolute PET metrics including SUVmax, SUVmean, and MTV did not predict survival in 30 HL patients (stage IIB-IV 63 %, 30 % IPS > 2) treated with varying chemotherapy regimens with or without IFRT when IPS was associated with PFS (p < 0.05) and OS (p < 0.01) [132]. On the contrary, the ΔMTV (p < 0.01), ΔSUVmax (p = 0.01), and ΔSUVmean (p < 0.05) at interim PET were associated with PFS and OS. This divergent result compared to others may be on the basis of a small patient cohort and the differences in methodologies, patient population, (stage, risk factors) and therapy protocol. However, all of the above reviewed studies had suboptimal designs marred by the retrospective design, which was inherently prone to biases because of non-standardized protocols and patient preparation (see previous section). Also the use of various segmentation methods and resultant MTV cutoffs that varied between 200 and 500 ml led to non-comparable and non-generalizable results. Moreover, a fixed threshold that was used by all of these studies is not considered optimal for volumetric assessment as discussed in the previous technical section.

In a retrospective analysis of prospectively acquired data in 89 cHL patients whose findings were reported previously by Hutchings et al. [133], during a median follow-up was 52 months, no baseline clinical parameters correlated with PFS but both baseline and interim quantitative PET parameters correlated with PFS [134]. The MTV was the strongest predictor of PFS at baseline (p = 0.002) and D-5PS at PET1 (p < 0.0001) (unpublished data). However, these data were obtained in a mixture of early- and advanced-stage patients, with as much as 54 % of the original series of 126 patients having a limited-stage disease (IA-IIB (Fig. 4.5)). Further investigations should include a more homogeneous data for definitive conclusions on the role of quantitative PET in the determination of HL outcomes. In view of the existing promising data, there is a need for more prospective large datasets to definitively determine the complementary or independent role of quantitative FDG-PET metrics at baseline for predicting prognosis and guiding treatment decisions in cHL.

Fig. 4.5
figure 5

Patient examples

Diffuse large B-cell lymphoma

For NHL, there are no universally accepted or validated criteria for defining “bulky” disease, although 6 cm was suggested as the best cutoff for FL [135] and 6–10 cm for DLBCL [136]. A more streamlined and objective tumor burden measure would be preferred to better guide management. The pretreatment FDG-PET metrics have been investigated as a potential predictor of survival in patients with DLBCL treated with rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisolone (R-CHOP) [137146]. In a retrospective study of 169 patients with stage II–III (74 % IPI ≤2) de novo DLBCL, prior to R-CHOP therapy (6–8 cycles), Song et al. found in a multivariate analysis that the whole body tumor burden was a more important prognostic parameter for PFS than Ann Arbor staging (HR = 5.3; OS, HR = 7.0, both p < 0.001) [138]. MTV was defined with a thresholding intensity based on SUVmax ≥2.5. With a median follow-up of 36 months, the 3-year estimates of PFS and OS were significantly higher in the low MTV than in the high group (PFS, 90 % vs. 56 %; OS, 93 % vs. 58.0 %, both p < 0.001). The same group of investigators found similar results in 165 early-stage (71 % IPI ≤2) primary gastrointestinal DLBCL patients [147]. During a median follow-up of 37 months, MTV was a better predictor of survival than SUVmax as determined by the receiver operator curve (ROC) analysis (0.92 vs. 0.70). Multivariate analysis revealed that a high IPI score (p = 0.001) and high MTV (p < 0.001) were independent prognostic factors for both PFS and OS, while other known prognostic factors were not significant. In another study of 140 DLBCL patients who received R-CHOP therapy followed by RT to bulky disease, after a median follow-up of 28.5 months, the TLG at the threshold of 50 % ΔSUVmax was significantly associated with PFS and OS (HR = 4.4; p = 0.008 for PFS and HR = 3.1; 95 % CI = 1.0–9.6; p = 0.049 for OS) [141]. High IPI score and Ann Arbor stage of III/V did not significantly shorten PFS. Similarly, in a retrospective study of 114 DLBCL patients [140] enrolled in previously reported International Validation Study [148], Sasanelli et al., using a 41 % SUVmax threshold, found that MTV was the only independent predictor of OS (p = 0.002) and PFS (p = 0,03) compared with other pre-therapy indices including tumor bulk (≥10 cm), LDH, stage, and age-adjusted IPI. The 3-year estimates of PFS were 77 % in the low metabolic burden group and 60 % in the high metabolic tumor burden group (p = 0.04), and prediction of OS was even better (87 % vs. 60 %, p = 0.0003). TLG failed to predict PFS and was less predictive of OS than MTV, in contrast to prior results. This multicenter study, however, was flawed by the absence of a protocol harmonization and cross-calibration of scanners across participating canters, variability of therapy protocols, and also the lack of comparative analysis between volumetric results and SUVs. More lately, Kim et al. reported that the higher MTV inferred a significantly inferior EFS compared with the lower MTV group during a median follow-up of 28 months in 96 DLBCL patients who were treated with R-CHOP [142]. In this study, MTV was defined with a fixed threshold of 2.5. There was no difference in EFS between patients with stage II and III patients (n = 53), but the higher MTV group showed significantly inferior EFS in this group of patients compared with the lower MTV group. Likewise, Xie et al. demonstrated that according to the cutoff determined from ROC analysis, lower MTV and TLG values prior to therapy were highly predictive of favorable PFS in DLBCL (n = 60) [144]. The multivariate analysis determined that the MTV and TLG values and number of enlarged lymph nodes predicted PFS independent of the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) score and lactate dehydrogenase (LDH) level.

There are several studies whose results contradict with the previously reported studies [145, 146]. Gallicchio et al. suggested that the baseline SUVmax was a better predictor of EFS (P = 0.0002, HR 0.13) during a median 18-month follow-up than MTV and TLG in a study of 52 DLBCL patients with intermediate IPI scores, who were treated with R-CHOP [145]. Only the IPI score 3 was slightly but significantly associated with poor outcome. The metabolic volume was determined with a 42 % threshold. It is conceivable that patients with intermediate IPI score presenting high SUVmax would respond better since the magnitude of glycolytic activity rather than the amount of metabolically active burden appears to be the key determinant. Adams et al. retrospectively investigated the pretreatment PET/CT in 73 patients with newly diagnosed DLBCL who had undergone R-CHOP immunochemotherapy [146]. On univariate Cox regression analysis, only the NCCN-IPI was a significant predictor of PFS (P = 0.024), and only the NCCN-IPI and MTV were significant predictors of OS (P = 0.039 and P = 0.043, respectively). Therefore, the NCCN-IPI was suggested to remain the most important prognostic tool in this disease.

Combined results of a systematic review of seven retrospective studies involving 703 DLBCL patients [149] which included some of the above referenced studies [137, 138, 140, 141, 143, 146] suggested that SUVmax and MTV are significant prognostic factors for PFS (HR 1.61; p = 0.038 and 2.18; p = 0.000, respectively). Similarly, high MTV and TLG values unfavorably influenced the 3-year OS (OR, 5.40 and 2.19, respectively). For OS, only high MTV was a strong predictor of poor prognosis in DLBCL with HR 2.99 (p = 0.000). Overall this meta-analysis found that the outcomes of the included studies were inconsistent. Although the principle treatment protocol in six trials was R-CHOP [137, 138, 140, 143, 146, 150], there were inhomogeneous treatments in one trial conducted by Sasanelli et al. [140] with 55 % of patients had received R-CHOP only, 45 % of patients had received R-ACVBP, and an additional 18 % of patients had undergone autologous stem cell transplantation. Additionally, the use of different risk scoring systems also impacted the homogeneity of the analysis. Five trials used the old IPI scoring system for risk stratification [137, 138, 141, 143, 150], one trial used the age-adjusted IPI scoring system [144], and the other used the recently proposed NCCN-IPI scoring system [146]. Except one study by Oh et al. [150], most patients of six trials had low-intermediate or high-intermediate risk according to IPI system. Thirdly, the varying inclusion and exclusion criteria might have led to the heterogeneity of the results. Moreover, each study varied widely in the optimal cutoff values for survival prediction, with the cutoff values ranging from 11 to 30 for ΔSUVmax, from 220 to 550 ml for MTV and from 415 to 2955 for TLG. The trials also differed in the Cox proportional hazard regression methods. Moreover, the small number of patients might have influenced the reliability of results. These are collectively the probable reasons leading to the high heterogeneity of the combined results. When the outcomes from other ongoing trials are published, a further meta-analysis will be needed.

In a prospective cohort of 103 primary mediastinal large B-cell lymphoma (PMBCL) patients enrolled in the International Extranodal Lymphoma Study Group (IELSG), Ceriani et al., reported that only TLG retained statistical significance for both PFS (P < .001) and OS (P = .001) in a multivariate analysis, who received combination chemo-immunotherapy [151]. The MTV was estimated using a threshold method based on 25 % of the SUVmax, which was lower than other proposed thresholds [132, 152]. The 5-year OS was 100 % for patients with low TLG vs. 80 % for those with high TLG (p = .0001), whereas PFS was 99 % vs. 64 %, respectively (P < .0001). Nonetheless, this was a retrospective evaluation in a group of 21 centers using various scanners. Additionally, despite a p < 0.0001, the HR for TLG was only 1.36 for increments of 103. Although considered preliminary, these results indicate that TLG at staging PET/CT could be a useful index in predicting outcomes in high-grade NHL including PMBCL treated with standard first-line chemotherapy regimens. Although it is premature to define the role of volumetric measurements in predicting outcomes, as a preliminary conclusion metabolic tumor volumes tend to be superior to ΔSUVmax in predictive values of survival, and a high MTV is significantly associated with reduced survival in DLBCL patients treated with R-CHOP. Because of the heterogeneity of the presently published data, these results should be interpreted with caution. This area of research will benefit from future large-scale prospective studies and further development in segmentation methodologies.

4.2.2.2.2 Predictive Value of PET-Derived Quantitative Metrics During or After Therapy

Taking a step forward from the traditional risk stratification systems, efforts have been concentrated on the interim PET results as a tool for guidance in early therapy modifications. However, the prognostic value of interim PET remains controversial in DLBCL patients with qualitative assessment variably correlated with outcome. The high false-positive rate associated with visual scoring systems, including the Deauville 5-point scale (D 5PS), has laid the grounds for quantitative PET initiatives when there is no existent optimal evaluation method for early assessment of response.

ΔSUV-based evaluation. Based on the results of multiple studies published by the Groupe d’Etude des Lymphomes de l’Adulte (GELA), it was suggested that the percentage reduction in SUVmax between baseline and interim PET (ΔSUVmax) improves both the interpretation accuracy and the interobserver reproducibility and better predicts patient outcome than visual analysis [153155]. This group of investigators demonstrated that a 66 % reduction in SUVmax between baseline (PET0) and two cycles of chemotherapy (PET2) better predicted event-free survival (EFS) by reducing false-positive results of visual analysis. Other subsequent studies published corroborative results in DLBCL patients, treated with an anthracycline-based regimens plus rituximab [156, 157]. However, opposing results have also been reported by Pregno et al. in DLBCL patients treated with R-CHOP when the ΔSUVmax (both 66 % cutoff and median) at PET2 to PET4 was rather weakly correlated with outcome (p = 0.113) [80]. Although it was in a homogeneous cohort, the limitation of this study included a small sample size, different time point analysis, and later than optimal time point preference (PET3 to PET4 vs. PET2). A similar quantitative approach was applied by Rossi C et al. to HL patients and showed that ΔSUVmax at PET2 was more accurate than the D 5PS in the prediction of outcome [158]. In this retrospective cohort of 59 consecutive HL patients who were treated with 4–8 cycles of anthracycline-based chemotherapy, PET2 ΔSUVmax >71 % was considered a favorable response. Although visual PET2 positivity was related to a lower 4-year PFS (45 % vs. 81 %, p < 0.002), ΔSUVmax was more accurate for identifying patients with different 4-year PFS (82 % vs. 30 %; p < 0.0001). In a multivariate analysis using the IPI and ΔSUVmax as covariates, ΔSUVmax remained the unique independent predictor for PFS (RR, 8.1 and p = 0.0001). Quantitative interpretation of PET may lend itself as a more pragmatic tool to guide clinicians in lymphoma management but, the results of available data only pointed to the need for larger prospective trials and optimization and standardization of criteria for interim PET evaluation to assess the real prognostic value of interim PET results.

Tumor metabolic volume evaluation

Although ΔSUVmax measurements partially improve on visual criteria and decrease the rate of false-positive results, a uniformly applicable ΔSUVmax cutoff has not been established to accurately predict clinical outcome. One can hypothesize that volumetric quantitative PET metrics have a better predictive value early during therapy beyond that of ΔSUVmax as well as traditional risk factors in lymphoma. The results are summarized under two topics, “HL” and “DLBCL,” respectively, in the following section. In general, a judicious approach should be adopted when reporting these studies because of the fact that the majority of these studies were retrospective, and no detailed information was provided on the quality assurance of the investigated data as well as on scanner calibration, image reconstruction algorithms, and patient scanning protocols (see previous section). Another flaw in design of prior studies included the presence of mixed population of early- and advanced-stage disease. It has been long established that Ann Arbor staging is one of the most important pre-therapy prognostication system and an essential component of prognostic models such as IPI and IPS. Therefore, evaluation of the additional value of PET quantitative metrics in distinct categories of early- and advanced-stage patients is necessary to derive a clinically meaningful prognostic information.

Hodgkin lymphoma

PET-derived quantitative metrics can improve the robustness of response assessment for therapy adaptation in HL patients. There are several studies designed to address this objective [131, 132, 134]. The results of the study by Kanoun et al. revealed that both baseline MTV and ΔSUVmax at PET2 were independent predictors of PFS in a mixed early- and advanced-stage HL population [131]. The combination of MTV and ΔSUVmax made it possible to identify three subsets of HL patients with different PFS outcomes (p < 0.0001). These included ΔSUVmax >71 % and MTV ≤225 ml, ΔSUVmax ≤71 % or MTV >225 ml, and ΔSUVmax ≤71 % and MTV >225 ml. In these three groups, the 4-year PFS rates were 92 %, 49 %, and 20 % (p < 0.0001), respectively. In another retrospective study by Tseng et al., 30 HL patients (53 % stages III–IV and 67 % had IPS ≥ 2) were treated with varying chemotherapy regimens [Stanford V (67 %), ABVD (17 %), VAMP (10 %), or BEACOPP (7 %)] with or without radiation therapy [132]. Interim-treatment scans were performed at a median of 55 days from the staging PET. At a median follow-up of 50 months, baseline absolute PET parameters did not predict survival while the ΔMTV (p = 0.01), ΔTLG (p < 0.01), and ΔSUVmax (p = 0.02) were associated with PFS. In this study, all calculated PET parameters were further associated with OS. IPS was also associated with PFS and OS (p < 0.05 and p < 0.01, respectively). These results suggest that the chemosensitivity of the tumor as measured by PET early during treatment is more predictive of clinical outcome than the initial tumor bulk which gives further credence to prior validation studies [159, 160]. However, on the basis of inclusion of relapsed patients and various chemotherapy regimens inclusive of intensive treatments, these data are not conducive to reproducible results with firm conclusions. The quantitative PET results were also investigated in pediatric HL patients [161164]. Similar to adult population, response assessment after two cycles improved the specificity of response assessment by 30 % using ΔSUVmax with a cutoff of 58 % [163, 164]. Contrary to these results, however, multiple other studies did not confirm the high predictive power of PET status early during therapy [7981]. In a recent study by Hussien et al. in 54 pediatric HL patients treated on treatment optimization protocols, all quantitative PET measures (SUVmax, SUVmean, MTV, and TGV) fared significantly better than the qualitative response assessment using D 5PS at PET2 [162]. ΔSUVmax was the most powerful predictor of treatment outcome (area under the curve, 0.92; p < 0.001). The tumor volumes were determined with a fixed threshold of 2.5 SUV and at a threshold of mean liver plus two standard deviations SUV. In this study, technical parameters were better controlled than other studies, all PET scanners were cross-calibrated, and scan protocols followed EANM guidelines. However, sophisticated volumetric PET measures did not perform significantly better than the previously proposed ΔSUVmax in early response assessment [1, 3]. In summary, in the pediatric HL population, similar to the adult population, these results are preliminary and larger cohorts are needed to investigate this observation for a better definition of the role of PET/CT imaging. Recently, Hasenclever et al. used a continuous scale by assigning D 5PS categories to certain quantitative PET cutoff values using the quotient of SUVpeak of the area with the most FDG avid residual uptake and the ΔSUVmean of the liver in 898 pediatric HL patients after two chemotherapy cycles [165]. The borderlines for D 5PS 3, 4, and 5 at quantitative PET values corresponded to 0.95, 1.3, and 2.0, respectively, and quantitative PET of <1.3 excluded an unfavorable response with a high sensitivity. This method warrants a prospective validation study to be potentially used in clinical settings.

Diffuse large B-cell lymphoma

Several retrospective studies investigated the value of quantitative PET-derived metrics in DLBCL, with the majority of data showing encouraging results [166, 167]. Park et al. investigated ΔSUVmax, TLG and Δ after 2 or 3 cycles in R-CHOP-treated DLBCL patients (n = 100) including 57 patients with an IPI score of 1–3: the absolute values of baseline and interim SUVs calculated as the sum of values from 5 lesions (SUVsum) and interim ΔSUVmax were significantly correlated with PFS [166]. While the ΔSUVmax and ΔTLG after 2 or 3 cycles were not associated with prognosis, the segmentation algorithm was based on mediastinal blood pool (MBP) threshold, which might have yielded larger MTVs than other thresholding methods would yield. The result of this study, although retrospective in design, highlights the potential of a quantitative approach to better delineate patient risk groups, particularly, in those with IPI scores of 1–3 which consists of the overlapping risk categories in which true low-risk patients should be better separated from the high-risk group to individualize therapies. These results could serve as a basis for future studies for the use of PET/CT in clinical practice, as an adjunct to IPI. Gradient-based methods appear to be more accurate compared with source-to-background ratio methods for segmenting FDG-PET images [43]. Malek et al. performed a retrospective study to correlate the ΔMTV and ΔSUVmax on interim PET with PFS after 2–4 cycles in 140 DLBCL patients using a gradient-based method rendered assessment of a greater tumor volume compared with the threshold-based method [167]. During a median follow-up of 37 months and with the use of R-CHOP and R-DA-EPOCH (rituximab-dose-adjusted etoposide, prednisone, Oncovin, cyclophosphamide, hydroxydaunorubicin) as the first-line therapy, D 5PS did not correlate with PFS (P = 0.37). Compared with the threshold-based method, the gradient-based method resulted in a statistically significant greater MTV in pretreatment, as well as interim PET images. However, no significant difference was noted between the two methods. ΔMTV predicted PFS better than ΔSUVmax as the AUC for ΔMTV was significantly larger compared with that for ΔSUVmax (AUCΔMTV: 0.713 and AUCΔSUVmax: 0.873; P: 0.0324). Briefly, ΔMTV by either method after initial treatment was a better predictor of PFS compared with ΔSUVmax. Further analysis also revealed the underlying importance of ΔMTV on interim PET to predict PFS for patients who had also achieved a significant ΔSUVmax. MTV assessment (by either gradient- or threshold-based methods) may provide a more optimal methodology to accurately predict PFS as it incorporates the metabolic and volumetric information as a measure of tumor burden. Contrary to the aforementioned results, in a cohort of newly diagnosed 73 DLBCL patients, Adams et al. showed that the NCCN-IPI [168] was the most important prognostic tool for PFS (p = 0.024) and OS (p = 0.039) compared to PET-derived metrics including SUVmax, MTV, and TLG [146]. In this retrospective study, the authors used a threshold setting of 40 % of the SUVmax for volume delineation by a single expert. Median values of SUVmax, MTV, and TLG were used as cutoff values for group discrimination. Compared to prior studies, these significantly different results might have stemmed from methodological differences, different patient populations, shortcoming of the use of non-cross-calibrated scanners, and the overestimation of MTV and TLG through the use of a retrospective cutoff value in ROC analysis. In a pilot study of pediatric NHL patients (n = 16), Furth et al. showed a limited predictive value for PET2 due to considerably high false-positive findings, especially in patients suffering from bulky disease [169]. With a mean follow-up of 60.2 months, the Kaplan–Meier survival analysis revealed no significant differences in 5-year PFS neither for conventional imaging modality (CIM) (76.9 % vs. 66.7 %; p = 0.67) nor for visual PET (85.7 % vs. 66.7 %; p = 0.34) nor for ΔSUVmax-based analysis (88.9 % vs. 57.1 %; p = 0.12). In relapsed or refractory DLBCL, in a multicenter clinical trial of 55 patients treated with bendamustine–rituximab, Tateischi et al. demonstrated that the ΔTLG can be used to quantify the response to treatment and can predict PFS after the last treatment cycle [170]. In this study, scanners were cross-calibrated using a NEMA/IEC image quality phantom. MTV was calculated with a fixed threshold SUVmax >2.5. The percentage change in all PET parameters except for the area under the curve of the cumulative SUV-volume histogram was significantly greater in complete responders than in non-complete responders after two cycles and after the last cycle. The percentage change of the sum of total lesion glycolysis after the last cycle (relative risk, 5.24; P = 0.003) was an independent predictor of PFS. An early PET scan after two cycles of treatment can effectively predict the outcome in patients with DLBCL treated with rituximab and anthracycline-based chemotherapy by using either a visual or quantitative approach. If its validity is proven in prospective studies, the interim ΔSUVmax approach may better serve clinicians to design a risk-adapted therapeutic strategy in DLBCL patients.

4.2.2.2.3 Radiation Therapy (RT) Planning

A limitation of FDG-PET in RT for HL is the variability in delineation of tumor volumes. Automatic or semiautomatic segmentation methods including thresholding based on a percent tumor ΔSUVmax may decrease variability in tumor delineations, but there is limited data in lymphoma using tumor volume segmentation methodologies. In a preliminary study using 15–40 % ΔSUVmax threshold segmentation method, on average, there was a 7.6-fold increase in PET volume between 15 % and 40 % ΔSUVmax x. There was a clinically significant decrease in dose to normal structures when the involved site radiation therapy (ISRT) plans were generated using the 15 % ΔSUVmax × volumes compared with the 40 % ΔSUVmax [171]. If these results can be reproduced, a streamlined approach may be developed using segmentation methods for conformal therapies. Moreover, the increased functional volume could be an artifact when contrast-enhanced CT is used for attenuation correction. In this case, it is recommended that the delineation volume using the relative or adaptive method should be preferred when contrast media are used for PET/CT [172].

The use of FDG-based PET data for target volume delineation in ISRT and IFRT planning requires a mindful utilization of automatic segmentation methods in conformal field designs such as ISRT, in which variations in pre-chemotherapy GTVs may lead to clinically significant changes as a result of different SUVmax thresholds. Clinical judgment is still required for the delineation of target volumes, and no segmentation method can reliably discern between FDG uptake caused by neoplastic processes and by physiological or inflammatory processes. The most accurate method for target volume definition in HL remains the manual generation of the volumes by a skilled radiation oncologist with input from a nuclear medicine physician when needed. This field is in evolution and further robust data are required to determine a reliable segmentation methodology to optimize treatment volumes and dose to normal structures.

4.2.3 New Technology

Magnetic resonance imaging (MRI) using diffusion-weighted technique (DWI) has been suggested as a useful method in the assessment of lymphoma lesions, particularly those with multiple conglomerate lymph nodes. There is preliminary evidence that the glycolytic rate as measured by FDG-PET and changes in water compartmentalization and water diffusion as measured by the apparent diffusion coefficients on DWI (ADC) are independent biological phenomena in newly diagnosed DLBCL [173, 174]. In one series, however, there was no significant correlation between ΔSUVmax and ΔADC after initiation of the first cycle of chemotherapy in patients with HL or DLBCL. Thus, these data did not support the replacement of FDG-PET with DW-MRI for response evaluation in lymphoma patients [175]. ADC values were also found to discriminate between indolent and aggressive NHL, and this finding can be useful in assessing possible transformation from indolent to aggressive NHL [176]. There is also pilot data showing that the accuracy of DWI was significantly higher than that with PET/CT for mediastinal and hilar lymphadenopathy in differentiating between malignant and benign conditions [177]. In other cohorts, DW-MRI provided results comparable with or complementary to those of PET/CT for staging and early response assessment in DLBCL [178180].

In summary, the literature is not mature to definitively prove or refute a diagnostic role for this modality compared to PET imaging in lymphoma patients. Further studies are warranted to assess the complementary roles of these different imaging biomarkers in the evaluation and follow-up of lymphoma.

4.2.4 Conclusions

The quantitative assessment with PET-derived volumes is still evolving and these preliminary findings suggest that it can be potentially useful in the prediction of clinical outcome and may improve on the predictive value of conventional risk-stratifying systems. However, currently, there is significant heterogeneity in the published data on the prognostic value of quantitative PET; thus, these results should be interpreted with caution because of their limited retrospective design, insufficient representation of risk and stage groups, differences in treatment strategies, as well as the varying methodologies used to measure MTVs. Currently, there is no consensus regarding the most optimal quantitative index to assess the metabolical activity disease burden using PET/CT imaging. Hence, the prognostic and predictive value of functional tumor volume remains to be further investigated with standardized, prospective, multicenter studies to validate as to what extent these parameters could improve individualized treatment approach in lymphoma.