Introduction

There is growing interest in the development of practical and inexpensive, yet accurate surrogate measures to predict treatment response and survival both in routine clinical practice and for the development of novel cancer therapeutics. In this regard, imaging studies utilizing molecular or functional quantitative measures play a pivotal role in characterizing neoplastic processes. Being less operator-dependent, these quantitative metrics allow for objective endpoints in multicenter clinical trials. However, to maximize the power of the trial, the standardization of imaging procedures including patient preparation, harmonization of image acquisition, reconstruction and analysis, and validation of these methods is essential to individualize treatment. A reliable quality control program with the inclusion of proper calibration of the scanner, cross-calibration with the dose calibrators and across various scanners is necessary for accurate quantitative measurements to enable an effective management change [1, 2, 3]. In addition to technical factors, FDG uptake is influenced by many different host and tumor-related factors (i.e., serum glucose and insulin levels, tumor glucose utilization, perfusion and hypoxia, inflammatory cell infiltrates in the tumor microenvironment and tumor viable cell fraction) [3].

In this review, in addition to technical and methodological considerations related to PET-derived volumetric metrics, we will review the published data to focus on the potential value of these metrics in the prediction of treatment response and patient prognosis in lymphoma.

Methodological considerations

PET-derived quantitative metrics

Standardized uptake value (SUV)

SUV is frequently used in PET reporting and adopted as a semi-quantitative index for measuring tumor glucose metabolism. It is defined as the ratio of the FDG concentration in a volume of interest (VOI) to the injected dose normalized to the patient’s body weight. Among the various SUV indices that could be reported, the most widely used is the SUVmax, defined as the maximal SUV value in a VOI. Being a single voxel measurement, SUVmax is intrinsically vulnerable to image noise. Repeat tumor SUVmax measurements had a within-patient bias of 5 to 30 % [4]. The SUVmean, an average value within the entire VOI is much less noisy but its value heavily depends on the delineation method for the VOI [5]. Consequently, an alternative index, SUVpeak, measuring the maximal tumor activity in a 1 cm3 VOI in the hottest volume of the tumor has been proposed [3, 5]. The SUVpeak characteristically is less affected by the noise compared to SUVmax and does not require definition of tumor boundaries which is a necessary step for SUVmean. Repeat tumor SUVpeak measurements have a lower within-patient bias (1–11 %) [4] compared to SUVmax. The SUVpeak has been the proposed measurement in the definition of therapy response for PET Response Criteria in Solid Tumors (PERCIST) [6]. Nonetheless, despite being relatively simple, it requires the use of custom software on a dedicated workstation to be calculated.

Metabolic volume measurements

Other proposed PET-derived functional metrics include metabolically active tumor volume (MTV) and total lesion glycolysis (TLG). These concepts have been developed in late nineteen nineties [7] but not evolved until recently because of the lack of necessary software developments. These volume-based PET parameters such as measure metabolic activity in an entire tumor mass reflect tumor biology. MTV incorporates the size-dependent thresholding to determine metabolically active tumor volume on the basis of SUVmax obtained within a volume of interest both for a single lesion and multiple lesions. TLG is the product of SUVmean in the defined VOI and the MTV; it combines tumor burden and its metabolic activity. Although well described, the routine application of these parameters is challenging because of the demand for considerable amount of time and effort. However, with the recent development of software-based automated VOI assessments, volume-based metabolic parameters have become increasingly available quantitative PET indices. Although, these metrics have the potential to become a useful index for assessing treatment response and survival, they are yet to be standardized and validated to translate to a clinical practice platform [810].

Variability of PET-derived quantitative metrics

All volume measurements are based on SUVs, thus the first prerequisite to perform volumetric analysis is to assure a reliable technique to derive SUVs with respect to physical, technological and biological factors. Several publications describe the impact of different factors on SUV [11]. All quantitative PET measurements including SUVs and volume metrics are affected by attenuation, scattered and random coincidences, and dead time correction. In addition, quantitative parameters are affected by user-defined factors including image acquisition settings, e.g., duration of acquisition, thickness of the slice, acquisition mode (2D vs 3D), reconstruction algorithm, and other instrumentation considerations.

The first prerequisite to minimize SUV variability is the cross-calibration of PET scanners and ancillary instrumentations. Albeit cumbersome, this approach proved effective in increasing accuracy in tracer uptake measurements by 5–10 % [1217]. This is well below the range of 10–25 % observed even in a controlled environment of a multicenter clinical trial [18]. In addition to scanner calibration, optimization of the image reconstruction algorithm should also be achieved.

The partial volume effect refers to both image blurring due to scanner finite spatial resolution and to voxel sampling. It affects small lesions, and is particularly negatively affected by tumor heterogeneity [18, 19]. SUVmax and SUVmean measurements in a lesion volume of 5 ml could be underestimated by up to 50 % [1]. Many strategies have been developed in the past to correct for partial volume effects but none of them reached a daily practice maturity [20]. Only recently, new algorithms have been applied directly to reconstruction algorithms [21] in modern scanners. It should be emphasized that small tumor volumes do not necessarily imply small number of cells since the tumors become visible at about 105–106 malignant cells [22] considering the resolution limits of the PET scanners. These recovery algorithms, through the modification of PET scanner resolution, allow for a better sensitivity by recovering more than 30 % of the activity in sub-centimeter lesions [21]. In the case of larger tumor volumes, partial volume correction is not expected to improve the predictive values for therapy response [2325]. Nevertheless, partial volume correction is recommended for accurate measurement of PET parameters in a recent study of head and neck cancers regardless of the tumor size [26].

PET test–retest reproducibility

Despite the significant potential for variability, SUV is widely used in the clinical practice and most publications report SUVs. A recent meta-analysis found that SUVmean had mildly better repeatability than SUVmax with better reproducibility in larger lesions [27]. However, a recent study comparing SUVmax, SUVmean, SUVpeak and TGV found that different SUV definitions yielded a 20 % variation of values for individual tumor response and variation of up to 90 % for a single SUV measure [28].

Methods for volume calculations

Different segmentation techniques for PET-derived volumes have been proposed with a varied complexity and operator manipulation [29, 30]. Hence, comparing the performance of different methods from published data is almost impossible given the variety of algorithms used [30]. To date, there is no consensus on a reproducible, accurate and practical method that should be preferred for tumor segmentation. The existing methodologies are described in the following paragraph.

Manual technique

The manual contouring by an experienced imaging expert is one of the first methods that has been applied and it is still widely used. However, this procedure is cumbersome, and rather time consuming in the case of disseminated disease. This method is technically and economically less demanding but leads to significant interobserver variability.

Thresholding method

The thresholding can be performed using fixed, percentage or adaptive methodologies. The most widely used approach to define a tumor volume is the identification of voxels exceeding a predefined threshold [31, 32].

In this regard, the earliest thresholding method was based on a percentage SUV, mainly 40–50 % of the SUVmax [33]. This method was simply based on phantom studies of static spheres and later on largely adopted by several groups. The principal drawback of this method is that the optimal threshold is influenced by the size of the tumor volume. As an alternative method, an absolute SUV threshold can be used for segmentation. However, tumor inhomogeneity and motion artifacts may hinder the application of this approach by failing to provide a good tumor delineation in nearly half of the cases, particularly, when describing lesions with low tumor-to-background ratio (TBR) [34]. Indeed, fixed thresholding techniques do not take neither the background nor the tumor size into consideration [35]. To address the variabilities dependent on the background, some authors suggested a threshold value based on tumor-to-background ratios [36, 37]. Subsequently, a more developed system with an iterative technique was introduced to optimize the TBR thresholding approach [3740]. The rationale is to change TBR threshold iteratively up to when an optimal threshold is generated by the convergence algorithm.

In general, application of the proper threshold technique is a challenging task because of the limited resolution of PET images. Blurring due to partial volume effect or motion artifacts and noise fluctuations due to limited photon counts can degrade segmentation accuracy.

Gradient technique

This technique measures gradient differences between the lesion and the surrounding background with a good spatial accuracy and efficiency [41, 42]. Gradient methodology includes simple edge or ridge detectors [43] or watershed method [44] More recently, deformable active contour models have been applied to PET segmentation with the assumption that contours are characterized by sharp variations in the image intensity [45, 46]. Despite being intuitive, the gradient technique suffers considerably from image noise and often requires filtering of the images with a blurring effect [47].

There are also other evolving methods including stochastic modeling [48] and those use a pattern recognition algorithm [49].

Comparison among methods

The variability of the thresholding methods leads to significant flaws with regard to accuracy and reproducibility and consequently misleading results. Reproducibility is a key issue associated with segmentation methods and evaluating accuracy is more difficult because it is virtually impossible to have a ground truth. Studies have been proposed using phantoms, morphological images (CT or MRI) and pathology specimens. The latter suffers from the fact that tissues are deformed during and after the surgical excision48. Using phantom studies, the variability between PET segmentation in mean tumor volumes was 40–46 % [10, 29]. Zaidi et al. compared ten PET segmentation methods in in vivo tumors and found a variation in mean tumor volumes of up to 400 % [30].

The above investigators and several guidelines concluded that the method used to determine the SUV and tumor volume greatly affects the reliability of the estimates and a new generation of automated contouring technique is an unfulfilled need to advance this field [50, 51].

Clinical applications in lymphoma

The need for development of PET-derived volume metrics

Although visual assessment of FDG-PET/CT data has been successfully implemented for monitoring therapy, a significant percentage of false-positive results have raised concerns for its usefulness for interim PET-adapted strategies [5254]. The inability of the visual criteria to accurately differentiate between different response categories can be addressed with a quantification approach. Hence, various quantitative PET metrics have been proposed in an effort to decrease the rate of false-positive results and increase interpretation reproducibility. The SUVmax has been the most widely used quantitative measure in prior studies for assessing tumor metabolic activity, although these results have never been prospectively validated.

Early studies with small patient samples suggested promising results using either a cutoff or an absolute value for SUVs after several cycles in mixed groups of Non-Hodgkin lymphoma (NHL) and Hodgkin lymphoma (HL) in the evaluation of early assessment of therapy response [5557]. Subsequent studies carried out in relatively homogeneous and larger groups of advanced-stage diffuse large B-cell lymphoma (DLBCL) patients, treated with CHOP or CHOP-like regimens, demonstrated that a decrease in SUV of 66 % after 2 cycles [5860] and 70 % after 4 cycles of chemotherapy [59, 61] was a more powerful prognostic tool than visual evaluation in the prediction of event free survival (EFS). In pediatric patients, SUV-based response assessment after two cycles of therapy improved the specificity of post-2 cycle response assessment by 30 % when a ΔSUVmax cutoff of 58 % was used [62, 63]. Contrary to these results, however, multiple other studies did not confirm the high predictive power of PET status early during therapy [5254].

Quantitative analysis of tumor response is an evolving endeavor and requires further improvement of methodology and application of these methods to large patient data for validation purposes. SUVs rely on one measurement in a volume of interest, which can be biased by variability. More importantly, the use of SUV does not take into consideration the extent of involved nodal and/or extra-nodal sites as a measure of biologic function. In addition, they do not reflect metabolic activity within the entire tumor volume, which would be a more desirable end-point to address tumor heterogeneity. Therefore, in an effort to determine the total tumor volume to improve on the predictive value and reproducibility of PET results, quantitative techniques beyond SUVs have been proposed. These parameters include tumor functional volume parameters, i.e., MTV and TLG.

The attendant high reproducibility with volume-derived indices may impart a better set of quantitative tools than absolute or percent change in SUVs [1, 8, 10]. With the continued improvement in sophisticated software programs, PET-derived volumes may enable better-orchestrated criteria for measuring entire viable tumor burden as a surrogate for prediction of treatment response and prognosis.

Considerations for quantitative studies

For all quantitative approaches using PET technology, it is imperative to adopt standardized acquisition, reconstruction and analysis protocols for the use of quantitative PET metrics. In addition to technical factors, FDG uptake is influenced by many different host and tumor-related factors including extent of glucose utilization by the tumor, tumor viable cell fraction, patient weight, serum glucose and insulin levels, variations in the uptake period, tumor perfusion and hypoxia, and inflammatory cells present in the tumor microenvironment [3].

Methodologically, as alluded in the “Methodological considerations” section, there are ongoing challenges associated with tumor segmentation algorithms [9, 10].

Another issue to be emphasized is that the cutoff values as a prognostic indicator may be different for early response definitions during therapy compared to later time points. It is also conceivable that the quantitative PET results obtained for DLBCL [58, 59, 6471], are not applicable to HL [7276] or other lymphoma subtypes [77] because of different disease characteristics.

Clinical studies: overall analysis

Multiple retrospective studies reported promising results for functional tumor volume parameters with respect to response assessment in lymphoma [6477]. The majority of these studies, involved DLBCL patients [6471] and some reported early results in HL patients [7276]. This may be on the basis of a higher need for DLBCL risk stratification than HL because the latter is more curable.

A systematic review of 7 studies with a total of 703 DLBCL patients suggested that SUVmax and MTV may be significant prognostic markers for progression-free survival (PFS) and MTV may be the only predictor for overall survival (OS) in DLBCL [78]. There was a strong association between high SUVmax, MTV and TLG values and the unfavorable 3-year PFS with ORs of 2.6, 3.7 and 2.3, respectively. Although the pooled results showed that high SUVmax and MTV were negative predictors of PFS, and only high MTV was a predictor of poor prognosis in DLBCL, however, the hazard ratios (HR) did not exceed 3.0. Collectively, the published studies in this field were marred by the obvious lack of methodological strength that includes the mixed risk groups’ distributions, the use of varying techniques to derive PET volumes that resulted in ungeneralizable data. These studies also suffered from insufficient number of patients, disease-related events and nonuniform therapy protocols.

Overall, the SUVmax, is still considered the standard parameter for quantitative PET assessment. The PET-derived volume metrics are yet to show a clear advantage in early response assessment over the use of SUVmax in trials designed with an adequate power for monitoring treatment response and prediction of survival.

Clinical studies: detailed analysis

Baseline evaluation of tumor bulk

Tumor bulk is an important prognostic factor, particularly, in early stage HL [79]. But despite its long-term use in clinical practice, definition of tumor bulk has been the subject of ongoing debates. Various risk stratification systems use indirect measures of tumor burden. In this regard, the Ann Arbor staging system uses the extent of disease, international prognostic score (IPS) uses stage and other prognostic systems in lymphoma integrated number of disease sites (either extra-nodal or nodal) as well as stage and LDH, to stratify risk categories for an individualized management. If the role of a baseline volume-derived measure is proven as a prognostic factor, identification of patients at high risk of treatment failure may be feasible early for treatment intensification. Gobbi et al. used contrast-enhanced CT to determine the overall tumor burden by measuring all nodal and extra-nodal disease sites as a predictor of treatment outcome [80, 81]. In HL patients treated on standard protocols, the mean tumor burden normalized to body surface area (rTB) was the best predictor of complete remission and survival. The rTB was largely superior to all prognostic models. For the same stage and treatment, patients who had a relapse had a 60–108 % initial rTB compared with those who had achieved a cure [81]. However, functional information afforded by PET imaging has proved to be an effective tool in detecting early tumor response preceding a detectable anatomical change. Therefore, metabolic volume determination may be a better surrogate for response and survival by representing overall tumor functionality. Therefore, baseline PET-derived quantitative metabolic parameters may benefit imagers to improve on the visual interpretation.

There is limited number of studies addressing the complementary value of quantitative approaches using baseline PET-derived metrics to optimally determine the tumor bulk. In a retrospective study of 169 patients with stage II–III (74 % IPI 0–2) de novo DLBCL, prior to R-CHOP therapy (6–8 cycles), Song et al. reported that the total tumor burden was a more important prognostic parameter than Ann Arbor stage in a multivariate analysis [64]. MTV was defined with a SUVmax ≥ 2.5 as a contouring border. The high MTV group, had lower PFS and OS patterns, regardless of the stage, compared with the low MTV group (p  <  0.001), during a median follow-up of 36 months. Three-year PFS and OS were significantly higher in the low MTV (<220 cm3) than in the high group (PFS 90 vs. 56 %, OS 93 vs. 58.0 %, both p < 0.001). Multivariate analysis revealed that high MTV was an independent factor for the prediction of an unfavorable outcome (PFS, HR = 5.3; OS, HR = 7.0, both p < 0.001), whereas stage III had no significant impact on survival [64]. In a similar design, the same group of investigators found similar results in 165 patients with stage IE–IIE (71 % IPI 0–2) primary gastrointestinal DLBCL, treated with R-CHOP, or surgery plus R-CHOP [65]. During a median follow-up of 37 months, the low IPI group had a longer PFS and OS than the high IPI group (81 and 86 %, respectively, vs. 58 and 60.5 %, respectively). MTV (160 cm3) was a better predictor of survival than SUVmax (12.0) as determined by the receiver operator curve (ROC) analysis (0.92 vs. 0.70). Multivariate analysis revealed that a high IPI score (p = 0.001), high MTV (p < 0.001), and surgical resection followed by R-CHOP therapy (p < 0.001) were independent prognostic factors for both PFS and OS, whereas other known prognostic factors were not significant.

Kim et al. reported that the higher MTV group showed a significantly inferior EFS compared with the lower MTV group during a median follow-up of 28 months in stage II–III DLBCL patients (n  =  53) who were treated with R-CHOP. In this study, MTV was defined with a fixed threshold of 2.5. There was no difference in EFS between patients with Ann Arbor stage II and those with stage III [68].

In a study of 140 DLBCL patients who received R-CHOP therapy plus 36 Gy of radiotherapy for bulky disease, after a median follow-up of 28.5 months, the TLG using a segmentation threshold of 50 % was independently associated with PFS and OS (HR = 4.4; 95 % CI = 1.5–13.1; p = 0.008 for PFS and HR = 3.1; 95 % CI = 1.0–9.6; p = 0.049 for OS) [68]. The Ann Arbor stage and IPI score were not significantly associated with survival. High TLG values (>415.5) were associated with reduced survivals compared with low TLG values (≤415.5) with a 2-year PFS of 73 vs. 92 % (p = 0.007; and 2-year OS of 81 vs. 93 %, p = 0.031).

Similar to prior results, Sasanelli et al. using a 41 % SUVmax threshold, found that MTV was the only independent predictor of OS and, to a lesser extent, of PFS (p  =  0.002) compared with other pre-therapy indices of tumor burden, in a retrospective study of 114 DLBCL patients [69] enrolled in previously reported International Validation Study [70]. The 3-year estimates of PFS were 77 % in the low metabolic burden group (MTV ≤ 550 cm3) and 60 % in the high metabolic burden group (MTV >550 cm3, p  =  0.04), and prediction of OS was even better (87 vs. 60 %, p  =  0.0003). However, this study is flawed not only by the retrospective design, the absence of a protocol harmonization across participating or cross-calibration of scanners, variability of therapy protocols but also by the lack of comparative analysis between volumetric results and SUVs. Using the same patient data, the same investigators advocated that a 66 % ΔSUVmax cutoff yielded better predictive values for a 3-year PFS estimate (44 % in PET2 positive vs. 79 % in PET negative) than visual interpretation using Deauville 5 point score (D 5PS) [82] (59 % in PET2 positive vs 81 % in PET negative) [83].

In mediastinal (thymic) large B cell lymphoma (PMBCL), retrospective evaluation of PET data in a multicenter study suggested that baseline TLG was a predictor of outcomes in a prospectively enrolled cohort of 103 patients who received combination chemoimmunotherapy with consolidation radiotherapy (~90 %) [77]. The MTV was estimated using a threshold method based on 25 % of the SUVmax, which was lower than the 41 % proposed by Boellard et al. [3]. In multivariate analysis, only TLG retained statistical significance for both OS (p = 0.001) and PFS (p < 0.001). At 5 years, OS was 100 % for patients with low TLG vs. 80 % for those with high TLG (p = 0.0001) while PFS was 99 vs. 64 %, respectively (p < 0.0001). Nonetheless, this was a retrospective evaluation of non-harmonized PET cameras in a group of 21 centers using various scanners. In addition, despite a p < 0.0001, the HR for TLG was only 1.36 for increments of 103. It is also unknown how these results compared with D 5PS.

With the use of a fixed threshold method with ≥SUVmax 2.5, in a retrospective study of 127 early stage HL patients, who were treated with six cycles of ABVD, with or without involved-field radiotherapy to bulky disease, the cutoff MTV value was 198 cm3 as determined by the ROC analysis [72]. In the multivariate analysis, only the presence of B symptoms and high MTV status were independently associated with PFS and OS (PFS, p = 0.008; OS, p = 0.007). Survival of high MTV groups was lower than the low MTV groups (PFS, p < 0.012; OS, p < 0.045) [72]. Both studies used a fixed threshold method with inherently not optimal for volumetric assessment as discussed in the previous technical section.

In another retrospective single-center study of a cohort of 59 HL patients (92 % stage II–IV, 61 % IPS > 2), who were treated with an antracycline-based therapy with or without IFRT (20–36 Gy), the MTV was measured with a semiautomatic method using a 41 %SUVmax threshold [73]. The baseline MTV was predictive of patient outcomes; patients with a low MTV had a significantly better 4 year PFS than those with a MTV0 > 225 ml (85 vs. 42 %, p = 0.001, 88 vs. 45 %, p = 0.0015, respectively [73]. In multivariate analysis, only MTV (p < 0.006, RR 4.4) and ΔSUVmaxPET0-2 (p  =  0.0005, RR 6.3) remained independent predictors of PFS (discussed further in the next section). Tumor bulk (diameter ≥10 cm) did not reach significance as a predictor of PFS.

There are several studies whose results contradicted with the prior studies [71, 76]. Gallicchio et al. suggested that the baseline SUVmax was a better predictor of EFS than MTV and TLG in a study of 52 DLBCL patients with IPI scores of 1–3 who were treated with R-CHOP [71]. The metabolic volume was determined with a 42 % threshold. After a median follow-up of 18 months, MTV and TLG were not associated with EFS but patients presenting with a higher SUVmax showed significantly better EFS (p = 0.0002, HR 0.13) indicating that the higher the magnitude of glycolytic activity the better the response to subsequent chemotherapy. Only the IPI score 3 was slightly but significantly associated with poor outcome. It is conceivable that patients with intermediate IPI score presenting high SUVmax would respond better since the magnitude of glycolytic activity rather than the amount of metabolically active burden appears to be the key determinant. It is previously published that despite a high tumor burden, patients exhibiting a high metabolic activity at baseline PET usually respond rapidly with a complete resolution of metabolic activity irrespective of a large residual mass [84]. It is also possible that the methodological imperfections of the prior studies have compounded the problems associated with thresholded components of the tumors amplifying their intrinsic drawbacks when accurate lesion/background discrimination is required. However, the sample size of this study is still small to derive any conclusions from these results. Tseng D et al. reported similar results in 30 HL patients of all stages treated with varying chemotherapy regimens with or without radiation therapy. At a median follow-up of 50 months, baseline absolute PET parameters did not predict survival while the ΔMTV and ΔSUV at interim PET performed 55 days of therapy initiation were associated with PFS and OS (further reviewed in the “tumor bulk” section) [76].

When analyzed collectively, there were three trials using the old IPI scoring system for risk stratification of patients [64, 66, 68], one trial using the age-adjusted IPI scoring system [69] and the others using the recently proposed National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) scoring (reference for this system) system [70]. Kim et al. showed that high IPI score (≥3) significantly reduced OS (2-year OS of 79 vs. 90 %, p = 0.049). Ann Arbor stage III/IV adversely influenced PFS (p = 0.013) [68]. However, high IPI score and Ann Arbor stage of III/V did not significantly shorten PFS (p = 0.200) and OS (p = 0.921). Each study varied widely in the optimal cutoff values for survival prediction, with the cutoff values ranging from 11 to 30 for SUVmax, from 220 to 550 ml for MTV and from 415.5 to 2955.4 for TLG.

The quantitative assessment with MTVs are still evolving and these preliminary findings suggest that it can be potentially useful in the prediction of clinical outcome and may prove superior to the Ann Arbor staging system in DLBCL patients treated with R-CHOP. The results of the previous studies should be interpreted with caution because of their limited retrospective design, insufficient representation of risk and stage groups, differences in treatment strategies as well as the varying methodologies used to measure MTVs. Hence, well-designed, prospective studies in large patient populations including all nodal stages and extra-nodal sites are warranted.

Treatment response evaluation during therapy

The existing preliminary data are in line with the hypothesis that percent MTV decrease predicts survival accurately in lymphoma patients and there is a tendency for metabolic volumes to perform better than SUVmax.

Hl

The outcome of HL has significantly improved in the past twenty years, however, not without adverse events associated with high cure rates. Hence the treatment objective of HL is to achieve a cure with minimal toxicity. A risk-adapted approach can help this goal by identifying high-risk patients with a poor prognostic outcome for an appropriate treatment policy. Quantitative metrics derived from PET can improve the robustness of response assessment for therapy adaptation. There are several studies designed to address this objective.

The study by Kanoun et al. revealed that both baseline MTV and ΔSUVmax at PET2 are independent predictors of PFS (discussed further in the previous section) [73]. The combination of MTV and ΔSUVmax made it possible to identify three subsets of HL patients with different outcomes in terms of PFS (p < 0.0001). These included: ΔSUVmax > 71 % and MTV ≤ 225 ml, ΔSUVmax < 71 % or MTV > 225 ml and ΔSUVmax ≤ 71 % and MTV > 225 ml. In these three groups, the 4-year PFS rates were 92, 49 and 20 % (p < 0.0001), respectively. In another retrospective study, in 30 HL patients with either early or advanced-stage disease treated with varying chemotherapy regimens with or without radiation therapy, at a median follow-up of 50 months, baseline PET parameters did not predict survival (reviewed in the “tumor bulk” section) while the  % change in PET parameters between baseline and interim for MTV (p = 0.01), TLG (p < 0.01) and SUVmax (p = 0.02) was associated with PFS [76]. IPS was also associated with PFS (p < 0.05). These results suggest that the chemosensitivity of the tumor as measured by PET early during treatment is more predictive of clinical outcome than the initial tumor bulk which gives further credence to prior validation studies [85, 86]. However, on the basis of inclusion of relapsed patients and various chemotherapy regimens inclusive of intensive treatments, these data are not conducive to derive definitive conclusions.

These quantitative PET results were also investigated in pediatric patients [74, 75]. In a recent study by Hussien et al., in a cohort of 54 pediatric HL patients treated on treatment optimization protocols, the volumes were determined with a fixed threshold of 2.5 SUV and at a threshold of mean liver plus two standard deviations SUV (corrected for lean body mass) [75]. All quantitative PET measures (SUVmax, SUVmean, MTV and TGV) performed significantly better than the qualitative response assessment using D 5PS at interim PET decreasing false-negative results. Δ SUVmax revealed the best results (area under the curve, 0.92; p < 0.001) but was not significantly superior to SUVmax estimation at PET2 and ΔTLGmax in the prediction of response. However, sophisticated volumetric PET measures did not perform significantly better than the previously proposed ΔSUVmax in early response assessment. All analytical strategies failed to improve the impaired positive predictive value to a clinically acceptable level while preserving the excellent negative predictive value. In this study, technical parameters were better controlled than other studies, all PET scanners were cross-calibrated and scan protocols followed EANM guidelines [1, 3]. All volumetric and SUV measurements performed significantly better than baseline metrics (discussed in the subsequent section) [75] in pediatric HL patients.

Dlbcl

The improvements in survival in rituximab era were not sufficient to lower the relapse rate/refractoriness in approximately one-third of patients presenting with advanced-stage DLBCL. It is important to note that the majority of patients with relapsed disease die of lymphoma. Therefore, quantitative PET metrics may help develop a better risk stratification system to prognosticate and predict relapse and refractory disease early during therapy to implement novel therapeutic approaches beyond standard therapy with rituximab, cyclophosphamide, doxorubicin, vincristine and prednisolone (R-CHOP).

Esfahani et al. studied 20 DLBCL patients treated with R-CHOP and who underwent PET at baseline and after two cycles of standard chemotherapy from a single institution [66]. The volumes were determined as any voxel inside a lesion with SUV greater than 1.5 liver SUVmean plus 2.5 standard deviation of liver SUV [6]. At interim PET, SUVmax (cutoff: 2.3), SUVmean (cutoff: 2.1) and TLG (cutoff: 96.5) were significant predictors of progression of disease. However, MTV could not significantly predict relapse.

Contrary to the aforementioned results, in a cohort of newly diagnosed 73 DLBCL patients in a single-center study, Adams et al. proved that the NCCN-IPI [87] was the most important prognostic tool for PFS (p = 0.024) and OS (p = 0.039) compared to PET-derived metrics including SUVmax, MTV and TGV [70]. In this retrospective study, the authors used a threshold setting of 40 % of the SUVmax for volume delineation by a single expert. Median values of SUVmax, M TV and TLG were used as cutoff values for group discrimination. Compared to prior studies, these significantly different results might have stemmed from methodological differences, different patient populations, the shortcoming of previous studies related to the use of non-cross-calibrated scanners, and the overestimation of MTV and TGL through the use of a retrospective cutoff value in ROC analysis.

There is paucity of data on the evaluation of patient outcomes using volumetric PET metrics in other subtypes of NHL. A recent article suggested that the use of PET-derived volumes segmented with a fixed threshold of SUVmax 3.0 together with the IPI score might be useful for detailed prediction of prognosis in extra-nodal natural killer T cell lymphoma patients [88]. Despite a lower IPI score, patients with high PET volumes values might be considered candidates for aggressive therapy to improve clinical outcomes.

In relapsed or refractory DLBCL, in a multicenter clinical trial of 55 patients treated with bendamustine–rituximab, Tateischi et al. demonstrated that the percent change in TLG can be used to quantify the response to treatment and can predict PFS after the last treatment cycle [89]. In this study, scanners were cross-calibrated using a NEMA/IEC image quality phantom. MTV was calculated with a fixed threshold SUVmax > 2.5. The percentage change in all PET parameters except for the area under the curve of the cumulative SUV–volume histogram was significantly greater in complete responders than in non-complete responders after two cycles and the last cycle. The Cox proportional hazard model and best subset selection method revealed that the percentage change of the sum of total lesion glycolysis after the last cycle (relative risk, 5.24; p = 0.003) was an independent predictor of PFS.

Conclusions

Not surprisingly, there is great variability in the reported prognostic values of PET-derived metabolic volumes considering the variability of quantitative methodology In the near future, the development of more sophisticated and robust tools for PET segmentation would help physicians to use these quantitative methods with higher precision and accuracy. The heterogeneity of the published series should be further minimized by pre-trial scanner cross-calibration and the use of standardization protocols for patient preparation, timing, scanning image generation, and analysis. Currently, there is no consensus regarding the most optimal quantitative index to assess the metabolically activity disease burden using FDG-PET/CT imaging. The mutual deficiencies of, thus far, published studies include retrospective design, either the insufficient number of patients and/or disease-related events, heterogeneous patient population consisting of all stages and risk categories, varying therapy schemes, length of therapy, involvement of RT and doses of RT, generationally non-compatible (different design) scanners, lack of cross-calibration between scanners, and use of non-validated and non-optimized methodologies for volume derivations. Therefore, the prognostic and predictive value of functional tumor volume remains to be further investigated with standardized, prospective, multicenter studies to validate as to what extent these parameters could improve individualized treatment approach in lymphoma.