Introduction

Primary mediastinal (thymic) large B cell lymphoma (PMBCL) is a distinct entity arising from B cells within the thymus [1,2,3,4,5]. It usually presents with a single, large anterior mediastinal mass, often with local invasion and compressive syndromes [6]. PMBCL commonly affects young adults, and treatment with aggressive immunochemotherapy regimens, with or without radiotherapy, appears to yield good outcomes, with 5-year survival rates of over 90% in recent studies [7, 8]. However, salvage treatment for the few patients in whom initial therapy fails, generally has a poor outcome [9, 10]. Since early identification of the minority of high-risk patients would allow the development of risk-stratified approaches, there is an unmet need for reliable prognostic markers.

There is increasing evidence of the prognostic value of quantitative parameters obtained from 18FDG-PET/CT imaging in patients with aggressive non-Hodgkin lymphoma [11,12,13]. Standardized uptake value (SUV) has previously been the most widely studied parameter [14,15,16]. More recently, there has been growing recognition of volume-based metabolic assessment, including metabolic tumour volume (MTV) and total lesion glycolysis (TLG; defined as the product of MTV and SUVmean) as promising prognostic indices in solid tumours [17], multiple myeloma [18] and malignant lymphomas [19,20,21,22,23,24,25,26,27,28]. A number of different methods for the calculation of volume based PET parameters have been proposed. In lymphomas, the 3D isocontour method using a fixed threshold as cut off has been widely applied, but the definition of the best cut-off is still a matter of debate.

Meignan et al. tested a fixed threshold of 41% SUVmax, demonstrating good correlation between the segmented volumes and actual dimensions of the lesions in a phantom model, simulating the characteristics of the more common disseminated lymphomas [29]. This cut off, recommended by EANM guidelines to segment tumour volume [30], has been validated by the same group as prognostic in Hodgkin lymphoma and diffuse large B cell lymphoma [29, 31]. It has also been adopted for the study of other subtypes, including follicular lymphoma and peripheral T cell lymphoma [27].

More recently, Ilyas et al. compared different methods of MTV calculation in DLBCL (including the fixed threshold of 41% of SUVmax), showing that a SUV threshold of 2.5 achieved the best inter-observer agreement and was easiest to apply, although all methods predicted progression free (PFS) and overall survival (OS) with similar accuracy [32].

Although these studies included patients with bulky disease, ranging from 9 to 40% of cases, the results obtained in disseminated lymphomas cannot be directly applied in PMBCL where the tumor burden is characterized in >50% of patients by a single mass with diameter > 10 cm and in >80% by a lesion with diameter > 7 cm, respectively [8]. We have previously shown that a threshold at 25% performs better than the standard approaches for segmentation of the bulky mediastinal lesions in PMBCL and that MTV and TLG, estimated with this threshold, are powerful predictors of clinical outcome [33].

Therefore, in this study we compared MTV estimations obtained using three standard fixed threshold (FT) segmentation algorithms (differing only in the applied cut-off) and a contrast-based automatic region growing (RG) algorithm. We used an anthropomorphic phantom simulating the particular clinical presentation of PMBCL to investigate how large, very FDG-avid masses may affect the performance of these different methods. We then tested the same methods in a cohort of PMBCL patients to assess their performance in real mediastinal lesions and to compare their accuracy in predicting PFS and OS in this type of lymphoma.

Materials and methods

Phantom

To simulate the characteristics of the mediastinal lesions found in patients with PMBCL, we assembled a simplified anthropomorphic phantom reproducing the main features of the disease (Fig. 1). The morpho-metabolic characteristics of the mediastinal lesions and the other intra-thoracic structures in the phantoms were derived from those estimated in 103 PMBCL patients enrolled in the IELSG-26 study (NCT00944567) [8]. In this cohort, the MTV was calculated with a cut-off of 25% of the SUVmax [33] and the intra-tumour heterogeneity of tracer uptake was expressed as the area under the curve (AUC) of the cumulative SUV volume histograms (CSH) (AUC-CSH) [34].

Fig. 1
figure 1

Characteristics of the assembled anthropomorphic phantom. A Scheme of the anthropomorphic phantom (regular shape and homogeneous uptake settings of the phantom lesion) with mediastinal lesion [1], cardiac insert [2] and mediastinal great vessels insert [3]. B: PET-CT image volumetric reconstruction of the phantom and fusion. C Phantom with lesion insert simulating an irregularly shaped mediastinal mass. D Insert assembled to simulate mediastinal mass with inhomogeneous intra-lesion tracer distribution

Four series of phantoms were assembled with variable volume, shape (regular/irregular), FDG uptake, and FDG distribution patterns (uniform/heterogeneous) chosen to represent the range estimated in IELSG-26 patients. All the SUV values in the range were tested; but for volume and heterogeneity the choice of the values was partially conditioned by technical reasons since the fixed structure and dimensions of the inserts did not allow to include lesions with very low volume or very high AUC-CSH in the phantom model (Table 1).

Table 1 Characteristics of the mediastinal lesions and the other intra-thoracic structures in the PMBCL patients enrolled in the IELSG-26 study and the corresponding values used in the assembled phantoms

All phantoms were scanned with a hybrid PET-CT scanner (MCT40, Siemens AG, Germany) and images were reconstructed using a standard clinical protocol with CT-based attenuation correction. The Online Resource summarises the main phantom characteristics and provides further methodological details (Table S1).

Volume segmentation

Volumes were measured on PET/CT acquisitions using a dedicated FDA approved software program (TrueD-Syngo, Siemens AG, Germany). A spherical box was positioned around the phantom insert mimicking the tumoral lesion to define a volume of interest (VOI). The MTV was then automatically calculated by a standardized routine program applying a fixed threshold (FT) algorithm. The tumour volume was delineated by all the voxels within the initial VOI with SUV values above or equal to the fixed threshold [35]. We tested the methods by applying three thresholds recently proposed in the literature for lymphomas: 25%, 41% of the SUVmax and an absolute SUV value of 2.5 (FT25%, FT41%, FT2.5) [29, 30, 33, 36]. In addition, the MTV calculation was also obtained by means of a region growing (RG) algorithm (3D Freeform Isocontour tool, TrueD Syngo, Siemens AG, Germany), with automatically segmented volume including all the voxels with common characteristics [37]. In PET images, after the selection of the tumor central voxels, all the neighboring voxels were automatically included in the segmented volume according to the voxel intensity. The algorithm was applied in contrast mode not requiring a predefined threshold [38].

Patients

Baseline PET scans of 103 patients with histologically proven PMBCL enrolled in the IELSG-26 study [8, 33] were analyzed with the same protocol and the methodology applied for the phantom imaging.

Statistical analysis

Quantitative variables were expressed as median and interquartile range (IQR). Medians were compared using the Wilcoxon signed-rank test. Absolute and relative differences between MTV measurement were expressed as mean and standard deviation (SD). Association between two variables was investigated through regression and correlation analysis (Pearson’s coefficient). Agreement between methods was quantified with the intra-class correlation coefficient (ICC). Survival functions were defined according to the revised NCI criteria [12] and estimated by the Kaplan-Meier method; patient groups were compared by the log-rank test. Cox proportional hazard models were used for the estimation of hazard ratios (HRs). The ability of different methods to correctly identify patients at risk of shorter survival was compared using both the Uno’s C concordance probability estimator (CPE) [39] and an in-sample fit approach to model selection (Akaike information criterion, AIC) [40]. A p value <0.05 was considered statistically significant. Negative predictive value (NPV) and positive predictive value (PPV) were calculated according to standard definitions [41]. Statistical analyses were conducted by using the STATA statistical software package, version 11 (StataCorp, College Station, TX) and the R statistical software environment, version 3.1.1 or the MedCalc Statistical Software, version 15.8 (MedCalc software bvba, Ostend; Belgium), as appropriate.

Results

Phantom models

Lesions of increasing volume (from 193 to 1256 ml) with fixed uptake and homogeneous tracer distribution (Fig. 2A)

Among the tested cut-offs, the FT25% provided the most accurate estimation of the actual volume (Online Resource, Table S1). The results obtained with this cut-off showed a consistent small overestimation of MTV, independent of the lesion volume (mean ± SD, 2.58% ± 0.9%). The RG method showed a systematic slight underestimation (mean ± SD, −3.83% ± 0.7%). The use of the higher 41% threshold resulted in underestimation of the lesion volumes (mean ± SD, −4.9% ± 1.8) with a progressive worsening of performance for the smallest lesions (≤401 ml). On the other hand, the FT2.5 showed a clear overestimation (13.85% ± 4.0), which was reduced for the smallest lesions.

Fig. 2
figure 2

Relative error of MTV estimation by the different methods. The relative error of MTV estimation in the anthropomorphic phantom models using different methods with changing of volume (A), of uptake heterogeneity (B) and of uptake intensity of the phantom lesion (C)

Lesions of increasing heterogeneity of tracer distribution (AUC-CSH from 0.66 to 0.43) with fixed volume and fixed uptake (Fig. 2B)

For FT41%, a rise in heterogeneity caused a progressive volume underestimation in comparison with the results obtained with the homogeneous insert, exceeding 15% for the greatest heterogeneity. The FT25% provided the most accurate MTV calculation (error of −0.49% for the lesion with highest heterogeneity), although there was a slight inverse correlation between volume estimated and degree of heterogeneity. FT2.5 and RG MTV estimation were not affected by the degree of heterogeneity.

Lesions of increasing uptake (SUV value from 10 to 36) with fixed volume and homogeneous tracer distribution (Fig. 2C)

The results demonstrated that the uptake level of the lesion does not significantly affect the accuracy of methods using percentage threshold (FT25% and FT41%) and the RG program. Conversely, FT2.5, applying a fixed absolute threshold, showed a progressive over-estimation of MTV correlated to the SUV values.

Lesions of fixed volume with irregular shape and homogeneous uptake

The four methods estimated the MTV of an irregularly shaped lesion with accuracy comparable to that demonstrated for the lesions of regular shape and similar volume (error < 1%).

The Online Resource (Tables S2–S4) reports the detailed results of MTV estimation by the different segmentation methods (with their differences and relative errors with respect to the actual phantom volume).

Patient population

The median values of MTV estimated by the different approaches (FT25%, FT 41%, FT2.5, RG) are summarized in Table 2 which also reports the absolute and percentage thresholds of the SUV values applied by different methods. The RG algorithm segmented MTV applying thresholds ranging between 10% and 40% of SUVmax, with mean value slightly higher than 25% (mean = 27.5%, median = 28%, IQR 23–32%). The cut-off value of SUV 2.5 corresponded in the overall population to a median SUV max percentage of 13% (IQR 10.4–15.4%).

Table 2 Comparison of different segmentation methods with respect to applied thresholds and estimated metabolic tumour volumes, and analysis of relative differences with intra-class correlation

We observed statistically significant differences between patient MTVs estimated by the four methods (Wilcoxon signed-rank test p < 0.0001 for the four methods considered together and p < 0.001 for each pair of methods compared). These relative differences hinged on the different thresholds used (Table 2), and we found a highly significant (p < 0.0001) inverse correlation between median MTV and threshold SUV values (Online Resource, Fig. S1). In nearly all patients, the FT2.5 provided the highest MTV value, while FT41% contoured the smallest MTV (Online Resource, Fig. S2). Nevertheless, there was an extremely good correlation between all methods for the MTV estimation: the Pearson’s coefficient ranges between 0.8 and 0.9, with p values consistently <0.0001 (Online Resource, Table S5).

The difference between FT25% and the other methods appears independent from the MTV value for FT41% and RG, while for FT2.5, it decreases progressively with increasing size of the lesion as demonstrated in the phantom model (Fig. 3A).

Fig. 3
figure 3

Relative difference in MTV values estimated by different methods in the patients population (left) and in anthropomorphic phantom models (right). The relative difference of MTV estimation by FT2.5, FT41% and RG methods with respect to FT25 is plotted as a function of lesion volume (A), heterogeneity (B) and uptake intensity of the phantom lesion (C)

FT41% showed higher MTV underestimation in patients with greater heterogeneity, while the relative MTV overestimation by FT2.5 was higher in cases with higher SUVmax, again, in keeping with the trend observed in the phantom models (Fig. 3B and C).

Notably, the relative percentage differences of the MTV estimated by the different methods was 3–4 fold higher in patients than in phantoms (Fig. 3A).

The correlation between MTV measurements obtained by different methods (ICC), varied from 0.89 (between FT25% and RG) to 0.71 (between FT25% and FT41%) confirming the presence of inter-method differences (Table 2).

Prognostic value of MTV

The distribution and area under the ROC curves for all four methods were similar for prediction of either overall survival or progression-free survival, with the exception of a significantly different AUC of the ROC curves between FT25% and FT41% methods (representing the largest and smallest value, respectively) for the analysis of progression free survival. Although all the methods allowed the identification of patients with significantly higher risk of progression, they provided different MTV cut-points for the separation of low- and high-risk groups (Table 3). The optimal MTV cut-point for the FT2.5 method was approximately three times greater than the one derived from the FT41% method. The four methods demonstrated similar negative NPV (from 94% to 95%) but different PPV (from 20% to 45%). Kaplan–Meier estimates (Fig. 4) showed that the patients with low MTV have a significantly longer PFS and OS and compared to the patients with high MTV, regardless of the method used. A concordance probability estimation (Uno’s C method) and the Akaike global model fit criterion showed that FT25% achieved both the best fit (lower AIC) and the best discrimination (higher CPE) in our cohort to identify patients with shorter PFS and OFS (Online Resource, Table S6).

Table 3 Description of the ROC analysis used to identify optimal cut-off points and outcome results according to the different segmentation methods
Fig. 4
figure 4

Risk group discrimination in PMBCL patients according to MTV estimated by different methods. Kaplan-Meier curves of PFS (panel A) and OS (panel B) according to MTV at diagnosis estimated using different segmentation algorithms

Additionally, elevated TLG, which is the product of MTV and SUVmean, was also significantly associated with a shorter PFS and OS in PMBCL, irrespective of the segmentation method adopted. Analogous to MTV, the best discrimination of patients at different risk of treatment failure was also provided using the FT25% method for the estimation of TLG (Online resource, Tables S7 and S8).

Discussion

There is increasing evidence, mainly described in large B cell lymphoma, but also in several other histologic lymphoma subtypes, of the prognostic value of quantitative parameters (SUVmax, MTV and TLG) obtained from initial staging with 18FDG-PET/CT [11, 12, 26, 33, 42]. However, their clinical use is still constrained by technical limitations, which make it difficult to obtain comparable results across various PET/CT systems and institutions. The cut-point that defines the patient with high MTV and poor risk disease is particularly inconsistent and depends on the method used for its estimation; its variability in different reports is also related to the small sample size of most individual studies. The present study addressed the need to optimize the MTV measure in patients with PMBCL, where baseline quantitative PET parameters have been shown to be powerful prognostic indicators [33] and might have a role in treatment decisions [34].

We developed a phantom model that attempted to reproduce the clinical and metabolic features observed in the largest prospective study on PMBCL patients [8]. Our phantom is, to the best of our knowledge, the first lymphoma model that, in addition to volume, shape and uptake, also takes into account the metabolic tumour heterogeneity, which we recently showed to be helpful for the early identification of poor-risk PMBCL [34]. We tested performance of the different methods (FT25%, FT41%, FT2.5, RG) proposed in the literature for the metabolic volume segmentation in various phantom models.

Our results showed that FT25% provides the best concordance between measured and actual volumes, with a performance only slightly affected by the lesion characteristics and maintained when metabolic heterogeneity is included in the models to make phantoms more adherent to the biological features of the disease. These experimental data are consistent with the clinical observations we made in prior studies on the use of PET in PMBCL patients [8, 33], in which the 25% threshold, selected by qualitative visual analysis of the best fit between metabolic edges and anatomic boundaries, was superior to the 41% currently recommended for volume segmentation in lymphoma patients [30]. In fact, the particular clinical features of PMBCL, usually characterized by bulky mediastinal mass, limit the potential advantages of the 41% cut-off, which was developed in diffuse large B cell lymphoma (DLBCL), which often presents with multiple lesions at different sites, of variable size, background and metabolic uptake.

In phantom models, FT2.5 led to a systematic volume overestimation. Conversely, FT41% underestimated the volume of large FDG-avid masses, particularly in highly heterogeneous lesions. The RG algorithm, which applies a variable threshold, also showed a generalized volume underestimation. These phantom model results appear in keeping with those obtained in patients with PMBCL. The relative percentage differences of the MTV estimated in patients by the different methods have been consistently higher than those obtained in the PET images of the phantoms. These findings may be explained by the higher variability of volume, tracer uptake, and heterogeneity in tumor lesions in the patient population compared with the phantom model. Although it is not possible to specify in individual patients the weight of each variable, the trends that we observed in patients in the relationship between segmentation methods and lesion characteristics appeared similar to those demonstrated in phantom models.

We demonstrated that in PMBCL, defined by a single large lesion, the estimated MTV values are strongly method-related and the four approaches provided significantly different MTV values. Also the optimal cut-offs for MTV to predict PFS and OS were very different and method-related, as recently shown in DLBCLs [32]. While these inter-method variations are expected, the finding, by ROC curve analysis, of different intra-method cut-offs for PFS and OS is difficult to explain on the basis of biology. This limitation may be overcome by the analysis of larger cohorts, such as the one of the ongoing IELSG37 study (NCT NCT01599559), which aims to enroll more than 500 patients.

All the methodologies demonstrated predictive power in risk-stratifying mediastinal masses, confirming the intrinsic prognostic role of baseline MTV in PMBCL. Although all four methods could predict PFS with good accuracy and similar NPV (95–98%) the FT25% showed the highest PPV (45%) and provided the best model to identify patients with poorer outcomes in this population.

Our results confirm the relationship between the most accurate estimation of the actual dimensions of the lesions and the best risk classification. We found individual patients in the study with very high FDG uptake and others with very heterogeneous masses, where FT41% and FT2.5 appeared to either under- or over-estimate MTV in comparison with FT25%.

Recently, Ilyas et al. proposed FT2.5 as the preferred method to segment PET volumes in patients with DLBCL because the use of an absolute cut-off (i.e. SUV =2.5) would be less time consuming, more reproducible and easier to apply than the other methods based on a percentage of the maximum SUV (i.e. 41%).

These considerations are of less importance in PMBCL, which is characterized by one dominant lesion. In this specific situation, the more common standard programs, after drawing the initial volume of interest around the pathological 18FDG uptake, automatically define the edges of the lesion in a single step, applying the chosen cut-off, either an absolute SUV value or a percentage of the maximum SUV. Our experience using FT25% in this setting showed very high inter-observer reproducibility (Pearson correlation coefficient > 0.99 and P < 0.0001) for the calculation of MTV [33].

Our prior studies have shown that, among functional PET parameters, elevated TLG is the best individual predictor of poorer PFS [33, 34]. In the present study, analogous to what was described for MTV, TLG values calculated with the FT25% method provided the best PPV, although all segmentation methods were able to correctly identify patients with favorable outcome.

In conclusion, this study has established that a cut-off of 25% SUVmax of the lesion is the optimal method for measuring baseline MTV in PMBCL and provides the best discrimination of patient risk.