Introduction

As the world population ages, early diagnosis of neurodegenerative dementia represents a challenge for both society as a whole and the medical community. It is estimated that 46.8 million people worldwide are living with dementia in 2015. This number will almost double every 20 years, reaching 74.7 million in 2030 and 131.5 million in 2050 [1]. Alzheimer’s disease (AD) is considered as the main etiology of neurodegenerative dementia (approximately 2/3 of cases have AD). AD impairs patient cognitive functions, and impacts daily life autonomy, and as such raises major public health policy issues in terms of home care and institutional placement [2]. Early diagnosis is a key feature in current patient management since treatments become less effective with worsening patient state.

Considerable effort is ongoing to identify and develop reliable biomarkers of incipient AD [3, 4] and other types of dementia [5] to target individuals who would most benefit from early therapeutic intervention [6]. Among these biomarkers, 18F-FDG positron emission tomography (PET) has been recognized as a valuable imaging modality for the positive and differential diagnosis of neurodegenerative dementias [7,8,9,10]. Since absolute quantification of cortical glucose metabolism requires dynamic acquisitions [11, 12] that rarely fit with clinical constraints, visual interpretation is usually performed using a static scan [13]. 18F-FDG uptake is a sensitive marker of synaptic dysfunction [14], and as such it has been shown to have good sensitivity in the detection of early brain dysfunction [15, 16] and follow-up of disease evolution over time [12]. Nevertheless, visual rating of relative cortical 18F-FDG distribution yields undesirable inter-reader variability [11, 17] and sub-optimal specificity especially among moderately skilled readers [18, 19].

To overcome these limitations of visual interpretation, several automated semi-quantitative techniques have been developed. Although complex computational approaches based either on deep learning [20], neural networks [21], or Support Vector Machine (SVM) [22] showing promising classification capabilities have recently been reported, most semi-quantitative methods rely on the statistical mapping of voxel-based normalized cortical uptake ratio with respect to some reference region [23,24,25,26,27].

The aim of the present study is to propose and highlight the potential and the relevance of brain 18F-FDG PET analysis based on an interval-valued reconstruction of 18F-FDG distribution [28, 29]. The proposed framework allows for a direct estimation of voxel-wise confidence intervals accounting for the statistical variability of voxel values. Subject classification was obtained via interval-based statistical comparison between cortical regions of interest and a reference region (extracted from the same reconstruction), known to be spared by AD. For the analysis, only raw data from current PET acquisition were required. Diagnostic performances were compared to those of PET data post-processing using a validated automated software based on the computation of regional normalized uptake ratios SD with reference to a healthy control database [30].

Materials and methods

Patient characteristics

Twenty-six patients were prospectively recruited from the outpatients of the Nuclear Medicine Department at Montpelier University Hospital from November 2016 to July 2017. All included patients were referred for brain 18F-FDG PET in routine conditions for the exploration of memory or executive dysfunction. Clinical diagnosis made during follow-up was considered as the gold-standard for patient classification.

The AD group (10 patients, 3 men and 7 women, 76 ± 6 years, range 64–84 years) was composed of patients diagnosed with probable AD according to the NINCDS-ADRDA criteria [31] during a clinical follow-up of 320 ± 204 days (range 101–606).

The control group (16 patients, 9 men and 7 women, 61 ± 13 years, range 36–77 years) was composed of patients with null or low pre-test probability of neurodegenerative disease, for whom 18F-FDG PET data were within the range of normal, and/ or rated as normal during a clinical follow-up of 158 ± 162 days (range 0–590).

PET data acquisition and reconstruction

PET examinations were performed on a Siemens Biograph mCT 20 Flow PET-CT system (Siemens Medical Solutions Knoxville, USA) about 30 min after intravenous (IV) injection of 2.5 MBq/kg of 18F-FDG. Data were acquired in three-dimensional (3D) time-of-flight mode for 10 min. Emission data were corrected for attenuation using the embedded computerized tomography (CT) scanner. Random coincidences (where the two photons did not arise from the same annihilation event), scatter coincidences, and dead-time were also corrected using dedicated manufacturer tools. Patient characteristics and technical data are summarized in Table 1. For the needs of the comparison experiments, PET data were reconstructed with two different reconstruction methods.

Table 1 Characteristics of the study groups

First, PET images were obtained using the routine workflow implemented at our institution for brain PET imaging. Iterative reconstruction was performed using 3D OSEM (21 subsets, 8 iterations) including PSF correction, followed by a 5, 5, 5 mm FWHM Gaussian post-smoothing procedure. Images were sampled on a 400 × 400 × 109 grid with a voxel size of 2.04 × 2.04 × 2.03 mm3.

Second, PET data were reconstructed using an interval-valued extension of the maximum likelihood-expectation maximization (ML–EM) algorithm [32, 33] called NIBEM [28, 29] that stands for Non-Additive interval-based expectation maximization. The main motivation for using this algorithm resides in its ability to directly reconstruct voxel-wise confidence intervals. The considered confidence intervals account for the statistical variability affecting reconstructed voxel values. The confidence level associated with these intervals was shown to be about 90% (29). As the current version of the mentioned algorithm was only described in 2D, 3D emission data were rebinned into a stack of 109 two-dimensional (2D) sinograms using the Fourier rebinning (FORE) algorithm [34]. Then the 2D sinograms were reconstructed using NIBEM on a 200 × 200 grid with a voxel size of 4.1 × 4.1 × 2.03 mm3. Reconstructions were performed using 70 iterations, which makes it possible to reach images with a similar noise level as obtained with 3D OSEM algorithm described in the routine workflow. Time-of-flight information was not exploited in this preliminary work. Instead of computing scalar values for each voxel, the algorithm used was interval-valued (i.e. the algorithm produces interval values for each voxel), then the measure associated with each voxel i was a real interval denoted [\(f_{i}\)], which lower and upper bounds are respectively denoted \(\underline{{f_{i} }}\) and \(\overline{{f_{i} }}\) Here, [\(f_{i}\)] represents the set of real numbers lying between its respective lower et upper bounds \(\underline{{f_{i} }}\) and \(\overline{{f_{i} }}\). The central image \(\hat{f}\) (i.e. the image that minimizes the Hausdorff distance between \(\underline{f}\) and \(\overline{f}\)) was defined, for each voxel i, as \(\hat{f}_{i} = {{\left( {\underline{{f_{i} }} + \overline{{f_{i} }} } \right)} \mathord{\left/ {\vphantom {{\left( {\underline{{f_{i} }} + \overline{{f_{i} }} } \right)} 2}} \right. \kern-\nulldelimiterspace} 2}\). In other words, for each voxel i of the reconstructed volume, the value \(\hat{f}_{i}\) is the center of the interval [\(f_{i}\)] reconstructed with NIBEM. A graphical illustration of the reconstruction procedure is presented in Fig. 1.

Fig. 1
figure 1

PET data reconstructed with interval-valued algorithm NIBEM

Patient classification using SUV ratio

The first set of images (3D OSEM) was post-processed using the manufacturer’s tool dedicated to statistical analyses of brain PET data (Siemens Scenium) [30]. PET images were automatically registered to the Montreal Neurological Institute (MNI) space, then segmented into cortical regions of interest (ROI) including whole brain, parietal, temporal, and temporo-mesial cortices. Mean standardized uptake values (SUV) were measured in each cortical ROI and normalized to mean whole brain SUV to produce cortical SUV ratios (SUVr). Last, these cortical SUVr were converted to standard deviation scores (SUVr SD), commonly called z score, based on reference distributions from age-matched control populations. As the ROI affected by AD appears hypometabolic, ROI SUVr SD of AD patients are usually negative. For example, a SUVr SD that equals − 1 means that the considered SUVr is 1 standard deviation below the mean SUVr of the control group. For a given classification threshold, a subject was considered as “not AD” if all the SUVr SD computed in 6 reference ROIs (parietal left/right, temporal left/right, and temporo-mesial left/right) were above the threshold. If at least one SUVr SD was below the threshold, the subject was classified as “AD”.

Patient classification using NIBEM

As Scenium is a commercial software, registration procedure and reference database are not accessible. We then designed our own procedure to compare our method. For each subject, the NIBEM central image \(\hat{f}\) was smoothed with a Gaussian kernel of 5, 5, 5 mm FWHM, then spatially normalized to the MNI space using the widely known tissue probability map template provided by SPM12 (Wellcome Trust Centre, London, UK). Reconstructed upper and lower NIBEM PET volumes were then spatially normalized using the estimated transformation parameters computed over \(\hat{f}\). As upper, lower, and central volumes are defined in the same space, normalizing the data with the same transformation parameters allows minimization of the spatial registration bias between upper and lower registered volumes. Resulting PET images were sampled on a 91 × 109 × 91 grid with cubic voxels. Image voxels were labeled according to the AAL2Footnote 1 anatomical atlas [35]. A graphical illustration of the proposed spatial registration procedure is shown in Fig. 2.

Fig. 2
figure 2

PET data spatial registration procedure

Spatially normalized NIBEM images were then segmented based on voxel labels to define cortical ROIs including left (L) and right (R) parietal, temporal and temporo-mesial lobes.

The clinical diagnosis of AD in 18F-FDG PET is made by highlighting the hypo-metabolism of some specific ROIs that are affected by the disease with respect to reference regions that are known not to be affected by AD. Sub-cortical grey nuclei can be considered as the reference region for such a comparison [36].

Thus, as NIBEM reconstructions are interval-valued, the interval-valued distribution associated with each relevant ROI (L and R parietal, temporal, and temporo-mesial lobes) needs to be compared to the interval-valued distribution associated with the reference ROI (sub-cortical grey nuclei).

To our knowledge, the literature concerning interval-valued distribution comparison remains rather scarce. Although an interval-valued generalization of the Wilcoxon rank-sum test was proposed [37], it appears to be insufficiently specific for the statistical comparison needed to be performed in this present work. Indeed, robust tests like Wilcoxon's test are affected by the imprecision of the data: the more imprecise the data the more difficult it is for the statistical tests to reject the null hypothesis. In the presence of a considerable imprecision (which is the case in the present application), the Wilcoxon test is too weakly specific to reject such a hypothesis.

Therefore, we propose to compare interval-valued distributions using a consistency measure between the two interval-valued distributions: \(\left[ {f_{{{\text{TEST}}}}^{1} } \right], \ldots ,\left[ {f_{{{\text{TEST}}}}^{N} } \right]\) for the tested ROI and \(\left[ {f_{{{\text{REF}}}}^{1} } \right], \ldots ,\left[ {f_{{{\text{REF}}}}^{M} } \right]\) for the reference region. These two distributions can be considered as being two random empirical distributions of intervals. Note that the probability density associated with each interval-valued reconstruction is unknown.

Two intervals \(\left[ {f_{{{\text{TEST}}}}^{i} } \right]\) and \(\left[ {f_{{{\text{REF}}}}^{j} } \right]\) are said to be inconsistent if \(\left[ {f_{{{\text{TEST}}}}^{i} } \right] \cap \left[ {f_{{{\text{REF}}}}^{j} } \right] = \emptyset\) and consistent if \(\left[ {f_{{{\text{TEST}}}}^{i} } \right] \cap \left[ {f_{{{\text{REF}}}}^{j} } \right] \ne \emptyset\) The consistency function \({\text{g}}\left( {\left[ {f_{{{\text{TEST}}}}^{i} } \right],\left[ {f_{{{\text{REF}}}}^{j} } \right]} \right)\) is equal to 1 if both intervals are consistent, 0 if they are inconsistent. If we consider the pair \(\left( {\left[ {f_{{{\text{TEST}}}}^{i} } \right],\left[ {f_{{{\text{REF}}}}^{j} } \right]} \right)\), assume these two intervals independent and since there are \(\begin{array}{*{20}c} N.M \\ \end{array}\) possible comparisons, the weight \({1 \mathord{\left/ {\vphantom {1 {\left( {\begin{array}{*{20}c} N.M \\ \end{array} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\begin{array}{*{20}c} N.M \\ \end{array} } \right)}}\) can be associated with the pair. Under this assumption, in the theory of belief functions [38], the more common way to calculate a consistency measure between the two interval distributions is to consider the cumulative mass on the pairs that are consistent. The consistency measurement denoted IR between the region to test and the reference region is thus defined by the equation:

$${\text{IR}}_{{\text{TEST,REF}}} = \frac{1}{N.M}\sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{M} {{\text{g}}\left( {\left[ {f_{{{\text{TEST}}}}^{i} } \right],\left[ {f_{{{\text{REF}}}}^{j} } \right]} \right)} } ,$$
(1)

where \({\text{IR}}_{{\text{TEST,REF}}} = 1\) means that all intervals of the tested region intersect the intervals of the reference region. On the opposite, \({\text{IR}}_{{\text{TEST,REF}}} = 0\) means that none of the intervals intersect between the two considered regions. Varying a given classification threshold from 1 to 0, a subject was considered as “not AD” if all the concordance measures IR computed independently between the 6 ROIs (left/ right temporal, left/right temporo-mesial and left/ right parietal lobes) and the reference (sub-cortical grey nuclei) regions were above the threshold. If at least one intersection ratio was below the threshold, the subject was classified as “AD”. ROI segmentation and comparison processes are graphically detailed in Fig. 3. Illustration of both post-processing and proposed methods outputs for an AD patient extracted from the database can be found in Fig. 4. For each method, one score by ROI is extracted from the analysis, albeit with different meanings: deviation scores for SUVr SD analysis and consistency measure between the analyzed ROI and a reference ROI for NIBEM IR analysis. An overview of the reconstructions for both compared methods is presented in Fig. 5.

Fig. 3
figure 3

PET data segmentation and ROI comparison step

Fig. 4
figure 4

Illustration of brain 18F-FDG PET analysis for a patient diagnosed with Alzheimer’s disease with both compared methods

Fig. 5:
figure 5

18F-FDG PET brain axial slice of an AD patient. a NIBEM lower bound, b NIBEM upper bound, c OSEM 3D reconstruction used in SUVr SD method, d SUVr SD map with its color bar. Same color bar was used for (a) and (b). All volumes were registered to template space

Statistical analysis

Differences between the control group and AD group in the distribution of SUVr SD and NIBEM IR computed on an ROI basis were assessed using Mann–Whitney’s test.

The ability of the two methods to discriminate between controls and AD subjects was evaluated using a receiver operating characteristic (ROC) analysis by varying the classification thresholds (from − 10 to 10 with a step. ROC analysis is performed with pre-defined criteria that, respectively, state false-positive (FP), false-negative (FN), true-positive (TP), and true-negative (TN) definitions:

  • FP: a patient classified as “AD” whilst labeled “control”

  • FN: a patient classified as “not AD” whilst labeled “AD”

  • TP: a patient classified as “AD” whilst labeled “AD”

  • TN: a patient classified as “not AD” whilst labeled “control”

For each threshold, FP, FN, TP, and TN were computed. Then always for each threshold, (sensitivity) vs (1− specificity) was computed and plotted to obtain the corresponding ROC curve. Areas under the ROC curve (ROC AUC) were compared [39] and the optimal threshold was defined as that maximizing Youden’s index (sensitivity + specificity− 1).

Diagnostic performances of the two methods were compared using Fisher’s exact test. A p value ≤ 0.05 was considered as significant. All statistical computations were performed using Matlab (The MathWorks, Inc).

Results

Distributions of SUVr SD by cortical ROI in controls and AD subjects are presented in Fig. 6a. It shows that distributions were significantly different in all tested cortical ROIs. It thus highlights that SUVr SD analysis is effective in classifying “AD” and “not AD” groups of the study according to the analyzed ROIs.

Fig. 6
figure 6

Box plots of (a) SUVr SD distributions and (b) NIBEM IR distributions both for healthy controls (blue) and AD patients (orange) according to cortical ROI. Box: median and inter-quartile range. Whiskers: mean ± 1.5 standard deviations. Round markers stand for extreme outliers beyond the whisker’s limits.

The same experiment carried out with cortical ROI NIBEM IR distributions in controls and AD subjects is presented in Fig. 6b. Group distributions were shown to also be significantly different between controls and AD patients for all the comparisons ROI vs reference except for the left temporo-mesial lobe. However, the classification (see Fig. 7; Table 2) is not impacted by the lack of IR statistical difference between the “AD” and “controls” groups in this specific ROI. It thus suggests that considering these groups and this comparison framework, left temporo-mesial lobe ROI does not play a big role in the classification framework in this specific study.

Fig. 7
figure 7

Receiver operating curves (ROC) of SUVr SD (red dotted line) and NIBEM IR (blue line and boxes) in differentiating healthy controls from AD patients. The black round marker indicates the optimal threshold maximizing Youden’s index that is common to both methods.

Table 2 Diagnostic performances of SUVr SD and NIBEM IR in differentiating healthy controls from AD patients

Figure 7 shows the ROC curves obtained by varying the classification threshold of each method. There was no significant difference between the ROC AUCs of the two methods (0.95 vs 0.96) for SUVr SD and NIBEM IR (respectively, p = 0.86). Both AUC values are excellent, and highlight the good classification performance of both frameworks on the considered dataset.

In the present experiment, the optimal classification thresholds that maximize the Youden’s index were 0.614 for NIBEM IR and − 2.1 for Scenium SUVr SD. These threshold values are those used for analyzing and comparing the two methods. In Table 2, where the diagnostic performance of both methods in differentiating healthy controls from AD patients is summarized, we can observe that statistical outcomes (sensitivity, specificity, accuracy, positive and negative predictive value) are equal for both methods considering the formerly specified thresholds.

Discussion

It is well documented that in the first stages of AD, the temporal and parietal lobes are usually affected by amyloid deposition [18]. This coincides with neurofibril deposition initially confined to the medial temporal lobe and limbic structures [11, 18]. These manifestations tend to be reflected as areas of hypo-metabolism on 18F-FDG PET reconstructed images. Visual rating of cortical distribution yields undesirable inter-reader variability [11, 17] and sub-optimal specificity especially among moderately skilled readers [18, 19]. To overcome these limitations, automated semi-quantitative techniques have been developed to help the physician judge whether the ROI can be considered as hypo-metabolic or not. In the first stages of the disease, hypo-metabolism is not usually observed symmetrically. It thus would seem intuitive to make a relative comparison: can the considered ROI be considered as identical in terms of radio-tracer concentration with respect to its symmetrical counterpart or a reference region?

The key limitation, for performing reliable direct ROI comparison in PET, is that no information about the statistical variability of the reconstructed data is directly available with the reconstruction algorithms used in clinical routine [28, 29].

To overcome this problem, most of the diagnostic assistance techniques recently proposed in the literature rely on the use of databases. These techniques can be separated into two groups. The first one is artificial intelligence (AI) based: for a very specific task, a convolutional neural network can be trained using a previously annotated database to classify whether the proposed 18F-FDG PET scan belongs to a patient suffering from the disease or not. Such approaches can lead to very promising classification performance [20]. The second group of approaches relies on the statistical mapping of voxel-based normalized cortical uptake ratio with respect to some reference region [23,24,25,26,27]. Each investigated voxel or ROI is characterized by a score that reflects its distance to a reference score. The latter is computed based on a database of ethnic-, age- and sex-matched healthy controls. As previously mentioned, these approaches give promising results, but have certain limitations. The major drawback is that they inescapably depend on the database on which they rely or from which they have been trained. This raises the question of performance variations of these methods considering changes in acquisition equipment, in reconstruction parameters, acquisition conditions, radio-tracer dose, etc. As institution-wide, country-specific, or equipment manufacturer-wide recommendations are and will continue to different, it appears very difficult to trust a diagnosis relying only on AI or database-aided techniques. Nevertheless, comparisons with these methods that are now fashionable would be of great interest. Apart from the need to have access to raw data to perform NIBEM reconstruction, the comparison between the proposed approach and these state-of-the-art methods would be straightforward as their output is of a comparable nature and the metrics used to assess their performance are similar.

The framework proposed in this article is different from the kind of approaches described above as it apprehends the problem of ROI comparison in PET in an alternative way. Through the reconstruction of confidence intervals, what is proposed is to directly compare regions of interest for dementia with a reference region extracted from the same reconstruction thanks to concordance measures. The performed comparison is not based on any external information (like a database for instance) for the analysis. Indeed, the reference used to perform the hypo-metabolism analysis is specific to each patient. It does not require a normalization step either. Only raw data of PET acquisition are needed. The preliminary results presented in this paper seem to highlight that such a processing framework makes possible diagnostic performances that are promising as they seem similar to a validated post-processing software for AD on the tested data.

However, the current study has also some limitations that raise interesting perspectives and necessary improvements. Concerning the clinical validation, the number of patients included remains limited. Indeed, as this method requires access to raw PET data, it is impossible to use databases such as ADNI [40] because associated raw data are usually not available. It thus could be interesting to validate the results on a more consistent database. Speaking of databases, it could also be interesting to adapt such an analysis to a broader panel of neurodegenerative diseases like mild cognitive impairment, fronto-temporal dementia, and Lewy body dementia. Another important step in the clinical follow-up of patients with dementia is the quantification of the evolution of brain function disorders. A method like the one proposed in this paper could have a role to play for this task. In further studies, one could imagine testing the ability to quantify the evolution of a disease over time by analyzing the analogous IR scores of PET scans acquired at two different times. It may thus be interesting to see if the IR score evolution correlates with the clinical and expected evolution of the disease and thus allow a better follow-up of disease evolution over time. On the side of theoretical developments, to be in line with current PET standards it may be interesting to perform NIBEM reconstructions making use of time-of-flight information. However, this would require prior theoretical developments to manage time-of-flight in the NIBEM algorithm. Generalizing the NIBEM algorithm to perform 3D reconstruction of CI estimates without making use of re-binning procedures would also be a plus. Comparability of the tested methods would be greatly improved. In short, the objective of further studies is to propose a reconstruction pipeline that meets all the current PET reconstruction standards. The diagnosis-aid performance will also benefit further theoretical developments in statistical testing of interval-valued distributions that could be more specific to the statistical population characteristics we are trying to compare. More generally, the diagnosis of neurodegenerative diseases in PET will surely continue to benefit from image quality improvements allowed by hardware and software progress of PET technology. However, chances are pretty good that tools like the one presented that facilitate better understand and characterize the acquired and processed data will also play an important role for early diagnosis in the future.

One last important point concerns the use of a database for comparative analysis. Even if the proposed analysis framework does not require external information like metadata or database references, it still requires to set beforehand a classification IR threshold that allows to state if the analyzed patient is affected by the disease or not. This information is required for all the diagnosis assistance methods mentioned above. Obviously, this preliminary work does not have the level of proof to establish a definitive threshold that could be used to differentiate AD from healthy controls in routine clinical work. As this threshold is the only dependence of the proposed method to a database, it would be interesting to carry out a study to see if the classification performance is independent of the group characteristics, acquisition conditions, radio-tracer dose, etc. In the case of a positive answer, it could be conceivable to propose a decision-making process based on the proposed methodology that is freed from the need of using any database which would be an important step for aided-diagnosis methods.

Conclusion

The method presented in this paper is the first method of the literature that exploits the uncertainty quantization associated with the data (in particular the statistical variability here) to make an assisting diagnosis tool in PET. The preliminary and original results presented in this paper highlighted the fact that clinical routine would directly benefit from theoretical progress in the field of confidence interval estimation and comparison. Moreover, the limited computational cost of the proposed method and its relatively easy implementation in the clinical routine framework should allow its use as a complementary tool for diagnosis aid and thus to improve the decision algorithm of AD diagnosis.

To conclude, this paper has shown that interval-valued reconstruction can allow promising analysis of brain 18F-FDG PET data, yielding diagnostic performances similar to that of a SUVr SD-based commercial post-processing software.