Introduction

Brain dopamine transporter (DAT) imaging with [I-123]FP-CIT SPECT is widely used in the differential diagnosis of parkinsonian syndromes (Kägi et al. 2010; Brooks 2012; Kupsch et al. 2012; Varrone et al. 2013). [I-123]FP-CIT is a tracer that binds to DAT located on the presynaptic terminal cell membranes of dopaminergic neurons, and thereby, the tracer binding may identify the loss of dopamine, DAT expression or the number of functional dopaminergic neurons (Kupsch et al. 2012; Ba and Martin 2015). The method can be used to differentiate essential tremor (ET) and Alzheimer’s disease from neurodegenerative dopaminergic parkinsonian syndromes, including Parkinson’s disease (PD), progressive supranuclear palsy (PSP), multiple system atrophy (MSA) and dementia with Lewy bodies (LBD) (Kägi et al. 2010; Brooks 2012; Ba and Martin 2015). In addition, it appears that secondary parkinsonism, such as drug-induced parkinsonism (DIP) and vascular parkinsonism which do not generally affect nigrostriatal projections, can be differentiated from neurodegenerative parkinsonism using DAT imaging (Brooks 2012; Ba and Martin 2015).

Commonly, an expert rater, often a nuclear medicine physician, performs the initial visual interpretation of DAT SPECT scans. The visual approach is generally accepted and is often the preferred method of analyzing DAT scans while there are also several automated semi-quantitative methods available (Badiavas et al. 2011). Most of these methods take advantage of binding ratios that are calculated by comparing tracer uptake in regions of interests (ROIs), most commonly the whole striatum, the caudate nucleus or the anterior/posterior putamen that show specific binding to reference areas free of specific binding (occipital or cerebellar cortex) (Joutsa et al. 2015). A commonly used quantitative index is the specific binding ratio (SBR), the ratio of specific to nonspecific uptake, which is presumably related to the density of the presynaptic terminals or DAT (Badiavas et al. 2011; Varrone et al. 2013). The advantage of the automated method is the lack of subjectivity or rater-induced bias (Koch et al. 2005; Pencharz et al. 2014) compared to the visual analysis that is subjective and possibly training-dependent (Scherfler and Nocker 2009; Kahraman et al. 2012). On the other hand, the disadvantage of the automated method could be the simplistic mathematical approach that can be vulnerable to systematic artifacts or errors (Scherfler and Nocker 2009; Buchert et al. 2015), and does not allow room for clinical interpretation in complex or atypical cases.

Some current protocols for the clinical evaluation of DAT SPECT scans emphasize the parallel and equal roles of visual and automated analyses. However, the extent to which automated semi-quantitative analysis should be used (not at all, equal or supplementary to the visual analysis, or even as a primary method of analysis) remains unclear (Ottaviani et al. 2006; Filippi et al. 2008; Suárez-Piñera et al. 2011; Buchert et al. 2015). In a recent study by Albert et al., a combination of visual and semi-quantitative analyses seemed to lead to the best diagnostic accuracy in clinical follow-up (Albert et al. 2016). In another study by Söderlund et al., the visual analyses between experts were very consistent, but a combined approach of visual and automated analyses of tracer binding created better reproducibility. The study also suggested that observers with less experience tended to over-report scans as abnormal (Söderlund et al. 2013). Buchert et al. reported that visual evaluation of a standardized slab view display of DAT scans showed agreement with semi-quantitative analysis in as many as 90 % of cases. However, in their study, readers without any experience showed similar performance compared to experienced readers in the visual analysis (Buchert et al. 2015). On the basis of these studies, it remains unclear whether training or experience in visual analysis provides any benefit in the diagnostic accuracy. Furthermore, the clinical characteristics of cases that exhibit disagreement between visual and automated analyses have not been studied.

If tracer binding appears visually abnormal but the interpretation of the automated analysis is normal—or vice versa—how should these cases be interpreted? The issue is of high importance because false interpretations can lead to false diagnoses and unjustified treatments (Suárez-Piñera et al. 2011; Hauser and Grosset 2012; Seibyl et al. 2014). It would also be cost-effective if accurate clinical analyses could be performed by nonexperts. This study aimed to investigate the frequency and the characteristics of DAT SPECT imaging cases with discrepant findings in clinical visual and automated analyses. We further studied if visual scan analysis could be performed by nonexperts by comparing nuclear medicine specialists to raters without prior experience.

Methods

Subjects

The study sample included 120 patients who had undergone a brain [I-123]FP-CIT SPECT due to clinically uncertain parkinsonism. We initially identified 489 patients with appropriate clinical data and [I-123]FP-CIT SPECT images from our database (scanned in 2007–2012). Age at scan, sex, scan date, scanner, duration of motor symptoms before scanning, predominant side of the motor symptom, presence of tremor and the post-scan clinical diagnosis in follow-up were available for every patient (Kaasinen et al. 2014). The 489 scans were visually categorized into one of the four categories of striatal tracer binding by one of the investigators (1 = Normal, 2 = Slightly abnormal, 3 = Abnormal and 4 = Clearly abnormal) (Benamer et al. 2000; Staff et al. 2009; Suárez-Piñera et al. 2011; Kupsch et al. 2012). The uptake was categorized as normal (grade 1) when it was symmetrical and observed in all striatal nuclei in both hemispheres. The uptake was categorized as slightly abnormal (grade 2) if there was a visually detectable reduction in putaminal uptake in one or both hemisphere(s) in association with normal or almost normal caudate uptake in both hemispheres. The uptake was categorized as abnormal (grade 3) when there was clear bilateral reduction (even if asymmetric) in putaminal uptake, with mostly preserved caudate nuclei in both hemispheres. The uptake was categorized as clearly abnormal (grade 4) with clear bilateral reduction in putaminal uptake in association with a binding loss in one or both caudate nuclei. Next, the category distributions were calculated and the data was sorted up randomly. Different samples of 120 patients were extracted, and the sample with the most representative distribution of scans for each of the four categories was selected as the final study sample without clinical information. The sample of 120 scans was considered to be sufficiently small but statistically acceptable for reliable and consistent individual visual ratings. The most common final diagnoses were PD (n = 49), essential tremor (n = 16) and medication-induced parkinsonism (n = 8). The demographics of the study sample are presented in Table 1. The study was approved by the ethical committee of the local hospital district and was conducted according to the principles of Helsinki.

Table 1 Main demographic characteristics of the studied sample categorized according to discrepancies between visual and automated analyses

Scanning and image preprocessing

Patients were imaged on either a GE Infinia II Hawkeye SPECT/CT scanner (GE Medical Systems, Milwaukee, WI) or a Picker Irix gamma camera (Picker International, Uniontown, OH). The data were acquired using LEHR collimators, the images were reconstructed using the ordered subsets expectation algorithm of Hybrid Recon Neurology software (version 1.0.15, Hermes Medical Solutions AB, Stockholm, Sweden), and a 3-D Gaussian post-filter with 0.7 cm full width at half maximum, as previously described (Kaasinen et al. 2014).

Automated semi-quantitative analysis

All scans were analyzed using BRASS automated analysis software (ROI-method) (version 3.6; Hermes Medical Solutions, Stockholm, Sweden). Scanner-specific corrections were used for the BRASS analyses (Kaasinen et al. 2014). SBRs for six regions; right and left caudate, right and left anterior putamen, and right and left posterior putamen, were then calculated using the occipital cortex as the reference region: \( {\text{SBR }} = ({\text{ROI}}_{\text{caudate or putamen}} - {\text{ROI}}_{\text{occipital}} )/{\text{ROI}}_{\text{occipital}} .\) Semi-quantitative uptake was defined as abnormal if it was more than two standard deviations below the reference mean in any of the six analyzed regions (Varrone et al. 2013). Out of the 120 patients, there were four abnormal cases that had caudate binding defects without parallel losses in putaminal binding.

Image layouts for visual analysis

In the initial phase of image selection, Vinci software (version 3.0, Max Planck Institute, Cologne, Germany) was used to generate horizontal views of the scans from 489 patients. Twelve striatal slices with a thickness of 4.7 mm were used in the visualizations using the Speckle rainbow color scale (range = SBR 0.0–5.5). Each image included four larger and eight smaller scan slices using the neurological convention (right side of the image = right side of the patient). After the selection of the final study sample of 120 patients, four representative images of each category were selected from the original data of patients who were not included in these 120 scans. The layout of the images and examples of each category are shown in Fig. 1.

Fig. 1
figure 1

Four categories of striatal tracer binding used in the visual analysis of images. Each rater was instructed to categorize images of 120 patients using this sample as reference. The raters were also instructed to indicate the hemisphere with the most prominent reduction in tracer binding. In the images, the right hemisphere is on the right side of the image and the left hemisphere is on the left side. The color scale bar indicates specific binding ratios (SBRs)

Visual analyses

The 120 images were individually analyzed by six independent raters. Two of the raters were experienced nuclear medicine physicians with extensive experience in analyzing [I-123]FP-CIT scans (>10 years). The other raters were nonexperts, including two registered nuclear medicine nurses with some knowledge of [I-123]FP-CIT SPECT imaging but no knowledge of SPECT image analysis and two nonmedical laymen with no previous knowledge of brain imaging. The raters were asked to categorize the 120 patients to one of the four categories of striatal tracer binding (Fig. 1) and specify possible interhemispheric differences. The visual analysis of the images was performed by visual interpretation of the images with the help of examples (Fig. 1), category definitions and the following clinical patient information: age at scan, sex, duration of motor symptoms at scanning and the predominant side of motor symptoms. All raters received both written and oral instructions on how to analyze the images, and they were blinded to the results of the automated analysis and each other’s ratings.

Statistical analyses

The agreement between dichotomous (normal/abnormal) visual analyses and the automated analysis, as well the inter-rater agreement were calculated using kappa statistics (Cohen’s unweighted κ) (Cohen 1960). The strength of agreement was defined based on κ values: less than 0.20 was considered to be poor agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 good agreement and 0.81–1.00 excellent agreement (Altman 1991). The differences in the agreement between experts and nonexperts were compared using Z tests. Patient characteristics between discrepant and non-discrepant cases were compared with one-way ANOVA with post hoc Bonferroni correction and Chi-square tests. The difference between cameras in discrepant cases was compared with Chi-square test. The level of statistical significance was set at p < 0.05.

Results

Visual vs. automated analysis

Of the total 120 scans, 26 (21.7 %) were categorized as normal by automated analysis and both experts, and eight scans (6.7 %) were categorized as normal by automated analysis and either one of the experts (non-discrepant normal scans, n = 34, 28.3 %). Similarly, 73 scans (60.8 %) were categorized as abnormal by automated analysis and both experts, and one scan (0.8 %) was categorized as abnormal by automated analysis and one of the experts (non-discrepant abnormal scans, n = 74, 61.7 %). Twelve (10 %) of the cases showed discrepancy between expert visual and automated analyses (discrepant scans). More closely, nine (7.5 %) were categorized as normal by automated analysis and abnormal by both experts, and three (2.5 %) were categorized as abnormal by automated analysis and normal by both experts. The main demographic characteristics of the patients are presented in Table 1.

Discrepant scans had 17.6 % lower mean striatal SBRs compared to normal scans (p = 0.003) and 62.7 % higher SBRs compared to abnormal scans (p < 0.001). Patients with discrepant scans were older compared to patients with normal scans (72.6 vs. 62.4 years, p = 0.023) without significant age differences compared to patients with abnormal scans (p = 0.33) (Table 1; Fig. 2). Discrepant cases did not differ from other cases in terms of gender distribution or in the predominant side of motor symptoms (Table 1).

Fig. 2
figure 2

Group-differences in a the mean striatal specific binding ratio (SBR) and b age. The normal group denotes the 26 patients who were categorized as normal by both visual and automated ROI analysis, the abnormal group denotes the 73 patients who were categorized as abnormal by both visual and automated ROI analysis, and the discrepant group denotes the 12 patients who had different results in the two analysis methods. Statistical significances denote post hoc Bonferroni-corrected p values after one-way ANOVA. *p < 0.05, **p < 0.01, ***p < 0.001, NS nonsignificant

Of the 74 patients scanned with the GE Infinia scanner, seven (9.5 %) had discrepant results; respectively, discrepant results were observed in 5/46 (10.9 %) patients scanned with the Picker Irix scanner. There were no significant differences in the prevalences of discrepant cases between cameras (p = 0.80).

The clinical characteristics of the 12 patients with discrepant results are presented in Table 2, and their SPECT scans are displayed in Fig. 3. Out of the nine patients who were visually categorized as abnormal while automated analysis categorized them as normal, eight (88.9 %) were aged 70 or older at the time of imaging, six (66.7 %) had cognitive defect at the time of imaging or developed dementia within 5 years after imaging, and four (44.4 %) received a final diagnosis of drug-induced parkinsonism (DIP). The three discrepant cases that were categorized as abnormal by the automated method were 2.0–2.5 standard deviations below the reference SBR mean in at least one striatal region. For patient number 10, the reduced region was the left anterior putamen (−2.26); for patient number 11, the right anterior putamen (−2.36), the left anterior putamen (−2.34), the left posterior putamen (−2.46) and the left caudatus (−2.21); and for patient number 12, the right anterior putamen (−2.01), the left anterior putamen (−2.31), the left posterior putamen (−2.08), the right caudatus (−2.03) and the left caudatus (−2.13). Patients 11 and 12, who showed uniform slight reductions in binding over the whole striatum uni- or bilaterally, were later diagnosed as having essential tremor.

Table 2 Individual clinical characteristics of the cases that showed discrepant findings between visual and automated analyses
Fig. 3
figure 3

Cases that were categorized abnormal in the visual analysis but normal in the automated analysis (cases 1–9), and normal in the visual analysis but abnormal in the automated analysis (cases 10–12). The clinical characteristics of the cases are presented in Table 2 with corresponding case numbers. Four representative striatal slices of each case are presented. The color scale bar indicates specific binding ratios (SBRs)

Effect of expertise in visual analysis

Expert raters showed good agreement between visual and automated analyses [κ = 0.66 (95 % CI 0.51–0.80) and κ = 0.72 (95 % CI 0.58–0.85)], whereas the agreement was fair to moderate in nonexperts. Experts also showed clearly superior inter-rater agreement in dichotomous visual analysis [κ = 0.81 (95 % CI 0.70–0.93)] compared to nonexperts (Table 3). Expert raters spent more time on the analyses compared to nonexperts (60–90 vs. 30–45 s per scan, respectively).

Table 3 Effect of expertise

Discussion

One out of every ten clinical DAT SPECT scans in our data showed discrepancy between visual and automated analyses. Most of these cases were visually interpreted to show slightly abnormal striatal tracer binding, whereas the automated analysis indicated normal binding. Interestingly, none of these patients developed degenerative dopaminergic diseases, such as PD, after a minimum of 4.5 years of clinical follow-up. From a practical clinical point of view, this suggests that scans on the borderline of abnormality with discrepant imaging findings should probably be interpreted as normal. One should also note that the symptom duration in discrepant cases tended to be longer than in other patients. This supports the benign nature of discrepancies as longer motor symptom durations at the time of imaging appear to be associated with a higher probability of a normal scan (Jaakkola et al. 2016). In addition, our results suggest that the accuracy of visual interpretation of DAT SPECT scans is highly dependent on the level of training as the performance of experienced nuclear medicine physicians in the visual analysis was clearly superior to that of nonexperts, if we consider the automated analysis as golden standard.

Patients with discrepant imaging findings were older than patients with normal findings in striatal tracer binding. This implies that the visual interpretation was affected by age-related dopaminergic changes. Particularly, patients who were interpreted as abnormal in the visual analysis but normal by the automated method were older. Many of these patients also had, or later developed, cognitive problems or dementia, which could be another factor misleading the interpretation. Furthermore, several patients received a final diagnosis of DIP. Therefore, combinations of older age, cognitive defect and antipsychotic medication would appear to cause interpretation difficulties in DAT imaging. Possible slight age-related decreases in tracer binding (Varrone et al. 2013; Kaasinen et al. 2015; Albert et al. 2016), in combination with developing dementia, could possibly visually mimic early PD and affect striatal visual shape recognition. While possible age-related defects seem to be difficult to detect in the visual analysis, the semi-quantitative method can be more helpful as it employs age-corrected reference values (Varrone et al. 2013). Furthermore, DIP occurs commonly in patients receiving neuroleptics (Ba and Martin 2015), and it has been reported to be more common among female and elderly patients (Kägi et al. 2010). DIP is generally thought to be associated with normal DAT imaging findings (Kägi et al. 2010; Brooks 2012; Ba and Martin 2015), although some patients might show slight degeneration of the nigrostriatal system (Lorberboym et al. 2006). In this study, several DIP cases appeared to visually show slight reductions of tracer binding in the putamen (cases 3–6 in Fig. 3). Because dementia and DIP are known to be indications where radionuclide imaging is deemed to be useful, the discrepancies in the interpretations of these cases are particularly alarming.

Three discrepant patients were (falsely) categorized as abnormal with the automated semi-quantitative method. It is of interest to note that two of these patients received a final diagnosis of essential tremor. These patients appeared to have uniform slight reductions in tracer binding over the whole striatum in the automated analysis. While the official indication of DAT SPECT is the differentiation between PD and ET (Scherfler and Nocker 2009; Brooks 2012; Kupsch et al. 2012), it has also been suggested that ET, as a heterogeneous disorder, could be a risk factor for PD (Kägi et al. 2010; Jiménez-Jiménez et al. 2012; Thenganatt and Jankovic 2016). Consistent with the present results, semi-quantitative analysis has previously been reported to show slightly reduced tracer binding in ET compared to healthy controls (Gerasimou et al. 2012; Waln et al. 2015). Moreover, it seems that ET could affect all parts of the striatum, unlike PD which induces asymmetrical reductions that are greatest in the putamen (Gerasimou et al. 2012).

The accuracy of visual analysis can indeed be improved by training, as the experts performed better than nonexperts both in the agreement with the automated analysis and in the inter-rater agreement. Nonexperts showed wide divergence in the visual evaluations, and they particularly tended to categorize normal cases incorrectly as abnormal, similarly to the less experienced readers in the study by Söderlund et al. (2013). Only 7–21 % of scans were categorized as normal by nonexperts, whereas both experts categorized 28 % and the automated analysis categorized 36 % of scans as normal. It thus seems evident that the visual interpretation of DAT SPECT scans should be performed by trained readers, ideally with years of experience. It is noteworthy that even without the support of an automated method, the visual evaluations by experts were consistent and reliable. The great majority of cases can thus be correctly diagnosed by visual analysis alone, whereas the value of the semi-quantitative analysis is emphasized in borderline cases that appear to show mild uptake defects. It should also be noted that the prevalence of discrepant cases is dependent on the proportion of borderline cases in the sample. In the recent study by Seibyl et al., excellent accuracy and inter-rater agreement were reported for expert image readers using only the visual analysis of [I-123]FP-CIT SPECT scans (Seibyl et al. 2014). Very good inter-rater agreement between experts in the visual analyses was also reported in an earlier study by Hauser and Grosset (2012). Nevertheless, the combined use of semi-quantification with the visual analysis is recommended by the European Association of Nuclear Medicine (EANM) (Scherfler and Nocker 2009; Buchert et al. 2015), and this combined approach was superior in the studies by Söderlund et al. (2013) and Albert et al. (2016). The findings by Albert et al. are consistent with the results of this study when suggesting that the use of the semi-quantitative method is important, especially in scans that are visually inconclusive or show dopaminergic striatal reductions (Albert et al. 2016). Furthermore, the vital role of the semi-quantitative methods was further underlined by Filippi et al. on PD diagnostics (Filippi et al. 2008), and in the study by Waln et al., in which semi-automated quantification showed sensitivity even in the differentiation between healthy controls and patients with ET (Waln et al. 2015).

In conclusion, our results suggest that although visual assessment of CIT SPECT images is challenging, it is generally reliable when performed by experienced observers. However, there is a discrepancy between visual and automated analyses in 10 % of cases. Patients with discrepant interpretations do not seem to develop neurodegenerative parkinsonism syndromes, highlighting the need for cautious interpretation in such cases. Diagnostic accuracy in these cases can be improved with the assessment using an automated semi-quantitative analysis method.