Keywords

1 Introduction

In the study of AD, in recent years an increasing emphasis has been placed on the importance of early diagnosis. This is because while currently available treatments are only able to mitigate the downstream effects of the disease process, pending ones are focused on actually disrupting the disease process itself, by interfering with the amyloid cascade that is thought to be one of the underlying causes of AD [1]. To be effective, such treatments would have to begin before the patient is showing the full symptoms of AD. This involves studying patients who have mild cognitive impairment (MCI). Clinically, MCI is defined as having isolated memory deficits that are not severe enough to affect normal living [2]. MCI patients convert to AD at an annual rate of 10–15% per year [3], although some develop other diseases or remain stable. As stable and converting MCI (MCI-s and MCI-c) patients by definition have similar symptoms, standard cognitive tests used to diagnose AD are by themselves of little help for this problem; instead, imaging and other biomarkers can be used with machine learning methods to detect subtle differences between the groups. A classifier can be trained on labeled examples of MCI-s and MCI-c images, or alternatively on examples of AD patients and healthy controls (HC), under the assumption that MCI-s subjects are more HC like and MCI-c subjects are more AD like. Most such studies use magnetic resonance imaging (MRI), from which a variety of features can be extracted. However the results can be improved by combining MRI features with imaging data measuring metabolic activity using fluorodexoxyglucose positron emission tomography (FDG-PET) and biomarkers measured in a sample of cerebrospinal fluid (CSF) or genetic information in a multi-kernel framework [4, 5].

A limiting factor in the accuracy these studies may be mislabeling of training subjects. The gold standard for diagnosis of AD is autopsy, but most studies use subjects whose diagnosis has been determined by standard clinical testing, which has been shown to have an error rate of at least 10 % [6]. This is an issue than has not been widely adressed; Aksu et al. [7] point out that training labels for MCI-s and MCI-c are uncertain and go on to generate their own MCI training labels by following the classification of MCI subjects by an HC versus AD classifier across multiple timepoint. However even this neglects the uncertainty in the HC and AD labels this scheme ultimately depends on.

Our proposed method follows [8] in abandoning discrete disease state labels for training altogether. We also perform a regression to predict a continuous proxy for disease status, but instead of age we use atrophy over a period of one year as measured by the boundary shift integral (BSI) [9]. This then provides a predicted atrophy rate for each test subject. We use Gaussian process (GP) regression [10], with a multiple kernel framework to optimally combine MRI, FDG-PET and CSF data. This results in a measure that can predict MCI conversion within 3 years with a balanced accuracy of 74.6 %, as good as state of the art techniques having a much larger training set, including our own previous work using multikernel GPs for classification [5].

2 Materials and Methods

2.1 Image and Biomarker Data

All data were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) databaseFootnote 1. The MRI images were T1 weighted structural scans from 1.5T scanners taken at baseline and 12 month follow-up. All were subjected to quality control and automatically corrected for spatial distortion caused by gradient nonlinearity and B1 field inhomogeneity.

FDG-PET images were acquired according to the ADNI protocol: acquired 30–60 min post-injection, averaged, spatially aligned, interpolated to a standard voxel size, intensity normalized, and smoothed to a common resolution of 8-mm full width at half maximum.

CSF samples were obtained from subjects by a lumbar puncture around the time of their baseline scan. Levels of the proteins amyloid-\(\beta _{42}\) (a\(\beta _{42}\)), tau, and phosphorylated tau were measured and recorded.

The original ADNI project collected baseline structural MRI scans for all subjects. However FDG-PET scanning and collection of CSF data were only done on subsets of these subjects. Furthermore, calculation of BSI requires a 12-month follow-up structural MRI, which were also missing for some subjects. As our method requires FDG-PET and CSF and a 12-month BSI as well as structural MRI data, only 129 subjects could be included in the study. The details of these are shown in Table 1. Subjects were classified as HC, AD or MCI by neuropsychological and clinical testing at the time of the baseline scan, with MCI conversion status decided by whether subjects were subsequently diagnosed as AD at any stage during the 36 month follow-up period.

Table 1. Subject groups and demographics

2.2 Image Processing

Probabilistic grey matter (GM) maps were produced from the native space baseline scans using the NiftySeg tool [11]. The native space images were also anatomically parcelated into 83 regions with a novel label fusion algorithm [12] in a multi-atlas label propagation scheme. The resulting parcelations were used to mask out the brainstem and cerebellum from the native space GM segmentations.

Also, a custom template was produced, performing all registrations using the NiftyReg toolkit [13]. The native GM space segmentations were then warped into the groupwise space. Finally, the groupwise space, masked GM segmentations were modulated by the Jacobian determinants of this final deformation. This step ensures the total mass of tissue remains constant. Hence the MRI features used were voxel level GM probabilities.

The native space anatomical parcelations were also rigidly transferred to the space of the FDG-PET images for the corresponding subjects. The parcelation was used to normalise each FDG-PET image by its mean cerebellar activity, and then to calculate the mean activity within each anatomical region, generating a set of 83 features for each FDG-PET image.

2.3 Boundary Shift Integral

The BSI is a method for robustly assessing volume loss of whole brains or brain regions from structural MRI. It calculates a change in volume by integrating across the longitudinal change in position of the boundary between CSF and GM surrounding the region of interest. Preprocessing is needed to extract the region of interest (which in our case is the whole brain) from each image, linearly align the baseline and follow-up images, and correct for intensity inhomogeneity between scans. We use the latest version of BSI [9] which uses a symmetric registration scheme to minimise bias and maximise desirable qualities for an atrophy measurement such as inverse consistency and transitivity between multiple timepoints.

We normalise the resulting volume changes by the baseline brain volumes and by the exact interval between baseline and follow-up scans, and multiply by 100. This produces a normalised brain atrophy rate (BAR) in percentage of original brain volume per year for each subject. These are then used as targets in the following regression analysis.

2.4 Gaussian Process Regression

Gaussian processes (GPs) provide a Bayesian, kernelised framework for solving both regression and classification problems. As an in depth explanation of GPs is beyond the scope of this paper, we refer the reader to [10] for a theoretical treatment and our previous work [5] for an application of multi-kernel GPs to predicting conversion to AD.

Briefly, however, a GP is a multivariate Gaussian forming the prior on the value of a latent function, on top of which is put a Gaussian noise model. The covariance of the prior is a function of the covariance between instances of training data \({\varvec{X}}\), and a set of hyperparameters \(\theta \) that control the overall form of the prior and the noise variance. During the training phase, the hyperparameters are learned from the training data \({\varvec{X}}\) and targets \({\varvec{y}}\) by type-II maximum likelihood. Once the hyperparameters have been set, predictions on unseen data are made by integrating across this prior, which can be calculated analytically for the regression case.

2.5 Gaussian Processes as Multimodal Kernel Methods

GP regression is based on a covariance kernel K, a symmetric positive definite matrix where entry \(K_{ij}\) is given by a covariance kernel function k of the feature vectors for the ith and jth subject \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\) and a hyperparameter or hyperparameters \(\theta \). We use a linear kernel function, which is simply the scalar product of \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\). As GPs belongs to the family of kernel methods, a positive sum of valid kernels is a valid kernel, and a valid kernel multiplied by a positive scalar is also a valid kernel. This implies that to do multimodal classification, we can define our kernel function as the weighted sum of a number of subkernels, each of which has been calculated from a the feature vectors representing a particular type of data or modality for each subject. Each subkernel has a scaling hyperparameter \(\alpha \) representing the modality’s weight in the overall kernel. A bias term \(\beta \) is also included in the sum. So in the case of multimodal classification using information derived from the MRI, PET and CSF data for each subject the overall kernel is

$$\begin{aligned} K_{ij}=\alpha _{\scriptscriptstyle {\text {MRI}}}({\varvec{x}}_{\scriptscriptstyle {i,\text {MRI}}}.{\varvec{x}}_{\scriptscriptstyle {j,\text {MRI}}})+\alpha _{\scriptscriptstyle {\text {PET}}}({\varvec{x}}_{\scriptscriptstyle {i,\text {PET}}}.{\varvec{x}}_{\scriptscriptstyle {j,\text {PET}}})+\alpha _{\scriptscriptstyle {\text {CSF}}}({\varvec{x}}_{\scriptscriptstyle {i,\text {CSF}}}.{\varvec{x}}_{\scriptscriptstyle {j,\text {CSF}}})+\beta \end{aligned}$$
(1)

giving a total of four covariance hyperparameters to set.

3 Results

To generate predicted BARs for all 129 subjects, we perform a leave-one-out cross validation (LOOCV) across the entire set. The correlation coefficient between predicted and measured BARs for the subjects is 0.38 (p < 0.0001) and the root mean squared error is 0.61. However our primary focus is not on the predicted brain atrophy rates themselves, but on whether they can be used to predict conversion in MCI subjects. Figures 1 and 2 show the spread of both measured and predicted BAR values for all four disease groups (HC, MCI-s, MCI-c, AD).

Fig. 1.
figure 1

Measured BAR across groups

Fig. 2.
figure 2

Predicted BAR across groups

As shown in Figs. 1 and 2, while the mean predicted BARs for each group are similar to the corresponding means for measured BARs, each clinical group occupies a much tighter cluster of values, even allowing for a few outliers (marked as a +). This results in reduced overlap between the clinical groups, which is especially noticeable between the MCI-s and MCI-c groups. The resulting accuracy is 74.6 %, which is similar to the best previously reported results. The balanced accuracy and area under the ROC curve (AUC) are shown in Table 2. This also shows results for single modalities, demonstrating the benefit of combining sources of data with multikernel learning.

We also compare our method to performing direct binary classification on the conversion status again using GPs. This can be done by training on the MCI subjects only in an LOOCV loop, by training on all subjects, again with an LOOCV loop and grouping HC subjects with MCI-s and MCI-c subjects with AD, and finally by training on the HC and AD subjects, and testing on the MCI subjects. The results are given in Table 3.

 

Table 2. Accuracy of discrimination between MCI-s and MCI-c with predicted BAR
Table 3. Accuracy of discrimination between MCI-s and MCI-c with training on binary diagnostic class labels

4 Discussion

These results show a clear advantage for our method of training on a well-characterised proxy for MCI conversion, rather than the diagnostic status itself. Training on BAR enables us to reach accuracies of up to 74.6 %, whereas training on diagnostic labels struggles to perform better than chance. It therefore appears that the use of BAR bypasses the problems caused by binary diagnostic labels. Data is made better use of as subjects can be used for training regardless of diagnostic label, and as parameters are learned automatically there is no need to set subjects aside for tuning. We also show an advantage for multimodal regression. Although direct comparisons between methods are difficult [5], the resulting accuracy in forecasting MCI conversion is among the best yet achieved. The main drawback of our the proposed method is that all three types of data are all required for the best results (although FDG-PET alone does almost as well) which limits the number of subjects that can be included. However we intend to further evaluate this method as much greater numbers of subjects with all modalities become available in ADNI 2. Finally, while 12-month follow-up scans are also required to calculate BSI values for training data, it should be emphasised they are not needed for testing data.