Introduction

Differentiating between Alzheimer’s disease (AD) and frontotemporal dementia (FTD) remains a significant clinical challenge. Current standard procedures for diagnosing different subtypes of dementia depend mostly on patient clinical history, cognitive assessments, and neuropsychological tests while structural neuroimaging studies (MRI) are also routinely performed when available [1]. However, overlapping symptoms, early onset of AD, or atypical presentations and disease courses make accurate diagnosis using such tools more challenging [1,2,3]. For example, a considerable subset of behavioral-variant FTD patients may show memory deficits similar to AD [4, 5], while AD patients may atypically present with poor executive functioning that may even exceed that of behavioral-variant FTD (bvFTD) [6, 7] and show less marked memory impairment [8].

The accurate distinction of dementia subtypes has important implications on all facets of patient management. One aspect pertains to the administration of pharmacological and psychological care [9]. For instance, acetylcholine-esterase inhibitors demonstrate modest yet discernible cognitive improvements in Alzheimer’s disease (AD) patients, while exhibiting no such effects in frontotemporal dementia (FTD) patients [10, 11]. On the other hand, intranasal oxytocin has shown efficacy in ameliorating neuropsychiatric symptoms in FTD cases, though its efficacy in AD remains underexplored [11, 12]. More generally, the distribution of associated neuropsychiatric conditions varies between FTD and AD, necessitating tailored care strategies [3]. Furthermore, with the advent of disease-modifying treatments such as the recent FDA approval of lecanemab for early-stage AD [13], early and precise diagnosis of specific dementia subtypes has become more important than ever before as treatments increasingly target underlying disease etiologies rather than nonspecific symptoms.

Another aspect concerns the varying progressions and prognoses of dementia subtypes. For example, FTD patients, particularly those with the ALS variant, experience more rapid progression and shorter life expectancies compared to other subtypes [14, 15]. Differences between dementia subtypes should also be considered when evaluating the heritability risk of these conditions. Up to 50% of FTD cases may have a hereditary component (particularly associated with MAPT gene), and an autosomal dominant pattern of inheritance can be identified in up to 20% of the patients. However, the hereditary component is less significant in AD, with fewer than 5% of cases showing such a component, primarily due to mutations in the PSEN1 and PSEN2 genes [16]. Lastly, the projected threefold increase in the worldwide population of individuals living with dementia—from about 57 million in 2019 to an estimated 153 million by 2050—further highlights the escalating impact of this health concern and the necessity of achieving precise diagnoses as the foundation for effective disease management [17].

Given the significance and complexity of diagnosing dementia subtypes, investigators and clinical trial sponsors attempting to develop new treatments for these conditions often find it necessary to employ additional diagnostic techniques such as positron-emission tomography (PET) or cerebrospinal fluid analysis to achieve a homogeneous and accurately diagnosed patient cohort [13, 18]. Even though these novel methods can indeed diagnose various dementia subtypes, sometimes even before the presentation of clinical signs and symptoms, their cost and time-intensive nature have hindered their integration into routine clinical practice and pose significant financial and temporal burdens on research studies [18]. Therefore, there is a pressing need to develop automated diagnostic procedures with high accuracy that could simplify clinical research studies and potentially evolve into routine clinical diagnostic techniques in the future.

The power of machine learning models to recognize the complex patterns and relationships characteristic of biomedical data is well known [19]. Consequently, it is unsurprising that several studies have attempted to utilize machine learning methods on various clinical and paraclinical data to build tools that complement the diagnostic process for dementia. These data sources encompass demographic information, clinical presentation, past medical history, results of neuropsychological assessments, lab biomarkers, and findings from structural and functional (PET) imaging [20,21,22,23,24,25,26]. Surprisingly, however, none of these studies has leveraged data from resting-state fMRI (rs-fMRI) scans to train their models. This is despite previous reports indicating condition-specific alterations in rs-fMRI signals in dementia [27, 28]. For instance, impairment of connectivity in the default mode network in AD patients, impairment of the salience network in bvFTD patients, and an increase in default mode network connectivity in bvFTD patients have been consistently reported [29]. Furthermore, conducting an rs-fMRI study is more cost-effective and less time-consuming than a PET study, and unlike PET, rs-fMRIs can be safely repeated (e.g., in follow-up studies for assessing disease progression) since the technique does not utilize radioactive isotopes [30]. The significance of this lies in the important role that PET plays in current attempts to definitively diagnose dementia subtypes [18] and that PET images have already been used to create machine learning models to classify different dementia subtypes, albeit with limited accuracy [26].

Another shared characteristic among the previous studies employing machine learning for diagnosing dementia subtypes is that none of them built models that simultaneously classified AD, mild cognitive impairment (MCI), FTD, and healthy controls (HC). Furthermore, the studies that classified FTD focused mainly on bvFTD, resulting in the underrepresentation of other clinical subsets of FTD (primary progressive aphasias [semantic and nonfluent-variants]) even though they may constitute up to 28% of the FTD patient population [31].

Considering these areas for improvement, in this study, we have utilized resting-state fMRI (rs-fMRI) data to build a multi-class classification model that could simultaneously identify HC, MCI, AD, and FTD patients. In addition, while we do not separately classify the different subtypes of FTD, we have included data from all subtypes of FTD in our FTD class. Unlike previous studies on rs-fMRIs, however, we did not follow the commonly used pathway of functional connectivity analysis and used raw, time-course data instead. Utilization of different approaches and techniques for functional connectivity analysis (e.g., graph-theory network analysis, independent component analysis, seed-based analysis) makes the reproducibility of findings difficult and might lead to divergent conclusions, while analyzing raw (time-course) data may be more conducive to achieving reproducibility and widespread use [32].

As described in the “Methods” section, we compared three different relatively interpretable machine learning algorithms to choose the model structure for our study. We opted for relatively interpretable methods since many of the most powerful machine learning models, especially those utilizing deep learning, are viewed as black boxes due to the difficulty in interpreting their decision-making process [33]. Given that machine learning models are unlikely to be perfect, interpretability and easily understandable visualization of the decision-making process by the model is key in any application of machine learning to medicine [19]. In our experiment for model selection, gradient-boosted decision trees (XGBoost) showed superior classification performance compared to the two other algorithms. This was not unexpected as XGBoost is widely regarded as the state-of-the-art in numerous machine learning tasks involving tabular data, frequently outperforming deep learning models [34]. Moreover, XGBoost strikes a balance within the continuum of machine learning algorithms, offering the ability to robustly extract nonlinear relationships while maintaining relative interpretability in its decision-making process. Hence, we used XGBoost to create the models for our study.

Methods

Databases and Imaging (fMRI)

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu) and the Frontotemporal Lobar Degeneration Neuroimaging Initiative (FTLDNI) databases. The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD. FTLDNI launched in 2010 under the leadership of Dr. Howard Rosen, MD, at the University of California, San Francisco, with funding from the National Institute of Aging. For up-to-date information regarding these databases, please see www.adni-info.org and http://memory.ucsf.edu/research/studies/nifd. This study was exempt from IRB review due to the public availability of ADNI and FTLDNI and the strict deidentification of data within them.

According to the ADNI acquisition protocol, resting-state fMRI data were acquired with a gradient echo planar imaging (EPI) sequence (TR = 3000 ms; TE = 30 ms; matrix = 64 × 64; flip angle = 80°; voxel size = 3.313 mm × 3.313 mm × 3.313 mm; 48 slices) on a 3 T Philips scanner [35]. FTLDNI was launched based on the infrastructure established by ADNI and shares similar acquisition protocols.

This study used all the available rs-fMRIs in these databases (1351 fMRI scans [AD, 101; FTD, 396; HC, 470; MCI, 384] obtained from 434 patients [AD, 32; FTD, 151; HC from ADNI, 51; HC from FTLDNI, 96; MCI, 103]) and their linked clinical data.

Preprocessing of Imaging Data

Resting-state fMRI data from the ADNI and FTLDNI databases were preprocessed using CONN [36] (RRID:SCR_009550) release 20.b and SPM [37] (RRID:SCR_007037) release 12.7771. Functional and anatomical data were preprocessed using CONN’s automated preprocessing pipeline. Then, the functional data was denoised using CONN’s standard denoising pipeline. The details of these pipelines are presented in Online Resource 1.

After this step, whole-brain gray matter was parcellated into 200 regions of interest (ROIs) based on a voxel-scale functional connectivity parcellation atlas by Schaefer et al. [38]. The time course of each ROI was expressed as the first eigenvariate of the processed time series and averaged across all voxels in the ROI [38, 39].

After completing CONN preprocessing and extracting time courses, our dataset still contained data from the 1351 fMRI scans described before. In preparation for our analysis, we proceeded to exclude the fMRI scans with more than 50% of their volumes identified as outliers (e.g., due to excessive motion artifacts) by CONN’s preprocessing pipelines, leaving 1084 scans for further analysis. Of the 267 excluded scans, 11 (11%), 133 (34%), 86 (18%), and 37 (10%) were from the AD, FTD, HC, and MCI classes, respectively. The number of patients remained the same after excluding scans with a disproportionately high percentage of outlier volumes. A summary of the clinical characteristics associated with the remaining scans is presented in Table 1. Finally, scans with more than 140-time points were truncated so that the final form of the time-course data became a 1084 * 200 * 140 array, corresponding to data recorded from 200 brain parcels (time series channels) over 140-time points in 1084 scans.

Table 1 Summary of the clinical and demographic information associated with the resting-state fMRI scans

It is important to note that many patients in ADNI and FTLDNI underwent multiple rs-fMRI scans over a span of several years. In our final dataset, all patients with repeated scans retained their initial diagnosis (MCI, AD, FTD, HC) in subsequent studies and no instances of progression from MCI to AD were recorded. Consequently, scans from the same patient would be expected to have a higher correlation than those from different patients. High correlation between data elements in training and test data poses the risk of overfitting and artificial inflation of model performance metrics, respectively.

As explained later, we chose the XGBoost algorithm to construct our models. XGBoost incorporates regularization both in the objective function that it optimizes and by virtue of being an ensemble of weak learners [40], enabling it to counter the overfitting challenge. Concerning the issue of potentially inflated metrics, we present the unseen test-set metrics before and after excluding repeat scans (retaining only the initial scan for each patient) from the test sets in the models discussed in the text.

Feature Extraction

Following these steps, the time series data underwent feature extraction. To this end, we used a number of features with relevance to time series data from the tsfresh package in Python [41]. The features that were calculated from the data are presented in Table 2. This fMRI features dataset had 1084 rows (for the 1084 scans) and 75,400 columns (features that were calculated).

Table 2 Features calculated from the time series data

Clinical and Demographic Data

We attempted to extract the clinical and demographic variables associated with the imaging studies to bolster the fMRI data. We identified the following variables to be present and similarly measured in both ADNI and FTLDNI databases: date of birth (enabling the calculation of age), sex, education, Mini-Mental State Examination (MMSE) total score, Clinical Dementia Rating (CDR) total score, forward and backward memory span tests, Boston Naming Test (BNT) score, letter verbal fluency test score, the 15-item Geriatric Depression Scale (GDS) total score, and Functional Activities Questionnaire scores. Throughout the paper, we will refer to these variables as clinical (instead of clinical and demographic) variables. We used an algorithm in R to assign values from neuropsychological tests to imaging studies if performed within 1 year of the fMRI scan. The forward and backward memory span tests were excluded from the data due to a very high percentage (> 60%) of missing data. All other variables had ≤ 25% missing data. However, the Functional Activities Questionnaire scores were also excluded due to the high discrepancy of data missing rate between the two databases, with 40% of FTD scans not having associated Functional Activities Questionnaire scores (FTLDNI). In contrast, less than 3.4% of AD scans did not have these results (ADNI).

Model Selection

To select the optimal machine learning model for our study, we compared three well-known and relatively interpretable methods: multinomial logistic regression (LR) and decision trees (DT) from the scikit-learn module and gradient-boosted decision trees from the XGBoost (Extreme Gradient Boosting) module in Python.

To compare the three models, we used a fivefold cross-validation procedure where the data was split into five folds with nearly identical percentages of all classes based on patient IDs (i.e., the set divisions were performed on the patients, not the fMRI scans). Even though the final classifications were performed on each individual fMRI scan (regardless of the patient it belonged to) and not the patients, this was done to prevent the inclusion of fMRI scans from a single patient in both sets, thus preventing information leakage and bias in model performance metrics. In each round of the CV, one of the folds (~ 20% of the scans) was used as the unseen hold-out test set, while the remaining four folds (~ 80% of the scans) were used as the train set. We did not perform hyperparameter optimization in this experiment as we intended to compare baseline model performances. As shown in Tables 1 and 2 in Online Resource 1, the XGBoost model achieved significantly better metrics compared to the LR and DT models. As a result, we used and optimized the XGBoost model for the remainder of this study.

A notable feature of XGBoost is its built-in regularization, making it suitable for the overfitting challenge posed by the repeated scans in our dataset. The ensemble nature of XGBoost contributes to its generalizability and robustness against overfitting [42]. Furthermore, XGBoost primarily optimizes the following objective function:

$${{\text{Obj}}}^{t}= \sum_{i=1}^{n}L\left({y}_{{\text{i}}},{\widehat{y}}_{{\text{i}}}^{t}\right)+ \sum_{k=1}^{m}\Omega \left({g}_{{\text{t}}}\right)$$

where \(n\) is the number of training examples, \({y}_{{\text{i}}}\) is the true target value for the \(i{\text{th}}\) example, \({\widehat{y}}_{i}^{t}\) is the predicted value in the \(t{\text{th}}\) iteration, \(L\) is the loss function, \(m\) is the number of trees (or boosting rounds) in the ensemble, \({g}_{t}\) is the \(m{\text{th}}\) tree in the ensemble, and \(\Omega \left({g}_{t}\right)\) is the regularization term applied to each tree. [42] \(\Omega \left({g}_{t}\right)\) itself is given by the following formula:

$$\Omega \left({g}_{{\text{t}}}\right)= \gamma \mathrm{\rm T}+{\lambda }_{1}\sum_{j=1}^{\rm T}|{\omega }_{{\text{j}}}|+ \frac{1}{2}{\lambda }_{2}\sum_{j=1}^{\rm T}{\omega }_{{\text{j}}}^{2}$$

where \(\gamma\) is a parameter that controls the overall complexity of the tree, \({\rm T}\) is the number of leaves in the tree \({g}_{{\text{t}}}\), \({\omega }_{{\text{j}}}\) is the weight associated with the \(j{\text{th}}\) leaf of the tree, \({\lambda }_{1}\) is the regularization parameter controlling the strength of the L1 (Lasso) penalty, and \({\lambda }_{2}\) is the regularization parameter controlling the strength of the L2 (Ridge) penalty. In this study, we found no enhancements in model performance when incorporating L1 and L2 penalties. As a result, \({\lambda }_{1}\) and \({\lambda }_{2}\) were set to zero and we only optimized the \(\gamma\) hyperparameter.

Nested K-Fold Cross-Validation Procedure

To build the classification models, we used stratified, nested k-fold cross-validation (CV). The nested approach involves a two-layered technique with an external and an internal CV process. This approach enables model hyperparameter optimization while preventing model overfitting to the data and the presentation of overly optimistic metrics [43]. The external CV process was conducted identically to the model selection experiment described above and had five rounds. The internal CV was performed on the train set (and not on the whole dataset) created in each round of the external CV and was used to optimize model hyperparameters. Furthermore, variable selection in the imaging features data, imputation of missing values in clinical data, and standardization of the clinical data and selected imaging features were performed based on the train set defined by the external CV process. These steps ensured that no information leakage occurred between the train and unseen test sets during each external CV round.

For variable selection, the imaging features data were used to create an initial classification model on the train set using XGBoost [40]. Then, the model was used to determine variable importances based on the average improvement (gain) in the training set loss attributed to each feature. These importance values were normalized to achieve a unit sum, and the features with an importance of at least 0.01 or higher were selected. A summary of the selected features based on their corresponding cortical regions, including the number of times they were selected during the external CV process and their average ranks in variable importance across the folds, is presented in Table 3. A more complete analysis, including summaries specifically based on brain region or mathematical feature, is presented in Online Resource 2.

Table 3 Selected fMRI imaging features summarized by cortical region

After this step, missing values in clinical data were imputed. The mean (for quantitative variables) and mode (for patient sex, the only qualitative variable in the clinical data) were calculated from the external CV round’s train set and were used to replace missing values in both the train and test sets. The imaging features data did not require any imputation as all values were available. Next, the clinical data and the selected features were standardized. In each round, means and standard deviations for each variable were calculated from the external CV’s train set and used to standardize that variable’s values in both the train and test sets.

Finally, the internal CV process within the train set of each external CV round was used to create that round’s final classification model with the preprocessed selected imaging features and clinical variables. For the internal CV, the RandomizedSearchCV function from the scikit-learn module was used to perform a threefold CV and find the optimal set of model hyperparameters. The final optimized model was then used to generate predictions and performance metrics on the external CV round’s unseen test set. Furthermore, the relative importances of the features fed into each model were calculated and are presented in Online Resource 4.

Reported Metrics and Statistical Analysis

In this paper, we report the mean and 95% confidence interval for each metric as calculated from the unseen hold-out test sets of the five external cross-validation rounds.

The calculated metrics include balanced accuracy (with class-balanced sample weights according to the inverse prevalence of each target class), area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, positive likelihood ratio (LR+), negative likelihood ratio (LR−), and area under the precision-recall (PPV-sensitivity) curve (AUC-PR). We used balanced accuracy instead of accuracy due to the imbalanced class proportions in our dataset. For balanced accuracy, F1 score, AUC, and AUC-PR, two sets of metrics were calculated: one-vs-rest metrics for each class (turning the multi-class classification into a binary classification) and a macro-averaged metric (unweighted average of the metric between all classes to maintain the influence of infrequent classes) to represent the overall performance of that model.

Ultimately, the performance metrics of the models were compared using t-tests. In case of multiple testing (pairwise comparisons of the imaging features-only, best clinical-only, best combined clinical and imaging features, imaging features + CDR, and imaging features + MMSE models), the reported p-values are adjusted for multiple testing using the Bonferroni method. The reported effect sizes were calculated using Cohen’s D method.

All analyses in this study were performed using the pandas (version 1.5.3), numpy (version 1.25.2), scipy (version 1.11.4), researchpy (version 0.3.6), scikit-learn (version 1.2.2), and XGBoost (version 1.7.6) modules in Python (version 3.10.12). All figures were created using the matplotlib (version 3.7.1), seaborn (version 0.13.1), and dtreeviz (version 2.2.2) modules in Python and Inkscape (version 1.3.2).

Results

Model with fMRI Features Data

The fMRI features data were fed into the XGBoost algorithm to create a classification model. As explained in the “Methods” section, the features underwent a selection process before being used to build the final classification model. This model achieved a mean balanced accuracy of 74.4% (95% CI, 72.1–76.7%) and average classification accuracy of 99.2% (95% CI, 97.9–100.00%) in FTD, 99.0% (95% CI, 98.3–99.7%) in HC, and 93.9% (95% CI, 88.3–99.5%) in MCI scans in the unseen test sets of the five external CV rounds. However, the model only exhibited an average accuracy of 5.5% (95% CI, 0.0–15.8%) in classifying AD scans, misclassifying the rest as MCI (Fig. 1). These classification accuracies were reflected in the model’s F1 scores: 0.08 (95% CI, 0.00–0.21) in AD, 0.99 (95% CI, 0.99–1.00) in FTD, 0.99 (95% CI, 0.99–1.00) in HC, and 0.86 (95% CI, 0.82–0.89) in MCI scans; and 0.73 (95% CI, 0.69–0.77) overall (macro-averaged). Exclusion of the repeat scans from the unseen test sets did not affect the model’s metrics (overall balanced accuracy: 74.6% [95% CI, 71.2–78.0%; t(8) = 0.30, p = 0.770, d = 0.19]; overall F1 score: 0.73 [95% CI, 0.69–0.78; t(8) = 0.00, p = 1.000, d = 0.00]). The complete set of metrics and feature importances for this model may be viewed in Online Resource 4.

Fig. 1
figure 1

XGBoost classification model using selected features from fMRI scans. a Confusion matrix of the model’s predictions vs ground truth. b One-vs-Rest classification ROC curves and their corresponding AUC values of individual classes and their unweighted averages (macro-average). AD, Alzheimer’s disease; FTD, frontotemporal dementia; HC, healthy control; MCI, mild cognitive impairment; OvR, one-versus-rest

Model with Clinical Data

Next, we used all available clinical data to create a classification model. This model reached 70.6% (95% CI, 63.6–77.7%) balanced accuracy (Fig. 2a) in the unseen test sets. The F1 scores were 0.68 (95% CI, 0.60–0.76), 0.79 (95% CI, 0.73–0.85), 0.68 (95% CI, 0.56–0.80), and 0.72 (95% CI, 0.61–0.83) for the AD-, FTD-, HCI-, and MCI-scan-associated clinical data, respectively, and 0.72 (95% CI, 0.62–0.81) overall. In terms of variable importance, the CDR score was ranked first in all cross-validation rounds (average variable importance = 0.29, average rank = 1.2), followed by education (0.20, 2.2), MMSE score (0.17, 2.8), letter verbal fluency test score (0.11, 3.8), BNT score (0.06, 5.8), age (0.06, 6.2), sex (0.06, 6.8), and GDS (0.05, 7.2). Exclusion of the repeat scans from the unseen test sets yielded significantly better metrics (overall balanced accuracy: 74.5% [95% CI, 69.2–79.9%; t(8) = 2.73, p = 0.026, d = 1.73]; overall F1 score: 0.76 [95% CI, 0.72–0.81; t(8) = 2.36, p = 0.046, d = 1.49]).

Fig. 2
figure 2

XGBoost classification models using clinical and demographic information. a Confusion matrix and One-vs-Rest classification ROC curves of the model built using all clinical variables (included variables: age, sex, education, CDR score, MMSE score, BNT score, letter verbal fluency test score, and GDS score). b Confusion matrix and One-vs-Rest classification ROC curves of the best model built using clinical variables (included variables: CDR score, MMSE score, BNT score, letter verbal fluency test score, and GDS score). AD, Alzheimer’s disease; BNT, Boston Naming Test; CDR, Clinical Dementia Rating; FTD, frontotemporal dementia; GDS, 15-item Geriatric Depression Scale; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini-Mental State Examination; OvR, one-versus-rest

Furthermore, we assessed the performance of models built with all possible combinations (subsets) of clinical variables (Online Resource 3). Of these models, the best performance was achieved using CDR score, MMSE score, BNT score, letter verbal fluency test score, and GDS score. As shown in Fig. 2b, a model using these variables achieved 71.6% (95% CI, 64.9–78.2%) balanced accuracy in the unseen test sets. Exclusion of the repeat scans from the unseen test sets yielded significantly better overall balanced accuracy: 75.9% (95% CI, 68.5–83.4%; t(8) = 2.67, p = 0.028, d = 1.69). The complete set of metrics for these models may be viewed in Online Resource 4.

Model with Combined Imaging and Clinical Information

In pursuit of better model performance, we built an XGBoost classifier using all the available clinical variables and fMRI features. The balanced accuracy for this model was 90.4% (95% CI, 87.6–93.2%) in the unseen test sets (Fig. 3a). Exclusion of the repeat scans from the unseen test sets did not affect overall balanced accuracy: 90.6% (95% CI, 87.0–94.2%; t(8) = 0.27, p = 0.792, d = 0.17).

Fig. 3
figure 3

XGBoost classification models using a combination of clinical and imaging data. a Confusion matrix and One-vs-Rest classification ROC curves of the model built using fMRI features and all clinical variables. b Confusion matrix and One-vs-Rest classification ROC curves of the model built using fMRI features and all clinical features except for age. AD, Alzheimer’s disease; CDR, Clinical Dementia Rating; FTD, frontotemporal dementia; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini-Mental State Examination; OvR, one-versus-rest

Interestingly, a slightly higher balanced accuracy in the test sets was achieved when combining all clinical variables except for age with imaging features (balanced accuracy = 91.1% [95% CI, 87.1–95.1%]). This model was our best-performing model and had an average accuracy of 98.8% (95% CI, 96.5–100.0%) for FTD, 98.8% (95% CI, 97.2–100.0%) for HC, and 95.3% (95% CI, 90.2–100.0%) for MCI scans. Once again, the lowest class-specific accuracy was observed in AD scans, where the accuracy was only 71.6% (95% CI, 56.8–86.4%), with the remaining scans being misclassified as MCI. The F1 scores were 0.74 (95% CI, 0.62–0.87), 0.99 (95% CI, 0.98–1.00), 0.99 (95% CI, 0.98–1.00), and 0.93 (95% CI, 0.90–0.97) for the AD-, FTD-, HCI-, and MCI-scan-associated clinical data, respectively, and 0.92 (95% CI, 0.87–0.96) overall (Fig. 3b). Exclusion of the repeat scans from the unseen test sets did not affect the metrics (overall balanced accuracy: 91.1% [95% CI, 87.1–95.1%; t(8) = 0.00, p = 1.000 d = 0.00]; overall F1 score: 0.92 [95% CI, 0.88–0.95; t(8) = 0.00, p = 1.000 d = 0.00]). The complete set of metrics and feature importances for these models may be viewed in Online Resource 4.

The balanced accuracy, F1 score, and AUC of the best combined model were significantly better than their corresponding values from both the best clinical (t(8) = 15.45, p < 0.001, d = 9.77; t(8) = 12.34, p < 0.001, d = 7.80; t(8) = 21.73, p < 0.001, d = 13.74) and the imaging features-only (t(8) = 14.15, p < 0.001, d = 22.37; t(8) = 19.59, p < 0.001, d = 12.39; t(8) = 20.69, p < 0.001, d = 13.09) models.

Minimizing the Number of Variables Used in the Combined Model

To find the most minimal model with acceptable accuracy, we built XGBoost classifiers using all combinations (subsets) of clinical variables and fMRI features. The balanced accuracy and AUC metrics (as calculated in the unseen tests of the external CV rounds) for a selection of these models are presented in Table 4. The metrics for all the models may be viewed in Online Resource 3.

Table 4 Performance metrics of various XGBoost models built using the included variables

For example, a model using a smaller subset of clinical variables (education, MMSE score, CDR score, and BNT score) achieved very similar metrics (balanced accuracy = 91.1% [95% CI, 87.6–94.6%]) to the best-performing model mentioned in the previous section. On the other hand, the smallest model with a mean balanced accuracy above 90% only used the CDR score in combination with imaging features (balanced accuracy = 90.7% [95% CI, 84.2–97.2%]) (Fig. 4a). Exclusion of the repeat scans from the unseen test sets did not affect overall balanced accuracy: 89.3% (95% CI, 83.1–95.4%; t(8) =  −0.97, p = 0.360, d = 0.61). In addition, the balanced accuracy and F1 score metrics of the imaging features + CDR model were not significantly different from the best combined model (t(8) = 0.00, p = 1.000, d = 0.00; t(8) = 0.00, p = 1.000, d = 0.00) but its AUC was significantly lower (t(8) =  −4.14, p = 0.032, d =  −2.62).

Fig. 4
figure 4

Minimizing the number of clinical variables used in the XGBoost classification models. a Confusion matrix and One-vs-Rest classification ROC curves of the model built using fMRI features and CDR score. b Confusion matrix and One-vs-Rest classification ROC curves of the model built using fMRI features and MMSE score. AD, Alzheimer’s disease; CDR, Clinical Dementia Rating; FTD, frontotemporal dementia; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini-Mental State Examination; OvR, one-versus-rest

Furthermore, while achieving slightly lower metrics, simpler models which might be more practical in the clinical setting (regarding the time and expertise required to gather the clinical information) still demonstrated acceptable performance. For example, the model using sex, education, and MMSE score alongside imaging features achieved a balanced accuracy of 89.1% (95% CI, 84.7–93.5%). Similarly, the model using only the MMSE score alongside imaging features achieved a balanced accuracy of 88.7% (95% CI, 86.3–91.1%) (Fig. 4b). Exclusion of the repeat scans from the unseen test sets did not affect the overall balanced accuracy (88.7% [95% CI, 85.1–92.3%]; t(8) = 0.00, p = 1.000, d = 0.00) of this model. In addition, the balanced accuracy, F1 score, and AUC metrics of the imaging features + MMSE model were not significantly different from the best combined model ((t(8) =  − 2.63, p = 0.301, d = 1.66; t(8) = 2.06, p = 0.731, d = 1.30; t(8) = 0.00, p = 1.000, d = 0.00). Figure 5 shows one of the estimator trees in this model and the complete set of metrics for the models shown in Fig. 4 and their feature importances may be viewed in Online Resource 4.

Fig. 5
figure 5

A representative decision tree from the fMRI features + MMSE model. The final model for this round of cross-validation is an aggregate of 400 trees similar to the one shown in the figure. The number 400 was derived during hyperparameter optimization. AD, Alzheimer’s disease; FTD, frontotemporal dementia; HC, healthy control; MCI, mild cognitive impairment; MMSE, Mini-Mental State Examination

Discussion

In this study, we have developed a multimodal machine learning model aimed at classifying three subtypes of dementia (MCI, AD, FTD) along with healthy controls. This model exhibits a balanced accuracy of ~90% when tested on previously unseen data. It utilizes an average of 15 features extracted from rs-fMRI time course data, combined with minimal clinical information (only the MMSE or CDR scores). One exciting aspect of our findings was that the model built using only fMRI features accurately classified FTD, HC, and MCI scans. However, it encountered challenges in identifying AD scans, with nearly all being misclassified as MCI. This is in line with previous studies, which had shown significant differences between the rs-fMRIs of FTD and AD [28], while rs-fMRIs from MCI patients were shown to have similar alterations to AD patients but only of a lower magnitude [44]. Nevertheless, the low classification accuracy of the AD group may also stem from the comparatively fewer scans in this class compared to the other classes. This explanation is supported by the high variability (manifested as wide confidence intervals) of AD classification accuracies across the cross-validation rounds.

The features selected using XGBoost’s variable importance measure may also provide valuable insights (Table 3). For example, the most consistently selected features belonged to the left visual cortex. Functional connectivity changes in the visual networks and occipital cortex may indeed be early differentiators of FTD (particularly, bvFTD) and AD [28]. Alongside the left visual cortex, features from the prefrontal cortices in both hemispheres were frequently selected. Functional connectivity studies have consistently demonstrated dysfunctions in the anterior (ventromedial, anteromedial, and dorsal prefrontal cortex) default mode network (DMN) in the brains of AD and MCI patients [45]. These alterations can differentiate AD from healthy patients [46] and the decline in DMN connectivity is associated with MCI to AD progression [47]. Decreased coherent activity in the dorsal prefrontal component of the anterior DMN has also been observed in behavioral variant FTD [48]. The temporal and somatomotor cortices were also common among the selected features. Alterations of temporal lobe activity and functional connectivity are widespread in both FTD [49] and AD [50]. Somatomotor cortex alterations possibly underlie the motor dysfunctions observed in AD, even during simple tasks in its early stages [51, 52], and contribute to motor signs in FTD [53].

Adding clinical information to the rs-fMRI features helped the model better classify AD and MCI scans. In our study, we were limited by the clinical tests that were similarly available in ADNI and FTLDNI, and future studies may show that tests other than MMSE or CDR better complement the fMRI features data. For example, the short form of the Montreal Cognitive Assessment (MoCA) would be an interesting candidate because it has a similar sensitivity to the full version and takes less time to perform than either MMSE or CDR in the clinic [54].

In conclusion, our study demonstrates that a multimodal model based on features derived from rs-fMRI time course data along with minimal clinical information offers an automated and highly accurate approach for classifying AD-MCI vs. HC vs. FTD. However, it is important to acknowledge the limitations that impact our study and conclusions. The class imbalance within our dataset, particularly the relatively fewer scans in the AD class, and the greater proportion of repeat scans in the AD and MCI classes compared to the FTD class may have constrained the model’s ability to differentiate AD and MCI scans. Another limitation of our model is that XGBoost’s enhanced performance relative to traditional decision trees comes at the cost of decreased transparency. While XGBoost effectively quantifies the overall significance of each feature in its classifications, the randomness inherent in the boosting process complicates the comprehension of how each specific feature influences the model’s prediction in each instance of classification. Finally, although we utilized data from two separate databases (ADNI and FTLDNI) to train our model, both were predominantly curated by the same teams, employing identical equipment and protocols at the same institutions. Therefore, to assess the broader generalizability of our model, it is essential to test it on entirely external datasets containing rs-fMRIs obtained from different machines by diverse teams in future studies. Future research could also delve deeper into the neural correlates of the rs-fMRI features that were selected using XGBoost’s variable importance measure and elucidate the physiological relevance of the brain parcels from which these features were derived.