Introduction

Many believe that cognitive decline is an inevitable part of aging and that developing dementia is a normal part of cognitive aging, which we as individuals have no control over [1, 2]. Research has, however, shown that some individuals maintain their cognitive abilities even though they reach a very high age and that individuals can influence their cognitive aging process [1,2,3,4]. In fact, while it is known that both age and genetics are strong risk factors for cognitive decline [5, 6], a recent report on dementia prevention, intervention, and care states that modifiable risk factors contribute to up to 40% of dementia cases worldwide [3]. The risk factors that are highlighted in the report and considered to have the strongest evidence are low education level, hearing loss, traumatic brain injury, hypertension, excessive alcohol consumption, obesity, smoking, depression, social isolation, physical inactivity, diabetes, and air pollution. Other factors, such as diet [7, 8], multilingualism [9], sleep [10,11,12], and physical fitness [13,14,15,16,17], have also been shown to have an association with dementia and cognitive decline. All of these risk factors are modifiable to some extent. Consequentially, by targeting these known risk factors for dementia and cognitive decline in a systematic way, a substantial proportion of dementia cases could be prevented or delayed. Pinpointing the individuals who could benefit from modifying their lifestyle to improve their habits relating to those risk factors could be of great value. One approach that could improve the ability to identify these individuals would be to apply mathematical models to the available data.

Machine learning algorithms have increasingly been applied to healthcare data [18, 19]. Studies have shown that machine-learning algorithms have proven helpful in risk prediction, disease diagnosis, and assessment of disease severity [18, 20]. Many of these studies have involved data collected from older adults. One example is a study that applied random forest algorithms to examine how sleep predicted mortality in older adults compared to other known predictors of mortality [21]. The results showed that although sleep was not the strongest predictor of mortality, multidimensional sleep data can add to the predictive power of more routinely used predictors of mortality.

Another example is a study that compared the performance of different machine-learning methods in modeling social determinants of health using the Health and Retirement Study dataset [22]. Of the algorithms that were explored, neural networks performed best. The study's results suggested that interactive non-linear relationships between social factors and biological health indicators were identified by applying neural networks to the dataset.

Several machine-learning studies have been performed on data relating to dementia, with the choice of method often being determined by the characteristics of the dataset [23]. Spooner et al. [24] compared the performance of different machine learning methods in predicting the onset of dementia using high-dimensional clinical data from two different datasets. According to their findings, random forest algorithms can outperform linear regression models when non-linear data, with complex non-linear relationships, is present, which is the case with many datasets. Casanova et al. [25] used random forest classification to study predictors of cognitive trajectories. According to their results, nongenetic modifiable risk factors play an important role in the trajectory of cognitive decline. Qiu et al. [26] used a deep learning framework to classify individuals with Alzheimer's disease using MRI data and cognitive scores. The model's diagnostic performance was better than that of a team of neurologists. Garcia-Gutierrez et al. [27] assessed the ability of machine learning algorithms to classify individuals with and without dementia based on neuropsychological assessments. The algorithms performed well, and the results indicated that by applying such algorithms, it might be possible to rely on the results of fewer neuropsychological tests than specialists usually do, reducing the resources needed to perform neuropsychological examinations. The focus of these studies differs, but all of them indicate that useful information about cognition and dementia can be obtained by applying machine learning methods to datasets that contain relevant information.

Present study

The work presented in this paper modeled associations between risk factors and future dementia cases using machine learning methods. Predicting which individuals are more likely to develop dementia can provide valuable insight for healthcare professionals and the healthcare system. By creating more accurate models, costs can be reduced, and healthcare workers can provide care in a more targeted way. Individuals at risk could be identified sooner in the process, and dementia progression could possibly be delayed, improving the quality of life for those individuals. A previous analysis of the AGES-Reykjavik Study dataset using a traditional statistical approach has shown that a logistic regression model analyzed in SPSS can pinpoint risk factors associated with whether an individual will have dementia five years later [28]. This study aimed to take that assessment further and explore whether machine learning algorithms could better estimate future dementia cases based on known risk factors for dementia and cognitive decline.

Methods

Study Sample

The dataset was collected through a study called the Age Gene/Environment Susceptibility-Reykjavik Study (AGES-Reykjavik Study). The participants were recruited through previous participation in the Reykjavik Study, a study sponsored by the Icelandic Heart Association (IHA). A randomly selected subgroup of the participants in the Reykjavik Study, all of whom lived in the Reykjavik area of Iceland at the time of the study and were born between 1907 and 1935, were invited to continue and participate in the AGES-Reykjavik Study. Between 2002 and 2011, data were collected twice for each participant, with approximately five years between baseline and follow-up. The purpose of the AGES-Reykjavik Study was to evaluate risk factors for disability and disease within the older population. Data collected included results from cognitive tests and dementia assessments and information regarding risk factors that could have associations with cognitive performance. Overall, 5,764 participants took part in the baseline data collection of the study. Of those, 3,316 also participated in the follow-up data collection. A more detailed description of the study design can be found in previously published material [29, 30].

The dataset used for all analyses in this study is based on the dataset used in a previous paper where logistic regression was used to assess the relationship between risk factors for cognitive aging and dementia based on traditional statistical methods [28]. It should be noted that a substantial portion of the original AGES-Reykjavik dataset was excluded from the analysis. That is because many participants were excluded since they had missing data for any of the variables included in the logistic regression model analyzed using SPSS. Those who fulfilled the dementia criterion at the baseline measurement were not included in the final dataset since the purpose was to assess the likelihood of developing dementia later. Figure 1 shows a flowchart of participants included and excluded from the analysis. Of those participating in the follow-up data collection round, 16.2% (n = 536) fulfilled a dementia criterion when assessed at baseline. The proportion of participants included in the data analysis (n = 1,491) that fulfilled the criterion at follow-up was 8.2% (n = 123). Since the previous dataset analysis excluded participants with missing data for any of the variables in a model, a substantial number of participants were not included in the final dataset. For a more detailed description of the analytical sample, please refer to [28].

Fig. 1
figure 1

Flowchart showing the participant selection for the study (the figure is a reproduction of work from a previous publication [28])

The Institutional Review Board of the U.S. National Institute on Aging, the National Institutes of Health, and the Icelandic National Bioethics Committee (VSN 00–063) have approved the study. All participants provided written consent for participation in the study.

Exposure variables

The choice of variables for the analysis was based on previous work that analyzed the same data [28, 30], which was supported by the cognitive aging literature [3, 9, 15, 25, 31,32,33,34,35,36,37]. The dataset used for the analyses performed here included eighteen exposure variables known as predictors of cognitive aging. Information about each variable is displayed in Table 1. Age and sex were included as control variables since both are recognized as predictors of cognitive aging [38, 39].

Table 1 Description of exposure variables

Dementia criterion

Participants were evaluated for mild cognitive impairment and dementia at baseline and follow-up. The screening process was the same in both instances, and the same panel of professionals was involved [29]. The three-step process started with a cognitive screening using the Mini-Mental State Examination [42] and the Digit Symbol Substitution Test [43]. Next, a neuropsychological diagnostic battery was administered to participants with positive screening results. Based on two screening tests from the diagnostic neuropsychological test battery, participants were identified for further examination by a neurologist. In the third and final step, based on relevant cognitive, MRI, and health history data, a consensus diagnosis was made by a panel of relevant specialists (neuropsychologist, neurologist, geriatrician, and neuroradiologist) in line with international guidelines. For the purpose of this study, individuals considered by the panel to have either mild cognitive impairment or dementia were grouped together; therefore, all of them fulfilled the current study’s dementia criterion.

Machine learning

Method of analysis

Three classification methods were compared to assess which could provide the most accurate prediction of future dementia cases based on known risk factors for cognitive aging. The scikit-learn library within the Python programming language was used for data analysis in all three cases [44]. The methods compared were logistic regression, random forest, and neural network.

Logistic regression

Within traditional statistics, logistic regression is a method commonly applied to predict binary outcomes by analyzing the relationship between multiple independent variables [45]. The use of logistic regression has also been adopted by those who work on machine learning, and within that discipline, logistic regression is often used to classify incoming information based on existing data.Footnote 1

Random forest

Random forest is a machine-learning method that builds upon the idea of decision trees by combining in one model a forest of decision trees that work together [46]. Figure 2 shows an example of how a random forest algorithm operates. Each decision tree within the model gives a prediction, and the results of the majority of the trees become the result the random forest algorithm gives. The purpose of the algorithm is to create many uncorrelated decision trees that can give more accurate results than any one decision tree could. The number of trees in the random forest and the complexity of the trees can be changed to test which combination gives the best results. The application of decision trees was also explored. However, since the algorithm's performance was underwhelming compared to the other three algorithms, the decision tree analysis will not be discussed further.

Fig. 2
figure 2

An example that explains a random forest algorithm

Using scikit-learn, data on feature importance can be extracted when a random forest algorithm is used, indicating which variables in the model were most important to the model's predictive power. A higher value indicates a higher importance. 

Neural network

Neural network is a machine learning algorithm modeled after how neurons in the brain communicate, allowing the algorithm to recognize patterns and relationships within datasets [47]. Figure 3 shows an example of how a neural network algorithm functions. The algorithm is based on multiple layers, an input layer and an output layer, and in between, there are hidden layers. The number of hidden layers can vary, and the number of nodes within each layer can also be changed. A node is a unit that operates similarly to a neuron; while training the network, a node is either activated or remains inactive. Different combinations of the number of hidden layers and nodes can be tested to identify which provides the most accurate predictions.

Fig. 3
figure 3

An example that explains a neural network algorithm

Data pre-processing

Validation strategy

The dataset was split into training and test datasets to estimate the performance of the different machine learning algorithms. The training dataset's size was 67% of the whole dataset; 493 participants were allocated to the test dataset and 998 to the training dataset. A statistical comparison was performed on the two datasets, and only one variable, diabetes, was significantly different (p < 0.001). Within the training dataset, 7.3% of the participants had diabetes, while 12.8% of participants within the test dataset had diabetes. Despite this anomaly, the comparison suggests that the two datasets are very similar.

Data scaling

Some pre-processing was performed to prepare the data for the machine learning algorithms. Since the two groups within the outcome variable are unbalanced, features were standardized by removing the mean and then scaling to unit variance.

Performance criteria

Confusion matrix

A confusion matrix gives valuable information about the model's performance [48]. The matrix displays which data points are classified correctly and which are classified incorrectly by showing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Simple metrics calculated based on information from the confusion matrix can give a high-level idea of model performance on specific tasks. These include accuracy, precision and recall.

Accuracy

The accuracy of a model is generally calculated by dividing the number of individuals correctly classified by the number of participants included in the dataset.

Precision

Precision is calculated by dividing the number of true positives by the number of true positives plus the number of false positives. This value indicates the portion of correct positive predictions, showing the proportion of people classified into the dementia group that actually have dementia.

Recall

Recall is calculated by dividing the number of true positives by the number of true positives plus the number of false negatives. This value indicates the portion of correct positive predictions out of all positive cases, giving insight into the proportion of individuals with dementia that the models do not detect.

Balanced accuracy

When outcome variables are imbalanced, balanced accuracy is considered a better metric [49]. This metric considers that one of the classes has a much lower number of individuals that belong to it, giving that class more weight in the accuracy calculations.

F1 score

The F1 score focuses on incorrect classified results (false positives and false negatives), balancing precision and recall [50]. The higher the value, the better the performance of the model is.

ROC curve

A receiver operating characteristic (ROC) curve is a graph that plots the true positive rate of a model against the false positive rate at different classification thresholds [51]. The area under the ROC curve (AUC-ROC), which gives a number between zero and one, indicates how well the classifier performs across all possible thresholds. The higher the number, the better performance is.

Matthews correlation coefficient

The Matthews correlation coefficient (MCC) calculates the correlation between true and predicted values in a dataset [52]. The MCC takes a value between minus one and one, with one indicating perfect performance and minus one indicating that each item has been misclassified.

Overfitting

Each model's misclassification rate for the training and test datasets was compared to identify whether a model was overfitting, thereby creating a generalization error. If the misclassification rate for the test dataset was much higher than the training dataset, the model was overfitting, thereby not performing as desired. When tuning the models, the misclassification rate was continually assessed to ensure the model was not overfitting. Table 2 shows the misclassification rate of the final models for each method. Based on an analysis of trends in the misclassification rate for the training and test datasets, a difference of more than 2% was considered to suggest that a model was overfitting. If that was the case, a change was made to how the hyperparameters were tuned.

Table 2 Misclassification rate for both training and test datasets for all three machine learning methods

Hyperparameter tuning

All machine learning models have hyperparameters that must be specified when running the algorithm. For each machine learning algorithm, a randomized search was used to identify the combination of hyperparameters that performed the best on the dataset of interest (e.g., different combinations of hidden layers and numbers of nodes for neural networks, and the number of trees, maximum number of features, and complexity of each tree in random forest).

Results

Descriptive statistics

Table 3 shows the descriptive statistics for the different exposure variables in the model for the whole dataset and split into groups based on the dementia criterion. Of those with healthy cognition at baseline (n = 1,491), 123 (8.2%) fulfilled the dementia criterion five years later.

Table 3 Descriptive statistics for exposure variables (measured at baseline) showing means, standard deviations, counts and percentages

Model performance

Table 4 compares the performance of all three machine learning algorithms after hyperparameter tuning. According to the confusion matrix, the neural network model had the highest number of true negatives and the lowest number of false positives. The random forest model had the lowest number of false negatives and the highest number of true positives. The random forest algorithm performed best on balanced accuracy, precision, recall, F1 score, and MCC score metrics. The neural network algorithm performed best according to the AUC-ROC metric.

Table 4 Comparison of performance metrics assessing the three different machine learning models being studied

Random forest

Since the random forest algorithm performed well on most of the performance metrics, the results from the random forest model were analyzed further. Figure 4 shows the feature importance for each variable included in the model. According to the random forest algorithm, the five most important features are white matter brain volume, leisure activities, mobility, age, and grey matter brain volume, in that order.

Fig. 4
figure 4

Feature importance for random forest algorithm

Discussion

This study assessed the performance of three different machine-learning algorithms at estimating future dementia cases based on known risk factors for dementia and cognitive decline. Logistic regression, random forest, and neural network algorithms were compared. The results showed that random forest scored higher than logistic regression and neural networks on most performance metrics. The confusion matrix showed that random forest had the highest number of true positives and the lowest number of false negatives from the assessed algorithms. However, neural networks had the highest number of true negatives and the lowest number of false positives. Random forest scored higher than logistic regression and neural networks on balanced accuracy, precision, recall, F1 score, and Matthews correlation coefficient. However, logistic regression had more area under the ROC curve than random forest and neural networks.

The findings align with the results reported by Spooner et al. [24], which suggest that random forest algorithms are well-equipped to analyze data regarding predictors of dementia onset. In their findings, Casanova et al. [25] focused on genetic and modifiable risk factors for developing dementia, while Qiu et al. [26] emphasized the predictability of MRI data. Both studies showed promising results, indicating that information on modifiable risk factors and brain imaging data performed well at predicting the diagnosis of dementia. The results of this study agree with those findings.

Similar to this study, Seligman et al. [22] also compared the performance of different machine-learning algorithms in analyzing health-related data. Both studies showed that the machine learning models only performed moderately well, and Seligman et al. [22] even go so far as to state that some machine learning algorithms cannot produce better predictions than traditional simpler models. However, what differs between the two studies is that while the results of this study indicate that random forest performs better than logistic regression and neural networks, the results from Seligman et al. [22] showed that neural networks significantly outperformed linear regression methods and random forests.

Multiple performance metrics exist in the machine-learning literature, and several were calculated for this analysis. The random forest algorithm performed best according to balanced accuracy, the F1 score, and the Matthews correlation coefficient. All of them are widely used, but studies that have compared the quality of different performance metrics suggest that the MCC is the best classification performance metric for imbalanced datasets where both correct and incorrect classifications must be considered [53, 54]. The area under the ROC curve was the only metric where other performance metrics outperformed random forest. Both logistic regression and neural networks performed better, according to the AUC-ROC. However, the metric has been criticized for overestimating performance when the analyzed dataset is imbalanced, which is the case here [53, 55]. These findings emphasize that the random forest algorithm is the best choice for this dataset. What further supports that choice is the fact that out of the more popular machine learning methods, the random forest algorithm is one of the more easily understood and interpretable methods, in addition to providing information about the relative importance of the predictors included in a model [56].

Precision and recall do not perform well on all datasets but were included in the analysis to see how they compared to the other metrics used. Knowing the recall value is interesting since it indicates how well the model identifies the real positive cases, which is often the focus when analyzing medical data [57]. Recall is considered important when false negatives (classifying an individual to the healthy cognition group when he belongs to the dementia group) are undesirable. However, precision is considered important when false positives (classifying an individual to the dementia group when he belongs to the healthy cognition group) are undesirable. Stating which type of error would be more detrimental is complicated, and such a decision would depend on the models' eventual use. If the purpose were to identify which types of individuals would most likely benefit from lifestyle interventions to reduce the risk of developing dementia but not to inform an individual that he has a high risk of developing dementia in a few years, recall would be the metric of interest. Precision would be a more interesting metric if the purpose were to use the model as a risk calculator on an individual level. The random forest algorithm had higher precision and recall than neural networks and logistic regression. However, when neural networks and logistic regression are compared, the results show that the neural network algorithm performs better on the precision metric, and the logistic regression algorithm performs better on the recall metric.

According to Valsdóttir et al. [28], where logistic regression was performed on the same dataset using the statistical program SPSS, the results showed that leisure activities, self-reported health, education, age, being an ApoE \(\varepsilon 4\) carrier, and white and grey matter volume had a significant relationship with whether an individual would be assessed as having dementia five years later. When those results are compared to the feature importance analysis performed on the RF algorithm in this study, it is evident that the findings are not the same. For example, the number of foreign languages an individual speaks has much higher feature importance compared to education and self-reported health, even though the foreign language variable is not significant in the logistic regression model performed in SPSS. Additionally, these results do not align with the RF algorithm-based findings presented by Casanova et al. [25], which suggests that education is the top-ranked predictor for dementia progression. The reasons for this are unclear but might be explained by the characteristics of the cohort that participated in the AGES-Reykjavik Study [29]. The study’s participants did not have the same educational opportunities that young people have today and may not have had the same access to education as their peers in other countries. That might be why other variables representing potential educational attainment are ranked higher in importance in this dataset.

Strengths and limitations

Even though the study's results might, at first glance, seem very decisive, it is vital to consider them in conjunction with the study's limitations and the data included in it. In the context of machine learning, the dataset being analyzed here is not very large and possibly not well suited for some of the machine learning methods available. Therefore, based on these findings, it is impossible to conclude that random forest algorithms would always perform better on similar but larger datasets.

The study was also limited by the information available in the dataset. The balanced accuracy metric gives insight into how accurate the models' predictions are. The balanced accuracy values in this study suggest that the classification could be significantly improved. The best way would be to include more suitable predictors in the models. Limitations relating to the dataset involve the number of variables available for analysis and that a portion of the variables is based on self-reported information, which needs to be interpreted cautiously. Of course, it is also important to consider that not all available and relevant data was fed into the models (such as MRI data for individual brain regions) for this study. Further studies could include more pertinent data in the models and produce more accurate predictions.

Finally, it must be considered that the results from traditional statistical analysis are generally more accessible to interpret than the outcomes from machine learning algorithms. Although some machine learning algorithms could produce much more accurate predictions for a particular dataset, it could also be the case that it could prove challenging to interpret what the results mean, which could be why researchers might be hesitant to try this more novel method. Therefore, it is essential to contemplate the purpose of analysis before deciding which approach would be most relevant.

The study also had strengths that are worth mentioning. In this study, the same dataset was analyzed as had been used in a previous analysis with the same goal that relied on traditional statistical methods. Since the results suggest that more accuracy could be reached by applying machine learning methods, such as RF algorithms, this could inspire researchers that have until now exclusively relied on traditional statistical approaches to branch out and incorporate in their work other less conventional ways of analyzing data.

Finally, the quality of the AGES-Reykjavik dataset significantly strengthens the results. Experts from many fields contributed to the study design, resulting in a high-quality dataset based on individuals with heterogeneous capabilities [29].

Future research

Since the decision was made to use the same dataset as had been used in a previous publication [28], the dataset used for this analysis was much smaller than the original AGES-Reykjavik dataset. That is because data points with missing data were excluded from the analysis when logistic regression was performed in SPSS. Future research could entail performing a principal component analysis or using other feature selection techniques on the original dataset from the AGES-Reykjavik Study to reduce the dimensionality of the dataset. This could draw out summary variables that could increase interpretability or other important variables not identified in previous dataset analyses.

Another avenue that could be taken with further research would be to focus on using such models within a clinical setting. Some people think that getting dementia is an inevitable part of aging. In contrast, others are convinced they are not a part of the group that will receive a dementia diagnosis when they get older. However, all these individuals could benefit from the prevention strategies proven to decrease the risk of developing dementia. The purpose of applying these models in a clinical setting would be to identify individuals with an increased risk of developing dementia to emphasize the importance of preventative measures among those individuals. In the future, a model such as this one could be applied on an individual level to indicate whether they need to take action to reduce the risk of developing dementia later in life. For older individuals, information about performance on neuropsychological tests could also be included to get even more accurate predictions about the likelihood of progression to dementia.

Conclusion

These preliminary results suggest that a random forest algorithm might better identify individuals in the AGES-Reykjavik Study dataset likely to develop dementia within five years compared to logistic regression and neural networks. However, it must be taken into consideration that these results only indicate better performance since statistical comparisons were not performed. The next step would be to perform such analyses on a more expansive portion of the AGES-Reykjavik Study dataset. These findings emphasize the possibility of gaining further insight into large and complex health-related datasets by exploring further the application of machine learning methods to such datasets. Further studies could support the development of a personalized dementia risk assessment tool that could be used to help people reduce their risk of developing dementia.