Abstract
Background
Limited studies have focused on the risk assessment of stroke in rural regions. Moreover, the application of artificial intelligence in stroke risk scoring system is still insufficient. This study aims to develop a simplified and visualized risk score with good performance and convenience for rural stroke risk assessment, which is combined with a machine learning (ML) algorithm.
Methods
Participants of the Henan Rural Cohort were enrolled in this study. The total participants (n = 38,322) were randomly split into a train set and a test set in the ratio of 7:3. An ML algorithm was used to select variables and the logistic regression was then applied to construct the scoring system. The C-statistic and the Brier score (BS) were used to evaluate the discrimination and calibration. The Framingham stroke risk profile (FSRP) and the self-reported stroke risk function (SRSRF) were chosen to be compared.
Results
The Rural Stroke Risk Score (RSRS) was produced in this study, including age, drinking status, triglyceride, type 2 diabetes mellitus, hypertension, waist circumference, and family history of stroke. On validation, the C-statistic was 0.757 (95% CI 0.749–0.765) and the BS was 0.058 in the test set. In addition, the discrimination of RSRS was 6.02% and 7.34% higher than that of the FSRP and SRSRF, respectively.
Conclusions
A well-performed scoring system for assessing stroke risk in rural residents was developed in this study. This risk score would facilitate stroke screening and the prevention of cardiovascular disease in economically underdeveloped areas.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
The global burden of cardiovascular disease remains substantial, and it is increasing as populations age [1]. According to the report from the American Heart Association (updated in 2022), the age-standardized prevalence rate of stroke increased by 0.77% from 2010 to 2020 [2]. Moreover, stroke is a leading cause of disability, with an increasing incidence trend in developing countries [3]. Previous studies indicate that the burden of stroke in China has increased over the past 30 years and remains particularly high in rural areas [4]. Thus, early intervention in cardiovascular diseases should be advocated to curb the epidemic in economically underdeveloped areas. However, existing stroke diagnostic tools are too expensive and inconvenient for disease screening in rural regions [5]. Fortunately, risk scoring systems have been proven excellent tools to estimate the risk of stroke and identify individuals with higher stroke risk at an early stage.
Several risk scores have been established to predict future stroke events [6], for example, the Framingham Stroke Risk Profile [7], the Revised Framingham Stroke Risk Profile [8], the ABC stroke risk score [9], CHA2DS2-VASc score [10], and SCAIL score [11]. However, there are several problems with existing risk scores: first, some scores are only applicable to specific groups, such as patients with arterial stiffness, which limits their abilities for stroke risk assessment in the general population. Second, some scores include medical imaging variables, which are highly inappropriate for on-site stroke screening, especially in underdeveloped areas. In addition, the application of machine learning, an artificial intelligence technique that can fully exploit non-linear relationships, is still limited in the construction of stroke risk scores. According to previous studies, machine learning in conjunction with deep phenotyping improves accuracy in stroke event prediction in an initially asymptomatic population [12]. What is more, ensemble machine learning algorithms such as gradient boosting machine (GBM) are shown to perform better than basic classifiers [13]. We cannot ignore the outstanding performance of machine learning in the field of disease risk assessment. Finally, relevant evidence [14, 15] suggests that there are very serious urban–rural differences in stroke care and in-hospital mortality [16], and for less developed regions such as rural areas, there is a need to develop easy-to-use scoring systems to support stroke risk assessment and prevention.
To address the problems mentioned above, the study aims to construct a stroke risk scoring system for rural residents with machine learning. It can be used to identify individuals at higher risk of stroke in the population and also facilitate risk assessment for cardiovascular diseases.
Methods
Study population
The Henan Rural Cohort (registration number: ChiCTR-OOC-15006699) was a prospective cohort study which was conducted from July 2015 to September 2017. Briefly, the Henan Rural Cohort recruited 39, 259 participants in five rural regions of China, including Suiping County, Yima City, Yuzhou City, Xinxiang County, and Tongxu County [17]. The study focused on permanent rural residents aged 18–79 [18]. A standardized questionnaire, physical examination, and biochemical detection were conducted to acquire individual information [19]. The demographic information, lifestyles (such as drinking status), and personal and parental history of diseases were obtained from this questionnaire. The resting blood pressure was measured thrice by an electronic sphygmomanometer (OMRON, Japan), and the fasting blood glucose and blood lipids were measured once by the Cobas c501 (Roche, Switzerland). Further details on this cohort have been published previously [20]. The data processing flow chart is displayed in the Supplemental Fig. 1. Ethics approval of the study was obtained from Zhengzhou University Life Science Ethics Committee. Written informed consent was obtained from all participants, and the study complied with the Declaration of Helsinki.
Stroke events
The investigated individuals are divided into non-stroke and stroke-affected residents according to the answers from standardized questionnaires. The definition of stroke is shown as follows: diagnosed with stroke previously by physicians or local health providers according to criteria recommended by the World Health Organization as a constellation of sudden or rapid onset of a neurologic deficit of vascular origin that persisted more than 24 h or until death [2]. Consequently, 2534 stroke patients in the Henan Rural Cohort were observed in the study.
Model development
Variable selection
Candidate variables were identified using the following criteria: (1) missing data < 30%; (2) not highly correlated (VIF < 10) using the collinearity diagnosis; and (3) easily accessible, excluding difficult-to-obtain variables such as genetics and imaging. To filter out the attributes that were most closely associated with the target issue in order to construct a subset of features and reduce the complexity of the model [21], a machine learning method (Gradient Boosting Machine) was used to establish the variable importance ranking. As an iterative algorithm, the kernel idea of the gradient boosting machine (GBM) is to train different classifiers (weak classifiers) for the same training set and group these weak classifiers to form a more powerful ultimate classifier (strong classifier). In this study, the procedure of variable selection is divided into two steps: first, the importance ranking of 17 candidate variables was determined by machine learning, and then, considering performance, ease of application, and expert recommendations, the top seven variables were selected as predictors. The model was constructed by the package sklearn (0.21.3) of Python 3.7 programing language.
Relationship modeling
The total participants in the derivation cohort (n = 38, 322) were randomly split into a train set (n = 26, 826) and a test set (n = 11, 496) in a ratio of 7:3. The model was built using data from the trainset, and it was validated in the test set. When attempting to relate the variables to scores, the Framingham methods were applied to build the relationship between them. A detailed description was published previously [22]. After separating the continuous variables into rational subgroups, coefficients of the logistic model were used to produce scores. For the categorical variables, zero point was assigned to the reference group. All variables except for age were divided into subgroups according to recommended criteria in China, including guidelines for the prevention and treatment of dyslipidemia in China (2016). After summing the score of all the predictors, each participant was assigned a specific stroke risk score. Individuals were ultimately divided into five risk groups by the quantities of the calculated risk score.
Model evaluation
The C-statistic was employed to evaluate the discriminatory performance of the scoring system: C-statistic scales from 0.5 to 1.0, which resembles the area under the receiver operating characteristic (ROC) curve [23]. The model calibration was evaluated by the Brier score (BS). Sensitivity, specificity, and 95% CI were also reported.
To compare the performance of this scoring system with similar risk assessment tools, two scores were chosen in this study. First, we chose the Framingham Stroke Risk Profile (FSRP), which is a predictive tool recommended by the American Heart Association specifically for stroke [7]; second, as self-reported stroke outcomes were used in this study, we used a self-reported stroke risk score (SRSRF) to explore the utility of the new scoring system [24].
Statistical analysis
Descriptive characteristics of candidate variables were shown as numbers (frequencies) for categorical variables and mean ± standard deviation (SD) for continuous variables. For comparisons between groups, the chi-square test (or Fisher’s exact test) was used for categorical variables, whereas the t-test was used for continuous variables. The collinearity analysis was performed using general linear regression. The model’s discriminatory performance was compared by calculating the difference between the C-statistics of the same data. For the comparison of C-statistics, the DeLong test was used. It was considered statistically significant when a double-tailed P value was less than 0.05. Statistical tests were performed using SPSS version 21 (IBM, Chicago, USA).
Results
Characteristics of participants
In total, 38, 322 participants (15, 057 men and 23, 265 women) in the Henan Rural Cohort were included in this study (Supplemental Fig. 1). The mean age of the study population is 55.62 years, and 39.29% of the individuals are men. Additionally, 3301 participants (8.61%) reported a family history of stroke. At baseline, there were 2534 stroke patients (6.61%) in the Henan Rural Cohort. The detailed characteristics of the participants are shown in Table 1.
Selected variables and the RSRS
After being selected by the GBM algorithm, the top seven variables ranked by the variable importance were determined as the factors in this analysis, including age, hypertension (HTN), drinking status, family history of stroke, triglyceride (TG), waist circumference (WC), and type 2 diabetes mellitus (T2DM). After several experiments, the best performance was obtained for the combination of seven variables. These variables were all relatively easy to obtain at the time of on-site screening through interviews, measurements, and blood collection in rural areas. The variable importance ranking gotten by the machine learning algorithm is shown in Fig. 1. Age ranked first with the variable importance of 0.4463, followed by HTN of 0.1585. Moreover, it could not be ignored that the variable importance of T2DM (0.0291) was nearly two times higher than the income (0.0118).
Based on the β value of the Logistic regression, different coefficients were assigned to different subgroups. The associations (OR, 95% CI) of scoring variables with stroke were observed in this study: age (1.076, 95% CI 1.070–1.082), HTN (2.208, 95% CI 1.989–2.452), family history of stroke (2.392, 95% CI 2.063–2.774), TG (1.150, 95% CI 1.035–1.278), WC (1.123, 95% CI 1.007–1.251), T2DM (1.547, 95% CI 1.351–1.771), and drinking status (1.249, 95% CI 1.111–1.404). For the TG and WC, they were divided by cutoff points (1.7 mmol/L, 85 cm for men, and 80 cm for women) according to the medical guidelines or criteria. For the age variable, it was divided at a group distance of 5. The age range in the sample was 18–87. To determine the reference values for the first and last categories, the 1st percentile and the 99th percentile were employed to minimize the influence of extreme values. The logistic regression β-coefficients, P-value, OR, and 95% CIs for each of the machine learning-selected variables are displayed in Supplemental Table 1. The model demonstrated acceptable discrimination with the C-statistic of 0.763 (95% CI 0.758–0.768) in the train set. The calibration performance was also good (BS 0.059). The sensitivity was 0.847 (95% CI 0.829–0.863), and the specificity was 0.551 (95% CI 0.545, 0.557) in the train set.
At this time, a new scoring system was constructed for assessing the risk of stroke in rural areas: the rural stroke risk score (range 0–25). The score table is shown in Fig. 2. We also calculated the estimated risk of stroke by the way of Framingham Methods [22]. The risk stratification of stroke in this scoring system is displayed in 5-level subgroups in Fig. 3. The results suggested that individuals with a score over 21 (including 21) had an estimated stroke risk of more than 40.46%, which we would classify as “very high.” The correlation between scores and estimated risk is shown in Supplemental Table 2.
Validation and comparison
The model was validated in the test set. After calculating all participant’s RSRS in the test set, we displayed the distribution of RSRS in Supplemental Fig. 2. The scores showed an approximately normal distribution in the validation set. The RSRS demonstrated good discrimination with an overall C-statistic of 0.757 (95% CI 0.749–0.765) in the test set. The ROC curve and the calibration plot of the validation are presented in Fig. 4. The calibration plots showed that the estimated risk of disease was in good agreement with the actual value in D1-D9 and the BS value (0.058) also confirmed the perfect calibration. The sensitivity was 0.772 (95% CI 0.740–0.801) and the specificity was 0.623 (95% CI 0.614, 0.632) in the test set. Other correlating metrics are presented in Table 2.
We also compared this model with other previously published risk scores. The results showed that the FSRP had a C-statistic of 0.714, a BS of 0.059, a sensitivity of 0.719, and a specificity of 0.588. The SRSRF, a stroke risk score based on self-reported variables, had a C-statistic of 0.705, a BS of 0.067, a sensitivity of 0.762, and a specificity of 0.565, respectively.
Mobile application
In this study, an RSRS-based mobile application was developed to help physicians and CDC personnel use the score more easily and quickly when conducting on-site disease screening. Screenshots of the mobile app for calculating the RSRS are shown in Supplemental Fig. 4. This app allows both health care workers and residents to quickly assess stroke risk.
Discussion
In this study, a new scoring system was developed for assessing the risk of stroke, which was based on machine learning algorithms to determine the key factors. It increased the accuracy as well as the degree of data utilization in model construction. After validation, the RSRS was shown to be applicable in the Chinese rural population.
Risk scoring systems are always significant for disease prevention, and the RSRS is produced to estimate stroke events in economically underdeveloped regions to prevent cardiovascular diseases at an early time and macroscopically reduce the prevalence of stroke in China. Additionally, the simple and visualized scale associated with RSRS is confirmed to eliminate the complexity of traditional predicting models, which demonstrates its practicality. With standardized questionnaires in this study, the age, parental history of stroke, and drinking status can be obtained conveniently, which also indicates the strength of RSRS in disease risk assessment in rural areas.
As is well known, early intervention in stroke plays a significant role in disease prevention [4], thus constructing a practical, convenient, and efficient scoring system for stroke risk assessment is vital. In the prevent study, the C-statistic of this model in the test set is 0.757 (95% CI 0.749–0.765), indicating a high predictive performance. Moreover, the BS values show that the model is well-calibrated. When comparing this model with the FSRP and the SRSRF, RSRS showed an improvement in discrimination (Δ = 0.043, 0.052, respectively). In Fig. 4, it cannot be ignored that the RSRS gave a higher estimated risk in the individuals who got higher scores; however, that was not evident in the calibration plot of the initial Logistic model (shown in Supplemental Fig. 3). Also, this phenomenon was observed in some studies such as the China-PAR [25] and the Taiwan AF score [26]. According to clinical suggestions [3, 27], stroke is a severe disease which needs to be emphasized at an early stage, so this phenomenon may raise the level of awareness of stroke prevention.
Previous studies indicated that machine learning demonstrated higher performance for risk assessment in patients with ASCVD [28, 29]. The machine learning algorithms provided the important ranking of candidate variables, which not only improved the science and accuracy of variable selection, but also optimized the performance of the scoring system [13]. Compared with the Stroke Riskometer™, the RSRS was constructed based on the GBM feature selection, while the predictors in Stroke Riskometer™ were determined by discussions and literatures. It is evident that the artificial intelligence methods increased the degree of data mining. Although there are a few risk scores that combined with machine learning, the RSRS fills the gap in the field of rural stroke risk assessment. What is more, compared with basic classifiers, GBM is an ensemble machine learning classifier, which enforces the efficiency of data utilization.
Several risk scores have been produced for stroke [6], such as the FSRP and so on. However, the predictors in a scoring system affect its applicability for a great deal. For instance, the SCAIL score showed its performance in the identification of early recurrent stroke [11], but its predictors included lumen stenosis, which required medical imaging findings for diagnosis. The Stroke Riskometer™ mentioned above was a mobile app based on the FSRP; however, after additionally adding some variables to the initial FSRP, there were nearly 17 predictors in this scoring system [30, 31]. It affected the convenience of a risk score to a great extent. Therefore, its application in rural regions was severely limited. The RSRS includes accessible predictors such as waist circumference, so it is easier to be applied. We summarized the predictors of different models in Supplemental Table 3.
The present study has the following strengths. First, to our best knowledge, this is the first study that develops a machine learning-combined risk score in the rural population. Second, the sample size (n = 38, 322) is relatively large in similar research. Finally, the seven-variable scoring system makes it simple for clinicians in on-site investigations, especially in rural areas.
Limitations also exist in this study. First, we conducted this analysis in a cross-sectional study with no follow-up data. Second, there are two invasive indices in the predictors; however, blood biochemistry tests and portable glucose meters are now accessible even in rural areas, so it is still convenient in total. Additionally, as the purpose of this study is to develop a score for stroke risk assessment, it needs to be further validated in other cohorts, and the performance of other races remains uncertain.
Conclusion
This study developed a convenient tool for stroke risk assessment among rural residents, which is valuable for identifying individuals with higher stroke risk. Additionally, the application of this risk score could facilitate the management of cardiovascular and neurological diseases and guide the prevention of stroke in underdeveloped areas.
Data availability
The data analyzed during the current study are available from the corresponding author upon reasonable request.
References
Hankey GJ (2017) Stroke. Lancet 389(10069):641–654
Tsao CW, Aday AW, Almarzooq ZI et al (2022) Heart disease and stroke statistics-2022 update: a report from the American Heart Association. Circulation 145(8):e153-e639
Campbell BCV, De Silva DA, Macleod MR et al (2019) Ischaemic stroke. Nat Rev Dis Primers 5(1):70
Wang W, Jiang B, Sun H et al (2017) Prevalence, incidence, and mortality of stroke in China: results from a nationwide population-based survey of 480 687 adults. Circulation 135(8):759–771
Schulz UG, Fischer U (2017) Posterior circulation cerebrovascular syndromes: diagnosis and management. J Neurol Neurosurg Psychiatry 88(1):45–53
Flueckiger P, Longstreth W, Herrington D et al (2018) Revised Framingham stroke risk score, nontraditional risk markers, and incident stroke in a multiethnic cohort. Stroke 49(2):363–369
Wolf PA, D’Agostino RB, Belanger AJ et al (1991) Probability of stroke: a risk profile from the Framingham study. Stroke 22(3):312–318
Dufouil C, Beiser A, McLure LA et al (2017) Revised Framingham stroke risk profile to reflect temporal trends. Circulation 135(12):1145–1159
Hijazi Z, Lindbäck J, Alexander JH et al (2016) The ABC (age, biomarkers, clinical history) stroke risk score: a biomarker-based risk score for predicting stroke in atrial fibrillation. Eur Heart J 37(20):1582–1590
Lip GYH, Nieuwlaat R, Pisters R et al (2010) Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest 137(2):263–272
Kelly PJ, Camps-Renom P, Giannotti N et al (2020) A risk score including carotid plaque inflammation and stenosis severity improves identification of recurrent stroke. Stroke 51(3):838–845
Ambale-Venkatesh B, Yang X, Wu CO et al (2017) Cardiovascular event prediction by machine learning: the multi-ethnic study of atherosclerosis. Circ Res 121(9):1092–1101
Li D, Xiong G, Zeng H et al (2021) Machine learning-aided risk stratification system for the prediction of coronary artery disease. Int J Cardiol 326:30–4
Howard G (2021) Rural-urban differences in stroke risk. Prev Med 152(Pt 2):106661
Wu S, Wu B, Liu M et al (2019) Stroke in China: advances and challenges in epidemiology, prevention, and management. Lancet Neurol 18(4):394–405
Hammond G, Luke AA, Elson L et al (2020) Urban-rural inequities in acute stroke care and in-hospital mortality. Stroke 51(7):2131–2138
Wang Y, Li Y, Liu X et al (2019) Prevalence and influencing factors of coronary heart disease and stroke in Chinese rural adults: the henan rural cohort study. Front Public Health 7:411
Zhang L, Wang Y, Niu M et al (2021) Nonlaboratory-based risk assessment model for type 2 diabetes mellitus screening in Chinese rural population: a joint bagging-boosting model. IEEE J Biomed Health Inform 25(10):4005–4016
Kang N, Chen G, Tu R et al (2022) Adverse associations of different obesity measures and the interactions with long-term exposure to air pollutants with prevalent type 2 diabetes mellitus: the Henan Rural Cohort study. Environ Res 207:112640
Liu X, Mao Z, Li Y et al (2019) Cohort profile: the Henan Rural Cohort: a prospective study of chronic non-communicable diseases. Int J Epidemiol 48(6):1756-1756j
Liao S, Jin L, Dai W-Q et al (2021) A machine learning-based risk scoring system for infertility considering different age groups. Int J Intell Syst 36(3):1331–1344
Sullivan LM, Massaro JM, D’Agostino RB (2004) Presentation of multivariate data for clinical use: the Framingham study risk score functions. Stat Med 23(10):1631–1660
Segar MW, Vaduganathan M, Patel KV et al (2019) Machine learning to predict the risk of incident heart failure hospitalization among patients with diabetes: the WATCH-DM risk score. Diabetes Care 42(12):2298–2306
Howard G, McClure LA, Moy CS et al (2017) Self-reported stroke risk stratification: reasons for geographic and racial differences in stroke study. Stroke 48(7):1737–1743
Xing X, Yang X, Liu F et al (2019) Predicting 10-year and lifetime stroke risk in Chinese population. Stroke 50(9):2371–2378
Chao TF, Chiang CE, Chen TJ et al (2021) Clinical risk score for the prediction of incident atrial fibrillation: derivation in 7 220 654 Taiwan patients with 438 930 incident atrial fibrillations during a 16-year follow-up. J Am Heart Assoc 10(17):e020194
Saini V, Guada L, Yavagal DR (2021) Global epidemiology of stroke and access to acute ischemic stroke interventions. Neurology 97(20 Suppl 2):S6-s16
Adler ED, Voors AA, Klein L et al (2020) Improving risk prediction in heart failure using machine learning. Eur J Heart Fail 22(1):139–147
Kim W, Park JJ, Lee H-Y et al (2021) Predicting survival in heart failure: a risk score based on machine-learning and change point algorithm. Clin Res Cardiol 110(8):1321–1333
Parmar P, Krishnamurthi R, Ikram MA et al (2015) The stroke riskometer(TM) App: validation of a data collection tool and stroke risk predictor. Int J Stroke 10(2):231–244
Feigin VL, Norrving B (2014) A new paradigm for primary prevention strategy in people with elevated risk of stroke. Int J Stroke 9(5):624–626
Acknowledgements
The authors thank all of the participants, coordinators, and administrators for their support and help during the research.
Funding
This research was supported by the National Natural Science Foundation of China (Grant NO: 81930092, 81973128), Foundation of National Key Program of Research and Development of China (Grant NO: 2016YFC0900803), Science and Technology Innovation Team Support Plan of Colleges and Universities in Henan Province (Grant NO:21IRTSTHN029), Key Research Program of Colleges and Universities in Henan Province (Grant NO: 21A330007), and Discipline Key Research and Development Program of Zhengzhou University (Grant NO: XKZDQY202008, XKZDQY202002). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
Zhongao Ding: conceptualization, investigation, data curation, methodology, formal analysis, visualization, writing—original draft. Liying Zhang: investigation, data curation, formal analysis, code, writing—review and editing. Miaomiao Niu: visualization, writing—review and editing. Bo Zhao: writing—review and editing. Xiaotian Liu: writing—review and editing. Wenqian Huo: investigation, writing—review and editing. Zhenxing Mao: investigation, writing—review and editing. Jian Hou: writing—review and editing. Zhenfei Wang: writing—review and editing. Chongjian Wang: conceptualization, methodology, investigation, validation, supervision, funding acquisition, project administration, writing—review and editing.
Corresponding author
Ethics declarations
Ethical approval
Ethics approval of the study was obtained from Zhengzhou University Life Science Ethics Committee, and has, therefore, been performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendments. The Henan Rural Cohort has been registered at the Chinese Clinical Trial Register. (Trial registration: ChiCTR-OOC-15006699. Registered 6 July 2015 -Retrospectively registered).
Informal consent
All study subjects gave their informed consent prior to their inclusion in the study.
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ding, Z., Zhang, L., Niu, M. et al. Stroke prevention in rural residents: development of a simplified risk assessment tool with artificial intelligence. Neurol Sci 44, 1687–1694 (2023). https://doi.org/10.1007/s10072-023-06610-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10072-023-06610-5