Introduction

The global burden of cardiovascular disease remains substantial, and it is increasing as populations age [1]. According to the report from the American Heart Association (updated in 2022), the age-standardized prevalence rate of stroke increased by 0.77% from 2010 to 2020 [2]. Moreover, stroke is a leading cause of disability, with an increasing incidence trend in developing countries [3]. Previous studies indicate that the burden of stroke in China has increased over the past 30 years and remains particularly high in rural areas [4]. Thus, early intervention in cardiovascular diseases should be advocated to curb the epidemic in economically underdeveloped areas. However, existing stroke diagnostic tools are too expensive and inconvenient for disease screening in rural regions [5]. Fortunately, risk scoring systems have been proven excellent tools to estimate the risk of stroke and identify individuals with higher stroke risk at an early stage.

Several risk scores have been established to predict future stroke events [6], for example, the Framingham Stroke Risk Profile [7], the Revised Framingham Stroke Risk Profile [8], the ABC stroke risk score [9], CHA2DS2-VASc score [10], and SCAIL score [11]. However, there are several problems with existing risk scores: first, some scores are only applicable to specific groups, such as patients with arterial stiffness, which limits their abilities for stroke risk assessment in the general population. Second, some scores include medical imaging variables, which are highly inappropriate for on-site stroke screening, especially in underdeveloped areas. In addition, the application of machine learning, an artificial intelligence technique that can fully exploit non-linear relationships, is still limited in the construction of stroke risk scores. According to previous studies, machine learning in conjunction with deep phenotyping improves accuracy in stroke event prediction in an initially asymptomatic population [12]. What is more, ensemble machine learning algorithms such as gradient boosting machine (GBM) are shown to perform better than basic classifiers [13]. We cannot ignore the outstanding performance of machine learning in the field of disease risk assessment. Finally, relevant evidence [14, 15] suggests that there are very serious urban–rural differences in stroke care and in-hospital mortality [16], and for less developed regions such as rural areas, there is a need to develop easy-to-use scoring systems to support stroke risk assessment and prevention.

To address the problems mentioned above, the study aims to construct a stroke risk scoring system for rural residents with machine learning. It can be used to identify individuals at higher risk of stroke in the population and also facilitate risk assessment for cardiovascular diseases.

Methods

Study population

The Henan Rural Cohort (registration number: ChiCTR-OOC-15006699) was a prospective cohort study which was conducted from July 2015 to September 2017. Briefly, the Henan Rural Cohort recruited 39, 259 participants in five rural regions of China, including Suiping County, Yima City, Yuzhou City, Xinxiang County, and Tongxu County [17]. The study focused on permanent rural residents aged 18–79 [18]. A standardized questionnaire, physical examination, and biochemical detection were conducted to acquire individual information [19]. The demographic information, lifestyles (such as drinking status), and personal and parental history of diseases were obtained from this questionnaire. The resting blood pressure was measured thrice by an electronic sphygmomanometer (OMRON, Japan), and the fasting blood glucose and blood lipids were measured once by the Cobas c501 (Roche, Switzerland). Further details on this cohort have been published previously [20]. The data processing flow chart is displayed in the Supplemental Fig. 1. Ethics approval of the study was obtained from Zhengzhou University Life Science Ethics Committee. Written informed consent was obtained from all participants, and the study complied with the Declaration of Helsinki.

Stroke events

The investigated individuals are divided into non-stroke and stroke-affected residents according to the answers from standardized questionnaires. The definition of stroke is shown as follows: diagnosed with stroke previously by physicians or local health providers according to criteria recommended by the World Health Organization as a constellation of sudden or rapid onset of a neurologic deficit of vascular origin that persisted more than 24 h or until death [2]. Consequently, 2534 stroke patients in the Henan Rural Cohort were observed in the study.

Model development

Variable selection

Candidate variables were identified using the following criteria: (1) missing data < 30%; (2) not highly correlated (VIF < 10) using the collinearity diagnosis; and (3) easily accessible, excluding difficult-to-obtain variables such as genetics and imaging. To filter out the attributes that were most closely associated with the target issue in order to construct a subset of features and reduce the complexity of the model [21], a machine learning method (Gradient Boosting Machine) was used to establish the variable importance ranking. As an iterative algorithm, the kernel idea of the gradient boosting machine (GBM) is to train different classifiers (weak classifiers) for the same training set and group these weak classifiers to form a more powerful ultimate classifier (strong classifier). In this study, the procedure of variable selection is divided into two steps: first, the importance ranking of 17 candidate variables was determined by machine learning, and then, considering performance, ease of application, and expert recommendations, the top seven variables were selected as predictors. The model was constructed by the package sklearn (0.21.3) of Python 3.7 programing language.

Relationship modeling

The total participants in the derivation cohort (n = 38, 322) were randomly split into a train set (n = 26, 826) and a test set (n = 11, 496) in a ratio of 7:3. The model was built using data from the trainset, and it was validated in the test set. When attempting to relate the variables to scores, the Framingham methods were applied to build the relationship between them. A detailed description was published previously [22]. After separating the continuous variables into rational subgroups, coefficients of the logistic model were used to produce scores. For the categorical variables, zero point was assigned to the reference group. All variables except for age were divided into subgroups according to recommended criteria in China, including guidelines for the prevention and treatment of dyslipidemia in China (2016). After summing the score of all the predictors, each participant was assigned a specific stroke risk score. Individuals were ultimately divided into five risk groups by the quantities of the calculated risk score.

Model evaluation

The C-statistic was employed to evaluate the discriminatory performance of the scoring system: C-statistic scales from 0.5 to 1.0, which resembles the area under the receiver operating characteristic (ROC) curve [23]. The model calibration was evaluated by the Brier score (BS). Sensitivity, specificity, and 95% CI were also reported.

To compare the performance of this scoring system with similar risk assessment tools, two scores were chosen in this study. First, we chose the Framingham Stroke Risk Profile (FSRP), which is a predictive tool recommended by the American Heart Association specifically for stroke [7]; second, as self-reported stroke outcomes were used in this study, we used a self-reported stroke risk score (SRSRF) to explore the utility of the new scoring system [24].

Statistical analysis

Descriptive characteristics of candidate variables were shown as numbers (frequencies) for categorical variables and mean ± standard deviation (SD) for continuous variables. For comparisons between groups, the chi-square test (or Fisher’s exact test) was used for categorical variables, whereas the t-test was used for continuous variables. The collinearity analysis was performed using general linear regression. The model’s discriminatory performance was compared by calculating the difference between the C-statistics of the same data. For the comparison of C-statistics, the DeLong test was used. It was considered statistically significant when a double-tailed P value was less than 0.05. Statistical tests were performed using SPSS version 21 (IBM, Chicago, USA).

Results

Characteristics of participants

In total, 38, 322 participants (15, 057 men and 23, 265 women) in the Henan Rural Cohort were included in this study (Supplemental Fig. 1). The mean age of the study population is 55.62 years, and 39.29% of the individuals are men. Additionally, 3301 participants (8.61%) reported a family history of stroke. At baseline, there were 2534 stroke patients (6.61%) in the Henan Rural Cohort. The detailed characteristics of the participants are shown in Table 1.

Table 1 Characteristics of several candidate variables in the Henan Rural Cohort

Selected variables and the RSRS

After being selected by the GBM algorithm, the top seven variables ranked by the variable importance were determined as the factors in this analysis, including age, hypertension (HTN), drinking status, family history of stroke, triglyceride (TG), waist circumference (WC), and type 2 diabetes mellitus (T2DM). After several experiments, the best performance was obtained for the combination of seven variables. These variables were all relatively easy to obtain at the time of on-site screening through interviews, measurements, and blood collection in rural areas. The variable importance ranking gotten by the machine learning algorithm is shown in Fig. 1. Age ranked first with the variable importance of 0.4463, followed by HTN of 0.1585. Moreover, it could not be ignored that the variable importance of T2DM (0.0291) was nearly two times higher than the income (0.0118).

Fig. 1
figure 1

Importance ranking of candidate variables of the study. The variables were listed in decreasing order by their importance gotten by GBM

Based on the β value of the Logistic regression, different coefficients were assigned to different subgroups. The associations (OR, 95% CI) of scoring variables with stroke were observed in this study: age (1.076, 95% CI 1.070–1.082), HTN (2.208, 95% CI 1.989–2.452), family history of stroke (2.392, 95% CI 2.063–2.774), TG (1.150, 95% CI 1.035–1.278), WC (1.123, 95% CI 1.007–1.251), T2DM (1.547, 95% CI 1.351–1.771), and drinking status (1.249, 95% CI 1.111–1.404). For the TG and WC, they were divided by cutoff points (1.7 mmol/L, 85 cm for men, and 80 cm for women) according to the medical guidelines or criteria. For the age variable, it was divided at a group distance of 5. The age range in the sample was 18–87. To determine the reference values for the first and last categories, the 1st percentile and the 99th percentile were employed to minimize the influence of extreme values. The logistic regression β-coefficients, P-value, OR, and 95% CIs for each of the machine learning-selected variables are displayed in Supplemental Table 1. The model demonstrated acceptable discrimination with the C-statistic of 0.763 (95% CI 0.758–0.768) in the train set. The calibration performance was also good (BS 0.059). The sensitivity was 0.847 (95% CI 0.829–0.863), and the specificity was 0.551 (95% CI 0.545, 0.557) in the train set.

At this time, a new scoring system was constructed for assessing the risk of stroke in rural areas: the rural stroke risk score (range 0–25). The score table is shown in Fig. 2. We also calculated the estimated risk of stroke by the way of Framingham Methods [22]. The risk stratification of stroke in this scoring system is displayed in 5-level subgroups in Fig. 3. The results suggested that individuals with a score over 21 (including 21) had an estimated stroke risk of more than 40.46%, which we would classify as “very high.” The correlation between scores and estimated risk is shown in Supplemental Table 2.

Fig. 2
figure 2

The scoring table of RSRS for stroke risk assessment in rural residents. The RSRS can be calculated by adding the corresponding scores of 7 items in the figure

Fig. 3
figure 3

The five estimated risk groups were stratified by quintiles of calculated scores. Different risk groups are marked by boxes on the figure

Validation and comparison

The model was validated in the test set. After calculating all participant’s RSRS in the test set, we displayed the distribution of RSRS in Supplemental Fig. 2. The scores showed an approximately normal distribution in the validation set. The RSRS demonstrated good discrimination with an overall C-statistic of 0.757 (95% CI 0.749–0.765) in the test set. The ROC curve and the calibration plot of the validation are presented in Fig. 4. The calibration plots showed that the estimated risk of disease was in good agreement with the actual value in D1-D9 and the BS value (0.058) also confirmed the perfect calibration. The sensitivity was 0.772 (95% CI 0.740–0.801) and the specificity was 0.623 (95% CI 0.614, 0.632) in the test set. Other correlating metrics are presented in Table 2.

Fig. 4
figure 4

Receiver operating characteristic curves and calibration plots of the scoring system a the ROC curve for the RSRS in the train set. AUC, the area under the curve b calibration of observed and estimated stroke risks by deciles of estimated risk in the train set. c the ROC curve for RSRS in the test set d calibration of observed and estimated stroke risks by deciles of estimated risk in the test set

Table 2 Comparison of the performance metrics of 3 stroke risk scores in the test set

We also compared this model with other previously published risk scores. The results showed that the FSRP had a C-statistic of 0.714, a BS of 0.059, a sensitivity of 0.719, and a specificity of 0.588. The SRSRF, a stroke risk score based on self-reported variables, had a C-statistic of 0.705, a BS of 0.067, a sensitivity of 0.762, and a specificity of 0.565, respectively.

Mobile application

In this study, an RSRS-based mobile application was developed to help physicians and CDC personnel use the score more easily and quickly when conducting on-site disease screening. Screenshots of the mobile app for calculating the RSRS are shown in Supplemental Fig. 4. This app allows both health care workers and residents to quickly assess stroke risk.

Discussion

In this study, a new scoring system was developed for assessing the risk of stroke, which was based on machine learning algorithms to determine the key factors. It increased the accuracy as well as the degree of data utilization in model construction. After validation, the RSRS was shown to be applicable in the Chinese rural population.

Risk scoring systems are always significant for disease prevention, and the RSRS is produced to estimate stroke events in economically underdeveloped regions to prevent cardiovascular diseases at an early time and macroscopically reduce the prevalence of stroke in China. Additionally, the simple and visualized scale associated with RSRS is confirmed to eliminate the complexity of traditional predicting models, which demonstrates its practicality. With standardized questionnaires in this study, the age, parental history of stroke, and drinking status can be obtained conveniently, which also indicates the strength of RSRS in disease risk assessment in rural areas.

As is well known, early intervention in stroke plays a significant role in disease prevention [4], thus constructing a practical, convenient, and efficient scoring system for stroke risk assessment is vital. In the prevent study, the C-statistic of this model in the test set is 0.757 (95% CI 0.749–0.765), indicating a high predictive performance. Moreover, the BS values show that the model is well-calibrated. When comparing this model with the FSRP and the SRSRF, RSRS showed an improvement in discrimination (Δ = 0.043, 0.052, respectively). In Fig. 4, it cannot be ignored that the RSRS gave a higher estimated risk in the individuals who got higher scores; however, that was not evident in the calibration plot of the initial Logistic model (shown in Supplemental Fig. 3). Also, this phenomenon was observed in some studies such as the China-PAR [25] and the Taiwan AF score [26]. According to clinical suggestions [3, 27], stroke is a severe disease which needs to be emphasized at an early stage, so this phenomenon may raise the level of awareness of stroke prevention.

Previous studies indicated that machine learning demonstrated higher performance for risk assessment in patients with ASCVD [28, 29]. The machine learning algorithms provided the important ranking of candidate variables, which not only improved the science and accuracy of variable selection, but also optimized the performance of the scoring system [13]. Compared with the Stroke Riskometer™, the RSRS was constructed based on the GBM feature selection, while the predictors in Stroke Riskometer™ were determined by discussions and literatures. It is evident that the artificial intelligence methods increased the degree of data mining. Although there are a few risk scores that combined with machine learning, the RSRS fills the gap in the field of rural stroke risk assessment. What is more, compared with basic classifiers, GBM is an ensemble machine learning classifier, which enforces the efficiency of data utilization.

Several risk scores have been produced for stroke [6], such as the FSRP and so on. However, the predictors in a scoring system affect its applicability for a great deal. For instance, the SCAIL score showed its performance in the identification of early recurrent stroke [11], but its predictors included lumen stenosis, which required medical imaging findings for diagnosis. The Stroke Riskometer™ mentioned above was a mobile app based on the FSRP; however, after additionally adding some variables to the initial FSRP, there were nearly 17 predictors in this scoring system [30, 31]. It affected the convenience of a risk score to a great extent. Therefore, its application in rural regions was severely limited. The RSRS includes accessible predictors such as waist circumference, so it is easier to be applied. We summarized the predictors of different models in Supplemental Table 3.

The present study has the following strengths. First, to our best knowledge, this is the first study that develops a machine learning-combined risk score in the rural population. Second, the sample size (n = 38, 322) is relatively large in similar research. Finally, the seven-variable scoring system makes it simple for clinicians in on-site investigations, especially in rural areas.

Limitations also exist in this study. First, we conducted this analysis in a cross-sectional study with no follow-up data. Second, there are two invasive indices in the predictors; however, blood biochemistry tests and portable glucose meters are now accessible even in rural areas, so it is still convenient in total. Additionally, as the purpose of this study is to develop a score for stroke risk assessment, it needs to be further validated in other cohorts, and the performance of other races remains uncertain.

Conclusion

This study developed a convenient tool for stroke risk assessment among rural residents, which is valuable for identifying individuals with higher stroke risk. Additionally, the application of this risk score could facilitate the management of cardiovascular and neurological diseases and guide the prevention of stroke in underdeveloped areas.