1 Introduction

According to World Health Organization (WHO), obesity is defined as abnormal or extreme body fat mass which has many negative consequences on health [1]. Due to the increased sedentary lifestyle especially in childhood, it has become an epidemic disease [2]. Its prevalence in Iran is 21.7% for people over 18 years old [3]. In 2016, 39% and 13% of adults were overweight and obese, respectively [1]. Another research estimated that these will reach up to 57.8% of adults in the world by 2030 [4]. Cardiovascular diseases (CVDs) are the leading cause of death globally [5]. It is estimated that more than 23.6 million will die from CVDs by 2030. Obesity is related to hypertension, diabetes mellitus, metabolic syndrome, and dyslipidemia which are all major risk factors for CVDs. So obesity can lead to heart failure, coronary artery disease, and atrial fibrillation [4, 6]. Pathology of these risk factors derives from extra and disproportionate fat mass distribution especially abdominal and central obesity [7]. Recent evidence suggests that anthropometric measurements including Body Adiposity Index (BAI), Waist-to-Hip Ratio (WHR), Body Round Index (BRI), Body Mass Index (BMI), A Body Shape Index (ABSI), Waist Circumference (WC), and Waist-to-Height Ratio (WHtR) can evaluate body fat mass properly and they have a positive linear relation with CVDs mortality [8,9,10].

BMI is widely used for measuring obesity. However, it is unable to distinguish between weight increase from lean muscle, fat, or bone density [11]. Also, some studies confirm that age, gender, and ethnicity can influence the relation between BMI and CVDs [12, 13]. WHO recommended cut-offs for BMI and WC may not be able to identify Asian subjects with CVDs [14]. Previous research has shown that WHtR is a better predictor for CVDs than BMI and WC [15, 16]. For example, WHtR has been recommended as the best predictor of hypertension in elderly males in Korea and Taiwan [17, 18]. On the other hand, WHR has been indicated to predict hypertension in Argentinian males and females [19]. Also, surveys such as that conducted by Esmaillzadeh et al. [20] have stated that WHR is a better predictor for CVDs rather than WHtR, BMI, and WC in adult males of Tehran of Iran [20]. ABSI is a novel anthropometric index combining BMI and WC that shows a better predictive value for CVDs than WC or BMI [21,22,23]. In another study in Singapore, it was found that BAI may be useful for measuring overall adiposity but it is not likely to be stronger than BMI [24]. A systematic review proved that BRI is a possible predictor and is better than ABSI for predicting hypertension in adults [25].

Machine learning is how computers learn from data. In medicine, there are many fields in which machine learning approaches has been beneficial. "Supervised" and "Unsupervised" are two main types of tasks in machine learning [26]. A decision tree (DT) is a supervised machine learning algorithm. DT is a non-parametric regression which is a class of simple regression models for explanation and prediction [27]. So DT models might be helpful to health policymakers to prioritize interventions that focus on high-risk groups. Classical statistical methods cannot select predictors easily so we used data mining methods such as DT to predict the most associated anthropometric measurements related to CVDs.

Obesity and CVDs are growing public health concerns worldwide. Although many studies have been published to assess the best anthropometric measurements associated with CVDs, controversies still exist. Also, few studies have examined the association between anthropometric indices with CVDs using machine learning in Iran. The Mashhad stroke and heart atherosclerotic disorder (MASHAD) study was started in 2010 to investigate the risk factors of CVDs in a north-eastern population of Iran. Therefore, we aimed to evaluate the most associated anthropometric indicators with CVDs using data mining in this population after six years of follow-up.

2 Methods

2.1 Study Population

The Mashhad stroke and heart atherosclerotic disorder (MASHAD) study was a large longitudinal prospective cohort study that was started in 2010 and continued until 2020. The purpose of this study was to evaluate various risk factors for CVDs in Mashhad, north-eastern Iran. Accordingly, 9704 healthy individuals without CVDs or other chronic conditions aged 35 to 65 were enrolled. Written informed consent was obtained from all participants. Subjects were followed up for CVDs incidence every three years. All patients who reported suspected CVDs during the follow-up years were contacted and consulted with a cardiologist. Finally, 235 people were confirmed with a diagnosis of CVDs (29 patients with myocardial infarction, 118 patients with unstable angina, 63 patients with stable angina, and 25 patients with cardiovascular death). The rest of the participants were considered healthy subjects (9469). The study was approved by the Ethics Committee of Mashhad University of Medical Sciences.

2.2 Diagnosis of CVDs

During the 6-year follow-up, an expert cardiologist conducted a physical examination to determine whether participants had CVDs based on their medical history. At the beginning of the study, a 12-lead electrocardiogram was performed in all subjects and subsequently reviewed by a cardiologist for signs of changes in P, QRS, T, and especially Q waves using the Minnesota code. A history of myocardial infarction, angina pectoris or a definite Q wave on an electrocardiogram confirmed the presence of CVDs [28, 29]. Other investigations were performed when the cardiologist had doubts about the diagnosis of CVDs, including computed tomography angiography, exercise tolerance testing (ETT), echocardiography, stress echocardiography, radioisotope and angiography.

2.3 Data Collection and Anthropometric Measurements

The variables documented in this database included demographic characteristics (such as gender and age), anthropometric information (such as weight, height, BMI, etc.). There was also a history of diabetes, hypertension, obesity, dyslipidemia, depression, anemia, anxiety, and metabolic syndrome in each individual. All these variables were measured at the beginning of the research as described before [30].

A experienced clinician measured the anthropometric measurements including weight, height, ABSI (A Body Shape Index), BAI (Body Adiposity Index), BMI (Body Mass Index), BRI (Body Roundness Index), demispan, HC (Hip Circumference), WC (Waist Circumference), WHR (Waist-to-hip Ratio), and WHtR (waist-to-height Ratio) (Table 1). In order to measure weight and height, participants were asked to wear light clothes and no shoes. A fixed stadiometer calibrated in centimeters to the closest 0.1 cm and electronic scales to the nearest 0.1 kg was used to measure height and weight, respectively. Then, BMI was defined as the weight (kg) divided by the square of height (m). According to World Health Organization (WHO) cut-offs, 25 < BMI ≤ 29.9 kg/m2 was considered overweight and BMI ≥ 30 kg/m2 was defined as obese [31]. ABSI, BAI, and BRI were calculated by using the formulas below:

Table 1 Baseline characteristics of the study population
$$ABSI=\frac{WC}{{BMI}^\frac{2}{3}{Height}^\frac{1}{2}}$$
$$BAI=\frac{Hip(cm)}{Height\sqrt{Height}}-18$$
$$BRI=364.2-365.5 \times \sqrt{1-\left(\frac{{\left(WC/2\pi \right)}^{2}}{{\left(0.5Height\right)}^{2}}\right)}$$

The distance between the mid-sternal notch and the space between the middle and ring fingers in a stretched arm was considered demispan. During measurement of HC and WC, participants were asked to stand and exhale. HC was calculated at the biggest circumference between the crotch and the iliac crest. Afterward, WC was measured at the middle point between the iliac crest and the last rib. A retractable tape meter was used for all the measurements. WHR was calculated as WC divided by HC. WHtR was measured by dividing WC (m) based on height (m). WHO suggested that truncal obesity in men is considered as WHR ≥ 0.95 and in women as WHR ≥ 0.8 According to International Diabetes Federation (IDF) protocols, WC ≥ 94 cm and ≥ 80 cm was defined as high in men and women, respectively.

2.4 Statistical Analysis

The data were analyzed using SPSS version 22 (Armonk, NY: IBM Corp.) and SAS JMP Pro version 13 (SAS Institute Inc., Cary, NC). To describe the quantitative and qualitative variables, mean ± SD and frequency (%) were reported, respectively. Chi-square and Fisher’s exact tests were applied to measure the association between categorical variables. Also, the mean of quantitative variables between the two groups were compared by independent T test. Data mining techniques such as logistic regression (LR) and decision tree (DT) algorithms have been used to analyze data.

The number of individuals who developed CVDs during the follow-up period was much smaller than the number of people who did not. In other words, only 2.4% of people have the disease during the follow-up period. In this case, we are dealing with a data set called “Unbalanced Dataset”, which is very common in such studies and occurs when the number of observations in one category is much less or more than in other categories. Because the results of most learning-based models are sensitive to unbalanced datasets; In order to transform the unbalanced data set into a balanced one, a Bayesian theory-based approach was used. Thus, based on a prior distribution, sampling was done from 10 observations so that 8 or 9 cases of disease and a maximum of 2 cases of non-disease were selected. In each step, the samples were repeated based on the posterior distribution function. These steps were continued until the number of cases of the disease was very close to another category, i.e. non-infection. The observations were then analyzed on a balanced data set and after removing the missing data in each of the measured variables, finally with 9354 observations as shown in Fig. 1.

Fig. 1
figure 1

Flow chart of this study

2.5 Logistic Regression (LR) Modelling

Logistic Regression is a popular model to evaluate the relationship between various predictor variables (either categorical or continuous) and binary outcomes in medicine, public health, etc. [32].

Let \({Y}_{i}\) denotes the response variable and takes the values of 0 or 1 depending whether response occurs or not. Also, \({\varvec{X}}\) be vectors of covariates associated with response variable, \({\varvec{\beta}}\) is the corresponding vectors of regression coefficients. So, the association between the covariates and binary response variable can be investigated as follows:

$$logit\{E({Y}_{i})\}=logit\{Pr({Y}_{i}=1| {\varvec{X}},{\varvec{\beta}})\} = {{\varvec{\beta}}}^{{\varvec{T}}}{\varvec{X}}.$$

2.6 Decision Tree (DT) Modelling

Data mining is one of the artificial intelligence analyses that emerged in the late 20th century [33, 34]. In other words, data mining is a process for extracting hidden knowledge in large data sets. One of the important problems for researchers in this process is data classification [35]. There are different techniques for classification problems [35]. DT can be applied in various applications in the medical field [36,37,38,39]. Due to the simplicity in understanding and clarity and extracting simple and understandable rules, it is widely applied and studied in these fields [35]. The DT consists of components, nodes, and branches. There are three types of nodes. First, a root node represents the result of subdividing all records into two or more exclusive subsets. The internal nodes represent a possible point in the tree structure connected to the root node from the top and the leaf nodes from the bottom. The third node is the leaf node that shows the tree’s final results in dividing records into target groups. Branches in the tree indicate the chance of placing records in target groups that emanate from the root node and the internal nodes [33, 34]. DT algorithm uses the Gini impurity index for selecting the best variable.

$$Gini\left(D\right)=1-\sum_{i=1}^{m}{P}_{i}^{2}$$

where \({P}_{i}\) is the probability that a record in D belongs to the class \({C}_{i}\) and is estimated by |\({C}_{i}\),D|/|D|. Logistic regression or LR is a statistical model applied to modeling dichotomous targets and investigating the effect of explanatory variables on the dichotomous target variable. In LR, the probability of placing each of the records in the target groups is also presented [40, 41]. The main advantage of using the LR is that it can provide a good direct or inverse relationship between the inputs or explanatory variables and the target. It is also a flexible method [42].

3 Result

3.1 Characteristics of the Study Population

Table 1 summarizes the demographic and clinical characteristics of the study population. To describe the quantitative and qualitative variables, mean ± SD and frequency (%) were reported, respectively. Out of 9354 participants, 4596 individuals (49%) developed CVDs and 4758 (51%) remained non-CVDs through the follow-up period. The female population was generally higher in both CVDs and non-CVDs groups (54.50% and 60.82%, respectively). The mean age of CVDs cases was significantly higher compared to the non-CVDs group in both male and female (53.94 ± 7.39 vs. 48.69 ± 8.50 and 54.61 ± 6.64 vs. 47.17 ± 7.97, p-value < 0.001).

3.2 The Association Between Anthropometric Measurements and CVD Using Logistic Regression (LR) Model

According to the data mining analysis results (Table 2), age, BAI, BMI, Demispan, and BRI for male and age, BAI, BMI, and WC for female had a significant association with CVDs (p-value < 0.03). Accoarding to logWorth measures, age and BRI for male and age and BMI for female had the highest correlation with CVDs development among the analyzed anthropometric factors (Table 2). Based on Table 1, the LR model indicated no significant association between other investigated anthropometric factors and CVDs.

Table 2 Parameter estimates of the LR model for CVD by anthropometric factors in male and female

Also, Table 2 listed the parameter estimates for all significant anthropometric factors. By applying the estimates obtained in Table 2, the regression formula for predicting CVDs based on significant factors can be designed (see Appendix).

In Table 2, considering the chi-square test results and chi-square p-value, age, BAI, BMI, BRI, and, Demispan for male and age, BRI, BMI, and WC for female demonstrated a remarkable association with CVDs, respectively (p-value < 0.03).

Table 2 also gives the unit odds ratios based on the significant factors. BRI, BAI, BMI, age, and demispan for male and age, BRI, BMI, and WC for female were significantly associated with CVDs. Among these factors, age and BRI has been identified as the most remarkable risk factor for CVDs in male (OR: 1.07, (95% CI 1.06, 1.08) and 1.36 (1.22, 1.51)). Also, for female, age and BMI has been identified as the most remarkable risk factor for CVDs (1.14 (1.13, 1.15) and 1.05 (1.02, 1.07)).

3.3 The Association Between Anthropometric Measurements and CVD Using Decision Tree (DT) Model

The results of the DT for anthropometric factors in male are demonstrated in Fig. 2. The DT algorithm evaluated the various CVDs risk factors and categorized them into four layers. In the DT model, the first variable (root) is of the highest importance, with the following variables in the next levels of significance, accordingly. As shown in Fig. 2, BRI has the most crucial effect on CVDs development risk, followed by age and BMI for male.

Fig. 2
figure 2

Decision tree model for male

The DT model for male indicated higher CVDs among participants with BRI ≥ 3.87 compared to those with lower BRI. In the subgroup with BRI < 3.87 and younger age < 46, 99% of participants were non-CVDs (lowest risk of CVDs). Meanwhile, among those with BRI ≥ 3.87, older age ≥ 46, and BMI ≥ 35.97, 90% of subjects were identified as CVDs (highest risk of CVD). Detailed rules for CVDs created by the DT model are demonstrated in Table 3.

Table 3 Detailed rules based on DT model for male and female

Also, as shown in Fig. 3, age has the most crucial effect on CVDs development risk, followed by BMI and WC for female. The DT model for female indicated higher CVDs among participants with age≥46 compared to those with younger age. In the subgroup with age<46 and loswer BMI<29.04, 97% of participants were non-CVDs (lowest risk of CVDs). Meanwhile, among those with older age≥46, older age≥54, and WC≥ 84, 71% of subjects were identified as CVDs (highest risk of CVD). Detailed rules for CVDs created by the DT model are demonstrated in Table 3.

Fig. 3
figure 3

Decision tree model for female

4 Discussion

This large prospective cohort investigated the association between anthropometric factors with CVDs risk in a 6-year follow-up. We determined significant correlations with age, BAI, BMI, Demispan, and BRI for male and age, BAI, BMI, and WC for female with the risk of CVDs. Based on the achieved ORs, BRI, age, and BMI increased the CVD incidence. Among these risk factors, BRI and age in male and age and BMI in female were the most significant risk factor for CVD.

Many recent studies have examined the association between aging and the incidence of CVDs. As expected, the results of our study showed that CVDs incident is significantly associated with older ages. Various studies have demonstrated a steeper increase in the risk of CVDs complications with the older ages (> 50 years) [43]. In line with our study, some research confirmed the existence of a linear relationship between some CVDs outcomes such as intimal carotid artery thickness and aging [44, 45]. A multiethnic study about atherosclerosis (MESA) in America also showed a strong relationship between age and the incidence of cardiovascular problems in the 9-year follow-up [46].

There are uncertainties about using BAI in predicting the risk of CVDs. According to several studies, BAI is a poor predictor and is not superior to other anthropometric measurements [47,48,49,50,51]. In contrast to earlier findings, D’Elia et al. [52] in an 8-years follow-up showed that BAI is a strong predictor of CVDs complications. Another longitudinal study in Brazil found that higher BAI levels were associated with a greater risk of developing coronary heart disease in both genders of different age groups [53]. Xinyan et al. [54], in a cross-sectional study in Singapore, showed that BAI can be the best predictor in women, however, a prospective cohort study by Susanne et al. [55] demonstrated that BAI is a good risk predictor for CVDs incident in males but not females. A possible explanation for these contradicting results might be the variety of ethnic populations. In our study, we found that BAI has a remarkable correlation with the decresing risk of CVDs in male.

Recently, a new index named BRI had a promising indication for body fat percentage and visceral adipose tissue based on growing evidence. It is shown to be related to premature death [21, 56]. Recent studies declared that in comparison with WC and BMI, BRI can be a good predictor for CVDs like our findings [57,58,59] and it is highly associated with hypertension, diabetes mellitus, and metabolic syndrome which are major risk factors for CVDs [60,61,62].

In the present study, we evaluated the effect of BMI and body composition on the complications of CVDs. Based on the evidence, higher BMI, and in the particular fat mass index, is associated with an increased risk of further consequences. A recent cohort study in Korea also addressed the importance of the link between BMI and CVDs [63]. In a Japanese retrospective cohort study, a strong association was found between BMI and CVDs in middle aged men [64]. Several conventional observational studies have examined the association of BMI with the incidence of CVDs [65]. In line with our findings, a cohort study and a meta-analysis of observational studies on British women also showed a significant increase in the risk of ischemic attacks with BMI in European and Asian individuals [66]. Interferential Lifestyle changes and weight-loss diets with the goal of BMI correction indicated remarkable beneficial effects on reducing cardiovascular risk in various studies [66]. In summary, these meta-analyses generally confirm the association between mortality risks in CVDs patients with very low or high BMI, suggesting that the results may also be influenced by gender, age, and follow-up time [63]. However, a meta-analysis of observational studies found no difference in BMI in patients with aortic aneurysms compared with healthy individuals [65]. It is important to consider the effect of age during assessing CVDs risk factors and BMI due to the increase in obesity with aging [46].

To the best of our knowledge, this is the first prospective cohort study with a large population-based sample to identify the association of anthropometric factors with the incidence of CVDs using logistic regression and decision tree as the analysis method. Another strength of the present study is the sufficient sample size, high quality of data, standardized measures, using logistic regression as the analysis method, imputation of missing data, and decision tree construction. This type of study and the large sample size can be a helpful baseline for further studies.

There are some limitations that need to be considered. Our study used only baseline data from the MASHAD cohort study, and the effect and it may not be possible to infer causality. Residual unknown confounders can affect the result of our study. Also, environmental and genetic factors such as certain dietary intakes and lifestyle characteristics need to be taken into account when describing the relationship between CVDs incidence and anthropometry risk factors. Additionally, the number of people who developed CVDs during the follow-up period is much smaller than the number of people who did not (unbalanced data). Individuals older than 65 years were not investigated in our study. This could be responsible for the low incidence of CVD during the follow-up. Also there was not any information about the cardiovascular medications that patients have taken (statin, aspirin, etc.). Furthermore, OR was not calculated for all of our anthropometric factors which needs further studies.

5 Conclusion

This 6-years follow-up cohort study can be a helpful resource for public health care. In this study, we evaluated the association between anthropometric measurments and CVDs risk using LR and DT. Our findings showed that BRI and age for male and age and BMI for female are the most significant factors for CVDs incidents. Also, BRI, age, and BMI had the highest index for the prediction of CVDs. DT identified various interactions between predictor variables of CVDs. As for DT, age was the first and the most crucial root. Further studies with a large sample size are required to assess the association between BRI and development of CVDs.