Introduction

Metabolic-associated fatty liver disease (MAFLD), previously recognized as nonalcoholic fatty liver disease (NAFLD), is intricately linked with metabolic perturbations and has emerged as the predominant chronic liver ailment on a global scale in the 21st century [1, 2]. MAFLD harbors the potential to engender a spectrum of grave afflictions, including liver hepatocellular carcinoma, cirrhosis, and fibrosis [3, 4], alongside maladies associated with metabolic syndrome, notably diabetes, hypertension, cardiovascular disease, chronic kidney disease, and systemic inflammation [5,6,7,8]. Timely screening and vigilant management of MAFLD hold utmost significance in averting and ameliorating the potential gravity of its complications and ensuing consequences.

During routine screening, accurately identifying MAFLD can prove challenging, given that patients might not manifest overt signs or symptoms during the initial phases. Moreover, traditional biochemical assessments fall short of directly quantifying hepatic steatosis and inflammatory conditions. Consequently, the susceptible population at risk of MAFLD could be considerable, warranting screening across a substantial cohort. Initial diagnosis and staging of MAFLD commonly entail a fusion of clinical background, laboratory analyses, and imaging assessments. Attaining more precise outcomes necessitates the utilization of liver tissue biopsy, which presently stands as the widely acknowledged benchmark, delivering intricate insights into hepatic histology, fibrosis progression, and inflammation intensity [9]. Nevertheless, liver biopsy involves invasiveness, accompanied by inherent risks and discomfort, rendering it potentially restricted within clinical application. In recent times, numerous researchers have harnessed non-invasive indicators and machine learning (ML) algorithms to discern MAFLD susceptibility, yielding promising outcomes [10,11,12]. Notwithstanding, these investigations are not without limitations. Certain predictive models demonstrate specificity toward particular populations [10], while others lean on intricate data collection such as omics data and biomarkers [11, 12]. Furthermore, a subset of studies grapples with diminutive sample sizes or disregards ethnic nuances [13, 14]. Consequently, the current MAFLD predictive models fall short in catering to the requisites of extensive population health screening.

Xinjiang, located in China, is an expansive and culturally diverse region. Owing to the distinct dietary preferences and genetic variances among its various ethnic groups, the prevalence of overweight and obesity in this area is notably high [15,16,17], resulting in substantial accumulation of hepatic fat content. Moreover, the region’s distinctive climatic extremities, desertification, air quality concerns, and relatively secluded geographic positioning profoundly influence the lifestyles and nutritional intake of its inhabitants [18], indirectly elevating the vulnerability to MAFLD. Hence, embarking upon MAFLD prediction research encompassing a wide-ranging population in Xinjiang is attuned to the region’s ethnically diverse and substantial sample characteristics. This approach is instrumental in advancing our comprehension of the risk factors inherent to MAFLD within this geographical context. Given the background described above, this study aims to develop the optimal ML model for identifying MAFLD patients from a large-scale health examination population in Western China. Two categories of models, including tree-based models and other ML models, were constructed and compared for their predictive performance. Furthermore, the study analyzed the important predictive factors to facilitate large-scale MAFLD screening and gain comprehensive insights into MAFLD risk factors, providing novel research perspectives.

Methods

Study population

The data employed in this study originated from the China Xinjiang National Health Examination Program, carried out in the year 2021. To acquire a thorough grasp of the research framework and the criteria for participant selection adopted within this initiative, we direct interested readers to our preceding study [19]. This investigation collated health examination data from a substantial cohort, comprising a total of 9,382,225 individuals. The exclusion criteria encompassed: (i) Missing values for important variables related to the diagnosis of MAFLD, such as plasma triglycerides and high-density lipoprotein cholesterol (n = 3,752,890). (ii) Age less than 18 years or older than 100 years (n = 444,569). (iii) Participants with liver cirrhosis, liver tumors, and liver cancer (n = 13,374). After implementing meticulous screening protocols, a cohort of 5,171,392 participants, drawn from 14 diverse regions, was deemed suitable for subsequent analysis (Fig. 1). For the data included in this study, we utilized the random forest algorithm for data imputation. The detailed demographic distribution across each of these regions is as follows: Hotan (735,022), Ili (723,546), Aksu (729,658), Changji (373,288), Tacheng (323,568), Bayingol Mongolian (319,160), Altay (182,453), Turpan (174,962), Bortala Mongolian (105,143), Hami (166,406), Kizilsu Kirgiz (136,787), Karamay (75,276), Kashgar (924,608), and Urumqi (201,515). The study was approved by the Ethics Committee and Institutional Review Committee of First Affiliated Hospital of Xinjiang Medical University(K202101-20).

Fig. 1
figure 1

Analysis process of this study. LASSO, least absolute shrinkage and selection operator; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the receiver operating characteristic curve

MAFLD diagnosis

MAFLD was diagnosed using any of the following three criteria: excessive overweight or obesity, type II diabetes mellitus, or metabolic dysregulation, in addition to radiological imaging-confirmed hepatic steatosis, according to the current assessment criteria [20]. Hepatic steatosis was determined by a diagnostic abdominal ultrasound and a physician through a questionnaire asking participants about their disease history (e.g., whether they have ever been diagnosed by a doctor with fatty liver, fatty accumulation, or degeneration of the liver). A body mass index (BMI) of ≥ 23 kg/m2 is defined as overweight or obesity (Asian cut-off value). Type II diabetes mellitus was defined by self-reported medical diagnosis, a history of type II diabetes, or a fasting glucose value ≥ 7.0 mmol/L. Metabolic dysregulation was defined by meeting two or more of the following criteria: (1) waist circumference (WC) ≥ 90/80 cm (Asian cut-off value) in men/women, (2) blood pressure ≥ 130/85 mmHg or specific medication, (3) plasma triglycerides ≥ 1.70 mmol/L or specific medication, (4) high-density lipoprotein cholesterol (HDL-C) < 1.0 mmol/L for males and < 1.3 mmol/L for females, and (5) pre-diabetes status (fasting blood glucose level from 5.6 to 6.9 mmol/L or HbA1c from 39 to 47%).

Predictors considered

The research data underwent a meticulous preprocessing procedure, including normalization and standardization for all data. When considering which variables to be included as predictors, we refer to relevant clinical studies of MAFLD [21,22,23,24] or effective factors that have been used for machine learning prediction [25]. Subsequently, we meticulously chose 20 pertinent attributes from questionnaire surveys and customary medical examination components for the purpose of shaping the predictive model. These attributes encompassed sex, age, ethnicity, education, occupation, marital history (MS), exercise frequency (EF), eating habits (EH), smoking status (SS), drinking frequency (DF), cardiovascular diseases (CVD), waist circumference (WC), body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), fasting blood glucose (FBG); total cholesterol (TC), triglyceride (TG), low-density lipoprotein cholesterol (LDLC) and high-density lipoprotein cholesterol (HDLC). For a comprehensive elucidation of these attributes, we kindly direct readers to consult Table 1.

Table 1 Information description of included variables

Data preprocessing

Data preprocessing was performed using the sklearn library in Python. Specifically, the sklearn.preprocessing module was utilized to transform categorical data into numerical labels using LabelEncoder, encode ordinal variables using OrdinalEncoder, and create dummy variables for nominal variables using OneHotEncoder. Additionally, we utilized the MinMaxScaler function for normalization. The principle of this function is to determine the minimum and maximum values for each feature, and then scale all values in the feature so that the minimum value corresponds to 0 and the maximum value corresponds to 1, effectively normalizing the entire range of values to fall within the interval [0, 1].

Grouping and feature selection

The participants were randomly divided into two sets: a training set comprising 4,137,133 individuals and a testing set comprising 1,034,279 individuals, in an 8:2 ratio. Subsequently, we employed Least Absolute Shrinkage and Selection Operator (LASSO) regression for variable selection on the training set. We used the glmnet package in R to perform feature extraction using LASSO regression, specifically employing a binomial logistic regression model type. LASSO regression aims to optimize the coefficient estimates of the model by introducing L1 regularization, which in turn promotes sparse solutions, i.e., forcing many regression coefficients to shrink to zero, thus enabling feature selection.

Prediction models

This study devised and compared two classes of MAFLD screening models: tree-based ML models (including Classification and Regression Trees (CART), Random Forest, Adaptive Boosting (ADABoost), Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting (XGBoost), and Categorical Boosting (CatBoost), as well as other ML models (namely, k-Nearest Neighbors (KNN) and Artificial Neural Network (ANN)). Furthermore, to achieve the objective of easily accessible and convenient screening, we solely employed data gathered from questionnaire surveys as predictive factors for the models.

CART, Random Forest, ADABoost, LightGBM, XGBoost, and CatBoost are all tree-based ML algorithms. They are capable of handling both classification and regression problems by constructing decision trees or optimizing gradient boosting decision trees to improve prediction performance. CART is a tree-based classification and regression algorithm that builds a decision tree model by recursively partitioning the dataset [26]. Random Forest improves prediction accuracy by constructing multiple independent decision trees and aggregating their results [27]. ADABoost builds a strong classifier by training a series of weak classifiers and combining them with weighted voting. It gradually improves overall classification performance by adjusting sample weights to focus on misclassified samples [28]. LightGBM is an efficient gradient boosting framework developed by Microsoft. It accelerates model training speed using a histogram-based decision tree algorithm and has lower memory usage [29]. XGBoost is a classic gradient boosting framework that enhances model accuracy and robustness by using second-order Taylor expansion to approximate the loss function and regularization terms [30]. CatBoost, developed by Yandex, is a gradient boosting framework with automatic handling capability for categorical features. It can directly utilize statistical information from categorical features [31].

The KNN algorithm is based on the fundamental idea of finding the nearest neighbors to a new input instance in the training set [32], and then using the majority vote of these K nearest neighbors for classification prediction. The ANN algorithm simulates the structure of neurons in the human brain and consists of an input layer, hidden layer(s), and an output layer [33]. The input layer receives data and converts it into a suitable format. The hidden layer(s) contains multiple neurons that are used to extract features and transform the input data. The output layer performs classification or prediction based on the output from the hidden layer.

Model evaluation

For the purpose of refining model performance, parameter adjustments were meticulously executed across each model, hinging on the learning curve, with the aim of unearthing the most optimal parameter configurations. Furthermore, an appraisal of model efficacy entailed the utilization of a confusion matrix, through which sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and the area under the receiver operating characteristic curve (AUC) were systematically evaluated. In particular, we generated 95% confidence intervals for the AUC. Additionally, to achieve better performance, we applied a threshold adjustment to the model predictions on the training set to find the threshold that maximized the Youden index (the sum of sensitivity and specificity minus 1).

Feature importance evaluation

The process of evaluating the importance of features through machine learning models involves a systematic quantification of the impact of individual features on the predictive performance of the model [34]. This is achieved by assessing how much the model’s performance is compromised when a specific feature is either permuted or excluded. Machine learning models, including Random Forest, Gradient Boosting, and Neural Networks, exhibit distinct advantages in this realm due to their capacity to grasp intricate relationships and interactions among features. These models offer insights into both linear and nonlinear correlations, thereby facilitating the identification of pivotal features that substantially contribute to the model’s accuracy. The integration of machine learning-based assessments of feature importance not only enhances the model’s interpretability but also aids in the selection of relevant features and provides guidance for domain-specific insights.

Statistical analysis

In this study, continuous variables were represented using mean (standard deviation), while categorical variables were presented using counts (percentages). For each variable, t-tests or Mann-Whitney tests were used for continuous variables, and chi-square tests or Fisher’s exact tests were used for categorical variables. Two-tailed p-values less than 0.05 were considered statistically significant.

Results

Basic characteristics

This study included 5,171,392 participants, with a mean age (standard deviation) of 51.12 (15.00) years, of which approximately 52.47% were females (Table 2). The majority of participants were of Uighur ethnicity (54.73%), 43.54% had completed only primary education or lower, and 65.55% were engaged in agricultural, forestry, animal husbandry, fishing, and water management occupations.

Table 2 Baseline characteristics of participants in this study

Notably, the dataset revealed a conspicuous imbalance concerning the prevalence of MAFLD, with 524,584 individuals diagnosed with the condition against a significantly larger group of 4,646,808 without it, highlighting a ratio of roughly 1:9.

Statistical analysis revealed significant differences between participants with and without MAFLD in terms of age, sex, ethnicity, education, occupation, MS, EH, EF, DF, SS, CVD, WC, BMI, SBP, DBP, FBG, TC, TG, LDLC and HDLC.

Features extraction

In this study, we utilized LASSO regression to perform feature selection on the training dataset. As depicted in Fig. 2, the outcomes of the feature selection process through LASSO regression unveiled that the model encompasses 20 non-zero coefficient variables. These variables include sex, age, ethnicity, education, occupation, MS, EF, EH, SS, DF, CVD, WC, BMI, SBP, DBP, FBG, TC, TG, LDLC and HDLC. These 20 variables were subsequently integrated as input features for the screening model developed within the framework of this research.

Fig. 2
figure 2

Feature selection using LASSO regression in the training set. (A) Cross-validation was performed 10 times to select the optimal parameters (lambda) of the LASSO model. (B) LASSO coefficient profile of 20 characteristics. In the LASSO regression algorithm, as lambda is tuned, the shrinkage and variable selection process leads to a corresponding change in the trajectory of the coefficients of each characteristic related to MAFLD, which can be visualized in the LASSO coefficient profile. MAFLD, metabolic dysfunction-associated fatty liver disease; LASSO, least absolute shrinkage and selection operator

Tuning of parameters

In this study, a 10-fold cross-validation approach was implemented to fine-tune and optimize the parameters of the six tree-based models using the training dataset. The performance of these models, as indicated by their AUC values, was visualized across various parameter configurations.

More specifically, in the process of hyperparameter tuning, max_depth was evaluated from a range of 1 to 20, recording the AUC value for each configuration, and then selecting the max_depth corresponding to the highest AUC value for each model. The optimal hyperparameters for each model are presented in Table 3. The optimal values for max_depth were identified as 11 for CART, 16 for Random Forest, 5 for ADABoost, 11 for LightGBM, 8 for XGBoost, and 6 for CatBoost, respectively. Default values are used for the remaining hyperparameters. As a result, we successfully trained and established six classification tree models that exhibited a noteworthy predicting performance.

Table 3 The optimal hyperparameters of each algorithm

Comparison of model performance

Tables 4 and 5 show the evaluation metrics corresponding to the performance of each model on the training and test data sets, respectively. Our observations reveal that the tree-based machine learning models outshine alternative machine learning methods in the context of conducting large-scale MAFLD screening within populations. Notably, the CatBoost algorithm emerges with remarkable prowess, attaining Sensitivity, Specificity, and AUC values of 0.814, 0.753 and 0.862, respectively. In contrast, the artificial neural network (ANN) displays relatively modest performance in this task. Figure 3 delineates the comprehensive receiver operating characteristic (ROC) curve on the training set and the test set, encapsulating all classifiers under scrutiny.

Table 4 Performance of each algorithm in the training set
Table 5 Performance of each algorithm in the test set
Fig. 3
figure 3

ROC curves on the training set and the test set for KNN, ANN, CART, RF, ADABoost, LightGBM, XGBoost and CatBoost respectively. ROC, receiver operating characteristic; KNN, K-Nearest Neighbor; ANN, Artificial Neural Network; CART, Classification and Regression Tree; RF, Random Forest; ADABoost, Adaptive Boosting; LightGBM, Light Gradient Boosting Machine; XGBoost, Extreme Gradient Boosting; CatBoost, Categorical Boosting

Importance of features

Within the confines of this study, we embarked on an evaluation and prioritization of feature importance specifically pertaining to the CatBoost model, which showcased the most superior performance and attained the highest AUC value. Based on the empirical findings delineated in Fig. 4, we discerned that BMI, age, TG, WC, FPG, occupation3 (pertaining to agriculture, forestry, animal husbandry, fishing, and water conservancy roles), HDLC, LDLC, TC, ethnicity (Uyghur), DBP, SBP and CVD emerged as the foremost 13 pivotal predictive factors. These factors were identified through the utilization of the CatBoost model to anticipate MAFLD within a sizeable population, grounded in the insights gleaned from questionnaire data.

Fig. 4
figure 4

Feature importance of CatBoost algorithm. Ethnicity (Hui), Ethnicity (Kirgiz), Ethnicity (Kazak), Ethnicity (Uyghur) and Ethnicity (Han) are dummy variables of Ethnicity. MS, marital status; EF, exercise frequency; EH, eating habits; SS, smoking status; DF, drinking frequency; CVD, cardiovascular diseases; WC, waist circumference; BMI, body mass index; SBP, systolic blood pressure; DBP, diastolic blood pressure; FBG, fasting blood glucose; TC, total cholesterol; TG, triglyceride; LDLC, low-density lipoprotein cholesterol; HDLC, high-density lipoprotein cholesterol

Discussion

Given the swiftly increasing prevalence of MAFLD, the task of identifying prospective MAFLD patients and implementing suitable therapeutic interventions has become an exigent priority. Within the confines of this study, a comprehensive cohort of 5,171,392 adults aged 18 and above was enlisted. Leveraging their physical examination data, we endeavored to develop and juxtapose AI algorithms intended for the large-scale population screening of MAFLD. Our comprehensive inquiry sheds light on the exemplary performance exhibited by the CatBoost algorithm within the domain of MAFLD screening. Remarkably, BMI, age, TG, WC, FPG, occupation3, HDLC, LDLC, TC, ethnicity (Uyghur), DBP, SBP and CVD have emerged as the pivotal predictive factors of significance.

In contrast to conventional statistical models, ML models offer substantially enhanced data analysis and predictive capacities in disease prognostication. These models possess the capability to manage extensive, high-dimensional, and intricate medical datasets, effectively discerning latent patterns and predictive principles to elevate the precision of predictions. Presently, an array of investigations has been undertaken to predict MAFLD employing laboratory indicators. Several scholars have integrated lipidomics, metabolomics, genomics, transcriptomics, and biomarkers as predictive variables for the formulation of models [35,36,37]. While these models have exhibited commendable outcomes in predicting MAFLD, acquiring such data through extensive health screenings proves impractical and hampers endeavors aimed at the broad-scale screening of diseases. Additionally, certain investigations center on the interplay between distinct ailments and MAFLD, encompassing cardiovascular diseases, diabetes, and liver fibrosis, while others pertain solely to particular demographics, such as adolescents or individuals with obesity [35, 38,39,40]. Some studies suffer from the limitation of small sample sizes [35, 38, 39], or do not consider ethnic-specific factors [13, 14]. Studies constrained by these limitations might encounter challenges in extrapolating research conclusions and prediction models to the broader populace. In stark contrast, our investigation encompasses a dataset of 5,171,392 participants hailing from the Xinjiang region, distinguished by its expansive sample size and ethnically diverse population. Consequently, the implications of our study hold promise for advancing MAFLD screening and prognostication within a sizable Chinese demographic.

The aforementioned studies on MAFLD prediction models have all achieved notably high AUCs, with some models reaching above 0.8 [13, 35, 36, 40] and others exceeding 0.9 [35] in their test sets, demonstrating good predictive performance under their respective study conditions. All models constructed and optimized in this study achieved AUCs above 0.8 in the test set, among which the best-performing CatBoost model reached an AUC of 0.862. This, to a certain extent, suggests that the variables selected in this study hold predictive value under the context of large-scale population disease screening, and the constructed models exhibit commendable performance in MAFLD screening.

In this study, we have identified that BMI, age, TG, FPG, WC, occupation, HDLC, LDLC, TC, SBP, DBP and CVD are pivotal factors for MAFLD screening. Obesity has been widely confirmed to be highly associated with MAFLD; therefore, BMI and WC are also important predictive factors for MAFLD. Age is highly correlated with MAFLD, with a higher age being associated with an increased risk of MAFLD, a factor that has been considered in many MAFLD prediction studies [41]. TG, HDLC, LDLC, and TC are components of lipid profiles in blood tests, and these parameters have been confirmed to be highly correlated with MAFLD. This is because elevated lipid levels are prone to causing the accumulation of fat in the liver [42]. Studies have shown a significant association between fasting blood glucose levels and the incidence and severity of MAFLD. Elevated fasting blood glucose levels serve as markers of diabetes and insulin resistance, both of which are notably linked to MAFLD. Insulin resistance leads to inadequate utilization of insulin, promoting fat accumulation in the liver and facilitating the development of MAFLD. Moreover, elevated fasting blood glucose itself may directly harm the liver, causing inflammation and fibrosis in liver cells, further exacerbating the condition of MAFLD [43]. Individuals immersed in activities such as agriculture, forestry, animal husbandry, fishing, and water conservancy are predominantly engaged in physically demanding labor. Prolonged exposure to physical labor or regular exercise exerts a favorable influence in averting the onset of MAFLD, a notion substantiated by an array of scholarly investigation [44]. Research indicates that elevated systolic and diastolic blood pressure are associated with an increased risk of MAFLD. Epidemiological investigations reveal that the prevalence of MAFLD among hypertensive patients is approximately 49.5%, significantly higher than that in the general population [42]. Furthermore, MAFLD appears to be closely linked to hypertension and endothelial dysfunction, seemingly serving as an independent risk factor for prehypertension and hypertension [45]. Our research findings highlight a robust correlation between MAFLD and cardiovascular disease. On one hand, cardiovascular disease risk factors (including hypertension, hyperlipidemia, and diabetes) can culminate in anomalous hepatic fat buildup, giving rise to MAFLD. Moreover, inflammatory and vascular injury elements prompted by cardiovascular disease can permeate the systemic circulation, fostering the progression of MAFLD. On the other hand, individuals with MAFLD frequently manifest obesity and metabolic irregularities, such as insulin resistance and elevated cholesterol levels. These factors compromise endothelial function and catalyze the advancement of atherosclerosis, thereby exacerbating the vulnerability to cardiovascular disease [38]. Ethnicity is also an important predictor. Ethnic group may be associated with regional, dietary habits, climate, genes and other factors. Previous studies have reported that there are differences in the incidence of MAFLD among different ethnic groups or regions [46].

This study boasts several notable strengths. Foremost, the MAFLD prediction model devised herein leverages variables garnered from physical examination data. Consequently, when juxtaposed with conventional MAFLD diagnostic methodologies, this model emerges as swifter, cost-efficient, and conducive to preliminary MAFLD screening within extensive populations. Additionally, our investigation encompasses a substantial and diverse Chinese demographic, encompassing an array of ethnic backgrounds. This inclusivity adeptly captures the influence of ethnicity-specific elements on the disease, markedly bolstering the applicability of our model. Furthermore, the meticulous sample selection adhering to scientifically grounded inclusion and exclusion criteria characterizes this study. This approach adeptly retains comprehensive data hailing from extensive health screening questionnaires, thus seamlessly aligning with the tenets of epidemiological research on MAFLD in real-world scenarios.

However, this study also has certain limitations. Firstly, our model is based on health check questionnaire data derived from a large-scale Chinese population, excluding data from other countries. The peculiarities of Xinjiang’s ethnic structure and geographical environment may impact the generalizability of this model to other populations, despite it being the first model established based on a multi-ethnic population comprising millions of samples. Secondly, this study is cross-sectional in design, which restricts our ability to ascertain causal relationships between certain factors and MAFLD, as exemplified by the relationship between exercise frequency and MAFLD. Follow-up cohort studies are needed to address this limitation. Thirdly, the quantification of alcohol consumption did not strictly adhere to the exclusion criteria for MAFLD (males > 30 g/day, females > 20 g/day), which could potentially impact the predictive ability of our model. The self-reported component of predictive factors may introduce bias due to inaccuracies or incomplete reporting. Furthermore, the exclusion of a large number of participants with indeterminate disease outcomes may increase potential selection bias.

Conclusions

The severity and prevalence of MAFLD have garnered heightened recognition from the public, propelling the dire necessity for the formulation of a large-scale, population-oriented early screening model. Grounded in a multi-ethnic and expansive sample populace, this study exclusively harnessed questionnaire surveys and customary medical examination components to meticulously establish and juxtapose tree-based MAFLD predictive models against alternative ML methodologies. We identified the optimal MAFLD predictive model and extensively analyzed the interactions between various risk factors and MAFLD. The study results demonstrated that our MAFLD screening model achieved satisfactory predictive performance, providing a new and more economical and efficient approach for the prevention and screening of MAFLD.