Introduction

Osteoporosis is characterized by low bone mass resulting in bone fragility fractures that occur following minimal or no trauma [1, 2]. Osteoporosis is common in postmenopausal women but is a silent disease until the fractures occur. Fractures place a severe burden on aging individuals because they can lead to poor quality of life and increased mortality [3]. Osteoporosis should be prevented and treated before it is complicated by fractures [4].

According to World Health Organization (WHO) criteria, osteoporosis is operationally defined as a bone mineral density (BMD) that is 2.5 standard deviations or more below the mean for a young healthy adult (T score ≤− 2.5), based on the dual-energy x-ray absorptiometry (DXA) T score. [5] DXA is generally used to diagnose osteoporosis. Although the benefits of screening are apparent, as early diagnosis may help prevent future morbidity and decrease mortality due to fracture complications, uniform screening of the general population using DXA may not be feasible because all physicians may not have access to this equipment. Therefore, substantial research has been conducted on when and where to use DXA to screen efficiently and to avoid overdiagnosis and misdiagnosis or create a false sense of security [6,7,8]. Several previous studies have highlighted prescreening tools to identify women with an increased risk of osteoporosis who ought to be selected for BMD measurements. These tools are simple formulas based on risk factors of osteoporosis [9,10,11].

Machine learning has been shown to improve the predictive value of statistics in many areas of medicine [12,13,14]. Machine learning is a field of computer science that uses computer algorithms to identify patterns in large amounts of data, which can also be used as predictors for novel data [15]. Using training data with known input and output values, the machine learning algorithm is able to make data-driven predictions or decisions [15, 16]. Although machine learning models have been proposed as a tool to predict osteoporosis risk in postmenopausal Korean women, previous studies had limitations, such as only applying the ANN method or not including lifestyle factors such as smoking, physical activity, coffee, and alcohol intake [17].

In this study, we aimed to develop and validate a selection of machine learning models using a database of 1792 patients who participated in the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 and V-2 (2010–2011) to construct an osteoporosis predictor. In databases of KNHANES, the definition of osteoporosis is based on only the T score for BMD assessed by DXA at the femoral neck or spine that is 2.5 standard deviations or more below the mean for a young healthy adult (T score ≤ − 2.5). Low trauma hip, vertebral, proximal humerus, or pelvis fracture that could be considered clinical osteoporosis were excluded [18, 19]. The predictive model in our study is complicated and has high dimensional characteristics as it contains diet and lifestyle properties, in addition to clinical factors, that could contribute to osteoporosis [6, 20, 21]. Considering the characteristics of complex models, we compared the performances of various models, using the raw data together with the preprocessed data, wherein statistically significant features were selected in advance.

Materials and methods

Study population

We analyzed the data from 1792 postmenopausal Korean women who participated in the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 and V-2 (2010–2011). The KNHANES data are available and can be downloaded from the KNHANES website (https://knhanes.cdc.go.kr/). The KNHANES is a nationwide, population-based, cross-sectional study that has been conducted periodically since 1998, which assesses the health and nutritional status of Koreans, monitors trends in health risk factors and the prevalence of major chronic diseases, and provides data for the development and evaluation of health policies and programs in Korea [22]. We excluded patients with incomplete information from our analysis. This study was approved by our institutional ethics committee (Kangbuk Samsung Hospital Institutional Review Board, Seoul, Republic of Korea; approval number: KBSMC 2020-01-007). The KNHANES received ethical approval from the Institutional Review Board of the Korea Centers for Disease Control and Prevention (IRB Nos. 2010-02CON-21-C and 2011-02CON-06-C) and complies with the Declaration of Helsinki. Informed consent was obtained from all participants for inclusion in the surveys.

Machine learning

Classification machine learning algorithms were used to predict the occurrence of osteoporosis, encoded as a binary outcome variable. The whole process was divided into five parts: (1). Data preprocessing: this included data cleaning, missing data processing, and data transformation; (2). Feature selection: the process of selecting input features for training; (3). Model building: application of the classification machine learning algorithms to achieve reasonable performance;( 4). Cross-validation: a resampling procedure to evaluate the machine learning models for training and testing of raw and feature selected data; and (5). Model performance evaluation: this was conducted using area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity. We plotted the AUROC curves from all the machine learning models, using the testing data.

Data preprocessing

A total of 1792 patients were included in this study, of which 613 were diagnosed with osteoporosis. Data were analyzed using R software version 3.6.2. (R Development Core Team, Vienna, Austria). Data scaling was performed using normalization and minimum-maximum scaling included in the Caret preprocessing libraries. The continuous variables were age, height, weight, body mass index (BMI), waist circumference, pregnancy, and duration of menopause. The categorical variables were estrogen therapy, hyperlipidemia, hypertension, history of fracture, osteoarthritis, rheumatoid arthritis, diabetes mellitus, smoking, alcohol, coffee, and physical activity. We used the data in the KNHANES V-1 and V-2 datasets as the training and testing data, respectively. The entire dataset was split into two categories: training and testing. For each machine learning algorithm, the data of 1353 subjects were used for training and those of 439 for testing. The Synthetic Minority Over-Sampling Technique (SMOTE) method, which addresses class imbalance, was used to generate synthesis samples to overcome the low incidence of osteoporosis in the training set [23].

Feature selection

Numerous studies have been conducted to identify features that may potentially affect osteoporosis risk [6, 20, 21]. We found 19 potential features, including demographics and clinical variables as shown in Table 1. Feature selection is the process wherein we select those features which contribute most to our output prediction [24, 25]. In this process, the backward stepwise variable selection procedure was used to identify such variables, using the logistic regression model. To construct the machine learning model, we included only the statistically significant features in the feature selected dataset as shown in Table 2.

Table 1 Demographic data and variable features of the included postmenopausal women
Table 2 Results of the stepwise logistic regression model for osteoporosis risk assessment

Model building

Except for the ANN model, all the machine learning models were imported from the Caret package containing functions for training and plotting classification and regression models (https://CRAN.R-project.org/package=caret). The ANN model was constructed using the Keras package designed to enable fast experimentation with deep neural networks (https://github.com/keras-team/keras). The machine learning approaches were developed to accurately identify patients at risk for osteoporosis. Classification models such as the k-nearest neighbors (KNN), decision tree (DT), random forest (RF), gradient boosting machine (GBM), support vector machine (SVM), artificial neural networks (ANN), and logistic regression (LR) were used to developed prediction models.

K-nearest neighbors (KNN) is a simple algorithm that classifies unlabeled observations based on a similarity measure such as a distance function. Input values are classified by a majority vote of its neighbors by assigning them to the class most common among its k-nearest neighbors measured by a distance function [26]. Decision Trees (DT) create models in the form of a flowchart-like tree structure which represents feature at an internal node, represents a decision rule by the branch, and generates the actual prediction at the leaf nodes. The technique learns to partition the tree on the basis of the feature value in recursively manner [27, 28]. Random forest (RF) is an ensemble classification algorithm that consists of a large number of individual decision trees [29]. Gradient boosting machine (GBM) is a type of machine learning boosting. It produces an ensemble model in the form of shallow and weak successive trees with each tree learning and improving on the previous [30, 31]. Support vector machine (SVM) split data into binary categories with a bisecting hyperplane [32]. Hyperplanes are decision boundaries that help separate the data points. The algorithm finds the hyperplane to represent the maximum distance between data points of the two categories. Input values falling on either side of the hyperplane can be assigned to different categories. Artificial neural networks (ANN) are computational models inspired by the biological neural networks that constitute animal brains [33]. It consists of input and output layers, as well as the inner hidden layers to simulate the signal transmission. Each layer comprises many nodes, and the nodes between layers are interconnected by different weights that adjust as learning proceeds. The algorithms automatically learn from the training dataset to predict output values [34]. Logistic regression (LR) is a traditional statistical method for binary classification problems, although it has been adopted as a basic machine learning model. Logistic regression predicts the probability of occurrence of a binary event utilizing the sigmoid function also called a logistic function [35].

Using machine learning algorithms to analyze medical data to predict a disease frequently involves choosing hyperparameters. A hyperparameter can be defined as a parameter that is not tuned during the learning process through iterative optimization of an objective function. Investigators typically tune hyperparameters arbitrarily after a series of manual trials. Different model training algorithms require different hyperparameters. The optimal hyperparameters obtained in a fivefold cross-validation of the test set are summarized in Table 3.

Table 3 Optimal hyperparameters of all machine learning models

Cross-validation

We validated the performance of all classification models using stratified k-fold cross-validation (Fig. 1). Cross-validation is a validation technique for assessing how the classification models will generalize to an unknown dataset and how accurately they will perform in practice. It is widely used in settings wherein the main goal is prediction. In this study, the dataset was randomly divided into five equal folds with approximately the same number of events. After partitioning one data sample into five subsets, one subset was selected for model validation while the remaining subsets were used to establish machine learning models. Finally, the validation results were combined to provide an estimate of the model’s predictive performance.

Fig. 1
figure 1

Schematic of the machine learning pathway

Model performance evaluation

We evaluated diagnostic ability based on four parameters: accuracy, sensitivity, specificity, and AUROC. The AUROC is known as a strong indicator of performance for classifiers in imbalanced datasets [36, 37]. We plotted AUROC curves to compare the performances of the machine learning classification models.

Statistical analysis

The continuous variables were expressed as mean ± standard deviation or median ± interquartile range, as appropriate, and analyzed by the unpaired t test or the Mann-Whitney U test. The categorical variables were presented as absolute number (n) and relative frequency (%) and analyzed by the chi-square test or Fisher’s exact test. The machine learning classification models were constructed using R software (version 3.6.2). The performance of the classification models for osteoporosis risk assessment was measured and compared using AUROCs. We also calculated the accuracy, sensitivity, and specificity (95% confidence interval). Differences with p < 0.01 were considered to be statistically significant.

Study design

Considering the high dimensionality of the data, which included 19 variables, we applied two different machine learning approaches, depending on where the variable reduction process was applied [38]. The first approach was to apply machine learning algorithms to the raw dataset. The second was to apply logistic regression analysis to the raw dataset so as to choose only the effective variables from the training dataset variables. We identified nine variables that were significantly different between patients with osteoporosis and those without. Nonsignificant variables were removed from the algorithmic input in the feature selected dataset.

Results

Patients’ characteristics

We analyzed the data of 1792 postmenopausal Korean women who participated in the KNHANES V-1 and V-2 from January 1, 2010 to December 31, 2011. The demographic and patient characteristics are summarized in Table 1. Osteoporosis occurred in 34.2% of cases (training set, 33.2%; test set, 37.4%).

Feature selection

The input variables used for the feature selected data included age, height, BMI, history of smoking, waist circumference, history of fracture, estrogen therapy, duration of menopause, hyperlipidemia, and diabetes mellitus. Table 1 shows the potential variables for predicting patients at risk for osteoporosis. The MASS library in the R software was used to perform stepwise backward elimination logistic regression analysis to obtain probability coefficients for each variable. The nine features with the greatest regression coefficient magnitudes (with p < 0.2) were used as input variables in classifying the machine learning models for osteoporosis risk assessment.

Model performance

The AUROCs for the test data set for all machine learning techniques for predicting osteoporosis risk are shown in Table 4. For the raw data, which included 19 variables, the ANN method achieved the best performance in terms of AUROC (0.741), followed by RF (0.727), LR (0.726), SVM (0.724), KNN (0.712), DT (0.684), and GBM (0.652). For the feature selected data, which included nine variables, the AUROCs increased slightly for all machine learning methods, with the best performance being that of ANN (0.743). Using feature selected data decreased the sensitivity for KNN (0.58) and ANN (0.72) but increased it for LR (0.79), SVM (0.73), DT (0.60), and RF (0.68). All algorithms showed better performance in terms of accuracy when using the feature selected data. The AUROCs of the seven different models are plotted in Fig. 2.

Table 4 Performance of all machine learning models
Fig. 2
figure 2

Areas under the receiver operating curve for raw (left) and feature selected (right) data

Discussion

Feature selection is an important concept in machine learning that has a huge influence on performance. Analysis and modeling with or without the feature selection process offer the opportunity to identify patients at high risk and to identify clinical factors that may increase the risk of osteoporosis. The objective of this study was to demonstrate that machine learning algorithms could accurately predict if postmenopausal women have a higher possibility of developing osteoporosis. This means that machine learning algorithms provide an alternative approach that could be useful in guiding the decision to perform DXA, considering a specific set of clinical factors. According to the United States Preventive Service Task Force (USPSTF) guidelines, the National Osteoporosis Foundation guidelines, and other guidelines, it is recommended that women aged 65 years or older, postmenopausal women starting or taking long-term (≥ 3 months) systemic glucocorticoid therapy, and perimenopausal or postmenopausal women with additional osteoporosis risk factors (low BMI, current smoker, rheumatoid arthritis, history of hip fracture in a parent, early menopause, and excessive alcohol intake) are screened for osteoporosis by BMD measurement at the hip and lumbar spine [1, 39, 40]. Considering these various factors associated with low bone density, machine learning algorithms may be supportive tools for identifying postmenopausal women at high risk for osteoporosis. In some cases, clinical efficiency can be expected through a two-step screening strategy that uses DXA testing after the use of machine learning algorithms. A previous randomized study, namely the Risk-Stratified Osteoporosis Strategy Evaluation (ROSE) study, investigated the effectiveness of a two-step osteoporosis screening program for women aged 65–80 years, using the Fracture Risk Assessment Tool (FRAX), a self-administered questionnaire, to select women for DXA, followed by standard osteoporosis treatment [41], The ROSE study showed risk reduction in the group following the two-step strategy when compared to the control group; a FRAX score ≥ 15% was considered to predict moderate- or high-risk of major osteoporotic fractures, hip fractures, and all fractures [42]. Effective machine learning models coupled with DXA may yield results as the ROSE study.

Machine learning algorithms have commonly been applied for classification and prediction, rather than causal inference. Our study may seek to promote health by intervening with patients at high risk of osteoporosis; this requires the ability to predict osteoporosis risk but not the need for causal inference about the effect of an input variable on that risk [43]. As for alcohol intake, there are two questions in KNHANES: life-time drinking experience and high-risk drinking frequency. We applied both of them as input variables in our models, but the performance of the predictive model was higher when life-time drinking experience was applied as input variable. Therefore, life-time drinking experience was used as a feature in this study although one time use of a drink is not a risk factor for bone loss.

We used a relaxed p value (p < 0.20) in a multivariable logistic regression analysis as shown in Table 2. There is no reason to worry about a relaxed p value criterion at feature selection stage because this is just a pre-selection strategy and no inference will derive from this step [44]. This relaxed p value criterion will help reduce the risk of missing important variables. In addition, it is possible to include a sufficient number of features because machine learning techniques are relatively free of limitations of conventional statistical analysis such as multicollinearity [14].

Previous studies have employed the use of logistic regression and various machine learning models to predict patients at high risk of osteoporosis [17, 45]. However, these studies either trained the models using only the ANN method or were based on limited input features. In this study, we developed and validated our models by performing feature selection, cross-validation, and testing on completely different datasets. Thus, our findings are helpful for implementing machine learning methods in clinical settings.

We investigated the application of seven machine learning techniques to the KNHANES V-1 and V-2 databases, which involve heterogeneous clinical characteristics. Unlike previously published studies, which do not incorporate diet and lifestyle patterns, our study included these features. We demonstrated that machine learning algorithms can be applied to predict osteoporosis risk with a reasonable level of performance.

In this study, we found that the optimal ANN needed two hidden layers to predict osteoporosis risk. In the ANN model, the first and second hidden layers were composed of 20 and 10 nodes, respectively. Since no specific tool exists for obtaining the most suitable hyperparameters to construct ANN models, we obtained the optimal hyperparameters empirically. The hyperparameters found could be useful as indicators in future studies using the ANN method.

This study has several limitations. First, the study used a cross-sectional survey that captured a population at a single point in time that is not guaranteed to be representative. Second, the prediction model in our study was based on Korean women. Thus, it may be difficult to generalize our study to a more diversified population. Third, there is some ambiguity in the survey at the KNHANES. More specifically, a survey of pregnancy history was assessed through questionnaires for each individual. The questionnaire contained the following questions: Have you had any pregnancy experience (currently during pregnancy, natural abortion, artificial abortion, ectopic pregnancy, etc.)? and if you answered yes to the question, how many times have you had total pregnancy? Unfortunately, there are some shortcomings that are not clear whether same occurrences in one individual or different individuals. Furthermore, our study could not predict the occurrence of osteopenia and osteoporosis using a multi-classification algorithm to reduce the risk of osteoporosis before the occurrence of fractures. In our database, osteoporosis, which classified osteoporosis only according to an operational definition, was not considered as another clinical standard such as low trauma fracture.

In conclusion, this study is important because it promotes the identification of patients at high risk of osteoporosis in a population of postmenopausal Korean women. The findings of this study show that the ANN model is the best machine learning classification model for predicting osteoporosis risk using a feature selected dataset. We made two different observations regarding osteoporosis risk assessment using machine learning models. First, input variables composed of clinical and diet and lifestyle factors, such as coffee intake, alcohol intake, and physical activity, were used to train our machine learning model. Second, we used two entirely different datasets, KNHANES V-1 and V-2, as the training and testing datasets, respectively. This means that the dataset used to train our classifier models (KNHANES V-1) was not of relevance to the dataset used for testing (KNHANES V-2). However, careful attention is required for practical clinical application of our study findings, as our study was limited to postmenopausal Korean women and had a limited data size.