Introduction

Areal bone mineral density (aBMD) measured by dual energy x-ray absorptiometry (DXA) is a significant predictor of fracture risk [1]; however, half of the fractures occur in women who would not be classified as osteoporotic by aBMD [2]. Mechanical strength of bone is dependent on bone matrix, apparent density, and microarchitecture [3] and measurements of cortical and trabecular microarchitecture are independent of aBMD [4]. Thus, incorporating bone structural information with density measurements could potentially improve assessment of fracture risk.

High-resolution peripheral quantitative computed tomography (HR-pQCT) is an emerging technology capable of measuring cortical and trabecular microarchitecture at distal radius and distal tibia [5, 6]. The resolution of HR-pQCT also permits application of finite element (FE) analysis to the 3D scans to estimate bone strength. These FE models are highly correlated with bone strength measured directly with mechanical testing [7] as they incorporate both material and architectural components of bone strength. Previous HR-pQCT studies examined the ability of individual bone microarchitectural [811] and FE analysis parameters [810] or a combination of parameters [9, 10] to discriminate between postmenopausal women with and without a history of low-trauma fracture. These studies showed that bone structure and strength contribute to forearm fracture risk independently of aBMD [8]. It has been an important first step to establish the association of microarchitectural parameters and FE outcomes with low-trauma fractures [10]. This naturally leads us to determine whether it is possible to classify people with low-trauma fracture based on similar outcomes.

Machine learning methods are statistical tools used to recognize patterns in datasets. These methods are often used for the purpose of classification since they can be trained based on known cases and then tested on new cases. Machine learning methods can be used to incorporate all HR-pQCT and FE parameters in order to classify those with and without fracture.This is in contrast to statistical approaches, such as principal component analysis (PCA), which reduce data into combinations of parameters that account for majority of variance. Atkinson et al. recently used a machine learning method, gradient boosting machines (GBM), and found that fracture prediction improved when all possible bone density, geometry, and microstructural parameters obtained with central quantitative computed tomography (QCT) and HR-pQCT were included in the GBM model when compared with DXA parameters alone [12]. Support vector machines (SVM) are an alternative machine learning method that can be used to classify individuals with and without previous fracture by maximizing the separation between groups. SVM models offer several advantages over other machine learning methods, including greater stability and minimal requirements for parameter tuning. Thus, our objective was to use SVM models to incorporate bone microarchitectural parameters and FE estimates of bone strength to determine if we can classify women with and without previous low trauma fractures.

Materials and methods

Participants

Participants in this study were postmenopausal women who were members of the Calgary, AB cohort of the Canadian Multicentre Osteoporosis Study (CaMos) [13, 14]. Briefly, CaMos is a 10-year prospective population-based study in which participants were recruited using a stratified random sampling technique from nine centers across Canada [15]. At the 10-year follow-up, we invited individuals from the Calgary CaMos cohort to participate in an HR-pQCT substudy. From this cohort (N = 442; ages 20–99 years), we identified postmenopausal women who sustained a low-trauma fracture (excluding finger, face, and toes) during the 10 years of study follow-up (n = 44). Fractures were identified by self-report during scheduled interviews at years 3, 5, and 10 and by yearly postal questionnaires through year 9. Additional information regarding circumstances surrounding the fracture and fracture site were gathered via a structured telephone interview [16]. Low-trauma fractures were those that occurred without trauma or as a result of a fall from standing height or less.We randomly matched two participants without fracture by age (±1.0 year) to each fracture case (n = 88). We determined height, weight, bisphosphonate and corticosteroid use (>3 months), and menopause status from the interviewer-administered CaMos questionnaire. In addition, as part of the standard CaMos assessment, trained technicians measured femoral neck (FN) and lumbar spine (LS) aBMD (gram per square centimeter) using DXA (Hologic QDR4500, Bedford, USA). The Conjoint Health Research Ethics Board at the University of Calgary approved all procedures.

HR-pQCT scan acquisition and measurements

As we described in detail elsewhere [13, 14], we scanned all participants using HR-pQCT (XtremeCT; SCANCO Medical, Brüttisellen, Switzerland) at the nondominant radius and left tibia unless there was a previous fracture at the desired site, in which case we scanned the opposite limb. One of two highly trained operators acquired and analyzed all scans according to the manufacturer's standard in vivo protocol. We scanned all participants using 60 kVp effective energy, 1,000 μA current, and 100 ms integration time to acquire 110 slices (approximately 9.02 mm) of the radius and tibia at an 82 μm nominal isotropic resolution. Scans were manually scored for motion on a scale of 0 (no motion) to 4 (significant blurring of the periosteal surface, discontinuities in the cortical shell, or streaking in the soft tissue) and scans scored as a 4 were excluded from this analysis (n = 2).

We used the manufacturer's standard method to filter and binarize the HR-pQCT images [17] and assessed all standard HR-pQCT morphological microstructure outcomes [5]. These measurements were previously validated against microcomputed tomography [6, 18] and have in vivo short-term reproducibility of <4.5 % in adult populations [19]. Reproducibility in our lab is <3.8 % for all parameters at the radius and tibia.

To segment the cortical and trabecular regions, we used an automatic segmentation algorithm implemented in Image Processing Language (IPL V5.07, SCANCO Medical, Brüttisellen, Switzerland) [20]. Based on this segmentation, we calculated macrostructural parameters: cortical (Ct.Ar, square millimeter), trabecular (Tb.Ar, square millimeter), and total (Tt.Ar, square millimeter) cross-sectional areas. In addition, we calculated cortical porosity (Ct.Po, percent) as the number of void voxels within the cortex [14, 20] and measured direct cortical thickness (Ct.Th, millimeter), and cortical bone mineral density (Ct.BMD, milligram hydroxyapatite (HA) per cubic centimeter).

We acquired FN and LS aBMD by DXA but did not acquire ultradistal radius (UDR) DXA scans for the participants. Therefore, we implemented the method developed by Burghardt et al. [21] to simulate DXA-derived aBMD: we aligned the HR-pQCT scans in the same configuration as a DXA scan and obtained a calibrated projection image from which we determined aBMD. The aBMD values determined by HR-pQCT were then calibrated based on the relationship reported by Burghardt et al. [21]. This method is strongly correlated with DXA aBMD (R 2 = 0.82) and is highly reproducible with a root mean-squared coefficient of variation of 1.1 %.

Finite element analysis

From the HR-pQCT images, we generated FE meshes using the voxel conversion approach [22, 23]. Each voxel was converted to an eight-node, 82 μm3, hexahedral element. The meshes generated resulted in approximately 1 million elements for the radius scans and 2.5 million elements for the tibia scans.We simulated uniaxial compression on each radius and tibia model up to 1 % strain using a Young's modulus of 6,829 MPa and Poisson's ratio of 0.3 [7] as previously reported [13, 14]. We used a custom FE solver (FAIM, Version 4.0; Numerics88 Solutions, Calgary, Canada) on a desktop workstation (Mac Pro, OSX, Version 10.5.6; 2 × 2.8 GHz Quad-Core Intel Xenon) to estimate bone strength (ultimate stress, megapascal) based on the relationship determined by MacNeil and Boyd [7]. We also determined the percentage of load carried by the cortex based on the elements labeled as cortical bone from the automatic segmentation at the most distal and most proximal slices. In order to estimate the risk of forearm fracture, we calculated the load-to-strength ratio (Φ) [2426]. This ratio represents the estimated fall force on the outstretched arm divided by the estimated failure load from the FE analysis. The theoretical fracture threshold occurs when Φ > 1.0 [2426].

Statistical analysis

We used SVM [27] with a weighted radial basis kernel function to classify women with and without low-trauma fracture. In the first analysis, we examined only low-trauma forearm fractures and in the second analysis, we included all low-trauma fractures. We used Student's t test to compare outcome variables between the fracture and control groups. For each analysis, we generated SVM models based on standardized measurements of (1) all HR-pQCT, FE, and DXA parameters; (2) all HR-pQCT and FE parameters; (3) FE parameters alone; (4) UDR, LS, and FN aBMD; and (5) LS and FN aBMD. For the forearm fractures, we used only the distal radius measurements and for all types of fracture we used the distal radius and distal tibia measurements. To gauge the accuracy of the SVM results and to avoid overfitting to our specific dataset, we used a tenfold cross-validation scheme. This method randomly divides the participants into ten groups, trains the model on nine of these groups, and tests the model on the remaining group. This is repeated ten times so that each group is used as the test group and the accuracy reported is the average of all iterations. The results of this validation are presented as the areas under the curve (AUC) for receiver operating characteristic (ROC) curves. Weka (version 3.7; University of Waikato, Hamilton, New Zealand) was used for the SVM analyses [28], and SPSS Statistics (Version 19.0; IBM; Somers, NY, USA) for all other analyses.

Results

Participant characteristics and bone outcome variables

Descriptive characteristics of the participants are provided in Table 1. The fracture and nonfracture groups were similar with respect to age, height, weight, body mass index (BMI), and use of corticosteroids. Use of bisphosphonates was higher among women with a history of forearm fracture compared with nonfracture controls (p = 0.028). The median time between fracture and scan date was 4.5 years (interquartile range (IQR): 3.2–6.7 years). The majority of fractures were forearm fractures (n = 14) and lower leg (n = 14), followed by upper arm (n = 9), spine (n = 5), hip (n = 4), upper leg (n = 4), and ribs (n = 2). Eight women reported two low-trauma fractures; we classified these women based on their first fracture.

Table 1 Descriptive characteristics of postmenopausal women with forearm fractures and fractures at all sites, and their corresponding age-matched controls

We present the HR-pQCT and FE variables in Table 2 and DXA parameters in Table 3. Significant differences (p < 0.05) between the forearm fracture group and nonfracture controls were found at the radius for all parameters with the exception of Tb.Th, Ct.Po, and Ct.Ar. When all types of fractures were pooled, all parameters were significantly different with the exception of Tb.Th, Ct.Po, Ct.BMD, Ct.Th, and area measurements. At the distal tibia, Tb.N, Tb.Sp, Ct.BMD, Ct.Po, Tt.Ar, Tb.Ar, and the percentage of the load carried by the cortical region at the distal and proximal slices were not significantly different between forearm fracture cases and controls. For all fracture types pooled at the distal tibia, Ct.BMD, Ct.Po, Tt.Ar, Tb.Ar, and the percentage of the load carried by the cortical region were not significantly different between fracture cases and controls.

Table 2 Bone microarchitecture and finite element parameters [mean (SD)] for the fracture and control groups at the distal radius and distal tibia
Table 3 DXA parameters [mean (SD)] for the fracture and control groups

The percentage of women who had load-to-strength ratios greater than 1.0 and, therefore, theoretically at greater risk of fracture was 28.6 % for those with forearm fractures and 10.7 % for their corresponding nonfracture controls; however, the mean of the entire groups were still both below 1.0 (Table 2). For women with all fracture types, 25.0 % exceeded a ratio of 1.0 while 13.6 % of their corresponding controls exceeded the threshold. Women with forearm fracture had significantly lower LS aBMD and UDR aBMD compared with the controls. The women with fractures at all sites had significantly lower FN and LS aBMD from their corresponding controls.

SVM classification

The accuracy, sensitivity, specificity, and ROC-AUC's from the cross-validation of the SVM models are summarized in Table 4. For the forearm fractures and controls, the highest accuracy (83.3 %) and AUC (0.82) was achieved using the HR-pQCT measurements and FE estimates of bone strength together. This was followed by the combination of HR-pQCT, FE, and DXA parameters (accuracy: 81.0 %, AUC: 0.80) and by FE parameters alone (accuracy: 78.6 %, AUC: 0.80). Using LS, FN, and UDR aBMD produced an accuracy of 71.4 % and an AUC of 0.71, while using only LS and FN aBMD had the worst performance (accuracy: 59.5 %, AUC: 0.63) classifying the women. When considering low-trauma fractures at all sites, the SVM models were not as effective at classifying the fracture and control participants compared with the focused forearm fracture analysis. The FE parameters alone had the highest accuracy (69.7 %) and had the highest AUC (0.69). The combination of HR-pQCT and FE measurements had an accuracy of 68.9 % and an AUC of 0.67 while the HR-pQCT, FEA, and DXA aBMD also had an accuracy of 68.9 % and an AUC of 0.68.

Table 4 SVM classification results for the forearm fractures and for fractures at all sites. Five different models were used for each fracture group. The first was HR-pQCT, FE, and DXA parameters; the second was HR-pQCT and FE parameters; the third was FE parameters alone; the fourth was DXA parameters including simulated ultradistal radius (UDR); and the last was lumbar spine (LS) and femoral neck (FN) DXA only. Accuracy, sensitivity, specificity, and the receiver operating characteristic area under the curve (ROC-AUC) are shown for each model

Discussion

In this study, we successfully demonstrated that a machine learning technique based on SVM model could classify postmenopausal women with and without a history of low-trauma fracture based on HR-pQCT input. The SVM model incorporating HR-pQCT bone microarchitecture and density, as well as FE estimates of bone strength at the distal radius, outperformed DXA-derived aBMD, which is the clinical gold standard for osteoporosis screening. The SVM method is a promising tool for classification of women with and without previous low-trauma fracture and could potentially be an important clinical basis for fracture risk assessment (Fig. 1).

Fig. 1
figure 1

Representative scans of a participant with a low trauma forearm fracture and an age-matched nonfracture control at the distal radius (a and b, respectively) and at the distal tibia (c and d, respectively)

SVMs are a type of machine learning tool that recognize patterns in data and map data to a higher dimension to create nonlinear boundaries between the cases [29], and notably, they are stable and require minimal tuning of parameters [30]. This allows SVMs to work well in many applications and reduces overfitting to the specific dataset. Other machine learning methods such as GBM may provide higher accuracy of classification but are computationally expensive and, most importantly, require extensive tuning of parameters. SVMs have an important advantage in that they find a global, unique solution unlike other methods such as artificial neural networks that may find local minima [31]. This advantage of SVMs potentially makes this approach more robust for application to diverse clinical datasets.

To date, only one other study has used machine learning techniques to classify postmenopausal women with and without a history of low trauma [12]. Similar to our findings, Atkinson et al. reported more accurate classification when bone density, geometry, and microstructural outcomes obtained with additional imaging modalities (QCT and HR-pQCT) were included in their GBM model in addition to standard DXA-derived aBMD, indicating the importance of microarchitectural parameters. In addition, whereas not all bone outcomes were significantly different between groups when compared individually, small contributions of each variable combined into a single model allowed for better fracture discrimination. Atkinson et al. reported higher AUC's for predicting forearm fracture cohorts compared with our current study, which was possibly due to their larger sample size (n = 99 forearm fractures) [12]. While our study is smaller, we used a tenfold cross-validation scheme, which gives an excellent indication of how our model would perform in practice on a new dataset.

Previous studies used PCA and logistic regression models to determine if HR-pQCT measures of bone microarchitecture and FE estimates of bone strength could discriminate between postmenopausal women with and without previous fracture [911]. These analyses differ from our approach in that they attempt to determine the individual parameters (or principal components of parameters) that best discriminate fracture cases from controls, whereas the goal of SVM is to perform classification. Despite the different analytic approaches and goals, results of these studies consistently indicate that HR-pQCT measurements of bone microarchitecture and FE analysis estimates of bone strength outperform DXA-measured aBMD when discriminating between postmenopausal women with and without a history of low trauma fracture. Consistent with our findings, Vilayphiou et al. showed that FE-estimated bone stiffness had high contributions to the principal components [10]. The relatively similar performance between our model using HR-pQCT, DXA, and FE parameters and the FE parameters alone indicate that the FE measurements provides an excellent “summary” measurement of bone quality.

Clinical DXA scans of the forearm are rare; therefore, we included models based on DXA measurements of LS and FN with and without UDR aBMD. The model with only LS and FN aBMD represents a typical clinical scenario and, as expected, resulted in the weakest classification as shown in Table 4. Based on the model that included UDR, FN, and LS aBMD, it is important to include a scan of the distal radius to classify forearm fractures. Our data may actually overestimate the importance of a UDR scan because our simulated UDR aBMD measurements were focused on precisely the area that is most likely to fracture, whereas a real UDR DXA would cover a broader region and may be less sensitive to classifying forearm fracture. It is not surprising that models including UDR aBMD more accurately classify forearm fractures. However, HR-pQCT measurements still outperformed all DXA measurements when classifying forearm fractures.

We found that the SVMs were better at discriminating low-trauma forearm fractures compared with low-trauma fractures at other sites. This is likely attributable to site specificity since the HR-pQCT scans obtained at the distal radius were used to classify the forearm fractures. The difference in discriminative ability may also be due to the large variation in fracture types in our study. We had few participants (n = 6) with fractures at “major” osteoporotic sites (i.e., hip and spine). It is possible that these fractures would be better predicted by the SVMs because women who suffer these types of fractures tend to be at more advanced disease stages compared with women who sustain forearm fractures [32]. A larger cohort of fracture cases would strengthen our understanding of the potential for SVMs to classify fractures. However, our current study indicates that there is excellent potential for this method in classification studies of all low-trauma fractures.

A limitation of our study is that it is a cross-sectional design and has a relatively small sample size, and the acquisition of HR-pQCT scans were done retrospectively after the fractures were sustained. Despite these weaknesses, our study provides strong support for SVM combined with HR-pQCT to classify fractures better than by the clinical gold standard of DXA aBMD, even at the UDR. It would be ideal to expand this research with a larger fracture cohort and to perform the study prospectively, and work in this area is currently underway at our laboratory and at other sites worldwide. There are also some technical limitations with our study, the first being that the FE estimates are based on a single uniaxial compression test. While this is currently the gold standard for FE applied to HR-pQCT, it may be advantageous to expand the number of simulated FE tests for a more comprehensive assessment of bone strength, and hence better prediction of the strength of the bone during a fall on an outstretched arm. In addition, FE models were computed using a homogenous Young's modulus for each element. Strength estimates using this method are highly correlated with experimental estimates of bone strength [7]; however, FE measurements may be affected by mineralization differences. In future, it may be important to account for such differences using scaled modulus values based on density measurements.

The application of SVMs to HR-pQCT data shows great promise to classify postmenopausal women with and without a history of low-trauma fracture. Our results suggest that this is an accurate classification method for forearm fractures and has potential to be applicable for classification of fractures at other sites. This is the first study to apply SVM to HR-pQCT data and is strengthened by incorporating cross-validation for testing and including FE estimates of bone strength.