Introduction

Vertebral fractures are the commonest osteoporotic fractures, and the assessment of vertebral fractures is widely used to diagnose osteoporosis or monitor disease progression. This assessment has also been used in many clinical trials as an end point to evaluate the efficacy of drugs for the treatment of osteoporosis [14]. Several methods of assessing vertebral fractures have been developed, and most are categorized as quantitative morphometry (QM) [59]. The morphometric approach is based on a comparison between the vertebral heights of osteoporotic patients and the vertebral height of normal women, including the anterior–posterior ratio, middle–posterior ratio, and posterior–posterior adjacent ratio. The cutoff thresholds differ, and no single measurement is considered the gold standard for vertebral fracture assessment. Recently, the semiquantitative (SQ) method has been used to assess vertebral fractures in clinical practice and clinical trials instead of QM. Genant et al. [10] devised the SQ method as a new way to assess vertebral fractures without measuring vertebral heights. In the SQ method, each spine was graded into four categories (normal, mild, moderate, and severe) on visual inspection. Excellent reproducibility of interobserver and intraobserver reliability (between experienced and inexperienced but trained observers) was found, with good agreement between QM and SQ methods [10]. Wu et al. [11] found excellent interobserver agreement using the SQ method. Grados et al. [12] compared the SQ method with four morphometric methods for assessing prevalent vertebral fractures, and found good agreement as well. Crans et al. [13] revealed that a spinal deformity index derived from the SQ assessment of vertebral fractures predicted future vertebral fracture risk and that the spinal deformity index or SQ method was clinically useful for the treatment of osteoporosis.

Despite these studies, little is known of exactly how the SQ method is used in clinical practice in Japan. Therefore, our aim was to clarify how the SQ method is used to assess vertebral fractures in clinical practice in Japan, by comparing expert physicians with nonexpert physicians.

Materials and methods

Materials

Lateral (thoracic and lumbar) spine radiographs of 40 osteoporotic patients were included in the present study. All of the radiographs originated from the Japanese Osteoporosis Intervention Trial (JOINT)-02 conducted by the Adequate Treatment of Osteoporosis (A-TOP) research group to evaluate combination therapy (alendronate and alfacalcidol) compared with monotherapy (alendronate alone) nationwide in Japan. The details of JOINT-02, including the study design, patient characteristics, inclusion and exclusion criteria, and end points, were previously reported [14, 15]. Baseline and follow-up (12 and 24 months) radiographs were converted into electronic data files (Digital Imaging and Communication in Medicine files). The use of the radiographs in this study was approved by the A-TOP executive and the ethical committees.

SQ method and spinal fracture index

The SQ approach was developed by Genant et al. [10] in the 1990s as a new method to assess vertebral fractures on visual inspection without measuring vertebral heights. The grading of each spine is classified into four categories as follows: normal (grade 0); mild deformity (grade 1, 20–25 % reduction in anterior, middle, and/or posterior height and 10–20 % reduction in area); moderate deformity (grade 2, 25–40 % reduction in any height and 20–40 % reduction in area); and severe deformity (grade 3, 40 % reduction in any height and area). In each spine, grade 1 or higher was considered “fractured” and grade 0 was considered “not fractured.” The spinal fracture index (SFI) was calculated for each patient by dividing the sum of individual vertebral grade scores by the number of spines evaluated, which provided general information on the osteoporosis severity in an individual patient [10].

Assessment of vertebral fractures using the SQ method

Seven expert physicians (expert group) and 37 nonexpert physicians (nonexpert group) independently assessed the vertebral deformity grade (T4–L4) of each patient on a personal computer using the SQ method. The expert group consisted of three orthopedists, three spinal surgeons, and one radiologist (average medical career, 29 years), all with experience assessing vertebral fractures in several drug trials or highly specialized experience assessing vertebral fractures in clinical practice. The nonexpert group consisted of 18 orthopedists, 14 internal medicine physicians, and 5 radiologists (average medical career, 16 years), not using the SQ method for assessing vertebral fractures in daily practice. Baseline and follow-up radiographs were assessed in chronological order per patient. The physician assessment data were gathered and statistically analyzed as a data set.

Statistical analysis

The frequency and proportion of SQ grade per spine and per visit were assessed between the experts and nonexperts, and the proportions were compared using the Pearson chi squared test. Also, the proportion of radiographs that the experts and nonexperts assessed as fractured (grade 1 or higher) was examined per spine at the baseline, 12 months, and 24 months. Physicians assessed the SFI for each patient, and the mean value was calculated. Using the SFI as the dependent variable, we used a mixed effects model, accounting for the correlation between the SFI of each patient assessed by the same physician. The SFI least mean squares was estimated to evaluate the difference between the experts and the nonexperts using the model, adjusted and not adjusted for their years of medical experience. To analyze interobserver reproducibility within each group, we calculated the kappa statistics per group per spine (T4–L4). Because we were interested in the degree of agreement between more than two physicians, we used the extended kappa statistical method proposed by Fleiss [16]. The kappa groups were as follows: 0–0.2, poor agreement; 0.2–0.4, fair agreement; 0.4–0.6, moderate agreement; 0.6–0.8, good agreement; 0.8–1.0, very good agreement [16].

Results

Forty-four physicians (seven experts and 37 nonexperts) assessed 40 sets of spine radiographs at the baseline and during the follow-up period using the SQ method. Table 1 shows the proportions of SQ grade per spine level assessed by the expert and nonexpert groups. There was a significant difference in all spine levels and at all time points between the two groups. The proportion of grade 0 was lower for every spine level in the nonexpert group than in the expert group. Figure 1 shows the proportion of fractured cases per spine level that the expert and nonexpert physicians assessed at the baseline, 12 months, and 24 months. The proportion of fractured cases was high in the thoracolumbar spine (T11–L2) compared with other spine levels in both groups. In addition, the proportion per spine was higher in the nonexpert group than in the expert group at each time point, and was especially high in the upper thoracic spine (T4–T6).

Table 1 Proportion of semiquantitative grade per spine level assessed by the expert and nonexpert groups
Fig. 1
figure 1

Proportion of fractured cases assessed by the expert and nonexpert groups: a baseline, b 12 months, c 24 months

The mean values of the SFI assessed per case by the experts and nonexperts were plotted at the baseline, 12 months, and 24 months (Fig. 2). The mean values were consistently higher in the nonexpert group than in the expert group for every time point.

Fig. 2
figure 2

Mean spinal fracture index (SFI) per patient assessed by the expert and nonexpert groups: a baseline, b 12 months, c 24 months

We compared the SFI of the expert group with that of the nonexpert group at the baseline, 12 months, and 24 months using a mixed effects model adjusted or not adjusted for years of experience as a physician. The least mean squares SFI was significantly higher in the nonexpert group than in the expert group for all time points (P < 0.0001) (Fig. 3). The margins of the least mean squares SFI between the nonexpert group and the expert group remained almost constant regardless of adjustment, at 0.21 (not adjusted) and 0.19 (adjusted) at the baseline, 0.21 and 0.19 at 12 months, and 0.23 and 0.19 at 24 months, respectively.

Fig. 3
figure 3

Expert versus nonexpert comparison of SFI from the baseline to 24 months: a not adjusted, b adjusted

Table 2 shows the interobserver kappa statistics in the expert and nonexpert groups for SQ grade of vertebral deformity per spine. The kappa statistics were higher in the expert group than in the nonexpert group for all vertebral levels at the baseline, 12 months, and 24 months. The expert group scores were considered to have moderate or good agreement, except for the T4 and L3 levels at the baseline, and the T5, and T6 levels at 24 months, and were particularly high between T12 and L4. The kappa statistics for the nonexpert group were considered to have poor or fair agreement at the baseline, 12 months, and 24 months, and were particularly low between T4 and T6.

Table 2 Kappa statistics per spine level in both groups at the baseline and in the follow-up period

Discussion

We assessed the vertebral fractures of 40 patients using the SQ method at the baseline, 12 months, and 24 months, and evaluated the interobserver reproducibility and discrepancies between expert and nonexpert physicians. In all spines from T4 to L4, the proportion of fractured cases (grade 1 or higher) was higher in the nonexpert group than in the expert group at the baseline and in the follow-up period (Fig. 1), and the proportion of SQ grades evaluated was significantly different per spine between the two groups (Table 1). In addition, in all cases, the mean value of the SFI was higher in the nonexpert group than in the expert group at all visits (Fig. 2). The nonexpert group had a tendency to overestimate the SQ grade of vertebral fractures, particularly in the thoracic spine, compared with the experts. Genant et al. [10, 11] reported that the SQ assessment of vertebral fractures showed an excellent intraobserver and interobserver agreement between experienced and inexperienced but trained physicians, and that the SQ method was a good reproducible method to assess osteoporotic vertebral fractures. However, another report indicated that it was difficult to identify subtle differences between SQ grade 1 as mild fracture and borderline deformity (grade 0.5), and those assessments were sometimes arbitrary [17]. In these reports, the inexperienced physicians were well trained [10, 17], and it would appear that they understood how to make use of all information regarding vertebral body size, shape, and projection to assess vertebral fractures using the SQ method. The discrepancies between the two groups in our study may have resulted from a lack of previous training in SQ assessment for the nonexperts. In addition, our study was conducted under several preexisting biases because the radiographs we used originated from JOINT-02, and it was previously reported that participants in that study had high fracture risks, with the number of prevalent vertebral fractures of one or more as an inclusion criterion [13, 14], which may have had some effect on the assessment of vertebral fractures by the nonexpert group.

The least mean squares SFI was significantly higher in the nonexpert group than in the expert group at the baseline and in the follow-up period, whether adjusted or not adjusted (Fig. 3). The estimated margin between the two groups was fairly constant at 0.19–0.23. These results also indicate an overestimation by the nonexpert group compared with the expert group for the SQ assessment of vertebral fractures. Conversely, there was a major difference between the expert group and the nonexpert group in the kappa statistics at the baseline and in the follow-up period. The kappa statistics for the nonexpert group were notably low at 0–0.2 (poor agreement) from T4 to T6 and 0.2–0.4 (fair agreement) from T7 to L4, whereas those of the expert group were high at 0.4–0.6 (moderate agreement) in most spine levels and 0.6–0.8 (good agreement) from T12 to L4. The interobserver reproducibility in the expert group for SQ assessment was excellent compared with that of the nonexpert group, similar to the findings of Genant et al. [10, 11].

Delmas et al. [18] reported that underdiagnosis of vertebral fractures was observed in the IMPACT trial (a multicenter multinational prospective study) in several geographic regions, including North America, Latin America, Europe, South Africa, and Australia. All radiologists were given a radiographic procedure manual, which was the principal tool for standardization of the SQ assessment, and this was a major difference compared with our study. The results of Delmas et al.’s study were as follows: there were 789 patients with vertebral fractures (grade 1 or higher) and 1,662 patients without vertebral fractures (grade 0) in the central readings, and 607 with vertebral fractures and 1,844 without vertebral fractures in the local readings. Further, among 789 patients with vertebral fractures in the central readings, 266 patients had no vertebral fractures (false-negative rate, 34 %) and 523 patients had vertebral fractures (true-positive rate, 66 %) in the local readings. Among 1,662 patients without vertebral fractures in the central readings, 1,578 patients had no vertebral fractures (true-negative rate, 95 %) and 84 patients had vertebral fractures (false-positive rate, 5 %) in the local readings. Conversely, our results indicated that the proportion of fractured cases was lower in the expert group than in the nonexpert group, revealing a discrepancy in the results between the two studies. It appears that a bias toward aggressive identification of vertebral fractures occurs in a clinical trial because of the strict protocol and use of the radiographic procedure manual, which differed from our study design, and our results may be reasonable despite no use of a manual.

Our study has several limitations. First, the SQ assessment was performed using images on a personal computer rather than on the actual X-ray films, which may have reduced the image resolution and minimized the shape of the spine. Second, all physicians independently assessed the vertebral fractures from the baseline to 24 months without instructions for the standardizing of the SQ assessment method, such as would be obtained from a special manual. Finally, the results of our expert group are not a gold standard of assessment but a reference, and our results are a relative comparison, only, of the two groups because there was no assessment adjudication in the expert group.

Vertebral fracture assessment is important not only in diagnosis and evaluation of the treatment effects of osteoporosis but also in epidemiologic studies of osteoporosis or the treatment of the clinical vertebral fracture. The SQ method may be not well known in daily clinical practice, but it has been widely used in assessment of vertebral fracture in many clinical trials of osteoporotic drugs. Precise assessment of vertebral fracture using the SQ method in daily practice is necessary to realize proper diagnosis and treatment of osteoporosis. Our results suggests that (1) conscious effort should be made to promote the SQ method in daily practice, and (2) training programs for the SQ method may be helpful to avoid overestimation of vertebral fractures by nonexpert physicians.

In conclusion, we clarified that the SQ assessment of vertebral fractures tended to be overestimated by nonexpert physicians, with poor nonexpert interobserver reliability and well-matched expert physician interobserver reliability in Japan. The SQ method is generally understood to include the entire spectrum of features of spinal deformity and to have a high reproducibility. Conscious efforts should be made to promote the SQ method to contribute to the treatment of osteoporosis.