Introduction

The skeleton is the most common organ to be affected by metastatic cancer with a predilection of the common cancers to metastasise to bone [4]. Tumour registry figures suggest that incidence of bone metastases is increasing, with breast being the most common causative histology and the femur and spine the most common sites [20]; in addition, bone metastases have been found to be the first sign of disease recurrence in a small number of patients [15]. An estimated 350,000 people die with bone metastases in the United States each year [14]. The management of metastatic deposits in long bones has long been a source of discussion. Many authors have proposed methods with which to identify those lesions at risk of causing pathological fractures based on radiological and clinical factors [2, 911, 16, 17, 19]. The basis of these methods of prediction generally take into account the size of the lesion, whether it involves a weight-bearing bone and whether the lesion is lytic or sclerotic in nature.

The most widely accepted of these predictive systems is that of Mirels [13], who proposed a scoring system based on pain intensity, site, type (lytic, mixed or blastic) and amount of bony involvement (Table 1). Mirels’ system is widely used. It is validated in the original study using a small sample size (38 patients) and has been subject to independent validation in only one other significant review [5]. This review by Damron et al. is itself limited by relatively small sample size (n = 12) and the use of simplified clinical histories requiring physician assessment pain severity based on written information provided.

Table 1 Mirels’ scoring system

The inclusion of physician rated pain severity in clinical scoring systems is problematic as pain is a subjective experience with both physical and psychosocial elements that are difficult to quantify objectively. Furthermore, the paucity of empirical data using validated pain assessments for bone pain also complicates the matters [6]. While the importance of pain severity in the assessment of fracture risk is generally accepted, it is however not absolute as two significant studies have shown [8, 12]. Keene et al. [12], whose paper is one of the largest on the subject, found that pain was not a significant predictor of fracture. Damron et al. [5] also showed in their intra- and inter-observer concordance study that pain was the factor which showed greatest variability.

The aim of this study was to independently evaluate the Mirels’ scoring system as applied to a cohort of bony metastatic disease in terms of inter- and intra-observer variability with the objective of obtaining data relating to its suitability for application as an ‘off the shelf’ aid to decision making in orthopaedic oncology. It is a basic premise of predictive scoring systems that they show satisfactory intra- and inter-observer reliability from both a clinical and academic point of view. In order for treatment decisions to be logical and consistent both within and between treating institutions and in order for reported treatment results to be valid, it is vital to have a predictive tool that produces similar results between individual clinicians and with repeated use. To remove the potential for bias caused by patient or physician rated pain severity, only the radiological features of the system were evaluated, thereby giving a real sense of the reproducibility of this system using only its most objective elements.

Materials and methods

Patients

Surgical, oncology and HIPE (hospital in-patient enquiry) records from the period between January 2005 and June 2007 inclusive were examined in an effort to identify patients with long bone metastases, and a retrospective chart and radiological review was carried out.

Criteria for selection and inclusion in the study were:

  1. 1.

    A known histologically proven primary neoplasm

  2. 2.

    A synchronous metastatic lesion present in a long bone, diagnosed radiologically

  3. 3.

    No fracture or history of fracture through this lesion

  4. 4.

    A comprehensive series of pre-fracture, pre-intervention radiographs were selected

Patients who had undergone adjuvant therapy were excluded as were those in which no histologically proven primary were identified.

Radiographs showing 35 lesions in 28 patients who met the selection criteria were retrieved. A patient database containing data regarding age, gender, histology and sites affected was created. Only those with pre-treatment images were selected, in particular no post radiotherapy images were used.

The radiographs were reviewed by three fellowship trained orthopaedic surgical oncologists (BH, SD & GOT) using a standard proforma assessment sheet containing the Mirels’ scoring system table. No clinical data were provided and the reviewers rated the radiological features only. This review process was repeated three weeks later using the same radiographs with altered sequence and labelling. The surgeons were blinded to patient identity and no patients currently being treated in the unit were included. Scores were recorded out of a maximum of nine rather than 12 as pain was not considered in this study.

The mean age (mean ± standard deviation) of the patients in this study was 62.3 ± 11.1 years (range 39–81 years). There were 11 male and 17 female patients. The bones affected by metastases were the femur (n = 26), humerus (n = 6) and tibia (n = 3). The primary neoplasms represented in the study cohort were: breast carcinoma (n = 11), small cell lung carcinoma (n = 6), multiple myeloma (n = 5), prostate carcinoma (n = 4), non-small cell lung carcinoma (n = 3), renal cell carcinoma (n = 3), thyroid carcinoma (n = 2), colorectal carcinoma (n = 1) and alveolar soft part sarcoma (n = 1).

Statistical analysis

The data were analysed for both inter- and intra-observer agreement. For inter-observer agreement, the initial overall score and scores for site, size and nature of lesion were compared across each pair of surgeons. As such, scores for surgeon 1 were compared with scores for surgeon 2 and similarly comparisons for surgeons 1 and 3 with surgeons 2 and 3. The scores assigned for the second observational time-point were similarly compared with each comparison performed using the Kappa statistic. The Kappa statistic considers the null hypothesis of no agreement versus the alternative hypothesis of agreement beyond what would be expected by chance, with a Kappa statistic of 0 indicating agreement that could be expected by chance and a Kappa statistic of 1 indicating complete agreement.

For intra-observer variability, the initial overall score and scores for site, size and nature of lesion were compared to the second recorded overall score and score for site, size and nature of lesion, respectively, again using the Kappa statistic.

A p-value of less than 0.05 was considered to be statistically significant. All statistical analyses were conducted using the statistical package SPSS 14.0 (SPSS Inc., Chicago, Ill, USA).

Results

Results for inter-observer analysis

All results were reported at a significance level of p < 0.001 except where specifically stated.

For the overall score comparisons, there was evidence of agreement beyond that expected by chance when comparing surgeons 1 and 2 at both time points (κ = 0.350 and 0.505, respectively) and surgeons 2 and 3 at both time points (κ = 0.404 and 0.610, respectively). Surgeons 1 and 3 only demonstrated significant agreement at the second time point (κ = 0.485).

In relation to site score for the first scoring of the X-rays (first observational time-point), there was significant agreement when comparing surgeons 1 and 2 (κ = 0.818), surgeons 1 and 3 (κ = 0.770) and surgeons 2 and 3 (κ = 0.955). Similar results were found at the second observational time-point with concurrence between scores for surgeons 1 and 2 (κ = 0.863), surgeons 1 and 3 (κ = 0.863) and surgeons 2 and 3 (κ = 1.000).

There was agreement between surgeons 1 and 2 (κ = 0.475), surgeons 1 and 3 (κ = 0.267, p = 0.024) and surgeons 2 and 3 (κ = 0.521) at the first viewing. A similar pattern was seen at the second observational time-point with a similarity of results when comparing surgeons 1 and 2 (κ = 0.506), surgeons 1 and 3 (κ = 0.596) and surgeons 2 and 3 (κ = 0.558).

For nature of lesion analysis at the first X-ray scoring, there was significant concordance between all observers (surgeons 1 and 2 [κ = 0.814], surgeons 1 and 3 [κ = 0.589] and surgeons 2 and 3 [κ = 0.695]). Similarly, at the second observational time-point, there was again evidence of agreement comparing surgeons 1 and 2 (κ = 0.669), surgeons 1 and 3 (κ = 0.550) and surgeons 2 and 3 (κ = 0.626).

Results for intra-observer analysis

For surgeon 1, there was evidence of agreement when comparing the first observational and second observational time-points for overall scores (κ = 0.340) as well as comparing the scores for site (κ = 0.765), size (κ = 0.481) and nature of lesion scores (κ = 0.757).

In the case of the second surgeon, there was evidence of agreement when comparing the first observational and second observational time-points for overall scores (κ = 0.392). There was also similarity for the site (κ = 1.000), size (κ = 0.438) and nature of lesion scores (κ = 0.656).

The observations of surgeon 3 showed agreement when comparing the first observational and second observational time-points for overall score (κ = 0.788), site (κ = 0.955), size (κ = 0.561) and nature of lesion scores (κ = 0.766).

The results for intra-observer analysis are shown in Table 2.

Table 2 Intra-observer analysis for all surgeons

Discussion

Mirels, in his paper of 1989 [13], presented a proposed scoring system to quantify the risk of sustaining a pathological fracture through a metastatic lesion in a long bone. He did this by performing a retrospective analysis of 78 lesions in 38 patients that had been irradiated without prophylactic fixation. The ensuing scoring system had a maximum score of 12 that could be attained with individual scores of up to 3 for the four subgroups of site, pain, lesion size and whether the lesion was lytic, mixed or sclerotic. The conclusion of his work suggested that long bones with lesions that score 9 or more should undergo prophylactic fixation.

Patients undergoing fixation of pathological fractures benefit from these procedures in terms of mobility and reduction in local pain [18]. Prevention of fracture by prophylactic fixation offers both technical and patient related benefits. In terms of operative procedures, a prophylactic fixation is considered to be of a lesser magnitude than having to fix an established pathological fracture [1, 3, 8, 21]. Furthermore, in relation to the patient, prophylactic fixation has been associated with pain relief with resultant improvement in the quality of life and restoration of ambulation [19] as well as a low complication rate [7].

As previously discussed, the reliability of pain as a predictor of fracture has been questioned and may indeed act as a confounding factor in the prediction of impending pathological fractures through metastases [8, 12] by significantly altering the scores recorded. As such, this study concentrated exclusively on the radiological components of the Mirels’ system, thereby assessing only the most objective elements of the scoring system. We wished to evaluate the intra- and inter-observer reliability of the system as appropriate levels of both are highly desirable in any clinical scoring system and are in reality a pre-requisite for acceptability of any clinical test. This is true in particular when reporting results of treatment in the scientific literature in which the validity of results and conclusions rely on such "like for like" comparisons.

The importance of both site of the lesion and its association with pain generation as well as better understanding of fracture risk in bone appearing sclerotic is acknowledged but is beyond the scope of this paper.

In this study we have also facilitated the application of the scoring system to a relatively broad array of pathology in terms of histology, site and type of lesion than has been the case in other assessments of the Mirels’ scoring system to date. In doing so, we hope to have provided an improved understanding of the reliability and reproducibility possible with the use of this scoring system.

The results have shown that when applied by experienced orthopaedic surgical oncologists, there is statistically significant inter- and intra-observer agreement across the spectrum of disease patterns. Analysis of subgroups in relation to time-points, size, site, nature of lesions and high or low scoring patterns similarly recorded a high level of agreement throughout the study. These results compare favourably to the only other significant independent appraisal of the reliability of the Mirels system, which was made by Damron et al. [5].

This paper excludes pain in the assessment. This approach is potentially controversial as pain is integral to many of the scoring systems used in this area. Our objective was, however, to identify the reproducibility of the radiological features of the Mirels score when applied by experienced clinicians, as no empirical data relating to this vital element exists in the scientific literature to date. We acknowledge the potential for bias caused by the relatively short re-review interval; however, we feel that, overall, adequate precautions to minimise this factor were taken.

In conclusion, the results of this study would advocate the application of the radiological components of the Mirels’ scoring system as reliable and repeatable as applied to this cohort of patients. While the pitfalls of the pain subset in altering the score are documented and recognised, this paper would support the continued and regular use of the Mirels scoring system in the management of patients with malignant bone disease.