Introduction

Osteonecrosis affects both compact and cancellous bone in a circumscribed area. The clinical picture of non-traumatic femoral head necrosis was first described more than 50 years ago. Nowadays avascular necrosis (AVN) continues to present a challenging clinical problem because it affects mostly middle-aged patients in their active phase of living. Another aggravating factor is that in 30–70% of the patients both hips are involved, with a peak incidence at 40 years of age [8]. In contrast to traumatic causes such as subcapital fracture or hip dislocation leading to vascular disruption and acutely deficient perfusion of the femoral head, the exact aetiology of AVN remains obscure [3, 13]. A multifactorial genesis is discussed involving several underlying diseases, such as Gaucher's disease, ionizing radiation, steroid therapy, and risk factors such as alcoholic excess, hyperuricemia, pancreatitis and pregnancy [7, 10]. Osseous necrosis can develop from failure of arterial supply, obstruction of venous drainage, intraluminal capillary obstruction and compression of the capillaries in the bone marrow space.

Current pathophysiological models claim recurrent ischaemic attacks on bone are followed by an increase of intraosseous pressure, most probably due to oedema since bone marrow is functionally a closed compartment. The intraosseous venules and capillaries are consequently compromised, resulting in the vicious circle similar to the compartment syndrome of the extremities.

In the past, several staging classifications have been introduced and used, some based only on plain radiographs and others on imaging modalities such as computed tomography, magnetic resonance imaging (MRI) or bone scintigraphy. It has been demonstrated that an early diagnosis of AVN plays a crucial role with respect to prognosis and therapeutic success [15].

The classification system introduced by Ficat [4] is possibly the most commonly used. However, in the literature this classification system has been criticised because of great inter- as well as intraobserver variability [11, 18]. Moreover, the Ficat classification does not take the size and location of the necrotic area into account. For appropriate consideration of the missing parameters the ARCO (Association Research Circulation Osseous) classification system was introduced [6].

Many authors have recommended treatment on the basis of the symptoms together with the Ficat and the ARCO classification using radiographs and MRI.

Treatment with pulsed electromagnetic fields or core decompression or both, bone grafting and decompression, and rotational transtrochanteric osteotomy have been suggested for Ficat stage I lesions [1, 14, 17]. These procedures as well as vascularised fibular grafting, rotational transtrochanteric osteotomy and intertrochanteric osteotomy have been recommended for stages II A and B [1, 9, 16]. All these treatments as well as surface replacement and total hip replacement have been recommended to treat stage III lesions.

The choice of treatment and the judgement of its efficacy often are directly based on the Ficat/ARCO stage. Thus, the determination of the Ficat or ARCO stage has important consequences as it has a direct bearing on the patient’s clinical course.

Although these radiological parameters are commonly used to evaluate the stage and progress of disease, most scores and calculations have not been analysed in more detail. A classification system provides a description of meaningful biological information and should be reproducible from one observer to the next as well as by one observer on separate occasions. The absence of reproducibility clouds the comprehension and comparison of studies and treatment recommendations which are based on such classifications systems.

Therefore, the aim of our study was to evaluate inter- as well as intraobserver reliability and variability of commonly used parameters of Ficat and ARCO on plain radiographs and MR images for the classification of femoral head necrosis.

Patients and methods

Between 1998 and 2004, all patients with suspected AVN were examined at our University Hospital’s department of radiology using only plain radiographs or both radiographs and MRI. The inclusion criteria for suspected AVN were typical clinical symptoms such as sudden pain in the hip following trauma or, in the absence of trauma, the coincidence of certain co-morbidities such as Gaucher's disease, diabetes, hypertension, rheumatic diseases, etc.

All identifying data on either radiographs or MR images were anonymised and randomised. Each radiograph and MR sequence was reviewed and classified by six observers: two general orthopaedic surgeons, two orthopaedic residents and two general radiologists. All observers were familiar with the Ficat as well as the ARCO classification system and had used it in daily clinical routine previously. The observers were provided with a copy of both classification systems to be used. They were allowed to refer to the copies as often as necessary during reading of the images. Each observer read the images on his own apart from a third person not involved in the study, documenting the results. During or after the reading, no discussion among the observers was allowed. The observers were given as much time as they needed to carefully review each radiograph or MR-image sequence. After a decision had been made, the images of the next hip/hips were displayed until all hips had been classified and documented.

After a period of three months all hips were reviewed again on a second occasion. In the interim the images were not available for any observer, no feedback regarding the initial reading was provided. The second review of images was performed in the same way as the first image reading session.

Statistical analysis

Computer-assisted statistical analysis (SAS, Heidelberg, Germany) was used to determine the inter- and intraobserver reliability of both classification systems for radiographs and MRI by calculating the weighted Cohen’s kappa index. The kappa values were generated by setting the observed proportion of agreement in relation to the proportion of agreement by chance. Kappa values of less than 0.5 indicated poor agreement and values greater than 0.75 were considered as excellent agreement. The accuracy or the measurement of how close an experimental observation lies to the true value was impossible because the correct classification for each evaluated hip was not available and not known. Therefore, the level of agreement between observers in terms of interobserver reliability and between each review of each observer in terms of intraobserver variability was assessed over time for both the Ficat and the ARCO classification system.

Results

Overall 38 patients (16 women, 22 men, mean age 55.5 years, SD 10.6 years) were enrolled; 54 hips were included in this study and evaluated.

Ficat classification (see Table 1)

Interobserver reliability

For the interoberserver reliability of radiographs (Table 1), a mean reliability coefficient of 0.36 resulted (range 0.11 to 0.68) averaged over both reviews. For the first review, a mean interobserver reliability coefficient of 0.39 resulted, whereas for the second review the mean interobserver reliability coefficient was 0.32.

Table 1 Scheme of Ficat classification (1985) [4]

The MR evaluation revealed a mean interobserver reliability coefficient of 0.37 (range 0.23 to 0.70). The mean interobserver reliability coefficient was 0.39 for the first review and 0.34 for the second review.

Intraobserver reproducibility

The mean weighted kappa intraobserver reproducibility coefficient for radiographs was 0.53 (range 0.29 to 0.76) among the six observers.

For the evaluation of the MR images a mean weighted kappa intraobserver reproducibility coefficient of 0.50 resulted (range 0.29 to 0.71).

ARCO classification (see Table 2)

Interobserver reliability

For the interoberserver reliability of MRI (Table 2), a mean reliability coefficient of 0.35 resulted (range 0.06 to 0.56), averaged over both reviews. For the first review, a mean interobserver reliability coefficient of 0.38 resulted, whereas for the second review the mean interobserver reliability coefficient was 0.31.

Table 2 Scheme of ARCO classification system (1992) [6]

Intraobserver reproducibility

For the evaluation of the MR images a mean weighted kappa intraobserver reproducibility coefficient of 0.44 resulted (range 0.26 to 0.56).

Discussion

In this study the intra- as well as the interobserver variability and reproducibility for the evaluation of avascular necrosis of the femoral head was assessed using the Ficat as well as the ARCO classification. We demonstrated a poor interobserver reliability and a fair intraobserver variability, diminishing any meaningful comparison of studies using the Ficat as well as the ARCO classification system. Thus, these staging systems are still not sufficiently reliable to assess the status of avascular necrosis of the hip on a use-alone basis.

Radiological evaluation of the stage and extent of disease plays an important role in avascular necrosis of the femoral head [15]. In the literature an increasing number of studies deal with the predictive quality rating of radiological classification schemes [15, 20].

However, the reproducibility of these analysis techniques has not been evaluated. These orthopaedic classification systems provide subdivisions in the spectrum of presentation of certain disease processes. These classification systems need to be of reproducible character among different observers as well as by the same observer on different occasions.

Sophisticated diagnostics should provide the information as to whether joint-conserving surgery is reasonable or already too late. Thus, diagnostic instruments providing high reliability are necessary especially for the correct and adequate staging of femoral head necrosis.

The modified Ficat system described by Smith et al. [18] does not provide sufficient information for the staging of the disease. In this study, two different classification systems, the Ficat as well as the ARCO classification, were used to stage osteonecrosis of the hip using both radiographs and MR images. Other orthopaedic studies dealing with evaluation of orthopaedic classifications of osteonecrosis of other bones than the hip also used kappa statistics, describing disappointing results with kappa values ranging from 0.4 to 0.57 for interobserver variability and from 0.58 to 0.69 for intraobserver reproducibility [2, 5]. A modified scheme of kappa values, as recommended by Smith et al. [18] was used for the interpretation of this study’s results, interpreting kappa values between 0.50 and 0.75 as fair and >0.75 as excellent. According to these modified guidelines, the present study presented poor interobserver reliability using the Ficat classification for radiographs and MRI in addition to the ARCO classification for MRI. Fair intraobserver variability was found for the Ficat classification on radiographs and MRI compared to poor variability within the readings of each observer using the ARCO classification. No excellent results were found regarding interobserver reliability or intraobserver variability. Our findings reinforce the results of Smith et al. who used the modified Ficat classification only.

Although not statistically analysed in this study, the reviewers were least likely to change their classification when initially classifying hips as stage I or IV. This is presumably because stage I or IV present more clearly obvious changes either of normal or severely osteonecrotic character respectively. In comparison, in most classifications of stage II A/B or III these hips were differently staged in the second review, most likely due to various interpretations of the phrases of both classifications with regard to the description of middle stages [4].

As stated above, the reliability of our results showed a high variability of the Ficat classification. Thus, it was concluded that the Ficat classification is not appropriate for the evaluation of femoral head necrosis and more reliable classifications should be used. The ARCO scheme was actually developed to be able to simultaneously evaluate different aspects of femoral head necrosis. Unfortunately, in comparison to the Ficat classification no significant improvement of reliability resulted from using the ARCO classification.

Due to the lack of excellent interobserver reliability results it is not possible to make comparisons between studies of different study centres even in cases using similar classification. Also, because no excellent intraobserver variability was described it is not plausible to rely on outcome studies when they are based on either radiographs or MRI since both classifications did not result in excellent intraobserver variability, being assessed over time, e.g. before and after treatment by the same observer or group of observers.

So far only one study exists describing the possibility of radiologically assessing the extent of necrotic area of the femoral head [19]. Unfortunately, this study presents only a small number of patients and is limited to the initial necrotic stages; thus, the figures shown are not sufficiently representative. Although an exact and precise evaluation of the extent of affected area should be possible using MRI and CT, it remains unclear whether these modalities provide exact information on the real necrotic volume [12, 20], and whether a higher accuracy of diagnosis would inevitably lead to improved treatment results.

Unfortunately, the classification schemes for osteonecrosis of the femoral head of Ficat and ARCO do not have acceptable interobserver reliability and intraobserver reproducibility on which treatment protocols and determination of outcome can be based. However, MRI and CT offer a more detailed view of the involvement of the femoral head, subchondral collapse, narrowing of the joint space and acetabular changes found with the progression of this disease. Nevertheless radiographs will continue to play an important role in the evaluation of follow-up of the disease, even though other, modern imaging modalities such as MRI are available.