Introduction

When considering a reconstructive osteotomy for treatment of hip dysplasia, several factors must be considered, including the degree of dysplasia, the presence of osteoarthritis, and hip congruency [13]. Hip congruency refers to the relationship of the femoral head contour to the acetabulum. In a congruent joint, the arc of the femoral head matches the arc of the acetabulum with a consistent joint space throughout. Some authors suggest a reconstructive pelvic osteotomy, such as a Salter or Pemberton osteotomy, can be performed only if adequate congruency exists between the acetabulum and the femoral head [11, 12, 14], using either a neutral hip radiographic view or a von Rosen (abduction and internal rotation view) [16]. If the joint is incongruent, a salvage osteotomy such as a Shelf or a Chiari may be required to obtain additional acetabular coverage. The periacetabular osteotomy (PAO) used in the adolescent or adult hip is a procedure used to achieve the necessary amount of femoral head coverage and may challenge some of the traditional concepts of preoperative joint congruency.

Previous studies have evaluated intrarater and interobserver reliabilities of radiographic measures of dysplasia and osteoarthritis [1, 3, 8, 9]. There are several reported classifications of hip congruency [6, 7, 9, 17]. Two classifications of congruency are commonly used [2, 5, 9, 17, 20]. Yasunaga et al. proposed a four-part classification for congruency: excellent if the curvature of the acetabulum and femoral head is identical or nearly identical and the joint space is entirely maintained, good if the joint space is adequately maintained, fair if partial narrowing of the joint space has occurred, and poor if the joint space is obliterated (Fig. 1) [20]. Okano et al. modified this to a three-part classification: good if the joint space width at the narrowest point is 50% or greater than the joint space width at the widest point, poor if the joint space width at the narrowest point is less than 50% of the joint space width at the widest point, and narrowed if no more than 2 mm of joint space remains in any area of the hip (Fig. 2) [9, 10]. Also, Clohisy et al. described congruency based on the subjective assessment of conformity between the acetabulum and femoral head using all radiographic views but with an emphasis on the AP view [2, 3]. They defined a congruent hip as one that has matching arcs of the femoral head and the acetabulum [3]. In our practice group, however, we have noted differences in opinion regarding what constitutes a congruent joint, particularly in cases of severe dysplasia. A different assessment of congruency could substantially affect the approach to treatment. Also, as newer literature reports the outcomes of PAO based on preoperative congruency, it is important to have reliable, reproducible measures of congruency to render the results of these studies meaningful [9, 19]. As these criteria are general, for the purposes of this study, we defined a third rating based solely on the rater’s clinical practice, whether the joint would be considered congruent or not (the subjective classification). The reliability of these measures, however, has not been well studied [3, 9]. Clohisy et al. reported an interrater kappa coefficient of 0.29 for a yes/no congruency rating [3]. However, Okano et al. reported a high interrater kappa coefficient of 0.92 using their three-part classification of congruency [9].

Fig. 1
figure 1

The classification of Yasunaga et al. is comprised of four possible ratings for congruency. (Published with permission from Yasunaga Y, Takahashi K, Ochi M, Ikuta Y, Hisatome T, Nakashiro J, Yamamoto S. Rotational acetabular osteotomy in patients forty-six years of age or older: comparison with younger patients. J Bone Joint Surg Am. 2003;85:266–272.)

Fig. 2
figure 2

The classification of Okano et al. has three possible ratings assessing hip congruency. (Published with permission from Okano K, Enomoto H, Osaki M, Shindo H. Joint congruency as an indication for rotational acetabular osteotomy. Clin Orthop Relat Res. 2009;467:894–900.)

We therefore performed this study to determine the intrarater and interrater reliabilities of these three measures of hip congruency.

Patients and Methods

We retrospectively reviewed an institutional database containing records of 158 adolescents with hip dysplasia who were potential candidates for PAO. All patients had symptomatic hip dysplasia. Radiographs of 30 hips were selected meeting the following criteria. Inclusion criteria were symptomatic hip dysplasia, skeletal maturity (closure of the triradiate cartilage and proximal femoral physis), and high-quality AP and von Rosen view radiographs (abduction internal rotation view). Representative radiographs were chosen independently by one author (ANL) who was not involved in the review process. All radiographs were obtained using a standardized imaging protocol, all patient information was removed, and each hip was assigned a study number. Various dysplastic hips were selected as to encompass a wide spectrum of disease severity and joint congruency. Several otherwise normal unaffected hips on the contralateral side of the dysplastic hips also were selected to complete the assortment. Several hips had severe dysplasia with severe subluxation, which reduced at least partially on the von Rosen view. Underlying diagnoses included developmental hip dysplasia, cerebral palsy, dysplasia associated with orthopaedic syndromes, and Legg-Calvé-Perthes disease. We excluded hips with frank dislocation and hips in skeletally immature patients. A total of 30 hips were included in the study. Institutional review board approval was obtained for all aspects of this study.

The observers had no specific training using the classification systems of Yasunaga et al. [20] and Okano et al. [9] beyond what is readily available in the literature, and simply were given published diagrams as a guideline. We used the resources typically available to an orthopaedic surgeon and did not seek any specialized training or contact with Dr Okano or Dr Yasunaga. The AP and von Rosen views for each hip were reviewed by four experienced pediatric orthopaedic surgeons (DJS, JGB, DAP, KER) and two pediatric orthopaedic fellows (SG, MK). All four attending surgeons commonly treat hip dysplasia. Fellows were included to see if there would be better or worse agreement among less experienced surgeons. The observers were blinded regarding patient physical examination, history, symptoms, and identity. Observers were provided with the diagrams (Figs. 1 and 2) for the classification systems of Yasunaga et al. [20] and Okano et al. [9], respectively, and were asked to provide a yes or no response regarding whether the hip was congruent based on the criteria that they would use in a clinical setting. The observers classified the congruency of each hip using the three methods and submitted confidential responses. To obtain intraobserver reliability measures, this procedure was repeated 1 month later with the viewing order of the films shuffled.

Before beginning the study, a biostatistician (RHB) was consulted to determine the appropriate number of cases to include in the study and assist in study design. Based on the number of reviewers, number of rating criteria, and heterogeneous patient population, a sample size of 30 cases was recommended. This historically has been used for other comparison studies at our institution. A formal power analysis was not completed for this study because the goal was not to test a hypothesis, but to estimate differences between the various classification systems.

Kappa values were used to measure intraobserver and interobserver reliabilities with 1.0 indicating perfect agreement, and 0 indicating a level of agreement expected by chance alone [4]. A simple kappa was used to measure agreement for the subjective rating. For the classifications of Yasunaga et al. [20] and Okano et al. [9], which are ordered variables, we used a weighted kappa as a refinement [15]. This differentiates disagreement by proximity of the response. Thus, if the ratings disagree by only one level compared with two levels, this is reflected in a higher (and improved) weighted kappa score. Intrarater reliability was calculated for each observer, comparing results from the first reading with the results from the second reading 1 month later. Also, a combined intrarater reliability was calculated, assessing the intrarater agreement in the group of attending staff and in the group of fellows. Interrater reliability was measured between the staff and the fellows for the first and second readings. No subanalysis was performed based on the degree of dysplasia. Instead, we sought to test the validity of these hip congruity measures over a wide spectrum of hip dysplasia. Level of agreement between the different congruency rating methods was assessed using a z-test.

Results

A total of 30 hips were reviewed, representing a spectrum of disease from the unaffected sides to severely dysplastic hips. The intrarater reliability ranged from 0.30 to 0.71 for the criteria of Yasunaga et al. (Table 1). An intrarater reliability for the criteria of Yasunaga et al. could not be performed for one staff surgeon, as the rater did not select all choices. The reliability using the classification of Okano et al. ranged from 0.15 to 0.41 (Table 1). Looking at each reviewer independently using the subjective yes/no criteria, we found that the intrarater reliability ranged from −0.03 to 0.84. The combined intrarater reliability was 0.74 for the subjective criteria, 0.43 for the Yasunaga et al. classification and 0.37 for the Okano et al. classification (Table 2). Comparing the combined intrarater reliability between the staff and the fellows, we found that the staff and fellows had low intrarater reliability for the classifications of Yasunaga et al. and Okano et al. Overall, only the subjective yes/no method yielded high combined intrarater reliability. The z-test was used to determine which rating system had superior intrarater reliability. The subjective yes/no method had statistically improved kappa scores for combined intrarater reliability compared with the classifications of Okano et al. and Yasunaga et al. (p < 0.001 and p < 0.001, respectively).

Table 1 Intrarater reliability
Table 2 Combined intrarater reliability

Using the classifications of Yasunaga et al., Okano et al., and subjective criteria, the interrater reliability ranged from −0.02 to 0.46 (Table 3). Of all the comparisons, the fellows had the best agreement on the first evaluation of the radiographs using the classification of Okano et al. and subjective criteria. All other interrater analyses revealed a kappa value less than 0.4. The subjective opinion of congruency between the staff was particularly poor, with results being no different than chance. We evaluated the interrater reliability for both trials to see if agreement differed between the first and second reviews; however, no improvement was seen. There was no difference in results when comparing agreement among the fellows and agreement among the staff. Using the z-test to determine which rating system had superior interrater reliability, we found that the subjective criteria and the classification of Okano et al. had statistically better matching rates than the classification of Yasunaga et al. (p < 0.001 and p < 0.001, respectively).

Table 3 Interrater reliability

Discussion

Opinions differ in clinical practice regarding what constitutes a congruent hip, particularly the joint in a severely dysplastic hip. This discrepancy may result in markedly different recommendations for the type of surgical treatment. Also, as the literature reports the outcomes of PAO based on preoperative congruency, it is important to have reliable, reproducible measures of congruency to render the results of these studies meaningful [9, 19]. We have noted differences in opinion regarding what constitutes a congruent joint, particularly in cases of severe dysplasia. A different assessment of congruency could substantially affect the approach to treatment. Of the various reported classifications of hip congruency [6, 7, 9, 17], we selected three that are commonly used, are easily measured in a clinical setting, do not require any specific imaging software, and, thus, are of potential practical value. We performed this study to measure the intraobserver and interobserver reliabilities of these commonly used measures of congruency.

Our investigation has several limitations. First, we evaluated only radiographs. The observers had no knowledge of the patient’s history or physical examination findings and could not contextualize the radiographs. Second, measurements of interrater reliability from the second reading may have limited validity owing to practice effects and potential bias, although reviewers were asked to avoid discussing the study with other practitioners between the two readings. Third, although the radiographs were taken at one center we noted some variability in positioning of the affected leg. In particular, positioning for the von Rosen view can be limited by the patient’s symptoms and restricted ROM. For this reason, all observers were given the AP and von Rosen views with which to rate the hip congruency. Fourth, some of radiographs were of hips of patients with severe dysplasia (Fig. 3). These three congruency measures may have better intrarater and interrater reliabilities if applied to a patient population with more subtle findings of hip dysplasia. In some cases, improved agreement was noted in patients with minimal dysplasia (Fig. 4). Finally, our observers had no specific training in using the classification systems of Yasunaga et al. and Okano et al. It is possible that with training the reliability would have been higher. Nevertheless, we provided the reviewers with the information currently available in the literature.

Fig. 3A−B
figure 3

These are representative (A) AP and (B) von Rosen views of a hip with severe dysplasia. Subjectively, four raters judged this hip to be incongruent, and two thought the hip was congruent. Interestingly, all six raters believed this represented a poorly congruent hip with the Okano criteria. With the Yasunaga criteria, there was one poor rating, two fair, and three good.

Fig. 4A–B
figure 4

These are representative (A) AP and (B) von Rosen views of a hip with mild dysplasia. All four staff agreed that this hip was excellent, good, and congruent. The two fellows rated the hip as good or poor with the criteria of Yasunaga et al. and Okano et al., but agreed subjectively that the hip was congruent.

We presumed there would be good intrarater reliability for the three methods. We found low combined intraobserver reliability for the classifications of Okano et al. and Yasunaga et al. When evaluating the reviewers independently, two attending surgeons had much higher intrarater reliability for the classification of Yasunaga et al. For the most part, our reviewers had difficulty duplicating their results 1 month apart for either classification system. To our knowledge, there are no previous published studies regarding intraobserver reliability of the classification systems of Okano et al. and Yasunaga et al. The combined intraobserver reliability for the subjective criteria was high at 0.74. Even if raters do not agree among themselves on a congruent hip, they consistently recognize what they personally consider to be a congruent joint. Clohisy et al. reported on intrarater reliability of a subjective measure of congruency [3]. They found a combined intrarater reliability of 0.50. Thus, subjective opinion appears to produce the highest intrarater reliability when compared with other measures.

We also presumed there would be low interrater reliability for measurements of hip congruency. We found low interobserver reliability for all three methods when used to measure congruency in hips with a spectrum of hip dysplasia. This was true for the subgroup of pediatric orthopaedic fellows and the attending orthopaedic surgeons. Clohisy et al. had similar findings in their study on interobserver reliability for various hip measures [3]. They rated congruency using a subjective yes/no criteria with an AP view of the pelvis [3]. This method is similar to our subjective criteria. They found the kappa coefficient for the congruency rating was poor at 0.29. We had similar results with an interrater reliability of 0.21 using subjective criteria for congruency. As our method is based only on the qualitative judgment whether the arc of the acetabulum matches the arc of the femoral head, it is understandable that there would be differences in opinion, resulting in a low kappa score. Okano et al. and Yasunaga et al. provided more detailed descriptions of congruency using three- and four-part classification systems [9, 10, 1720]. In our literature review, we did not find any previous reports of the interrater reliability for the classification of Yasunaga et al. As part of a larger study, Okano et al. rated 20 hips using their method and reported an excellent interrater kappa value of 0.92 [9]. To our knowledge, this has not been reproduced in other studies. Our overall interrater kappa using the classification of Okano et al. was 0.25, reflecting low agreement in our patient population. In contrast to our reviewers, Okano et al. [10] likely are familiar with their classification system and are better able to produce similar results, and thus high kappa scores. Alternatively, the hips in our series might have had more severe dysplasia than those in the series by Okano et al., rendering their classification system less reliable among our reviewers.

There has been increasing interest in the role of hip congruency as a surgical indication and as a prognostic factor for results after acetabular osteotomy. Traditionally, congruency has been considered a prerequisite for reconstructive osteotomy. Our observations suggest practitioners may have their own subjective understanding of what constitutes a congruent hip. Only subjective opinion was a reproducible measure of congruency for the individual surgeon, with good intrarater reliability. However, other commonly used measures of congruency have low intraobserver reliability, and all three methods have low interobserver reliability. Additional studies with more specific guidelines are needed to validate the current measures of congruency. Alternatively, a new radiologic rating of congruency with greater reproducibility among practitioners may aid in refining operative indications and understanding postoperative outcomes for osteotomies in the context of severe hip dysplasia.