Introduction

The tibial plateau is one of the most critical load-bearing areas in the human body. Fractures of the plateau affect knee alignment, stability, and motion. Early detection and appropriate treatment of these fractures are critical to minimize patient disability and reduce the risk of documented complications, particularly posttraumatic arthritis [1]. As the anatomy of the tibial plateau is complex, careful surgical planning with standard radiographies and computed tomography (CT) is essential. In general, surgeons make treatment decisions by assessing the radiology and grading the fracture pattern based on several classification systems [2].

A good classification system should be simple but accurate, show good reproducibility, and be based on clinically relevant data. Several systems have been described to classify tibial plateau fractures (TPF) [39]. The most widely used are the Schatzker [3] and AO [10] systems but concern has been raised over the accuracy and reproducibility of these two classifications [8, 9, 11]. Several new classification systems have since been developed, but few of them have been compared to date [1214].

The purpose of this study was to compare the five most commonly used classification systems for TPF by evaluating their intra- and inter-observer reproducibility with standard radiography and CT scan. The hypothesis was that all five systems would have a low degree of inter-observer agreement with acceptable intra-observer variability.

Material and methods

We retrospectively studied all patients who underwent TPF surgery at two centres between 2013 and 2015. We recorded demographic data (age, gender and body mass index (BMI)) as well as the injury mechanism and knee side.

We only included cases with available anteroposterior and standard lateral radiographs and bidimensional (2D) and tridimensional (3D) CT scan reconstructions. We excluded patients whose radiographs were not performed in the index hospitals and patients with isolated spine avulsion fractures.

Four observers (one senior orthopaedic surgeon, one junior orthopaedic surgeon, one orthopaedic surgeon fellow and one senior resident) evaluated the radiological findings. The observers analyzed and classified the fracture using five different systems: AO [15], Schatzker [3], Luo [2], Khan [9] and the revised Duparc [16] systems. A diagrammatic scheme with a written description of these four classifications was also provided. All the observers evaluated the anteroposterior and lateral views of the standard radiography, and the two most representative views of axial, sagittal and coronal images of the CT and their 3D reconstruction. An independent observer had previously selected each of the CT views. The observers evaluated the images on two occasions, eight weeks apart, to determine intraobserver reliability. Their second evaluation was used to assess the interobserver agreement. None of the observers had been involved in the treatment of these patients’ fractures.

Informed consent was obtained from all individual participants included in the study.

Classifications systems

In the AO classification system, the TPF corresponds to number 41 [15]. This system classified the fractures into A (extra articular), B (partial articular), and C (complete articular), with subtypes in each group (Fig. 1).

Fig. 1
figure 1

AO classification system of TPF

The Schaztker classification is based on severity of the fracture [3], classifying TPF into six types: I - wedge-shaped pure cleavage fracture of the lateral tibial plateau, II - split and depression of the lateral tibial plateau, III - pure depression of the lateral tibial plateau, IV - pure depression of the medial tibial plateau, V - involving both tibial plateau regions, and VI - fracture through the metadiaphysis of the tibia (Fig. 2). The three-Column Luo’s classification system is based in CT and 3D reconstructions, selecting views containing most parts of the fibular head and condylar spine on the axial CT views of the tibial plateau and dividing it into three columns (medial, lateral and posterior) (Fig. 3) [2]. The classification described by Khan et al. [9] grouped the fractures into lateral, medial, posterior, anterior, rim, bicondylar and subcondylar. In each group, the fracture was subclassified with numbers depending on the characteristics of the fracture, creating an alphanumeric system (Fig. 4).

Fig. 2
figure 2

Schaztker classification of TPF

Fig. 3
figure 3

Drawing of the three-Column Luo’s classification system of TPF

Fig. 4
figure 4

Khan classification of TPF

The revised Duparc classification, proposed by Gicquel et al. [16], is based on groups of unicondylar, bicondylar, spinocondylar and isolated posteromedial fractures with acronyms for each fracture (Fig. 5).

Fig. 5
figure 5

Revised Duparc classification of TPF

Statistical analysis

Statistical analysis was performed using SPSS 19 (SPSS, Chicago, IL). Categorical variables are expressed as percentages and frequencies. Means and standard deviations as well as median, minimums, and maximums were calculated for each continuous variable.

The kappa coefficient (K) [17] was calculated to analyze the reliability classification system made by the same observer on separate occasions (intra-observer reliability) or by five different observers (inter-observer reliability).

The K statistic reflects how many responses the observers agreed on and how many agreements occurred by chance [18]. A 100% agreement had a value of 1.00 (maximum), while agreement attributed to chance had a value of 0 (minimum). The values were interpreted according to Landis and Koch [19]: <0.21 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 excellent.

Results

A total of 112 patients were included. There were 71 men and 41 women. Mean age was 47.1 years (range 21–86) and mean body mass index was 25.2 ± 3.6 kg/m2. The fracture was on the left knee in 64 patients (57%) and on the right knee in 48 patients (43%).

While most of the TPF (56%) were sustained during high-energy trauma (traffic accidents, high height falls), in 25% of cases the injury mechanism was a low energy accident and in 19% the mechanism was related to sports injuries.

The inter-observer correlation was fair in the Modified Duparc and Khan classifications. Inter-observer reliability was significantly better in the AO, Schaztker and Luo classifications, with a substantial correlation (Table 1, column A). The intra-observer correlation was moderate in the Modified Duparc and Khan classification and excellent in the AO, Schaztker and Luo classifications (Table 1, column B).

Table 1 Reproducibility of the five classification systems

Discussion

One of the most remarkable findings of this investigation was that when combining standard radiographs with 2D and 3D CT images, the AO, Shatzker and Luo classifications showed an excellent intra-observer agreement with a good inter-observer correlation. This was in contrast with our hypothesis. The small discrepancies between intra- and inter-observer agreements might indicate that AO, Schatzker and particularly Luo classifications can be reproducible with adequate training. Zhu et al. [2] also observed a moderate–good inter-observer agreement for these three classification systems. However, they evaluated a considerably lower number of cases (n = 50), and compared only three classifications. Furthermore, the trainers underwent previous training, and although this could theoretically have improved their inter-observer agreement values, it did not positively influence their findings when compared with the present study (Table 2). In another study, Mellema et al. [20] observed only a fair inter-observer agreement in the Shatzker and Luo classifications. They noted that even the addition of 3D CT reconstruction did not improve the overall inter-observer reliability in these cases. Although 81 observers were involved in their study, they only evaluated both these classifications in 15 complex TPF cases (Table 2), randomized to either 2D or 2D and 3D CT evaluations using web-based platforms. Gicquel et al. [16] observed a moderate inter-observer agreement in the 50 cases they evaluated using the Schatzker, AO and Duparc classifications. The more favourable inter-observer agreement observed in our study might be attributed to the fact that we used a combination of standard radiography, 2D CT and 3D CT reconstructions. Furthermore, we provided observers with pre-selected fixed images instead of the whole CT study. The number of evaluators was similar in most studies with the exception of one study [20]. Surprisingly few studies have evaluated intra-observer agreement. In one of these few, Gicquel et al. [16] observed a substantial intra-observer correlation in the classifications of Schatzker, AO and Duparc. They also compared their findings with those from other studies that had calculated this intra-observer correlation, and found that the correlation was better when observers were given 3D CT images [12, 15]. The intra-observer results of the current investigation were also clearly higher (Table 2).

Table 2 Published studies

The reproducibility of fracture classifications can be affected by several factors, such as the experience of the observers, the simplicity of the system, additional tools or information provided to the observers, binary decision-making, and rank-order analysis [21]. While intra-observer variability is high, moderate inter-observer agreement could be improved with training and by providing observers with more tools and details to increase accuracy and reproducibility. The use of a combination of standard radiographies, 2D and 3D TC images might be a good alternative to increase the reliability and reproducibility of these frequently used simple classifications. The most frequently used TPF classification was described by Schatzker et al. [3] with standard radiographies only, and showed a low intra and inter-observer agreement [22]. Standard radiographies are often inaccurate and underestimate the extent of displacement and depression of these fractures [23]. Thus, several authors have evaluated their accuracy with CT scan or magnetic resonance imaging and have reported improved results [11, 2426]. Brunner et al. [27], for example, found that CT scanning could improve the inter-observer and intra-observer reliability in both the AO and Schatzker classifications. In addition, computed tomography is currently considered an essential tool to diagnose and plan tibial plateau surgery [2, 12, 13]. New surgical technique strategies have recently shown that even in simple cases, 3.5-mm locking plates are biomechanically superior to the use of cannulated screws [2830].

Tibial plateau fractures show great variability in their patterns. In recent years, much focus has been given to the so-called posterior column, which has been shown to have a major influence on surgical planning, fracture reduction accuracy and functional outcomes [3133]. Luo’s classification addresses this column concept, and although it might excessively simplify the different fracture patterns, it could be used as a complimentary tool to the most extended AO or Schatzker classifications.

The main weakness of the present study was that preselected CT images were provided to the evaluators instead of the whole CT study. This was done to facilitate their assessments but it may have increased intra- and inter-observer agreement as the sample might have underestimated the heterogeneity of the fracture patterns.

Conclusions

Although previous training could be needed, AO, Schatzker and Luo classifications showed a good reproducibility when the tibial plateau fractures were assessed with a combination of standard radiographies and biplanar and 3D CT images. The Modified Duparc and Khan classifications showed lower results, and therefore their use is not recommended.