Introduction

The morphology of intervertebral disc degeneration has often been described in literature [1, 911, 14, 18, 28, 33, 34]. Especially for research purposes, however, these changes need to be quantified. Therefore, in the past, many different grading systems have been developed especially for the lumbar and also for the cervical spine [21]. Some of these systems can only be used in vitro such as macroscopic or histologic grading systems. In contrast, some others are also applicable in clinical practice such as those based on plain radiographs, discography, computed tomography or magnetic resonance imaging. Out of these, magnetic resonance imaging has become increasingly popular since the intervertebral disc itself can be visualised and the procedure itself is not invasive. Nevertheless, grading systems based on plain radiographs still have several advantages. First, in contrast to discograms, they are less invasive. Second, in contrast to magnetic resonance imaging and computed tomography, they only require a standard X-ray machine and are much cheaper. And third, plain radiographs are often taken for diagnostic or follow-up purposes and, thus, are often already available.

A grading system has to fulfil certain requirements to become a valuable tool. First of all, the ratings should be the same irrespective of the experience of the observer. And second, they should be valid. Thus, they should reflect the “real” degree of degeneration. However, out of all the nine radiographic grading systems for lumbar or cervical disc degeneration found in literature [6, 17, 19, 20, 2325, 31], only three are tested for interobserver agreement [19, 23, 24] and none for validity. Furthermore, most of them are based on terms such as “mild”, “severe”, “small” or “large” [6, 17, 19, 20, 23]. Since such terms are not well defined and tend to be subjective, the interobserver agreement of the respective grading systems is expected to be worse if more subjective terms were used.

Therefore, the first aim of this study was to combine the existing radiographic grading systems to a new one, in which all subjective terms were replaced by more objective ones. The second aim was to test this new grading system for validity and agreement between experienced and unexperienced observers. Due to the uncinate processes and the smaller dimensions of the cervical spine, lumbar and cervical discs need to be graded in a different way. In order to prevent confusion, this study was therefore divided into the present Part I for the lumbar and a Part II for the cervical spine.

Materials and methods

The new grading system covers the three main radiographic signs of disc degeneration: “Height Loss”, “Osteophyte Formation” and “Diffuse Sclerosis” (Table 1). On lateral and postero-anterior radiographs each of these three variables first has to be graded individually on a scale from 0 to 3. Based on the sum of these three scores, the “Overall Degree of Degeneration” is assigned to each disc on a four-point scale from 0 (no degeneration) to 3 (severe degeneration).

Table 1 New radiographic grading system for lumbar intervertebral disc degeneration modified according to the systems found in literature

“Height Loss” is defined as the average anterior and posterior (but not central) decrease in disc height referred to the respective height before degeneration. The anterior height before degeneration is estimated based on the normal values reported by Frobin et al. [15] (Fig. 1; Table 2). To account for interindividual differences, the ranges of normal disc height should be considered rather than their mean value. The posterior height before degeneration is estimated as being smaller or as high but not higher than the respective anterior height [12]. The central disc height was not included into the assessment of “Height Loss” since at this position the height increases in some cases of osteoporosis (fish-vertebra deformity).

Fig. 1
figure 1

To assess the degree of height loss, first, the actual disc height has to be determined. For this purpose, the anterior and posterior edges of the adjacent vertebral bodies (small white circles) are defined as those points having the largest distance to the centre of the vertebral body (black points). Then, the distance of each of these four edges to the midplane of the disc (dashed line) is measured. Finally, the sum of the two anterior distances is defined as actual anterior disc height, and the sum of the two posterior distances is defined as actual posterior disc height. This procedure is meant to support the estimation of actual disc height, but does not have to be carried out using drawings or digitisation. In a second step, this actual height is compared to the respective height before degeneration, which is estimated based on the normal values reported by Frobin et al. [15](Table 2 )

Table 2 Normal values of anterior disc height normalised to the antero-posterior diameter of the cranial vertebral body (=100%) (mean of male and female subjects according to Frobin et al. [15])

The variable “Osteophyte Formation” is assessed in terms of the number and length of osteophytes growing at the two anterior, two posterior, two left lateral and two right lateral edges of the adjacent vertebral bodies (Fig. 2).

Fig. 2
figure 2

To assess the variable “Ostophyte Formation”, the two anterior (e1, e2), two posterior (e3, e4), two right lateral (e5, e6) and two left lateral edges (e7, e8) of the adjacent vertebral bodies are screened for osteophytes. Their number is counted and their length is measured along their long axis beginning at the former border of the vertebral body and ending at their tips (white lines in the edges e1, e2, e5, e6, e7 and e8)

The variable “Diffuse Sclerosis” is graded in terms of the number of predefined regions that are affected by sclerosis (Fig. 3). A thickening of the bony endplates should not be counted if it is not diffuse.

Fig. 3
figure 3

The variable “Diffuse Sclerosis” is assessed on the lateral radiographs only. The lower half of the upper vertebral body and the upper half of the lower vertebral body are each divided into four regions. Then, the number of regions is counted, which are covered by sclerosis. Note that a partially covered region is counted as if it was completely covered. In this example, the number of affected regions (asterisk) would be three for the upper and three for the lower vertebral body

To validate the new radiographic grading system, first, the radiographic degrees of degeneration of 44 intervertebral discs from 16 fresh frozen mono or polysegmental human osteoligamentous lumbar spine specimens were determined. The age of the donors ranged between 16 and 91 years (mean 66 years) and none of them had a known history of trauma or spinal disease. These radiographic degrees of degeneration were then compared to the respective macroscopic ones, which were defined as “real” degrees of degeneration. For this purpose, the specimens were first x-rayed in the lateral and postero-anterior direction (43805 X-Ray System, Faxitron Series, Hewlett Packard, USA; film to source distance 61 cm) using a tube voltage of 45–50 kV and an exposure time of 5 min. Then, still being frozen, they were cut in the mid-sagittal plane. The cutting surfaces were photographed and stored for evaluation. To be able to directly compare the radiographic with the macroscopic degrees of degeneration, the macroscopic grading system also covered the three variables “Height Loss”, “Osteophyte Formation” and “Diffuse Sclerosis” (Tables 1 and 3). However, macroscopically, the three variables “Nucleus Pulposus”, “Annulus Fibrosus” and “Endplate Cartilage” were added to reflect the “real” degree of degeneration as closely as possible (modified according to Thompson et al. [32]).

Table 3 Macroscopic grading system for lumbar intervertebral disc degeneration used as the “gold standard” to define the “real” degree of degeneration (modified according to the systems found in literature)

Using this modified macroscopic grading system, the “real” degree of degeneration of the 44 lumbar discs was determined by two observers independently. Both of them were familiar with disc degeneration and had several years of experience in spinal research. The “real” degree of degeneration was then defined as the mean value of the results of both the observers. Then, the 44 discs were additionally graded radiographically by one of these two observers. In order to ensure that this observer was not biased by the evaluation of the macroscopic slices carried out a few days before, the radiographs were blinded and put in a randomised order. The postero-anterior radiographs of four discs could not be evaluated due to poor quality. These four discs could therefore only be included for the variables “Height Loss” and “Diffuse Sclerosis”. The remaining 40 discs, however, could be evaluated completely. To statistically assess the agreement between the radiographic and the “real”, macroscopic degree of degeneration weighted Kappa coefficients (quadratic weights) with 95% confidence limits (95% CL) were calculated according to Fleiss and Cohen [13] using the software SAS 8.2 [30]. These calculations were carried out under the assumption of independency of the observation of each intervertebral disc.

In order to show whether the grade assigned to a disc depends on the degree of experience of the observer, the agreement between one experienced and one inexperienced observer was determined. Both observers graded the lateral and postero-anterior radiographs of 27 osteoligamentous mono or polysegmental spine specimens with an overall of 84 lumbar intervertebral discs. The age of the donors ranged between 16 and 92 years (mean 67 years) and none of them had a known history of trauma or spinal disease. The experienced observer was the one who also evaluated the macroscopic slices and the radiographs for validation. In contrast, the inexperienced observer, being a mechanical engineer without any medical training, had no experience in reading radiographs and was not familiar with disc degeneration. However, he was trained before grading the discs: the grading system was explained using some training radiographs. Furthermore, the radiographic appearance of the most common spinal diseases such as osteoporosis, osteoporotic fractures, fish-vertebra deformities, spondylolysis, Bechterew’s disease or spinal metastases was demonstrated in a 30 min session. Then, written instructions were handed over, in which the assessment of the three variables was explained again and the normal values of anterior lumbar disc height were listed similar to the Figs. 1, 2, 3 and to Table 2. Besides these instructions, the inexperienced observer did not get any further help during grading.

Statistically, the agreement between the ratings of the experienced and the inexperienced observer was evaluated using the same type of weighted Kappa coefficient as for validation [13]. For both, the validation and the assessment of the interobserver agreement, a Kappa of <0.00 was interpreted as poor agreement, 0.00–0.20 as slight agreement, 0.21–0.40 as fair agreement, 0.41–0.60 as moderate agreement, 0.61–0.80 as substantial agreement and >0.81 as almost perfect agreement [22].

Results

The agreement between the macroscopic ratings of the two experienced observers was almost perfect (Kappa between 0.874 and 0.920) for the overall degree of degeneration and the variables “Height Loss”, “Nucleus Pulposus”, “Annulus Fibrosus” and “Endplate Cartilage” (Table 4). For the variables “Osteophyte Formation” and “Diffuse Sclerosis” the agreement was somewhat lower, but still substantial (Kappa 0.675, respectively 0.707). These good agreements would almost have allowed to define the rating of only one observer as a “real” degree of degeneration. To further increase objectivity, however, the average ratings of both were used instead.

Table 4 Agreement between the macroscopic ratings of the two experienced observers (weighted Kappa coefficients with 95% CL)

The validation of the radiographic grading system revealed an almost perfect agreement with the macroscopic, “real” degree of degeneration for the variable “Height Loss” (Kappa 0.862) and a slightly lower but still substantial agreement for “Osteophyte Formation” (Kappa 0.613) (Table 5). For the overall degree of degeneration the agreement also was substantial, the radiographic grades, however, tended to be lower than the macroscopic ones: in 20 out of 40 discs the “real” overall degree of degeneration was underestimated, but in only three it was overestimated (Fig. 4). As to the variable “Diffuse Sclerosis”, Kappa was 0.343 reflecting an only fair agreement. In this case, much fewer sclerotic areas were detected radiographically than macroscopically.

Table 5 Agreement between the radiographic and the macroscopic “real” degrees of degeneration of 40 and 44 lumbar intervertebral discs,respectively (weighted Kappa coefficients with 95% CL)
Fig. 4
figure 4

Agreement between the radiographic and the macroscopic “real” degree of degeneration of 40 and 44 lumbar intervertebral discs, respectively. Each field contains the number of discs rated with 0, 1, 2 or 3 points radiographically (rating of one experienced observer) and with 0, 0.5, 1, 1.5, 2, 2.5 or 3 points macroscopically (mean value of the ratings of two experienced observers)

The agreement between the radiographic ratings of the experienced and the inexperienced observer was substantial (Kappa between 0.681 and 0.798) for all the three variables as well as for the overall degree of degeneration (Table 6). However, the inexperienced observer generally tended to assign lower degrees of degeneration than the experienced one (Fig. 5). For example, concerning the overall degree of degeneration, 15 discs were rated 1° lower by the inexperienced observer but only one disc was rated 1° higher. Nevertheless, most ratings were identical: the same degree of “Height Loss” was assigned by both observers to 65% of all the discs, the same degree of “Osteophyte Formation” to 79%, the same degree of “Diffuse Sclerosis” to 80% and the same overall degree of degeneration to 81% of all the discs. The differences in grade assignment were never higher than 1° except for one disc concerning the variable “Osteophyte Formation” and two discs concerning the variable “Diffuse Sclerosis”, where the difference was 2°. Differences of more than 2° did not occur.

Table 6 Agreement between the radiographic ratings of one experienced and one inexperienced observer (weighted Kappa coefficients with 95% CL)
Fig. 5
figure 5

Agreement between the radiographic ratings of one experienced and one inexperienced observer. Each field contains the number of lumbar intervertebral discs rated with the respective scores

Discussion

In this study, the radiographic grading systems for lumbar intervertebral disc degeneration available from literature were combined to a new one, in which undefined and subjective terms were replaced by better defined and more objective ones. Finally, similar to the grading system of Mimura et al. [25], the height loss of the disc was estimated as the percentage decrease in height referred to the height before degeneration. Osteophytes were assessed in terms of their number and absolute length and the degree of sclerosis was determined according to the number of predefined areas that were affected.

Despite these attempts to create a more objective grading system than those known from literature, a certain degree of subjectivity still remained. In the assessment of the variable “Height Loss”, for example, the initial disc height still needs to be estimated. In vivo, this estimation becomes even more difficult due to the diurnal changes in disc height [2, 5, 29]. But even in vitro, its assessment is difficult due to the large spread of normal values [15]. Thus, a wide variety of different estimations are possible. Therefore, it is not surprising that the ratings of the two observers were not equal: for the inexperienced observer the height loss of the discs often seemed to be less severe than for the experienced one.

In contrast, the variable “Osteophyte Formation” could be defined much more objectively . Nevertheless, the inexperienced observer tended to see fewer osteophytes, thus, for example, the inexperienced observer tended to define pointed edges as normal, whereas the experienced one tended to define them as osteophytes. Thus, even though the terms used in the new grading system are more objective than those used in the systems known from literature, interobserver differences still have to be expected. The tendencies seen in this study, however, should not be generalised to all experienced and inexperienced observers since the ratings of only one experienced and one inexperienced observer were compared with each other and also since the quality of the radiographs is not always and everywhere the same. For example, the alignment of the patient during X-raying may be more difficult than the alignment of a spine specimen.

However, despite these tendencies, the differences between the ratings of the two observers were only little: for the three variables and the overall degree of degeneration they did not differ by more than 1° in all except for three cases where the difference was 2°. Furthermore, the interobserver agreement was substantial with Kappa coefficients of 0.798 for “Height Loss”, 0.687 for “Osteophyte Formation”, 0.681 for “Diffuse Sclerosis” and 0.787 for the overall degree of degeneration. Even though, these values reflect the agreement between the two observers with different levels of experience, they were not much lower than those reported by Lane et al. [23] for three observers with similar experience. The Kappa coefficients of Lane et al. were 0.95 for “Narrowing”, 0.91 for “Osteophytes” and 0.93 for the “Summary Grades”. For “Sclerosis”, however, Lane et al. reported a Kappa of only 0.55. Thus, even though the three observers of Lane et al. were all experienced, their interobserver agreement for this variable was significantly lower than the respective agreement of the new system. In contrast to Lane et al., but similar to the present study, Madan et al. [24] reported the agreement between five observers with different levels of experience. Their interobserver Kappa coefficients varied between 0.351 and 0.673 for the overall disc grade and thus, were lower than the respective value of the new system. These results indicate that the use of undefined terms may work with experienced observers for the variables “Height Loss” and “Osteophyte Formation”, but does not work for the variable “Sclerosis” and not with inexperienced observers.

Similar to the work of Madan et al., the agreement between observers with different degrees of experience was also reported by Pfirrmann et al., who developed a grading system based on magnetic resonance images [27]. The Kappa coefficients reported by this group ranged between 0.74 and 0.81 for the overall disc grade. This range covers the respective value for the new radiographic system (0.787), but is higher than the range reported by Madan et al. (0.351–0.673) [24]. These differences between Pfirrmann et al. and Madan et al. indicate that the assessment of signal intensity and homogeneity on magnetic resonance images may per se be more objective than the assessment of bony structures and densities on radiographs. This would probably also be the case for Modic’s classification of vertebral body marrow changes [26]. According to Modic et al., these changes are associated with degenerative disc disease and, thus, are often used to quantify disc degeneration.

Due to their small number, each of the three variables “Height Loss”, “Diffuse Sclerosis” and “Osteophyte Formation” strongly influences the overall degree of degeneration. Thus, discs of one and the same overall degree of degeneration may have completely different appearances. Depending on the purpose of the study, it might therefore be advantageous to report the three variables individually instead of the overall degree of degeneration only. Another possibility to reduce the weight of each variable would be to include further variables such as “Listhesis” or “Disc Calcification” into the grading system. These variables are assumed to be associated with disc degeneration and can be seen on X-rays [7, 8, 35]. An objective grading, however, is difficult. The degree of listhesis seen on a radiograph, for example, strongly depends on the loading of the spine during X-raying. For instance, the degree of listhesis of one and the same patient may be completely different in a lying position when compared to a standing or sitting position. And whether calcifications can be seen on radiographs or not strongly depend on the quality of the radiograph and the voltage used. Therefore, these two variables were not included into the new grading system.

To validate the radiographic grading system, the macroscopic degree of degeneration was defined as being “real”. This definition was used since macroscopic slices directly reflect the changes within the disc, whereas radiographs only depict the surrounding bony structures. Histologically, disc degeneration shows regional variations within one and the same disc [4]. The macroscopic grading system used here, however, does not account for these differences. This also applies for the radiographic grading systems since the disc itself cannot be depicted. Thus, in both the grading systems, the macroscopic and the radiographic one, the disc is assessed as “average”.

Compared to the macroscopic “real” degrees of degeneration, the radiographic degrees of degeneration tended to be lower. This underestimation has two reasons: first, radiographically, the loss of intervertebral height, the formation of osteophytes and endplate sclerosis are indirect signs of degeneration, while changes within the disc itself cannot be seen directly. Thus, early degenerative changes, such as a discolouring of the nucleus cannot be detected on radiographs. Similarly Frobin et al. could show that signal loss within the intervertebral disc is possible without the radiographic loss of height [16]. In such cases, the disc may radiographically have grade 0, and macroscopically, however, grade 1. Thus, in the detection of early degenerative changes within the disc, magnetic resonance imaging may have certain advantages compared to plain radiography. However, according to Benneker et al., a magnetic resonance imaging score does not necessarily have to correlate better with morphology than a radiographic score [3].Also, the variable “Diffuse Sclerosis” is easily underestimated on radiographs. This underestimation may become even more pronounced if radiographs of patients instead of osteoligamentous specimens have to be rated, since on the radiographs of patients much more tissue surrounds the spine and influences the x-ray transparency around the vertebral bodies. The only variable, where radiography revealed a higher degree of degeneration than macroscopy was “Osteophyte Formation” since macroscopically, the assessment of osteophytes was restricted to the mid-sagittal plane.

Despite these discrepancies between the radiographic and the macroscopic “real” degrees of degeneration, however, the agreement for the overall degree of degeneration still was substantial (Fig. 6). Thus, the overall validity of the new radiographic grading system is deemed to be good.

Fig. 6
figure 6

Examples of the four degrees of degeneration

Conclusions

In conclusion, we believe that the new radiographic grading system is an almost objective, valid and reliable tool if the degree of degeneration of individual lumbar intervertebral discs has to be quantified. However, the user should always remember that the radiographic degree of degeneration tends to be lower than the “real” macroscopic one and that slight differences between the ratings of observers with different degrees of experience have to be expected.

This study was focused on the agreement between one experienced and one inexperienced observer to evaluate the objectivity of the new system. Other parameters such as the intraobserver agreement, the agreement between observers with similar degrees of experience, the agreement between whole institutions or the effect of the quality of the radiographs on the ratings need to be investigated in future studies.