Introduction

Cervical spondylotic myelopathy (CSM) is a degenerative disease of the cervical spine and the most common cause of spinal cord dysfunction worldwide [13]. Both static factors, including the protrusion of osteophytic spurs, disk bulging, ossification of the posterior longitudinal ligament (OPLL) and hypertrophy of the ligamentum flavum, along with dynamic factors, such as neck extension and flexion, result in the stenosis of the spinal canal and potential compression of the neural elements [2]. Mechanical compression of the cord may lead to demyelination or focal necrosis in the posterior and lateral columns as well as loss of nerve cells in the grey matter [4]. Surgery is a common treatment option for CSM and is used to arrest its progression, prevent descent towards further neurological disability and even improve clinical status [58]. Many different surgical techniques have been explored, including anterior procedures such as cervical discectomy and fusion and corpectomy and posterior approaches such as laminectomy, laminectomy and fusion, and laminoplasty [9, 10].

The proper diagnosis of CSM and assessment of patient impairment and disability is essential for the implementation of appropriate treatment programs and surgical strategies. It is sometimes the case that a patient undergoes surgery either unnecessarily or too late for optimal recovery due to non-standardized and subjective assessment [1114]. Clinicians are increasingly using quantitative or semi-quantitative guidelines, including grading of neurological impairment, using the Ranawat scoring system, and rating of functional disability, using Japanese Orthopaedic Association (JOA) scale, Nurick grade, or the 30 m walking test [1517]. As a result of these tools, treatment decisions are made more scientifically rather than based solely on clinical judgement. Many of these objective assessments also allow for the quantification of the severity of myelopathy. This quantification at baseline enables surgeons to predict surgical outcome, determine the necessary surgical approach for optimal results and objectively evaluate how well a patient benefited from treatment [18, 19].

The first objective of this study is to summarize important features of a good scale. The second is to conduct a systematic review of the literature to answer the following key questions (KQ): (1) what scales have the ability to detect the presence of cervical myelopathy? and (2) what are the psychometric properties of scales commonly used to assess myelopathy severity?

Qualities of an ideal scale

There are ten components that constitute an idea scale:

  1. 1.

    Quantifiable: Interval or ratio scales are more quantifiable.

  2. 2.

    Suitable distribution: the range of patients’ values should be spread evenly or normally throughout the range of a scale.

  3. 3.

    Validity: how well a scale measures what it is intended to. Scales are ideally validated by correlating them with the “gold standard,” something that has yet to be defined for CSM [20, 21]. A scale assessing CSM must also be validated using a CSM population [22].

  4. 4.

    Sensitivity: it is desirable for a test to record a broad range of absolute values across the population, indicating high sensitivity in detecting differences in severity between patients. A measure of sensitivity is the coefficient of variation.

  5. 5.

    Responsive: ability to detect clinically significant changes and distinguish between disease severities. Measures of responsiveness include effect sizes, area under receiver operating curves and ceiling or floor effects [23].

  6. 6.

    Easy to perform: preferably requiring no special training and having a short time frame.

  7. 7.

    High inter-rater and intra-rater reliability: producing the same results for repeated uses under the same conditions. Inter-observer reliability is the agreement between two or more raters, whereas intra-observer reliability is the agreement between two ratings made by a single observer on the same patient.

  8. 8.

    Internal consistency: a measure of the homogeneity of the scale. The Cronbach’s alpha is a normalized measure of correlation between multiple components of a single scale, where a score of 1 signifies perfect correlation [24, 25].

  9. 9.

    One-dimensional: adds all of the components of myelopathy with no hierarchy.

  10. 10.

    Relevant: the measure quantifies a patient’s complaints and changes as the disease progresses.

In addition, it is essential that the scale addresses the pathophysiology and the key signs and symptoms of CSM as well as its natural history. As shown in Fig. 1, the areas of increased damage in CSM are the lateral corticospinal tracts, which transmit signals for voluntary movements to the muscles, and the posterior and anterior spinocerebellar tracts, which convey sensory information on posture and movement to the cerebellum [26, 27]. These severely damaged pathways in CSM are the reasons for the presentation of its common signs and symptoms, including a broad-based spastic gait and numb, clumsy hands [28, 29]. Even though the exact anatomy and location of damage differs across the myelopathic population, this described pattern of severity can provide a rationale for selecting components of the disease to monitor. These concepts are important to remember when designing a scale intended to assess clinical and functional deficits in CSM. A measure that includes evaluation of voluntary movement, balance and coordinated activity will likely be more sensitive and relevant than one targeted towards bladder function, conscious joint position sense and temperature and pain sensation. An ideal scale should also reflect the natural history of the disease, a requirement much more difficult to meet. The onset of CSM is generally insidious and progresses in a stepwise fashion with periods of stability and periods of descent [3033]. At the early stages of the disease, patients typically present with a mild functional gait deficit rather than an abnormal neurological examination. A scale directed towards lower limb function and sensitive enough to detect mild leg symptoms would address the natural history of this disease [34]. Along with spastic and ataxic leg complaints, the classical triad of symptoms also includes painful, stiff neck and brachialgia. These symptoms are rarely included in outcome measures directly but may indirectly be accounted for through various questions on a patient’s disability.

Fig. 1
figure 1

Areas affected by cervical spondylotic myelopathy: essential to consider for the development of a valid, responsive and reliable outcome measure. As demonstrated by this diagram, the areas of increased damage are the lateral corticospinal tracts, and the posterior and anterior spinocerebellar tracts

Materials and methods

A systematic search was conducted using MEDLINE, MEDLINE in Process, EMBASE and Cochrane Central Register of Controlled Trials for literature published through November 2012. The keywords used for this search were “CSM or Ossification of the Posterior Longitudinal Ligament AND Outcome Measures AND Preoperative or Postoperative”. Two other previous searches focusing on predictors of surgical outcome and the epidemiology of the disease were also explored to identify further relevant articles. The search was limited to humans, aged 18 years or older and to English studies. Case reports, meeting abstracts/proceedings, white papers and editorials were excluded.

Studies were included for the purpose of KQ1 if they compared scale measurements between a control and a myelopathic patient population. For KQ2, studies were included if they discussed any psychometric property of a scale including its validity, reliability and responsiveness. Given that there is no published gold standard for the assessment of cervical myelopathy, analyses that evaluated correlation between scales were considered eligible validation studies. Studies focusing on patients with radiculopathy, rheumatoid arthritis, coexisting lumbar degeneration or neck problems were not necessarily excluded provided they discussed a scale that may be used to assess the functional status of patients with cervical myelopathy.

All abstracts and titles were reviewed independently by two of the authors and were sorted based on pre-determined inclusion and exclusion criteria. Figure 2 displays the search and review process in detail. From the included articles, the following data were extracted: author, objective, outcome measures studies, psychometric properties and correlations and conclusions. We determined which question each study answered and which psychometric properties were assessed.

Fig. 2
figure 2

Search strategy and detailed review process. SF-36 short form-36, JOA/mJOA (modified) Japanese orthopaedic association

Results

The total number of citations found from the three searches was 5,745. Of these, 37 met inclusion criteria: 10 answered KQ1 and 37 addressed KQ2. For KQ2, 23 studies explored validity by assessing correlation between scales, 10 examined reliability, 8 analyzed responsiveness mostly by quantifying a standardized response mean or by observing floor and ceiling effects and 6 looked at internal correlations and consistency. Scales analyzed in this review were either functional or impairment measures or were patient-reported outcomes or health assessment tools. The most frequently reported scale was SF-36 (n = 16) followed by Nurick grade (n = 14), JOA (n = 13), modified JOA (mJOA, n = 7) and the grip and release test (GRT, n = 6). Four studies each presented results on the Cooper, Harsh and 30-m walking test.

Evidence that scales can detect myelopathy

Ten studies explored the ability of various scales to detect myelopathy by comparing values in a control and patient population (Table 1). Hosono et al. and Mihara et al. [35, 36] reported a significantly lower number of grip and release cycles in patients with cervical myelopathy compared to a control group (22.9 ± 8.7 vs. 32.5 ± 9.0; 17.3 ± 5.7 vs. 23.6 ± 4.5 cycles). Five studies explored different components of gait by performing a 30-m walking test, a 10-s step test, a foot-tapping test, a triangle step test or by simply observing a patient’s stance and gait [34, 3640]. Maezawa et al. [39] divided a 24-patient sample into three groups based on preoperative JOA score: group I consisted of patients with mild spasticity or normal function with a JOA ≥10; group II had a JOA between 7 and 9; and group III had a JOA ≤6 points. Compared to a control population, patients in group II or III had slower walking speed, prolonged stance phase duration, reduced stance-phase duration and a shorter stride length. In addition, all three groups had reduced knee flexion, and groups II and III had decreased plantar flexion of the ankle as compared to controls. Similarly, Kuhtz-Buschbeck et al. [40] also demonstrated that myelopathic patients have a slower walking velocity and a longer stance phase, as well as a wider step width and longer double support time. Singh and Crockard [34] observed similar differences in gait: mean preoperative walking time and number of steps were significantly worse for CSM patients than for controls (85.4 ± 11.2 vs. 24.3 ± 0.8 s; 74.8 ± 5.3 vs. 46.9 ± 1.2 steps).

Table 1 The ability of various measurements to detect the difference between myelopathic and control patients

The triangle step test and foot-tapping test can also detect the presence of myelopathy as demonstrated by Mihara et al. [36] and Numasawa et al. [38]. Patients with myelopathy exhibited a significantly lower number of steps on the triangular board and a lower number of foot taps in 10-s than a control (18.4 ± 5.2 vs. 25.4 ± 3.7; 23.8 ± 7.2 vs. 31.7 ± 6.4).

Olindo et al. [41] assessed the ability of a nine-hole peg test to quantify hand disability in a CSM population. Mean and median times to perform this test were significantly greater in the patient group than in the controls (35 ± 29.9 and 22.2 s vs. 16.1 ± 3.1 and 16.2 s).

Finally, as identified by King et al. [42], patient-reported outcome measures may differ between a myelopathic and a control population. Patients with CSM exhibited worse scores on all 8 domains of the SF-36, the physical component score (PCS) and the mental component score (MCS) compared to age-adjusted normative data (PCS: 27.8 ± 8.3 vs. 37.3 ± 10.6; MCS: 40.5 ± 12.9 vs. 44.4 ± 17.3).

Psychometric properties of scales used to assess patients with cervical myelopathy

Validity

Given that there is no gold standard for the assessment of CSM, a scale’s validity is typically evaluated by correlating it with other measures. Furthermore, other forms of validity including construct, convergent or discriminant, predictive and biological can be used to determine whether a scale actually measures what it is intended to. Twenty-three studies either described previously reported correlations or one of these other forms of validation (Table 2).

Table 2 Correlations between scales and other forms of validity

Construct validity for the SF-36 was demonstrated by four studies. Brazier et al. [43] hypothesized that female patients, the elderly, those of social class IV or V and frequent health services users should have lower perceived outcome. These demographic groups did indeed display a lower SF-36 score [43]. Thakar et al. [44] proposed that mobility-related items of the SF-36 should correlate with the Nurick grade; this relationship was also true. Furthermore, King and Roberts [45], as expected, identified a significant association between Nurick grade, Cooper leg subscale and Harsh scale with physical functioning, role limitations physical, social functioning and the PCS of the SF-36. Similarly, the World Health Organization Quality of Life-Bref (WHOQOL-Bref) scale also exhibited construct validity as demonstrated by a significant relationship between preoperative physical domain scores and Nurick grade [44].

Certain subscales of the SF-36 also demonstrate discriminant and convergent validity. In a study by Baron et al. [46], role limitations physical, bodily pain, social functioning and role limitations displayed convergent and discriminant validity as their item-own correlations were greater than item-other correlations by 2 standard errors. Thakar et al. [44] confirmed convergent validity by reporting that associations between physical functioning, role limitations physical, bodily pain and PCS ranged from 0.67 to 0.74 and correlations between vitality, social functioning, role limitations emotional, mental health and MCS were from 0.67 to 0.72. Finally, predictive validity of SF-36 was proven by Latimer et al. [47] and Guilfoyle et al. [48] since the physical functioning subscale was predictive of postoperative myelopathy disability index (MDI).

The MCS and PCS of the SF-36 were highly correlated with both the preoperative and postoperative SF-12 MCS and PCS scores and the visual analog scale [49, 50]. The SF-36 physical functioning score was also highly associated with the pain and disability component of the North American Spine Society (NASS) instrument and the lower extremity domain of the Japanese Orthopaedic Association Cervical Myelopathy Evaluation Questionnaire (JOACMEQ), supporting the validity of both scales [51, 52].

Multiple studies reported significant correlations with the Nurick grade: lower limb mJOA, total mJOA, JOA, Cooper myelopathy scale (CMS) of the lower and upper extremity, European myelopathy score (EMS) and poor-grade, patient-reported outcomes [20, 5355]. Additionally, as identified by King et al. [50], there was a significant association between Nurick grade and the visual analog scale, time to trade-off and willingness to pay. Similarly, JOA was shown to be highly related to CMS of the lower and upper extremity, EMS, time to trade-off and willingness to pay [20, 50, 55].

Casey et al. [56] identified significant correlations between MDI and spinal canal diameter, spinal cord diameter, spinal cord area, and the degree of vertical translocation of the odontoid peg through the foramen magnum, indicating biological validity. In addition, the MDI was highly associated with the EMS both preoperatively and postoperatively [20]. EMS was also significantly correlated with CMS of the upper and lower extremity [55].

To evaluate the construct validity of the neck disability index (NDI), Riddle and Stratford [57] hypothesized that patients whose work status was altered or who were undergoing litigation would have lower scores. This was proven to be true. Young et al. [58], on the other hand, speculated that “stable” patients would have NDI, patient-specific functional scale (PSFS), and numeric pain rating scale (NPRS) scores that would not change, whereas patients with a “larger” clinically, meaningfully improved would have a significant change in disability. Based on this study, NDI had poor construct validity.

Reliability

Ten studies assessed the reliability of various scales used to evaluate CSM (Table 3). Yonenobu et al. [59] conducted a study on patients with OPLL to determine the inter-rater and intra-rater reliabilities of the JOA. The interobserver reliability was high (ICC = 0.813, 95 % CI = 0.704–0.888). Table 3 displays the proportion of agreement between raters and across interviews for motor, sensory and bladder function. Intra- and inter-observer agreement was highest, 82.9 and 82.3 %, respectively, for shoulder and elbow motor function. Although Bartels et al. [60] did not validate mJOA, his study demonstrated that when researchers read a text box aloud to the patients, the kappa of the scale increased from 0.56 ± 0.11 to 0.78 ± 0.05 (p < 0.001).

Table 3 Interobserver and test–retest reliability

The reliability of several gait parameters including the 10-s step test, the foot-tapping test and the 30-m walking test was described by several studies. Nakashima et al. [61] observed unsubstantial differences between and the 1st and 2nd measurement of 30 m walking test time (pre-op: 0.95, postop: 0.89), walking test steps (pre-op: −0.28, postop: −0.55) and 10 s step test (pre-op: −0.28, postop: −0.55). This study confirmed the findings of Singh and Crockard [34] who reported high intra-observer reliability for the 30 m walking test time and number of steps [34]. The foot-tapping test also demonstrated a high, immediate test–retest reliability in both control and myelopathy patients’ groups (right side, control: r = 0.931, myelopathy: r = 0.934) [38]. Mcdermott et al. [62] performed an extensive analysis of various temporal-spatial, kinematic and kinetic gait parameters and quantified test–retest reliability by conducting assessments 2–7 days apart. The intraclass correlations (ICC) for all temporal-spatial parameters were high (ICC >0.9) with the exception of opposite foot contact (ICC = 0.61). Kinematic parameters displayed a wider range of reliability values ranging from 0.33 for ankle position at initial contact to 0.92 for total range pelvic obliquity. Finally, most kinetic parameters exhibited high ICCs (>0.75) except for medio-lateral group reaction force (ICC = 0.12), peak hip abductor moment (ICC = 0.54), and peak eccentric power at the knee during loading (ICC = 0.56).

As demonstrated by two studies, the GRT displays high interobserver reliability (0.989) and exhibits only small differences when measured across trials (pre-op: −0.89, postop: −0.52, right hand) [35, 61].

The reliability of six different patient-reported outcome measures was described including the NDI, PSFS, NPRS, cervical spine outcomes questionnaire (CSOQ), SF-36, and SF-12 [49, 58, 63]. In a study on patients with cervical radiculopathy, NDI and NPRS exhibited similar fair inter-observer reliability (ICC = 0.55, 0.59), where the PSFS demonstrated low reliability (ICC = 0.17). BenDebba et al. [63] evaluated this property of the CSOQ using a patient’s 3-month and 12-month scores: ICCs were high for neck pain severity (0.80), shoulder arm pain severity (0.80), functional disability (0.85), psychological distress (0.82), physical symptoms other than pain (0.86), and health care utilization (0.86).

Finally, Brazier et al. [43] reported the 2-week test–retest reliability of all 8 subscales of the SF-36. The correlations were all high (>0.75) except for social functioning, role limitations physical and role limitations emotional.

Internal consistency

A scale is defined as having good internal consistency if it has a Cronbach’s alpha that exceeds 0.7 o, preferably 0.8. Based on the results from several studies and as illustrated in Table 4, the SF-36 has good internal consistency with a Cronbach’s alpha of 0.82 for all 36 items, >0.7 for all 8 subscales, 0.82–0.93 for the MCS and 0.89–0.92 for the PCS [4346, 49]. Singh et al. [49] reported a lower internal consistency of 0.77 for both the MCS and PCS of the SF-12 [49]. Other reported Cronbach’s alphas include MDI (0.92–0.95), EMS (0.68), JOA (0.72), and >0.85 for all domains of the WHOQOL-Bref [44] (Table 4).

Table 4 Internal consistency determined by Cronbach’s alpha

Responsiveness

Eight studies reported on a scale’s responsiveness (Table 5). The effect sizes for the SF-36, WHOQOL-Bref, Bournmouth questionnaire (BQ), NDI, neck functional disability scale (NFDS), NPRS and MDI have been defined in the literature [44, 46, 48, 56, 58, 64, 65]. Four studies referenced Cohen’s classification of effect size that defines 0.2 as a small group change, 0.5 as moderate and 0.8 as large. According to these criterion, three composite measures of the CSOQ are considered highly responsive (neck pain = 0.71–1.15, functional disability 0.9–1.0, physical symptoms 0.63–0.95), two moderately responsive (shoulder-arm pain = 0.68–0.79, psychological distress = 0.58–0.72) and one poorly responsive (health care utilization = 0.21–0.36) [63]. For the SF-36, Baron et al. [46] identified a significant change in the value of 5 scores postoperatively: effect sizes were −0.70 for energy/vitality, −0.55 for mental health, −0.52 for social functioning, −0.62 for bodily pain and −0.43 for physical functioning. Thakar et al. [44] and Guilfoyle et al. [48] reported similar effect sizes for mental health (0.54 for both studies) and higher effect sizes for physical functioning (0.78; 0.86). The responsiveness of emotional role was moderate (ES = 0.62) [44]. For bodily pain, Thakar et al. [44] and Guilfoyle et al. [48] observed effect sizes of 0.80 and 0.65, respectively. The domains for WHOQOL-Bref, on the other hand, were generally less responsive than the SF-36 subscales, with the exception of the physical domain (ES = 0.68) [44]. Effect sizes were 0.03, 0.39 and 0.45 for social, psychological and environment, respectively. As described by Bolton [64], all BQ subscales (0.82–1.94), BQ total score (1.67) and NRS pain (1.02–1.60) were all highly responsive to change and NDI and NFDS were moderately responsive. The effect size for MDI was reported by Casey et al. [56] as 0.39.

Table 5 Responsiveness given by effect size, area under a receiver operating curve or floor and ceiling effects

Other studies described responsiveness by calculating the area under a receiver operator curve (ROC) [58, 65]. This curve plots the sensitivity against 1-specificity to describe a scale’s ability to differentiate between two groups of patients. Young et al. [58] divided their patient sample into four groups based on whether they were unchanged, stable, “smaller” clinically improved or “larger” clinically improved following surgery. The area under the ROC for NDI, PSFS and NPRS were 0.74, 0.71 and 0.72, respectively, when differentiating between stable and “larger” clinically improved. Cleland et al. [65], on the other hand, reported a much lower area of 0.57 for NDI (stable and improved patients) and a higher area of 0.99 for the PSFS.

Three studies identified significant floor and ceiling effects for subscales of the SF-36. Baron et al. [46], Guilfoyle et al. [48], and Thakar et al. [44] reported a high floor effect for physical role (62.8, 50.9 and 64.3 %) and emotional role (35.1, 20, 71.4 %). Studies have reported floor effects of 15.4 % for physical functioning, 30 % for bodily pain and 17.1 % for social functioning and have identified ceiling effects of 22.9 % for social functioning, 20.4 % for physical role and 51.8 % for emotional role [44, 46, 48]. Casey et al. [56] noted a floor effect for the MDI as 27 out of their 194 sample of surgical patients had maximum disability. Although not explicitly examined in the literature, previous studies have speculated that the mJOA has a ceiling effect. In a study by Fehlings et al. [66], the degree of improvement on the mJOA scale was directly associated with baseline severity score, with severe patients showing the greatest change in mJOA. As a mild patient has less room for improvement as their baseline score is closer to the perfect score of 18, they are expected to show smaller changes in mJOA score.

Discussion

It is evident from this review that there are numerous measurement tools that may be used to assess patients with cervical myelopathy. Singh et al. [67] reported the results of a survey in which clinicians identified that, although it is essential to quantify functional disability in CSM patients, the specific measurements that may do this are underused or not ideal [67]. Even though some scales are more commonly used in global practice, there still remains a lack of a gold standard outcome measure that can assess disease severity over time and objectively determine a patient’s improvement following intervention. This prevents the establishment of standard quantitative guidelines used to direct and implement appropriate treatment programs and makes it more challenging to accurately predict surgical outcome. It is therefore essential to develop or define a gold standard to determine the ideal timing of surgical intervention, thereby optimizing the clinical management of CSM populations.

Tetreault et al. [68], in a systematic review on important predictors of outcome, identified that results may differ depending on what scale was used to evaluate outcome. For example, when outcome was assessed using the Nurick grade, its association with various predictors was less conclusive: the direction of the relationship between preoperative condition and Nurick grade was unclear as was the hypothesis that duration of symptoms was correlated with a worse outcome. The authors speculated these findings were likely due to limitations in the Nurick grade as it is insensitive and largely weighted towards employment and lower limb function. The results from this study demonstrate that even the most commonly used outcome measures have significant flaws that may prevent them from correctly identifying certain relationships.

This systematic review provides an extensive summary of the outcome measures used for the functional assessment and quality of life evaluation of CSM. Some of these scales, such as the mJOA, Nurick and 30-m walking test have gained acceptance and are widely used whereas others have not yet broken into clinical practice. This study has explored the psychometric properties and the utility of existing scales in an attempt to propose an ideal measure.

Unfortunately, we were not able to conclusively identify the ideal scale for evaluating patients with cervical myelopathy as we believe it does not exist. An ideal outcome measure would be one that incorporates all ten described qualities and also plays to the pathophysiology, signs and symptoms and natural history of the disease. Given that CSM typically affects the lateral corticospinal and the spinocerebellar tracts, a scale that assesses voluntary movement, balance and coordinated activity will likely be more sensitive and relevant than one that does not. In addition, a scale that can detect mild symptoms will address the insidious and slowly progressive nature of the disease. It clear from Tables 1 through 5 that certain scales can detect the presence of myelopathy, are valid, highly reliable, responsive to change and have a high internal consistency. Despite this, there are definite limitations to all measures reported. Based on these findings, we propose that functional measures such as the MDI, mJOA and Nurick grade should be used in combination with more sensitive and quantitative measures including various walking tests or the grip and release test. In addition, the use a quality life measures such as the SF-36 may help determine the impact of myelopathy on a patient’s mental and physical status. The use of a wider range of functional and impairment tests may help better define optimal treatment interventions as well as the prognostic value of certain clinical and imaging factors. In addition, we hope this review will direct future research in the form of a high quality clinimetric study exploring the quality of performance of key measures.