Introduction

Neck pain is a very common symptom with a lifetime prevalence of around 50% [1]. Because so many patients are affected, approximately 1% of total health care expenditure is utilised in its treatment [2]. Many treatments for musculoskeletal disorders are carried out with the aim of improving the patient’s quality of life and function. Patients, health insurances and governmental bodies increasingly expect appropriate documentation of the efficacy of medical treatment [3, 4]. As a result of outcomes research, indications can be optimised [57], therapy and predictors critically questioned, and success or deterioration measured [8, 9]. Patient safety has been improved by the analysis of side effects and contraindications, and different surgeons and hospitals have been compared [10, 11]. Outcomes research tries to guide doctors and patients through the variety of treatment options, also in consideration of the costs of treatment.

Various options for monitoring treatment effects in neck pain have been examined. Physiological approaches involving, for example, the assessment of strength or range of motion, have by and large failed to serve as valid outcome parameters, because they do not relate well to the factors of importance to the patient, such as symptoms and function in daily life [12]. Social science approaches have more recently found their way into outcomes research and many patient self-rating questionnaires have been developed. These subjective patient-orientated questionnaires appear to be the most valid outcome measurements we have today, but they are only useful for systematic documentation if they are feasible for use in routine daily practice [13]. Questionnaires should be long enough to include all essential questions but demand as little time as possible for the patient to complete. Short questionnaires also reduce the workload of data management, making them easier to integrate into the existing infrastructure of an institution [14]. If the instruments are standardised and available in different languages, they allow national and worldwide comparisons of baseline status and treatment outcomes for a given disorder [14].

In 1998, Deyo et al. [14] recommended a set of six core questions as a parsimonious and valid instrument for assessing outcome in disorders of the lumbar spine. The questions evaluated the dimensions pain (axial and radiating pain), function, symptom-specific well-being, and disability (social and work). This core set showed excellent psychometric characteristics in patients with back pain undergoing either surgical or conservative management [15, 16] and multilingual versions were adopted for use in the Spine Tango system, the international spine surgery registry of Eurospine, the Spine Society of Europe (SSE) [17].

The set of questions was also adapted for the cervical spine, by enquiring about neck rather than back problems, and it too showed good validity and reliability [18]. However, the latter study included only patients with moderate symptoms undergoing conservative management, did not include a quality of life question, and did not examine the responsiveness of the questionnaire, which is one of the key elements of outcome instruments [19]. The aim of the present study was to further analyse the psychometric characteristics of the COMI-neck questionnaire in a group of patients undergoing surgery of the cervical spine, with a focus on its responsiveness compared with that of other well-established condition-specific and quality of life questionnaires.

Methods

Patients

The study represents a retrospective analysis of prospectively collected data. All patients who had undergone cervical spine disc replacement surgery at our hospital between May 2005 and March 2010 were eligible for inclusion in the study as long as they could understand written German, had fulfilled the indications for disc replacement surgery (aged between 18 and 65, no segmental kyphosis, degeneration in not more than two segments, unsuccessful conservative therapy for at least 3 months, suffering from cervical brachialgia, discogenic neck or shoulder pain or early stage cervical myelopathy) and had reached 3 months follow-up after their operation. Exclusion criteria were traumatic or neoplastic indications for surgery. After their consultation with the surgeon in which the decision to operate was made, patients filled out a booklet of baseline questionnaires containing the EuroQol-5D and the NASS-cervical (see below), in compliance with the SWISSspine registry for disc arthroplasty in Bern [20], part of the government-mandated prospective evaluation of disc arthroplasty outcomes in Switzerland. Three months after surgery, at the time of the clinical follow-up with the surgeon, the booklet of questionnaires was completed again. As part of our hospital’s own in-house Spine outcomes registry a questionnaire containing the COMI (see below) was sent to the patient at home, preoperatively together with the information about their forthcoming hospital stay, and they were asked to complete it and hand it in during admission. Three months after surgery, a follow-up COMI was sent to them from the Research Department to complete and return by post.

Since the study was intended to compare the psychometric characteristics of the questionnaires themselves rather than report the outcome of the surgical procedure per se, the short-term follow-up of 3-months was considered unproblematic and in keeping with previous methodological studies [21].

The Core Outcome Measures Index for the neck (COMI-neck)

The COMI-neck is a short, self-administered outcome instrument consisting of just seven questions to evaluate the five dimensions pain, neck-related function, symptom-specific well-being, general quality of life and disability (social and work). Apart from the two disability items, which refer to the last 4 weeks, all items relate to how the patient felt in the last week. The two pain items use a 0–10 graphic rating scale; all other items use a 5-point adjectival scale. The higher the score, the worse the patient’s status. The scoring for each dimension [15] is summarised in Table 1. For the summary score the average of the scores for all five dimensions (each transformed 0–10) is taken [15]. According to the categories described by Beaton and Schemitsch [22] the COMI is a condition-specific questionnaire but includes also generic and so-called additional elements.

Table 1 COMI items, response options, and scoring

At the 3-month follow-up, the same questions were presented with an additional question to evaluate the global outcome of treatment [23]. The question enquired: “Overall, how much did the treatment that you received (the operation) help your neck problem? …” and was answered with a 5-point Likert scale (operation helped a lot, helped, helped only little, did not help, made things worse). This global outcome question was dichotomised into “good” (operation helped, helped a lot) and “poor” (operation helped only little, did not help, made things worse) for further analyses. Although “helped only little” is still a positive outcome, the cut-off point for “good” was placed higher than this, since clinically this is not considered a satisfactory outcome for elective surgery [15, 23].

Assessment of the psychometric properties of the COMI-neck

Questionnaire battery

To evaluate the COMI’s construct validity the following reference instruments were used: the EuroQol-Five Dimension (EQ-5D), the EuroQol-visual analogue scale (EQ-VAS), the North American Spine Society Cervical Spine Outcome Assessment Instrument (NASS-cervical) and two 0–10 visual analogue scales for neck pain and for arm pain.

The EQ-5D measures health-related quality of life [24]. It is a standardised, widely used, generic questionnaire and consists of the five items, i.e. mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Each item is rated on a 3-point adjectival scale. The EQ-VAS is used to quantify the ‘overall health state’, with the patient indicating his/her current health status on a 0–100 VAS. In the present study, a horizontal scale was used in preference to the vertical scale used in the original version, for ease of layout. The EQ-5D summary index scores [ranging from −0.594 (worse than death) to 1 (best possible health)] were calculated using the unweighted method of Prieto and Sacristán [25].

The NASS-cervical is a pain and disability questionnaire developed by the North American Spine Society Outcome Assessment Task Force and is a region-specific measure [26, 27]. It is based on questions from the Oswestry Disability Index concerning dressing, lifting, walking, sitting, standing, sleeping, participating in social life, travelling, and sexual activity plus eight additional questions about the frequency and bothersomeness of pain (neck, arm), sensory disturbances and motor disturbances. Several summary scores exist, which differ in relation to the questions chosen to form the average value (see Table 2); however, the “pain&disability” dimension appears to be the most frequently used subscale and the remaining subscales have not been widely researched.

Table 2 Overview of the items in the NASS-cervical questionnaire and the items making up each of the subscales

Additional medical history and surgical variables describing the study group were extracted from the Spine Tango Spine Surgery Registry [17].

Statistical analysis

The following “missing” rules were applied in the case of missing data: for COMI and EQ-5D, no missing data were allowed because they consist of only one item per domain. For the NASS-cervical a minimum of 80% items had to have been completed for the two main domains (pain&disability; neurology). We used this completion rate as the acceptable minimum for questionnaires in general (Elfering, personal communication) because there appeared to be no consensus in the literature from previous authors working with the NASS questionnaire [28, 29]. The score was then derived using the average of the values for the items that had been completed, to replace the missing values.

Floor and ceiling effects

Floor and ceiling effects were given by the proportion of individuals obtaining scores equivalent to the worst status and the best status, respectively, for each item and scale investigated. This indicates the proportion for whom, respectively, no meaningful deterioration or improvement in their condition could be detected since they are already at the extreme of the range. Due to the different scoring polarity of the questionnaires, for the COMI and NASS the highest scores represented floor effects (worst status) and the lowest scores, ceiling effects (best status); the converse was true for the EQ-5D and EQ-VAS scores. Floor/ceiling effects >70% are considered to be adverse and <15–20%, ideal [30, 31]. Floor and ceiling effects were determined for all scales, in order to provide some perspective for interpreting the corresponding values for the COMI.

Construct validity

Construct validity addresses the extent to which a questionnaire’s scores relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured [32]. The relationships between the COMI items/summary score and other questionnaire items or scores describing similar dimensions were examined using Pearson’s correlation coefficients. The following pairs of COMI items and corresponding items/questionnaires were examined:

  • the COMI “high pain” score and the higher of the two (arm or neck pain) 0–10 VAS pain scores and the NASS-cervical pain score;

  • the COMI-neck (arm) pain and the respective neck (arm) pain VAS;

  • the COMI item “neck function” and the NASS-cervical pain&disability and NASS-cervical disability score;

  • the COMI item “symptom-specific well being” and the EuroQol-5D and EQ-VAS;

  • the COMI item “general quality of life” and the EuroQol-5D and EQ-VAS;

  • the COMI “disability” average score and the NASS-cervical pain&disability and NASS-cervical disability score.

The correlations between the COMI summary score and all summary scores of the EQ-5D and the NASS-cervical were also examined [33].

Based on the validation studies for the original COMI and as recommended by Streiner and Norman [33] for measures of the same/similar attributes it was hypothesised that correlation coefficients would range from 0.4 to 0.8 for the relationships between the individual COMI items and their corresponding full-length questionnaires and between the COMI summary index score and NASS-cervical and EQ-5D summary index scores.

Responsiveness

Responsiveness refers to the ability of an instrument to show small but clinically important changes [21]. Beaton et al. [34] emphasise the importance of using different measures of responsiveness. In the present study we used three different approaches to compare questionnaire responsiveness.

Firstly, the effect size (standardised response mean; SRM) for the different questionnaires was calculated by taking the group mean of all the individual changes scores and dividing this by the standard deviation of these change scores [35]. An effect size (or SRM) of 0.2 is regarded as small, 0.5 as moderate and 0.8 as large [36, 37]. This SRM allows a group-level interpretation of the study population undergoing treatment [34].

Secondly, unpaired t tests were used to detect significant differences between change scores (pre-treatment to the 3-month follow-up) for the good and the poor outcome groups (dichotomised as described above). In addition, the SRMs were determined and compared for “good” and “poor” outcome groups separately.

Thirdly, Receiver Operating Characteristics (ROC) curves were plotted. The responsiveness of a questionnaire can be analysed in an analogous manner to the evaluation of a diagnostic test [21]. The score-change for the questionnaire represents the diagnostic test and this is examined in relation to the “global outcome of surgery”, which is taken to represent the “gold standard” or external criterion [21]. The resulting ROC curve displays the sensitivity and specificity for detecting a “good outcome” of several possible change-score cut-off points. The area under the ROC curve (AUC) describes how close the ROC plot compares to a perfect test discriminating with 100% sensitivity and 100% specificity (AUC = 1.0) [38]. An AUC of 0.93, for example, means that a randomly selected patient from the “good” group has a greater change score than that of a randomly selected patient from the “poor” group 93% of the time. The sum of specificity and sensitivity was maximised by calculating the Youden index (Youden index = Sensitivity + Specificity − 1) [39]. According to Beaton et al.’s classification [34] determination of the cut-off score in this manner allows individual-level interpretation for the observed questionnaires, which facilitates the monitoring of change in individual patients. The analyses were conducted using PASW Statistics 18.0 (SPSS Inc., Chicago, IL, USA) and MedCalc (MedCalc Software, Mariakerke, Belgium) and statistical significance was accepted at the P < 0.05 level.

Results

Over the period of study, 134 patients were eligible for inclusion. However, only 89 patients of these had completed and returned all 3 questionnaires at baseline (Table 3). 14/89 (15.7%) patients were lost to follow-up leading to a follow-up group of 75 patients. The data from all 89 patients at baseline were used for the analysis of floor and ceiling effects and construct validity. The data from the 75 patients with follow-up questionnaires were used for the calculations of responsiveness and follow-up floor and ceiling effects. At follow-up, it was not possible to calculate a NASS-cervical “pain&disability” subscale score for one patient or a “neurology” subscale score for another, due to there being fewer than 80% items completed in the subscale (see “Methods”). The demographic, medical history and surgical variables describing the whole study group (N = 89) are shown in Table 4.

Table 3 Overview of the number of questionnaires handed out and returned, preoperatively and at follow-up
Table 4 Study sample characteristics of baseline and follow-up population

Floor and ceiling effects

Table 5 shows the percentage floor effects (worst status) and ceiling effects (best status) for each of the instruments. The COMI summary score, NASS-cervical pain&disability score, neurology score 1 and pain score, and the EQ-VAS each showed low (<15%) floor and ceiling effects at both baseline and follow-up.

Table 5 Floor effects (worst status) and ceiling effects (best status) for each of the questionnaire items/scales

At follow-up there were high but not adverse ceiling effects for the NASS-cervical neurology score 2 and the NASS-cervical disability score (33 and 26%, respectively) and for EQ-5D (33%). Similarly, all the individual COMI items displayed high ceiling effects at follow-up (19–48%). At baseline the COMI items “function”, “disability” and “symptom specific well-being” showed high to adverse floor effects (29, 35 and 83%, respectively). Some of the individual items of the EQ-5D (mobility, self-care, and anxiety/depression) had very high ceiling effects at baseline (51–65%) and adverse (89–92%) ceiling effects at follow-up. EQ-5D pain had 37% floor effects at baseline and a similar percentage of ceiling effects at follow-up.

Construct validity

The relationships between each of the COMI item scores and the corresponding questionnaire scores are shown in Table 6. The COMI summary score showed moderate to high correlations with the EQ-5D, EQ-VAS, NASS-cervical pain&disability, NASS-cervical pain and NASS-cervical disability scores (−0.60 to 0.73) but low correlations with the two NASS-cervical neurology scores (0.24–0.37).

Table 6 Correlation coefficients describing the relationships between the COMI single items/summary score and reference items or full-length questionnaires at baseline in the 89 patients

Correlation coefficients of 0.47–0.63 were found for the relationship between the COMI-neck pain item score and the NASS-cervical pain scale, NASS pain questions 1 and 5 and the two VASs for neck pain. Similar correlations (0.54–0.72) were found for the various measures of arm pain. The COMI item “function” correlated well (0.60) with the NASS-cervical pain&disability and NASS-cervical disability scores. Generally low correlations (−0.27 to −0.31) were found between the COMI item “symptom specific well-being” and the EQ-5D and EQ-VAS scores. COMI “general quality of life” scores showed correlations of −0.50 to −0.59 with the EQ-5D and the EQ-VAS scores. COMI “disability” scores showed correlations of 0.56–0.57 with the NASS-cervical pain&disability and the NASS-cervical disability scores.

The correlations for all the change scores showed slightly lower coefficients (r = −0.22 to −0.60 than for the corresponding correlations of the absolute scores at baseline (Table 6).

Responsiveness

The global outcome ratings were distributed as follows: 58 (77.3%) helped a lot, 10 (13.3%) helped, 7 (9.3%) helped only little, 0 (0%) did not help, 0 (0%) made things worse. Hence the “good outcome” group consisted of 68 patients (90%) and the “poor outcome” group of 7 (10%).

There was a significant (P < 0.001) difference in the mean COMI change-scores for the good and poor outcome groups. Four out of the five NASS-cervical scores and the EQ-VAS (but not the EQ-5D) also showed significant differences between the scores for the good and poor outcome groups (Table 7, Fig. 1). The effect sizes (SRMs) giving information about the responsiveness or sensitivity to change for each of the instruments are compared in Fig. 1. The COMI showed the greatest difference between the SRMs for the good and poor outcome groups (2.34 and 0.34, respectively), i.e. it showed the best ability to discriminate between outcome groups, having a very high SRM in the good outcome group and a low SRM in the group with a poor outcome (Table 7). All the NASS subscales and EQ-5D scales showed smaller SRM differences between the good and poor outcomes indicating a worse discriminative ability.

Table 7 Mean scores, standard deviation, SRM and P value (difference between outcome groups for the change score (baseline to follow-up)) of the different questionnaires split by the global outcome question or regarded as one group
Fig. 1
figure 1

Standardised response means (SRMs) for the good and poor outcome groups for each instrument, highlighting the ability of the instrument to discriminate between the groups. The higher the SRM in the good outcome group, the lower the SRM in the poor outcome group (should be close to zero) and the greater the difference between the SRMs for the two groups, the more discriminative is the instrument

Figure 2 shows the ROC curves for each of the questionnaires. The COMI summary score is the closest to the top left corner, i.e., shows the best discriminative function. This is also shown by the data for the AUC which was 0.96 for the COMI and significantly (P < 0.05) higher than the AUCs for all the other questionnaires (Table 8). The EQ-VAS showed a slightly greater AUC than the EQ-5D summary score but the difference was not significant. An improvement of 2.7 or more points in the COMI summary score predicted a good outcome with a sensitivity of 83.3% and specificity of 100% (Youden index 0.83). Summarising, with all three of the methods applied to examine responsiveness, the COMI showed the best ability to discriminate between good and poor outcomes.

Fig. 2
figure 2

Receiver operating characteristics curves for the different instruments. As an external criterion the global outcome question was chosen. See Table 8 for further details

Table 8 Comparison of receiver operating curves for the different instruments

Discussion

Patients, health insurances and governmental bodies increasingly expect outcome research to be carried out to evaluate the effectiveness of treatment and the performance of individual health professionals and hospitals. Short questionnaires like the COMI are ideal for the longitudinal assessment of treatment outcomes [40] and have various advantages over longer instruments, such as easier administration and higher completion rates [13]. Nonetheless, it is also important to use questionnaires with adequate psychometric properties. For instruments designed to be used in longitudinal assessments i.e. as outcome instruments, responsiveness and validity are two of the most essential criteria [19, 41]. The single COMI items and the COMI summary score showed good correlations with the corresponding fuller questionnaires, indicating adequate construct validity, and the COMI also demonstrated good responsiveness.

Floor and ceiling effects

There were observable floor and/or ceiling effects (depending on the time-point of assessment) for the single items of all questionnaires examined in the present study. Some, notably the EQ-5D items mobility, anxiety/depression, and self-care at follow-up, and the COMI item symptom-specific well-being at baseline, even exceeded the critical level of 70% [40]. This might suggest that the responsiveness of these items would be limited because further change in an even more extreme direction might not be measurable. As a consequence of the critical ceiling effects for the EQ-5D individual items, the summary score of the EQ-5D also showed relatively high ceiling effects at follow-up. Despite there being some floor and ceiling effects for some single COMI items, the COMI-neck summary score showed no critical floor and ceiling effects.

Some argue that Likert scales with only five categories or the EQ-5D-style scale with only three categories may not be able to detect small but important changes [33]. As an alternative, 7-point or 10-point rating scales (similar to the pain VAS) have been recommended. However, in other studies, the 5-point Likert scale has been shown to display almost identical responsiveness to the 0–10 VAS, with the added advantage of being easier to administer and interpret [42]. The three-category response scale of the EQ-5D showed extremely high floor and ceiling effects and this likely diminished its responsiveness. This problem is known to the developers of the instrument and its further evaluation has led to the establishment of a 5-point scale for the EQ-5D similar to that used in the COMI [4345]. Floor and ceiling effects are highly population-dependent [40]. The present study involved patients undergoing cervical spine surgery, who typically suffer from severe functional restrictions, neurological deficits and high pain preoperatively (see Table 4) and who generally have only minimal symptoms after treatment. This would be commensurate with greater floor effects preoperatively and greater ceiling effects postoperatively. It is likely that a group of patients with less severe symptoms undergoing conservative therapy for neck pain would not show as many floor effects at baseline or ceiling effects after treatment.

Many patients in the present study reported a good outcome and displayed high EQ-5D and low COMI and NASS-cervical scores at 3 months postoperatively. Generally speaking, a very high predominance of good outcomes might suggest that participation had been selective and be indicative of response bias. However, in the present study we believe that it was simply the effective surgery that led to this distribution of outcomes. We deduce this from the fact that, in the larger group of patients that completed the COMI but not the other questionnaires (see Table 3), the % good outcomes [87.8% (101/115) patients; detailed data not shown] was similar to the value in the smaller group (90.6%, in 75 patients) that completed all three questionnaires preoperatively and at follow-up, and who were used in the comparisons of instrument responsiveness. The size of the follow-up group was lower due to the poorer rate of completion of the NASS and EQ-5D questionnaires. Whether this was the result of the different (and less “local”) administrative system used to collect the data or the greater length of the questionnaire battery cannot be ascertained. High completion rates are essential to feel confident in measuring the benefit of the treatment, unbiased by selective participation.

Construct validity

As in the original validation study [15], the individual COMI items showed a good correlation with their reference scales with the exception of “symptom specific well-being”. A possible explanation for this finding, namely that this item delivers unique important information for the summary score and should therefore continue to be included in the instrument, has been discussed before [15]. There was a much stronger correlation between the COMI and the NASS-cervical pain&disability score (and their respective change scores) than between the COMI and the NASS neurology score. This behaviour of the neurology score was also described by Stoll et al. [27] who found no correlation between this subscale and all SF-36 subscales. It is likely explained by the lack of any specific neurology assessment in the COMI and in the SF-36.

Responsiveness

For questionnaires that are to be used on a longitudinal basis, i.e. as outcome instruments, it is essential to know how well they are able to detect small but important changes [40, 46]. This information is used to inform clinical decisions and assist with the calculation of sample sizes in further studies. The t test results and the very low SRM for the poor outcome group and high SRM for the good outcome group indicated the excellent discriminative ability of the COMI. Examining the SRMs in the good and poor outcome groups separately was considered to be a fundamental necessity to see whether the questionnaires had the ability to differentiate between different global outcomes [34]. Evaluation of the SRM for the whole group alike fails to reveal whether an instrument also shows change where none is actually perceived by the patient. A responsive questionnaire should not show improvement or deterioration when none has occurred. This would not be an ideal characteristic for an outcome instrument. The NASS-cervical and the EuroQol showed less favourable SRM values than the COMI, and did not differentiate as well between good and poor outcome groups, suggesting they represented less responsive tools. A previous study [27] showed similar SRMs to those found in our study for the NASS pain&disability and the NASS neurology score 1 after conservative treatment over 3 weeks. A possible explanation for the greater responsiveness of the COMI might be the parsimonious choice of the COMI items, whereby only those that are most relevant to the condition are included, and the use in the summary score of the higher of the two pain scores (arm or neck pain), rather than either just neck pain or just radiating pain or an average of the two. Some patients suffer only from neck pain or only from arm pain. Pain is known to be one of the most responsive items in spinal surgery [47], and if the effect of intense pain in the most painful region is “diluted” by the averaging with pain scores for non-painful regions, then this will undoubtedly reduce the sensitivity of the pain item. Hyland [40] refers to this notion as shifting and non-shifting questions. In our study sample we observed that the items in the NASS-cervical lifting, walking, sitting, standing, stiffness, trembling and sexual activity, and the EuroQol items mobility, self-care and anxiety/depression had SRM values below 0.8, which indicates these were non-shifting elements (specific results not shown) and therefore likely diluted the average change-score.

The low responsiveness of the EuroQol compared with the COMI or NASS-cervical was not particularly unexpected, given that it is generic rather than condition-specific measure. The former are almost always less responsive, since the questions they contain are less specific to the condition in question and often contain non-shifting items (see above). As mentioned earlier, the EQ5D has only a 3-point scale with the two extremes effectively being “no problems” and “cannot do”; despite excellent treatment it is rare that patients change from the very worst to the best status or vice versa. The EuroQol is successfully used in cost-effectiveness analyses of treatment for spinal disorders [48] or to examine iatrogenic effects of treatment [49] but is not recommended for use as a standalone outcome instrument for specific conditions/disorders [50]. In ROC analysis, the EQ-VAS showed a greater AUC (0.74) than did the EQ-5D summary score (0.70) but with overlapping confidence intervals such that the two did not differ significantly. Interestingly, previous studies in patients with coronary heart disease, angina, stroke, diabetes, myocardial infarction, high blood pressure, joint pain, asthma have also shown that the single item EQ-VAS is more responsive than the EQ-5D summary score [51, 52].

The proximity of the COMI curve to the top left corner of the ROC curve and the high AUC value for the COMI reflected its excellent ability to discriminate between the good and poor outcome groups. Its performance in this respect was better than either the NASS-cervical or EQ-5D. Previous studies calculating ROCs for the EQ-5D in 2 health surveys [53, 54] and in a study on the treatment of femoral neck fractures [55] observed similar AUCs (0.70–0.77) to that in the present study (0.70) However, unfortunately they did not calculate the Youden index or the corresponding values for sensitivity and specificity in detecting a good outcome or positive health status that would otherwise have allowed direct comparison with our data. To the best of our knowledge, no previous studies have carried out ROC analyses of the NASS-cervical in neck pain patients. The results of such analyses permit the monitoring and management of individual patients and are a fundamental element in Beaton et al.’s responsiveness classification [34]. In future studies, more attention should be paid to this useful analysis and outcome instruments should be evaluated for their sensitivity and specificity using the ROC method [56].

Limitations of the study

Our study has some limitations. The follow-up period was only 3 months. However, since the study was intended to compare the psychometric characteristics of the questionnaires themselves rather than report the outcome of the surgical procedure per se, the rather short 3-month period of follow-up was considered unproblematic, and in keeping with previous methodological studies [21]. Furthermore, such a follow-up period allowed the immediate effects of surgery to be assessed, in which most of the changes occur, and allowed the maximum number of datasets to be included; our decision was supported by the finding that previous studies of a similar nature have reported no significant change in outcome up to 2 years later [23, 57].

We did not evaluate the test–retest reliability of the COMI-neck in the present study, because White et al. [18] reported good reliability for the English version of the COMI-neck, and because the test–retest reliability of the German COMI-back, which is identical to the COMI-neck but for the fact that it enquires about back/leg symptoms rather than neck/arm symptoms, has also been confirmed [15]. However, the reliability of the COMI-neck might require further verification in relation to the specific patient group in which it is to be used in future studies.

The number of patients in the “poor outcome” group was rather low, which may limit the external validity of the responsiveness analysis. Nonetheless, in previous studies of the COMI-back [23] and the EQ-5D [5355] similar results were obtained in terms of the responsiveness and minimal clinically important differences recorded.

From the initial 134 patients operated, only 75 could be included in the follow-up group, and this was predominantly the result of missing NASS-cervical/EQ5D questionnaires. We do not believe that this introduced a notable bias, though, since the outcome results for the COMI-neck in the larger group that completed only this questionnaire were similar to those reported for the group of 75 patients who completed all three (see earlier). Moreover, the baseline COMI scores for the larger group with a COMI but not the other questionnaires (n = 130) were similar to those for the group with all three questionnaires at baseline (N = 89) (detailed data not shown). In our spine centre we have observed how difficult the collection of questionnaires in daily practice can be without the employment of a dedicated study nurse/research assistant (the SWISSspine questionnaires were administered by the surgeon’s secretary in conjunction with the SWISSspine registry for disc arthroplasty in Bern [20], whereas the COMI system is managed as part of an internal quality management system, run with dedicated staff from the Research Department).

A further ongoing problem in all responsiveness studies is the lack of an external gold standard for measuring treatment success. There is no consensus regarding the selection of an external criterion except that it should represent major clinical improvement or deterioration of health. Patient-orientated appraisals are widely accepted in the literature [21, 23, 35, 5860], but other measurements, for example clinician and patient orientated assessment [59, 61] or return to full activity [21] may also be useful to examine in future studies. The global outcome criterion was included in the 3-month follow-up questionnaire that also contained the COMI. We examined whether this may have led to bias in that the global outcome was completed at the same time and under the same conditions as the COMI itself, and hence had a higher chance of being more closely related to it. However, there was a high correlation [r = 0.7 (data not shown)] between the highest pain score determined from almost identical single pain items in the SWISSspine and in the COMI instruments at follow-up, which would tend to suggest that this was unlikely the case.

None of the items in the COMI are weighted in the final score in relation to their perceived relative importance. The issue of weighting dimensions is an oft-discussed theme in the literature [33]. When the COMI was first developed the scores for the items were simply averaged for convenience and the excellent performance of the instrument resulted in the scoring being kept that way. An advantage of this method is of course its simplicity, in that it allows the quick and easy computation of the COMI summary score. Further studies might, however, examine whether other methods of computation would further improve its psychometric properties.

Conclusion

Our results demonstrate that the COMI-neck is a valid and responsive instrument for use in assessing the outcome of patients undergoing cervical spine surgery. Despite some large floor and ceiling effects for the individual items its responsiveness did not appear to be negatively affected: indeed, of the instruments examined, the COMI proved to be the best for discriminating between good and poor global outcomes. The COMI has the potential to serve as an outcome instrument not only for evaluating group outcomes in clinical trials, multicentre studies, routine quality management and surgical registry systems, but also for individual patient monitoring. In this way, the COMI can be used to enhance outcomes research, distinguish between useful and futile treatments, evaluate the performance of surgeons and hospitals and optimise the treatment of individual patients. Further analyses of the COMI-neck should be carried out in groups of non-surgically treated patients but, in view of the comparable performance of the COMI-back in both surgical and non-surgical groups [15], we are optimistic that the COMI-neck will perform just as well in non-surgical patients too.