Introduction

The evaluation of therapies for low back pain requires consideration of a number of variables. It has been recommended that for a full evaluation a condition specific disability measure, a general health measure (e.g. EQ-5D [20], WHODAS II [61], SF36 [62]), a pain measure (e.g. VAS [32]), a satisfaction measure, and a measure of employment should be considered [4]. For further technical details the reader is referred to two recent articles on the same subject [43, 44]. The present article aims to provide a more practical approach on the five most frequently used back related specific outcome scales.

The application of a widely used tool allows comparisons between the study group and other populations. All currently available measures have flaws or restrictions regarding their construction, validation or application. In the absence of an ideal instrument [5], the choice of a commonly used measurement tool may be considered reasonable.

The present article aims to provide a more practical approach of the five most frequent used back related specific outcome scales. Further technical details are available in the previous publications [43, 44].

Methods

Eighty two back related questionnaires were identified reviewing the major medical databases. The ten most frequently used questionnaires were analysed based on various criteria such as general characteristics, external validity, internal consistency, responsiveness to changes, floor and ceiling effect, question focus, offered answers, item masking score bias, item weighting score bias and cross contamination score bias [43, 44]. This study comprises a short and more practical version of the mentioned review summarising the five most widely used LBP assessment instruments. General measures such as the SF36 [62] as well as isolated pain measures were not included in the investigation.

History and technical validation concepts:

A number of specific areas were examined, defined below. More specific information is available in the literature [4, 48, 58].

General Characteristics: The population from which the score was developed, number of items, item scoring, information about subscales, and a brief description of the domains are provided.

Reliability: There are a variety of ways of examining the reproducibility of a measure administered on different occasions. Test–retest reliability is the most important. It is measured best by using tests of agreement such as Kappa [3, 42, 58]. The Pearson correlation coefficient [3] is a measure of correlation and although commonly used is a less precise measure. Pearson correlation values should exceed 0.8 and kappa values should exceed 0.5. Another measure is the Bland Altman plot [3]. This describes the spread of the score values within the same individuals between the test and the retest examination and provides a 95% confidence interval.

Internal Consistency: Measures of internal consistency are based on a single administration of the outcome measure. If the outcome measure has a relatively large number of items addressing the same dimension such as measures of physical function, it is reasonable to expect that the scores on each item would be correlated with the scores on all other items. Thus, if the internal consistency is low, the different items should not be summed, because they measure different domains. Internal consistency is predominantly measured by Crohnbach’s alpha correlations [10]. Values above 0.8 are acceptable.

Responsiveness to changes: The minimum clinically important difference (MCID) is the value of the change in the score which equates to the smallest change in the condition of interest the patient perceives as beneficial. Responsiveness can also be evaluated using the receiver operating characteristic (ROC) curve which is constructed by calculating the sensitivity (true positive rate) and specificity (true negative rate) of the cut-off point for each of the possible score values [58]. An index of the “goodness” of the questionnaire is the area under this curve (AUC), which is usually abbreviated as D′. A poorly discriminating questionnaire has an area of 0.5 and a perfect test has an AUC of 1.0 [58].

External validation: Comparison of a new score with the existing scores allows the assessment of its performance against known measures, particularly in the selection of measurement domains, responsiveness and floor and ceiling effects.

Floor and Ceiling effect: Floor and ceiling effects describe the percentage of clients which have maximal or minimal points in the score [4, 58]. Here the measure is inefficient in discriminating between subjects. A similar problem occurs when the results are skewed in a certain region. Floor and ceiling effects may be observed if a measure developed in one population, e.g. severely disabled subjects in a pain clinic, is used in a very different population, e.g. attendees in primary care.

Apart from the technical aspects of the validation process, the softer aspects of scale construction are of major importance and are mainly underestimated. The following definitions were employed

Question focus: Each question should have one single target (e.g. “Do you have pain in the groin?”) and they should be easy to understand and unambiguous [48, 58]. This aspect was recognised in most of the questionnaires. Questions should follow a logical structure and potential sources of inaccuracy should be specifically examined. For example, gender differences may exist in questions such as “Do you have back pain when doing chores?” as in many households chores are not evenly distributed between men and women.

In the ICF classification [26], three dimensions are described: (1) impairments sub-divided into functions and structures) (2) activity and participation limitations and (3) environmental factors. Impairment is divided into (1a) impairments of body functions (e.g. sleep as a mental impairment, pain as a sensory impairment [50], blood pressure as a cardiovascular impairment, etc.) and (1b) impairments of body structures (e.g. cartilage damage, impairments of spinal cord or peripheral nerves, etc.). Activities and participation include activities (such as learning, walking, doing housework etc.) and participation includes e.g. relationships, religion etc. Environmental factors are health care system, food, climate, etc. All these four dimensions are part of an outcome assessment and are valuable indicators for the quality of life. The four dimensions can be asked within one questionnaire like in the low back outcome score (LBOS) [25] or different questionnaires are used to fulfil the ICF criteria as is the case in the NASS LSO questionnaire [11]. Asking all three dimensions in one questionnaire is usually preferred because the type of questions and answer scaling as well as the question flow can be harmonised and similar questions can be united in one. The importance of assessing all the four ICF dimensions is stressed but summing the answer results in questions that belong to different ICF classes or subclasses in order to get a final sum score is not recommended (see below).

Offered answers: Answers should be clear and the scale has to be comprehensive and disjunctive [48] (answers do not overlap and focus on one single issue). It is helpful for the patient to use one single answer type throughout the questionnaire. Technically this issue is important for questions that will be grouped in order to get a sum score (see item weighting below).

Score: All answers of a questionnaire are given a certain numeric value. Adding up the various answer values results in a score. Sum scores facilitate comparison between patients or patient groups or between preoperative and postoperative conditions. Summing the scores of individual questions can result in a number of sources of error.

Item masking score bias: This may be present if unrelated questions are summed in a single score [18, 29]. For example, if a person suffers from severe back pain (ICF group 1a) and marks 9 out of a maximum of 10 points but has almost unrestricted walking (ICF group 2) with 3 out of 10 points. This results in a sum score of 12 points. During follow-up the same person suffers from moderate back pain only and marks 4 out of 10 points. Meanwhile, walking is severely restricted (8 out of 10 points). The follow-up sum score remains unchanged at 12 points, which indicates a seemingly unchanged outcome. The sum score has masked a significant change. Masking is more likely to be present if a sum score is composed of items focusing on different domains. The effect has also been called score bias [17, 19]. In a questionnaire comprising a number of domains the use of sub-scores focused on only one dimension should be considered.

Item weighting score bias: Some measures allocate different weights to questions. For example, weighting the ability to do work with 10 points and walking with 3 points places much emphasis on the work item. If the ability to work changes slightly, the sum score is affected significantly. If the walking capacity is altered, the change will hardly influence the sum score. Often no rationale is presented for these different weights, which may lead to under or overestimation of certain outcome parameters. Further research on the importance that patients place on various activities may improve this weighting. Weighting problems can also arise if the sum score contains questions that relate to abilities that are of no relevance for certain patients such as doing chores for certain male individuals or sex life for some elderly people. By giving the answers of such questions a certain amount of points, we put weight on questions that are meaningless for some individuals.

Cross contamination score bias: This is present if answers can be influenced by other diseases. If the question is not properly phrased in relation to the symptoms of the addressed disease, co-lesions or co-morbidities can alter the outcome. The question “do you have pain in your leg?” for example might produce a positive answer in a patient with severe radiating low back pain but degenerative hip disease might also produce a positive response.

Suggestions: suggestions are made for avoiding bias when using each of the five questionnaires. These suggestions should be carefully considered. Changes only in the calculation of the main and sub-scores will not change the content of the questionnaire. However, changes to the content of questions or answers will necessitate a new validation process before the changed questionnaire can be used.

None of the mentioned scale construction problems will be detected by internal or external validation and the same is the case with questions that do not offer disjunctive answers. These issues are structural problems that should be eliminated before validating a questionnaire. A proper outcome validation is hardly possible if one of the mentioned effects or a score bias is present. Note that the overview analysis of each questionnaire is always based on the latest available English version.

The oswestry disability index (ODI)

The ODI was initiated in 1976 and version 2.0 [21] is recommended [1, 52] for use. The administration is easy. A slightly modified ODI is used in the NASS [11] questionnaire (see below). The ODI and the Roland Morris questionnaire are the most thoroughly validated questionnaires [23, 24, 28, 31, 36] and have a good reliability and internal consistency. The ODI responsiveness seems to be acceptable but not as good as that of the Roland Morris score [12, 56]. Nevertheless, it can also be used for cervical problems [64]. However, Taylor et al [59] found that the ODI is more sensitive to patients who had improved and is less sensitive for patients whose condition remained unchanged. This fact is closely related to its floor effect [47]. External validation to different questionnaires shows neither an advantage nor a disadvantage of the ODI [12, 24, 25, 31, 35, 36, 39, 64] compared to other assessment tools. Slight item weighting, cross contamination and item masking bias are expected. The ODI is validated in English [22], German [40, 41], French [16], Finnish [27] and Greek [6]. Translations in several other languages do not appear to be validated.

Conclusion: The ODI is a simple, well analysed questionnaire widely used in comparative studies. The ODI is not recommended to use for the assessment of preventive measures because of its floor effect. If used in seriously diseased collectives the ODI can be recommended for in-depth scientific research studies if combined with the NASS (see below).

The Roland Morris disability questionnaire (RMDQ)

The RMDQ was derived from the Sickness Impact Profile, 24 out of 136 items were selected and published 1983 [53]. The RMDQ is short and simple to administer and widely used. Despite several published modification proposals [15, 47, 57, 60], the original version of the RMDQ is favoured by an international expert group [14]. A good reliability [13, 33, 53] and internal consistency [31, 39, 51, 57] is reported. The RMDQ seems to detect changes over time slightly better than the Oswestry scale [12, 39] especially in patients having mild disabilities [1], provided that the initial score is in the range between 4 and 20 [55]. Nevertheless, the RMDQ lacks a ceiling effect [55] in severely affected individuals. The RMDQ is validated in English [53], French [9], German [63], Greek [7], Portuguese [45], Spanish [37], Swedish [33], Turkish [38] and is available in several other invalidated language versions. A significant advantage is that the RMDQ questions are straightforward and consistently related to the back, offering simple Yes/No answers. Thus the RMDQ does not display uncontrolled item weighting or cross contamination.

Conclusion: The RMDQ is the low back assessment tool of choice if combined with a general health assessment and used in a mild to moderately affected low back pain collective.

The low back outcome score (LBOS)

The questionnaire was first published in 1992. The answering possibilities of each item are scaled. For pain an eleven point scale ranging from “no pain” to “maximum pain possible” is used. The other items use a four point scaled text. The total score gives different weights to different questions. Although this weighting has to be conducted with care, the administration is easy. All aspects of the ICF classification [26] are considered. Slight extensions to the questionnaire have been reported [30]. The LBOS is reliable [30] and provides a good internal consistency [30]. The MCID is reported to be 7.5 points [34]. The LBOS correlates well with the ODI (r=0.87) [59]. A ceiling effect was reported in a non compensated low back pain population [25]. The English version is validated and non validated German and a Spanish version are available. The questionnaire considers the different dimensions of the ICF classification, thus summing to provide a total sum score may lead to item masking. Nevertheless, the consideration of the different ICF dimensions provides a multidimensional assessment of the patient and is a strength of the LBOS. There is a possibility of item weighting bias and cross contamination bias. It is suggested that the sum score should not be used but that sub-scores for each of the dimensions (pain functional pain, and ability items) are to be used instead.

Conclusion: If the above suggestions are considered, the LBOS is useful because it is short, covers the important aspects of the treatment outcome and clearly discriminates between pain and disability. Thus the LBOS can be suggested if only a short assessment (such as for registries or for the private use of one surgeon’s patient collective) is envisioned. Unfortunately only the English version is validated.

The Quebec back pain disability scale (QBPDS)

The QBPDS was validated on a back pain population and published in 1995 [36]. The questions were designed using a conceptual model. Item selection was done using factor analysis for 46 disability items. Twenty items were selected and tested for reliability. The QBPDS measures only functional disability (self care, walking, sitting, standing, lifting, sport, stairs and housework) and sleep. Pain has to be evaluated with other tools. Items about social life, sex life and the need for help are not included. Nevertheless, the items give a comprehensive view of the patient’s disabilities because easy as well as more difficult functional abilities are asked. The administration is easy. The MCID lies between 14 and 19 points [12, 35] and the questionnaire seems to be as reliable as the ODI or the RMDQ. Floor and ceiling effects are not reported and the QBPDS is validated in English, French [36] and Dutch [54]. The simple, on disability focused questions and the consistently offered five answers are the strength of this questionnaire. Item masking is not present but uncontrolled item weighting and cross contamination might be present.

Conclusion: The questionnaire is well focused on disabilities and offers consistent answers that makes it an excellent disability assessment tool. If the QBPDS is combined with an independent pain assessment tool, it can be recommended for low back pain assessment. Unfortunately this questionnaire misses validated translations to other languages and is not as often used as the RMDQ or the ODI.

The NASS lumbar spine outcome (NASS LSO) assessment instrument

The NASS was first published by Daltroy et al. [11] and is based on a consensus of the North American Spine Society. It considers all aspects of the ICF classification, e.g. demographic data (age, sex, race, education and insurance information), medical history (co-morbidities, past surgeries etc.), body functions (pain, neurogenic symptoms etc.) and employment history. In the questionnaire construct the SF36, a modified ODI, and a modified employment assessment published by Bigos et al. [2] are included. For the follow-up assessment the questionnaire is slightly modified. It takes 20–25 min to fill in the form and the scoring is complex. As all the other mentioned questionnaires, the NASS is reliable and shows a good internal consistency [11]. Data on the MCID as well as research on floor and ceiling effects are not available in the literature. The NASS LSO is validated in English [11], German [49] and Italian [46]. The NASS LSO baseline questionnaire is by far the longest of all analysed instruments with its 62 main questions. Pain is a very dominant factor assessed with different methods. The pain questions themselves (ODI), however, do not explicitly relate pain and disability to the back. Instead, a pain locator (picture where the client can mark the location of pain) is offered where the various painful body regions can be clearly indicated. The pain locator is a useful instrument for drawing a precise pattern of pain distribution that reduces undetected cross contamination effects. Location unspecific questions can be set in relation to the location marked in the pain locator. This increases the value of the ODI significantly. Because of the use of different questionnaires, no homogenous way of posing questions exists. As a consequence, certain questions are asked twice. Item masking is present if all dimensions of the NASS are summed but this does not make sense since sub-scores are calculated due to the original questionnaires.

Conclusion: The NASS enables an extensive outcome assessment where pain is very dominant in the low back specific questions. In contrast to the ODI, pain can be indicated precisely (pain drawing), so that changes related to the pain origin become visible. Time consumption, administration and statistics limit the use of the NASS to scientists who can take advantage from a scientific environment (statistician, study nurse or equivalent, etc.).

Discussion

No gold standard exists for outcome assessment in low back therapies. Each of the studied questionnaires has its advantages or disadvantages. However, all of the questionnaires have been validated and have proved to be reliable and consistent.

Scientists who stress the importance of comparing data with other collectives should use the RMDQ or the ODI where the latter seems to be more discriminative in patients having severe low back pain while the RMDQ seems to be more sensitive in the less severely affected patients. Those two questionnaires have to be combined with a general health assessment tool like the EuroQuol (no fee) [8, 20], the WHODAS II (no fee) [61] or SF36/SF12 (fee to pay) [62].

Spine specialists who want to assess their own patient collectives should favour an easy to administer questionnaire which gives them a quick and good overview. In this case the LBOS can be suggested. Ideally LBOS is combined with the five questions of the EuroQuol [8, 20]. Unfortunately the LBOS is not validated in languages other than English.

Scientists working together with a research department and not discouraged by the needed administrative effort should take advantage of the NASS LSO assessment instrument. Nevertheless, this instrument is available only in a few languages.