Introduction

When all the costs, direct and indirect, were individually determined during a 2-year period among nearly 2,000 employed Swedish men and women sick-listed 28 days or more due to low-back or neck problems, it was found that the direct costs, the costs for all medical interventions, including surgery, were only 6.9% of the total costs [20]. Although higher in that study (93.1%) than earlier reported, the indirect costs are known to be the major costs for back problems and increasing with the duration of the work disability [18, 43, 48].

Strikingly low return to work rates within 2 years (63% in the US and Sweden, 40% in Denmark, 35% in Germany, 72% in the Netherlands and 60% in Israel) were recently reported in a multinational study of subjects sick-listed more than three months for back problems [8]. In that multinational study, which aimed at reflecting “the everyday” treatment and rehabilitation of those chronic patients, there was no clear evidence of any effect on pain or back function, or on the return to work rate, of any of all the non-surgical medical treatments or vocational rehabilitation tried [22]. Those “negative findings” confirmed what others have found, that reliable scientific evidence is missing for an effective treatment of long-lasting work-related back or neck problems [3, 17, 22, 35, 40]. For secondary prevention, many treatment efforts and modalities, and their combinations, have been tried without definite and reproduced success in bringing people back to work, however, especially those with long-lasting disabling spine problems back to work [22, 24, 37, 40, 44].

Awareness of the multifaceted character of the problem has generated trials emphasising exercise, education, ergonomic intervention or behavioural therapy alone or combined. In spite of such efforts, scientific evidence for a generally working concept is weak [17, 44]. Several studies have reported that modified work programs increase the cumulative return to work (RTW) rate and decrease the disability time considerably [29]. Whether such programs have a general effect or influence upon certain groups of sick-listed people has not been clarified, however.

For the back and neck patients as a group the work ability/inability is a multifaceted problem, to a great extent consisting of factors other than strictly medical ones [21, 40, 46]. Numerous high quality studies have looked for predictors not only of the medical outcome but work capacity/incapacity as well [1, 26, 36]. With few exceptions, the predictive and detective ability of the reported predictors has been limited [34]. Many of the frequently reported predictive factors have been quite general as higher age, earlier episodes of spine problems, sciatica or leg pain, while other more specific factors have had uncertain validity and frequently have consisted of single items from specific questionnaires alone or combined [19, 21, 42].

Presently, and at large, it seems as if the sick-listed sub-chronic or chronic back or neck patient, as a consequence of continuous problems, receives one usually ineffective “new” treatment/examination after the other until “everything” has been tried [22].

It is reasonable to assume that an early, versatile and reliable predictor of, for example, prognosis for RTW or non return to work (NRTW) would at least hold back some of the apparently ineffective interventions that not only add to costs but also probably prolong the already protracted course of the problems.

The object of this study was to perform a comparative study of the ability of various commonly used health measures to predict RTW or NRTW in a cohort of men and women sick-listed for more than 28 days due to low-back pain (LBP) or neck pain (NP).

Material and methods

Design

The study was a prospective cohort questionnaire and register study with a 2-year follow-up. The participants received mailed self-administered questionnaires after 28 days, 90 days, 1 year and 2 years.

Participants

A cohort of 1,822 employed Swedish men and women between 18 and 59 years sick-listed for 28 days due to either LBP or NP (sick listing certified by a physician) were selected consecutively from five socio-demographically representative regional Swedish General Insurance Offices [5]. Of the 1,822 subjects, 247 persons were not included in the present study since they reported both LBP and NP. A higher proportion of the remaining 1,575 subjects were women (56%), more than 70% of all subjects suffered from LBP. Among the persons who were sick-listed due to neck pain 70% were women. Because of these differences the participants in the study were presented in four groups: (1) men with LBP, (2) men with NP, (3) women with LBP and (4) women with NP. Self-employed and unemployed people and those with generalised arthritis or a fracture, tumour or infection as well as women suffering from back pain in connection with pregnancy, were excluded in the study.

Questionnaires

The questionnaires included a large set of baseline characteristics, like working and family conditions, education, economy, treatments and rehabilitations measures etc. and also ten different commonly used instruments reflecting various aspects of present health status, such as health-related quality of life, pain, disability, back function and depressivity (Table 1).

Table 1 List of the instruments/scales used in the study, and the times at which the data were collected (X time when the different instruments were applied)

EuroQol

EuroQol (EQ-5D) is a generic health-related quality of life measure. It provides a single index. The individuals classify their own health status into five dimensions (5D)—mobility, self-care, usual activity, pain/discomfort and anxiety/depression within three levels: no problems, moderate problems and severe problems. The instrument yields a total of 243 possible states, the Time Trade Off method is used to rate the different states of health. The value 0.00 indicated dead and 1.00 indicated full health. The EuroQol thermometer is a visual analogue scale, on which the respondents is asked to mark his or her health between 0 and 100 [710, 13, 14].

Hannover ADL

This is a questionnaire developed for measuring back-pain-related disability. It is a self-administered questionnaire of 12 questions for the assessment of functional limitations in activities of daily living among patients with musculoskeletal disorders. The 12 items have to be scored, summed and transformed on to a scale from 0 (worst back function) to 100 (best back function) [25].

Thirty-six-item short form health survey

The questionnaire is a standardized, generic self-administered instrument. The thirty-six-item short form health survey (SF-36) describes eight domains of health with each scored from 0 (poor health) to 100 (optimal health) and results in a health profile. It is not designed to generate a single index. In the present study, only the subscales for General Health, Vitality, Social Functioning and Mental Health were used [45, 47].

Von Korff’s pain and disability score

This questionnaire was made as a simple method of grading the severity of chronic pain for use in general population surveys and studies of primary care pain patients. It included seven questions which measure pain experience and functional restrictions during the last 6 months. Low values on the scale mean less pain or disability [27].

Zung’s depressivity scale

It is a self-rating depression scale. The most commonly found characteristics of depression were used and divided into 20 items, ten were worded symptomatically positive and ten negative. The scale is constructed so that the less depressed patients will have a low score on the scale and the more depressed patient will have a higher score [49].

Higher values on the scale indicate better health or function, with exceptions for Von Korff’s pain and disability score and Zung’s depressivity scale, where better health is indicated by low values. For most scales, data were obtained at 28 days and 2 years, and therefore only the values obtained at the latter times were used in the comparative study.

Response rates

The response rates at 28 days and 2 years differed considerably according to gender and diagnoses (back or neck diagnoses) (Table 2). After 28 days, women and those with NP were more apt to respond than men and those with LBP. The number of respondents decreased substantially between 28 days and 2 years, especially among men with NP, where the non-response rate exceeded 50%. Each subject’s sick-listing status (i.e. RTW or NRTW) during the 2-year study was obtained to 100% from each participating General Health Insurance office. From those data a rather comprehensive non-response analysis was made and showed that the refusals revealed a higher proportion of young people, males, individuals with LBP and people sick-listed for a shorter period [20, 22].

Table 2 The response rates after 28 days and after 2 years among men and women, and among those with LBP and NP diagnoses. The percentage of respondents varied slightly between the different scales

Prevalence of work resumption

The ability to predict work resumption based on the value of the scale will be dependent on the prevalence, i.e. the proportion of subjects who have not returned to work after a certain time. The prevalence decreased steadily from 28 days to 2 years with different patterns depending on gender and diagnosis, which is reported elsewhere [5]. The prevalence for respondents and non-respondents obtained for the scale 1. EQ-5D is presented in Table 3. A similar pattern was obtained for all the other scales. No significant difference in prevalence was detected between respondents and non-respondents, in spite of the relatively large sample sizes. It thus seems reasonable to assume that the respondent data were not biased in this respect.

Table 3 Prevalence (proportion NRTW) among those who participated in the study. The corresponding prevalence for the non-respondents is shown in parentheses

Determination of a cut-off value c representing RTW or NRTW for each scale

In order to study how well the subject’s value on a scale may serve as a predictor of work resumption, an approach based on a statistical diagnostic test was used [11]. Here, the ability of a certain scale to serve as a predictor is expressed in terms of probability (proportion) of correct predictions. To this end, one first has to determine a cut-off value c on the scale such that a subject is defined as RTW on scale for values in one direction of c and NRTW on scale in the opposite direction. Since the direction of the scales used in the present study differed, NRTW on scale was used for values below c on the scales 1–7 in Table 1 and NRTW on scale for values above c on the scales 8–10 in Table 1. One proper way to determine the numerical value of c is considered below.

The true states to be predicted were termed RTW if the subject had returned to work and NRTW if not. Consider the following definitions: predictive value of RTW on scale (PRTW) = proportion of subjects which are RTW given they are RTW on scale, and predictive value of NRTW on scale (PNRTW) = proportion of subjects which are NRTW given they are NRTW on scale. These predictive values are dependent on the cut-off value c and also on the prevalence, so that with decreasing prevalence PRTW increases, while PNRTW decreases. This is illustrated in Fig. 1, where PRTW and PNRTW are plotted against c on scale 1 (EQ-5D score) for men with LBP diagnosis, as an example. Notice that PNRTW approaches the value of the prevalence as c increases, while PRTW tends to 1. It is also seen that the PRTW-PNRTW difference increases as the prevalence increases. Since the predictive values are dependent on the prevalence, they are less useful in a comparative study. Instead the following relative measures were used:

$$ {\text{Re}}1{\text{PRTW}} = {\text{ }}\frac{{{\text{PRTW}} - (1 - {\text{Prevalence}})}} {{1 - {\text{Prevalence}}}} $$
(1)
$$ {\text{Re}}1{\text{PNRTW}} = {\text{ }}\frac{{{\text{PNRTW}} - {\text{Prevalence}}}} {{{\text{Prevalence}}}} $$
(2)

Here, RelPRTW and RelPNRTW can be interpreted as the relative gain obtained by using the scale as a predictor rather than merely using the relative frequency of RTW and NRTW, respectively. RelPNRTW=0.70 thus means that 70% more NRTW subjects can be classified correctly by using the scale as a predictor, than a prediction based on the relative frequency of NRTW.

Fig. 1
figure 1

a The EQ-5D values after 28 days and RTW within 90 days for men with LBP diagnoses. Prevalence (NRTW) at 90 days 0.56. b The EQ-5D values after 28 days and RTW within 1 year for men with LBP diagnoses. Prevalence (NRTW) at 1 year 0.19. c The EQ-5D values after 28 days and RTW within 2 years for men with LBP diagnoses. Prevalence (NRTW) at 2 years 0.12

Sensitivity, specificity and some further diagnostic measures

Besides the predictive measures (1) and (2) above, the scales were compared regarding the detective concepts Sensitivity = proportion of NRTW which are classified as NRTW on scale and Specificity = proportion of RTW which are classified as RTW on scale. Both these measures depend on the cut-off point c. By plotting Sensitivity on the Y-axis against 1-Specificity on the X-axis for various values of c, one obtains the receiver operating characteristic (ROC) curve, (Fig. 2). The latter is used to compare the detective ability of alternative diagnostic tests in such a way that the diagnostic test with the highest values on the ROC curve is preferred [4]. In this study there were a large number of ROC curves to be compared. In order to simplify the comparisons, only one point on the ROC curve was used and the following measure was found to be useful:

$$ {\text{ROClevel}} = \sqrt {({\text{Sensitivity}})^2 + ({\text{Specificity}})^2 } . $$
(3)

The measure (3) is simply the distance from the point on the curve to point 1.0. The measure can take values within the interval \( \left( {0,\sqrt 2 \approx 1.41} \right), \) large values indicating high detecting ability of the diagnostic test. In practice one should require that (3) is larger than \( 1/\sqrt 2 \approx 0.71 \) since the latter limit is obtained for a diagnostic test which produces false classifications at the same rate as true classifications.

Fig. 2
figure 2

Sensitivity (proportion of NRTW which are classified as NRTW on EQ-5D scale) and 1 minus specificity (proportion of RTW which are classified as RTW on EQ-5D scale) at 90 days for men with LBP. Each point on the ROC curve has a corresponding cut-off value

Predictive measures are used for predicting the outcome on the basis of a diagnostic test. Detective measures are typically used when one wants to replace a comprehensive health examination by a simpler one. Here the emphasis will be on prediction.

The problem of determining a proper cut-off value c is complicated by the fact that a large value of c on the scales 1–7 increases PRTW but decreases PNRTW (Fig. 1), with a similar reversed relation for Sensitivity and Specificity. In the present study, a good compromise was to chose c as the largest value on the scales 1–7 (and the smallest value on the scales 8–10) for which PNRTW−Prevalence > PRTW−(1−Prevalence). For all three curves in Fig. 1 this rule yielded c=0.65. In some cases, the PRTW and PNRTW curves were unstable with heavy fluctuations. They sometimes also behaved in an unexpected way, e.g. the PNRTW increased with increasing c on the scales 1–7. In these cases no cut-off value was determined. Such anomalies may to some extent be explained by sampling fluctuations if the sample size is small. However, the most plausible explanation is that the scale has very little to do with the predicted outcome.

The ability of the scales in Table 1 to predict work resumption, as measured by rel PRTW and rel PNRTW, and to detect those subjects who returned to work, as measured by ROC level, was ranked. This was made separately for men with LBP, men with NP, women with LBP and women with NP. The measures (1, 2, 3) were calculated for scale values after 28 day and work resumption after 90 days, 1 year and 2 years.

One may use other measures of diagnostic ability, such as rate of true positive accuracy (TP) = PNRTW/total sample size, rate of true negative accuracy (TN) = PRTW/total sample size, and overall diagnostic accuracy = TP + TN. These measures (TP and TN) are also presented, although they may be less informative in a forecast situation. To help clarifying the diagnostic tests used in this study a general representation of diagnostic test is presented below in Table 4 [2].

Table 4 General representation of diagnostic test.

Results

Summary characteristics of the scales

The mean scores on the various scales are presented in Table 5 for gender and type of diagnoses. With few exceptions, there was an improvement in the different aspects (quality of life, pain, function and depressivity) the scales were meant to measure from day 28 up until 2 years. Generally, the improvement—as reflected by the different scales—was more pronounced for LBP than for NP, and for men with LBP compared with women having LBP. There was also an overall tendency that men scored “better” than women. This pattern was most obvious for patients with LBP, where the largest differences between men and women were obtained for the scales 3 (HADL), 6 (SF36 Social function) and 9 (VKP). Here, the men scored 7–15% better and the differences were highly significant (P<0.001).

Table 5 Mean scores for the scales 1–10 in Table 1 after 28 days and 2 years.RTW is indicated by large values on scales 1–7 and by small values on scales 8–10. (Scales: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score)

Table 6 shows the correlations between the scores on the different scales after 28 days according to gender and type of diagnoses. The negative signs for scales 8–10 are simply due to the fact that an improvement (better) is indicated by lower values on these scales. When combining scales, one shall look for scales with low correlations in absolute value. Such correlations are found between the groups of scales (5, 6) and (1–3), and also between the groups of scales (9, 10) and (1–8), respectively. It must be noted that the correlations between the scales within the different groups were high. The implication of this is that, if one wants to combine values from several scales, then one shall choose some of (1–3), some of (5, 6) and some of (9, 10). These combinations were valid in all the four gender and diagnoses groups.

Table 6 Correlations (%) between the scores on the scales after 28 days according to gender and diagnosis. No values from the scale 7 SF-36GH were obtained. Scales: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score

Predictive and detective ability of the scales

Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 reflect the predictive and detective ability of the scales 1–10 (except for 7) when the scores were obtained after 28 days and the outcome (RTW or NRTW) after 90 days, 1 and 2 years. In the tables, the Sensitivity, Specificity, TP and TN are also presented. Some of the scales could not be used, due to the anomalous behaviour of the PRTW and PNRTW curves described in the Materials and methods. This was especially notable for outcomes far ahead in time and for women.

Table 7 Outcome after 90 days, men with LBP and NP. The relative predictive value of not return to work (Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses (Prevalence LBP 0.56, NP 0.58). Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 8 Outcome after 90 days, men with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 9 Outcome after 90 days, women with LBP and NP. The relative predictive value of not return to work (Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses) (Prevalence LBP 0.65, NP 0.63). Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 10 Outcome after 90 days, women with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 11 Outcome after 1 year, men with LBP and NP. The relative predictive value of not return to work (Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses (Prevalence LBP 0.19, NP 0.27). Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 12 Outcome after 1 year, men with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 13 Outcome after 1 year, women with LBP and NP. The relative predictive value of not return to work (Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses (Prevalence LBP 0.27, NP 0.27). Scales: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 14 Outcome after 1 year, women with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 15 Outcome after 2 years, men with LBP and NP. The relative predictive value of not return to work (Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses (Prevalence LBP 0.12, NP 0.20). Scales: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 16 Outcome after 2 years, men with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 17 Outcome after 2 years, women with LBP and NP. The relative predictive value of not return to work(Rel PNRTW) and the relative predictive value of return to work (Rel PRTW) are shown in parentheses (Prevalence LBP 0.17, NP 0.20) Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score
Table 18 Outcome after 2 years, women with LBP and NP. The Sensitivity (Sens), Specificity (Spec), true positive (TP) and true negative (TN) values for the different scales. Scale: 1 EQ-5D; 2 EuroQol thermometer; 3 HADL; 4 SF-36VI; 5 SF-36MH; 6 SF-36SF; 7 SF-36GH; 8 Zung; 9 Von Korff’s pain and 10 Von Korff’s score

On the whole, RelPNRTW for the highest ranked scales was larger for men than for women and for subjects with NP than for those with LBP. Men and those with NP were thus easier to predict as NRTW. Scale 1 (EQ-5D), followed by 3 (HADL) and 2 (EuroQol thermometer), performed best on the whole as judged by RelPNRTW. It is notable that the scale 6 (SF-36 Social Function) served very well for predicting NRTW after 90 days, but lost some relative predicting ability after 1 and 2 years.

A somewhat different pattern was seen for the prediction of RTW, as measured by rel PRTW. Also, here the scales 1–3 performed well on the whole, but scale 9 (Von Korff’s pain) was outstanding as a short-time (90 days) predictor for all subjects except for women with NP. Here, scale 9 performed so badly that it could not be used and instead the scale 8 (Zung), was the outstanding one.

These findings suggest that if the prediction of return to work or not is to be based on a single scale, then scale 1 (EQ-5D) should be used. A prediction based on the latter scale for men with NP and the outcome after 90 days, yields that 92% of the NRTW and 72% of the RTW cases can be correctly predicted (Table 6) [the latter figures were obtained by inserting the prevalence 0.58 from Table 3 and (2)]. However, it is evident that one can do better by combining those scales that are good in predicting NRTW, with those which are good in predicting RTW. Account shall also be taken of the length of the forecast interval, gender and diagnosis. Accordingly, the scales which have been found useful for predicting the outcome after 90 days are 1–3, 6 and 9, with the exception of women with NP, for whom scale 8 should be replaced by scale 9.

Discussion

The results from this study suggest that RTW or NRTW among sick-listed lower back or neck pain patients can be predicted to a high extent, especially through the use of the quality of life measure EQ-5D. The study revealed some particularities; e.g. that men generally scored “better”, i.e. higher QoL, less pain, less functional impairment and less depression than women at all four measuring occasions during the 2-year study. That men scored “better” was especially pronounced for those sick-listed with LBP problems [22]. It was also evident that the response patterns underwent changes with time during the 2-year follow-up. For those reasons it was obvious that the most accurate prediction could be made after grouping of the sick-listed according to gender, diagnoses and elapsed time off work.

It was obvious that of all the separately tested scales EQ-5D had the highest overall ability to predict RTW or NRTW in this cohort of initially (28 days or more) sick-listed LBP or NP subjects. That ability remained irrespective of gender, diagnoses or duration of the problems. EQ-5D’s prediction ability of, for example, 92% of NRTW and 72% of RTW within 90 days is the highest or among the highest seen in the literature. It was also evident that the predictive ability could be even more complete by combining selected scales. To obtain the best result when different scales were combined, scales best at predicting RTW should be combined with those best at predicting NRTW. Although EQ-5D had the best overall capacity, the predictive ability of the different instruments varied somewhat with time, as well as with diagnoses and gender. EQ-5D and von Korff’s pain instrument had the advantage of retaining a high predictive ability irrespective of gender and diagnoses over the 2-year study period.

Von Korff’s pain questionnaire was for example best in predicting early RTW (<90 days), except for women with NP. That pain is of particular importance during the early phases of back or neck problems is not surprising and seemed to be confirmed recently in both acute and sub-acute LBP [28]. Why pain was of less importance specifically for women with NP is one of many results which could not be easily explained. The pain level has been found to be of great importance for the RTW/NRTW also in chronic LBP patients, [15], [16, 26]. When several job and clinical factors that all predicted RTW/NRTW were analysed and compared, it was found that the level of pain was of greatest importance. In one of those studies it was even found that without utilising the pain level RTW/NRTW could not be predicted. That several of the instruments used in the present study, EQ-5D, von Korff’s pain and disability scales, as well as Zung and Hannover ADL, involve dimensions or consequences of pain directly or indirectly might thus be one explanation for their high predictive ability. Several other studies have also found that there are different predictors depending on the duration of time off work (compensated disability) [32]. The predictive ability of EQ-5D revealed in the present study is especially interesting, since EQ-5D was one of the QoL instruments, SF-36 the other, that was recommended for standard use in spine research of a group of international researchers [12]. In their proposal of useful instruments they discuss properties that ought to characterise such instruments [12]. To be recommended for use in a standard battery the instrument must have validity, especially construct validity, but also responsiveness and be practical to use. As far as we know EQ-5D’s predictive ability, as shown here, has not been tried before. Since not only its validity, but also responsiveness, specificity and sensitivity were tested here, it apparently is a useful instrument for prediction as well. The simplicity of the EQ-5D instrument in relation to its ability is a little bit astonishing. Even when compared with the most recently tested predictive instrument covering a relatively wide range of psychological variables, EQ-5D’s only (five questions with three alternatives) seem quite adequate [33]. The complexity of prediction was shown by Kool et al. [26], who reported that when “two out of four” of a step test positive, behavioural signs positive, a high pain level (nine or ten), or pseudo strength test positive, NRTW could be predicted with high probability.

The focus on psycho-social factors for the LBP patient’s work ability, particularly job satisfaction, was set by the findings in the so-called Boeing study during the early 1980s [6]. From that time it was more or less accepted that the disability related to neck and back problems must be understood as having a multifactor background where psychological and/or psycho-social factors are of considerable importance [8, 17, 22, 30, 32, 38, 40, 41]. That is especially true for disabilities lasting beyond the acute (medical) phase of the problems [21, 39, 40]. Several different physical, psycho-social, social and psychological factors repeatedly have been identified as risk indicators of disability accompanying those problems [8, 22, 23, 31, 40]. In a recent scientific evidence review of risk factors for long-lasting compensated work absence (>3 months) due to back or neck problems, it was concluded that they were very similar for those seen in non-specific pain syndromes [21].

As frequently seen in prospective questionnaire surveys, the non-response rate was rather high, especially after 1 and 2 years. Since in this particular study, the work status of each participant, responder and non-responder, i.e. RTW or NRTW to 100%, was obtained from the registers of the compulsory National Health Insurance the effects of missed subjects in those respects were limited and also allowed a rather comprehensive non-response analysis [20]. How can the information from this study be used? Shaw et al. [42] concluded in their wide-ranging review of prognostic factors for low-back disability that the understanding of the importance of many factors in those respects are still missing or even contradictory. Although in no way clarifying the specific reason(s), neither individually nor on a group basis for a sick-listed subject’s decision or ability for RTW or NRTW, some of the tested predictors in the current study, alone or combined, can serve the purpose of detecting early who, with sub-chronic back or neck pain, will or can return early, late or not at all to work. An early accurate prediction can help in directing more adequately, for example, the rehabilitation efforts of bringing someone back to work. By avoiding, for example, “unnecessary interventions” among those where the chances for RTW are good and instead concentrating vocational and medical rehabilitation resources on those where the risks for an extended disability period are great ought to improve the outcome. How applicable are the findings? Although this study was undertaken on a Swedish cohort, it is quite likely that the results are applicable for similar patients in other countries. The reason for this is that the Swedish cohort was only one part of a study with identical core protocols that was also performed simultaneously in the Netherlands, Germany, United States (California and New Jersey), Denmark and Israel as well [8]. The national results from this multinational study, particularly in relation to the main outcome, RTW/NRTW were in many aspects so similar that the predictors found in the present study might work even in the other countries [22].

The present paper has been focused on finding those scales that might be useful for predicting return to work or not. It turned out that EQ-5D had the highest overall ability in this respect. However, as can be seen from the Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 there were other scales that occasionally performed better. This suggests that predictions can be improved by combining the values from several scales. For this purpose other statistical techniques are available, for example discriminant analyses or logistic regression. But, this is beyond the scope of this study.

Conclusions

When RTW or NRTW were predicted in a cohort of sick-listed low-back or neck patients, EQ-5D had outstanding properties in this respect irrespective of gender, diagnosis or elapsed time during a 2-year study.