Introduction

Chronic musculoskeletal pain (CMP) affects approximately 20% of the adult European population [1, 2]. Pain is considered chronic when it persists for three months or longer [3]. Because CMP can impact work ability (WA), CMP is a major reason for reduced work participation [4, 5]. WA is defined as the ability of workers to do their work according to the demands of the job contextual to their health and mental resources [6]. It is a comprehensive concept composed of different aspects that are presented as the ‘house of WA’. The foundation of this model is the health aspect that consists of the amalgam of mental and physical health and social functioning [7]. To measure WA from the perspective of the worker, self-reported outcome measures are used. Self-reported outcome measures need to have adequate measurement properties to justify their use in the clinic or in research [8].

The work ability index (WAI) is worldwide the most commonly used WA questionnaire in occupational health care, clinical practice, and research [9]. This questionnaire correlates moderate to strong with self-rated general health questionnaires and is therefore considered as a valid instrument to estimate WA among healthy workers (r = 0.44–0.79) [10,11,12]. The WAI is a 10-item questionnaire that has been translated and validated into several languages, including Dutch [10, 13,14,15,16]. The first question of the WAI (“current WA compared with lifetime best WA”), is also known as the work ability score (WAS). This single item was strongly related to the total WAI for assessing the current level and progression of WA among general workers and those who are on long-term sick leave (Rs = 0.63–0.87) [17,18,19]. Because of its brevity, the WAS may be a good alternative for the WAI in research and clinically useful for routine evaluation and interpretation of patient outcomes [20].

Despite the widespread use of the WAS, its test–retest reliability, agreement, construct validity, and responsiveness has not been studied in sick-listed workers with CMP. The research questions for the present study were:

  1. 1.

    What is the test–retest reliability and agreement of the WAS in sick-listed workers with CMP?

  2. 2.

    Is the construct validity of the WAS adequate in sick-listed workers with CMP?

  3. 3.

    What is the responsiveness and minimal clinically important change of the WAS in sick-listed workers with CMP?

Methods

The Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist was applied when designing this study [8, 21, 22].

Study Design and Ethics

A retrospective observational cohort study was used to evaluate the measurement properties of the WAS. Data were derived from electronic health records from seven vocational rehabilitation (VR) centres in the Netherlands, collected between November 2014 and October 2019. For the construct validity, a cross-sectional study design was used and a repeated-measurement design with a fifteen-week interval was used to enable test–retest reliability, agreement, and responsiveness.

In the Netherlands, no permission is required from a medical ethics committee for the evaluation of outcomes of care solely based on anonymous data derived from the medical records. All data security and privacy regulations were adhered to. Informed consent was obtained from all workers being included in the study.

Study Sample

The study samples consisted of sick-listed workers with CMP admitted to a fifteen-week multidisciplinary VR program, provided in one of the participating VR centres. The program involved an individualized exercise program, cognitive behaviour therapy, group education, relaxation, and work-related guidance, delivered by a team of healthcare providers [23]. The inclusion criteria for the program were: being of working age (18–65 years), suffering from subacute (6–12 weeks) or chronic (> 12 weeks) musculoskeletal pain, and having decreased work participation (part-time or full-time sick leave or reduced work productivity) [23]. When essential baseline or discharge data was missing, data were excluded from analysis. Workers were excluded from this study if they have comorbidities other than CMP as a primary reason for sick leave or if they have no paid work.

Procedure

At baseline, before the start of the VR program, personal characteristics were collected and all workers filled out a set of questionnaires, as part of usual care [24]. Questionnaires were sent by mail to be completed individually by the workers at home. The workers received the set of questionnaires at discharge from the VR program for the second time and also completed the global perceived effect scale.

Measurements

Personal Characteristics

The personal characteristics collected in this study were: age (years), sex (male, female), educational level (low, medium, high), work status at baseline (full-time, part-time, 100% sick leave), extent of contract (hours/days), number of pain locations, and duration of pain (months, years).

Work Ability Score (WAS)

WA was assessed using the WAS, which is the first item of the WAI: ‘What is your current WA compared to your lifetime best WA?’ The question is scored on an 11-point Likert scale, where 0 represents ‘completely unable to work’ and 10 represents ‘WA perceived as lifetime best’. WAS and WAI are strongly related and are good indicators of WA [17].

iMTA Productivity Cost Questionnaire (iPCQ)

Work productivity is determined by the worker’s presence and performance at work. The first phenomenon is known as sickness absence, while the second phenomenon is called presenteeism [25]. Sickness absence and presenteeism were assessed with the iMTA Productivity Cost Questionnaire (iPCQ). Long-term sickness absence related to the reason for which workers came to VR was reported as the number of calendar days between the date of reporting going on sick leave and date of filling out baseline questionnaires. For the workers on short term sick leave, the number of days on sick leave in the past 4 weeks was reported. The presenteeism score from workers who were partly or completely at work and experienced presenteeism was used. The score ranges from 0 (I couldn’t do anything) to 10 (I could do the same as normal) [26].

Pain Disability Index (PDI)

Self-reported disability related to pain was assessed using the PDI. This questionnaire covers seven areas of activities and participation: family and home responsibilities, recreation, social activity, occupation, sexual behaviour, self-care, and life-support activity. Each area has one question, which is scored on an 11-point rating scale where 0 means no disability and 10 represents maximum disability. The total score ranges from 0 to 70 points, with higher scores indicating more disability [27].

RAND-36 Physical Functioning

Physical functioning was assessed using the physical functioning scale of the RAND-36 Health Survey. This scale consists of 10 questions with three levels of response (‘yes, strongly limited’, ‘yes, a bit limited’, and ‘no, not limited’). The total score ranges from 0 to 100, with higher scores indicating better physical functioning [28, 29].

EuroQol 5D (EQ-5D)

Health-related quality of life was assessed using the first part of the EQ-5D. This part covers five dimensions of health: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension has one question with three levels of response (no problems, some problems, and severe problems). Answers can be transformed into an index score ranging from 0 to 1, with higher scores indicating better overall health [30].

Numeric Pain Rating Scale (NPRS)

The pain intensity was assessed using the NPRS, requiring workers to rate their average and worst level of pain over the past seven days. The questions were scored on an 11-point rating scale, where 0 referred to no pain and 10 to worst possible pain [31].

Global Perceived Effect (GPE)

Evaluation of how much the rehabilitation program changed the work functioning of the worker compared with pre-rehabilitation level was assessed using one item of the global perceived effect (GPE) (‘How much did the VR program change your work functioning compared to pre-treatment?’). The question was scored on a 7-point Likert scale, ranging from 1 (extremely worsened) to 7 (completely improved), while 4 is unchanged [32]. Based on the GPE score the workers were classified as improved (score 5–7), stable (score 4) or worsened (score 1–3). This instrument was used as the anchor (external criterion) in the responsiveness analysis, to compare the changes over time as derived from the WAS.

Statistical Analysis

Data were analyzed with SPSS version 24.0 statistical software for Windows (SPSS Inc., Chicago, IL, USA). For baseline characteristics, the distribution was assessed by skewness, kurtosis, and histograms. A skewness or kurtosis between -1.0 and 1.0 indicated a normal distribution for large sample sizes (> 300 participants). For smaller sample sizes a z-score less than 1.96 is accepted for a normal distribution [33]. Mean value and standard deviation (SD) were presented for continuous normally distributed data, and median and interquartile range (IQR) were used for non-normally distributed data. p < 0.05 was interpreted as statistical significance for all analyses.

Reliability

The test–retest reliability was derived from workers who were stable over a fifteen-week period. Stability was defined based on four criteria. 1. The GPE score was 4. 2. The change on question number 4 of the PDI was not greater than 1 point from baseline till discharge. 3. The change on the presenteeism score was not greater than 1 point for workers who were partly or completely at work and experienced presenteeism at baseline. 4. The difference on short term sick leave was less than 5.5 days and workers who were on long-term sick leave at baseline were stable on this criterium if they were still with 100% sick leave at discharge [27, 34]. The intraclass correlation coefficient (ICC) was calculated to assess test–retest reliability, based on a 2-way mixed-effects model for absolute agreement, with corresponding 95% confidence intervals (CI) [35]. An ICC ≥ 0.7 with a value > 0.5 for the lower bound of the 95% CI is generally considered to be acceptable for test–retest reliability [35, 36].

Agreement

The standard error of measurement (SEM) was calculated to assess the absolute amount of measurement error (\(SEM=SD\sqrt{1-ICC})\), where SD is the standard deviation of the WAS scores obtained from all workers and ICC is the test–retest reliability coefficient. The SEM was also used to determine the smallest detectable change (SDC) for an individual (\({SDC}_{individual}=SEM \times 1.96 \times \sqrt{2}\)) and the total sample (\({SDC}_{sample}=\frac{{SDC}_{individual}}{\sqrt{n}}\)). The agreement of the WAS is considered as good if the absolute measurement error is smaller than the minimal clinically important change [36, 37].

Floor and Ceiling Effects

The presence of significant floor and ceiling effects were considered if more than 15% of the workers from the construct validity sample achieved the minimum (0) or maximum (10) possible WAS score at baseline [36].

Construct Validity

The datasets which contain the complete required baseline measurements from the medical records were used for construct validity analysis. WAS construct validity was examined based on seven hypotheses. Spearman rank correlation coefficient (rho) was used to measure associations. The construct validity was considered sufficient if at least 75% of the predefined hypotheses were not refuted [36].

  1. 1.

    WAS correlates moderately (r > 0.5) with the work productivity measures. Work productivity is the result of the workers' capacities and abilities, thus both instruments are related to the assessment of a worker’s capability to carry out work [38,39,40,41].

    1. 1.1

      WAS correlates moderately negative (r > -0.5) with days of sickness absence.

    2. 1.2.

      WAS correlates moderately positive (r > 0.5) with the presenteeism score.

  2. 2.

    WAS correlates weakly to moderate negative (−0.2 < r < -0−5) with the total PDI score.

    1. 2.1

      WAS correlates moderately negative (r > -0.5) with question number 4 of the PDI. Question 4 is the specific work-related question, which captures most specific the construct of WA.

  3. 3.

    WAS correlates weakly to moderate positive (0.2 < r < 0.5) with the RAND-36 physical functioning. The instrument measures the three primary domains of physical health that are key components to consider when evaluating physical functioning in the context of work [42]. Physical functioning is part of the foundation and relevant for WA and daily life functioning but is mainly related to WA in workers with high physical demands [43].

  4. 4.

    WAS correlates weakly to moderate positive (0.2 < r < 0.5) with the EQ-5D. Quality of life is a generic dimension of health, which is less directly related to WA.

  5. 5.

    WAS correlates weakly negative (r <  − 0.3) with the NPRS because pain is a comprehensive multidimensional construct that possibly represents only a fraction of WA.

Responsiveness

The GPE was used to classify workers as ‘improved’, ‘stable’ or ‘worsened’. The group ‘worsened’ was not included in the analysis. Based on this classification, a receiver operating characteristic (ROC) curve of the absolute change score was created by plotting the false positive rate (1-specificity) against the true positive rate (sensitivity). The minimal clinically important change was determined by the optimal cut-off point of the ROC curve of the change scores [44]. Additional responsiveness analyses were performed for the change scores in which the total sample was stratified by the baseline WAS tertile scores, because minimal clinically important change is likely to be influenced by baseline scores [45]. Area under the curve and 95% CI were used for describing the ability of the WAS to distinguish improved workers from not improved workers. Area under the curve > 0.9 indicates excellent discrimination, good discrimination by 0.7–0.9, moderate discrimination by 0.5–0.7, and discrimination fails if area under the curve ≤ 0.5 [46].

Sample Size

According to the COSMIN checklist, a sample size of 50–99 participants is considered adequate to obtain reasonable results for determining test–retest reliability, agreement, validity, and responsiveness. Furthermore, a sample size of ≥ 100 participants is assessed as excellent [22].

Results

A total of 34 workers were eligible for the reliability and agreement analysis because they met the four criteria of the operational definition for being considered stable. In total, 1291 eligible workers filled out the complete baseline questionnaires and were available for the construct validity analysis. The baseline and discharge responsiveness questionnaires were completed by 590 workers. WAS at baseline was not significantly different between responders and non-responders (p = 0.413). A flowchart of the inclusion of workers is shown in Fig. 1. The baseline characteristics of the workers for the study samples are shown in Table 1.

Fig. 1
figure 1

Flowchart of recruitment, evaluation, and exclusion

Table 1 Baseline characteristics of the sick-listed workers for different study samples

Reliability and Agreement

Test–retest reliability was ICC = 0.89 (95% CI 0.77–0.94), mean WAS score for test and retest were respectively 2.9 (SD 2.1) and 3.1 (SD 2.2), p = 0.386. The SEM, SDC individual, and SDC for the total sample were respectively 0.69, 1.92, and 0.33.

Floor and Ceiling Effects

At baseline, 8.0% of the workers scored 0 and 1.4% of the workers scored 10 (valid n = 1291). The percentages did not exceed 15%, therefore significant floor and ceiling effects were not present.

Construct Validity

Results of the construct validity are shown in Table 2. Six of the seven (85,7%) predefined hypotheses on the magnitude of the relationship between WA and the other constructs were supported. The correlation between WAS and the number of sickness absence days was refuted, the observed correlation was slightly weaker than hypothesized.

Table 2 Hypothesized and observed Spearman rank correlation coefficient between the baseline work ability score (WAS) and other measurement instruments

Responsiveness

Based on the GPE classification, 48 out of the 590 workers worsened, 117 workers were stable and 425 workers improved. WAS at baseline did not significantly differ between the stable and improved group (p = 0.120).

The mean scores, area under the curve, minimal clinically important change, sensitivity, and specificity of the WAS for the total sample and the baseline tertiles are presented in Table 3 and the ROC curves are shown in Fig. 2. The discriminative ability in the total sample between the stable and improved group was an area under the curve of 0.76 (95% CI 0.71–0.81), with a corresponding minimal clinically important change of 2.0 points.

Table 3 Mean baseline and change scores, standard deviations, and responsiveness of the work ability score (WAS)
Fig. 2
figure 2

Receiver operating characteristics (ROC) curves of the work ability score (WAS). a ROC-curve of the total study sample (n = 542). b ROC-curve of the sample with baseline WAS tertile 1 score (n = 180). c ROC-curve of the sample with baseline WAS tertile 2 score (n = 167). d ROC-curve of the sample with baseline WAS tertile 3 score (n = 195). AUC = area under the curve; CI = confidence interval

Discussion

This is the first study that assesses the measurement properties of the WAS in sick-listed workers with CMP. The test–retest reliability analysis resulted in an ICC = 0.89 which is considered adequate. Floor and ceiling effects were not present. Six of the seven predefined hypotheses were not refuted, supporting the construct validity of the WAS, the minimal clinically important change for the total sample was 2.0 points with a good discriminative ability. In summary, the WAS demonstrated acceptable reliability, construct validity, and responsiveness in this study sample.

The test–retest reliability in the present study was similar to a study among Iranian workers (ICC = 0.83) [47] and comparable with the total WAI among healthy nurses and healthcare workers (ICC = 0.92) [12]. Direct comparison is difficult because of differences in study samples (healthy versus CMP), and the time interval between test and retest.

As expected, the strongest correlation with the WAS was seen between the presenteeism score (r = 0.64), followed by PDI question 4 (‘How would you rate the level of disability you typically experience during occupational activities?’) (r =  − 0.52) and sickness absence (r = -0.40), indicating that these measurement instruments were best related to the construct of WA. The correlation indicates that when the presenteeism score decreased, or when the score on PDI question 4 or days of sickness absence increased, perceived WA decreased. The correlation between the WAS and sickness absence was weaker than expected. Stronger correlations were present within healthy samples (r =  − 0.44  to − 0.62) [39], indicating that days of sickness absence capture WA better among samples of healthy individuals than among those with CMP and a relatively high rate of long-term sick leave (57.5%). The correlation between the WAS and presenteeism score was comparable with the result of another study (r = 0.69) [40], supporting validity for the WAS. Construct validity of the WAI was better supported by physical functioning (r = 0.38–0.40) [14, 16] compared with the result for the WAS in the present study (r = 0.22). This difference could be explained because the WAI is a more comprehensive measurement instrument, and previous studies included workers who primarily worked in physically demanding jobs influencing perceived WA and physical functioning [9, 48].

In the present study, the area under the curve of 0.76 provides evidence that the WAS is a responsive instrument for detecting clinically relevant changes at discharge from VR. The discriminative ability of the WAS was good within the first (area under the curve = 0.90) and second (area under the curve = 0.85) tertile, and moderate for the third tertile (area under the curve = 0.68). Responsiveness of the WAS has to our knowledge not previously been assessed in another study, therefore the results cannot be compared.

The results of this study support the WAS as a valid, reliable, and responsive instrument. Consequently, the WAS is suitable for WA assessments at a group and at an individual level consisting of workers with heterogeneity concerning work types, and suitable for monitoring progress in VR. To decide whether an improvement in the WAS is clinically important and is not due to measurement error, minimal clinically important change values should be interpreted in relation to the SDC. The results of this study indicate a total sample SDC of 1.92 and an anchor-based minimal clinically important change of 2.0 points for the WAS of the total sample. The minimal clinically important change for the first and second tertile was respectively 3 and 2 points. Because the minimal clinically important change is larger than the SDC, the minimal clinically important change should be used as the cut-off value. In contrast, the minimal clinically important change of the third tertile was only 1 point and cannot be distinguished from the measurement error, therefore the SDC should be used as the cut-off value. By the interpretation of changes for an individual, it is recommended to account for baseline scores to avoid misclassification bias [49]. Sick-listed workers with a baseline score of ≤ 2 (first tertile) should increase minimal 3.0 points and sick-listed workers with a baseline score ≥ 3 (second and third tertile) should increase minimal 2.0 points to conclude that a relevant and measurable change has occurred. This information should be useful for the clinicians in the VR setting and researchers using the WAS as an outcome measure to help determine whether a clinically meaningful change has occurred as a consequence of the VR program.

Strengths and Limitations

A general strength of this study was the use of data from usual care collected in seven different VR centres in the Netherlands. Because the study was performed in a setting and sample that is representative of the daily clinical practice, the results are broader generalizable. There was also a sufficient sample size for the construct validity and responsiveness analysis. The study sample consists of sick-listed workers with different CMP complaints, a broad range of work professions, working hours, educational level, sex, and age. This makes the WAS suitable for a wide population of workers with CMP.

Despite the strengths, the present study does have some limitations that primarily impacts reliability and agreement. First, a traditional test–retest design could not be used, because there was no earlier measurement moment at which WAS had been measured. The time between test and retest assessment was 15 weeks during which a VR program was followed. To ensure that the sample was stable between the two assessment moments, which is a prerequisite for test–retest analyses, a strict operational definition was used. This strict operational definition resulted in a sample of n = 34, which is lower than recommended (n = 50) [22]. To investigate the extent to which the operational definition and small sample size affected the results, post hoc two sensitivity analysis were conducted. For the first sensitivity analysis the new threshold for the PDI question 4 change score was 2 points (n = 53), which equals the minimal clinically important change determined from data in this study. The results of these analyses are ICC = 0.83 (95% 0.70—0.90), SEM = 0.82, SDCindividual = 2.27 and SDCgroup = 0.31. In the second analysis, besides the extension of the PDI change score, the GPE score was broadened to include 3 and 5 (n = 111). The results of the second analysis were respectively ICC = 0.76 (95% 0.65—0.83), SEM = 1.04, SDCindividual = 2.88 and SDCgroup = 0.27. By loosening the operational definition of stability, a slightly less stable sample was created, resulting in larger samples and a slight decrease of the ICC, and an increase in the SEM. Given these results, it is unlikely that the strict definition of stable affected the test–retest reliability and agreement of the WAS.

The second limitation of the present study is the potential selection bias in the reliability and agreement sample. The sample was not completely representative of the total study sample. The included workers are on average older, a longer duration of pain, higher pain scores, and worse WAS and PDI scores. A further limitation is that the results of the current study are limited to the care as usual population in the Dutch VR setting. Future research should reveal whether these findings can be replicated and generalized to other samples.

Conclusion

The current study provides support for use of the WAS for assessing and evaluating the WA in workers with CMP in vocational rehabilitation. Apart from adequate measurement properties, it is easy to administer, simple to interpret, and not time-consuming for the worker to complete. An group change of 2.0 points, and a change score of 3.0 and 2.0 points for individuals with a baseline score ≤ 2.0 and ≥ 3.0 respectively can be used for evaluation purposes to assess the effectiveness of the VR program.