Introduction

A systematic review of outcomes measures for shoulder disorders identified pain, range of motion, and function as the most commonly assessed domains [2], but pain and physical function have been identified as core outcome domains [22]. The shoulder pain and disability index (SPADI) consists of a five-item pain subscale and an eight-item disability subscale [25]. It originally used visual analogue scales, but items are now scored using an ordinal rating scale from zero to ten, the latter indicating the highest level of pain/difficulty. Different scoring strategies have been used, with some authors reporting the mean of all 13 items and others calculating an average for the five pain and eight disability items separately. The latter method can also be used for reporting the mean of the two sub scores, thus giving equal weight to the two domains. The SPADI is recommended for clinical practice and research [10, 26]. However, these recommendations are based on studies of the validity, reliability, and responsiveness of SPADI using classical test theory (CTT).

Rasch analysis [24, 12, 6] is a modern approach based on item response theory (IRT) [28] that is used for the development and testing of patient-reported outcome (PRO) instruments. All IRT models possess certain desirable properties, and while some consider the Rasch model to be an overly simplified statistical model, its simplicity is exactly what gives it a special status among IRT models, representing a measurement model where the sum score of item responses contains all information about the underlying latent variable that the scale is intended to measure. A recent validation study of the SPADI using Rasch analysis identified strengths and limitations not previously observed using CTT methods [15]. This study concluded that the SPADI should be treated as two separate subscales and that, while the pain subscale fits the Rasch model well, the disability subscale does not fit the Rasch model and that clinicians should exercise caution when interpreting score changes on the disability subscale and attempt to compare their scores to age- and sex-stratified data. The Danish version of SPADI has been cross-culturally adapted and validated using CCT [4]. The purpose of this study is to validate the two subscales of the Danish translation of the SPADI using Rasch analysis, evaluate differential item functioning (DIF), and study how well the two subscales are targeted to patients with rotator cuff-related disorders.

Methods

The data for this validation study came from a consecutive cohort of patients with rotator cuff-related disorders (subacromial impingement with or without rotator cuff tear) [8]. The cohort included patients with a shoulder problem referred to orthopedic specialist assessment at a public secondary care outpatient clinic during a 3-month period (March to June 2014). Eligible patients received an information letter explaining that an extended assessment was offered immediately prior to the orthopedic specialist examination. As part of the additional assessment, the information letter contained the SPADI questionnaire and instructions to fill it in and bring it on the day of examination. Orthopedic specialists were blinded to results of the assessments, and patients were diagnosed according to the clinical judgement of the orthopedic specialist performing the examination. Study methods and results have been described elsewhere [8, 9, 29].

Overall fit to the Rasch model was assessed using the Andersen conditional likelihood ratio test [1] and individual item fit was evaluated by comparing observed and expected item-restscore associations [18]. We also evaluated item fit graphically by dividing the sample into five score groups (often denoted ‘class intervals’ in the Rasch literature) and, for each item, plotting the item mean scores in each interval, and comparing these to 95% confidence regions for the model expectations. To test the assumption of uni-dimensionality, we compared, observed, and expected subscore correlations [14]. Differential item functioning (DIF) [13] occurs when responses systematically differ by some other factor or variable like age or gender. Local response dependence occurs when items are almost identical (redundancy) or when they share features, e.g., wording response format or are associated with some other underlying trait. We evaluate local response dependence and test for DIF with respect to gender and age using log linear Rasch model tests [16] and item screening [19]. The ability of the subscales to discriminate between respondents is evaluated using Cronbach's coefficient alpha and the person separation index (PSI) [23]. In all analyses, we adjust p values using the Benjamini–Hochberg [15] procedure to control the false discovery rate.

Disordered thresholds occur when participants cannot consistently discriminate between the available response options. Jerosch-Herold et al. examined category probability curves and proposed a re-scoring (00112233445). Our sample was deemed to be too small to estimate threshold parameters with sufficient precision and we used this re-scoring for all items.

Analyses were done using DIGRAM [17, 20] diagram and version 9.4 of the SAS software package. Person-item location maps were created using a SAS macro [6].

Results

The validation sample consisted of 229 patients (48% female) with a mean age of 54.5 (SD = 14.2). Information about employment status was available for 221, of whom 115 were currently working, 106 were not working (sick leave, retired, unemployed). A total of 21 patients reported being on part time (n = 5) or sick leave/unemployed (n = 16) due to the shoulder disorder. The mean SPADI original score was 55 (SD = 22) and the dominant side was affected in 56.5%, (122 of 216). Average pain last week (on NPRS, 0–10) was 5 (SD = 2) (n = 210), and duration of shoulder problem (n = 226): 0-1 months: 3 (1%), 1–3 months: 37 (16%), 3–6 months: 50 (22%), 6 or more: 136 (60%).

Rasch analysis of SPADI pain subscale

Initial analysis of the pain subscale revealed poor fit to the Rasch model (Andersen \(\chi ^2=58.7\), df=23, \(p=0.0001\)). The item screening indicated local response dependence for two item pairs: P1 (‘at its worst’) and P2 (‘lying on affected side’) (\(p=0.0001\)) and P3 (‘reaching for object on a high shelf’) and P4 (‘touching the back of your neck’) (\(p<0.0001\)) and DIF by age for P1 ‘pain at worst’ (\(p=0.0131\)). Adding these yielded a log linear Rasch model with excellent overall (Andersen \(\chi ^2=48.4\), df=56, \(p=0.7540\)) fit. Regarding individual item fit, the item fit statistics (Table 1) and the plots of observed and expected item mean scores (Fig. 1) indicated that the data fit the log linear Rasch model.

Table 1 Item fit statistics
Fig. 1
figure 1

Item fit plot for the Pain subscale. Item mean scores (solid lines) and 95% confidence regions for expected mean scores (shaded areas)

Reliability was high with a Cronbach coefficient alpha of 0.86 and a person separation index (PSI) of 0.84 and the person-item location map (Fig. 2, left panel) shows that the subscale works well at different levels of the construct.

Fig. 2
figure 2

Person-item location maps. Distribution of person location estimates above the x-axis and item threshold distribution below the x-axis for the pain (left panel) and disability (right panel) subscales

Rasch analysis of SPADI disability subscale

Initial analysis of the 8-item disability subscale revealed misfit to the Rasch model (Andersen \(\chi ^2=61.9\), df=39, \(p=0.0112\)) and evidence substantial misfit for item D7 ‘carry a heavy object’ (observed item-restscore association 0.52, expected item-restscore association 0.66, \(p<0.0001\)). Deleting the items D3 ‘putting on undershirt or jumper’ and D7 ‘carry heavy object’ from the subscale yielded a model with excellent overall fit to the Rasch model (Andersen \(\chi ^2=36.3\), df=29, \(p=0.1647\)) and with excellent item fit (Table 1; observed and item mean scores corresponded to Rasch model predictions, Fig. 3). There was no evidence of DIF with respect to age and gender (results not shown), but evidence of local response dependence for the items D4 ‘putting on a shirt that buttons at front’ and D5 ‘putting on trousers’ (\(p<0.0001\)). Adding this yielded a log linear Rasch model with excellent overall (Andersen \(\chi ^2=51.4\), df=45, \(p=0.2366\)) and individual item fit (Table 1, Fig. 3). Reliability was high with a Cronbach coefficient alpha of 0.89 and a person separation index (PSI) of 0.87 and the person-item location map (Fig. 2, right panel) shows that the subscale works well at different levels of the construct.

Fig. 3
figure 3

Fit plot for the disability subscale. Item mean scores (solid lines) and 95% confidence regions for expected mean scores (shaded areas)

Dimensionality

Testing the assumption of uni-dimensionality by comparing observed and expected subscale correlations [14] showed the SPADI to be two-dimensional: expected subscale correlation 0.698 (s.e.=0.0262), observed subscale correlation 0.620, \(P=0.0029\).

Impact of DIF

Scores derived from the SPADI should be interpreted with caution. Firstly, it should be treated as a five-item pain subscale and an eight six disability subscale. We disclosed evidence of DIF by age for P1 (‘pain at worst’). In order to evaluate the impact of this, we computed equated scores and found the difference to be smaller than 0.61 (Table 2).

Table 2 Equated scores showing the impact of DIF

Discussion

We validated the Danish version of the SPADI and found results very similar to those Jerosch-Herold et al. found for the English version. The Danish version of the SPADI should be reported as two separate subscales. The pain subscale has some DIF, but the impact appears to be small. The disability subscale cannot be used in its current form, but a six-item version was found to fit the Rasch model adequately. Rasch Model analysis of the SPADI has identified some strengths and limitations not previously observed using CTT methods.

For the pain subscale, Jerosch-Herold et al. found DIF by age for P1 ‘pain at worst’ and by gender for P5 ‘pain when pushing with involved arm,’ but no evidence of local response dependence. We replicated the finding regarding DIF by age for P1 only. Regarding local response dependence we found evidence for two item pairs. Computation of equated scores indicated the difference to be relatively small.

For the disability subscale, we replicated the finding of Jerosch-Herold et al. that the six-item version resulting from removing the items D3 ‘putting on undershirt or jumper’ and D7 ‘carrying heavy object’ showed reasonable fit to the Rasch model, but where Jerosch-Herold et al. found DIF for D1 and D4 by gender and for D5 by age, our analysis did not disclose evidence of DIF. Again we disclosed significant evidence for local response dependence (for the item pair D4 ‘putting on a shirt that buttons at front’ and D5 ‘putting on trousers’) where Jerosch-Herold et al. did not.

Regarding the reason for misfit of the item D3 (‘Putting on undershirt or jumper’), we speculate that because in the Danish translation the two garments are quite different the item is double-barreled, and regarding misfit of D7 (‘Carry heavy object’) it is likely that because respondents are not equally strong that they do not rate ‘Carrying a heavy object of 10 pounds (4.5 kg)’ consistently. A DIF item “behaves differently for various subgroups after controlling for the overall differences between subgroups on the construct being measured” [11]. Regarding the DIF for the item P1 (‘pain at its worst’), where respondents over 60 consistently score slightly lower, we speculate that their reference for pain is shifted.

Jerosch-Herold et al. found more evidence of DIF than we did. We speculate that the difference in sample size could be the reason for this. Regarding local response dependence we found more evidence than Jerosch-Herold et al. and speculate that the reason could be that they did not follow the recommendation by Marais [21] that LD should be considered relative to the average residual correlation, cf. Christensen et al. [8].

Beyond the smaller sample size, our study population differed from the population of Jerosch-Herold et al. in other ways. Most importantly, Jerosch-Herold et al. [15] included all patients treated for shoulder pain irrespective of shoulder disorder, where this study only includes patients with rotator cuff-related disorders. Furthermore, the mean total of the original SPADI score of 55 (SD 22) in our sample, is somewhat higher than the 48 (SD 22) in the sample included by Jerosch-Herold et al. [15].

Clinical implications

In conclusion, score derived from the SPADI should be interpreted with caution. Firstly, it should be treated as a five-item pain subscale and a six-item disability subscale. Reporting of scores can still be done using a linear transformation to a zero to 100 scale (for the validation sample studied here this would yield SPADI pain 64 (SD = 22) and SPADI disability 44 (SD = 25)). Secondly, clinicians should attempt to compare their scores to age-stratified data, even though the impact of differential item function seems small.