FormalPara Key Points for Decision Makers

Our study cannot ultimately confirm an underlying trait of functioning and disability when applying the Disabilities of the Arm, Shoulder and Hand (DASH) Questionnaire to a sample of patients with hand injuries and diseases.

Especially patients with severe problems in functioning have difficulties differentiating between the response options of the DASH.

Further studies using Rasch analysis and including other (clinical) target groups with hand, arm and shoulder injuries or diseases are needed.

1 Introduction

Patient-reported outcome measures (PROMs) are widely accepted tools to measure health outcomes within clinical research and practice. Data and results retrieved from PROMs highly benefit the assessment of patients’ problems, their care, further assignment of interventions, and evaluation of research [1]. Thus, PROMs used for these purposes need to be robust, meaningful and interpretable, and in case of disease-specific PROMs, targeted to the needs and problems of patients affected by the respective health condition.

Several PROMs have been developed for the upper extremities and the hand and used for clinical and research purposes [2], such as the Michigan Hand Outcomes Questionnaire [3], the Western Ontario Rotator Cuff Index [4] and the Boston Carpal Tunnel Questionnaire [5]. The advantages of PROMs focusing on a defined anatomical region such as the hand or wrist (e.g. Michigan Hand Outcomes Questionnaire) instead of disease-specific PROMs (e.g. Boston Carpal Tunnel Questionnaire) have been revealed in terms of clinical practicability and applicability [6].

It is also a well-known fact that persons affected by injuries and diseases of the hand or the upper extremities have specific needs and problems in functioning and everyday life, such as problems in fine hand use, self-care, domestic life and social participation [79]. However, PROMs developed for upper extremities do not capture all functioning aspects—as defined by the World Health Organization’s (WHO’s) International Classification of Functioning, Disability and Health (ICF) [10]—that are relevant from the patient’s perspective; this is pointed out by Coenen et al. [11].

The Disabilities of the Arm, Shoulder and Hand (DASH) Questionnaire [12] is the most commonly used PROM in clinical research and practice for patients with injuries and diseases of the upper extremities [13], and is available in more than 30 languages. It is a 30-item PROM aiming to measure functioning and disability in people with all kinds of injuries and diseases of the upper extremities. The DASH was developed by experts who selected items from a large pool of existing scales [12, 14]. The reliability and validity of the DASH have been thoroughly investigated by classical test theory approaches in various settings [1418]. Internal reliability (Cronbach’s alpha) has been reported as excellent in various subpopulations (0.92–0.98) [16]. The German version of the DASH [19] was validated using data from a sample of patients with shoulder pain [17] and showed high internal reliability (Cronbach’s alpha) (0.96). Analyses addressing dimensionality and internal construct validity have rarely been conducted, and studies using factor analysis have shown varying results [2023]. Unidimensionality was a problem in these four studies, and factor analysis supported a two-factor [23] or three-factor [2022] model.

The strengths and weaknesses of PROMs can also be analysed by using Rasch analysis. Up to now this method has not been extensively used to analyse the DASH, apart from the development of a shortened version [24] on the one hand, and studies focusing on patients with specific health conditions, namely multiple sclerosis [25] and musculoskeletal disorders of the upper extremities (e.g. shoulder arthritis) [21], on the other. The DASH was developed for persons with any kind of injury and disease of the upper extremities. However, it is also used for persons with only a specific anatomical region affected, such as for patients with hand injuries and diseases. This versatility generally offers a broader range of clinical and research applications, but might also bear the risk of the outcome being too unspecific for people with hand injuries and diseases.

The objective of this study was to examine whether the DASH is a valid and reliable questionnaire to assess functioning and disability of patients with hand injuries and diseases using Rasch analysis.

2 Methods

2.1 Design

This study is designed as a psychometric study using data from two multicentre cross-sectional studies that were carried out to develop and validate the ICF Core Sets for Hand Conditions [26, 27]. Both studies were conducted in a clinical setting in four German hospitals run by the German Statutory Accident Insurance. In these studies study participants with a broad range of injuries and diseases of the hand filled in the DASH amongst other RPOMs. Socio-demographic and disease-related data were collected by health professionals and study nurses, respectively. Sampling strategies, inclusion and exclusion criteria, as well as the data collection of these studies have been described elsewhere in detail [2628]. Both studies were performed based on the principles of the Declaration of Helsinki and were approved by the Ethics Committee of the Medical Association, Hamburg, and the Ethics Committee of the Ludwig-Maximilians-Universität (LMU), Munich.

For this psychometric study we used bifactor and Rasch analysis to examine the reliability and validity of the DASH in persons with all kinds of hand injuries and diseases.

2.2 Data Used for Analyses

We used socio-demographic and disease-related data, as well as data derived from the German version of the DASH assessed in the two former empirical studies. Each of the 30 DASH items scores from 1 (e.g. ‘no difficulty’) to 5 (e.g. ‘unable’). The ‘DASH disability/symptom score’ is calculated by first averaging the given answers, then subtracting one and multiplying it by 25. The score ranges from 0 to 100, where a higher score indicates greater disability [14].

For our analyses, we solely considered data of patients who (a) had a valid DASH score, i.e. not more than three missing answers and (b) suffered from an injury (e.g. wrist fracture) or disease of the hand (e.g. carpal tunnel syndrome). Therefore, we excluded 29 patients with invalid DASH scores and 23 patients because of their diagnoses from the two samples (N = 469), resulting in a final sample of 417 patients used for further analysis.

2.3 Data Analyses

2.3.1 Descriptive Analysis

Descriptive analysis was carried out to describe the sample in terms of socio-demographic and disease-related characteristics. Results of the descriptive analysis are displayed in a table showing frequencies and percentages.

2.3.2 Internal Consistency Reliability

The internal consistency reliability of the DASH items was assessed based on different measures. For each item, the inter-item correlation and the item-total correlation (defined as the correlation between the item and the total DASH score) were calculated [2932]. In addition, Cronbach’s alpha and McDonald’s omega total were computed.

2.3.3 Metric Analysis

We carried out metric analysis according to the conceptualization of the item response theory, starting with the testing of model assumptions and ending with Rasch analysis; additionally, we tested for Differential Item Functioning (DIF).

(1) Testing data for Rasch model assumptions: We tested for the model assumptions: unidimensionality, monotonicity and local independency. Unidimensionality was examined by means of a bifactor analysis [33, 34]. The assumption made by bifactor analysis is the existence of one general factor and multiple independent group factors. High loadings on the general factor for all items point to an underlying unidimensional latent trait. The number of factors considered in the bifactor analysis were determined based on permuted parallel analysis [35]. According to this method, the number of factors is defined as the number of eigenvalues resulting from the observed data exceeding the 95% quantile of the eigenvalues resulting from several permutations of the observed data. In addition, a single-factor model was estimated to compare the factor loadings with the general factor from the bifactor analysis. If the loadings are similar, the general factor in the bifactor model represents the main dimension of the construct very well. Monotonicity was examined for each item by reviewing graphs of the item distributions conditional on average ‘rest scores’. These scores were calculated for each person as the total raw score of all remaining non-missing items divided by their number. Monotonicity can be assumed if there is a consistent trend that persons with higher rest scores are more likely to have more problems in the given item. Local independency was examined based on residual correlations among items resulting from a single-factor analysis [36]. A residual correlation higher than r = 0.3 suggests that a response of one question influences the response to another question.

(2) Computing and fitting of the Rasch model: Rasch analysis generally explores to what extent the observed data, i.e. the actual given answers to the items, fits the Rasch model [37]. Within Rasch analysis, the model assumes an underlying latent trait; in this study, it is the trait of functioning and disability. On this trait, both items and persons are being located. The Partial Credit Model (PCM) was chosen to perform Rasch analysis in our study because of the set of ordinal, polytomous DASH items.

Within the PCM, the item thresholds for each item are calculated, indicating the location on the latent trait where the item best discriminates between persons [3740]. Likewise, each person is assigned a value on the latent trait (called person’s ability). The higher a person’s ability is in relation to the threshold of an item, the higher the probability of a problem on that item. When a person’s location on the latent trait is equal to the threshold of the item, there is by definition a 0.5 probability of a response below or above this threshold. In case of unordered thresholds, the response options have to be collapsed until they are in a correct order. If collapsing of response options was necessary for specific items in our analysis, the collapsing strategy was decided for each item separately. The person separation index was calculated for the final model.

Chi square-based item infit and outfit mean square statistics were computed. The outfit statistic includes differences between the observed and expected responses for all items. The infit statistic is weighted and emphasizes those items near the person’s ability [37]. Values close to 1 indicate good item fit, while values larger than 1 demonstrate underfit (i.e. the observed data varies considerably more than can be explained by the model, constituting a violation of the model). Values smaller than 1 indicate overfit (i.e. the data varies noticeably less than expected based on the model, which is commonly considered acceptable) [41, 42]. According to the available literature, a range between 0.7 and 1.3 is generally acceptable [41].

(3) Testing for DIF: If an item shows DIF, it is a potential cause for bias in person measurement. This is the case where subgroups of a sample (e.g. older and younger patients) respond in a different way to a specific item despite equal levels of the underlying characteristic being measured. We tested DIF for sex and age dichotomized in above or below 45 years (thus splitting the sample in two almost equally sized groups at the median). Due to the sample size, change in McFadden’s pseudo R 2 (>0.02) was chosen as the criterion for flagging.

Data analyses were performed using R software [43] and R package eRm [44] (Rasch analysis) as well as R package lordif [45] (testing for DIF).

3 Results

3.1 Sample

The age of the 417 patients included in the data analyses ranged from 18 to 82 years, with a mean age of 45.0 [standard deviation (SD) 13.7] years. Table 1 provides an overview of socio-demographic and disease-related characteristics of the sample. Detailed information on the patients’ diagnoses is shown in Table 2. The DASH score ranged from 0 to 97.5, with a mean score of 41.4 (SD 21.3).

Table 1 Socio-demographic and disease-related characteristics of patients (N = 417)
Table 2 Diagnoses of patients (N = 417); multiple answers per patient possible

3.2 Internal Consistency Reliability

The DASH items showed high internal consistency reliability for our sample. Inter-item correlation was high, but showed great variability (mean 0.50, range 0.12–0.89). Item-total correlation was higher than the inter-item correlation, with less variation (mean 0.72, range 0.50–0.84). Cronbach’s alpha was 0.97, and McDonald’s omega total was 0.98. Therefore, all values indicated high internal consistency reliability in our sample.

3.3 Rasch Analysis

3.3.1 Rasch Model Assumptions

Unidimensionality: We derived three factors (displayed as blue, red and green curves in Fig. 1) from the permuted parallel analysis. Figure 1 shows high loadings for all items on the general factor (black curve). In general, the loadings were very close to the loadings from the single-factor model (dotted black curve). However, it must be noted that the items 24–28 concerning sensations (i.e. tingling, weakness, stiffness) and pain [i.e. pain (in general), pain when performing an activity] had higher loadings on the factor displayed as the red curve than on the general factor (black curve). So we assumed that these items did not fit the latent trait of functioning and disability perfectly in our sample. We still decided to compute the PCM including these items, but compared it to a model without the items referring to sensation and pain. Furthermore, the DASH items 2 (‘writing’) and 3 (‘turning a key’) (blue curve) as well as 10 (‘carrying a shopping bag’) and 11 (‘carrying a heavy object’) (green curve) each formed a factor, but with factor loadings below those of the general factor.

Fig. 1
figure 1

Loadings retrieved from bifactor analysis. The solid black curve constitutes the factor loadings for the general factor, while the coloured curves (blue, red and green) constitute the factor loadings for the three group factors from the bifactor analysis. The dotted black curve contains the factor loadings from a single-factor model

Monotonicity: Mosaic plots were evaluated to detect monotonicity. A general trend of a person with higher rest scores showing greater problems could be seen in all items. Therefore, monotonicity could be concluded.

Local independency: We calculated residual correlations among all items based on a single-factor analysis to check for possible local dependency. Overall, the results showed very low residual correlations, with three exceptions in accordance to the results from the bifactor analysis. A high residual correlation was found for items 2 (‘writing’) and 3 (‘turning a key’) (r = 0.31) as well as for items 10 and 11 (r = 0.36), both referring to carrying objects.

3.3.2 Computing of Partial Credit Models

As stated before, we decided to compute and compare two different PCMs: one with and one without the questions addressing pain and sensations (items 24–28). Since they were only slight differences, we decided to solely use a model including all items to perform further analyses. A closer inspection of the threshold patterns supported the use of the PCM. Before fitting the model, we found eight items with unordered thresholds. We collapsed the response options in these until all thresholds were in the correct order (see Fig. 2). The final PCM was calculated and tested using the collapsed data. The person separation index for the model with the collapsed items was 0.96, yielding high reliability. Nine items showed misfit to the fitted PCM model. The infit and outfit mean square statistics for this model revealed an overfit for six items and an underfit for three items (see Table 3).

Fig. 2
figure 2

Collapsing strategies of items with unordered thresholds. DASH items with unordered thresholds of the response options ranging from 1 to 5 are displayed along with the respective collapsing strategies. Collapsed response options are marked by circles

Table 3 Polychoric correlations of the DASH items with the rest score and item fit statistics, in order of DASH items

Figure 3 displays the person-item map. In the upper part, the grey histogram reflects the distribution of the persons’ abilities on the latent trait of functioning and disability. Persons’ abilities range from −4.41 to 4.24 (mean −0.09, median 0.01) (see x-axis values). In the lower part, the item thresholds (black circles) are visualized. The item thresholds cover most of the range (from −1.94 to 3.94, mean 0.41, median 0.25), but leave gaps, especially at the lower end of the continuum. However, only a very small part of the sample (25 persons, i.e. 6.0%) has persons’ abilities below the lowest threshold. To facilitate judgment on targeting between persons’ abilities and item thresholds, the item thresholds are additionally visualized by little black lines directly beneath the histogram of the persons’ abilities in the upper part of Fig. 3. Overall, the items well target the persons.

Fig. 3
figure 3

Person-item map. The histogram in the upper part displays the distribution of the persons’ abilities. In the lower part, the circles illustrate the thresholds for each response category of the respective items, while (as a summary measure) the black dots indicate the mean of the thresholds for the respective item. The item thresholds are in addition visualized by little black lines directly beneath the histogram of persons’ abilities

3.3.3 Testing for Differential Item Functioning

No DIF was found for age. Item 11 (‘carrying a heavy object’) showed DIF for gender.

4 Discussion

With this psychometric study we examined whether the DASH is a valid and reliable questionnaire to measure functioning and disability in patients with hand injuries and diseases using Rasch analysis. Previous studies using traditional psychometric methods in different populations and various settings showed excellent results on reliability and validity. Our study cannot ultimately confirm an underlying latent trait when applying the DASH to a sample of patients with hand injuries and diseases. Some items also showed disordered response options and item misfit.

The DASH was developed and is used for persons suffering from any kind of injury and disease of the upper extremities. As it aims to measure the extent of disability of a person, all items should be located on a latent trait or continuum of disability. From a methodological point of view, there is no standard approach to prove this continuum and more specifically to examine whether a PROM is unidimensional or not. In our analysis we decided to examine unidimensionality using bifactor analysis. Our bifactor analysis showed that items referring to pain and sensations did not fit the underlying dimension of disability. Although pain and sensations are clinically relevant and important aspects of disability for patients, the inclusion of these items into the overall DASH score seems to be problematic. The residual correlations and the bifactor analysis both revealed problematic results for items 2 (‘write’) and 3 (‘turn a key’). According to Reise this can be explained by the content similarity of these two items [34].

We found that nine out of 30 items misfit the model. Particularly, activities such as preparing a meal (item 4), doing heavy household chores (item 7) or making a bed (item 8) did not fit the model. This could have been caused by our sample including mostly men who probably do not perform these activities on a regular basis. Even though it showed no misfit, writing (item 2) was also a problematic item. After discussions within our research team, we concluded that writing as an activity may not be of relevance in the digital age or for patients with the working background of a craftsperson, of which our sample mostly consisted. Furthermore, we found DIF for item 11 (‘carry a heavy object’). Yet, one must consider that even without the existence of a hand injury or disease, men and women may not respond to this question similarly.

The items referring to recreational activities partly showed unordered thresholds before fitting the final PCM. This might be explained by the fact that activities such as knitting, golfing or playing Frisbee do not reflect hobbies carried out by the majority of the population. In addition, it seemed to be difficult for our study participants (most of them treated as inpatients during data collection) to rate activities referring to recreation and leisure, as patients in clinical settings hardly perform them. This should be considered when applying the DASH as an outcome measure in clinical trials and in clinical settings. The issues reported beforehand might hint at the possibly limited conceptual base of the DASH. In the development process of the DASH, no psychometric techniques were applied for item reduction [12]. Items were selected and added on the basis of experts’ opinions. Therefore, items did not necessarily measure the same construct or operationalize important aspects as intended.

Besides the fit of items, it is important to look at their discrimination ability in further detail. We found disordered thresholds for eight items. This could have several reasons. The most intuitive one is that the persons filling in the DASH had difficulties discriminating between the five response options and their labels. Detailed investigations of the item characteristic curves showed that it seemed to be difficult for patients to differentiate between ‘moderate’ and ‘severe difficulty’, as well as ‘severe difficulty’ and ‘unable’. Cano and colleagues also reported similar findings in their sample of patients with multiple sclerosis [25]. One option to overcome this problem is to reconsider the labelling or even the number of the response options; especially since the wording of the response options is only identical for the first 21 questions (i.e. ‘no’, ‘mild’, ‘moderate’ or ‘severe difficulty’, ’unable’) and then changes four times to, for example, ‘not at all’, ‘slightly’, ‘moderately’, ‘quite a bit’ and ‘extremely’ in item 22.

With regard to the person parameters, our findings suggest that the DASH is not perfectly targeted for people who suffer from injuries and diseases exclusively located on the hand. Both the mean location for persons and items on a well-targeted PROM (i.e. not too hard, nor too easy) should be around the value of zero [40]. In our study, we detected a slight shift to the right for the item thresholds. Accordingly, the persons’ abilities of a small part of the sample could not be differentiated optimally on the basis of the available item thresholds at the lower end of the continuum. However, this should be interpreted with caution as we used data from a clinical sample including a number of patients with severe impairments [7, 9, 11].

This study has some limitations. Firstly, the data used for our analyses was collected in German hospitals run by the German Statutory Accident Insurance. Thus, many patients of our sample had injuries following accidents at work (e.g. cutting damage with transection of a nerve), which might not reflect the whole range of hand injuries in general and might have biased the results. Secondly, due to the typical clientele of those hospitals, our sample showed an unbalanced gender proportion, which might have influenced the DIF testing. Thirdly, it remains unclear at which treatment stage the patients were included in the two former studies and more specifically when they filled out the DASH. This convenience, clinical sample might cause some limitations with regard to the generalizability of our results to other patient groups or settings. Fourthly, using the median age of 45 as the cut-off to test for DIF was an arbitrary decision. Finally, it is possible that some linguistic and cultural characteristics of the German version of the DASH could have influenced some of our results.

5 Conclusions

Our study reveals that some DASH items do not fit the underlying trait of disability that the DASH aims to assess. Unordered thresholds of items shed light on patients’ problems in differentiating between response options when filling out the DASH, especially in those who are severely affected. It would be worthwhile to perform further studies using Rasch analysis with other target groups (e.g. patients with injuries of the upper arm and shoulder) to compare their findings with our results.