Introduction

The impact of knee osteoarthritis (OA) on the individual is usually estimated by evaluating pain, physical function and the patient’s global assessment of well-being [1, 2]. For these purposes, patient-reported outcome measures (PROMs) are commonly used, but also performance measures (PMs) could be considered to quantify physical function [3]. It is well established that PROMs and PMs do not capture the same aspects of physical function in musculoskeletal conditions including knee OA [49]; it is suspected that PROMs generally measure an overall comprehensive experience [10, 11], whereas PMs may target a more specific construct linked to impairments in body functions [11]. One study (n = 115) found that the sensitivity to change over a period of 2 years was better for PMs than for PROMs in a population of patients with hip and/or knee OA [12]. The same conclusion was reached in a population of patients undergoing hip or knee replacement (n = 73), leading the authors to suggest that PMs should be core outcome measures in knee OA [13].

Hence, both PROMs and PMs should be used, as they contribute to a comprehensive understanding of a patient’s situation [1416]. Pain triggered by activity is a characteristic feature of knee OA [1720]. This leads to the anticipation that PMs may contribute with further valuable information if a pain score is integrated with a PM and therefore measure a specific construct of pain during an activity. In fact, it has been suggested that pain measures in knee OA should always include either performing pain-provoking activities or asking about pain during these activities [21]. Even though several PMs exist, which presumably provoke pain, there are no validated PMs with associated pain assessment for knee OA. A solution to this could be to extend an existing PM to include a pain score. However, we believe that it is possible to exceed the feasibility of existing PMs and therefore increase the incentive to use outcome measures in clinical practice.

Based on input from patients and health professionals [22], we have developed a Dynamic weight-bearing Assessment of Pain (DAP) for knee OA. The instrument combines a PM (weight-bearing knee bends) with a PROM (self-reported pain intensity). The pain intensity is measured on a 0–10 Numeric Rating Scale (NRS) as preferred by patient groups [23] and recommended for measuring pain intensity in clinical trials [19]. The psychometric properties of the DAP are yet to be established. As a first step, the objective of this study was to estimate the reliability, agreement and smallest detectable change (SDC) in the DAP to establish thresholds for detection of change between tests.

Methods

Participants

This study was nested within an assessor- and participant-blinded randomized controlled trial comparing corticosteroid injection with placebo prior to 12 weeks of supervised exercise three times weekly in people with knee OA (EudraCT: 2012-002607-18). Inclusion criteria for the trial were as follows: age above 40 years, radiologically verified diagnosed knee OA, ‘pain while walking on a flat surface’ of at least 4 on a 0–10 NRS, and a body mass index >20 and <35 kg/m2. Exclusion criteria were use of intra-articular corticosteroids in the knee or participation in physiotherapeutic exercise for knee OA within the last 3 months, or severe diseases. As part of the larger trial, the participants filled in the ‘Knee injury and Osteoarthritis Outcome Score’ (KOOS) [24].

Data for the current reliability and agreement study were collected at the follow-up assessments in the hosting trial done 3 months after termination of study interventions. All participants in this study gave informed consent before enrolling in the hosting trial and received a copy of the consent. All participants were asked about adverse events during a rheumatologist consultancy within 2 weeks after the tests. This report follows the recommended reporting guideline [25] suggested for reliability and agreement studies (GRRAS statement) [26]. The statistical analyses follow the COSMIN standards [27, 28].

Test description

The DAP is a simple performance test with an integrated pain score, designed to provide useful information on the interaction between pain and function for monitoring treatment progress and evaluating treatment effects. The DAP is intended for use both in research and in clinical practice, primarily physiotherapy related. The patient is asked to perform as many standing knee bends as possible within 30 s. For each bend, the knees should reach approximately 90 degrees of flexion (visually inspected by the observer) and full extension (to the extent possible for the individual patient). Limited range of motion does not preclude a test and does not result in missing data. This is supervised by the rater, who decides whether the test performance is approved according to the purpose, e.g., clinical monitoring of treatment progress or scientific purposes. There are three scores in the test: (1) number of pain-free knee bends; (2) number of painful knee bends; and (3) pain during knee bends on a 0–10 NRS. Scores from (1) and (2) are added to give the total number of knee bends. The pain score is obtained immediately after the knee bend tests with the question: ‘How much pain did you feel during the knee bends, on a scale from 0 to 10, where 0 is no pain and 10 is the worst pain you can imagine?’ In case the pain varies during the test, the highest pain intensity is recorded. The DAP takes about 1 min to perform including instructions and does not require any equipment besides a stopwatch/watch. The numbers of knee bends are direct measures of the patient’s ability to repeat a weight-bearing activity within a short timeframe; the pain intensity scores are measures of pain during a specific weight-bearing physical activity. The purpose is to reflect the limitations of daily activities due to knee OA that involves weight-bearing knee bending (e.g., getting up from/down in a chair, gardening, cleaning).

Study design

The study design is shown in Fig. 1. Two physiotherapists tested the DAP on all participants. Rater A (LK) is the test developer and experienced in using the DAP. Rater B (EG) had no experience with the DAP, but had an introduction and one rehearsal session with a knee OA patient not included in the study. Each participant had two visits separated by minimum 2 days, and maximum 1 week. At the first visit, one single DAP was conducted by rater A. At the second visit, DAP was conducted by both raters A and B in a randomized order separated by approximately 1.5 h. Thus, each participant was tested twice by rater A and once by rater B.

Fig. 1
figure 1

Study design

Statistical analysis

The statistical analyses were performed using SAS statistical software (version 9.3) and SPSS (IBM SPSS Statistics 19). Reliability was estimated by Intra-class Correlation Coefficients (ICC) for agreement based on a two-way ANOVA with random effects (single measures ICC 2.1) [28]. ICC were calculated for both intra- and inter-rater data. We had decided a priori to interpret the ICC value using the criteria for clinical acceptability suggested by Fleiss where ICC < 0.4 represent poor, 0.4 < ICC < 0.75 represent fair to good and ICC > 0.75 represent excellent agreement [29]. However, as the DAP is also intended for use in clinical practice, the quality criterion for this purpose was conservatively set to an ICC of at least 0.90 [30, 31], with a lower 95 % confidence limit of at least 0.75.

We calculated the measurement error ‘standard error of measurement’ (SEM) that represents the standard deviation of repeated measures in one patient. SEM was calculated as the square root of the residual mean square value obtained from the two-way analysis of variance, which is used to calculate the ICC [28]. SEM was calculated for both within (intra-rater) and between raters (inter-rater). Subsequently, the smallest detectable change (SDC) was calculated (representing the minimal change that must appear to ensure that the observed change is beyond measurement error); SDC is calculated as 1.96 × √2 × SEM for both intra- and inter-rater data [28]. Based on the SDC, the limits of agreement (LoA) were calculated (\(\bar{d} \pm {\text{SDC}}\)) and presented in Bland and Altman plots. Acceptable SDC was set to a maximum of two points or a reduction of 30.0 % in the 11-point NRS for pain, as this is regarded to represent the minimal clinically important difference [32]. For the knee bending scores, the a priori maximum SDC was set to 2.6, based on the minimal clinically important difference in the 30-s sit-to-stand test, tested in a hip OA population [33].

Sample size considerations

The power calculation was based on estimates of the 95 % confidence interval of the ICC. Assuming that the reliability of the DAP corresponded to an ICC of 0.80, including 13 participants would result in a lower 95 % confidence limit of 0.60 [28]. Based on this analysis, the number of participants was conservatively set to 20. We recruited 20 participants among the last 20 participants enrolled in the hosting trial.

Results

A total of 20 hosting trial participants who met the eligibility criteria were invited. All accepted to participate, and all completed the study. Their characteristics are presented in Table 1. Summary statistics from tests and retests are provided in Table 2. Table 3 presents the results for ICC, SEM, SDC and LoA.

Table 1 Participants’ characteristics
Table 2 Summary statistics for intra- and inter-rater test, retest and difference
Table 3 Intra-class Correlation Coefficients (ICC) with 95 % confidence interval (CI), standard error of measurement (SEM), smallest detectable change (SDC) and limits of agreement (LoA)

Of the 4 scores, the pain intensity score showed the best properties in terms of low SEM (0.70 for the intra-rater tests and 0.86 for the inter-rater tests on a scale from 0 to 10), acceptable SDC for the intra-rater tests (1.95) and excellent ICC (0.93, CI 0.83 to 0.97 for the intra-rater tests and 0.91, CI 0.78 to 0.96 for the inter-rater tests). SDC for the inter-rater test did not reach the a priori acceptable level (2.39). The three knee bend scores all had ICC above 0.50, showing fair-to-good agreement. However, only for the inter-rater tests did the lower confidence limit not fall below 0.40. The SEM for knee bends varied from 2.95 to 6.85 for the intra-rater tests and 2.56 to 5.95 for the inter-rater tests, in both cases with the lowest SEM for the total knee bends scores. None of the knee bend scores fell below the a priori defined maximum SDC of 2.6. The Bland and Altman plots in Fig. 2 illustrate the differences between observers plotted against the mean value of both observers for (A) pain intensity, (B) total bends, (C) pain-free bends and (D) painful bends.

Fig. 2
figure 2

Bland and Altman plot illustrating the differences between observers ab plotted against the mean value of observers (A + B)/2 for a the pain intensity scores, b the total knee bend scores, c pain-free knee bends and d painful knee bends. Solid line (mean difference) is an expression of the systematic error between observers while the limits of agreement define the boundaries of random error

Within 2 weeks after the tests (at the rheumatologist consultancy), one participant complained about pain in the days after performing the DAP. The excess pain had disappeared at the time of the consultancy and was not considered related to the test. Otherwise, no adverse events were noted.

Discussion

This study supports the integration of a pain score with a performance measure in order to capture another perspective of pain: the interaction with function. In this population of people with symptomatic knee OA, the DAP pain score shows excellent ICC, comparable with patient-reported outcome measures [34] and other performance-based outcome measures such as walking , stair, and chair stand tests [35]. Furthermore, the DAP has the advantage of being very short and not requiring any equipment besides a (stop) watch, whereas other performance measures typically require walking lanes, stairs, or chairs of standard dimensions. This, together with the low SEM and, for the intra-rater test, acceptable SDC, supports the applicability of the DAP in research and clinical practice, with the pain score as primary indicator.

The excellent ICC (0.91 and 0.93) of the pain score suggest that measuring pain during a pain-provoking activity yields reliable results. The demands to reliability and measurement error for instruments applied on the individual level in clinical practice are higher than on group level, as there is often only one score (no averaging) [36]. Thus, the low measurement error of the DAP pain score makes the DAP useful on individual levels in clinical practice.

The knee bend scores did not show adequate reliability and agreement in this population; hence, the knee bend scores may be omitted leaving only the pain score in the test, making this even simpler. However, the number of knee bends may have a motivational effect because of the more detailed information on treatment progress provided. This remains to be evaluated.

Limitations

This population had relatively mild symptoms with a mean of 70.5 on the KOOS pain subscale, and 78.5 on the KOOS function subscale (0–100 scales; higher is better). However, a mean of 55.3 on the KOOS quality-of-life subscale (range 0–87.5) indicates that the participants were indeed affected by their knee OA. Three patients had a DAP pain score of 0 at the first visit (NRS = 0), and six patients had a DAP pain score of 0 at the second visit (regardless of the rater). This calls for attention to the risk of floor or ceiling effects of the DAP. However, as there is no reason to discriminate patients reporting no pain any further, this cannot be categorized as a floor effect [28]. The same can be assumed regarding the knee bend score, as a limited knee mobility does not exclude anyone from performing the DAP. The reliability of the DAP is still unknown for populations with more severe symptoms. The small sample size, based on a priori calculations, is a possible limitation to the study. Furthermore, the lack of a stable external measure to ensure the absence of change between the two visits is a limitation to this study, as potential changes could have affected the correlation coefficients.

In general, the reliability was higher between the two raters than within the same rater, at least for the knee bends scores. This may be related to the study design, with one test by rater A on the first visit, and tests by both raters A and B on the second visit; higher mean knee bend scores and lower mean pain scores on the second visit suggest a certain learning effect. The difference could also be due to day-to-day variability. However, the SEM did not vary much between intra- and inter-rater measures. As measurement error is more a characteristic of a test in itself [27], it is expected to remain stable across populations and raters. The random sequence of the raters at the second visit may have influenced the intra-rater reliability, given that about half of the tests at the second visit were preceded by a test with the other rater. However, there was no significant difference related to the sequence of tests; mean pain score difference was 0.8 (3.0 where rater A tested first, and 2.2 where rater A tested second, p = 0.58); mean total knee bend score difference was 2.2 (22.4 where rater A tested first, and 20.2 where rater A testes second, p = 0.39).

In this study, we asked the participants to bend their knees from a standing position until reaching flexion of approximately 90°. This is a somewhat unspecific instruction and was only monitored visually by the rater; thus, certain variability is assumed. For example, the two participants who reached more than 30 in the total number of knee bends on their second visit are unlikely to have reached 90°. The good results in this study despite this uncertainty yield further support to the properties of the DAP. Also, bending knees to approximately 90° is a reasonably easy task to comprehend for most patients, and we believe that a pragmatic approach to a test design facilitates feasibility and cooperation from the patients. During instructions, it was emphasized to the patients that no predefined number of squats was expected from them; the number of squats was according their personal limit of tolerance. This might result in some patients choosing to endure pain in exchange for better performance (more bends) and some choosing less pain on the expense of high performance. This is true both in everyday life and in the interaction with the healthcare system, and unpredictable pain behavior is a premise for all performance measures. The DAP is developed in an attempt to address this pain behavior; i.e., we believe that that pain score of the DAP reflect the interaction between pain and function. Hence, we do not think of this as a limitation to the DAP.

We chose to only include one rehearsal session with the non-experienced rater. We were confident with this choice because basic physiotherapy knowledge enables understanding and performing this simple test. Furthermore, we wanted to test whether this minimal instruction would be sufficient; as this seems to hold true, the feasibility of the DAP is promising in this regard. However, the low reliability of the knee bend scores suggests that more explicit instructions are warranted; this is pending. Importantly, the results of this study only apply to physiotherapist; whether the DAP can be used by other groups of health professionals remains to be examined.

All participants were asked about adverse events during a rheumatologist consultancy within 2 weeks after the tests. Only one participant complained about pain in his unaffected knee after the tests, but this was not considered related to the DAP. Thus we are confident that the DAP is safe and with no excessive risks compared to everyday activities in a population with mild knee OA.

In conclusion, the reliability, agreement and, for the intra-rater test, the smallest detectable change in the DAP pain score meet the demands for use in clinical practice and research. The total knee bend score should be kept for motivational reasons. Evaluation of other important psychometric properties of the DAP such as validity, responsiveness and feasibility is pending.