Introduction

Rheumatoid arthritis (RA) affects up to 1% of adults and is three times more common in women than men [1, 2]. The joint pain, swelling, and damage associated with RA greatly affect physical, emotional, and social health and significantly impair health-related quality of life (HRQL) [3,4,5]. Up to half of people with RA are unable to work due to disability after 10 years, a trend that remains unchanged despite the introduction of biologics 20 years ago [6]. One reason for this may be that people with RA experience both unexpected temporary and sustained increases (flares) in RA activity which vary in frequency, severity, and impact. Disease flares are periods of increased inflammatory disease and have been defined as a cluster of symptoms of sufficient duration and intensity to require a review and possible change of existing treatment [7]. RA flares are important to identify and manage as they contribute to joint damage, disability, and increased cardiovascular disease that greatly impact health-related quality of life (HRQL) in people with RA [8].

Current treatment guidelines emphasize tight disease control and a “treat-to-target approach” in RA with remission or low disease activity as the treatment goal to reduce disability and improve HRQL [9]. To achieve this, RA is closely monitored, a patient’s level of disease activity is calculated using a validated algorithm, and therapies adjusted as needed until this target is reached. However, reliable monitoring of disease activity in real time is hampered by the lack of a gold standard and use of different approaches and tools of varying complexity that can yield different results.

Patient-reported outcomes (PROs) have been a cornerstone of RA disease monitoring since the late 1970s when the Patient Global Assessment (PGA), an 11-point numeric rating scale (NRS) was introduced [10]. In clinical practice, physicians monitor the number of swollen and tender joints and ask patients about their pain and disability to form an overall impression of disease activity (MD Global Assessment [MDGA]). Joint counts and MDGA constitute clinician-reported outcomes (CLIN-ROs). In the early 1990s, researchers created a composite disease activity index by combining the PGA with MD joint counts and a biomarker (first ESR and later CRP) in a weighted algorithm to create the Disease Activity Score (DAS) [11]. Thresholds were identified to classify RA patients as being in remission, low, moderate, or high disease activity states and better match treatment to current disease activity. The inclusion of biomarkers largely relegated the DAS to research given the need for laboratory results and complex calculations. More recently, the PGA and CLIN-ROs (i.e., joint counts and MDGA) were combined into a simple summative score known as the Clinical Disease Activity Index (CDAI). CDAI can be easily calculated during clinical encounters to inform treatment decision-making at the visit [12]. However, growing evidence suggests that CLIN-ROs overestimate improvement and underestimate progression of disease activity [13]. This prompted international members of the Outcome Measures in Rheumatology (OMERACT) RA Flare Working Group to co-create with patients a new composite PRO of RA disease activity—the RA Flare Questionnaire (RA-FQ) [14]. In the RA-FQ, patients rate RA symptoms and function over the past week [15]. International patient focus groups were conducted to identify relevant domains [16] and items were refined through a Delphi exercise that included several hundred patients, clinicians, researchers, and other stakeholders [17]. The RA-FQ has been shown to be valid, reliable, and responsive to change in international clinical trials and observational studies [17, 18]. To increase utility, evidence supporting score interpretation is needed.

The minimally important difference (MID) has been defined as the average change score of patients reporting a “small” or “minimal” difference [19, 20]. Meaningful change is average score change where RA patients consider themselves to be a lot better or a lot worse) [21, 22]. In RA, meaningful change helps to establish if treatment is sufficiently controlling RA inflammation to allow patients to feel and function better in everyday life and as such, for patient-reported outcomes, patients should define these values [23]. For regulatory activities, the U.S. FDA also recommends identifying meaningful within-patient score change from the patient perspective [24]; however, in clinical practice, physician assessment and CDAI level often drive treatment decisions given the emphasis on treat-to-target for RA management.

Our goal was to identify the mean minimal and meaningful score change for the RA-FQ between groups of patients who were improving, worsening, or experienced no change in disease activity. Scores were obtained at consecutive visits 3 months apart from patients, treating physicians, and in relation to changes in CDAI levels. We also explored score changes in response to both improving and worsening RA as others have noted these may vary based on the direction of change [25].

Methods

Design

We used data from two consecutive visits (the 3- and 6-month follow-ups) of RA patients enrolled in the Canadian Early Arthritis Cohort (CATCH). CATCH is a prospective observational inception cohort of adults with early RA who are enrolled around the time of diagnosis and beginning treatment and followed at pre-determined intervals [26]. These time points were chosen as 3 months which is a typical time frame used to judge the effectiveness of initial RA treatments, and it also allowed for sufficient variation in patients experiencing improvement and worsening at the second visit. Ethics approval was previously obtained at each of the 16 participating CATCH sites across Canada. Written informed consent was obtained from participants at enrollment, and the study was conducted in accordance with the Declaration of Helsinki.

Participants

Participants were adults 18 + years of age who were enrolled in the Canadian Early Arthritis Cohort (CATCH) from 2011 when the RA-FQ was implemented to March 2017 (when the Patient Global Impression of Change (PGIC) was discontinued) and who had RA-FQ and PGIC scores available at the 3- and 6-month visits.

Outcomes

Rheumatoid arthritis flare questionnaire (RA-FQ)

The RA-FQ contains five items that ask respondents to rate their pain, physical function, stiffness, fatigue, and participation over the past week using 11-point NRS (0 = none to 10 = severe). The RA-FQ was co-created with patients and in accordance with best practice methods [27,28,29]. The conceptual framework evolved from international focus groups with RA patients [16] and Delphi exercises with RA patients, clinicians, researchers, and other stakeholders [3]. Psychometric performance of the RA-FQ was examined in multiple international clinical trials and longitudinal observational studies, including CATCH [18, 30, 31]. Rasch analysis showed acceptable fit to the Rasch model, with items and people covering a broad measurement continuum with appropriate targeting of items to people, ordered thresholds, minimal differential item functioning by language, sex, or age [17]. A summative score across items is defensible, yielding an interval score (0–50) where higher scores reflect worsening disease activity.

Patient-reported outcomes (PROs)

The Patient Global Assessment is an 11-point NRS that asks “Considering all the ways arthritis affects you, how well are you doing today” with responses ranging from 0 = very well to 10 = very poorly [10]. At the second visit, they also completed a 5-point RA Global Impression of Change rating (“Compared to your last visit would you say that your arthritis is a lot better, a little better, the same, a little worse, and a lot worse”).

Physician-reported outcomes (CLIN-ROs)

At each visit, physicians counted the number of swollen and tender joints (from a total of 28). Swollen joint counts represent the most widely used “objective” indicator of inflammatory activity reflecting inflamed synovial tissue, while tender joint counts are thought to indicate the patient’s level of pain [32]. Among the seven RA Core Data set measures in the American College of Rheumatology for clinical trials, joint counts are weighted heavily compared to the other five Core Data Set measures [33]. The MD Global Assessment is a 0–10 rating scale with higher scores representing higher RA disease activity.

Clinical disease activity

The Clinical Disease Activity Index (CDAI), an index that sums tender and swollen joint counts and patient and MD Global Assessments, was calculated for each visit [34]. CDAI is widely used to classify RA disease activity level (remission ≤ 2.8, low 2.8- ≤ 10, moderate 10- ≤ 22, and high > 22) and help guide treatment decisions. We used CDAI and not indices that include biomarkers as DAS-28-CRP and SDAI as scores on all of measures are moderately-highly correlated [35] and CDAI is widely used in care as part of shared decision-making [36].

Anchors

Patients were asked “In the past month, is your arthritis (RA)…” and given seven categories from which to choose from much worse to much better. Meaningful within-person differences reflected ratings of much worse and worse for deterioration and much better and better for improvement; minimal within-person differences reflected ratings of a little better or a little worse. To identify meaningful and minimal score changes from the physician perspective, we calculated the difference in the MD global, a static global impression of (RA) severity (0–10) obtained at the two visits. This method was recently recommended by the U.S. FDA as it may be less likely to recall error [29]. We are aware of only one study by an OMERACT committee that has identified clinically important changes in physician global assessments of RA activity using a consensus process among rheumatologists, methodologists, and other stakeholders [37]. They reported a change of 2.3 in an 11-point NRS represented clinically important changes in RA patients and 1.4 in clinical trials of new therapeutic agents [37]. As 1- and 2-point change in patient global assessments have been associated with being slightly better and much better in patients in studies of musculoskeletal pain [38], including RA [39], we used these thresholds to represent minimal and meaningful change from the physician perspective. This approach is also consistent with the widely used ACR20 criteria (i.e., 20% improvement of core outcomes including physician global) to define improvement. We also estimated score changes associated with a 1 or 2 category change in CDAI levels, as this is a common secondary outcome in RA research.

Statistical methods

Descriptive statistics were calculated to summarize patient characteristics and outcomes. Our overall approach was to calculate the mean within-patient difference between the two visits from three perspectives—patients, physicians, and between CDAI categories. We calculated Pearson correlation coefficients between the change in RA-FQ total score and components and with other disease activity indicators. Mean score change was also calculated for traditional clinical indicators of RA disease activity (CDAI, Patient Global, swollen, and tender joints) across change categories. To visually examine the separation between changes among groups across all levels of the RA-FQ change scores and enhance interpretation, we generated empirical cumulative distribution function (eCDF) curves by patient, MD, and CDAI change categories. Analyses were completed using SAS (v. 9.4) and plots were constructed using R (v 3.4.3).

Results

Participants

Participants were 808 middle-aged adults who were mostly white (84%) and female (71%). At enrollment, 85% were classified as having moderate or high RA disease activity (Table 1). At 3 months, about half (43%) were in moderate-high disease activity, 39% in Low Disease Activity, and 18% in Remission; by 6 months, 65% were in Remission or Low Disease Activity. Mean RA-FQ scores at enrollment through the 3- and 6-month visits were 25.5, 15.6, and 13.9, respectively. Correlations between changes in the RA-FQ total and component scores and disease activity indicators ranged from 0.33 for swollen joint counts to 0.89 for pain and physical function supporting the appropriateness of the data to evaluate change scores [40].

Table 1 Characteristics of participants at enrollment (n = 808)

Patient perspective: minimal and meaningful change

At the second visit, 59% of patients rated their RA disease activity as improved, 20% worse, and 21% the same as the previous visit. Mean changes in RA-FQ total score and component scores by patient change categories are shown in Table 2. RA-FQ scores changed in the expected directions, and the magnitude of change was similar within components. Worsening disease activity was associated with larger mean changes in RA-FQ scores than improvement. Similar patterns were evident in changes in traditional disease activity indicators (CDAI, patient and MD global assessments, and tender and swollen joint counts). Cumulative distribution function (CDF) curves by patient RA change categories, shown in Fig. 1, suggest that throughout most of the range of change scores, there was clear separation of curves supporting discrimination among categories by patients, with larger spread across scores related to worsening.

Table 2 Change in RA-FQ scores at second visit by patient global impression of change categories
Fig. 1
figure 1

Cumulative distribution function curves of RA-FQ change between by patient change categories

Physician ratings: minimal and meaningful change

At the second visit, 44% of physicians rated RA disease activity as improved, 24% had worsened, and 32% rated it the same as the previous visit. As compared with patients, physicians were more likely to classify patients as the same or worse. Mean changes in RA-FQ total score and component scores by physician change categories 0, |1|, and |≥ 2| are shown in Table 3. RA-FQ scores changed in the expected directions, and the magnitude of change was similar within components. In contrast to patients, improvement was associated with larger mean changes in RA-FQ scores than worsening disease activity. CDF curves by physician RA change categories shown in Fig. 2 suggest that throughout much of the range of scores, physicians had difficulty discriminating between patients who were the same or a little worse at the second visit.

Table 3 Change in RA− FQ by physician change categories* (N = 787)
Fig. 2
figure 2

Cumulative distribution function curves by physician change categories (|1| or |2| point change on MD Global for minimal and meaningful change, respectively)

CDAI categories: minimal and meaningful change

Scores changes in relation to CDAI change by |1| or |≥ 2| categories are shown in Table 4. At the second visit, 56% remained within the same, 29% had improved, and 15% had worsened by one or more CDAI categories. The thresholds for a 1 category change were similar to patient and physician thresholds representing meaningful change. Improvement was associated with a larger mean RA-FQ score change for patients starting with moderate-high disease activity where a two-category change (i.e., to remission) was nearly double that for 1 category (to low disease activity). Similarly, for worsening disease activity, patients starting in low disease activity had numerically larger mean changes than individuals who started in remission at the first visit. Across disease activity levels, patients whose CDAI category was unchanged had similar scores at both visits. CDF curves illustrate considerable variability within and wide separation of curves between change categories among patients within the same CDAI category at both visits (see Fig. 3).

Table 4 Change in RA-FQ by change in Clinical Disease Activity levels between visits
Fig. 3
figure 3

Cumulative distribution function curves by CDAI change categories

Discussion

The RA-FQ is one of only a few RA measures that can summarize the extent and complexity of the broad impact of RA symptoms on how people feel and function as a single number. We have previously shown that the RA-FQ is sensitive to changes in RA symptoms and function [17]. This is the first study to identify group-based within-person minimal and meaningful change scores for the RA-FQ. Notably, thresholds for change were influenced by the anchor, direction, and baseline values. Values derived from patients are contrasted with those of treating physicians and in relation to change in CDAI levels, the keystone of the treat-to-target approach.

Our results hold implications for researchers, methodologists, and clinicians. First, the thresholds for minimal and meaningful change varied depending on the anchor used. This in turn impacted the proportion of patients who were classified as having improved or worsened. For example, among patients, 59% classified themselves as better or much better, whereas physicians classified 44% as improved; 29% had improved at least one CDAI category. Patients were least likely to be classified as worse using a change in CDAI level (15%) and most likely to be seen as having worsened disease activity by physicians (24%), with patients landing in between (20%). Second, patients had larger thresholds for defining worsening RA whereas physicians had larger thresholds for defining improvement. In effect, physicians were looking for more resolution of symptoms/impacts than patients to be confident RA was much better (score change − 7.3 vs. − 6.0, respectively), but had a lower threshold to judge symptoms and RA activity as much worse (score change + 5.7 vs. + 8.9). The lack of separation between CDF curves of physician RA change categories suggested they can detect substantial changes in disease activity but had difficulty recognizing when patients were a little better or a little worse as compared to the previous visit. Establishing the overlap among thresholds that both patients and physicians consider meaningful and worthwhile holds important implications for clinical trials, comparative effectiveness research, and optimal management of RA in care settings. There are likely to be scenarios where more or less stringent definitions of change are appropriate which will influence the choice of anchor (e.g., change that is associated with a greater likelihood increased joint damage may better reflected by physician or CDAI categories). In addition, in RA, duration of change may be as important as the magnitude of change; for example, in RA clinical trials, it has been proposed that increased disease activity should persist for at least 7 days before being classified as an inflammatory flare [17]. Discrepancies in thresholds are also important to identify as these may contribute to patient dissatisfaction with care, treatment non-adherence, poorer disease outcomes, and increase in health care utilization and costs [41].

Two recent systematic reviews concluded that patient and physician perspectives regarding RA disease status often significantly differ [33, 34]; when discordance exists, up to 79% of patients generally perceived their disease as being more active than their treating physician. As global assessments of physicians and patients generally have good reliability [42], this discrepancy can be explained, in part, because pain, disability, and for fatigue may feature prominently in patient assessments, whereas joint counts and acute phase reactants influence physician ratings [43]. Our data suggest assessments of change also differ where patients are more likely to perceive improvement than providers. Further, among patients the RA-FQ score change associated with meaningful improvement vs. worsening differed. Meaningful improvement for patients was associated with 6-point decrease in RA-FQ, whereas meaningful worsening was associated with a 9-point increase; similar patterns were seen for minimal worsening or improvement. This suggests that either patients may be more vigilant to identify improvement or perhaps that the threshold for disease to be perceived as worse is higher than that for improvement. At the same time, the graphical displays suggest patients appear better able to identify worsening RA at a finer level than improvement.

The pattern observed with physicians differed from patients. First, clinicians classified fewer patients as improved between visits with more classified as the same or a lot worse. In contrast to patients, on average, a numerically larger RA-FQ score change was associated with improvement versus worsening. It is interesting to note that the change in joint count (and CDAI) also was higher for patients rated as a lot worse or a lot better as compared with patient-based assessments of change. This is likely because physicians view joint counts as a reliable indicator of inflammatory disease activity. Physicians also appeared less able to differentiate patients who were a little worse from those who had the same level of disease activity at the previous visit. Changes in RA-FQ scores in relation to changes in CDAI levels were robust and similar in the direction of worsening or improvement; scores were also stable between visits in participants whose CDAI level was unchanged. Overall, triangulating meaningful RA-FQ change scores among patient and clinician ratings with RA disease activity classifications yielded robust results adding further evidence that disease activity as captured by the RA-FQ represents a well-defined concept that can be reliably measured. Evaluation of within-patient change establishes both the responsiveness of the RA-FQ and patient-relevant thresholds of change. The use of multiple anchors also demonstrates that benefits of a treatment as judged by physicians and using a treat-to-target approach are also perceived as clinically meaningful by patients. Conversely, both physicians and CDAI change may suggest patients are deteriorating before patients perceive their RA has changed.

While thresholds identified at the group level can inform policy and comparisons between different treatments, individual-level thresholds are necessary to inform clinical treatment decisions [44]. Examining within-patient change visually also can offer new information. The CDF curves presenting a continuous view of the proportion of patients within each category experiencing scores changes. The physical separation between curves suggest that most patients who said they were a lot better or a lot worse were relatively distinct from those reporting they were a little better or worse or the same throughout most of the continuum of score changes. We also evaluated changes in traditional RA clinical indicators and observed that changes in other disease activity indicators (e.g., mean CDAI and joint counts) were largest when using the CDAI anchors followed by physician change categories. This is not surprising since joint counts and physician global impression of disease activity constitute 3 of 4 components of the CDAI. When physicians rated patients as a lot better CDAI decreased by 12 points; similarly, when patients were rated as a lot worse, CDAI increased 11 points. The change in CDAI scores we observed were notably larger than the minimal clinically important differences that have been identified for CDAI (i.e., -6 points when starting with moderate disease activity and 2 points for worsening when starting in remission/low disease activity) in one study [45]. The change in CDAI when using patient reports of feeling a lot better was -5 points, whereas a lot worse was 7 points.

These findings add to the growing body of evidence supporting strong measurement properties (reliability, construct, content and criterion validity, responsiveness) of the RA-FQ. One reason may be that the RA-FQ was developed in accordance with best practice methods [27,28,29]. The conceptual framework of the RA-FQ evolved from international focus groups with RA patients [16] and Delphi exercise with RA patients, clinicians, researchers, and other stakeholders [3]. Performance of the RA-FQ was examined in multiple international clinical trials and longitudinal observational studies, including CATCH [18, 30, 31]. Rasch analysis showed acceptable fit to the Rasch model, with items and people covering a broad measurement continuum with appropriate targeting of items to people, ordered thresholds, minimal differential item functioning by language, sex, or age [17].

The strengths of this study include the use of multiple anchors to identify meaningful change in a large well characterized and diverse real-world sample of people with RA. Patients had been recently diagnosed and had started disease-modifying treatments which often take several months to reach maximum therapeutic effectiveness; only 21% rated their RA as the same at both visits. Values identified may also generalize to situations where patients are starting a new RA treatment as a result of flaring. We used patient and clinician impressions and CDAI, a disease activity index that combines these perspectives with joint counts is viewed as “the most specific quantitative clinical measure” in RA research and care [33]. In this early RA cohort, many patients reported a change in their RA status at the second visit, and these reports were supported by changes in the standardized indicators of RA disease activity recommended in international clinical practice guidelines [46, 47] and used in clinical trials as part of a treat-to-target approach. We also visualized scores associated with patient change categories with CDFs, a technique recently recommended by the US Food and Drug Administration (FDA) [24]. There are limitations. A recent systematic review showed how different methods including predictive modeling have been applied to identify change thresholds [40]. While some, such as distribution-based methods, have fallen out favor, others such as the use of bookmarking are being applied more widely [40, 48]. We used the mean change method which is the most widely used method [40], is currently recommended by the FDA [24], and is used in RA trials identifying change scores for NIH-PROMIS™ measures [21]; however, this approach may yield values higher than actual minimal thresholds [40]. We evaluated mean score changes in relation to disease activity levels at the first visit for CDAI only. In patients with established RA, different values may be obtained for improvement and worsening as patients gain more experience with disease flares. We also observed that mean change values for worsening and improving RA varied according to the baseline value. Values derived from our cohort of recently diagnosed patients may be higher than those estimated in established patients with less active disease at baseline or patients with painful comorbidities, such as osteoarthritis. As baseline values influenced the magnitude of change, additional work is needed to determine change values for patients who begin in states of remission or low disease activity or participate in tapering trials.

In summary, in a large diverse cohort of real-world patients recently diagnosed with RA, we compared multiple anchors to derive meaningful and minimal changes from the perspective of patients, clinicians, and using CDAI, a disease activity index used widely as part of treat-to-target approaches. We found similar patterns overall, but some important differences in the actual value of thresholds. Further, our results suggest that the benchmarks used to classify RA treatments as a success or failure differ depending on the anchor used. These findings contribute new information that can be used to interpret RA-FQ scores in research and patient care. They also demonstrate that proportion of patients classified as “responders” to a new treatment could vary considerably depending on the anchor used to define meaningful change.