Introduction

With the intensifying emphasis on patient-centered care, patient-reported outcomes (PROs) are playing an increasingly important role in clinical practice. PROs are reports directly from patients about a health condition or its treatment and include an array of outcomes such as symptoms, functional status, and health-related quality of life (HRQOL) [1, 2]. PRO data can promote patient-centered care. PROs are commonly used in comparative research studies [36], and oncologists have also endorsed the value of PROs in clinical trials [7, 8]. PRO findings from these studies can facilitate patients’ and clinicians’ understanding of how different treatments affect patient functioning and well-being and can inform treatment decision making [8, 9].

A second important application of PROs in clinical practice is to inform individual patient management [10, 11]. In this application, patients complete a PRO questionnaire, the results of which are shared with their clinician (and frequently the patient) to help identify and address concerns. Use of PROs for individual patient management has been shown to improve clinician–patient communication [1215].

However, a number of barriers limit PRO use in practice by patients and clinicians, including variation in which PRO measures are used, how these are scored, how scores are scaled, and how data are reported. The PRO and Quality of Life Instrument Database describe over 800 PRO measures [16] that vary in how they are scored (e.g., higher scores are better on some measures but worse on others). PRO measures also vary in scaling; for example, 0–100 may represent the worst-to-best score, or scores may be normed (e.g., to a population average of 50). Thus, scores can have widely different meaning depending on the PRO being used. In patient management applications, patients and clinicians consistently report that the variation in PRO scoring and scaling makes score interpretation and application challenging [1719]. For group-level data from research studies, oncologists strongly endorse the potential of PROs to enhance clinical trial interpretation, but fewer than half felt comfortable interpreting PRO results themselves [8]. Further, some methods for presenting PRO data are more easily understood and more accurately interpreted than others [20, 21].

We undertook this study to (1) evaluate patient and clinician comprehension of PRO data using existing presentation approaches and (2) obtain qualitative feedback on attributes of different presentation formats that are helpful and challenging.

Methods

Study design

This cross-sectional, mixed-methods study used a self-completed exercise followed by semi-structured qualitative interviews to identify facilitators of and barriers to comprehension and application of current PRO data presentation formats. Enrolled patients and clinicians were randomized 1:1 to evaluate either “group-level” PRO data from comparative research studies or “individual-level” PRO data for patient monitoring. The Johns Hopkins School of Medicine Institutional Review Board approved the study; all participants provided written consent.

Population and setting

Participants were recruited from eight academic and community institutions representing diverse populations in the Johns Hopkins Clinical Research Network (JHCRN). Eligible patients were ≥21 years, diagnosed with any cancer (excluding non-melanoma skin cancer), ≥6 months post-diagnosis, not currently receiving acute treatment, and able to communicate in English. We purposively sampled by education attainment, cancer type, and practice setting. Eligible clinicians were in active oncology practice, including medical, radiation, and surgical oncologists, gynecologic oncologists/urologists, oncology nurse practitioners, and oncology fellows. We purposively sampled by practice setting and by clinician specialty.

Study procedures

All interviews had a common structure. Participants began by completing a self-directed paper-based exercise. First, in order to orient participants to PRO questionnaires, each completed six domains from the European Organization for Research and Treatment of Cancer Quality of Life Questionnaire-Core-30 (QLQ-C30) [22]. Second, we assessed participants’ intuitive understanding of the meaning of increasing/decreasing PRO scores by asking them to interpret two line graphs: one labeled ‘general well-being’ with a line trending up and one labeled ‘shortness of breath’ with a line trending down (Fig. 1). The nature of the graph was explained (e.g., “each square on the line represents a recent clinic visit”), but no instructions regarding the meaning of direction were provided. For each graph, participants reported whether they thought the graph represented a patient getting “better,” “worse,” or “not sure.”

Fig. 1
figure 1

Line graphs used to evaluate participants’ intuitive understanding of trending PRO scores for an individual-level, single domain function and symptom. In keeping with the research design, scores on the y-axis are not provided

Third, participants were presented with different approaches for presenting PRO scores using example data from the six QLQ-C30 domains. QLQ-C30 data were generally presented consistent with its scoring conventions (0–100 scale with higher scores representing better function and greater symptom burden), but for some formats, data were presented so that directionality (i.e., whether higher was better/worse) was the same across function and symptom domains. The data presentation formats were selected to illustrate a range of approaches found in the literature. For group-level data, six formats presenting hypothetical clinical trial results were evaluated (Fig. 2): simple line graphs of mean scores over time, line graphs with norms, line graphs with confidence intervals, bar charts of average changes, bar charts based on a responder definition (improved, stable, worsened), and cumulative distribution functions [1]. For individual-level data, four formats presenting a hypothetical patient’s scores were evaluated (Fig. 3): line graphs of mean scores [17], tabulated scores [23], bubble plots of scores at a point in time [13], and a heat map [24]. Each format was presented on a separate page with its own explanation and legend, including instructions regarding directionality [e.g., “For function scales, high scores represent high levels of functioning. For symptom scales, high scores represent high symptom levels (greater symptom burden).”]. Participants were instructed to consider only the information on a given page to discourage their applying descriptions of one format to others. To control for potential order effects, formats were randomly presented in four different orders.

Fig. 2
figure 2

Group-level data formats included a line graphs of mean scores with confidence intervals, b proportions responding (improved/same/worsened), c bar charts of average changes, d cumulative distribution functions. Additional group-level data formats included simple line graphs of mean scores and line graphs of normed mean scores (not shown). Each format was presented on a separate page with its own explanation and legend (not all of which are shown in the Figure)

Fig. 3
figure 3

Individual-level data formats included a line graphs of scores over time, b tabulated scores, c heat map of normed scores, d bubble plot of scores. Each format was presented on a separate page with its own explanation and legend (not all of which are shown in the Figure)

For each format, participants responded to two questions that assessed accuracy of interpretation. Finally, for each format, participants rated “How easy is it for you to understand these graphs?” (0 = Very difficult to 10 = Very easy) and “How useful do you find these graphs?” (0 = Not at all to 10 = Very).

Following the self-directed portion, the interviewer conducted a semi-structured debriefing interview. The interviewer assessed participant’s reasoning for the intuitive interpretation questions and then reviewed the participants’ responses to each format, including the ease-of-understanding and usefulness ratings, and the decision process for one accuracy question. For each format, participants were asked what they liked, did not like/found confusing, and what they would add/remove/change. In the individual-level data interviews, specific probes were used to evaluate particular format attributes (e.g., yellow shading in the tabulated scores). Finally, respondents were asked whether they preferred formats that depict single or multiple time-points and were asked to select the one format that was easiest to understand and most helpful for patients and clinicians to use in practice. Interviews were audio-recorded and transcribed.

Analyses and sample size

The quantitative data were analyzed descriptively using proportions and medians/ranges. To analyze the qualitative data, we used a “concurrent triangulation” design [25] utilizing an applied “framework approach” [26] that focused on attributes relevant to comprehension, preference, and utility of PRO data. The quantitative data results were considered in the context of the key points identified from the qualitative data to develop overall study findings.

To organize the qualitative data, the research team developed a coding scheme based on the study objectives, interview structure, and content of the initial interviews. Codes related to positive or negative comments made by participants on each format and to emergent themes. After several training rounds performed by the whole team, one team member (E.L.) coded all transcripts using ATLAS.ti [27] and each transcript’s coding was reviewed by a second investigator (E.B., M.B., C.S., K.S.). Team members independently identified themes from reports summarizing the coded text (e.g., positive and negative comments on each format), which were then discussed by the group to summarize key points for each format. Selected quotations that illustrate the key findings are included in the results (“[P]” and “[C]” attributions are used to identify patient and clinician respondents, respectively).

Sample size was determined based on thematic saturation for the qualitative objectives. We estimated that 50 patient and 20 clinician interviews would be required (half each addressing group- and individual-level formats). We confirmed thematic saturation by identifying themes from the first 56 interviews and then determining that the final 14 interviews did not identify substantive additions.

Results

Study sample

To recruit 50 patients, we assessed the eligibility of 147, of whom 25 were not eligible, 10 declined participation, and 62 were registered but not randomized (e.g., not meeting purposive sample needs). The participants’ characteristics are listed in Table 1. The median age was 65.5 years, 54 % were female, and 78 % were white. Per our purposive sampling strategy, participants had a range of education levels, and the most common cancers represented were breast (32 %) and prostate (28 %). The median time from first diagnosis was 5 years (range 0–32), and median time from most recent diagnosis was 3 years (range 0–7). Patients were recruited from 6 JHCRN practice locations.

Table 1 Participant demographics

To recruit 20 clinicians, we assessed the eligibility of 37, of whom 3 were not eligible, 1 declined participation, and 13 were registered but not randomized. The clinician participants’ characteristics are also shown in Table 1. The sample included clinicians from a variety of specialties, with a median age of 42, 40 % of whom were female. Clinicians were recruited from 6 JHCRN practice locations.

Intuitive interpretation findings

The vast majority of patients and clinicians interpreted the upward-trending “general well-being” line as getting better (patients: 96 % better, 2 % worse, 2 % not sure; clinicians: 80 % better, 5 % worse, 15 % not sure). In the interviews, participants reported that uptrending lines intuitively signify improvement. Participants reporting “not sure” were often unwilling to guess direction in the absence of labeled axes or scale descriptors. One reason for selecting “getting worse” is that the QLQ-C30 response options (1 = no problem at all and 4 = very much problem) inferred that higher numbers indicated worse quality of life.

In contrast, participants’ interpretation of the downward-trending “shortness of breath” line showed greater variation (patients: 60 % worse, 34 % better, 6 % not sure; clinicians: 10 % worse, 70 % better, 20 % not sure). Many who chose “worse” commented on the intuitive understanding of increasing lines being better. Some interpreted the downward trend as reflecting less symptom burden (“This was going down, so that meant the person was not feeling as short of breath” [P007]), whereas several participants were not sure, some indicating their unwillingness to guess direction in the absence of labeled axes or scale descriptors.

Group-level data formats

Table 2 summarizes respondents’ ratings, overall preferences, and feedback on the group-level data formats. Both patients and clinicians rated simple line graphs highest for ease-of-understanding and usefulness. Across formats, clinicians’ accuracy of interpretation ranged from 56 % (a Cumulative Distribution Function question) to 100 % (a Normed Line Graph question), whereas patients’ accuracy ranged from 36 % (a Cumulative Distribution Function question) to 100 % (a Line Graphs with Confidence Intervals question) (data not shown).

Table 2 Summary of results for group-level data formats

Line graphs were generally described as “straightforward” and “clear.” Patients tended to find normed scores, p values, and confidence intervals confusing. Clinicians appreciated that normed scores provided a basis for comparison (beyond the two study arms) and valued p values and confidence intervals as particularly important for publication. Across the line graph formats, the inconsistency in scoring direction was often reported as unclear: “for function it’s high levels of functioning, for symptoms it’s greater burden, which if you get confused between the two that throws you off there a little bit” [P002].

There was also some support for bar chart formats. Respondents noted that it was easy to compare treatments using bar charts of average changes. Many clinicians thought that the bar charts of proportions responding to treatment were clinically meaningful (“you’d rather have things that are stable, worse or improved and that’s usually how we judge things” [C014]). Both patients and clinicians found the cumulative distribution function confusing and difficult to interpret (“it stinks and I kept liking it worse!” [C012]), though the ability to compare treatments on proportions responding across multiple cut-points was appreciated among the few participants who grasped the concept.

In terms of overall preferences, 90 % of clinicians and 87 % of patients preferred formats displaying multiple time-points. Patients most often selected simple line graphs for overall ease-of-understanding and usefulness, whereas clinicians selected either line graphs of normed scores or with confidence intervals (Table 2).

Individual-level data formats

Table 3 summarizes respondents’ ratings, preferences, and feedback on the individual-level data formats. In the self-directed exercise, ratings for ease-of-understanding and usefulness were highest for line graphs for both patients and clinicians. Clinicians’ accuracy of interpretation was high across formats (90–100 % on both questions on all formats), and patients’ ranged from 64 % on one Line Graphs question to 96 % on the other Line Graphs question and on one Heat Map question (data not shown).

Table 3 Summary of results for individual-level data formats

In the interviews, participants noted that line graphs “…gave a lot of information in a visually easy way to look at …you could just glance at it” [C022]. Some clinicians drew analogies to other clinical information: “Looks like a blood pressure graph or sodium graph” [C066]. The importance of highlighting concerning scores was noted, but the approach used in the sample format (highlighting the graph in yellow) was not always effective. Several respondents expressed a desire for reference values/scale: “It would be nice to know what the normal range is,” [P001] whereas others said, “I don’t really care how I compare to others” [P054]. There was also a question of who the normative population should be (e.g., US average, other cancer patients).

As with the group-level formats, the inconsistency between whether higher scores were better or worse (indicated with an arrow on the y-axis pointing which direction was better) posed a barrier to interpretation, contributing to several participants’ incorrect responses to accuracy questions. As noted above, the two accuracy questions for line graphs represented both the highest and lowest accuracy rates. A “Physical Function” graph trending down with the “Better” arrow indicating that higher scores are better was correctly interpreted as worsening by 96 % of patients and 100 % of clinicians. However, a “Fatigue” graph trending up with the “Better” arrow indicating that lower scores are better was correctly interpreted by only 64 % of patients and 90 % of clinicians. When asked about the responses to the “Fatigue” question in the debriefing interview, some participants did not notice that lower scores were better and reported intuitively interpreting higher scores as better (“I imagine if it’s going up, it’s getting better” [P029]). Others noticed only on further examination that lower scores were better for Fatigue (“…I didn’t notice [indications of direction]” [P051] and “For some of them going up is better and so…my first instinct was incorrect”) [C009].

Among other individual-level formats, tabulated scores were considered “boring but straightforward,” [P054] with one respondent noting the value if small differences in numbers are clinically important. The highlighting of the particular score considered potentially concerning was appreciated (as compared to highlighting the entire graph for line graphs). However, others thought that the format presented too much data and that the scores were difficult to interpret without a reference scale. One respondent thought that the comments were an “apology for a confusing table” [C016]. The inconsistency in scoring directionality was noted here, too. On the heat map, color was considered helpful, but several respondents recommended simplifying the color scheme. There was also some debate as to whether red should indicate good or bad, and some thought that better scores should be on the right. Most patients did not understand the concept of “range.” There was also an issue with having 50 as the norm: “Statistically it’s correct, but people don’t want to be 50. If you…take a test, you don’t want to get a C” [C026]. While bubble plots were considered “easy on the eye,” [C005], the trends over time were missed and the scaling was not clear to everyone. Notably, on this format, scores were depicted consistently (higher always better), but one respondent noted that it was confusing that pain was 92 but “normal.”

In terms of overall preferences, 90 % of clinicians and 92 % of patients preferred formats displaying multiple time-points versus single time-points. Both patients and clinicians most often selected line graphs for overall ease-of-understanding and usefulness (Table 3).

Discussion

For PRO data to contribute meaningfully to patient-centered care, patients and clinicians must be able to understand and interpret these data. Previous research has demonstrated that variation in PRO scoring, scaling, and presentation poses obstacles to interpretation and application. This study aimed to elicit patient and clinician input on formats for PRO data presentation and their respective attributes that would help to promote PRO data understanding and clinical use.

For group-level data, our findings suggest the need to use different approaches when presenting data to patients (e.g., in educational materials or decision aids) versus clinicians (e.g., in peer-reviewed publications). Patients tended to prefer simple line graphs, whereas clinicians valued greater detail (e.g., normed scores, or inclusion of confidence intervals or p values). Because the appropriate format also depends on how the study endpoint is conceptualized (e.g., mean scores over time vs. proportions responding to treatment), it is also necessary to consider best practices for presenting bar charts of group-level data. For individual-level data, line graphs were the consensus favorite for patients and clinicians.

The findings of this study should be considered in the context of its design. All candidate formats displayed data from the QLQ-C30, an instrument with distinct subscales that may be easier to graph. For the most part, the data were reported based on the QLQ-C30’s scoring conventions (0–100, with higher scores representing better function and greater symptom burden). In some cases, we reported the data normed or with consistent scoring directionality—such as is used by other PRO instruments. Additional research should explore whether the findings from this study are replicated when the results of different PRO measures are displayed.

Our purposive sampling of patients ensured representation across cancer types, care settings, and education levels. We recruited patients who were at least 6 months post-diagnosis and who had completed acute treatment, in order not to burden patients who were under active treatment. Thus, our sample may not reflect the perspectives of patients with short-term survival (e.g., pancreatic cancer) or who are currently dealing with acute side effects. Clinicians were sampled to represent a variety of specialties and care settings. Our purposive sampling strategy ensured inclusion of patient participants with only high school education (20 %). We note, however, that 40 % of patients had postgraduate education; this distribution of patient education experiences may have influenced how patients responded to different format options. An additional strength of the study design was that we controlled for potential learning or other order effects by randomly assigning format presentation order across interviews. Finally, we expect that these results from a cancer setting will be generalizable to other disease settings, although this requires confirmation.

This mixed-methods study provides quantitative data (ratings, % preferring overall, accuracy questions) to complement the qualitative feedback. We applied a relatively simple qualitative content analysis to meet the aims of this study; a more in-depth exploration of qualitative data from this and future studies could contribute further insights regarding particular issues, such as why respondents had difficulty understanding the opposite scoring of symptom and function domains, but that was beyond the scope of the current study. While qualitative thematic saturation was confirmed by analyzing the last 14 interviews separately from the first 56, the sample size of 50 patients, and especially 20 clinicians, limited precision for the quantitative measures. The accuracy of interpretation questions was primarily directed at engaging respondents in active interpretation of the graphs. We interpret those findings with caution, as questions asked across formats may not have been equally difficult, and participants might have guessed the correct answers.

Nonetheless, the accuracy scores, when interpreted in the context of the data regarding intuitive understanding of directionality, do illustrate substantial room for improvement in communicating PROs effectively. The physical function line graph was consistent with participants’ intuitive interpretation of whether higher scores on a function domain would be better or worse and were accurately interpreted by all clinicians and all but one patient. There was no clear consensus on the intuitive meaning of a symptom graph trending up, and this was reflected in the comparatively low accuracy rates for interpreting the fatigue line graph. Approaches to address this inconsistency in directionality are needed to improve PRO data interpretation.

We noted that for many formats, median ease-of-understanding and usefulness scores were 8/10 or higher. The wide range of responses on virtually all formats, however, emphasizes the fact that format preferences vary substantially, even within patient and clinician groups. Finally, the patients’ and clinicians’ preferences and interpretation reflect current levels of familiarity. While more complex formats (e.g., cumulative distribution function) were rated poorly, it is possible that with time and greater exposure, the information provided by these more complex approaches would be valued.

In the next steps of this research, we are engaging patients and clinicians in an iterative Work Group process to develop approaches to address the issues identified from this study. Work groups will develop potential approaches to address inconsistent directionality of scoring, to effectively highlight statistical significance and clinical meaningfulness (group-level), and to highlight clinically important scores and to provide reference scores (individual-level). Approaches developed by these Work Groups will then be tested in additional one-on-one interviews and subsequently evaluated in quantitative studies powered to appropriately test their interpretation. Our findings are expected to provide best practices for presenting PRO data to patients and clinicians to facilitate better understanding and application of PRO data in practice.