Background

In recent years, there has been increased focus on the use of health-related quality of life (HRQOL) and other patient-reported outcomes (PROs) in clinical practice for individual patient management [1]. Unlike the use of PROs as outcomes in clinical trials and observational studies where differences between groups are the primary outcomes, using PROs in clinical practice raises a number of methodological issues. For example, to be useful in managing individual patients, outcomes need to be measured far more reliably and precisely than is the case when estimating group means [2]. One of the challenges that needs to be addressed to facilitate using PROs in clinical practice is identifying what score represents a problem that requires attention (“problem score”).

The use of PROs in clinical practice is analogous to the use of lab tests in that both PROs and lab tests provide the clinician with information on the patient’s health. When clinicians seek to identify a problem or determine the differential diagnoses to be considered or ruled out, they order a variety of relevant tests, patients have the lab work completed, the results are reported to the clinician, and values that are outside of a given range motivate action. With PROs, the approach is similar, except that rather than going to a laboratory, they complete a questionnaire either in the office or via the Internet. As with lab tests that require identification of values that are “abnormal,” the use of PROs in clinical practice requires identification of “problem” scores that merit further attention.

Take, for example, Patient X who has completed the physical function questions from the European Organization for the Research and Treatment of Cancer Quality of Life Questionnaire-Core 30 (EORTC QLQ-C30) [3]. Patient X has quite a bit of difficulty doing strenuous physical activity, a little bit of difficulty with moderate physical activity, and no problems with activities of daily living. The resulting scale score for Patient X is 60 on a scale from 0 to 100 with 100 representing better function. Does this score of 60 represent limitations in physical function that require attention from the patient’s health care provider? Few guides are currently available to help clinicians interpret the meaning of scores on PRO questionnaires. One approach, used by Velikova et al. [4], is to provide mean scores for the general population as a basis for comparison. While helpful, these normative data still do not provide a clear indication of what score represents a problem for an individual patient that requires attention.

A novel approach that might be useful in identifying problem scores on HRQOL questionnaires is to use needs assessments. Needs assessments generally ask whether a need exists and, frequently, how well existing needs are being met [5]. For example, the Supportive Care Needs Survey-Short Form (SCNS) [6, 7] asks patients whether the need is applicable to them, whether they have that need but it is being satisfied, or whether they have a low, moderate, or high level of unmet need. Thus, needs assessments provide an indication of the extent to which the patient perceives an unmet need in an area. We conducted a preliminary analysis to investigate whether needs assessments could be used to identify problem scores on HRQOL questionnaires using data from a study in which cancer patients completed both the EORTC QLQ-C30 and SCNS.

Methods

Patient population and data collection

The study population and data collection methods have been described previously [8, 9]. Briefly, the patients of seven medical oncologists involved in the treatment of breast, prostate, and lung cancers were recruited for participation through flyers handed out by clinic staff. Eligibility criteria included (1) diagnosis of breast, prostate, or lung cancer at any stage, (2) aged 18 or older, (3) currently undergoing treatment with chemotherapy, radiation therapy, hormonal therapy, biologic therapy, or therapy as part of a clinical trial, (4) physically and cognitively able to complete the questionnaire, (5) able to read and write in English, and (6) able and willing to provide oral informed consent. Based on our plans to use the data for initial exploratory analyses, we aimed to enroll 35–50 patients per tumour type for a total target sample size of 105–150 patients.

The questionnaires for this study included the SCNS [6, 7] followed by the EORTC QLQ-C30 [3] and complied with the requirements of the instruments’ copyright holders. The QLQ-C30 includes five function domains (physical, role, emotional, cognitive, social), eight symptom domains (fatigue, pain, nausea and vomiting, dyspnoea, insomnia, appetite loss, constipation, diarrhoea), plus financial impact and global health/quality of life ratings. Most questions use a four-point scale from “not at all” to “very much” and have a 1-week recall period. Domains are transformed to a 0–100 scale with higher scores on function domains representing better function and higher scores on the other domains representing greater burden. The SCNS addresses five domains of need (psychological, health system and information, physical and daily living, patient care and support, sexual). Patients respond using a five-point scale from “not applicable” to “high need” using a one-month recall period. To calculate domain scores, we averaged the items in the domain, and scores > 2.0 represent presence of an unmet need. This cut-off of >2.0 was also applied in analyses involving individual SCNS items.

Patients also reported their age, sex, race, and education level. Clinicians completed a form that detailed patients’ Eastern Cooperative Oncology Group (ECOG) performance status, cancer type, extent of disease, and current and previous treatments. The study was reviewed and approved by the Institutional Review Board of the Johns Hopkins School of Medicine (#NA_00001797) and complies with the 1964 Declaration of Helsinki.

Research questions and hypotheses

Before we could address the main research question of whether needs assessments help identify scores on HRQOL questionnaires that represent a problem, we first had to establish whether HRQOL scores differ by level of need reported. If, for example, patients’ average HRQOL pain scores were the same regardless of whether they reported on the needs assessment that pain was “not applicable” or they had a “high need,” then a needs assessment would clearly not be helpful in identifying problem scores on the HRQOL questionnaire. We hypothesized that needs assessments would be most useful where there is a close relationship between the content of the needs assessment and the HRQOL scale. Specifically, we examined the content of both the QLQ-C30 and SCNS, and for each QLQ-C30 domain, identified the SCNS domain/item(s) with the most similar content. While some QLQ-C30 domains (e.g., pain) had clearly similar item(s) in the SCNS, other domains (e.g., dyspnoea) had no similar item(s) in the SCNS, requiring us to use a more generic item (e.g., “feeling unwell a lot of the time”). Based on how close the content match was between the QLQ-C30 domain and the SCNS, we hypothesized whether there would be a strong, moderate, or weak association between the QLQ-C30 domain and the SCNS domain/item (Table 1).

Table 1 Hypothesized relationship between QLQ-C30 and SCNS domains/items and resulting areas under the curve (AUC)

Analysis

After performing descriptive analyses of the patients’ sociodemographic and clinical characteristics, we conducted bivariate analyses and constructed boxplots to explore how the distribution of QLQ-C30 scores varied by level of need reported.

To evaluate the discriminative ability of the QLQ-C30 domains for their hypothesized SCNS domain/item(s), we calculated the area under the receiver operating characteristic (ROC) curve (AUC). The AUC is a quantitative measure of how well a continuous variable discriminates two binary classes (see the “Technical Appendix” for a detailed explanation of ROC analyses and a practical example). The ROC curve represents the trade-off between sensitivity and specificity. If the covariate does well at predicting the outcome, the ROC curve will curve towards the top left corner, where sensitivity and specificity are high and the AUC approaches 1.0. The worst case is where sensitivity = 1 − specificity, where the AUC equals .50.

As shown in Table 1, we generated a list of 38 hypothesized relationships between QLQ-C30 scores and binary groupings of SCNS domains/items, resulting in 38 separate ROC analyses. In our case, if our binary classes are patients with and without an unmet need, we examined how well QLQ-C30 scores predict which need group a patient is in. Of the 38 calculated AUCs, 12 were less than .50; 12 were between .50 and .70; 10 were between .70 and .75; 3 were between .75 and .80; and 1 was greater than .80. Based on this distribution, we further explored the relationship between presence or absence of unmet need and QLQ-C30 scores for pairs whose AUC ≥ .70. Though there are no firm benchmarks on classifying AUC scores, it has been suggested that values below .70 represent poor discrimination, values between .70 and .80 represent acceptable discrimination, and values above .80 represent excellent discrimination [10]. We then tested potential cut-off scores and calculated the sensitivity and specificity associated with them, as well as the positive and negative predictive values. All analyses were performed using R version 2.7.0.

Results

Sample characteristics

The characteristics of the patient sample have been reported previously [8, 9]. Briefly, a total of 117 patients enrolled in the study, a 91% response rate from the 129 patients referred to the research coordinator by clinic staff. The majority of the patients had either breast (43%) or prostate (41%) cancer, with the remainder having lung cancer (16%). The mean age of the sample was 61.2 years, 77% were white, and 49% were women. As might be expected from a sample attending an outpatient clinic, the patients had good performance status, with 95% having ECOG ratings of 0 or 1. Half of the patients had metastatic disease. The majority of patients were currently taking hormonal therapies and had previously had surgery.

Bivariate analyses of HRQOL scores by need reported

To answer the question of whether HRQOL scores vary by presence or absence of an unmet need, we examined the distribution of HRQOL scores by level of need reported. As hypothesized, HRQOL scores showed more differentiation by need reported when the content of the need domain/item(s) was most similar to the content of the HRQOL questionnaire. Figure 1 provides example boxplots for strong, moderate, and weak relationships. For the hypothesized strong relationship (physical function and work around the home), there was a clear differentiation in scores between patients who did and did not perceive an unmet need. For the hypothesized moderate relationship (sleep and lack of energy/tiredness), there is more overlap in the boxes, but the median scores are different. For the hypothesized weak relationship (nausea/vomiting and feeling unwell a lot of the time), the medians are both at 0. These bivariate analyses confirmed that HRQOL scores do differ by level of need reported, if the content of the need domain/item is closely related to the HRQOL domain.

Fig. 1
figure 1

Example boxplots for hypothesized strong, moderate, and weak associations between QLQ-C30 scores and presence/absence of an unmet need

ROC analysis

We then performed the ROC analysis to assess the strength of the association between the HRQOL domains and need items/domains using the area under the ROC curve. Our hypotheses were largely supported, with all of the EORTC domains hypothesized to have a strong association having an AUC ≥ .70 with at least one need item or domain (Table 1). Where we hypothesized moderate relationships, the best matches had areas under the curve of .51 for sleep and .64 for social function. Finally, with one exception, the hypothesized weak associations had AUCs ranging from .34 to .54. However, global health/quality of life actually had a strong association with “feeling unwell a lot of the time,” with an area under the curve of .73. Based on these results, we further examined the five domains with the hypothesized strong relationship plus the global health/quality of life domain to evaluate the test characteristics associated with various cut-off scores.

Calculation of test characteristics

After calculating the AUCs for all pairings of the QLQ-C30 and SCNS items outlined above, we selected a group of pairings for further analysis. For each of the six QLQ-C30 domains with AUCs ≥ .70, we examined the SCNS item or domain that was most accurately discriminated, i.e., had the highest AUC. In all cases, it was an individual SCNS item that had the highest AUC, so all additional analyses used these single items only. The resulting QLQ-C30–SCNS item pairings were as follows: physical function-work around the home (AUC = .81); role function-work around the home (AUC = .73); emotional function-feelings of sadness (AUC = .74); pain-pain (AUC = .78); fatigue-lack of energy/tiredness (AUC = .74); global health/quality of life-feeling unwell a lot of the time (AUC = .73) (Fig. 2). For each pairing, we calculated the sensitivity and specificity of the QLQ-C30 score using cut-offs of 0, 33, 50, 60, 70, 80, 90, and 100 for all QLQ-C30 domains except for pain and fatigue, for which we used cut-offs of 0, 5, 10, 20, 30, 33, 40, and 100. The low and high cut-offs often resulted in specificity and sensitivity values of 0 or 100%. Based on these results, we report two candidate cut-off scores for each domain in Table 2, which provide what we considered to be the most optimal trade-off between sensitivity and specificity. For all six domains, we were able to identify a cut-off with sensitivity ≥ .85 and specificity ≥ .50. For example, for the physical function domain, a cut-off of 90 resulted in a sensitivity of .85 and a specificity of .58, meaning that with this cut-off, 85% of patients who perceive an unmet need with physical and daily living needs are correctly identified as having a problem and 58% of patients who do not perceive an unmet need with physical and daily living needs are correctly identified as not having a problem.

Fig. 2
figure 2

ROC curves for select QLQ-C30 domains that discriminate SCNS classification with an AUC of .70 or higher

Table 2 Sensitivity, specificity, positive predictive value, and negative predictive value with various cut-off scores

We also calculated the positive and negative predictive values of the cut-offs applied to our test population. As a reminder, sensitivity and specificity are characteristics of the test, but the positive and negative predictive values vary depending on the prevalence of the condition in the population being tested. As can be seen in Table 2, these cut-offs when applied to this test population had very high negative predictive values (ranging from .85 to .95), indicating that when the cut-off says the patient does not have a problem, the patient truly does not have a problem. The positive predictive values were somewhat lower (ranging from .39 to .68), indicating that patients identified by the cut-off as having the problem only actually have a problem 39–68% of the time. These positive and negative predictive values can help inform the identification of the appropriate cut-off score to use based on the clinical importance of false positives versus false negatives.

Application of the findings

In applying the results of this analysis, it is important to determine the appropriate cut-off scores to use. The appropriate cut-off score to use depends on the clinical importance of false positives versus false negatives. That is, is it more important that every patient who possibly has a problem in one of these areas be identified—which would argue for favouring sensitivity over specificity. Or, is it more important to be sure that if patients are identified as having a problem, they actually have the problem—which would argue for favouring specificity over sensitivity. It is also helpful to think about how the cut-off scores might be applied in practice. There are two possible approaches for how to apply these cut-offs. The first approach would use the HRQOL questionnaire score alone to determine a treatment decision: the patient scored X on the pain scale; therefore, we will prescribe pain medication. The second approach uses cut-off scores to flag potential problems that clinicians should enquire about further: the patient scored X on the pain scale; therefore, we should talk to the patient about his/her pain and potentially recommend action if our follow-up suggests that action is required. Because it is more likely that HRQOL questionnaires will be used to identify potential problems for further evaluation (the second approach), it is probably appropriate to favour sensitivity over specificity.

The positive and negative predictive values are also informative in determining the appropriate cut-off. In this test population, if the cut-off identified the patient as not having a problem, it was correct 85–95% of the time, reassuring clinicians that they need not spend time querying patients about issues the patients do not perceive as being an unmet need (as measured by the SCNS). On the other hand, if the cut-off identified the patient as having an unmet need, it was only correct 39–68% of the time, so many of the patients identified as potentially having a problem may not require intervention. If the test is only used to identify areas for further enquiry (the second approach), the relatively high rate of false positives is not overly concerning. That being said, too many false positives can lead to “alert fatigue” (i.e., clinicians growing tired of asking patients about issues that are not really a problem) and may result in clinicians ignoring the PRO results.

Discussion

This analysis provides preliminary support for the use of needs assessments to help identify problem scores on HRQOL questionnaires. As expected, the needs assessment worked best for this purpose when there was good overlap between the content of the HRQOL domains and the SCNS items. For QLQ-C30 domains with close SCNS matches, there were cut-off scores with both high sensitivity and specificity, meaning that the test performs well in identifying both patients who do perceive an unmet need and patients who do not perceive an unmet need. Because there are currently few guides available to indicate what score represents a problem, these results suggest that this needs assessment approach may be useful in identifying problem scores on HRQOL questionnaires.

While the results of this analysis are encouraging, they are preliminary. Before this approach can be applied in practice, further research addressing the limitations of this study should be conducted. Because the sample size for the study was relatively small and included predominantly well-functioning patients with breast and prostate cancer, these results need to be confirmed in larger studies with more diverse samples to improve generalizability. Studies in these other samples are particularly important, given the sample-dependent nature of positive and negative predictive values.

It would also be helpful to know whether this approach also works using other HRQOL measures and/or other needs assessments. Patients’ reports of their supportive care needs may differ from patient to patient or change within a given patient over time. A more rigorous evaluation would not rely on a single self-reported measure for the gold standard but use an array of measures, including patient-reported, clinician-reported, and objective disease measures, to triangulate and identify cut-points. Longitudinal studies might help us identify important changes in scores—an area that this analysis does not even begin to address.

One issue that remains is what external criterion to use for the HRQOL domains that have no good needs assessment content to match. There are several options for approaching this issue. In this study, we used a validated needs assessment, which has the benefit of using questions that have been developed and tested using a rigorous process. However, it would be possible to develop needs assessment-type questions to use as anchors specifically for the purpose of identifying problem scores on HRQOL questionnaires. This latter approach would also address another weakness of our study, which is that our two questionnaires had different recall periods. The QLQ-C30 questions generally use a 1-week recall period while the SCNS questions use a 1-month recall period, meaning that the patients’ responses were not technically for the same time period. This discrepancy in recall period could lead to differences in patients’ answers to the QLQ-C30 questions versus the SCNS. Developing specific questions allows for making the recall period consistent. As noted above, it would be preferable to supplement whatever PRO is used as the gold standard with other clinician-reported and objective measures.

While all of these issues will need to be addressed in future studies, the preliminary evidence from this analysis suggests that there are viable approaches for identifying problem scores on HRQOL questionnaires. Being able to identify scores that require a clinician’s attention is an important step in using PROs in clinical practice to assist with individual patient management.