Introduction

The recognition of patient-reported outcomes (PROs) as independent outcomes in cancer represents a major shift in medicine in the last decades [1, 2]. This is consolidated by the CONSORT-PRO Extension Statement developed to improve the reporting of PROs on patients’ evaluation of symptoms, functioning, and quality of life (QoL) [3].

In oncology, traditional and targeted agents represent a variety of biological mechanisms with a suppressive effect on the oral epithelium and the salivary glands [4, 5], in line with other medications for general symptom relief, e.g., pain medication, corticosteroids, antihypertensives (4,6–8). The reported prevalence of the most frequent oral complications varies by cancer diagnosis, stage, previous and ongoing antitumor treatment, comorbidities, and study designs: from 6 to 91 % for xerostomia/salivary gland hypofunction with reduced flow or altered composition [410], 60–86 % for taste changes [11, 12], 30 % for caries [13], up to 43 %–75 % for viral/fungal infections [5, 1416], 23–81 % for pain [5, 16], and up to 80 % for mucositis [17, 18]. Osteonecrosis of the jaw, associated with head-and-neck radiotherapy, varies from 5 to 13 % [16, 19] and is also documented after bisphosphonates and denosumab treatments [19].

Studies show that dental and oral problems are more routinely assessed in patients with head-and-neck cancer [20] than in those with general oncology [21, 22]. This may be because surgery and radiotherapy automatically direct the attention to this location [21, 22], and dental examinations are included in pretreatment procedures. Oral side effects are underreported by patients and healthcare providers, especially beyond the phases of active, curative treatment [22]. Nevertheless, depending on the number, intensity, and duration of oral adverse effects, QoL may be compromised after most cancer regimens, in the acute phase and during recovery and follow-up [11, 2325]. Consequences may be a vicious circle with long-lasting sores, mouth pain, oral infections, and dental problems with caries and loose teeth with a negative impact on QoL dimensions like fatigue, nutritional intake, and social functioning [5].

Optimal PRO evaluation must be based on validated assessment tools that should be brief, patient-centered, and comprehensive. Most oral health assessment tools are either too long, focus on one aspect, e.g., xerostomia, and rarely evaluate impact on QoL [26]. The frequently used European Organisation for Research and Treatment of Cancer (EORTC) core questionnaire (EORTC QLQ-C30) assesses generic QoL aspects [27]; thus, the development of specific site or treatment modules is encouraged for clinical trials [28, 29].

This paper presents phase IV, an international field study of the EORTC oral health QoL module [26], intended for clinical use in cancer patients. Aims were to field-test the module in a large, international group of patients to investigate all aspects of its psychometric properties.

Patients and methods

Study design

Questionnaires

The study followed the EORTC Quality of Life Group procedures for module development [28, 29] with patients completing the phase IV QLQ-OH module [26] and the EORTC QLQ-C30 [27]. The module [26] encompassed 17 questions, with four hypothesized scales: pain and discomfort (6 items), xerostomia (2 items), eating (4 items), and information (2 items) and two single items: future worries and use of and problems with dentures.

The 30-item EORTC QLQ-C30 contains five functional scales: physical, role, cognitive, emotional, and social; three symptom scales: fatigue, nausea/vomiting, and pain; and six single items [27]. All but two of the EORTC QLQ-C30 items, Global health/QoL-scale scored from 1 to 7 and the dichotomous module-item on use of dentures, are scored from 1 “not at all” to 4 “very much”. Higher item scores represent better function regarding functioning and global health and more symptoms/problems on the other items and on the QLQ-OH module. All scores were linearly transformed to a 0–100 scale using EORTC guidelines [29].

Patients

Patients were recruited from 14 institutions in 10 countries between November 2012 and August 2014. Eligibility criteria were age ≥ 18 years, heterogeneous cancer diagnoses, language fluency, consent, in active treatment, or ≤3 years post-treatment. Patients with terminal disease or obvious cognitive impairment according to standard clinical criteria; disturbed consciousness, disorientation to time/place, and attention deficits were ineligible.

A sampling matrix was used to ensure a wide distribution of diagnoses, treatment phases, and socio-demographics and included the following five patient groups: (A) in active curative treatment, (B) 2–6 months after cancer treatment, (C) 6 months–3 years after cancer treatment, (D) receiving palliative treatment, and (E) referred to hospital dentist/oral health team. The sample size chosen aimed to satisfy (1) the ‘rule of thumb’ of 5–10 respondents per item for efficient factor analysis; (2) sufficient to generate ample patient-group sizes to enable item response theory (IRT) methods to analyze differential item functioning (DIF) associated with groups A–D above, plus at least 50 in group E, if applicable; and (3) to ensure adequate patient groups for stability and responsiveness to change analyses (RCA).

Methods

Eligible in- and out-patients were approached and informed by the study personnel. Participants completed the QLQ-OH module [26] and the EORTC QLQ-C30 [27], prior to a set of debriefing questions regarding the module’s clarity of wording, whether questions were perceived as intrusive, difficult or irrelevant, and additional comments. Study personnel completed a form on socio-demographic and medical variables including Karnofsky performance status [30].

A subset of patients (n = 177) completed the forms twice, with a 2-week time span. Test–retest reliability was assessed in 60 patients whose oral health issues were not expected to change, while RCA was evaluated in 117 patients undergoing therapy known to negatively affect oral health. A 2-week interval between the RCA assessments was deemed adequate based on clinical experience.

Ethics approval followed national/local requirements. Informed consent was obtained from all participants. The study was registered in ClinTrials.gov (Protocol 2012/1390REK).

Data analyses and item selection

The validation dataset for the QLQ-OH module was prepared in IBM SPSS Statistics v.21 for Windows (IBM Corporation, Armonk, NY) and variables screened for missing values. Preliminary descriptive analysis of responses to the 17 items was conducted and checked for severe restriction in range; that is, where only two responses accounted for more than 95 % of respondents [31]. Using a combination of techniques from classical test theory and IRT, the structure and psychometric properties of the hypothesized scales were analyzed.

Principal components analysis

The QLQ-OH module comprised 12 items scored “during the last week,” three items scored “during the course of the illness,” and two items related to dentures. Using the 12 items scored in the same time frame, PCA with oblimin rotation was chosen to identify potential items to form scales. Respondents with missing responses for more than 10 % of the items in the module were omitted during this stage of the analysis. This preserved a complete and unbiased dataset during the exploratory factor analysis stage; preferred to the use of imputation methods. Initial eigenvalues (>1) were inspected to assess the optimum number of factors, with a threshold value of 0.4 used for item loading coefficients in the analysis. Scale reliability was then assessed using Cronbach’s alpha coefficients.

RUMM 2030 software (RUMM Laboratory Pty Ltd., Australia) was then used to test the unidimensionality of subscales identified in the factor analysis. The default procedure for RUMM 2030 uses the partial credit model, which allows items to have varying numbers of response categories and does not assume the distance between response thresholds is uniform. The following summary statistics were used to assess model fit, using established guidelines [32]. A well-fitting solution would be indicated by a probability from the item-trait interaction chi-square greater than 0.05, with Bonferroni correction. Due to the sensitivity of the chi-square statistic with large sample sizes, an adjusted chi-square was adopted for a sample size of 300. Fit residual values, for both person (PFR) and item (IFR), were inspected; a mean close to zero and a SD less than 1.5 was desirable. Individual item fit residual values greater than +2.5 were taken to indicate misfit and less than −2.5 to indicate item redundancy. Internal consistency was assessed using the person separation index (PSI) with values above 0.7 considered desirable for group level analysis. Threshold maps were inspected for noteworthy disordering, which would indicate inconsistent use of the response options. Rescoring was considered if a significant improvement in model fit was seen.

Differential item functioning (DIF) was checked for possible item bias, caused by the responses by different groups in the sample: sex, age group, and treatment group. Person item threshold maps were plotted to assess whether the scales appropriately targeted the respondent group. Lastly, dimensionality was assessed using equating t tests to compare person estimates derived from the two most disparate subsets of scale items [33]. A threshold level of less than 5 % was considered acceptable. Results from the PCA and Rasch analyses were then combined to establish a solution for a set of scales which provided the best overall fit and optimal psychometric properties.

Results

Patients

Overall, 585 patients from 14 centers in 10 countries: France, Germany (2), Greece, Israel, Italy, Netherlands (2), Norway (2), Poland (2), Sweden, and UK were included, varying from 35 (Greece) to 102 (Poland). For 13 records, more than 20 % of values were missing for the items of interest for potential scales, e.g., 10 patients had a feeding tube; thus, the eating items were not applicable.

A core dataset of 572 patients (98 %), 54 % females, mean age 60.4 (SD12.9), remained for analyses, with occasional missing values acceptable for demographic and clinical variables, Table 1. The majority were married or living with partner (70 %), 53 % were outpatients. The most frequent diagnosis was head-and-neck cancer (21 %), followed by breast cancer (15 %). Forty-five percent had disseminated or metastatic disease, with metastases to the lymph nodes (10 %) or bones (8 %) being most frequent. Comorbidities were present in 51 %, with two or more in 14 %. Heart disease and/or hypertension were most prevalent (n = 109/20 %). No significant differences were found between those who were included and those who were not. The following groups were analyzed for DIF: sex, age groups (≤50, 51–60, 61–70, 71+), treatment group (in active treatment or not), and treatment intent (curative vs. not curative).

Table 1 Sociodemographic and medical characteristics

Acceptability

Five hundred forty-nine of 572 patients (96 %) were interviewed, varying from 94 to 100 % per country. Completion took <10 min for 58 %. Assistance was provided to 21 % (n = 114), primarily with reading and/or writing (n = 96/84 %). Forty-five patients (8 %) marked one or more items as confusing or difficult to answer; 0–20 per country. The most frequently endorsed items were satisfaction with information, sensitivity to food and drink, and sticky saliva. Also, 75 (14 %) patients had provided free comments on specific items. Dichotomous answer categories were suggested for the information item (n = 12), and five patients suggested dropping these items. General comments were provided by 75 patients, primarily related to satisfaction with the content and that these issues were addressed (40 %).

Scale structure and reliability

Assessment of the item responses

The dataset was screened for missing responses to the scoring items of the QLQ-OH module. Four items: sensitivity to food/drink, taste change, eating solid food, satisfaction with information had a high proportion of missing values. Two items showed significant restriction in range with low endorsement by patients in this dataset; over 70 % of patients did not report either bleeding gums or having lip sores. However, due to their obvious clinical significance in certain groups of patients, these were retained for further examination in the psychometric analyses.

Principal components analysis

In phase III, the three hypothesized scales (12 items) using a four-point scale, “during the last week” exhibited good internal consistency and reliability [26]. The first principal components analysis (PCA) on the phase IV dataset suggested a two-factor structure, accounting for 55.2 % of the variance. The first factor had a comparatively high eigenvalue (5.26) with the second (1.36) and subsequent factors having small eigenvalues. Inspection of the pattern matrix showed a fair degree of cross-loading for three items between the two factors (Table 2), supporting yet another hypothesis, that the QLQ-OH module could be unidimensional. The information scale, using the timeframe “during the course of your illness” and the dichotomous item use of dentures were analyzed separately.

Table 2 Principal components analysis; factor loading coefficientsa

Rasch analysis

First, the two factors identified in Table 2 were tested for goodness of fit (GOF) to the Rasch model. Summary statistics indicated that for factor 1, two items needed to be removed to improve fit whereas for factor 2, one item needed to be removed. Second, all 12 items were tested together to test for unidimensionality. At each step, the item with the greatest misfit or greatest redundancy was removed. Items were removed in the following order, following standard methodology: (1) sensitivity to food and drink, fit residual (FR) = −4.191, (2) problems enjoying meals, FR = −3.510, (3) soreness in mouth, FR = −3.301, and lastly (4) sticky saliva, FR = −2.779.

Eight items formed a clinically useful scale (named OH-QoL) with good fit to the Rasch model (overall chi-square—69.6, df—64, p = 0.295/8 = 0.037). These items, all scored on the conventional EORTC four-point scale were pain in gums, bleeding gums, lip sores, problems teeth, sore in mouth corners, dry mouth, taste change, and problems eating solid food, with an acceptable Cronbach’s alpha coefficient of 0.786. Inspection of the threshold map revealed slight disordering of thresholds for three items; these could be explained by small frequencies in some categories.

Optimum solution

The statistical analyses conducted in this study served to complement one another. Taking into account the cross-loading of items across the two factors in the PCA and the subsequent results of the Rasch analyses (both on the individual factors and the combined items), the optimum solution adopted for the module was the QLQ-OH15 questionnaire, with an OH-QoL score (8 items), information scale (2 items), scale regarding dentures (2 items), and three single items (sticky saliva/mouth soreness/sensitivity to food/drink). In line with other EORTC QOL modules, the overall total score of the 8 items was standardized to a scale from 0-100, 100 meaning highest QOL (lowest symptom burden). Table 3 displays the correlation matrix showing the item-by-item correlation for the overall eight-item OH-QoL score. Fit statistics to the Rasch model were all within accepted limits: person fit residual (PFR) (SD = 0.950, mean = −0.307), item fit residual (IFR) (SD = 1.462, mean = −0.893), and a PSI of 0.600. The percentage of equating t tests was below the 5 % threshold [1.57 %] and no DIF for sex, age, or curative vs. non-curative treatment (p > 0.003, Bonferroni correction). Only one item taste change showed slight uniform DIF for treatment group (active treatment vs. not).

Table 3 Item-by-item correlation for the eight-item OH-QoL score

Known group comparisons

There were no significant differences in the OH-QoL score for sex, age group, treatment group (curative or not), or whether satisfied with information given. However, there were highly significant differences in the overall OH-QoL score as to the extent of the patients’ sore mouth, problems with dentures, and problems with sticky saliva (p < 0.003). The head-and-neck patients scored lower on the OH-QoL score (more problems) than the patient group with other cancers (Fig. 1). The OH-QoL score also varied significantly according to patient performance status (worse Karnofsky score = lower QoL) (Fig. 2).

Fig. 1
figure 1

Median OH-QOL scores between patients with Head-and-Neck cancer (n = 117) vs. other cancers (n = 455). Boxplot with medians (interquartile ranges). Lower OH-QoL scores (0-100) indicate more oral health related problems

Fig. 2
figure 2

Median OH-QOL scores split by Karnofsky Performance Status Score. Boxplot with medians (interquartile ranges). Lower OH-QoL scores (0-100) indicate more oral health related problems

* Boxplot with medians (interquartile ranges). Lower OH-QoL scores indicate more oral health related problems Test–retest and responsiveness

Test–retest validity collected 2 weeks apart (n = 60) revealed no significant differences in responses over time (Wilcoxon matched paired signed ranks test, z = 0.229, p = 0.82). Responsiveness was tested in 117 patients with varying diagnoses undergoing therapy with potential oral adverse effects. Consistently higher levels of oral problems were reported at the second assessment, albeit not statistically significant (z = 1.904, p = 0.056).

The correlation between the overall OH-QoL score and the EORTC QLQ-C30 scores showed mild to moderate correlations; 0.3 to 0.4, p < 0.05, Table 4. Scores on the overall OH-QoL scale showed a lower QoL score (higher symptom burden) for head and neck (mean: 71), compared to other cancers (mean: 78), Table 4, and the QLQ-OH15 single items showed a higher symptom burden in the head-and-neck group. 

Table 4 Correlation between transformed scores on the QLQ-C30 and the QLQ-OH15, all diagnoses versus head-and-neck

Discussion

This study represents the final phase of the EORTC module development process and investigates the reliability, validity, and psychometric properties of an EORTC QLQ-OH module in an international heterogeneous sample of cancer patients. Two items were removed from the phase III module, due to statistical misfit and patient feedback, yielding a 15-item questionnaire: the QLQ-OH15, containing one eight-item OH-QoL scale, three single items (sticky saliva/mouth soreness/sensitivity to food/drink), and two two-item contingency scales regarding use (yes/no) and problems with dentures and reception of (yes/no) and satisfaction with information. Patients’ appreciative comments on the debriefing forms indicate that the instrument was well understood and perceived relevant.

The standardized cross-cultural development of questionnaires under the EORTC umbrella ensures the identification of issues perceived as relevant by patients and the necessary psychometric properties for international use. No apparent cross-cultural differences were observed in this study. One item, sticky saliva, was reported as difficult by patients, particularly among Swedish patients. This item was taken from the EORTC item-bank, used in the head-and-neck module [34] since 1994 with no reported problems, and no mistakes were identified in the Swedish translation. Because of the high clinical importance, this item was retained, as was the item, sensitivity to food and drink, also from the item bank. Patient feedback resulted in a change of answer categories from 1 to 4 to yes/no on the first information item, received information, offering a skip option for the subsequent, satisfaction with information.

Although the hypothesized scale structure of the QLQ-OH module during phase III needed some refinement, the items developed remained robust during phase IV. This study demonstrates the powerful combination of classical test theory and IRT in the development of new scales. The eight-item scale represents an overall OH-QoL scale that is influenced by the oral health status. Thus, the final OH15 module has three multi-item scales: the OH-QoL, the information scale, and one regarding use and problems with dentures, supplemented by three single items on symptoms, all perceived relevant by patients and clinicians during the stepwise development. When used in conjunction with EORTC QLQ-C30 as by convention for EORTC modules, the multidimensional concept of QoL is well addressed, e.g., how oral problems may influence social activities and functioning. Thus, we regard the QLQ-OH15 as an overall screening instrument for QoL related to oral health. It should be noted that the measurement properties of the eight items constituting the OH-QoL score are not maintained if split into subscales. The items may be used to assess the frequency or severity of these issues, if solely based on the 1–4 raw scores, although we do not recommend this. As opposed to some of the other QLQ modules, e.g., the elderly and social well-being modules [35, 36], the QLQ-OH15 may be viewed as having a predominantly physical focus. In our opinion, this is no drawback, as the initial idea originated from clinical practice in an oncology oral health team. A brief, easy-to-use assessment tool may improve the awareness of oral problems in all cancers among healthcare providers and patients. Thus, preventative and supportive care actions can be taken during treatment and follow-up, e.g., alleviation of mucositis and dry mouth; early detection and treatment of oral mucosal infections, periodontal diseases, and caries; adjustment of ill-fitting prostheses; dietary counseling; etc.

However, assessment tools have little value if they are not perceived as relevant by the users, are unable to discriminate between groups that are perceived as different with respect to symptom intensity, or are insensitive to change over time. All these requirements were met in the present study. No significant differences in scores on the oral health issues were found with demographic- or treatment-related variables, supporting the discriminant and criterion validity. Internal consistency was acceptable, test–retest results showed no significant differences whereas responsiveness was shown in patients whose oral health was expected to change over time. As the primary intention of the QLQ-OH15 development was to produce a clinically useful tool, a number of clinical hypotheses were also investigated, showing that patients with head-and-neck cancer, those with lower performance status, sore mouth and sticky saliva, and problems with dentures had significantly lower OH-QoL scores compared to others (p values <0.001). Although most people today are dentate, ill-fitting dentures that may lead to nutritional problems should be acknowledged, especially among the elderly.

One limitation of the present study may be that inspections of the threshold map revealed slight disordering of thresholds for two items in the OH-QoL score. This could be explained, for example, by low incidence of bleeding gums in these patients, despite the large sample size. On the other hand, this may also occur with even larger samples, unless a strict stratification or a very detailed inclusion matrix was applied. In hindsight, responsiveness related to one particular cancer treatment or diagnosis could have been investigated in more detail. However, our experience with international, multicenter studies shows that researchers’ access to patients and diagnostic groups varies, and that may influence patient recruitment.

Statistical study strengths relate to the large sample size and the utilization of the combination of CTT and IRT methods, as best practice in scale development. Overall study strengths are the cross-cultural validation and systematic development according to established EORTC guidelines, the apparent clinical validity and applicability across treatment phases and cancer diagnoses, and the positive patient feedback. The EORTC Quality of Life Group supports the development of symptom-based questionnaires (SBQs) focusing on side effects related to new treatment regimens. The QLQ-OH15 module fits well with this, as a review reported substantial differences in oral mucositis and stomatitis in cancer patients treated with different tyrosine kinase inhibitors [37]. Also, a recent randomized trial demonstrated more QoL improvements in cancer patients undergoing systematic monitoring of PROs compared to those being monitored at the discretion of the clinicians [38], thereby demonstrating a beneficial effect on clinical outcomes.

Conclusion

The results from this large-scale international study support the psychometric properties of the QLQ-OH15 as a clinical instrument for evaluation of oral health issues that may impact on QoL. Its use in conjunction with EORTC QLQ-C30 makes it feasible to assess, treat, or prevent oral problems before, during, and after cancer treatment.