Introduction

Accurate capture and monitoring of symptomatic adverse events (AE) is essential in clinical trials and drug labeling to ensure patient safety and inform treatment-related decision-making (Basch 2010, 2014, 2016). In the United States, the standard approach to collecting this information as part of trials in oncology is clinician reporting using the Common Terminology Criteria for Adverse Events (CTCAE) (National Cancer Institute 2010), which allows licensed clinicians [i.e., medical doctors (MDs) and registered nurses (RNs)] to grade AEs based upon descriptive clinical criteria (e.g., Grade 3 nausea = inadequate oral caloric or fluid intake; tube feeding, TPN, or hospitalization indicated). The assignment of a given AE grade has implications for patient treatment and/or participation in clinical trials.

The CTCAE includes multiple categories of AEs, including lab-based, which are generally sourced directly from lab reports (e.g., neutropenia); clinical measurement-based, which are typically evaluated and reported by clinicians (e.g., hypertension); symptom-based such as fatigue or nausea, which despite being amenable to patient reporting, are still primarily rated by clinicians (Basch et al. 2014).

The increased acceptance of the use of patient-reported outcomes (PROs), defined as the unfiltered direct report of a given symptom by a patient (Basch 2012; Trotti et al. 2007), to characterize the patient symptomatic experience has led to the US National Cancer Institute’s initiative to develop a PRO version of the CTCAE (PRO-CTCAE) that will be used in future US-based clinical trials in oncology (Basch et al. 2014; Dueck et al. 2015; Hay et al. 2014). Given that both clinician- and patient-based reporting of symptomatic AEs will be commonplace in US-based oncology clinical trials, it is important to understand how these independent rating sources are associated before this information can be integrated into clinical practice.

As part of our prior work making use of conventional statistical metrics [e.g., intraclass correlation (ICC), Cohen’s weighted κ] to compare clinician and patient reports of AE severity using an ordinal response scale, we have observed that the inter-rater agreement is highly dependent on the prevalence of the AE; a high proportion of “asymptomatic” pairs of ratings (i.e., both ratings are 0, or not present) may lead to an inflated level of agreement, which may not necessarily be an accurate representation of the subset of patients who are experiencing any levels of the symptom (Atkinson et al. 2012, 2016).

An alternative Bayesian approach to the calculation of concordance known as the Graded Item Response Model (GRM) was recently proposed by Baldwin and colleagues (Baldwin et al. 2009). This approach utilized the underlying principles of the original Samejima GRM (Samejima 1997) in a Bayesian framework. In this example, patient hip fracture radiograms were independently judged by 12 orthopedic surgeons using a four-level classification of severity. Surgeons’ hip fracture severity ratings were viewed from an Item Response Theory (IRT) perspective, in which each surgeon’s severity rating was modeled as a scale item while radiographs from patients were considered as a sample from a latent continuum of hip fracture severity. This analytic framework allowed an IRT analysis on the raw rectangular dataset from 15 patients evaluated by 12 surgeons (likened to scale items). The item threshold parameters in the fitted Bayesian GRM represented the surgeons’ decision cutoffs and the item discrimination represented how sensitive the surgeons’ responses were with respect to changes in hip fracture severity. The authors found that the model-predicted decision cutoffs agreed with surgeons’ severity ratings reasonably well. This example showed that the Bayesian GRM framework has a potential application for identifying how raters differ in their independent assessments, which may be subtle in the sense that such differences can be nuanced and highly contextual (e.g., concordance at low latent hip fracture severity, with discordance emerging at high latent hip fracture severity).

The present study applied this Bayesian GRM framework to measuring concordance between doctor (MD)-, registered nurse (RNs)-, and patient-based reporting of symptomatic AEs. We sought to model and further characterize nuanced differences in AE grading thresholds between MDs, RNs, and patients using this advanced statistical technique, thus providing us with information beyond that which can be obtained through the use of traditional statistical methods such as Cohen’s weighted κ or ICC.

Methods

Patients

The data sample for this secondary analysis included 393 English-language speaking cancer patients of mixed disease type (i.e., lung, prostate, and gynecologic) who were undergoing chemotherapy regimens as part of an Institutional Review Board approved protocol at Memorial Sloan Kettering Cancer Center (MSK) between March 2005 and August 2009 where informed consent was obtained from all included patients (Basch et al. 2005, 2007a, b, 2009, 2016). Patient records were eligible for inclusion in this analysis if they contained documented independent MD, RN, and patient symptom ratings for a single clinic visit, without any other restrictions (Atkinson et al. 2012).

Measures

Common Terminology Criteria for Adverse Events version 4 (CTCAE) (National Cancer Institute 2010)—CTCAE consists of a library of over 700 descriptive terms for clinician-based assessment of patient AEs related to cancer treatment. Each CTCAE term is assessed using a 5-point verbal descriptor grading scale, with each grade following a similar grading convention (i.e., 0 = not present, 1 = mild, 2 = moderate, 3 = severe and/or requiring medical intervention but not life-threatening, 4 = life-threatening consequences, and 5 = death).

Symptom Tracking and Reporting (STAR) (Basch et al. 2005, 2007a, b, 2009, 2015, 2016)—STAR is a web-based adaptation of CTCAE that was developed and validated to facilitate clinic waiting area and between-visit home-based patient reporting of treatment-related AEs. STAR items are assessed using a 5-point verbal descriptor rating scale similar to CTCAE (i.e., 0 = none, 1 = mild, 2 = moderate, 3 = severe, 4 = disabling). STAR items assessing constipation, diarrhea, dyspnea, fatigue, nausea, and vomiting were included in the present analysis to correspond with analogous clinician-based CTCAE ratings of patients for these AEs.

Procedure

Routinely documented patient electronic medical records were examined using the Health Information System of MSK. Data were abstracted in cases where ratings of constipation, diarrhea, dyspnea, fatigue, nausea, and vomiting were made by an independent MD (via CTCAE), RN (via CTCAE), and patient (via STAR) triplets during the same clinic visit.

Statistical analysis

A Bayesian GRM was fitted to calculate the latent grading thresholds between clinics and patients (Baldwin et al. 2009). In this analysis we focused on the model-based expected item responses between MDs, RNs, and patients. This model-based approach is advantageous in that it facilitates extraction of core information from data that contain multiple sources of variability. The resulting model-estimated responses then represent the most likely AE ratings from MDs, RNs, and patients with random error variabilities parsed out. The set of six individual AEs were treated as unidimensional, given that each AE was probed using a single item and independently rated by MDs, RNs, and patients.

It was necessary to code the data in a manner that was amenable to the GRM framework. Whereas the Baldwin example contained single ratings for each observation, the present dataset contains as many as three patient ratings and a clinic rating for each symptom. Table 1 represents a single symptom example of the data structure in our analysis, with the columns representing scale items fitted. For each column, GRM item discrimination and thresholds were calculated. The posterior mean values of the model-fitted item responses were calculated to represent model-based AE grades obtained from MDs, RNs, and patients independently. Since all MDs and RNs did not assess AEs in all patients, instances where a given MD did not make a rating were treated as missing (noted by “N/A”). For example, MD 264 may have rated patients 004 and 390–393 but no other patients in the dataset. The Bayesian GRM approach updates the parameter estimates based on available data only; therefore, missing data provides no information with respect to the posterior distributions of the parameters. This permitted the modeling of decision thresholds across the aggregated clinic clusters in an actual clinical encounter, without the need to compel a rectangular data structure.

Table 1 Example of data entry structure

With respect to the prior distributions, αs follow a Gaussian distribution with a mean of 1.0 and a standard deviation of 2.5, truncated at a value >0.0. The threshold parameter κ values follow a Gaussian distribution of a mean of zero and a standard deviation of 2.5. The θ values were constrained to have a standard normal distribution. A total of 86,000 iterative burn-in simulations were completed, with the first 6000 iterations discarded, and thinning every ten simulations for the remaining 80,000 iterations. Other specific details on the Bayesian computation are explained elsewhere (Baldwin et al. 2009). Local independence among MD, RN, and patient ratings was assumed to simplify the illustrative examples. All analyses were completed using R version 3.2.3 (R Development Core Team 2016) and Just Another Gibbs Sampler (JAGS) version 4.1 (Plummer 2016). The RJAGS package was used in R as a conduit to send the data to JAGS for the actual simulations.

Results

Table 2 includes characteristics of the included patients (N = 393). Patients (median age = 63, range = 26–91 years) were diagnosed with lung (34%), prostate (29%), or gynecologic (37%) malignancies; the majority of patients (85%) were high functioning (i.e., score ≥80/100), as captured by the clinician-reported Karnofsky Performance Status (Karnofsky and Burchenal 1949) measure. These patients were independently rated by 1 of 26 attending oncologists and corresponding 26 RNs, without having access to each other’s assessments, as part of their routine clinic visit. The average amount of time between MD and RN ratings was 68.04 min (Atkinson et al. 2012).

Table 2 Patient characteristics

Table 3 displays the means, standard deviations, and traditional concordance metrics for patient AE ratings, separated by comparisons of MDs and RNs, MDs and patients, and RNs and patients. ICCs less than 0.40 indicate poor agreement, values between 0.40 and 0.75 are indicative of moderate agreement, with values of 0.75 or higher indicating excellent agreement (Rosner 2005). Cohen’s κ estimates follow a similar convention, with values from 0.00 to 0.40 representing poor concordance, 0.41–0.75 indicating fair to good agreement, and values over 0.75 indicative of excellent agreement (Shrout and Fleiss 1979). For the current sample, Cohen’s κ and ICC estimates were poor to moderate at best when comparing any of these rating sources for each of the AEs. Additionally, the Cohen’s κ for the comparison between MD and patient ratings of constipation was −0.05, which cannot be meaningfully interpreted.

Table 3 Means, standard deviations, and traditional concordance metrics for adverse event ratings by MDs, RNs, and patients

Figure 1 represents GRM estimates for MDs, RNs, patients, and the resulting difference between MD and patient, RN and patient, and MD and RN ratings for nausea. Each trace line represents the expected a posteriori (EAP) AE ratings made by each individual over a range of latent toxicity values. The upper left subplot of Fig. 1 displays the EAP MD CTCAE ratings for nausea. The upper center subplot of Fig. 1 displays the EAP RN CTCAE ratings for nausea. The upper right subplot displays the EAP STAR ratings for all patients for nausea.

Fig. 1
figure 1

Graded response model estimates and histograms for MDs, RNs, patients and the difference between MD, RN, and clinician thresholds for rating nausea. For the top two rows, each trace line represents the expected a posteriori (EAP) AE ratings made by each individual over a range of latent AE values. For the histograms, the thick Gaussian kernel density trace line estimates represent the smooth version of responses

The second row of subplots of Fig. 1 displays the differences between model-estimated patient and MD, RN and patient, and MD and RN EAP AE ratings for nausea. To interpret these subplots, a difference of zero would represent perfect concordance between raters, with positive and negative values indicative of underestimation or overestimation of relative AE ratings, respectively. Here, MDs and RNs were observed to underestimate patient-reported nausea, with a slight RN overestimation of nausea when compared to MD ratings. Further, the Bayesian GRM shows higher variability in patients’ thresholds in assessing nausea than those obtained from MDs and RN, as indicated by the more extreme trace lines for the patient versus MD and patient versus RN subplots.

The bottom subplots of Fig. 1 represent histograms of the GRM-estimated rating scale thresholds for nausea (i.e., Grade 0–1, 1–2, or 2–3), separated by MDs, RNs, and patients. For example, the subplot labeled “MD: Grade 0–1 Threshold” is the plot of the estimated thresholds for all 26 MDs, with the x-axis representing the latent implicit decision thresholds, in terms of standard deviations above or below the norm, and the y-axis representing the frequency count (i.e., the tallest bar represents four MDs with a latent implicit decision threshold near 0 standard deviations). The thick Gaussian kernel density trace line estimates represent the smooth versions of the histograms (Silverman 1986).

The MD and RN latent implicit decision threshold peak is represented by approximately five standard deviations above the norm, whereas the patient latent implicit decision threshold peak occurs at approximately four standard deviations above the norm. This implies that differences in AE grading between patients and MDs or RNs are more likely to occur at these higher levels of nausea toxicity.

Figure 2 represents constipation and follows the same general format as Fig. 1. Here, concordance between patients and MDs appears to be fairly high, with subtle MD underestimation at lower grading thresholds and overestimation at higher thresholds. RNs overestimate higher grading thresholds of patient- and MD-rated constipation. The frequency distribution subplots of Fig. 2 indicate that MD, RN, and patient latent implicit thresholds are relatively similar for the Grade 2–3 threshold, but that differences as large as 1 or 2 grades occur when RNs rate constipation at the Grade 1–2 threshold, as compared to MDs. Appendix II includes similar figures for the remaining four AEs (i.e., diarrhea, dyspnea, fatigue, vomiting).

Fig. 2
figure 2

Graded response model estimates and histograms for MDs, RNs, patients and the difference between MD, RN, and clinician thresholds for rating constipation. For the top two rows, each trace line represents the expected a posteriori (EAP) AE ratings made by each individual over a range of latent AE values. For the histograms, the thick Gaussian kernel density trace line estimates represent the smooth version of responses

Discussion

Traditional methods of calculating concordance have been well established to characterize the relationship between two independent sources of information. However, when applying these methods to AE reporting, where there is likely to be a significant number of instances where MDs, RNs and patients agree due to a symptom not being present, the resulting coefficients may not be an accurate representation of the actual level of agreement. Additionally, a single coefficient does not provide us with a complete story of the relatedness of clinician- and patient-based AE ratings, particularly with respect to the direction and magnitude of the discrepancies. In the oncology clinical trial setting, where a difference as small as 1 CTCAE grade can determine whether a patient continues their participation in the trial, it is crucial to accurately identify and understand any sources of discrepancy in AE ratings. In this study, we used a Bayesian Graded Item Response Model to model concordance between MD-, RN,- and patient-based AE reporting, as well as characterize potentially nuanced differences in AE grading thresholds between these three rating sources.

We found that on average, the disagreements between MDs, RNs, and patients were generally less than one grade, but in some instances, these discrepancies can vary by up to two grades. Overall, MDs and RNs underestimate patient-reported diarrhea, dyspnea, nausea, vomiting, and fatigue. The Bayesian GRM analysis also demonstrated that RNs overestimate higher levels (i.e., Grade 1–2) of constipation when compared to patient or MD ratings, which is consistent with previous findings from a study of patients undergoing chemotherapy (Cirillo et al. 2009).

Additionally, the Bayesian GRM indicated the presence of higher variability in the latent patient AE rating thresholds versus those obtained from MDs or RNs. This finding is consistent with our previous work that indicates clinician-based toxicity reports underestimate the frequency and severity of AEs when compared to patient reports of these AEs (Basch et al. 2009). Patient variability in their AE-reporting thresholds is likely due to the highly subjective and contextual nature of AE self-reporting, where a given patient’s rating of a severe AE could potentially be analogous to that same AE being rated as mild for another patient. Patients also may not be aware that important decisions related to their treatment and continued participation in a clinical trial may be impacted by their AE levels. As patient reporting of AEs becomes commonplace in oncology clinical trials, it may be important to provide patients with additional context with respect to the treatment-related implications of reporting a higher grade of a given AE.

The Bayesian GRM analysis begins to provide evidence to support the notion that patients report some symptoms that MDs and/or RNs might not consider to be important until the AE has reached a more elevated level of severity. This is important to understand as the inclusion of patient-reported AEs nears standardization in US-based clinical trials in oncology. Clinical trial participation can impact clinician AE grading. In such a case, assigning a higher AE grade for a particular symptom may result in that patient being removed from the trial, despite any other evidence of therapeutic benefit. Utilization of the Bayesian GRM visualization of differences in AE grading thresholds could be a potentially useful tool that would allow MDs and RNs to communicate and acknowledge differences between clinician and patient AE reports while explaining the implications of assigning a higher AE grade.

Given that patient reports of AEs are becoming increasingly accepted for inclusion in clinical trials, an outstanding issue could be related to which source of AE reports should be considered to be the definitive “gold standard” indicator of AE levels. In the present study, MD ratings were compared with RN ratings and patient ratings of AEs. While patients were mentioned as the reference category when compared to MDs or RNs, this was only for the point of illustrating differences between sources of AE ratings. Unfortunately, in the absence of standardized AE grading decision criteria for MDs, RNs, and patients, there may be no definitive “gold standard” source of AE information. Nevertheless, these multiple AE rating sources should be used as complementary pieces of information that can provide clinicians with a more complete picture of the patient symptomatic experience.

This study is not without several limitations. Our sample was collected in a single, tertiary cancer center and was limited in diversity with respect to race, ethnicity, and disease type; only three cancer-type populations were included (i.e., lung, prostate, and gynecologic). Additionally, while the Bayesian GRM model is helpful in depicting underlying patterns of concordance between clinician- and patient-based AE ratings, this statistical method does not explain sources of discordance between raters. The STAR measure has been previously validated as a tool to capture patient-reported AEs (Basch et al. 2005); however, this instrument assesses a limited number of patient AEs. With the recent development of PRO-CTCAE (Basch et al. 2014; Dueck et al. 2015; Hay et al. 2014), it follows that this Bayesian GRM analysis be used in a multicenter prospective study of patients across multiple disease types to assess a wide range of treatment-related AEs, as assessed by CTCAE and PRO-CTCAE. Finally, in this context the GRM operates under the assumption that MD, RN, and patient ratings are locally independent given the model. As such, the results should be interpreted with caution, as non-independence may exist between these ratings. A formal investigation of this potential statistical codependence is beyond the scope of this article. Future applications of this analysis should accommodate the multi-level data structure (i.e., patients nested within RNs, who are nested within MDs) and potentially assess the utility of employing alternative models to accommodate such a structure, such as the Rasch testlet model (Wang and Wilson 2005).

The Bayesian GRM can be a potentially useful descriptive tool for understanding and visualizing the nuanced differences between MD-, RN-, and patient AE-reporting thresholds. For instances where MDs and RNs may rate the same patient or set of patients, the Bayesian GRM can display subtle patterns of discrepancies between such ratings and show where any potential large, 1–2 grade differences may exist for a given AE. This information can help to assist MDs and RNs in the standardization of AE grading. Similarly, as patient reports of treatment-related AEs become commonplace in oncology clinical trials, their ratings can be included in a Bayesian GRM framework to be displayed relative to their respective clinician ratings for a given AE. Such information can serve to enhance communication between patient and provider and potentially help patients understand the importance of accurate AE reporting, toward ultimately improving decisions related to treatment and long-term patient health outcomes.