Introduction

The establishment of causality between drug exposure and adverse drug reactions (ADRs) is challenging. To date, there are few diagnostic tools available to confirm or refute an implicated drug. Thus obtaining a comprehensive history, including timing of exposure, onset of ADR symptoms, previous reactions to similar medication, and other associated risk factors, and having an understanding of the pharmacological profile of the implicated drug as related to an ADR are critical in assessing causality.

A standardized approach for drug causality assessment is recommended [1]. Causality assessment tools (CAT) provide guidance for gathering critical information related to an ADR event. Several CATs exist which categorize the relationship between drug exposure and ADR as unlikely, possible, probable, or definite. Some CAT have been developed for specific types of ADRs such as the Roussel Uclaf Causality Assessment Method for drug-induced liver injury or the algorithm of drug causality for epidermal necrolysis (ALDEN) specific for cases of Stevens-Johnson Syndrome (SJS) and toxic epidermal necrolysis (TEN) [2, 3]. Other CAT such as the Naranjo or Liverpool are not specific to a clinical presentation and can thus be used for a variety of ADRs [4, 5].

Although CAT were developed to assist in determining a link between drug exposure and ADR, agreement between causality tools is poor [6]. To date, no studies exist comparing the ALDEN, Liverpool, and Naranjo in the assessment of SJS/TEN ADR cases. The objective of this study was to compare the reliability of these three CAT in assessing SJS/TEN cases, and quantify the validity by comparing the results to expert judgment.

Methods

Causality assessment tools

Seven reviewers independently completed three CAT (ALDEN, Liverpool, Naranjo) for 11 Stevens-Johnson syndrome/toxic epidermal necrolysis (SJS/TEN) cases. Briefly, the Naranjo consists of 10 questions with yes/no/do not know options for each response [5]. A score is provided for each question based on the response, and the sum of the scores determines the causality classification of doubtful, possible, probable, or definite. Similarly, the ALDEN consists of 6 criteria with an associated question and score based on response, with the total score determining the causality as very unlikely, unlikely, possible, probable, or very probable [3]. The Liverpool Tool is a visual algorithm consisting of yes/no questions that determine the path to the next question and final causality classification of unlikely, possible, probable, or definite [4]. Each CAT was applied to categorize all potential drugs as definite/very probable, probable, possible, or doubtful/unlikely as causing SJS/TEN. For analysis, the ALDEN results of very unlikely and unlikely were grouped together and classified as doubtful/unlikely. An additional reviewer (NS) provided expert opinion by designating the most likely implicated drug(s) for each case using clinical judgment without use of a CAT.

Clinical cases

Eleven real and randomly chosen historical clinical cases were provided by the authors CC and CL, who did not take part in the causality assessments. All cases were diagnosed on the basis of RegiSCAR criteria and had undergone rigorous evaluation by a dermatologist and a SJS/TEN review committee at the time of clinical presentation [7, 8]. For this study, information for each SJS/TEN clinical case was de-identified and then provided to the 7 reviewers. The information included a brief medical history, detailed clinical presentation, and relationship of the cutaneous ADR onset to any recent drug exposure. Laboratory evaluations and data regarding skin biopsy results were provided when available. The timing of the initiation through discontinuation was provided for each drug when available.

Statistical analysis

Reliability was evaluated using Cohen’s Kappa. Agreement was measured (1) by method when comparing all reviewers within each CAT [“inter-rater reliability”], (2) by reviewer when comparing reliability across the 3 CAT for each reviewer [“intra-rater reliability”], and (3) by case when comparing all reviewers within each method. Kappa results were interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement [9]. Somers’ D and a c statistic were calculated to assess validity of the 7 reviewers’ results when compared to expert opinion (NS). A c statistic with a value of 0.5 indicates that the CAT is no better at identifying the implicated drug than random chance when compared to expert opinion, a value over 0.7 indicates a good model, and a value of 1 means that the CAT perfectly predicted agreement with expert opinion [10]. We report confidence intervals for the Kappa scores based on bias-corrected and accelerated (BCa) bootstrapped intervals, using 1000 replications. All analyses were completed using the “irr,” “pROC,” and “Hmisc” packages in R (version 3.3.2).

Results

Eleven SJS/TEN cases were initially examined. We excluded one case from the final analysis as reviewers determined the cutaneous reaction occurred prior to drug exposure. The final analysis included 10 cases involving 30 drugs (Table 1).

Table 1 Cutaneous skin reaction cases

Overall inter-rater reliability by CAT was poor to fair. The Kappa for ALDEN was 0.223 (95% confidence interval (CI) 0.141, 0.355), Naranjo 0.112 (95% CI 0.019, 0.266), and Liverpool 0.124 (95% CI 0.034, 0.273). In general, the Kappa increased with increasing perceived likelihood of the drug causing the reaction (Table 2). Moderate agreement occurred when the ALDEN response classified a drug as definite/very probable and this was the highest level of agreement achieved across all CAT.

Table 2 Inter-rater reliability by Causality Assessment Tool (CAT)

Similarly, intra-rater reliability by reviewer was generally poor when comparing across the 3 CAT (Supplemental Table 1). Only a single reviewer (reviewer #1) achieved overall moderate agreement (Kappa: 0.466) when evaluating the same drugs using the different CAT. Similar to the inter-rater reliability, the Kappa was highest with a definite/very probable response. Examination of all reviewer results stratified by both case and method failed to improve agreement (Fig. 1).

Fig. 1
figure 1

Variability of Causality Assessment Tool (CAT) Results for 10 SJS/TEN Cases. Results of three CAT involving 10 cases of severe cutaneous drug reactions performed by 7 reviewers is displayed. Causality classification of unlikely, possible, probable, or definite is depicted by color/shading. The figure demonstrates the observed inter and intra-rater reviewer variability by case and CAT

When comparing the validity of an individual CAT to the expert reviewer, the area under the curve was highest for the ALDEN (c statistic; 0.65) as compared to the Naranjo (0.52) or Liverpool (0.54). Agreement between CAT and expert review occurred most frequently when the reviewers’ response for a given drug was definite/very probable. A definite result by the Naranjo aligned with the expert reviewer 100% of the time, although only 2 responses fell into this category by Naranjo scoring. Using Liverpool CAT, 8 responses were deemed definite with 88% agreement with expert opinion as compared to the ALDEN with 36 responses deemed definite with 86% agreement. Agreement was lowest (56%) when comparing responses determined by ALDEN as unlikely as compared to expert reviewer.

Discussion

Determining the likelihood of a drug exposure resulting in an ADR is important, yet the effectiveness of available CATs is insufficient [11, 12]. The ability to discern whether or not a drug resulted in an adverse reaction has several implications. First, a medical provider must make future prescribing decisions for a patient following an adverse reaction. Establishment of causality helps guide which drug classes should and should not be used in the future. Second, pharmacovigilance programs used by healthcare systems, the pharmaceutical industry, and regulatory agencies are reliant on determining causality between drug exposure and adverse reactions for the detection of both existing and new ADR signals. Third, research focused on identifying ADR predictors requires detailed phenotyping of ADR patients including drug exposure and causality. The findings from this study demonstrate overall poor performance of the three CAT based on inter-rater reliability, reliability by reviewer when comparing across the 3 CAT, and validity when compared to expert opinion.

Our study demonstrates CAT results have low overall agreement even when used by specialists in the field of drug safety. Even when reviewers were provided the same data, interpretation was highly variable. CAT agreement appeared highest when the drug was deemed the definitive culprit. Inter-rater agreement has been previously shown to be highest when results are more conclusive such as a “definite” classification [13]. Inter-rater reliability was poor when the drug was determined as unlikely to have caused SJS/TEN. Additionally, the CATs used for this study were only slightly better than chance of predicting the implicated drug when compared to an expert reviewer.

This study is unique as the seven reviewers provide geographical representation across the globe making the findings more generalizable. Our ADR cases were limited to severe cutaneous reactions. The selection of CAT included for this study had not been previously compared. These specific CATs were selected due to their unique characteristics: the Naranjo tool which can be applied to all ADRs regardless of phenotype, the ALDEN is specific to SJS/TEN cases, and the Liverpool is non-ADR specific but presented in a flow diagram as compared to Naranjo and ALDEN table scoring systems. Despite these differences, our findings align with previous studies demonstrating the overall poor reliability of causality assessment tools [11,12,13,14].

ADRs are under-recognized, underreported, and CATs are not consistently utilized in the medical setting [15, 16]. To date, no universally accepted CAT has been identified as providing highly reliable and valid results, thus clinicians often rely on clinical judgment alone which can be highly subjective [17]. A CAT that is easy to use while providing useful and reliable results to help guide future prescribing is needed. As the field of drug safety evolves, more information becomes available regarding potential predictors associated with the development of ADRs. Efforts continue in the identification of genetic markers associated with ADRs, including serious skin reactions [18]. New information on drug metabolism and the immune system continues to advance our current understanding of ADR development and risk [19, 20]. An enhanced tool for drug-induced SJS/TEN that can incorporate data from immunological testing (e.g., lymphocyte transformation test), pharmacogenetic results (e.g., human leukocyte antigen, drug metabolizing enzyme genotype), and pharmacokinetic data may strengthen the usefulness and applicability of CAT.

Our study has limitations. This study was retrospective and application of CAT was based on the case documentation provided. We reviewed 11 cases only; however, we simulated kappa calculations by randomly sampling cases with replacement from our existing data. Based on the simulation, 250 cases would be needed to see non-overlapping confidence intervals between ALDEN and the other methods and the kappa scores did not vary, even when including 1000 simulated cases, supporting our findings that reliability is poor for the 3 CATs assessed. However, the 250 simulated cases was only a calculated estimate. Larger studies in this population, such as the one performed by Sassolas et al. [3], should be performed in the future to further validate our findings. No validated testing to serve as the gold standard for SJS/TEN cases is available and thus we relied on expert opinion. A single expert with vast experience in making the clinical diagnosis of SJS/TEN served as the “gold standard” in this study to best represent what occurs in the clinical setting when causality was assessed. We recognize that there is potential variability in expert opinion; however, evaluating variability in expert opinion was beyond the scope of this study. Not every reviewer completed CAT for every drug, resulting in sporadic missing data points. Only 30 drugs were included in this study, which may have impacted our ability to determine the true reliability of these assessment tools. For this study, our working group focused only on drug-associated SJS/TEN. Future studies should include non-drug-induced reactions.

In conclusion, the currently available CAT have poor reliability and validity for drug-induced SJS/TEN. Due to the importance of determining ADR causality for patient care, research, pharmaceutical industry, and regulatory purposes, development of an enhanced tool for drug-induced SJS/TEN that can incorporate data from immunological testing and pharmacogenetic results may strengthen CAT usefulness and applicability.