Introduction

Pharmacovigilance experts devote considerable effort to post-marketing surveillance of adverse drug reactions (ADRs). Although the prepared mind of the pharmacovigilance expert remains the cornerstone of this process [1], statistical algorithms, also known as data mining algorithms (DMAs), are being promoted as supplementary tools for safety reviewers. Opinions vary on their utility and optimum deployment mainly because their use has not been completely validated for various reasons, including a lack of consensus on gold standards for causality. True positive associations may be inherently more interesting, but constructing reference sets for validation also require identification of “true negatives” for measuring performance of DMAs.

Occasionally, drug-event associations (DEAs), originally considered credible based on traditional pharmacovigilance monitoring, are discounted with various levels of certitude after further investigation. We refer to these DEAs “phantom ships” [2]. Phantom associations may be discounted through epidemiological evidence, careful clinical analysis of the individual cases, and/or based on fundamental clinical pharmacological principles [39].

Objective

To highlight some previously ignored decision-theoretic aspects of signal detection using common implementation of two DMAs applied to eight potential “phantom” associations.

Methods

Two authors (M.H. and E.v.P.) selected a convenience sample of drug-event combinations (DECs), which could be identified as ‘phantom DECs.’ These are listed in Table 1. Four currently used metrics from two types of disproportionality analysis, a frequentist method (i.e. standard PRRs [10]), and an empirical Bayesian method (i.e. stratified MGPS [11]) were applied to the FDA-AERS database through the 3Q2003Footnote 1. For each metric/threshold, the timing of the first statistical disproportionality—hereafter referred to as a signal of disproportionate reporting (SDR) [12]—was identified. A MEDLINE database search was used to identify the first literature citation for each of the DECs. For PRRs, an SDR was defined as a PRR >2 and Chi sq>4 and case count >2 [10]. For MGPS, we used the commonly-cited threshold of EB05>2, N>0 [13] and an additional threshold EBGM>2, N>0.

Table 1 SDRs of phantom associations based on a frequentist and an empirical Bayesian DMAs/metrics: number of reports to SDR and (year of first signal)

Results

Both frequentist and empirical Bayesian algorithms were associated with SDRs for all of the associations. All generated an SDR for all phantom associations with the exception of the commonly cited MGPS metric EB05>2 for DEC 1 and 2. Literature reports preceded an SDR in five instances with both DMAs (see Table 1).

Discussion and conclusions

Both DMAs generated SDRs for all selected phantom associations for one or more metrics. For DEC 1 and 2, EB05>2 was the only threshold metric that discriminated such phenomena. This is not surprising because it may be the most “severe” of the metrics, in that it incorporates empirical Bayesian shrinkage plus an additional frequentist element of shrinkage due to the use of the lower bound of the 95% posterior interval. While we were unable to review every case of each association to determine the quality of the clinical evidence, our sample included published case reviews that were notable for a lack of evidence to support an association.

Defining a misclassification error when evaluating DMAs has been the subject of vigorous debate. Regarding false positive misclassification, some argue that if the data was misleading, but the DMA accomplished its intended objective of identifying associations not obviously identifiable at the outset as spurious (i.e. warranting further investigation), then it should not be counted as a misclassification by the DMA. Another view based on the interest in the incremental utility of DMAs versus traditional approaches, is that such scenarios represent misclassification by both traditional and computational approaches. Although traditional and computational approaches to signal detection have distinctive and complementary features [22], a corollary lesson is that since they are both related by the same dataset, they share common properties so their misclassifications errors are likely to be correlated.

Although classification errors are to be expected with any screening tool, the results of the present study constitute a further caution against “seduction bias”—the tendency to over-interpret findings generated from algorithms with an extensive mathematical framework, when they are susceptible to many of the same reporting biases and artifacts as traditional approaches [22, 23]. There are a myriad of factors [2430] that influence reporting (e.g., attention of medical and/or lay press) and which therefore result in misclassification by both traditional and computational methods. Literature reports preceded an SDR in five instances with both DMAs (Table 1). Hence, previous publications in the literature may be a predisposing factor for yielding statistical associations when data mining FDA-AERS database.

It is also especially noteworthy that most of the selected phantom associations were highlighted based upon small numbers of reports (Table 1). Often, the ADRs involved in such ‘phantom ships’ associations constitute signs or symptoms, which have low background incidence rates and are rarely reported ADRs for other drugs, and therefore small numbers of the association are sufficient to yield a statistically significant effect when applying DMAs.

What is the significance of the greater discriminatory behavior of the EB05>2 threshold in this exercise? Some investigators assign priority to the “less is more” principle—namely that a metric is superior if it presents the user with fewer potential associations for evaluation. Not withstanding the findings from our small and non-systematic sample, this remains only opinion at this time since there is no clear decision theoretic framework to guide such assessment, [31] and the relative importance of sensitivity versus specificity may be situation dependent.

Previous publications have not fully explored these issues and some answers are accepted before all the questions have even been formulated. For example, what are the relative benefits and opportunity costs of earlier detection of both true and spurious associations? Earlier detection with a smaller number of cases is always assumed to be advantageous. But if the association cannot be clarified until additional cases are submitted, and this coincides with initial detection by a less sensitive method, then earlier detection by the more sensitive method merely imposed an additional burden of monitoring over time without earlier resolution, akin to lead-time bias in medical screening. Conversely, earlier detection may allow more timely implementation of highly focused and intensified follow-up data capture procedures which itself could lead to earlier resolution. Analogous considerations could apply to spurious associations. A more careful and systematic analysis of the utilities and costs associated with the use of DMAs in real-world pharmacovigilance scenarios could yield added benefits and insights over the usual published data mining exercises [31].

We believe that certain phantom ships might be included within a larger reference set for understanding performance of DMAs relative to traditional approaches. Although many questions remain about the optimal approach to such validation exercises [32], human interpretation of the results remains pivotal [23].